Q-learning and policy iteration algorithms for stochastic shortest path problems

Yu, Huizhen; Bertsekas, Dimitri P.

Author(s)

Yu, Huizhen; Bertsekas, Dimitri P.

DownloadYU qlearning_ssp_YB (2).pdf (506.8Kb)

OPEN_ACCESS_POLICY

Terms of use

Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/

Metadata

Show full item record

Abstract

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.

Date issued

2012-04

URI

http://hdl.handle.net/1721.1/93745

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science; Massachusetts Institute of Technology. Laboratory for Information and Decision Systems

Journal

Annals of Operations Research

Publisher

Springer-Verlag

Citation

Yu, Huizhen, and Dimitri P. Bertsekas. “Q-Learning and Policy Iteration Algorithms for Stochastic Shortest Path Problems.” Annals of Operations Research 208, no. 1 (April 18, 2012): 95–132.

Version: Author's final manuscript

ISSN

0254-5330

1572-9338

Collections

MIT Open Access Articles

DSpace@MIT