Q-learning and policy iteration algorithms for stochastic shortest path problems

Yu, Huizhen; Bertsekas, Dimitri P.

dc.contributor.author	Yu, Huizhen
dc.contributor.author	Bertsekas, Dimitri P.
dc.date.accessioned	2015-02-03T19:29:00Z
dc.date.available	2015-02-03T19:29:00Z
dc.date.issued	2012-04
dc.identifier.issn	0254-5330
dc.identifier.issn	1572-9338
dc.identifier.uri	http://hdl.handle.net/1721.1/93745
dc.description.abstract	We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1):66-94, 2012). The main difference from the standard policy iteration approach is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm solves an optimal stopping problem inexactly with a finite number of value iterations. The main advantage over the standard Q-learning approach is lower overhead: most iterations do not require a minimization over all controls, in the spirit of modified policy iteration. We prove the convergence of asynchronous deterministic and stochastic lookup table implementations of our method for undiscounted, total cost stochastic shortest path problems. These implementations overcome some of the traditional convergence difficulties of asynchronous modified policy iteration, and provide policy iteration-like alternative Q-learning schemes with as reliable convergence as classical Q-learning. We also discuss methods that use basis function approximations of Q-factors and we give an associated error bound.	en_US
dc.description.sponsorship	United States. Air Force (Grant FA9550-10-1-0412)	en_US
dc.description.sponsorship	National Science Foundation (U.S.) (Grant ECCS-0801549)	en_US
dc.language.iso	en_US
dc.publisher	Springer-Verlag	en_US
dc.relation.isversionof	http://dx.doi.org/10.1007/s10479-012-1128-z	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	Prof. Bertsekas via Chris Sherratt	en_US
dc.title	Q-learning and policy iteration algorithms for stochastic shortest path problems	en_US
dc.type	Article	en_US
dc.identifier.citation	Yu, Huizhen, and Dimitri P. Bertsekas. “Q-Learning and Policy Iteration Algorithms for Stochastic Shortest Path Problems.” Annals of Operations Research 208, no. 1 (April 18, 2012): 95–132.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.department	Massachusetts Institute of Technology. Laboratory for Information and Decision Systems	en_US
dc.contributor.approver	Bertsekas, Dimitri P.	en_US
dc.contributor.mitauthor	Yu, Huizhen	en_US
dc.contributor.mitauthor	Bertsekas, Dimitri P.	en_US
dc.relation.journal	Annals of Operations Research	en_US
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Yu, Huizhen; Bertsekas, Dimitri P.	en_US
dc.identifier.orcid	https://orcid.org/0000-0001-6909-7208
mit.license	OPEN_ACCESS_POLICY	en_US
mit.metadata.status	Complete

Files in this item

Name:: YU qlearning_ssp_YB (2).pdf
Size:: 506.8Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record