Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Bertsekas, Dimitri P; Yu, Huizhen

dc.contributor.author	Bertsekas, Dimitri P
dc.contributor.author	Yu, Huizhen
dc.date.accessioned	2019-06-11T20:09:45Z
dc.date.available	2019-06-11T20:09:45Z
dc.date.issued	2012-02
dc.date.submitted	2011-05
dc.identifier.issn	0364-765X
dc.identifier.issn	1526-5471
dc.identifier.uri	https://hdl.handle.net/1721.1/121248
dc.description.abstract	We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.	en_US
dc.description.sponsorship	National Science Foundation (U.S.) (Grant ECCS-0801549)	en_US
dc.description.sponsorship	United States. Air Force (Grant FA9550-10-1-0412)	en_US
dc.description.sponsorship	United States. Air Force (Grant FA9550-10-1-0412)	en_US
dc.description.sponsorship	Academy of Finland (Grant 118653 )	en_US
dc.language.iso	en_US
dc.publisher	Institute for Operations Research and the Management Sciences (INFORMS)	en_US
dc.relation.isversionof	https://doi.org/10.1287/moor.1110.0532	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	en_US
dc.source	MIT web domain	en_US
dc.title	Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming	en_US
dc.type	Article	en_US
dc.identifier.citation	Bertsekas, Dimitri P., and Huizhen Yu. “Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming.” Mathematics of Operations Research, vol. 37, no. 1, Feb. 2012, pp. 66–94.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.department	Massachusetts Institute of Technology. Laboratory for Information and Decision Systems	en_US
dc.relation.journal	Mathematics of Operations Research	en_US
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.date.submission	2019-04-04T10:05:09Z

Files in this item

Name:: Bertsekas-Q-learning.pdf
Size:: 2.435Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record