Show simple item record

dc.contributor.authorBertsekas, Dimitri P
dc.contributor.authorYu, Huizhen
dc.date.accessioned2019-06-11T20:09:45Z
dc.date.available2019-06-11T20:09:45Z
dc.date.issued2012-02
dc.date.submitted2011-05
dc.identifier.issn0364-765X
dc.identifier.issn1526-5471
dc.identifier.urihttps://hdl.handle.net/1721.1/121248
dc.description.abstractWe consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal state costs or Q-factors. The main difference is in the policy evaluation phase: instead of solving a linear system of equations, our algorithm requires solving an optimal stopping problem. The solution of this problem may be inexact, with a finite number of value iterations, in the spirit of modified policy iteration. The stopping problem structure is incorporated into the standard Q-learning algorithm to obtain a new method that is intermediate between policy iteration and Q-learning/value iteration. Thanks to its special contraction properties, our method overcomes some of the traditional convergence difficulties of modified policy iteration and admits asynchronous deterministic and stochastic iterative implementations, with lower overhead and/or more reliable convergence over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm addresses effectively the inherent difficulties of approximate policy iteration due to inadequate exploration of the state and control spaces.en_US
dc.description.sponsorshipNational Science Foundation (U.S.) (Grant ECCS-0801549)en_US
dc.description.sponsorshipUnited States. Air Force (Grant FA9550-10-1-0412)en_US
dc.description.sponsorshipUnited States. Air Force (Grant FA9550-10-1-0412)en_US
dc.description.sponsorshipAcademy of Finland (Grant 118653 )en_US
dc.language.isoen_US
dc.publisherInstitute for Operations Research and the Management Sciences (INFORMS)en_US
dc.relation.isversionofhttps://doi.org/10.1287/moor.1110.0532en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alikeen_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0/en_US
dc.sourceMIT web domainen_US
dc.titleQ-Learning and Enhanced Policy Iteration in Discounted Dynamic Programmingen_US
dc.typeArticleen_US
dc.identifier.citationBertsekas, Dimitri P., and Huizhen Yu. “Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming.” Mathematics of Operations Research, vol. 37, no. 1, Feb. 2012, pp. 66–94.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.contributor.departmentMassachusetts Institute of Technology. Laboratory for Information and Decision Systemsen_US
dc.relation.journalMathematics of Operations Researchen_US
dc.eprint.versionAuthor's final manuscripten_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dspace.date.submission2019-04-04T10:05:09Z


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record