Approximate policy iteration: A survey and some new methods

Bertsekas, Dimitri P.

dc.contributor.author	Bertsekas, Dimitri P.
dc.date.accessioned	2012-09-28T17:46:49Z
dc.date.available	2012-09-28T17:46:49Z
dc.date.issued	2011-08
dc.date.submitted	2011-01
dc.identifier.issn	1672-6340
dc.identifier.issn	1993-0623
dc.identifier.uri	http://hdl.handle.net/1721.1/73485
dc.description.abstract	We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.	en_US
dc.description.sponsorship	National Science Foundation (U.S.) (No.ECCS-0801549)	en_US
dc.description.sponsorship	Los Alamos National Laboratory. Information Science and Technology Institute	en_US
dc.description.sponsorship	United States. Air Force (No.FA9550-10-1-0412)	en_US
dc.language.iso	en_US
dc.publisher	Springer-Verlag	en_US
dc.relation.isversionof	http://dx.doi.org/10.1007/s11768-011-1005-3	en_US
dc.rights	Creative Commons Attribution-Noncommercial-Share Alike 3.0	en_US
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/3.0/	en_US
dc.source	MIT web domain	en_US
dc.title	Approximate policy iteration: A survey and some new methods	en_US
dc.type	Article	en_US
dc.identifier.citation	Bertsekas, Dimitri P. “Approximate Policy Iteration: a Survey and Some New Methods.” Journal of Control Theory and Applications 9.3 (2011): 310–335. Web.	en_US
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science	en_US
dc.contributor.approver	Bertsekas, Dimitri P.
dc.contributor.mitauthor	Bertsekas, Dimitri P.
dc.relation.journal	Journal of Control Theory and Applications	en_US
dc.eprint.version	Author's final manuscript	en_US
dc.type.uri	http://purl.org/eprint/type/JournalArticle	en_US
eprint.status	http://purl.org/eprint/status/PeerReviewed	en_US
dspace.orderedauthors	Bertsekas, Dimitri P.	en
dc.identifier.orcid	https://orcid.org/0000-0001-6909-7208
mit.license	OPEN_ACCESS_POLICY	en_US
mit.metadata.status	Complete

Files in this item

Name:: Bertsekas-Approximate policy.pdf
Size:: 1.630Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record