Show simple item record

dc.contributor.authorBertsekas, Dimitri P.
dc.date.accessioned2012-09-28T17:46:49Z
dc.date.available2012-09-28T17:46:49Z
dc.date.issued2011-08
dc.date.submitted2011-01
dc.identifier.issn1672-6340
dc.identifier.issn1993-0623
dc.identifier.urihttp://hdl.handle.net/1721.1/73485
dc.description.abstractWe consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.en_US
dc.description.sponsorshipNational Science Foundation (U.S.) (No.ECCS-0801549)en_US
dc.description.sponsorshipLos Alamos National Laboratory. Information Science and Technology Instituteen_US
dc.description.sponsorshipUnited States. Air Force (No.FA9550-10-1-0412)en_US
dc.language.isoen_US
dc.publisherSpringer-Verlagen_US
dc.relation.isversionofhttp://dx.doi.org/10.1007/s11768-011-1005-3en_US
dc.rightsCreative Commons Attribution-Noncommercial-Share Alike 3.0en_US
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/3.0/en_US
dc.sourceMIT web domainen_US
dc.titleApproximate policy iteration: A survey and some new methodsen_US
dc.typeArticleen_US
dc.identifier.citationBertsekas, Dimitri P. “Approximate Policy Iteration: a Survey and Some New Methods.” Journal of Control Theory and Applications 9.3 (2011): 310–335. Web.en_US
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Scienceen_US
dc.contributor.approverBertsekas, Dimitri P.
dc.contributor.mitauthorBertsekas, Dimitri P.
dc.relation.journalJournal of Control Theory and Applicationsen_US
dc.eprint.versionAuthor's final manuscripten_US
dc.type.urihttp://purl.org/eprint/type/JournalArticleen_US
eprint.statushttp://purl.org/eprint/status/PeerRevieweden_US
dspace.orderedauthorsBertsekas, Dimitri P.en
dc.identifier.orcidhttps://orcid.org/0000-0001-6909-7208
mit.licenseOPEN_ACCESS_POLICYen_US
mit.metadata.statusComplete


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record