dc.contributor.author | Bertsekas, Dimitri P. | |
dc.date.accessioned | 2011-06-21T19:35:53Z | |
dc.date.available | 2011-06-21T19:35:53Z | |
dc.date.issued | 2010-12 | |
dc.identifier.issn | 0743-1546 | |
dc.identifier.issn | 0191-2216 | |
dc.identifier.uri | http://hdl.handle.net/1721.1/64641 | |
dc.description.abstract | Approximate policy iteration methods based
on temporal differences are popular in practice, and have
been tested extensively, dating to the early nineties, but
the associated convergence behavior is complex, and not
well understood at present. An important question is
whether the policy iteration process is seriously hampered
by oscillations between poor policies, roughly similar to
the attraction of gradient methods to poor local minima.
There has been little apparent concern in the approximate
DP/reinforcement learning literature about this possibility,
even though it has been documented with several simple
examples. Recent computational experimentation with the
game of tetris, a popular testbed for approximate DP
algorithms over a 15-year period, has brought the issue
to sharp focus. In particular, using a standard set of 22
features and temporal difference methods, an average score
of a few thousands was achieved. Using the same features
and a random search method, an overwhelmingly better
average score was achieved (600,000-900,000). The paper
explains the likely mechanism of this phenomenon, and
derives conditions under which it will not occur. | en_US |
dc.description.sponsorship | National Science Foundation (U.S.) (NSF Grant ECCS-0801549) | en_US |
dc.description.sponsorship | United States. Air Force (Grant FA9550-10-1-0412) | en_US |
dc.description.sponsorship | Los Alamos National Laboratory. Information Science and Technology Institute | en_US |
dc.language.iso | en_US | |
dc.publisher | Institute of Electrical and Electronics Engineers | en_US |
dc.rights | Creative Commons Attribution-Noncommercial-Share Alike 3.0 | en_US |
dc.rights.uri | http://creativecommons.org/licenses/by-nc-sa/3.0/ | en_US |
dc.source | MIT web domain | en_US |
dc.title | Pathologies of Temporal Difference Methods in Approximate Dynamic Programming | en_US |
dc.type | Article | en_US |
dc.identifier.citation | Bertsekas, Dimitri P. "Pathologies of Temporal Difference Methods in Approximate Dynamic Programming." In Proceedings of the 49th IEEE Conference on Decision and Control, Dec.15-17, 2010, Hilton Atlanta Hotel, Atlanta, Georgia USA. | en_US |
dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | en_US |
dc.contributor.approver | Bertsekas, Dimitri P. | |
dc.contributor.mitauthor | Bertsekas, Dimitri P. | |
dc.relation.journal | Proceedings of the 49th IEEE Conference on Decision and Control (CDC), 2010 | en_US |
dc.eprint.version | Author's final manuscript | en_US |
dc.type.uri | http://purl.org/eprint/type/ConferencePaper | en_US |
dspace.orderedauthors | Bertsekas, Dimitri P. | |
dc.identifier.orcid | https://orcid.org/0000-0001-6909-7208 | |
mit.license | OPEN_ACCESS_POLICY | en_US |
mit.metadata.status | Complete | |