Pathologies of Temporal Difference Methods in Approximate Dynamic Programming
Author(s)
Bertsekas, Dimitri P.
DownloadBertsekas_Pathologies of.pdf (1.019Mb)
OPEN_ACCESS_POLICY
Open Access Policy
Creative Commons Attribution-Noncommercial-Share Alike
Terms of use
Metadata
Show full item recordAbstract
Approximate policy iteration methods based
on temporal differences are popular in practice, and have
been tested extensively, dating to the early nineties, but
the associated convergence behavior is complex, and not
well understood at present. An important question is
whether the policy iteration process is seriously hampered
by oscillations between poor policies, roughly similar to
the attraction of gradient methods to poor local minima.
There has been little apparent concern in the approximate
DP/reinforcement learning literature about this possibility,
even though it has been documented with several simple
examples. Recent computational experimentation with the
game of tetris, a popular testbed for approximate DP
algorithms over a 15-year period, has brought the issue
to sharp focus. In particular, using a standard set of 22
features and temporal difference methods, an average score
of a few thousands was achieved. Using the same features
and a random search method, an overwhelmingly better
average score was achieved (600,000-900,000). The paper
explains the likely mechanism of this phenomenon, and
derives conditions under which it will not occur.
Date issued
2010-12Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer ScienceJournal
Proceedings of the 49th IEEE Conference on Decision and Control (CDC), 2010
Publisher
Institute of Electrical and Electronics Engineers
Citation
Bertsekas, Dimitri P. "Pathologies of Temporal Difference Methods in Approximate Dynamic Programming." In Proceedings of the 49th IEEE Conference on Decision and Control, Dec.15-17, 2010, Hilton Atlanta Hotel, Atlanta, Georgia USA.
Version: Author's final manuscript
ISSN
0743-1546
0191-2216