MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Computationally Efficient Reinforcement Learning under Partial Observability

Author(s)
Rohatgi, Dhruv
Thumbnail
DownloadThesis PDF (1.126Mb)
Advisor
Moitra, Ankur
Terms of use
In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
A key challenge in reinforcement learning is the inability of the agent to fully observe the latent state of the system. Partially observable Markov decision processes (POMDPs) are a generalization of Markov decision processes (MDPs) that model this challenge. Unfortunately, planning and learning near-optimal policies in POMDPs is computationally intractable. Most existing algorithms either lack provable guarantees, require exponential time, or only apply under stringent assumptions about either the dynamics of the system or the observation model. This thesis shows that the computational intractability of planning and learning in worst-case POMDPs is fundamentally due to degeneracy in the observation model, in that making an appropriate assumption about the informativeness of the partial observations (of the latent state) mitigates intractability. Specifically, we show that planning and learning are both possible in quasi-polynomial time for gamma-observable POMDPs, where gamma-observability is the assumption that c-well-separated distributions over the latent states induce (gamma*c)-well-separated distributions over observations. These are the first sub-exponential time planning and learning algorithms for POMDPs under reasonable assumptions. While falling just short of polynomial time, it turns out that quasi-polynomial time is optimal for gamma-observable POMDPs under standard complexity assumptions. The main technical innovation driving our algorithmic results is a new quantitative connection between gamma-observability and the stability of posterior distributions for the latent state in Hidden Markov Models and (more generally) POMDPs. Essentially, stability implies that old observations have limited relevance to the current state, and hence "short-memory'' policies that only depend on a short window of recent observations are nearly optimal. This connection has several applications beyond planning and learning POMDPs. Leveraging gamma-observability, we give a quasi-polynomial time algorithm for (improperly) learning overcomplete HMMs that does not require a full-rankness assumption on the transition matrices. We also give a quasi-polynomial time algorithm for planning coarse correlated equilibria in partially observable Markov games.
Date issued
2023-02
URI
https://hdl.handle.net/1721.1/150191
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.