Show simple item record

dc.contributor.authorGolowich, Noah
dc.contributor.authorMoitra, Ankur
dc.contributor.authorRohatgi, Dhruv
dc.date.accessioned2024-07-11T21:21:25Z
dc.date.available2024-07-11T21:21:25Z
dc.date.issued2024-06-10
dc.identifier.isbn979-8-4007-0383-6
dc.identifier.urihttps://hdl.handle.net/1721.1/155662
dc.descriptionSTOC ’24, June 24–28, 2024, Vancouver, BC, Canadaen_US
dc.description.abstractThe key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map φ(x, a) that maps state-action pairs to d-dimensional vectors, and that the rewards and transition probabilities are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the “kitchen sink” approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a k-sparse linear MDP, there is an unknown subset S ⊂ [d] of size k containing all the relevant features, and the goal is to learn a near-optimal policy in only poly(k,logd) interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions, that still suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist, and moreover can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples (in the size of the decision tree). This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives aen_US
dc.publisherACMen_US
dc.relation.isversionof10.1145/3618260.3649710en_US
dc.rightsCreative Commons Attributionen_US
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en_US
dc.sourceAssociation for Computing Machineryen_US
dc.titleExploring and Learning in Sparse Linear MDPs without Computationally Intractable Oraclesen_US
dc.typeArticleen_US
dc.identifier.citationGolowich, Noah, Moitra, Ankur and Rohatgi, Dhruv. 2024. "Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles."
dc.contributor.departmentMassachusetts Institute of Technology. Department of Mathematics
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.mitlicensePUBLISHER_CC
dc.eprint.versionFinal published versionen_US
dc.type.urihttp://purl.org/eprint/type/ConferencePaperen_US
eprint.statushttp://purl.org/eprint/status/NonPeerRevieweden_US
dc.date.updated2024-07-01T07:50:29Z
dc.language.rfc3066en
dc.rights.holderThe author(s)
dspace.date.submission2024-07-01T07:50:29Z
mit.licensePUBLISHER_CC
mit.metadata.statusAuthority Work and Publication Information Neededen_US


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record