Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles

Golowich, Noah; Moitra, Ankur; Rohatgi, Dhruv

dc.contributor.author	Golowich, Noah
dc.contributor.author	Moitra, Ankur
dc.contributor.author	Rohatgi, Dhruv
dc.date.accessioned	2024-07-11T21:21:25Z
dc.date.available	2024-07-11T21:21:25Z
dc.date.issued	2024-06-10
dc.identifier.isbn	979-8-4007-0383-6
dc.identifier.uri	https://hdl.handle.net/1721.1/155662
dc.description	STOC ’24, June 24–28, 2024, Vancouver, BC, Canada	en_US
dc.description.abstract	The key assumption underlying linear Markov Decision Processes (MDPs) is that the learner has access to a known feature map φ(x, a) that maps state-action pairs to d-dimensional vectors, and that the rewards and transition probabilities are linear functions in this representation. But where do these features come from? In the absence of expert domain knowledge, a tempting strategy is to use the “kitchen sink” approach and hope that the true features are included in a much larger set of potential features. In this paper we revisit linear MDPs from the perspective of feature selection. In a k-sparse linear MDP, there is an unknown subset S ⊂ [d] of size k containing all the relevant features, and the goal is to learn a near-optimal policy in only poly(k,logd) interactions with the environment. Our main result is the first polynomial-time algorithm for this problem. In contrast, earlier works either made prohibitively strong assumptions that obviated the need for exploration, or required solving computationally intractable optimization problems. Along the way we introduce the notion of an emulator: a succinct approximate representation of the transitions, that still suffices for computing certain Bellman backups. Since linear MDPs are a non-parametric model, it is not even obvious whether polynomial-sized emulators exist. We show that they do exist, and moreover can be computed efficiently via convex programming. As a corollary of our main result, we give an algorithm for learning a near-optimal policy in block MDPs whose decoding function is a low-depth decision tree; the algorithm runs in quasi-polynomial time and takes a polynomial number of samples (in the size of the decision tree). This can be seen as a reinforcement learning analogue of classic results in computational learning theory. Furthermore, it gives a	en_US
dc.publisher	ACM	en_US
dc.relation.isversionof	10.1145/3618260.3649710	en_US
dc.rights	Creative Commons Attribution	en_US
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en_US
dc.source	Association for Computing Machinery	en_US
dc.title	Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles	en_US
dc.type	Article	en_US
dc.identifier.citation	Golowich, Noah, Moitra, Ankur and Rohatgi, Dhruv. 2024. "Exploring and Learning in Sparse Linear MDPs without Computationally Intractable Oracles."
dc.contributor.department	Massachusetts Institute of Technology. Department of Mathematics
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
dc.identifier.mitlicense	PUBLISHER_CC
dc.eprint.version	Final published version	en_US
dc.type.uri	http://purl.org/eprint/type/ConferencePaper	en_US
eprint.status	http://purl.org/eprint/status/NonPeerReviewed	en_US
dc.date.updated	2024-07-01T07:50:29Z
dc.language.rfc3066	en
dc.rights.holder	The author(s)
dspace.date.submission	2024-07-01T07:50:29Z
mit.license	PUBLISHER_CC
mit.metadata.status	Authority Work and Publication Information Needed	en_US

Files in this item

Name:: license_rdf
Size:: 40bytes
Format:: application/rdf+xml

View/Open

Name:: 3618260.3649710.pdf
Size:: 268.1Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

MIT Open Access Articles

Show simple item record