Towards Stable Reinforcement Learning in
Non-Episodic Tasks

Karnik, Sathwik

dc.contributor.advisor	Agrawal, Pulkit
dc.contributor.author	Karnik, Sathwik
dc.date.accessioned	2023-11-02T20:24:52Z
dc.date.available	2023-11-02T20:24:52Z
dc.date.issued	2023-09
dc.date.submitted	2023-10-03T18:21:07.795Z
dc.identifier.uri	https://hdl.handle.net/1721.1/152886
dc.description.abstract	Despite recent advances in deep reinforcement learning (RL), deploying RL policies in robotics often leads to various challenges. The typical training paradigm in RL involves the rollouts of policies executed in a finite horizon or episodes. However, such policies may struggle to generalize well in various non-episodic tasks, including both object manipulation and locomotion. In this thesis, we study the challenges that arise from non-episodic tasks in two settings: (1) object manipulation in the Habitat Home Assistant Benchmark (HAB) [18] and (2) locomotion in the MuJoCo suite [20]. In the first of these two settings, we study the failure modes of the baseline methods and characterize much of the failures as being due in part to the instabilities in object placement and the lack of error recovery in the setting of open-loop task planning. We consider a possible approach to address this issue by modifying the steady-state termination condition in the RL objective to place the object at the goal position for a longer horizon. We next consider an error-corrective policy using inverse-kinematics (IK) following the execution of the RL policy. The integration of an IK policy leads to a significant improvement in the final task success rate from 41.8% to 65.3% in SetTable, one of the three tasks in the HAB. In the second setting, we consider extrapolation in the non-episodic task of locomotion in the MuJoCo suite. Typical RL policies are trained for a finite horizon, but may need to be executed for a longer horizon during deployment in locomotion tasks. However, current RL approaches may fail to generalize beyond the training horizon. To address this issue, we consider the use of time-to-go embeddings as part of the observations. Specifically, we introduce the use of a constant time-to-go embedding in the setting where the horizon is much longer during evaluation or deployment. We find limited evidence of improvements in the average episode returns during evaluation in 6 tasks in the MuJoCo suite.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Towards Stable Reinforcement Learning in Non-Episodic Tasks
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: karnik-skarnik-meng-eecs-2023- ...
Size:: 3.055Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record

Towards Stable Reinforcement Learning in Non-Episodic Tasks

Files in this item

This item appears in the following Collection(s)