Show simple item record

dc.contributor.advisorAgrawal, Pulkit
dc.contributor.authorKarnik, Sathwik
dc.date.accessioned2023-11-02T20:24:52Z
dc.date.available2023-11-02T20:24:52Z
dc.date.issued2023-09
dc.date.submitted2023-10-03T18:21:07.795Z
dc.identifier.urihttps://hdl.handle.net/1721.1/152886
dc.description.abstractDespite recent advances in deep reinforcement learning (RL), deploying RL policies in robotics often leads to various challenges. The typical training paradigm in RL involves the rollouts of policies executed in a finite horizon or episodes. However, such policies may struggle to generalize well in various non-episodic tasks, including both object manipulation and locomotion. In this thesis, we study the challenges that arise from non-episodic tasks in two settings: (1) object manipulation in the Habitat Home Assistant Benchmark (HAB) [18] and (2) locomotion in the MuJoCo suite [20]. In the first of these two settings, we study the failure modes of the baseline methods and characterize much of the failures as being due in part to the instabilities in object placement and the lack of error recovery in the setting of open-loop task planning. We consider a possible approach to address this issue by modifying the steady-state termination condition in the RL objective to place the object at the goal position for a longer horizon. We next consider an error-corrective policy using inverse-kinematics (IK) following the execution of the RL policy. The integration of an IK policy leads to a significant improvement in the final task success rate from 41.8% to 65.3% in SetTable, one of the three tasks in the HAB. In the second setting, we consider extrapolation in the non-episodic task of locomotion in the MuJoCo suite. Typical RL policies are trained for a finite horizon, but may need to be executed for a longer horizon during deployment in locomotion tasks. However, current RL approaches may fail to generalize beyond the training horizon. To address this issue, we consider the use of time-to-go embeddings as part of the observations. Specifically, we introduce the use of a constant time-to-go embedding in the setting where the horizon is much longer during evaluation or deployment. We find limited evidence of improvements in the average episode returns during evaluation in 6 tasks in the MuJoCo suite.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleTowards Stable Reinforcement Learning in Non-Episodic Tasks
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record