Towards Stable Reinforcement Learning in Non-Episodic Tasks
Author(s)
Karnik, Sathwik
DownloadThesis PDF (3.055Mb)
Advisor
Agrawal, Pulkit
Terms of use
Metadata
Show full item recordAbstract
Despite recent advances in deep reinforcement learning (RL), deploying RL policies in robotics often leads to various challenges. The typical training paradigm in RL involves the rollouts of policies executed in a finite horizon or episodes. However, such policies may struggle to generalize well in various non-episodic tasks, including both object manipulation and locomotion. In this thesis, we study the challenges that arise from non-episodic tasks in two settings: (1) object manipulation in the Habitat Home Assistant Benchmark (HAB) [18] and (2) locomotion in the MuJoCo suite [20].
In the first of these two settings, we study the failure modes of the baseline methods and characterize much of the failures as being due in part to the instabilities in object placement and the lack of error recovery in the setting of open-loop task planning. We consider a possible approach to address this issue by modifying the steady-state termination condition in the RL objective to place the object at the goal position for a longer horizon. We next consider an error-corrective policy using inverse-kinematics (IK) following the execution of the RL policy. The integration of an IK policy leads to a significant improvement in the final task success rate from 41.8% to 65.3% in SetTable, one of the three tasks in the HAB.
In the second setting, we consider extrapolation in the non-episodic task of locomotion in the MuJoCo suite. Typical RL policies are trained for a finite horizon, but may need to be executed for a longer horizon during deployment in locomotion tasks. However, current RL approaches may fail to generalize beyond the training horizon. To address this issue, we consider the use of time-to-go embeddings as part of the observations. Specifically, we introduce the use of a constant time-to-go embedding in the setting where the horizon is much longer during evaluation or deployment. We find limited evidence of improvements in the average episode returns during evaluation in 6 tasks in the MuJoCo suite.
Date issued
2023-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology