Hindsight Trust Region Policy Optimization
Hindsight Trust Region Policy Optimization
Hanbo Zhang, Site Bai, Xuguang Lan, David Hsu, Nanning Zheng
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
Main Track. Pages 3335-3341.
https://doi.org/10.24963/ijcai.2021/459
Reinforcement Learning (RL) with sparse rewards is a major challenge. We pro-
pose Hindsight Trust Region Policy Optimization (HTRPO), a new RL algorithm
that extends the highly successful TRPO algorithm with hindsight to tackle the
challenge of sparse rewards. Hindsight refers to the algorithm’s ability to learn
from information across goals, including past goals not intended for the current
task. We derive the hindsight form of TRPO, together with QKL, a quadratic
approximation to the KL divergence constraint on the trust region. QKL reduces
variance in KL divergence estimation and improves stability in policy updates. We
show that HTRPO has similar convergence property as TRPO. We also present
Hindsight Goal Filtering (HGF), which further improves the learning performance
for suitable tasks. HTRPO has been evaluated on various sparse-reward tasks,
including Atari games and simulated robot control. Experimental results show that
HTRPO consistently outperforms TRPO, as well as HPG, a state-of-the-art policy
14 gradient algorithm for RL with sparse rewards.
Keywords:
Machine Learning: Deep Reinforcement Learning
Machine Learning: Reinforcement Learning