Efficient and Stable Offline-to-online Reinforcement Learning via Continual Policy Revitalization

Rui Kong; Chenyang Wu; Chen-Xiao Gao; Zongzhang Zhang; Ming Li

doi:10.24963/ijcai.2024/477

Efficient and Stable Offline-to-online Reinforcement Learning via Continual Policy Revitalization

Rui Kong, Chenyang Wu, Chen-Xiao Gao, Zongzhang Zhang, Ming Li

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 4317-4325. https://doi.org/10.24963/ijcai.2024/477

PDF BibTeX

In offline Reinforcement Learning (RL), the pre-trained policies are utilized for initialization and subsequent online fine-tuning. However, existing methods suffer from instability and low sample efficiency compared to pure online learning. This paper identifies these limitations stemming from direct policy initialization using offline-trained policy models. We propose Continual Policy Revitalization (CPR) as a novel efficient, stable fine-tuning method. CPR incorporates a periodic policy revitalization technique, restoring the overtrained policy network to full learning capacity while ensuring stable initial performance. This approach enables fine-tuning without being adversely affected by low-quality pre-trained policies. In contrast to previous research, CPR initializes the new policy with an adaptive policy constraint in policy optimization. Such optimization keeps the new policy close to behavior policy constructed from historical policies. This contributes to stable policy improvement and optimal converged performance. Practically, CPR can seamlessly integrate into existing offline RL algorithms with minimal modification. We empirically validate the effectiveness of our method through extensive experiments, demonstrating substantial improvements in learning stability and efficiency compared to previous approaches. Our code is available at https://github.com/LAMDA-RL/CPR.

Keywords:

Machine Learning: ML: Reinforcement learning

Machine Learning: ML: Offline reinforcement learning