Towards Long-delayed Sparsity: Learning a Better Transformer through Reward Redistribution

Towards Long-delayed Sparsity: Learning a Better Transformer through Reward Redistribution

Tianchen Zhu, Yue Qiu, Haoyi Zhou, Jianxin Li

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 4693-4701. https://doi.org/10.24963/ijcai.2023/522

Recently, Decision Transformer (DT) pioneered the offline RL into a contextual conditional sequence modeling paradigm, which leverages self-attended autoregression to learn from global target rewards, states, and actions. However, many applications have a severe delay of the above signals, such as the agent can only obtain a reward signal at the end of each trajectory. This delay causes an unwanted bias cumulating in autoregressive learning global signals. In this paper, we focused its virtual example on episodic reinforcement learning with trajectory feedback. We propose a new reward redistribution algorithm for learning parameterized reward functions, and it decomposes the long-delayed reward onto each timestep. To improve the redistributing's adaptation ability, we formulate the previous decomposition as a bi-level optimization problem for global optimal. We extensively evaluate the proposed method on various benchmarks and demonstrate an overwhelming performance improvement under long-delayed settings.
Keywords:
Machine Learning: ML: Deep reinforcement learning
Planning and Scheduling: PS: POMDPs
Uncertainty in AI: UAI: Sequential decision making