Improving Multi-agent Reinforcement Learning with Stable Prefix Policy

Improving Multi-agent Reinforcement Learning with Stable Prefix Policy

Yue Deng, Zirui Wang, Yin Zhang

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

In multi-agent reinforcement learning (MARL), the epsilon-greedy method plays an important role in balancing exploration and exploitation during the decision-making process in value-based algorithms. However, the epsilon-greedy exploration process will introduce conservativeness when calculating the expected state value when the agents are more in need of exploitation during the approximate policy convergence, which may result in a suboptimal policy convergence. Besides, eliminating the epsilon-greedy algorithm leaves no exploration and may lead to unacceptable local optimal policies. To address this dilemma, we use the previously collected trajectories to construct a Monte-Carlo Trajectory Tree, so that an existing optimal template, a sequence of state prototypes, can be planned out. The agents start by following the planned template and act according to the policy without exploration, Stable Prefix Policy. The agents will adaptively dropout and begin to explore by following the epsilon-greedy method when the policy still needs exploration. We scale our approach to various value-based MARL methods and empirically verify our method in a cooperative MARL task, SMAC benchmarks. Experimental results demonstrate that our method achieves not only better performance but also faster convergence speed than baseline algorithms within early time steps.
Keywords:
Agent-based and Multi-agent Systems: MAS: Multi-agent learning
Agent-based and Multi-agent Systems: MAS: Coordination and cooperation