PoRank: A Practical Framework for Learning to Rank Policies

Pengjie Gu; Mengchen Zhao; Xu He; Yi Cai; Bo An

doi:10.24963/ijcai.2024/447

PoRank: A Practical Framework for Learning to Rank Policies

Pengjie Gu, Mengchen Zhao, Xu He, Yi Cai, Bo An

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 4044-4052. https://doi.org/10.24963/ijcai.2024/447

PDF BibTeX

In many real-world scenarios, we need to select from a set of candidate policies before online deployment. Although existing Off-policy evaluation (OPE) methods can be used to estimate the online performance, they suffer from high variance. Fortunately, we care only about the ranking of the candidate policies, rather than their exact online rewards. Based on this, we propose a novel framework PoRank for learning to rank policies. In practice, learning to rank policies faces two main challenges: 1) generalization over the huge policy space and 2) lack of supervision signals. To overcome the first challenge, PoRank uses a Policy Comparison Transformer (PCT) for learning cross-policy representations, which capture the core discrepancies between policies and generalizes well across the whole policy space. The second challenge arises because learning to rank requires online comparisons of policies as ground-truth labels, whereas deploying policies online might be highly expensive. To overcome this, PoRank adopts a crowdsourcing based learning-to-rank (LTR) framework, where a set of OPE algorithms are employed to provide weak comparison labels. Experimental results show that PoRank not only outperforms baselines when the ground-truth labels are provided, but also achieves competitive performance when the ground-truth labels are unavailable.

Keywords:

Machine Learning: ML: Reinforcement learning