Black-box Prompt Tuning for Vision-Language Model as a Service

Lang Yu; Qin Chen; Jiaju Lin; Liang He

doi:10.24963/ijcai.2023/187

Black-box Prompt Tuning for Vision-Language Model as a Service

Lang Yu, Qin Chen, Jiaju Lin, Liang He

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 1686-1694. https://doi.org/10.24963/ijcai.2023/187

PDF BibTeX

In the scenario of Model-as-a-Service (MaaS), pre-trained models are usually released as inference APIs. Users are allowed to query those models with manually crafted prompts. Without accessing the network structure and gradient information, it's tricky to perform continuous prompt tuning on MaaS, especially for vision-language models (VLMs) considering cross-modal interaction. In this paper, we propose a black-box prompt tuning framework for VLMs to learn task-relevant prompts without back-propagation. In particular, the vision and language prompts are jointly optimized in the intrinsic parameter subspace with various evolution strategies. Different prompt variants are also explored to enhance the cross-model interaction. Experimental results show that our proposed black-box prompt tuning framework outperforms both hand-crafted prompt engineering and gradient-based prompt learning methods, which serves as evidence of its capability to train task-relevant prompts in a derivative-free manner.

Keywords:

Computer Vision: CV: Vision and language

Machine Learning: ML: Evolutionary learning

Machine Learning: ML: Multi-modal learning