Prompt Learns Prompt: Exploring Knowledge-Aware Generative Prompt Collaboration For Video Captioning

Prompt Learns Prompt: Exploring Knowledge-Aware Generative Prompt Collaboration For Video Captioning

Liqi Yan, Cheng Han, Zenglin Xu, Dongfang Liu, Qifan Wang

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 1622-1630. https://doi.org/10.24963/ijcai.2023/180

Fine-tuning large vision-language models is a challenging task. Prompt tuning approaches have been introduced to learn fixed textual or visual prompts while freezing the pre-trained model in downstream tasks. Despite the effectiveness of prompt tuning, what do those learnable prompts learn remains unexplained. In this work, we explore whether prompts in the fine-tuning can learn knowledge-aware prompts from the pre-training, by designing two different sets of prompts in pre-training and fine-tuning phases respectively. Specifically, we present a Video-Language Prompt tuning (VL-Prompt) approach for video captioning, which first efficiently pre-train a video-language model to extract key information (e.g., actions and objects) with flexibly generated Knowledge-Aware Prompt (KAP). Then, we design a Video-Language Prompt (VLP) to transfer the knowledge from the knowledge-aware prompts and fine-tune the model to generate full captions. Experimental results show the superior performance of our approach over several state-of-the-art baselines. We further demonstrate that the video-language prompts are well learned from the knowledge-aware prompts.
Keywords:
Computer Vision: CV: Video analysis and understanding   
Computer Vision: CV: Vision and language