Cross-modal Generation and Alignment via Attribute-guided Prompt for Unsupervised Text-based Person Retrieval

Cross-modal Generation and Alignment via Attribute-guided Prompt for Unsupervised Text-based Person Retrieval

Zongyi Li, Jianbo Li, Yuxuan Shi, Hefei Ling, Jiazhong Chen, Runsheng Wang, Shijuan Huang

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1047-1055. https://doi.org/10.24963/ijcai.2024/116

Text-based Person Search aims to retrieve a specified person using a given text query. Current methods predominantly rely on paired labeled image-text data to train the cross-modality retrieval model, necessitating laborious and time-consuming labeling. In response to this challenge, we present the Cross-modal Generation and Alignment via Attribute-guided Prompt framework (GAAP) for fully unsupervised text-based person search, utilizing only unlabeled images. Our proposed GAAP framework consists of two key parts: Attribute-guided Prompt Caption Generation and Attribute-guided Cross-modal Alignment module. The Attribute-guided Prompt Caption Generation module generates pseudo text labels by feeding the attribute prompts into a large-scale pre-trained vision-language model. These synthetic texts are then meticulously selected through a sample selection, ensuring the reliability for subsequent fine-tuning. The Attribute-guided Cross-modal Alignment module encompasses three sub-modules for feature alignment across modalities. Firstly, Cross-Modal Center Alignment (CMCA) aligns the samples with different modality centroids. Subsequently, to address ambiguity arising from local attribute similarities, an Attribute-guided Image-Text Contrastive Learning module (AITC) is proposed to facilitate the alignment of relationships among different pairs by considering local attribute similarities. Lastly, the Attribute-guided Image-Text Matching (AITM) module is introduced to mitigate noise in pseudo captions by using the image-attribute matching score to soften the hard matching labels. Empirical results showcase the effectiveness of our method across various text-based person search datasets under the fully unsupervised setting.
Keywords:
Computer Vision: CV: Multimodal learning
Computer Vision: CV: Vision, language and reasoning
Machine Learning: ML: Multi-modal learning
Machine Learning: ML: Unsupervised learning