Enhancing Cross-modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval

Enhancing Cross-modal Completion and Alignment for Unsupervised Incomplete Text-to-Image Person Retrieval

Tiantian Gong, Junsheng Wang, Liyan Zhang

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 794-802. https://doi.org/10.24963/ijcai.2024/88

Traditional text-image person retrieval methods heavily rely on fully matched and identity-annotated multimodal data, representing an ideal yet limited scenario. The issues of handling incomplete multimodal data and the complexities of labeling multimodal data are common challenges encountered in real-world applications. In response to these challenges encountered, we consider a more robust and pragmatic setting termed unsupervised incomplete text-image person retrieval, where person images and text descriptions are not fully matched and lack the supervision of identity labels. To tackle these two problems, we propose the Enhancing Cross-modal Completion and Alignment (ECCA) method. Specifically, we propose a feature-level cross-modal completion strategy for incomplete data. This approach leverages the available cross-modal high semantic similarity features to construct relational graphs for missing modal data, which can generate more reliable completion features. Additionally, to address the cross-modal matching ambiguity, we propose weighted inter-instance granularity alignment as well as enhanced prototype-wise granularity alignment modules that can map semantically similar image-text pairs more compact in the common embedding space. Extensive experiments on public datasets, fully demonstrate the consistent superiority of our method over SOTA text-image person retrieval methods.
Keywords:
Computer Vision: CV: Multimodal learning