Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval

Xiaobo Shen; Qianxin Huang; Long Lan; Yuhui Zheng

doi:10.24963/ijcai.2024/136

Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval

Xiaobo Shen, Qianxin Huang, Long Lan, Yuhui Zheng

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 1227-1235. https://doi.org/10.24963/ijcai.2024/136

PDF BibTeX

As video-based social networks continue to grow exponentially, there is a rising interest in video retrieval using natural language. Cross-modal hashing, which learns compact hash code for encoding multi-modal data, has proven to be widely effective in large-scale cross-modal retrieval, e.g., image-text retrieval, primarily due to its computation and storage efficiency. However, when applied to video-text retrieval, existing cross-modal hashing methods generally extract features at the frame- or word-level for videos and texts individually, thereby ignoring their long-term dependencies. To address this issue, we propose Contrastive Transformer Cross-Modal Hashing (CTCH), a novel approach designed for video-text retrieval task. CTCH employs bidirectional transformer encoder to encode video and text and leverages their long-term dependencies. CTCH further introduces supervised multi-modality contrastive loss that effectively exploits inter-modality and intra-modality similarities among videos and texts. The experimental results on three video benchmark datasets demonstrate that CTCH outperforms the state-of-the-arts in video-text retrieval tasks.

Keywords:

Computer Vision: CV: Image and video retrieval

Machine Learning: ML: Multi-modal learning

Machine Learning: ML: Multi-view learning