Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval

Contrastive Transformer Cross-Modal Hashing for Video-Text Retrieval

Xiaobo Shen, Qianxin Huang, Long Lan, Yuhui Zheng

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1227-1235. https://doi.org/10.24963/ijcai.2024/136

As video-based social networks continue to grow exponentially, there is a rising interest in video retrieval using natural language. Cross-modal hashing, which learns compact hash code for encoding multi-modal data, has proven to be widely effective in large-scale cross-modal retrieval, e.g., image-text retrieval, primarily due to its computation and storage efficiency. However, when applied to video-text retrieval, existing cross-modal hashing methods generally extract features at the frame- or word-level for videos and texts individually, thereby ignoring their long-term dependencies. To address this issue, we propose Contrastive Transformer Cross-Modal Hashing (CTCH), a novel approach designed for video-text retrieval task. CTCH employs bidirectional transformer encoder to encode video and text and leverages their long-term dependencies. CTCH further introduces supervised multi-modality contrastive loss that effectively exploits inter-modality and intra-modality similarities among videos and texts. The experimental results on three video benchmark datasets demonstrate that CTCH outperforms the state-of-the-arts in video-text retrieval tasks.
Keywords:
Computer Vision: CV: Image and video retrieval 
Machine Learning: ML: Multi-modal learning
Machine Learning: ML: Multi-view learning