Expressiveness is Effectiveness: Self-supervised Fashion-aware CLIP for Video-to-Shop Retrieval

Expressiveness is Effectiveness: Self-supervised Fashion-aware CLIP for Video-to-Shop Retrieval

Likai Tian, Zhengwei Yang, Zechao Hu, Hao Li, Yifang Yin, Zheng Wang

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1335-1343. https://doi.org/10.24963/ijcai.2024/148

The rise of online shopping and social media has spurred the Video-to-Shop Retrieval (VSR) task, which involves identifying fashion items (e.g., clothing) in videos and matching them with identical products provided by stores. In real-world scenarios, human movement in dynamic video scenes can cause substantial morphological alterations of fashion items with aspects of occlusion, shifting viewpoints (parallax), and partial visibility (truncation). This results in those high-quality frames being overwhelmed by a vast of redundant ones, which makes the retrieval less effectiveness. To this end, this paper introduces a framework, named Self-supervised Fashion-aware CLIP (SF-CLIP), for effective VSR. The SF-CLIP enables the discovery of salient frames with high fashion expressiveness via generating pseudo-labels from three key aspects of fashion expressiveness to assess occlusion, parallax, and truncation. With such pseudo-labels, the ability of CLIP is expanded to facilitate the discovery of salient frames. Furthermore, to encompass the comprehensive representations among salient frames, a dual-branch graph-based fusion module is proposed to extract and integrate inter-frame features. Extensive experiments demonstrate the superiority of SF-CLIP over the state-of-the-arts.
Keywords:
Computer Vision: CV: Image and video retrieval 
Computer Vision: CV: Interpretability and transparency