Zero-Shot Sketch Based Image Retrieval via Modality Capacity Guidance

Zero-Shot Sketch Based Image Retrieval via Modality Capacity Guidance

Yanghong Zhou, Dawei Liu, P. Y. Mok

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1780-1787. https://doi.org/10.24963/ijcai.2024/197

Zero-shot sketch-based image retrieval (ZS-SBIR), aiming to recognize and retrieve relevant photos based on freehand sketch queries that belong to unseen categories in the search set, has sparked considerable interest, benefiting from the rapid advancements in multimodal learning and feature representation research. Despite the recent improvements in performance, there are still rooms for refining feature representation and thus enhancing the generalization capabilities of the models. Most of the existing research efforts have primarily focused on learning the feature distribution of modalities within specific datasets, without considering the broader dataset-agnostic `population distribution' of relevant modalities. In this paper, we investigate the modality population distribution and apply such knowledge to guide feature learning. Specifically, we propose a modality capacity constraint loss to control the learning of population distribution for sketches and photos. This loss can be effectively combined with retrieval loss (e.g., triplet loss) or classification loss (e.g., InfoNCE loss) to enhance the performance of ZS-SBIR, through the fine-tuning process of pre-trained models like CLIP and DINO. Extensive experiment results have demonstrated our significant performance improvements, achieving an increase of 7.3%/3.2% and 19.9%/10.3% in terms of mAP@200/P@200 compared to the state-of-the-art models on CLIP and DINO, respectively, on the Sketchy-ext dataset (split 2). Data, code, and supplementary information are available at https://github.com/YHdian0716/ZS-SBIR-MCC.git.
Keywords:
Computer Vision: CV: Image and video retrieval 
Computer Vision: CV: Multimodal learning
Computer Vision: CV: Recognition (object detection, categorization)
Computer Vision: CV: Representation learning