Multi-Attention Based Visual-Semantic Interaction for Few-Shot Learning

Peng Zhao; Yin Wang; Wei Wang; Jie Mu; Huiting Liu; Cong Wang; Xiaochun Cao

doi:10.24963/ijcai.2024/194

Multi-Attention Based Visual-Semantic Interaction for Few-Shot Learning

Peng Zhao, Yin Wang, Wei Wang, Jie Mu, Huiting Liu, Cong Wang, Xiaochun Cao

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 1753-1761. https://doi.org/10.24963/ijcai.2024/194

PDF BibTeX

Few-Shot Learning (FSL) aims to train a model that can generalize to recognize new classes, with each new class having only very limited training samples. Since extracting discriminative features for new classes with few samples is challenging, existing FSL methods leverage visual and semantic prior knowledge to guide discriminative feature learning. However, for meta-learning purposes, the semantic knowledge of the query set is unavailable, so their features lack discriminability. To address this problem, we propose a novel Multi-Attention based Visual-Semantic Interaction (MAVSI) approach for FSL. Specifically, we utilize spatial and channel attention mechanisms to effectively select discriminative visual features for the support set based on its ground-truth semantics while using all the support set semantics for each query set sample. Then, a relation module with class prototypes of the support set is employed to supervise and select discriminative visual features for the query set. To further enhance the discriminability of the support set, we introduce a visual-semantic contrastive learning module to promote the similarity between visual features and their corresponding semantic features. Extensive experiments on four benchmark datasets demonstrate that our proposed MAVSI could outperform existing state-of-the-art FSL methods.

Keywords:

Computer Vision: CV: Transfer, low-shot, semi- and un- supervised learning

Machine Learning: ML: Meta-learning