CLIP-FSAC: Boosting CLIP for Few-Shot Anomaly Classification with Synthetic Anomalies

Zuo Zuo; Yao Wu; Baoqiang Li; Jiahao Dong; You Zhou; Lei Zhou; Yanyun Qu; Zongze Wu

doi:10.24963/ijcai.2024/203

CLIP-FSAC: Boosting CLIP for Few-Shot Anomaly Classification with Synthetic Anomalies

Zuo Zuo, Yao Wu, Baoqiang Li, Jiahao Dong, You Zhou, Lei Zhou, Yanyun Qu, Zongze Wu

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 1834-1842. https://doi.org/10.24963/ijcai.2024/203

PDF BibTeX

Few-shot anomaly classification (FSAC) is a vital task in manufacturing industry. Recent methods focus on utilizing CLIP in zero/few normal shot anomaly detection instead of custom models. However, there is a lack of specific text prompts in anomaly classification and most of them ignore the modality gap between image and text. Meanwhile, there is distribution discrepancy between the pre-trained and the target data. To provide a remedy, in this paper, we propose a method to boost CLIP for few-normal-shot anomaly classification, dubbed CLIP-FSAC, which contains two-stage of training and alternating fine-tuning with two modality-specific adapters. Specifically, in the first stage, we train image adapter with text representation output from text encoder and introduce an image-to-text tuning to enhance multi-modal interaction and facilitate a better language-compatible visual representation. In the second stage, we freeze the image adapter to train the text adapter. Both of them are constrained by fusion-text contrastive loss. Comprehensive experiment results are provided for evaluating our method in few-normal-shot anomaly classification, which outperforms the state-of-the-art method by 12.2%, 10.9%, 10.4% AUROC on VisA for 1, 2, and 4-shot settings.

Keywords:

Computer Vision: CV: Transfer, low-shot, semi- and un- supervised learning

Computer Vision: CV: Multimodal learning

Computer Vision: CV: Recognition (object detection, categorization)