ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Mengqi Xue, Qihan Huang, Haofei Zhang, Jingwen Hu, Jie Song, Mingli Song, Canghong Jin

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1516-1524. https://doi.org/10.24963/ijcai.2024/168

Prototypical part network (ProtoPNet) and its variants have drawn wide attention and been applied to various tasks due to their inherent self-explanatory property. Previous ProtoPNets are primarily built upon convolutional neural networks (CNNs). Therefore, it is natural to investigate whether these explainable methods can be advantageous for the recently emerged Vision Transformers (ViTs). However, directly utilizing ViT-backed models as backbones can lead to prototypes paying excessive attention to background positions rather than foreground objects (i.e., the “distraction” problem). To address the problem, this paper proposes prototypical part Transformer (ProtoPFormer) for interpretable image recognition. Based the architectural characteristics of ViTs, we modify the original ProtoPNet by creating separate global and local branches, each accompanied by corresponding prototypes that can capture and highlight representative holistic and partial features. Specifically, the global prototypes can guide local prototypes to concentrate on the foreground and effectively suppress the background influence. Subsequently, local prototypes are explicitly supervised to concentrate on different discriminative visual parts. Finally, the two branches mutually correct each other and jointly make the final decisions. Moreover, extensive experiments demonstrate that ProtoPFormer can consistently achieve superior performance on accuracy, visualization results, and quantitative interpretability evaluation over the state-of-the-art (SOTA) baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.
Keywords:
Computer Vision: CV: Interpretability and transparency
Computer Vision: CV: Recognition (object detection, categorization)
Computer Vision: CV: Representation learning