CoAtFormer: Vision Transformer with Composite Attention

Zhiyong Chang; Mingjun Yin; Yan Wang

doi:10.24963/ijcai.2024/68

CoAtFormer: Vision Transformer with Composite Attention

Zhiyong Chang, Mingjun Yin, Yan Wang

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 614-622. https://doi.org/10.24963/ijcai.2024/68

PDF BibTeX

Transformer has recently gained significant attention and achieved state-of-the-art performance in various computer vision applications, including image classification, instance segmentation, and object detection. However, the self-attention mechanism underlying the transformer leads to quadratic computational cost with respect to image size,limiting its widespread adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and effective attention module we call Composite Attention. It features parallel branches, enabling the modeling of various global dependencies. In each composite attention module, one branch employs a dynamic channel attention module to capture global channel dependencies, while the other branch utilizes an efficient spatial attention module to extract long-range spatial interactions. In addition, we effectively blending composite attention module with convolutions, and accordingly develop a simple hierarchical vision backbone, dubbed CoAtFormer, by simply repeating the basic building block over multiple stages. Extensive experiments show our CoAtFormer achieves state-of-the-art results on various different tasks. Without any pre-training and extra data, CoAtFormer-Tiny, CoAtFormer-Small, and CoAtFormer-Base achieve 84.4%, 85.3%, and 85.9% top-1 accuracy on ImageNet-1K with 24M, 37M, and 73M parameters, respectively. Furthermore, CoAtFormer also consistently outperform prior work in other vision tasks such as object detection, instance segmentation, and semantic segmentation. When further pretraining on the larger dataset ImageNet-22k, we achieve 88.7% Top-1 accuracy on ImageNet-1K

Keywords:

Computer Vision: CV: Representation learning

Machine Learning: ML: Deep learning architectures