CrowdFormer: An Overlap Patching Vision Transformer for Top-Down Crowd Counting

CrowdFormer: An Overlap Patching Vision Transformer for Top-Down Crowd Counting

Shaopeng Yang, Weiyu Guo, Yuheng Ren

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 1545-1551. https://doi.org/10.24963/ijcai.2022/215

Crowd counting methods typically predict a density map as an intermediate representation of counting, and achieve good performance. However, due to the perspective phenomenon, there is a scale variation in real scenes, which causes the density map-based methods suffer from a severe scene generalization problem because only a limited number of scales are fitted in density map prediction and generation. To address this issue, we propose a novel vision transformer network, i.e., CrowdFormer, and a density kernels fusion framework for more accurate density map estimation and generation, respectively. Thereafter, we incorporate these two innovations into an adaptive learning system, which can take both the annotation dot map and original image as input, and jointly learns the density map estimator and generator within an end-to-end framework. The experimental results demonstrate that the proposed model achieves the state-of-the-art in the terms of MAE and MSE (e.g., it achieved a MAE of 67.1 and MSE of 301.6 on NWPU-Crowd dataset.), and confirm the effectiveness of the proposed two designs. The code is https://github.com/special-yang/Top_Down-CrowdCounting.
Keywords:
Computer Vision: Scene analysis and understanding   
Computer Vision: Machine Learning for Vision
Computer Vision: Recognition (object detection, categorization)
Computer Vision: Representation Learning
Computer Vision: Video analysis and understanding