CrowdFormer: An Overlap Patching Vision Transformer for Top-Down Crowd Counting

Shaopeng Yang; Weiyu Guo; Yuheng Ren

doi:10.24963/ijcai.2022/215

CrowdFormer: An Overlap Patching Vision Transformer for Top-Down Crowd Counting

Shaopeng Yang, Weiyu Guo, Yuheng Ren

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 1545-1551. https://doi.org/10.24963/ijcai.2022/215

PDF BibTeX

Crowd counting methods typically predict a density map as an intermediate representation of counting, and achieve good performance. However, due to the perspective phenomenon, there is a scale variation in real scenes, which causes the density map-based methods suffer from a severe scene generalization problem because only a limited number of scales are fitted in density map prediction and generation. To address this issue, we propose a novel vision transformer network, i.e., CrowdFormer, and a density kernels fusion framework for more accurate density map estimation and generation, respectively. Thereafter, we incorporate these two innovations into an adaptive learning system, which can take both the annotation dot map and original image as input, and jointly learns the density map estimator and generator within an end-to-end framework. The experimental results demonstrate that the proposed model achieves the state-of-the-art in the terms of MAE and MSE (e.g., it achieved a MAE of 67.1 and MSE of 301.6 on NWPU-Crowd dataset.), and confirm the effectiveness of the proposed two designs. The code is https://github.com/special-yang/Top_Down-CrowdCounting.

Keywords:

Computer Vision: Scene analysis and understanding

Computer Vision: Machine Learning for Vision

Computer Vision: Recognition (object detection, categorization)

Computer Vision: Representation Learning

Computer Vision: Video analysis and understanding