Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Unified Single-Stage Transformer Network for Efficient RGB-T Tracking

Jianqiang Xia, Dianxi Shi, Ke Song, Linna Song, Xiaolei Wang, Songchang Jin, Chenran Zhao, Yu Cheng, Lei Jin, Zheng Zhu, Jianan Li, Gang Wang, Junliang Xing, Jian Zhao

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1471-1479. https://doi.org/10.24963/ijcai.2024/163

Most existing RGB-T tracking networks extract modality features in a separate manner, which lacks interaction and mutual guidance between modalities. This limits the network's ability to adapt to the diverse dual-modality appearances of targets and the dynamic relationships between the modalities. Additionally, the three-stage fusion tracking paradigm followed by these networks significantly restricts the tracking speed. To overcome these problems, we propose a unified single-stage Transformer RGB-T tracking network, namely USTrack, which unifies the above three stages into a single ViT (Vision Transformer) backbone through joint feature extraction, fusion and relation modeling. With this structure, the network can not only extract the fusion features of templates and search regions under the interaction of modalities, but also significantly improve tracking speed through the single-stage fusion tracking paradigm. Furthermore, we introduce a novel feature selection mechanism based on modality reliability to mitigate the influence of invalid modalities for final prediction. Extensive experiments on three mainstream RGB-T tracking benchmarks show that our method achieves the new state-of-the-art while achieving the fastest tracking speed of 84.2FPS. Code is available at https://github.com/xiajianqiang/USTrack.
Keywords:
Computer Vision: CV: Motion and tracking
Computer Vision: CV: Multimodal learning
Computer Vision: CV: Video analysis and understanding