Diffusion Mask-Driven Visual-language Tracking

Diffusion Mask-Driven Visual-language Tracking

Guangtong Zhang, Bineng Zhong, Qihua Liang, Zhiyi Mo, Shuxiang Song

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1652-1660. https://doi.org/10.24963/ijcai.2024/183

Most existing visual-language trackers greatly rely on the initial language descriptions on a target object to extract their multi-modal features. However, the initial language descriptions are often inaccurate in a highly time-varying video sequence and thus greatly deteriorate their tracking performance due to the low quality of extracted multi-modal features. To address this challenge, we propose a Diffusion Mask-Driven Visual-language Tracker (DMTrack) based on a diffusion model. Confronting the issue of low-quality multi-modal features due to inaccurate language descriptions, we leverage the diffusion model to capture high-quality semantic information from multi-modal features and transform it into target mask features. During the training phase, we further enhance the diffusion model's perception of pixel-level features by calculating the loss between the target mask features and the ground truth masks. Additionally, we perform joint localization of the target using both target mask features and visual features, instead of relying solely on multi-modal features for localization. Through extensive experiments on four tracking benchmarks (i.e., LaSOT, TNL2K, LaSOText, and OTB-Lang), we validate that our proposed Diffusion Mask-Driven Visual-language Tracker can improve the robustness and effectiveness of the model.
Keywords:
Computer Vision: CV: Motion and tracking