Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Shiyin Dong; Mingrui Zhu; Kun Cheng; Nannan Wang; Xinbo Gao

doi:10.24963/ijcai.2024/82

Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Shiyin Dong, Mingrui Zhu, Kun Cheng, Nannan Wang, Xinbo Gao

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 740-748. https://doi.org/10.24963/ijcai.2024/82

PDF BibTeX

The remarkable prowess of diffusion models in image generation has spurred efforts to extend their application beyond generative tasks. However, a persistent challenge exists in lacking a unified approach to apply diffusion models to visual perception tasks with diverse semantic granularity requirements. Our purpose is to establish a unified visual perception framework, capitalizing on the potential synergies between generative and discriminative models. In this paper, we propose Vermouth, a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an Adapted-Expert providing discriminative priors. Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages. We emphasize that there is no necessity for incorporating a heavyweight or intricate decoder to transform diffusion models into potent representation learners. Extensive comparative evaluations against tailored discriminative models showcase the efficacy of our approach on zero-shot sketch-based image retrieval (ZS-SBIR), few-shot classification, and open-vocabulary (OV) semantic segmentation tasks. The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.

Keywords:

Computer Vision: CV: Representation learning

Computer Vision: CV: Recognition (object detection, categorization)

Computer Vision: CV: Segmentation

Computer Vision: CV: Transfer, low-shot, semi- and un- supervised learning