OSIC: A New One-Stage Image Captioner Coined

OSIC: A New One-Stage Image Captioner Coined

Bo Wang, Zhao Zhang, Mingbo Zhao, Xiaojie Jin, Mingliang Xu, Meng Wang

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 1353-1361. https://doi.org/10.24963/ijcai.2024/150

Mainstream image captioning models are usually two-stage captioners, i.e., encoding the region features by a pre-trained detector and then feeding them into a language model to generate the captions. However, such a two-stage procedure will lead to a task-based information gap that decreases the performance, because the region features in the detection task are suboptimal representations and cannot provide all the necessary information for subsequent captions generation. Besides, the region features are usually represented from the last layer of the detectors that lose the local details of images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms the images into descriptive sentences in one stage for eliminating the information gap. Specifically, to obtain rich features, multi-level features are captured by Swin Transformer, and then fed into a novel dynamic multi-sight embedding module to exploit both the global structure and local texture of input images. To enhance the global modeling capacity of the visual encoder, we propose a new dual-dimensional refining to non-locally model the features interaction. As a result, OSIC can directly obtain rich semantic information to improve the captioner. Extensive comparisons on the benchmark MS-COCO, Flickr8K and Flickr30K datasets verified the superior performance of our method.
Keywords:
Computer Vision: CV: Machine learning for vision
Computer Vision: CV: Scene analysis and understanding   
Computer Vision: CV: Vision, language and reasoning
Natural Language Processing: NLP: Language generation