Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning

Show and Tell More: Topic-Oriented Multi-Sentence Image Captioning

Yuzhao Mao, Chang Zhou, Xiaojie Wang, Ruifan Li

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 4258-4264. https://doi.org/10.24963/ijcai.2018/592

Image captioning aims to generate textual descriptions for images. Most previous work generates a single-sentence description for each image. However, a picture is worth a thousand words. Single-sentence can hardly give a complete view of an image even by humans. In this paper, we propose a novel Topic-Oriented Multi-Sentence (\emph{TOMS}) captioning model, which can generate multiple topic-oriented sentences to describe an image. Different from object instances or attributes, topics mined by the latent Dirichlet allocation reflect hidden thematic structures in reference sentences of an image. In our model, each topic is integrated to a caption generator with a Fusion Gate Unit (FGU) to guide the generation of a sentence towards a certain topic perspective. With multiple sentences from different topics, our \emph{TOMS} provides a complete description of an image. Experimental results on both sentence and paragraph datasets demonstrate the effectiveness of our \emph{TOMS} in terms of topical consistency and descriptive completeness.
Keywords:
Natural Language Processing: Natural Language Generation
Computer Vision: Recognition: Detection, Categorization, Indexing, Matching, Retrieval, Semantic Interpretation
Computer Vision: Language and Vision