Deep Learning for Video Captioning: A Review

Deep Learning for Video Captioning: A Review

Shaoxiang Chen, Ting Yao, Yu-Gang Jiang

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Survey track. Pages 6283-6290. https://doi.org/10.24963/ijcai.2019/877

Deep learning has achieved great successes in solving specific artificial intelligence problems recently. Substantial progresses are made on Computer Vision (CV) and Natural Language Processing (NLP). As a connection between the two worlds of vision and language, video captioning is the task of producing a natural-language utterance (usually a sentence) that describes the visual content of a video. The task is naturally decomposed into two sub-tasks. One is to encode a video via a thorough understanding and learn visual representation. The other is caption generation, which decodes the learned representation into a sequential sentence, word by word. In this survey, we first formulate the problem of video captioning, then review state-of-the-art methods categorized by their emphasis on vision or language, and followed by a summary of standard datasets and representative approaches. Finally, we highlight the challenges which are not yet fully understood in this task and present future research directions.
Keywords:
Computer Vision: Language and Vision
Natural Language Processing: Natural Language Generation
Machine Learning: Deep Learning
Computer Vision: Video: Events, Activities and Surveillance