Retrieval Guided Music Captioning via Multimodal Prefixes

Retrieval Guided Music Captioning via Multimodal Prefixes

Nikita Srivatsan, Ke Chen, Shlomo Dubnov, Taylor Berg-Kirkpatrick

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
AI, Arts & Creativity. Pages 7762-7770. https://doi.org/10.24963/ijcai.2024/859

In this paper we put forward a new approach to music captioning, the task of automatically generating natural language descriptions for songs. These descriptions are useful both for categorization and analysis, and also from an accessibility standpoint as they form an important component of closed captions for video content. Our method supplements an audio encoding with a retriever, allowing the decoder to condition on multimodal signal both from the audio of the song itself as well as a candidate caption identified by a nearest neighbor system. This lets us retain the advantages of a retrieval based approach while also allowing for the flexibility of a generative one. We evaluate this system on a dataset of 200k music-caption pairs scraped from Audiostock, a royalty-free music platform, and on MusicCaps, a dataset of 5.5k pairs. We demonstrate significant improvements over prior systems across both automatic metrics and human evaluation.
Keywords:
Application domains: Music and sound
Methods and resources: Machine learning, deep learning, neural models, reinforcement learning