Learning to Recognize Transient Sound Events using Attentional Supervision

Learning to Recognize Transient Sound Events using Attentional Supervision

Szu-Yu Chou, Jyh-Shing Jang, Yi-Hsuan Yang

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 3336-3342. https://doi.org/10.24963/ijcai.2018/463

Making sense of the surrounding context and ongoing events through not only the visual inputs but also acoustic cues is critical for various AI applications. This paper presents an attempt to learn a neural network model that recognizes more than 500 different sound events from the audio part of user generated videos (UGV). Aside from the large number of categories and the diverse recording conditions found in UGV, the task is challenging because a sound event may occur only for a short period of time in a video clip. Our model specifically tackles this issue by combining a main subnet that aggregates information from the entire clip to make clip-level predictions, and a supplementary subnet that examines each short segment of the clip for segment-level predictions. As the labeled data available for model training are typically on the clip level, the latter subnet learns to pay attention to segments selectively to facilitate attentional segment-level supervision. We call our model the M&mnet, for it leverages both “M”acro (clip-level) supervision and “m”icro (segment-level) supervision derived from the macro one. Our experiments show that M&mnet works remarkably well for recognizing sound events, establishing a new state-of-theart for DCASE17 and AudioSet data sets. Qualitative analysis suggests that our model exhibits strong gains for short events. In addition, we show that the micro subnet is computationally light and we can use multiple micro subnets to better exploit information in different temporal scales.
Machine Learning: Deep Learning
Machine Learning Applications: Other Applications
Machine Learning Applications: Applications of Supervised Learning