Learning to Recognize Transient Sound Events using Attentional Supervision
Learning to Recognize Transient Sound Events using Attentional Supervision
Szu-Yu Chou, Jyh-Shing Jang, Yi-Hsuan Yang
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 3336-3342.
https://doi.org/10.24963/ijcai.2018/463
Making sense of the surrounding context and ongoing
events through not only the visual inputs but
also acoustic cues is critical for various AI applications.
This paper presents an attempt to learn
a neural network model that recognizes more than
500 different sound events from the audio part of
user generated videos (UGV). Aside from the large
number of categories and the diverse recording conditions
found in UGV, the task is challenging because
a sound event may occur only for a short
period of time in a video clip. Our model specifically
tackles this issue by combining a main subnet
that aggregates information from the entire clip
to make clip-level predictions, and a supplementary
subnet that examines each short segment of the clip
for segment-level predictions. As the labeled data
available for model training are typically on the clip
level, the latter subnet learns to pay attention to segments
selectively to facilitate attentional segment-level
supervision. We call our model the M&mnet,
for it leverages both “M”acro (clip-level) supervision
and “m”icro (segment-level) supervision derived
from the macro one. Our experiments show
that M&mnet works remarkably well for recognizing
sound events, establishing a new state-of-theart
for DCASE17 and AudioSet data sets. Qualitative
analysis suggests that our model exhibits strong
gains for short events. In addition, we show that the
micro subnet is computationally light and we can
use multiple micro subnets to better exploit information
in different temporal scales.
Keywords:
Machine Learning: Deep Learning
Machine Learning Applications: Other Applications
Machine Learning Applications: Applications of Supervised Learning