SynthNet: Learning to Synthesize Music End-to-End

SynthNet: Learning to Synthesize Music End-to-End

Florin Schimbinschi, Christian Walder, Sarah M. Erfani, James Bailey

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Main track. Pages 3367-3374. https://doi.org/10.24963/ijcai.2019/467

We consider the problem of learning a mapping directly from annotated music to waveforms, bypassing traditional single note synthesis. We propose a specific architecture based on WaveNet, a convolutional autoregressive generative model designed for text to speech. We investigate the representations learned by these models on music and concludethat mappings between musical notes and the instrument timbre can be learned directly from the raw audio coupled with the musical score, in binary piano roll format.Our model requires minimal training data (9 minutes), is substantially better in quality and converges 6 times faster in comparison to strong baselines in the form of powerful text to speech models.The quality of the generated waveforms (generation accuracy) is sufficiently high,that they are almost identical to the ground truth.Our evaluations are based on both the RMSE of the Constant-Q transform, and mean opinion scores from human subjects.We validate our work using 7 distinct synthetic instrument timbres, real cello music and also provide visualizations and links to all generated audio.
Keywords:
Machine Learning: Time-series;Data Streams
Multidisciplinary Topics and Applications: Art and Music
Machine Learning: Deep Learning
Machine Learning: Learning Generative Models
Machine Learning: Interpretability