Generating More Audios for End-to-End Spoken Language Understanding

Generating More Audios for End-to-End Spoken Language Understanding

Xuxin Cheng, Yuexian Zou

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Main Track. Pages 6234-6242. https://doi.org/10.24963/ijcai.2024/689

End-to-end spoken language understanding (SLU) aims to directly capture the comprehensive semantics from the given spoken utterance without generating any transcript. Since the transcripts might not always be available, Textless SLU is attracting increasing attention, which could eliminate the need for transcripts but often does not perform as well as SLU models trained with transcripts. In this paper, we focus on the scenarios where the transcripts are not available and propose a framework GMA-SLU to generate more audios according to the labels. In order to alleviate the modality gap between text and audio, two language models are developed and discrete tokens are utilized as a bridge, where the first language model utilizes labels to generate semantic tokens and the second language model adopts these obtained semantic tokens and the acoustic tokens of source audios to generate the synthetic audios. All the experiments are conducted on the monolingual SLU dataset SLURP and the multilingual SLU dataset MINDS-14. Experimental results show that our method outperforms the previous best Textless End-to-end SLU models and can obtain the comparable performance with the models trained with the assistance of the corresponding transcripts.
Keywords:
Natural Language Processing: NLP: Dialogue and interactive systems