Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models

Yixuan Tang; Anthony K. H. Tung

doi:10.24963/ijcai.2024/716

Contextualized Speech Recognition: Rethinking Second-Pass Rescoring with Generative Large Language Models

Yixuan Tang, Anthony K. H. Tung

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 6478-6485. https://doi.org/10.24963/ijcai.2024/716

PDF BibTeX

Automatic Speech Recognition (ASR) systems have witnessed notable advancements in recent years. Contextualized ASR tasks require recognizing speech not as isolated utterances but within the broader context in which they occur. Conventional approaches often employ a second-pass paradigm to re-rank initial transcriptions, yet they risk propagating errors across candidate hypotheses, thereby compromising recognition precision. In this study, we introduce a novel framework that diverges from typical second-pass rescoring methods. Given n-best hypotheses, we leverage prompting with a large language model for contextualized second-pass generation. Besides pursuing higher accuracy, we aim to explore the performance boundaries without substantially altering the underlying pre-trained speech and language models. We investigate the effectiveness of the proposed paradigm through zero-shot prompting and strategic low-rank adaptation tuning. On the multi-accent spoken reading comprehension benchmark SQuAD-SRC, both prompting and fine-tuned models outperform the 1-best ASR hypothesis, achieving notable relative Word Error Rate (WER) improvements of 13.6% and 45.9%, respectively. The results suggest that the proposed approach enhances transcription accuracy and contextual understanding.

Keywords:

Natural Language Processing: NLP: Speech