Cross-Talk Reduction

Zhong-Qiu Wang; Anurag Kumar; Shinji Watanabe

doi:10.24963/ijcai.2024/572

Cross-Talk Reduction

Zhong-Qiu Wang, Anurag Kumar, Shinji Watanabe

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

Main Track. Pages 5171-5180. https://doi.org/10.24963/ijcai.2024/572

PDF BibTeX

While far-field multi-talker mixtures are recorded, each speaker can wear a close-talk microphone so that close-talk mixtures can be recorded at the same time. Although each close-talk mixture has a high signal-to-noise ratio (SNR) of the wearer, it has a very limited range of applications, as it also contains significant cross-talk speech by other speakers and is not clean enough. In this context, we propose a novel task named \textit{cross-talk reduction} (CTR) which aims at reducing cross-talk speech, and a novel solution named CTRnet which is based on unsupervised or weakly-supervised neural speech separation. In unsupervised CTRnet, close-talk and far-field mixtures are stacked as input for a DNN to estimate the close-talk speech of each speaker. It is trained in an unsupervised, discriminative way such that the DNN estimate for each speaker can be linearly filtered to cancel out the speaker's cross-talk speech captured at other microphones. In weakly-supervised CTRnet, we assume the availability of each speaker's activity timestamps during training, and leverage them to improve the training of unsupervised CTRnet. Evaluation results on a simulated two-speaker CTR task and on a real-recorded conversational speech separation and recognition task show the effectiveness and potential of CTRnet.

Keywords:

Machine Learning: ML: Unsupervised learning

Machine Learning: ML: Weakly supervised learning

Machine Learning: ML: Applications

Natural Language Processing: NLP: Speech