Gated Cross-Attention for Universal Speaker Extraction: Pay attention to the Speaker’s Presence
Current speaker extraction models have achieved good performance in extracting target speech from highly overlapped multi-talker speech. But in real-world applications, the multi-talker speech is sparsely overlapped and the target speaker may be absent from the speech mixture, making it difficult for the model to extract desired speech in this situation. The universal speaker extraction is proposed to solve the problem by evaluating the quality of estimated speech signals and silence. However, the design of existing universal speaker extraction models does not take into account distinguishing the presence or absence of the target speaker. In this paper, we propose a gated cross-attention network for universal speaker extraction. In our model, the cross-attention mechanism learns the correlation between the target speaker and the speech to distinguish whether the target speaker presents or not. According to the correlation, the gate mechanism makes the model focus on extracting speech when the target is present, while filtering out the features when the target is absent. Meanwhile, we propose a joint loss function to optimize the network in both target present and absent scenarios. We conducted experiments on the LibriMix dataset with various scenarios and evaluated the performance in terms of speech quality and speaker extraction error rate. The experiment results show that our proposed method outperforms the baselines in all of the scenarios.
Email Address of Submitting Authorzhangyrease@qq.com
ORCID of Submitting Author0000-0002-5436-9570
Submitting Author's InstitutionNanjing University of Aeronautics and Astronautics College of Computer Science and Technology
Submitting Author's Country