loading page

Gated Cross-Attention for Universal Speaker Extraction: Pay attention to the Speaker’s Presence
  • +3
  • Yiru Zhang ,
  • Zeke Li ,
  • Bijing Liu ,
  • Haiwei Fan ,
  • Yong Yang ,
  • Qun Yang
Yiru Zhang
Nanjing University of Aeronautics and Astronautics College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics College of Computer Science and Technology

Corresponding Author:[email protected]

Author Profile
Bijing Liu
Author Profile
Haiwei Fan
Author Profile
Yong Yang
Author Profile

Abstract

Current speaker extraction models have achieved good performance in extracting target speech from highly overlapped multi-talker speech. But in real-world applications, the multi-talker speech is sparsely overlapped and the target speaker may be absent from the speech mixture, making it difficult for the model to extract desired speech in this situation. The universal speaker extraction is proposed to solve the problem by evaluating the quality of estimated speech signals and silence. However, the design of existing universal speaker extraction models does not take into account distinguishing the presence or absence of the target speaker. In this paper, we propose a gated cross-attention network for universal speaker extraction. In our model, the cross-attention mechanism learns the correlation between the target speaker and the speech to distinguish whether the target speaker presents or not. According to the correlation, the gate mechanism makes the model focus on extracting speech when the target is present, while filtering out the features when the target is absent. Meanwhile, we propose a joint loss function to optimize the network in both target present and absent scenarios. We conducted experiments on the LibriMix dataset with various scenarios and evaluated the performance in terms of speech quality and speaker extraction error rate. The experiment results show that our proposed method outperforms the baselines in all of the scenarios.