loading page

Addressing Human Speech Natures in Single-Channel speaker Extraction Network
  • Zhaohui Yuan
Zhaohui Yuan
East China Jiaotong University

Corresponding Author:[email protected]

Author Profile


The purpose of single-channel speaker extraction is to distill the pure speech of the target speaker from a multi-talker speech mixture. Typically, an auxiliary reference network is employed to obtain the voiceprint features of the speaker’s speech as the input stimulus for the main speech extraction network, improving the robustness of the speech extraction. However, current research has paid little attention to the time-series characteristics of speech signals, resulting in a mismatch between the receptive field and signal features.  Additionally, commonly used speaker extraction models do not consider the spectral distribution characteristics of speaker speech. In this paper, we address these issues by extending the ConvNeXt model in image processing to TD-ConvNeXt model and combining it with TCN block to form the main body of our speech extraction network. We also revamp the ConvNeXt model to a new Spk block for the auxiliary network, which learns speaker identity information features from reference speech as embedding vectors. By utilizing this approach, we have achieved better results in speaker identification without compromising the quality of speech extraction. We also use multi-task learning to jointly train the extraction network and auxiliary reference network. Extensive experiments have been conducted to verify the significant single-channel target speech extraction performance improvements of our proposed model.