TechRxiv
single channel speech extraction-forTECHXIVR8.4.pdf (862.73 kB)

Addressing Human Speech Natures in Single-Channel speaker Extraction Network

Download (862.73 kB)
preprint
posted on 2023-08-07, 15:51 authored by Zhaohui YuanZhaohui Yuan

The purpose of single-channel speaker extraction is to distill the pure speech of the target speaker from a multi-talker speech mixture. Typically, an auxiliary reference network is employed to obtain the voiceprint features of the speaker’s speech as the input stimulus for the main speech extraction network, improving the robustness of the speech extraction. However, current research has paid little attention to the time-series characteristics of speech signals, resulting in a mismatch between the receptive field and signal features.  Additionally, commonly used speaker extraction models do not consider the spectral distribution characteristics of speaker speech. In this paper, we address these issues by extending the ConvNeXt model in image processing to TD-ConvNeXt model and combining it with TCN block to form the main body of our speech extraction network. We also revamp the ConvNeXt model to a new Spk block for the auxiliary network, which learns speaker identity information features from reference speech as embedding vectors. By utilizing this approach, we have achieved better results in speaker identification without compromising the quality of speech extraction. We also use multi-task learning to jointly train the extraction network and auxiliary reference network. Extensive experiments have been conducted to verify the significant single-channel target speech extraction performance improvements of our proposed model.

Funding

the Province Science foundation of Jiangxi, No.20224BAB202030 and No.20202ACBL202009

History

Email Address of Submitting Author

yuanzh@whu.edu.cn

ORCID of Submitting Author

yuanzh@whu.edu.cn

Submitting Author's Institution

East China Jiaotong University

Submitting Author's Country

  • China