When Speaker Recognition Meets Noisy Labels: Optimizations for Front-ends and Back-ends
preprintposted on 28.12.2021, 20:56 by Lin Li, Fuchuan TongFuchuan Tong, Qingyang Hong
A typical speaker recognition system often involves two modules: a feature extractor front-end and a speaker identity back-end. Despite the superior performance that deep neural networks have achieved for the front-end, their success benefits from the availability of large-scale and correctly labeled datasets. While label noise is unavoidable in speaker recognition datasets, both the front-end and back-end are affected by label noise, which degrades the speaker recognition performance. In this paper, we first conduct comprehensive experiments to help improve the understanding of the effects of label noise on both the front-end and back-end. Then, we propose a simple yet effective training paradigm and loss correction method to handle label noise for the front-end. We combine our proposed method with the recently proposed Bayesian estimation of PLDA for noisy labels, and the whole system shows strong robustness to label noise. Furthermore, we show two practical applications of the improved system: one application corrects noisy labels based on an utterance’s chunk-level predictions, and the other algorithmically filters out high-confidence noisy samples within a dataset. By applying the second application to the NIST SRE0410 dataset and verifying filtered utterances by human validation, we identify that approximately 1% of the SRE04-10 dataset is made up of label errors.
National Natural Science Foundation of China under Grants No.61876160 and No.62001405