loading page

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation
  • Aaron Nicolson ,
  • Kuldip K. Paliwal
Aaron Nicolson
Signal Processing Laboratory, Signal Processing Laboratory

Corresponding Author:[email protected]

Author Profile
Kuldip K. Paliwal
Author Profile


Estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) estimator training targets. The choice of training target can have a significant impact on speech enhancement/separation and robust ASR performance. Motivated by this, we find which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is best for an ASR front-end. Three different deep neural network (DNN) types and two datasets that include real-world non-stationary and coloured noise sources at multiple SNR levels were used for evaluation. Ten objective measures were employed, including the word error rate (WER) of the Deep Speech ASR system. We find that training targets that estimate the a priori signal-to-noise ratio (SNR) for MMSE estimators produce the highest objective quality scores. Moreover, we find that the gain of MMSE estimators and the ideal amplitude mask (IAM) produce the highest objective intelligibility scores and are most suitable for an ASR front-end.
01 May 2021Published in The Journal of the Acoustical Society of America volume 149 issue 5 on pages 3273-3293. 10.1121/10.0004823