On Training Targets for Supervised LPC Estimation to Augmented Kalman Filter-based Speech Enhancement
preprintposted on 30.04.2021, 20:13 by Sujan Kumar RoySujan Kumar Roy, Aaron NicolsonAaron Nicolson, Kuldip K. Paliwal
The performance of speech coding, speech recognition, and speech enhancement largely depends upon the accuracy of the linear prediction coefficient (LPC) of clean speech and noise in practice. Formulation of speech and noise LPC estimation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised technique, typically a deep neural network (DNN) is trained to learn a mapping from noisy speech features to clean speech and noise LPCs. Training targets for DNN to clean speech and noise LPC estimation fall into four categories: line spectrum frequency (LSF), LPC power spectrum (LPC-PS), power spectrum (PS), and magnitude spectrum (MS). The choice of appropriate training target as well as the DNN method can have a significant impact on LPC estimation in practice. Motivated by this, we perform a comprehensive study on the training targets using two state-of-the-art DNN methods--- residual network and temporal convolutional network (ResNet-TCN) and multi-head attention network (MHANet). This study aims to determine which training target as well as DNN method produces more accurate LPCs in practice. We train the ResNet-TCN and MHANet for each training target with a large data set. Experiments on the NOIZEUS corpus demonstrate that the LPC-PS training target with MHANet produces a lower spectral distortion (SD) level in the estimated speech LPCs in real-life noise conditions. We also construct the AKF with the estimated speech and noise LPC parameters from each training target using ResNet-TCN and MHANet. Subjective AB listening tests and seven different objective quality and intelligibility evaluation measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR) on the NOIZEUS corpus demonstrate that the AKF constructed with MHANet-LPC-PS driven speech and noise LPC parameters produced enhanced speech with higher quality and intelligibility than competing methods.