On Training Targets for Supervised LPC Estimation to Augmented Kalman
Filter-based Speech Enhancement
Abstract
The performance of speech coding, speech recognition, and speech
enhancement largely depends upon the accuracy of the linear prediction
coefficient (LPC) of clean speech and noise in practice. Formulation of
speech and noise LPC estimation as a supervised learning problem has
shown considerable promise. In its simplest form, a supervised
technique, typically a deep neural network (DNN) is trained to learn a
mapping from noisy speech features to clean speech and noise LPCs.
Training targets for DNN to clean speech and noise LPC estimation fall
into four categories: line spectrum frequency (LSF), LPC power spectrum
(LPC-PS), power spectrum (PS), and magnitude spectrum (MS). The choice
of appropriate training target as well as the DNN method can have a
significant impact on LPC estimation in practice. Motivated by this, we
perform a comprehensive study on the training targets using two
state-of-the-art DNN methods— residual network and temporal
convolutional network (ResNet-TCN) and multi-head attention network
(MHANet). This study aims to determine which training target as well as
DNN method produces more accurate LPCs in practice. We train the
ResNet-TCN and MHANet for each training target with a large data set.
Experiments on the NOIZEUS corpus demonstrate that the LPC-PS training
target with MHANet produces a lower spectral distortion (SD) level in
the estimated speech LPCs in real-life noise conditions. We also
construct the AKF with the estimated speech and noise LPC parameters
from each training target using ResNet-TCN and MHANet. Subjective AB
listening tests and seven different objective quality and
intelligibility evaluation measures (CSIG, CBAK, COVL, PESQ, STOI,
SegSNR, and SI-SDR) on the NOIZEUS corpus demonstrate that the AKF
constructed with MHANet-LPC-PS driven speech and noise LPC parameters
produced enhanced speech with higher quality and intelligibility than
competing methods.