DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-based Speech Enhancement
preprintposted on 12.04.2021, 20:37 by Sujan Kumar RoySujan Kumar Roy, Aaron NicolsonAaron Nicolson, Kuldip K. Paliwal
Current augmented Kalman filter (AKF)-based speech enhancement algorithms utilise a temporal convolutional network (TCN) to estimate the clean speech and noise linear prediction coefficient (LPC). However, the multi-head attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than TCNs. Motivated by this, we investigate the MHANet for LPC estimation. We aim to produce clean speech and noise LPC parameters with the least bias to date. With this, we also aim to produce higher quality and more intelligible enhanced speech than any current KF or AKF-based SEA. Here, we investigate MHANet within the DeepLPC framework. DeepLPC is a deep learning framework for jointly estimating the clean speech and noise LPC power spectra. DeepLPC is selected as it exhibits significantly less bias than other frameworks, by avoiding the use of whitening filters and post-processing. DeepLPC-MHANet is evaluated on the NOIZEUS corpus using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC-MHANet is compared to five existing deep learning-based methods. Compared to other deep learning approaches, DeepLPC-MHANet produced clean speech LPC estimates with the least amount of bias. DeepLPC-MHANet-AKF also produced higher objective scores than any of the competing methods (with an improvement of 0.17 for CSIG, 0.15 for CBAK, 0.19 for COVL, 0.24 for PESQ, 3.70\% for STOI, 1.03 dB for SegSNR, and 1.04 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC-MHANet-AKF was also the most preferred amongst ten listeners. By producing LPC estimates with the least amount of bias to date, DeepLPC-MHANet enables the AKF to produce enhanced speech at a higher quality and intelligibility than any previous method.