TechRxiv
tgt_20.pdf (616.11 kB)

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum Estimation

Download (616.11 kB)
preprint
posted on 30.09.2020 by Aaron Nicolson, Kuldip K. Paliwal
The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets---which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.

History

Email Address of Submitting Author

aaron.nicolson@griffithuni.edu.au

ORCID of Submitting Author

https://orcid.org/0000-0002-7163-1809

Submitting Author's Institution

Signal Processing Laboratory, Griffith University

Submitting Author's Country

Australia

Licence

Exports

Licence

Exports