Deep Latent Fusion Layers for Binaural Speech Enhancement

Tom Gajecki; Waldo Nogueira

doi:10.36227/techrxiv.21215378.v1

loading page

Deep Latent Fusion Layers for Binaural Speech Enhancement

Tom Gajecki ,
Waldo Nogueira

Abstract

In this work, we address the problem of speech enhancement in the context of binaural hearing. We propose deep learning models which are connected by “fusion layers” that perform Hadamard products between specific generated latent representations. Fusion layers are inspired by multi-task learning approaches that combine and/or share weights between models that tackle related tasks. We first present a general fusion model and show that this approach is able to fit synthetic data better than independent linear models, equalize activation variance between learning modules, and exploit input data redundancy to improve the training error. We then apply the concept of fusion layers to enhance speech in binaural listening conditions. Our results show that the proposed approach improves speech enhancement performance on unseen data with respect to the independent models. However, we observe a trade-off between speech enhancement performance and predicted speech intelligibility based on a short-time objective binaural speech intelligibility index, potentially due to distortions introduced by fully fused models.

Results also suggest that fusion layers should share parameterized latent representations in order to properly exploit the information contained in each listening side. In general, this work shows that sharing information between speech enhancement modules may be promising to improve binaural speech enhancement while keeping the number of trainable parameters constant and improving generalization.

2023Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing volume 31 on pages 3127-3138. 10.1109/TASLP.2023.3301223

Abstract

Peer review status:Published