Deep Latent Fusion Layers for Binaural Speech Enhancement

Tom Gajecki; Waldo Nogueira

doi:10.36227/techrxiv.21215378.v2

loading page

Deep Latent Fusion Layers for Binaural Speech Enhancement

Tom Gajecki ,
Waldo Nogueira

Abstract

This work addresses the issue of enhancing speech in binaural hearing scenarios. Specifically, we present a method to improve binaural noise reduction by integrating latent features produced by monaural speech enhancement algorithms through the use of “Fusion layers.” These layers perform Hadamard products between latent spaces at specific processing stages. These fusion layers draw inspiration from multi-task learning techniques, which involve sharing model weights across various models aimed at handling interconnected tasks. The layers perform element-wise dot products between tensors representing latent representations at the same processing stage, mimicking the physiological excitatory and inhibitory mechanisms of the binaural hearing system. This study initially presents a general fusion model, demonstrating its ability to better fit synthetic data compared to independent linear models, equalize activation variance between learning modules, and exploit input data redundancy to improve the training error. We then apply the concept of fusion layers to enhance speech in binaural listening conditions. The proposed method shows promise for improved noise reduction compared to other feature-sharing approaches. The study also suggests that including fusion can enhance predicted speech intelligibility and quality, but too many fused features may have a negative impact on expected speech intelligibility. Furthermore, the results suggest that fusion layers should share parameterized latent representations to effectively utilize information from each listening side, rather than using deterministic representations. Overall, this study highlights the potential of sharing information between speech enhancement modules through deep fusion layers to improve binaural speech enhancement while maintaining constant trainable parameters and improving generalization.

2023Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing volume 31 on pages 3127-3138. 10.1109/TASLP.2023.3301223

Abstract

Peer review status:Published