Abstract
This work addresses the issue of enhancing speech in binaural hearing
scenarios. Specifically, we present a method to improve binaural noise
reduction by integrating latent features produced by monaural speech
enhancement algorithms through the use of “Fusion layers.” These
layers perform Hadamard products between latent spaces at specific
processing stages. These fusion layers draw inspiration from multi-task
learning techniques, which involve sharing model weights across various
models aimed at handling interconnected tasks. The layers perform
element-wise dot products between tensors representing latent
representations at the same processing stage, mimicking the
physiological excitatory and inhibitory mechanisms of the binaural
hearing system. This study initially presents a general fusion model,
demonstrating its ability to better fit synthetic data compared to
independent linear models, equalize activation variance between learning
modules, and exploit input data redundancy to improve the training
error. We then apply the concept of fusion layers to enhance speech in
binaural listening conditions. The proposed method shows promise for
improved noise reduction compared to other feature-sharing approaches.
The study also suggests that including fusion can enhance predicted
speech intelligibility and quality, but too many fused features may have
a negative impact on expected speech intelligibility. Furthermore, the
results suggest that fusion layers should share parameterized latent
representations to effectively utilize information from each listening
side, rather than using deterministic representations. Overall, this
study highlights the potential of sharing information between speech
enhancement modules through deep fusion layers to improve binaural
speech enhancement while maintaining constant trainable parameters and
improving generalization.