loading page

BiConNet: A Hybrid CNN-BiLSTM Architecture for Robust Overlapping Speech Detection in Diverse Acoustic Environments
  • Yassin Terraf,
  • Youssef Iraqi
Yassin Terraf
Youssef Iraqi

Corresponding Author:[email protected]

Author Profile

Abstract

Speech overlap, which occurs when multiple people speak simultaneously, poses a significant challenge in audio and speech processing. The presence of overlapping speech segments significantly degrades the performance of technologies such as Automatic Speech Recognition (ASR), speaker identification, and diarization systems. This degradation in performance becomes more significant in diverse acoustic environments with background noise and reverberation. To effectively address this issue, we introduce BiConNet. This novel dual-branch architecture combines the strengths of Convolutional Neural Networks (CNN) and Bidirectional Long Short-Term Memory (BiLSTM) for robust detection of overlapping speech in diverse acoustic conditions. The CNN branch is used for frame-level spectral feature extraction, while the BiLSTM branch captures temporal dependencies from both forward and backward directions. Features from both branches are concatenated, resulting in a robust feature representation. We also examined the impact of Mel Frequency Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GTCC), and Power Normalized Cepstral Coefficients (PNCC) as spectral-based features on BiConNet's performance. To validate its effectiveness in various acoustic environments, we developed a constructed data set derived from the GRID corpus, including conversations with different gender combinations and recording conditions, such as clean, noisy, reverberant, and combined noise and reverberation conditions. Experimental results show that BiConNet outperforms various state-of-the-art methods in detecting overlapping speech segments under these conditions. Furthermore, our analysis of computational efficiency reveals that BiConNet provides competitive training and inference times, demonstrating its practicability for real-world applications.
17 Feb 2024Submitted to TechRxiv
20 Feb 2024Published in TechRxiv