Neural Comb Filtering Using Sliding Window Attention Network for Speech Enhancement

In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in the deep neural network (DNN)-based speech enhancement. The parameters of the DNN can be estimated by minimizing the mask loss, but it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. Restoring the cepstral pitch peak, in turn, helps in restoring the pitch harmonics in the enhanced spectrum. The proposed cepstral pitch-peak loss acts as an adaptive comb filter on voiced segments and emphasizes the pitch harmonics in the speech spectrum. The network parameters are estimated using a combination of mask loss and cepstral pitch-peak loss. We show that this combination offers the complementary advantages of enhancing both the voiced and unvoiced regions. The DNN-based methods primarily rely on the network architecture, and hence, the prediction accuracy improves with the increasing complexity of the architecture. The lower complex models are essential for real-time processing systems. In this work, we propose a compact model using a sliding-window attention network (SWAN). The SWAN is trained to regress the spectral magnitude mask (SMM) from the noisy speech signal. Our experimental results demonstrate that the proposed approach achieves comparable performance with the state-of-the-art noncausal and causal speech enhancement methods with much lesser computational complexity. Our three-layered noncausal SWAN achieves 2.99 PESQ on the Valentini database with only 109\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^9$$\end{document} floating-point operations (FLOPs).


Introduction
Prolonged exposure to noisy speech signals causes severe fatigue to the listeners [26]. Speech enhancement (SE) algorithms aim to improve the quality and intelligibility of the noisy speech signal, which is degraded by the background noise [28]. SE plays a vital role in human-human communication over mobile/radio channels, hearing aids, cochlear implants, etc. [65]. SE is a crucial first step for domestic voice assistants with multiple noise sources like television, microwave oven, competing speakers, etc. [15]. Even though several methods have been proposed, SE still remains a challenging problem, especially for unseen speakers and noises. SE algorithms can be broadly classified into two categories: -Estimating the noise characteristics and suppressing them.
-Estimating the speech characteristics and emphasizing them.
Most of the statistical SE algorithms rely on the adaptive estimation of the noise component and subtracting it from the noisy signal [2,9,47]. The performance of these algorithms depends on their noise tracking ability. Although they have better performance in stationary noise environments, it degrades significantly in the presence of nonstationary noise environments [4]. Many advanced noise tracking algorithms were proposed to estimate the nonstationary noise characteristics [32,33]. The algorithms based on Bayesian statistics are well received in the literature, and their performance primarily depends on the accuracy of the estimated a priori and a posteriori signal-tonoise ratio (SNR) [17]. The statistical methods typically involve explicit suppression of estimated noise, which introduces musical noise distortion in the enhanced speech signal [3]. Despite these drawbacks, statistical enhancement methods are widely used in commercial applications because of their low computational complexity and realtime performance. The DNN architectures are proposed to estimate a priori SNR, which is subsequently used in the minimum mean-squared error (MMSE) estimation of the clean speech [36].
DNN approaches, on the other hand, rely on learning the structure of the speech signal through a nonlinear mapping between noisy and clean speech signals [57]. As DNN approaches pose SE as a supervised learning task, various noises at different SNRs can be used during the training. Hence, DNN approaches perform considerably better than the statistical approaches in unseen noises and lower SNR conditions [12,39]. The superior performance of DNNs comes at the expense of substantial computational complexity. There is a need to develop efficient low-complexity DNN architectures for low-power applications [46]. Network architecture and the objective function play an important role in efficiently capturing the signal characteristics. In the initial studies, feed-forward neural networks (FFNNs) were proposed for spectral regression-based SE [64]. However, FFNN architectures are memoryless networks that cannot capture the temporal dependencies across the frames. Subsequently, the convolutional and recurrent architectures were employed to exploit the long-term contextual information for improved SE [29,40,43]. However, the convolutional architectures require a deep stack of layers to achieve a higher receptive field [20]. Although the recurrent architectures offer longer contexts, they are not suitable for parallel processing [27]. Recent advances in the transformer architectures offer compact models for parallel processing of the frames while capturing long-term dependencies through explicit attention mechanism [51]. Kim et al. proposed a Gaussian weighted self-attention transformer (TGSA) for SE [22]. Subsequently, Nicolson et al. proposed a transformer-based architecture (MHANet) to estimate a priori SNR in a statistical enhancement framework. Later, Wang et al. proposed a two-stage transformer neural network (TSTNN) in the time-domain for SE [59]. Although TSTNN is a compact model, it requires enormous computation as it operates at the sample level in the time domain. In this paper, we propose a SWAN to estimate spectral masks in the frequency domain. As the proposed architecture operates at the frame level of the speech signal, it requires significantly lesser computation than TSTNN.
The frequency-domain approaches rely on estimating spectral mask from the noisy speech spectrum. The network parameters are updated using a combination of mask approximation loss and signal approximation loss [62]. State-of-the-art frequency-domain approaches for the SE use perception-related loss functions for signal approximation [14,23,60]. Although the perception-related loss functions operating in Mel/Bark domains are good at approximating the lower frequency bands, they do not perform well in the high-frequency bands. In this paper, we propose a production-related loss function for signal approximation.
Speech signal exhibits high SNR regions in both time and frequency domains. The region around the glottal-closure instants (GCIs) exhibits higher amplitudes in the time domain, and hence, they are less vulnerable to the degradation [66]. As the periodicity of GCIs transforms to pitch harmonics in the frequency domain, adaptive comb-filtering has been used in the literature to enhance them [30,35]. The pitch harmonics in the frequency domain show up as a peak in the cepstral domain [38]. Hence, accurately estimating the pitch peak in the cepstral domain enhances the pitch harmonics in the frequency domain and quasi-periodic structure of GCIs in the time domain. Elshamy et al. proposed an approach to manipulate the excitation and envelope in the cepstral domain to estimate a priori SNR in a statistical framework [8]. The excitation manipulation restores the pitch harmonics. In this work, we propose a novel loss function to better approximate the pitch harmonics in the spectral mask estimated using a DNN. The estimated spectral mask resembles a comb-filter and enhances the noisy speech spectrum across the frequency bands.
The rest of the paper is organized as follows. Section 2 presents the proposed transformer architecture for the speech enhancement using a sliding window attention network. The production-related loss function to enhance the pitch harmonics is discussed in Sect. 3. In Sect. 4, we compare the proposed method with the other state-of-the-art methods in terms of speech quality and intelligibility measures. Several ablation studies were conducted to understand the effect of various hyper-parameters. Finally, Sect. 5 summarizes the crucial contributions of this work.

Transformer for Spectral Mask Estimation
Let the desired speech signal of interest d[n] be corrupted by uncorrelated additive noise g[n], leading to a noisy speech observation s[n] given by The discrete Fourier transform (DFT) representation is more pertinent to the stationary signals. As speech is a nonstationary signal, the spectral characteristics of the speech signal vary with time. Moreover, the effect of additive noise is not uniform across time and frequency in nonstationary environments [28]. As a result, the noisy speech observation exhibits varying segmental and subband SNRs across time and frequency bins, respectively. The noisy speech signal is typically processed in the STFT domain to deal with the varying levels of degradation in the time-frequency (TF) plane. The STFT of a signal s[n] is defined as where w[n] is a window of length L, R is the frame shift in samples, t is the frame index, and k is the frequency bin. Since STFT is a linear transformation, STFT of the noisy speech signal s[n] can be expressed as where , respectively. Because of the nonstationary spectral characteristics, noisy speech signals must be processed using a time-varying filter to improve the perceptual quality. DNN approaches realize this time-varying filter through spectral masking in the TF plane. The spectral mask is expected to attenuate the noise bins while retaining the speech bins in the TF plane. In this approach, the noisy speech spectrum is multiplied by the spectral mask to estimate the clean speech spectrum, where M[t, k] is the spectral mask, andD[t, k] is the STFT of the enhanced speech signal. Both real and complex masks were explored in the literature. A real mask enhances only the magnitude spectrum [10,18,49,61], while a complex mask enhances both magnitude and phase spectra [63]. A detailed discussion on a variety of spectral masks can be found in [57]. In this work, we use the SMM for speech enhancement. The SMM defined as has been shown to perform better than other masks [57]. Another advantage of SMM is that the computation of the oracle mask does not require access to the noise. Since the SMM is a real mask, the noisy phase is retained during signal reconstruction fromD[t, k] through inverse STFT [5]. We use a transformer architecture with sliding window attention to estimate SMM from noisy speech signals.

Sliding Window Attention Network
A transformer is a neural network model that learns context information through an attention mechanism, with differential weights at each part of the input sequence [55]. Unlike recurrent neural networks (RNNs), the transformers need not process the input sequence in the temporal order of evolution. Instead, the attention mechanism extracts context for any position in the sequence. The proposed transformer architecture with sliding window attention for estimating M[t, k] from S[t, k] is shown in Fig. 1. We have used time-distributed dense layers and temporal attention layers as the basic building blocks to facilitate the parallel processing of the data. The dense layers capture correlations across the frequency, while the temporal attention layers capture correlations across the time.
The log-magnitude spectrum of the noisy speech signal is projected onto a d-dimensional space using a dense layer and batch-normalized to obtain contextindependent nonlinear projections x[t], given by where s[t] ∈ R K is the vector of STFT coefficients at time t, W ∈ R d×K , b ∈ R d are the weights and biases of the dense layer, f b (.) denotes the nonlinear activation followed by batch normalization. As the dense layer is memoryless, it does not capture the context information from the sequence. These context-independent projections x[t] are processed using a transformer block, as shown in Fig. 1 to extract the context information.
As speech is a nonstationary signal, the context information can be extracted by performing short-term analysis on x[t]. In this work, we propose to use sliding window attention for extracting time-varying context information from x[t] [1]. In the transformer block, x[t] is linearly projected to obtain query x q [t], key x k [t] and value x v [t] projections, given by  frames to enhance the current frame. We use attention from c frames on either side of the current frame. The scaled dot product attention between frame-t and frame-τ is given by [55] where softmax function operates over the context of 2c + 1 frames. The attention weights are used to linearly combine the value vectors x v [t] over a context of 2c + 1 frames to obtain the context-vector x c [t], given by In the case of multi-head attention, multiple context vectors x h c [t] are estimated using parallel projections as shown in Fig. 1. The context vectors across the heads are concatenated and projected to a lower dimension to obtain the combined summary, given by where W o ∈ R d×Hd v is a learnable linear projection, and H denotes the number of parallel heads. The context vector estimation in the transformer can be interpreted as kernel adaptive filtering [41]. The recent advances in transformer architectures offer a way to learn the optimum kernel from the data in a supervised manner. The context-independent vector x[t] is added to the context-vector x c [t] through a residual connection and the sum is batch-normalized. Assuming that the network is operating in the log-spectral domain, the residual connection can be interpreted as filtering x[t] through a time-varying filter x c [t]. As the attention block is intended to capture the correlations along the temporal axis, the batch-normalized output is processed using a couple of feed-forward layers to capture the correlations along the frequency axis. When multiple layers of transformer blocks are employed, the signal is filtered along the time and frequency domains to estimate the SMM in the TF domain.
The output of the transformer block is processed using the two time-distributed feed-forward layers to estimate the SMM. As SMM is nonnegative, an exponential activation function is used in the final layer to estimateM[t, k].
The network parameters can be estimated to minimize the distance between the oracle and estimated masksM[t, k] where u denotes utterance number, |B| denotes cardinality of the batch, and |.| p denotes p−norm of the error. The mean square error (MSE) between the masks was used as the mask approximation objective [62]. However, closer observation of the oracle mask in Fig. 2a reveals the occasional large amplitude fluctuations, which need not be given a proportional weight in error. Hence, we preferred to use mean absolute error (MAE), i.e., p = 1, for the mask estimation.
In the earlier works, it has been reported that the mask approximation does not directly optimize the speech enhancement objective [62]. This occurs because all the finer fluctuations in the oracle SMM, shown in Fig. 2a, may not contribute equally to the perceptual improvement. The objective function should be chosen to prioritize the perceptually important regions of the spectrum. Hence, a spectral approximation objective is also used along with the mask approximation objective. Weninger et al. [62] used MSE in the Mel-spectral domain as a spectral approximation objective. The SMM estimated using a combination of mask-loss and Mel-spectral loss is shown in Fig. 2a. While the estimated SMM matches well with the oracle mask in the low-frequency bands (< 2 kHz), the SMM approximation is not good in the high-frequency bands (> 2 kHz). In the subsequent studies, perceptual evaluation of speech quality (PESQ) was used as the signal approximation objective [11,23]. These signal approximation objectives were motivated from speech perception. Alternatively, the structure imposed by the speech production mechanism can be exploited for signal reconstruction. In this work, we propose to exploit the quasi-periodic structure imposed by the voice source for speech enhancement.

Enhancing Pitch Peak in Cepstral Domain
During the production of voiced sounds, significant excitation to the vocal tract system is delivered at the GCIs [34]. Hence, the region around the GCIs is more robust to the additive noise within each pitch period [66]. The quasi-periodic sequence of GCIs in the time domain are manifested as pitch harmonics in the frequency domain and as pitch peak in the cepstral domain [38]. The real cepstrum of a clean speech signal can be computed from its spectrum S[t, k] as where W D is the inverse discrete cosine transform (IDCT) matrix [44] and q denotes the quefrency bin in the cepstral domain.
In the cepstral domain, the low-time components predominantly correspond to the vocal tract information, while the high-time components correspond to the voice source [42]. The most prominent peak at high-time cepstrum corresponds to the pitch period [38]. Figure 3a shows the cepstrum of a frame of clean speech signal for a female speaker. The cepstral peak at around 3.7 milliseconds (ms) corresponds to the pitch period of the speaker. Figure 3a also shows the cepstrum of the enhanced signal when the mel-spectral loss is used as a signal approximation objective. In this case, there is no reliable estimation of the pitch peak, as mel-frequency wrapping smooths the pitch harmonics.
The harmonic structure in the spectral domain can be enhanced by restoring the pitch peak in the cepstral domain. Hence, we propose to minimize the MSE between the pitch peaks of the clean and enhanced cepstra. During the training, the location of pitch peak is predetermined from clean speech using pitch tracking algorithm [21,67]. To account for the minor deviations, we have considered a short window around the (a) (b) Fig. 3 Real cepstrum of a voiced frame enhanced using a mel-spectral loss, b cepstral pitch-peak loss. Oracle cepstrum is shown for comparison estimated pitch peak, as indicated by dotted lines in Fig. 3a. The cepstral pitch-peak loss for a random batch B is computed as whereŜ c [t, q] is the real-cepstrum of the enhanced signal, V is the set of voiced frames in the utterance u, Q l and Q u denote the lower and upper bounds of the pitch window, respectively. In this work, Q l and Q u are chosen to be seven samples each to account for a 1 ms deviation in the pitch period estimation. The cepstral pitch peak loss alone cannot be used for speech enhancement, as it is defined only for the voiced regions. Hence, we combined it with the mask loss to improve the quality of the enhanced signal in the voiced regions.
In the DNN approaches, the network weights θ are updated using gradient descent, which involves computing gradients of the loss function L with respect to each network weight, that is ∇ θ L. For spectral masking methods, the gradient of the loss function can be decomposed as whereM is the estimated mask. The choice of the loss function influences the second term in (13), i.e, the gradient of the loss function with respect to the estimated mask ∇M L. In order to analyze the effect of different loss functions in training the network, we plotted ∇M L for L mask , L mel and L cep in Fig. 4.The gradient of the mask loss ∇M L mask is shown in Fig. 4c. Since MAE is used for the mask loss, its gradients take only two values, + 1 |B| or − 1 |B| , depending on the sign of the error. It explains the  Fig. 4c. We can notice from the figure that the gradients captured the gross structure of the spectrogram. The gradients are higher in the unvoiced and silence regions to achieve noise suppression in those regions, which typically exhibit low segmental SNR. The gradients of the mask loss remain uniformly low in the voiced regions. Hence, it does not contribute to suppressing the noise present between the pitch harmonics. Figure 4d shows the gradients of the mel-spectral loss, ∇M L mel . In this case, the magnitude of the gradients is close to zero in most regions, which could be attributed to the spectral wrapping using a mel-filter bank that smooths both the desired and estimated spectra. As the mel-filter bank offers higher resolution at low frequency, the first and second formants were emphasized in the gradients plot. Figure 4e shows the gradients of cepstral pitch-peak loss, ∇M L cep . The gradients are zero in the unvoiced regions as the loss is evaluated only for the voiced regions by applying a voice activity flag. In the voiced regions, the gradients exhibit harmonic structure, negative values at pitch harmonics, and positive values in between the harmonics. Hence, the cepstral pitch-peak loss suppresses noise in the spectral valleys formed between the successive harmonics, emphasizing the harmonic structure.
The mask loss and the cepstral pitch-peak loss offer complementary advantages. The mask loss suppresses the noise in the unvoiced and silence regions, while the cepstral pitch-peak loss enhances the harmonic structure in the voiced regions. Hence, a weighted combination of mask loss and cepstral pitch-peak loss is optimized for updating the network parameters. The composite loss function used for training the network is given by where the constant α determines the relative importance given to the mask and cepstral pitch-peak losses. Since the gradient magnitudes of mask loss are ten times lesser than the gradients of cepstral pitch-peak loss, we choose α = 0.9 to give equal importance to both the losses. Figure 3b illustrates the effectiveness of the proposed production-related loss function in retrieving the pitch peak. The pitch peak was accurately estimated with cepstral pitch-peak loss, whereas it is underestimated with mel-spectral loss. As a consequence, the SMM estimated using cepstral pitch-peak loss resembles a comb-filter as shown in Fig. 2b. The SMM estimated using the combination of mask loss and cepstral pitch-peak loss closely follows the oracle mask up to 5 kHz, demonstrating the effectiveness of the proposed composite loss function.

Database Description
The performance of the proposed SWAN is evaluated on a publicly available speech dataset created for evaluating the SE algorithms [54]. It contains clean speech recordings of 28 speakers with 14 male and 14 female taken from the Voice-Bank corpus [56] for training. Each speaker has about 400 sentences, creating a total of 11572 speech utterances. The noisy samples are generated by mixing ten different noises-eight real noises and two artificially generated noises. The real noise recordings were taken from diverse environments multi-channel acoustic noise (DEMAND) database [53], which include cafeteria, car, kitchen, meeting, metro, restaurant, station, and traffic noises. The two artificially generated noises are the speech-shaped noise (SSN) and the babble noise. These ten noises are mixed with the clean speech recordings at four different SNRs: 15 dB, 10 dB, 5 dB, and 0 dB. Hence, the DNNs were trained with 40 (10 noises × 4 SNR levels) different noisy conditions per speaker.
For testing, the clean speech recordings of two speakers (one male and one female), not included in the training set, were taken from the Voice-Bank speech corpus. Five noise types, viz. bus, cafe, living room, office, and public square that were not used for training, are taken from the DEMAND dataset. The noisy speech for testing was created at four different SNRs: 17.5 dB, 12.5 dB, 7.5 dB and 2.5 dB. Hence, 20 different noisy conditions are considered for each speaker. A total of 824 clean and noisy speech utterance pairs are available for evaluating the SE algorithms. The clean speech recordings of two speakers 1 that are not included in either train or test dataset were taken from the Voice-Bank corpus to prepare the development dataset. To simulate the similar noisy conditions as the test dataset, the noisy signals are generated by mixing the clean speech signals with the same noises and SNR levels as the test dataset. All the speech utterances were downsampled from 48 to 16 kHz.

Feature Extraction and Model Training
Speech segments of the length of 1.6 s are randomly sliced from the available utterances, and STFT is computed with a frame size of 32 ms, shifted by 16 ms, resulting in 100 frames per segment. The network is trained to estimate SMM from the normalized 257-dimensional magnitude spectra. All the dense layers in the network are of 257 dimensions, i.e., d = 257. We used swish activation for all the dense layers except for the final layer. In the final dense layer, the exponential activation function is used to estimate SMM. In the transformer block, we have used four heads (H = 4) for extracting the context vector x c [t]. The dimension of the linear projections used to obtain query, key, and value vectors was set to 64, i.e., d q = d k = d v = 64. We observed that using different projections for keys and values did not offer a noticeable advantage in the performance. Hence, we have used the same linear projection for generating key and value vectors, i.e., W K = W V . This choice significantly reduced the number of trainable parameters of the network.
The network is trained to minimize the composite loss function in Eq. (14). The Adam optimizer [24] with β 1 = 0.9, β 2 = 0.99 is used to update the network parameters. The learning rate is exponentially decayed at the rate of 0.95 per every 15 epochs, starting from an initial value of 0.001. The network training was stopped if the training loss did not reduce for the 20 consecutive epochs. A batch size of 32 segments is used for training. The pitch period required for computing cepstral pitch-peak loss was estimated from the clean speech signal using yet another algorithm for pitch tracking (YAAPT) algorithm [21].

Description of Evaluation Metrics
The performance of the SE algorithms is evaluated using the following objective quality and intelligibility measures: In real-time applications, we often have limited memory and computing resources. Although the state-of-the-art methods achieve better speech quality, they may not be suitable for real-time operation because of the deeper models and high computational complexity. Hence, we compared the model size and the computational complexity of different algorithms to investigate their applicability in real time. The model size is quantified in terms of the number of network weights, while the computational complexity is measured in terms of the number of FLOPs [52]. We should note that a smaller model size need not necessarily result in lower computational complexity. Our experimental studies show that some smaller models require a much higher number of FLOPs than the deeper models.

Effect of Hyper Parameters
We conducted several experiments to study the effect of hyper-parameters on the performance of the proposed SWAN. A three-layered SWAN network is used for all these experiments. All the models are evaluated on the development data described in Sect. 4.1.

Effect of Temporal Context
The information captured in the context vector of Eq. 8 critically depends on the width of the sliding window (2c + 1). In order to understand the effect of sliding window size, we have conducted experiments with varying temporal contexts, ranging from c = 3 to c = 15, and the corresponding results are reported in Table 1.
For the longer temporal contexts, the attention layer averages the value vectors across the nonstationary regions of the signal and hence does not capture local characteristics. As a result, the performance is inferior for the longer contexts. In fact, the performance is poorest when the full-temporal (infinite) context is used for context vector estimation. However, a better SSNR has been achieved with infinite context. Bold values indicate the best results in each metric among the different methods Moreover, the computational complexity and latency increase with longer contexts. In contrast, shorter contexts offer lower latency. However, the context vectors could be noisy and may not capture the temporal dependencies. It is found that a ten-frame context on either side (c = 10) offers a good compromise between the performance and computational complexity. Hence, we used a sliding window of 21 frames (around 0.3 s) to extract the context vector for all the subsequent studies.

Significance of the Cepstral Pitch-Peak Loss
To demonstrate the importance of the proposed cepstral pitch-peak loss, we have conducted ablation studies with and without including it in training the model. Firstly, we trained our network with only mask loss, followed by the combination of mask loss and cepstral pitch-peak loss. We have also evaluated the model performance for the combination of mask loss and mel-spectral loss for comparison. The performance of a 3-layer SWAN trained with different loss functions is shown in Table 2.
The quality and intelligibility of the enhanced signal improved when mask loss is combined with either mel-spectral loss or cepstral-pitch peak loss. The system achieved the best performance when mask loss was combined with the cepstral pitchpeak loss. Since the cepstral pitch-peak loss is effective only in the voiced regions, we have evaluated the SSNR of voiced and unvoiced segments separately. The voiced and unvoiced segments are identified using a voice activity detector [21]. The segmental SNR of voiced regions (V-SSNR) improved significantly when the cepstral pitch-peak loss was combined with the mask loss. In contrast, the segmental SNR of unvoiced regions (UV-SSNR) remained almost the same because the gradient of cepstral pitchpeak loss was zero in the unvoiced regions. On the other hand, incorporating the mel-spectral loss adversely affected the unvoiced segments while performing relatively better in the voiced regions. This study demonstrates the significance of the proposed cepstral pitch peak loss in restoring pitch harmonics in the spectral domain. Signal-N denotes noisy signal and E denotes enhanced signal

Performance Evaluation Across SNR Levels
The performance of the proposed SWAN network is evaluated at different SNR levels, and the results are presented in Table 3. The evaluation metrics for the noisy signals are also provided in the table to analyze the relative improvement. The evaluation metrics across the SNR levels are consistently better for the enhanced signals than those of the noisy input signals (N). The proposed method achieves 70.47% absolute improvement in PESQ at 2.5 dB SNR, illustrating the effectiveness of the pitch restoration at lower SNR levels.

Performance Evaluation Across Genders
The effect of noise is not the same across the speakers from different genders [25,31]. For the same SNR level, the perceived degradation can be significantly higher for female speakers compared to male speakers, which can be observed from the evaluation metrics of the noisy signals (N) of the male and female speakers in Table 4. This behavior can be attributed to the higher fundamental frequency of the female speakers, which results in widely-spaced pitch harmonics in the narrowband spectrogram. Hence, the frequency bins between the pitch harmonics are vulnerable to the additive noise, degrading the perceptual quality. The enhancement algorithms should sup- press the noise between the pitch harmonics to achieve better performance on female speakers. The mask estimated using the proposed cepstral pitch-peak loss resembles a time-varying comb filter and helps in effectively suppressing the noise between the pitch harmonics. The performance of the proposed SE algorithm for both genders is reported in Table 4. Improvements in all the quality and intelligibility measures across the genders can be observed.

Performance Comparison with the State-of-the-Art Methods
The proposed SE algorithm 2, 3 is compared with several state-of-the-art methods evaluated on the Valentini test dataset. The methods chosen for comparison can be broadly classified into two categories-frequency domain methods aimed at estimating spectral gain/mask and time-domain methods that directly regress the clean waveform. A brief description of the methods chosen for the comparison is given below. DeepMMSE [69] is based on a statistical enhancement framework estimating a priori SNR using a temporal convolutional network (TCN). Similarly, MHANet [37] also estimates a priori SNR using a multi-head attention network. The MMSE-GAN [48] is a generative adversarial network (GAN) that implicitly estimates a multiplicative mask. The generator and discriminator are composed of three dense layers each, and the mean squared error criterion is used to learn the parameters of the network. Met-ricGAN [11] and MetricGAN+ [14] are also based on the GAN framework in which the generator consists of bidirectional long short term memory (BLSTM) layers estimating the SMM, and the discriminator consists of convolutional layers evaluating the perceptual quality of the enhanced signal. The discriminator in these two networks was designed to mimic the behavior of the PESQ metric. SDR-PESQ [23] estimates a multiplicative mask using the combination of convolutional and BLSTM layers. In T-GSA [22], a transformer-based architecture is used to estimate the SMM. Both SDR-PESQ and T-GSA are trained to optimize the SDR and PESQ metrics. PhaseDCN [68] estimates the phase spectrum along with the ideal ratio mask using a dilated convolutional dual-path network. DEMUCS [7] and TSTNN [68] are the time-domain approaches. DEMUCS uses a combination of convolutional and LSTM layers, while the TSTNN uses a combination of convolutional and transformer layers. However, the transformer layer of TSTNN consists of a gated recurrent unit (GRU) instead of a conventional dense layer. Both DEMUCS and TSTNN are trained to minimize a combination of waveform reconstruction loss and STFT loss.

Comparison with Noncausal SE Algorithms
Causality is one of the desirable features of SE algorithms for real-time processing . Noncausal models use frames from either side to extract context-rich representations and achieve better performance. However, noncausal models cannot be used in real time as they require future frames for processing the current frame. Although some of  4 The models are available online and are taken for detailed analysis in terms of computational complexity the algorithms mentioned above can be restricted to use causal context, most of them were trained with noncausal contexts to achieve improved performance. Hence, we compare the performance of the proposed SWAN model with causal and noncausal contexts separately. Table 5 compares the performance of SWAN with the state-of-the-art noncausal SE algorithms. Our model achieved comparable performance with the state-of-the-art methods like SDR-PESQ, T-GSA, and DEMUCS. Even though Metricgan+ has the best PESQ, its STOI and SSNR suffered because of the time-varying gain introduced while optimizing the PESQ metric. Figure 5 shows the denoised speech segment with different methods. In the case of Metricgan+, the enhanced signal does not match well with the clean signal because of the time-varying gain that cannot be compensated by a constant scaling factor. We evaluated the MSE between the clean and enhanced signals to quantify the distortion, Metricgan+ has higher signal distortion than other methods. It explains the reason for its lower STOI and SSNR, even though its PESQ is higher. As DEMUCS and TSTNN are time-domain approaches, they perform better on STOI and SSNR measures. However, their computational complexity is very high as they operate in the time-domain, i.e., on the speech samples. On the other hand, frequency-domain approaches like Metricgan+ and SWAN operate at the frame level and require much lesser computation.
In particular, SWAN offers a good balance between enhancement quality and computational complexity. Even a single-layer SWAN (1L) delivers impressive performance with just 0.53 million (M) parameters and 4.46 billion (B) FLOPs. Though the SWAN is not trained to optimize any of the metrics explicitly, the performance of SWAN is comparable to the approaches that explicitly optimize the speech quality metrics. It demonstrates the perceptual significance of restoring the pitch harmonics in the spectral domain using the cepstral pitch-peak loss.

Comparison with Causal SE Algorithms
The proposed sliding window attention can be restricted to include only the causal context for enhancing the streaming audio in real-time. The performance of the proposed SWAN trained with causal context is compared with state-of-the-art causal SE algorithms. The quality and intelligibility measures of the causal models are given in Table 6. It is observed that the performance of the causal algorithms is generally inferior to that of the noncausal counterparts. The performance of the SWAN (6L) is better than other frequency-domain methods and comparable with the DEMUCS, which is a time-domain method. However, as indicated earlier in Table 5, the computational complexity of the DEMUCS is much higher than the SWAN. Hence, the proposed SWAN architecture is better suited for real-time SE as it offers comparable performance with significantly lower computational complexity.

Conclusions and Future Work
In this work, we proposed a sliding window attention-based transformer architecture for speech enhancement. The proposed network is trained to estimate the SMM from the noisy speech signal. We demonstrated the importance of incorporating the knowledge of the speech production process during the mask estimation, i.e., the quasi-periodic nature of speech signal, which transforms to harmonic structure in the frequency domain. In order to restore the pitch harmonics in the enhanced speech signal, we proposed to minimize cepstral pitch-peak loss, along with the mask loss. The spectral mask thus estimated resembles a time-varying comb-filter emphasizing the high SNR regions around the pitch harmonics. Our experimental results show that the proposed SWAN architecture achieves state-of-the-art performance with much lesser computational complexity. The proposed approach aims to enhance the magnitude spectrum of the speech signal, and hence the noisy phase is reused during reconstruction. Our future efforts will be oriented toward developing a complex attention mechanism to incorporate the effect of phase in context vector generation. In addition, the cepstral pitch-peak loss proposed in this work is motivated by the speech production mechanism. In the future, we will combine the proposed production-related loss function with the perceptionrelated loss functions.