A Novel Melspectrogram Snippet Representation Learning Framework for Severity Detection of Chronic Obstructive Pulmonary Diseases

A chronic obstructive pulmonary disease (COPD) is a major public health concern across the world. Since it is an incurable disease, early detection and accurate diagnosis are very crucial for preventing the progression of the disease. Lung sounds provide reliable and accurate prognoses for identifying respiratory diseases. Recently, Altan et al. recorded 12-channel real-time lung sound dataset, namely, RespiratoryDatabase@TR, for five different severity levels of COPD at the Antakya State Hospital, Turkey, and proposed deep learning frameworks for two-class COPD classification and five-class classification using a deep belief network (DBN) classifier and an extreme learning machine (ELM) classifier, respectively. The classification accuracies (ACC) of 95.84% and 94.31% were achieved for two classes and five classes, respectively. In this article, we have proposed a melspectrogram snippet representation learning framework for both two-class and five-class COPD classification. The proposed framework consists of the following stages: data augmentation and preprocessing, melspectrogram snippet representation generation from lung sound, and fine-tuning of a pretrained YAMNet. An experimental analysis on the RespiratoryDatabase@TR dataset demonstrates that the proposed framework achieves the accuracies of 99.25% and 96.14% for binary and multiclass COPD severity classification, respectively, which are superior to the only existing methods proposed by Altan et al. for severity analysis of COPD using lung sounds.


I. INTRODUCTION
R ESPIRATORY diseases are the world's third leading cause of mortality. Each year, more than three million individuals worldwide die as a result of one of the five major pulmonary disorders: asthma, COPD, tuberculosis, lung cancer, and lower respiratory tract infection [1]. In particular, COPD is a major public health concern across the world, as it is incurable and takes a considerably long time to get diagnosed. In general, the lungs' airways and alveoli are elastic or flexible. However, in the case of COPD, less air passes in and out of the airways because of the following: 1) the airway walls are thickened and become inflammatory and 2) the airways create more mucus than usual and might become blocked. COPD can be characterized by adventitious breathing, which can be observed during lung auscultation. Due to the severe consequences of COPD, early detection and accurate diagnosis are crucial. The majority of the prior studies [2], [3], [4] stand out, as they incorporate several pathological variables, including spirometry measures, age, sex, blood pressure, heart rate, hemoglobin, hematocrit, and so on, for COPD severity classification. Lung function test or spirometry along with subjective lung auscultation is the standard technique for COPD diagnosis [5]. Recently, at-home spirometry has become prevalent for regular monitoring of lung capacity with the emergence of portable spirometers [6]. Under spirometry test, a patient has to inhale deeply and exhale forcefully into the mouthpiece while nose clipped. To diagnose and assess the severity levels of COPD, the values for spirometry variables, including forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC), and FEV1/FVC ratio or forced expiratory volume ratio (FEVR), are employed [5]. On the contrary, the most common symptom of COPD during lung auscultation is wheezing, which is caused by narrowed airways and blockages caused by sputum (a mixture of saliva and mucus). It is a chronic condition characterized by musical clearing on both exhale and inhalation [7]. Normal or vesicular respiratory sounds are mild and perceptible in both the inspiratory and expiratory phases with a frequency range of 100 Hz-1 kHz. However, wheezes are loud, high pitched, and have a frequency range of more than 400 Hz [7]. Based on the spirometric measurement: FEVR and the wheezing characteristics from the respiratory sound, the global initiative for chronic obstructive lung disease (GOLD) has divided the degree of COPD severity into five groups [8]: COPD-0, COPD-1, COPD-2, COPD-3, and COPD-4. COPD-0 is a low-risk condition for those who have been smoking for a few years, having an FEVR of more than 85% [8]. Patients with moderate-level COPD (COPD-1) exhibit prolonged symptoms with minimal wheezing, and they have FEVR of more than 80% [8], whereas COPD-2 refers to those who have an intermediate degree of COPD severity, with FEVR ranging from 50% to 80% [8]. Significant wheezing during expiration and inspiration is typical in COPD-3 and COPD-4. It is caused by blockage and constriction of the airways, as well as usually coexisting heart disorders. People with an FEVR of 30%-50% with all chronic symptoms, and lung infections, are likely to have severe COPD and fall into the COPD-3 group [8]. Patients with COPD-4, the most severe stage of COPD, have an FEVR of less than 30% and experience all of the chronic symptoms and respiratory issues [8].

A. Related Work and Motivation
Lung auscultation is one of the most popular and traditional methods used by pulmonary specialists to analyze the status of the respiratory system, as lung sounds are connected to anatomical flaws in the lungs [9]. Although doctors use photoplethysmograph [10], spirometry [5], clinical history, and so on, lung auscultation remains essential to doctors due to its simple and cost-effective nature [9]. Since spirometry and subjective lung auscultation require patient cooperation and quite laborious for children and elderly [11], researchers have a high surge of interest in developing automated algorithms for COPD detection and categorization. Most of the most of the COPD categorization works focus on clinical data [2], [3], [4], [12], which are heavily dependent on clinical technician [11]. Recently, very few research works have been carried out on COPD severity detection from lung sounds. However, majority of research has focused on identifying adventitious sound anomalies [13], [14], [15], [16] rather than diagnosing chronic respiratory diseases by utilizing the lung sounds [17], [18], [19]. This highlights motivation for intervening with lung sounds in COPD severity grading.
Naves et al. [20] used high-order statistical characteristics (cumulants) to examine diseased lung sounds. To increase classification performance, linear as well as nonlinear classification algorithms were applied to identify crackle and wheeze lung sounds. Fernandez-Granero et al. [21] examined tracheal sounds for early detection of COPD and achieved the highest classification ACC of 75.80% by employing a neural network architecture. Oweis et al. [22] performed an adventitious lung sound classification task by feeding the power spectral density derived features along with morphological features to a neural network. Newandee et al. [12] suggested a technique for diagnosing COPD based on heart rate variability (HRV) readings using principal component analysis (PCA) and clustering methods. An ACC of 88% was achieved in classifying COPD and non-COPD individuals. On the basis of clinical data, Amaral et al. [23] suggested a COPD diagnosis framework with 90% classification ACC. To identify the different distinctive information present in respiratory sounds, Sánchez Morillo et al. [24] investigated the time-frequency representation (TFR) derived from the short-time Fourier transform (STFT). Following that, COPD was detected with an ACC of 81.80% by utilizing an artificial neural network (ANN)-based classification model. In recent years, deep learning techniques have also been employed in COPD severity classification [17], [18], [19], [25], [26]. Sugimori et al. [27] used CT scan images to classify five-class COPD severities using ResNet-50 [28]-based convolutional neural network (CNN) backbone followed by the dense classifier and achieved a classification ACC of 44% for five-class COPD severity classification. A new noninvasive imaging approach, named parametric response mapping (PRM) [25], [29], is employed in conjunction with inspiratory and expiratory CT images for early detection of small airway damages that occurred due to COPD. Ho et al. [25] used a 3-D-CNN architecture in combination with PRM to classify subjects affected with COPD. They used two functional parenchyma variables named emphysema percentage and small airway disease percentage as inputs to 3-D-CNN architecture and received an ACC rate of 89.30% and a sensitivity of 88.30%, respectively, while classifying COPD subjects from non-COPD ones. Altan et al. [17] classified COPD and healthy patients by using 12-channel lung sound based on ensemble empirical mode decomposition and extracted statistical as well as time-domain features. An ACC rate of 93.67% was achieved using a deep belief network (DBN) classifier. Altan et al. [18] proposed a 3-D second-order difference plot (SODP)-based feature extraction strategy that uses an SODP methodology and DBN classifier to distinguish two extreme COPD severity levels: COPD-0 and COPD-4. Using these features, 95.84% ACC was achieved on two-class severity classification. However, these two severities can be easily differentiated by analyzing the wheezing density. In another work, Altan et al. [19] employed 3-D SODP to extract distinctive anomalies from the lung sound data and used cuboid, octant quantization of the 3-D-SODP features. Thereafter, these features were fed to a deep extreme learning machine (ELM) model to categorize all five severity levels of COPD. The deep ELM architecture uses bottom-top triangular ELM (LuELM) and Hessenberg ELM kernels for achieving better generalization and faster training convergence. By employing this deep ELM architecture, a classification ACC of 94.31%, a weighted sensitivity (WSEN) of 94.28%, and a weighted specificity (WSPE) of 98.76% were achieved for five-class COPD severity classification task.

B. Objective and Key Contributions
The primary objective of the study is to propose a deep learning framework based on lung sounds for classifying all five severity levels of COPD, which are hard to differentiate without performing additional tests, such as the spirometry test. Early detection of COPD is critical for preventing disease development and improving people's quality of life. However, identifying the following severities: COPD-0, 1, and 2, without employing any further diagnostic tests is practically very challenging. The same problem persists in identifying COPD-3 and 4 as both of these classes have almost similar properties. The proposed framework exploits the potential of the fine-tuning process of a pretrained YAMNet [30] for COPD severity categorization, which is trained on melspectrograms of audio signals present in AudioSet [31], the largest dataset for audio machine learning. The proposed framework involves the following stages: 1) preprocessing of the lung sound signals and 2) melspectrogram snippet representation learning-based severity classification, which contains two submodules where the following hold: 1) the processed lung sound signal is converted to vanilla melspectrogram and then framed into subimages, called melspectrogram snippets and 2) a pretrained YAMNet architecture is employed, which helps to classify these snippets into five classes of COPD severities. Based on the experimental analysis using lung sounds from Respiratory-Database@TR [32], the proposed framework achieves a high classification performance for both two-class and multiclass COPD severity classification. To the best of our knowledge, this is the second work after Altan et al. [18] and [19], focusing on improving the ACC in both two-class and multiclass COPD severity classification. The salient contributions of this article are as follows.
1) Investigation of the potential of melspectrogram TFR of lung sound signal for COPD severity classification. 2) Exploitation of the potential of transfer learning by fine-tuning the state-of-the-art (SOTA) pretrained audio classification network: YAMNet [30], which is explicitly pretrained using audio TFRs, for efficient COPD severity classification.

3) Extensive evaluation of the proposed framework on
the RespiratoryDatabase@TR dataset for an overall and channel-wise performance unlike only overall performance in existing works by Altan et al. [18] and [19]. The rest of this article is organized as follows. Section II discusses the public dataset used to evaluate the proposed framework. The main processing steps of the proposed framework are presented in Section III. Section IV interprets the performance evaluation and compares the test results with some of the notable prior studies, and Section V concludes the proposed framework.

II. DESCRIPTION OF DATASET
RespiratoryDatabase@TR [32] is a public multimedia respiratory dataset. For each patient, the dataset contains four-channel phonocardiogram signals, 12-channel respiratory sounds, and spirometric measurements. The right (R) and left (L) channel lung sound signals were collected by using a Littmann 3200 digital stethoscope with the assistance of two experienced pulmonologists, from six different lung auscultation sites. Fig. 1 illustrates the lung auscultation positions from both the posterior and anterior sides of the body. The individuals were labeled based on the wheezing features of respiratory sounds and spirometric measurements. The current stage of COPD severity for each subject was approved by two pulmonologists, who also agreed on the diagnosis. The dataset comprises lung sounds from 41 COPD patients with varied degrees of severity, starting from COPD-0 to COPD-4. Five of the 41 individuals had COPD-0, five had COPD-1, seven had COPD-2, seven had COPD-3, and 17 had COPD-4. The demographics, gender, age, and information related to the auscultation process are covered in [32]. The subjects were asked to cough at the start of each lung sound recording to ensure that the signals from the right and left channels of both lung areas were synchronized. Each recording lasted for at least 17 s. Lung sound signals were sampled at 4 kHz [33]. While recording the data using the digital stethoscope, Altan et al. [32] of the dataset were able to remove 85% of the background noise from the room where the auscultation was done.

III. PROPOSED FRAMEWORK
As mentioned earlier, the proposed framework consists of the following steps: data augmentation and preprocessing, melspectrogram snippet generation, and classification using pretrained deep YAMNetdeep after fine-tuning. The block diagram of the proposed framework has been shown in Fig. 2. The blocks used in the proposed framework are explained in the following Sections III-A and III-B.

A. Data Augmentation and Preprocessing
In the RespiratoryDatabase@TR, COPD-4 resembles the majority class, and the rest classes belong to the minority class, as each of these classes contains lesser subjects. To deal with this class imbalance problem, we have adopted different time-domain audio data augmentation techniques, such as the following.
1) Time Stretching: The audio sample can be sped up or slowed down while maintaining pitch [34]. In this study, signals from each of the minority classes were stretched by two factors: {0.4,0.17}.
2) Pitch Shifting: Pitch shifting modifies the pitch of the audio signal either by raising or lowering the pitch, while the duration of the audio signal is kept unaltered. In [35], the importance of pitch shifting process is investigated for CNN-based environmental sound classification. Here, we exploit same technique with two semitones or pitch shifting factors of {−2,1} in order to augment the minority class recordings for this study.
3) Noise Addition: We have also used white noise addition as another audio data augmentation strategy to increase the number of samples in the minority class.
The preprocessing of lung sound signals includes segmentation, baseline wander removal, and amplitude normalization. The respiratory sounds in the RespiratoryDatabase@TR [32] dataset are not consistent in length, with lung sound signals lasting at least 17 s. To draw certain pathological information and status about respiratory health, it is recommended to auscultate for one or more respiratory cycles (inspiration to expiration phase) in each auscultation site [36]. The average time for completing a respiratory cycle is approximately 5 s [36]. In this work, we have considered a 10-s window that covers roughly two cycles approximately and can capture significant information about the lung sound signal, which may be beneficial to the classification tasks. Therefore, the lung sounds are split into 10-s segments with 50% overlap to keep the uniform processing length. Then, we have removed the baseline wandering (BW) component from the segmented is computed as follows [37]: In general, the frequency range of the BW component lies between 0 and 1 Hz [37]. Thereby, we can eliminate the BW component from the DFT coefficients by removing the frequencies, which are smaller than 1 Hz. For the f Hz frequency component, the DFT coefficient index p can be computed as follows: p = ⌈( f M/ f s )⌉, where f s denotes the sampling rate of the lung sound signal. The BW component eliminated signal, v(m), can be computed as follows [37]: The threshold DFT coefficients, denoted byS, are produced where s n (m) indicates the normalized signal. Fig. 3 illustrates the original lung sound signal: s(m) in Fig. 3(a), extracted baseline wander component: s(m)−v(m) in Fig. 3(b), baseline wander removed signal: v(m) in Fig. 3(c), and finally the normalized lung sound signal: s n (m) in Fig. 3(d).

B. Melspectrogram Snippet Representation Learning
STFT [15], [28], scalogram [38], melspectrogram [13], [16], gammatone spectrogram [13], [14], and constant Q transform (CQT) spectrograms [13], [14] have been the most popular input representations for lung sound classification problem. In this article, we have proposed a novel melspectrogram snippet representation learning framework for COPD severity classification, as several research shows that melspectrograms perform better than other TFR in classifying audio signals [39]. The snippet representation learning consists of two submodules: first, the lung sounds are converted to vanilla melspectrogram and thereafter framed into subimages, called melspectrogram snippets, and second, thereafter, these frames or the melspectrogram-snippets are fed as individual instances to the YAMNet model, which generates feature vector corresponding to each snippet.
1) Melspectrogram Snippet Generation: In this study, initially, the preprocessed signal is transformed into a spectrogram with 256 frequency bins. To convert the signal into a spectrogram, we have used STFT with a 25-ms periodic Hanning window and an overlap of 10 ms [30], [40]. These spectrograms are then transformed into melspectrogram with mel bins of 64. The log-melspectrograms are computed by taking the natural log of the offset melspectrogram, where the offset melspectrograms are created by adding an offset of 0.01, to avoid computing the logarithm of 0 [41]. The log-scaled melspectrograms are referred to as vanilla melspectrogram, which are framed into subimages of 0.96 s, where the frame hop is considered as 0.096 s. These framed subimages are referred to as melspectrogram snippets, having the size of 96 × 64 each, and fed into the pretrained YAMNet, which generates feature vectors for each snippet. Fig. 4(a)-(e) illustrates time-domain visualization of COPD-0, 1, 2, 3, and 4 signals, respectively, and Fig. 4(f)-(j) depicts the vanilla melspectrogram representation of corresponding time-domain signals. It can be observed from the figures that vanilla spectrograms demonstrate the spectral component variation for different severity levels of COPD.
2) Snippet Representation Learning With Pretrained YAM-Net: Training a deep convolution neural network (DCNN) from scratch requires a huge volume of data. Since acquiring lung sound data of different COPD severity levels is quite challenging due to the unavailability of frequent COPD patients and expert annotations, transferring knowledge from pretrained networks, that have been trained on an extensively large audio dataset, is quite useful.
As a result, when such pretrained DCNN is fine-tuned to accomplish some other target tasks, less amount of data is required, which allows faster learning and improved performance after model fine-tuning. In recent years, several works have been carried out on lung sound anomaly classification, using standard pretrained DCNN models, such as ResNet-50 [28] and ResNet-34 [15] via transferring their knowledge. However, these DCNN models are initially trained using spectrogram images from the ImageNet dataset [15], [28], [42]. These DCNN models achieve poor ACC, as these are pretrained with images. It has been investigated that, as spectrograms contain immense information about audio signals, it is difficult to extract promising features from DCNNs pretrained using images [43]. In this context, Tsalera et al. [43] investigated that a knowledge transfer will be beneficial to a sound classification problem, if it will be pretrained with audio TFRs rather than simple images. Thereby, in recent years, research on audio neural networks has received significant attention [30], [44]. Hence, in this work, we have used a pretrained audio classification model, i.e., YAMNet [30], [44]. YAMNet has been trained on the melspectrograms extracted from the audio signals of AudioSet [31], which is the largest dataset for audio deep learning.
Let us consider a deep learning model M 0 , which was pretrained using the source dataset P s = {a i s , o i s } n s i=1 , where P s contains input features (a i s ) and corresponding labels (o i s ). On the target dataset P t = {a i t , o i t } n t i=1 , we aim to fine-tune the pretrained model using the transfer learning approach in order to generate better results on the target task. In this work, the source dataset is AudioSet, and the target dataset is RespiratoryDatabase@TR [32]. At the time of fine-tuning, only P t and M 0 are available. As P s and P t are of different domains and may have different input and output spaces, M 0 cannot be directly applied to the target data. In general, DCNNs are often divided into two portions: a generic representation function Fλ0 (parameterized byλ 0 ) and a domain or task-specific function G λ s , which is represented by the top layers of the DCNN. In the case of a transfer learning-based approach, the generic representation function is retained, while the domain or task-specific function is replaced by randomly initialized functionality D λ t (parameterized by λ t ). Therefore, we optimize where L {·} indicates the loss function. At the commencement of the optimization process, these pretrained parameters λ 0 provide a good starting point. YAMNet is built on the MobileNetV1 architecture [30] and is made up of depth-wise separable and point-wise convolutions, which significantly helps in reducing the model size and computing cost [45]. The architecture of YAMNet comprises one convolutional layer, and 13 pairs of depth-wise separable and point-wise convolution layers. After each convolution layer, ReLU activation and batch normalization are used. Finally, a global average pooling (GAP) layer is applied along with a fully connected (FC) classifier layer to classify the audio signals [30]. The top layers of the YAMNet have been removed for fine-tuning purposes, resulting in a feature vector of size 1024 as the new output of the YAMNet, which can be achieved from the GAP layer of YAMNet. Thereafter, the feature vector is passed through two FC layers or dense layers, consisting of 500 and 100 neurons, respectively. The activation function used for these FC layers is ReLU. The dropout rate of 0.3 and 0.2 is used in the FC layers, respectively, to alleviate the overfitting problem. Then, dense layer output is fed to the classification layer that generates C × 1 dimensional class output, where C indicates the total number of classes (C = 2 for binary and C = 5 for multiclass COPD severity classification). In general, the operation for the classification layer can be represented as follows: where <X dense , W 0 > denotes the dot product between weight vector (W 0 ) and the output of the dense layer (X dense ), b 0 denotes the bias, and σ refers to the activation function. For binary and multiclass severity classification, we have employed the sigmoid and softmax activation functions, respectively. The sigmoid function is given by the following equation [46]: Softmax activation function can be expressed as follows [46]: where y i is the ith element from the output of the neural network, and the numerator is normalized to bring the class probability value between 0 and 1.

IV. RESULTS AND DISCUSSION
In this section, the performance of the proposed framework is examined on publicly available multimedia Respi-ratoryDatabase@TR using different widely used performance metrics, which are presented in the subsequent sections.

A. Performance Metrics
To evaluate our proposed framework, we have used the following performance metrics: ACC, precision (PRC), sensitivity (SEN)/recall (REC), specificity (SPE), and Mathew's correlation coefficient (MCC) for both binary and multiclass classification similar to [47] and [48]. These metrics can be computed from the confusion matrix [49]. Further for multiclass severity classification, we have used two new performance metrics, namely, WSPE and WSEN [19].

B. Performance Evaluation
The performance of the proposed framework is assessed via lung sound training and testing signals for both binary and multiclass severity classification. The optimized simulation parameters are obtained after extensive training of the proposed DCNN architecture, which are provided in Table I. We have evaluated our framework through two experiments. In experiment 1, we have divided the entire dataset (patientwise) into 80%-20% nonoverlapping train and test subsets, respectively, which helps to prevent the scenario in which lung sound signals from one individual remain in both the train and test subsets [13]. For this experiment, we have presented the channel-wise ACC for different subjects present in the test subset. Fig. 5 and Table II illustrate the channel-wise   TABLE II  CLASSIFICATION ACC (%) WITH RESPECT TO 12 DIFFERENT  ACQUISITION SITES (L1-L6 AND R1-R6) or acquisition site-wise average ACC for five-class severity classification, and it proves that our proposed framework can accurately identify different severity levels from all the 12-channel lung sound signals with high classification rate, whereas in experiment 2, we have randomly split the entire dataset into 80%-20% ratio for training-testing. Furthermore, 20% data for testing are split into 10%, each for testing and validation, similar to [47] and [48]. For this experiment, we have presented overall training and testing performances through loss, ACC curves, and other performance metrics. A batch size of 128 was utilized to train the deep learning model, coupled with the Adam optimizer. A fixed valued learning rate of 3 × 10 −4 was used to fine-tune and train the DCNN model. We have applied fivefold cross validation to ensure the generalization of our proposed framework. We have utilized binary and categorical cross-entropy loss for twoclass and multiclass severity classification tasks, respectively. Fig. 6(a) and (b) depicts the ACC and loss graphs of the proposed deep learning framework for binary severity classification, and Fig. 6(c) and (d) shows the ACC and loss graphs for multiclass severity classification. From Fig. 6, it can be observed that the deep learning framework does not overfit. As we have used fine-tuning on the pretrained network, we can see that within less number of epochs, the network has achieved a high classification ACC for both binary and multiclass severity classification. Tables III and IV illustrate the fold-wise classification performance obtained using the proposed framework for both binary and multiclass severity classification. Different classification metrics have been tabulated for each of the five folds. It can be observed from both Tables III and IV that our proposed framework provides outstanding results for each of the folds, which, in a way, refers to the generalization of the proposed framework and outperforms the existing results on both binary [18] and multiclass severity classification task [19].
To visualize the problem, a t-distributed stochastic neighbor embedding (t-SNE) plot [50] is used, which depicts lung sound data of different classes of COPD in a 2-D space. In Fig. 7(a), it can be observed that, initially, the lung sounds belonging to different classes are randomly scattered in the 2-D space without forming any class-specific clusters. However, from    Fig. 7 (b) and (c), it can be argued that our model is capable of classifying the different severity levels accurately, as the features corresponding to each class make distinct clusters in the 2-D feature plane, for both binary and multiclass classification. Table V illustrates the class-wise performance analysis of the multiclass severity classification. From the confusion matrix obtained using the test data while evaluating the framework, we can calculate all possible performing metrics for each of the classes [49]. Table V contains class-wise SEN, SPE, ACC, PRC, and F1-score values for each of the severity classes. It can be observed from the table that the proposed framework classifies each of the severity classes with almost 97% ACC, and other metrics (such as F1-score) for each of the classes also yield high values. This refers to the supremacy of the proposed framework.
C. Analysis of Sensitiveness of the Proposed Framework 1) Effect of Processing Length of Input Lung Sound Signal: Since lung sound signals are highly nonstationary signals, we generally process the TFR of the lung sound signal [13]. However, the information of the lung sound time series is present in a sequential manner, and the temporal resolution  or processing length of the input signal also influences the classification performance of the deep learning model. We investigated three different processing lengths ranging from 5 to 15 s and carried out the binary, multiclass COPD severity classification task to assess the influence of processing length. Fig. 8(a) and (b) illustrates the variation of classification parameters of the proposed framework with respect to different input lengths of lung sound signals for both tasks. With different experiments, we have found that lung sound with 10-s processing length yields the highest classification rates for both classification tasks. While 15-s processing length is also yielding competitive results with respect to 10 s, however, a slight reduction in the metrics can be observed. In addition, we can observe that with shorter segment length, i.e., 5-s processing length, the performance degrades drastically. According to Sarkar et al. [36], we should use at least two respiratory cycles to assess any lung sound signal. As 5-s frame just covers only one respiratory cycle (approx.), therefore, it has substantially lesser information; that is why we can also see the degradation in classification performance in Fig. 8(a) and (b).  2) Choice of Proper TFR: It is necessary to transform the lung sound signal from one domain to another in order to gain a comprehensive understanding of them [13]. In this regard, TFRs capture spectral variations over time. Hence, the choice of TFR also affects the classification ACC of the DCNN model. Therefore, in Table VI, a comparative result analysis is presented based on the effect on classification ACC with respect to different TFRs. In this research, we have mainly stressed the melspectrogram representation, as it captures a wealth of time-frequency information from the audio signal [39]. From Table VI, it can be observed that with the use of the snippets of melspectrogram, the classification ACC reaches the maximum value for both binary and multiclass COPD severity classification. With the use of simple STFT-based spectrogram and gammatone spectrogram snippets, the classification ACC degrades quite a bit. However, CQT spectrogram snippets perform the worst among all other TFRs in classifying COPD severities.
3) Choice of Proper Deep Learning Model for Transferring Knowledge for COPD Severity Classification: In this part, we have experimented with different well-known DCNN models for COPD severity classification using the knowledge transfer technique. We have compared our proposed framework with VGG-16 [51], AlexNet [52], ResNet-50 [28], and MobileNet-V2 [45] in terms of model size, total trainable parameters, and obtained ACC for both binary and multiclass COPD severity classification. The comparison results are tabulated in Table VII. From Table VII, it can be observed that our proposed approach outperforms the other ImageNet [42] pretrained DCNNs [28], [45], [51], [52].
To prove the better domain adaptation quality of YAMNet, we have provided a group of t-SNE plots, which are created using the features extracted from intermediate layers of finetuned VGG-16 [51], ResNet-50 [28], MobileNetV2 [45], and fine-tuned YAMNet [30] depicted in Fig. 9(a)  This establishes the fact that, using an audio pretrained network for lung sound classification task offers higher knowledge transfer capabilities than the traditional image pretrained models [28], [45], [51]. In a way, the results of Table VII prove that YAMNet provides better domain adaptability for lung sound-based COPD severity classification task after transferring knowledge from pretrained audio signals rather than using other ImageNet pretrained DCNNs [43]. In addition, our suggested framework performs better in terms of classification ACC for both tasks than the so-called lightweight DCNN model MobileNetV2 by a margin of 11.43% and 6.78%. Furthermore, it uses substantially less storage space and offers a good trade-off between the total amount of trainable parameters. We have computed the total execution time required to classify one lung sound signal using our proposed framework. It takes nearly 1.21 ± 0.02 s to classify a lung sound signal at MATLAB 2021b 64-bit OS on Windows 10, 32-GB RAM desktop consisting of Intel Xeon(R) W-1350 processor with a clock frequency of 3.30-3.31 GHz. The preprocessing part takes 0.08 ± 0.02 s, the melspectrogram snippet generation process requires approximately 0.11 ± 0.01 s, and finally, the YAMNet-based final classification takes 1.02 ± 0.01 s to execute. However, the MobileNetV2 needs 1.19 ± 0.02 s to classify the same melspectrogram snippets. Therefore, it also proves that our proposed framework achieves better than SOTA performance in terms of classification ACC, the total number of trainable parameters, and execution time.

D. Performance Comparison
In this section, the superiority of our proposed framework is analyzed with respect to other existing COPD severity  Even though clinical studies support that lung auscultation is one of the most unique diagnostic techniques for detecting different respiratory disorders, the majority of prior research on computer-assisted analysis has emphasized clinical and pathological data, and various other biological signals. Till date, very few research has been carried out that focuses on lung sounds for COPD severity detection. To the best of our knowledge, this is the second work that focuses on both binary and multiclass COPD severity classification based on lung sounds after Altan et al. [18] and [19]. Hence, first, we compare our performance results with the methods proposed by Altan et al. Tables VIII and IX depict the performance results in comparison with the method proposed by Altan et al. [19] for binary classification (COPD-0 and COPD-4) and multiclass COPD severity classification, respectively. It is clearly observed from the tables that the proposed framework provides higher classification performance in terms of all evaluation metrics. Also, in [19], low classification ACC rates were achieved for COPD-1 and COPD-3 as compared with the proposed framework. We also perform a detailed comparative performance analysis for COPD severity categorization using other biomedical modalities along with lung sounds. Table X represents the overall comparative study by considering different datasets, methods, classifiers, and performance measures. It can be observed from the table that the existing works exploited the use of subjective measurements, including symptoms, medical assessments, physical examinations, questionnaire responses [4], and different biomedical input modalities, including CT scans [26], [27], [53], PRM method [25], spirometric measures [4], electrocardiogram (ECG) [12], and tracheal sounds [24] for COPD severity classification. Results demonstrate that the proposed framework achieves higher performance for both binary and multiclass COPD severity categorization as compared with existing works.

V. CONCLUSION
In this article, we have proposed a melspectrogram snippet representation learning framework for COPD severity classification. This work exploits the time-frequency melspectrogram snippets representation of lung sounds and the YAMNet-based transfer learning model. The proposed framework works in the following stages: data augmentation and preprocessing, melspectrogram snippet representation generation from lung sound, and fine-tuning of a pretrained YAMNet. The proposed framework achieves an ACC of 99.25% and 96.14% for binary and multiclass COPD severity classification, respectively, using lung sounds from RespiratoryDatabase@TR. Furthermore, experimental results demonstrate that the proposed framework achieves superior classification performance as compared with existing works for COPD severity classification. In the future, we hope to investigate the impact of various metrological influences on COPD severity classification performance, such as ambient noise and heart sound interference, and how to eradicate them in order to achieve better results. In addition, we believe that our approach will make it possible to create a system that can automatically detect COPD severities from lung auscultations in actual clinical settings.