Analysis of Automated Clinical Depression Diagnosis in a Chinese Corpus

Depression clinical interview corpora are essential for advancing automated depression diagnosis. While previous studies have used written speech material in controlled settings, these materials do not accurately represent spontaneous conversational speech. Additionally, self-reported measures of depression are subject to bias, making the data unreliable for training models for real-world scenarios. This study introduces a new corpus of depression clinical interviews collected directly from a psychiatric hospital, containing 113 recordings with 52 healthy and 61 depressive patients. The subjects were examined using the Montgomery-Asberg Depression Rating Scale (MADRS) in Chinese. Their final diagnosis was based on medical evaluations through a clinical interview conducted by a psychiatry specialist. All interviews were audio-recorded and transcribed verbatim, and annotated by experienced physicians. This dataset is a valuable resource for automated depression detection research and is expected to advance the field of psychology. Baseline models for detecting and predicting depression presence and level were built, and descriptive statistics of audio and text features were calculated. The decision-making process of the model was also investigated and illustrated. To the best of our knowledge, this is the first study to collect a depression clinical interview corpus in Chinese and train machine learning models to diagnose depression patients.

The World Health Organization (WHO) predicts that clinical depression will be the second most debilitating disease by 2030, following only cardiovascular diseases [5].Early-stage depression screening and follow-up psychotherapy are crucial for individuals with high-risk mental health conditions [6].However, delivering early-stage mental health interventions to vulnerable and stigmatized populations, particularly those in financial hardship or concerned about privacy, remains challenging.
In addition, depression diagnosis currently lacks a reliable "gold standard" method, which leads to inconsistencies in treatment depending on an individual clinicians' experiences and intuition.Artificial Intelligence (AI) has the potential to revolutionize the diagnosis and prognosis of diseases, especially by making early-stage mental health screening a reality.The role of datasets cannot be overlooked in advancing AI models.Recent papers published in TBioCAS and IEEE CASS journals highlight the importance of datasets.For example, Zhang et al. conducted experiments on respiratory sound classification using their own SPRSound database [7].Jiang et al. presented a Standardized Assessment of Underwater Image dataset and proposed a new metric for image quality assessment [8].The availability of these datasets presents significant opportunities for researchers and practitioners in a variety of fields to develop innovative solutions to various problems.
Although significant progress has been made in automating depression diagnosis, previous studies have primarily used non-clinical datasets.These datasets are valuable for researchers as they provide ample training data and insights for those without the resources to collect and label their datasets.They also serve as a benchmark for performance evaluation, allowing researchers to compare their models with others.Previous datasets have investigated music-induced [9], videoinduced [10], [11] and mixed emotion induction methods [12].However, there are still challenges to implementing and deploying depression detection systems in real-world applications.For instance, existing datasets ignore the critical aspect that emotions are typically context-based.To address this issue, there is a need for interactive multimodal datasets collected through interviews conducted in a clinical setting between patients and physicians.In this scenario, patient's emotions rely on verbal and non-verbal communication with physicians.The primary objective of this study is to investigate the effectiveness of semantic and prosodic features in evaluat- 1 The dataset has been made available as open source on GitHub at the following link: https://github.com/uofabinarylab/MDDInterviewnumerous recordings labelled on different depression scales.Therefore, the generalizability of the resulting models relies heavily on the various elements that constitute the dataset.This section will discuss two key elements: data collection methods and depression assessment instruments.

A. Data collection methods
Data collection methods play a crucial role in impacting model performance.Researchers must consider an appropriate context in which the subjects responses are observed.To date, two main types of contexts have been used in collecting depression datasets: social network -an open platform for individuals to share their thoughts, such as scraping social networks to construct depression-related corpora [15], [16]; and spontaneous behaviour -interviewees naturally interact with interviewers or machines, for example, chatting with a chatbot [17], [18].It is important to note that the choice of data collection method can affect the quality and generalizability of the dataset, and researchers should carefully consider which method is most appropriate for their study.
1) Social network platforms: Datasets for research on affective computing have been well-studied from different perspectives.Herein, we will examine a series of datasets and corresponding data collection strategies.Numerous corpora suitable for diagnosing depression have been collected in lownoise environments and with limited topics.However, these conditions are not representative of the real world, and models trained on such datasets may not perform well when applied to recordings made in uncontrolled settings.In contrast, many researchers have had to perform feature extraction and design application-specific machine learning strategies due to data scarcity.To address the data shortage challenges, Rajput et al. proposed collecting corpora from online forums that focus on mental disease discussions [19].Pirina et al. investigated the influence of the quality of training data [20].De Choudhury et al. and other researchers proposed to collect, analyze, and summarize data from social media platforms, where valuable patterns and information can be detected [21]- [23].Collecting data from online forums can also significantly reduce difficulties with obtaining sufficient data from healthy individuals.The healthy control group can be sampled from other online communities unrelated to depression.With their vast inflow of user-generated content, social media platforms effectively capture depressive behavioural cues relevant to an individual's emotional state or mental disorder.However, it is important to note that user-generated content from online forums can be misleading for machine learning models.For example, patients who avoid visiting clinics due to fear of mental disorderrelated stigma may also avoid discussing depression online.
2) Interviews under controlled conditions: Many researchers are recruiting volunteers and recording their responses during interviews or free discussions as a method of data collection due to the limitations of social media.The widespread use of smartphones has enabled the emergence of this new strategy to efficiently recruit a diverse sample of participants and collect large amounts of data.Examples of datasets that have adopted this approach include the SEMAINE dataset [24], which includes audio and video recordings of 150 participants and the Affectiva-MIT Facial Expression Dataset (AM-FED) [25], which consists of labelled spontaneous facial recordings Another issue with the clinical interview data collection strategy is the high cost involved.Gratch et al. proposed an automated interview platform that utilizes an animated virtual interviewer to make patients feel as comfortable as possible [17].The virtual interviewer can be fully automated or controlled by an operator, which significantly reduces labour costs for data collection.Still, the stringent semi-structured interview process for each patient may be problematic, especially in cases where the patient is unwilling to answer a question.In that case, the virtual interviewer can only proceed to the next question, resulting in patients providing only a few words or nonverbal responses.The resulting datasets may not contain enough information to assess the emotional and mental health state of the patient accurately.

B. Depression assessment instrument
Assessment of depression is challenging due to ongoing research on its pathology [39].The Diagnostic and Statistical Manual of Mental Disorders (DSM), developed by the American Psychiatric Association, provides the most commonly used set of criterion for diagnosing mental disorders.The DSM aims to provide standard criteria for identifying mental disorders based on observed symptoms such as psychomotor retardation and diminished concentration.The Hamilton Rating Scale for Depression (HAMD) and Beck Depression Index (BDI) are also widely used assessment tools [40], [41].
HAMD is a clinician-administered depression scale and is considered the gold standard assessment tool, while BDI is a self-reported questionnaire.Investigations using both HAMD and BDI have led to the development of new depression scales such as the MADRS [13], Quick Inventory of Depressive Symptomatology (QIDS) [42] and the 9-item Patient Health Questionnaire (PHQ-9) [43].Previous research has reported that MADRS has higher reliability statistics than QIDS and PHQ-9 [44], [45].

C. The existing corpora
We reviewed previous articles that reported dyadic interview recordings that were annotated based on clinician and outpa-tient interactions.Approximately half of these datasets were 249 labelled with a self-reported depression rating scale, recorded 250 in controlled conditions, and produced in English.Controlled 251 condition refers to the standardized task and procedures that 252 were used during interviews.Specifically, in previous studies, 253 researchers asked the interviewees to perform tasks such as 254 reading a fixed paragraph, sustained vowels, and memory 255 recalling.This allows investigators to control for variability 256 in responses and simplify the problem.In comparing our 257 dataset to others, we believe that our approach allows us to 258 enable collection of more natural responses from participants.259 While previous studies have used structured interviews, our 260 methodology allowed for more flexibility in the conversation 261 since participants were able to choose to continue or change 262 the topic as they wished.This approach can yield more spon-263 taneous and authentic responses from the participants, which 264 is important for accurately diagnosing depression.Further, the 265 prevalence of one language (English) in these datasets limits 266 their usability for cross-cultural studies of depression.The main goal of this study was to collect high-quality 271 responses from subjects participating in clinical depression 272 interviews.Previous research has found that spontaneous 273 speech is more effective than reading speech in depression 274 classification [46], [47].In addition, this study aims to examine 275 subjects' emotional responses to physicians' questions.There-276 fore, the data collection protocol, related experiments, and data 277 preprocessing procedures have been designed to detect and 278 evaluate depression and depression levels.Participants were required to be native Mandarin speakers with 283 at least primary education.To ensure that our findings were 284 applicable to a broader population, we carefully to selected 285 a representative sample of individuals with depression.We 286 also took recommendations from clinicians into account which 287 excluding individuals with a history of antidepressant medi-288 cation or mental disorders.Participants who were diagnosed 289 with depression had no other mental or medical conditions 290 were eligible.Verbal consent and signed forms were obtained, 291 allowing data processing and distribution with removed patient 292 identification.The study was conducted in Wenzhou, China, 293 with in-person interviews taking place in a confidential private 294 room that was pre-arranged for the purpose of protecting 295 patients' privacy.Although the interviews were conducted in 296 a private room, we did not use noise-cancelling equipment 297 or impose any restrictions on the topics discussed during 298 the interviews.Our goal was to capture a range of natural 299 variations that might occur in real-world settings, thus ensuring 300 the authenticity and generalizability of our dataset, which 301 is important because noise levels in public spaces are not 302 controlled.If the model is trained on noise-cancelling data, 303 its performance may not be as robust in real-world settings.
We ensured the standardization of our data collection process in several ways.First, we employed experienced physicians to conduct all the interviews.A total of four attending physicians, each with a minimum of five years of experience, are involved.
We also utilized the MADRS questionnaire, a reliable and well-established tool with high inter-rater consistency and reliability.We also followed well established standardized protocols for conducting interviews, recording data, and managing data quality.Before starting the full-scale data collection, we conducted pilot testing, including random selection of some outpatients to conduct depression interviews on to ensure the data collection process is feasible, reliable and standardized.In  II.
During the experiment, participants were asked questions by a clinician about their mental health.The order of the questions may have varied at the clinician's discretion.In some cases, additional questions were asked for more information based on the clinician's judgement and experience, as long as the question was still relevant to the previous topic, and the 346 participant was willing to discuss it.The clinician was also 347 allowed to adapt the initial questions to put the participant 348 at ease.At the end of the interview, the clinician helped the 349 participant relax from any distress they may have experienced.350 Experienced clinicians conducted the interviews to minimize 351 any further impact on the participants' mental health.Our goal 352 was to elicit verbal and non-verbal cues of depression from 353 the participants.

C. Dataset statistics 355
In our study, we interviewed 113 participants, 52 of whom 356 were healthy, and 61 were depressed patients.The interview 357 audios were an average of 364.40 seconds in length (st.dev 358 = 257.66seconds).For the control group, the audio files were 359 an average of 164.53 seconds in length (st.dev = 101.88360 seconds), and the average sentence word count is 6.14 (st.dev 361 = 6.44).For the depressed patient group, the audio files were 362 have an average of 535.70 seconds in length (st.dev = 224.78363 seconds), and the average sentence word count is 6.41 (st.dev 364 = 5.89).Patient demographics are illustrated in Table III.The 365 supplementary material shows the interview sample in Table 366 S1 and S2.Before further analysis, a balanced dataset was 367 built by random sampling.For binary depression detection, 368 positive and negative samples should be approximately equal.369 For multiclass depression level prediction, the distribution of 370 severity levels should be balanced.In our dataset, 52 and 61 371 participants were in the health and depressed patient groups, 372 respectively.Fig. 1(a) shows the distribution of the depression 373 levels in our dataset.Compared to the distribution of the audio duration, the number of words in a sentence for the control and experiment groups was not significantly different (p > 0.05).To identify patterns between the healthy and depressive groups, we generated word clouds and showed frequently used words in a larger font in Fig. 1(f) and Fig. 1(g).The word cloud of the depressive group reveals that these individuals are more likely to use negative words such as 'difficult to fall asleep,' 'being in a bad mood,' and 'not good' during the interview.

IV. DATA PROCESSING
In this section, we describe the preprocessing procedure for interview recordings, using prosodic and acoustic features, frequency-based features and pre-trained word embeddings. in depression classification studies.We segmented the raw interview audio recordings using the COVAREP toolkit, which allowed us to extract audio features at a rate of 100Hz.To achieve this, we divided the raw recording into 10 millisecond blocks, which is common practice in speech processing [48]- [51].We then read in each interview recording from the input directory and extract various features for each block in the recording.Specifically, we extract features such as F 0 , voiced/unvoiced (VUV) decision, NAQ, quasi-open quotient (QOQ), H1-H2, peak-slope (PSP), modulation depth quotient (MDQ), relative amplitude quotient (Rd), creaky voice detection, Mel-Cepstral coefficients (MCEPs), and Harmonic Model + Phase Distortion (HMPD) features.By using a 10millisecond block size and a sampling rate of 100Hz, we believe that we were able to capture the relevant acoustic information at a reasonable computational cost.Detailed descriptions of each audio feature can be found in Table IV.

A. Audio recordings preprocessing
a) Fundamental frequency F 0 : The fundamental frequency 432 F 0 was been investigated from various aspects.Regarding a 433 stationary and periodic signal, the fundamental frequency is 434 given by the inverse of its period.However, since speech 435 signals are non-stationary and time-variant, the position of the 436 vocal tract can change abruptly.Therefore, the starting point 437 of the measurement cannot be ignored as it influences the 438 final measurement.Previous articles have proposed different 439 algorithms to estimate F 0 with the attributes of speech signals 440 in the time and spectral domains, while other researchers have 441 proposed exploitation of both spaces.F 0 has been described in 442 many previous studies as a biomarker of depression [52] involving an open phase (O) and a closed phase (C).NAQ [55] 452 and QOQ [56] are two features calculated from the glottal flow, 453 which is given by [57]: where d peak is the negative amplitude of the main excitation in 455 the differentiated glottal flow pulse, f ac is the peak amplitude 456 of the glottal flow pulse, T 0 is the length of the glottal pulse 457 period.QOQ is calculated by amplitude measurements of 458 the glottal flow pulse.The quasi-open period is measured by 459 finding the peak in the glottal flow and the time points prior 460 to the peak that descends below 50% of the amplitude.The 461 duration between the two time points is divided by the local 462 glottal period to determine QOQ [57].Besides QOQ and NAQ, 463 other glottal features have been commonly used in previous 464 articles on automatic depression detection [57], [58].

B. Transcripts preprocessing 466
Our proposed dataset includes 113 transcripts in comma-467 separated value files, with five fields per transcript: "bg," 468 "ed," "speaker," "value," and "words list."The "bg" and "ed" 469 fields indicate the start and end of one sentence captured by 470 the transcription algorithm.The "value" field is the sentence 471 recognized and transcribed by the algorithm, and "word list" 472 field is the sentence tokenization.The "value" and "speaker" 473 fields may contain errors due to environmental noise or a 474 lack of pause between the psychiatrist and patient.After 475 the transcriptions were verified against the audio recordings 476 by research assistants, the sentences in the transcripts were 477 tokenized using Jieba, a Chinese tokenization library.The 478 transcripts were then divided into healthy and depressive 479 groups based on the physicians'diagnosis after removing stop 480 words such as "if" and "too."

482
We established baseline models and performance metrics 483 to detect depression and classify its severity on our proposed 484     document frequency).It is used to determine the relative importance of words or tokens in a document and is widely used for information retrieval and text mining.
The count vectorizer builds a vocabulary by scanning all transcripts and transforming each document into a matrix of token counts.The count vectorizer builds a vocabulary by scanning all transcripts and transforming each document into a matrix of token counts.Let x = (x 1 , x 2 , . . ., x n ), x ∈ R s×n , be a vector of token frequencies, where each column vector x i represents token frequencies of subject S i , where s is the number of subjects, n is the size of the vocabulary.
After generating frequency-based text features, we trained multinomial Bayes classifiers to identify depression and predict its severity for each subject.We selected the multinomial classifier because of its ability to classify numerical features and used it to calculate the probability of depression presence and level based on each subject's feature vector., i.e.
2) The probability of any given example can be assigned a class C i , which is given by: Therefore, the Bayes classifier finds the maximum a posterior probability (MAP) given any example x, i.e.
However, x i is a high-dimensional feature vector, resulting in difficulties directly computing the term P (X = An approximation is adopted to reduce the computation cost, such as using the assumption that features are conditionally independent given the class C i , i.e. The set of conditional probabilities in Equation 5 can prove unreliable when a word is missing from the training set; regardless of its label, a zero product of conditional probability is the result.Avoiding zero-conditional probabilities is accomplished by adopting a smoothed conditional probability instead of directly computing the conditional probability p (x|y, c i ), which is given by: Binary classification text model (depression vs. healthy): We trained and evaluated multinomial Bayes models using nested cross-validation on the training set collected for this study.During the hyperparameter fine-tuning phase, we optimized the classifier parameters to achieve the highest F 1 score.
To eliminate unreliable estimates and zero-conditional probabilities, we optimized the parameter α for the multinomial Bayes classifiers.We expected that the conditional probability of a given word that only exists in the test set is close to zero; therefore, α varied in the range (10 k with k = 0, . . ., 3).The best parameter α was determined to be 1.0 after crossvalidation, and the best micro average F 1 score was 0.85 in the cross-validation and 0.91 in the test set.The details of describe the statistical characteristics of the audio features.To 637 determine the number of bins, we used the Freedman-Diaconis 638 rule, which calculates the bin width to minimize the difference 639 between the area under the empirical data distribution and the 640 theoretical data distribution.[63].The hyperparameter N for each audio feature was determined in advance using nested cross-validation.Specifically, we tested values of N in the range [5,10,15,20,25,30,35] and recorded the value that resulted in the best cross-validation score on the training partition of the original dataset.

2) Binary classification audio model (depression vs. healthy):
XGBoost is an open-source research project that implements a tree-based gradient-boosting algorithm.The XGBoost model has many useful features including: it is an ensemble learning method, which decreases the bias of the model; and it is a treebased model with high interpretability, which aids in determining the feature's importance in making an inference.These models also offer a good trade-off between computation cost and accuracy.Tree-based boosting algorithm methods solve many machine learning problems efficiently and accurately, making such methods good candidates for providing baseline results within our dataset.To train our XGBoost classifiers, we created a separate model for each audio feature.We excluded certain features, such as HMPDM 0 to HMPDM 3, since they remained constant throughout the interview.To facilitate our final decision, we applied a majority voting algorithm to the output of each classifier.To optimize our models, we finetuned the parameters by maximizing the F 1 score, which we deemed equally important for precision and recall.For each XGBoost classifier, we tuned several parameters, including the learning rate, max depth of the tree, and number of estimators.
To identify the optimal hyperparameters, we conducted a grid search on the training set, selecting models with high precision and recall.In the nested cross-validation, we achieved a best micro average F 1 score of 0.81, which improved to 0.87 on the test set.Further details on other metrics can be found in Table VII.
3) Depression level classification: In our investigation of the relationship between depression severity and audio features, we trained depression severity prediction models using the top-N elements method.This method transformed variable-length audio features into fixed-length vectors, which were then used in our analysis.We trained and evaluated each model on both the training and validation set, testing different parameters to optimize performance.The highest performing model, which we chose for our study, achieved an F1 score of 0.52 in crossvalidation and 0.55 on the test set.Results of the fine-tuned baseline models can be found in Table VIII.

4) Multimodality baseline models:
To enhance the model's ability to assess depression, we employed late fusion to combine the outputs of the acoustic and semantic models.Our multimodality baseline models produced an output through a linear combination of the acoustic and semantic model outputs.In Table V and Table VII, the depression detection accuracy of the acoustic-only and semantic-only models were 0.82 and 0.81, respectively.For depression-level classification, the accuracy of the semantic-only and acoustic-only models were 0.62 and 0.61, respectively, as shown in Table VI     isons: inter-condition and intra-condition comparisons.Inter-condition comparisons evaluate differences in audio features between healthy and depressive groups, such as whether participants from the control (healthy) and experimental (depressive) groups differ in vocal fundamental frequency (F 0 ) as processed by the top-N elements method.Intra-condition comparisons evaluate the variability of a patients' audio features relative to their severity of depression.This second comparison   there were significant differences in the distributions of these 717 features between the depressed and non-depressed groups, as 718 well as between different levels of depression.Regarding the used to describe the periodicity of the speech.Our analysis found that the inter-condition effect is present for female participants for F 0 .The median F 0 of the healthy control group is lower compared to participants from the depressive group, as shown in Fig. 2.This is in line with the conclusion reached by Mundt et al. that the healthy control group had a lower F 0 than the depressive group [64].The variances of F 0 between the two groups were compared using a Welch ttest, and the variance of F 0 of the healthy group was found to be significantly greater than that of the depressive (p<0.01).However, F 0 was not a significant audio feature in male participants.
2) Mel-Cepstrum Coeffificent (MCEP): MCEP was included in our model as it has been effectively demonstrated in the characterization of speech content [65]- [67].In our research, we conducted a Welch t-test to determine if MCEP audio features differ significantly with depression presence and severity.For the binary classification task (depressive vs. healthy), we identified MCEP values that were significantly different between the healthy and depressive groups.The box plots in Fig. 3 and 4 confirm that MCEP 0 can be used in both male and female groups as a criterion to distinguish potentially depressed patients.However, some high-order MCEPs (such as MCEP 8, MCEP 13 and MCEP 18) had overlapping values between the healthy and depressive subjects.MCEP 0 was significantly different in the healthy, mild, moderate and severe depression groups, as shown in Fig. 3 and 4. Therefore, MCEP 0 may be a gender-independent factor in distinguishing depression presence and severity.
3) Harmonic model and phase distortion mean (HMPDM): Several reports have shown that HMPDMs can be used to predict depression presence and severities [68]- [71].In our research, we conducted a Welch t-test to determine if HMPDM values in the healthy and depressive groups were significantly different.The significance levels were set at 1% for HMPDM audio features.For the binary classification (depressive vs. healthy), the higher-order HMPDMs, such as the HMPDM 17, were found to be significantly different between the healthy and depressive subjects (see Fig. 5), suggesting that HMPDMs may play a key role in predicting depression.Additionally, the variance of HMPDM 17 increased in participants suffering from depression, while the median of the HMPDM 17 was higher in healthy subjects.In depression severity classification, HMPDM 17 of female participants was a reliable indicator for predicting depression levels.For example, Fig. 6 shows that the healthy group has a higher median of HMPDM 17.Further  To demonstrate the reliability of the predictions made by the model and gain further insight into factors that affect depression diagnosis, we present the graphical contribution of audio features to the prediction process.Our model outputs depression probability and its explanations, which shows a series of features that increased (red) and decreased (blue) the depression risk.Based on professional diagnosis by clinicians, we divided the dataset into two categories: healthy and depressive.The audio features were extracted and processed using the method described in Section V-C.The original dataset was split into training (80%) and test (20%) sets.We trained an XGBoost binary classification model with the optimal parameters obtained in Section V-C.2.The output of the binary classifier provided the depression probability of the participant.An explanation of our model represents the contributions of interpretable groups of audio features.These contributions explain how the model makes a prediction, making it possible for psychiatrists to reach a final diagnosis.In section VI-A, we only investigated the difference between each preprocessed audio feature (processed by the neighbourhood top-N elements method) in the healthy and depressive groups.Without a meaningful explanation, the output probability of the model may be difficult to interpret.By presenting the depression probability as a cumulative process, the reason for the prediction becomes clearer.
The increase in the depression probability of test examples shown in Fig. 7-10 is driven by audio features.The probability explanation bar in Fig. 7-10 has red features that push the probability higher (to the right) and blue features that push the probability lower (to the left).The magnitude of their contribution sorts audio features, and the features with the higher contributions are labelled.Through this representation, we can conclude that most audio features have a small impact, while a few are responsible for driving the probability of the depression features.Instead of feeding the model with important features, we allow the model to select the features it believes to be effective, meaning that the model may select unpredictable features that are unforeseen as effective for depression prediction.For some of these features with high contributions, it is beneficial to investigate further how they relate to depression risk.High contribution features could be used to quickly alert psychiatrists of implicit signals, since they are likely to be a proxy of a potential negative mental states or emotions.

VII. CONCLUSION
Open datasets are valuable for both the research and clinical communities.Collecting and annotating clinical interviews with professional diagnoses is labour-intensive and requires expertise in psychology.Authentic depression interview we conducted statistical analysis with two different meth-867 ods.Subjects were divided into groups based on depression 868 severity, and intragroup feature analyses were completed.We 869 confirmed acoustic features such as F 0 , MCEP 0, and high-870 order HMPDM significantly impact the ability to distinguish 871 between depressed and healthy individuals.Moreover, a novel 872 visualization method is presented to illustrate the high-impact 873 audio features in depression detection, which further high-874 light the black-box nature of our proposed models as well 875 as providing a reference for physicians.We anticipate the 876 release of this dataset will motivate additional researchers 877 to work on new models for automated depression diagnosis 878 based on Chinese.We hope our dataset can also become a 879 benchmark for other researchers to compare the performance 880 of their models against others' and supplement other datasets 881 collected under controlled lab settings.We envision this dataset 882 providing greater insight into AI for mental healthcare for 883 mental health researchers and professionals.
collected over the internet, including 242 video recordings and labels of 10 symmetrical and 4 asymmetrical action units (AU), head movements, smiles, feature tracker confidence, as well as gender and facial landmarks.Dhall et al. proposed a dataset including 4886 images collected in real-world situations with a label on happiness intensity.Moore et al. proposed recording the voice of each subject while reading a short story [26].Yingthawornsuk et al. also provided a solution to acquiring the recording: two-part interviews between participants and clinicians consisted of an audio recording session.Then the participants were asked to read a selected section of a book [27].Cohn et al. proposed obtaining facial images from clinical patients in a study of 57 participants [28].The facial activities and voices of the participants were recorded simultaneously during the interview session with the participants' permission.
) were recruited for a psychology study 281 with informed consent and ages ranging from 15 to65.282 addition, we developed a data management plan that outlined procedures for storing, protecting and sharing data to ensure that data quality and privacy were maintained throughout the data collection process.Clinicians were not aware of the mental health condition of the examined subject in advance.Interviews were conducted in Mandarin and lasted 5-10 minutes [13].Clinicians had the flexibility to adjust the order of questions within the MADRS questionnaire, and they also allowed patients to discuss other related topics.Our approach allowed us to gather more comprehensive and individualized data on each participant.Audios were recorded in real-time at a 48 kHz sampling rate, 128 kbps bitrate, and mono-channel MP3 format.The study was approved by the ethics committee of Wenzhou Kangning Hospital (No. AF/SQ-02/01.0).B. Procedure After obtaining verbal consent and a signed form for recording, the MADRS questionnaire interview was conducted by clinicians in Chinese.The MADRS, consisting of 10 items rated on a 6-point scale, evaluates core depression symptoms, with a maximum possible score of 60 points.Scores between 7-19 indicate mild depression, 20-34 indicate moderate depression, and scores above 34 indicate severe depression [13].The questionnaire focused on the ten critical symptoms in Table Fig. 1(b) and Fig. 1(d) illustrate the distri-374 bution of the audio duration in healthy and depressive groups.375 The average audio duration for the depressive population was 376 significantly longer than that of the healthy.The distribution of the utterance length is shown in Fig. 1(c) and Fig. 1(e).

1 )
Audio transcription: The iFlyTek API was used to transcribe audio recordings, which were then reviewed by research assistants majoring in psychiatry.Before starting the automatic transcription, raw audio files larger than 10 MB were divided into smaller data blocks as required by the transcription algorithm.The transcript blocks were merged sequentially using each block's unique ID to create the final transcript.The raw transcription results were in JavaScript Object Notation (JSON) format, containing various fields such as the timestamp of a sentence, sentence content, speaker identification, and sentence tokenization.The speaker identification helped to isolate the patients' responses in the raw audio.The timestamp of a sentence, indicating each sentence's start and end points, was used to extract the patient's audio clips and vocal features in later experiments.The tokenization result was used as input for the frequency vectorizer.These JSON objects were parsed using the Python internal JSON package and converted into comma-separated values (CSV) files.2) Audio feature extraction: We used the COVAREP toolkit to capture the frame-level acoustic features [14].COVAREP is an open-source feature extraction toolkit commonly used

Fig. 1 :
Fig. 1: The proposed dataset contains 113 individuals, (a) 51 of whom are healthy and 62 of whom are patients with depression.Of the patients with depression, 9 have mild depression, 34 have moderate depression, and 19 have severe depression.(b) The distribution of audio duration between the healthy and depressive subjects.(c) The distribution of utterance length between the healthy and depressive groups.(d) The distribution of audio duration across four depression levels.(e) The distribution of utterance length across four depression levels.(f), (g) The word cloud of the healthy (above) and the depressive (below) groups.(h), (i) The word cloud in English.Negative words and phrases, such as "difficult to fall asleep", "bad mood", etc., are in the word cloud below.

Fig. 2 :
Fig. 2: The audio feature F 0 of female subjects between healthy and depressive, healthy and mild, healthy and moderate, healthy and severe.(HLTY: Healthy, MDD: Major Depressive Disorder, M: Mild depression, MT: Moderate depression, SE: Severe depression)
Table 267 I compares existing data from social networks and clinical 268 interviews.

TABLE I :
A Comparative Study of the Proposed Dataset and Datasets Employed in the Reviewed Studies for Depression Detection

TABLE II :
The Questionnaire Used During Interview Lassitude Do you feel like you don't want to do anything?Inability to feel Do you feel that everything has nothing to do with you?Pessimistic thoughts Do you feel inferior or self-blaming?Suicidal thoughts Have you ever thought of self-harm or suicide?
-[54], 443 which exhibited strong discriminating power in distinguishing 444 depression and other mental disorders.Compared with fundamental fre-446 quency F 0 , the glottal flow features have received less attention 447 in previous depression and mental disorders studies.Moore 448 et al. demonstrated that several glottal flow features exhibited 449 significant separation between control and healthy groups [26].450 Speech production lasts several glottal cycles, with each cycle 451

TABLE III :
Summary of dataset characteristicDemographic characteristics Subjects categorized as depressed Subject categorized as healthy

Table I
This study focused on detecting depression and predicting 493 disease severity.To achieve this, the dataset was split into 494 independent training and test sets.The training set consisted 495 of 41 healthy individuals and 49 individuals with depression, 496 while the test set included 10 healthy individuals and 13 497 ), most clinical datasets comprise 100 to 200 data points, due to the high cost of data collection.A. Experimental Setting492

TABLE IV :
COVAREP Spectral and Cepstral Feature Set and Table VIII.During cross-validation and on the test set, our multimodality depression detection model (accuracy=0.86,see Table IX) and multimodality depression-level classification model (accuracy=0.63,see Table X) produced fewer errors

TABLE V :
Cross Validation and Testing Result of the Text Depression Detection ModelText modality (Binary classification, nested cross-validation, testing)

TABLE VII :
Cross Validation and Testing Result of the Audio Depression Detection Model than the acoustic-only and semantic-only models.699 VI.FEATURE STATISTICS 700 A. Audio features 701 Our dataset analyzed recordings from 113 clinically su-702 pervised participants, resulting in two different compar-

TABLE IX :
Cross Validation and Testing Result of the Multimodality Depression Detection Model

TABLE X :
Cross Validation and Testing Result of the Multimodality Depression Level Prediction Model contribution of each audio feature to the prediction by comparing the output of the model when a feature is included or excluded.However, it is important to note that the feature contribution does not demonstrate causality and does not represent a final diagnosis of depression.It enables doctors to make more informed diagnoses by understanding which audio features contribute more to the generated depression prediction.
783provide enough information for a doctor to understand how the 784 prediction was made.To provide a more clinically meaningful 785 explanation, using audio features such as F 0 , MCEP, and 786 HMPDM may be more informative.Generally, explaining how 787 a prediction was made limits the model we can use, but we 788 chose to adopt the SHapley Additive exPlanations (SHAP) 789 proposed by Lundberg et al. [76].This approach allows us to understand the emotion corpus for human behavior analysis," in Proceedings of the 931 IEEE conference on computer vision and pattern recognition, 2016, pp.Montgomery and M. Åsberg, "A new depression scale designed 934 to be sensitive to change," The British journal of psychiatry, vol.134, Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, "Covarep Hiraga, "Predicting depression for japanese blog text," in Proceed-941 ings of ACL 2017, Student Research Workshop, 2017, pp.107-113.tive feature statistics of speech for classifying clinical depression," in 983 The 26th annual international conference of the IEEE engineering in 984 medicine and biology society, vol. 1. IEEE, 2004, pp.17-20.