Improving Face Alignment Accuracy on Clinical Populations and its effect on the Video-based Detection of Neurological Diseases

—Automatic facial landmark localization is an essential component in many computer vision applications, including video-based detection of neurological diseases. Machine learning models for facial landmarks localization are typically trained on faces of healthy individuals and model performance is inferior when applied to faces of people with neurological diseases. Fine-tuning pre-trained models with representative images improves performance on clinical populations significantly. However, questions related to the characteristics of the database used to fine-tune the model and the clinical impact of the improved model remain. We employed the Toronto Neu- roFace dataset – a dataset consisting of videos of Healthy Controls (HC), individuals Post-Stroke, and individuals with Amy- otrophic Lateral Sclerosis (ALS) performing speech and non-speech tasks with thousands of manually annotated frames – to fine-tune a well-known deep learning-based facial landmark localization model. The pre-trained and fine-tuned models were used to extract landmark-based facial features from videos, and the facial features were used to discriminate clinical groups from HC. Fine-tuning a facial landmark localization model with a diverse database that includes HC and individuals with neurological disorders resulted in significantly improved performance for all groups. Our results also demonstrated the clinical significance of such improvement. Specifically, fine-tuning the model with representative data greatly improved the ability of the subsequent classifier to separate clinical groups vs. HC from videos.


I. INTRODUCTION
F ACIAL alignment (FA) refers to the use of machine learning models and algorithms for the automatic localization of pre-defined landmarks in facial images or videos [1]. FA is an important first step in many computer vision applications including face recognition [2], [3], emotion detection [4]- [6], human-computer interactions [7], [8], pain detection [9]- [12], and the detection of neurological diseases and motor disorders [13]- [30].Many of these applications rely on pre-trained FA models for localization of facial landmarks. These models are typically trained using large databases of manually or semi-automatically annotated facial images [1]. These databases often consist of thousands of photographs with a large variety of poses, expressions, illumination, backgrounds, and scales. Thus, pre-trained FA models are designed to provide accurate facial landmark localization under general conditions [31]- [36].
Challenges remain when applying pre-trained FA models to photographs from clinical populations [29], [37]. For instance, we recently demonstrated that pre-trained FA models perform better in healthy individuals as compared to those with neurological diseases such as Alzheimer's disease, stroke, and ALS, and motor disorders such as facial palsy [24], [29], [30], [37]- [39]. This phenomenon, known as algorithmic bias, is attributed to the lack of representative data in the databases used to train FA models [30], [37], [40]- [42] One approach to mitigate the bias in a pre-trained FA model is to fine-tune the model using representative data applying transfer learning techniques [43], [44]. Recently, we demonstrated that fine-tuning a pre-trained, deepneural network-based FA model with manually annotated photographs of patients with facial palsy significantly improved the model performance on individuals with the same clinical condition. Further, we demonstrated that it is possible to eliminate the model bias against individuals with facial palsy by fine-tuning the model with 320 manually annotated photographs from 40 patients [30]. Similarly, we observed a significant improvement in FA model performance in older adults with dementia after fine-tuning a deep-neural network-based model with 688  manually annotated representative photographs [37]. Improved model performance of a pre-trained FA model was also observed in individuals post-stroke (PS) and in individuals with ALS after fine-tuning the model with 1371 and 920 manually annotated photographs, respectively [39]. Furthermore, we observed that fine-tuning an FA model with 1015 images of age-matched healthy controls (HC) improved the model performance when applied to photographs of individuals with neurological diseases significantly. However, the improvement was lower than when fine-tuning the model with representative clinical data [39].
Our previous results showed that fine-tuning an FA model with representative clinical data improved the model performance on that clinical population. They also showed that fine-tuning an FA model with data form age-matched HC improved the model performance on clinical populations recorded under the same conditions. Thus, the logical next step is to determine if fine-tuning a model with representative clinical data from patients of multiple clinical groups and aged-matched HC leads to improved model performance on the clinical and nonclinical groups. This research question may have important clinical implications, collecting data from HC is typically straightforward, whereas collecting data from patients is often difficult and time-consuming, specially for rare diseases such as ALS. Moreover, while each individual is affected differently by a neurological disease, there are many similarities in the way that many neurological diseases manifest in the orofacial musculature and function, including muscle weakness and facial asymmetries. Based on these observations, we hypothesized that fine-tuning an FA model with data from multiple patient populations and age-matched HC can improve model performance on all clinical and non-clinical groups.
Furthermore, despite significant efforts to improve the FA models performance on clinical populations, there is no quantitative evidence that the improved accuracy in landmarks localization leads to an improved computeraided diagnosis of neurological diseases from video based monitoring. We have shown that by using pre-trained FA models it is possible to differentiate aged-matched HC from individuals PS with an accuracy of 87% [13], and age-matched HC from individuals with ALS with an accuracy close to 89% [15]. Based on these result, and the improved performance provided by fine-tuned FA models on clinical populations, we hypothesized that better diagnosis of neurological diseases from video based monitoring would be achieved by applying FA models finetuned with representative data as compared to pre-trained FA models.
The specific objectives of this paper are to: i) fine-tune a deep-neural network-based FA model with a database of manually annotated photographs of patients from different clinical populations with neurological diseases affecting the orofacial function, and age-matched healthy controls; and ii) assess the influence of the fine-tuned FA model on the computer-aided diagnosis of neurological diseases from video based monitoring. For these goals, we used facial videos from healthy controls and individuals from two clinical populations -PS and ALS -performing a set of speech and non-speech tasks commonly used during clinical orofacial examinations [45], [46]. A subset of video frames were manually annotated and used to fine-tune a well-known pre-trained FA model. The pre-trained and fine-tuned FA models were used to estimate landmark-based facial features, and these features were used to automatically differentiate the clinical groups from HC. Fig. 1 summarises the methods and research questions investigated on this paper. The diagram presents the three stages of our pipeline, including data acquisition and preprocessing, fine-tuning a FA model with representative data, and automatic detection of neurological disease from landmark-based facial features and video-based monitoring.

A. Toronto NeuroFace dataset
The Toronto NeuroFace dataset [39] -a novel and open-access dataset for facial analysis in individuals with neurological diseases -was used in this study. Here, we provide a brief description of the database, experimental setup, and tasks.  [47], and passed a hearing screening. Table I presents the demographics and clinical summary of the participants. The study was approved by the Research Ethics Boards at the Sunnybrook Research Institute and University Health Network: Toronto Rehabilitation Institute. All participants signed informed consent according to the requirements of the Declaration of Helsinki.
2) Experimental Setup: Participants were seated in front of an Intel RealSense™ depth camera (SR300 or D400) with a face-to-camera distance between 30 cm and 60 cm. A continuous light source was placed adjacent to the camera to provide uniform illumination. Participants were asked to look directly at the camera and were recorded during the execution of standard speech and nonspeech tasks used during clinical orofacial examinations. A video comprised of color (RBG) and depth information was recorded for each task. Both streams were recorded synchronously at approximately 50 frames per second at VGA resolution (640×480 pixels). A total of 332 videos were included in the database: 108 from HC participants, 113 from individuals PS, and 111 from individuals with ALS.
3) Tasks: Participants were asked to perform a set of speech and non-speech tasks commonly used during clinical orofacial examinations [45], [46]. The tasks included 10 repetitions of the sentence "Buy Bobby a Puppy" at a comfortable speaking rate and loudness (BBP); repetitions of the syllable /pa/ as fast as possible on a single breath (PA); repetitions of the syllables /pataka/ as fast as possible on a single breath (PATAKA); puckering the lips 5 times (BLOW); pretend to kiss a baby 5 times (KISS); maximum opening of the mount 5 times (OPEN); pretending to smile with tight lips 5 times (SPREAD); making a We fine-tuned a well-known, deep-learning-based model for FA using manually annotated representative clinical data [29], [30], [37], [39]. Next, we briefly describe the model, the manual annotation procedure, and the approach used to fine-tune the model and evaluate its performance.
1) Pre-trained FA model: The pre-trained FA model corresponds to the Facial Alignment Network (FAN), a convolutional neural network (CNN) model trained with more than 230,000 photographs [35]. The FAN model consist of an initial face detection stage that returns a 256×256 pixels image centered around the face. The facecentered image is then down-sampled into a set of 256 feature maps of dimensions 64×64, and passed into four stacked hourglass networks, an architecture commonly used for face and body landmarks localization [48]- [50], that transforms the feature maps into a set of 68 heatmaps of dimensions 64×64. Each heat-map provides the estimated position of a facial landmarks. The pre-trained FAN model and Python API are freely available online (https://github.com/1adrianb/face-alignment).
2) Manual annotations: A set of 4340 frames (1435 for HC, 1478 for PS, and 1427 for ALS) were extracted from the videos. Extracted frames were intended to capture a wide range of facial gestures during task execution. Additional details regarding frame selection can be found in [39].
The locations of 68 facial landmarks, described by the Multi-PIE 2D configuration [51] defining the eyebrows, eyes, nose, mouth, and jawline, were manually localized by a trained annotator in each extracted frame. Manually annotated facial landmarks were considered as the ground truth positions.
3) Fine-tuning the FAN model with representative data: The parameters of the first four stages of the pre-trained FAN model were frozen and not modified during the finetuning process. The parameters of last hourglass network were updated using the Toronto NeuroFace dataset. Optimization algorithm and hyper-parameters were the same to those used by Bulat and Tzimiropoulos to train the FAN model [35]. Our training algorithm used a recently introduced loss function, adaptive wing-loss, which improves model performance by penalizing small errors more than the traditional squared-loss [36], [52], [53] Twelve participants from each group were randomly selected and used to train the model. Data from the remaining participants were used to test the model performance.
4) Evaluation of model performance: The performance of the pre-trained and fine-tuned FAN models was evaluated by computing the accuracy in landmark localization. Accuracy was computed in terms of the Root-Mean-Squared Error (RMSE) between manually annotated and model predicted landmark positions normalized by the intercanthal distance (NRMSE) [33]. 5) Statistical analysis: Statistical differences between results yielded by the pre-trained model, fine-tuned model and ground truth landmark positions were evaluated using the t-test, statistical significance was considered at p < 0.01, and the standardized mean difference (SM D), computed as (n 1 − 1)s 2 1 + (n 2 − 1)s 2 2 n 1 + n 2 − 2 were µ 1 , µ 2 , s 1 , s 2 , n 1 , and n 2 are the mean, standard deviation, and number of elements of the difference between pre-trained FAN model predictions and ground truth landmark positions; and µ 2 , s 2 , and n 2 are the mean, standard deviation, and number of elements of the difference between the fine-tuned FAN model predictions and ground truth landmark positions. For 0 < SM D < 0.5, the difference between groups is considered to be small; for 0.5 ≤ SM D < 0.8, the difference between groups is considered to be medium; and for SM D ≥ 0.8, the difference between groups is considered to be large [54], [55].

C. Video-based diagnosis of neurological diseases
Diagnosis of neurological diseases was achieved by 1) manually segmenting the tasks by repetition, 2) applying an FA models to localize the 68 facial landmarks in each video frame, 3) reconstructing the 3D, real world coordinates of the 68 facial landmark in each video frame, 4) extracting landmarks-based facial features from each repetition, and 5) using a classification algorithm to detect the presence or absence of the disease based on the extracted features. Next, we describe these steps in detail and provide a brief description of the landmarks-based features used in this study. 1) Participants and tasks: Nine participants declined for their data to be shared publicly, so their recordings were not used for further analysis. Thus, data from 36 participants were used for video-based diagnosis of neurological diseases, 11 individuals with ALS (7 female), 14 individuals PS (4 female), and 11 HC (4 female). Furthermore, only tasks common to all 36 participants were used for videobased diagnosis, analyzed tasks include: BBP, OPEN, SPREAD, and REST. Finally, video recordings were the participant did not look directly at the camera during task execution were removed from the analysis. A total of 138 videos were included for the video-based diagnosis of neurological diseases: 40 from HC participants, 54 from individuals PS, and 44 from individuals with ALS.
2) Tasks segmentation: All tasks, except REST, were manually segmented into individual repetitions by a trained observer; the observer identified the beginning and end of each repetition using the audio or video recordings.
3) Face alignment: Pre-trained and fine-tuned FAN models were used to automatically estimate the position of the 68 facial landmarks in each video frame. The time-series containing the [x 2d , y 2d ] coordinates for each landmarks were smoothed using a 5-points median filter. 4) Reconstruction of 3D coordinates: Color and depth streams provided by the Intel RealSense™ depth camera were aligned using the camera extrinsic parameters. Afterwards, the real world coordinates (in mm) for each landmark were computed using a pinhole camera model with the depth information provided by the depth sensor (z 3d ), the 2d coordinates ([x 2d , y 2d ]), and the intrinsic parameters provided by the camera manufacturer. This procedure resulted in a set of [x 3d , y 3d , z 3d ] coordinates for each landmark. The origin of the 3D coordinate system was the center of the IR camera, and the x, y, and z axes were along the lateral, vertical, and frontal directions, respectively. 5) Feature extraction: For each repetition of each task, a set of features were extracted using the 3D coordinates of selected landmarks. Different features were extracted to separate individuals PS from healthy controls, and individuals with ALS from healthy controls. Features to identify individuals with PS measured mouth range of motion, mouth movement velocity, and facial symmetry (13 features in total). These features have been previously described [13]. Features to identify individuals with ALS measured mouth range of motion, mouth movement velocity, overall movement of the lower lip, mouth symmetry, and the overall roundness of lips during movement (11 features in total). These features have been previously described [15]. 6) Classification: Disease detection was performed on a task by task basis using a random forest (RF) classification algorithm. Twelve classification tests were conducted, by combining data from: Two diseases (HC vs. PS, and HC vs. ALS), three tasks (BBP, OPEN, and SPREAD), and two FA models (pre-trained and fine-tuned). Table II summarises the different RF classifiers trained in this study. The output of the RF model was the probability (a value between 0 and 1) that each repetition was performed by an individual suffering from a neurological disease. All repetitions of the same task were averaged to estimate the probability that an individual suffering from a neurological disease performed the task. Classification performance was evaluated using leaveone-subject-out cross-validation (LOSO-CV). For each fold of the LOSO-CV, all the repetitions belonging to a single participant were used as the test set, and the RF classifier was trained with the repetitions from the other participants. Performance was evaluated using the receiver operating characteristic (ROC) curve and the corresponding area under the ROC curve (AU-ROC).

III. RESULTS
A. FA model fine-tuning  Table III summarized the results of Fig. 2 and demonstrates that fine-tuning the FAN model with a database composed of manually annotated images from HC participants, individuals with ALS, and individuals PS improved the model performance for all groups significantly. In particular, for the HC participants, there was a large, significant improvement in the NRMSE.

B. Automatic detection of neurological diseases
1) Detection of stroke from video-based monitoring: Fig. 3 presents the ROC curve of the six RF classifiers trained to detect individuals PS and described in Table II. Results obtained from facial landmarks-based features yielded by the pre-trained FA model are presented in blue lines, and those yielded by the fine-tuned FA model are presented  with orange lines. Fig. 3 A) present the results for BBP task, B) for OPEN task, and C) for SPREAD task.
As Fig. 3 shows, fine-tuning the FA model with representative data improved the ability of the RF classifier to distinguish between HC participants and individuals PS for all tasks as measured by the AUC of the ROC. In particular, fine-tuning the FA model improved for the the AUC of the ROC curve from 0.60 to 0.85 for the BBP task, from 0.60 to 0.77 for the OPEN taks, and from 0.83 to 0.92 for the SPREAD task.
2) Detection of ALS from video-based monitoring: Fig. 4 presents the ROC curve of RF classifiers trained to detect individuals with ALS and described in Table II. Results obtained from facial landmarks-based features yielded by the pre-trained FA model are presented in blue lines, and those yielded by the fine-tuned FA model are presented with orange lines. Fig. 4 A) present the results for BBP task, B) for OPEN task, and C) for SPREAD task.
As Fig. 4 shows, fine-tuning the FA model with representative data improved the ability of the RF classifier to distinguish between HC participants and individuals with ALS for all tasks and measured by the AUC of the ROC. In particular, fine-tuning the FA model improved for the the AUC of the ROC curve from 0.87 to 0.87 for the BBP task, from 0.83 to 0.95 for the OPEN taks, and from 0.83 to 0.95 for the SPREAD task.

IV. DISCUSSION
Recently, we showed that FA models can be used as part of algorithms for video-based detection of stroke [13], ALS [15], and Parkinson's disease [23]. We also showed that fine-tuning FA models with representative data of a clinical group improves the model performance on that clinical population [30], [37], [39]. This study builds on those results and demonstrates that 1) fine-tuning a FA model with a diverse dataset conformed by photographs of aged-matched healthy controls and individuals with neurological diseases improved the model performance across the clinical and non-clinical groups; and 2) improved accuracy in facial landmarks localization leads to improved detection of neurological diseases from videos. These results support our hypotheses, and represent an important step towards the development of clinically useful tools for automatic, video-based diagnosis of neurological diseases

A. Innovation
This study presents some important technical and clinical innovations that will likely have a broad impact on the clinical application of FA technology. These innovations include: 1) Fine-tuning FA model with diverse dataset: Herein, we showed for the first time that fine-tuning a CNN model for FA with a database conformed from photographs of HC participants, individuals with ALS, and individuals PS resulted in a significant improvement in model's performance for the clinical and non-clinical groups. However, the FA model performance improvement was greater for HC participants than for the clinical groups. By finetuning the FA model, the mean NRMSE improved by 36.5% for HC, 27.2% for the ALS group, and 19.5% for the PS group.
This study also showed that the pre-trained FAN model is biased against clinical populations. Furthermore, we showed that the improved model performance in clinical populations produced by fine-tuning did eliminate the model bias; our results results showed that fine-tuning the FAN model with the Toronto NeuroFace dataset magnified the model's bias against clinical populations.
The results presented herein fit well with our understanding on how CNNs learn from new data. After fine-tuning the model with representative images, the model gains additional information about 1) subjects' pose and expressions, 2) images illumination and background, 3) the differences in manual annotations between the databases used to train the and fine-tune the model, and 4) facial abnormalities observed in the patients [30]. For the Toronto NeuroFace dataset, all the videos were recorded under similar conditions and landmarks were manually localized by the same annotator. Thus, the fine-tuning process informs the model about the first three aspects from all the training images. In contrast, the FA model gains information about the fourth aspect -disease specific facial abnormalities -from a subset of images in the database. This observation likely justifies the differences in performance of the fine-tuned model for HC and patients.
Our results indicate that fine-tuning a FA model using images of HC acquired under the same conditions than the patient's data, might help to improve the model performance on clinical populations. HC images can teach the model about aspects such as illumination, background, and manual annotations. This is an important observation for the clinical application of FA models. Collecting patients' data can be challenging; in contrast, there is typically abundant data from age-matching healthy controls.
2) Video-based detection of neurological diseases: An important contribution of this study was to demonstrate that fine-tuning FA models with representative data leads to improved video-based detection of neurological diseases. Our classification results showed that landmarks-based facial features yielded by a fine-tuned FA model provided better detection of individuals PS and individuals with ALS for all the tasks analyzed.
The task SPREAD provided the best classification results for detection of stroke using landmarks-based facial features, with an AUC of the ROC curve of 0.92. This result aligns well with with our understanding of the typical sequelae associated with cerebrovascular accidents. Stroke survivors typically develop unilateral facial paralysis [56], which is characterized by decreased facial symmetry during movement, affecting the patients' ability to smile [57], [58]. Furthermore, stroke survivors often suffer from significant articulatory disorders [59] which might explain the good results obtained with the BBP task.
The tasks SPREAD and OPEN demonstrated an excellent ability for detection of individuals with ALS using landmarks-based facial features, with an AUC of the ROC curve of 0.95. This observation might be explained by the fact that individuals with ALS demonstrate slower mouth movements [60], [61], which can be easily detected during maximum effort tasks such as SPREAD and OPEN.
Comparing the results of both classification tasks (HC vs. ALS and HC vs. PS) directly is not possible because they used different feature sets. Nevertheless, we observed better classification performance in the detection of individuals with ALS than in the detection of individuals PS for all tasks.

B. limitations
This study has three main limitations. Firstly, data from HC, individuals PS, and with ALS were recorded under tightly controlled pose, background, and illumination conditions. These laboratory conditions might be difficult to reproduce in more natural setting such as home recordings. Thus, it is likely that the FA model fine-tuned with the Toronto NeuroFace dataset will yield higher landmark localization error when applied to photographs and videos recorded under different conditions. Second, participants were asked to look straight at the camera during task execution. We observed that maintaining this posture was difficult for some participants (both HC and patients) and they continuously turned their bodies or heads and looked away from the camera. Videos where the participant did not directly face the camera were not used in classification analysis as it was difficult to compare the differences in left and right facial movements. To alleviate this limitation, we are developing a custom software application to provide real-time feedback on the participants pose.
Finally, the FAN model used in this study was originally introduced in 2017 [35]. Since then, alternative models with improved speed, performance and reduced complexity have been introduced [53], [62], [63]. However, these models are based on the same technology as FAN -heatmap regression -so they do not typically demonstrate large improvement in performance. Moreover, new models are trained in the same way as FAN, so they will likely demonstrate similar bias against clinical populations.

V. CONCLUSIONS
We demonstrated that fine-tuning a CNN model for FA with a database composed of manually annotated facial images from healthy controls and individuals with neurological disorders resulted in improved model performance for all clinical and non-clinical groups. We also demonstrated that using the fine-tuned FA model resulted in improved video-based disease detection. These results provide important guidelines for fine-tuning FA models, and validate the importance of fine-tuning FA models with representative data when applying this technology for automatic monitoring and assessment of neurological diseases.