Artificial Intelligence for Clinical Gait Diagnostics of Knee Osteoarthritis: An Evidence-based Review and Analysis

Background Knee osteoarthritis (OA) remains a leading aetiology of disability worldwide. With recent advances in gait analysis, clinical assessment of such a knee-related condition has been improved. Although motion capture (mocap) technology is deemed the gold standard for gait analysis, it heavily relies on adequate data processing to yield clinically significant results. Moreover, gait data is non-linear and highdimensional. Due to missing data involved in a mocap session and typical statistical assumptions, conventional data processing methods are unable to reveal the intrinsic patterns to predict gait abnormalities. Research question Albeit studies have demonstrated the potential of Artificial Intelligence (AI) algorithms to address these limitations, these algorithms have not gained wide acceptance amongst biomechanists. The most common AI algorithms used in gait analysis are based on machine learning (ML) and artificial neural networks (ANN). By comparing the predictive capability of such algorithms from published studies, we assessed their potential to augment current clinical gait diagnostics when dealing with knee OA. Methods Thus, an evidence-based review and analysis were conducted. With over 188 studies identified, 8 studies met the inclusion criteria for a subsequent analysis, accounting for 78 participants overall. Results The classification performance of ML and ANN algorithms was quantitatively assessed. The test classification accuracy (ACC), sensitivity (SN), specificity (SP) and area under the curve (AUC) of the ML-based algorithms were clinically valuable, i.e., all higher than 85%, differently from those obtained via ANN. Significance This study demonstrates the potential of ML for clinical assessment of knee disorders in an accurate and reliable manner.

information. Whilst the MLP deploys either the logistic/sigmoid or the hyperbolic tangent sigmoid transfer functions [23], RBF uses the Gaussian transfer function [16]. MLP is trained via the back-propagation algorithm, whereby the initially randomised weighted inputs are propelled forward, whilst the errors are iteratively propagated backwards until an optimal set of weights generates the least MSE [22,24]. The SOM uses competitive learning and deploys the Kohonen neighborhood transfer function [11], whereby input data features are clustered based on distance metrics between data points [15].
Amongst Machine Learning (ML)-based methods for gait analysis, the SVM [25] is the most widely applied technique [19,26]. SVMs map the training samples via kernel functions into a high dimensional space and apply a decision surface boundary as an optimal separating hyperplane (OSH) for classification [26].
Via AI, gait-related indices could be derived as subject-specific metrics to assess the impact of gait-driven rehabilitation in patients with knee OA. Due to the lack of evidence-based, quantitative analysis on the efficacy of these algorithms, the use of AI-based tools in clinical gait diagnostics is still limited [27]. Moreover, none of the studies published so far as a topical review in gait analysis [28] has been able to offer such a comprehensive and quantitative analysis. Let alone to select an algorithm for aiding either diagnosis or assessment of prognosis of knee OA, biomechanists currently do not have any objective ground truth whereby they could choose amongst several algorithms, as well as the time and expertise involved in understanding the tools in question (AI).
To the best of the authors' knowledge, this is the first study that has performed an evidence-based analysis on AI-related studies in clinical gait diagnostics tailored to patients with knee OA. Besides providing an evidence-based review, a star-rating system for quality assessment of relevant literature has been formulated. The scope of this rating system is not only limited to this study but can also be applied in any AI-related studies involving healthcare data. By performing a methodological evaluation of the eligible articles, the validity of the inferences drawn in this study were further ascertained. This review is hoped to have a significant impact in the field of Clinical Biomechanics in promoting the clinical application of AI-based methods to aid diagnosis and/or assessment of prognosis of several lower limb pathologies.

Methods
The high-level objectives of this evidence-based review and analysis are the following: 1.
To identify studies that have used AI for gait analysis on knee OA and conduct a methodological quality assessment for their inclusion to conduct an evidence-based analysis; 2.
To assess the predictive capability of such studies by comparing supervised and unsupervised models via an evidence-based analysis approach, with respect to test classification accuracy (ACC) and further performance measures, such as sensitivity (SN), specificity (SP) and area under the curve (AUC);

3.
To compare studies that have implemented ANN-and ML-based algorithms in clinical gait diagnostics for knee OA, with respect to the above-mentioned performance measures. An evidence-based analysis was carried out via Review Manager (RevMan) (Version 5.3. Copenhagen: The Nordic Cochrane Centre, The Cochrane Collaboration, 2014). Further to performing the I 2 test to assess the heterogeneity amongst selected studies [29], a sensitivity analysis was carried out to discard studies that could have biased the results. Furthermore, a statistical analysis on the accuracy and reliability measures of the algorithms reviewed was performed via IBM SPSS Statistics (IBM Corp. Released 2016. IBM SPSS Statistics, Version 24.0. Armonk, NY: IBM Corp.). Although gait data include spatiotemporal and metabolic data more broadly, only kinetic and kinematic data were considered in this study, as they are gold standard variables to quantify human motion in clinical gait diagnostics, and, thus, were collectively referred to as "gait data".

2.1
Inclusion and exclusion criteria Full-length papers and conference articles were initially screened, and their titles and abstracts were assessed for eligibility against the aims and high-level objectives of this study by all four authors independently.
All major electronic databases including PubMed, Web of Science, MEDLINE, ScienceDirect, Scopus, Google Scholar, IEEE Xplore, Springer, Wiley, O'Reilly, SAGE, Cochrane, Embase were parsed with relevant keywords of interest, e.g., knee osteoarthritis, machine learning and artificial neural networks. Subsequent subscription-based and open-access articles published from 1984 (since when optometric mocap systems were first introduced) until 29/04/2018 were searched for. However, no relevant articles were found in MEDLINE, Cochrane, O'Reilly, Embase and Sage databases, which were thus discarded.
Selected articles must have reported gait-related data on human participants with knee OA, regardless of any other demographic factor except for the age (older than 12 but younger than 75). Studies reporting gait-related data on human subjects with neurotrauma or with neurodegenerative disorders, such as Parkinson's Disease, Cerebral Palsy or Huntington's Disease, were discarded from this review.
Studies in which kinematic data collected only using optoelectronic systems that uses RGB-D sensors (e.g., Vicon, Qualysis and Asus Xtion) were considered for inclusion. Moreover, body segments must have been identified via passive markers and any other methods for measuring or estimating such gait metrics were not considered. Selected articles must have reported data on walking gait (speed lower than that of a healthy human subject, approximately less than 5 km/h, due to the knee OA). Any studies involving walking on instrumented treadmills (with embedded force platforms) and fall detection systems were discarded.
An initial search was performed only based on the title. Subsequently, a second and final search was carried out, after which relevant key articles were selected for inclusion based on their abstract and full-text content. All bibliographies from the retrieved key articles were also searched for potential articles that might not have been considered previously. The "Preferred Reporting Items for Systematic reviews and Meta-Analysis" (PRISMA) guidelines [30] were followed throughout this study. 188 studies were identified following a further screening. By applying the above-mentioned inclusion and exclusion criteria, 180 articles were excluded and 8 were selected. A methodological quality assessment was performed on these studies for their inclusion in the meta-analysis, as outlined in 2.2.
In case of incongruencies in the selected articles that could have not been clarified amongst the four authors and reviewers, the corresponding author of the selected articles was contacted for clarification, thus ascertaining whether the articles in question were eligible for inclusion, instead of discarding them a priori.

2.2
Methodological quality assessment: The UARTA star-rating system. Adapted from the MQAS scale [32], a quality assessment on selected articles was performed via a star-rating system developed by the authors L.P. and N.R. at the University of Auckland Rehabilitative Technologies Association (UARTA). The "UARTA Star-rating System for Assessing Clinical Significance of Artificial Intelligence-related Research" is deemed applicable to any research articles dealing with AI applied to healthcare-related data. Each of the following points corresponds to a star (★) attributed to selected papers for meeting the criterion described in the statement next to it. A maximum of fifteen stars was attributed to each of the selected articles. Articles carrying less than seven stars were not considered for the review and meta-analysis. ★ Selected articles must have outlined a clear purpose for the classification task, specifying inputs/outputs, reported any data preprocessing steps undertaken to ensure accuracy and consistency of the results presented, and to enable their reproducibility. If those were not applicable, the authors of the selected articles must have justified why they were not. ★ Selected articles must have reported any data post-processing steps undertaken to ensure accuracy and consistency of the results presented, and to enable their reproducibility. If those were not applicable, the authors of the selected articles must have justified why they were not. ★ Selected articles must have reported a measure (number and/or percentage of the whole dataset) quantifying the training set of the data used. ★ Selected articles must have reported a measure (number and/or percentage of the whole dataset) quantifying the cross-validation set of the data used. ★ Selected articles must have reported a measure (number and/or percentage of the whole dataset) quantifying the testing set of the data used. ★ Selected articles must have reported the name of the cross-validation algorithm (e.g., holdout validation, leave-one-out (LOO), nested or k-fold cross-validation, specifying the number k of partitions made where applicable) to avoid overfitting and ensure reproducibility of the results attained. ★ Selected articles must have reported any qualitative outputs showing the training-and cross-validation-related mean squared error (MSE) curves against the number of iterations or epochs to illustrate at which iteration/epoch overfitting occurs. This step is fundamental to stop the training accordingly. ★ Satisfying the above-mentioned criterion provides evidence on the avoidance of overfitting, thus ensuring that the algorithms tested were truly learning from the data which were trained on, rather than solely 'remembering' the input features. ★ Selected articles must have reported testing or out-of-sample classification accuracy. ★ Selected articles must have reported at least one measure of error, e.g., the mean squared error (MSE) or cross entropy, or it should be clearly inferable from the performance measures reported. ★ Selected articles must have reported the sensitivity (SN). ★ Selected articles must have reported the specificity (SP). ★ Selected articles must have reported the area under the receiver operating characteristic curve (AUC) or, at least, the Pearson's product-moment coefficient of determination (r 2 ). ★ Selected articles must have reported any of the above-mentioned performance measures with confidence intervals. ★ Selected articles must have reported any qualitative outputs on the receiver characteristic curve (ROC) under which the AUC was computed.
Based on the inclusion and exclusion criteria in 2.1 and the quality assessment performed via the UARTA star-rating scale, eight (N=8) key articles were retrieved, as per the selection procedure outlined in Fig. 1.  Table 1 shows the stars attributed to each of the selected articles for meeting the UARTA star-rating scale-related criteria, as outlined above. Pre-processing Total  10  10  9  5  7  8  11  7   Table 2 summarises the main elements derived from a comprehensive review of the selected studies, as per the UARTA star-rating quality assessment scale described above.  Training algorithm      Table 4 summarises the results obtained from a statistical analysis on the classification outcomes reported in selected studies. Whilst the test classification accuracy was reported in eight (N=8) selected studies, the sensitivity and specificity were reported in four (N=4) studies, with the area under the curve being reported only in two (N=2) studies. Previous reviews on the use of AI in clinical gait diagnostics were purely qualitative [28]. Instead, in this study not only a qualitative analysis of previous research findings was carried, but also a quantitative one was performed using a novel evidence-based analysis approach in AI for clinical gait diagnostics, which yielded two eligible studies that compared the classification performance of ML-and ANN-based methods directly, i.e., in the same study. Aljaaf et al. [6] were able to capture underlying gait patterns from the movement of body segments and quantified the knee adduction moment of the ankle, knee (KAM), hip and pelvis from their corresponding Euler angles during a single gait cycle. As compared to other ANN-based techniques, MLP was the most accurate algorithm tested (r 2 =0.86, root mean squared error (RMSE)=0.07). Karg et al. [18] applied a quadratic SVM to quantify pathological discrepancies in gait phases and joint angles between patients with symptomatic OA-related gait abnormalities and healthy subjects deploying spatio-temporal parameters with a classification accuracy of 85%. Patients with knee OA tendentially were found to have decreased walking speed, cadence, stride and step lengths when considering both lower limbs, a reduced time taken in single support but a longer time when in double support.
To the best of the authors' knowledge, the UARTA-star rating quality assessment scale represents the first ever clinical gait diagnosticsequivalent (when dealing with AI-based clinical decision-making) of the MQAS scale [32] and the Newcastle-Ottawa quality assessment scale [31], the latter being used to assess findings from cohort studies (prospective or retrospective) published in the medical literature.
Published studies lack consistency in reporting machine learning-based research findings as shown in Table 2 and as evident from the analysis in Table 4, where not all (N=8) selected studies had reported all the main performance measures of the AI-based algorithms tested on gait data. Remarkably, only two (N=2) studies reported the area under the curve (AUC) ( Table 4), which is one of the most important performance measures for AI algorithms.
In selected papers that directly compared classification outcomes between machine learning (ML)-and artificial neural networks (ANN)based architectures (N=2) [6,18], ML-based algorithms were found to consistently have a higher accuracy and a lower standard deviation than those of ANN, thus being more stable in dealing with gait data ( Table 3). The reduced performance in ANN-based algorithms may be partly explained by the re-sampling occurring within ANN-based architectures, such as the multi-layer perceptron (MLP), where weights and biases are adjusted iteratively, and so the data is continuously resampled until the mean squared error drops below a preset threshold that is deemed acceptable to stop training the ANN. However, such mean difference was not significant (p=0.32, Table 3 and Fig. 2.a) and the heterogeneity between studies was also considerably high (I 2 =93%, Table 3). Instead, when considering outcomes on test classification accuracy of the AI-based algorithms from the eight studies (N=8) that reported such a performance measure, the ANN seems to have a slightly higher accuracy with a slightly lower standard deviation than the ML-based ones, being both highly correlated between one another (r-squared=1.00, p=0.01; Table 4). Nevertheless, as also shown in Table 4, all reliability-related performance measures of ML-based algorithms (SN=95.50%, SP=86.00, AUC=0.86) were consistently higher than those of ANN-based architectures (SN=87.92%, SP=79.36, AUC=0.85). These apparently contradictory results further support the development and use of the UARTA star-rating quality assessment scale for clinical gait diagnostics-related studies using machine learning for three main purposes: 1. qualitatively evaluating the technical rigour and quality of such studies; 2. quantitatively perform the first ever objective evidence-based analysis (in this study) on results reported in published studies; 3. providing guidelines to biomechanists and clinicians on which algorithm would be more accurate and reliable. Fig. 2.b shows no publication bias, as the mean differences from the selected studies are close to the midline of the graph indicating the mean MD. Therefore, the lack of publication bias supports the reliability of the above-mentioned conclusions derived from analysing the results reported in Table 3. Moreover, Table 3 seems to suggest that ANN can be used when dealing with gait data collected on patients with knee OA [18], whilst ML seems to generalise to patients with any other knee-related conditions better [6,18].
Whilst the test classification accuracy was reported in four studies, the sensitivity and specificity were mentioned in four studies, with the area under the curve being reported in two studies only. The limited size of the data available for the evidence-based analysis and statistical analysis is a major limitation of this study, which, indeed, highlights an even greater limitation in the reporting machine learning-related results in the literature. Nevertheless, the main limitation of this study remains the small sample of studies reviewed, which met the inclusion criteria for eligibility. Therefore, it is hard to draw definitive conclusions.
To summarise, with respect to applications in clinical gait diagnostics, whilst both ANN-and ML-based algorithms attempt to mimic the learning-related mechanisms occurring in the brain and can handle nonlinear and highly dimensional data, they require adequate data pre-processing (removal of outliers, at times normalization or standardisation of inputs) and do not directly yield physiologically interpretable results.
The development and validation of the UARTA star-rating quality assessment scale seeks to change such a status quo and obviate the lack of appropriate, consistent and thorough reporting on machine learning-related findings in the clinical gait diagnostics literature. The implementation of this set of standards for selection criteria is also intended to guide the development and testing of AI-based algorithms in clinical gait diagnostics, such that progress in AI research can be promptly translated in readily available and thoroughly validated tools that biomechanists and clinicians can easily use to aid diagnosis and/or assessment of prognosis of lower limb disorders and/or pathologies worldwide.
This study establishes clear design criteria for selecting and deploying Artificial Intelligence (AI)-based algorithms for diagnostic and/or prognostic purposes in clinical gait diagnostics, particularly when dealing with data on patients with knee-related conditions. A concise but comprehensive description of the main AI learning-based algorithms was provided (ANN and ML). A quantitative analysis enabled the definition of criteria for selecting the most accurate and reliable AI-based algorithm to apply in a clinical setting. Based on this analysis, the test classification accuracy (ACC), sensitivity (SN), specificity (SP) and area under the curve (AUC) of the ML-based algorithms analysed were found to be clinically valuable, i.e., all higher than 85% (ACC=89.81±6.88%; SN=95.50±6.36%; SP=86.00±15.56%; AUC=0.86±0.00), differently from those obtained via ANN (SP=79.36±5.32%).
Biomechanists have so far applied AI-based algorithms without having any standards for guiding adequate selection and implementation of such tools. Via the development and validation of the UARTA star-rating quality assessment scale for machine learning-based studies in clinical gait diagnostics, we attempted to define a set of initial standards, guidelines that can promote a thorough and prompt translational application of previous research findings and AI-based algorithms. It is hoped that the UARTA scale will be considered when international standards will be outlined on appropriate and consistent reporting of findings from clinical gait diagnostics-related studies in which machine learning was used. AI can revolutionise and objectify best practices in clinical gait diagnostics, augmenting the capabilities of biomechanists to aid diagnosis and assessment of prognosis in patients with knee-related conditions.