Performance of a deep learning CNN model for the automated detection of 13 common conditions on Chest X-rays

Background and aims: Chest X-rays are widely used, non-invasive, cost effective imaging tests. However, the complexity of interpretation and global shortage of radiologists have led to reporting backlogs, delayed diagnosis and a compromised quality of care. A fully automated, reliable artificial intelligence system that can quickly triage abnormal images for urgent radiologist review would be invaluable in the clinical setting. The aim was to develop and validate a deep learning Convoluted Neural Network algorithm to automate the detection of 13 common abnormalities found on Chest X-rays. Method: In this retrospective study, a VGG 16 deep learning model was trained on images from the Chest-ray 14, a large publicly available Chest X-ray dataset, containing over 112,120 images with annotations. Images were split into training, validation and testing sets and trained to identify 13 specific abnormalities. The primary performance measures were accuracy and precision. Results: The model demonstrated an overall accuracy of 88% in the identification of abnormal X-rays and 87% in the detection of 13 common chest conditions with no model bias. Conclusion: This study demonstrates that a well-trained deep learning algorithm can accurately identify multiple abnormalities on the images. As such models get further refined, they can be used to ease radiology workflow bottlenecks and improve reporting efficiency. Napier Healthcare’s team that developed this model consists of medical IT professionals who specialise in AI and its practical application in acute & long-term care settings. This is currently being piloted in a few Hospitals and Diagnostic Labs on a commercial basis.


Introduction
Artificial intelligence (AI) and deep learning are disruptive technologies that have moved at an unimaginable pace from being a futuristic promise to a current reality. The rise and dissemination of AI and ML (Machine Learning) have begun to permeate all spheres of our lives and are beginning to be applied to the field of healthcare as well.
In parallel, over the last two decades, advancement in medical imaging technology has led to the exponential growth of the use of diagnostic imaging for the early detection, diagnosis, and treatment of diseases. Imaging has taken on a critical role in modern healthcare, and most patient care pathways are reliant on an efficient radiology service to deliver the best outcomes [1], [2].
Among the various imaging modalities in use, Chest X-rays are the most common, and millions of studies are performed globally every year [3], [4], [5]. The wide availability, low cost, non-invasive nature, portability and ease of operation make it an attractive initial choice for the detection of a large number of thoracic conditions [3], [4].
Chest X-ray interpretation, however, is a complex task that is time consuming and labour intensive. Specialist radiologists who are qualified to perform this task are in short supply globally.
It is estimated that there are only about 10.83 radiologists for every 100,000 people in the United States, 6.9 per 100,000 people in Canada, around 5 for 100,000 people in the UK and 1 per 100,000 people in India [1]. In Singapore, statistics in 2019 showed that there were only around 392 registered diagnostic radiologists serving a population of around 5.7 million people [6].
In many parts of world, the number of digital X-ray machines available far exceed the availability of professionals to interpret and report on them. Such gaps in the radiology workforce result in backlogs, delayed diagnosis, fatigue based diagnostic errors and poor quality of patient care.
In addition, despite their widespread use, Chest X-rays are low resolution images that are not easy to read. The overlapping of the tissue structures in the chest greatly increases the complexity of interpretation, as does the patient position during the study, the exposure technique, and image quality [7]. Significant subjectivity and inter-reader variability depending on the level of expertise and the abnormality being detected, are other factors that add to the complexity [4], [8], [9], [10].
All of these factors have thus resulted in a renewed interest in harnessing the power of AI and deep learning to assist and augment interpretation of Chest X-ray images by radiologists . A clinically validated, automated, artificial intelligence system that can independently read Chest Xrays could provide substantial benefits such as prioritisation of the workload, clinical decision support to minimise diagnostic errors, large-scale screening and global population health initiatives especially in low resource settings [3], [4], [11], [12].
Computer-aided detection (CAD) is not new to radiology. Early attempts at computerised analysis of medical images date back to the 1960s. But the very limited computational power and lack of high-quality digitized image data at that time meant these applications were not very successful [13], [14], [15].
The second era of artificial intelligence through the 1980s and 90s once again saw the widespread use of CAD tools based on 'conventional' machine learning to assist radiologists in image interpretation. The first CAD commercial system was approved by the Food and Drug Administration (FDA) for use in screening mammography in 1998 [16], [17]. However, none of these systems reached the high performance or diagnostic accuracy that offered significant benefits to radiologists.
Over the last few years, there has been increasing interest in the use of deep learning algorithms based on Convoluted Neural Networks (CNNs) to assist with abnormality detection on medical images [18], [19], [20].
The major limitations of conventional ML techniques were the complex feature engineering, significant domain knowledge, and data processing expertise that were required to extract the essential discriminative features to train non-deep learning models. Deep learning systems on the other hand, are able to automate the feature extraction and classification steps, effectively shifting the burden of feature engineering from the human to the machine side [13], [20], [21]. A single well-designed and well-trained network can yield state-of-the-art results across many domains by the use of 'transfer learning' without the need for significant domain knowledge.
A good example is the application of CNNs to image detection in the real-world setting. In the early editions of the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC), the traditional CAD methods produced five times as many errors as a practiced human when attempting to identify everyday objects in photographs [22], [23]. However, in 2012, the convoluted network model 'AlexNet' significantly outperformed other conventional methods and in 2015, the winning CNN algorithm, 'ResNet', exhibited astounding results surpassing human level performance, paving the way for the use of these technologies in other domains as well [24], [25], [26].
'Transfer learning' from ImageNet, has now become a de-facto method for deep learning applications in different fields of medical imaging and several research groups have very successfully applied these models to detect specific problems on Chest X-ray [27], [28].

Napier CNN model for Chest X-rays
In our study, we analysed a supervised multi-label classification framework based on the VGG16 CNN model (ConvNet/CNN), for the automated detection of abnormal images and the identification of 13 common chest conditions on Chest X-rays.

OBJECTIVE
To evaluate and validate the performance of the Napier CNN algorithm for the automated detection of one or more of 13 abnormal radiological findings on Chest X-rays.

DATA
The Napier algorithm was trained on the ChestX-ray14 data set [29], one of the largest public open-source repositories of chest radiographs released by the National Institute of Health (NIH), that contains 112,120 frontal view images of more than 30,000 unique patients.  input Chest X-ray image, assigns importance (through learned weights and biases) and delivers an output based on the presence or absence of one or more of the 13 conditions that it has been trained to detect.

IMAGE PROCESSING TECHNIQUES
The Chest X-rays in the dataset varied considerably in size, resolution and quality. The images were resized to a standard size of 224x224 pixels, and a set of image normalization techniques was applied to reduce variation. Additional data augmentation techniques like the angle tilt were applied before the images were presented to the model.

PRETRAINING
Pretraining was done on the ImageNet pre-trained weights. During the design and training phase, some selected layers of the CNN were trained, and the respective layer weights were updated based on our use case specific training images.

TRAINING
Model training was done with 50 + epochs on around 28,000 images.. The number of epochs is a hyperparameter that defines the number of times the learning algorithm will run through the entire training dataset to update the model weights.

PERFORMANCE ASSESSMENT
The primary performance metric measures were accuracy, precision, recall and F1 score.

RESULTS
(1) The deep learning model demonstrated an overall accuracy of 88% in the identification of abnormal X-rays and 87% in the detection of 13 common chest conditions.
The classification report on sample test data set for normal and abnormal classification of the model is shown in figure 2.

DISCUSSION
Deep learning is a machine learning method in which a complex multi-layer neural network architecture learns representations of data automatically by transforming the input information into multiple levels of abstractions [21]. CNNs are the most commonly used deep learning networks for pattern recognition tasks in images and are trained using 'training data sets' from which the network automatically learns to extract relevant features by adjusting its weights with backpropagation. In radiology, these training sets usually consist of large numbers of handlabelled images. If trained properly, these CNNs can identify features in medical images that are beyond the threshold of human detection and extract valuable new information from them [13], [20], [21].
However, deep network architectures are highly demanding in terms of the volume of data needed to train them. The sparsity and poor quality of digital medical data available for training the models was a thus major impediment to their widespread use in healthcare until as recently as 2017.
Since then, in an effort to provide sufficient training data for the research community for the development of deep learning-based algorithms, several institutions such as the NIH [29], the Stanford University [11], and the Massachusetts Institute of Technology [30] have released very large public datasets of annotated Chest X-rays.
The ChestX-ray14 dataset released by the NIH [29] was the first dataset to be made publicly available in 2017 and comprises 112,120 frontal-view X-ray images with fourteen disease labels derived by datamining radiology reports using Natural Language Processing (NLP) techniques.
The CheXpert dataset from Stanford University containing around 224,316 images [11] and MIMIC-CXR data set from the MIT [30] containing more than 350,000 were released in 2019..
Using these data sets, several research groups have very successfully applied CNN models to detect specific problems on Chest X-rays such as such as pneumothorax [31], pneumonia [32], [33], lung nodules [34], tuberculosis [35], [36] and the presence of medical devices in the thorax.
[37], [38]. Multi-label classification, where each input sample is associated with one or several labels, has also been explored and a number of publications have reported excellent performance when trained on the NIH Chestx-ray14 [39], [40], ChesXpert [11], and the MIMIC data sets [41].
A few of these algorithms have even been approved by the FDA-approved for clinical use [42].
The Napier VGG 16 CNN model compares favourably with these previously published models and has demonstrated high accuracy in the identification of abnormal images and the automated detection of 13 abnormalities on Chest x-rays.

CLINICAL APPLICATIONS
The Napier Deep Learning CNN algorithm can be deployed in the clinical setting in at least two different ways. Firstly, as a triage tool to prioritise, reliably identify and flag abnormal images for urgent radiologist review. This would be especially useful in high-volume settings and in areas with limited reporting radiologists. Secondly, the model may also be used as a clinical decision support tool, acting as a second opinion to perform a simple back up check on the diagnosis of the physician or to direct the attention of the physicians to findings that they may have missed.

LIMITATIONS
Although our model achieved a highly accurate performance, we acknowledge that this study has some limitations.
First, the deep learning algorithm was trained and evaluated on a single open-source data set.
Therefore, variance due to new or unseen patterns in the model performance may be observed when the model is exposed to new images in the clinical setting. To overcome this, we hope to re-train our model periodically using radiologist annotated, richer, larger and more diverse datasets that will be made available to us as a result of our ongoing engagement with large hospitals and radiology centres around the world.
Our training and testing dataset consisted of around 30,000 images out of the 1.2 million images in the NIH Chest X-ray data set. This may be attributed to the fact that a large number of X-rays in the NIH dataset came from only '30,000' unique patients, and nearly 50,000 of the images were labelled 'normal', making the 'effective size' of the dataset much smaller [43].
Whilst the model demonstrated excellent performance, rigorous evaluation of the model against specialists in radiology in real world clinical settings will help further validate its accuracy.
The released NIH dataset included corresponding metadata on patient age, gender, number of visits to the hospital and other non-image related data that were not used while training our algorithm. Integration of this metadata into the network would help the model identify any correlation between the identified labels and individual patient traits and hence increase its efficiency.

CONCLUSION
The Napier VGG 16 CNN model has demonstrated high accuracy in the automated detection of 13 abnormalities on Chest X-rays and compares favourably with previously published models. Our findings support increasing consensus that CNN based deep learning algorithms can address unmet needs in the radiology workflow and will likely be an integral part of radiology reporting in future. Large scale and in-depth prospective trials will further validate the efficacy and accuracy and ascertain the overall impact of such applications on patient care and workflow efficiency.