xViTCOS: Explainable Vision Transformer Based COVID-19 Screening Using
Radiography
Abstract
Since its outbreak, the rapid growth of COrona VIrus Disease 2019
(COVID-19) across the globe has pushed the health care system in many
countries to the verge of collapse. Therefore, it is imperative to
correctly identify COVID-19 positive patients and isolate them as soon
as possible to contain the spread of the disease and reduce the ongoing
burden on the healthcare system. The primary COVID-19 screening test,
RT-PCR although accurate and reliable, has a long turn-around time. In
the recent past, several researchers have demonstrated the use of Deep
Learning (DL) methods on chest radiography (such as X-ray and CT) for
COVID-19 detection. However, existing CNN based DL methods fail to
capture the global context due to their inherent image-specific
inductive bias. Motivated by this, in this work, we propose the use of
vision transformers (instead of convolutional networks) for COVID-19
screening using the X-ray and CT images. We employ a multi-stage
transfer learning technique to address the issue of data scarcity.
Furthermore, we show that the features learned by our transformer
networks are explainable. We demonstrate that our method not only
quantitatively outperforms the recent benchmarks but also focuses on
meaningful regions in the images for detection (as confirmed by
Radiologists), aiding not only in accurate diagnosis of COVID-19 but
also in localization of the infected area.