Explaining the Black-box Smoothly- A Counterfactual Approach

We propose a BlackBox Counterfactual Explainer, designed to explain image classification models for medical applications. Classical approaches (e.g., saliency maps) that assess feature importance do not explain"how"imaging features in important anatomical regions are relevant to the classification decision. Our framework explains the decision for a target class by gradually"exaggerating"the semantic effect of the class in a query image. We adopted a Generative Adversarial Network (GAN) to generate a progressive set of perturbations to a query image, such that the classification decision changes from its original class to its negation. We used counterfactual explanations from our framework to audit a classifier trained on a chest x-ray dataset with multiple labels. We proposed clinically-relevant quantitative metrics such as cardiothoracic ratio and the score of a healthy costophrenic recess to evaluate our explanations. We conducted a human-grounded experiment with diagnostic radiology residents to compare different styles of explanations (no explanation, saliency map, cycleGAN explanation, and our counterfactual explanation) by evaluating different aspects of explanations: (1) understandability, (2) classifier's decision justification, (3) visual quality, (d) identity preservation, and (5) overall helpfulness of an explanation to the users. Our results show that our counterfactual explanation was the only explanation method that significantly improved the users' understanding of the classifier's decision compared to the no-explanation baseline. Our metrics established a benchmark for evaluating model explanation methods in medical images. Our explanations revealed that the classifier relied on clinically relevant radiographic features for its diagnostic decisions, thus making its decision-making process more transparent to the end-user.


A B S T R A C T
We propose a BlackBox Counterfactual Explainer, designed to explain image classification models for medical applications.Classical approaches (e.g., , saliency maps) that assess feature importance do not explain how imaging features in important anatomical regions are relevant to the classification decision.Such reasoning is crucial for transparent decision-making in healthcare applications.Our framework explains the decision for a target class by gradually exaggerating the semantic effect of the class in a query image.We adopted a Generative Adversarial Network (GAN) to generate a progressive set of perturbations to a query image, such that the classification decision changes from its original class to its negation.Our proposed loss function preserves essential details (e.g., support devices) in the generated images.
We used counterfactual explanations from our framework to audit a classifier trained on a chest x-ray dataset with multiple labels.Clinical evaluation of model explanations is a challenging task.We proposed clinically-relevant quantitative metrics such as cardiothoracic ratio and the score of a healthy costophrenic recess to evaluate our explanations.We used these metrics to quantify the counterfactual changes between the populations with negative and positive decisions for a diagnosis by the given classifier.
We conducted a human-grounded experiment with diagnostic radiology residents to compare different styles of explanations (no explanation, saliency map, cycleGAN explanation, and our counterfactual explanation) by evaluating different aspects of explanations: (1) understandability, (2) classifier's decision justification, (3) visual quality, (d) identity preservation, and (5) overall helpfulness of an explanation to the users.Our results show that our counterfactual explanation was the only explanation method that significantly improved the users' understanding of the classifier's decision compared to the no-explanation baseline.Our metrics established a benchmark for evaluating model explanation methods in medical images.Our explanations revealed that the classifier relied on clinically relevant radiographic features for its diagnostic decisions, thus making its decision-making process more transparent to the end-user.

Introduction
Machine learning, specifically Deep Learning (DL), is being increasingly used for sensitive applications such as Computer-Aided Diagnosis (Hosny et al., 2018) and other tasks in the medical imaging domain (Rajpurkar et al., 2018;Rodriguez-Ruiz et al., 2019).However, for real-world deployment (Wang et al., 2020), the decision-making process of these models should be explainable to humans to obtain their trust in the model (Gastounioti and Kontos, 2020;Jiang et al., 2018).Explainability is essential for auditing the model (Winkler et al., 2019), identifying various failure modes (Oakden-Rayner et al., 2020;Eaton-Rosen et al., 2018) or hidden biases in the data or the model (Larrazabal et al., 2020), and for obtaining new insights from large-scale studies (Rubin et al., 2018).
With the advancement of DL methods for medical imaging analysis, deep neural networks (DNNs) have achieved near-radiologist performance in multiple image classification tasks (Seah et al., 2021;Rajpurkar et al., 2017).However, DNNs are criticized for their "black-box" nature, i.e., they fail to provide a simple explanation as to why a given input image produces a corresponding output (Tonekaboni et al., 2019).To address this concern, multiple model explanation techniques have been proposed that aim to explain the decision-making process of DNNs (Selvaraju et al., 2017;Cohen et al., 2021).
The most common form of explanation in medical imaging is a class-specific heatmap overlaid on the input image.It highlights the most relevant regions (where) for the classification decision (Rajpurkar et al., 2017;Young et al., 2019).However, the location information alone is insufficient for applications in medical imaging.Different diagnoses may affect the same anatomical regions, resulting in similar explanations for multiple diagnosis, resulting in inconclusive explanations.A thorough explanation should explain what imaging features are present in those important locations, and how changing such features modifies the classification decision.
To address this problem, we propose a novel explanation method to provide a counterfactual explanation.A counterfactual explanation is a perturbation of the input image such Regions that changed the most between input image and its counterfactual that the classification decision is flipped.By comparing, the input image and its corresponding counterfactual image, the endusers can visualize the difference in important image features that leads to a change in classification decision.Fig. 1 shows an example.The input image is predicted as positive for pleural effusion (PE), while the generated counterfactual image is negative for PE.The changes are mostly concentrated in the lower lobe region, which is known to be clinically important for PE (Lababede, 2017).Counterfactual explanation is used to derive a pseudo-heat-map, highlighting the regions that change the most in the transformation (difference map in Fig. 1).
We demonstrate the performance of the counterfactual explainer on a chest x-ray (CXR) dataset.Rather than generating just one counterfactual image at the end of the prediction spectrum, our explanation function generates a series of perturbed images that gradually traverse the decision boundary from one extreme (negative decision) to another (positive decision) for a given target class.We adopted a conditional Generative Adversarial Network (cGAN) as our explanation function (Singla et al., 2019).We extend the cGAN to preserve small or uncommon details during image generation (Bau et al., 2019).Preserving such details is particularly important in our application, as the missing information may include support devices that may influence human users' perceptions.To this end, we incorporated semantic segmentation and object detection into our loss function to preserve the shape of the anatomy and foreign objects during image reconstruction.We evaluated the quality of our explanations using different quantitative metrics, including clinical measures.Further, we performed a clinical study with 12 radiology residents to compare the explanations for the proposed method and the baseline models.

Related work
Posthoc explanation is a popular approach that aims to improve human understanding of a pre-trained classifier.Our work broadly relates to the following posthoc methods: Feature Attribution methods provide explanation by producing a saliency map that shows the importance of each input component (e.g., pixel) to the classification decision.
Perturbation-based methods identify salient regions by directly manipulating the input image and analyzing the resulting changes in the classifier's output.Such methods modify specific pixels or regions in an input image, either by masking with constant values (Dabkowski and Gal, 2017) or with random noise, occluding (Zhou et al., 2015), localized blurring (Fong and Vedaldi, 2017), or in-filling (Chang et al., 2019).Especially for medical images, such perturbations may introduce anatomically implausible features or textures.Our proposed method also generates a perturbation of the query image such that classification decision is flipped.But in contrast to the above methods, we enforce consistency between the perturbed data and the real data distribution to ensure that the perturbations are plausible and visually similar to the input.
Counterfactual Explanations are a type of contrastive (Dhurandhar et al., 2018) explanation that provides a useful way to audit the classifier and determine causal attributes that lead to the classification decision (Parafita Martinez and Vitria Marca, 2019;Singla et al., 2021).Similar to our method, generative models like GANs and variational autoencoders (VAE) are used to compute interventions that generate realistic counterfactual explanations (Cohen et al., 2021;Joshi et al., 2019).Much of this work is limited to simpler image datasets like MNIST, celebA (Liu et al., 2019;Van Looveren and Klaise, 2019) or simulated data (Parafita Martinez and Vitria Marca, 2019).For more complex natural images, previous studies (Chang et al., 2019;Agarwal and Nguyen, 2020) focused on finding and infilling salient regions to generate counterfactual images.In contrast, our explanation function doesn't require any re-training for generating explanations for a new image at inference time.
In another line of work (Wang and Vasconcelos, 2020;Goyal et al., 2019) provide counterfactual explanations that explain both the predicted and the counter class.Further, researcher (Narayanaswamy et al., 2020;DeGrave et al., 2020) has used a cycle-GAN (Zhu et al., 2017) model to perform image-to-image translation between normal and abnormal images.Such methods are independent of the classifier.In contrast, our framework uses a classifier consistency loss to enable image perturbation that is consistent with the classifier.

Contributions
In this paper, we propose a progressive counterfactual explainer, that explains the decision of a pre-trained image classifier.Our contributions are summarized as follows: 1. We developed a cGAN-based framework to generate progressively changing perturbations of the query image, such that classification decision changes from being negative to being positive for a given target class.
2. Our method preserved the anatomical shape and foreign objects such as support devices across generated images by adding a specialized reconstruction loss.The loss incorporates context from semantic segmentation and foreign object detection networks.
3. We performed a thorough qualitative and quantitative evaluation of our explanation function to audit a classifier trained on a CXR dataset.
4. We proposed quantitative metrics based on clinical definition of two diseases (cardiomegaly and PE).We are one of the first methods to use such metrics for quantifying DNN model explanation.Specifically, we used these metrics to quantify statistical differences between the real images and their corresponding counterfactual images.
5. We are one of the first methods to conduct a thorough human-grounded study to evaluate different counterfactual explanations for medical imaging task.Specifically, we collected and compared feedback from diagnostic radiology residents, on different aspects of explanations: (1) understandability, (2) classifier's decision justification, (3) visual quality, (d) identity preservation, and ( 5) overall helpfulness of an explanation to the users.

Methodology
We consider a black-box image classifier f , with high prediction accuracy.We assume that f is a differentiable function and we have access to its value as well as its gradient with respect to the input ∇ x f (x).We also assume access to either the training data for f , or an equivalent dataset with competitive prediction accuracy.

Notation
where L cGAN is a conditional GAN-based loss function that enforces data-consistency, L f enforces classifier consistency through a KullbackLeibler (KL) divergence loss and L rec is a reconstruction loss that enforces self-consistency.Hyperparamerters, λ cGAN , λ f and λ rec controls the balance between the terms.In the following sections, we will discuss each property and the associated loss term in detail.

Data consistency
We formulated the explanation function, I f (x, c), as an image encoder E(•) followed by a conditional GAN (cGAN) (Miyato and Koyama, 2018), with c as the condition.The encoder enables the transformation of a given image, while the GAN framework generates realistic-looking transformations as an explanation image.The cGAN is a variant of GAN that allows the

Black Box Classifier
x < l a t e x i t s h a 1 _ b a s e 6 4 = " k t T 5 k y Z q 9 6 p 4 O K 5 a A t + 2 3 b S V 8 t e 9 / x a Z T B B g 1 i Z 7 z R B P U W / A 1 p g j m 3 n C 2 4 d L y I f H H J 7 C 1 I 9 y I T J a J 8 X w X t r p f F g 3 q + 2 d x 0 / 9 X H q r J a t T p d t + / O D F 8 H X g O 6 q L H D o P P X j z J a J x a Z T B B g 1 i Z 7 z R B P U W / A 1 p g j m 3 n C 2 4 d L y I f H H J 7 C 1 I 9 y I T J a J 8 X w X t r p f F g 3 q + 2 d x 0 / 9 X H q r J a t T p d t + / O D F 8 H X g O 6 q L H D o P P X j z J a J C w where c denotes a condition and z is noise sampled from a uniform distribution P z .In our formulation, z is the latent representation of the input image x, learned by the encoder E(•).
Finally, the explanation function is defined as, For the discriminator in cGAN, we adapted the loss function from the Projection GAN (Miyato and Koyama, 2018).The Projection GAN imposes the following structure on the discriminator loss function: Here, r(x) is the discriminator logit that evaluates the visual quality of the generated image.It is the discriminator's attempt to separate real images from the fakes images created by the generator.The second term evaluates the correspondence between the generated image x c and the condition c.
To represent the condition, the discriminator learns an embedding matrix V with N rows, where N is the number of conditions.The condition is encoded as an N-dimensional one-hot vector which is multiplied by the embedding-matrix to extract the condition-embedding.When c = n, the conditional embedding is given as the n-th row of the embedding-matrix (v n ).
The projection is computed as the dot product of the conditionembedding and the features extracted from the fake image, where, n is the current class for the conditional generation and φ is the feature extractor.
In our use-case, the condition c is the desired posterior probability from the classification function f .c is a continuous variable with values in range [0, 1].Projection-cGAN requires the condition to be a discrete variable, to be mapped to the embedding matrix V. Hence, we discretize the range [0, 1] into N bins, where each bin is one condition.One can view change from f (x) to c as changing the bin index from the current value C( f (x)) to C(c) where C(•) returns the bin index.

Classifier consistency
Ideally, cGAN should generate a series of smoothly transformed images as we change condition c in range [0, 1].
These images, when processed by the classifier f should also smoothly change the classification prediction between [0, 1].
To enforce this, rather than considering bin-index C(c) as a scalar, we consider it as an ordinal-categorical variable, i.e., C(c 1 ) < C(c 2 ) when c 1 < c 2 .Specifically, rather than checking one condition that desired bin-index is equal to some value n, C(c) = n, we check n − 1 conditions that desired binindex is greater than all bin-index which are less than n i.e., and Hall, 2001).
We adapted Eq. 5 to account for a categorical variable as the condition, by modifying the second term to support ordinal multi-class regression.The modified loss function is as follows: Along with conditional loss for the discriminator, we need additional regularization for the generator to ensure that the actual classifier's outcome, i.e., f (x c ), is very similar to the condition c.To ensure this compatibility with f , we further constrain the generator to minimize the KullbackLeibler (KL) divergence that encourages the classifier's score for x c to be similar to c.
Our final condition-aware loss is as follows, Here, the first term evaluates a conditional probability associ- Please note, the term r(x) is not appearing in Eq. 7 as it is independent of c.

Context-aware self consistency
A valid explanation image is a small modification of the input image, and should preserve the inputs' identity i.e., patientspecific information such as the shape of the anatomy or any foreign objects (FO) if present.While images generated by GAN are shown to be realistic looking (Karras et al., 2019), GAN with an encoder may ignore small or uncommon details in the input image (Bau et al., 2019).To preserve such details, we propose a context-aware reconstruction loss (CARL) that exploits extra information from the input domain to refine the reconstruction results.This additional information comes as semantic segmentation and detection of any FO present in the input image.The CARL is defined as, Here, S (•) is a pre-trained semantic segmentation network that produces a label map for different regions in the input domain.Rather than minimizing a distance such as 1 over the entire image, we minimize the reconstruction loss for each segmentation label ( j).Such a loss heavily penalizes differences in small regions to enforce local consistency.
O(x) is a pre-trained object detector that, given an input image x, outputs a number of bounding boxes called region of interests (ROIs).For each bounding box, it outputs 2-d coordinates in the image where the box is located and an associated probability of presence of an object.Using the input image x, we obtain the ROIs and associated O(x), which is a probability vector, stating probability of finding an object in each ROI.For reconstructed image x , we reuse the ROIs obtained from image x and computed the associated probabilities for the reconstructed image as O(x ).Next, we used KL divergence to quantify the difference between probability vectors as , in eq 8. x < l a t e x i t s h a 1 _ b a s e 6 4 = " k t T 5 k y Z q 9 6 p 4 Object Detection (a) Context-aware Self-Reconstruction x < l a t e x i t s h a 1 _ b a s e 6 4 = " k t T 5 k y Z q 9 6 p 4 Finally, we used the CAR loss to enforce two essential properties of the explanation function: 1.If c = f (x), the self-reconstructed image should resemble the input image.
2. For c f (x), applying a reverse perturbation on the explanation image x c should recover the initial image i.e., We enforce these two properties by the following loss, where L rec (•) is defined in Eq. 8. We minimize this loss only while reconstructing the input image (either by performing self or cyclic reconstruction).Please note, the classifier f and support networks S (•) and O(•) remained fixed during training.

Implementation details
Classification model: We consider classification model that take as input a single-view chest radiograph and output the probability of each of the 14 observations.Following the prior work on diagnosis classification (Irvin et al., 2019), we used DenseNet-121 (Huang et al., 2016) architecture for the classifier.We use the Adam optimizer with default β-parameters of Explanation Function: Our explanation function is implemented using TensorFlow version 2.0 and is trained on NVI-DID P100 GPU.Before training the explanation function, we assume access to the pre-trained classification function, that we aim to explain.We also assume access to pre-trained segmentation and object detection networks, that are used to enforce CARL loss.
In cGAN, we adapted a ResNet (He et al., 2016) architecture for the encoder, generator, and discriminator networks.For optimization, we used Adam optimizer (Kingma and Ba, 2015), with hyper-parameters set to α = 0.0002, β 1 = 0, β 2 = 0.9 and updated the discriminator five times per one update of the generator, and the encoder.
In our experiments, we train three independent explanation functions, for explaining classifier's decision for three class la-bels; cardiomegaly, pleural effusion (PE), and edema.For training, we divide f (x) ∈ [0, 1] into N = 10 equally-sized bins and trained the cGAN with 10 conditions.To construct the trainingset for the explanation function, we randomly sample images from the test-set of the classifier such that each condition (binindex) have 2500 -3000 images.Similarly, we created a nonoverlapping (unique subjects) evaluation dataset, of 20K images for the explanation function.We created one such dataset for each class label.

Evaluation
For evaluating the explanations, we randomly sample two groups of real images from the test set of the explanation function (1) a real-negative group defined as We employ several metrics to quantify different aspects of a valid counterfactual explanation.
Frechet Inception Distance (FID) score: FID score quantifies the visual similarity between the real images and the synthetic counterfactuals.It computes the distance between the activation distributions as follow, if the input image is negative, then the explanation is predicted as positive for the target class.CV score is computed as, where, τ is the margin between the two prediction distributions.
We used τ = 0.8 in our experiments.
Foreign Object Preservation (FOP) score: FOP score is the fraction of the real images, with successful detection of FO, in which FO was also detected in the corresponding explanation where, O(x) is the probability of finding a FO in image x as predicted by a pre-trained object detector.O(x) > 0.5∀x ∈ X i.e., we consider images with positive detection of FO in set X.
Next, we define two clinical metrics to quantify the counterfactual changes that leads to the flipping of the classifier's decision.Precisely, we translated the clinical definition of cardiomegaly and pleural effusion into metrics that can be computed using a CXR.
Cardio Thoracic Ratio (CTR): We used CTR as the clinical metric to quantify cardiomegaly.CTR is the ratio of the cardiac diameter to the maximum internal diameter of the thoracic cavity.A CTR ratio greater than 0.5 indicates cardiomegaly (Mensah et al., 2015;Centurión et al., 2017;Dimopoulos et al., 2013).We followed the approach in (Chamveha et al., 2020) to calculate CTR from a CXR.We use the pre-trained segmentation network S (•) to mark the heart and lung region.We calculated heart diameter as the distance between the leftmost and rightmost points from the lung centerline on the heart segmentation.The thoracic diameter is the horizontal distance between the widest points on the lung mask.

Score for detecting a healthy Costophrenic recess (SCP):
We first identify CP recess in a CXR and then classify it as healthy or blunt to quantify pleural effusion.The fluid accumulation in CP recess may lead to the diaphragm's flattening and the associated blunting of the angle between the chest wall and the diaphragm arc, called costophrenic angle (CPA).The  blunt CPA is an indication of pleural effusion (Maduskar et al., 2016;Lababede, 2017).Marking the CPA angle on a CXR requires expert supervision, while annotating the CP region with a bounding box is a much simpler task (see SM-Fig.15).We learned an object detector to identify healthy or blunt CP recess in the CXRs and used SCP as our evaluation metric.

Experiments and Results
We performed four sets of experiments on CXR dataset: (1) In Section 4.1, we evaluated the validity of our counter-factual explanations and compared them against xGEM (Joshi et al., 2018) and CycleGAN (Narayanaswamy et al., 2020;De-Grave et al., 2020).
(2) In Section 4.2, we compared against the saliency-based methods to provide post-hoc model explanation.
(3) In Section 4.3, we associate the counterfactual changes in our explanation with the clinical definitions of two diagnosis, cardiomegaly and pleural effusion.
(4) In Section 4.4, we present a clinical study that collects subjective feedback from radiology residents on three different

Validity of counterfactual explanations
A valid counterfactual explanation resembles the query image while having perceivable differences that achieves an op- where different x c are generated using same x but different conditions.
To verify this behaviour, we group images in the test-set of the explanation function, into five non-overlapping groups based on their original prediction f (x).Next, for each image, we created 10 explanation images by discretising the range

Visual quality
Qualitatively, the counterfactual explanations generated by our method look visually similar to the query image (see Fig. 4).

Identity preservation
Ideally, a counterfactual explanation should differ in semantic features associated with the target task while retaining unique properties of the input, such as foreign objects (FOs).
FO provide critical information to identify the patient in an xray.The disappearance of FO in explanation images may create confusion that explanation images show a different patient.Our results confirm that CARL is an improvement over 1 reconstruction loss.We further provide a detailed ablation study over different components of our loss in SM-Sec.6.13.
Real Images Generated explanation with CARL Generated explanation w/o CARL Fig. 6.Fidelity of generated images with respect to preserving FO.

Comparison with saliency maps
Popular existing approaches for post-hoc model explanation includes explaining using a saliency-map (Pasa et al., 2019;Irvin et al., 2019).To compare against such methods, we approximated a saliency map as a pixel-wise difference map between the explanations at the two extreme ends i.e., with condition c = 0 (negative decision) and with condition c = 1 (positive decision).For proper comparison, we normalized the absolute values of the saliency maps in the range [0, 1].
In clinical setting, multiple diagnosis may affect the same anatomical region.In this case, the saliency map may highlight same regions as important for multiple target tasks.Fig. 8 is showing one such example.Our method not only provides a saliency map, but also counterfactual images to demonstrate how image features in those relevant regions should be modify to change the classification decision.Quantitatively evaluation: In this experiment, we quantitatively compare different methods for generating saliency maps, to show that important regions identified by these methods are actually relevant for classification decision.Specifically, we used the deletion evaluation metric (Petsiuk et al., 2018;Samek et al., 2017).For each image in set X p , we derived saliency maps using different methods.We used the saliency information to sort the pixels based on their importance.Next, we gradually removed top x% of important pixels by selectively impainting the removed region based on its surroundings.We processed the resulting image through the classifier and measure the output probability.We repeated this process while gradually increasing the fraction of removed pixels.
For each image, we plotted the updated classification probability as a function of the fraction of removed pixels, to obtain  8.
the deletion curve and measure its area under the deletion curve (AUDC).A sharp decline in classification probability shows that the removed pixels were actually important for classification decision.A sharp decline results in smaller AUDC, and demonstrates the high sensitivity of the classifier in the salient regions.In Fig. 7, we reported the mean and standard deviation of the AUDC metric over the set X p .Our method achieved the lowest AUDC, confirming the high sensitivity of the classifier in the salient regions identified by our method.

Disease-specific evaluation
In this experiment, we demonstrated the clinical relevance of our explanations.We defined two clinical metrics, cardiothoracic ratio (CTR) for cardiomegaly and score of the normal costophrenic recess (SCP) for pleural effusion.We used these metrics to quantify the counterfactual changes between normal We conducted a paired t-test to determine the effect of counterfactual perturbation on the clinical metric for the respective diagnosis.To perform the test, we considered the two groups of real images X n and X p and their corresponding counterfactual groups X n→p c and X p→n c respectively.In Fig. 9, we showed the distribution of differences in CTR for cardiomegaly and SCP for PE in a pair-wise comparison between real images and their respective counterfactuals.Patients with cardiomegaly have higher CTR as compared to normal subjects.Consistent with clinical knowledge, in Fig. 9, we observe a negative mean dif-ference for CTR(X n ) − CTR(X n→p c f ) (a p-value of < 0.0001) and a positive mean difference for CTR(X a ) − CTR(X n c f ) (with a pvalue of 0.0001).The low p-value in the dependent t-test statistics supports the alternate hypothesis that the difference in the two groups is statistically significant, and this difference is unlikely to be caused by sampling error or by chance.
By design, the object detector assigns a high SCP to a healthy CP recess with no evidence of blunting CPW.Consistent with our expectation, we observe a positive mean difference for ) (with a p-value of 0.0001) and a negative mean difference for SCP(X p ) − SCP(X p→n c ) (with a pvalue of 0.0001).A low p-value confirmed the statistically significant difference in SCP for real images and their corresponding counterfactuals.

Human evaluation
We conducted a human-grounded experiment with diagnostic radiology residents to compare different styles of explanations (no explanation, saliency map, cycleGAN explanation, and our counterfactual explanation) by evaluating different aspects of explanations: (1) understandability, (2) classifier's decision justification, (3) visual quality, (d) identity preservation, and (5) overall helpfulness of an explanation to the users.
Our results show that our counterfactual explanation was the only explanation method that significantly improved the users' understanding of the classifier's decision compared to the noexplanation baseline.In addition, our counterfactual explanation had a significantly higher classifier's decision justification than the cycleGAN explanation, indicating that the participants found a good evidence for the classifier's decision more frequently in our counterfactual explanation as compared to cycle-GAN explanation.
Further, cycleGan explanation performed better in terms of visual quality and identity preservation.However, at times the cycleGAN explanations were identical to the query image, thus providing inconclusive explanations.Overall the participants found our explanation method the most helpful method in understanding the assessment made by the AI system in comparison to other explanation methods.Below, we describe the de-sign of the study, the data analysis methods, along with the results of the experiment in detail.

Experiment Design
We conducted an online survey experiment with 12 diagnostic radiology residents.Participants first reviewed an instruction script, which described the AI system developed to provide an autonomous diagnosis for CXR findings such as cardiomegaly.The study comprised of the radiologists evaluating six CXR images which were presented in random order to them.
For selecting these siz CXR, we first, divided the test-set of the explanation function for cardiomegaly into three groups, pos- 4, 0.6]) and negative ). Next, we randomly selected two CXR images from each group.The six CXR images were anonymized as part of the MIMIC-CXR dataset protocol.
For each image, we had the same procedure consisted of a diagnosis tasks, followed by four explanation conditions, and ended by a final evaluation question between the explanation conditions.Further details of the study design are includes in SM-Section 6.1.
Diagnosis: For each CXR image, we first asked a participant to provide their diagnosis for cardiomegaly.This question ensures that the participants carefully consider the imaging features that helped them diagnose.Subsequently, the participants were presented with the classifier's decision and were asked to provide feedback on whether they agreed.
Explanation Conditions: Next, the study provides the classifier's decision with the following explanation conditions: 1.No explanation (Baseline): This condition simply provides the classifier decision without any explanation, and is used as the control condition.

Saliency map:
A heat map overlaid on the query CXR, highlighting essential regions for the classifier's decision.
3. CycleGAN explanation: A video loop over two CXR images, corresponding to the query CXR transformation with a negative and a positive decision for cardiomegaly.
4. Our counterfactual explanation: A video showing a series of CXR images gradually changing the classifier's de-cision from negative to positive.
Please note that after showing the baseline condition, we provided the other explanation conditions in random order to avoid any learning or biasing effects.
Evaluation metrics: Given the classifier's decision and corresponding explanation, we consider the following metrics to compare different explanation conditions: 1. Understandability: For each explanation condition, the study included a question to measure whether the end-user understood the classifier's decision, when explanation was provided.The participants were asked to rate agreement with "I understand how the AI system made the above assessment for Cardiomegaly".
2. Classifier's decision justification: Human user's may perceive explanations as the reason for the classifier's decision.For the cycleGAN and our counterfactual explanation conditions, we quantify whether the provided explanation were actually related to the classification task by measuring the participants' agreement with "The changes in the video are related to Cardiomegaly".

Visual quality:
The study quantifies the proximity between the explanation images and the query CXR by measuring the participants' agreement with "Images in the video look like a chest x-ray.".

Identity preservation:
The study also measures the extent to which participants think the explanation images correspond to the same subject as the query CXR by measuring the participants' agreement with "Images in the video look like the chest x-ray from a given subject".

5.
Helpfulness: For each CXR image, we asked the participants to select the most helpful explanation condition in understanding the classifier's decision, "Which explanation helped you the most in understanding the assessment made by the AI system?".This evaluation metric directly compares the different explanation conditions.
All metrics, but the helpfulness metric were evaluated for agreement on a 5-point Likert scale, where one means "strongly disagree" and five means "strongly agree".
Free-form Response: After each question, we also asked the participants a free-form question: "Please explain your selection in a few words."We used answers to these questions to triangulate our findings and complement our quantitative metrics by understanding our participants' thought-processes and reasoning.
Participants.Our participants include 12 diagnostic radiology residents who have completed medical school and have been in the residency program for one or more years.On average, the participants finished the survey in 40 minutes and were paid $100 for their participation in the study.

Data analysis
For each evaluation metric, the study asked the same question to the participants while showing them different explanations.
For each question, we gather 72 responses (6 -number of CXR images × 12 -number of participants).
For the understandability and helpfulness metrics, we conducted a one-way ANOVA test to determine if there is a statistically significant difference between the mean metric scores for the four explanation conditions.Specifically, we built a oneway ANOVA with the metric as our dependent variable and analyzed agreement rating as the independent variable.If we found a significant difference in the ANOVA method, we ran Tukey's Honestly Significant Difference (HSD) posthoc test to perform a pair-wise comparison between different explanation conditions.
We measured the classifier's decision justification, visual quality and identity preservation metrics only for the cycleGAN and our counterfactual explanations.We conducted paired ttests to compare these evaluation metrics between these two explanation conditions.We also qualitatively analyzed the participants' free-form responses to find themes and patterns in their responses.

Results
Fig. 10 shows the mean score for the evaluation metrics of understandability, classifier's decision justification, visual quality, and identity preservation among the different explanation conditions.Below, we report the statistical analysis for these results, followed by analysis of the participants' free-form responses to understand the reasons behind these results.Understandability: The results show that our counterfactual explanation was the most understandable explanation to the participants.A one-way ANOVA revealed that there was a statistically significant difference in the understandability metric between at least two explanation conditions (F(3, 284) = [3.39], p=0.019).The Tukey post-hoc test showed that the understandability metric for our counterfactual explanation was significantly higher than the no-explanation baseline (p = 0.018).
However, there was no statistically significant difference in mean scores between other pairs of explanations (refer to Table 3, "Understandability" column).This finding indicates that providing our counterfactual explanations along with the classifier's decision made the algorithm most understandable to our clinical participants, while other explanation conditions, saliency map and cycleGAN failed to achieve significant difference from no-explanation baseline on the understandability metric.Next, we use responses from free-text question to supplement our findings.
For the no-explanation baseline, the primary reason for poor understanding was the absence of explanation (n=30), (e.g., they stated that "there is no indication as to how the AI made this decision").Interestingly, many responses (n=23) either associated their high understanding with the correct classification decision i.e., participants understood the decision as the decision is correct ("I agree, it is small and normal") or they assumed the AI-system is using similar reasoning as them to arrive at its decision ("I assume the AI is just measuring the width of the heart compared to the thorax", "Assume the AI measured the CT ratio and diagnosed accordingly.").
Participants' mostly found saliency maps to be correct but incomplete (n=23), ("Unclear how assessment can be made without including additional regions").Specifically, for cardiomegaly, the saliency maps were highlighting parts of the heart and not its border ("Not sure how it gauges not looking at the border") or thoracic diameter ("thoracic diameter cannot be assessed using highlighted regions of heat map").We observe a similar result in Fig. 8, where the heatmap focuses on the heart but not its border.Further, some participants expressed a concern that they didn't understand how relevant regions were used to derive the decision ("i understand where it examined but not how that means definite cardiomegaly").
For cycleGAN explanation, the primary reason for poor understanding was the minimal perceptible changes between the negative and positive images (n=3), ("There is no change in the video.").In contrast, many participant's explicitly reported an improved understanding of the classifier's decision in the presence of our counterfactual explanations (n=33), ("I think the AI looking at the borders makes sense.","i can better understand what the AI is picking up on with the progression video").
Classifier's decision justification: Our counterfactual explanation (M=3.46;SD=1.12) achieved a positive mean difference of 0.63 on this metric as compared to cycleGAN (M=2.83;SD=1.33), with t(71)=3.55 and p < 0.001.This result indicates that the participants found a good evidence for the predicted class (cardiomegaly), much frequently in our counterfactual explanations as compared to cycleGAN.
Most responses (n=25) explicitly mentioned visualizing changes related to cardiomegaly such as an enlarged heart in our explanation video as compared to cycleGAN (n=17).In cy-cleGAN, many reported that changes in the explanation video was not perceptible (n=23).Further, the participants reported changes in density, windowing level or other attributes which were not related to cardiomegaly ("Decreasing the density does not impact how I assess for cardiomegaly.","they could be or just secondary to windowing the radiograph").Such responses were observed in both cycleGAN (n=17) and our explanation (n=17).This indicates that the classifier may have associated such secondary information (short-cuts) with cardiomegaly diagnosis.A more in-depth analysis is required to quantify the classifiers' behaviour.
Visual quality and identity preservation: We observe a negative mean difference of 0.31 and 0.37 between our and cycle-GAN explanation methods in visual quality and identity preservation metrics, respectively.The mean score for visual quality was higher for cycleGAN (M=4.55;SD=0.71) as compared to our method (M=4.24;SD=0.80) with t(71)=3.49and p < 0.001.Similarly, the mean score for identity preservation was also higher for cycleGAN (M=4.51;SD=0.56) as compared to our method (M=4.14;SD=0.78) with t(71)=3.96 and Most of the responses (n=69) agreed that the CycleGAN explanation were marked as highly similar to the query CXR image.These results are consistent with our earlier results, that cycleGAN has better visual quality with a lower FID score (see Table .1).However, in some responses, the participants pointed out that the explanation images were almost identical to the query image ("There's virtually no differences.This is within the spectrum of a repeat chest x-ray for instance.").An explanation image identical to the query image provides no information about the classifier's decision.Further, similar looking CXR will also result in similar classification decision, and hence will fail to flip the classification decision.As a result, we also observed a lower agreement in the classifier consistency metric and a lower counterfactual validity score in Table .1 for cycleGAN.
Helpfulness: In our concluding question, "Which explanation helped you the most in understanding the assessment made by the AI system?", 57% of the responses selected our counterfactual explanation as the most helpful method.A one-way ANOVA revealed that there was a statistically significant difference in the helpfulness metric between at least two explanation conditions (F(3, 284) = [21.5],p < 0.0001).In pair-wise Tukey's HSD posthoc test, we found that the mean usefulness metric for our counterfactual explanations was significantly different from all the rest explanation conditions(p < 0.0001).Table 3 ( "Helpfulness" column) summarizes these results.
These results indicates that the participant's selected our counterfactual explanations as the most helpful form of explanation for understanding the classifier's decision.

Discussion And Conclusion
We provided a BlackBox Counterfactual Explainer, designed to explain image classification models for medical applications.Our framework explains the decision by gradually transforming the input image to its counterfactual, such that the classifier's prediction is flipped.We have formulated and evaluated our framework on three properties of a valid counter- Further, we present a thorough comparison between cycle-GAN and our explanation in a human evaluation study.The clinical experts' expressed high agreement that explanation images from cycleGAN were of high quality and they resembles the query CXR.But at the same time, users found the explanation images to be too similar to query CXR, and the cycleGAN explanations failed to provide the counterfactual reasoning for the decision.
In comparison, our explanation were most helpful in understanding the classification decision.Though the users reported inconsistencies in the visual appearance, but the overall sentiment looks positive and they selected our method as their preferred explanation method for improved understandability.
Clinical relevance of the explanations: From a clinical perspective, we demonstrated that the counterfactual changes associated with normal (negative) or abnormal (positive) classification decisions are also associated with corresponding changes in disease-specific metrics such as CTR and SCP.In our clinical study, multiple radiologist reported using CTR as the metric to diagnose cardiomegaly.As radiologist annotations are expensive, and it is not efficient to perform human evaluation on a large test set, our results with CTR calculations provides a quantitative way t evaluate difference in real and counterfactual populations.
We acknowledge that our GAN-generated counterfactual explanations may have missing details such as small wires.In our extended experiments, we found that the foreign objects such as pacemaker have minimal importance in the classification decision (see .We attempted to improve the preservation of such information through our revised context-aware reconstruction loss (CARL).However, even with CARL, the

Summarizing the notation
Table .4 summarizes the notation used in the manuscript.

Dataset
We focus on explaining classification models based on deep convolution neural networks (CNN); most state-of-the-art performance models fall in this regime.We used large, publicly available datasets of chest x-ray (CXR) images, MIMIC-CXR (Johnson et al., 2019).MIMIC-CXR dataset is a multimodal dataset consisting of 473K CXR, and 206K reports from 63K patients.We considered only frontal (posteroanterior PA or anteroposterior AP) view CXR.The datasets provide imagelevel labels for fourteen radio-graphic observations.These labels are extracted from the radiology reports associated with the x-ray exams using an automated tool called the Stanford CheXpert labeler (Irvin et al., 2019).The labeller first defines some thoracic observations using a radiology lexicon (Hansell et al., 2008).It extracts and classifies (positive, negative, or uncertain mentions) these observations by processing their context in the report.Finally, it aggregates these observations into fourteen labels for each x-ray exam.For the MIMIC-CXR dataset, we extracted the labels ourselves, as we have access to the reports.

Classification Model
To train the classifier, we considered the uncertain mention as a positive mention.We crop the original images to have the same height and width, then downsample them to 256 × 256 pixels.The intensities were normalized to have values between 0 and 1.Following the approach in prior work (Rajpurkar et al., 2017;Rubin et al., 2018;Irvin et al., 2019)  or knowing the architectural details.Our proposed approach can be used for explaining any DL based neural network.

Explanation Function
The explanation function is a conditional GAN with an encoder.We used a ResNet (He et al., 2016)   Real image data distribution q(x) Learned data distribution by cGAN r(x) Loss term of cGAN that measures similarity between real and learned data distribution r(c|x) Loss term of cGAN that evaluates correspondence between generated images and condition φ(x) Image feature extractor; part of the discriminator function  for N. A small N is equivalent to fewer conditions, resulting in a coarse transformation which leads to abrupt changes across explanation images.In our experiments, we used N = 10, with a batch size of 32.We experimented with different values of N and selected the largest N, which created a class-balanced batch that fits in GPU memory and resulted in stable cGAN training.

Semantic Segmentation
We adopted a 2D U-Net (Ronneberger et al., 2015) to perform semantic segmentation, to mark the lung and the heart contour in a CXR.The network optimizes a multi-categorical cross-entropy loss function, defined as, where 1 is the indicator function, y i is the ground truth label for i-th pixel.s is the segmentation label with values (background, the lung or the heart).p θ (x i ) denotes the output probability for pixel x i and θ are the learned parameters.The network is trained on 385 CXRs and corresponding masks from Japanese Society of Radiological Technology (JSRT) (van Ginneken et al., 2006) and Montgomery (Jaeger et al., 2014) datasets.

Object Detection
We trained an object detector network to identify medical devices in a CXR.For the MIMIC-CXR dataset, we pre-processed the reports to extract keywords/observations that correspond to medical devices, including pacemakers, screws, and other hard-  We trained similar detectors for identifying normal and abnormal CP recess regions in a CXR.We associated an abnormal CP recess with the radiological finding of a blunt CP angle as identified by the positive mention for "blunting of costophrenic angle" in the corresponding radiology report.For the normal-CP recess, we considered images with a positive mention for "lungs are clear" in the reports.We extracted 300 CXR im-ages with positive mention of respective terms for normal and abnormal CP recess to train the object detector.
Please note that the object detector for CP recess is only used for evaluation purposes, and they were not used during the training of the explanation function.In literature, the blunting of CPA is an indication of pleural effusion (Maduskar et al., 2013(Maduskar et al., , 2016)).The angle between the chest wall and the diaphragm arc unrealistic assumption for image data.Hence, vanilla VAE is known to produce over-smoothed images (Huang et al., 2018).
The VAE used is available at https://github.com/LynnHo/VAE-Tensorflow.All settings and architectures were set to default values.The original code generates an image of dimension 64x64.We extended the given network to produce an image with dimensions 256×256.For training cycleGAN, we consider two sets of images.The first set comprises 2000 images from the MIMIC-CXR dataset such that the classifier has a positive prediction for the presence of a target disease i.e., f (x) > 0.9, and the second set has the same number of images but with strong negative prediction i.e., f (x) < 0.1.We train one such model for each target disease.

Extended results for identity preservation
A FO is critical in identifying the patient in an x-ray.FO's disappearance may lead to a false conclusion that removing FO resulted in the changed classification decision.

Ablation study over pacemaker
We performed an ablation study to investigate if a pacemaker is influencing the classifier's prediction for cardiomegaly.We Hence, pacemaker is not influencing classification decisions for cardiomegaly.We compared the explanations generated using CARL against those generated using simple 1 reconstruction loss on their similarity with the input images.To quantify the similarity between the explanation images and the query image in a latent space, we used latent-space closeness (LSC) score.LSC score is the fraction of the images where explanation image derived using CARL (x CARL c ) is closest to the query image x as compared to explanations generated using 1 loss i.e., x 1 c .We calculated similarity as the euclidean distance between the embedding for the query and explanation images.LSC score is defined as, where E(•) is a pre-trained feature extractor based on the Inception v3 network.Table 6 presents our results.A high LSC score, together with a high CV score (Fig. 19) shows that the query and counterfactual images are fundamentally same but differs only in features that are sufficient to flip the classification decision.

Extended classifier consistency results
Our explanation framework gradually perturbs the input image to traverse the classification boundary from one extreme (negative) to another (positive).We quantify the consistency between our explanations and the classification model at every step of this transformation.We divided the prediction range [0, 1] into ten equally sized bins.For each bin, we generated an explanation image by choosing an appropriate, c ∈ [0, 1].We further divided the input image space into five groups based on their initial prediction i.e., f (x).In Fig. 18, we represented each group as a line and plotted the average response of the classifier i.e., f (x c ) for explanations in each bin against the expected outcome i.e., c.For xGEM, we generated multiple, progressively changing explanations by traversing the latent space.For each input image, we generated ten explanation images.For cycle-GAN, we can generate only images at the two extreme ends of Table 7. FID score quantifies the visual appearance of the explanations.CV score is the fraction of explanations that have an opposite prediction compared to the input image.FOP score is the fraction of real images with FO, in which FO was also detected in the corresponding explanation image.
In configuration with λ 1 = 0 there is no adversarial loss from cGAN, in λ 2 = 0 there is no KL -loss for classifier consistency and in λ 3 = 0 there is no context-aware self reconstruction loss.We also compared the saliency maps generated by our model with popular gradients based methods.For quantitative evaluation, we consider the deletion evaluation metric (Petsiuk et al., 2018).The metric quantifies how the probability of the targetclass changes as important pixels are removed from an image.

Cardiomegaly Pleural Effusion Edema
To remove pixels from an image, we tried selectively impainting the region based on its surroundings.In features should be modified to flip the decision.

Disease-specific evaluation
For quantitative analysis, we randomly sample two groups of real images (1) a real-normal group defined as X n = {x; f (x) < 0.2}.It consists of real CXR images that are predicted as normal by the classifier f .(2) A real-abnormal group defined as X p = {x; f (x) > 0.8}.For X n , we generated a counterfactual group as, X p c = {x ∈ X n ; f (I f (x, c)) > 0.8}.Similarly for X p , we derived a counterfactual group as X n c = {x ∈ X p ; f (I f (x, c)) < 0.2}.Next, we quantify the differences in real and counterfactual groups by performing statistical tests on the distribution of clinical metrics such as cardiothoracic ratio (CTR) and the Score of normal Costophrenic recess (SCP).Specifically, we performed the dependent t-test statistics on clinical metrics for paired sam-

Fig. 1 .
Fig. 1.Counterfactual explanation shows "where" + "what" minimum change must be made to the input to flip the classification decision.For Pleural Effusion, we can observe vanishing of the meniscus (red) in counterfactual image as compared to the query image.

:
The classification function is defined as f : R d → R K , where d is the dimensionality of the image space and K is the number of classes.The classifier produces point estimates for posterior probability of class k as P(y k |x) = f (x)[k] ∈ [0, 1].Explanation function: We aim to explain the decision of function f for a target class k.We consider visual explanation of the black-box as a generative process that produces a plausible and realistic perturbation of the query image x such that the classification decision for class k is changed to a desired value c.This idea allows us to view c as a "knob".By gradually changing the desired output c in range [0, 1], we generate progressively changing perturbations of the query image x, such that classification decision changes from being negative to being positive for a class k.To achieve this, we propose an explanation function x c = ∆ I f k (x, c) : (X, R) → X.This function takes two arguments: a query image x and the desired posterior probability c for the target class k.The explanation function generates a perturbed image x c such that f (x c )[k] ≈ c.For simplicity, we will drop k from subsequent notations.Fig. 2 summarizes our framework.We design the explanation function to satisfy the following properties: (A) Data consistency: x c should resemble data instance from input space i.e., if input space comprises of CXRs, x c should look like a CXR with minimum artifacts or blurring.(B) Classifier consistency: x c should produce the desired output from the classifier f , i.e., f (I f (x, c)) ≈ c.(C) Context-aware self-consistency: On using the original decision as the condition, i.e., c = f (x), the explanation function should reconstruct the query image.We forced this condition for self-consistency as I f (x, f (x)) = x and for cyclicconsistency as I f (x c , f (x)) = x.Further, we constrained the explanation function to achieve the aforementioned reconstructions while preserving anatomical shape and foreign objects (e.g., pacemaker) in the input image.Overall objective: Our explanation function I f (x, c) is trained end to end to learn parameters for three networks, an image encoder E(•), a conditional GAN generator G(•) and a discriminator D(•), to satisfy the above three properties while minimizing the following objective function: min 4 m z X n I P s P t E r z K N x w A U L c 8 i T c 8 T Y 4 b b a m 2 3 7 J 0 u f v + 8 9 2 m A u u w H X / G e a t 2 3 f u r q z e s + 4 / e P j o c W r 6 B l y k I e 2 0 S 5 6 h w 7 R A F H j q / H N + G H 8 N K f m d / O X + X v u a h p N z F O 0 Z O a f / 6 Y R 6 w o = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " k t T 5 k y Z q 9 6 p 4 O K 5 D 6 R q A k I f C D 8 E = " > A A A C + n i c b V L b b t M w G H b C Y S O c O r j c j U U V K R W j S s a k 7 X L A 0 E B M a E h 0 m 9 R U k e M 4 m z X n I P s P t E r z K N x w A U L c 8 i T c 8 T Y 4 b b a m 2 3 7 J 0 u f v + 8 9 2 m A u u w H X / G e a t 2 3 f u r q z e s + 4 / e P j o c W f t y Z H 2 N R P h O J P 6 p I B n b D u i J I l S k y T U n n W H 6 q p W k z d p w w L i n V H J 0 7 w A l t J 5 o b g Q G D J c / w M c c c k o i I k G h E q u e 8 X 0 j E h C Q f 8 W S y / B u z r y d X C 0 2 f d e 9 j c / b X V 3 X z f r W E X r 6 B l y k I e 2 0 S 5 6 h w 7 R A F H j q / H N + G H 8 N K f m d / O X + X v u a h p N z F O 0 Z O a f / 6 Y R 6 w o = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " k t T 5 k y Z q 9 6 p 4 O K 5 D 6 R q A k I f C D 8 E = " > A A A C + n i c b V L b b t M w G H b C Y S O c O r j c j U U V K R W j S s a k 7 X L A 0 E B M a E h 0 m 9 R U k e M 4 m z X n I P s P t E r z K N x w A U L c 8 i T c 8 T Y 4 b b a m 2 3 7 J 0 u f v + 8 9 2 m A u u w H X / G e a t 2 3 f u r q z e s + 4 / e P j o c W

Fig. 2 .
Fig.2.Explanation function I f (x, c) for classifier f .Given an input image x, we generates a perturbation of the input, x c as explanation, such that the posterior probability, f , changes from its original value, f (x), to a desired value c while satisfying the three consistency constraints.
ated with the generated image given the condition c and is a function of both G and D. The second term minimize the KL divergence between the posterior probability for new image f (x c ) and the desired prediction distribution c.It influences only the G.

Fig. 3 .
Fig. 3. (a) A domain-aware self-reconstruction loss with pre-trained semantic segmentation S (x) and object detection O(x) networks.(b) The self and cyclic reconstruction should retain maximum information from x.
consists of real CXR that are predicted as negative by the classifier f for a target class k. (2) A real-positive group defined as X p = {x; f (x) > 0.8}.For X n , we generated a counterfactual group by setting condition c = 1.0 as,X n→p c = {I f (x, c = 1)∀x ∈ X n }.Similarly for X p , we derived a counterfactual group as X p→n c = {I f (x, c = 0)∀x ∈ X p }.We create one such dataset for each target class k.Combining the two groups, our set of real images is X = X n ∪ X p and corresponding set ofcounterfactual explanations is X c = X n→p c ∪ X p→n c.All the results are computed on this evaluation dataset.
) where µ's and Σ's are mean and covariance of the activation vectors derived from the penultimate layer of a Inception v3 network Heusel et al. (2017) pre-trained on MIMIC-CXR dataset.Counterfactual Validity (CV) score: CV score (Mothilal et al., 2020) is defined as the fraction of counterfactual explanations that successfully flipped the classification decision i.e.,

Fig. 4 .
Fig.4.Qualitative comparison of the counterfactual explanations generated for two classification tasks, cardiomegaly (first row) and pleural effusion (PE) (last row).The bottom labels are the classifier's predictions for the specific task.For the input image in first column, our model generates a series of images x c as explanations by gradually changing c in range [0, 1].The last column presented a pixel-wise difference map between the explanations at the two extreme ends i.e., with condition c = 0 (negative decision) and with condition c = 1 (positive decision).The heatmap highlights the regions that changed the most during the transformation.For cardiomegaly, we show the heart border in yellow.For PE, we showed the results of an object detector as a bounding-box (BB) over the healthy (blue) and blunt (red) CP recess regions.The number on the top-right of the blue-BB is the Score for detecting a healthy CP recess (SCP).The number on red-BB is 1-SCP.
posite classification decision as compared to the query image from the classifier.In Fig. 4, we present qualitative examples of counterfactual explanations from our method and compared them against those obtained from xGEM and CycleGAN.4.1.1.Classifier consistency In Fig. 4, we observe that the explanation images gradually flip their decision f (x c ) (bottom label) as we go from left to right.Table. 1 summarizes our results on CV score metric.A high CV score for our model confirms that the condition used in cGAN is successfully met and the generated explanations are successfully flipping the classification decision and hence, are consistent with the classifier.On the other hand, cycleGAN achieves a CV score of about 50%, thus creating valid counterfactual images only half of the times.In a deployment scenario, a counterfactual explanation that fails to flip the classification decision would be rejected as an invalid, and hence half of the explanations provided by cycleGAN would be rejected.Our formulation constraints the condition c to vary linearly with the actual prediction f (x c ) i.e., if we increase c in range [0, 1] then the cGAN should create an image x c such that condition c is met and the actual prediction f (x c ) should also increase.Further, consider a scenario when c = 1.0.The expected behaviour is f (x c=1.0 ) ≈ 1.0 and also, f

Fig. 5 .
Fig. 5.The plot of condition, c (desired prediction), against actual response of the classifier on generated explanations, f (x c ).Each line represents a set of input images with prediction f (x) in a given range.Plots for xGEM and cycleGAN are shown in SM-Fig.18.
, we quantify the strength of our revised CARL loss in preserving FO in explanation images compared to an image-level 1 reconstruction loss.In Table2, we report the results on the FOP score metric.Our model with CARL obtained a higher FOP score.The FO detector network has an accuracy of 80%.Fig.6presents examples of counterfactual explanations generated by our model with and without the CARL.

Fig. 7 .
Fig. 7. Quantitative comparison of our method against gradient-based methods.Mean area under the deletion curve (AUDC), plotted as a function of the fraction of removed pixels.A low AUDC shows a sharp drop in prediction accuracy as more pixels are deleted.

Fig. 8 .Fig. 9 .
Fig. 8.Comparison of our method against different gradient-based methods.A: Input image; B: Saliency maps from existing works; C: Our simulation of saliency map as difference map between the normal and abnormal explanation images.More examples are shown in SM-Fig.21.

(
negative diagnosis) and abnormal (positive diagnosis) populations, as identified by the given classifier.If the change in classification decision is associated with the corresponding change in clinical-metric, we can conclude that the classifier considers clinically relevant information in its diagnosis prediction.

Fig. 10 .
Fig. 10.Comparing the evaluation metrics of understandability, classifier's decision justification, visual quality, and identity preservation across the different explanation conditions.
factual transformation: data consistency, classifier consistency, and self-consistency.Our results in Section 4.1 showed that our framework adheres to all three properties.Comparison with xGEM and cycleGAN: Our model creates visually appealing explanations that produce a desired outcome from the classification model while retaining maximum patientspecific information.In comparison, both xGEM and cycle-GAN failed on at least one essential property.xGEM model fails to create realistic images with a high FID score.Furthermore, the cycleGAN model fails to flip the classifier's decision with a low CV score (∼ 50%).

FO
preservation score is not perfect.A possible reason for this gap is the limited capacity of the object detector used to calculate the FOP score.Training a highly accurate FO detector is outside the scope of this study.Further, a resolution of 256 × 256 for counterfactually generated images is smaller than a standard CXR.Small reso-lution limits the evaluation for fine details by both the algorithm and the interpreter.Our formulation of cGAN uses conditional-batch normalization (cBN) to encapsulate condition information while generating images.For efficient cBN, the mini-batches should be class-balanced.To accommodate high-resolution images with smaller batch sizes, we must decrease the number of conditions to ensure class-balanced batches.Fewer conditions resulted in a coarse transformation with abrupt changes across explanation images.In our experiments, we selected the largest N, which created a classbalanced batch that fits in GPU memory and resulted in stable cGAN training.However, with the advent of larger-memory GPUs, we intend to apply our methods to higher resolution images in future work; and assess how that impacts interpretation by clinicians.To conclude, this study uses counterfactual explanations as a way to audit a given black-box classifier and evaluate whether the radio-graphic features used by that classifier have any clinical relevance.In particular, the proposed model did well in explaining the decision for cardiomegaly and pleural effusions and was corroborated by an experienced radiology resident physician.By providing visual explanations for deep learning decisions, radiologists better understand the causes of its decisionmaking.This is essential to lessen physicians' concerns regarding the "BlackBox" nature by an algorithm and build needed trust for incorporation into everyday clinical workflow.As an increasing amount of artificial intelligence algorithms offer the promise of everyday utility, counterfactually generated images are a promising conduit to building trust among diagnostic radiologists.By providing counterfactual explanations, our work opens up many ideas for future work.Our framework showed that valid counterfactuals can be learned using an adversarial generative process that is regularized by the classification model.However, counterfactual reasoning is incomplete without a causal structure and explicitly modeling of the interventions.An interesting next step should explore incorporating or discovering plausible causal structures and creating explanations grounded
on diagnosis classification, we used DenseNet-121 (Huang et al., 2016) architecture as the classification model.In DenseNet, each layer implements a non-linear transformation based on composite functions such as Batch Normalization (BN), rectified linear unit (ReLU), pooling, or convolution.The resulting feature map at each layer is used as input for all the subsequent layers, leading to a highly convoluted multi-level multi-layer non-linear convolutional neural network.We aim to explain such a model post-hoc without accessing the parameters learned by any layer architecture for the Encoder, Generator, and Discriminator.The details of the architecture are given in Table5.For the encoder network, we used five ResBlocks with the standard batch normalization layer (BN).In encoder-ResBlock, we performed downsampling (average pool) before the first conv of the ResBlock as shown in Fig.16.a.For the generator network, we follow the details in(Miyato et al., 2018)  and replace the BN layer in encoder-ResBlock with conditional BN (cBN) to encode the condition (seeFig.16.b.).The architecture for the generator has five ResBlocks; each ResBlock performed up-sampling through the nearest neighbour interpolator.For the discriminator, we used spectral normalization (SN)(Miyato and Koyama, 2018)    in Discriminator-ResBlock and performed down-sampling after the second conv of the ResBlock as shown in Fig.16.c.For the optimization, we used Adam optimizer(Kingma and Ba, 2015), with hyper-parameters set to α = 0.0002, β 1 = 0, β 2 = 0.9 and For creating the training dataset, we divide the posterior distribution for the target class, f (x) ∈ [0, 1] into N equally-sized bins.The cGAN is then trained on N conditions.For efficient training, cBN requires class-balanced batches.A smaller value for δ results in more conditions for training cGAN, increasing cGAN complexity and training time.Also, we have to increase the batch size to ensure each condition is well represented in a batch.Hence, the GPU memory size bounds the high value ware.Such foreign objects are easy to identify in a CXR and do not requires expert knowledge for manual labelling.Using the CheXpert labeller, we extracted 300 CXR images with positive mentions for each observation.The extracted x-rays are then manually annotated with bounding box annotations marking the presence of foreign objects using the LabelMe(Wada, 2016) annotation tool.Next, we trained an object detector based on fFast Region-based CNN(Ren et al., 2015), which used VGG-16 model(Simonyan and Zisserman, 2014), trained on the MIMIC-CXR dataset as its foundation.We used this object detector to enforce our novel context-aware reconstruction loss (CARL).

Fig. 15 .
Fig. 15.The costophrenic angle (CPA) on a CXR is marked as the angle formed by, (a) costophrenic angle point, (b) hemidiaphragm point and (c) lateral chest wall point, as shown byMaduskar et al.in (Maduskar et al.,  2016) Fig.15.Learning automatic marking of CPA angle requires expert annotation and is prone to error.Hence, rather than marking the CPA angle, we annotate the CP region with a bounding box which is a much simpler task.We then learned an object detector to identify normal or abnormal CP recess in a CXR and used the Score for detecting a normal CP recess (SCP) as our evaluation metric.6.8.xGEMWe refer to work by Joshi et al. (Joshi et al., 2019) for the implementation of xGEM.xGEM iteratively traverses the input image's latent space and optimizes the traversal to flip the classifier's decision to a different class.Specifically, it solves the following optimization x = G θ (arg minz∈R d L(x, G θ (z)) + λ ( f (G θ (z)), y )) (14)where the first terms is an 2 distance loss for comparing real and generated data.The second term ensures that the classification decision for the generated sample is in favour of class y and y y is a class other than original decision.Unless explicitly imposed, the explanation image does not look realistic.The explanation image is generated from an updated latent feature, and the expressiveness of the generator limits its visual quality.xGEMadopted a variational autoencoder (VAE) as the generator.VAE uses a Gaussian likelihood ( 2 reconstruction), an

Fig. 16 .
Fig. 16.Architecture of the ResBlocks used in all experiments.

Fig. 17 .
Fig. 17.An example of input image before and after removing the pacemaker.

Fig. 18 .
Fig. 18.The plot of desired outcome, f (x) + δ, against actual response of the classifier on generated explanations, f (x δ ).Each line represents a set of input images with classification prediction f (x) in a given range.Dashed line represents y = x line.

Fig. 19 .
Fig.19.Each cell is the fraction of the generated explanations, that have flipped in a class as compared to the query image.The x-axis shows the classes in a multi-label setting, and the y-axis shows the target class for which an explanation is generated.Note: This is not a confusion matrix.
Fig. 20, we show an example of deletion-by-impainting.For generating results in Table.??, we plot the deletion curve for 500 images, and calculated area under the deletion curve (AUDC) for each.Please note that, as more pixels are removed, the modified images become unrealistic and visually appear different from a CXR.The behavior of the classifier on such images is inconsistent.Low AUDC demonstrates that all the methods are successful in localizing the important regions for classification.However, unlike saliency-based methods, our counterfactual expla-nation provides extra information on what image features in those relevant regions for classification and how those image 999 and learning rate 1× 10 −4 which is fixed for the duration of the training.We used a batch size of 16 (Ren et al., 2015)9)ropping, re-scaling, and intensity normalization.Our classification model achieved an AUC-ROC of 0.87 for Cardiomegaly, 0.95 for pleural effusion, and 0.91 for edema.These results are comparable to performance of the published model(Irvin et al., 2019).Object detector: We trained a Fast Region-based CNN(Ren et al., 2015)network as object detector O(•).We trained three independent detectors for three use-cases: detecting foreign objects (FO) such as pacemakers and hardware, detecting healthy costrophenic (CP) recess and detecting blunt CP recess.For constructing a training dataset for this object detection, we first collect candidate CXRs for each object by parsing the radiology reports associated with the CXR to find positive mention for "blunting of the costophrenic angle" for blunt CP recess, and "lungs are clear" for healthy CP recess.For each object, we manually collect bounding box annotations for 300 candidate CXRs.

Table 1 .
The counterfactual validity (CV) score is the fraction of explanations that have an opposite prediction compared to the input image.The FID score quantifies the visual appearance of the explanations.We have normalized the FID scores with respect to the best method (cycleGAN).

Table .
CycleGAN achieved the least FID score, thus generating the most realistic images as explanations.
1 reports the FID score for our method and compares them against xGEM and cycleGAN.Our approach achieved a lower FID score as compared to xGEM.xGEM has very high FID score, thus creating counterfactual images that are very different from the query image and hence are unsuitable for de-ployment.

Table 2 .
The foreign object preservation (FOP) score for our model with and without the context-aware reconstruction loss (CARL).FOP score depends on the performance of FO detector.

Table 3 .
Results for one-way ANOVA for understandability metric, followed by Tukey's HSD post-hoc test between different levels of agreement.Note that the mean value for E4 (our counterfactual explanation) is the highest, indicating that our explanations helped users the most in understanding the classifier's decision.*p < 0.05; ***p < 0.0001

Table 4 .
Summarizing the notation

Table 6 .
The latent-space closeness (LSC) score for our model with and without the context-aware reconstruction loss (CARL).