key: cord-0656119-rqzv1j66
authors: Khorram, Saeed; Lawson, Tyler; Li, Fuxin
title: iGOS++: Integrated Gradient Optimized Saliency by Bilateral Perturbations
date: 2021-01-01
journal: nan
DOI: nan
sha: 0410155dd4c906a872187b51f0751e95a20a80f8
doc_id: 656119
cord_uid: rqzv1j66

The black-box nature of the deep networks makes the explanation for"why"they make certain predictions extremely challenging. Saliency maps are one of the most widely-used local explanation tools to alleviate this problem. One of the primary approaches for generating saliency maps is by optimizing a mask over the input dimensions so that the output of the network is influenced the most by the masking. However, prior work only studies such influence by removing evidence from the input. In this paper, we present iGOS++, a framework to generate saliency maps that are optimized for altering the output of the black-box system by either removing or preserving only a small fraction of the input. Additionally, we propose to add a bilateral total variation term to the optimization that improves the continuity of the saliency map especially under high resolution and with thin object parts. The evaluation results from comparing iGOS++ against state-of-the-art saliency map methods show significant improvement in locating salient regions that are directly interpretable by humans. We utilized iGOS++ in the task of classifying COVID-19 cases from x-ray images and discovered that sometimes the CNN network is overfitted to the characters printed on the x-ray images when performing classification. Fixing this issue by data cleansing significantly improved the precision and recall of the classifier.

As deep networks achieve excellent performance in many tasks, more and more people want to open these black boxes to understand how they make their decisions under the hood. Especially, explaining deep classifiers can help to potentially "debug" them to understand how they make mistakes, and fix those mistakes e.g. by additional data preprocessing. This is increasingly important as deep learning is starting to be used in critical decision-making scenarios such as autonomous driving and medical diagnosis.

Saliency map or heatmap visualization is a fundamental tool for explaining convolutional networks (CNNs). It is mostly used to directly explain classification decisions, but approaches that explain intermediate network nodes or forming new concepts would depend on them as well. In earlier days, heatmap visualizations were mostly based on computing gradient variants of the network with respect to the input [15, 17, 24] . However, due to the highly nonlinear nature of the CNNs, those one-step gradients only account for infinitesimal changes in the function values and do not necessarily correlate to the features that CNNs actually use for decision making [1] . This has caused some of the earlier works to be disillusioned on heatmap research. The most successful of the gradient-based approaches, Grad-CAM [15] , partially avoids this issue by not backpropagating into the convolutional layers, hence correlating reasonably well with CNN classification. However, its heatmaps are quite low-resolution since it only works at the final layers where the images are already low-resolution.

Another group of approaches optimizes for a small mask so that CNNs can no longer classify masked images [7, 12] . These approaches provably correlate with the CNN classification, since it can usually be shown that the CNN would no longer predict the original category once the image is masked. However, as we experiment with those approaches, we find that sometimes it is much simpler to "break" the important features CNNs are using to classify, without necessarily capturing all the important features. An intuitive example of this is an object with its important parts being long and thin, as shown in Fig. 1 . The mask only needs to cover a small number of points to "break" the legs of the tiger beetle, making it disconnected and making the CNN no longer capable of classification. However, those areas, when revealed to the CNN, do not necessarily contain enough information (e.g. the complete legs) for CNNs to come to a correct classification of the image. As a result, those methods do not perform well especially at higher Figure 1 : Comparison of heatmaps generated from [12] (I-GOS) and our proposed approach (iGOS++) for a tiger beetle image. I-GOS focuses only on breaking existing evidence (e.g. legs of the beetle), hence generated a mask that is highly scattered under high resolution (2nd image from left), and prediction confidence with the masked area (middle image) on the tiger beetle category is very low. However, with the same amount of pixels (6%) from the proposed iGOS++ heatmap, the network has 99.2% confidence (5th image from left) resolutions, as their masks become disconnected and "adversarial", focusing only on reducing the CNN prediction rather than locating all the informative areas. Usually, the capability of CNN classifying the image to its original category based on the masked parts drops noticeably when moving from 14 × 14 heatmap resolution to e.g. 224 × 224 [12] .

In this paper, we propose novel improvements to address this gap. The first novelty is to optimize for an additional insertion loss which aims for the CNN to correctly classify the object even if only a small informative part of the image is revealed. Instead of just adding a simple loss term, we found that separating the deletion mask and insertion mask during the optimization process helps the visualization performance. Our second novelty is to propose a new bilateral total variation term that makes the mask smooth on image areas with similar color, alleviating heatmap diffusion at higher resolutions. A combination of these techniques enables us to obtain heatmap visualizations with significantly better performance, especially when the heatmaps are generated at a high resolution -which are more prone to adversarial solutions if the masks are derived solely on deletion loss. In the tiger-beetle example ( Fig.1) , it can be seen that iGOS++ is capable of capturing entire legs of the beetle hence capable of generating a high-confidence prediction even with only 6% of the pixels in the original image. We evaluate iGOS++ through extensive experiments and compare our results against the state-of-the-art baselines on non-medical datasets -ImageNet and FashionMNIST as standard benchmarks for evaluation of saliency maps.

We utilized iGOS++ in a real-world problem of detecting COVID-19 patients based on chest x-ray imaging. Interestingly, we found that in some cases, the classifier is overfitted to characters printed on the x-ray images -which clearly should not be related to the underlying indicators of the disease. This illustrates one of the major problems of the current deep network models: their lack of interpretability; These black box classifiers can overfit to class priors not expected by humans, which could lead to poor generalization and arbitrary decisions, particularly in high-stake tasks such as in medical diagnosis. Once we pre-processed all images to remove written characters, meaningful performance improvements on the COVID detection task were observed. This shows the utility of the heatmap visualization algorithms in realistic tasks to open up the black box and reveal some of the biases the classifiers may have learned.

We only review related work in heatmap visualizations rather than the broader problem of explaining deep networks. Most visualization approaches can be categorized into gradient-based approaches and perturbation-based approaches.

Gradient-based approaches for generating saliency maps commonly use different backpropagation heuristics to derive the sensitivity of the output score with respect to the input. Deconv and Saliency Maps [17, 23] attach special deconvolutional layers to the convolutional layers. Guided-backprop [19] works in a similar, yet different, method to [23] , masking values based on negative gradients. [16] multiplies the gradient with the image RGB values. [20] proposes integrated gradients, which compute multiple gradients along a straight line in the image space and average them. [2] computes the relevance across different layers and assigns an importance to each pixel that is used to create a saliency map. [24] uses a winner take all probabilistic model to send top-down signals through the network and generate probabilities based on the weights that are used to create the saliencies. Grad-CAM [15] is the most popular visualization method in this category, it generalizes the existing class activation method [26] to work on any CNNbased neural network and maps class scores back to the previous convolution layer.

However, gradients reflect one-step infinitesimal changes in the input, which do not necessarily correspond to a direction in which the output score from a deep network would drop significantlyparticularly for deep networks that are highly non-linear functions. In addition, some of these saliency maps were shown to be completely or somewhat independent of the category, only showing strong image edges [1] .

Perturbation-based approaches work by modifying the input in some way, e.g. masking, and testing how the output of the network changes. [25] is a method that iteratively removes sections of the image using the gradient until the image contains only the information needed for classification of the target class. In [4] , a new network is trained to generate saliency maps. RISE [11] and LIME [13] are similar perturbation based methods that treat the model as a black-box, and thus do not use gradients at all. They both involve randomly perturbing the image. RISE weighs all of the random masks by the model's confidence and combines them, taking into account each pixels distribution in the random masks. In LIME, the random masks are used to fit a linear model to the black-box model in the local space around the image. The distance from the original image is used in the loss function and the final weights of the linear model are used to generate an explanation for the image. LIME has a reliance on super-pixels to avoid adversarial masks. Some methods utilize optimization with multiple iterations to generate a heatmap visualization [7, 12] . Here the main challenge is the highly non-convex nature of the optimization problem. [7] optimizes a mask to reduce the prediction confidence of a target class. Following [7] , [6] uses a fixed-area binary mask that maximally effects the output. This is advantageous as it mitigates the balancing issue in the original mask optimization. [12] is the most related to our work. It combines the algorithms in [7] and [20] , by utilizing integrated gradient to optimize the mask. This is shown to significantly improve the performance of the optimization.

3 BILATERALLY-OPTIMIZED PERTURBATIONS 3.1 Background 3.1.1 Heatmap Visualizations by Optimization. We consider the well-known image classification task, where a black-box network predicts a score ( 0 ) on class for input image 0 . Let ⊙ denote the Hadamard product.

The idea of optimization-based heatmap visualization is to locate the regions in the input image 0 that are most important for the network in outputting ( 0 ). These local perturbations can be formulated by the inner-product of a real-valued mask to the image ( 0 ⊙ ) [7] . Prior work optimizes for the deletion task, namely masking the image so that it has e.g. low predictive confidence for class . Afterward, one can visualize the mask to find the salient regions that caused the output confidence to decrease. Mathematically:

where˜0 is a baseline image with near zero evidence about the target class , (˜0) ≈ min ( ). It is often chosen to be a constant, white noise, or a blurred version of the image 0 [7] . The masking operator Φ( 0 ,˜0, ) uses a weighted version between 0 and˜0 to block the influence of certain pixels, decided by the mask values. The aim of optimization problem (1) is to locate a small and smooth mask which identifies the regions that are most informative to the black-box by maximally reducing the output score (predictive confidence on class ) Φ( 0 ,˜0, ) ≪ ( 0 ). The regularization term ( ) encourages the mask to be small, by penalizing the magnitude of with coefficient 1 , as well as to be smooth, by penalizing the total-variation (TV) in [7] with coefficient 2 .

3.1.2 Integrated Gradient. Eqn. (1) is a complicated non-convex optimization problem. [7] optimizes the mask by gradient descent. However, this is slow and can take hundreds of iterations to converge. In addition, gradient descent can converge to a local optimum and is not able to jump out of it. [12] has alleviated this issue by using Integrated-gradient (IG) [20] rather than conventional gradient as the descent direction for solving the mask optimization. The IG of ( ) with respect to can be formulated as follows,

where it accumulates the gradients along the straight-line path from the perturbed image Φ 0 ,˜0, towards the baseline0, which approximately solves the global optimum to the unconstrained problem eq. (1). Equivalently, IG can be thought of as performing gradient descent to simultaneously optimize the performance of multiple masks:

which makes it a proper optimization algorithm. Practically, [12] has shown that it improves the optimization performance of eq. (1) as well.

As shown in Fig. 1 and as discussed in the introduction: the capability of destroying a CNN feature does not by itself fully explain the features a CNN used. In this section we propose I-GOS++, which improves upon [12] by also optimizing for the insertion task. This would make the CNN predict the original class given information from only a small and smooth area. We believe that the deletion task and the insertion task are complementary and need to be considered simultaneously, since both of them contain important information that introduces a novel "look" into the network behavior by perturbing the input. Besides, considering one task alone is prone to reach adversarial solutions [21] , particularly when removing evidence. However, it would be unlikely to find adversarial solutions that can satisfy both aforementioned criteria. A naïve approach to implement this would be to add an insertion loss − Φ( 0 ,˜0, 1 − ) directly to eq. (1). However, empirically we found that optimizing separate masks performed better than the direct approach. Our method aims to optimize separate deletion and insertion masks over the input with the constraint that the product of the two also satisfies both the deletion and insertion criteria for a target class , i.e. deletion of the evidence from the input 0 drastically reduces the output score while retaining the same evidence preserves the initial output score ( 0 ). Formally, we solve the optimization problem with 3 masks:

where is the deletion mask, is the insertion mask, and their dot product, , is taken as the final solution to the above optimization problem. The BTV term stands for Bilateral Total Variation which is explained later in section 3.2.1.

In the above formulation eq. (4), the resolution of the masks is flexible. If the mask has lower resolutions than the original input 0 , it is first up-sampled to the input size using bi-linear interpolation, and then is applied over the image. The choice of the mask resolution depends on the application and the amount of detail desired in the output mask. Commonly, lower resolution masks e.g. 14 × 14 tend to generate more coarse and smooth saliency maps, while higher resolution masks, e.g. 224 × 224, generate more detailed and scattered ones. Lower resolution masks also have the advantage of being more robust against adversarial solutions [12, 21] . In addition, one can add regularization terms on the individual masks and , but we have found regularization only on the final mask to be sufficient.

Note that the integrated gradient for the insertion mask − (Φ( 0 , 0 , 1 − )) is slightly different from Eq. (2) as it calculates the negative IG along the straight-line path from the image perturbed using the inverse mask 1 − , i.e., Φ 0 ,˜0, 1 − , toward the original image 0 ,

We use IG to substitute the conventional gradient for the partial objective (.) in eq. (4) and the simple gradient for the convex regularization terms (·). The following is the total-gradient (TG) for each iteration of the mask update:

TG contains separate integrated gradients w.r.t the deletion and insertion masks ( and ). ∇ 0 is indicative of the direction to the unconstrained problem eq. (4) while ∇ ( ) regularizes the gradients toward a local and smooth mask and discards unrelated information. Moreover, to make the masks to generalize better and be less dependent over the individual dimensions, we add noise to the perturbed image [7, 12] during each step of the IG calculation in ∇ 0 ( ) + ∇ 0 ℎ ( ).

To further alleviate the problem of scattered heatmaps shown in Fig. 1 , we introduce a new variation of the TV loss [7] , called Bilateral Total-Variance (BTV),

where ( ) and ( ) are the mask and the input image value at pixel ( ), and and are hyperparameters. This enforces the mask to not only be smooth in its own space but also to consider the pixel value differences in the image space. This is intuitive since BTV would discourage mask value changes when the input image pixels have similar color. In other words, BTV penalizes the variation in the mask when it is over a single part of an object. This helps particularly in high-resolution mask optimizations and prevents having scattered and adversarial masks.

Step Size . Similar to [12] , we use backtracking line search at each iteration of mask update. Appropriate step size plays a significant role in avoiding local optimum and accelerates convergence. To that end, we revised the Armijo-Goldstein condition as follows,

where is the step size at time step and ∈ (0, 1) is a control parameter. This attempts to determine the maximum movement in the search direction that corresponds to an adequate decrease in the objective function ( ). Note, this is slightly different from the revised condition in [12] in that the objective function decrease is calculated over the IG intervals rather than at the mask . This is due to the fact that IG actually solves an optimization problem similar to eq. (3).

In this section we will validate the algorithm, first through an experiment in the natural image domain in order to compare with other baselines. Then, we will show the application of the algorithm to the analysis of COVID-19 X-ray images.

We first evaluate the algorithm in the natural image domain to validate it against other baselines. Although visual assessment of the saliency maps might seem straight forward, quantitative comparisons still pose a challenge. For example, one of the widely-used metrics is the pointing game [24] which assesses how accurate the saliency maps can locate the objects using the bounding-boxes annotated by humans. However, this is not necessarily correlated with the underlying decision-making process of the deep networks, and localization of the objects is only intelligible for humans. Also, pointing game requires correct localization even for misclassified examples, and its results correlate poorly with other metrics (e.g. Excitation-BP, which has excellent pointing game scores, suffers heavily from the fallacies pointed out in [1] ). This casts doubt on the validity of this metric. Due to these flaws, we do not utilize this metric in our evaluations.

We opted to follow the causal metrics introduced in [11] , which is the evolved version of the "deletion game" introduced in [7] . Although not perfect, we find this to be a better evaluation metric, as it performs interventions on the input image, a necessary approach to understand causality. These metrics are: the deletion metric that evaluates how sharply the confidence of the network drops as regions are removed from the input. Starting from the original image, relevant pixels as indicated by the heatmap are gradually deleted from the image and are replaced by pixels from the highly-blurred baseline. This goes on until all pixels from the original image are removed and network has near-zero confidence. The deletion score is the area under the curve (AUC) of the classification confidences. The lower the deletion score is the better. The Insertion metric, which is complementary to the deletion one, shows how quickly the original confidence of the network can be reached if relevant evidence are presented. Starting from a baseline image with near-zero confidence, relevant pixels from the image, based on the ranking provided by the heatmap, are gradually inserted into the baseline image. This goes on until all pixels in the baseline are replaced by the original image. The insertion score is also the AUC of the classification scores, with a higher insertion score indicating better performance. While adversarial examples can break the deletion metric with ease and achieve near perfect score, they usually do poorly on the insertion score. Yet only relying on the insertion score will sometimes invite irrelevant background regions to be ranked higher than relevant areas. That is why one needs to jointly look at both scores. See supplementary materials for the set of parameters used in our experiments.

Insertion and Deletion Scores. Table 1 and Table 2 show our results on the Insertion and Deletion metrics, averaged over 5, 000 random ImageNet images, for ResNet50 and VGG19 architectures respectively. We experimented with masks of the size 224 × 224, 28 × 28, 7 × 7 (only on Resnet50), and 14 × 14 (only on VGG19) in our experiments. The choice of the masks is to provide fair comparison with other baselines, e.g GradCam [15] that for ResNet50 is best on its"layer4" (7 × 7 resolution). In addition, Integrated Gradients visualization [20] has heatmaps directly on the input (224 × 224). Also note that binary masks for RISE [11] are generated at 7 × 7 and then up-sampled to 224 × 224. Tables 1 and 2 , iGOS++ has superior performance at all resolutions compared to the other approaches. Particularly, for the insertion score, our approach is showing 10% − 25% improvement over prior work. This is mainly due to the novel incorporation of the insertion objective in our eq. (4). Note that, for all the methods, the insertion score tends to be lower on higher resolution heatmaps, since it is easier for CNNs to extract features if a significant contiguous part is inserted, rather than isolated pixels in high-resolution masks. On the other hand, on higher resolutions, it is easier to break an image feature by destroying a small number of pixels, hence the deletion metric is usually better. We note that the drop of our insertion performance from 7 × 7 (or 14 × 14) to 224 × 224 is significantly smaller than that of I-GOS, hence it is safer to choose a higher-resolution with iGOS++. Note, we did not notice high variance in our reported results. For instance, on the reported numbers in Table 1 for iGOS++ (224 × 224 resolution), we obtained the standard deviations of 0.032 ± 0.00023 and 0.722 ± 0.00097) for the insertion and deletion scores, respectively. The low variance shows validity of the experiment setup.

Furthermore, Fig.2 shows visualizations from multiple saliency map methods along with their corresponding deletion and insertion curves and scores. Saliency map for I-GOS and iGOS++ are generated at high (224 × 224) and medium (28 × 28) resolutions. It can be seen that iGOS++ performs well even if the object has long and thin parts, whereas I-GOS generates much more scattered heatmaps at high resolutions. GradCAM works well on insertion but included too many irrelevant regions due to its low resolution, a similar issue shared with RISE. For additional visual comparison of our method against various baselines please refer to the supplementary materials. Figure 3 : Sanity check using the Fashion MNIST dataset. A simple CNN model was trained first using ground truth labels and then with random labels. The masks on the second row were generated with the ground truth trained model, while the third row are masks generated for the random label trained model Sanity Check. Following the work of [1] , we perform sanity checks on our method to justify its validity, as only relying on visual assessment can be fallacious. First is the model randomization test which checks whether the generated heatmaps are indeed independent of the model parameters or not. Since by randomizing the model (even over one layer), the output score of the network will go near-zero, our method simply returns the initialization ≈ 1 --------0.1675 0.6521 Integrated Gradients [20] 0.0907 0.2921 --------RISE [11] 0.1196 0.5637 --------Mask [7] 0 which is different from the final mask we generate. Second, for the label randomization test, we trained a simple Convolutional Neural Network (CNN) over the Fashion-MNIST dataset. The CNN was trained with two sets of data. First it was trained using the ground truth labels of the dataset until it reached 100% accuracy. Heatmaps were generated for this model and then compared to maps generated for a model trained with random labels. The idea is if the heatmaps are the same, then the method might be interpreting the image rather than explaining the models decision. Figure 3 shows that iGOS++ generates meaningful heatmaps on real classes that are significantly different from random classes.

Adversarial Examples In addition to the sanity check presented in [1] , the visualization from our method on adversarial examples have been analyzed. Figure 4 shows four image samples from ImageNet Validation set where the heatmaps generated on the adversarial images have been visualized along with the original natural images. In this experiment, VGG19 architecture and two different mask resolutions: a low-resolution (28 × 28) and a highresolution (224 × 224) are used. To generate adversarial examples (target class: persian cat), we used the MI-FGSM [5] on the VGG19 model. There are two main takeaways from this analysis: first, for adversarial examples, the heatmaps generated by iGOS++ does not provide meaningful explanations. Second, it can be observed that for adversarial examples, the insertion score is significantly lower and it only reaches the original score after almost all the pixels have been inserted. This shows that the insertion score is a good indication of whether the generated mask is adversarial or not.

Ablation Study. The results from the ablation study are presented in Table 3 . The experiments are run at 224 × 224 and 28 × 28 resolutions using the ResNet50 model on ImageNet. We observe that optimizing a mask by replacing the deletion loss in [12] with an insertion one obtains a good insertion score on both resolutions. However, it significantly hurts the deletion score. On the other hand, naïvely incorporating the insertion loss to eq. (1) by just adding an insertion loss term clearly does not work as well. In fact, it is performing worst than either of the [12] and insertion optimization alone. Further, we find that adding noise makes a clear difference in high resolution (224 × 224). Moreover, having a fixed step size is worse than using an adaptive step size with line-search. Finally, by removing the bilateral TV term we observed that the deletion score decreases, particularly in high resolution while the insertion score also was impacted negatively. This shows the benefit of the bilateral TV term in avoiding adversarial solutions. In addition, for a comparison between the deletion and insertion masks, please refer to the supplementary materials. Convergence Behavior. Throughout our experiments, we found the convergence behavior of our method robust against the choice of hyperparameters. The choice of hyperparameters were also transferable between all the datasets and models as stated in the Set of Hyperparameters section. Table. 4, shows the final value of the combined deletion and insertion loss Φ( 0 ,˜0, ) − Φ( 0 ,˜0, 1− ) (eq. (4) in the main paper) after optimizing with iGOS++ and the naïve extension of I-GOS [12] -when the insertion loss is directly added to it. The reported numbers are averaged over 500 images for different choices of hyperparameters 1 and 2 . Although, the naïve extension is directly optimized over the combined deletion and insertion loss, iGOS++ shows superior performance in all settings. Note the loss value can be negative since the insertion loss is maximized during the optimization. This validates the capability of iGOS++ in achieving a lower objective compared to using the naive extension of I-GOS for the same objective, showing that it found better optima in our difficult non-convex optimization problem. Running time comparison of iGOS++ is provided in the supplementary materials. Table 4 : Optimization objective comparison with naive addition of the insertion loss at 28×28 resolution, averaged over 500 images. Our algorithm resulted a lower objective than the naive version for the same objective, showing that it found better optima in this difficult non-convex optimization problem. Note that 2 = 200 was never used in the actual experiments and only included here for completeness. Also note the loss could be negative because there is a negative sign on the insertion loss in eq. (4) in the main paper 

COVID-19 has been devastating to human lives throughout the world in 2020. Currently, diagnostic tools such as RT-PCR has nonnegligible false-negative rates hence it is desirable to be able to diagnose COVID-19 cases directly from chest imaging. X-ray imaging is significantly cheaper than CT or other more high-resolution imaging tools, hence there would be significant socio-economical benefits if one can diagnose COVID-19 reliably from X-ray imaging data, especially in an explainable manner. To that end, we used the COVIDx dataset [9] which is one of the largest publicly-available COVID-19 dataset with 13,786 training samples and 1,572 validation samples comprised of X-ray images from Normal, Pneumonia, and COVID-19 patients. Following the setting from [9] , we trained a classifier over these images. Somewhat surprisingly, when we applied iGOS++ on the trained classifier, we noticed that in occasional cases the classifier seems to have overfitted to singleton characters printed on the x-ray image (Fig. 5) , such that even when only the character region is available, there is a non-negligible chance of classifying for COVID-19. Note that the higher resolution explanation from iGOS++ is important in pinpointing the heatmap to the character whereas low-resolution alternatives such as GradCAM were not informative. For further examples and the case when only text region is revealed to the classifier, please refer to the supplementary materials.

Noticing this "bug" of the classifier, we utilized a state-of-the-art character detector, CRAFT [3] , and removed the spurious characters from all the x-ray images in both the training and the testing sets by cleaning and in-painting the detected regions. Examples from the original and cleaned dataset, referred to as COVIDx++, can be found in the supplementary materials. The results shown in Table 5 show that the recall of COVID-19 detection on the validation dataset improved by 2.5% and the F1 score improved by more than 1%. This small exercise showcases that "bugs" do exist in deep network classifiers as they do not have common sense on what part of the data is definitely noise, and that heatmap visualizations can help humans locate these bugs as a useful debugging tool. We hope Figure 5 : Showcase of the capability of iGOS++ in detecting bugs in a COVID-19 classification pipeline. It can be noted that unlike Grad-Cam which provides coarse (8×8 resolution) and ineffective explanation, iGOS++ can generate a more-detailed explanation (32×32 resolution) to discover the most salient regions. In the right-most image, only by inserting the top 6% of the pixels from iGOS++ heatmap (highlighting the character "R"), the classifier predicts it as COVID-19 (confidence 43 %).

to dig more into this experiment in the future and obtain more meaningful knowledge from this data. Figure. 6 showcases further examples of our visualization method giving insights to the underlying decision making process of the classifier and potentially debuging and improving it. It can be seen that the iGOS++ visualizations (second column) are highly weighting in the text region in the images, i.e. the letter "R". The third column shows images when only a small fraction of the pixels (e.g. 6%), referred to as Pixel Ratio (PR), were inserted back into the baseline image (a Gaussian-blurred of the original image). However, these small amount of pixels, which mostly include the text region, are predicted as COVID-19 classes with their corresponding confidence (shown in green color). On the other hand, using the same iGOS++ visualizations, when only a small fraction of the pixels (e.g. PR: 12%) are removed from the original image, the classifier shows different predictions (than COVID-19). The corresponding confidence (shown in red color) is depicted on the top of the perturbed images. It should be noted that this confidence is not high enough so that the images get classify as COVID-19. Figure 7 shows examples of the case when the sole insertion of the text region into a baseline is enough for the classifier to change its prediction, falsely to COVID-19. For the purpose of this analysis, we used a few examples of X-ray images from COVID-19 (Fig.  7a) and Pneumonia (Fig. 7b) patients, depicted in the left columns. When feeding the highly-blurred version of these images to the classifier, they were all predicted as Normal (the second columns). However, when only revealing the text regions back into the baselines (using the bounding boxed from the CRAFT text detector [3] ), the classifier falsely predicts all those images as COVID-19, shown by red color (third columns). This suggests that the presence of text regions is influencing the decisions of the classifier -a bug that Figure 6 : Showcase for capability of the iGOS++ in detecting bugs in a COVID-19 classification pipeline. The X-ray images in the first column belong to COVID-19 patients. The second column shows iGOS++ visualizations at 32×32 resolution where our method highlights the text region, the letter "R". The third column shows images when only a small fraction of the pixels (e.g. 6%), referred to as Pixel Ratio (PR), are inserted back into the baseline. Interestingly, the insertion of these small amount of pixels, which mostly include the text region, is enough for the perturbed images to be classified as COVID-19 (The corresponding confidence scores are shown in color green). The fourth column shows the opposite scenario where only a small fraction of the pixels (e.g. PR: 12%) are removed (blurred) from the original image and that is enough for the perturbed images to be mis-classified, i.e. not COVID-19. Obviously, the confidence score for these images (shown in color red) are lower than of Normal or Pneumonia classes.

we hypothesized using our visualizing method. Nevertheless, in reality, one assumes these should be regarded as noise in the data and should be discarded by the classifier. This is an example of the fragility of trusting the decision making of black-box classifiers, particularly in tasks where human life is at stake such as medical diagnosis.

We propose a new approach, iGOS++, for creating a heatmap visualization to explain decisions made by the CNNs. iGOS++ differs from other approaches by optimizing for separate deletion and insertion masks which are tied together to output a single explanation. We show empirically that with this approach, significantly better insertion performance at all resolutions can be achieved. Besides, we introduce a new term for regularization using the bilateral total variance of the image. This is shown to improve the smoothness of the generated heatmaps as well. As a real-life example, we showed that in a task of classifying COVID-19 patients from x-ray images, sometimes the classifier would overfit to the characters printed on the image. Removing this bug improved the classifier performance meaningfully. We hope the high-fidelity heatmaps generated with iGOS++ will be helpful for downstream tasks in explainable deep learning in both the natural image domain and medical imaging domain.

A. Deletion (M x ) v.s. Insertion (M y ) Mask. and masks are optimized with different objectives and exhibit differences. To make the behavior of each of the and masks more clear, the insertion and deletion scores for each of them are reported in Table  6 . As backed by the results, (insertion mask) tends to be more local and smooth while is often more scattered. Multiplying them together, as in iGOS++ methodology, reduces adversariality and directs the optimization out of saddle points that come with the combination of the dueling loss functions as mentioned previously. In addition, it can be observed that in lower resolution (28×28), is less adversarial than in higher resolution (224×224). Fig.1 , we show an example of where our method does not work properly. This image is incorrectly predicted by the network and has low prediction confidence. As it can be observed, an adversarial mask has been generated from our method as the insertion curve requires almost all pixels to be inserted to get to the original confidence. Generally, when the initial prediction confidence is low, or the network is predicted incorrectly, our method does not work very well. C. Set of Hyperparameters (Insertion/Deletion): To quantitatively evaluate our method (iGOS++) against baselines in terms of the casual metrics, insertion and deletion scores, we use Ima-geNet [14] benchmark. For the reported deletion/insertion scores, we use ResNet50 [8] and VGG19 [18] pre-trained on the ImageNet (from the PyTorch model zoo [10] ) and generate heatmaps for 5,000 randomly selected images from the ImageNet validation set. All the baseline results presented in this paper are either obtained from the published paper when applicable or by running the publicly available implementations. The choice of hyperparameters is also chosen to have the best performance -from the paper/code repositories. During all experiments, for high-resolution mask (224×224), we set 1 = 10, and set it to 1 for all other resolutions. This is to avoid having a diffuse mask as the penalty over the mask size can easily be larger at high resolutions. In addition, 2 is set to 20 for all resolutions. The line search parameters are similar to [12] . Also, We set = 2 and = 0.01 for the BTV term. BTV and TV term (with = 2) also can be averaged and added to the main objective. Note that throughout all the experiments presented in the paper, the same set of hyperparameters is used for all network architectures (ResNet50, VGG19, COVID-NET, etc.) as well as all datasets (Ima-geNet, Fashion-MNIST, COVIDx, etc.), showcasing the robustness of the algorithm. We used the publicly-available implementation for COVID-NET 1 and the best F-1 score from the validation set is reported in the paper.

D. Data Cleaning for the Generation of The COVIDx++ dataset: The first stage in cleansing the COVIDx dataset is to detect the text in the X-ray images. For this purpose, the CRAFT text detection method [3] is used. We used the general pre-trained model available at the official code repository 2 . The text confidence threshold of 0.7 is used for the experiments. To inpaint inside the bounding boxes detected from the CRAFT, we used the built-in OpenCV inpainting function (cv2.inpaint(...)) with the algorithm by [22] and inpaint radius of 10 pixels. Figure 9 shows some samples from the COVID-19 patients in the COVIDx dataset [9] , the corresponding detected bounding boxes detected by the CRAFT text detector, and the final inpainted images. E. Running Time. The results from running time comparison of our proposed method against the other perturbation-based methods, namely, I-GOS [12] , Mask [7] , and RISE [11] , are presented in Table 7 at two different resolutions. We followed the setting in [12] and an NVIDIA GeForce GTX 1080 Ti GPU along with an 8-core CPU is used for this experiment. The reported numbers are averaged over 5,000 images from the validation set of ImageNet. The gradient-based methods are faster to compute as they only need one forward and backward pass though the network -generally speaking, this would take under one second per image to generate a heatmap. However, the resolution of their generated heatmaps is more restricted to the CNN architecture and not as flexible as in our method. They also are shown to perform poorer than our proposed method in both quantitative and qualitative evaluations. We improved the implementation code for [12] 3 and that is the reason the running time reported is faster than what the original paper reported. The main reason our method is slower than [12] is due to difference in calculating the step size using backtacking line search as explained in the methodology section.

F. iGOS++: Visual Comparisons. Figure. 10 compares the visual explanation from iGOS++ with other gradient-based (Grad-Cam [15] , Integrated-Gradient [20] , and Gradient [17] ) and perturbation based methods (I-GOS [12] , RISE [11] ). The images are selected from the ImageNet validation set, and the ResNet50 [8] is used as the classifier. The AUC for the deletion and insertion metrics are indicated under each visualization (for the deletion AUC, lower is better. For the insertion AUC, higher is better). For RISE visualizations, 8000 7×7 random masks with = 0.5 are generated.

The official implementations have been used for all the visualizations. In Fig. 10 , it can be noted that Grad-Cam and RISE, due to the nature of their low-resolution masks (7×7), highlight irrelevant regions in the image. In addition, Gradient method is calculated by the infinitesimal changes at the input image that change the prediction. To that end, as it can be seen in the figure, its visualizations are diffuse and not intuitive to human understanding. The same issue can be noted for Integrated-Gradient, however, it shows a good deletion score. Yet, Integrated-Gradient suffers in the insertion score. Similar to iGOS++, I-GOS has a flexibility in generating various resolution masks that can be chosen depending on the task at hand. Nevertheless, the visualizations from I-GOS are more scattered compared to the iGOS++ visualization (which are local on the object) and it suffers more on the insertion score. As an example, the visualization for the "BulBul" image (third row from the top) clearly underlines this issue. Figure 11 compares the visualizations from iGOS++ at different resolutions. Our method has flexibility in the resolution of the generated mask and it can go from low-resolution and coarse visualizations (e.g. at 7×7) to high-resolution and detailed explanations (e.g. 224×224). It can be noted, when multiple objects (e.g. bottom four rows) are present in the image, low resolution masks perform poorer in locating them. This improves as the resolution of the explanations increases. For example, the "Granny Smith" image in the Fig. 11 illustrates this. In addition, the visualization for the "Bullmastiff" image shows that iGOS++, unlike some visualization methods that perform (partial) image recovery [1] , generates faithfull explanations when objects from different classes exist in the image.

To visually compare the smoothness from our proposed BTV term with the TV term introduced in [7] , refer to Figure 12 where the iGOS++ visualizations are shown for different smoothness losses as well as different values for their penalty constant 2 . For this analysis, ResNet50 architecture is used. The generated mask for the "Tiger Beetle" are at 224×224 resolution. It can be seen that when the BTV term is used (as in the original iGOS++ method), the visualizations are local to the object while for the TV term the visualizations are still scattered. For further details on the BTV term refer to the methodology section in the main paper. 2 = 20 is used in all of our experiments and different values in the Fig. 12 are for highlighting the effect of the smoothness loss. [12] , RISE [11] , Grad-Cam [15] , Integrated-Gradient [20] , and Gradient [17] . For the deletion AUC, lower is better. For the insertion AUC, higher is better. (Best viewed in color.) res Figure 11 : iGOS++ visual explanations for different classes (written on the left) at different mask resolutions (written on the top). This figure shows the flexibility of our method in finding the most salient regions in the image from coarse and low resolution explanations (e.g. 7×7) to refined and high resolution explanation (e.g. 224×224). The last four rows also demonstrate examples where multiple objects are present in the image -both from the same class, as in the "Ostrich", "Kite", and "Granny Smith" images, and from different classes, as in the "Bullmastiff" image. iGOS++ shows reliable explanation in both cases. Figure 12 : Smoothness comparison of our proposed BTV term with the TV term [7] for the iGOS++ visualizations. Top row represents iGOS++ (with our proposed BTV term) while the bottom row shows when the TV term [7] is used instead. Each column shows different value for 2 hyperparameter (smoothness penalty). As it can be observed, when the 2 value increases, the generated masks become more smooth. In our experiments, we used 2 = 20 for all resolutions, models, and datasets. In addtion, it can be noted that BTV outperforms the TV as it generates a less scattered and local visualization which is more intuitive for human interpretation.

Sanity checks for saliency maps

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

Character Region Awareness for Text Detection

Real time image saliency for black box classifiers

Boosting Adversarial Attacks With Momentum

Understanding deep networks via extremal perturbations and smooth masks

Interpretable Explanations of Black Boxes by Meaningful Perturbation

Deep Residual Learning for Image Recognition

COVID-Net: A Tailored Deep Convolutional Neural Network Design for Detection of COVID-19 Cases from

PyTorch: An Imperative Style, High-Performance Deep Learning Library

RISE: Randomized Input Sampling for Explanation of Black-box Models

Visualizing deep networks by optimizing with integrated gradients

Why Should I Trust You?: Explaining the Predictions of Any Classifier

ImageNet Large Scale Visual Recognition Challenge

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Not Just a Black Box: Learning Important Features Through Propagating Activation Differences

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. ICLR Workshop

Very Deep Convolutional Networks for Large-Scale Image Recognition. International Conference on Learning Representations

Striving for Simplicity: The All Convolutional Net

Axiomatic Attribution for Deep Networks

Intriguing properties of neural networks

An image inpainting technique based on the fast marching method

Visualizing and Understanding Convolutional Networks

Top-down neural attention by excitation backprop

Object Detectors Emerge in Deep Scene CNNs

Learning deep features for discriminative localization

We thank Dr. Alan Fern for helpful discussions. This work is supported in part by DARPA contract N66001-17-2-4030.