key: cord-199863-5j01k5v6 authors: Verenich, Edward; Velasquez, Alvaro; Khan, Nazar; Hussain, Faraz title: Improving Explainability of Image Classification in Scenarios with Class Overlap: Application to COVID-19 and Pneumonia date: 2020-08-06 journal: nan DOI: nan sha: doc_id: 199863 cord_uid: 5j01k5v6 Trust in predictions made by machine learning models is increased if the model generalizes well on previously unseen samples and when inference is accompanied by cogent explanations of the reasoning behind predictions. In the image classification domain, generalization can be assessed through accuracy, sensitivity, and specificity. Explainability can be assessed by how well the model localizes the object of interest within an image. However, both generalization and explainability through localization are degraded in scenarios with significant overlap between classes. We propose a method based on binary expert networks that enhances the explainability of image classifications through better localization by mitigating the model uncertainty induced by class overlap. Our technique performs discriminative localization on images that contain features with significant class overlap, without explicitly training for localization. Our method is particularly promising in real-world class overlap scenarios, such as COVID-19 and pneumonia, where expertly labeled data for localization is not readily available. This can be useful for early, rapid, and trustworthy screening for COVID-19. The use of deep neural networks for image classification and object detection in imagery is well established in the computer vision domain. As neural networks became increasingly used in real world applications, such as assisting medical diagnosis, the phenomenon of class overlap became more apparent [1] . Recent work on detecting COVID-19 using X-ray imagery has also shown that class overlap degrades classifier performance [2] . By training convolutional neural networks (CNNs) to account for classes with similar conditions, the model becomes less certain. This is in part due to overlap in class activations triggered by the same image. This paper presents a new technique to distinguish between COVID-19 and regular pneumonia in X-ray imagery in a more explainable fashion by using class activation maps. The standard approach in deep learning to reduce uncertainty is to provide more training data to the model, which is not always possible and primarily addresses model uncertainty Fig. 1 : The left and center images show significant overlap in the class activation maps computed by the binary classifiers for carwheels and cars, respectively. The third image, computed by applying our kernel function on the first two images, localizes the region in the original image that is primarily responsible for its classification as a carwheel. that is due to model parameters. Another method to reduce decision uncertainty is to localize target objects, hence increasing confidence in the prediction. Localization through supervised training with labeled bounding boxes to compute the reward is a widely used approach for reducing decision uncertainty in image classification [3] . However, the lack of labeled data and inherent noise present in novel situations results in additional predictive uncertainty. Consider, for example, Xray imagery of confirmed COVID-19 patients, where X-ray images were taken to analyze pulmonary complications, yet expert localization of COVID-19 specific attributes was not performed by radiologists, i.e. no bounding boxes on X-ray images of COVID-19 relevant regions were annotated [4] - [8] . Ghoshal et al. [9] note that there are two distinct kinds of predictive uncertainty in deep learning. First, epistemic uncertainty, or uncertainty in model parameters, which decreases with more training data. Second, aleatoric uncertainty, that accounts for noise in observations due to class overlap, label noise, and varying error term size across values of an independent variable. Aleatoric uncertainty cannot be easily reduced by increasing the size of the training set. Kernel Output Fig. 2 : High-level view of our dual-network technique for handling class overlap. Given a class of interest C 1 and another possibly overlapping class C 2 , our approach involves training two separate binary expert networks (N 1 , N 2 ). Each input image (I) is fed to both expert networks to obtain class activation maps (CAM 1 , CAM 2 ) which are then used by our directed kernel function K to localize regions in I where the expert network for the class of interest (N 1 ) is more confident. At the heart of our approach is the use of class activation maps (CAMs) for improved localization of the regions responsible for the image being in a specific class (e.g. COVID-19) as opposed to some other overlapping class (pneumonia). As an example, Figure 1 shows activation maps for two separate models trained for classifying carwheels and cars, respectively. The figure depicts significant overlap in the regions responsible for categorizing the image in the two classes. Our goal is localizing regions in the image which are more responsible for its classification in a specific class of interest (carwheel), in particular where bounding boxes are not available during training. In unexpected public health emergencies, such as the coronavirus pandemic, labeled datasets with bounding box annotations are unlikely to be available to the community at an early stage. Such scenarios preclude the possibility of training a model for localization. In such situations, is it possible to improve localization when there is no way to train for it? This work proposes a method for improved localization in order to enhance explainability of image classifications in data regimes with significant class overlap, thus mitigating aleatoric uncertainty. Our results show that training per-class binary CNN models and applying our new kernel function on their class activation maps can extract and better localize objects from overlapping classes. Image classification and object localization have been successfully utilized for diagnostic purposes in radiology for pneumonia detection using chest X-rays [10] . With the recent emergence of the COVID-19, a number of methods and models have been proposed, as surveyed by Shi et al. [11] , to detect the disease using medical imaging. Wang and Wong [4] released early work on using convolutional neural networks for COVID-19 detection from X-ray images. Alqudah et al. [12] used convolutional neural networks to classify X-ray images as well as extract features and pass them to other classifiers. Ghoshal et al. [9] observed that most methods focused exclusively on increasing accuracy without accounting for uncertainty in the decision and proposed a method to estimate decision uncertainty. Zhou et al. [13] showed that discriminative localization is possible without explicitly training for object detection with labeled object bounding boxes. However, noise related to overlapping classes has not been considered in these works. This work builds on these ideas and extends the work of Zhou et al. [13] on discriminative localization to specifically address the problem of overlapping classes and explainability in image classification. Our goal is to localize image regions responsible for a class of interest (e.g. COVID-19), given possibly overlapping classes (e.g. pneumonia/COVID-19). Our method consists of training locally independent expert networks as binary classifiers for two different classes that are possibly overlapping, e.g. classifiers for COVID-19/No-COVID-19 and pneumonia/No-pneumonia. These binary expert networks are then leveraged as expert classifiers on each input image as part of a dual-network architecture as shown in Figure 2 . CAMs obtained from the two networks are then passed to our novel kernel function (K) that localizes image regions responsible for the class of interest. Note that, in our approach, a binary expert network is a classifier that classifies its input as either in-class or out-of-class for a specific class e.g. a COVID-19/No-COVID-19 classifier is a binary expert network for COVID-19. Similarly, a Pneumonia/No-Pneumonia classifier is an expert network for Pneumonia. However, we do not consider a COVID-19/Pneumonia classifier as a binary expert network. Our approach is summarized below: 1. Given a class of interest C 1 and a possibly overlapping class C 2 , train separate binary expert networks (N 1 , N 2 ) for both of them. 2. Pass each input image through N 1 and N 2 and, in both cases, the features of the last convolutional layer are extracted as CAMs (CAM 1 , CAM 2 ). 3. CAM 1 and CAM 2 are passed to our kernel function which localizes regions in the image classifier (N 1 ) of the class of interest is more confident. We now describe how the CAMs are computed and then define our kernel function. Class activation maps allow us to localize objects of a given class by mapping regions of an image to the most active values in the activation layer of a network. To obtain CAMs of each of the binary expert classifiers we follow the approach described by Zhou et al. [13] , differing only in the architecture of the expert models. In order to compute a CAM, a convolutional network architecture needs an activation layer, followed by a pooling layer (average or max), and a fully connected layer to obtain a class score. A ResNet architecture meets these requirements, utilizing a Global Average Pooling layer after the final convolutional layer. We create a mechanism for extracting values from the Activation layer, in case of our ResNet based networks it is the last layer before the Global Average Pooling layer. We add a hook to the Activation layer to store its activation values when a forward pass is performed on a network for inference. The activation values in the Activation layer contain 2048 activation maps [f 1 ...f 2048 ] , each with a dimension of 7x7, thus when we extract activation values after a forward pass we get a tensor with a shape of 7x7x2048. Within the normal forward pass, this tensor is then reduced by the Global Average Pooling layer to a tensor of shape 1x1x2048 by averaging each feature map and then flattening the tensor to just 2048 in the Flatten layer. The connections between the Flatten layer and final Fully Connected Layer contain the weights [w 1 ...w i ] that are used for classification, as each node in the final connected layer represents an object class. We use these weights w i along with activation maps f i to compute a CAM for a predicted class during a forward pass. Figure 3 shows a high level view of an expert network and locations of activation (f ) and weight (w) tensors, respectively. Finally we compute a weighted sum using activation values and weights for a predicted class as in Equation 1 to produce a 7x7 tensor that represents a CAM for the predicted class c. In order to identify regions in the image that are important to the predicted class, we superimpose the 7x7 CAM on the original image using bilinear sampling to scale the CAM to appropriate size, which in this case is 224x224. In order to extract overlapping features between two classes, we need a directed divergence or difference measure. For tensors (x, x ) we need a measure that will amplify only positive differences in activation values in (x − x ), because we are interested in recovering features in tensor x with higher values than tensor x , but not vice versa i.e. K(x, x ) = K(x , x). We introduce a kernel method called Amplified Directed Divergence Kernel (ADDK) that accepts two tensors (x, x ) of equal shape and returns another tensor of the same shape with amplified positive differences of (x − x ) as shown in Equation 2. The kernel method ensures that a maximum value of a given tensor is not zero in the normalization step. Normalization with maximum tensor values has shown promising empirical results, but we plan to explore other normalization techniques in the future. The parameter α controls amplification of directed differences where higher amplification will concentrate the resulting heat map to a smaller region. To illustrate the kernel function operation, a simplified example with α = 15 is shown in Equation 3 . We apply our proposed dual-network technique to localize regions indicating COVID-19 in X-ray imagery. However, due to the absence of localized and labeled bounding boxes for COVID-19 in X-rays, the computed localizations cannot be easily validated. Therefore, we also tested our technique on a natural imagery dataset for which the localizations can be visually validated. To train the expert binary models, we utilized transfer learning to mitigate the problem of training a robust image classifier with a small number of training samples from a novel class of interest. We used a pretrained ResNet-152 architecture [14] and replaced the final connected layer with appropriate classes and fine-tuned it with new data. Stochastic gradient descent was used with a learning rate of 0.001 and momentum of 0.9. Training was performed for 30 epochs and the best performing model based on validation accuracy was selected. We selected two categories of images with significant class overlap, viz. carwheel and car, the former being the class of interest. We fine-tuned two CNNs pretrained on ImageNet as binary experts for car and carwheel. The models were Fig. 5 : The results of our approach for distinguishing COVID-19 specific regions in X-ray imagery on publicly available X-ray datasets for COVID-19 [5] and Pneumonia [15] . In each set of three images, the first two show superimposed CAMs corresponding to the two classes predicted by the expert networks for COVID-19 and Pneumonia, respectively. The third image in each set is the output of our kernel function representing a heat map where regions in the image where the expert network for COVID-19 was more confident have been localized. trained solely for classification and not for object localization. Furthermore, the carwheel expert model was fine-tuned with only thirteen images in order to simulate an environment with a novel class of interest that is likely to be data-starved (e.g. COVID-19). The CAMs obtained from these two experts were passed through our novel kernel function to obtain a heat map that localizes regions in the image where the expert network for carwheel was more confident. Figure 4 shows results consisting of sets of three images; the first two are the CAMS obtained from the carwheel and car expert networks respectively, and the third is the heat map computed using our kernel function that significantly improves the localization of the class of interest (carwheel). This enhances the explainability of the classification decision in this class overlap scenario. We utilized the COVID-19 chest X-ray dataset [5] and extracted COVID-19 samples with the posteroanterior view of the X-ray. The dataset by Kermany et. al [15] was used for the Pneumonia X-ray images, which are also available from Kaggle. Both data sets were processed into training, validation, and test splits using the 60/20/20 ratio. Table I shows the dataset sizes during training, validation, and testing. Figure 5 shows triples of X-ray images with superimposed class activation maps for predictions obtained from expert binary models (images one and two) with the third image showing the heat maps computed using our kernel. The intended use of our method is to examine positive classifications from two possibly overlapping classes (i.e. COVID-19, Pneumonia) and extract discriminative features pertaining to the class of interest, i.e. COVID-19. Triples 2 ). As the amplification parameter α is increased, the COVID-19 localization in the heat map output by the kernel function becomes more concentrated. models along with class activation maps that localize the image region responsible for that classification. The third image in each triple shows a better localized image region for COVID-19 as computed using our method. Our method is intended to improve explainability of predictions under circumstances where both models return positive classifications resulting in significant overlap in activation maps. Figure 6 demonstrates the role of the kernel parameter α. It controls amplification of the directed differences among the activation maps. Higher values of α concentrate the resulting heat map to a smaller region. We have described a novel method that improves predictive explainability of image classification by reducing uncertainty induced by class overlap. For a classification task with overlapping classes, our approach creates multiple separate binary classification problems. In this way, we avoid the uncertainty due to the overlap between classes. Therefore, each binary model is allowed to become more confident for its specific task which is reflected in its CAM. A direct comparison of the CAMs allows us to localize image regions in the CAM that explain why the image was classified in a specific class of interest. It should be noted that our approach enhances explainability in settings with class overlap by enabling models trained solely for classification to be used for localization. This is extremely useful in scenarios where the training data was not annotated at the level of bounding boxes, as is the situation for existing COVID-19 datasets. Our results show that the proposed method is effective in extracting and better localizing objects or regions associated with a class of interest that has significant overlap with another class. Furthermore, discriminative localization is performed without models having been explicitly trained for localization and object detection using labeled object bounding boxes. This work was motivated by our observations that numerous reported applications [4] of image classification and object detection in support of rapid screening and diagnosis of COVID-19 were inhibited by the noisy data available to practitioners, which in turn increased uncertainty in associated predictions. We identified uncertainty induced by overlapping features of COVID-19 and non-SARS-CoV2 induced pneumonia. We developed a method to mitigate uncertainty due to class overlap, that cannot be easily reduced just by using more training data. Our dual-network technique and our Amplified Directed Divergence Kernel function helps domain experts, e.g. radiologists, in computer-aided diagnosis. For COVID-19 and regular pneumonia, which have shared symptoms, our approach can help better isolate relevant regions in diagnostic imagery that explain specific classifications. Our work improves the explainability of classification decisions in scenarios with overlapping classes. We do this by training more confident models on simpler binary classification problems. Our approach uses these simpler binary models for enhancing explainability by improved localization, without training for localization. We believe that our technique can be extended towards other useful applications. An example is monitoring training progress of classification models on data without localization ground truths, while having subject matter experts assess whether the model is discriminating proper or expected regions based on class. This is useful to asses that the model is learning some causal relationship between data and class rather than some spurious correlation induced by data artifacts, i.e. certain classes have some artificial mark produced by the collection process. Finally, by pairing subsequent evolutions of the same model, i.e. continuous retraining on new data, our method can extract shifts in activation maps induced by retraining and hence detect model drift or covariate shift of the data. We believe that our technique is promising in addressing uncertainty related to noisy data and further development will enable its use in numerous applications. One direction of future work is to investigate image upsampling methods in order to better map class activations to original imagery. Whether variations of our kernel function can help improve explainability by better localization can also be explored. More experiments are needed to observe the effects of transformations used during training of expert models. For example, some domains such as natural images benefit from random geometric variations during training, while others, i.e. posteroanterior X-ray imagery, do not, as subject positioning is relatively constant between data points. Finally, the effects of training expert models from scratch instead of utilizing pretrained models that are fine-tuned can also be explored. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios Automated detection of covid-19 cases using deep neural networks with x-ray images Accurate object localization in remote sensing images based on convolutional neural networks Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images Covid-19 image data collection COVID-19 Chest X-ray Dataset Initiative Can ai help in screening viral and covid-19 pneumonia? Estimating uncertainty and interpretability in deep learning for coronavirus (covid-19) detection Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 Automated systems for detection of covid-19 using chest x-ray images and lightweight convolutional neural networks Learning deep features for discriminative localization Deep residual learning for image recognition Labeled optical coherence tomography(oct) and chest x-ray images for classification