key: cord-0594331-0ehiddqu
authors: Wang, Andong; Lee, Wei-Ning; Qi, Xiaojuan
title: HINT: Hierarchical Neuron Concept Explainer
date: 2022-03-27
journal: nan
DOI: nan
sha: 233fc04e0be0b8a4555cb76207e93469713a93b4
doc_id: 594331
cord_uid: 0ehiddqu

To interpret deep networks, one main approach is to associate neurons with human-understandable concepts. However, existing methods often ignore the inherent relationships of different concepts (e.g., dog and cat both belong to animals), and thus lose the chance to explain neurons responsible for higher-level concepts (e.g., animal). In this paper, we study hierarchical concepts inspired by the hierarchical cognition process of human beings. To this end, we propose HIerarchical Neuron concepT explainer (HINT) to effectively build bidirectional associations between neurons and hierarchical concepts in a low-cost and scalable manner. HINT enables us to systematically and quantitatively study whether and how the implicit hierarchical relationships of concepts are embedded into neurons, such as identifying collaborative neurons responsible to one concept and multimodal neurons for different concepts, at different semantic levels from concrete concepts (e.g., dog) to more abstract ones (e.g., animal). Finally, we verify the faithfulness of the associations using Weakly Supervised Object Localization, and demonstrate its applicability in various tasks such as discovering saliency regions and explaining adversarial attacks. Code is available on https://github.com/AntonotnaWang/HINT.

Deep neural networks have attained remarkable success in many computer vision and machine learning tasks. However, it is still challenging to interpret the hidden neurons in a human-understandable manner which is of great significance in uncovering the reasoning process of deep networks and increasing the trustworthiness of deep learning to humans [3, 35, 68] .

Early research focuses on finding evidence from input data to explain deep model predictions [4, 10, 31, 37, 38, 53, 56, 57, [59] [60] [61] [62] 72] , where the neurons remain unexplained. More recent efforts have attempted to associate hidden neurons with human-understandable concepts [7] [8] [9] 11, 23, 49, 50, 75, 76, 79, 80] . Although insightful inter-pretations of neurons' semantics have been demonstrated, such as identifying the neurons controlling contents of trees [8] , existing methods define the concepts in an ad-hoc manner, which heavily rely on human annotations such as manual visual inspection [11, 49, 50, 80] , manually labeled classification categories [23] , or hand-crafted guidance images [7] [8] [9] 79 ]. They thus suffer from heavy costs and scalability issues. Moreover, existing methods often ignore the inherent relationships among different concepts (e.g., dog and cat both belong to mammal), and treat them independently, which therefore loses the chance to discover neurons responsible for implicit higher-level concepts (e.g., canine, mammal, and animal) and explore whether the network can create abstractions of things like our humans do.

The above motivates us to rethink how concepts should be defined to more faithfully reveal the roles of hidden neurons. We draw inspirations from the hierarchical cognition process of human beings-human tend to organize things from specific to general categories [42, 52, 67] -and propose to explore hierarchical concepts which can be harvested from WordNet [44] (a lexical database of semantic relations between words). We investigate whether deep networks can automatically learn the hierarchical relationships of categories that were not labeled in the training data. More concretely, we aim to identify neurons for both low-level concepts such as Malamute, Husky, and Persian cat, and the implicit higher-level concepts such as dog and animal as shown in Figure 1 (a). (Note that we call less abstract concepts low-level and more abstract concepts high-level.)

To this end, we develop HIerarchical Neuron concepT explainer (HINT) which builds a bidirectional association between neurons and hierarchical concepts (see Figure 1 ). First, we develop a saliency-guided approach to identify the high dimensional representations associated with the hierarchical concepts on hidden layers (noted as responsible regions in Figure 1 (b)), which makes HINT low-cost and scalable as no extra hand-crafted guidance is required. Then, we train classifiers shown in Figure 1 (c) to separate different concepts' responsible regions where the weights represent the contribution of the corresponding neuron to the classification. Based on the classifiers, we design a Shapley value-based scoring method to fairly evaluate neurons' contributions, considering both neurons' individual and collaborative effects.

To our knowledge, HINT presents the first attempt to associate neurons with hierarchical concepts, which enables us to systematically and quantitatively study whether and how hierarchical concepts are embedded into deep network neurons. HINT identifies collaborative neurons contributing to one concept and multimodal neurons contributing to multiple concepts. Especially, HINT finds that, despite being trained with only low-level labels, such as Husky and Persian cat, deep neural networks automatically embed hierarchical concepts into its neurons. Also, HINT is able to discover responsible neurons to both higher-level concepts, such as animal, person and plant, and lower-level concepts, such as mammal, reptile and bird.

Finally, we verify the faithfulness of neuron-concept associations identified by HINT with a Weakly Supervised Object Localization task. In addition, HINT achieves remarkable performance in a variety of applications, including saliency method evaluation, adversarial attack explanation, and COVID19 classification model evaluation, further manifesting the usefulness of HINT.

Neuron-concept Association Methods. Neuron-concept association methods aim at directly interpreting the internal computation of CNNs [2, 12, 25, 48] . Early works show that neurons on shallower layers tend to learn simpler concepts, such as lines and curves, while higher layers tend to learn more abstract ones, such as heads or legs [71, 72] . TCAV [32] and related studies [22, 24] quantify the contribution of a given concepts represented by guidance images to a target class on a chosen hidden layer. Object Detector [80] visualizes the concept-responsible region of a neuron in the input image by iteratively simplifying the image. After that, Network Dissection [7, 8, 79] quantifies the roles of neurons by assigning each neuron to a concept with the guidance of extra images. GAN Dissection [8, 9] illustrates the effect of concept-specific neurons by altering them and observing the emergence and vanishing of concept-related contents in images. Neuron Shapley [23] identifies the most influential neuron over all hidden layers to an image category by sorting Shapley values [54] . Besides pre-defined concepts, feature visualization methods [11, 49, 50] generate Deep Dream-style [47] explanations for each neuron and manually interpret their meanings. Additionally, Net2Vec [20] maps semantic concepts to vectorial embeddings to investigate the relationship between CNN filters and concepts. However, existing methods cannot systematically explain how the network learns the inherent relationships of concepts, and suffer from high cost and scalability issues. HINT is proposed to overcome the limitations and goes beyond exploring each concept individually -it adopts hierarchical concepts to explore their semantic relationships.

Saliency Map Methods. Saliency map methods are a stream of simple and fast interpretation methods which show the pixel responsibility (i.e. saliency score) in the input image for a target model output. There are two main cat-egories of saliency map methods -backpropagation-based and perturbation-based. Backpropagation-based methods mainly generate saliency maps by gradients; they include Gradient [57] , Gradient x Input [56] , Guided Backpropagation [60] , Integrated Gradient [62] , SmoothGrad [59] , LRP [5, 26] , Deep Taylor [46] , DeepLIFT [55] , and Deep SHAP [13] . Perturbation-based saliency methods perturbate input image pixels and observe the variations of model outputs; they include Occlusion [72] , RISE [51] , Real-time [15] , Meaningful Perturbation [21] , and Extremal Perturbation [19] . Inspired by saliency methods, in HINT, we build a saliency-guided approach to identify the responsible regions for each concept on hidden layers.

Overview. Considering a CNN classification model f and a hierarchy of concepts E : {e} (see Figure 1 (a)), our goal is to identify bidirectional associations between neurons and hierarchical concepts. To this end, we develop HIerarchical Neuron concepT explainer (HINT) to quantify the contribution of each neuron d to each concept e by a contribution score φ where higher contribution value means stronger association between d and e, and vice versa.

The key problem therefore becomes how to estimate the score φ for any pair of e and d. We achieve this by identifying how the network map concept e to a high dimensional space and quantifying the contribution of d for the mapping. First, given a concept e and an image x, on feature map z of the l th layer, HINT identifies the responsible regions r e to concept e by developing a saliency-guided approach elaborated in Section 3.1. Then, given the identified regions for all the concepts, HINT further trains concept classifier L e to separate concept e's responsible regions r e from other regions r E ∪r b * where b * represents background (see Section 3.2). Finally, to obtain φ, we design a Shapley valuebased approach to fairly evaluate the contribution of each neuron d from the concept classifiers (see Section 3.3).

In this section, we introduce our saliency-guided approach to collect the responsible regions r e for a certain concept e ∈ E to serve as the training samples of the concept classifier which will be described in Section 3.2.

Taking an image x containing a concept e as input, the network f generates a feature map z ∈ R D l ×H l ×W l where there are D l neurons in total. Generally, not all regions of z are equally related to e [76] . In other words, some regions have stronger correlations with e while others are less correlated, as shown in Figure 1 (b) "Step 1". Based on the above observation, we propose a saliency-guided approach to identify the closely related regions r e to the concept e in feature map z. We call them responsible regions. 

for each e in E do 10 Train classifier L e which separates r e and r E ∪ r b * 11 for each e in E do 12 for each d in D do 13 φ = Shapley value of neuron d to concept e; 14 Update Φ with φ;

First, we obtain the saliency map on the l th layer. As shown in Figure 1 (b) "Step 1", with the feature map z on the l th layer extracted, we derive the l th layer's saliency map s with respect to concept e by the saliency map estimation approach Λ. Note that HINT is compatible with different back-propagation based saliency map estimation methods. We implement five of them [56, 57, 59, 60, 62] , please refer to the Supplementary Material for more details. Note that different from existing works [56, 57, 59, 60, 62] that pass the gradients to the input image as saliency scores, we early stop the back-propagation at the l th layer to obtain the saliency map s. Here, we use modified Smooth-Grad [59] as an example to demonstrate our approach:

and N indicates normal distribution. It is notable that we may also optimally select part of the neurons D for analysis.

Next is to identify the responsible regions on feature map z with the guidance of the saliency map s. Specifically, we categorize each entry z D,i,j in z to be responsible to e or not. To this end, the saliency map s is first aggregated by an aggregation function ζ along the channel dimension and then normalized to be within [0, 1]. Note that different aggregation functions ζ can be applied (see five different ζ shown in Supplementary Material). Here, we aggregate s using Euclidean norm ζ = s along its first dimension. After that, we obtainŝ ∈ [0, 1] H l ×W l with each element s i,j indicating the relevance of z D,i,j to concept e. By setting a threshold t ( we set t as 0.5 in the paper) and masking z withŝ ≥ t andŝ < t, we finally obtain responsible regions and background regions respectively (see the illustration of the two regions Figure 1 (b): "Step 1").

Our saliency-guided approach extends the interpretability of saliency methods, which originally aim to find the "responsible regions" to a concept on one particular image. However, our approach is able to identify "responsible regions" to a concept on the high dimensional space of a hidden layer from multiple images, which can more accurately describe how the network represents concept e internally. Therefore, our saliency-guided approach provides better interpretability as it helps us to investigate the internal abstraction of concept e in the network.

For all images, we identify its responsible regions for each concept e ∈ E following the procedures described in 3.1 and construct a dataset which contains a collection of responsible regions r e and a collection of background regions r b * . Given the dataset, as shown in Figure 1 (c) "

Step 2", we use the high dimensional CNN hidden layer features to train a concept classifier L e which distinguishes r e from r E ∪ r b * , i.e., separating concept e from other concepts E ∪ b * (Line 9 and 10 in Algorithm 1).

L e can have many forms: a linear classifier, a decision tree, a Gaussian Mixture Model, and so on. Here, we use the simplest form, a linear classifier, which is equivalent to a hyperplane separating concept e from others in the high dimensional feature space of CNN.

where r = z D,i,j ∈ R |D| represents spatial activation with each element representing a neuron; α is a vector of weights, σ is a sigmoid function, and L e (r) ∈ [0, 1] represents the confidence of r related to a concept e. It is notable that we can apply the concept classifier L e back to the feature map z to visualize how L e detect concept e. Classifiers of more abstract concepts (e.g., whole) tend to activate regions of more general features, which helps us to locate the entire extent of the object. On contrary, classifiers of lower-level concepts tend to activate regions of discriminative features such as eyes and heads.

Next is to decode the contribution score φ from the concept classifiers. A simple method to estimate φ is to use the learned classifier weights corresponding to each neuron e, where a higher value typically means a larger contribution [45] . However, the assumption that α can serve as the contribution score is that the neurons are independent of each other, which is generally not true. To achieve a fair evaluation of neurons' contributions to e, a Shapley valuebased approach is designed to calculate the scores φ, which can take account of neurons' individual effects as well as the contributions coming from the collaboration with others.

Shapely value [54] is from Game Theory, which evaluates channels' individual and collaborative effects. More specifically, if a channel cannot be used for classification independently but can greatly improve classification accuracy when collaborating with other channels, its Shapley value can still be high. Shapely value satisfies the properties of efficiency, symmetry, dummy, and additivity [45] . Monte-Carlo sampling is used to estimate the Shapley values by testing the target neuron's possible coalitions with other neurons. Equation (2) shows how we calculate Shap-

where r = z D,i,j represents spatial activation from r E and r b * ; S ⊆ D\d is the neuron subset randomly selected at each iteration; * is an operator keeping the neurons in the brackets, i.e., S ∪d or S, unchanged while randomizing others; M is the number of iterations of Monte-Carlo sampling; L * e means that the classifier is re-trained with neurons in the brackets unchanged and others being randomized.

By repeating the calculation for different e and d (see Line 11 to line 14 in Algorithm 1), finally, we can get the score matrix Φ.

By repeating the score calculations for all pairs of e and d, we obtain a score matrix Φ where each row represents a neuron d and each column represents a concept e in the hierarchy. By sorting the scores in the column of concept e, we can get collaborative neurons all having high contributions to a concept e. Also, by sorting the scores in the row of neuron d, we can test whether d is multimodal (having high scores to multiple concepts) and observe a hierarchy of concepts that d is responsible for.

Note that the score matrix Φ cannot tell us the exact number of responsible neurons to concept e. For a contribution score φ which is zero or near zero, the corresponding neuron d can be regarded as irrelevant to the corresponding concept e. Therefore, for truncation, we may set a threshold for φ. In our experiment, for a concept, we sort scores and select the top N as responsible neurons. 

HINT is a general framework which can be applied on any CNN architectures. We evaluate HINT on several models trained on ImageNet [17] with representative CNN backbones including VGG-16 [58] , VGG-19 [58] , ResNet-50 [27] , and Inception-v3 [63] . In this paper, the layer names are from PyTorch pretrained models (e.g., "features.30" is a layer name of VGG19). The hierarchical concept set E is built upon the 1000 categories of ImageNet with hierarchical relationship is defined by WordNet [44] as shown in Figure 1 . Figure 3 shows the computational complexity analysis, indicating that Shapely value calculation is negligible when considering the whole cycle.

In this section, we study the responsible neurons for the concepts and show the hierarchical cognitive pattern of CNNs. We adopt the VGG-19 backbone and show the top-10 significant neurons to each concept (N =10). The results in Figure 2 manifest that HINT explicitly reveals the hierarchical learning pattern of the network: some neurons are responsible to concepts with higher semantic levels such as whole and animal, and others are responsible to more detailed concepts such as canine. Besides, HINT shows that there can be multiple neurons contributing to a single concept and HINT identifies multimodal neurons which have high contributions to multiple concepts.

Concepts of different levels. First, we investigate the concepts of different levels in Figure 2 (a). Among all the concepts, whole has the highest semantic level including animal, person, and plant. To study how a network recognizes a Husky (a subclass of canine) image on a given layer, HINT hierarchically identifies the neurons which are responsible for the concept from higher levels (like whole, animal) to lower ones (like canine). Besides, HINT is able to identify multimodal neurons which take responsibility to many concepts at different semantic levels. For example, the 445 th neuron delivers the most contribution to multiple concepts including animal, vertebrate, mammal, and carnivore, and also contributes to canine, manifesting that the 445 th neuron captures the general and specie-specific features which are not labeled in the training data.

Concepts of the same level. Next, we study the responsible neurons for concepts at the same level identified by HINT. For mamml, reptile, and bird, there exist multimodal neurons as the three categories share morphological similarities. For example, the 199 th and 445 th neurons contribute to both mammal and bird, while the 322 nd and 347 th neurons are individually responsible for both reptile and bird. Interestingly, HINT identifies multimodal neurons contributing to concepts which are conceptually far part to humans. For example, the 199 th neuron contributes to both bird and car. By applying the bird classifier to images of bird and car, we find that the body of the bird and the wheels of the car can be both detected. Same concept on different layers. We also identify responsible neurons on different network layers with HINT. Figure 2 (b) illustrates the 10 most responsible neurons to mammal in other four network layers. On shallow layers, such as on layer features.10, HINT indicates that the concept of mammal cannot be recognized by the network (F1 score: 0.04). However, as the network goes deeper, the F1 score of mammal classifier increases until around 0.8 on layer features.30, which is consistent with the existing works [71, 72] that deeper layers capture higher-level and richer semantic meaningful features.

With the associations between neurons and hierarchical concepts obtained by HINT, we further validate the associations using Weakly Supervised Object Localization (WSOL). Specifically, we train a concept classifier L e (see detailed steps in Section 3.1 and 3.2) with the top-N significant neurons corresponding to concept e at a certain layer, and locate the responsible regions using L e as the localization results. Good localization performance of L e indicates the N neurons also have high contributions to concept e. Comparison of localization accuracy. Quantitative evaluation in Table 1 and 2 show that HINT achieves comparable performance with existing WSOL approaches, thus validating the associations. We train animal (Table 1) and whole (Table 2 ) classifiers with 10%, 20%, 40%, 80% neurons sorted and selected by Shapley values on layer "features.26" (512 neurons) of VGG16, layer "layer3.5" (1024 neurons) of ResNet50, and layer "Mixed 6b" (768 neurons) of Inception v3, respectively. To be consistent with the commonly-used WSOL metric, Localization Accuracy measures the ratio of images with IoU of groundtruth and predicted bounding boxes larger than 50%. In Table 1 , we compare HINT with the state-of-the-art methods on dataset CUB-200-2011 [65] , which contains images of 200 categories of birds. Note that existing localization methods need to re-train the model on the CUB-200-2011 as they are tailored to the classifier while HINT directly adopts the classifier trained on ImageNet without further finetuning on CUB-200-2011. Even so, HINT still achieves a compara- ble performance when adopting VGG16 and Inception v3, and performs the best when adopting ResNet50. However, Table 2 shows that HINT outperforms all existing methods on all models on ImageNet. Besides, the differences of localization accuracy may indicate different models have different learning modes. Precisely, few neurons in VGG16 are responsible for animal or whole while most neurons in ResNet50 contribute to identifying animal or whole. In conclusion, the results quantitatively prove that the associations are valid and HINT achieves comparable performance to WSOL. More analysis is included in the supplementary file. Flexible choice of localization targets. When locating objects, HINT has a unique advantage: a flexible choice of localization targets. We can locate objects on different levels in the concept hierarchy (e.g., bird, mammal, and animal).

In experiments, we train concept classifiers of whole, person, animal, and bird using 20 most important neurons on layer features.30 of VGG19 and apply them on PASCAL VOC 2007 [18] . Figure 4 (a) shows that HINT can accurately locate the objects belonging to different concepts. Extension to locate the entire extent of the object. Many existing WSOL methods adapt the model architecture and develop training techniques to highlight the entire extent rather than discriminative parts of object [6, 36, 41, 43, 69, 73] . However, can we effectively achieve this goal without model adaptation and retraining? HINT provid es an approach to utilize the implicit concepts learned by the model. As shown in Figure 4 (c), classifiers of higher-level concepts (e.g. whole) tend to draw larger masks on objects than classifiers of lower-level concepts (e.g. bird). It is because that the responsible regions of whole contain all the features of its subcategories. Naturally, the whole classifier tends to activate full object regions rather than object parts.

We perform an ablation study to show that HINT is general and can be implemented with different saliency methods, and Shapley values are good measures of neurons' contributions to concepts. Implementation with different saliency methods. We train concept classifiers with five modified saliency methods (see Supplementary Material) . Then, we apply the classifiers to the object localization task. Figure 4 (b) shows that the five saliency methods all perform well. This shows that HINT is general, and different saliency methods can be integrated into HINT, Shapley values. To test the effectiveness of Shapley values, we train concept classifiers using 20 neurons on layer features.30 of VGG19 by different selection approaches, including Shapley values (denoted as shap), absolute values of linear classifier coefficients (denoted as clf coef), and random selection (denoted as random). We then use the classifiers to perform localization tasks on PASCAL VOC 2007 (see Figure 4 (c)). Two metrics are used: pointing game (mask intersection over the groundtruth mask, usually used by other interpretation methods) [74] and IoU (mask intersection over the union of masks). The results show that "shap" outperforms "clf coef" and "random" when locating different targets. This suggests that Shapley value is a good measure of neuron contribution as it considers both the individual and collaborative effects of neurons. On contrary, linear classifier coefficients assume that neurons are independent of each other.

We further demonstrate HINT's usefulness and extensibility by saliency method evaluation, adversarial attack explanation, and COVID19 classification model evaluation (Figure 5 Saliency method evaluation. Guided Backpropagation can pass the sanity test in [1, 30] if we observe the hidden layer results (see Figure 5 (a)). On layer features.8, with less randomized layers, the classifier-identified regions are more concentrated on the key features of the bird -its beak and tail, thereby suggesting that Guided Backpropagation detects the salient regions. Explaining adversarial attack. We attack images of various classes to be bird using PGD [40] and apply the bird classifier to their feature map. The responsible regions for concept bird highlighted in those fake bird images may imply that, for certain images, adversarial attack does not change the whole body of the object to be another class but captures some details of the original image, where there exist shapes similar to bird (see Figure 5 (b)). For example, in the coffee mug image where most shapes are round, adversarial attack catches the only pointed shape and attacks it to be like bird. Upon above observations, we design a quantitative evaluation on the faithfulness of our explanations. First, we attack 300 images of other categories excluding bird to be birds based on VGG19 model. Then, we use a bird classifier to find the regions corresponding to the adversarial features of bird on the attacked images. By visual in-spection, we find most regions contain point shapes. Based on the regions, we train an adversarial attacked "bird" classifier ("ad clf"). Finally, we use the "ad clf" to perform the WSOL task on real bird images. The accuracy is 64.3% (for true bird classifier, it is 70.1%), indicating HINT captures the adversarial bird features and validates the explanation: some kind of adversarial attacks may be caused by attacking the similar shapes of the target class. COVID19 classification model evaluation Applying deep learning to the detection of COVID19 in chest radiographs has the potential to provide quick diagnosis and guide management in molecular test resource-limited situations. However, the robustness of those models remains unclear [16] . We do not know whether the model decisions rely on confounding factors or medical pathology in chest radiographs. Object localization with HINT can check whether the identified responsible regions overlap with the lesion regions drawn by doctors (see Figure 5 (c)). As you can see, the pointing game and IoU are not high. Many cases having low pointing game and IoU values show that the model does not focus on the lesion region, while for the cases with high pointing game and IoU values, further investigation is still required to see whether they capture the medical pathology features or they just accidentally focus on the area of the stomach.

HINT can systematically and quantitatively identify the responsible neurons to implicit high-level concepts. However, our approach cannot handle concepts that are not included in the concept hierarchy. And it is not effective to identify responsible neurons to concepts lower than the bottom level of the hierarchy which are the classification categories. More explorations are needed if we want to build such neuron-concept associations.

We have presented HIerarchical Neuron concepT explainer (HINT) method which builds bidirectional associations between neurons and hierarchical concepts in a lowcost and scalable manner. HINT systematically and quantitatively explains whether and how the neurons learn the high-level hierarchical relationships of concepts in an implicit manner. Besides, it is able to identify collaborative neurons contributing to the same concept but also the multimodal neurons contributing to multiple concepts. Extensive experiments and applications manifest the effectiveness and usefulness of HINT. We open source our development package and hope HINT could inspire more investigations in this direction. 

In this supplementary file, first, we show the five modified saliency methods and five aggregation approaches with which HINT can be implemented in Section A and B respectively. Second, we explain the properties that HINT's Shapley value-based neuron contribution scoring approach satisfies in Section C. Third, we provide detailed descriptions of applications of HINT -saliency method evaluation, explaining adversarial attack, and evaluation of COVID19 classification models -in Section D. Next, we demonstrate more neuron-concept associations and the activation maps of multimodal neurons in Section E. Then, we show more quantitative analysis and illustrations of the results of applying HINT for Weakly Supervised Object Localization tasks in Section F. Finally, we provide more illustrations of ablation studies on modified saliency methods and Shapley value-based scoring approach in Section G.

Inspired by backpropagation-based saliency methods, we develop a saliency-guided approach to identify responsible regions in feature map z. Equation (S.1) shows how the representative backpropagation-based saliency method, Gradient (Vanilla Backpropagation) [57] , calculates the contribution of pixel x :,i0,j0 to a class C k .

where f is a deep network, f C k (x) is the logit of x to class C k , and x :,i0,j0 is a pixel. We extend the idea of saliency maps to hidden layers. We take concept e and neurons D on the l th layer as an example. Given an image x with label C k where C k is concept e or a subcategory of concept e, the contribution of spatial activation z D,i l ,j l to class C k (also to concept e) is shown in Equation (S.2)

where s D,i l ,j l ∈ R |D| is a vector and s D,i l ,j l s for each i l and j l form the saliency map s. As shown in Table S .1, we modify five backpropagationbased saliency methods. All of them can be used in HINT.

With saliency map s, the next step is to aggregate s D,i l ,j l , and the aggregated value will be used to decide whether z D,i l ,j l s belong to responsible foreground regions or irrelevant background regions. We implement five aggregation approaches shown in Table S.1. All of them can be applied to HINT. Note that the aggregation is only conducted along the first dimension of s.

In the main paper, the Shapley value φ of a neuron d to a concept e is calculated as Equation (S.3) .

where D is the set of neurons; L e is the classifier for concept e; r = z D,i,j represents spatial activation; r E and r b * are responsible regions of all concept e ∈ E and background regions; S ⊆ D\d is the neuron subset randomly selected at each iteration; * is an operator keeping the neurons in the brackets, i.e., S ∪d or S, unchanged while randomizing others; M is the number of iterations of Monte-Carlo sampling; L * e means that the classifier is re-trained with neurons in the brackets unchanged and others being randomized.

The following explains the properties of efficiency, symmetry, dummy, and additivity that Shapley values satisfy [45] , i.e., our Shapley value-based scoring approach satisfies.

Efficiency. The sum of neuron contributions should be equal to the difference between the prediction for r and its expectation as shown in Equation (S.4). Then

where * is an operator keeping the neurons in the brackets, i.e., S ∪d n or S ∪d m , unchanged while randomizing others.

Dummy. If a neuron d has no contribution to concept e, which means d's individual contribution is zero and d also has no contribution when it collaborates with other neurons, d's contribution score should be zero. Then φ d = 0 (S.8) 

SmoothGrad [59] 1 N N n=1

Additivity. If L e is a random forest including different decision trees, the Shapley value of neuron d of the random forest is the sum of the Shapley value of neuron d of each decision tree.

where there are T decision trees.

We demonstrate more applications of HINT as follows.

With the emergence of various saliency methods, different sanity evaluation approaches have been proposed [1, 33, 70] . However, as most saliency methods only show responsible pixels on the input images, feature maps on hidden layers are not considered, which makes the sanity evaluation not comprehensive enough. For example, [1] proposed a sanity test by comparing the saliency map before and after cascading randomization of model parameters from the top to the bottom layers. Guided Backpropagation failed the test because its results remained invariant.

We propose to apply the concept classifier implemented with the target saliency method to identify the responsible regions on hidden layer feature maps for the sanity test. The target saliency method passes the sanity test if meaningful responsible regions can be observed. As shown in Figure  S.1 (a) , on the hidden layer features.8, when fewer layers are randomized, the responsible regions are more focused on the key features of the bird -its beak and tail, which means that Guided Backpropagation does reveal the salient region and Guided Backpropagation could pass the sanity test if hidden layer results are considered.

Concept classifiers can also be applied to explain how the object in an adversarial attacked image is shifted to be another class. As shown in Figure S.1 (b) , we attack images of various classes to be bird using PGD [40] and apply the bird classifier to the attacked images' feature maps. The responsible regions for concept bird highlighted in those fake bird images imply that adversarial attack does not change all the content of the original object to be another class but captures some details of the original image where there exist shapes similar to bird. For example, in the image of a coffee mug where most shapes are round, adversarial attack catches the only pointed shape and attacks it to be like bird. Additionally, we find the attacked image still preserves features of the original class. In Figure S.1 (b) , the result of applying mammal classifier on the attacked lion image shows the most parts of the lion face are highlighted, while the result of applying mammal classifier on the original lion image shows a similar pattern.

Applying deep learning to the detection of COVID19 in chest radiographs has the potential to provide quick diagnosis and guide management in molecular test resourcelimited situations. However, the robustness of those models remains unclear [16, 30] . We do not know whether the model decisions rely on confounding factors or medical pathology in chest radiographs. To tackle the challenge, object localization by HINT can be used to see whether the identified responsible regions overlap with the lesion regions drawn by doctors. With the COVID19 dataset from SIIM-FISABIO-RSNA COVID-19 Detection competition [34] , we trained models used by high-ranking teams and other baseline models for classification. The localization results of COVID19 cases with typical symptoms by Effi-cientNet [64] are shown in Figure S.1 (c) . As you can see, the pointing game and IoU are not high. Many cases having low pointing game and IoU values show that the model does not focus on the lesion region, while for the cases with high pointing game and IoU values, further investigation is still required to see whether they capture the medical pathology features or they just accidentally focus on the area of the stomach. , and Inception v3 to the concept of animal respectively. As we can see, the drop of the neurons' contribution scores of ResNet50 is less sharp compared with VGG16 and Inception v3, which means that the neurons of ResNet50 more rely on collaboration to detect animal.

As shown in S.4, the 445 th neuron on layer features.30 of VGG19 contribute strongly to multiple concepts, indicating it is multimodal. We show the activation maps of the 445 th As shown in Table S 

In this section, because many images in ImageNet only have classification labels, we use the hidden layer saliency map as the mask of the target object. And we apply metrics of pointing game (pointing) [74] , Spearman's correlation (spearman cor), and structure similarity index (SSMI) [66] to evaluate concept classifiers' performances on ImageNet. VGG19 is used for testing. Table S .5). 

Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps

Guided zoom: Questioning network evidence for finegrained classification

Interpretable machine learning in healthcare

Synthesizing robust adversarial examples

PMLR

On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation

Rethinking class activation mapping for weakly supervised object localization

Network dissection: Quantifying interpretability of deep visual representations

Understanding the role of individual units in a deep neural network

Gan dissection: Visualizing and understanding generative adversarial networks

Exploring neural networks with activation atlases

This looks like that: deep learning for interpretable image recognition

Explaining models by propagating shapley values of local components

Attention-based dropout layer for weakly supervised object localization

Real time image saliency for black box classifiers

Ai for radiographic covid-19 detection selects shortcuts over signal

Imagenet: A large-scale hierarchical image database

The PASCAL Visual Object Classes Challenge

Understanding deep networks via extremal perturbations and smooth masks

Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks

Interpretable explanations of black boxes by meaningful perturbation

Towards automatic concept-based explanations

Neuron shapley: Discovering the responsible neurons

Regression concept vectors for bidirectional explanations in histopathology

Semantics for global and local interpretation of deep neural networks

Understanding individual decisions of cnns via contrastive backpropagation

Deep residual learning for image recognition

Densely connected convolutional networks

Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size

Towards semantic interpretation of thoracic disease and covid-19 diagnosis models

Examples are not enough, learn to criticize! criticism for interpretability

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav)

The (un) reliability of saliency methods

The 2021 siim-fisabio-rsna machine learning covid-19 challenge: Annotation and standard exam classification of covid-19 chest radiographs

The mythos of model interpretability. int. conf

Geometry constrained weakly supervised object localization

Consistent individualized feature attribution for tree ensembles

A unified approach to interpreting model predictions

Shufflenet v2: Practical guidelines for efficient cnn architecture design

Towards deep learning models resistant to adversarial attacks

Erasing integrated learning: A simple yet effective approach for weakly supervised object localization

The parallel distributed processing approach to semantic cognition

Foreground activation maps for weakly supervised object localization

Wordnet: a lexical database for english

Interpretable machine learning

Explaining nonlinear classification decisions with deep taylor decomposition

Inceptionism: Going deeper into neural networks

Compositional explanations of neurons

Feature visualization: How neural networks build up their understanding of images. distill

Katherine Ye, and Alexander Mordvintsev. The building blocks of interpretability

Rise: Randomized input sampling for explanation of black-box models

MR Quillian and Semantic Memory'in. Semantic information processing

why should i trust you?" explaining the predictions of any classifier

A value for n-person games

Learning important features through propagating activation differences

Not just a black box: Learning important features through propagating activation differences

Deep inside convolutional networks: Visualising image classification models and saliency maps

Very deep convolutional networks for large-scale image recognition

Smoothgrad: removing noise by adding noise

Striving for simplicity: The all convolutional net

One pixel attack for fooling deep neural networks

Axiomatic attribution for deep networks

Rethinking the inception architecture for computer vision

Efficientnet: Rethinking model scaling for convolutional neural networks

The Caltech-UCSD Birds-200-2011 Dataset

Mean squared error: Love it or leave it? a new look at signal fidelity measures. IEEE signal processing magazine

The selective impairment of semantic memory

The challenge of crafting intelligible intelligence

Danet: Divergent activation for weakly supervised object localization

On the (in) fidelity and sensitivity of explanations

Understanding neural networks through deep visualization

Visualizing and understanding convolutional networks

Rethinking the route towards weakly supervised object localization

Top-down neural attention by excitation backprop

Interpreting cnn knowledge via an explanatory graph

Interpreting cnns via decision trees

Localization results of applying whole classifier on the sample images from PASCAL VOC. The classifier is trained on layer features.30 of VGG19