key: cord-0432979-wyemg8i7
authors: Draelos, Rachel Lea; Carin, Lawrence
title: Explainable multiple abnormality classification of chest CT volumes with deep learning
date: 2021-11-24
journal: nan
DOI: nan
sha: 4bf2579d2045e7574d2d3ec7abb5da6cc12a9fb9
doc_id: 432979
cord_uid: wyemg8i7

Understanding model predictions is critical in healthcare, to facilitate rapid verification of model correctness and to guard against use of models that exploit confounding variables. We introduce the challenging new task of explainable multiple abnormality classification in volumetric medical images, in which a model must indicate the regions used to predict each abnormality. To solve this task, we propose a multiple instance learning convolutional neural network, AxialNet, that allows identification of top slices for each abnormality. Next we incorporate HiResCAM, an attention mechanism, to identify sub-slice regions. We prove that for AxialNet, HiResCAM explanations are guaranteed to reflect the locations the model used, unlike Grad-CAM which sometimes highlights irrelevant locations. Armed with a model that produces faithful explanations, we then aim to improve the model's learning through a novel mask loss that leverages HiResCAM and 3D allowed regions to encourage the model to predict abnormalities based only on the organs in which those abnormalities appear. The 3D allowed regions are obtained automatically through a new approach, PARTITION, that combines location information extracted from radiology reports with organ segmentation maps obtained through morphological image processing. Overall, we propose the first model for explainable multi-abnormality prediction in volumetric medical images, and then use the mask loss to achieve a 33% improvement in organ localization of multiple abnormalities in the RAD-ChestCT data set of 36,316 scans, representing the state of the art. This work advances the clinical applicability of multiple abnormality modeling in chest CT volumes.

Automated interpretation of medical images with machine learning has the potential to revolutionize the field of radiology. However, machine learning systems have not yet been adopted on a large scale in clinical practice [1] . One barrier to adoption is trust [2, 3] . Most medical imaging models are based on convolutional neural networks (CNNs), which are "black box" models unless additional steps are taken to improve explainability [4] .

Visual explanation methods in computer vision indicate which regions of an input image contribute to a model's predictions [4] . Explainability is critical in medical imaging to detect situations where a model leverages aspects of the data that are correlated with an outcome but inappropriate for prediction. Zech et al. [5] used class activation mapping [6] to reveal that a CNN trained to predict pneumonia from chest radiographs leveraged non-medical features to make the pneumonia prediction. Specifically, the model identified differences in metal tokens, postprocessing, and compression artefacts which revealed the hospital system of origin, a highly effective indicator of pneumonia risk due to differing frequencies of pneumonia among the patient populations. Furthermore, a "model" consisting of sorting the radiographs by hospital system achieved an AUROC of 0.861, illustrating that high performance alone does not guarantee that a model is faithful to expected medical reasoning. This behavior is analogous to natural image classifiers detecting boats via water, trains via rails, or horses via a copyright watermark [7, 8] . It is thus critical to seek insight into how models make their predictions.

In this work, we consider explainable multiple abnormality classification in volumetric medical images, a task that to the best of our knowledge has not yet been explored in the literature. Our first two contributions create an initial solution:

• We propose AxialNet, a multiple instance learning CNN that explains its predictions by identifying the axial slices that contribute most towards identification of each abnormality; • To obtain finer-grained, sub-slice predictions, we incorporate HiResCAM explanations into AxialNet. We also prove that for multiple instance learning models like AxialNet, HiResCAM is guaranteed to highlight the locations the model used.

Our next two contributions aim to improve AxialNet's learning and thus improve the "reasoning" behind its explanations:

• We develop a mask loss that leverages HiResCAM and 3D allowed regions to encourage AxialNet to predict abnormalities from only the organs in which they are found; • In order to obtain the 3D allowed regions efficiently, we propose PARTITION, a method to obtain pixel-level, abnormality-specific allowed regions without any manual labeling, by combining location information extracted from radiology reports with organ segmentation maps obtained via morphological image processing.

Overall, we present an initial method for the new task of explainable multiple abnormality classification in volumetric medical images, then improve upon our solution to achieve state-of-the-art performance. This work represents a step towards machine learning systems that may accelerate the radiology workflow and contribute to improved detection and monitoring of disease.

2 Related Work

Computed tomography (CT) scans are used to diagnose and monitor numerous conditions, including cancer [9] , injuries [10] , and lung disease [11, 12] . Due to the challenging and time-consuming nature of CT interpretation, there has been substantial interest in developing machine learning models to analyze CT scans. Almost all prior work in CT classification has focused on one class of abnormalities at a time, such as interstitial lung disease [13, 14, 15, 16, 17, 18, 19] , lung cancer [20] , pneumothorax [21] , or emphysema [22] . The only model developed to predict multiple diverse abnormalities simultaneously from one CT volume is CT-Net, which was trained and evaluated on the RAD-ChestCT data set of 36,316 CT volumes [23] . While CT-Net achieves high performance, its final representation is not interpretable due to an intermediate convolution step over the feature dimension that disrupts the spatial relationship between the input volume and the low-dimensional representation. We propose a new model, AxialNet, which outperforms CT-Net while providing explainability through identification of the slices that contribute most to each abnormality prediction.

Slice-level explanations are useful, but obtaining finer-grained explanations remains important for gaining additional insight into a model's behavior. One straightforward way to obtain abnormalityspecific fine-grained explanations is via the gradient-based methods Class Activation Mapping (CAM) [6] and Grad-CAM [24] , which are commonly used explanation approaches in medical imaging [25, 26, 27, 28, 29, 30] . Unfortunately, CAM can only be used for CNNs that end in global average pooling followed by one fully connected layer, which does not apply to AxialNet. Grad-CAM is a generalization of CAM that loosens CAM's architecture requirements; however, recent work has demonstrated that Grad-CAM can be misleading, and sometimes highlights irrelevant locations that the model did not actually use for prediction [31] . HiResCAM is a new explanation method that is applicable to wider range of CNN architectures than CAM, while remaining provably guaranteed to accurately highlight the locations these models used [31] . In this work, we prove that HiResCAM's location faithfulness is additionally guaranteed for the multiple instance learning architecture of the AxialNet model. We then apply HiResCAM to AxialNet in order to obtain sub-slice explanations.

AxialNet and HiResCAM provide an initial solution to the task of explainable abnormality prediction.

To improve AxialNet, we propose a mask loss that is mathematically related to a typical loss function used for training an abnormality segmentation model. Prior work in abnormality segmentation in CT scans has focused on groundglass opacities and consolidation [32, 33] , COVID-19-related "anomalies" [34] , pneumothorax [35] , and lung nodules [36, 37, 38] . Our mask loss does rely on abnormality-specific allowed regions; however, unlike an abnormality segmentation loss, the mask loss is intended to enhance a classification model rather than train an abnormality segmentation model; it is calculated in a low-dimensional space for computational feasibility, rather than the input space; and it relies on automatically generated allowed regions rather than manually obtained abnormality segmentation maps, so that over 80 abnormalities across 36,316 CT volumes can be considered rather than 1-2 abnormalities across a few hundred CT volumes.

The abnormality-specific allowed regions for our mask loss are obtained using PARTITION, which is the first reported approach to provide abnormality-specific allowed regions automatically. One component of PARTITION is organ segmentation with morphological image processing. We choose an unsupervised morphological image processing approach over a supervised machine learning approach for several reasons.

First, the unsupervised approach for organ segmentation requires no manually created segmentation maps. RAD-ChestCT [23] includes whole-volume abnormality labels, but no segmentation ground truth, so training a machine-learning-based multi-organ segmentation model [39, 40, 41, 42] on RAD-ChestCT would require manually circumscribing the relevant anatomical structures on hundreds or thousands of slices, representing months of effort by a domain expert.

Second, the two largest organs we were interested in segmenting -the right and left lungs -can be segmented with excellent performance using morphological image processing, because they are large contiguous regions of mostly black pixels. In fact, morphological image processing yields such good results for lung segmentation that it was used to create the training set lung segmentation ground truth in the CT-ORG data set [43, 44, 45] .

Finally, training a machine learning segmentation model on another dataset and then deploying it on RAD-ChestCT would be unlikely to result in satisfactory performance on RAD-ChestCT due to significant differences in the data distributions [46] . RAD-ChestCT includes only non-contrast chest CT scans, while CT-ORG includes contrast, non-contrast, abdominal, and full body CT scans, and the AAPM Thoracic Auto-segmentation Challenge 2017 data set [47, 48] includes only contrast scans. RAD-ChestCT includes a mixture of mild and severe lung diseases, and 95% of scans have a slice thickness ă0.625 mm; AAPM excludes cases with collapsed lungs from extensive disease, and uses scans with slice spacing of 1 mm, 2.5 mm, or 3 mm. The anatomical regions of interest also differ. For our application, we required right lung, left lung, and heart/great vessels segmentations; CT-ORG includes labels for lung but not the heart or great vessels, while AAPM includes labels for lungs and heart but not the great vessels. For these reasons, we decided that domain adaptation from CT-ORG and/or AAPM models to RAD-ChestCT was best left to future work, and we pursued an unsupervised organ segmentation approach that could be optimized for the RAD-ChestCT dataset. Figure 1 : The proposed AxialNet model including a mask loss calculated using HiResCAM attention and PARTITION allowed regions. In the initial layers of AxialNet, a low-dimensional representation z h is obtained for each slice h " 1, ..., H via 2D convolutions on axial slices. A final fully connected layer produces M abnormality scores per slice. The slice scores are averaged for each abnormality to produce the overall score used in the classification loss (lower right). A HiResCAM explanation is calculated for each abnormality as the element-wise product of the overall scan representation Z with the gradient of the abnormality score with respect to Z, summed over the feature dimension. In AxialNet, this gradient is directly proportional to the fully connected layer weights for the relevant abnormality. The mask loss we propose is computed using the HiResCAM explanations and PARTITION allowed regions G true to encourage the model to only increase the abnormality score using organs in which that abnormality is found. Here, the allowed organs are the left lung for mass, the right lung for nodule, and both lungs for pleural effusion. Best viewed in color.

Consider a dataset tX, yu N i"1 where y i P t0, 1u M is a binary vector corresponding to the presence/absence of M abnormalities and X i P r0, 1s 405ˆ420ˆ420 is a CT volume. 1 We wish to predict, in an explainable manner, the labelsŷ i given a CT volume X i , meaning that for any abnormality m P t1, 2, . . . , M u a physician can query the model to obtain a visualization that highlights the sub-regions of the volume that the model used to predict that abnormality. Figure 1 provides an overview of our solution to this task.

Radiologists often view CT scans as a stack of axial slices [49] , which form horizontal planes through an upright patient. Motivated by this practice, we propose AxialNet, a multiple instance learning architecture that treats each CT scan as a "bag of slices" and produces whole-volume scores by averaging per-slice scores, thus enabling direct determination of which axial slices contributed the most to prediction of each abnormality.

In the first part of AxialNet, a 2D CNN is applied to each CT slice. The CNN's parameters are shared across slices (Figure 1 ), an approach that has been previously successful in CT analysis [23, 50, 51] and has the benefit of reducing the number of model parameters while incorporating the reasonable assumption that the same features may be seen at multiple levels in a CT volume. The low-dimensional representation produced by the 2D CNN for the h th slice is termed z h P R FˆD1ˆD2 with F features, width D 1 , and depth D 2 .

An abnormality score vector c h P R M is then produced from each slice representation z h using a fullyconnected layer with parameters shared across all CT slices: c h " Wz h`b , with W P R MˆF D1D2 and b P R M . Since AxialNet makes predictions for all M abnormalities and all H slices at once, it produces a matrix of all per-slice abnormality scores C P R MˆH . This matrix C provides basic explainability because it illustrates the quantitative contribution of each axial slice to each abnormality score, enabling identification of the top slices for each abnormality prediction.

Next, for each abnormality m AxialNet averages together the per-slice scores to produce a wholevolume abnormality score s m :

Above, one row vector w m P R 1ˆF D1D2 of the weight matrix W corresponds to one abnormality, meaning that the expression w m z h`bm produces the scalar score c mh for the m th abnormality and h th slice.

The whole-volume predicted probabilityŷ m for the m th abnormality is calculated from the wholevolume score s m using the sigmoid functionŷ m " σps m q " 1 1`e´sm . For each observation we optimize the expected multilabel cross entropy objective, requiring only whole-volume abnormality labels:

AxialNet alone can provide explainability through identification of top slices. However, AxialNet does not provide granular sub-slice explanations. In order to obtain these finer-grained explanations, we apply High-Resolution Class Activation Mapping (HiResCAM) [31] , a recently proposed visual explanation method. Recall that s m is the model's score for abnormality m before the sigmoid function (equation 1). To obtain a HiResCAM explanation for abnormality m, we first compute the gradient of s m with respect to a collection of convolutional feature maps A " tA f u F f "1 . For volumetric data, this gradient Bsm BA is 4-dimensional rF, H, D 1 , D 2 s. The HiResCAM explanation is an element-wise multiplication of this gradient with the feature maps themselves:

For any CNN consisting of convolutional layers followed by a single fully connected layer, HiResCAM was previously proven to highlight locations the model used when applied at the last convolutional layer [31] . In this section, we prove that this location faithfulness guarantee also holds for any CNN consisting of convolutional layers, one fully connected layer, and a final multiple instance learning averaging layer, i.e. models with the general structure of AxialNet.

Proof. We apply HiResCAM at the last convolutional layer of AxialNet; this layer produces a low-dimensional representation of the entire CT scan termed Z. We thus replace A " tA f u F f "1 in equation 3 with the feature maps Z " tZ f u F f "1 to obtain:

To calculate Bsm BZ , the gradient of the abnormality score s m with respect to Z, we must use an expression for s m . However, the previous expression for the abnormality score s m " 1 H ř H h"1 w m z h`bm (equation 1) expressed the score in terms of the slice representations z h rather than Z overall.

We can rewrite the score s m in terms of Z overall via two concatenations. First, define Z as the vector resulting from concatenation of all the flattened z h representations, Z " z 1`z2`. ..`z H , with flattened Z P R HF D1D2ˆ1 . Next, define w cat m as the vector resulting from concatenation of the m th -abnormality-specific weights w m with themselves H times: w cat m " w m`wm`. ..`w m , where w m P R 1ˆF D1D2 and w cat m P R 1ˆHF D1D2 . Then an alternative expression for the whole volume abnormality score s m is:

Note that the 1 H fraction is only applied to the w cat m Z term because 1 HˆH b m " b m . The gradient of the abnormality score s m with respect to Z can then be calculated as:

Substituting equation 7 for Bsm BZ into equation 4, we obtaiñ

The element-wise multiplication w cat m d Z in the HiResCAM expression is the intermediate computation in calculating w cat m Z, which in turn is a direct contributor to the abnormality score s m " 1 H w cat m Z`b m . Therefore, the large positive elements of the HiResCAM explanationÃ HiResCAM m that show up as abnormality-relevant image locations correspond to locations which directly increase the abnormality score. Similarly, negative elements of the HiResCAM explanation are direct contributors to a lower abnormality score. Thus, for any CNN consisting of convolutional layers, one fully connected layer, and a final multiple instance learning averaging layer, HiResCAM explanations at the last convolutional layer can be interpreted as showing exactly which parts of the input CT volume contributed most to each abnormality prediction.

HiResCAM is a member of the CAM family of gradient-based explanation methods. Grad-CAM is another member of this family, and is an explanation method that is familiar to many medical imaging researchers. Unfortunately, Grad-CAM does not have a location faithfulness guarantee, which means that Grad-CAM sometimes produces misleading explanations that highlight irrelevant locations [31] . To demonstrate the effect of these misleading explanations in a medical imaging context, we compare Grad-CAM to HiResCAM for the task of explainable multiple abnormality prediction in CT volumes. In order to apply Grad-CAM to CT images, we must extend Grad-CAM from 2D to 3D.

Similar to HiResCAM, Grad-CAM is applied at the output of a convolutional layer. The first step is the same as HiResCAM: calculate Bsm BA , the gradient of s m with respect to a collection of feature maps A " tA f u F f "1 . The next step differs: we calculate a vector of importance weights [24] α m P R F that will be used to re-weight each corresponding feature map A f . For 3D data, the importance weights are obtained by global average pooling the gradient over the height, width, and depth dimensions: The importance weights suggest which features are most relevant to this particular abnormality throughout the volume overall. The final Grad-CAM explanation is an importance-weighted combination of the feature maps

The gradient averaging step that Grad-CAM requires (equation 9) is the reason that Grad-CAM explanations sometimes highlight incorrect locations. For a more detailed discussion of this topic, see [31] .

The combination of AxialNet and HiResCAM provides an initial solution to the new task of explainable multiple abnormality classification. The explanations produced are guaranteed to reveal the locations the model used for each abnormality prediction. Initial inspection of the explanations produced by an AxialNet model trained only on a classification loss showed that this model sometimes made predictions using unexpected locations, and thus could be exploiting confounding variables. To encourage the AxialNet model to learn more medically meaningful relationships and thereby produce better explanations, we introduced a mask loss.

Our proposed mask loss drives the model to focus its abnormality-specific attention on 3D allowed regions in which each abnormality appears. Given the model's predicted attentionÃ P R MˆHˆD1ˆD2 and a binary mask G true P t0, 1u MˆHˆD1ˆD2 which defines an allowed region for the attention for each abnormality, the mask loss is calculated as follows (usingã i to access all M HD 1 D 2 elements ofÃ):

The proposed mask loss is conceptually half of a segmentation loss [52] , applied in a low-dimensional space. To minimize the mask loss the model must not increase the abnormality score using forbidden regions. Further justification for this choice of mask loss is provided in the Appendix.

To understand the effect of the mask loss, we train one AxialNet model on the classification loss alone, L class (equation 2), and another AxialNet model on the overall proposed loss that incorporates both the classification loss and the mask loss: L total " L class`λ L mask , where λ is a hyperparameter. We found λ " 1 3 to be an effective value.

The primary barrier to computing the mask loss is obtaining G true which specifies the allowed regions for each abnormality. Manually creating G true would require decades of full-time expert manual labor (36,316 CT volumesˆ405 axial slicesˆ80 abnormalities = over one billion annotations; at 1 second per annotation, the full dataset would require 30 years of work). Therefore, we develop PARTITION ("Per Abnormality oRgan masks To guIde aTtentION"), an efficient approach for obtaining G true automatically, without any manual input. First, we expand the previously described SARLE [23] natural language processing method with location vocabulary, in order to automatically identify the anatomical location of each abnormality described in the free text CT reports. Next, we develop an unsupervised multi-organ segmentation pipeline using morphological image processing to define segmentation maps of the right lung, left lung, and mediastinum in each volume. Combining the locationˆabnormality labels with the organ segmentations enables determination of an "allowed region" for each abnormality as illustrated in Figure 2 . Further details of PARTITION are provided in the Appendix.

In experiments on RAD-ChestCT, we assess the suitability of AxialNet and HiResCAM for the new task of explainable multiple abnormality prediction in chest CT volumes. We then consider whether the mask loss can encourage the AxialNet model to learn more medically meaningful relationships and thereby produce more appropriate explanations.

RAD-ChestCT [23] is a data set of 36,316 CT volumes with 83 whole-volume abnormality labels. We focus on M " 80 labels relevant to the lungs and mediastinum, detailed in the Appendix. To the best of our knowledge, RAD-ChestCT is the only large-scale volumetric medical imaging dataset with multiple diverse abnormality labels. To quantify the effect of the mask loss on organ localization of abnormalities, we train and evaluate AxialNet with and without the mask loss using the full data set of 36,316 volumes (ą 5 terabytes). For architecture comparisons and ablation studies we use a predefined subset [23] of 2,000 training scans and 1,000 validation scans intended for this purpose, as using the full data set would require 1-2 weeks of compute time per comparison.

The per-slice CNN in AxialNet consists of a 2D ResNet-18 [53] pretrained on ImageNet [54] and refined during training on CTs, followed by custom convolutional layers. Models were trained using stochastic gradient descent with momentum 0.99 and learning rate 10´3. Whole-dataset models were trained on an NVIDIA Tesla V100 GPU with 32 GiB of memory. All models are implemented in PyTorch. Code will be made publicly available upon publication 2 .

The accuracy of any explanation method for a particular class of models must be proven mathematically; it is impossible to measure using experiments [31] . However, it is possible to experimentally evaluate how well a particular model yields desired behavior, such as localizing abnormalities to the correct organs.

To gain insight into the behavior of particular trained AxialNet models -specifically, to measure the models' organ localization of abnormalities -we propose OrganIoU, a metric that equals 1 when the model has assigned all abnormality attention within the allowed regions for that abnormality, and equals 0 when the model has only assigned attention to forbidden regions. The OrganIoU is calculated using the model's predicted attentionÃ P R MˆHˆD1ˆD2 and the attention ground truth G true P t0, 1u MˆHˆD1ˆD2 . The predicted attention is binarized with different thresholds and the optimal threshold chosen for each abnormality on the validation set. Define allowed as the sum of Table 1 : RAD-ChestCT validation set classification performance and localization performance using the predefined 2,000 train/1,000 val subset [23] for computational feasibility. Classification performance is reported as median AUROC (area under the receiver operating characteristic) while localization performance is reported as mean OrganIoU. The proposed AxialNet architecture outperforms all previously published multilabel CT scan classifiers (CTNet, 3DConv based on 3D convolutions, and BodyConv), as well as a new BodyCAM architecture detailed in the Appendix, and ablated versions of AxialNet. OrganIoU was calculated at the last convolutional layer of all models. No OrganIoU could be calculated for CTNet as the spatial relationship between the output of the last convolutional layer and the input has been disrupted due to convolution over features. all predicted attention values where G true " 1 and f orbidden as the sum of all predicted attention values where G true " 0. Then OrganIoU = allowed allowed`f orbidden .

AxialNet is the first multilabel CT scan abnormality classification model that has built-in explainability.

To understand how the built-in explainability affects performance, we compare AxialNet to previously published and alternative architectures, including fully "black box" models such as CTNet. We find that AxialNet outperforms all these models including CTNet on both classification and localization (Table 1) .

We additionally performed an ablation study to gain further insight into different components of the AxialNet architecture. The proposed AxialNet architecture includes a ResNet pretrained on ImageNet, custom convolutional layers, and average pooling; it outperforms variants that instead use a randomly-initialized ResNet (RandInitResNet), no custom convolutional layers (NoCustomConv), or max pooling (MaxPool) ( Table 1) .

In addition to outperforming competing methods, the AxialNet model provides explainability via the matrix of all per-slice abnormality scores C P R MˆH (abnormalitiesˆslices) which is produced for each input CT volume as an intermediate step in the model's computations. The C matrix quantifies the contribution of each axial slice group to each final abnormality score. Figure 3 provides a visualization of C, showing per-slice abnormality scores aggregated across all test set volumes, for particular abnormalities of interest. The visualization demonstrates that C displays patterns consistent with medical knowledge.

First, Figure 3 shows a general pattern in which the orange line (for scans containing the abnormality) is higher than the blue line (for scans lacking the abnormality), indicating higher scores when the abnormality is present, as desired. On the top row, example heart abnormalities tend to have peak scores in central slices where the heart is found. Furthermore, the scores for "heart calcification" and "great vessel calcification" have a similar distribution across slices, which is reasonable since these abnormalities are related -though great vessel calcification scores are comparatively higher in slices 80`, which makes sense as the aorta (a great vessel) descends into the abdomen. For "pleural effusion," the model tends to yield high scores towards the lower section of the chest cavity («slices 80-120), which is reasonable because pleural effusions frequently collect in the pleural spaces next to the lung bases. The high scores for pleural effusion in the upper lungs may be an artefact of symmetry -the model may have learned to find the lung bases through a relative decrease in the proportion of the slice occupied by lung tissue, and is detecting this relative decrease again at the apices. The "emphysema" scores peak towards the upper lobes of the lungs («slices 0-50), which is consistent with the fact that the most common form of emphysema (centrilobular emphysema) is typically most visible in the upper lungs [55] . Finally, "interstitial lung disease," "honeycombing," and "reticulation," when present, have high predicted scores throughout the entire lung field, which is reasonable as all of these abnormalities tend to be diffuse [56] .

AxialNet's built in slice-level explanations are useful for gaining insight into the model's behavior. HiResCAM augments these explanations further, by providing insight into sub-slice locations that contribute to a particular abnormality prediction. Figure 4 includes visualizations of HiResCAM explanations for predictions of several different abnormalities. These heat maps narrow down the explanation to a particular part of each slice.

The Grad-CAM explanation method has been used in a variety of medical imaging applications. However, Grad-CAM sometimes highlights locations the model did not use, yielding misleading explanations.

We generated HiResCAM and Grad-CAM explanations, holding the model, abnormality, and input CT volume constant, and found that these explanations were often different. Because HiResCAM is mathematically guaranteed to highlight only the locations the model actually used to make a prediction, this means that Grad-CAM is wrong whenever it disagrees with HiResCAM. Figure 4 compares HiResCAM and Grad-CAM explanations across a variety of abnormalities. Sometimes the HiResCAM and Grad-CAM explanations appear similar, e.g. for "great vessel atherosclerosis," "interstitial lung disease," and "honeycombing." However, other times Grad-CAM creates the incorrect impression that the model made predictions for lung abnormalities based on the heart or body wall, when in fact the model did rely on the lungs as illustrated in the HiResCAM explanations ("groundglass," "opacity," and "aspiration"). We hypothesize that Grad-CAM focuses on the wrong organ in these examples because these lung abnormalities are "light grey" and may activate features that detect this "light grey" quality; however, the heart is more "light grey" than any adjacent lung tissue, so feature-focused Grad-CAM fixates incorrectly on the heart. Furthermore, by focusing on locations the model did not actually use, Grad-CAM explanations generally produce worse organ localization of abnormalities, as can be seen by Grad-CAM's lower OrganIoU in Tables 1 and 2 for the 3DConv, BodyConv, MaxPool, NoCustomConv, and AxialNet models. (Grad-CAM and HiResCAM have identical OrganIoU for BodyCAM because this is a CAM architecture, and Grad-CAM and HiResCAM are alternative generalizations of CAM, meaning they yield identical explanations for CAM architectures only [31] .)

Inspection of the explanations for the AxialNet model trained with only L class revealed that this model sometimes makes predictions using unexpected locations, and therefore may be exploiting some spurious correlations (discussed further in Section 4.6 and Figure 5 ). The goal of the mask loss is to reduce use of spurious correlations and encourage the model to predict abnormalities from more medically reasonable locations; the latter characteristic can be measured with OrganIoU. OrganIoU is lower when the model exploits spurious correlations in irrelevant locations, and is higher when the model predicts each abnormality based on the organs in which that abnormality appears. This is true Figure 3 : AxialNet provides explainability through per-slice abnormality scores, and the explanations in turn suggest that the model may have learned some medical concepts. This figure provides a summary of the per-slice abnormality scores C P R MˆH across the 7,209 RAD-ChestCT test set CT volumes, for 9 example abnormalities. The matrices C were calculated for each scan using the final AxialNet L class`λ L mask whole dataset model. The orange line depicts the mean score and 95% confidence interval for scans in which the listed abnormality is present, while the blue line depicts the mean score and 95% confidence interval for the scans in which the abnormality is absent. Axial slice group 0 is closer to the head, while slice group 135 is closer to the abdomen. The medical concepts demonstrated in this figure are described in detail in the text. Best viewed in color. 

In this section, we provide a case study that illustrates how HiResCAM explanations can reveal when a model is exploiting spurious correlations, and how the mask loss may reduce this undesirable behavior.

The case study is depicted in Figure 5 . The left side of Figure 5 depicts the AxialNet L class model. While this AxialNet L class model appropriately predicts high likelihood of atherosclerosis (probability 170% over the mean) in an overweight patient, the explanation highlights the fat in the body wall and not the great vessel, suggesting that the AxialNet L class model may be inappropriately exploiting the correlation between body fat and atherosclerosis to make its atherosclerosis prediction. High fat in the body wall (i.e., overweight or obesity) is a known risk factor for atherosclerosis [57] . However, it is inappropriate to directly use a patient's weight to predict atherosclerosis, because it is possible to be overweight without great vessel atherosclerosis and it is also possible to have great vessel atherosclerosis without being overweight. Great vessel atherosclerosis should be diagnosed by inspecting the great vessel itself for signs of atherosclerosis, which include calcifications like those circled in the inset on the right. Indeed, the AxialNet L class model's strategy of exploiting the patient's obesity can backfire. The bottom row shows a thinner patient who still has great vessel atherosclerosis in spite of their lower body fat. Here, the AxialNet L class model again places too much focus on the body wall and produces a score only 108% above average in spite of the obvious atherosclerosis visible in the vessel.

The right side of Figure 5 depicts the AxialNet L class`λ L mask model, which has been trained with the mask loss to encourage predictions from within relevant anatomical regions. This model's explanations for great vessel atherosclerosis appear to focus more on the great vessel itself. The benefit is seen for the thinner patient, where the AxialNet L class`λ L mask model is able to produce a score 157% above average in spite of the patient's thinner body habitus. It appears that the mask loss may have discouraged exploiting the body wall for atherosclerosis prediction, and was thus able to yield a model that relied more on actual signs of atherosclerosis.

One limitation of AxialNet is its tendency to focus on small discriminative regions, a frequent weakness for classification CNNs [58, 59, 60, 61] . Consequently, some of AxialNet's explanations highlight only one example of an abnormality, e.g. bilateral groundglass highlighted on only one side, or only one nodule highlighted out of many. Exploring methods to encourage AxialNet to leverage all relevant examples without focus spreading to irrelevant organs is thus a promising direction for future work.

A general limitation of visual explanation methods is that these methods show what parts of the input are being used but do not provide insight into how these parts of the input are used. HiResCAM and other visual explanations suggest a model's behavior by highlighting particular locations, but they do not indicate what functions were computed on these locations. In the worst case scenario, a model could look at the right region for the wrong reasons -for example, detecting pneumonia via post-processing and compression artefacts located within the lung fields -and then the human observer would have no way to identify the model's mistake. Overall, building truly interpretable computer vision models in which both the locations used and the functions computed on them are transparent remains a major unsolved problem.

In this study, we introduced the new task of explainable multiple abnormality classification in chest CT volumes. We presented a multiple instance learning CNN architecture, AxialNet, that specifies top axial slices contributing most to each abnormality prediction. We proved that the HiResCAM gradientbased explanation method is guaranteed to highlight the regions the AxialNet model used. We then proposed a mask loss that enables AxialNet to achieve better organ localization of abnormalities and thus more medically plausible explanations. Calculation of the mask loss is enabled by PARTITION, the first approach for automatic identification of 3D abnormality-specific allowed regions. Overall, our innovations result in a 33% improvement in organ localization of abnormalities, and represent the first effort towards explainable multiple abnormality prediction in chest CT volumes.

The classification loss L class determines the extent of attention (e.g. a whole lung or only a lobe) in allowed regions, while the mask loss L mask discourages attention in forbidden regions.

Classification loss: The amount of attention placed within allowed regions, determined by L class , can vary from abnormality to abnormality. The model could attend to the entire allowed region (useful for diffuse abnormalities) or a small sub-part of the allowed region (useful for focal abnormalities).

The mask loss is

whereã i accesses all elements ofÃ P R MˆHˆD1ˆD2 , the model's predicted attention, and G true P t0, 1u MˆHˆD1ˆD2 defines an allowed region for the attention for each abnormality.

The mask loss is half of a segmentation cross entropy objective [52] , specifically the part of the objective where the ground truth is equal to 0. To minimize the mask loss the model must not increase the abnormality score using forbidden regions (where G true " 0), which are regions outside the organ(s) in which that abnormality is found in a particular scan. The portion of the full segmentation objective relating to G true " 1 is excluded because we do not want to force the model to attend to the entire relevant organ. This is because most abnormalities do not occupy the entire organ, especially focal abnormalities such as nodules and masses.

In practice we find that training with L mask is stable. We explored an alternative mask loss formulation based on an L 2 norm but found that its performance was inferior.

The mask loss is computed in a low-dimensional space, at the level of the last convolutional layer where the HiResCAM explanation is produced. This is orders of magnitude more computationally efficient than computing a mask loss in the input space.

Loss conceptual example: right lung nodule: For a nodule in the right lung, the mask loss will discourage the model from predicting this nodule using forbidden regions, i.e. anatomy anywhere outside of the right lung. For example the mask loss will discourage the model from exploiting liver nodules to predict the lung nodule (in a patient with metastatic cancer, there may be cancerous nodules and masses in multiple organs, but to predict specifically "right lung nodule" only the right lung should be used). The goal of the classification loss is to enable the model to make a correct prediction, so the classification loss should encourage the model to predict the nodule from the small part of the lungs in which it is actually found. Ideally, the final explanation for a right lung nodule should cover the relevant part of the right lung, and nowhere outside of the right lung.

Features in one organ providing clues for another organ: It is true that sometimes, features in one organ may provide clues for another organ, such as the case of the metastatic cancer patient described above. We split abnormalities that occur in multiple organs into different labels to encourage the model to learn differences between organs. Developing models to leverage relationships across organs while precisely distinguishing between them is an interesting direction for future work.

The PARTITION approach to create G true , the allowed regions for each abnormality, combines locationˆabnormality labels extracted from radiology reports with unsupervised organ segmentation, as shown in Figure 4 of the main paper. This section provides further details on PARTITION's subcomponents.

The first step in creating the attention ground truth G true is obtaining the locationˆabnormality labels. The locationˆabnormality labels are produced via a simple extension of SARLE, the publicly available rule-based automated label extraction method used to create the RAD-ChestCT abnormality labels [23] . SARLE was introduced and evaluated in prior work [23] . The first step of SARLE is phrase classification, in which each sentence of a free text report is analyzed using a rule-based system to distinguish between "normal phrases" (those that describe normality or lack of abnormalities) and "abnormal phrases" (those that describe presence of abnormalities). The second step of SARLE is a "term search" that searches for abnormality-related vocabulary within the "abnormal phrases" to identify which specific abnormalities are present. The handling of negation in the first step of SARLE is highly effective, and when combined with the term search yields an average F-score of 0.976 [23] . SARLE is designed to be easily customizable through the addition of extra vocabulary to the term search. We leverage this customizability by adding location terms to the term search, to identify whether each abnormality is located in the lungs, heart, great vessels, mediastinum, or elsewhere. We also make use of the medical definitions of the abnormalities themselves. Many abnormalities reveal their locations by definition, e.g. "pneumonia" (lung infection) which can only occur in the lungs or "cardiomegaly" (enlarged heart) which can only occur in the heart. Specific examples of some of our extensions to SARLE's term search are shown in Table 3 . Table 4 includes a full listing of all the labels used in this study, including 51 right and/or left lung labels and 29 mediastinum labels. Abnormalities were subdivided by location meaning that the right lung and left lung were considered distinctly, thus increasing the total number of labels on which models were trained to 131. Radiologists typically distinguish right from left when describing abnormalities. Furthermore, subdividing abnormalities by location often yields a more medically relevant task. For example, calcification in the lungs is usually caused by calcified nodules or calcified Table 3 : Examples of identifying the location of each abnormality from the radiology reports. In Step 1 (Vanilla SARLE) only abnormal phrases are kept. In Step 2 (Vanilla SARLE) the abnormality is identified with a term search. In Step 3 (our addition), the location is identified with a term search.

Step 1

Step 2

Step 3 Comment the heart is enlarged without pericardial effusion the heart is enlarged cardiomegaly heart

Step 2: "heart is enlarged" is one of the synonyms for the abnormality "cardiomegaly." Step 3: the word "heart" indicates heart as the location there is a nodule in the right upper lobe there is a nodule in the right upper lobe nodule right lung

Step 3: "right upper lobe" indicates "right lung" as the location left pneumonia left pneumonia pneumonia left lung

Step 3: "pneumonia" by definition only affects the lung, so "left pneumonia" implies the location "left lung"

calcifications in the aorta calcifications in the aorta calcification great vessel

Step 3: the aorta is a great vessel The consolidation has resolved normal finding; no abnormality labels produced Table 4 : The 80 abnormality labels used in this study. When subdivided by location, so that the right and left lungs are represented separately, the total number of unique labels increases to 131.

Right and/or Left Lung (51) air trapping, airspace disease, aspiration, atelectasis, bronchial wall thickening, bronchiectasis, bronchiolectasis, consolidation, emphysema, fibrosis (lung), groundglass, interstitial lung disease, lung infection, lung inflammation, lung scarring, lung scattered nodules or nodes, mucous plugging, plaque (lung), pleural effusion, pleural thickening, pneumonia, pneumothorax, pulmonary edema, reticulation, septal thickening, tree in bud, lung resection, lung transplant, postsurgical (lung), bandlike or linear, honeycombing, lung calcification, lung cancer, lung cavitation, lung cyst, lung density, lung granuloma, lung lesion, lung lucency, lung lymphadenopathy, lung mass, lung nodule, lung nodulegr1cm, lung opacity, lung scattered calcification, lung soft tissue, chest tube, lung catheter or port, lung clip, lung staple, lung suture Mediastinum (includes Heart and Great Vessels) (29) cardiomegaly, heart failure, pericardial effusion, pericardial thickening, CABG, heart transplant, postsurgical (great vessel), postsurgical (heart), sternotomy, coronary artery disease, great vessel aneurysm, great vessel atherosclerosis, great vessel calcification, great vessel dilation or ectasia, great vessel scattered calc, heart atherosclerosis, heart calcification, heart scattered calc, mediastinum calcification, mediastinum cancer, mediastinum lymphadenopathy, mediastinum mass, mediastinum nodule, mediastinum opacity, great vessel catheter or port, heart catheter or port, heart stent, heart valve replacement, pacemaker or defib granulomas [62] , whereas calcification in the aorta is typically due to atherosclerosis [63] . A catheter in the lungs is often a pigtail catheter, e.g. to treat a pneumothorax [64] , while a catheter in the superior vena cava (a great vessel) is a central venous catheter [65] .

The second step in creating the attention ground truth G true is obtaining the segmentation maps for the allowed regions.

Method summary: Our unsupervised multi-organ segmentation approach includes three stages. In the first stage both lungs are segmented together using morphological image processing, using the four steps shown in Figure 6 . Next, a bounding box enclosing both lungs is computed and the right lung and left lung are separated via bisection along the midline sagittal plane. Finally, the mediastinum segmentation is defined as the center-left nonlung region inside of the lung bounding box, exploiting the anatomical relationship of the mediastinum and the lungs. The final output is a left lung segmentation, a right lung segmentation, and a mediastinum segmentation that includes the heart and great vessels.

Step Evaluation and quality control. We do not calculate intersection over union (IoU) of our organ segmentations on RAD-ChestCT because there are no ground truth organ segmentations available for RAD-ChestCT. We also do not calculate IoU of our RAD-ChestCT segmentation approach on the CT-ORG or AAPM data sets because of the significant differences in data distribution between RAD-ChestCT and CT-ORG or AAPM discussed in the Related Work section. Like all morphological image processing approaches for lung segmentation, our unsupervised segmentation method is sensitive to threshold values that had to be optimized specifically for RAD-ChestCT. For example the minimum allowed volume size specified in step 4 (removal of small objects) was tuned to 1000000, a value that enables preservation of the right and left lung but elimination of the stomach in RAD-ChestCT. However, this threshold value is inappropriate for any data set that has a different resolution such that a single voxel corresponds to a different physical volume in the real world. Each threshold was chosen in an iterative process that involved manual inspection of hundreds of examples produced with different thresholds. Selecting new customized thresholds for CT-ORG or AAPM would thus not yield an evaluation that is reflective of our algorithm's performance on RAD-ChestCT. A limitation of our work is that we did not manually create segmentation maps to calculate IoU for our unsupervised segmentation pipeline, due to the time consuming nature of creating segmentation ground truth in large 3D volumes. In future work, we would like to manually create this ground truth so that we can quantitatively evaluate our unsupervised segmentation with IoU. For now, we are encouraged by the performance benefit of using the unsupervised segmentation in the mask loss, as well as by the qualitative results for our segmentation approach discussed next.

Qualitative results. We undertook two steps to evaluate our unsupervised segmentation approach. First, we performed manual inspection of numerous randomly-selected segmentations, including 3D renderings and/or 2D projections along the axial, sagittal, and coronal planes. Several randomlyselected right lung and left lung 3D renderings are shown in Figure 7 . Next, we use summary statistics gathered from the training set to assess lung inclusion quality. Analysis of histograms of lung bounding box dimensions paired with visual inspection of outlier scans enabled definition of outlier thresholds that automatically identify when one or both lungs are missing due to severe disease. We quantified the fraction of segmentations in which one or both lungs is missing and found it to be only 4%, meaning that 96% of the segmentations pass this basic quality control metric. In creation of the attention ground truth, the 4% of segmentations that fail quality control are discarded Figure 7 : A random selection of full resolution right and left lung segmentations from our unsupervised segmentation method. The scan at row 3, column 2 where both lungs are missing is an example of a scan for which the unsupervised segmentation failed; for this scan a heuristic mask was used in computing the mask loss. Only 4% of scans require a heuristic mask. Best viewed in color.

and replaced with a heuristic mask in which the allowed region for the right lung is the right half of the volume, the allowed region for the mediastinum is the center half, and the allowed region for the left lung is the left half.

In order to calculate the mask loss, the dimensions of the attention ground truth must match the dimensions of the low-dimensional CT representation. Thus, the raw segmentation masks were downsampled to t0, 1u HˆD1ˆD2 , which has the additional side benefit of "smoothing out" any small errors in the segmentation masks. We compared nearest neighbors, trilinear, and area downsampling algorithms, with and without morphological dilation of the downsampled mask. Morphological dilation produces more expansive allowed regions and is thus a more permissive approach. The nearest neighbors algorithm with no dilation yielded the best performance as shown in Table 5 , so this setting was selected to create the attention ground truth used in the final mask loss implementation.

For all models, the OrganIoU was calculated directly between the predicted attentionÃ and the attention ground truth G true . We explored calculation of OrganIoU in the input space using upsampling of the predicted attention but found this to be prohibitively computationally expensive (estimated ą3 weeks of runtime per model). To improve model training time, the attention ground truth was computed in default axial orientation during the first epoch and then loaded from disk for all subsequent epochs. The data augmentation transformations randomly applied to each CT scan in each epoch were applied dynamically to each sub-volume of that scan's attention ground truth before the mask loss calculation.

6.3 The BodyCAM architecture and an advantage of AxialNet Table 1 includes performance of a BodyCAM model. We developed the BodyCAM architecture to begin exploring how much value HiResCAM adds over CAM by loosening CAM's architecture requirements. CAM requires a CNN to end in global average pooling followed by one fully connected layer. BodyCAM is essentially a variant of AxialNet that has been modified to meet CAM's architecture requirements.

The first part of the BodyCAM model is the same as AxialNet, consisting of a 2D ResNet and custom convolutional layers applied to each slice. However, after the custom convolutional layers, BodyCAM uses global average pooling over the rH, D 1 , D 2 s spatial dimensions, and then a fully connected layer produces the final predictions, whereas AxialNet follows a multiple instance learning setup with a per-slice fully connected layer that produces per-slice predictions which are then averaged to yield the whole-volume predictions.

We hypothesized that AxialNet would outperform BodyCAM, because AxialNet does not include global average pooling and thus can preserve the information about spatial locations of certain features. Our experimental results support this hypothesis, with AxialNet achieving better AUROC than BodyCAM as shown in Table 1 of the main paper. Because BodyCAM follows the CAM architecture, Grad-CAM and HiResCAM produce identical explanations on BodyCAM models, and thus yield the same OrganIoU in Table 1 .

We are encouraged by the result that AxialNet outperforms BodyCAM. While we recognize that the global average pooling step provides the convenient property of enabling input images of variable spatial dimensions, we also believe that if eliminating global average pooling can provide better performance in some tasks then it is a direction worth exploring further. In future work, it would be interesting to systematically investigate the effect of removing the global average pooling step across a variety of architectures and imaging applications.

This section includes examples of various studies that apply deep learning to medical imaging tasks to illustrate how HiResCAM could positive impact explainability. Table 6 includes examples of previously published studies that apply machine learning to a medical imaging task. The table considers what architecture was used as well as any visual explanation methods that were applied, and analyzes whether there was a risk of a visual explanation method providing the false impression that the model had highlighted the wrong location. The top half of the table includes studies that already incorporated visual explanation methods. The bottom half of the table includes studies that did not report results of a visual explanation method, but which use CNN models for a medical imaging classification task and thus could have had a visual explanation method applied.

It appears that presently, there is risk in medical imaging of relying on faulty visual explanations that highlight locations the model did not use, due to the popularity of Grad-CAM as a visual explanation method and the popularity of custom CNN architectures that end in more than one fully connected layer. In these studies, there are no experiments justifying the need for more than one fully connected layer. Most likely, the decision to include more than one fully connected layer was due to two factors: (a) the authors are likely taking inspiration from architectures like VGG or AlexNet which include multiple fully connected layers; and (b) to the best of our knowledge, no work until recently [31] conveyed the potential explainability benefits of using only one fully connected layer at the end of a CNN, so there may not have been any motivation to limit the number of fully connected layers to one.

In all studies with high risk of wrong explanations, we recommend reducing the number of fully connected layers to one and using HiResCAM as the explanation approach to ensure that the visual explanations reflect the regions the model is using to make predictions. 3 Table 6 : Example studies in medical imaging, particularly CT scan analysis, and assessment of whether there was a risk of a visual explanation method highlighting locations the model did not actually use for prediction. When there is a risk (final column orange), that risk could be removed by using HiResCAM for explanations and modifying the architecture to end in only one fully connected layer ("FC layer").

An intelligent future for medical imaging: a market outlook on artificial intelligence for medical imaging

Building trust in deep learning system towards automated disease detection

why should i trust you?" explaining the predictions of any classifier

Gradient-based attribution methods

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study

Learning deep features for discriminative localization

Analyzing classifiers: Fisher vectors and deep neural networks

Unmasking clever hans predictors and assessing what machines really learn

Low-dose ct scan for lung cancer screening: clinical and coding considerations

Ct imaging of blunt chest trauma

Ct densitometry in emphysema: a systematic review of its clinical utility

Chest ct signs in pulmonary disease: a pictorial review

Classification of interstitial lung abnormality patterns with an ensemble of deep convolutional neural networks

Weakly-supervised deep learning of interstitial lung disease types on ct images

Multi-label deep regression and unordered pooling for holistic interstitial lung disease pattern detection

Holistic classification of ct attenuation patterns for interstitial lung diseases via deep convolutional neural networks

Deep learning for classifying fibrotic lung disease on high-resolution computed tomography: a case-cohort study

Multisource transfer learning with convolutional neural networks for lung pattern analysis

Lung pattern classification for interstitial lung diseases using a deep convolutional neural network

End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography

Deep learning-enabled system for rapid pneumothorax screening on chest ct

Genetic Epidemiology of COPD (COPDGene) Investigators. Deep learning enables automatic classification of emphysema pattern at ct

Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes

Grad-cam: Visual explanations from deep networks via gradient-based localization

An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets

A deep learning and grad-cam based color visualization approach for fast detection of covid-19 cases using chest x-ray and ct-scan images

Comparison of deep learning approaches for multi-label chest x-ray classification

Dynamic routing on deep neural network for thoracic disease classification and sensitive area localization

Efficient deep network architectures for fast chest x-ray tuberculosis screening and visualization

Deep learning to assess long-term mortality from chest radiographs

Use hirescam instead of grad-cam for faithful explanations of convolutional neural networks

Automated quantification of ct patterns associated with covid-19 from chest ct

Automatic segmentation of ground-glass opacities in lung ct images by using markov random field-based algorithms

Anam-net: Anamorphic depth embeddingbased lightweight cnn for segmentation of anomalies in covid-19 chest ct images

Helmut Prosch, and Georg Langs. Deep learning detection and quantification of pneumothorax in heterogeneous routine chest computed tomography

Lung nodule segmentation in chest computed tomography using a novel background estimation method. Quantitative imaging in medicine and surgery

Byung-il Lee, and Yeong-Gil Shin. Volumetric lung nodule segmentation using adaptive roi with multi-view residual learning

Segmentation of pulmonary nodules in ct images based on 3d-unet combined with three-dimensional conditional random field optimization

Automatic multiorgan segmentation in thorax ct images using u-net-gan

Automatic multi-organ segmentation in dual-energy ct (dect) with dedicated 3d fully convolutional dect networks

Automatic segmentation of multiple organs on 3d ct images by using deep learning approaches

A method of rapid quantification of patient-specific organ doses for ct using deep-learning-based multi-organ segmentation and gpu-accelerated monte carlo dose computing

Ct-org: Ct volumes with multiple organ segmentations dataset. The Cancer Imaging Archive

Ct organ segmentation using gpu data augmentation

The cancer imaging archive (tcia): Maintaining and operating a public information repository

Unsupervised domain adaptation of convnets for medical image segmentation via adversarial learning

Data from lung ct segmentation challenge. The Cancer Imaging Archive

Autosegmentation for thoracic radiation treatment planning: A grand challenge at aapm 2017

Computed tomography of the chest: I. basic principles. Bja Education

Using artificial intelligence to detect covid-19 and community-acquired pneumonia based on pulmonary ct: evaluation of the diagnostic accuracy

Doubly weak supervision of deep learning models for head ct

A survey of loss functions for semantic segmentation

Deep residual learning for image recognition

Imagenet: A largescale hierarchical image database

Imaging of pulmonary emphysema: a pictorial review

Accuracy of high-resolution ct in the diagnosis of diffuse lung disease: effect of predominance and distribution of findings

Adipokines as a novel link between obesity and atherosclerosis

Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference

Tell me where to look: Guided attention inference network

Weakly-supervised semantic segmentation via sub-category exploration

Seed, expand and constrain: Three principles for weakly-supervised image segmentation

The calcified lung nodule: what does it mean?

Computed tomography of aortic wall calcifications in aortic dissection patients

Small-bore pigtail catheters for the treatment of primary spontaneous pneumothorax in young adolescents

Central venous line placement in the superior vena cava and the azygos vein: differentiation on posteroanterior chest radiographs

The authors would like to thank the Duke Protected Analytics Computing Environment (PACE), particularly Mike Newton and Charley Kneifel, PhD, for providing the computing resources and GPUs needed to complete this work. The authors also thank Paidamoyo Chapfuwa, PhD, for thoughtful comments on a previous version of the manuscript, David Dov, PhD, for discussion of multiple instance learning, and Geoffrey D. Rubin, MD, FACR, for helpful remarks on the explanations.

This work was supported in part by the National Institutes of Health (NIH) Duke Medical Scientist Training Program Training Grant (GM-007171).

Architecture Is the architecture a "CAM architecture"?Were results of a visual explanation method reported in the paper?If a visual explanation method were applied, could it highlight incorrect locations?An explainable deep-learning algorithm for the detection of acute intracranial haemorrhage from small datasets. [ Comparison of deep learning approaches for multi-label chest x-ray classification. [27] ResNet-50Yes Grad-CAM was applied at the final convolutional layer of an architecture that follows the "CAM architecture"thus, this attention approach is mathematically equivalent to CAM No Efficient deep network architectures for fast chest x-ray tuberculosis screening and visualization. [29] Custom CNN ending in global average pooling then one FC layer Yes Saliency maps and Grad-CAM were applied at different layers. From the figure appearance, it appears that Guided Grad-CAM was used rather than vanilla Grad-CAM.Yes, because Grad-CAM was applied at different layers, and also because Guided Grad-CAM appears to have been used.Deep learning to assess longterm mortality from chest radiographs. [30] Modified Weakly-supervised deep learning of interstitial lung disease types on CT images [14] Custom CNN ending in ą1 FC layer.

No Yes: an architecture ending in ą1 FC layer was used.Holistic classification of CT attenuation patterns for interstitial lung diseases via deep convolutional neural networks [16] Custom CNN ending in 3 FC layers.

Yes: an architecture ending in ą1 FC layer was used.Deep learning for classifying fibrotic lung disease on highresolution computed tomography: a case-cohort study [17] Inception-ResNet-