key: cord-0579252-gvkvtowm authors: Engstler, Paul; Keicher, Matthias; Schinz, David; Mach, Kristina; Gersing, Alexandra S.; Foreman, Sarah C.; Goller, Sophia S.; Weissinger, Juergen; Rischewski, Jon; Dietrich, Anna-Sophia; Wiestler, Benedikt; Kirschke, Jan S.; Khakzar, Ashkan; Navab, Nassir title: Interpretable Vertebral Fracture Diagnosis date: 2022-03-30 journal: nan DOI: nan sha: 11db8d1ff310c18d1d7446521a496a667e19a5e6 doc_id: 579252 cord_uid: gvkvtowm Do black-box neural network models learn clinically relevant features for fracture diagnosis? The answer not only establishes reliability quenches scientific curiosity but also leads to explainable and verbose findings that can assist the radiologists in the final and increase trust. This work identifies the concepts networks use for vertebral fracture diagnosis in CT images. This is achieved by associating concepts to neurons highly correlated with a specific diagnosis in the dataset. The concepts are either associated with neurons by radiologists pre-hoc or are visualized during a specific prediction and left for the user's interpretation. We evaluate which concepts lead to correct diagnosis and which concepts lead to false positives. The proposed frameworks and analysis pave the way for reliable and explainable vertebral fracture diagnosis. Osteoporosis is regarded as one of the most relevant diseases of the elderly, with 22 million women and 5.5 million men affected in the EU alone [5, 15] . Early detection of incidental osteoporotic fractures in routinely-acquired computed tomography (CT) scans is important, as these often remain clinically silent for a long time [13] . Furthermore, osteoporotic fractures are an independent predictor of further fractures with an approx. 12-fold increased risk and are associated with an 8-fold increased mortality [6, 25] . The sequelae include major socioeconomic consequences and an individual reduction in quality of life [4, 17, 14, 7] . Despite the clinical significance, around 85% of osteoporotic fractures are not adequately described in the radiological reports of routinely acquired CT scans, possibly as a result of a disproportionate increase in radiologists' workload [34, 2] . Automatic detection of vertebral body fractures with deep learning models can remedy this and increase incidental findings. However, most of these methods are black-box models that do not give insights into the decision-making process. Revealing the inside of these models can allow for investigation of failure cases and, when addressed, increase robustness and trust in the system. Thus far, interpretable diagnosis is mostly investigated via feature attribution (saliency) approaches [20] such as class activation maps [39] . These interpretations reveal where important features for the prediction are located. Although being a valuable tool for running a sanity check on the network inference mechanism, feature attribution does not disclose further information regarding prediction. Moreover, only knowing about the location of important features is not useful information for fracture diagnosis as it is easy to see where the fracture is located, and it is of interest to know "what" features are important. To this end, inspired by the network dissection [3] approach and its derivations in chest radiography [20] and mammography [35] applications, we propose two scenarios for analyzing the internal units of the neural network and their associated clinical concepts. The scenarios are dataset-wise and single inference. In the dataset-wise scenario, we compute the output of the last convolutional layer for all input data and identify neurons highly correlated with the output value associated with fractures. Subsequently, we ask the clinicians to identify the concepts associated with highly correlated activations by inspecting the inputs that activate those neurons the highest. The dataset-wise scenario provides an overall understanding of what concepts the network has learned and whether they are aligned with what clinicians use. In the single-inference scenario, the highly activated convolutional neurons for a single input are identified. Then their associated concepts are visualized to the user by showing the top images that activate each neuron. In this scenario, the user can get a conceptual understanding of the decision-making mechanism of the model. We perform the analysis for both scenarios on the open-source VerSe [31] dataset and a larger private dataset procured in our hospital. The concept-based interpretations are the building block toward a broader objective of explainable diagnosis and generating radiology reports. The objective of this work is to investigate what features the network uses for fracture diagnosis, whether they overlap with clinical knowledge, and how they can be used for more verbose and explainable fracture diagnosis. Vertebral Fracture Detection Many works have been proposed for automatic vertebral fracture detection in recent years. Most of these approaches use Convolutional Neural Networks (CNN) on Computer Tomography (CT) spine images. Notable exceptions are [33] using 3D radiomics extracted from CT images in a Random Forest and methods [27, 8] detecting fractures on radiographs. CNN-based methods can be categorized into 2D and 3D convolutions. 2D methods usually rely on a feature aggregation with Recurrent Neural Networks to model inter-slice dependencies [1, 32] . Husseini et al. [16] reformat the image to use the most informative mid-sagittal slice of each vertebra and, in addition to fracture detection, grade fractures using an ordinal regression loss for representation learning. Pisov et al. [29] also reformat the 3D volume to retrieve a spine-centered 2D image and detect key points for measuring the compression of each vertebra, detecting and grading fractures. Detecting fractures on a voxel-level and then post-processing, Nicolas et al. [28] for the first time used 3D convolutions for the detection of vertebral fractures. More recent works using 3D convolutions include modeling the dependency between the 3D volumes of each vertebra with a sequence-to-sequence model [9] and detecting osteoporotic fractures on a patient-level [36] . Related to the task of fracture detection and grading, recently Li et al. [23] , and Feng et al. [11] explored the distinction between benign and malign vertebral fractures. Interpretability of models is narrowly explored in the domain of vertebral fracture diagnosis and [37] interprets the models by feature attribution (saliency) approaches to identify which regions in the input contributed to the prediction. In fact, in most medical image analysis applications, feature attribution is the dominant approach [20] . However, attribution methods are limited in the information they can disclose regarding the decision-making mechanism of the model. Moreover, the feature attribution problem remains largely unsolved, and although there are many attribution approaches (CAM [39] , LRP [26] , DeepSHAP [24] , IBA [30, 38, 21] . . . ), the methods disagree with the identified important features [19, 38, 18] . This disagreement problem is a caveat for domain experts while utilizing these attribution methods. Thus there is a need for interpretation approaches that are reliable and reveal more information than "which region is important." An inspiring approach, Network Dissection [3] , identifies the concepts encoded by internal units (neurons) of the network. Motivated by this approach, Wu et al. [35] identify concepts the network encodes for diagnosis on mammography images, and Khakzar et al. [20] perform dissection on chest x-ray models and investigate research questions such as what clinical concepts do networks pick up when trained on COVID-19 severity scores. Methodologically, our work differs from [20, 3] in that we do not use an annotation dataset and instead identify the highly correlated neurons with the output under investigation. We investigate a different medical domain and explore different research questions such as what features contribute to true positives and what features to false positives. We model the vertebral fracture detection task as a binary classification problem, where the positive class indicates a fracture. The network function is defined as f Θ (x) : R H×W ×D → R. The predicted probability isŷ = sigmoid(f Θ (x)). We use a 3D U-Net [10] for the vertebral fracture classification task, replacing its upsampling path with a classification head. In neural networks, each neuron is activated by a specific input pattern. The corresponding pattern of each neuron can be equivalently deemed as its associated concept. In convolutional neuron networks each neuron can be considered either as an activation map or an activation unit within the map. As the activation units within an activation map all represent the same function (only for different spatial locations), they represent the same concept [3] . We denote the output activations of the final convolutional layer of the network by the tensor A ∈ R H ×W ×K where K represents the number of channels in that layer. After computing the distribution of individual unit activations a k , we determine the top quantile level T k for each unit k such that P (a k > T k ) = 0.005 [3] . We then derive the binary segmentation mask Positive Prediction Correlation Some units might capture concepts that are highly useful to determine whether a sample is fractured, establishing a stronger correlation with a true positive prediction than other units. To find these units, we compute: where P is the set of positive samples and 1 1 1 is the indicator function. With c k1 > c k2 > ..., k 1 is the unit most strongly correlated with a true positive prediction, followed by k 2 . Due to the variability of observed defects in fractured vertebrae, different concepts are relevant during the inference of a sample. We compute the relevance of a unit k during inference of input x x x as follows: For units k 1 , k 2 with r k1 > r k2 , k 1 is more relevant for the inference of x x x than k 2 . Now, when visualizing highly correlated concepts for a sample x x x, we compute the inference relevance of each detector unit and display the activation maps showing the corresponding responses for the input sample x x x. Data Preparation The network is trained on the VerSe dataset [31] as well as an in-house dataset acquired at Klinikum Rechts der Isar (Munich) and Klinikum der Universität München (Munich). The latter includes 465 patients with a median age of ∼ 69(±12) years, containing a heterogeneous collection of field of views, scanner settings, and healthy and fractured vertebra, including metallic implants and foreign materials. This combined dataset contains CT scans of patients with healthy and fractured vertebrae of osteoporotic or malignant nature from a heterogeneous collection of CT scanners. To address the inherent class imbalance in the data, negative samples are undersampled and positive (fractured) samples are oversampled in training to achieve a perfect class balance each epoch. As osteoporotic and malignant fractures rarely occur in cervical vertebrae (C1-C7), they are excluded from the dataset. We extract 96 × 96 × 96 sized 3D patches for each vertebrae with a 1mm resolution. These patches are centered on the vertebral body and oriented along the spine by aligning the vertical axis with a spline constructed with the vertebral centroids provided by the dataset similar to [16] . In the following, we first evaluate the performance of our vertebral fracture detection neural network before dissecting it into its individual detector units. We then validate detector units highly correlated with a true positive prediction by showing that they represent clinically meaningful concepts. Lastly, we present a system to display the units most relevant to a single inference. Vertebral Fracture Detection We consider the threshold-based evaluation metrics F1-score and accuracy. To remove the dependence on a manually chosen threshold whose optimum might vary between trained networks, the area under curve (AUC) and average precision (AP) metrics are also evaluated. We report the mean and standard deviation of these metrics from five separate training trials for each model. For networks trained on the smaller VerSe dataset, we observe performance akin to "naive" two-dimensional vertebral fracture detection approaches on the same dataset [16] , and a high dependence on a beneficial random seed. These networks, however, do not yield detector units that exhibit any discerning patterns. This is achieved by training a network with the larger dataset, combining VerSe and in-house data collected at Klinikum Rechts der Isar (Munich) and Klinikum der Universität München (Munich), that is reliably superior in performance. Its detector units exhibit a variety of patterns that are investigated in the subsequent sections. Given the network trained on the larger dataset, we extract its semantic concepts with Network Dissection [3] , which we extended to the three-dimensional space. To reduce the 512 detector units of the 3D U-Net to a tractable number, we determine the top ten units highly correlated with a true positive prediction as detailed in Section 2.2. For these units, we exported a single-slice collage of 25 strongly activating fractured samples serving as an overview of the units' activations. For the five samples that activated the unit most strongly, all twodimensional slices as well as three-dimensional NIfTI files are exported, allowing for a detailed inspection. Based on these exports, we consulted two clinical experts with a combined experience of 22 years in spine imaging about the clinical meaningfulness of these detector units. Omitting three units where no immediate association was possible, we show the detector units identified by their correlation rank with their corresponding clinical explanation in Table 2 . The provided samples show a diverse collection of detector unit activations, with each unit exhibiting consistent patterns across multiple samples. We also observe that these units' main focus is the primary vertebra, even if there is some activation in the surroundings. It is noteworthy that the patterns align with the bone anatomy and present themselves in clinically significant locations. As severe fractures are associated with changes in the superior and inferior vertebral endplates, we find the majority of activations in these regions. Although multiple detector units target these areas, they focus on different locations and exhibit varying sizes of regions of interest, with some integrating further information from the intervertebral discs as well as the adjacent vertebra. These insights are clinically meaningful to detect moderate and severe vertebral deformations (Genant grade 1 or higher [12] ), and thus show that our network learned concepts that have a clinical correspondence. Clinical Explanation For the omitted cases, we observed either no statistically significant activations, i.e. M k (x x x) = 0 0 0, or sporadic activations that do not present any clear patterns, even though they are highly correlated with a true positive prediction. Overall, such detector units represent a minority and can therefore be disregarded in light of those that exhibit tangible patterns. Having shown that the network learns clinically relevant concepts, we have validated its ability to make use of conducive features. We further seek to illuminate the black box decision-making process of the network by providing the user with a visual explanation for a single inference. To this end, we propose a system that visualizes the concepts considered most important by the network during inference. Using the method described in Section 2.3 to identify the units representing the most relevant concepts, we retrieve their respective top activating images from our combined dataset. We then display two visualizations for each unit: (i) the activations of those units for the input sample, and (ii) the activations for their corresponding top images. This provides the user with a detector unit's particular response for the given input sample as well as a larger context to understand its general concept. For both visualizations, a single slice with high activation (after thresholding) is shown. An example of (i) is given with Table 3 , which gives evidence of the network corroborating its prediction with a diverse set of concepts. These concepts illustrate the network accurately identifying relevant indications for the wedge-shaped deformity and incorporating information from an adjacent vertebra. Table 3 . Visualization of the most relevant detector units during class prediction of the sample shown on the left, which the network correctly predicted as fractured. Each detector unit is represented by a single slice activation for that particular sample. We also show its ranking in units highly correlated with a true positive prediction. We observe that the network uses concepts associated with wedge-shaped deformity and incorporates information from an adjacent vertebra This system enables users to comprehend the network's decision making, increasing trust in the system and allowing them to identify failure cases more easily. Furthermore, this approach does not require any prior concept matching by experts, as the user is able to interpret the general concept of a detector unit and make informed judgements about its importance for a particular sample. We show that a 3D U-Net learns a diverse set of concepts to tackle the task of detecting vertebral fractures. To gauge their meaningfulness, we first proposed a method to identify units highly correlated with a fracture detection. Then, we showed the overlap of these units with clinical concepts as validated by experts. Finally, we introduced a system to visually explain a single inference by showing the concepts most relevant for the classification of the sample, giving users insight into the network's decision making process. Further extensions of this system are conceivable, such as pre-filling a radiology report based on activations in a group of semantically similar detector units. Compression fractures detection on ct Prevalence of thoracolumbar vertebral fractures on multidetector CT Network dissection: Quantifying interpretability of deep visual representations Mortality Risk Associated With Low-Trauma Osteoporotic Fracture and Subsequent Fracture in Men and Women Public Health Impact of Osteoporosis Risk of mortality following clinical fractures Mortality after all major types of osteoporotic fracture in men and women: an observational study Application of deep learning algorithm to detect and visualize vertebral fractures on plain frontal radiographs 3d convolutional sequence to sequence model for vertebral compression fractures identification in ct -net: learning dense volumetric segmentation from sparse annotation Two-stream compare and contrast network for vertebral compression fracture diagnosis Vertebral fracture assessment using a semiquantitative technique Vertebral fractures: a hidden problem of osteoporosis Healthrelated quality of life after vertebral or hip fracture: a seven-year follow-up study Osteoporosis in the European Union: medical management, epidemiology and economic burden: A report prepared in collaboration with the International Osteoporosis Foundation (IOF) and the European Federation of Pharmaceutical Industry Associations (EFPIA) Grading loss: a fracture grade-based metric loss for vertebral fracture detection Association Between Vertebral Fracture and Increased Mortality in Osteoporotic Patients Rethinking positive aggregation and propagation of gradients in gradient-based saliency methods Do explanations explain? model knows best Towards semantic interpretation of thoracic disease and covid-19 diagnosis models Explaining covid-19 and thoracic pathology model predictions by identifying informative input features Adam: A method for stochastic optimization Differential diagnosis of benign and malignant vertebral fracture on ct using deep learning A unified approach to interpreting model predictions Vertebral Fractures Predict Subsequent Fractures Explaining nonlinear classification decisions with deep Taylor decomposition Artificial intelligence for the detection of vertebral fractures on plain spinal radiography Detection of vertebral fractures in ct using 3d convolutional neural networks Keypoints localization for joint vertebra detection and fracture severity quantification Restricting the flow: Information bottlenecks for attribution Verse: a vertebrae labelling and segmentation benchmark for multi-detector ct images Deep neural networks for automatic detection of osteoporotic vertebral fractures on ct scans Opportunistic osteoporosis screening in multi-detector ct images via local classification of textures Under-reporting of osteoporotic vertebral fractures on computed tomography Deepminer: Discovering interpretable representations for mammogram classification and explanation Automated deep learning-based detection of osteoporotic fractures in ct images Assessing attribution maps for explaining cnn-based vertebral fracture classifiers Fine-grained neural network explanation by identifying input features with predictive information Learning deep features for discriminative localization