key: cord-0544552-cgmn5vzi authors: Gonzalez, Camila; Gotkowski, Karol; Bucher, Andreas; Fischbach, Ricarda; Kaltenborn, Isabel; Mukhopadhyay, Anirban title: Detecting when pre-trained nnU-Net models fail silently for Covid-19 lung lesion segmentation date: 2021-07-13 journal: nan DOI: nan sha: 79896420ad94d67cb67d6e4a49bdd4e18945dfd6 doc_id: 544552 cord_uid: cgmn5vzi Automatic segmentation of lung lesions in computer tomography has the potential to ease the burden of clinicians during the Covid-19 pandemic. Yet predictive deep learning models are not trusted in the clinical routine due to failing silently in out-of-distribution (OOD) data. We propose a lightweight OOD detection method that exploits the Mahalanobis distance in the feature space. The proposed approach can be seamlessly integrated into state-of-the-art segmentation pipelines without requiring changes in model architecture or training procedure, and can therefore be used to assess the suitability of pre-trained models to new data. We validate our method with a patch-based nnU-Net architecture trained with a multi-institutional dataset and find that it effectively detects samples that the model segments incorrectly. Automatic lung lesion segmentation in the clinical routine would significantly lessen the burden of radiologists, standardise quantification and staging of Covid- 19 as well as open the way for a more effective utilisation of hospital resources. With this hope, several initiatives have gathered Computed Axial Tomography (CAT) scans and ground-truth annotations from expert thorax radiologists and released them to the public [6, 20, 23] . Experts have identified ground glass opacities (GGOs) and consolidations as characteristic of a pulmonary infection onset by the SARS-CoV-2 virus [24] . Deep learning models have shown good performance in segmenting these lesions. Particularly the fully-automatic nnU-Net framework [11] secured top spots (9 out of 10, including the first) in the leaderboard for the Covid-19 Lung CT Lesion Segmentation Challenge [7] . Such frameworks would ideally be utilised in the clinical practice. However, deep learning models are known to fail for data that considerably diverges from the training distribution. CAT scans are particularly prone to this domain shift problem [4] . The data showcased in the challenge is multi-centre and diverse in terms of patient group and acquisition protocol. A model trained with it would be presumed to produce good predictions for a wide spectrum of institutions. Yet when we evaluate a nnU-Net model on three other datasets, we notice a considerable drop in segmentation quality (see Fig. 1 (a) ). Lung lesions do not manifest in large connected components (see Fig 4) , so it is not trivial for a novice radiologist to identify an incorrect segmentation. Clinicians can still leverage models trained with large amounts of heterogeneous data, but only alongside a process that identifies when the model is unsuitable for a new data sample. Widely-used segmentation frameworks are not designed with OOD detection in mind, and so a method is needed that reliably identifies OOD samples post-training while requiring minimal intervention. Several strategies have shown good OOD detection performance in classification models. Hendrycks and Gimpel [8] propose using the maximum softmax output as an OOD detection baseline. Guo et al. [5] find that replacing the regular softmax function with a temperature-scaled variant produces truer estimates. This can be complemented by adding perturbations to the network inputs [19] . Other methods [10, 17] instead look at the KL divergence of softmaxed outputs from the uniform distribution. Some approaches use OOD data during training to explicitly train an outlier detector [1, 9, 17] . Bayesian-inspired techniques can also be used for outlier detection. Commonly-used are Monte Carlo Dropout [3] and Deep Ensembles [16] . These have shown promising results in the field of medical image segmentation [12, 13, 21] . Approaches that modify the architecture or training procedure have shown better performance in some cases, but their applicability to widely-used segmentation frameworks is limited [2, 15, 22] . We propose a method for OOD detection that is lightweight and seamlessly integrates into complex segmentation frameworks. Inspired by the work of Lee et al. [18] , our approach estimates a multivariate Gaussian distribution from in-distribution (ID) training samples and utilises the Mahalanobis distance as a measure of uncertainty during inference. We compute the distance in a lowdimensional feature space, and down-sample it further to ensure a computationally inexpensive calculation. We validate our method on a patch-based 3D nnU-Net trained with multi-centre data from the Covid-19 Lung CT Lesion Segmentation Challenge. Our evaluation shows that the proposed method can effectively identify OOD samples for which the model produces faulty segmentations, and provides good model calibration estimates. Our contributions are: -The introduction of a lightweight, flexible method for OOD detection that can be integrated into any segmentation framework. -An extension of the nnU-Net framework to provide clinically-relevant uncertainty estimates. We start by summarising the particularities of the nnU-Net framework in Sec. 2.1. In Sec. 2.2, we outline our proposed method for OOD detection, which follows a three-step process: (1) estimation of a Gaussian distribution from training features (2) extraction of uncertainty masks for test images and (3) calculation of subject-level uncertainty scores. The nnU-Net framework is a standardised baseline for medical image segmentation [11] . Without deviating from traditional U-Net architectures [26] , it has won several grand challenges by automatically customising the architecture and training configuration to the data at hand [7] . The framework also performs preand post-processing steps, such as adapting voxel spacing and contrast normalisation, during both training and inference. In this work we utilise the patch-based full-resolution variant, which is recommended for most applications [11] , but our method can be integrated into any other architecture. For the patch-based architecture, training images are first divided into overlapping patches with a sliding window approach, resulting in N patches . Predictions for each patch are multiplied by a filtering operation that weights centre-voxels more heavily, and then aggregated into an output mask with the dimensions of the original image. The euclidean distance DE does not recognize thatẑ1 (purple marker) is closer thanẑ2 (blue marker) to the distribution of training samples (gray markers), with mean µ (green marker) and covariance Σ. This difference intensifies in high-dimensional spaces, where it is common for regions close to the mean to be underrepresented. We are interested in capturing epistemic uncertainty, which arises from a lack of knowledge about the data-generating process. Quantifying it for image regions instead of region boundaries is challenging, particularly for OOD data [14] . One computationally inexpensive way to assess epistemic uncertainty is to calculate the distance between training and testing activations in a low-dimensional feature space. As a model is unlikely to produce reasonable outputs for features far from any seen during training, this is a reliable signal for bad model performance [18] . Model activations have covariance and the activations of typical input images do not necessarily resemble the mean [27] , so the euclidean distance is not appropriate to identify unusual activation patterns; a problem that exacerbates in high-dimensional spaces. The Mahalanobis distance D M rescales samples into a space without covariance, supplying a more effective way to identify typical patterns in deep model features. Fig. 1 (b) illustrates a situation where the euclidean distance assumes thatẑ 2 is closer to the training distribution thanẑ 1 , whenẑ 2 is highly unusual andẑ 1 is a probable sample. In the following we describe the steps we perform to extract a subject-level uncertainty value. Note that only one forward pass is necessary for each image, keeping the computational overhead to a minimum. Estimation of the training distribution: We start by estimating a multivariate Gaussian N (µ, Σ) over model features. For all training inputs For modern segmentation networks, the dimensionality of the extracted features z i is too large to calculate the covariance Σ in an acceptable time frame. We thus project the latent space into a lower subspace by average pooling. Finally, we flatten this subspace and estimate the empirical mean µ and covariance Σ. Extraction of uncertainty masks: During inference, we estimate an uncertainty mask for a subject following the process outlined in Fig. 2 . For each patch x i , features are extracted and projected intoẑ i . Next, the Mahalanobis distance (Eq. 2) to the Gaussian distribution estimated in the previous step is calculated. Each distance is a point estimate for the corresponding model input. These are aggregated in a similar fashion to how network outputs are combined to form a prediction mask. Following the example of the patch-based nnU-Net, a zerofilled tensor is initialised with the dimensionality of the original image. After assessing the distance for a patch, the value is replicated to the specified patch size and a filtering operation is applied to weight centre voxels more heavily. Finally, patch-level uncertainties are aggregated to an image-level mask. Subject-level uncertainty: The process described above produces an uncertainty mask with the dimensionality of the CAT scan. In order to effectively identify highly uncertain samples, we aggregate these into a subject-level uncertainty U by averaging over all voxels. We then normalise uncertainties between the minimum and doubled maximum values represented in an ID validation set -which we assume to be available during training -to ensure U ∈ [0, 1]. We work with a total of four datasets for segmentation of Covid-19-related findings. The Challenge dataset [6] contains chest CAT scans for patients with a confirmed SARS-CoV-2 infection from an array of institutions. The data is heterogeneous in terms of age, gender and disease severity. We use the 199 cases made available under the Covid Segmentation Grand Challenge, which we randomly divide into 160 cases to train the model, 4 validation and 35 test cases. We evaluate our method with two publicly available datasets and an in-house one. The public datasets encompass cases for patients with and without confirmed infections. Mosmed [23] contains fifty cases and the Radiopedia dataset [20] , a further twenty. Finally, we utilise an in-house dataset consisting of fifty patients who were tested positive for SARS-CoV-2 with an RT PCR test. All fifty scans were reviewed for diagnostic image quality. The annotations for the inhouse data were performed slice-by-slice by two independent readers trained in the delineation of GGOs and pulmonary consolidations. Central vascular structures and central bronchial structures were excluded from all segmentations. All delineations were reviewed by an expert radiologist reader. For the public datasets, the segmentation process is outlined in the corresponding publications. With the Challenge data, we train a patch-based nnU-Net [11] on a Tesla T4 GPU. Our configuration has a patch size of [28, 256, 256], and adjacent patches overlap by half that size. To reduce the dimensionality of the feature space, we apply average pooling with a kernel size of (2, 2, 2) and stride (2, 2, 2) until the dimensionality falls below 1e4 elements. With the Scikit Learn library (version 0.24) [25] , calculating Σ requires 85 seconds for 1e5 samples. Our code is available under github.com/MECLabTUDA/Lifelong-nnUNet (branch ood detection). We compare our approach to state-of-the-art techniques to assess uncertainty information by performing inference on a trained model. Max. Softmax consists of taking the maximum softmax output [8] . Temp. Scaling performs temperature scaling on the outputs before applying the softmax operation [5] , for which we test three different temperatures T = {10, 100, 1000}. KL from Uniform computes the KL divergence from an uniform distribution [10] . Note that all three methods output a confidence score (higher is more certain), which we invert to obtain an uncertainty estimate (lower is more certain). Finally, MC Dropout consists of doing several forward passes whilst activating the Dropout layers that would usually be dormant during inference. We perform 10 forward passes and report the standard deviation between outputs as an uncertainty score. For all methods, we calculate a subject-level metric by averaging uncertainty masks, and normalise the uncertainty range between the minimum and doubled maximum uncertainty represented in ID validation data. We start this section by analysing the performance of the proposed method in detecting samples that vary significantly from the training distribution. We then examine how well the model estimates segmentation performance. Lastly, we qualitatively evaluate our method for ID and OOD examples. OOD detection: We first assess how effective our method is at identifying samples that are not ID (Challenge data). Due to the heterogeneity of the Challenge dataset, in practice data from an array of institutions would be considered ID. However, for our evaluation datasets there is a drop in performance which should manifest in higher uncertainty estimates. As is common practice in OOD detection [19] , we find the uncertainty boundary that achieves a 95% true positive rate (TPR) on the ID validation set, where a true positive is a sample correctly identified as ID. We report for the ID test data and all OOD data the false positive rate (FPR) and Detection Error = 0.5 (1 − T P R) + 0.5 F P R at 95% TPR. Tab. 1 summarizes our findings. All methods that utilise the network outputs after one forward pass have a high detection error and FPR, while the MC Dropout approach manages to identify more OOD samples. Our proposed method displays the lowest FPR and detection error. Segmentation performance: While the detection of OOD samples is a first step in assessing the suitability of a model, an ideal uncertainty metric would inversely correlate with model performance, informing the user of the likely quality of a prediction without requiring manual annotations. For this we calculate the Expected Segmentation Calibration Error (ESCE). Inspired by Guo et al. [5] , we divide the N test scans into M = 10 interval bins B m according Table 1 . Detection Error (lower is better) and FPR (lower is better) for the boundary of 95% TPR, ESCE (lower is better) and (mean±sd) Dice (higher is better) for subjects with an uncertainty below the 95% TPR boundary. The results are reported for ID test data and all OOD samples. Det. Error FPR ESCE Dice Max. Softmax [8] 0 to their normalised uncertainty. Over all bins, the absolute difference is added between average Dice (Dice(B m )) and inverse average uncertainty (1 − U(B m )) for samples in the bin, weighted by the number of samples. The results are reported in Tab. 1 (forth column). Our proposed approach shows the lowest ESCE at 0.125. The average Dice of admitted samples (fifth column) lies at 0.744, which is consistent with the ID expected performance of the model (see Fig. 1 (a) ). Fig. 4 . Upper row: a good prediction. Lower row: a prediction for an OOD sample where two lesions are erroneously segmented in the superior lung lobes. Despite the considerable differences to the ground truth, these errors are not directly noticeable for the inexpert observer, as GGOs can manifest in superior lobes [24] . Increasingly, institutions are taking part in initiatives to gather large amounts of annotated, heterogeneous data and release it to the public. This could potentially alleviate the work burden of medical practitioners by allowing the training of robust segmentation models. Open-source end-to-end frameworks contribute to this process. But regardless of the variety of the training data, it is necessary to assess whether a model is well-suited to new samples. This is particularly true when it is not trivial to identify a faulty output, such as for the segmentation of SARS-CoV-2 lung lesions. There is currently a disconnect between methods for OOD detection, which often require special training or architectural considerations, and widely-used segmentation frameworks. We find that calculating the Mahalanobis distance to features in a low-dimensional subspace is a lightweight and flexible way to signal when a model prediction should not be trusted. Future work should explore how to better identify high-quality predictions. For now, our work increases clinicians' trust while translating trained neural networks from challenge participation to real clinics. Simultaneous semantic segmentation and outlier detection in presence of domain shift Weight uncertainty in neural network Dropout as a bayesian approximation: Representing model uncertainty in deep learning Machine learning with multi-site imaging data: An empirical study on the impact of scanner effects On calibration of modern neural networks Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets Leading pediatric hospital reveals top ai models in covid-19 grand challenge A baseline for detecting misclassified and out-ofdistribution examples in neural networks Deep anomaly detection with outlier exposure Using self-supervised learning can improve model robustness and uncertainty nnu-net: a self-configuring method for deep learning-based biomedical image segmentation Analyzing the quality and challenges of uncertainty estimations for brain tumor segmentation Assessing reliability and challenges of uncertainty estimations for medical image segmentation What uncertainties do we need in bayesian deep learning for computer vision? A probabilistic u-net for segmentation of ambiguous images Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems Training confidence-calibrated classifiers for detecting out-of-distribution samples A simple unified framework for detecting out-ofdistribution samples and adversarial attacks Enhancing the reliability of out-of-distribution image detection in neural networks Covid-19 ct lung and infection segmentation dataset Confidence calibration and predictive uncertainty estimation for deep medical image segmentation Stochastic segmentation networks: Modelling spatially correlated aleatoric uncertainty Mosmeddata: Chest ct scans with covid-19 related findings dataset Review of the chest ct differential diagnosis of ground-glass opacities in the covid era Scikit-learn: Machine learning in python U-net: Convolutional networks for biomedical image segmentation Understanding intra-class knowledge inside cnn