1 Introduction

Deep Learning (DL) models have found widespread use in various applications, ranging from autonomous driving [19] and pest detection to speech recognition [26]. Despite its outstanding results in tasks related to computer vision and natural language processing, accuracy is not the only subject to be taken into consideration in a DL deployment [10]. Depending on the problem, other aspects may also become important, such as the explainability and the capability to handle samples from unknown classes [34].

The DL models are known to learn generally in closed-set assumptions, and such out-domain restrictions are reflected in their inefficiency in explicitly showing ignorance about input samples from unseen classes. As a result, a DL model trained in such a setup is often unable to identify an unknown class data as unknown, which leads to problems of model overconfidence [32]. The overconfidence has several natures, such as unidentified overfitting problems, bias, or even the choice of the softmax function for the model’s output layer, making directly identifying unknown samples more difficult [34]. Therefore, the model needs to be robust and able to handle Out-of-distribution (OOD) samples, which can come in various forms depending on the problem.

For medical applications, OOD detection is an important auxiliary task to improve the ability to detect unseen classes in an open-set problem. For example, when classifying an unseen rare skin lesion using a DL model to classify skin lesions, it would be preferable to identify it as unknown instead of erroneously classifying it as one of the known classes [25, 36]. Therefore, the OOD detection task has drawn attention to a wide range of applications, such as histopathology [22], X-ray [3], and magnetic resonance images [14] classification problems.

OOD Detection can be considered a recent field of research in the area of DL, having one of the main objectives to improve the ability of models to recognize unknown samples. In other words, an OOD detection algorithm should be able to identify whether an input can be considered known or unknown. The most straightforward option for OOD detection is to use activations from the model’s output layer, as this is closest to the final inference result [9]. These strategies typically rely on logits or softmax outputs to compute confidence scores, which are then used to differentiate between known and unknown classes.

More recently, researchers have explored using the feature space of the model to identify unknown samples, based on the assumption that the feature space can be useful for OOD detection, as intermediate layers capture different levels of semantic features [20]. One of the methods is named Open Principal Component Score (OpenPCS), which uses a low dimensional feature space representation from Principal Component Analysis (PCA) to fit class-wise Gaussian distributions to identify whether data is known or unknown. This approach was first implemented for semantic segmentation problems, but it can be extended for multi-class classification, named as OpenPCS-Class [5]. However, the feature-space approach for OOD detection, especially the OpenPCS-Class, is still underexplored for many applications.

In this article, we evaluate the OpenPCS-Class for OOD detection in skin lesion classification problems. The objective is to evaluate the capability of a Gaussian-based approach using the feature space to identify unseen classes in this medical application, which is usually a complex task with numerous OOD classes related to unknown skin lesions. The contribution of this work is three-fold:

  1. 1.

    We evaluate the OpenPCS-Class method for OOD detection in skin lesion problems. We use different OOD data to evaluate the approach, ranging from samples of unseen classes of skin lesions to different medical problems.

  2. 2.

    We compare the results with traditional and state-of-the-art methods for OOD detection. We assess how these methods behave in the presence of different OOD classes and additional ID data.

  3. 3.

    We also evaluate these models in different model architectures to investigate the model’s contribution to OOD detection using different space representations.

2 Related Works

Detecting OOD samples is crucial in building reliable Deep Learning models that need to operate effectively in an open-set scenario. In medical applications, such strategies allow DL models to enhance the robustness of such results in a critical task. These works are generally concentrated in semantic segmentation and image classification tasks [4]. Karimi et al. [13] proposed a spectral analysis of the intermediate features of DL models to enhance the robustness of the segmentation task in multiple organs by quantifying the uncertainty of the segmentation result. Wollek et al. [30] evaluated some state-of-the-art OOD detection methods in several medical application tasks related to the image classification problem, discussing the advantages and drawbacks of such methods in identifying unknown samples closer to the training classes.

Due to the relevance of this topic in DL applications to guarantee safety and robustness, there are a plethora of new strategies related to OOD detection. One of the most common methods for Out-of-Distribution (OOD) detection involves using the softmax output as an OOD score, known as Maximum Softmax Probability (MSP) [11]. The MSP is based on the idea that unknown class samples would generate lower confidence scores for each known class, which are then used to distinguish ID and OOD data. This method was evaluated in a wide range of problems, including medical applications. Zhang et al. [37], for example, evaluated the effectiveness of the MSP method in OOD detection for diabetic retinopathy detection and chest radiography-related problems. However, using softmax output can sometimes lead to overconfident scores on unknown data, which is inappropriate for OOD detection [33].

To avoid the issues associated with the softmax, the feature space can also distinguish between known and unknown samples. Lee et al. [15] proposed a method that uses the information from the feature space to detect OOD samples, assuming that the feature representation can be fitted into Gaussian distributions. In this case, the class-conditional Gaussian distributions are obtained and the score is computed as the Mahalanobis distance from a test sample to the closest class-conditional distribution [24]. This OOD detection method was applied in different medical applications related to image analysis, such as malaria parasitized cells classification [28], lung cancer classification [2], and skin lesion classification [25].

Despite its efficiency in the OOD detection task, the feature space is generally a high-dimensional representation, which can be often an inefficient representation with high redundancy and lead to a harder fit of the OOD detection method [31]. To alleviate the problem of high dimensionality in intermediate representations, Oliveira et al. [21] proposed a method for OOD detection, called OpenPCS, using PCA to reduce the dimensionality of the feature space. The low-dimensionality representation is then used to adjust class-conditional Gaussian distributions, and the score is calculated by finding the maximum likelihood between a sample’s intermediate representation and the class-conditional distributions. More recently, Carvalho et al. [5] proposed an extension of its method for multi-class classification problems, named as OpenPCS-Class. This method was successfully evaluated in benchmark problems, but the OpenPCS-Class is still unexplored in different applications, including medical image analysis.

3 Detecting Unseen Samples Using Feature Space

In this section, we describe in detail the OpenPCS-Class method strategy. We also briefly introduce the OOD detection problem in skin lesion problems, motivating the applicability of this work. The code of this work is publicly availableFootnote 1

3.1 Open Principal Component Score for Image Classification

The Open Principal Component Score (OpenPCS) is a method that uses intermediate features for OOD detection in a semantic segmentation task. Originally, this method could be applied only to Fully Convolutional Networks (FCN), which can be prohibitively for direct utilization of OpenPCS for different DL tasks.

The OpenPCS-Class can be seen as the extension of the OpenPCS method for classification tasks. This method discards the need of a FCN but retains the main characteristics of using a combination of intermediate features in a low-level representation. For a better comprehension of the method, Fig. 1 displays the method overview for an image classification problem.

Fig. 1.
figure 1

OpenPCS-Class overview for image classification

The OpenPCS-Class is an OOD detection method that can combine features from different layers to distinguish whether a sample belongs to a known or unknown class. For each model layer l, we transform the activation map \(a^{(l)}\) to the corresponding activation vector \(h^{(l)}\) by using a reduction method (e.g., average pooling). Therefore, we always obtain its feature vector independently from the layer specification.

One of the main abilities of the OpenPCS method is the capability to combine the feature representation from different layers, which is a user-defined parameter. For classification tasks, the features are combined by concatenating their vectors, resulting in a feature vector h. The drawback of such an approach is the high dimensionality of the feature vector h. To alleviate this issue, we apply the PCA to obtain a better representation in a low dimension.

To fit the eigenvectors and eigenvalues for the PCA, we follow a class-wise approach. Therefore, we use the collection of feature vectors related to each of the known classes to fit the parameters of the PCA, creating a specific dimensionality reduction for each of the known classes, according to Eq. 1.

$$\begin{aligned} h^{*}_c = h \cdot v_{c} \end{aligned}$$
(1)

where h is the feature representation, \(v_{c}\) is the eigenvector with the highest eigenvalues for dimensionality reduction for class c, and \(h^{*}_c\) is the feature vector after transformation for class c. Therefore, depending on the class c, the resulting low-dimension feature vector \(h^{*}_{c}\) can be different. It is important to note that we can obtain different low-dimensional representations for the same feature vector h, depending on the class c that we are evaluating.

The OOD score is computed by estimating how likely the feature vector is to each of the known class. In the literature, the Gaussian density estimator was successfully used to quantify the OOD-ness [18, 21]. Therefore, we adopted the Gaussian density estimator to fit and compute its corresponding likelihood. Mathematically, the OOD score for each class c is computed according to Eq. 2

$$\begin{aligned} G_c(h^{*}_c) = \frac{1}{\sqrt{2\pi \sigma _{c}^2}}\exp \left( -\frac{(h^{*}_c -\mu _{c})^2}{2\sigma _{c}^2}\right) \end{aligned}$$
(2)

where \(\mu _c\) and \(\sigma _c\) represents the mean and standard deviation for a known class c, and \(G_{c}(a^{*}_c)\) represents the probability density of \(h^{*}_c\) generated by \(G_{c}\). The final OOD score is the maximum log-likelihood over all known classes, as defined in Eq. 3.

$$\begin{aligned} s = \max _{c=1}^{n} {\log \left[ G_{c} \left( h^{*}_{c} \right) \right] } \end{aligned}$$
(3)

where n is the number of classes. In summary, to detect if a sample can be considered as OOD, we obtain its feature vector representation and apply a class-wise dimensionality reduction, and for each low-dimensionality representation vector, we calculate the log-likelihood to its corresponding class. The OOD score can be viewed as the maximum log-likelihood over all of the classes. For an ID sample, the likelihood would be lower for all of the classes, except for the corresponding class, which yields a high s score. As for the OOD sample, the likelihood tends to be lower for all class-wise distributions, so the score s also is lower. Thereby, the ID and OOD samples can be distinguished by setting a threshold value for s.

3.2 OOD Detection in Skin Lesion Classification

With the world constantly evolving, new medical pathologies are frequently discovered through diagnosis. However, the identification of novel or rare diseases can be troublesome for DL-based automated diagnosis, potentially leading to incorrect classification and inappropriate treatment [36]. In such cases, OOD detection methods can play an essential role in identifying whether a new sample belongs to any of the known classes of the problem, thus providing an auxiliary task for DL-based approaches.

Specifically for dermatological-related tasks, OOD detection strategies can be handy to identify samples from unseen classes during the training phase. For instance, consider a deep learning problem aimed at automatically identifying the three most common skin lesions, as illustrated in Fig. 2. In an open-set scenario, the trained model may encounter unseen skin lesions, so it would be preferable to detect them as unknown instead of erroneously classifying them as the closer known class. Ideally, the OOD detection method should be capable of correctly classifying these samples as OOD, but this may depend on the chosen strategy [6]. Especially when ID and OOD samples are visually similar, it can be challenging to distinguish between known and unknown classes [30].

Fig. 2.
figure 2

Examples of In-Distributions and Out-of-Distributions samples of skin lesions

This article evaluates the feature space-based OOD detection method in different scenarios. In some experiments, we verify the capability of the OpenPCS-Class to detect near-OOD samples, typically skin lesions images taken in the same settings as the ID samples. We also evaluate some OOD detection approaches in the same problem (skin lesions), but under different conditions to evaluate the changing of such OOD detection strategies. Finally, we also evaluate these models in far-OOD detection samples, but related to medical applications.

4 Experiments

This section presents the experimental protocol for our case studies in the OOD detection task. In this work, we focused on the skin lesion classification problem, selecting different medical-related samples as OOD.

4.1 OOD Methods

We evaluated three robust methods commonly employed in this area to assess the OOD detection results. One of them is the Maximum Softmax Probability (MSP) method [11], which is a traditional approach that utilizes the softmax probability vector to identify unknown samples. By computing the maximum probability value of all classes, MSP assumes that a lower MSP score suggests that the model is less accurate about the predicted class, which could indicate an OOD sample.

Another method we selected is Energy-Based Out-of-Distribution detection (EBO) [16], a more sophisticated technique that uses the output space to calculate the OOD score. EBO computes the entropy of the logit and employs it as an OOD score to distinguish between OOD and ID samples.

We also opted for a feature space-based method for OOD detection to provide a more insightful discussion of the OpenPCS-Class approach. The Mahalanobis OOD detection method [12] measures the OOD score as the Mahalanobis distance of class-conditional Gaussian distribution in the feature space. In this case, if compared to the Gaussian distributions, OOD samples are expected to be further away from ID samples.

4.2 Datasets

For the OOD detection in medical multi-class classification, we have utilized the HAM10000 dataset as our In-Distribution dataset (\(D_{in}\)) for skin lesion classification [27]. This dataset comprises 10000 images of seven distinct skin lesions: Melanocytic nevi, Melanoma, Benign keratosis-like lesions, Basal cell carcinoma, Actinic keratoses, Vascular lesions, and Dermatofibroma. For this study, we have selected the first four classes as our ID classes, which contain 6705, 1113, 1099, and 514 samples, respectively. In addition, we have designated the remaining three classes as OOD samples to form the basis of our first case study. For the other experiments, we maintain the same \(D_{in}\) and ID classes, changing the OOD samples.

The dataset from the second case study consists of a wide range of skin lesion images taken in different parts of the body. However, a significant issue with this dataset is that some classes overlap with those found in \(D_{in}\). Therefore, to ensure a fair comparison of the OOD detection task, we remove the overlapping classes from \(D_{out}\).

The third selected \(D_{out}\) is related to the monkeypox classification problem [1]. The dataset contains images of monkeypox lesions and different skin lesions (e.g., chickenpox), given that the problem originally was built as a binary classification to identify whether a lesion can be considered monkeypox. Therefore, we used all images from this dataset as OOD samples in the third experiment.

For the fourth case study, we have manually selected images of rare skin lesions that do not belong to any of the classes in \(D_{in}\). In this experiment, we have included additional ID images obtained from different circumstances than those found in the HAM10000 dataset. This collection of images will enable us to gain practical insights into the identification of unknown and uncommon classes in skin lesion classification and evaluate how the OOD detection methods perform when presented with different ID samples.

4.3 Metrics

To compare the methods, we selected three metrics to evaluate the OOD detection task in multi-classification problems [35].

AUROC (Area Under Recall Operating Curve) summarizes the Recall Operating Curve (ROC) as calculating the area under the curve. As ROC is usually used in a binary classification problem, to evaluate the OOD detection task using this metric, we consider only ID and OOD classes, independently from the fine-grained classes. Mathematically, the AUROC can be approximated as evaluating the True Positive Rate (TPR) and False Positive Rate (FPR) at discrete threshold values, presented in Eq. 4

$$\begin{aligned} \text {AUROC} = \sum _{i=1}^{n-1} \frac{1}{2} (x_{i+1} - x_i) (y_i + y_{i+1}) \end{aligned}$$
(4)

where n is the number of thresholds, \(x_i\) and \(y_i\) are the false positive and true positive rates, respectively, at the i-th threshold.

AUPR (Area Under Precision-Recall Curve) is a metric that summarizes the Precision-Recall trade-off for different threshold values for a specific class. This metric is highly important for imbalance problems, which may be the case for our experiments. Therefore, we calculate the AUPR for the OOD class.

FPR95 indicates the False Positive Rate (FPR) when the True Positive Rate (TPR) is 95%. Typically, the FPR95 describes how likely the method could erroneously classify as unknown at a reasonably high TPR. Therefore, the lower the FPR95 is, the better the OOD detection method. Unlike the other ones, this is a dependent, since we define a cutoff value to classify as known or unknown.

4.4 Experimental Details

To evaluate our proposed approach, we used the same experimental procedure in all experiments. For the \(D_{in}\), we split the dataset proportionally into training (60%), validation(20%), and test (20%) sets. We fit the OOD detection methods using the training set and, for all experiments, we use the test samples from \(D_{in}\) and the whole \(D_{out}\) to evaluate the separability between ID and OOD samples, respectively. During the testing phase, we randomly selected 500 samples from each set of \(D_{in}\) and \(D_{out}\) (when applicable) and computed the average metrics over ten runs. We also used the Wilcoxon signed-test rank to verify the statistical significance between the best result metric and all others. In Sect. 5, we denote an average result with a statistical difference using an underscore in the tables.

We also assess the impact of the OOD detection methods in different model architectures. As our problem is related to the image classification, we selected three models, Vision Transformer (ViT) model [7], ConvNeXT [17], and ResNet [8]. For the first two architectures, we used the pre-trained weights on ImageNet1k and finetuned the classification layer in the \(D_{in}\) problem. For the ResNet model architectures, we trained from scratch, following a similar training procedure as presented in the literature [29].

5 Discussion and Results

This section contains the results of the four case studies in medical applications. It is important to note that the selected OOD detection methods are similar in the experimental setup (i.e., it does not require any model retraining and just one forward pass is needed to identify OOD samples), but use different approaches to detect unseen classes.

For the first experiment, Table 1 summarizes the results for different OOD detection methods and architectures.

Table 1. OOD Detection Results for Experiment 1

The first experiment is a more challenging for discriminating wheter a sample belongs to a known or unknown class. This idea is reflected in the OOD detection metric results, showing a lower AUROC score for all of the methods, if compared to the other experiments. Even so, we noticed that OpenPCS-Class outperformed all three methods in terms of AUROC and AUPR, independently from the model architecture. Also, the FPR95 shows that our approach can enhance the OOD detection task considering a real-world scenario, considering the threshold that yields a TPR at 95%. In that case, the OpenPCS-Class can lower the FPR in this condition up to 7.1% (using ResNet model and MSP method).

The model architecture plays an important role in OOD detection. In this experiment, the ViT model increased the capability to detect OOD samples, at least for the OpenPCS-Class method. Especially for feature-based approaches, the model used can impact directly the results, given that different model architectures can yield feature activations.

The second experiment can be considered an easier task in OOD detection if compared to the first one. Although the classes from \(D_{out}\) are similar in the first two experiments, the images were obtained in different body parts, which can facilitate the OOD detection task. The results for the second experiment can be observed in Table 2.

Table 2. OOD Detection Results for Experiment 2

In this experiment, the approaches based on the feature space had a better OOD detection capability, if compared to those who use the output space. In fact, the feature space can contain low-level and high-level feature information, which can help to detect unknown classes in different contexts. On the other hand, the output space does not contain such kind of information, which may help to understand the difference between those approaches. Therefore, these strategies directly impact the scores generated, as illustrated in Fig. 3.

Fig. 3.
figure 3

ID and OOD score distributions (a) Maximum softmax score from MSP (b) Maximum likelihood from OpenPCS-Class

The main objective of OOD detection is to yield scores that could be easy to distinguish between ID and OOD samples. To evaluate the distributions obtained in Fig. 3, we conduct a Welch t-test [38], which rejected the hypothesis that the ID and OOD distributions have equal means (\(p < 0.05\)) only for the OpenPCS-Class.

The OpenPCS-Class, in this experiment, outperformed all three methods for OOD detection (decreased 48.7% in terms of FPR95, if compared to Mahalanobis and ConvNeXT). However, there is a slight difference between the OpenPCS-Class and Mahalanobis methods, depending on the model architecture. For transformer-based models, both methods obtained, AUROC and AUPR metrics closer to one. This result corroborates recent findings that models based on Transformer architecture can enhance the robustness of OOD detection [23].

The third experiment uses skin lesions pathologies that are more different from those presented in \(D_{in}\). The results are presented in Table 3.

Table 3. OOD Detection Results for Experiment 3

Although the \(D_{out}\) in the third study case contains images from skin lesions, we noticed that all methods, independently from the model architecture, enhanced the OOD detection metrics, if compared to the previous experiments. As the OOD samples are related to diseases like monkeypox and chickenpox, the images are more dissimilar to those presented to the model in the training phase (using \(D_{in}\)), so there is a low confidence score in the output space, and the feature representation from OOD samples are more dissimilar to those presented from ID samples, resulting in an easier OOD detection task.

In this experiment, the feature-based approaches also obtained a considerably high capability to detect samples visually dissimilar to those presented in \(D_{in}\). For Transformer-based approaches, Mahalanobis and OpenPCS-Class obtained comparable results, given that they could almost differentiate total ID and OOD samples. However, for the ResNet architecture, the difference between those approaches is more significant, showing better performance for the OpenPCS-Class method (increased 5.1% if compared to the Mahalanobis detector).

For the last case study, Table 4 displays the results for OOD detection using the same experimental protocol as the previous experiments.

Table 4. OOD Detection Results for Experiment 4

In this experiment, we observed that feature-based approaches performed comparably better in detecting a wide range of pathologies as OOD. Even in the presence of new images for ID classes that are slightly different from those presented in the \(D_{in}\), the OpenPCS-Class outperformed other methods in all three evaluation metrics. Therefore, even with visually different ID samples, the feature-space approaches obtained better results in the OOD detection task.

Although we only present the distributions for the second experiment, we used the Welch t-test for all experiments in this section. For all experiments using the transformer-based architectures, we noted that the ID and OOD distributions can be easily distinguished (i.e., it contains different means) for the OpenPCS-Class.

6 Conclusions

In this work, we evaluated the OpenPCS-Class for a new domain of application for OOD detection, more specifically for skin lesion problems. The feature space-based approaches, in general, obtained a superior OOD detection when the OOD samples are visually more dissimilar to ID ones, corresponding to the latter three experiments of this work.

Compared to all the methods evaluated in the experiments, the OpenPCS-Class outperformed in all scenarios regarding AUROC, and 9 (out of 12) in terms of average FPR95. More interestingly, the transformer-based models were more suitable for the OpenPCS-Class method, which always obtained superior OOD detection results.

Going forward, we aim to evaluate the OpenPCS-Class in different medical classification problems, to get a better perspective of feature space-based models in OOD detection application problems.