1 Introduction

The Machine Learning (ML) literature extensively provides algorithmic developments focused on model hyperparameter tuning and related model-centric tasks. More recently, the community of data-centric Artificial Intelligence (AI) is lighting the focus on the effort to understand more the data and its quality improvement than on developing more complex ML models [13].

Paving the way for such a data-centric approach is a more fine-grained analysis of the data and classification performance. Herewith, aggregated measures applied for classification problems, such as accuracy, precision, or similar metrics, restrict the understanding of the particularities in the data the algorithms are modeling. Those aggregated metrics do not provide information about misclassification at the level of an instance or why they are misclassified. However, a more reliable usage of ML algorithms must reveal for which particular instances a model struggles to classify correctly and why. One way to achieve such understanding is leveraging knowledge from correlating data characteristics extracted by a set of meta-features [12] to the predictive performance of multiple algorithms, in a meta-learning (MtL) approach [2].

One particular set of meta-features is the set of data complexity measures, previously proposed by Ho and Basu [5] to explore the overall complexity of solving the classification problem given the dataset available for learning, providing a global perspective of the difficulty of the problem [1]. Since these measures can fail to provide information at the instance-level [8], the Instance Hardness Measures (IHMs) were introduced by Smith et al. [14] to characterize the difficulty level of each individual instance of a dataset, giving information on which particular instances are misclassified and why. These developments attend to a trending interest in responsible AI that has emerged in recent years, making researchers focus on the reliability and trustfulness of the predictions obtained by ML models.

Nonetheless, the current IHMs need the instance label to be computed, which restricts their use for analyzing and curating ML training datasets. In the use of ML in production, where the class of an instance is unknown, adaptations are needed. This paper proposes alternative instance hardness measures when the instances do not have a label. The idea is to leverage the knowledge of the hardness of the training dataset, which is labeled, to assess the hardness level of new unlabeled instances. This knowledge can support reject-option strategies in the future so the ML model might opt for abstaining from some predictions that will be uncertain [4].

Firstly, a set of IHMs is adapted to disregard the labels of the new instances in their computation. Another strategy tested was generating regression meta-models to estimate the IHMs for new unlabeled observations in a meta-learning approach at the instance level. Both approaches are compared experimentally using one synthetic dataset and four datasets of the health domain, known for presenting hard instances. Instances with characteristics making them lie in overlapping or borderline regions of the classes are highlighted as hard to classify by both approaches. The adapted measures show an increased correlation to the original values of the instance hardness measures and prove to be an adequate alternative to estimate instance hardness in the deployment stage, driving the solutions to a more refined level and contributing toward a more trustful use of ML models.

The paper is organized as follows: Sect. 2 details the hardness measures to apply to unlabeled data and how they were modified from the original measures. Section 3 presents the materials and methods used in experiments, whose results are presented in Sect. 4. Finally, Sect. 5 presents the conclusions of this work.

2 Instance Hardness Measures

The concept of instance hardness was introduced in the seminal work of Smith et al. [14] as an alternative for a fine-grained analysis of classification difficulty. They define an instance as hard to classify if it gets consistently misclassified by a set of classifiers of different biases. They also define a set of measures to explain possible reasons why an instance is difficult to classify, which are regarded as instance-level meta-features in the literature [7].

The base IHMs adopted in this work are presented next, along with their adaptations, which are indicated by the “\({\text {adj}}\)” (adjusted) extension. In their definition, let \(\mathcal D\) be a training dataset with n pairs of labeled instances \((\textbf{x}_i,y_i)\), where each \(\textbf{x}_i \in \mathcal X\) is described by m input features and \(y_i \in \mathcal Y\) is the class of the instance in the dataset. The number of classes is denoted as C. And let \(\textbf{x}\) be a new instance for which the label is unknown.

To illustrate the concepts, consider the dataset in Fig. 1 containing two classes, red and blue. Two instances are highlighted: \(\textbf{x}_1\) and \(\textbf{x}_2\). The instance \(\textbf{x}_1\) is in a borderline area of the classes and might be difficult to classify despite its class. The instance \(\textbf{x}_2\) is more aligned to the blue class. If the label registered for it in the dataset is blue, it will be easily classified. Otherwise, it will have a hardness level higher than \(\textbf{x}_1\). Standard IHMs need to know these labels, so that both \(\textbf{x}_1\) and \(\textbf{x}_2\) are contained in the labeled dataset \(\mathcal D\). This work introduces adaptations to estimate the hardness level of an instance in the absence of its label, meaning \(\textbf{x}_1\) and \(\textbf{x}_2\) are not in the labeled dataset \(\mathcal D\) used to estimate the hardness levels. Please note there are differences between the two estimations. Based on the characteristics of \(\textbf{x}_2\), it will probably be easily classified as blue. In contrast, \(\textbf{x}_1\) will probably be considered hard to classify in both scenarios.

Fig. 1.
figure 1

Example of dataset with highlighted instances: \(\textbf{x}_1\) is in a borderline region and can be difficult to classify despite its class; \(\textbf{x}_2\) might be easy or hard to classify depending on its registered label. (Color figure online)

2.1 Neighborhood-Based IHM

The hardness level of the instance can be obtained considering its neighbourhood in the dataset. In the original IHMs, instances surrounded by elements sharing the same label as themselves can be considered easier to classify. For new data without labels, our approach seeks the neighbourhood of the instance in the labeled dataset \(\mathcal D\) and assigns a higher hardness level when there is a mix of different classes in this region.

  • k-Disagreeing Neighbors kDN: the original kDN measure computes the percentage of the k nearest neighbors of \(\textbf{x}_i\) in the dataset \(\mathcal D\) that have a different label than the refereed instance:

    $$\begin{aligned} \text {kDN}(\textbf{x}_i,y_i) = \frac{\sharp \{\textbf{x}_j | \textbf{x}_j \in \text {kNN}(\textbf{x}_i) \wedge y_j \ne y_i\}}{k}, \end{aligned}$$
    (1)

    where \(\text {kNN}(\textbf{x}_i)\) represents the set of k-nearest neighbors of the instance \(\textbf{x}_i\) in the dataset \(\mathcal D\). An instance will be considered harder to classify when the value of kDN\((\textbf{x}_i,y_i)\) is higher. Values close to 1 represent an instance surrounded by examples from a different class of itself. This would be \(\textbf{x}_2\)’s case in Fig. 1 when labeled red in \(\mathcal D\). Intermediate values of kDN\((\textbf{x}_i,y_i)\) are found for borderline instances. Easier instances are those surrounded by elements sharing their class label, which would correspond to \(\textbf{x}_2\) when it has a blue label.

    In the absence of an instance’s label, an alternative way to measure the mixture of classes in its neighbourhood is to compute an entropy measure. Specifically, the entropy is computed based on the proportion of the classes found in the instance’s neighbourhood. Higher entropy values represent the new instance is in regions from \(\mathcal D\) near elements from different classes. This corresponds to the \(\textbf{x}_1\) case in Fig. 1. In contrast, \(\textbf{x}_2\) will be regarded as easy to predict, as it is surrounded by elements of the blue class.

    $$\begin{aligned} \text {kDN}_{\text {adj}}(\textbf{x}) = -\sum _{i=1}^{C} p(y_j = c_i) \log p(y_j = c_i), \text {for} \textbf{x}_j \in \text {kNN}(\textbf{x}), \end{aligned}$$
    (2)

    where \(p(y_j = c_i)\) are the proportions of the classes of the k-nearest neighbours of \(\textbf{x}\) in the dataset \(\mathcal D\).

  • Ratio of the Intra-class and Extra-Class Distances. N2\(_{\text {IHM}}\): the original measure takes the complement of the ratio of the distance of \(\textbf{x}_i\) to the nearest example from its class in \(\mathcal D\) to the distance it has to the nearest instance from a different class (nearest enemy) in \(\mathcal D\) with a normalization as presented next:

    $$\begin{aligned} \text {N2}_{\text {IHM}}(\textbf{x}_i,y_i) = 1 - \frac{1}{\textrm{IntraInter}(\textbf{x}_i)+1}, \end{aligned}$$
    (3)

    where:

    $$\begin{aligned} \textrm{IntraInter}(\textbf{x}_i,y_i) = \frac{d(\textbf{x}_i,\text {NN}(\textbf{x}_i\in y_i) )}{d(\textbf{x}_i,\text {NE}(\textbf{x}_i))}, \end{aligned}$$
    (4)

    where d is a distance function, NN\((\textbf{x}_i\in y_i)\) represents the nearest neighbor of \(\textbf{x}_i\) from its class and NE\((\textbf{x}_i)\) is the nearest enemy of \(\textbf{x}_i\) (NE\((\textbf{x}_i) = \text {NN}(\textbf{x}_i \in y_j \ne y_i\))). In this formulation, when an instance is closer to an example from another class than another from its own class, the N2\(_{\text {IHM}}\) values will be larger, indicating that this instance is harder to classify. This would correspond to the case where \(\textbf{x}_2\) in Fig. 1 has the red label.

    The alternative measure for unlabeled instances can be obtained by taking the ratio of the minimum distance from \(\textbf{x}\) and the closest element in \(\mathcal D\), denoted as \(\textbf{x}_j\) in Eq. 5, to the distance from \(\textbf{x}\) and the closest element from another class in \(\mathcal D\), that is, a class different from that of \(\textbf{x}_j\). This ratio will assume value close to 1 when the instance is almost equally distant from different classes. This will happen more probably for borderline instances, such as \(\textbf{x}_1\) in Fig. 1.

    $$\begin{aligned} \text {N2}_{\text {adj}}(\textbf{x}) = \frac{\min (d(\textbf{x},\textbf{x}_j))}{\min (d(\textbf{x},\textbf{x}_k)|y_k \ne y_j)} \end{aligned}$$
    (5)

2.2 Class Likelihood IHM

This type of measure captures if the instance is well situated in its class, considering the general patterns of this class. The likelihood can be estimated for that, considering the input features are independent for simplifying the computations.

  • Class Likelihood Difference] CLD: the original measure takes the complement of the difference between the likelihood that \(\textbf{x}_i\) belongs to its class \(y_i\) and the maximum likelihood it has to any other class. This complement is taken to standardize the interpretation of the direction of hardness since the confidence of an instance belongs to its class is larger than that of any other class [9]:

    $$\begin{aligned} \text {CLD}(\textbf{x}_i,y_i) = \frac{1 -\left( p(\textbf{x}_i|y_i)p(y_i) - {\max }_{y_j \ne y_i}[p(\textbf{x}_i |y_j)p(y_j)]\right) }{2}, \end{aligned}$$
    (6)

    where \(p(y_i)\) is the prior of class \(y_i\), set as \(\frac{1}{C}\) for all data instances. \(p(\textbf{x}_i|y_i)\) represents the likelihood \(\textbf{x}_i\) belongs to class \(y_i\) and it can be estimated considering the input features independent of each other, as in Naïve Bayes classification. For example, if \(\textbf{x}_2\) in Fig. 1 is labeled as blue in \(\mathcal D\), it will be easy according to this measure, as its likelihood to the blue class will be higher than to the red class.

    When the class of an instance cannot be defined in advance, the hardness measure can be estimated by the difference between the two higher likelihoods of all possible classes in the dataset. Like in the original measure, the complement of the difference is taken to keep the interpretation that higher values are found for instances harder to classify. The values of this measure will tend to be higher for borderline instances since their likelihood of being in different classes will be similar.

    $$\begin{aligned} \text {CLD}_{\text {adj}}(\textbf{x}) = \frac{1 -\left( {\max }_{y_i}[p(\textbf{x}|y_i)p(y_i)] - {\max }_{y_j \ne y_i}[p(\textbf{x}|y_j)p(y_j)]\right) }{2}. \end{aligned}$$
    (7)

2.3 Tree-Based IHM

Decision trees (DTs) can be used to estimate the hardness level of an instance based on the number of splits necessary to classify it. If many splits are required, the instance’s classification will be harder. The DT is built based on the labeled dataset \(\mathcal D\). Unlabeled instances are input to the built DT, and the measure can be computed based on where it is classified.

  • Disjunct Class Percentage DCP: from a pruned decision tree (DT) using \(\mathcal D\), the leaf node where the instance is classified is considered the disjunct of \(\textbf{x}_i\). The complement of the percentage of instances in this disjunct that shares the same label as \(\textbf{x}_i\) gives the original DCP measure:

    $$\begin{aligned} \text {DCP}(\textbf{x}_i,y_i) = 1- \frac{\sharp \{\textbf{x}_j | \textbf{x}_j \in \text {Disjunct}(\textbf{x}_i) \wedge y_j = y_i\}}{\sharp \{\textbf{x}_j|\textbf{x}_j \in \text {Disjunct}(\textbf{x}_i)\}}, \end{aligned}$$
    (8)

    where Disjunct\((\textbf{x}_i)\) represents the instances contained in the disjunct (leaf node) where \(\textbf{x}_i\) is placed. For easy instances, according to this measure, larger percentages of examples sharing the same label as the instance will be found in their disjunct. For example, if \(\textbf{x}_2\) in Fig. 1 has the red label in \(\mathcal D\), it will probably be placed in a leaf node containing many elements of the blue class, making it harder to classify according to the interpretation of this measure.

    In scenarios where the instance’s class is unknown, we take the entropy of the disjunct where the instance is placed as a hardness measure, similarly to what has been done for kDN.

    $$\begin{aligned} \text {DCP}_{\text {adj}}(\textbf{x}) = - \sum _{i=1}^{C} p(y_j = c_i) \log p(y_j = c_i), \text {for} \textbf{x}_j \in \text {Disjunct}(\textbf{x}), \end{aligned}$$
    (9)

    where the proportions of the classes are taken based on the disjunct where \(\textbf{x}\) is placed in the DT built using the dataset \(\mathcal {D}\).

  • Tree Depth TD: the original measure gives the depth of the leaf node that classifies \(\textbf{x}_i\) in a DT built using all labeled dataset \(\mathcal D\), normalized by the maximum depth of the tree:

    $$\begin{aligned} \text {TD}(\textbf{x}_i,y_i) = \frac{\text {depth}_{\text {DT}}(\textbf{x}_i)}{\max (\text {depth}_{\text {DT}}(\textbf{x}_j \in \mathcal D))}, \end{aligned}$$
    (10)

    where \(\text {depth}_{\text {DT}}(\textbf{x}_i)\) gives the depth where the instance \(\textbf{x}_i\) is placed in the DT. Instances harder to classify tend to be placed at deeper levels of the tree, making TD higher. There are two versions of this measure. One derives from a pruned tree (\(\text {TD}_{\text {P}}\)) and the other from an unpruned tree (\(\text {TD}_{\text {U}}\)). For unlabeled instances, the procedure for hardness estimation is the same as in DCP, where the DT is built from the labeled set \(\mathcal D\), and next, the unlabeled instance is submitted to the built DT. The depth of the leaf node where this instance is classified by the DT is taken and used in the equation:

    $$\begin{aligned} \text {TD}_{\text {adj}}(\textbf{x}) = \frac{\text {depth}_{\text {DT}}(\textbf{x})}{\max (\text {depth}_{\text {DT}}(\textbf{x}_j \in \mathcal D))}, \end{aligned}$$
    (11)

2.4 Using Meta-models to Estimate IHM

Meta-learning is a traditional ML task that uses data related to ML itself [2]. Here, MtL is designed to predict IHM values without considering their labels. This is done using the original input features from the dataset \(\mathcal D\) to learn the expected IHM values in a regression task. Therefore, in this approach regression meta-models are induced to estimate the IHM values of new instances. Their training datasets comprise the original input features of \(\mathcal D\) and a label corresponding to an IHM estimated from \(\mathcal D\) in its original formulation. There is one regression model per IHM.

The estimation of the IHM values for unlabeled data with this meta-learning approach is compared to the usage of the adjusted IHM values.

3 Materials and Methods

In this section, we describe the materials and methods used in experiments performed to analyze the behaviour of IHM for unlabeled data in classification problems.

3.1 Datasets

Five datasets are employed in the experiments. The first dataset was created synthetically, containing three classes with some overlap. The other four datasets are from the health domain, for which some instances are hard to classify due to the overlap of attribute values for different classes or inconsistencies. Two of them are from the UCI public repository [6] and have been employed in previous related work [8, 11]. The last two are related to severe COVID-19 cases in two large hospitals from the São Paulo metropolitan area [15]. The main characteristics of the five datasets are presented in Table 1, including the number of instances, classes and input features.

Table 1. Summary of the datasets used in the study.

The dataset blobs was generated synthetically using the make_blobs package from the scikit-learn library [10], which can generate isotropic Gaussian blobs in space. The standard deviation between the centers of the classes was set as 2 to create some overlap between the input features and regions where the difficulty in classifying the instances is harder than others. Figure 4 presents this dataset, where it is possible to notice some overlap in the borderline regions of the classes.

Fig. 2.
figure 2

Illustration of the blobs dataset.

The diabetes dataset is related to the incidence of diabetes in female patients of Pima Indian heritage who are at least 21 years old. The objective is to identify the presence of diabetes. The predictive variables record blood indices and patient characteristics, such as number of pregnancies and age [6].

The heart dataset registers heart disease in patients and has features collected during the exercise test, others reflecting blood indices and personal characteristics of the patients, such as age and gender [6].

The last two datasets, named hospital, were extracted from the raw public database provided by FAPESP COVID data sharing initiative [3]. The binary response categorized patients as severe when hospital stay was greater than or equal to 14 days or patients who progressed to death. The features collected in those datasets were related to blood indices, age and gender [15].

3.2 Methodology

The adjusted IHM measures proposed in this paper were applied to the datasets, considering each instance unlabeled once at a time, and the remaining instances labeled, resembling a leave-one-out (LOO) cross-validation scheme.

The same procedure is used to generate the meta-models to predict the IHM values, where one instance is left out as unlabeled at a time. The IHM of the other instances is calculated using their original formulations. Next, a meta-dataset is built, mapping the original features of the instances to the computed IHM values. Regression meta-models are induced to learn this relationship and predict the expected IHM value of the left-out instance. One meta-model is induced per IHM measure considered. We used the Random Forest Regressor (RF) available in the Scikit-learn library [10] with default hyperparameters’ values to generate these meta-models.

We also computed the original IHMs for the entire datasets, which regard the labels of all instances. Next, we compare the association of the IHM values of the original measures to those of the estimated measures, where the estimation is taken by the adjusted measures or the induced meta-models. Spearman’s correlation provides a non-parametric estimation of the association (monotonic relationship) of the modified measures with the original measure. This correlation captures if the direction of the adjusted/estimated IHM is the same as the value obtained from the original IHM. Higher values of the Spearman’s correlation indicate more association between the estimated and original IHM.

We expect medium to high correlations, although there can be deviations of values, since they do not strive to deliver identical IHM values. Indeed, instances with noisy labels in the training datasets have characteristics that make them aligned to another class and are expected to show a lower correlation to the original IHM values. But for most cases, we expect the hardness directions to be maintained.

All codes and analyses are implemented in Python. The original IHMs are computed using the PyHard package [7, 9]. Codes of the adjusted measures are in a public repository https://anonymous.4open.science/r/Adj-IHM-BF75. The k value in kDN was set as 10, default value in the PyHard package.

4 Results

The results of the experiments performed are presented and discussed next.

4.1 Meta-models

First, we present the performance of the meta-models in the regression task. Table 2 presents the Mean Squared Error (MSE) obtained in predicting the IHMs using the regression meta-models. Lower values are indicative of better performances in predicting the original IHM values.

Table 2. MSE obtained for the RF algorithm concerning predicting the IHMs to different datasets.

For some measures, the MSEs are lower, demonstrating a better approximation of the original IHM values. This happens mostly for tree-depth measures. For others, the approximations are not as good (e.g. for kDN and DCP). One possible explanation is that the tree depth measures do not depend as much on the labels of the instances as the others. The only difference between the original tree depth measures and their estimated counterparts is excluding one instance from the decision tree induction, which affects less the results. For other measures, if an instance is incorrectly labeled, the original measures will point them as very hard to classify. But this instance might be easily classified into another class, making it easy without the label information.

4.2 Correlation Analysis

Table 3 shows Spearman’s correlation coefficient between the original IHMs and the measures obtained using the meta-learning approach. Values higher than 0.5 are highlighted in bold. The values of the estimated tree depth measures are the highest, especially for the pruned version of the measure (TD\(_\text {U}\)). This happens because, in the pruned version of the tree, noisy and outlier instances tend to be placed in nodes which have undergone pruning. Therefore, the label of the particular instance seems to matter less in the original IHM formulation. In contrast, the formulation of the original CLD, DCP and kDN measures is highly influenced by the label of each instance where they are measured. This decreases the correlations, especially in datasets with many instances with feature values akin to a class, despite being originally labeled into another class in the dataset. This is the case for hospital 1 and 2 datasets, where situations such as instances wrongly labeled or with overlapping feature values are more common.

Table 3. Spearman coefficient obtained for the RF algorithm concerning predicting the IHMs to different datasets.
Table 4. Spearman coefficient obtained for adjusted measures compared to the original IHMs in different datasets.

Table 4 presents the same results for the adjusted IHMs: their Spearman correlation to the original IHMs. As in Table 3, values higher than 0.5 are boldfaced. More boldfaced correlations are observed here. Similar observations concerning the higher correlation values for tree depth-based measures are observed in Table 4 too. The correlations observed for the adjusted measures are generally higher than those observed for the measures estimated by the meta-regressors. To make the differences clearer, Fig. 3 plots the Spearman’s correlations for the adjusted IHM and the meta-models compared to the original values. Blue bars represent the correlation of the adjusted measures, while orange bars denote the meta-learning approach. Only for the blobs dataset and for the DCP-diabetes combination were correlations of the meta-models higher than those of the adjusted IHMs. The blobs dataset has difficult instances concentrated on the border of the classes, while the other datasets may pose other sources of difficulties which are not captured when the labels are absent, such as label noise.

Fig. 3.
figure 3

Spearman’s correlation applied to the adjusted IHM vs. the original IHM (blue bars) and the predicted vs. expected values from MtL (orange bars). (Color figure online)

Figure 4 shows the instances in the blobs dataset colored by the hardness of the original IHM (in the left) followed by the adjusted IHM (in the center) and the meta-learning approach (in the right). This can be done for this dataset, as it is bi-dimensional. The harder the instance is to classify, the more intense it is colored in red. In contrast, instances that are easier to classify are filled with darker blue. The central areas of the plots contain the overlapping region between the three classes (see Fig. 2) and, therefore, are harder to classify. The first row corresponds to the kDN measure, while the second is the TD\(_\text {U}\) measure. For kDN, it is clear that the hardest instances are those in the border of the classes. For TD\(_\text {U}\) the pattern observed in the three approaches shows that the harness level is related to partition derived from the decision tree classification. All measures show similar behaviors. However, for the adjusted kDN measure, more central instances have higher IHM values compared to the other measures. It is important to note that since the adjusted measures can vary on a different scale from the original IHM, the results presented in the plots were normalized between 0 and 1 to allow a direct comparison.

Fig. 4.
figure 4

Visualization of the measures kDN (top) and TD\(_\text {U} \) (bottom): original IHM (left), adjusted IHM (middle) and meta-learning approach (right) for the blobs dataset. (Color figure online)

4.3 Discussion

Considering the difference in the nature of the datasets, where the blobs were artificially designed with three classes and the other are real-world health data, the Spearman’s correlation in Fig. 3 shows that the MtL achieves more convergent result than the adjusted IHM in the blobs dataset for great part of the measures. Conversely, for real datasets, the adjusted IHM is more associated with the original measure for almost all measures.

The tree-depth measures had the highest correlations to the original measures for the adjusted IHM and the MtL approach. This is mostly related to the fact that the original tree-depth measures do not depend so directly on the label of the instances. The other measures all regard whether the labels of some vicinity are in accordance with the registered label of the instance. This makes them deviate more for instances that are mislabeled, for instance.

This can be observed in Fig. 5, where the original and estimated IHMs KDN (in the top) and TD\(_\text {U}\) (in the bottom) are contraposed for all instances of the blobs dataset. In the x-axis, we have the original measures, whilst, in the y-axis, the proposed counterparts are taken. The adjusted kDN is normalized between 0 and 1 for direct comparison.

Fig. 5.
figure 5

The adjusted IHM vs the original IHM and meta-learning prediction vs original IHM for the blobs dataset.

For the measure kDN, one can observe that as the hardness to classify the instances grows, both the adjusted and the original IHM increase their values, reaching their peak in the middle of the scale. After that, the value of the estimated IHM assumes the opposite direction of the original measure. This result is expected considering the unlabeled data, given that instances harder to classify without a label will be predicted as belonging to any other class rather than being an outlier from a specific class.

Conversely, in the TD unpruned graphs, the results indicate that the hardness in classification is independent of the class being known or not. For both adjusted IHM and meta-learning approaches, there is some linearity between those measures and the original IHM. It means that the proposed measures for unlabeled data capture the increase of hardness for classification problems equivalent to the increase in hardness when the class is given for this measure. This result can be expected considering the nature of the measure.

The CLD measure, the only measure using likelihood as the metric, performed more closely to the original measure with the adjusted IHM for all datasets. Especially for datasets with two classes, in many cases, both estimates might agree when the first and second classes with maximum likelihoods are the same.

Overall, adjusted and meta-learning IHMs were able to assess the hardness level of the unlabeled instances, with some prominence of the adjusted measures, which showed larger correlations to the original measures in most cases. They are also simpler to compute, as they do not need to induce a ML model as in the meta-learning approach. In the absence of labels, most measures are more effective in pointing borderline instances as posing a higher difficulty of posterior classification.

5 Conclusion and Future Work

This research analyzed alternative ways to measure the hardness of instances for classification problems in scenarios where the label of an instance is unknown, that is, in the deployment stage. Standard IHMs from the literature were adapted to this scenario. Their results were compared to the alternative of generating regression meta-models to predict the IHM values. Both alternatives were effective on their behalf, correlating to the original IHMs that need to know the label of each instance. The correlations were higher for some measures that do not rely as much on the labels, but the results for other measures are expected as their original formulation allows one to identify noise and outliers on data regarding their labels. The results encourages the usage of the adjusted measures in the deployment of ML models, allowing the identification of instances that ML models might struggle to classify.

In future work, we will explore the patterns found in the comparisons between the original and adjusted IHM not presented in this work and alternative measures for unlabeled data not addressed in this research. We will expand the application of the adjusted measures and meta-learning to more datasets, and tuning the meta-models could lead to new findings about the characteristics of the instances. Another fruitful direction will be to explore the usage of the adjusted measures for designing classification rejection options.