key: cord-0045813-uc8sp3ph authors: Brabec, Jan; Komárek, Tomáš; Franc, Vojtěch; Machlica, Lukáš title: On Model Evaluation Under Non-constant Class Imbalance date: 2020-05-23 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50423-6_6 sha: 7572d9a12d7c03d4d20add6068421f9d3005123f doc_id: 45813 cord_uid: uc8sp3ph Many real-world classification problems are significantly class-imbalanced to detriment of the class of interest. The standard set of proper evaluation metrics is well-known but the usual assumption is that the test dataset imbalance equals the real-world imbalance. In practice, this assumption is often broken for various reasons. The reported results are then often too optimistic and may lead to wrong conclusions about industrial impact and suitability of proposed techniques. We introduce methods (Supplementary code related to techniques described in this paper is available at: https://github.com/CiscoCTA/nci_eval) focusing on evaluation under non-constant class imbalance. We show that not only the absolute values of commonly used metrics, but even the order of classifiers in relation to the evaluation metric used is affected by the change of the imbalance rate. Finally, we demonstrate that using subsampling in order to get a test dataset with class imbalance equal to the one observed in the wild is not necessary, and eventually can lead to significant errors in classifier’s performance estimate. Class-imbalanced problems arise if number of samples in one of the classes, often in the class of interest, is significantly lower than in the other class, often the background class. Such problems are present in variety of different domains such as medicine [16] , finance [15, 20, 21] , cybersecurity [1, 3, 5] and many others. In highly imbalanced problems it is essential to use suitable evaluation metrics to correctly assess the merit of pursued algorithms and realistically judge their impact before they are deployed into the wild. Methods for evaluation of classifiers on class-imbalanced datasets are well known and have been thoroughly described in the past [4, 9, 11, 19] . It is usually assumed that the imbalance of the test dataset is the same as in the real distribution on which the model will operate once deployed into production environment. However, this assumption is often broken, because of different reasons ranging from selection bias when constructing the test dataset, high costs of acquiring large dataset mainly in situations when the imbalance is high (e.g. 1:10 4 ), to the fact that often not a single general distribution exists (e.g. disease classifier may face different priors depending on the location). Discrepancy between imbalances in test datasets and real world is often the root cause of too optimistic results leading to wrong expectations of the impact in industrial applications. This is detrimental to the research community, because it creates confusion about which problems are still open and which are solved. It might discourage groups from working on such problems, and make it harder for researchers still investigating the field to convince the community that in the light of the too optimistic prior work their results have still impact. Throughout this paper, we frame and investigate the problem of classifier evaluation dropping the assumption of constant class imbalance. We focus on precision related metrics as one of the most popular metrics for imbalanced problems [4, 9] . We show how these metrics can be computed for arbitrary class imbalances and any test dataset without the need to re-sample the data. We also inspect their behavior as a function of the imbalance rate. We show that Precision-Recall (PR) curves have little value without stating the corresponding imbalance ratio which can dramatically affect the results and their assessment. We demonstrate that change in imbalance rate, maybe surprisingly, affects also the ranking of classifiers under these metrics. We argue that instead of tabulating the results for a single dataset, it is beneficial to plot the dependence on the class imbalance rate whenever possible. Such plots provide considerably more information for wider audience. We also describe how errors in measurements can be assessed and that they can significantly affect the reliability of measured precision mainly in cases when low regions of false positive rate are of interest. This can be primarily attributed to the fact that the test dataset is finite. Therefore, we further elaborate how the class imbalance increases the demands on the size of test dataset. Most importantly, we refute the common understanding that the best practice is to alter the test dataset so that class imbalance matches the imbalance of the pursued distribution as is suggested e.g. in [14] . We show how re-sampling of a dataset may lead to significant errors in measurements. We stress that the test dataset should be constructed in a way to allow measurements of false-positive and true-positive rates with errors as small as possible. We show that the crucial entity to focus on is the coefficient of variation related to both true-positive and false-positive rates. Throughout this paper we are concerned with the binary classification task. Let x ∈ X be an input and y ∈ Y = {−1, 1} be a target. We call the class y = −1 negative class and the class y = 1 positive class. The positive class is assumed to be the minority class and the negative class is the majority class. We do not assume that there exists a single real-world joint-probability distribution p(x, y) but instead consider a parametric family: Parameter η ∈ [0, 1] specifies the positive class prevalence. If we consider a classifier h : X → Y then the following classifier evaluation metrics can be expressed as probabilities: TPR stands for true-positive-rate (also called recall or sensitivity), FPR for false-positive-rate and Prec for precision. Formula (4) is derived using Bayes' theorem. We can observe that both TPR and FPR are not affected by the positive class prevalence but precision is. This observation is very important for the rest of this paper. To estimate the above-mentioned metrics we need to evaluate the classifier on a test dataset. We assume that the test dataset is sampled i.i.d. from p(x, y; η test ) where η test may or may not correspond to a positive class prevalence connected to some real-world application of the classifier. T P , F P , T N, F N denote the number of true positives, false positives, true negatives and false negatives, respectively and N = T P + F P + T N + F N equals the size of the test set. Prevalence of the positive class in the test dataset p + and imbalance ratio (IR) are defined as (one can be computed from the other easily): TPR is defined as the fraction of positive samples that were classified correctly: where [[·] ] is the indicator function. FPR is defined as the fraction of negatives samples that were classified incorrectly: Prec is the number of true positives out of all the positive predictions: It can be easily shown that Prec(p + ) = TP/(TP + FP) resolves to the standard formula used to compute precision. It holds that the metrics measured on the test dataset approach their true values originating from the distribution p(x, y; η) as the size of the dataset grows. In other words p + → η test , TPR → TPR, FPR → FPR and Prec → Prec as N approaches infinity, but the errors in estimation caused by limited size of test dataset are often significant enough to deserve consideration, particularly during classifier evaluation in settings that are heavily class-imbalanced. We elaborate on this in Sect. 5. Equation (8) in Sect. 2 shows that the class imbalance ratio of the test dataset directly impacts the measured precision. As such, the test dataset class imbalance must be considered when interpreting the results to assess viability of the classifier for a given application. Fortunately, it is not necessary for a test dataset's imbalance ratio to be equivalent to the real-world imbalance. Equation (8) shows how to estimate precision ( Prec), that corresponds to any class imbalance, from TPR and FPR which are estimated from the test dataset and are unaffected by it's imbalance. In Sect. 5 we provide rationale and show that matching the real-world class imbalance is often sub-optimal and not desirable for correct evaluation. Positive prevalence adjusted precision computed by Equation (8) is a linear rational function of the positive class prevalence η. As such, it can be plotted over an interval of positive prevalence values. We call such plot Positive-Prevalence Precision (P 3 ) curve. The curve should be plotted with log-scaled x-axis (linlog P 3 curve) to easily distinguish between different orders of magnitude of the positive prevalence as demonstrated in Fig. 1 . Given a particular ROC curve, each point on the curve corresponds to a different value of TPR. Instead of saying that P 3 curve corresponds to a particular point on the ROC curve, it can also be said that it corresponds to a fixed value of TPR. For example, P 3 curve in Fig. 1 corresponds to a classifier with TPR fixed at 60%. P 3 curve answers the question "How does precision of a given classifier evolve when changing the class imbalance-ratio?" and allows to quickly visually assess some of the conditions under which the classifier is suitable for production environment. Also, even if P 3 curve may not be used in a particular evaluation of a classifier it is still important to possess intuition about it's general shape. PR curve is a very popular method to evaluate classifiers on imbalanced datasets. It captures the relationship between recall (TPR) on the x-axis and precision on the y-axis. As is the case with ROC curve, PR curve is usually created by applying different thresholds on the raw output of a classifier. While ROC is a strictly increasing function, PR curves do not have to be monotonous because it is possible for precision to both increase or decrease for different threshold values. As discussed in Sect. 3.1, contrary to the ROC curve, PR curve is affected by the imbalance ratio present in the test dataset. This behavior is demonstrated in Fig. 2 . PR curves can immediately reveal poor performance on class-imbalanced datasets that might not be obvious when inspecting ROC curves alone [18] . Because of this property PR curves are well suited and popular choice for evaluation of classifiers on class-imbalanced sets. We suggest that the particular imbalance ratio present in a test dataset for which the PR curve was created should always be reported and considered when interpreting the impact of the results. When different research teams perform their experiments on different test sets while solving the same problem, and even if the data originate from the same source, the resulting PR curves will not be comparable if different imbalance ratios are present. For example, in computer security the datasets of downloaded files might originate from the VirusTotal 1 service, but different teams may work with different subsets that have different imbalance ratios. Another danger is that the class imbalance ratio in a particular test dataset is often not representative of the imbalance ratios encountered once the classifier is deployed in real environment. It is often the case that the imbalance ratios experienced in the wild are lower than the ratio in the test dataset (not rarely the test datasets are even not imbalanced at all). In such situations, too optimistic estimates of the classifier's performance will be obtained if evaluation based on PR curve computed directly on the test dataset is used. To remedy these risks, often test datasets with the same class imbalance ratios that would be encountered in the real environment are created. In Sect. 5 we demonstrate that this should not be the goal. Rather, a test dataset should be assembled that allows estimation of TPR and FPR with low enough variance and (8) should be used to compute Precision-Recall curves for different class imbalance ratios of interest. When comparing performance of classifiers that need to deal with imbalanced data, the area under PR-curve (PR-AUC) or F1 score (F 1 = 2 · Prec · Recall Prec + Recall ) are often used out of convenience because they can be expressed as a single number [8] . In this section, we show that not only the values of these metrics dramatically depend on the imbalance rate in the selected test dataset, but the rate has notable influence even on the order of classifiers related to their efficacy. That is, based on these metrics two classifiers can switch places given different . 3 . The graph is similar to Positive Prevalence-Precision plot in Fig. 1 but instead of precision it plots F1 score of two distinct classifiers computed on the same dataset but assuming different imbalance rates. It can be seen that not only the absolute value of the score but even the order of the classifiers depends on the positive class prevalence. imbalance rates. This can lead to incorrect conclusions about performance of classifiers on real data. The fact can be also misused for cherry-picking of an imbalance rate to pick the one where a classifier achieves better results than any other method it competes with. F1 score is defined as harmonic mean of precision and recall. The comparison of F1 scores of two classifiers is therefore affected by the selected imbalanced rate since precision depends on the rate while recall does not. Figure 3 demonstrates how the F1 score of two classifiers depends on the imbalance rate present in a test dataset. Therefore, we suggest to plot F1 scores in relation to imbalance rates, such as seen in Fig. 3 instead of tabulated F1 scores in any applied research papers. The plot contains a superset of information, it is easily interpretable, spaceefficient and conveys an overall better picture about performance of classifiers independent of the particular imbalance rate in the selected test dataset. The imbalance rate of the particular test dataset can be easily highlighted on the x-axis. Firstly, it is proven that if a classifier dominates in ROC space it also dominates in PR space [6] , but dominance is not linked to the area under ROC curve (ROC-AUC). It is easily possible for a classifier to have greater ROC-AUC than another but smaller area under PR curve (PR-AUC) on the same test dataset. A convenient property of evaluating classifier by ROC-AUC is that it's value is invariant to class imbalance. On the other hand, the value of ROC-AUC can be dominated by insignificant regions in the ROC space, e.g. high values of FPR, which are in practice of no importance. If the problem is heavily class imbalanced it is usually not an appropriate method for evaluation of classifiers [2] and PR-AUC should be considered. However, it is often not realized that PR-AUC values depend on class imbalance and notably that also the order of classifiers under this metric depends on the imbalance rate as demonstrated in Fig. 4 . It may be more surprising than in the case of F1 score computed only at a single operating point, while PR-AUC is evaluated over the whole range of operating points. Therefore, one might wrongly expect the metric to preserve ordering of classifiers across different imbalance rates. We offer similar advice as with F1 score about the need to report the dataset imbalance rate together with PR-AUC values and to ideally use plots as in Fig. 4 instead of tabulated values for a single imbalance ratio. Class-imbalanced problems have increased demands on the test dataset size. It is often ignored that TPR and FPR computed on test dataset are just point estimates of the real TPR and FPR, given in (2) and (3), respectively, and as such they may be affected by uncertainty related to insufficient amount of samples of the minority class. In this section, we investigate how this uncertainty impacts the measured precision and how to correctly design experiments in presence of imbalanced data to suppress the uncertainty in the outcome. A common approach to quantify the uncertainty of estimates based on finite samples is to use the interval estimates. We say that I TPR = ( TPR − σ TPR , TPR + σ TPR ) is the α-confidence interval of TPR if it holds that where the probability is w.r.t. randomly generated positive test samples X + which are used to compute TPR by (6) . The interval (half-)width σ TPR , the number of samples |X + | and the confidence level α ∈ (0, 1) are dependent variables the exact relation of which is characterized by numerous concentration bounds like the Hoeffding's inequality. For example, by fixing σ TPR and α we can compute the minimal number of samples in X + which guarantee that I TPR is the α-confidence interval. In the sequel we assume that the interval width σ TPR is not greater than TPR. Note that this formalisation does not introduce any specific constraints on the shape of TPR distribution. The confidence interval I TPR can be characterized by a single number, the coefficient of variation, defined as Analogously, we can define I FPR = ( FPR − σ FPR , FPR + σ FPR ), CV FPR = σFPR FPR , and we also assume that σ FPR < FPR. Let us define the precision as a function of the positive class prevalence η, TPR and FPR 2 : Given TPR ∈ I TPR and FPR ∈ I FPR , the value of Prec(η, TPR, FPR) has to be for any fixed η ∈ (0, 1) inside the interval (LB(η), UB(η)) where LB(η) = min TPR∈I TPR FPR∈I FPR Prec(η, TPR, FPR), UB(η) = max Prec(η, TPR, FPR). Let Δ be the maximal width of the interval (LB(η), UB(η)) w.r.t. η, that is, The number Δ can be interpreted as the maximal uncertainty in measurements of precision when the exact values of TPR and FPR are replaced by their confidence intervals I TPR and I FPR , respectively. It is easy to see that TPR ∈ I TPR and FPR ∈ I FPR imply Prec(η, TPR, FPR) ∈ ( Prec(η) − Δ, Prec(η) + Δ). The concepts of UB(η), LB(η) and Δ as well as their relation to Prec(η, TPR, FPR) are illustrated in Fig. 5 . The following theorem relates the maximal uncertainty Δ and the coefficients of variation CV TPR and CV FPR , which characterize the confidence intervals I TPR and I FPR , respectively. The α 2 -confidence level stems from the fact that TPR ∈ I TPR and FPR ∈ I FPR are two independent random events with probability not less than α. Theorem 1 shows the relationship between confidence intervals for precision, widths of these intervals and point estimates of TPR, FPR. That is, coefficients of variation for TPR and FPR are the crucial quantities to consider when designing test dataset. If a test set is constructed we first need to manually fix both σ TPR and σ FPR at reasonable values based on the purpose of the dataset, and then ensure sufficient number of testing samples necessary to estimate TPR, FPR with desired Δ. If, for example, one is interested in FPR = 10 −3 on a dataset having only 10,000 negative samples, the estimate around this working point may become extremely noisy. Since such low FPR corresponds to only 10 FP samples (10,000 * 10 −3 ), just a small increase or decrease in number of FPs suffice to significantly alter the relative value of the FPR. Therefore, if such low values of FPR are of interest, one should increase the amount of negatives. Different methods exist that can quantify the concentration bounds. For example, Hoeffding's inequality can be used, which states that the upper bound on the number of required samples is proportional to 1 σFPR 2 , but Hoeffding's bound is very loose and usually less samples are required. On the other hand, given a test dataset, in order to find Δ we need to estimate σ TPR , σ FPR to get CV TPR , CV FPR . For that purpose cross-validation or bootstrapping can be used. For example, a classifier with TPR = 0.6, σ TPR = 0.06, FPR = 10 −3 , σ FPR = 10 −4 has CV TPR = CV FPR = Δ = 0.1, which might be reasonable width of the precision's confidence interval (i.e. ±10% change). But, if we increase σ FPR = 5 * 10 −4 then even though the number might seem small and it may be not indicative of the impact on estimate of the precision, the bound for precision becomes Δ = 0.5 (i.e. ±50% change), which will immediately shed light on the reliability of estimates of the precision. 4 To illustrate the error of sub-sampling we used ResNet-50 [10] on the ImageNet validation dataset [17] to detect images of 'agama' in a one-vs-all manner. The p + in such dataset is 10 −3 . To plot PR curves for η = 10 −2 we can either use the full dataset and then apply (8) to adjust the precision, or sub-sample the dataset to p + = 10 −2 . Figure 6 compares these two approaches, where we repeated the sub-sampling 30 times to estimate the variance introduced by random reduction of the negative class. The results show that PR curves measured on the sub-sampled datasets are encumbered by a considerable measurement errors even though each one has 5000 samples, which might otherwise be a reasonable number for evaluation on balanced problems. Moreover η = 10 −2 is not as drastic imbalance as is often encountered in applications and the errors could be even more pronounced if η was lower. Unlike the common practice of sub-sampling of the test dataset to the desired imbalance rate [14] , we recommend to use a bigger dataset (to decrease the coefficients of variation) and adjust the metrics to the desired imbalance rate instead. Several comprehensive papers about methodology of evaluation on imbalanced datasets were written [4, 7, 9, 11, 19] . They focus on measuring the performance on the test dataset and do not address the problem of mismatch between class imbalances in test and application datasets. In [5] authors use a plot with area under PR curve on the y-axis and a quantity related to the imbalance ratio on the x-axis. The plot is similar to Fig. 4 , it is used because it is useful in the context of the paper but it's properties and impacts are not discussed. In [2] authors discuss several bad practices in handling of class-imbalanced problems. Apart from other causes, they discuss the importance of addressing the real imbalance ratios that can be different from the test dataset. They also present a formula for adjusting the precision to different imbalance ratios but do not explore this formula in greater detail neither inspect the impact of uncertainty originating from the finite size of the test dataset on precision. Paper [12] introduces measure based on area under PR curve, which is further integrated across different class imbalances yielding a single evaluation number. The idea is based on the relationship between PR and ROC given in (8) . No additional investigations related to multiple working points, ordering of classifiers according to the score, nor errors in measurements are carried out. In [14] authors raise the issue of experimental results in cybersecurity often not being reproducible in real applications. They mention the problem that the class imbalance is often different in test dataset and in practice. They do not address the issue analytically but instead choose to re-sample the test dataset to desired imbalance ratios. This goes directly against our observations in Sect. 5 and applying such method leads to results heavily affected by noise. It should be mentioned that other evaluation metrics well-suited for evaluation of class-imbalanced problems were proposed. A notable example is Matthews Correlation Coefficient (MCC) [13] , but is not in the scope of this paper. MCC is not as widely used as PR [8] and it's values are not that easily interpretable as values of precision and recall. The base-rate fallacy and the difficulty of intrusion detection. Understanding Intrusion Detection Through Visualization Bad practices in evaluation methodology relevant to classimbalanced problems. Critiquing and correcting trends in machine learning workshop at NeurIPS Decision-forest voting scheme for classification of rare classes in network intrusion detection Data mining for imbalanced datasets: an overview A comparison of static, dynamic, and hybrid analysis for malware detection The relationship between precision-recall and ROC curves An introduction to ROC analysis Learning from class-imbalanced data: review of methods and applications Learning from imbalanced data Deep residual learning for image recognition Handling imbalanced datasets: a review Precision-recall operating characteristic (P-ROC) curves in imprecise environments Comparison of the predicted and observed secondary structure of T4 phage lysozyme TESSERACT: eliminating experimental bias in malware classification across space and time Minority report in fraud detection: classification of skewed data Addressing the class imbalance problem in medical datasets Imagenet large scale visual recognition challenge The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets A systematic analysis of performance measures for classification tasks Effective detection of sophisticated online banking fraud on extremely imbalanced data A multiscale neural network learning paradigm for financial crisis forecasting This paper addressed evaluation of classifiers under consideration that the class imbalance ratio encountered in real world is different from imbalance present in the test dataset or is suspect to change. We focused on precision as one of the most popular evaluation metrics for imbalanced problems.We stress that it is of significant importance to report also the imbalance ratio under which the classifier was developed and is aimed for, because assuming different imbalance ratios may easily lead to swapping of places of classifiers. This holds also for both PR-AUC and F1 score.We have shown that even very small absolute values of σ FPR can result in large variance in measured precision. The larger the class imbalance, the greater are the demands on the amount of negative samples present in the test dataset. Therefore, rather than sub-sampling a dataset to reach desired imbalance rate, all the samples should be kept to decrease the coefficients of variation, and the evaluation metrics should be computed given the presented formulas.