A Robust Profit Measure for Binary Classification Model Evaluation∗ Franco Garrido1, Wouter Verbeke2, and Cristián Bravo3 1Programa de Maǵıster en Gestión de Operaciones, Universidad de Talca, fgarridoc@alumnos.utalca.cl 2Faculty of Economic and Social Sciences and Solvay Business School, Vrije Universiteit Brussel, Belgium, wouter.verbeke@vub.be 3Department of Decision Analytics and Risk, Southampton Business School, University of Southampton, c.bravo@soton.ac.uk Abstract Using profit-based evaluation measures is a necessity in business- oriented contexts, as they aid companies in making cost-optimal de- cisions. Among the measures that effectively include the true na- ture of costs and benefits in binary classification, the expected max- imum profit (EMP) has been used successfully for churn prediction and credit scoring, and defined in general for binary classification problems. However, despite its competitive results against the most frequently used measures, the EMP relies on a fixed probability dis- tribution of costs and benefits, the range of which in real applications is not entirely known. In this paper, we propose to extend this mea- sure by adding random shocks to these distributions. We call this new measure the R-EMP, following the convention of the analogous EMP measure. Our metric adds a stochastic component to each point of ∗ NOTICE: this is the author’s version of a work that was accepted for publication in Expert Systems with Applications in September 19, 2017, published online as a self-archive copy after the 24 month embargo period. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Please cite this paper as follows: Franco Garrido, Wouter Verbeke, Cristin Bravo, A Robust Profit Measure for Binary Classification Model Evaluation, In Expert Systems with Applications, 2017, Accepted: Available Online https://doi.org/10.1016/j.eswa.2017.09.045. 1 https://doi.org/10.1016/j.eswa.2017.09.045. the cost-benefit distributions, assuming that costs and benefits have a fixed probability, but its distribution range is subject to an external shock, which can be different for each cost or benefit. The experimen- tal set-up is focused on a credit scoring application using a dataset of a Chilean financial institution, with the attribute selection for a logistic regression being accomplished using the AUC, EMP, H-measure, and R-EMP as the selection criteria. The results indicate that the R-EMP measure is the most robust metric for achieving the greatest profit for the company under uncertain external conditions. Keywords: Supervised Binary Classification; Business analytics; Per- formance measures; Profit-driven analytics 1 Introduction The development of performance measures for classification methods has become an important task in data analytics, given their critical role in oper- ations management (Baesens et al., 2009). In many industries, information analysis has become the only method of differentiation (Davenport, 2006). McAfee and Brynjolfsson (2012) stated the following: ”You can’t manage what you don’t measure”. The most common predictive analytics problem, binary classification, has the goal of classifying elements into one of two classes. Most models that are used to solve this problem, such as logistic regression or neural networks, return a continuous score that indicates how likely each case is to belong to one of the two classes, and it is up to the practitioners to determine the threshold that defines the frontier between the two classes. There is a wide variety of measures available for evaluating the performances of algorithms, among which the receiver operating characteristic (ROC) curve and, espe- cially, the area under the ROC curve (AUC) are the most frequently used (Bradley, 1997). Researchers have demonstrated that measures like the AUC are not suit- able for environments in which misclassification costs are different (Hand, 2009). There are measures that consider the true nature of costs effectively; among them, the H-measure (Hand, 2009), the maximum profit (MP) mea- sure (Verbeke et al., 2012), and the expected maximum profit (EMP) measure (Verbraken et al., 2013) are some of the best known. The last two measures are designed as total profit measures; the former (MP) assumes certainty in 2 cost parameters, obtaining the maximum benefit and the optimal threshold, while the latter (EMP) is a stochastic version of the MP measure, in which cost and benefit parameters are described by a probability distribution, lead- ing to the estimation of the expected maximum profit. In real applications, the MP and EMP measures have contributed to the selection of models aligned with the nature of costs and benefits. Some ap- plications of these measures include determining the optimal fraction of the consumers to be targeted for a churn prevention campaign at a telecommuni- cations company (Verbeke et al., 2012) and estimating the maximum profit of credit scoring models (Verbraken et al., 2014a). This paper presents a more robust version of the measure for the case in which the uncertainty comes not only from the profit parameters but also from external random shocks, which we call the Robust Expected Maximum Profit (R-EMP) measure. Our method is based on the rationale that random shocks can modify an originally rigid profit estimation; thus, if the distribu- tion of these potential shocks can be known, then the R-EMP will fit the profit estimation, taking into account this information. This paper is structured as follows: in Section 2, we describe the state-of- the-art profit-based performance measures for evaluating classification mod- els. Section 3 shows the R-EMP formulation, specifying its structure and all the considerations regarding its implementation. Section 4 presents the experimental design of this work, which consists of a synthetic case (Section 5) and an empirical case using loan data (Section 6). This section includes both the benchmark of the R-EMP against other measures and a case in which we show the use of the measure as a decision-making tool. Finally, conclusions are presented in Section 7. 2 Evaluation Measures for Classification Mod- els Within the field of predictive analytics, a classification problem refers to the task of determining a class label for an element from a set of known labels. When the number of possible labels is only two, this task is known as binary classification. Data mining/analytics models for supervised classification al- low determining the labels for new cases with unobserved labels, and binary classifiers usually return a probability of belonging to one of two classes, lead- 3 ing to the necessity of defining a threshold that separates the two classes, i.e., a cut-off value. According to Ali and Smith (2006), there is no unique measure that can be used to find the best classification model. Baldi et al. (2000) showed that the most frequently used measures are percentages, different kinds of distances, correlation, entropy, mutual information, and ROC curves. Various authors have applied ROC curves in many applications and used the area under the ROC curve, i.e., the AUC, as the performance measure, mainly because the AUC does not depend on a cut-off value and is insensitive to class distri- bution (Bradley, 1997). The AUC is also easily implemented (Brown and Davis, 2006) and interpreted (Fawcett, 2006a). Fawcett (2006b) showed that ROC graphs are not a suitable reference when there are instance-varying, i.e., case-dependent, classification costs. To fix this problem, he developed a variant called the ROCIV. Hand (2009) detected an additional AUC weak- ness and proposed an alternative measure known as the H-measure, which is coherent and therefore should yield a more reliable indication of performance than the area under the ROC curve. Correa Bahnsen et al. (2014) indicated a significant need that exists for measures that are sensitive to classification costs. They proposed an algorithm for credit scoring that allows construct- ing a classifier while simultaneously taking into account the variable nature of costs. Several publications presented measures or techniques for evaluat- ing classification models. Most of the proposed measures were compared to the AUC measure. Among these works, we find McDonald (2006) introduc- ing a measure that has the characteristic of allowing an unbiased (with or without cost sensitivity) comparison between different classifiers; De Bock and Van den Poel (2011) presenting a methodology that considers a rota- tive evaluation of performance measures; and Aman et al. (2015) proposing a set of measures that allows comparing models in terms of independence, reliability, volatility, and cost. Later, Clemente-Ćıscar et al. (2014) proposed two measures, one based on benefits and another based on returns, with the objective of evaluating the performance of a customer retention campaign. In this paper, we elaborate upon the MP framework as introduced by Verbeke et al. (2012) for supervised binary classification problems. The MP measure Verbeke et al. (2012), which is the first measure in the MP frame- work, considers the different costs of classification and at the same time facil- itates the obtaining of the optimal cut-off value to be applied when operating the obtained classification model, which is a practical advantage when com- pared to alternative measures. Verbraken et al. (2013) developed a stochastic 4 version of the MP measure, called the EMP, which models each cost using a probability distribution, allowing the estimation of the expected value of the maximum profit. The MP and EMP measures have been implemented and adopted successfully in churn prediction (Verbraken et al., 2014b) and credit scoring (Verbraken et al., 2014a). Both the MP and EMP measures are discussed in more detail in the next section. 2.1 Profit-based Evaluation Measures The MP measure is designed as a profit-based function, in which the param- eters b0 and c0 (b1 and c1) are, respectively, the benefit and cost associated with correctly and incorrectly classifying a good (bad) case, F0(t) and F1(t) denote the cumulative fraction of, respectively, goods and bads, with a score assigned by the classifier below the variable cut-off t. The average classi- fication profit per case resulting from adopting a threshold t is defined as follows: P(t; b0,c0,b1,c1) = b0π0F0(t)+b1π1(1−F1(t))−c0π0(1−F0(t))−c1π1F1(t) (1) Since all parameters in this function, i.e., b0, b1, c0, and c1, are assumed to be positive, then it follows that the theoretical overall maximum profit can be attained when F0(t) = 1 and F1(t) = 0. This, however, only occurs when a classifier perfectly discriminates between goods and bads. The max- imum profit that can be obtained for a non-perfect classifier is defined in Equation (2), where T is the optimal cut-off value that defines the threshold score separating the two classes. MP = max ∀t P(t; b0,c0,b1,c1) = P(T ; b0,c0,b1,c1) (2) The value of T can be obtained under the maximization of the profit function and satisfies the first-order condition for the maximization of the average profit, P : f0(t) f1(t) = π1(b1 + c1) π0(b0 + c0) = π1θ π0 (3) with π0 and π1 being the prior class probabilities and θ = b1+c1 b0+c0 being the cost-benefit ratio. Hence, the optimal threshold T depends on the cost- benefit ratio θ. The MP measure has the merit of being oriented toward the 5 central business objective, i.e., profit maximization, and also the practical benefit of providing the optimal cut-off value. More recently, the EMP measure has been proposed as an extension of the MP (Verbraken et al., 2013). This measure was designed considering that in real application settings, it is often difficult to estimate accurate values for benefit and cost parameters or costs and benefits may be case dependent; therefore, these parameters were modeled using a probability distribution. The EMP measure is presented in Equation (4) and accounts for the involved uncertainty, with ω(b0,c0,b1,c1) being the conjoint probability distribution of the cost and benefit parameters. For each possible combination of the cost and benefit parameters (b0,c0,b1,c1), the optimal threshold T is determined using Equation (3) as a function of the cost-benefit ratio θ. EMP = ∫ b0 ∫ c0 ∫ b1 ∫ c1 P(T(θ); b0,c0,b1,c1) ·ω(b0,c0,b1,c1)db0dc0db1dc1 (4) 3 The R-EMP measure In this article, we extend the EMP measure by acknowledging that in addition to the uncertainty regarding the cost and benefit parameters, as captured by ω(b0,c0,b1,c1) in the EMP measure, these parameters can change because of a random shock. Such a random shock can either be an external or internal event, or, despite the name, a steady evolution of the operational setting in which the classification model functions. For instance, changes in economic conditions or technological evolutions can have an impact on the operational setting and are examples of external shocks that affect profitability. On the other hand, changes in customer behavior, customer relationship manage- ment, business strategies and business processes are examples of internal shocks affecting profitability. Therefore, the presented R-EMP measure ex- tends the EMP approach, which models the benefit and cost parameters using a probability distribution to capture either uncertainty in estimating the exact values or to account for inherent variability across cases, by su- perimposing a perturbation or uncertainty on top of these distributions to capture the effect of such random shocks on profitability. As such, we aim to achieve a more robust measure and, through the use of this measure, as will be explained in a later section, to obtain more robust classification models for improved decision-making under variable conditions. Thus, we include the 6 impact of external information, given by the random shock, and the potential correlation between the components of profit, as the random shock can affect each measure separately or the same shock can affect multiple parameters at once. For example, both the benefits and the costs of an application can be affected by inflation, which is both external to any intrinsic uncertainty and the same for all measures, thus creating correlation between the costs and benefits. If Equation (4) is considered to give an estimation of the maximum profit but there is an external event η (that is out of our control) affecting this value, then we can extend the definition in Equation (4) to incorporate such a random shock. This leads to the definition of the extended, more robust R-EMP measure. For this purpose, the benefit and cost parameters are defined by a probability function and, in addition, by a random shock. As explained, these random shocks correspond to perturbations of benefits and costs. The extended expressions for the benefit and cost parameters are given in Equations (5), (6), (7) and (8). b′0 = f (b0,ηb0 ) (5) c′0 = f (c0,ηc0 ) (6) b′1 = f (b1,ηb1 ) (7) c′1 = f (c1,ηc1 ) (8) Then, b′0 represents the stochastic benefit of correctly classifying a case of class 0, which is a function of the original stochastic variable b0, as adopted in the EMP measure, and additionally of a random variable (ηb0 ), representing an external random shock. b′1 is the benefit of correctly classifying a case of class 1 and is defined as a function of the stochastic variables b1 and ηb1 . The cost variables c′0 and c ′ 1 define the cost of classifying a case as class 0 that belongs to class 1 or vice versa, respectively; these variables have been defined in a manner similar to that of the benefit parameters that include a stochastic random shock. If ω represents the joint distribution function of these parameters, then the R-EMP measure is defined by the following equation: 7 R−EMP = ∫ b′0 ∫ c′0 ∫ b′1 ∫ c′1 P(T(θ′); b′0,c ′ 0,b ′ 1,c ′ 1) ·ω(b ′ 0,c ′ 0,b ′ 1,c ′ 1)db ′ 0dc ′ 0db ′ 1dc ′ 1 (9) Note that the definitions of the random shocks, ηj, allow the specification and inclusion of highly complex probability distributions for the cost and benefit parameters, incorporating both stochastic effects that are intrinsic to the operation and random shocks that are external to the user. For example, by defining the distribution d of each component ηj as d(ηj) = d(νj,ε), with νj being an internal random shock affecting only one parameter and ε being an external random shock affecting every parameter j ∈ {b′0,b′1,c′0,c′1}, then each parameter will depend on internal and external stochastic effects. 3.1 R-EMP for Credit Scoring The R-EMP measure proposed in the previous section is a generic profit measure that can be adapted towards application in any business setting that requires accounting for stochastic costs and benefits, which may be subject to shocks. In this section, we define the functional form of the R-EMP measure for credit scoring. In credit risk management, one critical decision involves whether or not to grant a loan to a consumer. Credit scorecards, especially application scorecards, are classification models that are typically developed for making this decision in a data-driven manner (Thomas et al., 2002; Siddiqi, 2016). By defining the cost and benefit parameters and establishing the involved profit formula, Verbraken et al. (2014a) adapted the EMP measure as defined in Equation (4) to evaluate credit scorecards: EMPCS = ∫ 1 0 P(T(θ); λ,ROI) ·h(λ)dλ (10) with P(t,λ,ROI) = λ ·π0F0(t) −ROI ·π1F1(t) (11) The EMP measure for credit scoring defined in Equation (10) involves a single uncertain parameter, i.e., the loss fraction of a loan, λ, which is the benefit of correctly classifying a bad applicant (i.e., b1). This fraction is defined in Equation (12) below as the amount owed in the case of default, 8 i.e., the exposure at default (EAD), multiplied by the loss after all collection measures have been exhausted, i.e., the loss given default (LGD), and divided by the original amount of the loan (A). b1 = λ = LGD · EAD A (12) Additionally, when an application scorecard wrongly classifies a good ap- plicant as a bad applicant, an opportunity cost is incurred equal to the total return over the investment (ROI). This cost is considered relative to the amount of the requested loan (A) and will be assumed to be constant across cases (Verbraken et al., 2014a). In defining the R-EMP measure for credit scoring, we adopt the same approach as for the EMP measure, except that the λ variable is replaced in the same way that b1 is replaced in Equation (7), i.e., instead of λ, we now consider λ′, which is expressed by λ′ = f (λ,ηλ). This function adds a random shock ηλ to the stochastic loss fraction λ defined in Equation (12), which impacts the potential losses. The distribution of the loss fractions can be impacted, for instance, by changing the economic conditions and collateral prices. The R-EMP measure for credit scoring is defined as follows: R−EMPCS = ∫ 1 0 (f (λ,ηλ) ·π0F0(t) −ROI ·π1F1(t)) ·h(λ′)dλ′ (13) In the following sections, we will extensively illustrate the use of this measure, using both a synthetic and a real credit scoring dataset. 4 Experimental Settings An evaluation measure has practical use when it assists in decision-making during model development and operation. Common decisions that have to be made are, for example, choosing model parameters to maximize the classifi- cation performance, choosing the best attributes to include in a classification model, or choosing the cut-off point that is to be used for making a binary decision (e.g., to accept or reject a loan application) based upon the con- tinuous score that is produced by the classification model. In the following sections, we will illustrate how the R-EMP measure supports and improves such decision-making in a business context. 9 For this purpose, we conducted experiments on a synthetic and on a real credit scoring dataset. The experiments on the synthetic dataset focus on pinpointing differences between the R-EMP and other commonly used eval- uation measures, such as the AUC, H-measure and EMP, while the empirical case study focuses on the use of the R-EMP measure for practical decision- making. In the empirical case, given that we have data spanning 12 years, we will show both a comparison of measures overall with an out-of-time benchmark and a year-by-year comparison. This allows observing the behavior of these measures both over short and long terms and to assess their robustness, which was the main objective in developing the R-EMP measure. 5 Synthetic Case For the experiments reported in this section, a synthetic dataset has been cre- ated to compare the characteristics of a selection of measures often adopted to evaluate credit scorecards. The goal of this experiment is to compare how the applied measures behave when we subject the profit to a higher level of uncertainty. For this, we built a synthetic dataset, with a binary target variable (default and non-default) following a binomial distribution with size n = 1000 and probability of success p = 0.5. We created 12 attributes originating from six distributions (binomial, exponential, normal, Poisson, uniform and Weibull). Each attribute had different parameters per class; thus, there was a slight overlap (15%) between the distributions for each class. This process resulted in attributes that do not allow for linearly separable classes but do allow the achievement of a very high predictive accuracy. To construct the costs and benefits, we first created the loan amount A for each of the n cases, and then the EAD and the LGD were set based on this value. Benefits were defined as b = LGD ·EAD and costs as c = ROI · A, meaning that there are benefits when defaulters are identified correctly because we are avoiding the loss of b and that there are losses when rejecting good applicants because we are not earning c. Following this, we calculated the value of λ using Equation (12). The dataset was then randomly divided into a training and a test set. To introduce a shock, we perturbed the value of λ following Equation (14). 10 λ′ = λ + N(µ = 0,σ2 = 0.2 ·λ2) (14) We want to study the performance of different measures when more noise is added to the dataset to study the behavior of different performance mea- sures in this situation. Because each variable is simulated with an overlap, as more variables become available, the uncertainty (noise) in the model will increase. We selected random forest as our underlying classifier to extract as much information as possible from the variables, to filter random noise that can be easily eliminated by a model, and to focus on the impact of the uncertainty that cannot be filtered and the impact of the profit subject to a random shock. The simulation starts with no variables, and in each iteration, we include in the model the attribute that maximizes each evaluation measure. Once the procedure converges, i.e., when no further improvements can be achieved by adding more attributes, the resulting profit is calculated for the test sample. This process is repeated 100 times, generating new attributes without varying the underlying distributions. After all profits are calculated, the maximum profit achieved across all 100 iterations is calculated and used to normalize the results. Hence, the experiment simulates different samples of the same population with the same statistical structure but also with different external shocks to the profit structure. Figure 1 shows the profit in the test sample as a proportion of the max- imum profit for that sample. Here, the power of the profit-driven measures is demonstrated. The AUC shows a high variability, with profit proportions ranging from 0% to 70% and an average of just 42%. The H-measure per- forms slightly better, with less variability and a somewhat higher average profit proportion being achieved (54%), but yields results much lower than those achieved by the EMP and R-EMP. Both of these measures yield an average profit of 80% (EMP: 79.8%, R-EMP: 79.6%). It can also be seen that the R-EMP achieves a smaller overall standard deviation for most ex- periments but that there are outliers, which occur when the maximum profit is achieved. This effect is consistent with the design of the measure: the model will generally select the most robust measure (small deviation), but that measure will be a maximum profit measure. The robustness of the measure is shown in Table 1. In 57 out of 100 times, the R-EMP reaches the maximum value. The small increased standard deviation of the R-EMP occurs only because of the outlier value (without this 11 Figure 1: Synthetic Dataset - Profit Out-of-time by Measure ● ● ● ● 0 25 50 75 100 AUC EMP H−measure R−EMP Measures P ro fit % Measures AUC EMP H−measure R−EMP value, the R-EMP falls to 5.9%), and acknowledging some small difference, the means are basically the same. We can conclude that the R-EMP is equivalent to the EMP in that it presents better behavior when the noise in the sample is bigger; explicit information regarding this variability can be captured by adding a new exogenous variable. Table 1: Results of synthetic experiments Measure Average s.d. Times best measure R-EMP 79.6% 6.3% 57 EMP 79.8% 6.1% 43 AUC 42.2% 13.0% 0 H-Measure 53.8% 13.5% 0 6 Empirical Case The dataset, consisting of loans for small businesses, that was used to eval- uate the presented empirical case was provided by a Chilean financial in- stitution. The dataset contains 9 attributes, which after preprocessing and 12 one-hot encoding result in 16 predictive variables, as shown in Table 2. The attributes can be grouped into three types: • Loan variables: which describe the characteristics of the operation. Besides the amount and the term - in two forms, to account for different types of loans that might be either very short or very long term -, whether the borrower has collateral, or if the loan has a guarantor. • Sociodemographic variables: These variables describe the owner of the business. It is composed of the age of the borrower, a grouping of the zip code where the borrower operates in terms of default rates, and the ownership status of their main plot of land. For the last variable, the borrower can either rent, own, have free use of the land, part-own, or have other arrangements, resulting in five categories. • Business segment: These variables describe the business segment in which the borrower operates. The number of plots was segmented considering the default rates at each group, resulting in three categories: one, two or more than two plots. The second variable, the business segment, was divided following the same logic into four categories, each describing a macro-economic sector in the economy. Table 2: Variables used in empirical experiment. Variable Description Type Guarantor If the borrower had a guarantor for the loan Loan Collateral If the borrower had collateral on the loan Loan Term (months) Term of the loan (in months) Loan Term (year) Term of the loan (in years) Loan Age Age of the borrower Sociodemographic Housing Ownership status Sociodemographic ZipGroup Region of the country Sociodemographic Properties Number of plots of land Business EconSector Main economic sector Business The number of cases in the dataset is approximately 40, 000, with a de- fault rate of 24%. The dataset includes loan applications from 1996 to 2008, but the cases are not distributed equally across years: 35% of the cases stem from the first two years and the last four years, with 13% of the total cases corresponding to the most recent year. The last four years are selected as 13 the out-of-time sample, and the cases occurring between 1996 and 2004 are randomly divided into a training and a test sample in a 70% versus 30% proportion, respectively. For more details regarding this dataset, one may refer to Bravo et al. (2013). 6.1 Using R-EMP as a Profit Measure We first repeat the experimental setup applied in Section 5 for the synthetic dataset, allowing us to study how the R-EMP measure behaves when shocks affect a dataset with more complex data structures. As before, we want to choose the best model as more information becomes available, following a forward selection-like procedure. We stop adding information once there is no improvement in the measure and record the profit over the test set. As in the synthetic case, we assume a normal distribution to model the random shock (ηλ), and the number of replications (ne) is again set to 100. From Figure 2, we can see that the AUC and the H-measure generally lead to a similar number of attributes being selected; the EMP selects a smaller number of attributes; and the R-EMP selects the highest number. The R- EMP, AUC and H-measure select very similar average numbers of attributes, with the EMP being slightly off this mark. Again, as in the previous section, the R-EMP selects the models more robustly, with only small deviations with respect to the mean number of attributes, instead of either less or more attributes than average typically being selected, as with the other measures. Using the test sample, which covers the same time period as the training sample, we obtain the results shown in Table 3. The main insight from this table is that the R-EMP outperforms the other measures in all years except for 1999. This result again exemplifies the goal of the R-EMP, i.e., to provide a measure that is resistant to random shocks and variability. The selected model exhibits a stable, high performance over the selected period of time, as opposed to exhibiting such performance only in average years, as observed for the EMP and the H-measure. Additionally, note that in 2002, which involved a severe economic downturn, the model that was developed using the R-EMP measure is the only model that yielded a positive profit. From 2002 onward, the R-EMP model clearly outperforms the other models. Upon calculating the total average profit, the R-EMP is found to achieve the highest profit, followed at some distance by the EMP, H-measure, and AUC. Finally, the total average standard deviation is calculated over the full set of experimental results. As expected, the R-EMP has the lowest standard 14 Figure 2: Empirical Dataset - Number of Selected Attributes by Measure ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ●● ● ● ●●●●● ●● ● ●● ● ● ●●●● ● ●●●●●●●●●●● ● ●●● ● ●●● ● ●● ●● ● ● ●●●● ● ●● ● ●●●●●● ● ● ●●●● ●● ● ●●● ● ●● ● ●●● ● ● ●● ● ● ● ●● ●●●●●● ●● ● ● ● ● ● ●●●● ●● ● ● ● ● ●●● ● ●●●●● ● ● ●● ●● ●● ● ●● ● ● ● ●●● ● ● ● ● ● ●●● ●● ● ●●● ●●●●●●●● ● ● ● ● ● ●●●● ●●● ●● ● ● ●●●●● ● ●● ●● ● ● ● ● ● ●●● ●●● ●●●● ● ●● ● ●● ● ● ●●● ●● ● ●● ● ● ●● ●● ● ●● ● ● ● ● ●●● ●●●●● ● ●●●●●●●● ● ● ● ● ● ● ●●● ● ●●●● ●● ● ●●●●●● ● ●●● ● ●●● ● ● ●●● ● ● ● ● ● ● ●●●● ● ●●● ●●●●●● ●● ● ●●●●● ● ●● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●●● ●● ● ●●●● ● ●●●●●●● ● ●● ● ● ●● ●●●● ● ● ●●● ● ● ● ● ●●● ●● ● ● ●● ● ● ●●●● ● ● ● ● ●●●●●● ●●● ● ● ● ● ● ●●●● ●●● ● ●●● ● ● ● ● ●●●● ● ● ●● ● ● ●●● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ●●●● ●●● ● ● ●● ●● ● ●●●●● ●● ●●●●●●● ●● ●● ●●●●● ● ● ●● ● ●● ●● ● ● ●● ● ● ●● ●● ●● ●● ● ● ● ● ●● ●● ●●● ●●●● ●● ●● ●●● ●● ● ● ●●● ●● ●●● ● ● ● ●●● ●● ●●● ● ●● ●● ● ● ● ●● ● ● ●●●●● ● ● ● ● ●●●● ●●● ● ● ● ●● ● ●●● ●● ● ●● ●●● ● ●● ● ● ●● ●●●● ● ●● ● ●● ● ● ● ●● ●● ● ●● ●● ●● ● ● ● ● ● ● ●● ●●●●● ● ● ● ●●●● ● ● ●● ●● ● ●● ●● ● ● ●● ● ●●●● ●●●● ● ● ●● ● ● ●● ● ●●●● ● ● ● ● ● ●● ●● ●●● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ●●●● ●●● ● ●● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●●●●● ●● ● ● ● ●● ●● ●●●●●●●●●●●●●●●●●● 7.5 10.0 12.5 15.0 AUC EMP H−measure R−EMP Measures N u m b e r o f se le ct e d a tt ri b u te s Measures AUC EMP H−measure R−EMP deviation with respect to profit. A final comparison between the models developed using the different mea- sures considers their predictive ability for an out-of-time test sample. The results are given in Figure 3. According to this figure, it is possible to ob- serve that the density of profit obtained using the EMP and R-EMP is more concentrated than that obtained using the AUC and H-measure; also, the EMP and R-EMP yield less dispersion, again hinting at the robustness of the proposed measure. The average profit for the out-of-time test sample is only slightly different across the models. According to Table 4, the R-EMP yields the highest profit (51,930 EUR), with the EMP in second place, closely followed by the AUC and the H-measure. Both Table 4 and Figure 3 indicate that even though the difference in average profit is small, the use of the R-EMP does consistently yield a higher profit model, which, importantly, is either less disperse or more robust. This outcome is exactly what the measure was designed to accomplish. 6.2 R-EMP as a decision-making tool This section is devoted to showing how the R-EMP can be used as a decision- making tool. Therefore, two additional experiments are conducted: the first 15 Table 3: Empirical Dataset - Average Profit Year-by-year in EUR by Measure Year AUC EMP R-EMP H-measure 1996 17,273 18,010 18,373 18,277 1997 6,391 5,043 7,877 6,511 1998 41,981 47,271 52,202 42,487 1999 121,923 121,625 109,097 123,311 2000 143,919 143,667 145,059 144,901 2001 103,714 103,375 104,923 103,987 2002 -3,683 -2,686 1,776 -2,856 2003 59,515 68,550 78,670 59,893 2004 41,235 50,974 59,539 40,548 Total average 532,269 555,827 577,515 537,059 Total standard deviation 217,210 216,181 209,535 218,179 Table 4: Empirical Dataset - Profit Out-of-time ± Standard Deviation in EUR by Measure AUC EMP R-EMP H-measure 51,270 ± 13,138 51,732 ± 10,162 51,930 ± 9,057 50,665 ± 13,936 experiment concerns parameter tuning, while the second experiment concerns determining a cut-off value. For this purpose, and for illustrating the use of the developed profit driven evaluation measure for decision making, an artificial neural network is trained. More specifically, since often adopted in a business analytics con- text (Verbeke et al., 2012), a multilayer perceptron (MLP) with one hidden layer is trained, also given the importance of tuning the characteristics of an MLP, for which various evaluation measures can be adopted. As a result, we obtain an indication of the potential gain from a profit perspective that may be achieved when consistently adopting the proposed measure during development of a classification model, in this case, a credit risk model. Note that the evaluation measure can be adopted in combination with any supervised learning approach, including alternative neural networks and deep learning approaches (Schmidhuber, 2015). A broad benchmarking study to compare various supervised learning techniques and performance evaluation measures, is beyond the scope of this paper, but is considered an important topic for further research. 16 Figure 3: Empirical Dataset - Profit Out-of-time in EUR (thousands) by Measure ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 AUC EMP H−measure R−EMP Measures P ro fit Measures AUC EMP H−measure R−EMP An MLP requires a large number of hyperparameters to be tuned to func- tion optimally. One typical method for setting parameter values is using a grid-search over various combinations of parameter values, thus limiting the infinite search space. The parameters tuned in this experiment are the num- ber of units in the hidden layer of the network and the maximum number of iterations of the algorithm. By using a training sample that includes data from the real dataset (described in the previous section) for the years from 1996 to 2004 and an out-of-time test sample that includes data for the years from 2005 to 2008 to evaluate the performance, we select the best combina- tion of parameters using the Accuracy, AUC and R-EMP measures. Given the strong correlation between the AUC and H-measure that was observed in the previous sections, we have omitted the H-measure from this experiment and instead included the Accuracy. The same reasoning was used for the EMP and R-EMP, focusing, of course, on the latter. Based on the number of attributes, the size of the hidden layer is tested from 8 neurons to 32 neurons in steps of 1, and the number of iterations is set from 50 iterations to 1000 iterations in steps of 50. In Table 5, the optimal values of the parameters using the different mea- sures are shown. The table shows the values for the iterations and hidden 17 layer size, the value of the performance measure (PM) at which that param- eter combination occurred, and, for the R-EMP, the optimal fraction (cut-off value) suggested for the model. Table 5: Parameter selection decision driven by different measures Performance Measure Iterations Hidden layer size Value of PM Optimal fraction Accuracy 800 17 0.6945 N/A AUC 50 20 0.7364 N/A R-EMP 50 30 0.0127 6.37% To operate a credit scorecard in practice to decide whether to accept or reject loan applications, a cut-off value needs to be adopted. Setting a cut-off value also allows a straightforward comparison of the performances of the various models in terms of profitability. The R-EMP measure implies a cut-off to be used, reported as the optimal fraction in Table 5. However, for the Accuracy and AUC, we need to choose the cut-off, which for Accuracy is selected based on the score of the test sample for which the accuracy was maximal, while for the AUC, we select the score for which the tangent of the ROC curve is equal to the proportion between average costs for acceptance and rejection, following Hand (2009). In Table 6, the behavior of the models based on the selected cut-off value and parameters is reported. The R-EMP, AUC and Accuracy measures are considered as alternatives, and a baseline scenario, in which no model is used to make a decision, i.e., loans are always granted, is reported as a reference. Under the baseline scenario, 4,566 loans are granted, leading to a total negative profit of -382,197 EUR. When using the Accuracy-based model and selecting the optimal cut-off for the validation sample, there is an improvement in terms of profit, resulting in a positive number of 11,567 EUR. AUC-based decision-making leads to an improvement in the total profit of up to 22,609 EUR. Note that the test accuracy is considerably lower when using the AUC and that the number of rejected loans is relatively larger. Finally, when adopting the R-EMP-based model, we further improve upon the AUC-based model profit, yielding 45,028 EUR and significantly reducing the number of rejected loans. As reported in Table 6, the R-EMP achieves both the highest accuracy and profit for the test sample. The accuracy of the Accuracy-based model is very similar to the accuracy of the model using the R-EMP, with the marginal gain in accuracy when using the R-EMP probably due to the robustness of 18 Table 6: Cut-off value decision driven by different measures Model Cut-off Test accuracy Total profit (EUR) Profit/loan (EUR) Number of granted loans No model N/A 80.20% -382,197 -83.70 4,566 Accuracy-based 0.78 57.60% 11,567 2.74 4,220 AUC-based 0.65 47.82% 22,609 6.89 3,283 R-EMP-based 0.75 58.18% 45,028 10.53 4,275 the measure, as it considers distributions over the sample as opposed to only the averages. The result in terms of the achieved profit is expected, as the R-EMP measure is designed to maximize profit over populations using distributions over samples, whereas the other measures are not. These results show that the R-EMP can be used with confidence to make decisions during model development and for model selection, supporting the method as a decision-making tool. 7 Conclusions In business environments, it is imperative to strive for optimal and robust decision-making, taking into account risks and evolving conditions. This ar- ticle contributes to achieving optimal and robust decision-making by propos- ing a novel performance metric for improving the decision-making process in developing classification models, i.e., by designing a variation of the EMP measure for evaluating classification performance when random shocks may affect the distribution of the profit parameters. The novel measure, dubbed the R-EMP, is experimentally evaluated using both a synthetic and real credit scoring dataset to assess its appropriateness and to compare its characteris- tics with those of the EMP measure as well as the AUC and H-measure. The results of the experiments indicate that the R-EMP effectively out- performs the EMP, as well as the AUC and H-measure, when there are ex- ternal factors affecting the profit variables, thus demonstrating that taking into consideration the impact of random shocks improves the quality of deci- sions in terms of profit. Additionally, the experiments provide evidence that the use of the EMP and R-EMP effectively leads to the selection of different models and yields better performance in terms of achieved profits. More- over, using the R-EMP for decision-making results in reduced variability and therefore improved robustness. Moreover, the presented results show that the novel measure can be reliably used to select the best model and to define the cut-off point for future use. 19 The experiments on a real dataset confirm that the use of the R-EMP results in a more robust model than the use of the EMP, which leads to the conclusion that by adding perturbations to the profit variables, the EMP measure can effectively be improved. For credit scoring applications, the R- EMP measure leads to better decisions than the main measures reported in the literature. Hence, the R-EMP appears to be a robust measure for building predictive analytics models within highly variable business environments. Acknowledgments We acknowledge the support of Conicyt Fondecyt Initiation Into Research 11140264. References References Ali, S., Smith, K. A., 2006. On learning algorithm selection for classification. Applied Soft Computing 6 (2), 119–138. Aman, S., Simmhan, Y., Prasanna, V. K., 2015. Holistic measures for eval- uating prediction models in smart grids. Transactions on Knowledge and Data Engineering, IEEE 27 (2), 475–488. Baesens, B., Mues, C., Martens, D., Vanthienen, J., 2009. 50 years of data mining and OR: upcoming trends and challenges. Journal of the Opera- tional Research Society 60 (1), S16–S23. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., Nielsen, H., 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16 (5), 412–424. Bradley, A. P., 1997. The use of the area under the ROC curve in the eval- uation of machine learning algorithms. Pattern Recognition 30 (7), 1145– 1159. 20 Bravo, C., Maldonado, S., Weber, R., 2013. Granting and managing loans for micro-entrepreneurs: New developments and practical experiences. Eu- ropean Journal of Operational Research 227 (2), 358 – 366. Brown, C. D., Davis, H. T., 2006. Receiver operating characteristics curves and related decision measures: A tutorial. Chemometrics and Intelligent Laboratory Systems 80 (1), 24–38. Clemente-Ćıscar, M., San Mat́ıas, S., Giner-Bosch, V., 2014. A methodology based on profitability criteria for defining the partial defection of customers in non-contractual settings. European Journal of Operational Research 239 (1), 276–285. Correa Bahnsen, A., Aouada, D., Ottersten, B., 2014. Example-dependent cost-sensitive logistic regression for credit scoring. In: International Con- ference on Machine Learning and Applications. p. 7. Davenport, T. H., 2006. Competing on analytics. Harvard Business Re- view (84), 98–107. De Bock, K. W., Van den Poel, D., 2011. An empirical evaluation of rotation- based ensemble classifiers for customer churn prediction. Expert Systems with Applications 38 (10), 12293–12301. Fawcett, T., 2006a. An introduction to ROC analysis. Pattern Recognition Letters 27 (8), 861–874. Fawcett, T., 2006b. ROC graphs with instance-varying costs. Pattern Recog- nition Letters 27 (8), 882–891. Hand, D. J., 2009. Measuring classifier performance: a coherent alternative to the area under the roc curve. Machine Learning 77 (1), 103–123. McAfee, A., Brynjolfsson, E., 2012. Big data: the management revolution. Harvard Business Review (90), 60–6. McDonald, R. A., 2006. The mean subjective utility score, a novel metric for cost-sensitive classifier evaluation. Pattern Recognition Letters 27 (13), 1472–1477. Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neu- ral networks 61, 85–117. 21 Siddiqi, N., 2016. Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards. John Wiley & Sons. Thomas, L. C., Edelman, D. B., Crook, J. N., 2002. Credit Scoring and its Applications. SIAM. Verbeke, W., Dejaeger, K., Martens, D., Hur, J., Baesens, B., 2012. New insights into churn prediction in the telecommunication sector: A profit driven data mining approach. European Journal of Operational Research 218 (1), 211–229. Verbraken, T., Bravo, C., Weber, R., Baesens, B., 2014a. Development and application of consumer credit scoring models using profit-based classifica- tion measures. European Journal of Operational Research 238 (2), 505–513. Verbraken, T., Verbeke, W., Baesens, B., 2013. A novel profit maximizing metric for measuring classification performance of customer churn pre- diction models. Transactions on Knowledge and Data Engineering, IEEE 25 (5), 961–973. Verbraken, T., Verbeke, W., Baesens, B., 2014b. Profit optimizing customer churn prediction with bayesian network classifiers. Intelligent Data Anal- ysis 18 (1), 3–24. 22 Introduction Evaluation Measures for Classification Models Profit-based Evaluation Measures The R-EMP measure R-EMP for Credit Scoring Experimental Settings Synthetic Case Empirical Case Using R-EMP as a Profit Measure R-EMP as a decision-making tool Conclusions