key: cord-1011314-2qviset0
authors: El Annas, Monir; Benyacoub, Badreddine; Ouzineb, Mohamed
title: Semi-supervised adapted HMMs for P2P credit scoring systems with reject inference
date: 2022-05-14
journal: Comput Stat
DOI: 10.1007/s00180-022-01220-9
sha: a2a47f6b55899db8a781f5100de478207d5f20d4
doc_id: 1011314
cord_uid: 2qviset0

The majority of current credit-scoring models, used for loan approval processing, are generally built on the basis of the information from the accepted credit applicants whose ability to repay the loan is known. This situation generates what is called the selection bias, presented by a sample that is not representative of the population of applicants, since rejected applications are excluded. Thus, the impact on the eligibility of those models from a statistical and economic point of view. Especially for the models used in the peer-to-peer lending platforms, since their rejection rate is extremely high. The method of inferring rejected applicants information in the process of construction of the credit scoring models is known as reject inference. This study proposes a semi-supervised learning framework based on hidden Markov models (SSHMM), as a novel method of reject inference. Real data from the Lending Club platform, the most used online lending marketplace in the United States as well as the rest of the world, is used to experiment the effectiveness of our method over existing approaches. The results of this study clearly illustrate the proposed method’s superiority, stability, and adaptability.

Fintech is emerging rapidly worldwide. Despite the economic shock from the COVID-19 pandemic, global Fin-tech investments remained strong, with over $ 25.6 billion in the first half of 2020 (https://home.kpmg/xx/en/home/insights/2020/02/pulse-offintech-archive.html). The pandemic has significantly accelerated digital trends and the demand for digital platforms such as digital banking, peer-to-peer lending platforms and other fintech-related services. The peer-to-peer lending (P2P lending) online platforms (https://www.lendingclub.com/info/download-data.action ), allows borrowers to obtain loans directly from other people. For lenders, it is an alternative to lend customers without going through banks and credit organizations which are very demanding in terms of guarantees and expensive in terms of bank transaction charges. Despite its many advantages, P2P lending is associated with a high level of risk for lenders. As a result, credit scoring systems are commonly used by P2P lending platforms to evaluate potential borrowers. This is generally done by building models using only data from previous accepted applicants without taking into account the applicants who have been rejected. As a result the credit scoring models are biased (Bücker et al. 2013) , as well as statistical and economic consequences (Chen and Astebro 2001; Marshall et al. 2010) . Reject inference as a method of inferring the credit worthiness status of the rejected applications, has raised a lot of interest in the P2P lending domain, where rejection rate is extremely high. For example, between June 2007 and December 2018, Lending Club P2P lending platform (https://www.lendingclub.com/ info/download-data.action ), accepted 2;260;701 loans and rejected 27;648;741 loans. As a result, only 8% of loans are issued by the platform. The majority of reject inference methods uses statistical techniques. However, semi-supervised machine learning algorithms are in growing use in this research topic (see Table 1 ). This study proposes a semi-supervised hidden Markov model (SSHMM) as a novel method to evaluate the usage of semi-supervised machine learning for reject inference in credit scoring. We compare the performance of the SSHMM model with a set of state-of-the-art semi-supervised machine learning algorithms used for reject inference. In addition, supervised machine learning models are used to evaluate the performance gain of reject inference. Finlay, by sampling the rejected data set to generate several samples with varied rejection rates, we conduct a full-sensitivity study on reject inference. The following is a breakdown of the paper's structure. Section 2 discusses related work on credit scoring and reject inference strategies, followed by Sect. 3's discussion of HMM models and introduction to the proposed SSHMM model. Section 4 summarizes the data, experiments sets up, and discusses the major findings. Finally, we give the primary conclusion as well as some suggestions for further research.

Credit scoring is used by financial institutions and P2P lending platforms, to assess the credit worthiness of loan applicants, usually embedded in a probabilistic framework p(y | x), which describes the likelihood that an applicant will repay his loan (y = 1) or not (y = 0) depending on his characteristics x. As a result, estimating p(y | x) is an important part of any credit rating process. Generally , the two types of standard credit scoring models, statistical and machine learning based models (Siddiqi 2017; Lessmann et al. 2015) , uses only the information on loan records of accepted applicants. The reject inference process of inferring the good or bad loan performance of rejected applicants in the construction of credit scoring models, have been explored as a missing data problem and categorized into three types (Feelders 1999) , based on the modelling of p(z | x, y), where z is a binary variable which indicates if the applicant has benefited from a credit (his request has been accepted) or the customer has not benefited from a credit (his request has been refused):

The first missing mechanism is missing completely at random (MCAR), which means p(z | x, y) = p(z). In this situation, applicants are approved or denied independently of their loan records or personal information, implying that applicants' good or bad behaviour is independent of applicant characteristics x and class y. It basically means that platforms or financial institutions choose whether or not to accept applicants at random, without considering their characteristics or repayment history. As a result, under the MCAR condition, there is no selection mechanism, and thus no sample bias in the lending process. The way platforms and financial institutions handle loan applications is totally inconsistent with this mechanism. As a result, in credit scoring models, it is always disregarded.

The second mechanism is missing at random (MAR), which means p(z | x, y) = p(z | x). In this situation, loans request are accepted only on the basis of the values of x and certain arbitrary cut-offs. In credit scoring applications, this is similar to

The third is missing not at random (MNAR), which states that z can be influenced by missing data y, implying that p(z | x, y) = p(z | x). MNAR is a type of missing data in which the result class is determined not just by x but also by y, which is impacted by some unobserved variables, such as loan officers' manual overrides of the model decision (according to their overall impression of an applicant, based on personal experience or other factors). The majority of online loan investors, in particular, are not expert financial investors, and their selections are frequently influenced by a variety of subjective reasons.

In reject inference, a variety of strategies have been used, which may be divided into statistical methods and machine learning techniques. The most common statistical methods used in early reject inference studies are augmentation and extrapolation (Banasik et al. 2003; Anderson 2007) . In augmentation, the weights of accepted loan applications are increased by augmenting them. In extrapolation the creditscoring model is initially built based solely on accepted applications, then predicts the classes of rejected applications before creating a new credit-scoring model based on both samples. However, according to relevant research, augmentation and extrapolation methods do not increase the performance of credit scoring models in most circumstances when compared to the original credit-scoring model trained with solely accepted loans (Banasik and Crook 2007; Crook and Banasik 2004) . Survival analysis techniques (Sohn andShin 2006) are another extensively used approach to reject inference. However they have only been found to be of use if there are a majority of rejected applications (Banasik and Crook 2010) .

In contrast, some recent studies on reject inference in a semi-supervised scenario have been undertaken based on: The support vector machine (Maldonado and Paredes 2010; Li et al. 2017; Tian et al. 2018; Kim and Cho 2019) , Gradient boosting decision tree (Xia et al. 2018) , Lightgbm (Xia 2019), Bayesian networks (Anderson 2019), Deep generative models (Mancisidor et al. 2020) , Logistic regression (Kozodoi et al. 2019) , and Ensemble learning framework that combines multiple classifiers and clustering algorithms (Liu et al. 2020; Shen et al. 2020; Kang et al. 2021) . In comparison to statistical approaches, all of the experiments above proved the superiority of semi-supervised machine learning methods of reject inference. A summary of reject inference research using semi-supervised machine learning approaches is shown in Table 1 .

This section introduces the discrete case of hidden Markov models' mathematical basis and learning algorithms. The proposed SSHMM model is then described.

The transition matrix A, the observation probability matrix B, and the initial probability vector pi are the hidden Markov model parameters, which are represented in a single parameter λ = {A, B, π}. The main elements of a hidden Markov model are summarized in Table 2 Baum et al. (1970) ; Levinson et al. (1983) ; Li et al. (2000) .

In order to ulistrate the Baum-Welch procedure for estimating the parameter lambda of an HMM that generates a single observation sequence , we define the following probabilities (Baum et al. 1970; Levinson et al. 1983 ):

• The joint probability function α t (i) = P (o 1 , o 2 , . . . , o t , s t = e i | λ), which can be computed recursively as follows (forward algorithm):

For t = 2, 3, . . . , T , and for j = 1, 2, . . . ,

can be computed recursively as follows (backward algorithm):

The probability γ t (i) of being in the state e i at time t as: 

The hidden state sequence

The possible values of each state

The possible symbols per state

The initial probability vector

The observation probability matrix

• The probability ξ t (i, j) of being in the state e i at time t and in the state e j at time t + 1,

Then, HMM model learning using the Baum-Welch algorithm is done as follows:

The Baum-Welch algorithm for a single observation sequence 1: Initialization: λ parameters, δ tolerance, gain 2:

HMM may be extended to support L independent observable variables with one common hidden sequence. To explain the Baum-Welch learning for L independent

with equal length T , we first define the following probabilities:

• The joint probability function α (l)

Which can be calculated for l = 1, 2, . . . , L, recursively, as follows (forward algorithm) :

For t = 2, 3, . . . , T , and for j = 1, 2, . . . , N , α (l)

T | s t = e i , λ i = 1, 2, . . . , N ; t = 1, 2, . . . , T and l = 1, 2, . . . , L. Which can be calculated for l = 1, 2, . . . , L, recursively, as follows (backward algorithm):

Then, HMM model learning is done using the Baum-Welch algorithm as follows:

We propose a semi-supervised hidden Markov model (SSHMM) framework to address the problem of reject inference, which aims at taking advantage of the data collected on both accepted and rejected credit applicants. The proposed SSHMM model construction is done in three main stages: binning, filtering, and model training.

In the first stage, a binning process is used to discretize the values of continuous variables into bins and address the presence of outliers and statistical noise. Furthermore, the binning process is used for data scaling and model complexity reduction. It is worth noting that binning techniques are commonly applied in credit risk modelling (Siddiqi 2017) . The binning quality is assessed using a score, considering the following aspects (Navas-Palencia 2020) : information value (IV), statistical significance and homogeneity. The Baum-Welch algorithm for multiple observation sequences

In the second stage, a filtering process is performed to remove observations that may have a deleterious effect on the model's performance, using isolation forest algorithm (Liu et al. 2008) . We first remove rejected applicants that different the most of the accepts distribution. Second, rejected applicants who are the most identical to those who have been accepted are removed. Furthermore, the filtering process, reduce data noise and retain clean data, thus decrease data size and save computing resources.

In the third stage, the HMM structure is set such that the class labels (good/bad) is represented by two hidden states and the observation sequence corresponds to the sequence of observation resulting from the binned characteristics. We first compute the initial parameter λ of HMM, using maximum likelihood estimation (MLE), as the following counts : Then, we adjust HMM parameters using the iterative procedure of Baum-Welch learning given the observed sequences from rejected applicants samples. The flow chart describing the SSHMM modelling pipeline is presented in Fig. 1 . Thus, SSHMM take advantage of unsupervised learning and supervised learning. As such, it adds together information from unsupervised learning (using the BWA) and supervised learning (using MLE) to get the complete model. Since the initialization is done in supervised manner, the learned parameters will always be in alignment with the initialization labels instead of randomly assigned labels. As a result, a more consistent credit scoring model with reject inference. 

The data sets, performance measures, and the evaluation baseline of the proposed framework are all introduced in this section.

Our numerical experiment was based on data from Lending Club online credit marketplace (https://www.lendingclub.com/info/download-data.action ), for the period from 2007 until 2018 and contain both rejected and accepted applications. Since the characteristics of the accepted and rejected data sets were incompatible. The accepted data set initially had 150 characteristics. However the rejected data set only has six: loan amount, fico score, debt-to-income (dti) ratio, loan purpose, address state, and employment length. Only the aforementioned characteristics shared by accepted and rejected applicants are used in this study. Although the rejected data sets features provide a lot of information about applicants' creditworthiness, if only the six characteristics were used to build the credit scoring model, some important information might have been missed. Only loans with a completely paid or defaulted status were considered, and records with missing values or obvious errors were removed. The final data set used in this study contain 2;064;314 rejected loans and 1;266;782 accepted loans, including 247;426 default loans. Tables 3 and 4 shows descriptive statistics of the Lending Club data. The data binning summary is given in Table 5 . It's worth mentioning that in previous studies of the reject inference problem, the lending club data set was the most commonly used data set (Li et al. 2017; Tian et al. 2018; Kim and 

We use four evaluation measures relevant to credit scoring studies, to assess the performance of our proposed model and benchmarks. These measures are accuracy, 

Defined as the proportion of correctly predicted instances to the total number of instances. . Which quantifies the fraction of the predicted positive instances which are true positive.

. Which quantifies the number of the predicted positive instances made out of the total number of positive instances.

AUC reflects a classifier's overall behaviour independently of classification threshold values. The model is considered to have a good discriminative capability when its AUC value approaches 1. In contrast, a model is considered to have less efficient discriminative capability when its AUC approaches the value of 0.5. The AUC can be computed as follows :

is the separation surface and 1 is the indicator function.

In the literature, parametric and nonparametric significance tests have been conducted to determine whether one model is significantly better than another. The assumptions of parametric tests, such as normality or homogeneity of variance, are generally broken in practice (Lessmann et al. 2015) . Therefore, nonparametric tests are often preferred to parametric tests (Demsar 2006; García et al. 2010 ). Friedman's test (Friedman 1940) is used in this study to determine if there is a significant difference between models for a certain assessment metric. Although, Friedman aligned rank test and Quade tests are two alternatives to the Friedman test (García et al. 2010) , these two tests are favourites over the Friedman test ,if only the compared algorithms are not more than 4 or 5.

The Friedman statistic is computed as follows:

k denotes the number of models, N the number of data samples, R j the average rank of the j-th model over all the data samples, and r j i the j th of k models on the i th of N data samples.

If Friedman's test rejects the null hypothesis of equivalence of ranks for a given evaluation measure, we perform pairwise comparisons using the post-hoc Nemenyi test (Nemenyi 1962 ) by computing the critical difference (CD):

The crucial values for q α are based on the studentized range statistic. The results of the Nemenyi post-hoc test are illustrated by a critical distance diagrams, which display the model ranks as well as the critical difference. A horizontal bar connects models that are not significantly different. Furthermore, for each evaluation measure, we use a Wilcoxon rank-sum test to compare the control approach (the proposed SSHMM model in this research) to a set of benchmark models. This test is more powerful than the post-hoc test, which is used to determine whether a new approach is superior to existing ones (Demsar 2006) .

Our experimental process for evaluating the effectiveness of our proposed framework is described in Fig. 2 . We conduct two different sets of experiments. In the first experimental setup Two sets of experiments are performed. In the initial experimental setting, we compare the performance of the SSHMM model with a range of semisupervised learning techniques for reject inference, including semi-supervised SVM (S3VM), SVM in combination with self-learning, contrastive pessimistic likelihood estimation (CPLE) and label propagation frameworks, also Lightgbm as base classifier with the self-learning and CPLE frameworks. To measure the marginal gain of reject inference, we use a total of six widely used supervised machine learning classifiers in credit scoring (Lessmann et al. 2015) : multi-layer perceptron (MLP), support vector machines (SVM), random forest (RF), extreme gradient boosting (XGBoost), light gradient boosting machines (LightGBM), and Categorical Boosting (Catboost). In the second experiment, we change the size of the rejected sample while keeping the size of the accepted sample the same to see how the rejected sample size affects the SSHMM model's predictive ability.

As suggested by Li et al. (2017) ; Tian et al. (2018) ; Xia et al. (2018) ; Xia (2019), the experiment is carried out as follows:

Step 1: Randomly select a sample of accepts and rejects, which sizes are denoted respectively as NA and NR.

Step 2: Randomly divide the accepted samples into a training set and a test set, using the proportion 70%:30%,. Then we choose the number of reject applications to be merged with the training sample, denoted as NR.

Step 3: Respectively build supervised models using the training sample with known labels and semi-supervised models using the training sample (labelled and unlabelled).

Step 4: Predict the likelihood of default and the labels of the test set sample using the classification rules generated in step 3.

Step 5: Compute and compare the model's performance metrics.

Steps 2 through 5 were repeated 25 times, and the evaluation metrics were computed by averaging the results values. Moreover, in the first experiment we set NA to 2000 and we keep the original acceptance ratio 8%. In the second experiment, we set up two alternative scenarios and compared roc curves and AUC scores to see how the rejection rate affects the SSHMM model's performance. We started by setting NA to 2000 and NR to a range of 1000 to 25000. Then we set NA to 1;266;782 and NR to a range of numbers between 1000 and 2;064;314. That's more data to what S3VM can handle due to memory requirements and not feasible for the CPLE procedure due to computing time.

Furthermore, to prevent only considering the accepted cases in the test sample used for models' evaluations, the previous set of experiments were also performed by including the same proportion of rejected cases in the test sample. Thus, the test sample will contain both the accepted and rejected cases (unbiased test sample).

Since the true labels of rejects is unknown, direct estimation of performance is prohibited. Thus, we approximately generate a ground truth for the good/bad label of the rejected cases following the method conducted by Li et al. (2017) . It's worth mentioning that just a few research had access to a data set that included a fraction of the rejected applicants data with known outcomes (Kozodoi et al. 2019; Shen et al. 2020 ). Resulting, e.g. from executing risky strategies as accepting some rejected applicants by the scoring system. As a result, the true repayment status of those applicants who were initially rejected will be known. Unfortunately, the data sets from those studies are private.

Machine learning algorithms have several hyper-parameters that largely influence performance. Thus, we must tune hyper-parameters of these models. We used a grid search with 10-fold cross-validation method to search for the optimal hyper-parameters for SVM, RF, XGBoost, CatBoost, LightGBM, and MLP classifiers. Table 7 shows the summary of the hyper-parameters search space for each of those classifiers. The hyperparameter optimization in our proposed SSHMM framework is done for the tuning of contamination parameters in the filtering stage, we selected values between 0.01,0.03, 0.05, 0.1 and 0.2. There are various hyper-parameters in machine learning algorithms that have a significant impact on performance. As a result, we must fine-tune these models hyper-parameters. To find the best hyper-parameters for SVM, RF, XGBoost, CatBoost, LightGBM, and MLP classifiers, we performed a grid search with a 10-fold cross-validation approach. For each of the classifiers, Table 7 summarizes the hyperparameters search space. We select contamination values between 0.01,0.03, 0.05, 0.1, and 0.2 in the filtering stage of suggested SSHMM framework. 

Predictive performance analysis Table 8 shows the numerical experimental results of the proposed SSHMM model and the benchmark models while preserving the original acceptance ratio. The best results for each performance metric, which includ accuracy, precision, recall, and AUC, are highlighted in bold font.

The performance results of the proposed SSHMM model and the benchmark models on the biased test set shows that SSHMM outperform other classifiers over most evaluation measures, namely accuracy, precision and AUC. Particularly, SSHMM improved the classification capability of the base model HMM for the aforementioned evaluation measures. Over all evaluation measures, the S3VM model performed worse than the standard SVM model. and when SVM was combined with self learning, CPLE and label propagation frameworks, the predictive performance deteriorated as well.

The performance results of the proposed SSHMM model and the benchmark models on the unbiased test set shows that MLP model yield the best performance in terms of accuracy, recall and AUC. Lightgbm was the second best model with the same performance as the MLP in terms of recall and AUC. Our proposed SSHMM model was the third best, with the highest precision achieved. Note that, SSHMM improved the classification capability of HMM on all the evaluation measures. We mention that S3VM and self learning frameworks also improved the classification capability of the base model SVM on all the evaluation measures. 

Friedman test statistics on accuracy, recall, precision, and AUC metrics are presented in Table 8 . The Friedman test's null hypothesis is rejected at a 95% level of significance, resulting in significant differences between the different models. We use the Nemenyi post-hoc test to see if there were any significant differences between the models. If the difference in the mean ranks is more than the critical distance, the differences are significant. The results of the post-hoc tests are shown in Figs. 3 and 4 . At the 95% level of significance, the models within the bold line are not statistically different. Furthermore, Table 9 shows the results of significance test on the AUC of the control method SSHMM and the benchmark models using the Wilcoxon Rank-sum test. The significance level of the tests is alpha=0.05. The null hypothesis of the tests is "There is no significant difference between AUC performance of the control model SSHMM and AUC of the model used as comparison". Subsequently, SSHMM is significantly better than benchmark models on AUC performance over the biased test set (p-value < 0.05). However, the p-value calculated between SSHMM and MLP, Lightgbm, Catboost, RF and HMM models were greater than 0.005 which indicates a statistically insignificant differences over the unbiased test set. Consequently, The results highlight the efficiency of the proposed model.

To investigate the impact of rejection rates on AUC performance and to identify the optimal rejection rate for the SSHMM, we randomly sampled rejected data set with different rejection rates. The ROC curves in Fig. 5 lead to the following conclusions. First, the results show that the proposed SSHMM model can reach optimal performance without requiring a large number of rejected samples. It also shows that using samples with a low rejection rate improves predictive accuracy than those using samples with a higher rejection rate. Second, when the rejection rate increased, the SSHMM's predictive performance changed, but in most circumstances, the SSHMM's ROC curves outperformed the supervised HMMs.

In terms of semi-supervised learning, there have been few successful methodologies to solve the problem of reject inference in the credit scoring domain. Using a semisupervised modified HMM model, this study offers a novel approach to the problem. The SSHMM model outperforms other models in terms of applicability, stability, and performance when tested on real P2P lending data. More crucially, by using the prospective information of rejected candidates, the prediction performance of the underlying HMM classifier has improved using the suggested framework. We can look into the following directions for future research. First, because the Baum-Welch algorithm is recognized to convergence towards local optimums, we can use different algorithms to estimate HMM parameters (El annas et al. 2022) . Second, we can consider building an ensemble method incorporating the existing machine learning methods together with SSHMM to do the reject inference.

The credit scoring toolkit: theory and practice for retail credit risk management and decision automation

Using Bayesian networks to perform reject inference

Reject inference, augmentation, and sample selection

Reject inference in survival analysis by augmentation

Sample selection bias in credit scoring models

A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

Reject inference in consumer credit scoring with nonignorable missing data

The economic value of reject inference in credit scoring

Does reject inference really improve the performance of application scoring models?

Statistical comparisons of classifiers over multiple datasets

Hidden Markov models training using hybrid Baum Welch: variable neighborhood search algorithm

Credit scoring and reject inference with mixture models

A comparison of alternative tests of significance for the problem of m rankings

Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power

A graph-based semi-supervised reject inference framework considering imbalanced data distribution for consumer credit scoring

An ensemble semi-supervised learning method for predicting defaults in social lending

Shallow self-learning for reject inference in credit scoring

Benchmarking state-of-the-art classification algorithms for credit scoring: an update of research

An introduction to the application of the theory of probabilistic functions of Markov process to automatic speech recognition

Training hidden Markov models with multiple observations-a combinatorial method

Reject inference in credit scoring using Semi-supervised support vector machines

Isolation forest

A new approach in reject inference of using ensemble learning based on global semi-supervised framework

A semi-supervised approach for reject inference in credit scoring using svms

Deep generative models for reject inference in credit scoring

Variable reduction, sample selection bias and bank retail credit scoring

Optimal binning: mathematical programming formulation

Distribution-free multiple comparisons

Three-stage reject inference learning framework for credit scoring using unsupervised transfer learning and three-way decision theory

Intelligent credit scoring: building and implementing better credit risk scorecards

Reject inference in credit operations based on survival analysis

A new approach for reject inference incredit scoring using kernel-free fuzzy quadratic surface support vector machines

A novel reject inference model using outlier detection and gradient boosting technique in peer-to-peer lending

A rejection inference technique based on contrastive pessimistic likelihood estimation for P2P lending

Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations