1 Introduction

In recent years, machine learning (ML) has become a popular tool for data analysis in various fields such as finance [1], healthcare [2], and marketing [8]. The ultimate goal of machine learning is to build predictive models that can accurately predict the target variable based on the input features.

Choosing the best model among a set of candidate models, however, can be a daunting task. One approach is to use statistical tests of hypothesis to compare the performance of different models [4].

The statistical tests of hypothesis are widely used in the scientific community to evaluate the significance of a result or to compare the performance of different methods [7]. In machine learning, statistical tests are used to determine whether there is a significant difference between the performance of two or more models.

The use of statistical tests in ML model selection has become an important topic of research due to its impact on the performance of the final model [3,4,5, 14, 17]. The use bayesian statistics is studied in [3, 5] while the use of frequentist tests is the focus of [4, 14, 17]. While these studies investigate the ability of statistical tests to choose the best model in a single experiment, this study aims to understand their effectiveness when they are applied over larger periods of time. More specifically, this study investigates whether the statistical tests commonly used for selecting the best ML model in a single experiment remain effective when applied in sequences of experiments. To achieve this, using a set a binary classification problems, the quality of the models selected by each test is examined after a significant number of iterations. Other parameters such as the number of iterations and the amount of data collected to feed the statistical tests are also investigated. In this context, the research questions investigated in this study can be laid down as follows:

  1. 1.

    The selection of machine learning models based on statistical tests of hypothesis lead to higher quality models after a large number of iterations?

  2. 2.

    What is the effect of the number of tests performed to acquire data to these statistical test of hypothesis?

  3. 3.

    Is there a preferred statistical test for different experimental time horizons?

The rest of the paper is organized as follows. In Sect. 2, we will provide a brief overview of statistical tests of hypothesis and their application in machine learning. In Sects. 3 and 4, we will describe the experimental setup and datasets used in this study. In Sect. 5, we will present and analyze the results of our experiments. Finally, in Sect. 6, we will conclude the paper with a discussion of the implications of our findings and future research directions.

2 Statistical Hypothesis Testing

In this section, we introduce the hypothesis tests used in this study. Given our objective of comparing two groups, which is the simplest scenario, i.e., comparing two algorithms, we have selected methods for this purpose. We have chosen the t-test as a representative parametric test, while the Mann-Whitney U test has been selected as the non-parametric option.

2.1 Paired T-Test for Two Related Samples

A paired t-test is a statistical test used to determine whether there is a significant difference between the means of two related samples [11]. While using it, each individual in one sample is paired with an individual in the other sample based on some common characteristic, such as before-and-after measurements of the same individual, or measurements from two different methods applied to the same subjects. It assumes that the two samples are normally distributed and have equal variances.

The null hypothesis of the paired t-test is that there is no difference between the means of the two samples. If the p-value returned by it, which is a representation of the probability of obtaining a statistic as extreme as the one observed, assuming that the null hypothesis is true, is less than a predefined significance level (e.g., 0.05), the null hypothesis can be rejected, and it usually concluded that there is a significant difference between the means of the two samples.

The scipy.stats.ttest_rel function is the method in the SciPy library [16] that performs a paired t-test for two related samples. It can be used in machine learning to compare the performance of two models on the same dataset. This information can be useful in model selection and can help to identify the best performing model for a given task. The test statistic t is computed as follows:

$$\begin{aligned} t = \frac{\overline{X}_D - \mu _0}{s_D/\sqrt{n}} \end{aligned}$$
(1)

where, \(\overline{X}_D\) and \(s_D\) are the average and the standard deviation of the difference between all pairs. \(\mu _0\) is the true mean of the difference under the null hypothesis. It is zero if we want to test whether the average of the difference is significant. The number of pairs is represented by n, which is also used to calculate the degrees of freedom as \(n-1\).

2.2 Independent T-Test for Two Independent Samples

An independent t-test is a statistical test used to determine whether there is a significant difference between the means of two independent samples.

The scipy.stats.ttest_ind function is a method in the SciPy library [16] that performs an independent t-test for two independent samples. It assumes that the two samples are normally distributed and have equal variances.

The null hypothesis of the independent t-test is that there is no difference between the means of the two samples [11]. If the p-value returned by the ttest_ind function is less than a predefined significance level (e.g., 0.05), the null hypothesis can be rejected, and it can be concluded that there is a significant difference between the means of the two samples. The test t statistic is computed as:

$$\begin{aligned} t = \frac{\overline{X}_1 - \overline{X}_2}{s_p \times \sqrt{\frac{1}{n_1} + \frac{1}{n_2}} } \end{aligned}$$
(2)

where

$$\begin{aligned} s_p = \sqrt{\frac{(n_1 - 1)s^2_{X_1} + (n_2 - 1)s^2_{X_2}}{n_1 + n_2 - 2} } \end{aligned}$$
(3)

\(s_ p\) is the pooled standard deviation of the two samples. It is defined in this way so that its square is an unbiased estimator of the common variance. For more details, see [12].

2.3 Mann-Whitney U Test

The Mann-Whitney U test is a non-parametric statistical test used to determine whether there is a significant difference between the distributions of two independent samples.

The ss.mannwhitneyu function is a method in the SciPy library [16] that performs the Mann-Whitney U test, also known as the Wilcoxon rank-sum test.

A very general formulation is to assume that:

  • All the observations from both groups are independent of each other,

  • The responses are at least ordinal (i.e., one can at least say, of any two observations, which is the greater),

  • Under the null hypothesis \(H_0\), the distributions of both populations are identical.

  • The alternative hypothesis \(H_1\) is that the distributions are not identical.

The function takes two arrays of different sizes as input. The arrays represent the two samples that are being compared. The function returns two values: the calculated U statistic and the associated p-value. The U statistic is a measure of the difference between the ranks of the two samples.

The null hypothesis of the Mann-Whitney U test is that there is no difference between the distributions of the two samples. If the p-value returned by the ss.mannwhitneyu function is less than a predefined significance level (e.g., 0.05), the null hypothesis can be rejected, and it can be concluded that there is a significant difference between the distributions of the two samples.

The Mann-Whitney U test is commonly used in machine learning to compare the performance of two models, especially when the assumptions of the t-test (such as normality and equal variances) are not met.

In details, let \(X_1, \cdots , X_n\) be an independent and identically distributed (i.i.d.) sample from X and \(Y_1, \cdots , Y_m\) an i.i.d. sample from Y and each sample independent from another. The corresponding Mann-Whitney U statistic is defined as:

$$\begin{aligned} U = \sum _{i=1}^n\sum _{j=1}^m S\left( X_i, Y_j \right) \end{aligned}$$
(4)

with

$$\begin{aligned} S(X, Y) = {\left\{ \begin{array}{ll} 1 , \text { if } X > Y \\ 0.5, \text { if } X = Y\\ 0, \text { if } X < Y \\ \end{array}\right. } \end{aligned}$$
(5)

For large samples, assign numeric rank to all the observations, independent of the group they are. Then, U is given by \(min(U_1, U_2)\), where: \(U_1 = n_1 n_2 + \frac{n_1(n_1 + 1)}{2} - R_1,\) and \(U_2 = n_1 n_2 + \frac{n_2(n_2 + 1)}{2} - R_2\). \(R_i\) is the sum of the ranks from observations of the group i and \(n_i\) is the sample size of the group i.

3 The Test Bed

To evaluate the impact of using the aforementioned statistical tests for model selection over time, a simplified evolutionary algorithm is defined. This algorithm operates on a population of n model types, each having a set of associated hyper-parameters. The model with the highest performance, according to metric M, is designated as best model.

During each generation, a random model type is selected and its parameters undergo random mutations. The mutated model then competes with the model of the same type in the population. If a statistical test reveals a significant difference in performance at a significance level of \(\alpha = 0.05\), the hyper-parameters of the model in the population are updated. In such cases, the mutated model also competes against the current best model using the same process.

Ultimately, the method returns the best model obtained through this evolutionary process. A pseudocode for the method is given in Algorithm 1.

Algorithm 19
figure a

. Evolutionary Model Selection

4 Experimental Setup

4.1 Datasets

Four binary classification problems defined over four datasets from the UCI repository [6] were selected for analysis. They are described below:

  1. 1.

    The Banking Dataset - Marketing Targets [13] (Banking) represents customer data in the banking industry for predicting conversion. It features a large sample size (\(n \approx 45000\)) and includes seventeen predictors, consisting of seven numeric and nine categorical variables (three of which are binary).

  2. 2.

    The Default of Credit Card Clients Dataset [18] (Credit) contains information about default payments of credit card clients in Taiwan from 2005. With a substantial sample size (\(n \approx 30000\)), it comprises 24 predictors, comprising fifteen numeric and nine categorical variables.

  3. 3.

    The Heart Disease Dataset [10] (Heart) integrates information from four databases: Cleveland, Hungary, Switzerland, and Long Beach V. It encompasses 76 attributes, but published experiments focus on a subset of 14 attributes. The “target” field denotes the presence of heart disease in patients. The dataset has a sample size of \(n = 303\) and includes thirteen attributes, of which five are numeric and eight are categorical.

  4. 4.

    The Spambase Data Set [9] (Spambase) consists of a sample of \(n = 4601\) emails classified as either “spam” or “non-spam.” It contains fifty-seven numeric predictors, including metrics such as capital character frequency and the percentage of matching words.

The datasets are preprocessed by converting categorical variables to numerical values using LabelEncoder and normalizing the feature values using preprocessing normalize from scikit-learn. The preprocessed data is then split into training and testing sets using train_test_split from scikit-learn.

4.2 Models and Hyper-parameters

The following models were chosen for the study:

  • Random Forest Classifier

  • K-Nearest Neighbors Classifier

  • Decision Tree Classifier

  • XGBoost Classifier

The sampling functions used in the hyper-parameter mutation phase are denoted by:

  • \(\mathcal {S}_{[\mu _1,\mu _2,\min ]}\): Samples from a Skellan Distribution defined with \(\mu _1\) and \(\mu _2\) and the center on the current value of the hyper-parameter. The samples truncated to \(\min \) if its value is below \(\min \)

  • \(\mathcal {N}_{[\sigma ,\min ,\max ]}\): Samples from a Normal Distribution defined with \(\mu \) on the current value of the hyper-parameter and \(\sigma \). The sample is truncated to \(\min \) if its value is below \(\min \) or \(\max \) if its value is above max

  • \(\mathcal {U}_{[v_1,v_2,...,v_k]}\): Samples the values \(v_1,v_2,...,v_k\) using a uniform distribution.

The hyper-parameters varied in each model and the related sampling functions are given below:

  • Random Forest Classifier:

    • Number of estimators: \(\mathcal {S}_{[10,10,3]}\)

    • Maximum depth: \(\mathcal {S}_{[2,2,2]}\)

    • Maximum number of features: \(\mathcal {U}_{['sqrt', 'log2', None]}\)

  • K-Nearest Neighbors Classifier:

    • Number of neighbors: \(\mathcal {S}_{[1,1,3]}\)

    • Weights: \(\mathcal {U}_{['uniform', 'distance']}\)

  • Decision Tree Classifier:

    • Maximum depth: \(\mathcal {S}_{[1,1,2]}\)

    • Maximum number of features: \(\mathcal {U}_{['sqrt', 'log2', None]}\)

    • Criterion: \(\mathcal {U}_{['gini', 'entropy', 'log\_loss']}\)

  • XGBoost Classifier:

    • Tree method: \(\mathcal {U}_{['auto', 'exact', 'approx']}\)

    • Maximum depth: \(\mathcal {S}_{[1,1,1]}\)

    • Booster: \(\mathcal {U}_{['gbtree', 'dart']}\)

    • Number of estimators: \(\mathcal {S}_{[1,1,1]}\)

    • Subsample: \(\mathcal {N}_{[0.3,10^-2,1]}\)

If no sampling function is the defined, the mutation selects a random element from the set of possible values. Otherwise, it ramdomly samples from the defined distribution.

The F1 score [15] is used as the evaluation metric for the classification models. The F1 metric is a particular case of a general measure \(F_{\beta }\). This measure is a weighting between precision and recall, indicated in cases of unbalanced data, which can generate more inaccurate results when only one of these two measures is used.

4.3 Statistical Tests

The statistical tests used by Algorithm 1 to compare the performance of different hyper-parameters are:

  1. 1.

    Paired t-test for two related samples (ttest_rel)

  2. 2.

    Two-sample t-test (ttest_in)

  3. 3.

    Mann-Whitney U test (mannwhitneyu)

They are performed on the set of evaluation metrics collected using k-fold cross-validation. As baseline, Algorithm 1 was also without any statistical test. In this case is simply selects the models with the best average performance as given by the cross-validation procedure. This procedure will be called dummy_stats_test for rest of the text.

4.4 Experimental Procedure

Algorithm 1 was executed 10 independent times for each combination of the following factor levels:

  • Datasets: [Banking,Credit,Heart,Spambase]

  • Statistical Tests: [Paired t-test for two related samples (ttest_rel), Two-sample t-test (ttest_in), Mann-Whitney U test (mannwhitneyu), Dummy (dummy_stats_test)]

  • K folds: [10, 30, 50]

  • Number of generations: [25, 100, 1000]

5 Results

This section summarizes the results obtained using Algorithm 1 for all tested conditions related to the research questions in Sect. 1. The results are presented as the percentage improvement in performance between the final best model (\(F1_{final}\)) and the initial best model (\(F1_0\)) which was computed by the equation below:

$$\begin{aligned} \text {\%improvement} = \frac{F1_{final} - F1_0}{F1_0} \end{aligned}$$
(6)

5.1 Performance over Datasets

Figure 1 shows the box-plots for the \(\%improvement\) obtained with Algorithm 1 combining all the possible configurations. It can be seen that the proposed procedure was more successful for the Spambase dataset with an average improvement around 7% reaching 17.5% improvement in some scenarios. Conversely, the improvements obtained for the other datasets were more modest with averages around 3%, 1.7% and 2.6% for the Banking, Credit and Heart datasets, respectively.

Fig. 1.
figure 1

% of improvement by dataset

5.2 Performance of the Statistical Tests Varying Number of Folds

Figure 2 shows the box-plots for the \(\%improvement\) obtained with Algorithm 1 when we vary the applied statistical test and the number of folds used in the cross-validations procedure. It can be seen that, in this scenario where the dataset effect is confounded increasing the number of folds had no effect in the \(\%improvement\).

Fig. 2.
figure 2

% of improvement by Statistical Test varying the number of folds

Figure 3 presents the factors from Fig. 2 for each dataset, separately. Looking at the Spambase results, the dataset where Algorithm 1 performed the set, it can be seen that, even though there is still no significant difference in the average \(\%improvement\), increasing the number of folds contributed to reduce the variance of the results for the parametric tests. This phenomenon, however, could not be observed in the overall results, possibly because of the difficulty Algorithm 1 had to generate improved models.

Fig. 3.
figure 3

% of improvement by Statistical Test varying the number of folds for each dataset

5.3 Performance of the Statistical Tests Varying the Maximum Number of Generations

Figure 4 displays the performance Algorithm 1 in regards to the \(\%improvemnt\) when varying the statistical test and the maximum number of generations (iterations) allowed. As expected, when the number of generations increases the \(\%improvement\) also increases. These overall results do not show an interaction between the statistical test and the number of iterations since the difference in the performance among the different statistical tests and the dummy test remains almost constant as the number o generations increases.

Fig. 4.
figure 4

% of improvement by Statistical Test and Maximum Number of Generations

Figure 5 displays the performance Algorithm 1 in regards to the \(\%improvemnt\) for each dataset. For the datasets Heart, Banking and Credit, even for a small number of iterations, or 25 or 100, the use of no statistical test (dummy) has led to better average performances of Algorithm 1. Among the versions with an actual statistical test, the paired t-test with relates samples (ttest_rel) has led to better or almost equal results on average.

For the Spambase where is was easy to generate improved models, the difference between the dummy and the other versions, only shows at the limit of 1000 generations.

Overall, these results suggest that in a challenging scenario where generating improved models is difficult, the use of statistical tests may hinder progress by imposing a an excessively stringent standard for accepting new models.

Fig. 5.
figure 5

% of improvement by Statistical Test and Maximum Number of Generations for each dataset

6 Conclusion

This study investigates the effectiveness of statistical tests commonly used for selecting the best machine learning (ML) model when applied over larger periods of time. The study aims to understand whether a selection procedure based on statistical tests leads to higher quality models after a significant number of iterations. Additionally, the impact of the number of tests (number of folds in cross-validation) performed to acquire data for these statistical tests and the preferred statistical test for different experimental time horizons are examined.

In summary, the analysis suggests that the use of statistical tests in ML model selection should be carefully considered, especially in challenging scenarios, as they may hinder progress and impose overly stringent criteria for accepting new models.

Although the number of datasets in this study is limited, and the proposed procedure did not verify the assumptions of the applied tests beforehand, it is noteworthy that the dominance of the versions without statistical tests remained consistent. This highlights the importance of conducting further research and exploration in this area. The use of statistical tests of hypothesis is fundamental in many scientific fields, and it is crucial to better understand their impact on the selection of machine learning models.