key: cord-0227095-j2w78sac authors: Polo, Felipe Maia; Izbicki, Rafael; Lacerda, Evanildo Gomes; Ibieta-Jimenez, Juan Pablo; Vicente, Renato title: A unified framework for dataset shift diagnostics date: 2022-05-17 journal: nan DOI: nan sha: 04a7c71a201ae647176463a9252560bc0e45a55f doc_id: 227095 cord_uid: j2w78sac Most machine learning (ML) methods assume that the data used in the training phase comes from the distribution of the target population. However, in practice one often faces dataset shift, which, if not properly taken into account, may decrease the predictive performance of the ML models. In general, if the practitioner knows which type of shift is taking place - e.g., covariate shift or label shift - they may apply transfer learning methods to obtain better predictions. Unfortunately, current methods for detecting shift are only designed to detect specific types of shift or cannot formally test their presence. We introduce a general framework that gives insights on how to improve prediction methods by detecting the presence of different types of shift and quantifying how strong they are. Our approach can be used for any data type (tabular/image/text) and both for classification and regression tasks. Moreover, it uses formal hypotheses tests that controls false alarms. We illustrate how our framework is useful in practice using both artificial and real datasets. Our package for dataset shift detection can be found in https://github.com/felipemaiapolo/detectshift. A standard assumption in machine learning and applied statistics is that the data used to train models comes from the distribution of interest. When that assumption does not hold, we say that a dataset shift has happened (Quiñonero-Candela et al., 2008) . Formally, we have dataset shift when the joint distribution of features (X) and labels (Y ) associated to the training sample -the source distribution, P (1) X,Y -and the distribution of interest -the target distribution, P (2) X,Y -are different. Dataset shift is common in real world problems and has shown to be relevant in several applied fields from finance (Lucas et al., 2019 , Speakman et al., 2018 and health Finlayson et al. (2021) to technology (Li et al., 2010) and physics , Wojtkiewicz et al., 2018 . Unfortunately, dataset shift may substantially decrease the predictive power of machine learning algorithms if it is not properly taken into account (Quiñonero-Candela et al., 2008, Sugiyama and Kawanabe, 2012) . In order to learn about the target distribution using training data, assumptions that relate both distributions need to be made. Different assumptions are translated into different types of shift (Moreno-Torres et al., 2012) , and each shift type may demand different adaptation methods. For instance, if P (1) Y |X , dataset shift adaptation can be performed by using importance weighting on the training data (Sugiyama et al., 2007 , Gretton et al., 2009 , Sugiyama and Kawanabe, 2012 , Maia Polo and Vicente, 2022 . Similarly, if P (1) X|Y , dataset shift adaptation can be performed by re-calibrating posterior probabilities via Bayes' theorem (Saerens et al., 2002 , Vaz et al., 2019 . Thus, to successfully adapt prediction algorithms in a dataset shift setting, practitioners need to know not only if dataset shift occurs, but also which type of shift happens for the data at hand. We propose a novel and flexible methodology that answers this question. Our approach: (i) leverages the power of probability classifiers to estimate the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951, Polyanskiy and Wu, 2019) between two marginal, joint, or conditional distributions of features and labels and (ii) makes use of simulation-based hypotheses tests to formally detect each type of shift with a certain significance level, thus being able to control false alarms. Related work. There is a rich literature on dataset shift detection. For instance, Gama et al. (2004) and Baena-Garcıa et al. (2006) developed methods to detect distribution shift on data streams while performing a classification task. The former, for example, uses the time evolving error rate of a classifier as a way to track distribution shifts. Both approaches are powerful, but were not designed to detect different types of shift neither control false alarms through hypotheses testing. More recently, Webb et al. (2016) and Webb et al. (2018) showed how to detect different types of shift via Hellinger and total variation distances using discrete or discretized data (both features and labels), which can be a hard task for high-dimensional data. The authors show how to assess different types of shift, although no formal hypotheses tests were introduced. In a different direction, some works propose dataset shift hypothesis testing. For example, Yu et al. (2019) introduced a simulation-based hierarchical hypothesis testing framework for classification problems. This method was designed to detect overall shifts on the data, thus not on distinguishing between different types of shift separately. Similarly, Raza et al. (2015) proposed the use of classical hypotheses tests after a data stream screening phase to detect covariate shift on time-series, i.e., unlabeled data. Vovk (2020) also develops a framework for detecting shifts in data streams. This martingale-based approach is able to separately detect concept shift and label shift in classification problems. Contribution and novelty. Our work provides a unified and flexible framework to quantify and test the presence of different types of dataset shift. Using labeled samples from both source and target distributions, we leverage the power of probabilistic classifiers to estimate the KL divergence between the two distributions and then use simulation-based hypotheses testing procedures in order to control type I errors (that is, false alarms) when detecting shifts. Our method is capable of quantifying and testing for different kinds of dataset shift, e.g., covariate/label/concept shifts, in isolation. Moreover, because we use classifiers to obtain the KL divergence estimates, our method can be adapted to any type of data, including images and texts, for example. Also, our approach works for both classification and regression tasks, as well as to unsupervised problems (see Final Remarks). Therefore, this methodology helps practitioners to successfully adapt prediction algorithms according to the detected types of shift, independently of their data type or final task. We observe two datasets, D (1) and D (2) , where We assume that all observations from the same dataset are i.i.d. to each other, and that the datasets are independent of each other. We denote by P (i) X,Y the distribution associated to an observation from the i-th dataset, where i = 1 stands for source and i = 2 for target. From now on, we assume P (1) X,Y and P (2) X,Y are absolutely continuous to each other (we denote such assumption by P (1) X,Y ∼ P (2) X,Y ). Our goal is to quantify and test which types of dataset shift occur in a dataset. The null hypotheses we want to test are where PY and P X denote any distribution such that PY ∼ P (i) Y and P X ∼ P (i) X . Notice that the nomenclature we use for the hypotheses is slightly different from the one used by some papers. For instance, the covariate shift assumption is often stated as "P (1) Y |X ", while our hypothesis is about the distribution of X only. Therefore, the standard covariate shift assumption would then be directly related to the non-rejection of H 0,C2 , but rejection of H 0,C . Section 2.1 describes how to obtain statistics that are able to quantify the amount of each type of dataset shift and Section 2.2 shows how to use the statistics to formally test the occurrence of each type of shift. Our statistics are based on the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951, Polyanskiy and Wu, 2019) , which is a well known measure to describe discrepancies between probability measures. Formally, the KL divergence between two probability distributions over a space Z, P and Q, is defined as where dP dQ is the density (Radon-Nikodym derivative; we assume Q dominates P) of P with respect to Q. We use the following quantities to measure each of the shifts described in the beginning of Section 2: The next theorem states that we can rewrite the null hypotheses we want to test in terms of the quantities above. The proof can be found in the supplementary material. To quantify and test the different types of shift, we use estimators of the parameters KL X,Y , KL X , KL Y , KL X|Y , KL Y |X as test statistics. This is a reasonable choice because (i) our null hypotheses can be equivalently written in terms of such estimable parameters and (ii) the magnitude of the statistics are directly related to the shifts intensities. These suggest that tests based on these statistics will be powerful to detect situations of shift. Moreover, all parameters are integrals computed with respect to the target distribution. Thus, they give more weight to regions of the feature/label space where most target data points belong to. In order to estimate KL X,Y , KL X , and KL Y , we first use 1 training data and the probabilistic classification method for density ratio estimation, also known as odds-trick, to estimate the Radon-Nikodym derivative between the two probability distributions , Cranmer et al., 2015 , Dalmasso et al., 2021 . Then, we use test data to estimate the divergences. More precisely, we first create the augmented dataset where each (X k , Y k ) corresponds to a different observation taken at random without replacement from D (1) ∪ D (2) and Z k ∈ {1, 2} indicates from which dataset (X k , Y k ) comes from. We then randomly split D into two sets: D tr (training set) and D te (test set). We use D tr to train a probabilistic classifier that predicts Z. The features used to predict Z are (i) (X, Y ) to estimate the amount of total dataset shift (KL X,Y ), (ii) X to estimate the amount of covariate shift (KL X ), and (iii) Y to estimate the amount of label shift (KL Y ). The estimated Radon-Nikodym derivative (in the case of total dataset shift; the other ones are analogous) is then given by where P denotes the trained classifier and n (i) tr is the number of samples from population i in D tr . Finally, we use empirical averages 2 over the test dataset D te in order to estimate 1 When Y is discrete, KLY can also be estimated by using a plug-in estimator in supplementary material. 2 The same approach is used to estimate divergences by Sønderby et al. (2016) in the context of generative models, for example. the KL divergence (again in the case of total dataset shift; the other ones are analogous): denotes the test samples from the second population. This approach however cannot be used to estimate KL X|Y or KL Y |X . Instead, we rely on the KL divergence decomposition, given in the following theorem, which is extracted from (Polyanskiy and Wu, 2019; Section 2.2). Theorem 2. Let KL Y , KL X , KL Y |X , KL X|Y , KL X,Y be defined as they were in Section 2.1. Then The authors show a proof for the theorem when the distributions are discrete and then discuss how the result can be expanded to more general cases (Polyanskiy and Wu, 2019; Sections 2.1, 2.6). This result shows that the KL divergences of the conditional distributions can be estimated via Once we have statistics that can quantify the amount of different types of dataset shift, we can use them to formally test the hypotheses described in Section 2.1. In this section, Y can be discrete or continuous, except when obtaining the p-values for the hypothesis P (1) X|Y , in which we assume it is discrete. This is needed since Algorithm 1 relies on this assumption. If Y is continuous or has few repeated values, concept shift of type 1 can be tested by discretizing/binning the label for computing the statistic and applying the algorithm 3 -we give more details and references in the end of this section. Consider the datasets D te and D te 2 as defined in Section 2.1, and let T (D te 2 ) be a test statistic of interest computed using D te 2 . Namely, T (D te 2 ) can represent any of the following random quantities, depending on which type of shift we are testing for: We test each of the hypotheses of interest by computing a p-value of the form 3 Binning is not needed when training the classifiers though. where each D te 2 (j) is a modified version of D te 2 , which depends on the whole test set D te . The modification that is done depends on the hypothesis we are testing: • To test the hypotheses related to unconditional distribu- is obtained randomly permuting Z k 's on D te and then selecting the samples to form the modified version of D te 2 . In this case, p is the p-value associated to a permutation test, a framework commonly used to perform two-sample testing (Ernst, 2004) . • To test the hypothesis related to the first type of concept is obtained randomly permuting the values of Z k 's within each level of Y on D te and then selecting the samples with Z k = 2 to form the modified version of D te 2 . Thus, we require Y to be discrete to apply this test. In this case, p is the p-value associated to a conditional independence local permutation test (Berrett et al., 2020 , Kim et al., 2021 , which can be used to test H 0,C1 because this hypothesis is equivalent to the hypothesis that X ⊥ ⊥ Z|Y . • To test the hypothesis related to the second type of concept shift (H 0,C2 ), we first estimate the conditional distribution of Y |X using the whole training set D tr . Let Q(y|x) denote such estimate, which can be obtained using any probabilistic classifier, such as logistic regression, neural networks, Cat-Boost classifier (Prokhorenkova et al., 2017) , or conditional density estimators if Y is continuous Lee, 2017, Okuno and Polo, 2021) . We then obtain D te 2 (j) by replacing each Y k in D te 2 by a random draw from Q(y|X k ). This test is known as the conditional randomization test (Candès et al., 2018) . This test can be used to test H 0,C2 because this hypothesis is equivalent to the hypothesis that Y ⊥ ⊥ Z|X. Algorithm 1 details the steps to obtain the p-values for each case. For all the cases, we assume the procedure involving the training of probabilistic classifiers, described in Section 2.1, has already been executed. That is, we have a test statistic T for every test we want to perform. Also, for the case we are testing for concept shift type 2, we assume the estimated conditional distribution Q(y|x) has been computed. We fix a significance level α ∈ (0, 1), and after calculating the p-value p for a specific null hypothesis of interest, and reject that hypothesis if p < α. The next theorem shows that such tests are valid (that is, they control type I error probability). The only exception is the test for H 0,C2 , which controls type I error probabilities asymptotically as long as Q(y|x) is a good estimator of P(y|x) (see Berrett et al. 2020; Section 5.1 for some examples). Theorem 3. Let p(D te ) be the p-value from Equation 1, where each modified dataset D te 2 (j) is computed as described above (and thus depends on the hypothesis to be tested). Then, for every 0 < α < 1, Algorithm 1: Dataset shift detection: obtaining p-values Input: (i) Hypothesis to be tested and respective test statistic T , (ii) Test set D te , (iii) number of iterations B ∈ N, (iv) conditional distribution Q(y|x) (in case of testing for concept shift type 2); Output: p-value p = p(D te ); if Testing for label shift, feature shift, or dataset shift then 4 Draw a random permutation π = (π 1 ... π |D te | ) of natural numbers from 1 to where d TV is the total variation distance, Q |D te | and P |D te | are the product measures of Q and P over |D te | independent samples, and the expectation is taken with respect to a new value X randomly drawn from the distribution induced by the features of D te . In this section, we present numerical experiments with both artificial and real data. The experiments with artificial data investigate the statistical power of our tests in identifying the different types of dataset shift, while the experiments with real data showcase how our methods are useful to extract new insights from data. In all the experiments in which Y is discrete, we use the plug-in estimator (supp. material) to estimate KL Y . In the first experiment, we set P (1) and P (1) where Ber(p) denotes the Bernoulli distribution with mean p, 1 d denotes a vector of ones of size d = 3, I d the identity matrix of dimension d = 3, and N (µ, Σ) denotes the normal distribution with mean vector µ and covariance matrix Σ. In this way, δ controls the amount of label shift, while γ controls the amount of concept shift. Indeed, it is possible to show that that is, both types of shift increase when |δ| and |γ| grow. In the second experiment, we set P (1) and P Y |X = N (X, 1), P Y |X = N (X + θ, 1), where N (µ, σ 2 ) denotes the normal distribution with mean µ and variance σ 2 . In this way, λ controls the amount of covariate shift, while θ controls the amount of concept shift. Indeed, it is possible to show that that is, both types of shift increase when |λ| and |θ| grow. We vary (δ, γ) or (λ, θ) in a grid of points for experiments 1 and 2, respectively. For each point in the grid, we perform 100 Monte Carlo simulations to estimate the tests powers, that is, the probabilities of rejecting the null hypotheses. Specifically, we conduct hypotheses tests aiming to detect total dataset shift, label and covariate shift, and concept shift (types 1 or 2). For each pair (δ, γ) or (λ, θ) and Monte Carlo simulation, we: (i) draw training and test sets, from both joint distributions, with size 2500 each; (ii) train a logistic regression model as a probabilistic classifier to estimate the Radon-Nikodym derivatives using the training sets; (iii) use the test set from population 2 to estimate KL X,Y , KL X|Y or KL Y |X , and KL Y or KL X ; (iv) use both test sets to estimate the p-values using 4 Algorithm 1 (B = 100); (v) reject the null hypothesis if the p-value is smaller than the level of significance α = 5%. (Kolmogorov, 1933 , Smirnov, 1939 test for covariate shift. Our method had similar power curves when compared with the alternative approaches when testing for both label and covariate shift. However, when testing for both concept shifts, our method was able to achieve a significantly higher power when compared with the total variation approach. Figure 1 shows the power estimates for each test as a function of (δ, γ) or (λ, θ). Our procedure to test presence of different types of dataset shift is well-behaved: the power is close to the nominal level α = 5% when (δ, γ) or (λ, θ) is close to the origin, i.e., when no shift happens, and grows to 1 when ||(δ, γ)|| or ||(λ, θ)|| gets larger. Moreover, our procedure could also detect types of shift in isolation: the power of our tests increases for concept shift (types 1 and 2) and label/covariate shift detection when increasing |γ| or |λ| and |δ| or |γ| separately. As expected, the tests are not affected by the shifts that are not being tested at that moment. Finally, the test designed to detect concept shift in the second experiment has slightly lower power than the test designed to detect covariate shift, despite the fact that the divergences in both cases are the same. This is because the concept shift test depends on a conditional distribution estimate, which may not be perfect. Next, we compare our framework with existing approaches for detecting shifts. We do this by comparing the power of the different hypothesis tests for α = 5%. For this experiment, we use the same data generating process and sample sizes used in the first two experiments. Moreover, we use 250 Monte Carlo simulations to estimate power and set B = 250 for Algorithm 1. When our objective is to detect label and covariate shifts, we vary δ and λ but fix γ = θ = 0; when our objective is to detect both types of concept shifts, we vary γ and θ but set δ = λ = 0. The main alternative approach we compare our method with is the total variation (TV) approach introduced by Webb et al. When testing for label shift, we also include comparisons with the normal approximation (Z-test) to compare two proportions (Lehmann et al., 2005) . Moreover, we add a comparison to the Kolmogorov-Smirnov (KS) test (Kolmogorov, 1933 , Smirnov, 1939 for testing covariate shift. Except for the case we are testing for concept shift 2 (shift in P Y |X ), we do not have any training phase when using the existing approaches, then we use the whole dataset, i.e., 5k data points from each distribution, to calculate the statistics and get the p-values. When testing for concept shift 2, we need to train the linear regression model (to obtain Q(y|x)), thus in that case we only use the test set to conduct the tests using the existing approaches. When using our own approach, we always make a distinction between training and test sets. Figure 2 shows that our method had similar power curves when compared with the alternative approaches when testing for both label and covariate shift. However, when testing for both types of concept shift, our method was able to achieve a significantly higher power when compared with the total variation approach -that is probably due to the fact our approach does not require discretization of the data. We expect that our method has a more pronounced advantage in higher dimensions -even when testing for covariate shift, for example -due to the fact that the number of points inside each bin tends to small in such cases. Next, we investigate the role of the dimensionality of the feature space in the performance of the various methods. More specifically, our goal in the example is to detect covariate shift using the settings from the second experiment of this section when λ = .1. We add to X a standard Gaussian random vector (independent of the original X), ending up with an update version of X with size d. Then, we compare the various tests in terms of their power to test H 0,C when α = 5%. We use 250 Monte Carlo simulations to estimate power and set B = 250 for Algorithm 1. Because the divergence between the distributions remains the same when adding this noise, this experiment allows us to isolate the dimensionality influence. Figure 3 indicates that the performance of our method does not suffer as much from increasing d when compared to the TV approach. We compare the TV approach with d ∈ {1, 2, 3, 4} with our approach d ∈ {1, 2, 3, 4, 10, 20, 30, 40, 100, 200, 300, 400}. We stop at d = 4 for the TV approach because the quantity of bins increase geometrically with the number of dimensions and if d = 5 we would expect to find less than two data points per bin. Interestingly, the performance of our method when d = 400 is equivalent to the performance of the TV approach when d = 4. In this experiment, we use our method to extract insights on how probability distributions can differ from each other in a financial application. The dataset used in this experiment can be used for credit scoring and was kindly provided by the Latin American Experian Datalab, based in Brazil. It contains financial data of one million Brazilians collected every month going from August/2019 to May/2020. The features in this dataset are related to past financial data, e.g., amount of loans and credit card bills not payed on time, and the label variable informs whether a consumer will delay a debt payment for 30 days in the next 3 months, i.e., we have a binary classification problem. In this experiment, we kept 20k random data points in each month with 20% of them going to the test set. Also, we kept the top 5 most important features to the credit risk prediction model. These specific features are related to payment punctuality for credit card bills, number of active contracts of the consumer, and monetary values involved. We used CatBoost (Prokhorenkova et al., 2017) both to estimate the Radon-Nykodim derivative and the conditional distribution of Y |X. The results in Figure 4 indicate increasing covariate, concept shift type 1, and total dataset shift from the beginning. The features in this dataset are related to past financial data and the label variable informs whether a consumer will delay their payment for 30 days in the next 3 months, i.e., labels from February corresponds to March/April/May. The vertical dashed line marks the beginning of the COVID-19 crisis in Brazil. We highlight the decoupling between the total shift and covariate or concept shift type 1 after February/2020. This behavior is due to a bigger shift in the marginal and conditional distribution of Y and is possibly associated with the economic consequences of the pandemic. This is expected because the features contains information about how people use their credit (e.g., amount of loans, credit card use) and the way people use their credit is a function of changes in the economy that can occur quite rapidly, such as inflation/interest/exchange rates fluctuation, extra expenses due to holidays, etc. From February 2020 onwards, i.e., labels relative to months post March 2020, it is possible to notice a decoupling between these shifts in the coming months after the first official COVID-19 case detected Brazil and the beginning of the economic crisis. The decoupling means that a bigger share of the total shift is due to shift in the marginal and conditional distribution of Y . We speculate that this decoupling are due to measures taken by banks and credit bureaus to help consumers during the pandemic. Some of the measures include, but are not limited to longer payment intervals and lower interest rates. Indeed, the proportion of people with credit restriction, according to the dataset provided by the Experian credit bureau, fell from 54% to 49% from March/2020 to August/2020 while it was about constant in 2019. In this experiment, we evaluate our method as a guide for dataset shift adaptation using the MNIST and USPS datasets (LeCun et al., 1998, Xu and Klabjan, 2021) . Both datasets contain images, i.e. pixels intensities, and labels for the same 10 digits (i.e. 0 to 9). Our interest is (i) to use our framework to quantify and formally test the presence of all types of dataset shift using the MNIST distribution as source, and a mixture between MNIST and USPS (with increasing proportions of USPS participation) as target distributions and then (ii) adapt our predictors using the insights given by our diagnostics in order to achieve better out-of-sample performance. Our goal is to show how detecting specific types of shift help practitioners correct their models in a more informed manner. In this experiment, the distributions of labels are similar in both populations -the KL divergence between the two marginal distributions is equivalent to the divergence between two Bernoulli distributions with parameters .48 and .52. On the other hand, MNIST samples tend to have more white pixels than USPS (Figure 5) , and thus the distributions of X are clearly different in both datasets. To run our experiments, we first organize the data as follows: (i) resize all MNIST images to match the USPS dimensions (16×16); (ii) normalize the pixels of all images to guarantee all pixels intensities are between 0 and 1; (iii) we select approximately 27k samples from MNIST and 9k from USPS; (iv) use Principal Component Analysis (PCA; Hastie et al. 2009 ) to reduce the number of features from 256 to 10; (v) partition the data in 12 smaller disjoint datasets of size 3.1k in a way that the proportion of USPS samples increases linearly from 0% to 50% from the first to the last dataset. We split each dataset using 10% of the samples as testing datapoints. We use the first dataset, with only MNIST samples, as our baseline and compare it to the mixed datasets. To estimate both Radon-Nikodym derivatives and conditional distributions, we use CatBoost (Prokhorenkova et al., 2017) . The first two plots of Figure 6 show both the test statistics and their respective p-values. The plots indicate that the total dataset shift (shift in P X,Y ), concept shift 1 (shift in P X|Y ), and covariate shift (shift in P X ) increase when we make the USPS participation higher. Our p-values rapidly approach zero, thus detecting those changes even when the USPS participation is still less than 10%. This indicates that our hypotheses tests have high power. On the other hand, label shift (shift in P Y ) and concept shift 2 (shift in P Y |X ) are not promptly detected by our method. This indicates that label shift and concept shift 2 are weak or non-existent in this example. That observation is consistent with the fact that (i) the distribution of Y is similar in both MNIST and USPS populations, and that (ii) two similar pixel configurations should induce similar posterior distributions of labels regardless of the origin distribution (source or target), as long as the prior distribution is approximately stable across populations. Finally, the plots indicate that adapting for covariate shift (e.g., using importance weighting (Sugiyama et al., 2007) ) should be enough to achieve better predictions on the target domain. Indeed, that is the case here -in the third plot, we compare the performance of two logistic regression models. Both are trained using pure MNIST samples but one of the is adapted using importance weighting. The weights are obtained via the classifier used to estimate KL X , that is, with no need to fit an extra model. The latter approach starts giving better predictions when covariate shift is detected by our test. Next, we use our framework to detect shifts in image and text datasets using deep learning models as classifiers to estimate the Radon-Nikodym derivative and the conditional distribution of Y |X. We use large pretrained models as feature extractors, freezing all the layers except the output one, which is given by a logistic regression model. The pretrained models are EfficientNetV2S 5 (Tan and Le, 2021) for images and XLM-ROBERTa 6 (Conneau et al., 2020) for texts. We use the CIFAR-10 and CIFAR-100 as image datasets (Krizhevsky et al., 2009) and "Amazon Fine Food Reviews", available on Kaggle 7 , as our text dataset. The first two datasets are composed by small RGB images from K = 10 or K = 100 different classes, while the third dataset is composed of product reviews in the form of short texts and a rating, varying from 0 to 4, given by consumers, thus having K = 5 classes. In the third dataset, we subsampled the data to guarantee all the classes have roughly the same number of examples. The total sample size for each experiment is 30k data points with 10% of them going to the test portion. From the original datasets, we derive the source and target datasets as follows. First, we fix δ ∈ (0, .5) and then create a list LIST of K numbers (one for each class) where the first element of the list is δ, the last is 1 − δ, and the intermediate ones are given by a linear interpolation of δ and 1 − δ. Then, we select LIST[k] of the samples of class k ∈ {0, ..., K − 1} to be in the source dataset, while the rest goes to the target dataset. We explicitly introduce label shift, which may induce covariate shift and concept shift of type 2. After we have the data from both populations (source and target), we proceed as usual to detect the shifts. We repeated the same procedure for δ = .1, .2, .3, .4 and in all the cases, we were able to detect all types of shift except Figure 6 : Detecting different types of dataset shift using MNIST and USPS data and then adapting a predictor. We define the MNIST distribution, represented in the first dataset (USPS (%) = 0), as the source distribution and the mixed distributions, represented in datasets 2 to 12 (USPS (%) > 0), as multiple target distributions. Total dataset shift (shift in P X,Y ), covariate shift (shift in P X ), and concept shift of type 1 (shift in P X|Y ) are rapidly detected while label shift and concept shift of type 2 (shift in P Y |X ) are not evident. The second plot indicates that adapting for covariate shift should be enough, then we compare the performance of two logistic regression models trained using pure MNIST samples but one of the is adapted using importance weighting. The adapted models start giving better predictions when covariate shift is detected by our test. concept shift 1. This result was expected, because given the class the distribution of the features must not be affected by the way we introduced the shift. The results for CIFAR-10 can be seen in Figure 7 . The results for CIFAR-100 and "Amazon Fine Food Reviews" experiment are similar and displayed in the supplementary material. Figure 7 : Detecting different types of dataset shift using CIFAR-10 data and EfficientNetV2S (Tan and Le, 2021) as classifier to to estimate the Radon-Nikodym derivative and the conditional distribution of Y |X. We split the data evenly in a source and target portions explicitly introducing label shift, which may induce covariate shift and concept shift of type 2. We fix δ ∈ (0, .5) and then create a list LIST of 10 numbers where the first element of the list is δ, the last is 1 − δ, and the intermediate ones are given by a linear interpolation of δ and 1 − δ. Then, we select LIST[k] of the samples of class k ∈ {0, ..., 9} to be in the source dataset, while the rest goes to the target dataset. For different values of δ we were able to detect all types of shift except concept shift 1, which is expected because we introduced label shift in isolation. Finally, we present a regression experiment using data from 2017, 2018, 2019, and 2020 of ENEM 8 , the "Brazilian SAT". In each of the years, Y is given by the students' math score in logarithmic scale while X is composed by six of their personal and socioeconomic features: gender, race, school type (private or public), mother's education, family income and the presence of a computer at home. We randomly subsample the data in each one the years to 30k data points with 10% of them going to the test portion and then use the CatBoost algorithm to both estimate the Radon-Nikodym derivative and the conditional distribution of Y |X. When estimating the distribution of Y |X, we first fit a regressor to predict Y given X and then, using a holdout set, we fit a Gaussian model on the residuals. When testing for a shift in the distribution of X|Y , we discretize Y in 10 bins evenly splitting the data. Even though we use the binned version of Y to get the p-value, we reportKL X|Y in the first panel of Figure 8 . In this experiment, we do a similar analysis compared to the credit one, comparing the probability distributions of 2018, 2019, and 2020 with the one in 2017. From Figure 8 , it is possible to see that we detected all kinds of shifts after 2017. This result indicates that a model trained in 2017 might not generalize well to other years, and practitioners may consider retraining their models from scratch using more recent data. We introduced a novel methodology that can both quantify and formally test for different types of dataset shift, including label shift, covariate shift, and concept shifts. Our ap- proach sheds light not only on if a prediction model should be retrained, but also on how so, enabling the practitioners to objectively tackle shifts in the probability distributions they are dealing with. Furthermore, our method can be applied in conjunction with a diversity of practical problems independently of data type or final supervised task and showed to be effective in dataset shift detection in artificial and real data experiments. We compared our framework with existing approaches. Unlike our framework, existing methods are only designed to detect specific types of shift or cannot formally test their presence, sometimes even requiring both labels and features to be discrete. In practice, our method achieved good results when compared with alternative approaches. Furthermore, we demonstrated how our framework leads to insights that improve the predictive power of the supervised model. Moreover, because our approach estimates Radon-Nikodym derivatives, dataset shift correction via importance weighting comes for free as showed in the MNIST/USPS experiment. Our framework can be used or adapted for different situations. Firstly, our methodology can also be applied to unsupervised tasks, e.g., we might be interested in tracking shifts in the distribution of features X even when we do not have access to labels. Secondly, our hypotheses tests are agnostic to the choice test statistics, and therefore our framework is modular. That is, we can use different statistics along with the same testing procedures. That opens the possibility of creating even more powerful tests for some settings. For instance, we could use an estimator for a symmetrized version of the KL divergence. Improvements of our method could also be obtained by applying different methods for density ratio estimation, e.g., Rhodes et al. (2020) , to obtain better estimates of the KL divergence. We are thankful for the credit dataset provided by the Latin American Experian Datalab, the Serasa Experian DataLab. FMP is grateful for the financial support of CNPq (32857/2019-7) and Fundunesp/Advanced Institute for Artificial Intelligence (AI2) (3061/2019-CCP) during his master's degree at the University of São Paulo (USP). Part of this work was written when FMP was at USP. RI is grateful for the financial support of CNPq (309607/2020-5) and FAPESP (2019/11321-9). Our package for dataset shift detection can be found in https://github.com/felipemaiapolo/ detectshift. The source code used in this paper can be found in https://github.com/felipemaiapolo/ dataset_shift_diagnostics. All the datasets can be found in https://github. com/felipemaiapolo/dataset_shift_ diagnostics/tree/main/data. All the experiments were ran in a MacBook Air (M1, 2020) 16GB except for the credit analysis experiment which was run in a 80 CPUs Intel Xeon Gold 6148 cluster. In this running time analysis, we consider one iteration as all the steps needed to compute all the p-values used for a specific experiment. In the artificial data experiments, each iteration performed by our framework took less then 1s on average. On the other hand, each iteration (approximately, on average) in the (i) credit experiment took 16s, (ii) digits experiment took 34s, (iii) text experiment took 20s, CIFAR-10 took 175s, CIFAR-100 took 715s, and (iv) the regression experiment took 39s. Proof. The results for H 0,D , H 0,C , H 0,L are well known in probability theory and derives from Jensen inequality properties. For H 0,C1 , see that KL X|Y = 0 ⇔ ⇔ KL(P where the (i) first step derives from the fact that the KL divergence is always non-negative, (ii) the second step is analogous to the results for H 0,D , H 0,C , H 0,L , and (iii) the last step is due P Y ∼ P (2) Y , that is, P Y P (2) Y and P Y P (2) Y . The result for H 0,C2 is obtained in a similar way. Having observed two datasets (in practice we use the test datasets), D (1) and D (2) , we definep (i) y to be the relative frequency of the label y in dataset i. Then, a plug-in estimator for KL Y is given by This estimator is consistent. Amazon Fine Food Reviews Figure 9 : Detecting different types of dataset shift using CIFAR-100/Amazon Fine Food Reviews data and EfficientNetV2S (Tan and Le, 2021)/XLM-ROBERTa (Conneau et al., 2020) as classifier to to estimate the Radon-Nikodym derivative and the conditional distribution of Y |X. We split the data evenly in a source and target portions explicitly introducing label shift, which may induce covariate shift and concept shift of type 2. We fix δ ∈ (0, .5) and then create a list LIST of 100 or 5 numbers where the first element of the list is δ, the last is 1 − δ, and the intermediate ones are given by a linear interpolation of δ and 1 − δ. Then, we select LIST[k] of the samples of class k ∈ {0, ..., 99} or {0, ..., 4} to be in the source dataset, while the rest goes to the target dataset. For different values of δ we were able to detect all types of shift except concept shift 1, which is expected because we introduced label shift in isolation. Early drift detection method The conditional permutation test for independence while controlling for confounders Panning for gold:'model-x'knockoffs for high dimensional controlled variable selection Unsupervised cross-lingual representation learning at scale Approximating likelihood ratios with calibrated discriminative classifiers Likelihood-free frequentist inference: Bridging classical statistics and machine learning in simulation and uncertainty quantification Permutation methods: a basis for exact inference The clinician and dataset shift in artificial intelligence A unified framework for constructing, tuning and assessing photometric redshift density estimates in a selection bias setting Learning with drift detection Covariate shift by kernel mean matching. Dataset shift in machine learning The elements of statistical learning: data mining, inference, and prediction Converting high-dimensional regression to high-dimensional conditional density estimation Photoz estimation: An example of nonparametric conditional density estimation under selection bias Local permutation tests for conditional independence Sulla determinazione empirica di una legge di distribuzione Learning multiple layers of features from tiny images On information and sufficiency. The annals of mathematical statistics Gradient-based learning applied to document recognition Testing statistical hypotheses Application of covariate shift adaptation techniques in brain-computer interfaces Dataset shift quantification for credit card fraud detection Effective sample size, dimensionality, and generalization in covariate shift adaptation A unifying view on dataset shift in classification Ocde: Odds conditional density estimator Lecture notes on information theory Catboost: unbiased boosting with categorical features Dataset shift in machine learning Ewma model based shift-detection methods for detecting covariate shifts in non-stationary environments Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure Estimate of deviation between empirical distribution functions in two independent samples Amortised map inference for image super-resolution Three population covariate shift for mobile phone-based credit scoring Machine learning in non-stationary environments: Introduction to covariate shift adaptation Covariate shift adaptation by importance weighted cross validation Density ratio estimation in machine learning Efficientnetv2: Smaller models and faster training Quantification under prior probability shift: The ratio estimator and its extensions Testing for concept shift online Characterizing concept drift Analyzing concept drift and shift from sample data A concept-drift based predictive-analytics framework: Application for real-time solar irradiance forecasting Concept drift and covariate shift detection ensemble with lagged labels Concept drift detection and adaptation with hierarchical hypothesis testing