key: cord-0195353-v6jn9xox authors: Budhathoki, Kailash; Janzing, Dominik; Bloebaum, Patrick; Ng, Hoiyi title: Why did the distribution change? date: 2021-02-26 journal: nan DOI: nan sha: a05752c00c65a99ea44940128b5affc98ae7f02b doc_id: 195353 cord_uid: v6jn9xox We describe a formal approach based on graphical causal models to identify the"root causes"of the change in the probability distribution of variables. After factorizing the joint distribution into conditional distributions of each variable, given its parents (the"causal mechanisms"), we attribute the change to changes of these causal mechanisms. This attribution analysis accounts for the fact that mechanisms often change independently and sometimes only some of them change. Through simulations, we study the performance of our distribution change attribution method. We then present a real-world case study identifying the drivers of the difference in the income distribution between men and women. Changes to a probability distribution are common in many real-world domains that are part of a changing environment. For example, during COVID-19, most retailers most likely observed a shift in the distribution of their inventory level of products-as certain products have unusually high demand (e.g., masks and hand sanitizers) while supply for some other products are limited due to suppliers suspending manufacturing. To be able to effectively respond to (either proactively or retrospectively) similar situations, it is not only important to identify if and where the distribution changed, but also know why the distribution changed. In recent years, several techniques have been developed to either automatically detect changes in the underlying distribution from a sequence of observations (Pollak, 1985; Kifer et al., 2004) , or determine if two data samples come from the same distribution (Chakravarti et al., 1967; Scholz and Stephens, 1987; Snedecor and Cochran, 1989; Gretton et al., 2012) . However, a formal way to identify the "root causes" of the distribution change seems to be missing. In this work, we consider a system of n variables X 1 , . . . , X n . In a supply chain, for instance, these variables can represent different business metrics, such as demand forecast, labour cost, shipment cost, and inventory level to name a few. Typical causal questions on these variables are interventional in nature, e.g. "What would be the impact on X k if we were to intervene on X j ?" Here we are interested in a slightly different question, namely "Which mechanisms are responsible for the change in the joint distribution P X 1 ,...,X n , or the marginal distribution P X k of one of the variables X k ?" For example, in a supply chain, we might be interested in understanding the drivers of weekover-week changes in the distribution of inventory level (or a summary statistic, such as its mean across different products). To this end, we build upon graphical causal models (Pearl, 2009) . Given a causal graph G of variables X 1 , . . . , X n , assuming the causal Markov condition (Spirtes et al., 2000) , we can factorise the joint distribution into causal conditionals, i.e. P X 1 ,...,X n = n ∏ j=1 P X j |PA j , where P X j |PA j denotes the causal mechanism of variable X j given its direct parents PA j in the causal graph. Each parent-child relationship captured by P X j |PA j represents an autonomous physical mechanism-we can change one such relationship without affecting the others. 1 Thus it is plausible to attribute any change in the joint distribution or the marginal distribution of some target variable to the change in some of the causal mechanisms. Based on this insight, we develop a formal approach to define the quantitative contribution of each mechanism to the overall change. In particular, we use the Shapley value concept (Shapley, 1953) from cooperative game theory to cope with the fact that the impact of changing a mechanism depends on which other mechanisms have been changed already. The paper is structured as follows. In Section 2, we formalise changes to causal mechanisms. Section 3 presents a proposal to attribute the change in the joint distribution. In Section 4, we describe a formal method to attribute the change in the marginal distribution to causal mechanisms. Section 5 discusses the practical implications of applying these attribution proposals. In Section 7, we report results from simulations and present a case study in identifying the drivers of difference in the income distribution between men and women. Finally, we conclude in Section 8. We consider probabilistic causal models that incorporate probability to infer causal relationships between variables. Suppose that we have a collection of n random variables (X 1 , . . . , X n ) =: X. The underlying causal graph G of these variables is a directed acyclic graph in which a directed edge from X i to X j indicates that X i causes X j directly. A joint distribution P X is said to be compatible with the causal graph G, if P X can be generated following the edges in G. More formally, P X is compatible to G if Definition 1 (Probabilistic Causal Model) A probabilistic causal model is a pair C := G, P X that consists of a causal graph G, and a joint distribution P X over the variables in G that is compatible with G. Given a probabilistic causal model, the causal Markov assumption (Spirtes et al., 2000) allows us to factorise the joint distribution P X into causal mechanisms P X j |PA j at each node X j . Each causal mechanism P X j |PA j remains invariant to interventions (external influences) in other variables. With this, we can formally define what causal mechanism changes entail. Definition 2 (Mechanism Changes) Mechanism changes to a causal model C := G, P X on a subset of variables X T indexed by a change set T ⊆ {1, . . . , n} transform C into C T := G, P T X , where P T is a new joint distribution obtained by replacing "old" causal mechanism P X j |PA j at each node X j , where j ∈ T , with the "new" causal mechanismP X j |PA j . The following example illustrates the idea above in a formal setting where causal relationships between variables are represented in terms of structural equations (Pearl, 2009) . Example 1 Consider a causal model consisting of two variables, i.e. C = X 1 → X 2 , P X 1 ,X 2 , induced by the structural equations X 1 := N 1 and X 2 := 2X 1 + N 2 , where the independent unobserved noise terms N j ∼ N (0, 1) are distributed according to a standard Normal distribution. A typical structure preserving intervention (Eberhardt and Scheines, 2007) changes either the parent-child functional relationship, or the distribution of unobserved noise term. Consider an intervention that changes the relationship between X 1 and X 2 from the linear function to a non-linear function represented by the new structural assignment X 2 := X 3 1 + N 2 . Whereas this changes the causal mechanism of X 2 from P X 2 |X 1 toP X 2 |X 1 , the underlying causal graph remains the same. As such, we have a new causal The "new" joint distribution P T X can also be seen as the post-intervention joint distribution P do(T ) X where do(T ) represents mechanism changes through stochastic intervention at each node X j indexed by T (Correa and Bareinboim, 2020) . In particular, we only consider the cases where the change in the joint distribution is a result of systematic replacements of independent physical mechanisms. Other cases, e.g. adversarial perturbation, can also change the joint distribution arbitrarily, which we do not consider here. Moreover, we consider the problem setting where the joint distribution changes, but not the causal graph. 3 Why did the joint distribution change? As the joint distribution P X 1 ,...,X n is a composition of independent causal mechanisms P X j |PA j , it is plausible to attribute any change in the joint distribution to the change in some of the causal mechanisms. We would like to compute the contribution of each node-potentially due to the change in its causal mechanism-to the change in the joint distribution. To this end, first we need a measure that quantifies the change in the joint distribution. A natural choice for quantifying the change in the joint distribution are the divergence measures as they measure the "distance" between two probability distributions. In this work, we consider the Kullback-Leibler (KL) divergence (also called relative entropy) (Cover and Thomas, 2006) . Let P and Q be two probability distributions of a discrete random variable defined on the same probability space X . Then the KL divergence from Q to P is defined as If P and Q are the distributions of a continuous random variable, then the summation is replaced by an integral. The KL divergence is particularly suitable for our purpose as it is additive for the independent compositions of the joint distribution. That is, by generalising the chain rule to more than two variables using the causal Markov condition, we get an additive decomposition of the KL divergence from the joint distribution P X toP X (Cover and Thomas, 2006, Theorem 2.5.3) : Thus, the contribution of each node X j to the KL divergence from the joint distribution P X toP X is the KL divergence from its causal mechanism P X j |PA j toP X j |PA j . In other words, each node-due to the change in its causal mechanism-contributes independently to the KL divergence from the joint distribution P X toP X . The lemma below formalises this observation. Definition 3 Suppose that the causal mechanism of a node X j changes from P X j |PA j toP X j |PA j . Then the contribution of a node X j to the KL divergence from the joint distribution P X toP X is D(P X j |PA j || P X j |PA j ). The KL divergence is always non-negative. That is, for any P and Q, it holds that D(P || Q) ≥ 0. Therefore, the contribution of a node X j to the change in the joint distribution, measured in terms of the KL divergence, cannot be negative. If the causal mechanism of a variable did not change, then its contribution will be zero. We should add a remark, however, that the KL divergence between conditionals also depends on the distribution of parents, not only the conditionals. Formally, the KL divergence from P X j |PA j toP X j |PA j is defined as This is not really problematic, however, as the marginal distributionP PA j of the parents is only used for averaging the KL divergences D(P X j |pa j || P X j |pa j ). As such, each node X j uses the corresponding parent-distributionP PA j from the same joint distributionP X . Estimating KL divergence in high-dimensional setting is a challenging problem. For some parametric families, such as the exponential family of distributions, 2 however, closed-form expressions exist for computing KL divergence. If a non-parametric estimator is desired, then we can use the k-nearest-neighbour-based estimator of KL divergence (Wang et al., 2009 ) that is asymptotically unbiased and mean-square consistent assuming i.i.d. samples. One may also wonder whether the asymmetry of KL divergence creates a non-intuitive interpretation-as the impact of each mechanism change differs according to the direction of comparison. From an inferential standpoint, however, one direction seems preferable. When a distribution changes in time, there is a natural inferential asymmetry since one would rather consider the likelihood of new data with respect to the old model than vice versa. However, the main reason to choose KL divergence is that it nicely decomposes additively. Admittedly, this additive decomposition is a bit spoiled because we need a reference distribution to weigh the change of the conditionals-but this seems like a problem that is hard if not impossible to avoid. In practice, often it is of interest to understand why the marginal distribution of one target variable changed, instead of the change in the joint distribution of all variables. In the next section, using a concept from game theory, we formalise how to attribute the change in the marginal distribution of a target variable to each node in the causal graph. Suppose that the marginal distribution of a target variable X k changes-from P X k toP X k . The causal Markov condition allows us to compute the marginal distribution of any variable in the causal graph by marginalising (summing) over all independent causal mechanisms excluding that of the variable itself. Formally, given a causal model C = G, P X , the marginal distribution P X k of the variable X k can be computed by first factorising the joint distribution using the causal Markov condition, and then marginalising over all other variables, i.e. The change in the marginal distribution of X k given the change set T is then given by It is therefore reasonable to assume that any change in the marginal distribution P X k of the variable X k is most likely due to the change in some of the causal mechanisms P X j |PA j . Unlike in case of the joint distribution change, however, the additive property of KL divergence cannot be leveraged directly for the attribution here-due to the marginalisation. A natural way to compute the contribution of each node to the change in the marginal distribution of the target, from P X k toP X k , is then as follows: replace each "old" mechanism P X j |PA j by the "new" mechanismP X j |PA j in succession. Each replacement changes the marginal distribution of the target X k , which can be used to compute the contribution of the corresponding node. The amount of change, however, depends on the causal mechanisms that have been already replaced. In other words, the contribution of a node X j to the change in the marginal distribution of the target X k depends on the order in which we replace the causal mechanisms-the resulting attribution procedure is hence in danger of becoming arbitrary. The Shapley value (Shapley, 1953) from cooperative game theory provides a principled approach to mitigate the arbitrariness in the attribution procedure, due to the dependence on the ordering of replacements. In particular, it removes the arbitrariness by symmetrizing over all orderings. We briefly summarise the Shapley value here. Let N := {1, . . . , n} be a set of n players and ν : 2 N → R be a set function that associates a real-valued payoff to a coalition of players T ⊆ N with ν( / 0) = 0, where / 0 denotes an empty set. We assume that players will cooperate to form a grand coalition N. The goal is then to "fairly" assign the resulting payoff ν(N) to each player j in N. Let σ : N → N denote a permutation of players N. All permutations of the set N with n elements form a symmetric group S n . Suppose that each player enters into a coalition one by one in the ordering σ (1), σ (2), . . . , σ (n) to eventually form a grand coalition. Let N prec ( j) denote the set of players that precede player j in the ordering, i.e. gives player i's position. Then the change in the payoff of a coalition when a player joins the coalition is the marginal contribution of the player to the coalition, The marginal contributions of all players will then sum up to the payoff of the grand coalition ν(N), i.e. Unfortunately, the marginal contribution of a player j depends on the order σ in which players enter into a coalition. To remove the dependence on the ordering σ , Shapley's solution (Shapley, 1953) is to assign each player j ∈ N its average marginal contribution over all permutations, i.e. where the second line follows from counting the number of permutations (out of n!) where N prec ( j) = T . The Shapley value is also desirable because it gives a unique solution to the following axioms that capture the notion of fairness. Efficiency The Shapley values of all players sum up to the payoff of the grand coalition. That is, ∑ n j=1 φ j (ν) = ν(N) for any set function ν. Symmetry If two players contribute the same amount to every coalition of other players, then their Shapley values are equal. Formally, for two players i and j, if Null Player A player that does not contribute to any coalition will get a zero Shapley value. Formally, for a player j, Additivity The Shapley value computed from the sum of two set functions is the same as the sum of the Shapley values computed using individual set functions. That is, for any two set functions ν 1 and ν 2 , we have Computing the Shapley value for a player has the worstcase time-complexity of O(2 n ). However, efficient approximations exist (Lundberg and Lee, 2017) . By slightly abusing the notation, we use the index set {1, . . . , n} =: N to refer to the corresponding variables X 1 , . . . , X n . Then any coalition T ⊂ N represents the change set for mechanism changes to the "old" causal model. Let P T X k denote the marginal distribution of X k obtained by marginalising the joint distribution P T X in the new causal model C T := G, P T X . Naturally, for T = / 0, the causal model does not change, i.e. C T = C for T = / 0. First consider a scenario where the change in the marginal distribution of the target X k is quantified by the KL divergence. The quantity that we want to attribute to each node is then the KL divergence D(P X k || P X k ). To this end, we define the marginal contribution of a node X j given a change set T as In other words, the marginal contribution quantifies how much replacing the causal mechanism of variable X j contributes to the KL divergence D(P X k || P X k ), given that we have already replaced the causal mechanisms of variables in the change set T . Then the Shapley value of the variable X j is which gives the "fair" contribution of X j to the change in the marginal distribution of the target X k measured in terms of D(P X k || P X k ). Note that the Shapley value contribution φ j (D) can be negative. This is completely reasonable as the marginal contribution C D ( j | T ) can be negative; replacing the causal mechanism of a variable can bring the "new" marginal distribution of the target closer to the "old" marginal distribution P X k , thereby lowering the KL divergence relative to not replacing it. Due to the efficiency property, the Shapley values of all variables will sum up to the KL divergence of the marginal distribution from P X k toP X k , i.e. n ∑ j=1 φ j (D) = D(P X k || P X k ) . Note that the equality above may not hold strictly when we approximate the Shapley values for players. For the worst case analysis on the approximation error of Shapley values, we refer the interested reader to Charnes et al. (1988) ; Fatima et al. (2008) . Instead of the overall change in the marginal distribution as measured by the KL divergence, we might be interested in the change in some of its property or summary (e.g. mean, median, variance, skew). For instance, it is not uncommon for a retailer to ask, "Why did the mean inventory level go down?", as maintaining an inventory requires upfront investment. Amongst many, that could be due to the change in the distribution of the demand forecast, or some changes in the algorithms (mathematical functions). The definition below provides a rather general way to attribute marginal distribution change. Let ψ denote any functional defined on the marginal distribution. Given a change set T , the marginal contribution of a node X j to the change in the functional of the marginal distribution of the target X k , i.e. ∆ψ := ψ(P , and its Shapley value contribution to ∆ψ is given by The marginal contribution C ψ ( j | T ) quantifies how much replacing the causal mechanism of X j contributes to the change in the summary of P X k , given that we have already replaced the causal mechanisms of variables in the change set T . As the marginal contribution of a variable can be negative, the Shapley value contribution φ j (ψ) can also be negative. In the example below, we show how to attribute the change in the mean of the target variable to each node. Example 2 Let E X k ∼P X k [X k ] denote the mean of the variable X k under the distribution P X . The Shapley value contribution of a node X j to the change in the mean (due to the change in the marginal distribution) of the target node X k is where the marginal contribution C E ( j | T ) of X j given a change set T is defined as The Shapley values of all nodes sum up to the change in the mean of the target node. That is, the following holds: Next we discuss practical aspects of distribution change attribution solution we have presented thus far. To apply our attribution methods, we require a causal graph and its causal conditionals. Existing techniques on sound and complete causal structure learning (Spirtes et al., 2000; Pearl, 2009) , however, can only recover the Markov equivalence class of DAGs assuming faithfulness. With additional assumptions on the data-generating process, it is possible to recover the exact causal graph (Peters et al., 2017; van de Geer and Bühlmann, 2013) . A promising approach is to combine domain knowledge with conditional independence tests (Mastakouri et al., 2019) . In some cases, we can also perform controlled randomised experiments to identify the exact DAG from the Markov equivalence class (Dubois et al., 2020) . Once we have the causal graph, we can then estimate the causal conditionals directly from data using techniques from high-dimensional statistics (Bühlmann and van de Geer, 2011; Wainwright, 2019) . Note that we are given two samples of the same size or different sizes (e.g. data from the week before Christmas, and data from the Christmas week). Then for each node X j , we estimate P X j |PA j from the first sample, andP X j |PA j from the second sample. In the context of distribution-change attribution, however, sampling variability can lead to spurious results when we directly plug in the estimated causal conditionals. Even if two samples are drawn from the same joint distribution P X j ,PA j , we will most likely estimate two different causal conditionals (even with regularisation), because of sampling variability. Contrary to the expectation, X j will then be attributed a non-zero value. If we knew that the causal mechanism P X j |PA j did not change, it would then make sense to learn the causal mechanism from the combined sample, as then the quality of the learned causal conditional will also improve due to the increased sample size. Moreover, we can also directly PA j X j g(A) Figure 1 : The assumed causal graph to detect mechanism changes for a parent-child relationship. attribute a zero contribution to the node. This raises the question, "how do we detect causal mechanism changes from two samples?" In other words, are there any statistically testable implications of causal mechanism changes? There exists a large body of work, e.g. Chakravarti et al. (1967) ; Scholz and Stephens (1987) ; Snedecor and Cochran (1989) ; Gretton et al. (2012) , on statistical hypothesis test to determine whether the difference between the two distributions is statistically significant from their two samples, each drawn from a separate distribution. Those are not applicable in our setting as we are interested in the difference between the conditional distributions, not the marginals. In a recent work, Huang et al. (2020) extend the PC algorithm Spirtes et al. (2000) for causal discovery from non-stationary or heterogeneous data, where one of the steps involve detecting changing causal mechanisms. Here we adapt their method. Assume that causal mechanisms P X j |PA j can be written as functions of a time or domain index A (see Figure 1 ). If the causal graph is induced by a functional causal model, then the quantities such as functional models, noise levels, etc that may change over time or across domains can be written as functions of A. Under these assumptions, if the causal mechanism P X j |PA j remains the same across various values of A, then the conditional independence test X j ⊥ ⊥ A | PA j suffices to detect changes to the causal mechanism P X j |PA j . Let D t denote the m t -by-n matrix containing sample from time (or domain) t ∈ {1, 2}, and D denote the m-by-n matrix obtained by vertically concatenating D t s, where m = m 1 + m 2 . That is, we have We can construct A directly from data. The key idea is to assign the same value of A to the units in the sample from the same time (or domain) t. For clarity, with a slight abuse of notation, we interchangeably use A for a variable as well as the data vector. Each entry a i of m-by-1 vector A is assigned Using the columns from the combined data matrix D and index vector A, we then test if each variable X j is conditionally independent of A given its direct parents PA j in the causal graph, i.e. X j ⊥ ⊥ A | PA j . If the conditional independence holds, then the causal mechanism P X j |PA j did not change across various values of A. On the other hand, if X j is dependent on A given PA j then its causal mechanism P X j |PA j also changes with A. Note that the functional relationships between variables X j and time (or domain) index A is unknown. It is therefore important to use a non-parametric conditional independence test. In this work, we use kernel-based conditional independence test (Zhang et al., 2011) . Path-specific effect of X and Y via path π is the degree to which an interventional-change in X would change the marginal distribution of Y if that change were to be transmitted only via π. If an indicator (or context) variable A represents samples from distributions P and P, then computing the path-specific effect of A on any node via the direct path simply measures the distance between the "old" causal mechanism and the "new" causal mechanism of the node. Arguably, we then capture the causal influence of external factors (abstracted by A) to the node (Janzing et al., 2013) . To discuss the relation to the strength of causal arrows defined in Janzing et al. (2013) , we need to label the two distributions by an additional variable V attaining values v andv with edges to all X j whose mechanisms change. Their strengths would be Most feature-based interpretability techniques assume that features independently co-cause the target (Lundberg and Lee, 2017; Janzing et al., 2020) . In particular, they consider how the marginal distribution of the target changes w.r.t. to interventional changes in the features. They cannot attribute causal mechanism changes, as one must assume that the causal mechanism of the target does not change to make them work for multiple datasets. Kulinski et al. (2020) perform a statistical test for the distance between conditional distribution of each feature given other features from two samples to assign blame of distribution shift to a subset of features. As direct causes are not considered in conditioning, they do not answer the "why?" question. In causal discovery from multiple contexts (Mooij et al., 2020) , they find the union of causal graphs in each dataset (or context) by jointly modelling the context variables (A 1 , . . . , A k ) and observed variables (X 1 , . . . , X n ). While they also work on samples from different distributions and hence might appear related, these are two different problems. Mechanism changes can also be represented with stochastic interventions (Pearl, 2009; Correa and Bareinboim, 2020) . We briefly discussed the connection in the last paragraph of Section 2. Computing the effect of a stochastic intervention, however, does not tell us how much the corresponding mechanism change contributed to a target quantity summarising distribution change. Overall, existing methods are not suitable for distribution change attribution, where our goal is to attribute a target quantity summarising distribution change (e.g. change in mean, KL divergence) to change in causal mechanism (conditional distribution of a node given its direct causes) of each variable. In experiments, we study the performance of our approach for attributing the change in the marginal distribution, and its application to real-world data. In particular, through simulations, we study the performance of our attribution method when we learn causal mechanisms from samples w.r.t. sample size and magnitude of causal mechanism change. On real-world data, we asses whether results are sensible. We consider a setting where n − 1 independent input variables X 1 , . . . , X n−1 co-cause a target variable X n . Their un-X 1 . . . X n−1 X n Figure 2 : The causal graph used in simulations, where independent inputs X 1 , . . . , X n−1 co-cause the target X n . derlying causal graph is shown in Figure 2 . We choose this simple, yet representative, causal graph for simulation because computing the Shapley values analytically for summaries of distributional changes is not trivial for rather complex graphs. Let N w ∼ N (µ w , 1) denote an independent Gaussian noise with mean µ w and unit variance for each w ∈ {1, . . . , n}. Suppose that their causal graph is induced by the following structural assignments: X w := N w for w ∈ {1, . . . , n − 1} and X n := X 1 + . . . + X n−1 + N n . We refer to the structural causal model (SCM) above by C. Suppose that their joint distribution P X 1 ,...,X n changes tõ P X 1 ,...,X n due to the changes in the structural assignments. In the new SCMC, we have the following assignments: X w := N w + λ w for w ∈ {1, . . . , n − 1} and X n := X 1 + . . . + X n−1 + N n + λ n , where λ w is a scalar that shifts the mean of the corresponding variable X w for each w ∈ {1, . . . , n}, and further defined as where S w is a Bernoulli random variable with a probability of success p = 0.5. That is, S w determines whether the causal mechanism of the corresponding variable X w is potentially subject to change. If S w = 1, then the value of λ subsequently dictates the magnitude of the change in the causal mechanism of X w . Note that even if S w = 1, the causal mechanism of corresponding variable X w does not change if λ = 0. With rejection sampling, we ensure that at least one causal mechanism changes, i.e. λ w = 0 for at least one w ∈ {1, . . . , n}. This way, we can change the causal mechanism of a random subset of variables through S w , and study the performance of our attribution method w.r.t λ . Due to changes in the causal mechanisms of variables, the marginal distribution of the target X n also changes: C : X n ∼ N (µ 1 + . . . + µ n , n) C : X n ∼ N (µ 1 + . . . + µ n + λ 1 + . . . + λ n , n). Let P X n andP X n denote the marginal distributions of X n in SCMs C andC respectively. We measure the change in the marginal distribution of X n by the difference in its mean, i.e. With some algebraic manipulation, we can show that the contribution of each variable X w -due to the change in its causal mechanism-to ∆E is then given by These closed-form expressions provide the ground truth for our evaluation. As it should, we see that the following holds: First we generate two set of samples of same size from the two SCMs C andC stated above. From each sample, we learn (estimate) the SCM assuming that the causal graph is known. As the SCM has an additive unobserved noise term, it is possible to estimate both the function and the noise from data with regression. We generate a sample from the joint distribution induced by the learned SCM. Then we estimate the mean of the marginal distribution by the sample average. Finally, we compute the Shapley value of each variable X w . To measure the quality of the estimated Shapley values against the ground truth, we use the 1 norm, otherwise known as Manhattan distance. First we study the performance of our attribution method against the magnitude parameter λ . To this end, for a given value of λ , we generate 100 pair of SCMs (C,C) with µ w chosen according to the Uniform distribution U (−5, 5) for each w ∈ {1, . . . , n}, where n is chosen uniformly randomly from {2, 3, 4, 5}. From each SCM in the pair (C,C), we generate 100 samples, each containing 1000 observations. Note that we learn the SCM from each sample, and then estimate the Shapley values from the sample drawn from the learned SCM, which we repeat 100 times. Therefore we report the average 1 distance (with standard error) over 100 × 100 × 100 = 1 000 000 pairs of samples in Figure 3 at various values of λ . With a right regression model (linear regression), the estimated Shapley values are very close to the ground truth regardless of the magnitude of causal mechanism change λ -indicated by close to zero 1 distance. With gradient boosted trees (from xgboost python package with default hyperparameters and 100 trees), however, the estimated Shapley values, on average, differ from the ground truth as λ increases. This is expected as inferring the right model is harder if function classes are less restricted a priori. Therefore, the Shapley values estimated using a non-linear function will deviate from the ground truth compared to that from a linear function. We also observe that the uncertainty in the Shapley value estimation increases if model estimation is not accurate. This is indicated by wide error bars for the gradient boosted trees compared to narrow error bars, that are invisible in the figure, for the linear regression. In Figure 4 , we show the result when we vary the sample size, but randomly choose the magnitude parameter λ according to U (1, 5) for each SCM pair. We observe that the Shapley values from linear regression model is close to Next we present a case study on the Adult Census Income dataset 3 where we use our proposal to identify the drivers of the difference in the income distribution between men and women. The dataset contains 32,561 records from the census on annual income in the United States from 1994. In addition to whether the annual income of an individual is greater than fifty thousand USD, it contains 14 other socio-economic attributes. We consider a subset of nonredundant attributes for analysis, namely education and occupation, that directly affect income as well as act a proxy of income for other attributes. After removing the rows with missing values, we end up with 30,718 rows. As the number of variables is small, we combine causal discovery with domain knowledge to construct the causal graph. First, from the combined records of men and women, we discover the skeleton graph using the PC algorithm (Spirtes et al., 2000) with kernel-based conditional independence test (Zhang et al., 2011) at a significance level of 0.2 (see Figure 5 left). At a higher significance level, the PC algorithm gives us a denser skeleton. This way we minimise the chances of omitting dependencies. We 3 http://archive.ics.uci.edu/ml/datasets/Adult then orient the edges in the skeleton using domain knowledge. Note that changing the occupation will "mainly" lead to the change in the annual income, not the other way around. Changing the education not only affects the occupation, but also the income (Heckman et al., 2018) , e.g. passive income through smart investments, side incomes, etc. Therefore, education confounds both occupation and income. We then have a causal graph as shown in Figure 5 (right). Since our goal is to identify the drivers of difference in the income distribution between men and women, as a sanity check, we perform a two-sample test to determine whether the income distribution is, indeed, different between men and women in the dataset. Under the null hypothesis that incomes of men and women come from the same distribution, the Kolmogorov-Smirnov two-sample test yields a p-value of 1.38 × 10 −234 . Since the p-value is extremely low, we can safely reject the null hypothesis. All three variables are categorical. In particular, the target variable (income) is binary (>50K?). We use the empirical distribution as the causal mechanism of the root node (education). For a non-root node (occupation and income), we learn its causal mechanism using a XGBoost classifier with 100 gradient boosted trees. To detect causal mechanism changes, we use kernel-based conditional independence test (Gretton et al., 2012) with delta kernel at a significance level of 0.05. We would like to attribute the difference in the mean annual income between men and women. The result of our method is shown in Figure 6 . We find that the change in the causal mechanism of occupation, P occupation|education , is the main driver for the difference in the mean annual income between men and women. This is also often cited in public discourse as a reason why we −0.020 0.000 0.020 0.040 0.060 0.080 0.100 education income occupation Shapley value Figure 6 : Shapley value contribution of each variable (due to the potential difference in its causal mechanism) to the difference in the mean annual income between men and women (µ men − µ women ). Each horizontal bar represents the mean Shapley value contribution, and each line represents the bias-corrected and accelerated bootstrap confidence interval at a 95% confidence level, over 100 resamples. The horizontal bars of education and income are almost invisible as their mean Shapley value contributions are close to zero. need more women participation in labour market to empower them economically (Giuliano, 2017) . 4 Educational choices of men and women differ-science-related subjects are attended mostly by men than women (Wang and Degol, 2017; Tellhed et al., 2016) . In UC Berkeley gender bias study from 1975, for instance, it was found that, compared to men, women tended to apply to departments that are more crowded, less well funded, and that frequently offer poorer professional employment prospects (Bickel et al., 1975) . Graduates of science-related subjects are also known to earn more than the others (Deming and Noray, 2018) . Moreover, women often take non-professional responsibilities, such as parenting and caring family relatives that affect their career and earnings (Jolly et al., 2014; Budig, 2006) . Therefore, income distribution of men and women differ even when they have the same level of education. As a representative case, we show the empirical conditional probability distribution of occupation given "Bachelor" educated men versus women in Figure 7 , which also corroborates our attribution result. The finding that the subject of education plays a significant role on occupation and income would deserve further studies with more detailed data sets which goes beyond the scope of the present paper. 4 A comprehensive review of literature on this topic along with data and visualisation is available at https://ourworldindata.org/ female-labor-supply. Figure 7 : Conditional probability distribution of occupation given "Bachelor" educated men versus women. We presented a formal approach to identify the drivers of distribution change using graphical causal models. The key idea is that, given a causal graph, we can factorise the joint distribution into independent causal conditionals. Any change in the joint distribution, or marginal distribution of any target variable thereof, can then be attributed to changes in some of the causal conditionals. We illustrated our method on both simulated and real-world datasets. In Section 5, we showed how to detect causal mechanism changes from data given the underlying causal graph using conditional independence tests. In many tasks-e.g. exploratory data analysis, designing and evaluating robust models-knowing those conditionals that change is already sufficient. One might then ask, "Why do we need to quantify the contribution from each mechanism?" Causal mechanisms always change in a system of a large number of variables that are embedded in a changing environment. Supply chain is one such example where continuous deployments of new changes in constituent subsystems (e.g. forecasting, buying) are common. It is too costly (e.g. time, personnel) to look at all variables whose mechanisms change. Quantifying how much each variable contributed to the change allows us to focus on a few most relevant variables. Finally, our attribution approach requires a causal graph, which may not be identifiable from observational data. If the causal graph is not identifiable, the Shapley values will not be identifiable as well. Therefore, this question on the robustness of our attribution approach to causal graph misspecification deserves further research. Sex bias in graduate admissions: Data from berkeley Gender, self-employment, and earnings: The interlocking structures of family and professional status Statistics for High-Dimensional Data: Methods, Theory and Applications Handbook of methods of applied statistics Extremal Principle Solutions of Games in Characteristic Function Form: Core, Chebychev and Shapley Value Generalizations A calculus for stochastic interventions:causal effect identification and surrogate experiments Elements of Information Theory Stem careers and the changing skill requirements of work Causal mapping of emotion networks in the human brain: Framework and initial findings Interventions and causal inference A linear approximation method for the shapley value Gender: An historical perspective. NBER Working Papers 23635 A kernel two-sample test Returns to Education: The Causal Effects of Education on Earnings, Health, and Smoking Causal discovery from heterogeneous/nonstationary data Quantifying causal influences. The Annals of Statistics Feature relevance quantification in explainable ai: A causal problem Gender differences in time spent on parenting and domestic responsibilities by high-achieving young physician-researchers Detecting change in data streams Feature shift detection: Localizing which features have shifted via conditional distribution tests A unified approach to interpreting model predictions Selecting causal brain features with a single conditional independence test per feature Joint causal inference from multiple contexts Causality: Models, Reasoning and Inference Elements of Causal Inference -Foundations and Learning Algorithms Optimal detection of a change in distribution K-sample anderson-darling tests A value for n-person games Statistical Methods. Iowa State University Press Causation, Prediction, and Search Will i fit in and do well? the importance of social belongingness and self-efficacy for explaining gender differences in interest in stem and heed majors 0 -penalized maximum likelihood for sparse directed acyclic graphs High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics Gender Gap in Science, Technology, Engineering, and Mathematics (STEM): Current Knowledge, Implications for Practice, Policy, and Future Directions Divergence estimation for multidimensional densities via k-nearestneighbor distances Kernelbased conditional independence test and application in causal discovery The authors thank Dr. David Afshartous for useful comments.