key: cord-0554915-sdwiq76t authors: Subbaswamy, Adarsh; Chen, Bryant; Saria, Suchi title: The Stability and Accuracy Tradeoff Under Dataset Shift: A Causal Graphical Analysis date: 2019-05-27 journal: nan DOI: nan sha: 7353a679102d49f5a66265c56f74a338edbeed16 doc_id: 554915 cord_uid: sdwiq76t Recent interest in dataset shift has produced many methods for finding invariant distributions for prediction in new, unseen environments. However, these methods consider different types of shifts and have been developed under disparate frameworks, making it difficult to theoretically analyze how solutions differ with respect to stability and accuracy. Taking a causal graphical view, we use a flexible graphical representation to express various types of dataset shifts. We show that all invariant distributions correspond to a causal hierarchy of graphical operators which disable the edges in the graph that are responsible for the shifts. The hierarchy provides a common theoretical underpinning for understanding when and how stability to shifts can be achieved, and in what ways stable distributions can differ. We use it to establish conditions for minimax optimal performance across environments, and derive new algorithms that find optimal stable distributions. Using this new perspective, we empirically demonstrate that that there is a tradeoff between minimax and average performance. Given that dataset shifts happen in nearly every domain where predictive models are deployed, and given how serious the consequence of failures due to these shifts can be, it is critical to be able to ensure the reliability of models under such shifts: That is, we need to understand under a given set of dataset shifts, how will a model's behavior change? In the X-ray example, a model developer wants to know: what shifts can lead to model instability? Would shifts in color encoding schemes or upgrades to X-ray equipment lead to instability and deteriorate model accuracy? Has the model learned any dependencies (e.g., between pneumonia and choice of equipment) that will lead to this instability? For models trained using different algorithms, what guarantees do they give about stability to shifts in color schemes? How do the accuracies of these models differ under these dataset shifts? What guarantees can be made about a model's worst-case performance under such shifts? Lacking common footing for framing stability and dataset shift, it is difficult to begin to answer these questions and compare algorithms. A common framing of dataset shift is predominantly reactive [14] -data from a clearly defined "target" environment of interest is used (along with data from a "source" environment) to make inferences or learn models about the target environment. Reactive approaches to addressing dataset shift exist across multiple fields of study. Some examples include methods for domain adaptation in machine learning (for an overview, see [6] ), generalizability and transportability in causal inference (e.g., [15] [16] [17] [18] ), and sample selection bias in statistics and econometrics (e.g., [19] [20] [21] [22] ). These approaches require data from the target environment which makes it difficult to proactively determine if a model will perform well in a new, unseen environment. In this case, it is important to instead use proactive learning approaches which learn models that are stable to any anticipated problematic shifts. One common class of proactive methods is declarative in nature. These methods allow users to specify dependencies (i.e., mechanistic relationships) between variables in the dataset that are likely to experience a dataset shift. For instance, in the previously discussed X-ray example, a user might want to specify that the model should be stable to changes in scanner manufacturer, the choice of X-ray orientation (front-to-back vs back-to-front), or the color encoding scheme, since all of these are problematic shifts that are likely to occur. Example methods include "graph pruning" approaches which find stable conditioning sets [23, 14] , "graph surgery" approaches which hypothetically intervene on variables to enforce stability [24, 25] , and "counterfactual" approaches which compute counterfactual features to remove unstable dependencies [14] . A key feature of declarative methods is that they can give guarantees about the stability of their predictions to changes in the specified shifts. When a user specifies that they desire invariance to the choice of X-ray orientation, then the declarative method will find a stable solution satisfying this specification. However, this does require domain expertise to be able to specify the likely problematic shifts to which stability is desired. A notable exception is the method of [25] , which learns candidate shifts that occurred across datasets and allows users to choose which invariances to enforce. Additionally, while different declarative methods can guarantee stability, it is unknown what tradeoffs exist between methods with respect to accuracy. For example, models trained using graph pruning vs graph surgery will both satisfy stability to, e.g., the choice of X-ray orientation, but we currently cannot answer how their accuracy will compare and differ under shifts in X-ray orientation preferences. While both are stable, is one always more accurate than the other? A second class of proactive methods are imperative in nature. These methods take in datasets collected from multiple, heterogeneous environments and automatically extract invariant predictors from the data without user input [26] [27] [28] [29] . Examples of these methods are those that compute features sets [26] or representations [27, 28] that yield invariant predictors. In the X-ray example, imperative methods would require datasets collected from a large number of health centers which diversely represented the sets of shifts that could be observed (e.g., the datasets differ in terms of scanner manufacturer, X-ray orientation, and encoding schemes). An advantage of such approaches is that they do not require domain expertise in order to determine invariances. These methods often provide theoretical guarantees about minimax optimal performance (i.e., that they have the smallest worst-case error) across the input distributions. Thus, using an imperative approach a model developer can guarantee good worst-case performance at new hospitals that "look like" a mixture of the training hospitals. However, they cannot provide guarantees about stability to a set of specified shifts: we do not know the ways in which the datasets differ (or by how much), so we cannot answer if the model is stable to shifts in scanner manufacturers, X-ray orientations, or encoding schemes. The difficulties of understanding model behavior within and across each thread of work prevent rigorous analysis of the reliability of models under dataset shifts. In reality, we are presented with a prediction problem in which the data has been collected and generated under some DGP. Dataset shifts can then lead to changes to arbitrary pieces of the DGP. Thus, there is a need for a framework that enables us to answer the fundamental questions about model behavior under changes to the DGP. This would provide common ground to compare algorithms which address dataset shift, and to generate generalizable insights from methods which address particular instances of shifts. In this paper, we provide a unifying framework for specifying dataset shifts that can occur, analyzing model stability to these shifts, and determining conditions for achieving the lowest worst-case error (i.e., minimax optimal performance) across environments produced by these shifts. This provides common ground so that we can begin to answer fundamental questions such as: To what dataset shifts are the model's predictions stable vs unstable? Has the model learned a predictive relationship that is stable to a set of prespecified shifts of interest? How will the model's performance be affected by these shifts? For models trained using different methods, what guarantees do they provide about stability and accuracy? The framework centers around two key requirements: First, a causal graphical representation of the environment which includes specifications of what can shift. These specifications take the form of marked unstable edges in the graph which represent mechanistic dependencies between variables that can shift across environments. We consider arbitrary shifts to mechanisms in the graph, as opposed to, e.g., constraining shifts to be within bounded norm-balls of the training data. This specification entails commonly studied instances of dataset shift (such as label shifts, covariate shifts, and conditional shifts), but also handles the more general, unnamed shifts that we expect to see in practice. Second, we restrict our analysis to methods whose target distribution can be expressed graphically. This entails algorithms which do not learn intermediate feature representations, but instead operate directly on the observed variables. For models which do not induce a graphical representation, we discuss how our results might be used to probe these models for their stability properties and discuss opportunities for future work to bridge this gap. Our main contribution is the development of a causal hierarchy of stable distributions, in which distributions at higher levels of the hierarchy guarantee lower worst-case error. The levels of the hierarchy provide insight into how stability can be achieved: the levels correspond to three operators on the graphical representation of the environment, which modify the graph to produce stable distributions-the learning targets for stable learning algorithms (Definitions 6, 7, 8) . We further show that the operators have different stability properties: higher level operators more precisely remove unstable edges from the graph when producing stable distributions (Corollary 2). Using this graphical characterization of stability, we provide a simple graphical criterion for determining if a distribution is stable to a set of prespecified shifts: a distribution is stable if it modifies the graph to remove the corresponding unstable edges (Theorem 1). We then address questions about the accuracy of different stable solutions by showing how the hierarchy provides a causal characterization of the minimax optimal predictor: the predictor which achieves the lowest worst-case error across shifted environments (Proposition 7) . Surprisingly, we find that frequently studied intervention-invariant solutions generally do not achieve this minimax optimal performance. Finally, we demonstrate through a series of semisynthetic experiments that there is a tradeoff between minimax and average performance. Through these contributions, we provide a common theoretical underpinning for understanding model behavior under dataset shifts: when and how stability to shifts can be achieved, in what ways stable distributions can differ, and how to achieve the lowest worst-case error across shifted environments. While the focus of this work is on statistical and machine learning models, we briefly discuss concepts related to dataset shift that have been studied in other fields. In particular, external validity, or the ability of experimental findings to generalize beyond a single study, has long been an important goal in the social and medical sciences [30] . For example, practitioners (such as clinicians) who want to assess the results of a randomized trial must consider how the results of the trial relate to their target population of interest. This need has led to much discussion and work on assessing the generalizability of randomized trials (see, e.g., [31] [32] [33] ). More recently, methodological work in causal inference has focused on transportability (see [18] for a review). For example, researchers have developed causal graphical methods for determining when and how experimental findings can be transported from one population or setting to a new one. (see, e.g., [15, 34, 17, 35] ). Generalizability is also of importance in economics research [36] , with much methodological work focusing on the problem of sample selection bias resulting from non-random data collection (see, e.g., [19] [20] [21] ). Selection bias leads to systematic differences between the observed data and the general population, and thus is related to problems of external validity, which consider different environments or populations. Returning to the focus of this work, the problem of differing train and test distributions in predictive modeling is known as dataset shift in machine learning [6] . Classical approaches, such as domain adaptation, assume access to unlabeled samples from the target distribution which they use to reweight training data during learning or extract invariant feature representations (e.g., [37] [38] [39] [40] ). More recently, work on statistical transportability has produced sound and complete algorithms for determining how data from multiple domains can be synthesized to compute a predictive conditional distribution in the target environment of interest [15, 41] . These methods reactively adjust to target data. In many practical scenarios, however, it is not possible to get samples from all possible target distributions. Instead, this requires proactive methods that make assumptions about the set of possible target environments in order to learn a model from source data that can be applied elsewhere [14] . Work on proactive methods has primarily focused on either bounded or unbounded shifts. Bounded shifts have been studied through the lens of distributional robustness, often assuming shifts within a finite radius divergence ball (e.g., [42, 43] ). [44] consider robustness to bounded magnitude interventions in latent style features. [45] consider bounded and unbounded mean-shifts in mechanisms (in which the means of certain variables vary by environment). [46] build on this to allow for stochastic mean-shifts. In this paper, we focus on unbounded shifts: in safety-critical domains the cost of failure is high and it can be difficult to accurately specify bounds. Many proactive methods use datasets from multiple source environments to train invariant models to predict in new, unseen environments. [47] propose a kernel method to find a data transformation minimizing the difference between feature distributions across environments while preserving the predictive relationship. Invariant risk minimization (IRM) learns a representation such that the optimal classifier is invariant across the input environments [27, 48] . Similar works learn invariant predictors by seeking derivative invariance across environments [28, 29] . These methods learn representations, so the predictors do not induce a graphical representation. Despite this, we discuss how our graphical results can be applied to probe these methods for their stability properties. IRM establishes its ability to generalize to new environments using the framework of invariant causal prediction (ICP) [49] . Under ICP, the minimax optimal performance of a causally invariant predictor is tied to assuming shifts occur in all variables except the target prediction variable. In reality, however, specific shifts occur and we want to determine which ones to protect against (and how). In this work, we take this approach by starting with the data generating process, and derive stable solutions to prespecified sets of shifts. Other proactive methods make explicit use of connections to causality. Related to ICP, [26] use multiple source environment datasets to find a feature subset that yields an invariant conditional distribution. This has also been extended to the reactive case in which unlabeled target data is available (see also [23] ). For discrete variable settings in which data from only one source environment are available and there is no unobserved confounding, covariate balancing techniques have been used to determine the causal features that yield a stable conditional distribution [50, 51] . Other causal methods assume explicit knowledge of the graph representing the DGP instead of requiring multiple datasets. Explicitly assuming no unobserved confounders, [12] protect against shifts in continuous-time longitudinal settings by predicting counterfactual outcomes. [14] find a stable feature set that can include counterfactual variables, assuming linear mechanisms. [52] regularize towards counterfactual invariance so that a model learns to predict only using causal associations. Other works consider using counterfactuals generated by human annotation [53] , data augmentation [54, 55] , or active learning [56] to improve model robustness. [24] use selection diagrams [15] to identify mechanisms that can shift and find a stable interventional distribution to use for prediction. More recently, end-to-end approaches have been developed which relax the need for the graph to be known beforehand, instead learning it from data [57, 25] . In this paper we provide a common ground for understanding the different types of stable solutions and for finding stable predictors with the best worst-case performance. In this section we present a causal hierarchy of stable distributions that are invariant to different types of dataset shifts. First, we will introduce a general graphical representation for specifying dataset shifts that can occur (Section 3.2). We use this representation to give a simple graphical criterion for determining if a distribution is stable to a set of prespecified shifts. Then, we present the hierarchy and show that the levels of the hierarchy correspond to three operators on the graphical representation which modify the graph to produce stable distributions (Section 3.3). This allows different stable distributions to be compared in terms of how they modify the graphical representation. We further show that the hierarchy is nested, and thus, it has implications on the existence of stable distributions Figure 1 : (a) Posited DAG for the pneumonia example of [13] . Y represents the target condition, pneumonia. D represents the x-ray style features (e.g., orientation and color encoding scheme). Z represents the x-ray itself. W represents the The orange edge represents the unstable style feature mechanism, while the dashed node represents an unobserved variable. (b) The corresponding ADMG for the DAG in (a). The unobserved confounder W has been replaced by a bidirected edge. (Section 3.3.2). We begin by introducing necessary background on causal graphs (Section 3.1). Proofs of results are in Appendix B. Throughout the paper sets of variables are denoted by bold capital letters while their particular assignments are denoted by bold lowercase letters. We will consider graphs with directed or bidirected edges (e.g., ↔). Acyclic will be taken to mean that there exists no purely directed cycle. The sets of parents, children, ancestors, and descendants in a graph G will be denoted by pa G (·), ch G (·), an G (·), and de G (·), respectively (subscript G omitted when obvious from context). For an edge e, He(e) and T a(e) will refer to the head and tail of the edge, respectively. We represent the data generating process (DGP) underlying a prediction problem using acyclic directed mixed graphs (ADMGs), G, which consists of a set of vertices O corresponding to observed variables and sets of directed and bidirected edges such that there are no directed cycles. Directed edges indicate direct causal relations while bidirected edges indicate the presence of an unobserved confounder (common cause) of the two variables. ADMGs are able to represent directed acyclic graph (DAG) models that contain latent variables. The graph G defines a Structural Causal Model (SCM) [58] in which each variable V i ∈ O is generated as a function of its parents and a variable-specific exogenous noise variable U i : The prediction problem associated with the graph consists of a target output variable Y and the remaining observed variables as input features. As an example, consider the DAG in Fig 1a. This DAG corresponds to a simple version of the pneumonia example in [13] . The goal is to diagnose pneumonia Y from chest x-rays Z and stylistic features (i.e., orientation and coloring) of the image X. The latent variable W represents the hospital department the patient visited. The corresponding ADMG is shown in Fig 1b. The unobserved confounder, W , has been replaced by a bidrected edge. In this section we introduce types of dataset shifts that have been previously studied. Then, we graphically characterize instability in terms of edges in the graph of the data generating process. This will be key to the development of the hierarchy in Section 3.3. To define the types of dataset shifts, assume that there is a set of environments such that a prediction problem maps to the same graph structure G. However, each environment is a different instantiation of that graph such that certain mechanisms differ. Thus, the factorization of the data distribution is the same in each environment, but terms in the factorization corresponding to shifts will vary across environments. As an example, consider again the graph in Fig 1a. In the pneumonia example, each department has its own protocols and equipment, so the style preferences P (X|W ) vary across departments. In this example, a mechanism shift in the style mechanism P (X|W ) leads to differences across environments. Definition 1 (Mechanism shift). A shift in the mechanism generating a variable V corresponds to arbitrary changes in the distribution P (V |pa(V )). Causal mechanism shifts produce many previously studied instances of dataset shift. Consider, for example, label shift, a well-studied mechanism shift in which the distribution of the features X given the label Y (P (X|Y )) is stable, but P (Y ) varies across environments. Label shift corresponds to a causal graph Y → X in which the features are caused by the label, and the mechanism that generates Y varies across environments, resulting in changes in the prevalence P (Y ) (see, e.g., [59, 38] ). More generally, mechanism shifts are the most common and general type of shift considered in prior work on proactive approaches for addressing dataset shift [49, 26, 23, 50, 24] . However, special cases of mechanism shifts have also been studied. For example, [60, 45, 46] considered parametric mean-shifted mechanisms, in which the means of variables in linear SCMs can vary by environment. Definition 2 (Mean-shifted mechanisms). A mean-shift in the mechanism generating a variable V corresponds to an environment-specific change in the intercept of its linear structural equation Nonlinear generalizations are possible. Another special case considered by [14] is edge-strength shifts, in which the relationship encoded by a subset of edges into a variable may vary. Variation along an individual edge corresponds to the natural direct effect [58, Chapter 4] . Thus, an edge-strength shift is a mechanism shift which changes the natural direct effect associated with the edge. is the counterfactual value of V had X been x and had Y been counterfactually generated under X = x. Key Result: All of these shifts can be expressed in terms of edges. First, edge-strength shifts directly correspond to particular edges. Next, since the mechanism generating a variable V is encoded graphically by all of the edges into V , shifts in mechanism can be represented by marking all edges into V as unstable. For shifts in mechanism to an exogenous variable V with no parents in the graph, one might imagine adding an explicit mechanism variable M V to the graph and considering the edge M V → V to be unstable. Finally, mean-shifts correspond to an edge A → V where the mean of V is shifted in each environment A (also referred to as an "anchor", see [45] for a discussion of anchor variables). We denote the set of unstable edges that can vary across environments by E u ⊆ E where E is the set of edges in G. Graphically, unstable edges will be colored. ). An edge is said to be unstable if it is the target of an edge-strength shift or a mechanism shift. The concept of unstable edges provides a flexible and extensible way to graphically represent dataset shifts. We note that defining shifts in terms of unstable edges makes it possible to tackle new problems determined by shifts in sets or paths of unstable edges. For example, DAGs can be used to represent non i.i.d. network data in which certain edges represent interference between units (e.g., friendship ties in social networks) [61, 62] . Thus, one can define dataset shifts pertaining to networks (e.g., deleting, adding, or changing the strength of friendships). Similarly, dataset shifts due to changing path-specific effects [63] are another interesting avenue for future exploration (e.g., reductions in side effects of a drug while maintaining its efficacy). We can now define stable distributions, which are the target sought by methods addressing instability due to shifts. We will refer to a model of a stable distribution as a stable predictor. Definition 5 (Stable Distribution). Consider a graph G with unstable edges E u defining a set of environments. A distribution P (Y |Z) is said to be stable if for any two environments, is not restricted to being an observational distribution. Having established a common graphical representation for arbitrary shifts of various types, we provide a graphical definition of stable distributions. First, define an active unstable path to be an active path (as determined by the rules of d-separation [64] ) that contains at least one unstable edge. Key Result: The non-existence of active unstable paths is a graphical criterion for determining a distribution's stability. Figure 2 : (a) Example graph for a data generating process in which the orange Y → X edge is unstable. (b) The level 1 operator applied to the graph in (a). The stable level 1 distribution over observed variables is P (Y |V ), which ignores all information from X and Z. (c) Level 2 operator applied to (a), deleting the edges into X. The level 2 stable distribution is P (Y |V, Z, do(X)). The level 2 operator deletes the stable Y ↔ X edge. (d) Level 3 operator applied to (a). The stable level 3 distribution is P (Y |V, Z(X = x), X(Y = 0)), where Z(X = x) is the value of Z had X been set to its observed value x, and X(Y = 0) is the counterfactual value of X had Y been set to 0. The counterfactual operator retains the stable Y ↔ X edge. Intuitively, Theorem 1 means that a stable distribution cannot capture a statistical association that relies on the information encoded by an unstable edge. In the pneumonia example of Fig 1a, the W → X edge which denotes the X-ray style mechanism was determined to be unstable. Because W is unobserved, a model of P (Y |X, Z) will learn an association between Y and X through W . Thus, P (Y |X, Z) contains an active unstable path, and this distribution is unstable to shifts in the style mechanism. This means that P (Y |X, Z) is different in each environment. By contrast, if W were observed and we could condition on it, then P (Y |X, Z, W ) is stable to shifts in the style mechanism because all paths containing the unstable edge are blocked by W . Thus, P (Y |X, Z, W ) is invariant across environments. In the next section we use this edge-based graphical characterization to show that all stable distributions, including those found by existing methods, can be categorized into three levels. Thus, this hierarchy defines the ways in which it is possible to achieve stability to shifts. Many works seek stable distributions in order to make predictions that are stable or invariant to dataset shifts. However, because these methods have been developed in isolation, there has been little discussion of whether these methods find the same stable distributions, or how these distributions differ from one another. As a main contribution of this paper, we show in this section that there exists a hierarchy of stable distributions, in which stable distributions at different levels have distinct graphical properties. Thus, the development of this hierarchy provides a common theoretical underpinning for understanding when and how stability to shifts can be achieved, and in what ways stable distributions can differ. In this section we will define the levels of the hierarchy and show that they correspond to different operators that can remove unstable edges from the graph. Then, in the next section we will further study how differences between levels of the hierarchy affect worst-case performance across environments. Armed with the graphical characterization of stability from the previous section, we now introduce a hierarchy of the 3 categories of stable distributions. The levels of the hierarchy are: 1) observational conditionals, 2) conditional interventionals, and 3) counterfactuals. This hierarchy is related to the hierarchy of causal queries, which defines three levels of causal study questions an investigator can have: association, intervention, and counterfactuals [58] . Also relatedly, in [65] the authors connect the identification of different types of causal effects to a hierarchy of graphical interventions: node, edge, and path interventions. While these works develop hierarchies that relate different types of causal queries and effects, in this paper we develop a hierarchy of shift-stable distributions which connects different types of stable distributions to interventions which remove unstable parts of the data generating process from the underlying graph of the DGP. Each level of the hierarchy of stable distributions corresponds to graphical operators which differ in the precision with which they can remove edges in the graph (Corollary 2, main result of this subsection). Using the graph in Fig 2(a) as a common example, we discuss each level in detail below. Note that in Fig 2(a) , the goal is to predict Y from V, X, Z, and the Y → X edge is unstable. Definition 6 (Stable Level 1 Distribution). Let G be an ADMG with unstable edges E u defining a set of environments E. A stable level 1 distribution is an observational conditional distribution of the form P (Y |Z) such that, for any two environments E 1 , E 2 ∈ E, P E1 (Y |Z) = P E2 (Y |Z) holds. Level 1: Methods at level 1 of the hierarchy seek invariant conditional distributions of the form P (Y |Z) that use a subset of observed features for prediction [26, 23] . These distributions only have conditioning (i.e., the standard rules of d-separation) as a tool for disabling unstable edges. For this reason, the conditioning operator is coarse and removes large pieces of the graph. Consider Figure 2a , in which the maximal stable level 1 distribution is P (Y |V ), since conditioning on either X or Z activates the path through the unstable (orange) edge. The conditioning operator disables all paths from X and Z to Y to produce Fig 2b. While the operator successfully removes the unstable edge, many stable edges were removed as well. Definition 7 (Stable Level 2 Distribution). Let G be an ADMG with unstable edges E u defining a set of environments E. A stable level 2 distribution is a conditional interventional distribution of the form P (Y |do(W), Z) such that, for any two environments E 1 , E 2 ∈ E, P E1 (Y |do(W), Z) = P E2 (Y |do(W), Z) holds. Methods at level 2 [24] find conditional interventional distributions [58] of the form P (Y |do(W), Z). In addition to conditioning, level 2 distributions use the do operator, which deletes all edges into an intervened variable [58] . Fig 2c shows the result of do(X) applied to Fig 2a: the edges into X (including the unstable edge) are removed. Thus, P (Y |Z, V, do(X)) is stable and retains statistical information along stable paths from Z and X that the level 1 distribution P (Y |V ) did not. However, the stable Y ↔ X edge was also removed by the operator. Intervening interacts with the factorization (according to the graph) of the joint distribution of the observed variables P (O) by deleting the terms corresponding to mechanisms of the intervened variable: The term P (Z|X, Y ) corresponds to the stable information we retain by intervening that we could not capture by conditioning. Definition 8 (Stable Level 3 Distribution). Let G be an ADMG with unstable edges E u defining a set of environments E, and let V (x) denote the counterfactual value of V had X been set to x for variables V, X ⊆ O. A stable level 3 distribution is a counterfactual distribution of the form P (Y (W, Z(W ))|Z(W)) such that, for any two environments Level 3: Finally, level 3 methods [14, 52] seek counterfactual distributions, which allow us to consider conflicting values of a variable, or to replace a mechanism with a new one. For example, let Y and Z denote two children of a variable X. If we hypothetically set X to x for X → Y but left X as its observed value x for X → Z, this corresponds to counterfactual Y (x ) and factual Z(x) = Z. By setting a variable to a reference value (e.g., 0) for one edge but not others, computing counterfactuals effectively removes (or replaces) a single edge. In Fig 2c, we saw P (Y |V, Z, do(X)) is stable and deletes both edges into X, including the stable Y ↔ X edge. However, if we compute the counterfactual X(Y = 0), depicted in Fig 2d, then the level 3 distribution P (Y |X(Y = 0), V, Z(X = x)) is stable and only deletes the unstable Y → X edge, retaining information along the Y ↔ X path. More generally, level 3 distributions allow us to counterfactually replace mechanisms (and thus replace the influence along unstable edges) with new ones. We will exploit this fact in Section 4 when we investigate accuracy. The effects of the three operators produce the following result: Corollary 2. Distributions at increasing levels of the hierarchy of stability grant increased precision in disabling individual edges (and thus paths). Key Result: Thus, the difference between operators associated with the different levels of stable distributions is the precision of their ability to disable edges into a variable. Level 1, conditioning, must remove large amounts of the graph to disable edges. Level 2, intervening, deletes all edges into a variable. Level 3, computing counterfactuals, can precisely disable a single edge into a variable. Since paths encode statistical influence this also provides a natural definition for a maximally stable distribution as one which deletes the unstable edges, and only the unstable edges. Thus, given a stable distribution found by any method, we can compare to the maximally stable distribution to see which, and how many, stable paths were removed. Another important fact is that the hierarchy is nested. This means that a level 1 distribution can be expressed as a level 2 distribution (and a level 2 distribution can be expressed as a level 3 distribution): There are a number of practical consequences of the hierarchy of shift-stable distributions: First, level 1 distributions can always be learned from the available data because conditional distributions are observational quantities. This means that we can simply fit and learn a model of P (Y |Z) from the training data. However, because the conditioning operator throws away large parts of the graph, including many stable paths, models of level 1 distributions will generally have higher error compared to models of level 2 and level 3 distributions. A tradeoff exists, though, since level 2 and level 3 distributions are not always identifiable-they cannot always be estimated as a function of the observational training data. Level 2 distributions model the effects of hypothetical interventions, and, just as in causal inference, unobserved confounding can lead to identifiability issues (for more detail on identification and level 2 stable distributions see [24] ). In addition to identifiability challenges, level 3 counterfactual distributions require further assumptions about the functional form of the causal mechanisms in the SCM (see [58, Chapter 7] for more discussion). For example, the method of [14] assumes that causal mechanisms are linear in order to compute counterfactuals. The nested nature of the hierarchy means that it has consequences on the existence of stable distributions: If there is no stable level 3 distribution, then no stable level 1 or level 2 distributions exist. Considering the other direction, if we find that no stable level 1 distribution exists, there may still be a stable level 2 or 3 distribution. This is an important consideration as more methods for finding stable distributions are developed. For example, [24] developed a sound and complete algorithm for finding stable level 2 distributions in a graph. This means that the algorithm returns a distribution if and only if a stable level 2 distribution exists. An open problem is to develop a sound and complete algorithm for finding stable level 3 distributions. Such a result would be very powerful: If a complete algorithm failed to find a stable level 3 distribution, then that would mean no stable distributions (level 1, 2, or 3) exist. We have shown that the hierarchy of stable distributions defines graphical operators which can be used to construct stable distributions by disabling edges in the the underlying graph. Next, we show how the ability of counterfactual level 3 distributions to replace edges can be used to achieve minimax optimal performance under dataset shifts. We now compare stable distributions with respect to their minimax performance under dataset shifts. Specifically, we show that there is a hypothetical environment in which counterfactually training a model would yield minimax optimal performance across environments. We further show that this level 3 counterfactual distribution is not, in general, a level 2 interventional distribution. Counter to the increasing interest in invariant interventional solutions like Invariant Risk Minimization and its related follow-ups (e.g., [27] [28] [29] ), these results motivate the development of counterfactual (as opposed to level 2) learning algorithms. We now present our result characterizing the stable distribution that achieves minimax optimal performance. First, recall that dataset shifts result in a set of hypothetical environments E generated from the same graph G such that the mechanisms associated with unstable edges in G differ in each environment. For simplicity, we will assume that the mechanism of a single variable W ∈ X is subject to shifts while the mechanisms of all other variables V = {X, Y } \ {W } remain stable across environments. Each distribution in the set of data distributions U corresponding to each environment factorizes according to G, but differs only in the term P (W |pa G (W )) which corresponds to the mechanism for generating W . Now consider the following game: Suppose the data modeler (DM) wishes to pick the distribution B ∈ U such that the corresponding Bayes predictor h * B (i.e., the true P B (Y |X)) minimizes the worst-case expected loss (i.e., worst-case risk) across all distributions in U. This can be written as Following a game theoretic result [66, Theorem 6.1], this game has a solution for bounded loss functions (e.g., the Brier score but not the log loss): Key Result: That this game has a solution means that B * is the "optimal training environment" such that counterfactually training a predictor in B * to learn the true P B * (Y |X) would produce the minimax optimal predictor. Importantly, this optimal environment B * depends on the choice of loss function. There are two consequences of this result: First, P B * (Y |X) is not, in general, a level 2 distribution (and thus level 2 distributions are not, in gen-eral, minimax optimal). Second, there is a level 3 distribution which corresponds to P B * (Y |X) and thus is minimax optimal. Proposition 6. The level 2 stable distribution P (Y |do(W ), X) = P Q (Y |X), where Q is the member of U such that W has a uniform distribution, i.e., P (W, pa(W )) = cP (pa(W )) for c ∈ R + . In the appendix we provide a counterexample in which Q = B * . This shows that the level 2 stable distribution P (Y |do(W ), X) is not minimax optimal. Proposition 7. The level 3 distribution P (Y (W B * )|X(W B * )) equals P B * (Y |X) and is minimax optimal, where W B * is the counterfactual W generated under the mechanism associated with the environment B * . Thus, given training data from P 0 ∈ U, if we could counterfactually learn P (Y |X) in the environment associated with B * then the resulting predictor would be minimax optimal. This means the stable level 3 distribution P (Y (W B * )|X(W B * )) produces the best, worst-case performance across environments out of all distributions that could be used for prediction. Algorithm 1: Gradient Descent Ascent input : # of steps T , Step size η, Data O output: Robust model parametersθ We now consider a simple distributionally robust likelihood reweighting algorithm for learning the minimax optimal level 3 predictor. This approach can serve as a starting point for developing new stable learning algorithms which achieve minimax optimal performance under dataset shift. 2 For simplicity, suppose there are no unobserved confounders (i.e., G has no bidirected edges). We relax this condition in the Appendix. Then, assuming full shared support. Let h(W, pa(W ); φ) = P Q (W |pa(W )) and E[h|pa(W )] = 1 be a reweighting function parameterized by φ. The learning problem becomes with model parameters θ. While many possibilities exist, perhaps the simplest version of h is to explicitly learn a parametric density model (e.g., logistic regression for discrete W )P for P P0 (W |pa(W )) and use the same density model class to model P Q (W |pa(W )) =Q(W |pa(W ); φ). Algorithm 1 describes a gradient descent ascent learning procedure (GDA) for this case. It is important note that this general minimax learning problem is often very challenging with complicated convergence and equilibrium dynamics (see, e.g., [68] [69] [70] ). Thus, Algorithm 1 only serves as a starting point for designing counterfactual level 3 learning algorithms. We turn to semisynthetic experiments on a real medical prediction task to demonstrate practical performance implications of the hierarchy. To carefully study model behavior under dataset shifts, we posit a graph of the DGP for this dataset and reweight the data to simulate a large number of dataset shifts. We investigate how the performance of models of stable distributions at different levels of the hierarchy behave as test environments differ from the training environment. Our results show that though level 3 models can produce the best worst-case performance (i.e., minimax optimal), level 2 models may perform better on average. This further highlights that model developers need to carefully choose how they achieve stability. One prominent application of machine learning is patient risk stratification in healthcare. It has been widely noted that developing reliable clinical decision support models is difficult due to changes in clinical practice patterns [9] . The resulting behavior-related associations are often brittle-policies change over time and differ across hospitals-and can cause models to make dangerous predictions if left unaccounted [12] . We investigate practical implications of the hierarchy on this important risk prediction challenge. Below we describe our setup which loosely follows the setup of [71] for predicting patient risk of sepsis, a lifethreatening response to infection. We use electronic health record data collected over four years at our institution's hospital. The dataset consists of 278,947 patient encounters that began in the emergency department. The prevalence of sepsis (S) is 2.3%. Three categories of variables were extracted: vital signs (V) (heart rate, respiratory rate, and temperature), lab tests (L) (lactate), and demographics (D) (age and gender). For encounters that resulted in sepsis, physiologic data available prior to sepsis onset time was used. For non-sepsis encounters all data available until the time the patient was discharged from the hospital was used. Min, max, and median features were derived for each time-series variable. Unlike vitals, lab measurements are not always ordered (O), so a binary missingness indicator was given. The graph of the DGP is shown in Fig 3. Different lab test ordering policies correspond to shifts in the conditional P (O|s, d). As a result, missingness patterns vary across datasets derived from different hospitals, because the lab test rate can vary from one institution to another [72] . To compare across datasets corresponding to differing lab testing patterns, we simulated one hundred datasets as follows: For a given test split, we fit a (logistic regression) model of the ordering policy P (O|s, d) (i.e., the P model). Then, for a new ordering policy Q(O|s, d), we reweight the test samples by Q P to mimic data from a new hospital which differs only in the ordering policy. To simulate an edge shift, we created new ordering policies Q by perturbing the coefficient of sepsis in the P model. This corresponds to changing the log odds ratio for sepsis of a patient receiving a lab test. To simulate a mechanism shift, we perturbed all coefficients in the P model. Train/test splits were generated via 5-fold cross-validation. Full experimental details are in the Appendix. Models were fit using the Brier score as the loss since it is a bounded loss function (required by Theorem 5). We consider the four possible models: stable models for each level of the hierarchy and an unstable baseline that does not adjust for shifts. In fitting models, any model structure (e.g., random forests, neural networks, etc.) can be used The level 2 model is of P (S|d, v, l, do(o)), and is an implementation of the "graph surgery estimator" [24] . In Fig 3, the do operator deletes both edges into the O node. Finally, the level 3 model was trained using Algorithm 1, with a logistic regression counterfactual reweighting model. In Fig 3, this deletes and then replaces the mechanism for O with a new ordering policy mechanism. All models were implemented in JAX [73] . Full details of how the level 2 and 3 models were fit are in the Appendix. When test environments differ from the training environment, stable models have more robust performance than unconstrained, unstable models. An unconstrained model uses all dependencies present in the training data; in other words, the model captures correlations due to all paths in the underlying graph. As we impose invariance constraints (by disabling edges), stable models show performance improvements over the unstable model as the test distribution deviates further from the training distribution. We see this, for example, in Fig 4a when , due to the edge shift, the correlation flips from being negative to positive: the level 1, 2, and 3 models outperform the unstable model for log odds ratios < 1. As desired, the level 3 model achieves the best worst-case performance amongst the four models, indicating that training using Algorithm 1 was successful. Further, the performance of the level 3 model is nearly constant across the shifts. This is encouraging evidence, because constant risk is a sufficient condition for a Bayes estimator to be minimax optimal [74] . The results are largely consistent in Fig 5, in which we consider a mechanism shift in the lab test ordering policy. We see that irrespective of the KL divergence between the training and shifted distributions, the level 3 model still has almost constant performance. Finally, from we Theorem 5 we know that the optimal training distribution depends on the choice of loss function (Theorem 5). Thus, we do not expect a minimax optimal predictor under one loss to be optimal when measured under a different loss. Indeed, in Fig 4b when the four models are evaluated with respect to the log loss, the level 3 model is no longer minimax optimal. In fact, its performance is strictly worse than that of the level 2 model. Even when evaluated using the Brier score (4a), the worst-case performance of the level 3 model is only slightly better than the worst-case performance of the level 2 model. Further, the level 2 model sees performance improvements when the log odds increase that the level 3 model does not (loss drops noticeably for x-axis values > 0). Thus, on average, the level 2 model might be preferable on this data, and a conservative objective like worst-case performance may not be desirable. The discussion in this paper has focused on a graphical perspective-explicitly starting with knowledge of the data generating process and using this to determine when and how stability to shifts is achievable. An alternative emerging paradigm in machine learning has focused on invariant learning [27, 29, 28] . Invariant learning is applicable when multiple datasets from different environments are available, and the goal is to learn a representation that produces an optimal predictor which is invariant across these environments. In this section, we discuss an important limitation of the invariant learning paradigm which highlights a key advantage of graphical approaches. We also discuss how graphical analyses can guide future work to address this. A critical question that determines the usefulness of an invariant predictor is: to what set of shifts is the predictor stable? The answer to this question defines the set of new environments to which an invariant predictor can be safely applied. In the graphical approach, the answer is transparent by design. Shifts are defined as (arbitrary) changes to particular causal mechanisms in the graph, so an invariant predictor is exactly one which is stable to the specified shifts in mechanisms. Further, the graph allows model developers to choose the set of shifts to which a predictor should be stable and provide guarantees about shifts that are protected against. In contrast, invariant learning methods currently struggle to answer this critical question. First, existing invariant learning methods do not identify the differences that exist across the observed environments. Thus, they are unable to provide guarantees about the nature of the shifts in environment (i.e., the causal mechanisms) against which they protect. This also means it is difficult to state the set of new environments to which the invariant predictor can be safely applied. Further, because invariant learning automatically determines invariance from datasets, there is no opportunity for developers to specify particular invariances that they want to hold. Fortunately, we believe there are opportunities to leverage graphical analyses to improve the current state of invariant learning. For example, structure learning [75] can be used to detect particular mechanism shifts that occur across environments [76, 57, 25] . When mechanisms of interest have been identified, one can develop tests to probe an invariant predictor for stability to particular mechanisms shifts. For example, one could examine how the performance or calibration of the predictor changes as evaluation data is reweighted according to the distribution associated with the mechanism shift. This would provide a post-hoc way to verify the integrity of a trained invariant predictor. Finally, there is an opportunity to develop invariant learning objectives which explicitly capture invariances at different levels of the hierarchy of shift-stability. This would allow developers to specify particular invariances they want to be guaranteed to hold. The use of machine learning in production represents a shift from applying models to static datasets to applying them in the real world. As a result, aspects of the underyling DGP are almost certain to change. Many methods have been developed to find distributions that are stable to dataset shift, but as a field we have lacked common underlying theory to characterize and relate different stable distributions. To address this, we developed a common framework for expressing the different types of shifts as unstable edges in a graphical representation of the DGP. We further showed that stable distributions belong to a causal hierarchy in which stable distributions at different levels have distinct operators that can remove unstable edges in the graph. This provides a new, but natural, way to characterize and construct stable models by only removing unstable edges. This also motivates a new paradigm for future work developing methods that can modify individual edges. We also showed that popular invariant solutions (level 2; invariant under intervention) do not, in general, achieve minimax optimal performance across environments. Our experiments showed that there is a tradeoff between worst-case average performance. Thus, developers need to carefully determine when and how they achieve invariance. A.1 Data Our experimental setup follows that of [77] . The dataset contains electronic health record data collected over four years at our institution's hospital. The dataset consists of 278,947 emergency department patient encounters. The prevalence of the target disease, sepsis (S), is 2.1%. Features pertaining to vital signs (V) (heart rate, respiratory rate, temperature), lab tests (L) (lactate), and demographics (D) (age, gender) were extracted. For encounters that resulted in sepsis (i.e., positive encounters), physiologic data available up until sepsis onset time was used. For non-sepsis encounters, all data available until discharge was used. For each of the time-series physiologic features (V and L), min, max, and median summary features were derived. Unlike vitals, lab measurements are not always ordered (O) and are subject to missingness (lactate 89% missing). To model lab missingness, missingness indicators (O) for the lab features were added, and lab value-missingness interactions terms were used in place of lab value features. Logistic Regression models were fit using a custom JAX [73] implementation. L 2 regularization with regularization coefficient 0.1 was used (hyperparameter chosen via grid search using the performance of the unstable model on a hold-out 10% of the initial dataset). These same hyperparameters were used to train the Level 1-3 models. For the predictive models, a b-spline basis feature expansion was used for continuous features (lab values and vital signs). Following the standards in [77] for accounting for missingness, the missingness feature and the missingness-lab value interaction features were added. The specific shift in lab test ordering patterns considered was a shift in lactate ordering, as these patterns have seen great variation across hospitals and are known to be associated with sepsis [72] . Lactate missingness has a correlation of -0.36 with sepsis in this dataset (i.e., the presence of the measurement is predictive of the target variable). Thus, to simulate the edge shift, in each test fold, we first fit a logistic regression model (no b-spline basis expansion, with default scikit-learn hyperparameters) to the test fold's lactate missingness (O = 0) given S, D. That is, a logistic regression model of P (O = 0|s, d). Then, to simulate the edge shifted lactate ordering policies, we replaced the coefficient for sepsis in the logistic regression model with 100 values on a grid from -6 to 8. The resulting logistic regression model is of the hypothetical shifted hospital's ordering policy Q(O = 0|s, d). Evaluating the loss under each shift was then done by using sample weights computed as Qi Pi for each test sample using the two models. The mechanism shift was simulated in a similar manner to the edge shift. However, instead of only perturbing the coefficient of sepsis in the P (O|s, d) model, all coefficients and the intercept were perturbed. Specifically, for a single test fold, 1000 new coefficients were sampled as follows: Let w denote the weight in the P model. The new coefficients/intercepts were drawn from U nif (−|w| − 0.1, |w| + 0.1). Because all weights of the logistic regression model changed, we plotted the shifts according to the estimated (using the test set) KL-divergence between the Q and P logistic regression models: The level 1 model was fit using a reduced feature set that excluded the lactate features (min, max, median) and lactate missingness indicator. The level 2 model P (S|d, v, l, do(o)), an instance of the "graph surgery estimator" [24] , was fit by inverse probability weighting (IPW). The term corresponding to O in the factorization of the DAG in Fig 4a is P (O|s, d) , so we fit a logistic regression model of this distribution using the training data. Then, the main logistic regression prediction model with the full feature set was trained using sample weights 1 P (oi|si,di) . The resulting model was the level 2 model. The level 3 model is similar to the level 2 model, but instead corresponds to a counterfactual ordering policy Q(O|s, d). The level 3 logistic regression model was trained using the Gradient Ascent Descent procedure in Algorithm 1. As noted in the main paper, this procedure has complicated dynamics and we found it was quite sensitive to the choice of step size (or learning rate η). Through grid search using performance on a hold-out 10% of the initial dataset the value η = 5 was selected. The model parameters θ were initialized using the learned level 2 model parameters, and the reweighting φ parameters were initialized via random draws from N (0, 0.1 2 ). The resulting model was the level 3 model. Theorem 1. P (Y |Z) is stable if there is no active unstable path from Z to Y in G and the mechanism generating Y is stable. Proof. We first recall that all of the shifts considered in Section 3.2 are types of arbitrary shifts in mechanism: meanshifted mechanism are a special parametric case, and edge-strength shifts correspond to a constrained class of mechanism shifts in which the natural direct effect associated with the mechanism has changed. Thus, if a distribution is stable to arbitrary shifts in mechanisms, then it will also be stable to mean-shifts and edge-shifts. For this reason, in our proof we will prove a distribution is stable by leveraging previous graphical results on stability under shifts in mechanisms (and stability to specific cases follows). To do so, we will leverage results from transportability, which uses a graphical representation called selection diagrams (see [15, 24] for details). A selection diagram is a a graph augmented with selection variables S (which each have at most one child) that point to variables whose mechanisms may vary across environments. Prior results have shown that a distribution P (Y |Z) is stable is if Y ⊥ ⊥ S|Z in the selection diagram (see [15, Theorem 2] and [24, Definition 3] ). Thus, to prove the theorem, we will first translate our unstable edge representation of the graph to a selection diagram. Then, we will show that if Y is not d-separated from the selection variables that this implies there is an unstable active path to Y . We first translate our unstable edge representation of the graph to a selection diagram. For an edge e let He(e) denote the variables that e points into. Now for each e ∈ E u , add a unique selection variable that points to each V = He(e). This indicates that the mechanism that generates V is unstable. We now consider the cases in which there could be an active path from a selection variable to Y (which would make a distribution P (Y |Z) unstable), and show that this corresponds to an active path that contains an unstable edge. There are two possible ways there can be an active path from a variable S ∈ S to Y . If there is an active forward path from S to Y (e.g., S → ch(S) → . . . Y ) then there is a corresponding active path from T a(e) to Y that contains the unstable edge e: e.g., a path T a(e) − e → ch(S) → . . . Y . Alternatively, an active forward path indicates that the mechanism that generates Y is unstable. The other case is if there is an active path beginning with a collider from S to Y (e.g., S → ch(S) ← . . . Y ). Then there is a corresponding active path from T a(e) to Y that contains e: e.g., T a(e) − e → ch(S) ← . . . Y . Thus, in a selection diagram if P (Y |Z) is unstable, then there is an active unstable path to Y in the original unstable edge-denoted graph. Taking the contrapositive of this statement proves the theorem. Proof. This is a restatement of Corollary 1 in [24] . Lemma 4. A stable level 2 distribution of the form P (Y |Z , do(W)) can be expressed as a stable level 3 distribution of the form P (Y (W)|Z (W)). Consider the (level 2) intervention do(X) = x. For a variable V letting V (x) denote the value V would have taken had X been set to x we have that P (V (x)) = P (V |do(x)). When interventions are consistent (i.e., for x = x there are no conflicting interventions do(X = x) and do(X = x )) counterfactuals reduce to the potential responses of interventions expressible with the do operator [58, Definition 7.1.4]. For completeness, we restate the following result from [66] . For the present paper, both the action space A and the set of distributions Γ are U (the DM is picking a training distribution (the action a from U and nature is picking the test distribution P from U). Theorem 8 (Theorem 6.1, [66] ). Let Γ ⊆ P be a convex, weakly closed, and tight set of distributions. Suppose that for each a ∈ A the loss function L(x, a) is bounded above and upper semicontinuous in x. Then the restricted game G Γ = (Γ, A, L) has a value. Moreover, a maximum entropy distribution P * , attaining Theorem 5. Consider a classification problem and suppose is a bounded loss function. Then Equation 1 has a solution, and the maximum generalized entropy distribution Q * ∈ U satisfies B * = arg inf B∈U Proof. This result follows directly from [66, Theorem 6.1]. The preconditions are trivially satisfied: The set of all distributions over W is convex, closed, and tight. We consider bounded loss functions, which for finite discrete Y (i.e., for classification problems) are continuous. Thus, the game has a solution. Further, by [66, Corollary 4.2] , the maximum generalized entropy distribution Q * is also the distribution minimizing the worst-case expected loss. Proposition 6. The level 2 stable distribution P (Y |do(W ), X) = P Q (Y |X), where Q is the member of U such that W has a uniform distribution, i.e., P (W, pa(W )) = cP (pa(W )) for c ∈ R + . Proof. We know that every distribution in U factorizes according to the graph G, and that they only differ in the term corresponding to the mechanism for W , P (W |pa(W )). Thus, for any D ∈ U, P D (W, pa(W )) = P D (W |pa(W ))P D (pa(W )) = P D (W |pa(W ))P (pa(W )), noting that P (pa(W )) is the same across all members of U. It suffices to show, then, that P (O \ {W }|do(W )) ∝ P Q (O) (within a constant factor), such that P Q (W, pa(W )) = cP (pa(W )). Recall that, by definition, performing do(W ) deletes the W term from the factorization (or equivalently sets P (W = w|do(w), pa(W )) = P (W = w|do(w)) = 1), resulting in the so called "truncated factorization." Further, the resulting distribution P (O \ {W }|do(W )) is a proper distribution (sums to 1) over O \ {W }. Consider two cases: 1) That W is a discrete variable or 2) W is a continuous variable. With slight abuse of notation, for continuous variables the results will be with respect to the pdf. 1. Suppose W is discrete and that across environments it is observed to take k distinct values for k ∈ N, k < ∞. P (O \ {W }|do(W )) is not a proper distribution over O because w P (W = w|do(w), pa(W )) = k. However, this can be made proper by by normalizing it such that P (W = w|do(w), pa(W )) = 1 k . Thus, P (O \ {W }|do(W )) is within a constant factor of P Q (O) where Q is the member of U such that P (W |pa(W )) = P (W ) = 1 k (i.e., where W has a discrete uniform distribution). W.r.t. the theorem statement, c = 1 k . 2. This case follows similarly. Suppose W is continuous and that across environments it is observed to be Proof. The following counterexample is adapted from an example in [67] . Consider the DAG G in Fig 6 in which the goal is to predict Y from W and Z, and the mechanism for generating W (i.e., P (W |Y )) varies across environments. The distribution factorizes as P (Z, W, Y ) = P (Z|W, Y )P (W |Y )P (Y ). Let all variables be binary, and assume that P (Y = 1) = 1 2 and P (Z = 1|W, Y ) = 1 2 if Y = X and P (Z = 1|W, Y ) = 1 otherwise. Finally, we will parameterize P (W |Y ) as follows: P (W = 1|Y = 0) = 1 − α 0 and P (W = 1|Y = 1) = α 1 for α 0 , α 1 ∈ [0, 1]. For the Brier score, [67] computed the maximum generalized entropy parameter values to be α 0 = 2 − √ 2 and α 1 = 2 − √ 2. Thus, the minimax optimal P (W |Y ) that yields the maximum generalized entropy P (Y |W, Z) is P (W = 1|Y = 1) = 2 − √ 2 ≈ 0.586 and P (W = 1|Y = 0) = √ 2 − 1 ≈ 0.414. This is different than the P (W |Y ) that yields the P (Y |W, Z) equivalent to the stable level 2 solution P (Y |do(W ), Z), which is P (W |Y ) = P (W ) = 0.5 (by Proposition 6). Thus, the level 2 solution P (Y |do(W ), Z) is not optimal for this graph using the Brier score (this also holds for the log loss; see the computations in [67] ). Proposition 7. The level 3 distribution P (Y (W B * )|X(W B * )) equals P B * (Y |X) and is minimax optimal, where W B * is the counterfactual W generated under the mechanism associated with the environment B * . Proof. Given that our training data was generated from P 0 ∈ U, we are interested in P (Y |X) if counterfactually W had been generated using the mechanism (i.e., we edited the structural equation in the SCM) g B * (pa(W ), w ), the mechanism for W associated with the environment B * ∈ U identified in Theorem 5. This mechanism change produces a new distribution associated with W , P B * (W |pa(W )). We can represent this counterfactually by letting Z(W B * ) = Z(g B * (pa(W )) be the potential outcome of Z had W been generated according to g B * for some variable Z (the rhs notation is sometimes used to express policy interventions, see, e.g., [62] ). Thus, the counterfactual distribution can be expressed as P (Y (W B * )|W B * , pa(W ), X \ {W, pa(W )}(W B * )) = P (Y (W B * )|X(W B * )) (noting that pa(W )(W B * ) = pa(W ) because changing the mechanism of W does not affect its parents). Because P 0 and B * differ only with respect to the mechanism generating W , the counterfactual distribution associated with this mechanism change yields P B * (Y |X). Thus, P (Y (W B * )|X(W B * )) is minimax optimal because P B * (Y |X) was shown to be minimax optimal in Theorem 5. In Section 4 we developed a likelihood reweighting formulation for a minimax optimal predictor by assuming that the graph has no bidirected edges (no unobserved confounders). We now relax this condition. First, note that the c-component (or district) of a variable in an ADMG is the set of nodes reachable via purely bidirected paths (i.e., paths of the form V 1 ↔ · · · ↔ V 2 ). An ADMG over variables O factorizes as: Now we can see that the term P (O i |pa(T i ) \ {O i }) is the ADMG generalization of P (O i |pa(O i )) in DAGs, and is the term associated with the mechanism for generating O i . Thus, if P (W |do(pa(W ))) is identifiable in the ADMG, then we we need to perform likelihood reweighting with respect to P (W |pa(T W ) \ {W }). That is, let B * ∈ U be the minimax optimal training distribution/environment. Then and we can define a reweighting function as before. Hospitals roll out ai systems to keep patients from dying of sepsis Palantir has secretly been using new orleans to test its predictive policing technology Machine bias Concrete problems in ai safety Tutorial: Safe and reliable machine learning Dataset shift in machine learning The clinician and dataset shift in artificial intelligence How the coronavirus pandemic is breaking artificial intelligence and how to fix it Biases in electronic health record data due to processes within the healthcare system: retrospective observational study Practice variation and physician-specific effects Physician beliefs and patient preferences: a new look at regional variation in health care spending Reliable decision support using counterfactual models Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study Counterfactual normalization: Proactively addressing dataset shift using causal mechanisms Transportability of causal and statistical relations: a formal approach Assessing the generalizability of randomized trial results to target populations Causal inference and the data-fusion problem A review of generalizability and transportability Shadow prices, market wages, and labor supply Sample selection bias as a specification error Models for sample selection bias Estimating models with sample selection bias: a survey Domain adaptation by using causal inference to predict invariant conditional distributions Preventing failures due to dataset shift: Learning predictive models that transport I-spec: An end-to-end framework for learning transportable, shift-stable models Invariant models for causal transfer learning Invariant risk minimization Generalization and invariances in the presence of unobserved confounding Out-of-distribution generalization with maximal invariant predictor Experimental and quasi-experimental designs for research. Houghton, Mifflin and Company Commentary: External validity of results of randomized trials: disentangling a complex concept Generalizing evidence from randomized clinical trials to target populations: the actg 320 trial The use of propensity scores to assess the generalizability of results from randomized trials External validity: From do-calculus to transportability across populations Generalizing causal inferences from randomized trials: counterfactual and graphical identification The promise and success of lab-field generalizability in experimental economics: A critical reply to levitt and list Correcting sample selection bias by unlabeled data Domain adaptation under target and conditional shift Domain-adversarial training of neural networks Domain adaptation with conditional transferable components From statistical transportability to estimating the effect of stochastic interventions Certifying some distributional robustness with principled adversarial training Variance-based regularization with convex objectives Conditional variance penalties and domain shift robustness Anchor regression: heterogeneous data meets causality Regularizing towards causal invariance: Linear models with proxies Domain generalization via invariant feature representation Invariant risk minimization games Causal inference by using invariant prediction: identification and confidence intervals Stable prediction across unknown environments Stable prediction with model misspecification and agnostic distribution shift Counterfactual invariance to spurious correlations: Why and how to pass stress tests Learning the difference that makes a difference with counterfactuallyaugmented data Explaining the efficacy of counterfactually augmented data Selecting data augmentation for simulating interventions Active learning for decision-making from imbalanced observational data Domain adaptation as a problem of inference on graphical models On causal and anticausal learning Causality from a distributional robustness point of view Causal diagrams for interference Intervening on network ties Identifiability of path-specific effects Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference Causal inference with a graphical hierarchy of interventions Game theory, maximum entropy, minimum discrepancy and robust bayesian decision theory Robust causal domain adaptation in a simple diagnostic setting Training gans with optimism The limit points of (optimistic) gradient descent in min-max optimization On gradient descent ascent for nonconvex-concave minimax problems A machine learning algorithm to predict severe sepsis and septic shock: Development, implementation, and impact on clinical practice Incidence and trends of sepsis in us hospitals using clinical vs claims data JAX: composable transformations of Python+NumPy programs Statistical decision theory and Bayesian analysis Causation, prediction, and search Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination Development and evaluation of a machine learning model for the early identification of patients at risk for sepsis Studies in Causal Reasoning and Learning