key: cord-0145886-5562xtan authors: Ahuja, Kartik; Shanmugam, Karthikeyan; Dhurandhar, Amit title: Linear Regression Games: Convergence Guarantees to Approximate Out-of-Distribution Solutions date: 2020-10-28 journal: nan DOI: nan sha: d991f49a59c5e5e001df214840ff274f7df7893d doc_id: 145886 cord_uid: 5562xtan Recently, invariant risk minimization (IRM) (Arjovsky et al.) was proposed as a promising solution to address out-of-distribution (OOD) generalization. In Ahuja et al., it was shown that solving for the Nash equilibria of a new class of"ensemble-games"is equivalent to solving IRM. In this work, we extend the framework in Ahuja et al. for linear regressions by projecting the ensemble-game on an $ell_{infty}$ ball. We show that such projections help achieve non-trivial OOD guarantees despite not achieving perfect invariance. For linear models with confounders, we prove that Nash equilibria of these games are closer to the ideal OOD solutions than the standard empirical risk minimization (ERM) and we also provide learning algorithms that provably converge to these Nash Equilibria. Empirical comparisons of the proposed approach with the state-of-the-art show consistent gains in achieving OOD solutions in several settings involving anti-causal variables and confounders. A recent study shows that models trained to detect COVID-19 from chest radiographs rely on spurious factors such as the source of the data rather than the lung pathology [DeGrave et al., 2020] . This is just one of many alarming examples of spurious correlations failing to hold outside a specific training distribution. In one commonly cited example, [Beery et al., 2018] trained a convolutional neural network (CNN) to classify camels from cows. In the training dataset, most pictures of the cows had green pastures, while most pictures of camels were in the desert. The CNN picked up the spurious correlation and associated green pastures with cows thus failing to classify cows on beaches correctly. Recently, [Arjovsky et al., 2019] proposed a framework called invariant risk minimization (IRM) to address the problem of models inheriting spurious correlations. They showed that when data is gathered from multiple environments, one can learn to exploit invariant causal relationships, rather than relying on varying spurious relationships, thus learning robust predictors. The authors used the invariance principle based on causality [Pearl, 1995] to construct powerful objects called "invariant predictors". An invariant predictor loosely speaking is a predictor that is simultaneously optimal across all the training environments under a shared representation. In [Arjovsky et al., 2019] , it was shown that for linear models with confounders and/or anti-causal variables, learning ideal invariant predictors translates to learning solutions with ideal out-of-distribution (OOD) generalization behavior. However, building efficient algorithms guaranteed to learn these invariant predictors is still a challenge. The algorithm in [Arjovsky et al., 2019] is based on minimizing a risk function comprising of the standard risk and a penalty term that tries to approximately ensure that predictors learned are invariant. The penalty is non-convex even for linear models and thus the algorithm is not guaranteed to arrive at invariant predictors. Another recent work [Ahuja et al., 2020] , proposed a framework called invariant risk minimization games (IRM-games) and showed that solving for the Nash equilibria (NE) of a special class of "ensemble-games" is equivalent to solving IRM for many settings. The algorithm in [Ahuja et al., 2020] has no convergence guarantees to the NE of the ensemble-game. To summarize, building algorithms that are guaranteed to converge to predictors with non-trivial OOD generalization is unsolved even for linear models with confounders and/or anti-causal variables. In this work, we take important steps towards this highly sought after goal. As such, we formulate an ensemble-game that is constrained to be in the ∞ ball. Although this construction might seem to be surprising at first, we show that these constrained ensemble-game based predictors have a good OOD behavior even though that they may not be the exact invariant predictors. We provide efficient algorithms that are guaranteed to learn these predictors in many settings. To the best of our knowledge, our algorithms are the first for which we can guarantee both convergence and better OOD behavior than standard empirical risk minimization. We carry out empirical comparisons in the settings proposed in [Arjovsky et al., 2019] , where the data is generated from models that include both causal and anti-causal variables as well as confounders in some cases. These comparisons of our approach with the state-of-the-art depict its promise in achieving OOD solutions in these setups. This demonstrates that searching over the NE of constrained ensemble-games is a principled alternative to searching over invariant predictors as is done in IRM. IRM [Arjovsky et al., 2019] has its roots in the theory of causality [Pearl, 1995] . A variable y is caused by a set of non-spurious actual causal factors x Pa(y) if and only if in all environments where y has not been intervened on, the conditional probability P (y|x Pa(y) ) remains invariant. This is called the modularity condition [Bareinboim et al., 2012] . Related and similar notions are the independent causal mechanism principle [Schölkopf et al., 2012] [Janzing and Schölkopf, 2010] [Janzing et al., 2012] and the invariant causal prediction principle (ICP) [Peters et al., 2016] [Heinze-Deml et al., 2018] . These principles imply that if all the environments (train and test) are modeled by interventions that do not affect the causal mechanism of target variable y, then a predictor trained on the transformation that involves the causal factors (Φ(x) = x Pa(y) ) to predict y is an invariant predictor, which is robust to unseen interventions. In general, for finite sets of environments, there may be other invariant predictors. If one has information about the causal Bayesian network structure, one can find invariant predictors that are maximally predictive using conditional independence tests and other graph-theoretic tools [Magliacane et al., 2018 , Subbaswamy et al., 2019 . The above works select subsets of features, primarily using conditional independence tests, that make the optimal predictor trained on the selected features invariant. In IRM [Arjovsky et al., 2019] , the authors provide an optimization-based reformulation of this invariance that facilitates searching over transformations in a continuous space. Following the original work IRM from [Arjovsky et al., 2019] , there have been several interesting works - [Teney et al., 2020] [Krueger et al., 2020] [Chang et al., 2020] [Koyama and Yamaguchi, 2020] is an incomplete representative list -that build new methods inspired from IRM to address OOD generalization. In these works, similar to IRM, the algorithms are not provably guaranteed to converge to predictors with desirable OOD behavior. 3 Background, Problem Formulation, and Approach A standard normal form game is written as a tuple Ω = (N , {u i } i∈N , {S i } i∈N ), where N is a finite set of players. Player i ∈ N takes actions from a strategy set S i . The utility of player i is u i : S → R, where we write the joint set of actions of all the players as S = Π i∈N S i . The joint strategy of all the players is given as s ∈ S, the strategy of player i is s i and the strategy of the rest of players is Definition 1. A strategy s † ∈ S is said to be a pure strategy Nash equilibrium (NE) if it satisfies NE defines a state where each player is using the best possible strategy in response to the rest of the players. A natural question to ask is when does a pure strategy NE exist. In the seminal work of [Debreu, 1952] it was shown that for a special class of games called concave games such a NE always exists. is continuous and concave in s i . Theorem 1. [Debreu, 1952] For any concave game Ω a pure strategy Nash equilibrium s † always exists. In this work, we only study pure strategy NE and use the terms pure strategy NE and NE interchangeably. We are given a collection of training datasets D = {D e } e∈Etr gathered from a set of environments E tr , where D e = {x i e , y i e } ne i=1 is the dataset gathered from environment e ∈ E tr and n e is the number of points in environment e. The feature value for data point i is x i e ∈ X and the corresponding label is y i e ∈ Y, where X ⊆ R n and Y ⊆ R. Each point (x i e , y i e ) in environment e is drawn i.i.d from a distribution P e . Define a predictor f : X → R. The goal of IRM is to use these collection of datasets D to construct a predictor f that performs well across many unseen environments E all , where E all ⊇ E tr . Define the risk achieved by f in environment e as R e (f ) = E e (f (X e ), Y e ) , where is the square loss when f (X e ) is the predicted value and Y e is the corresponding label, (X e , Y e ) ∼ P e and the expectation E e is defined with respect to (w.r.t.) the distribution of points in environment e. Invariant predictor and IRM optimization: An invariant predictor is composed of two parts a representation Φ ∈ R d×n and a predictor w ∈ R d×1 . We say that a data representation Φ elicits an invariant predictor w T Φ across the set of environments E tr if there is a predictor w that achieves the minimum risk for all the environments w ∈ argminw ∈R d×1 R e (w T Φ), ∀e ∈ E tr . IRM may be phrased as the following constrained optimization problem: If w T Φ satisfies the constraints above, then it is an invariant predictor across the training environments E tr . Define the set of invariant predictors w T Φ satisfying the constraints in (1) as S IV . Informally stated, the main idea behind the above optimization is inspired from invariance principles in causality [Bareinboim et al., 2012 ][Pearl, 2009 . Each environment can be understood as an intervention. By learning an invariant predictor the learner hopes to identify a representation Φ that transforms the observed features into the causal features and the optimal model trained on causal representations are likely to be same (invariant) across the environments provided we do not intervene on the label itself. These invariant models can be shown to have a good out-of-distribution performance. Next, we briefly describe IRM-games. Ensemble-game: Each environment e is endowed with its own predictor w e ∈ R d×1 . Define an ensemble predictorw ∈ R d×1 given asw = q∈Etr w q ; for the rest of this work a bar on top of vector represents an ensemble predictor. We require all the environments to use this ensemblew. We want to solve the following new optimization problem. For a fixed representation Φ, the constraints in the above optimization (3.2) represent the NE of a game with each environment e as a player with actionsw e . Environment e selectsw e to maximize its utility −R e w e + q =ew q T Φ . Define the set of ensemble-game predictorsw T Φ, i.e. the predictors that satisfy the constraints in (3.2) as S EG . In [Ahuja et al., 2020] it was shown that the set of ensemble S EG = S IV . Having briefly reviewed IRM and IRM-games (we presented them with linear models but these works are more general), we are now ready to build our framework. The data is gathered from a set of two environments, E tr = {1, 2}. 1 Each data point (X e , Y e ) in environment e is sampled from P e . Each environment e ∈ {1, 2} is a player that wants to select a w e ∈ R n×1 such that it minimizes where E e is expectation w.r.t P e . We write the above as a two player game represented by a tuple Γ = ({1, 2}, {R e } e∈{1,2} , R n×1 ). We refer to Γ as a unconstrained linear regression game (U-LRG). A Nash equilibrium w † = (w † 1 , w † 2 ) of U-LRG is a solution to 1 Discussion on multiple environments is in the Appendix Section. The above two-player U-LRG is a natural extension of linear regressions and we start by analyzing the NE of the above game. Before going further, the above game can be understood as fixing Φ to identity in the ensemble-game defined in the previous section. For each e ∈ {1, 2}, define the mean of features µ e = E e [X e ], Σ e = E e X e X T e and the correlation between the feature X e and the label Y e as ρ e = E e X e Y e . Assumption 1. Regularity condition. For each e ∈ {1, 2}, µ e = 0 and Σ e is positive definite. The above regularity conditions are fairly standard and the mean zero condition can be relaxed by introducing intercepts in the model. When µ e = 0, Σ e is the covariance matrix. For each e ∈ {1, 2}, define w * e = Σ −1 e ρ e , where Σ −1 e is the inverse of Σ e . w * e is the least squares optimal solution for environment e, i.e., it solves minw ∈R n×1 E e Y e −w T X e 2 . Proposition 1. If Assumption 1 holds and if the least squares optimal solution in the two environments are } describes all the pure strategy Nash equilibrium of U-LRG, Γ. • not equal, i.e., w * 1 = w * 2 , then U-LRG, Γ, has no pure strategy Nash equilibrium. The proofs to all the propositions and theorems are in the Appendix. From the above proposition, it follows that agreement between the environments on least squares optimal solution is both necessary and sufficient for the existence of NE of U-LRG. Next, we describe the family of linear structural equation models (SEMs) in [Arjovsky et al., 2019] and show how the two cases, w * 1 = w * 2 , and w * 1 = w * 2 naturally arise. In this section, we consider linear SEMs from [Arjovsky et al., 2019] and study the NE of U-LRG. Assumption 2. Linear SEM with confounders and (or) anti-causal variables For each e ∈ {1, 2}, (X e , Y e ) is generated from the following SEM The feature vector is X e = (X 1 e , X 2 e ). H e ∈ R s is a confounding random variable, where each component of H e is an i.i.d draw from a distribution with zero mean and unit variance. H e affects both the labels Y e through weights η e ∈ R s and a subset of features X 2 e ∈ R q through weights Θ e ∈ R q×s . ε e ∈ R is independent zero mean noise in the label generation. Y e affects a subset of features X 2 e with weight α e ∈ R q , ζ e ∈ R q is an independent zero mean noise vector affecting X 2 e . X 1 e ∈ R p are the causal features drawn from a distribution with zero mean and affect the label through a weight γ ∈ R p , which is invariant across the environments. The above model captures many different settings. If α e = 0 and Θ e = 0, then features X 2 e appear correlated with the label due to the confounder H e . If α e = 0 and Θ e = 0, then features X 2 e are correlated with the label but they are effects or anti-causal. If both α e = 0, Θ e = 0, then we are in a hybrid of the above two settings. In all of the above settings it can be shown that relying on X 2 e to make predictions can lead to failures under distribution shifts (modeled by interventions). From [Arjovsky et al., 2019] , we know that for the above family of models the ideal OOD predictor is γ, 0 as it performs well across many distribution shifts (modeled by interventions). Hence, the goal is to learn γ, 0 . No confounders & no anti-causal variables (w * 1 = w * 2 ): Consider the SEM in Assumption 2. For each environment e ∈ {1, 2}, assume α e = 0 and Θ e = 0, i.e. no confounding and no anticausal variables. This setting captures the standard covariate shifts [Gretton et al., 2009] , where it is assumed that P e (Y e |X e = x) is invariant across environments, here we assume E e (Y e |X e = x) = γ T x is invariant across environments. The least squares optimal solution for each environment is w * e = γ, 0 , which implies that w * 1 = w * 2 . From Proposition 1 we know that a NE exists (any two predictors adding to w * 1 form a NE). In this setting, different methods -empirical risk minimization (ERM), IRM, IRM-games, and methods designed for covariate shifts such as sample reweightingshould perform well. Confounders only (w * 1 = w * 2 ): Consider the SEM in Assumption 2. For each environment e ∈ {1, 2}, assume α e = 0, Θ e = 0, i.e. confounders only setting. Define Σ e1 = E e X 1 e X 1,T e and define the variance for the noise vector ζ as σ 2 ζe = E e [ζ e ζ e ], where represents element-wise product between two vectors. Assumption 3. Regularity condition for linear SEM in Assumption 2. For each environment e ∈ {1, 2}, Σ e1 is positive definite and each element of the vector σ 2 ζe is positive. Assumption 3 is equivalent to Assumption 1 for SEM in Assumption 2 (it ensures Σ e is positive definite). Proposition 2. If Assumption 2 holds with α e = 0 for each e ∈ {1, 2}, and Assumption 3 holds, then the least squares optimal solution for environment e is We divide w * e into two halves w inv e = γ and w var e = Θ e Θ T e + diag[σ 2 ζe ] −1 Θ e η e . Observe that the first half w inv e is invariant, i.e., it does not depend on the environment, while w var e may vary as it depends on the parameters specific to the environment e.g., Θ e , η e . In general, w var 1 = w var 2 (e.g., s = q, Θ e is identity I q , σ 2 ζe is one 1 q , η 1 = η 2 ) and as a result w * 1 = w * 2 . In such a case, from Proposition 1, we know that NE does not exist. ERM and other techniques such as domain adaptation [Ajakan et al., 2014 , Ben-David et al., 2007 , Glorot et al., 2011 , Ganin et al., 2016 , robust optimization [Mohri et al., 2019 , Hoffman et al., 2018 , Lee and Raginsky, 2018 , Duchi et al., 2016 , would tend to learn a model which tends to exploit information from the spuriously correlated X 2 e thus placing a non-zero weight on the second half corresponding to the features X 2 e and not recovering (γ, 0). IRM based methods are designed to tackle these problems. These works try to learn representations that filter out causal features, X 1 e , with invariant coefficients, w inv e , from spurious features, X 2 e , with variant coefficients w var e and learn a predictor on top resulting in the invariant predictor (γ, 0). However, the current algorithms that search for these representations in IRM and IRM-games are based on gradient descent over non-convex losses and non-trivial best response dynamics respectively, both of which are not guaranteed to converge to the ideal OOD predictor (γ, 0). We formally state the assumption underlying these methods, which we also use later. Assumption 4. Spurious features have varying coefficents across environments. w var 1 = w var 2 In U-LRG, Γ, the utility of environment 1 (2) for is −R 1 (w 1 , w 2 ) −R 2 (w 1 , w 2 ) . For each environment e ∈ {1, 2}, −R e is continous and concave in w e . For each e in the game Γ, the set of actions it can take is in R n×1 , which is not a compact set. If the set of actions for each environment were compact and convex, then we can use Theorem 1 to guarantee that a NE always exists. Let us constraint the predictors to be in the set W = w e w e ∞ ≤ w sup , where · ∞ is the ∞ norm and 0 < w sup < ∞. We define the constrained linear regression game (C-LRG) as Γ c = E tr , {−R e } e∈Etr , W . W is a closed and bounded subset in the Euclidean space, which implies it is also a compact set. W is also a convex set as ∞ norm is convex. From Definition 2, the game Γ c is concave. Proposition 3. A pure strategy Nash equilibrium always exists for C-LRG, Γ c . The above proposition follows from Theorem 1. Unlike the game Γ, a NE always exists for the game Γ c . Let w † 1 , w † 2 be a NE of Γ c and letw † be the corresponding ensemble predictor, i.e. w † = w † 1 + w † 2 . In the next theorem, we analyze the properties of w † 1 , w † 2 but before that we state some assumptions. Assumption 5. Realizability. For each e ∈ {1, 2} the least squares optimal solution w * e ∈ W. We write the feature vector in environment e as X e = (X e1 , . . . , X en ) and the least squares optimal solution in environment e as w * e = (w e1 , . . . , w en ). Divide the features indexed {1, . . . , n} into two sets U and V. U is defined as: i ∈ U if and only if the weight associated with i th component in the least squares solution is equal in the two environments, i.e., w * 1i = w * 2i . V is defined as: i ∈ V if and only if the weight associated with i th component in the least squares solution is not equal in the two environments, i.e., w * 1i = w * 2i . For an example of these sets, consider the least squares solution to the confounded only SEM in equation (5) under Assumption 4 , w inv 1 = w inv 2 = γ =⇒ U = {1, . . . , p}, and w var 1 = w var 2 =⇒ V = {p + 1, . . . , p + q}. Assumption 6. Features with varying coefficients across environments are uncorrelated. For each i ∈ V and for each e ∈ {1, 2}, the corresponding feature X ei is uncorrelated with every other feature j ∈ {1, . . . , n}\{i}, i.e., The above assumption says that any feature component whose least squares optimal solution coefficient varies across environments is not correlated with the rest of the features. We use the above assumption to derive an analytical expression for the NE of Γ c next. For a vector a, |a| represents the vector of absolute values of all the elements. Element-wise product of two vectors a and b is written as a b. Define an indicator function 1 a≥b ; it carries out an element-wise comparison of a and b and it outputs a vector of ones and zeros, where a one at component i indicates that i th component of a, a i , is greater than or equal to the i th component of b, b i . Recall that the ensemble predictor constructed from NE isw † =w † 1 +w † 2 . Theorem 2. If Assumptions 1, 5, 6 hold, then the ensemble predictor,w † , constructed from the Nash equilibrium, (w † 1 , w † 2 ), of Γ c is equal to Casewise analysis of NE in equation (6) • We analyze this case under two categories Opposite sign coefficients: If the i th component of w * 1 and w * 2 have opposite signs, then the i th component of the ensemble predictor,w † , constructed from the NE of Γ c , is zero, i.e., In this case, the coefficient of the environments' predictors in the NE, w † 1i and w † 2i , have exact opposite signs and both are at the boundary one at w sup and other at −w sup . This case shows that when the features have a large variation in their least squares coefficients across environments, they can be spurious (see Proposition 4) and the ensemble predictor filters them by assigning a zero weight to them. Same sign coefficients: If the i th component of w * 1 and w * 2 have same signs, then the i th component of ensemble predictor,w † , constructed from the NE of Γ c , is set to the least squares coefficient with a smaller absolute value, i.e.,w † This shows that ensemble predictor is conservative and selects the smaller least squares coefficient. This property is useful to identifying predictors that are robust (see Proposition 4). Lastly, only when the least square coefficients are the same, i.e., w * 1i = w * 2i , the coefficient of the environments' predictors in the NE can be in the interior, i.e., |w † 1i | < w sup and |w † 2i | < w sup . Suppose for each environment e ∈ {1, 2} the data is generated from SEM in Assumption 2. We study if the NE of C-LRG, Γ c , achieves or gets close to the ideal OOD predictor (γ, 0). We compare the ensemble predictorsw † constructed from the NE of Γ c to the solutions of ERM (Theorem 2 enables this comparison). In ERM, the data from both the environments is combined and the overall least squares loss is minimized. Define the probability that a point is from environment e as π e (π 2 = 1 − π 1 ). The set of ERM solutions for all distributions, {π 1 , π 2 }, is S ERM given as Proposition 4. If Assumption 2 holds with α e = 0 and Θ e an orthogonal matrix for each e ∈ {1, 2}, and Assumptions 3, 4, 5 hold, then w † − (γ, 0) < w ERM − (γ, 0) holds for all w ERM ∈ S ERM . 2 Moreover, if all the components of two vectors Θ 1 η 1 and Θ 2 η 2 have opposite signs, thenw † = (γ, 0). 2 Exception occurs over measure zero set over probabilities π1. If least squares solution are strictly ordered, i.e., ∀i ∈ {1, . . . , n}, 0 < w * 1i < w * 2i and π1 = 1, then w ERM =w † = w * 1 . In general, w * 1 , w * 2 are not ordered and π1 ∈ (0, 1), thus C-LRG improves over ERM. We use Theorem 2 to arrive at the above (Assumptions in the above proposition imply that Assumptions 1, 5, 6 hold thus allowing us to use Theorem 2). From the first part of the above we learn that for many confounder only models (α e = 0, Θ e an orthogonal matrix), the ensemble predictor constructed from the NE is closer to the ideal OOD solution than ERM. For the second part, set Θ e = I q , where I q is identity matrix. Suppose the signs of all the components of η 1 and η 2 disagree. As a result, the signs of latter half of least squares solution w var e (in equation (5)) disagree. From Theorem 2, we know that if the signs of the coefficients in least squares solution disagree, then the corresponding coefficient in the ensemble predictor is zero, which impliesw † = (γ, 0). Remark. In Proposition 4, besides the regularity conditions, the main assumption is Θ e is orthogonal. This assumption ensures that the the spurious features X 2 e are uncorrelated (Assumption 6). For confounder only models this seems reasonable. However, in the models involving anti-causal variables, i.e., α e = 0, the spurious features can be correlated and one may wonder how does the ensemble predictor behave in such setups? In experiments, we show that ensemble predictors perform well in these settings as well. Extending the theory to anti-causal models is a part of future work. Insights from Theorem 2, Proposition 4 Suppose the data comes from the SEM in Assumption 2. For this SEM, [Arjovsky et al., 2019] showed that if the number of environments grow linearly in the total number of features, then the solution to non-convex IRM optimization recovers the ideal OOD predictor. We showed that for many confounder only SEMs (α e = 0 and Θ e orthogonal) NE based ensemble predictor gets closer to the OOD predictor than ERM and sometimes recovers it exactly with just two environments, while no such guarantees exist for IRM. Next, we show how to learn these NE based ensemble predictor. Initialize: In this section, we show how we can use best response dynamics (BRD) [Fudenberg et al., 1998 ] to learn the NE. Each environment takes its turn and finds the best possible model given the choice made by the other environment. This procedure (Algorithm 1) is allowed to run until the environments stop updating their models. In the next theorem, we make the same set of Assumptions as in Theorem 2 and show that Algorithm 1 converges to the NE derived in Theorem 2. Theorem 3. If Assumption 1, 5, 6 hold, then the output of Algorithm 1,w + , is We now understand different aspects of BRD. Dynamics of BRD. Consider the i th component of the predictorsw 1i andw 2i from Algorithm 1. Suppose w * 1i > w * 2i and |w * 1i | > |w * 2i |. The two environments push the ensemble predictor, w 1i +w 2i , in opposite directions during their turns, with the first environment increasing its weight, w 1i , and the second environment decreasing its weight,w 2i . Eventually, the environment with a higher absolute value (e = 1 since |w * 1i | > |w * 2i |) reaches the boundary (w 1i = w sup ) and cannot move any further due to the constraint. The other environment (e = 2) best responds. It either hits the other end of the boundary (w 2i = −w sup ), in which case the weight of the ensemble for component i is zero, or gets close to the other boundary while staying in the interior (w 2i = w * 2i − w sup ), in which case the weight of the ensemble for component i is w * 2i . BRD a sequence of convex minimizations. In Algorithm 1, we assumed that at each time step each environment can do an exact minimization operation. The minimization for each environment is a simple least squares regression, which is a convex quadratic minimization problem. There can be several ways of solving it -gradient descent for R e and solving for gradient of R e equals zero directly, which is a linear system of equations. We provide a simple bound for the total number of convex minimizations (or turns for each environment) in Algorithm 1 next. For each i ∈ V (defined in Section 3.4), compute the distance between the least square coefficients in the two environments |w * 1i − w * 2i | and find the least distance over the set V given as ∆ min = min i∈V |w * 1i − w * 2i | (following the definition of V this distance is positive). The bound on number of minimizations is 2w sup ∆ min . Suppose the data is generated from SEM in Assumption 2. Next, we show the final result that the NE based predictor, which we proved in Proposition 4 is closer to the OOD solution, is achieved by Algorithm 1. Proposition 5. If Assumption 2 holds with α e = 0 and Θ e an orthogonal matrix for each e ∈ {1, 2}, and Assumptions 3, 4, 5 hold, then the output of Algorithm 1,w + obeys w + − (γ, 0) < w ERM − (γ, 0) for all w ERM ∈ S ERM except over a set of measure zero (see footnote 2). Moreover, if all the components of vectors Θ 1 η 1 and Θ 2 η 2 have opposite signs, thenw + = (γ, 0). We use Theorem 3 to arive at the above result. We have shown through Theorem 2, Proposition 4, Theorem 3 and Proposition 5 that the NE based ensemble predictor of Γ c has good OOD properties and it can be learned by solving a sequence of convex quadratic minimizations. Extensions: In the Appendix, we extend the Theorem 3 to other BRD commonly used in literature (alternating gradient descent). We discuss convergence to NE when Assumption 6 (uncorrelatedness) may not hold. Finally, we also have a high level discussion on extending the entire setup to non-linear models and multiple environments. In this section, we run the regression experiments described in [Arjovsky et al., 2019] . We use the SEM in Assumption 2 with following configurations. • γ is a vector of ones with p dimensions, 1 p , which makes the ideal OOD model 1 p , 0 q . Each component of the confounder H e is drawn i.i.d. from N (0, σ 2 He ). σ H 1 = 0.2, σ H 2 = 2.0. We consider two configurations for Θ e and η e . i) Θ e = 0, η e = 0, thus there is full observability (F) as there are no confounding effects, ii) each component of Θ e and η e is drawn i.i.d. from N (0, 1) thus there is partial observability (P) as there are confounding effects. • Each component of α e is drawn i.i.d from N (0, 1). ε e ∼ N (0, σ 2 εe ) and each component of the vector ζ e is drawn from N (0, σ 2 ζe ). We consider two settings for the noise variances -Homoskedastic (HOM) σ ε 1 = 0.2 and σ ε 2 = 2.0, σ ζ 1 = σ ζ 2 = 1.0 and Heteroskedastic (HET) σ ζ 1 = 0.2 and σ ζ 2 = 2.0, σ ε 1 = σ ε 2 = 1.0. From the above, we gather that there are four possible combination of settings in which comparisons will be carried out -F-HOM, P-HOM, F-HET, P-HET. We use the following benchmarks in our comparison. IRM from [Arjovsky et al., 2019] , ICP from [Peters et al., 2015] , and standard ERM. Note in each of the cases we use a linear model. Our implementation for the data generation, IRM, ICP and ERM comes from https://github.com/facebookresearch/ InvariantRiskMinimization. All other implementation details can be found in the Appendix. The source code is available at https://github.com/IBM/IRM-games/tree/master/LRG_games. The performance is measured in terms of the model estimation error, i.e., the square of the distance from the ideal model (1 p , 0 q ). Before we discuss a comparison in all these settings, we look at a two dimensional experiment where p = q = 1 and the parameters are set to F-HOM. We carry out this comparison to illustrate several points. Firstly, we want to show why is ∞ constraint very important. Secondly, we want to show that the works when α e is non-zero, i.e., X 2 e is anti-causal (in the theory we had assumed α e = 0). We compare with following variants of the linear regression game (LRG) i) no constraints, which is the game U-LRG (Section 3.3), ii) regularize each R e with ∞ penalty (R ∞ -LRG), and iii) regularize each R e with 2 penalty (R 2 -LRG). In Table 1 , we show the estimated model against the respective method and the estimation error. Observe that C-LRG was able to outperform other variants of LRG. Moreover, C-LRG performed better than the other existing methods as well. w 12 (w 22 ) are the coefficients that model 1 (2) associates with feature 2, which is spuriously correlated. We plot the trajectories of the coefficients w 12 (w 22 ) of the models of each of the environments for the spurious features as the best response dynamics based training proceeds in Figure 1 . Observe how the ∞ constrained models saturate on opposite ends of the boundary and as a result they cancel the spurious factors out. In contrast for other models, we do not see such an effect. Lastly, see if we choose a larger bound w sup = 5 the coefficients reach the boundary they just take more steps than w sup = 2. Next, we move to a more elaborate comparison for the 10 (p = q = 5) dimensional setting from [Arjovsky et al., 2019] . In Figure 2a , 2b, we show the estimation error for F-HET and P-HET settings. In Figure 2c , 2d, we show the estimation error for F-HOM and P-HOM settings. Observe that in each of the settings how the C-LRG performs better than the rest or is close to the best approaches when the number of samples is more than 400. In this work, we developed a new game-theoretic approach to learn OOD solutions for linear regressions. To the best of our knowledge, we have provided the first algorithms for which we can guarantee both convergence and better OOD behavior than standard empirical risk minimization. Experimentally too we see the promise of our approach as it is either competitive or outperforms the state-of-the-art by a margin. We would like to thank Dr. Kush R. Varshney for the valuable discussions in the initial stages of this work. In this section, we provide the proofs to the propositions and theorems, and also provide other details on the experiments. We restate all the propositions and theorems for reader's convenience. In all our results, we use the following notation a is a vector, a i is the i th component of vector a, A is a scalar random variable, A is a vector random variable, A i is the i th component of the random variable A, A is a set, bold capitalized Greek letters e.g., Σ are used for matrices. I m is a m dimensional identity matrix and 1 m is a m dimensional vector of ones. A bar over a vector w,w, denotes the ensemble predictor (sum of predictor from the two environments). We restate Proposition 1 below. Proposition 6. If Assumption 1 holds and if the least squares optimal solution in the two environments are } describes all the pure strategy Nash equilibrium of U-LRG, Γ. • not equal, i.e., w * 1 = w * 2 , then U-LRG, Γ, has no pure strategy Nash equilibrium. Proof. We start with latter part of the proposition. Suppose there exists a pair w † ; loss inside the expectation is convex and expectation is a weighted sum over these losses). Let us compute the gradient of R e (w 1 , w 2 ) w.r.t w e . From the definition of pure strategy NE, it follows that w † 1 (w † 2 ) minimizes R 1 (·, w † 2 ) (R 2 (w † 1 , ·)). From the convexity of R 1 (·, w † 2 ) and R 2 (w † 1 , ·) it follows that ∇ w 1 |w 1 =w † 1 R 1 (w 1 , w † 2 ) = 0 and ∇ w 2 |w 2 =w † 2 R 2 (w † 1 , w 2 ) = 0. Therefore, we have In the above equation (8), we use Assumption 1 and the optimal solution defined in Section 3.3, w * e = Σ −1 e ρ e , for each e ∈ {1, 2}. From equations (8) it follows that w * 1 = w * 2 . Therefore, the existence of NE implies w * 1 = w * 2 or in other words if w * 1 = w * 2 implies NE does not exist. In the above we learned that w * 1 = w * 2 is a necessary condition for NE to exist. In the next part we show that this condition is sufficient as well. Suppose w * 1 = w * 2 = w * . Define any pointŵ 1 and another pointŵ 2 = w * −ŵ 1 . Compute ∇ w 1 |w 1 =ŵ 1 R 1 (w 1 ,ŵ 2 ) and ∇ w 2 |w 2 =ŵ 2 R 2 (ŵ 1 , w 2 ). Using the expression in equation (7) we get ∇ w 1 |w 1 =ŵ 1 R 1 (ŵ 1 ,ŵ 2 ) = 2Σ 1 (ŵ 1 +ŵ 2 ) − 2ρ 1 = 2Σ 1 w * − 2ρ 1 = 0 (From the optimality of w * for R 1 ) ∇ w 2 |w 2 =ŵ 2 R 2 (ŵ 1 ,ŵ 2 ) = 2Σ 2 (ŵ 1 +ŵ 2 ) − 2ρ 2 = 2Σ 2 w * − 2ρ 2 = 0 (From the optimality of w * for R 2 ) From the convexity of R 1 and R 2 it follows thatŵ 1 ,ŵ 2 simultaneously minimize R 1 and R 2 . Therefore, every suchŵ 1 andŵ 2 that sum to w * form a NE. This completes the proof. We restate Proposition 2 below. Proposition 7. If Assumption 2 holds with α e = 0 for each e ∈ {1, 2}, and Assumption 3 holds, then the least squares optimal solution for environment e is Proof. We derive the expression for the optimal predictor in the confounder only SEM in Assumption 2. Recall that the general expression for the least squares optimal predictor (defined in Section 3.3) is We use the SEM in Assumption 2 to derive an expression for Σ e . First observe that from Assumption 2, we have that E e [X 1 e ] = 0 and We divide Σ e into four smaller matrices Σ e1 = E e [X 1 e X 1,T e ], Σ e2 = E e [X 2 e X 2,T e ], Σ e12 = E e [X 1 e X 2,T e ] and Σ e21 = E e [X 2 e X 1,T e ]. From Assumption 2, we know (H e , ζ e ) ⊥ X 1 e and X 2 e ← Θ e H e + ζ e , which implies X 2 e ⊥ X 1 e . Therefore, from X 2 e ⊥ X 1 e and equation (12) it follows that In the above equation (14), we use E e H e H T e = I s and E e H e ζ T e = 0 s×q , which follow from Assumption 2. From Assumption 3, we know that σ 2 ηe > 0 and we use this observation in equation (14) to deduce that Σ e2 is positive definite. From equation (13) we can simplify Σ e into a block diagonal matrix written as diag Σ e1 , Σ e2 , where Σ e1 = E e X 1 e , X 1,T e and Σ e2 = E e X 2 e , X 2,T e . From Assumption 3, Σ e1 is positive definite and we showed above that Σ e2 is positive definite as well. Therefore, we can write the inverse of Σ e as another block diagonal matrix written as Next let us simplify ρ e = E e X 1 e Y e , E e X 2 e Y e . Combining equations (11)- (17), This completes the derivation. We first state a lemma needed for proving Theorem 2. Lemma 1. Suppose Assumptions 1 and 5 hold. Consider the case when w * 1 = w * 2 . In this case, at least one of the predictors in the NE of C-LRG w † 1 or w † 2 has to be on the boundary of the set, i.e. for at least one e ∈ {1, 2}, w † e ∞ = w sup . Moreover, if w † 1 ∞ < w sup ( w † 2 ∞ < w sup ) and w † 2 ∞ = w sup ( w † 1 ∞ = w sup ) then the ensemble predictor is optimal for environment e, i.e., w † = w * 2 (w † = w * 1 ). Proof. We start with the first part of the above lemma. In the first part, the only case that is excluded is when both the points forming the NE are in the interior, i.e., w † 1 ∞ < w sup and w † 2 ∞ < w sup . Denote w −e as the predictor used by the environment q ∈ {1, 2} \ {e}. We interhangeably use R e (w e , w −e ) and R e (w 1 , w 2 ). For environment e, from the definition of NE, it follows that w † e satisfies w † e ∈ arg min we∈W R e (w e , w † −e ). Note i) R e (w e , w † −e ) is a convex function in w e , and ii) the set W has a non-empty relative interior (Since w sup > 0). From these two conditions it follows that Slater's constraint qualification is satisfied, which implies strong duality holds [Boyd and Vandenberghe, 2004] . From strong duality, it follows that w † e and λ † e , where λ † e is the dual variable for the constraint w e ∞ ≤ w sup , satisfy the KKT conditions given as follows In the above ∂( w † e ∞ ) represents the subdifferential of · ∞ at w † e . If w † 1 ∞ < w sup and w † 2 ∞ < w sup , then λ † 1 and λ † 2 are both zero. As a result, we have for e ∈ {1, 2}, ∇ w † e R e (w † e , w † −e ) = 0. From the expression of the gradients in (7) we have for each e ∈ {1, 2} From equation (19) it follows that w * 1 = w * 2 , which contradicts the assumption w * 1 = w * 2 . This completes the proof for the first part of the Lemma. Next, we move to the latter part of the proof, which states that if w † e ∞ < w sup and w † −e ∞ = w sup , then the ensemble predictor is optimal for environment e, i.e.,w † = w * e . Since w † e ∞ < w sup , from the KKT conditions above in (18) we have that λ † e = 0, which implies that ∇ w † e R e (w † e , w † −e ) = 0. Using the expression for gradient in equation (7), we have that w † e +w † −e =w † = w * e . This completes the proof. We restate Theorem 2 for reader's convenience. Theorem 4. If Assumptions 1, 5, 6 hold, then the ensemble predictor,w † , constructed from the Nash equilibrium, Proof. Recall in Section 3.4, we divided the features {1, . . . , n} into two sets U and V. Without loss of generality assume that the first k components in X e belong to U and the next n − k components to be in V. Therefore, U = {1, . . . , k} and V = {k + 1, . . . , n}. Define X e+ = (X e1 , . . . , X ek ) and X e− = (X e(k+1) , . . . , X en ). We divide the weights in w e = (w e1 , . . . , w en ) into two parts where the weights associated with the first k components, X e+ , are w e+ = (w e1 , . . . , w ek ) and the weights associated with the next n − k components, X e− , are w e− = (w e(k+1) , . . . , w en ). Similarly, we divide the vector ρ e defined in Section 3.3 into ρ e+ and ρ e− . Define Σ e+ = E e X e+ X T e+ ] and define Σ e− = E e X e− X T e− ]. As a consequence of the Assumption 6, we can simplify the expression for Σ e as follows For each e ∈ {1, 2}, each feature component i ∈ {1, . . . , n} has a mean zero E e X ei = 0. Therefore, the variance in each feature component i ∈ {1, . . . , n} is σ 2 ei = E e X 2 ei . We can further simplify Σ e− . Using Assumption 6, we have that Σ e− is a diagonal matrix, which we write as We use equations (21) and (22) and the notation introduced above to simplify the risk as follows. Recall that w * e = Σ −1 e ρ e (defined in Section 3.3). From the above equations (21), (22) and Assumption 1, we get where w * e+ is the vector of the first k components in w * e , w * e− are the next n − k components in w * e , ρ ei , is the i th component of ρ e and σ 2 ei is the variance in X ei . Recall that the first k components comprise the set U, which is defined as the set where the features of the least squares coefficients are the same across environments, i.e., For each i ∈ V = {k + 1, . . . , n} define We use the above equations (26) and (27) to simplify the risks as follows min we∈W R e (w 1 , w 2 ) = min In the above W + = w | w ∈ R k , w ∞ ≤ w sup . From the above expression in equation (28), we see that the the optimization for environment e can be decomposed into separate smaller minimizations, which we analyze separately next. ∞ norm constraints allows to make the problem in equation (28) separable and for other norms such separability is not possible. Henceforth, we will look at each smaller minimization as a separate game between the environments. Let us consider the first minimization in equation (28) min Let us minimize the objective in equation (29) without imposing the constraint that w 1+ ∈ W 1+ Therefore, we have that if (w 1+ + w 2+ ) = w * e+ , then environment e achieves the minimum risk possible and cannot do any better. In fact, from equation (24) , then both environments are at the minimum and cannot do any better. Therefore, we know that all the elements in the set C = {w 1+ , w 2+ | w 1+ ∈ W + , w 1+ + w 2+ = w * 1+ } form a NE of C-LRG. From Assumption 5, we know that this set C is non-empty. Moreover, there are no points outside this set C which form a NE. If w 1+ + w 2+ = w * 1+ , then the gradient will not be zero for either of the environments and both would prefer to move to a point where their gradients are zero. Hence, in every NE, w 1+ + w 2+ = w * 1+ . If we use the expression in equation (6) and compute first k components it returns vector w * 1+ (we use the condition w * 1+ = w * 2+ to simplify the expression in equation (6)). This shows that the expression in equation (6) correctly characterizes the NE for the first k components that make up the set U. We now move to the remaining n − k components that make up the set V. Consider a component i ∈ V = {k + 1, . . . , n}. Environment e is interested in minimizing R ei defined in equation (26). Let us consider the i th component of the expression in equation (6) in Theorem 2 and rewrite the expression in terms of scalars. We divide the analysis into two cases. In the first case, the signs of w * 1i and w * 2i disagree, which implies 1 w * 1i w * 2i ≥0 is zero. In the the second case, the signs of w * 1i and w * 2i agree, which implies 1 w * 1i w * 2i ≥0 is one. Let us start with the first case. Without loss of generality say w * 1i < 0 and w * 2i > 0. Observe that (32) , w † 1i can be decreased and improve the utility for environment 1, which contradicts that w † 1i is NE. Supposew † i < 0, then from symmetry we can show that one of the environments will be able to increase the weight and improve its utility. Therefore, the only option that remainsw = 2(−w * 2i ) < 0 and if w † 2i < w sup environment 2 will want to increase w † 2i . Hence, the only solution left is for environment 1 to be at −w sup and environment 2 to be at w sup . Environment 1's (2's) risk decreases (increases) as it moves closer to its optimal point w * 1i (w * 2i ). When environment 2 uses w sup , environment 1's best response is to use −w sup as it brings the environment 1 the closest it can get to w * 1i . Therefore, (w sup , −w sup ) is a NE. This completes the first case, i.e., when the coefficients have opposite signs the coefficient of the NE based ensemble predictor for that component is 0, which is what equation (6) states. Next, consider the case when the signs of w * 1i and w * 2i agree. Let us consider the case when both have positive signs and the negative sign case will follow from symmetry. Suppose 0 < w * 1i < w * 2i . From Lemma 1, we know that there are three scenarios possible. In the first scenario, both w † 1i and w † 2i are on the same side of the boundary, say both are at From Assumption 5, 0 < w * 1i < w * 2i ≤ w sup . Thus for both e ∈ {1, 2}, from equation (33) it follows that decreasing w † ei from the current state would improve the utility thus contradicting that they form a NE. The other possibility is that the two are on the other sides of the boundary, which negative (from Assumption 5); thus prompting each player on the negative side of the boundary to increase the weight and improve its utility, which contradicts the fact that they form a NE. The other possibility arising out of Lemma 1 is w † 1i = w sup and w † 2i = w * 2i − w sup . In this case, the is positive implying environment 1 can decrease and improve its utility. Thus this state is not a NE. Therefore, the only remaining possibility is w † 2i = w sup and w † 1i = w * 1i − w sup . In this case, the is negative, environment 2 cannot increase the weight further as it is already at the boundary (playing w sup is a best response of environment 2 brings it closest to the desired w * 2i ). Hence, this state is a NE and the ensemble predictor is at w * 1i . If we suppose, w * 2i < w * 1i < 0. In this case, we can follow the exact same line of reasoning and arrive at the conclusion that the only NE isw † = w * 1i . We have analyzed all the possible cases when |w * 1i | < |w * 2i | and both w * 1i and w * 2i have the same sign. This completes the proof for the first term in the expression in equation (6) w The second term is same as the first term in equation (6) with the roles of environments swapped. Therefore, due to symmetry we do not need to work out the second term separately. This completes the analysis for all the cases in the equation (6) in Theorem 2. We restate Proposition 4 below. Proposition 8. If Assumption 2 holds with α e = 0 and Θ e an orthogonal matrix for each e ∈ {1, 2}, and Assumptions 3, 4, 5 hold, then w † − (γ, 0) < w ERM − (γ, 0) holds for all w ERM ∈ S ERM . 3 Moreover, if all the components of two vectors Θ 1 η 1 and Θ 2 η 2 have opposite signs, thenw † = (γ, 0). Proof. We first show that the Assumptions made in the above proposition imply that the Assumptions needed for Theorem 2 to be true hold. We show that Assumptions 2, 3 =⇒ Assumption 1 holds. X e 1 is zero mean (from Assumption 2) and E e Y e = γ T E e X 1 e + η T e E e H e + E e ε e = 0 E e X 2 e = α e E e Y e + Θ e E e H e + E e ζ e = 0 Thus E e X e = E e (X 1 e , X 2 e ) = 0 In the proof of Proposition 2, we had shown that when the data is generated from SEM in Assumption 2 Since Θ is an orthogonal matrix we have Both Σ e1 and diag[σ 2 ζe + 1 q ] are positive definite as a result Σ e is also positive definite. Therefore, Assumption 1 holds. The expression for the solution to the least squares optimal solution derived in equation (5) has two parts w inv e and w var e . Recall the definition of sets U and V from Section 3.4. The first p components corresponding to w inv e =⇒ {1, . . . , p} ⊆ U. The next q components corresponding to w var e =⇒ {p + 1, . . . , p + q} ⊇ V. We showed above that Σ e is a block diagonal matrix and the block corresponding to the feature components {p + 1, . . . , p + q} also equalling a diagonal matrix diag[σ 2 ζe + 1 q ]. Therefore, we can see that each feature component in {p + 1, . . . , p + q} is uncorrelated with any other feature component. Therefore, Assumption 6 also holds. Hence, all the Assumptions required for Theorem 2 also hold. We write the expression for least squares optimal solution in this case (from equation (5) We divide the NE based ensemble predictorw † into two halves: w † 1 is the vector of first p coefficients ofw † and w † 2 is the vector of next q coefficients ofw † . From Theorem 2 it follows that w † 1 = γ The next q components in the set {1, . . . , q} are computed as follows. For k ∈ {1, . . . , q}, (p + k) th ζe,k +1 , where Θ e η e k is the k th component of Θ e η e and σ 2 ζe,k is the k th component of σ 2 ζe . We first prove the latter part of the above proposition. If Θ 1 η 1 and Θ 2 η 2 have opposite signs, then for each k ∈ {1, . . . , q}, the sign of (p + k) th component of w * 1 and w * 2 are opposite. From Theorem 2 it follows thatw † p+k = 0. This holds for all k ∈ {1, . . . , q} and as a result we havē w † = γ, 0 . Now we move to the former part of the Proposition, which compares the NE based ensemble predictor to ERM's solution. ERM solves the following optimization problem min w∈R n×1 By putting the gradient of the above to zero, we get Substituting the expression for Σ e from equation (34) into equation (37) we get w ERM equals where ξ = 1 q π 1 σ 2 ζ 1 + 1 q + (1 − π 1 ) σ 2 ζ 2 + 1 q and a b is elementwise division of the two vectors a and b, and w ERM 1 = γ and w ERM 2 = (π 1 Θ 1 η 1 + (1 − π 1 )Θ 2 η 2 ) ξ. For k ∈ {1, . . . , q}, the k th component of w ERM 2 is given as Based on the ERM predictor (equation (38)) and NE-based ensemble predictor (equation (35)) correctly estimate the causal coefficients γ, i.e., they match in the first p coefficients. We focus on the latter q coefficients. The distance of ERM and NE based ensemble predictors are written Hence, we only need to compare the norm of w ERM 2 andw † 2 . From Assumption 4, we know that w var 1 = w var 2 , thus the two differ in at least one component. Consider a component m, where the two vectors w var 1 and w var 2 do not match. For simplicity, let us write Θ e η e m = ϑ e . Therefore, There are two possiblities -i) the signs of ϑ 1 σ 2 ζ 1 ,m +1 and ϑ 2 σ 2 ζ 2 ,m +1 do not match, and ii) the signs of ϑ 1 σ 2 ζ 1 ,m +1 and ϑ 2 σ 2 ζ 2 ,m +1 match. In case i), the magnitude of the corresponding coefficient of NE based predictor is 0. The magnitude for the ERM based predictor is given as π 1 ϑ 1 + 1 − π 1 )ϑ 2 π 1 σ 2 ζ 1 ,m + 1 + (1 − π 1 ) σ 2 ζ 2 ,m + 1 If π 1 ϑ 1 + 1 − π 1 )ϑ 2 = 0 (π 1 = ϑ 2 ϑ 2 −ϑ 1 ), then the coefficient of ERM based solution has same magnitude as NE based predictor, which is equal to zero. Therefore, except for when π 1 = ϑ 2 ϑ 2 −ϑ 1 , ERM is strictly worse than NE based ensemble predictor. In case ii), the the signs of ϑ 1 σ 2 ζ 1 ,m +1 and ϑ 2 σ 2 ζ 2 ,m +1 match. Let us consider the case when both are positive. Without loss of generality assume that 0 ≤ ϑ 1 σ 2 ζ 1 ,m +1 < ϑ 2 σ 2 ζ 2 ,m +1 . From Theorem 2, we know that the magnitude of the NE based predictor is equal to ϑ 1 σ 2 ζ 1 ,m +1 and the magnitude of ERM based predictor is π 1 ϑ 1 + 1 − π 1 )ϑ 2 π 1 σ 2 ζ 1 ,m + 1 + (1 − π 1 ) σ 2 ζ 2 ,m + 1 We take a difference of the magnitudes of the two and get π 1 ϑ 1 + 1 − π 1 )ϑ 2 π 1 σ 2 ζ 1 ,m + 1 + (1 − π 1 ) σ 2 ζ 2 ,m + 1 − ϑ 1 σ 2 ζ 1 ,m + 1 1 − π 1 ) ϑ 2 σ 2 ζ 1 ,m + 1 − ϑ 1 σ 2 ζ 2 ,m + 1 π 1 σ 2 ζ 1 ,m + 1 + (1 − π 1 ) σ 2 ζ 2 ,m + 1 σ 2 ζ 1 ,m + 1 Since ϑ 1 σ 2 ζ 1 ,m +1 < ϑ 2 σ 2 ζ 2 ,m +1 it follows that if π 1 ∈ [0, 1), then the above difference in equation (42) is positive. However, if π 1 = 1, then the difference is zero. Therefore, except for when π 1 = 1, ERM is strictly worse than NE based ensemble predictor. Lastly, the analysis for the case when both coefficients are negative also follows on exactly the above lines. Without loss of generality consider the case, ϑ 2 σ 2 ζ 2 ,m +1 < ϑ 1 σ 2 ζ 1 ,m +1 ≤ 0. In this case, NE based predictor will take the value ϑ 1 σ 2 ζ 1 ,m +1 (follows from Theorem 2) and its magnitude is − ϑ 1 σ 2 ζ 1 ,m +1 . The magnitude of ERM based predictor is − π 1 ϑ 1 + 1 − π 1 )ϑ 2 π 1 σ 2 ζ 1 ,m + 1 + (1 − π 1 ) σ 2 ζ 2 ,m + 1 We take a difference of the magnitudes of the NE based predictor and the ERM based predictor to get Since ϑ 2 σ 2 ζ 2 ,m +1 < ϑ 1 σ 2 ζ 1 ,m +1 it it follows that if π 1 ∈ [0, 1), then the above difference in equation (42) is positive. However, if π 1 = 1, then the difference is zero. Therefore, except for when π 1 = 1, ERM is strictly worse than NE based ensemble predictor. This completes the analysis for all the possible cases. For each component where the least squares optimal solution differ, we showed that ERM based predictor is worse than NE based predictor except over a set of measure zero over the probability π 1 . This completes the proof. We restate Theorem 3 below. Theorem 5. If Assumption 1, 5, 6 hold, then the output of Algorithm 1,w + , is Proof. From Theorem 2, we know that if Assumptions 1, 5, 6 hold, then the NE based ensemble predictor is given as In Algorithm 1, each environment plays the optimal action given the action of the others. Hence, by best responding to each other we hope the procedure would converge to NE based ensemble predictor in equation (45). We write a dynamic which seems simpler than the dynamic in Algorithm 1. However, we show that the two are equivalent. We index the iteration by t. Define w t e as the predictor for environment e at the end of iteration t. The ensemble predictor is given asw t = w t 1 + w t 2 . Each component of e's predictor is given as w t e = (w t e1 , . . . , w t en ) and for the other environment q ∈ {1, 2} \ {e} as w t −e = (w t −e1 , . . . , w t −en ). Environment e in its turn sees that the other environment is using a predictor w t−1 −e ; environment e updates the predictor by taking a step such that min w t e ∈W w t e + w t−1 −e − w * e 2 , i.e. the environment moves such that it gets closest to the optimal least squares solution. We can simplify this minimization as For t = 0, w t 1 = 0 and w t 2 = 0. The dynamic based on equation (46) is written as follows. For t ≥ 1 t = t + 1 In the above equations (47), (48), Π W represents the projection on the set W = {w s.t. w ∞ ≤ w sup }. In each iteration, only one of the environment updates the predictors. In the above dynamic, whenever an environment completes its turn to update the predictor, t is incremented by one. Before showing the convergence of this dynamic, we first need to establish that this dynamic is equivalent to the one stated in Algorithm, where when its the turn of environment e to update it minimizes the following min w t e ∈W R e (w t e , w t−1 −e ). (47) and (48) to dynamic in Algorithm 1 Recall from Section 3.4 and proof of Theorem 2, that we divide the feature components {1, . . . , n} into two sets, the first k components are in U and the next n − k components are in V. The two environments have the same least squares coefficients for components in U but have differing coefficients for points in V. For each e ∈ {1, 2}, w t e+ corresponds to the first k coefficient at the end of iteration t and w t e− corresponds to the next n − k coefficients at the end of iteration t. Recall the decomposition that we stated in equation (28). To arrive at the equation (28) we used Assumptions 1 and 6. We continue to make these assumptions in this theorem as well. Therefore, we can continue to use the decomposition in equation (28). For environment 2 we can write A decomposition identical to above equation (49) also holds for environment 1. From equation (46) and (49), we gather that for latter n − k components, the update rule in equations (47), (48) and the update rule in Algorithm 1 are equivalent for both the environments. We now show that both the rules are equivalent in the first k components as well. Consider iteration t = 1. If the environment 1 uses the update rule in equation (47), then the ensemble predictor is set to w * 1 (From Assumption w * 1 ∈ W is in the interior, the projection will be the point itself). Note that if environment 1 used min w t 1 ∈W R 1 (w t 1 , w t−1 2 ), then as well it will move the ensemble predictor to w * 1 (since w * 1 is the least squares optimal solution). Consider iteration t = 2. Suppose the environment 2 uses the update rule in equation (48). Define ∆ * = w * 2 − w * 1 and the i th component of ∆ * as ∆ * i . Given the environment 1 is at w * 1 the rule dictates that environment 2 should update the predictor to Π W [∆ * ]. The first k components of ∆ * would be zero as the two environments agree in these coefficients. Therefore, environment 2 will not move its predictor for the first k components and continue to be at 0. After this the two environments do not need to update the first k components as they have already converged. Suppose the environment 2 uses the update rule from Algorithm 1. Consider the first k components in which both the environments agree. Since environment 1 already is using w * 1+ , which is optimal for environment 2 as well, environment 2 will not move its predictor for first k components and continue to be at 0. After this the two environments do not need to update the first k components as they have already converged. Thus so far we have established that both dynamics in equations (47), (48) and the update rule in Algorithm 1 are equivalent. We have also shown the convergence in the first k components. We now focus on establishing the convergence for the next n − k components that make up the set V. Convergence of the dynamics in equations (47) and (48). In the previous section, we showed that dynamic in equations (47) and (48) are equivalent to the dynamic in Algorithm 1. While we had shown the convergence of the first k components in the set U, we will repeat the analysis for ease of exposition. In this section, we begin by showing how the dynamic in equations (47) and (48) plays out. Just for the sake of clarity of exposition, in the dynamic we show below we assume that the predictors of the environment continue to be in the interior of the set W. 1. End of t = 1,w t = w * 1 , e = 1 plays w t 1 = w * 1 , e = 2 plays w t 2 = 0. 2. End of t = 2,w t = w * 2 , e = 1 plays w t 1 = w * 1 , e = 2 plays w t 2 = ∆ * . 3. End of t = 3,w t = w * 1 , e = 1 plays w t 1 = w * 1 − ∆ * , e = 2 plays w t 2 = ∆ * . 4. End of t = 4,w t = w * 2 , e = 1 plays w t 1 = w * 1 − ∆ * , e = 2 plays w t 2 = 2∆ * . 5. End of t = 5,w t = w * 1 , e = 1 plays w t 1 = w * 1 − 2∆ * , e = 2 plays w t 2 = 2∆ * . 6. End of t = 6,w t = w * 2 , e = 1 plays w t 1 = w * 1 − 2∆ * , e = 2 plays w t 2 = 3∆ * . In the dynamic displayed above, we assumed that the predictor w t 1 and w t 2 were in the interior just to illustrate that the two sequences w t 1 and w t 2 are monotonic. Observe that if a certain component of ∆ * say ∆ * i is non-zero, the two sequences are strictly monotonic in that component. The sequences cannot grow unbounded and at least one of them will first hit the boundary at w sup or −w sup . Recall from the last section, where we already showed that for the first k components of ∆ * associated with these U are zero (from equation (25)). Hence, for the first k components the the dynamic w t 1 and w t 2 converges at the end of t = 1. Therefore, we now only need to focus on the remaining n − k components comprising the set V. Since the update rules in equation (47) and (48) are separable for the different components, we only focus on one of the components say i. We divide our analysis based on if w * 1i and w * 2i have the same sign or not. Suppose w * 1i and w * 2i have the same sign. Let us consider the case when both are positive (negative case follows from symmetry as the dynamic starts at zero). • Suppose 0 ≤ w * 1i < w * 2i . If 0 ≤ w * 1i < w * 2i is plugged into the equation (6), we obtain w * 1i . Our objective is to show convergence to w * 1i . In this case, ∆ * i , which corresponds to the i th component of ∆ * , is greater than zero. Observe from the dynamic that environment 2 will first hit the boundary in this case and since ∆ * i > 0, it will hit the positive end, i.e., w sup . The best response of environment 1 is to play Π W [w * 1i − w sup ]. Since w * 1i > 0, we get that environment 1 uses the predictor w * 1i − w sup , the ensemble predictor takes the value w * 1i . Environment 2 in the next step continues to play Π W [w * 2i − w * 1i + w sup ] = w sup and environment 1 continues to play w * 1i − w sup . Hence, the predictors stop updating. Thus in this case at convergence, the ensemble classifier achieves the value that we wanted to prove w * 1i . Also, both environments best respond to each other, which implies that the state is a NE. is plugged into the equation (6), we obtain w * 1i . Our objective is to show convergence to w * 2i . Observe from the dynamic that environment 1 will first hit the boundary in this case and since ∆ * i < 0, it will hit the positive end, i.e., w sup . The best response of environment 2 is to play Π W [w * 2i − w sup ]. Since w * 2i > 0, we get that environment 2 uses the predictor w * 2i − w sup and the ensemble predictor takes the value w * 2i . Thus just like the case described above both environments stop updating. Hence, the state w sup , w * 2i − w sup is a NE and the final ensemble predictor is at w * 2i , which is what we wanted to prove. Suppose w * 1i and w * 2i have the opposite sign. • Consider the case when w * 1i < 0 < w * 2i . If w * 1i < 0 < w * 2i is plugged into the equation (6), we obtain 0. In this setting, environment 2 moves towards the w sup and environment 1 moves towards −w sup . Suppose environment 2 hits the boundary. In the next step, the best response from environment 1 is computed as Π W [w * 1i − w sup ] = −w sup . Since both environments best respond to each other the state (−w sup , w sup ) is NE and the final ensemble predictor is at 0, which is what we wanted to prove. Hence, both the environment continue to stay at the boundary. This is also the case if environment 1 hits the boundary first. The same analysis applies to the case when w * 2i < 0 < w * 1i . We focused on one of the components i and the above analysis applies to all the components j ∈ {k + 1, .. . . . , n}. This completes the proof. Note that the entire analysis is symmetric and it does not matter which environment moves first, the analysis also extends to the case when initialization is not zero. We restate Proposition 5 Proposition 9. If Assumption 2 holds with α e = 0 and Θ e an orthogonal matrix for each e ∈ {1, 2}, and Assumptions 3, 4, 5 hold, then the output of Algorithm 1,w + obeys w + − (γ, 0) < w ERM − (γ, 0) for all w ERM ∈ S ERM except over a set of measure zero (see footnote 2). Moreover, if all the components of vectors Θ 1 η 1 and Θ 2 η 2 have opposite signs, thenw + = (γ, 0). Proof. In the above proposition, we make the same set of assumptions as in Proposition 4. In the proof of Proposition 4, we showed that the set of Assumptions in Proposition 4 imply that the Assumptions 1, 5, 6 hold. Since Assumptions 1, 5, 6 hold, from Theorem 3 it follows that the output of Algorithm 1 is equal to the NE based ensemble predictor given by equation (6). In Proposition 4, we have already shown that this NE based ensemble predictor (equation (6)), which is the output of Algorithm 1, is closer to (γ, 0) than the solution of ERM (except over a set of measure zero defined in the proof of Proposition 4). We had also shown that the NE based ensemble predictor is equal (γ, 0) when the signs of Θ 1 η 1 and Θ 2 η 2 are opposite. This completes the proof. In this section, we describe a simple signed gradient descent based dynamic. The aim is to show that for simple variations of the dynamic proposed in Algorithm 1 we continue to have convergence guarantees. We define a signed gradient descent based version of the dynamic in equation (47) and (48) with step length β. For t = 0, w t 1 = 0 and w t 2 = 0. w t = w t 1 + w t 2 t = t + 1 In the above sgn is the component-wise sign function, which takes a value 1 when the input is positive (including zero) and −1 if the input is negative.w t is the ensemble predictor at the end of iteration t, w t e is the predictor for environment e at the end of iteration t. Suppose we want to get within (per component) distance of the NE based predictor. Divide the components into two sets E and F defined as follows. i ∈ E if and only if the least squares solution are within distance, i.e. |w * 1i − w * 2i | ≤ and i ∈ F if and only if the least squares solution are separated by at least epsilon i.e. |w * 1i − w * 2i | > . Let β < . Let us analyze the dynamic for a component i ∈ F. We divide the analysis into two cases -w * 1i and w * 2i have the same sign, and w * 1i and w * 2i have opposite signs. Let us start with same sign case with both w * 1i and w * 2i positive (negative sign case follows from symmetry). Consider the case 0 < w * 1i < w * 2i . In this case from the expression of NE in equation (6), we would hope the dynamic can eventually achieve an ensemble predictor that stays within distance of w * 1i . Since the dynamic starts at 0 and sgn[w * 1i ] = 1 and sgn[w * 2i ] = 1, the ensemble predictorw t i after some iterations will enter the interval [w * 1i , w * 2i ]. Once the ensemble predictor enters the interval, the two predictors w t 1i and w t 2i will push the predictor in opposite directions and since the step length is the same the ensemble predictor will not move. This would continue until environment 2 hits the positive boundary w sup . Once the environment 2 hits the boundary it stops updating and environment 1 pushes the ensemble predictor towards w * 1i . Once the predictor is within β distance from w * 1i it continues to oscillate around w * 1i . Consider the case 0 < w * 2i < w * 1i , the same analysis follows and dynamic eventually oscillates around w * 2i . Next, consider the case when w * 1i and w * 2i have opposite signs. In this case from the expression of NE in (6), we would hope the dynamic can eventually achieve an ensemble predictor that stays within distance of 0. In this case, since the predictors start at zero, both environments will push in opposite directions. Eventually, both environments hit opposite ends of the boundary and stay there. This results in ensemble predictor coefficient of zero. Now let us analyze the game for components in E. We again carry out analysis based on the whether the signs agree or not. Consider the case 0 < w * 1i < w * 2i . In this case from the expression of NE in (6), we would hope the dynamic can eventually achieve an ensemble predictor that stays within distance of w * 1i . The dynamic starts at 0 and sgn[w * 1i ] = 1 and sgn[w * 2i ] = 1, the ensemble predictorw t i may or may not enter the interval [w * 1i , w * 2i ]. If it enters, then the analysis is identical to the previous case when i ∈ F. If the ensemble predictor does not enter the interval, then it has to overshoot it and move to the right of the interval, which implies in the next step it will be pulled back to the left of the interval. This sets the ensemble predictor in an oscillation around w * 1i . The analysis for the case when w * 1i and w * 2i have opposite signs is the same as the previous case. In this section, we discuss can we still learn NE if the Assumption 6 is relaxed? We would rely on the results in [Zhou et al., 2017] for our discussion here. In [Zhou et al., 2017] , the authors introduced a notion called variational stability. Consider the class of concave games. It was shown that if the set of Nash equilibria satisfy variational stability, then a mirror descent based learning dynamic (described in [Zhou et al., 2017] ) converges to the NE. Next, we analyze the variational stability for C-LRG. Define the gradient of utility of the environment e v e (w) = −∇ we R e (w e , w −e ), where recall that w e is action of environment e and w −e is the action of the other environment, R e is the risk, and w = (w e , w −e ). Let us recall a characterization of NE in terms of the gradients ( [Zhou et al., 2017] ). Next, we show how to relate the gradient at NE, v e (w † ), to the gradient at any other point In the abovew = w 1 + w 2 ,w † = w † 1 + w † 2 . For establishing variational stability, we need to show that for each w ∈ W × W and for each NE w † the following inequality, i.e., e v e (w) T (w e − w † e ) ≤ 0, holds. We begin by analyzing the case when Σ 1 = Σ 2 . Substitute Σ 1 = Σ 2 in equation (54), If we use the condition in equation (52) along with the fact that Σ 1 is positive definite, then we get that e∈{1,2} v e (w) T (w e − w † e ) ≤ 0, which implies that the set of NE of C-LRG is variationally stable. We now give an example of when Σ 1 = Σ 2 is satisfied. Consider the SEM in Assumption 2. If between the two environments the only parameters that vary are η e and the distribution of ε e , and rest all other parameters in the model are the same, then Σ 1 = Σ 2 is satisfied. We now discuss what happens if we relax the assumption, Σ 1 = Σ 2 , made above. Consider the eigenvalue decomposition of Σ 2 − Σ 1 = ΩΛΩ T , where since Σ 2 − Σ 1 is a symmetric matrix we know that Ω can be chosen as an orthogonal matrix and Λ is a diagonal matrix of eigenvalues. Define the smallest eigenvalue of Σ 2 − Σ 1 as λ min (Σ 2 − Σ 1 ). Define a transformation of vector w under Ω asw = Ωw. Since Ω is orthogonal, w = w . We now use these relationships to simplify In the last two inequalities in equation (55), we used Cauchy-Schwarz inequality and the fact that norms do not change under orthogonal transformations w = w . Now let us bound the term (54). In the last inequality in equation (56), we used equation (55). If Σ 2 − Σ 1 is positive semi-definite with lowest eigenvalue of zero, then the term in equation (56) is bounded above by zero. If we use this observation in equation (54), the the condition for variational stability is satisfied. Note that the entire analysis is symmetric and we can state the same result for the matrix Σ 1 − Σ 2 . Therefore, if one of the matrices Σ 2 − Σ 1 or Σ 1 − Σ 2 , is positive semi-definite with lowest eigenvalue of zero, then we get variational stability for the NE. As a result, we can use the convergence results in [Zhou et al., 2017] to guarantee that NE will be learned. In the main body of the paper, we discussed the results when the data is gathered from two environments. What happens if the data were gathered from multiple (more than two) environments? First let us start with the game U-LRG from Section 3.3. The first result in Proposition 1 states that when least squares optimal solution are not equal, then there is no NE of U-LRG. When we move to multiple environments using same proof techniques it can be shown that if there is any two environments, which do not agree on the least squares optimal solution, then no NE will exist. For a NE to exist all environments will have to have the same least squares optimal solution. Next, we consider the game C-LRG from Section 3.4. With multiple (more than two environments), we are guaranteed that NE will exist. How does the Theorem 2 change for multiple environments? We extend the Assumption 6 to state that any feature component that does not have the same least squares coefficient across all the environments is uncorrelated with the rest of the features. Suppose the environments are indexed from {1, . . . , r}. For this discussion, let us focus on one of the feature components say i. Without loss of generality, assume that these environments are ordered in an increasing order w.r.t the optimal least squares coefficient, i.e. if e, f ∈ {1, . . . , r} such that e ≤ f , then w * ei ≤ w * f i . Let us assume that r is odd. Consider the median environment indexed m = r+1 2 . Ensemble predictor's coefficient will be equal to the coefficient of the median environmentw † i = w * mi . In this case in the NE, all the environments with index e > m play w sup , all the environments with index e < m play −w sup , and median environment m plays w * mi . Let us assume that r is even. Consider the two median environments indexed m = r 2 and m + 1. If w * mi and w * (m+1)i have the same sign, then the NE based ensemble predictor is equal to the coefficient with a smaller absolute value. If w * mi and w * (m+1)i have the same sign, and say 0 ≤ w * mi ≤ w * (m+1)i , then in NE the environment m plays w * mi − w sup and environment m + 1 plays w sup . If w * mi and w * (m+1)i have the opposite sign, then the NE based ensemble predictor is equal to zero. If w * mi and w * (m+1)i have the opposite sign, and say w * mi < 0 ≤ w * (m+1)i , then in NE the environment m plays −w sup and environment m + 1 plays w sup . For all the remaining environments other than m and m + 1 their actions are described as -environments with index e > m + 1 play w sup , all the environments with index e < m play −w sup . In Proposition 4, we analyzed linear SEMs and showed that NE based ensemble predictor are closer to the OOD solutions than ERM. Proposition 4 relied on Theorem 2, which we have shown can be appropriately extended to multi-environment setting. Hence, by using the same proof techniques used to prove Proposition 4 and the expression for NE that we discussed above for multiple environments, we can show that the same result extends to multiple environments. In Theorem 3, we proved the convergence for BRD dynamics to NE. We can show convergence in this setup as well using the same ideas discussed in Section 7.5 and Section 7.7. We have shown that Theorem 2, Proposition 4 and Theorem 3 extend to multi-environment setting. We also know that Proposition 5 directly follows from Theorem 2, Proposition 4 and Theorem 3. Therefore, Proposition 5 extends to multi-environment setting. In the main body of the paper, we discussed all the results for linear models. An important direction for future work is to consider non-linear models. A natural extension of the linear models is to consider models from reproducing kernel Hilbert spaces (RKHS). In the standard analysis of kernel regressions, the optimal predictors can be expressed as linear functions of the kernel evaluated at the different data points. Similar to the constrained linear regression game that we introduced in Section 3.4, we can introduce a constrained kernel regression game, where each environment's model choice is a predictor from RKHS with a constraint on its norm. Owing to the similarity of analysis of kernel regression and standard linear regression, we believe that the type of analysis that we carried out for linear regression games can serve as a useful building block to solving kernel regression games. The experiments were done on 2.3 GHZ Intel Core i9 processor with 32 GB memory (2400 MHz DDR4). We use linear models for all the methods. We carried out 10 trials for all the experiments and show the average performance in Figure 2 . In the experiments for the 10 dimensional case (p = q = 5) shown in Figure 2 , we use a bound of w sup = 2. In Figure 1 , we had shown that provided the solution is contained in the search space, i.e., realizability assumption (Assumption 5) is satisfied, then the choice of the bound does not impact the solution provided the number of training steps are sufficiently large. We use a stochastic gradient descent based best response dynamic to learn the NE; this dynamic is very similar to the one described in Section 7.7. For each environment e ∈ {1, 2} say the loss for the current batch at the end of iteration t isR e (w t 1 , w t 2 ) (sample mean estimate of the loss over the current batch). For each e ∈ {1, 2}, say w t e is the model used by environment e at the end of iteration t. The two environments alternate to take turns to update the model, i.e. in odd iterations t environment 1 updates the model, and even iterations t environment 2 updates the model. Each environment in its turn takes a step based on gradient of its loss over the batch w.r.t its model parameters. In its turn the environment 1 updates w t 1 = Π W [w t−1 1 − β∇ w t−1 1R 1 (w t−1 1 , w t−1 2 )], while the environment 2 does not update the model, w t 2 = w t 2 − 1, and then t is incremented by 1. In the next turn, the same procedure is repeated with the roles of environment 1 and 2 reversed, i.e., environment 2 updates and environment 1 does not. We continue this cycle of updates until a fixed number of epochs. In our experiments, we set β = 0.005, the batch size was set to 128 and the total number of epochs were set to 200 (each epoch is equal to the size of the training data divide by the batch size). For the implementation of IRM, we needed to change the cross-validation procedure in the implementation provided by [Arjovsky et al., 2019] . The cross-validation procedure in requires access to data from a separate validation environment with a different distribution. Since we only use two environments, we use the cross-validation procedure called the train-domain validation set procedure (defined in [Gulrajani and Lopez-Paz, 2020] ), which requires us to split each train environment into a train portion and a validation portion. It finally requires to combine all the validation splits and use them as one validation split. We use a 4:1 split. Besides this change in cross-validation procedure, the rest of the implementation comes from https://github.com/facebookresearch/InvariantRiskMinimization/. Below we provide tables (Table 2, 3, 4, 5) containing numerical values and the standard deviation associated with model estimation error shown in Figure 2 . ICP can often be conservative in accepting a covariate as a direct cause, which is the reason we see in some rows the entry against ICP is 5.0 ± 0.0; it does not accept any covariate as cause. Invariant risk minimization game Domain-adversarial neural networks Invariant risk minimization Local characterizations of causal Bayesian networks Recognition in terra incognita Analysis of representations for domain adaptation Convex optimization Invariant rationalization A social equilibrium existence theorem Ai for radiographic covid-19 detection selects shortcuts over signal Statistics of robust optimization: A generalized empirical likelihood approach The theory of learning in games Domain-adversarial training of neural networks Domain adaptation for largescale sentiment classification: A deep learning approach Covariate shift by kernel mean matching search of lost domain generalization Invariant causal prediction for nonlinear models Algorithms and theory for multiple-source adaptation Information-geometric approach to inferring causal directions Causal inference using the algorithmic Markov condition Out-of-distribution generalization with maximal invariant predictor Out-of-distribution generalization via risk extrapolation (rex) Minimax statistical learning with Wasserstein distances Domain adaptation by using causal inference to predict invariant conditional distributions Agnostic federated learning Causal diagrams for empirical research Causal inference using invariant prediction: identification and confidence intervals Causal inference by using invariant prediction: identification and confidence intervals On causal and anticausal learning Should I include this edge in my prediction? Analyzing the stability-performance tradeoff Unshuffling data for improved generalization Mirror descent learning in continuous games