key: cord-0439901-7pj3qr36 authors: Mahadevan, Sridhar title: Causal Homotopy date: 2021-09-20 journal: nan DOI: nan sha: eb6f050b8eb307720d24e9941ee91c389f507086 doc_id: 439901 cord_uid: 7pj3qr36 We characterize homotopical equivalences between causal DAG models, exploiting the close connections between partially ordered set representations of DAGs (posets) and finite Alexandroff topologies. Alexandroff spaces yield a directional topological space: the topology is defined by a unique minimal basis defined by an open set for each variable x, specified as the intersection of all open sets containing x. Alexandroff spaces induce a (reflexive, transitive) preorder. Alexandroff spaces satisfying the Kolmogorov T0 separation criterion, where open sets distinguish variables, converts the preordering into a partial ordering. Our approach broadly is to construct a topological representation of posets from data, and then use the poset representation to build a conventional DAG causal model. We illustrate our framework by showing how it unifies disparate algorithms and case studies proposed previously. Topology plays two key roles in causal discovery. First, topological separability constraints on datasets have been used in several previous approaches to infer causal structure from observations and interventions. Second, a diverse range ofgraphical models used to represent causal structures can be represented in a unified way in terms of a topological representation of the induced poset structure. We show that the homotopy theory of Alexandroff spaces can be exploited to significantly efficiently reduce the number of possible DAG structures, reducing the search space by several orders of magnitude. Topology (Munkres, 1984a) has found extensive use in many areas in AI, machine learning and optimization. The Hahn-Banach theorem is a topological result concerning separation of points from convex sets by hyperplanes, and the entire framework of Lagrange duality can be derived from this topological insight (Luenberger, 1997) . The Hahn-Banach theorem is also the basis for the universal representation theorem in deep neural networks (Cybenko, 1989) . Topological data analysis techniques, such as persistent homology, are playing an increasingly important role in different areas of machine learning (Edelsbrunner, 2007; Zomorodian and Carlsson, 2005) . Graphical models have been extensively studied in artificial intelligence (AI), causal reasoning, machine learning (ML), physics, statistics, and many other fields (Koller and Friedman, 2009; Lauritzen, 1996; Pearl, 1989 Pearl, , 2019 . Causal discovery (Spirtes et al., 2000) involves the construction of a causal model, for example a DAG graphical model structure and the specification of a probability model, from observational or experimental data. A broad family of models, ranging from Bayes networks (Pearl, 1989) on directed acyclic graphs (DAGs) to more recent variants, such as directed acyclic mixed graphs (ADMG) (Richardson, 2009) , marginalized DAGs (mDAGs) (Evans, 2018) and hyperedge-directed graphical models (HEDGs) (Forre and Mooij, 2017) , can be represented by finite Alexandroff spaces with different topological properties. For example, a DAG model imposes a T 0 topology on a finite Alexandroff space, which induces a partial ordering σ on the variables V in the model so that function f i determining the value of variable X σ i in the model is measurable given the values of the previous variables X σ 1 , . . . , X σ i−1 . A directed acyclic mixed graph (ADMG) (Richardson, 2009) and chain graphs (Andersson et al., 1996) , on the other hand, have both undirected and directed edges, which induce only a preordering on the set of variables. Marginalized DAGs (MDAGs) (Evans, 2018) and HEDG (Forre and Mooij, 2017 ) models allow hyperedges between nodes, representing the effect arXiv:2112.01847v1 [math.AT] 20 Sep 2021 of latent variables. These can be represented using topological constructions, such as non-Hausdorff cones C(X) or non-Hausdorff suspensions S(X) (Barmak, 2011) . We propose a novel topological framework for causal inference, building an initial topological representation of a partially ordered set (poset) from data, prior to building a probabilistic graphical model from the poset (see Figure 1 ). To that end, we represent posets using the algebraic topology of finite Alexandroff spaces (Alexandroff, 1956; Barmak, 2011; May) . Representing posets as topological spaces confers many computational advantages, such as the ability to combine multiple posets into a joint poset, and to use algebraic homotopy theory (Barmak, 2011; May) of finite topological spaces to significantly reduce the search space of possible structures. We show how a wide variety of graphical models, from chain graphs (Lauritzen and Richardson, 2002) to DAGs (Pearl, 1989) , can be topologically represented as posets in a finite Alexandroff space. We illustrate our approach using a real-world dataset of pancreatic cancer (Beerenwinkel and Sullivant, 2009; Diaz-Uriarte, 2017) . Our primary goal is to illustrate how algebraic topology provides some powerful tools to design more scalable algorithms for structure discovery, which otherwise presents an intractable combinatorial search space. We also relate our approach to previous studies of causal discovery showing how to classify them using the concept of intervention topology (Pearl, 2009; Spirtes et al., 2000; Eberhardt, 2008; Hauser and Bühlmann, 2012a; Kocaoglu et al., 2017; Mao-cheng, 1984; Tadepalli and Russell, 2021) . (Bernstein et al., 2020) develop a greedy poset-based algorithm for learning DAG models, but do not exploit poset topological properties. Our approach builds on a key technical innovation of using a topological representation of partially ordered sets, in between the original datasets and the final probabilistic or statistical graphical model. The principal reasons for explicitly modeling posets as topological spaces is that it allows us to exploit the rich algebraic theory of finite spaces (Barmak, 2011; May; McCord, 1966; Stong, 1966) to our computational advantage. For example, in modeling cancer disease progression, in addition to obtaining crucial timing information about mutations from clinical datasets on tumors and their associated genotypes, we can also exploit the disease pathways (Jones et al., 2008) that are known from medical research. Each source of information results in a different poset, which can be combined together using the algebraic topology theory of finite spaces. In addition, algebraic topology gives powerful tools for reducing the combinatorial search space of possible structures. We characterize homeomorphic equivalences among minimal poset models, and show homotopical equivalences sharply reduce the number of structures that need to be examined during structure discovery. In the case of pancreatic cancer, for example, we can reduce the search space of possible structures by three orders of magnitude. The topology of posets of graphical models based on the algebraic topology of finite Alexandroff spaces (Alexandroff, 1956) . We show how a wide variety of graphical models, from chain graphs (Lauritzen and Richardson, 2002) to DAGs (Pearl, 1989) , can be topologically embedded in a finite Alexandroff space. For a DAG G, simply compute the unique transitive closure graph G tc , and define the open sets of the induced topological model (M, ≤) by defining the open sets U x ⊂ T as the ancestors of a node x (including itself) in G tc . To revert from a topological model M = (X, T ) to its DAG representation, form the Hasse diagram of the partial order defined by x ≤ y if x ∈ U y . We present a novel algorithmic paradigm for structure discovery as iterating between searching among topologically distinct structures and causally faithful structures. We show how this paradigm can be used to characterize many previous studies of causal discovery in terms of the concept of intervention topology, collections of subsets intervened on to determine directionality (Pearl, 2009; Spirtes et al., 2000; Eberhardt, 2008; Hauser and Bühlmann, 2012a; Kocaoglu et al., 2017; Mao-cheng, 1984; Tadepalli and Russell, 2021) . This connection immediately suggests topological generalizations of these previous algorithms. We begin with brief review of basic point-set topology, and then give a succinct characterization of finite space topologies. Topology (Munkres, 1984a) characterizes the abstract properties of arbitrary spaces that are equivalent under smooth deformations, usually represented as continuous invertible mappings called homeomorphisms. Formally, a general topological space (X, U) is characterized by a base space X, along with a collection of "open" sets O i ⊆ U closed under arbitrary union and finite intersection. Note the asymmetry in these restrictions. For example, if we define X = R to be the real line, and consider the open is not an open set! Alexandroff (Alexandroff, 1956) pioneered the study of the subclass of topological spaces that are closed under both arbitrary union and intersection. While our framework can be potentially be extended to the non-finite case, for simplicity, we will restrict our presentation in this paper to the case of finite Alexandroff spaces (Barmak, 2011; May; McCord, 1966; Stong, 1966) . It is obvious to note that finite topological spaces (X, U) are trivially closed under arbitrary unions and intersections, because there are only a finite number of open sets O ∈ U. However, what turns out to be surprising is that the particular construction used by Alexandroff in defining open sets applies even in the finite case, and results in spaces with surprising topological richness, even though they are finite. Definition 1. A finite Alexandroff topological space (or simply, finite space, in the remainder of the paper) (X, U) is a finite set X and a collection U of "open" sets, namely subsets of X, such that (i) ∅ (the empty set) and X are in U (ii) Any union of sets in U is in U (iii) Any intersection of subsets in U is in U as well. We will often refer to a finite space simply by its elements X, where the topology is left implicit, unless its character is important, when we will clarify it. The most common topologies on X will be the discrete topology, where the collection of open sets U is just the powerset 2 X , and the trivial topology U = {∅, X}. The major contribution of this paper is the use of a specific topological representation of partially ordered sets based on Alexandroff spaces (Alexandroff, 1956) , who showed that finite topological spaces naturally defined preordered and partially ordered sets. Two classic papers by McCord (McCord, 1966) and Stong (Stong, 1966) laid the foundations for much of the subsequent study of finite topological spaces. Detailed proofs of all the main theorems on finite topological spaces in this paper can be found in (Barmak, 2011; May; McCord, 1966; Stong, 1966) . Remarkably, a key idea that is implicit in many causal discovery algorithms is the topological notion of separability (Beerenwinkel et al., 2007; Kocaoglu et al., 2017; Acharya et al., 2018) , which intimately relates to the topology of the finite space. In order to construct a poset model from data, (Beerenwinkel et al., 2007) assume that the dataset u : G → N specifies the number of observations of each genotype g. For example, Table 5 specifies the number of tumors that contain a specific set of gene mutations g. The support S(u) is the non-zero coordinates of u, namely the genotypes that occur in the data. A crucial assumption here is that the dataset u separates the events e and f if there exists some genotype g ∈ S(u) such that g ∩ {e, f } ∅. Viewed more abstractly, this notion of separability implies that the underlying space has the T 0 Kolmogorov topology. (Kocaoglu et al., 2017 ) assume a separating set, which is essentially a restricted type of finite topological space. Definition 2. The neighborhood of an element x in a finite space X is a subset V ⊂ X such that x ∈ U for some open set U ⊂ V. • X is a Kolmogorov (or T 0 ) finite space X if each pair of points x, y ∈ X is distinguishable in the space, namely for each x, y ∈ X, there is an open set U ∈ U such that x ∈ U and y U. Alternatively, if x ∈ U if and only if y ∈ U, ∀U ∈ U implies that x = y. • X is a T 1 finite space if element x ∈ X defines a closed set {x}. • X is a T 2 finite space or a Hausdorff space if any two points have distinct neighborhoods. It turns out that T 1 finite spaces are not interesting since the only topology defined on them is the discrete (powerset) topology. The most interesting finite spaces are those equipped with the T 0 topology. Lemma 1. (May) If X is a T 2 space, then it is a T 1 space. If X is a T 1 space, then it is a T 0 space. The key concept that gives finite (Alexandroff) spaces its power is the definition of the minimal open basis. First, we introduce the concept of a basis in a topological space. Definition 3. A basis for the topological space X is a collection B of subsets of X such that • For each x ∈ X, there is at least one B ∈ B such that x ∈ B. • If x ∈ B ∩ B", where B, B" ∈ B, then there is at least one B ∈ B such that x ∈ B ⊂ B ∩ B". The topology U generated by the basis B is the set of subsets U such that for every x ∈ U, there is a B ∈ B such that x ∈ B ⊂ U. Note that the relation ≤ defined above is a preorder because it is reflexive (clearly, x ∈ U x ) and transitive (if x ∈ U y , and y ∈ U z , then x ∈ U z ). However, in the special case where the finite space X has a T 0 topology, then the relation ≤ becomes a partial ordering. This gives a topological way to model DAG models, which will play a crucial role in our framework. Lemma 3. A function f : X → Y from one finite space to another is continuous if and only Lemma 4. If X is a T 2 space, then it is a T 1 space. If X is a T 1 space, then it is a T 0 space. Lemma 5. A function f : X → Y between two finite spaces is continuous if and only if it is order-preserving, meaning if x ≤ x for x, x ∈ X, this implies f (x) ≤ f (x ). Lemma 6. Let x, y be two comparable points in a finite space X. Then, there exists a path from x to y in X, that is, a continuous map α : (0, 1) → X such that α(0) = x and α(1) = y. Theorem 1. If X is a finite topological space containing a point y such that the only open (or closed) subset of X containing y is X itself, then X is contractible. In particular, the non-Hausdorff cone C(X) is contractible for any X. Proof: Let Y = { * } denote the space with a single element, * . Define the retraction mapping r : X → * by r(x) = * for all x ∈ X, and define the inclusion mapping i : Lemma 7. If X is an finite Alexandroff space, then U x is contractible. In particular, if X has a unique maximal point or unique minimal point, then X is contractible. We use a motivational example of causal discovery for treatment of patients for the COVID pandemic using vaccines (Greinacher et al., 2021; Schultz et al., 2021) . In the COVID vaccination causal discovery problem, we are given a finite set of 5 variables X = {AZV, PF4, H, BC, VITT}, defined as follows: • AZV: This variable represents the adminstration of the AstraZeneca vaccine. • PF4: A number of patients suffering from vaccine-induced abnormal blood clotting tested positive for heparin-induced platelet factor 4 (PF4). • Gender: Many of the patients who exhibited adverse effects to the Covid vaccine were disproportionately women, so gender may be a causal factor. HIT: Heparin is a blood thinner used to prevent blood clots. Triggered by the immune system in response to heparin, HIT causes a low platelet count (thrombocytopenia). • VITT: This variable denotes whether patients suffered from this rare vaccine-related variant of spontaneous heparin-induced thrombocytopenia that the authors of these studies referred to as vaccine-induced immune thrombotic thrombocytopenia. Figure 1 illustrates a simple causal model for the COVID problem, represented both conventionally as a DAG, as well as two alternative representations of a finite topological space (X, T ), where X is represented by the two variables shown, and T is a either a set of open sets, defined as the descendants of a node (including itself), or a set of closed sets comprised of the ancestors of a node (including itself). In the open set parameterization, there is an arrow from node x to node y whenever x ∈ O y , that is, when the node x is in the open set corresponding to node y. Figure 2 shows how latent variables are typically modeled in causal reasoning using acyclic mixed directed graphs (ADMGs), where the latent variable is represented by a dashed undirected edge connecting the two observable variables. Such latent variables can be captured in our finite space topological framework by the use of the non-Hausdorff cone construction, which is one of several ways of connecting two topological spaces. Recall the non-Hausdorff cone merging of topological space X with { * } yields the new space C(X), whose open sets are now To take a real-world example, recently two studies were published in the New England Journal of Medicine that described patients in Austria, Germany and Norway who developed an unexpected blood clotting disorder in reaction to their first dose of the AstraZeneca/Oxford COVID-19 vaccine (Greinacher et al., 2021; Schultz et al., 2021) . Understanding causal pathways in such problems requires modeling the effects of tens of thousands of discrete and continuous variables, from the administration of the vaccine, heparin-induced platelet factors like PF4, thromocytopenia (blood clots) and its various causes, and the entire previous medical history of the patient. Clinicians have to juggle through all these factors in describing a potential treatment (e.g., should heparin be given to a patient?). As mentioned above, every concept in a topological space must be defined in terms of the open (or closed) set topology, and that includes (path) connectivity. The crucial idea here is that connectivity is defined in terms of a continuous mapping from the unit interval I = (0, 1) to a topological space X. Remarkably, the upshot of this construction is that every graph-theoretic concept in causal models, e.g. separation and conditional independence, can be translated into properties of open sets in the topology. Definition 4. We call two points x, y ∈ X comparable if there is a sequence of elements x 0 , . . . , x n , where x 0 = x, x n = y and for each pair . , x n of elements such that any two consecutive elements are comparable. X is order connected if for any two elements x, y ∈ X, there exists a fence starting in x and ending in y. Lemma 9. Let x, y be two comparable points in a finite space X. Then, there exists a path from x to y in X, that is, a continuous map α : (0, 1) → X such that α(0) = x and α(1) = y. Lemma 10. Let X be a finite space. The following are equivalent: (i) X is a connected topological space. (ii) X is an order-connected topological space (iii) X is a path-connected topological space. To illustrate the notion of connectivity, Table 4 gives examples of connected and disconnected finite space topologies for a small three element space. We now give a purely topological characterization of separation and conditional independence in finite topological spaces, which draw upon equivalent notions in graphical models (Lauritzen and Richardson, 2002; Pearl, 1989) , but are defined with respect to the open sets of the topology. Definition 5. Given a connected finite space X, with an induced (pre,partial) ordering ≤, and subsets U, V, Z ⊂ X, the subset U is topologically blocked or d-separated from V given Z, if for every fence from an element u ∈ V to an element v ∈ V, the following conditions hold: 1. The fence x 0 , x 1 , . . . , x n , where x 0 = u and x n = v is such that every consecutive pair of elements is of the form x i ≤ x i+1 or x i+1 ≤ x i , and some element x k ∈ Z (this condition is equivalent to stating that all the edges are of the form x i → x i+1 or x i ← x i+1 ). 2. The fence x 0 , x 1 , . . . , x n , where x 0 = u and x n = v is such that for every collider in the fence, namely a triple of elements This condition is the topological restatement of the standard collider condition in graphical models, where for a path to be blocked, no collider or any of its descendants can be in the conditioning set. Definition 6. Given a finite space X, and subsets U, V, Z ⊂ X, U is topologically conditionally independent (TCI) of V given Z if and only if every fence from an element u ∈ U to an element v ∈ V is topologically blocked with respect to the conditioning set Z. We now precisely define stable and solvable causal models, including both their Alexandroff topological structure, and a decomposable product probability measure constituting the parameters of the model. We impose the condition that the product probability measure respect the underlying Alexandroff topology, namely its open (or closed) sets, and requirement of a particular factorization is translated into a requirement of a particular topology. Our formulation is related to the intrinsic model of decision making (Witsenhausen, 1975) , which was recently adapted to causal inference (Heymann et al., 2020) . Neither of these investigated Alexandroff topologies. Definition 7. A finite space causal model is defined as M = (U α , F α , I α , (Ω, B, P)), where α ∈ X, a finite Alexandroff space topology. U α is a non-empty set that defines the range of values that variable α can take. F α is a sigma field (or algebra) of measurable sets for variable α. The triple (Ω, B, P) is a probability space, where B is a sigma field of measurable subsets of sample space Ω. The information field I α ⊂ F represents the "receptive field" of an element α ∈ X, namely the set of other elements β ∈ X whose values α must consult in determining its own value. We impose the restriction that the information field I α respect the Alexandroff topology on X, so that I α ⊂ F (U α ), where U α is the minimal basic open set associated with element α ∈ X. Following structural causal models (Pearl, 1989) , we can decompose the elements of the topological space into disjoint subsets X = U V, where U represents "exogenous" variables that have no parents, namely α is exogenous precisely when I α ⊂ F (∅), and V are "endogenous" variables whose values are defined by measurable functions over exogenous and endogenous variables. Note that the probability space can be defined over the "exogenous" variables α ∈ U, in which case it is convenient to attach a local probability space (Ω α , B α , P) to each exogenous variable, where B α ⊂ B. We define conditional independence with respect to the induced information fields over the open sets of the Alexandroff space. Definition 8. Given the induced probability space over information fields in a topological finite space, a stochastic basis is a sequence of information fields G = I 1 , . . . , I n such that for 1 ≤ i ≤ n − 1, I i ⊂ I i+1 , and ∪ n 1=1 I i = F . Two such sequences G 1 and G 2 are conditionally independent given the base sigma algebra F , if for all subsets A ∈ G 1 , B ∈ G 2 , it follows that P(A B|F ) = P(A|F )P(B|F ). Definition 9. The decision field U = α∈X U α defines the space of all possible values of the variables in the finite space causal model, where the cartesian product is interpreted as a map u : X → ∪ α∈X U α such that u(α) ≡ u α ∈ U α . Definition 10. For any subset of elements B ∈ X, let P B denote the projection of the product α U α upon the product β∈B U β , that is P B (u) is simply the restriction of u to the domain B. Definition 11. The product sigma field is defined as α∈B F α over α∈B U α , where F (B) is the smallest sigma-field such that P B is measurable. Note that if B 1 ⊂ B 2 , then F B 1 ⊂ F B 2 . The finest sigma-field F (X) = α∈X F α . Definition 12. A finite space causal model M is causally faithful with respect to the probability distribution P over M if every conditional independence in the topology, as defined in Section 2.3, is satisfied by the distribution P, and vice-versa, every conditional independence property of the P is satisfied by the topology. We can now formally define what it means to "solve" a causal finite space model M. We impose the requirement that each variable α ∈ X must compute its value using a function measurable on its own information field. Definition 13. Let the policy function f α of each element α ∈ X be constrained so that f α : U × Ω → U α is measurable on the product sigma field is measurably solvable if for every ω ∈ Ω, the closed loop equations P α (u) = f α (u, ω) have a unique solution for all α ∈ X, where for a fixed ω ∈ Ω, the induced map M γ : Ω → U is a measurable function from the measurable space (Ω, B) into (U, F ). Definition 15. The finite space causal model M = (X, U α , F α , I α ) is stable if for every ω ∈ Ω, the closed loop equations P α (u) = f α (u, ω) are solvable by a fixed constant ordering Ξ that does not depend on ω ∈ Ω. Measurably solvable models capture the corresponding property in a structural causal model (U, V, F, P), which states that for any fixed probability distribution P defined over the exogenous variables U, each function f i computes the value of variable x i ∈ V, given the value of its parents Pa(x i ) uniquely as a function of u ∈ U. This allows defining the induced distribution P u (V) over exogenous variables in a unique functional manner depending on some particular instantiation of the random exogenous variables U. Stable models are those where the ordering of variables is fixed. We now extend the notion of recursive causal models in DAGs (Pearl, 2009 ) to finite topological spaces. Definition 16. The finite space causal model M = (X, U α , F α , I α ) is a recursively causal model if there exists an ordering function ψ : X → Ξ n , where Ξ n is the set of all injective (1-1) mappings of (1, . . . , n) to the set X, such that for any 1 ≤ k ≤ n, the information field of variable α k in the ordering Ξ n is contained in the joint information fields of the variables preceding it: I α k ⊂ F (α 1 , . . . , α k−1 ) (1) Definition 17. A causal intervention do(β=u β ) in a finite space topological model M = (X, U α , F α , I α ) is defined as the submodel M β whose information fields I α are exactly the same as in M for all elements α β, and the information field of the intervened element β is defined to be I β ⊂ F (∅) × B β . Note that since the only measurable function on F (∅) is the constant function, whose value depends on a random sample space element ω ∈ Ω β , this generalizes the notion of causal intervention in DAGs, where an intervened node has all its incoming edges deleted. 2 We now explain how to construct faithful topological embeddings of causal graphical models. The following lemma plays a fundamental role in constructing topologically faithful embeddings of graphical models. Lemma 11. (Barmak, 2011; May) A preorder {X, ≤} determines a topology U on space X with the basis given by the collection of open sets U x = {y | y ≤ x}. It is referred to as the order topology on X. The space (X, U) is a T 0 space if and only if (X, ≤) is a partially ordered set (poset). As before, we can alternatively characterize finite space topologies by the closed sets F x = {y | y ≥ x}. The unique minimal basis gives us a way of characterizing whether or not a finite space has T 0 topology. Lemma 12. (May) Two elements x, y ∈ X have the same neighborhoods if and only if U x = U y . Thus, a finite space X has T 0 topology if and only if U x = U y . a, b, (a, b) , (b, c) D 1 P 2 yes no DAG with node a disconnected, b → c a P 3 no yes Chain graph: a → b, a → c, and b − c Proof: If x and y have the same neighborhoods, then trivially U x = U y . Conversely, if U x = U y , then if x ∈ U for some open set U, then U y = U x ⊂ U (recall that U x is the intersection of all sets that contain x), and hence y ∈ U. Similarly, if y ∈ U, the same argument shows x ∈ U. Thus, x and y have the same neighborhoods. We state three theorems that show how to reduce several popular causal models into their faithful topological embedding. The same construction can be followed for all the other models in the literature as well. We focus on embedding the topology of a graphical model, leaving aside the parametric specification of a probability measure on the model (which we discuss in more depth in the appendix). Theorem 2. Every causal DAG graphical structure G = (V, E) defines a finite T 0 Alexandroff topological space with a partial ordering. Proof: Define the elements of the topology X = V, the vertices representing the variables of the DAG G = (V, E). Construct the transitive closure G tc of the DAG G. Define the partial ordering x ≤ y in the topological space if the variable x is a descendant of y in G tc . Define the open sets of X as U y = {x|x ≤ y}. Theorem 3. Every chain graph (Lauritzen and Richardson, 2002) structure G = (V, E) defines a finite Alexandroff topological space X with a ≤ preordering. Proof: Once again, define X as the variables in the chain graph. Recall that in a chain graph G = (V, E), two nodes x and y are connected by an edge that is either directed, so x ← y or x → y, or there is an undirected edge x − y between them. Define the ordering x ≤ y on the topology X if and only if there exists a path from x to y such that every comparable pair of nodes on this path is either of the form x k ← x k+1 or alternatively x k − x k+1 . This ordering ≤ on X is a preordering, and hence defines a general Alexandroff finite space topology. Define the open sets of X as U x = {y|y ≤ x}. Theorem 4. Every mDAG graphical model (Evans, 2018) or HEDG hyper-edge directed graphical model (Forre and Mooij, 2017) G = (V, E, H), where H is a set of hyper-edges, represented by an abstract simplicial complex, can be represented by a finite Alexandroff topological space X with a ≤ preordering. Proof: Define the space X by the variables in the graphical model G = (V, E, H). For the observable edges represented by E, we follow the same construction as in DAG models described above. Note the hyper-edges h ∈ H in effect represent an abstract simplicial complex. For example, in Table 4 , the D 3 discrete topology on X = {a, b, c} can be represented as an mDAG model where the three observable variables a, b, c are connected only through one latent variable, whose effect on the observable variables is manifested by the hyper-edge that constitutes a simplicial complex C defined by the non-empty power set of X. This simplicial complex C can be modeled as a non-Hausdorff cone C between the latent variable and the open set topology of the observable variables (see Section 3.2 below). A crucial strength of our topological framework is the ability to combine two topological spaces X and Y into a new space, which can generate a rich panoply of models. Here are a few of the myriad ways in which topological spaces can be combined (Munkres, 1984a) . Table 1 illustrates some of these ways of combining spaces for a small finite space X comprised of just three elements. • Subspaces: The subspace topology on A ⊂ X is defined by the set of all intersections A ∩ U for open sets U over X. • Quotient topology: The quotient topology on U defined by a surjective mapping q : X → Y is the set of subsets U such that q −1 (U) is open on X. • Union: The topology of the union of two spaces X and Y is given by their disjoint union X Y, which has as its open sets the unions of the open sets of X and that of Y. • Product of two spaces: The product topology on the cartesian product X × Y is the topology with basis the "rectangles" U × V of an open set U in X with an open set V in Y. • Wedge sum of two spaces: The wedge sum is the "one point" union of two "pointed" spaces (X, x o ) with (Y, y o ), defined by X Y/x 0 ∼ y 0 , the quotient space of the disjoint union of X and Y, where x 0 and y 0 are identified. • Smash product: The smash product topology is defined as the quotient topology X Y = X × Y/X Y. Like the number of possible DAG structures, the number of possible finite space topologies grows extremely rapidly. However, there are powerful tools in algebraic topology, such as homotopies, which characterize equivalences among spaces (Munkres, 1984a) . In particular, Table 2 shows that exploiting homotopical inequivalences, we can save over three orders of magnitude in searching for an appropriate poset model over naive search. Given that evolutionary processes, such as pancreatic cancer, may potentially involve multiple thousands of elements (genes), the savings may be very significant. Of course, it is crucial to combine the savings from domain knowledge, as provided in Table 5 and Table 4 with that provided by efficient enumeration of poset topologies under homemorphic equivalences. Definition 18. A topological space X is contractible if the identity map id X : X → X is homotopically equivalent to the constant map f (x) = c for some c ∈ X. For example, any convex subset A ⊂ R n is contractible. Let f (x) = c, c ∈ A be the constant map. Define the homotopy H : A × I → X as equal to H(x, t) = tc + (1 − t)x. Note that at t = 0, we have H(x, 0) = x, and that at t = 1, we have H(x, 1) = c, and since A is a convex subset, the convex combination tc + (1 − t)x ∈ A for any t ∈ [0, 1]. Theorem 5. If X is a finite topological space containing a point y such that the only open (or closed) subset of X containing y is X itself, then X is contractible. In particular, the non-Hausdorff cone C(X) is contractible for any X. Proof: Let Y = { * } denote the space with a single element, * . Define the retraction mapping r : X → * by r(x) = * for all x ∈ X, and define the inclusion mapping i : The following lemma is of crucial importance in Section 4, where we will define beat points, elements of a topological space that can be removed, reducing model size. Definition 19. A point x in a finite Alexandroff topological space X is maximal if there is no y > x, and minimal if there is no y < x. Lemma 13. If X is an finite Alexandroff space, then U x is contractible. In particular, if X has a unique maximal point or unique minimal point, then X is contractible. Definition 20. Let f, g : X → Y be two continuous maps between finite space topologies X and Y. We say f is homotopic to g, denoted as f g if there exists a continuous map h : X × [0, 1] → Y such that h(x, 0) = f (x) and h(x, 1) = g(x). In other words, there is a smooth "deformation" between f and g, so we can visualize f being slowly warped into g. Note that is an equivalence relation, since f f (reflexivity), and if f g, then g f (symmetry), and finally f g, g h =⇒ f h (transitivity). Definition 21. A map f : X → Y is a homotopy equivalence if there exists another map g : Y → X such that g • f id X and f • g id Y , where id X and id Y are the identity mappings on X and Y, respectively. In this section, we describe a number of algorithms for constructing causal poset models. Our approach is intended to highlight the important role played by topological constraints, which were implicit in many previous studies. We use the domain of cancer genomics to illustrate how topological constraints on datasets makes it possible to efficiently learn the structure and parameters of a causal model from observational data (Beerenwinkel and Sullivant, 2009; Beerenwinkel et al., 2007 Beerenwinkel et al., , 2006 Gerstung et al., 2011) . A greedy algorithm for learning causal poset models is described in (Bernstein et al., 2020) , but it does not fully exploit the algebraic topology of posets for efficient enumeration. We show that the space of causal structures can be significantly pruned by exploiting the algebraic topology of finite spaces, in particular using homeomorphisms among topologically equivalent finite space models. First, we show that many previous studies of causal discovery from interventions, including the conservative family of intervention targets (Hauser and Bühlmann, 2012a) , path queries (Bello and Honorio, 2018) , and separating systems of finite sets or graphs. (Eberhardt, 2008; Hauser and Bühlmann, 2012a; Kocaoglu et al., 2017; Katona, 1966; Mao-cheng, 1984) can all be viewed as imposing an intervention topology. Table 3 classifies a few previous studies in terms of the induced intervention topology. If no experiments are allowed, the intervention topology is simply the trivial topology. If X = I X \ I, where I is the intervention target (Eberhardt, 2008) , then the intervention topology is disconnected. (Shpitser and Tchetgen, 2016) study single node interventions, which can be viewed as a T 1 intervention topology where singleton sets are closed. (Hauser and Bühlmann, 2012a) introduce the idea of conservative family I of intervention targets, meaning a family of (open) subsets of variables in a causal model such that for every variable x ∈ X, there exists an I ⊂ I such that x I. This is closely related to the idea of Alexandroff topologies where elements have distinguishable neighborhoods, and each intervention target I ∈ I defines a neighborhood. A very related notion is that of separating systems of finite sets as intervention targets (Eberhardt, 2008; Hauser and Bühlmann, 2012a; Kocaoglu et al., 2017) . or separating systems of graphs (Mao-cheng, 1984; Hauser and Bühlmann, 2012b) . (Kocaoglu et al., 2017) used antichains, a partitioning of a poset into subsets of non-comparable elements. (Bello and Honorio, 2018) use path queries, which can be viewed as chains. Finally, (Tadepalli and Russell, 2021) used leaf queries on tree structures, where none of the interior nodes can be intervened on. We first introduce the notion of a separating system, which is a special case of the T 0 topology separation axiom of finite Alexandroff spaces. Definition 22. A separating system on a finite set X is a collection of subsets {U 1 , . . . , U m } such that for every pair of elements x, y ∈ X, there is a set U k such that either x ∈ U k , y U k or alternatively, x U k , y ∈ U k . An (m, n) strongly separating system is a pair of sets U i , U j such that x ∈ U i , y U j and x U i , y ∈ U j . Definition 23. Given a finite space Alexandroff topology (X, T ), the T 0 -topogenous matrix A (Shiraki, 1969) associated with it is defined as the m × n binary matrix defined as A(i, j) = 1 if x j ∈ U i , and A(i, j) = 0 otherwise. In words, each row defines the open sets that element j belongs to, and each column defines the elements that are contained in open set U i . A conditional independence oracle is also assumed. Output: Causally faithful poset P that is consistent with conditional independences in data. begin Set the basic closed sets F e ← ∅ repeat Select an open separating set g ∈ T and intervene on g. for e ∈ g, f g do Use samples and the CI oracle to test if (e f ) M g on dataset D. If CI test fails, then set F e ← F e ∪ { f } because f is an ancestor of e. end until convergence; Define the relation e ≤ f if f ∈ F e , for all e, f ∈ E, and compute its transitive closure. Return the poset P = (E, ≤), where ≤ is the induced relation on the poset P. end Theorem 6. Given a finite space Alexandroff topology (X, T ), the T 0 -topogenous matrix A associated with it defines a separating set. Proof: Note that in a T 0 finite space Alexandroff topology, each element is distinguished by a unique neighborhood. Consequently, the open sets U x and U x associated with x, y ∈ X must be distinct, and if U x = U y , then trivially x = y. Consequently, the topogenous matrix defines a separating set for the topology (X, T ). Similarly, the closed sets F x in Algorithm 1 also define separating sets. Algorithm 2: Given a causally faithful T 0 topological model (X, ≤, T and access to a conditional independence testing oracle, compute causal (observable) DAG model G = (V, E). Input: T 0 causally faithful topology M = (X, ≤, T ) with ≤ defining a transitively closed partial ordering sets. Intervene on an antichain set T i . for x ∈ T i , y U do Use samples and the CI oracle to test if (x y) M T i on a dataset. If CI test fails, then set E ← E ∪ (x, y). end until convergence; Set V = X, and return observable causal DAG G = (V, E) end Theorem 7. Algorithm 1 requires only |T | interventions and conditional independence tests on samples obtained from each post-interventional distribution, to find a statistically consistent T 0 topological model. If there are O(log n) separating sets, the algorithm requires O(log n) interventions. Proof: If we intervene on the separating open set U and find an element y F x that is statistically dependent on x, then we change F x to include y (since x needs to "consult" y in determining its value). The bound O(log n) in (Kocaoglu et al., 2017) assumes that there are up to 2 log n sets in the original separating system. It has been argued in (Bello and Honorio, 2018 ) that interventions on multiple variables, such as used here and in the previous work (Kocaoglu et al., 2017) can potentially require an exponential number of experiments, if for example a separating set has O(n) distinct elements, and each node is a binary variable, which requires two experiments (setting it to both 0 and 1). There is an inherent trade off between the size of each separating set, and the number of separating sets. Algorithm 2 is a generalization of Algorithm 1 in the recent paper by (Kocaoglu et al., 2017) , who construct a DAG by doing interventions on the antichains of posets, and extend this approach to discover causal models with latent variables as well. (Acharya et al., 2018) propose a related approach for inferring causal models, which does not require conditional independence testing, but uses a sample efficient testing methodology based on squared Hellinger distances. Algorithm 2 constructs the observable DAG model, based on antichain sets, namely the set of incomparable elements at each level of the partial ordering. Antichain sets can be shown to be in bijective correspondence with the open sets on an Alexandroff T 0 topology. Theorem 8. Mirsky's theorem (mir, 1971) : The height of a T 0 topology causal model (X, ≤, T ) is defined to be the maximum cardinality of a chain, a totally ordered subset of the given partial order. For every partially ordered T 0 causal model (X, ≤, T ), the height also equals the minimum number of antichains, namely subsets in which no pair of elements are ordered, into which the set may be partitioned. Theorem 9. Algorithm 2 requires O(h) interventions and conditional independence tests on samples obtained from the post-interventional distributions, where h is the height h of a T 0 topology causal model (X, ≤, T ). Note that Algorithms 1 and 2 are topological generalizations of Algorithm 1 and 2 in (Kocaoglu et al., 2017) . The remaining Algorithms 3 and 4 in (Kocaoglu et al., 2017) on learning a latent variable DAG model can be generalized as well (see the appendix). Next, we turn to the fundamental problem of how to efficiently enumerate posets, which is a key requirement for scaling many causal discovery algorithms (Kocaoglu et al., 2017; Acharya et al., 2018; Bernstein et al., 2020; Beerenwinkel et al., 2006) . Definition 24. For every T 0 finite space model M with a partial ordering ≤, define its associated Hasse diagram H M as a directed graph which captures all the relevant order information of M. More precisely, the vertices of H M are the elements of M, and the edges of H M are such that there is a directed edge from x to y whenever y ≤ x, but there is no other vertex z such that y ≤ z ≤ x. General pre-ordered finite spaces can be reduced to partially ordered T 0 topologies up to homomeomorphic equivalence. Theorem 10. (Stong, 1966) Let (X, T ) be an arbitrary finite space model with an associated preordering ≤. Let X 0 represent the quotient topological space X/ ∼, where x ∼ y if x ≤ y and y ≤ x. Then X 0 is a homotopically equivalent topological model with T 0 separability, and the quotient map q : X → X 0 is a homotopy equivalence. Furthermore, X 0 induces a partial ordering on the elements x ∈ X 0 . A key idea in the enumeration is to assume that each element in the Hasse diagram of the poset does not have an in-degree or out-degree of 1. Definition 25. (Stong, 1966) An element x ∈ X in a finite T 0 space X is a down beat point if x covers one and only one element of of X. Alternatively, the setÛ x = U x \ {x} has a (unique) maximum. Similarly, x ∈ X is an up beat point if x is covered by a unique element, or equivalently ifF x = F x \ {x} has a (unique) minimum. A beat point is either a down beat or up beat point. Theorem 11. (Stong, 1966 ) Let X be a finite T 0 topological model, and let x ∈ X be a (down, up) beat point. Then the reduced model X \ {x} is a strong deformation retract of X. 3 A point x in a finite space M is an upbeat point if and only if it has in-degree one in the associated Hasse diagram H M , i.e., it has only one incoming edge). Similarly, x is downbeat if and only if it has out-degree one (it has only one outgoing edge). Definition 26. A finite T 0 topological space is a minimal model if it has no beat points. A core of a finite topological space X is a strong deformation retract, which is a minimal finite space. The minimal graph of a minimal model is its equivalent Hasse diagram. Theorem 12. (Stong, 1966) Classification Theorem: A homotopy equivalence between minimal finite space topological models is a homeomorphism. In particular, the core of a finite space model is unique up to homeomorphism and two finite spaces are homotopy equivalent if and only if they have homeomorphic cores. (Barmak, 2011; Stong, 1966) . Right: Efficiently enumerating minimal posets (Fix and Partias; Brinkmann and McKay, 2002) . Algorithm 3: Find Topologically Minimal T 0 Causal Model. Input: General pre-ordered causal model, such as a chain graph G = (V, E) that is causally faithful to a dataset. Output: Minimal T 0 causal model homotopically equivalent to original non-T 0 model (e.g., from a chain graph G). The algorithm uses homotopy theory to find the core T 0 model of a general chain graph. begin Define the topological model (X, U) where X = V and the open sets in U are constructed from the induced pre-order ≤ from G. Define the minimal model (X 0 , U ), and set X 0 = X. repeat for x, y ∈ X 0 s.t. x ≤ y, y ≤ x do Remove x, y from X 0 , and replace them with a new variable z = x ∼ y. Set X 0 ← X 0 \ {x, y} ∪ {z}. z represents the equivalence class that includes x and y. end for x ∈ X 0 do Remove down beat points: IfÛ x = U x \ {x} has a maximum, then X 0 ← X 0 \ {x}. Remove up beat points: IfF x = F \ {x} has a minimum, then X 0 ← X 0 \ {x}. end until convergence; Define the open sets U x ∈ U as U x = {y | y ≤ x} for x ∈ X 0 . end Algorithm 3 determines a quotient T 0 topology that is homotopically equivalent to the general non-T 0 topology defined by a chain graphical models. Second, the algorithm further reduces the model to its core by removing beat points (Barmak, 2011; May; McCord, 1966; Stong, 1966) . Table 5 shows a small fragment of a dataset for pancreatic cancer (Jones et al., 2008) . Like many cancers, it is marked by a particular partial ordering of mutations in some specific genes, such as KRAS, TP53, and so on. In order to understand how to model and treat this deadly disease, it is crucial to understand the inherent partial ordering in the mutations of such genes. Pancreatic cancer remains one of the most prevalent and deadly forms of cancer. Roughly half a million humans contract the disease each year, most of whom succumb to it within a few years. 4 Figure 5 shows the roughly 20 most common genes that undergo mutations during the progression of this disease. The most common gene, the KRAS gene, provides instructions for making a protein called K-Ras that is part of a signaling pathway known as the RAS/MAPK pathway. The protein relays signals from outside the cell to the cell's nucleus. The second most common mutation occurs in the TP53 gene, which makes the p53 protein that normally acts as the supervisor in the cell as the body tries to repair damaged DNA. Like many cancers, pancreatic cancers occur as the normal reproductive machinery of the body is taken over by the cancer. In the pancreatic cancer problem, for example, the topological space X is comprised of the significant events that mark the progression of the disease, as shown in Figure 5 : Left: Genetic mutations in pancreatic cancer (Jones et al., 2008) . Middle: histogram of genes sorted by mutation frequencies. Right: Poset learned from dataset. (Jones et al., 2008) , showing genetic mutations occur along distinct pathways. Bottom: Topological representation of causal poset and DAG models for COVID-19 AZV (AstraZeneca vaccine), VITT (vaccine-induced clotting of blood), and other factors (Greinacher et al., 2021; Schultz et al., 2021) . specific locations by the change of an amino acid, causing the gene to malfunction. We can model a tumor in terms of its genotype, namely the subset of X, the gene events, that characterize the tumor. For example, the table shows the tumor Pa022C can be characterized by the genotype KRAS, SMAD4, and TP53. In general, a finite space topology is just the elements of the space (e.g. genetic events), and the subspaces (e.g., genomes) that define the topology. We illustrate our framework using the problem of inferring topological causal models for cancer (Beerenwinkel and Sullivant, 2009; Beerenwinkel et al., 2007 Beerenwinkel et al., , 2006 Gerstung et al., 2011) . The progression of many types of cancer are marked by mutations of key genes whose normal reproductive machinery is subverted by the cancer (Jones et al., 2008) . Often, viruses such as HIV and COVID-19 are constantly mutating to combat the pressure of interventions such as drugs, and successful treatment requires understanding the partial ordering of mutations. A number of past approaches use topological separability constraints on the data, assuming observed genotypes separate events, which as we will show, is abstractly a separability constraint on the underlying topological space. A key computational level in making model discovery tractable in evolutionary processes, such as pancreatic cancer, is that multiple sources of information are available that guide the discovery of the underlying poset model. In particular, for pancreatic cancer (Jones et al., 2008) , in addition to the tumor genotype information show in Table 5 , it is also Here, it is assumed that each genome is an intervention target, whose size will affect the complexity of each causal experiment. A conditional independence oracle is also assumed. Output: Causally faithful poset P that is consistent with conditional independences in data. begin Set the basic closed sets F e ← ∅ repeat Select an open separating set g ∈ T and intervene on g. for e ∈ g, f g do Use samples and the CI oracle to test if (e f ) M g on dataset D. If CI test fails, then set F e ← F e ∪ { f } because f is an ancestor of e. end until convergence; Define the relation e ≤ f if f ∈ F e , for all e, f ∈ E, and compute its transitive closure. Return the poset P = (E, ≤), where ≤ is the induced relation on the poset P. end known that the disease follows certain pathways, as shown in Table 4 . This type of information from multiple sources gives the ability to construct multiple posets that reflect different event constraints (Beerenwinkel et al., 2006) . Algorithm 4 is a generalization of past algorithms that infer conjunctive Bayesian networks (CBN) from a dataset of events (e.g., tumors or signaling pathways) and their associated genotypes (e.g., sets of genes) (Beerenwinkel et al., 2007 (Beerenwinkel et al., , 2006 The pathway poset and DAG shown in Figure 1 and the poset in Figure 5 were learned using Algorithm 4 using the pancreatic cancer dataset published in (Jones et al., 2008) . We proposed a topological framework for causal discovery, building on the key relationship between posets and finite Alexandroff topologies. In the supplementary material, we elaborate on additional details. We gave some examples from the domain of cancer genomics. A growing body of work in causal discovery has implicitly used topological constraints to constrain search. Our paper uses insights from algebraic topology of finite spaces into developing more scalable algorithms. Our paper has a number of significant limitations. We did not discuss building poset models over latent variables, which is important in many applications. Furthermore, a deeper study of the empirical performance of the algorithms proposed here is necessary to fully evaluate the promise of the proposed framework. Elements of algebraic topology Optimization in Vector Spaces Approximation by superpositions of a sigmoidal function An introduction to persistent homology Computing persistent homology Probabilistic Graphical Models -Principles and Techniques Graphical Models Probabilistic reasoning in intelligent systems -networks of plausible inference. Morgan Kaufmann series in representation and reasoning The seven tools of causal inference, with reflections on machine learning Causation, Prediction, and Search, Second Edition A factorization criterion for acyclic directed mixed graphs Margins of discrete Bayesian networks Markov properties for graphical models with cycles and latent variables Algebraic topology of finite topological spaces and applications. Lecture notes in mathematics Diskrete Räume. Rec. Math Graylock Press, 1956. J. P. May. Finite spaces and larger contexts Chain graph models and their causal interpretations Markov models for accumulating mutations Cancer progression models and fitness landscapes: a many-to-many relationship Causality: Models, Reasoning and Inference Almost optimal intervention sets for causal discovery Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs Experimental design for learning causal graphs with latent variables On separating systems of graphs PAC learning of causal trees with latent variables Ordering-based causal structure learning in the presence of latent variables Singular homology groups and homotopy groups of finite topological spaces Finite topological spaces Core signaling pathways in human pancreatic cancers revealed by global genomic analyses Conjunctive Bayesian networks Learning and testing causal models with interventions Thrombotic thrombocytopenia after chadox1 ncov-19 vaccination Thrombosis and thrombocytopenia after chadox1 ncov-19 vaccination The intrinsic model for discrete stochastic control: Some open problems Causal information with information fields Evolution on distributive lattices The temporal order of genetic and pathway alterations in tumorigenesis Computationally and statistically efficient learning of causal bayes nets using path queries Illya Shpitser and Eric Tchetgen Tchetgen. Causal inference with a graphical hierarchy of interventions Two optimal strategies for active learning of causal models from interventions. CoRR, abs/1205 A dual of Dilworth's decomposition theorem Enumeration of homotopy classes of finite t0 topological spaces Posets on up to 16 points