key: cord-0123415-45ecl734 authors: Ye, Qiaoling; Amini, Arash A.; Zhou, Qing title: Distributed Learning of Generalized Linear Causal Networks date: 2022-01-23 journal: nan DOI: nan sha: eccddb5c012b4f36df8d9b2597c8ef05ea433091 doc_id: 123415 cord_uid: 45ecl734 We consider the task of learning causal structures from data stored on multiple machines, and propose a novel structure learning method called distributed annealing on regularized likelihood score (DARLS) to solve this problem. We model causal structures by a directed acyclic graph that is parameterized with generalized linear models, so that our method is applicable to various types of data. To obtain a high-scoring causal graph, DARLS simulates an annealing process to search over the space of topological sorts, where the optimal graphical structure compatible with a sort is found by a distributed optimization method. This distributed optimization relies on multiple rounds of communication between local and central machines to estimate the optimal structure. We establish its convergence to a global optimizer of the overall score that is computed on all data across local machines. To the best of our knowledge, DARLS is the first distributed method for learning causal graphs with such theoretical guarantees. Through extensive simulation studies, DARLS has shown competing performance against existing methods on distributed data, and achieved comparable structure learning accuracy and test-data likelihood with competing methods applied to pooled data across all local machines. In a real-world application for modeling protein-DNA binding networks with distributed ChIP-Sequencing data, DARLS also exhibits higher predictive power than other methods, demonstrating a great advantage in estimating causal networks from distributed data. Causal diagrams are mathematical models for cause-effect relationships among variables (Pearl 1995) , with applications in education (Dehejia and Wahba 1999; Hill 2012) , economics (Imbens 2004) , and epidemiology (Glymour 2006) , etc. A causal diagram is represented by a directed acyclic graph (DAG), where edges encode causal effects among variables (nodes). Graphical models associated with DAGs are known as Bayesian networks (BNs). The probability density of Ye, Amini, and Zhou a set of random variables {X 1 , . . . , X p } in a BN factorizes as where PA j ⊂ {X 1 , . . . , X p } \ {X j } is the parent set of X j with pa j being its value. Since randomized experiments are not always feasible, various approaches have been put forward to learn causal DAGs from observational data (Pearl 2009; Guo et al. 2020) . In this work, we focus on learning causal DAGs from distributed data. Distributed data storage has been used for privacy protection when managing the copious amount of data generated everyday by government agencies, research institutions, medical centers, technology companies, etc (Mehmood et al. 2016 ). Many of these organizations collect similar data, or data from the same population, and they often collaborate in social, scientific and business domains. For example, in 2021, Google and Apple were initiating an exposure notification system that tracks potential exposure to COVID-19 and a privacy-preserving analytics platform that collects health metrics (Apple and Google 2021) . Also, small clinics often combine data to obtain statistically significant results since the individual datasets are otherwise too small. Such collaborations always require privacy disclosures stating that each institution cannot share the privately-hold sensitive data with the other parties. Even when data privacy is not a concern, merging data from multiple sources remains a difficult task. Since each local agency stores and manages data with different platforms, it is often time-consuming to extract, standardize and migrate the data. From a purely computational perspective, distributed computing often uses parallelism across local machines, resulting in substantial reduction in computation time and allowing us to scale inference procedures to massive datasets. The widespread use of distributed data has lead to a research area whose goal is to adapt common statistical and machine learning tasks to the distributed setting. A straightforward idea for deriving a distributed version of any inference procedure is to form a global estimate by averaging the local estimates, an approach known as one-shot parameter averaging. This approach, however, fails to obtain solutions with any desired level of suboptimality compared to the global estimate constructed based on all the data (Zinkevich et al. 2010; Shamir et al. 2014 ). To overcome drawbacks of one-shot averaging, communication-efficient algorithms have been proposed that utilize multiple rounds of communication between local and central machines to generate a sequence of (global) estimates (Zhang et al. 2013; Shamir et al. 2014; Jordan et al. 2018; Fan et al. 2019) . Communication-efficient algorithms are particularly useful in distributed optimization of multi-agent systems, such as electronic power systems, sensor networks and smart manufacturing (Molzahn et al. 2017; Yang et al. 2019) . Despite these methodological advances, learning causal DAGs from data distributed across independent machines is still a challenging task. A major difficulty is how to integrate local Distributed causal graph learning information to form a global causal graph that satisfies the acyclicity constraint. One approach is to iterate over local datasets (once) and then combine the local graphs or the local p-values to form a global graph (Na and Yang 2010; Gou et al. 2007; Tang et al. 2019) . However, there is no theoretical guarantee that aggregating local estimates, using this single-iteration approach, would lead to an estimate close to the global minimizer of the loss on the combined data. In this paper, we propose a score-based learning method that carries out multiple rounds of communication to estimate DAGs from distributed data. Our objective function is equivalent to a regularized log-likelihood of the overall data. To optimize it, the central machine proposes a candidate ordering π, where the score of π is evaluated via communication with local machines over DAGs compatible with π. Then, the candidate sort π is selected by simulated annealing. Because every DAG has at least one ordering, searching over the space of orderings ensures that the acyclicity constraint is always satisfied. We show that the convergence rate of our distributed estimate to the global one is O(log(n)/ √ m) for a fixed true DAG, where n is the total sample size across all local machines and m is the smallest local sample size (Theorem 1, Section 4.1). To the best of our knowledge, our approach is the first to learn causal graphs from distributed data with a theoretical guarantee of convergence to the global estimate as the sample size of the local machines grow. Another contribution of our work is the use of generalized linear models (GLMs) for local conditional distributions in BNs, which brings several advantages to causal structure learning. Our proposed GLM DAG model is a flexible family for various data types beyond linear Gaussian models (with equal variance) and multi-logit models Aragam et al. 2019) , and hence it can be applied to multiple domains. Additionally, most models in the GLM family lead to convex loss functions which facilitate the optimization task. The objective function in our distributed learning algorithm is equivalent to a regularized likelihood of the overall data, which has been shown to be effective in learning both continuous and discrete DAGs (Fu and Zhou 2013; Aragam and Zhou 2015; Gu et al. 2019; Ye et al. 2021) . Furthermore, we show that GLM DAG models lead to the identifiability of the underlying causal DAGs (Proposition 1, Section 2), while other common models, such as multinomial for discrete networks and Gaussian linear DAGs, are not identifiable in general (Pearl 1995; Heckerman et al. 1995) . Under such identifiability, we establish the 2 -consistency of a global maximizer DAG of our regularized likelihood score (Theorem 2, Section 4.2). The paper is organized as follows. Section 2 defines the generalized linear DAG model and establishes some identifiability results under this model. In Section 3, we set up the optimization problem for learning causal graphs and develop the DARLS algorithm that combines simulated annealing to search over permutation space and a distributed optimization algorithm to optimize the network structure given an ordering. We then establish theoretical results for the convergence of the distributed optimization algorithm and estimation consistency of DARLS in Section 4. Section 5 consists of exhaustive simulation experiments, where we compare our method to existing ones using distributed data, test the robustness of DARLS against violations of its underlying model assumptions, and examine the accuracy loss and computational efficiency of distributed learning. We also apply the distributed learning methods to ChIP-sequencing data for modeling protein-DNA binding networks in Section 6. The paper is concluded with a discussion in Section 7. All proofs are relegated to the supplementary material. Denote by x j ∈ R d j a realization of variable X j , where d j = 1 for a numerical X j and d j = r j − 1 for a categorical variable X j with r j classes, using the one-hot encoding. Let β ij ∈ R d i ×d j encode the influence of X i on X j and β ij = 0 if X i / ∈ PA j . Put where β 0j ∈ R 1×d j and d = p i=1 d i . Here and elsewhere, [x, y] denotes the vertical concatenation of two vectors or matrices x and y. We define a generalized linear DAG (GLDAG) as the Bayesian network (1) with conditional densities given by GLMs with canonical links, that is, where b j and c j are both functions from R d j to R. Note that β j x = i∈PA j β ij x i only depends on pa j . GLDAG models allow for many common distributions via the choice of the log partitionfunction b j (·). Examples include the Bernoulli distribution for b j (θ) = log(1 + e θ ), constantvariance Gaussian for b j (θ) = θ 2 /2, Poisson for b j (θ) = exp(θ), Gamma for b j (θ) = − log(−θ) and the multinomial for b j (θ) = log 1 + d j l=1 e θ l . Note that in the multinomial case b j (·) is a multivariate function, operating on a vector θ = (θ l ), in contrast to the other example for which b j (·) is a scalar function. The Bernoulli and multinomial choices above give rise to logistic and multi-logit regression models for each node. We collect all the parameters of model (3) in a matrix β ∈ R (d+1)×d which is obtained by horizontal concatenation of β j , j = 1, . . . , p, each as defined in (2). We say that a GLDAG (3) is continuous if all the variables are continuous. Recall that in this case, d j = 1 for all j ∈ [p] and thus β is a (p + 1) × p matrix. We rewrite the log pdf of (3), in the continuous case, as where β x ∈ R p and β ij = 0 if and only if X i → X j . Next, we define identifiability of DAG models following ; Hoyer et al. (2008) ; Peters and Bühlmann (2014) , and show that continuous GLDAGs are identifiable. Definition 1 (Identifiability). Suppose we are given a joint distribution L(X) = L(X 1 , . . . , X p ) that has been generated from an unknown GLDAG model (3) with a graph G 0 . If the distribution L(X) cannot be generated by any GLDAG model with a different graph G = G 0 , then we say G 0 is identifiable from L(X). It is well-known that linear Gaussian DAGs and multinomial DAGs in general are not identifiable (Pearl 1995; Heckerman et al. 1995) . In contrast, continuous GLDAG models (4) are identifiable under mild assumptions: Proposition 1. Suppose the joint distribution L(X) is defined by the log-pdf L(x; β) with a DAG G 0 according to (4) such that β ij = 0 if and only if i ∈ PA j in G 0 . If L(x; β) is second-order differentiable with respect to x and the first-order derivative of b j (·) exists and is not constant for all j, then G 0 is identifiable from L(X). Proposition 1 establishes the identifiability of continuous GLDAG models (4), partially justifying our goal as to learn causal graphs. This result also expands the class of identifiable DAG models in the literature. DAGs generated from linear Gaussian structural equation models with equal variance can be fully identified (Peters and Bühlmann 2014) , which is a special case of the GLDAG models with b j (θ) = θ 2 /2, ∀j. A different class of identifiable DAG models is the additive noise model, X j = f j (PA j ) + ε j , assuming nonlinear f j and/or non-Gaussian ε j (Shimizu et al. 2006; Hoyer et al. 2008; ). In this section, we construct the objective function using distributed data and propose a simulated annealing search combined with an iterative optimization method to learn causal DAG structures. We start with the definition of topological sorts for DAGs. Given a permutation π on [p] := {1, . . . , p}, we permute a vector v = (v 1 , . . . , v p ) according to π to obtain a relabeled vector v π = v π(1) , . . . , v π(p) . A topological sort of a DAG is a permutation of nodes such that if a ∈ PA b , then a precedes b in the order defined by π, denoted by a ≺ π b. By definition (1), every DAG has at least one topological sort. Let {x h } n h=1 be an i.i.d. sample of size n from model (3). We also let x j h represent the observed value of the j-th variable (X j ) in the h-th data point. Consider a subset I ⊂ [n]. The normalized negative log-likelihood of the subsample {x h } h∈I is given, up to an additive constant, by Note that in this notation, [n] denotes the normalized negative log-likelihood of the entire sample of size n. We consider the case that the overall data is stored on K different servers, where each local machine M k holds its private data {x h } h∈I k and communicates with a central machine C. Let n k = |I k | be the sample size in M k so that K k=1 n k = n. The normalized negative log-likelihood based on the entire data can be decomposed as [n] (β) = K k=1 n k n I k (β). Let P be the set of all permutations on [p] and D(π) ⊂ R (d+1)×d the set of DAGs whose topological sorts are compatible with a permutation π ∈ P. Note that D(π) is a linear subspace of R (d+1)×d . We ideally would like to estimate β by minimizing a regularized loss function of the form and ρ(·) is an appropriate regularizer to promote sparsity in β. We call f (π) the global objective function since it is defined using all data across local machines. Recall that β ij = 0 if and only if i ∈ PA j . To learn sparse DAGs, we apply group regularization of the form where ρ g (·) is a nonnegative and nondecreasing group regularizer and λ > 0 is a tuning parameter. Restricted to D(π), the regularizer can be further simplfied to ρ(β) = λ j i≺πj ρ g (β ij ). In this paper, we consider the Group Lasso (i.e., group 2 ) penalty with the choice where |||β ij ||| F is the Frobenius norm of matrix β ij . As a convex penalty and a natural extension of Lasso regularization, Group Lasso has demonstrated remarkable performance in grouped variable selection (Yuan and Lin 2007) . To search over (π ∈ P, β ∈ D(π)) with distributed data as in (6) The main steps of DARLS are outlined in Algorithm 1. At each annealing iteration, a permutation π * is proposed based on current π (line 5) and is accepted with probability according to simulated annealing given a decreasing temperature schedule. To compute the score of the optimal DAG structure for a given permutation, we use the distributed optimization approach outlined in Algorithm 2, the details of which are discussed in Section 3.2 below. This approach Algorithm 1 Distributed annealing on regularized likelihood score (DARLS). 1: Select tuning parameter λ by BIC selection. 2: π ← π 0 , compute ( β, f ( π)) by Algorithm 2. 3: for i = 0, . . . , N do 4: T ← T (i) . Central machine C proposes π + by flipping a random interval (length up to τ ) in π. Compute (β + , f (π + )) using Algorithm 2. . 8: end for 9: Refine the causal structure implied by β. Algorithm 2 Distributed optimization to compute the global permutation score. π ) and sends the result to C. and broadcasts it to local machines. Each M k calculates the minimizer β (t+1) k,π = ϕ k,π β (t) given by (9) and sends it to C. and broadcasts it to local machines. allows multiple rounds of communications between local and the central machines to update and synthesize information. Note that DARLS can be applied to any objective function as long as the gradient w.r.t. β has a closed-form expression. For any fixed π, we use distributed computing to evaluate f (π). That is, instead of directly working with the objective function in (6), we rely on local versions of it to guide a distributed algorithm that divides the task of computing f (π) among the K local machines. In particular, we consider the local objective functions The global version (6) can be rewritten as f (π) = min β∈D(π) F (β) where F (β) := [n] (β) + ρ(β). Typically, each of F and F k is nonsmooth due to the presence of the regularizer ρ, but the is often smooth. The gradient of h k is used to guide iterations in each local machine. That is, given the current (global) estimate β, local machine M k performs the update The local regularized loss F k guided by ∇h k , denoted by F k , is a first-order approximation to the global regularized loss F , up to an additive constant. Let β (t) π be the global estimate of the algorithm at iteration t. At the next iteration, t + 1, we obtain local estimates β for k = 1, . . . , K. These local estimates are then passed to the central machine C to compute the next global estimate by averaging, i.e., β k,π . The main steps of this distributed optimization method are outlined in Algorithm 2. The above approach is essentially a version of the DANE algorithm (Zhang et al. 2013; Shamir et al. 2014; Jordan et al. 2018; Fan et al. 2019) . Note that to calculate local updates β (t+1) k,π (line 5, Algorithm 2), only the current global estimate β (t) π and the global gradient need to be communicated to each local machine. In Section 4.1, we show that for a sufficiently large minimum sample size per machine, i.e. min k n k , the sequence {β (t) π } t≥0 thus produced will converge to a global minimizer β π of F (·) over D(π). Another piece in the distributed optimization is to compute the local update (9) (line 5, Algorithm 2), for which we use the proximal gradient algorithm (Algorithm 3). Given a current global estimate β (t) , optimizing local objective F k (ξ) (9) is equivalent to min ξ∈D(π) , ξ , a surrogate for the global likelihood [n] (ξ). To solve (9), we use iterative proximal gradient descent. At each iteration, we minimize a quadratic approximation to I k around the current solution ξ, plus a regularization term, where s > 0 plays the role of a step size and ξ + is our next estimate of the solution. Equivalently, the update (10) can be re-written as where prox sρ (x) := argmin u sρ(u) + 1 2 |||x − u||| 2 F is a proximal operator applied to the scaled function sρ(·). Equation (11) is known as a proximal gradient update, and for our choice of the Distributed causal graph learning Algorithm 3 Use the proximal gradient algorithm to compute local permutation scores. 10: ξ ← ξ + and iter ← iter +1. regularizer given by (7) and (8), has the following closed-form expression: In this section, we study the convergence of the iterative distributed optimization algorithm (Algorithm 2) and establish the consistency of the global minimizer of (6). As our method is primarily motivated by applications involving a large amount of distributed data, we develop theoretical results under the setting that n is large and the number of variables p stays fixed. Recall the local iteration functions ϕ k,π defined in (9). The overall iteration function for the distributed algorithm can be written as Φ π (·) := k n k n ϕ k,π (·) (line 6, Algorithm 2). Let denote the Frobenius ball of radius r centered at β. We consider the case of numerical variables, i.e. d j = 1 for all j, which includes continuous and binary discrete random variables. The following theorem provides convergence guarantees on the distributed optimization algorithm represented by Φ π for any fixed π. Let β π be any global minimizer of the global objective function, i.e., where ρ(ξ) is a convex regularizer. Let Ω := π D(π) ⊂ R (p+1)×p be the parameter space of GLDAGs. We recall that for θ ∈ Ω, θ j denotes the jth column and that {x h } is an i.i.d. sample from a GLDAG model (3). Theorem 1. Assume that the coordinates of x h are T -bounded, that is, |x j h | ≤ T for all h ∈ [n] and j ∈ [p]. Let θ ∈ Ω be any GLDAG parameter and r > 0, and set R * 1 = max j θ j 1 and where ψ(x) := max{x, √ x} and m := min k |I k |. Assume further that np ≥ max{K + 1, 3}. There exist constants c 1 , C 1 , C > 0 such that if C 1 T 2 p 2 log(np)/m ≤ λ min (Σ), then with probability at Theorem 1 applies to any θ ∈ Ω. It is natural to take θ to be β * π , the minimizer of the population loss defined as Since β π is a consistent estimate of β * π for any π (Theorem 2, Section 4.2), that P(||| β π − β * π ||| F > r) goes to zero as n grows. Thus, with high probability, the iteration operator Φ π (·) will be a contraction: the sequence {β (t) π } t≥0 produced by the distributed algorithm converges geometrically to β π if Cζ n < 1. For fixed p, and for sufficiently large r such that β (0) π ∈ B F ( β π , r), one can always satisfy the condition of Cζ n < 1 by taking m (the minimum sample size per machine) large enough. Hence, Theorem 1 provides a quantitative lower bound on m for the geometric convergence to kick in. Let us write ψ(x; β) := − log p(x | β) for the negative log-likelihood of a single sample x from model (3). We view x, x h and β as vectors by concatenating the columns when dealing with ψ(·; ·), so that β ∈ R D for D = d(d + 1). The consistency results established in this section applies to the class of models in (3), not restricting to numerical variables. Recall that F (β) = [n] (β) + λ n i,j |||β ij ||| F is the global regularized negative log-likelihood and Ω is the GLDAG parameter space. The optimization problem (6) is equivalent to min β∈Ω F (β). We denote by β ∈ Ω a global minimizer of F (β) and β * ∈ Ω the true parameter with the true DAG G * . For any β, let us consider the (cross) Fisher information matrix I(β; β * ) := E β * ∇ 2 ψ(x; β). We note that ψ(x; ·) is a convex function for exponential families, and hence I(β; β * ) is always positive semi-definite. To establish the consistency of β as well as that of β π , used in Theorem 1, for any fixed π, we make the following assumptions: (A1) The true DAG G * is identifiable. (A2) For every π, there exists a neighborhood of β * π , denoted by nb(β * π ) and functions M jkl such that ∂ 3 ∂β j ∂β k ∂β l ψ(x; β) ≤ M jkl (x) for all β ∈ nb(β * π ), almost surely, and E β * [M jkl (x)] < ∞ for all j, k and l. (A3) For every π, we have inf u ∈ D(π), u =1 u, I(β * π ; β * )u > 0. In (A2), it is impliclty assumed that ψ(x; ·) is finite in nb(β * π ) almost surely. Before stating our theoretical results, we define Π * := {π : β * π = β * } which is exactly the set of permutations consistent with β * , and in particular, it is nonempty. Below is a sketch of argument, and a full proof is provided in the supplement (Section S2.3). First, for any π that is consistent with β * , we have β * ∈ D(π). A KL divergence argument then shows that β * is the unique solution of the optimization problem defining β * π . That is, any π consistent with β * belongs to Π * . Conversely, if π ∈ Π * , then β * = β * π ∈ D(π), and hence π is consistent with β * . With this observation, we establish the desired consistency results in the following theorem. Theorem 2. Assume (A1)-(A3) and √ nλ n = O p (1). Then, (a) For every π, F (·) has a unqiue minimizer β π over D(π) and sup π∈P ||| β π − β * π ||| F = O p (n −1/2 ). (b) F (·) has a unique minimizer β over Ω (the space of DAGs) and (c) With probability converging to one as n → ∞, β = β π for some (sequence of ) π ∈ Π * . Theorem 2 confirms that the Group Lasso regularized estimator β, defined as a global minimzier of F , is √ n-consistent, and it will identify a correct topological sort π ∈ Π * in the large-sample limit. Moreover, the theorem also establishes the uniform consistency of restricted miminizers β π for all π, which is needed for Theorem 1. Assumption (A1) holds under mild conditions according to Proposition 1, Assumption (A2) is a standard regularity condition, and Assumption (A3) is related to the non-singularity of the second moment matrix E β * (xx ). For example, consider the case d j = 1 for all j and assume that the elements of x are T -bounded and let R j = max π [β * π ] j 1 , viewing β * π as a matrix with jth column [β * π ] j . Then if inf |t|≤T R j b j (t) > 0 for all j, non-singularity of E β * (xx ) is sufficient for (A3) to hold. Denote by s 0 the number of edges in a graph on p nodes. We downloaded the following networks (p, s 0 ) from the Bayesian networks repository (Scutari 2007) (70, 123). We generated data under the GLDAG model (3) (Section 5.3) and other common DAG models (Section 5.4), where the latter is to examine the robustness of our method against violation of its model assumptions. We compared the DARLS algorithm to the following DAG structure learning methods: the standard greedy hill climbing (HC) algorithm (Gámez et al. 2011) , the Peter-Clark (PC) algorithm (Spirtes and Glymour 1991) , the max-min hill-climbing (MMHC) algorithm (Tsamardinos et al. 2006 ), the fast greedy equivalence search (FGES) (Chickering 2002; Ramsey et al. 2017; Ramanan and Natarajan 2020) , and the NOTEARS algorithm ). Among these methods, PC is a constraint-based method and MMHC is a hybrid method. The other three methods are score-based, where HC searches over DAGs, FGES searches over the equivalence classes, and NOTEARS uses continuous optimization to estimate DAG structure. Following the practice for DAG learning on distributed data as in Na and Yang (2010) We applied a competing method on each local dataset D k to obtain a completed partially directed acyclic graph (CPDAG) A k , and then constructed a global graph using {A k } K k=1 . We used CPDAGs here because all the competing methods were developed under non-idenfitiable DAG models. Among the five competing methods, only PC and FGES output a CPDAG, and thus we converted estimated DAGs from the other methods to CPDAGs to obtain A k . Given {A k } K 1 , we counted occurrences of the three possible orientations between each pair (i, j): i → j, i ← j, or i − j (undirected). We then ranked orientations across all node pairs in the descending order of their counts, and sequentially added these edge orientations to an empty graph, as long as they would not introduce a directed cycle (a cycle consisting of all directed edges). By the end of this process, we had a partially directed graph. Lastly, we applied Meek's rules (Meek 1995; Chickering 2002) to maximally orient undirected edges, and hence constructed a global CPDAG estimate. Remark 1. The HC global estimates had too many edges using this approach, because its local CPDAGs lacked consensus, resulting in a large number of candidate edges and much higher FP edges. To solve this problem, we did not add any edge between a node pair (i, j) if the majority of the local graphs {A k } had no edges between them. In this way, global graphs estimated by HC became reasonably sparse compared to global estimates by the other methods. (Zheng 2018) . Competing methods were applied to each local dataset using a 2016 MacBook Pro (2.9 GHz Intel Core i5, 16 GB memory). Since DARLS is designed for distributed computing, it was run on a computer cluster at UCLA. In this study, the DARLS algorithm (Algorithm 1) was initialized with a random permutation. According to the landscape of the objective function, we set the initial annealing temperature to 5 · 10 −2 and gradually decreased it to 5 · 10 −5 in a geometric fashion over a total of 10 3 iterations. Note that since the log-likelihood has been normalized by the sample size, as in (5), the range of the objective function is quite small. For the PC algorithm, a significance level of 0.01 was used to generate graphs with desired sparsity. FGES was applied with a significance level of 0.1, which was the default value. For MMHC and HC methods, the maximum number of parents for a node was set to three. For NOTEARS, we used the default 2 loss and the default threshold value. Given estimates generated by the above methods, we use a few metrics to evaluate their structural accuracy. To standardize the performance metrics, we transform an estimated DAG into its CPDAG when the true DAG is not identifiable, before calculating the following metrics. Logistic GLDAGs model (3) with b j (β j X) = log(1 + exp(β j X)) for all j ∈ [p], was used to generate binary data X j ∈ {0, 1}, where the coefficient parameters {β ij } were uniformly sampled from [−1.5, −0.8] ∪ [0.8, 1.5]. We simulated 20 datasets for each network under two settings, n = 100p, K = 10 and n = 10, 000, K = 20, where a total of n observations were randomly assigned to K local machines. Since GLDAGs are identifiable, we compare an estimated DAG by DARLS to the true DAG when calculating the SHD. For all other methods, we compare estimated Table 1 reports the average performance metrics across 20 datasets for each of the seven graphs, using all six methods. Since NOTEARS estimates were too sparse to be competitive, using the default settings, we decreased the penalty tuning parameter to 10 −4 from the suggested value of 10 −1 . Its results are listed only for two small graphs, Asia and Sachs. PC had difficulty generating estimates within a reasonable time limit (30 minutes per estimation) for the last two networks, Hailfinder and Hepar2, and hence it was removed from the comparisons on these two graphs. In both cases n = 100p and n = 10, 000, DARLS consistently achieved the lowest SHD among all methods for every network, demonstrating higher accuracy in estimating graphical structures. DARLS identified more TP edges than the other methods in almost every case. The refinement step via thresholding β also helped to reduce SHD by cutting down FP edges. To examine the loss of accuracy in estimating network structures on distributed data, we applied each method to combined data by pooling the K local datasets. For brevity in reporting the results, we chose the best performance on each network among the five competing methods, called the best competing method, to compare with DARLS. Figure 1 shows the performance, in terms of SHDs, of DARLS and the best competing method on distributed and combined data. First, we observe that, applied to either distributed or combined data, DARLS achieves similar SHD values. The SHD difference between the combined and distributed data of DARLS is much smaller than that of the best competing method, demonstrating that DARLS is more effective in using distributed data. Second, consistent with Table 1 , DARLS always outperformed the best competing method significantly when applied to distributed data (best-distributed). Furthermore, DARLS on distributed data shows a substantial overlap in the distribution of SHD, with the best competing method on combined data (best-combined), which is either HC or FGES for n = 100p and HC, MMHC or FGES for n = 10, 000. Such overlaps indicate the highly competitive performance of our distributed learning algorithm. The variability in SHD of DARLS is in general smaller than that of the best competing method, showing a higher consistency across different datasets. To quantify the accuracy of distributed optimization using Algorithm 2, we computed permutation scores f (π) (6) under various values of K ∈ {1, 2, 5, 10} for a fixed π. For each value of K, we fixed the tuning parameter λ, permutation π and all internal computation parameters to ensure the only K varied in the calculation of f (π). Let f (K) be the value of f (π) computed by K local machines, and ∆f (K) := f (K) − f (1) be the relative increase in the loss f (K) . Figure 2a shows {∆f (K) : K = 2, 5, 10} across all the networks. The values of ∆f (K) is in the order of 10 −13 , verifying f (π) is essentially identical using either the overall (K = 1) or distributed data (K ≥ 2). Since the number of iterations is fixed, ∆f (K) increases with the network size. We also examined computation time of finding f (K) for K ∈ [20] using the 20 Insurance datasets. In each test, the same data were split and distributed to different number K of local machines to solve the optimization problem (6), with all other parameters fixed. Figure 2b shows Figure 1: SHD comparison between DARLS and the best competing method using combined or distributed data. For each graph, the box plots from left to right report the results for the best competing method on combined data (best-combined), DARLS on combined combined, DARLS on distributed data and the best competing method on distributed data (best-distributed). the computation time versus K. As expected when distributing a complicated task, computing f (π) requires less time if using more machines. However, the reduction in computation time reaches approximately a stable level after K = 10, indicating a trade-off between the gains from parallel computation and the communication overhead. Additionally, the smallest local sample size m decreases when K is large, and Theorem 1 shows that would slow down the convergence of Algorithm 2. To test the robustness of DARLS against violations of its model assumptions, we also generated data from different DAG models. Particularly, we used threshold-Gaussian and multinomial DAGs to generate discrete data, and then compared DARLS with the other methods. For the threshold-Gaussian DAG model, we first generated continuous variables Y 1 , . . . , Y p using Gaussian structural equation models Y j = B j PA j + j where j is a Gaussian noise from N (0, 1) where s j = |PA j |. Then, we thresheld these continuous values to generate binary data X j = I(Y j > c j ), where c j is the sample mean of Y j . We used two-component mixture Gaussian to simulate continuous data for root variables in Y j , each drawn from N (1, 1) and N (−1, 1) with an equal probability. Under this design, most of the Y j 's were bi-modal in distribution, which greatly increased the robustness in thresholding. Such conversion between continuous and categorical variables has been used for modeling BNs in prior work (Lee and Kim 2010; Cai et al. 2020; Song et al. 2021 ), but the threshold-Gaussian DAG model is not the underlying model for any of six methods in our comparison. We also simulated multinomial data from contingency tables provided in the BN repository (Scutari 2007) . We made a few modifications to the contingency tables to ensure that (1) the number of states per variable was at most three, and (2) the marginal probability of any state was at least 0.1 for every variable by merging states. Due to the high structure complexity of Hailfinder and Hepar2, the original contingency tables of a few nodes were used without modifications, resulting in marginal probability less than 0.1 for some states. We note that the multinomial DAG model is the underlying model for the majority of the competing methods, including HC, PC, FGES and MMHC. Therefore, comparisons on these data would also test if the GLDAG model can approximate the multinomial model reasonably well. Table 2 reports the structural estimation accuracy of each method on data generated by the threshold-Gaussian and the mulitnomial DAG models, with n = 10, 000 and K = 20. When the underlying model is a threshold-Gaussian DAG, DARLS achieved the lowest SHD for 4 networks, i.e. Asia, Sachs, Child and Insurance. For data simulated from multinomial models, DARLS still could estimate the network with the lowest SHD for Sachs and Child. For most of the other cases, DARLS was only worse than HC, but better than or at least comparable to the other methods. Note that we specifically tuned the way for HC to combine local estimates and build a global graph, due to its lack of consensus among local estimates (Remark 1). It is encouraging to see that DARLS uniformly outperformed FGES, a consistent score-based method under the multinomial DAG model (Chickering 2002) , highlighting the effectiveness of DARLS in using distributed data. This also suggests that the GLDAG (3) can be a pretty good approximation to the commonly used multinomial DAG model for discrete data. This study confirms that DARLS indeed performs relatively well on data generated from different DAG models, which is important for its practical use. This is further demonstrated by the application to a real dataset in next section. In this section, we apply our method to the ChIP-Seq data generated by Chen et al. (2008) . The dataset contains the DNA binding sites of 12 transcription factors (TFs) in mouse embryonic stem cells: Smad1, Stat3, Sox2, Pou5f1, Nanog, Esrrb, Tcfcp2l1, Klf4, Zfx, E2f1, Myc, and Mycn. For each TF, an association strength score, which is the weighted sum of the corresponding ChIP-Seq signal strength, was calculated for each of the 18,936 genes (Ouyang et al. 2009 ). Roughly speaking, this score can be understood as a measure of the binding strength of a TF to a gene. Following the same preprocessing in Wang and Zhou (2021) , the genes with zero association scores were removed from our analysis. Accordingly, our observed data matrix, of size n × p = 8462 × 12, contains the association scores of 12 TFs over 8,462 genes. We aim to build a causal network that reveals how these 12 TFs affect each other's binding to genes. The associate scores of a TF are typically bimodal, and they were discretized before network estimation; see Figure S1 in the supplemental material for illustration of the discretization. We distributed this dataset across K = 20 local machines and applied DARLS, HC, MMHC, PC and FGES to the distributed data to learn the protein-DNA binding network. NOTEARS was excluded because its performance was not competitive. Local estimates of a competing method were combined to construct a global graph as we did in Section 5. To ease the comparison, we controlled the sparsity of estimated networks such that every method produced two graphs using the distributed data, with roughly s 0 = 17 and 29 edges, except FGES which had difficulty generating output close to 17 edges. We also applied each method on the combined data (i.e, K = 1) with the same parameters in Section 5. In this case, each estimated graph had around s 0 = 30 edges. The only exception is PC, whose estimate had 21 edges even when its significance level had been reduced to 10 −10 . Since the true network structure is unknown, test data likelihood under multinomial DAG models in ten-fold cross validation is used to assess the accuracy of estimated networks. Denote DAG models (see supplemental material Section S1 for calculation of g and g). We also compute BIC = −2 log g + log( n)N ( G) for model comparison, where n is the training sample size and N ( G) is the number of multinomial parameters for estimated graph G. We choose some benchmarks to ease comparison. Denote by g B and BIC B the highest test data likelihood and the lowest BIC value, respectively. Define the log-likelihood difference ∆(log g) = log g − log g B and the BIC difference ∆BIC = BIC − BIC B . Note that since log g is test data log-likelihood while BIC is calculated with training data, the magnitude of ∆BIC is much larger than ∆(log g). We also compute the value of exp {−∆BIC/(2 n)} as an approximation to the normalized marginal likelihood ratio (NLR) (P ( X | G)/P ( X | G B )) 1/ n , where X denotes training data, between estimated DAGs G by a competing method and G B by the BIC benchmark. To gain more scientific insights, we show in Figure parameters are included in the source code. In Section 5, we explain how we simulate 349 data, and in Figure S1 we specified the parameters used for ChIP-Sequencing data. Training details of other competing methods are provided in the supplemental material, Section S2.1. (c) Did you report error bars (e.g., with respect to the random seed after running exper-353 iments multiple times)? [Yes] We plotted box plots of simulation data in Figure 1 354 and reported standard deviations of accuracy metrics in Table S1 and Table S2 in 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... is the directed path Nanog→Pou5f1→Sox2 in the estimated CPDAG, among the three core regulators in the gene regulatory network in mouse embryonic stem cells (Chen et al. 2008; Zhou et al. 2007 ). It is well-known that many genes are co-regulated by Pou5f1, Sox2 and Nanog. The estimated path suggests that Nanog binding would cause the binding of Pou5f1, which then may cause Sox2 binding. This provides new clue for how the three TFs work together to co-regulate downstream genes. Data analysis in Chen et al. (2008) , the original work that generated the ChIP-Seq data, suggests that there are two clusters of TFs that tend to co-bind: The first group consists of Nanog, Sox2, Oct4, Smad1 and STAT3, while the second group includes Mycn, Myc, Zfx and E2f1. These two groups are clearly recovered in the estimated CPDAG, which contains a dense undirected subgraph on the second group of TFs and a fully directed subgraph on the first group. Moreover, the directed edge Myc→Pou5f1 indicates that the second group might be in the causal upstream of the first group, a novel hypothesis for potential experimental investigation. In this paper, we develop the DARLS algorithm that incorporates a distributed optimization method in simulated annealing to learn causal graphs from distributed data. Based on simulation studies and a real data application, we have shown that DARLS is highly competitive even when its model assumptions are violated. In its distributed optimization given an ordering, DARLS learns a causal graph by optimizing a convex penalized likelihood. In practice, one may consider concave penalties (Aragam and Zhou 2015; Ye et al. 2021) to improve accuracy when learning DAGs, although there may be lack of theoretical guarantees for convergence of distributed learning with a concave penalty. This is certainly a promising further development for our method. Our proposed GLDAG model includes a family of flexible distributions besides linear Gaussian models (with equal variance), and thus can be applied to different types of data. It is also possible to add interaction terms to GLDAGs (3) to model nonlinear effects between variables. Such generalization is expected to approximate real causal relations with higher accuracy. We have established that continuous GLDAGs are identifiable, justifying their use in causal discovery, and it is left as future work to study the identifiability of general GLDAGs. The primary focus of this paper is on big distributed data, with large n but moderate p. In this setting, we established the convergence of the solution obtained by distributed optimization to a global minimizer of the loss using the combined data, and the consistency of the global minimizer as an estimate of the true DAG parameter. However, generalizing the convergence and consistency results to allow diverging p is theoretically interesting and left as future work. Explosure Notification Privacy-preserving Analytics (ENPA) White Paper Learning Large-Scale Bayesian Networks with the sparsebn Package Concave Penalized Estimation of Sparse Gaussian Bayesian Networks Bayesian Network Marker Selection via the Thresholded Graph Laplacian Gaussian Prior Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells Optimal Structure Identification with Greedy Search Causal Effects in Nonexperimental Studies: Reevaluating the Evaluation of Training Programs Communication-Efficient Accurate Statistical Estimation Being Bayesian about Network Structure. A Bayesian Approach to Structure Discovery in Bayesian Networks Learning Sparse Causal Gaussian Networks with Experimental Intervention: Regularization and Coordinate Descent Learning Bayesian Networks by Hill Climbing: Efficient Methods Based on Progressive Restriction of the Neighborhood Using Causal Diagrams to Understand Common Problems in Social Epidemiology Learning Bayesian Network Structure from Distributed Homogeneous Data Penalized Estimation of Directed Acyclic Graphs from Discrete Data A Survey of Learning Causality with Data: Problems and Methods Learning Bayesian Networks: The Combination of Knowledge and Statistical Data Bayesian Nonparametric Modeling for Causal Inference Nonlinear Causal Discovery with Additive Noise Models Nonparametric Estimation of Average Treatment Effects under Exogeneity: A Review Communication-Efficient Distributed Statistical Inference Causal Inference Using Graphical Models with the R Package pcalg Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance Analysis of Control Parameters Conversion of Categorical Variables into Numerical Variables via Bayesian Network Classifiers for Binary Classifications Strong Completeness and Faithfulness in Bayesian Networks Protection of Big Data Privacy A Survey of Distributed Optimization and Control Algorithms for Electric Power Systems Distributed Bayesian Network Structure Learning ChIP-Seq of Transcription Factors Predicts Absolute and Differential Gene Expression in Embryonic Stem Cells Proximal Algorithms Causal Diagrams for Empirical Research Causal Inference in Statistics: An Overview Identifiability of Gaussian Structural Models with Equal Error Variances Causal Discovery with Continuous Additive Noise Models Causal Learning from Predictive Modeling for Observational Data A Million Variables and More: the Fast Greedy Equivalence Search Algorithm for Learning High-dimensional Graphical Causal Models, with an Application to Functional Magnetic Resonance Images Learning Bayesian Networks with Thousands of Variables Bayesian Network Repository Learning Bayesian Networks with the bnlearn R Package Communication-Efficient Distributed Optimization using an Approximate Newton-type Method A Linear Non-Gaussian Acyclic Model for Causal Discovery Bayesian Sparse Mediation Analysis with Targeted Penalization of Natural Indirect Effects An Algorithm for Fast Recovery of Sparse Causal Graphs PEnBayes: A Multi-Layered Ensemble Approach for Learning Bayesian Network Structure from Big Data The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm Causal Network Learning with N0n-intertible Functional Relationships A Survey of Distributed Optimization Optimizing Regularized Cholesky Score for Order-Based Learning of Bayesian Networks Model Selection and Estimation in Regression with Grouped Variables Communication-Efficient Algorithms for Statistical Optimization DAGs with NO TEARS DAGs with NO TEARS: Continuous Optimization for Structure Learning A Gene Regulatory Network in Mouse Embryonic Stem Cells Parallelized Stochastic Gradient Descent