Submitted 13 March 2017 Accepted 7 June 2017 Published 3 July 2017 Corresponding author Jacob M. Schreiber, jmschr@cs.washington.edu Academic editor Jingbo Wang Additional Information and Declarations can be found on page 14 DOI 10.7717/peerj-cs.122 Copyright 2017 Schreiber and and Noble Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Finding the optimal Bayesian network given a constraint graph Jacob M. Schreiber1 and William S. Noble2 1 Department of Computer Science, University of Washington, Seattle, WA, United States of America 2 Department of Genome Science, University of Washington, Seattle, WA, United States of America ABSTRACT Despite recent algorithmic improvements, learning the optimal structure of a Bayesian network from data is typically infeasible past a few dozen variables. Fortunately, domain knowledge can frequently be exploited to achieve dramatic computational savings, and in many cases domain knowledge can even make structure learning tractable. Several methods have previously been described for representing this type of structural prior knowledge, including global orderings, super-structures, and constraint rules. While super-structures and constraint rules are flexible in terms of what prior knowledge they can encode, they achieve savings in memory and computational time simply by avoiding considering invalid graphs. We introduce the concept of a ‘‘constraint graph’’ as an intuitive method for incorporating rich prior knowledge into the structure learning task. We describe how this graph can be used to reduce the memory cost and computational time required to find the optimal graph subject to the encoded constraints, beyond merely eliminating invalid graphs. In particular, we show that a constraint graph can break the structure learning task into independent subproblems even in the presence of cyclic prior knowledge. These subproblems are well suited to being solved in parallel on a single machine or distributed across many machines without excessive communication cost. Subjects Artificial Intelligence, Data Mining and Machine Learning, Data Science, Distributed and Parallel Computing Keywords Bayesian network, Structure learning, Discrete optimization, Parallel processing, Big data INTRODUCTION Bayesian networks are directed acyclic graphs (DAGs) in which nodes correspond to random variables and directed edges represent dependencies between these variables. Conditional independence between a pair of variables is represented as the lack of an edge between the two corresponding nodes. The parameters of a Bayesian network are typically simple to interpret, making such networks highly desirable in a wide variety of application domains that require model transparancy. Frequently, one does not know the structure of the Bayesian network beforehand, making it necessary to learn the structure directly from data. The most intuitive approach to the task of Bayesian network structure learning (BNSL) is ‘‘search-and-score,’’ in which one iterates over all possible DAGs and chooses the one that optimizes a given scoring function. Recent work has described methods that find the optimal Bayesian network structure without explicitly considering all possible DAGs (Malone, Yuan & Hansen, 2011; Yuan, Malone & How to cite this article Schreiber and and Noble (2017), Finding the optimal Bayesian network given a constraint graph. PeerJ Comput. Sci. 3:e122; DOI 10.7717/peerj-cs.122 https://peerj.com mailto:jmschr@cs.washington.edu https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.122 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://dx.doi.org/10.7717/peerj-cs.122 Wu, 2011; Fan, Malone & Yuan, 2014; Jaakkola et al., 2003), but these methods are still infeasible for more than a few dozen variables. In practice, a wide variety of heuristics are often employed for larger datasets. These algorithms, which include branch-and- bound (Suzuki, 1996), Chow-Liu trees (Chow & Liu, 1968), optimal reinsertion (Moore & Wong, 2003), and hill-climbing (Tsamardinos, Brown & Aliferis, 2006), typically attempt to efficiently identify a structure that captures the majority of important dependencies. In many applications, the search space of possible network structures can be reduced by taking into account domain-specific prior knowledge (Gamberoni et al., 2005; Zuo & Kita, 2012; Schneiderman, 2004; Zhou & Sakane, 2003). A simple method is to specify an ordering on the variables and require that parents of a variable must precede it in the ordering (Cooper & Herskovits, 1992). This representation leads to tractable structure learning because identifying the parent set for each variable can be carried out independently from the other variables. Unfortunately, prior knowledge is typically more ambiguous than knowing a full topological ordering and may only exist for some of the variables. A more general approach to handling prior knowledge is to employ a ‘‘super-structure,’’ i.e., an undirected graph that defines the super-set of edges defining valid learned structures, forbidding all others (Perrier, Imoto & Miyano, 2008). This method has been fairly well studied and can also be used as a heuristic if defined through statistical tests instead of prior knowledge. A natural extension of the undirected super-structure is the directed super-structure (Ordyniak & Szeider, 2013), but to our knowledge the only work done on directed super-structures proved that an acyclic directed super-structure is solvable in polynomial time. An alternate, but similar, concept is to define which edges must or cannot exist as a set of rules (Campos & Ji, 2011). However, these rule-based techniques do not specify how one would exploit the constraints to reduce the computational time past simply skipping over invalid graphs. We propose the idea of a ‘‘constraint graph’’ as a method for incorporating prior information into the BNSL task. A constraint graph is a directed graph where each node represents a set of variables in the BNSL problem and edges represent which variables are candidate parents for which other variables. The primary advantage of constraint graphs versus other methods is that the structure of the constraint graph can be used to achieve savings in both memory cost and computational time beyond simply eliminating invalid structures. This is done by breaking the problem into independent subproblems even in the presence of cyclic prior knowledge. An example of this cyclic prior knowledge is identifying two groups of variables that can draw parents only from each other, similar to a biparte graph. It can be difficult to identify the best parents for each variable that does not result in a cycle in the learned structure. In addition, constraint graphs are visually more intuitive than a set of written rules while also typically being simpler than a super-structure, because constraint graphs are defined over sets of variables instead of the original variables themselves. This intuition, combined with automatic methods for identifying parallelizable subproblems, makes constraint graphs easy for non-experts to define and use without requiring them to know the details of the structure learning task. This technique is similar to work done by Fan, Malone & Yuan (2014), where the authors describe the same computational gains through the identification of ‘‘potentially optimal Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 2/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Group using Constraint Graph Use Constraints to Guide Structure Learning A B D De�ne directed super-structure C Figure 1 A constraint graph grouping variables. (A) We wish to learn a Bayesian network over 11 variables. The variables are colored according to the group that they belong to, which is defined by the user. These variables can either (B) be organized into a directed super structure or (C) grouped into a constraint graph to encode equivalent prior knowledge. Both graphs define the superset of edges which can exist, but the constraint graph uses far fewer nodes and edges to encode this knowledge. (D) Either technique can then be used to guide the BNSL task to learn the optimal Bayesian network given the constraints. parent sets.’’ One difference is that Fan et al. define the constraints on individual variables instead of on sets on variables, as this work does. By defining the constraints on sets of variables instead of individual ones, one can identify further computational gains when presented with cyclic prior knowledge. Given that two types of graphs will be discussed throughout this paper, the Bayesian network we are attempting to learn and the constraint graph, we will use the terminology ‘‘variable’’ exclusively in reference to the Bayesian network and ‘‘node’’ exclusively in reference to the constraint graph. CONSTRAINT GRAPHS A constraint graph is a directed graph in which nodes contain disjoint sets of variables from the BNSL task, and edges indicate which sets of variables can serve as parents to which other sets of variables. A self-loop in the constraint graph indicates that no prior knowledge is known about the relationship between variables in that node, whereas a lack of a self-loop indicates that no variables in that particular node can serve as parents for another variable in that node. Thus, the naive BNSL task can be represented as a constraint graph consisting of a single node with a self-loop. A constraint graph can be thought of as a way to group the variables (Fig. 1A), define relationships between these groups (Fig. 1C), and then guide the BNSL task to efficiently find the optimal structure given these constraints (Fig. 1D). In contast, a directed super-structure defines all possible edges that can exist in accordance with the prior knowledge (Fig. 1B). Typically, a directed super-structure is far more complicated than the equivalent constraint graph. Cyclic prior knowledge can be represented as a simple cycle in the constraint graph, such that the variables in node A draw their parents solely from node B, and B from A. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 3/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Any method for reducing computational time through prior knowledge exploits the ‘‘global parameter independence property’’ of BNSL. Briefly, this property states that the optimal parents for a variable are independent of the optimal parents for another variable given that the variables do not form a cycle in the resulting Bayesian network. This acyclicity requirement is typically computationally challenging to determine because a cycle can involve more variables than the ones being directly considered, such as a graph which is simply a directed loop over all variables. However, given an acyclic constraint graph or an acyclic directed super-structure, it is impossible to form a cycle in the resulting structure; hence, the optimal parent set for each variable can be identified independently from all other variables. A convenient property of constraint graphs, and one of their advantages relative to other methods, is that independent subproblems can be found through global parameter independence even in constraint graphs which contain cycles. We describe in ‘Solving a component of the constraint graph’ the exact algorithm for finding optimal parent sets for each case one can encounter in a constraint graph. Briefly, the constraint graph is first broken up into its strongly connected components (SCCs) that identify which variables can have their parent sets found independently from all other variables (‘‘solving a component’’) without the possibility of forming a cycle in the resulting graph. Typically these SCCs will be single nodes from the constraint graph, but may be comprised of multiple nodes if cyclic prior knowledge is being represented. In the case of an acyclic constraint graph, all SCCs will be single nodes, and in fact each variable can be optimized without needing to consider other variables, in line with theoretical results from Ordyniak & Szeider (2013). In addition to allowing these problems to be solved in parallel, this breakdown suggests a more efficient method of sharding the data in a distributed learning context. Specifically, one can assign an entire SCC of the constraint graph to a machine, including all columns of data corresponding to the variables in that SCC and all variables in nodes which are parents to nodes in the SCC. Given that all subproblems which involve this shard of the data are contained in this SCC of the constraint graph, there will never be duplicate shards and all tasks involving a shard are limited to the same machine. The concept of identifying SCCs as independent subproblems has also been described in Fan, Malone & Yuan (2014). It is possible to convert any directed super-structure into a constraint graph and vice- versa though it is far simpler to go from a constraint graph to a directed super-structure. To convert from a directed super-structure to a constraint graph, one must first identify all strongly connected components that are more than a single variable. All variables in a strongly connected component can be put into the same node in a constraint graph that contains a self loop. Then, one would tabulate the unique parent and children sets a variable can have. All variables outside of the previously identified strongly connected components with the same parent and children sets can be grouped together into a node in the constraint graph. Edges then connect these sets based on the shared parent sets specified for each node. In the situation where a node in the constraint graph can draw parents from only a subset of the variables in a node created by the identification of the strongly connected components, the node must be broken into two nodes that both have self loops and loops connecting to each other to allow for only a subset of those variables Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 4/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 to serve as a parent for another node. In contrast, to convert from a constraint graph to a directed super-structure one would simply draw, for each node, an edge from all variables in the current node to all variables in the node’s children. We suggest that constraint graphs are the more intuitive method both due to their simpler representation and ease of extracting computational benefits from the task. METHODS Bayesian network structure learning Although solving a component in a constraint graph can be accomplished by a variety of algorithms including heuristic algorithms, we assume for this paper that one is using some variant of the exact dynamic programming algorithm proposed by Malone, Yuan & Hansen (2011). We briefly review that algorithm here. The goal of the algorithm is to identify the optimal Bayesian network defined over the set of variables without having to repeat any calculations and without having to use excessive memory. This is done by defining additional graphs, the parent graphs and the order graph. We will refer to each node in these graphs as ‘‘entries’’ to distinguish them from the constraint graph and the learned Bayesian network. A parent graph is defined for each variable and can be defined as a lattice, where the entries to some layer i correspond to combinations of all other variables of size i. Each entry is connected to the entries in the previous layers that are subsets of that entry such that (X1,X2) would be connected to both X1 and X2. For each entry, the score of the variable is calculated using the parents in the entry and compared to the scores held in the parent entries, recording only the best scoring value and parent set amongst them. These entries then hold the dynamically calculated best parent set and associated score, allowing for constant time lookups later on the best parent set given a set of possible parents. The order graph is structured in the same manner as the parent graphs except over all variables. In contrast with the parent graphs, it is the edges that store useful information in the form of the score associated with adding a given variable to the set of seen variables stored in the entry and the parent set that yields this score. Each path from the empty root node to the leaf node containing the full set of variables encodes the optimal network given a topological sort of the variables, and the shortest path encodes the optimal network. This data structure reduces the time required to find the optimal Bayesian network from O(n2n(n−1)) time in the number of variables to O(n2n) time in the number of variables without the need to keep a large cache of values. Structure learning is flexible with respect to the score function used to identify the optimal graph. There are many score functions that typically aim to penalize the log likelihood of the data by the complexity of the graph to encourage sparser structures. These usually come in the form of Bayesian score functions, such as Bayesian-Dirichlet (Heckerman, Geiger & Chickering, 1995), or those derived from information theory, such as minimum description length (MDL) (Suzuki, 1996). Most score functions decompose across variables of a Bayesian network according to the global parameter independence property, such that the score for a dataset given a model is equal to the product of the score of each variable given its parents. While constraint graphs remain agnostic to the specific Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 5/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 score function used, we assume that MDL is used as it has several desirable computational benefits. For review, MDL defines the score as the following: MDL(D|M)=P(D|M)− 1 2 log(N)|B| (1) where |B| defines the number of parameters in the network. The term ‘‘minimum description length’’ arises from needing 12 log(N) bits to represent each parameter in the model, making the second term the total number of bits needed to represent the full model. The MDL score function has the convenient property that a variable cannot have more than log ( n log(n) ) parents given n samples, greatly reducing computational time. Solving a component of the constraint graph The strongly connected components of a constraint graph can be identified using Tarjan’s algorithm (Tarjan, 1971). Each SCC corresponds to a subproblem of the constraint graph and can be solved independently. In many cases the SCC will be a single node of the constraint graph, because prior knowledge is typically not cyclic. In general, the SCCs of a constraint graph can be solved in any order due to the global parameter independence property. The algorithm for solving an SCC of a constraint graph is a straightforward modification of the dynamic programming algorithm described above. Specifically, parent graphs are created for each variable in the SCC but defined only over the union of possible parents for that variable. Consider the case of a simple, four-node cycle with no self-loops such that W →X →Y →Z →W . A parent graph is defined for each variable in W ∪X∪Y ∪Z but only over valid parents. For example, the parent graph for X1 would be over only variables in W . Then, an order graph is defined with entries that violate the edge structure of the constraint graph filtered out. The first layer of the order graph would be unchanged with only singletons, but the second layer would prohibit entries with two variables from the same layer because there are no valid orderings in which Xi is a parent of Xj, and would prohibit entries in which a variable W is joined with a variable of Y . One can identify valid entries by taking the entries of a previous layer and iterating over each variable present, adding all valid parents for that variable which are not already present in the set. A simple example illustrating the algorithm is a constraint graph made up of a four node cycle where each node contains only a single variable (Fig. 2A). The parent graphs defined for this would consist solely of two entries, the null entry and the entry corresponding to the only valid parent. The first layer of the order graph would be all variables as previously (Fig. 2B). However, once a variable is chosen to start the topological ordering the order of the remaining variables is fixed because of the constraints, producing a far simpler lattice. Because constraint graphs can encode a wide variety of different constraints, the complexity of the task depends on the structure of the constraint graph. Broadly, the results from Ordyniak & Szeider (2013) still hold, namely, that acyclic constraint graphs can be solved in quadratic time. As was found in Fan, Malone & Yuan (2014), because each SCC can be solved independently, the time complexity for constraint graphs containing a cycle corresponds to the time complexity of the worst case component. Fortunately, although Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 6/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 (0, 1) (0, 1, 2) (0, 1, 2, 3) (1, 2) (1, 2, 3) (0,) (1,) (2,) (2, 3) (0, 2, 3) (3,) (0, 3) (0, 1, 3) (,) 0 1 2 3 A B Figure 2 An example of a constraint graph and resulting order graph. (A) A constraint graph is defined as a cycle over four nodes with each node containing a single variable. (B) The resulting order graph dur- ing the BNSL task. It is significantly sparser than the typical BNSL task because after choosing a variable to start the topological ordering the remaining variables must be added in the order defined by the cycle. the complexity of a node engaging in a cycle is still exponential, it is only exponential with respect to the number of variables that node interacts with. Adding additional, equally sized nodes to the constraint graph only causes the algorithm to grow linearly in time and has no additional memory cost if the components are solved sequentially. The algorithm described above has five natural cases and are described below. One node, no parents, no self loop: The variables in this node contain no parents, so nothing needs to be done to find the optimal parent sets given the constraints. This naturally takes O(1) time to solve. One node, no parents, self loop: This is equivalent to exact BNSL with no prior knowledge. In this case, the previously proposed dynamic programming algorithm is used to identify the optimal structure of the subnetwork containing only variables in this node. This takes O(n2n) time where n is the number of variables in the node. One node, one or more parent nodes, no self loop: In this case it is impossible for a cycle to be formed in the resulting Bayesian network regardless of optimal parent sets, so we can justify solving every variable in this node independently by the global parameter independence property. Doing so results in a significant improvement over applying the algorithm naively because neither the parent graphs nor the order graph need to be explicitly calculated or stored. The optimal parent set can be calculated without the need for dynamic programming because the optimal topological ordering does not need to be discovered. Because no dynamic programming needs to be done, there is no need to store either the parent or order graphs in memory. This takes O(nmk) time, where n is the number of variables in the node, m is the number of possible parents, and k is the maximum number of parents that a node can have, in this case set by the MDL algorithm. If k is set to any constant value, then this step requires quadratic time with respect to the number of possible parents and linear with respect to the number of variables in the node. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 7/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 One node, one or more parents, self loop: Initially, one may think that solving this SCC could involve taking the union of all variables from all involved nodes, running exact BNSL over the full set, and simply discarding the parent sets learned for the variables not in the currently considered node. However, in the same way that one should not handle prior knowledge by learning the optimal graph over all variables and discarding edges which offend the prior knowledge, one should not do the same in this case. Instead, a modification to the dynamic programming algorithm itself can be made to restrict the parent sets on a variable-by-variable basis. For simplicity, we define the variables in the current node of the constraint graph as X and the union of all variables in the parent nodes in the constraint graph as Y . We begin by setting up an order graph, as usual defined over X. We then add Y to each node in the order graph such that the root node now is now comprised of Y instead of the empty set and the leaf node is comprised of X∪Y instead of just X. Because the primary purpose of the order graph is to identify the optimal parent sets that do not form cycles, this addition is intuitive because it is impossible to form a cycle by including any of the variables in Y as parents for any of the variables in X. In other words, if one attempted to find the optimal topological ordering over X∪Y it would always begin with the variables in Y but would be invariant to the ordering of Y . Parent graphs are then created for all variables in X but are defined over the set of all variables in X∪Y , because that is the full set of parents that the variables could be drawn from. This restriction allows the optimal parents for each variable in X to be identified without wasting time considering what the parent set for variables in Y should be, or potentially throwing away the optimal graph because of improper edges leading from a variable in Y to a variable in X. This step takes O(n2n+m) time, where n is the number of variables in the node and m is the number of variables in the parent nodes. This is because we only need to define a parent graph for the variables in the node we are currently considering, but these parent graphs must be defined over all variables in the node plus all the variables in the parent nodes. Multiple nodes: The algorithm as presented initially is used to solve an entire component at the same time. RESULTS While it is intuitive how a constraint graph provides computational gains by splitting the structure learning task into subproblems, we have thus far only alluded to the idea that prior knowledge can provide efficiencies past that. In this section we examine the computational gains achieved in the three non-trivial cases of the algorithm presented in ‘Solving a component of the constraint graph’. Acyclic constraint graphs can model the global stock market First, we examine the computational benefits of an acyclic constraint graph modeling the global stock market. In particular, we want to identify for each stock which other stocks are predictive to its performance. We chose to do this by learning a Bayesian network over the opening and closing prices of 54 top performing stocks from the New York Stock Exchange (NYSE) in the United States, the Tokyo Stock Exchange (TSE) in Japan, and the Financial Times Stock Exchange (FTSE) in England. Learning a Bayesian network Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 8/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 AAPL-open XOM-close MSFT-open RTN-close ORCL-close GOOGL-open COB-close GOOGL-closeBRK.A-close JNJ-open GFRD-close WFC-close GE-close PG-open RDSB-close FB-close T-closeCVX-close VZ-open T-open DEB-close VED-open BWNG-open BWNG-close KEISEI-open Time TSE FTSE NYSE A B Opening Closing Figure 3 A section of the learned Bayesian network of the global stock market. (A) The constraint graph contains six nodes, the opening and closing prices for each of the three markets. These are con- nected such that the closing prices in a market depend on the opening prices but also the most recent in- ternational activity. (B) The most connected subset of stocks from the learned network covering 25 vari- ables. over all 108 variables is clearly infeasible, so we encode in our constraint graph some common-sense restrictions (Fig. 3A). Specifically, opening and closing prices for the same market are grouped into separate nodes, for a total of six nodes in the constraint graph. There are no self-loops because the opening price of one stock does not influence the opening price of another stock. Naturally, the closing prices of one group of stocks are influenced by the opening price of the stocks from the same market, but they are also influenced by the opening or closing prices of any markets which opened or closed in the meantime. For instance, the TSE closes after the FTSE opens, so the FTSE opening prices have the opportunity to influence the TSE closing prices. However, the TSE closes before the NYSE opens, so the NYSE cannot influence those stock prices. The dataset consists of opening and closing prices from these stocks between December 2nd, 2015 and November 29th, 2016, binarized to indicate whether the value was an increase compared to the prior price seen. The resulting Bayesian network has some interesting connections (Fig. 3B). For example, the opening price of Microsoft influences the closing price of Raytheon, and the closing price of Debenhams plc, a British multinational realtor, influences the closing price of GE. In addition, there were some surprising and unexplained connections, such as Google and Johnson & Johnson influencing the closing price of Cobham plc, a British defense firm. Given that this example is primarily to illustrate the types of constraints a constraint graph can easily model, we suggest caution in thinking too deeply about these connections. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 9/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Table 1 Model comparison between naive Bayes, Bayesian network classifiers (BNC), and random for- est. Three algorithms were evaluated on the UCI handwritten digits dataset, fed in the binarized value cor- responding to whether the intensity of a pixel was above average. The fitting time and test set accuracy are reported for each algorithm. Model Train time (s) Test set accuracy Naive Bayes 0.05 0.79 BNC 0.5 0.81 Random forest 1.4 0.89 It took only∼35 s on a computer with modest hardware to run BNSL over 250 samples. If we set the maximum number of parents to three, which is the empirically determined maximum number of parents, then it only takes∼2 s to run. In contrast it would be infeasi- ble to run the exact BNSL algorithm on even half the number of variables considered here. Constraint graphs allow learning of Bayesian network classifiers Bayesian network classifiers are an extension of Bayesian networks to supervised learning tasks by defining a Bayesian network over both the feature variables and the target variables together. Normal inference methods are used to predict the target variables given the observed feature variables. In the case where feature variables are always observed, only the Markov blanket of the target variables must be defined, i.e., their parents and children. The other variables are independent of the target variables and can be discarded, serving as a form of feature selection. A popular Bayesian network classifier is the naive Bayes classifier that defines a single class variable as the parent to all feature variables. A natural extension to this method is to learn which features are useful, instead of assuming they all are, thereby combining feature selection with parameter learning in a manner that has some similarities to decision trees. This approach can be modeled by using a constraint graph that has all feature variables X in one node and all target variables y in its parent node, such that y →X. We empirically evaluated the performance of learning a simple Bayesian network classifier on the UCI Digits Dataset. The digits dataset is a collection of 8 × 8 images of handwritten digits, where the features are discretized values between 0 and 16 representing the intensity of that pixel and the labels are between 0 and 9 representing the digit stored there. We learn a Bayesian network where the 64 pixels are in one node in the constraint graph and the class label is by itself it another node in the constraint graph that serves as a parent. We then train a Bayesian network classifier, a naive Bayes classifier, and a random forest classifier comprised of 100 trees, on a test set of 1,500 images and test their performance on a held out 297 images. As expected, the learned Bayesian network classifier falls between naive Bayes and the random forest in terms of both training time and test set performance (Table 1). Futhermore, more complicated Bayesian network classifiers can be learned with different constraint graphs. One interesting extension is that instead of constraining all features to be children of the target variable, to allow features to be either parents or children of the target variable. This can be specified by a cyclic constraint graph where y →X →y, preventing Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 10/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Table 2 Algorithm comparison on a node with a self loop and other parents. The exact algorithm and the constrained algorithm proposed here were on a SCC comprosied of a main node with a self loop and one parent node. Shown are the results of increasing the number of variables in the main node while keep- ing the variables in the parent node steady at five, and the results of increasing the number of variables in the parent node while keeping the number of variables in the main node constant. For both algorithms we show the number of nodes across all parent graphs (PGN), the number of nodes in the order graph (OGN), the number of edges in the order graph (OGE) and the time to compute. Exact Constraint graph PGN OGN OGE Time (s) PGN OGN OGE Time (s) Variables 4 2,304 512 2,304 0.080 1,024 16 32 0.033 8 53,248 8,192 53,248 1.30 32,768 256 1,024 0.545 12 1,114,112 131,072 1,114,112 27.03 786,432 4,096 24,576 9.56 Parents 4 2,304 512 2,304 0.087 1,280 32 80 0.045 8 53,248 8,192 53,258 1.401 20,480 32 80 0.356 12 1,114,112 131,072 1,114,112 27.22 327,680 32 80 4.01 the model from spending time identifying dependencies between the features. Finally, in cases where some features may be missing, it may be beneficial to model all dependencies between the features in order to allow inference to flow from observed variables not directly connected to the target variables to the target variables. This can be modeled by adding a self loop on the features variables X, allowing all edges to be learned except those between pairs of target variables. Learning a Bayesian network classifier in this manner will suffer from the same computational challenges as an unconstrained version, given the looseness of the constraints. Self-loops and parents We then turn to the case where the strongly connected component is a main node with a self loop and a parent node. Because an order graph is defined only over the variables in the main node its size is invariant to the number of variables in the parent node, allowing for speed improvements when it comes to calculating the shortest path. In addition, parent graphs are only defined for variables in the parent set, and so while they are not smaller than the ones in the exact algorithm, there are fewer. We compare the computational time and complexity of the underlying order and parent graphs between the exact algorithm over the full set of variables and the modified algorithm based on a constraint graph (Table 2). The data consisted of randomly generated binary values, because the running time does not depend on the presence of underlying structure in the data. We note that in all cases there are significant speed improvements and simpler graphs but that there are particularly encouraging speed improvements when the number of variables in the main node are increased. This suggests that it is always worth the time to identify which variables can be moved from a node with a self loop to a separate node. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 11/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 C D A B Figure 4 Cyclic constraint graphs. (A) This constraint graph is comprised of a simple two node cy- cle with each node containing four variables. (B) The learned Bayesian network on random data where some variables were forced to identical values. Each circle here corresponds to a variable in the resulting Bayesian network instead of a node in the constraint graph. There were multiple possible cycles which could have been formed but the constraint graph prevented that from occuring. (C) This constraint graph now encodes a four node cycle each with four variables. (D) The learned Bayesian network on random data with two distinct loops of identical values forced. Again, no loops are formed. Cyclic constraint graphs Lastly, we consider constraint graphs that encode cyclic prior knowledge. We visually inspect the results from cyclic constraint graphs to ensure that they do not produce cyclic Bayesian networks even when the potential exists. Two separate constraint graphs are inspected, a two node cycle and a four node cycle (Figs. 4A and 4C). The dataset is comprised of random binary values, where the value of one variable in the cycle is copied to the other variables in the cycle to add synthetic structure. However, by jointly solving all nodes cycles are avoided while dependencies are still captured (Figs. 4B and 4D). We then compare the exact algorithm without constraints to the use of an appropriate constraint graph in a similer manner as before (Table 3). This is done first for four node cycles where we increase the number of variables in each node of the constraint graph and then for increasing sized cycles with three variables per node. The exact algorithm likely produces structures that are invalid according to the constraints and so this comparison is done solely to highlight that efficiencies are gained by considering the constraints. In each case using a constraint graph yields simpler parent and order graphs and the computational time is significantly reduced. The biggest difference is in the number of nodes in the parent graphs, as the constraints place significant limitations on which variables are allowed to be parents for which other variables. Since the construction of the parent graph is the only part of the algorithm which considers the dataset itself it is unsurprising that significant savings are achieved for larger datasets when much smaller parent graphs are used. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 12/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 Table 3 Algorithm comparison on a cyclic constraint graph. The exact algorithm and the constrained algorithm proposed here were run for four node cycles with differing numbers of variables, cycles with different numbers of nodes but three variables per node, and differing numbers of sam- ples for a four-node, three-variable cycle. All experiments with differing numbers of variables or nodes were run on 1,000 randomly generated sam- ples. Shown for both algorithms are the number of nodes across all parent graphs (PGN), the number of nodes in the order graph (OGN), the num- ber of edges in the order graph (OGE) and the time to compute. Since the number of nodes does not change as a function of samples those values are not repeated in the blank cells. Exact Exact PGN OGN OGE Time (s) PGN OGN OGE Time (s) Variables 1 32 16 32 0.005 8 14 16 0.005 2 1,024 256 1,024 0.036 32 186 544 0.014 3 24,576 4,096 24,576 0.611 96 3,086 16,032 0.320 4 524,288 65,536 525,288 14.0 256 54,482 407,328 7.12 Nodes 2 192 64 192 0.111 48 56 150 0.008 4 24,576 4,096 24,576 0.634 96 3,086 16,032 0.217 6 2,359,296 262,144 2,359,296 60.9 144 168,068 1,307,358 26.12 Samples 100 24,576 4,096 24,576 0.357 96 3,086 16,032 0.311 1,000 – – – 0.615 – – – 0.211 10,000 – – – 2.670 – – – 0.357 100,000 – – – 243.9 – – – 10.41 DISCUSSION Constraint graphs are a flexible way of encoding into the BNSL task prior knowledge concerning the relationships among variables. The graph structure can be exploited to identify potentially massive computational gains, and acyclic constraint graphs make problems tractable which would be infeasible to solve without constraints. This is particularly useful in cases where there are both a great number of variables and many constraints present from prior knowledge. We anticipate that the automatic manner in which parallelizable subtasks are identified in a constraint graph will be of particular interest given the recent increase in availability of distributed computing. Although the networks learned in this paper are discrete, the same principles can be applied to all types of Bayesian networks. Because the constraint graph represents only a restriction in the parent set on a variable-by-variable basis, the same algorithms that are used to learn linear Gaussian or hybrid networks can be seamlessly combined with the idea of a constraint graph. In addition, most of the approximation algorithms which have been developed for BNSL can be modified to take into account constraints because these algorithms simply encode a limitation on the parent set for each variable. One could extend constraint graphs in several interesting ways. The first is to assign weights to edges so that the weight represents the prior probability that the variables in the parent set are parents of the variables in the child set, perhaps as pseudocounts to take into account when coupled with a Bayesian scoring function. A second way is to incorporate ‘‘hidden nodes’’ that are variables which model underlying, onobserved phenomena and Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 13/16 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.122 can be used to reduce the parameterization of the network. Several algorithms have been proposed for learning the structure of a Bayesian network given hidden variables (Elidan et al., 2001; Elidan & Friedman, 2005; Friedman, 1997). Modifying these algorithms to obey a constraint graph seems like a promising way to incorporate restrictions on this difficult task. A final way may be to encode ancestral relationships instead of direct parent relationships, indicating that a given variable must occur at some point before some other variable in the topological ordering. ACKNOWLEDGEMENTS We would like to acknowledge Maxwell Libbrecht, Scott Lundberg, and Brandon Malone for many useful discussions and comments on drafts of the paper. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work is supported by an NSF IGERT grant DGE-1258485. There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: NSF IGERT: DGE-1258485. Competing Interests The authors declare there are no competing interests. Author Contributions • Jacob M. Schreiber conceived and designed the experiments, performed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper. • William S. Noble analyzed the data, wrote the paper, reviewed drafts of the paper. Data Availability The following information was supplied regarding data availability: Code implementing the concept is available at GitHub: www.github.com/jmschrei/pomegranate Data and code reproducing the figures are available at GitHub: https://github.com/ jmschrei/constraint_graphs. REFERENCES Campos C, Ji Q. 2011. Efficient structure learning of bayesian networks using constraints. Journal of Machine Learning Research 12:663–689. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 14/16 https://peerj.com www.github.com/jmschrei/pomegranate https://github.com/jmschrei/constraint_graphs https://github.com/jmschrei/constraint_graphs http://dx.doi.org/10.7717/peerj-cs.122 Chow C, Liu C. 1968. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory 14(3):462–467. Cooper G, Herskovits E. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9:309–347 DOI 10.1007/BF00994110. Elidan G, Friedman N. 2005. Learning hidden variable networks: the information bottleneck approach. Journal of Machine Learning Research 6:81–127. Elidan G, Lotner N, Friedman N, Koller D. 2001. Discovering hidden variables: a structure-based approach. In: Leen TK, Dietterich TG, Tresp V, eds. Advances in neural information processing systems (NIPS), vol. 13, 479–485. Fan X, Malone BM, Yuan C. 2014. Finding optimal Bayesian network structures with constraints learned from data. In: UAI’14 proceedings of the thirtieth conference on uncertainty in artificial intelligence, 200–209. Friedman N. 1997. Learning belief networks in the presence of missing values and hidden variables. In: ICML’97 proceedings of the fourteenth international conference on machine learning, 125–133. Gamberoni G, Lamma E, Riguzzi F, Storari S, Volinia S. 2005. Bayesian networks learn- ing for gene expression datasets. In: Advances in intelligent data analysis VI: 6th inter- national symposium on intelligent data analysis, 109–120 DOI 10.1007/11552253_11. Heckerman D, Geiger D, Chickering D. 1995. Learning Bayesian networks: the combina- tion of knowledge and statistical data. Machine Learning 20:197–242. Jaakkola T, Sontag D, Globerson A, Meila M. 2003. Learning Bayesian network structure using LP relaxations. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics (AISTATS-10), 358–365. Malone BM, Yuan C, Hansen EA. 2011. Memory-efficient dynamic programming for learning optimal Bayesian networks. In: Proceedings of the twenty-fifth AAAI conference on artificial intelligence. Moore A, Wong W. 2003. Optimal reinsertion: a new search operator for accelerated and more accurate bayesian network structure learning. In: Proceedings of the 20th international conference on machine learning (ICML-03), 552–559. Ordyniak S, Szeider S. 2013. Parameterized complexity results for exact Bayesian network structure learning. Journal of Artificial Intelligence Research 46:263–302. Perrier E, Imoto S, Miyano S. 2008. Finding optimal Bayesian network given a super- structure. Journal of Machine Learning Research 9:2251–2286. Schneiderman H. 2004. Learning a restricted Bayesian network for object detection. In: Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition. IEEE, 639–646. Suzuki J. 1996. Learning Bayesian belief networks based on the minimum description length principle: an efficient algorithm using the B & B Technique. In: Machine learning, proceedings of the thirteenth international conference (ICML’96). Tarjan R. 1971. Depth-first search and linear graph algorithms. SIAM Journal of Computing 2:146–160. Tsamardinos I, Brown LE, Aliferis CF. 2006. The max–min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65:31–78. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 15/16 https://peerj.com http://dx.doi.org/10.1007/BF00994110 http://dx.doi.org/10.1007/11552253_11 http://dx.doi.org/10.7717/peerj-cs.122 Yuan C, Malone BM, Wu X. 2011. Learning optimal Bayesian networks using A* search. In: IJCAI proceedings-international joint conference on artificial intelligence. Zhou H, Sakane S. 2003. Learning Bayesian network structure from environment and sensor planning for mobile robot localization. In: Proceedings of IEEE international conference on multisensor fusion and integration for intelligent systems, MFI2003 DOI 10.1109/MFI-2003.2003.1232636. Zuo Y, Kita E. 2012. Up/down analysis of stock index by using Bayesian network. Engineering Management Research 1:46–52. Schreiber and and Noble (2017), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.122 16/16 https://peerj.com http://dx.doi.org/10.1109/MFI-2003.2003.1232636 http://dx.doi.org/10.7717/peerj-cs.122