PII: 0166-218X(88)90005-4 Discrete Applied Mathematics 19 (1988) 45-63 North-Holland 45 GRAPH EMBEDDING IN SYNCHEM2, AN EXPERT SYSTEM FOR ORGANIC SYNTHESIS DISCOVERY * Joseph D. BENSTOCK** and Donald J. BERNDT Department of Computer Science, State University of New York at Stony Brook, Stony Brook, NY 11794, USA Krishna K. AGARWAL Department of Computing and Information Sciences, Trinity University, San Antonio, TX 78284, USA Received 15 October 1985 Revised 30 April 1986 Graph embedding (subgraph isomorphism) is an NP-complete problem of great theoretical and practical importance in the sciences, especially chemistry and computer science. This paper presents positive test results for techniques to speed embedding by modeling graphs with subroutines, precalculating edge tables, turning recursion into iteration, and using search- ordering heuristics. The expert system SYNCHEM~ searches for synthesis routes of organic molecules without the online guidance of a user, and this paper examines how embedding information helps to imple- ment the central operations of SYNCHEM.2: selection, application, and evaluation of chemical reac- tions. The paper also outlines the architecture of SYNCHEM~, analyzes the computational time complexity of embedding and related problems in graph isomorphism and canonical chemical naming, and suggests topics and techniques for further research. 1. Introduction Graph embedding algorithms determine whether a given guest graph can be embedded within a given host graph; some embedding algorithms in addition store the mapping(s) of the isomorphism(s) between the guest and subgraphs of the host. Such algorithms are of great theoretical and practical importance in a number of the sciences. We present positive test results for techniques to speed embedding by modeling graphs with subroutines, precalculating edge tables, turning recursion into iteration, and using search-ordering heuristics. As a central process in the expert System SYNCHEM~ at SUtiY/Stony Brook, an AI project under the direction of H. Gelernter, embedding assists in organic syn- thesis discovery, an important and difficult problem in the natural sciences for * This research was supported by Eastman Kodak Co. and NASA Research Grant #NAG 3-651. ** Present address: Computer Resources International A/S, Vesterbrogade lA, DK-1620 Copenhagen V, Denmark. 0166-218X/88/$3.50 0 1988, Elsevier Science Publishers B.V. (North-Holland) 46 J.D. Benstock et al. which SYNCHEM~ has generated results of interest to chemists and biochemists [7,19,20]. SYNCHEM searches for synthesis routes for a given organic molecule without the online guidance of a user, and uses embedding information to help select, apply, and evaluate chemical reactions. Various algorithms for substructure search, molecular pattern matching, pretransform and posttransform testing, and heuristic subgoal evaluation will be presented and discussed in the context of SYNCHEM2. Embedding is an NP-complete problem related to important problems in graph isomorphism and canonical chemical naming that currently have unknown com- putational time complexity and that bear upon the deep questions of containment in the polynomial hierarchy of problem complexity classes, including P and NP. We will analyze the time complexity of these problems and the relationships between them. Graph embedding is a central operation in all the following applications in- volving pattern matching and feature detection: canonical chemical nomen- clature and indexing, chemical substructure searching, and synthesis discovery [2,7, 10,33,36,39,42,43], aspects of developmental biology [ 141, and modeling memory recall in cognitive psychology [7,32]. Applications are found in electrical network design and optimization [43], image processing involving template match- ing and feature recognition [7,8, 11, 16, 17,34,35,37,38,40], and speech processing using templates and networks of phonemes [7]. Embedding techniques find wide- spread use in the following subfields of computer science: database management and information retrieval [6,7,8,11,30], program transformation (including syn- thesis, optimization, and automated programming) [2,5,7, lo], computer learning and knowledge representation [7,15], analogical reasoning using graph homo- morphism [12,22], state representation and pattern recognition in game trees [2,6, lo], as well as study of programming languages, data and control flow, and concurrency [9, 141. The paper consists of six sections: (1) Introduction, (2) Architecture of SYNCHEM& (3) Using embedding in SYNCHEM~, (4) Computational time complexity of embed- ding and related problems, (5) Techniques to speed embedding, (6) Summary and directions for further research. 2. Architecture of SYNCHEM Other introductions to the SYNCHEM project are found in [7,19,20]. 2. I. Control strategy Problem reduction by retrosynthesis The problem reduction strategy SYNCHEM~ uses for finding synthesis rOUteS for an input organic molecule (called the tirrget molecule) is to select reactions that Graph embedding in SYNCHEM~ 47 retrosynthesize it into compounds that are simpler in the sense that they are com- mercially available or are expected to be more readily synthesizable. These generated compounds are referred to as subgoal conjuncts in artificial intelligence and precur- sors in chemistry. Problem representation using AND/OR graphs The search space of the problem is represented by an AND/OR Problem Solving Graph (PSG) as in Fig. 1, in which a molecule is represented by a compound node, that when selected for development, spawns alternative subgoal nodes [7,3 11. Com- pound nodes and subgoal nodes alternate in the PSG and are connected by reaction edges. Subgoal nodes (whose children are one or more reacting compounds) are by definition ‘OR’ nodes because the establishment of any subgoal node solves the parent compound. Compound nodes, on the other hand, are ‘AND' nodes because all sibling compound nodes must be established to solve their parent subgoal. Available compounds are ‘terminal’ nodes in the PSG, and normally are not further developed. A cycle of subgoal generation is completed when every feasible reaction schema from the knowledge base that meets the dynamic selection criteria has been applied to a particular compound. The local tactic of SYNCHEM~ is to progressively reduce the ‘distance’ between leaf nodes of the PSG and a list of available com- pounds by selecting and applying reactions until all leaf nodes represent available compounds. SG #I SG #2 SG #3 SG #4 I CvC C A I CvC C 0 0 0 O-H O-H 0 I I C - C C c=o I 0 0 C - c jc C C \\ \ c Cl O-H O-H c=c C Fig. I. A Problem Solving Graph. Adapted from [24]. The target compound at the root is developed into three ‘OR’ subgoal nodes, SC #2, 3, and 4. Subgoal nodes have one or more reacting ‘AND' com- pound nodes as children. The goal of SYNCHEM~ is to develop all leaf nodes into available compounds. Note that chemical convention omits depiction of hydrogen atoms on carbons when they clutter draw- ings; thus all carbons have four edges, although in drawings some edges may be implicit. 48 J.D. Benstock et al. Heuristic evaluation functions An exhaustive search of the problem space is computationally infeasible (‘com- binatorially explosive’) for any but the most simple syntheses. As the search gener- ates and evaluates subgoals, it must correspondingly limit the growth of the PSG and prune it. Thus the search is guided by figures of merit, which are heuristically computed values used to predict the success of program alternatives. The three types of merit, compound, reaction, and subgoal, are integers in the range [0, . . . , 1001. Merits are computed using information about the current chemical environment of a given node in the PSG to adjust the default values from the knowledge bases [20,24]. Compound merits are used to predict the likelihood of successfully completing a pathway from a particular compound all the way down to available compounds and reflect the maximum subgoal merit of subgoal children. (Available compounds are assigned a merit of 100 on the scale [O, . . . , 1001.) Reaction merits, working in the opposite direction, estimate the cost of successfully reacting compounds to solve their parent subgoal. Reaction merits are a function of the dynamic parameters ease, yield, and confidence, which are also rated [0, . . . , 1001. The default parameter values associated with each reaction are estimates of usefulness under standard con- ditions, supplied by chemists who have worked on the knowledge bases. The subgoal merit of each subgoal is a dynamic evaluation function of its constituent child compound merits, its reaction merit, and after updating, the subgoal merits of descendants. After each cycle of subgoal generation, affected subgoal and com- pound merits throughout the PSG are updated by recomputing them iteratively, starting from newly generated subgoals and moving up the PSG to the target com- pound at the top. The merit of a subgoal is generally somewhat worse than the merit of its worst unsolved child compound. Effort distribution The Effort Distribution algorithm dynamically apportions search time for each compound node so that the program can develop the search space rationally. Search status flags for compound nodes (open, solved, just solved, stuck, available, un- solvable) are updated and distributed throughout the PSG after each cycle of subgoal generation [24]. SYNCHEM~ has the capability of allowing the user to control the degree to which the search concentrates on syntheses that are either inexpensive, short, diverse, different from those in the literature, or have well-understood chemistry, etc. 2.2. Knowledge bases The knowledge bases consist of four available-compounds libraries (two of which are currently active), and four reaction libraries. Graph embedding in SYNCHEM~ 49 Compound representation and libraries Davis, Boivie, and Agarwal have designed SLINGS (SYNCHEM Linear Input Graphs) for SYNCHEM~ [ 131, a compact linear canonical representation for molecules, for user input/output as well as internal storage and communication be- tween program modules. Compounds are also represented by the Extended Topo- logical Representation (XTR), a data structure that includes a connection matrix and that computes and stores additional information on demand, such as atomic degree, number of cycles, and so on. SYNCHEM~ has a knowledge base of about 10,000 available compounds, most of which are from the Aldrich Catalog-Handbook of Organic and Biochemicals. Reaction representation and libraries SYNCHEM~ selects reactions from about 1200 reactions in its knowledge base by determining the attributes of a compound and using these attributes as keys to reac- tions in the knowledge base. Attributes are characteristic chemical features, func- tional groups, or patterns of infeasible structures. The main reaction library is divided into 30 currently active chapters; each chapter is associated with a unique attribute and a chapter screening test. A reaction with more than one attribute is arbitrarily assigned to a chapter. Reactions within chapters are arranged in order of decreasing probable utility. Reactions are stored in the reaction library as reaction schemata, which may be viewed as graph rewriting rules. A reaction schema consists of a goalpattern graph and its retrosynthetic derivative, the subgoalpattern graph, which have isomorphic node sets. The goal pattern or the subgoal pattern may be disconnected. To deter- mine whether a reaction may be applied to a compound, i.e., a reaction schema ap- guest Y H ?-C-Cl + H-Cl ?-b-H + Cl-Cl I!l k!l HH HH H-&k-Cl + H-Cl H-&-H + Cl-Cl IG ifi host Fig. 2. A reaction schema and application. The guest graph in the reaction schema on top embeds in the host graph, showing that the reaction schema can be applied to the host graph, as on the bottom. The left and right sides of the reaction schema are called the goal pattern and subgoal pattern respectively. The fragment variable nodes labeled ‘?’ correspond to subgraphs, called fragments, in the host and its retrosynthetic derivative. 50 J.D. Benstock et al. plied to a host graph, SYNCHEM~ considers the goal pattern graph as the guest graph and searches for embeddings of it within the host graph. Fig. 2 shows a guest graph that embeds in a host graph, which indicates that the reaction schema can be applied to the host graph. 3. Using embedding in SYNCHEM~ 3.1. Definition of chemical graph embedding Because a goal pattern graph stands for a class of compounds, it is convenient to use fragment variable nodes to stand for classes of compound substructures. Definition. A fragment variable node (or variable node, for short) is a guest graph node that, according to its label, maps to exactly one of the following attribute classes: any element at all, any element except hydrogen, any halogen atom, or an alkyl carbon. Similarly, a variable edge maps to either a single, double, or resonant bond. Definition. A constant node is the opposite of a variable node and is labeled by a definite element. Definition. A chemical graph embedding of a guest graph in a host graph is a l-to-l mapping of guest nodes into host nodes such that: (1) adjacency is preserved, (2) node and edge labels are preserved, (3) stereochemical orientation is preserved, and (4) images of variable nodes are part of connected subgraphs in the host correspond- ing to their label and containing no images of constant nodes [39]. These subgraphs in the host and its retrosynthetic derivative are called fragments and are an instance of the attribute class the labeled variable node represents. In- tuitively, fragments are what is ‘left over’ in the host graph after images of constant nodes are accounted for. See Fig. 3. \ ,c=c’ $1 $2 H y H Fig. 3. Variable nodes and fragments. From [39]. The guest graph embeds in the host graph. The hydrogen and carbon atoms in the guest map to the obvious atoms in the host; the two variable nodes labeled ‘$’ both correspond to the same fragment in the host graph, an eight node subgraph constituting the rest of the host. ‘$N’, where N is an integer, is represented as R, in chemistry literature, a molecular structure variable. Graph embedding in SYNCHEM~ 51 3.2. Selecting reactions Attribute determination In order for the program to know which types of reactions to select for a par- ticular compound, it must first determine the attributes of that compound. It uses SUBSRCH, a procedure designed by Boivie, for finding all attributes within a com- pound from a set of predefined substructures. SUBSRCH operates by conducting a binary search of the attribute table and comparing table entries to neighborhoods recursively ‘grown’ around each compound atom. SUBSRCH operates by examining the neighborhood of each atom in a compound, using breadth-first search (BFS, see Section 5.3), and identifying each neighborhood as: (1) part of a potential attribute warranting further search, (2) an instantiation of an attribute, or (3) a substructure not warranting further search, unlike any in the table (in which case SUBSRCH backtracks). The SUBSRCH procedure is more effi- cient than general embedding algorithms for locating sets of predefined substruc- tures and identifying attributes that are: (1) sensitive and that interfere with reactions, (2) readily available or readily synthesizable, or (3) chemically invalid 1101. Screening reactions Once SUBSRCH has determined the attributes of a compound, SYNCHEM conducts screening pretransform tests on them to select reactions from the reaction library. Pretransform tests for a reaction use chemist-specified combinations of conditions of the following four types: the compound (can’t/must) have (any/all) of the at- tributes on a list associated with the reaction. An example of a pretransform test would be that a Grignard reaction on lactones or aldehydes can’t have any carboxyl or nitrile groups, and must have any secondary or tertiary alcohol group. 3.3. Applying reactions If all the pretransform tests for a reaction on a compound are satisfied then SYNCHEM~ attempts the following embedding using Sanders’s MATCH algorithm: it takes the goal pattern graph of the reaction schema as the guest graph, and the graph representing the compound as the host graph, and generates maps for all embeddings of the guest within the host [39]. The difference between SUBSRCH and MATCH is that, while SUBSRCH locates a ‘root’ node for each instance of an attribute within the host, MATCH determines the exact location and scope of the guest graph within the host graph, generating a l-to-l map from nodes of the guest graph to nodes of the host graph. Because a reaction can take place at several sites in a molecule, but sometimes such chemistry is redundant, MATCH has the option of generating only one embedding on any given procedure call. MATCH is called recursively with a guest graph and a host graph and the most recently matched pair of nodes. It then finds all possible extensions of existing par- 52 J.D. Benstock et al. tial embeddings using a dual depth-first search (DDFS), and returns the maps of all complete embeddings. The pseudocode for the MATCH algorithm is given below. Note that a frond or back edge is an edge that completes a cycle in a graph traversal 1411. The Dual Depth-First Search MATCH Algorithm. Node a in the guest already matches node a in the host and MATCH attempts to extend the embedding to nodes b and p. procedure MATCH (a, a: nodes); begin for each unmarked neighbor b of a do begin mark b; for each unmarked neighbor /3 of a do begin if (node labels of b and /3 match and fronds of b and ,8 match and edge labels of edges (a, b) and (a, /I) match) then begin mark /3; image(b) : = p; if (all nodes of guest graph are marked) then record embedding eke Cd MATCH(b, p); unmark /I; image(b) : = 0; end; (if nodes, fronds, and edges match} end; {for each unmarked neighbor of a do} unmark b; end; {for each unmarked neighbor of a do} end; (procedure MATCH} If the guest graph of the reaction can be embedded within the host graph represent- ing the compound, then SYNCHEM~ applies the reaction to the host graph using the procedure SUBGENR, producing a subgoal graph consisting of one or more com- pounds, as in Fig. 2. 3.4. Evaluating reactions After applying a reaction, SYNCHEM~ canonically names the generated compounds and conducts a series of posttransform tests, using SUBSRCH and MATCH, to evaluate the application to the particular compound in the chemical environment existing at that node in the PSG. The default values for the ease, yield, and confidence parameters of the reaction are adjusted at this stage and combined to calculate the adjusted reaction merit. Posttransform tests can cause a subgoal to be rejected (e.g., Graph embedding in SYNCHEM~ 53 due to an invalid attribute having excessive bond strain), the reaction procedure to be modified (e.g., changing the reagent), or protection procedures to be specified for sensitive groups. In each case, the specific site of the embedding and the current chemical environment are considered. Posttransform tests may be concerned with, for example, electronic and stereochemical effects, steric hindrance, proximity of functional groups to the reac- tion site, ring structures and sizes, alkynes in a ring, allenes in a small ring, relative stabilities of carbonium ions, leaving groups, migratory aptitudes, the largest at- tribute in the molecule, symmetry properties of the goal and subgoal molecules, etc. 1241. After posttransform testing, compound merits of newly generated compounds are evaluated using several criteria, calling SUBSRCH to identify significant attributes and heuristic procedures to estimate compound merit. Compound merit is computed in a procedure that considers the number and bonding of carbon atoms, the ratio of number of functional groups to number of carbon atoms, ring system complexities, and metastable oxidation states. In general, simpler compounds are rated pre- ferentially. 4. Computational time complexity of embedding and related problems 4. I. The polynomial hierarchy A program for a deterministic Turing machine (DTM) consists of: (1) a finite state read/write head; (2) an infinite tape of cells; (3) a finite set Z of input tape symbols; (4) a finite set Q of states, including a start state q. and two halt states qy and qN, and (5) a transition function that directs the tape head in a given state at a given cell to read the symbol in the cell, to write a symbol over it (perhaps the same one), and to move either left or right. A program for a nondeterministic Turing machine (NTDM) has a transition function that directs the tape head in a given state at a given cell to read the cell, and to nondeterministically choose a pair, consisting of a symbol to write and a head movement, from a set. A nondeterministic program ‘guesses’ a solution of a decision problem by choosing the right such pair at each step, and then ‘verifies’ whether it is actually a solution. An example of this concept is the following informal description of an NDTM for graph isomorphism: it guesses a mapping between the vertex sets of two graphs, and then proceeds to verify whether the mapping is l-to-l, onto, and adjacency-preserving. An input string XEZ* is a finite sequence of symbols and we say (for a DTM or an NDTM) that program M accepts x iff it begins at the first tape cell, reads the string x, and halts in the accepting state qy. The language accepted or recognized by program M is L, = {x: XEZ* and M accepts x}. 54 J.D. Benstock et cd There is a natural correspondence between recognizing languages and solving deci- sion problems. A polynomial time program is one in which the number of computation steps is bounded by a polynomial function of the input length. The complexity class P is defined to be P = {L: 5’ a polynomial time DTM program recognizing L}. The class NP is defined to be NP = {L: 2 a polynomial time NTDM program recognizing L}. The class co-NP is defined to be co-NP = {L: 5’ a polynomial time NTDM program recognizing Z* - L}. The P = NP problem is open because, while verifying solutions can be done in polynomial time for many problems, it is not known whether guessing solutions can also be done in polynomial time for those problems. To rate the relative complexity of languages, it is useful to define a polynomial transformation between languages L, and L, as a function f computable by a recursive (always halting) polynomial time DTM program such that XE L, iff f(x) E L,. We then write L, I, L,, and say that L, many-one Turing reduces to L,. This expresses the relation that if the ‘harder’ language L, is in P, then so is L,. ‘I,’ is a partial order. L1 and L, are polynomial time equivalent iff L, &, L, and Lz<, L,. We then write L, =, L,. The class NP-complete (NPC) consists of the ‘hardest’ problems in NP: NPC={L: LENP, and VL’ENP,L’S,L}. A polynomial algorithm found for an NPC problem would be translatable to a polynomial algorithm for any NP problem, by direct construction using the reduc- ing polynomial transformation. It is felt that there is a strong possibility that NPC problems are intractable, with optimal solutions taking an exponential amount of time, although good approximate solutions may sometimes be found in polynomial time. We can show a given problem to be NPC by showing that it contains a known NPC problem as a subcase with a recursive (always halting) polynomial time transformation reducing the known problem to the given one. (Intuitively, if the new, harder, problem were polynomially solvable, then its known NPC subcase would be polynomially solvable as well). Ladner showed that if P # NP, then there exists a dense hierarchy of problems strictly between P and NPC [ 181. The class NP- Incomplete is defined to be NPI = NP - (P U NPC). Fig. 4 depicts the containments in the polynomial hierarchy under this assumption that P#NP, in which case NPI is nonempty. Accounts of Turing machines and complexity classes are found in [3,18,21]. Graph embedding in SYNCHEM 55 4.2. Time complexity of subgraph isomorphism Embedding is NP-complete for chemical and general graphs. Compounds can be modeled by graphs with labeled edges and nodes. Thus the problem of determining whether one compound is ‘part’ of another can be restated as that of determining, in a pair of labeled graphs, whether a graph G can be embedded in a graph G’, i.e., whether G is isomorphic to a subgraph of G’. This is the labeled variant of the Subgraph Isomorphism problem (SGI). To see that SGI is NPC, we use the standard technique of noting that it is NP (because it is possible to verify a solution in polynomial time), and that it contains a known NPC problem, CLIQUE, as a subcase. Intuitively, this shows that SGI is in the class NPC because it is ‘at least as hard’ as its subcase, known to be ‘at least as hard’ as any problem in NP. CLIQUE asks whether a graph contains a complete subgraph of size ?k. CLIQUE is a subcase of SGI because an algorithm for SGI can be iteratively called as a subroutine to attempt to embed the complete graphs of size ?k in CLIQUE’s input graph. That iteration takes polynomial time. We conclude that CLIQUES, SGI, and that SGI is NPC. A polynomial reduction from unlabeled graphs to graphs with only one label shows that SGII, labeled SGI. Thus the chemical application we are interested in is modeled by labeled SGI, which is NPC. 4.3. Relationship of SGI to graph isomorphism and chemical naming The Graph Isomorphism problem (GI) asks whether a graph G is isomorphic to a graph G’. (SGI asks whether there exists a 1-I map between the graphs, while GI asks whether there exists an onto l-l map between the graphs.) The isomorphisms of a graph onto itself are called automorphisms or symmetries. In synthesis discovery embedding, nontrivial automorphisms and interactions between goal pat- tern, subgoal pattern, and host graphs often generate redundant compounds and waste the substantial time required to canonically name, store, evaluate, and develop them. Agarwal discusses several heuristics for eliminating redundant maps (some before generation and some after) [2], but notes that the heuristics bear fur- ther unification and extension. Because heuristics to speed an algorithm can them- selves suffer from combinatorial explosion, applications must be time analyzed carefully. SGI and GI algorithms can be used as subroutines to help solve the chemical canonical naming problem, which is of central concern whenever dealing with a chemical database. The graph naming problem (GRAPHNAME) determines a canonical name for a compound. SYNCHEM~ uses the SLING representation and algorithms [13]. The Morgan algorithm is another popular and effective system [29]. 4.4. Time complexities of graph isomorphism and chemical naming The graph isomorphism problem is of considerable theoretical interest because, 56 J.D. Benstock et al. NP Fig. 4. The polynomial hierarchy of complexity classes, assuming Pf NP. From [18, p. 1541. although it models many fundamental problems, its exact complexity is not current- ly known. A polynomial algorithm for SGI could be easily modified to polynomially solve GI, so GI<, SGI, but GI has not been shown to be NPC. The problems Ladner showed to be NPI under the assumption P # NP are ‘artificially constructed’ diagonalized problems, but determination of graph isomorphism and integer primality are two naturally occurring candidates for membership in NPI. Further- more, since there is no known efficient test for determining non-isomorphism of graphs, only failure of isomorphism tests, it is not known whether GI is in co-NPC. Thus the complexity of GI bears on the deep questions of containment among the complexity classes P, NP, co-NP, and the rest of the polynomial hierarchy. Indications that GI may be in the class NPI are: (1) The failure of current NPC proof techniques to show the membership of GI in NPC, as opposed to the thousands of NP problems that have been shown to be in NPC. (2) The polynomial time equivalence of the existence and enumeration GI pro- blems, which is at variance with some NPC problems [18] (although the polynomial equivalence of existence and enumeration variants of many NPC problems is cur- rently unknown [23,44]). The enumeration variant of GI asks how many isomor- phisms exist between a pair of graphs. Graph isomorphism can be tested in polynomial time for restricted classes of graphs, such as trivalent or planar [18,25]. Luks showed in 1980 that isomorphism for graphs of bounded degree can also be tested in polynomial time [27]. An in- teresting question is whether bounding the degree of guest and/or host graphs makes SGI also solvable in polynomial time, and if it does, whether Luks’s 0(n6) algorithm could be made practical or improved for chemical graphs of the size cur- rent chemistry programs may encounter (about 50 vertices). Chemical graphs naturally have bounded degree, which is usually four (for elements such as carbon), and certainly less than eight. Luks’s technique exemplifies the group theoretic approach to GI, which seeks to determine a set of generating permutations for the automorphism group of a graph. The other main approach, exemplified by the work of Miller [28], is topological, and seeks to embed a graph onto a surface of minimal genus and then dissect the surface into planar components. The relationship between the two approaches is not known. There are a number of combinatorial and group-theoretic problems (some Graph embedding in SYNCHEM~ 51 concerning group generating sets) that are polynomial time equivalent to GI and we say that they constitute the class of isomorphism complete problems [25]. A polynomial time solution to GRAPHNAME would polynomially solve GI, but an interesting fact is that it is not currently known whether a polynomial time solu- tion to GI could be extended to polynomially solve GRAPHNAME [25]. Thus we know that GI I ,,, GRAPHNAME, but not whether GRAPHNAME I, GI. 5. Techniques to speed embedding 5.1. Modeling graphs with subroutines Unrolling loops and making fewer array references and procedure calls can significantly shorten execution time of a program. Benstock has developed a tech- nique to do this that models or mirrors each guest graph with a short subroutine in source code. A simple algorithm creates this code out of an adjacency represen- tation for a guest graph. It apportions a ‘for’ loop in source code for each node of a guest graph, so that some guest graph information is implicit in the body of that loop. The technique is similar to that of transforming the labeled graphs of transi- tion diagrams into code for lexical analyzers of compilers, as in [5]. There is a l-to-l map from a graph node to a ‘for’ loop that searches the node’s neighbors to extend the embedding. The following operations are saved: procedure calls and array references for node, edge, and frond matching are either eliminated (in the frequent situation that any label will match, or that no fronds exist) or reduced to references to constants. The pseudocode below for the mirroring technique shows the ‘for’ loop of one node with the ‘for’ loop of its successor (in DFS traversal) nested within it. Pseudocode for Mirroring Technique for each unmarked neighbor p of CY do begin if (node labels of b and /3 match and fronds of b and p match and edge labels of edges (a, b) and (a, /3) match) then begin mark /I; image(b) :=/3; a :=/I; for e-f unmark p; image(b) : = 0; CY : = image(a); end; {if nodes, fronds, and edges match} end; {for each unmarked neighbor of (x do} 58 J.D. Benstock et al. The mirroring technique reduced execution time of MATCH by a factor of 174 when attempting to embed the perester pattern graph of Fig. 6 in each of the five host graphs of Fig. 5. It reduced execution time by a factor of 35 when embedding a benzene ring in itself. The mirroring technique is suitable for stable databases re- quiring fast pattern matching and with storage enough for a subroutine correspond- ing to each pattern in the database. 5.2. Eliminating recursion and precalculating tables We have developed an embedding algorithm GREMBED (for Graph Embed) that turns the recursion of MATCH into iteration, and uses a precalculated edge table of the guest graph to control the search, instead of using direct DFS of the guest graph. It is efficient in the context of SYNCHEM~ to precalculate edge tables a single time rather than searching guest graphs anew each time (with marking, unmarking, querying, and backtracking operations) since there are only about 1200 reaction schemata, and edge tables of their guest graphs would be consulted repeatedly [l]. These modifications made GREMBED about two to three times faster than MATCH, in our tests which used the five host graphs shown in Fig. 5 and the five guest graphs shown in Fig. 6. We attempted embeddings, using MATCH and GREMBED, for each of the 25 possible pairs of host and guest graphs. The total execution times in milliseconds for finding all embeddings are shown below in Table 1, along with the MATCH/GREMBED ratios of those times. They are grouped by host and guest graph. The average ratio (speedup factor) is 2.6. The figures for finding just one embedding (existence) are shown in Table 2. The average ratio for this case is 3.3. Both algorithms were written in VAX/VMS PL/I and run on a MicroVAX-II. For simplicity, these tests did not consider stereochemistry and the heuristics for reduc- ing the generation of redundant graphs. y=c c~c-c-c-c=c-c=o H-O 0 C=O (1) (2) 0 0: E-o-o-c-c 0 $-O-H CJ (4) (5) Fig. 5. Five host graphs used to compare MATCH and GREMBED. Unlabeled vertices represent carbon atoms by convention. Hexagonal rings with a circle in the center represent benzene rings where each of the six edges is labeled as a resonant bond. A carbon in a benzene ring has room outside the ring for only one single bond. (1) arbitrary molecule, (2) 3-vinyl-2-hepten-6-ynal, (3) OO-ethyl hydrogen monoperox- phthalate, (4) 3-chlorophthalide, (5) furoyl. Graph embedding in SYNCHEM 59 (6) (7) (8) ??? ? I-2 I I? ? ? 0 I ; ? 0 ? ??? (9) (10) Fig. 6. Five guest graphs used to compare MATCH and GREMBED. (6) alkene, (7) lactone, (8) perester, (9) furan-3-carboxaldehyde, (10) cyclohexene. There are 8 and 32 embeddings of alkene and cyclohexene respectively in the first host graph, 16 embeddings of alkene in the second host, 6 embeddings of perester in the third host, no embeddings in the fourth host, and 16 and 2 embeddings of alkene and furan respectively in the fifth host. We also compared GREMBED and the canonical naming algorithm of SYNCHEM~ for testing graph isomorphism, and found GREMBED to be about 24 times faster on the average when testing each of the host graphs of Fig. 5 for isomorphism with itself. GREMBED was 70 times faster than canonical naming for the molecule below. Table 1. Times (in milliseconds) using MATCH and GREMBED to find all embeddings of the 25 host-guest pairs. Summed and grouped by host and guest graph, and showing the MATCH/GREMBED ratios of the times. Name MATCH GREMBED Ratio arbitrary molecule 3.vinyl-2-hepten-6-ynal OO-ethyl hydrogen monoperoxphthalate 3-chlorophthalide furoyl 1055 410 2.57 665 210 3.17 825 315 2.62 520 200 2.60 315 160 1.97 alkene 740 360 2.06 perester 540 160 3.38 lactone 465 205 2.27 furan 525 175 3.00 cyclohexene 1110 395 2.81 60 J.D. Benstock et al. Table 2. Times (in milliseconds) using MATCH and GREMBED to find the first embedding (i.e., determine existence of embedding) of the 25 host-guest pairs. Summed and grouped by host and guest graph, and showing the MATCH/GREMBED ratios of the times. Name MATCH GREMBED Ratio arbitrary molecule 3-vinyl-2-hepten-6-ynal OO-ethyl hydrogen monoperoxphthalate 3chlorophthalide furoyl 885 255 3.47 665 210 3.17 795 260 3.06 535 200 2.68 325 105 3.10 alkene 575 200 2.88 perester 525 165 3.18 lactone 480 176 2.74 furan 515 165 3.12 cyclohexene 1130 275 4.11 This indicates that GI algorithms are to be preferred to GRAPHNAME algorithms in applications in which approximately one hundred graphs or fewer are to be pair- wise compared and duplicates not named. 5.3. Search-ordering heuristics Comparisons between search-ordering heuristics should analyze, among other factors, the following graph parameters: the expected number of embeddings (so we can seek to primarily accept or reject matches), the cycle structure and general graph topology, and the absolute and relative sizes of the guest and host graphs. For the sake of efficiency, attempts should be made to match distinguishing structural features, such as attributes and fronds, as soon as possible in an embedding search. Time might be saved in SYNCHEM~ embedding searches if substructures in the guest graph were matched as units to their images in the host graph, using informa- tion already gathered by SUBSRCH. This attribute matching would also be suitable for applications with guest graphs that had many predefined substructures. MATCH is passed matched initial root nodes by SUBSRCH, and we have found it to be highly sensitive to the choice of pattern graph root node. We have found that choosing a guest root node within a distinctive region of the guest (allowing rapid rejection of nonembeddings) can speed embedding by a factor of as much as four. Breadth-first search (BFS) is the graph traversal method that visits the root, then all nodes at distance one from the root, then all nodes at distance two, etc., and never revisits nodes. Thus it crosses every edge exactly twice, counting backtracking upon finding an already visited node. Depth-first search (DFS) is the graph traversal method that visits the unvisited node of maximum depth at each step. BFS tends to waste storage space, since desired nodes within the graph lie too far from the root for most applications, and BFS also tends to manifest embedding failures more Graph embedding in SYNCHEM~ 61 slowly than DFS. Embedding using DFS is usually faster than BFS, perhaps because it covers cycles and other structural features sooner. BFS is faster than DFS is in some instances, however, and further investigation is needed to determine the condi- tions under which this obtains [ 1,261. Note that a computation to determine whether DFS or BFS is preferable in a given instance would add its own costs to the algorithm. One possible heuristic to increase efficiency visits nodes in the order of highest- degree-first [I]. This, in effect, causes the algorithm to traverse chains and cycles of carbons early. We found that this particular traversal heuristic made no signifi- cant difference in embedding times. A fragment code is a set of predefined symbols standing for a molecular substruc- ture such as a ring or functional group. Time might be saved if SYNCHEM~ used a fragment code to represent substructures, as is the practice in many chemical databases. However, we feel this refinement would have only a small effect, which might even be negated by the added time spent expanding the fragment symbols, e.g., as must be done when rings are broken. 6. Summary and directions for further research Graph embedding is a widely applied process that needs heuristic and theoretical improvements to push upward the threshold of combinatorial explosion. Our mir- roring technique, which models graphs with procedures, makes embedding faster by a factor between one and two orders of magnitude. The mirroring technique is suitable for stable databases needing fast pattern matching. We have eliminated recursion and used a precalculated edge table in the general embedding algorithm of SYNCHEM~, which made embedding faster by a factor of about three in 25 test cases. It will be useful to determine the relative contributions of recursion elimina- tion and the precalculated table, and the reasons why BFS is occasionally faster than DFS. In addition to graph-theoretical heuristics, a chemical context allows chemical heuristics to speed embedding searches and a wide range of possible suitable heuristics is open for investigation, including the suggested ‘attribute matching’. A statistical profile of SYNCHEM~ execution is needed to determine the distribution and structure of chemical graphs most frequently input to MATCH. A preliminary count indicates that embeddings determined by SUBSRCH are rejected by MATCH 60% of the time in typical sYNcm342 runs. Further investigation is needed into other possible embedding heuristics and their costs, as well as the costs of determining when they are favored. Towards this end, we have developed an embedding program and a driver program above it that can accept arbitrary traversal sequences, and can output traversal and embedding infor- mation at variable levels of detail. A particularly useful investigative tool would be a graphics program that displayed traversals through graphs at variable speeds using 62 J.D. Benstock et al. highlighting or color, allowing for concrete experience of, and experimentation with, different embedding strategies on different classes of graphs. Acknowledgements We wish to thank Prof. H. Gelernter and Dr. G. Miller for suggestions and helpful discussions, and Noglle Benstock for assistance with the diagrams. References [l] K.K. Agarwal, Graph transformation and canonization algorithms, Ph.D. Thesis, State Univ. of NY at Stony Brook (1976). [2] K.K. Agarwal, D.P. Agrawal and M.A. Lassner, Subgraph identification using associative techni- ques with applications to information science and chemical structure investigation, NSF Report, Wayne State Univ., College of Engineering, Detroit, MI (1982). [3] A.V. Aho, J.E. Hopcroft and J.D. Ullman, The Design and Analysis of Algorithms (Addison- Wesley, Reading, MA, 1974). [4] A.V. Aho, J.E. Hopcroft and J.D. Ullman, Data Structures and Algorithms (Addison-Wesley, Reading, MA, 1983). [5] A.V. Aho, R. Sethi and J.D. Ullman, Compilers: Principles, Techniques, and Tools (Addison- Wesley, Reading, MA, 1986). [6] R.B. Banerji, Artificial Intelligence: A Theoretical Approach (North-Holland, Amsterdam, 1980). [7] A. Barr, P.R. Cohen and E.A. Feigenbaum, eds., The Handbook of Artificial Intelligence (William Kaufmann, Los Altos, CA, 1982). [8] H.G. Barrow, A.P. Ambler and R.M. Burstall, Some techniques for recognizing structures in pic- tures, in: S. Watanabe, ed., Frontiers of Pattern Recognition (Academic Press, New York, 1972) l-30. [9] L. Bit, Processing of semantic nets on dataflow architectures, Artificial Intelligence 27 (1985) 219-227. [lo] R.H. Boivie, Heuristic search guidance in SYNCHEMZ, Ph.D. Thesis, State Univ. of NY at Stony Brook (1977). [11] J.M. Brayer and K.S. Fu, Some multidimensional grammar inference methods, in: C.H. Chen, ed., Pattern Recognition and Artificial Intelligence (Academic Press, New York, 1976) 29-60. [12] J.G. Carbonell and S. Minton, Metaphor and common sense reasoning, in: J.R. Hobbs and R.C. Moore, eds., Formal Theories of the Common Sense World (Ablex, Norwood, NJ, 1985) 405-426. [13] H.W. Davis, Computer representation of the stereochemistry of organic molecules, in: Inter- disciplinary Research Series 23 (Birkhauser, Basel, 1976). [14] H. Ehrig, M. Nagl and Cl. Rozenberg, eds., Graph-Grammars and Their Application to Computer Science, 2nd International Workshop, 1982, Haus Ohrbeck, Germany, Lecture Notes in Comp. Sci. Vol. 153 (Springer, Berlin, 1983). [15] N.V. Findler, ed., Associative Networks: The Representation and Use of Knowledge by Computers (Academic Press, New York, 1979). [16] E.C. Freuder, Structural isomorphism of picture graphs, in: C.H. Chen, ed., Pattern Recognition and Artificial Intelligence (Academic Press, New York, 1976) 248-256. [17] K.S. Fu, Syntactic Methods in Pattern Recognition (Academic Press, New York, 1974). [IS] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP- Completeness (Freeman, San Francisco, CA, 1979). Graph embedding in SYNCHEMS? 63 [19] H.L. Gelernter, N.S. Sridharan, A.J. Hart, S.C. Yen, F.W. Fowler and H.-J. Shue, The discovery of organic synthesis routes by computer, in: Topics in Current Chemistry 41(l) (Springer, Berlin, 1973) 114-130. [20] H.L. Gelernter, A.F. Sanders, D.L. Larsen, K.K. Agarwal, R.H. Boivie, G.A. Spritzer and J.E. Searleman, Empirical explorations of SYNCHEM, Science 197 (1977) 1041-1049. [21] J.E. Hopcroft and J.D. Ullman, Introduction to Automata Theory, Languages, and Computation (Addison-Wesley, Reading, MA, 1979). [22] E.B. Hunt, Artificial Intelligence (Academic Press, New York, 1975). [23] D.B. Johnson and S. Kashdan, Lower bounds for selection in X+ Y and other multisets, J. ACM 25(4) (1978) 556-570. [24] M.A. Jones, The organization and design of SYNCHEMZ, Tech. Report No. 82/038, State Univ. of NY at Stony Brook (1982). [25] C.M. Hoffmann, Group-Theoretic Algorithms and Graph Isomorphism, Lecture Notes in Comp. Sci. 136 (Springer, Berlin, 1982). [26] M.A. Lassner, Graph embedding algorithms and their applications, Ph.D. Thesis, Wayne State University, College of Engineering (1981). [27] E.M. Luks, Isomorphism of graphs of bounded valence can be tested in polynomial time, in: Pro- ceedings 21st IEEE Symp. on Foundations of Comp. Sci., Rochester (1980) 42-49. [28] G.L. Miller, Isomorphism of k-contractible graphs. A generalization of bounded valence and bounded genus, Information and Control 56(1/2) (1983) l-33. [29] H.L. Morgan, The generation of a unique machine description for chemical structures ~ a techni- que developed at Chemical Abstracts Service, J. Chemical Documentation 5(2) (1965) 107-I 13. [30] R.R. Muntz, The WELL system: a multi-user database system based on binary relationships and graph-pattern-matching, Information Systems 3 (1978) 99-115. [31] N.J. Nilsson, Problem-solving Methods in Artificial Intelligence (McGraw-Hill, New York, 1971). [32] D.A. Norman and D.E. Rumelhart, Explorations in Cognition (Freeman, San Francisco, CA, 1975). [33] L.J. O’Korn, Algorithms in the computer handling of chemical information, in: R.E. Christof- fersen, ed., Algorithms for Chemical Computation (American Chemical Society Symposium Series Vol. 46, Washington, DC, 1977) 122-148. [34] T. Pavlidis, Linear context-free graph grammars, J. ACM 19(l) (1972) 1 l-22. [35] J.F. Pfaltz and A. Rosenfeld, Web grammars, in: 1st Proc. IJCAI, Washington, DC (May 1969) 609-619. [36] R.C. Read and D.G. Corneil, The graph isomorphism disease, J. Graph Theory 1 (1977) 339-363. [37] A. Rosenfeld, Picture automata and grammars: An annotated bibliography, in: Proc. Symp. Com- put. Image Process. Recognition 2 (Univ. of Missouri at Columbia, Aug. 1972). [38] A. Rosenfeld, Picture processing: 1977, in: Computer Graphics and Image Processing 7 (1978) 21 l-242. [39] A.F. Sanders, Some applications of graph theory, Ph.D. Thesis, State Univ. of NY at Stony Brook (1976). [40] A.C. Shaw, Picture graphs, grammars and parsing, in: S. Watanabe, ed., Frontiers of Pattern Recognition (Academic Press, New York, 1972) 491-510. [41] R.E. Tarjan, Depth-first search and linear graph algorithms, SIAM J. Comput. l(2) (1972) 146-160. [42] R.E. Tarjan, Graph algorithms in chemical computation, in: R.E. Christoffersen, ed., Algorithms for Chemical Computation (American Chemical Society Symposium Series Vol. 46, Washington, DC, 1977) l-20. [43] A. Tucker, Applied Combinatorics (Wiley, New York, 1980). 1441 L.C. Valiant, The complexity of computing the permanent, Theoret. Comp. Sci. 8 (1979) 189-202.