key: cord-0057848-1sid1z64
authors: Mahfoud, Houari
title: Conditional Graph Pattern Matching with a Basic Static Analysis
date: 2021-02-22
journal: Pattern Recognition and Artificial Intelligence
DOI: 10.1007/978-3-030-71804-6_22
sha: cf6abdced30b06dfa32957150d38141e1004a39d
doc_id: 57848
cord_uid: 1sid1z64

We propose conditional graph patterns (CGPs) that extend conventional ones with simple counting quantifiers on edges, attributes on nodes, positive and negative predicates. In emerging applications such as social network marketing, CGPs allow to express complex search conditions and to find more sensible information than their traditional counterparts. We show that the CGPs expressivity does not come with a much higher price. Indeed, we propose a matching algorithm that is based on a revised notion of graph simulation and allows to match CGPs over any data graphs in quadratic time, as opposed to the prohibitive solutions based on subgraph isomorphism. We discuss a parallel version of our algorithm that makes it very efficient over real-life data. We investigate the satisfiability and containment problems of CGPs and we show that they are in quadratic time by providing a non-trivial checking algorithm for each one. This paper is the first effort that investigates static analysis of non-conventional graph patterns containing important features that are widely used in practice. An extensive experimental study has been conducted to show effectiveness and efficiency of our results.

Given a data graph G and a graph pattern Q, graph pattern matching is to find all subgraphs of G that match Q. Matching here is typically expressed in terms of subgraph isomorphism which consists to find all subgraphs of G that are isomorphic to Q. However, subgraph isomorphism is an NP-Complete problem [4] . Moreover, the number of matches via subgraph isomorphism may be exponential in the size of the data graph. To reduce this cost, graph simulation [18] and its extensions [5, 6, 15] have been proposed that allow graph pattern matching to be conducted in polynomial time. Unlike subgraph isomorphism that defines a one-to-one mapping, these extensions map a node of a graph pattern to many nodes of a data graph, which allows overcoming the NP-Completeness aspect. However, many complex graph patterns are needed in practice and which are not expressible by existing simulation-based models, notably patterns with counting quantifiers (CQs), predicates and negation as we illustrate by the next example. Fig. 1 where node labels represent type of entities (e.g. Patient, User ); edge labels represent type of the relationships that exist between these entities (e.g. is friend ); a label of the form "≥k" specifies quantification on edges (e.g. number of treatments prescribed for a patient). Moreover, dotted-line blocks with "+" (resp. "−") represent positive (resp. negative) predicates that must be satisfied (resp. unsatisfied) by some nodes. P depicts a request that is defined by a nurse identified by "123 " in order to return information of all patients that have at least five long-term treatments. Suppose that the hospital wants to impose an access control policy as follows: 1 ) only nurses of medical staff can access information of patients; 2 ) a nurse can access only information of patients which are treated by a doctor working with her; and 3 ) information of top-secret patients (e.g. diplomat patients) cannot be shown to the nurses. To enforce this policy, one may easily rewrite P into a safe one by adding some parts that guarantee (resp. prevent) access to accessible (resp. inaccessible) patient information. The safe version of P is given by the pattern P s that defines a positive and negative predicate over the node Patient to enforce respectively conditions 1-2 and condition 3. In other words, a patient must not be returned to the nurse if he belongs to the Top-Secret category or is being treated by a doctor which does work with the underlying nurse. The result of P s over any data graph must be composed by only the entities Patient and Treatment along with their edge, while the remaining parts of P s are used only to refine (make safe) this result. II) The pattern Q is useful in social media marketing. It looks for each user x in UK which has at least n friends whose recommend a Sony product; and no friend of x gave a bad rating since 2018 for a Sony product after buying one. The result of Q will be composed by single nodes representing profile of the users satisfying predicates of Q.

To the best of our knowledge, no existing graph pattern matching model can express the queries P s and Q of the previous example. Thus, we revised the notion of graph simulation to strike a balance between its computational complexity and its ability to deal with more expressive graph patterns.

Contributions and Road-Map. (i ) we propose conditional simulation that extends dual simulation [15] by supporting simple CQs, positive and negative predicates (Sect. 3). (ii ) We show that conditional simulation remains in ptime as earlier extensions of graph simulation [5, 6, 14, 15] by providing a cubic-time computation algorithm (Sect. 4). (iii ) We discuss a parallel version of this algorithm that allows efficient matching of CGPs over large data graphs. (iv ) We investigate the problem of satisfiability and containment (Sect. 5) of our graph patterns and we show that these problems are solvable in quadratic time by providing the corresponding checking algorithms. (v ) Using a real-life graph, we experimentally verify the performances of our algorithms (Sect. 6).

We classify previous work as follows.

Graph Pattern Matching. Graph pattern matching (GPM) is typically defined in terms of subgraph isomorphism which is an np-complete problem [19] . Moreover, it is often too restrictive to capture sensible matches [6] as it requires matches from a data graph to have the same topology as a graph pattern, which hinders its applicability in real-life applications such as social network marketing. To overcome these limits, graph simulation [18] has been first adopted for graph pattern matching due to its low computational complexity. GPM via graph simulation has been useful for various applications (e.g. social position detection [1] ). However, it preserves only downward mappings which raised a need to extend it in order to find more sensible matches in emerging applications. There has been a host of work on improving the notion of graph simulation (e.g. bounded simulation [6] , regular simulation [5] , dual simulation [15] ). Unfortunately, all existing extensions remain limited in emerging applications that need patterns with complex features, notably counting quantifiers (CQs), predicates and negation. Among these features, only CQs have received some attention. The visual tool QGraph [12] allows annotating edges of a graph pattern with CQS of the form [min, max] that can express different semantics (e.g. negated ([0, 0]) and optional ([0, 1]) edges). The formal matching algorithm of QGraph (whose complexity is undefined) is a heavy extension of subgraph isomorphism which makes it probably more higher than NP-Complete 1 . Authors of [8] proposed CQs that can express numeric and ratio aggregates of the forms: "=p(%)", "≥p(%)", and "=0" for negation. Their matching algorithm is based on subgraph isomorphism and has been proved to be DP-Complete.

In contrast, our first motivation was to use simple CQs that are less expressive than those of [8, 12] but make quantified matching in ptime. Moreover, we allow negation of any part of a graph pattern which is more useful then negation of simple edges. At the time of writing this paper, we do not know any approach that allows definition and matching of predicates within graph patterns.

Static Analysis of Graph Patterns. The satisfiability and containment problems are classical and fundamental problems for any query language. They have been well studied for XPath (e.g., [10, 11] ). For graph patterns however, it is striking how little attention has been paid for these problems. The containment problem has been studied in [5] for graph patterns without neither predicates nor negation. The satisfiability problem has been studied in [16] for child relationships only, which raised for a trivial solution. Apart from these works, we are not aware of other ones that investigate static analysis of graph pattern queries.

A Data Graph is a directed graph G = (V, E, L, A) where: 1) V is a finite set of nodes; 2) E ⊆ V × V is a finite set of edges in which (v, v ) denotes an edge from node v to v ; 3) L is a function that assigns a label L(v) (resp. L(e)) to each node v ∈ V (resp. edge e ∈ E); and 4) for each node

Intuitively, for any data node v, the label L(v) represents the type of v (e.g. Movie) while the tuple A(v) defines its properties (e.g. title, release date, running time). Moreover, since an edge relies two entities (e.g. Movie and Person), then its label may represent a relationship between them (e.g. produced by, edited by). Notice that the attributes defined over a node v ∈ V may be different for those defined over another node w ∈ V , even if L(v) = L(w).

A Ball in G with center v and radius r is a subgraph of G, denoted by B(G, v, r), s.t.: 1) all nodes v are in B(G, v, r) if the number of hops between v and v is at most r; and 2) it has exactly the edges appearing in G over the same node set. A Graph Pattern 2 is a directed connected graph Q = (V, E, L, A) where: 1) V , E, and L are defined as for data graphs; and 2) for each node u ∈ V , A(u) is a predicate defined as a conjunction of atomic formulas of the form "A op c" where: A is an attribute of u, c is a constant, and op ∈ {≥, ≤, =, =}.

For any pattern node u, we call A(u) the attributes constraints of u which specify a search condition: e.g. a movie released in US after 2017 (i.e. A(u) = "country=US ∧ year > 2017 "). If A(u) = ∅ then L(u) is the only search condition for u as in [8, 15] . Notice that conflicting constraints over u (e.g. A(u) = "Age≥ 25 ∧ Age < 20 ") should be considered by the satisfiability checking.

The Diameter of Q, written d Q , is the longest distance between all pairs of nodes in Q. That is,

is the length of the shortest undirected path from v to v .

We slightly revise subgraph isomorphism [9] , graph simulation [18] and dual simulation [15] to take in account labels on edges and attributes on vertices. We refer next to the data graph G = (V, E, L, A) and the graph pattern

By condition (3), graph simulation preserves only child relationships.

Dual simulation enhances graph simulation by preserving parent relationships.

When G matches Q via subgraph isomorphism, the match result is the set of all subgraphs of G that are isomorphic to Q. Moreover, the match result w.r.t. a binary match relation S ⊆ V P ×V (i.e. produced by dual simulation or its extensions) is a subgraph G s of G s.t.: 1) a node v ∈ V s if it matches some node of Q via S; and 2) an

Contrary to complex CQs proposed in [12] that lead for a prohibitive cost, we use simple but useful CQs in order to strike a balance between the expressivity and matching cost.

A are defined as for conventional graph patterns; 2) for each edge e ∈ E, C(e) is a CQ given by an integer p (p ≥ 1).

Intuitively, for any data graph G and any edge e = (u, u ) in E with C(e) = p, a node v from G matches u if it has at least p children that match u , and moreover, these children must be reached from v via an edge labeled L(e). As a special case, C(e) = 1 expresses existential quantification. Contrary to [8] , we do not allow CQ of the form C(e) = 0 since its semantic would be ambiguous with negation. Using QGPs as building blocks, we next define CGPs.

Remark that CGPs extend QGPs with the function P that allow definition of positive and negative predicates. Intuitively, P(u) assigns a (possibly empty) conjunction of positive and/or negative predicates to the vertex u of Q c . Each positive (resp. negative) predicate specifies a quantified and attributed graphbased condition 3 that must be satisfied (resp. unsatisfied) by each match of u in a data graph. This kind of predicates give our matching model some expressivity that is not covered by existing approaches [5, 6, 8, 12, 15] .

The whole semantic of Q can be stated naturally with the friendly syntax

. This means that a data graph G matches Q if it has a subgraph G s that satisfies all constraints expressed by Q c (child and parent relationships, attribute constraints, and CQs). Moreover, for any node u ∈ Q c , each match of u in G s must satisfy (resp. unsatisfy) all positive (resp. negative) predicates given by P(u). Inspired by wellknown conditional languages (e.g. SQL, XPath), the core Q c of Q represents the structure of the match result that will be returned to the user, predicates of Q are used only during the matching process to refine this match result.

The size of a QGP q = (V , E ) is given by |V | + |E |, while the size of a CGP Q = (V, E, L, A, C, P) is given by |V | + |E| + u∈V ( q ∈P(u) |q | + not(q )∈P(u) |q |). In other words, the size of Q is given by the number of vertices and edges that belong to the core Q c , positive and/or negative predicates of Q.

We simply denote the CGP Q as (V, E) when it is clear from the context.

Consider the CGPs of Fig. 2 . According to Definition 2 (condition c), remark that each predicate intersects with the core of the original CGP in only one node 4 . The QGP Q 1 returns all professors (Pr ), their PhD students (PhD), and the articles (Article) published by these latter. The QGP Q 2 is a special case of Q 1 since it requires: 1) each PhD student to have at least two published articles; and 2) each professor to be from UK and between 38 to 45 years old. The CGP Q 3 returns pairs of nodes composed by professors and their PhD students providing that each of these student have exactly two published articles. This restriction is imposed by the conjunction of a positive and negative predicate defined over the node PhD of Q 3 . This means that each PhD node from a data graph is returned by Q 3 only if it has a parent node labeled Pr and satisfies each predicate of Q 3 (has exactly two children labeled Article). 

We introduce conditional simulation that extends dual simulation of Ma et al. [15] with attribute constraints, CQs, positive and negative predicates. Our extension allows matching of large data graphs in ptime via more expressive patterns.

Conditions (1-2) concern node properties: a pattern node u can be matched by any data node v that has its label and whose attribute values satisfy attribute constraints of u. Conditions (3) and (4) check respectively the satisfaction of CQs and parent relationships. Condition (5) specifies that for each match v of u, the ball centered at v in G satisfies (resp. does not satisfy) each positive (resp. negative) predicate defined over u. Finally, condition (6) states that there are matches in G for any node in Q.

As graph simulation and its counterparts, conditional simulation aims to find the maximum version of S G Q , that is all subgraphs of G that satisfy topology and constraints of Q. Based on this notion, we make the following result 5 . Figure 3 shows our CGPs matching algorithm, referred to as Match C , that inputs a CGP Q and a data graph G, and returns ∅ if Q ≺ C G or the corresponding maximum matching relation S G Q otherwise. Firstly, the match relation S G Q is initialized by matching each node in Q with all nodes of G that have its label (line 1). Since Q c ⊀ C G implies Q ⊀ C G, then Match C starts by refining the initial version of S G Q w.r.t Q c constraints only (lines 2-3). This step is ensured by the algorithm Match Q that is explained below. If the call of Match Q returns an empty version of S G Q then Q c ⊀ C G and the principal algorithm Match C ends up. Otherwise, the new version of S G Q reflects all matches of Q c in G, and it must be refined w.r.t each predicate of Q as follows. A couple (u, v) is deleted from S G Q if there exists a predicate q defined over u in Q where: q is positive but the ball centered at v in G, B(G, v, d q ), does not match it (lines 4-6); or q is negative but it is matched by the ball B(G, v, d q ) (lines 7-9). As remarked in [17] , when a couple (u, v) is deleted from S G Q (lines 5 + 8), other couples in S G Q may become incorrect w.r.t Q c constraints. For this reason, after deleting all couples of S G Q that are incorrect w.r.t predicates of Q (lines 4-9), we refine again the match relation S G Q to keep only matches that are correct w.r.t Q c constraints (line 10). The last version of S G Q that yields from this refinement is finally returned.

G with Q ≺ C G,

The algorithm Match Q of Fig. 3 extends dual simulation of Ma et al. [15] by attribute constraints, labeled edges, and specially by CQs. The algorithm checks whether Q c ≺ C G for any data graph G and QGP Q c in input. This is done by checking conditions (1 -4 + 6) of Definition 2. Given a match relation S (defined by algorithm Match C ) that maps nodes of Q c to nodes of G, the goal is to eliminate all incorrect matches from S as follows. A couple (u, v) is deleted from S if: i ) an atomic formulas "A op m" (e.g. "age>35 ") is defined over u in Q c but there is no tuple "A = n" over v in G that holds this formulas (e.g. "age=n" with n < 35) [lines [1] [2] [3] [4] [5] ; ii ) there exists an edge (u , u) in E c with label l but v has no parent node v in G that matches u and reaches v with an edge labeled l (lines 7-9); iii ) u must have at least n instances of the child u in Q c (i.e. C c (u, u ) = n) but the number of children of v in G that match u (i.e. size of the set C(v, u, u )) is lesser than n (lines 10-13). The two latter cases are repeated (lines [6] [7] [8] [9] [10] [11] [12] [13] [14] until there is no incorrect match in S. Finally, if each node in Q c gets some matches from S then Q c ≺ C G and the refined version of S that ensures this matching is returned; otherwise Q c ≺ C G and ∅ is returned.

S ⊆ V Q × V , algorithm Match Q takes at most O(|Q||G|) time to refine S.

S G Q .

To allow efficient matching of CGPs over large data graphs, we discuss here a parallel version of our algorithm Match C , referred to as PMatch C (not shown here due to page limit). The intuition behind this version is as follows: rather to evaluate the core and predicates of a given CGP C in a sequential manner, the goal is to match each of them over the underlying data graph G via separated threads. If the data graph does not match the core and/or at least one positive predicate of C then we return null. Otherwise, we combine the match relations produced by the different threads to compute the final match result. As stated in [15] , the parallel scalability is an important property that each parallel algorithm must verify. This guarantees that the more processors are used, the less time is taken by the parallel algorithm. We practically prove this property over large data graph in Sect. 6. Even thought PMatch C is quite simple, its allows to improve sequential matching time on average 39.48%-60.74%.

We introduce first the notion of matching between QGPs that plays an important role in static analysis tasks as satisfiability and containment checking.

The notion of Pattern-only Matching (PoM ) is to check for two graph patterns Q and R, whether Q matches R. This means that there is at least one sub-pattern of Q that satisfies all constraints of R.

Intuitively, A Q (w) ∼ A R (u) if for any data graph G and any node v in G, if v satisfies A Q (w) then it also satisfies A R (u). For instance, the constraint "age > 25" satisfies the constraint "age = 20", but the inverse does not hold.

Q = (V Q , E Q , L Q , A Q , C Q ) and R = (V R , E R , L R , A R , C R ). We say that Q matches R, denoted by R Q, if there exists a binary match relation S ⊆ V R × V Q s.t.: 1. for each (u, w) ∈ S: L Q (w) = L R (u) and A Q (w) ∼ A R (u). 2. for each (u, w) ∈ S and each edge e u = (u , u) in E R , there exists an edge e w = (w , w)) in E Q with: (u , w ) ∈ S and L Q (e w ) = L R (e w ).

Intuitively, R Q if for any data graph G that matches Q, G matches R too.

Q with the maximum match relation S. We say that an edge e w = (w, w ) in Q is covered by S if there exists an edge e u = (u, u ) in R s.t.: (u, w) and (u , w ) are in S, and L Q (e w ) = L R (e u ). Moreover, we denote by S Q the set of all edges of Q that are covered by S. Example 3. Consider Q 1 and Q 2 of Fig. 2 . As Q 2 makes more restrictions over nodes and edges of Q 1 , any data graph G that matches Q 2 will match Q 1 too, i.e. Q 1 Q 2 . We can see that Q 2 Q 1 since there exists a data graph G that matches Q 1 but not Q 2 . Consider a QGP Q 5 given by

Article. Remark that Q 1 Q 5 even if Q 5 has nodes (resp. edges) that match no nodes (resp. edges) in Q 1 . If S is the maximum match relation that ensures Q 1 Q 5 , then S Q5 is given by 

The satisfiability problem discussed in this section is to determine, given a CGP Q, whether or not there exists a data graph G that matches Q via conditional simulation, i.e. Q ≺ C G. Contrary to existing graph pattern models [2, 5, 6, 15] , the problem is more intriguing in our case due to the presence of predicates.

Example 4. Given the CGP Q 4 of Fig. 2 , it is clear that no data graph G can match Q 4 since each subgraph of G that matches the positive predicate of Q 4 mismatches its negative predicate and vice versa.

Given a CGP Q = (V, E, L, A, C, P). One of the unsatisfiability cases is due to the presence of conflicting attribute constraints over the vertices of Q (e.g. "Age > 25 & Age < 20"). Moreover, Q is unsatisfiable if there exist conflicts between a negative predicate not(q) ∈ P(u) and the constraints defined over u in Q + . This can be checked in quadratic time thanks to the notion of PoM (Lemma 2).

As widely defined in the literature [11, 13] , the containment problem is to determine whether all nodes and edges returned by a query Q are also returned by the query R. We revisit this traditional definition via conditional simulation as follows.

The static checking of Q ⊆ C R may help decreasing matching time over a large data graph G as follows. If R ≺ C G then we deduce statically that Q ≺ C G, and if Q ≺ C G then we do not need to test R anymore since R ≺ C G holds automatically. In general, if Q ⊆ C R then we can consider M C (Q, G) as an upper-evaluation of R over G; and M C (R, G) as an over-evaluation of Q over G.

We give next necessary and sufficient conditions for the containment of CGPs. 

For each (u, w) ∈ S Q R and not(q u ) ∈ P R (u), there exists not(q w ) ∈ P Q (w): q w q u with the match relation S and (u, w) ∈ S. 3. Each edge in Q c matches at least one edge in R c via S Q R . Example 5. Consider the data graph G and the CGPs Q i of Fig. 4 (i ∈ [1, 5] ). Based on Lemma 3, one can deduce that: 

We conducted extensive experiments to evaluate the (1) effectiveness of QGPM and CGPM ; (2) effectiveness of parallel CGPM ; (3) scalability of algorithms Match Q , Match C and PMatch C , and (4) parallel scalability of PMatch C . Experimental Settings. We used the Amazon real-life network dataset 6 , that is a product co-purchasing graph with 548K nodes and 1.78M edges. Each node has attributes such as ID, Title, Rating, and Group, and an edge from product u to w indicates that people who buy u would also buy w with a high probability.

Using Java language, we implemented the algorithms (1) Match Q ; (2) Match C ; and (3) PMatch C , the parallel version of Match C that detects the number of processors and dispatches core and predicates of the CGP over them. Moreover, a data graph extractor has been implemented to extract an Amazon subgraph data with a fixed number of vertices.

All the experiments were run on a Ubuntu desktop machine with an Intel Core i5-8250u CPU, 8GB memory, and 256GB SSD storage. Each experiment was run 5 times and the average is reported. Experimental Results. We present next our findings.

We first evaluated the performances of quantified and conditional GPM using an Amazon data graph that has 10000 nodes and 37854 edges.

Varying |V Q | and |E Q | ( Match Q ): we generated 8 QGPs with sizes (|V Q |, |E Q |, |C Q |) ranging from (4, 4, 4) to (10, 10, 20) . As shown in Fig. 5 (a) , we remark that: (i) the larger |V Q | is, the longer time is taken by algorithm Match Q , as expected; (ii) |V Q | influences on matching time more than |E Q |.

Varying |C Q | ( Match Q ): fixing |V Q | = 10 and |E Q | = 20 (a very large QGP ), we varied the number of CQs from 0 (query with only existential quantifications) to 20. As shown in Fig. 5(c) , the number of CQs influences slightly in matching time which is consistent with the fact that our quantified simulation does not increase cost of dual simulation (Lemma 1).

Varying |A Q | ( Match Q ): using the same QGP of size (10, 20) , we varied the number of attribute constraints from 0 to 20 by adding incrementally two attribute constraints (e.g. "Group = Book ", "Rating ≥ 3") to each vertex of the pattern. As remarked in Fig. 5(c) , attribute constraints may reduce slightly the matching time taken by Match Q . This is due to the fact that attribute constraints reduce the size of the initial match relation (lines 1-5 of Fig. 3 ) which leads for slight (considerable) reduction of matching time. For instance, the average of vertex matches ranges from 2946 (|A Q | = 0) to 1814 (|A Q | = 20).

Varying |P + Q | and |P − Q | ( Match C ): we defined two CGPs C 1 and C 2 with cores of size (|V Q |, |E Q |, |C Q |, |A Q |)= (5, 10, 10, 5) . We varied the number of positive (resp. negative) predicates from 0 to 10 in C 1 (resp. C 2 ) by adding incrementally two predicates of sizes (3, 2, 2, 2) to each vertex of C 1 (resp. C 2 ). The results are reported in Fig. 5(b) . As expected, each predicate requires to be evaluated over the underlying data graph which increases the matching time of the whole CGPs. Match C gives good performance for a reasonable number of predicates: it takes around 7 seconds for three positive (resp. negative) predicates since the whole size 7 of C 1 (resp. C 2 ) becomes (11, 16, 16) , which is a little bit large compared to real-life patterns. Contrary to what one may deduce from the figure, positive and negative predicates require theoretically the same computational time.

PMatch C compared to Match C . Using the same data set as Experiment 1, we generated a CGP with a core of size (|V Q |, |E Q |, |C Q |, |A Q |) = (3, 6, 6, 3). We varied the number of predicates from 0 to 10 by incorporating fairly 5 positive predicates and 5 negative predicates to different vertices of the core. As shown in Fig. 5(d) , the running time of PMatch C increases very slightly contrary to Match C . This is due to the fact that our machine has 4 cores with hyper-threading technology that allows to execute 8 threads at the same time, hence, the increasing number of predicates influences slightly at the matching time. Remark that PMatch C outperforms Match C by 2.7 times, a significant improvement.

We evaluated the scalability of algorithms Match Q , Match C and PMatch C . We generated a CGP C with a core of size (3, 4) , a positive and a negative predicates of size (3, 4) too. This gives a whole graph pattern of a large size, (7, 16) . We generated ten data graphs whose sizes varied from (10 4 , 37854) to (10 5 , 391545). For algorithm Match Q , we compute only the time needed to evaluate the core of C. As shown in Fig. 5(e) , (1) time of PMatch C is close to that of Match Q since the core and predicates of C have the same size and are matched in parallel; (2) PMatch C scales well with large data graph and is on average 39.48%-60.74% times faster than Match C . Experiment 4. We evaluated the parallel scalability of algorithm PMatch C . We generated a CGP C with a core of size (3, 6) , three positive and three negative predicates of size (3, 2) . This gives a whole graph pattern of size (15, 18) . Next, we computed the running times of PMatch C by varying the number of CPU cores from 1 to 4. In the second part of this experiment, we disabled the hyperthreading technology to see its impact in matching time. As shown in Fig. 5 (f): (1) PMatch C is parallel scalable since the more cores are used, the less time it takes to match C over the data graph, (2) enabling hyper-threading allows to increase slightly matching time of CGPs. This suggests to use more CPU cores in order to efficiently match CGPs over large data graphs.

We have proposed conditional graph patterns (CGPs) that extend traditional graph patterns with edge quantifications of the form "≥k", attribute constraints on nodes, positive and negative predicates. Despite their increased expressivity, CGPs do not make our lives much harder which is proved by our matching algorithm that runs in quadratic time. In the second part of this paper, we have studied satisfiability and containment of CGPs that are fundamental problems of any query language. We have shown that these problems are decidable in low ptime by providing two checking algorithms. Our experimental results have verified the effectiveness, efficiency and scalability of our algorithms, using reallife data. As future work, we will investigate some strategies to speed up matching of CGPs such data graphs distribution [3] and incremental matching [7] .

Detecting social positions using simulation

Making pattern queries bounded in big graphs

Distributed query evaluation with performance guarantees

A (sub)graph isomorphism algorithm for matching large graphs

Adding regular expressions to graph reachability and pattern queries

From intractable to polynomial time

Incremental graph pattern matching

Adding counting quantifiers to graph patterns

Matching structure and semantics: a survey on graph-based pattern matching

Satisfiability of X path queries with sibling axes

A system for the static analysis of X path

A visual language for querying and updating graphs

Containment of queries for graphs with data

Multi-constrained graph pattern matching in large-scale contextual social graphs

Strong simulation: capturing topology in graph pattern matching

Graph pattern matching for dynamic team formation

Graph pattern matching with counting quantifiers and labelrepetition constraints

Communication and Concurrency

An algorithm for subgraph isomorphism