LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.64 ue>r no. 715-721 cop. Z Digitized by the Internet Archive in 2013 http://archive.org/details/algorithmsforpar718liuj 9/C). a¥ TJ6r UIUCDCS-R-75-718 no. J/tf ML 3 Algorithms for Parsing Search Queries in Inverted Hie Document Retrieval Systems by Jane W. S. Liu U. (f- November, 1975 UIUCDCS-R- 75-718 Algorithms for Parsing Search Queries in Inverted File Document Retrieval Systems by Jane W. S. Liu November, 1975 This work was supported by the National Science Foundation under Grants NSF DCR 73-07980 and NSF DCR 72-037^0 A01. 11 Abstract In an inverted file document retrieval system, a query is in the form of a Boolean expression of index terms. In response to a query, the system accesses the inverted lists corresponding to the index terms, merges them and selects from the merged list those documents that satisfy the search logic. In this paper, we consider the problem of determining a Boolean expression which leads to the minimum total merge time among all Boolean expressions that are equivalent to the expression given in the query. This problem is the same as finding an optimal merge tree among all trees that realize the truth function determined by the Boolean expression in the query. Several algorithms which generate optimal merge trees, when the lengths of overlaps between different lists are small compared with the length of the lists, are described. These algorithms are no longer optimal when the lengths of overlaps cannot be neglected. In this case, it is possible to bound the performance of these algorithms in some instances in terms of the maximum overlap between lists. The performance bounds are discussed. Ill Acknowledgment The author wishes to thank Drs. D. J. Kuck and W. Stellhron for their comments and suggestions. I. Introduction In this paper, we consider the problem of parsing search queries in inverted file document retrieval systems. In such systems, the index file contains an entry for each of the index terms selected as descriptors for the documents in the data file. Each entry in the index file contains a pointer to an inverted list of pointers in the postings file. The pointers in this list in turn point to all the documents in the data file that contain the corresponding index term. This file organization has been studied extensively and is used in many well known systems [1-1+]. A query to an inverted file document retrieval system is in the form of a Boolean expression of index terms. For example, to request information on scheduling or resource management policies in time- shared systems, a user may present to the system a query "Time Shared" • ("Scheduling Policy" + "Resource Management Policy") (l-l) where • and + are the AND and OR operators, respectively. In response to a request, the system accesses the inverted lists in the posting files corresponding to the index terms in the query, merges them and selects from the merged list those documents that satisfy the search logic. In our example, the union of the lists corresponding to the index terms "Scheduling Policy" and "Resource Management Policy" is the list of pointers to the documents on scheduling or resource management policies. Let us denote this list by A. The list A is obtained by merging the two lists with duplicated entries deleted. We call the process of merging two lists to obtain their union an OR merge . The intersection of two lists is obtained by merging them and deleting from the merged list all entries except the duplicated ones. We call the process of merging two lists to obtain their intersection an AND merge . In this example, pointers to documents to be selected as response to the query in (1-1 ) are obtained by AND merging the list A with the list corresponding to the index term "Time Shared. " In other words, these three lists are merged in the order specified by the tree in Fig. 1-la. We label the leaves of the tree by the index terms and the internal nodes by the Boolean operators corresponding to the merges. We note that the query in (1-1) is equivalent to the query "Scheduling Policy"* "Time Shared "+ "Resource Management"* "Time Shared" To answer this query, the corresponding lists are merged in the order specified by the tree in Fig. 1-lb. Clearly, the times required to produce the response might be different for the two equivalent queries written in different Boolean forms. We are concerned here with the problem of determining a Boolean expression which leads to the minimum total merge time among all Boolean expressions that are equivalent to the expression given in the query. For this purpose, we describe the inverted file document retrieval system schematically as shown in Fig. 1-2* To process a query, the lists corresponding to the index terms in the query are read into the buffer memory and are merged after they ha" been placed in the buffer. The total retrieval time is equal to the sum of th< The total merge time computed here is equal to the amount of time required of the merge processor to process the query. The merge order that minimizes this time does not minimize the total retrieval time in general. However, keeping the total merge time minimized becomes important in multiuser systems in which many users share a large buffer memory. While the lists corresponding to the index terms in the query of a user are being merged, the lists specifie< in the queries of other users can be loaded into the buffer. "scheduling " policy "time shared' it ii resource management "scheduling" policy (a) it j.. it t» i time resource ti . . ii time shared management shared (b) Figure l-l a: o w (n O (O QC tU UJ o 2 O cc Ql CM H a> id 2 CO z < H- X CO O O X H 2 < a x o a: h- Z O o time required to access the lists from the secondary memory and the merge time of the lists. In this paper, no attempt is made to minimize the former. (The average access time of the lists from the secondary memory has been estimated elsewhere [5,6], The dependence of list access time on the access algorithm and buffer management scheme used in the system is the subject of a separate study [7].) Furthermore, only the case of two-way merges is considered here. In Section II, we introduce the terminologies and notations necessary in our discussion. In Section III, we assume that the lengths of overlaps between different lists are very small compared with the lengths of the lists. Hence, in computation of the total merge time, lengths of overlaps between lists can be neglected. Several algorithms are described in Section III. These algorithms allow us to find optimal Boolean expressions when queries are written as nested Boolean expressions in which (i) all variables are distinct, and (ii) the complement of any variable, B, can appear only in product terms, such as A. • A • ... -A • B, with at least one of the variables A. being uncomplemented. These algorithms are no longer optimal when the lengths of overlap between lists cannot be neglected. In this case, it is possible to bound the performance of these algorithms in some instances in terms of the maximum overlap between lists. We discuss these bounds in Section IV. II. Notations Throughout our discussions, we use upper case letters (e.g., A, B, C, D) to denote both index terms and their corresponding inverted lists. Lower case letters are used to denote the lengths of the corresponding lists (e.g., a, b, c and d denote the lengths of the lists A, B, C, and D, respectively). Let a (A, B) denote the lengths of overlap between lists A and B. The lengths of the resultant lists obtained by AND merging and OR merging the lists A and B are a(A,B) and a+b-cr(A,B), respectively. Let A denote the complement of the index term A. The list corresponding to A*B is obtained by merging the lists A and B and selecting from the merged list those entries that are in list B but not in list A. We call this merging process an AND NOT merge . Clearly, the length of the list obtained by AND NOT merging A and B is equal to b-cr(A,B). We shall not be concerned with Boolean expressions containing terms of the forms A+B and A*B. This restriction leads to no loss of generality since Boolean expressions of these forms are explicitly ruled out in most installations. Consider a query written as a Boolean expression of the index terms A n ,A ,...,A . Let Q(A n ,A„, . . .,A ) denote the truth function determined by ]/ 2' n 1 2' ' n this expression. (A truth function is usually represented by a truth table for the Boolean expression.) We note that the truth function Q(A 1 ,A 2 , . . ,,A n ) More precisely, a Boolean expression of the index terms A.,,A~,...,A 1 2' ' n specifies an element in the free Boolean algebra with n generators. This element in the free Boolean algebra in turn specifies the truth function Q(A 1 ,A 2 ,...,A n ). determines a unique list of pointers. This list is the valid response to all queries written as Boolean expressions that determine this truth function and this list can be obtained by merging the lists A, ,Ap, . ..,A . Let T(F(A, ,Ap, . . .,A )) denote a tree specifying the merge order of lists A_,Ap,...,A corresponding to the Boolean expression F(A, ,A~, . . . ,A ). Again, the leaves of the tree are labeled with the names of the corresponding lists while the internal nodes are marked by the Boolean operators corresponding to the merges. (Examples of such trees are shown in Fig. 1-1. ) We say that the tree T(F(A ,Ap, . . . ,A )) realizes the truth function Q(A f P^,...,k ) if Q,(A,,Ap, . . . ,A ) is the truth function determined by F(A ,Ap,...,A ). Since there are many ways to parenthesize a Boolean expression (e.g., A«(B+(C+D)) and A*((B+C)+D) are two different ways to parenthesize the expression A»(B+C+D)), corresponding to a Boolean expression, there are many different binary merge trees. We distinguish them by using different subscripts. (For example, T (A-(B+C+D)) and T ? (A- (B+C+D)) denote two different binary merge trees for the expression A«(B+C+D)). When there is no possible confusion, we also refer to the tree T(F(A,,Ap, . . ,,A )) simply as T. The time required to merge two lists is proportional to the sum of their lengths for all three types of merges. (To be specific, let the proportional constant be 1.) The cost of the tree T(F(A ,A„, . . .,A )), denoted C(T), is equal to the total merge time of the lists A ,A , ...,A when the order of the merges is specified by T. A tree is said to be optimal when its cost is minimum among all trees which realize the truth function Q(A 1 ,Ag, ...,A ) determined by the Boolean expression F(A ,Ap,...,A ) in the query. With a slight abuse of the notation, we denote the optimal tree by T 0^ F ^ A 1 ,A 2' " * ,A n^' Hence the problem of finding a Boolean expression corresponding to the minimum total merge time is the same as that of finding an optimal merge tree among all trees that realize the truth function Q(A 1 ,A 2 ,...,A n ). III. Algorithms for Determining Optimal Merge Trees In this section, we assume that the lengths of overlap between different lists are very small compared with the lengths of the lists. Hence, in computation of total merge time, lengths of overlaps can be neglected. In this case, the length of the resultant list obtained by OR merging list A and B is equal to a+b. The length of the resultant list obtained by AND merging A and. B is very small compared to a or b. In the computation of merge time, it is assumed to be zero. The length of the list obtained by AND NOT merging A and. B is approximately equal to b. The lengths of overlaps between lists being negligibly small, an optimal tree for OR merging the lists A..,Ap,...,A to realize the truth function determined by the Boolean expression A -t-Ap+...+A is one with minimum weighted path length where the weights of the leaves are the lengths of the lists and the weights of internal nodes are zero. Such a tree can be constructed using the Huffman's procedure. We call the resultant tree a Huffman tree for A,,Ap,...,A and denote it by T (A,+Ap+. . . +A ) [9]. A property of an optimal tree is that all its subtrees are optimal. It follows that if T, n and T pn are two subtrees of an optimal tree T (A +A^+...+A ) obtained by removing the root of T , then C(T ) + C(T 2 ) is minimum among all possible two subtrees of an arbitrary tree T(A n i-A n . . . i A ). Similarly, let T,~,T ~, . . .,T « denote m subtrees of 11 n ^" 10' 20 ' mO T (A,+A_-k . . hA ) obtained by removing the roots of larger subtrees of T n . Then C(T iQ ) + C(T 2Q ) + ... + C(T mQ ) < 0(1^ + C(T 2 ) + ... + C(T m ) (3-1) 10 where T ,T , ...,T are m subtrees of an arbitrary tree T(A- L +A 2 +. . .+A n ). We call the subtrees T 1( y T 20 ' ' ' * > T m0 Huffman subtrees. Optimal Merge Trees for Boolean Expressions of the Form (A 1 +A p +. . ,+A )*B Let A ,Ap, ...,A and B be n+1 lists with lengths a^a^, ...,a n and b, respectively. To find an algorithm which yields an optimal form of the Boolean expression (A +A 2 +, . ,+A n )'B, we consider the tree shown in Fig. 3-1 where T , T g , . . ,,T m are m subtrees of T(A 1 +A 2 +. . . ^Aj. We note that Lemma 3-1 . If the tree in Fig. 3-1 is an optimal merge tree among all trees that realize the truth function determined by the expression (A +iU+...+A )-B, then T ,T , . ..,T are Huffman subtrees of T (A n +A„+. . .+A ). l d. n ± d. m o 12 n Proof: The cost of the tree in Fig. 3-1 is n C(T) = C(T ) + C(T 2 ) + ... + C(T ) + Z a. + mb i=l Because of (3-1)^ this cost is minimized when T-,,T~, ...,T are Huffman subtrees. M As a consequence of Lemma 3-1, to find an optimal tree which realizes the truth function determined by the Boolean expression (A,+Ap+...+A )«B, we need to consider only trees corresponding to expressions of the form ( Z A )-B + ( Z A. ).B + ... + ( Z A. )-B (3-2, A.eS, 1 A.eS x A.eS" 1 i 1 l 2 l m where S (l < j < m) is the set of leaves of the j Huffman subtree of .th ) T Q (A 1 +A 2 +. . ,+A ). Furthermore, an optimal merge tree that realizes the truth function determined by (A-.+A- + . . ,+A )'B can be obtained from 12 n 11 II -\ 77^1 ^m Figure 3-1 12 Theorem 3-1 (A +A- + ...+A )«B is an optimal Boolean form if and only if *1 2 n n E a. < b 1=1 1_ Hence an optimal merge tree for (A n +A„ + ...+A )«B is the tree, T , shown in Fie. 1 2 n u & 3-2a, where T Q (A +A- + . . .+A ) is the Huffman tree for the lists A^Ap, ...,A . Proof: Let T n _ and T„~ be two Huffman subtrees of T„(A,+A_+. . .+A ) 10 20 1 2 n and S be a subset of {A^,A^, ...,A }. Because of Lemma 3-l> the tree in Fig. 3-2b, T,j will have the minimum cost among all trees T(F(B,A_,Ap, . . ,,A )) corresponding to Boolean expressions of the form F(B,A ,Ap,.. A ) -~ ( E A )-B + ( E A )-B - c - n A.eS A^C^Ag,...^ }-S The cost of T, is d C d = C(T 10 ) + C(T 20 ) , I a. + 2b 1=1 C(T Q ) + 2b while the cost of T is u n Therefore, we have if and only if C = C(T-_) + E a. + b u v . , l i=l d — u n E a. < b (3-3! i=l 1_ 13 vw--- + v u (a) (b) Figure 3-2 Ill We now show that the inequality in (3-3) implies that the tree T in Fig. 3-2a is indeed optimal. Again, because of Lemma 3-l> it suffices for us to show that (3-2) implies that the cost of the tree I shown in Fig. 3-3a, C , is less than the cost of the tree T ( m+ ]_) shown in Fig. 3-3b, C .,, where T and T.„~ are two Huffman subtrees of T.„. We note that j 10 3d we have algorithm A for finding an optimal merge tree for (A +a +...+A )-B. Algorithm A n 1. a. If b > £ a., (A-.+Ap+. . ,+A )«B is an optimal Boolean form and i=l an optimal merge tree is T shown in Fig. 3-2a where T_(A-.+Ap+. . ,+A . is the Huffman "cree for the lists A,, A-,..., A . 15 m (a) (m+1) (b) Figure 3-3 16 n b. If b < E a., choose the merge tree T shown in Fig. 3-2b where i=l x d T 1Q and T 2Q are two Huffman subtrees of T„(A +A^+. . .+A ). The corresponding Boolean expression is ( E A. )-B + ( E A. )-B A.eS, 1 A.eS ± 1 1 i 2 where S and Sp are the sets of leaves of T, and T pn . 2. An optimal tree can be obtained by repeating step 1 for each of the terms (E A. )«B. x Consider the example shown in Fig. 3-k. The lengths of the lists B, A , Ap, A and A> are 5, 1, 2, 5 and 10, respectively. The Huffman tree for Aj+Ap+A +A^ is shown in Fig. 3-1+a together with its two subtrees T and T 2Q . Since (a,+a 2 +a„+a, ) = 18 > b, the cost of the tree T in Fig. 3-Ub is less than the cost of any tree T(B- (A +Ap+A +A. )). (C(T ) = 39 and C(t(B« (A +Ap+A +A. ■)) ) > 52.) Hence, we choose the Boolean expression corresponding to the tree T , B-(A 1 +A 2 +A 3 ) + B-A^ instead of B« (A,+A ? +A +A. ). Repeating step 1 for the term B-(A,+Ap+A ), we choose to distribute . operation with respect to + operation and obtain B- (A +Ap) + B-A + B-A. The corresponding merge tree is T' shown in Fig. 3-^-c. Since a, + a p < b, we conclude that TI is an optimal merge tree among all trees that realize the truth function determined by the expression B« (A, +Ap+A~+A, ). Indeed, C(T^) = 36 and C(T(B'A +B-A^+B»A +B'A^)) = 38. 17 T (A 1 +A 2 +A 3 +A 1+ ) A, T, 20 ■10 (a) (b) A. *2 (c) Boolean expression B* (A, +Ap+A_+A. ) Figure 3-U 18 Optimal Merge Trees for Boolean Expressions of the Form (A., +A^ +. . . + A ) • (B, +B^+. . . +B ) *— 1 — 2 n y * 1 — 2 m M Lemraa 3-1 and. Theorem 3-1 can be generalized to the case when the Boolean expression specified in the query is of the form (A,+A +...+A )'(B n +B„+...+B ) 1 ^ rr 1 2 nr To do so, let us consider the tree T shown in Fig. 3-5 where T.-., T.~, . . . ,T. . are subtrees of T(A +A +...+A ) and T Bn ,T_ , . . .,T m are subtrees of 12 n Bl B2' Bk T(B n +B^+. . ,+B ). We state without proof: v 1 2 m Lemma 3-2 . If the merge tree T, is optimal for the Boolean expression (A,+A +...+A )• (B n +B +...+B ), then T. - ,T. , . . ., T. . are Huffman subtrees of v 1 2 n 1 2 m Al A2 7 Aj T (A 1 +A 2 +...+A n ) and T -^T^, ' ' '> T Bk are Huffman subtrees of T o^ B l +B 2 + * ' * +B m^ In this case, we have Theorem 3-2 . (A,+Ap+. . ,+A )• (B,+Bp+...+B ) is an optimal Boolean expression if and only if n m £ a. = T, b. i=l i=l Hence an optimal merge tree for (A n +Ap+...+A )" (B,,B p ,+. . .+B ) is the tree T in Fig. 3-6 where T. and T_ are Huffman trees for A,,A_,...,A and u A B 1 d' n B n ,B , .. ,,B , respectively. Proof: We compare the cost of the tree T in Fig. 3-6a with that of T_. and T, in Fig. 3-6b and 3-6c. The Boolean expressions corresponding to the tree T,. and T^ are dA dB 19 I m •H 20 u (a) Figure 3-6 21 ( Z A. )-(B n +B +...+B )+( Z A. )-(B,+B +...+B ) \ i 1 2 m \ /„ i 1 2 m V S 1A V S OA and ( Z B ).(A +A +...+A )+( Z B )'(A +A +...+A ) B.GS 1R ^ 1 1 n b.^S_ ! ! 2 n l IB r IB respectively, where S, and S, are sets of leaves of T and T -, respectively. Let C and C-.. be the costs of the trees T and T, ., r ^ u dA u dA respectively. n m C = C(T.) + C(T ) + Z a. + Z b. U A B . , l . , l 1=1 1=1 and m n C dA " C( V + C < T A2> + °( T B> + 2 * \ + * a i 1=1 1=1 But n Hence C(T A ) = C(T A1 ) ♦ C( Ta2 ) ♦ E a 1=1 n m C - C, A = Z a. - Z b. u dA . , l . , i i=l i=l which is less than or equal to zero if and only if n m Z a. < Z b. . , l - . , l i=l i-=l Similarly, m n Z b. < Z a 4 i=l 1=1 22 implies C - C_ < u dB — where C,_ is the cost of the tree T,^. In other words, dB dB ' C u = C dB = C dA when n m Z a = Z_Jb (3-1+) i=l 1=1 x We need to show that because of Eq. (3-^-)* C is no greater than any other tree which realizes the truth function determined by the Boolean expression (A +Ap+...+A )«(B,+...+B ). Let T shown in Fig. 3-5 be such a tree. Again, because of Lemma 3-2, T , , T „, ..., T are j Huffman subtrees of T. and T_.,, T^ , ..., TL,, are k Huffman subtrees of T_. Let S. be the A Bl' B2' ' Bk a Ap set of leaves of T. (p = 1,2, ...,j) and S_, be the set of leaves of Ap ' ' Bq TL (q = 1,2, ...,k). We note that the Boolean expression corresponding Bq to the tree T, is k Z Z ( Z A )•( Z B ) (3-5) p=l q=l A.eS A ± B.eS^ •^ i Ap i Bq Let 1L - and TL „ be two Huffman subtrees of TL with S^ , and Bql Bq2 Bq Bql S p be the sets of leaves, respectively. Let T, , (l; r) denote a tree corresponding to the Boolean expression J k j Z Z ( Z A.)-( Z B.) + Z ( Z A. )•( Z B.) p=l q=2 A.eS. x B.eS^ x p=r+l A.eS. 1 B.eS_.. 1 ^ l Ap l Bq * l Ap .1 Bl r r + Z ( Z A. )•( Z B. ) + Z ( Z A. )'( Z B.) p=l A.eS. 1 B.eS^.. x p=l A.eS. x B.eS^ no 1 ^ i Ap i Bll r i Ap l B12 23 obtained by further distributing the . operation with respect to + operation in the sum term Z B. in (3-5)- We note that for any 1 < r < j, V S B1 ' C(T k+1 (l; r)) > C(T k ) This inequality follows from j k n n CtT, ) = Z C(T. ) + Z C^OL ) + k Z a. + j Z b. v k y , v Ap' , N Bq y . , l ° . , l p=l * q=l ^ i=l 1=1 while C(^ +1 (l; r)) = Z C(T ) ♦ Z C(^) * k Z a. ♦ j S b p=l * q=l n 1=1 1=1 r + Z Z a. p=l A.eS. x * l Ap When r=j, we have c(T k+1 (l; j)) = s c(T ) ♦ z c(l B ) p=l q=2 n n m + C(T B11 ) + C(T B12 ) + (k+1) Z a + j Z b. i=l i=l But C ^l> = C(T B11^ + C < W + T> _ E „ b i Hence B i £S Bl n C(T k+1 (l; j)) - C(T k ) = L a ± - Z b. ( 3 -6) i=l B.eS B1 2k which is larger than or equal to zero when Eq. (3-*0 is valid. In general, let T, n (t; j) be the tree corresponding to the Boolean expression Z ( Z A. ) • ( Z B, ) + Z p=iU=t + iA l6 s x B. eSB . 1 q=i.Lvs Ap x B. e s Bql x ( Z A.) • ( Z B.) + ( Z A ) • ( T, B ) A.eS. B.eS_ _ _ l Ap i Bq2 for 1 < t Z a 3 = 1,2,... ,k and tfy . WV i=1 It follows from Eq. (3-6) that 25 Corollary 3-1 n m When E a.. < E b^, an optimal tree corresponding to the Boolean expression i-l x 0=1 3 k E (A +A +...+A )•( E B ) * =1 V S Bd has the minim-urn cost among all trees that realize the truth function determined by (A n +A r , + . ..+A )• (B n +B n + . . ,+B ). ° v 1 z n 1 2 nr m n Similarly, for the case of E b. < E a., let S ., 1 < j < k, be j=l J i=l x AJ the set of leaves of k Huffman subtrees of T A such that their union is (A 1 ,A 2 , ...,A n }, and m but E a. < E b. j = 1,2,. ..,k V S Aj * " ^ ' m E a. > E b. o, j' - 1,2, . ..,k and j/j • We have Corollary 3-2 n m When E a. > E b , an optimal tree corresponding to the Boolean 1=1 1 3=1 J expression E < W ...«* W M E A.) 3-1 A^S^ 26 has the minimum cost among all trees that realize the truth function determined by (A+A +...+A )• (B n +B«+. . .+B ). 12 n 1 2 m' An algorithm to determine an optimal tree in this case is Algorithm B n m 1. If Z a = 2 b., (A +A.+...+A V(B 1 +B +...+B ) is an i=l i=l n' v 1 2 m' optimal expression. An optimal tree is T shown in Fig. 3-6a where T and T B are Huffman trees of A-^Ag, . . .,A and B.,, Bg, . ; . ■, B , respectively. n m 2. If Z a. < 2 b., i=l x 1=1 ^ a. Choose the Boolean expression A +Ap+ i ..+A n )'( Z B ) + (A +A.+...+A )•( I B, ) B i £S Bl V S B2 where S 01 and S r)0 are sets of leaves of the two Huffman subtrees of T . The corresponding merge tree is T in Fig. 3.6c. b. For each of the terms (A +Ap+. . .+A )■( Z B. ), if n B.eS^ . x ' 1 B 3 n Z a < £ b., distribute * with respect to the sum of the B! s such that i Bj ' the B.'s in each of the sum terms are elements of sets of leaves of smaller l Huffman subtrees. i c. The process in step 2-b terminates either when n n Z a,. > Z h. or when we obtain terms of the form ( Z A. ) • B .. . .eSL. "' i=l x B.eS,, . 1 i=l 1 3 27 n m 3. If 2 a. > 2 b., 1=1 1=1 a. Choose the Boolean expression (B 1 +B p +...+B )■( I A )+(b'+B 2 +...+B )•( £ A) V S A1 A i GS A2 where S , and S„^ are the sets of leaves of the Huffman subtrees of T . Al A2 ■"• The corresponding merge tree is T in Fig. 3.6b. b. For each of the terms (B n +B +...+B )•( 2 A.), if 1 2 m A.£S A . X l Aj m Z b < 2a, distribute * with respect to sum of A.' s in each of the i=l A.eS A . i Aj sum terms such that the A. *s in each of the sum terms are elements of i leaves of smaller Huffman subtrees. m (c) The process in 3b terminates either 2 b. > 2 a. or i=l 1 A.eS A . i Aj m when we obtain terms of the form ( 2 B. )-A.. i=i x d We illustrate Algorithm B by an example. Consider the expression (A +Ap+A +A. )• (B^+Bp). The lengths of the corresponding lists are a = 1, a = 2, a = 3> a. = 6, b = 2 and b p = 2. The Huffman trees for the lists A , A , A and A^ and for lists B and B are shown in Fig. 3-7a. Since a + a Q + a. + a< > b + b p , the cost of the merge tree T in Fig. 3- 7b corresponding to the Boolean expression F l * (B 1 +B 2 ).(A 1 4-A 2 f-A 3 ) h (B^BgJ-A^ is less than the cost of the tree T( (A , t-Ap+A +A. )■ (B +B p ) ). Moreover, since a 1 +a 2 +a 3 > b-^bp, the cost of the tree Tp in Fig. 3.7c is less than the cost of T . Indeed, T is an optimal tree and the Boolean expression corresponding 28 T. (a) B B, (b) T x (c) T c Figure 3-7 29 F 2 = (B 1 +B 2 )-(A 1 +A 2 ) + (B 1+ B 2 )-A 3 + (B^B^-A^ (C(T(A 1 +A 2 +A 3 +A i+ )-(B 1 +B 2 ))) > kl, C(T ± ) = 33, C(T 2 ) = 31.) Farther Generalizations To find an optimal merge tree when the Boolean expression specified in the query is a product of sum terms m P = it (A. n +A. Q +...+A. ) (3-7) m . =1 K fll 02 on. and A., are all distinct, let C(k) denote the cost of the merge tree T(P ). ji m Suppose that we first complete all those merges corresponding to the Boolean expression m-1 P m , = jt (A. n + A. + ... + A. ) Since the lengths of overlaps between lists are assumed to be negligibly small, it follows from Theorem 3-1 that the cost n m C(m) - L a . + C(m-l) . , mi i=l is minimum in this case and the corresponding Boolean expression is P = A , • P , + A „ • P ,+...+ A • P , m ml m-1 m2 m-1 mn m-1 m Similarly, we have n. C(m) = L^ T^ a.. + C(T( (A^A^-H. . . i-A^)- (A^-A-., ►. . . ^ ) )) Suppose that the indices are chosen so that 30 n i n 2 C(T((A 11+ A 12+ ...,A l2i ).(A 21+ A 22 + ... + A 2 )))- L a - Z a. 1 2 i=l i=l \ C(T ) + E a. + 2b A i=l 1 = C + b u Similarly, let T.. for i = 1, 2, . .., ra be m subtrees of T. and S be the sets of their leaves, n C, = C(T.) + rab + E a, dm A i i-1 > c ■ u We further observe that the cost of the merge tree corresponding to the Boolean expression B • A 1 +A 2 +...+A n is equal to that of B • (A,+Ap + . . . +A ). Again, let T.. and T. be two Huffman subtrees of T^(A-+A-+ . . , + A ) with their sets of leaves being S.., 1 2 n Ai and S „, respectively. The cost of the merge tree corresponding to the Boolean expression is equal to E A. • E A. V S A1 " V S A2 " C(T ) ♦ C(T ) ♦ I at 2b 1 = 1 We note that in response to a query of the form B • F • F p , the syst will parenthesis the expression as (B«F )«F , or rewrite it as em B ' F x hF 2 . 3h same as the cost of the tree corresponding to B * ( Z A. ) + B • ( Z A. ) V S A1 V S A2 " Kence we can use the result in Theorem 3-1 and obtain Corollary 3-5 n When b > Z a., B-A^A +A +...+A is optimal among all Boolean expressions equivalent to it. Algroithm A can be modified to determine an optimal equivalent form n when b < Z a. . i=l X Algorithm Al n 1. If b > Z a.., B • A,+Ap+...+A is optimal among all Boolean expressions i=l equivalent to it and an optimal merge tree is T as shown in Fig. 3-9a. Otherwise, we have n 2. a. b < Z a.. Choose the Boolean expression i=I x B • Z A. • I A. V S A1 1 A i £S A2 ' and the corresponding tree is T shown in Fig. 3-9b. b. An optimal tree can be obtained by repeating steps 1 and/ or 2a for B • Z A. and then for (B • Z A. ) • Z A. . V S A1 " V^l 1 *l 6S i2 X 35 u (a) d (b) Figure 3-9 36 Similarly, from Theorem 3-2, we have Corollary 3-6 m n When I b. = Z a., an optimal form of the Boolean expression i i i=l i=l (B 1+ B 2+ ... + B m ) • A 1+ A 2 ... + A n is (B 1 +B 2 +...B m ) • A 1+ A 2 +...+A n The algorithm Bl determines an optimal merge tree when Zb. / 2 a.. Algorithm Bl m n i=l i=l n m 1. If Z a. = Z b., the optimal merge tree is T in Fig. 3- 10a where 1=1 1=1 T„ and T^ are Huffman trees of A n , A-,, .... A and B n , B~, . .., B . A B 1' IS ' n 1 2' ' m Otherwise, we have n m 2. If Z a. > Z b.. . , 1 . -, 1 1 = 1 1 = 1 a. Choose the expression ( 2 A ) • B +B+...+B + ( 2 A ) • B 1+ B+... + B A.eS A1 a. £Sa2 The corresponding merge tree is T in Fig. 3- 10b. b. For each of the terms ( Z A. ) • B,+B_+...B , if i i 2 m A.eS. . i A 3 m Z a. > T, b., rewrite the term as A.eS ft x i=l 1 1 A j ( Z A. ) • B,+B +...+B + ( Z A. ) • B n +B +...+B A.eS fl . n 1 1 2 m A.eS i 1 2 l Ajl i Aj2 m where S. ._ and S. .„ are the sets of leaves of the two Huffman subtrees of T. .. Ajl Aj2 Aj 37 u (a) (b) I i 1.1 I (c) Figure 3-10 38 m c. Repeat Step 2b until either 2 a. < 2 b. or the term become A.eS A x ~ i=l ± 1 A 3 A. • B n +B +. . .+B . j 1 2 m n m 3. If 2 a. < 2 b., i=l X i=l x a. Choose the Boolean expression (A 1 +A 2 +...+A n ) • I 2 B ± ) • \ Z B ± ) B i 6S Bl B i eS B2 and the corresponding merge tree is T,^ as shown in Fig. 3-10c. &B n b. For each of the terms 2 B., if 2 a. < 2 b., rewrite the B.eS,. x i=l x B.eS-,. x term as ( ~^ bTT • T S B~7 B.eS_ 1 B.eS, n ± l Bjl i Bj2 where S, .., and. S^ ._ are sets of leaves of the Huffman subtrees of 01.. B J1 B J2 Bj n c. Repeat Step 3b until either 2 a. > 2 b. or when S^ contains i=l x ~B.es,. x B J one single term. Furthermore, we have Corollary 3-7 The cost of an optimal merge tree corresponding to the expression m I « ± (VV-'-^iJ • k ! x (B ii +B i2 + - + %^ is C (m) +2 2 b U k=l i=l J6± where C (m) is the cost for merging the lists A. . in Eq. (3-8). >J J- J 39 IV. Bounds on Sub optimal Parsing Algorithms When the lengths of overlaps cannot be neglected in the computation of merge times, the algorithms described in Section III are no longer optimal. In this section, we derive bounds on the performance of these algorithms in terms of the maximum overlap between lists. Again, let cr(A, B) denote the length of overlap between lists A and B. The length of the resultant list obtained by OR merging lists A and B is equal to a+b-cr(A,B). The lengths of the resultant lists are a(A, B) and a-cr(A,B) when the lists A and B are AND merged and AMD NOT merged, respectively. Consider a set of lists S = {A ,A p ,...,A }. We say that the maximum length of overlap between them is a if a(A.,A. ) < a for all A. and A . in 3 i 3 - i J Moreover, let S and S be any two disjoint subsets of S, and R and R be the two lists obtained by OR merging lists in S, and S ? , respectively. Then, o(R , R ) < o. Hence, a is an absolute mea,-ur_' of the maximum length of overlap. It is a meaningful measure when the lists in S are of comparable lengths and that their intersections are relatively small compared to their lengths. In practice, the lengths of the inverted lists may differ by several t orders of magnitude. The length of overlap between any pair of lists is often measured in terms of a fraction of the length of the shorter list. Let (A.,A.) denote the fraction such that a(A.,A.) - (A.,A.) min[a.,a.] For example, from MEDLAR Master Mesh , we found that the length of the list corresponding to the index term HUMAN is ^93, 599 while that corresponding to LUROVIN is only k. ko for any A., A. in S = {A ,A , ...,A }. Let * denote the maximum overlap for the set S. That is, a(A.,A.) < * min[a.,a.] and a(R , Rp ) < min[r-.,r p ] where r, and r„ are the lengths of R and Rp, respectively. With slight abuse of the term, we also call 4> the maximum length of overlap. Again let C(T) denote the cost of a merge tree T(A +Ap+...+A ). Since C(T) is the total merge time of A ,A ? ,...,A when their merge order is specified by T, C(T) depends on the lengths of A n ,A p ,...,A as well as overlaps between them. Let P(t) denote the weighted path length of the tree T. As discussed in Section III, P(T) is the cost of the tree T when the corresponding Boolean expression is A-+Ap+. ..+A and the length of overlap between the lists are zero. Bounds on Cost of Huffman Tree Let T n (A n +A p + . . .+A ) be an optimal merge tree for the lists A,,Ap, ...,A corresponding to the Boolean expression A..+A +. . .+A . As demonstrated by the example in Fig. ^— 1, T cannot be obtained using Huffman's procedure in general. Let T u denote the Huffman tree for A,,Ap, ...,A . We now bound the cost of the Huffman tree T„. Lemma k.l . Let S = {A„ ,A p , . . .,A } be a set of lists and R be the list obtained, by OR merging all the A. in S. The length of the list R , r n , is such that k r > Z a - (k-1) a k i=l x where a is the maximum overlap between the lists in S. Moreover, the bound is tight. kl C(T H ) 2(2+3) + h 11+ (a) T,,, Huffman tree H C(T Q ) =3 + 1 + + I + + 2 = 13 (b) T , Optimal tree a(A 1 ,A 2 ) = 3, a(A L ,A 3 ) = 0, a(A 2 ,A 3 ) = l, — H-, ap — j, a„ — c. Figure k-1 Proof: Let us consider any list A. in the set S. Without loss of 1 generality, suppose that A. DA. , A. HA. , ... A. DA. are nonempty -L tz. K. (where (~l denotes set intersection). By definition of a, the total number of /\ elements in A. D A. , A. flA. , . . ... A. DA. is at most a. i i^ i i 2 i i k Let I(A 1 ) = where is the null set and I(A.) = A ± (A 1 U^ U ... UA i _ 1 ) i =, 2, 3, .... k We note that the lists A * - A A l A l A^ = A-, - l(Ag) are disjoint. Moreover, their lengths a', a', . .., a/ are such that a l = a l ^ > a 2 - a Since the list corresponding to A] + A ' + ... + A' is R , we have k k r. = E a.' > E a. - (k-l) a k . i - . i i=l 1=1 We point out here that throughout our discussion, by a list, we mean a sorted list of distinct elements. Hence, we may also regard a list as a set whenever the order in which the elements appear in it is irrelevant. 1+3 That the bound is tight can be demonstrated by the example: A 1 = (a,p,7,x x , ...,x m ), A 2 = (a,p,7,y.,_, ...,y n ) .... A fe = {a,^, 7 ,z ± , . . . ,z^) where x., y., z. are all distinct, a = 3 and the list R, is 111 K k ia,p,y,x,...,x m ,y 1 , ...,y n ,...,z 1 ,...,Zg) with length Z a ± - 3(k-l). ■ i=l In terms of o, we have the following bound on the weighted path length of a Huffman tree for a set of lists S = {A ,A 2 , ...,A }, p(T ). Theorem k-1 p(T H ) < C(T) + S^ =£ 3 C*-D where a is the maximum overlap for lists A... A-,..., A and C(T) is the cost 1 2 n of any tree T for these lists corresponding to the Boolean expression A,+A„+. . ,+A . 12 n Proof: Let T be an arbitrary tree for the set of lists fA,,A ,...,A 1 12 n corresponding to the expression A +A +...+A . Suppose that in T the leaf A. is at level I., i = 1. 2, .... n. Let T' be a tree obtained from T when i i n-1 / ^ the weight of A. is replaced by a. - a. We claim that l in P(T') < C(T) (k-2) That is, 2 L(a.. --a)< C(T) .,ii n ' — v ' i=l n 1 2 Since a Huffman tree has the minimum weighted path length and 2 ii. < — (n +n-l), i=i l - 2 we have p(T H ) - |(n 2 +n-l) • ^aA , . ,A } Suppose that in the tree T, A is at level I as shown in Fig. k-2, where T , T ? , ...,T« -i>T« are subtrees of T. Let S,,S p , . ,.,S„ be the sets of leaves and R,,Bp, . . .,R„ be the lists corresponding to the roots of T,,T p , . . . , T», respectively. We note that the weight of the node j in T', i\, is J t! = a + 2 Z a . - — a ( Z Is I + l) (1+-1 J k=J+l A.eS x k=j+l 1 K. On the other hand, from Lemma ^-1, the length of the list R. , r. , is such that r. n > 2 a. - (|S.' I - 1) a and length of the list corresponding to the node (j+l) is lower bounded by a + 2 ( 2 a. - p|S J ) k=j+2 A.eS k x Hence merge time- required to generate the list corresponding to the node j in the tree T, T., is such that * , i . . x- t. > a + Z Z a. - ( Z I S | - 1 ) a 3 k=j+l A.eS. x k=g+l x k Since n 1 £ £ t ^( ? |s k | + D>( s |sj -1) + k=j+l k=j+l t n-1 The inequality (x+1) > x - 1 is equivalent to 2n > x + 1 Since we have x < n and n > 1, the inequality is always valid. ^5 0) 0) 46 for any n > 1, for all j = 0, 1, . .., £-1, t!.< t. and the inequality in J J (4-2) follows. ■ It follows from Theorem 4-1 that the cost of a Huffman tree is upper bounded by Corollary 4-1 C(T H ) < C(T ) + |(n 2 + n-l) ^ a (4-J where the tree T is optimal. Although the bound in (4-4) is not tight, it does indicate that in most cases of practical interest, the cost of the Huffman tree does not differ substantially from that of an optimal tree for processing a query of the form A +A p +. . . +A . For example, when the length of the lists are approximately equal. C(T Q ) ~ £(l - j) ne^ n where £ is the approximate list length. Since p(n +n-l) a is 12^ approximately equal to — n a for large n, its value becomes comparable to C(T ) only when' 2 % 2 n In £ / Ov ^ v 1 - a / a (For n = 64, 7 ~ 0.20; for n = 8, 7 « 0.43. That is, only for every large overlaps. ) Similarly, we bound the cost of a Huffman tree in terms of the maximum overlap . hi Lemma k-2 * k i=l Again, r is the length of the list corresponding to the expression Proof: Let A^ be the shortest list among A ,A p , ...,A, and R be the list corresponding to the expression A,+Ap+...+A, , . The length of R , r .., is clearly larger than the length of any particular list A^Ag, ..., or A k _ 1 . Hence r k > = a i " «k i=l > S a. (1 - r) * - i=1 i k Hence, we obtain the following bound for the cost of a Huffman tree for Lhc lists A,,A_, . . . ,A . 1 2 7 ' n Theorem k-2 C(T ) < P(T ) < ij— c(T) (^-5) 1 - — 2 where T is any arbitrary tree for A, + Ap * ... + A . Proof: The proof is similar to that of Theorem k-1. Suppose that T ' is a / 1 ^\ tree obtained from T when the weight of A. is replaced by a. (1 - — J. We claim that P(T') < C(T) (k-6) and hence, the inequalities in (^+-5) follows. 1+8 To show that the inequality (k-6) is valid, we again look at the tree in Pig. k-2. The weight of the node j in the tree T', t*o is such that I t! > (a + Z 3 N +1A A a i )(l-|*) On the other hand, the length of the list R. -. , r. _,, is such that 0+1' j+1' r i+l - L a i ^ " Tq 1 a+1 A.eS. n X |S d+l' and the length of the list corresponding to the node (j+1) is lower bounded by * V a + Z Z a. 1 k=j+2 A.eS n V \ J l k / \ 1+ Z S £ k=o+2 k 1 Hence, the merge time required to generate the list corresponding to the node, t.» is such that ( l t . > [a + I Z a. 1 min 1 - 0+1' I 1 - i + z s k=o'+2 k 1 The inequality in (U-6) follows immediately from this expression. We again note that the bound in (^-5) is not tight. However, we can conclude from it that for most cases of practical interest ( < 0.1) the total merge time is quite close to the minimum when the merge order is specified by a Huffman tree. h9 Performance of Sub optimal Algorithms Clearly, when the length of overlap is not negligible, the algorithms described in Section III are no longer optimal. However, we have the following special cases : Theorem k-3 n When Z a. < b, (A n +A_+...+A )«B is optimal among all Boolean . .. 1 — ' 1 2 n i=l expressions equivalent to it. Hence, an optimal merge tree is T as shown in Fig. U-3a when T is an optimal merge tree for (A +Ap+...+A ). Proof: Let C (T) denote the cost of the tree T in Fig. U-3a. Suppose that T(A +Ap+. ..+A ) is an arbitrary tree for the lists A , A p , ..., A . Let R be the list corresponding to the Boolean expression A +A-+...+A and L e: n r be its length. Clearly, C (T) = C(T) + r + b We need to show that C. (T) is less than or equal to any tree T , shown in l v ' ^ J m+1 Fig. k-3b. Let R,, FL, ..., R , R , R be the lists corresponding to the roots of T_, T , ..., T ,, T ,, T ~ which are the m+1 subtrees of 1' 2' ' m-1' ml m2 T(A.,+A„+ . . .+A ) obtained by removing the roots of larger subtrees of T. Let r-,, r^} •••> r -,, r -,, r be the lengths of these lists, respectively. 1 2' ' m-1 ml m2 ^ Furthermore, let D.., D„, ..., D n , D ,, D be the lists obtained by AND 1 2 7 ' m-1 ml m2 ° merging R , R , ..., R , R , R , respectively, with the list B. The cost of the tree T , , C .. , is m+1' m+1 m-1 C , = £ [C(T. ) + r. ]+C(t 1 ) + C(l ) + r,+P fl + (m+1) b in i i . . L v i ' i v ml / v m2 / ml m2 v ' i I i- c(t(d +d+...+d .+d +5 \ )) v 12 m-1 ml m2 50 u (a) m+1 00 Fig. k-3 51 Let T be a subtree of T obtained by OR merging R n and. R together. We have m ml m? m C(T m ) = 2 [C(T, ) + r. ] + mb + C(T(D n +D p +. . .+Dj ) i=l wheri R is the root of T and D is the list obtained by AND merging R with B, Since C(T ) = C(T .) + C(T ) + r + r nr ml v m2 ' ml m2 C(T . ) - C(T ) = b - (r +r _-r ) v m+l y v m y ml m2 m y + C(T(D-.+LV...+D _+D n +D „)) v v 1 2 m-1 ml m2 7 y - C(T(D 1 +D 2 +...+D m )) That this difference is larger than or equal to zero follows from r n +r -r >0 ml m2 m — n b > Z a, > (r ■ fr -r ) — . - l — ml xad m 1=1 and that D is a list obtained by OR merging the lists D , and D .. Hence, m ml md C(T ) is a monotone nondecreasing function of m. In other words, for any tree T , there is a tree T whose cost is less than C . Hence, we have the m u m statement of the theorem. 52 V. Summary The results in Section III allows us to find an optimal form of any nested Boolean expression in which (i) all variables are distinct, and (ii) the complement of any variable, B, (or any expression) can appear only in the form A l ' *2 • ••• ' A n • 5 | with at least one of the A. ' s not complemented. We already noted that the restriction (ii) leads to no real loss of generality since it indeed is a restriction imposed on the form of the query in most inverted list document retrieval systems. The problem of finding the optimal forms of Boolean expressions in which not all variables are distinct is more difficult than that of finding the minimum gate realization of the Boolean expression (when fan-in of gates are 2). Our problem is further complicated by the fact that if one of the terms A+B, A*B, A-B, A*B is generated by merging lists A and B, the other three terms are also obtained at no additional cost. When the lengths of overlaps between lists cannot be neglected in computing the total merge time, the Boolean expression determined by the algorithms described in Section III are no longer optimal. The performance of these algorithms can be bounded in terms of the maximum overlap between the lists as done in Section IV. 53 References [1] Hsiao, D. and Prywes, N. S., "A System to Manage an Information System," in Proc of the FID/ IF IP Joint Conference on Mechanized Information Storage, Retrieval and Dissemination, Rome, Italy, 1967. [2] Hsiao, D. and Harary, F. , "A Formal System for Information Retrieval from Files, " Comm. ACM , Vol. 13, No. 2, February, 1970. [3] Martin, L. D. , "A Model for File Structure Determination for Large On-line Data Files, " in Proc of the FILE 68 International Seminar on File Organization, Copenhagen, 1968. [k] Prywes, N. S., "Man-Computer Problem Solving with Multilists," Proc. IEEE jk, 12, December, 1966. [5] Cardinas, A. F. , "Analysis and Performance of Inverted Data Base Structures, " Comm. ACM , Vol. 18, No. 5, May, 1975. [6] Lowe, T. C, "The Influence of Data Base Characteristics and Usage on Direct Access File Organization, " J. ACM , Vol. 15, No. 4, October, 1968, [7] Liu, Jane W. S., "Probabilistic Models of Inverted File Document Retrieval Systems, " Technical Report, UIUCDCS-R-75-7 1 42, University of Illinois, Department of Computer Science. [8] Liu, Jane W. S., "Algorithms for Parsing Search Queries in Inverted File Document Retrieval Systems, " Technical Report UIUCDCS-R-75-718, University of Illinois, Department of Computer Science, Urbana, Illinois, September, 1975. [9] Knuth, D. E., The Art of Computer Programming , Fundamental Algorithms, Vol. 1, pp. 1+02-^05, I968. IBLIOGRAPHIC DATA 1EET 1. Report No. UIUCDCS-R-75-718 3. Recipient's Accession No. Title and Subtitle Algorithms for Parsing Search Queries in Inverted File Document Retrieval Systems 5. Report Date November, 1975 Author(s) Jane W. S. Liu 8- Performing Organization Rept. No. Performing Organization Name and Address Department of Computer Science University of Illinois Urbana, Illinois 10. Project/Task/Work Unit Nc 11. Contract/Grant No. NSF DCR 73-07980 NSF DCR 72-037^0 A01 sponsoring Organization Name and Address National Science Foundation Washington, DC 13. Type of Report & Period Covered 14. .:pplc-mt. rttary Notes . Abstract s In an inverted file document retrieval system, a query is in the form of a Boolean xpression of index terms. In response to a query, the system accesses the inverted ists corresponding to the index terms, merges them and selects from the merged list hose documents that satisfy the search logic. In this paper, we consider the problem f determining a Boolean expression which leads to the minimum total merge time among 11 Boolean expressions that are equivalent to the expression given in the query. This roblem is the same as finding an optimal merge tree among all trees that realize the ruth function determined by the Boolean expression in the query. Several algorithms hich generate optimal merge trees, when the sizes of overlaps between different lists re small compared with the length of the lists, are described. These algorithms are longer optimal when the lengths of overlaps cannot be neglected. In this case, it s possible to bound the performance of these algorithms in some instances in terms of he maximum overlap between lists. The performance bounds are discussed. . Key Words and Document Analysis. 17a. Descriptors inverted file document retrieval system merge algorithm parsing Boolean query b. Idcmifiers/Open-Ended Terms c- < OSAIT Field/Group 1 "lability Statement 19. Security (lass (This Report ) UN( 1.ASS1H1-D 20. Security ( lass (This Page UNCLASSIFIED 21. No. of Pages 22. P HM NTIS-3B USCOMM-DC 40329-P7I