LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.64 
 
 ue>r 
 
 no. 715-721 
 cop. Z 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/algorithmsforpar718liuj 
 
9/C). a¥ 
 TJ6r 
 
 UIUCDCS-R-75-718 
 
 no. J/tf 
 
 ML 
 
 3 
 
 Algorithms for Parsing Search Queries in 
 Inverted Hie Document 
 Retrieval Systems 
 
 by 
 
 Jane W. S. Liu 
 
 U. 
 
 (f- 
 
 November, 1975 
 
UIUCDCS-R- 75-718 
 
 Algorithms for Parsing Search Queries in 
 
 Inverted File Document 
 
 Retrieval Systems 
 
 by 
 
 Jane W. S. Liu 
 
 November, 1975 
 
 This work was supported by the National Science Foundation under Grants 
 NSF DCR 73-07980 and NSF DCR 72-037^0 A01. 
 
11 
 
 Abstract 
 
 In an inverted file document retrieval system, a query is in the 
 form of a Boolean expression of index terms. In response to a query, 
 the system accesses the inverted lists corresponding to the index terms, 
 merges them and selects from the merged list those documents that satisfy 
 the search logic. In this paper, we consider the problem of determining 
 a Boolean expression which leads to the minimum total merge time among 
 all Boolean expressions that are equivalent to the expression given in 
 the query. This problem is the same as finding an optimal merge tree 
 among all trees that realize the truth function determined by the Boolean 
 expression in the query. Several algorithms which generate optimal merge 
 trees, when the lengths of overlaps between different lists are small 
 compared with the length of the lists, are described. These algorithms 
 are no longer optimal when the lengths of overlaps cannot be neglected. 
 In this case, it is possible to bound the performance of these algorithms 
 in some instances in terms of the maximum overlap between lists. The 
 performance bounds are discussed. 
 
Ill 
 
 Acknowledgment 
 
 The author wishes to thank Drs. D. J. Kuck and W. Stellhron for 
 their comments and suggestions. 
 
I. Introduction 
 
 In this paper, we consider the problem of parsing search queries 
 in inverted file document retrieval systems. In such systems, the index 
 file contains an entry for each of the index terms selected as descriptors 
 for the documents in the data file. Each entry in the index file contains a 
 pointer to an inverted list of pointers in the postings file. The pointers 
 in this list in turn point to all the documents in the data file that 
 contain the corresponding index term. This file organization has been 
 studied extensively and is used in many well known systems [1-1+]. 
 
 A query to an inverted file document retrieval system is in the 
 form of a Boolean expression of index terms. For example, to request 
 
 information on scheduling or resource management policies in time- shared 
 systems, a user may present to the system a query 
 
 "Time Shared" • ("Scheduling Policy" + "Resource Management Policy") (l-l) 
 
 where • and + are the AND and OR operators, respectively. In response to a 
 request, the system accesses the inverted lists in the posting files 
 corresponding to the index terms in the query, merges them and selects from 
 the merged list those documents that satisfy the search logic. In our 
 example, the union of the lists corresponding to the index terms 
 "Scheduling Policy" and "Resource Management Policy" is the list of 
 pointers to the documents on scheduling or resource management policies. 
 Let us denote this list by A. The list A is obtained by merging the two 
 lists with duplicated entries deleted. We call the process of merging two 
 lists to obtain their union an OR merge . The intersection of two lists is 
 obtained by merging them and deleting from the merged list all entries except the 
 
duplicated ones. We call the process of merging two lists to obtain their 
 intersection an AND merge . In this example, pointers to documents to be 
 selected as response to the query in (1-1 ) are obtained by AND merging the 
 list A with the list corresponding to the index term "Time Shared. " In 
 other words, these three lists are merged in the order specified by the tree 
 in Fig. 1-la. We label the leaves of the tree by the index terms and the 
 internal nodes by the Boolean operators corresponding to the merges. 
 We note that the query in (1-1) is equivalent to the query 
 
 "Scheduling Policy"* "Time Shared "+ "Resource Management"* "Time Shared" 
 
 To answer this query, the corresponding lists are merged in the order specified 
 by the tree in Fig. 1-lb. Clearly, the times required to produce the response 
 might be different for the two equivalent queries written in different 
 Boolean forms. 
 
 We are concerned here with the problem of determining a Boolean 
 expression which leads to the minimum total merge time among all Boolean 
 expressions that are equivalent to the expression given in the query. For this 
 purpose, we describe the inverted file document retrieval system schematically 
 as shown in Fig. 1-2* To process a query, the lists corresponding to the index 
 terms in the query are read into the buffer memory and are merged after they ha" 
 been placed in the buffer. The total retrieval time is equal to the sum of th< 
 
 
 The total merge time computed here is equal to the amount of time required 
 of the merge processor to process the query. The merge order that minimizes 
 this time does not minimize the total retrieval time in general. However, 
 keeping the total merge time minimized becomes important in multiuser systems 
 in which many users share a large buffer memory. While the lists corresponding 
 to the index terms in the query of a user are being merged, the lists specifie< 
 in the queries of other users can be loaded into the buffer. 
 
"scheduling " 
 policy 
 
 "time shared' 
 
 it ii 
 resource 
 
 management 
 
 "scheduling" 
 policy 
 
 (a) 
 
 it j.. it t» i 
 time resource 
 
 ti . . ii 
 time 
 
 shared management shared 
 
 (b) 
 
 Figure l-l 
 
a: 
 
 o 
 
 w (n 
 
 O (O 
 
 QC tU 
 
 UJ o 
 
 2 O 
 
 cc 
 
 Ql 
 
 CM 
 H 
 
 a> 
 
 id 
 
 2 
 
 
 CO z 
 
 < 
 
 H- X 
 
 CO O 
 
 O 
 
 X 
 
 H 
 2 
 
 < 
 
 a 
 
 x 
 
 o 
 a: 
 
 h- 
 Z 
 O 
 o 
 
time required to access the lists from the secondary memory and the merge 
 time of the lists. In this paper, no attempt is made to minimize the 
 former. (The average access time of the lists from the secondary memory 
 has been estimated elsewhere [5,6], The dependence of list access time 
 on the access algorithm and buffer management scheme used in the system 
 is the subject of a separate study [7].) Furthermore, only the case of 
 two-way merges is considered here. 
 
 In Section II, we introduce the terminologies and notations 
 necessary in our discussion. In Section III, we assume that the lengths 
 of overlaps between different lists are very small compared with the lengths 
 of the lists. Hence, in computation of the total merge time, lengths of 
 overlaps between lists can be neglected. Several algorithms are described 
 in Section III. These algorithms allow us to find optimal Boolean 
 expressions when queries are written as nested Boolean expressions in which 
 (i) all variables are distinct, and (ii) the complement of any variable, B, 
 can appear only in product terms, such as A. • A • ... -A • B, with at 
 least one of the variables A. being uncomplemented. These algorithms are 
 no longer optimal when the lengths of overlap between lists cannot be 
 neglected. In this case, it is possible to bound the performance of these 
 algorithms in some instances in terms of the maximum overlap between lists. 
 We discuss these bounds in Section IV. 
 
II. Notations 
 
 Throughout our discussions, we use upper case letters (e.g., A, B, C, 
 D) to denote both index terms and their corresponding inverted lists. Lower 
 case letters are used to denote the lengths of the corresponding lists 
 (e.g., a, b, c and d denote the lengths of the lists A, B, C, and D, 
 respectively). Let a (A, B) denote the lengths of overlap between lists A 
 and B. The lengths of the resultant lists obtained by AND merging and OR 
 merging the lists A and B are a(A,B) and a+b-cr(A,B), respectively. Let A 
 denote the complement of the index term A. The list corresponding to A*B is 
 obtained by merging the lists A and B and selecting from the merged list 
 those entries that are in list B but not in list A. We call this merging 
 process an AND NOT merge . Clearly, the length of the list obtained by 
 AND NOT merging A and B is equal to b-cr(A,B). We shall not be concerned 
 with Boolean expressions containing terms of the forms A+B and A*B. This 
 restriction leads to no loss of generality since Boolean expressions of 
 these forms are explicitly ruled out in most installations. 
 
 Consider a query written as a Boolean expression of the index terms 
 
 A n ,A ,...,A . Let Q(A n ,A„, . . .,A ) denote the truth function determined by 
 ]/ 2' n 1 2' ' n 
 
 this expression. (A truth function is usually represented by a truth table 
 for the Boolean expression.) We note that the truth function Q(A 1 ,A 2 , . . ,,A n ) 
 
 More precisely, a Boolean expression of the index terms A.,,A~,...,A 
 
 1 2' ' n 
 
 specifies an element in the free Boolean algebra with n generators. This 
 
 element in the free Boolean algebra in turn specifies the truth function 
 
 Q(A 1 ,A 2 ,...,A n ). 
 
determines a unique list of pointers. This list is the valid response 
 to all queries written as Boolean expressions that determine this truth 
 function and this list can be obtained by merging the lists A, ,Ap, . ..,A . 
 Let T(F(A, ,Ap, . . .,A )) denote a tree specifying the merge order of lists 
 A_,Ap,...,A corresponding to the Boolean expression F(A, ,A~, . . . ,A ). Again, 
 
 the leaves of the tree are labeled with the names of the corresponding lists 
 while the internal nodes are marked by the Boolean operators corresponding 
 to the merges. (Examples of such trees are shown in Fig. 1-1. ) We say that 
 the tree T(F(A ,Ap, . . . ,A )) realizes the truth function Q(A f P^,...,k ) if 
 Q,(A,,Ap, . . . ,A ) is the truth function determined by F(A ,Ap,...,A ). Since 
 there are many ways to parenthesize a Boolean expression (e.g., A«(B+(C+D)) 
 and A*((B+C)+D) are two different ways to parenthesize the expression 
 A»(B+C+D)), corresponding to a Boolean expression, there are many different 
 binary merge trees. We distinguish them by using different subscripts. 
 (For example, T (A-(B+C+D)) and T ? (A- (B+C+D)) denote two different binary 
 merge trees for the expression A«(B+C+D)). When there is no possible 
 confusion, we also refer to the tree T(F(A,,Ap, . . ,,A )) simply as T. 
 
 The time required to merge two lists is proportional to the sum of 
 their lengths for all three types of merges. (To be specific, let the 
 proportional constant be 1.) The cost of the tree T(F(A ,A„, . . .,A )), denoted 
 C(T), is equal to the total merge time of the lists A ,A , ...,A when the 
 order of the merges is specified by T. A tree is said to be optimal when 
 its cost is minimum among all trees which realize the truth function 
 Q(A 1 ,Ag, ...,A ) determined by the Boolean expression F(A ,Ap,...,A ) in the 
 query. With a slight abuse of the notation, we denote the optimal tree by 
 T 0^ F ^ A 1 ,A 2' " * ,A n^' Hence the problem of finding a Boolean expression 
 corresponding to the minimum total merge time is the same as that of finding 
 
an optimal merge tree among all trees that realize the truth function 
 Q(A 1 ,A 2 ,...,A n ). 
 
III. Algorithms for Determining Optimal 
 Merge Trees 
 
 In this section, we assume that the lengths of overlap between 
 different lists are very small compared with the lengths of the lists. 
 Hence, in computation of total merge time, lengths of overlaps can be 
 neglected. In this case, the length of the resultant list obtained by OR 
 merging list A and B is equal to a+b. The length of the resultant list 
 obtained by AND merging A and. B is very small compared to a or b. In the 
 computation of merge time, it is assumed to be zero. The length of the 
 list obtained by AND NOT merging A and. B is approximately equal to b. 
 
 The lengths of overlaps between lists being negligibly small, an 
 optimal tree for OR merging the lists A..,Ap,...,A to realize the truth 
 function determined by the Boolean expression A -t-Ap+...+A is one with 
 minimum weighted path length where the weights of the leaves are the lengths 
 of the lists and the weights of internal nodes are zero. Such a tree can 
 be constructed using the Huffman's procedure. We call the resultant tree a 
 Huffman tree for A,,Ap,...,A and denote it by T (A,+Ap+. . . +A ) [9]. 
 
 A property of an optimal tree is that all its subtrees are optimal. 
 
 It follows that if T, n and T pn are two subtrees of an optimal tree 
 
 T (A +A^+...+A ) obtained by removing the root of T , then C(T ) + C(T 2 ) 
 
 is minimum among all possible two subtrees of an arbitrary tree 
 
 T(A n i-A n . . . i A ). Similarly, let T,~,T ~, . . .,T « denote m subtrees of 
 11 n ^" 10' 20 ' mO 
 
 T (A,+A_-k . . hA ) obtained by removing the roots of larger subtrees of T n . 
 Then 
 
 C(T iQ ) + C(T 2Q ) + ... + C(T mQ ) < 0(1^ + C(T 2 ) + ... + C(T m ) (3-1) 
 
10 
 
 where T ,T , ...,T are m subtrees of an arbitrary tree T(A- L +A 2 +. . .+A n ). We 
 call the subtrees T 1( y T 20 ' ' ' * > T m0 Huffman subtrees. 
 
 Optimal Merge Trees for Boolean Expressions of the 
 Form (A 1 +A p +. . ,+A )*B 
 
 Let A ,Ap, ...,A and B be n+1 lists with lengths a^a^, ...,a n and b, 
 respectively. To find an algorithm which yields an optimal form of the 
 Boolean expression (A +A 2 +, . ,+A n )'B, we consider the tree shown in Fig. 3-1 
 where T , T g , . . ,,T m are m subtrees of T(A 1 +A 2 +. . . ^Aj. We note that 
 
 Lemma 3-1 . If the tree in Fig. 3-1 is an optimal merge tree among 
 
 all trees that realize the truth function determined by the expression 
 
 (A +iU+...+A )-B, then T ,T , . ..,T are Huffman subtrees of T (A n +A„+. . .+A ). 
 l d. n ± d. m o 12 n 
 
 Proof: The cost of the tree in Fig. 3-1 is 
 
 n 
 C(T) = C(T ) + C(T 2 ) + ... + C(T ) + Z a. + mb 
 
 i=l 
 
 Because of (3-1)^ this cost is minimized when T-,,T~, ...,T are Huffman 
 subtrees. M 
 
 As a consequence of Lemma 3-1, to find an optimal tree which realizes 
 the truth function determined by the Boolean expression (A,+Ap+...+A )«B, 
 we need to consider only trees corresponding to expressions of the form 
 
 ( Z A )-B + ( Z A. ).B + ... + ( Z A. )-B (3-2, 
 
 A.eS, 1 A.eS x A.eS" 1 
 
 i 1 l 2 l m 
 
 where S (l < j < m) is the set of leaves of the j Huffman subtree of 
 
 .th 
 
 ) 
 
 T Q (A 1 +A 2 +. . ,+A ). Furthermore, an optimal merge tree that realizes the 
 
 truth function determined by (A-.+A- + . . ,+A )'B can be obtained from 
 
 12 n 
 
11 
 
 
 II -\ 77^1 
 
 ^m 
 
 Figure 3-1 
 
12 
 
 Theorem 3-1 
 
 (A +A- + ...+A )«B is an optimal Boolean form if and only if 
 
 *1 2 n 
 
 n 
 
 E a. < b 
 1=1 1_ 
 
 Hence an optimal merge tree for (A n +A„ + ...+A )«B is the tree, T , shown in Fie. 
 
 1 2 n u & 
 
 3-2a, where T Q (A +A- + . . .+A ) is the Huffman tree for the lists A^Ap, ...,A . 
 
 Proof: Let T n _ and T„~ be two Huffman subtrees of T„(A,+A_+. . .+A ) 
 10 20 1 2 n 
 
 and S be a subset of {A^,A^, ...,A }. Because of Lemma 3-l> the tree in 
 
 Fig. 3-2b, T,j will have the minimum cost among all trees T(F(B,A_,Ap, . . ,,A )) 
 
 corresponding to Boolean expressions of the form 
 
 F(B,A ,Ap,.. A ) -~ ( E A )-B + ( E A )-B 
 - c - n A.eS A^C^Ag,...^ }-S 
 
 The cost of T, is 
 d 
 
 C d = C(T 10 ) + C(T 20 ) , I a. + 2b 
 
 1=1 
 
 C(T Q ) + 2b 
 
 while the cost of T is 
 
 u 
 
 n 
 
 Therefore, we have 
 
 if and only if 
 
 C = C(T-_) + E a. + b 
 u v . , l 
 
 i=l 
 
 d — u 
 
 n 
 
 E a. < b (3-3! 
 
 i=l 1_ 
 
13 
 
 vw--- + v 
 
 u 
 
 (a) 
 
 (b) 
 
 Figure 3-2 
 
Ill 
 
 We now show that the inequality in (3-3) implies that the tree T in 
 
 Fig. 3-2a is indeed optimal. Again, because of Lemma 3-l> it suffices for 
 
 us to show that (3-2) implies that the cost of the tree I shown in Fig. 3-3a, 
 
 C , is less than the cost of the tree T ( m+ ]_) shown in Fig. 3-3b, C .,, where 
 
 T and T.„~ are two Huffman subtrees of T.„. We note that 
 j 10 3d<J D u 
 
 m n 
 
 C m + 1 " „ E n C( V + C(5 W + C(T J20 ) + •_- a i + (m+1) ^ 
 
 and 
 
 But 
 
 m n 
 
 C = S C(T. n ) + Z a. + mb 
 k-1 i=l 
 
 J J ° A.eS. 
 
 i 
 
 where S. is the set of leaves of T.„. That 
 J JO 
 
 C - C = b - S ; a. 
 
 m+1 m A.eS. x 
 
 is equal to or larger than zero is clearly implied by the inequality (3-3). " 
 
 From Theorem 3-l> we have algorithm A for finding an optimal merge 
 
 tree for (A +a +...+A )-B. 
 
 Algorithm A 
 
 n 
 1. a. If b > £ a., (A-.+Ap+. . ,+A )«B is an optimal Boolean form and 
 i=l 
 an optimal merge tree is T shown in Fig. 3-2a where T_(A-.+Ap+. . ,+A . 
 
 is the Huffman "cree for the lists A,, A-,..., A . 
 
15 
 
 m 
 
 (a) 
 
 (m+1) 
 (b) 
 
 Figure 3-3 
 
16 
 
 n 
 b. If b < E a., choose the merge tree T shown in Fig. 3-2b where 
 i=l x d 
 
 T 1Q and T 2Q are two Huffman subtrees of T„(A +A^+. . .+A ). The 
 
 corresponding Boolean expression is 
 
 ( E A. )-B + ( E A. )-B 
 
 A.eS, 1 A.eS ± 
 1 1 i 2 
 
 where S and Sp are the sets of leaves of T, and T pn . 
 
 2. An optimal tree can be obtained by repeating step 1 for each of the 
 
 terms (E A. )«B. 
 x 
 
 Consider the example shown in Fig. 3-k. The lengths of the lists 
 B, A , Ap, A and A> are 5, 1, 2, 5 and 10, respectively. The Huffman tree 
 for Aj+Ap+A +A^ is shown in Fig. 3-1+a together with its two subtrees T and 
 T 2Q . Since (a,+a 2 +a„+a, ) = 18 > b, the cost of the tree T in Fig. 3-Ub is 
 less than the cost of any tree T(B- (A +Ap+A +A. )). (C(T ) = 39 and 
 C(t(B« (A +Ap+A +A. ■)) ) > 52.) Hence, we choose the Boolean expression 
 corresponding to the tree T , 
 
 B-(A 1 +A 2 +A 3 ) + B-A^ 
 
 instead of B« (A,+A ? +A +A. ). Repeating step 1 for the term B-(A,+Ap+A ), 
 we choose to distribute . operation with respect to + operation and obtain 
 
 B- (A +Ap) + B-A + B-A. 
 
 The corresponding merge tree is T' shown in Fig. 3-^-c. Since a, + a p < b, 
 we conclude that TI is an optimal merge tree among all trees that realize 
 the truth function determined by the expression B« (A, +Ap+A~+A, ). Indeed, 
 C(T^) = 36 and C(T(B'A +B-A^+B»A +B'A^)) = 38. 
 
17 
 
 T (A 1 +A 2 +A 3 +A 1+ ) 
 
 A, 
 
 T, 
 
 20 
 
 ■10 
 
 (a) 
 
 (b) 
 
 A. 
 
 *2 
 
 (c) 
 
 Boolean expression B* (A, +Ap+A_+A. ) 
 Figure 3-U 
 
18 
 
 Optimal Merge Trees for Boolean Expressions of the 
 
 Form (A., +A^ +. . . + A ) • (B, +B^+. . . +B ) 
 *— 1 — 2 n y * 1 — 2 m M 
 
 Lemraa 3-1 and. Theorem 3-1 can be generalized to the case when the 
 
 Boolean expression specified in the query is of the form 
 
 (A,+A +...+A )'(B n +B„+...+B ) 
 1 ^ rr 1 2 nr 
 
 To do so, let us consider the tree T shown in Fig. 3-5 where T.-., T.~, . . . ,T. . 
 
 are subtrees of T(A +A +...+A ) and T Bn ,T_ , . . .,T m are subtrees of 
 
 12 n Bl B2' Bk 
 
 T(B n +B^+. . ,+B ). We state without proof: 
 v 1 2 m 
 
 Lemma 3-2 . If the merge tree T, is optimal for the Boolean expression 
 
 (A,+A +...+A )• (B n +B +...+B ), then T. - ,T. , . . ., T. . are Huffman subtrees of 
 v 1 2 n 1 2 m Al A2 7 Aj 
 
 T (A 1 +A 2 +...+A n ) and T -^T^, ' ' '> T Bk are Huffman subtrees of T o^ B l +B 2 + * ' * +B m^ 
 
 In this case, we have 
 
 Theorem 3-2 . 
 
 (A,+Ap+. . ,+A )• (B,+Bp+...+B ) is an optimal Boolean expression if 
 
 and only if 
 
 n m 
 
 £ a. = T, b. 
 i=l i=l 
 
 Hence an optimal merge tree for (A n +Ap+...+A )" (B,,B p ,+. . .+B ) is the 
 
 tree T in Fig. 3-6 where T. and T_ are Huffman trees for A,,A_,...,A and 
 u A B 1 d' n 
 
 B n ,B , .. ,,B , respectively. 
 
 Proof: We compare the cost of the tree T in Fig. 3-6a with that 
 of T_. and T, in Fig. 3-6b and 3-6c. The Boolean expressions corresponding 
 
 to the tree T,. and T^ are 
 
 dA dB 
 
19 
 
 I 
 
 m 
 
 •H 
 
20 
 
 u 
 (a) 
 
 Figure 3-6 
 
21 
 
 ( Z A. )-(B n +B +...+B )+( Z A. )-(B,+B +...+B ) 
 \ i 1 2 m \ /„ i 1 2 m 
 
 V S 1A V S OA 
 
 and 
 
 ( Z B ).(A +A +...+A )+( Z B )'(A +A +...+A ) 
 
 B.GS 1R ^ 1 1 n b.^S_ ! ! 2 n 
 
 l IB r IB 
 
 respectively, where S, and S, are sets of leaves of T and T -, 
 
 respectively. Let C and C-.. be the costs of the trees T and T, ., 
 r ^ u dA u dA 
 
 respectively. 
 
 n m 
 
 C = C(T.) + C(T ) + Z a. + Z b. 
 U A B . , l . , l 
 
 1=1 1=1 
 
 and 
 
 m n 
 
 C dA " C( V + C < T A2> + °( T B> + 2 * \ + * a i 
 
 1=1 1=1 
 
 But 
 
 n 
 
 Hence 
 
 C(T A ) = C(T A1 ) ♦ C( Ta2 ) ♦ E a 
 
 1=1 
 
 n m 
 
 C - C, A = Z a. - Z b. 
 
 u dA . , l . , i 
 
 i=l i=l 
 
 which is less than or equal to zero if and only if 
 
 n m 
 
 Z a. < Z b. 
 . , l - . , l 
 i=l i-=l 
 
 Similarly, 
 
 m n 
 Z b. < Z a 4 
 i=l 1=1 
 
22 
 
 implies 
 
 C - C_ < 
 u dB — 
 
 where C,_ is the cost of the tree T,^. In other words, 
 
 dB dB ' 
 
 C u = C dB = C dA 
 
 when 
 
 n m 
 
 Z a = Z_Jb (3-1+) 
 
 i=l 1=1 x 
 
 We need to show that because of Eq. (3-^-)* C is no greater than any 
 other tree which realizes the truth function determined by the Boolean 
 expression (A +Ap+...+A )«(B,+...+B ). Let T shown in Fig. 3-5 be such a 
 tree. Again, because of Lemma 3-2, T , , T „, ..., T are j Huffman subtrees 
 
 of T. and T_.,, T^ , ..., TL,, are k Huffman subtrees of T_. Let S. be the 
 
 A Bl' B2' ' Bk a Ap 
 
 set of leaves of T. (p = 1,2, ...,j) and S_, be the set of leaves of 
 Ap ' ' Bq 
 
 TL (q = 1,2, ...,k). We note that the Boolean expression corresponding 
 
 Bq 
 
 to the tree T, is 
 
 k 
 
 Z Z ( Z A )•( Z B ) (3-5) 
 
 p=l q=l A.eS A ± B.eS^ 
 •^ i Ap i Bq 
 
 Let 1L - and TL „ be two Huffman subtrees of TL with S^ , and 
 Bql Bq2 Bq Bql 
 
 S p be the sets of leaves, respectively. Let T, , (l; r) denote a tree 
 corresponding to the Boolean expression 
 
 J k j 
 
 Z Z ( Z A.)-( Z B.) + Z ( Z A. )•( Z B.) 
 
 p=l q=2 A.eS. x B.eS^ x p=r+l A.eS. 1 B.eS_.. 1 
 ^ l Ap l Bq * l Ap .1 Bl 
 
 r r 
 
 + Z ( Z A. )•( Z B. ) + Z ( Z A. )'( Z B.) 
 
 p=l A.eS. 1 B.eS^.. x p=l A.eS. x B.eS^ no 1 
 ^ i Ap i Bll r i Ap l B12 
 
23 
 
 obtained by further distributing the . operation with respect to + operation 
 in the sum term Z B. in (3-5)- We note that for any 1 < r < j, 
 
 V S B1 ' 
 
 C(T k+1 (l; r)) > C(T k ) 
 
 This inequality follows from 
 
 j k n n 
 
 CtT, ) = Z C(T. ) + Z C^OL ) + k Z a. + j Z b. 
 v k y , v Ap' , N Bq y . , l ° . , l 
 
 p=l * q=l ^ i=l 1=1 
 
 while 
 
 C(^ +1 (l; r)) = Z C(T ) ♦ Z C(^) * k Z a. ♦ j S b 
 
 p=l * q=l n 1=1 1=1 
 
 r 
 
 + Z Z a. 
 
 p=l A.eS. x 
 * l Ap 
 
 When r=j, we have 
 
 c(T k+1 (l; j)) = s c(T ) ♦ z c(l B ) 
 
 p=l q=2 n 
 
 n m 
 
 + C(T B11 ) + C(T B12 ) + (k+1) Z a + j Z b. 
 
 i=l i=l 
 
 But 
 
 C ^l> = C(T B11^ + C < W + T> _ E „ b i 
 Hence 
 
 B i £S Bl 
 
 n 
 
 C(T k+1 (l; j)) - C(T k ) = L a ± - Z b. ( 3 -6) 
 
 i=l B.eS B1 
 
2k 
 
 which is larger than or equal to zero when Eq. (3-*0 is valid. In general, 
 let T, n (t; j) be the tree corresponding to the Boolean expression 
 
 Z ( Z A. ) • ( Z B, ) + Z 
 
 p=iU=t + iA l6 s x B. eSB . 1 q=i.Lvs Ap x B. e s Bql x 
 
 ( Z A.) • ( Z B.) 
 
 + ( Z A ) • ( T, B ) 
 
 A.eS. B.eS_ _ _ 
 l Ap i Bq2 
 
 for 1 < t <k and its cost C(T k+1 (t; j)). We have, 
 
 n -c 
 C(T. .At; 3)) - C(T )=t Z a - Z Z b 
 
 k i=i q=iB. e s Bq 
 
 which is equal to or larger than zero when Eq. (3-^0 is valid. ■ 
 
 n m 
 
 When Z a. < Z b., the tree T is no longer optimal. In this case, 
 i=l X j=l ,] 
 
 let S , 1 < j < k, be the set of leaves of k Huffman subtrees of 0L such 
 AD — B 
 
 that their union is f B_,,B„, . . .,B 1. Moreover, 
 
 12' ' m J ' 
 
 n 
 Z b < Z a 3 = 1,2,... ,k 
 
 B.eS^. x i=l 1 
 
 i Bj 
 
 but 
 
 E M b > Z a 3 = 1,2,... ,k and tfy . 
 
 WV i=1 
 
 It follows from Eq. (3-6) that 
 
25 
 Corollary 3-1 
 
 n m 
 
 When E a.. < E b^, an optimal tree corresponding to the Boolean 
 expression 
 
 i-l x 0=1 3 
 
 k 
 
 E (A +A +...+A )•( E B ) 
 
 * =1 V S Bd 
 
 has the minim-urn cost among all trees that realize the truth function 
 
 determined by (A n +A r , + . ..+A )• (B n +B n + . . ,+B ). 
 ° v 1 z n 1 2 nr 
 
 m n 
 
 Similarly, for the case of E b. < E a., let S ., 1 < j < k, be 
 
 j=l J i=l x AJ 
 
 the set of leaves of k Huffman subtrees of T A such that their union is 
 
 (A 1 ,A 2 , ...,A n }, and 
 
 m 
 
 but 
 
 E a. < E b. j = 1,2,. ..,k 
 
 V S Aj * " ^ ' 
 
 m 
 E a. > E b. o, j' - 1,2, . ..,k and j/j • 
 
 We have 
 
 Corollary 3-2 
 
 n m 
 When E a. > E b , an optimal tree corresponding to the Boolean 
 
 1=1 1 3=1 J 
 
 expression 
 
 E < W ...«* W M E A.) 
 
 3-1 A^S^ 
 
26 
 
 has the minimum cost among all trees that realize the truth function 
 
 determined by (A+A +...+A )• (B n +B«+. . .+B ). 
 12 n 1 2 m' 
 
 An algorithm to determine an optimal tree in this case is 
 
 Algorithm B 
 
 n m 
 
 1. If Z a = 2 b., (A +A.+...+A V(B 1 +B +...+B ) is an 
 
 i=l i=l n' v 1 2 m' 
 
 optimal expression. An optimal tree is T shown in Fig. 3-6a where T and 
 
 T B are Huffman trees of A-^Ag, . . .,A and B.,, Bg, . ; . ■, B , respectively. 
 
 n m 
 
 2. If Z a. < 2 b., 
 
 i=l x 1=1 ^ 
 
 a. Choose the Boolean expression 
 
 A 
 
 +Ap+ i ..+A n )'( Z B ) + (A +A.+...+A )•( I B, ) 
 B i £S Bl V S B2 
 
 where S 01 and S r)0 are sets of leaves of the two Huffman subtrees of T . 
 The corresponding merge tree is T in Fig. 3.6c. 
 
 b. For each of the terms (A +Ap+. . .+A )■( Z B. ), if 
 
 n B.eS^ . x ' 
 1 B 3 
 n 
 
 Z a < £ b., distribute * with respect to the sum of the B! s such that 
 i Bj ' 
 
 the B.'s in each of the sum terms are elements of sets of leaves of smaller 
 
 l 
 
 Huffman subtrees. 
 
 i 
 
 c. The process in step 2-b terminates either when 
 n n 
 
 Z a,. > Z h. or when we obtain terms of the form ( Z A. ) • B .. . 
 .eSL. "' 
 
 i=l x B.eS,, . 1 i=l 1 3 
 
27 
 
 n m 
 
 3. If 2 a. > 2 b., 
 1=1 1=1 
 
 a. Choose the Boolean expression 
 
 (B 1 +B p +...+B )■( I A )+(b'+B 2 +...+B )•( £ A) 
 
 V S A1 A i GS A2 
 
 where S , and S„^ are the sets of leaves of the Huffman subtrees of T . 
 Al A2 ■"• 
 
 The corresponding merge tree is T in Fig. 3.6b. 
 
 b. For each of the terms (B n +B +...+B )•( 2 A.), if 
 
 1 2 m A.£S A . X 
 l Aj 
 
 m 
 
 Z b < 2a, distribute * with respect to sum of A.' s in each of the 
 
 i=l A.eS A . 
 i Aj 
 
 sum terms such that the A. *s in each of the sum terms are elements of 
 
 i 
 
 leaves of smaller Huffman subtrees. 
 
 m 
 
 (c) The process in 3b terminates either 2 b. > 2 a. or 
 
 i=l 1 A.eS A . 
 i Aj 
 
 m 
 
 when we obtain terms of the form ( 2 B. )-A.. 
 
 i=i x d 
 
 We illustrate Algorithm B by an example. Consider the expression 
 (A +Ap+A +A. )• (B^+Bp). The lengths of the corresponding lists are 
 a = 1, a = 2, a = 3> a. = 6, b = 2 and b p = 2. The Huffman trees for 
 the lists A , A , A and A^ and for lists B and B are shown in Fig. 3-7a. 
 Since a + a Q + a. + a< > b + b p , the cost of the merge tree T in Fig. 3- 7b 
 corresponding to the Boolean expression 
 
 F l * (B 1 +B 2 ).(A 1 4-A 2 f-A 3 ) h (B^BgJ-A^ 
 is less than the cost of the tree T( (A , t-Ap+A +A. )■ (B +B p ) ). Moreover, since 
 a 1 +a 2 +a 3 > b-^bp, the cost of the tree Tp in Fig. 3.7c is less than the cost 
 of T . Indeed, T is an optimal tree and the Boolean expression corresponding 
 
28 
 
 T. 
 
 (a) 
 
 B 
 
 B, 
 
 (b) T x 
 
 (c) T c 
 
 Figure 3-7 
 
29 
 F 2 = (B 1 +B 2 )-(A 1 +A 2 ) + (B 1+ B 2 )-A 3 + (B^B^-A^ 
 (C(T(A 1 +A 2 +A 3 +A i+ )-(B 1 +B 2 ))) > kl, C(T ± ) = 33, C(T 2 ) = 31.) 
 
 Farther Generalizations 
 
 To find an optimal merge tree when the Boolean expression 
 specified in the query is a product of sum terms 
 
 m 
 
 P = it (A. n +A. Q +...+A. ) (3-7) 
 
 m . =1 K fll 02 on. 
 
 and A., are all distinct, let C(k) denote the cost of the merge tree T(P ). 
 ji m 
 
 Suppose that we first complete all those merges corresponding to the Boolean 
 
 expression 
 
 m-1 
 P m , = jt (A. n + A. + ... + A. ) 
 
 Since the lengths of overlaps between lists are assumed to be negligibly small, 
 
 it follows from Theorem 3-1 that the cost 
 
 n 
 m 
 
 C(m) - L a . + C(m-l) 
 . , mi 
 i=l 
 
 is minimum in this case and the corresponding Boolean expression is 
 
 P = A , • P , + A „ • P ,+...+ A • P , 
 m ml m-1 m2 m-1 mn m-1 
 
 m 
 
 Similarly, we have 
 
 n. 
 
 C(m) = L^ T^ a.. + C(T( (A^A^-H. . . i-A^)- (A^-A-., ►. . . ^ ) )) 
 
 Suppose that the indices are chosen so that 
 
30 
 
 n i n 2 
 C(T((A 11+ A 12+ ...,A l2i ).(A 21+ A 22 + ... + A 2 )))- L a - Z a. 
 
 1 2 i=l i=l 
 
 \ 
 
 <C(T((A. 1+ A. 2+ ... + A. n _ j ).(A lcl+ A k2+ ... + A knk ))) - _^a. - ^a. 
 
 for all j, k = 3,^, ...,m. 
 
 Corollary 3-3 
 n. 
 
 If E a., are equal for all j = 1, 2, . .., m, the cost of the 
 3=1 01 
 
 optimal tree corresponding to the Boolean expression in (3-7) is 
 
 n. 
 m j 
 
 C (m) = £ I a + C(T(A i A +...+ A )) 
 
 3=1 i=l 01 "^ ■ n 
 
 + c(iyy... + j^)) 
 
 n. 
 Furthermore, when I, a., are not all equal, the cost of the optimal tree 
 
 i=l J1 
 
 is given by 
 
 Corollary 3-^- 
 
 n. 
 
 C o (m) = * = a oi +G ^^ A ii +A n2 + --- +A in 1 ) ' (A21+A22+,, - +A2n 2 )) (3 " b) 
 
 When the Boolean expression given in the query can be written as 
 a sum of product terms 
 
 m 
 
 S = Z A. n • A. ... * A. 
 m J=1 3 1 2 jn. 
 
 with all A., being distinct, we note that distribution of the + operation 
 j i 
 
 with respect to the * operation will lead to increase in the total merge 
 time. Hence, the optimal cost in this case is 
 
31 
 
 n. 
 C = Z Z a.. 
 
 ^11=1 » 
 
 Optimal Merge Trees for Boolean Expressions Containing 
 Complements of Variables 
 
 To generalize the results discussed above to Boolean expressions 
 containing complements of variables, we consider expressions of the form 
 
 (A 1 +A 2 +...+A n )-B (3-9) 
 
 In this case, we have 
 Theorem 3-3 
 
 (A +A p +. . . +A )*B is an optimal expression among all expressions 
 equivalent to it. Hence an optimal merge tree is T shown in Fig. 3-8 
 where the symbol . -i is used to denote an AND NOT merge. 
 
 Proof: The cost of the merge tree corresponding to the Boolean 
 expression in (3-9) is 
 
 n 
 
 C = C(T, ) + Z a. + b 
 u v y . . i 
 1=1 
 
 where T. is a Huffman tree for the lists A,, A„, .... A . Let T A , and 
 A 1 ' z' n Al 
 
 T. be two subtrees of T(A, + A + ,..+ A ) with sets of leaves S-, and S AO , 
 Ad 1 2 n J Al A2' 
 
 respectively. The cost of the merge tree corresponding to the Boolean 
 expression 
 
 ( L A. )-B+( Z A. )-B 
 
 A.eS A1 X A.eS ft0 X 
 
 l Al l A2 
 
 is 
 
32 
 
 B 
 
 T (A +V ... + A n ) 
 
 u 
 
 Fig. 3-8 
 
33 
 
 n n 
 
 C d = C(T Al) + C ^ T A2^ + Z a i + 2b + Z a i 
 
 i=l i=l 
 
 n 
 > C(T ) + E a. + 2b 
 A i=l 1 
 
 = C + b 
 
 u 
 
 Similarly, let T.. for i = 1, 2, . .., ra be m subtrees of T. and S be 
 
 the sets of their leaves, 
 
 n 
 
 C, = C(T.) + rab + E a, 
 dm A i 
 
 i-1 
 
 > c ■ 
 
 u 
 
 We further observe that the cost of the merge tree corresponding to 
 the Boolean expression 
 
 B • A 1 +A 2 +...+A n 
 
 is equal to that of B • (A,+Ap + . . . +A ). Again, let T.. and T. be two 
 
 Huffman subtrees of T^(A-+A-+ . . , + A ) with their sets of leaves being S.., 
 
 1 2 n Ai 
 
 and S „, respectively. The cost of the merge tree corresponding to the 
 Boolean expression 
 
 is equal to 
 
 E A. • E A. 
 
 V S A1 " V S A2 " 
 
 C(T ) ♦ C(T ) ♦ I at 2b 
 
 1 = 1 
 
 We note that in response to a query of the form B • F • F p , the syst 
 will parenthesis the expression as (B«F )«F , or rewrite it as 
 
 em 
 
 B ' F x hF 2 . 
 
3h 
 
 same as the cost of the tree corresponding to 
 
 B * ( Z A. ) + B • ( Z A. ) 
 V S A1 V S A2 " 
 
 Kence we can use the result in Theorem 3-1 and obtain 
 
 Corollary 3-5 
 
 n 
 
 When b > Z a., B-A^A +A +...+A is optimal among all Boolean 
 
 expressions equivalent to it. 
 
 Algroithm A can be modified to determine an optimal equivalent form 
 
 n 
 when b < Z a. . 
 i=l X 
 
 Algorithm Al 
 
 n 
 
 1. If b > Z a.., B • A,+Ap+...+A is optimal among all Boolean expressions 
 
 i=l 
 equivalent to it and an optimal merge tree is T as shown in Fig. 3-9a. 
 
 Otherwise, we have 
 
 n 
 
 2. a. b < Z a.. Choose the Boolean expression 
 
 i=I x 
 
 B • Z A. • I A. 
 V S A1 1 A i £S A2 ' 
 
 and the corresponding tree is T shown in Fig. 3-9b. 
 b. An optimal tree can be obtained by repeating steps 1 and/ or 2a for 
 
 B • Z A. and then for (B • Z A. ) • Z A. . 
 V S A1 " V^l 1 *l 6S i2 X 
 
35 
 
 u 
 
 (a) 
 
 d 
 (b) 
 
 Figure 3-9 
 
36 
 
 Similarly, from Theorem 3-2, we have 
 
 Corollary 3-6 
 
 m n 
 
 When I b. = Z a., an optimal form of the Boolean expression 
 i i 
 
 i=l i=l 
 
 (B 1+ B 2+ ... + B m ) • A 1+ A 2 ... + A n 
 
 is 
 
 (B 1 +B 2 +...B m ) • A 1+ A 2 +...+A n 
 The algorithm Bl determines an optimal merge tree when Zb. / 2 a.. 
 Algorithm Bl 
 
 m n 
 
 i=l i=l 
 
 n m 
 
 1. If Z a. = Z b., the optimal merge tree is T in Fig. 3- 10a where 
 
 1=1 1=1 
 
 T„ and T^ are Huffman trees of A n , A-,, .... A and B n , B~, . .., B . 
 A B 1' IS ' n 1 2' ' m 
 
 Otherwise, we have 
 
 n m 
 
 2. If Z a. > Z b.. 
 
 . , 1 . -, 1 
 
 1 = 1 1 = 1 
 
 a. Choose the expression 
 
 ( 2 A ) • B +B+...+B + ( 2 A ) • B 1+ B+... + B 
 A.eS A1 a. £Sa2 
 
 The corresponding merge tree is T in Fig. 3- 10b. 
 
 b. For each of the terms ( Z A. ) • B,+B_+...B , if 
 
 i i 2 m 
 A.eS. . 
 i A 3 
 
 m 
 
 Z a. > T, b., rewrite the term as 
 
 A.eS ft x i=l 1 
 1 A j 
 
 ( Z A. ) • B,+B +...+B + ( Z A. ) • B n +B +...+B 
 
 A.eS fl . n 1 1 2 m A.eS i 1 2 
 
 l Ajl i Aj2 
 
 m 
 
 where S. ._ and S. .„ are the sets of leaves of the two Huffman subtrees of T. .. 
 Ajl Aj2 Aj 
 
37 
 
 u 
 (a) 
 
 (b) 
 
 I 
 
 i 1.1 I 
 (c) 
 
 Figure 3-10 
 
38 
 
 m 
 
 c. Repeat Step 2b until either 2 a. < 2 b. or the term become 
 
 A.eS A x ~ i=l ± 
 1 A 3 
 
 A. • B n +B +. . .+B . 
 j 1 2 m 
 
 n m 
 
 3. If 2 a. < 2 b., 
 i=l X i=l x 
 
 a. Choose the Boolean expression 
 
 (A 1 +A 2 +...+A n ) • I 2 B ± ) • \ Z B ± ) 
 
 B i 6S Bl B i eS B2 
 
 and the corresponding merge tree is T,^ as shown in Fig. 3-10c. 
 
 &B 
 
 n 
 
 b. For each of the terms 2 B., if 2 a. < 2 b., rewrite the 
 
 B.eS,. x i=l x B.eS-,. x 
 
 term as 
 
 ( ~^ bTT • T S B~7 
 
 B.eS_ 1 B.eS, n ± 
 l Bjl i Bj2 
 
 where S, .., and. S^ ._ are sets of leaves of the Huffman subtrees of 01.. 
 B J1 B J2 Bj 
 
 n 
 c. Repeat Step 3b until either 2 a. > 2 b. or when S^ contains 
 
 i=l x ~B.es,. x B J 
 
 one single term. 
 
 Furthermore, we have 
 
 Corollary 3-7 
 
 The cost of an optimal merge tree corresponding to the expression 
 
 m I 
 
 « ± (VV-'-^iJ • k ! x (B ii +B i2 + - + %^ 
 
 is 
 
 C (m) +2 2 b 
 U k=l i=l J6± 
 
 where C (m) is the cost for merging the lists A. . in Eq. (3-8). 
 
 >J J- J 
 
39 
 
 IV. Bounds on Sub optimal Parsing Algorithms 
 
 When the lengths of overlaps cannot be neglected in the computation of 
 merge times, the algorithms described in Section III are no longer optimal. 
 In this section, we derive bounds on the performance of these algorithms in 
 terms of the maximum overlap between lists. Again, let cr(A, B) denote the 
 length of overlap between lists A and B. The length of the resultant list 
 obtained by OR merging lists A and B is equal to a+b-cr(A,B). The lengths 
 of the resultant lists are a(A, B) and a-cr(A,B) when the lists A and B are 
 AND merged and AMD NOT merged, respectively. 
 
 Consider a set of lists S = {A ,A p ,...,A }. We say that the maximum 
 
 length of overlap between them is a if a(A.,A. ) < a for all A. and A . in 3 
 
 i 3 - i J 
 
 Moreover, let S and S be any two disjoint subsets of S, and R and R be 
 the two lists obtained by OR merging lists in S, and S ? , respectively. Then, 
 o(R , R ) < o. Hence, a is an absolute mea,-ur_' of the maximum length of overlap. 
 It is a meaningful measure when the lists in S are of comparable lengths and 
 that their intersections are relatively small compared to their lengths. 
 
 In practice, the lengths of the inverted lists may differ by several 
 
 t 
 orders of magnitude. The length of overlap between any pair of lists is often 
 
 measured in terms of a fraction of the length of the shorter list. Let 
 
 <t>(A.,A.) denote the fraction such that 
 
 a(A.,A.) - <I>(A.,A.) min[a.,a.] 
 
 For example, from MEDLAR Master Mesh , we found that the length of the list 
 corresponding to the index term HUMAN is ^93, 599 while that corresponding 
 to LUROVIN is only k. 
 
ko 
 
 for any A., A. in S = {A ,A , ...,A }. Let * denote the maximum overlap for 
 the set S. That is, a(A.,A.) < * min[a.,a.] and a(R , Rp ) < <t> min[r-.,r p ] 
 where r, and r„ are the lengths of R and Rp, respectively. With slight 
 abuse of the term, we also call 4> the maximum length of overlap. 
 
 Again let C(T) denote the cost of a merge tree T(A +Ap+...+A ). 
 Since C(T) is the total merge time of A ,A ? ,...,A when their merge order 
 is specified by T, C(T) depends on the lengths of A n ,A p ,...,A as well as 
 overlaps between them. Let P(t) denote the weighted path length of the 
 tree T. As discussed in Section III, P(T) is the cost of the tree T when 
 the corresponding Boolean expression is A-+Ap+. ..+A and the length of overlap 
 between the lists are zero. 
 
 Bounds on Cost of Huffman Tree 
 
 Let T n (A n +A p + . . .+A ) be an optimal merge tree for the lists 
 A,,Ap, ...,A corresponding to the Boolean expression A..+A +. . .+A . As 
 demonstrated by the example in Fig. ^— 1, T cannot be obtained using 
 Huffman's procedure in general. Let T u denote the Huffman tree for 
 A,,Ap, ...,A . We now bound the cost of the Huffman tree T„. 
 Lemma k.l . 
 
 Let S = {A„ ,A p , . . .,A } be a set of lists and R be the list 
 
 obtained, by OR merging all the A. in S. The length of the list R , r n , 
 
 is such that 
 
 k 
 r > Z a - (k-1) a 
 k i=l x 
 
 where a is the maximum overlap between the lists in S. Moreover, the bound 
 is tight. 
 
kl 
 
 C(T H ) 
 
 2(2+3) + h 
 
 11+ 
 
 (a) T,,, Huffman tree 
 H 
 
 C(T Q ) =3 + 1 + + I + + 2 = 13 
 
 (b) T , Optimal tree 
 
 a(A 1 ,A 2 ) = 3, a(A L ,A 3 ) = 0, a(A 2 ,A 3 ) = 
 
 l, — H-, ap — j, a„ — c. 
 
 Figure k-1 
 
Proof: Let us consider any list A. in the set S. Without loss of 
 
 1 
 
 generality, suppose that A. DA. , A. HA. , ... A. DA. are nonempty 
 
 -L tz. K. 
 
 (where (~l denotes set intersection). By definition of a, the total number of 
 
 /\ 
 
 elements in A. D A. , A. flA. , . . ... A. DA. is at most a. 
 i i^ i i 2 i i k 
 
 Let 
 
 I(A 1 ) = 
 where is the null set and 
 
 I(A.) = A ± (A 1 U^ U ... UA i _ 1 ) i =, 2, 3, .... k 
 
 We note that the lists 
 
 A * - A 
 A l A l 
 
 A^ = A-, - l(Ag) 
 
 are disjoint. Moreover, their lengths a', a', . .., a/ are such that 
 
 a l = a l 
 
 ^ > a 2 - a 
 
 Since the list corresponding to A] + A ' + ... + A' is R , we have 
 
 k k 
 
 r. = E a.' > E a. - (k-l) a 
 k . i - . i 
 i=l 1=1 
 
 We point out here that throughout our discussion, by a list, we mean a 
 sorted list of distinct elements. Hence, we may also regard a list as a 
 set whenever the order in which the elements appear in it is irrelevant. 
 
1+3 
 
 That the bound is tight can be demonstrated by the example: 
 
 A 1 = (a,p,7,x x , ...,x m ), A 2 = (a,p,7,y.,_, ...,y n ) .... A fe = {a,^, 7 ,z ± , . . . ,z^) 
 
 where x., y., z. are all distinct, a = 3 and the list R, is 
 
 111 K 
 
 k 
 
 ia,p,y,x,...,x m ,y 1 , ...,y n ,...,z 1 ,...,Zg) with length Z a ± - 3(k-l). ■ 
 
 i=l 
 In terms of o, we have the following bound on the weighted path 
 
 length of a Huffman tree for a set of lists S = {A ,A 2 , ...,A }, p(T ). 
 
 Theorem k-1 
 
 p(T H ) < C(T) + S^ =£ 3 C*-D 
 
 where a is the maximum overlap for lists A... A-,..., A and C(T) is the cost 
 
 1 2 n 
 
 of any tree T for these lists corresponding to the Boolean expression 
 
 A,+A„+. . ,+A . 
 12 n 
 
 Proof: Let T be an arbitrary tree for the set of lists fA,,A ,...,A 1 
 
 12 n 
 
 corresponding to the expression A +A +...+A . Suppose that in T the leaf 
 
 A. is at level I., i = 1. 2, .... n. Let T' be a tree obtained from T when 
 i i 
 
 n-1 / ^ 
 
 the weight of A. is replaced by a. - a. We claim that 
 
 l in 
 
 P(T') < C(T) (k-2) 
 
 That is, 
 
 2 L(a.. --a)< C(T) 
 .,ii n ' — v ' 
 
 i=l 
 
 n 1 2 
 
 Since a Huffman tree has the minimum weighted path length and 2 ii. < — (n +n-l), 
 
 i=i l - 2 
 
 we have 
 
 p(T H ) - |(n 2 +n-l) • ^a<C(T) 
 
 and hence Eq. (k-1). 
 
hk 
 
 To show that the inequality in (^-2) is valid, let A be an arbitrary 
 list in the set {A >A , . ,A } Suppose that in the tree T, A is at level I 
 as shown in Fig. k-2, where T , T ? , ...,T« -i>T« are subtrees of T. Let 
 S,,S p , . ,.,S„ be the sets of leaves and R,,Bp, . . .,R„ be the lists corresponding 
 to the roots of T,,T p , . . . , T», respectively. We note that the weight of the 
 node j in T', i\, is 
 
 J 
 
 t! = a + 2 Z a . - — a ( Z Is I + l) (1+-1 
 J k=J+l A.eS x k=j+l 
 
 1 K. 
 
 On the other hand, from Lemma ^-1, the length of the list R. , r. , is 
 such that 
 
 r. n > 2 a. - (|S.' I - 1) a 
 
 and length of the list corresponding to the node (j+l) is lower bounded by 
 
 a + 2 ( 2 a. - p|S J ) 
 
 k=j+2 A.eS k x 
 
 Hence merge time- required to generate the list corresponding to the node j 
 in the tree T, T., is such that 
 
 * , i . . x- 
 
 t. > a + Z Z a. - ( Z I S | - 1 ) a 
 
 3 k=j+l A.eS. x k=g+l 
 
 x k 
 
 Since 
 
 n 1 £ £ t 
 
 ^( ? |s k | + D>( s |sj -1) + 
 
 k=j+l k=j+l 
 
 t n-1 
 The inequality (x+1) > x - 1 is equivalent to 
 
 2n > x + 1 
 
 Since we have x < n and n > 1, the inequality is always valid. 
 
^5 
 
 0) 
 0) 
 
46 
 
 for any n > 1, for all j = 0, 1, . .., £-1, t!.< t. and the inequality in 
 
 J J 
 
 (4-2) follows. ■ 
 
 It follows from Theorem 4-1 that the cost of a Huffman tree is 
 upper bounded by 
 Corollary 4-1 
 
 C(T H ) < C(T ) + |(n 2 + n-l) ^ a (4-J 
 
 where the tree T is optimal. 
 
 Although the bound in (4-4) is not tight, it does indicate that in 
 most cases of practical interest, the cost of the Huffman tree does not 
 differ substantially from that of an optimal tree for processing a query of 
 the form A +A p +. . . +A . For example, when the length of the lists are 
 approximately equal. 
 
 C(T Q ) ~ £(l - j) ne^ n 
 
 where £ is the approximate list length. Since p(n +n-l) a is 
 
 12^ 
 approximately equal to — n a for large n, its value becomes comparable to 
 
 C(T ) only when' 
 
 2 % 2 n 
 
 In £ / Ov 
 ^ v 1 - a / 
 
 a 
 
 
 
 (For n = 64, 7 ~ 0.20; for n = 8, 7 « 0.43. That is, only for every large 
 
 overlaps. ) 
 
 Similarly, we bound the cost of a Huffman tree in terms of the 
 maximum overlap <!>. 
 
hi 
 
 Lemma k-2 
 
 * k 
 
 i=l 
 
 Again, r is the length of the list corresponding to the expression 
 
 Proof: Let A^ be the shortest list among A ,A p , ...,A, and R be 
 the list corresponding to the expression A,+Ap+...+A, , . The length of 
 R , r .., is clearly larger than the length of any particular list 
 A^Ag, ..., or A k _ 1 . Hence 
 
 r k > = a i " «k 
 
 i=l 
 
 > S a. (1 - r) * 
 
 - i=1 i k 
 
 Hence, we obtain the following bound for the cost of a Huffman tree for 
 
 Lhc lists A,,A_, . . . ,A . 
 1 2 7 ' n 
 
 Theorem k-2 
 
 C(T ) < P(T ) < ij— c(T) (^-5) 
 
 1 - — <t> 
 
 2 
 
 where T is any arbitrary tree for A, + Ap * ... + A . 
 
 Proof: The proof is similar to that of Theorem k-1. Suppose that T ' is a 
 
 / 1 ^\ 
 tree obtained from T when the weight of A. is replaced by a. (1 - — <t>J. 
 
 We claim that 
 
 P(T') < C(T) (k-6) 
 
 and hence, the inequalities in (^+-5) follows. 
 
1+8 
 
 To show that the inequality (k-6) is valid, we again look at the 
 tree in Pig. k-2. The weight of the node j in the tree T', t*o is such that 
 
 I 
 
 t! > (a + Z 
 
 3 N +1A A 
 
 a i )(l-|*) 
 
 On the other hand, the length of the list R. -. , r. _,, is such that 
 
 0+1' j+1' 
 
 r i+l - L a i ^ " Tq 1 
 
 a+1 A.eS. n X |S d+l' 
 
 and the length of the list corresponding to the node (j+1) is lower bounded by 
 
 * V 
 
 a + Z Z a. 1 
 
 k=j+2 A.eS n V \ 
 
 J l k / \ 1+ Z S 
 
 £ 
 
 k=o+2 
 
 k 1 
 
 Hence, the merge time required to generate the list corresponding to the node, 
 
 t.» is such that 
 
 ( l 
 
 t . > [a + I Z a. 1 min 
 
 1 - 
 
 0+1' I 
 
 1 - 
 
 i + z s 
 
 k=o'+2 
 
 k 1 
 
 
 The inequality in (U-6) follows immediately from this expression. 
 
 We again note that the bound in (^-5) is not tight. However, 
 we can conclude from it that for most cases of practical interest 
 (<t> < 0.1) the total merge time is quite close to the minimum when the merge 
 order is specified by a Huffman tree. 
 
h9 
 
 Performance of Sub optimal Algorithms 
 
 Clearly, when the length of overlap is not negligible, the algorithms 
 described in Section III are no longer optimal. However, we have the 
 following special cases : 
 
 Theorem k-3 
 
 n 
 
 When Z a. < b, (A n +A_+...+A )«B is optimal among all Boolean 
 . .. 1 — ' 1 2 n 
 i=l 
 
 expressions equivalent to it. Hence, an optimal merge tree is T as 
 
 shown in Fig. U-3a when T is an optimal merge tree for (A +Ap+...+A ). 
 
 Proof: Let C (T) denote the cost of the tree T in Fig. U-3a. Suppose 
 
 that T(A +Ap+. ..+A ) is an arbitrary tree for the lists A , A p , ..., A . 
 
 Let R be the list corresponding to the Boolean expression A +A-+...+A and 
 
 L e: n 
 
 r be its length. Clearly, 
 
 C (T) = C(T) + r + b 
 
 We need to show that C. (T) is less than or equal to any tree T , shown in 
 
 l v ' ^ J m+1 
 
 Fig. k-3b. Let R,, FL, ..., R , R , R be the lists corresponding to 
 
 the roots of T_, T , ..., T ,, T ,, T ~ which are the m+1 subtrees of 
 1' 2' ' m-1' ml m2 
 
 T(A.,+A„+ . . .+A ) obtained by removing the roots of larger subtrees of T. 
 
 Let r-,, r^} •••> r -,, r -,, r be the lengths of these lists, respectively. 
 
 1 2' ' m-1 ml m2 ^ 
 
 Furthermore, let D.., D„, ..., D n , D ,, D be the lists obtained by AND 
 1 2 7 ' m-1 ml m2 ° 
 
 merging R , R , ..., R , R , R , respectively, with the list B. 
 
 The cost of the tree T , , C .. , is 
 
 m+1' m+1 
 
 m-1 
 
 C , = £ [C(T. ) + r. ]+C(t 1 ) + C(l ) + r,+P fl + (m+1) b 
 in i i . . L v i ' i v ml / v m2 / ml m2 v ' 
 
 i I 
 
 i- c(t(d +d+...+d .+d +5 \ )) 
 
 v 12 m-1 ml m2 
 
50 
 
 u 
 
 (a) 
 
 m+1 
 00 
 
 Fig. k-3 
 
51 
 
 Let T be a subtree of T obtained by OR merging R n and. R together. We have 
 m ml m? 
 
 m 
 C(T m ) = 2 [C(T, ) + r. ] + mb + C(T(D n +D p +. . .+Dj ) 
 i=l 
 
 wheri R is the root of T and D is the list obtained by AND merging R with B, 
 Since 
 
 C(T ) = C(T .) + C(T ) + r + r 
 nr ml v m2 ' ml m2 
 
 C(T . ) - C(T ) = b - (r +r _-r ) 
 v m+l y v m y ml m2 m y 
 
 + C(T(D-.+LV...+D _+D n +D „)) 
 v v 1 2 m-1 ml m2 7 y 
 
 - C(T(D 1 +D 2 +...+D m )) 
 
 That this difference is larger than or equal to zero follows from 
 
 r n +r -r >0 
 ml m2 m — 
 
 n 
 
 b > Z a, > (r ■ fr -r ) 
 — . - l — ml xad m 
 1=1 
 
 and that D is a list obtained by OR merging the lists D , and D .. Hence, 
 m ml md 
 
 C(T ) is a monotone nondecreasing function of m. In other words, for any 
 
 tree T , there is a tree T whose cost is less than C . Hence, we have the 
 m u m 
 
 statement of the theorem. 
 
52 
 V. Summary 
 
 The results in Section III allows us to find an optimal form of 
 any nested Boolean expression in which (i) all variables are distinct, and 
 (ii) the complement of any variable, B, (or any expression) can appear 
 only in the form 
 
 A l ' *2 • ••• ' A n • 5 | 
 
 with at least one of the A. ' s not complemented. We already noted that the 
 restriction (ii) leads to no real loss of generality since it indeed is a 
 restriction imposed on the form of the query in most inverted list document 
 retrieval systems. 
 
 The problem of finding the optimal forms of Boolean expressions in 
 which not all variables are distinct is more difficult than that of finding 
 the minimum gate realization of the Boolean expression (when fan-in of 
 gates are 2). Our problem is further complicated by the fact that if one 
 of the terms A+B, A*B, A-B, A*B is generated by merging lists A and B, the 
 other three terms are also obtained at no additional cost. 
 
 When the lengths of overlaps between lists cannot be neglected in 
 computing the total merge time, the Boolean expression determined by the 
 algorithms described in Section III are no longer optimal. The performance 
 of these algorithms can be bounded in terms of the maximum overlap between 
 the lists as done in Section IV. 
 
53 
 References 
 
 [1] Hsiao, D. and Prywes, N. S., "A System to Manage an Information System," 
 in Proc of the FID/ IF IP Joint Conference on Mechanized Information 
 Storage, Retrieval and Dissemination, Rome, Italy, 1967. 
 
 [2] Hsiao, D. and Harary, F. , "A Formal System for Information Retrieval 
 from Files, " Comm. ACM , Vol. 13, No. 2, February, 1970. 
 
 [3] Martin, L. D. , "A Model for File Structure Determination for Large 
 On-line Data Files, " in Proc of the FILE 68 International Seminar on 
 File Organization, Copenhagen, 1968. 
 
 [k] Prywes, N. S., "Man-Computer Problem Solving with Multilists," 
 Proc. IEEE jk, 12, December, 1966. 
 
 [5] Cardinas, A. F. , "Analysis and Performance of Inverted Data Base 
 Structures, " Comm. ACM , Vol. 18, No. 5, May, 1975. 
 
 [6] Lowe, T. C, "The Influence of Data Base Characteristics and Usage 
 
 on Direct Access File Organization, " J. ACM , Vol. 15, No. 4, October, 1968, 
 
 [7] Liu, Jane W. S., "Probabilistic Models of Inverted File Document 
 
 Retrieval Systems, " Technical Report, UIUCDCS-R-75-7 1 42, University of 
 Illinois, Department of Computer Science. 
 
 [8] Liu, Jane W. S., "Algorithms for Parsing Search Queries in Inverted 
 File Document Retrieval Systems, " Technical Report UIUCDCS-R-75-718, 
 University of Illinois, Department of Computer Science, Urbana, 
 Illinois, September, 1975. 
 
 [9] Knuth, D. E., The Art of Computer Programming , Fundamental Algorithms, 
 Vol. 1, pp. 1+02-^05, I968. 
 
IBLIOGRAPHIC DATA 
 1EET 
 
 1. Report No. 
 
 UIUCDCS-R-75-718 
 
 3. Recipient's Accession No. 
 
 Title and Subtitle 
 
 Algorithms for Parsing Search Queries in Inverted 
 File Document Retrieval Systems 
 
 5. Report Date 
 
 November, 1975 
 
 Author(s) 
 
 Jane W. S. Liu 
 
 8- Performing Organization Rept. 
 No. 
 
 Performing Organization Name and Address 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, Illinois 
 
 10. Project/Task/Work Unit Nc 
 
 11. Contract/Grant No. 
 
 NSF DCR 73-07980 
 NSF DCR 72-037^0 A01 
 
 sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, DC 
 
 13. Type of Report & Period 
 Covered 
 
 14. 
 
 .:pplc-mt. rttary Notes 
 
 . Abstract s 
 
 In an inverted file document retrieval system, a query is in the form of a Boolean 
 xpression of index terms. In response to a query, the system accesses the inverted 
 ists corresponding to the index terms, merges them and selects from the merged list 
 hose documents that satisfy the search logic. In this paper, we consider the problem 
 f determining a Boolean expression which leads to the minimum total merge time among 
 11 Boolean expressions that are equivalent to the expression given in the query. This 
 roblem is the same as finding an optimal merge tree among all trees that realize the 
 ruth function determined by the Boolean expression in the query. Several algorithms 
 hich generate optimal merge trees, when the sizes of overlaps between different lists 
 re small compared with the length of the lists, are described. These algorithms are 
 longer optimal when the lengths of overlaps cannot be neglected. In this case, it 
 s possible to bound the performance of these algorithms in some instances in terms of 
 he maximum overlap between lists. The performance bounds are discussed. 
 
 . Key Words and Document Analysis. 17a. Descriptors 
 
 inverted file 
 document retrieval system 
 merge algorithm 
 parsing Boolean query 
 
 b. Idcmifiers/Open-Ended Terms 
 
 c- < OSAIT Field/Group 
 
 1 "lability Statement 
 
 19. Security (lass (This 
 Report ) 
 
 UN( 1.ASS1H1-D 
 
 20. Security ( lass (This 
 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 22. P 
 
 HM NTIS-3B 
 
 USCOMM-DC 40329-P7I