Hi Hi Silsl ■ wMBBEBRm ■ ■ mama Mmmgm m MM ' Hi RSWWS H ■ ittfffiil IB U MB iffl HH HfsSnn Si HflH Bfl WB&BBB& BOB — H ^■9 jaH HHH HHHH fiEPW 9ffi B6B9i ■ BIbI IH IS B I ■ < 138 BH eh Hil HH ■ H HUBS ■ 1 B RSI!!! HflBB IbI WHllS Hm m mm LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 I46r no .679-68+ cop. 2. The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN NOV 5 OCT 13 RECD JAN 2 4 L161 — O-1096 ON RANDOM 3-2 TREES by Andrew Chi-Chih Yao October 197^ DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS IHE LIBRARY OF THE NOV 2 7 1974 ^V OF li t INOIS uiucdcs-r-7^-679 On Random 3-2 Trees t>y Andrew Chi -Chin Yao October 197^ Department of Computer Science University of Illinois Urbana, Illinois 6l801 Research supported by NSF Grant GJ^1538. Digitized by the Internet Archive in 2013 http://archive.org/details/onrandom32trees679yaoa 510. ty Titer L&) 2" ABSTRACT It is shown that n(N), the average number of nodes in an N-key random 3-2 tree, satisfies the inequality 0.70N-e Vh + in- 1)+ ir'- (2)+ n ( 2 ) 6x_ 6xn 6x + — i. (1) + —2.(1) + —2 . rr = A 1 (N-1) +|(-U 1 (N-1) +6A (N-l) +6A 6 (N-1) + 6X,(N-l) + 6A q (N-1) + 6A 9 (N-1)). *i X 2 "3 x i X 5 H -7 -8 x' 9 Probability V l Xg+1 *3 \ X 5 x 6 *r *8 x 9 4-3L/R *L Xg-1 x,+l \ X 5 x 6 *7 x 8 x 9 axg/N *1 x 2 -l x 3 V 1 x 5 x 6 *7 *8 x 9 3^/N X l *2 x 5 -l \ X +1 x 6 *7 *8 x 9 6x^/N y x l *2 *3 V 1 X 5 +1 x 6 "7 "8 x 9 kx k /U X l *2 x 3 Vi X 5 Xg+l \ x 8 x 9 2x^/N x l X 2 x 3 \ x -1 5 x 6 x^+1 *8 x 9 2x 5 /N X l *2 x 3 \ x 5 -l x 6 *r Xg+l x 9 2x /N x 1 +2 X 2 *3 \ V 1 x 6 *7 *8 x 9 3x 5 /N *1 % *3 \ X 5 V 1 xu+l x 8 x 9 ifx 6 /N x +2 X 2 x 3 \ x 5 V 1 "7 x 8 X 9 3x 6 /N X l X 2 *3 \ X 5 x 6 Xj-1 x 8 V 1 2x ? /N XL+1 Xg+l *3 \ X 5 x 6 3U-1 x 8 x 9 6^/N X l X 2 "3 \ X 5 x 6 *7 Xg-l x g +l 2x 8 /N X +1 x 2 +l *3 \ X 5 x 6 *T Xg-1 x 9 6x 8 /N V 1 *2 X..+1 \ X 5 *6 *? x 8 v 1 6x 9 /N *1 x 2 +2 *3 \ x 5 x 6 *7 x 8 V 1 3x 9 /N Table 1 Transition under a random insertion: a tree of class (2;x n ,x , . . .,x ) becomes a tree of class (2;x',x', . . .,x* ). Each row gives the values of x',x* ...,x' for a possible resulting class with its probability of occurrence in the last column. 10 Similar formulas for A (N), . . .,A (N) can also be derived. These relations can be compactly written in the following form: Let A(n) be the 9-component column vector (A (N)), then A(N) = (I+|D)A(N-1) where I is the 9x9 identity matrix and D is given by (k) k 6 6 6 6 6 k -5 6 6 6 2 -6 6 3 -6 6 k -7 2 -7 2 1+ -8 2 -8 2 2 -9 (5) To solve A(n) from (k), we define a 9-component column vector a(N) = (a.(N)) by a ± (N) = A i (N)/(N+l) In terms of a(N), (h) can be written as a(N) = (l+^-(D-l))a(N-l) (6) (7) To solve (7), the following two lemmas will be useful. Since recurrence relation of the form (7) has been studied in some other context [5], we shall omit the proofs here. Lemma 2.10 : Let G be a p X p real matrix with simple eigenvalues X^X^.X^, . . .,X . where \_ = and ReX , g ReX „ g ... g Re\ n < 0. 0' L' 2 p-1 p-1 p-2 1 If v(l),v(2), . . .,v(N), . . . is a sequence of p-component vectors satisfying 11 v(j) = (l +t-tj-G) v(j-l), then there exists a vector u such that (i) Gu = Re\ 1 (ii) |(v(N)). - u. I < CN for some constant C and all i,N. where (v(N)).,u. denote the i th component of v(N) and u respectively. Corollary : v(N) -* u as N -* ». Lemma 2.11 : Let k be a positive integer, then the polynomial I I g(\) = II (\+k+j ) - II (k+j ) has only simple roots. Furthermore, j=0 j=0 the real parts of all the roots except the root X = are negative. We now turn to the determination of a(N) by (7)« An explicit calculation shows that the characteristic polynomial of D-I is -\(\+T)(A.+8)(x+9)((^+9) 5 +25(x+9) 5 +210(x+9) 2 +136(\+9)+38^). The roots of the polynomial, which are eigenvalues of D-I, are 0, -6.55-6.25i, -1, -8, -9, -9«23±1.37i, -13. Mu Thus, D-I satisfies the conditions on G in Lemma 2.10. Therefore, there exists a vector u = (u. ) such that la. (N) -u. I < C^N 55 for some, constant C^ (8) 1 l i ' v ' where u satisfies (D-I)u = (9) In terms of the u.'s, we can express Lemma 2.9 as follows: Lemma 2.12 : (N+1)(J e u. + | Z u.) -i-CN" 5 ' 55 gn(N)g (N+l)(^ e u. + 5 Z u. ) -1 + CN -5 * 55 2 i=l x 2 ±=14- x 2 i=l x i=k 1 for some constant C. Proof : From equations (6) and (8), we obtain |A (N) - (N+l)u ± | < | N" 5 * 55 for some constant C. (10) The lemma follows immediately from (10) and Lemma 2.9. D 12 Now, to find the values of the u. * s, we observe that equation (9) determines u up to a constant factor. The normalization constant can be determined as follows: Lemma 2.7 and equation (6) lead to the equation l+a 1 (N)+5a 2 (N)+6a 3 (N)46a 1+ (N)+7a 5 (N)+Ta 6 (N)+8a 7 (N)+8a 8 (N)+9a 9 (N) = 1 (ll) Since a. (n) -* u. as N -* », (11) implies that ifu 1 +5u 2 +6u 5 +6u ]+ +7u +7u 6 +8iu+8ug+9u = 1 , ( 12 ) which is the equation we need in order to determine the normalization constant. Therefore, solving equations (9) and (12), we obtain: uu = 41J+/7991 = 0.052 u 2 = 396/7991 = 0.050 u 3 = 912/55937 = 0.016 u, = 1188/55937 = 0.021 u 5 = 1278/55937 = 0.023 (13) u 6 = 297/55937 =0.005 1^ = U6/55937 = 0.007 u 8 = 284/55937 = 0.005 ■ u Q = 20/7991 = 0.003 Substituting the values of the u. ' s into the inequality in Lemma 2.12, we obtain the main result of this section. Theorem 2.13 : 0.70N+0.2-C N~ 5 * 55 ^ n(N) £ 0.79N-0.2+C N~ 5 ' 55 for some constant C. 13 The technique used in this subsection can be used to compute higher moments of the number of nodes n(T). For example, we can set up a system of recurrence relations of the form of equation (k) for the quantities A. . (n) where A. .(N) = 2 P. T (2;x n ,...,x n )x.x. i,j = 1,2,. ..,9 . ij v x x , ...,x IT ' 1' ' 9 i j ' Determination of the A. . (n)'s will then lead to (by Lemma 2.8) bounds J- J on the average value of n(T) for N-key random 3-2 trees. 2.3 Higher Order Analysis The methods used in the previous two subsections obviously can be generalized to obtain better approximation of n(N). By computing the average number of nodes in the lowest k-levels, we can determine n(N) to an accuracy of l/2(2 -l) X 100$. This is so because at most k 1/2 of the keys are in nodes above the lowest k levels. This general procedure, however, is not very useful in practice. If F(k) is the number of different types of trees of height k, then solution of this problem involves the manipulation of an F(k) x F(k) matrix. It is not difficult to show that F(k) = |-F(k-l)(F(k-l)+l) 2 . We have analyzed the cases k = 1, 2, where F(l) =2 and F(2) = 9» However, F(3) = ^50 is already such a large number that carrying out the computation appears to be very difficult. 3. An Analysis of B -trees 3^1 Introduction A natural extension of 3-2 trees is the idea of "B-trees" [2] [k]. A B-tree of order m is a tree in which the number of keys contained Ik in any internal node other than the root is no greater than m-1 and no less than |"m/2]-l. 3-2 trees are just B-trees of order 3« To add a key to a node, we insert the new key into the other keys and check if the node now contains more than m-1 keys. If the answer is no, the insertion has been completed. Otherwise, we split the node into 2 nodes, one of which contains the smallest f"m/2]-l keys and the other the m-fm/2] largest keys, the one remaining key is then inserted into the parent node. Random B-trees are defined in exactly the same way as random 3-2 trees are defined. We shall study n (N), the average number of nodes in the B-trees of order m resulting from N random insertions. An obvious bound was given in [k]: -, n (N) n , -i- * -^— * 1 + i (lk) m-1 - N - rn /2l-l N In this section we shall consider the nodes at the lowest level, and do an analysis similar to the first order analysis done in Section 2.1. As we shall see, this analysis yields better results than the corresponding analysis for 3-2 trees. This is so because a greater proportion of keys in a B-tree are stored in the lowest internal nodes as m becomes larger. Define the following functions: H(N) = Z t for N II k=l k (HOiO-Hdii^))" 1 if m= even •(m) ( m+1 d = f mTl (Hda+D-HUm+Dyte))" 1 if m = odd v. 15 It is well known [6] that H(m) ~ &nm + 0. 58 + — +... . A simple 2m computation shows that r(m) = -77 + 0(-rr). Our new bounds on n (N) are given below: Theorem 3«l s For any e > and fixed m, n n" (N) , 111-1 N f m/2 ] -1 when N is sufficiently large. ff m« 1 1 , , C Corollary : For any e > and fixed m, | — - -s-rr \ < -^ + e for all sufficiently large N, where C is a constant independent of m and N. The corollary follows from Theorem 3»1 and the approximation of r(m) given earlier. If all the nodes in a B-tree of order m contain m-1 keys, there would be N/(m-l) nodes. The ratio N/(m-l)n (n) can therefore be viewed as storage utilization [k-]. Our corollary to Theorem 3»1 shows that, as N becomes large, the storage utilization is essentially 5m2. ~ O.69 for fixed large m (cf. equation (l4)). 3.2 Proof of Theorem 3.1 We will first introduce some notations. Note that there are m-[m/2]+l types of B-trees of order m and height 1. As shown in Figure 5> a "type i B-tree contains i keys in its node for i = ["m/2]-l, [m/2 ],..., m-1. A B-tree of order m is said to be of class (yr /^i -,»yr /rt-i»«»»*y -,) if at the lowest level there are y. subtrees w |m/2 \-V J |m/2]* ,J m-l y J x of type i for each i. Let V y [m/2l-l' y f"m/2l' * ' ''^m-l^ be the probability for a random N-key B-tree of order m to be of class ( y fm/2l-l'""'* y m-l?" 16 Type fm/21-1 Type fm/2] Type m-1 Figure 5 The height 1 B-trees of order m consist of m-["m/2l+l types. (Shown for m = 10) 17 def Definition ?.2: A.(N) = ^ y^fy^,^ . . .,7^) J i = [m./2~|-l, . . . ,m-l . For brevity, we have suppressed the dependence of A. (N) and P on m in our notations. 1 m " 1 1 Lemma 5.3 ; (l + ~r) E A. (W) - -~ g n (N) ^ ^ i=|-m/2l-l x m " 1 m m-1 g (1+ * ) E A. (N) = +1 [m/21-1 i=[ m / 2 ]-l X [m/21-1 Proof : Similar to the proof of Lemma 2.1. The term +1 appearing on the right-hand side of the equation arises from the fact that the root may contain less than [m/2]-l keys. □ The major effort to prove Theorem 3*1 is contained in the next Lemma. -, f m-1 Lemma 3-h : Let g(N) J E A. (N)/(N+l). Then for any e > 0, i^m/21-1 1 |g(N) -r(m)| < e for all sufficiently large W. Proof : We shall assume m = 2p to be an even number. The proof for odd m is similar. Let T be an (N-l)-key B-tree of order 2p. After a random insertion, T may become a B-tree of class (y ,...,y. -l,y.+l, . . ,,y ) with probability iy /n (for each i = p, ...,2p-l) or it may become a B-tree of class (yp_ 1 +1 >y p +1 >y p+1 > • • ->7 2 p.^ 1 ) ^- th probability 2py g ^/N. It follows that 18 Vi (N) = Vl*" 1 ' "i^^p-i^- 1 ^ 2 ^-!^- 1 ^ A (N) = A (N-l)+fepA^ (N-l)-(p+l)A (N-l)+2pA (N-l)) P P IN p-J- P e -r x P and A.(N) = A.(N-l)+|(jA 1 (W-l)-(j+l)A (N-l)) for p+1 g j g 2p-l (15) Denoting by A(N) the (p+1 ) -component vector (A.(N)), (15) can be J written in matrix notation as A(N) = (I+|B)A(N-1) where I is the (p+l) X (p+l) identity matrix, and B is defined by (16) B = -P P "(P+l) P+l -(p+2) p+2 -(p+3) 2p 2p (IT) 2p-l -2p To solve (l6) for A(n), we define a (p+l) -component vector a(N) = (a. (N)) by a ± (N) = A i (N)/(N+l) Equations (l6) and (18) lead to the following recurrence relation (18) a(N) = [l+pi (B-l)]a(N-l) (19) 19 The characteristic polynomial q(\) of B-I is computed to be q(\) = (-1) P ' L (\+2p+l)[Z (\+p+j)- Z (p+j)] 0=1 j=l From Lemma 2.11, it is easy to see that the roots of q(\) = satisfy the following conditions : (i) All roots are simple roots, (ii) \ = is a root, (iii) The real parts of all roots except \ = are negative. Therefore, according to Lemma 2.10, there exists a vector T u = (u p _ 1 , u , . . ., u 2p _ 1 ) such that (i) (B-I) u = (21) -e (ii) |a. (N) -u. | < C N m for p-1 g i g 2p-l (22) where C , e are positive constants. (22) implies 2p-l 2p-l _ e | Z A i (N)/(N+l) - Z uJssC^N m (23) i=p-l i=p-l Now, to determine the u. ' s, we note that the following equation can be proved easily (cf. the derivation of Equation (12)): 2p-l Z (i+1) u. = 1 (2*0 i=p-l X Solving (21) and (2k), we obtain -1 %-l = ^l2^1 P(2p)-H(p)] p+1 2p+l (25) U i = ife 172 [H(2p)-H(p)f p^i*2p-l 20 2P-1 1 ml Therefore, L u = ^—r (H(2p)-H(p)) x = r(m). (26) i=p-l 1 P Finally, substituting (26) in (23), the lemma is obtained. □ Proof of Theorem 3»1 « It is a direct consequence of Lemma 3*3 and 3.*K □ k. Conclusion We have derived bounds on the average number of nodes in an N-key random B-tree, which essentially is the average number of "splitting" in building the tree. One interesting result is that the asymptotic storage utilization is approximately Sn2 ~ 69% for B-trees of high orders. This seems to agree well with one set of experimental data (m = 121, N = 5000, storage utilization = 67$, see [k]). Many problems about 3-2 trees remain to be investigated. What is the average number of splitting on the N^h random insertion [3]? How to analyze 3-2 trees when deletions are also present? Some upper bounds can be obtained for the former problem using the present approach, but it appears that very different methods would be required to answer these questions satisfactorily. 21 References [1] Aho, Hopcroft, Ullman, The Design and Analysis of Computer Algorithms, Addis on -Wesley, 197^. [2] Bayer, R. and E. McCreight, Acta Informatica (1972), pp. 173-189. [3] Chvatal, V., D. A. Klarner, and D. E. Knuth, Selected Combinatorial Research Problems, Problem 37, STAN-CS-72-292, 1972. [k] Knuth, D. E., The Art of Computer Programming, Vol. 3» Addison - Wesley, 1973, pp. ^68-^80. [5] Knuth, D. E., The Art of Computer Programming, Vol. 3» Addison - Wesley, 1973, PP» 679-68O, answer to exercise 10. [6] Knuth, D. E., The Art of Computer Programming, Vol. 1, Addison- Wesley, 1968, p. 7k. BLIOGRAPHIC DATA 1EET 1. Report No. UIUCDCS-R-74-679 2. 3. Recipient's Accession No. Title and Subtitle On Random 3-2 Trees 5. Report Date October 197U 6. Author(s) Andrew Chi -Chin Yao 8- Performing Organization Rept. No. Performing Organization Name and Address Department of Computer Science 10. Project/Task/Work Unit No. University of Illinois at Urb ana -Champaign Urbana, Illinois 61801 1 1. Contract/Grant No. NSF GJ-J+1538 . Sponsoring Organization Name and Address National Science Foundation Washington, D.C. 13. Type of Report & Period Covered 14. . Supplementary Notes . Abstracts It is shown that n(N), the average number of nodes in an N-key random 3-2 tree, satisfies the inequality 0.70N-e . Identifiers/Open-Ended Terms :. COSATI Field/Group Availabiliry Statement ?M NTIS-35 ( 10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 22. Price USCOMM-DC 40329-P7 1 en in a. UJ CO I RJM ■■ ffli !\ UNIVERSITY OF ILLINOIS-URBAN A blO 64 IL6R no COO? no 679-664(1974 Alport / M dHHHHH H Ira BB WW ■ > iB| B H H BB W MM HI l & S V*^ LH I hmHS ■"j&ri RH I •' ■ pwJ *& *-i I >^ Hi H Bl ■39 BH ■H n9 H H H H IB Bbs H Bi HUH ojH ■—■—a mm nj HH — HJ BH I LBH ■ ■