v : of ILLINOIS m AT UR3ANA-CNAMPAIGN ENGINEERING /C - ^ ENGINEERING LIBRARY ' ^ 7 Jl 63 •H 4-1 C o u •o u II a CM ■H 01 01 — IP3 a, • s in (0 • Ti 3 I (0 u > c u o a) ■H jj 01 c ■H H u 0) QJ u u a c Q) u a O -i u-l c QJ o CM u ■H a, c#> CD a cr, e 03 CO +1 CO ON 3 H CM ID n r- O CN vO CTi H <* in x> m m r- SI ON on r-~ oo co m o~\ h oo "=r ■— i S) m m3 00 OTi si ■— 1 CM ■*lO vO CD Jl SI r~ r- r— r- ao oo ao 00 CO CO 00 00 CTi o in +1 — 1 u 10 c > o u ■rH j> a CM e +1 10 GO tfp rH + 1 c o •H 4J OJ (0 N —I •H 3 U) a o d, n n in in si CM CN SI r- co o\ > c*1 3 H f~ m 3 *r r~ cn o> s a 3 a G\ •— I vjO ^D CM m 1^ CTi H ii H H H M ,N in n h r^ co co r-i <~n vo oo IN n CI > in ^ 3CD i£> ^S" si in en *T CN r- 30 ro J> co *r in m ko r- co a\ cr\ on H CN CN ro T T r— 1 H <-t i— 1 rH H H —1 rH rH CN CN CN CN CN CN a a a a a a .o. a q q m co .n h a 3 < — I C3 IX) 3 a> h to "^ md cn to to to co QflQQfl 3 3 H io '/I on j\ cn in cn 3 n in 3 ^ ■3" tj< «* m in a -Q CN r-~ r^ SI T H vO SI co J") CO m rH m CO CN SI p» 3H.N CO CO - ) o> 0""i r~ r- in cn cn Hl^ 3 CNC1 CN CN CO CO PO in co h ^r P- iN cO x co t -=r CO CO CO «3" «3" CT> CTi CTi C7) CTl ^f t ^r in in CFt 0\ CTl CXi CTi 3 -O 3 CO IO H ^f vO CO Ji *J ^ LO n 1^1 CO vO iO lO cO CO iO CO CO CO CO CO CO CO CO 3 CN H Ul in SI H N CN CO m in in in in in cn oo -N CO ^ in incDio in in in m in £1 Q ciini^ CN CTl "3" co o r~ O Q Q Q r» r~ on on en m m h co *r oo h ^ IO r- 3 !N 'JiniD I — C0COCOC0 0TN CXi CT) 0Tl CT) co^simcN ^f r- comco O CN 3 CI3 IN H CO I 1 CTl '1 co "3* in in -D r— r~ x> oo cn a a n a a Q 3 3 3 3 3 3 3 33 in 3 in 3 Ln H H CN CN Q Q Q Q Q 3 3 SI 3 3 3 3 3 3 3 3 in 3 in 3 iO CO •5T "3- m Q Q Q Q CTi cTi CO 3 S) 3 3 3 S3 3 3 3 3 3 3 3 3 3 lO > CO J\ 3 Ln in in in in CTt CTi o> cj> IvDCOHCN r-» r~ r- co x> CO CO CO CO CO r~ cn in co vo t~ 00 CD Jl Jl in in in in in so ro cn ^r in m h in <» n CTl 'S' 3 H H CN CO ON ON ON CN ON in co co i£> cn in oo io m io CD w cn •H QJ cn h ■H CO J^ > >1 U cO u D u u to TD QJ 3 a 1 ca 3 £ in 4J c cu m u c O 4J E cu ■H C C CJ QJ cn c u 3 QJ > >i a, a (0 4J o c cn QJ O •a c o QJ CO H H s u CO rH en ca u en h qj c cn 4-> co u c •H o cn to 0) QJ c cn 4J 3 c co H —I E o. G* C QJ U u QJ a a, o a cu .c C +1 QJ <& -C in a 22 Table 2 - Two different sampling techniques as estimators of query performance. Precesions are at 95* confidence level. Query state=98 (Means a ship) state=uk state=06 fuel=145 fuel=jp4 fuel=jp4 ! fuel=145 coc;mand = sac comnand=mac ! comraand=sac fuel=jp4 & commandrmac reciepts_dod> 100000 stock<10000 state=9B & fuel=145 state=9b & fuel=jp4 fuel=jp4 & stock<10000 stock<5000 & stock>0 stock<5000 & stock>0 & state=98 open_inventorystock & command=sac state= ( & stock<10000 state=98 4 stock<10000 & fuel= jp4 receipts_dod>receipts_comm conMTiandisac & commandfcniac & fuel=jp4 location>us location>us & state=98 receipts_dod>10 receipts_comm> 1 receipts_coriim>10 & receipts dod>10 command=pac/ships i command=pacflt (command: pac/ships ! coranand=pacflt ) & state=98 ctual olume oduced Estimated From 100 Kandom Tuples Estimated From 100 Random Locations Estimated From Classical Assumptions 1752 1900+765 1755+626 109 e 190 400+332 93+176 109 e 701 700+498 994+731 109 e 1015 600+463 1 196+330 400 e 1227 1400+677 1398+355 400 e 2242 2000+780 2594+620 800 e 464 300+333 186+352 86 e 607 700+498 419+560 172 e 27 0+194 a 47+38 1c 1 9 0+194 a 31+59 ? 8268 8300+733 7812 + 1167 ? 55 100+194 124+123 177 1 4 0+194 a 16+29 21:> 633 900+564 668+261 1014 1 5130 5400+972 4923+1250 ? 55 100+194 62+92 399 1 1953 1800+749 2019+538 9 239 100+194 109+2C5 90 1 1608 1300+749 1460+493 1449 1 3 0+194 a 16+29 178 1 1528 1600+715 1600+452 ? 1110 1400+677 1305+346 1153 1 267 1 2300+621 2794+890 ? 1740 2000+780 1739+627 467 1 1554 1600+715 1584+472 ? 2505 2500+844 2252+665 9 224 0+194 a 248+203 389 1 663 300+333 373+236 172 e 58 3 300+333 311+225 116 1 a Actually, since p=0, this should be 0+0. We give a more conservative precision based upon p=0.01. e This is based on the assumption that each distinct domain value is equally probable. This is based upon independence between domains. 23 assumptions. Probability of correct guess . Having gotten p , d R , p~, and dg for the two relations in the query, one can decide which intermediate result to ship over the net. It would be useful to know the probability that the choice made is, in fact, optimal. If |R| and |S| are the volumes of the relations R and S, then the estimates of |(R|B R )| and |(S|B S )| are p R |R| andp g |S|. Assume that p R | R | >p s I S I . The best strategy then is to ship the relation (S!B S ). The exact values of |(R|B R )| and |(S|B S )| will usually be different from the predicted values, but as long as I (&IB R ) lj> I (S|Bg) | , the correct decision will have been made. The confidence in the decision is merely the probability tnat I (RlB R ) | > | (?|B S ) | is true. If it is assumed that the two values are normally distributed around the estimated values, then it can be shown that the "probability of correct guess", if relation (S|B g ) is shipped, is (p_|R|-p_|S|)z — R S \ (d R |R|) 2 +(d s lS|) 2 A derivation of this is given in Appendix A. Sample values of this function, for d R |R|~d g |S|, are presented in Table 3. For 4 instance, suppose |R|=|S|=10 , p R =0.1000, p g *0.084, and d =d =0.02, then d I R| =d e | S|=200 , and p D |R|-p_|S| * 1000-840 = 160. From the table, one can deduce that by shipping the output from S, one will have made the correct decision 86.6% of the time. Expected excess c ost . From (3) , it is possible to find the 24 20 40 60 80 100 120 140 160 180 200 220 240 « 260 ^ 280 CO CO T3 300 320 340 360 380 400 40 0.500 0.500 0.500 0.500 0.997 0.917 I 0.822 l 0.756 I 0.500 0.500 0.500 0.500 0.710 I 0.678 I 0.654 I 0.636 l 0.500 0.500 0.500 0.500 0.621 l 0.609 I 0.599 I 0.591 I 0.500 0.500 0.500 0.500 0.584 I 0.578 l 0.573 < 0.569 I 0.500 0.500 0.500 0.500 0.565 I 0.561 0.558 0.555 I Pr ! Rt - P51S1 80 120 160 200 1.000 1.000 1.000 1.000 0.997 1.000 1.000 1.000 0.968 0.997 1.000 1.000 0.917 0.981 0.997 1.000 0.866 0.952 0.987 0.997 0.822 0.917 0.966 0.990 0.786 0.883 0.943 0.976 0.756 0.851 0.917 0.958 0.731 0.822 0.891 0.938 0.710 0.797 0.866 0.917 0.693 0.775 0.843 O.896 0.678 0.756 0.822 0.876 0.665 0.739 0.803 0.857 0.654 0.724 0.786 0.839 0.644 0.710 0.770 0.822 0.636 0.698 0.756 0.807 0.628 0.688 0.743 0.793 0.621 0.678 0.731 0.779 0.615 0.669 0.720 0.767 0.609 0.661 0.710 0.756 Table 3 - Probability of correct guess when relation S is shipped to CO •o OS •o 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 5.76 11.51 17.27 23.03 28.79 34.54 40.30 46.06 51.82 57.57 63.33 69.09 I 4.85 0.60 86.36 92.12 97.87 103.63 109.39 115.15 40 Pr!R! - PsiS! 80 120 0.01 1.09 4.16 8.35 13.10 18.17 23.42 28.80 34.25 39.77 45.33 50.92 56.54 62.18 67.83 73.50 mi 90.55 96.25 0.00 0.02 0.55 2.18 21.34 26.20 31.21 36.33 41.55 46.84 52.19 57.59 &§? 74.01 79.54 0.00 0.00 0.04 0.40 12.48 16.40 20.61 25.04 29.65 34.41 IIM 49.35 54.50 .70 .96 160 !i 0.00 0.00 0.00 0.05 0.33 kit 4.37 6.82 9.73 13.03 16.64 20.52 24.63 28.93 33.39 u-.n 47.50 52.40 200 0.00 0.00 0.00 0.00 0.06 0.31 0.90 1.95 3.47 5.46 7.86 10.65 13.76 17.16 20.80 24.66 28.71 32.91 37.26 41.73 Table 4 - Expected excess cost, in tuples, when relation S is shipped 25 likelihood of making the wrong decision. It is also useful to know how much will be lost as a result of the fact that the choice made is not always correct. The expected excess cost is the expected number of extra tuples that will be shipped if the wrong decision is made, multiplied by the probability that the wrong decision will be made. When relation (S|B g ) is shipped, this can be computed from the formula cd cd d |R| d |S| J / (Y-x)f (x,p R |R|,-^— )f (y,p s |s|,-^--)dydx. -co X This is the expected value of (y-x) = | (S I B ) I -| (R| B ) I computed o K over all x and y such that y>x. This formula can be simplified to CD a/ (1-F(x))dx where R -y s 1 m-1 This equation for d can be easily inverted to get the sample size m required to produce a given precision when the sample variance 2 is s . To implement this method for predicting join output, first take a sample of the joining domain values. From each relation 30 select and save all tuples which have one of these values. The query in question should be run against these samples. The above formulas can then be used to predict the performance of the query when run against the full relations. There are, of course, disadvantages to this scheme. If the multiplicity (number of tuples with a given joining domain value) and the variance (s' in the above equations) are not small, then the number of tuples required for a given precision will be larger than for simple random sampling. Also, because constructing a sample will require searching the relation instead of just picking random tuples, the cost of constructing the sample will be much higher. (The presence of an appropriate index would reduce this extra cost factor.) Another problem inherent in the scheme has to do with the fact that the sample is no longer truly random. In particular, if a select function references the joining domain explicitly, the prediction could be off. There are two ways of coping with this problem. It could be ignored (with some justification) in the hope that the sample will be large enough to smooth out this effect. Alternatively, the query processor could be made smart enough to separate the select into parts that do not depend on the joining domain. These parts could be run against the sample separately and the results combined using an independence assumption. For instance, suppose the join is on domain "location" and the query is "(location * 'London 1 & stock > 1000) or (location * 'Paris' & stock > 5)." If the sample shows that the "average" location has five tuples with stock > 1000 and ten 31 with stock > 5, then one can expect 15 tuples to result from this query. It will sometimes happen that the domain value (s) tested in the query will be included in the sample. In this case it is not necessary to look at the full relation at all. If, in the above example, both "London" and "Paris" were in the sample, the query could be processed using only the sample. (In light of this, it is tempting to place the most frequently referenced domain values in the sample. This would have to be done with extreme care, however, to preserve the statistical properties of the sample.) Experim enta l results . A second sampling method was used on the same test database as before. This was a sample by domain value. One hundred of the 1553 values of the field called "location" were selected, and the 663 records wnich had one of those values were collected. The predicted volumes in column 3 of Table 2 are 155.3 times the volumes from the queries run on this sample. The precisions are obtained from equation (4) . General ized Query Sampling All of the above discussions have dealt with a very simple query operating on only two relations. This section describes a sampling technique which is applicable to more complex queries consisting of joins and restricts on an arbitrary set of relations. By constructing specially designed samples of the individual relations, this technique will allow estimation of the volume of output from each intermediate result in the query. 32 Unlor tunately , it is difficult in this general case to define the probability of correct guess and expected excess cost described in earlier sections. These statistics were defined for a simpler, two-relation query for which there were only two possible query strategies. It was easy to treat the tradeoffs between the two strategies. In the current, more general case, there could be a very large number of feasible query strategies for any given query. The probability of correct guess in this case would be the probability that the chosen strategy is better than all alternatives. To compute this would require some kind of enumeration of the possible strategies and a computation of tne effect of the errors in the estimated volumes on the cost of each strategy. Fur therraore 9 in the two-relation case each relation was sampled independently, so it was reasonable to assume that errors in the volumes being estimated were independent* This made it easy to treat tne problem analytically. In the general case, the relation samples are not built independently, so it is unreasonable to assume independence of the errors. For these reasons, no attempt has been made to define probability of correct guess or expected excess cost in this context. A Multiple - Relation Sampling Technique . Consider now an arbitrary relational-algebra query to the database. This query can be represented as a tree. Each leaf node represents a select function to be applied to some stored relation. Each interior node represents a join between two relations and a select on the result of the join. This tree can be locally optimized using a 33 procedure, similar to that described by Smith and Chang [75] , which "moves down" portions of the select functions as far as possible in the tree. This will cause tuples to be removed as early as possible, at low levels in the tree, thus reducing the total volume of intermediate results. To optimize the assignment : tree nodes (query operations) to network hosts, it is necessary to be able to estimate the volume of output from each operation in the query tree. The approach developed here is an extension of one described in an earlier section. For predicting the output of a join between two large relations, the suggested approach was to select a sample of the domain values and estimate the number of tuples in the relation which have each value. To predict arbitrary queries, we will select one database domain as a "base H domain B from which sample values will be taken. Samples of individual relations will be constructed which consist of all tuples which either contain one of the sampled B values or which could be joined, through some series of join operations, with a tuple containing one of the sampled B values. The assumption of join uniqueness ensures that the relation samples will be unique for any given base domain sample. The procedure described below selects tuples to be included in the sample by first identifying samples of some domains which could be involved in query joins. Every database relation will contain at least one domain which has been sampled. The samples of the individual relations will consist of all tuples which contain one of the sampled domain values. For each sampled 34 domain D in the database, the sets S Di and the constant z D will be developed. S Di is the set of all values of domain D which are "associated with" the sampled value b^^ of the base domain B. The values b^ of B and d^ of D are defined to be associated with each other if there exists a tuple in GOAL which contains those two values. The constant z D is the number of joining steps needed to get from the relation containing B to the relation containing D. 1. First, the base domain B must be selected. This could be any domain in the database, but for best results the joining domain with the largest number of distinct values or a key domain of GOAL should probably be selected. From the M distinct values of B, select a random sample of m values. For each sampled value b., the set S u - contains only the value b^ . The "distance 11 z B is equal to zero. 2. Let T be any relation in the database which contains no domains for which samples have been constructed, but which can be joined to relation R by domain D, where R contains a domain C which has been sampled. If no such T exists, the sampling procedure is finished. 3. Use the notation (c.,d„)GR to indicate that a tuple with a domain C value of c. and a domain D value of d. exists in J * R. The sets S ^ can be constructed as S Di = * d k ! c j €S Ci' (c j .d k )6R}. S Di is the set of alJ D values which appear in R together with one of the C values in S ci# The values in S D ^ are 35 therefore all of those associated with values in S^.. The domain C is either B itself or it is one of a set of domains which can be used to join R (possibly thru a series of intermediate linking relations) to a relation containing B. The nature of the join operation ensures that if any R tuple with C value c. can be joined to a tuple with B value b. (again, possibly using intermediate joins), then every R tuple with value c. will be joinea to a tuple with value b.. Since the join operation mu st reconstruct pieces of GOAL, and since S„ . contains all tuples associated with b^ it follows that S Di contains all D values associated with b. . Conversely, a value d,, cannot appear in S . unless it is associated with b., so S D - is exactly the set of D values associated with b.. Having identified the sample of D, the sample of T can be constructed consisting of all those T tuples which have D values appearing in one of the sets S . . If there are any other relations in the database containing domain D and which have not yet been sampled, their samples can also be constructed from the S n -. 4. Let z D =z c +l. Go to step 2. The sample of the database consists of the individual relation samples developed above. Each individual value of a sampled domain may be associated 36 with several values of B. The probability that a given domain value will be included in the sample is directly proportional to the number of B values it is associated with. Different domain values will therefore have different probabilities of being included in the sample. If this effect is not compensated for, the estimates will be biased in favor of tuples with values which are associated with a large number of B values. Therefore, the relation samples must be used with appropriately developed weights which will cause "high probability" tuples to have proportionally lower impact on the estimates. For each value d. which appears in one of the sets s ni' a weight m D - must be defined. This weight is equal to the number of distinct B domain values which d. is associated with. The computation of these weights is by far the most expensive portion of the setup cost of this sampling method. The only way to find them in the general case is to enumerate the base domain values which could be joined to each sampled value in the sampled domain D. This could be done by finding the set S ' . , the set of B values associated with sampled value d , in a manner similar to that used to find S .. Alternatively, a sample of GOAL could be constructed containing all tuples with a sampled domain D value, using appropriate joins of individual relations. The m . values could be computed directly from this relation. There is a special case where computation of m . becomes somewhat easier. This is when the base domain B is a key of •oAL. one such key can always be constructed, if necessary, by 37 combining several domains. In this case, for the relation R and domains G and D (from step 2 above) , m Dk = * m Cj' 3 (c^d^) mR In other words, the weight on the domain value d. is equal to the sum of the weights on the domain C values which are associated with d^. The weights on all values of the base domain are 1. The weights for th* other domains can be computed as part of the procedure defined above. At each step, the weights for all values of a domain D (not just the values in the sample) can be computed from the weignts on C. To estimate |T|, the volume of a relation T which is the output of some query or piece of a query, run the query on the database sample. Call T" the result of running the query on the relation samples, and let D be a sampled domain in T such that Z D- Z A ^ or a ** sampled domains A in T. If x^ is the number of tuples in T' with a domain D value of d., M is the number of distinct base domain values and m is the number of them that are in the sample, then |T| can be estimated as If B is a composite domain, the sampling procedure must treat B as distinct from its components. For instance, if B is the composite of domains A and C (call it AC) , then the procedure could generate samples of A or C or both in addition to the sample of AC. If, in addition, no stored relation exists which contains all of the domains of B, it will be necessary to construct a temporary relation containing only B for sampling purposes. It can be discarded when the sampling parameters have been computed. 38 mr 1 1 where x . n, = 5 -2_. J n Dj d D inS Di The quantity n. is a weighted count of the number of tuples in T' with D values associated with B value b. . As mentioned above, probability of correct guess and expected excess cost will not be computed for this type of query. However, the accuracy of any intermediate volume estimate can be obtained as before from the formula ,2 z 2 M(M-m) 2 o = £- i-S m where ^ (n i " M } r.2 i s = (m-1) Example . Consider a database consisting of four relations. It contains information on parts and suppliers who supply those parts. Relation PARTS contains information for each part, relation SPLR contains information for each supplier, relation CITIES contains information on individual cities in which suppliers are located, and relation SP indicates which suppliers supply which parts. 39 PARTS SP SPLR CITIES PNAME Widget Bolt Nut Rivet P# 1 2 3 4 P# "1 s* 2 1 2 3 1 4 3 2 3 4 4 4 4 2 s# "I 2 3 4 SNAME Ajax Acme Widget Bomad Ci ty_ Urbana Hinkley Urbana Flag staff City State Flagstaff NM Hinkley Oil Urbana IL v Suppose that the composite domain (P#,S#) is selected as the base domain. (This is a key of GOAL.) Call this composite domain C, and let the sample of C be the values (1,1), (2,3), and (1,4). The domain samples generated from this sample of (S#,P#) are S C1 ={<1,1)} S p##1 ={l} S s#rl ={l} S city#1 «{Urbana} S C2 ={(2,3)} S p#f2 ={2} S s#f2 ={3} S cityr2 ={Urbana} S C3 -{(l f 4)} S p#r3 ={l} S s#r3 ={4) S CltVf3 ={Flagstaft} The weights resulting from this choice of base domain are P# m P# S# m S# 2 2 2 2 1 2 3 4 2 2 1 3 City Urbana Hinkley Flagstaff ra City 3" 2 3 To test the effect of a given query which uses only the SPLR relation, for instance, run the query on the sample relation SPLR' which contains only tuples with S# values of 1, 3, or 4. If that query selected only the tuple (4, Bomad, Flagstaff), then an estimate of the actual output from the whole SPLR relation would be: 40 15 = 3 ( 2 + T + 3> = - 89 The precision of this at the 95% confidence level is .2 (1. 96) 2 (8) (5) ((0-0. II) 2 + (0-0. II) 2 + (0.33-0. II) 2 ) (3) (2) = 1.86 = > d ■ 1.36 Prediction of Project Output In the discussion up to this point, the project operation has been deliberately ignored. This is because not all sampling methods can be used to predict the effect of a project. The project operation will coalesce several input tuples into one output tuple, so if the sampling method used was such that some tuples in the sample could possibly be merged with tuples not in the sample, it would be impossible to accurately predict the output volume. There would be no way to estimate from the sample alone how many input tuples would be projected into a single output tuple. It turns out, however, that the multiple-relation sampling technique just described will handle project in the majority of cases. As long as a given project retains at least one of the sampled domains, it will be possible to predict the output of the project using that sampled domain. As before, the query, including the project, is run on the sample database, and tnen the output volume is predicted using the sampled domain. This tfill work because the sample database, by definition, contains 41 all tuples which have any one of the sampled domain values, hence, there can be no tuples not in the sample which could be merged with a tuple not in the sample. Since all joining domains are sampled, there will always be a sampled domain in the output of any project which will later be the input to a join. Therefore, all volumes (except sometimes the final output volume) in a query with projects can be predicted. In the cases where the final output volume cannot be predicted using sampling, any gross estimate of the volume can be used to find a sub-optimal strategy. Pathological Cases It is quite possible that any given database will exhioit pathological behavior which will make it impractical to implement sampling for the entire database. The samples required for some of the relations might prove to be substantial portions of the relations themselves. In this case, the cost of sampling for those relations will be an unacceptably large proportion of the total query cost for that relation. There are at least two situations in which this can occur. Wide variation in relation sizes . If there is a very wide range in the sizes of the relations of the database, it is probable that the smaller relations will have samples which are appreciable proportions of the whole relations. To see how this happens, consider a three-relation database which is an upgraded version of the two-relation one discussed earlier in this chapter . 42 Relation Domains Number of Tuples SUPPLIERS Naroe,Splr# 1000 PARTS Color, Shape, Parti 100,000 S-P Splr#,Part# 300,000 The SUPPLIERS relation has one tuple to describe each supplier, the PARTS relation has one tuple identifying each part, and S-P is a "linking" relation which indicates which suppliers supply each part. On the average, each supplier supplies 300 different parts, and each part is supplied by 3 suppliers. Suppose that Part# is selected as the "base" domain, and 1000 of its 100,000 values are sampled. There will then be 1000 tuples in the sample of PARTS, or 1% of the whole relation. Because each part number (value of Part#) appears about 3 times in S-P, it is reasonable to expect about J000 tuples in the sample of S-P, which is still only 1% of the tuples in that relation. However, because each Part# value is associated with a large number of Splr# values, it should happen that a large percentage of the SUPPLIERS tuples (over 95% based upon strictly probabilistic considerations) will be in the sample. This means that the sampling cost for the part of the query which uses SUPPLIERS will be close to the actual processing cost for that part of the query. In this case, sampling should not be used on the SUPPLIERS relation. It is so small, anyway, that it contributes only a small part of the total query cost. Instead, the part of the query which uses this relation should be run on any copy of SUPPLIERS wnich is available on the network. The rest of the query can then be optimized using exact results for the output from this part of the query, rather than estimates based upon sampling. 43 Large "tan-out" between domains . The term "fan-out" is used very loosely here to refer to the average number of domain values in one domain associated with each value of some other domain. Even it the relations are of similar size, a large fan- out can cause large samples. In the last example, the Spirt domain has a large fan-out to domain Part#. Suppose Spirt is picked as the base relation with a 10% sample of 100 of its values. The sample of SUPPLIERS would contain 100 tuples. Because each Spirt value occurs about 300 times in S-P, the sample of S-P will have about 30,000 tuples. About 27,100 tuples could be expected in the sample of PARTS, or over 27% of that relations's tuples. (There will be some duplication of the Partt values selected by the sample of Spirt , which is why the total sample of PARTS contains fewer than 30,000 tuples.) A 10% sample of SUPPLIERS therefore results in a 27% sample of the larger PARTS relation. In a large database with many relations, this "fan-out" problem could result in unreasonably large samples. In such cases, a more judicious choice of base relation might help, or it might be necessary to eschew sampling for those relations with unreasonably large samples. This means that the •'samples" for those relations would consist of the relations, themselves. 44 V. AN INTEGER LINEAR PROGRAM FOR OPTIMAL QUERY STRATEGIES Chapter IV described a method which can be used to estimate the amount of data produced by each operation in the query. In this chapter, a method for generating a query strategy based upon those estimates is discussed. This is an integer linear programming (ILP) model which will produce an optimal assignment of query operations to network hosts. Solving a general ILP is an expensive proposition, so a procedure is developed which will allow much less expensive linear programming (LP) techniques to be used, instead. This technique is not guaranteed to find the best strategy for the given query. Finding such a global optimum would require considering many possible permutations of the query tree. In particular, the fact that the join operation is associative (within certain limits) would have to be considered. Altering the order in which joins are performed can have a large effect on the total size of the various intermediate operations. However, this procedure will find the optimal strategy based upon the given query tree and estimates of intermediate volumes. Def ini Li on of the Integer Linear Program The cost of a particular implementation of a relational algebra query in a distributed environment can be expressed with the cost function C ■ 55(E x + V t .) qi qi qi q qi' 45 where E is the expense of performing operation q on host i, V is the expense of sending the output of operation q over the network (An ARPANET-like cost structure is assumed where cost is determined wholly by volume of traffic.) , x . is 1 if operation q is performed on host i, and is zero qi ^ r otherwise, and t is 1 if operation q is performed on host i and its output must be shipped over the network, and is zero otherwise. The constants E . represent purely local processing costs, and are included in the cost function here only for completeness. No suggestions are made of how to compute them. The constants V are the estimated volumes of output from the operations, multiplied by an appropriate cost factor. (The constants E . v» will also depend partly on the estimated volumes for the inputs to operation q.) All cost constants are assumed to be positive. Let there be N operations and M hosts, and denote the successor (or parent) operation of operation q as a . (This means that the output of operation q is input to operation a ) A minimum-cost strategy for servicing the query can be found by solving the integer linear programming problem (ILP) Find values for x . and t • which minimize qi qi 46 N-l f y t qi qi - q qi' 1 q q-i under the constraints 2x ai = 1 q-1,... ,N (5) i M x . - x -. - t . < B q=l,...,N-l (6) ql V ql i-1 M x . ,t . = or 1 (7) qi qi l ' The constraints (5) ensure that each operation is performed on exactly one host. The constraints (6) ensure that t • is one if qi operation q is performed on host i and a is performed somewhere else. If both q and a are performed on host i, or if q is not performed on host i, then the fact that V is positive (by assumption) will ensure that the minimization procedure will 2 produce a t . of zero. It is assumed here that the output of operation N, the last operation in the query, will never be shipped over the network. Tnerefore, the variables t„. and the constant V kl are left unused Ml N in this formulation. If it became necessary to allow the output of N to be shipped, it would be easy to add a final, dummy 2 It would be possible in this ILP formulation to replace the M t . variables with a single variable t . However, this siS^lif ication would render the linear programming formulation given below unworkable. 47 operation N+l, where a =N+1. Notation In standard terminology, any assignment of values to all of the variables x . and t is called a solution to the ILP. A qi qi solution which satisfies all of the constraints is called a feasible solution . A feasible solution which minimizes the cost function (i.e. has a cost which is less than or equal to the cost of any other feasible solution) is called an optimal solution . In this thesis, a solution is feasible w ith respect to operation Q if all constraints of type (6) are satisfied when q is a descendant of Q, and all constraints (5) are satisfied when q is either equal to Q or is a descendant of Q. In the discussion which follows, a solution to the above ILP will be represented as as a vector of length 2(N-1)M. The vector corresponding to a given solution S with variables equal to x^ and t* t will be (X 11 ' x l2 ' • • - ' x 21' X 22" ,,X, NM' t il' t i2" ,# ' t (N-l)M ) Similarly, the cost constants can be put in a vector of the form '21'12'*** ' 21 ' 22 ' * * * NM' 1 ' 1 ' * * " ' N—l where each volume constant V occurs M times. This vector will be called E, so the cost of a particular solution S is S£ T . A solution which hub the variable x . (t .) equal to one and all others equal to zero will be represented by the vector x . (€ a; ;) . The value of the variable x . (t ai ) in a solution 48 represented by the vector S will be denoted by S[x .] (S[t •]). The symbol 6^ is the Kronecker Delta, which is equal to 1 if i=j and zero if i^ j . Solving the ILP as a Linear Program The above ILP can be treated as a Linear Program (LP) and solved using classical methods (e.g. the simplex method) by changing the constraints (7) to X qi' fc qi > ■ < 7 '> However, there is no guarantee that an all-integer solution will result. It will be shown here, however, that if there are any feasible solutions, there will always be at least one all-integer solution which is optimal. Given an optimal solution S (and the existence of a feasible solution and the fact that the costs are positive guarantee the existence of at least one optimal solution) the following section will describe how to construct a set of integer, feasible solutions p- and associated positive weights w. such that 5w. = 1 i and Because S is an optimal solution, it is true that p.E T >SE T for each p i# it is also true that |w i P i E *SE . It follows that T T 1 p.E =SE for each p., so p. is an all-integer, optimal solution. 49 Having proved the existence of an integer, optimal solution, a fast algorithm which will produce an integer, optimal solution from an arbitrary, non-integer, optimal solution will be given. A proof of its validity will also be given. Construction of the Partial S olution s and freights . Given an arbitrary optimal solution S, the following paragraphs show inauctively how to construct a set P for each node q in the q operation tree. Each element in the set P is a pair in the form q (w;p) consisting of a weight w and a solution vector p. The weights and vectors in a given P will have the following properties: A: 5 w = 1 (w;p)GP q i.e. the weignts sum to unity. B: If I wp = T , (w;p)€P q q then T lx . ]=S[x .] for all hosts i, where u is either q or q ui ui one of its descendants. C: T [t .J=S[t .] for all hosts i where u is a descendant , of qui u l D: Each solution p in P is feasible with respect to operation The set P , where N is the root node of the query, contains the weights and all-integer, optimal solutions which can be used to reconstruct S as above. 50 First, construct the sets p for each q that is a leaf of the query tree. Each P will be si P q - ^ (SlX ^ 1? ^i ) six qi ]>0 The equations (5) in the LP ensure that the weights in P sum to 1, so p satisfies property A. The other properties are true trivially. The next task is to show that if p and P c exist with these properties, where r and s are the children of node q, then P can be constructed with the same properties. This will be the basis for an inductive proof that such a set can be constructed for the root note N. Select a node q with children r and s (meaning that The variable g.. is, in a sense, the part of the output from operation r wnich is generated at host j and used at host i. Tne quantity minimized can be thought of as the total amount of output from operation r that is shipped over the network. A similar transportation problem will yield h, . tor the other child l\ X s. The set P g i s then defined as P q = U U V u u i j 0 g..>0 „ * 1NM h, .>0 n * , Nfl qi J ^31 P r U ]>0 ki P s tx S k )>0 w w g . . h . ai t?7 J1 i 4t r? P +P +x . + (l-6..)£ +(1-6.. )£ . 3[x .]S[x r .]S[x gk ] *r *s qi 13' rj ik' sk Proof That the Solutions Satisfy Properties A, B, C, and D. Let p be one of the vectors generated by this procedure from a given p and p . The vectors p ana p are orthogonal, meaning p [x.]p [x..] = p [t..Jp [t ..] = for all i and j. This is because the subtrees defined by r and s are disjoint. In fact, all of the five vectors summed to form p are mutually orthogonal. This guarantees that all of the components of p are either or 1. To demonstrate that p is feasible with respect q to q, note that p and p are feasible with respect to r and s, 52 respectively. Also note that any ILP constraint satisfied by either p p or p g will be satisfied by P r +P s r so P r +P s is feasible with respect to r and s. It must be shown that jx =1 for u i ul equal to q or a descendant of q, and x -x_ . -t <0 for u a ui (J i u i— descendant of q. The induction hypothesis ensures that jx =1 i U1 tor u a descendant of q, and x ui -x .-t t <0 for u a descendant of u r or s. The inclusion of the term x . in the definition of P ensures that 5x qi =l. The terms (1-6^)6^ and d-^i k )€ sk ensure that x u i-x q ^-t u ^£0 for u equal to either r or s. Therefore, p q is feasible with respect to q. This is true for all vectors in P , so P has property D. The fact that p has property A will be shown as a subresult wnile demonstrating properties B and C. Properties B and C must be demonstrated in several steps. Let T = 2 wp. q (w;p)6P q First, it will be shown that T ^x.J *S fac.J for any host i. The value of T [x .] can be found by summing the weights on all solution vectors p which have p[x .]■!. This sum is equal to w w g . . h . 2 2 2 2 c-n; — isfx isfx — P (8) D (w r ?p r )6P, k (w e ;p c )€P s blX qi JblX rj JblX sk J S[x ]>0 „ \ C 1 / S[x J>0 „ ® S , , r} P r lx r j 1=sl sk p s l slJ The fact that P and P satisfy property B ensure that 53 w = S [ x J (w ;p )GP C Cj r *r r P lx . ]=l r n anu 5 w = S[x J < w s ; Ps ,6P s S p s (x sk 1=1 Tne fact that g.. and h, are solutions to their respective ^ji ki r transportation problems ensure that 2 »ji- 2 V-siV S[x rj ]>0 S[x sk ]>0 Rearranging expression (8) and using these identities yields g w h. . j S(X rj j (v. ;p r )eP r SlX qi J k S(X sk J (w_ ,p ) €P^ S S[X rjJ> P r [x n ]»l SlX s^> P s [x s "].l = Six . ] . qi Therefore T [x . ]=Slx J for all hosts i. Because every p in P vj ^j J. vj x M Si has p [x - ] =1 for exactly one i, it follows that the sum of all c q qi * weights in P is q 1 w = I I w = IT [x ] * 2S[x J * 1 (w;p)€P q i (w;p)€P q i 4 4X i 4± P[x qi ]=l rnerefore, P has property A. 54 To show that T [t . ]=S[t .J, it is necessary to first prove a lemma. Lemma 1: If the vector g is the solution to the transportation problem between nodes r and q which was solved while constructing P , then S [t . J -£(1-6 . . )g .. for any j. Proof: Suppose 00 for some i and j such that i^ j . Construct g', a new solution to the transportation problem, in the following way: g '. . - g . . + g • *J3 *33 ^Di gj 4 - g ji!u. g kj = 9 kj * S[x']-g,. k=1 J-lrJ+lf.-.t" Hi - Hi * S[x ji )-i.. k=1 3-l.J+l M tJ J J g!. » g . for all other case; It is obvious that q'. . , g'» and g' are non-negative. From the fact that g . . +g . . 29^ - Zg ki 55 and (by similar reasoning) it follows that 9' is feasible. The cost of g' will be lower that that of g by q . + -__ — ± — - — a. — ^ji S x ^T-g . . Hence, the original solution was not optimal. Therefore, in any optimal solution, if S [x . J S[x .], then g.. = s l x q jJ and ^ij =0 tor ^3 • From the above argument, it follows that when S[x JS [x . ] , then g... = Six J, so 5(1-6. . )a . . = ?g - g =S[x ■ ]-S[x ]. Since 1 qj I 11 Ji f ]l y DD H q} J Sft . ]>0 cv constraint (7')/ and since S(t .]>S[x ]-S[x J by constraint (6), it follows that Sit . ] >5 (1-6 . . ) g . . . The r j — t 13 31 assumption that V is positive ensures that S[t .] will never be larger than is required by the constraints (6) . Therefore, S(t r -]=^(l-6 i .)g. i . Q.E.D. As oefore, T [t ] is equal to the sum of the weights on q r j all p 's which have p ft . J =1 . This sum is *q 4 cj 56 w w g . .h. 5 5 5 5 r s 3i ki I (w ;p )6P k (w ;p ) €P S [x qi J S lx rj ] S [X sk ] r r C Cfy ISA s'*V s ^ J qi J I S[X rj 1 (w r ;5 r )€P r S[X qi J k S(X sk J (w ..pJGp" 3 Six .]>0 , r , , r S[x J>0 T , , s qi p r Ix rj 1=1 sk p r Ix rk ]ssl Using Lemma 1 and the same identities used to simplify expression (8) , this reduces to 5 (1-6. )g • . S[x q .]>0 Since S[x .]»0 implies g . . -0 (from the definition of the transportation problem) , this is equal to 5(1-6. .)g . . = S[t J . 7 1 lj'^ji r] J Therefore, for all t . (and t . by similar argument), T q [t r .]-S[t rj ] and T q (t sk )=S[t sk ). Now, it remains to be shown that T [y]*S[y] for all variables y such that p [yl«l for some (w ;p )€P . (If p [y]-l, there can be no p_ in p such that p_[y]*l.) T (yj is obtained by s s s q summing the weights which correspond to solutions p in P which have p(yl*l. This yields 5 w (w,p)6P g Ply]=i 57 w w g . . h. = 55 5 5 5 r s 3 1 k * i J (w r ;5 r )€P k (w ;5 )€P Slx qi ]Slx rj )Slx sk J S[x .J>0 S[x .]>0 „ , , i S[x J>0 ^ 7 , i 1 qi J ru Pr^rjl" 1 sk P s lx sk J=1 P r [y]-i w g . h, j (w r JP r )6P r SIx rj 1 I SIX qi ] k S[X sk ] (w :p JGp/ 3 SlX rD 1>0 P r lx n r ,,i r S < x qi'> S ^s k J> P s tx s ;i«l S P r lyJ-i j (w ;p )€P S l x rj^ >0 p x ]=1 P r lyJ=i Since each p in P will have p [x .1=1 for exactly one j, the c r r ^r l r] J sum over j and the restriction p (x ]=1 can be eliminated. The sum can then be reduced to 5 w p r )e; p r [y]-i (w r ;P r )€P r This is exactly equal to T [y] , which by hyphothesis is equal to S[y], so T q [y]=Sly]. A similar argument will show that TqlyHSly] for any y such that P s [yJ=l for some p s in P g . Therefore, P has properties B and C. It has thus been shown that the set P has the properties A thru D as do the sets P^ and P s . Since it is possible to construct such sets for the leaves of the query tree, it is possible to construct one for the root node, N, by induction. This set, P^ t will contain a group of integer, feasible 58 solutions. The non-integer solution S lies on the interior of the hyper-polyhedron defined by these vectors (i.e. is a convex combination of these vectors) , so (by the argument on page 48) each of the integer solutions in P N is an optimal solution. An Algorithm for Finding Integer Solutions The above argument proves the existence of an optimal, integer solution to the LP, but is entirely impractical as a method for finding such a solution. The following algorithm will quickly find an optimal, integer solution W, given an optimal, non-integer solution S. 1. Set the vector W to all zeroes. Select an l such that S[x M .l>0. Set W[x Ni ]=l. 2. Pick any operation r which has not been visited, but whose parent q has already been visited. (This is an arbitrary, top-down traversal of the tree.) If none such exists, stop. W is an optimal, integer solution. 3. Find the i such that W.[x .]«1. If S[x .]>S[x .], let j*i. \^ J. Li. 4 ■*■ Otherwise, let j be any value such that 00. Similarly, if W[x sk ]=l, it is possible to have n ki >0. From this and the derivation of P , it is clear that P could contain a solution of the form p r + ?s + X qi + £ .k This solution corresponds to to for the subtree with q at the root. It q is a leaf and Wl.x ]=l, then there must be a p in P such that p=x .. it can therefore be shown by inauction that w corresponds to some p in P for the subtree with root N. Tnis means that W is the same as p, and since p is optimal, W is optimal . 60 VI. CONCLUSIONS The described method for sampling a database to allow prediction of query performance can be fairly expensive, and will not be practical for small databases where the potential savings from optimization are small. For large databases, however, it will allow optimization based upon figures which are theoretically more valid than figures derived using an independence assumption. An important advantage of sampling is that quantitative estimates for the error in the estimates can be obtained. A major part of the cost of sampling is the setup cost. A large amount of work must be done to build the samples and compute the weights which are used in interpreting the samples. Technically, these samples should be reconstructed whenever an update is made to the database. However, it is reasonable to expect that the statistical properties of the database should not change rapidly with time. Therefore, a sample, once constructed, can continue to be used even if the database has been updated until such time that it is observed to no longer reflect the status of the entire database. 61 REFERENCES Bernstein, P. A. "Synthesizing third normal form relations from functional dependencies," ACM Transactions on Database Systems, 1, 4 (December 1976), 277-298. Chu, W.W. "Optimal file allocation in a multiple computer system," IEEE Transactions on Computers, October 1969, 885-889. Codd , E.F. "A relational model of data tor large shared data banks," Comm. ACM, 13, 6 (June 1970), 377-387. Codd, E.F. "Further normalization of the database relational model," in Data Base Systems , Courant Inst. Computer Science Symp. 6, R Rustin, Ed., Prentice-Hall, Englewood Cliffs, 1972, 33-64. hammer, M. and Chan, A. "Index selection in a self adaptive data base management system," Proc. ACM-SIGMOD Conf. on Management of data, 1976. Levin, K.D. "Organizing distributed databases in computer networks," Tech Report No. 74-09-01 Dept of Decision Sciences, The Wharton School, University of Pennsylvania, (Ph.D. dissertation) . Smith, J. and Chang, P. "Optimizing the performance of a relational algebra data base system," ACM-SIGMOD Workshop, San Fransisco, CA (May 14, 1975). vallarino, 0. "On the use of bit maps for multiple key retrieval," Proc. ACM-SIGMOD Conf. on Data, Salt Lake City, March 1976. wong, E. and Youssefi, K. "Decomposition - A strategy for query processing," ACM Trans. on Database Systems, 1, 3 (September 1976), 223-241. Wong, E. "Retrieving dispersed data from SDD-1: A System for Distributed Databases," Proc of the Second Berkeley Workshop on Distributed Data Management and Computer Networks, May 1977, 217-235. Yamane , T. Elementary Sampling Theory , Prentice-Hall, Englewood Cliffs, 1967. 62 APPENDIX A Derivation of Formula For Probability of Correct Guess Given that sampling has been used on the relations R and S to obtain p R , d R , p g , and d g , let d |R| »R = ?R ,R| S R = -V- d | SI *s = p s ,si s s = —r~ Tne actual volumes of |(R|B )| and |(S|B C )| can then be thought 2 of as normal distributions with means ju D and u c and variances s and Sg. We say that |(R|B R )| is N(*i R ,s R ) and |(S|B S )| is 2 N(u~,s ). (N(x,y) denotes a normal distribution with mean x and variance y.) Because the two samples were taken independently, it follows that D = I (S|B g ) I -I (R|B R ) | is N(/j g -u R , Sg+s R ) , so D-{» -» ) D' = a * is N(0,1) V 2 2 S R +S S The probability of correct guess when (S|B ) is shipped is the probability that |(R|B )| is greater than |(S|B )|. This is equal to P(| (RlB R ) | > | (S|B S ) |) = P(D<0) < -Si •2-2 63 (p R |R|-p s lS|)z (d s |S|) 2 +(d R |R|) 2 UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (Whan Data Bntmrmd) REPORT DOCUMENTATION PAGE READ INSTRUCTIONS BEFORE COMPLETING FORM ! REPORT NUMBER CAC Document Number 234 2. GOVT ACCESSION NO. 3. RECIPIENT'S CATALOG NUMBER 4. TITLE (and Subtitle) Optimization of a Relational-Algebra Query to a Distributed Database Using Statistical Sampling Methods 5. TYPE OF REPORT ft PERIOD COVERED Thesis 6. PERFORMING ORG. REPORT NUMBER CAC #234 7. AUTHOR("»; David A. Willcox 8. CONTRACT OR GRANT NUMBERf*.) DCA100-75-C-0021 9 PERFORMING ORGANIZATION NAME AND ADDRESS Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 10. PROGRAM ELEMENT, PROJECT, TASK AREA ft WORK UNIT NUMBERS 11. CONTROLLING OFFICE NAME AND ADDRESS Joint Technical Support Activity 11440 Isaac Newton Square, North Reston, Virginia 22090 12. REPORT DATE August 1, 1977 13. NUMBER OF PAGES 69 14. MONITORING AGENCY NAME * ADDRESSf// di Iterant from Controlling Oltlca) 15. SECURITY CLASS, (of thla report) UNCLASSIFIED 15*. DECLASSIFI CATION/ DOWN GRADING SCHEDULE 16. DISTRIBUTION ST ATEMEN T (of this Report) Copies may be obtained from the address in (9) above. 17. DISTRIBUTION STATEMENT (ot the abstract entered In Block 20, If different from Report) No restriction on distribution 18 SUPPLEMENTARY NOTES None 19 KEY WORDS (Continue on reverse side if necessary and identify by block number) Distributed data management Relational database Sampling Query optimization 20. ABSTRACT (Continue on reverse side if necessary and identity by block number) A novel approach to the minimization of a relational-algebra query, where the relations of the database are distributed among several computers, is presented. A statistical sampling method is described which can be used to develop the parameters to an integer-linear programming (ILP) problem. An efficient method for solving the ILP is also presented. DD , JAN 73 1473 EDITION OF 1 NOV 65 15 OBSOLETE UNCLASSIFIED SECURITY CLASSIFICATION OF THIS PAGE (IWien Data Entered)