lib-s-mocs-kmc364-20141005044052 109 Statistical Behavior of Search Keys Abraham BOOKSTEIN: Graduate Library School, University of Chicago Editor's note: The editor and author are aware that varying approaches may be taken to the problem presented here. Readers are invited to respond in the form of a paper or a technical C.'Ommunication. In discussion about search keys, concern has been expressed as to how the nwnber of items tetrieved by a single value relates to collection size. This paper creates a statistical model that attempts to give some insight into this behavior. It is concluded that, in general, the observed behavior can be explained as being intrinsically statistical in nature rather than being a property of specific search keys. An attempt is made to relate this model to other tesearch, and to indicate how this model may be made to yield more accurate predictions. INTRODUCTION Various experiments suggest that it may be possible to develop, as an access route into a file of bibliographic records, a search key'" whose values can be easily derived from such bibliographic data as is likely to be available to its users.1 Some concern, however, has been expressed regarding the non- uniqueness of these keys: if the number of items retrieved were often to exceed an amount easily handled by a user of the system, the value of this access route would be considerably diminished. Accordingly, an important measure of search key performance is the frequency with which a large number of records is reh·ieved as the search key is applied to the file. This measure is · related, for example, to how many memory accesses will be re- quired, on the average, to retrieve all records satisfying a request; it is also an important consideration in deciding which display device should be in- stalled in a system.2 • 3 After evaluating such a measure for a search key on a particular file, it is reasonable to ask how that measure will change over time, as the file in- creases in size. The nature of this variation has already been of concern to researchers in the field. Kilgour, on the basis of a· number of experiments carried out at OCLC, notes that "There remains a major problem to be o By the. phrase "search key'~ we mean a key similar to the 3-3 or 3-1-1-1 keys used at · Ohio College Library Center and other places, which is made up by concatenating truncations of bibliographic data elements. llO Journal of Library Automation Vol 6/ 2 June 1973 solved and a major question to be answered. The problem is constituted of those replies that contain a number of entries exceeding the optimal maximum .. .. The major question to be answered is how truncated search keys will perform on files ten and a hundred times the size of that used in this experiment."' He elsewhere observes that "as a file of bibliographic entries increases, the maximum number of entries per reply does not in- crease in a one-to-one ratio ... . "5 This paper presents a mathematical model that addresses itself to the problem defined by Kilgour and attempts to explain his observation; it is suggested that the gross features of the be- havior are statistical in nature and not properties of specific search keys. A VIEW OF COLLECTION GROWTH The cause of the phenomenon observed by Kilgour can best be under- stood by first considering a simple model which, while not itself valid, does cast light on the nature of the behavior. This first model neglects the effect of randomness both in the growth of the collection and in the arrival of requests. It supposes our search key has the following property: regardless of collection size, the fraction of the collection retrieved by a particular search key value, v~, is exactly given by a constant f;; thus, if the fil e holds N records, a request for v 1 will retrieve n 1 = f,N records. This model similarly assumes that among any sizeable number of requests, the fraction of the time any particular search key value will occur is fixed; thus, for any subset of search key values, it is possible to determine how often members of that subset will occur among a set of requests. In particular, for any integer n, we can form the set of all the search key values that will retrieve less than n items. We can then determine how often search key values from that set are requested. If, for example, re- quests for these values occur 99 percent of the time, then we can assert that 99 percent of the time less than n items will be retrieved. If the fil e contains N items, then these n items constitute the fraction f = ~ of the file. Should the collection size increase to lN, then the model predicts that 99 percent of the time less than f( lN) = ln items would be retrieved. In other words, we have precisely the behavior Kilgour observes does not occur. This argument shows that a simple deterministic model does not conform to ex- perience with search keys. The model breaks down in two ways, which accounts for the dis- crepancy between the results derived from it and Kilgour's observations: 1. in any actual library, the fraction of the time that a particular request will appear within a sequence of requests will vary; and 2. in comparing two different samples having the same size, the number of items having a given search key value will vary. The first of these factors is easily dealt with and its analysis will suggest the number of requests to use in a test of search key behavior in a given library. For a particular collection, letS denote the set of search key values Statistical Behavior of Search Keysj BOOKSTEIN 111 for which, say, twenty or more items are retrieved. We would like to find the fraction of the time that a request in S occurs in the long run; suppose this value is in fact q. Then among M requests, the probability that m mem- bers of S occur is given by the binomial distribution fB(m\q,Mi). This dis- tribution has a mean of qM and a variance of qM(1 - q). Should we de- sire to estimate the actual fraction of the time that twenty or more items will be retrieved, we can take a sample of M requests and compute q, the fraction of the requests with search key values in S; if we do so, we will usually get a value for q between q - ,/ M v q ( 1 - q) and q + v2 M v q ( 1 - q) .' If for example, q = .01 and M = 10,000, we would tend to find q in the interval .01 ± .002. Thus the effect of randomness in the ar- rival of requests can easily be controlled by increasing the number of re- quests considered; furthermore, the size of error can be predicted. We next introduce the second factor; its analysis will suggest how the behavior of search keys will change as the collection grows in size. For this purpose we adopt a model of collection growth which assumes that as items arrive, they are randomly distributed among the search key values in accordance with some probability distribution. If we suppose that the probability of an item being assigned a specified search key value, v11 is p11 then in a collection of N items we may conclude that the probability of n items having that value is given by the binomial distribution: ( N ) n N-n fu(n jpbN) = 7 p1(1- p1 ) • If g' ( v;) is the probability that the value v1 is selected from the request population, then the probability that the "next" request retrieve n items is given by def ~~ g'(vt) fB(njp;,N) =fg(p) fB(njp,N)dp; g(p) dp= ~ g'(v;) p;! P I ~ p + dp is the probability that a request arrive with value p1 in the interval (p,p + dp), and will be treated as a continuous function.""' Since the ex- pectation of the binomial distribution is given by pN, we have de£ Nfpg(p)dp = Np as the expected number of items retrieved by a random re- quest; since this is proportional toN, doubling the size of the collection will, on the average, double the amount of material ret1·ieved. Similarly, the - 2 - 2 variance, u 2, is given by N2 ( p 2 - p ) + Nf p( 1 - p) g( p )dp. Should p2 - p , de£ the variance of p, be small, this reduces to Nfp(l - p )g(p )dp = i?N, so that approximately 95 percent of the time the amount of material retrieved would be less than Np + 2\1 N a-= N ( p + , ~a- ) . v N . •• This result would more precisely be expressed as f fB(n lp ,N)dG(p), which has the form of a Stieltjes integral. The expression used in the text is simpler and reasonably valid because of the vast number of values the search key can take. I I I J I 112 Journal of Library Automation Vol. 6/2 June 1973 It is the factor - + 2Cf P vN' and its dependence on N, that may account for Kilgour's nonlinearity, and not any property intrinsic in the nature of any type of search key. Thus, to the extent that this model reflects what is really happening, the 95 percent point increases roughly proportionately with file size; the "constant" of proportionality, however, is the sum of two tem1s: the first is a true con- stant, and the second is a term that approaches zero as the file gets larger. In particular, this model suggests that we will never reach a leveling off point-as the file increases in size, the number of items retrieved will also increase, and the pattern of increase will become increasingly linear. Up to this point this discussion has been qualitative in nature, being based upon general statistical considerations and making use of the normal approximation to some unknown distribution; its broad conclusions are, however, consistent with the findings of earlier workers and can explain certai11 unanticipated properties of search keys. To proceed further it will be necessary to restrict the form of the function g(p); tl1is will be attemped in the following section of this paper. RELATIONSHIP OF MODEL TO EARLIER RESEARCH Interest in access methods that are appropriate for files of bibliographic data has generated a considerable amount of empirical research on search key behavior. Of necessity, this pioneering work has been of a descriptive nature, resulting in data showing search key behavior in specific environ- ments. While these efforts have lent a good deal of insight into the nature of search keys, the basic weakness of such research lies in the difficulty of extending these findings to other situations. One purpose of a mathematical model such ·as. the one being developed here is to provide this increased generality by representing in a concise and easily manipulated form the results of previous research. It is accordingly of interest to indicate the re- lationship between previous work on search keys and our model. Research on search key performance has been of two kinds. The fi.rst kind seeks .to answer the question: for any number, n, how many search key values retrieve n items? The answer to this question depends only on the search key and the collection; it is independent of the pattern of re- quest arrivals. The second kind of research involves the ·actual arrival of requests; it tries to answer the question: for any number n, how frequently will requests resulting in the retrieval of n items occur? To discuss this research in terms of our model requires a closer examina- def tion of the function g( p) previously defined. We recall that g( p) dp == ~ g'(v1), with dp being a small number. Thus g(p) is determined P ~ PI ~ p+dp by two factors: Statistical Behavior of Semch KeysjBOOKSTEIN 113 a. The number of search key values in the interval ( p,p + dp). Let us denote this value by f(p )dp, so f(p) is the density of search keys at p. We make use here of the fact that although the number of possible search key values is finite, the number is very large, so their. distribu- tion can be thought of as continuous. b. The average probability of search keys, with values p 1 near p, being requested. We shall refer to this quantity as g"(p). By combining these factors we have g(p) = g"(p )f(p ). · In terms of this discussion, the first type of research described above. is in fact estimating f(p): if there ares search key values that retrieve n items from a collection of N items, then sis an estimate of this relation uses _!_ f (~)· N N' n + ~~ n = pN, and dp = N n- ~ 1 N N' The second kind of research directly estimates g ( p). Guthrie, in a recent paper, provides a bridge between the two types of research by discussing his findings in terms of two models.6 One of his models, which asserts that each search key value has an equal chance of being requested, is equivalent to the assumption that g"(p) = 1, and g(p) = f(p). Guthrie finds that this is not an adequate representation of his data. Guthrie's second model asserts that each item has an equal chance of being requested. In our terms this becomes g' ( p )ap, and g( p )apf ( p). This model, while an improvement over the first, still disagrees with the data. Furthermore , these models do not estimate f ( p); even if Guthrie's model were correct, we would not know the probability that n items would be re - trieved until we were told how many search key values contained n items. In the next section we will try to remedy this situation by means of a two paramete r representation of g( p). A REPRESENTATION OF f(p) To get a more detailed account of search key behavior by experiment is difficult since the two aspects of randomness already discussed are con- founded; the experimenter only sees the combined effect. We will, however, try to estimate the distribution g ( p) by a distribution of the form (a + {3 + 1)! a (1 - )f3 a!f3! P p. We believe that such an attempt is reasonable on three grounds: a. It is not possible to find g(p) exactly, and moreover, it is not clear that this would be desirable. We are interested in a reasonable ap- proximation that is satisfactory for decision-making purposes; b. The above distribution assumes a wide variety of shapes as a and f3 vary; it seems likely that values of a and f3 can be found for which 114 Journal of Library Automation Vol. 6/ 2 June 1973 this distribution is close enough to g ( p); and c. This distribution is mathematically tractable. If we proceed using the above approximation for g(p ), we find: (i) the probability, P(n), of n items being retrieved is given by 1. P(n) = (-N) ~-+ f3 + 1~l(a + n)! (N- n + [3)! n a!fJ! (a+fJ+N+l)! ( ii) the expected number of items retrieved, E, is given by a + 1 2. E == N a + {3 + 2 ; and (iii) the variance, V, of the number of items retrieved is given by _ a+l {3 + 1 N 3· V - N a + f3 + 2 a + {3 + 3 ( 1 + a. + {3 + 2 ) · If the experiment is performed on a small sample, the expectation and variance can be computed and the values of a and f1 estimated from the relations E a (1 - -) + 1 4. f1== N 2, and E N v E- N E 1 -N 5. a. v - 1 E l E 1-- N Usually ~ will be much smaller than one; in this case we may use the approximations: N 4'. f3 =(a+ I)E, and E 5'. E 1 a= N- . Once a and f1 have been evaluated, we can compute the probabilities P ( n) for files of arbitrary size, and with these values we can make as- sertions regarding the probability of, say, more than 30 items being re- trieved. A relation that can be derived from Formula 1 and may be of use when comparing this model with experiment is: P(n) I + {3 N-n = 1 + a n + 1 P(n + 1) Statistical Behavior of Search Keys/ BOOKSTEIN 115 The probability of zero retrievals is likely to be an extraordinary point in the distributions g ( p) and P ( n) since it is influenced by the knowledge that a user may have of the collection; this effect is likely to be encountered in a sampling process in which the requests have to be generated artificially. In such cases it would be advisable to treat P ( 0) as an empirically derived parameter, (), and use the modified formula { (Jifn=O 6. P' (n) = (1 - fJ) 1 ~(;~O) if n ::1= 0. The value of() can be estimated by the fraction of requests retrieving zero items; for sampling techniques using only productive requests, () will be zero. a. and f3 can be calculated as before from the mean and variance of the sample. CONCLUSION The above discussion is intended as an attempt to provide some theoreti- cal understanding of the puzzling behavior discovered in the use of search keys and also to provide some guide for those experimenting with samples of such files. We do, however, urge caution for the latter uses. An analysis similar to the above can be useful under several different circumstances, such as: determining the future behavior expected of a search key in a single library as the collection grows; determining the be- havior for one library based upon experiments conducted on a different but similar library; and extrapolating from the performance of a search key in a sample of the collection to its pedormance in the full collection. If one wishes to compare two different libraries, one can note that as far as search key values are concerned, a particular library's collection can be thought of as a random sample of the larger population from which it selects its material, and accordingly the formula for P ( n) should be valid. In this case, if two different collections are drawn from the same population, the g ( p) refers to this population and the libraries are distinguished by the parameter N; when we are considering samples from a single library, then N is the sample size and g ( p) refers to the library itself. No theoretical basis exists at present for estimating to what extent the populations being considered depend upon the type of library, if any, so this problem must be dealt with empirically. We have assumed here that these populations are similar with regard to search key values. Should these populations in fact vary, it is possible that they can be broken down, e.g., by language, into subpopulations that are stable and for each of which the analysis is valid. ACKNOWLEDGMENTS This work was made possible by CLR/ NEH Grant No. E0-262-70-4658. I would like to express my gratitude to members of the University of Chicago Systems Develop- ment Office for their many comments and suggestions on this work. I ; I ll6 Journal of Library Automation Vol. 6/ 2 June 1973 REFERENCES I. Frederick G. Kilgour, Philip L. Long, Eugene B. Leiderman, and Alan L. Landgraf, "Title-Only Entries Retrieved by Use of Truncated Search Key," Journal of Library Automation 4:207-10 (Dec. 1971). 2. A. Bookstein, "Double Hashing," Journal of the American Society for Information Science 23:402-25 (Nov.-Dec. 1972) . 3. A. Bookstein, "Hash Coding with a Non-Unique Search Key," to be published in the Journal of American Society for Information Science. 4. Frederick G. Kilgour, Philip L. Long, Eugene B. Leiderman, and AJan L. Landgraf, "Retrieval of Bibliographic Entries from a Name-Title Catalog by Use of Truncated Search Keys." preprint. 5. Kilgour, Long, Leiderman, and Landgraf, "Title-Only Entries," p.209-10. 6. Gerry P. Guthrie and Steven D. Slifko, "Analysis of Search Key Retrieval on a Large Bibliographic File," Journal of Library Automation 5:96-100 (June 1972).