lib-s-mocs-kmc364-20141005044052


109 

Statistical Behavior of Search Keys 

Abraham BOOKSTEIN: Graduate Library School, University of Chicago 

Editor's note: The editor and author are aware that varying approaches may be taken to the 
problem presented here. Readers are invited to respond in the form of a paper or a technical 
C.'Ommunication. 

In discussion about search keys, concern has been expressed as to how the 
nwnber of items tetrieved by a single value relates to collection size. 
This paper creates a statistical model that attempts to give some insight 
into this behavior. It is concluded that, in general, the observed behavior 
can be explained as being intrinsically statistical in nature rather than being 
a property of specific search keys. An attempt is made to relate this model 
to other tesearch, and to indicate how this model may be made to yield 
more accurate predictions. 

INTRODUCTION 

Various experiments suggest that it may be possible to develop, as an access 
route into a file of bibliographic records, a search key'" whose values can be 
easily derived from such bibliographic data as is likely to be available to 
its users.1 Some concern, however, has been expressed regarding the non-
uniqueness of these keys: if the number of items retrieved were often to 
exceed an amount easily handled by a user of the system, the value of this 
access route would be considerably diminished. Accordingly, an important 
measure of search key performance is the frequency with which a large 
number of records is reh·ieved as the search key is applied to the file. This 
measure is · related, for example, to how many memory accesses will be re-
quired, on the average, to retrieve all records satisfying a request; it is also 
an important consideration in deciding which display device should be in-
stalled in a system.2 • 3 

After evaluating such a measure for a search key on a particular file, it is 
reasonable to ask how that measure will change over time, as the file in-
creases in size. The nature of this variation has already been of concern to 
researchers in the field. Kilgour, on the basis of a· number of experiments 
carried out at OCLC, notes that "There remains a major problem to be 

o By the. phrase "search key'~ we mean a key similar to the 3-3 or 3-1-1-1 keys used at 
· Ohio College Library Center and other places, which is made up by concatenating truncations 
of bibliographic data elements. 


llO Journal of Library Automation Vol 6/ 2 June 1973 

solved and a major question to be answered. The problem is constituted 
of those replies that contain a number of entries exceeding the optimal 
maximum .. .. The major question to be answered is how truncated search 
keys will perform on files ten and a hundred times the size of that used in 
this experiment."' He elsewhere observes that "as a file of bibliographic 
entries increases, the maximum number of entries per reply does not in-
crease in a one-to-one ratio ... . "5 This paper presents a mathematical 
model that addresses itself to the problem defined by Kilgour and attempts 
to explain his observation; it is suggested that the gross features of the be-
havior are statistical in nature and not properties of specific search keys. 

A VIEW OF COLLECTION GROWTH 

The cause of the phenomenon observed by Kilgour can best be under-
stood by first considering a simple model which, while not itself valid, does 
cast light on the nature of the behavior. This first model neglects the effect 
of randomness both in the growth of the collection and in the arrival of 
requests. It supposes our search key has the following property: regardless 
of collection size, the fraction of the collection retrieved by a particular 
search key value, v~, is exactly given by a constant f;; thus, if the fil e holds 
N records, a request for v 1 will retrieve n 1 = f,N records. This model similarly 
assumes that among any sizeable number of requests, the fraction of the 
time any particular search key value will occur is fixed; thus, for any subset 
of search key values, it is possible to determine how often members of that 
subset will occur among a set of requests. 

In particular, for any integer n, we can form the set of all the search key 
values that will retrieve less than n items. We can then determine how 
often search key values from that set are requested. If, for example, re-
quests for these values occur 99 percent of the time, then we can assert that 
99 percent of the time less than n items will be retrieved. If the fil e contains 
N items, then these n items constitute the fraction f = ~ of the file. Should 
the collection size increase to lN, then the model predicts that 99 percent 
of the time less than f( lN) = ln items would be retrieved. In other words, 
we have precisely the behavior Kilgour observes does not occur. This 
argument shows that a simple deterministic model does not conform to ex-
perience with search keys. 

The model breaks down in two ways, which accounts for the dis-
crepancy between the results derived from it and Kilgour's observations: 

1. in any actual library, the fraction of the time that a particular request 
will appear within a sequence of requests will vary; and 

2. in comparing two different samples having the same size, the number 
of items having a given search key value will vary. 

The first of these factors is easily dealt with and its analysis will suggest 
the number of requests to use in a test of search key behavior in a given 
library. For a particular collection, letS denote the set of search key values 


Statistical Behavior of Search Keysj BOOKSTEIN 111 

for which, say, twenty or more items are retrieved. We would like to find 
the fraction of the time that a request in S occurs in the long run; suppose 
this value is in fact q. Then among M requests, the probability that m mem-
bers of S occur is given by the binomial distribution fB(m\q,Mi). This dis-
tribution has a mean of qM and a variance of qM(1 - q). Should we de-
sire to estimate the actual fraction of the time that twenty or more items 
will be retrieved, we can take a sample of M requests and compute q, the 
fraction of the requests with search key values in S; if we do so, we will 

usually get a value for q between q - ,/ M v q ( 1 - q) and q + v2 M 
v q ( 1 - q) .' If for example, q = .01 and M = 10,000, we would tend to 
find q in the interval .01 ± .002. Thus the effect of randomness in the ar-
rival of requests can easily be controlled by increasing the number of re-
quests considered; furthermore, the size of error can be predicted. 

We next introduce the second factor; its analysis will suggest how the 
behavior of search keys will change as the collection grows in size. For 
this purpose we adopt a model of collection growth which assumes that as 
items arrive, they are randomly distributed among the search key values 
in accordance with some probability distribution. If we suppose that the 
probability of an item being assigned a specified search key value, v11 is p11 
then in a collection of N items we may conclude that the probability of n 
items having that value is given by the binomial distribution: 

( 
N ) n N-n 

fu(n jpbN) = 7 p1(1- p1 ) • 
If g' ( v;) is the probability that the value v1 is selected from the request 
population, then the probability that the "next" request retrieve n items is 
given by 

def 

~~ g'(vt) fB(njp;,N) =fg(p) fB(njp,N)dp; g(p) dp= ~ g'(v;) 
p;! P I ~ p + dp 

is the probability that a request arrive with value p1 in the interval 
(p,p + dp), and will be treated as a continuous function.""' Since the ex-
pectation of the binomial distribution is given by pN, we have 

de£ 

Nfpg(p)dp = Np as the expected number of items retrieved by a random re-
quest; since this is proportional toN, doubling the size of the collection will, 
on the average, double the amount of material ret1·ieved. Similarly, the 

- 2 - 2 
variance, u 2, is given by N2 ( p 2 - p ) + Nf p( 1 - p) g( p )dp. Should p2 - p , 

de£ 
the variance of p, be small, this reduces to Nfp(l - p )g(p )dp = i?N, so 
that approximately 95 percent of the time the amount of material retrieved 
would be less than 

Np + 2\1 N a-= N ( p + , ~a- ) . 
v N . 

•• This result would more precisely be expressed as f fB(n lp ,N)dG(p), which has the form 
of a Stieltjes integral. The expression used in the text is simpler and reasonably valid because 
of the vast number of values the search key can take. 

I 

I 
I 

J 
I 


112 Journal of Library Automation Vol. 6/2 June 1973 

It is the factor 
- + 2Cf 
P vN' 

and its dependence on N, that may account for Kilgour's nonlinearity, and 
not any property intrinsic in the nature of any type of search key. Thus, to 
the extent that this model reflects what is really happening, the 95 percent 
point increases roughly proportionately with file size; the "constant" of 
proportionality, however, is the sum of two tem1s: the first is a true con-
stant, and the second is a term that approaches zero as the file gets larger. 
In particular, this model suggests that we will never reach a leveling off 
point-as the file increases in size, the number of items retrieved will also 
increase, and the pattern of increase will become increasingly linear. 

Up to this point this discussion has been qualitative in nature, being 
based upon general statistical considerations and making use of the normal 
approximation to some unknown distribution; its broad conclusions are, 
however, consistent with the findings of earlier workers and can explain 
certai11 unanticipated properties of search keys. To proceed further it will 
be necessary to restrict the form of the function g(p); tl1is will be attemped 
in the following section of this paper. 

RELATIONSHIP OF MODEL TO EARLIER RESEARCH 

Interest in access methods that are appropriate for files of bibliographic 
data has generated a considerable amount of empirical research on search 
key behavior. Of necessity, this pioneering work has been of a descriptive 
nature, resulting in data showing search key behavior in specific environ-
ments. While these efforts have lent a good deal of insight into the nature 
of search keys, the basic weakness of such research lies in the difficulty of 
extending these findings to other situations. One purpose of a mathematical 
model such ·as. the one being developed here is to provide this increased 
generality by representing in a concise and easily manipulated form the 
results of previous research. It is accordingly of interest to indicate the re-
lationship between previous work on search keys and our model. 

Research on search key performance has been of two kinds. The fi.rst 
kind seeks .to answer the question: for any number, n, how many search 
key values retrieve n items? The answer to this question depends only on 
the search key and the collection; it is independent of the pattern of re-
quest arrivals. The second kind of research involves the ·actual arrival of 
requests; it tries to answer the question: for any number n, how frequently 
will requests resulting in the retrieval of n items occur? 

To discuss this research in terms of our model requires a closer examina-
def 

tion of the function g( p) previously defined. We recall that g( p) dp == 
~ g'(v1), with dp being a small number. Thus g(p) is determined 

P ~ PI ~ p+dp 

by two factors: 


Statistical Behavior of Semch KeysjBOOKSTEIN 113 

a. The number of search key values in the interval ( p,p + dp). Let us 
denote this value by f(p )dp, so f(p) is the density of search keys at p. 
We make use here of the fact that although the number of possible 
search key values is finite, the number is very large, so their. distribu-
tion can be thought of as continuous. 

b. The average probability of search keys, with values p 1 near p, being 
requested. We shall refer to this quantity as g"(p). By combining 
these factors we have g(p) = g"(p )f(p ). · 

In terms of this discussion, the first type of research described above. is 
in fact estimating f(p): if there ares search key values that retrieve n items 
from a collection of N items, then sis an estimate of 

this relation uses 

_!_ f (~)· 
N N' 

n + ~~ 
n = pN, and dp = N n- ~ 1 N N' 

The second kind of research directly estimates g ( p). Guthrie, in a recent 
paper, provides a bridge between the two types of research by discussing 
his findings in terms of two models.6 One of his models, which asserts that 
each search key value has an equal chance of being requested, is equivalent 
to the assumption that g"(p) = 1, and g(p) = f(p). Guthrie finds that 
this is not an adequate representation of his data. 

Guthrie's second model asserts that each item has an equal chance of 
being requested. In our terms this becomes g' ( p )ap, and g( p )apf ( p). This 
model, while an improvement over the first, still disagrees with the data. 
Furthermore , these models do not estimate f ( p); even if Guthrie's model 
were correct, we would not know the probability that n items would be re -
trieved until we were told how many search key values contained n items. 
In the next section we will try to remedy this situation by means of a two 
paramete r representation of g( p). 

A REPRESENTATION OF f(p) 

To get a more detailed account of search key behavior by experiment is 
difficult since the two aspects of randomness already discussed are con-
founded; the experimenter only sees the combined effect. We will, however, 
try to estimate the distribution g ( p) by a distribution of the form 

(a + {3 + 1)! a (1 - )f3 
a!f3! P p. 

We believe that such an attempt is reasonable on three grounds: 
a. It is not possible to find g(p) exactly, and moreover, it is not clear 

that this would be desirable. We are interested in a reasonable ap-
proximation that is satisfactory for decision-making purposes; 

b. The above distribution assumes a wide variety of shapes as a and f3 
vary; it seems likely that values of a and f3 can be found for which 


114 Journal of Library Automation Vol. 6/ 2 June 1973 

this distribution is close enough to g ( p); and 
c. This distribution is mathematically tractable. 
If we proceed using the above approximation for g(p ), we find: 
(i) the probability, P(n), of n items being retrieved is given by 

1. P(n) = (-N) ~-+ f3 + 1~l(a + n)! (N- n + [3)! 
n a!fJ! (a+fJ+N+l)! 

( ii) the expected number of items retrieved, E, is given by 
a + 1 

2. E == N a + {3 + 2 ; and 
(iii) the variance, V, of the number of items retrieved is given by 

_ a+l {3 + 1 N 
3· V - N a + f3 + 2 a + {3 + 3 ( 1 + a. + {3 + 2 ) · 

If the experiment is performed on a small sample, the expectation and 
variance can be computed and the values of a and f1 estimated from the 
relations 

E 
a (1 - -) + 1 

4. f1== N 2, and 
E 

N 
v 

E-
N 

E 
1 -N 

5. a. v - 1 
E 

l 
E 

1--
N 

Usually ~ will be much smaller than one; in this case we may use the 
approximations: 

N 
4'. f3 =(a+ I)E, and 

E 
5'. E 1 a= N- . 
Once a and f1 have been evaluated, we can compute the probabilities 

P ( n) for files of arbitrary size, and with these values we can make as-
sertions regarding the probability of, say, more than 30 items being re-
trieved. A relation that can be derived from Formula 1 and may be of use 
when comparing this model with experiment is: 

P(n) 
I + 

{3 
N-n 

= 
1 + 

a 
n + 1 

P(n + 1) 


Statistical Behavior of Search Keys/ BOOKSTEIN 115 

The probability of zero retrievals is likely to be an extraordinary point 
in the distributions g ( p) and P ( n) since it is influenced by the knowledge 
that a user may have of the collection; this effect is likely to be encountered 
in a sampling process in which the requests have to be generated artificially. 
In such cases it would be advisable to treat P ( 0) as an empirically derived 
parameter, (), and use the modified formula 

{

(Jifn=O 

6. P' (n) = (1 - fJ) 
1 
~(;~O) if n ::1= 0. 

The value of() can be estimated by the fraction of requests retrieving zero 
items; for sampling techniques using only productive requests, () will be 
zero. a. and f3 can be calculated as before from the mean and variance of 
the sample. 

CONCLUSION 

The above discussion is intended as an attempt to provide some theoreti-
cal understanding of the puzzling behavior discovered in the use of search 
keys and also to provide some guide for those experimenting with samples 
of such files. We do, however, urge caution for the latter uses. 

An analysis similar to the above can be useful under several different 
circumstances, such as: determining the future behavior expected of a 
search key in a single library as the collection grows; determining the be-
havior for one library based upon experiments conducted on a different but 
similar library; and extrapolating from the performance of a search key 
in a sample of the collection to its pedormance in the full collection. 

If one wishes to compare two different libraries, one can note that as far 
as search key values are concerned, a particular library's collection can be 
thought of as a random sample of the larger population from which it selects 
its material, and accordingly the formula for P ( n) should be valid. In this 
case, if two different collections are drawn from the same population, the 
g ( p) refers to this population and the libraries are distinguished by the 
parameter N; when we are considering samples from a single library, then 
N is the sample size and g ( p) refers to the library itself. 

No theoretical basis exists at present for estimating to what extent the 
populations being considered depend upon the type of library, if any, so 
this problem must be dealt with empirically. We have assumed here that 
these populations are similar with regard to search key values. Should these 
populations in fact vary, it is possible that they can be broken down, e.g., 
by language, into subpopulations that are stable and for each of which the 
analysis is valid. 

ACKNOWLEDGMENTS 

This work was made possible by CLR/ NEH Grant No. E0-262-70-4658. I would 
like to express my gratitude to members of the University of Chicago Systems Develop-
ment Office for their many comments and suggestions on this work. 

I 
; 
I 


ll6 Journal of Library Automation Vol. 6/ 2 June 1973 

REFERENCES 

I. Frederick G. Kilgour, Philip L. Long, Eugene B. Leiderman, and Alan L. Landgraf, 
"Title-Only Entries Retrieved by Use of Truncated Search Key," Journal of Library 
Automation 4:207-10 (Dec. 1971). 

2. A. Bookstein, "Double Hashing," Journal of the American Society for Information 
Science 23:402-25 (Nov.-Dec. 1972) . 

3. A. Bookstein, "Hash Coding with a Non-Unique Search Key," to be published in 
the Journal of American Society for Information Science. 

4. Frederick G. Kilgour, Philip L. Long, Eugene B. Leiderman, and AJan L. Landgraf, 
"Retrieval of Bibliographic Entries from a Name-Title Catalog by Use of Truncated 
Search Keys." preprint. 

5. Kilgour, Long, Leiderman, and Landgraf, "Title-Only Entries," p.209-10. 
6. Gerry P. Guthrie and Steven D. Slifko, "Analysis of Search Key Retrieval on a 

Large Bibliographic File," Journal of Library Automation 5:96-100 (June 1972).