lib-s-mocs-kmc364-20141005043400 File Structure for an On- Line Catalog of One Million Titles J. J. DIMSDALE: Department of Computing Science, University of Alberta, Edmonton, Canada, and H. S. HEAPS: Department of Com- puter Science, Sir George Williams University, Montreal, Canada. 37 A description is given of the file organization and design of an on-line cat- alog suitable jo1· automation of a library of one million books. A method of virtual hash addressing allows rapid search of the indexes to the cata- log file. Storage of textual material in a compressed form allows consid- erable reduction in storage costs. INTRODUCTION An integrated system for on-line library automation requires a number of computer accessible files. It proves convenient to divide these files into three principal groups, those required for the on-line catalog subsystem, those required for the acquisition subsystem, and those required for the on-line circulation subsystem. The present paper is concerned with the files for the catalog subsystem. Files required for the circulation subsystem will be discussed in a future paper. The files for an on-line catalog system should contain all bibliographic details normally present in a manual catalog, and the file should be orga- nized to allow searches to be made with respect to title words, authors, and Library of Congress ( LC) call numbers. It may also be desired to search on other bibliographic details, in which instance the appropriate files may be added to those described in the present paper. The file organization should be such as to support economic searching with respect to questions in which terms are connected by the logic opera- tions AND, OR, and NOT. It should also allow question terms to be con- nected by operations of ADJACENCY and PRECEDENCE, and it should allow question terms to be weighted and the search made with reference to a specified threshold weight. It may be desirable for the file organiza- tion to include a thesaurus that may be used either directly by the user or by the search program to narrow, or broaden, the scope of the initial query or to ensure standardization of the question vocabulary. The file organization and search strategy should ensure that the user of the on-line catalog system receive an acceptable response time to his 38 Journal of Library Automation Vol. 6/ 1 March 1973 queries, although it is likely that some of the operations required by the circulation system will be given a higher priority. Thus the integrated sys- tem must time-share between search queries, circulation transactions, and other tasks that originate from a number of separate terminals or from batch input. Such tasks might arise from acquisitions, and from update and maintenance of the on-line catalog. The system should be a special purpose time-sharing system such as the Time Sharing Chemical Informa- tion Retrieval System described by Lefkovitz and Powers and by Wein- berg.1· 2 In this system the queries time-share disk storage as well as the central processor. Since an on-line catalog is a large file, and hence expensive to store in computer accessible form , it is desirable to store it in as compact a form as possible. For example, a catalog file for one million titles is likely to in- volve between 2 x lOS and 5 X 108 alphanumeric characters. If stored char- acter by character the required storage capacity would be equivalent to that supplied by from seven to sixteen IBM 2316 disk packs. It is also impor- tant to design the frequently accessed files so as to minimize the number of disk, or data cell, accesses required to process each query. The files described in the present paper include ones stored in com- pressed form and organized for rapid access. Throughout the present paper the term title is used in a general sense. It may include periodical titles as well as book titles. However, it is sup- posed that frequently changing information, such as periodical volume ranges, will be stored as part of the circulation subsystem rather than the catalog subsystem. OVERALL FILE ORGANIZATION The complete bibliographic entries of the catalog may be stored in a serial (sequential) file so that any record may readily be read and dis- played in its entirety. However, as indicated by Curtice, use of an inverted file is to be preferred for purposes of searching.3 An alternative to the simple serial file is one organized in the form of a multiple threaded list ( multilist) in which all records that contain a particular key are linked together by pointers within the records themselves. The first record in each list is pointed to by an entry in a key directory as described by Lefkovitz, Holbrook, Dodd, and Rettenmayer.4-7 For very small collections of documents Divett and Burnaugh have at- tempted to organize on-line catalogs by use of ring structured variations of the multilist technique.8• 9 Neither file organization is feasible for a collection of a million documents because of the long length of the threads involved. Many disk accesses would be needed in order to retrieve all elements of a list, and hence there would be a very slow response to queries. The cellular multilist structure proposed by Lefkovitz and Powers, or the cellular serial structure proposed by Lefkovitz, may well prove to be a viable alternative to the organization proposed in the present paper.10• 11 File Structure for an On-Line Catalog j DIMSDALE 39 However, as indicated by Lefkovitz, the inverted organization provides shorter initial, and successive, response times in answer to queries.12 In the present paper it is supposed that the on-line catalog file consists of both a serial file of complete bibliographic entries and an inverted file organized with respect to search keys such as title words, subject terms, au- thor names, and call numbers. Such a two-level structure is often assumed and has been termed a "combined file" by Warheit who concluded it to be superior to either a single serial file or a threaded list organization.13- 17 The file structure described in the present paper uses indexes based on the virtual scatter table as described by Morris and Murray, the scatter in~ dex table discussed by Morris, and the bucket as treated by Buchholz.18- 20 The attractiveness of a similar structure for use in the Ohio College Li- brary Center has been analyzed by Long, et aP1 The basic elements of the file organization are shown in Figure 1. It is supposed that the access keys are title words, but a similar file structure is used for access with respect to keys of other types. KEY HASHI NG HASH {eg. TITLE WORD)_. F UNCTION-+ TABLE FILE Fig. 1. Overall File Organization Any key may be operated on by a hashing function which transforms it into a pointer to an entry in a hash table file. This file contains pointers to both a dictionary file of title words and an inverted index which is stored in a compressed form. Entries within the compressed inverted index serve as pointers to the catalog file of complete bibliographic entries. Terms, such as title words, within the catalog file are coded to allow a compressed form of storage. The codes used in the compressed catalog file serve as pointers to the uncoded terms stored in the dictionary file. There would be a separate hashing function, hash table file, dictionary file, and compressed inverted file for use with each different type of key. However, there is only one compressed catalog file. For a search scheme that allows use of a thesaurus of synonyms, narrow- er terms, broader terms, and so forth, a thesaurus file may be added (Fig- me 2). The files must be organized to allow for ease of updating. As further bibliographic entries are added it is necessary to add additional pointers from the inverted index. Also, whenever a new key occurs in a bibliograph- ic entry it must be added to the dictionary, assigned a code for storage in 40 Journal of Library Automation Vol. 6/1 March 1973 the compressed catalog file, and entered into the compressed inverted in- dex. KEY HASHING (eg. TITLE WORD) - FUNCTION Fig. 2. File Organization with Inclusion of a Thesaur·us STRUCTURE OF THE HASH TABLE FILE In order to locate the set of inverted index pointers that corresponds to a given search key K, the key is first operated on by a hashing function that transforms it into a bit string of length v bits. Each such bit string is said to represent a virtual hash address, and is regarded as the concatena- tion of two substrings of length r and v-r bits. The two substrings are re- spectively said to constitute the major and the minor M( K) of the virtual hash address. The major is further divided into two bit strings B(K) and I(K) that define a bucket number B(K) of a bucket f3(K), and an index number I(K) of an entry within the bucket. The major that represents the pair of numbers B ( K), I ( K) is said to constitute a real hash address. The hash table file is divided into portions, or buckets, of equal length. Each bucket is further divided into an index section, a content section, and a counter section (Figure 3) . The index sections of all buckets have the same length. Similarly, all content sections are of equal length, and so are all counter sections. As the hash table is created, entries are added sequentially into the con- tent section so that any unfilled portion is at the end. In contrast, the index section of any bucket may contain unfilled entries at random positions and hence constitutes a scatter table. The hash table :file is created as follows. The various keys are trans- formed by the hashing function into bit strings B ( K), I ( K), M ( K). In the bucket f3 ( K) of number B ( K) an entry as described below is added to the content section, and the vacancy pointer within the counter section is incremented to point to the beginning of the unfilled portion of the content section. The I(K)th entry number in the index section is then set to point to the position of the entry added to the content section. The en- try placed in the content section includes the minor M ( K) and a diction- ary pointer to where the key is placed in the dictionary file as well as a pointer to an entry in the compressed inverted index. If there has previously occurred a bit string B(K1), I(K1), Mr(K1) in which B(L) = B(K), I(K1) = I(K), Mr(K1) # M(K) then no change is File Structure for an On-Line Catalog/DIM5DALE 41 B(K), I(K), M(K) HASH TABLE FILE: COUNTER SECTION: COUNTER SECTION NUMBER NUMBER NUMBER OCCUPIED OVERFLOWS FROM OVERFLOWS INTO Fig. 3. Bucket of the Hash Table File made to the I ( K) th entry in the bucket f3 ( K) or to the minor M ( K1) in the content section. Instead, the chain pointer is set to point to the location of a new entry that is added to the content section. In this new entry the minor is set to Mt(K) and the dictionary pointer is set to indicate where the new key is placed in the dictionary file. There is said to have resulted a collision at the real hash address B(K), I(K). If there has previously occurred a bit string B ( K1), I ( Kt) , M ( K1) in which B(K1) = B(K), I(K1) = I(K ), M(Kt) = Mt(K), where K1 =F K, then the collision bit that precedes M( K.,_) is set to 1. and a further content entry containing M ( K) is chained from the entry that contains M ( K1 ) . There is said to have OCCUlTed a collision at the virtual hash address B(K), I(K), M(K). The last three entries included in the counter section shown in Figure 3 are optional but are useful for monitoring the performance of the hashing function with respect to bucket overflows and so forth. A bucket becomes full when there is no remaining unfilled space in its content section. If a further chain pointer is required from a content en- try, its preceding overflow bit Qc is set to 1 to indicate that the pointer is to another bucket. Likewise, if a further entry is required in the index section its preceding overflow bit Qr is set to 1 to indicate that it refers to an entry within another bucket. The bucket is then said to have over- flowed. Methods of handling bucket overflow, and choice of the new bucket, are discussed in a subsequent section. It should be noted that use of a hash table as described above retains most of the advantages of the usual scatter index method in which the in- 42 Journal of Library Automation Vol. 6/1 March 1973 dex entries and content entries are stored in two separate files. It has the further important advantage that in most instances a single disk access is sufficient to locate both the index entry and the corresponding content en- try. As noted by Buchholz and Reising~ if it is known that certain keys are likely to appear with high frequency in search queries then it is advanta- geous to enter them at the start of creation of the hash file. 22 • 23 They will then tend to appear near the beginnings of the content entry chains and hence require little CPU time for their subsequent location. Furthermore, they will tend to appear in the same bucket as their corresponding index entries, and hence their location will usually require only a single disk ac- cess. NUMBER OF BITS FOR VIRTUAL HASH ADDRESS Suppose the hashing function is chosen so that the majors of the trans- formed keys are uniformly distributed among the R slots available for real hash addresses B,I. If there are N keys then a = N / R may be termed the load factor. It is the average number of keys that are transformed into any given real hash address. The probability that any given real hash address corresponds to k keys is given by M urra y24 as (I) Pk = e-a ak j kl Hence, for any given real address the probability of a collision occurring is N (2) C = ~ Pk = 1 - Po - P1 = 1 - (1 + a)e-a. k= 2 If a collision occurs at a particular real hash address, the expected length of the required chain within the content section is N (3) L = ~ kP~r/C k=2 N = (1 / C) (~ kPk - Pd k:O = ( l j C) (a - ae-a) _ a (e" - 1) ea.- 1 - a It may be noted that if the load factor a is equal to 1 then L = 2.43. If all the transformed keys are distributed uniformly among the V pos- sible virtual addresses B, I, M then the expected total number of collisions at virtual addresses is given by Murray25 as (4) p = N2/ 2V provided V" N. The expected relative frequency of collisions at virtual addresses is therefore (5) f = N / 2V. File Structure for an On-Line Catalog j DlMSDALE 43 It proves convenient to regard N, f, and a as basic parameters in terms of which may be determined the number r of bits required in the major 1 and the number v of bits required in the virtual hash addresses. The value of r must be at least as large as lo~R = lo~(N/a), and hence r may be chosen according to the formula (6) r = r log2 (N/ a) where r means "the smallest integer greater than or equal to." The value of v must be at least as large as (7) v = r lo~V = r lo~ (N/2£). If N and f have the form N = 2n and f = 2-'Y then v may be chosen according to the formula (8) v = n + 'Y - 1 and the number of bits required for the minor is (9) m = v- r. CHOICE OF BUCKET CAPACITY With an 8-bit byte-oriented computer, such as the IBM 360, it proves convenient to use 8 bits of storage for each entry number plus overflow bit within the index section. If a value of zero is used to indicate an un- used index entry there remain up to 127 possible values for entry numbers. Thus the number c of entries in the content section must be less than or equal to 127. Suppose there are b slots for index entries in each bucket. The total number of index entries in the entire file is R. It follows from the results of Schay and Spruth,26 Tainter,27 and Heising28 that the probability P( b, c) of overflow of any bucket is given by oo c-<>b (10) P (b, c) = ~ (ab)k -· k = c + 1 kl For selected values of b, Beyer's tables of the Poisson distribution have been used to compute P ( b, c) and to determine the largest value of c for which P(b, c) L O.OI.29 The results are shown in Table 1 for the in- stance in which a = 1. A similar table has been computed by Buchholz3° for the instance in which c = b and a ranges from 0.1 to 1.2. As is apparent from Table 1, an increase in the value of b allows use of a smaller ratio c/ b and hence permits more economical use of storage. With b = 64 the allowed value of c/b is 1.33 and hence c may be chosen equal to 85. The reduction in access time that results from structuring the file so that each bucket contains both index and content entries is, of course, effected at the expense of additional storage costs. For example, if cj b = 1.33 then the space allocated for storage of content entries is 33 percent greater than if content entries are stored in a separate file. Relaxation of the condition P(b,c)..:::: 0.01 allows a reduction in cj b, but the increased number of buck- et overflows will cause additional disk accesses to be required. 44 Journal of Library Automation Vol. 6/ 1 March 1973 Table 1. Values of b, c, and cj b for which P(b,c~O.Ol when a= 1. b 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 60 100 c 5 6 8 10 11 13 14- 15 17 18 19 20 22 23 24 25 27 28 29 30 80 125 TREATMENT OF BUCKET OVERFLOWS c! b 5.00 3.00 2.66 2.50 2.20 2.17 2.00 1.88 1.89 1.80 1.73 1.67 1.69 1.64 1.60 1.56 1.59 1.55 1.53 1.50 1.33 1.25 When a new key is found to map into a bucket whose content section is full then some means must be found to provide space in some other buck- et. The particular procedure that should be used depends on the extent to which the entire set of buckets contain unfilled portions. Suppose that many buckets are almost full and that the number c of al- lowed content entries is less than 127. The entire hash file may then be ex- panded with the same index sections but with longer content sections. If many buckets are almost full and c = 127 then the entire file may be expanded in such manner that each bucket is replaced by a pair of buckets that contain the same number b of allowable index entries, but whose number Ct of allowable content entries is chosen to ensure that P(b,ct ) L 0.01. Such doubling of buckets also doubles the number of index entries but it does not double the storage required for the entire file. Each key K that corresponds to an entry in the original bucket is associated with an entry in the first , or second, of the new buckets according as the leading bit of either its index address I ( K) or its minor M:( K) is equal to 0 or 1. The effect is to shift one bit from I(K) or M(K) into the bucket address B ( K ) . This method is based on a suggestion of Morris. 31 · Suppose that few buckets are almost full. Then a suitable means of de- termination of an unfilled bucket for storage of the minor is through use of some overflow algorithm that determines a sequence of bucket numbers Bo ( K) , Bt ( K) , B2 ( K), etc., corresponding to any given full bucket {:3o ( K) . Suppose there are nb buckets. A quadratic residue algorithm (11) Bi (K) = [B0 (K) + aj + bj2 ] mod nb File Structure for an On-Line Catalog / DIMSDALE 45 has been considered by Maurer and by Bell for use with in-core hash ta- bles, but it suffers from the disadvantage that the existence of a full buck- et /)o ( K) will divert entries into the particular buckets /31 ( K), /32 ( K), etc. and hence cause them to fill more rapidly than other buckets which may contain fewer entries.82 • 33 It is believed that a more desirable form of the quadratic residue al- gorithm is ( 12) Bj ( K) = { B0 ( K) + f1 [I ( K)] } mod nb where fl is a suitably chosen function. Letting B, ( K) depend, through fs, on both j and I ( K), instead of on j alone, allows reduction of the tendency to fill a particular set of buckets. To prevent a tendency to overflow particular buckets it is also desirable for the overflow algorithm to produce bucket numbers that are uniformly distributed among all possible bucket numbers. Among the more promising forms to be chosen for the fl [I ( K)] are the following ( 13a) fj [I ( K)] = I ' ( K) j where j = 1, 2, ... , nb -1, and l'(K) denotes I(K) if I(K ) is odd, but de- notes I ( K) + 1 if I ( K) is even. Since nb is a power of 2 such choice of I'(K) ensures that I'(K) and nb have no common factors, and hence that Bi ( K) steps through the sequence /3o ( K ), /31 ( K), etc. covering every buck- et in the file. ( 13b ) fj [I ( K) ] = I ' ( K ) j 2 where j = 1, 2, ... , r \ / n-1, and r means "the least integer greater than or equal to." ( 13c ) fdl ( K)] = Rdi ' ( K)] where j = 1, 2, ... , Db, and RJ[l'( K)] denotes a number output by a pseu- dorandom number generator of the form suggested by MorrisB4 with an ini- tial input of I' ( K) instead of 1. It may be remarked that use of Equation 13a requires the least number of machine instructions, and the least CPU time per step, but it has a strong tendency to cluster the f31(K) immediately after the /3o(K) and hence it is likely to be the least effective of the three methods. Use of Equation 13b produces less clustering, but the sequence does not include all buckets of the file. Use of Equation 13c requires the largest number of instructions and CPU time per step, but the f3J(K) are less likely to cluster and they are uniformly distributed among all possible buckets. Thus Equation 13c produces shorter chains of overflow buckets and hence re- quires fewer disk accesses. If a new key K maps into a full bucket /3o ( K) then the following pro- cedure is used to determine the bucket into which the minor of K is to be inserted: ( i) The chain of pointers from the I ( K) th entry of the bucket /)o ( K) is followed, possibly through overflow buckets given by Equation 12, in or- 46 Journal of Library Automation Vol. 6/1 March 1973 der to locate the terminal entry of the chain. Suppose this terminal entry is within a bucket /3J ( K) . (ii) If there is available space in bucket /3J(K) then the minor Mr(K) is entered and chained as described previously. (iii) If bucket /3J ( K) is full, but there is space in /3J + 1 ( K), then the minor M ( K) is entered into /3J + 1 ( K) and chained as described previous- ly. ( iv) If buckets f3J ( K) and /3j + 1 ( K) are both full, and bucket /3J + 1 ( K) contains at least one nonempty index entry I ( K') whose chained content entries are all contained within /3J + 1 ( K), then the minor M ( K) is stored according to the following displacement algorithm: The terminal member of the chain from I ( K') is displaced to an over- flow bucket /3r ( K') determined by use of Equation 12, except that if both /3r(K') and /3r + 1(K') are full then a further bucket is determined by use of the displacement algorithm. The minor M ( K) is substituted for the displaced entry in bucket /3J + 1(K) and is chained appropriately. ( v) If application of Step ( iv) leads to a bucket /3J + 1 ( K), or /3r + 1 ( K), that contains no nonempty index entry whose chained content entries are all contained within it, then the entire hash file must be expanded by use of one of the procedures described at the beginning of the present section. It should be emphasized that, although Step ( iv) is necessary for com- pleteness, the probability of its use is very low. With a probability of less than 0.01 for a bucket overflow, the probability of use of Step ( iv) is less than (0.01) 3• SEARCH PHASE AND PROBLEM OF MISMATCH In the previous sections the structure of the hash index file has been dis- cussed with emphasis on details of its creation and update. During search of the catalog files by use of the inverted index, each search key is pro- cessed by the following search alogorithm: Step 1: The search key K is transformed by the hashing function into a virtual hash address B(K), I(K), M(K). Step 2: The bucket /3(K) is read into core. Step 3: The index entry specified by I(K) is examined. If it is empty then the search key is not present in the data base. If it is not empty then Step 4 is performed. Step 4: The overflow bit of the index entry specified by I(K) is exam- ined. If it is equal to 1 then Step 5 is performed. If it is equal to 0 then Step 6 is performed. Step 5: The overflow algorithm is used to determine the address of there- quired overflow bucket which is then read into core, and Step 6 is executed. Step 6: The minor of each entry in the chain of content entries is com- File Structure for an On-Line Catalog/DIMSDALE 47 pared to the minor of the search key's virtual hash address until either a match is found or the chain is exhausted. Whenever the chain leads to an overflow bucket then Step 5 is performed. Step 7: If a match is found for M ( K) then the collision bit of the entry is examined. If it is equal to 0 then Step 9 is performed. If it is equal to 1 then Step 8 is performed. Step 8: The dictionary entry that corresponds to each content entry in the virtual address collision is read into core and compared to the search key K. If no match is found then the search key is not pres- ent in the index. Step 9: This step is included because there is a small probability that a misspelled search key, or one not present in the hash file, may be transformed into the same virtual address as some key already in- cluded in the file. The step consists of reading the corresponding dictionary entry into core and comparing it with the search key. For reasons discussed later in the present section it is desirable to omit this step. It should be noted that in most instances the search algorithm will not require execution of Steps 5 and 8. In fact, with the hash index files de- signed as described in the previous sections, the probability of execution of Step 5 is about 0.01 and the probability of execution of Step 8 is about 2-16• Consequently, if Step 9 is also omitted the number of disk accesses re- quired to find the index entry corresponding to a search key is approxi- mately l.Ol. The mismatch problem, which gives rise to Step 9 of the search al- gorithm, is less serious than might be expected. Suppose the hash function distributes the transformed keys uniformly over all hash addresses. The probability that a new, or misspelled, key maps into an existing entry is given by (14) Pc = NjV The probability that a search leads to a mismatch is therefore ( 15) P m = P .N j V where Ps is the probability that the search key is misspelled or not in the hash table. Thus, for a hash table of N = 216= 65,536 title words and V = 28\ an assumption of Ps = 0.1leads to Pm = 3 X 10- 6• Because Pm is extremely small, and because each execution of Step 9 re- quires up to two disk accesses, it is desirable to omit this step. If experience shows that particular new or misspelled search keys occur frequently, and cause mismatches, they may themselves be entered into the hash index file. In fact, some degree of automatic spelling correction may be provided if some common misspellings are included in the hash files and chained to the content entries that correspond to the correctly spelled keys. Correct, but alternative, spellings of search keys may also be treated in the same man- ner. 48 Journal of Library Automation Vol. 6/1 March 1973 SIZE OF HASH FILE FOR TITLE WORDS Suppose the docwnent collection contains T different titles that comprise a total of W words of which there are N different words. Let W = W /T de- note the average number of words in each title. Reid and Heaps85 have re- ported word counts on the 57,800 titles included on the MARC tapes be- tween March 1969 and May 1970 and have noted that (16) w = 5.5 ( 17) log10N = 0.6 log1oW + 1.2. Examination of other data bases has led to the conclusion that log N is likely to be a linear function of log W over the range 0 L W L 106• For a library of one million titles the Equations 16 and 17 may there- fore be used to predict that when T = 106 then (18) W :::: 5.5 X 106 and N = 1.8 X 105 • It follows from Equation 6 that if a = 1 the number of bits required in the major is (19) r = 18. According to Equation 7, in order to reduce the frequency f of collisions at virtual addresses to 2-16 the number of bits required in the entire vir- tual address is (20) v = r [lo~ (1.8 x 105 + 16 - 1] = 33. Consequently, the number of bits in the minor is ( 21) m = v - r = 15. However, with such a choice of r then R = 218 and the value of the load factor is, in fact, (22) a = N/R = 0.7 It follows from Equation 4 that the expected total number p of colli- sions at virtual addresses is equal to approximately 2. It may be further noted that Murray36 has derived the following approximation for the prob- ability that the number of collisions at virtual hash addresses lies within the range a to d: d (23) P (a, d) = ~ e-"P p1/il (0 ~i L U~ N) i= a where L means "greatest integer less than or equal to." When p = 2 the equation gives a value of 0.9998 for the probability that the total number of collisions lies between 0 and 8. Thus the above choice of r, v, and m leads to a title word hash table file with excellent virtual ad- dress collision properties. Use of Equation 10 with b = 64 and a= 0.7, leads to the result that the probability of bucket overflow may be reduced to 0.01 by choosing c = 62. In view of the above value of m it proves convenient to allocate 10 bytes of storage for each content entry. Each entry consists of a 2-byte portion to contain the 15-bit minor preceded by a collision bit, a 1-byte portion to File Structure for an On-Line Catalogj DIMSDALE 49 contain a 7-bit chain pointer preceded by an overflow bit, a 3-byte diction- ary pointer, and a 4-byte pointer to an inverted index. The 64 one-byte index entries, the 62 ten-byte content entries, and 4 one-byte counters, con- stitute buckets of length 688 bytes. The entire hash file consists of R en- tries, and hence R/b = 212 buckets. Its storage requirement is therefore for 212 X 688 = 2.82 X 106 bytes. It may be remarked that nine 688-byte buckets may be stored unblocked in one track of an IBM 2316 disk pack, and that the entire hash file occu- pies 11.38 percent of the disk pack. When the disk and channel are idle the average time to access such a bucket is the sum of the average seek time, the average rotational delay, and the record transmission time. For storage on an IBM 2314 disk drive the average bucket access time is therefore 60 + 12.5 + 2.8 = 75.3 milliseconds. The average access time for a sequence of accesses could be reduced by suitable scheduling. SIZE OF HASH FILE FOR LC CALL NUMBERS For a library of one million titles the number N of call numbers is 106• If a = 1 and f = .2-16 it follows from Equations 6, 7, 9, and 4 that (24) r = 20, v = 35, m = 15, p = 16. With such a choice of r the load factor is approximately equal to 1. Equa~ tion 23 gives a probability of 0.9998 that the total number of virtual ad- dress collisions lies between 0 and 34. Use of Equation 10 with b = 64 and a = 1.0 shows that the probability of bucket overflow may be reduced to 0.01 by choosing c = 85. The content entries for LC call numbers may be arranged as for title words except that the 4-byte pointer to an inverted index is replaced by a 3-byte pointer to the compressed catalog file. The bucket length is there- fore 64 + 85 x 9 + 4 = 833 bytes. The storage requirement for the hash file is ( 220/ 26 ) x 833 = 13.65 x 106 bytes which may be stored in 2184 tracks, or 54.6 percent, of an IBM 2.316 disk pack. The average time to access a bucket is 60 + 12.5 + 3.3 = 75.8 milliseconds. SIZE OF HASH FILE FOR AUTHOR NAMES In the present section the term "author" will be used to include personal names, corporate names, editors, compilers, composers, translators, and so forth. It will be assumed that for personal names only surnames are en- tered into the author dictionary. A search query that includes specification of authors with initials is first processed as if initials were omitted, and the resulting retrieved catalog entries are then scanned sequentially to elim- inate any entries whose authors do not have the required initials. It will also be supposed that each word of a corporate name is entered separately into the author dictionary, and that the inverted index contains an entry for each term. In the absence of reliable statistics regarding the distributions of author 50 Journal of Library Automation Vol. 6/1 March 1973 surnames, words within corporate names, and so forth, the following as~ sumptions have been made in order to estimate tile size of the author dic- tionary and hash file for a library of one million titles: ( i) Personal author names contain 2 x 105 different surnames of average length 7 characters. ( ii) The corporate author names include 4 X 104 different words of average length 6 characters. (iii) The author names include 1.6 X 104 different acronyms such as IBM, ASLIB, and so forth; their average length is 4 characters. It is thus supposed that N = 2.56 X 105 entries are required in the author hash files. Calculations similar to those of the previous section show that ( 25) r = 18, v = 33, m = 15, p = 4, a = 1.0. Equation 23 gives a probability of 0.9999 that the total number of virtual address collisions lies between 0 and 13. The probability of bucket over- flow may be reduced to 0.01 by choosing c = 85. Content entries of 10 bytes may be arranged as previously described for title words. Hence each buck- et requires 918 bytes of storage. The storage requirement for the hash file is ( 218/ 26 ) X 918 = 3.76 X 106 bytes which may be stored in 586 tracks, or 14.6 percent, of an IBM 2316 disk pack. The average time to access a bucket is 76.1 milliseconds. STRUCTURE OF DICTIONARY FILES The structure of the dictionary files for title words and author names is as described by Thiel and Heaps.87• 38 Each dictionary file contains up to 128 directories each of which points to up to 128 term strings that may each contain space for storage of 128 terms of equal length. Thus each dictionary file contains up to 214 different terms. The dictionary pointers in the hash files are essentially the codes stored instead of alphanumeric terms in the catalog file. The most frequent 127 title words are assigned dictionary pointers of the form (26) 10000000 10000000 1XXXXXXX PT and do not have corresponding entries in the inverted index file. The last byte forms the code used to represent the title word within the compressed catalog file. The next most frequent 16,384 title words are assigned dictionary point- ers of the form ( 27) 00000000 1XXXXXXX lXXXXXXX or (28) 10000000 OXXXXXXX 1XXXXXXX File Structure for an On-Line Catalog/ D!MSDALE 51 according as there is, or is not, a corresponding entry in the inverted index. The last 2 bytes are used as codes in the compressed catalog file. The remaining title words are assigned dictionary pointers of the form ( 29) OXXXXXXX OXXXXXXX lXXXXXXX ...____~ --...------' p~ p~ PT They all have corresponding entries in the inverted index file, and the 3 bytes are used as codes in the catalog file. The reason that terms coded in the form 26 or 28 do not have corre- sponding entries in the inverted index file is that very frequently occurring terms form very inefficient search keys. Also, previous results suggest that omission of corresponding entries in the inverted index allows its size to be reduced by about 50 percent.39• 40 The codes of type PT, ( Ps,PT) , and ( Pn,Ps,PT) are used respectively for approximately 50 percent, 45 percent, and 5 percent of the title words. The average length of the coded title words in the compressed catalog file is therefore 1.55 bytes. Associated with each dictionary file there is a directory of length 512 bytes whose entries point to the beginnings of term strings within the dic- tionary file and also indicate the lengths of the terms. Within the hash ta- ble file a dictionary pointer of the form Po, P s, PT points to the PT th term of the Ps th term string in the dictionary associated with the Po th directory. There is a single directory associated with each set of pointers of type PT and Ps, PT. The average length of the 1.8 X 105 different title words is 7.6 characters, and hence the entire set of term strings requires 1.8 X 105 X 7.6 = 1.37 x 106 bytes for storage of title words. Since twelve directories occupying 12 x 512 = 6144 bytes will be required, and since some term strings will contain un- filled portions, the storage requirement of the dictionary file will be slight- ly larger. If the title word dictionary is stored on disk in 1,000 byte rec- ords then the storage requirement is 238 tracks, or 5.95 percent, of an IBM 2316 disk pack. The assumptions made previously regarding author names imply an au- thor dictionary size of 1.70 X 106 bytes and sixteen directories whose total storage requirements are 16 X 512 = 8,192 bytes. Using an IBM: 2316 disk pack the storage requirement is for 286 tracks, or 7.15 percent. On completion of a search through use of the inverted index .file there results a set of sequence numbers that indicate the position of the relevant items in the compressed catalog file. Before such items are displayed to a user of the system, each term must be decoded through access to the direc- tory and dictionary to which it points. The time required to decode a catalog item depends on how the direc- tories and dictionaries are partitioned between disk and core memory. Sev- eral partitioning schemes for title words have been analysed, and the re- sults are summarized in Table 2. 52 Journal of Library Automation Vol. 6/1 March 1973 In the calculations used to obtain Table 2 it is assumed that title words occur with the frequencies listed by Kucera and Francis.41 It is supposed that both the directory and term strings corresponding to codes of form PT are stored in a single physical record, that every other directory is con- tained wholly within a physical record, and that each dictionary term may be located by a single access to a term string. Any required CPU time is re- garded as insignificant compared to the time needed for file accesses. From the results shown in Table 2 it appears that the best partition be- tween core and disk is probably that which gives an average decode time of 42 milliseconds while requiring a dedicated 1501 bytes of core memory. This results when core is used to store both the directories and term strings for terms that correspond to pointers of type PT, and the directories only for terms that correspond to pointers of type Ps,PT. COMPRESSED CATALOG FILE Since the title word codes stored in the compressed catalog file have an average length of 1.55 bytes, whereas uncoded title words and their delim- iting spaces have an average length of 6.5 characters, the compressed title fields occupy only 24 percent of the storage required for uncompressed words. Uncoded author names and their delimiting spaces have an average length of 7.6 characters and are coded to occupy not more than 3 bytes; hence coding of author names effects an average compression factQr of less than 3;7.6 = 40 percent. For LC call numbers the compression factor is less than 30 percent. Clearly, subject headings, publisher names, and series statements may be coded with even more effective compression fac- tors. The saving in space through compression of the catalog file may be translated into a cost saving as follows. If there are an average of 5.5 words in each title then one million titles include 5.5 X 106 title words and delimiting spaces which, if stored in the catalog file in uncoded form, would require 3.63 X 107 bytes.42 When stored in coded form the require- ment is for 8.54 X 106 bytes. Charges for disk space vary considerably with different computing facilities. At the University of Alberta users of the IBM 360 Model 67 are charged a monthly rate of $.50 for each 4,096 bytes of disk storage. Thus, for title words alone the advantage of storing the catalog file in compressed form is to allow the monthly storage cost to be reduced from $4,440 to $950. CONCLUDING REMARKS The results reported in the present paper indicate that a satisfactory structure for a catalog file may be designed to use the concept of virtual hash addressing and storage of terms in compressed form. Access and de- coding times may be reduced to acceptable amounts. It may prove advantageous to arrange the items in the catalog file in the order of their call numbers. This will tend to reduce the number of disk File Structure for an On-Line Catalog/ DIMSDALE 53 Table 2. Average Time to Decode a Title Word of the Compressed Catalog File. Core Resident Directories Ter-m String None Pr PT, (Ps, PT) All PT, (Ps, PT) All None Pr Pr p.,. PT, (Ps, Pr) 0 Pr, (Ps, Pr ) 0 Average Number Accesses 1.50 1.01 0.55 0.50 0.49 0.44 ( Ps, Pr) 0 signifies the 128 most frequent of the codes Ps, PT Average Decode Time (milliseconds) ll5 77 42 39 38 34 Dedicated Core Memory (bytes) 0 989 1501 7133 2474 8106 accesses needed to retrieve catalog items in response to queries since it will tend to group relevant items. However, the benefits should be weighed against the additional expense required to maintain and update the or- dered file. The present paper has omitted discussion of the form of the query lan- guage or the search algorithm that operates on the elements of the invert- ed index. A formal definition of one form of query language has been discussed by Dimsdale.48 Details of a search algorithm and structure of a compressed form of in- verted index have been discussed by Thiel and Heaps.4 4 It may be noted that each content entry in the hash table file has 4 bytes reserved for a pointer to a bit string of the inverted index. Whenever the bit string is less than 4 bytes in length it is stored in the content section and no pointer is required. Storage of such bit strings within the content entries signifi- cantly reduces the storage requirements of the inverted index and also re- duces the number of required disk accesses in the search phase of the pro- gram. ACKNOWLEDGMENT The authors wish to express their appreciation to the National Research Council of Canada for their support of the present investigation. REFERENCES 1. D. Lefkovitz and R. V. Powers, "A List-Structured Chemical Information Retl"ieval System," in G. Schecter, ed., Informatio-n Retrieval (Washington, D.C.: Thompson Book Co., 1967), p.l09-29. 2. P. R. Weinberg, "A Time Sharing Chemical Information Retrieval System" (Doc- toral Thesis, Univ. of Pennsylvania, 1969) . 3. R. M. Curtice, "Experimental Retrieval Systems Studies. Report No. 1. Magnetic Tape and Disc File Organization for Retrieval" (Master's Thesis, Lehigh Univ., 1966). 4. D. Lefkovitz, File Strttctures for On-Line Systems (New York: Spartan Books, 1969). 54 Journal of Library Automation Vol. 6/ 1 March 1973 5. I. B. Holbrook, "A Threaded-file Retrieval System," Journal of the American So- ciety for Information Science 21: 40- 48 (Jan.-Feb. 1970). 6. G. G. Dodd, "Elements of Data Management Systems," Computer Surveys 1:117- 33 (June 1969). 7. J. W. Rettenmayer, "File Ordering and Retrieval Cost," Information Storage and Retrieval8:19-93 (April1972). 8. R. T. Divett, "Design of a File Structure for a Total System Computer Program for Medical Libraries and Programming of the Book Citation Module" (Doctoral Thesis, Univ. of Utah, 1968). 9. H. P. Burnaugh, "The BOLD (Bibliographic On-Line Display) System," in G. Schecter, ed., Information Retrieval (Washington, D .C.: Thompson Book Co., 1967)' p.53-66. 10. Lefkovitz, Powers, "A List-Structured Chemical Information," p.109--29. 11. Lefkovitz, File Structures for On-line SysteTM, p.141. 12. Ibid., p.177. 13, F. G. Kilgour, "Concept of an On-Line Computerized Catalog," Journal of Li- brary Automation 3:1-11 (March 1970). 14. J. L. Cunningham, W. D. Schieber, and R. M. Shoffner, A Study of the Organiza- tion and Search of Bibliographic Holdings Records in On-Line Computer SysteTM: Phase I (Berkeley: Univ. of California, 1969). 15. R. S. Marcus, P. Kugel, and R. L. Kusik, "An Experimental Computer Stored, Augmented Catalog of Professional Literature," in Proceedings of the 1969 Spring Joint Computer Conference (Montvale: AFIPS Press, 1969) p.461-73. 16. J. W. Henderson and J. A. Rosenthal, eds., Library Catalogs: Their Preservation and Maintenance by Photographic and Automate d Techniques; M.I.T. Report 14 (Cambridge, Mass.: M.I.T. Press, 1968). 17. I. A. Warheit, "File Organization of Library Records," Journal of Library Auto- mation 2:2(}...30 (March 1969) . 18. R. Morris, "Scatter Storage Techniques," Communications of the ACM 11 :38-44 (Jan. 1968) . 19. D. M. Murray, "A Scatter Storage Scheme for Dictionary Lookups," Journal of Library Automation 3:173-201 (Sept. 1970). 20. W. Buchholz, "File Organization and Addressing," IBM Systems Journal 2:86-111 {June 1963). 21. P. L. Long, K. B. L. Rastogi, J. E. Rush, and J. A. Wyckoff, "Large On-Line Files of Bibliographic Data: An Efficient Design and a Mathematical Predictor of Re- trieval Behavior," in Information Processing 71 (North Holland Publishing Com- pany, 1972) p.473-78. 22. Buchholz, "File Organization," p.l02-3. 23. W. P. Reising, "Note on Random Addressing Techniques," IBM Systems Journal 2:112- 16 (June 1963). 24. Murray, "A Scatter Storage Scheme," p.178. 25. Ibid., p.181. 26. G. Schay and W. G. Spruth, "Analysis of a File Addressing Method," Communi- cations of the ACM 5:459-62 (August 1962). 27. M. Tainter, "Addressing for Random-Access Storage with Multiple Bucket Capaci- ties," Journal of the ACM 10:307-15 (July 1963). 28. Reising, "Note on Random Addressing," p.ll2-16. 29. W. H. Beyer, Handbook of Tables for Probability and Statistics (Cleveland: The Chemical Rubber Company, 1966). 30. Buchholz, "File Organization," p.99. 31. Morris, "Scatter Storage," p.42. 32. W. D. Maurer, "An Improved Hash Code for Scatter Storage," Communications of the ACM 11:35-38 (Jan. 1968). File Structure for an On-Line Catalog/DIMSDALE 55 33. J. R. Bell, "The Quadratic Quotient Method: A Hash Code Eliminating Secondary Clustering," Communications of the ACM 13:107-9 (Feb. 1970). 34. Morris, "Scatter Storage," p.40. 35. W. D. Reid and H. S. Heaps, "Compression of Data for Library Automation," in Canadian Association of College and University Libraries: Automation in Li- braries1971 (Ottawa: Canadian Library Association, 1971), p.2.1-2.21. 36. Murray, "A Scatter Storage Scheme," p.183. 37. L. H. Thiel and H. S. Heaps, "Program Design for Retrospective Searches on Large Data Bases," Information Storage and Retrieval8:1-20 (Jan. 1972) . 38. H. S. Heaps, "Storage Analysis of a Compression Coding for Document Data Bases," INFOR 10:47-61 (Feb. 1972) . 39. Thiel and Heaps, "Program Design," p.l5-16. 40. Reid and Heaps, "Compression of Data," p.2.1-2.21. 41. H. Kucera and W. N. Francis, Computational Analysis of Present-Day American English (Providence: Brown University Press, 1967). 42. Reid and Heaps, "Compression of Data," p.2.4. 43. J. J. Dimsdale, "Application of On-Line Computer Systems to Library Automa- tion" (Master's Thesis, Univ. of Alberta, 1971), p.50-68. 44. Thiel and Heaps, "Program Design," p.l-20.