lib-s-mocs-kmc364-20140601052211 96 ]o11mal of Library Automation Vol. 5/ 2 June, 1972 ANALYSIS OF SEARCH KEY RETRIEVAL ON A LARGE BIBLIOGRAPHIC FILE Gerry D. GUTHRIE, Steven D. SLIFKO : Research & Development Divi- sion, The Ohio State University Libraries, Columbus, Ohio Two search keys (4,5 and 3,3) are aMlyzed using a probability formula on a bibliographic file of 857,725 records. Assuming random requests by record permits the creation of a predictive model which more closely approximates the actual behavior of a search and retrieval system as determined by a usage survey. INTRODUCTION Systems planners are hard pressed to accurately predict the access charac- teristics of search keys on large on-line bibliographic files when so little is known about user requests. This paper presents a realistic model for analyzing different search keys and, in addition, the results are compared to actual request data gathered from a usage survey of the Ohio State University Libraries Circulation System. A number of papers are available in the literature concerning search key effectiveness; however, all of these were done on relatively small data bases ( 1-5) . Of particular importance to this paper is Kilgour's article on truncated search keys ( 6) . PURPOSE The purposes of this study are ( 1 ) to determine the comparative effec- tiveness of the 4,5 and 3,3 search keys, ( 2) to compare two predictive models, and ( 3 ) to test the results with an actual usage survey. METHOD The Ohio State University Libraries Circulation System contained at the time of this study 857,725 titles representing over 2.6 million volumes in the Analysis of Search Key Retrieval/GUTHRIE 97 OSU collection. The data base used for this study was the search key index file which contained one search key for each title in the master file. The search key is composed of the first four letters of the author's last name and the first five letters of the first word of the title excluding non- significant words ( 4,5 key). Title words are passed against a stop-list to determine significance. The stop-list contains the words: a, an, and, annual, bulletin, conference, in, international, introduction, journal, of, on, proceed- ings, report, reports, the, to, yearbook. The search key file is in sequence by search key. For comparative purposes, a second search key file was created and sorted which contained a 3,3 key (the first three characters of the author's last name and the first three characters of the first significant word of the title. ) The two files of sorted search keys were then processed by a statistical analysis computer program. This program created a frequency distribution table of identical keys, i.e., how many keys were unique, duplicated once, duplicated twice, etc. From this table two models were compared. Modell: File entry was viewed as a random process with choice of any unique search key equiprobable. This model has been suggested in the literature mentioned earlier. It states that if X;. number of keys will return i matches then the probability of a file search returning i matches may be written: P(i) = Xi/Ku where Ku is the total number of unique file keys. Likewise, the cumulative probability for I or fewer matches is I I P(I) = ~ P(i) = ( l x;. )/Ku i= l i= l Model 2: File entry is viewed as a random process with the choice of any record equiprobable. Thus, P( i) = ix;/Rt where R t is the total number of file records. Correspondingly, I I P(I) = l P(i) = ( ~ ixi )/Rt i= l i= l Survey: The Ohio State University Libraries Automated Circulation System includes a telephone center to which patrons may telephone requests for 98 Journal of Library Automation Vol. 5/2 June, 1972 library holdings information and for checking out and renewing books. Telephone operators, sitting at cathode ray tube ( CRT) terminals, translate the patron's author-title request into a 4,5 search key and proceed with a file search. By having the telephone operators treat te lephone calls as random input to the system and recording the number of matches returned for each search used, results can be generated in the same form that both of the models take, i.e. , I or fewer matches have been returned P( I ) x 100 percent of the time. This is a relatively easy survey to conduct since the output list of match- ing records for any particular key entry is headed with the exact number of matches which follow. The sample size was 1000 information requests recorded over two one-week periods separated by one month. Before these two subsamples were merged, statistical analysis on their individual means (for percent of 10 or fewer matches) signified they were identical at the 99 percent confidence level. RESULTS The results predicted by the two models for both a 4,5 and 3,3 search key for 1-10 matches appear in Tables 1 and 2. The figures pertaining to the 4,5 key can be compared directly to the data received fro m the survey conducted through the OSU Library's tele- phone center. This comparison is shown in Table 1 for 1-10 matches. Table 1. File Access Comparisons (4,5 search key). (Percent of time I or fewer matches returned) I 1 2 3 4 5 6 7 8 9 10 Actual Survey 35.9 53.8 66.0 73.1 78.5 81.3 83.8 85.6 86.6 87.8 Modell Model 2 (random key) (random Tecord) 81.3 55.7 92.9 71.6 96.3 78.5 97.7 82.4 98.4 84.9 98.8 86.6 99.1 87.8 99.3 88.8 99.4 89.6 99.5 90.2 To acquire a 99 percent upper confidence limit on the percent of requests returning 10 or fewer matches, the normal distribution was used as an approximation to the binomial distribution ( n = 1000, p = .878 ) producing an upper limit of 90.2 percent. Analysis of Search Key Retrieval/GUTHRIE 99 Table 2. File Access Comparisons (3,3 search key). (Percent of time I or fewer matches were returned ) I 1 2 3 4 5 6 7 8 9 10 DISCUSSION Modell (random key) 64.3 81.0 87.9 91.6 93.7 95.1 96.1 96.8 97.3 97.7 Model 2 (random record) 28.0 42.5 51.7 58.0 62.7 66.3 69.3 71.8 73.9 75.7 In Table 1 the results of the survey show that 87.8 percent of all searches recorded returned 10 or fewer titles. In Modell, assuming that requests of the file are random with respect to search key, it is predicted that 99.5 percent of all searches will return 10 or fewer titles. All predicted per- centages for Model 1 are consistently higher than observed results. The predicted response in Model2 more closely approximates the observed behavior of the system as the number of responses increases. However, Model 2 is also consistently higher than the actual survey. Comparing Model 1 and Model 2 only, it is apparent that assuming a random record request more accurately reflects the true usage of a library collection. The lower percentages recorded in the actual survey may be attributable to a number of variables not taken into consideration in this study. Clus- tering due to common English word titles and common names may account for the greater part of this difference. Table 2 shows the results of predicted response for a 3,3 search key. In this table, Model2 predicts that only 75.7 percent of requests will return 10 or fewer titles. Equally important, only 28.0 percent of the requests will return a single record. CONCLUSION In predicting the expected behavior of an information retrieval system, it is more accurate to assume random requests by record than to assume random requests by search key. Probability predictions are deceptively high for assumed random key requests and do not reflect actual usage of the file. Even assuming random requests by record will produce higher-than- observed results. Data calculated using Model 2 should be considered as an upper limit or "ideal" performance indicator. Regarding the results of 100 Journal of Library Autvmatio11 Vol. 5/ 2 June, 1972 the random record model as the upper limit on effectiveness of the search key, the data gathered indicate that, as the search key is shortened from 4,5 to 3,3, the deviation between the random key and random record models is considerably heightened. The 4,5 search key is more efficient for retrieval of 10 or fewer records from a large file than the 3,3 key (90.2 -75.7 percent ). Based on these data, the OSU Libraries decided to retain the 4,5 search key and not reduce it to 3,3. Additional studies should be undertaken to determine the effects of com- mon word usage, common names, and their relation to book usage. Secondly, the data presented here could be systematically and randomly reduced in size to predict the behavior of various search key combinations on varying file sizes. REFERENCES 1. Philip L. Long and Frederick G. Kilgour, "A Truncated Search Key Title Index," Journal of Library Automation 5:17-20 (Mar. 1972 ). 2. Frederick G. Kilgour, Philip L. Long, Eugene B. Leiderman, and Alan L. Landgraf, "Title-Only Entries Retrieved by Use of Truncated Search Keys," Journal of Library Automation 4:207-10 (Dec. 1971 ). 3. Frederick G. Kilgour, "Retrieval of Single Entries from a Computerized Library Catalog File," Proceedings of the American Society for Infor- mation Science 5: 133-36 ( 1968) . 4. Frederick H. Ruecking, Jr., "Bibliographic Retrieval from Bibliographic Input; The Hypothesis and Construction of a Test," j ournal of Library Automation 1:227-38 ( Dec. 1968). 5. William L. Newman and Edwin J. Buchinski, "Entry / Title Compression Code Access to Machine Readable Bibliographic Files," Journal of Library Automation 4:72-85 (June, 1971 ). 6. Frederick G. Kilgour, Philip L. Long, and Eugene B. Leiderman, "Re- trieval of Bibliographic Entries from a Name-Title Catalog by use of Truncated Search Keys," Proceedings of the American Society for Information Science 7:79-81 ( 1970). ..