lib-MOCS-KMC364-20140103101752 1 FILE SIZE AND THE COST OF PROCESSING MARC RECORDS John P. KENNEDY: Data Processing Librarian, Georgia Institute of Tech- nology, Atlanta, Georgia Many systems being developed for utilizing MARC records in acquisitions and cataloging operations depend on the selection of records from a cum- ulative tape file. Analysis of cost data accumulated during two years' ex- perience in using MARC records for the production of catalog cards at the Georgia Tech Library indicates that the ratio of titles selected to titles read from the cumulative file is the most significant determinant of cost. This implies that the number of passes of the file must be minimized and an effective formula for limiting the growth of the file must be developed in the design of an economical system. Since 1963 several articles on computerized production of catalog cards have reported cost figures for card production. Fasana reported a cost per card of 9.9 cents at the Air Force Cambridge Research Laboratory (AFCRL) (1). Costs at the Yale Medical Library under the Columbia- Harvard-Yale computerized card production system varied from 8.8 cents to 9.8 cents per card ( 2) . Under the Yale Bibliographic System, costs for card production at the Yale Medical Library have been 13.9 cents per card .. When the MARC MATE program is used to introduce MARC rec- ords mto the Yale Bibliographic System the cost of cards produced from the MARC records is 24.9 cents ( 3) . Costs for computer assisted card production at the Philip Morris Research Library have been estimated at 18 cents per card ( 4) . The cost per card for cards produced from MARC records at the Georgia Institute of Technology Library has been reported as 10 cents (5). 2 ]oumal of Library Automation Vol. 4/1 March, 1971 The focus of interest in these cost reports has been on a comparison of the costs of computer produced cards and manually produced cards. There is agreement in these reports that computer production can compete fav- orably in terms of cost with other methods of production. Less attention has been given to variations in the costs of computer produced cards. Since the systems for which costs have been reported vary in scope and objectives, equipment used, nature of input, rates for labor, and charges for computer time, it is not very useful to compare the costs from system to system. Variations in cost within one system are of greater interest, since it is easier to isolate the factors that result in the altered costs. The report on the Yale bibliographic system shows that the introduction of MARC rec- ords into a system that was not designed for processing MARC records may produce substantially higher costs. Fasana reported that when a PDP-1 computer was used rather than the specially built Crossfiler in the AFCRL system, the cost per card was quadrupled. Kilgour discusses briefly the effects of three changes in the Columbia-Harvard-Yale system on the cost of cards produced. The 10-cent-per-card cost reported for Georgia Tech was the average cost during the preceding three-month period, January through March 1968. During the three years in which catalog cards have been produced on the computer at Georgia Tech, costs have varied widely as procedures, personnel, file sizes and work loads have changed. The greatest variation has occurred in the cost of the manual steps in the system, mainly proof- reading and making corrections. The greatly improved accuracy of the MARC II records has resulted in a reduction in the time required for proofreading and making corrections. The costs of supflies and equipment have been small and shown little variation. The cost o computer time has varied from 18 cents per title (just over 2 cents per card) to a high of 47 cents ( 6 cents per card), excluding the cost of the merge runs to maintain a cumulative file of MARC records. An analysis has been made to determine the factors responsible for this variation in computer costs, and techniques for reducing computer costs have been developed. MATERIALS AND METHODS The Price Gilbert Memorial Library at the Georgia Institute of Tech- nology is a centralized scientific, technical and management collection of 612,000 volumes plus 500,000 microtext and other bibliographic units. In 1968/69 almost 20,000 titles representing about 35,000 volumes were cataloged for addition to the collection. The Library makes use of the UNIVAC 1108 and the Burroughs B5500 computing systems of the Insti- tute's Rich Electronic Computing Center for its data processing needs. The work described here was performed on the B5500. The Georgia Tech B5500 configuration includes two central processing units, 32,000 forty- eight-bit words of core storage, 29 million characters of disc storage and 10 magnetic tape drives. Library programs are written in COBOL and are File Size and MARC Records/KENNEDY 3 multi-processed with other programs in the standard work stream. The Library is billed $140 per hour for central processor time and $47 per hour for IO channel time. The system for production of catalog cards from MARC I records which was in operation for over two years has been described previously ( 6). Statistics were recorded for all computer runs in the processing of 73 batches of MARC I titles. These statistics include number of records processed, file sizes, processor time, IO channel time, and cost, for each run. The time and cost remained fairly constant for some runs. The cost of runs to produce the sorted catalog cards from edited MARC records ranged from 6 to 9 cents per title and averaged a little over 7 cents. The cost of runs to make changes and additions to the MARC records ranged from 1 to 5 cents per title and averaged 2 cents. The cost was usually about 1 cent per title for each time the correction program was run. It often had to be rerun several times before all records in the batch were correct. The Library's improved MARC II system avoids the cost of correction reruns by permitting independent corrections to any record in a direct access file rather than requiring records to be processed as a batch. Most of the variation in the cost of computer time occurred in the run in which records were selected from the cumulative MARC file and the selected records were then converted to the B5500 character codes, reformatted and prooflisted. The cost of this run varied from a low of 10 cents per title selected to a high of 36 cents per title; the variation is primarily an effect of the increasing size of the cumulative MARC file and of variation in the number of titles selected in the run. As the MARC file increased in size the cost of selecting a small number of titles increased dramatically. The precise relationship of file size and batch size to cost per title is not apparent, however, because the cost of character conversion, reformatting, and printing the prooflist were combined with the cost of selection in a single run. An additional complication results from the effects of the other jobs being processed by the computer concurrently. For example, one batch which had to be rerun because the output tape was defective cost 23 cents per title the first time and 28 cents per title when rerun with a different job mix. Although the part of the run cost which can be attributed to passing the M~~C file and the part attributable to code conversion, formatting and pnntmg cannot be determined for a single run, this can be calculated from a number of runs with varying file sizes and batch sizes. It is assumed that . variations in the time required for processing individual records of v~ymg lengths and variations due to the mix of jobs run concurrently ~11 average out and may be disregarded. Statistics for the selection runs mclude the number of records read from the cumulative MARC file, the number ~f recor~ selected and processed, the processor time and IO channel time requrred for the run, and the cost of the run. Using the method of the least squares, these statistics were used to calculate the 4 Journal of Library Automation Vol. 4/1 March, 1971 average time and cost for each record read from the cumulative MARC file. Once these constants are calculated it is possible to predict the cost per item or the total cost of a select run with any given file size and batch size. In order to determine the average cost for processing a selected record and the average cost for reading a record from the cumulative MARC file, it was postulated that C•= (~~) Ca+C. where CT FS is the total cost per title (File Size) is the number of records read from the cumulative MARC file BS (Batch Size) is the number of records selected in the run Cn is the cost of reading a record from the cumulative MARC file CP is the cost for processing a selected record The method of least squares yields the following equations: [ ~(~~ r] Ca+ [ ~(~~)] C•= ~(~~)c. and [ ~(:] C.+NC.=C. Solving these equations for the data from the 73-batch sample gives the following values: cp = $.073 Cn = $.00068 Since charges for computer time are determined differently at other installations, the figures for processor time and 10 channel time may be more useful to others than the cost figures. Using the same techniques but substituting processor time for cost gives the following values: Processor time per record read = .00646 seconds Processor time for selected records = 1.339 seconds Again, using the same technique but substituting 10 channel time for cost gives the following values: 10 channel time per record read= .02048 seconds 10 channel time for selected records= .456 seconds File Size and MARC Records/KENNEDY 5 These values may be substituted in the formula, Cr = ( ~~ ) Cn + c,, to find the cost or time per title for any batch and file size. For example, the per title cost for selecting and processing a batch of 200 records from a MARC file of 40,000 records: c.=(~~ )c. +c. c.=( 4:0} $.00068) +$.073 CT= $.21 It will cost about twenty-one cents per title. The total cost of the run can be predicted as follows: C = ( FS - BS ) ( Cn) + ( BS ) ( CP) c = ( 40000 - 200) ( $.00066) + ( 200) ( $.073) c = $41.27 RESULTS Table 1 shows the predicted cost per title for various file sizes and batch sizes; it is based on the cost of the select run at Georgia Tech and ignores the cost of maintaining the MARC file. Since the Library of Congress cumulated MARC I records until a reel of tape was filled and provided a cumulative card number listing of the records on the reel, it was not essential to update the cumulative MARC file each week. The MARC II tapes issued from the MARC Distribution Service are not cumulative. Most libraries maintaining a cumulative file of MARC records will find it necessary to update this file each week. Weekly updating of the MARC file requires that all records on the file be not only read but also written on a new tape each week. For most systems this will rapidly become the most expensive machine procedure in the entire system. Com- bining the selection function and any index production with the file up- date means that no additional passes of the file will be required, but the cost of writing the file each week must be added to the figures in Table 1. Statistics from the merge runs at Tech show that if the number of old MARC file records read, the number of records read from the weekly update tape, and the number of records written on the new MARC file are totaled, the average cost per IO operation for the merge runs ranged be.~een $.00062 and $.00073 and averaged $.00068 for all merge runs. Since th1s IS the same cost as that obtained for each record read from the cumula- tive file in the select runs, it seems reasonable to use this figure as the cost for reading or writing a MARC record in calculating the cost of 0) ._ c ~ ~ - Table 1. Relationship of File Size and Batch Size to Cost per Title c -r .... ~ ~ ~ BATCH SIZE > File ~ a- Size 50 100 150 200 250 300 400 500 750 1000 ~ $ .209 $ .118 $ .107 $ .100 $ .095 $ .087 $ .082 $ .080 .... !OK $ .141 $ .090 cs· 20K .345 .209 .164 .141 .127 .118 .107 .100 .091 .087 ;s 30K .481 .277 .209 .175 .155 .141 .124 .114 .100 .093 < 40K .617 .345 .254 .209 .182 .164 .141 .127 .109 .100 c ~ 50K .753 .413 .300 .243 .209 .186 .158 .141 .118 .107 ,;... -60K .889 .481 .345 .277 .236 .209 .175 .155 .127 .114 ...... 70K 1.025 .549 .390 .311 .263 . 232 .192 .168 .137 .121 ~ 80K 1.161 .617 .436 .345 .291 .254 .209 .182 .146 .127 ll' .685 .379 .277 .155 "t 90K 1.297 .481 .318 .226 .194 .134 C'.) lOOK 1.433 .753 .526 .413 .345 .300 .243 .209 .164 .141 .?"' llOK 1.569 .821 .572 .447 .372 . 322 .260 .223 .173 .148 ...... co ~ 120K 1.705 .889 .617 .481 .399 .345 .277 .236 .182 .155 ...... Table 2. Relationship of File Size and Batch Size to Cost per Title- File Update and Record Selection Functions Combined in Same Program Old BATCH SIZE File Size 50 100 150 200 250 300 400 lOK $ .378 $ .225 $ .175 $ .149 $ .134 $ .124 $ .111 20K .650 .361 .265 . 217 .188 .169 .145 30K .922 .497 .356 .285 .243 .214 .179 40K 1.194 .633 .447 .353 .297 .260 .213 50K 1.466 .769 .537 .421 .352 .305 .247 60K 1.738 .905 .628 .489 .406 .350 .281 70K 2.010 1.041 .719 .557 .461 .396 .315 80K 2.282 1.177 8.09 .625 .515 .441 .349 90K 2.554 1.313 .900 .693 .569 .486 .383 lOOK 2.826 1.449 .991 .761 .624 .532 .417 llOK 3.098 1.585 1.081 .829 .678 .577 .451 120K 3.370 1.721 1.172 .897 .732 .622 .485 500 750 1000 ":tj ... ~ $ .104 $ .093 $ .088 en .131 .111 .102 N . ~ .158 .130 .115 ~ .185 .148 .129 ~ .212 .166 .143 ~ .240 .184 .156 > .267 .202 .170 ~ () .294 .220 .183 ~ .321 .238 .197 ~ ~ .348 .257 .211 c ~ .376 .275 .224 -.403 .293 .238 ~ tTl :z :z t=:l tj to< ~ 8 Journal of Library Automation Vol 4/1 March, 1971 combined merge-select runs. Table 2 shows the predicted costs per title for combined merge-select runs with varying file and batch sizes. The costs shown are based on the following equation: C. =(FSo + FS~:s· + FS. )c.o + Cp where CT is the cost per title FSo is the file size for the old MARC file FSA is the file size for the add records ( 1200) FSv is the file size for the delete records ( 1200) FSN is the file size for the new MARC file BS (Batch Size) is the number of records selected in the run C1o is the cost of reading or, writing a record ( $.00068) CP is the cost of processing a selected record ( $.073) Calculations for this table are based on several assumptions: it is assumed that the file has reached a state of equilibrium in which the weekly addi- tions and deletions are equal; it is also assumed that delete records have the same average length as other records and therefore take as long to read. While it is unlikely that these assumptions will hold perfectly, the variations are not great enough to destroy the usefulness of the resulting figures as a guide. DISCUSSION The figures presented in the two tables have several implications for the design of systems based on the maintenance of a cumulative MARC file and the selection of records from that file. First, they show the im· portance of assuring that no unnecessary passes of the cumulative MARC file are made. Updating of the MARC file, production of indexes to it and selection of records from it should be accomplished in a single pass of the file. If it is desired to select records from the file more often than once a week, Table 1 provides a means of estimating the cost of the im· proved response time. If for example, the file size is 100,000 and the weekly volume is 500, twice-a-week runs would increase the cost by 14 cents per title or by $68.00 a week for the select runs. The figures presented in the two tables also show the critical importance of controlling the growth of the cumulative MARC file, especially for File Size and MARC Records/KENNEDY 9 libraries with a relatively small volume of titles to be processed. Three characteristics of the acquisitions program of the library largely determine the possibilities for controlling the growth of this file. The number of titles acquired by the library determines the batch sizes for records to be selected from the file each week. The acquisition rate is also an im- portant determinant of the growth rate of the cumulative file provided that records which have been selected and used are then purged from the file. If the Library of Congress issues an average of 1200 titles per week and a library uses an average of 1000 titles a week from the file, the net annual growth of the cumulative file will be only slightly over 10,000 records. On the other hand, a smaller library selecting an average of only 100 titles a week would have a net annual growth rate of about 57,000. If unused records were purged after one year, the file size would remain stable at these levels. Table 2 indicates that the cost per title for file maintenance and selection at these two libraries would be about 9 cents and 86 cents respectively. A second characteristic of the acquisitions program of the library that is important in controlling the growth of the cumulative MARC file is the scope of the subject coverage attempted. If most of the monographs acquired fall within well defined subject classes, the probability of utilizing MARC records in many other subject classes may be low enough that these records need not be added to the cumulative MARC file at all. For a special library that attempts to collect everything published in a few well defined subject areas it may be economical to maintain and utilize a limited MARC file even though the number of records selected is small. On the other hand, a small or medium-sized public library ac- quiring the same number of titles would probably find a much larger per- centage of its records on the MARC file but still not be able to use the MARC tapes economically. Since the public library is likely to collect titles in most subject fields, the probabilities of utilizing records in dif- ferent classes would not vary as widely and it would not be possible to limit the file to records in a few classes having a high probability of utility. Consequently, the per-item cost of MARC records would likely be too high for consideration. If it is determined that the probabilities of using MARC records vary widely for other characteristics, such as publisher, these characteristics may be used for restricting the records to be added to the cumulative file, thus limiting its size, but subject class seems to be the most promising characteristic for this purpose. An analysis by subject class of all non-juvenile records in the MARC I BI.e and of those records selected from it for use by the Georgia Tech Ltbrary has been used as the basis for restricting the growth of the cumu- lative file of MARC II records. Overall, 8,953 out of 46,486 records were utili~ed, 19.3% of the file. The percentage selected varied from more than 50% m some engineering classes to less than 1% in a few classes such as CS (Genealogy) and BW (Practical theology) . Elimination of thirty 10 Journal of Library Automation Vol. 4/1 March, 1971 classes in which fewer than 4% of the records were eventually used would have reduced the file by 7,710 records or 16.6%. Only 184 of these records ( 2.4%) were eventually selected for use. Records for these thirty subject classes are not being added to the Georgia Tech file of MARC II records. A third characteristic of the acquisitions program important in con- trolling the growth of the cumulative MARC file is the speed with which newly published monographs are acquired. If most monographs are ac- quired soon after publication, the probability of using a MARC record that has not been selected in the first few months after its receipt may be low. Unselected records may therefore be purged after a relatively short time and the file size thereby controlled. Use of the MARC tapes for book selection will help to increase the probability of records being selected during the first few months on the file. A system that uses the weekly MARC tapes for book selection and does not retain on the cumula- tive MARC file those records not selected for purchase might be quite economical. The frequency with which decisions are later made to acquire titles that were initially passed over, and the added cost for manual input of those records, would have to be considered in deciding on this policy. An analysis has been made of the interval between the date records were added to the MARC file and the date on which they were selected for use by the Georgia Tech Library. Distributions by time intervals for each Library of Congress subject class were prepared. The distributions varied significantly for reasons that are not yet clear. Generally, it appeared that in those subject classes for which a smaller percentage of the titles available on the MARC file were acquired, they were acquired more rapidly. This seems to be advantageous for keeping the MARC file small. For those classes in which a large percentage of titles are selected, un- selected records will be retained on the file for a long period, such as eighteen months. Use of a large percentage will mean that the number of unused records remaining on the file will be relatively small and they will have a high probability of selection over the extended period. For those classes in which a smaller percentage of titles are acquired, the unselected records will be retained on the file for a shorter period, such as six months. Since titles in these fields tend to be acquired more promptly, few potentially useful records will be lost by purging unselected records after a shorter interval. Over the past year major changes have been made in acquisitions pro- cedures in the Georgia Tech Library. A much larger proportion of mono· graphs are now received on approval plans. The MARC distribution serv· ice now provides about twice as many records each week as were provided during the pilot project phase. The effects of these changes on the propor· tion of titles selected and the time required for acquiring titles in the various subject classes have not yet been determined. Continuous moni· toring of the operation of the system for changes in these characteristics File Size and MARC Records/KENNEDY 11 will be required for efficient operation. The improved program for main- tenance of the MARC II file and selection of records from it provides for designating subject classes which are not to be added to the file and designating how long unselected records in other classes are to be retained on the file. This study of variations in the computer costs of card production lends support to the decision to continue using COBOL as the primary language for the MARC II system being implemented on the UNIVAC 1108 rather than using assembly language. The inefficiency of COBOL for character- by-character code conversion and for manipulating variable length data had been a source of some concern. The cost of all processing of selected records, including code conversion, reformatting, prooflisting, making cor- rections, generating and formatting added entry records, and sorting and printing catalog cards, averaged only about 16 cents per title. A reduction of even 50% through the use of assembly language and increased effort directed to program efficiency would reduce costs by only about 8 cents per title or 1 cent per card. These savings do not seem to justify the in- creased original programming costs and the likelihood of eventual costly reprogramming. On the other hand, the cost of selecting records from the MARC file varied from 3 cents per title to 29 cents per title. With the added cost of weekly maintenance of the MARC file and with more than twice as many MARC records being received, the costs of processing the cumulative MARC file might easily go much higher. By careful attention to controlling tl1e growth of this file, significant savings in the cost of the system may be achieved. CONCLUSION Some librarians have assumed that as the scope of the MARC distribu- tion service expands to include other languages and other types of ma- terials their problems of inputting current records will be solved. This analysis shows that the situation is not so simple. Probably only a few of the largest general research libraries will be able to maintain complete MARC files for their individual use during the next few years, though reductions in computing costs may eventually change this prediction. Even medium-sized libraries such as Georgia Tech will not be able to use eco- nomically the foreign language materials when they are included in the MARC program. Some libraries which do not use a large enough proportion of the MARC records to make it economically practical to maintain a complete MARC file may be able to make economical use of MARC records by carefully contro~ling the retention of records on the cumulative file. Continuing analysts of the probabilities for selecting records of varying age and subject classes rna~ be utilized in developing a formula for maintaining the file at near opbmum size if the system provides for collection of the required statistics. 12 Journal of Library Automation Vol. 4/1 March, 1971 For libraries which cannot profitably use the MARC tapes, there is another prospect. Cooperative centers that do the processing for large library systems or for several systems will have the volume to justify maintenance of complete files. Certainly, a processing center serving all libraries of the University System of Georgia could economically maintain a more complete MARC file than Georgia Tech alone can justify. The de- velopment of cooperative processing programs in Ohio, New England, Oklahoma, ( 7, 8, 9) and elsewhere indicates that some librarians are coming to this realization. ACKNOWLEDGMENTS Mrs. Julie Gwynn wrote most of the computer programs referred to in this paper. Her husband, Professor John Gwynn, gave valuable advice on the statistical techniques employed in analyzing the data. The University of Toronto Library generously provided a copy of its MARC file, which included the date each record was added to the file, for use in analysis of the time lag between availability of the record and selection of it. REFERENCES 1. Fasana, Paul J.: "Automating Cataloging Functions in Conventional Libraries," Library Resources and Technical Services, 7 (Fall1963), 350-365. 2. Kilgour, Frederick G.: "Costs of Library Catalog Cards Produced by Computer," Journal of Library Automation. 1 (June 1968), 121-127. 3. Stone, Sandra F .: Yale Bibliographic System; Time and Cost Analysis at the Yale Medical Library (Unpublished document, New Haven: Yale University Library, 1969). 4. Murrill, Donald P.: "Production of Library Catalog Cards and Bul- letin Using an IBM 1620 Computer and an IBM 870 Document Writing System," Journal of Library Automation, 1 (September 1968 ), 198-212. 5. Kennedy, John P.: "A Local MARC Project: The Georgia Tech Library." In University of Illinois, Graduate School of Library Science: Proceedings of the 1968 Clinic on Library Applications of Data Processing (Urbana: University of Illinois, 1969), pp 199-215. 6. Ibid. 7. Kilgour, Frederick G.: "A Regional Network- Ohio College Library Center," Datamation, 16 (February, 1970), 87-89. 8. Agenbroad, James E.; et al.: Systems Design and Pilot Operations of the N ew England State Universities. NELINET, New England Li· brary Information Network. Progress Report, July 1, 1967. March 30, 1968 (Cambridge, Mass.: Inforonics, Inc., 1968). ED 026 078. 9. Bierman, Kenneth John; Blue, Betty Jean: "Processing of MARC Tapes for Cooperative Use," Journal of Library Automation, 3 (March 1970)' 36-64.