lib-s-mocs-kmc364-20140601051338 MULTIPURPOSE CATALOGING AND INDEXING SYSTEM (CAIN) AT THE NATIONAL AGRICULTURAL LIBRARY. 21 Vern J. VAN DYKE: Chief, Computer Applications, National Agricultural Library, and Nancy L . AYER: Computer Systems Analyst, National Agri- cultural Library, Beltsville, Maryland. A description of the Cataloging and Indexing System (CAIN) which the National Agricultural Library has been using since January 1970 to build a broad data base of agricultural and associated sciences information. With a single keyboarding, bibliographic data is inputed, edited, manipulated, and merged into a permanent base which is used to produce many types of printed or print-ready end-products. Presently consisting of five sub- systems, CAIN utilizes the concept of controlled authority files to facilitate both information input and its retrieval. The system was designed to provide maximum computer services with the minimum of effort by users. INTRODUCTION This article describes an interactive system in operation at the National Agricultural Library which with a single keyboarding of data provides all necessary catalog cards, book catalogs, bibliographies, and related internal reports, as well as a computer data base for information retrieval. Primarily in batch mode, the system can operate on an IBM 360 with 256K memory using OS, six magnetic tape drives, a card reader, and a line printer. BACKGROUND The National Agricultural Library ( NAL) as one of the three national libraries is responsible for the collection and dissemination of agricultural information on a national and worldwide basis. In this pursuit publications are obtained through gifts, exchange agreements, and by purchase of items in many languages. Titles of those items in non-Roman alphabets are transliterated and all non-English titles are translated. The volume of publications handled by NAL in 1969 was in the neigh- 22 Journal of Library Automation Vol. 5/1 March, 1972 borhood of 600,000, of which approximately 275,000 were added to the collection. This volume was sufficiently large to provide a serious problem to NAL's staff and thus computer assistance was clearly a logical and necessary arrangement. In 1964 a computer group was formed in NAL; it became active in developing systems to prepare voluminous indexes for the Bibliography of Agriculture, the complete Pesticides Documentation Bulletin, and the categorical and alphabetical issues of the Agricultural/ Biological Vocabu- lary. During 1969 these systems were consolidated and expanded so as to process all input data within one coordinated set of parameters. In Jan- uary 1970 the new Cataloging and Indexing (CAIN) System was im- plemented. SYSTEM DESIGN CAIN is a complex and comprehensive computer system which has been engineered to handle up to five ( 5) simultaneous but separate users who share the same controlled authority files. The basic precept in develop- ment of computer applications at NAL is to make input and output simple and convenient for the users, with the computer assuming as much detail and data manipulation as is technically feasible. At NAL the current users providing input data are the New Book Section, Cataloging, Index- ing, and Agricultural Economics. Operating in parallel, CAIN also services the herbicides data base of the Agricultural Research Service; the Inter- national Tree Disease data base of the Forest Service; and in 1971 will be installed in the Library of the Technion-Israel Institute of Technology in Haifa, Israel. The master data record is variable in length with a fixed portion of 173 characters and up to fifty-seven additional segments of 65 characters each. The fixed portion includes basic data plus a directory of data con- tained in the variable portion. Data elements in CAIN are: a. File code-delineates the various files. b. Identification number-on cataloged items this embodies the ac- cession number. All identification numbers include the year of accession, a parallel run code plus a unique control number. c. Source code. d. User codes-specific identification of up to five users. e. English Indicator-language of text. f. Translation code-availability of an English translation. g. Language, if other than English. h. Proprietary restrictor- identifies classified records. i. Title tracing indicator-for catalog cards. j .. Main entry-designates main entry if not normal sequence. k Document type-whether journal article, monograph, serial, etc. I. Filing location-if other than in the library stacks. m. Categories-two. General area of coverage of subject matter. Cataloging and Indexing System/VAN DYKE and AYER 23 n. New book description-if the title is not sufficiently explanatory. o. Titles-three types: ( 1 ) vernacular or short, ( 2) alternate or hold- ings, and ( 3) translated title (English). p. Personal authors-up to 10. Names plus identifying data. q. Corporate authors-maximum of two. r. Major personal author affiliation. s. Abbreviated journal title if item is a journal article; imprint if mono- graphs and serials. t. Collation/Pagination. u. Date-two: Search date, and date on publication if different. v. Call number. w. Subject terms-may be nested. Up to 45. x. General Notes. y. Special purpose numbers-patent, grant, analysis, contract, tech- nical, or report. z. Series statement. aa. Abstract/ Extract. bb. Tracings not otherwise normally generated by the system. cc. Nonvocabulary cross-references. The total number of individual elements is limited only by the maximum record size. The NAL-produced software is written in COBOL. The data base is maintained on tape which is nine-track, 800 bpi, blocked 2, in EBCDIC, with standard IBM 360 header and trailer labels. The total system pres- ently consists of forty programs, some of which are multipass. In addition, throughput is sorted twenty-five times during the full computer run. These, of course, include the search and retrieval programs and sorts which are run only on request. The ultimate system which NAL is working toward and for which the basic design is already substantially complete is an on-line full library document locator and control system which may be linked via dial-up service to an international and national science and technology information network. Each portion of CAIN is developed with the broader picture in mind. It was this factor which weighed heavily in selecting cathode ray tube (CRT) terminals for the proposed data gathering subsystem inas- much as CRT's will be the predominant type of terminal in the future network. For convenience in discussion, the system will be described by its sub- systems: data gathering, edit and update, publication, search and con- trolled authorities. DATA GATHERING SUBSYSTEM From its inception the input to CAIN was in the form of punched cards, a method which has proved to be slow and error prone. In order to elimi- nate double keyboarding and excessive time lag, as well as to reduce the 24 Journal of Library Automation Vol. 5/ 1 March, 1972 error rates, it was decided to perform this input function in the library with trained library personnel. To accomplish this, NAL proposes to implement an "on-line" type of input subsystem using CRT's. Although this form of entry is not yet in use, the subsystem should operate substantially as follows. The documents are to be marked by catalogers and indexers and passed to library technicians who will enter the data through CRT's into an on-line storage file. To do this, the technician will call from the hardware pre- stored formats as desired and fill in the data elements required. These formats use English terms and for the most part call for data rather than codes. In addition, data are to be entered in normal upper- and lowercase without diacritics, thus improving visual scanning for errors. An average of four formats will be needed to enter one item. By use of an algorithm, the system would store formatted records for each ID in such a manner as to permit recall singly or collectively. The physical documents are then to be passed on to an editor who can recall any or all formatted records for review. With the document in hand, stored records will be reviewed and corrected if necessary. When accept- able, the records will then be transmitted to magnetic tape. Variations on this procedure could include input direct to tape, storage to tape without recall to a CRT by an editor, cancellation of actions, and a direct purge of the entire storage file without loss of the controlling matrix. The expertise of the library technicians inputting the data should insure far more accuracy than could be expected from multihandling and multi- keyboarding. In addition the system has been designed to accomplish basic pre-CAIN editing of such factors as numeric or alphabetic characters in certain fields and overall lengths of the fields. Errors in these categories will be promptly identified by the computer by a blinking feature on the CRT screen. Another major benefit of this direct approach is that documents can be processed through the system so as to reach the stacks twenty-four days faster than under the current keypunch method. Magnetic tapes created by the data gathering system will be periodically converted from ASCII to EBCDIC and processed into the edit and update subsystem of CAIN. The present NAL time schedule for updating master CAIN files is weekly. This is not a requirement of the system but an ad- ministrative decision based on other deadlines. The data gathering system as prescribed by NAL will be composed of sixteen CRT's, a large on-line storage file , and one nine-track 800 bpi mag- netic tape drive. This configuration will be either a hard-wired "black-box" approach, or controlled by a dedicated mini-computer. The hardware pre- scribed for this subsystem is not included as a requirement of CAIN inas- much as transactions can be entered on 80-column cards if desired. An additional feature of this subsystem will be the generation of manage- Cataloging and Indexing SystemjV AN DYKE and AYER 25 ment information feedback. This will encourage elimination of manual counts and provide accurate throughput volume statistics on a timely basis. Through this means the supervisor will be in a better position to evaluate workload, individual performance, and hardware utilization. EDIT AND UPDATE SUBSYSTEM The first step in the acceptance of transactions is a thorough validation of each data element. The computer is used to relieve librarians of the voluminous and time-consuming edit of many individual elements having predetermined limits. Thus, only a cursory review of the proof-listed rec- ords is necessary by a librarian before acceptance. The system cannot detect, of course, logical or typographical errors, but it can determine the absence of necessary information, codes in invalid ranges, and the incorrect placement of data. Elements for which the system supplies authority files are not only verified against the file but also additional transactions are generated from the authority file to assure uniformity in output. This also eliminates the necessity for librarians having to enter those elements which have a direct predictable relationship to another element. Further validations are performed at the point of building new records or updating records already in the master file. The two "master" files are ( 1 ) the temporary set of unselected records and ( 2 ) the permanent set of those records which have been approved and selected for publication in some form. Data elements specified as required within each record are reviewed. If one or more is missing, the system refuses to approve this record, and a notice is produced concerning this reversal of human input. Fields can be deleted, in whole or in part, replaced or added. Three types of output from this subsystem are: • New updated master files. Those which have been added or altered during this update run are proof-listed for cursory review by a team of professional librarians. Corrections and/ or approvals are submitted in a subsequent update run. • Activity notices. Every action whether submitted by the user or sys- tem-generated which has been accepted for processing is reported. • Error notices. All error and warning messages from this subsystem are compiled into one listing. This includes errors on individual ele- ments, system-discovered errors of omission, and warnings of computer overriding of submitted actions. Through the use of control cards various handling options are possible. One of these is proof-listing of a specific range or ranges of masters by identification numbers or dates. Subject headings are assigned by professional librarians for monographs and new serial titles. For journal articles, however, the system analyzes the title of the article and creates subject index terms, using single words, 26 Journal of Library Automation Vol. 5/1 March, 1972 combinations of two words not separated by stop words, and singular and plural variations. The generated terms are then processed against the con- trolled authority file. Those accepted as valid are inserted in the record for searching purposes. PUBLICATION AND DISTRIBUTION SUBSYSTEM Each data element of a bibliographic item is captured only once and at the earliest possible time in the receipt process. Master records which have successfully passed the edit and update phase become candidates for various types of publications and other user services. Six major modes of publication products are produced by CAIN, at various times and in a variety of both formats and media. Preliminary to the production of formal output there is a screening for records designated as fully acceptable by the edit and update sub- system. As mentioned above, any record may be identified as being ap- plicable to any combination of from one to five users. By a method of control cards the system is informed as to which users are scheduled for publication/ distribution, and the maximum quantity to be selected in each case. This subsystem reviews each record to ascertain its appropriateness for selection. Records meeting the criteria are siphoned off for individual handling. No record is dropped from the temporary file until it has been selected by all applicable users. A New Book Shelf listing may be printed on photocopy paper on request. On preparation, it is ready to be matted, photographed, printed, and dis- tributed throughout the Department of Agriculture. Only enough new book entries are selected by the computer at one time as will fit on three sheets of a four-page publication. Approved cataloged records are selected weekly. Each record is analyzed for applicability to any or all of the eight major files for which catalog cards are prepared. Each card file has its own criteria both in content and in the number and types of cards produced for it. The system produces a separate record for each card required, sorts together the records for each file, and alphabetizes within that file. Leading articles (regardless of language) are printed but are excluded in the sorting procedure. Cards are printed two-up in upper- and lowercase in the format prescribed by Anglo- American cataloging rules. After printing, the cards are distributed to the appropriate organizations and sections where they may be filed with a minimum of additional effort. Monthly, a book catalog is compiled. This contains not only a listing by main entry but also indexes of personal authors, corporate authors, subjects, and titles. A biographic index (major personal author affiliation) capability is available although not presently used by NAL in the book catalog. This catalog is printed in varying numbers of columns changeable by control card option for each index. Again photocopy paper is used with a standard Cataloging and Indexing System/VAN DYKE and AYER 27 upper- and lowercase (TN) print train. An alternate option is magnetic tape output formatted for direct input to a computer-driven LINOTRON. See bibliographic description for more detail. Semiannually the index portions of the book catalog are cumulative. Main entry listings are not repeated. Multiyear accumulations may also be produced. The book catalogs are presently being published from photo- copy printout by Rowman and Littlefield, Inc., New York. Bibliographies, either scheduled or special, can be produced with the same indexes as those in the book catalog. These are normally prepared for printing via the LINOTRON. This magnetic tape record contains all formatting requirements with the exception of word divisions. Document title, page, and columnar (subject category) headers are provided by NAL. Running headers are inserted by the LINOTRON. Through predetermined codes, the CAIN tape specifies the print style, print size, and print format. Bibliographies may also be computer printed on photocopy paper similar to the book catalog. Once a month, each record selected for publication is processed through a merge and adjustment program. At this point published records not previously on the permanent master file are added to it. Those which are already on it are compared and the resident record is adjusted to include the new user for whom the record has just been published. The term field is also verified and updated if necessary. Each term is also used to generate posting records for the subject authority file. The permanent (published) CAIN data base is available on magnetic tape in either the master format or a print format of the linear proof (list- ing of each data element). Only records not previously published are added to the monthly sale tapes. These tapes may be ordered individually (new monthly selections) or collectively (whole file) at the cost of repro- duction only. The tape is nine-track, 800 bpi, EBCDIC with standard IBM 360 header and trailer labels. One of the purchasers of CAIN tape is the CCM Information Corporation of New York which publishes Bibliography of Agriculture from it starting in 1970. Current purchasers include private corporations and universities, both in the United States and abroad. The last type of output is normal computer printout of numerous internal reports in a variety of customized formats. SEARCH SUBSYSTEM The search capability of the CAIN system is not being used by NAL on its own data base at the present time. It is utilized, however, by other organizations who run the CAIN system on a parallel basis, maintaining their own data bases. The following description, therefore, pertains to the programmed system rather than to its use on the NAL data base. This subsystem permits identification and retrieval of records in CAIN format based on search statements as applied to almost every data element 28 Journal of Library Automation Vol. 5/ 1 March, 1972 or combinations thereof. Such searches may use simple statements or a complex series of nested boolean parameters. Questions may also be absolute or weighted to give more precise results. The weight factors if used are normally assigned to each statement within a search question, with a threshold weight assigned to the overall question. The total weight of all true statements must be equal to or greater than the threshold weight for the full query in order to be considered as meeting the search criteria. If such is not the case, the record will not be selected. Since CAIN uses a controlled vocabulary, query statements on subject terms are first matched against that authority file. At this point each in- valid (USE ) term is replaced by a corresponding valid ( UF ) term if appropriate. In addition, if the query statement so specifies, the requested terms may be expanded one level in the hierarchy. In other words, it could generate additional statements requesting all broader, narrower, or related terms as specified if such structure were present for the subject within the vocabulary. Because subject terms comprise the largest percentage of all search elements, an algorithm was developed whereby queries on this type of element are first processed against an inverted file. Identification numbers are extracted for all terms matching the query and only those candidate records are searched using the full query. On a serial file such as CAIN, this concept provides a substantial savings in computer run time. The print options of retrieval output allow either for normal sequence by identification number or for a specific sequence as requested by the originator. The printout may contain all data elements or only those se- lected, all others being suppressed. At the present time this subsystem is used infrequently by NAL and only for internal high priority searches due to the extremely limited subject indexing terms present. It is used more extensively on the parallel operation established for the International Tree Disease Register maintained for the U. S. Forest Service. AUTHORITY FILES SUBSYSTEM This subsystem updates, generates, expands, and maintains three types of authority files. These include subject terms with associated hierarchy, call numbers of indexed journals with abbreviated titles, and a subject term inverted file carrying the identification number of each record using that term. Each transaction to add, change, or delete any data is both edited and reversed before entering the updating sequence. Thus an addition of a narrower term (for example, HORSE) to a base term (for example, ANIMAL) will automatically generate another transaction to add the broader term of ANIMAL to a base term (new or existing ) of HORSE. This precludes having to manually enter both sides of an action as well as assuring reciprocity of entries. Due to the flexibility of the search sub- Cataloging and Indexing System/VAN DYKE and AYER 29 system of CAIN, this hierarchical continuity is of great importance. If an item is changed the same procedure is followed. In the instance of deletion, a broader precept is involved. In this case, the term is deleted from all entries in other hierarchies but is itself left on the authority fil e and marked as being no longer valid. It is thus avail- able for search purposes but is not allowed to be used on subsequent CAIN data records. During a normal CAIN data run, each call number or subject term in a record is verified against the appropriate file. Each element on these files is carried in two forms-one in stripped uppercase, and the other in preferred print form. When an incoming term is found on the authority file, the system substitutes the proper form. This includes substituting a valid term for an invalid term as in the "use-use for" relationship, as well as generation of the appropriate abbreviated journal title for a given call number. In order to keep the authority file up to date, the transactions generated by the publication subsystem are now used to insert the record identifica- tion number into the inverted file as well as increase the number of postings per term. This assists search specialists in formulating queries in the manner which will reduce computer processing time to the greatest degree. When published, the authority files themselves can be printed in a special format which displays the entire hierarchy of each term. In addi- tion, up to ten levels of increasingly narrower terms can be listed for each term. SUMMARY CAIN is a broad-based comprehensive batch mode system which meets many library requirements. Its flexibility is apparent from the fact that it has already been expanded to se lect each newly cataloged serial record for transmission in MARC II communication format to the National Serials data bank being created by the three national libraries. Still more capabili- ties will undoubtedly be built into it before the NAL ultimate on-line system is implemented. The major thrust of the systems design has been to concentrate on simplifying user interface while imposing stringent and extensive service requirements on the computer system itself. Due to its inherent fluidity, CAIN is being retained as an in-house sys- tem. It is so complex that a single change in one subsystem may have radial effects in any or all of the other portions. Continuing efforts are underway to simplify input, accelerate throughput, and expand its already generous services both to the staff of the National Agricultural Library and to those organizations utilizing output from the CAIN system.