College and Research Libraries F. W. LANCASTER Evaluation of Systems by Comparison Testing This paper contends that the retrieval abilities of four index languages studied in the Cranfield Project are comparable, although many of their respective characteristics differ considerable one from another. The ability of a system to retrieve a high percentage of documents may not, in itself, be meaningful; the total expenditure of effort must also be taken into account. In the case of the Cranfield Project, the four systems, utilizing a common conceptual analysis and given iden- tical entry vocabularies, would have achieved identical recall per- formance for any given group of requests. pHYLLIS RICHMOND'S recent article "Sys- tems Evaluation by Comparison Testing" criticizes the early study by the Cranfield Project of the comparative ability of four index languages to retrieve known rele- vant documents (i.e., the recall powers of these index languages ) . Her main point of criticism is that the test program compared unlike things: that the Uni- versal Decimal Classification, the special faceted classification, the scheme of al- phabetical subject headings, and the sys- tem of Uniterms were not equally ap- plicable to handle the subject matter of the test collection, namely aeronau- tics. This argument I believe to be ill- founded. In the view of Mrs. Richmond, use of the UDC and of alphabetical subject headings represents a "dilute approach" to the indexing of aeronautics docu- ments, whereas the "concentrated" ap- proach is provided by the special faceted classification devised for the Cranfield Project and by the use of Uniterms ex- tracted from the document texts. Mr. Lancaster is with Herner and Com- pany, Washington, D.C. Admittedly the Universal Decimal Classification is an example of hierarch- ical classification (allowing for a certain element of synthesis) designed to or- ganize the whole of recorded knowledge. Unlike the Dewev Decimal Classifica- tion, however, u:bc is applied much less to the control of general document collections than to the control of collec- tions in fairly restricted subject fields. Indeed the UDC appears to be used more for microdocumentation than mac- rodocumentation. In England, at least, a principal application of the scheme is for the detailed indexing of reports and journal articles in specialized tech- nical libraries. In many if not most cases, these libraries are centrally interested in only a small segment of the total schedules, as, for example, the aero- nautics section. The advantage of the UDC under such circumstances is that, in many subject areas, it has been de- veloped in sufficient detail to cope with the specific indexing of highly special- ized collections, while the remainder of the schedules can be drawn upon in a more general way to index the subject areas of peripheral interest. Thus, inso- far as application to special collections I 219 220 I College & Research Libraries • May, 1966 is concerned, many documentalists would disagree with the statement that if a particular section of the index language is "selected for special treatment or ex- pansion or realignment, the ramifications are soon felt throughout the rest of the system, which then needs the same kind of attention so that it will continue to function as an organic whole." Mrs. Richmond's claim that alpha- betical subject headings are "general- ized-concept index terms" would appear to be naive. She perhaps confuses an indexing method with popular examples of its application. Certainly the subject headings in the authority lists of Sears and of the Library of Congress are somewhat general, but this should not therefore make the whole subject head- ing principle inapplicable to the index- ing of highly specific subject matter. Properly designed, a scheme of alpha- betical subject headings can afford an approach to indexing of aeronautics (or whatever other subject you care to name) equally as "concentrated" as an approach through a special faceted classification, U niterms, or any other type of index language. The recall performance of a system (i.e., its score in retrieving relevant doc- uments ) is not in itself a very meaning- ful measure of the efficiency of a docu- ment retrieval system, since it is obvi- ous that 100 per cent recall can always be obtained by examining the entire document collection. It is to save the time and effort involved in this task that an index to a collection is created. By so doing, the number of documents that need to be looked at is reduced (i.e., precision is improved). At the same time, some relevant items tend to be lost (i.e., recall deteriorates). It follows, then, that any recall figure for a par- ticular search (i.e., the percentage of the relevant documents that are retrieved) is only meaningful when considered in relation to the precision figure (i.e., the percentage of the total documents re- trieved that are in fact relevant) achieved at the same time. In reviewing Dr. Richmond's conclu- sions, it is worthwhile considering briefly the principal factors governing recall and precision power of a document re- trieval system. Precision is governed primarily by the specificity of the index language (i.e., by its ability to define classes uniquely). This is not a direct reflection of the number of terms used to define classes in the system. The five thousand classes that are defined by, say, five thousand distinct subject head- ings or five thousand notational ele- ments from a traditional hierarchical classification (pre-coordinate) may be uniquely definable by one thousand Uni- terms, three hundred Mooers-type de- scriptors, or as few as one hundred care- fully chosen semantic factors. Recall, on the other hand, is gov- erned by the exhaustivity of the index- ing. The more concepts we recognize in our analysis of document content, and convert into the terms of some index language, the greater will be the num- ber of requests for which the indexed documents will be retrieved. Maximum recall would be assured if we were able always to foresee all the types of re- quests for which each document enter- ing the system would provide a rele- vant response. But it is not enough to recognize indexable concepts and to translate these into the terminology of the index language. We must also create a record to show what particular terms, or combination of terms, we have used to represent some particular idea. In others words, we must create an entry vocabulary to supplement the working vocabulary of our index language. It is important at this point to empha- size the fact that the indexing process consists of two quite distinct steps. The first step we might call "conceptual analysis." It is the intellectual task of de- termining what a document is about, or more properly, of deciding for what Evaluation of Systems by Comparison Testing I 221 types of requests the document is likely to provide a suitable response. The second step involves the transla- tion of the notions identi:6.ed in this con- ceptual analysis into the terms of some index language. Once a suitable entry vocabulary has been developed to link textual expressions (from the indexing of documents) and verbal expressions (from the indexing of requests) to the working terms of our vocabulary, this translation task can be a purely clerical operation. In ' fact, with suitable table lookup procedures, it can very well be delegated to a machine. That Mrs. Rich- mond has failed to recognize the distinc- tion between these two steps is shown in her statement that one "system was used for the initial analysis ... [and] its result was then matched to the termi- nological or structural pattern of the other three." Let us assume that we have a collec- tion of one thousand documents and that we carry out a conceptual analysis of these items. Now we translate these conceptual analyses into the terms of four separate index languages, say, UDC, a faceted classincation, alphabet- ical subject headings, and Uniterms. No matter how much variation there is among these languages with respect to their ability to de:6.ne classes specillcally, if we equip each system with an iden- tical entry vocabulary, they will all have the capability of achieving the same re- call performance for any particular group of requests. If, in the Cran:6.eld investigation, identical entry vocabular- ies for the four systems had been built up, based on the original conceptual analysis of test documents, and if human variables in searching had been elimi- nated, the performance of the systems with respect to retrieval of known rele- vant documents would have been iden- tical. For a particular collection of docu- ments and of requests, any index can achieve the same recall performance as any other, providing they are both equipped with identical entry vocabu- laries, based on a common conceptual analysis. If the two systems should also have the same capability for uniquely de:6.ning classes, then both will also be capable of the same precision perform- ance. It would appear then that Dr. Rich- mond is erroneous in her contention that, with respect to the indexing of highly specialized subject matter, a tailor-made faceted classi:6.cation or U niterms can offer a "concentrated approach," where- as UDC and alphabetical subject head- ings can offer only a "dilute approach." It should not be assumed of the UDC that there is only one such beast. In fact there are as many UDC' s as there are organizations using the scheme, since no two organizations use it in exactly the same way. Certainly no informed librarian would rely on the printed index to the schedules as a suitable en- try vocabulary. Each library must de- velop his own entry vocabulary to re- flect the way that documents are written in the subject :6.elds of interest and, even more importantly, to reflect the way that requests are made by the li- brary's user group. The richness of the entry vocabulary is a function of the exhaustivity of the indexing, and an in- dividual library is able to control the recall powers of its version of the UDC on this basis. Similarly, the precision powers of the system can be controlled by the degree · of speci:6.city effected through synthesis of notational elements. In retrospect, it can be seen that. the early efforts of the Cran:6.eld Project were imperfect. Cyril Cleverdon is the :6.rst to admit this. However, the com- parative study of the four index lan- guages was of great value in signpost- ing the direction which further investi- gations should take. This, and subse- quent work at Cran:6.eld has done much to clarify thinking regarding the factors that affect importantly the operating ef- :6.ciency of a document retrieval system . ••