TIllMlBitfBoi MUM , ' It WB JIMrtBMM KBwH vcfl ■MMME9QI ■■ XMra MM HHShHBH flB9& _JHflMMMJGn H ■ BSSSBtt hhmHtc ffi WHIM HKJCwlCCCu/S $w BO nkntf 088% EnSa&KCtt Km H OH If Bro affl fl g ffiflfl WBBBS8A §88w fflBssm m mm Goal Jfc*CT B8S liillM II P SSI fflfflWBI 8 LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 TJl&r no. 818 - 823 cop. 2, The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN SFP 1 6 MB SEP 1 1 1^95 L161 — O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/evaluationofsele823rine v^ / IS ' U 7 Z16aJ u Report No. UIUCDCS-R-76-823 NSF-0CA-DCR73-07980 A02-000022 Crf, ~z , EVALUATION OF SELECTED FEATURES OF THE EUREKA FULL-TEXT INFORMATION RETRIEVAL SYSTEM by James Richard Rinewalt September 1976 Report No. UIUCDCS-R-76-823 EVALUATION OF SELECTED FEATURES OF THE EUREKA FULL-TEXT INFORMATION RETRIEVAL SYSTEM* by James Richard Rinewalt September 1976 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US NSF-DCR73-07980 A02 and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, September 1976. EVALUATION OF SELECTED FEATURES OF THE EUREKA FULL-TEXT INFORMATION RETIREVAL SYSTEM James Richard Rinewalt, Ph. D. Department of Computer Science University of Illinois at Urbana-Champaign, 1976 Evaluations of on-line information retrieval systems have been largely dependent upon monitoring of users' searches and the cooperation of users in interviews and questionaires. Since users have a variety of information needs and levels of experience, the evaluation process has been difficult. This paper describes and presents the results of a series of experiments designed to evaluate various features of an information retrieval system in a controlled environment. Features are evaluated on the basis of their development and implementation cost, their effect on system performance as well as user performance, and the attitude of users toward the feature. Ill ACKNOWLEDGMENT \ The author would like to thank the Illinois Legislative Information Systems and West Publishing Company for making available a digitized copy of the Illinois State Statutes. He would also like to thank Bernie Hurley for his recruiting activities and Keith Morgan for helping run the classes. Also, the financial assistance of the Department of Computer Science and LTV Aerospace Corp. is gratefully acknowledged. He would especially like to thank his advisor, Prof. Dave Kuck, for his advice, encouragement, and patience. Finally, the author wishes to thank his parents, his wife Ruth Ann, and son Mark for their support and understanding. IV TABLE OF CONTENTS PAGE CHAPTER 1 — INTRODUCTION 1 CHAPTER 2 — INVERTED FILE STRUCTURE FOR FULL-TEXT IR SYSTEMS 5 CHAPTER 3 — EUREKA QUERY LANGUAGE 11 3.1 Query Language 11 3.1.1 FIND Statement 13 3.1.2 PRINT Statement 14 3 . 2 Sample User Session 15 CHAPTER 4 — USER EXPERIMENTS 19 4.1 Initial User Experiments 19 4.2 Feature Evaluation Experiments 23 4.2.1 Fall 1975 Experimental Series 26 4.2.2 Spring 1976 Experimental Series 34 CHAPTER 5 — ASSESSMENT OF USER ATTITUDES 42 CHAPTER 6 — SUMMARY 49 6 . 1 Extrapolation to Larger Systems 52 6.2 Suggestions for Future Research 53 LIST OF REFERENCES 54 VTTA 55 CHAPTER 1 — INTRODUCTION In the 19 years since the start of the Cranfield Project, much work has been done on the problem of evaluating information retrieval systems. The early systems were of the controlled vocabulary type which operated in the batch mode and whose response to a search was a list of bibliographic citations. The primary tradeoff in this type of system was the cost of the depth and exhaustivity of indexing versus their effect on recall and precision. The primary concern of the user was to construct one complex search which would satisfy his need. Thus, the evaluation of this type of system was based on the values of its recall ratio (the proportion of relevant material actually retrieved in response to a search request) and its precision ratio (the proportion of retrieved material which is actually relevant). Cleverdon[l] listed other criteria but considered these two the most important. Cooper [2] proposed a measure of retrieval effectiveness which combined recall and precision and took into account the amount of relevant material desired by the user. Various other measures, all involving some form of precision and recall, have been proposed [3]. Advances in computer systems have made on-line full- text information retrieval systems practical. Since users are now able to conduct iterative searches, other criteria including user effort, response time, and the form in which search results are displayed have become more important[4] . The evaluation of operational on-line systems [4] has been largely dependent upon monitoring of users' searches and the cooperation of users in interviews and questionaires. Since the users had a variety of information needs and levels of experience, the evaluation process was quite difficult. This thesis describes and presents the results of a series of experiments designed to evaluate various features of an information retrieval system in a controlled environment by comparing the performance of several users who have a common information need and somewhat comparable backgrounds. Carlisle [5] proposed a framework (Figure 1.1) for conducting research in man-computer interactions. In this framework, the system refers to items which are transparent to the user (e.g., the hardware, the language in which the routines are written, etc.) while the user-system interface refers to items directly affecting the user (e.g., the commands available, the command syntax, the form of the output, etc.). This thesis will examine the effect of varying two entities of man-computer interactions, the user-system interface and the task, on some of the characteristics of performance. Varying the user-system interface will consist of denying selected features to different groups of users, while the two tasks to be considered are short answer- type quizzes and essay-type quizzes. The characteristics of performance are highly interdependent. Those to be examined in this thesis are time, cost, and quantity and quality of performance. In Chapter 2, the effects of inverted file structure on full-text retrieval systems will be discussed. A brief description of the EUREKA query language will be presented in Chapter 3. In Chapter 4, the design of the experiments will be discussed and the results presented. The results of a survey of user attitudes will be discussed in Chapter 5. Chapter 6 is devoted to a summary of the results and some suggestions for future exper iments . Entities of Man-Computer Interation 1. The System 2. The Data Base 3. The User-System Interface 4. The User 5. The Training 6. The Setting 7. The Task Character istics of Performance 1. The time to perform the task 2. The cost to perform the task 3. The quantity and quality of the performance 4. The errors committed 5. The user's satisfaction 6. The utilization of available resources 7. The patterns of user and system behavior Figure 1.1 Experimental Framework for Man-Computer Interaction Research [5] CHAPTER 2 — INVERTED FILE STRICTURE FOR FULL-TEXT IR SYSTEMS For the sake of efficiency and adequate response time, an on-line full-text information retrieval system requires some form of inverted file or index to the words used in the text. Without this, the full text of each document, or some surrogate thereof, would have to be searched for each query submitted to the system. While this technique is straightforward, it is obviously time consuming. The content of the inverted file varies from one implementation to another. For the purposes of this discussion, the inverted file structure of EUREKA[6] will be assumed: 1. A token is defined as any unbroken string of alphanumeric characters. 2. A type is a distinct token. 3. The inverted file contains only those types which occur in the data base. 4. Associated with each type is a list of pointers indicating where the tokens of this type occur in the data base. The level to which the inverted file points is important to the design of a full-text information retrieval system. The level of indexing, in order of increasing specificity, may be that of document, section of a document, paragraph, sentence, or word. Tnere may be other levels appropriate for specific data bases. The tradeoff is the use of a higher, less specific level of indexing to conserve storage space versus a lower, more specific level of indexing to minimize full-text searching and improve response time. Figure 2.1 illustrates the difference in storage requirements for various levels of indexing for a data base consisting of a set of state statutes. The area under a curve is the number of pointers required in the inverted file. Since the curves are plotted on log-log paper, the relative sizes of the areas may be misleading. Also, as the level of indexing becomes more specific, the number of bits required for each pointer increases. Table 2.1 shows the number of pointers required and the storage space used as a percentage of the full text for four levels of indexing for this data base. INDEXING POINTERS STORAGE USED LEVEL REQUIRED (% FULL TEXT) Document 0.39X10 6 7 Section 0.96xl0 6 20 Paragraph 2.04xl0 6 75 Word 3.30X10 6 120 Table 2.1 Storage Requirements for Different Indexing Levels WORD LEVEL PARAGRAPH LEVEL DOCUMENT LEVEL LOG (RANK) Figure 2.1 - Zipf Curves for State Statutes 8 Theoretically, the minimum number of bits required for indexing to the word level is given by n log n where n=number of tokens Assuming eight bit characters and an average of six characters per token, the size of the inverted file as a percentage of the full text is then inverted file n locun = *•- x 100 » 2 logon full text 48n Realistically, however, pointer length should be an integral number of bytes. Also, pointers should distinguish between documents and subdivisions thereof; i.e., a pointer should indicate the document, the section within that document, the paragraph within that section, etc. This arrangement allows a user to search for co-occurrences of tokens at any level and allows the data base to change without total reinversion. Unfortunately, it also drastically increases the storage requirements; for the state statutes data base (3.3x10 tokens and 21202 types) , the inverted file requires 120% instead of the theoretical minimum 43% of the full text. Intuitively, not all types in the data base need be included in the inverted file. The high frequency types are generally syntactic words (THE, AND, OF, etc.), while the lowest frequency types are generally too specific to be useful in searches. One method of determining the usefulness of types is to count the number of times each was actually used in a search. Types can be accessed in two ways - (1) fully-specified or (2) truncated to eliminate prefixes and/or suffixes. Figure 2.2 shows data which was collected for the state statutes data base during the user experiments. The data is based on approximately 12000 searches conducted during 870 Figure 2.2 Token Useaqe for Statutes Data Base -- Fully Specified Accesses — Total Accesses a; «/> 00 QJ O u < «+- o 0) o> e 4-> 3 E O 10 10' 10* 10* Rank 10 user-hours. There were 10317 fully-specified and 31784 truncated types accessed during this period. Figure 2.2 indicates that 25% of the types (ranks 100 through 5192) account for 80% of the fully-specified types used for searching. For this data base, the type-token ratio is 150 and the highest 3 frequency token, THE, occurs 256x10 times. By deleting the sixty highest frequency tokens, the size of the inverted file containing pointers to the word level can be halved. Consequently, the type- token ratio is reduced to 75 and the highest frequency token of those remaining occurs 6116 times. The choice of the level of indexing has a direct effect on system performance. If a high level is chosen and a large number of search requests specify a lower level, the system will spend much of its time performing full-text searching. This will increase response time and decrease the number of users which the system can effectively handle. Alternatively, if a lower level of indexing is chosen, storage will be used inefficiently unless a sufficient number of search requests occur at or below that level. Among other features, this thesis will explore document-level versus section-level indexing. In an attempt to compare section-level indexing with paragraph-, sentence-, or word-level indexing, section-level indexing with full- text searching will be compared to that without full- text searching. 11 CHAPTER 3 — EUREKA QUERY LANGUAGE The EUREKA query language has been designed as a basic tool for studying user behaviour and searching techniques. Most of the facilities provided in this system are available elsewhere, though not necessarily all together in one system. This chapter will give a brief description of the EUREKA language emphasizing those features which were evaluated in this research. A complete description of the language and a detailed explanation of its implementation are given in [6] . Each EUREKA user is given a file, accessible only by him, on disk in which a record of his actions is kept. The text of each query is stored here along with a list of identifiers for all documents responding to the query. This file also contains the text of any comments which the user has attached to a query or to individual documents. 3.1 Query Language Currently, there are nine commands in the EUREKA query language. Only two of these (FIND and PRINT) are necessary for conducting searches, while the other seven perform auxiliary functions. In brief, the functions of the commands are: FIND: The FIND statement is used to perform searches for documents containing a user selected set of words, parts of words, or phrases. The collection of document 12 identifiers returned by the FIND statement is known as a query set . This command will be discussed in more detail in section 3.1.1. PRINT: The PRINT command is used to print user comments, selected portions of a document, and information about preceding queries and their resultant query sets. This command will be discussed further in section 3.1.2. MAKE: The MAKE statement is used to compare and combine sets of documents created by previous FIND and MAKE statements. COMMENT: The COMMENT statement is used to write notes in the user file concerning a query set or particular document. These notes may be retrieved at a later time by use of the PRINT statement or, if attached to a document, searched by a FIND statement. MACRO: The MACRO statement is used to name lists of search terms so that the user does not have to repeatedly type in long search expressions. These macro definitions are saved in the user file and may be used in conjunction with other search terms in FIND statements. CHAN3E: The CHANGE statement is used to assign a name to or change the existing name of a query set. 13 DELETE : The DELETE statement is used to delete query sets and/or comments which are no longer needed. LOGON: The LOGON statement is used to identify the user to EUREKA in order for EUREKA to gain access to the correct user files and data base. LOGOFF The LOGOFF command is used to terminate a session. It disconnects a user from EUREKA and closes his files. Each of these commands has a simple basic form. The FIND and PRINT commands, however, have optional clauses and modes of operation that significantly increase their power. These two commands will be discussed in more detail in the following paragraphs. 3.1.1 FIND Statement The general form of the FIND statement is FIND [IN ] [FROM ] [= ] [""] where is an arbitrarily complex Boolean expression whose variables are search terms and whose operators are + and * representing the Boolean OR and AND operations, respectively. Search terms are enclosed in apostrophes and may consist of words, parts of words, phrases, or arbitrary character strings. A universal character, #, is provided for indicating to the system that prefixes, suffixes, or both have been deleted from a search term. 14 The clauses enclosed in square brackets are optional. The IN clause restricts the search to specified contexts of a document; e.g., author list, title, abstract, body, footnotes, etc. If this clause is omitted, the entire document is searched. The FROM clause can be used to restrict the search to specific documents or to results of previous queries. is a Boolean expression whose variables are sets of documents and whose operators are +, *, and - representing the OR, AND and AND NOT operations. The last two optional clauses allow the user to assign an alphanumeric name to the query and to attach an arbitrary character string as a comment. EUREKA'S inverted file contains all types which occur in the data base. Associated with each type is a list of documents in which the type occurs. For each document in this list, there is a set of flags indicating which contexts of the document contain one or more occurrences of the type. Thus, many search requests can be satisfied by a search of this file. A search of the full text of a document, however, is required whenever the user (1) enters a search term containing nonalphanumeric characters, (2) searches for a co-occurrence of two or more terms in the same paragraph or sentence, or (3) searches his comments. Statistics, gathered during the user experiments, concerning the use of full-text searching will be presented in Chapter 4. 3.1.2 PRINT Statement The PRINT statement has three uses. It may be used to display all or selected contexts of any document. It may also be used to display information about previous queries and the documents which responded to them 15 and to display macro definitions. Only the first use will be discussed here. The general form of the PRINT command for displaying a document or a selected part thereof is PRINT FROM ( []) where specifies one or more contexts and the FROM clause indicates which documents are to be displayed. The argument of the FROM clause may be the user-assigned name or the system-assigned number of a set of documents created by a previous query, or a list of document accession numbers enclosed in square brackets. The documents are displayed in order of probable relevance according to the frequency of occurrence of the search terms. When the system displays a portion of a document, the user has the option of browsing through the text. Using the currently displayed portion as an entry point, the user may move backward or forward one or more paragraphs or sentences, or he may display any other context of the document. He may at any time stop browsing and continue with the current display, skip to the next document in the output list, or cancel all further output. Also at any time he may attach a comment to the document currently being displayed by entering an arbitrary character string enclosed in quotes. Comments attached to a document may be displayed by a print statement or searched by a FIND statement. 3.2 Sample User Session To illustrate the use of some of these commands, a sample user session is shown in Figure 3.1. All input by the user is underlined. The lines following "... DOCUMENTS ARE POSTED..." in each FIND statement give a list 16 of the accession numbers of the documents which responded to the search. The numbers in parentheses in these lines are the system-assigned relevance ranks based on the frequency of occurrence of the search terms. The data base being used is a set of state statutes and the object of the session is to find the penalty for robbery. With a less restricted data base, the user might well be satisfied with the results of the first search, having retrieved only thirteen documents, and immediately begin viewing text. Indeed, the ranking mechanism would have presented him with the desired information in the first document it displayed. However, to illustrate more features of the language and since it is known that the desired information occurs in only one document, the object of the session is to retrieve and display only that document. 17 g ID J I CN IS § M «4-l .-H « 10 .a •H 4-> c 0) T3 CD to I I B •H c •H a s u c 8 T3 (0 • tfl CQ CD CQ CD CD O 33 5 K rH U SJ CQ Si CD s> Q s> a 4*= - >H Q S g D fa in s> to SZ CD U 3 B 1 •S 1 " CD 4-3 s: c s ! u u td c o 4J to c Pus: e p (D u 4J CD (0 CO DC -O u CD J8 CD >H ■8 u fa |8 10 r~ C/3 »-H iH 0) J3i c c •H 8 to TD u Q •rj £ c *4-l 5 o 5 • 4-> d ■ft in •H CO i-H >£> H to OJ 5^ 4J c &ff £ to J-J T3 X 04 o U u 8 S |-fi. y o >jZ (0 rd uj i-i w ft pi CD ft *co o H-l :i CQ u u 10 Q O IB to c QJ a; o ,2 td & U TJ to QJ >-i u e e . 88 u qj 5 c to U QJ to o E qj o QJ C QJ oj x: Sx; sn to X 4-> 10 4-> CO 2 Cm S3 2 IS w S Cm ts - =*= >h s OS u Ei-, =*= + J8 4-> >iX5 td u r-H l-l a td to OJ -H (0 si > CO tip to c ^3 -P c o to u J* to 10 •° -a c td OJ J-i • -C CP to X> td e U V-i oj td oj re a-M QJ U OJ >i -H OJ •tj oj £ 8 OJ o to -P td 4J ore J-i T3i OJ CQ fd c J-I -H rp-M id u u OJ c QJ QJ KG QJ to to a 5 QJ • o rH 4J c 8 td to ii QJ QJ 2 $ to O to i-l u O O u <4-l 14-1 U 4-1 10 4-) 4-) fd £$ C -M •H X OJ c o o (4-1 .3 T3 C QJ O J-i -H •H CO D tO CT QJ OJ to OJ qj x; to ■n oj 9 td o c CT C C QJ •H 4-> > td oj ac x: 3 c •H 4J § CO OJ 19 CHAPTER 4 — USER EXPERIMENTS The experiments centered around machine assignments in a special topics course in information retrieval. The system used was a minicomputer-based experimental retrieval program known as EUREKA [6] . Two data bases were used: (1) a collection of thirty-seven technical articles on information retrieval containing approximately one million characters and (2) a set of state statutes containing approximately twenty million characters. Due to its small size, the information retrieval data base was used only during the first series of experiments. 4.1 Initial User Experiments The initial set of experiments was intended to "shake down" the system and obtain a preliminary view of user reactions to it. The first set of experiments was conducted during the spring semester of 1975. Before registration and again at the first class meeting, the nature of the course was explained and students who wished to withdraw were given a chance to do so. The group which completed the course consisted of five graduate students and seven undergraduates. Four were majors in Computer Science, one in Engineering, one in Library Science, and six in Business Administration. The first two class meetings were devoted to a description of the system and its inquiry language. Each student was given a user's manual and one two-hour practice session on-line. A monitor was always present to answer questions and assist with technical problems. 20 For experimental purposes, the class was divided into two sections. Each week r a list of questions covering unfamiliar material was prepared, and one section attempted to answer them using EUREKA while the other group completed the same assignment using the original documents. Class sections alternated from week to week between EUREKA and the documents. In either case, the student was informed of the general subject to be covered and allowed up to two hours of study time. He or she could elect to take the quiz at any time during this period and was then allowed a maximum of one hour in which to complete it. By proceeding in this way, two important sets of measurements can be obtained. First, we can compare machine assisted searching techniques with the use of conventional materials. Second, we can compare the performance of several motivated users who are all seeking the same information and whose "information need" is known to the investigator. Twelve quizzes were given during the semester. The first four consisted of short answer-type questions taken from the information retrieval documents. The second set of four quizzes consisted of short answer-type questions taken from the state statutes. The final set consisted of essay questions taken from the state statutes. Overall, the students using the original documents scored approximately 50% better than those using EUREKA. 21 Throughout the semester, the students using the printed materials spent the preliminary period each week studying and taking notes, while those using EUREKA took considerably less time and used it to reaguaint themselves with the language of the system rather than to study the material. Some sources, e.g., [4], claim that substantial practice is required to develop a facility with an on-line retrieval system. Since each student had two weeks between on-line sessions, they tended to use only the most primitive features (FIND and PRINT statements) of EUREKA. Oily one student used EUREKA's comment feature. The DEFINE statement was rarely used, and the MAKE statement was not used at all. The poor performance of the students using EUREKA can be attributed to two factors in addition to lack of familiarity with the system. First, during the first set of quizzes, the system was quite unstable. Hard failures which occurred during user sessions prevented them from gaining confidence in the system. Thus, they avoided the more powerful features and used EUREKA in a very elementary and time consuming manner. Also, data corruption which occurred on at least two occassions resulted in non-retrieval of relevant documents. Since this type of error did not cause a system crash, it went undetected for an undetermined period of time. During the second and third sets of quizzes, the system was relatively stable. Unfortunately, the users' opinion of the system was well established by this time. Also, these quizzes were taken from the state statutes. The documents for this data base contain an extensive (500 page) index which was not available on EUREKA. This gave the group using the documents a substantial advantage. For example, one question concerned the 22 advertisement and sale of birth control devices. The phrase "birth control" does not occur in the text of the statutes but does appear in the index. To retrieve the relevant document using EUREKA, the user must search for some form of the phrase "prevent pregnancy". However, due to their lack of confidence in the system, the group using EUREKA repeatedly searched for forms of the phrase "birth control" often repeating the same search. The second set of experiments was conducted during the summer semester of 1975. The class consisted of five graduate students and two undergraduates. In order to learn more about training users and to obtain at least a subjective evaluation of the more powerful features of EUREKA, the comparison of machine versus manual searching was temporarily abandoned. The introductory lectures and initial on-line practice sessions were similar to those of the spring semester except that more emphasis was placed on the Macro and Comment features. The students were given two two-hour sessions per week using EUREKA. Three of these sessions were devoted to short answer quizzes from the spring semester; i.e., exactly the same questions were used in order to make comparisons. Six sessions were devoted to two new essay quizzes which were not used during the spring semester. During the first short answer quiz, the then inexperienced users approached the system in the same way and performed approximately the same as their confidence-lacking spring semester counterparts. After taking an essay quiz designed to force them to use the more powerful features of EUREKA, they performed substantially better. After another three-session essay, the EUREKA users equalled the performance of the spring semester index-aided document users on the final short answer quiz. 23 4.2 Feature Evaluation Experiments Two series of experiments were conducted to evaluate selected features of the EUREKA retrieval system. The evaluation proceedure consisted of giving a set of essay and short answer quizzes to three groups of students - one using the full version of EUREKA, one using a restricted version, and one using the original documents. All questions were taken from the state statutes data base. Since the index to these documents was not available on EUREKA, it was also denied to the document users. The document users still had access to a two-level table of contents which was not available on EUREKA. The primary emphasis of these experiments was on the relative performance of the group using the restricted version of EUREKA. The document group was retained as an experimental control group. Those features of EUREKA which could be removed without totally handicapping the user were selected for evaluation. The features which were removed are: 1. User personal files a. Accessing previous queries b. Creating and using macros (personal thesaurus) c. Attaching comments to a document or query 2. Full-text searching (ability to search for phrases, to search for the co-occurrence of two or more words in the same sentence or paragraph, and to search user comments) 3. Browse mode (ability to access any portion of a selected document at random) . 24 The subtopics under item 1 can be removed individually or in combinations. Items l.b. and I.e. should not affect user performance on short answer quizzes but may be useful on essay quizzes. The other items were expected to have a substantial effect on user and system performance on both types of quizzes. A cost-benefit analysis of these system features can be developed from the results of these experiments. The benefit of a given feature is defined in terms of the difference in user performance (quiz score) between the group having the feature and the one to which it has been denied. For short answer quizzes, solution time as well as raw score is taken into account in user performance. The development, implementation, and maintenance cost can be estimated by the code and data storage requirements. Tne cost in terms of system performance is the difference in the system load, defined as the average space-time product per command, between the full system and the restricted system. The space-time product is the amount of core memory required for both code and data multiplied by the CPU time during which it was used. Table 4.1 gives the storage requirements for the above mentioned features in the current version of EUREKA. Although the User Personal Files feature is a combination of the Macros and Comments and Access to Previous Queries features, the code required is greater than the sum of it parts. This is due to a substantial amount of bookkeeping code which is common to the two parts. 25 c m O S *H •H 0) H CO (U £ w U u r-H in in iH CO • • • 3 a « ts (S • • in CM S iff ■H dP ^0 fl> ■8 CO dp df> 4-> (0 0) O ** r- S> m in tu CN < Q M-l « O O dp R O fl § CO to 3 1 B w in M • •H S3 ^ m (N IS) 3 i-H CM & f0 s CO -U CO CO ID CN CM (S CTi CN cs (N (T> vo o> IS 00 *s CQ IS . r-H CN in CO CN 5 Cb CN V£> 00 CN 5 00 CN IS CO as IS IS co in IS •^ . .32 in fH . . CN CO IS co • .84 i-H i-H IS . CN CO IS ■*r • CN CN • 00 in IS rH • .01 i-H 00 s * r- CN i-H W •rH CD ja ■P CO £> Cu o U Pn II II II II II II J as & pj pa Gu gggggg I 4J w A N •H k-l 0> 4J in r- 2 CN <1> 1 28 \D OV m m CO . CN 00 m in m CN § g Oj in co oo co CN IT) co oi vo r- ES iH co m CN CO CD CO E) CO W CO II II II II II II J 2ft Dj ffl tu CO co co A >1 CO CO CO a m r- CT> co CD g >i id co CO CO E CD -U CO CO CO CD -U u C O -H co a >r» jj •H 4J iH CO •H CD XI -U CO Cn 1 U a, 29 user "THINK" time per command is also shown for short answer quizzes. These times are in the range of mean "THINK" times, 20.0 to 35.3 seconds, found in studies of five time-sharing systems [7] . Since users spend much more time writing during an essay, this parameter does not seem appropriate for essay- type quizzes. The figure of merit is the average of the ratio of user performance to system load for each user. An analysis of variance was performed for this measure. The values for the F test give the ratio of two estimates of the variance, between groups and within each group, of the figure of merit. A value of F much greater than 1 indicates a larger variance between groups than within each group and therefore a high probability that a difference does exist between the groups. The probability entry in each table was obtained from standard tables for the distribution of F as a function of the sample size of each group, and indicates the probability that the figures of merit are random samplings of the same population. With the exception of Quiz and Essay #6, the highest statistically significant difference in the figure of merit occurred during Quiz #4 comparing the full system to the system lacking all personal files. User performance is significantly better on the short answer quiz as well as the essay using the full system. Also, system load is drastically increased by the lack of personal files. This may be attributed to the fact that lacking macros and access to previous queries, the user has no alternative to entering long, complicated search requests. This observation is also supported by the substantially longer "THINK" time between queries taken by users of the restricted system on both the short answer quiz and the essay. 30 A slight improvement in user performance was recorded by users of the restricted system in Essay #5. System load also decreased with the restricted system on this essay and to a greater extent on the short answer quiz. Analysis of user sessions show that users of the restricted system spent approximately twice as much time displaying text as did users of the full system. Since the restricted system did not allow browsing, users were forced to read the entire text of each possibly relevant document. This may have presented them with information which they would have missed if browsing had been available and may therefore have been a factor in their score advantage on the essay. Users of the full system, however, spent less time viewing text and more time performing searches. Searching is naturally more demanding of system resources and accounted for the higher system load. Essay #2 shows a statistically significant difference in the figure of merit in favor of the system without Macros and Comments. Some small difference is to be expected since full-text searching is required to retrieve user comments. The large difference in this case may be due to lack of familiarity with the feature since, as is shown in Table 4.4, it was not heavily used. Over half the class did not use the Comment feature at all. For Quiz and Essay #6, the level of indexing was changed because it was felt that inhibiting full- text searching while indexing only to the chapter level would not produce any interesting information. The inverted file was modified to provide pointers to the section level where each section contained approximately 1500 tokens. The actual implementation reformatted the data base, making several smaller new documents (sections) out of the 31 FEATURE USEAGE Macros Average of less than 1 macro per user per session Comments This feature was not used on short answer quizzes. The following statistics are for essays only. Fall 1975: 38% of the users made an average of 10 comments each per two-hour session Spring 1976: 48% of the users made an average of 34 comments each per two-hour session Access to Previous Queries 40% of the search requests used the results of a previous search Browse Mode 47% of the time during which users were viewing text, they were browsing Full-Text Search Fall 1975: 46% of all searches requested full-text searching Spring 1976: 30% of all searches requested full-text searching Table 4.4 Useage of EUREKA Features 32 old documents (chapters) . This did not allow term coordination at higher levels; i.e., users could no longer search for co-occurrences of terms at the chapter level. This was a drastic change for users who had become familiar with the chapter-level indexing since the number of documents increased from 378 to 3176. Therefore, the comparison between the full and restricted systems in this case cannot be considered valid. However, it was noted that users performed significantly better, although somewhat erratic, on this quiz than on previous quizzes. Also, system load decreased drastically. The lower level of indexing reduced the amount of full-text searching required and also provided a new level for term coordination. User performance on Essay #6 degraded somewhat compared with previous essays. This implementation of section-level indexing tended to fragment concepts since chapters were broken into several documents. This fragmentation did not affect user performance on the short answer quiz since only specific factual information was being sought. However, for the essay, concepts are important and their fragmentation degraded user peformance. User performance for short answer quizzes throughout the semester are shown in Figure 4.1. The vertical bars indicate the 95% confidence limits in each case. To account for the differences between groups, the average score over all quizzes was calculated for each group. These averages were used as a measure of the native intelligence of each group, and the scores on each quiz were adjusted accordingly. In an attempt to factor out the difficulty of the quiz so that the learning curve could be examined, the scores for each quiz were then normalized to the average document score. 33 2.0 - 1.8 1.6 1.4 C ■!"■ E "■^ (/> +-> c •r- o o. 0) i. o o 1/1 1.2 1.0 0.8 - 0.6 * 0.4 Fi qure 4.1 Fall 1975 Short Ar iswer Quizzes * = Full System ° = Restric ted System • Ir i s ■• to J= O) O s- •^ rtJ * LL. I/O r— «o +-> c o «/> s- 1 3 c O 01 o J- o Du e3 O o 4J «/> O to i- to o> O a> •D (T3 o C s: < 2: o O) z: o to 3= O s- ca - o Documents » D » * • i k * < ) • ■ 3 4 Quiz Number 34 From Figure 4.1, it can be seen that this approximation of the difficulty of the quiz is not valid since Quizzes #4 and #6 were evidently slanted toward the machine. 4.2.2 Spring 1976 Experimental Series A second set of feature evaluation experiments was conducted during the spring semester of 1976. The class consisted of thirty-one students - fourteen in Computer Science or related majors, nine in Business Administration, and eight in Liberal Arts and Sciences. The introductory lectures, demonstrations, and practice sessions were similar to those in the Fall 1975 experiments. Additionally, this class was given a programming language-type quiz on the EUREKA language after the second lecture. This quiz was designed to force the users to learn the EUREKA language and to obtain an estimate of the native intelligence of each group. This estimate agrees well with the procedure used during the Fall 1975 experiments, the difference between the best group and the worst group being approximately 8%. The class was again divided into three groups - one using the full version of EUREKA, one using a restricted version, and one using the original documents. For this series of experiments, each group had one one-hour session and one two-hour session per week. To eliminate inter-group distinctions, they rotated every two weeks. During the first week of each two-week period, the one-hour session was devoted to a thirty minute short answer quiz prior to which each student could have up to thirty minutes to refamiliarize himself with the system while the two-hour session was devoted to studying for an essay. During the second week, two hours were allotted for writing an essay and one hour for a short answer quiz. 35 To investigate the effect of indexing level, all short answer quizzes used the indexing system which was used on Quiz #6 during the Fall 1975 experiments. Because of the concept fragmentation in this system, the chapter-level indexing system was used for all essays. Except for Quiz and Essay #1, the order in which the quizzes were given as well as which restricted system was used with a oarticular quiz were scrambled from the preceeding semester. Tables 4.5, 4.6, and 4.7 present the results of these experiments. Analysis of variance of the user performance expressed in points per minute for the short answer quizzes and points for the essays shows no statistical significance between the full version of EUREKA and any restricted version. However, the raw score, which does not include solution time, for Quiz #4 showed a significant difference at the 10% level in favor of the full system over the system lacking access to previous queries. The results for system performance show a good agreement with the Fall 1975 experiments. As expected, the system lacking full- text searching significantly decreased the system load but also increased the user "THINK" time. User performance on both quizzes and the essay comparing these two systems show a relatively large difference in scores and a large value of F from the analysis of variance. A larger sample size may have shown a statistically significant difference in user performance between these two systems. 36 I s (Ti co m VO • CM 00 • S) vo VO • VO i-H CO • **• •"tf • •«* r^ • CM ** • (S CM s> <7\ p"> • •H 00 • ^J« IS) • CO in • r- vo • "tf ^ 00 • rH 00 • rH IS) CT> in • CO CO • (a VO • «sr vo S a? Si 1 O H 5 ovists e Q QQ O .25555 CO -H -H -H -H H £ £ 5 £ 0) > tf 1 « O z u CO M \ ce-T oduc Comm )H CO O -P u c CO -H 53 u s o QJ -H ens Cl. -u i-H CO •H 0) £} 4J fO fa ■8 u CL. 37 <£> co CM ID fa S fa fa fa g CQ fa fa d> 00 r- o Lf> oo co oo 1 • i— 1 r~- • IT) in H 00 • 1 00 CO ® • 00 o\ • in CM 00 SI • *tf vr> • r- rH CM in CO CM • s 00 in r^ • • •s CO • CM U3 <£> ^ • • CM «S5 H H CO VD • iH CN CO CO • t ® CM • CO in VD iH CM V£> in V£> s> • • r» r» ^r • ■«* in CM • in r- 10 • i-H i-H • CO CM CM IS • I I ^r CO i W CO 4J rH 3 co >i fO CO co fa CTi •H H. VO CD rH XI fO EH 13 >i CO CO CO fa e CD 4-> CO >l CO CD -J u C O -H o CD -H 838 N QJ Pjfa J-i CJiS W 2i fa & 4J iH CO •H CD X! +J fd XI fa o U fa 38 1 u ca N g g g CTi co Ca CM CO CN CN 00 r- CN CN ca oo CN m 1 « O z u CO M Q) BG CO Eh — CN CN in cr. ca CN co • i-i CTi • C3 en CO • iH <£> • i-H ca ■H in • i-H CN • t-i ca ca en 00 ca ca in ca «x> ca in CM oo • rH in • ca .65 r- CO CN • iH CO • i-H ca iH o> • r» CN • ca .57 00 00 fH • in CN • ca c •H i 1 ■u B •H4J (C o C o s- o u CO -C o J- «o CO 4-> X 0) -o o oo o s_ CO 1.4 L 1.2 - 1.0 - c o «/) s- CD S- Q. O 4-> CO 00 0) o o <: 0.8 0.6 5 2 Quiz Number 41 in ro (N N e a) jj en CO m • iH CN • iH • S3 CM m CN in • CN • ON CN CN • CO CN in vo CN CN in CN Eh CO « O z u M CD an co E-t *~ co CO in CN CN S3 CN t-* S3 CN in S3 S3 •«* ** • iH en • iH S3 CO in r» CN • S3 r^ CO r> • • S3 r^ • •^r S3 r* CN • ^r KD • S3 V£> in • S3 i-H S3 VO ON i-H • • S3 VO • CN S3 V£3 i-H • *tf •** • S3 v£> rH • S3 (Ti S3 in CO i-H • • S3 CN • ^ S3 S3 CTt • i-H 00 • S3 CN i-H • S3 r~ S3 rH CN rH • • S3 00 • ^r S3 CN in • CO m • S3 CO i-H • S3 VD S3 S3 OO i-H • • S3 tj< • e •H -P CD\ U CO ^^ O 4-> 83 o c CO -H S3 o 4-» 0) -H N CD £ en & fa -p rH CO •H Q) XI -P 03 JQ fa O u Oi CO I CO co •H X r£ a g •H -P u CO cu > i o 8 •H S-J a o oo CD rH XI EH 42 CHAPTER 5 — ASSESSMENT OF USER ATTITUDES In addition to user and system performance, another important factor in the evaluation of an information retrieval system is the attitude of users toward the system. Although users may be able to perform adequately with a minimal system when under pressure, they may not willingly use such a system. One method of measuring attitudes is through the use of a semantic differential. A semantic differential consists of a series of bipolar adjective scales on which a subject indicates his reaction to a particular concept. An example is shown in Figure 5.1. One semantic differential exists for each concept to be rated. The subject is instructed to mark one of the seven intervals between each adjective pair indicating the strength of his reaction to the concept. EUREKA fast good successful valuable slow bad unsuccessful worthless Figure 5.1 Example of a Semantic Differential 43 To reduce the amount of data which must be examined, adjective scales can often be combined into independent groups through factor analysis. Each group then measures a different dimension of a subject's attitude toward a concept . The adjective scales used for the factor analysis in this evaluation are the same as those used in a 1970 study of SUPARS[8,9], while the concepts which were rated naturally differ. The current study rated fifteen concepts, some of a general nature and some specific to EUREKA. Thirty- two experienced users (four members of the EUREKA staff and twenty-eight students who had participated in the experiments) were given a packet of semantic differentials, one for each concept. The order of the semantic differentials within each packet, the order of the adjective scales within each semantic differential, and the ends of the adjective scales were randomized. The completed semantic differentials were scored and the data was then subjected to factor analysis. The factor analysis procedure used in this evaluation follows that of Katzer[8]. Each semantic differential was treated as a separate observation resulting in a matrix consisting of 480 observations by 19 variables. The correlation matrix among the variables was first computed. Then the eigenvalues and associated eigenvectors of this matrix were found. To reduce the number of dimensions, only those eigenvalues greater than 1.0 were retained. The remaining dimensions were rotated using Kaiser's varimax procedure [10] to approximate a simple structure. 44 Each variable was then assigned to the one dimension on which it loaded highest. Acceptable dimensions were required to have at least as many variables assigned to it as the dimensionality of the factor space. For example, the acceptance of a fourth dimension would require each dimension to have at least four variables assigned to it. The results of the factor analysis are given in Table 5.1. Factor loading is a measure of the correlation between a variable and a dimension. Communal ity is a measure of the variance of a variable accounted for by the reduced number of dimensions, while factor purity, defined as the square of the highest loading divided by the communal ity, indicates the proportion of the variance accounted for by the dimension to which the variable is assigned. Variables 11 and 19 loaded highest on a fourth dimension which was discarded because it did not satisfy the requirement for the number of variables assigned to it. Based on factor loading and factor purity values, representative adjective scales, identified by an asterisk in Table 5.1, were then selected from each dimension. These eight adjective scales were then used for an attitude survey of the students participating in the Spring 1976 series of experiments. The survey was conducted at the end of the semester at which time each student had completed approximately twenty-six contact-hours on EUREKA. Twenty-five of the thirty-one students completed the semantic differential packets. As in the factor analysis phase, each packet contained fifteen randomized semantic differentials. The completed semantic differentials were scored by assigning integer values from -3 to +3 to the seven- interval adjective scales, positive values indicating a positive reaction. The means and standard deviations were then calculated for each concept by dimension. 45 $£ u s w n h n oo in oo in ^t oo ro r^ ^j* cn in r- oo cri IS CTi in cri g cn r- cm s m ^ vo ^i" is cri r- «x» in vd vo cm vo in s r~ co vo in ^ vo in «40 r^ «x> cr> in in in CO >£> "=r CO IS CM CO ea i z 1 1 1 1 1 i rH H fO 9 a a M r- o> o> *J3 r^- cr> rH 00 rH r- r- CM V© •^ in rH CM ■H in r- u M ^ r» rH CM rH IS s • Q OS • • • • • • • • • • • • • • • • t # • CM -P 8 1 1 1 1 1 H u fa <: fa rH 00 s m 00 ^ CO rH i IS CO co e h CO -H »0 CO CD 1 <0 -H CD •H CD > CU U-l Jj O U H U U ■a (C H O •H E CO u H rH Q Cn 1 4-i Jj 05 C 2 O ^ -pass ad -unti sful- CO 0) ^ o fO 5 O g 3 Cn C U rH S | C 3 rH 6 -t- 1 ■h x: ij 3 CD CO .3 P £ u 5 CU PQ I (D£H-H I CO cnH 1 3 Wl Ej ° u s T> H 3 1 3 -U >. Ss 1 "H rH ■as 5 £lA 3 1 H Oh H D U WU-I (TJH S X rH g CD X) > to CO cnXi 8 S M O M-J U 3 -H S-l ^ s CD 1 E 'SrH gj. > 1 rH _, | O +J 4J C rH >l CO Ol > 1 4-1 •H-n mu ■*-> E U fa 1 OT3H 3 CO CD u •H fO o > CO -p !& Q 4-1 rH C J-l fO 3 -H O -H CD CD U U E T3 CP J-i 3 U O -H 3 o O CD -H 13^^ -p 4-1 fO CTI 4-> CO CO > X! U -V CO Xl 4-1 4-1 • • U£jQ C • • H c CD • • H rH Cn U M M (H •H CO CD U * * * * l-i * * '-J •K * CO (^ O • • • • • • • o • • • • • • o • • • < • • 4-> H cm co ^r in r- oo +3 v£> CO iX> rH