Digitized by the Internet Archive in 2013 http://archive.org/details/thesaurusfeature855morg •Ua f£5 Report No. UIUCDCS-R-77-855 f A THESAURUS FEATURE FOR THE EUREKA INFORMATION RETRIEVAL SYSTEM UILU-ENG 77 1731 NSF-0CA-MCS73-07980 A03-000027 by Trevor John Morgan May 1977 Report No. UIUCDCS-R-77-855 A THESAURUS FEATURE FOR THE EUREKA INFORMATION RETRIEVAL SYSTEM* by Trevor John Morgan October 1976 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 • This work was supported in part by the National Science Foundation under Grant No. US NSF-MCS73-07980 A03 and was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, October 1976. Ill ACKNOWLEDGMENT The author would like to thank the many people who contributed to the success of this Thesis project. In particular, to the EUREKA wizards: Keith Morgan and Bernie Hurley for initiating me, and Dick Rinewalt for many patient hours of explanations and assistance. Special thanks to my advisor, Professor David J. Kuck for his suqqestions and quidance throughout the project. The financial suDport of the Rotary Foundation and the Western Australian Institute of Technology to make this work possible is gratefully acknowledged. Finally, thanks to my wife Barbara and son Kent for their unending moral support and understanding through many lonely times. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. BASIS FOR THESAURUS FACILITIES 4 2.1 Reasons for Recall Failure 4 2.2 Increasing the Recall Ratio 5 2.3 Synonyms * . .... 7 2.4 Spelling Variants 8 2.5 Different Grammatical Constructs .....10 2.6 Words with Multiple Meanings 12 2.7 Different User Environments ..13 2.8 User Control of Thesaurus Features ..14 2.9 Construction of the Thesaurus 14 3. THESAURUS COMMAND 16 3.1 ENTER Option 17 3.2 DISPLAY Option 19 3.3 DELETE Option 20 3.4 ON and OFF Options 21 4. IMPLEMENTATION 23 4.1 Overall View 2 3 4.2 Structure of Thesaurus Files 25 5. CONCLUSION 28 APPENDIX A Error Messages 29 REFERENCES 31 1. INTRODUCTION The FrjREKA system (1) is a free text. information retrieval system in operation on a PDP~11/40 computer system at the University of Illinois. It is used extensively as an experimental tool to determine the desirability of various features in such a system. Any information retrieval system, whether it uses a controlled vocabulary index or free text searching, has the problem of matching the terms and language of tho searcher with those used in the controlled index or in the documents themselves (2) . If this problem is not solved, then it is to be expected that search recall ratios will suffer because the searcher is not presenting the correct terms in his searches. There is much to commend a free text information retrieval system such as FU3I\KA when it is use»d in non-delegated search mode by practitioners in the field. The natural language of the documents is likely to match the language of the searcher more closely than any controlled vocabulary will. Since most people have slightly different vocabularies, searchers will often not use the exact te»rms used in the documents. Unfortunately, the searcher in such a system has a very large number of words to choose from and must specify all of the possible terms to assure a high level of recall. When conducting a search in thr. BURtiKA system, the ircher is continually creating synonym tables in t-ho form of boolean search expressions that he develops. He iray not think of *11 possihlp ways in wliich a particular topic nay be expressed, and even if he does discover most of the terms used in the documents, the time expended may be considerable and will reiuce the efficiency of his search. Moreover, this effort is lost once the search is completed and the identical process mav bo repeated many times. To help overcome these problems, the searcher in a free text system needs a thesaurus or similar aid to control synonyms and to group related words together. Such a thesaurus would normally consist largely of tables of synonyms or near synonyms. Por example the thesaurus should remind the user that 'factory' may also appear as 'industrial plant' or 'warehouse 1 . The thesaurus will also perform the important task of preserving portions of search strategies over an extended period of time for all users. It is desirable that the user be able to nranipulate and examine such a thesaurus on-lir.e. Moreover, the searcher should be able to have all synonyms for an input term substituted automatically in his search. This thesis describes a thesaurus facility which has been introduced with these features to ease the burden a searcher must carry during his search. Chapter 2 describes some specific reasons for recall failure and the thesaurus facilities introduced to overcome these. In chapter 3 the explicit thesaurus commands are described and Chapter 4 is a description of the actual i implementation of these features. Chapter 5 is a summary of the facilities available. 2. BASIS FOP THFSAURIIS FACILITIES 2.1 Reasons for Recall failure When a user is conduct inq a search he is -greatly interested in the recall ratio and precision ratio of his search [1) . The recall ratio is the proportion of the material available in the data base that is found by a search, and the precision ratio is the proportion of the material found that is judged by the user to be relevant. As described in the following sections, a thesaurus feature is primarily to increase the recall ratio of a search, but also has an impact on the precision ratio. Salton (U) has found that many recall failures in free text systems are due to three different problems with the language of searchers and documents. The first of these is due to synonyms, where words with the same meaning are used interchangeably. It may be that different documents use different terms for the same concept, or the searcher uses a different term from that found in the documents. Examples of such cases are 'dark* and 'night' or •fright' and 'scare'. Another form of synonym is the conceptual group containing words which are related but are not identical in meaning, such as the terms 'brain', 'nervous system' and 'spinal column' . The second problem is caused by variant spellings of the same word. Such variations in spelling are due to different tenses of the word. For example we may have • factory' and 'factories' or we may have 'control' , 'controller', ' controlling ' , 'controlled' and •controls*. In these cases a searcher would find it difficult to think of all word endings for inclusion in a search. The third source of difficulty is the occurrence of different grammatical constructions, for example the concept of 'birth control' may alternatively be expressed as 'pregnancy prevention' , 'prevention of pregnancy', 'control of the birth rate' etc. The searcher without the use of a controlled index where one term is adopted for all such alternatives, would not usually be able to think of all of these phrases in his search. 2.2 Increasing the Pecall Ratio The three problems outlined in the previous section, have been addressed and a thesaurus facility introduced to the EUREKA system to improve the recall ratio in that system. One method of overcoming these problenrs is to base the system on a precontrolled vocabulary. Rather than using the words of the document text in the indexing, these words are converted to the word stem or to predetermined concept classes. If word stems are used as the basis for this approach, the problem of synonyms must still be solved. When concept classes are used the problems involved with a controlled vocabulary ire reintroduced and all terms used in thr> natural language searches must be converted to the appropriate concept classes. In addition the ZUREKA system has an index based on all words in the text of documents and it was considered a major task to convert it to a precontrolled index. The* approach taken was to introduce a thesaurus which would be placed between the natural lanquaqe of the searcher and the free text index. All the terms in a concept class are stored as one entry in the thesaurus. The statements entered by the searcher are matched against all entries in the thesaurus, and if a match occurs, the term in the search statement is replaced by the thesaurus entry. In this way the search statements are expanded before being passed on to the index. Thp facilities offered by the thesaurus are limited to constructs which can be placed in the thesaurus entries. The following sections describe the facilities available to overcome each of the three problems outlined in the last section as causing recall failure. Unfortunately a thesaurus feature has several disadvantages. While increasing the recall of a search, it is also likely to increase precision failures due to false coordinations and incorrect term relationships. Several features of the thesaurus facility have been specifically designed to minimize the effect of this problem. Tn addition, the EUREKA system is designed to be used on-line, with the user interacting with the system to dynamically develop his search strategy. In this way the user is able to monitor the precision of his search by examining the number of documents retrieved, and improve it by development of a usefull search strategy. A thesarus also requires that a considerable vocabulary of synonyms be constructed which may be difficult to store and maintain. These synonyms will generally have to be maintained as the vocabularv of searchers and documents changes with time. Facilities for interactive maintenance of the thesaurus are provided to minimize this problem. 2.3 Synonyms Synonyms are the constructs which form the basis for the operation of the whole thesaurus. Words which have the same meaning can be entered into a concept class within the thesaurus using the boolean logic which is the basis of all EUREKA searches. For example when we search for the concept 'darkness', we also want any of the terms 'dark' or 'night' or 'black'. Putting this in a boolean expression as it would appear in a FIND statement or a thesaurus concept class we get ' darkness'* 'dark'+ 'night • ♦'black'. Other facilities available in the thesaurus also depend on this construction of a concept class by including all words or word variations which have the same or similar meanings. 2.U Spollinq Variants The EfHEKA system has a universal character *, which all . any endinq t.o appear after a word stem. As an example of this, •factor*' will find all terms containinq the word stem 'factor 1 followed by any endinq. It can be used to find all of the terms • factor ' , ' factor ize ' and 'factorization'. Unfortunately the scope of the universal character is indiscriminate and it will also detect 'factory' and 'factories', causinq a drop in the precision ratio for the search. A more extreme example is the use of 'dt* to search for 'die' or 'dyinq' or 'died'. In cases such as this, the universal character is obviously useless. The thesaurus allows permanent storaqe of all variants of a word as synonyms in a thesaurus entry. The different versions of the same word are assumed to have the same meaninq and would be stored as • die •+• dyinq '+• died • . This approach has the advantaqe that we can differentiate between different meaninqs for the same word stem with a variety of word endinqs, dependinq on the thesaurus entry they appear in. "or example 'factor'+'factorize'+'factorization' would appear in one entry and • factory' ♦* factories' in another entry, and the two entries are never confused. In order to conserve storage space, a shorthand method is available for storing different word endings. The word stem is ended by a colon and is followed by the allowable endings (which may include null) , seoarated by commas. Using the examples from the last paragraph, we would have * factor :, itemization' and • factor :v,ies' . These word variants can appear in the same thesaurus entry together with different words with the same meaning, as in ' factor: y , ies • + • warehouse* • + ' ind ustry ' In addition to this facility for storing word variants in the thesaurus, an additionall feature is provided for terms used in a search statement which do not appear in any thesaurus entry. The analysis by Winograd (5) is used to convert plural terms to the sinqular equivalent and terms with the special endings 1 ly ' , ' ing ■ , 'er * , *en ' , 'ed • and 'est' to the singular word stem. Singular terms are also converted to the plural form. For example the search expression * watch ' + ' babies' + 'rising ' is expanded to 'watch'+'watches'+'babies'+'baby'+'rising'+'rise* and 'prettily' is expanded to ' prettily '+• pretty • . In a FIND statement it is assumed that words with special endings are usually verbs, adverbs or adjectives, and do not have a plural form. This is not true in all cases, but occurs so often 10 that automatic analysis cannot- be don< . Similarly tei * ^d in tho singular or plural formr. a r e assumed to h^ nouns and -nc not expanded to include the special endings. It is also assumed that words which appear in a thesaurus entry will have all possible endinqs already associated with them in the thesaurus, and so no additional analysis is lor.e. 2.5 Different Grammatical Constructs As a standard feature, the tUNFKA system allows the us^r to enter phrases as a single term in a search, for example 'birth control* and 'prevention of pregnancy'. These are also treated by the thesaurus as single terms and hence can be stored and referred to in the standard manner. As an example we may have as thesaurus entries 'birth control' +' cont raceptive' +' pregnancy prevention' and ' factor: y , ies '+' warehouse* ' + ' i nd ustrial plant' Unf ortunately phrases used in this rranner must match the phrases present in the documents exactly. To be completely effective, a thesaurus entry must contain all the possible phrases expressing each concept. This is obviously difficult to establish and maintain. The FUREKA system contains another feature which overcomes this last problem, but also reduces the precision ratio. This is achieved by using statistical association with the boolean AND 11 function, denoted by *, which assures that two terms appear in the same context. The context may be a full document, a paragraph or a sentence. Tt is to be expected that in some cases the required terms would appear together, but not related to each other, or. with a meaninq different from the one required. These are false coordinations and incorrect term relationships and increase the precision failure in a search. For example 1 pregnancy '*' prevention 1 would incorrectly retrieve 'prevention of hysterical fathers during pregnancy'. When required, these term relationships can be stored in the thesaurus in the form usually used in searches, surrounded by parentheses, as in ('pregnancy' * 'prevention'). Such relationships will not be matched to any terms in a FIND statement. Thus to retrieve such an expression, a word or phrase must appear in the thesaurus entry and rrust also be used in the search statement. For example 'birth control'* ('pregnancy'*'prevention') can only be used if 'birth control' appears in the search statement. To be most effective, these term relationships should be used in a restricted context such as sentence or paragraph. Unfortunately the thesaurus is incapable of forcing such a context and uses full documents as the context unless the user explicitly specifies otherwise. 12 Another method of handling different grammatical » • ir. to do a full syntactical Analysis of the document text to discover all syntactic equivalents of the qiven phrase. This approach was considered far too comnlox and slow to he an effective tool in the on-line environment of EUREKA. 2.6 Words with Multiple Meanings In any free text information retrieval system, multiple meanings are a problem when searches are heinq conducted. They lead to a decrease in precision due to false coordinations and incorrect term relationships. In the thesaurus facility in the EUREKA system, a word with multiple meanings may appear in more than one thesaurus entry and will be flagged as having multiple appearances. When a search statement containing the term is entered, the system displays each thesaurus entry and asks the user if it has the correct meaning. If a term has multiple meanings, but only appears in the thesaurus once, the system is unaware of the alternative meanings and will automatically use the single entry whenever the term is used. When a FIND statement includes a term with the universal character #, the system assumes that it may match more than one thesaurus entry and so displays each one that is matched and asks 13 if it is the correct one. Only one such entry will replace the input term. 2.7 Different nser Environments Fach user of an information retrieval system will operate -co some extent in his own environment. His information requirements, vocabulary and expectations from the systeir will be different from other users, even if they are working in a similar field. For this reason, a Universal Thesaurus is available to all users, and each individual user is given the full thesaurus features available in his own User Thesaurus. He alone is completely responsible for the maintenance and use of this thesaurus, and may store in it whatever he chooses. The user may select between the use of his own User Thesaurus, use of the Universal Thesaurus, or use of both, thus giving him considerable flexiblity. It is possible to use only the Universal Thesaurus for general queries and then select his User Thesaurus for searches in a particular field. This feature is described in more detail in Section 3.4. Tt gives the user the ability to dynamically change his search environment. In this way, each user can use thesaurus entries tailored to his own individual needs without interfering at all with other users. 1U 2.8 riser Control of Thesaurus Features When a feature such as a thesaurus automatically alters the search statements entered by the user, the user must have the ability to control its use. This is accomplished in the case of the thesaurus by allowing the user to selectively turn automatic features on and off. These features include use of the whole thesaurus, selective use of the Universal Thesaurus and the User Thesaurus, display of the expanded form of the search statement, use of the special word endings feature and use of the plural words feature. These can all be controlled for each individual statement, qivinq the user qreat. flexibility. The user is able to interactively decide if the thesaurus is introducinq incorrect terms into a particular search statement and can improve the precision of his search by turninq thesaurus features off for that search. 2.9 Construction of the Thesaurus The construction of the thesaurus is almost completely a manual task. the user must think of synonyms, phrases and term relationships to be entered into the thesaurus using the ENTER command. Some assistance is qiven to the user for qeneratinq different endinqs to each word entered, based aqain on the work of Winoqrad (5). When the singular form of a term is entered, the plural form is automatically derived, the singular form is derived 15 from the plural or one of the special endings ' ly* , • ing* , ' er ' , 'en','ed' and 'est 1 . Each of the special endings mentioned above is then added to the singular form and the user is interactively asked to determine whether the resultant word has the correct meaning. This allows the automatic generation of a wide range of commonly used endings and at the same time removes erroneous versions of words. For example, if the term 'fast' was entered, the user may accept 'faster 1 and 'fastest' but re-ject the incorrct meanings 'fasting', 'fasted' and 'fasten' and the nonsense word •fastly'. u 3. THESAURUS COMMAND The thesaurus is implicitly referenced by every TMD statement as described in the preceding sections. AIL statements which explicitly reference the thesaurus are grouped together as the THESAURUS command. The keyword THESAURUS in this statement must be immediately followed by a second keyword to identify the particular type of thesaurus facilities teguired. In most cases additional information is also reguired in the command. The form of all variations of the THESAURUS command are shown in Table 3.1.1. Keywords in these statements are shown in upper case letters, and may be abbreviated to any number of characters which uniguely identify them. THESAURUS ENTER THESAURUS DISPLAY THESAURUS DISPLAY ALL THESAURUS DELETE THESAURUS [ON ! OFF ] [ALL ! USER ! UNIV ! EXPANSION ! WORDENDING ! PLURAL] ::= ! + Table 3. 1. 1 17 As described in previous sections, may be a word, a phrase, a word containing the universal character #, or a word stem followed by : and one or more word endinqs separated by commas. Each of these forms must be enclosed in quotes as in other EUREKA commands. The may also be a term relationship of two or more terms separated by * and enclosed by parentheses. The following sections describe the different options available in the THE3AUPTIS command. Examples are included to illustrate the use of the facilities. 1.1 ENTEP Option This command is used to enter terms with the same meaning into the thesaurus. Tach concept class in the thesaurus is searched for the occurrence of any of the terms in the search expression. Each of the terms which do not occur already, and which do not have multiple endings or a # in them, will be given the word ending treatment. If none of the terms occur already, a new entry is created. For example, assuming the User Thesaurus is initially empty, we would get the following seguence for the command T E 'CALL' DO FOLLOWING WORDS HAVE THE SAME MEANING (Y OR N) •CALLY' N 18 •CALLING' Y 'CALLFP' v •CALLEN' N 'CALLFP' Y •CALLFST' N 'CALL: r S f ING,ED,E8 l WILL ONLY ENTER INTO USEE THESAURUS This will create a new concept class as shown in the second last line of the example. Similarly the commands T E 'FACTOF:Y,IES' ♦ • WAREHOUSE* • ♦ 'INDUSTRIAL PLANT' T E 'BIRTH CONTROL'+ (•PREGNANCY 1 * 'PREVENTION') will create the concept classes 'FACTOR: Y,IES' + • W AP EHOUS E# • ♦ 'INDUSTRIAL PLANT:, S« ' PTRTH CONTROL: ,S' + ( ' PREGN ANCY • * » PREVENTION') When a conceDt class is found which contains one of the terms in the command, it. will be displayed and the user asked if it has the correct meaning. If it does, the thesaurus concept class will replace the term in the FNTSP command, and will then be deleted. The search is then continued to find any other concept classes 19 containing any of the other terms in the original command. When the search is completed, the expanded expression is entered into the thesaurus as a new concept class. As an example, the command T E ■ SHOUTING • + 'CALL' will match the concept class entered in the first example. The term 'CALL' will be replaced by that concept class, and the new concept class created wi 11 be ' SHOUT-.TNG, ,S,ED,EP' ♦ • CALL : , S , I N G, ED , ER ' In this way ,all the terms in the FNTFR command which exist in a thesaurus concept class are replaced by the appropriate concept class. As a result, if two different terms in the ENTER statement already appear in different thesaurus concept classes, these classes will be combined into a single large entry. Since this process is repeated for all terms entered into the thesaurus, the same term should not appear in two different concept classes with the same meaning. If a term exists already in a concept but has a different meaning to the one being entered, it is flagged as a multiple meaning and a new entry is created containing the new terms. 3.2 DISPLAY Option The DISPLAY command is used to display all thesaurus entries containing any of the terms or phrases specified. The terms may be any of the forms described for the ENT FR command as in the followinq examples. T DIS 'WAREHOUSE*' nsnn THESAURUS •FACTOR: Y,TES' ♦ 'WAREHOUSE*' » 'INDUSTRIAL PLANT:, S 1 The DISPLAY ALL command is used to display all entries in the thesaurus, as shown below. THES DIS ALL USER THESAURUS •FACTOR: Y,TES' + ' W A PEHOUSE* ' ♦ 'INDUSTRIAL PLANT:, S' 'RIRTH CONTROL: , S' + (• PR FG NANCY' ♦•PREVENTION' ) 'SHOrjTtlNG^S.EDfEP 1 ♦ ' CALL : , S ,ING,ED, ER' 3.3 DELETE option If a thesaurus entry is in error it nay be deleted by specifyinq in a DELETE statement any term which occurs within it. Each concept class which contains this terrr will be displayed and the user asked if he wants it deleted. This provides the user with an opportunity to reconsider, and alsc allows duplicates to be deleted individually. Using the previous examples, the user may decide that 'shout' and 'call' are not really synonyms, and so would enter the followinq command. T DEL 'CALL' 'SHOUTtING, ,S,ED,EP' + ' CALL : , S , IN G, ED, ER » 21 DO YOn REALLY WANT TO DELETE THIS SYNONYM (Y OR N) Y 3.U ON and OFF Options These are to allow the user to control some of the facilities of the thesaurus as described earlier. The keyword ALL turns the whole thesaurus feature on and off by simultaneously turninq both the Universal Thesaurus and the User Theaurus on or off. The keywords UNIVERSAL and USFR are used to turn on and off the Universal Thesaurus and the User Thesaurus respectively. When both of these are turned off, the thesaurus feature is not used at all. If one thesaurus is turned on, FINE commands and THESAURUS commands operate on it only. When both are turned on, the ENTFR and DELETE options of the THESAURUS command will only operate on the User thesaurus, the DISPLAY option operates on both, and FIND commands will search first the User Thesaurus and then the Universal Thesaurus. Only the first match found will be used, so that if the same word appears in one concept in the Universal Thesaurus and one concept in the User Thesaurus, only the User Thesaurus concept will be used. The keyword EXPANSION is used to control the display of the expanded form of each FIND command. Automatic analysis of special endings is turned on or off by usinq WORDENDING, and similarly PLURALS controls generation of plural forms and conversion of 22 plurals to the singular form. Examples are T OFF ALL TMES OFF PLURALS T ON USER These commands will turn off the wholf thesaurus and the plurals processing, and thon turn the User Thesaurus back on agiin. 23 4. IMPLEMENTATION 4.1 Overall View All of the thesaurus routines are executed under the control of the thesaurus search module THESCH. In the case of the FIND command, the thesaurus routines are called by the routine PARSER for each term in the search expression (although the thesaurus can handle the whole search expression after the expansion of macro calls). Each THESAURUS command is handled by one call to THESCH from PARSUB. Byte Contents 0-1 Address of the user Loqon Block. 2-3 Command code: 21 = THESAURUS ENTER 22 = FIND 23 = THESAURUS DISPLAY 24 = THESAURUS DELETE 25 = THESAURUS DISPLAY ALL 4-5 Address of input expression. fi-1 Address for output expression {if any required) Thesaurus Command Table Structure Table 4.1.1 2<4 The routine THFSCH sets up a thesaurus command table, shown in Table U.1.1, which is passed to the routine THSPCH for execution. For each thesaurus file currently turned on, each concept class is searched for any terms used in the command. If any are found, the appropriate action is taken for the particular command and the search qoos on until the end of file is reached, or all terms in the command have been found. The routines used in this process are as follows THSETA - To set ud addresses of terms in the command search expression . THSCHS - To search the current concept class for any terms which match the command search expression. THESIO - Perform standard disk and terminal i/o. THMSYN - Replaces a term in the search expression with the current concept class (assumed to match it). THCRSN - Creates a new concept class for the expanded search expression. TMWEND - Does word analysis to reduce special endings and plurals to the sinqular form, and convert singular words to the plural form. 25 THENTW - Adds the special endings to the singular form of the word for the PINTER command only. 4.2 Structure of Thesaurus Files The Users Thesaurus is stored with other user specific information such as macros, query history and comments, in the User File. The thesaurus starts at block 240 and is only limited in length by the size of the User File. The Universal Thesaurus starts in block number 1 of the file UNIVTH.SYN. In both of these files, each block is 256 words, or 512 bytes in length. The first block in each file has the structure shown in Table 4.2.1 and other blocks have the structure shown in Table 4.2.2. The only difference between these two structures is that bytes 2-3 of the first block contain the number of the last block of the file which is being used. Bytes Contents 0-1 Number of the last byte used in this block. 2-3 Number of the last block used in the file. 4-511 Thesaurus concept classes. Structure of First Block of Thesaurus Files Table 4. 2. 1 2f, Bytes Contents 0-1 Number of the last byte used in this bl', 2-511 Thesaurus concept classes. Structure of Other Blocks in Thesaurus Files Table 4.2.2 Bytes Contents 0-1 Length of this concept class = L (includes bytes 0-7). 2-5 Bits indicating if the corresponding terms in the concept class have a duplicate meaning. 6 Number of terms in this concept class. 7 Not used. B-L Terms in the concept class. Concept Class Structure Table 4.2. 3 In Table 4.2.3 is shown the structure of each concept class stored in the thesaurus files. Each concept class must reside in a single block and hence L is limited to 512 bytes. The bits 27 indicating if the corresponding terms have multiple meanings, are set to normally, and 1 if the term has a multiple meaning. 28 5. CONCLUSION The preceding sections have described a thesaurus facility within the ^UREKA system. It addresses three major causes of recall failure due to language problems and allows parts of search statements in the form of synonyms to be stored for repeated use. Commands are provided for the user to make entries to the thesaurus or delete these entries. Users may also display the contents of the thesaurus and control the automatic facilities which it provides. The construction of such synonym tables requires a considerable expenditure of human intellectual effort. Nevertheless such searching aids will hopefully raise the average performance capabilities of the free text information retrieval system dramatically. These synonym tables could repay their cost manyfold in savinq the time and intellectual effort of users, thus leading to overall economy in the system. 29 APPENDIX A Error Messages NEW SYNONYM TOO LONG Cause - After combination with concept classes with the same meaning, already in the thesaurus, the expression being entered is too long to be stored in the file. It is over 502 characters in length. SYNONYM IS SMALLER THAN TERM SO NOT USED Cause - A term in the last, command matches a synonym which is shorter than the actual term, and therefore likely to be more restrictive. THESAURUS FILE ERROR Cause - The thesaurus file currently in use has been corrupted. Cry for help. THESAURUS FILE IS FULL Cause - There is no room to store any more concept classes in the thesaurus currently turned on. Call a programmer to reallocate the thesaurus in a larger contiguous file. THESAURUS - ILLEGAL CHARACTER FOUND Cause - The search expression or term in the last command was not in the correct form or contained an illegal character. Re-enter the command correctly. 70 THESAURUS - NO TFRMS IN FIND Cause - The last command entered had no terms in the expression part. Fnter a sensible command. UNIVERSAL THESAURUS ALFFADY IN USP Cause - Only one person can enter information into the Universal Thesaurus at one time. Wait until the current user has turned the Universal Thesaurus off. WILL ONLY ENTER INTO USFFS 7HFSAUPUS Cause - Both the Users Thesaurus and the Universal Thesaurus were turned on uh^n an F.NTER command was performed. The expression entered will only he stored in the Users Thesaurus. 31 REFERENCES (1) Morqan J.K., "Description of an Experimental Cn-line Minicomputer Based Information Retrieval System", M.S. Thesis, University of Illinois Department of Computer Science Report Number 76-779, February 1976. (2) Lancaster F.W. and Fayen E.G., "Information Retrieval On-Line", Los Angeles, Calif.: Melville Publishing Company (1973). (3) Lancaster F.W. and Climenson W.D. , "Fvaluating the Economic Fffiiency of a Document Retrieval System", Journal of Documentation vol. 2U, March 1968, pp. 16-UO. (U) Salton 0. , ed, "The SMART Retrieval System : Experiments in Automatic Document Processing", Englewood Cliffs, N. J. :Prentice Hall (1971). (S) Winograd T., "Understanding Natural Language", New York, Academic Press (19*72). IEIOGRAPHIC DATA IIT 1. Report No. UIUCDCS-R-77-855 Tic and Subtitle /Thesaurus Feature for the EUREKA Information Retrieval System 3. Recipient's Accession No. 5. Report Date October 1976 6. /:hor(s) Tevor John Morgan 8. Performing Organization Rept. No. UIUCDCS-R-77-855 frforming Organization Name and Address liversity of Illinois at Urbana-Champaign [partment of Computer Science Ibana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF MCS73-07980 A03 l )onsoring Organization Name and Address ftional Science Foundation Ishington, D. C. 13. Type of Report & Period Covered Master's Thesis 14. i. jpplementary Notes i. bstracts Iny information retrieval system, whether it uses a controlled vocabulary or free :ext searching, has the problem of matching the terms and language of the searcher nth the terms available in the system index. These and other problems have been itudied and a thesaurus feature has been implemented and installed into EUREKA in )rder to help understand and solve these problems. ey Words and Document Analysis. 17a. Descriptors )ata base :UREKA Information retrieval Thesaurus wldentif iers/Open-Ended Terms eZOSATI Field/Group ■ 'ailability Statement Mease Unlimited 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 2h 22. Price F NTIS-35 (10-70) USCOMM-DC 40329-P7 1 *$ r o 1 * WW ■■■■■ii m iiii MH iimi i mi ■mi