s mat^^^ ^^^^^^ be charged a minii^ each non-returned orJ<.^, ,,^,^^ T'lirtrat: of lUinoi. and ore pr inw and Procedure. _j 333.84OO. ^- TO RENEW CAtl^^^^^ ,^^o.Cha«^P-|9n^ APR 25 ,r;tp new due date W..n renewing -'J/r" U« below previous due date. Digitized by the Internet Archive in 2013 http://archive.org/details/uniquewordscanni914leun T-^tQ. eport No. UIUCDCS-R-78-914 ///^//^ 'c UILU-ENG 78 I70U P A UNIQUE WORD- SCANNING FACILITY FOR THE EUREKA FULL-TEXT INFORMATION RETRIEVAL SYSTEM by William Ming-Cheong Leung January 1978 NSF-OCA-MCS73-07980-000032 Report No. UIUCDCS-R-78-914 A UNIQUE WORD-SCANNING FACILITY FOR THE EUREKA FULL-TEXT INFORMATION RETRIEVAL SYSTEM by William Ming-Cheong Leung January 1978 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 This work was supported in part by the National Science Foundation under Grant No. US NSF MCS73-07980 and was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, January 1978. Ill ACKNOWLEDGMENT Various people have contributed to the successful completion of this thesis. To them, I would like to express my appreciation. My special thanks go to Professor David J. Kuck, my thesis advisor, for his guidance, suggestions and most of all, his patience with me. I would also like to thank Perry Emrath for his invaluable assistance throughout this project, John Keith Morgan for helping me with the Eureka internals and Fon Krupp, my supervisor, for his understanding during the final weeks of this preparation. iv TiBLE OF CONTENTS CHAPTER PiGE 1 INTRODUCTION 1 2 HISTOGEAH. . .A WORD-SCANNING FACILITY U 2.1 The Need for an Intelligent Word Scanning Facility. H 2.2 HISTOGRAM and the Zipf Distribution 5 2.3 The HISTOGRAM Screening and Selection Mechanisms... 7 2.4 EOREKA and the HISTOGRAM Subsystem 9 2.5 Language Design and Description 12 2.5.1 HIST Command 12 2.5.2 MERGE Command 13 2.5.3 STATS Command 14 2.5.4 LIST Command 17 2.5.5 SEARCH Command 18 2.5.6 EXIT Command 19 2.6 Usages of HISTOGRAM as a Search Aid 19 3 HISTOGRAM DESIGN AND DESCRIPTION 22 3.1 Design Philosophy 22 3.2 Functional Overview 22 3.3 Files of HISTOGRAM 25 3.4 Description of the Command Modules 26 3.4.1 HSTGRM 26 3.4.2 HGR 28 3.4.3 STATS 30 3.4.4 LISTER 32 3.4.5 SRCHER 32 4 THESAURUS GENERATION 34 4.1 Conventional Methods of Thesaurus Generation 34 4.2 Concept of Thesaurus Classes 34 4.3 Our Proposed Approach to Thesaurus Generation 35 4.3.1 Clustering Documents 36 4.3.2 Generating Thesaurus Classes from Clusters 37 4.4 HISTOGRAM Modules and Thesaurus Building in EUREKA. 39 5 CONCLUSION 42 REFERENCES 46 APPENDIX A GLOSSARY OF TERMS 47 CHAPTER 1 - INTRODUCTION Advances in computer technology have made on-line full-text retrieval systems an attractive alternative to the conventional means of information retrieval by hand. However, since most such systems allow searchers to use words from a non-controlled vocabulary as search terms, the successf ulness of a search depends on a correct choice of words. To be assured of a good recall and/or precision ratio, a searcher has to include the right words together with all their spelling variants, synonyms and similar meaning phrases - a not too realistic demand. Fortunately, the problem is alleviated to some extent due to the iterative nature of on-line retrieval systems and the fact that most retrieval systems do allow some form of prefix, infix, and/or suffix truncation for the search terms. But often enough, a searcher can only find his desired information with much difficulty and unnecessary delays. At other times, a searcher can not even converge a search simply becasuse he can not come up with enough, or any, alternate or more precise words for his search expression. We believe at least two tools can be used to circumvent this predicament. One is to provide searchers with an on-line thesaurus. The other is to provide a facility which enables searchers to selectively scan over meaningful words from thus far retrieved documents. The purpose of both is to suggest to searchers pontential clues for search terms. Various work has been done in generating thesauri, but not much work has been done in the other area. While it is still debatable whether any practical and elaborate thesaurus system is easily implementable, a word-scanning facility is comparatively simple to design and implement. Used wisely, it can be a very effective means to accelerate and converge searches. The main design consideration with such a facility is a clever word selection and screening mechanism. This thesis describes the implementation of such a facility on an experimental on-line retrieval system - Eureka, and discusses its two potential uses. Firstly, it can be used as an effective search aid for on-line searchers to obtain good, discriminating search terms to improve their search strategy. Secondly, it can be used towards automatic or semi-automatic system generation of thesaurus classes consisting of synonyms, spelling variants, similar meaning phrases and related words. Chapter 2 describes our word scanning system - HISTOGBAM, its purpose, facilities, language design, syntax and usages as an on-line search aid. chapter 3 describes the systems aspects of HISTOGRAM. Chapter 4 discusses how HISTOGRAM can be used towards automatic or semi-automatic generation of thesauri. Chapter 5 concludes this discussion by suggesting possible areas of further improvement and raising some open-ended questions for future research. For the purpose of this discussion, a list of terms and definitions used in our context is included in Appendix A. CHAPTER 2 - HISTOGRAM. ..A WORD-SCANNING FACILITY 2.1 The Need for an Intelligent Word Scanning Facility On many occasions during an information search, a user would like to be able to look at the contents of the responding documents. By looking at the text, a user can usually tell whether he is going the right direction in converging his search. If not, he will have saved considerable time and effort by changing his search strategy at an early stage. In addition, being able to glimpse over the text of the responding documents can make the user aware of other potentially good words to use or add to his search expression, other synonyms or synonymous phrases that he has not considered or has simply overlooked. However, since the number of initial responding documents is usually very large, often on the order of tens, or even hundreds, users will be reluctant to look at the actual text of the document set. To take, as an example, a responding document set of 50 documents with an average size of 1,000 words each, the document set would then contain 50,000 words. Obviously, not too many users will have the patience to go through these words. Even if they do, it is doubtful whether any significant information can be extracted, given the amount of built-in noise, duplicate and irrelevant words present. For this reason, an intelligent word scanning facility with a good word screening and selection mechanism is needed. HISTOGRAM, our version of a word scanning facility, is an implementation that attempts to provide the above mentioned capabilities . 2.2 HISTOGRAM and the Zipf Distribution Our design of the HISTOGRAM word screening and selection mechanisms draws from the concept of the Zipf distribution of words in natural English language documents. George Kingsley Zipf, a mathematician, made the interesting observation that for any collection of natural English language documents, f, the frequency of occurrence of the n-th rank type, where type is a distinct word and rank is the standing of this type in descending order of frequency of occurrence, is approximately equal to k/n, where k is the number of occurrences of the most frequently occurring type. Mathematically, this can be expressed as f = k/n. The general shape of a Zipf distribution curve is shown in Figure 2.1. The area under the curve approximates the number of words in the document set. The non-linearity of the curve suggests two things. The left end of the curve suggests that l_lj ct: o Q_ M < (N O o O o o o o o in o LO o in K) CM cs ''^ "^ Aouanbajj u9>|o_l for any collection of documents, a large percentage of tokens, where a token is a single occurrence of a type, corresponds to only a small percentage of types in the document set. These are the highest ranking types in the set. The right tail of the curve suggests that for any collection of documents, a large number of types have a very low freguency of occurrences, or token frequency, in the set. These are the lower ranking types in the set. As words from both of these groups are either too general or too specific to be of much use as search terms, they are really noise words and can be cut off from the group of words to be displayed to the searcher. 2.3 The HISTOGRAM Screening and Selection Mechanisms The screening mechanism of HISTOGRAM consists of 2 concurrent processes. The first process is merging, which is a reduction of words to types. An upper bound average reduction ratio of 150 can be achieved from this merging process for the State Statutes data bases. (This figure is arrived at from the fact that the data base has an average token to type ratio of 150[6]. However, since we are only dealing with subsets of the data base at any given time, we would expect the ratio to be much lower.) While this reduction process does not exactly eliminate words off the higher end of the Zipf curve, they are reduced to only a relatively small number of types. Another screening process done concurrently with the merging is the screening of the lower end Zipf curve words from the documents to be merged. This effectively is a low frequency cutoff from the merged document set and further reduces the number of types by a significant percentage. Assuming that the upper bound for this percentage to be 10%, the average reduction ratio achieved by these combined processes alone would then be as high as 1 6*7 . To take our previous example of the 50-document set, the initial HISTOGRAM screening process would then have reduced the number of words to be displayed from 50,000 to as low as 300 words! After the initial screening process which reduces the number of words to be displayed to a much more manageable size, the user can then choose to manipulate and scan these words selectively with the HISTOGRAM word selection mechanism. This mechanism consists of a statistics table and a windowing facility through which he can view the words. The statistics table displays distribution statistics of the types with respect to the token frequency and how these types within each interval of the token frequency domain are distributed among the documents. By defining token frequency as the frequency of occurrences of a distinct type, and document frequency as the number of documents a particular type appears in, we can interpret this statistics table as a 3-dimensional histogram as in Figure 2.2. The histogram on the front is a distribution histogram of the types within a domain of token frequency. The histograms on and parallel to the left side are the distribution histograms of the types within the document frequency domain for the corresponding interval of token frequency on the front plane. The histogram on the front plane is in fact another way of looking at the shape of Zipf curve of a document set. This different arrangement is more revealing and appropriate for our purpose. The 2 domains of this 3-dimensional histogram, the token frequency and the document frequency, are user-definable. This enables the searcher to, in effect, look at all or zoom into any part of the Zipf distribution curve with the additional capability of seeing how the types distribute among the documents within intervals of the part examined. With the aid of this table, the user is then able to determine how the types are distributed and which groups of types he wants to see. Then, with the very same parameters, the token frequency and the document frequency, the user can select the types he wants to inspect. 2.U EUREKA and the HISTOGRAM Subsystem HISTOGRAM is implemented as a subsystem in EUREKA - an 10 >- o z UJ 3 o LU < o CO o CO S Q I CO tz: E- I • CM ui o 11 experimental on-line full text retrieval system designed at the University of Illinois by a research group under Prof. David J. Kuck. The system is PDP-11 based, running under a multi-process, multi-user executi ve[ 3 ]. The current hardware configuration is made up of a e^K PDP-II/UO, 2 disk drives, 2 CRT's and 1 printer. Presently, the data bases used are the various state statutes. The EUREKA query language is made up of 9 commands in 4 main functional areas as follows:- (1) Finding an arbitrary complex Boolean search expression from a defined document set with/without context specification. (2) Printing selected portions of documents and information about preceding queries. (3) Auxiliary functions such as defining and deleting document sets, query sets, macros and comments. (U) Logon, logoff. A thorough description of the EUREKA system and query language can be found in [4]. 12 2.5 Language Design and Description To be in line with the language design of EUEEKA, the HISTOGRAM commands are simple, few, but effective. Apart from the entry and exit commands, there are only 4 HISTOGRAM commands. The syntax of the commands is straightforward and minimal; no verbose, English-like commands nor exhaustive options exist to confuse the users. i In the description of the HISTOGRAM commands that follows, the following notations for syntax specification are used:- (1) Capital letters or commas are the characters to be typed as part of the command. (2) An underscored character means the character is mandatory for the command. (3) Words in small letters represent parameters whose values are to be supplied by the user. (4) A set of parallel entries enclosed by 2 vertical bars means any but only one of the entries applies. 2. 5. 1 HIST Command 13 Syntax; HISTOGRAM Function: Entry from EOREKA to the HISTOGRAM subsystem. 2. 5.2 MERGE Command Syntax: MERGE | eureka document set name | I LAST \ I [ document number list J_ ( - eureka document set name is any EUREKA defined document set - LAST is the last document set defined in EOREKA - document number list is a user defined list of document numbers in the data base, separated by commas with no embedded blank Function: Produces a resultant merged file of alphabetically ordered types and statistics from the document set. The completion of the merge process is signalled by the system prompt character '.'. A maximum of only 1 resultant merged file exists for a user at any given time. Any subsequent merge will erase the previous resultant merged file. Any merge command that requests a non-existent 14 document set or a non-existent document will be ignored and no error message will be issued. However the previously saved resultant merged file, if any exists, will not be scratched. Error messages will be issued for the other invalid requests. Examples: ME [1,7,23,15] (Merges documents 1,7,23 and 15.) ME LAST (Merges the last EUREKA defined document set.) 2.5.3 STATS Command Syntax: STATS JJ.tf ,ut f ,ldf ,udfj_ - Itf is the user specified lower bound for the token frequency - utf is the user specified upper bound for the token frequency - Idf is the user specified lower bound for the document frequency - udf is the user specified upper bound for the document frequency Default: Itf - 1 utf - 65535 15 Idf - 1 udf - the total number of documents in the data base Subsitution implied with the absence of an operand or operands between commas, or with a premature ']'. No embedded blanks are allowed. Function: The STATS command displays a distribution statistics table of all the types in the merged file with the user-specified domains of token and document frequencies. The table is an 8*8 table with intervals scaled on the two domains. By wisely varying the upper and lower bounds of these parameters, the user has in effect a zoomable viewing device on the distribution statistics of the types. Command ignored with invalid domains or non-existent merged file. Display: Figure 2.3 is an example of the statistics distribution table displayed by STATS. Token frequency, whose domain is bounded by the user-specified values or through default, is scaled into 8 equal integral intervals with the last interval possibly extended or truncated to the upper bound. The document frequency domain is similarly 16 scaled. Each entry under the column labelled 'TYPES' shows a count of the types occurring in the token frequency interval specified in the rightmost column of each row. The 'TOTAL' displayed at the bottom of the 'TYPES* column shows the total number of types in the specified token frequency domain. The entries in the columns between the leftmost and rightmost ones show the occurrence frequencies of types in the specific document frequency interval. Example: In response to the command STATS [1,380,1,15], HISTOGRAM will produce a display similar to the one in Figure 2.3 TOKEN FREQ 1 1 50 100 150 200 250 300 350 380 DOC FREQ 2 u 6 8 10 12 m 15 TYPES 30 40 35 50 as 12 12 7 231 5 20 30 27 23 5 4 6 120 12 8 6 12 12 16 24 12 102 20 10 11 9 3 7 4 6 70 1 1 12 10 18 3 4 3 52 7 8 7 4 6 32 1 3 3 7 4 18 H 2 2 2 10 NO. OF TYPES IN THE MERGED FILE 404 FIGURE 2.3 17 The display is a statistics distribution table for types in the merged file with token frequency from and including 1 to 380 and document frequency from and including 1 to 15- As an example, row 5 in the table would mean the following:- There are 70 types altogether that occur between 151 and 200 times inclusive. Among the types, 20 occur from 1 to 2 documents, 10 from 3 to U documents, 11 from 5 to 6 documents, 9 from 7 to 8 documents, 3 from 9 to 10 documents, 7 from 1 1 to 12 documents, 4 from 13 to 14 documents and 6 in 15 documents, summing up to 70. 2. 5. 4 LIST Command Syntax: LIST _^ltf ,utf , Idf , udf J^ -Itf ,utf ,ldf ,utf as in 2.5.3 Default: as in 2.5.3 Function: List the types in the merged documents that satisfy the user requested thresholds on frequencies. Display: Selected types are displayed four in a row 18 alphabetically. Examples: (1) L [5,10,4,4] (Lists types that occur from 5 to 10 times in the documents and only in 4 documents.) (2) L [,200] (Lists types that occur between 1 to 200 times in all the documents of the data base.) 2.5.5 SEARCH Command Syntax: SEARCH ' character string ' Function: The search command initiates a search through the merged file for the type specified in the operand. If a hit is made, the type's token and document frequencies in the documents merged previously will be displayed along with the text of the type. If not found, a message to that effect will be generated. Examples: (1) SE 'JUNK' (Searches for the type 'JUNK'.) JUNK [132,4] (The type 'JUNK' occurs 132 times in 4 documents in the document set.) 19 (2) SEARCH 'BIGFOOT* BIGFOOT NOT FOUND 2.5.6 EXIT Command Syntax: | E_X | I ED_ I Function: Exit from HISTOGRAM to EUREKA. EX is an exit with the merged file saved for subsequent use. ED is an exit with the merged file scratched. 2.6 Usages of HISTOGRAM as a Search Aid There are many ways HISTOGRAM can be used in assisting a searcher to converge his search. By using the SEARCH command, a searcher can evaluate the the usefulness of the terms in his search expression. Words that have relatively high token and document frequencies are probably too general to be good search terms. Likewise, words with relatively low token and document frequencies are probably too specific to be of any use. On the other hand, words with a significant token frequency and a not too high document 20 frequercy are words that appear a lot of times in a small number of documents and are pontentially good terms to converge a search. By using the STATS command, a user can feel out the general characteristics of the document set before he sets out to selectively look at the word. From the statistical table, one can have an idea of how the documents in the document set relate to each other. If a significant amount of the medium token frequency words appear in a great percentage of the documents, the documents within the set are probably closely related to some common topics. On the other hand, if a significant amount of medium token frequency words appear with a small document frequency, the document set is probably made up of clusters of related documents. However, when the medium token frequency types are quite evenly distributed, it could mean that we might have a set of unrelated documents at our hand and should probably re-evaluate our search strategy. As a matter of fact, there are many ways that the statistics can be interpreted and used to help a searcher. Once the user has an idea of what scope of words he wants to see, he can select them for inspection with the proper parameters in the LIST command. Though generally words with medium token frequencies are 21 our main focus of attention, it is certainly conceivable that other words could be useful on other occasions. To summarize, what we have cited are only a few instances of how we think HISTOGBAM could be used and are definitely not exhaustive. Different goals and stages in our search might call for differnt approaches to using HISTOGRAM. Hopefully, with practice and experience, users can find new ways of intepreting and using HISTOGRAM to improve their search strategy. 22 CHAPTER 3 - HISTOGRAM DESIGN AND DESCRIPTION 3.1 Design Philosophy HISTOGRAM was designed as an autonomous subsytem of EOREKA with a minimal number of interface paths. As it is experimental in nature, one of our main design concerns was to make the code straightf oward and easy to modify and maintain. To this effect, the technique of modular programming was used extensively. The HISTOGRAM subsystem is broken down into command processing modules, and each command processing module is in turn further granularized into functional units of smaller programming entities in the form of macros and subroutines. Every function that might be changed, modified, replaced or tuned in the future is programmed as an independent unit so that any change can be done with minimal effort and disturbance. While this excessive modularizing tends to impose some addtional CPU overhead, this should not affect EUREKA and HISTOGRAM performance as HISTOGRAM itself is an I/O bound procedure and EUREKA, from measurements done by Milner[3], is only using 50% of its CPU cycles anyway. 3.2 Functional Overview 23 HISTOGRAM is made up of 5 modules. Except for the dispatching module, HSTGRM, which communicates with the various command modules, there exists no direct interaction among command modules. HISTOGRAM itself only communicates with EUREKA in three situations - entry, exit and the passing of the EUREKA user logon block pointer in order to obtain a EUREKA defined document set list. Briefly, the 5 HISTOGRAM command modules perform the following :- (1) HSTGRM - Interfacing with the EUREKA user, the EUREKA system and the various HISTOGRAM command modules. (2) MGR - Merging documents: the process of reducing duplicate words to distinct words and summing up their token and document freguencies while sorting them in alphabetical order. (3) STATS - Building and displaying, according to user-defined domains, statistical table for the merged document set. 2U iEmiR^BJkA HSTGRM MGR STATS LISTER' SRCHER Statute Files ■^ \ Merged File control flow data flow FIG 3 . 1~HIST0GRAM FUNCTIONAL OVERVIEW 25 (H) LISTER - Displaying words of the merged document set according to user-specified parameters. (5) SRCHER - Showing distribution statistics of any specific word within the merged document set. Basically, what HISTOGRAM does is that given any user requested document set, it will input already-existing statistics files corresponding to each of these documents. Through the merger, it then will merge them and at the same time strip the intermediate files generated during the merge and the resulting merged file into a simpler format. The resulting merged file is then input and acted upon by other HISTOGRAM modules whenever required. 3.3 Files of HISTOGRAM The input statistics file and the output merged files of HISTOGRAM are made up of entries each containing a variable token text and tagged-along statistical data. These entries are ordered alphabetically. Entries in the merged file and the intermediate scratch files have the same format. Each entry contains a token text preceded by a 1-byte text length count, followed on even boundary by a double-word token frequency count and a one-word document frequency count. The 26 input statistical files have essentially the same format without the document frequency count but with some other irrelevant data tagged on after the frequency count (Fig. 3.2). The input statistical files are really read-only files while the merged file can be saved, scratched or over-written with another merge operation. The intermediate files generated are work files and are deleted after every 2-way merge. 3.4 Description of the Command Modules 3.4.1 HSTGRM HSTGRM, the interfacing module between the user, EUREKA and the HISTOGRAM modules, is a command interpreter and control dispatcher of the subsystem. It is entered from EUREKA and re-enterd from the HISTOGRAM command modules when they are finished with the requested command. On user request to exit, it jumps to a section of code that handles the exit option of keeping or deleting the merged file. The parameters that it passes to the command modules are the EUREKA logon block pointer and the command buffer pointer. No condition flags or data are passed back when control is returned from the command modules. 27 Statute files: 4Bo|Nbl loHoj?! . . . kbb^l !o!o!oi3 • • • I \ Length Text Double-word Freq Count Other Data Generated files: s-^.l* I /J crc • » • Single— word Document Freq Count FIG . 3.2- ENTRY FORMATS 3.4.2 MGR The merging module, MGR, consists of a driver, a 2-way merger, an input handler and an output handler. The 2-way merger is a very specialized routine. It merges by pulling entries off and placing the results at pre-designat ed locations and letting the input and output handler worry about the actual I/O and the tedious bookkeeping involved. In merging the entries, it takes the smaller of the two if the tokens are unequal. If they are equal, it adds the statistics and puts only one out and discards the other. The input handler performs the actual I/O in retrieving the entries from the disk files, does buffer management, updates pointers to the current entries in buffers, moves entries from buffers to the approprate fixed locations, does low frequency word cutoffs in the first pass, checks for and acts accordingly on I/O error conditions and end-of-file conditions. The output handler is comparatively simpler. It moves entries from the fixed location to the output buffer, updates pointer and outputs if buffer is full. The driver itself is the heart of the MGR module. It emits 29 DRIVER I 2-WAY MERGER >■ Entries to be merged INPUT HANDLER V Statute Files OUTPUT HANDLER Merged File control flow data flow FIG 3 . 3 - THE MGR MODULE 30 file names to the input handler for every 2-way merge and sequences the overall merging activities at the document level. Initially, it builds the list of file names to be merged by interpreting the EUREKA passed document set list or parsing the user-defined document set list depending on how the merge is requested. It then picks out these names 2 at a time and emits them to the input handler for the 2-way merge. It also generates file names for all the intermediate scratch files. The scratch files are deleted as soon as they are entirely absorbed by the 2-way merger. Only the final merged file is not deleted automatically. As for the sequencing function, in addition to passing file names for the merge, the driver is also able to recognize the various situations of whether it is the first pass of the merge or not and whether the number of documents in that pass is even or odd. Different combinations of these call for different actions and settings of flags in order to optimize control flow and/or simplify coding. 3.4.3 STATS The STATS command module is responsible for building the statistical table on the merged file for the user-requested domains. 31 When passed control by HSTGRM the subsystem dispatcher, STATS checks if the user-specified parameters are valid. These parameters are the user requested upper and lower bounds of the domains of token and document frequencies. They are considered valid if they are numbers and the upper bound is greater than the lower bound. It then will build an internal statistical table equivalent to the one to be diplayed. The domains are scaled into 8 equal integral intervals whenever possible, the exceptions being the cases when the domain itself is not large enough to be split into 8 integral intervals and/or the last interval is extended or truncated to the upper bound of that domain. Each slot in this usually 8*8 table corresponds then to a particular interval of document frequency and token frequency. They are used as counters and are initialized to zeroes. STATS then builds a 2-level directory that maps entries into the appropriate slots to increment the counter. When this is done, the STATS module simply walks through entries in the merged file with a relatively simple input handler and updates the appropriate counters whenever tokens within the requested domains are found. On end of file, STATS proceeds to format the internal table 32 into a displayable form by doing the necessary editings and conversions. It then displays the table to the user. The statistics table that it builds is not retained and every new STATS command causes another round of table building. The merged file however remains intact and is only overwritten after another MERGE command. 3.4.4 LISTER LISTER enables users to look at words in the merged documents that are within his specified threshold values of document and token frequencies. These threshold values do not have to coincide with intervals displayed in the statistical table at all. Again, this module simply walks through the entries in the merged file and picks the ones that satisfy the thresholds. It then moves these tokens to the terminal buffer, edits them and outputs them whenever the output buffer is full. 3.4.5 SRCHER The SRCHER does essentially the same walk-through process as LISTER does except that it is loking for just one particular entry with the specified token. When a hit is made, SRCHER 33 converts the tagged-on token and document frequencies into character form and displays the token and its statistics. When end of file or an alphabetically larger token is reached, SRCHER gives up and informs the user. At present, SRCHEP can only handle one token per search request. But with additional code that can parse multiple entries in the search command operand and sort these tokens, the module is able to accept multiple tokens per request without any further modification. 34 CHAPTER a - THESAURUS GENERATION 4. 1 Conventional Methods of Thesaurus Generation The generation of on-line thesauri for full-text retrieval systems has received considerable attention in recent years. Thesauri can be generated manually, semi-automat ically or automatically. Manual thesaurus generation is a tedious task which involves subjective judgement and is inflexible to changes in the data base. Automatic and semi-automatic thesaurus generation, on the other hand, are more objective, faster and more flexible to changes in the data base provided resonably good generation algorithms are used. One problem though with automatic and semi-automatic thesaurus generation is that, regardless of the generation algorithm used, a tremendous amount of computing resource in terms of processor storage, secondary storage and computing time is required. Simply put, the problem with automatic and semi-automatic thesaurus generation is that no simple and effective generation algorithm yet exists. 4-2 Concept of Thesaurus Classes Before we proceed any further, we need to clarify our notion of thesaurus classes. We beleive thesaurus classes should 35 not only contain synonyms, but also words that are related but not synonymous in any sense of the word. Words that appear in the same phrases or words that are always mentioned together should also be considered and included into our thesaurus classes. As an example, consider the following group of words - 'birth* , 'control', 'prevention', 'abortion' and 'pregnancy'. While certainly not all of these words will be included in a conventional thesaurus group, they all pertain to a common topic. For a searcher who is looking for information on birth control, it would be nice for him to know of these other possible words he can use as search terms. We believe that as an on-line search aid, this extended concept of thesaurus groups of synonyms as well as related words would serve a better purpose and we should therefore aim at such thesaurus classes in generating thesauri. U.3 Our Proposed Approach to Thesaurus Generation As far as the generation process is concerned, we outline here a 2-step approach. The first step involves clustering of documents in the data base. The idea behind clustering is that documents in a data base can be broken up into smaller clusters of documents. Documents within a cluster are loosely related to some common 36 topics. While a document can belong to more than one of these clusters, for all practical purposes, clusters are independent entities with no inter-relationship to each other whatsoever. As an example, a document on birth control obviously belongs to a different cluster from one with a document on highway safety regulations. Generation of a thesaurus then should be from these subsets of the data base rather than from the entire data base itself. There are two advantages. Obviously, by reducing the number of documents to a manageable size, automatic or semi-automatic generation of a thesaurus should be a lot easier. In addition, generating a thesaurus from documents within a cluster filters out noises that would otherwise be introduced from unrelated documents outside of the cluster. As an example, the words 'law' and 'order' would probably be put into the same thesaurus group for a cluster of documents on law and order while the words 'order' and 'magnitude' would probably be put into another thesaurus group in a cluster of documents on computer performance, but the words 'law' and 'magnitude' certainly do not belong together. 4.3.1 Clustering Documents To cluster documents, we can make use of the concept of document vectors where a document vector is a list of distinct 37 words which characterizes and distinguishes a document from the others. Looking at the Zipf curve of any particular document, we know that words under either end of the curve are noise words of very little significance. Intuitively, as we approach from both ends to the inside of the curve, we would find more and more significant words. If some kind of a significance level curve can be plotted onto the Zipf curve, it would probably resemble the one in Figure U.I. The idea of defining a document vector then is to define low cutoff and high cutoff threshold values of token frequency so that we can obtain a set of words large enough to characterize a document but small enough for later manipulation in the clustering and thesaurus generation processes. Since the threshold values are absolute quantities which will vary with document sizes, to design any general algorithm to determine threshold values, we have to relate to relative quantities such as distribution percentiles. 4.3.2 Generating Thesaurus Classes from Clusters Once we have built clusters and thereby reduced the number of documents and the size of words to be practical for thesaurus generation, a lot of the existing statistical correlation techniques to generate a thesaurus would become practical and feasible to implement. One such good technique is the generation of thesaurus classes by a co-occurrence matrix. The co-occurrence matrix is a table of 'd' rows and LJ 38 o O O O o o o to O in o in hO iox 39 •t' columns, where * i* is the number of documents in the data base concerned and 't' the number of types in the data base. Each column in the table corresponds to a type and each row corresponds to a document in the data base. A M' in the i-th row, j-th column entry, indicates the presence of the j-th column type in the i-th row document, and a '0' indicates an absence of the type in the document. The matrix is then used as a tool to determine what types co-occur often enough in the same documents to be considered within the same thesaurus classes. 4.4 HISTOGRAM Modules and Thesaurus Building in EUREKA The above section only gives an overall outline of an approach to thesaurus generation. A detailed plan for the implementation requires further investigations and efforts. However, at this point, it is clear that a lot of the existing HISTOGRAM modules can lend themselves to usage regardless of the final plan adopted. The merging module of HISTOGRAM, for example, can be used to strip the statistics files into a format acceptable by the STATS module. The STATS module can then compile distribution statistics. These statistics can be used as a criterion by some general algorithm to generate document-dependent high and low cutoff values. Once these values are determined, the ao LISTER module can use them to extract groups of words as document vectors and output them to intermediate files for later use. With document vectors defined, we can employ statistical techniques such as correlation of document vectors to cluster documents. All of the above mentioned functions involving HISTOGRAM modules need only minor modifcations to the modules concerned. One nice thing about the characteristics of the EUREKA data bases of state statutes is that they are already in a natural order of clusters. We can therfore use these naturally arranged clusters as yardsticks in evaluating our clustering techniques. By experimenting with different cutoff values to vectorize documents and different correlation techniques for clustering, we hopefully will arrive at some satisfactory algorithms in generating clusters that are small. Once we have good ways of producing small clusters, we can try out various correlation techniques with our "co-occurrence matrix" method of thesaurus generation. Again through experimentation, hopefully we would come up with a good scheme of generating a thesaurus. At such a time, we can then use the existing thesaurus feature in EUREKA to actually build the thesaurus system[5]. It has yet to be investigated on how feasible and how much effort 41 would be involved to interface the HISTOGRAM modules with this thesaurus feature in order to be able to generate thesaurus classes and build the theasurus system completely automatically. However, at the very least, our generated thesaurus classes can be entered into EUFEKA manually to produce an on-line thesaurus. H2 CHAPTER 5 - CONCLOSION We have described a facility that attempts to improve users' performance on on-line searches. The facility is experimental in nature. Our initial objective was to get the facility operational and then do whatever necessary tunings and modifications afterwards. It is believed that ways for improvements and modifications, in both system performance and user interfacing, will become clear as more data and user feedback are available once the system is operational. As can be seen now, there are several areas for maior improvements. The major performance problem with the current HISTOGRAM system is the I/O bottleneck. This bottleneck is mainly caused by the large number of arm movements necessary for every 2-way merge. Two things can be done in directly minmizing and optimizing arm movements. Placing the statistics files and intermediate scratch files close together and/or close to the center of the disk would minimize arm movement time. Also, an arm movement optimizing scheme in the choosing of input and output files for merging would greatly reduce unnecessary arm movement delays. As a matter of fact, the latter is to a certain 43 extent implicit with the current merging algorithm. If a user requests a merge for a EUREKA defined document set, the document number list passed by EUREKA to the HISTOGRAM merger is already arranged in ascending document numbers, thus by picking them 2 at a time from the start of the list, the merger is picking the closest possible pair of documents for every merge. Significant performance improvement can also be achieved by reducing the volume of I/O for the merge. This can be done by shrinking the size of the statistics files and the intermediate scratch files. One way to do so would be to eliminate the irrelevant statistical information in the original statistics files. They do occupy a significant percentage of the file space and are not useful information for our purpose. As the first pass merge I/O constitutes more than half the total volume of I/O, reducing the size of these initial statistics files would significantly affect the system performance. Another way worth considering is to encode the variable length token text in every entry into a 2-byte code. This would reduce the size of the statistics files and the intermediate scratch files by a substantial percentage. As a matter of fact, by encoding words with an implicit alphabetical ordering, comparision of tokens can be reduced to a trivial operation with the merging process greatly simplified. The actual decoding of the words does not have to be done till the point where the user selects the words to see, and even then, the number 44 of words to be decoded would generally be only on the order of tens, or hundreds. The merging process is another potential area of improvement. We have a tentative low cutoff mechanism for every merge in the fisrt pass, but we have yet to introduce a high cutoff scheme in this pass. The reason that we have not done so is the difficulty in determining a threshold value for the high cutoff. To a much greater extent, the high cutoff threshold would vary with the size of the documents. It is not obvious at this point what good method could be used to filter out the high fregeuEcy types and at the same time not imposing so much overhead that would defeat the whole purpose of a high cutoff. Introducing a stop list of the usual high frequency words like •the', 'a', etc. would help but does not seem to be very general and flexible. The merging algorithm can be improved also. At present, merging of the statistics files for the documents is done by 2-way merges and intermediate files are generated for each merge. The generation of files hampers system performance as the openning and closing of files are slow processes. One possible way to avoid having to open and close files so frequently is to use a single intermediate file for all the merges within the same pass rather than using 1 intermediate file per 2-way merge. This would, however, require a much more complicated bookeeping 45 algorithm than the current one and restructuring of some of the systems modules concerned might be necessary. One final point that deserves attention is that we have talked about vectorizing documents. Our objective is to use as few words as possible to characterize a document's contents. A mere high and low cutoff as discussed earlier might not be sufficient for our purpose. Maybe some additional screening using weighing factors based on a type's token and document freguencies would help. This information is readily available from HISTOGRAM. As mentioned earlier, HISTOGRAM is meant to be a dynamic system. It is hoped that through experimentation with this facility, solutions to some of the problems encountered in on-line retrieval systems can be found. 46 REFERENCES [1] Lancater F.W., "Information Retrieval Systems Characteristics, Testing and Evaluation", New York, Wiley 1968. [2] Lancaster F.W. and Fayen E.G., "Information Retrieval On-Line", Los Angeles, Calif.: Melville Publishing Company, 1973. [3] Milner, J. M,, "A Multiprocess, Multiuser Executive for an Experimental Information Retrieval System", M.S. Thesis, University of Illinois Department of Computer Science Report Number 75-736, 1975. [4] Morgan, J. K. , "Description of an Experimental On-line Minicomputer Based Information Retrieval System", M.S. Thesis, University of Illinois Department of Computer Science Report Number 76-779, Febraury 1976. [5] Morgan, T.J., "A Thesaurus Feature for the Eureka Information Retrieval System", M.S. Thesis, University of Illinois Department of Computer Science Report Number 77-855, 1977. 77-855, 1977. £6] Rinewalt, J.R., "Evaluation of Selected Features of the Eureka Full-text Information Retrieval System", Ph.D. Thesis, University of Illinois Department of Computer Science Report Number 76-823, 1976. H7 APPENDIX A - GLOSSARY OF TERHS recall ratio: the ratio of retrieved relevant documents to the total number of relevant documents. precision ratio: the ratio of retrieved relevant documents to the number of retrieved documents in the data base. token: any unbroken string of alphanumeric characters, type: a distinct token- token frequency: the number of times a token repeats itself, document frequency: the number of documents a type appears in. type rank: the standing of a type in a document set according to its token frequency. The lowest rank type in a document always has the highest token frequency in a set. query: a search request. query language: the repertoire of commands in a retrieval system. search term: a word used in a search request. search expression: a syntactically correct combination of the desired contents of documents to be retrieved from the data base. responding documents: the retrieved documents in a search. document set: a collection of documents. thesaurus classes: groups of related words, not necessarily synonymous. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-78-914 2. 3. Recipient's Accession No. 4. Title and Subtitle A UNIQUE WORD-SCANNING FACILITY FOR THE EUREKA FULL-TEXT INFORMATION RETRIEVAL SYSTEM 5. Report Date January 1978 6. 7. Author(s) William Ming-cheoriR Leuns 8. Performing Organization Rept. ^°"UIUCDCS-R-78-914 9. Performing Organization Name and Address University of Illinois at Urb ana- Champaign Department of Computer Science Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF MCS73-07980 12. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Master's Thesis 14. 15. Supplementary Notes 16. Abstracts This thesis describes a word screening and scanning feature in the experimental on-line full text retrieval system — EUREKA. Given any specified collection of documents in the database, the feature can reduce words within the set to distinct occurrences, screen out insignificant words, provide distribution statistics of the words, and based on these statistics, enable users to look at alphabetical listings of these words selectively. This feature saves users considerable efforts in converging their search. In addition, the feature can serve as a potential tool for automatic thesaurus generation in EUREKA. By using it to generate word-vectors which characterize documents, documents of the database can be clustered to workable sizes for generating thesauri by statistical methods. 17. Key Words and Document Analysis. 17a. Descriptors Thesaurus generation Word scanning and screening Zipf distribution 17b. Identifiers/Open-Ended Terms 17c. rOSATI Field/Group 18. Availability Statement Release Unlimited 19. Security Class (This Report ) UNCLASSIFIED 21. No. of Pages 51 20. Security Class (This Page UNCLASSIFIED 22. Price , FORM NTIS-35 ( 10-70) USCOMM-DC 40329-P71 MAD -x 1978