<UU' 
 
 
 i 
 
«NTRAIC«COIAT.ONANOBOOKSTACKS 
 
 The person borrowmg th>s mat^^^ ^^^^^^ 
 
 be charged a minii^ 
 
 each non-returned orJ<.^, ,,^,^^ T'lirtrat: 
 
 of lUinoi. and ore pr 
 
 inw and Procedure. _j 333.84OO. 
 
 ^- TO RENEW CAtl^^^^^ ,^^o.Cha«^P-|9n^ 
 
 APR 25 
 
 ,r;tp new due date 
 
 W..n renewing -'J/r" U« 
 
 below previous due date. 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/uniquewordscanni914leun 
 
T-^tQ. 
 
 eport No. UIUCDCS-R-78-914 
 
 ///^//^ 
 
 'c 
 
 UILU-ENG 78 I70U 
 
 P 
 
 
 A UNIQUE WORD- SCANNING FACILITY FOR THE EUREKA 
 FULL-TEXT INFORMATION RETRIEVAL SYSTEM 
 
 by 
 
 William Ming-Cheong Leung 
 
 January 1978 
 
 NSF-OCA-MCS73-07980-000032 
 
Report No. UIUCDCS-R-78-914 
 
 A UNIQUE WORD-SCANNING FACILITY FOR THE EUREKA 
 FULL-TEXT INFORMATION RETRIEVAL SYSTEM 
 
 by 
 
 William Ming-Cheong Leung 
 
 January 1978 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 This work was supported in part by the National Science Foundation under 
 Grant No. US NSF MCS73-07980 and was submitted in partial fulfillment of 
 the requirements for the degree of Master of Science in Computer Science, 
 January 1978. 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 Various people have contributed to the successful 
 completion of this thesis. To them, I would like to express my 
 appreciation. 
 
 My special thanks go to Professor David J. Kuck, my thesis 
 advisor, for his guidance, suggestions and most of all, his 
 patience with me. 
 
 I would also like to thank Perry Emrath for his invaluable 
 assistance throughout this project, John Keith Morgan for 
 helping me with the Eureka internals and Fon Krupp, my 
 supervisor, for his understanding during the final weeks of this 
 preparation. 
 
iv 
 
 TiBLE OF CONTENTS 
 
 CHAPTER PiGE 
 
 1 INTRODUCTION 1 
 
 2 HISTOGEAH. . .A WORD-SCANNING FACILITY U 
 
 2.1 The Need for an Intelligent Word Scanning Facility. H 
 
 2.2 HISTOGRAM and the Zipf Distribution 5 
 
 2.3 The HISTOGRAM Screening and Selection Mechanisms... 7 
 
 2.4 EOREKA and the HISTOGRAM Subsystem 9 
 
 2.5 Language Design and Description 12 
 
 2.5.1 HIST Command 12 
 
 2.5.2 MERGE Command 13 
 
 2.5.3 STATS Command 14 
 
 2.5.4 LIST Command 17 
 
 2.5.5 SEARCH Command 18 
 
 2.5.6 EXIT Command 19 
 
 2.6 Usages of HISTOGRAM as a Search Aid 19 
 
 3 HISTOGRAM DESIGN AND DESCRIPTION 22 
 
 3.1 Design Philosophy 22 
 
 3.2 Functional Overview 22 
 
 3.3 Files of HISTOGRAM 25 
 
 3.4 Description of the Command Modules 26 
 
 3.4.1 HSTGRM 26 
 
 3.4.2 HGR 28 
 
 3.4.3 STATS 30 
 
 3.4.4 LISTER 32 
 
 3.4.5 SRCHER 32 
 
 4 THESAURUS GENERATION 34 
 
 4.1 Conventional Methods of Thesaurus Generation 34 
 
 4.2 Concept of Thesaurus Classes 34 
 
 4.3 Our Proposed Approach to Thesaurus Generation 35 
 
 4.3.1 Clustering Documents 36 
 
 4.3.2 Generating Thesaurus Classes from Clusters 37 
 
 4.4 HISTOGRAM Modules and Thesaurus Building in EUREKA. 39 
 
 5 CONCLUSION 42 
 
 REFERENCES 46 
 
 APPENDIX 
 
 A GLOSSARY OF TERMS 47 
 
CHAPTER 1 - INTRODUCTION 
 
 Advances in computer technology have made on-line full-text 
 retrieval systems an attractive alternative to the conventional 
 means of information retrieval by hand. However, since most such 
 systems allow searchers to use words from a non-controlled 
 vocabulary as search terms, the successf ulness of a search 
 depends on a correct choice of words. To be assured of a good 
 recall and/or precision ratio, a searcher has to include the 
 right words together with all their spelling variants, synonyms 
 and similar meaning phrases - a not too realistic demand. 
 
 Fortunately, the problem is alleviated to some extent due 
 to the iterative nature of on-line retrieval systems and the 
 fact that most retrieval systems do allow some form of prefix, 
 infix, and/or suffix truncation for the search terms. But often 
 enough, a searcher can only find his desired information with 
 much difficulty and unnecessary delays. At other times, a 
 searcher can not even converge a search simply becasuse he can 
 not come up with enough, or any, alternate or more precise 
 words for his search expression. 
 
 We believe at least two tools can be used to circumvent 
 this predicament. One is to provide searchers with an on-line 
 thesaurus. The other is to provide a facility which enables 
 
searchers to selectively scan over meaningful words from thus 
 far retrieved documents. The purpose of both is to suggest to 
 searchers pontential clues for search terms. 
 
 Various work has been done in generating thesauri, but not 
 much work has been done in the other area. While it is still 
 debatable whether any practical and elaborate thesaurus 
 system is easily implementable, a word-scanning facility is 
 comparatively simple to design and implement. Used wisely, it 
 can be a very effective means to accelerate and 
 converge searches. The main design consideration with such a 
 facility is a clever word selection and screening mechanism. 
 
 This thesis describes the implementation of such a facility 
 on an experimental on-line retrieval system - Eureka, and 
 discusses its two potential uses. Firstly, it can be used as an 
 effective search aid for on-line searchers to obtain good, 
 discriminating search terms to improve their search strategy. 
 Secondly, it can be used towards automatic or semi-automatic 
 system generation of thesaurus classes consisting of synonyms, 
 spelling variants, similar meaning phrases and related words. 
 
 Chapter 2 describes our word scanning system - HISTOGBAM, 
 its purpose, facilities, language design, syntax and usages as 
 an on-line search aid. chapter 3 describes the systems aspects 
 of HISTOGRAM. Chapter 4 discusses how HISTOGRAM can be used 
 
towards automatic or semi-automatic generation of thesauri. 
 Chapter 5 concludes this discussion by suggesting possible areas 
 of further improvement and raising some open-ended questions for 
 future research. 
 
 For the purpose of this discussion, a list of terms and 
 definitions used in our context is included in Appendix A. 
 
CHAPTER 2 - HISTOGRAM. ..A WORD-SCANNING FACILITY 
 
 2.1 The Need for an Intelligent Word Scanning Facility 
 
 On many occasions during an information search, a user 
 would like to be able to look at the contents of the responding 
 documents. By looking at the text, a user can usually tell 
 whether he is going the right direction in converging his 
 search. If not, he will have saved considerable time and effort 
 by changing his search strategy at an early stage. In addition, 
 being able to glimpse over the text of the responding documents 
 can make the user aware of other potentially good words to use 
 or add to his search expression, other synonyms or 
 synonymous phrases that he has not considered or has simply 
 overlooked. 
 
 However, since the number of initial responding documents 
 is usually very large, often on the order of tens, or even 
 hundreds, users will be reluctant to look at the actual text of 
 the document set. To take, as an example, a responding document 
 set of 50 documents with an average size of 1,000 words each, 
 the document set would then contain 50,000 words. Obviously, 
 not too many users will have the patience to go through these 
 words. Even if they do, it is doubtful whether any significant 
 information can be extracted, given the amount of built-in 
 
noise, duplicate and irrelevant words present. For this 
 reason, an intelligent word scanning facility with a good word 
 screening and selection mechanism is needed. 
 
 HISTOGRAM, our version of a word scanning facility, is an 
 implementation that attempts to provide the above mentioned 
 capabilities . 
 
 2.2 HISTOGRAM and the Zipf Distribution 
 
 Our design of the HISTOGRAM word screening and selection 
 mechanisms draws from the concept of the Zipf distribution 
 of words in natural English language documents. George 
 Kingsley Zipf, a mathematician, made the interesting observation 
 that for any collection of natural English language 
 documents, f, the frequency of occurrence of the n-th rank type, 
 where type is a distinct word and rank is the standing 
 of this type in descending order of frequency of 
 occurrence, is approximately equal to k/n, where k is the number 
 of occurrences of the most frequently occurring type. 
 Mathematically, this can be expressed as f = k/n. 
 
 The general shape of a Zipf distribution curve is shown in 
 Figure 2.1. The area under the curve approximates the number of 
 words in the document set. The non-linearity of the curve 
 suggests two things. The left end of the curve suggests that 
 
l_lj 
 
 ct: 
 
 o 
 
 Q_ 
 M 
 
 < 
 
 (N 
 
 O 
 
 o 
 
 O 
 
 o 
 
 o 
 
 o 
 
 o 
 
 o 
 
 in 
 
 o 
 
 LO 
 
 o 
 
 in 
 
 K) 
 
 CM 
 
 cs 
 
 ''^ 
 
 "^ 
 
 
 Aouanbajj u9>|o_l 
 
for any collection of documents, a large percentage of tokens, 
 where a token is a single occurrence of a type, corresponds to 
 only a small percentage of types in the document set. These are 
 the highest ranking types in the set. The right tail of the 
 curve suggests that for any collection of documents, a large 
 number of types have a very low freguency of occurrences, or 
 token frequency, in the set. These are the lower ranking types 
 in the set. As words from both of these groups are either too 
 general or too specific to be of much use as search terms, they 
 are really noise words and can be cut off from the group of 
 words to be displayed to the searcher. 
 
 2.3 The HISTOGRAM Screening and Selection Mechanisms 
 
 The screening mechanism of HISTOGRAM consists of 2 
 concurrent processes. 
 
 The first process is merging, which is a reduction of words 
 to types. An upper bound average reduction ratio of 150 can be 
 achieved from this merging process for the State Statutes 
 data bases. (This figure is arrived at from the fact that the 
 data base has an average token to type ratio of 150[6]. However, 
 since we are only dealing with subsets of the data base at 
 any given time, we would expect the ratio to be much lower.) 
 While this reduction process does not exactly eliminate 
 words off the higher end of the Zipf curve, they are reduced to 
 
only a relatively small number of types. 
 
 Another screening process done concurrently with the 
 merging is the screening of the lower end Zipf curve words from 
 the documents to be merged. This effectively is a low frequency 
 cutoff from the merged document set and further reduces the 
 number of types by a significant percentage. Assuming that the 
 upper bound for this percentage to be 10%, the average reduction 
 ratio achieved by these combined processes alone would then be 
 as high as 1 6*7 . To take our previous example of the 50-document 
 set, the initial HISTOGRAM screening process would then have 
 reduced the number of words to be displayed from 50,000 to as 
 low as 300 words! 
 
 After the initial screening process which reduces the 
 number of words to be displayed to a much more manageable size, 
 the user can then choose to manipulate and scan these words 
 selectively with the HISTOGRAM word selection mechanism. This 
 mechanism consists of a statistics table and a windowing 
 facility through which he can view the words. The statistics 
 table displays distribution statistics of the types with respect 
 to the token frequency and how these types within each interval 
 of the token frequency domain are distributed among the 
 documents. By defining token frequency as the frequency of 
 occurrences of a distinct type, and document frequency as 
 the number of documents a particular type appears in, we 
 
can interpret this statistics table as a 3-dimensional histogram 
 as in Figure 2.2. The histogram on the front is a 
 distribution histogram of the types within a domain of token 
 frequency. The histograms on and parallel to the left side are 
 the distribution histograms of the types within the document 
 frequency domain for the corresponding interval of token 
 frequency on the front plane. 
 
 The histogram on the front plane is in fact another way of 
 looking at the shape of Zipf curve of a document set. This 
 different arrangement is more revealing and appropriate for our 
 purpose. The 2 domains of this 3-dimensional histogram, the 
 token frequency and the document frequency, are user-definable. 
 This enables the searcher to, in effect, look at all or zoom 
 into any part of the Zipf distribution curve with the additional 
 capability of seeing how the types distribute among the 
 documents within intervals of the part examined. With the aid of 
 this table, the user is then able to determine how the 
 types are distributed and which groups of types he wants to 
 see. Then, with the very same parameters, the token frequency 
 and the document frequency, the user can select the types he 
 wants to inspect. 
 
 2.U EUREKA and the HISTOGRAM Subsystem 
 
 HISTOGRAM is implemented as a subsystem in EUREKA - an 
 
10 
 
 >- 
 o 
 
 z 
 
 UJ 
 
 3 
 
 o 
 
 LU 
 
 < 
 o 
 
 CO 
 
 o 
 
 CO 
 
 S 
 
 Q 
 I 
 
 CO 
 
 tz: 
 
 E- 
 I 
 
 • 
 CM 
 
 
 ui 
 
 o 
 
11 
 
 experimental on-line full text retrieval system designed at the 
 University of Illinois by a research group under Prof. David J. 
 Kuck. The system is PDP-11 based, running under a multi-process, 
 multi-user executi ve[ 3 ]. The current hardware configuration is 
 made up of a e^K PDP-II/UO, 2 disk drives, 2 CRT's and 1 
 printer. Presently, the data bases used are the various state 
 statutes. 
 
 The EUREKA query language is made up of 9 commands in 4 
 main functional areas as follows:- 
 
 (1) Finding an arbitrary complex Boolean search expression 
 from a defined document set with/without context 
 specification. 
 
 (2) Printing selected portions of documents and information 
 about preceding queries. 
 
 (3) Auxiliary functions such as defining and deleting 
 document sets, query sets, macros and comments. 
 
 (U) Logon, logoff. 
 
 A thorough description of the EUREKA system and query 
 language can be found in [4]. 
 
12 
 
 2.5 Language Design and Description 
 
 To be in line with the language design of EUEEKA, the 
 HISTOGRAM commands are simple, few, but effective. Apart from 
 the entry and exit commands, there are only 4 HISTOGRAM 
 commands. The syntax of the commands is straightforward and 
 minimal; no verbose, English-like commands nor exhaustive 
 
 options exist to confuse the users. 
 
 i 
 
 In the description of the HISTOGRAM commands that follows, 
 the following notations for syntax specification are used:- 
 
 (1) Capital letters or commas are the characters to be 
 typed as part of the command. 
 
 (2) An underscored character means the character is 
 mandatory for the command. 
 
 (3) Words in small letters represent parameters whose 
 values are to be supplied by the user. 
 
 (4) A set of parallel entries enclosed by 2 vertical bars 
 means any but only one of the entries applies. 
 
 2. 5. 1 HIST Command 
 
13 
 
 Syntax; 
 
 HISTOGRAM 
 
 Function: Entry from EOREKA to the HISTOGRAM subsystem. 
 
 2. 5.2 MERGE Command 
 
 Syntax: MERGE | eureka document set name | 
 
 I LAST \ 
 
 I [ document number list J_ ( 
 
 - eureka document set name is any EUREKA defined 
 document set 
 
 - LAST is the last document set defined in EOREKA 
 
 - document number list is a user defined list of 
 document numbers in the data base, separated by 
 commas with no embedded blank 
 
 Function: Produces a resultant merged file of alphabetically 
 
 ordered types and statistics from the document set. 
 
 The completion of the merge process is signalled by 
 the system prompt character '.'. 
 
 A maximum of only 1 resultant merged file exists for 
 a user at any given time. Any subsequent merge 
 will erase the previous resultant merged file. Any 
 merge command that requests a non-existent 
 
14 
 
 document set or a non-existent document will be 
 ignored and no error message will be issued. 
 However the previously saved resultant merged file, 
 if any exists, will not be scratched. Error 
 messages will be issued for the other invalid 
 requests. 
 
 Examples: ME [1,7,23,15] (Merges documents 1,7,23 and 15.) 
 ME LAST (Merges the last EUREKA 
 
 defined document set.) 
 
 2.5.3 STATS Command 
 
 Syntax: STATS JJ.tf ,ut f ,ldf ,udfj_ 
 
 - Itf is the user specified lower bound for the token 
 frequency 
 
 - utf is the user specified upper bound for the token 
 frequency 
 
 - Idf is the user specified lower bound for the 
 document frequency 
 
 - udf is the user specified upper bound for the 
 document frequency 
 
 Default: Itf - 1 
 
 utf - 65535 
 
15 
 
 Idf - 1 
 
 udf - the total number of documents in the data base 
 
 Subsitution implied with the absence of an operand or 
 operands between commas, or with a premature ']'. No 
 embedded blanks are allowed. 
 
 Function: The STATS command displays a distribution statistics 
 table of all the types in the merged file with the 
 user-specified domains of token and document 
 frequencies. The table is an 8*8 table with intervals 
 scaled on the two domains. By wisely varying the 
 upper and lower bounds of these parameters, the user 
 has in effect a zoomable viewing device on the 
 distribution statistics of the types. 
 
 Command ignored with invalid domains or 
 non-existent merged file. 
 
 Display: Figure 2.3 is an example of the statistics 
 distribution table displayed by STATS. Token 
 frequency, whose domain is bounded by the 
 user-specified values or through default, is scaled 
 into 8 equal integral intervals with the last 
 interval possibly extended or truncated to the upper 
 bound. The document frequency domain is similarly 
 
16 
 
 scaled. Each entry under the column labelled 'TYPES' 
 shows a count of the types occurring in the token 
 frequency interval specified in the rightmost column 
 of each row. The 'TOTAL' displayed at the bottom of 
 the 'TYPES* column shows the total number of types in 
 the specified token frequency domain. The entries in 
 the columns between the leftmost and rightmost ones 
 show the occurrence frequencies of types in the 
 specific document frequency interval. 
 
 Example: In response to the command STATS [1,380,1,15], 
 HISTOGRAM will produce a display similar to the one 
 in Figure 2.3 
 
 TOKEN FREQ 1 
 1 
 50 
 100 
 150 
 200 
 250 
 300 
 350 
 380 
 
 DOC FREQ 
 
 2 
 
 u 
 
 6 
 
 8 
 
 10 
 
 12 
 
 m 
 
 15 
 
 TYPES 
 
 30 
 
 40 
 
 35 
 
 50 
 
 as 
 
 12 
 
 12 
 
 7 
 
 231 
 
 5 
 
 20 
 
 30 
 
 27 
 
 23 
 
 5 
 
 4 
 
 6 
 
 120 
 
 12 
 
 8 
 
 6 
 
 12 
 
 12 
 
 16 
 
 24 
 
 12 
 
 102 
 
 20 
 
 10 
 
 11 
 
 9 
 
 3 
 
 7 
 
 4 
 
 6 
 
 70 
 
 1 
 
 1 
 
 12 
 
 10 
 
 18 
 
 3 
 
 4 
 
 3 
 
 52 
 
 
 
 
 
 
 
 7 
 
 8 
 
 7 
 
 4 
 
 6 
 
 32 
 
 1 
 
 3 
 
 
 
 3 
 
 
 
 
 
 7 
 
 4 
 
 18 
 
 
 
 
 
 
 
 H 
 
 
 
 2 
 
 2 
 
 2 
 
 10 
 
 NO. OF TYPES IN THE MERGED FILE 
 
 404 
 
 FIGURE 2.3 
 
17 
 
 The display is a statistics distribution table for 
 types in the merged file with token frequency from 
 and including 1 to 380 and document frequency from 
 and including 1 to 15- 
 
 As an example, row 5 in the table would mean the 
 following:- There are 70 types altogether that occur 
 between 151 and 200 times inclusive. Among the 
 types, 20 occur from 1 to 2 documents, 10 from 3 to U 
 documents, 11 from 5 to 6 documents, 9 from 7 to 8 
 documents, 3 from 9 to 10 documents, 7 from 1 1 to 12 
 documents, 4 from 13 to 14 documents and 6 in 15 
 documents, summing up to 70. 
 
 2. 5. 4 LIST Command 
 
 Syntax: LIST _^ltf ,utf , Idf , udf J^ 
 
 -Itf ,utf ,ldf ,utf as in 2.5.3 
 
 Default: as in 2.5.3 
 
 Function: List the types in the merged documents that satisfy 
 the user requested thresholds on frequencies. 
 
 Display: Selected types are displayed four in a row 
 
18 
 
 alphabetically. 
 
 Examples: (1) L [5,10,4,4] (Lists types that occur from 5 to 10 
 
 times in the documents and only in 4 
 documents.) 
 
 (2) L [,200] (Lists types that occur between 1 to 200 
 times in all the documents of the 
 data base.) 
 
 2.5.5 SEARCH Command 
 
 Syntax: SEARCH ' character string ' 
 
 Function: The search command initiates a search through the 
 merged file for the type specified in the operand. If 
 a hit is made, the type's token and document 
 frequencies in the documents merged previously will 
 be displayed along with the text of the type. If not 
 found, a message to that effect will be generated. 
 
 Examples: (1) SE 'JUNK' (Searches for the type 'JUNK'.) 
 
 JUNK [132,4] (The type 'JUNK' occurs 132 times 
 in 4 documents in the document set.) 
 
19 
 
 (2) SEARCH 'BIGFOOT* 
 
 BIGFOOT NOT FOUND 
 
 2.5.6 EXIT Command 
 
 Syntax: | E_X | 
 I ED_ I 
 
 Function: Exit from HISTOGRAM to EUREKA. 
 
 EX is an exit with the merged file saved for 
 
 subsequent use. 
 
 ED is an exit with the merged file scratched. 
 
 2.6 Usages of HISTOGRAM as a Search Aid 
 
 There are many ways HISTOGRAM can be used in assisting a 
 searcher to converge his search. 
 
 By using the SEARCH command, a searcher can evaluate the 
 the usefulness of the terms in his search expression. Words that 
 have relatively high token and document frequencies are probably 
 too general to be good search terms. Likewise, words with 
 relatively low token and document frequencies are probably too 
 specific to be of any use. On the other hand, words with a 
 significant token frequency and a not too high document 
 
20 
 
 frequercy are words that appear a lot of times in a small number 
 of documents and are pontentially good terms to converge a 
 search. 
 
 By using the STATS command, a user can feel out the general 
 characteristics of the document set before he sets out to 
 selectively look at the word. From the statistical table, one 
 can have an idea of how the documents in the document set relate 
 to each other. If a significant amount of the medium token 
 frequency words appear in a great percentage of the documents, 
 the documents within the set are probably closely related to 
 some common topics. On the other hand, if a significant amount 
 of medium token frequency words appear with a small document 
 frequency, the document set is probably made up of clusters of 
 related documents. However, when the medium token frequency 
 types are quite evenly distributed, it could mean that we 
 might have a set of unrelated documents at our hand and should 
 probably re-evaluate our search strategy. 
 
 As a matter of fact, there are many ways that the 
 statistics can be interpreted and used to help a searcher. Once 
 the user has an idea of what scope of words he wants to see, he 
 can select them for inspection with the proper parameters in the 
 LIST command. 
 
 Though generally words with medium token frequencies are 
 
21 
 
 our main focus of attention, it is certainly conceivable that 
 other words could be useful on other occasions. To summarize, 
 what we have cited are only a few instances of how we think 
 HISTOGBAM could be used and are definitely not exhaustive. 
 Different goals and stages in our search might call for differnt 
 approaches to using HISTOGRAM. Hopefully, with practice and 
 experience, users can find new ways of intepreting and using 
 HISTOGRAM to improve their search strategy. 
 
22 
 
 CHAPTER 3 - HISTOGRAM DESIGN AND DESCRIPTION 
 
 3.1 Design Philosophy 
 
 HISTOGRAM was designed as an autonomous subsytem of EOREKA 
 with a minimal number of interface paths. As it is experimental 
 in nature, one of our main design concerns was to make the 
 code straightf oward and easy to modify and maintain. To this 
 effect, the technique of modular programming was used 
 extensively. The HISTOGRAM subsystem is broken down into 
 command processing modules, and each command processing module 
 is in turn further granularized into functional units of 
 smaller programming entities in the form of macros and 
 subroutines. Every function that might be changed, 
 modified, replaced or tuned in the future is programmed as an 
 independent unit so that any change can be done with minimal 
 effort and disturbance. 
 
 While this excessive modularizing tends to impose some 
 addtional CPU overhead, this should not affect EUREKA and 
 HISTOGRAM performance as HISTOGRAM itself is an I/O bound 
 procedure and EUREKA, from measurements done by Milner[3], is 
 only using 50% of its CPU cycles anyway. 
 
 3.2 Functional Overview 
 
23 
 
 HISTOGRAM is made up of 5 modules. Except for the 
 dispatching module, HSTGRM, which communicates with the various 
 command modules, there exists no direct interaction among 
 command modules. 
 
 HISTOGRAM itself only communicates with EUREKA in three 
 situations - entry, exit and the passing of the EUREKA user 
 logon block pointer in order to obtain a EUREKA defined document 
 set list. 
 
 Briefly, the 5 HISTOGRAM command modules perform the 
 following :- 
 
 (1) HSTGRM - Interfacing with the EUREKA user, the EUREKA 
 
 system and the various HISTOGRAM command 
 modules. 
 
 (2) MGR - Merging documents: the process of reducing 
 
 duplicate words to distinct words and summing 
 up their token and document freguencies while 
 sorting them in alphabetical order. 
 
 (3) STATS - Building and displaying, according to 
 
 user-defined domains, statistical table for 
 the merged document set. 
 
2U 
 
 iEmiR^BJkA 
 
 HSTGRM 
 
 MGR 
 
 STATS LISTER' SRCHER 
 
 Statute 
 
 Files 
 
 ■^ 
 
 \ 
 
 Merged 
 
 File 
 
 control flow 
 
 data flow 
 
 FIG 3 . 1~HIST0GRAM FUNCTIONAL OVERVIEW 
 
25 
 
 (H) LISTER - Displaying words of the merged document set 
 according to user-specified parameters. 
 
 (5) SRCHER - Showing distribution statistics of any 
 specific word within the merged document set. 
 
 Basically, what HISTOGRAM does is that given any user 
 requested document set, it will input already-existing 
 statistics files corresponding to each of these documents. 
 Through the merger, it then will merge them and at the same time 
 strip the intermediate files generated during the merge and 
 the resulting merged file into a simpler format. The resulting 
 merged file is then input and acted upon by other HISTOGRAM 
 modules whenever required. 
 
 3.3 Files of HISTOGRAM 
 
 The input statistics file and the output merged files of 
 HISTOGRAM are made up of entries each containing a variable 
 token text and tagged-along statistical data. These entries 
 are ordered alphabetically. Entries in the merged file and 
 the intermediate scratch files have the same format. Each 
 entry contains a token text preceded by a 1-byte text length 
 count, followed on even boundary by a double-word token 
 frequency count and a one-word document frequency count. The 
 
26 
 
 input statistical files have essentially the same format 
 without the document frequency count but with some other 
 irrelevant data tagged on after the frequency count (Fig. 3.2). 
 
 The input statistical files are really read-only files 
 while the merged file can be saved, scratched or over-written 
 with another merge operation. The intermediate files generated 
 are work files and are deleted after every 2-way merge. 
 
 3.4 Description of the Command Modules 
 
 3.4.1 HSTGRM 
 
 HSTGRM, the interfacing module between the user, EUREKA and 
 the HISTOGRAM modules, is a command interpreter and control 
 dispatcher of the subsystem. It is entered from EUREKA and 
 re-enterd from the HISTOGRAM command modules when they are 
 finished with the requested command. On user request to exit, it 
 jumps to a section of code that handles the exit option of 
 keeping or deleting the merged file. 
 
 The parameters that it passes to the command modules are 
 the EUREKA logon block pointer and the command buffer pointer. 
 No condition flags or data are passed back when control is 
 returned from the command modules. 
 
27 
 
 Statute files: 
 
 4Bo|Nbl loHoj?! . . . kbb^l !o!o!oi3 
 
 • • • 
 
 I 
 
 \ 
 
 Length Text Double-word Freq Count Other Data 
 
 Generated files: 
 
 s-^.l* 
 
 I /J 
 
 crc 
 
 • » • 
 
 Single— word Document Freq Count 
 
 FIG . 3.2- ENTRY FORMATS 
 
3.4.2 MGR 
 
 The merging module, MGR, consists of a driver, a 2-way 
 merger, an input handler and an output handler. 
 
 The 2-way merger is a very specialized routine. It merges 
 by pulling entries off and placing the results at pre-designat ed 
 locations and letting the input and output handler worry about 
 the actual I/O and the tedious bookkeeping involved. In merging 
 the entries, it takes the smaller of the two if the tokens 
 are unequal. If they are equal, it adds the statistics and puts 
 only one out and discards the other. 
 
 The input handler performs the actual I/O in retrieving the 
 entries from the disk files, does buffer management, updates 
 pointers to the current entries in buffers, moves entries 
 from buffers to the approprate fixed locations, does low 
 frequency word cutoffs in the first pass, checks for and acts 
 accordingly on I/O error conditions and end-of-file conditions. 
 
 The output handler is comparatively simpler. It moves 
 entries from the fixed location to the output buffer, updates 
 pointer and outputs if buffer is full. 
 
 The driver itself is the heart of the MGR module. It emits 
 
29 
 
 DRIVER 
 
 I 
 
 2-WAY MERGER 
 
 >■ 
 
 Entries to be merged 
 
 INPUT 
 
 HANDLER 
 
 V 
 
 Statute 
 
 Files 
 
 OUTPUT 
 
 HANDLER 
 
 Merged 
 
 File 
 
 control flow 
 
 data flow 
 
 FIG 3 . 3 - THE MGR MODULE 
 
30 
 
 file names to the input handler for every 2-way merge and 
 sequences the overall merging activities at the document level. 
 
 Initially, it builds the list of file names to be merged by 
 interpreting the EUREKA passed document set list or parsing 
 the user-defined document set list depending on how the merge is 
 requested. It then picks out these names 2 at a time and 
 emits them to the input handler for the 2-way merge. It also 
 generates file names for all the intermediate scratch files. The 
 scratch files are deleted as soon as they are entirely absorbed 
 by the 2-way merger. Only the final merged file is not deleted 
 automatically. 
 
 As for the sequencing function, in addition to passing 
 file names for the merge, the driver is also able to recognize 
 the various situations of whether it is the first pass of the 
 merge or not and whether the number of documents in that pass is 
 even or odd. Different combinations of these call for different 
 actions and settings of flags in order to optimize control flow 
 and/or simplify coding. 
 
 3.4.3 STATS 
 
 The STATS command module is responsible for building the 
 statistical table on the merged file for the user-requested 
 domains. 
 
31 
 
 When passed control by HSTGRM the subsystem dispatcher, 
 STATS checks if the user-specified parameters are valid. 
 These parameters are the user requested upper and lower bounds 
 of the domains of token and document frequencies. They are 
 considered valid if they are numbers and the upper bound is 
 greater than the lower bound. It then will build an internal 
 statistical table equivalent to the one to be diplayed. The 
 domains are scaled into 8 equal integral intervals whenever 
 possible, the exceptions being the cases when the domain itself 
 is not large enough to be split into 8 integral intervals and/or 
 the last interval is extended or truncated to the upper bound of 
 that domain. 
 
 Each slot in this usually 8*8 table corresponds then to a 
 particular interval of document frequency and token frequency. 
 They are used as counters and are initialized to zeroes. 
 
 STATS then builds a 2-level directory that maps entries 
 into the appropriate slots to increment the counter. When this 
 is done, the STATS module simply walks through entries in the 
 merged file with a relatively simple input handler and updates 
 the appropriate counters whenever tokens within the requested 
 domains are found. 
 
 On end of file, STATS proceeds to format the internal table 
 
32 
 
 into a displayable form by doing the necessary editings and 
 conversions. It then displays the table to the user. 
 
 The statistics table that it builds is not retained and 
 every new STATS command causes another round of table building. 
 The merged file however remains intact and is only overwritten 
 after another MERGE command. 
 
 3.4.4 LISTER 
 
 LISTER enables users to look at words in the merged 
 documents that are within his specified threshold values of 
 document and token frequencies. These threshold values do not 
 have to coincide with intervals displayed in the statistical 
 table at all. 
 
 Again, this module simply walks through the entries in the 
 merged file and picks the ones that satisfy the thresholds. It 
 then moves these tokens to the terminal buffer, edits them and 
 outputs them whenever the output buffer is full. 
 
 3.4.5 SRCHER 
 
 The SRCHER does essentially the same walk-through process 
 as LISTER does except that it is loking for just one particular 
 entry with the specified token. When a hit is made, SRCHER 
 
33 
 
 converts the tagged-on token and document frequencies into 
 character form and displays the token and its statistics. When 
 end of file or an alphabetically larger token is reached, SRCHER 
 gives up and informs the user. 
 
 At present, SRCHEP can only handle one token per search 
 request. But with additional code that can parse multiple 
 entries in the search command operand and sort these tokens, the 
 module is able to accept multiple tokens per request without any 
 further modification. 
 
34 
 
 CHAPTER a - THESAURUS GENERATION 
 
 4. 1 Conventional Methods of Thesaurus Generation 
 
 The generation of on-line thesauri for full-text retrieval 
 systems has received considerable attention in recent years. 
 Thesauri can be generated manually, semi-automat ically or 
 automatically. Manual thesaurus generation is a tedious task 
 which involves subjective judgement and is inflexible to changes 
 in the data base. Automatic and semi-automatic thesaurus 
 generation, on the other hand, are more objective, faster 
 and more flexible to changes in the data base provided 
 resonably good generation algorithms are used. One problem 
 though with automatic and semi-automatic thesaurus 
 generation is that, regardless of the generation algorithm 
 used, a tremendous amount of computing resource in terms of 
 processor storage, secondary storage and computing time is 
 required. Simply put, the problem with automatic and 
 semi-automatic thesaurus generation is that no simple and 
 effective generation algorithm yet exists. 
 
 4-2 Concept of Thesaurus Classes 
 
 Before we proceed any further, we need to clarify our 
 notion of thesaurus classes. We beleive thesaurus classes should 
 
35 
 
 not only contain synonyms, but also words that are related but 
 not synonymous in any sense of the word. Words that appear in 
 the same phrases or words that are always mentioned together 
 should also be considered and included into our thesaurus 
 classes. As an example, consider the following group of words - 
 'birth* , 'control', 'prevention', 'abortion' and 'pregnancy'. 
 While certainly not all of these words will be included in a 
 conventional thesaurus group, they all pertain to a common 
 topic. For a searcher who is looking for information on 
 birth control, it would be nice for him to know of these 
 other possible words he can use as search terms. We believe that 
 as an on-line search aid, this extended concept of thesaurus 
 groups of synonyms as well as related words would serve a 
 better purpose and we should therefore aim at such thesaurus 
 classes in generating thesauri. 
 
 U.3 Our Proposed Approach to Thesaurus Generation 
 
 As far as the generation process is concerned, we outline 
 here a 2-step approach. 
 
 The first step involves clustering of documents in the data 
 base. The idea behind clustering is that documents in a data 
 base can be broken up into smaller clusters of documents. 
 Documents within a cluster are loosely related to some common 
 
36 
 
 topics. While a document can belong to more than one of these 
 clusters, for all practical purposes, clusters are independent 
 entities with no inter-relationship to each other whatsoever. As 
 an example, a document on birth control obviously belongs to 
 a different cluster from one with a document on highway 
 safety regulations. 
 
 Generation of a thesaurus then should be from these 
 subsets of the data base rather than from the entire data base 
 itself. There are two advantages. Obviously, by reducing the 
 number of documents to a manageable size, automatic or 
 semi-automatic generation of a thesaurus should be a lot 
 easier. In addition, generating a thesaurus from documents 
 within a cluster filters out noises that would otherwise be 
 introduced from unrelated documents outside of the cluster. 
 As an example, the words 'law' and 'order' would probably be 
 put into the same thesaurus group for a cluster of documents 
 on law and order while the words 'order' and 'magnitude' would 
 probably be put into another thesaurus group in a cluster of 
 documents on computer performance, but the words 'law' and 
 'magnitude' certainly do not belong together. 
 
 4.3.1 Clustering Documents 
 
 To cluster documents, we can make use of the concept of 
 document vectors where a document vector is a list of distinct 
 
37 
 
 words which characterizes and distinguishes a document from the 
 others. Looking at the Zipf curve of any particular document, we 
 know that words under either end of the curve are noise words 
 of very little significance. Intuitively, as we approach from 
 both ends to the inside of the curve, we would find more and 
 more significant words. If some kind of a significance level 
 curve can be plotted onto the Zipf curve, it would probably 
 resemble the one in Figure U.I. The idea of defining a document 
 vector then is to define low cutoff and high cutoff threshold 
 values of token frequency so that we can obtain a set of words 
 large enough to characterize a document but small enough for 
 later manipulation in the clustering and thesaurus generation 
 processes. Since the threshold values are absolute quantities 
 which will vary with document sizes, to design any general 
 algorithm to determine threshold values, we have to relate to 
 relative quantities such as distribution percentiles. 
 
 4.3.2 Generating Thesaurus Classes from Clusters 
 
 Once we have built clusters and thereby reduced the number 
 of documents and the size of words to be practical for thesaurus 
 generation, a lot of the existing statistical correlation 
 techniques to generate a thesaurus would become practical 
 and feasible to implement. One such good technique is the 
 generation of thesaurus classes by a co-occurrence 
 matrix. The co-occurrence matrix is a table of 'd' rows and 
 
LJ 
 
 38 
 
 o 
 
 O 
 
 O 
 
 O 
 
 o 
 
 o 
 
 o 
 
 to 
 
 O 
 
 in 
 
 o 
 
 in 
 
 hO 
 
 <N 
 
 CM 
 
 ^~ 
 
 '^ 
 
 
 Aouanbajj U8>iox 
 
39 
 
 •t' columns, where * i* is the number of documents in the data 
 base concerned and 't' the number of types in the data base. 
 Each column in the table corresponds to a type and each row 
 corresponds to a document in the data base. A M' in the 
 i-th row, j-th column entry, indicates the presence of the j-th 
 column type in the i-th row document, and a '0' indicates an 
 absence of the type in the document. The matrix is then used as 
 a tool to determine what types co-occur often enough in the 
 same documents to be considered within the same thesaurus 
 classes. 
 
 4.4 HISTOGRAM Modules and Thesaurus Building in EUREKA 
 
 The above section only gives an overall outline of an 
 approach to thesaurus generation. A detailed plan for the 
 implementation requires further investigations and efforts. 
 However, at this point, it is clear that a lot of the existing 
 HISTOGRAM modules can lend themselves to usage regardless of the 
 final plan adopted. 
 
 The merging module of HISTOGRAM, for example, can be used 
 to strip the statistics files into a format acceptable by the 
 STATS module. The STATS module can then compile distribution 
 statistics. These statistics can be used as a criterion by 
 some general algorithm to generate document-dependent high and 
 low cutoff values. Once these values are determined, the 
 
ao 
 
 LISTER module can use them to extract groups of words as 
 document vectors and output them to intermediate files for 
 later use. With document vectors defined, we can employ 
 statistical techniques such as correlation of document 
 vectors to cluster documents. 
 
 All of the above mentioned functions involving HISTOGRAM 
 modules need only minor modifcations to the modules concerned. 
 One nice thing about the characteristics of the EUREKA data 
 bases of state statutes is that they are already in a natural 
 order of clusters. We can therfore use these naturally arranged 
 clusters as yardsticks in evaluating our clustering techniques. 
 By experimenting with different cutoff values to vectorize 
 documents and different correlation techniques for clustering, 
 we hopefully will arrive at some satisfactory algorithms in 
 generating clusters that are small. 
 
 Once we have good ways of producing small clusters, we can 
 try out various correlation techniques with our "co-occurrence 
 matrix" method of thesaurus generation. Again through 
 experimentation, hopefully we would come up with a good scheme 
 of generating a thesaurus. 
 
 At such a time, we can then use the existing thesaurus 
 feature in EUREKA to actually build the thesaurus system[5]. It 
 has yet to be investigated on how feasible and how much effort 
 
41 
 
 would be involved to interface the HISTOGRAM modules with 
 this thesaurus feature in order to be able to generate 
 thesaurus classes and build the theasurus system completely 
 automatically. However, at the very least, our generated 
 thesaurus classes can be entered into EUFEKA manually to 
 produce an on-line thesaurus. 
 
H2 
 
 CHAPTER 5 - CONCLOSION 
 
 We have described a facility that attempts to improve 
 users' performance on on-line searches. The facility is 
 experimental in nature. Our initial objective was to get the 
 facility operational and then do whatever necessary tunings 
 and modifications afterwards. It is believed that ways 
 for improvements and modifications, in both system 
 performance and user interfacing, will become clear as more 
 data and user feedback are available once the system is 
 operational. 
 
 As can be seen now, there are several areas for maior 
 improvements. 
 
 The major performance problem with the current HISTOGRAM 
 system is the I/O bottleneck. This bottleneck is mainly caused 
 by the large number of arm movements necessary for every 2-way 
 merge. Two things can be done in directly minmizing and 
 optimizing arm movements. Placing the statistics files and 
 intermediate scratch files close together and/or close to the 
 center of the disk would minimize arm movement time. Also, an 
 arm movement optimizing scheme in the choosing of input and 
 output files for merging would greatly reduce unnecessary arm 
 movement delays. As a matter of fact, the latter is to a certain 
 
43 
 
 extent implicit with the current merging algorithm. If a user 
 requests a merge for a EUREKA defined document set, the document 
 number list passed by EUREKA to the HISTOGRAM merger is already 
 arranged in ascending document numbers, thus by picking them 2 
 at a time from the start of the list, the merger is picking the 
 closest possible pair of documents for every merge. 
 
 Significant performance improvement can also be achieved by 
 reducing the volume of I/O for the merge. This can be done by 
 shrinking the size of the statistics files and the intermediate 
 scratch files. One way to do so would be to eliminate the 
 irrelevant statistical information in the original statistics 
 files. They do occupy a significant percentage of the file space 
 and are not useful information for our purpose. As the first 
 pass merge I/O constitutes more than half the total volume of 
 I/O, reducing the size of these initial statistics files would 
 significantly affect the system performance. Another way worth 
 considering is to encode the variable length token text in every 
 entry into a 2-byte code. This would reduce the size of the 
 statistics files and the intermediate scratch files by a 
 substantial percentage. As a matter of fact, by encoding words 
 with an implicit alphabetical ordering, comparision of tokens 
 can be reduced to a trivial operation with the merging 
 process greatly simplified. The actual decoding of the words 
 does not have to be done till the point where the user 
 selects the words to see, and even then, the number 
 
44 
 
 of words to be decoded would generally be only on the 
 order of tens, or hundreds. 
 
 The merging process is another potential area of 
 improvement. We have a tentative low cutoff mechanism for every 
 merge in the fisrt pass, but we have yet to introduce a high 
 cutoff scheme in this pass. The reason that we have not done so 
 is the difficulty in determining a threshold value for the high 
 cutoff. To a much greater extent, the high cutoff threshold 
 would vary with the size of the documents. It is not obvious at 
 this point what good method could be used to filter out the high 
 fregeuEcy types and at the same time not imposing so much 
 overhead that would defeat the whole purpose of a high cutoff. 
 Introducing a stop list of the usual high frequency words like 
 •the', 'a', etc. would help but does not seem to be very general 
 and flexible. 
 
 The merging algorithm can be improved also. At present, 
 merging of the statistics files for the documents is done by 
 2-way merges and intermediate files are generated for each 
 merge. The generation of files hampers system performance as the 
 openning and closing of files are slow processes. One possible 
 way to avoid having to open and close files so frequently is to 
 use a single intermediate file for all the merges within the 
 same pass rather than using 1 intermediate file per 2-way merge. 
 This would, however, require a much more complicated bookeeping 
 
45 
 
 algorithm than the current one and restructuring of some of the 
 systems modules concerned might be necessary. 
 
 One final point that deserves attention is that we have 
 talked about vectorizing documents. Our objective is to use as 
 few words as possible to characterize a document's contents. A 
 mere high and low cutoff as discussed earlier might not be 
 sufficient for our purpose. Maybe some additional screening 
 using weighing factors based on a type's token and document 
 freguencies would help. This information is readily available 
 from HISTOGRAM. 
 
 As mentioned earlier, HISTOGRAM is meant to be a dynamic 
 system. It is hoped that through experimentation with this 
 facility, solutions to some of the problems encountered in 
 on-line retrieval systems can be found. 
 
46 
 
 REFERENCES 
 
 [1] Lancater F.W., "Information Retrieval Systems 
 
 Characteristics, Testing and Evaluation", New York, Wiley 
 1968. 
 
 [2] Lancaster F.W. and Fayen E.G., "Information Retrieval 
 On-Line", Los Angeles, Calif.: Melville Publishing Company, 
 1973. 
 
 [3] Milner, J. M,, "A Multiprocess, Multiuser Executive for an 
 Experimental Information Retrieval System", M.S. Thesis, 
 University of Illinois Department of Computer Science Report 
 Number 75-736, 1975. 
 
 [4] Morgan, J. K. , "Description of an Experimental On-line 
 Minicomputer Based Information Retrieval System", M.S. 
 Thesis, University of Illinois Department of Computer 
 Science Report Number 76-779, Febraury 1976. 
 
 [5] Morgan, T.J., "A Thesaurus Feature for the Eureka 
 Information Retrieval System", M.S. Thesis, University of 
 Illinois Department of Computer Science Report Number 
 77-855, 1977. 77-855, 1977. 
 
 £6] Rinewalt, J.R., "Evaluation of Selected Features of the 
 Eureka Full-text Information Retrieval System", Ph.D. 
 Thesis, University of Illinois Department of Computer 
 Science Report Number 76-823, 1976. 
 
H7 
 
 APPENDIX A - GLOSSARY OF TERHS 
 
 recall ratio: the ratio of retrieved relevant documents to the 
 total number of relevant documents. 
 
 precision ratio: the ratio of retrieved relevant documents to 
 the number of retrieved documents in the data base. 
 
 token: any unbroken string of alphanumeric characters, 
 type: a distinct token- 
 token frequency: the number of times a token repeats itself, 
 document frequency: the number of documents a type appears in. 
 
 type rank: the standing of a type in a document set according 
 to its token frequency. The lowest rank type in a 
 document always has the highest token frequency in a 
 set. 
 
 query: a search request. 
 
 query language: the repertoire of commands in a retrieval 
 system. 
 
 search term: a word used in a search request. 
 
 search expression: a syntactically correct combination of the 
 desired contents of documents to be retrieved from 
 the data base. 
 
 responding documents: the retrieved documents in a search. 
 
 document set: a collection of documents. 
 
 thesaurus classes: groups of related words, not necessarily 
 synonymous. 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-78-914 
 
 2. 
 
 3. Recipient's Accession No. 
 
 4. Title and Subtitle 
 
 A UNIQUE WORD-SCANNING FACILITY FOR THE EUREKA 
 FULL-TEXT INFORMATION RETRIEVAL SYSTEM 
 
 5. Report Date 
 
 January 1978 
 
 6. 
 
 7. Author(s) 
 
 William Ming-cheoriR Leuns 
 
 8. Performing Organization Rept. 
 
 ^°"UIUCDCS-R-78-914 
 
 9. Performing Organization Name and Address 
 
 University of Illinois at Urb ana- Champaign 
 Department of Computer Science 
 Urbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF MCS73-07980 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 13. Type of Report & Period 
 Covered 
 
 Master's Thesis 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts 
 
 This thesis describes a word screening and scanning feature in the experimental 
 on-line full text retrieval system — EUREKA. Given any specified collection of 
 documents in the database, the feature can reduce words within the set to distinct 
 occurrences, screen out insignificant words, provide distribution statistics of the 
 words, and based on these statistics, enable users to look at alphabetical listings 
 of these words selectively. This feature saves users considerable efforts in 
 converging their search. In addition, the feature can serve as a potential tool for 
 automatic thesaurus generation in EUREKA. By using it to generate word-vectors which 
 characterize documents, documents of the database can be clustered to workable sizes 
 for generating thesauri by statistical methods. 
 
 17. Key Words and Document Analysis. 17a. Descriptors 
 
 Thesaurus generation 
 
 Word scanning and screening 
 
 Zipf distribution 
 
 17b. Identifiers/Open-Ended Terms 
 17c. rOSATI Field/Group 
 
 18. Availability Statement 
 
 Release Unlimited 
 
 19. Security Class (This 
 Report ) 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 51 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 22. Price 
 
 , FORM NTIS-35 ( 10-70) 
 
 
 
 
 
 USCOMM-DC 40329-P71 
 
MAD -x 1978