imam iliiHll IMS B UB ifflaaull htoSw J I ffl riBHHfin M i nffl iHiH iiHiil Hys 111888 111911 HII iiwll m mm m HHII ■lur ■111 Hyfl LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 cop. 2-* the person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN L161 — O-1096 n*v,",', and [). These special characters are, for the most part, printed above the letters on the terminal keyboard. One other point about keys should be noted. On a terminal keyboard zero and the letter "0" are not the same. The zero key is between the "9" key and the ":" key on the top row of keys, while the "0" key is between the "I" and "P" keys on the next row down. Also note that there are both single guote and double quote keys, the double quote key being SHIFT-2 and the single quote key being SHIFT-7. EUREKA makes 16 use of both single and double quotes, so do not attempt to use two sinqle quotes where one double quote is called for. The only other special keys the user need be aware of are the CTRL key, the RETURN key, and the RUB OUT key. The RETURN key is used to cause a carriage return and to cause the line of characters you have just typed on the keyboard to be sent to EUREKA. Hhile you type the terminal holds what you type in a temporary storage area until you push the RETURN key, at which time the entire line is sent to EUREKA. If, while you are typinq in a ccmaand to EUREKA, you discover that you have -just hit the wronq key, you may delete the last letter you typed in by hittinq the RUB OUT key. Pushinq the RUB OUT key twice deletes the last two characters, etc. The CTRL key is similar to the SHIFT key in that it assiqns a new set of meaninqs to other keys on the keyboard. If, while typinq in a line you notice a mistake back in the first part of the line and don*t want to use the RUB OUT key to rub out the entire line back to that point, you may delete the entire line by holdinq down the CTRL key and pushinq the "U" key. The user should note that EUREKA can send and receive commands at the same time, even thouqh commands typed in while EUREKA is workinq on somethinq else are not echoed on the terminal until EUREKA finishes whatever it was workinq on. 17 Therefore, it. is unwise to play with the keyboard while EUREKA is working on a command you have just entered since whatever you type in will eventually be sent to EUREKA. In order to prevent information being flashed on the screen faster than you can read it, EUREKA pauses every 15 lines or at the end of each item you have requested to have printed, whichever occurs first. If the line count has been reached, EUREKA will print an n d" on the next line of the screen to inform you of this fact. If the end of an item has been reached, EUREKA uses an "!" to signal you. When signalled by either an "» M or an "! M , you may then instruct EUREKA to continue with the current output by pushing the carriage return key, stop the current output (the current search contiues, however) by pressing the "K" key followed by a carriage return, skip to the next document in the list to be printed by pressing "S" followed by a carriage return, or enter the browse mode to skip around within the document currently being printed. Browse Mode Browse mode is merely a method of viewing sentences or paragraphs adjacent to the one current).y being printed. To skip back to the previous sentence, type in "-S" and push the 18 carriaqe return key. Similarly, the previous paragraph may be viewed by typing n -P M . To skip forward one sentence or paragraph, type "+S" or "♦p«. If, while viewing portions of a document, you decide to look at the title or author (or any other context) , you have only to type in the correct context identifier (see Appendix A) and push carriage return when EUREKA has paused to allow you to respond. Once any of the commands for skipping around within a document or printing other contexts have been used after a "i" or H !" flag, the user is said to be in browse mode. In order to get out of browse mode and resume the output from the current document, type in "E" followed by a carriage return. ZsJ. IHE OJE.M LANGUAGE There are only nine commands in the EUREKA guery language. Only two of these are necesssary for conducting searches, while the other seven perform auxiliary functions. In brief, the functions of these commands are: FIND: The FIND statement is the heart of the EUREKA system. It is used to perform searches for documents containing a user selected set of words or phrases. 19 HAKE: The MAKE statement is used to compare and combine sets of documents created by the FIND statement. COMMENT: The COMMENT statement is used to write notes to yourself concerning a query set or particular document. These notes may be retrieved at a later time by use of the PRINT statement. CHANGE: The CHANGE statement is used to assiqn a name to a query set or to chanqe the existinq name of a query set. DELETE: The DELETE statement is used to delete query sets, macros, and/or comments which are no lonqer needed. PRINT: The PRINT statement is used to print user comments , selected portions of a document (up to and includinq the entire document) , and information about preceedinq queries and their resultant query sets. LOGON: The LOGON statement is used to identify the user to EUREKA in order for EUREKA to qain access to the correct user files and data base. 20 LOGOFP: The LOGOFF command is used to terminate a session. It disconnects a user from the EUREKA system and closes his files. DEFINE: The DEFINE statement is used to give a name to a list of search terms so that the user does not have to repeatedly type in long search expressions. These macro definitions are saved in the user file area and may be used in conjunction with other search terms in FIND statements. Each of the guery language commands has a very simple basic form which may be used alone or with the addition of optional clauses that significantly increase their power. This allows the user to begin with very simple guery statements and progress to more complicated forms when the need arises. EUREKA is a keyword driven language. This means that EUREKA figures out what a command typed in by the user means by looking for special words that tell it to do a specific operation. These keywords are usually followed by one or more user supplied parameters that control how the operation is performed. 21 Although this sounds somewhat complicated, it is pretty simple. Algebra is one other example of a keyword driven language. In the algebraic expression: A * 3X ♦ 5 the equal sign "= M is a keyword to tell anyone looking at it that whatever appears on the right side of it is equal to whatever appears on the left. Similarly, the plus sign, "♦", is a keyword for the add operation, while A,3,X, and 5 are parameters. In discussing the EUREKA query language we will wish to represent parameters in a general fashion in addition to qiving specific examples, since it would take an impossibly larqe number of examples to cover the possible ranqe of each command entirely. Therefore, if we wish to describe a parameter, we will -just use an Enqlish phrase that describes the parameter and then describe allowable forms for the parameter by giving a definition for the phrase. However, this causes a slight problem. Since most of the EUREKA keywords are English words, we need something to distinguish the EUREKA keywords from the English phrases! For this purpose we will use some characters not used in the EUREKA guery language to enclose the English phrases. Since neither the less-than symbol (<) nor the greater-than symbol (>) is used in the EUREKA language, we will use them for separating 22 the phrases from the keywords. Por example, the algebraic expression used earlier could be represented by: = where is of the form: and so forth. The notation just presented is definitely easier to understand after seeing several examples than it is to describe formally. The user should become familiar enough with the notation to make sense of it by comparing examples to the description in the above described notation for the next few pages. Zs.ls.1 IHI FIND STATEMENT The FIND statement is used to enter search reguests. Its basic form is: FIND The keyword is "FIND", but for brevity this may be abbreviated "F". The parameter is a combination of words enclosed in sinqle quotes for which the search is to be conducted. Examples are: FIND 'PRECISION* FIND 'PRECISION* * 'RECALL* 2 3 F 'CATS' * 'DOGS' + 'MICE' In the above examples, the first, one directs EUREKA to find all documents in which the word "PRECISION" appears. The second example directs EUREKA to find all documents in which both the word "PRECISION" and the word "RECALL" appear. Note that "*" means that "Both what is on the riqht and what is on the left must appear". The final example directs EUREKA to find all documents in which both the word "CATS" and the word "DOGS" appear, and to also retrieve all documents in which the word "MICE" appears. Note that "♦" means that "Either what is on the left or what is on the riqht must appear". For this reason, "+" is referred to as "OR", while "*" is referred to as "AND". Another fact to note about the final example is that the "AND" is considered before the "OR". That is, the third example is taken to mean "All documents containing either the word "MICE" or both the word "CATS" and the word "DOGS" rather than "Find all documents containing both the word "CATS" and either of the words "DOGS" and "MICE". A little perusal will show the reader that the above two sentences do not mean the same thing. This is similar to the problem in algebra of deciding whether 24 A = VX-5 means A = U/(X-5) or A = (VX)-5 Since EUREKA always assumes "AND" is to be done before "OR", if one wishes to tell EUREKA to find "all the documents containing the word •CATS 1 and either of the words •DOGS 1 or •fllCE*", one must make use of parentheses to alter the order of evaluation of the search expression- For instance: F •CATS 1 * ('DOGS* ♦ 'MICE') One final remark about search expressions. The words in single guotes are called "search terms" and are used in the exact form typed in when searching for matches in the document texts. Therefore, if one types in FIND •DOG 1 one gets the list of documents containing the word "DOG", but not those containing the words "DOGGIES", "DOGS", etc. (unless the word "DOG" appears in them also) . If the user wishes to search for all forms of a word stem, then he may make use of the "universal character", "#". For instance, the guery F •DOG#« would return the list of all documents containing any of the 25 following words: "DOG", "DOGS", "DOGMATIC", "DOGWOOD", ...etc. This is known as suffixing, since it directs EUREKA to accept any suffix attached to the word stem. Prefixing is also permitted. Fcr instance, F 'tFIX' would retrieve all documents containing any of the following words: "PREFIX", "POSTFIX", "SUFFIX", etc. Both prefixing and suffixing may be performed at once. The query F •#IZ# I would retrieve all documents containing such words as: "AMERICANIZATION", "AMERICANIZED", "COM PUT ERIZ 2D" , etc. However, if the universal character appears in the middle of a word it is assumed to actually be a pound sign. Therefore, F •AID 1 tries to find the word "A#D" rather than retrievinq all documents containing a word starting with "A" and ending with "D". Now we may begin to add on optional clauses to the basic FIND statement in order to simplify certain operations and make Dossible others that are not possible with the basic FIND statement. Options are just, like options on a car - they may be included if necessary or left out if not needed. 26 2iI-.J. i l i l CONTEXT CLAUSE The first optional clause we shall discuss is the context option. its general form is: FIND IN "In" is the keyword to inform EUREKA that a search context follows, and is the parameter that specifies the context in which the words in the must appear in order for the document to be retrieved. For example, the query: FIND 'DOG 1 * 'CAT 1 IN SENTENCE directs EUREKA to retrieve all documents in which both the word "CAT" and the word "DOG" appear in the same sentence within the document. The list of all allowable contexts is: 1 : SENTENCE 2 : PARAGRAPH 3 : DOCUMENT u : : ARTICLE 5 : DATA 6 : AUTHOR 7 : : TITLE 8 : SOURCE 9 : DATE 10: : PAGES 11: MISC 12: INDEX 13: : KEYS 14: TEXT 15: ABSTRACT 16: : BODY 17: NOTES 18: REFERENCES 19: ! COMMENTS 27 Definitions of the various contexts appear in Appendix A. Note that any context term may dp abbrevited by truncating it to any Length that leaves it distinguishable from all context terms preceeding it in the above list. For instance, specifying "IN A" for a context is the same as specifying "IN ARTICLE", while "IN AU" specifies "IN AUTHOR". Examples of FIND statements containing context, clauses are: F 'SALTON' ♦ 'LANCASTER* IN AUTHOR Which directs EUREKA to find all documents written by either Salton or Lancaster. FIND 'GARBAGE' IN COMMENTS Which directs EUREKA to find all documents to which the user has added a comment (to be explained later) containing the word "GARBAGE" F 'COMPUTER*' * «LIBR#' IN TITLE which directs EUREKA to find all documents that contain both a word starting with the characters "COMPUTER" and a word starting with the letters "LIBR" in the title. F 'AUTOMATA' IN AB which directs EUREKA to find all documents containing the word "AUTOMATA" in their abstract. 28 2r2rli.lr2 FROM CLAUSE The next option vie shall discuss is the from clause. the from clause is used to specify the search set (set of documents among which the search is to be conducted. See Section 2.U). Its general form is: FIND FROM The keyword is "FROM", which directs EUREKA to search for the words in the only among the documents that meet the reguirements of the set expression . The parameter is an expression involving guery sets and documents. Query sets may be referred to by either guery number or guery set name, while documents are referred to by a list of document numbers separated by commas, enclosed by sguare brackets. The general form for a is: . ... where is either a guery set number, a guery set name, or a document list as described above. is one of the following: "*», "♦», or "-". Since the concept of a is difficult to describe rigorously in English, let us resort to some examples. FIND •ALPHA 1 FROM 1 directs EUREKA to find all documents that responded to guery #1 and also contain the word "ALPHA". 29 F 'ALPHA* FROM 1 * 2 directs EUREKA to find all documents that responded to both query #1 and query #2 and also contain the word "ALPHA". F 'ALPHA' FROM 1*3+2 directs EUREKA to find all documents that responded to either query #2 or to both query #1 and query #3, and that in addition, contain the word "ALPHA". FIND 'ALPHA' FROM 1 - 2 directs EUREKA to find all documents that responded to query #1 but did not respond to query #2 , and also contain the word "ALPHA". F • ALPHA' *' SOMETHING' IN SFNTENCE FROM 1+[1,24,3] directs EUREKA to find all documents that responded to query #1 and contain the words "ALPHA" and "SOMETHING" in the same sentence, and to also search documents 1, 24, and 3 for the occurrences of the search terms. FIND 'ALPHA' + 'BETA' FROM 1 -[3,19] which directs EUREKA to search all documents respondinq to query #1 except documents #3 and #19 for an occurrence of either the word "ALPHA" or the word "BETA". F 'ALPHA' directs EUREKA to search all documents in the data base for any that contain the word "ALPHA". F 'ALPHA' FROM LAST 30 EUREKA is directed to search for documents containing the word "ALPHA" among all documents responding to the last guery. I.E., if this is guery #4 then all documents responding to guery #3 are used as the search set (just as if "FROM 3" had been specified). However, if this happens to be guery #1, then the entire data base is searched because there are no preceeding gueries from which to search. F 'ALPHA* FROM CHEZFAC - 3 This directs EUREKA to search for documents containing the word "ALPHA" among all the documents in the guery set named "CHEZFAC" by the user, except any documents that responded to both guery #3 and the guery named "CHEZFAC" by the user. Let us note in passing that "*" is used as a Boolean "AND" operator, "♦" is used as a Boolean "OR" operator, and "-" is the Boolean "RELATIVE COMPLEMENT". Note also that parentheses may not be used to alter the order of evaluation of the set expression. If one wishes to have a complicated expression of set name/numbers not obtainable by the from clause, one must use the MAKE statement (see Section 2.7.3) to obtain a set eguivalent to the desired set expression. 31 2-.Z-.lrIii £MI£X Ml NAMING CLAUSE Since most of us will not want to keep track of large numbers of relatively easy to forget guery numbers, EUREKA allows the user to specify a mnemonic name for any guery set he/she creates. One method of assigning a set name is via the set name clause attached to either a "FIND" or "MAKE" statement (another method is via the "CHANGE" statement, which will be described later) . The general form for the guery set name clause is: FIND = in which "=" is the keyword that signals EUREKA that what follows is a name the user wishes to have associated with the guery set that will result from this FIND statement. may be any string of up to ten letters and/or numbers (no special characters like ♦ , " , or <) that meets the following restrictions: 1) Must not begin with a number 2) Must not be any of the following words: "ALL", "FROM", "COMMENTS", "MACRO", or "LAST". Examples of find statements containing set naming clauses are: FIND •DOG#» FROM 3 = DOGSET which directs EUREKA to search all documents in guery set #3 for words that begin with the letters "DOG" and then name the 32 resulting query set "DOGSET". F •ANORAK 1 ♦ •CAGOULE 1 IN TITLE FROM ALL = RAINCOATS which directs EUREKA to search the entire data base for documents containing either the word "ANORAK" or the word "CAGOULE" and then name the resulting query set "RAINCOATS". F 'CHEESE' * »FACTOR#» IN SENTENCE = CHEZFAC Now turn back to Section 2.3 and study the example there. 2-.l2.li.lifi COMMENTS C LA USE The next option we shall discuss is the comments clause, which is used for attaching user comments to a query set. Comments are a mechanism for writing notes to oneself that may be retrieved at a later time via the PRINT statement. These comments may be a statement of the purpose of creating this particular query set, the number of documents in the set, or anything else the user feels to be of interest. The general form for the comments clause of the FIND statement is: FIND " = In the DEFINE statement, "DEFINE" is the keyword that tells EUREKA what to do with the rest of the command, is as defined for the "MAKE statement" (Section 2.7.1). The "=" is a keyword to separate the search expression from the name the user wishes to assign to the macro () . The must follow the same rules set forth for the (Section 2.7.1). Examples of DEFINE statements are: DEFINE 'TICKS' ♦ 'FLEAS' = BUGS which defines the macro used in the macro example above. 35 DEF »FLEAPOWDER« * BUGS = CORES Note that macros may be used within definitions of new macros. Also note that the keyword "DEFINE" may be abbreviated "DEF". Is.ls.2 MEI STATEMENT The MAKE statement is used to compare two or more query sets and generate a new query set based on the results of the comparison. The basic form of the MAKE statement is: MAKE where "MAKE" is the keyword (which may be abbreviated "M"), and is the same as the defined for the FIND statement (see Section 2.7.1). This is because the MAKE statement is, in effect, an explicit method of creating new sets of documents from old sets and explicit document numbers. The difference between created by a from clause of a FIND statement and a query set created by a MAKE statement is that a is temporary only and may not be referred to again without explicitly re-creating it, while a query set created by a MAKE statement is given exactly the same status as a guery set created by a FIND statement. It may be named, have comments attached to it, and it may be referred to by guery number or query name in a later query. 36 Examples of basic MAKE statements are: MAKE 3*2 which directs EUREKA to create a new query set consisting of all documents that appear in both query set #3 and in query set #2. MAKE 3 «• 2 * CHEZFAC which directs EUREKA to create a new query set composed of all documents that are either in query set #3 or in both query set #2 and the query set named "CHEZFAC" by the user. MAKE 3 ♦ [7,26,8] which directs EUREKA to create a new query set composed of documents 7, 26, and 8, and also all documents that are in query set #3, The options for the MAKE statement are the query naming clause and the comments clause. Both of these options function exactly like their counterparts for the Find statement, so the reader is directed to Section 2-7.1 if further description is required. The general form of the MAKE statement is: MAKE = " TO "CHANGE" is the keyword, which may he abbreviated "CH", that informs EUREKA that a name assiqnment/change follows. "TO" is a keyword to separate the old set ID from the new. is either a query number or a query set name. is the new query set name to be assigned to the guery set identified in . This name must obey the rules described for query set naming in Section 2.7.1. Examples of CHANGE statements are: CHANGE 3 TO GOODSET which assigns the name "GOODSET" to guery set #3. CH GOODSET TO BADSET Which changes the name of the guery set currently named "GOODSET" to "BADSET". 38 Macros may be renamed by following the word "CHANGE" by the word "MACRO". An example is: CHANGE MACRO TX34J TO FRED which changes the name of a macro named "TX3UJ" to "FRED". Note that the keyword "MACRO" may not be abbreviated. lili.5 COMMENT STATEMENT The COMMENT statement is used to assign comments to guery sets or individual documents. These comments may be retrieved upon demand by use of the PRINT statement. The general form of the COMMENT statement is: COMMENT "COMMENTS" "COMMENT" is the keyword (which may be abbreviated "CO") to inform EUREKA that what follows is a set or document identifier for the set the user wishes to add a comment to, and the double guotes act as delimiters for the comment string. The must be either a guery number, a guery set name, or a document number enclosed in sguare brackets ([ ]) . The comment string must follow the rules described in Section 2.7.1. Examples of COMMENT statements are: COMMENT 3 "SOME COMMENT STRING" CO [ 19] "VERY GOOD PAPER ON ""FRABBLEGIBBETS""" The first example merely attaches the comment 39 SOME COMMENT STBING to query set #3, while the second attaches the comment string VERY GOOD PAPER ON "FR ABBLEGIBBETS" to document #19. 2iZii> DELETE STATEMENT The DELETE statement is used to delete query sets and/or comments from the user file area. Once a query set has been deleted it cannot be referred to in a MAKE statement or a From Clause, but it no longer takes up space in the users 1 file. Since each user is assiqned only one cylinder of disk, it is important to remove unwanted query sets and comments when they are no longer needed. The DELETE statement has three forms. The first looks like this: DELETE "DELETE" is the keyword, and may be abbreviated "DEL". The is a list of query set names and query set numbers of query sets that the user wishes to have deleted. This will remove both the query set and all associated comments. Examples are: DELETE 3,7,JONKSET DEL BADSET 40 The second form is: DELETE COMMENTS where "DELETE COMMENTS" is the keyword informing EUREKA to remove all user comments attached to the query sets and/or documents that make up the . The keyword may be abbreviated "DEL COMMENTS", The query sets in the list may be referred to by either query number or by query set name, while the documents must be referred to by a list of document numbers separated by commas and enclosed in a sinqle set of square brackets. When this form of the DELETE statement is used only the comments attached to a query set or document are deleted, so the query sets may be referred to in later statements (but the comments are no lonqer available) • Examples are: DELETE COMMENTS 3 ,[ 1 5,23,5 ],GOODSET DEL COMMENTS [ 18] Note that users may delete comments from documents, but are not allowed to delete the actual documents. The third form of the DELETE statement is: DELETE MACRO As one would expect, this form is used for deleting macro definitions. "MACRO" is the keyword informinq EUREKA that the following list is a list of macro definitions to be deleted 41 rather than a list of query sets to be deleted. Note that this command may be abbreviated to "DEL MACRO ". If the user wishes to delete all or most of his/her macros, query sets, and/or comments, he may specify that he wishes EUREKA to delete all sets/macros except the ones he wishes to save by puttinq the names and/or numbers of the queries/macros he wishes to have saved in the / and preceed the / with the keywords "ALL EXCEPT". Similarly, "DELETE ALL", "DELETE COMMENTS ALL", AND "DELETE MACRO ALL" delete all query sets, comments, and macros respectively, with no exceptions. Examples of the use of "ALL" are: DELETE ALL EXCEPT 5 which deletes all query sets and comments except query set #5. DEL COMMENTS ALL Which deletes all user assiqned comments. DELETE MACRO ALL EXCEPT DIRMAC which deletes all macro definitions except the one named "BIGMAC". 42 li.ls-1 .LOGON STATEMENT The LOGON statement is used to identify the user to the EUREKA system so that it can retrieve the user files and initialize a workspace for the user. The form of the LOGON statement is: LOGON where "LOGON" is the keyword (it may not be abbreviated) , is the (up to) six letter identification code assigned to each user. If a person does not have a user ID, he may still use all facilities of the system except the storaqe of results between terminal sessions. If a user enters a query without first typing in a "LOGON" command, he is automatically logged in as a public user and allowed full access to the system. However, as soon as the public user logs off the system all of his query sets and comments are erased in order that the next public user may start with a clean slate. Examples are: LOGON A3UKR7 LOGON FRED 43 2-.Z-.8 LOGOFF STATEMENT The LOGOFF statement is used to log out from the system after a session. It tells EUREKA to save all the user files and free the user's workspace. Its form is: LOGOFF and there are no variations on its form. lils.1 PlINT STATEMENT The PBINT statement has three uses. it may be used to print all or part of any document. It may also be used to print information about previous queries and their related query sets. Another use is to print macro definitions. The form of the PRINT statement used for printing query set/statement information is: PRINT TO where and are either query numbers or query set names. This will cause the followinq information to be printed for each query set with a query number between that of and : query number, query set name (if present), query text, list of all documents making up this guery set and their relative rank, and any comments associated with this query set. If "TO " is omitted, only the information for the set specified by is printed. 44 Examples are: PRINT 3 TO LAST PRINT JOE and PRINT 3 The form of the PRINT statement used for printing all or part of documents is: PRINT FROM where is the list of context items that the user wishes to have printed. Any list of context terms is valid here, as long as they are meaningful. If "PARAGRAPH" or "SENTENCE" is specified here, EUREKA looks at the query statement that generated the set list from which we wish to print information and then prints all paragraphs or sentences containing the search terras specified by the query statement. Therefore, it is not meaningful to command EUREKA to print a sentence from a set created by a MAKE statement, since there are no search terms in the MAKE statement to search for. If no is specified, the default value assumed is "DOCUMENT". may be either of the following: 1) Set name or query number; 2) List of document numbers separated by commas and enclosed 45 by square brackets ("[" and " ]") • The command: "PRINT FROM " will cause the portions of documents specified by the of each document in the set list of that query to be printed. The command: "PRINT FROM [ Doc#1 ,Doc#2 ,. . . , Doc#N ] causes the specified (by the ) portions of documents numbered "Doc#1", "Doc#2,..., etc. up to Doc#N to be printed. All output from a PRINT statement is routed to the users terminal, unless he ends the print statement with "ON LP" , in which case all output is routed to the line printer. Examples are: PRINT JOE which prints all information (as described above) about the query and query set named "JOE". PRINT JOE TO 14 which prints the query set information for every query set with a query number between 14 and that of set "JOE" (whether "JOE" has a lower or hiqher number than 14). 46 PRINT FROM JOE ON LP which prints the entire document text of every document in query set "JOE" on the line printer. *** WARNING! BE CAREFUL WHEN USING THIS COMMAND, AS IT CAN EASILY GENERATE IMMENSE AMOUNTS OF OUTPUT ***. PRINT TITLE FROM 3 which prints the title of every document in query set #3. PRINT TITLE, AUTHOR FROM [3] which prints the title and author of document number 3. P FROM [ 3,43,22 ] ON LP which directs EUREKA to print documents 3,43, and 22 on the line printer. Notice that "PRINT" may be abbreviated by "P". P SEN FROM NEWSET which prints every sentence in each document in the query set "NEWSET" that contains a term from the Terra Expression from the FIND statement that qenerated "NEWSET". PRINT COMMENTS FROM [12] which prints all user-assiqned comments attached to document #12. P COMMENTS FROM 12 which prints all comments the user has attached tc any document in query set #12. 47 The third use of the PRINT statement is printing macro definitions. The form used is: PRINT MACRO "PRTNT MACRO" is the keyword specif yinq that the following word is to be taken to be the name of a user defined macro definition and that this macro definition is to be retrieved from the user file and printed. If the word "All" is substituted for then all macros and their definitions are listed. The output may be routed to the line printer by adding "ON LP" to the end of the command. Some examples are: PRINT MACRO BUGS which causes the definition of the macro "BUGS" (see Section 2.7.2) to be printed on the terminal. P MACRO BIGMAC ON LP which causes the macro named "BIGMAC" by the user to be printed on the line printer. P MACRO ALL which prints out all macro names and the macro text associated with each macro name. 48 3 SYSTEM PROGRAMMERS GUIDE 1^1 AN OVERVIEW OF EUREKA In order to obtain an overview of EUREKA before being faced with the qory details, let us examine a block diagram of the system structure (Fig. 3.1.1). This diagram shows the structure of EUREKA at a task level. The block labelled "Processor" is the actual PDP-11 hardware, which is allocated and controlled by "DOS", the DEC operating system, and by "EXECUTIVE", the EUREKA operating system. "DD" and "DE" are the two Diva 231U-style disk drives where both system and user files reside. "Userl" and "User2" are the terminals and associated non-deterministic, non-rational physical cellular automata (1). Neither DOS nor EXECUTIVE shall be explained in detail here, as sufficent documentation exists elsewhere [1,2]. One point we should consider before proceeding, however, is the interrelationship of the two operating systems. DOS, the standard DEC operating system, is used primarily as a bootstrap and low-level software resource by EXECUTIVE, which actually provides almost all of the multi-user scheduling, allocation, and management facilities. All I/O reguests, task startup and control, and memory management (1) sometimes referred to as "humans". 49 Processor DOS ? T DE Executive Root Node User Interface Parsers 1 I Search Supervisor 1_ £ X Set nformat ion Printer Full Text Searcher Browse Mode Handler ± Index and Postings Handler V ,r Merger DD -M Userl \ L J ->■/ User: zzn__ Set Expression Evaluator 1 Set Handler SYSTEM FLOW DIAGRAM FIGURE 3.1.1 50 are done by traps to EXECUTIVE. Each module is actually an invocation of a collection of one or more object modules that are logicaly grouped together and may be treated as a single unit by the executive for purposes of memory management and process control. Within tasks, JSR*s and JMP*s may be used to transfer control between modules, but between tasks, the EXECUTIVE trap $PRFRM must be used in order to maintain EXECUTIVE'S task control. Modules within a task share a common memory area, known as the "workspace", which must be allocated by the EXECUTIVE trap SALOC. This memory area (and any other memory allocated by a task) may be accessed by any routine in the task and by any routines in tasks initiated via $PFFM traps by the task, but by no others. The Root Task or Root Node (ROOT) is an initialization and housekeeping task used to initialize user and system files, etc. It is the first module to be performed by EXECUTIVE upon starting up the system. User Interface (USRNTF) is the window between the users and the internal EUREKA routines. It acts as a terminal handler and message router by accepting and formating commands from the user, performing the Parser, and then displaying any error messages generated by lower-level routines and/or prompting the user for his next command. There is one invocation of User Interface in existence for each user 51 in the system at any given tine. The Root Task starts up one invocation of the User Interface for each terminal attached to the system (this is determined by assembled-in constants in the code) at the time EUREKA is initialized. This invocation remains in existance for the duration of the execution (at least, until "SHUTDOWN" is typed in to shut EUREKA down). All EUREKA routines (except initialization routines) access user-dependent structures by way of pointers and tables passed to them by higher level routines. This allows EUREKA routines to be repntrant, simplifying greatly the process of adding more users to the system anl minimizing memory usage since only one copy of the code need exist to serve all users. The Parser (PARSER) decodes user commands by examining the command string typed in by the user. It creates either a Search Supervisor Table or a Set Handler Table that describes the services reguested by the user and contains all the information needed by lower level routines to perform these services. The Parser then performs the correct action routine. If the command parsed requires either full-text searching or set list merging (FIND, MAKE, or PRINT FROM) Search Supervisor is performed upon completion of the parse. If the command parsed was PRINT , then the Set Information Printer (INFOPT) is performed. LOGON and LOGOFF are handled internally by the Parser. All other commands (CHANGE, DEFINE, COMMENT, and DELETE) cause the Set Handler to be performed. 52 Upon completion of the action routine, control is returned to the Parser, which immediately returns control to the User Interface. The Search Supervisor (SRCHSP) is primarily a sequencing routine that controls the operation of the EUREKA routines used in a search. The Merger (MERGE), Index and Postings Handler (IPHNDL), Full Text Searcher (FTSRCH) , and Set Handler (SETHLR) are all used by the Search Supervisor to complete a search. The Merger (MERGE) is used to merge lists of documents together in order to construct lists of documents meeting the conditions of a Boolean function specified by the user in his/her search command. It is hoped that eventually the Merger will become a manager for a hardware merge unit now under construction. The Index and Posting Handler (IPHNDL) is used to evaluate one term at a time. It is given one search term by the Search Supervisor and produces a list of documents in which this terra appears. Note that in the case of a term containing several tokens, i.e. *FULL TEXT', the Index and Postings Handler returns a list of documents containing both the words "FULL" and "TEXT" with no assurance that the string "FULL TEXT" actually appears. In this case, the Index and Postings Handler marks each document in the list by setting the full-text search bit in its descriptor (described in Sec. 3.8); The Full Text Searcher must be used to determine if the 53 words actually appear in the correct relationship within the documents listed by the Index and Postings Handler. The Full Text Searcher (FTSRCH) performs the actual comparison of search strings from the user's command to the text of documents whenever a full-text search must be performed. The Set Handler (SETHLR) performs all maintenance and accession of the user's personal files. All entries, deletions, and reads to/from this file must be done through this routine. The Set Information Printer (INFOPT) is used to retrieve information on previous gueries and/or macros typed in by the user. It uses the Set Handler to retrieve the desired information from the user's personal file, formats it, and then displays it on either the user's terminal or the line printer. Now let us take a guick look at the information structures manipulated by EUREKA. There are effectively four types of information (excluding EXECUTIVE data) dealt with by EUREKA: the user's command string; the documents in the database and their associated accession mechanism; the user's Logon Block and personal file; and command tables passed from one task to another. The user's command string is entered by the user through the keyboard and is passed, along with a pointer to the user's Logon Block, to 54 the Command Parser. The Logon Block is effectively EUREKA's record of all system information specific to one user. The Command Parser then builds either a Search Supervisor Table or a Set Handler Table, depending on which action task is to be performed. If the command is MAKE, FIND, or PRINT FROM, then a Search Supervisor Table is constructed; otherwise a Set Handler Table is constructed. The table is then passed to the correct action routine (again, along with a pointer to the user's Logon Block). The action routines use the tables passed to them to determine what actions are to be performed (i.e. read a set list, search on a list of terms, etc.) and the Logon Block to find the correct user file to use and other such user-specific data. The document file and its associate-.! accession mechanism (including the index, postings, and hash files) is used to perform the actual searches and to display text on the user's terminal. The user's personal file is used to store the record of his/her past searches, along with any macro definitions or comments attached to sets or documents by the user. All the above tasks and information structures, will be described in greater detail in the following sections. 55 LlI I . .! 1.ASK The first module we shall consider is the Root Task or Root Node (ROOT), This is a relatively uncomplicated module that essentially gets things started for the rest of the system and then closes up shop when the system is shut down. The Root Task starts the system up by: 1) Calling (via a JSR) subroutine IRINIT, which opens all files except the user files, .INlTs the terminals, and sets up some non-relocatable scratch spaces; 2) Performing n copies of the User Interface, where n is an assembled-in constant with global name LGNUM, passing each the address of a different Logon Block The layout of the Logon Block is shown in Fig. 3.2.1; 3) Doing a TRAP 5 to wait until all n copies of the User Interface have executed $RETN traps; i.e. SHUTDOWN has been typed in at all terminals) . Once all copies of the User Interface have died, the Root Task closes all relevant files and then executes a $RETN trap to return control to the EXECUTIVE, thus shutting down the system. 56 USER ID CURRENT QUERY # DATABASE ID FLAGWORD LINK BLOCK POINTER FILE BLOCK POINTER TRAN BLOCK POINTER CURSOR FREE CHUNKS DISP INTO BMAP BITMAP 1st FREE DIR BLK LINK BLOCK (8 words) FILE BLOCK (7 words) TRAN BLOCK (5 words) TTY LINK BLK PTR TTY LINK BLOCK (8 words) PTR TO "LAST" SET DIR "LAST" SET (1^ words) r USER LOGON BLOCK Fig. 3.2.1 57 i*-l H§ER INTERFACE The User Interface (USRNTF) is the user's window into the system. Each terminal has one invocation (task) of the User Interface associated with it (via the Logon Block passed to the task at the time it was initiated via a $PRFM trap by the Root Task) . This User Interface task handles query prompting, formating. Parser invocation, error message display, and statistics recording for the user at its associated terminal and does not die until "SHUTDOWN" is typed in at the terminal. The first action performed by the User Interface is the setting up of a buffer/statistics area. One component of this buffer is the text buffer that is passed to the Command Parser (and hence the rest of the system) . Another large section of the buffer is the statistics area in which all timing and frequency statistics are recorded. Double buffering is used for the statistics area in order to decrease the number of I/O requests made during operation of the system. Once the buffers have been initialized, the User Interface enters a loop in which it: 1) Clears and resets the least recently used statistics block; 2) Re-initializes the byte count in the terminal I/O buffer header; 3) Reads the next query typed in by the user into the text buffer; H) Checks to see if the guery was too long; 53 5) checks for continuation to another line, loops back to (2) if so; 6) Checks for "SHUTDOWN" having been typed in , goes to shutdown routine if so; 7) Performs the Parser, passing it pointers to the Logon Block and the guery text string; 8) Displays error message (if any) , or "COMMAND COMPLETE" message if no error has occured when control is returned from the Command Parser ; 9) Records statistics into buffer and writes block containing both buffers if both buffers are full; 10) Loops back to (2) . The shutdown routine writes the user statistics block out to disk (one buffer may be empty, depending on whether the user type! in an even or odd number of gueries) , and then executes a SRETN trap, returning control to the Root Task. lii COMMAND PARSERS The Command Parser module consists of a collection cf different routines that each parse one EUREKA command, plus several subroutines used to perform common functions. Access to the Command Parser module is through routine FIND, which parses the FIND, DEFINE, LOGON, and LOGOFF statements. FIND initially allocates 59 workspace for the entire Command Parser module (placing the address in R5) , fills in the address of the beginning and end of the query text in the workspace, and then does a JMP to the correct Command Parser routine based on the first two or three letters of the command. The individual Command Parser routines shall not be described here since they are very straightforward linear-scan, f ill-in-the-table routines. Once the Comand Parser routine has constructed a command table of the appropriate type it then initiates via a $PRFM trap either the Set Handler, Search Supervisor, or Set Information Printer, depending on instruction type. As soon as the action routines execute $RETN traps, returning control to the Command Parser, the Command Parser executes a $RETN trap, returning control to the User Interface in order to begin the cycle again. ls.1 SET HANDLIE The next task we shall consider is the Set Handler (SETHLR) , which maintains the users 1 personal files. All changes to the user file or retrievals therefrom must be made by this task in order to 60 maintain the integrity of the user file structure. 2*.5±1 USER fliil STRUCTURE Each user in the EUREKA system is assigned a personal disk file in which all user-specific records are stored. Since user information is dynamic and of varying lengths and types, we must have an access/storage system that can cope efficently with rapidly changing, non-homogenious data. This system should also seek to minimize the number of disk accesses reguired by common functions, as they are currently one of the more troublesome bottlenecks in EUREKA. The data structure chosen for this task is shown in Fig. 3.5.1.1. It exists in the medium of a disk file consisting of one cylinder of disk. This gives us 240 contiguous blocks of 256 16-bit words. Since the blocks are allocated in one cylinder they may all be accessed without moving the heads of the disk unit, thus avoiding some seek time on sequential reads. The file is accessed in relative (.BLOCK) mode, giving us a block address space of 0-239 and a byte address space of 0-511. User information is stored in blocks 7-239, with blocks 0-6 being used as directory space. The storage blocks (7-239) are divided (for allocation purposes) into chunks of 64 bytes (8 per block). File space is allocated in chunks starting 61 USER FILE STRUCTURE 1 r Bit - ' map Macro Dir. Query Directory User File Space r ?~ r r Macro Name Block Nbr Offset •> Length r Text< - I Query Nbr Query N ame Block Nbr Disp. in Blk Set List »*— I Length j— Query — j h- Text ~ h H r Comment Last Comment Length Set List Length Block Nbr Offset .Comment 62 at the last block (239) of the file and qrows toward the front of the file. Chunks within each block are allocated from byte 443 proceedinq back to byte 0. Since Set Handler routines requesting disk space from the Bitmap Handler are only qiven a startinq address which is actually the lowest byte of the lowest block number of the contiquous space allocated them, only the Bitmap Handler (ALOCD) need be concerned with this allocation pattern. The Bitmap Handler keeps track of which chunks of disk are in use by recordinq their status in a bitmap which occupies the last 240 bytes of the user's Loqon Block when the user is loqqed in or the first 240 bytes (0-239) of block of the users disk file when loqqed out. In this bitmap, a bit value of "0 M implies the corresponding chunk is in use, while a bit value of "1" implies that the chunk is free. The mappinq scheme that allows us to associate bits with chunks will be discussed alonq with the Bitmap Handler in Sec. 3.5.8. The rest of the first block (block 0) of the user's file is occupied by a one-byte (byte 240) free directory block number, a one-byte (byte 241) valid/invalid flaq for the bitmap, and the user's macro directory in words 121-253 (bytes 242-507). See Piqure 3.5.1.2 for details. The free directory block number is the relative block number of the first directory block (block 1-6) that 63 has an unused entry in it. The valid/invalid flag is used to prevent users from accidentally wiping out information in their files by logging back on after a system crash or other calamity occurinq while they were logged on has destroyed the copy of their bitmap in their Logon Block before it could be rewritten to disk. Whenever a user logs on, the bitmap is read in from their disk file and stored in their Logon Block and the valid/invalid flag is set to "-1" as a flag that the bitmap is no longer current. When the user logs back out, the updated bitmap is transferred from the user's Logon Block back to the first 240 bytes of block of the user's file and the valid/invalid flag is set to zero to indicate that the bitmap is current again. Should the system crash while the user is logged on, the bitmap must be rebuilt by the off-line routine BITFIX which builds a new bitmap for the user based on the current contents of the user's file. Words 121 through 253 of the first block are taken up by the user's macro directory. The macro directory consists of 19 7- word blocks structured as in Fig. 3.5.1.2. A 5-word (10 character) macro name is followed by the starting address (block number and offset within block) of the macro text. At this starting address will be found one word containing the length of the macro text in characters, followed by the macro text. This directory (and the entire file) is maintained by the Set Handler routine ALOCD, the 64 Bitmap Handler, which uses the first word of each macro directory as a flaq for allocation of directory slots. If the first word of a directory slot contains zero (in binary, not the character), that directory slot is free; any other value shows that the directory slot is in use as a pointer to some macro text. The next six file blocks (1-6) are occupied by the query set directory, as shown in Fig. 3.5.1.3. This directory is structured much like the macro directory, the main difference being more fields within each directory slot. Each query set directory slot can be seen to be a 14-word long block containing a query set number, query set name, and pointers to the starting addresses of all of the pertinent information on disk that makes up the guery set. Both the guery text and guery set list are stored in the same manner as the macro text (length word followed by information). The comments are sliqhtly more complex, as they are stored as a one-way linked list so that comments may be added to existing gueries or documents. Refer to Fig. 3.5.1.4 for details of the comment chain structure. For comments, the field labeled "1ST COMM PTR" points to the first comment in the chain. Each comment consists of a length word followed by a two-word link field (block number and displacement) that points to the next comment in the chain. The comment text follows immediately after the link. The last comment in the chain is pointed at by the field of the directory entry labeled 65 BITMAP (120 words) S Y/I FLAQ FREE DIR ADR ONE MACRO «< DIR v. MACRO NAME BLOCK # OFFSET IN tiWQK 18 MORE 7-WORD MACRO DIRECTORY ENTRIES AS ABOVE 2 UNUSED WORDS BLOCK DETAIL Fig. 3.5.1.2 66 ONE QUERY/DOC J DIRECTORY ENTRY — QUERY TEXT ADDRESS — — SET LIST ADDRESS — Q/D NUMBER" QUERY SET NAME (BLANK IF DOC) FIRST COMMENT ADDRESS LAST COMMENT ADDRESS 17 MORE QUERY/DOCUMENT DIRECTORY ENTRIES AS ABOVE h UNUSED WORDS BLOCK 1-6 DETAIL Fig. 3.5.1.3 67 LENGTH OF FOLLOWING COMMENT BLOCK NUMBER OF NEXT COMMENT IN CHAIN OFFSET IN BLOCK OF NEXT COMMENT IN CHAIN COMMENT TEXT COMMENT DETAIL Fig. 3.5.1A 68 "LAST C01M PTR" and is also flagged by having the high-order bit in the first word of the link field set to 1. Also stored in this word is the total length of all comments (retrieved by setting bit 15 to 0). The tail pointer in the directory entry is used to speed up thf* attaching of comments by allowing the File Writer routine to avoid running down the chain each time a new comment is to be added. The bit flag is used to signal the end of the list to routines that are running down the list, such as the File Reader or the Delete Routine. The total length of all comments field is used to put an upper bound on the length of the buffer size needed to read in user comments. If a guery has no comments attached to it, both of its comment pointers are set to »-1" to flag their non-existence. Document directories occupy the same slots as guery slots in order to avoid having three kinds of directories. A document directory entry is distinguished from a guery directory by having bit 15 of the guery/document number field set to 1. Again, in order to retrieve the document number one must clear this bit. The only fields in a document directory entry that are meaningful are th2 guery/document number as just discussed, and the comment pointers. The comment chain attached to a document is stored in exactly the same form as one attached to a guery set. As in the case of the macro directory the first word of the guery/document directory slot is used as a flag word. If this word contains a negative value. 69 then it is a document directory; if it contains a positive number, it is a query directory; if it contains zero, it is an unused directory slot. ii5-.2 SET HANDLER TABLE The set Handler may be invoked by the Parser when processing CHANGE, DELETE, or COMMENT statements; the Search Supervisor when processing MAKE or FIND statements; and the Information Printer when processing PRINT commands. These tasks use the EXECUTIVE trap SPRFRM to start up the Set Handler. These routines communicate with the Set Handler via a table known as the Set Handler Table. The address of the Set Handler Table must be placed in register RO by the performing task. There are effectively three different kinds of Set Handler Tables, as shown in Figs. 3.5.2.1, 3.5.2.2, and 3.5.2.3. The table shown in Fig. 3.5.2.1 is used for all read and write operations (op codes 0-6,12,13,17). The contents of the read/write form of the Set Handler Table are as follows: 70 OFFSET CONTENTS 0-1..... address of Logon Block 2 .....command (one byte long only) 3.... not used 4-5 guery/document # - high-order bit is set to if guery #, 1 if document. this word is set to for a macro read or write. 6- 15.. ........ .guery/macro name 16- 17. ......... address of guery/macro text 18-1 9. ........ .address of set list 20-21 ......... .address of comments A list of Set Handler commands is given in Table 3. 5.2.4. Note that not all fields of this table will be filled in on every invocation of the Set Handler. Bead and write commands need fill in only those addresses pertaining to the items to be read/written (there is one exception, however; see the section describing FILRDR). For instance, if a command of "2" (read guery text and set list) is used, then only the addresses of buffers in which to store the guery text and the set list need be allocated. Similarly, if a command of "6" (write comments) is used, then only the address of the buffer containing the comments to be attached to the specified set need be filled in. Also, it is not necessary to fill in both the set name and set number on a read, either one being sufficient. The address of the Logon Block, a command, and either a set name or set number must always be present, however. In order to avoid accidental matches to incorrect sets in the directory search, care should be taken to clear unused set name/number fields when only one of the two is being used. When not being used, the guery/macro name should 71 n- n ^ I COMMAND ?: § #j — OZSZraOESMoi LOGON PTR — QUERY/MACRO NAME PTR TO Q OR M TEXT PTR TO SET LIST PTR TO CMT STRING READ/WRITE FORMAT (cmds 0-6, 1*+, 15) Fig. 3.5.2.1 RO 0= # 1= D # LOGON PTR # Q/M I" COMMAND O/D/M # CO IF MACROf OUERY/MACRO NAME Repeat as Necessary DELETE FORMAT (cmds 7-12,16,17) Fig. 3.5.2.2 72 RO LOGON PTR I COMMAND QUERY/MACRO # OLD QUERY/MACRO NAME NEW NAME RENAME FORMAT (cmds 13, 20) Fig. 3.5.2.3 73 ACTION TO BE PERFORMED COMMAND READ QUERY TEXT READ SET LIST 1 READ QDERY TEXT AND SET LIST 2 READ COMMENTS 3 WRITE QUERY TEXT AND SET LIST 4 WRITE QUERY TEXT, SET LIST, AND COMMENTS 5 WRITE COMMENTS 6 DELETE FOLLOWING QUERY TEXT, SET LIST, AND COMMENTS 7 DELETE ALL QUERY TEXTS, SET LISTS, AND COMMENTS EXCEPT FOLLOWING 8 DELETE FOLLOWING COMMENTS 9 DELETE ALL COMMENTS EXCEPT FOLLOWING 10 RENAME QUERY SET 11 READ MACRO TEXT 12 WRITE MACRO TEXT 13 DELETE FOLLOWING MACROS 14 DELETE ALL MACROS EXCEPT FOLLOWING 15 RENAME MACRO 16 READ LIST OF ALL COMMENTED DOCUMENTS 17 NOT USED 18 READ LIST OF MACRO IDENTIFIERS 19 READ QUERY NAME/NUMBER ONLY 20 Set Handler Command Codes Figure 3.5.2.4 74 be set to blanks, and the query/document number should be set to zero. The next type of Set Handler Table we shall consider is the delete form of the Set Handler Table (Fig. 3.5.2.2), used only when deleting query sets, comments, or macro definitions (command codes 7-10, 14, 15). Its contents are the same as the read/write form except for several minor variations: Byte three contains the number of items to be deleted, unless a code meaning "delete all except" (codes 8,10,15) is used, in which case the third byte contains the number of set/macro identifiers that are to be saved. Note that in this case a zero in byte three implies "delete all". The only other difference from the read/write form is the absense of buffer addresses. These are replaced by a series of six-word long blocks containing a query number and/or name for each query/macro to be saved or deleted, depending on the command used. On a "delete" command (codes 7,9,14) the identifiers of the sets/comments/macros to be deleted are listed, while in the case of a "delete all except" | (codes 8,10,17) the identifiers of the items to be saved are listed. The last Set Handler Table to consider is the rename format (Fig. 3.5.2.3). This table is used only for rename commands (codes 11 and 16) and has a very simple layout. The first three words of the table are identical with the read/write form (except for command 75 cole, of course) . These three words are then followed by the old query/macro name (5 words) and the new name to be assigned to the query/ macro (5 words) . ls.ls.1 SET HANDLER SUPERVISOR Now that we have an understanding of the input to this module, let us consider the routines contained in the Set Handler module and delineate their individual functions. Figure 3.5.3.1 shows us the various routines in the Set Handler and their interconnections. Notice that the only entry path into the Set Handler is through the Set Handler Supervisor (SETHLR) . This routine of the Set Handler module acts as a startup routine for all the action routines within the Set Handler. Its functions are: 1) Allocation of workspace for all routines; 2) Translation of all reguests for action on the "last" set into a specific set name/number; 3) Determination of which action routine to call (via a "JMP" command) . The seguence of operations performed by the SETHLR routine is as follows: 1) Check for valid command code; 2) Check to see if this is a reference to the "last" set (signalled by "LAST" in the guery name) . If "LAST" occurs in any set name 76 Set Handler Supervisor I File Writer Delete Routine Bitmap Handler File Reader Rename Routine V Directory Searcher SET HANDLER SYSTEM FLOW DIAGRAM FIGURE 3.5.3.1 77 slot, then move in the name/number of the "last" set, if it exists; 3) Allocate workspace, put address in R5; 4) Initialize the .THAN Block in the LOGON Block (using the last 320 (500 octal) bytes of the workspace, starting at byte 72 (120 octal) as the I/O buffer) ; 5) Do a JMP to the correct action routine. No file accesses are made from this routine, the "last" set information being obtained from the LOGON Block. The only data structures modified by this routine are the LOGON Block (the .TRAN Block is set up and the "last" set name is changed on rename of "last" set) and the Set Handler Table (the "last" set identifier is moved in if the "last" set is referenced) . For further information, refer to the program listing. ii.5-.U FILE WRITER Now let us consider the File Writer (FILWRT). This routine accepts a Set Handler Table containing a "write" command and pointer to information to be written to disk in the user's personal file. This routine, the delete routine, and the logon/logoff routines are the only ones allowed to alter the user's file. 78 When entered (via a "JMP" from routine SETHLR) , FTLWRT expects to find the address of a Set Handler Table in reqister RO and the address of its workspace in R5. This Set Handler Table should contain a command code of 4-6 or 13. A set identification, macro name, or document number must also be present, alonq with a pointer to the user's Loqon Block. Last, but not least, there must be some pointers to the buffer (s) containinq the information to be written to disk. If a query set is beinq written out to disk, then either a command of 4 or 5 will be used, dependinq on whether the user has attached comments to the set or not. The pointer to the query text ani to the set list must be present in either case. The pointer to the query text is the address of a buffer containinq: 1) A pointer to the Loqon Block; 2) A pointer to the carriaqe return - line feed endinq the text strinq; 1) Thp actual text of the query* This stranqe format is due to the current Oser Irterface/Parser communication protocol. The File Writer computes the lenqth of the auery text (without the carriaqe return - line feed) from the startinq address of the text and the address of the CR-LF. When written to disk, the Loqon Block pointer and CR-LF pointer are replaced by a word containinq the lenqth of the query text in characters (note that this is identical with the number of bytes of 79 storage) . Similarly, the set list pointer points to a buffer containing: 1) The number of documents in the set (1 word) ; 2) Two words per document, the document number and its relative rank in the set list. The number of documents is transformed into a length in bytes (i.e. multiplied by 4) before being written to disk in order to simplify the read mechanism. If comments are present, the comment pointer is the address of a buffer containing: 1) Two blank words for use by the File writer as link words; 2) The length of the comments in bytes; 3) The comment string. The output format of the comment string will be discussed along with the comment string writing mechanism. When writing out a new guery set, the File Writer first calls routine ALCCD, the disk file space allocator, to obtain a directory entry slot to use for the new guery set. As soon as this reguest is granted, the guery set identification number and name are moved into the area in the workspace reserved for building the new guery directory entry. Next the guery text length is computed and stored at the head of the text in the output buffer. Disk space for the text is then reguested by another JSR to ALOCD. When the disk address is returned by ALOCD it is entered . in the workspace copy of 80 the directory entry record and is passed (along with appropriate pointers to the text buffer, Logon Block, etc.) to the routine HRTDSK, a subroutine that formats the information into disk block size and writes it to disk. The guery set list and comments are then handled in a similar fashion. Upon completion of their transfer to disk the File Writer must then write the directory entry to disk and update the "last" set information in the Logon Block (since the new query set is by definition the new "last" set). When these duties have been completed, the File Writer performs the F.XECUTIVE trap $RETN, effecting a return of control from the Set Handler to the periorminq task. Macro text writes (code 13) and comment only writes (code 6) are handled somewhat differently. A macro text write resembles a query set write. However, since neither a set list nor comments will be present the macro directory entry is shorter than the reqular query set directory entry, leadinq to many directory size kludqes in the File Writer (See Fiqs. 3.5.1.1 and 3.5.1.2 for a diaqram of the directory layouts) . Another important difference is the format of the text buffer passed to the File Writer. Unlike the query text, the macro + ext is passed in a buffer containinq the character count of the text followed by the text itself. Since this 81 more closely resembles the format of the document list (at least after conversion of the document count into a byte count) , the macro text is put in the correct format and written to disk by the same section of the File writer that handles the set list. Comment writes (code 6) are the most complex operations performed by the File writer. Three distinct cases must be considered: 1) Adding comments to a query set that currently has no comments; 2) Adding comments to a document that currently has no comments; 3) Adding comments to a query set or document that has previously had comments attached to it. We discover into which category a given request falls by examining the query/document number and the directory entry for the query set (or existence/non-existence of a directory entry in the case of documents) . The subroutine DIRSRH is used to search the directory for this information. If the entity to which the comments are to be attached is a document and no comments have been previously attached then no directory entry will exist for this document. If the entity is a query set with no currently attached comments, a directory must exist, but the pointer to the comment string will be set to -1. For either a document or a query set with existing comments, the directory entry for the document or query set will contain pointers (disk addresses) to the head and tail of the comment chain for that 82 entity (see Fig. 3.5. 1.1 for a diagram of the structure of the comment list). In light of this information, we can see how the File Writer must proceed. First, the directory searcher (DIRSRH) is called to find out if a directory entry exists for this entity and to retrieve a copy if it does (non-existence is flagged by the Directory Searcher by moving -1 to the first word of the buffer in which it has been reguested to place the directory copy) . If the entity is a document and no directory entry exists, the File Writer performs the following actions: 1) Reguests a directory slot and disk space for the comments from ALOCD; 2) Fills out the directory entry with both the head and tail pointers poininng to the newly allocated disk space in which the comments are to be written; 3) Moves "-1" to the first word of the output buffer for the comments (again, see Fig. 3.5.1.4 for the comment chain layout) to flag the non-existence of further links in the chain; 4) Moves the length of the comment string to the second and thiri words (total comments length and local string length, respectively); 5) And then writes out the directory entry and comment string. Attaching comments to a guery set that currently contains no comments is done in a similar fashion, altered only by the pre-existence of a directory entry that must be altered rather than 8 3 allocating a new directory entry slot. In the case of adding a new comment to a query set or document that already has one or more comments attached to it, the File Writer must: 1) Pead the comment pointed to by the tail pointer of the directory entry; 2) Hove the word containing the total length of the existing comments (second word of old last comment) to the second word of the new tail comment and add in the length of the new comments; 3) Hove the disk address of the new comments to the link field of the old tail comment and rewrite the old tail comment (or the first block thereof if it is extends across a block boundary) ; 4) Update the tail pointer in the directory record to point to the new comment; 5) Write out the new comment record; 6) Rewrite the updated directory record; 7) Check to see if the "last" set has been modified, and update the "last" set information in the Logon Block if so. 1*.1±!> llkl READ ER Now that we have seen the mechanism used for writing data into the users 1 files, let us consider the mechanism used for retrieving said data. This routine of the Set Handler is called the Pile 8a Reader (PILRDR) . It is entered by a JtIP command from the SETHLR routine whenever SETHLR detects a read op code (0-3,11-14) in the Set Handler Table. For each of these op codes, the Type I Set Handler Table is used. The only other routine referenced by the File Reader is the Directory Searcher (DIRSRH). The first operation the File Reader performs (after a minute amount of housekeeping) is to determine what is requested of it. If a list of macros or a list of all comments is requested, then special sections of the File Reader are JflP'ed to (.1ACALL and COMALL, respectively) . If a macro text or any type of set information is requested, then FILRDR must obtain the directory entry for the data to be read. This is done by one of two methods; if the data requested is anything but information from the "last* 1 set, then the Directory Searcher (DIRSRH) is called via a JSR command, or if the information desired is from the "last" set (signalled by a set name of "LAST" in the Set Handler Table) , then the section of the File Reader following label LASTRT is used to obtain the information directly from the user's Logon Block, thus avoiding one or more disk reads. In either case, the adiress of the disk block containing the disk directory for the set/macro to be read is returned, along with the directory entry itself (in working storage) , to the File Reader. If the read reguested was from the "last" set, but no "last" set exists, then a "-1" is put in the 85 first word of the Set List Buffer and a $RETN trap is executed in order to allow a "read last set" to be done on the first query of a session without causing undue problems. Note that this forces us to fill in the set list pointer of all Set Handler Tables referencing the "last" set and check for this error condition in each routine doing a read from the user's file, even if we are not reguesting a set list read- If a read is reguested from any other non-existent set than the "last" set, an $ERROR trap return is done. Once we have the directory entry for the selected set or macro, then we may begin to read in the requested information. The guery text, set list, comments, and macro definition reads are all done by essentially the same section of code (starting at label READIT) with parameters for the read loop set to point to the correct buffers, etc. by a compare-branch tree preceeding the loop for each iteration of the loop. The code starting with label DIDIT is a trailer section that follows a performance of the read loop and determines whether another loop through the read loop is needed. This occurs whenever both the guery text and set list must be read for a set or when comments are being read and the one just read has a pointer to another comment chained to it. Once all the information has been read in, the File Reader exits via a $RETN trap at label DONE. 86 The two special case sections of the File Reader, HACALL and COMALL, are straightforward linear search strategies that merely obtain the desired lists of macro/document ID's. Macros are listed in the buffer pointed to by the guery text pointer with one macro name every five words and the number of macro names stored in the guery number field of the Set Handler Table. Commented document lists are returned in the set list buffer, with the first two words of the buffer being identical counts of the number of document numbers following. After the two count words come the document numbers themselves. These are in the form of two word entries, the first word being the document number and the second zero in order to simulate a normal set list. This concludes our discussion of the File Reader and allows us to proceed to the next major routine in the Set Handler Module, the Delete Routine (DELRTN). ^5^6 DELETE ROOTINE The Delete Routine (DELETR) handles reguests to delete guery sets, comments, or user-defined macros from a user's file. It does this by finding the directory entry for the information to be deleted, zeroing the first word of the directory entry if the reguest is to delete a guery set or a macro or by moving "-1" to the 87 comment pointers if comments are to be deleted. After this has been performed the Delete Routine calls the Bitmap Handler (ALOCD) which frees disk space involved. In finding the directory entry for a query/macro to be deleted, the Delete Routine reads in the the user's directory (one block at a time) and scans each block linearly. This approach is used rather than calling the Directory Searcher to locate the directory in order to minimize the number of disk accesses when the user is deleting a large number of queries at once. Each query/macro directory entry scanned is compared to the list of queries/macros attached to the Set Handler Table passed to the delete routine by the Delete Statement Parser. Tf the query/macro identifier matches one of the identifiers in the list of queries/macros attached to the table ani a "delete" command code (7 for query set, 9 for comments, 14 for macro) has been entered in the command table; or if no match has been found in the identifier list for a directory entry and a "delete all except" command code (8 for query sets, 10 for comments, 15 for macros) has been entered in the command table, then the directory address is passed to the portion of the Delete Routine which causes storage deallocation and directory deletion; otherwise, the directory entry is left unaltered and the search continues. It should be noted that "delete all" for query sets, macros, or comments is denoted by a "delete all except" command with 88 a null list of identifiers. Disk storage deallocation for query set lists, query texts, and macro texts is done by a loop that reads in the information to be deleted in order to qet the length of the disk space to be freed and then calls the Bitmap Handler (ALOCD) which marks the space free in the user's bitmap. Comments are handled in a similar fashion, but by a recursive routine (CHASER) which runs down the links of the comment chain marking the disk space occupied by each comment free in the bitmap. After the disk space has been marked free in the bitmap the comment pointers in the directory are set to -1 if comments only have been deleted or the first word of the directory entry is set to zero if an entire guery set/macro definition has been deleted. After all items in the query/macro identifier list have been deleted or the end of the directory has been reached, control is returned to the Delete Statement Parser (and hence to the User Interface) via a $BETN trap. 89 ■ h.5.^1 RENAME ROUTINE The Rename Routine (RENAME) of the Set Handler module is an extremely simple routine. Its purpose is to attach a (new) mnemonic name to a query set or to a user defined macro. The Rename Routine accepts as input in register RO the address of a Type III Set Handler Table (see Section 3.5.2). This table contains the current identifier (name or number) of a query set or user macro and the new name to be attached to the set/macro. The Rename Routine uses the Directory Searcher (DIRSRH) to retrieve the address of the directory entry for the set/macro to be renamed. The Rename Routine then reads in the disk block containing the directory entry, replaces the name field of the directory entry with the new name from the Set Handler Table, rewrites the block containing the directory entry, and exits via a $RETN trap. 3.5. 8 EITMAP HANDLER The Bitmap Handler (ALOCD) is used to allocate/deallocate disk storage and directories for the Set Handler. This routine is performed as a subroutine (via a JSR) by the File Writer and the Delete Routine and receives all of its parameters through the Set Handler Workspace, whose base address is located in register R5 throughout the Set Handler. The workspace parameters used by this 90 routine are the Logon Block pointer (LOGPTR), amount requested (NBRREQ) , block number (RELBLK) , and offset in block (BLKDSP) . The pxact layout of the Set Handler Workspace is stored as a template macro in the file SYSMAC.SBL. The NBRREQ field of the workspace is used to hold the number of bytes of storage to be allocated/deallocated or a directory allocate/deallocate flag when either query or macro directories must be manipulated. The Logon Block pointer contains the address of the user's Logon Block, needed by this routine for reading directories from the user's file when doing directory allocates/deallocates. The block number and offset in block fields are used for passing disk addresses of the beginning address of disk space allocated or to be deallocated by the Bitmap Handler. There are essentially five cases of Bitmap Handler action reguests to consider: 1) Allocating user file space; 2) Deallocating user file space; 3) Allocating macro directories; 4) Allocating query set directories; 5) Deallocating query set/macro directories. Deallocation of directories, whether they are macro or query set directories, is very straightforward. The Bitmap Handler is passed 91 the disk address of the offending directory entry in the disk address fields of the workspace and needs merely read in the required block, move zero to the first word of the entry, rewrite it, and return. Allocation of directories is only slightly more complex. When requesting allocation of a directory, the calling routine (FILWRT) places a flag describing the type of directory desired in the NBRREQ field of the workspace (these flags, MACMSK and DIRJ1SK, are contained in the f ile SYSMAC. SfIL) . The two short routines MACALC and DIRALC are used to allocate macro and query set directories, respectively. Both routines work by searching linearly through the proper directory until they find a directory slot with a first word of zero or the end of the directory. If a free slot is found its block number and offset within the block is placed in the RELBLK and BLKDSP fields of the workspace and the Bitmap Handler executes an RTS to return control to the calling routine. If the end of the directory is reched without finding a free directory, a "-1" is moved to the RELELK field of the workspace to signal this fact and an RTS is done. When file space allocation is requested, the number of bytes to be allocated is placed in the NBRREQ field. The Bitmap Handler calls subroutine VALCNK, which converts this number to the number of chunks that must be allocated (i.e. ceilinq [bytes/64]). The Bitmap Handler then looks through the bitmap until it finds enough 92 contiguous free chunks to satisfy the request. These bits are then cleared, the address of the lowest numbered block allocated is placed in RELBLK and the address of the lowest byte allocated in this block is placed in the BLKDSP field. The higher level routines therefore receive the starting address of a contiguous segment of disk space guaranteed to be at least as large as they reguested, but possibly spread across several blocks. File space deallocations are done in a similar fashion. The number of bytes to be deleted is placed in NBRREQ, but with bit 15 set to "1" as a flag that a delete is being requested. The starting address of the disk string to be deleted is placed in the RELBLK and BLKDSP fields, as with directory deletes. The Bitmap Handler then tranlates the number of bytes into the corresponing number of chunks and sets the proper bits in the bitmap back to "1" to show that space free. Before attempting to understand the address translation mechanism, it is perhaps wise to reconsider the bitmap layout. There are 16 bits/word in the bitmap, 8 chunks/block of disk. Therefore each word in the bitmap covers two blocks of disk chunk space. Since blocks are allocated from the highest block down and we associate the first word in the bitmap with the last block of the file, the difference in bytes between any byte and the first byte of 93 the bitmap is the number of blocks down from the hiqhest block in the disk file. The low-order bit in a byte corresponds to the chunk startinq at displacement in that block, the next one to the chunk starting at offset 64 (bytes) ,..., the high-order bit (bit 7) corresponds to the chunk starting at byte 448 within the block. In order to convert a bitmap address into a disk address, therefore, we must subtract the offset in bytes into the bitmap from the highest block number in the file to get the block number. To get the displacement in block, we need to multiply the relative bit position within the byte by 64. 1-.6 SEARCH SUPERVISOR The Search Supervisor (SRCHSP) module is the scheduling and control module that orchestrates the performance of the Serge Routine, Set Handler, Index and Postings Handler, and Full-Text Searcher modules in the execution of a search. The Search Supervisor may be performed by either the FIND Statement Parser, the PRINT Statement Parser, or the MAKE Statement Parser and is passed the address of a Search Supervisor Table (Fig. 3.6.1.1) in register RO. This table contains all the pertinent information collected from the user's query by the command parser. 94 iifUl SEARCH SUPERVISOR TABLE Since the primary function of the Search Supervisor (and the rest of EUREKA, for that natter) is to perform the operations described in the Search Supervisor Table, we shall take a detailed look at its contents and fori. Referring to Fig. 3.6.1.1, we see that the first three words pointed at by RO are reserved for EXECUTIVE use. This actually reflects an earlier incarnation of the EXECUTIVE, and these words are not currently used. The next word in the table, "PTR TO TTY STRING", is the address of the buffer containing the text of the user's query. The next six words are all pointers (memory addresses) to various information blocks within the table that shall be described later. The next block, "•IN* CONTEXT", is a three word descriptor identifying the contexts in which the term expression must occur. Similarly, the block labeled "» PRINT* CONTEXT" is a one word descriptor identifying the contexts to be printed from documents that satisfy the search request. The word labeled "DEVICE" is a descriptor specifying whether the information is to be printed on the line printer or displayed upon the user's screen. 95 Not Produced by Make Stmt Parse If RESERVED FOR EXEC USE PTR TO TTY STRING PTR TO TERMS PTR TO FULL TXT SR TBL v PTR TO SET_ QUADS PTR TO SETS PTR TO COMMENTS PTR TO RESULTS "IN" CONTEXT "PRINT" CTXT DEVICE SET NAME TERM QUADS FULL TEXT SEARCH TABLE TERMS SET QUADS SETS COMMENTS RESULTS £ (256 words) SEARCH SUPERVISOR TABLE Fig. 3.6.1.1 Pointer to Set/Term Quadruple Pointer to Results Descriptor * Term Results 96 Point er to Left Side of Expression Pointer to Right Side of Expression Pointer to Results Query Set Number or Term Length «e Query Set Name or I Figure 3.6.1.2 Term/Set Quad Detail 97 The five word block "SET NAME" is the mnemonic name the user wishes to have attached to the resulting query set. "TERMS" and "TERM QUADS" are the two blocks that specify the search terms for this query and in what relationship they are to occur. Fiq. 3.6.1.2 describes the structure of these two blocks. The "TERM QOADS" block holds a set of four word descriptors which form a binary operation tree that describes the search expression. The first descriptor in the block is the root node of the operation tree. Each descriptor is made up of four words, the first being a word of bit-flags, the second and third beinq pointers to the left and riqht hand terms of the expression (with respect to the operator described at this node) , and the fourth beinq a pointer to the address at which the results of this operation are to be placed. The bit-flag word is broken down as follows: Bit 15,14: Operation to fce performed; 00 => OR 10 => AND 11 => AND NOT Bit 11 : 1 => Suffixing to be performed on left hand side term. Bit 10 : 1 => Prefixing to be performed on left hand side term. Bit 8 : 1 => Left hand side pointer points to another node (term 98 quad) rather than a leaf (term) . Bit 3 : same as bit 11, only for right hand side. Bit 2 : saae as bit 10, only for right hand side. Bit : sane as bit 8, only for right hand side. Note that bits 8, 10, and 11 are bits 3, 2, and of the high-order byte. The node-leaf selector bit allows us to handle cases of the form: FIND 'A' * 'B' * •C Since we use binary operations exclusively, we must first AND toqether the list of documents responding to the first search term with the list of documents responding to the second search term, thus producing a temporary result to be ANDed with the list of documents responding to the third search term. Whenever the node-leaf bit is set to 1 for either the left or right hand side the "TERM POINTER" points to the set quad of the operation that must be performed in order to generate the temporary result list needed to perform the operation described in the current node* If the right/left node-leaf bit is set to then the right/left term pointer is the address of the search term to be used in the right/left hand side of the current operation. The "TERMS" block contains all the terms pointed at by the terra pointers just described. The terms are laid out sequentially in the "TERMS" 99 block, each consisting of a length word followed by the text of the term. The terra pointers actually point to the length words. The result pointer in each term quad is filled in by the Search Supervisor upon completion of the operation specified in that node and is used whenever a parent node term pointer points at the current node, "SET QUADS" and "SETS", which describe the set expression for this search, are structured in the same form as "TERfl QUADS" and "TERNS". The only significant differences are that the pref ixinq-suff ixing bits in the bit-flag words are meaningless here and the "SET POINTER" points to a six word block containing either a query set number in the first word thereof or a query set name in the last five words. The indirect (node-leaf) bit, results pointer, and op code bits are the same as for the "TERM QUADS". 3*6^.2 SEARCH SUPERVISOR OPERATION Now that we have an understandinq of the Search Supervisor Table the description of the Search Supervisor is relatively trivial. The only difficulties occur in attemptinq to describe the handlinq of the various "special cases" that can occur. He shall therefore first look at the main structure of the code and then qo back and describe how the "special case" handlers fit within the 100 framework of the body of the module. The first action of the search Supervisor (aside from some housekeeping) is to start up the Set Expression Evaluator (STEVL) , which evaluates the set expression as contained in the "SET QUADS" and "SETS" blocks of the Search Supervisor Table. While the Set Expression Evaluatcr is in progress the Search Supervisor does some initialization of working lists for use in evaluting the term expression. As soon as the Set Expression Evaluator finishes, the Search Supervisor evaluates the term expression by use of a loop which effectively traverses the "TERM QUADS" tree in postorder, using the Index and Postings Handler (IPHNDL) to read in the postings list for each leaf (search term) , and the Merge Routine (MERGE) to perform the operation specified at each node of the "TERM QUADS" tree. The Search Supervisor then performs the Full-Text Searcher (FTSRCH) which determines whether any of the documents require full-text searching and, if so, performs the search. Refer to Sec, 3.10 for details of the operation of the Full-Text Searcher. Upon completion of the Full-Text Searcher execution the Search Supervisor prints the message: n DOCUMENTS POSTED TO THIS SET and then constructs a Set Handler Table describing the results of 101 this search and performs the Set Handler in order to save a record of this search for future use by the user. All that remains then is the not inconsiderable task of cleaning up by freeing all the temporary lists used in the term expression evaluation and freeing all the disk space used by the Merger. A $RETN trap is then executed to return control to the parser routine that initiated (via a $PRFM trap) the Search Supervisor. Now we shall look at the special cases. If the Search Supervisor has been performed by a PRINT Statement then no term expression exists and the section of code that evaluates this expression must be skipped. The Search Supervisor must, however, handle print requests that access user comments. This requires a call to the Set Handler to get the list of all documents which have user comments attached to them. The Search Supervisor must also skip over the call to the Set Handler that saves the search results in order to avoid generating a new guery set from a PRINT statement. HAKE Statements present a similar problem in that they have no term expression to be evaluated, but this is easily handled by skipping the term expression evaluation process. 102 The last major "special case" to consider is the case where one or more search terms contain no alphanumeric characters and therefore do not have entries in the index file. If this case arises, the Search Supervisor must construct a set list consisting of all the document accession numbers in the entire database for this term and force full-text searchinq of all of them. For further details, refer to the program listing. li.1 SET EXPRESSION EVALUJTOR The Set Expression Evaluator (STEVL) performs the function of evaluating the set expression contained in the "SET QUADS" and "SETS" blocks of the Search Supervisor Table (see Fig. 3.6.1.1; also refer to the preceeding section of this report for a description of the table layout). The address of the Search Supervisor Table is passed to this routine in register R0. Once one understands the structure of the "SETS" and "SET QUADS" sections of the Search Supervisor Table the operation of the Set Expression Evaluator is self-evident. It traverses the operation tree in postorder, using the Set Handler to retrieve document lists (query set lists) from the user*s file for the leaves and the Merger (MERGE) to perform the operations specified at the nodes. The final result is put in the area of the Search Supervisor 103 Table labeled "RESULTS". 3.8 INDEX AND POSTINGS HANDLER The purpose of the Index and postings Handler (IPHNDL) module is to determine in which documents the user's search terms occur so that we need only consider those documents rather than searching the entire database to find documents that satisfy the user's search expression. iiii! FILES «ANII!I!LATED BY THE INDEX AND POSTINGS HAH2LJR Before attempting to fathom the details of the Index and Postings Handler, let us consider the file structures manipulated by it. This file structure consists of two hash tables, HASH1 and HASH2 (read in from files HASH1.XXX and HASH2.XXX by IRINIT) ; the index file, INDEX. XXX; and the postings file, PSTNG.XXX. The "XXX" file extension on the file name is used to distinguish between various versions of EUREKA and also between various databases. The two hash tables are used to get a disk address in the index file. HASH1 is used to hash on the first letter of a token and HASH2 is used to hash on the second letter of the token. The sum of the values obtained (via a process explained in Sec. 3.8.2) from HASH1 and HASH2 is used to give a disk address in the index file 104 (INDEX) where terms beginning with these two characters are indexed. This section of the index file is then searched linearly for a match to the entire token until the index file entry is lexicographically less than the token. If a match is found a pointer into the postings file (PSTNG) is obtained from the index file. The postings file contains the list of all documents in which the token under consideration occurs. Each entry in the postings file contains a document accession number, context bits to describe the context (s) in which the token occurs, and a count of the number of occurrences of the term within the document. Let us now look at the layouts of the files and tables in a semi-tabular format. J-lil^i HASH! IABLE Table name: HASH1 Size: 256 words Content: Each word contains full word value to hash the character which indexes it. If the value of the word is FFFP base 16, then the character does not exist in the index. 10 5 3.8, 1.2 HASH2 T ABLE Table Name: HASH2 Size: 256 Bytes Content: Each byte contains a byte value to hash the character which indexes it. If the value of the byte is FF base 16, then the character does not exist in the index. 106 3..8&JU.3 INDEX FILE File Name: INDEX. XXX Type: Contiguous Blocking Factor: 2 Format: Next Block Pointer Number of Types in this Block (N) . _ Length this Type (n) , Type (n odd By tes) / y a v S- Directory Address of Postings Offset Into Postings Block # Occurs N Times ft* 107 Jiiii-UJi POSTINGS FILE File Name: PSTNG.XXX Type: Contiguous Format: 2(i! Words A Next Block Pointer Total Postings This Type Postings This Block Postings ^V N J^- ■\v nl N > n1 implies postings for a type are split across blocks. In this case, the next block has the following format: Next Block Pointer Postings This Block Postings | \\ 1 W Each posting consists of two words in the following format: A U T \ D A Document Number M P A M T_ M COUNT M 103 1±S±Z 2EEIA1IO.N QF THE INDEX AND POSTINGS HANDLER Now that we have analyzed the files manipulated by the Index and Postinqs Handler, let us look at the operation of the routine itself. Upon entry, register HO should point at a table of six words containinq: 1) A prefix/suffix descriptor for the term. 2) A pointer to the term. 3) A pointer to where the postings are to be placed. 4) and 5) A two-word context descriptor. 6) A pointer to the Logon Block. The prefix/suffix descriptor contains only two bits of useful information, if bit 2 is on (i.e. eguals 1), then prefixinq is to be used; if bit 3 is on, then suffixing is to be used. If both bits are on, then both prefixing and suffixing are to be used. The pointer to the term points to a term in the "TERMS" block in the following format: One word containing the length in characters (bytes) of the term, followed immediately by the text of the term. The context descriptor is in the standard form shown in Sec. 3.8. 1.4. 109 After the usual housekeeping the Index and Postings Handler first checks to see if the term contains any non-alphanuraeric characters. If it does, then full text searching will be reguired of all documents containing as a substring any token within the term. If, as is normally the case, there are no special characters in the term, the Index and Postings Handler checks to see if prefixing or suffixing has been specified for this term in the descriptor word. If either or both have been specified then control is passed to one of three special purpose routines (PREFIX, SUFFIX, and BOTH) which shall be described later. In the simplest case (where the user has reguested a term with no prefixing, suffixing, or special characters) the Index and Postings Handler hashes on the first two characters of the term to obtain the address in the Index File to begin searching for an exact match to the search term. The hash is done by treating the bytes containing the characters being hashed as if they were numeric values. The first character is multiplied by two (i.e. shifted left one tit) and is used as an index into table HASH1. A one word value is retrieved from the indexed location in HASH1. If the value is FFFF base 16, then the character doesn't exist in the index. The second character is then treated similarly, retrieving a byte value from HASH2 which is added to the word value retrieved from HASH1 (unless it is FF base 16, the flag that a character is non-existent) to obtain a disk address in 110 the index file at which terms beqinning with this bigram are located. If either value has been flagged as non-existent the Index and Postings handler marks this fact and immediately executes a $RETN trap. If a valid disk address has been obtained by the hash, the Index and Postings Handler uses the subroutine GETNDX to retrieve the index listing for the term in question. Once the index entry has been found (if it exists) the Index and Postings Handler uses the subroutine GETPST to retrieve the postings (list of document accession numbers) for this term and calls the Merger to merge this list with any previously generated lists (this occurs primarily when handling special cases such as prefixing). Once the final postings list has been constructed it is read into the buffer specified in the six word descriptor table. If the results list is too long to fit in the buffer, then only the first block is read in, with a link pointer to the remainder on disk. After some fairly messy housekeeping a $RETN trap is done. Thp special subroutines SUFFIX, PREFIX, and BOTH handle finding all truncated matches for terms and merging the postings lists of each newly found posting into the list of ones already found. These routines use the common exit routine to clean up after execution and do the SRETN trap that returns control to the Search Supervisor. 111 3.9 MERGER The Merger (MERGE) is used to perform Boolean operations (AND, OR, and AND NOT) on lists of document accession numbers. A subsidiary function is the allocation/deallocation of scratch disk space for itself and for the Search Supervisor, Index and Postings Handler, and the Full-Text Searcher. Parameters are passed to this module in the form of a nine word long parameter list containing: Word operation descriptor Word 1. ...pointer to left hand side list Word 2 pointer to right hand side list Word 3. ........ pointer to results buffer Words 4&5 ..two-word long context descriptor Word 6.. ........ pointer to left hand side buffer Word 7 pointer to right hand side buffer Word 8 .......... disk address of result list overflow. In word 0, the operation descriptor, only the high-order byte is meaningful. The bits of this byte have the following meanings: Bits 15,14 : Binary operation code as described in Section 3.6.1. Bit 11 : 1 => Free the scratch disk space whose starting address is located in word 8 of the parameter list (bytes 15, 16) . Bit 10 : 1 => Allocate a scratch disk space and place the starting address in word 8 of the parameter list. 112 Words 1 and 2, the left and right hand side list pointers, are the starting memory addresses of the two document accession number lists to be merged. The first word of each list contains the number of document accession numbers (postings) in the list, while the second word contains the number of postings in this block. These words are followed by the document, accession list in the standard two-word long descriptor format described in Sec. 3.8.1.4. Word 3 points to the memory buffer in which to store the result list. This buffer is only one block long, so if the result list is over one block long only the first block of the list is stored here. The rest of the list is stored on disk as a one-way linked list with the starting block number stored in word 8 of the parameter list. Words 4 and 5 are a standard context descriptor that is used for setting context bits in entries in the result list for use by the Full-Text Searcher. Words 6 and 7 are pointers to the head of the buffer used by the Index and Postings Handler for reading in the information from the Postings file. Words 1 and 2 are addresses within this buffer. The buffer addresses are provided in case the posting spreads across more than one block. 113 Word 8 is used to return disk addresses of result list overflow lists, the address of freshly allocated disk space, and to receive the address of scratch disk buffers to be deallocated. The first action of the Merger is the decoding of the operation decriptor. If the reguest in the parameter list is for disk buffer allocation/deallocation then subroutine BMAP is called, BMAP is a straightforward bitmap handler which keeps track of which disk buffers are in use. As soon as BMAP has updated its bitmap and placed its result in the parameter list a $RETN trap is executed to return control to the calling routine. Boolean operations on document accession number lists are performed in seperate loops (one for each operation) that compare the two lists on an element-by-element basis, generating the result list with correct context bits set as it does so. After the two lists have been merged into a result list a $RETN trap is executed, returning control to the calling routine. It is hoped that this routine will be replaced by a hardware merger at some time in the near future. 114 The Full-Text Searcher (PTSRCH) module does all the full-text searching, browsing, and text display for the user. This module is ■ade up of three main routines: PTSRCH, the full-text searching routine; BROISE, the brovse mode handler; and PRNTR, the document text display routine. It is initiated (via a SPRPH trap) by the Search Supervisor and receives the address of the Search Supervisor Table in register RO. lilOjJ PULkrlBXT SEARCHING ROU TINE The Full-Text Searching routine (FTSRCH) is called during each search after the Index and Postings Handler has constructed a list of all documents that contain the proper Boolean conjunction of search terms. The Full-Text Searcher sorts this list of documents into highest-count-field-first seguence and rewrites it to disk. If the user has reguested that all text printed for this guery be displayed upon the line printer a SLOCK trap is executed to lock the line printer. The Full-Text Searcher then determines what type of search (if any) is to be performed on documents. Next the Pull-Text Searcher enters a loop that reads in the directory for each document in the list, on*> at a time. Tests are then made to determine if any type of full-text search or text print is to be performed on this document. If a full-text search is to be performed, control is passed to the appropriate controlling 115 loop for the type of search to be performed. If no search is required, a JSR is made to the Text Print (PRNTR) routine. The search controlling loops set up parameters for the subroutine LEVEL1 , which does the actual searching. Each time a "hit M is found in the text being searched, LEVEL1 returns control to the parent loop, which handles coordinating Boolean conjunctions of terms within contexts, etc. Whenever some text that satisfies the search request is found, control is passed to the Text Printer routine, which formats and displays the text if a print has been specified. 3, 10.? TEXT PRINTER ROUTINE The Text Printer routine (PRNT) does all text display for the EUREKA system. It is called by the Full-Text Searcher routine and the Browse Mode Handler. If the user has requested that some term (s) be found in the same paragraph and that the sentence in which they occur be printed; or that they be found in the same sentence and that the paragraph in which they occur be printed, then this routine is passed the starting and stopping addresses of the clause that satisfies the Print Clause. Under any other combination of requests, the Text Printer gets all needed information from the Search Supervisor Table. 116 The Text Printer has several special subroutines used for handlinq the: "FIND IN SENTENCE PRINT PARAGRAPH" type of situations. These routines utilize "inside knowledge" about startinq and stopping addresses in the text, etc. to set un parameters for the regular formatting and marking routines and th^n perform them just as the main body of the Text Printer does. The main body of the Text Printer first goes through a series of tests to see if individual contexts are to be printed. On each "hit", the appropriate context is moved into th<* parameter areas of the workspace and the PRINT1 subroutine, which handles the mechanics of printing one context, is called. For some contexts, such as a SENTENCE print during an "IN PARAGRAPH" find, the special- pur pose routines described before must be called. The PRINT1 subroutine uses the subroutine RDTXT to read in the text containing the context to be printed, the subroutine MARK to mark the text to be displayed, and the subroutine FORMAT to actually format and display the marked text. Mark does little more than stick a special fern in front of search tprms to mark them for FORMAT and will not be discussed in any greater detail. FORMAT handles the mechanics of moving the text to be displayed into the print buffer, deleting ferns, moving asterisks to column one of any 117 buffer that contains a mark fern, and actually displaying the text. After the text has been displayed, a JSR is made to the Browse Mode Handler (BROWSE), unless the information is being displayed upon the line printer, in which case the JSR is skipped and an immediate RTS is done. IsJO-J BROWSE MODE HANDLER The Browse Mode Handler (BROWSE) controls all interaction with the user while text display is in progress. It prompts the user each time a context is printed and examines his reply to see if any browse commands have been entered. If no browse mode commands have been entered an immediate RTS is done to return control to the Text Printer routine. If the user has entered a command that reguests printing of another context or previous/succeding sentence or paragraph, the Browse Mode Handler must take care of the mechanics of retrieving the text to be displayed, setting up parameters for the subroutine FORMAT, and calling it to have the text printed. If the user has entered a comment string to be attached to the document currently being viewed the Browse Mode Handler must build a Set Handler Table containing the Logon Block pointer, document accession number (with bit 15 set to 1 to flag it as a document), and pointer to the comment string. The Browse Mode Handler then performs the Set Handler and then re-prompts the user for another command. 118 iill §ET INFORMATION PRINTER The Set Information Printer (INPOPT) module is used to retrieve information from the user's personal file and display it. Its function, therefore, is primarily that of calling the Set Handler to retrieve information from the user's file, format it, and display it upon either the user's terminal or upon the line printer. The first operation of the Set Information Printer is some housekeeping which includes workspace allocation, line buffer header initialization [1], locking (via a $LOCK trap) the line printer if necessary, and various buffer initialization. The Set Information Printer then determines whether the information to be printed is a macro or a guery set and proceeds to the proper section of the module. He shall first look at the case of a user macro print. In the case of a user macro print the Set Information Printer first checks to see if the macro identifier in the Set Handler Table is "ALL". If it is, the user has reguested that all macro definitions in the user's file be displayed. In this case the user must first reguest a list of all the macro identifiers in the user's directory from the Set Handler. Once it has this list it goes into a loop which moves one macro identifier at a time into the Set Handler Table and JSR's into the display subroutine once for each macro. If the identifier is not "ALL", the user has reguested to 119 see a single macro definition and the Set Information Printer needs only perforin (via a JSP) the display subroutine once and then exit by executing a $RETN trap. The guery set print section of the nodule works in much the same fashion as the macro print. All guery set print requests are assumed to be of the form: PRINT TO This form is reflected in the use of a modified Set Handler Table for passing parameters to the Set Information Printer module. This table looks like a Type I Set Handler Table with a second Query Set Descriptor (describing ) following the first. If the user has reguested the display of a single guery, the second Query Set Decriptor is zeroed. If the user has specified either of the 's via a mnemonic name the Query Set Information Printer reguests the guery number of that guery set from the Set Handler in order to use it, as a bound on the printing loop. Once the Query Set Information Printer has both query numbers it clears the set name field of the Set Handler Table, moves the lower of the two query numbers to the guery number field of the table, and subtracts one from the query number field. It then enters a loop that: 1) Adds one to the query number field of the Set Handler Table; 2) Compares the query number to the upper bound of the print 120 request: if it is less than or equal to the upper bound it performs the display subroutine once and loops back to (1); if it is greater than the upper bound the Set Information Printer jumps to the exit routine. The display subroutine (PRINT) is referenced by both the macro print section and the guery set print section. It receives as input a Type I Set Handler Table containing the correct identifiers/buffer pointers to read in the information to be printed. The display routine calls the Set Handler once for each pertinent block of information to be retrieved (query/macro text, query set list, comments). When the Set Handler returns control to the display routine it formats the data for display, prints it wherevpr the user has requested that it be printed, and executes an RTS instrution to return control to the controlling loop. If the Set Handler has returned an error messaqe of "INVALID SET ID" on the stacK, the display routine merely clears the stack and does an RTS, returning control to either the macro print loop or the query set print loop, thus discardinq the error messaqe. All other error raessaqes are propaqated back up the tree of tasks. This allows the Query Set Information Printer to handle cases where the user requests to see all query sets between and where some , L < M 10 CHARACTERS: You have attempted to attach a name with 11 or more letters to a query set. This is not permitted. Hatch for concatenated words on multiple line commands. ILLEGAL OR IMMORAL USE OF QUOTES: Check for unmatched single quotes, i.e. • . Also check for sinqle quotes used in improper places. INVALID WORD FOLLOWS "ALL": EUREKA cannot parse the part of your command that comes after the word "all". CANNOT DELETE DOCUMENT: EUREKA will not allow you to delete a document, only the comments attached to the document. ILLEGAL USE OF BRACKETS: EUREKA has discovered some brackets where it doesn't expect them. OUERY OR DOCUMENT NO. TOO BIG: 125 You have just input a query/document number that is far biqqer than the number assigned to any query or document in the EUREKA system. MISSING SET NAME IN CHANGE STMT.: EUREKA cannot find the name of the query set you wish to have chanqed. MISSING KEYWORD OR SETNAME IN CHANGE STATEMENT: EUREKA cannot find "TO" in your CHANGE statement. Check to see if you have two Set Names separated by " TO ". TOO MANY SET NAMES: EUREKA is confused. There are too many names there. QUERY OR DOCUMENT NO. IS NOT NUMERIC: EUREKA has found a strinq of letters in a place it expected only numbers. Be sure you haven't typed an "0" instead of an "0". NO SET NAME IN DELETE STATEMENT: Did you tell EUREKA what you wanted deleted? It doesn^t think you did! ILLEGAL USE OF BRACKETS: Check the usage of brackets ("[" or "]") for validity. NAME OF DOCUMENT CANNOT BE CHANGED: You may not assiqn a name to a document. A close approximation is to use the make statement to create a query set with only the desired document in the set. 126 INVALID CHARACTERS IN SET NAME: A Set Name cannot contain any symbols other than letters of the alphabet or numbers; the first letter of the Set Name must be non-numeric. MISSING SET NAME: You've left out the name of a Set somewhere. LOGICAL END OF STATEMENT REACHED PREMATURELY: EUREKA thinks you started something that you didn't finish. Check your brackets, etc. PARSER TABLE OVERFLOW: Proqram couldn't handle your expression - please use smaller queries to achieve your final result. ILLEGAL SUCESSOR: You probably have either l°ft out an operator (*,♦# or -) , or have two of them with nothinq in betweeen them. ILLEGAL CONTEXT: EUREKA doesn't recognize the context you want it to use. Check your spellinq or abbreviation. ILLEGAL SET NAME: Check to see if you have used a reserved word or invalid characters. TOO MANY LEVELS OF PARENTESES: Your expression is too complicated. Break it up into several FIND/MAKE statements. 127 TOO MANY TERMS OR PARENTHESES: Sane as previous aessage. ILLEGAL TERM: Host likely explanation is that you have too many or too few single quotes (•) in your search expression. TOO MANY DISJUNCTIONS: Expression too complicated. Use several smaller queries. ILLEGAL SET OR DOCUMENT NO.: The number is either non-numeric or too big. EXPRESSION TOO COMPLICATED: Use several smaller queries and/or MAKE statements to accomplish your grand design. INVALID USER ID: Check your spelling. If you cannot log on, see a systems programmer. INVALID BITMAP: Some of your records may be missing. Notify a systems programmer. INVALID BOOLEAN CONJUNCTION OF SETNAME-#«S: You've probably left out or put in an extra Boolean operator (*,♦,-). YOU HAVE ATTEMPTED TO RENAHE A NON-EXISTENT SET: Self explanatory. Check your spelling. YOU HAVE ATTEMPTED TO READ A NON-EXISTENT SET: 128 Same as previous error message. YOU HAVE ATTEMPTED TO COMMENT A NON-EXISTENT SET: Same as previous error message. YOU HAVE RUN OUT OF DISK SPACE. PLEASE DELETE SOMETHING: Your disk space is completely full - EUREKA cannot find enough space to finish processing your last command. Delete some sets and/or comments before proceeding. SET EXPRESSION NOT VALID: You have probably left out a Boolean operator (*,♦,-) in your set expression. NO "LAST" SET EXISTS: You have attempted to access the "LAST" set, which has been deleted or somehow changed. ILLEGAL MACRO NAME: Macro names must obey the same regulations as guery set names. YOU MUST LOGON!: No gueries may be entered until you have logged on. SYSTEM ERROR: Program error - call system programmer immediately. 129 REFERENCES 1. Digital Equipment Corporation, "Disk Operating System Monitor Programming Handbook", 1972. 2* Milner, J.M., "A Multiprocess, Multiuser Executive for an Experimental Information Retrieval System", M.S. thesis. University of Illinois Department of Computer Science Report Number 75-736, August 1975. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-76-779 4. Title and Subtitle DESCRIPTION OF AN EXPERIMENTAL ON-LINE, MINICOMPUTER-BASED INFORMATION RETRIEVAL SYSTEM '. Author(s) John Keith Morgan I. Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 3. Recipient's Accession No. 5. Report Date February 1976 2. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 5. Supplementary Notes 6. Abstracts 8. Performing Organization Rept. No - UIUCDCS-R-76-779 10. Project/Task/Work Unit No. 11. Contract/Grant No. US NSF DCR73-07980 A02 13. Type of Report & Period Covered Master's Thesis - 1976 14. Jitioc «- Ikc^ 1 ? 1131 ]^ 1nformat i on retrieval systems have provided little more than titles or abstracts of documents in response to user queries This thesis descHh^ an experimental information retrieval system that provide a framework for research in providing users access to the entire text of documents. researcn m . Key Words and Document Analysis. 17a. Descri 'atabase systems 'ocumentation nformation retrieval nverted files onnumeric processing uery languages ser aids '• Identifiers/Open-Ended Terms ptors h COSATI Field/Group ^Availability Statement '•LEASE UNLIMITED C 'M NTIS-33 (10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 135 22. Price USCOMM-DC 40329-P7 1 * %