LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 TIGr no. ?G I - ?&3 Cop. 2 Digitized by the Internet Archive in 2013 http://archive.org/details/listmergingproce762holl Report No. UIUCDCS-R-75-762 NSF - OCA - DCR73-07980 A02 A LIST MERGING PROCESSOR FOR INVERTED FILE INFORMATION RETRIEVAL SYSTEMS Lee Allen Hollaar October 1975 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS SHE LIBRARY OF THE UNIVERSITY OF il,_.h * * ........ . s ....... . . . . . , . , . . , . , . . 68 LIST OF REFERENCES ..........-..,.,.,......,.,.-.--.,...., , 70 VITA . . •*«»•*...•**#.«,,*,*»*,,,»„,,„,»,»,».,,,,,,,,,.,,, 72 LIST OF FIGURES Page 1 . 1 Context Hierarchies .... * 3 1 .2 Logical Database Organization 6 1.3 Table of Word Frequency in "Ulysses" .*.....*».- 9 1.4 Zipf Curve for "Ulysses" *..*.. 9 1.5 Zipf Curves for State Statutes ..*........ . 10 1.6 Truncated Zipf Curve .■ 5 12 1 .-7 Savings Using Complementary Lists ........... ^ ... t 12 1 .8 AND Subroutine from EUREKA System ........... , 15 1 .9 List Entry Format for AND Subroutine . . . . 16 1 . 1 Operations Table ........ s ....... t .. . . ................... s » 17 2 . 1 Stellhorn/Batcher Merge System . ... t ............. t . t ...... . 22 2.2 HARVEST Functional Units .*«*«*»<,•«*.•»««,..•,*.».•,,<« 22 2 . 3 Merge System Configuration ........... i .. . 24 2.4 Complex Merge System Configuration . 24 2 . 5 Merge Element Block Diagram .................. ^ .................... . 27 2.6 Ripple Carry Number Comparator . ^ ........ a ... iX ... . 29 2.7 Parallel Count Field Generator ................... .. .. 29 2.8 Serial Merge Element Block Diagram *......*...* ........ .... 31 2.9 Basic Serial Merge Element . ........ i . 31 2.10 Complete Serial Merge Element 33 2.11 Serial Element Control Flow Chart 34 2.12 Element Performance ........ . . 35 3.1 Network to Process a Complex Expression ...... i 38 3 .2 Binary Tree Network Configuration 43 3 . 3 Network with PASS Element .................. t 44 3 . 4 Network Gate Counts *....**.* .. 46 3 .5 Network Deadlock ......... . 48 3 .6 Modified Operations Table * . * * . . . 49 3.7 Modified Network Operation . . 51 3.8 Multi-Pass Expression Processing 58 3 . 9 Input Pair Classes t .......... iV . ............. t ...... . . -. . 58 4 , 1 Simulation Result s ............ ^........^..........g......... 64 CHAPTER 1 -- INTRODUCTION TO THE PROBLEM This thesis deals with the design and use of a specialized data processing system in connection with a conventional digital computer running a large scale information retrieval system using an inverted file database structure. It describes the operation of a representative information retrieval system implemented on a standard digital computer, with emphasis on the structure of the database and the types of queries generally made of it. Then it will be shown that the processing of standard queries frequently requires the merging of two or more ordered lists of index term pointers. The design of a simple processor for aiding in this and a technique for increasing the effective memory bandwidth available by connecting a number of these merge processors in a network will be presented. The implications to both the hardware design and the methods used to parse expressions when the network is in the form of a fixed binary tree concludes the third chapter* Finally, a number of areas for future research are proposed, with preliminary results in these areas given* This thesis will not try to justify the use of a particular file structure, data format, or form of query. These problems have been discussed at great lengths in the past , and to do so here would only obscure the issue of a specialized merge processor. Suffice it to say that inverted file information retrieval systems exist, that large scale systems of this type show a decrease in performance due to the disproportionate time spent handling the merging and correlation of the index terms, and that a specialized processor presents a possible solution to the problem. 1 . 1 Information Retrieval Queries A representative information retrieval system can be considered to have a command which locates items* within the database which match a given pattern. Any additional commands, such as those which print out the results of the search, create sets of items based on previous searches, and the like, are not important to the discussion of list merging. This command can find all items which contain one or more of a set of explicitly specified character strings (with the command in the form "FIND 'AARDVARK'" to locate all occurrences of AARDVARK, or "FIND 'AARDVARK' OR 'AARDWOLF'" to find all items which contain either AARDVARK or AARDWOLF), This union of terms can also be specified implicitly by the use of "explosion" or "wildcard" functions. For example, EXPLODE( 'HOUND') may produce an operation equivalent to 'BASSET HOUND' OR 'BEAGLE' OR Similarly, PROGRAM* (where # is assumed to match any arbitrary character string, including the null string) may be equivalent to 'PROGRAM' OR 'PROGRAMMED' OR 'PROGRAMMER' OR , with as many terms as there are words in the database which begin with PROGRAM. In addition, the command allows searching for two or more terms occurring within a given context. While the meaning of context varies from system to system depending upon the inherent heirarchy of the data stored, the contexts given in Figure 1.1 for EUREKA will be assumed. An example of this second form of the command is "FIND 'BEOWULF' AND 'GRENDEL' IN PARAGRAPH", which * An item is the primary entity in the representative information retrieval system. When inverted files are present, it represents the context to which the inversion was made, and which must be searched if a lower context is specified. For EUREKA [1], an item corresponds to a document. EUREKA PDP-11 System State Statutes Database Document Author Title Source Date Keys Abstract Body Paragraph Sentence Notes References Comments Statutes Chapter Index Act Title Enabling Clause Section Paragraph Sentence Footnotes Figure 1.1- Context Hierarchies matches all items which have a paragraph containing both BEOWULF and GRENDEL. If no context is specified, it is assumed that the request is for the co-occurrence of the terms in the same item. A special form of the AND connective exists in the form "FIND 'AREA NAVIGATION'". This expression asks the system to find a sentence which contains both AREA and NAVIGATION, occurring adjacently and in that order. A third form of an expression is "FIND 'COMPUTER' AND NOT 'ANALOG'" which finds all items which contains the word COMPUTER but not the word ANALOG. It is easy to see that the OR connective increases the number of items meeting the test, while the two forms of the AND decrease the number. This is important, since the primary function of the information retrieval system is to reduce the number of items which must be examined for relevancy. These three forms can also be combined to make a more complex expression, for example: FIND EXPLODE('RNAV') OR 'OMEGA' AND 'DIGITAL COMPUTER' IN SENTENCE OR 'PROGRAM*' AND NOT 'INERTIAL' AND 'NAVIGATION' IN PARAGRAPH 1.2 Database Structure Probably the easiest database organization, both conceptually and in terms of implementation, is one consisting of all the textual material organized in a single file which is searched sequentially in response to a user's query. The programming necessary to implement the previously discussed connectives is fairly obvious. The only disadvantage to this type of organization is that the time required for each search on a conventional digital computer increases linearly with the number of characters (or items) stored. For a batch processing system this may not be a serious problem, since many user's requests can be satisfied in a single pass thru the database. However, for a system like the National Library of Medicine's MEDLINE [2], with 500,000 documents available for online user inquiries, the batching of requests would be impractical, while the initiation of a search thru the entire database for each user request is impossible if adequate response times for a large number of users is to occur. The answer to this problem is to provide an index to the material which can be checked to eliminate the needless searching of items which do not contain the desired terms. This index can be prepared manually by trained indexers, or automatically by a computer. In the latter case, all words can be indexed, only certain words from a predefined list can be indexed, or all words except those on an exception list (such as THE, AND, A, etc.) can be indexed. Figure 1.2 illustrates the logical organization of the database in an inverted file structure. The index file contains lists of pointers to items in the text file, and other data necessary for the operation of the system. Additionally, the list entries may contain context flags, to indicate that the word is also contained in some context outside of the body of the item (such as the title), and other data, such as a count field to indicate the frequency of occurrence of the word in the item. As long as the expressions imply the item level for a context, the requested operations can be performed without actually searching the textual material. For an OR operation, all that is required is to form the union of Index File Text File List Header A Index Term Entry Count (=N) List Entry Al Item Pointer (=X) Context Flags Other Data List Entry A2 Item Pointer Context Flags Other Data List Entry AN Item Pointer Context Flags Other Data List Header B Index Term Entry Count List Entry Bl Item Pointer Context Flags Other Data Item 1 Item 2 Item X-l Item X Item X+l Figure 1.2 - Logical Database Organization 'two or more index lists, for AND, the intersection, and for AND NOT, the removal of entries contained in the second argument's list from the list specified by the first* For requests specifying a lower level of context, such as the sentence, the result of the merging of the index lists gives a set of items which have a chance of success in a full-text search. This saves processor time by not requiring the searching of items which cannot possibly match the specified search expression. 1.3 Zipf's Law and the Level of Inversion Since the storage requirements for an inverted file system depend upon not only the total number of words in the database (the number of tokens ) , but also the number of distinct words (the types ) and the number of times each type occurs (the number of tokens per type) , estimating the requirements is more difficult than with a full text searching system. However, it has been observed for a number of natural phenomena, including large collections of text, that when the types are ranked according to their number of tokens, the product of the rank of the type and the number of its tokens is constant. This is referred to as Zipf's Law*. When a constant product is plotted with the rank of the type on the x-axis, and the number of tokens in the type on the y-axis, the graph is in the form of a rectangular hyperbola. If this same curve is plotted on log-log coordinates, the result will be a straight line. * After George Kingsley Zipf, a professor of German at Harvard University. Zipf initially studied the distribution of words in a variety of languages and from a number of sources, observing approximately the same results. He regarded these results as some sort of universal truth, which he called the Priciple of Least Effort, and attempted to use it to explain the Civil War, Committees of Congress, chamber music, the Chicago Tribune, and sex. [3»^] Figure 1.3 represents data collected by Dr* Miles Hanley and Dr. M. Joos on the distribution of words in James Joyce's novel "Ulysses" [5]. This work was initially selected to demonstrate that the Zipf distribution would not exist in a sample of this size (260,430 tokens). It can be seen by the product column in the table and from the graph of the data on log-log coordinates in Figure 1*4, that even a sample of this size confirms Zipf's observations. The abnormality at the right side of the curve exists because a word can only occur an integral number of times; the final step, for example, is between words which occur twice and those which occur once* Figure 1.5 show the curves for a large database consisting of the statutes of a state.- The curve for inversion to the word level corresponds to the actual Zipf curve, and approximates a straight line when plotted log-log. The other two curves indicate that as the inversion is made to a higher level, the curve flattens on the left side, as a result of a greater number of words which appear in all or nearly all items at the level to which the inversion was made. At the sentence level, about a dozen words (such as A and THE) occur in at least half the items, while at the document level, with a document corresponding to a chapter of the statutes, more that 300 words occur in over half the items. It is customary in many information retrieval systems to delete common words from the index file, both to conserve space and to prevent the user from making a query which would result in an inordinate number of items matching. However, this imposes a rather arbitrary restriction on the system user, since it is possible for him to meaningfully use these words in a query. This query RANK FREQUENCY PRODUCT OF RANK AND FREQUENCY 10 2,653 26 ,530 20 1,311 26 ,220 30 926 27 ,780 40 717 28 ,680 50 556 27 ,800 100 265 26 ,500 200 m 26 ,600 300 25 ,200 400 62 24 ,800 500 50 25 ,000 1 ,000 26 26 ,000 2 ,000 12 24 ,000 3 ,000 8 24 ,oor> 14 ,000 6 24 ,000 5 ,000 5 25 ,000 10 ,000 2 20 ,000 20 ,000 1 20 ,000 29 ,899 1 29 ,899 Figure 1.3 - Table of Word Frequency in "Ulysses" 10,000 1000 > u z UJ S 100 DC — James Joyce data 10 10 100 1000 RANK 10,000 Figure 1.4 - Zipf Curve for "Ulysses" 10 WORD LEVEL PARAGRAPH LEVEL DOCUMENT LEVEL LOG (RANK) Figure 1.5 - Zipf Curves for State Statutes 11 would generally take the form of A AND B or A AND NOT B, where A is a list which has a small number of entries, and B with a large number. Both have as results a list with the number of entries less than or equal to the number of entries in A, while the result of A OR B is of the same order as B (the large list). However, the first two expressions can be rewitten as A AND NOT (NOT B) and A AND (NOT B) . Therefore, a list indicating the items which do not contain a term can be stored if the term is contained in more than half the total number of items. The saving achieved by using this technique depends upon the frequency of occurrence of the words and the total number of tokens at the inversion level selected. A simple model which can be used consists of a line with slope -1 for the Zipf curve of the database inverted to the word level, as illustrated in Figure 1.6. When the inversion is made to a higher level, all points on the curve with a frequency greater than the total number of items at the inversion level are made equal to the total number of items. In reality, the changing of the inversion level not only truncates the left hand portion of the curve, but also decreases the slope, moving the point where the negatively-sloped line meets the horizontal line left. Figure 1.7 illustrates the savings achieved by using complemented lists. It must be remembered that the results are plotted on log-log coordinates, so the relative sizes of the areas are misleading. However, it does illustrate where the saving is achieved. With the simple model for a higher level inversion, using complementary lists results in a savings of 15 to 25 percent in the size of the index file. On actual data (the statutes mentioned previously) the decrease in size is 15 percent at the document level and 10 12 IDEAL ZIPF CURVE THEORETICAL TRUNCATED ZIPF CURVE ACTUAL TRUNCATED ZIPF CURVE LOG (RANK) Figure 1.6 - Truncated Zipf Curve ,TOTAL NUMBER OF DOCUMENTS NOTE: AREAS ARE DISTORTED BECAUSE OF LOG-LOG AXES STORAGE SAVED BY USING COMPLEMENTARY LISTS DOCUMENTS TOTAL NUMBER OF TYPES NUMBER OF TYPES CONTAINED IN ALL DOCUMENTS Figure 1.7 - Savings Using Complementary Lists 13 percent at the sentence level, due to differences between the simple model's curve and the actual curve. Another consideration is the extra processing required to check contexts either above or below the item level [6]. Earlier, it was noted that if the context specified is lower than the item level, full-text searching may be required to determine if a match actually exists. If the context is higher than the item level, then a change in the method used for combining terms is required. This can be done either by having the item pointer contain encoded information regarding the higher level structure (i.e.- for inversion at the sentence level, have the item number consist of fields indicating the document and paragraph numbers which contain the sentence, as well as the number of the sentence within the paragraph) , or by defining each higher entry as any item within a fixed range of another item. The first technique is inefficient in its use of bits in the item pointer, while the second is inaccurate due to the use of a non-standard definition for the higher levels. - However, in many systems operating on data with an inherent hierarchical structure (such as Figure 1.1 shows for state statutes in a legal database), it is possible to invert to an optimal level which minimizes the number of context requests either above or below the item level. 1.4 Processing on Conventional Computers Most currently implemented inverted file information retrieval systems run on standard digital computers (for example MEDLINE on an IBM System/370 and EUREKA on a PDP-11/40). Estimates made by the operators of these systems indicate that a majority of the processor time is spent in the routines used HI for fetching and merging posting lists. Figure 1.8 is representative of the instructions used in the EUREKA system to produce the AND of two lists both contained in memory. The program has been written to take advantage of the high speed registers available on the PDP-11 computer, and is close to the maximum efficiency for doing the operations required. Figure 1.9 illustrates the data element on which the program operates. The first part of the program checks the context bits to determine if the entry occurs in the proper context, and if not, fetches the next entry on the list. When a valid entry from each list has been found, the two document number fields are compared, and if they are not equal, a new entry is fetched from the list containing the lower document number. If they are equal, an output entry is created consisting of a document number from either of the two input entries, a count field equal to the minimum of the two input count fields and a tag bit equal to the OR of the two input tags is formed, and the context bits are cleared. The tag bit is used as an indication that full-text searching is required for an entry, since if one of the input entries required full-text searching to determine if it is a valid entry, then any entry formed from an AMD operation with that entry will require full-text searching. Figure 1.10 summarizes the operations used to generate the fields for all three operators. The number to the right of the comment field for each instruction is its number of memory references, which on the PDP-11 is directly proportional to the time required to perform that instruction. Assuming that an average merge operation will require the reading of one entry and the writing of one entry (it can read either one or two and write zero or one), and counting the number 15 * rt + * CM tH M M * ri * t< N d M M M h^Wh N»nt-(N-I MH^rld H (M i ) h M CM * d M CM T H H H Lt K > u. Cl I I K X t- H Ll U. iij uj hOI- K 2 2 2 in G X Cl X I CJ Uj uj J I a. a. a. K- > 2 V ili U. CJ K C O U; u. X 0. s > > O II V cj x :■•: Uj > 2 a. > E x x > i- x q. t- t- :•'.' 2 X o UJ U. Ul h- h- 5 w Z 2 X 3 5 ui i 5 O H- Cl CJ Of 05 6 N x u uj U -J c u- l- to .* 0- CJ > 2 'I i- a 2 K 3 a. in ll Ui r a O u lu 2 12 I- LL 2 K i- :■. > h ll ll r O G 2 x u. ll- c Uj 2 2 Uj 01 Ci U UJ 2 lu :>• uj > m m + 2 or a: -■ x 2 2 ts rt * CM *» X Ti X «H X rt v to (S3 + + " to ■ X •- CM CM a. I ll a. ". CS '- cs cs in ui N CS X I '- N 2 2 N vi MH CM *t a. * * > M > i (S3 X CS - CS - N rt : CO X CB >-• Cl CS '■ Ct CS ■-• Cl I - CS CS CS CS CS N CS p.. r- N v 1 , ■ * * L~ CS Cl -i + Cl + a. LL 1 to I . + + + + + + 2 ■■*■ " r\ /"■ .-■ - I (53 CM r ■ hSi > K Ul CJ LU CC 2 ciSuj^HL^ tt ajaj£coLOHi?^ffi^Sffi£a : ^&cEc:£S£ r * in «t *t #t UJ N 01 u. a z O UJ C 2 UJ I rt J CC Ul Ct H- irt 2 U3 3 CO o UJ Ui UJ O Ci 1- guj i. H 2 O 2 CC > CC M C h- Ul -J Q£ 2 CC 1- Ct a. x m CC 2 2 UJ Ll ai X to 3 _J a 2 * _l UJ o Ui X 2 i- in ct > t-* 2 I o UJ 2 UJ o > h-t 1- ct i. o ^ O > UJ in cl U 1- CC to O 2 to UJ LL Ci h- CC UJ Ui 1- to UJ to 2 3 1 , X h- in CC 01 CJ in CS Ul Ui K U d Ul _J c UJ CM > ii- m m 2 Ui X. CC UJ a Ul X M Ci CC u W > F CC 1- 2 _l 2 O I u H 3b " , Ct ^T 2 UJ Ul K 2 Cl 2 CC Uj to -1 in C LL ui UJ Ul 0. CS l-HL!i I Ct Ct t- K 2 * 2 h- CO K UJ *-* X « U1 ►■* \ Ui U 2 UJ O CO 1- Ul »-» rt 2 3 O 01 t- M 3 CD O ^ 2 Ct 1- _J _J ui —i 2 C Ul 2 1 O (- O 3 to X Ct i- a. u in z CC HI H H Ul UJ 5 2 a LL »-i U^ -i 1- UJ U. Cl Ui w w h- 5 Ul ~ X O I t- I 2 - oi ►- in (- . u 1- Ul u (J U N 2 _J c 2 3 t— I M CS CC 2 H O H U. H Ct I 2 2 O 2 O X 2 O 1- 2 w 1- Ul ui Ui CJ 2 CC 2 > Ul Ul u — J 1- Ui 1- : O 1- CO 2 2 2 2 H -1 « Ul 2 w UJ UJ CC 5 &l m Ui CC a cj t- 1- CJ N o CS tf UJ I X. - Ul Lu M ui M -1 ct \p CS 2 y- to Ct Ct 2 2 Ct in CS (S ~i z% >- 1- LU UJ h- LL ^ Ul UJ 3 N _J m Q o o rt > Ul • % w i e ct H Ui o 3 2 -J S M 2 —J h- h _J O UJ X Ul H- to > Ct LL rt 01 2 H 2 1- UJ c Ul CD o Ul m ct lu 2 H X 2 _J ►«■« O t- in a to Ui ■J »-i T 2 -< J _J X > N i- a Ui Ui 'I a CC 2 U Ul * U. w 1- UJ Ul _J to t- > Ul O X 3 U. X U. > U. M U. UJ£ to U M a ^ ca a o UJ to o G UJ > 3 2 Ul 2 u a : Ct ct in O G C M CO U C2 UJ M UJ Ifl c u Cl Ll CO X K t- (- h- t- H t- Ui 2 UJ Ul Ul 2 OJ to x 2 Cl 2 XX" 3 2 £ s Ct 3 Ct 3 X 3 2 2 3 > to Ul u U"! H to UT ui a O C 2 Ct UJ > Ul Q UJ : H Lu 1- O 1- O 1- O Cl 5 LU u CJ U 2 H J _ J Ui Ul Q LL L 1 LL '_■ CL '_ CD UJ 2 I Ui 2 2 2 Ul m r, LL f>1 c o ct ct (J\ _j (J > m 2 O 1! II 1 II li II CS rt CM f 1 t IP Ul h- t- 2 H Cl rt Cl ll d u. CJ K H « t- _J X LL Cl LL LL U § CO CO < Cd CC 23 W E O d) C •H O t, J3 3 CO O z CO •rt Cv, 16 15 12 CONTEXT-T 11 DOCUMENT NUMBER EVEN WORD 15 T 14 CONTEXT- II COUNT ODD WORD Context-I and Context-II form a group of 13 flag bits [Ctx(A)] flags which indicate the contexts within which the term occurs. The assignment is arbitrary, but must match the assignment used for the context check mask. In general, it indicates contexts other than Body, Paragraph, or Sentence. Document Number [Doc (A)] is a pointer or indicator of the document which contains the term. Due to implementation restrictions, it only allows 4094 documents in the database. Count [Cnt(A)] indicates how many times the term occurs within the body of the document. Zero means it is only in another context (Title, Author, etc.) and ^ means it occurs 6 3 or more times. T is a special tag bit [Tag(A)] that indicates a result of the merge operations must be full-text searched to determine if it actually satisfies the specified conditions of the query. Figure 1.9 - List Entry Format for AND Subroutine 17 w c o •H P O < bO CO CO H H • f o rH K S3 iH rH O < .H - D • ■ •> D »-. Pi, ^-^ •^ ^ Cl, X ll X X >H || W •• v— y v_^ \^^ • • faC N! faO - -H bO •> tH CO H - H ^ D H >n D E-i - n >, II >H &U II w ti- ll >> " P •• N ' II •• P ll •• P «"-> G, <-v +J .. ^ c • • ^-» D. Nl E NJ C M Nl O M Nl E ^ W v- 1 O ^_^ ^ ta bC ll bO - - hO OT «■ fafl ll (0 •• CO ^ >> CO 3 >> >s CO •• H X H X P H H P P H >H v_^ Q. a a a •• •> - P E •» E E •s «k **""S *~v ~ c w ^-v *— ■>. H fcl *""N /■"> X X * U ll * X II ii >H >H V — • *^s N_x l__l •• ^-^ S— ' • • • • > * ^_s O P O C >h O P >H >^ O P o c O -H o c o c P CJ Q S - o u ■> •s Q O >> II II ll ll >> ii ii >> >> ll ll >> P ■ • •• .. .. +3 •• «• P p • • •• -p Q. /— v ^"N ^^ a /-^ *-~N a D. ••■v .*— * D. E N) Nl NN E Nl N! E E Nl Nl E W V > w v_- W *W ^«' Ci3 W *— ' ^^ W II O P O P II O P II H O P ii •• o c o c •• o c • • • • o c X O o Q U X Q O X X o o >H c o •H P CO t, CD a c H c SS o o o 2 O H O Q 2: <1) CO ^ o o V 18 of cycles required on the average to carry out this operation, the average merge time is about 30 microseconds, with an effective bandwidth of 2.HJ megaHertz. The actual bandwidth of the memory (16 bits per 650 nanoseconds) is 24*6 MHz. Memory efficiency of a process can be defined as the effective bandwidth of the process divided by the available memory bandwidth. For the previous example, this is about 8.8$. This number is similar to those calculated for other general purpose digital computers; for example the IBM System/360 Model 75 has an efficiency of 6.4$, due to its higher available bandwidth. This low efficiency for conventional digital processors can be easily explained by examining the program. Before an instruction is executed it must be fetched from storage, requiring an overhead memory cycle. Because of this, even if every instruction completely processed a word of the input or output data, the efficiency would be only 50$ -. In addition, on the PDP-11 and many other computers, an instruction may consist of more than a single word, reducing the efficiency even more. In addition, there are instructions in the program which do not process any input or output data. These can be divided into two classes — flow of control and locating and aligning. The flow of control instructions include branches necessary to reach other statements of the program either conditionally or unconditionally. The second class is used to find the next input data element in a list or the next available output location (adjustment of pointers) , or to transform data to a form which can be processed by the machine (bit masking, shifting, etc.). 19 It is clear from the preceding discussion that the problem of merging lists of entries does not nicely match a conventional digital computer's architecture. What is needed is a processor which could execute instructions more compatible with the problem, reducing both the number of instructions which must be fetched and the need for flow of control instructions. 20 CHAPTER 2 — A SPECIALIZED LIST MERGING PROCESSOR Very few types of processors have been proposed to conveniently handle the generally non-numeric task of merging two lists of data. Most non-numeric processors are associative processors, which are ideal for searching large bodies of data, but not for combining two lists and eliminating unwanted entries . The implementation of a specialized processor to merge two input lists into a single output list is simplified by the nature of the problem. The operations required are both simple and well defined, allowing a hardwired, rather than programmed, sequencer for speed and efficiency. Operations such as pointer and count manipulations can be performed in parallel with the actual merging operation, further increasing the speed* Finally, the data alignment problem present on conventional processors is non-existent, since the data can easily be routed to the appropriate points in the processor (assuming the data format is fixed or falls within a small set of previously defined formats) . 2.1 Previous List Merging Processors Two different styles of list merging processors have previously been proposed: the bit serial/entry parallel unit discussed by Stellhorn, and the HARVEST non-numeric extension to the IBM STRETCH computer designed for the National Security Agency. 21 Stellhorn [7] proposed using a Batcher merge network [8] to combine the two input lists (see Figure 2.1). Since it is not practical to build a Batcher network capable of merging two large lists (since for two lists, each containing N entries, it requires order N log N* Batcher merge elements), a technique for merging the lists in parts was devised. This consists of merging the next sublist from one of the input lists, selected based on the lower first entry, with the last half of the results of the previous merge (which are fed back from the outputs to the inputs of the merge network). Stellhorn proved that this technique will always produce a properly merged list. However, this network only produces a list consisting of the merged entries of the two input lists; no action is taken to remove duplicate entries in an OR operation or, more importantly, to identify these duplicates as the only correct results of an AND. This action must be handled by an additional unit, the coordination network* This unit must examine the entries and eliminate those which are not proper results. It then must repack the data in the output buffers and wait until these buffers are full, because some of the entries from the Batcher merge network may have been eliminated. It is possible, when a large number of list entries are being processed in parallel, for either the processing time (with the unit proposed by Stellhorn) or the number of gates (as proposed by Lawrie [9,10]) of the coordination network to be greater than that of the Batcher merge network! * In all instances in this thesis, log n will mean the logarithm to base two of n, if n is a integral power of two, or the logarithm of the next higher power of two, if it isn't. 22 uconc / DATA MEMORY / CONTROL NETWORK r t CONTROL COMPUTER ( R DISK UNITS ^ ( F U | \ > ' COORDIN- ATION NETWORK \ ) CONTROL DATA CONTRC TO DISK DATA Figure 2.1 - Stellhorn/Batcher Merge System 1 ■ - I 64 > \ 64 t ( P REGISTER P UN IT 128 — »• 8 MATRIX Q INDEXING UNIT R INDEXING UNIT -> -> STREAM UNIT P 64 64 REGISTER Q 128 8 MATRIX STREAM UNIT Q STREAM UNIT R 128 MATRIX REGISTER R 64 64 TABLE STORED IN MEMORY Figure 2.2 - HARVEST Functional Units 23 The second form of merge processor is similar to the IBM 7950 HARVEST processor, an extension to the IBM 7030 STRETCH system [11]. Figure 2.2 is a simplified diagram of the functional units within the processor. HARVEST is programmed by having STRETCH pass it a list of setup instructions. When the processor has been successfully programmed by these instructions, a start command is issued by STRETCH, and HARVEST processes the streams of data based on the instructions. Facilities exist for the transformation of data controlled by table lookup, in addition to logical transformation. This table lookup scheme causes the processor to fetch data from the current table based on a function of the input characters. This data can consist of an arbitrary number of output characters, including none, and an address for the table to be used next. 2*2 A Simple List Merging System Figure 2.3 shows the major data paths connecting the components of a large scale data processing system, such as the IBM System/360 Model 75 -. The large, high bandwidth memory is connected to the various processors by a memory bus control unit (BCU) , which acts as an arbitrator between the potential memory users. The channels have the highest priority access to the memory, and the central processor the lowest * The channels are used to relieve the need for character assembly by the central processor, and to better match the high bandwidth of the memory to the low bandwidth of the peripheral units. In a smaller system, the BCU is replaced by a simple bus arbitration protocol, and the channels by including direct memory access capability in control units which transfer large amounts of data. 2U DATA PATH CPU CONTROL PATH DEVICE CONTROLLERS TERMINALS , ETC Figure 2.3 - Merge System Configuration HOST SYSTEM CHANNEL TO OTHER DEVICE CONTOLLERS CHANNEL INTERFACE MERGE SYSTEM LOCAL MEMORY DISK CONTROLLER 1 MERGE PROCESSOR — DATA PATH --CONTROL PATH SCHEDULING PROCESSOR i I 1 INDEX FILE DISK STORAGE X) X) Figure 2.U - Complex Merge System Configuration 2S The merge processor is added as if it were an additional processor or channel, with control information exchanged between the central processor and the merge processor, and the merge processor transferring data to and from memory thru the BCU. In this configuration, the central processor issues a command to the merge processor indicating the memory locations for the input and output lists and the type of operation desired. The central processor can then execute other tasks until the merge processor completes the operation or detects an error; at this point it will interrupt the central processor for a new command. Due to the high data requirements of the merge processor, as compared to normal input/output devices, and the fact that the bus control unit grants access to the central processor only when another unit is not requesting it, the merge processor can effectively halt execution of the program running on the central processor if care is not taken to periodically relinquish memory ownership to the central processor. A more complex merge processing system is illustrated in Figure 2.4. This system contains its own memory and disk files, so its interference during operation of the conventional data processing system is minimized. It consists of a merge processor, a disk system, a channel interface, memory, and a scheduling processor. This scheduling processor receives requests from the host processor, queues them until the appropriate resources are available, fetches the data from disk into memory, merges the entries, and transfers the result either to disk for later usage or to the host processor via the channel interface. To the host system, this configuration appears to be a very intelligent disk system which has all possible combinations of lists stored. 26 The merge processor can be implemented either as a parallel or serial unit. As is generally true, the parallel unit operates considerably faster, but requires an increase in gates greater than the increase in its speed over the serial unit , However, the parallel unit is easier to understand, and will be discussed first. 2.3 Parallel Element Implementation A general block diagram of the parallel merge processor [12,13] is given in Figure 2.5. Data is fetched from memory by either the X or Y list fetch logic and delivered to the appropriate mask checker. Here the context bits are checked using the specified mask to determine if the entry is for an item in the proper context; if it is, the entry is placed in the appropriate input holding register and that register is marked full*. Fetching of list entries continues until both holding registers are full. At this time, the two document number fields are compared, and the action to be taken is determined based on the operation specified* This consists of forming the output fields, marking the output register as full if the operations table specifies the creation of an output entry, and indicating that either or both of the input registers are empty 5 This action continues until the lists are exhausted**. * The merge processor and the memory interface are interlocked using a bit for each of the inputs and for the output. These bits are set by the data source when it places data in the buffer, to indicate the connection is full , and reset by the data sink to indicate the connection is empty , and new data should be placed in it . ** In the case of A AND B, processing can be stopped when either list A or list B is exhausted, rather than waiting for both lists to be exhausted. For A AND NOT B, it can be stopped when list A is exhausted. The amount of time this saves is highly data dependent, and will be ignored in future discussions. 27 1 t Memory } P u X Input Fetcher -^ Y Input Fetcher -^ — «- Z Output Storer ^ r t ^- f I i _j_ l ' < > X Mask Checker Y Mask Checker i r i ' X Holding Register Y Holding Register Document Numb e r Selector i i . » » r ^ i ' < 1 1 p — < i — ^ ' ' r Document Number Comparator Field 1 Generator -* i f » — ► Action Selector --— i • i i • Field N Generator Command Reeisters ^ »- Figure 2.5 - Merge Element Block Diagram 28 The major unit in the element is the document number comparator, which is used to decide which input holding registers should be marked as empty and how the output entry should be formed. Figure 2.6 illustrates a ripple-carry design for this comparator, including the equations for each stage and the approximate number of gates and the delay. The output XLOW is true (high) if the document number in the X input register is less than or equal to the one in the Y register; YLOW is true if Y is less than or equal to X. For higher speed operation, a parallel comparator similar to the SN7M85 MSI unit [14] can be used. The document number in the output entry is generated by using a two input selector driven by the document number fields of the two input list entries, and using either the XLOW or YLOW signal as required to select the lower document number. Figure 2.7 shows the simple ALU/selector used to form the output count field. A comparator examines the two input count fields to determine the lesser, which is selected if an AND operation is being performed. Since only one of the four AND gates is selected in this case, the adder passes the desired count field directly to the output* If either an OR or an AND NOT operation were specified, the proper inputs to the adder are selected based on the XLOW and YLOW control signals — if the appropriate nLOW control signal is true, the Cnt(n) data is fed to the adder, while if nLOW is false, a zero is sent. Other output generators, such as for the tag bit, can be added in a similar fashion. 29 i=CH . • • • XLOW XL =0 YL =0_ XLiMXi-Yi-YLi.jJ + XL,.! YLt^XfYj-XL^ + YLi., XLOW=YL n YLOW = XL n GATES = 4* n MAX DELAYS n* AVERAGE GATE DELAY Figure 2.6 - Ripple Carry Number Comparator CNT(X) »> 3E> E> E> j=©-pB> CARRY CONTEXT YLOW XLOW CLK1 Figure 2.10 - Complete Serial Merge Element 34 START Pulse RESET1, RESET2 , RESET 3 - 1 SELSUM - SELLOW - 1 Full (X)- Full (Y) -Empty (Z) ? Y| Nl All Document Numbers Processed T N j *| Shift Inputs Load Output Pulse CLK1 = 1 SELLOW - All Context Bits Processed ? y I In Set Empty(X) » XLOW S«t Empty (Y) • YLOW I Context Flag Set? I "AND* ■ Operation l"OR" 'AND NOT 1 Shift Masks Shift Inputs Load Output Pulse CLK3 - 1 Set Full(Z) - XLOW-YLOW Set Full(Z) - 1 Set Full(Z) - yToV Shift Input and Output by Number of Remaining Bits "AND" ? _Yj 1_N_ Pulse RESET1 - 1 SELLOW, SELSUM ■ 1 SELLOW - 1 Shift Inputs I Load Output "OR" -XLOW -YLOW ? Yl IN SELSUM - 1 SELLOW - SELSUM - SELLOW - 1 All Count Bits Processed ? Yl I" Signal End-of-Cycle to Memory Interface Control Shift Inputs I Load Output Pulse CLK1, CLK2 - 1 Figure 2.11 - Serial Element Control Flow Chart 35 i> M tsl X X z Z o o 00 o ^J- rsi N N o o fM O t-n oo hJ ►4 H H H H TJ >. P r^ re +-> T3 4-> P o 03 rP •P u re 00 3. hO CO P o co CO p. P LO o • t-^ t-O en P 1 o p •H N N o +-> S ± •H 00 CO re z Z +-> 3. P +-> re p o o ■m vO o CD -3- o p • ro 6 CM VO 0) (N] r^ CD i— 1 CD Ph rH 6 Ph rH CO CO ■H N N re n p (h s s fH CD Z z re o o CO Ph o CO 1 CO vO I (N) r— 1 vO CO CO hJ i-q H H H H 13 >. P ,* re +-> -d 4-> p O re X +-> u CO CO N M ts) i 1 1 r-- O to • LO t-O o r- 1 r>- rH II II II I— 1 i— I CD CD CO I— 1 i— 1 p I— 1 i— 1 re re o p p LO v$ re t^ p. Ph CO p p +J •H •H ■ H X> U) bO p P 00 •H ■ H N rsi p P S i— i P P z CD CD II Mh 4-1 VO CO CO • >^ P p *d- P re re cni O P p e +-> ■M M CD Z CO CO l CO n3 t3 CO pj CD re re cu P ' -H re Xi i— i vO P 1— 1 II p 1— 1 >s P X N N O LO E a: 6 F> z CD Z r- 1 CD LO LO CD T3 (Nl vO P O o Z II II u o ^ r* o vO CO CO <* hO •H •H \ **», Q Q i-i S r-\ CD CD cu i +-> Ph Ph Cv. CO >s ^ Q ^ +J ■P o< CO 1 1 >* o u Z 1— 1 to pq « to tn Q i— i (Ni- hO CD O c E U o <1) 0- ■p c _ t- UJ £ = > c •H PQ I C\J m bO •H sin M , .. N ■ ■ + E5 C3 *- J i ,\ «/ R= LM + LN +MN L = [3,8,10, ... ] M = [1,2,6,9,12, .., ] N = [ 1,2,5,8,11, ... ] Time M C1 C2 C3 CH 1 3 X X X X X X X 2 3 1 X X X X X X 3 3 1 1 X X X X X 1J 3 2 X X X 1 X X 5 3 2 2 X X 1 X X 6 3 2 2 X X 1 X X 7 3 2 2 X X 1 X X 8 3 2 2 X X 1 X X 9 3 2 2 X X 1 X X X indicates the connection is empty Figure 3-5 - Network Deadlock 49 cfl > c M II >H •a •H •» •o T3 T3 rH •o •o •H •H •H CO •H •H .H iH rH > H r-i to to a) c m cfl > > > H > > £ c M c M II c c II II II X II II • • • • • • • • X •o •o X "D X X3 X TD X * •H •H » •H 1 •H ■k •H p, c •o > iH > 3 > iH > i— | 4-> (0 c c CO C c Cfl C (0 o > M M > M > H > M < II II II II II II ii II II II M X X N) X Nl X M X Nl K n » « „ „ ^ ^ m X X X * * * * * * * X X X 11 II II ii II II II II ii II II II II Nl M n; M N) NJ Nl N N) N CO Nl Nl c o •H .p Cfl t* a o CO > o o < cc o H H O O CC 5r 5r E-> O •» cc ^ CC O n o Q O 52 n s sz ?*; < < < O T- O o o I- O *- T- X) m H CO c o o o «- V o o Q 50 produced as before, except in the case where the two inputs are equal. Here, an input is only used to form the output field if it is valid. The major change consists of unconditionally placing the value of the lower input on the output connection if the output connection is free (does not contain valid data). In addition, the element compares the two inputs regardless of whether their inputs are both valid, and sets the validty of the output based not only on the relative magnitudes of the two inputs, but also on their validity. Figure 3.7 shows the operation of the network in Figure 3-5, but with the modified element operation. A value in parenthesis indicates that that value is invalid. A modified network is one constructed from processing elements following the operations table given in Figure 3.6. An unmodified network is one formed according to the original operations table in Figure 1.10. The following show that a modified network will never deadlock, and will produce valid output list entries identical to those of the unmodified network. Lemma The output of any subtree of a modified network consists of the union of its input lists, in increasing order with all duplicates removed. Furthermore, if an output list entry is invalid, it cannot become valid at some later time* Proof First consider a subtree consisting of a single merge processing element whose inputs are the input lists. By the operations table, it takes the lower of its two inputs and places it on its output. It then gets a higher valued input to replace the one transferred to the output. Therefore, its output is the union of its two inputs, in increasing order* If the same entry occurs in both its inputs, only one output 51 1 ,\ Ci C4 L ► M r «/ IV! ™"™* + \ A/ A C2 ■;/ > M IN E5 / C3 .\ "/ Time M R = LM + LN +MN L = [3,8,10, ... M = [1,2,6,9,12, N = [1,2,5,8,11, C1 C2 C3 cn 1 3 (0) (0) (0) (0) (0) (0) (0) 2 3 1 (0) (0) (0) (0) (0) (0) 3 3 1 1 (1) (0) (0) (0) (0) 4 3 2 (D (1) (1) 1 (0) (0) 5 3 2 2 (2) (1) 1 (1) (0) 6 3 6 (2) (2) (2) 2 (1) 1 7 3 6 (2) (3) (2) 2 (2) (1) 8 3 6 5 (3) (2) (2) (2) 2 9 3 6 5 (3) (3) (5) (2) (2) 10 8 6 5 (3) (3) (5) (3) (2) 1 1 8 6 8 (6) (5) (5) (3) (3) 12 8 9 8 (6) 8 (6) (5) (3) 13 10 9 (8) (8) 8 (8) (6) (5) 14 10 9 11 (9) (8) (8) 8 (6) 15 10 12 11 (9) (10) (9) (8) 8 (A) indicates A is a value marked as invalid Figure 3.7 - Modified Network Operation 52 entry is produced, so that no duplicates exist in the output* Finally, once an output has been produced, the input that was used to form it is replaced by a higher valued entry, so that any future output entry, valid or invalid, must be greater than the current output entry. Assume the Lemma is true for all subtrees consisting of N stages. A subtree of N + 1 stages consists of a single merge processing element, with two subtrees of N stages as its inputs. Since the outputs of these two subtrees are assumed to be in the form given by the Lemma, which is the same form as for an input list, the arguments given for the single element subtree also hold for the final element in the N + 1 stage subtree. Therefore, the output of this subtree is in the form given by the Lemma. By induction, the Lemma is true. Lemma If an N stage subtree of a modified network is not deadlocked, and its ouput connection remains free (either by no valid items placed on it or by the successor element of the subtree immediately marking it as invalid) , the lowest of its inputs is transfered to its output in not more than N cycles. Proof For a subtree consisting of a single element, the Lemma is obvious. Assume the Lemma is true for all subtrees of N - 1 stages. Remember that a tree of N stages consists of two subtrees of N - 1 stages as inputs to a single element at stage N.- After N - 1 cycles, each of these subtrees has transferred the lowest of its inputs to its output. At cycle N, the final stage's element takes the lower of these two subtree outputs, which is the lowest of the inputs to both subtrees, and places it on its output connection. Therefore, the Lemma holds for all subtrees of N stages, and by induction is true. 53 Theorem A modified network cannot deadlock. Proof Assume the network is deadlocked. Therefore, one or more elements is unable to place a valid entry on its output connection because a previous valid entry has not been marked invalid by the element which has the connection for an input. Furthermore, this condition has been in existence for an arbitrary length of time. This is a blocked connection. Consider the blocked connection B closest to the input end of the tree. The subtree which has B as its output connection will be called X, the element with B as its input, E, and the subtree which feeds E's other input will be called Y. Because connection B is blocked, E only takes its input from subtree Y. If there are no input lists in common between X and Y, Y will continue to transfer its input list entries to element E's input. Eventually, an entry (possibly the end-of-list marker) greater than the value in connection B will occur at the output of Y. This will allow E to process the value in B, unblocking it. Hence, the network cannot deadlock if there is not at least one input list in common between X and Y. If there is an input in common, the value of its list entry is greater than or equal to the value in connection B. This is because if there were an input to a subtree less than the value of its output, at some later time the lower input would occur as an output entry. But this cannot occur because of the first Lemma above. sn Therefore, subtree Y has at least one input which is greater than or equal to the value in connection B. Since Y is not blocked, eventually the value at the common input will be the lowest input to Y. By the second Lemma, shortly thereafter this value will be the output of Y. Since this value is greater than or equal to the value in connection B, connection B will be marked invalid by element E. Hence, the tree is not deadlocked. Theorem An OR element modified according to the operations table in Figure 3.6 produces the same valid items in the same order as one constructed according to the operations table in Figure 1.10. Proof The only parts of the new operations table which must be examined are those which differ from the original table. These fit two different categories: the input with the lower value is valid but the higher input is invalid, or both inputs are equal, but only one is valid: In the first case, a valid result is produced and the lower input is marked as invalid, where previously no action was taken. However, this can produce an incorrect action only if at the next time both inputs are valid, the input which held the higher invalid input now contains a valid entry less than or equal to the original lower input. However, by the above Lemma, this cannot occur. Hence, the OR element functions correctly in this case. 55 In the second case, the element produces a valid result even though only one of the two equal inputs is valid. Again, this is an incorrect action only if the input which contains the invalid entry were to contain a valid entry less than or equal to the current invalid entry at some later point in time. Since by the Lemma this cannot occur, the operation is performed correctly. A similar proof can be used to show that the AMD, AND MOT, and PASS elements function correctly. Since all elements in the network function correctly, it is clear that the network will always yield the correct results. 3-4 Parsing Expressions for a Fixed Tree Size If the expression can be contained in the available tree, how the expression is parsed has no effect on the required processing time. Disregarding end-of-list effects, the time required to process the expression is identical for all forms of the expression* It is simply proportional to the lengths of the input and output lists, since the network is pipelined and only one entry can be transferred to or from the memory in any one network cycles However, if the expression cannot be contained in the available tree in any form, the problem of reducing the processing time becomes more complex. In the following discussion, subexpression will mean that portion of the total expression which can be processed directly by the available tree. The processing of a subexpression by the tree will be termed a pass , with the first subexpression processed during pass one. Figure 3.8 illustrates a possible scheme for numbering the passes. All the passes performed at the same level of trees in the processing of an expression will be referred to as a level. SATI Field/Group lability Statement PLEASE UNLIMITED IS-35 I 10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21- No. of Page; 78 22. P USCOMM-DC 40329-P7! tvcr * \ *» &