UNIVERS1TY0F The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN MAR 22t«3 JAN 2 mi DEC 15 Ntf L161— O-1096 UIUCDCS-R-80-1027 M«M UILU-ENG 80 1726 HARDWARE FOR SEARCHING VERY LARGE TEXT DATABASES NOV liMSMQ August 1980 by Roger Lee Haskin HARDWARE FOR SEARCHING VERY LARGE TEXT DATABASES BY ROGER LEE HASKIN B.S., University of Illinois at Urbana-Champaign, 1973 M.S., University of Illinois at Urbana-Champaign, 1978 THESIS Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1980 Urbana, Illinois HARDWARE FOR SEARCHING VERY LARGE TEXT DATABASES Roger Lee Haskin, Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign, 1980 The problems involved in searching very large text databases are discussed. It is shown that conventional techniques for searching current databases cannot be scaled up to larger ones, and that it is necessary to build hardware to search the database in parallel if reasonable search times are expected. The part of the search process requiring the highest bandwidth is scanning the database to detect instances of search terms. Methods from the literature of doing this in hardware are examined, problems with using them in large systems are discussed, and design criteria to be met by a successful search architecture are defined. A new model for text searching, using a nondeterministic finite state automaton (NFSA) to control matching, is introduced. First the NFSA model itself is discussed. Examples are given showing how it can be used to search for a wide variety of textual patterns. Next, implementation of the NFSA Searcher in hardware is discussed. By partitioning the nondeterministic state table on the basis of pairwise compatibilities and assigning blocks of states to interconnected sub-machines, it is shown how the NFSA can be built with simple logic in a manner that lends itself to LSI implementation. It is of critical importance that it be possible to quickly partition the state table for a group of search patterns. Methods for partitioning tables efficiently are developed and their performance is analyzed. Methods of detecting instances of higher-level search expressions from instances of their component patterns detected by the NFSA Searcher are also discussed. Finally, the configuration and performance of the search system as a function of user load and other paramenters is discussed. By comparing the hardware required and response time afforded by the NFSA Searcher with that for an alternative implementation, it is seen that the NFSA Searcher is a significant advance in architecture for text searching. Digitized by the Internet Archive in 2013 http://archive.org/details/hardwareforsearc1027hask Ill Table of Contents Page Chapter 1 — Introduction 1 1.1 Search Strategies 2 1.2 Text Searchers 4 1 .3 Query Resolution 4 1.4 Term Matching - Previous Efforts 7 1.5 Summary 17 Chapter 2 — The NFSA Term Matcher 20 2.1 The NFSA Model 20 2.1.1 Comparison of the FSA and NFSA Models 25 2.2 Implementing the NFSA Term Matcher with Multiple FSA's 28 2.3 NFSA Implementation Using Memory-Mapped State Tables 29 2 .4 CM Operation 35 2.4.1 Startup 36 2.4.2 CM State Transitions 37 2 .4 .3 Forking 37 2 .4 .4 Loop Transitions 38 2.4.5 Recognizing Document Formatting Codes 39 2.4.6 Bounded EVLDC's 41 2 .5 MC Operation 42 2 .5 .1 Timing Generation 44 2 .5 .2 Load Cycles 45 2 .5 .3 Match Cycles 45 2.6 Matching a Data Stream 47 2.7 MSI Prototype of the NFSA Term Matcher 50 2 .7 .1 MSI Match Controller 51 2 .7 .2 MSI Character Matcher 53 2.7.3 Comments on the MSI Term Matcher 57 2.8 LSI Implementation of the Matcher 59 2.8.1 Packaging 59 2 .8 .2 Testing 63 2.9 Summary 64 Chapter 3 — State Assignment 67 3.1 Definitions 69 3.2 Best-Fit (Minimum CM) Assignment 73 3.2.1 Machine Decomposition 74 3 .2 .2 State Minimization 75 3.2.3 Lower Bound on Number of CM's - Unlimited Interconnection 77 3.2.3.1 Incompatibility Covers 78 3 .2 .3 .2 Term Tail Removal 82 3.2.4 Computing Maximal Incompatibles 84 3 .2 .4 .1 MAXINC Program Overview 84 IV 3 .2 .4 .2 MAXINC Results 88 3.2.5 Lower Bound on Number of Neighbor-Connected CM's 94 3 .3 First-Fit Assignment 95 3.3.1 Compatibility Testing 95 3.3.2 Assigning States to CM's 96 3.3.3 Assigning State Addresses in CM's 98 3 .3 .4 AS2CMZ Program Overview 99 Chapter 4 — Query Resolution 104 4 . 1 Query Resolution Algorithms 105 4.2 Query Resolution in Real Time 108 4.3 Query Resolution for the Hardware Term Matcher 110 4.3.1 Accepting Hits from the Term Matcher Ill 4 .3 .2 Term Hit Decoding 113 4.3.3 Context Boundary (Fern) Processing 114 4 .3 .4 Term Hit Processing 116 4.4 Contexts Spanning Track Boundaries 119 4.5 Processing Large Search Expressions 123 4.6 Summary 124 Chapter 5 — System Configuration and Performance 126 5.1 System Behavior Under Load 126 5.1.1 Search Time 128 5.1.2 Number of Queries in the System - Constant Interarrival Time 130 5.1.3 Number of Queries in the System - Poisson Arrivals 131 5.1.4 Queries per Searcher 133 5.1.5 Searchers per Drive 135 5 .2 Term Matcher Characteristics 136 5 .3 Loading the Searchers 138 5 .4 A Case Study 140 5.5 Summary 143 Chapter 6 — Summary and Conclusions 144 References 148 Vita 151 Chapter 1 Introduction Several large scale online text retrieval systems are now available, both systems that store document abstracts (Medline, [McCar78], Orbit [Black78], and Dialog [Bayer78] to name a few), and those that store the full text of their documents (notably Lexis [Sprowl76], and Westlaw and Juris [Hollaar79 J ) . Large amounts of computer readable text are also being accumulated as a side effect of other processes such as office automation and computer typesetting. The availability of large text databases containing material pertinent to fields such as law, engineering, management, etc., will generate a demand for facilities to allow this data to be searched. Thus both the number and size of text database systems can be expected to increase in the near future. Two factors have contributed to limiting the growth of such systems, both in terms of the database size, and in terms of the size of the user community that can afford to use them. These factors are the processing capacity of the computers on which the systems run, and the cost of online storage - both the per-bit costs of disk drives and the indirect cost of the space to put them. Disk technology is advancing rapidly; current drives have over 30 times the capacity and cost less than half as much as those of ten years ago. Databases in the 30-100 billion character size range are now economically feasible. New advances such as monolithic read-write heads and Winchester' (sealed surface) technology promise to continue this trend. Mainframe technology has not kept pace - most of the speedup in recent computers has been in execution of floating-point operations. Systems whose query response times are only just tolerable to begin with cannot be expanded to take advantage of the increased database size allowed by larger, cheaper disks unless methods are found to speed up searching. However, this speedup must not have the effect of pricing retrieval systems out of the reach of of their user community, as just adding more conventional processing power would do. Searches on present systems are too expensive already (from a few dollars to as much as $100 per search); any significant increase in this cost would be unacceptable. Much effort has been directed toward developing special purpose hardware to speed searching of formatted databases. In particular, machines to support the relational model (RARES [LiSmSm76], RAP [OzScSm75], and CASSM [SuLip75]), and variants of this model (i.e. sets of value-attribute tuples, for example DBC [HsKaKe75]) have been proposed. Prototypes of RAP and CASSM have been built. Unfortunately, text (journal articles, legal decisions, books, etc.) cannot usually be formatted into the fixed size fields normally required by such database management systems. Thus, these hardware designs are not directly applicable for use as fast text searchers. 1.1 Search Strategies Strategies for speeding up the search of text databases are either to increase the speed at which characters in the database can be processed, or to narrow down the area to be searched to a small fraction of the database . The second strategy usually involves using an index that contains pointers to occurrences of search terms in the data. Index- processing hardware has been relatively well studied ([Stell75], [Hollaar76], [Hurl76], and [Miln76]). It is tempting to place the entire processing burden on the index processor by inverting to the word level. Here, no text searching at all is necessary. While it initially appears to be a viable solution, word- level inversion has severe drawbacks. One is the complexity of the postings (pointers to term occurrences). The index processors proposed in the literature are only designed to perform simple AND, OR, and NOT operations between postings lists, and this is not adequate to handle common query operations such as proximity ( V A within five words of B') and context qualification ("A and B in the same sentence'). Another problem is the large amount of space required to store a fully inverted index. Space estimates for storing the index range from 50% to 300% of the space needed to store the text itself ( [BiNeTr78] ) . The performance of index processors such as Stellhorn's degrades substantially if the postings lists are stored on moving head disks ( [Hurl76 ] , [Miln76 J ) . Storing an index of a size comparable to the full text is expensive enough on moving head disks, and is not economically feasible on any faster medium (i.e. drum, shift register memory, etc.). Thus, the performance degradation due to the use of moving head disks for index storage must be accepted if full inversion is used. Similarly, searching the entire database for each query has been proposed [BiNeTr78]. In this strategy, the entire surface of each disk is scanned repeatedly. During each scan, the. batch of queries arriving during the preceding scan is searched for. Searching the entire database results in slow best case response time. How slow response is depends upon the implementation (whether all drives are searched in parallel or sequentially; whether multiple heads per drive are scanned simultaneously, etc.). This topic will be discussed in more detail in Chapter 5. A third alternative search strategy is partial inversion. Here, the index contains a posting for each region of the database (i.e. cylinder, track, document, etc.) containing an instance of the term. This index contains fewer and smaller postings, so the index size can be held to around 20% to 30% of the text size. It thus costs much less to store than the considerably larger fully inverted index. The index is processed as for a fully inverted system, and then regions in the result postings list are searched for actual occurrences of the search expression. Since the region pointed to by each posting can (and usually does) contain several occurrences of the term, postings lists will be much shorter and faster to process than if full inversion were used. By using a partially inverted index, it might be possible to confine full text searching to a small enough area to make it reasonable to use a conventional processor for searching. However, even assuming that such a processor could keep up with disk transfer rates (about 1 microsecond per character), and assuming that the index processor will narrow the search to a very optimistic 0.1% of a 50-billion character database, the average query would completely saturate the processor for almost a minute. A system of this size would be expensive and require many users to pay for its operation. If a reasonable number of users are to be accommodated simultaneously, it is clearly quite intolerable to require a full minute for processing an individual query! No matter whether full searching or partial inversion is used, special search hardware is necessary if the database is large and if fast response time is desired. Thus, a general-purpose large text database system might consist of a conventional computer to handle communication with users and to translate queries, of index and text data residing in secondary storage, and of special purpose machines to process the index and search the text. 1.2 Text Searchers Before discussing the design of a text searcher, its function should be more clearly defined. The searcher (Figure 1.1) can be thought of as a black box accepting questions (search commands) consisting of an encoded query (search expression) and the address of a region to be searched. It responds with pointers to occurrences of the search expression in the region. The searcher accepts many such commands, processing them concurrently. It does not necessarily provide answers in the order the questions were issued, but rather responds in an order that attempts to optimize some performance variable such as throughput or response time. The search system thus has the job of scheduling the order in which regions are to be searched, and of carrying out the searches. Searching itself can be broken down into two processes: scanning the database region being searched for instances of search terms and formatting codes (i.e. end of sentence, etc.), and doing the bookkeeping necessary to detect instances of the query's search expression (query resolution). 1.3 Query Resolution Conceptually, query resolution is quite simple. Each time a term is matched, the occurrence is recorded, and at each context boundary the occurrences are checked to see whether the search expression has been satisfied. Then the data structure is reinitialized (the appropriate UJ x a < (T CD < < ^M V Ul H \ t/> < X. Q w/ _l o cc t- < z t- o < o Q w ERM ARATOR cr UJ w o — en Q h- •- ^ z 2 o o u z u i 1 «n s o 2 z < < tr !L_ 2 2 o o or or o a UERY SOLVE O UJ cr or UJ X _J U _l q: o < tr UJ H C/> 2 V) o t- o _J 73 to UJ i I W) cr h- O a) o CO a> en X OJ H ai u co -o i-> CO 60 •H term-found flags are reset) and the search continues. Each time an instance of the expression is detected, its location is reported to the host . Resolvers for conventional systems operate sequentially; documents are processed one at a time from start to finish. Design problems are typically dependent upon the implementation of the term matching logic and upon the search expression format - for example, matching resolution speed to expected term hit rates. Such considerations are often very mundane, and as might be expected, not much has appeared on query resolution in the literature . The query resolution functions for the CIA's proposed SAFE system were discussed in [OSl77a] and [OSl77b] . Roberts [Rob77] discussed query resolution processing speed requirements under assumptions on query statistics appropriate to the SAFE system. The query resolver for this system is designed to operate in conjunction with a hardware term matcher, and is intended to perform query resolution at disk speeds. However, the SAFE system has several peculiarities which limit the applicability of its techniques to more general systems. The system has a relatively slow response time (nominally over seven minutes) and is used essentially in a batch mode. Thus, queries tend to be more complex than typical queries for other systems. Also, the SAFE system is not indexed. Since non- relevant as well as relevant documents are searched, the hit rate on each term can be expected to be somewhat lower than if only potentially relevant documents were searched. Finally, the query language limits the type of search operations which can be performed. Further development of the resolution algorithms are necessary if a flexible query language is to be supported. Stellhorn ( [Stell74a] , [Stell74b] ) discusses search expression data structures and resolution processing techniques used by the EUREKA system ( [Morgan76 ] , [BurEm79] ) . Most of this discussion assumes a fixed, hierarchically related set of formatting codes and a term matching technique that requires making one pass over the data for each term in the expression. The result was a set of heuristics for optimizing search time under these constraints, but the work is of little relevance to systems capable of searching for many terms simultaneously. Questions of speed, memory requirements, and efficient algorithms necessary in the parallel search system being considered here remain to be answered . 1.4 Term Matching - Previous Efforts Several approaches to building term-matching hardware have been suggested in the literature. Cheng [Cheng77] proposed two parallel matcher designs using associative match cells (character comparators) to match strings held in memory at extremely high speeds. It was intended primarily to speed execution of SNOBOL pattern matches and has several drawbacks that make it impractical for use in searching databases. First, it was designed to search short (less than 100 character) strings - terms spanning string boundaries cannot be matched and must be processed specially. Second, a huge number of comparators is needed. The fastest version of the matcher uses an MxN array of comparators, where M is the string length (about 100 characters) and N is the maximum number of characters in a term (about 16). Either the matcher must be cycled once for each term being searched, or more than one NxM comparator array must be included. Large systems needing search hardware can require searching for many terms (50-100) simultaneously, making either of these alternatives unpalatable. Finally, duplicating this huge matrix for many disk drives is impractical, so the searcher must be built as a centralized unit with a large buffer memory, with the disks transferring into it. in parallel with searching (Figure 1.2). The cabling and channel hardware to do this would be prohibitively costly. Cabling, channel hardware, and buffer memory costs can be eliminated by placing the search hardware local to the storage device (Figure 1.3). In this arrangement, the only high-bandwidth path is between the storage medium and the term matcher connected to it. The only necessary communication between searchers and host is instructions telling the searchers what to look for, and reports to the host on the locations of hits on the search expression. The low bandwidth required by this UJ J= u u OJ in -l 3 M •H SEARCH UNIT 10 communication path allows searchers to be connected to the host along a standard I/O bus . Distributed searchers using associative comparators to match terms have been proposed by Stellhorn [Stell74b], Bird [BiTuWo77], Foster and Kung [FoKu80], and Mules and Warter [MuWar79]. These matchers accept data at typical disk rates (200 - 1000 ns . per character), thus one matcher is required per track being searched at any one time. The first three of these matchers share similar drawbacks. They cannot handle embedded variable-length don't cares (EVLDC's - patterns with a specified prefix and suffix and an unspecified middle). More importantly, they are limited as to both the number and length of terms they can handle. Stellhorn' s matcher, for example, uses a fixed NxM array of cells, where N is the maximum term length allowed and M is the maximum number of terms allowed. The average term length in most queries is under eight characters; allowing a reasonable maximum term length of 16 characters results in many cells being unused. Also, associative matchers are typically not very flexible in the types of matching they can perform (usually allowing only exact match or fixed-length don't care). It is very useful to have other functions available (such as matching any punctuation or matching any alphanumeric). However, adding anymore power to the comparator requires that the necessary added logic be included in each cell in the comparator array. Mules and Warter' s matcher is a very clever design - it is capable of detecting several types of errors in the data (insertions, deletions, and transpositions) and matching terms in spite of them. It has a flexible bit-by-bit don't care masking scheme. This matcher was designed for searching a small, noninverted database. It is debatable how useful the error-detection feature is with an inverted database; the index processor will only point the text searcher to areas known to contain correct spellings. Also, this searcher is designed to handle a small number of terms (16) of limited length (16 characters). It is not clear that it could be economically scaled up to handle the number of terms required in a large system. 11 Copeland [Cop78] and Mukhopahhyay [Muk.78] independently proposed matchers based upon networks of match cells. Copeland' s matcher was used in a relational database environment, and matched only one term per pass over the data. Mukhopadhyay described a more complex design with both character match cells and other cells that handle functions such as character counting and boolean operations. Presumably a reconf igurable interconnection network would be required. This was not discussed in [Muk78] , and since such networks are difficult to implement in LSI, a large matcher of this type would probably not be practical for direct scanning of data coming off the disk. An alternative implementation for the term matcher can be based upon the concept of a finite-state automaton (FSA). Figure 1.4 shows the state diagram and associated table necessary to match the word N #CAB#', where s ti ' indicates a word break. When a v #' is seen, the FSA makes a transition to state 2. If the next input character is X C, the transition to state 3 is taken, and any other input causes the FSA to go either to state 2 (on a word break) or state 1 (for anything else). Matching continues in this manner until the trailing v #' is seen, at which point an output signal is produced and the FSA returns to state 2. Conventially, this state diagram is specified in the tabular form shown in Figure 1.4. Each row represents a state, and the entry in each column corresponds to the next state if the input character shown at the top of the column is received . Bird [OSl77a] designed a searcher based upon an FSA model. This matcher (Figure 1.5) is designed to search the entire disk sequentially; all queries arriving during a search are batched and processed concurrently during the next pass. The number of terms in a batch of queries can be quite large. All terms to be searched for are collected, and a state table is built. A conventional FSA state table, with one row per state (character in a term) and one column per possible input character code, would require a prohibitive amount of memory to store. Recognizing that the table is very sparse, Bird devised a method of storing it in much less memory, using one type of state (sequential) for states having one transition leading out of them, and another type (index) 12 ^CAB-# c+# #/i INPUT STATE * A B C D •••• I 2 I I I I •" 2 2 I I 3 3 2 4 I i i - 4 2 I 5 i i - 5 2/I I I • * * « Figure 1 .4 FSA State Diagram and Table for the 10% of the states in the table having more than one outbound transition defined for them. Sequential states operate straightforwardly; each has only one possible successful match character. If the next input character corresponds to the one required for the current state, a transition to the next sequential state is taken. On a mismatch, a default transition to an idle state is taken. States having more than one successful match alternative require a more sophisticated processing method, which is embodied in the index state. Rather than indexing into a vector of next 13 u a; o e H < fa TJ •H PQ m u 3 •H fa 14 state addresses, one for each possible input character (wasteful of memory) or comparing against a list of alternative successful match characters (requiring multiple comparators or iteratively cycling one comparator) , the Bird machine has a bit vector stored in each index state word. Each character code in the input alphabet corresponds to a bit position in the vector. When an input character is received whose bit is set, a x leading ones' count is done. The number of bits set in the vector with positions lower than that of the input character's bit is used as an index, and added to a base address stored in the state word. The resultant address is that of the next state. If the input character's bit was clear, the default transition to an idle state is taken. The FSA matcher represents a significant improvement over the previous designs. It has less memqry wasted due to term-length variations, and since separate comparison logic is not required for each character in the list of terms, it can be fairly sophisticated in terms of what types of matching it will perform. Also, no complex interconnection network is required. However, the FSA has a few shortcomings of its own. There is a definite speed/hardware tradeoff in the leading ones counter. Consider a Bird machine matching a database with a character set with n codes (typically, n is in the range from 64 to 256). Counting the ones in an n-bit vector can be done with one shifter and a log n bit counter in n steps, or can be done in O(log n * loglog n) time using 0(n 2 log n loglog n) gates ([ChKu77]). If n is large or if the data rate is fast, the expensive realization may be necessary (repeated, of course, for every disk drive in the system). To overcome this, a version of the Bird machine was proposed which processes the data stream in 4-bit nibbles. This reduces the bit vector size to 16 bits and simplifies the leading- ones counter, but requires two states per character. As a result, the rate at which characters can be matched using a given state table memory speed is reduced. The positional bit vector also complicates matching on character classes. If a particular transition was to be taken if the next input character was any numeric digit, all ten corresponding bits would have to 15 be set and ten separate (identical) states would be required in the list of next states. The Bird machine as implemented has special tests to overcome this problem, at the expense of extra hardware and complexity. Certain terms (and combinations of terms) greatly increase the complexity of the FSA state table. This can be illustrated most simply with two examples. Figure 1.6 shows a state diagram to match the string "ANAS'. Suppose the input string is 'BANANAS'. When the first V ANA' has been matched, the FSA will be in state 3. When the following V N' is seen, the FSA must "backtrack' to state 2 to enable successful recognition of the succeeding V AS' rather than take the normal mismatch transition to state 0. Similarly, if an V A' is seen in any state, the FSA must go to state 1. Figure 1.7 shows the state diagram to match the two strings 'FISH' and "ISMS'. It illustrates that the start of a term can be encountered while another one is being matched. Terms such as these which are not required to begin on word boundaries (i.e. whose first character can be preceded by another alphanumeric character) are called initial variable-length don't-cares (IVLDC's). Notice first that each state must have transitions to state 1 (if N F' is seen) and 4 (if "I' is seen). BANANAS (A+N+S) Figure 1.6 Diagram of FSA State Table with Backup Transitions 16 FISH ISMS Figure 1.7 Diagram of FSA Table with Embedded Startups Also, state 3 must x remember' that x ISMS' can occur within a suspected instance of X FISH' , and must have a transition to state 6 to allow this path to be continued. In Figure 1.7, mismatc h transitions back to state have been left out for simplicity, but they still exist in the table. The point of these examples is that every character of the input is potentially the start of a search term in addition to being the middle of another. Terms which can start during the matching of others (or themselves) must be detected when the state table is being built, and the appropriate recovery transitions must be included. The Bird machine ameliorates this problem to some extent by including special logic to automatically execute transitions to one state if the 17 input is both a mismatch and a word separator, and another state on any other mismatch. If all terms start on word boundaries, most have only one match transition and can be assigned as sequential states. However, if even one term is not restricted to starting on a word boundary, all other states in the table must have recovery transitions leading to that term, and thus must be index states. To circumvent this problem, a second identical FSA is included in the Bird machine, and all IVLDC terms are assigned to it. All states in the second FSA are index states, but the main FSA can now be assigned mostly to sequential states. A similar problem exists with continuous word phrases (CWP's) - terms containing more than one word in sequence. These are handled by yet a third FSA. The implementation of index states causes three more problems. First, sequential and index states require different amounts of time to process. The Bird machine uses a FIFO input buffer to synchronize the FSA with the disk. Second, the Bird machine requires memory access times in the 100 nsec . range. The per-bit cost of memory in this speed range is higher than that for slower memory. Finally, index states require storing the leading ones bit vector, the index table base address, and several other smaller fields. As a result the state memory word is quite wide (85 bits for the 6-bit character version and 39 bits for the 4-bit nibble version) . The necessary high memory speed and wide data path both would complicate an LSI implementation. 1 .5 Summary This chapter discussed the problem of searching very large text databases. The applicability of indexing and full-text searching were discussed, and it was shown that a combination of the two techniques, partial inversion, was an attractive approach to use in implementing a search system. However, even if partial inversion is used to narrow the search to a fraction of the database, the necessary searcher bandwidth is too high to be accommodated by conventional processors, and specialized text search processors need to be developed. There has been some investigation into the design of such processors, and how these designs could be used in a large-system environment was 18 discussed. The three basic organizations which have been explored in the literature are associative array comparators, comparator networks, and finite state automata. It was shown that each has inherent problems complicating their use in large systems. It is now possible to state several criteria to be met by a well-designed text searcher for large text databases : 1 . The design should be flexible in accomodating different system loads. It should be configurable, such that different systems can be designed to handle different ranges of user loads and expected query complexities using the same basic design. The design should scale up well as the load it is to be configured to handle increases. 2. To minimize cost in large systems requiring many searchers, the design should lend itself to LSI implementation. Factors such as data path width, use of memory vs. random logic, and partitioning into small, identical building blocks should be considered. 3. It should straightforwardly implement complex matching operations such as matching groups of character codes, handling variable- ■ length don't-cares (VLDC's), and recognizing document formatting codes . 4. The channel bandwidth should be minimized to lower cabling and hardware costs. 5. The design should not require the use of large buffer memories. Not only are these expensive, but extra program logic and execution time is required to allocate space in them. Each of the three organizations discussed fell short in one or more of these areas. To facilitate building search systems for large databases, improved architectures are clearly needed. The subsequent chapters will be devoted to developing such an architecture. First, a new model for the search process will be introduced. Next, it will be shown how the model can be mapped into hardware in a manner fulfilling each of the above design criteria. Then, it will be shown how this searcher can be fitted into a retrieval system. Problems of translating queries into commands for the search system and of processing the output of the 19 searcher to detect instances of search expressions will be discussed, both in single-user and multi-user environments. Configuring the search hardware for use in individual systems will be discussed. It will be considered how parameters such as user load, index system selectivity, and query size affect both the configuration of the system (i.e. how much hardware is necessary to handle the load) and the response time. Finally, the new design will be compared with one of the above architectures to determine the cost and performance improvement it affords. In short, this chapter demonstrated why it is difficult to build search systems for very large text databases using current designs. The following chapters will introduce a new design, and will show how large systems having excellent response times can be built using it. 20 Chapter 2 The NFSA Term Matcher Of the alternatives examined so far, the Bird FSA came closest to fulfilling the design criteria stated at the end of Chapter 1 . It was more economical to implement in hardware than the associative schemes, and could handle don't-care conditions without the difficulties of other approaches. The problems to be overcome were the necessary recovery (backtrack) transitions if wrong paths are taken, the many alternative transitions out of certain states, the wide state word necessitated by the leading-ones bit vector, and the fact that the state table must be stored in relatively fast memory. These problems are not due to the implementation of the FSA matcher; they are due to inherent shortcomings of the deterministic FSA as a model for the search process. 2.1 The NFSA Model A far superior model for the text search process is the nondeterministic finite state automaton. An NFSA can be described ([HopU179]) as a 6-tuple ( {X} , {Y} , {Z} ,M,0,y Q ) : {X} is the alphabet of allowable input characters {Y} is the set of states the NFSA can occupy {Z} is the set of outputs produced Y Y M is a mapping XxY+2 (the state transition function. The set 2 is the power set of {Y} .) is a mapping XxY->-{Z} (the output function) y n is the initial state The primary difference between a deterministic and nondeterministic FSA is that the latter can occupy more than one state at once. Actually, different references state this concept in different ways. For example, Hopcroft and Ullman ([HopUl69]) define a NFSA as % choosing any one' of the successors of a state when making a transition. A sequence of inputs is 21 accepted if there exists a sequence of transitions leading into a final state that could be taken as a result of receiving the input sequence. (A final state would correspond to one generating an output in the above definition.) The ambiguity regarding which transition is taken at each step is presumably resolved upon arrival in the final state. Ten years later ([HopU179]) they changed their minds; now during each transition with multiple successors, the NFSA is said to N make a duplicate copy of itself in each possible successor state. For our purposes these distinctions are meaningless, it is just as easy to think of the NFSA as one machine that can occupy several states simultaneously. The FSA matched input strings by occupying one state and making a transition to one of a number of states based upon the input character. In any state the FSA had to determine which alternative input was received and choose one of a number of possible successors . The comparator logic to do this was quite complex. The NFSA will be employed to match terms by having it occupy one state for each alternative next input character. Instead of choosing from a number of successors, each state will only have to determine whether or not the next input is the one alternative that it is assigned to match, and make transitions to one (or more) successors if a match occurs. It is quite simple to design such a yes/no comparator. To allow the use of a simple yes/no comparator, the transitions out of all states but the initial (or idle ) state will be restricted in the following way: Each non-idle state will have associated with it some subset x of the input alphabet {X}, and all transitions out of the state will either be labelled x or x (the complement of subset x) . Deciding which transition(s) to take out of a state only requires one check to see whether the input is in x. The form of these subsets x will be further restricted to include only certain useful combinations of the inputs (i.e. single characters, all alphanumerics , all punctuation, document formatting codes, etc.) which will allow building a comparator to evaluate an input's membership in x using a minimum of logic. Figure 2.1 shows a sample list of terms to be matched. The special character N //' signals that any punctuation (word separator) can appear in 22 1. #A?1SM# 4. #BEST 2. IST# 5. #BENT# 3. #SCHISM# 6. # BUNT . 7. # BUNTED # Figure 2.1 Sample List of Search Terms that position, v %' stands for any document formatting code (i.e. end of sentence), v *' denotes that any single character can appear, v u' allows any alphanumeric character, and v ?' stands for any string of alphanumeric characters. Figure 2.2 shows the diagram of a nondeterministic state table generated directly from the term list. Each transition is labeled with the character that must be found for the transition to be taken. Mismatches cause transitions to be taken to the idle state, which is not shown for clarity. Also, transitions not leading into or out of numbered states are presumed to lead from or to the idle state . The last transition for each term is labeled v x/z' where x£{X} is the last character in the term, and ze{Z} is the output produced that signals detection of the end of a term. Each such transition's output is unique, and identifies which term has been found. Transitions leading into the first state of each term have two- character labels; for X //SCHISM#' the first transition is labeled V #S'. This means that the first state is entered only after both characters in the label are matched in sequence. The first character is restricted to being a type code, i.e. v %', v #", " u' , or v *' . The way to interpret the two-character label is that the first character is a precondition; it must be received to enable the transition into the first (or start) state of a term to be taken on receipt of the second character. This is a slight departure from the standard definition of the NFSA, in which each transition has only one label. This change reduces the number of states (two states with single character labels on incoming transitions are combined into one) and reduces the frequency of transitions out of the idle state. 23 ^T)^io)l^^ ^sy^z^s^ #/3 Figure 2.2 Diagram of Nonreduced NFSA State Table Note the first state of V //A?ISM#', that matches terms like "atheism', x antiterrorism' , etc. The arc labeled v a' means that the NFSA will stay in state 1 until it is forced out by a word separator. When an V I' is matched, a transition is taken both to the next state and back to the present state. This is done because the first N I' in the term may not be the start of S ISM' (i.e. the first % i' in ^antiterrorism' ) . This type of state is called a loop state. 24 Figure 2.3 shows the diagram condensed to eliminate multiple occurrences of term prefixes. For example, after recognizing V //BE', a fork transition is taken to states 15 and 17; one tine recognizes instances of 'BEST' and another those of V BENT". To simplify design of an NFSA term matcher, we will limit the maximum number of transitions out of fork states and loop states to three. Results to be presented later will show that this is a sufficient number to match practical sized tables. Even if the number of forks out of a state is greater than three, the table can be modified to limit fanout to three. Figure 2.4 shows the diagram for a table with four forks out of state 3. Figure 2.5 shows the diagram modified to limit the fanout. Obviously if backing the fork up to a previous state caused that state to have too large a fanout, the fork * K^M^©^ ti^l^|)H^o)J^^ I l^Nm}^*©^ Figure 2.3 Diagram of Reduced NFSA State Table 25 *v Figure 2.4 Diagram of Table with Fanout of Four Figure 2.5 Table with Fanout Reduced to Three could be backed up still further. If it were necessary to back a fork up past the beginning of a term, the excessive fork would just be entered in the table as a separate term. 2.1.1 Comparison of the FSA and NFSA Models The NFSA model can be mapped into a hardware implementation having significant advantages over an FSA implementation, and this will be discussed in detail later. However, another advantage that the NFSA has 26 is that it is a simpler and more natural model of the search process. The NFSA state table is much simpler than the FSA table. It is easier to understand, contains fewer transitions, and is easier to generate from a list of search terms than its FSA counterpart. This is best shown by considering an example. To illustrate the comparative complexity of the FSA model, consider the FSA state diagram corresponding to the list of terms in Figure 2.1. This FSA diagram is shown in Figure 2.6 As in the NFSA example, many of the transitions are not shown for clarity. In addition to the transitions shown, every state has a transition to state 2, which is taken if the input is a wordspace ( v #'). Additionally, all states but 3, 4, 5, 6, and 19 take transitions to state 23 if the input is V I'. Finally, all states except 3, 4, 5, and 6 take transitions to state 1 (the idle state) if an input character is encountered other than one for which another transition is defined. These states do not have transitions to the idle state because they have transitions to other states defined for the entire input alphabet . The diagram in Figure 2.6 has 25 states and 96 transitions. This compares with 26 states (counting the idle state not shown) and 65 transitions for the NFSA diagram in Figure 2.3. The fact that no backup transitions (i.e. from states 4 to 3 in Figure 2.6) are necessary in the NFSA table accounts for most of the difference in the number of transitions. Mismatch transitions to the idle state can be easily handled by the hardware, and it is not fair to count them as full-fledged transitions. Eliminating them from the accounting results in 75 match transitions for the FSA table and only 32 for the NFSA table. Each of these transitions has to be detected by analyzing the list of terms. Detecting some of the transitions (i.e. backups) requires extensive checking of states in one term against states in all others. This contributes to making the FSA table computationally more difficult to build than the corresponding NFSA table. Finally, consider how many comparisons must be done in the worst case match cycle. In state 2 of the Figure 2.6, the FSA must decide which of five non-default transitions to take (to states 2, 3, 7, 17, or 23). In Figure 2.3, the NFSA can be in at 27 0) u 3 60 •H E H U 4-1 CO 42 cO H M O e CO u ao co CO 4J CO < pt, (-1 3 t>0 •H Pn 28 most three (non-idle) states simultaneously (1, 2, and 5). After receipt of the next input character, the NFSA must decide which subset of the three non-default successor transitions to take. It will be shown that the three yes/no decisions required by the NFSA requires simpler hardware to make than the five-way branch required by the FSA. Furthermore, it will be shown that this property holds for larger tables (i.e. the number of simultaneous yes/no decisions required by the NFSA grows relatively slowly with table size) . The advantage of using an NFSA for term detection follows from its ability to occupy several states at once; several potential search paths can simultaneously be followed with no need to back up in case of a mismatch along one of the paths. In a practical matcher, the number of states that can be simultaneously occupied must be limited to some small number, as each requires a separate state variable, yes/no comparator, and associated logic to effect state transitions. Simulations discussed later show that this number is small for practical-size state tables. 2.2 Implementing the NFSA Term Matcher with Multiple FSA's Originally it was intended to use a number of small FSA's to interpret the nondeterministic state table. These would all share a common state table memory. Each FSA would follow a potentially successful search path. At each place where more than one path could be followed, an FSA would be activated from a pool of free ones, and would be started on the new path. This would be analogous to the NFSA N creating a copy of itself in a successor state as stated in [HopU179]. Each active FSA would follow its assigned path either until a mismatch occurred or until the end of the term was found, whereupon it would be returned to the free pool . Special logic is required to allocate FSA's from the free pool as new alternative search paths are encountered. Each new character can be the start of many new alternative search paths. These paths can be due to forks as was discussed above, or can be due to new terms whose beginning could be embedded in occurrences of other terms. Any or all of the free FSA's may get dispatched in one cycle, each FSA receiving a GO signal and 29 the state address at which it is to start. Either fast and highly parallel logic would have to be implemented to allow multiple allocations and startups in one character time, or synchronization logic (such as a FIFO buffer for input characters) would have to be used. If a new search path appeared and no free FSA was available, some method of signalling the host to notify it that the table was too complex would be necessary. A problem that would have to be solved is signalling errors in a manner that would allow the host to recover from an error by eliminating an offending term from the table and then re-executing the search in multiple revolutions. Contention among active FSA's for access to the state table memory could be quite severe. All active FSA's would need one memory cycle per match cycle. Methods exist for implementing multi-port parallel access memories with adequate bandwidth to support the needed number of FSA's, but these require too much logic for it to be practical to build such a memory for each term matcher in the system. 2.3 NFSA Implementation Using Memory-Mapped State Tables Allocating FSA's to interpret a NFSA state table stored in a central memory resulted in several implementation problems. A far better way to approach the problem is to connect each FSA to its own dedicated memory and partition the state table so each state resides in the memory of one of the FSA's. Such a design is shown in Figure 2.7 It contains several small machines called character matchers (CM's), each corresponding to an individual FSA in the above discussion. A module called a Match Controller (MC) detects the occurrences of output transitions in the CM's, reporting instances of terms and the disk addresses at which they are found to the query resolver . A diagram of an individual character matcher (CM) is shown in Figure 2.8, and the format of the CM's startup and state table words appears in Figure 2.9 As was noted, it is not necessary to dynamically allocate processors (CM's) to follow currently active search paths. Instead, the state table is partitioned into groups of compatible states that can reside in a 30 HITS Q MATCH D CONTROLLER H C « SERIAL DATA FROM DISK FL F I FORKS T CHARACTERS (PARALLEL) Figure 2.7 NFSA Term Matcher 31 N N D A \ T ST /START UP\ \ TABLE I <—• TC Ts I T £ N NEXT ADC MUX 7 H A TT ^TRANSITIONS k TABLE / N cc C.T.H.L ■M H Figure 2.8 Character Matcher (CM) Block Diagram 32 STARTUP TABLE (ST) 1 I ■ I N PREV TYPE CHAR TYPE START STATE ■ALPHANUMERIC PUNCTUATION FERN TRANSITION TABLE (TT) H N CHARACTER -CODE TYPE NEXT STATE LOOP HIT FORK TABLE (FT) FR -\ r FL N X N t RIGHT STATE START t LEFT STATE START ■-FORK RIGHT ^FORK LEFT Figure 2.9 State Table Memory Word Formats 33 single CM's memory. Although compatibility will be more rigorously defined later, what it essentially means is that two compatible states can never potentially be occupied simultaneously. Partitioning states among CM's such that all states assigned to one CM are compatible assures that each CM will occupy a maximum of one state at any one time. Figure 2.10 shows such a partitioning for the state table of Figure 2.3. The states I~, I,, and I„ are idle states. The NFSA's idle state is split among the CM's, each one containing only transitions connnecting it to states assigned to the associated CM. A feature of the notation should also be pointed out: the two character startup label here has the first character of the label adjacent to the transition arc inside the idle state. This is more in line with the interpretation that the first character is a precondition for taking the transition after the second character is recognized. This partitioning has the effect of ^scheduling' the CM's at the time the table is built, eliminating the necessity of including logic to allocate CM's dynamically. Partitioning the state table, allocating states in advance to the CM's which will execute them, requires some extra computation. In effect, this computation is being traded for the opportunity to eliminate the hardware for scheduling the FSA's, and the memory contention resulting from many FSA's concurrently accessing one copy of the state table. Introducing the idea of partitioned state tables allowed a very simple implementation for the CM's. Each CM is self-starting; no outside allocation hardware is necessary. During normal state transitions, each CM is only looking for one character code. This allows the comparator logic to be quite simple. No component of a CM gets cycled more than once per character match cycle. This allows faster cycle times to be possible with a given logic and memory speed. One side benefit is that it can be determined when the table is partitioned whether it will fit in the available number of CM's. If not enough CM's are available, the search need not even be attempted; the table can be subdivided and the search performed in multiple revolutions. 34 7 #/2 fl9 CM o 17L ma M 13. a #/3 I T/4 <#/1 / / i B / / #/7i CM 1 i I CM 2 'N [24) Figure 2.10 Diagram of Partitioned State Table 35 2.4 CM Operation Each CM contains three RAM's: a startup table (ST), a transition table (TT), and a fork table (FT). The FT is mapped into a segment of the TT address space. The fork entries can be thought of as fields of a state's TT word, but since most states do not fork, the FT can be implemented more efficiently as a smaller, independent memory. The logic necessary to load the memories has been omitted from Figure 2.8 for clarity. The CM also contains the following registers: C - Holds the most recent character in the string to be searched. A - Holds the current state (TT and FT addresses). T - Contains the type ('%', ' // ' , or 'u') of the character now in C. F - Buffer holding the startup state when the CM is being forked to by a neighbor. FR,FL - Holds fork table word contents, which are the start states for the right and left neighbor CM's when they are being forked to. H - Buffer for holding the terminal state and CM number when signaling that a term has been found. Several of these * registers' (F, FR, FL, H) are transparent - they do not need to actually latch any data, and can be implemented as buffers or even eliminated if the logic family being used to build the CM permits . Additionally, the CM contains a character comparator (CC), a type comparator (TC, used to control startup), a next-state address multiplexor (N), and contains logic to control gating data among these components. The multiplexor N can also be eliminated if the logic family permits, instead being implemented as a bus with a selector enabling the source containing the next address onto the bus . The state at address in the TT is the idle state. Whenever a mismatch occurs and TT,=0 for the current state, a transition is automatically taken to the idle state. The CM remains in the idle state until a startup transition occurs . 36 2.4.1 Startup A CM can be started in one of two ways: either by being forked to by a neighboring CM, or by taking a startup transition defined by an entry in the ST. Inbound forks have priority over startup transitions, which in turn have priority over normal state to state transitions. To fork to a CM, one of its neighbors gates the start state into F (the state assignment prevents forking to an active CM) . An example is the transition between states 14 and 15 in Figure 2.10. F =1 causes the multiplexor N to select F . The fork address is gated into A, and the corresponding state word is fetched. The CM is ready now to match the next input character. A CM can also start itself in response to reception of a startup sequence. Startup sequences define transitions out of the idle state; it is therefore convenient to think of a CM as always occupying its idle state even though it may also simultaneously be in another state. Startup transitions work as follows: The contents of C for each input character are used as an address in the ST, and the corresponding word's contents are fetched. Each character's ST entry contains a previous character type (PCT) field (ST ) which specifies what type(s) must precede the character to allow a startup to occur. Characters for which no startup transition is defined have a PCT of zero, which will not match any previous character's type. The PCT corresponds to the first character in the two- character startup labels mentioned earlier (which was restricted to being a type). If the TC detects a type match (ST & T ^ 0) , it signals N to select the startup state field ST and gate it into A. The CM then starts as described for forks . The previous character type restrictions allow terms to be started at word boundaries (ST = x //'), within words (ST = N a'), or anywhere (ST = v *'). p P p Only one startup transition per CM is allowed for each character code. It must be remembered that startup transitions override normal state transitions. It is necessary that the state assignment insure that startups do not interfere with normal matching. Sometimes, however, it is necessary to use a startup transition to intentionally drive a CM out of a 37 particular state. For example, a startup transition is used to force exit from an EVLDC loop state at the end of a term. 2.4.2 CM State Transitions When a state is entered by gating its address into A, the contents of the corresponding word in the TT are fetched. At the start of the next cycle, the character code field in the TT word (TT ) is compared with either the input character in C or the current character type in ST , based upon the value of TT . If the character or type matches, first (if the hit bit TT, =1) the hit register H (which contains the present state address) is output to the Match Controller. Then, the next state (in TT ) is gated into A and H. i n & If there was a mismatch, and if the loop bit TT,=0, A is cleared to force a transition to the idle state. The loop bit being set forces TT to be used as the next state address regardless of whether or not the character stored in the state table entry matches the input . TT, is called the loop bit because TT of states with this bit set will usually contain the address of the state itself, causing the CM to loop in the same state until it is forced out by a startup transition or by being forked to (i.e. state 1 in Figure 2.10). However, in some cases a loop state's TT field may contain the address of a different state. Such a case will be discussed in Section 2.4.6. 2.4.3 Forking A CM can start either or both of its neighbors by forking to them. Most tables have many fewer fork states than normal states, so fork states are allocated only the top 1/8 of the state-table address space. Whether or not a state is a fork state is determined by its address. Fork states have normal transition sequences which work exactly as described above. Forking is done in parallel with normal transitions. If the CM is in a fork state, if the input character matches successfully, and if one or both of the fork enable bits (FR , FL ) are set, then the x' x ' corresponding fork addresses in FR and FL are gated out to the neighboring 38 CM('s). The neighbor then starts as described in Section 2.4.1. Forking is done in the following situations: 1 . Incompatibilities - If a terra is being matched in a CM and one of its states is incompatible with another in the CM's memory, control must be transferred to another CM to match the incompatible state. To do this, the last compatible character is made a fork state, and (if reached) a fork is made to a neighboring CM. For such states, TT contains the address of the idle state. 2. Terms with common prefixes - The fork is made after matching the last common character (i.e. state 14 in Figure 2.10). The CM proceeds to look for one alternative, and the neighbor(s) look for the other(s) . 3. EVLDC's - After matching the prefix, a loop state is entered which will fork to another CM each time the first character in the suffix is encountered (i.e. state 1 in Figure 2.10). Loop transitions are described in more detail below. 2.4.4 Loop Transitions Any state with its loop bit TT,=1 is called a loop state . An example is state 1 in Figure 2.10. Loop states are used to eliminate the need to back up in case of a mismatch, as was necessary in the FSA. If TT,=1, the mismatch transition back to the idle state is inhibited. The next state address in TT is gated into A regardless of the result of the match. TT n b b n usually contains the loop state's own address (hence the name v loop state'), although Section 2.4.6 will discuss an exception. In addition to remaining in the same state, loop states fork to a neighboring CM to continue matching the suffix. They therefore must reside in the fork state portion of the address space. The CM remains in the loop state even after forking. This insures that if the suffix match fails, the CM will still be looking for a correct instance of the suffix. Thus, at least two CM's are required to match an EVLDC term, one to match the suffix and one to continue monitoring the input for each potential start of the suffix string . 39 In the absence of any further outside influences, the loop state would remain occupied forever. The CM can only be forced out of the loop by being restarted or by being forked to. Since it is usually desired to stop looking for the suffix at the end of a term, it is necessary to include a startup transition in the ST so that term separators force the CM back to the idle state. This special startup transition is called a kill transition . One appears in Figure 2.10, going from I„ back to I„ . Its label, x ucx', corresponds to any alphanumeric followed by a non- alphanumeric, and is equivalent to having both the transitions x a%' and x a//' in the CM. It would be implemented by including an entry in the startup table for every non- alphanumeric , each entry having ST = x u' and each going to the idle state. Each time a non-alphanumeric was seen following an alpahnumeric , the loop would be killed. Actually, loop transitions are very powerful and can be used for several purposes in addition to matching EVLDC's. Figure 2.11 illustrates the use of a loop state to match a term containing two words separated by a variable number of word separators. At the end of the first word ( X PUNK'), the loop state is entered. The CM remains in that state until the start of N R0CK' is seen, whereupon a fork is done to match N R0CK' . A kill transition must be included for every character not having a startup transition, except word separators. The one for Figure 2.11 is implemented by setting the ST entry for all characters but V P' and word breaks to force the CM to the idle state. 2.4.5 Recognizing Document Formatting Codes The query resolver needs to be notified when document formatting codes (ferns) are encountered. Actually, these are recognized just like one-character terms. One CM is assigned to matching ferns, and has a startup sequence defined for each. Figure 2.12 shows an example. Since a startup transition cannot cause a hit to be reported, it must go to a second state. This state reports a hit and takes a transition back to the idle state regardless of the next input. This can be accomplished by setting TT h =l, TT =1, TT c = N *' (all type bits set), and TT n =0 (the idle state address). Even if the character after the fern causes a startup 40 CMi CMi+i Figure 2.11 Loop States Used to Match Word Breaks Figure 2.12 Transitions to Match Context Boundary Codes transition to be taken, the operation of the CM will insure that the hit is reported. Depending upon the requirements of the query resolver, each unique fern code would probably force the CM to a different state so the fern code that was received could be deduced from the output . 41 2.4.6 Bounded EVLDC ' s State 1 in Figure 2.10 shows the usual type of embedded variable length don't cares, where the length of the don't care string can be of arbitrary length. The only way the CM can be stopped from looking for the suffix is to force it out of the loop state with a kill transition (i.e. when a word separator occurs). Sometimes, however, it might be desired to match strings with a variable, but bounded, number of don't cares. A table to match such a string is shown in Figure 2.13 This table will match N AB', V A*B', or V A**B' . The loop bit TT =1 for states 1 and 2. The match character for states 1, 2, and 3 is N B'. When the initial V A' is seen, CM n enters state 1. For the next two input characters, instead of mismatches sending CM„ back to the idle state, the CM will go to the next state regardless of the input. If a V B' is seen while CM n is in states 1, 2, or 3, a fork to state 4 (in CM. ) will be performed. State 3 returns to the idle state regardless of the input. CM Figure 2.13 Bounded EVLDC 42 2.5 MC Operation Figure 2.14 shows a block diagram of the Match Controller (MC). The MC is arranged around a 16-bit internal bus, to which the following components are connected: 1. Four registers (F, A, D, and S) for communicating with the query resolver (QR). The F register allows the query resolver to specify the matcher's function, and S displays the matcher's status . Registers A and D are set by the query resolver to control the address and data (respectively) when loading data into the CM's. 2. Two buffers (C, D) and one register (H) for communicating with the CM's. Commands and match characters are transmitted to the CM's over C, data to be loaded into the CM memories is transmitted over D, and hits (successful matches) are reported by the CM's into H. 3. A small FIFO (Q) for reporting hits to the query resolver. The FIFO is used because sequential characters can cause multiple hits which must be processed by the query resolver. Buffering these in a FIFO allows more flexibility in building the query resolver hit processing logic. 4. A serial/parallel code converter to handle data from the disk. The serial data is converted to parallel format, then the bytes are looked up in a ROM to convert them to the character set used by the CM's. The ROM allows the character set size and codes to differ between the CM's and the storage medium. Case conversion can also be done if desired. A RAM could be used to allow these functions to be programmed, but the extra complexity involved in loading a RAM was thought to outweigh any benefits. 5. A counter to generate addresses corresponding to characters coming from the disk. The counter allows each hit to be accompanied by the address (displacement into the region) at which it occurred . 6. A timing generator to allow manufacturing timing signals for the MC and CM's from the disk bit clock and character strobe signals. 43 JT,M(T4567) © K UMTT^q- . LM(T45),q-7 O TO CM'S Mffbe) © H M(T23) .o-'o LtTe?) « © D UT67) / fl - lS x M(T67) L(T67) o-t © A L(T67) e-ia D STATUS SIGNALS M(To3Jo. l3 ® M(Toi) . l3 0-6 ® I © N TRANSLATION ROM SERIAL DISK- DATA COUNTER PARALLEL DISK DATA SHIFTER ,0-15 ^^> TO 'QR © -TRANSMITTER BUFFER © -LATCH M -ACTIVE DURING MATCH CYCLES AT INICATED TIMES L -ACTIVE DURING LOAD CYCLES — / CONNECTION TO BUS LINES nTHRUm Figure 2.14 Match Controller (MC) Block Diagram 44 The MC also contains a small state machine to sequence gating data onto the internal bus . The MC is interfaced to the query resolver over an external bus, to which F, A, D, S, and Q are also connected. The decision to use the query resolver to load and control operation of the matcher was made because it is anticipated that the QR will be implemented using a small general purpose computer (i.e. a microprocessor), and it is expected that programming the QR to support controlling the matcher will not be a problem. If the QR is eventually implemented using different technology, the matcher and the QR might be interfaced to a small control microprocessor to oversee both their functions. 2.5.1 Timing Generation Each cycle (character time) is broken into eight equal segments, corresponding to the eight bit-times during which data is received from the disk. The basic clock signal is derived from the bit clock from the disk, which is nominally around 100ns for 3330 technology drives. The eight bit times are named TO thru T7 , TO corresponding to the first bit of each data byte. All clock times in both the MC and CM's are defined in terms of this nomenclature. The timing generator provides three signals of single-time duration (Tl, T3, and T7 ) used solely as clocks, six signals of two-time duration (T01, T23, T45, T56, T67 , and T70) used both as clocks and to control gating data, and one signal of four- time duration (T0123) used as a gate. To clarify the nomenclature, T01 is one pulse occurring during TO and Tl of a cycle, T70 is a pulse occurring during T7 of one cycle and TO of the next, and T0123 is one pulse occuring during TO, Tl, T2 , and T3 of a cycle . All data transfers in both the MC and the CM's are set up so that data is enabled onto a bus during (at least) two successive times, and is clocked into the receiving register by a clock signal occurring in the middle of the interval during which it is available. For example, if data were gated onto a bus during T67, it would be clocked into the receiving register on the leading edge of T7 . 45 Three timing signals are provided externally for use by the CM's (Ta, Tb, and Tc) . Ta corresponds to T56, Tb to T70 , and Tc to T0123, although special logic inhibits them during certain cycles. Figure 2.15 shows a timing diagram of these signals, and also illustrates when data is available on the I/O registers of the CM's, and when it is clocked into them. Tc is used only during load cycles; its function is to strobe the write enable line of the memory being loaded. 2.5.2 Load Cycles To load the CM's, the query resolver places the CM memory address and data into the F, A, and D registers. The MATCH bit in F is cleared to indicate a load cycle. During T45, F will be gated onto the MC bus and out to the CM's via C. Each CM will use Ta to strobe C into itself. During T67 , A and D (the memory address in the CM and the data) will be gated onto the MC bus and out to the CM's via C and D. If it was the one addressed, the CM will use the address and data strobed into C and D by Tb to load the addressed byte of memory during Tc . 2.5.3 Match Cycles For the moment, details involved with starting and ending match operations on a data stream will be ignored. It is assumed that the match cycle being described is somewhere in the middle of the data stream (i.e. in the middle of a track) . If the data flow through the complete term matcher (MC and CM's) is considered, the system can be regarded as a pipeline. For a given character, bits arriving serially from the disk are assembled into a byte during one cycle, and code conversion is done by the ROM during the next cycle. During T45 of the second cycle, F (the function register) is gated onto the MC's internal bus, thru C, and from there broadcast to all CM's. The MATCH bit informs them that this will be a match cycle as opposed to a load. During T67 of the second cycle, the ROM contents (the translated character) are similarly gated onto the bus and through C to all CM's. During the third cycle, the active CM's compare the character with TT for their current states, and during T56 of that cycle, any hit is reported to the MC, during which time it is strobed into the MC's H-register. If there was a hit, then during the fourth 46 BIT TIME 4 ( ) — DATA AVAILABLE (register output-enabled) Q REGISTER CLOCKED 5 6 7 12 T/ T, B Tc C fym)/ " *® \ H tout) )" WSJSXE®® < LM 9.ou,.> Figure 2.15 Character Matcher (CM) Timing Diagram 47 cycle the character's address is gated from N (the address counter) onto the MC's internal bus during T01 and is strobed into Q (the FIFO), and during T23 the terminal state address (in H) is gated onto the bus and into Q. Sometime later, the hit will be read from the FIFO by the query resolver . When N (the address counter) runs out at the end of the track, the MC gates it and H (which contains nothing in particular) into Q. The query resolver, seeing the in the address field of a hit, can detect the end of the track. The counter going to also inhibits any further match cycles. Notice that due to the delay in propagating a character through the matcher, the address in N must be displaced by a fixed value so the count runs out at the proper time. If an up-counter is used to match an n-byte data stream, N must be loaded with -(n+3) at the start of the track. The disk must continue to provide clock pulses even between tracks so timing signals can be generated for load cycles. Most disks have clocks which free run at approximately the bit rate even when data is not being transferred, so this is not a problem. 2.6 Matching a Data Stream Figure 2.16 shows a timing diagram representing all major signals in the term matcher occuring during matching of a three-byte data stream. A track length of three bytes is sufficient to show the special cases involved in start of track and end of track processing, and also is sufficient to illustrate the propagation of data through the system. For purposes of the example, assume that the CM's have been loaded with a state table, that a was strobed into C in all CM's to force them to the idle state, and that MATCH in F was set. The matcher is waiting for the INDEX pulse to arrive from the disk, signalling start-of-track. Further assume that the first three bytes of the track will cause one of the CM's to generate a hit. It may be helpful to refer back to Figures 2.5 and 2.11 for review of the names of the components of the MC and CM's. 48 N <£> in ro c\j «■- CO (J O t£> 1 m 1 ro i I CO ■ ■ H in .ji <) O ^-' "*■■' .< CD O X X Z r- h- 49 During TO of the first cycle (character time), INDEX and CHR co-occur ([1] on Figure 2.16), signifying the arrival of the first bit of the first character on the track. During the next eight bit times, data bits arrive from the disk and are assembled in a shift register in the MC. The contents of F and the ROM (whatever they happen to be) are gated through C to the CM's during T45 and T67 of the first cycle, but since Ta and Tb are inhibited during the first match cycle, the CM's ignore them. At the start of TO of the second cycle, the new data byte is gated from the shift register to the ROM address register. During T45 of this cycle, F (the function, containing the MATCH bit) is gated through C to the CM's [2], and since Ta occurs during this cycle [3], it is gated into each CM. MATCH going from to 1 clears T in each CM, and enables the CC's. During T67 of this cycle, the first data byte ([3] in Figure 2.16) is sent from the ROM to the CM's via C. Tb occurs during T70 , causing the character to be strobed into C. The CM's will remain in the idle state, and the character will be looked up in the ST of each CM. During the third cycle, the serial data for the third character is received from the disk and the translated version of the second is sent from the ROM to the CM's (during T67). At T7 , the occurrence of Tb strobes the third character into C, and strobes the type of the second character (from the ST field of the ST entry, looked up during the preceding cycle) into T in the CM. Tb also gates a new state address into A, but since there is no match and no startup sequence (T was 0, so the TC did not generate a startup signal), a is again gated into A. Thus, the CM's remain in the idle state for another cycle. During the fourth cycle, no data is received from the disk. The third byte is sent from the ROM to the CM's and strobed into C by Tb . During this cycle, the ST entry for the second character is fetched, and the TC compares its ST field against the type of the first character, which is now in T. By assumption, they match, and a startup signal is generated. The next state address is taken from ST , and is clocked by Tb into the address register A. 50 During the fifth cycle, again no disk data is available, and furthermore the data sent from the ROM to C is not defined ([4]). By T7 of the fifth cycle, when Tb occurs, the TT word for the state entered after receipt of the second character will have been fetched. The character code field (TT ) will have been compared against the third data byte, and (by assumption) a match will have been generated. Since the hit bit TT,=1, a hit will be generated, and Tb will gate the state address and CM number out of the H-register in the CM into that of the MC. During the sixth cycle again no valid character is sent to the CM's, and in fact what they do is of no further interest. Since a hit was received by the MC during the previous cycle, the hit address (-1) is gated from N onto the MC bus and into Q during T01 ([5] in Figure 2.16). The hit address in H is gated into Q during T23 ([6]). The counter N will be counted at T6, and will overflow. This will prevent any further Tb signals from being generated, so no further (spurious) hits will be reported by the CM's. During the seventh cycle, since N contains a 0, its contents are gated onto the bus and into Q during T01 . The meaningless contents of H are gated into Q during T23 . The overflow of the disk address counter N at this point sends the MC out of match mode, and no further action takes place without intervention by the query resolver. Eventually, the data for the two matches (one for the character sequence, the other for end-of-data) will appear at the output of the FIFO, and will be processed by the query resolver. During the succeeding interval, the disk is positioned to a new track, a new table is loaded, the CM's are forced to the idle state, and the match is performed on a new cylinder . 2.7 MSI Prototype of the NFSA Term Matcher As anyone with experience in logic design will attest, the initial block diagram of a system will usually bear little resemblance to the final implementation. To be truthful, the ones shown in previous sections (Figures 2.5 and 2.11) were modified many times during the NFSA matcher's 51 gestation period. If the design is only taken to the block diagram stage, problems often tend to be ignored or swept under the rug. To iron out the details of the design and to demonstrate its feasibility a prototype of the NFSA Term Matcher was designed using MSI circuits. It is configured with an MC and four CM's, enabling it to handle most single-query searches. It will be available for use in an experimental retrieval system to aid in further research. 2.7.1 MSI Match Controller Figure 2.17 is a reproduction of the logic diagram of the MSI Match Controller. It consists of 41 IC packages, most of them being MSI registers. All but one of the packages are standard 7400-series TTL, the exception being the ROM, an Intel 2716 2Kx8 bit EPROM. The MC is connected to the QR via 21 lines, to the CM's via 21 lines, to the disk via 5 lines, and to V and ground through another two (a total of 49). The INDEX line from the disk signals the start of data on a track. Assuming the MC has received a MATCH command from the QR (on the X 7 line into F) , receipt of INDEX places the MC into match mode. CHR synchronizes the timing generator to characters arriving over the DATA line; individual bits are clocked by CLK, which also clocks the Johnson counter in the timing generator. A truth table for the Johnson counter appears at the left side of Figure 2.17, next to the counter. Bits arriving on the DATA line are shifted into the 74164, and at the end of the cycle are strobed into the 74175 which serves as an address latch for the 2716 ROM. Output data from the ROM (the translated, six bit versions of the input bytes) are gated onto the internal bus at the appropriate time through the 74244 tri-state driver. The disk address counter (N) appears just above the deserializer/code converter, and is implemented using four 74161 counters. The system will be used on fixed-length tracks, so the initial count is set using jumper wires. The 74244 's gate the count onto the bus at the appropriate times after a hit is detected and after the counter overflows. The carry out of the high order counter is OR-ed with the ERROR line from the disk, and the result is used to take the MC out of match mode. 52 J J J J J -• -■ ' ■> * ' " " jLii B.AB. = v U k ,B.={Y} 1=1 l where {Y} is the state set of M, and p is a partition composed of blocks B. , each block containing states from {Y}, such that each state appears in exactly one block. The component machines M. can be described in terms of partition pairs': (P ± ^ ± ) I P^iX}^ where {X} is the input alphabet of M. In simple terms, M. is defined by a partition pair (p.,q.) such that blocks of the preserved partition q. define the states of M. and p. contains the information needed to compute the next state of M. for any given input. Finding a minimal assignment of states to CM's is analogous to finding a minimal feedback decomposition of a machine M [Booth68]. From the set of partition pairs, it is necessary to select the smallest set M. such that the g.l.b. of all q. is a partition having one state per block (i.e. M's state can be decoded from the states of the M.), with there being some ordering of the partition pairs such that q._ *q.*q.,, < p. (to satisfy the requirement that submachines (CM's) are only connected to 75 their neighbors), and with all blocks q. corresponding to states in the nondeterministic table. Computing assignments for NFSA tables of 1000 states by this method is, of course, quite impractical. To begin with, converting the k-state nondeterministic table into a deterministic one results in a table size of k k'=2 ([HopU169]). Second, the set of partition pairs must be generated. There are 0(k' (k'-l )/2) of these. Even if it were practical to compute these, finding an ordered set satisfying the connectivity and correspondence constraints would in general require checking all combinations of partition pairs. Machine decomposition is useful as an abstraction for viewing interconnected machines and describing their behavior, but it provides us with no efficient strategy for assigning large tables. 3.2.2 State Minimization State assignment in the NFSA has an analog in the problem of state minimization for incompletely specified sequential machines, although the definition of compatibility differs between the two. In the latter, if for any input sequence two states of a machine M can never cause M to generate different outputs in a case where both outputs are specified, the states are compatible and may be grouped together in the same state of a minimal machine M' . In the former case, states are compatible if they can never be simultaneously occupied, in which case they can be assigned to the same CM. Sources such as [Unger69] and [FrMe75] give algorithms for state minimization. These algorithms differ in detail, but in general consist of these steps: 1. Generate a table expressing compatibility between each pair of states (pair table) 2. Use this compatibility information to compute the set of 'maximal compatibles' (MC's) - sets of compatible states which are not subsets of other compatible sets. 76 3. From the set of MC's, find the set of "prime compatibles' (PC's). A PC is a subset of (the states in) an MC which is not dominated by another such subset ([FrMe75]). A. Find a minimal closed set of PC's covering every state of M such that elements of the class set implied by each PC under all inputs are contained in other PC's in the set. The states in each compatible in the cover are grouped together into one state of the minimal realization M' . Similarly, in an NFSA table assignment, the states in each compatible can be assigned to the CM associated with that compatible. The reason for checking the class sets of each PC is to insure that the transitions out of all states in each PC. in the cover M' under a 1 given input go to the same compatible PC. - just as a state of M takes a transition to only one other state under a given input. The restriction on which compatibles are eligible for inclusion in the cover is slightly different for NFSA assignment. In the NFSA, states retain their identity, and all states in a compatible in the cover need not go to the same next state under a given input. Rather, the interconnection structure of the Ql's imposes a different restriction on the choice of compatibles for the cover: since CM's can only be connected to two neighbors, there must exist an ordering of the compatibles in the cover M' such that the transitions out of each state in a compatible under any input go only to states in the same or adjacent compatibles. As was true for machine decomposition, direct application of the textbook algorithm is too inefficient to use on large tables even during 2 the design phase. Computing the pairwise compatibles requires N /2 compatibility tests, and computing the maximal compatibles requires in the worst case 0(2 ) operations (these algorithms usually involve traversing part of a tree whose maximum depth is the number of states in the table) . Choosing a minimal cover in general involves examination of all combinations of PC's. The following sections will discuss several aspects of determining the minimum-CM assignment of a state table, and will show how the above 77 mentioned algorithms were modified to allow these assignments to be computed quickly. 3.2.3 Lower Bound on Number of CM's - Unlimited Interconnection Each CM in the NFSA Term Matcher is connected to two neighbors. This limited interconnectivity allowed branches in the search path to be pursued in parallel without the need for any centralized dispatching logic, while still providing a resonable limit to the number of output pins on each CM. However, it is an interesting question whether allowing unlimited interconnection would permit a smaller number of CM's to be used . If unlimited interconnection is allowed, the number of CM's needed is equal to the largest number of simultaneously active states possible in the state table. Since a state would be capable of making a transition to a successor in any CM, the only restriction on assignment would be that no two incompatible states could reside in the same CM. The number of CM's required would be the number of states in the largest * maximal incompatible' - that is, the largest group of states such that each state in the group is pairwise incompatible with all other states in the group. Recall that the most difficult part of finding a minimal state assignment was finding an ordered set of compatibles such that transitions went only to adjacent compatibles. This required an exhaustive search of the MC's. However, since the number of CM's is the item of interest, it is not necessary to find such a set - we only need determine its existence. When the maximal incompatibles (Mi's) have been computed and the largest found, the procedure is done. Mi's are computed in the same manner as MC's. Several good algorithms ( [SinSh72 ] , [Stoff 74 ] ) exist for doing this. Also note the advantage of computing maximal incompatibles instead of maximal compatibles. For example, Stoffers' algorithm requires 0(2 ) operations, where k is the number of states in the largest maximal group (compatible or incompatible, depending upon which is being computed). In the NFSA table for a list of search terms, every state in the table is compatible with almost every other state. Thus, when computing MC's, k is usually almost N (the number of states in the table), 78 and 2 is very large. On the other hand (as will be shown later) k is usually less than 10 for most 1000-state tables when computing maximal incompatibles . However, even computing the maximal incompatibles of large tables requires much computation and a large amount of memory. For example, for an N-state table, Stoffers' algorithm requires two N-bit vectors per state to store the compatibility and identification fields; two megabits are required for N=1000. A program was written to compute pairwise compatibles as part of the general state assignment program to be described later. This program is capable of determining pairwise compatibility in about 400 microseconds (115 PDP-11 machine instructions). At this rate, the pairwise compatibility table takes three minutes to build. While these numbers show that the computation is no longer impractical, the algorithm can be optimized substantially. More importantly, these optimizations are directly applicable to the first-fit assignment problem. 3.2.3.1 Incompatibility Covers The first step in computing the maximal incompatibles for an N-state 2 table is building the pair table, which requires N /2 compatibility tests, and at least that many bits of storage. One way to speed up compatibility testing is to break the state table down into small portions and process 2 them independently. The big N process can be broken down into many 2 smaller n processes, where n< LU U ix Z Cu _i © '•£' (33 '■!' (S '•£ © '•£' P n r-- h r-- -••- -•■- in n r m r- h n in r-- n r- r e r- -h ■••- -h r- p.. -••- p-- r- -•■- r- ""P -H Tf T-j Tf r X © t cj (9 © Z © © t— i ©i © h- r-i rl G © © © © Ll Q © 'I © © Z ©' © cj © © a n n © © Z r-i © (J *.jj 23 CJ © Q [••- © © Q © © il © rl ix © © Ll © H © *•- © © © Q I— n i' v '! X © '■£> Z r-i © -••- n © ©i © ©' > r-- "••- LL ll" Ifi LL © © !■■- '•- © © IX I CJ iX IX' O -ri tH ix © © IX © 'V •H f-- 'I © ©i © © © © © © © ©' © © ©! © © © © © ©I © ©i © | V i © © © © © © © ©I ©' ©I v © t Zj O E) Al- l—I (S3 © "k H rl H Q © © © © ix rl rl 'I © 5) fln e. i>j UJ O © © Z N N Q N N © (S3 z rl © l_I rl (53 (J r-i © N "•• -*■ _ _ Jjjl © i-i IX (S3 »+ Ix © *-i N -" - 1 - .i-. _ K M l~ 1- r-j © © E © IX © © IX © © © © IX I (J Ifl U'i h- ©i i- ■* IX rl r-i IX © © ri ri "I" T-i *-i ri n r-- © © Q © © © ©! ©I © © © © © © © © © © © © r: © ©i © © © © © © CM Q © N © © © © © u Z Ul Ll Q LU lj I- Lu Lu 1/1 LU '/! 'I -* I- 1- in © © © © © © n © © LU 0"i ix LU h- o 'I ix M y- u'i © • LL Z z o l_l CJ # © © z Ix i i # ix i_i _j LL © I- © !_I © H © © - C ! © -. ©i h'i © IX © 'I Z II u Lt rl LU h- LU _i LL G li'i © lD © z © z LU U'i I- © l/l LU © I- © © CM © © S © ©i © i/i -. u C LU Z Z 01 >-h LU _J _l »-h LU LU i—i I— ( I- H- © >I IX IX * z u a _J © C'i © ■i © © © © © © © \- © © © © © © © in H LU 1 1*1 © © ©i © © © I LL © _l © © © © I LU LU 2 5 a E (0 s o 4-1 3 a 4-1 3 o 4J CO ■u Cfl m on 0) u 3 toO •H Pn O O O O O O 0> 00 h- CD m *- ro CM 90 INCOMPATIBILITY COVERS (D TAIL REMOVAL OPTIMAL' 40 60 80 TERMS Figure 3.6 Compatibility Tests vs. Table Size the maximal incompatibles . Curve 1 shows the number of tests to build the 2 entire pair table for an N-state table (N /2). The number of tests needed when INCOVR is used to subdivide the table into smaller blocks is shown in Curve 2. This nearly twice the N /52 estimated in Section 3.2.3.1, mostly because several of the blocks are larger than average (e.g. terms starting with N S') and some are quite small ( X X'), so the total number of tests is 91 2 2 2 larger (a>b>c and a+c=2b => a +c >2b ). Still, Curve 2 is a great improvement over Curve 1 . Curve 3 shows the number of tests when using both INCOVR and DETAIL. The improvement over Curves 1 and 2 is obvious, but comparing Curve 3 with Curve 4 shows just how good the result is. Curve 4 plots the average number of incompatible pairs in tables of each size, and as such represents the minimum necessary number of compatibility tests. Curve 3 is very close to optimal over the range of table sizes being considered. The actual time taken by MAXINC to find the maximal incompatibles is shown in Figure 3.7 Each data point represents the average time (BLDFK+BLDCVR+ASBLKS) needed to generate all maximal incompatibles for a table and select the largest . Curve 1 plots the times for MAXINC to generate tables using incompatibility covers alone, and Curve 2 plots the execution times using both incompatibility covers and term tail removal. The rise of Curve 2 is greater than linear, but it still rises quite slowly within the range of reasonable table sizes. One may wonder why such attention is being paid to performance statistics for best-fit assignment, since first-fit will be used in production. The reason is that most of the time necessary in both best- fit and first-fit is that consumed in computing pairwise incompatibilities, and this needs to be done for both. INCOVR and DETAIL will speed up first-fit assignment as much as they did MAXINC. Figure 3.8 shows the statistic which this section originally set out to determine. For tables of from 10 to 120 terms chosen at random, MAXINC found the number of CM's required. Each number on the field of the graph corresponds to how many test tables containing the number of terms shown on the X-axis required the number of CM's to match them shown on the Y- axis . Twenty tables of each size were generated, for a total of 240 tables. For a table of a given size, the number of CM's required falls into a narrow band. Only 3 of the 240 tables fall above the line CM=5+(T/16), where CM is the number of CM's required and T is the number of terms in the table. The graph also shows that large tables of 120 terms (approximately 1000 states) will fit easily into a 16-CM system. 92 © INCOMPATABILITY COVERS 20 30 40 50 60 70 80 90 100 110 120 TERMS Figure 3.7 Number of Terms vs. Time for MAXINC 93 —A — ro c\J oo cm ro N W S lO tO — * * N * CM *■ 10 h- — cm cm ro ro io 5f ro IO fs- CM CM m N " (VI cm cm 52 -Q j « » J i i ' « o CM o /-N o S5 H O 1 O ^^ c O •r< o 0) O cu d c o o o CO 0) 4-) e M 2 CO T3 CU s •U •H e g q: •H iH LJ 5 o h- u o tu N •H o CO «■ QJ rH ^» (0 o H ro • CO > O CO CM IO oo m ; RSBLKCBLOCK, NTRM, CM>; CM<-MOD; END; END RSCVRj ; ASS I GN ALL BLOCKS OF COVER ; START ASSIGNING TO CM O ; DO FOR ALL BLOCKS ; ASS I GN BLOCK TO CM ; BUMP CM # MOD NO. OF CM PROC ASBLK; DO FOR 1=1, NTRMS; TERMOBL.OCK -:: I > ; STATE<-TERM<1>j RSSIGNCSTATE, CM>; cm<:-mod<:cm, NCM>; END; END RSBLK; ; DO FOR ALL TERMS IN BLOCK ; NEXT TERM IN BLOCK ; START W/1ST STRTE IN TERM ; STRRT TRYING RT CM ; NEXT TERM RT NEXT CM PROC ASSIGN; DCL DTBL-::i6> INITIO, 1, -1, 2, -2, 3, -3, . . . >; ; ORDER TO TRV CM'S IF -::STATE=NULL> THEN RETURN; IF •:: CM-:: STATE';- . ME. NULL;- THEN RETURN; ; IF ALRERDV DONE IF CPREV = NULL;- THEN NONCM; ELSE N<-2; ; START STATE CAN GO ANYWHERE DO FOR 1=1, N; ; TRV ALL VALID CM'S TCM<-MOD, NCM>; ; ... CLOSEST FIRST! IF THEN DO; CM-:: STRTE X-TCM; ; SET TRIRL CM RSSIGN=::NEXT'::STRTE>, TCM>; ; TRV ASSIGNING REST OF TERM IF C ASSIGN WAS SUCCESSFUL;- THEN DO; RSSIGN<:FORK-::STRTE;-, CM> ; ASSGN FORKS TO ORIG. CM IF < ASSIGN WRS SUCCESSFUL.::- THEN RETURN; ELSE DO; -:: DERSS I GN NEXT •:: STRTE > > ; ; UNDO THE DAMAGE CM-:: STRTE X-NULL; END; END; ELSE CM-:: STRTE X-NULL; ; IF 1ST ASSIGN DIDN'T WORK END; END; ; TRV NEXT CM FRETURN; ; FAIL IF NO VALID ASSIGNMENT END ASSIGN; Figure 3.9 State Assignment Algorithm 98 compatible with all currently in the CM, ASSIGN is called recursively to attempt assignment of the successor and fork states. If both succeed, ASSIGN returns to its caller successfully also. If an assignment is possible, it will almost always be found very quickly. If the table will not fit in the allotted number of CM's, all possible placements of the incompatibles will be tested, and it will take a long time to decide that no assignment is possible. To combat this problem, the following tests were added to the assignment program: 1. If the total number of incompatible states in any block is greater than the available number of CM's, the maximal incompatibles of that block are generated to see if it is possible to fit the table into the matcher. The assignment is aborted if not . 2. Certain alternative paths with no hope of correcting a problem are not tried. Many N alternatives' are just permutations among the CM's of states in the same maximal incompatible. Those which are detectable are not generated. 3. In case all else fails, a time limit is put on the assignment procedure. If the limit is exceeded, the assignment is aborted. The processing necessary to recover from aborted assignments is beyond the scope of this chapter, but in general what would be done is to remove the problem term(s) from the table and perform the search in two passes . 3.3.3 Assigning State Addresses in CM's When all states are assigned to CM's, their CM addresses are assigned. ASSIGN could possibly perform this function (see above), but it was decided to do it independently. If ASSIGN were delegated this function, the overhead when an alternate path had to be tried would be increased; the address assignment would have to be v undone' along with the CM assignment. It is necessary to traverse the terms table anyway to compute statistics at the end of the run, so doing address assignment during this pass is no extra trouble. 99 Transition states are assigned from the bottom of each CM's memory space, and forks are assigned down from the top. If a CM either has too many total states or too many forks assigned to it, an error is flagged. Admittedly assigning addresses during CM assignment would allow correcting these errors rather than aborting because of them. Attempting assignment to a full CM is identical to attempting assignment to a CM containing an incompatible state. The algorithm could back up similarly in each case. However, the number of CM's needed expands fast enough with increasing table size that there was plenty of room left in each CM even with the simple assignment algorithm used. Unless this fact changes, a more complicated address assignment algorithm is unnecessary. 3.3.4 AS2CMZ Program Overview AS2CMZ is very similar to MAXINC, which was described earlier. The only difference is that instead of (or if desired, in addition to) finding the number of CM's by generating maximal incompatibles , the states are assigned addresses in CM's using the above methods. AS2CMZ is essentially a test program, and the output available is similar to that of MAXINC. Figure 3.10 shows sample output from a run. Instead of printing the cover blocks and pair tables, the entire state table is printed out after addresses have been assigned. The statistics printed at the bottom are the same as for MAXINC. In this example the # CM's NEEDED is determined by computing maximal incompatibles, but in the tests used to generate the graph in Figure 3.12, the above formula for the number of CM's is used. The COMPATIBILITY CHECKS are tests done by ASSIGN, not tests done during pair table generation. The ASSIGNMENTS field shows the number of assignments attempted; this will be identical to the number of states unless backtracking was necessary. In Figures 3.11 and 3.12 are shown the results of running AS2CMZ on the same tables used to generate the statistics in Figures 3.5-3.7. Recall that each data point represents the average time to assign five tables of the size shown. Considering that ASSIGN uses a potentially very slow recursive backtrack algorithm to assign states to CM's, one might wonder how efficient the program is in practice. 100 RISM BUNTED flDDR 071410 071430 071450 071470 PUT CHR 14 01 01 01 A I M BEST 1ST PRV NXT FRK BENT SCH I SN BUNT CCH CM CMfl F QTV INDX 000000 071430 000000 071430 00 001 14 01 0000 071410 071450 000000 071450 01 001 00 01 0004 071430 071470 000000 071470 01 002 00 01 0010 071450 000000 000000 072150 01 003 00 01 0014 O71510 071530 071550 071570 071630 071650 071670 14 B 000000 071530 071710 000000 01 377 50 01 0014 01 E 07151.0 071550 071630 071710 01 004 00 01 0004 01 S 071530 071570 000000 071710 01 405 00 01 7777 01 T 071550 000000 000000 000000 01 406 00 01 7777 01 E 071510 071650 000000 071530 02 001 00 01 0000 01 N 071630 071670 000000 071550 02 402 00 01 7777 01 T 071650 000000 000000 000000 02 403 00 01 7777 07 '1710 • 071730 ■ 071750 ■ 07 '1770 • Q7 ? 2110 • U7 '2130 14 B 000000 071.730 000000 071510 02 004 10 01 01 U 071710 071750 000000 000000 02 405 00 01 01 N 071730 071770 000000 000000 02 406 00 01 01 T 071.750 072110 000000 000000 02 407 02 01 01 E 071770 072130 000000 000000 02 410 00 01 01 D 072110 000000 000000 000000 02 411 00 01 0010 777 777 777 777 072150 072170 07221.0 17 I 000000 072170 000000 072170 02 012 10 01 0000 01 S 072150 072210 000000 072310 02 013 00 01 0004 01 T 072170 000000 000000 072230 02 414 00 01 7777 07 '■-■ 230 : 07 d 250: 07 !=:! 270 : 07 2 310: 07 'irJ 330 : 07 2, 350: 14 S 0000O8 072250 000000 072250 01 407 10 01 7777 01 C 072230 072270 000000 072270 01 410 00 01 7777 01 H 072250 072310 000000 072310 01 411 00 01 7777 01 I 072270 072330 000000 072330 01 012 00 01 0010 01 S 072310 072350 000000 000000 01 013 00 01 0014 01 M 072330 000000 000000 000000 01 414 00 01 7777 TOTAL # OF TERMS - 00007, TOTRL # OF CHRRRCTERS = 00033 00004 UNIQUE 1ST CHARS, 00001 FORK STATES, AND 00026 STRTES IN TRBLE 00022 COMPATIBLES FOUND, 00030 TESTED; 00012 INCOMPATIBLES FOUND, 00021 TESTED. 00030 COMPATIBILITY CHECKS, 00026 ASSIGNMENTS, # CM'S NEEDED IS 00001 TIMES: BLDFK-00012 BLDCVR-00166 ASBLKS-00607 Figure 3.10 Sample Output from AS2CMZ 101 Figure 3.11 shows one measure of this - the number of unsuccessful assignments as a percentage of the total number of attempted assignments. In this case, an unsuccessful assignment corresponds to the *if (no incompatibles in TCM)' test failing in the ASSIGN procedure in Figure 3.9. As the graph shows, for tables of up to 120 terms fewer than 20% of the assignment attempts are unsuccessful. Another possible measure of ASSIGN'S efficiency is the number of backtracks executed - that is, the number of times a state is assigned to a CM and later de-assigned because a successor or fork could not be assigned. It is interesting that no backtracks were ever necessary when the number of CM's available was sufficient to hold the test table. Of course, the number of backtracks would be immense if the table did not fit in the CM's, but the tests introduced in Section 3.3.2 detect and stop excessive backtracking. Figure 3.12 shows the average execution times for AS2CMZ on the test tables. Again, Curve 1 shows the execution time with incompatibility covers only, and Curve 2 shows the time needed with incompatibility covers and tail removal. Notice that Curve 2 indicates that execution time seems to grow linearly for the range of table sizes shown, and that tables of up to 120 terms can be assigned in nearly a second each. Tables with 120 terms contain nearly 1000 states, and the stated goal has been handling tables of this size in real time. Whether this performance is sufficient depends upon the query arrival rate and processing strategy, which will be covered in a later chapter. However, being able to do state assignment at the rate of about a millisecond per state on a PDP-11/40 with 1.75 microsecond cycle time is certainly encouraging. 102 O ro CM (%) S1N3IAIN9ISSV indssaoonsNn o CJ o o o o 00 p 8 or LU lO o o ro O c\J O c 0) w c E 5 •H CO CO < QJ 4-1 CO JJ CD u 3 60 •H 103 o •IS- UJ H © INCOMPATABILITY COVERS TAIL REMOVAL 1 10 20 30 40 50 60 70 TERMS i J L 80 90 100 110 120 Figure 3.12 Number of Terms vs. Time for ASSIGN 104 Chapter 4 Query Resolution The Query Resolver is responsible for accepting input from the Term Matcher informing it of the occurrence of items of interest in the text stream (terms, context boundary codes, etc.) and from these, detecting occurrences of the user's search expression. Figure 4.1 shows the syntax for search expressions in the query language used by EUREKA ([BurEm79]). The EUREKA language is reasonably powerful in the types of queries it allows while remaining fairly simple to learn and use. It will be used as a v lower bound' in the succeeding discussion; any successful implementation of the resolver must be able to handle it. Consider a sample search expression in the language: "(CHANDLER IN AUTHOR) AND (MARLOWE OR (PRIVATE AND DETECT? IN SENTENCE) OR (DEAD AND BODY IN PARAGRAPH) IN BODY)' . The term matcher detects instances of search terms (CHANDLER, MARLOWE, PRIVATE, DETECT?, etc.) and context boundaries (the AUTHOR section, the BODY text, paragraphs, and sentences). The query resolver performs the bookkeeping necessary to detect instances of the full search expression from these. The structure of English- language terms is fairly well-defined and does not vary much from database to database. This made it possible to design special-purpose term searching hardware which would not have to be ::= | ::= AND | OR | & | + ::= | : := IN | IN ::= | , ::= SENTENCE | PARAGRAPH | AUTHOR | TITLE | ::= | () Figure 4.1 Syntax of EUREKA Search Expressions 105 changed from system to system. Query resolution, on the other hand, is very much dependent upon the query language, and different languages can make fundamentally different requirements of the query resolver. It must be possible to program the query resolver to accept a wide variety of query languages, and this argues for implementing it with some sort of small general purpose processor (i.e. a microprocessor) rather than with special-purpose hardware. On the other hand, term hit rates can get quite high. If two terms were found in a typical 200-character sentence, the resolver would have 67 microseconds to handle each of the three hits. Unless careful attention is paid to the resolution algorithms, a general purpose processor may not be able to keep up. Many specifics of query resolution are system-dependent, and will not be discussed in detail. However, the speed of the hardware term matcher and the desire to search each cylinder in one revolution pose problems for the query resolver which will not change from system to system. It is these system- invariant problems that will be discussed. First, a basic description of the query resolution process will be given. Next, considerations relating to the speed and generality of the resolver will be discussed. Finally, methods of handling documents spanning track boundaries and of processing queries too large to be searched for in one revolution will be introduced. 4.1 Query Resolution Algorithms One query resolution algorithm which has been documented in detail is the one for the CIA's proposed SAFE full-text retrieval system, which is documented in detail in [OSl77b] . This system uses a hardware term matcher (a Bird FSA) for each disk, but searches only one track of the disk at a time rather than being able to search several (or all) tracks on a cylinder in parallel. Also, the search expression syntax is considerably less general than that of EUREKA. The language essentially allows only product-of-sums expressions, with proximity operations allowed only between adjacent terms in the outer product expression. However, there is enough similarity between SAFE and systems of the type being considered here to warrant a brief discussion of the OSI resolver. 106 The OSI Query Resolver (QR) receives input from the Term Matcher signalling the occurrence of search terras and document formatting codes (context delimiters). For each hit, the QR is passed a code which is an index into a table of addresses. Each address in this table (TTABLE) is a pointer to an entry in another table (XTABLE). The XTABLE entry contains one element for each query in which the term causing the hit appears. Each element points into a data structure (the QLB or query logic block) containing the internal representation of the search expression. When a hit is detected, the QLB entry corresponding to the term is marked as present. At the end of the document, the QLB is examined to see if the search expression was satisfied. If it was, the document number is saved to be later reported to the host . The SAFE system query language is simpler than EUREKA' s in that no context-bounded subexpressions (i.e. v IN SENTENCE') are allowed. The only in-context operator is proximity (i.e. X A AND WITHIN 3 WORDS B'), which is only allowed at the outermost level. The QR must differ from OSI's algorithm, since to handle complex context-bounded queries, resolution cannot be deferred until the end of a document. Stellhorn ([Stell74a]) discusses query resolution for an experimental precursor to EUREKA. This system did term matching differently than ones that have been discussed in that one pass was made over the data for each search term. Text between two context delimiters (e.g. a sentence) was buffered, and searched for each term. When a hit occurred, the term's presence bit was set in the data structure representing the search expression, and the structure was scanned to see if the expression had been satisfied. If it had, a hit on the expression was reported. Even ignoring the fact that Stellhorn' s algorithm requires multiple passes over the data, it is not capable of resolving the EUREKA language. It can only identify and search one context at a time, and therefore cannot handle expressions like x (ENEMY AND AIRCRAFT IN SENTENCE) AND (SIGHT? AND OVERHEAD IN PARAGRAPH)'. On the other hand, it does allow context specifications to be attached to the entire expression rather than just allowing proximity specifications between adjacent terms. 107 Allowing context specifiers on subexpressions greatly increases the power of the language. For example, concepts can be expressed by constructs such as "DATA AND BASE IN SENTENCE', and used as terms in more complex expressions. However, a price is paid for this feature. Doing resolution for the full EUREKA search expression syntax requires considerably more logic than either of the above methods. The primary difference is that different subexpressions may be looked for in different and potentially overlapping contexts. Each time a context boundary mentioned in an X IN' clause of a query is crossed, the QR must check if the subexpression was found and then reset the term found flags for the subexpression. This is considerably more complex and time consuming than the OS I algorithm (which, except in the case of word proximity) only had to do any processing at document boundaries, and also more complex than Stellhorn's program, which reset all flags each time a boundary was crossed . The EUREKA query resolver (written by the author) basically works as follows : 1. The search expression is represented by a tree. Each operator node contains context specifications and v found' bits for the right and left hand sides. These can be either terms or other operators . 2. Each time a term is found, it is looked up in a table. The table entry points to each node in the tree corresponding to an instance of the term. The proper * found' bits are set in the tree. Then the tree is traversed to see if the entire expression was satisfied as a result of the latest term hit. If so, the expression hit is reported. 3. When a context delimiter (fern) is encountered, the " found' bits are reset for subexpressions being searched for in contexts delimited by the fern. Searching in EUREKA is done in software, and data is read into memory before being searched. Therefore, timing is not critical. If the EUREKA resolution algorithm were adopted for use with a hardware term matcher, 108 several problems might result. First, traversing the entire tree for each term found is too slow for large queries. Second, it is also necessary to go through the whole tree to reset found bits upon receipt of each fern. Both of these problems need to be corrected before the algorithm can be used to process output from the hardware term matcher in real time. 4.2 Query Resolution in Real Time The speed at which query resolution can be done depends upon knowledge of expected term and delimiter hit rates. These rates in turn depend upon the number of queries being searched for, the complexity of these queries, and the frequency of occurrence of search terms in regions being searched. It was hoped that EUREKA could be used to measure these for a large sample of test queries, but the lack of a suitable database and a sizable user community made this impossible. Little data is available regarding performance statistics of full text search systems. Roberts ([Rob77]) gives a statistical estimate of term hit rates. It is based upon the Zipf curve (rank order vs. frequency of occurrence in the database) and a curve of rank order in the database vs. frequency of occurrence in queries. Although he admitted that the latter curve was empirically derived, Roberts contended that the product of the two curves was essentially a constant, and therefore the expected frequency of occurrence of a search term while scanning the database was a constant, which he determined to be around 10 . For Roberts' example system (64 key words per query and 20 queries being searched in a batch) the expected hit rate was one every 1215 microseconds. For a batch of 120 terms, which is closer to the size which has been used as an example in previous chapters, Roberts' formula predicts one hit every 11.6 milliseconds, or approximately one term hit per track. In a partially indexed system, of course, each track being searched would be- known to contain at least one term occurrence, and usually more than one. Additionally, search terms are often looked for in conjunction with one another. Two semantically related terms such as X DATA' and V BASE' can be expected to occur in proximity much more often than is suggested by their frequency of occurrence. An analytical formula for hit 109 rate would therefore be a complex conditional probability problem. Knowledge of probability distributions of co-occurrences of terras in both queries and the database would be necessary. It is not clear that this information could be obtained except from analysis of a large corpus of queries . If a large corpus of queries were available, it would be possible to make hit-rate measurements by simulating the search and instrumenting the simulator to gather the necessary statistics. However, it is expected that fundamental differences between the search system being proposed here and conventional systems will result in substantial differences in the way searches are performed. Conventional systems are characterized by large search costs and slow response. Therefore, users attempt to get high recall by such techniques as the use of a large synonym dictionary (thesaurus) and complex search expressions. A system using search hardware as being proposed here could have much lower cost per search and extremely fast response. This would substantially alter the dynamics of the search process, facilitating doing the logical query as a number of small steps. Thus, query characteristics can be expected to be quite different for hardware-augmented systems than for conventional ones . It is not clear that measuring hit rates using a corpus of conventional queries would yield valid results. The best that can be done before an experimental hardware augmented system is built and tried in practice is to make educated guesses about query characteristics and hit rates. It seems reasonable that the average search expression will be smaller and the term hit rate therefore lower. However, even increasing the estimate given by Roberts' formula by a factor of ten predicts only around one hit per millisecond when searching for 120 terms. This figure is almost certainly much higher than will be observed in practice. The hit rate for ferns will most likely be much higher. For example, in the Brown Corpus database ferns occur at an average of one every 200 characters. Most of these are end-of-sentence ferns, and intuition suggests that most sentences will not contain any search terms even in relevant documents. Thus, it can be expected that the most critical real- time function of the query resolver will be fern processing. no 4.3 Query Resolution for the Hardware Term Matcher Individual systems may have different search expression syntax and different requirements concerning reporting search expression hits. Rather than forcing all systems to conform to an arbitrarily defined interface to the text searcher, it was decided to allow the query resolver's operation to be tailored to each system within a fairly broad set of guidelines. This requires that it be relatively easy to modify details of the resolution algorithm, which in turn suggests that the query resolver be implemented in programmable logic (e.g. a microprocessor of some sort) rather than with fixed logic. Since many of the details of the resolution process are system dependent, it is not worthwhile to present a detailed discussion of a complete query resolution system. On the other hand, it is necessary to demonstrate that the basic approach to resolution in the hardware searcher environment will work, which means demonstrating that the query resolver can do its job in real time. Fortunately, most of the time-critical parts of the resolver do not change from system to system. Term and fern hits have to be accepted from the matcher and recorded, the data structure representing the search expression must be scanned to detect query hits, and the v found' bits must be periodically reset as context boundaries are crossed. Timing estimates for these functions would most likely remain valid for any system. Since some program fragments have to be implemented in order to obtain execution time estimates, some assumptions are necessary regarding the type of processor used in the implementation. The processor is constrained to being rather small, both because it will most likely be packaged in the disk drive cabinet and because it is desirable for the size and cost of the resolution logic to be of the same order of magnitude as that of the term matching logic. Since a PDP-11 was available for use by this project, sample program segments were implemented for it. Although the PDP-11 is a 16-bit processor, this is of no great advantage in many of the code segments, and in fact several microprocessors are available having close to the same performance. In a production system, a Ill single component microprocessor such as the Intel 8048 - possibly with the addition of some outboard hardware (RAM, I/O support) - would be used. 4.3.1 Accepting Hits from the Term Matcher The term matcher signals the query resolver for every term hit. The matcher contains a limited amount of output buffering capability provided by the FIFO in the match controller (mainly to allow two successive hits, such as a term followed immediately by a fern, to be handled) . It is still quite possible for a burst of hits to occur faster than they can be processed sequentially, so the resolver must be interruped by the occurrence of a hit, and must then buffer it in its own memory, to minimize the latency between hit detection and when the hit is read from the FIFO. Figure 4.2 shows an interrupt routine to buffer term matcher hits. BUFFER is a circular hit buffer, with input pointer INPNT and output pointer OUTPNT . FIFO and STAT are device registers in the term matcher interface. When an interrupt arrives, the buffer pointer is loaded and the hit is read from the term matcher FIFO. It is saved in the buffer, and the circular wrap-around test is made. The program then checks for the buffer being full, and goes to an error routine (not shown) if it is. Finally, the term matcher status register is tested to see if another hit is available, and the program loops to buffer it if so. When no more hits are available, the pointer is saved and the program returns from the interrupt . Assuming that the query resolver is at the highest interrupt priority, the maximum interrupt latency would be from the time STAT is tested, through the return from interrupt, back into the interrupt routine, until the FIFO is read. This takes 19 bus cycles or 33.25 microseconds (computed using the 1.75 microsecond cycle time memory on the PDP-11 used). Thereafter, a hit can be read from the FIFO every 20 bus cycles (35 microseconds) . The total execution time of the interrupt routine if only one hit is read per interrupt is 59.5 microseconds. 112 CD CC LL '•£> ? OJ CD CO i CD -' CC I ■H OJ © "\ CO '3 Q Cf u cc cc I LU in LL LU _l U CC I I- LL D Cf ix UJ CL Ll UJ LU 1- D Z CO i— i Ct Ct: CC LU _l I D f 'i ( 't h- Ct: •I •-« z u z z Cf •-« LU Ll D CO CJ H Ct *-" ~ I o Lu O O I- Z 01 LU Lt Ct lu a I- Q Z 'X I— I O it C9 o. in ct »-• k- a LU 3 > Cl H CC Z LU in •-. a + .■-\ CD Ct z D lu a Q Cf a cc o 0_ cc h- ct ►-< z I Cf a o Z Lu "I . U . LU . I CJ © Ct in •■-•• i- Q Ct □ Lu i— i Lu 15.1 _J _J O D W Lu Cf LU Lu Lu i CO Lt o Lu LU Lu CO © © Ct" © in If' Ct % --' Z-- Z> I-- 5" 0. O I:- LL C3 H- '- , 5- 5- ►"-' O O Q Q Z J O Z 111 W Z O Q h- ZZZZUCOZOCOI-COZZCf •^ H If" ■5 '■0 © © CD T-- CM CD N l v i N rH N CD CM 1" '5 CM CM OJ CD CD '•£• M M '0 '0 CD CD CD P-- N N CD rf f •- © t T-- N CD '0 CD CD '0 M M CD r'l OJ CD -ri ID CD CD OJ CD tH tH H H OJ Q H OJ CD CD CD rH ri CD CD © © © © rl © © © © rH CD CD © CD CM W OJ 'JL> OJ t Q * ^0 OJ t CD OJ CD CD CD •H 113 The nominal speed of a 3330-type disk is 1.25x10 bytes per second. The MSI prototype term matcher from Chapter 2 could buffer eight hits in its FIFO. If the FIFO started empty, it would take nine hits within 61.25 microseconds (76 characters) to saturate the interrupt routine. Cases which could cause this include eight consecutive sentences containing no search terms, averaging under eight characters (plus fern) each, or one sentence 76 characters long containing seven search term hits (one hit every 11 characters) . Neither of these cases are very likely. The worst-case sustained hit rate that can be handled is governed by the length of the loop (35 microseconds). Assuming the FIFO were full, one hit could be accepted every 44 characters. This rate is substantially greater than the expected rate of occurrence of either ferns or search terms . This discussion demonstrates that the query resolver can accept hits from the matcher at a reasonably high burst rate. It is now necessary to ascertain whether the resolution itself can be done faster than the average hit rate. 4.3.2 Term Hit Decoding Each hit report from the term matcher contains the disk address at which the hit was detected and a unique code (the CM state address) identifying the hit. The disk address is not used during resolution, it is used only to specify the location of expression hits in reports to the host. The code is used to address a hash table which yields a pointer to a term table entry. A term table entry contains a pointer to the presence bit for each occurrence of the term in search expression trees. A term matcher with sixteen 128-state CM's has a total of 2048 possible states, any of which can be a final state. Allowing 2048 hash slots would require too much memory, so fewer must be used. Assume that each CM has 128 states, and the state address in the CM is used as the hash code. In this case, up to sixteen hit codes (state addresses) can hash into the same slot. If all states are equally likely to be terminal states, a binomial distribution can be used to calculate the expected 114 number of hit codes hashing into the same slot. Since the average term is seven characters long, each state has a 1/7 probability of being a terminal state. Assuming (in the worst case) that all states in all CM' a were full, the expected number of hit codes hashing into each slot is two. Figure 4.3 shows a program segment for taking a hit from the circular buffer filled by the hit interrupt handler and looking it up in the hash table. The expected value of the time to look up an entry is 42 microseconds, assuming it is equally probable that the hit code is any of those hashing to the same slot. It will take a couple of more instructions to determine whether the hit is a term or fern, and go to the proper routine. 4.3.3 Context Boundary (Fern) Processing In the EUREKA query resolver, each time a context boundary is crossed the search expression tree is scanned for nodes whose context is restricted to contexts corresponding to the boundary just crossed. All x found' bits in such nodes are reset so the search can begin anew the next time such a context is entered . Since ferns occur on the average every 200 characters (160 microseconds) and since a substantial fraction of this time is already consumed handling the hit interrupt and looking it up in the hash table, visiting all nodes in each search expression to reset found bits is much too slow. For the same reason, it is not practical to look for search expression hits at fern boundaries (as opposed to during term hits) . There are many more ferns hits than term hits, and it would be necessary to examine the tree after each fern hit even in the usual case of no term hits occurring between successive ferns. Since the Section 4.2 indicated that term hits occur so much less frequently than ferns, fern processing should be optimized, not terra processing. Thus , the primary thing that has to be done for each fern is resetting found bits. This can be substantially speeded up if the bits are not stored in the search tree itself. Instead, all found bits for nodes (subexpressions) being searched for in a particular context are 115 CM LU a CC LL 'D rl CM E CO I C3 Z' 'I I ri CM (3 rt CO r-- s> o IX ( ) cc 'I z Q CC IX LU Ll Ll _i CO u: CC _l ix LU -< -I CJ CO •I Z H Q ix I U. i.O ■I I- I I LU I I h- (_) P Z LU m U. IX O _i h- I- Uj t-4 3 CD a z h-l Z ill «-« z LU 10 Cl t— i (J"l Ct t- UJ M I- Z a z Ix I- *-* Lt LU Lx U. _! CO a QL Ll L0 U'l LU Ct a lu 'X o o a LU h- Cl LU CJ . z o. i- x. d cc Q Z 10 H- t-t 1—4 u CC . I- , h-. LU I h- > Z CO o X H* L0 I o z LU Z I- U CC iZl h- ix LU ~> CD LU H- CC CM H- Ct U1 Ll u LU h- J- CO -J z in 'i z O LU H H Z 'I *-< _l O Z 10 LX H _ LU Ct H Z L0 LU ri Ll Z n ii LU o _1 I- CO O I- _J Ifl z »-4 ►-• h- z o IX LX I— j Ll 'I HH I- LU LU LU CD 10 Z X t-l z > h- Ct h- IX O i Ll L- i~i IX LU Lx t- ►-< Z H 10 IX IX LU O I- Ct IX LU _J CC w __l ii Q CJ >-• LU IX Z LU CJ 1.0 I- in Z ix Cl Cl -. O 3 rl Z Ct Cl . I- + 3 •■-- D (53 CM Ct M Ct rl ri ^ ct Ct r-'i , ■-•■ Ct' '.jj CO ••_•• M IX CJ i.O CJ z <*• D M Lx *t I- •3 O Ct' Cl 3 ri ri 0'J ( "I CM O ix' 3 O Ct _J ••-•• •-•• lx Ct Ct Ct # Z Ct LX •-•' ri Z CO CO Cl Cf > > > _J > CC CJ > Cl G > LU Z LU '_' Q '_' L0 Q 2 w u Z 111 U Z it U 0J Z Z Z I Z 10 CO Z CJ CO Z CO CO h- o 1 l_l Cl i O -J \/ Ct Q (~ i 3 ... ., ■, -J ri N in in leaf found D X X Figure 4.5 AND-Subexpression Tree 4.4 Contexts Spanning Track Boundaries If documents were wholly contained in one track, searching could be done by examining tracks independently. Organizing the data like this is not practical, however, because documents are large relative to the track size. Not only would space be wasted at the end of each track due to fragmentation, but many documents will not even fit on a track. It is therefore necessary that documents, and even the larger contexts within documents (e.g. sections), be allowed to reside on more than one track. This introduces the problem of searching for queries in contexts mapping into more than one track (i.e. 'A AND B IN DOCUMENT'). 120 The basic strategy for handling contexts spanning track boundaries is to search the track containing the start of the context, and save the partial results (found bits) existing at the end of the context fragment. These results are then used when searching the continuation of the context on the next track. For contexts residing on more than two tracks, the partial results are passed along until the track containing the end of the context is searched . The manner in which data is organized on disk affects how this strategy is carried out . Two possible ways that spanning contexts can be arranged include continuing the context on the next track of the same cylinder ( vertical spanning ), or continuing it on the same track of the next cylinder ( horizontal spanning ) . Each choice has advantages and drawbacks . An example of vertical spanning is shown in Figure 4.6. One paragraph starts on track and ends on track 2, and another starts on track 2 and ends on track 3. Both contain search expression hits. The right arrow at the end of tracks 0, 1, and 2 represent a fern indicating that the context is continued on the next track. Similarly, the left arrows at the beginning of tracks 1, 2, and 3 indicate contexts continued from previous tracks . During searching, the query resolver saves the found bits for contexts only partially contained in the track. These are combined to resolve the entire expression. For example, QR n sends a message to QR, saying that it has found V A' . The paragraph continues through track 2 and onto track 3, so QR„ combines its own partial results with those from QRq, showing that both S A' and X B' have been found, and passes them to QR„ . The paragraph ends in track 2, so QR„ saves the partial results when the end of the continued paragraph is encountered. When these are combined with the results from QR , a hit on the entire expression is detected. Additionally, the partial results for the paragraph continued on track 3 are sent to QR^>, telling it that X C has been found. The partial results saved by QR„ show that X A' and X B' were found in the part of the paragraph on track 3, so combining them with the results passed from QR„ produces 121 TRACK - TRACK 1 «-— B----A— •- ,. ► A i k v TMi QRi TRACK 2 ^ C --- If the number of cylinders is C, each scan searches an average of p C cylinders. In this case, the average scan time is: t s =p c CR(2.25-.75p c ) The probability that a cylinder is accessed is the probability that any track on it is accessed, which can also be stated: p = 1 - Pr[no tracks to search on cylinder] = 1 " (l-p t ) T where p is the probability that a track requires searching and T is the number of tracks per cylinder. The probability that a track does not require searching, (1-p ), is the probability that no query being searched accesses the track. The number of queries in the system is n, and p is cl the probability that a given query accesses the track, so: n (1-P t ) = (l"P a ) and , / , ,. nT P c - 1 " (1-P a ) 130 so the scan time t can be expressed as s t = CR(l-(l-p ) nT )[2.25-.75(l-(l-p ) nT )) So 3l The time necessary to search all cylinders (i.e. for p =1) is t =1.5RC, which is 20 seconds for a 3330 disk drive (R=. 01667 sec, max ' ' C=800 cylinders) . 5.1.2 Number of Queries in the System - Constant Interarrival Time Each query remains in the system for one scan, so t is the time necessary to service all queries in the system. To solve for n, the number of queries in the system, consider that during any scan, n=t /t s q queries will arrive. Thus, at equilibrium, nt =t , which allows solving q s for n. Figure 5.1 shows a graphical solution. Notice that t reaches a s maximum of t , which is the time necessary to search all cylinders, max' Curves 1, 2, and 3 show nt for varying interarrival times t . For t >t , q J 5 q q qi (curve 1), n. 4 3 2 .003 38 14 4 1 1 ] 1 .001 9 1 1 1 1 ] 1 .0003 1 1 1 1 1 ] 1 .0001 1 1 1 1 1 ] 1 .00003 1 1 1 1 1 ] 1 Table 5.1 - Queries Searched per Scan Interarrival Time in Sec (t ) .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4?5 5.0 l u 41 21 14 11 9 7 6 6 5 5 .3 20 12 9 7 6 5 5 5 4 4 .1 9 6 4 4 3 3 3 3 2 2 .03 4 3 2 2 2 2 2 1 1 1 .01 2 2 1 1 1 1 1 1 1 1 .003 1 1 1 .001 .0003 .0001 .00003 Table 5.2 - Queries per Searcher Interarrival Time in Sec ( t ) .5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4^5 5.0 T 19 19 19 19 19 19 19 19 19 19 .3 19 19 19 19 19 19 19 19 19 19 .1 19 19 18 17 16 15 14 14 13 13 .03 18 14 12 10 9 8 8 7 7 6 .01 11 8 6 5 4 4 3 3 2 2 .003 6 3 2 1 1 1 1 1 1 1 .001 2 1 1 1 1 1 1 1 1 1 .0003 .0001 .00003 Table 5.3 - Searchers per Drive 133 system for searching at constant intervals t , . The analysis of t and n q s vs. t and p could then be used by substituting the release interval t , q *a j & q corresponding to the average interarrival time t . This is actually a very reasonable model; before a query can be searched, various processing is necessary (building state tables, doing index processing, etc.), and performing these functions will tend to % smooth out' bursts of arrivals. Rather than the queue being artificially imposed to simplify the analysis, it is a reflection of the way the host processes queries. The length of this queue is of interest, because if it becomes too long, response time will suffer. Since the queue is an M/G/l system with constant service time [CofDe73], the average queue length is given by: n - r(l-r/2)/(l-r) where r=t ,/t q q q Figure 5.2 shows how average queue length n varies with r. Notice that queries must be released for searching faster than the average interarrival time, or else the equilibrium queue length will be infinite. Choosing t ,=.3t results in an average queue length of .364, meaning that the queue is normally empty. However, the number of queries in the system n' (t , ,p )> n(t ,p ) for t , .99. As one would expect, for p =1, each searcher must look el for all queries in the system (see Table 5.1). However, for a partially indexed system with p =.001 and t =1.5 (the example used above), over 99% of the time no queries will be mapped onto a track, and almost never will more than one. Thus, a searcher designed to handle only one query will be sufficient . 5.1.5 Searchers per Drive If cylinders are to be searched in one revolution, sufficient searchers must be built into each disk drive to handle each track requiring searching. If p is the probability that a track requires searching, the probability of needing to search k or fewer tracks on a cylinder is given by: Pr[# tracks .99. Not surprisingly, a nonindexed system 136 (p =1) requires 19 searchers, one for each track of the cylinder. Notice 3 that for the example partially indexed system (p =.001, t =1.5 sec.), the a q probability is over 99% that no more than one track on a cylinder will require searching. For this system, drives may safely be configured with only one searcher. As the previous section showed, this searcher only has to accommodate one query. Chapters 2 and 4 showed that the searcher could be implemented using very few IC packages, so in this case the hardware searcher would increase the cost of the disk drive by only a small fraction. 5 .2 Term Matcher Characteristics Section 5.1.4 showed how the maximum likely number of queries per searcher could be determined from the interarrival time (t ) and the index q system eff icency (p ) . If statistics regarding the number of terms per 9. query are available, the number of terms that the matcher should be configured to handle can be determined. For most indexed systems, the searcher will be designed to handle only one query. Evidence is confusing regarding just how large one query should be expected to be. [OSl77a] gave estimates for the CIA's SAFE system, indicating that the average number of terms per query was 23. However, the references for this system are not consistent about this. Roberts ([Rob77]) gave a conflicting set of requirements indicating that the average number of terms per query would be 41.2, and that 10% would have an average of 70 terms. How these estimates were obtained was not stated. Milner ([Miln76]) gave statistics from an analysis of one day's transactions of MEDLINE. His results indicated that the average number of terms per query was 6.5, even using the EXPLODE (thesaurus) feature. However, about 1/8 of the terms used in queries caused explodes. The average number of basic terms per exploded term in MEDLINE was given as 30.25. Again, there are fundamental differences between these systems and a system using the search hardware that has been discussed in previous chapters. SAFE uses one Bird FSA per drive, and the average response time is 7.5 minutes, which hardly encourages interactive use of the system. 137 Intuitively, users would try to minimize the number of searches they perform, which could explain the relatively large average number of terms per query. Additionally, the SAFE system is designed to be used in a rather specialized manner. The database is composed of continuously updated intelligence data, and the users are primarily specialists in a given area looking for new information pertinent to their specialty. Such users, looking for the same type of information day after day, have time to refine their queries . Rather than performing their intellectual search as a sequence of smaller trial queries, they can input the same (larger) canned query each time and have a fair amount of confidence that it will produce the correct result . These factors would seem to contribute to the large expected number of terms per query. Substantial differences also exist between MEDLINE and a true full- text retrieval system; MEDLINE only stores document citations and only indexes on a small subset of the terms in the document. However, MEDLINE has a relatively fast response time (computed as averaging 11.83 sec. in [Miln76]), and therefore might have query characteristics similar to those of an interactive full-text system. Since a significant fraction of the queries contain exploded terms, the matcher should be able to accommodate in the vicinity of 30-50 terms . Once the number of terms to be handled by the term matcher is decided upon, the number of CM's needed and the number of states per CM can be determined. For tables of under 120 terms, Figure 3.8 can be used to determine the number of CM's. To handle the 30-50 terms in the MEDLINE example, seven CM's would be required. Figure 3.5 showed that the average number of states per term was very close to seven, so for 50 terms, the table would have around 350 states . The number of states per CM would be (350 states/7 CM's) = 50. Rounding up to the next power of 2, each CM would need 64 states. Thus, to handle a large query, the term matcher would be configured with 7 CM's with 64 states each. In cases where the searcher has to handle more than one query, this configuration could accommodate 7.6 x average' 6 .5-term queries. 138 As another example, requirements for the SAFE system given in [OSl77a] specify a maximum of 70 active queries per scan, which corresponds to an interarrival time t =3.5 sec. Table 5.2 shows that this q corresponds to 6 queries per searcher. Since the same reference gives the number of terms per query as 23.57, each term matcher must handle 142 terms. Extrapolating Figure 3.8 slightly, this number of terms requires 13 CM's. Each CM would require 76 states. Rounding up to the next power of two, the configuration for this system would use a matcher containing 13 CM's with 128 states each. If the NFSA Term Matcher was built using LSI, the design and layout would be the most important factor in its cost . It is not likely that designing CM's with different numbers of states would be practical. Therefore, one would like to choose a memory size able to accommodate any system's requirement. Figure 3.8 illustrated that the number of CM's required was 5+(t/16), where t is the number of terms. Thus, the number of terms per CM is asymptotic to 16. Since each term contains about seven states, each CM need contain only 112 states. Again, the next highest power of two is 128. Therefore, assuming that the terms/CM ratio continues to grow linearly, CM's of 128 states will be adequate for tables of arbitrary size. 5 .3 Loading the Searchers The searchers will be loaded while the disk is seeking to the next cylinder. During this interval, it is necessary to load each searcher on a drive with the search tree (residing in the query resolver) , and the startup and state tables for each CM in the searcher. Each search tree node requires 40 bits, and since leaves are not represented by nodes, only about one node per term is necessary. Each state table entry requires an average of 18 bits (16 for the TT word, plus a 1/8 probability of needing a 10-bit fork table entry). Additionally, the startup table for each CM contains (64xl3)=832 bits. Assuming that t terms of seven states each are loaded into n CM's, the number of bits to load into each searcher is: (bits/searcher) = 832n + t(7xl8 + 40) 139 Substituting the formula n=(t/16)+5 derived from Figure 3.8 into the above equation, (bits/searcher) = 4160 + 832t/16 + 166t = 4160 + 218t The time available to load all searchers on a drive is 1/2 revolution, or 8 milliseconds. Assuming a data rate of 10 bits/second, t = [(8xl0 4 /m) - 4160] / 218 where m is the number of searchers being loaded (i.e. the number of tracks being searched for on the cylinder). Figure 5.3 shows how k varies with a: uj x o < Ld V) 1000 100 in q: 10 I ' I I I I I I L_ 13 5 7 9 i f I 1 I I I I LJUL II 13 15 17 19 SEARCH ERS/CYL Figure 5 .3 Searcher Capacity as Limited by Loading Time 140 in. Note that if only one searcher is being loaded, there is time to load it with 349 terms. There is more than adequate bandwidth to handle the indexed system example (p =.001, t =1.5 sec.). a Q It initially seems alarming that if all 19 tracks require searching, there is not even time to load one term. However, in practice, any system using an index will have a very low value of p (<<.01), and thus will search at most a few tracks per cylinder (Table 5.3). Only nonindexed systems would be configured with a searcher per read head. In a nonindexed system, all searchers will be looking for the entire query batch, and will therefore be loaded with identical tables. If the control logic in the drive is appropriately designed, it is only necessary to send the tables once, broadcasting them to all searchers. Actually, the situation is even better than this; the tables need be reloaded at most each time a query enters or leaves the system, which is considerably less often than once per disk revolution. In fact, by batching queries and processing them together, the searchers would only have to be loaded once per 20-second scan. 5 .4 A Case Study To demonstrate how a NFSA searcher might be configured for a practical application, we will again consider the example of the SAFE system mentioned briefly in Section 5.2. To reiterate, [OSl77a] estimated the minimum query interarrival time t as 3 .5 seconds and the average number of terms per query as 23. Naturally, since the system is not indexed, all queries search all tracks and thus p =1. The statistics shown in Figure 3.5 suggest that each query will contain around 177 characters (7 .7 characters per term) and will require around 168 states (7 .3 states per term) . To evaluate the practicality of using the NFSA searcher, the amount of hardware necessary will be compared against that for the Bird FSA used by the SAFE system. Most of the gates in both architectures are in the state table memories, so the state table memory size will be compared for the two designs. For a given table size, two numbers will be presented: the total number of bits in the state table memory (which is a good 141 measure of system cost, but due to packaging considerations can be larger than what is actually needed), and the number of bits used to store the benchmark state table (which is a measure of load time and indicates the room available for larger than average tables) . The SAFE system was designed to use one Bird FSA per disk drive. The entire disk is scanned repeatedly to search, which takes about 266 seconds. Queries are collected during each scan for searching during the next scan. The average response time is therefore about 1.5 scans, or 400 seconds (6.67 minutes). Configuring a comparable system with NFSA searchers would require 19 searchers per disk drive (one per read head), and using the staggered data organization assumed throughout this chapter, would require only 20 seconds to scan the entire disk surface. Two variants of the Bird FSA were mentioned in the SAFE design, one processing 6-bit characters, and the other processing each data byte from the disk as two sequential 4-bit nibbles . Each has different memory requirements, but using figures from [0Sl77a] , a six-bit character version configured to be capable of handling the assumed 70 queries per scan would require a total of 1392640 bits of memory (4 boards of 4096 85-bit words each), of which 678640 would actually be used. The 4-bit nibble version would require 5 boards of 4096 39-bit words, or 798720 bits of memory. Of these, 622758 would actually be used. In comparison, the NFSA matcher would be configured with 13 CM's of 128 states (including 16 fork states) each. Each CM would require (64xl3)=832 bits of ST memory, (128xl6)=2048 bits of TT memory, and (16xl0)=160 bits of FT memory. The 19 NFSA searchers per drive would thus require a total of 750880 bits, of which 550240 would actually be used . The amount of memory required by the NFSA searcher compares quite favorably with that required by either version of the Bird FSA, even ignoring that the latter requires 100 nanosecond memory and the former only requires a memory speed in the vicinity of 600 nanoseconds. When it is remembered that response time is at least 13 times better for the NFSA searcher, this design seems even more attractive . 142 Admittedly, the use of an index might not be practical considering the SAFE system's constantly changing database of intelligence data. However, it is interesting to look, at how the system could be configured if indexing were used. Assuming Poisson arrivals with an average interarrival time t =3.5 seconds, releasing queries into the system every t ,=2.5 seconds results in an average queue length n =1.5 (Figure 5.2, r=.7). If an index is used with p =.001, column 5 of Table 5.2 shows that a each searcher would only have to be designed to search for one query. The average query is assumed to contain 23 terms, and configuring each matcher with 6 CM's would be sufficient to handle queries of up to 32 terms. Each CM requires (13x64)=832 bits for the ST memory, and assuming that each CM has 128 states (16 fork states), each would have 2304 bits of state table memory. Each matcher would thus have 6x(832+2304)=18816 bits of memory. The 168 states in the average query require ( 168xl8)=3204 bits, so of the 18816 total bits in the matcher, only (832x6)+3204 , or 8196 bits, would actually be used. Since Table 5.3 shows that only one searcher per drive is needed, this is the total system memory requirement. Finally, as Table 5.1 shows, each query can be searched in (much) less than the release time t ,. Neglecting index processing time, the total response time is the search time plus the time spent in the queue, which is 2.5t ,=6.25 seconds . As a final example, consider a situation in which indexing is not used, and off-the-shelf 3330-type disks are used having only the capability of reading from one track at a time. This is identical to the SAFE system implementation. In this case, 70 queries (1610 terms) would have to be searched for at once. The memory requirements were given above for two variations of the Bird FSA, but suppose that an NFSA searcher was used instead. If one large NFSA were used, a table of 11753 states would have to be assigned to a term matcher with 95 CM's (extrapolating the results of Chapter 3). This is not practical, so instead (since the 70 queries can be processed independently) , several term matchers could be connected in parallel to the one read head, either reporting hits to one query resolver or to several (depending upon hit rate) . If each NFSA term matcher was configured to have 16 CM's of 128 states each, seven 23-term queries would comfortably fit in each. Thus, a total of 10 term matchers 143 can contain the 70 queries. Each CM has 3040 bits of memory, so the 160 CM's necessary to handle the 70 queries would have a total of 486400 bits of 600-nanosecond memory, of which 348034 would actually be used. This is slightly more than 60% of the 798720 bits of 100-nanosecond memory necessary for the 4-bit nibble processor version of the Bird FSA. 5 .5 Summary This chapter considered how the NFSA searcher could be configured for different types of systems . The two parameters having the most impact on how the system is configured are the query interarrival time and the index system efficiency (which indicates how much data must be searched for each query). A simple model was developed showing how search time, response time, searcher size, and the number of searchers per drive vary for a wide spectrum of potential systems with different values of the two parameters . It was shown that for two representative systems with very different parameters (indexed and nonindexed) , systems could be configured having excellent response time and (for the indexed system) almost negligible cost for the search hardware. Finally, the performance and cost (in terms of memory requirements) of the Bird FSA searcher and the NFSA searcher were compared. The NFSA searcher can be implemented in LSI, which allows duplicating it so that each read head can be connected to its own searcher. This allows the response time to be 13 times faster than that for the SAFE system using the Bird FSA. Additionally, the NFSA version can use much slower memory, and needs less of it than the Bird FSA. Finally, even if the NFSA is used in the same manner as the Bird FSA (i.e. connected to one head of the disk, searching the entire surface sequentially), the memory requirement is still much lower. As Chapter 1 mentioned, the Bird FSA seemed to be the most attractive of the systems mentioned in previous literature . The NFSA searcher seems to be a significant improvement over the Bird FSA, and all evidence seems to support the contention that systems can be built economically to allow many users to simultaneously search very large text databases with excellent response time. 144 Chapter 6 Summary and Conclusions Very large text retrieval systems could prove very useful in many areas such as law, medicine, science, and engineering. The main obstacle to building such systems is the inability of conventional, software based search techniques to deliver satisfactory response times at reasonable cost. To get fast response time, the database must be searched in parallel, and to do this cost-effectively requires special purpose hardware optimized for the task of text searching. The major components of a large-scale text-retrieval system are the host computer, the index processor, and the text searcher. This thesis focused on the latter. It was shown that previous architectures for hardware text searchers had problems inhibiting their use in large systems. A new architecture was introduced, which is modeled after a nondeterministic finite-state automaton (NFSA) . It was discussed how this searcher could be used to search not only for words, but also for many other interesting classes of patterns. It was shown how the NFSA could be implemented in hardware using interconnected sub-machines . The state table is partitioned among the sub-machines, insuring that each will only be following one alternative search path at once. The particular organization chosen allowed a simple comparator to be used in each machine (CM). It avoided state table memory contention by storing each CM's states locally, and included self-starting and forking features in each CM to eliminate the necessity of externally scheduling the CM's. Although partitioning the state table into compatible subsets facilitated the hardware implementation of the searcher, it was shown that a simpleminded approach to computing the partitions was too slow to be used in production. Optimizations were developed which drastically decreased the amount of computation necessary compared to that required by 145 direct application of standard table partitioning algorithms. It was shown that state tables of reasonable size could be partitioned in real time even on a relatively slow machine. The NFSA Matcher is capable of recognizing terms and other primitive patterns in the database, but to detect instances of higher-level search expressions generated by user queries, separate query-resolution logic is necessary. The fact that the NFSA Matcher runs at disk speeds places severe speed constraints on the query resolver, and methods of doing resolution under these constraints were discussed. Methods were also discussed for resolving hits in contexts spanning physical device boundaries . For it to be practical to actually build text searching hardware, it must be applicable to systems with possibly widely varying character- istics . A major advantage of the NFSA Matcher is that it is made up of small, identical building blocks (Character Matchers). As such, it can be configured to efficiently handle a wide range of anticipated system loads and indexing strategies . It was studied how these parameters influence the amount of hardware necessary, and how the NFSA Matcher could be configured for an individual system once values for these parameters are known. Examples were presented to show that the amount of hardware necessary for a practical system was indeed reasonable, and was in fact less than that required by the best previous searcher architecture. Although the NFSA Searcher was discussed from the viewpoint of its application to searching very large database, there is no basic reason why it cannot be used in other environments. Its high bandwidth was achieved by breaking the database down into many independent data streams and searching them simultaneously. Since the hardware necessary to search each data stream is relatively inexpensive (especially if only one query is being searched for), the NFSA Searcher * scales down' well to smaller systems. For example, in an office automation system, a small NFSA Searcher could be used in the local terminals to provide a fast-access s automated filing cabinet' feature. 146 The NFSA might also be useful to search text stored in the memory of a mainframe CPU. Although the search bandwidth for an individual NFSA is lower than one might hope for in such an application, several obvious methods exist for speeding it up. First of all, remember that the match rate is governed by the logic family used to implement the searcher, and for the LSI version which searched data directly from disk, very slow logic and memory were adequate. If the NFSA were used to augment the searching capability of a mainframe, not as many duplicates of the NFSA would be necessary, and the economic factors that forced an LSI implementation to be used for the database application would no longer hold. Therefore, faster logic families could be used. Memories capable of storing the state table exist with access times as low as 20 nanoseconds, and correspondingly fast logic is available for implementing the comparitors and registers. While no detailed timing estimates for such a matcher were made, it seems possible to achieve match rates of under 50 nanoseconds per character. Depending upon the application, it might also be possible to use the search strategy that was used for databases, that is, breaking down long strings to be searched into smaller segments and searching these in parallel with slower, cheaper searchers. If the patterns being searched for are short relative to the string segment size, many searchers could be used in this manner to match segments of long strings at a very high aggregate bandwidth. As with any project of this sort, many extensions and possibilities for future research were thought of that could not be pursued due to time limitations . First , several extensions to the basic NFSA design are possible to greatly extend its power. One possibility is including the capability to do matches on numbers stored in the database. Remember that the comparator in the MSI prototype discussed in Chapter 2 had the capability of doing full numeric comparisons (<, =, and >). If numbers to be searched (e.g. dates) are stored using a fixed number of digits with leading zeroes, a simple extension to the NFSA could allow a state table to be defined to search for numbers related to a pattern by one of the above three operators . Another extension might be to support searching for v near misses' on a pattern, allowing a limited number of incorrect characters to occur before a search path is abandoned. This would be 147 useful in a non-indexed system (such as an office automation system) where spelling errors are possible. But perhaps the major area needing further work is investigating how to tie all the parts of a large database system together. As Chapter 1 mentioned, index processing has been studied in some depth, and now similar work has been done on text searching. Integrating index processing hardware and text searching hardware into a balanced system will require much thought . Partial inversion allows impressive speedup of the search using a relatively small index, while still allowing the full flexibility in search operations possible with full-text searching. However, the level to which the text is inverted must be properly chosen if the workload is to be balanced between the index processor and the searcher. Additionally, data flow between the components of the system must be studied. The output of the index processor is a set of postings not necessarily ordered by disk address . These must be sorted to allow commands to the text searchers to be generated. These and many other problems dealing with the care and feeding of the index processing and text searching hardware require further study. The light is at the end of the tunnel - the parts of a large hardware-augmented search system are now all understood, and there is some feeling for how they can be fit together. With just a little more work on the higher-level system aspects, the use of specialized processors to search very large text databases in interactive time will become a reality. 148 References [Bayer78J Bayer, M. P., "Dialog - An Online Retrieval System for Bibliographic Information," Digest of Papers, COMPCON Fall 1978, Washington, D. C, pp. 54-58. [BiNeTr78] Bird, R. M., Newsbaum, J. B., and Trefftzs, J. L., "Text File Inversion: An Evaluation," Proc . Fourth Non-Numeric Workshop, Syracuse, N. Y., Aug. 1978, pp. 42-50. LBiTuWo77] Bird, R. M., Tu, J. C, and Worthy, R. M., "Associative/Parallel Processors for Searching Very Large Textual Data Bases," Proc. Third Non-Numeric Workshop, Syracuse, N. Y., May, 1977, pp. 8-16. [Black78] Black, D. V., "System Development Corporation's Search Service," Digest of Papers, COMPCON Fall 1978, Washington, D. C, pp. 59-64. [Booth68] Booth, T. L . , Sequential Machine and Automata Theory , John Wiley and Sons, Inc., New York, 1968. [BurEm79] Burket , T. C, and Emrath, P. E., "User's Guide to Eureka and Eurup," Report No. UIUCDCS-R-79-956 , University of Illinois, Urbana, Feb. 1979, pp. 1-49. [ChKu77] Chen, S. C, and Kuck, D. J., "Combinational Circuit Synthesis with Time and Component Bounds," IEEE Transactions on Computers, Vol. C-26, No. 8, Aug 1977, pp. 712-726. [Cheng77] Cheng, W. K., "Multiprocessor for String Manipulation," M. S. Thesis, University of Illinois, Urbana, Oct. 1977, pp. 22-65. [CofDe73] Coffman, E. C, and Denning, P. J., Operating Systems Theory , Prentice-Hall, Englewood Cliffs, New Jersey, 1973. [Cop78] Copeland, G. P., "String Storage and Searching for Data Base Applications: Implementation on the INDY Backend Kernel," Proc. Fourth Non-Numeric Workshop, Syracuse, N. Y., Aug. 1978, pp. 8-17. [FoKu80] Foster, M. J., and Kung , H. T., "Design of Special Purpose VLSI Chips: Examples and Opinions," Carnegie-Mellon University, Pittsburgh, Pa., Sept. 1979, pp. 5-17 . [FrMe75] Friedman, A. D., and Menon, P. R., Theory and Design of Switching Circuits , Computer Science Press, Woodland Hills, California, 1975. [HaSt66] Hartmanis, J., and Stearns, R. E., Algebraic Structure Theory of Sequential Machines , Prentice-Hall, Englewood Cliffs, New Jersey, 1966. [Hollaar76] Hollaar, L. A., "An Architecture for the Efficient Combining of Linearly Ordered Lists," Second Workshop on Comp . Arch, for Non- 149 Numeric Processing, Jan. 1976. [Hollaar79] Hollaar, L. A., "Text Retrieval Computers," Computer, March 1979, Vol. 12, No. 3 (ISSN 0018-9162), pp. 40-50. [HopU169] Hopcrof t , J. E., and Ullman, J. D., Formal Languages and Their Relation to Automata , Addison-Wesley, Reading, Massachusetts, 1969. [HopU179] Hopcrof t , J. E., and Ullman, J. D., Introduction to Automata Theory, Languages, and Computation , Addison-Wesley, Reading, Massachusetts, 1979. [HsKaKe75] Hsiao, D. K., Kannan, K., and Kerr, D. S., "Structure Memory Designs for a Database Computer," Proc . ACM Annual Conf . , 1977, pp. 343-350. [Hurl76] Hurley, B. J., "Analysis of Computer Architectures for Information Retrieval," M. S. Thesis, University of Illinois, Urbana, May 1976. [LiSmSm76] Lin, C. S., Smith, D. C. P., and Smith, J. M., "The Design of a Rotating Associative Memory for Relational Database Applications," ACM Trans. Database Syst., March 1976, Vol. 1, No. 1, pp. 53-65. [McCar78] McCarn, D. B., "Online Services of the National Library of Medicine," Digest of Papers, COMPCON Fall 1978, Washington, D. C, pp. 48-53 [MeCon80] Mead, C, and Conway, L., Introduction to VLSI Systems , Addison- Wesley, Reading, Massachusetts, 1980. [Miln76] Milner, J. M., "An Analysis of Rotational Storage Access Scheduling in a Multiprogrammed Information Retrieval System," Ph . D . Thesis, University of Illinois, Urbana, Sept. 1976. [Morgan76] Morgan, J. K., "Description of an Experimental On-Line Minicomputer-Based Information Retrieval System," M. S. Thesis, University of Illinois, Urbana, Feb. 1976. [Muk.78] Mukhopadhyay, A., "Hardware Algorithms for Non-numeric Computation," Proc. Fifth Symposium on Computer Architecture, Palo Alto, Calif., Apr. 1978, pp. 8-16. [MuWar79] Mules, D. W., and Warter, P. J., "A String Matcher for an 'Electronic File Cabinet' Which Allows Errors and Other Approximate Matches," Department of Electrical Engineering, University of Delaware, Newark, April, 1979. [0zScSm75] Ozkarahan, E. A., Schuster, S. A., and Smith, K. C, "RAP: an Associative processor for Data Base Management," Proc. 1975 AFIPS Nat. Comp. Conf., Vol. 44, AFIPS Press, Montvale, N. J., pp. 379-387. 150 lOSl77a] "High-Speed-Text-Search Design Contract Interim Report," OSI:R77- 002, Operating Systems Incorporated, Woodland Hills, Calif., Jan. 1977, pp. 2-69 - 2-101 . [OSl77b] "High-Speed-Text-Search Design Contract Design Specification Document," OSI:R77-O08, Operating Systems Incorporated, Woodland Hills, California, April 1977. lRob77] Roberts, D. C. (ed.), "A Computer System for Text Retrieval: Design Concept Development," U. S. Central Intelligence Agency ORD, RD-77-10011, Oct. 1977. [SinSh72] Sinha Roy, P. K., and Sheng, C. L . , "A Decomposition Method of Determining Maximum Compatibles," IEEE Transactions on Computers, Vol. C-21, March 1972, pp. 309-312. [Sprowl76] Sprowl, J. A., "Computer-Assisted Legal Research - An Analysis of Full Text Document Retrieval Systems, Particularly the LEXIS System," American Bar Foundation Research Journal, Jan. 1976, Vol. 1, No. 1, pp. 175-226. [Stell74a] Stellhorn, W. H., "An Experimental Information Retrieval System," Report No. UIUCDCS-R-74-657 , University of Illinois, Urbana, July 1974, pp. 22-43. [Stell74b] Stellhorn, W. H., "A Processor for Direct Scanning of Text," presented at First Nonnumeric Workshop, Dallas, Oct. 1974. [Stell75] Stellhorn, W. H., "A Specialized Computer for Information Retrieval," Ph.D. Thesis, University of Illinois, Urbana, 1975. [Stoff74] Stoffers, K. L., "Sequential Algorithm for the Determination of Maximum Compatibles," IEEE Transactions on Computers, Vol. C-23, Jan. 1974, pp. 95-98. [SuLip75] Su, S., and Lipovski, G. J., "CASSM: A Cellular System for Very Large Databases," Proc . Conf . on Very Large Data Bases, Sept. 1975, pp. 456-472. [Unger69] Unger , S. H., Asynchronous Sequential Switching Circuits , Wiley- Interscience , New York, 1969, pp. 28-63. [W0II8O] Wollsen, D. L., "CMOS LSI - The Computer Component Process of the 80' s," Computer, ISSN 0018-9162, February, 1980, pp. 59-67. 151 Vita Roger Lee Haskin was born on August 8, 1950, in Cleveland, Ohio. He received his B.S. degree in Computer Engineering from the University of Illinois at Urbana-Champaign in February, 1973, and his M.S. in Computer Science from the same institution in October, 1978. Mr. Haskin was a senior systems engineer with Datalogics, Inc., in Chicago, Illinois from 1972 to 1975. He supervised a group developing a comprehensive computer typesetting system which supported multicolumn full-page makeup. Since 1975, he has been a research assistant at the University of Illinois at Urbana-Champaign, where he has done work in the fields of artificial intelligence, computerized aircraft flight instrumentation, information retrieval, and computer architecture. BIBLIOGRAPHIC DATA SHEET 1. Recort N lffu r fcD&-R-80-1027 3. Recipient's Accession No. 4. Title and Subtitle HARDWARE FOR SEARCHING VERY LARGE TEXT DATABASES 5. Report Date August 1980 7. Author(s) Roger Lee Haskin 8. Performing Organization Rept. N °UIUCDCS-R-80-1027 9. Performing Organization Name and Address Jrganization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, IL 61801 10. Project/Task/Work Unit No. 11. Contract/Grant No. 12. Sponsoring Organization Name and Address National Science Foundation Washington, D.C. 13. Type of Report & Period Covered Doctoral -1980 14. 15. Supplementary Notes 16. Abstracts The problem of searching very large text databases is discussed. Problems with using both conventional CPU's and previously developed text search hardware are noted. A new model for text searching; using a nondeterminlstic finite-state automaton (NFSA) to control matching, is Introduced. By partitioning the nondeterministic state table and assigning blocks of states to interconnected sub-machines, it is shown how the NFSA searcher can be built with simple hardware amenable to LSI implementation. Methods of effectively partitioning large state tables are developed, query resolution 1s discussed, and system configuration and performance as a function of user local and other parameters is discussed. 17. Key Words and Document Analysis. 17a. Descriptors Computer Architecture Disk Systems Finite-State Automata Full -Text Document Retrieval Inverted File Large-Scale Integration On-Line Information Retrieval 17b. Identifiers/Open-Ended Terms 17c. COSAT1 Field/Group 18. Availability Statement Release Unlimited FORM NTIS-35 (10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 155 22. Price USCOMM-DC 40329-P71 JUN 3 WN