LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAICN 510.84- r££r no.5>16-322 co p. 2 MATHEMATICS The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN BUILDING JAN 18 USE jm i 8 ONLY 98Z 982 L161— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/suggestionsforus321hend l Report No. April 10, 1969 COO-1018-1178 SUGGESTIONS FOR USE OF A PARTICULAR DIRECTORY SCHEME by D. Austin Henderson, Jr. David E. Gold JUN 1 6 1680 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN • URBANA, ILLINOIS COO-1018-1178 Report No. 321 SUGGESTIONS FOR USE OF A PARTICULAR DIRECTORY SCHEME by D. Austin Henderson, Jr.* David E. Gold April 10, 1969 + Department of Computer Science University of Illinois Urbana, Illinois 6l801 *Presently at Massachusetts Institute of Technology, Project MAC Document originally completed in April 1967 • Abstract The storage/directory method described in this paper allows for flexibility in modification of a rapidly growing directory while main- taining a reasonable number of searches necessary to locate items whose keys are stored in the directory. A mathematical analysis shows the average number of searches to be of the same order of magnitude as for a binary-chop method of table look-up while the directory itself need not be reordered to accomodate new entries. The analysis is followed by a series of procedures and suggestions for procedures which should prove useful when employing this method. Introduction Two classical directory schemes are the binary-chop and sequen- tial-link pointer* methods. The method to be described which is essen- tially that of Hibbard [1], embodies many of the desirable features of these two schemes, while avoiding many of the disadvantages. This method or storage directory system will be referred to as the Binary-Chop Pointer method. As will be shown later in this paper the BCP method reduces to either the binary-chop or sequential pointer linked directory methods at either end of its functional spectrum. General The binary-chop search method is a highly efficient directory look-up procedure. Given a directory of n elements in collating sequence, this scheme locates a key being searched for in, on the average, log n-1 searches. The main disadvantages of this type of table are the necessary reordering of the entries as additions are made and the necessity of using contiguous storage space for the table. The sequential-pointer linked directory allows for rapid modifi- cation and use of non-contiguous storage by merely setting pointers. On the other hand, the search of a table of this type is necessarily a linear *i.e. a directory in which a pointer in each entry indicates the next entry. -1- process and hence requires, on the average, p search for a table of n entries. A BCP table allows for addition of entries with no reordering, and does not require contiguous storage. The average number of searches, however, is small, being less than I.38 logpn. Format of the Table A BCP table is made up of n entries , each entry consisting of the following items: 1) the key on which the search is made. 2 ) a low pointer or L0P . 3) a high pointer or HIP. k) the argument which is associated with the key in that entry. The LjfrP in an entry refers (points) to an entry whose key is low with respect to the present entry's key. Similarly the HIP in an entry points to an entry whose key is high with respect to the present key. Thus, an examination of a particular key in the table during a search operation results in a ternary decision branch: i) an equal compare in which case the argument corresponding to that entry is retrieved. ij ) the key being search for it is lower than the one being examined, in which case the Lf)P indicates the location of the next key to be examined, iii) the key being searched for it is higher than the one being examined in which case the HIP indicates the location of the next key to be examined. Example: Assume that the following five names appear as keys in BCP table: MAC, B0B, PETE, SAM, Jj£>E The table might be: -2- Location L0P HIP Argument 1 mac: 2 3 XX 2 B0B 5 XX 3 PETE k XX k SAM XX 5 JjbE XX If one desires to insert an entry whose key is MIKE, the following operations occur: 1) Insert the entry in an available storage location (in this case in location 6) with the L^P and HIP blank. 2) Compare MIKE to the key at location 1. Examine the appro- priate pointer (in this case it is the HIP, since MIKE > MAC). If it is not blank, continue the search at the indi- cated location. (i.e. now make comparison at location 3«) 3) When the examined pointer is blank, the location of the key being added is then inserted. In this case, the resultant table is: Locat ion Key Lj£)P HIP Ar gument 1 MAC 2 3 XX 2 BOB 5 XX 3 PETE 6 h XX k SAM XX 5 jf)E XX 6 MIKE XX . Suppose one started with no entries in the table. If one then inserted MAC, followed by BOB, PETE, SAM, J0E, MIKE, in that order, the table above will result. However if one were to insert the entries in the order BOB, MAC, J0E, SAM, PETE, one would obtain the table: Locat ion Key LOP hi: 1 BOB 2 2 MIKE 3 5 3 MAC It k JjbE 5 SAM 6 6 PETE Note that the structure of a BCP table is highly dependent upon the ordering of keys as they are entered. The algorithm used in determining the location of the pointer to the new entry in the last example is the method used to search the table to retrieve an entry already in the table. An equal compare will result when the desired entry is encountered. An example of this basic algo- T2l nthm was used by Knowlton . A convenient form for representing the BCP table is a tree in which each node corresponds to an entry in the table From each node there may be both left and right arrows, corresponding to L0P and HIP's respectively. The tree representation of the last table is shown in Figure 1. A node X which has a pointer to a node Y is said to subsume node Y and all nodes subsumed by Y. Analysis A measure of the worth of a table searching algorithm is the average time required to locate an item in the table. Here, this time is directly proportional to the average number of entries examined in obtaining a required entry. For a given table this is the average of the search times to each node in the tree that the table represents. A search time of 1 is assumed to the top node, 2 to each node subsumed immediately below this, and so on; a table with no nodes will be defined to have a search time of to each element, and hence an average search time of 0. The analysis is based on the notion that for a table of n+1 entries there are exactly (n+l)l different orders in which the entries can be made. For simplicity and without loss of generality it is assumed that values of the keys of the entries are the integers 1 through n+1. Any given order of elements in the table will be assumed equally likely i.e., with a proba- bility of — — . (n+l)! Each ordering of entries defines exactly the tree structure of the table. The (n+l)l resulting trees are partitioned according to which of the (n+l) integers is entered first and hence determines its top node. Clearly there are (n)l trees in each block of this partitioning; each block is equal likely. The top node in the tree determines the number of elements to the left of the top, and the number to the right. For example: for tables consisting of the 5 keys 1, 2, 3? ^, 5> if ^ is the top node, the subtree to the left of k will have 1, 2, and 3 in it, and that to the right will have the node 5- In any given block, say the (k+l)th, all trees can be character- ized by the form shown in Figure 2; i.e. the top node is k+1, there are the k nodes for 1 to k in the subtree to the left and the (n-k) nodes for k+2 to n+l in the subtree to the right. Within the block we will have all trees of this form. Hence, on the average, the search time to an element in the left subtree will be S(k)+1 where S(k) is the average search time to an element in a table with k elements. There will be k such elements. The search time to an element in the right subtree will be S(n-k)+l. There will be (n-k) such elements. The search time to the top element, k+1, is 1. Let the average search time over the (k+l)th block of the set of all trees having n+l elements be denoted S, _,(n+l). to k+1 S fn+1) - k [S( k ) +1 ] + ( n " k ) [S(n-k)+l] + 1 k+1^ ' n+l = kS(k) + (n-k) S(n-k)+n + 1 n+l Now to obtain the average search time over all blocks we take an average (unweighted, as all blocks are equally likely) of S (n+l). Thus, we arrive at the basic recursion formula: -6- CM 3 -7- S(n+1) = n+1 n y kS(k) + (n-k) S(n-k) + n+1 k=0 n+1 n+1 k Z Q KS(R) + (n-k) S(n-R) +1 (l) b^n+lj - — (n+1) 2 As the summation k runs from to n, kS(k) runs from 0S(0) to nS(n) and (n-k) S(n-k) runs from nS(n) to 0S(0). Also 0S(0) = 0. This leads to n S(n+1) = 2 ,Z 1 kS(k) k=l v ' +1 (2) (n+1) 2 This formula involves, in the calculation of S(n+l) the use of the values S(l) ... S(n). A more useful formula involving only S(n) can be obtained as follows: n Or: -, ,, ! kSi kS < k > + i S(n+1) = 5 (n+1) 2 2 - n - (n+1) [S(n+1)-1] = 2 ^ kS(k) (3) n-1 = 2nS(n) +2 7 kS(k) k=l But from (3) with n replaced by n-1 n-1 2 2 k E x kS(k) = (n) c [S(n)-1] 2 - - 2 - Hence, (n+l) [S(n+l)-l] = 2nS(n)+n [S(n)-1] S(n+1)-1 = g nS(n)+n 2 S(n)-n 2 (n+1) 2 S(n+1) = (n+2)n S(n)-n 2 + (n+l) 2 (n+l) 2 = (n+2)n S(n) + 2n + 1 7—2 (M (n+l) This formula can also be written: t . x -, . v (n+2)n S(n)+2n + 1 (n+l) S(n+1) = ■» - y —~ — v ' v ' n+l And clearly S(n) behaves as log n for n sufficiently large. Now knowing that S(n) = k In n in the limit, the value of k can be determined from (k) . Clearly: k ln(n+l) a {&£& k ln(n ) + 2n+l (n+l) 2 (n+ir As n -» co CO m -10- Deh'tioiis in a BCP Table Deletion of an item from a basic BCP table is not always a trivial operation. A node can have one of three pointer structures within the BCP table. These are represented graphically in Figure k (a), - ^(c), where X is the node to be deleted. The deletion in case a) is merely the removal of the pointer to node X. In case b), the pointer to node X is reset to point to node Y, and is shown schematically in Figure U(d). In either case, the entry corresponding to the deleted node X will, in general, occur somewhere imbedded in the actual BCP table. This space is now freed and may be used for a subsequent entry (addition) in the table. Case c) is somewhat more difficult to handle because there exists a single pointer to X but two from X. Clearly this single pointer cannot be made to point to two different nodes at the same time. One possible solution would be to reset this lone pointer to point to one of the two nodes subsumed by X, say the left (or Y) . A pointer to the other subsumed node would be established at the first node not having a HIP in use which is reached by successively following HIP's from Y. (Note that any other form of linking node Z from a node subsumed by Y will yield an incorrectly structured BCP table. In particular, when there exists any other linkage to Z, the Z entry is irretrievable.) This solution is undesirable because the search path to Z and any node subsumed by it must now go through Y and in general, other nodes subsumed by Y. As an example, consider the representation of a BCP table shown in Figure 5(a) where the number of searches for each node are listed below the respective nodes. It is desired to delete node e. If the pointer is now changed to indicate the left node subsumed by e, the resultant situation is shown in Figure 5(b). For the case in which the pointer is reset to indicate the right node first, the table is represented by Figure 5(c). -11- -Q Q> T3 =3 -0 (0 0) C7» 0) -12- Figure 5a Figure 5b Figure 5c Figure 5d ( * i nd i cates null node ) •13- A more acceptable solution to the problem is reached by- allowing the "image" of the deleted node to remain for comparison purposes. The entry in the table corresponding to this node is flagged to indicate that it no longer exists as a standard entry in the table, but rather is there only to allow the search algorithm to continue to reach nodes subsumed by it. Such a node is called a null node . The tree representing such a deletion is shown in Figure 5(d). Note that a null node may be entirely deleted if one or both of the two nodes subsumed by it becomes deleted through some sequence of later dele- tions. When this occurs, the null node is handled as in the earlier non-troublesome cases (cases a and b above). It is also possible to re-utilize the space in the table containing a null node entry when a subsequent addition to the table falls anywhere in the allowable range. The authors have written an algorithm which does this in a manner which is essentially no more complicated than the standard search algorithm. Balance of a BCP Table The structure of a BCP table varies between two extremes. These endpoints are referred to as best case and worst case conditions. The best case condition occurs when S is a minimum for a given number of entries in the table and is the same (with respect to search time) as a binary-chop method. This case is equivalent to the method de- scribed by Brooks and Iverson . A schematic representation for such a table containing seven keys (the integers 1 - 7) is shown in Figure 6(a). Note that the tree corresponding to this condition is unique only when the number of entries in the table is one less than a power of two. The worst-case condition is satisfied when S is a maximum for a given number of entries in the table. The search time for the table in this case is the same as for a sequential-linked-pointer directory. Such a situation is that depicted in Figure 6(b). -Ill- (0 to 0) L. 3 cn CO -Q to l_ -15- One method of eliminating this situation comes to mind when one considers that most tables or directories are not formed by starting with an empty list and merely adding entries. In general an initial table is established by some kind of declaration and this table is modified by later adding or deleting entries. The entries thus initially declared should be established in the table in a best-case fashion. A practical alternative to this method is applicable in cases where there exists previous knowledge as to the usage of the initial entries. In such a situation, the approximate frequencies of use (retrieval) could also be declared and this information would be used to establish an optimum BCP table. Such a table will minimize S over the retrieval of all entries in the table where the retrieval of each entry is weighted accordingly. Note that in general this is not the same as a best-case table. Regardless of the initial method used to start the table, the tree representing the table might become highly asymmetric or skewed through subsequent additions and/ or deletions. The way to determine this would be to establish a measure of skewness which would be calculated periodically. When this measure exceeded some allowable limit, a garbage collection routine could take over and restructure the table. Note that a larger well-structured table will require more modifications than a smaller one before becoming adversely skewed, hence a very large directory need be examined for skewness less fre- quently than a smaller one. A simple, although not necessarily best, measure of skewness is merely S for a given table. This can then be compared to S for the best-case condition and suitable action can then be taken. Multiple Keying hi A method suggested by Brooks and Iverson L for a particular type of binary-chop table is also applicable to the BCP table -- namely multiple keying. In the BCP table this is equivalent to imagining many columns instead of a single column for each entry (i.e., for a single argument) each containing a key and a set of pointers. This allows -16- searching on any set of keys and can be thought of as an alternate indexing scheme. Note that it is also possible to use one or more of the keys of an entry as the actual argument in some cases. An example of such a case would be telephone directory information where the sets of keys would be name, address, and telephone number, thus allowing the retrieval of either or both of the remaining corresponding keys when searching on any one of the three. Segmentation When using a BCP table with multiple keys, it is not always possible to obtain a system of segmenting the table which is universal to all keys. In the telephone directory example, one would be inviting trouble to suggest that the table can be broken into two tables by merely inserting people with- last names beginning in A-L in one table and to M-Z in the other. Which of these tables does one look in when one wants to retrieve a name but knows just a telephone number? One special feature of the BCP table is helpful in this respect: If we always require null nodes when deleting items, all pointers in the table point down in the table (i.e. away from the first entry) . The entire table can then be thought of as one long continuous one, where different segments may be retrieved and searched as needed. In general, a pointer from an entry in a segment may point to an entry in another segment. However, because these pointers only point down- ward, it will never be necessary to call a segment into memory twice to locate an entry. Because a pointer need not point from an entry in a segment to an entry in the next segment, some segments might be entirely passed over. It would also be possible to structure the initial table (at declaration time) such that a minimal number of segments need be entered (and hence retrieved). -IT- Multiple Entry Pointers Previously, the top node of the tree defining a table was the initial entry in that table and hence this entry contained the first key to be examined in any search. This entry can be chosen at the initial declaration of the directory from those entries thus declared, but when multiple-keying the best candidate is not obvious. (indeed, a best can- didate may not exist -- the best choice with respect to one set of keys might be the worst with respect to another.) A method of resolving this is to assign a pointer for each column. In the case where the table is to be segmented, there need only exist the further restrictions that each of these pointers indicate an entry in the first segment and that pointers not in this first segment point downward as before i.e., the pointers in the first segment may point up or down. This creates no problems because this first segment is always the first to be retrieved and there are none before (higher than) it which could be recalled. Variable Segment Boundaries In searching through a segmented table, it is obviously advantageous to minimize the number of segments which must be retrieved. The fact that pointers only point downward can sometime prove useful here, too, depending upon the configuration of the table in tertiary storage . If the table is stored contiguously such that a number of records comprise each segment, the segment boundaries may be varied to optimize the transfer of data (table segments) into primary storage. When a pointer indicates an entry which is outside of the segment containing that pointer, the next segment to be retrieved is the one starting with the record containing the new entry and continuing until the proper number of records is reached. For example, suppose that segments are made up of four records each, and that the entire table is stored contiguously. If a pointer in the first segment indicates an entry in the seventh record, the second segment (consisting of records 5-8) need not be retrieved. Instead, a segment consisting -18- of records 7 -10 can be retrieved. If this new entry's pointer now routed the algorithm to an entry in the first half of the third segment (records 9 and 10), no new segment need be immediately retrieved. This would have been the case had records 5-8 been retrieved. Duplications A whole area of thought is opened up when we consider the effects of having in any column, keys duplicated. If no special precautions are taken we will have an equal compare resulting while adding an entry to the table. Normally this would trigger an error condition. The simplest solution is to allow a duplicate key to be considered as higher than the equal entry already in the table. This inserts duplicates like any other entry. However when searching to locate an element one would have to search all the way to the bottom of the tree to determine if there were any duplicate entries. A partial solution is to flag entries which have duplicates. Then one need only search if one knew a duplicate existed. A further disadvantage of this system is that to reach entries lower in the tree one will in general have to search many duplicates -- a time wasting procedure. The best system is to provide a pointer from a node which is being duplicated by a new entry to that new entry. If an entry has SAM as a key, for example, a third pointer is provided to the next entry with SAM as a key (the HIP and L0P) being the first two. The first SAM is flagged to indicate that a duplicate exists, or the presence or absence of the third pointer (referred to as a duplicate pointer -- DUP) can be tested. -A third entry with SAM as a key will simply be pointed to in a sequential-pointer-linked fashion from the second SAM-keyed entry, and so forth. To allow this scheme, there must be storage provided (for each column if multiple-keying) in each original (non-duplicate) entry to provide space for a possible later duplicate pointer. This usually represents a fairly expensive use of storage. -19- A scheme which overcomes the necessity of wasting core on possibly unused duplicate pointers is the following. The second SAM -keyed entry has its HIP set to point to the entry which the first SAM-keyed HIP pointer indicated. This HIP pointer in the original SAM-keyed entry indicates the new entry. The L0P of the new entry can be used for the first pointer in a chained set to the third and higher-numbered SAM-keyed entries (see Figure 7)- A draw-back of this system is that the pointer from the second SAM-keyed entry to the entry indicated by the original HIP pointer may be pointing physically upward in the table. If seg- mentation is being used, this is unacceptable. If both the boundary conditions above are in force (segmenta- tion, no space for DUP pointers) as may happen in large files, a solu- tion may be achieved at the expense of some search time. When a new entry duplicates an old, the old entry is flagged and the new entry entered in a BCP table containing only duplicate entries. Space is left in this table for DUP pointers. One must search both tables to locate a duplicate. But hopefully this duplicate table will be small relative to the main table and both of these draw-backs will be minimal. If one is multiple-keying, an entry may duplicate an existing key in one column and be a new key in a second. One cannot store the entry in two different tables at once, unless they are Interleaved. This scheme necessitates having a pointer to the top node of the duplicate table in each column, as these tables will be made up of keys from different entries in each column. The following table illustrates this concept of a multiple-keyed table with duplicate keys which is capable of being segmented. -20- I- w o o -21- The keys to be searched here are name and age: Loc. Key LjfrP HIP DUP^ Loc . Key LjftP HIP f DUP' 1 MAC 3 5 2 * 22 8 6 3 Dj6W 13 T -*3 22 2U 22 10 5 TIM 9 19 6 29 12 16 7 JIM 8 !9 Ik 18 9 R0N 15 10 22 >ll 3 JIM 23 21 12 27 13 B/)B IT Ik IT 15 SAM 16 32 26 IT * AL 18 * 20 20 19 TfM 20 21 21 JIM 25 22 27 23 AL 2U 20 25 JIM 26 33 *indicates that a key has been flagged to show that it has been duplicated Notes: 1 The locations here appear in sequential order only to simplify the example. In actual practice, entries may occupy any fixed size of memory, depending on length of keys, pointers, etc. 2 Space for the DUP is not left for entries in which none appears, but only in those entries in which one exists in the table. 3 The pointers for entering the table to search on the two keys, name and age, are assumed to be set at locations 1 and 2 re- spectively. The arrows indicate the pointers which reference the first entries in the duplicate table. -22* If the added device of entering the key being duplicated in the duplicate table as well is used when one desires to search for a key which one knows to be duplicated, one can search the duplicate table first, bypassing the (hopefully larger) search on the main table. As one cannot recopy keys in other columns which are not duplicates, one only copies one column into the duplicate table and uses, as its argument, a pointer (admittedly upward) to the original entry. As this upward pointer is used only when the final entry has been retrieved, we will have at most one segment of table to recall. -23- Summary The Binary-Chop Pointer directory scheme seems to have consider- able promise. The authors have noted several possibilities and have given suggestions for incorporating same into a computer system. Their results are by no means exhaustive, but, rather, merely suggestive of the possi- bilities of a little explored method. -2l4- Bibliography 1. Hibbard, T. N. Some Combinatorial Properties of Certain Trees with Applications to Searching and Sorting. JACM 6 (Jan. 1962), 13-28. 2. Knowlton, Movie on L . 3. Brooks and Iverson Automatic Data Processing , Ch . F: Searching and Sorting. -25- Form AEC -427 (6/68) AECM 3201 U.S. ATOMIC ENERGY COMMISSION UNIVERSITY-TYPE CONTR \CTORS RECOMMENDATION FOR DISPOSITION OF SCIENTIFIC AND TECHNICAL DOCUMENT ( See Instructions on Rmrtne Side ) 1. AEC REPORT NO. COO-10l8-1178-Report No. 321 2. TITLE SUGGESTIONS FOR USE OF A PARTICULAR DIRECTORY SCHEME 3. TYPE OF DOCUMENT (Check one): L}[ a. Scientific and technical report U b. Conference paper not to be published in a journal: Title of conference Date of conference Exact location of conference _^ Sponsoring organization □ c - Other (Specify) 4. RECOMMENDED ANNOUNCEMENT AND DISTRIBUTION (Check one): Q a. AEC's normal announcement and distribution procedures may be followed D b. Make available only within AEC and to AEC contractors and other U.S. Government agencies and their contractors. |_J c. Make no announcement or distribution. 5. REASON FOR RECOMMENDED RESTRICTIONS: 6. SUBMITTED BY: NAME AND POSITION (Please print or type) ^Austin Henderson, Jr., David E. Gold - Research Assistant, Organization , Depa^ent of.CoMputer S ci ence , University of I11inM , „^„ T1 _ .. Signature ois 6l80l Date April 10, 1969 FOR AEC USE ONLY '■ «cVr;r™ A r ,N ' sTRAT0R ' s comments - ,f any - on above «*°™ ano D1STR , BUT , 0N '■ PATENT CLEARANCE: D a. AEC patent clearance has been granted by responsible AEC patent group. U b. Report has been sent to responsible AEC patent group for clearance. LJ c. Patent clearance not required.