UNIVERSITY^ „SSSR- The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN MAR 22 19^3 JAN Q 2 mi DEC 15 1917 SEP 8 1 989 L161— O-1096 uiaJLM. , X /)6 /£?£^ Ocf.iL Report No. UIUCDCS-R-80-1022 UILU-ENG 80 1720 SPEEDING UP COMPUTATION IN TWO-DIMENSIONAL ITERATIVE ARRAYS t #j by Peter Gerald Rose May 1980 NSF-OCA-MCS80-01561-000048 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Digitized by the Internet Archive in 2013 http://archive.org/details/speedingupcomput1022rose Report No. UIUCDCS-R-80-1022 SPEEDING UP COMPUTATION IN TWO-DIMENSIONAL ITERATIVE ARRAYS by Peter Gerald Rose May 1980 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US NSF-MCS80-01561 and was submitted in partial ful- fillment of the requirements for the degree of Master of Science in Computer Science, May 1980. Ill ACKNOWLEDGMENT I would like to thank my thesis advisor, Professor Daniel Gajski, for his patient support, for his helpful suggestions, and most of all for his willingness to let me pursue my own ideas wherever they led. iv NOTATIONAL CONVENTIONS USED IN THIS THESIS Symbol Meaning log logarithm with base 2 -=> "implies" <=> "if and only if" = "is defined to be" LxJ greatest integer < x TABLE OF CONTENTS Chapter Page 1 . OVERVIEW 1 1.1. Background 1 1.2. Method 1 1 1.3. Proposed New Methods 3 2 . METHOD 2 6 2.1. Construction 6 2.2. A Bound for Computation Time 10 2.3. Actual Computation Time 13 2.4. Required Hardware 17 2.5. Fan-out and Method 2 » 18 2.6. Summary 21 3 . METHOD 3 : CONSTRUCTION 24 3.1. An Example 24 3.2. The Algorithm 31 3.3. Required Hardware 35 4 . METHOD 3 : APPLICATIONS 37 4.1. The Augmented Parallel Counter 37 4.2. The Partial Sorter 40 4.3. Other Special Cases 41 4.4. A Conj ecture 42 4.5. Additional Remarks 47 5 . METHOD 3 : COMPUTATION TIME 50 5.1. Comparison with Method 1 50 5.2. Further Theoretical Results 55 5.3. Some Empirical Results 63 5.4. Comparison with Method 2 , 65 6. CONCLUSIONS 71 6.1. Methods 2 and 2 ' 71 6.2. Method 3 71 6.3. Summary 73 LIST OF REFERENCES 75 VI LIST OF TABLES Page 1. Propagation time (in unit delays) for (16, k) Method 2 network 14 2 . Comparison between c and c 16 n n 3 . Attempts to apply Method 3 43 4. T„ as a function of k (for fixed n) 60 5 . Method 3 parameter values 64 6. Comparison chart for (n,k) networks 74 Vll LIST OF FIGURES Page 1 . An (n ,k) rectangular network 2 2 . Single cell from rectangular network 4 3. A (4,3) rectangular (Method 1) network 7 4. First stage of (4,3) Method 2 network 8 5 . Modules used in Method 2 network 9 6. Stage j of Method 2 network for n = 16 11 7. Stage j of Method 2' network for n = 16 19 8. Speed comparison of Methods 1, 2, and 2' 22 9. (8,3) Method 1 network 25 10. Stage 1 of (8,3) Method 3 network 26 11. Stages 1 and 2 of (8,3) Method 3 network 27 12. (8,3) Method 3 network 30 13. (16,4) Method 3 network 32 14 . Half adder 38 15 . Speed comparison between Method 3 and Method 1 61 16 . Speed comparison for n = 16 66 17. Speed comparison for n = 64 67 18 . Speed comparison for n = 256 68 19 . Speed comparison for n = 1024 69 CHAPTER 1 OVERVIEW 1.1. Background Two-dimensional arrays of switching circuits have been studied for some time, beginning notably with the work of Hennie. Other early 2 research in the area is reviewed in the survey article by Minnick. 3 4 Important contributions were later made by Kautz, Akers, , and several others. Most of the work cited above consists of theoretical investiga- tions into the ability of two-dimensional logic networks tc realize arbitrary combinational functions. Of more practical interest have been the many schemes proposed which use two-dimensional iterative arrays to implement specific computations. As examples, networks have been designed for performing multiplication, ft 7 division, and square root extraction. Because of the modularity inherent in array structures, such circuits possess great advantages in the modern world of LSI technology. This thesis explores the idea of transforming a two-dimensional array into an equivalent one which performs the same computation but in less time, preferably using no more hardware than the original array. We could thus obtain a speed increase while preserving the modularity of cellular design. Ideally, we would like to have an algorithm which automatically performs this transformation for a large class of two-dimensional networks, 1.2. Method 1 We shall consider here a rectangular network of the kind shown in Figure 1. All modules, or "cells", in the network are identical and 49 15 C _J£ ALL Ali. ^ -> ^k. -> ^ ^ -> -> aLl a£- ■> J*_ ■^ -> ^ ^ -> A! M O U c 3 GO ■U o (U )-l 1). By S we shall denote the set of 2 distinct states such a bundle can assume. However, we will normally think of a bundle as a single "line" carrying a signal which represents an element of S. Figure 2 shows a single module. We shall assume that the propagation time to either output is exactly one "unit delay". The output values will be denoted as shown; we think of # and * as binary operations defined on the set S (that is, as mappings from the Cartesian product S x S into S). We are concerned mainly with the propagation delay of the network in Figure 1, which we shall henceforth call a Method 1 network. It is easy to see that once the inputs are applied it takes n + k - 1 unit delays before all the outputs are valid. We want to transform this array into a faster one which produces the same outputs. 1.3. Proposed New Methods In Chapter 2 such a transformation is described. We shall call this new way of doing the computation Method 2, Method 1 referring to the original rectangular network. Method 2 can often yield a striking improve- ment in performance, and it has the advantage of being applicable to any array, regardless of the operations # and *. However, Method 2 is faster than Method 1 only when k is quite small compared to n; otherwise, it is 6. #1 <- V. a. * lo y Figure 2. Single cell from rectangular network. 5 actually slower. Method 2 also requires more hardware. In the search for a better transformation, Method 3 was developed; it is introduced in Chapter 3. Most of this thesis is devoted to investi- gating the properties of Method 3, which is a superior technique in several ways. First of all, Method 3 is a purely geometric transformation of the rectangular array. That is, the Method 3 network uses not only the same type of module as Method 1, but also the same number of them. Only the interconnection pattern is different. Yet we shall show that for most values of k Method 3 is substantially faster than Method 1 and that in no case is it slower. Moreover, Method 3 always outperforms Method 2. Chapter 5 develops and discusses these speed comparisons. Unfortunately, Method 3 will not work for all arrays, since in general it is not an output-preserving transformation. Only for certain instances of the cell operations # and * will the Method 1 network and the corres- ponding Method 3 network perform the same computation; this matter is discussed in Chapter 4. Perhaps future research will be able to modify Method 3 so as to extend its range of application. In any event, the many advantages of Method 3 give it excellent potential as a design algorithm for speeding up computations done by two-dimensional arrays. CHAPTER 2 METHOD 2 2.1. Construction To help explain Method 2, we shall use as an example a (4,3) rectangular network, which Figure 3 shows in detail. Let us first consider the top row of cells. Viewed by itself, it is a one-dimensional iterative array, otherwise known as a (first-order) recurrence network. There exists a general method for speeding up such a computation; it involves using semigroups of functions to obtain a tree implementation of the network. This technique has been described recently by several authors, including 8 9 Gajski and Unger. However, to simplify our discussion let us assume that the operation # is associative. In that case we can derive a tree implementation directly, for y = a.. , // (a-.- # (a, 9 # (a., # X-))) can be computed as a , # ((a- - // a 1? ) # (a 1 - # x..)). In this way some parallelism can be introduced into an originally serial computation. For a detailed discussion of tree-height reduction by associativity, see pp. 101 - 102 * v u 10 of Kuck. Anyway, Method 2 uses this technique to transform the first row of the rectangular array into the network shown in Figure 4. Figure 5 depicts the two modules used in this network; we assume each has propagation time of one unit delay. Observe that the cell in Figure 2 is simply these two modules combined into one box. With all of these modules note carefully the order of the inputs, for # and * are not necessarily commutative. X- Figure 3. A (4,3) rectangular (Method 1) network. a, <*i * a. 11 Figure 4. First stage of (4,3) Method 2 network. a y A k \[/||\l/\J/|\l/\I/\J/vvvvvvv d j*/,(f *J»l**.MF 4 >/,l? 4 Jn/i a J«',» d J*4M 4 J*',•/» *M< 4 j*',r *W* a j*>,3 A j+i,a a j + t, Figure 6. Stage j of Method 2 network for n = 16. 12 because there are k stages in all. We shall see later that the actual propagation time is somewhat less than this upper bound, although over the range of interest the difference is not very great. Let T denote the computation time for an (n,k) network using Method 1 and T„ denote the time required using Method 2. We then have: T - n + k - 1 T 2 < k(l + log n) E T 2 For the moment let us assume that the upper bound T 9 is the true Method 2 delay time. It is trivial to prove the following result: n - 1 * log n 2 1 n - 1 A k > % - =^ T„ > T- log n 2 1 n — 1 a a We shall denote the crossover point by c . Note that c is generally log n n n much less than n; for example, if n = 128, c ^18. Thus Method 2 is faster than Method 1 only when k is small relative to n; otherwise, Method 1 is faster. If k < c , then k « n, whence the ' n* ' speedup (as defined by Kuck ) of Method 2 over Method 1 is : T 1 n + k - 1 ^ k(l + log n) 2 1 /_JL\ k ^ log n J When k = 1, which corresponds to the rectangular array having only one row, this becomes ■= , the well-known speedup obtained when a one- log n dimensional iterative array is replaced by a binary tree implementation. Hence, one can view — as an attenuation factor which arises in the two- dimensional case. 13 2.3. Actual Computation Time As mentioned earlier, the true Method 2 propagation time T„ is in general somewhat smaller than T_ = k(l + log n) . To see why this is so, let us look again at the example shown in Figure 6. Although for the sake of regularity we have arranged the diagram to show 5 rows of modules, the rightmost 8 columns actually have depth less than 5. Thus (taking j = 1) the first stage of the Method 2 network produces output y. after 5 unit delays, but a„ , a„„, a„«, ... , a~ fi are ready earlier. Hence, the second stage of the network can begin computing sooner than we might have expected, and many of its outputs are ready after far less than 10 unit delays. Continuing down through later stages, this speedup effect begins to snowball, aided by the fact that most of the columns (even on the left side) do not have modules in every row. The effect drifts from right to left, until after a few stages the speedup starts to appear in the outputs y.. This means that effectively the propagation time through each of the later stages is less than 5 unit delays. Hence, for the (16, k) network we find that T ? < T = 5k, at least for the larger values of k. Table 1 details the situation for this example. Naturally, the preceding discussion applies regardless of the value of n. In all cases, the effective propagation time is 1 + log n only for the first several (in fact, log n) stages; thereafter, the stage delay decreases. Thus, keeping n fixed, the difference between T_ and T„ grows with increasing k. As Table 1 shows, this difference can become quite large as k approaches n. However, we are not really interested in the larger values of k if in those cases Method 2 is outperformed by Method 1, anyway. What we would 14 k T 2 T 2 1 5 5 2 10 10 3 15 15 4 20 20 5 25 24 6 30 28 7 35 32 8 40 35 9 45 38 10 50 40 11 55 42 12 60 44 13 65 45 14 70 46 15 75 47 16 80 48 effective delay time ~ of k th stage 2 " 2 5 5 5 5 4 1 4 2 4 3 3 5 3 7 2 10 2 13 2 16 1 20 1 24 1 28 1 32 Table 1. Propagation time (in unit delays) for (16, k) Method 2 network. 15 like to know is, for each n, the largest value of k for which Method 2 is faster than Method 1. Earlier we calculated this crossover point to be c = -s . (Since c is not in general an integer, the value of k we n log n n ° ' desire would be [c J.) But recall that this was computed on the basis of T„ , not Ty . Thus we now define c to be the true crossover point between the two methods. That is, c is the integer such that: n k < c => T„ < T, - n 2 1 k > c => T_ > T n n 2 1 Since in general T_ < T„, Method 2 should compare more favorably to Method 1 than we had thought. In particular, we would expect that c > Lc J. But what is the magnitude of this improvement? The answer was obtained from a computer simulation of Method 2 which determined T~ for values of n up through 1024 and all relevant values of k. From this were derived values of c , which are displayed in Table 2. n c Note that over the range considered — •* — < 1.2, an improvement of less c ic ° n than 20%. More significantly, — LJU- < . 02 ; that is, the difference between the true crossover and the estimated one is no more than 2% of n. In addition, the computer study showed that for k < c (the range over which Method 2 would be used) the difference between T„ and T„ is not very great ; the improvement is never more than 14% , and usually it is much less. T„ and T„ are especially close when k is very small, which is precisely when we would be most likely to use Method 2 (because in this range Method 2 outperforms Method 1 by the greatest amount) . Later in this paper Figure 8 and Figures 16 through 19 provide graphical comparisons 16 CM O t— 1 o CM CM o CM .-I o 00 rH 00 rH o CM rH m • rH rH 00 O CM O vD CM CO H m H CM rH vO rH rH o CM o 00 CM rH 00 rH m H • -3- rH O 00 o o H o rH rH O rH rH rH O CM 00 vO rH rH o 00 CM CM m CM m CM rH o * 4 j( , 4a tfj„ tf jw <* Jt tfjt a J7 4* a Jr a JV a J3 aji <*j, YYYYYYYYY YYYYYYY yj ^r^rV^r^^^V' ^n^^M^M *JHK a J»,ts A j»,i* A JH,n *j*i,u*M» A J*i,i* *j«,1 A jn,t A j*i,7 A j».t *W <>l,» d J«,J *J+lA*jtl, A) , it follows that the crossover point n log n 2 ' between Method 1 and Method 2' is simply c and that for k < c , n n T_, = T 2 < T . That is, in the case of Method 2' our simple upper bound formulas, time = k(l + log n) and crossover = ■= , hold exactly! log n * (Actually, T~, < k(l + log n) when k > y , but of course in that range we would never use Method 2', anyway, since it is slower than Method 1.) Now recall from our study of Method 2 that c , the crossover point n between Methods 1 and 2, is slightly greater than c . Remember also that for k < c , T_ is slightly less than T„ . Therefore, over the range of interest Method 2 outperforms Method 2 ' by a small amount . However , this difference in speed (plus the lower module count of Method 2) must be weighed against the considerable fan-out advantage of Method 2'. 2.6, Summary As we stated earlier, an (n,k) Method 2 network is faster than the corresponding (n,k) Method 1 network only if k is small compared to n. c n Note from Table 2 that — is generally somewhere between .1 and .2 . n That is, as k varies from 1 to n, only in the lower 10 to 20 percent of that range (depending on n) does Method 2 outperform Method 1. The situation is similar for Method 2 f , as can be seen from the tabulated values of n Figure 8 compares the speeds of Methods 1, 2, and 2' in a general way. 22 V -P 'c =5 I -p c o ^p en c Ql- o %K-/ - Figure 8. Speed comparison of Methods 1, 2, and 2' 23 Considering n to be fixed, we have sketched T , T ? , T ? , and T_, as functions of k, which runs from 1 to n. (For purposes of presentation these curves have been drawn as continuous; of course, in reality k takes on only integral values.) Figures 16 through 19 in Chapter 5 present the same curves plotted from actual data for several values of n. In addition, Table 6 (in Chapter 6) displays side by side the major characteristics of Methods 1, 2, and 2' . To summarize our results on Method 2 (and its variation, Method 2'), we may say that this technique has definite, though rather limited, advantages. Method 2 is a simple and quite general approach for achieving significant speedup in two-dimensional arrays. However, this speedup is obtained only when the relative dimensions of the array are in a certain narrow range. Thus in spite of the fact that they require more hardware, Method 2 networks are in most cases actually slower than the original (Method 1) rectangular arrays. It is this difficulty that motivated the development of Method 3. 24 CHAPTER 3 METHOD 3: CONSTRUCTION Like Method 2, Method 3 transforms an (n,k) rectangular (Method 1) network into a circuit consisting of k stages built one upon the other, each producing a single network output. In both cases the i stage corresponds to the i row of the rectangular array. However, in contrast to Method 2, the modules used in Method 3 are identical to those of the original network. 3.1. An Example To illustrate the new technique, we shall consider an (8,3) Method 1 network; this is shown in Figure 9. Figure 10 depicts the first stage of the corresponding Method 3 network. The modules in both diagrams are the same; however, Figure 10 does not show their * outputs because only the // outputs are used within stage 1. What we have done in Figure 10 is to simply fan in as quickly as possible the inputs a, through a R plus the input x 1 . Output z.. is thus the //-product of these nine inputs. This is the same as output y- in Figure 9, provided that // is associative. Now note that, except for x„ , each input to the second row of the Method 1 network is a *-product of signals occurring within the first row. Output y„ is then the //-product of all of these inputs. This suggests that we construct stage 2 of the Method 3 network in the following way. We shall simply take the * outputs from all the stage 1 modules (along with input x ? ) and form their //-product as fast as we can. This is shown in Figure 11, where dashed lines indicate the second stage. The * outputs of the stage 2 modules are not shown, since they will not be needed until 25 X X V ""V * ^ *- * HL n» ^ <3 /* > t* s ** -> )f V v ^ *> * Sr ^s * * v^ •n. o S fc* > .> ** >> \ r V V N^ *w >w V "V * * /> * V y* *^ s ^ f > ' y * >w * *s^ ** x V « > > .-* * ^ > f \ ! > f lo >^ *N ^ > * > 4 ! *> > r \ f > 1 ^ _ "*~ 3 *■ «^ ■^ siL ^v ti > b > * t* > * -> \ f > f \ 1 (n „ X. iV *^, * *n 4fc •*s tt " ' > 4 •> -> -? \ f ^ f v I Oo >fc ^ ^ * ^ 4 *" X. S r * ^ s t ^ > ' \ f > r ^5 0) C O ■u $. ^ 26 level *i &7 a 6 a 1 3 t Figure 10. Stage 1 of (8,3) Method 3 network. 27 Level l x h- s a, < CL ' s 1 ^< 1 ' s 1 < ' ' ' " — — ^. 3 a 3 X ! #* # ## *# S TX "i yS 1 ,-h \. /S 1 ■■■ 1/ H- a .9 3 1 ! #* * . ■: # *# s *# \ \ yS 1 s s ^ \ ' r 1 -^ ? s V \ J . ■ 3 a. *# T r c i - -/* —\ , yS C 3 ..1..J- 2- *# > i i S^ 1 / 1 1 3 ?. z. 8 3 \l r X, Figure 12. (8,3) Method 3 network. 31 exactly which way we connected the various cell inputs in constructing each stage. However, our interest lies in those cases for which the outputs are the same regardless of the order in which the products are formed, these outputs being equal to the corresponding Method 1 outputs. This is the "output-preserving" property mentioned earlier. As examples of our terminology, observe in Figure 12 that stage 3 has top level 3, bottom level 8, and height 6. Also note that the output of the leftmost cell on level 5 is reserved until level 8. However, the most important thing to observe here is that the total computation time for the Method 3 network is 8 unit delays, as opposed to the 10 unit delays Method 1 requires (Method 2 takes 12). Note that Figure 11 can be viewed as either the first two stages of an (8,3) network or as a complete (8,2) network. Likewise, the array in Figure 12 can be made into an (8,4) Method 3 network by simply adding one extra stage. The * outputs of the 3-cells would be inputs to that stage, which would be constructed in the same way as the previous stages. As a further illustration of our procedure, Figure 13 depicts a (16,4) Method 3 network. Solid lines are used for stages 1 and 3, dashed lines for stages 2 and 4. The network propagation time is 11 unit delays; this compares to 19 for Method 1 and 20 for Method 2. Chapter 5 is devoted to a detailed discussion of the speedup provided by Method 3. 3.2. The Algorithm We now present a formal procedure for designing any (n,k) Method 3 array (1 < k < n, n a power of 2). We shall make use of the notation and 32 *!( A if */v "■» «/» 4fl d /o a T *f *7 Ai a c d h *l V Figure 13. (16,4) Method 3 network. 33 terminology introduced in describing our example. The algorithm consists of two parts: Step A is performed once, then Step B is iterated k - 1 times (for i from 2 to k) . The statement of the algorithm is followed by several remarks which will be useful later in the paper. Algorithm for Constructing an (n,k) Method 3 Network STEP A To construct stage 1: 1. Recalling that n is a power of 2, fan in the inputs a., a~ , . . ., a with a full binary tree of cells, using only the # outputs. (The root cell of this tree will be on level log n. See Figure 10 for an example where n = 8 . ) 2. Install an additional cell on level log n + 1, having as inputs the network input x. and the # output from the bottom cell of the above tree. The // output of this added cell is the network output z . STEP B To construct stage i, once stage i - 1 has been constructed: 1. "Reserve" the input x . 2. Let L be the number of the top level of stage i - 1. 3. Let S be the set of all * outputs from (i - l)-cells on level L and all # outputs from i-cells on level L. Let C denote the cardinality of S. 34 4. If C is odd, then do the following: If there is a reserved element, add it to S; otherwise, remove an element from S and reserve it. C is now even or zero. 5. If C is not zero, then do the following: Q a. Install y i-cells on level L + 1, connecting their inputs to the elements of S (in any order whatever) . b. Increment L by one. c. Go to step 3. 6. Now C = and we are done. The reserved element is the network output z . . REMARKS (valid for all i) Remark 1 The top level of stage i is level i. (This is obviously true for i = 1, and an easy induction verifies it for i > 1.) Remark 2 Note that stage i skips no levels; that is, it has at least one cell on every level between its top level and its bottom level. Remark 3 Clearly, the bottom level of stage i is below the bottom level of stage i - 1 and contains exactly one i-cell (whose // output is z i ). Remark 4 Hence, the total number of levels in the network is equal to the bottom level number of stage k. This is the propagation time through the network (in unit delays). 35 Remark 5 Since each i-cell has two input lines and one # output line , stage i forms a regular binary tree. The branches of this tree are the input lines and # output lines of all the i-cells, its interior nodes are the i-cells themselves, and (if i > 1) its leaves are the (i - 1) -cells together with the input x. . 3.3. Required Hardware It has been mentioned, of course, that the cells of Method 3 are identical to those used by Method 1. We now proceed to show that corres- ponding Method 1 and Method 3 networks contain exactly the same number of cells, as well. Theorem 1 Each stage of an (n,k) Method 3 network contains exactly n cells. Proof By induction. We first show the theorem is true for stage 1. In step Al of the Method 3 algorithm, we construct a full binary tree fanning in n inputs, where n is a power of 2. This, of course, requires n - 1 cells. In Step A2 we complete stage 1 by installing one additional cell. Thus stage 1 has exactly n cells. Now suppose the theorem is true for stage i - 1. Let us consider stage i. By Remark 5, the i-cells are the interior nodes of a regular binary tree whose leaves are the (i - l)-cells plus input x. . Since by hypothesis there are n (i - l)-cells, this tree has a total of n + 1 leaves. Thus, by a well-known result of graph theory, the tree must have n interior nodes. That is, there are exactly n i-cells, whence 36 the theorem is true for stage i. QED Since an (n,k) Method 3 network has k stages, by Theorem 1 the total number of cells it contains is nk, precisely the same as for Method 1. Thus, in contrast to Method 2, Method 3 requires no more hardware than the original rectangular array. Moreover, it is evident from the algorithm that in a Method 3 network each cell output is connected to (at most) one cell input. This is also the case in a Method 1 array. Thus neither method has a fan-out problem, which again is different from Method 2. We therefore conclude that the Method 3 transformation of a rectangular network amounts to nothing more than a rather limited rearrangement of the interconnections among the nk cells. That this usually results in a significant increase in speed (see Chapter 5) is indeed remarkable. 37 CHAPTER 4 METHOD 3: APPLICATIONS Having just boasted the merits of Method 3, we must again remind the reader that unfortunately it cannot always be used, for often the transformation is not output-preserving. However, in this chapter we shall present several cases for which Method 3 does indeed work. These are intended as illustrative examples only; our array methods are not necessarily the most practical or efficient circuits available to perform the given computations. We shall also look at some cases where Method 3 does not work and then speculate as to which properties of the cell operations are relevant to output-preservation. 4.1. The Augmented Parallel Counter In this example the interconnections of the Method 1 array (Figure 1) are single bit lines, and the cell operations (Figure 2) are as follows: # is EXCLUSIVE OR (XOR) and * is logical AND. Each cell is thus a half adder (see Figure 14). To understand what a rectangular network of half adders does, let us look, for instance, at Figure 9. Suppose first that x = x„ = x~ = 0. Then y 1 is just the EXCLUSIVE OR sum of the inputs a.., a« , . . ., a g . The carries from this summation are then XOR'ed together to form y~ , and finally y„ is the EXCLUSIVE OR of the carries from the y ? summation. Thus y~y~y, is simply the ordinary sum, in weighted binary form, of the eight inputs a.. In other words, the three-place binary numeral yoYoYi represents the total number of a inputs which are in the logic ONE state. 38 (tarrv) Figure 14. Half adder, 39 Hence, the network acts as an eight-input parallel counter. However, the three-bit output can represent only 8 of the 9 possible a. sums. That is, y-^y-i is actually the sum of the a, 's modulo 8. Moreover, dropping our assumption that the x 's are zero, we see that the array output in fact represents the modulo 8 sum of the weighted binary number x-x^x.. and the number of active (logic ONE) a. inputs. Thus an (n,k) Method 1 network of half adders is a kind of generalized parallel counter. Its k-bit output represents, in binary coded form and modulo 2 , the total number of the n top inputs which are active, plus the weighted binary number formed by the k right-hand inputs. We shall call such a circuit an "augmented parallel counter". If k = log n + 1 and the x. are all zero, then the array becomes an ordinary n-input parallel counter — it simply computes the number of a 's which are ONE. Now to see what happens when the Method 3 transformation is applied to an augmented parallel counter, let us return to our previous example, the (8,3) network (Figure 9). Recalling that # is XOR, we see from Figure 10 that z- , the first Method 3 output, is the XOR of x.. and all the a.'s. That is, z. is the least significant bit of the sum of the eight a.'s and the binary number x-x-x. . Since * is AND, Figure 11 shows that z„ is the XOR of x_ and all the carries from the z summation. Hence, z„ is the next-to-least significant bit of the sum of x_x«x 1 and all the a.'s. Similarly, z~ is the next most significant bit of this sum (see Figure 12). Thus we see that z z.z = y~y 9 y, ; that is, the Method 3 network performs exactly the same computation as does the original Method 1 array. Clearly, this argument can be generalized to any (n,k) augmented 40 parallel counter. The Method 1 network and the Method 3 network will always compute the same sum, column by column. The two methods are dissimilar only in that for any one output bit they produce the sum in a different order, the Method 3 arrangement being faster. We therefore conclude that when # is XOR and * is AND, the Method 3 transformation is output-preserving. 4.2. The Partial Sorter Unlike the previous example, in which the cell operations were Boolean functions, many of the other cases we shall look at feature cell functions which operate on integer values. We will then assume that in our arrays each interconnecting line is capable of carrying a representa- tion of any positive integer. Naturally, there would in practice be a limit to the size of the integers, but this restriction has no bearing on the relevance of our examples. For our next special case, we take the cell operation # to be MAX and the operation * to be MIN. Of course, MAX returns the larger of its two integer operands, while MIN returns the smaller. To see what a rectangular array of such cells does, let us look once again at Figure 9. The network inputs are now positive integers. Clearly, y 1 will be the largest of the 9 numbers L, a., a., . . . , a ft . The other 8 numbers are passed down to the second row of the array. They are then compared with x„ , and the largest of these 9 integers becomes y» . A similar operation determines y„. Hence, we see that the network outputs will simply be the 3 largest numbers among the 11 numbers which were inputs to the array. It is also easy to see that if the x.'s were 41 originally in order, with x- largest and x_ smallest, then the y.'s will likewise be in order. Of course, similar reasoning holds for any (n,k) Method 1 network composed of such cells. The output numbers are always the k largest of the n + k input numbers. Moreover, if the right-hand inputs are in increasing order from the bottom to the top of the array, then the same will be true of the outputs. Thus the network is able to take an ordered list of length k and an unordered list of length n and merge them into an ordered list of length k composed of the largest elements in the original two lists. Hence, it is with some justification that we call such a network a "partial sorter". It is quite easy to convince oneself that in this case , too , Method 3 is output-preserving. Look, for example, at Figure 12 and recall that for each i the // (i.e., MAX) outputs of the i-cells are kept within stage i to eventually produce output z , while the * (i.e., MIN) outputs are passed on to stage i + 1. Hence, the Method 3 network operates in exactly the same way as does the Method 1 network, producing the largest k numbers at its outputs. Only the order of comparison is different. 4.3. Other Special Cases Having presented two examples in detail, we shall now simply give a list of a number of other particular cases that were investigated. They employ as cell operations a wide variety of two-variable functions, some Boolean and some integral. Several of these — XOR, AND, MAX, and MIN — we have used before; another one of them is just the usual logical OR. The rest require a brief explanation. 42 There are three new Boolean functions. COIN, short for "coincidence", returns a 1 if and only if its two arguments are equal. IMP, which stands for "implies," has value 1 for all cases except 1 IMP 0. Finally, FIR, short for "first," simply returns its first argument; i.e., x FIR y = x. We shall also use a number of operations defined on the set of positive integers. The value of GCD is the greatest common divisor of its two operands. Similarly, LCM is the least common multiple. The function ADD is just ordinary addition, while MULT performs ordinary multi- plication. Lastly, AVG returns the average of its two arguments; strictly speaking, this must be defined on the set of rationals, since the average of two integers is not always an integer. We combined these functions into various #/* pairs and determined for which of these Method 3 is output-preserving. The results of this effort are shown in Table 3. For the moment ignore the column marked "conjecture properties violated"; this will be explained in the next section. Two of the sixteen cases have already been described: example no. 1 is the augmented parallel counter and example no. 4 is the partial sorter. Note that we list four other cases for which Method 3 works. For these, the proofs of output-preservation will be omitted. As for those instances where Method 3 fails, we merely found simple counterexamples by plugging in specific inputs and computing the outputs by hand. In fact, in all cases (4,3) networks sufficed to demonstrate the lack of output-preservation, 4.4. A Conjecture We say that the cell operation * "distributes over" the cell operation # 43 B = Boolean — operand type I ■ (positive) integer R = (positive) rational 01 iH | tO 01 c opei // ell ations * ■ output-preserving? conjecture properties violated 1 XOR AND B yes none 2 AND OR B yes none 3 COIN OR B yes none A MAX MIN I yes none 5 GCD LCM I yes none 6 GCD GCD I yes none 7 AND XOR B no not distributive, XOR not idempotent 8 COIN AND B no not distributive 9 IMP OR B no IMP not associative, not commutative 10 OR FIR B no FIR not commutative 11 ADD MULT I no MULT not idempotent 12 ADD MIN I no not distributive 13 MAX ADD I no ADD not idempotent 1A ADD ADD I no not distributive, ADD not idempotent 15 MULT GCD I no not distributive 16 MAX AVG R no AVG not associative Table 3. Attempts to apply Method 3. 44 if a * (b | c) - (a * b) # (a * c) for all a, b, c in the set S of operands. (If * is not commutative, we also require that (b // c) * a = (b * a) #(c * a). ) To make a less familiar definition, we shall say that the operation * is "idempotent" if a * a = a for all a in S. A large part of our research has been an attempt to characterize those rectangular arrays to which Method 3 can be successfully applied. That is, we have tried to find sets of conditions necessary and/or sufficient for the Method 3 transformation to be output-preserving. Experience with many examples as well as consideration of several theoretical aspects have led us to propose the following: CONJECTURE Method 3 is output-preserving if and only if the cell operations # and * have the following properties: a. //is associative and commutative b. * is associative, commutative, and idempotent c. * distributes over # Although this statement has stubbornly resisted all attempts to prove it, no counterexamples have yet been found. The rightmost column in Table 3 shows for each case those properties listed in the conjecture which are not satisfied by the given pair of cell operations. Any property not mentioned was proved to hold. The examples marked with "none" are those which satisfy all the conditions of the conjecture. Note that these are precisely the cases for which Method 3 is output-preserving, The examples where Method 3 fails are those which violate one or more of 45 the conjecture properties. Hence, Table 3 is consistent with our belief that these properties are both necessary and sufficient . Since the Method 3 algorithm pays no attention to the order in which inputs are combined, it is not surprising that in order for the outputs to be the same as in Method 1, both cell operations must be associative and commutative. (In fact, this has been proved in the case of the # operation.) However, the need for distributivity and, especially, for idempotence is much less obvious. Note that Table 3 concerns arrays of arbitrary dimensions in which the cell operations are specified. While trying to prove the necessity of the conjecture conditions, we investigated networks with arbitrary cell functions but where n and k were small enough to conveniently handle. In manual computations with these networks, we needed to utilize all of the conjecture properties in order to get the outputs of Methods 1 and 3 to be equal. Nevertheless, to actually prove necessity seemed impossible. Perhaps one can somehow construct rather artificial # and * operations which violate the conjecture conditions but for which Method 3 works. However, judging from our research it appears doubtful that such a case would arise naturally, that is to say, when using any standard sort of functions for cell operations. The situation is similar with regard to sufficiency. When working with networks of feasibly small dimensions, we found the conjecture properties sufficient to obtain output equality. A general proof, though, has remained elusive. However, as an indication that the conjecture conditions could well be sufficient, let us point out that they form 46 a more powerful set of postulates than may be evident at first glance. For example, using only those properties we can derive the following formula (for all a, b in S) : (a // b) * (a // b) = [(a // b) * a] // [(a # b) * b] a // b = [a * (a // b)] // [b * (a // b)] = (a * a) # (a * b) # (b * a) // (b * b) = a # (a * b) # (a * b) # b a // b = a // b # (a * b) # (a * b) Moreover, letting a = b and once again using idempotence, we obtain: a // a = a # a # a # a (1) Now, it may perhaps be possible to invent a pair of functions satisfying the conjecture conditions but for which Method 3 is not output-preserving; however, once again we consider it very unlikely that one would run into such a counterexample by accident. For one thing, any system which has all the conjecture properties must also satisfy (1) above. Hence, quite probably it possesses as well one of these two common properties (for all a in S): Property 1 a // a = 0, where is an identity element for the # operation Property 2 a # a = a (i.e., # is idempotent) Indeed, it is easy to check that one of these holds for every example in Table 3 which satisfies all the conjecture conditions. (Obviously, both properties cannot hold at once, unless is the only element in S.) 47 Now the point is that either of these two properties , when added to the conjecture conditions, makes the two-operation system even more rich in structure, allowing equations with # and * to be greatly simplified. This seems to enhance the likelihood that corresponding Method 1 and Method 3 outputs are indeed equal. Taking all of this into account, we come to the following conclusions. In the first place, our conjecture is very plausible. Moreover, even if it is not literally true, the conjecture appears at any rate to be a highly reliable test for determining to which arrays Method 3 can be applied. Because the conjecture properties are easy to check, this provides a quite practical method for screening prospective function pairs. Of course, to be completely safe, one must supply for each individual case an independent proof or disproof of the output-preservation property. Thus we hope that future research will lead to a proof of the conjecture, a proof of some modification of it, or at least some enlightening counterexamples . 4.5. Additional Remarks Note that all of the conjecture properties hold in any Boolean ring as well as in any Boolean algebra. (For definitions and an excellent discussion of these two types of systems, see the book by 14 Burton. ) Thus we would expect Method 3 to be output-preserving for both kinds of systems. This is illustrated by the first two examples in Table 3: XOR/AND is a Boolean ring, while AND/OR is a Boolean algebra. On the other hand, MAX/MIN, for which Method 3 also works, does not belong to either class of systems. Finally, it is interesting to note that 48 Boolean rings satisfy Property 1, whereas Boolean algebras satisfy Property 2. If the conjecture is true, then the range of applications for Method 3 may be rather limited, for it seems that the conjecture conditions are really quite restrictive. For example, among the 16 two-variable switching functions, only AND and OR are both commutative and idempotent. Moreover, on the domain of positive integers, the only operations we could readily find satisfying all three conditions on the * function were MAX, MIN, LCM, and GCD. Note that these four functions are essentially the same! (For instance, to find the least common multiple of two numbers, we simply choose the maximum exponent for each prime factor; GCD and MIN are likewise related.) Clearly, idempotence is the most difficult condition to meet. The other conjecture properties hold in any commutative ring, a very common type of system. Hence, to find applications for Method 3 we might be tempted to search among ordinary commutative rings, hoping that the conjecture's requirement for idempotence is not absolute. Unfortunately, the following proposition has been proved (we omit the argument, which is very tedious) : If the cell operations # and * form a commutative ring (with identity), where # is addition and * is multiplication, and if for this system Method 3 is output-preserving, then the operation * is idem- potent, that is, the system is actually a Boolean ring. This result is quite significant in view of the fact that there are not, so to speak, many Boolean rings around. Indeed, according to the famous Stone Representation Theorem, every Boolean ring is isomorphic to a ring of sets! (In a ring of sets, addition is the symmetric difference 49 14 operation and multiplication is set intersection. See Burton for details.) On the other hand, in order for Method 3 to work it is not necessary that the #/* system be a ring at all; MAX/MIN is a case in point. Hence, the true extent to which Method 3 can be applied is still an open question. 50 CHAPTER 5 METHOD 3: COMPUTATION TIME We come now to the crucial question: just how fast does Method 3 perform? Let T„ be the propagation time in unit delays for an (n,k) Method 3 network. Usually we shall consider n fixed and observe how T„ varies as k runs from 1 to n. Unfortunately, an exact formula for T~ in terms of n and k remains to be discovered. However, a combination of theoretical and empirical research has revealed a great deal about the behavior of the function T~, including simple and accurate approximations valid at least for the range n < 1024. Moreover, we shall show that T„ is in general much smaller than both T- and T~ , the propagation delays for Methods 1 and 2, respectively. 5.1. Comparison with Method 1 Consider an (n,k) Method 3 network and let h denote the height of stage i (as defined in Section 3.1). Note that by Step A of the Method 3 algorithm (Section 3.2), h = log n + 1. Now consider stage k, the final stage in the network. 3y Remark 1 (Section 3.2), there are k - 1 levels above stage k. Hence, by Remark 4, the total number of levels in the network is k - 1 + h. . We thus have: k Remark 6 T„ = h. + k - 1 3 K Hence, surprisingly, the computation time of the Method 3 array depends solely on the height of its last stage. Unfortunately, it turns out that h, is in general difficult to determine. Much can be gained, 51 however, by studying the properties of the sequence h , h„ , . . . , h . In fact, we shall now present a few lemmas on the subject, which lead ultimately to a theorem comparing the speeds of Methods 1 and 3. Lemma 1 h ± < n (for all i, 1 < i < k) Proof By Theorem 1 (Section 3.3), stage i contains n cells. Hence, noting Remark 2 (Section 3.2), it is obvious that the height of stage i can be no more than n. QED Lemma 2 h. £ h (1 < i < k - 1) Proof By Remark 1, stage i + 1 begins one level below the top level of stage i; by Remark 3, it ends at least one level below the bottom level of stage i. Thus the height of stage i + 1 must be at least as great as the height of stage i. QED Lemma 3 h = b. <=> h ± = n (1 < i < k - 1) Proof If h. - n, then by Lemma 2 h. .. > n. On the other hand, by Lemma 1 h < n. Hence, h.,- = n = h . . 52 Suppose h. - h • Then since (by Remark 1) the top level of stage i + 1 is one level below the top level of stage i, it follows that the bottom level of stage i + 1 is one level below the bottom level of stage i. Let L be the bottom level of stage i + 1. By Remark 3, level L contains exactly one (i + l)-cell; call it C . Ju Since level L - 1 is the bottom level of stage i, it contains an i-cell, say c 1 ; moreover, (by Remark 2) it contains at least one Li — 1 (i + l)-cell, say C . Now according to the Method 3 algorithm, the * i-i x output of c , and the # output of C must each be connected to some (i + l)-cell below level L - 1. But the only (i + l)-cells below level L - 1 is C Hence, one input of C is attached to c and the other ij • Li Li L to C_ - . This means that there can be no (i + l)-cell besides C on level L - 1, for there are no available (i + l)-cells below that level to accept their # outputs. Now consider level L - 2. It contains at least one i-cell c_ « and at least one (i + l)-cell C „ . Since C - is the only available (i + 1)- cell below level L - 2, its two inputs must be connected to c _ and C _. Hence, there can be no (i + l)-cells besides C on level L - 2. L—Z Li—A Clearly, this argument can be repeated for levels L - 3, L - 4, etc., all the way to the top level of stage i + 1. The result is that every level of stage i + 1 has only one (i + l)-cell. Since (by Theorem 1) the stage has n cells, its height must be n; that is, h = n. But then since h. = h. . - . we get h. = n. x 1+1 1 QED 53 Recalling that h- * log n + 1 and taking into account Lemmas 1 through 3, we see that the sequence {h . } starts at log n + 1, strictly increases until it reaches n, and remains at n thereafter. Let s be ' n the smallest value of i for which h = n. We can then list the basic properties of the sequence {h. } as follows: Property A h - log n + 1 Property B {h . } strictly increases for 1 £ i < s Property C h. = n for i £ s Note that the height h. of a stage is a kind of inverse measure of its parallelism. Since (by Theorem 1) each stage has exactly n cells, the greater the height of a stage, the less parallel is its computation. Stage 1, with height log n + 1, is as parallel as it can be. By Property B, later stages gradually become less parallel. Beyond some point (s by Property C), each stage has height n, i.e., its computation is completely serial. We have now assembled enough information to easily prove our theorem. Recall that for an (n,k) Method 1 network, the computation time T = n + k - 1. T_ is the computation time of the corresponding Method 3 network. 54 Theorem 2 a. for k < s , T_ < T, n 3 1 b. for k > 8 , T » T, - n' 3 1 Proof By Remark 6, T - h + k - 1. By Property C, for k > s , h, = n, whence T~=n+k-l=T, which proves b. By Properties B and C, for k < s , h,< n, whence T log n + 2 h > log n + 3 h > log n + s s - ° n n This proves the first inequality. By Property C, h g = n. Thus by Property B we must have n h . < n - 1 s -1 - n h s -2 * n " 2 n h = h , .. < n - (s -1) 1 s -(s -1) - n n n 56 Hence, h s j < n - j for 5 J 5 8 " 1« To change subscripts, let n ~J n i ■ s -1 . Then 1 < i < s , and our result becomes hj < n - (s -i) ■ n J - - n i-n n - s + i. This proves the second inequality. QED Lemma 5 Suppose that for some I n-s +1 + 1 1+1 - n h T , _ > n-s +1+2 1+2 - n h > n - s + s s n n n That is, for all i where In-s +i. But by Lemma 4 we ' - n* i - n also have h. < n - s + i. i - n QED From Lemma 4, we note in particular that log n+i < n-s +i ° - n log n < n-s - n s < n - log n n - 57 If s were equal or very close to this upper bound, then since n - log n is in general close to n, s would also be close to n. By Theorem 2 (Section 5.1), this would mean that Method 3 outperforms Method 1 for most k between 1 and n. Thus in some sense n - log n is an ideal value for s . For that reason we introduce the following special notation: s = n - log n n ° d = s - s n n n Our hope is that d is small (relative to n) . In preparation for the upcoming theorems, we must make another definition: T 3 = log n + 2k - 1 Moreover, we note the following equality: Remark 7 T„+d =n-s+2k-l 3 n n This is a direct consequence of the above definitions and will be used often in the work that follows. Theorem 3 for l s we know by Theorem 2b that T~ = T, = n + k - 1. Moreover, d n 3 1 n small means that s = n - log n is an accurate estimate for s . Hence, n n d is a crucial parameter. If it is small compared to n, then to a good approximation we know the values of T_ for all k, plus we know (as mentioned earlier) that T„ < T.. for all k except those very close to n. We postpone to the next section our evidence that d is indeed quite close to zero. We can improve our results even more by introducing another parameter, Note that when k = s , we have: n T„ = T (by Theorem 2b) = n + k - 1 = n-k+2k-l = n-s +2k-l n = T + d (by Remark 7) J n 59 Thus we can define e to be the smallest value of k for which T. = n 3 T + d . Obviously, e < s . 3 n J ' n - n Theorem 4 For e < k < s , T =T +d n - - n 3 3 n Proof Using Remarks 6 and 7, we find that: T = T„ + d <===> h. + k - 1 - n-s + 2k - 1 3 3 n k n <==> h, = n - s + k k n Hence, by the definition ofe, h e =n-s +e. But then according * ' n' n n n to Lemma 5, when e s ,T„=n+k-l, only for k < e is a formula for T not known (and even in that case the n 3 bounds of Theorem 3 apply). Hence, it would be nice if e were relatively small. In the next section we shall find that this is apparently the case. Table 4 summarizes our results concerning T„. The same information is shown pictorially in Figure 15. This graph is intended to illustrate the general theory and has been distorted for the sake of clarity. (That is, 60 < PI •. u 4J S U U ai cu CU 3 CD M u h O Cu O o o CO o CU cu cu u P-, fl 43 H fl C •a + 00 iH " (n + k - 1) - (n - s + 2k - 1) = s - k n n Assuming s is close to n and that k is not too small, we can also n estimate the speedup: T T _i * 1 n + k - 1 ~ T ,. " n - s + 2k - 1 3 T- + d 3 n 2k 2 U k ; As a final note on Figure 15, we point out that T„ crosses the line As T, at s . This is because when k = s , T = T- (since s < s ) , and 3n n* 3 1 n-n' thus we get: n - log n = k n = log n + k n+k-1 = logn+2k-l 63 T = T T = T l 3 3 From this it follows that for all k < s , T < T« < T + d . This is a - n* 3-3-3 n slight extension of Theorem 3. 5.3. Some Empirical Results The results of the last section are useful only if d and e are J n n small compared to n. Unfortunately, general formulas for these quantities are not known. For that reason we wrote a computer program of the Method 3 algorithm which, for a given n, calculates T„ for all values of k between 1 and that n. From this information, d and e are easily ' n n determined. The program was run for values of n up through 1024, in order to get an idea of the relative magnitudes of the various Method 3 parameters and to see how they vary with n. Our results are presented in Table 5. Note that, as we had hoped, the crucial quantity d is indeed very small compared to n. In fact, as a fraction of n, d actually decreases as n approaches 1024. Although it is dangerous to extrapolate from this kind of data, the results are certainly encouraging. Similarly, over the range investigated the parameter e stays fairly small; as a fraction of n it fluctuates within an interval bounded by 7 and 11 percent. This behavior is somewhat less promising than that of d , but it is good enough to indicate the usefulness of Theorem 4, especially since s seems to stay so close to n. This brings us to the most impressive aspect of our table. Note that g n rises steadily as n increases. Beyond n = 64, this ratio is greater 64 ^3" en r-i ov O rH H rH * • • * CM O co CO o\ O o 00 l""» rH * CM CM O rH i-H o m o m CM e c c a c (3 c c CO 0) •P 0) CO CO o m T_ , T„,, and T_ have been plotted together for all values of k between 1 and n. (We have connected the plotted points so as to present continuous curves.) Thus from actual data we can directly compare the speeds of all the methods which we have discussed. In particular, we see that Method 3 is superior to Method 1 over nearly the entire range of k, the speed difference being especially drastic when k « n. /\ ^ In these graphs the Method 3 bounds T~ and T„ + d are shown as 3 3 n dashed lines. Note that T» = T„ + d everywhere except in the extremes of 3 3 n k's range. Of course, for k close to n, T. = T. . For k close to 0, the behavior of T„ is somewhat erratic (though of course within the given bounds). We observe also that for all k < § = n - log n, T» is only slightly greater than T„ = log n + 2k - 1. 5.4. Comparison with Method 2 In Figures 16 through 19 we also see that Method 3 is much faster than Method 2. In fact, this difference in performance becomes more marked for the larger values of n. For any given n, the speed difference increases with k. But note that even when k is so small that T„ < T. , we still have that T 3 < T 2 . 66 70- 60- SO- 1 0> £ > VI 9 .3 rH + 4-1 4-1 .*: 0) X 3 S 00 •H rH CO 00 o rH o ^ \ rH ^~\ CN •- 3|> 3 VI c l) S rH 00 CN + J 4 4-1 00 m o < 5 x: o rH c )0 00 rH u C|CM •H v-' o N^< rH ^s 14H M CO •<-s c rH .h 1 XI 1 4«5 O M a rH ,3 4J + 0) a 3 > / /^S M CO CU /-N ^S a >> XI £i CO CO o cd cu CU 4-1 XI •H rH r4 3 rH 3 cd 4J CU CU ■H 3 3 O o rd CU X) r4 3 XI 1 rH 4J i cd cr rH O 3 3 3 •H 4-> s cu cd cd 4-> cx 4-1 •H X) M 4-1 •H MH •H B 3 M O <4H X 3 o 3 cd 4-1 O cd CJ %— » ,3 s_/ \~s CO r4 o 4-1 CU 3 U o 4-1 u cd o g CO •H U cd §* o o cu rH •s H 75 LIST OF REFERENCES 1 . Hennie , F . C . , Iterative Arrays of Logical Circuits , John Wiley and Sons, New York, 1961. 2. Minnick, R. C. , "A Survey of Microcellular Research," Journal of the Association for Computing Machinery , Vol. 14, No. 2, pp. 203- 241, April 1967. 3. Kautz, W. H. , "Programmable Cellular Logic," Recent Developments in Switching Theory , Mukhopadhyay, ed., pp. 369-422, Academic Press, New York, 1971. 4. Akers , S. B. , "A Rectangular Logic Array," IEEE Transactions on Computers , Vol. C-21, No. 8, pp. 848-857, August 1972. 5. Pezaris, S. D. , "A 40-ns 17-Bit by 17-Bit Array Multiplier," IEEE Transactions on Computers , Vol. C-20, No. 4, pp. 442-447, April 1971. 6. Cappa, M. and V. C. Hamacher , "An Augmented Iterative Array for High- Speed Binary Division," IEEE Transactions on Computers , Vol. C-22, No. 2, pp. 172-175, February 1973. 7. Majithia, J. C, "Cellular Array for Extraction of Squares and Square Roots of Binary Numbers," IEEE Transactions on Computers , Vol. C-21, No. 9, pp. 1023-1024, September 1972. 8. Gajski, D. D. , "Semigroups of Recurrences," High Speed Computer and Algorithm Organization , Kuck, et al., eds., pp. 179-183, Academic Press, New York, 1977. 9. Unger, S. H. , "Tree Realizations of Iterative Circuits," IEEE Transactions on Computers, Vol. C-26, No. 4, pp. 365-383, April 1977. 10. Kuck, D. J., The Structure of Computers and Computations , Vol. 1, John Wiley and Sons, New York, 1978. 11. Gajski, D. D. , "Class Notes on Digital System Design Automation," Department of Computer Science, University of Illinois, Urbana, Illinois, 1978. 12. Ladner, R. E. and M. J. Fischer, "Parallel Prefix Computation," Technical Report No. 77-03-02, Department of Computer Science, University of Washington, Seattle, Washington, 1977. 13. McCluskey, E. J., Introduction to the Theory of Switching Circuits , McGraw-Hill, New York, 1965. 14. Burton, D. M. , Introduction to Modern Abstract Algebra , Addison- Wesley, Reading, Massachusetts, 1967. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-80-1022 3. Recipient's Accession No. 5. Report Date May, 1980 4. Title and Subtitle SPEEDING UP COMPUTATION IN TWO-DIMENSIONAL ITERATIVE ARRAYS 6. 7. Author(s) Peter Gerald Rose 8. Performing Organization Rept. No -UIUCDCS-R-80-1022 9. Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF MCS80-01561 12. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Master's Thesis 14. 15. Supplementary Notes 16. Abstracts This thesis considers rectangular, unidirectional arrays containing identical cells of combinational logic. The goal was to find an algorithm for transforming any such array into a network which performs the same computation in less time. Two methods are presented. The first achieves speedup only if the length of the original array is much greater than its width; it also requires much additional hardware. The second merely rearranges the lines which interconnect the cells, yet it obtains substantial speedup for all but the most square arrays. However, the latter method can be applied only if the cell logic satisfies fairly restrictive conditions. 17. Key Words and Document Analysis. 17a. Descriptors Combinational logic Computation time Iterative arrays 17b. Identifiers/Open-Ended Terms 17c. COSATI Field/Group 18. Availability Statement Release Unlimited 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 83 22. Price FORM NTIS-35 (10-70) USCOMM-DC 40329-P7 1 j\ih 3 m\