LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 5IQ84 
 
 TfiGr 
 
 "0.770-775 
 cop.^L 
 
 ^gi,^^^^--. 
 
1 he person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 4 
 
 L161 — O-1096 
 
.Z(tA' Report No. UIUCDCS-R-75-775 
 
 UJ7S 
 
 NSF-0CA-DCR7 3-07980 A02-000016 
 
 COMBINATIONAL CIRCUIT SYNTHESIS WITH TIME AND COMPONENT BOUNDS 
 
 by 
 
 S. C. Chen and D. J. Kuck 
 
 December 1975 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
 Hit LIBRARY OR THi 
 
 UNIVERSITY OF ILLINOIS 
 *t "drawa CHAMPAIGN 
 
Report No. UIUCDCS-R-75-775 
 
 COMBINATIONAL CIRCUIT SYNTHESIS WITH TIME AND COMPONENT BONDS 
 
 by 
 
 S. C. Chen and D. J. Kuck 
 
 December 1975 
 
 Department of Computer Science 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 6l801 
 
 * 
 
 This work was supported in part by the National Science Foundation 
 
 under Grant No. US NSF DCR73-07980 A02. 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/combinationalcir775chen 
 
1. Introduction 
 
 This paper discusses some aspects of the relationship between 
 sequential circuits and combinational circuits. Circuit design in both 
 areas has been studied extensively in the past. Past studies have included 
 efforts to reduce the time and gates required to compute various functions. 
 This paper establishes upper bounds on time and gates, and also provides 
 a systematic procedure for transforming a sequential circuit design into 
 a combinational circuit. 
 
 The upper bounds on time vhich we prove are quite good, relative 
 to the best known lower bounds in most cases. We also give gate bounds, 
 which have often eluded detailed analysis in the past. Our gate bounds 
 seem quite sharp relative to the actual numbers found in real logic design 
 examples . 
 
 The algorithms we have for transforming sequential circuit designs 
 into combinational ones yield circuits which meet the above-mentioned gate 
 and time bounds. In this sense, we present a uniform design procedure for 
 the realization of any linear sequential machine in combinational circuit 
 form. The advantage of this is that one can often specify the behavior of 
 some desired function quite easily as a sequential circuit. It is somewhat 
 more difficult to translate such a specification into a faster combinational 
 circuit form. A classic example is the ease with which a bit serial adder 
 is specified in sequential form. On the other hand, the design of 
 combinational parallel adders (with various lookahead schemes) occupied 
 many logic designers for some years in the 1950s. The automatic design of 
 a fast parallel combinational adder derived from a bit serial specification 
 is one example of the use of our method. 
 
Not all interesting logic design problems are presented in a 
 sequential form that is linear. As ve shall see later, multiplication is 
 an example. While some nonlinear cases can he linearized mathematically, 
 ve shall discuss another approach. We will show how nonlinear logic 
 circuits can he used to remove the nonlinearity in the sequential specifica- 
 tion. Then, in terms of elements which contain the nonlinearities, we 
 obtain a linear system at a higher level. Our method can then be applied 
 in a straightforward way. 
 
 An important question in modern, practical logic design is what 
 to put in one integrated circuit package and then how to synthesize useful 
 circuits using such packages. One of the methods we present deals with 
 what can be regarded as logic design at the integrated circuit package 
 level. We show what logic should be contained in a package and then give 
 a method for interconnecting packages. Again our discussion is centered 
 on transforming given sequential logic specifications into combinational 
 logic in the form of packages. This is closely related to the subject of 
 the previous paragraph in the sense that nonlinear logic functions can 
 often be hidden in integrated circuit packages, leaving us with a linear 
 problem at a higher level. 
 
 Throughout the paper we illustrate our methods with 
 examples giving gate and time bound coefficients for several practically 
 useful logic design problems including adders, multipliers, and ones' position 
 counters . 
 
 The techniques described in this paper are variations on our 
 earlier efforts to design fast parallel operation computers [ 1 ] [ 2 ] . 
 
There our basic units were adders and multipliers which operated on whole 
 floating-point numbers, while here we are dealing with logic design at 
 lower levels. In this paper we deal with operations on bits and bytes 
 at the gate and integrated circuit package level. It is important to 
 notice that mathematically, precisely the same ideas and algorithms are 
 used at all levels; only the details of the technology change. Thus we 
 feel that in attempts to automate the design of general purpose or 
 special purpose machines, one set of underlying ideas may be of general 
 use. 
 
 The following definitions and assumptions will hold throughout 
 the paper. An atom is a constant or variable denoted by a lower case 
 letter. In some parts of the paper we will deal with Boolean atoms 
 (which have value or l) and in other parts we will deal with arithmetic 
 atoms (which represent binary numbers). A dyadic Boolean operator is 
 either a logical or or a logical and . A dyadic arithmetic operator is 
 either an addition or multiplication operator. We denote these by + and • 
 respectively, in either case. The context will make our meaning clear 
 when necessary, and in some cases the same result will hold in either the 
 Boolean or the arithmetic case. 
 
 Except as noted in the paper, we assume that all Boolean nots 
 and arithmetic subtractions are distributed down to the level of atoms. 
 In the arithmetic case, this is discussed in [ 3], while in the Boolean 
 case a similar procedure may be carried out using DeMorgan's Laws. We do 
 this without loss of generality to simplify our discussion. 
 
 An expression (Boolean or arithmetic) is a well-formed string 
 
consisting of atoms and operators and is denoted by an upper case letter. 
 We write E<e>, for example, to denote an expression E containing e atoms. 
 The distinction "between Boolean and arithmetic atoms and expressions will 
 be clear by the context of our discussion. 
 
 We assume throughout the paper that and , or and not gates each 
 have one gate delay of unit time. We assume that all and and or_ gates have 
 fan-in 2 and fan-out f . By dealing with such stylized gates we are able 
 to compare various designs in elementary terms. If one assumes more 
 complex gates with higher fan-ins, our gate and time upper bounds can 
 obviously be reduced, in general. Another way in which the coefficients 
 in our bounds can be uniformly improved is by ignoring the time required 
 to complement signals. Many circuit families have gates in which both 
 true and complemented outputs are available with no time or cost penalty. 
 To make our bounds conservative and as widely useful as possible, we have 
 not taken advantage of any such features . 
 
 We emphasize the fact that in practice fan-out is usually greater 
 than fan-in, but fan-out delays may be nonnegligible. We account for fan- 
 out delays and gates in all of our bounds. Thus our results represent a 
 more refined treatment than is usually found in abstract bounds of this 
 type which often ignore fan-out limitations. 
 
 We use the notation T [E] to denote the number of gate delays 
 
 (i 
 
 in a circuit which implements expression E using G gates. Similarly, we 
 use the notation T [E] to denote the number of processor delays required to 
 
 compute E using P processors. 
 
 Throughout the paper we use log x to denote log^x. 
 
2. Combinational Circuits 
 
 In this section we discuss gate and time "bounds for combinational 
 logic circuits. We give bounds for gates with fan-out f and fan-in 2. 
 After giving some elementary fan-out and combinational fan-in bounds we 
 present an overall circuit bound. This is expressed in terms of the number 
 of inputs and outputs, and could, for example, be used to bound the gates 
 and time needed for an integrated circuit package. 
 
 Throughout the paper, we assume that signals appear from some 
 external source and are returned to some external destination after our 
 operations on them. Effectively we are ignoring registers from which 
 signals come and to which they are returned. Thus we can count gates and 
 time delays and compose them in a uniform way, without ad hoc accounting 
 procedures at the source and destination of our signals. 
 
 Our first lemma concerns the fan-out of signals and will be used 
 extensively later. 
 
 Lemma I An e way fan-out can be accomplished using gates with 
 
 fan-out of f > 2 in 
 
 T G 1 flog f e] - 1 
 
 with 
 
 °<f^ • 
 
 Proof The first stage of fan-out is accomplished by either an 
 
 external source (which we ignore) or a previous combinational gate which 
 will be counted elsewhere. The destination of our signals is either external 
 
(hence ignored) or combinational gates which are accounted, for elsewhere. 
 
 This is illustrated in Figure 1. 
 
 Thus we can fan-out to f places with zero gates, to f-l+f places 
 
 with one gate, to f-2+2f places with two gates, and to f-G+Gf places with 
 
 G gates. Since we want e <_ f-G+Gf, we see that e <_ f-G+Gf < e+f-1. Thus 
 e-1 
 
 we have G < 
 
 f-1 ' 
 
 2 
 
 We can fan-out to f places in zero time, to f places in 1 time 
 
 3 k 
 
 unit, to f places in 2 time units, and to f places in k-1 time units. It 
 
 follows that for e <_ f < fe we have k < 1 + log e, so k - 1 < log e. 
 
 But k - 1 = T so the theorem is proved. 
 G 
 
 Q.E.D. 
 
 Next, we bound the gates and time in the combinational part of any logic 
 circuit . 
 
 Lemma 2 [ 3 ], [ U ] 
 
 Any Boolean expression E<e> of e atoms can be realized using 
 gates of fan-in 2 in 
 
 o 
 
 1 + 2d + ("log el if d < 7 log e 
 
 T [E<e>] < 
 
 JUlog el otherwise, 
 
 with 
 
 3 
 e-1 if d < — log e 
 
 G[E<e>l << 
 
 »2(e-l) otherwise, 
 
 where d is the depth of parenthesis nesting in E. 
 
Stage Stage 
 1 2 
 
 Signal 
 Destinations 
 
 Figure 1 
 
 Signal Fan-out 
 
The proof of this lemma for d < — log e is found in [ 3 ] . In 
 
 most practical expressions, the depth of parenthesis nesting is small, so 
 
 3 
 this provides the best bound. However, if d > - log e, we use the second 
 
 half of the lemma which is proved in [ k ] s where it is also shown that this 
 
 may be extended to T,jE<e>] < 31og n with G[E<e>] < 2.5e. We have found 
 
 that for practical purposes a low gate bound is more important than a low 
 
 time bound, however. In much of the following we will use Lemma 2, assuming 
 
 3 
 for simplicity that d < — log e. 
 
 Next we define a combinational circuit and then give overall 
 
 gate and time bounds for such circuits. 
 
 Definition 1 
 
 A combinational circuit C<r,s,e,n,d> is defined by 
 
 1) A set of inputs x. , 1 <_ i <_ r. 
 
 2) A set of outputs y , 1 <_ j <_ s , where y. is 
 
 d J 
 
 defined by an output expression E.<e.> of e. atoms (representing inputs or 
 complements of inputs) and with parenthesis nesting depth d.. 
 
 J 
 
 3) e = max{e.} is the maximum number of atoms contained in E . , 1 <_ j j 
 
 s J J 
 
 M n = I e. is the total number of atoms in all E. s. 
 j=l J J 
 
 5) d = max{d.} is the maximum parenthesis nesting depth among 
 all of the output expressions E.. 
 
 It is clear that n >_ s , and we assume that n >_ r, i.e., each input is used in 
 at least one output expression. 
 
Theorem 1 
 
 Any combinational circuit C<r,s,e,n,d> can be realized using gates 
 of fan-in 2 and fan-out f in 
 
 T G < [log el + 2(d + riog f nl) 
 
 with 
 
 G < (l+^)n + (l-"jrj-)r " a . 
 
 Proof 
 
 First, consider the fan-out of the inputs. Let the i-th input be 
 used e. times in output expressions. Since we may need to complement the 
 input, we first fan it out to e. + 1 places (the extra one for complementation) 
 By Lemma 1 we need (since we assume each input atom is used at least once) 
 
 T G1 < riog f (n-s+l)l - 1 < riog nl - 1 
 
 with 
 
 r e. 
 Gl < I 
 
 f-1 f-1 
 1=1 
 
 Now we can complement each input variable in 
 
 T =1 
 G2 
 
 with G2 <_ r, 
 
 and fan the complemented variable out, each to at most e. places. Thus we have 
 
 T G3 < riog f (n-s+l)l - 1 < [log f nl - 1 
 
 r e.-l 
 
 G3 < Z — — = £z ^- 
 J - . _ f-1 f-1 
 i=l 
 
10 
 
 Next, we consider the fan-in of the atoms to form the output 
 
 variables according to the output expressions E.<e.>. By Lemma 2 we 
 
 J J 
 
 have (assuming d. < — log e.) 
 
 "GU 
 
 with 
 
 r 
 
 Gk < 
 
 1 + 2d + flog e] 
 n+log el 
 
 Z e . -1 = n - s 
 
 if d < f log e , 1 < J < s 
 
 (J *- J 
 
 otherwise 
 
 if d < f log e , 1 < j < s 
 
 V 
 
 I 2(e.-l) = 2(n-s) otherwise. 
 
 Thus (assuming d. < — log e., 1 £ j £ s) we have a total of 
 
 T_ < Tlog el + 2(d + riog-nl) 
 
 b I 
 
 with 
 
 = (l +f -^)n + (l-^)r -s. 
 
 Q.E.D. 
 
 Example 1 
 
 Suppose we have a 16 pin integrated circuit package which contains 
 only combinational logic. Assume we can use 7 pins for inputs and 7 pins 
 for outputs, i.e., r = s = "J. Assume that we have an average Of k atoms 
 per output expression so n = k.7 = 28, the maximum number of atoms per 
 expression is e = 8, and d = 2. Thus a typical output expression may be of 
 the form 
 
 y . = (x 1+ x 2 )*(x- 3+ x 5 ) . 
 
11 
 
 Let us use circuits with fan-in 2 and fan-out 8. Now for any 
 possible combinational logic with the above characteristics, a package 
 can be designed such that the total package time in gate delays is 
 
 T Q £ flog el + 2(d + [log f nl) 
 
 = [log 81 + 2(2+ Flogg281) = 3 + 2(U) = 11 . 
 
 The total number of gates in any such package is at most 
 G < (1+|)28 + (1-4)7 - T = 35 • 
 
 Example 2 
 
 Suppose we have a h& pin package for large-scale integrated 
 circuits. Let r = s = 23, n = 6*23 = 138, e = l6, d = 3, and f = 8. Now 
 any possible combinational circuit can be realized with 
 
 T Q < Tlog 161 + 2(3 +riog Q 1381) = k + 2(6) = 16 
 
 and 
 
 G < (1+|)138 + (1-4)23 - 23 = ITT . 
 
 Thus we see that for realistic assumptions about packages and 
 logical expressions, we obtain gate and time bounds that are of practical 
 interest. 
 
12 
 
 3. Sequential Circuits 
 
 In this section we discuss methods of transforming sequential 
 circuits into combinational ones and give time "bounds and component "bounds 
 on the resulting circuits. 
 
 Definition 2 
 
 A sequential circuit S<r ,s ,e,n,d,m> is defined at time t by 
 
 1) A set of inputs x.(t), 1 <_ i <_ r = r + r . We call the 
 
 x . ( t ) , 1 < i < r. , the external inputs , and the x.(t),r + 1 1. i !l r ? 
 
 the feedback inputs . 
 
 2) A set of outputs y.(t), 1 < j <_s = s + s, where for any 
 
 logical functions f., 
 
 J 
 
 y (t) = f [x (t), x (t), ..., x (t)] 
 
 = f [a^Ct), ..., x r (t), y s +1 (t-m 1 ), . .., y g (t-m r )] 
 
 as shown in Figure 2 . We call the y.(t), 1 < j £ s , the external outputs 
 
 and the y.(t), s +l£j<_s, the feedback outputs . Note that r > s . 
 
 Each output is defined by an output expression E,<e.> of e. atoms (repre- 
 
 J J J 
 
 senting inputs or their complements). Expression E. has parenthesis nesting 
 
 J 
 
 depth d . . 
 J 
 
 3) e = {e l5 e } where 
 
 e, = max {e.} and e = max (e.} 
 i^l 5 ! J s^l^^s J 
 
Clock 
 
 Inputs 
 
 1 
 
 Outputs 
 
 13 
 
 
 . 
 
 L 
 
 
 
 
 
 
 1 
 
 
 
 • 
 • 
 
 Combinational 
 Logic 
 
 • 
 
 s i 
 
 r l+l 
 
 s i + i 
 
 
 * 
 
 
 
 
 • 
 
 • 
 
 
 
 r +r 
 
 1 2 
 
 S l +S 2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 m 
 r 2 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 m. 
 
 J 
 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 m. 
 
 l 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 m i 
 
 
 
 
 
 
 
 
 
 
 
 
 Fig. 2 
 Sequential Circuit 
 
lU 
 
 h) n = n + n where 
 
 S l 
 
 n_ = Z e . and n^ = Z e . . 
 1 j-l ° 2 j=s 1+ l J 
 
 5) d = {d ,dp} where 
 
 d = max {d.} and d = max {d.} 
 l<J<s 1 J S-j+l^s J 
 
 6) A set of delays m. , 1 <_ i £ r , where 
 
 m = max m. is the maximum delay , 
 i 
 
 Definition 3 
 
 A linear sequential circuit is a sequential circuit with outputs 
 
 y.(t). s n + 1 < i < s, of the form 
 l '1 — — 
 
 y.(t) = f i [x 1 (t ),..., x r (t), y s +1 (t-m 1 ),...,y s (t-m r )] 
 
 = C i + a H y s 1+ l (t - m l ) + '•• + a ik y s (t - m r ) • 
 1 r 2 2 
 
 wher 
 
 e the c and a , 1 < i < r o > are derived from any logical functions of 
 i 1.1 — — d 
 
 the inputs x. (t), ..., x (t) 
 1 r 
 
 Definition k 
 
 An m-th order linear recurrence system of n equations R<n,m>, is 
 
 defined by 
 
 x. = for i < 0, 
 i — 
 
 and 
 
 i-1 
 
 x.=c.+ E a..x. for 1 < i < n , 
 
 i i . . ij j _ — 
 
 j=i-m 
 
15 
 
 where 1 < m < n, and the c. and a. . are constants. We assume that n and m 
 - 1 ij 
 
 are powers of 2. If either is not, we choose the next higher power of 2 
 
 and apply our hounds and algorithms directly. The solution of this recurrence 
 
 is the set {x. |l <_ i <_ n} . 
 
 The following lemma forms the basis of much of our subsequent work. 
 We will use it to count gates as well as higher level components such as 
 integrated circuit packages or whole processors. Thus we state the lemma in 
 terms of operations 6 which can be interpreted as logical or and and or as 
 arithmetic addition and multiplication. When we deal with fan-out, at the 
 gate level corresponds to gates while at the processor level it refers to 
 registers or demultiplexers. 
 
 Lemma 3 
 
 Any m-th order linear recurrence R<n,m> can be solved in 
 
 / 5 1 12 
 
 T <_ (g- + log m + — log f n)log n - -^(log m + log m) 
 
 with 
 
 9 < || m2 (2-H f -3Y) + m(l+~)"|nlog n 
 
 + [m 3 (l +f -i T )-m 2 (2+-^ I y) - -^SLyJ n + 2m 2 + (^-) log n 
 
 where f = 2^, q >_ 1 . 
 
 Proof 
 
 Our proof follows the proof of Theorem 2 of [2 ] and a logical 
 circuit can be constructed following Algorithm 2 of [ 2]. First, we consider 
 the time required. The computational 6 delays follow directly from the time 
 bound for solving an R<n,m> system in Theorem 2 of [ 2 ] . Thus, for the first 
 part of our time bound, we have from Theorem 2 of [?_} 
 
16 
 
 1 2 
 Tl <_ (2+log m) log n - —(log m + log m) . 
 
 To complete the time bound, ve must consider the fan-out time 
 required by Theorem 2 of [ 2 J ■ Such times were regarded as negligible 
 compared to arithmetic operation times in [ 2]. The solution of an R<n,m> 
 system is generated in log n iterations. It may be seen from Figure k of [2] 
 that on iteration i = log k, we perform at most (— + m - l) way fan-outs. 
 
 Thus the fan-out time on iteration i is [log (— + m - l)] - 1, by Lemma 1. 
 
 Summing over all iterations, for k = 2, h, 8, ..., n, we have (since 
 f = 2 q > 2), 
 
 T2 < ([log f 2f] - 1) + ([log f Uf] - l) + ... + ([log f (|+ m - 1)] - l) 
 and grouping terms, we get 
 
 < q(l + 2 + 3 + 
 
 log n - log 2f + 1 
 
 ) 
 
 = q(l + 2 + 3 + ... + f l0g f n l - 1) 
 
 <_ q(l + 2 + 3 + ... + log f n) = I log f n(l + log n) 
 
 = — log n(l + log n) . 
 
 Thus our total time is Tl + T2 or 
 
 12 1 
 
 T £ (2 + log m) log n - -^(log m + log m) + — log n(l + log n) 
 
 5 1 1 ? 
 
 = (— + log m + — log f n) log n - p"(log m + log m) . 
 
 Next, we consider the number of operations required. In the proof 
 of Theorem 2 [ 2], we gave expressions for counting the number of processors 
 in evaluating an R<n,m> system. Since a tree of n leaves has at most 2n - 1 
 
IT 
 
 nodes, we can upper bound the number of 6 operations by doubling the processor 
 count from Theorem 2 of [ 2 ] • We choose the worst expression for the 
 processor count on iteration i = log k, namely, expression (2 ) [2], the 
 2m <_ 2 < n case, sum over all iterations, for keK={2,U,8,...,n} , 
 and multiply by 2 to bound the 6 operations. Thus, ignoring fan-out for the 
 moment , we have a total of 
 
 91 < 2E (fl + (£ - 2)(m + 1) 
 " keK L 
 
 r m 
 
 I J +m(f - 1) 
 3=1 
 
 r m 
 
 + (m + 1. 
 
 Ij + m(f - m) 
 
 } , 
 
 where 
 
 K = {2,U,8,...,n} . 
 
 By rearranging terms, we have 
 
 = 2Z { 
 keK 
 
 1 +(|- l)(m + 1) 
 
 m 
 
 3=1 
 
 2) m(m + 1)(|- 1) + m(| - l) 
 
 + m(m + 1)(|- m)} 
 
 Wow summing on j gives 
 
 = 21 {[f(m + l) - m] aJfal + (S. , 2 ) m(m + l)(| 
 
 keK * ^ k 2 
 
 1) + f^(m + 2) 
 
 (m + m + m)} 
 
 3 2 
 m 
 
 _ oy r m k ( m - m )n ( m + m ) 3m J - 
 - dL I- - + — - + — n - 
 
 keK ^ ^ k ^ 2 
 
 - 2m 
 
 } 
 
 < [- m (2n - 2) + (m J - m)n + (m + m)nlog n] 
 
 p O p 
 
 = (m + m)nlog n + (m - 2m - m)n + 2m 
 
18 
 
 As is discussed in [ 2 ] » the trees we are evaluating are of a 
 special form with • operations at the leaf nodes and + operations elsewhere. 
 The above sum can be used as an exact count of * operations. But since the 
 trees are somewhat sparse, a more refined count reduces the number of + 
 operations. Thus our factor of 2 above is too large. By a straightforward 
 but long argument similar to the above, we can show that the 6 operation 
 count is actually bounded by 
 
 61 £ (m 2 + |)n log n + (m 3 - 2m 2 )n + m(2m - l) 
 
 which we use in the statement of the theorem. 
 
 Now we consider the number of fan-out 9 operations required. It 
 
 2 2 
 
 follows from Theorem 2 [ 2] that iteration i requires (m + m)n/k - m 
 
 fan-outs, each fanning out to at most k/2 + m - 1 destinations. Thus the 
 
 total number of operations can be computed using Lemma 1 as 
 
 Q^UfimJn z I(| +m . 2 ) -=L E (f+m - 2) , K= {2.U.8,.. . ,n) 
 
 f-1 k£K k 2 f_1 keK d 
 
 Summing , we obtain 
 
 92 < 
 
 2 
 
 (m +m)n 
 
 f-1 
 
 log n 
 
 + m - 1 
 
 m 
 
 f-1 
 
 2n-2 
 
 + (m-2) log n 
 
 2 
 m +m 
 
 2(f-l) 
 
 m +m 
 2(f-l) 
 
 n log n + 
 
 -5P3 2 2 
 
 m -m -m m log n - 2m log n - m 
 
 f-1 
 
 n - 
 
 f-1 
 
 2 3 2 
 , m +m , , m -m -m 2 .. 
 < . 1 i n log n + — — ; n + t-tt log n 
 
 f-1 
 
 f-1 
 
 Note that at the gate level these 9 operations are gates and are 
 comparable to the gates counted in 81. At the integrated circuit or processor 
 level, these 9 operations correspond to registers or demultiplexers which are 
 
19 
 
 generally less costly than the 6 operations of 01. But to be conservative 
 we count each of them as one operation. Thus our total operation count 
 is = 01 + 02, so 
 
 2 m , m +m 
 m + 2 + 2Tf^lT 
 
 n log n 
 
 3 2 
 
 i 3 ~ 2 , m -m -m 
 
 |m " 2m + (f-l) J 
 
 2 2 
 n + 2m + (j~r) log n 
 
 = |[ m 2 (2 + —-) + m(l + ~j-)J n log n 
 
 + [ m 3 (l + _i_) _ m 2 (2 + ^ . .^ J n + ( |_ } lQg n + ^2 ^ 
 
 Q.E.D. 
 The following corollary follows directly from Lemma 3 and covers 
 a case of wide practical interest. 
 
 Corollary 1 Any first order linear recurrence R<n,l> can be solved in 
 
 T e £ |(5 + log f n) log n 
 
 with 
 
 1 |(3 + ~[) n log n - (1 + -^-) n + (~-0 log n + 2 . 
 
 Thus we see that for large fan-outs, we can solve any R<n,l> system 
 in T., = 0(log n) with G = 0(n log n) . 
 
 Example 3 
 
 The R<8,1> system 
 
 c. = , i < 
 
 l — 
 
 and 
 
 c. = y. + x. »c. _ , l<i<8 
 11 i l-l — — 
 
 can be used to describe the carry generation in a binary adder (c.f., Theorem 3) 
 
 A circuit to generate the ^ follows directly from Lemma 3 and Algorithm 2 of 
 
20 
 
 of [ 2] "by interpreting • as and, and + as or. The circuit is shown in 
 Figure 3, assuming f = 5. 
 
 Next we give a corollary of Lemma 3 which shows the ranges of 
 time and gates for an R<n,m> system as fan-out ranges from 2 to an arbitrarily 
 high number. 
 
 Corollary 2 Any m-th order linear recurrence R<n,m> can be solved 
 
 in 
 (2 + log m)log n - x(log m + log m) < T < (| + log m + -log n)log n - -(log m+loj 
 
 
 with 
 
 (m 2 +|)n log n + (m 3 -2m 2 )n + 0(m 2 log n) <_ <_ j 3m 2 +2m n log n + [2m -3m - ml n 
 
 + 21og n + 2m 
 
 Proof The lower bounds follow directly from Tn-, and 91 in the proof 
 
 of Lemma 3, assuming that fan-out time and 6 count are negligible. The upper 
 bounds follow from Lemma 3 by setting f = 2. 
 
 Thus we see that for large fan-outs we can solve an R<n,m> system 
 
 
 in T = 0(log m log n) with G = 0(m n log n) . 
 
 u 
 
 Definition 5 
 
 The k step operation of a sequential circuit S is defined by k pairs 
 of vectors 
 
 [(x 1 (t), ..., x r (t)), (y 1 (t), ..., y s (t)] 
 
 for 1 <_ t <_ k. These vectors represent the external inputs and outputs of 
 S at each time step t. 
 
21 
 
 
 
 a 
 o 
 
 •H 
 
 -p 
 
 cd 
 U 
 
 CD 
 G 
 <D 
 bO 
 
 >i 
 U 
 
 u 
 
 3 
 
 on o 
 
 u 
 bD o 
 
 •H «H 
 
 ■P 
 
 •H 
 
 a 
 U 
 
 •H 
 
 CO 
 V 
 
 ■5 
 
22 
 
 Theorem 2 
 
 The k step operation of any linear sequential circuit S<r,s ,e,n,d,m> 
 can be realized by a combinational circuit such that for large k 
 
 T Q < |(log f s 2 k)(log s 2 k) + 0(log k) 
 with 
 
 G < |(m+l) 2 s 2 3 (2 + ~j-) klog s 2 k + 0(k) 
 
 Proof Our proof is in three parts. First, we set up the A and b 
 arrays of Definition h. Then we evaluate the resulting recurrence system. 
 Finally, we generate the external outputs. 
 
 The A matrix and b vector components can be generated from the 
 external inputs at any of the k time steps. Thus we have a total of kr 
 inputs to combinational circuit C. which produces as outputs the components 
 
 of A and b. Since a total of n p atoms are used in generating all" of 
 
 the 
 
 feedback outputs of S, there are at most kn,. non-zero components in A and b. 
 The maximum number of atoms in any expression is e Q , the total number of 
 atoms is kn p and the maximum parenthesis depth is d , so we can set up the 
 A and b arrays with 
 
 C 1 <kr 1 , kn 2 , e 2> kn 2 , d p > . 
 
 Next we solve the linear recurrence R<n,m> . There are a total 
 
 of ks outputs in k time steps so n = ks . Since the maximum delay is m 
 
 i 
 
 time steps with s outputs per time step, the bandwidth of this system is 
 
 at most (m+l)s - 1. Thus we have a recurrence of the form R<ks , (m+l)s 2 -l> . 
 
 Finally, we generate the external outputs with combinational circuit 
 C . There are a total of kr inputs and ks external outputs. The maximum 
 
23 
 
 number of atoms in any output expression is e , the total number of atoms 
 
 in all output expressions is kn and the maximum depth is d , so we have 
 
 C 2 <kr, ks , e 1$ ki^, d 1 > . 
 
 Now we bound the gates and time required for each of these. By 
 Theorem 1 , for CL we have 
 
 T 
 with 
 
 Gl 1 \ lQ Z e 2J + 2(d 2 + P^f kn 2l 
 
 Gl < (1 +^)kn 2 + (1 - 71^)^ - kn 2 
 
 = (i -7II )kr i + 7=1 kn 2 
 
 By Lemma 3, we can solve R<ks , (m+l)s - 1> in 
 
 T G2 < ( f + log(m+l)s 2 + | log f (m+l)s 2 )log ks 2 
 
 with 
 
 G2 < | |"(m+l) 2 s 2 2 (2 + —-) + (m+l)s 2 (l+~j-)l ks 2 log ks 2 
 
 + (1 + ■—■) (m+1)- 3 s 2 3 ks 2 + 0((m+l) 2 s 2 2 log ks 2 ) 
 
 By Theorem 1 we have for C_ 
 
 T 03 ^ [ log e ll + 2(d l + f Xog f **lV 
 
 with 
 
 G3 < (1 + j~) kn 1 + (1 - £—-) kr - ks 
 
Combining the above we have a total time of 
 T Q < hog ej + flog e 2 | + 2(d 1 + d g + ^log f taxj + flog f knj) 
 
 + (f + lo g( m + X ) s 2 + 2 l0g f s 2 k ^ l0g S 2 k ' 
 Thus, for a fixed circuit, as we increase the number of operating 
 time steps k, we have 
 
 T Q < -(log f s 2 k)(log s 2 k) + 0(log k) . 
 
 The total gate count is 
 G < |[(m+l) 2 s 2 2 (2 + ~^) + (m+l)s 2 (l + ^-)J s^ log s 2 k 
 
 + ^-fir^i*^ + n i + ?ir n - s i] k 
 
 + (1 + fTj-Jdn + I) 3 s 2 h k + 0((n + l) 2 s 2 2 log ks 2 ) . 
 
 Thus, for any fixed circuit, as k increases we have 
 G < |[(m+l) 2 s 2 2 (2+ f -^-) + (m+1) B 2 (l+~-)l s g klog k s^ + 0(k) 
 or (since m >_ 1 and f >_ 2) 
 
 G < ~(m+l) 2 s 2 3 (2+^j) klog s 2 k + 0(k) . 
 
 Q.E.D, 
 
 Now we turn to the consideration of higher level components as our 
 basic circuit elements. We will define two package types which could be 
 implemented directly using integrated circuits. Our time bounds will be 
 expressed in package delays. The techniques of the previous section could 
 be used to design such packages. Our component bounds will be expressed in 
 terms of the total number of packages required. 
 
25 
 
 Our strategy in this case is to decompose a linear recurrence 
 system R<n,m> into a number of small identical systems. These smaller 
 systems can be solved directly by interconnecting the integrated circuit 
 packages we specify. An algorithm to decompose a large R<n,m> system 
 has been given in [ 2], [ 5] for arithmetic operations. Here we present 
 the algorithm for logic design and consider only the R<n,l> case for the 
 sake of easy explanation. The R<n,l> case is by far the most common one 
 occurring in practical logic design, and our method can be extended to 
 larger m in a straightforward way. 
 
 Definition 6 
 
 We define two types of integrated circuit packages. 
 
 a) ICL. n . is a package which accepts input atoms c. for 
 
 R<n,l> c c l . 
 
 1 <_ i <_ n, and a. for 2 <_ i <_ n. It computes the outputs x. for 1 < i < n 
 
 according to the recurrence relation 
 x =0 
 
 x. = c. + a. x. _ . 
 ill l-l 
 
 For signal input and output it has a total number of pins equal to 3n - 1 
 times the number of bits per atom. 
 
 b) IC_ T is a package which may accept input atoms a. and b. for 
 
 1 £ i ± n, and c and d. It computes the outputs x. for 1 <_ i < n, according to 
 
 x. = v.w. + y. z. , 
 i li -11 
 
 where either 
 
 i) v. =a.,w. =c,y. =b. and z. =d, l<i<n 
 l li i i i — — 
 
26 
 
 or 
 
 ii) v. = a., v. = t>. , y. = a. and z. = "b. , 1 < i < n. 
 1 ii ii l i i — — 
 
 For signal input and output it has a total number of pins of at most 3n + 2 
 times the number of bits per atom. In general, we denote the total number of 
 integrated circuits in some logical circuit by IC. 
 
 Example h 
 
 An IC , has a total of 3*^-1 = 11 signal pins if it is to solve 
 
 a Boolean recurrence. Suppose we are summing the bits in a l6-bit word and 
 will produce a log l6 = h bit result. Then h bits are required per atom and 
 an arithmetic IC , to solve this problem would need hk signal pins. An 
 
 IC for Boolean operations requires 3*3+2 = 11 signal pins. An arithmetic 
 
 package for handling h bit numbers would need a total of UU signal pins. 
 
 The following algorithm is adapted from [5] (c.f. Ch . k) . It solves 
 any R<n,l> system by partitioning it into smaller systems. 
 
 Algorithm 1 Any given first- order linear recurrence 
 R<n,l>: x = 
 
 x. = c. + a. x. . , 1 < i < n 
 
 111 l-l — — 
 
 can be solved as follows . 
 
 Step 1 
 
 f i ) 
 
 ,) For any h > 2, compute — independent recurrence systems Z , 
 
 n 
 
 1 <: J ' : 7"j defined as follows 
 h 
 
27 
 
 Z (J) : z ( ^ = , 
 
 .. ( j } = c. (j) + a. (j) z. n (j) , 1< i <£, 
 111 i-I — — n 
 
 where 
 
 c (J) - c 
 
 c i " C i+(J-I)h ' 
 
 i l+tj-ljh . 
 
 Id) Compute (|T-l) independent recurrence systems ^ 5 2 <_ j £ jj- , 
 
 defined as follows. 
 
 Y (J) : y < j) =1 , 
 
 r (j) = a (}) (J) !,!<„. 
 
 1 1 ^1-1 — — 
 
 From this step we ohtain h elements of the solution of the original 
 
 system, i.e., x. = z. for 1 < i < h . 
 
 li — — 
 
 Step 2 
 
 From the results of Step 1, compute the following recurrence system 
 
 Z h = ° 
 
 z (j) = (J) (j) (j-D 1<: . ..n 
 
 h h J \ h — ° — h 
 
 From this step we obtain another (—"-l) elements of the solution, 
 
 h 
 
 i.e., x., = z^' for 2 < j < ~ . 
 
 jh h — ° — h 
 
 Step 3 
 
 From the results of Steps 1 and 2, compute the remaining elements 
 
28 
 
 of the solution using the following n - — - (h-l) independent expressions 
 
 - Z (J) +V (J) Jj-D 
 X i+(j-l)h " Z i 
 
 for 1 <_ i < h - 1 and 2 <_ j <_ — . 
 
 r>s 
 
 Lemma h 
 
 Any first-order linear recurrence R<n,l> can be solved in time 
 
 ( logn _ } 
 IC - log h ; 
 
 using a total package count of 
 
 IC < 6£ + k ^&-£ - 7 
 — h log h 
 
 with package types ICL,.. ... and IC TT . n _. for h > 2, 
 
 R<h,l> U<h-1> — 
 
 Proof 
 
 It follows directly from the above algorithm that we need one 
 
 ICL,, _. type package for each Z and Y in Step la and lb. This results 
 K<n , i> 
 
 in (2 
 
 -l) packages. In Step 3 we use ( 
 
 -l) packages of type IC 
 
 U<h-1>' 
 
 corresponding to Definition 6b, part i. We can treat Step 2 as a new 
 
 R< 
 
 , 1> system and apply the same algorithm recursively to solve this system. 
 
 This implies that we reduce the size of the original system from n to less than 
 
 or equal to h, following the sequence n' = n, — 
 
 n 
 h 
 
 * 
 
 
 n 
 h 
 
 1 
 
 h 
 
 , h, and 
 
 finally use one extra IC package to solve the residual system. Hence, 
 
 n<n ,-L' > 
 
 for each iteration we need (2 
 
 -l) packages of type IC„ . , > and ( 
 
 -1) 
 
 packages of type IC..^., , . . Since at most - — °— — - 1 iterations are required, 
 J * U<h-1> log h 
 
 we have for IC_.., n . type package a total of IC = (2 
 K<n , i> 
 
 -1) + (2 
 
 -1 + • 
 
29 
 
 2<*HM 
 
 2<V £♦!)-! 
 
 h 
 
 h 
 
 ••> + 3(^f- 2 ) + i, 
 
 and since h > 2, 
 
 < iiS. + 3 12EJ1 _ 5 
 - h J log h 
 
 Similarly, for IC type packages, we have a total of 
 
 «-<[§H ♦<[[!! 
 
 n| 1 
 
 h 
 
 1) + ... 
 
 < (H) + ( n 1 } ( n^ 1 1 } 
 
 - V v ^2 h' \3 ^2 h ; 
 h h h 
 
 ^^^•••'♦Ci- 1 *- 1 
 
 < 2n + log n _ 2 
 — h log h 
 
 The time hound is ohtained hy the fact that all packages in the same step 
 
 per iteration are operating in parallel 
 
 Q.E.D, 
 
 Example 5 
 
 x. = 
 
 i 
 
 The R<l6,l> system 
 for i < 
 
 and 
 
 x. = c. + a. x. , for 1 < i < l6 , 
 
 ill l-l — — 
 
 can be solved hy the circuit of Figure h which follows directly from Algorithm 1 
 
 with h = h . The packages marked R represent IC R< < , > types and those marked 
 
 U represent IC 
 
 For use in a later application, we now consider a special case of 
 
 an R<n,l> system. Let a. =1, for all i, in Algorithm 1. In this case we 
 
30 
 
31 
 
 fn'l 
 
 need not perform step l"b. So for each iteration, only r~i type ^^ 
 
 | n I n^n j.L'* 
 
 packages are required. Also, note that all Z are computed in Step la by 
 
 merely summing atoms. Since Steps 2 and 3 require only multiplication by the 
 
 y's generated in Step 1, which are l's, no multiplication is required in any 
 
 package. From this we have 
 
 Corollary 3 Any R<n,l> system of the form 
 
 x = 
 
 x. = c. + x. _ , 1 < i < n 
 l l l-l — — 
 
 can be solved in time 
 
 IC - v log h ; 
 
 using a total package count of 
 
 IC < l£ + 3^-^ - k 
 — h log h 
 
32 
 
 k . Applications 
 
 In this section we will study several practical logic design 
 problems. The methods of section 3 will "be used to derive time and 
 component bounds. We will consider binary addition and ones' position 
 counting in detail. In less detail we will consider binary multiplication, 
 digital filtering and a control problem. 
 
 Definition 7 
 
 By the addition of two n digit binary numbers a = a ... a 
 
 and b = b ... b^ we mean the generation of sum digits s = s . . . s n and 
 n 1 n 1 
 
 carry digit c , defined as follows. 
 
 We write 
 
 s. = (a.b.+a.b.) c. ., + (a.b.+a.b.) c. 1 (l) 
 
 l 11 11 l-l 11 ii i-l 
 
 where 1 < i < n and c^ = 0, such that s. = 1 iff just one or all three of 
 — — l ° 
 
 a., b. and c. _, are equal to 1. Also we write 
 l i i-l 
 
 c. = a.b. + (a.+b. ) c. , (2) 
 
 l ii ii i-l 
 
 where 1 < i < n and c = , such that c. = 1 iff any two or all three 
 
 of a., b. and c. n are equal to 1. Now let 
 l i i-l 
 
 x. = a. + b. (3) 
 
 ill 
 
 and 
 
 y. = a.b. . (h) 
 
 i ii 
 
 If we write 
 
 d.=a.b.+a.b.=(a.+b.)+a.b.=x.+y. (5) 
 
 l 1111 ii ii l l 
 
33 
 
 then Equation 1 can be rewritten as 
 
 s. = d. c. . + d. c". . (6) 
 
 l i l-l i i-I 
 
 and Equation 2 can be rewritten as 
 
 y i + x i C i-1 1 i i i n (T) 
 
 i = . 
 
 Our first result concerns binary addition using gates as components. 
 
 Theorem 3 
 
 Two n = 2 , t >_ 0, digit binary numbers can be added in 
 
 T G l|<5+lcg f n) log n + h 
 
 with 
 
 G - ( 2 + fll 5 n log n + (8 - -^-) n + {-£-) log n + 2 . 
 
 Proof Our proof consists of three parts. 
 
 1) To generate the x. and y. , 1 < i < n, from a. and b. by 
 
 l l — — l l 
 
 Equations 3 and h , we need 2n gates and one gate delay, so T = 1 and 
 
 (j-L 
 
 Gl = 2n. 
 
 2) To generate the s. , 1 <_ i <_ n, from x. , y. , and c. using 
 
 Equation 6, we refer to Figure 5- A total of 7 gates are required for 
 
 each s., for a total of 7n gates. After d. and c. , are available, three 
 l i i-I 
 
 gate delays are required. It will be seen in part 3 that the generation 
 
 of the c, 1 <_ i <_ n, from x. and y. can be accomplished in 21og n steps. 
 
 So for n >_ 2 the two steps required to generate d. from x. and y. are no more 
 
 than the time required to generate c, since 21og 2 = 2. 
 
fctH 
 
 3U 
 
 Figure 5 
 Sum Generation 
 
 It is easy to verify that the theorem holds for n = 1 by a direct construction. 
 
 Thus we have T_ = 3 with G2 «: Tn. 
 G2 
 
 3) To generate the c. , 1 <_ i <_ n, from x. and y. using Equation 7, 
 
 we turn to Lemma 3. Since Equation 7 defines an R<n,l> system, it follows 
 immediately from Corollary 1 (c.f., Figure 3) that 
 
 T G3 - | (5 + lo S f n) lcg n 
 
 with 
 
 with 
 
 G3 < 
 
 3 + 1 
 
 2 f-1 
 
 1 ? 
 n log n - (l + j^-) n + — r log n + 2 . 
 
 Thus we have from parts 1, 2 and 3 a total of 
 
 T G =l+3+|(5+ log f n) log n 
 
 = -(5 + log f n) log n + k 
 
 < 2n + 7n + (| + ~-) n log n - (l + — ) n + ~- log n + 2 
 - (§ + ^j) n log n + (8 - £^-) n + (~j-) log n + 2 . 
 
 Q.E.D. 
 
35 
 
 Next we consider binary addition with integrated circuit packages 
 as components . 
 
 Theorem h 
 
 Two n = 2 , t > 0, digit binary numbers can be added in time 
 
 T Jr < (2^^ + 1) 
 IC — log h 
 
 using a total package count of 
 
 IC < <£ + k±^ _ 7 
 — h log h 
 
 with package types IC and IC for h > 2. 
 
 Proof The x. and y. of Definition 7 can be generated in one package delay 
 
 using 2n/h type IC packages. The carries of Equation 7 (c.f., Figure h) can be 
 
 generated following Lemma k in T TO < (2r — ~r - l) using Gr- + h- — °— — - 7 
 
 IC — log h h log h 
 
 packages. Then the sum bits of Equation 6 can be generated in one package 
 
 delay using — packages of type IC JT following Definition 6b, part ii. 
 
 Summing these counts proves the theorem. 
 
 Q.E.D. 
 
 Example 6 
 
 Consider the problem of adding two 32-bit binary numbers using 
 gates with fan-in 2 and fan-out 8. By the method of Theorem 3, the sum can 
 be formed in at most 21 gate delays since 
 
 T G < |(5 + logg 32) log 32 + It 
 
 <i(§°) 5 + i, = ioo + u<21 m 
 
36 
 The number of gates required is at most 
 
 G < (f + h 32-5 + (8 - j) 32 + 2-5 + 2 < f|- -160 + jp-32 + 12 = 527 
 
 ■ 
 
 On the other hand, if integrated circuit packages are available 
 which handle 8 bits at a time, h = 8, we have the following. The total 
 package count is 
 
 IC 19 f^ ^ff - 7 < 37 i 
 
 and the number of package delays is 
 
 t < (2^r + 1) < 5 . 
 
 1C — log o 
 
 The next application we study is a ones 1 position counter. This 
 is the problem of determining the number of ones to the right (say) of each 
 bit in a word. The problem arises in various real world contexts, particularly 
 in control design. We discuss the problem because of its practical interest 
 and also because it serves as an interesting case standing between binary 
 addition and binary multiplication. 
 
 As we saw above, given the theoretical background of section 3 on 
 solving linear recurrences, the design of a binary adder is straightforward. 
 The ones' position counter is not as easy, however. When formulated at the 
 bit level, this problem leads to a nonlinear recurrence which cannot be 
 solved by the methods of section 3. As we shall see later, binary multiplication 
 also shares this property. 
 
 The technique we use to solve such logic design problems with bit 
 level nonlinearities, is to reformulate them at a higher level where they 
 are in fact linear. The nonlinearity is thus hidden inside a more complex 
 bit level operator. In practical terms, this can be accomplished by building 
 a nonlinear circuit element and then combining these in linear ways according to 
 the techniques of section 3. Putting such nonlinearities inside integrated 
 
37 
 
 circuit packages is an attractive possibility. 
 
 Definition 8 
 
 The ones' position counting of an n bit word a = a ... a^a. 
 — n d i 
 
 is the generation of a count vector z = (z , ..., z., ) such that z. is the 
 
 n 1 l 
 
 sura of the number of ones in bits a. ... a n . Thus, the ones' position 
 
 l 1 
 
 count of a = 10110110 is the vector z = ( 5 ,U,U ,3,2,2,1,0) . 
 
 Following Definition 8, we can easily generate the z vector using 
 
 the following arithmetic R<n,l> system 
 
 z o ■ ° 
 
 z . = a. + x. _ , 1 < i < n . 
 
 i i l-l — — 
 
 Thus by using log n bit adders (c.f., Theorem h) as components we can solve 
 the system in 0(log n) adder steps (c.f., Corollary l), so 
 
 T n = 0(log n) 0(log log n) = 0(log n log log n) . 
 
 Li 
 
 Since each adder has 0(log n log log n) gates, we have a total gate count of 
 
 G = ((n log n)*0(log n log log n) 
 = 0(n log n log log n) . 
 
 By formulating this problem in terms of integrated circuit packages 
 
 we can use Corollary 3 to achieve a better gate count than the above. Thus, 
 
 to solve an arithmetic R<n,l> system we need IC = 0(— ) . Each IC package 
 
 n R<n,-L> 
 
 is used to count l's, so inside each package we can use the method of 
 
 Corollary 1 to solve an arithmetic R<h,l> system. Thus from Corollary 1, 
 
 we have 6 = 0(h log h) . Now let us choose h = log n so 6 = 0(log n log log n) , 
 
 Each such 9 processor is used to add log n bit numbers. Thus we use 
 
 Theorem 3 to count the gates as G = 0(log n log log n). Multiplying these 
 
38 
 
 three levels of components we obtain a total gate count of 
 
 n 2 
 
 G = 0(— *h log h'log n*log log n) = 0(n log n*(log log n) ) . 
 
 Similarly, we obtain the time. By Corollary 3 we have 0(log n/log h) 
 package delays. Each package delay is T fl = O(log h) from Corollary 1. And 
 the add time by Theorem 3 is 0(log log n) . Hence, our total time in gate 
 delays is 
 
 T Q = 0(log n'log log n) . 
 
 Thus we see that the time is the same but we have reduced the gate 
 count over the straightforward method. We can summarize this as 
 
 Theorem 5 
 
 The ones' position count of an n = 2 , t > 0, bit word can be 
 
 generated in 
 
 with 
 
 T = 0(log n'log log n) 
 
 2 
 G = 0(n*log n*(log log n) ) 
 
 We note that the gate count can be further improved by using more 
 types of packages. For example, if we let h = log log n in Step 1 of Algorithm 1 
 and h = log n in Step 2 (see proof of Lemma h) , we can obtain a solution in 
 
 with 
 
 T G = 0(log n-log log n) 
 
 G = 0(n*(log log n) (log log log n)) 
 
 By using even more package types, even better gate bounds are possible. 
 
 To obtain a package bound, Corollary 3 can be applied directly. 
 The following example illustrates this. 
 
 Example 7 
 
 Suppose we have packages of types IC « and IC as illustrated 
 
39 
 
 in Example h. Then by Corollary 3, the ones' position count of an n-vector, 
 with n = l6, can be done with at most 2(t— + 2(— ) - 2 = 10 IC . 
 
 packages and 2(^r0 + (|) - 2 = 8 IC U<3> packages. Actually, by direct 
 
 application of Algorithm 1, it can be easily found that we need just five 
 ICL., n ^ packages and three IC TT .^_ packages. The following table shows 
 
 the package count for some practical values of n. 
 
 package ^"^^^^^^ 
 
 16 
 
 I 
 
 1 
 
 32 1 6k 
 
 j 
 
 Bound 
 
 ic r<1ki> 
 
 10 
 
 19 
 
 36 
 
 IC U<3> 
 
 8 
 
 IT 
 
 33 
 
 Actual 
 
 
 
 IC R<U,1> 
 
 5 
 
 11 
 
 21 
 
 IC U<3> 
 
 3 
 
 8 
 
 18 
 
 Table 1. Ones' Position Count 
 Finally, let us consider a bit level formulation of this problem. 
 Following Definition 8, let z. . be the j-th, 1 < j <_ 1 + log n, bit of z. . 
 
 We can imagine solving the problem using an array of half adders such that 
 the half adder in position (i,j) is described by ( ( + ) denotes exclusive or) 
 
 ij " Z i-l,j v-/ C i,j-1 Eq. 8 
 
 c..=z. _.»c. . n Eq.9 
 
 ij 1-1, j i,J-l 
 
 where c. _ = a. and s„ , =0. Notice that at the bit level, this is a non- 
 1,0 1 0,1 
 
 linear recurrence and cannot be solved by the methods of section 3. 
 
 If we use half-adders as components, it is easy to see that the 
 
 problem can be solved with G = 0(n log n) or T = 0(n). This gate count is 
 
 G 
 
 comparable to the best shown above, but it uses much more time. 
 
1+0 
 
 Next, we turn to bounds for binary number multipliers. 
 
 Definition 9 By the multiplication of two n digit binary numbers 
 
 a = a . . . a_ and b = b . . . b_, , we mean the generation of 2n product digits 
 n 1 n 1 
 
 P = P 2n . . . . Pl . 
 
 First, we can formulate the multiplication problem using a 
 
 straightforward (row parallel) carry-save adder array. If we let x correspond 
 
 to various pairwise ands of input bits [6 ] , we obtain a coupled recurrence 
 
 system of the form 
 
 q. . = x (+) q. . . (+) c. _ . . Eq. 10 
 
 ij ^-^ 1-1. J ^-^ l-l, J -1 
 
 c. = x*q. _ . + x'c. . . . + q. . . • c. . Eq. 11 
 ij l-l, J i-l,j-l i-l,J i-l, J-l 
 
 Note that this nonlinear recurrence system is a generalization of Equations 
 
 8 and 9 for the ones' position counter. This cannot be solved by the methods 
 
 • 2 
 of section 3, however, we can solve it directly using an array of n bit level 
 
 adders. This gives a circuit which can multiply two n bit numbers in 
 
 T = 0(n) with G = 0(n ). Since we are interested in faster schemes, we 
 G 
 
 will now turn to two methods to solve the recurrence of Equations 10 and 11 
 in parallel. 
 
 The first method uses a tree of 2n bit adders. First, we form a 
 standard array of partial products. Then we use the adder tree to form the sum. 
 
 Theorem 6 
 
 Two n = 2 , t > 0, digit binary numbers can be multiplied in 
 
 1 2 1 
 
 T G <. ^6 + log- n)log n + —{lk + log n)log n + log n + 1 
 
 with 
 
 2 s 2 , ,_ 2x2^. ,,„ 2 
 
 G < 
 
 (3 + — )n log n + (20 + — )n - 3n log n - (IT - J-^) n 
 
1+1 
 
 2 
 Proof Our proof consists of two parts: l) To generate the n 
 
 2 
 partial product bits a. *b , for all 1 <_ i,j _< n we need n and gates and 
 
 one gate delay. Since each input bit is fanned out to n places we have 
 from the above and Lemma 1 , 
 
 T Q1 < 1 + log f n 
 
 with 
 
 Gl < n 2 + 2n(^) < n 2 (l + ^§-) 
 
 2) To generate the sum of the partial products we need an adder tree of 
 
 n - 1 adders. Each adder adds 2n bit numbers and the height of the tree is 
 
 log n adder delays. Thus, by Theorem 3, we have 
 
 T G2 i lo ^ 
 
 n'l|(5 + log f 2n)log 2n + h 
 
 1 2 1 
 
 = ^(6 + log n)log n + — (lk + log n) log n 
 
 with 
 
 G 2 = (n - 1) 
 
 (f + -jT^) 2n log 2n + (8 - ^-) 2n + -~ log 2n + 2 
 
 < (3 + ~^-r) n 2 log n + 19n 2 - 3n log n - (17 - ~) n . 
 
 Q.E.D. 
 
 As an example of this theorem, consider an integrated circuit 
 
 package as follows. 
 
 Example 8 
 
 Using gates with fan-out 8, a multiplier of two U-bit numbers can 
 be implemented with a delay of 
 
 T G < |(6) log 2 h + |(lU) log k + 1 = 20 
 using 
 
 G < (3 + j) h 2 log h + (20 + j) k 2 - 3-U-log \ - (17 - h k = 3^1 
 
U2 
 
 The above result is somewhat sloppy because we considered all 
 inputs to the tree adder to "be 2n "bit numbers. In fact, the inputs to 
 the first level of adders are only n bit numbers. At succeeding levels 
 they are of length n + 2, n + 5, ...,n+i + 2 1 " - 2, for 1 <_ i <_ log n . 
 By a careful analysis which takes this increasing length into account, we 
 can improve the gate count in Theorem 6 by a factor between 2 and 3. Thus, 
 in our example the gate count could actually be bounded by a number between 
 115 and 170. 
 
 The method above is the best method we know (in terms of time) 
 for numbers with few digits. For long numbers, the next method is the best 
 we know. The crossover between the two occurs between 8 and 16 bits. 
 
 The next method is a variation of the Wallace-Dadda method [T ], 
 [8 ]. It consists of three stages; generation of partial products, column 
 compression, and a carry propagate adder. This differs from Wallace-Dadda 
 only in the last stage. 
 
 The generation of partial products is done in the same way as in 
 Theorem 6. For an upper bound on time, we assume a three to two column 
 
 compression scheme [ 6 ] . The column compression for two n-bit numbers can be 
 
 2 
 done with (n - Un + 3) full adders and (n - l) half adders. The half adder 
 
 can be built using 9 gates (see Theorem 3 with n = l). A full adder of 2 bits 
 
 can be easily implemented with 11 gates by a scheme similar to Figure 5« 
 
 Thus we have a total of 
 
 G2 < ll(n 2 - Un + 3) + 9(n - l) = lln 2 - 35n + 2\ . 
 
 The time for the column compression is 
 T G2 < 6log 3/2 n = lOlog n 
 
 since each full adder requires at most 6 gate delays . 
 
h3 
 
 Finally, to propagate the carry, we use a 2n "bit adder (fewer 
 bits are actually needed) 
 
 T Q3 < (§ + |log f 2n) 
 
 as we used in Theorem 6. This leads us to 
 
 Theorem 7 
 
 Two n = 2 , t ^_ 0, digit binary numbers can be multiplied in 
 
 T G - ^ 13 + 2 log f n ^ l0S n + 2 log f n + 8 
 
 with 
 
 G < (12 + ~j-) n 2 + (3 + ~j0 n log n - l6n + ^~ log n + (26 + —■) 
 
 Example 9 
 
 Using gates of fan-in 8, we can multiply two 32-bit numbers in 
 
 T Q < (13 + |logg 32) log 32 + |logg 32 + 8 < 8l 
 
 with 
 
 G < (12 + j) 102U + (3+|) 321og 32 - 512 + ylog 32 + 26 + y < 12,62k 
 
 Thus far, all of our examples have dealt with R<n,l> systems. In 
 
 practical logic design linear recurrence systems with m > 1 also arise. 
 
 First, consider the following logical path tracing problem. Suppose 
 
 we are given two binary words a = a ... a, and b = b . . . b n and a starting 
 
 n 1 n 1 
 
 bit, either a or b . We wish to generate a word e = e ... e which consists 
 
 of those bits on a path through a and b chosen as follows. First, we let e n 
 
 be the given starting bit. Then we choose bits in the same word until we 
 encounter a zero in (say) bit i, which causes us to choose bit e. _ from 
 
 the other word. We continue in the other word until we encounter a zero 
 which causes another switch, etc. We define 
 
kk 
 
 c i = a i-l ' c i-l + Vl • d i-l > 2 1 i 1 n 
 
 and 
 
 d i = a i-l - c i-l + Vl ' d i-l > 2 1 i 1 n 
 as two control words, where c = 1 if a is our starting bit and d = 1 if 
 
 b_ is our starting bit. 
 Then we have 
 
 e. = a. • c. + b. • d. 
 
 11111 
 
 The generation of c = c ... c. and d = d ... d, can be handled 
 
 n 1 n 1 
 
 as a coupled linear recurrence system of the form R<2n,3> . 
 
 As a final application of the ideas of this paper we mention 
 
 digital filtering. This topic has received a great deal of attention in 
 
 recent years. Our combinational results can be applied to nonrecursive 
 
 filters and our recurrence results can be applied to recursive filters in 
 
 rather direct ways. For more details about such filters see [ 9 ] or [10]. 
 
h5 
 
 REFERENCES 
 
 [l] R. Brent, D. Kuck, and K. Maruyama, "The Parallel Evaluation of 
 Arithmetic Expressions Without Division," IEEE Transactions on 
 Computers , Vol. C-22, No. 5, pp. 532-53*+, May 1973. 
 
 [2] S. C. Chen, and D. Kuck, "Time and Parallel Processor Bounds for 
 Linear Recurrence Systems," IEEE Transactions on Computers , 
 Vol. C-2U, No. 7, pp. 701-717, July 1975- 
 
 [3] D. Kuck, and Y. Muraoka, "Bounds on the Parallel Evaluation of 
 Arithmetic Expressions Using Associativity and Commutativity ," 
 Acta Informatica , Vol. 3, Fasc. 3, pp. 203-216, 191 h . 
 
 [k] R. P. Brent, "The Parallel Evaluation of Arithmetic Expressions 
 
 in Logarithmic Time," Complexity of Sequential and Parallel 
 Numerical Algorithms , J. F. Traub, ed., Academic Press, N.Y., 
 1973. 
 
 [5] S. C. Chen, "Speedup of Iterative Programs in Multiprocessor 
 
 Systems," Ph.D. thesis, Univ. of 111. at Urb . -Champ . , Dept . of 
 Computer Science Report No. 69k , Jan. 1975. 
 (NSF - OCA -Gj-36936 - 00000*0. 
 
 [6] A. Habibi, and P. A. Wintz, "Fast Multipliers," IEEE Transactions 
 on Computers , Vol. C-19, No. 2, pp. 153-57, Feb. 1970. ' 
 
 [7] C. S. Wallace, "A suggestion for a fast multiplier," IEEE Trans- 
 actions on Electronic Computers , Vol. EC-13 , pp. lU-17, Feb. I96U. 
 
 [8] L. Dadda, "Some schemes for parallel multipliers," Alta Frequenza , 
 Vol. 31, pp. 319-356, March 1965. 
 
 [9] W. D. Little, "An Algorithm for High-Speed Digital Filters," IEEE 
 
 Transactions on Computers , Vol. C-23, No. 5, pp. U66-U69, May 197*+. 
 
 10] L. B. Jackson, J. F. Kaiser, and H. S. McDonald, "An Approach to the 
 Implementation of Digital Filters," IEEE Transactions on Audio and 
 Electroacoustics, Vol. AU-16, No. 3, Sept. 1968. 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-75-775 
 
 3. Recipient's Accession N> 
 
 4. I n l< .inJ ^uht itle 
 
 Combinational Circuit Synthesis with Time and Component Bounds 
 
 5- Report Date 
 
 December 1975 
 
 '. Author(s1 
 
 S. C. Chen and D. J. Kuck 
 
 8. Performing Organ i/.jt ion Rcpt. 
 No. 
 
 1. Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, Illinois 6l801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF DCR7 3-07980 A02 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 13. Type of Report & Period 
 Covered 
 
 Technical Report 
 
 14. 
 
 5 supplementary Notes 
 
 6. Abstracts 
 
 New results are given concerning the design of combinational logic circuits 
 We give time and component bounds for combinational circuits specified in several 
 ways. For any sequential machine defined by linear recurrence relations, we discuss 
 an algorithm for the synthesis of equivalent combinational logic. The procedure 
 includes upper bounds on the time and components involved. We also discuss the trans- 
 formation of nonlinear recurrences into combinational circuits. Examples are given 
 using gates as well as ICs as components. These include binary addition, multipli- 
 cation, and ones' position counting. The time and component bounds our procedure 
 yields compare favorably with traditional results. 
 
 7. Key Words and Document Analysis. 17a. Descriptors 
 
 3inary addition 
 Binary multiplication 
 Circuit synthesis 
 Combinational circuits 
 Component bounds 
 Sequential circuits 
 Pime bounds 
 
 'b. Identifiers /Open-Ended Terms 
 
 'c r.OSATI Field/Group 
 
 '•Availability Statement 
 
 Release Unlimited 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 
 Page 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 _k8_ 
 
 22. Price 
 
 "M N TIS-35 (10-70) 
 
 USCOMM-DC 40329-P'l 
 
■ 
 
 3