LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510 .fc* Til for no. 752-757 cop. 2+ The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN ttjil 1 3 1976 JUL 1 2 REd} MAR 1 5 im FEB 1 o RE MAY 6 1981 L161 — O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/classofcompacthi756sten 7SC- UIUCDCS-R-75-756 / / , v- i- r; A CLASS OF COMPACT HIGH SPEED PARALLEL MULTIPLICATION SCHEMES by WILLIAM JOHN STENZEL September, 1975 UIUCDCS-R-75-756 A CLASS OF COMPACT HIGH SPEED PARALLEL MULTIPLICATION SCHEMES BY WILLIAM JOHN STENZEL September, 1975 Department of Computer Science University of Illinois Urbana, Illinois 61801 This work was supported in part by Grant No. US NASA NAS5-23334 and was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer science at the University of Illinois. m ACKNOWLEDGMENT The author would like to express his deepest gratitude to his thesis advisor, Professor W. J. Kubitz, for his interest, guidance, en- couragement, and, above all, patience over the past several years. The author would also like to thank Professor D. L. Slotnick for providing the invaluable opportunity to participate in this project, and who, through his constructive criticisms and questionable sense of humor, provided the stimulus for the development of the prototype multi- plier. Additionally, thanks are due to Marvin Graham and Gilles Garcia for their continuing advice and support and to Frank Serio and the E- lectronics Fabrication Group, without whom the construction of the proto- type would have been impossible. iv TABLE OF CONTENTS CHAPTER Page 1 INTRODUCTION 1 2 BASIC TECHNIQUES 3 3 M x M MULTIPLIERS 14 4 GENERALIZED COUNTERS 19 5 NUMBER OF LEVELS FOR REDUCTION 28 6 REDUCTION ALGORITHM 33 7 COMPARISON OF SEVERAL SCHEMES 40 8 IMPLEMENTATION OF MULTIPLIERS AND COUNTERS .... 47 9 PROTOTYPE MULTIPLIER 54 10 TESTING OF THE PROTOTYPE MULTIPLIER 63 11 CONCLUDING REMARKS 69 LIST OF REFERENCES 71 LIST OF FIGURES Figure Page 2.1 Diagramatic Representation of 6 x 6 Multiplication . . 4 2.2 Full Adder as Used to Reduce Columns of Bits .... 5 2.3 Wallace Tree Reduction for 6x6 Multiplication ... 6 2.4 2K ROM as 4 x 4 Multiplier 8 2.5 4 x 12 Bit Partial Product Matrix 9 2.6 12 x 12 Bit Partial Product Matrix 10 2.7 IK ROM as (5,5,4) Counter 11 2.8 12 x 12 Bit Partial Product Reduction Using (5,5,4) Counters 13 3.1 Partial Product Tree for 1 x 1 Multipliers 15 3.2 Partial Product Tree for m x m Multipliers 17 4.1 Some Generalized Counters 21 4.2 Effect of a Series of Adjacent Counters 22 4.3 Six Row Matrix Reduction Using (3,3,3,3,6) Counters . . 25 4.4 Maximally Efficient Counters with Equal Columns ... 27 5.1 Examples of Multilevel Reduction by (5,5,4) and (7,3) Counters 29 5.2 Effect of Mixing Counter Types Upon the Reduction Sequence 31 6.1 32 x 32 Bit Multiplication Using Only (5,5,4) Counters . 36 6.2 32 x 32 Bit Multiplication Using (5,5,4) Counters with (3,2) Counters in Last Stage 38 6.3 32 x 32 Bit Multiplication with Underutilized (5,5,4) Counters Replaced by (3,2) Counters 39 vi Figure Page 7.1 Characteristics of Some Multipliers and Counters . . 42 7.2 Plot of Propagation Delay vs. Word Size 43 7.3 Plot of Package Area vs. Word Size 44 7.4 Plot of Power Dissipation vs. Word Size 45 7.5 Plot of Cost vs. Word Size 46 8.1 Synthesis of (15,4) and (3,3,3,3,6) Counters ... 48 8.2 Combining Redundant Addresses 51 8.3 PLA-typc Implementation of (5,5,4) Counter .... 53 9.1 24 x 24 Bit Multiplier 55 9.2 Details of Power and Ground Bussing System .... 57 9.3 Multiplier Prototype Artwork; Component Side ... 58 9.4 Photograph of Multiplier Prototype; Component Side . . 59 9.5 Multiplier Prototype Artwork; Solder Side .... 60 9.6 Photograph of Multiplier Prototype; Solder Side 61 10.1 Multiplier Prototype Interface Unit Block Diagram 65 10.2 Photograph of Multiplier and Interface Unit .... 66 10.3 Interface Unit Control Signals 67 1. INTRODUCTION Various forms of parallel multiplication have been proposed. The schemes are roughly divisible into 2 classes—those which consist of an iterative array of cells [Advanced Micro Devices 25S05, Chung, Deegan, Fairchild 9344, Perzaris] and those which entail the generation of a ma- trix of partial product terms and the subsequent reduction of the matrix by means of pseudo adders [Dadda, Habibi, Singh, Svoboda, Wallace]. Array schemes are attractive in that they are fairly compact and often involve only one circuit type but their speed of operation increases linearly with the length of the operands to be multiplied and hence is slow for large words. Matrix generation-reduction schemes are much faster for large operands since their speed of operation increases with the log of the operand length. Traditional forms of the matrix generation-reduction scheme employ AND gates as 1 by 1 bit multipliers in forming the partial product matrix and use full adders to reduce the matrix; this form is not nearly as compact as the array scheme. Current LSI technology has made possible integrated circuits which can form partial products for operands larger than one bit [Texas Instruments 74S274] as well as ICs which can reduce larger portions of the matrices than is possible with full adders [Kingsbury, Texas Instruments 74S275] thus allowing fabrication of gener- ation-reduction type multipliers which rival array type multipliers in compactness. This paper attempts to deal with compact forms of generation-re- duction type multipliers. The consequences of employing larger partial product generation circuits and more general reduction circuits are con- sidered; possible implementations of these circuits are discussed. An algorithm for the design of multipliers using these circuits is presented and is used to obtain gross measures of merit for several multiplication schemes. Finally, the fabrication of a prototype 24 x 24 bit multiplier employing such circuits will be described and its performance will be evaluated. Note that all schemes described are for unsigned (i.e., sign- magnitude) multiplication. If 2's complement multiplication is desired, the algorithm presented by Baugh and Wooley may be used to generate ad- ditional partial product terms which will produce a 2's complement result when either added to the unsigned product or incorporated into the matrix and reduced along with the rest of the terms [Baugh]. 2. BASIC TECHNIQUES Consider two n bit unsigned binary numbers, A and B, of the n-1 .j n-1 form A= £ a.2 , B = £ 6.2 . We may form the product of these num- i=0 n i=0 n 2 bers by first calculating the n product terms p.. = a. * b. for i, j = 0, ..., n-1 with weighting factor 2 1 J , and then summing the n terms to form the 2n bit product. The form that this operation assumes is shown diagramatically in Figure 2.1. This is analogous to long-hand multipli- cation. The product terms, p.., may be readily computed by using AND gates as 1 x 1 bit mul tipliers--p. . = a. * b. = a.b.. Hence the diffi- culty lies in forming the sum of the matrix. The approach followed by Wallace and Dadda used full adders to reduce three bits of weight 2 to one bit of weight 2 and one bit of weight 2 . This is commonly termed Wallace tree reduction. This reduction is shown diagramatically in Fig- ure 2.2 as a dot matrix in which the three encircled input bits yield two output bits. The basis of the scheme is the repeated application of this reduction until a matrix with at most two terms present in a given column is obtained. This two row matrix is then reduced to a single row (i.e., the product) by means of a standard carry look-ahead adder. An example for the 6x6 case is shown in Figure 2.3. Note that the 2 trapeziodal representation of the n matrix has been altered to a tri- angular form. The matrix of height six has been reduced by full adders to height four, then by another set of adders to height three, and fi- nally to height two. Carry look-ahead adders reduce this to the final product. A B Figure 2.1 Diagramatic Representation of 6 x 6 Multiplication A "out Sum B in .i+1 2 1+1 2 1 Figure 2.2 Full Adder as Used to Reduce Columns of Bits LEVEL 1 LFVEL 2 1 LFVEL 3 CLA LEVEL PPODUCT Figure 2.3 Wallace Tree Reduction for 6x6 Multiplication This scheme may be improved by the use of a more sophisticated means of product term generation. There exist currently 256 x 8 read only memories (ROMs) (Figure 2.4) which may be programmed with the multi- plication tables for pairs of 4 bit operands. That is, the two 4 bit operands are input as an 8 bit address at which location their 8 bit pro- duct is stored. Hence, the ROMs may be used as 4 x 4 multipliers in generating 8 bit partial product terms, rather than employing AND gates as 1 x 1 multipliers. The applicability of this approach may be demonstrated by de- tailing the generation of the partial product matrix for a 12 x 12 bit multiplication. First consider the partial product generation for a 4 x 12 bit multiplication, as shown in Figure 2.5. Three terms of 8 bits each are generated (A B thru A ? B ), and are expressed more compactly in the form labeled AB . Now consider the partial product generation for the full 12 x 12 bit multiplication, as shown in Figure 2.6. It consists of three sets of terms similar to AB which may in turn be compactly ex- pressed in the triangular form shown which is 5 bits high at its deepest 2 point and 24 bits wide. The full n expansion would have been 12 bits high and contained exactly twice as many terms. The savings thus obtained in partial product generation leads one to seek similar savings in the matrix reduction. Accordingly, we may note that currently available IK x 4 bit ROMs may be programmed to treat the 10 address lines as 2 adjacent columns of 5 bits each and per- form a table lookup on the sum of all the bits (Figure 2.7). A ROM so ,i+3 A .A. i r »i+3 B >i+T Addresses 256 x 8 ROM Output A x B Figure 2.4 2k ROM as 4 x 4 Multiplier 2 1 X X X A X B X X Vo X X \ B o X X Vo X X X X X m o Figure 2.5 4x12 Bit Partial Product Matrix (Each 'X' represents four bits.) 10 XXX A XXX B X X X X AB Q X X X X X X AB- X X X X X AB 2 X X X X X X X X X X X X X X X X X X X X Figure 2.6 12 x 12 Bit Partial Product Matrix (Each 'X' represents four bits.) i+1 2 11 i r 2 IK x U ROM >i+3 Figure 2.7 Ik x 4 ROM as (5,5,4) Counter 12 programmed could be used to reduce a matrix in a fashion similar to the full adders of the previous example. Note that the maximum sum of the first column is five, and the maximum sum of the adjacent column is twice that, or ten; hence the maximum possible total sum is 15, which requires exactly 4 bits for its binary representation. This provides complete utilization of the 10 bit address field and of the 4 bit output field while providing a convenient tool for reducing large, symmetric portions of the matrix. This tool may be applied to the matrix of Figure 2.6 with the result shown in Figure 2.8. The first level in Figure 2.8 cor- responds to the full matrix and the second shows the resulting two row matrix. 13 Figure 2.8 12 x 12 Bit Partial Product Reduction Using (5,5,4) Counters 14 M x M MULTIPLIERS Consider the generation of the partial product tree for an n 2 by n bit multiplication using 1 x 1 bit multipliers. A total of n par- 2 tial products of one bit each are generated via n multipliers. These partial products align themselves in the triangular form shown in Figure 3.1 in which each dot represents one bit. The depth at the center of the matrix is n bits and the height of the i column (i.e., the column weighted by 2 ) is h. =■ i + 1 for i = 0, ..., n-1 i h. = 2n - 1 - i for i = n, ..., 2n-2. l Now let us consider partial product generation via m by m multi- pliers with m > 2. We may divide the n bit operands into n' = — segments of m bits (assuming for the sake of simplicity that n is divisible by m) 2 and form the partial product tree by taking the (n 1 ) crossproducts of the 2 segments using (n 1 ) multiplier modules. Each of the crossproducts con- tains 2m bits, hence the matrix contains a total of 2m(n') 2 ■ 2m(£) 2 = *£- m m bits. If we let each dot in Figure 3.1 represent an m bit segment rather than a single bit, the figure shows the alignment of the low order halves of the partial products. Since the height of individual columns is con- stant within m bit blocks, we may denote the height of a block of m columns V 15 2n - 1 bits < Figure 3.1 Partial Product Tree for 1 x 1 Multipliers 16 (for which the lowest order column has weighting 2 ) as g.. If we con- sider only the contributions of the low order portions of the partial products, the height of the blocks of columns will be given by g! = i + 1 for i = 0, ..., n'-l g. = 2n' - 1 - i for i = n' , ..., 2n' -2. Now let us take into account the effect of the high order portion of the products. Note that if the low order portion of a product contributes to block i-1 its high order will contribute to block i. This means that the total height of block i is equal to the number of low order segments in block i-1 plus the number of low order segments in block i. Denoting the total height of block i as g! this gives g'. = g. + g. , for i = 0, . . . , 2n' -1 'i-1 or g! = (i + 1) + i = 2i + 1 for i = 0, ..., n'-l g^ = (2n' - 1 - i) + (2n' - i) = 4n' - 1 - 2i for i = n' , ..., 2n'-l. The matrix has the form shown in Figure 3.2, i.e., the height at the extreme high and low order blocks of the matrix is 1 and the height in- creases by 2 ewery m bits moving toward the center. The height at the center is 17 2n bits or 2n' segments /N * 2n' - 1 ^ X X X X X X X X X X 7 X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Figure 3.2 Partial Product Tree for m x m Multipliers (Each 'X' represents m bits.) 18 2n' -1=^-1. m The height of the individual columns may be equivalently expressed as h. = 2 I 1 !- 1 for i = 0, ..., n-1 l l_mj l |aL-LJij tl . = 2 v— ■ ll\ + ! for i = n, ..., 2n-l where h. denotes the height of the column with weighting 2 . 19 4. GENERALIZED COUNTERS Dadda has introduced the notion of a (c,d) counter as a com- binatorial network which receives c bits of equal weight (i.e., a single column of c bits) as input and produces a d bit word corresponding to their sum as output. A full adder, for example, would be termed a (3,2) counter. The value of the output, then may be expressed as m-1 v- L b 1 i=0 n where b. denotes the binary value of bit i of the input column and v de- notes the value of the d bit output. Note that the number of output bits must be sufficient to represent all possible sums of c bits, and hence must satisfy the condition 2 d - 1 > c. This class of counters may readily be extended to include coun- ters which receive several successively weighted input columns and produce their sum, taking the weighting into account. We may denote counters of this type as (c k-T c k-2' '••' c 0' d) counters, where k is the number of input columns, c. is the number of in- put bits in the column of weight 2 , and d is the number of bits in the output word. The value of the output may be expressed as 20 k-1 V 1 v - I I b 2 1 i=0 j=0 J where b. . denotes the value of bit j in column i. The number of bits in the output word must again be sufficient to represent the largest possible sum, hence d is subject to the constraint that 2 d - 1 > I c. 2 1 . i=0 n Examples of several counters are shown in Figure 4.1 in dot matrix form. The encircled dots represent the configuration of the input bits, and are followed by the resultant output bits. Consider now the effect of a series of counters acting on ad- jacent sets of input columns, as shown in Figure 4.2. The inputs to the counters are shown first, followed by the resulting output bits, followed in turn by an equivalent but more compact expression of the output bits. Let us refer to the number of resulting output rows as s. We see, then, that a series of (7,3) counters can reduce a matrix 7 rows high to a ma- trix 3 rows high (s = 3), or a series of (5,5,4) counters can reduce 5 rows to 2 rows (s = 2). Note also that a string of (2,2,2,3,5) counters can reduce 3 rows to 2 rows by virtue of the fact that an extra input bit is consumed in the low order position, compensating for the carry out of the previous counter. Let us now focus our attention on counters with input columns of equal height. Such counters provide a convenient tool for reducing regular portions of a matrix and, in general, are more interesting and useful than irregular counters. The regularity of the counters permits 21 • • (3,2) (7,3) (5.5.U) n • • • • • • • • (2,2,2,3,5) (3,3,3,3,6) Figure 4.1 Some Generalized Counters 22 (7,3) (5.5.U) (3,2) • • • • • • • (2,2,2,3,5) nn n • • * I * * * * I * (3,3,3,3,6) Figure 4.2 Effect of a Series of Adjacent Counters 23 us to make the following observations about them, which are not necessar- ily true of unequal column counters. A counter with equal height columns will consume a rectangular matrix segment of k columns by r rows, where r = c k-l = c k-2 = •" = V As Figure 4.2 shows, a string of counters produces d bit outputs at inter- vals of k bits. The outputs align themselves such that no more than | r- outputs contribute to a given column, hence the number of rows, s, of the output matrix is simply s = A Thus the height of the resulting output matrix is determined by the number of input and output columns, subject to the constraint that the sum of the r input rows is representable in d bits. Note that the number of re- sultant output rows has a direct bearing on the number of stages of coun- ters needed to reduce a large matrix (as explained in the next section), hence it is desirable that the number of output rows be small. If we let v = 2 k - 1 r denote the maximum possible value of a single input row and let v Q - 2 - 1 denote the maximum representable output value, the constraint on the number of input rows may be expressed as 24 or v - r v o r r < _o = 2 - 1 r 2 - Note that if d is not divisible by k, the output matrix will be somewhat sparse, e.g., the output of a string of (3,3,3,3,6) counters consists of alternate 2 bit segments of height 1 and height 2. This may be used to advantage in certain situations, such as the reduction of a six row matrix by means of (3,3,3,3,6) counters. The reduction is accom- plished by stacking two strings of counters, one atop the other, as shown in Figure 4.3. Note that direct alignment of the counters produces a ragged matrix of height 4, but that skewing the 2 strings of counters by 2 bits produces a matrix which is uniformly 3 bits high. The advantage of producing an output matrix of uniform height is, of course, that it can in turn be easily reduced by a single counter type. These considerations lead us to the notion of a maximally effi- cient counter type as a counter which produces a uniform output matrix while consuming the largest possible regular portion of a matrix, i.e., it should have equal columns with the largest possible number of rows. Disregarding cases where strings of counters may be stacked and skewed as in the above example, this means that the number of output columns should be a multiple of the number of input columns, or d = s k. 25 • • Figure 4.3 Six Row Matrix Reduction Using (3,3,3,3,6) Counters 26 The counter will consume the largest possible portion of the matrix when v = r v o r so that r . Zo . 2^_I = 2 sk - 1 = 2 (s-l)k + 2 (s-2)k + _ + ^ v r 2 k - 1 2 k - 1 A table of such maximally efficient counters is shown in Figure 4.4 for the first few values of k and s. The only currently implementable such maximally efficient counters having equal input columns are the (3,2), the (7,3), the (5,5,4) and possibly the (15,4) counters. The number of input rows increases rapidly with both the number of input columns and output rows, hence any implementation of counters with a larger k or s than those mentioned above would probably be less than maximally effi- cient. 27 INPUT Columns Rows k r 1 3 2 5 3 9 4 17 5 33 1 7 2 21 3 73 4 273 1 15 2 85 3 585 1 31 2 341 denotes currently realizable counters OUTPUT Columns Rows d s 2 2 4 2 6 2 8 2 10 2 3 3 6 3 9 3 12 3 4 4 8 4 12 4 5 5 10 5 Figure 4.4 Maximally Efficient Counters with Equal Columns 28 5. NUMBER OF LEVELS FOR REDUCTION In the previous section it was shown that a matrix r bits high could be reduced to one s bits high through the use of one level of coun- ters. We may now turn to the issue of the number of levels of counters required to reduce a matrix more than r bits in height to one of s bits. Let us denote the maximum height of the matrix which may be reduced to s bits using j levels of counters as 1.. Note that l n = s and 1, = r. 3 J 1 Knowi ing 1 . we may determine 1 . , by observing that the 1 . bits represents the output of a stack of 1 M e n< counters each consume r bits, so counters at level j+1 plus a residue of (1.) mod s bits which were not reduced by counters (Figure 5.1). The l j+1 ■ r |ji| ■♦ (Ij) mod s. For a (5,5,4) counter, then, we have the maximum reduction height sequence 2, 5, 11, 26, 65, ... and for a (7,3) counter we have the sequence 3, 7, 15, 35, 79, ... A rough approximation of the first few terms of the sequence is s, s(f), s(^) 2 , s(f) 3 , ... hence the number of levels of counters required to reduce a matrix h bits high to s rows is roughly log (h) levels. & Note that not all counters give a reduction to 2 rows--the (7,3) counter for example will give a reduction to only 3 rows. The final 29 • • • (5,5,M sequence . . . . • ♦ ♦ * -^m (7,3) sequence Figure 5.1 Examples of Multilevel Reduction by (5,5,4) and (7,3) Counters 30 reduction to 2 rows must be accomplished by means of an additional stage of counters, e.g., (3,2) or (2,2,2,3,5) counters. Note also that other sequences may be obtained by combining different types of counters. For example, if the final level of a series of (5,5,4) counters is followed by an additional level of (3,2) counters, the final reduction height and hence the first term of the sequence for the (5,5,4) counters becomes 3 rather than 2. The sequence for the (3,2) level is 2, 3 and for the (5,5,4) levels is 3, 6, 15, 36, 90, ... giving an overall sequence of 2, 3, 6, 15, 36, Thus, if we wish to employ (5,5,4) counters for the reduction of a matrix which is initially 15 bits high to 2 rows we may use either 3 levels of (5,5,4) counters or 2 levels of (5,5,4) coun- ters followed by one level of (3,2) counters. Assuming that (3,2) coun- ters have a shorter propagation delay than (5,5,4) counters, the second approach will be slightly faster than the first. Similarly, we may insert a level of (7,3) counters between a final level of (3,2) counters and succeeding levels of (5,5,4) counters (Figure 5.2). The sequence for the (3,2) level is 2,3 for the (7,3) level its 3, 7 and for the (5,5,4) levels it is 7, 16, 40, 100, ... Combining these, we get an overall sequence of 2, 3, 7, 16, 40, 100, ... 31 • • • • i — m (3,?) (7,3) (5.5.U) Figure 5.2 Effect of Mixing Counter Types Upon the Reduction Sequence 32 Hence we may in some sense tailor a sequence to fit the requirements of a particular matrix by mixing various counter types. 33 6. REDUCTION ALGORITHM The heuristic rule for the reduction of a matrix with (3,2) counters as presented by Dadda is directly applicable to reduction by generalized counters. Dadda's rule was to find the largest term 1. in the counter's sequence which is less than the original matrix height and to reduce the matrix only to height 1.. Each successive level of counters then reduces the matrix to the height given by the next smaller term in the series, i.e., 1., 1 .,,..., l fl . In the more general situ- ation the sequence may be characteristic of more than one type of coun- ter, as in the examples of the previous section. The actual placement of the counters at a given reduction level is determined by traversing the matrix from right to left, inserting counters as needed to reduce columns to the targeted reduction height. The entire process may be stated algorithmically as follows (it is as- sumed that the choice of counters has been made, and that the reduction sequence is known): Denote the desired word length as n, the multiplier module size as m (i.e., an m by m multiplier), and the counters at a given level as (c, , , •••c r) , d) as before, where k is the number of input columns, d is the number of output columns, and s = h- is the number of output rows. Further let h. for i = 0, ..., 2n-l denote the current height of column i of the matrix and let the variables H and T denote the matrix height before and 34 after the reduction, respectively; i.e., T is the targeted reduction height for a given level. 1. Initialize h. and H as follows: i a) if m = 1 h. = i + 1 for i = 0, ..., n - 1 h. = 2n - 1 - i for i = n, ..., 2n - 1 H = n b) if m > 2 h i = 2 |l| + 1 for i = 0, ..., n - 1 h . = 2 [< 2n "J - ^j + 1 for i = n, ..., 2n - 1 H=2n-1 m 2. Set T to the largest term in the reduction series which is less than H, i .e. , T = 1 . for j such that 1 . < H and J J Vi J H 3. If H = 2 then terminate the algorithm; otherwise perform step 3a for i = 0, ..., 2n - 1. a) If h. - T do nothing; otherwise insert a counter at this point, adjust the column heights accordingly, and repeat step 3a for the new height of the same 35 column, i. The rules for adjusting column heights are: h i+j = max [T, (h i+ . - c. + 1)] for j = k - 1 h. . = h... +1 for j = k, ..., d - 1 4. Set H to T and T to the next smaller term in the reduction series. This term is given by T = s 1-1+ (T) mod r LSh < T) where the s and r values characterize the counters which will be used at the next level. 5. Go to step 3. This process lends itself to solution via computer. Accordingly, subroutines have been written which design multipliers using various com- binations of counters for word sizes of 4 to 64 bits. These routines are used to determine the numerical results presented later. An example of a reduction is shown in Figure 6.1 in which 32 x 32 bit multiplication is performed using 4 x 4 bit multipliers for partial product generation and (5,5,4) counters for reduction. The height of the matrix is initially 15 and the counter's sequence is 2, 5, 11, 26. Hence the matrix is first reduced to height 11, then to height 5, and finally to height 2. Note that in some cases a (5,5,4) counter has been underutilized so that the reduction proceeds only to the de- sired height. 36 Figure 6.1 32 x 32 Bit Multiplication Using Only (5,5,4) Counters 37 In the previous section it was shown that a reduction from 15 rows to 2 rows could be accomplished by using 2 levels of (5,5,4) counters and one level of (3,2) counters. Hence an alternative reduction scheme is presented in Figure 6.2. We follow the same rules for reduction after altering the sequence to 2, 3, 6, 15. The matrix is reduced first to height 6 and then to height 3 by (5,5,4) counters, and finally to height 2 by (3,2) counters. If there is a need to economize on the number of large counters (e.g., if the cost of a (5,5,4) counter is much larger than that of a (3,2) counter) a slight minimization can be obtained by using smaller counters to re- place underutilized large counters wherever possible. The effect of such a minimization is shown in Figure 6.3. 38 Figure 6.2 32 x 32 Bit Multiplication Using (5,5,4) with (3,2) Counters in Last Stage 39 ,.J. Figure 6.3 32 x 32 Bit Multiplication with Underutilized (5,5,4) Counters Replaced by (3,2) Counters 40 7. COMPARISON OF SEVERAL SCHEMES The algorithm presented earlier for the mechanical partioning of multipliers is well suited to solution via computer and has, in fact, been programmed to generate partitions for multipliers using various combinations of counters and multiplier modules. The counters and multi- plier modules involved are all currently available as off the shelf items in TTL implementation. Hence we may use the characteristics of the individual components in conjunction with the component counts pro- duced by the respective multiplier generating programs to obtain real- istic measures of speed, power consumption, relative circuit board area, and cost for a multiplier as a whole. Four schemes will be compared in this fashion. They are: 1) Generate partial products with 4x4 multiply modules and reduce with both (5,5,4) and (3,2) counters, using the (3,2) counters wherever possible (similar to Figure 6.3). 2) Generate matrix with 4x4 multiply modules and reduce with both (7,3) and (3,2) counters, using the (3,2) coun- ters wherever possible. 3) Generate matrix with 4x4 multiply modules and reduce with (3,2) counters exclusively. 4) Generate matrix with 1 x 1 multiply modules and reduce with (3,2) counters exclusively. The components involved are in practice implemented in several versions of TTL (i.e., standard TTL, Schottky clamped TTL, and high speed 41 TTL) hence, for purposes of normalization, their propogation delays are characterized in terms of equivalent logic levels when modeling the multipliers. For purposes of reference, all logic levels may be taken as the standard TTL gate delay of 10 nsec. The 1 x 1 multiplier is characterized as having 2 logic level s--a NAND followed by an inversion-- and the (3,2) counter as having 3 logic levels--a level of inversion fol- lowed by a 2 level AND-OR-INVERT tree. The 4 x 4 multiplier, as well as the (5,5,4) and (7,3) counters are characterized as ROMs having 6 logic levels--3 levels of row address inversion and decoding and one level each for bit sensing, column selection, and output buffering. An attempt has also been made to normalize the power consumption to that characteristic of a standard TTL implementation of the ICs. It is assumed that the cir- cuits are packaged in standard dual inline packages (DIPs) and that sev- eral components are housed in a single DIP when possible (e.g., two (3,2) counters or four 1 x 1 multipliers in a single DIP). This permits us to obtain a figure for the total area occupied by the ICs themselves, which is a good indication of the total area of a circuit board for a given multiplier relative to other multipliers. The cost figures were taken from current distributors catalogs, but are somewhat unrealistic in that IC prices fluctuate with the particular distributor, the quantity involved, the current state of the market, etc. A table of the charac- teristics of the various ICs is given in Figure 7.1 and the results of the modeling of the various schemes are presented in Figures 7.2 through 7.5. 42 to o o o o o o 00 o o CM o o IT) CM 5 s- • CD E o c D_ T- o CM CM O <0 O LO 00 CO o LO 00 CD CD _*: CD o s- Q. CM CM CM CM O CO to J- CD +-> C o o -a c D- O <0 *0 O CM CU CD Q_ CM «3" r- Q. 4-> 3 E O oo O l/> •r- +J «/> •r- S- O fC s- to U r- •r- CU CD > O CD CO tO to CM to CD s- Z5 CD «3- **— ■* *<■■■«* #* CM CO LO *N A •N co r^ LO X «3- 43 DELAY (usee. ) 4 uoo -- 300 -- 200 -- 100 -- -► n 12 2U 36 U8 60 Figure 7.2 Plot of Propagation Delay vs. Word Size 44 AREA (in. ? ) 1 I llUO - t)i SOO ;O0 300 - *l L_'OJ - r - 2 J 00 - — i — -Tr "i**^— H 1 1 1 1 I 1 1 — H 1 1 -^- 12 2 1* 36 1*8 60 Figure 7.3 Plot of Package Area vs. Word Size 45 ■?oo 300 100 -- Figure 7.4 Plot of Power Dissipation vs. Word Size 46 9000 root' 5000 3000 1000 12 2U 36 l>8 60 Figure 7.5 Plot of Cost vs. Word Size 47 8. IMPLEMENTATION OF MULTIPLIERS AND COUNTERS There are several avenues of approach to the implementation of multipliers and counters. The logic for the desired function may be im- plemented directly by means of gates, or the function may be simulated using read only memory techniques. Direct implementation is feasible in counters only when the height of the input columns are small, (e.g., the traditional full adder (3,2) counter) and carry look ahead adder ((2,2,2, 3,5) counter). Additionally, small units may be used as functional modules in constructing larger units by means of either LSI or hybrid techniques. Figure 8.1 shows a (15,4) counter synthesized via (3,2) and (5,5,4) counters, and a (3,3,3,3,6) counter synthesized via (3,2) and (2,2,2,3,5) counters. This approach may also be applied to large multi- pliers—an 8x8 multiplier may be synthesized from 4x4 multipliers and carry look ahead adders using hybrid technology and arrays of gated full adders have been used in fabricating monolithic multipliers as large as 8 x 8 bits. [Breuer] Table lookup schemes are practical for m x m multipliers and larger counters. In its purest form this type of scheme utilizes a stand- ard ROM with one address bit corresponding to each input bit of the coun- ter or multiplier and eyery input configuration addressing a distinct word in memory. A ROM simulating a device with i input bits and j output bits must have a total of 2 x j bits of memory. The implementation of multipliers and counters via standard ROMs is especially attractive in that the time and expense of designing and developing a new chip are 48 (3,2) (5.5,10 (3,2) (2,2,2,3,5) (15,10 (3,3,3,3,6) Figure 8.1 Synthesis of (15,4) and (3,3,3,3,6) Counters 49 avoided--one need only program the appropriate pattern into an existing type of chip. High speed TTL memories are currently available in organi- zations as large as Ik x 4 and 256 x 8 permitting the implementation of such units as 4 x 4 multipliers, (7,3) counters, (5,5,4) counters, (10,4) counters, and (2,2,3,3,6) counters. The propagation delay of these units is simply the addressing delay of the memory—typically 6-8 logic level delays. One drawback of the standard ROM approach is the high degree of redundancy in the stored information. Consider the case of a ROM imple- mented (7,3) counter. There are seven distinct configurations of inputs which correspond to the output value 1 meaning that the same output word, 1, must be stored at seven distinct locations. Clearly a reduction in the amount of storage required could be achieved if the input configura- tions could somehow be mapped into classes such that each member of a class referenced the same memory location. A scheme for such a mapping which incorporates residue threshold functions has been proposed by Ho and Chen. [Ho] The scheme was presented in conjunction with single in- put column counters (they demonstrate its applicability to the (7,3) counter) but it is also useful for certain multiple column counters as will be shown in the ensuing sketch of their scheme* The standard ROM implementation of a (5,5,4) counter requires 10 address bits and 4 output bits for a total of 2 10 x 4 = 4096 bits of memory. The internal organization of the memory utilizes some 50 form of coincidence addressing, e.g., row and column selection. Let the row addresses of the memory correspond to the 32 possible bit configura- tions of the low order input column of the counter, and the column ad- dresses of the memory correspond to the 32 configurations of the high order input column of the counter. This means that every row address represents a sum of either 0, 1, 2, 3, 4, or 5 in the low order input column and that ewery column address corresponds to a sum of either 0, 2, 4, 6, 8, or 10 in the high order input column. Note that the OR of all row address lines corresponding to a particular sum conveys as much information about that sum as the constituent address lines, and simi- larly for the column addresses. Hence, by ORing appropriate address lines we may condense 32 lines into 6 lines, as shown in Figure 8.2, and rather than a 32 x 32 matrix for each output bit we now have a 6 x 6 matrix. The intersection of the row and column lines corresponds to their sum, hence coincident selection of a row line with value v and a column line with value v implies an output value v of v = v + v r c Note that a particular output bit f. with weighting 2 1 will contain a 1 for a given value v if and only if 2 1 < (v) mod 2 i+1 e.g., f Q is 1 for 1, 3, 5, 7, ...and f, is 1 for 2, 3, 6, 7, 12, 13, ... This means that the memory matrix for a given output bit f . may be pro- grammed by testing each v and v line for the condition 51 00000 00001 00010 00100 01000 10000 ADDRESS LINES CORRESPONDING SUM 11111 Figure 8.2 Combining Redundant Addresses 52 2 1 < (v + v ) mod 2 r c i+1 and inserting a 1 at their intersection if the condition is satisfied and a otherwise. The complete matrix for a (5,5,4) counter programmed in this fashion is shown in Figure 8.3. Since bit f Q is independent of the high order input column its matrix may be replaced by the OR of row lines 1,3, and 5. The amount of storage required, then, is 6 x 6 x 3 = 108 bits. This represents a savings by a factor of nearly 40 over the 4096 bits required for direct implementation at the expense of one extra logic level in decoding. Some technologies allow this extra logic level to be implemented with no significant extra proagation delay, e.g., ECL collector dotting. Note that this type of implementation bears great similarity to programmed logis arrays (PLAs). In fact, the most basic form of a PLA consists of a stage in which inputs are selectively ANDed together to form a number of product terms and a second stage in which the products are selectively ORed to form a number of sum of product terms. The con- densed implementation discussed above may then be viewed as simply a first PLA which encodes the input column into the proper value, fol- lowed by either a ROM or a second PLA which effectively adds the column values to produce the output word. 53 1 LOW 2 ORDER 1 SUM It 5 HIGH ORDER SUM (WEIGHTED BY 2) U 6 8 10 2 L-k k^k OUTPUTS Figure 8.3 PLA-Type Implementation of (5,5,4) Counter 54 9. PROTOTYPE MULTIPLIER A prototype 24 x 24 bit multiplier incorporating ROM implemented 4x4 multipliers and (5,5,4) counters as well as standard (2,2,2,3,5) counters has been fabricated as a portion of a floating point arithmetic unit. The arithmetic unit is a processing element for a large scale ar- ray computer considered in a design feasibility study sponsored by NASA. For a description of the processing element and the global aspects of the computer see [Graham], The multiplier circuit was hand optimized to reduce the number of (5,5,4) counters and to minimize the propagation delay. The design is shown in Figure 9.1. Note that the carries from the (2,2,2,3,5) counters are propagated horizontally in some cases, as indicated by arrows between the counters. Schottky TTL integrated circuits were used, necessitating care in coping with the high power consumption and fast rise times character- istic of the Schottky family. The high power consumption requires that the V and ground distribution system be able to carry large DC currents (about 10A.) and exhibit low inductance in the 10 - 100 MHz range in order to minimize the transients characteristic of saturating logic. The fast rise time of the logic necessitates care in board layout to avoid transmission line effects and crosstalk between reduction levels. Due to fabrication limitations, we were restricted to two-sided boards with maximum dimensions of 15 x 18 in. The multiplier circuit contains 90 ICs, hence a 15 x 18 in. board allows three square inches of board area per package. While this is normally sufficient area for 55 -M CO *3- CM CTl cn 56 standard TTL, it was feared that the power bussing requirements and the complexity of the package interconnections would force a two board layout. In view of this, a novel power distribution system was devised by Mr. Frank Serio; two thin sheets of copper were etched, one for power, one for ground, to form a one piece system of thin strips which lie underneath the rows of integrated circuits and between their pins. This power bussing system need not be etched onto the PC board, allowing higher density and greater flexibility in the board layout. Figure 9.2 shows a schematic and a cross section of the bus system. A trial layout of the nine most tightly interconnected integrated circuits was attempted with promising results, and a decision was made to fabricate the entire circuit on one board. The layout of the board was done entirely by hand, the package placement being settled upon after several increasingly successful iter- ations. Due to its complexity, the interconnection specification for the board's artwork was in the form of a wiring list rather than the usual logic diagram form. In its final form, the board consists of a horseshoe type arrangement of integrated circuits with the input lines running up the center of the horseshoe and the output lines running down the outside, as shown in Figures 9.3 and 9.4. The thirty-six 4x4 multiplier ROMs immediately surround the input lines and are fed by horizontal connections on the back side of the board, as shown in Figures 9.5 and 9.6. The packages for the first level of reduction surround the 4x4 multiplier ROMs and are in turn surrounded by the second level and the carry look ahead adders with their associated carry propagate units. This layout 57 PRINTED CIRCUIT BOARD — ' DUAL IN-LINE PACKAGE %///////////zzz% , zzzzzzzzzzzz z* i &ZZZZZZZ2L GROUND i Vcc CONNECTION ETCHED 'ON THE BOARD INSULATING TAPE Figure 9.2 Details of Power and Ground Bussing System 58 Figure 9.3 Multiplier Prototype Artwork; Component Side C O 0) j- 60 Figure 9.5 Multiplier Prototype Artwork; Solder Side Figure 9.6 Photograph of Multiplier Prototype; Solder Side 62 tends to minimize signal path length and also limits crosstalk between reduction levels because the signals flow uniformly from the center of the board toward the edges. 63 10. TESTING OF THE PROTOTYPE MULTIPLIER The testing of the multiplier proceeded in two phases--one phase in which the steady state integrity of the circuit was established and a second in which the dynamic aspects of the circuit were examined. The only problem uncovered in the first phase of testing was the omission of approximately ten connections to the ground bussing sys- tem—this was a minor problem and was quickly remedied. Since there are 48 2 possible input configurations, direct verification of the correctness of the outputs of the circuit by exhaustive testing is not feasible. In fact, at the rate of one microsecond per test on the order of nine years of continuous testing would be required. Instead, a set of inputs for which the outputs are easily calculated by hand were used. For example, multiplications in which one of the operands contains only one nonzero bit or multiplications of two contiguous blocks of '1' bits. This con- stitutes a good test of the correctness of the signal interconnections on the circuit board, but does not provide exhaustive testing of the contents of the read only memories. Hence the ROMs were screened in advance by means of a master-slave type test and several defective ROMs were discarded. The dynamic phase of the testing investigated such parameters as the propagation delay of the circuit, the power and ground noise and the rise times and transient behavior of the signals. An interface unit was constructed to facilitate manual input of the two operands and visual inspection of the results via toggle switches and LEDs. Both the 64 inputs and outputs were buffered via D-type registers. This permits determination of the propagation delay of the circuit by clearing the input register, clocking the data into the input register, and, after an adjustable interval, clocking the output register. The interval between clocking the input and output registers is adjusted to the smallest value such that the correct answer is displayed on the LEDs. A block diagram and photograph of the test interface unit appears in Figures 10.1 and 10.2, respectively. The clear and clock signals are generated by employing a simple ring counter to split a master clock into three out-of-phase clocks, each at one-third the frequency of the master. The relationship of the control signals is shown in Figure 10.3. It should be noted that this scheme includes input and output register settling times as well as delays in the cabling between the multiplier board and the buffer registers when measuring the propagation delay. Initial measurements of the worst case delay were about 380 nsec. Inspection of the signal quality, however, revealed very slow rise times and a great deal of crosstalk in the lines between the input register and the multiplier. It was determined that this was due to capacitive loading between signal lines both in the input cables and in the paths feeding the 4x4 multiplier ROM stages (i.e., the lines at the center of the horseshoe). The cables employed were approximately 18 in. long flat flexible cables with conductors spaced on .05 in. centers. Alter- nating grounds, unfortunately, were not employed. The corresponding input lines in the center of the board were also spaced on .05 in. centers, the intervals between paths being about equal to the path width, and 65 i_ u >s € ^ 3 O o M T3 Uj o. O o _| M « u • — K ^^•^ Q. o M 1 1 U o o 1 t> •1 Q. ^ 3 2 r i O 0) +J O) Q. o O S- E O. ro S- i- cn Oi ro •i— »r- i— Q Q. •r- -* ■•-> O i— o 3 r— 5: co 0) J- 3 24 23 22 21 20 19 18 1 * 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 * * * * •«♦• ••*« I f f I * « * * • • I A 24 23 22 21 20 16 15 14 13 12 11 10 9 4(21 | | « • | | * « *««4 A *««* * « « « m m m i i! «7 46 45 44 43 42 41 40 39 38 37 * 36 35 34 33 11 30 29 28 27 26 ?5 24 23 22 21 20 19 18 17 16 15 14 13 12 UK 8 7 6 5 Fiqure 10.2 Photograph of Multiplier and Interface Unit 67 c en •r— O o o 4-> •r™ C o to s- 4-> o s- 3 CA k- 0> (A S> o> q: 3 Q. C (A (A 5> v cr 3 Q. 0) IA '5» 3 Q. (A O O 0> o o o o 68 again alternating grounds were not used. The crosstalk problem was attacked by placing a ground plane over the input lines on the top side of the board. The ground plane consisted of an insulating layer of electrical tape covered by a layer of copper foil, which was adequately grounded at several points. This reduced the circuit's propagation delay somewhat and improved the risetimes and crosstalk on the inputs so a similar ground plane was placed over the corresponding area on the back side of the board and the input cables were shortened to about 10 in. and shielded with copper foil. Adequate ground returns were also established between the multiplier and buffer circuits and the standard TTL buffer registers were replaced with their high speed Schottky equiv- alents. The net effect of these improvements was to lower the propa- gation delay to about 200 nsec. The sum of the typical delay as specified by the manufacturer for components along the worst case path is about 190 nsec, while the sum of the rated maximum delays is 260 nsec. This figure is well within the expected range of performance. Inspection of the power distribution system during operation revealed yery little noise, attesting to the efficacy of the bussing system. The signal quality on intra-board paths was found to be excel- lent with fast rise times and yery little ringing, although the 4x4 multiplier ROMs tended to generate spurious output pulses as their ad- dresses changed. This seems to be an inherent feature of the memories and is not a serious problem although the circuit would probably operate slightly faster (by 10 - 20 nsec.) had the pulses not been present. 69 11. CONCLUDING REMARKS The partial product matrix generation-reduction schemes of Wallace and Dadda may be enhanced through the use of multiplier modules larger than 1 x 1 bit and counters larger and more general than (3,2) counters. The larger multiplier modules permit the generation of a partial product matrix containing fewer total bits and having a maximum height less than that generated by 1 x 1 multipliers and, with current implementations, require fewer IC packages to do so. Larger, more com- plex counters permit reduction of matrices in fewer stages of counters with fewer IC packages than the simpler (3,2) counters while maintaining a similar number of total logic levels. The net result is a multiplier with the speed of Dadda 's scheme and the compactness of current imple- mentation of array multipliers. [Advanced Micro Devices 25S05, Fairchild 9344] Several large counters and multiplier modules have been realized as TTL integrated circuits and are also feasible as ECL circuits. The feasibility of their use in multiplication schemes has been demonstrated by the fabrication of a 200 ns. 24 x 24 bit prototype. Advances in LSI and hybrid technology should make possible still larger counters and multiplier modules. The algorithm for the logical design of multipliers incorpo- rating generalized counters and multiplier modules is extremely straight- forward. The physical implementation and maintenance of such multipliers should be facilitated by their compactness and hence should require less 70 time and expense than comparable multipliers employing Dadda's scheme, especially for large (e.g., 64 bits) words. This type of multiplier then, is attractive for a variety of applications, ranging from fast floating point mini and midi computers to large scientific machines. 71 LIST OF REFERENCES Advanced Micro Devices, Data Book , 1974. Baugh, C. R. and Wooley, B. A., "Two's Complement Parallel Array Multi- plication Algorithm," IEEE Trans. Comput. , Vol C-22, pp 1045- 1047, Dec. 1973. Breuer, D. R., "A High Speed Monolithic 8x8 Multiplier," unpublished correspondence. Chung, T. J. and Bedrosian, S. D., "Fast Digital Multiplier Based on Iterative Cellular Arrays of ROMs," unpublished correspondence. Dadda, L., "Some Schemes for Parallel Multipliers," Alta Frequenza , Vol 19, pp 349-356, May 1965. Deegan, I. D., "Cellular Multiplier for Signed Binary Numbers," Electronics Letters , Vol 7, pp 436-437, July 29, 1971. Fairchild Semiconductor, TTL Data Book , June 1972. Graham, M. L., "The Design and Logic Simulation of an Array Computer," Ph.D. Thesis, University of Illinois, 1975. Habibi, A. and Wintz, P. A., "Fast Multipliers," IEEE Trans. Comput. , Vol C-19, pp 153-157, Feb. 1970. Ho, I. T. and Chen, T. C, "Multiple Addition by Residue Threshold Func- tions and Their Representation by Array Logic," IEEE Trans. Comput. , Vol C-22, pp 762-767, Aug. 1973. Kingsbury, N. G., "High Speed Binary Multiplier," Electronics Letters , Vol 7, pp 277-278, May 20, 1971. Majithia, J. C. and Kitai, R., "An Iterative Array for Multiplication of Signed Binary Numbers," IEEE Trans. Comput. , Vol C-20, pp 214-216, Feb. 1971. Pezaris, S. D., "A 40 ns 17-Bit by 17-Bit Array Multiplier," IEEE Trans. Comput. , Vol C-20, pp 442-447, April 1971. Singh, S. and Waxman, R., "Multiple Operand Addition and Multiplication," IEEE Trans. Comput. , Vol C-22, pp 113-120, Feb. 1973. Svoboda, A., "Adder With Distributed Control," IEEE Trans. Comput. , Vol C-19, pp 749-751, Aug. 1970. 72 Texas Instruments, Supplement to the TTL Data Book , 1974. Wallace, C. S., "A Suggestion for a Fast Multiplier," IEEE Trans. Elect. Comput. , Vol EC-13, pp 14-17, Feb. 1964. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-75-756 3. Rei ipienl 'i Ac < . ssion N< 4. 1 11 le .mil Sunt it k- A CLASS OF COMPACT HIGH SPEED PARALLEL MULTIPLICATION SCHEMES 5- Report Date September 191"? 7. Author(s ) William John Stenzel 8. Performing Organization Rept. No. 9. Performing Organization Name and Address Department of Computer Science University of Illinois Urbana, Illinios 6l801 10. Pro|ect/Task/Work Unit No. 11. Contract/Grant No. US NASA NAS5-23331+ 12. Sponsoring Organization Name and Addre \* tf> '&