i^asffi] LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 l£6r no. 131 -140 cop. 3 The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. To renew call Telephone Center, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN NOV 1 1*2 SPP 1 a < "95 LI61— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/suggesteddesignf133wall 71 ° * ' * 3 DIGITAL COMPUTER LABORATORY °P UNIVERSITY OF ILLINOIS URBANA, ILLINOIS REPORT NOo 133 SUGGESTED DESIGN FOR A VERY FAST MULTIPLIER by C. S. Wallace February 11, 1963 This work was supported in part by the Atomic Energy Commission under Contract No. AT(ll-l)-4l5 Abstract ' * i It is suggested that the economics of present large-scale scientific computers could benefit from a. greater investment in hardware to mechanize multiplication and division than is now common. As a move in this direction a design is developed for a multiplier which generates the product of two 40-digit numbers using purely combinational logic, i.e., in one gating step. This design is described in some detail to establish that no ex- ceptional cases invalidate the assertion made about its speed of operation Using straightforward diode-transistor logic, it appears presently possible to obtain products in under one microsecond, and quotients in three. A rapid square root process is also outlined. Approximate component counts are given for the proposed design. I. Introduction A contemporary computer spends a. large percentage of its time executing multiplication, and to a lesser extent, division. The recent advent in very large machines of bookkeeping controls -operating in advance of the arithmetic unit to execute memory fetches, stores and address mod- ification, etc. -has tended to increase this percentage by relieving the arithmetic unit of many trivial burdens. The arithmetic unit of such a machine, when used for scientific computations, will spend nearly half its time multiplying or dividing. Paradoxically, the amount of hardware built into large machines specifically for these operations is rarely very great. Thus the situation has arisen, viewed in the context of a very large machine involving a heavy investment in memory, peripheral equipment and controls, that it may be advantageous to the economy of the machine as a whole to increase the hardware investment in the operations of multiplication and division, even beyond the point where an increment of this investment yields an equal incremental increase in multiplication-division speed. Consistent with this point of view, this paper will describe the logical design and economics of a multiply-divide unit designed for maximum possible speed. For multiplication, which will be discussed first, obvious ways to get high speed are to (a) reduce the number of partial products to be summed, and (b) to extend the parallelism used in their addition. The limiting case of the latter course in which the product is formed by combinatorial logic in one gating step is treated below. -1- This approach, while clearly involving a great deal of hardware, has some by-product advantages. First, control complexity is reduced to the minimum of a single step. Second, with present transistor technology the time for the distribution of gate signals to a flipflop register augmented by the time required for the flipflops to settle into their new state generally exceeds by a considerable factor the propagation delay through a combinatorial logic element. Thus there is a strong argument toward performing many levels of logic in each gating step. In this paper attention will be restricted to the multiplication and division of ^O-digit two's complement binary numbers. II. The Adder Tree Given a large number of numbers to be summed by combinatorial logic, it is clearly unnecessary and undesirable to ha.ve carry propagation at each intermediate stage of the additions. A straightforward approach, used here, employs a sufficient number of full-adder words , each consisting of as many full-adder circuits as there are significant digits in the numbers to be added. The full-adder circuits are not interconnected in any way. A full-adder word gives two output numbers, sum and carry, whose sum equals the sum of the three input numbers, If there are n numbers t« be summed, n - 2 full-adder words will be needed to express the sum as ti numbers. These two numbers must then be summed in a carry -propagating adder to produce the final result. :.o :wo All partial products to be summed are generated simultaneously. An arrangement of the n - 2 full -adder words to start work on all partial products simultaneously, and to produce the result after as few full- adder propagation delays as possible is desired. This suggests a tree structure of the type shown in Figure 1. In this figure each box represents a full-adder word, the three incoming numbers identified at the top. The sum and carry numbers lea,ving the bottom of the box are identified by the letters s and c. As can be seen, starting with the carry -propagating adder, each additional level of full-adder words increases the number of available inputs by a factor of 1 . 5 or less. The inputs shown by w , W , etc., to W^ are the partial product numbers. The example shown in the figure, with twenty input numbers, corresponds to the particular multiplier design to be developed below. -2- WO W3 W5 W7 W9 wn W.13 W15 W17 V/19 W2l W23 W25 W27 W29 ¥31 W33 W35 W37 W39 4 1 18s C±OS i t j t 11 11 cl2s cl3s Jt t lit # cl4s J L cl5s cl6s ; 8 £ cos @L JLJL : re zn level 7 cl7s level 6 c9s t 1 1 clOs c5s t__ 1 ells T L. i c6s 1 j c7s rs \ c3s © I I K c^s >) > ©1 c2s % _£ $ sic JL i-JL carry propagating adder > Fina.l Sum level 5 level 4 level 3 level 2 level 1 Figure 1. Adder Tree -3- Certain complications arise from the fact that the summands, being partial products, are shifted relative to one another. Thus the three input numbers to any one adder word do not in general cover the same range of digital positions. At the less significant end of the adder word, there will be digital positions having only one or two inputs. Since the function of an adder word as to reduce three input numbers to two output numbers, these digital positions of the adder word need not contain full adder circuits. In some cases, they need contain only one or two inverter circuits. Unfortunately, the same simplication does not apply at the more significant end. Each partial product, and hence in general any input number to an adder word, may be negative. In the two's complement representation, a number to be added in an adder word to a number of greater significance must be augmented to the left of its sign digit with copies of its sign digit. Thus, the adder word must contain full adder circuits as far left as the most significant (i.e., sign) digit of its most significant input number if all input numbers may be negative. Also, the most significant full adder circuit of an adder word whose outputs, by virtue of entering another adder word together with an input number of greater significance, must be augmented by copies of their sign digits, each capable of driving several full adder circuit inputs. However, in such cases, it is possible to arrange that only the carry output number need be so augmented, the sum output being restricted to positive values. As will be shown when the circuitry proposed for the full adder is described, carry output may be provided with extra fanout without overall loss of speed (see Figure 2). If the three sign digit inputs to the most significant adder stage of an adder word are denoted x, y and z, the adder stage may take the normal form, giving two outputs C = Xy V yZ v zx, s = (x ^ y v Z ) • c v- xyz where, of course, the digit c enters the next adder word displaced one digital position to the left, and provided that both s and c are augmented by copies are far left as is necessary. This form will be used only where the s output does not in fact require augmenting, as provision for the extra fanout would slow the addition. To ensure that the sum output word is always positive, use will be made of the fact, true in all cases where this maneuver is re- quired, that the three input numbers are not of equal significance. Two of the numbers will have identical digits in both the sign and next less sig- nificant digital positions. Thus the left hand two adders of the adder word must sum three numbers having the form at their left-hand ends Inputs (+0.5 or -1.5v) O Q 4— T-^4 — 4 .S\- i ■#■ -W— ^ -* % All 1} H15V +15v i> ~# — i5 10K l6v A o— « Emitter follower for fanout of carry when necessary. >~ m o i.9K-ir carry output v -l6v i -2 . 5v >1K A +15v -\ — m — * t ■161 _* i4k Sum Output 2.5v Figure 2. A Full Adder Circuit xy ----- - aa ------ y irrto two num b ers of the fon bb Slim (non-negative) opq ------ Carry - - - rrrt ------ q is developed as the sum modulo two of y, a and b in the usual way, The logical expressions for the remaining digits are: r = x ^ a ^ b t = (a v b) • y • (ab) p = r ■ [x ■ (a v b)] The only digit requiring fanout in any circumstances will be r, which can be amplified without loss of speed (see Figure 3). When this form of sum and carry outputs from one adder stage both enter the same adder word in the next stage together with a third input of greater significance, the left hand end of the second adder word will have stages with only two significant input digits. In this case, the circuits used in these stages can be half -adders . At least one adder word in each level of the adder tree will have at i t s more significant end adder circuits handling digits of the weight of the sign digit of the final product. Since two's complement representation is proposed, carry outputs from these circuits may be ignored. No adder word will contain digital positions to the left of the sign digit of the final product „ At the other end of the adder words,, each level of the adder tree will contain an adder word some of whose least significant output digits have less significance than any output digits of any other adder word in the same level. These digits may bypass all remaining levels of the tree and enter directly into the (double-length) carry propagating adder. Thus, at the time when adder word one produces its output numbers, these will not contain the less significant end of the product, which will have already been produced in its final form by the right-hand end of the carry propagating adder. -6- mod 2 • y ) O p =: [(a v t) + x i -i6. O r = x ^ a. ^ b mod 2 Figure 3« Possible circuit for the two most significant stages of an adder word giving a non-negative sum output. Biasing resistors and clamps not shown. -7- Ill . Generation of Partial Product s Each partial product will be selected from a limited number of multiples of the multiplicand on the basis of some of the multiplier digits. It is proposed that the available multiples be +2, +l, 0, -1, and -2 times the multiplicand. Partial product W., where '»i» is odd, will depend on multiplier digits x.^, x., and x±+1 , where the multiplier digits are labelled from x Q (the sign digit) to ^ (the least significant). Digit x,_ is taken as zero. The rules for selection of a multiple are: ^2 if x X. ■ X, 1-1 1 1+1 +1 if ~ 1-1 X i ' X i+1 v X i-1 ■ x i ■ i+1 if x 1-1 X i " X i+1 V X i-1 ■ \ ' * 1+1 _1 lf X ,- 1 ' X - ' X. n v X • x~ • v !"! 1 1+1 i-1 X i X i+ 1 -2 if . x. _ • x. • x~ i-l i i+1 This recoding scheme has the following advantages: i) It requires little logic. ii) All selections can be made simultaneously; the recoding is not a serial process. iii) The multiples used can be obtained from the multiplicand by the trivial processes of complementation and displacement . iv) It produces only 20 partial products from the kO- digit multiplier. v It applies without alteration to the leftmost digits of the multiplier. Alternative schemes involving a smaller number of partial products, each selected from more possibilities, are considered Inadvisable in this -ontext. If, say, eight or nine multiples are allowed, the number of partial products is reduced to 1+. The time saving however is small; the adder tree, while using 12 rather than 18 adder words, is shortened by only one level in seven. Such a recoding scheme would require multiples not obta.ina.ble by shifting and 'complementation as above. The generation of these multiples would almost certainly re- quire longer than the propagation delay of the one adder tree level saved. Moreover, it is not clear that any equipment saving could be made in this way, as the circuits re- quired for selection of the partial products involve a considerable amount of equipment, which increases linearly with the number of possible multiples. A two's complement number representation has been assumed. When a. nega.tive multiple of the multiplicand is selected, the complement of the multiplicand is used, and a correction applied to this complement by adding one to its least -significant iigit. To add this correction directly to the complement in a special adder provided for the purpose would be both time-consuming a.nd expensive. Instead, the correction digit for some partial product W., occurring in a, digital position i + 39, can be appended to :he right-hand end of the next more significant partial product W. , thus extending Ms partial product to the right from position i + 37 to i + 39- X A slight improvement )f this method is to so recode the last digit of W. that the correction bit will occur i position i + 38, thus extending W^ by one digital position rather than two. Suppose the least-significant digit of the possibly shifted, but as yet un- complemented, multiplicand is x. Instead of setting, for a negative multiple, digit 39 of W. equal to x, and digit i + 39 of W._ 2 equal to 1, we set digit i + 39 of W. •qua! to x, and digit i + 38 of W. equal to x. Thus, after allowing a. possible left displacement of the multiplicand of one jlace for multiples of modules two, the range of digital positions occupied by significant igits of W. is i - 1 to i + kO. The correction to V ± cannot be so treated, as it is the most significant jartial product. Its correction digit, which, by use of the above technique, Is made to jie in position 39, is instead added to the least-significant partial product W . This an be dene without loss of time in the following ways. Digits of W 37 and W in positions to the left of and including position 9 are not fed directly into adder word 17 . Instead, they are fed into a, short adder 3rd section (number 19) in level seven of the adder tree, having adder stages in jigital positions 36 to 39- This section also receives, in position 39, the W action digit. The sum and carry outputs of this section cover digital positions 36 39 and 32*. to 38 respectively. These, together with positions 3^ to 39 of W , enter 'sitions 3^ to 39 of adder word 17. Since level seven of the adder tree is necessary |"» any case, this additional section does not delay the final result. IV. Dimensions of the Adder Tree Ha,ving decided upon the formation of the partial products, a.nd the general scheme for their addition, one can now fix the length and relative significance of the numbers appearing at the inputs to the various adder words, and hence the dimensions of the adder words themselves. In the following list, the inputs to each adder word are listed with their ranges of significant digital positions. Partial product input numbers, as modified by the addition of correction bits, are called W. . Sum and carry output numbers of adder word j are called s . and c . . Where the last few J J output digits of an adder are fed directly to the carry propagating adder, this is shown by the range of digital positions involved and the word "out." In this case, the digits going "out" are not included in the listed sum and carry outputs. Where the last few stages of an adder word have two or less inputs, and hence do not involve the use of full adder circuits, the range of digital positions is shown with the word "void." Adder Word Inputs Outputs Stages Remarks 19 38, 39 of W 39 36-39 of W 3T Correction to Wl at position 39 s: 36-39 36-39 c: 34-38 (4) 18 W3 (2-43) W5 (4-45) W7 (6-1+7) s-: 1-47 2-43 kk-k7 void o: 1-45 (42) 17 W39 (40-78), sig (36-39) W37 (40-77), sig (34-38) W35 (34-75) s: 34-73 34-75 74-78 out. c: 32-73 (42) 76-78 void 16 W33 (32-73) W31 (30-71) W29 (28-69) s: 28-73 28-69 70-73 void c: 26-71 (42) 15 W27 (26-67) W25 (24-65) W23 (22-63) s: 21-67 22-63 g: 21-65 (42) 64-67 void UNIVERSITY OF ILLINOIS LIBRARY .in. Adder Word Inputs Outputs Stages Remarks 14 W21 (20-61) a: 16-61 16-57 58-6l void W19 (18-59) c: 14-59 (42) W17 (16-57) 13 W15 (14-55) s-: 10-55 10-51 52-55 void W13 (12-53) c: 8-53 (42) wii (10-51) 12 W9 (8-49) s.: 0-49 1-45 46-49 void sl8 (1-47) Co 0-47 (46) cl8 (1.49) 11 sl7 (34-73) s: 28-71 28-71 72, 73 out cl7 (32-73) c: 26-71 (44) sl6 (28-73) 10 cl6 (26-71) s: 21-71 21-65 66-71 void sl5 (21-67) c: 19-67 (45) cl5 (21-65) 9 sl4 (16-61) s: 9-6l 1.0-55 56-6l void elk ( 14-59) c: 9-59 (46) sl3 (10-55) 8 cl3 (8-53) s: 0-53 0-47 48-53 void sl2 (0-^9) c: 0-49 (48) cl2 (0-^7) ■11- Adder Word Inputs Outputs Stages Remarks 7 sll (28-71) s: 21-67 21-71 68-71 out ell (26-71) cs 19-67 (51) slO (21-71) 6 clO (19-67) s: 9-67 9-59 60-67 void s9 (9-6l) c: 7-61 (51) c9 (9-59) - 5 s8 (0-53) s: 0-53 o-4o i+l-53 void c8 (0-49) c: 9-^-9 (to) wi (o-4o) k s7 (21-67) s: 9-61 9-61 62-67 out c7 (19-67) c: 7-6l (53) s6 (9-67) 3 c6 (7-61) s: 0-6l 0-^9 50-61 void s5 (0-53) c: 0-53 (50) c5 (O-49) 2 s4 (9-61) s: 0-53 0-61 5^-6l out ck (7-61) c: O-53 (62) s3 (0-6l) 1 s2 (0-53) s: 0-53 0-53 All digits out c2 (0-53) c: 0-52 W c3 (0-53) Total adder steps: 7V7 ■12- V, Circuitry The suggested circuits employ diode OR-AND logic and transistor inverting amplifiers,, run saturated. Figure 2 shows a possible circuit for a full-adder stage. It is designed to he fed from, and to feed, exactly complementary circuits using npn transistors with their emitters tied to -2 volts and collectors caught at +0.5 volts. Either output can drive one input of a complementary circuit. Thus, odd-numbered levels of the adder tree would employ one polarity of circuit, and even-numbered stages would employ the other. The circuit shown generates the complement of the sum and carry outputs as normally defined,, However, since the logical, equations for both sum and carry of a full adder are self-dual, the same circuit con- figuration is employed in both varieties of circuit. Another way of looking at the polarity of signals is to define the output of a transistor of either sort as one of the transistor is on, in which case the circuit shown will give true signals at all adder tree levels. Notice that if all inputs to the adder tree are zero in this latter convention, all transistors of the tree will be cut off. This point will be of importance to the discussion of division. The component values shown guarant.ee in an assumed worst-case combination of ±3 per cent resistor and power supply variations, base turn- on and turn-off currents of about one-twelfth of the maximum collector standing current. The use of modern epitaxial transistors of a.roung $1 in cost, and of diodes around $.30 cost should, with reasonable care in packaging, give a circuit propagation delay of about 3 nsec per transistor, or 60 nsec for a full adder. It is proposed that the partial products be generated using OR-AND-NOT circuits of the same general type as used in the full adder. One such circuit, and hence one transistor, would be required for each digit of each of the twenty partial products, a total of 800 transistors. A circuit producing the digit of weight 2"^ in partial product W. woula produce the function (p . +1 - 72) . (p . v Tl) • (p~ ^ ~) • (^ v 72) where v . are the digits of the multiplicand, numbered from in order of decreasing significance, and the signals +2, + 1, -1 and -2 are the signals generated by receding the multiplier digits as described in section 3 to -13- give the multiple for W. . The recoded multiplier digit zero can be obtained by making signals +1 and -1 true simultaneously. Partial products generated for introduction to level seven of the adder tree must differ in polarity from those for introduction to levels four or six. Complementary forms of the selector circuit would be used for the two polarities. Each recoded multiplier digit signal must drive ^0 inputs, as must each polarity of multiplicand digit. Thus, 160 driver circuits of about 200 ma. capacity are required. Modern silicon epitaxial transistors can be used to make such drivers with delay times of about 30 nsec. However, in view of the possibly large spatial fanout of these signals, a delay time estimate of 100 nsec might be more realistic. The logical conditions for the receded multiplier digit signals would each be generated from an OR -AND- NOT single transistor circuit forming the input to the associated driver. Eighty such circuits are needed. The circuitry required for the 79 -digit carry-propagation adder will not be discussed. General, designs nave appeared in the literature capable of performing the carry propagation in time of the order of 100 nsec. .It should be noted that of the 79 digital positions, only the propagation time over the most -significant 5^ will be additive to the propagation delay through the a.dder tree. VI. The Possibilities for Division Although a. case can be made for the use of a, structure of the form described solely for the purpose of multiplication, it is of interest to see whether it can be used to execute a reasonably rapid division when suitably augmented, The author has been unable to discover any very effective method for direct division in the multiplier. However, it appears possible to use it in a four-step process to obtain the reciprocal of a 40-digit number,, If the multiplier structure is to be used efficiently, advantage must be taken of its ability to sum many numbers simultaneously and rapidly. In normal division processes, the usual direction taken to accelerate the process is to inspect the more significant digits of partial remainder (or dividend) and divisor, and to guess on the basis of these, the next few quotient digits. The product of the guessed digits and the divisor is then formed, possibly using simultaneous addition and recoding of the guessed quotient digits, and subtracted from the partial remainder to give a new -Ik- partial remainder. This method can in principle be carried to whatever extent desired, but the logic required for guessing quotient digits becomes rapidly more complex as the number of digits guessed per step increases. The practical limit is probably not much more than six quotient digits per step. The partial re- mainders may be left in a carry-unassimilated form to save time, but this con- siderably complicates the circuits required for guessing.. In any case, the guesses can never be always correct, so the quotient must normally be developed in ' a redundant form, e.g., as two numbers which must be summed to give the final quotient . Excellent though this method and its variants may be for conventional arithmetic units, it does not seem feasible to the author to extend it to the point where it would make good use of the a.dder tree. The proposed method is essentially based on the following iterative division process: Given x and y, to divide y by x, set a-L = xp b 2 = yp tfhere p is some approximation to the reciprocal of x, and iterate a n+l = a n (2 " a n } > Vl = V 2 " *J Phis process converges quadratic ally, ^ to one, and b n to the required quotient. is, the number of correct digits in b R+1 is double that in b . If p is efficiently good an approximation to l/x that xp differs from one by 2" 5 at most, :hen three of the repetitive steps will give a 2*0 -bit quotient., The part of the liird step which generates a. is not needed,. The values of (2 - a.J used at each step need not be exact, provided that ;he same value is used to form both a and b n+1 n+1 ' 11 - The Iterative Division For the moment, consider only the part of the iteration involved in nning the a^. This part is independent of the formation of b^ We will assume hat the divisor, x, is positive and normalized to lie in the range 1/2 < x < 1, y inspection of the first seven digits of x following the binary points, the PProximation p will be generated,. If x has the foj )rm O.labcdef , 1 p is giv e n the form l.qrst, then suitable expressions for the digits of p are -15- q = ab v ac r = be ^ abc ^ acde ^ bdef s = ace v abc v bde v abce v abed ^ abed v abedf v a ,bcef t = acd v/ -bed v bde v bede v abc"e v bede v abdf ^ aedf v abed ^ abedf ^ abede v abedf These expressions are essentially raw minterm forms and may not be minimal. However, even as they stand, they could be realized quite cheaply and quickly with diode logic. The values of p yielded by these expressions are such that, xp always has one of the forms 0.11111. . . or 1. 00000. . . (The digit in position zero should be interpreted with positive weight.) The set of p values chosen are not unique in having this property but they appear to require the simplest logical expressions for their generation. The first step of the process will consist of forming p, receding it to give three partial products each either -2, -l, 0, +1 or +2 times x, and summing these. (it may possibly be advantageous to generate the receded multiplier digits directly from the digits of x.) Only digital positions 5 et seq of the product need be explicitly formed. The next step of the process should be to take a as formed and multiply it by (2 - &1 ), with the aim to producing as ^ a number with digits 1 to 10 all complement of digit zero. This aim can be achieved by using a multiplier which approximates (2 - e^), but which has many fewer significant digits. -16- Consider a, number of the form of a , viz., P " PPPPPqrstuv. . . and the following approximation to 2 - a : 1-2 (p ■ qrstu) where the number in brackets is interpreted as a signed two's complement fraction, hereafter called d, in the range -1 to 1 - 1/32 . We may write a, ± as 1 + 2~ 5 d + e, where e is in the range < e < 2 . The product of the two numbers will be 1 - 2 _10 d 2 + e(l - 2" 5 d) This number will have a minimum value, when e = and d = -1, of 1 - 2~ 10 and a maximum value,, when e is just less than 2~ ±0 and d = 0,of just less than 1+2 . (Although, with e at its maximum value, differentiation of the above expression with respect to d would give a stationary value with d slightly negative; in fact the smallest negative allowable value for d, viz., -1/32, is already past the stationary point and yields a- value for the product of 1 + e - 2 '• (2 ■ - e) which is slightly less than the value quoted above.) Thus the approximation to (2 - a^ given above always yields a product of the desired form, differing from one by an amount in the range -2 to just under 2~ Thus this multiplier, which can be receded to give four partial products, is used in the second step of the iteration to eive a '2" Similarly, the number formed by adding one to 2~ 10 times the signed two's -complement fraction represented by digits 10 to 20 of a is an adequate multiplier for use in step three. It may be recoded to yield seven partial products, and will give an a whose digits 1 to 20 will all differ from its digit zero,. The multiplier for step four will be one plus 2~ 20 times the signed two's-complement fraction represented by digits 20 to k0 of a . However, no a,^ will, he generated. In forming b^, the final answer, we could start with the dividend y and multiply it successively by the four multipliers used in the four steps. This would require four multiplication times in addition to the three needed ' form a, 1 , a g and a^, If, however, we instead generate the reciprocal of x, -17- then y = 1 and b is simply p which is available „ As will be shown below, it is possible because of the fact that b is then a number of only five digits, to obtain b at the same time as a and b at the same time as a . Thus, b, , the reciprocal, can be obtained in four multiplication times, and a true division can be done in five multiplication times, a.s opposed to seven. Also, the reciprocal, yielded as a byproduct may often be useful to the programmer. VIII. Detailed of Reciprocal Generator Assume the existence of a, register R having digital positions R., where i = 0, 1, 2.... Initially, the positive normalized number x occupies R to Ro Q = Digits R - R are decoded to give p and hence a recoded multiplier of the form s OqOr where s, q and r ta.ke values +2^ +1, 0, -l,..-^. Three partial products are formed by selector circuits other than those normally used to produce W. . These selectors introduce their output words into the adder three at the points labelled on Figure 1 a.s B, C and D. If the normally used multiplier input is made zero during this and subsequent steps, all transistors of the adder tree above the levels at which numbers are specifically introduced will be off, so that the collectors of the selector circuits introducing numbers at. these poinds may simply be commoned to the a.dder tree collectors normally supply these points,. We will have at B: (R ) s in positions 0-37 (a, two-pla.ce left displacement) at C: (R, ~ Q ) q in positions 1-39 ( nc displacement) at D° (R ) r in positions 3-^-1 ( a two-place right displacement) In generating these partial products, correction bits are applied in the usual way. That for the entry at B can be introduced Into the appropriate digital, position at point A of Figure 1. (if x is positive, s will never in fact be negative. However, a simple way to produce reciprocals of negative numbers is to use a negative multiplier in step one.) If the output of the adder tree is now gated ba,ck without shifting into R, we will have in R, . digits 2-^3 of a , of which digits 2-5, i.e., R„ _, will be identical. At the same time we gate digits to k of the multiplier p just used into R rn „. This completes step one, 51-55 In the second step, R q, i.e., digits s to 10 of a , are decoded to give a recoded multiplier of the form l.OOOOOcOdOe giving four partial products introduced at tne adder-tree points A, B, C and D. -1.8- at A: R 8-4l in P° sit ions 0-33, R 52 _ 55 in positions 44-47 (an eight -place left displacement) at B: (K 2 _ kl ) c in positions 0-34, (X^_^) c in positions 49-53 (two -place left shift) at C: ( R o4l^ d in Portions 0-4l, (* 51 _ 55 ) d in positions 51-55 (no displacement) at D: (R 0-4l ) e in Positions 2-43, ( R 51 _ 55 ) e in positions 53-57 (two -place right shift) Note that the entries to B, C and D are displaced in the same way as for step one. Thus the same selector inputs may be used. Carries from digital position 44 to position 1+3 of adder words 1 and 2 of the adder tree are inhibited, to give effectively independent multiplication of a and b by the same multiplier. The b part of the entries at B, an^D musAe augmented by sign digit copies as far left as position 44 . Gating the tree output back into R gives digits 10 to 53 of a in R 0-4 3 ' and digits ^^ °f b 2 in \ k _ 5T Digit of b g is known to be one, and need not be formed. It can be thought of as occupying an additional flipflop R^ for the discussion of later steps. In the third step, digits 10-20 of a g in E Q _ 1Q are decoded to give a multiplier of the form 1 . OOOOOOOOOfOgO hOiO jOk giving seven partial products, which must be formed in specially-provided sectors and introduced into adder tree points E, F, G, H, I, J and K. StE: R 10-43 in Positions 0-31, ^3.^ in positions 35-39 StF: R 0-33 in Positions 0-33, \ 3 _ 5? in positions 35-49 StG " R o- 3 i in Positions 2-33, R^ 3 _ 57 in positions 37-51 StH: R 0-29 in Positions 4-33, \ 3 _ 3J in positions 39-53 StI: R 0-27 in Positions 6-33, R 4 3 _ 5? in positions iu-55 StJ: R 0-25 in Positions 8-33, \ 3 _ 5? in positions 43-57 StK: R 0-23 in Positions 10-33, \ 3 _ 5? in positions 45-59 -19- Carry from position 3^ to position 33 is inhibited in all relevant adder words. As described above, only digits 10 onwards of b 3 are formed. However, if, in the carry propagating adder, an additional 10-digit section is added to the left of position ' 3k, this section and the stage in position 3^ can receive digits 0-9 of b , i.e., R U3 _ 52 , and any carry (which may be negative) into position 3^ generated in the adder tree. Since, in this step, positions of the double-length carry propagation adder to the right of position 59 are not used, ten of these could be switched to form the required additional section. The adder output thus contains digits 20 to 53 of a in positions to 33, and digits to 3^ of b 3 in positions 2 5 -59- In this third step some truncation of a 3 has occurred. However, digits of b 3 are retained as far right as digit 53, and only digits 20 to ko are required in step four. Thus the truncation error introduced is very small. Digits 20 to ko of a 3 are gated into R Q 2Q , and digits 0-3^ of b are gated into R 25 _ 59 . In the fourth and final step, R Q _ 2Q axe decoded to give a recoded multi- plier specifying twelve partial products. These could be introduced into level five of the adder tree, but only at considerable expense. The multiplier is therefore used to control a standard multiplication step, using little or no additional equipment to produce the reciprocal, b^ Thus the formation of a reciprocal requires 15 adder word delays, four carry propagating addition delays, of which only the last is of the normal length, and four recoding and selection delays. If the range of digital positions of the numbers involved in each step is examined and compared with the list of adder word dimensions, it will be found that all numbers will fit into the adder word dimensions already prescribed, with the exception of the partial products input in step 3 at points I, J and K of the adder tree. To accommodate these, adder word k must be extended by three digital positions its left-hand end. The reciprocal -forming process also requires an additional eleven words of selector circuits, with drivers. IX, A Square -Root Pro cess In iterative procedure for generating the reciprocal of the square root * a number very similar to that used above for obtaining the reciprocal is: giver n a. number x, and an approximation p to the reciprocal, of its root, set a = xp 2 , b = p 2nd iterate 1 ' X Vl " a n fl I " \ \ )2 ~ \ + l '- V 1 I - \ a „) -20- The process converges quadrat! cally, a. n to 1, and b to the required reciprocal root. If p and p 2 are provided by inspection of x, "eight multiplications are required to obtain b^ which, if sp 2 differs from one by less than l/ 3 2, will have about kO correct digits, It is possible that the eight multiplications could be done in six steps, by a partitioning of the adder tree similar to that suggested above. Even if this were not so, it would still be quite a. rapid process. The true square root can of course be obtained from its reciprocal by multiplication by x. X. Speed a.nd Cost For the limit of a large number of digits per word, most conventional multi- plier structures have a product of equipment cost and multiplication time which varies I the square of the number of digits. In the present structure the equipment cost varys as the square of the number of digits, multiplication time varys as the logarithm of the number of digits. Thus, the present structure has a cost-time product increasing more rapidly with increasing word length than that of conventional, structures. It is therefore less efficient in the long-word limit. However, it is not necessarily as inefficient as one might suppose for the proposed word length of kO bits, particularly the context of existing transistor technology. The apparent logarithmically in- easing inefficiency is a reflection of the fact that, while the multiplication time lepends upon the propagation delay of signals passing through the logarithmically increasing number of adder tree levels, each logical element of the structure is used only once during the multiplication process. Thus, if one defines the useful duty cycle of a logical element as the ratio between its propagation delay and the period between meaningful and distinct uses of its output, this duty cycle is in the present structure logarithmically decreasing with word length. However, the duty cycle in the ase of W) digits is about l/l 5 , which is not greatly below the upper limit set by the characteristics of the circuits used,, Typical transistor circuits having propagation lays of 15 to 30 nsec are very difficult to operate at repetition above 1.0 mc, especially when allowance is made for the reliable distribution time of clock and gating signals, and for the settling time of flip flops. The equipment requirements of this structure are approximately as follows: ■21- For multiplication Circuit Type Transistors (per unit) Diodes (per unit) Number units Total Transistors Total Diodes Full adders 3 18 750 2,250 13,500 Selectors 1 13 840 840 10,920 Recoders 4 ~ 10 80 320 800 Multiplicand Drivers k 3 80 320 240 Totals for multiplication: 3,730 25,460 Additional re quirements for division: Recoders 4 10 15 60 150 Multiplicand Drivers 4 3 60 240 180 Selectors 1 13 56l 561 7,293 Grand Totals: 4, 591 33,083 Not included in the above estimates are the carry propagating adder, the registers necessary to hold operands and results, and the control circuitry. This equipment would almost certainly be present in the computer arithmetic unit for addition-subtraction, and should not be charged specifically to the multiplier-divider. That is, the totals above represent the additional equipment equired for the proposed multiply-divide scneme over and above that necessary for even the most primitive parallel arithmetic unit. The additional equipment is perhaps ten per cent of the semiconductor complement of a modern, large- scale computer, but would almost certainly represent much less than' ten per cent of the cost of a large computer. To estimate the time required for a multiplication, it is assumed that i) The propagation delay per transistor is 30 nsec; the delay per adder tree level accordingly is 60 nsec. ii) The propagation delay of the high-current drivers is 100 nsec ill) The settling time of the carry-propagating adder is 100 nsec. iv) The result will be gated into a register with a gating time of 100 nsec -22 ■ On this basis, the multiplication time becomes 750 nsec. This should be a fairly conservative estimate, the circuit delays being those obtainable at reasonable cost in 1962 using readily available components. On the same basis, the time required for the generation of a reciprocal, excluding the pre-normalization time, is 2220 nsec. The time for a full division is therefore about 3 usee. XI. A Simpler Vers ion If a kO x kO bit multiplication is performed in two steps, the adder tree technique described above can be used in a simpler form. In the first step, the 22 least-significant multiplier digits would be recoded to yield 11 partial products. A carry can arise in the recoding process to be incorporated in the recoding of remaining multiplier digits in the second step. An a.dder tree of five levels containing nine adder words is used to reduce the 11 partial products to a sum word and a carry word having digits in positions 17 to 78 (approximately).. Of these words, digits in positions 57 to 78 are final, and can be added in carry -propagating adder to give digits of the final product. The remaining digits of both words, together with the nine partial products formed by recoding digits 0-17 of the multiplier, are summed in the same tree with the output words added in the carry -propagating adder to yield the rest of the x product,, The equipment cost of this scheme would be about half thai of the one described above, and, making the same assumptions as above,, the multi- plication time would be about 1.2 usee. The time for reciprocal formation remains unchanged, since by increasing the number of adders in the tree to ten, the tree can be made capable of all. operations required by the process described, However, the time required for a division is increased by the increase in the multiplication time to about 3.5 usee. Such a. scheme might well be attractive in some circumstances. XII „ Conclusion A method of performing multiplications has been described using a large amount of equipment to produce the product In a one-step combinatorial manner. Although in principle rather inefficient, this process is reasonably well ma.tched to the characteristics of saturating diode -transistor circuitry, and the considerable increase it could yield in the overall speed of a large -23- computer might well justify its cost. A four-step method for obtaining reciprocals can he employed using essentially the same equipment to give a fairly rapid substitute for division. A perhaps slightly more efficient scheme employing a little more than half as much equipment can multiply in a. little less than twice the time required by the more expensive scheme, using two steps. This cheaper version is as fast as the more expensive when generating reciprocals „ -2k-