The person charging this material is re sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. for disciplinary action and may reson the University. To renew call Telephone Center, 333-8400 C C 2 19' )B L161— O-1096 litis Report No. UIUCDCS-R- 77-873 oJ73 / ' f AN ARITHMETIC UNIT FOR ON-LINE COMPUTATION by MARY JANE IRWIN UILU-ENG 77 1722 May 1977 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS UIUCDCD-R-77-873 AN ARITHMETIC UNIT FOR ON-LINE COMPUTATION by Mary Jane Irwin May, 1977 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois This work was supported in part by the National Science Foundation under Grant No. US NSF 73-07998 and was submitted in partial fulfillment for the Doctor of Philosophy in Computer Science, 1977. Digitized by the Internet Archive in 2013 http://archive.org/details/arithmeticunitfo873irwi AN ARITHMETIC UNIT FOR ON-LINE COMPUTATION Mary Jane Irwin, Ph.D. Department of Computer Science University of Illinois at Urb ana- Champaign, 1977 This thesis is concerned with the algorithmic and logic design of an arithmetic unit to be used in a computational environment in which the basic arithmetic operations satisfy the on-line property ; that is, to generate the j digit of a result (where a digit consists of n bits for base 2 ), it is necessary and sufficient to have the operands available only up to the j digit plus, in the case of division, a predetermined number of extra digits which correspond to an "on-line delay." Since there is no on-line delay for addition, subtraction, and multiplication, the unit can begin generating result digits as soon as one digit of each operand has been input. The delay for division is shown to be a small, positive, radix dependent constant. To fulfill the on-line requirements, a set of left-to-right (most-to-least significant), digit-by-digit algorithms have been derived. The existence of such algorithms is contingent upon the use of a redundant representation for the result digits. These algorithms and a block diagram level implementation of the basic arithmetic unit are developed in the thesis. The proposed arithmetic unit, capable of performing on-line operations, would be extremely useful in many real-time applications. Due to its potential for performing sequences of operations in an overlapped fashion (pipelining) , the unit could provide an effective way to speed up execution. Furthermore, it is ideally suited for variable precision arithmetic. iii ACKNOWLEDGMENTS I would like to express my deep appreciation to my advisor, Professor James E. Robertson, for his advice, support, interest, and insight. I would also like to thank the members of my thesis committee for their interest and comments: Professors Mike Faiman, Dave Kuck, Tom Murrell, and Sylvian Ray. Fellow students Alf Weaver, John Larson, and Will Gillett provided valuable comments, encouragement, and proof reading skills. Gayanne Carpenter's contributions in terms of friendship and advice on departmental policies are also greatly appreciated. Thanks are also due to June Wingler for an outstanding job of typing, to Stan Zundo for his drafting expertise, and to the Department of Computer Science and the National Science Foundation for their financial support. Finally, I want to thank my husband, Vern, and son, John, for their love, patience, and understanding throughout this long undertaking and my parents for instilling in me the desire to succeed. IV TABLE OF CONTENTS Page 1 . INTRODUCTION 1 1.1 Objectives 1 1.2 Related Work 4 1.3 Number Representation 6 1 . 4 The Generalized Procedure 7 1.5 Dissertation Overview 8 2. DIVISION 10 2 . 1 Background 10 2.2 The On-line Algorithm 13 2.3 Quotient Digit Selection 18 2.3.1 Range Restriction Analysis 18 2.3.2 The Selection Equations 20 2.3.3 Determining the Minimum Index Difference 23 2.3.4 The Model Division 28 2 . 4 Valid Operand Ranges 38 2 . 5 Hardware Block Diagram 39 3. MULTIPLICATION 42 3 . 1 Background 42 3.2 The On-line Algorithm 44 3. 3 Product Digit Selection 48 3.3.1 The Selection Function 49 3.3.2 Sufficient Precision 54 V Page 3 . 4 Valid Operand Ranges 58 3.5 Hardware Block Diagram 60 3 . 6 Some Numerical Examples 62 4 . ADDITION AND SUBTRACTION 66 4 . 1 Background 66 4.2 The On-line Algorithm 67 4.3 Sum (Difference) Digit Selection 72 4.4 Hardware Block Diagram 73 4 . 5 Some Numerical Examples 75 5 . IMPLEMENTATION 79 5.1 Design Constraints of LSI 79 5 . 2 Floating Point Considerations 85 5.3 Algorithmic Modifications of AURORA for Pipelining 87 5 . 4 AURORA Hardware Requirements 94 5.4.1 The Processing Logic 94 5.4.2 Hardware Modifications of AURORA for Pipelining... 100 5.4.3 Speed of the Processing Logic 104 5.4.4 AURORA as a Total Module 106 5 . 5 System ' s Level Overview 109 6. SUMMARY Ill 6 . 1 Summary of the Results Ill 6 . 2 Applications 112 6.3 Suggestions for Future Research 115 vi Page BIBLIOGRAPHY 117 APPENDIX I REDUNDANCY DEFINITION 120 II SAMPLE P-D PLOTS FOR DIVISION 123 III AN ON-LINE RECODING ALGORITHM 145 VITA 147 VX1 LIST OF TABLES Table Page 2.1 Equations Defining the Selection Regions of Figure 2.1 22 2.2 Equations Defining the Selection Regions of Figure 2.2 24 3 . 1 Minimum y Values 57 3.2 Example of the Algorithm MULT (r=2,K=l) 63 3.3 Example of the Algorithm MULT (r=4,K=l) 64 3.4 Example of the Algorithm MULT (r=4,K=2/3) 65 4.1 Example of the Algorithm ADD (r=2,K=l) 76 4.2 Example of the Algorithm ADD (r=4,K=l) 77 4.3 Example of the Algorithm SUBT (r=4,K=l) 78 5.1 Result Digit Selector Inputs 97 5.2 Result Digit Selector Inputs After Carry Propagation 99 5 . 3 Word Processing Time Delay 105 5.4 Rough Pin Count for AURORA 106 5 . 5 Speed vs . Complexity (Processing Logic) 107 A.l Example of the Algorithm DIVIDE (r=2,K=l) 12 ^ Vlll LIST OF FIGURES Figure Page 2.1 P-D Plot with r=4, K=2/3, and 6=4 22 2.2 Modified Symmetric P-D Plot with r=4, K=2/3, and 6=4 24 2.3 P-D Plot with r=2, K=l, and 6=3 26 2.4 P-D Plot with r=2, K=l, and 6=4 27 2.5 P-D Plot Showing the Worst Case Overlap Region 32 2 . 6 Flowchart for Determining a and 3 , 34 2 . 7 Selecting a "Good" Staircase 36 2.8 Step Definition Flowchart 37 2 . 9 Block Diagram for Division 40 3.1 P-p Plot 51 3 . 2 The Overlap Region of the P-p Plot 53 3.3 P-p Plot Showing the Worst Case Error 56 3.4 Block Diagram for Multiplication 61 4.1 Block Diagram for Addition and Subtraction 74 5.1 A Typical PLA Control Layout 83 5.2 Communication between Mantissa and Exponent Units 88 5 . 3 Pipelining Multiple AURORAS 90 5.4 Serial Processing with One AURORA 91 5.5 AURORA Processing Logic 95 5 . 6 Adder I/O Requirements 101 IX Figure Page 5.7 I/O Requriements of Digital Position i of Adder (i _> 1) 102 5.8 Modified AURORA Processing Logic for Pipelining 103 5.9 Block Diagram of an AURORA 108 5.10 Sample System Organization 110 A. 1 The On-line Recoding Algorithm 146 1 . INTRODUCTION This thesis is concerned with the development of a set of basic (addition, subtraction, multiplication, and division) arithmetic algorithms suitable for use in a computational environment which calls for the on-line processing of data. In on-line processing the operands, as well as the results, flow through the arithmetic unit in a digit-by-digit manner, most significant digit first. In various real-time applications in which the operands are generated serially by an analog-to-digital conversion process beginning with the most significant digits, an arithmetic unit possessing this on-line property is highly desirable. A unit which operates in an on-line fashion can provide the ever more popular microprocessor, a device traditionally restricted from most mathematical applications because of its short word length, with variable precision arithmetic capabilities. At the same time, it can provide for overlapping the generation of result digits with the fetching of operand digits. As an added bonus, the user may halt the processing when sufficient precision has been obtained, which may conceivably occur before all of the operand digits have been processed. With these thoughts in mind, an effort has been made to develop imple- mentable, on-line algorithms for the basic arithmetic operations. 1.1 Objectives A set of implementable on-line algorithms for the basic arithmetic functions should meet several desirable objectives. Three specific objectives were imposed on the algorithmic design during the development stage of this dissertation. OBJECTIVE 1 : The algorithms should be on-line with respect to the result digits. They should generate the most significant digits of the result first, in such a way that once generated, the result digit produced at step j would not be affected by any subsequent step k, k > j. OBJECTIVE 2 : The algorithms should also be on-line with respect to the operand digits. Only those digits up to and including those provided at step j should be required in order to perform the j step of the algorithm. To avoid extreme scaling of the operands during division, a limited number of leading digits of each operand corresponding to an "on-line delay" are accumulated prior to starting the actual division algorithm. OBJECTIVE 3 : The basic computational step should be invariant at every step j and the only primitive arithmetic operation should be addition. The selection procedure generating one result digit per step should be such that the step execution time is independent of the operands; i.e., the selection should be based on a limited precision model of the operands. The first objective implies the use of a redundant representation of the results. Without redundancy the problem cannot be solved. The second objective calls for a more complicated basic computational step than would, otherwise, be necessary to allow for the on-line arrival of operand digits. The step invariance requirement of the third objective makes the control section of the implementation very straightforward. While the requirement that addition be the only primitive operator simplifies the processing section of the implementation. In order for the selection procedure to be independent of the length of the operands and, thus, step invariant, a limited propagation mode of addition must be employed. This, in turn, provides for a cost effective speed up of the overall algorithms. Once algorithms which satisfy these objectives have been developed, an arithmetic unit encompassing all of the algorithms must be specified. This unit should, of course, be real world implementable. To accomplish this end, the unit itself must conform to several objectives imposed during the logic design phase of research. OBJECTIVE 4 : The unit should comply with the design constraints of LSI (Large Scale Integration) . OBJECTIVE 5 : The unit as a whole should have conventional input/output requirements; i.e., the operands which are input and the results which are output should be in a conventional form (e.g., two's complement, sign magnitude, etc.) . OBJECTIVE 6 : The unit as designed should be modular and expandable, both from the individual chip and the overall system viewpoint. It should be designed to handle floating point numbers. OBJECTIVE 7 : The unit should be fast as compared to typical central processor and memory speeds. In order to comply with the fourth objective, the unit must possess a high circuit density, regularity of structure, a low pin count, and a large domain of applications. The fifth objective requires that the redundant nature of the algorithms be hidden from the user. The result digits generated by the algorithms must be recoded into a conventional format before they are output. By designing the unit to function either in a serial mode as a stand alone unit or in a pipelined mode in connection with other identical units, the sixth objective would be satisfied. The last objective is difficult to quantify given today's rapidly changing technology. 1.2 Related Work Several of the well known basic algorithms satisfy the on-line property with respect to either the operands or the results. Consider, for example, conventional division which has the on-line property with respect to the quotient digits. Similarly, conventional multiplication has the on-line property with respect to the multiplier. Several authors have extended this on-line property for multiplication to the product digits as well [AV162, PIS70, GOY76]. It is also possible to define algorithms conforming to a right-to- left type of processing; i.e., algorithms which operate from least-to-most significant end. Conventional addition and subtraction process operands and produce result digits in this manner. Atrubin [ATR65] developed a right-to-left type algorithm for multiplication. But, since the division process must by its very nature operate in a left-to-right fashion for the calculation of the quotient, this left-to-right, on-line processing was imposed upon all of the algorithms. The most significant digit first approach is also consistent with other arithmetic processes such as operand normalization, mantissa overflow detection, and result sign determination. All of these processes inherently require examination of the most significant digits of the result. Thus, in this thesis the on-line process is defined to be that process in which all of the operand digits as well as the result digits flow through the arithmetic unit in a left-to-right, digit-by-digit fashion. To fulfill the on-line requirements, a set of left-to-right, digit- by-digit algorithms had to be derived. The existence of such algorithms is contingent upon the use of a redundant representation for the result. In the past, redundant number representations have often proved useful for speeding up arithmetic operations [MET57, ROB58, T0C58, AV161, AV162, PEN62]. In a non-redundant system, even simple operations like addition and subtrac- f tion possess a significant on-line delay due to the carry propagation which may involve the full precision of the result. By allowing redundancy in the number representation, it is possible to limit the carry propagation to one (or two) digital positions [MET57, ROH67, BOR68, ROB67]. Thus, on-line algorithms for addition (and subtraction) with on-line delays of at most one t Recall that the on-line delay corresponds to the number of digits of the operands which must be input before the generation of correct result digits can be initiated. or two digits can be easily developed. Campeau [CAM70] has also developed an on-line algorithm for multiplication with an on-line delay of one digit. More recent work in the area of on-line computation has been done by Ercegovac [ERC75, TRI75]. He developed an on-line multiplication algorithm which combines the technique of incremented multiplication, as used in digital differential analyzers [BRA63, CAM70, BAK75], with the use of a redundant number system. Ercegovac also proposed an on-line division algorithm in his work. It, however, is not suitable for implementation (in this author's viewpoint) because it requires excessive scaling of the operands initially to insure convergence of the algorithm. The existence of a reasonable on-line division algorithm, however, has not been at all obvious. The first attempt at deriving a reasonable algorithm for division was made by Trivedi [TRI75] in the Spring of 1975. In his algorithm the on-line delay for division depends upon the radix and other properties of the number system used. The delay is generally a small, positive constant and alleviates the problem of excessive operand scaling. Division methods presented in this thesis are extensions of Trivedi' s preliminary work. His algorithm combined with parts of the previously mentioned work by Ercegovac, led to the evolution of a set of compatible algorithms for the basic arithmetics. 1.3 Number Representation All numerical values considered in this thesis are assumed to be represented in finite precision, floating point format with a representational error of |e| < r where m is the precision of the mantissa. The effect of the representational errors is minor and necessitates only a slight extension of the precision of the initial data to obtain the required precision of the results. Thus, the representational error is not of immediate concern. It is assumed that the precision of all initial data has been properly adjusted so that, for a given precision of m bits, the data can be regarded as exact. Consider an m digit, radix r fractional component of a floating point number N, where m N = I n.r" 1 . 1-1 X Using a conventional representation, each digit n. can assume any value from the digit set {0,1, . . . , (r-1) } . Such representations, allowing only r values in the digit set, are non-redundant. There is only one (unique) representa- tion for each representable number. By contrast, number systems that allow more than r values in the digit set are redundant and allow more than one representation for each number. See Appendix I for a formal definition of redundancy. The scope of this thesis covers those cases where the operands are provided in a conventional format while the results are generated in a redundant format. A "mostly" on-line recoding scheme for converting redundant results into conventional format, as given in Appendix III, is then used so that the unit will satisfy OBJECTIVE 5. 1.4 The Generalized Procedure The algorithms developed in this thesis consist of the following sequence of operations whose order may vary slightly from one specific function to another: 1) initialization which consists of waiting for sufficient digits, corresponding to the on-line delay, to be input; 2) input of the next operand digits; 3) an addition which corrects for any error made in the previous result digit selection and accounts for the new operand digits just received; 4) selection of the next result digit based upon a limited precision model; and 5) a completion test which loops back to step 2) upon failure. From the above sequence, it is obvious that the computational algorithms are step invariant with the only primitive operation being addition as required by OBJECTIVE 3. The operands are input according to OBJECTIVE 2 and the results are generated according to OBJECTIVE 1. 1.5 Dissertation Overview Chapters 2, 3, and 4 present compatible on-line algorithms for division, multiplication and addition/subtraction, respectively. The on-line division algorithm is of primary concern, since it is the most difficult of all the algorithms to specify. Once division is specified, compatible on-line algorithms for the other functions can then be defined. Each chapter This delay is shown to be zero for addition, subtraction, and multiplication and is a small, radix dependent constant for division. contains necessary background material on the function in question. The on-line algorithms are then presented along with their convergence conditions and result digit selection schemes. Finally, valid ranges for the operands are derived and a hardware block diagram with description is given. Chapter 5 addresses the question of implementation. Suitability to LSI, floating point considerations, hardware requirements, and a system's level overview are discussed. A summary of the results, applications for on-line arithmetic units, and the possible implications of such a device are discussed in the final chapter. 10 2. DIVISION Algorithms which satisfy the on-line property for addition and subtraction can be easily specified [AVI61, ROH67, ROB67]. Multiplication requires a somewhat more elaborate approach [ERC75, TRI75] and will be discussed in detail in Chapter 3. However, the existence of an on-line division algorithm was not determined until a first attempt at such an algorithm was made by Trivedi in the spring of 1975 [TRI75]. The methods presented in this chapter are extensions of this preliminary work. The division algorithm must be of primary concern, since the algorithmic and logic design for division are the most difficult of all the algorithms to specify. Compatible algorithms for on-line multiplication, addition, and subtraction can then be specified. 2.1 Background In designing a computer arithmetic unit, division is the most difficult of the basic operations to implement efficiently. Division is inherently a trial-and-error process requiring an initial guess of a quotient digit followed by a comparison (in the form of a subtraction) to determine whether this guess was correct. If it was not, the initial quotient digit is modified and the process is repeated. This class of division based upon subtraction can be defined by the recursive relationship P. «- rP. , - q.D , j = l,...,m (2.1) J 3-1 J 11 in which P is the dividend, P._ 1 is the partial remainder used in the j recursion, P is the remainder, m j is the recursion index, q. is the j quotient digit, D is the divisor, and r is the radix. To form the partial remainder, P., a multiple of the divisor is subtracted from the previous shifted partial remainder. The determination of which multiple of D to subtract is dependent upon the quotient digit; but it is precisely this quotient digit that must be computed. It is not known a priori. As it stands, this recursion relationship for division does not adequately specify how q. is to be selected. By adding the range restriction (which is intuitively applied when doing the hand calculation) |P | 1 K + -|D|, (2.2) j the division algorithm becomes completely specified. The important point here is that division not only requires an addition or subtraction (as in multiplication), but also the selection of a quotient digit such that the value of the new partial remainder is within a specified range. On-line division, as investigated in this thesis, is yet a further complication of this process in that the full precision of the operands, K. will be defined in the next section. It is sufficient here to know that 1/2 < K < 1. 12 P n and D, is not available for comparison. At first consideration it would t seem that on-line division is impossible. By allowing the quotient digits to take on a redundant ' representation, many of the above problems which are seemingly inherent to division can be resolved. As will be discussed in this chapter, redundancy in the represen- tation of the quotient permits inspection of fewer digits of the operands in the selection of the quotient digits. This seems to be intuitively correct, since without redundancy the quotient has one unique representation and thus each digit of that quotient must be selected precisely. It should be clear that without redundancy it is not possible to avoid a set of full precision comparisons. But, with redundancy the selection of the quotient digit need not be precise. A selection based upon just the first few most significant digits of the divisor and partial remainder is good enough; i.e., the selction is based upon a limited precision version of the operands. Thus, by using redundancy the trial-and-error nature of division can be avoided. The resultant non-unique representation of the quotient does, however, complicate the division in that the redundant form must eventually be converted to a conventional representation. See Appendix III for a description of the result digit recoding algorithm. Note that, in most cases, this recoding algorithm is also an on-line, most-to-least significant process. In the case of on-line division it should be immediately evident that in the absence of redundancy the problem cannot be solved. By definition, during on-line processing the full precision of the operands is not available See Appendix I for a discussion of redundant number representations 13 for selection. Therefore, the quotient digit selection must be based upon a limited precision estimate of the divisor and partial remainder. Error is also introduced into the on-line division process when calculating the new f~Vi partial remainder. During the j recursion, only those operand digits which have been received prior to the j iteration have been included in the computation of the old partial remainder. Part of the j calculation must then be a correction factor which takes into account the effects of the new dividend and divisor digits on the value of the new partial remainder. Thus, the quotient digit selection is based on a possibly erroneous partial remainder, though this error is relatively small. Can the margin of allowable error which is permitted by the use of redundant quotient digits also be made to cover this extra error? If so, how much error of this type can be tolerated? Is there some minimum allowable operand precision required in order for on-line division to proceed? This chapter will resolve these questions by specifying an on-line division algorithm and the conditions on it 2.2 The On-line Algorithm For floating point operation each number X is of the form X = f • r e where, usually, f is a fraction in the range 1/r < |f| < 1 and e is an exponent. The arithmetic for the fractional parts is handled separately from the arithmetic for the exponents. Thus, two arithmetic units are required, one for fractions and one for exponents. Design of the exponent handling unit is straightforward, requiring only addition and 14 subtraction. However, the design of the unit to handle the fractional arithmetic is nontrivial and it is the algorithms for this unit which will be discussed in depth in this thesis. Let the radix r representation of the fractional part of the t positive dividend, divisor, and quotient be denoted by N, D, and R respectively, such that m N = I n.r" 1 , 1=1 1 m D = E d.r 1 1-1 X m R = E q.r" 1 , 1-1 X and R = N/D to m digit precision. Recall that in an on-line environment the digits of the dividend and divisor are not known in advance, but are available on-line, digit-by- digit, most significant digit first. These operand digits, n and d., are i i typically members of a conventional, nonredundant digit set, 0, such that n., d. e {0,1,2, ... ,r-l} li The result (quotient) is denoted by R so as to be compatible with the notation used in the other algorithms as discussed in the next two chapt< 15 It is assumed that the dividend and divisor are in normalized form upon input to the unit; i.e., | £ D, N < 1 . The methods presented here are extendible to the case when the operands are also in redundant form [ATK70]. Assume that the first quotient digit, q 1 , can be properly selected after 6 leading digits (the on-line delay or 'index difference') of the dividend and divisor are known. Thereafter, one new digit of the quotient can be determined upon the receipt of one new digit each of the dividend and divisor. Let the quotient digits be members of a symmetric redundant digit set, , such that P q ± £ {-p,-(p-l), ... ,1,0,1, ... ,p-l,p} where f 1 P 1 r_1 • The degree of redundancy will be denoted by K, referred to as the redundancy coefficient, where P K = r-1 Thus, when using a maximally redundant digit set, K = 1. When K = 1/2 (i.e., p = (r-l)/2) there is no_ redundancy in the digit set. The partial remainder is computed via a limited carry-borrow propagation adder [ROH67, ROB67], resulting in a redundantly represented partial remainder. A limited carry-borrow propagation adder is necessary 16 to make the time required to perform the recursive step independent of the precision of the operands; i.e., a carry free, totally parallel addition is possible. Thus, the digits of the partial remainder are also members of a redundant digit set, £> , , which may or may not be the same set as £> the quotient digit set. The redundancy coefficient of the adder is, then, K' . Given these definitions, the algorithm DIVIDE [TRI75] which is shown on the next page, can be specified. In this algorithm, the dividend and divisor are assumed to be padded with zero digits on the right (least significant end). Note that the basic recursion, (2.3), is more complex than that of the standard division recursion, (2.1), due to the corrective action necessitated by the operand digits arriving on-line during each iteration. The convergence of the algorithm DIVIDE can be established as follows. Using the basic recursion (2.3) in algorithm DIVIDE, the following expression for the on-line version of the partial remainder, P., can be derived by induction on j . P. = r 3 [ Z n.r" 1 - ( E q.r 1 ) ( E d.r 1 ) ] (2.4) 3 i=l 1 i=l X i=l X which implies that, as j -*■ m, P - r m [N - R«D] m so that R = N/D - P r m /D m The algorithm is so structured as to be compatible with the multiplication algorithm specified in Chapter 3. 17 Algorithm DIVIDE: Step 1 [Initialization]: 6 p o * * V • 1=1 6 i=l d i r ' R -0; j + 0; GO TO Step 4; Step 2 [Input Digit]: D. *■ D. . + d._^r" j " 6 ; 3 3-1 3+5 Step 3 [Basic Recursion]: P. «- rP. , - q.D. 3 3-1 J 3 -6 B -6 + n._^r - R. d. „r j+6 j-1 j+6 (2.3) Step 4 [Selection]: q. ,, «■ SELECT (rP.,D.); 3+1 3 3 R. ., «- R. 3+1 3 + w~ j " Step 5 [Test]: IF (j < m) THEN j '«- j + 1; GO TO Step 2; ELSE END DIVIDE; Therefore, by devising a quotient digit selection procedure, SELECT in Step 4 of algorithm DIVIDE, such that for j = m |P. I < K«D 3 - (2.5) 18 where \< K < 1 , then R = N/D can be computed to m digit precision. Assuming that a selection procedure can be specified which generates the quotient digit q... while guaranteeing that |P., n | < KD given that |P.| < KD, then, by j+1 j+± — j — induction, the range restriction (2.5) will hold for all values of j. (For j=0, (2.5) can be satisfied by appropriately preshifting the dividend as explained in Section 2.4.) Such a selection procedure is derived in the next section. First a bound will be established on P. - P.l and then a J 3 selection procedure will be developed which guarantees that |P.| <_ KD. This in turn will give a bound on P.. J 2.3 Quotient Digit Selection The division procedure may be defined graphically with a construction suggested by C. V. Freiman [FRE61]. The basis for its construction is the basic recursive relationship, (2.3), together with the range restriction, (2.5), which has been adjusted to include the error introduced by on-line processing. The figure is essentially a plot of partial remainder versus divisor values and is thus designated a P-D plot. By analyzing such a plot, a quotient digit selection procedure can be fully specified for a given r, p , and 6 . 2.3.1 Range Restriction Analysis From the recursion (2.1), the following equation can be derived by induction. m _ . j m P = r J [ Z n.r" 1 - ( I q.r x ) ( I d.r" 1 )] . (2.6) 1 1-1 1-1 x 1-1 1 19 Subtracting equation (2. A) from equation (2.6) gives m j . m P. - P. = r J [ e n r" 1 - ( S q r" 1 ) ( E d r ) ] . (2.7) J J i=j+6+l i=l i=j+6+l Recall that P. is the normal full precision representation of the partial remainder and that P. is the on-line version of the partial remainder. J Thus, equation (2.7) is a measure of the error introduced at a particular step, j, by using the on-line algorithm. Now, determine the bounds on this error. UPPER BOUND: m j _ . m P - P < r J [ i (r-l)r -1 + ( z p r X ) ( E (r-l)r -1 ) ] J J i=j+<5+l i=l i=j+6+l _< r J [(r-l)(r J - r ) (^-) + p(r - r ) (— j-) (r-1) (r J - r ) (^- )] = r" 6 (l+K) - r" m+j (l+K) - r" j " 6 K + r _m K Since m is assumed to be large with respect to 5, the upper bound is certainly less than P - P < (1+K)r~ 6 . LOWER BOUND: j _. m P, " P, > -r J [( Z Pr X )( l (r-l)r X ) ] J J i=l i=j+6+l j r / ~1 ~j-l\/ r w i\/ — i— 6 — 1 -m-l x / r N , ^-r J [p(r -r J ) (— j-) (r-1) (r J -r ) (— j- ) ] = -K(r -r J -r J +r) 20 And since m is assumed to be large with respect to 6 , the lower bound is certainly greater than A — (S P. - P. > - Kr . J J ~ Combining the above results Kr P. - P <_ (1+K)r (2.8) Recall from equation (2.2), the range restriction on P . , that - KD < P. < KD (2.9) From equation (2.8) and (2.9), the range restriction on P. becomes -6 " -6 -KD + Kr ^ P. <^ KD - (1+K)r Since K is positive, equation (2.5) is satisfied by the above equation for j = l,...,m and by using this range restriction on P. to define the selection procedure, R = N/D can be computed to m digit precision. 2.3.2 The Selection Equations By applying the range restriction (2.9) on P.,-,* and using the (incremented) recursion relationship (2.1), the selection region of rP . for each possible value of q. in can be determined. Let q.., - i such that J+l J+l i £ p, then the i-selection region guaranteeing the range restriction <) Is given by (-K+i)D <_rP. < (K+i)D (2.10) The corresponding i-selection region for rP . is obtained using equation (2.8) and (2.10). Thus, a partial definition of the SELECT function given in DIVIDE, Si i [) k as 21 q._,, «■ SELECT (rP.,D.) 3+1 J J becomes (-K+i)D + Kr 6 ~ KL i rP. 1 (K+i)D - (1+K)r 6+1 (2.11) This condition can be graphically described by means of a P-D plot, as in Figure 2.1. (The difference between this P-D plot and the conventional P-D plot is that the ordinate is rP . instead of rP . . ) It consists of a 3 3 family of curves which are linear functions of D with q.,-, as a parameter ranging from -p to +p in steps of 1. The area between the maximum rP . and the minimum rP. will be denoted the "q . , , = i region." 3 3+1 So, for a given base (r) , redundancy coefficient (K) , and index difference (6) the division procedure can be specified via a corresponding P-D plot. A given value of D. and rP . will correspond to a point in an i-selection region. The quotient digit q.,-. is, therefore, i and is used in forming the next partial remainder. Figure 2.1 is an example of a full P-D plot with r = 4, K = 2/3, and 6=4. The equations for the selection lines are given in Table 2.1. Note that, as a consequence of redundancy in the representation of the quotient, there is an overlap between each adjacent quotient digit selection region. Some values of rP . and D. will specify a point for which either q.., = i or 3 3 3+1 q. . = i-1 is a valid choice. It is this overlap which permits the quotient digit selection to be made on the basis of estimates of the full precision divisor and shifted partial remainder and thus permits on-line division. By tightening the lower bound of the selection equation (2.11) to give the selection equation (-K+i)D + (l+lOr" 6 " 1 " 1 <_ rP. <_ (K+i)D - (1+K)r" 6+1 (2.12) 2.66 — D 2.66 Figure 2.1 P-D Plot with r=4, K=2/3, and 6=4 Table 2.1 Equations Defining the Selection Regions of Figure 2.1 Selection Lines aa Selection Equations UPPER (2 LOWER (2 UPPER (1 LOWER (1 UPPER(0 LOWER (0 UPPER (T LOWER (I UPPER(2 2 5-3 4P. < ("I + 2)D - | 4 2 2-3 4P. > (- | + 2)D + | 4 2 5-3 4P. < (j + 1)D - | 4 2 2-3 4P. > (- | + 1)D + | 4 2 5-3 4P. < (-|)D - | 4 2 2-3 4P > (- f)D + § 4 «j « (| - 1)D - | 4" 3 2 2-3 4P. > (- | - 1)D + | 4 J 4P . < ( | - 2)D - | 4" 3 2 2-3 4P. > (- ^ " 2)D + -f 4 23 the full P-D plot becomes symmetric about both axes as shown in Figure 2.2 with r = 4, K = 2/3, and 6 = 4. The corresponding selection equations are given in Table 2.2. Although this more restrictive, but still valid equation reduces the overlap regions slightly, this reduction is more than compensated for by the fact that all of the quadrants now have identical (except for sign) overlap regions. Thus, only quadrant I need actually be implemented. This small change in the lower bound does not significantly increase the complexity of the step definition for the quotient digit selection (see Section 2. 3. A). In the rest of this chapter we will restrict our attention to the first quadrant of the P-D plot defined according to the selection equation (2.12). (Representative plots are collected into Appendix II.) 2.3.3 Determining the Minimum Index Difference Recall that an initial assumption — that the first quotient digit, q. , can be properly selected after 6 leading digits of the dividend and divisor are known — was made. Now the question arises as to what is the minimum possible value for 6, the index difference for division. The minimum value for 6 is desired because this determines the initial delay time before the division algorithm can start producing quotient digits. 6 most significant digits of the dividend and divisor must be available initially. Thus, if one memory word holds 4 operand digits, 2» [6/4] memory accesses must be made prior to the generation of the first quotient digit. f Throughout this thesis the term "word" is used to refer to the width of the memory (e.g., 4, 8, or 16 bits for microprocessors). The term "digit" is one radix r digit (e.g., one digit is 2 bits for radix 4). Thus, one word may consist of several digits, while the full precision operands may conceivably consist of several memory words — variable precision — up to some hardware limited maximum. 24 -2.64 Figure 2.2 Modified Symmetric P-D Plot with r=4, K=2/3, and 6=4 Table 2.2 Equations Defining the Selection Regions of Figure 2.2 Selection Lines UPPER(2 LOWER (2 UPPER(1 LOWER (1 UPPER(0 LOWER (0 UPPER (I LOWER (I UPPER(2 LOWER (2 !i±l Selection Equations o S -3 4P. <_ (-| + 2)D - | 4 J 4P. > (- | + 2)D + | 4" 3 2 5-3 4P. < (-| + 1)D - | 4 4P. > (- | + 1)D + | 4~ 3 2 5-3 4P. < (|)D - § 4 4P. > (- f)D + § 4" 3 2 5-3 4P. < <§ - 1)D - | 4 2 5-3 4P. > (- 4 - 1)D + | 4 J J ~ 3 3 2 5-3 4P. < (j - 2)D - | 4 J 2 5-3 4P. > (- -| -2)D + | 4 J 25 The minimum allowable value for 6 can be determined by requiring that the lower bound for a q. + , - i selection region and the upper bound for the corresponding q.,, = i-1 selection region intersect at a value of D <_ — ; there must be a nonzero selection overlap for all values of — <^ D < 1. Otherwise, there are valid regions on the P-D plot where the quotient digit, q.,-i» is undefined; that is, the division algorithm could not be completely specified. For example, consider the case of r = 2, K = 1, and 6 = 3 as shown in Figure 2.3. The shaded area is a valid region of the plot where the value of the quotient digit is undefined. The worst case occurs when D = — and i is either p or p-1, the selection overlap region between the lower limit of p and the upper limit of p-1. If this overlap region is non-null, then all of the selection overlap regions are guaranteed to be non-null. The condition (-K4p)| + (1+K)r" 6+1 <_ (K+p-l)| - (1+K)r" 6+1 must hold. Since 5 is required to be an integer, then Sa-r% ^§5-1 . (2.13) So, for the case of r = 2 and K = 1 it is required that 6 >_ 4. Figure 2.4 is the P-D plot for the specific case of 6 = 4. Looking ahead to implementation, given the pin limitations of LSI (Large Scale Integration) , reasonable values for the number of bits of each operand input to one arithmetic unit (AU) at the beginning of each cycle are 4, 8, or 16. It would be preferable to have 6 small enough so that the division algorithm could proceed after only one memory word for each of the 26 2.0 — 1.5-- 1.0 — 0.5 -0.5 D (-l+i)D + 2-2~ 2 5 2P. < (l+i)D - 2-2 2 Figure 2.3 P-D Plot with r=2, K=l, and 6=3 27 **D (-l+i)D + 2-2 3 < 2P < (l+i)D - 2-2 3 Figure 2.4 P-D Plot with r=2, K=l, and 5=4 28 two operands had been input (i.e., a delay of only two memory accesses). On the other hand, increasing 6 increases the size of the overlap region and, thus, simplifies the quotient digit selection. A compromise must be made. In the binary case, r = 2, a digit is one bit, so a convenient choice for 6 would be 4, 8, or 16 respectively, depending on the number of bits input to the AU during each cycle. For base 4, r = 4, a digit consists of 2 bits, so 6 should be 2, 4, or 8 respectively. Similarly for base 16. Some of the above choices must be eliminated because they do not meet the restriction on minimum 6 . See Appendix II for some representative P-D plots with 6 values as discussed above. 2.3.4 The Model Division Preliminary Remarks As stated in Section 2.1, the advantage of using redundant quotient digits is that it eliminates the trial and error nature of division. Using redundant quotient digits permits the selection to be based upon a limited precision model of the operands, thus circumventing the need for a full precision comparison. Sufficient background has now been given to permit a complete definition of the SELECT function of algorithm DIVIDE resulting in a limited precision division model . The limited precision model is a device which, when given estimates of the divisor and shifted partial remainder of sufficient precision, will output a quotient digit such that restriction (2.12) is satisfied. Thus the model must be able to select, given only an estimate of the operands, the correct quotient digit values. If the point corresponding to the values of 29 rP . and D. falls in an overlap region of the P-D plot, the model must make J J a choice between two adjacent quotient digit values. It must take into account the error incurred by the limited precision inputs. While making the selection based upon these inputs, it must guarantee that the quotient digit selected is also valid for the full precision values. The selection procedure can be visualized as a series of steps spanning the overlap regions. By comparing the values of the estimates to these steps, the appropriate quotient digit can be selected. If the values of rP . and D. correspond to a point lying on or above the step, the larger quotient digit is selected. While if the point lies below the step, the smaller quotient digit is selected. Sometimes the step is one simple comparison constant for rP . or "tread" which spans the entire overlap region from D. = — to D. = 1. In this case, the quotient digit can be selected J 2 j based merely upon the value of the shifted partial remainder and is independent of the value of the divisor! But, more often, due to the steepness and narrowness of the overlap region, the step consists of a connected series of "treads" and "risers" which span the region. Here, the risers define the divisor limits for which the corresponding tread comparison constant for rP . is valid. See Appendix II for some typical steps . The steps should be chosen such that the simplest and fewest comparisons need be made. These steps, in turn, are dependent upon the precision of the model. The comparison constant values for the steps must be numbers which are representable given the amount of precision required by the model. Thus, before the steps can be "defined, sufficient precision must be determined. 30 Sufficient Precision Assume that sufficient precision corresponds to the use of a most significant digits of rP . and 3 most significant digits of D.. Thus, a digits of rP. and 3 digits of D. are used as inputs to the limited precision division model. Denote these truncated estimates as rP . and D. correspondingly, 3 3 Recall that only 6+j digits of each operand are known at Step j. Then, obviously, 3 1 6 . Denote the maximum error introduced by this truncation into the representation of the operands as Arp and Ad. Then Arp and Ad are defined by I rP. - rP. I < Arp and D. - D. < Ad . J J - Since the 3 most significant digits of D. are invariant for iterations J 3 < j < m, the value of D, is a constant or just D «- D. J and, for base r Ad = r" e . (2.14) And, since the partial remainder is computed via a limited carry-borrow propagation adder and is, therefore, in redundant notation, Arp = K»r" a+1 (2.15) where K' is the redundancy coefficient of the adder. 31 The conditions for determining the smallest possible a and 3 can De found by investigating the worst case (steepest and narrowest) overlap region of the P-D plot, that is when and i = P • See Figure 2.5. Sufficient precision of rP. and D., as represented by their truncated estimates rP . and D, is insured if a selection step can be defined J in this worst case region. Then, steps for the rest of the overlap regions can be found and the model division completely specified. Figure 2.5 shows the upper selection limit for p-1, UPPER(p-l), rP. < (K+p-l)D - (1+K)r" 6+1 (2.16) and the lower selection limit for p, LOWER(p), -cS+1 rP. > (-K+ p )D + (1+K)r (2.17) near D = — . An estimate, rP . , falling in this region and resulting in the selection of the maximum quotient digit, q.,, = p» must meet the following constraints rP. - Arp > (-K4o)(-|+Ad) + (1+K)r" 6+1 (2.18) and rP. <_ (K4p-l)| - (1+K) r " 6+1 (2.19) for the selection to also be valid for the full precision operand, rP . 32 rP; rP. MAX J a u < rP. ~r a < rP. MIN J £ Equation (2.19) v- 0.5 UPPER (p-1) LL(p) Equation (2.18) D = D. MIN Ad D. MAX LOWER (p) -► D Figure 2.5 P-D Plot Showing the Worst Case Overlap Region 33 Thus, the dotted line, LL(p), in Figure 2.5 defines an absolute lower limit for a tread value, rP . , which would result in a quotient digit selection of q.., = p. A similar lower limit exists for each overlap region and will be denoted as LL(i). The upper limit on the possible tread values is obviously the upper limit of the corresponding overlap region. Thus, the treads (and, hence, the risers) of each stair case must be fully contained between the lines LL(i) and UPPER(i-l) in each appropriate overlap region. These treads and risers should assume the simplest possible binary values which conform to the limits and the precision requirements. The minimum values of a and 3 can now be empirically defined. Subtracting equation (2.18) from (2.19) gives Arp < K - j + (K-p)Ad - 2(1+K)r" 6+1 . Substituting the values of Arp and Ad, (2.14) and (2.15), into the above equation gives K'r~ a+1 + K(r-2)r 3 < K - | - 2(1+K)r 6+1 . (2.20) For a given base (r), redundancy coefficient of the quotient (K) , index difference (6), and redundancy coefficient of the adder (K'), interdependent a and 3 values can be defined using equation (2.20). Recall that a represents the number of digits in rP . which are redundant. Thus, an attempt should be made to minimize a even at the cost of increasing 3 to the maximum allowed value of 6. The flowchart of the program used to determine near minimal values for a and 3 is given in Figure 2.6. 34 C CALL g & g ) 3 = 6, a = 3 (THE MAXIMUM POSSIBLE VALUE) FAIL = NO YES CAN 'k COMPLETE STEP BE )EFINED USING THIS a AND 3 GIVEN THE .IMITS OF LL(i) AND/ UPPER(i-l) YES YES NO a = a - 1 FAIL = FAIL + 1 NO a - a + 1 NO - 1 (return) SEARCH FOR a AND FAILS ! YES YES NO YES + 1 (return) Figure 2.6 Flowchart for Determining a and 3 35 Definition of the Steps Once "good" minimum values for a and 3 have been chosen for a given set of constants (r, K, 6, and K'), steps can be defined in each overlap region. The value of a limits the maximum precision allowed in the specifica- tion of the treads and 3 the maximum precision allowed in the specification of the risers. Recall that the treads and risers must conform to the upper and lower limits, UPPER(i-l) and LL(i), in each overlap region. Think of the overlap region between q.., = i and q,,, = i-1 as a j+1 3+1 grid of vertical spacings, Ad, and horizontal spacings, Arp. The set of all boundaries in this overlap region is all stairsteps which can be drawn along these grids while remaining inside the upper and lower limits. See Figure 2.7, As Ad and Arp are decreased (i.e., a and 3 are increased) the number of different possible boundaries increases exponentially. The boundary, stairstep, which results in the simplest and fewest comparisons for the selection of q.,-, should be chosen. Some possible choices are shown in Figure 2.8. These comparison constants are then used to define the quotient digit selection function, SELECT, of algorithm DIVIDE. The flowchart for the program used to choose a "good" stairstep in each overlap region is given in Figure 2.8. Little or no attempt was made to minimize the comparison constants of one overlap region in relation to another. See the work of Atkins [ATK67, ATK68, ATK70] for a more detailed analysis . Appendix II contains some sample P-D plots with stairsteps and examples of the algorithm DIVIDE corresponding to them. The steps were chosen according to the algorithm defined in the flowchart of Figure 2.8. 36 a u < UPPER(i-l) -, LL(i) LOWER (i) D; 0.5 Figure 2.7 Selecting a "Good" Staircase (call steps ) i NRISER - 1 NTREAD = D(NRISER) = 0.5 r NTREAD = NTREAD + 1 FIND THE LARGEST TREAD AT THIS RISER: RP(NTREAD) = LUPPER(i-l) @ D(NRISER)/ArpJ*Arp ( RETURN ) YES \ NRISER=NRISER+1 D(NRISER) - D(NRISER - 1) 37 YES D (NRISER) =1.0 YES D(NRISER) = D(NRISER)+Ad D(NRISER) = D(NRISER)-Ad STEPS SUCCEED PLOT STEPS (RETURN ) YES FAIL DUE TO Ad TOO LARGE ( RETURN ) YES Figure 2.8 Step Definition Flowchart 38 The binary values of the steps (i.e., the comparison constants) are given in the righthand (rP.) and top (D) margins of the plots shown in Appendix II. 2.4 Valid Operand Ranges By investigating the P-D plots, it becomes obvious that the initial operands must be restricted in the range of values they can assume. To insure the convergence of algorithm DIVIDE, equation (2.5) must be satisfied for j = 0. This may require an initial preshifting of the dividend. As stated in Section 2.2, both operands are assumed to be in normalized form upon input to the arithmetic unit; i.e., 4 1 D, N < 1 . As seen by looking at the plots, only D is required to be in normalized form for the division algorithm to be defined. The allowable range for N must now be determined. When the 6 most significant digits of the dividend are shifted to become the first shifted partial remainder, rP , prior to quotient digit selection, it is conceivable that this shifting results in an rP_ which is out of range. That is, it corresponds to a point on the P-D plot for which no quotient digit value is defined. In other words, if rP Q > (K+ P )D - (1+K)r" 6+1 then rP. is out of range and must be scaled before division can proceed. For rP to be valid it must conform to the bounds (-K-p)D + (1+K)r" 6+1 < rP Q < (K+ p )D - (1+K)r~ 6+1 . 39 Looking at the worst case values on the upper bound and assuming minimal redundancy, this implies that or just r p < f2 _ 3r-2 -6+1 - 4(r-l) 2(r-l) r 2 If the term involving 6 is assumed to be negligible with respect to the r term, then P_ must fall below the value 1 r 1 4 < 4(r-l) - 2 r = 2,3,. .. Since P_ is input in normalized form, shifting P one bit to the right will guarantee that rP_ will be within the allowable range limits. A correction on the quotient which consists of shifting R one bit to the left must then be made. 2.5 Hardware Block Diagram This chapter has completely defined the algorithm DIVIDE. Specifically, the requirements on the index difference, 6, and the quotient digit selection function, SELECT, have been given. In this section a variable radix block diagram implementing the DIVIDE algorithm will be examined. At this point, only the major components of the arithmetic unit (AU) for DIVIDE will be discussed. Lower level details for actual implementation will be developed in Chapter 5. Figure 2.9 is a block diagram of the AU for performing division. It is so structured as to be compatible with multiplication, addition, and 40 REDUNDANT QUOTIENT REGISTER R. .-.-fr-R.+q. ,,r 3+1 3 3+1 -3-1 4+1 QUOTIENT DIGIT SELECTOR 7T rP C=qt v / MULTI- INPUT REDUNDANT ADDER -q,D, • r ) ■ 3 3 7Y q.D. 3 3 q 3 j+6 "j-1^3+6 7 ' "j-1 — 7V^ 7![ — r" 6 (n 3+6 ■ R 3"l d 3 + 6) SELECTION NETWORK A j+6, DIVISOR REGISTER D.+D. ,+d.,_r 3 3-1 1+5 -j-6 rP 3-l SELECTION NETWORK J ft A l j+ > N o 1 j +6 n. ,n . i j+6 Figure 2.9 Block Diagram for Division 41 subtraction with only minor modifications. The major component of the AU is a full width multi- input limited carry-borrow propagation adder. The adder is discussed in detail in Chapter 5. In many practical applications the number of inputs to the adder is rather small. The quotient digit selector is a table look up device which implements the SELECT function. It examines a most significant digits of rP . , rP., and 3 most significant digits of D., D, in order to select the appropriate quotient digit, q.,-,- The rest of the unit consists mainly of registers and selectors. Two full width double bank registers are required for the storage of the quotient, R, and the partial remainder, P., because they are in redundant form. The selection network must be capable of forming the required multiples of D. and R. . . A carry generator would be needed if, for example, a radix complement representation of negative numbers is used. In that case, the selection network must also be able to form the complement of the possible multiples of D. . J The complexity of the selection network increases for higher radices, and since the additional multiples appear as inputs to the adder, the complexity of the adder would also be increased. Thus, a higher radix, while reducing the number of steps per cycle, does increase the complexity of the arithmetic unit. Chapter 5 addresses the problem of finding an optimal radix while considering both the compexity of the adder and selection network and the complexity of the result digit selector across all of the operations (/, *, +, and -). T The term "full width" implies that the adder can process a full precision operand (i.e., several memory words) during one cycle and the registers can store one full precision operand each. Thus, the adder and register widths set a hardware upper limit on the maximum allowed precision. 42 3. MULTIPLICATION Once the on-line algorithm for division has been specified, com- patible algorithms for multiplication, addition, and subtraction can be defined. This chapter presents an on-line multiplication algorithm which has its roots in work done by Ercegovac [ERC75, TRI75]. It combines the well-known technique of incremented multiplication, as used in digital differential analyzers [BRA63, CAM70, BAK75], with the use of redundant number systems. The algorithm is so structured as to be compatible with the division algorithm specified in Chapter 2. 3.1 Background In multiplication, a product is accumulated by the successive addition of multiplies of the multiplicand to a partial product. Unlike division, the selection of which multiple to add is dependent upon a known quantity (i.e., a digit of the multiplier). Thus, multiplication can be f defined by the recursion relationship P. «- rP. . + y.X, j = l,...,m (3.1) J J-l 3 in which P_ is zero, P. is the partial product used in the j recursion, t Recall that operations proceed from most-to-least significant digit as required for division. 43 P is the product, m j is the recursion index, y. is the j multiplier digit, X is the multiplicand, and r is the radix. To form the new partial product, P., a multiple of the multiplicand is added to the previous shifted partial product. Exactly which multiple to add is dependent upon a known multiplier digit. Thus, many of the problems encountered in division are not present in the design of the multiplication algorithm. In converting multiplication to an on-line process, two complica- tions arise. First, the recursion relationship must be restructured to take into account the on-line nature of the operands. During the j recursion, only those operand digits which have been received prior to the j iteration can be included in the calculation. Secondly, if a nonredundant number system is used in representing the partial product, the digits of the desired product appear in a right-to- left (least-to-most significant) fashion, as determined by the conventional carry propagation requirements. If, however, redundancy is used in the representation of the product, the desired on-line, most-to-least significant generation of the product digits can be provided. Here, again, the redundant product must eventually be converted to a conventional representation. See Appendix III for a description of the on-line recoding algorithm. This chapter will develop in detail the methods used to alleviate these complications and specify a compatible on-line multiplication algorithm and the conditions on it. 44 3.2 The On-line Algorithm Let the radix r representation of the fractional part of the positive multiplicand, multiplier, and product be denoted by X, Y, and R respectively, such that m X = S x.r 1 , i-1 X m Y = E y.r X , i=l m R = E p.r 1 , i=l X and R = X • Y to m digit precision. Recall that in an on-line environment the digits of the multiplicand and multiplier are not known in advance, but are available on-line, digit-by- digit, most significant digit first. The operand digits, x. and y., are typically members of a conventional, nonredundant digit set, 0, such that x., y. e {0,1,2, .. .r-1} . i J i It is assumed that the multiplicand and multiplier are in normalized form; i.e., - < X, Y < 1 . 2 ~ Define the j -digit representation of the on-line operands X and Y as X.= Z x 4 r = X. n + x.r J 1 i-1 i 1-1 J 45 where X = 0, and 3 -i -i Y. = Z y.r = Y. , + y.r J J i=1 i J-l "3 where Y = 0. The corresponding partial product is, then, X.Y. *■ X. .Y. , + (X. ,y. + x.y.r" 3 + Y. .xjr^ J J J-l J-l J-l J J J J-l J which can be rewitten as X.Y. + x. ,'Y. . + (X.y. + Y. .xjr"^ . J J J-l J-l J J J-l J Therefore, if P. is the scaled partial product, P. = X.Y.r 3 (3.2) J J J a recursion relationship for multiplication which takes into account the on-line nature of the operands can be expressed as P. «- rP. . + X.y. + Y. -x. (3.3) J J-l J J J-l J But, this relationship does not, as it stands, generate the result digits in an on-line fashion. The product digits would become available from the least-to-most significant end of P , as determined by the traditional carry m propagation requirements. Assume that the product digits, p., can be selected on-line using a recursion relationship similar to (3.3). One new digit of the product could then be determined upon the receipt of one new digit each of the multiplicand and multiplier. In multiplication the index difference, 6, is identically 1. That is, only one digit of each of the operands is needed initially to select the first result digit. Let the product digits, p., be members of 46 the same symmetric redundant digit set, , as defined for the quotient digits in Chapter 2. The partial product is computed via the same limited carry-borrow propagation adder used to generate the partial remainder during division. Thus, the digits of the partial product are members of the redundant digit set, ' . P Given these definitions, the algorithm MULT [TRI75] which is shown on the next page, can be specified. In this algorithm, recursion (3.3), which allows for on-line processing of operands, has been altered to form the basic recursion, (3.4), providing on-line generation of result digits. The selection procedure for result digits, as shown in the next section, corresponds to a simple rounding on P.. Thus, the basic recursion includes a "correction" on P. n to take into account the previous selection of p. . . From equation (3.3) and the basic recursion equation, (3.4), in algorithm MULT, the following relation can be obtained by induction. P. = P. - E p,r J . J J i= l i This implies that, as j -* m, m-1 n r> *- m-i P = P - E p.r mm ... l i=l or rearranging m-1 P = P + r m E p.r 1 . (3.5) m m i=1 i Since, by equation (3.2), P =X • Y •r m =X-Y-r m , m m m 47 Algorithm MULT Step 1 [Initialization]: P Q + 0; X Q <■ 0; Y «- 0; R - 0; Pq «■ 0; j *■ i; Step 2 [Input Digit]: X. <- X. . + x.r J ; L v 6 J J-l J Y. «- Y. . + y.r~ J ; J 3-1 J Step 3 [Basic Recursion]: P * r(P. , - p; :■•_•) -+ X.y. + Y. n x. j j-l F j-l y j 7 j j-l J (3.4) Step 4 [Selection]: p. i- SELECT(P.); R. *- R. 1 + p.r" 3 ; Step 5 [Test]: IF (j < m) THEN j *• j + 1; GO TO Step 2; ELSE END MULT; equation (3.5) can be rewritten as which is just so that m-1 X • Y • r = P +r E p.r m . , i i=l m X • Y = £ p.r" 1 + (P - p )r" m , . , r i mm 1=1 m R = E p.r _1 = X • Y - (P -p )r~ m . n r i mm 1=1 48 By devising a product digit selection procedure, SELECT, in Step 4 of algorithm MULT, such that |P - P | < K (3.6) 1 m m 1 — where the redundancy coefficient, K, satisfies \< K< 1 , then R = X • Y can be computed to m digit precision. In the next section such a selection procedure will be derived so as to guarantee convergence of the algorithm. The algorithm as it stands produces just the most significant half of the product. The least significant half of the product is available as the redundant output of the adder after iteration m + 1; i.e., m+1 m m By feeding these redundant adder digits directly into the recoding unit, the least significant half of the product can also be output in conventional form. 3.3 Product Digit Selection In order to implement the algorithm MULT, a product digit selection procedure, SELECT of Step 4, must be devised such that restriction (3.6) is satisfied. For j = 1, the range restriction can be satisfied by appropri- ately preshifting the operands as explained in Section 3.4. Assuming that there is a selection procedure that generates the product digit p. so as to guarantee that |P. - p. | < K 49 given that then by induction the range restriction (3.6), assuming that the operands conform to certain bounds as derived in Section 3.4, will hold for all values of j. Obviously, by performing a "simple" rounding on P. and using the integer part of this rounded P. to represent the product digit, p., |P.-p. | < K . The remainder of this section will fully specify an implementable rounding procedure for the selection of the product digits. 3.3.1 The Selection Function Define the product digit selection function [TRI75] to be ''sign p *l|p |+ij for I P. j <_ p , p. «- SELECT(P.) = \ (3.7) sign P.*l|P. | J otherwise This represents, for all practical purposes, a rounding procedure which has been modified at the end points of the domain to avoid product digit values greater than p. Thus, the selection process itself can be carried out in a deterministic fashion; that is, the product digit selected by the procedure is simply the integer part of the rounded partial product, P.. However, the partial product is in redundant form. Thus, a "simple" rounding would require a full precision carry propagation of P. in order to determine its magnitude. This must be avoided if at all possible. An 50 ideal selection procedure is one in which the time necessary to per- form the selection is independent of the precision of the operands (i.e., step invariant). By devising a graphical representation for the selection process, the problem can be more easily understood and solved. Figure 3.1 is a plot of the partial product, P., versus the remaining partial product after rounding, P. -p., and will be designated a P-p plot. By analyzing such a plot, a product digit selection procedure based upon a limited number of leading digits of P. can he fully specified for a given r and p, resulting in a precision independent selection. In Figure 3.1 each line corresponding to a different product digit value will be called a "p-line." The bound on the remaining partial product after rounding, J 3 ' - t - determines the maximum allowable value for the partial product , P.; that is, | P . | <_ K + p = rK in order for a product digit to be properly selected. Thus, the maximum value for P. is rK which occurs at the intersection of (P. -p.) = K and the 3 3 3 p-line, p. = p. Similarly, the minimum value, -rK, must occur at the intersection of (P. -p.) = -K and the p-line, p. = p. These bounds are 3 3 3 indicated by the dashed vertical lines on Figure 3.1. The range for the partial product, P., in multiplication (-K - p _< P. K + p) is approximately comparable in magnitude to the range of the shifted partial remainder, rP., in division ((-K - p)D + (1 + K)r~ 6+1 <_ rP . <_ (K + p)D - fl + K)r ft+L ). 51 to be within the required limits of + (K + p). These operand range restrictions and their effects on the selection function are covered in Section 3.4. As in division, redundancy in the representation of the product permits the selection of product digits to be based upon a limited number of leading digits of P.. This is manifested in Figure 3.1 by the overlap region, A, for which either of two p-lines may be legitimately selected. For example, at point A one may move vertically upward to the p-line, p. = 0, or downward to the p-line, p. = 1. In either case the product digit is correct. The defined selection function (3.7) implies that the comparison constants for multiplication are simple, low precision numbers (i.e., + -r-, + 1 t, + 2 t, ... + (p - -r-)). Then the product digit, p., can be defined according to f i - 1 (i-1) - \ < P. < (i-1) + \ 'r l * l i i - f < p, 1 + r% r J^l (3.10) In Table 3.1 the minimum values of y (the number of redundant digits of P.) and y' (the number of nonredundant bits derived from the y most significant digits of P. on which a carry propagation has been performed) are shown as a function of the base (r) , product redundancy coefficient (K) , and adder redundancy coefficient (K'). 56 U L-l • • • p J i-1 ■oh Figure 3.3 P-p Plot Showing the Worst Case Error Table 3.1 Minimum y Values 57 Y = 1 + \b 2K' r 2K-1 BASE K K' Y DIGITS Y 1 BITS 2 1 1 2 2 2/3 2/3 2 4 4 1 1 2 3 4/7 4/7 2 6 8 1 1 2 4 8/15 8/15 2 8 16 1 1 2 5 128/255 128/255 2 16 256 1 1 2 9 58 3.4 Valid Operand Ranges As stated in Section 3.3.1, the range of the initial operands must be restricted to insure the convergence of the algorithm MULT. The operands may have to be preshifted initially to conform to the specified convergence bounds. Recall that both operands are assumed to be in normalized form upon input to the arithmetic unit; i.e., j ± X, Y < 1 . To be able to select the first product digit, p , the first partial product, P., must conform to the bound P_ < K + o From the basic recursion, (3.4), the bound on X 1 is seen to be X y < K + p = rK . Assuming worst case values for y and K, this implies that x<4 r 2 r-1 and, since ,. 1 r lim — 2 r-1 2 * r-x» shifting X one bit to the right will guarantee that V will be within the allowed range limits for the selection of p . To guarantee convergence of the algorithm MULT over all j, it is obvious that certain bounds must also be placed upon X and Y in order for each new partial product to be within the required limits of +(K+p). To insure convergence of the algorithm, given that 59 iVi- p j-i' - K ' the upper bound on the values of the operands must be small enough to guarantee that |P. I < rK . J ~ to insure valid selection of p.. From the basic recursion of the algorithm MULT, equation (3.4), P. <■ r(P. -p. .) + X.y. + Y. .x. , j j-1 *j-l i 1 J-l J worst case values imply that P. < r(P. .-p. ,) + X(r-l) + Y(r-l) , J _ J-l J-l or just P. 1 r ( p -_ 1 -P-_ 1 ) + (X+Y)(r-1) . Let the upper bound on both operands be represented by M. The largest possible value for M, resulting in the fewest possible preshifting steps (scaling) on the initial operands, should be found. By using the upper bound M, the above equation becomes P. < r(P. -p. .) + 2M(r-l) . J ~ J-l J-l Replacing P. by its upper bound for selection, this becomes r(P -p ) + 2M(r-l) < rK or, rearranging r(K-(P -p )) m < dZ± J__i — C3 IT) M - 2(r-l) * U * ; 60 If the smallest possible number of leading digits of P. was used for the selection of p._-i> then the upper bound i> <- K must be used in equation (3.11) and the bound on the operands goes to zero! But, by inspecting more digits than the absolute minimum for selection, the bound on (P _-p -) can be tightened considerably. A full precision inspection results in the tightest possible upper bound of — . Thus, for full precision inspection the bound, M, is M K r(2K-l) M ± 4(r-l) ' Assuming maximum redundancy (K=l) , t r 1 lim 4(r-l) 4 ' so that shifting both X and Y two bits to the right will guarantee convergence for the case of full precision inspection. A compromise between the minimum number of digits inspected for selection and the smallest amount of necessary initial scaling must be made. As will be discussed in Chapter 5, more digits of the partial remainder are always required for selection during division. Since these lines are available, using them during multiplication does not increase the complexity of the arithmetic unit, while it does relax the bound, M, on the operands. 3.5 Hardware Block Diagram In this chapter the algorithm MULT has been completely defined. Specifically, the product digit selection function, SELECT, and initial range restrictions have been given. In this section a variable block diagram imp] (-meriting the MULT algorithm will be examined. Here, again, only the 61 REDUNDANT PRODUCT REGISTER R.-^R. ,,+p.r J J-l J "J p j PRODUCT DIGIT fc / SELECTOR Y rp j-i TO THE -1 DIGITAL POSITION ONLY MULTI-INPUT REDUNDANT ADDER -rp. , + X. y. +Y. . x.+rP. , J-l J J 3-1 J J-l tt ~~n A Y. -x. J-l J X.y. J J SELECTION NETWORK 7\ x. MULTIPLICAND REGISTER X.+X. -,+x.r J J-l J rP j-l SELECTION NETWORK 7T j-i MULTIPLIER REGISTER Y.«-Y. ,+y.r J J-l J -J o M 2 n tn > > w O 50 Figure 3.4 Block Diagram for Multiplication 62 major components of the arithmetic unit (AU) for MULT will be discussed. Lower level details for actual implementation will be developed in Chapter 5. Figure 3.4 is a block diagram of the AU for performing multiplica- tion. It is so structured as to be compatible with division, addition, and subtraction. As with division the major component of the AU is a full width multi- input limited carry-borrow propagation adder. The product digit selector is a table look up device which implements the SELECT function for multiplication. It examines the y most significant digits of P., P., and does essentially a rounding on it to select the proper product digit, p.. As with division, the rest of the unit consists of register and selectors. 3.6 Some Numerical Examples Table 3.2 through Table 3. A are examples of the algorithm MULT for several difference radices and product redundancy coefficients (K) . Table 3.2 Example of the Algorithm MULT (r=2,K=l) 63 r = 2, m = 8, = {1,0,1), K X = 0.01101001 Y - 0.01110011 (R = 0.0010111100101011) = 1 X.y.+Y. -x. P. 3 p j 2(P.-p.) 0.0 0.0 0.0 0.01 0.01 2(0.01-0) - 0.1 0.101 1.001 1 2(1.001-1) = 0.01 0.0110 0.1010 1 2(0.1010-1) = -0.11 0.0111 -0.0101 2(-0. 0101-0) = -0.101 0.0 -0.101 T 2(-0101+l) = 0.11 0.0110100 1.0010100 1 2(1.0010100-1) = 0.010100 0.11011011 1.00101011 1 2(1.00101011-1) = 0.0101011 8 .-8 ,-9 R = Z p. 2 J + 2 U (P -P ) = 0.00110111 + 2 "(0.0101011) j=l J 8 8 = 0.0010111100101011 64 Table 3.3 Example of the Algorithm MULT (r=4,K=l) r = 4, m = 4, Q = {3,2,1,0,1,2,3}, K = 1 P X = 0.01101001 Y = 0.01110011 (R = 0.0010111100101011) j X.y. + Y. n x. P. j p j 4(P.-P.) 1 0.01 0.01 4(0.01-0) = 1.0 2 3(0.0110)+2(0.01) = 1.1010 10.1010 3 4(10.1010-3) = -1.1 3 2(0.0111) = 0.1110 -0.101 T 4 (-0.101+1) =1.1 4 3(0.01101001) + 0.011100 = 1.10101011 11.00101011 3 4(11.00101011-3) = 0.101011 R = E p.4" j + 4 A (P 4 ~P 4 ) = 0.0313 4 + 4 5 (0.223 4 ) j-l J R = 0.02330223. = 0.0010111100101011 4 Table 3.4 Example of the Algorithm MULT (r=4,K=2/3) r = 4, m = 4, = {2,1,0,1,2}, K = 2/3 X = 0.01101001 Y = 0.01110011 TO INSURE CONVERGENCE — SHIFT RIGHT ONE DIGIT EACH (R = 0.0010111100101011) 65 j X.y.+Y. .x, J J 3-1 J p. J p j 4( VV 1 0.0 0.0 0.0 2 0.0001 0.0001 4(0.0001-0) - 0.01 3 3(0.000110) + 2(0.0001) = 0.011010 0.101010 1 4(0.101010-1) = -1.011 4 2(0.000111) = 0.001110 -1.00101 I 4(-l. 00101+1) = -0.101 5 3(0.0001101001) + 0.00011100 = 0.0110101011 -0.0011010101 4(-0. 0011010101) = -0.11010101 R = 4 2 ( I P .4" j + 4 5 (P 5 -p 5 )) i=l J 4 (0.00110 -0.0000003111 ) = 4 4 0.0101000011010101 = 0.0010111100101011 66 4. ADDITION AND SUBTRACTION Once compatible on-line algorithms for division and multiplication have been specified, it remains to define compatible on-line addition and subtraction algorithms. It would be possible, since a limited carry- borrow propagation adder is used for the basic recursion, to simply feed the output of the adder directly to the recoding logic. Instead, an algorithm which is quite similar to the multiplication algorithm is presented. It will soon become apparent why this method is preferred. 4.1 Background Consider the situation at the i digit position of a conventional adder . The addend A and the augend B provide the inputs a . and b . , each of weight r , to the i stage. A third input c, also of weight r , is the carry output from digit position i+1. The outputs of the i position are the sum digit s. of weight r and the carry out digit c -_i> of weight r . The relationship between outputs and inputs is, then re. , + s. = x. + y. + c. . i-1 l l J x i Because the carry out c._ 1 is propagated from the least significant to the most significant end of the adder, the speed of conventional addition is less than desirable. It has long been recognized that carries do not need to be propagated during each addition of a long sequence of additions, provided that the carries are explicitly stored. This technique was first used to speed up multiplication, with a single carry propagation at the ] of the operation. 67 Storing the carry, c._, , is a method of introducing redundancy into the adder. Much work has been done on the development of redundant adders [ROH67, BOR68, MEL72, GOY76]. In particular, Rohatsch's work proves that the utilization of any redundant and contiguous sum-difference digit set makes possible the implementation of limited carry-borrow addition and subtraction; that is, addition and subtraction for which carries-borrows propagate no further than a fixed number (one or two in practice) of digital positions is possible. In Goyal's work an exhaustive study is done of various design techniques and the reader is referred to this work for the gory details. Since the carry-borrows can be limited to one or two digital positions, a redundant adder could be used directly for on-line addition with an on-line delay of up to two digits. There is, however, an algorithm which involves no delay for producing on-line sums and differ- ences. This algorithm will now be presented. 4.2 The On-line Algorithm Let the radix r representation of the fractional part of the positive addend (minuend), augend (subtrahend), and sum (difference) be denoted by A, B, and R respectively, such that m A = E a.r -1 , i-1 X m B = E b.r 1 i=l X m R = E s.r- 1+1 i-1 X 68 and R = A+B (R=A-B) to m digit precision. In an on-line environment the digits of the addend (minuend) and augend (subtrahend) are not known in advance, but are available on-line, digit-by-digit, most significant digit first. The operand digits are typically members of a conventional nonredundant digit set, £>, such that a.,b. e {0,1,2,.. .(r-1)} . 11 It is assumed that the operands are in normalized form; i.e., |- < A,B < 1 . Assume that the sum (difference) digits can be selected on-line. One new digit of the sum (difference) would then be determined upon the receipt of one new digit each of the addend (minuend) and augend (subtra- hend) . In addition (subtraction) the index difference, 6, is identically 1. That is, only one digit of each of the operands is needed initially to select the first result digit. Let the sum (difference) digits, s.'s, be members of the same symmetric redundant digit set, © , as defined for the quotient and product digits. The partial sum (difference) is computed via the same limited carry-borrow propagation adder used to generate the partial remainder during division and the partial product during multiplication. Thus, the digits of the partial sum (difference) are members of the redundant digit set, ,. Given these definitions, the algorithm ADD and the algorithm SUBT as shown on the succeeding pages, can be defined in the same manner as the algorithm MULT was defined for multiplication. The selection procedure for result digits corresponds to a simple rounding on P, exactly as done 69 in multiplication. Thus, again, the basic recursion includes a "correction" on P . .. to take into account the previous selection of s . ,. 3-1 3-1 As in multiplication a recursion relationship for addition which takes into account the on-line nature of the operands only can be expressed as P. ■*■ rP._ + (a.+bjr" 1 (4.3) Algorithm ADD: Step 1 [Initialization]: P_ ■«■ 0; s Q .0; j «- i; R o * 0j Step 2 [Input Digit]: a. AND b.; Step 3 [Basic Recursion]: » P. «- r(P. -s. ,) + (a +b.)r 3 3-1 3-1 3 3 (4.1) Step 4 [Selection]: s. «- SELECT(P.); R. ^ R. • + s.r~ J+1 ; 3 3-1 3 Step 5 [Test]: IF (j < m) THEN j *■ j + 1; GO TO Step 2; ELSE END ADD; 70 Algorithm SUBT : Step 1 [Initialization]: P «- 0; s -0; R -0; j * i; Step 2 [Input Digits]: a. AND b.; Step 3 [Basic Recursion]: R. «■ R. . + s.r ^ +1 J J-l J Step 5 [Test]: IF (j < m) THEN j ■*- j + 1; GO TO Step 2; ELSE END SUBT; P. «- r(P. ,-s. .) + (a.-b.)r 1 (4.2) 3 J-l J-l J J Step 4 [Selection]: s. «- SELECT(P.); Using this recursion, the sum digits would become available from the least-to-most significant end of P , as determined by the traditional m carry propagation requirements. From equation (4.3) and the basic recursion equation in algorithm ADD, (4.1), the following relation for addition can be obtained by induction. J-l • • ,-, P. = P. - £ s.r 1=1 71 This implies that, as j •*■ m, m-1 m-i+1 P = P - Z sj m m . , i 1=1 or rearranging P = P + r m Z s,r" i+1 . (4.4) m m i=1 i Since P = (A+B)r for addition and P = (A-B)r for subtraction, equation m m (4.4) can be rewritten as a m— 1 . . , (A+B)r = P + r Z s.r — m . . i i=l which is just m -'-4-1 A + B = Z s.r ■"■ + (P -s )r" m , — . , l mm i=l so that m . . _ R = Z s.r 1 " 1 " 1 = (A+B) - (P -s )r" m . . . l — mm i=l By using the product selection procedure SELECT defined for multi- plication in Chapter 3, Section 3.3, then |P -s | < K (4.5) 1 m m' — and R = A+B(R=A-B) can be computed to m digit precision. 72 4.3 Sum (Difference) Digit Selection Using the same selection function as defined for multiplication of r . : ,.i. ,.i s. «- SELECT (P sign P *l|P | + -J j ) = for |P.| < p, (4.6) ^sign P J *L|P j | J otherwise, will insure the convergence of the algorithms ADD and SUBT. This represents a simple rounding procedure on P . . Recall, however, that the partial sum, P., is in redundant form. Thus, a "simple" rounding would require a full precision carry propagation of P. in order to determine its magnitude. This must be avoided if at all possible. Sufficient precision for the proper selection of sum digits can be determined by looking at the basic recursion of the algorithm ADD, equation 4.1, A. A 1 P. «- r(P. -s. n ) + (a.+ b.)r . 3 3-1 3-1 3 3 Using the same upper bound on P. for proper selection as defined for multipli- cation, |Pjl < rK » and assuming worst case values for a and b. of (r-1) , the basic recursion can be restated as r(P -s ) + 2(r-l)r~ 1 <_ rK . The bound on the 'residual', P. - s._ , is then (P. -s. .) < K - 2(r I 1) . (4.7) 3-1 3-1 ~ r 2 73 For base 2(K=1), this is just < .1 - ! - 1 • So, for base 2 a full precision inspection of P. to select s. must be performed. For base 4(K=1), equation (4.7) becomes (p _ s \ "< 1 „ A. m I * 3-1 S j-1 ; - X 16 8 which implies a two digit inspection of P. to select s.. Selection require- ments for other cases can be determined in a similar manner. 4.4 Hardware Block Diagram In this chapter the algorithms ADD and SUBT have been completely defined. In this section a variable block diagram implementing the ADD and SUBT algorithms will be examined. Here, again, only the major components of the arithmetic unit (AU) for ADD and SUBT will be discussed. Lower level details for actual implementation will be developed in Chapter 5. Figure 4.1 is a block diagram of the AU for performing addition and subtraction. It is so structured as to be compatible with division and multiplication. As with the other algorithms, the major component of the AU is a full width multi- input limited carry-borrow propagation adder. The sum (difference) digit selector is a table look up device which implements the SELECT function described for addition and subtraction (i.e., the same function as in multiplication) . It examines the y most significant digits of P., P., and does essentially a rounding on it to select the proper sum (difference) digit, s.. As with the other algorithms, the rest of the unit consists of registers and selectors. 74 REDUNDANT SUM (DIFFERENCE) REG, R.«-R. n +s.r -j+l s . J SUM (DIFFERENCE) DIGIT SELECTOR P. Y rs MULTI-INPUT REDUNDANT ADDER P.*- J -rs. . + (a. + b.)r 1 + rP 3-1 J ~ J ~ j-l TO THE -1 DIGITAL POSITION ONLY j-l 7V SELECTION NETWORK rp j-i SELECTION NETWORK a. J ADDEND (MINUEND) REG. AUGEND (SUBTRAHEND) REG, Figure 4.1 Block Diagram for Addition and Subtraction o w 25 O 3 £ > W H k! O 75 4.5 Some Numerical Examples Table 4.1 through Table 4.3 are examples of the algorithms ADD and SUBT for several different radices and sum (difference) redundancy coefficients (K) . 76 Table 4.1 Example of Algorithm ADD (r=2,K=l) r = 2, m = 8, D - (1,0,1), K = 1 A - 0.11011001 B = 0.10100101 (R = 1.01111110) j (a.+ b.) 2" 1 p. J s . J 2( V s.) 1 1.0 1.0 1 2(1.0-1) = 0.0 2 0.1 0.1 1 2(0.1-1) = -1.0 3 0.1 -0.1 I 2(-0.1+l) = 1.0 4 0.1 1.1 1 2(1.1-1) = 1.0 5 0.1 1.1 1 2(1.1-1) = 1.0 6 0.1 1.1 1 2(1.1-1) = 1.0 7 0.0 1.0 1 2(1.0-1) - 0.0 8 1.0 1.0 1 2(1.0-1) = 0.0 9 0.0 0.0 2(0.0-0) = 0.0 m+1 i+1 R = Z s r 1 = 1.11111110 1=1 = 1.01111110 77 Table 4.2 Example of Algorithm ADD (r=4,K=l) r = 4, m = 4, © = {3,2,1,0,1,2,3}, K A = 0.11011001 B = 0.10100101 (R = 1.01111110) 1 2 3 4 5 (a.+ b.) 4 X P. 3 s . 3 4(P.-s.) 1.01 1.01 1 4(1.01-1) = 1.0 .11 1.11 2 4(1.11-2) = -1.0 .11 -0.01 4(-0.01-0) = -1.0 .10 -0.10 I 4 (-0.10+1) =10.0 0.0 10.0 2 4(10.0-2) - 0.0 m¥1 _-+1 R = E s.r 1 L = 1.2012, • i 1 4 . x=l = 1.01111110 78 Table 4.3 Example of Algorithm SUBT (r=4,K=l) r = 4, m = 4, © p = {3,2,1,0,1,2,3}, k = 1 A = 0.11011001 B = 0.10100101 (R = 0.00110100) j (a.-b.)4" 1 J J P. J s . 3 4(P.- S .) 1 0.01 0.01 4(0.01-0) = 1.0 2 -0.01 0.11 1 4(0.11-1) = -1.0 3 0.01 -0.11 I 4(-0.11+l) = 1.0 4 0.0 1.0 1 4(1.0-1) = 0.0 5 0.0 0.0 4(0.0-0) = 0.0 m+1 --+1 R = E s.r X = 0.1110. = 0.01010100 1=1 X 4 , = 0.00110100 79 5. IMPLEMENTATION The previous three chapters have dealt with the algorithmic design for on-line division, multiplication, and addition/subtraction. It is now appropriate to consider the problems of actual implementation of the on-line arithmetic unit while keeping in mind OBJECTIVES 4, 5, 6, and 7 as outlined in Chapter 1. With the typical computer scientist's fondness for acronyms, t this unit has been dubbed AURORA (Arithmetic Unit Realizing On-line Redundant Algorithms). This chapter attacks the problems of implementation: the design constraints of LSI, floating point considerations, design modifi- cations required to pipeline AURORA, minimal hardware requirements with the corresponding speed tradeoffs, and a system's level overview. 5.1 Design Constraints of LSI With the advent of large scale integration (LSI) technology has come a challenge to computer designers to find circuits which make efficient use of its full potential. The ever improving reliability, cost, and size of integrated circuits makes it more and more reasonable to now implement in hardware various functions which previously belonged in the software domain. It is this author's opinion that AURORA, in keeping with OBJECTIVE 4, makes an excellent candidate for design as a single chip, LSI module. When t An aurora is appropriately defined as "the early period of anything." In Roman mythology Aurora was the goddess of the dawn. Thus, the name seems suitable in all respects: connotation, gender, and acronymability . 80 designing the module, several properties innate to LSI technology [HOD76, GOY76, LEW74] must be adhered to. These properties include: • high circuit density, • regularity of structure, • low pin count, and • a large domain of applications. With LSI, thousands of individual gates can be fabricated on an extremely small chip of silicon. There has been a problem finding suitable functions which require a large number of gates while still meeting the other design constraints of LSI. Semiconductor memories (RAMs, ROMs, and PLAs) are obviously the most suitable candidates for LSI implementation. They more than satisfy the previously defined properties. Calculator chips are also suitable candidates. The microprocessor, which is becoming more and more widely used, can be made suitable for implementation in LSI. Another area in which LSI technology is apparent is in digital watch chips. The capability provided by LSI, though, is considerably more than that required for the typical watch function — telling the time and date. Consequently, watches which are also timers, counters, alarms, and even calculators are appearing on the market in order to make full use of the inherent LSI capabilities. The search goes on for other candidates which will make full use of the potential of LSI while meeting the predefined constraints. AURORA, similar in function to a microprocessor, will have a high gate count (high circuit density). Thus, an effort must be made to make AURORA conform, as the microprocessor has been made to conform, to the other constraints of LSI. 81 One of the most severe constraints in the design of an LSI device is the restriction on interconnections within the chip itself. This is due to two factors: 1) the comparatively large chip area required for inter- connections reducing the "useable" chip area, and 2) the immense problem of routing a large number of signals while maintaining a minimum number of crossovers and high gate density. Interconnections can be simplified by forcing the logic design of the chip into a cellular or regular structure. A regular structure has other important implications. 1) It simplifies the manufacturing process by making tooling and mask production easier and by helping to regulate the processing steps. 2) It makes it possible to optimize each cell and, thus, the overall chip to achieve the most function per dollar. A large collection of random gates is virtually impossible to optimize because of the large number of variables involved. 3) The generation of testing algorithms for cellular and regular, repetitive structures is easier than for random logic. A) In addition, as technological improvements pave the way for larger devices, cellular structures are more easily expanded to make use of this bonus real estate. t Unavoidable crossovers call for more processing steps and, thus, lead to higher tooling costs. 82 In that microprocessors have been made to conform to the constraint of regularity of structure, so can AURORA. Typically, the most random part of any system is its control logic. It is envisioned that AURORA would be microprogrammed, possibly via a PLA (Programmable Logic Array) [RHY74, WES 75] to avoid randomness in the control logic. The PLA would generate all necessary control signals for processing based upon the opcode input to the unit (+, -, *, /), the present state of the process, internal and external status signals, and a clock pulse. See Figure 5.1. A PLA, which can be visualized as a logical AND network feeding a logical OR network, works like an ROM with flexible addressing. Though suitable for LSI, using an ROM as a logic element has a major drawback in that a word must be stored for every possible combination of system inputs, many of which may be don't cares and are therefore never used. The PLA's address translation capability allows more than one of these possible input combinations to share a word of storage, making the PLA practical in many applications too large for the typical ROM. The PLA also has more input (address) lines than a comparable ROM. Using a PLA instead of the standard ROM for the microcontrol store provides for self-addressing of the next control state as well as the generation of the required control signals within a single array. The processing logic of AURORA will require the use of proper logic partitioning to make it suitable to LSI. Logic partitioning involves the organization of the internal logic structure so that large functional areas (or arrays) on the chip can be grouped together and used repetitively. External to the chip, functional partitioning of the overall system requires a framework consisting of modules which are completely self-contained 83 INTERNAL & EXTERNAL STATUS INITIALIZE OP CODE TO PROCESSING LOGIC CLOCK PULSE Figure 5.1 A Typical PLA Control Layout 84 processors, each having its own local store, processing logic, and the control necessary for the module to execute its function. Thus, each module acts as a small insular unit of logic. Since each module's control sees only its own state, the internal and external communication requirements are correspondingly reduced. The AURORA as a module of a larger system meets this criterion. Internal to the chip, an effort must be made to subdivide the logical units of the processing logic in order to achieve logic partitioning and its benefits. One of the most serious constraints imposed by LSI on any device is that of a limited number of external connections (i.e., a low pin count). A realistic maximum of 64 external connections (pins) is an unavoidable LSI limitation. AURORA should require fewer pins than this upper limit. The unit will be designed so that it can be connected easily to the bus of a typical microprocessor. A bus system based upon generic signals, such as MUMS (Modular-Unif ied-Microprocessor-System) developed by Faiman [FAI77, CAT76], could be used as a model for designing the communication interface of AURORA. This would help to insure generality of the interface with a system. The number of signals on the MUMS bus sets an upper limit on the pin count for communication with the mother system at 47: a maximum of 32 for data (16 "data" lines and 16 "address" lines) and 15 for control (3 power lines, 6 memory control lines, and 6 interrupt control lines). The upper limit on the number of bits per operand word input at the beginning of each cycle is, then, 16. Thus, the choice of radices to be used in the interal hardware for implementation of the algorithms can be realistically set at either 2, 4, 16, or 256. The higher the radix the higher the speed and the higher the complexity (gate count). A rough overall pin count for AURORA is attempted in Section 5.4. This count must also include requirements for 85 communication with an exponent unit for floating point operations and, possibly, communication with other AURORA modules. By designing AURORA to meet the specifications of a typical micro- processor bus system, the domain of usefulness increases dramatically. The desirability of having a peripheral unit on a microprocessor bus capable of achieving on-line variable precision arithmetic is obvious. This and other applications of AURORA are discussed in Chapter 6. 5.2 Floating Point Considerations In early, fixed point computers, all numerical data within the com- puter was scaled to lie within a restricted range; a frequent choice was that of a fractional representation such that each number, X, used in computation was in the range -1 < X < 1. It soon became obvious that in order to reduce the burden of scaling during preparation or execution of a problem, a second arithmetic unit, for operations on exponents, should be introduced. Then a number X could be represented by a fractional part f and an exponent e, such that X = f • r e where e is a positive or negative integer and f is a fraction in the normalized range \± |f| «i . In order for AURORA to process floating point numbers, two arithmetic units are required, one for the fractional parts and one for the exponents. As far as the handling of data between the mother system and these two pro- cessing units, the following sequence of operations would occur: the exponents, which are assumed to be one memory word each, would be sent to 86 the exponent unit; the exponent unit would then handle these exponent operands appropriately, sending the proper shift signals to the mantissa unit; and, while these exponents were still being processed, the fractional operands, consisting of several memory words each, would be sent in an on- line fashion to the fractional unit. Upon normalization and exponent adjustment of the results via an investigation of the most significant digits of the mantissa as soon as they became available, the exponent of the result could, then, be returned to the mother system followed by an on-line return of the result words as they are generated. A sample mother system organization is given in Section 5.5. If the operation in question was addition or subtraction, an exchange of communication such as the following must occur between the two units. 1) The exponent unit would determine which operand was smaller. 2) The mantissa unit, upon signals from the exponent unit, would shift the appropriate operand to the right while the exponent unit would increase that exponent accordingly, until the two exponents agree. 3) The mantissa unit would start the on-line addition or subtraction. 4) When the most significant digits of the result were produced in the mantissa unit, the result could be normalized (shifted right if necessary) and the mantissa unit would signal the exponent unit to adjust the result exponent (i.e., the exponent of the larger operand) appropriately. Multiplication is only slightly more complicated in that the sum of the exponents must be formed in the exponent unit to produce the exponent of the result. One step of normalization in the form of one left shift of the result and a unit decrease of the result exponent may be necessary if the fractional product, R, falls in the range <2> < W < 2 ' Similarly, for division the difference of the exponents must be formed by the exponent unit to produce the result exponent. The fractional quotient, R, if it falls in the range j - « ? l <_ |r| < 2 a 'J I may then require normalization by one right shift with a unit increase in the result exponent. A communications protocol between the two units, exponent and mantissa, is shown in Figure 5.2. Recall that operand scaling and the corresponding result correction to insure convergence of the algorithms is internal to the mantissa processing unit. 5.3 Algorithmic Modifications of AURORA for Pipelining Up to this juncture AURORA has been discussed from the point of view of the serial processing of multiple memory words on a single unit which accepts one memory word per cycle. A problem arises, then, if the memory width is wider than the operand width accepted by one AURORA during one cycle. Perhaps this problem can be resolved by connecting multiple AURORA'S together, each accepting one "byte" of the memory word, with all of the units running in parallel to produce multiple result "bytes" in parallel constituting one result word. (Note that the data input to an AURORA has been redefined from 88 EXPONENT UNIT 77 A ADD OR SUBT CODE NORMALIZE PULSE RIGHT SHIFT PULSE (FOR ADD/SUBT) OPERAND SHIFT CODE MANTISSA UNIT A A PIN COUNT FOR FLOATING POINT COMMUNICATION : 4 Figure 5.2 Communication between Mantissa and Exponent Units 89 "word" to "byte" in this context.) If so, would it then be possible to combine the two approaches (serial and parallel), producing a serial system of, say, N AURORA units which run in parallel? The entire system, while cycling serially, would then produce N result bytes during each cycle. While this is an admirable goal, on second consideration it is deemed improbable that such a system of simple AURORAS could be designed. Since the AURORA'S operations include division and multiplication, such a design specification, allowing an unspecified number of parallel units to be connected to produce parallel results, implies parallel division and multiplication! Such a conclusion is intuitively impossible. There are two more or less attractive alternative solutions to this problem. The first approach is based upon the assumption that the demand that parallel units produce parallel results is removed. Then the units, connected as shown in Figure 5.3, though capable of being loaded in parallel, would produce result bytes in a serial fashion rippling down from the most significant AURORA. The necessary information would be pipelined from left to right through the units. This information to be pipelined through AURORAS arranged in this manner must be determined. Such an arrangement, of course, will increase the complexity of the unit, while the communication required between units will increase the pin count of each unit. But the speed benefits of streaming multiple instructions through the units as allowed in pipelining [RAM77], would more than com- pensate for the added complexity costs. Another technique for accommodating the parallel input of oversized operand words is to use one AURORA unit serially as in Figure 5.4, while providing large holding shift registers for the oversized operands. While 90 H to w H M w Pn H M >-" s PQ o M H CO >J !=> H en to w < e* w hJ O O O H ^ CN U w M H P^ ^ H M « X S w o H IS M CO £ to H w to ^ o £ H % •H o w M H Pn >H M P3 :s e> H M hJ CO & to co § o £ X W pd a* o o s o Figure 5.3 Pipelining Multiple AURORAs 91 RESULT WORD 11 HOLDING SHIFT REGISTER OP CODE V RESULT BYTE J AURORA 7S ft OPERAND BYTE OPERAND BYTE Y, HOLDING SHIFT REGISTER TT OPERAND WORD Y HOLDING SHIFT REGISTER TT OPERAND WORD X Figure 5.4 Serial Processing with One AURORA 92 this approach requires the purchase of only one expensive AURORA unit, it must be dedicated to a single instruction until that instruction is completed. Thus, the speed benefits of pipelining are unavailable. The necessary information to be pipelined through the multiple AURORA setup will now be specified. The basic recursion of each operation must be investigated to determine this information. For addition and subtraction it is obvious from the basic recursion, (4.1), P. «- r(P. ,-s. ,) + a. + b. J 3-1 J-l J J that the information to be passed down the pipe from one stage to the next is just the "residual" * which increases in significant width as it flows from one unit to the next down the pipe. The problem becomes slightly more complicated when multi- plication is attempted. The basic recursion, (3.4), P. -*- r(P. .-p, n ) + X. y. + Y. x. J J-l J-l J J j-l J t indicates that not only the residual, but also the operand digits up to and including x. and y. must be passed. It is only in this way that the next stage can then form X. and Y. , . The information to be passed in J J-l division is not at all obvious. The basic recursion, (2.3), P. «• rP, - q.D. + n.^.r" 6 - R. .d.-.r J J-l J J J+6 J-l J+<5 For simplicity during this discussion, it is assumed that one byte of an operand or result word is equivalent to one digit of that operand or result word. 93 must be broken down further into in order to reveal the desired information. Since j_1 -i R.d.^. - ( I q.r )d. ,. + q.d.^. , j J+ 6 i=1 i j-hs m j j-ks equation (5.1) can be rewritten as P. «• rP. n - q.D. . + n.^r" 6 - R.d.^r" 6 . (5.2) By using equation (5.2) as the basic recursion for division, the informa- tion to be passed down the pipe is seen to be the division residual, rP. t - q.D. .. , J-1 3 J-1 the divisor digits up to and including d.^. (to be used in forming D. 1 in the next stage), and all of the quotient digits up to and including q. +1 (to be used in forming R. in the next stage). The necessary information to be sent down the pipe over all operations is, then, a residual, both operands, and the result. By sending a digit of each operand and result at the end of every cycle, by the time the last stage of the pipe is processing an operation all of the necessary information has been sent to it. , The residual can be passed either as its individual elements, P._i and p._, for multiplication, or the residual can be explicitly formed in the unit and passed as one unique entity. Of course, this would require another adder (or another pass through the main adder) in each AURORA to form the residual. The addition of this residual adder simplifies the 94 design of the main adder. Section 5.4.2 investigates the hardware adjust- ments of the processing logic necessary to pipeline the AURORA. Some obvious hardware complications besides the extra residual adder in the processing logic are 1) more pins for the communication requirements between the pipelined AURORAS, 2) a much more complex micro- control store, and 3) a bank of extra registers to store the operand and result digits as they arrive. 5.4 AURORA Hardware Requirements A personalized hardware block diagram has been shown for each operation in its appropriate chapter. It now remains to combine the various requirements into one total arithmetic unit which can be microprogrammed to provide the various functions. Low-level (gate) detail of an actual implementation will not be discussed. The description given here will be only detailed enough to estimate the following features: complexity, speed, and pin count. 5.4.1 The Processing Logic The block diagram of the processing logic of AURORA is shown in Figure 5.5. For clarity, the basic recursion formulas are now restated. DIVISION P. -e rP. - q.D. + n.^r" 6 - R. -d.^r" 6 (R-N/D) J J , J J J+6 J_1 J+6 MULTIPLICATION P. «- r(P. .-p. . ) + X.y . + Y. -X (R--X-Y) j j-1 M-l' "j'j j-1 j ADDITION P. +- r(P. .-s. ,) + a. + b (R=A+B) j ' j-1 j-1 j j SUBTRACTION P. ^ r(P. -s. J+a. -b -IS) j i- 1 j-1 j j 95 RESULT DIGIT SELECTOR ?y rP TO THE -1 DIGITAL POSITION ONLY r MULTI-INPUT REDUNDANT ADDER n n 71 o w w > • £g H *< O • ftf SELECTION NETWORK r; x. a. 3 t^.^ X.,D. J J OPERAND REGISTER V D J J+6 rP J-l SELECTION NETWORK I fi~K b. 3 j-l Y j-r p o OPERAND REGISTER Yi' p o a. ,x. ,d. ,d. , x 3 3 i J+ 6 b . , y . , n . , n x 3 3 i 3 +' . and the first 6 most significant digits of D are inspected during Table 5.1 Result Digit Selector Inputs 97 RADIX P(K) INDEX DIFI (6) REDUNDANT P. 3 a(bits) Y(bits) D B(bits) TOTAL 2 KD 5,6,7,8 4(8) 2(4) 8 + 1 = 9 4 2(2/3) 4 4(16) 2(8) 3(6) 22 + 1 = 23 5,6,7,8 3(12) 2(8) 3(6) 18 + 1 = 19 3(1) 3 3(12) 2(8) 3(6) 18 + 1 = 19 4,5,6 3(12) 2(8) 2(4) 16 + 1 = 17 16* 11(11/15) 4 2(16) 2(8) 2(8) 24 + 1 = 25 12(4/5) 3,4 2(16) 2(8) 2(8) 24 + 1 = 25 13(13/15) 3,4 2(16) 2(8) 2(8) 24 + 1 = 25 14(14/15) 3,4 2(16) 2(8) 2(8) 24 + 1 = 25 15(1) 3,4 2(16) 2(8) 2(8) 24 + 1 = 25 *Cases with more than 50 steps in any single overlap region were automatically eliminated from consideration. 98 division. The complexity of base 16 for the result digit selection is obvious. The binary case would require a table look up device with 9 available address lines: 8 for the first 4 redundant bits of P. and 1 to J indicate which operation is in force. The particular case for radix 4, p = 3, and 6=4 would require 19 available lines: 12 for the first 3 redundant digits of P., 6 for the first 3 conventional digits of D, and 1 for control. For the radix 4 case, the number of address lines for the table look up is somewhat large by conventional standards. Two techniques to avoid this dilemma are available: 1) use a PLA with the previously out- lined advantages to do the selection, or 2) perform a carry propagation on the most significant portion of P. to reduce the number of lines required. Table 5.2 indicates the number of address lines required when a carry propagation is employed on the most significant portion of P.. Note that division always requires more significant digits of P. for selection than the other operations. By using all of the available digits for selection during all operations, which is more than the absolute minimum required for multiplication, addition, and subtraction, the bounds on the operands during addition, subtraction, and multiplication can be relaxed significantly as discussed in Chapter 3, Section 3.4. The comparison constants for division which must be encoded into the table look up for quotient digit selection are given in Appendix II for several cases. The comparison constants for multiplication, addition, and subtraction were given in Chapter 3, Section 3.3, as + y, + 1 y, + 2 — , •-., ± (P- \). 99 Table 5.2 Result Digit Selector Inputs After Carry Propagation RADIX p(K) ENDEX DIFF (6) :arry propagated p . j a(bits) y(bits*) D 3(bits) TOTAL 2 KD 5,6,7,8 4(4) 2(2) 4 + 1 = 5 4 2(2/3) 4 4(8) 2(4) 3(6) 14 + 1 = 15 5,6,7,8 3(6) 2(4) 3(6) 12 + 1 = 13 3(1) 3 3(6) 2(3) 3(6) 12 + 1 = 13 4,5,6 3(6) 2(3) 2(4) 10 + 1 = 17 16 11(11/15) 4 2(8) 2(7) 2(8) 16 + 1 = 17 12(4/5) 3,4 2(8) 2(6) 2(8) 16 + 1 - 17 13(13/15) 3,4 2(8) 2(6) 2(8) 16 + 1 = 17 14(14/15) 3,4 2(8) 2(5) 2(8) 16 + 1 = 17 15(1) 3,4 2(8) 2(5) 2(8) 16 + 1 = 17 *From Table 3.1 100 The central part of the processing logic — the adder — is also the most complex part . This adder, as has been repeatedly stated, is a multi- input limited carry-borrow propagation type adder. Such adders have been the subject of intense study in much of the literature [GOY76, ROH67, BOR68, ROB75, AVI61, MET57]. As a result it will not be examined in detail here. Figure 5.6 gives the adder I/O requirements for the hardware block diagram being discussed, while Figure 5.7 represents the i digit position of this adder. A carry generator will be needed if a radix complement representation for negative numbers is used. 5.4.2 Hardware Modifications of AURORA for Pipelining If the pipelining approach is used — stringing multiple AURORAS together — as described in Section 5.3, an adjustment of the processing logic makes it suitable for this technique. An extra adder can be used to form the residual, as discussed in Section 5.3, which must he passed from one unit to another. This residual is in the form of * for multiplication, addition, and subtraction, and is in the form of rP. - q.D. j-1 M j j-1 for division. Additionally, the two operand digits, x. and y., must be passed in multiplication, and the divisor and quotient digits passed in division during each cycle. The hardware block diagram for this approach is shown in Figure 5.8, Note that the addition of the second adder has simplified the input requirements of the main adder. The control for this setup would be Lex than Lti the serial arrangement. 101 p . J J-1' S 3"1 -q.D. J J Y «Y j-l X j i MULTI-INPUT REDUNDANT ADDER J ft ft 7K (n j+6" R j-l d j+6 )r -6 X.«y. J J +b. - J X o m 2 o m > o 3) JO o > JO o H -< o JO rP 3-1 rP j-l rP :-i Figure 5.6 Adder I/O Requirements 102 RESULT DIGIT i {-p\... ,1,0,1, TRANSFER - OUT t i-1 i ,p'> DIGITAL POSITION L OF MULTI-INPUT REDUNDANT ADDER (DO A SUBTRACTION IF q. IS NEGATIVE) J {0,l,...p(r-l)} (± qj d ± ) {0,l,...,(r-l)(r-l)} (y.-x.) {0,1,. ..(r-1)} (aj ] l^ TRANSFER IN t. + i<6 1=6 {0,1, ...(r-1)} _ (n j+6 } i>6{p(r-l),.. .1,0,1,... < a ^ P(r-l)} {0,l,...,(r-l)(r-l)} (x. «y .) {0,1 (r-1)} (+b.) X + {-p\. . .,1,0,1,. ..p'} (RESULT DIGIT i+1) {-p',...l,0,l,...p'} (RESULT DIGIT i+1) {-p\ .. .1,0,1, ...pM (RESULT DIGIT i+1) I i}',ure 5.7 I/O Requirements of Digital Position i of Adder (i > 1) 103 REDUNDANT RESULT REGISTER V Vl RESULT FROM PREVIOUS AURORA RESULT TO NEXT AURORA t T " s r p j' q j+ i s :-i ,p j-i RESULT DIGIT SELECTOR A C rP. 3 \ / SELECTION NETWORK 7Y $> RESIDUAL TO NEXT AURORA RESIDUAL REGISTER ( VrVi } r VrVj-i 7S RESIDUAL ADDER 7S A q.D. j-l MULTI- INPUT REDUNDANT ADDER I A y i X. J X. 3 d j-iIX OPERAND REGISTER X., D. . J 3-1 5 F vy 7S o w > H k! O SELECTION NETWORK A i RESIDUAL FROM PREVIOUS AURORA R. Vi- F o OPERAND REGISTER Vr P o F E OPERAND FROM a.,x.,d.,d OPERANDS TO b . ,y . ,n. ,n . PREVIOUS J J x 1 NEXT aurora J 3 i J AURORA Figure 5.8 Modified AURORA Processing Logic for Pipelining OPERAND FROM PREVIOUS AURORA 104 5.4.3 Speed of the Processing Logic An attempt to bound the speed of the serial processing logic is now appropriate. Define the time to perform one basic recursion in the standard processing logic (Figure 5.5) as t R =t T +t A + C S where t T is the register and selection network transfer time, t. is the redundant add time, and t is the result digit selection time. Both t^ and t correspond to a few gate delays — 4 to 5 t . So t appears to be the dominant factor in the equation and depends upon the radix (and, thus, the number of inputs to the adder) and the adder structure. Let s be defined to be the number of radix 2 summands (inputs to the adder); i.e., higher radix and redundant inputs are considered in their binary equiva- lent formats. Then a simple adder structure, consisting of (s-2) levels of full adder rows [GOY76] will have a t A " 2 assuming 2t per full adder. More sophisticated adder structures, such as D Dadda-types [H073] can considerably reduce this time. Using this formula for t., the total recursion time for a radix 2 case, where the worst case A s over all operations is 5 (i.e., 1 conventional input and 2 redundant inputs) is t R - 0(Ut g ) In the case of radix 4, this increases to 0(26t ). g Depending on the operand size input, a number of cycles must occur •re a result is completely generated as tabulated in Table 5.3. While 105 Table 5.3 Word Processing Time Delay TOTAL WORD WORD DIGITS /WORD PROCESSING LENGTH RADIX BITS/DIGIT (//OF CYCLES) TIME/ CYCLE TIME 4 2 1 4 14 tg 56 t g (*) 4 2 2 26 tg 52 t g (*) 16 4 1 98 tg 98 t g (*) 8 2 1 8 14 tg 112 t g 4 2 4 26 tg 104 t g 16 4 2 98 tg 196 t g 16 2 1 16 14 tg 224 t g 4 2 8 26 tg 208 t g 16 4 4 98 tg 392 t g (*)These cases do not satisfy the on-line delay requirements of division. 106 using a higher radix can speed up the overall processing time (112t for base 2 versus 104t for base 4 given an 8 bit input width) , it also increases the complexity of the adder. From this table it would appear that radix 4 may be the best compromise. Table 5.5 reinforces this choice in comparing speed and complexity of the processing logic. Once again the trade off between high speed and low gate count is evident. 5.4.4 AURORA as a Total Module A block diagram of the total unit is given in Figure 5.9. It includes the PLA control logic as discussed in Section 5.1, the processing logic just discussed, the recoding logic as discussed in Appendix III, and the I/O requirements of the overall module. An approximate pin count for base 2, 4, and 16 is given in Table 5.4. Table 5.4 Rough Pin Count for AURORA INPUT OUTPUT PINS TO WORD # OPERANDS // RESULT COMMUNICATE LENGTH PINS PINS WITH EXP. UNIT CONTROL TOTAL 4 8 4 4 15 31 8 16 8 4 15 43 16 32 16 4 15 67 Table 5.5 Speed vs. Complexity (Processing Logic) 107 X M q 3 *PUT 3 ERAND [DTH BITS iDUNDANCY 4AX. RESULT EGIT) ADDER COMPLEXITY RESULT SELECTOR WORST CASE tfORD PROCESS H O !3 3 ^ Q (LEVELS) INPUT/ STORAGE TIME 2 4 8 1(1) 1(1) 3 levels ull width) 9 LINES/ 3 WORDS * 112 t g 16 1(1) 9 LINES/ 3 WORDS 224 t g 4 4 §,1(2,3) — — 8 f(2) :vels ddth} 23 LINES/18 WORDS 104 t g 8 1(3) cu 3 rH rH ON rH 17 LINES/16 WORDS 104 t g 16 |(2) 3 4H v— ' 19 LINES/21 WORDS 208 t g 16 1(3) 15 LINES/19 WORDS 208 t g 16 4 —-►1(8+15) — — 8 j| +1(8+15) CO XI rH 4J — — 16 iiciD 45 leve (full wid 25 LINES/- 300 WORDS 392 t g 16 1(15) 21 LINES/- 200 WORDS 392 t g t is one gate delay g 108 C RESULT WORD READY CQ INTERRUPT CONTROL SIGNALS FROM • EXPONENT LOGIC > PLA CONTROL LOGIC RESULT WORD RESULT WORD RECODING LOGIC AURORA PROCESSING LOGIC 7\ 7V OPERAND WORDS Figure 5.9 Block Diagram of an AURORA 109 5.5 System's Level Overview AURORA was designed to operate as a functional module controlled externally by a larger system. This system must then provide AURORA with the correct operand words and control signals and receive from AURORA the result word for storage. This system must fetch the correct number of operand words in the proper order from main memory, transfer them on-line to AURORA, and store the result words in the proper order back into main memory. To avoid the necessity of dedicating the system to AURORA while the digits are being processed, the AURORA unit should signal the main system via an interrupt when a result word is ready for storage. Then the system is free to handle other processes while AURORA is functioning. Figure 5.10 shows a sample system organization. This chapter has consisted of an overview only of the necessary considerations of actual implementation. The implementation as such was not attempted, but is left for a more suitable medium. A detailed gate count was considered inappropriate and ill advised. It remains to justify the existence of such a unit as AURORA. 110 HIGH LEVEL COMMAND OPERATION X, Y, R, N X, Y, R: address of significant exponent N: precision (number words to process) most word of 1 31 MICROCODE ROUTINE Determined by AURORA arrangement h4 O OS H O PL, CJ PIPELINED APPROACH INITIALIZE AURORA FETCH X, Y • CALL AURORA (EXP, OP, X,Y) • DO 1=1, N, 2 • FETCH X+I,X+I+1 FETCH Y+I.Y+I+1 CALL AURORA (MAN, OP, X,Y) IF (INTER .EQ. 0) WAIT IF(I .EQ. 1) STORE R IF (INTER .EQ. 0) WAIT >v ■> STORE R+I IF (INTER .EQ. 0) WAIT STORE R+I+l c 10 CONTINUE RETURN c >io SERIAL APPROACH INITIALIZE AURORA FETCH X, Y CALL AUR0RA(EXP,0P,X,Y) DO 10 1=1, N,l FETCH X+I,Y+I CALL AURORA (MAN, OP, X,Y) IF (INTER .EQ. 0) WAIT IF (I .EQ. 1) STORE R IF (INTER .EQ. 0) WAIT STORE R+I CONTINUE RETURN o o OS o Ph o OS u RESULT WORD AURORA EXPONENT LOGIC UTTER AURORA MANTISSA LOGIC :> C RESULT WORD INTER AURORA MANTISSA LOGIC \UUUl_JTJl c CONTROL S> AURORA EXPONENT LOGIC AURORA MANTISSA LOGIC OPERAND WORDS {LJHI_J) OPERAND WORDS $ o OS < I"i;',ure 5.10 Sample System Organization Ill 6 . SUMMARY 6.1 Summary of the Results The algorithmic and logical design of an arithmetic unit to be used in a computational environment in which the basic arithmetic operations satisfy the on-line property has been presented. The on-line property requires that to generate the j digit of a result (where a digit consists of n bits for base 2 ), it is necessary and sufficient to have the operands available only up to the j digit plus, in the case of division, a predetermined number of extra digits which correspond to an "on-line delay." Since there is no on-line delay for addition, subtraction, and multiplication, the algorithms can begin generating result digits as soon as one digit of each operand has been input. The delay for division was shown to be a small, positive, radix dependent constant. The fulfill the on-line requirements, a set of left-to-right (most-to-least significant), digit-by-digit algorithms have been derived. The existence of such algorithms is contingent upon using a redundant representation for the result digits. These algorithms and a block diagram implementation of the basic arithmetic unit has been presented. Algorithms for addition and subtraction which conform to this on-line property were easily specified, while the multiplication algorithm required a somewhat more elaborate approach. The existence of an on-line division algorithm was only recently discovered [TRI75]; the bulk of this thesis was a development and extension of this early work. Quotient 112 digit selection procedures based upon a limited precision model of the operands and minimum values for the on-line delay were discussed in detail. Once compatible algorithms for the basic operations were defined, the problems of actual implementation of the basic unit, AURORA, were then considered. Suitability to LSI, floating point processing, and adjustments in the hard- ware to gain speed were some of the desirable design goals. It is now fitting to discuss some possible applications of such a unit to justify its existence. 6.2 Applications It has been stated that a unit suitable for investigation as an LSI chip should have as large a domain as possible. Several application areas are ripe for exploitation today, while others, as advances are made in technology, are on the horizon. The following list of applications is, by no means, exhaustive, but is presented here as a representative study. 1) The most obvious use for AURORA is in the area of real-time applications. As the operands are generated serially by an analog-to-digital conversion process beginning with the most significant digits, AURORA could be used to process these operand digits as soon as they became available from the converter. This is unlike the conventional setup, where the processing unit must wait while the full precision operands are converted before starting operation. The speed up benefits are obvious. Such a system, with the capability of overlapping conversion and computation, has a definite place in long distance communications 113 and satellite systems. In fact, any system designed to be of use in a real-time environment could make significant gains with the addition of an AURORA module to its hardware. 2) Another possible application is in performing variable precision arithmetic. The described algorithms and simple implementation requirements of AURORA are compatible with the required modularity of any variable precision unit. Both hardware implementations as discussed in Chapter 5, serial processing on one unit or the pipelining of multiple AURORAS for speed, provide variable precision arithmetic in a straight- forward manner. Of course, the ultimate allowable precision is set by the internal hardware of AURORA as with any such device. It is believed that sufficient register and adder widths can be provided by large scale integrated technology to provide enough "variable precision arithmetic" to meet the demands of most applications. The desirability of being able to attach an AURORA as a peripheral device on a microprocessor bus is obvious. The microprocessor, a device tradition- ally restricted from most mathematical applications because of its short word length, could not help but benefit from this boost in processing power. As an added bonus, since AURORA signals completion via an interrupt, the microprocessor would not have to be 114 dedicated to it. While the AURORA is functioning, the main system would be freed to handle other pending processes. 3) If the approach of connecting multiple AURORAS together is used for actual implementation, pipelining of successive operations could be used as an effective speed up technique. Recall that the recursive function- ing ripples down from the most-to-least significant unit. When the first unit completes processing and returns the result digit to the main system, it has also passed all necessary information for that operation to correctly continue on down the pipe to the next unit. When one unit has completed all of the processing associated with the present operation, the next unit in line can begin generating the next result digit associated with that same instruction. Thus, the one unit is free to initiate processing on the next instruction in the program. In this way, the fraction arithmetic unit, which has been traditionally considered as a single stage of the pipeline [STE75, AND67], can be further decomposed into multiple stages to speed up processing even more. Chaining operations on result digits as they are generated can increase processing speed even further. h) One application on the horizon will become more feasible as technological improvements make way for the wide spread use of large serial memories (CCDs, bubble 115 memories, etc.). The major user of the large serial memories will be data base systems. AURORA could be used in conjunction with these serial memories to provide instant processing capabilities for data base systems. As soon as the most significant word was (serially) extracted from the memory, the decisions as to which actual memory word is desired could be made on-line. Then, just that operand would have to be read from memory. The only other alternative is to read both operands from memory, a lengthy process, and then pro- ceed to compare them to determine which is the desired operand. By using AURORAS to provide on-line processing, intelligent data base retrieval systems could be built. 6.3 Suggestions for Future Research During work on this dissertation, several extensions or other areas of interest requiring further research have become obvious to this author. They include the following questions. 1) Could the on-line processing technique employed here be extended to other functions, such as logarithmic, trigonometric, and exponential? It is believed that an alternative algorithmic approach to processing, such as continued products as described by DeLugish [DEL70] or the E-method as developed by Ercegovac [ERC75] might be more appropriate for extension. The on-line result digit selection procedure outlined in this thesis could be carried over with slight modification to these techniques. This area is in need of further research. 116 2) What about actual implementation? It is beyond the influence of this author whether or not an actual AURORA module is manufactured as a single chip. It is entirely feasible to build the module from several standard MSI and SSI chips now available on the market. An even more attractive alternative would be to implement AURORA in software using a suitable bit-slice microprocessor. This seems to be more useful than simulating AURORA on a large system equipped with an appropriate simulation language. The microprocessor implementation is one area of continuing research in which this author hopes to be actively engaged. 117 BIBLIOGRAPHY [AND67] Anderson, S. F. , et. al . , "The System/360 Model 91 Floating Point Execution Unit," IBM System Journal 11 , Vol. 34, 1967. [ATK67] Atkins, D. E. , "The Theory and Implementation of SRT Division," M.S. Thesis, Report 230, Department of Computer Science, University of Illinois, Urbana, June 1967. [ATK68] Atkins, D. E., "Higher Radix Division Using Estimates of the Divisor and Partial Remainders," IEEE Transactions on Computers , Vol. C-17, No. 10, pp. 925-934, October 1968. [ATK70] Atkins, D. E., "Design of the Arithmetic Units of Illiac III: Use of Redundancy and Higher Radix Methods," IEEE Transactions on Computers , Vol. C-19 , No. 8, pp. 720-733, August 1970. [ATK70] Atkins, D. E., "A Study of Methods for Selection of Quotient Digits During Digital Division," Ph.D. Thesis, Report 397, Department of Computer Science, University of Illinois, Urbana, June 1970. [ATK75] Atkins, D. E., "Introduction to the Role of Redundancy in Computer Arithmetic," Computer , Vol. 8, No. 6, pp. 74-76, June 1975. [ATR65] Atrubin, A. J., "A One-Dimensional Real-Time Iterative Multiplier," IEEE Transactions on Computers , Vol. EC-14, pp. 394-399, 1965. [AVI61] Avizienis, A., "Signed-Digit Number Representation for Fast Parallel Arithmetic," IRE Transactions on Electronic Computers , EC-10, pp. 389-400, 1961. [AVI62] Avizienis, A., "On a Flexible Implementation of Digital Computer Arithmetic," Proceedings of IFIP , pp. 664-668, 1962. [AVI64] Avizienis, A., "Binary-Compatible Signed-Digit Arithmetic," AFIPS Conference Proceedings , Vol. 26, Part 1, pp. 663-672, 1964. [BAK75] Baker, P. W. , "Algorithms for Higher Level Functions in Machine Hardware," Ph.D. Thesis, The University of New South Wales, November, 1975. [B0R68] Borovec, R. T., "The Logical Design of a Class of Limited Carry-Borrow Propagation Adders," M.S. Thesis, Report 275, Department of Computer Science, University of Illinois, Urbana, August 1968. 118 [BRA63] Brawn, E. L., Digital Computer Design-Logic, Circuitry, and Synthesis , Academic Press, New York, 1963. [CAM70] Campeau, J. 0., "Communication and Sequential Problems in the Parallel Processor," Parallel Processor Systems, Technologies and Applications , Spartan Books, New York, 1970. [CAT76] Catlin, Robert W., "MUMS: Modular Unified Microprocessor System," M.S. Thesis, Report 809, Department of Computer Science, University of Illinois, Urbana, June 1976. [DEL70] DeLugish, B. G. , "A Class of Algorithms for Automatic Evaluation of Certain Elementary Functions in a Binary Computer," Ph.D. Thesis, Report 399, Department of Computer Science, University of Illinois, Urbana, June 1970. [ERC75] Ercegovac, Milos D., "A General Method for Evaluation of Functions and Computations in a Digital Computer," Ph.D. Thesis, Report 750, Department of Computer Science, University of Illinois, Urbana, August 1975. [FAI77] Faiman, Michael, et^. al_. , "MUMS-A Reconf igurable Microprocessor Architecture," Computer , Vol. 10, No. 1, pp. 11-16, January 1977. [FRE61] Freiman, C. V., "Statistical Analysis of Certain Binary Division Algorithms," Proceedings of the IRE , Vol. 49, pp. 91-103, January 1961. [GOY76] Goyal, Lakshmi, "A Study in the Design of an Arithmetic Element for Serial Processing in an Iterative Structure," Ph.D. Thesis, Report 797, Department of Computer Science, University of Illinois, Urbana, May 1976. [H073] Ho, I. T. and T. C. Chen, "Multiple Addition by Residue Threshold Functions and Their Representation by Array Logic," IEEE Transactions on Computers , Vol. C-22, pp. 762-767, August 1973. [HOD76] Hodges, David A., "Trends in Computer Hardware Technology," Computer Design , Vol. 15, No. 2, pp. 77-85, February 1976. [LEW74] Lewin, Douglas, "Outstanding Problems in Logic Design," The Radio and Electronic Engineer , Vol. 44, No. 1, pp. 9-17, January 1974. [MEL72] Melicher, S. A., "An Arithmetic Unit with Total Redundant Representation," M.S. Thesis, Department of Electrical Engineering, University of Illinois, Urbana, 1972. [ME.T57] Metze, G., "A Study of Parallel One's Complement Arithmetic Units with Separate Carry or Borrow Storage," Ph.D. Thesis, Report 81, Department of Electrical Engineering, University of Illinois, Urbana, November 1957. 119 [PEN62] Penhollow, J. 0., "A Study of Arithmetic Recoding with Applications to Multiplication and Division," Thesis, Report 128, Department of Computer Science, University of Illinois, Urbana, September 1962, [PIS70] Pisterzi, M. J., "A Limited Connection Arithmetic Unit," Ph.D. Thesis, Report 398, Department of Computer Science, University of Illinois, Urbana, June 1970. [RAM77] Ramamoorthy, C. V. and H. F. Li, "Pipeline Architecture," Computing Surveys , Vol. 9, No. 1, pp. 61-102, March 1977. [RHY74] Rhyne, V. Thomas, Scott McPhillips and Jerry Ogdin, "Programmed Logic Array," New Logic Notebook, Vol. 1, No. 2, October 1974. [ROB58] Robertson, J. E., "A New Class of Digital Division Methods," IRE Transactions on Electronic Computers , Vol. EC-7, pp. 218-222, September 1958. [ROB65] Robertson, J. E., "Methods of Selection of Quotient Digits During Digital Division," Report 663, Department of Computer Science, University of Illinois, Urbana, 1965. [ROB67] Robertson, J. E., "A Deterministic Procedure for the Design of Carry-Save Adders and Borrow-Save Subtractors, " Department of Computer Science, University of Illinois, Urbana, July 1967. [ROB75] Robertson, J. E., "Redundant Structures I: Binary Carry-Save Adders and Borrow-Save Sbutractors" and "Redundant Structures II: Signed Digit Arithmetic," Chapters 6 and 7 of Class Notes for CS 364, University of Illinois, Urbana, Revised Fall 1975. [ROH67] Rohatsch, Fred A., "A Study of Transformations Applicable to the Development of Limited Carry-Borrow Propagation Adders," Ph.D. Thesis, Report 226, Department of Computer Science, University of Illinois, Urbana, June 1967. [STE75] Stephenson, C, "CaseStudyof the Pipelined Arithmetic Unit for the TI Advanced Scientific Computer," Proceedings of Third IEEE Symposium on Computer Arithmetic , Dallas, Texas, November 1975. [TOC58] Tocher, T. D., "Techniques of Multiplication and Division for Automatic Binary Computers," Quarter. J. Mech. App . Math. , Vol. 2, Pt. 3, pp. 364-384, 1958. [TRI75] Trivedi, Kishor S. and Milos D. Ercegovac, "On-Line Algorithms for Division and Multiplication," Proceedings of Third IEEE Symposium on Computer Arithmetic , Dallas, Texas, November 1975. [WES75] 1975 Wescon Professional Program, "Field Programmable Logic," Session 26, San Francisco, California, September 1975. 120 APPENDIX I REDUNDANCY DEFINITION Redundancy (the state of being in excess of what is necessary) is used in the implementaion of computer arithmetic to achieve three design goals: improved reliability, increased speed, and structural flexibility. To achieve the first goal, hardware redundancy and/or redundant arithmetic codes are applied to the detection and correction of faults. This, however, is not the type of redundancy referred to in this dissertation. Rather, the use of number systems employing redundancy in their representation is implied. In this way the design goals of increased speed and structural flexibility can be achieved. A positional number system with fixed radix, r, is redundant if the allowable digit set includes more than r distinct elements, thereby affording alternate representations of a given numeric value. Uniqueness of representation is sacrificed with the hope of gains in speed and flexibility. The type of redundant number representation used throughout this dissertation calls for the use of a symmetric redundant digit set, defined as where D = {-p,-(p-l),.. .1,0,1, ...(p-l),p} f 1 P 1 r ~ ! • 121 In particular, D is 1) minimally redundant if (where | D | is the cardinality of the digit set) l° p l - r + 1 so that r P = 2 2) or maximally redundant if |D I = 2r - 1 1 P 1 so that p = r - 1 Consequently, the representation of a number X is simply m X = Z x.r 1 1-1 X where the sign of X is just the sign of x.. . EXAMPLE: For radix r = 4 D = {0,1,2,3} Vn = { 2,I,0,1,2} ° P MAX = ^,2,1,0,1,2,3} where the overbar denotes the negative sign, i.e., 2 = -2. Then X = 0.6875 1Q = 0.1011 2 = 0.23 4 D = 0.1101 2 = 0.3l 4 for x. e D^ D = 1.1101 2 = 1.11 4 for x. « D pMiN 122 Some of the desirable properties of this type of number representa- tion are: 1) The representation of zero is unique. An algebraic value of X = 0, if and only if all x. - 0. 2) The additive inverse (negation) of an operand is very simply achieved by reversing the sign of every non-zero digit \ individually. 3) The sign of the algebraic value of X is given by the sign of the most significant (leftmost) non-zero digit. 123 APPENDIX II SAMPLE P-D PLOTS FOR DIVISION This appendix contains sample P-D plots for on-line division. The key in the top, left-hand corner of each plot corresponds to BASE: the radix(r) RHO: the maximum quotient digit (p) DELTA: the on-line delay (6) ALPHA: sufficient precision of rP. for quotient digit selection (a) BETA: sufficient precision of D. for quotient digit selection (6) The comparison constants for the treads (rP.) are given in the right-hand column. The comparison constants for the risers (D) are given along the top row. In the case of base 16 several anomalies are apparent: 1) when an overlap region required more than 50 steps, those steps were not plotted; and 2) the comparison constants are not shown due to insufficient space. Table A.l Example of the Algorithm DIVIDE (r=2,K=l) r = 2, 6 = 5, m = 8, D = {1,0, 1}, K = 1 p N = 0.10100011 D = 0.11110110 L s TO INSURE CONVERGENCE - SHIFT N RIGHT ONE BIT (R = 0.10101001101...) ] 124 j D. J 2P. J Vi -4 R. 2 J 0.11110 0.10100 1 0.0 1 0.111101 2(0.10100-0.111101) = -0.10101 T 0.00001 2 0.1111011 2 (-0.1011+0. 1111011 -0.00001) = 0.100111 1 0.000001 3 0.11110110 2(0.101001-0.1111011) = -0.101001 T 4 0.11110110 2) -0.101001+0. 1111011) = 0.101001 1 5 0.11110110 2(0.101001-0.1111011) = -0.101001 1 6 0.11110110 2 (-0.101001+0. 1111011) = 0.101001 1 7 0.11110110 2(0.101001-0.1111011) = -0.101001 1 8 0.11110110 2 (-0.101001+0. 1111011) = 0.101001 1 __ — R = 8 2( Z q 2 I -i 2(0.1111111111) 2(0.0101010101) = 0.10101010 D D 125 a. oo BASE = 2 RHQ = 1 DELTA = 5 ALPHA = 4 BETA = Q 0.01 1 0.17 0.33 0.50 DIVISOR 0.67 0.83 1.00 126 o o ru r- o CO UJ •— i • ^: UJ az a:g CX UJlo 01 in ru BASE = 2 RHQ = 1 DELTA = S ALPHA = 4 BETA = UPPER OF 1 UPPER QF 0.011 o o LOWER QF 1 ♦ —I 0.33 0.50 DIVISOR ^.00 0.17 0.67 0.83 1.00 127 o ru BASE = 2 RHQ = 1 DELTA = 7 ALPHA = 4 BETA = Q UPPER OF 1 UPPER OF Q Q.Qlt LOWER QF 1 1 —I — 0.67 0.00 0.17 0.33 0.50 DIVISOR 0.83 1.00 128 oj o CO LU CX a: a: a_ Q LUld I— r- U~) LO LO ru a BASE - 2 RHQ = 1 DELTA = 8 ALPHA = 4 BETA = Q upper ar i upper ar o 00 0.17 ni)I FR HF 1 0.01 1 0.33 0.50 DIVISOR 0.67 0.83 1.00 129 o o o — o o o o o o o CD o o o «-• o o o o O D o _:t ^b.oo BRSE = 4 RHQ = 2 DELTA = 4 flLPHfl - 4 BETA = 3 UPPER OP 2 UPPER DP 1 LOWER OP. 2 UPPER 0P LOWER QP 1 P 8 0.17 0.33 0.50 DIVISOR 0.67 0.83 01.011107 01.010007 01.000177 00.111777 00.111000 00.770077 00.077710 00.010077 1. 00 130 o o O —• O — • .-< o o — • o — o o — •— o o — • — i — •-« O O O O CD O CD o o o o oT ' b oo KEI BASE = 4 RHO = 2 DELTA = 5 ALPHA = 3 BETA = 3 f 0.17 UPPER OF 2 UPPER OF 1 LOWER OF 2 UPPER OF LOWER OF 1 0.33 0.50 DIVISOR 0.67 0.83 1.00 o o 131 o «-« o »-« o o o «■* o o ~* «-• o —i O — i O O — • —. o O O -^ .-• —i o ._ O O O O O ^h _h (30 KEI BASE = 4 RHQ = 2 DELTA = 6 ALPHA = 3 BETA = 3 UPPER OF 2 UPPER OF 1 LOWER Of 2 UPPER OF LOWER OF 1 01.0002 01.0000 00.7122 00.2220 00.2102 00.0272 00.0102 0.27 0.33 0.50 DIVISOR 0.67 — I 0.83 2.00 o o 132 o «-< o «-• o ~ O O ^ «— O ^ O O — i O O — • — « ~4 o o o o o o o o o ^.00 BASE = 4 RHQ = 2 DELTA = 7 ALPHA = 3 BETA = 3 4 0.17 UPPER QP 2 UPPER OP 1 LOWER OF. 2 UPPER QP LOWER OP 1 01 .1000 01 .0102 01 007 2 01, 0007 01. 0000 00. 7177 00. 7170 00. 7107 00.0777 00.0101 + 0.33 0.50 DIVISOR 0.67 0.83 7.00 o o O «-< O —* CD ~+ o o — • -* o — ■ o o — • o o o o o o o o o 133 ^.00 BASE - 4 RHO - 2 DELTA = 8 ALPHA = 3 BETA = 3 UPPER OP 2 UPPER OF 1 LOWER OP 2 UPPER OP LOWER OP \ 01.0001 01.0000 00.1111 00.1110 00.1101 00. 0111 00.0101 4- 0.17 0.33 0.50 DIVISOR 0.67 0.83 1.00 134 o o ' b 00 o o .—I o o o T ' BASE =4 / \ RHQ - 3 / ,. DELTA = 31 / I ALPHA = 3 - / A BETA =3 1 / / 1 / / 1 / / ' ^ s\ / / \ y y\ / / ! "V"^ UPPER OF 3 f X- "^^ >i / ' ^ / / / \ S yS / /- •TyS / / | // / UPPER OF 2 \--y iy ^/ // '/ /__ g ^ LOWER OF 3 x /-"; ^ -^ UPPER OF 1 r^-- ^^^^ ^\ LOWER OF 2 -^^ ^^^ 1 1 ipppR pip n __ n.oo BfiSE = 16 RHQ - 12 DELTA = 3 RLPHfl = 2 BETfi = 2 0.17 UPPER OF 12 UPPER LOWER UPPER \m LOWER UPPER hPPBI \m LOWER UPPER LOWER UPPER LOWER UPPER LOWER UPPER m 1 n^P R 0.33 Q.50 QIVI50R 0.67 0.83 1.00 143 o af- ro di- ll} o az LU ■ CD I • ■ CD OZ az a: a. o m x IT) CD BASE = 16 RHQ = 13 DELTA = 3 ALPHA = 2 BETA = 2 UPPER OF 13 Upper of 12 \mn bf u \fffl BF \i \fffl BF 4 Wffl BF *g bp¥P BF bPplp; BF bPHP BF WP BF \$mw WiR BF bPHP BF \em BF LOIJER OF 1 h- — o.oo 0.17 0.33 0.5O DIVISOR 0.67 0.83 1.00 144 o CO OJ- o a: cr UJ (X, az cc a. O UJ CD CD LT3 en CD Tj\00 BASE - 16 RH0 = 13 DELTR = 4 RLPHfl = 2 BETA = 2 —\ — 0.J7 UPPER 3F 13 UPPER 0F 12 WP SF 1? WFJ If 1§ bPpiR P 4 M P x % bfffB P hPHP P mm p bPHil P WIR BF WFJ P WfBP IftP IF § LO W ER BF — I 0.33 O.SO DIVISGR 0.67 0.83 1.00 145 APPENDIX III AN ON-LINE RECODING ALGORITHM The result digits generated by the algorithms discussed in this dissertation are in a redundant format. Before they can be output from the unit they must be converted to a conventional format. As each redundant result digit becomes available from the result digit selector, it is stored in the full width double bank result register. Given the result in the form m R = E s.r" 1 1=1 X where s ± e {-p,-(p-l),...l,0,l,...(p-l),p} it must be recoded to the form where m R' = Z s'r x i=l s! e {0,1,2, ...r-1} The "mostly" on-line recoding algorithm is shown in Figure A.l. Note that the only information needed to recode the present digit is the overall sign of the result (S ) and the sign of the next rightmost non-zero op digit (S ). So, the only time the recoding network fails to produce an n output on-line is when it encounters a string of zeros in the result. 146 op V S : n x.: r: sign of the overall result sign of the present result digit sign of the next rightmost nonzero result digit present result digit radix Figure A.l The On-line Recoding Algorithm 147 VITA Mary Jane Irwin was born in Cairo, Illinois, in 1949. She received a B.S. degree in Mathematics from Memphis State University in 1970, her M.S. and Ph.D. degrees in Computer Science from the University of Illinois at Urbana-Champaign, in 1975 and 1977 respectively. From 1971 to 1977 she was employed as a graduate teaching/research assistant by the Department of Computer Science at the University of Illinois. She was a participant in the Digital Computer Arithmetic Group, headed by Professor James E. Robertson. While at Illinois she served as president of the University of Illinois student chapter of the ACM. She is currently a member of ACM and IEEE. BIBLIOGRAPHIC DATA SHEET 1. Report No. UIUCDCS-R-77-873 4. Title and Subtitle An Arithmetic Unit for On-Line Computation 3. Recipient's Accession No. 5. Report Date May, 1977 6. 7. Author(s) Mary Jane Irwin 8. Performing Organization Rept. No. h Performing Organization Name and Address Department of Computer Science University of Illinois Urbana, IL 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. NSF DCR 73-07998 12. Sponsoring Organization Name and Address National Science Foundation Washington, DC 13. Type of Report & Period Covered 14. 15. Supplementary Notes 16. Abstracts This thesis is concerned with the algorithmic and logic design of an arithme- tic unit to be used in a computational environment in which the basic arithmetic opera- tions satisfy the on-line property; that is, to generate the j tn digit of a result (where a digit consists of n bits for base 2 n ) , it is necessary and sufficient to have the operands available only up to the j th digit plus, in the case of division, a pre- determined number of extra digits which correspond to an "on-line delay." Since there is no on-line delay for addition, subtraction, and multiplication, the unit can begin gen- erating result digits as soon as one digit of each operand has been input. The delay for division is shown to be a small, positive, radix dependent constant. To fulfill the on-line requirements, a set of left-to-right (most-to-least significant), digit-by-digit algorithms have been derived. The existence of such algorithms is contingent upon the use of a redundant representation for the result digits. These algorithms and a block diagram level implementation of the basic arithmetic unit are developed in the thesis. The proposed arithmetic unit, capable of performing on-line operations, would 17. Key Words and Document Analysis. 17a. Descriptors ^e extremely useful in many real-time applications digital computer arithmetic Due to its potential for performing sequences of on-line algorithms operations in an overlapped fashion (pipelining), pipelining the unit could provide an effective way to speed redundancy up execution. Furthermore, it is ideally suited digit-by-digit algorithms for variable precision arithmetic, large scale integration floating-point arithmetic 17b. Identifiers/Open-Ended Terms I7e. COSATI Field/Group 18. Availability Statement 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 22. Price ORM NTIS-35 ( 10-70) USCOMM-DC 40329-P71 SEP 1 6 1977 AUG 1M| w