The person charging this material is re 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 for disciplinary action and may reson 
 
 the University. 
 
 To renew call Telephone Center, 333-8400 
 
 C C 2 
 
 19' )B 
 
 
 L161— O-1096 
 
litis 
 
 Report No. UIUCDCS-R- 77-873 
 
 oJ73 
 
 / ' 
 
 f 
 
 AN ARITHMETIC UNIT FOR 
 ON-LINE COMPUTATION 
 
 by 
 MARY JANE IRWIN 
 
 UILU-ENG 77 1722 
 
 May 1977 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
UIUCDCD-R-77-873 
 
 AN ARITHMETIC UNIT FOR 
 ON-LINE COMPUTATION 
 
 by 
 Mary Jane Irwin 
 
 May, 1977 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 
 
 This work was supported in part by the National Science Foundation 
 under Grant No. US NSF 73-07998 and was submitted in partial 
 fulfillment for the Doctor of Philosophy in Computer Science, 1977. 
 
Digitized by the Internet Archive 
 
 in 2013 
 
 http://archive.org/details/arithmeticunitfo873irwi 
 
AN ARITHMETIC UNIT FOR ON-LINE COMPUTATION 
 
 Mary Jane Irwin, Ph.D. 
 
 Department of Computer Science 
 
 University of Illinois at Urb ana- Champaign, 1977 
 
 This thesis is concerned with the algorithmic and logic design of 
 an arithmetic unit to be used in a computational environment in which the 
 basic arithmetic operations satisfy the on-line property ; that is, to 
 generate the j digit of a result (where a digit consists of n bits for 
 base 2 ), it is necessary and sufficient to have the operands available 
 only up to the j digit plus, in the case of division, a predetermined 
 number of extra digits which correspond to an "on-line delay." Since there 
 is no on-line delay for addition, subtraction, and multiplication, the 
 unit can begin generating result digits as soon as one digit of each 
 operand has been input. The delay for division is shown to be a small, 
 positive, radix dependent constant. To fulfill the on-line requirements, a 
 set of left-to-right (most-to-least significant), digit-by-digit algorithms 
 have been derived. The existence of such algorithms is contingent upon 
 the use of a redundant representation for the result digits. These algorithms 
 and a block diagram level implementation of the basic arithmetic unit are 
 developed in the thesis. 
 
 The proposed arithmetic unit, capable of performing on-line operations, 
 would be extremely useful in many real-time applications. Due to its 
 potential for performing sequences of operations in an overlapped fashion 
 (pipelining) , the unit could provide an effective way to speed up execution. 
 Furthermore, it is ideally suited for variable precision arithmetic. 
 
iii 
 
 ACKNOWLEDGMENTS 
 
 I would like to express my deep appreciation to my advisor, 
 Professor James E. Robertson, for his advice, support, interest, and 
 insight. I would also like to thank the members of my thesis committee 
 for their interest and comments: Professors Mike Faiman, Dave Kuck, 
 Tom Murrell, and Sylvian Ray. 
 
 Fellow students Alf Weaver, John Larson, and Will Gillett 
 provided valuable comments, encouragement, and proof reading skills. 
 Gayanne Carpenter's contributions in terms of friendship and advice 
 on departmental policies are also greatly appreciated. 
 
 Thanks are also due to June Wingler for an outstanding job of 
 typing, to Stan Zundo for his drafting expertise, and to the Department 
 of Computer Science and the National Science Foundation for their 
 financial support. 
 
 Finally, I want to thank my husband, Vern, and son, John, for 
 their love, patience, and understanding throughout this long undertaking 
 and my parents for instilling in me the desire to succeed. 
 
IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1 . INTRODUCTION 1 
 
 1.1 Objectives 1 
 
 1.2 Related Work 4 
 
 1.3 Number Representation 6 
 
 1 . 4 The Generalized Procedure 7 
 
 1.5 Dissertation Overview 8 
 
 2. DIVISION 10 
 
 2 . 1 Background 10 
 
 2.2 The On-line Algorithm 13 
 
 2.3 Quotient Digit Selection 18 
 
 2.3.1 Range Restriction Analysis 18 
 
 2.3.2 The Selection Equations 20 
 
 2.3.3 Determining the Minimum Index Difference 23 
 
 2.3.4 The Model Division 28 
 
 2 . 4 Valid Operand Ranges 38 
 
 2 . 5 Hardware Block Diagram 39 
 
 3. MULTIPLICATION 42 
 
 3 . 1 Background 42 
 
 3.2 The On-line Algorithm 44 
 
 3. 3 Product Digit Selection 48 
 
 3.3.1 The Selection Function 49 
 
 3.3.2 Sufficient Precision 54 
 
V 
 
 Page 
 
 3 . 4 Valid Operand Ranges 58 
 
 3.5 Hardware Block Diagram 60 
 
 3 . 6 Some Numerical Examples 62 
 
 4 . ADDITION AND SUBTRACTION 66 
 
 4 . 1 Background 66 
 
 4.2 The On-line Algorithm 67 
 
 4.3 Sum (Difference) Digit Selection 72 
 
 4.4 Hardware Block Diagram 73 
 
 4 . 5 Some Numerical Examples 75 
 
 5 . IMPLEMENTATION 79 
 
 5.1 Design Constraints of LSI 79 
 
 5 . 2 Floating Point Considerations 85 
 
 5.3 Algorithmic Modifications of AURORA for Pipelining 87 
 
 5 . 4 AURORA Hardware Requirements 94 
 
 5.4.1 The Processing Logic 94 
 
 5.4.2 Hardware Modifications of AURORA for Pipelining... 100 
 
 5.4.3 Speed of the Processing Logic 104 
 
 5.4.4 AURORA as a Total Module 106 
 
 5 . 5 System ' s Level Overview 109 
 
 6. SUMMARY Ill 
 
 6 . 1 Summary of the Results Ill 
 
 6 . 2 Applications 112 
 
 6.3 Suggestions for Future Research 115 
 
vi 
 
 Page 
 
 BIBLIOGRAPHY 117 
 
 APPENDIX 
 
 I REDUNDANCY DEFINITION 120 
 
 II SAMPLE P-D PLOTS FOR DIVISION 123 
 
 III AN ON-LINE RECODING ALGORITHM 145 
 
 VITA 147 
 
VX1 
 
 LIST OF TABLES 
 
 Table Page 
 
 2.1 Equations Defining the Selection Regions of Figure 2.1 22 
 
 2.2 Equations Defining the Selection Regions of Figure 2.2 24 
 
 3 . 1 Minimum y Values 57 
 
 3.2 Example of the Algorithm MULT (r=2,K=l) 63 
 
 3.3 Example of the Algorithm MULT (r=4,K=l) 64 
 
 3.4 Example of the Algorithm MULT (r=4,K=2/3) 65 
 
 4.1 Example of the Algorithm ADD (r=2,K=l) 76 
 
 4.2 Example of the Algorithm ADD (r=4,K=l) 77 
 
 4.3 Example of the Algorithm SUBT (r=4,K=l) 78 
 
 5.1 Result Digit Selector Inputs 97 
 
 5.2 Result Digit Selector Inputs After Carry Propagation 99 
 
 5 . 3 Word Processing Time Delay 105 
 
 5.4 Rough Pin Count for AURORA 106 
 
 5 . 5 Speed vs . Complexity (Processing Logic) 107 
 
 A.l Example of the Algorithm DIVIDE (r=2,K=l) 12 ^ 
 
Vlll 
 
 LIST OF FIGURES 
 
 Figure Page 
 
 2.1 P-D Plot with r=4, K=2/3, and 6=4 22 
 
 2.2 Modified Symmetric P-D Plot with r=4, K=2/3, and 6=4 24 
 
 2.3 P-D Plot with r=2, K=l, and 6=3 26 
 
 2.4 P-D Plot with r=2, K=l, and 6=4 27 
 
 2.5 P-D Plot Showing the Worst Case Overlap Region 32 
 
 2 . 6 Flowchart for Determining a and 3 , 34 
 
 2 . 7 Selecting a "Good" Staircase 36 
 
 2.8 Step Definition Flowchart 37 
 
 2 . 9 Block Diagram for Division 40 
 
 3.1 P-p Plot 51 
 
 3 . 2 The Overlap Region of the P-p Plot 53 
 
 3.3 P-p Plot Showing the Worst Case Error 56 
 
 3.4 Block Diagram for Multiplication 61 
 
 4.1 Block Diagram for Addition and Subtraction 74 
 
 5.1 A Typical PLA Control Layout 83 
 
 5.2 Communication between Mantissa and Exponent Units 88 
 
 5 . 3 Pipelining Multiple AURORAS 90 
 
 5.4 Serial Processing with One AURORA 91 
 
 5.5 AURORA Processing Logic 95 
 
 5 . 6 Adder I/O Requirements 101 
 
IX 
 
 Figure Page 
 
 5.7 I/O Requriements of Digital Position i of Adder (i _> 1) 102 
 
 5.8 Modified AURORA Processing Logic for Pipelining 103 
 
 5.9 Block Diagram of an AURORA 108 
 
 5.10 Sample System Organization 110 
 
 A. 1 The On-line Recoding Algorithm 146 
 
1 . INTRODUCTION 
 
 This thesis is concerned with the development of a set of basic 
 (addition, subtraction, multiplication, and division) arithmetic algorithms 
 suitable for use in a computational environment which calls for the on-line 
 processing of data. In on-line processing the operands, as well as the 
 results, flow through the arithmetic unit in a digit-by-digit manner, most 
 significant digit first. In various real-time applications in which the 
 operands are generated serially by an analog-to-digital conversion process 
 beginning with the most significant digits, an arithmetic unit possessing 
 this on-line property is highly desirable. A unit which operates in an 
 on-line fashion can provide the ever more popular microprocessor, a device 
 traditionally restricted from most mathematical applications because of 
 its short word length, with variable precision arithmetic capabilities. 
 At the same time, it can provide for overlapping the generation of result 
 digits with the fetching of operand digits. As an added bonus, the user 
 may halt the processing when sufficient precision has been obtained, which 
 may conceivably occur before all of the operand digits have been processed. 
 With these thoughts in mind, an effort has been made to develop imple- 
 mentable, on-line algorithms for the basic arithmetic operations. 
 
 1.1 Objectives 
 
 A set of implementable on-line algorithms for the basic arithmetic 
 functions should meet several desirable objectives. Three specific 
 
objectives were imposed on the algorithmic design during the development 
 stage of this dissertation. 
 
 OBJECTIVE 1 : The algorithms should be on-line with 
 respect to the result digits. They should generate the 
 most significant digits of the result first, in such a 
 way that once generated, the result digit produced at 
 step j would not be affected by any subsequent step k, 
 k > j. 
 
 OBJECTIVE 2 : The algorithms should also be on-line 
 with respect to the operand digits. Only those digits 
 up to and including those provided at step j should be 
 required in order to perform the j step of the 
 algorithm. To avoid extreme scaling of the operands 
 during division, a limited number of leading digits of 
 each operand corresponding to an "on-line delay" are 
 accumulated prior to starting the actual division algorithm. 
 
 OBJECTIVE 3 : The basic computational step should be 
 invariant at every step j and the only primitive 
 arithmetic operation should be addition. The selection 
 procedure generating one result digit per step should 
 be such that the step execution time is independent of 
 the operands; i.e., the selection should be based on 
 a limited precision model of the operands. 
 
 The first objective implies the use of a redundant representation 
 of the results. Without redundancy the problem cannot be solved. The 
 
second objective calls for a more complicated basic computational step 
 than would, otherwise, be necessary to allow for the on-line arrival of 
 operand digits. The step invariance requirement of the third objective 
 makes the control section of the implementation very straightforward. While 
 the requirement that addition be the only primitive operator simplifies 
 the processing section of the implementation. In order for the selection 
 procedure to be independent of the length of the operands and, thus, step 
 invariant, a limited propagation mode of addition must be employed. This, 
 in turn, provides for a cost effective speed up of the overall algorithms. 
 
 Once algorithms which satisfy these objectives have been developed, 
 an arithmetic unit encompassing all of the algorithms must be specified. 
 This unit should, of course, be real world implementable. To accomplish this 
 end, the unit itself must conform to several objectives imposed during the 
 logic design phase of research. 
 
 OBJECTIVE 4 : The unit should comply with the design 
 constraints of LSI (Large Scale Integration) . 
 
 OBJECTIVE 5 : The unit as a whole should have conventional 
 input/output requirements; i.e., the operands which are 
 input and the results which are output should be in a 
 conventional form (e.g., two's complement, sign magnitude, 
 etc.) . 
 
 OBJECTIVE 6 : The unit as designed should be modular and 
 expandable, both from the individual chip and the overall 
 system viewpoint. It should be designed to handle floating 
 point numbers. 
 
OBJECTIVE 7 : The unit should be fast as compared to typical 
 central processor and memory speeds. 
 
 In order to comply with the fourth objective, the unit must possess 
 a high circuit density, regularity of structure, a low pin count, and a 
 large domain of applications. The fifth objective requires that the 
 redundant nature of the algorithms be hidden from the user. The result 
 digits generated by the algorithms must be recoded into a conventional 
 format before they are output. By designing the unit to function either in 
 a serial mode as a stand alone unit or in a pipelined mode in connection 
 with other identical units, the sixth objective would be satisfied. The 
 last objective is difficult to quantify given today's rapidly changing 
 technology. 
 
 1.2 Related Work 
 
 Several of the well known basic algorithms satisfy the on-line 
 property with respect to either the operands or the results. Consider, for 
 example, conventional division which has the on-line property with respect 
 to the quotient digits. Similarly, conventional multiplication has the 
 on-line property with respect to the multiplier. Several authors have 
 extended this on-line property for multiplication to the product digits 
 as well [AV162, PIS70, GOY76]. 
 
 It is also possible to define algorithms conforming to a right-to- 
 left type of processing; i.e., algorithms which operate from least-to-most 
 significant end. Conventional addition and subtraction process operands 
 and produce result digits in this manner. Atrubin [ATR65] developed a 
 right-to-left type algorithm for multiplication. But, since the division 
 
process must by its very nature operate in a left-to-right fashion for 
 the calculation of the quotient, this left-to-right, on-line processing 
 was imposed upon all of the algorithms. The most significant digit first 
 approach is also consistent with other arithmetic processes such as operand 
 normalization, mantissa overflow detection, and result sign determination. 
 All of these processes inherently require examination of the most significant 
 digits of the result. Thus, in this thesis the on-line process is defined 
 to be that process in which all of the operand digits as well as the result 
 digits flow through the arithmetic unit in a left-to-right, digit-by-digit 
 fashion. 
 
 To fulfill the on-line requirements, a set of left-to-right, digit- 
 by-digit algorithms had to be derived. The existence of such algorithms is 
 contingent upon the use of a redundant representation for the result. In 
 the past, redundant number representations have often proved useful for 
 speeding up arithmetic operations [MET57, ROB58, T0C58, AV161, AV162, PEN62]. 
 
 In a non-redundant system, even simple operations like addition and subtrac- 
 
 f 
 tion possess a significant on-line delay due to the carry propagation which 
 
 may involve the full precision of the result. By allowing redundancy in the 
 
 number representation, it is possible to limit the carry propagation to one 
 
 (or two) digital positions [MET57, ROH67, BOR68, ROB67]. Thus, on-line 
 
 algorithms for addition (and subtraction) with on-line delays of at most one 
 
 t 
 Recall that the on-line delay corresponds to the number of digits of the 
 
 operands which must be input before the generation of correct result digits 
 
 can be initiated. 
 
or two digits can be easily developed. Campeau [CAM70] has also developed 
 an on-line algorithm for multiplication with an on-line delay of one digit. 
 More recent work in the area of on-line computation has been done by 
 Ercegovac [ERC75, TRI75]. He developed an on-line multiplication algorithm 
 which combines the technique of incremented multiplication, as used in 
 digital differential analyzers [BRA63, CAM70, BAK75], with the use of a 
 redundant number system. Ercegovac also proposed an on-line division 
 algorithm in his work. It, however, is not suitable for implementation 
 (in this author's viewpoint) because it requires excessive scaling of the 
 operands initially to insure convergence of the algorithm. The existence 
 of a reasonable on-line division algorithm, however, has not been at all 
 obvious. The first attempt at deriving a reasonable algorithm for division 
 was made by Trivedi [TRI75] in the Spring of 1975. In his algorithm the 
 on-line delay for division depends upon the radix and other properties of 
 the number system used. The delay is generally a small, positive constant 
 and alleviates the problem of excessive operand scaling. Division methods 
 presented in this thesis are extensions of Trivedi' s preliminary work. His 
 algorithm combined with parts of the previously mentioned work by Ercegovac, 
 led to the evolution of a set of compatible algorithms for the basic 
 arithmetics. 
 
 1.3 Number Representation 
 
 All numerical values considered in this thesis are assumed to be 
 represented in finite precision, floating point format with a representational 
 error of |e| < r where m is the precision of the mantissa. The effect of 
 the representational errors is minor and necessitates only a slight 
 
extension of the precision of the initial data to obtain the required 
 precision of the results. Thus, the representational error is not of 
 immediate concern. It is assumed that the precision of all initial data 
 has been properly adjusted so that, for a given precision of m bits, the 
 data can be regarded as exact. 
 
 Consider an m digit, radix r fractional component of a floating 
 
 point number N, where 
 
 m 
 N = I n.r" 1 . 
 1-1 X 
 
 Using a conventional representation, each digit n. can assume any value from 
 the digit set {0,1, . . . , (r-1) } . Such representations, allowing only r values 
 in the digit set, are non-redundant. There is only one (unique) representa- 
 tion for each representable number. By contrast, number systems that allow 
 more than r values in the digit set are redundant and allow more than one 
 representation for each number. See Appendix I for a formal definition of 
 redundancy. The scope of this thesis covers those cases where the operands 
 are provided in a conventional format while the results are generated in a 
 redundant format. A "mostly" on-line recoding scheme for converting 
 redundant results into conventional format, as given in Appendix III, is 
 then used so that the unit will satisfy OBJECTIVE 5. 
 
 1.4 The Generalized Procedure 
 
 The algorithms developed in this thesis consist of the following 
 sequence of operations whose order may vary slightly from one specific 
 function to another: 
 
1) initialization which consists of waiting for sufficient 
 digits, corresponding to the on-line delay, to be 
 input; 
 
 2) input of the next operand digits; 
 
 3) an addition which corrects for any error made in the 
 previous result digit selection and accounts for the 
 new operand digits just received; 
 
 4) selection of the next result digit based upon a limited 
 precision model; and 
 
 5) a completion test which loops back to step 2) upon 
 failure. 
 
 From the above sequence, it is obvious that the computational algorithms 
 are step invariant with the only primitive operation being addition as 
 required by OBJECTIVE 3. The operands are input according to OBJECTIVE 2 
 and the results are generated according to OBJECTIVE 1. 
 
 1.5 Dissertation Overview 
 
 Chapters 2, 3, and 4 present compatible on-line algorithms for 
 division, multiplication and addition/subtraction, respectively. The on-line 
 division algorithm is of primary concern, since it is the most difficult of 
 all the algorithms to specify. Once division is specified, compatible 
 on-line algorithms for the other functions can then be defined. Each chapter 
 
 This delay is shown to be zero for addition, subtraction, and multiplication 
 and is a small, radix dependent constant for division. 
 
contains necessary background material on the function in question. The 
 on-line algorithms are then presented along with their convergence conditions 
 and result digit selection schemes. Finally, valid ranges for the operands 
 are derived and a hardware block diagram with description is given. 
 
 Chapter 5 addresses the question of implementation. Suitability to 
 LSI, floating point considerations, hardware requirements, and a system's 
 level overview are discussed. A summary of the results, applications for 
 on-line arithmetic units, and the possible implications of such a device are 
 discussed in the final chapter. 
 
10 
 
 2. DIVISION 
 
 Algorithms which satisfy the on-line property for addition and 
 subtraction can be easily specified [AVI61, ROH67, ROB67]. Multiplication 
 requires a somewhat more elaborate approach [ERC75, TRI75] and will be 
 discussed in detail in Chapter 3. However, the existence of an on-line 
 division algorithm was not determined until a first attempt at such an 
 algorithm was made by Trivedi in the spring of 1975 [TRI75]. The methods 
 presented in this chapter are extensions of this preliminary work. The 
 division algorithm must be of primary concern, since the algorithmic and 
 logic design for division are the most difficult of all the algorithms 
 to specify. Compatible algorithms for on-line multiplication, addition, 
 and subtraction can then be specified. 
 
 2.1 Background 
 
 In designing a computer arithmetic unit, division is the most 
 difficult of the basic operations to implement efficiently. Division is 
 inherently a trial-and-error process requiring an initial guess of a quotient 
 digit followed by a comparison (in the form of a subtraction) to determine 
 whether this guess was correct. If it was not, the initial quotient digit 
 is modified and the process is repeated. This class of division based 
 upon subtraction can be defined by the recursive relationship 
 
 P. «- rP. , - q.D , j = l,...,m (2.1) 
 
 J 3-1 J 
 
11 
 
 in which 
 
 P is the dividend, 
 
 P._ 1 is the partial remainder used in the j recursion, 
 
 P is the remainder, 
 m 
 
 j is the recursion index, 
 
 q. is the j quotient digit, 
 
 D is the divisor, and 
 
 r is the radix. 
 
 To form the partial remainder, P., a multiple of the divisor is 
 subtracted from the previous shifted partial remainder. The determination 
 of which multiple of D to subtract is dependent upon the quotient digit; 
 but it is precisely this quotient digit that must be computed. It is not 
 known a priori. As it stands, this recursion relationship for division 
 does not adequately specify how q. is to be selected. By adding the range 
 restriction (which is intuitively applied when doing the hand calculation) 
 
 |P | 1 K + -|D|, (2.2) 
 
 j 
 
 the division algorithm becomes completely specified. The important point 
 
 here is that division not only requires an addition or subtraction (as in 
 
 multiplication), but also the selection of a quotient digit such that the 
 
 value of the new partial remainder is within a specified range. 
 
 On-line division, as investigated in this thesis, is yet a further 
 
 complication of this process in that the full precision of the operands, 
 
 K. will be defined in the next section. It is sufficient here to know 
 that 1/2 < K < 1. 
 
12 
 
 P n and D, is not available for comparison. At first consideration it would 
 
 t 
 
 seem that on-line division is impossible. 
 
 By allowing the quotient digits to take on a redundant ' representation, 
 many of the above problems which are seemingly inherent to division can be 
 resolved. As will be discussed in this chapter, redundancy in the represen- 
 tation of the quotient permits inspection of fewer digits of the operands 
 in the selection of the quotient digits. This seems to be intuitively 
 correct, since without redundancy the quotient has one unique representation 
 and thus each digit of that quotient must be selected precisely. It should be 
 clear that without redundancy it is not possible to avoid a set of full 
 precision comparisons. But, with redundancy the selection of the quotient 
 digit need not be precise. A selection based upon just the first few most 
 significant digits of the divisor and partial remainder is good enough; 
 i.e., the selction is based upon a limited precision version of the operands. 
 Thus, by using redundancy the trial-and-error nature of division can be 
 avoided. The resultant non-unique representation of the quotient does, 
 however, complicate the division in that the redundant form must eventually 
 be converted to a conventional representation. See Appendix III for a 
 description of the result digit recoding algorithm. Note that, in most 
 cases, this recoding algorithm is also an on-line, most-to-least significant 
 process. 
 
 In the case of on-line division it should be immediately evident 
 that in the absence of redundancy the problem cannot be solved. By definition, 
 during on-line processing the full precision of the operands is not available 
 
 See Appendix I for a discussion of redundant number representations 
 
13 
 
 for selection. Therefore, the quotient digit selection must be based upon a 
 limited precision estimate of the divisor and partial remainder. Error is 
 also introduced into the on-line division process when calculating the new 
 
 f~Vi 
 
 partial remainder. During the j recursion, only those operand digits 
 which have been received prior to the j iteration have been included in 
 the computation of the old partial remainder. Part of the j calculation 
 must then be a correction factor which takes into account the effects of the 
 new dividend and divisor digits on the value of the new partial remainder. 
 Thus, the quotient digit selection is based on a possibly erroneous partial 
 remainder, though this error is relatively small. Can the margin of 
 allowable error which is permitted by the use of redundant quotient digits 
 also be made to cover this extra error? If so, how much error of this type 
 can be tolerated? Is there some minimum allowable operand precision required 
 in order for on-line division to proceed? This chapter will resolve these 
 questions by specifying an on-line division algorithm and the conditions on it 
 
 2.2 The On-line Algorithm 
 
 For floating point operation each number X is of the form 
 
 X = f • r e 
 
 where, usually, f is a fraction in the range 
 
 1/r < |f| < 1 
 
 and e is an exponent. The arithmetic for the fractional parts is handled 
 separately from the arithmetic for the exponents. Thus, two arithmetic 
 units are required, one for fractions and one for exponents. Design of the 
 exponent handling unit is straightforward, requiring only addition and 
 
14 
 
 subtraction. However, the design of the unit to handle the fractional 
 arithmetic is nontrivial and it is the algorithms for this unit which will 
 be discussed in depth in this thesis. 
 
 Let the radix r representation of the fractional part of the 
 
 t 
 positive dividend, divisor, and quotient be denoted by N, D, and R 
 
 respectively, such that 
 
 m 
 N = I n.r" 1 , 
 1=1 1 
 
 m 
 D = E d.r 1 
 1-1 X 
 
 m 
 R = E q.r" 1 , 
 1-1 X 
 
 and 
 
 R = N/D 
 
 to m digit precision. 
 
 Recall that in an on-line environment the digits of the dividend 
 and divisor are not known in advance, but are available on-line, digit-by- 
 digit, most significant digit first. These operand digits, n and d., are 
 
 i i 
 
 typically members of a conventional, nonredundant digit set, 0, such that 
 
 n., d. e {0,1,2, ... ,r-l} 
 li 
 
 The result (quotient) is denoted by R so as to be compatible with the 
 notation used in the other algorithms as discussed in the next two 
 chapt< 
 
15 
 
 It is assumed that the dividend and divisor are in normalized form upon 
 input to the unit; i.e., 
 
 | £ D, N < 1 . 
 
 The methods presented here are extendible to the case when the operands are 
 
 also in redundant form [ATK70]. 
 
 Assume that the first quotient digit, q 1 , can be properly selected 
 
 after 6 leading digits (the on-line delay or 'index difference') of the 
 
 dividend and divisor are known. Thereafter, one new digit of the quotient 
 
 can be determined upon the receipt of one new digit each of the dividend 
 
 and divisor. Let the quotient digits be members of a symmetric redundant 
 
 digit set, , such that 
 P 
 
 q ± £ {-p,-(p-l), ... ,1,0,1, ... ,p-l,p} 
 where 
 
 f 1 P 1 r_1 • 
 
 The degree of redundancy will be denoted by K, referred to as the redundancy 
 coefficient, where 
 
 P 
 
 K = 
 
 r-1 
 
 Thus, when using a maximally redundant digit set, K = 1. When K = 1/2 
 (i.e., p = (r-l)/2) there is no_ redundancy in the digit set. 
 
 The partial remainder is computed via a limited carry-borrow 
 propagation adder [ROH67, ROB67], resulting in a redundantly represented 
 partial remainder. A limited carry-borrow propagation adder is necessary 
 
16 
 
 to make the time required to perform the recursive step independent of the 
 precision of the operands; i.e., a carry free, totally parallel addition is 
 possible. Thus, the digits of the partial remainder are also members of a 
 redundant digit set, £> , , which may or may not be the same set as £> the 
 quotient digit set. The redundancy coefficient of the adder is, then, K' . 
 
 Given these definitions, the algorithm DIVIDE [TRI75] which is 
 shown on the next page, can be specified. In this algorithm, the dividend 
 and divisor are assumed to be padded with zero digits on the right (least 
 significant end). Note that the basic recursion, (2.3), is more complex than 
 that of the standard division recursion, (2.1), due to the corrective action 
 necessitated by the operand digits arriving on-line during each iteration. 
 
 The convergence of the algorithm DIVIDE can be established as 
 follows. Using the basic recursion (2.3) in algorithm DIVIDE, the following 
 expression for the on-line version of the partial remainder, P., can be 
 derived by induction on j . 
 
 P. = r 3 [ Z n.r" 1 - ( E q.r 1 ) ( E d.r 1 ) ] (2.4) 
 
 3 i=l 1 i=l X i=l X 
 
 which implies that, as j -*■ m, 
 
 P - r m [N - R«D] 
 m 
 
 so that 
 
 R = N/D - P r m /D 
 m 
 
 The algorithm is so structured as to be compatible with the multiplication 
 algorithm specified in Chapter 3. 
 
17 
 
 Algorithm DIVIDE: 
 
 
 
 Step 1 [Initialization]: 
 
 6 
 
 p o * * V • 
 
 1=1 
 
 6 
 i=l d i r ' 
 R -0; 
 
 j + 0; GO TO Step 4; 
 
 
 Step 2 [Input Digit]: 
 
 D. *■ D. . + d._^r" j " 6 ; 
 3 3-1 3+5 
 
 
 Step 3 [Basic Recursion]: 
 
 
 
 P. «- rP. , - q.D. 
 3 3-1 J 3 
 
 -6 B -6 
 + n._^r - R. d. „r 
 j+6 j-1 j+6 
 
 (2.3) 
 
 Step 4 [Selection]: 
 
 q. ,, «■ SELECT (rP.,D.); 
 3+1 3 3 
 
 
 R. ., «- R. 
 3+1 3 
 
 + w~ j " 
 
 
 Step 5 [Test]: 
 
 IF (j < m) 
 
 THEN j '«- j + 1; 
 
 GO TO Step 2; 
 ELSE END DIVIDE; 
 
 
 Therefore, by devising a quotient digit selection procedure, SELECT in 
 Step 4 of algorithm DIVIDE, such that for j = m 
 
 |P. I < K«D 
 3 - 
 
 (2.5) 
 
18 
 
 where 
 
 \< K < 1 , 
 
 then R = N/D can be computed to m digit precision. Assuming that a 
 
 selection procedure can be specified which generates the quotient digit 
 
 q... while guaranteeing that |P., n | < KD given that |P.| < KD, then, by 
 j+1 j+± — j — 
 
 induction, the range restriction (2.5) will hold for all values of j. 
 
 (For j=0, (2.5) can be satisfied by appropriately preshifting the dividend 
 
 as explained in Section 2.4.) Such a selection procedure is derived in the 
 
 next section. First a bound will be established on P. - P.l and then a 
 
 J 3 
 
 selection procedure will be developed which guarantees that |P.| <_ KD. 
 
 This in turn will give a bound on P.. 
 
 J 
 
 2.3 Quotient Digit Selection 
 
 The division procedure may be defined graphically with a construction 
 suggested by C. V. Freiman [FRE61]. The basis for its construction is the 
 basic recursive relationship, (2.3), together with the range restriction, 
 (2.5), which has been adjusted to include the error introduced by on-line 
 processing. The figure is essentially a plot of partial remainder versus 
 divisor values and is thus designated a P-D plot. By analyzing such a 
 plot, a quotient digit selection procedure can be fully specified for a 
 given r, p , and 6 . 
 
 2.3.1 Range Restriction Analysis 
 
 From the recursion (2.1), the following equation can be derived by 
 induction. 
 
 m _ . j m 
 
 P = r J [ Z n.r" 1 - ( I q.r x ) ( I d.r" 1 )] . (2.6) 
 
 1 1-1 1-1 x 1-1 1 
 
19 
 
 Subtracting equation (2. A) from equation (2.6) gives 
 
 m j . m 
 
 P. - P. = r J [ e n r" 1 - ( S q r" 1 ) ( E d r ) ] . (2.7) 
 J J i=j+6+l i=l i=j+6+l 
 
 Recall that P. is the normal full precision representation of the partial 
 
 remainder and that P. is the on-line version of the partial remainder. 
 
 J 
 
 Thus, equation (2.7) is a measure of the error introduced at a particular 
 step, j, by using the on-line algorithm. Now, determine the bounds on this 
 error. 
 
 UPPER BOUND: 
 
 m j _ . m 
 
 P - P < r J [ i (r-l)r -1 + ( z p r X ) ( E (r-l)r -1 ) ] 
 J J i=j+<5+l i=l i=j+6+l 
 
 _< r J [(r-l)(r J - r ) (^-) + 
 
 p(r - r ) (— j-) (r-1) (r J - r ) (^- )] 
 
 = r" 6 (l+K) - r" m+j (l+K) - r" j " 6 K + r _m K 
 
 Since m is assumed to be large with respect to 5, the upper bound is 
 certainly less than 
 
 P - P < (1+K)r~ 6 . 
 
 LOWER BOUND: 
 
 j _. m 
 P, " P, > -r J [( Z Pr X )( l (r-l)r X ) ] 
 J J i=l i=j+6+l 
 
 j r / ~1 ~j-l\/ r w i\/ — i— 6 — 1 -m-l x / r N , 
 ^-r J [p(r -r J ) (— j-) (r-1) (r J -r ) (— j- ) ] 
 
 = -K(r -r J -r J +r) 
 
20 
 
 And since m is assumed to be large with respect to 6 , the lower bound is 
 
 certainly greater than 
 
 A — (S 
 
 P. - P. > - Kr . 
 J J ~ 
 
 Combining the above results 
 
 Kr P. - P <_ (1+K)r 
 
 (2.8) 
 
 Recall from equation (2.2), the range restriction on P . , that 
 
 - KD < P. < KD 
 
 (2.9) 
 
 From equation (2.8) and (2.9), the range restriction on P. becomes 
 
 -6 " -6 
 
 -KD + Kr ^ P. <^ KD - (1+K)r 
 
 Since K is positive, equation (2.5) is satisfied by the above equation for 
 j = l,...,m and by using this range restriction on P. to define the 
 selection procedure, R = N/D can be computed to m digit precision. 
 
 2.3.2 The Selection Equations 
 
 By applying the range restriction (2.9) on P.,-,* and using the 
 
 (incremented) recursion relationship (2.1), the selection region of rP . for 
 
 each possible value of q. in can be determined. Let q.., - i such that 
 
 J+l J+l 
 
 i £ p, then the i-selection region guaranteeing the range restriction 
 <) Is given by 
 
 (-K+i)D <_rP. < (K+i)D 
 
 (2.10) 
 
 The corresponding i-selection region for rP . is obtained using equation (2.8) 
 and (2.10). Thus, a partial definition of the SELECT function given in 
 DIVIDE, Si i [) k as 
 
21 
 
 q._,, «■ SELECT (rP.,D.) 
 3+1 J J 
 
 becomes 
 
 (-K+i)D + Kr 6 ~ KL i rP. 1 (K+i)D - (1+K)r 6+1 (2.11) 
 
 This condition can be graphically described by means of a P-D plot, 
 
 as in Figure 2.1. (The difference between this P-D plot and the conventional 
 
 P-D plot is that the ordinate is rP . instead of rP . . ) It consists of a 
 
 3 3 
 
 family of curves which are linear functions of D with q.,-, as a parameter 
 
 ranging from -p to +p in steps of 1. The area between the maximum rP . and 
 
 the minimum rP. will be denoted the "q . , , = i region." 
 3 3+1 
 
 So, for a given base (r) , redundancy coefficient (K) , and index 
 difference (6) the division procedure can be specified via a corresponding 
 P-D plot. A given value of D. and rP . will correspond to a point in an 
 i-selection region. The quotient digit q.,-. is, therefore, i and is used in 
 forming the next partial remainder. 
 
 Figure 2.1 is an example of a full P-D plot with r = 4, K = 2/3, and 
 
 6=4. The equations for the selection lines are given in Table 2.1. Note 
 
 that, as a consequence of redundancy in the representation of the quotient, 
 
 there is an overlap between each adjacent quotient digit selection region. 
 
 Some values of rP . and D. will specify a point for which either q.., = i or 
 3 3 3+1 
 
 q. . = i-1 is a valid choice. It is this overlap which permits the quotient 
 digit selection to be made on the basis of estimates of the full precision 
 divisor and shifted partial remainder and thus permits on-line division. 
 
 By tightening the lower bound of the selection equation (2.11) to 
 give the selection equation 
 
 (-K+i)D + (l+lOr" 6 " 1 " 1 <_ rP. <_ (K+i)D - (1+K)r" 6+1 (2.12) 
 
2.66 
 
 — D 
 
 2.66 
 
 Figure 2.1 P-D Plot with r=4, K=2/3, and 6=4 
 
 Table 2.1 Equations Defining the Selection Regions of Figure 2.1 
 
 Selection Lines 
 
 aa 
 
 Selection Equations 
 
 UPPER (2 
 LOWER (2 
 
 UPPER (1 
 LOWER (1 
 
 UPPER(0 
 LOWER (0 
 
 UPPER (T 
 LOWER (I 
 
 UPPER(2 
 
 2 5-3 
 
 4P. < ("I + 2)D - | 4 
 
 2 2-3 
 
 4P. > (- | + 2)D + | 4 
 
 2 5-3 
 
 4P. < (j + 1)D - | 4 
 
 2 2-3 
 
 4P. > (- | + 1)D + | 4 
 
 2 5-3 
 
 4P. < (-|)D - | 4 
 
 2 2-3 
 
 4P > (- f)D + § 4 
 
 «j « (| - 1)D - | 4" 3 
 
 2 2-3 
 
 4P. > (- | - 1)D + | 4 J 
 
 4P . < ( | - 2)D - | 4" 3 
 
 2 2-3 
 
 4P. > (- ^ " 2)D + -f 4 
 
23 
 
 the full P-D plot becomes symmetric about both axes as shown in Figure 2.2 
 with r = 4, K = 2/3, and 6 = 4. The corresponding selection equations are 
 given in Table 2.2. 
 
 Although this more restrictive, but still valid equation reduces the 
 overlap regions slightly, this reduction is more than compensated for by the 
 fact that all of the quadrants now have identical (except for sign) overlap 
 regions. Thus, only quadrant I need actually be implemented. This small 
 change in the lower bound does not significantly increase the complexity of 
 the step definition for the quotient digit selection (see Section 2. 3. A). 
 
 In the rest of this chapter we will restrict our attention to the 
 first quadrant of the P-D plot defined according to the selection equation 
 (2.12). (Representative plots are collected into Appendix II.) 
 
 2.3.3 Determining the Minimum Index Difference 
 
 Recall that an initial assumption — that the first quotient digit, q. , 
 
 can be properly selected after 6 leading digits of the dividend and divisor 
 
 are known — was made. Now the question arises as to what is the minimum 
 
 possible value for 6, the index difference for division. The minimum value 
 
 for 6 is desired because this determines the initial delay time before the 
 
 division algorithm can start producing quotient digits. 6 most significant 
 
 digits of the dividend and divisor must be available initially. Thus, if 
 
 one memory word holds 4 operand digits, 2» [6/4] memory accesses must be 
 made prior to the generation of the first quotient digit. 
 
 f 
 Throughout this thesis the term "word" is used to refer to the width of 
 
 the memory (e.g., 4, 8, or 16 bits for microprocessors). The term "digit" 
 
 is one radix r digit (e.g., one digit is 2 bits for radix 4). Thus, one 
 
 word may consist of several digits, while the full precision operands may 
 
 conceivably consist of several memory words — variable precision — up to 
 
 some hardware limited maximum. 
 
24 
 
 -2.64 
 
 Figure 2.2 Modified Symmetric P-D Plot with r=4, K=2/3, and 6=4 
 Table 2.2 Equations Defining the Selection Regions of Figure 2.2 
 
 Selection Lines 
 
 UPPER(2 
 LOWER (2 
 
 UPPER(1 
 LOWER (1 
 
 UPPER(0 
 LOWER (0 
 
 UPPER (I 
 LOWER (I 
 
 UPPER(2 
 LOWER (2 
 
 !i±l 
 
 Selection Equations 
 
 o S -3 
 
 4P. <_ (-| + 2)D - | 4 J 
 
 4P. > (- | + 2)D + | 4" 3 
 
 2 5-3 
 4P. < (-| + 1)D - | 4 
 
 4P. > (- | + 1)D + | 4~ 3 
 
 2 5-3 
 4P. < (|)D - § 4 
 
 4P. > (- f)D + § 4" 3 
 
 2 5-3 
 4P. < <§ - 1)D - | 4 
 
 2 5-3 
 4P. > (- 4 - 1)D + | 4 J 
 J ~ 3 3 
 
 2 5-3 
 4P. < (j - 2)D - | 4 J 
 
 2 5-3 
 4P. > (- -| -2)D + | 4 J 
 
25 
 
 The minimum allowable value for 6 can be determined by requiring 
 that the lower bound for a q. + , - i selection region and the upper bound for 
 the corresponding q.,, = i-1 selection region intersect at a value of 
 D <_ — ; there must be a nonzero selection overlap for all values of — <^ D < 1. 
 Otherwise, there are valid regions on the P-D plot where the quotient digit, 
 q.,-i» is undefined; that is, the division algorithm could not be completely 
 specified. For example, consider the case of r = 2, K = 1, and 6 = 3 as 
 shown in Figure 2.3. The shaded area is a valid region of the plot where 
 the value of the quotient digit is undefined. The worst case occurs when 
 D = — and i is either p or p-1, the selection overlap region between the 
 lower limit of p and the upper limit of p-1. If this overlap region is 
 non-null, then all of the selection overlap regions are guaranteed to be 
 non-null. The condition 
 
 (-K4p)| + (1+K)r" 6+1 <_ (K+p-l)| - (1+K)r" 6+1 
 
 must hold. Since 5 is required to be an integer, then 
 
 Sa-r% ^§5-1 . (2.13) 
 
 So, for the case of r = 2 and K = 1 it is required that 6 >_ 4. Figure 2.4 
 is the P-D plot for the specific case of 6 = 4. 
 
 Looking ahead to implementation, given the pin limitations of LSI 
 (Large Scale Integration) , reasonable values for the number of bits of each 
 operand input to one arithmetic unit (AU) at the beginning of each cycle are 
 4, 8, or 16. It would be preferable to have 6 small enough so that the 
 division algorithm could proceed after only one memory word for each of the 
 
26 
 
 2.0 — 
 
 1.5-- 
 
 1.0 — 
 
 0.5 
 
 -0.5 
 
 D 
 
 (-l+i)D + 2-2~ 2 5 2P. < (l+i)D - 2-2 2 
 
 Figure 2.3 P-D Plot with r=2, K=l, and 6=3 
 
27 
 
 **D 
 
 (-l+i)D + 2-2 3 < 2P < (l+i)D - 2-2 3 
 
 Figure 2.4 P-D Plot with r=2, K=l, and 5=4 
 
28 
 
 two operands had been input (i.e., a delay of only two memory accesses). On 
 the other hand, increasing 6 increases the size of the overlap region and, 
 thus, simplifies the quotient digit selection. A compromise must be made. 
 
 In the binary case, r = 2, a digit is one bit, so a convenient choice 
 for 6 would be 4, 8, or 16 respectively, depending on the number of bits 
 input to the AU during each cycle. For base 4, r = 4, a digit consists of 
 2 bits, so 6 should be 2, 4, or 8 respectively. Similarly for base 16. 
 Some of the above choices must be eliminated because they do not meet the 
 restriction on minimum 6 . See Appendix II for some representative P-D 
 plots with 6 values as discussed above. 
 
 2.3.4 The Model Division 
 Preliminary Remarks 
 
 As stated in Section 2.1, the advantage of using redundant quotient 
 digits is that it eliminates the trial and error nature of division. Using 
 redundant quotient digits permits the selection to be based upon a limited 
 precision model of the operands, thus circumventing the need for a full 
 precision comparison. Sufficient background has now been given to permit 
 a complete definition of the SELECT function of algorithm DIVIDE resulting 
 in a limited precision division model . 
 
 The limited precision model is a device which, when given estimates 
 of the divisor and shifted partial remainder of sufficient precision, will 
 output a quotient digit such that restriction (2.12) is satisfied. Thus the 
 model must be able to select, given only an estimate of the operands, the 
 correct quotient digit values. If the point corresponding to the values of 
 
29 
 
 rP . and D. falls in an overlap region of the P-D plot, the model must make 
 J J 
 
 a choice between two adjacent quotient digit values. It must take into 
 
 account the error incurred by the limited precision inputs. While making the 
 
 selection based upon these inputs, it must guarantee that the quotient digit 
 
 selected is also valid for the full precision values. 
 
 The selection procedure can be visualized as a series of steps 
 
 spanning the overlap regions. By comparing the values of the estimates to 
 
 these steps, the appropriate quotient digit can be selected. If the values 
 
 of rP . and D. correspond to a point lying on or above the step, the larger 
 
 quotient digit is selected. While if the point lies below the step, the 
 
 smaller quotient digit is selected. Sometimes the step is one simple 
 
 comparison constant for rP . or "tread" which spans the entire overlap region 
 
 from D. = — to D. = 1. In this case, the quotient digit can be selected 
 J 2 j 
 
 based merely upon the value of the shifted partial remainder and is 
 independent of the value of the divisor! But, more often, due to the 
 steepness and narrowness of the overlap region, the step consists of a 
 connected series of "treads" and "risers" which span the region. Here, the 
 risers define the divisor limits for which the corresponding tread 
 comparison constant for rP . is valid. See Appendix II for some typical 
 steps . 
 
 The steps should be chosen such that the simplest and fewest 
 comparisons need be made. These steps, in turn, are dependent upon the 
 precision of the model. The comparison constant values for the steps must 
 be numbers which are representable given the amount of precision required 
 by the model. Thus, before the steps can be "defined, sufficient precision 
 must be determined. 
 
30 
 
 Sufficient Precision 
 
 Assume that sufficient precision corresponds to the use of a most 
 
 significant digits of rP . and 3 most significant digits of D.. Thus, a 
 
 digits of rP. and 3 digits of D. are used as inputs to the limited precision 
 
 division model. Denote these truncated estimates as rP . and D. correspondingly, 
 
 3 3 
 
 Recall that only 6+j digits of each operand are known at Step j. Then, 
 obviously, 
 
 3 1 6 . 
 
 Denote the maximum error introduced by this truncation into the 
 representation of the operands as Arp and Ad. Then Arp and Ad are defined by 
 
 I rP. - rP. I < Arp 
 
 and 
 
 D. - D. < Ad . 
 J J - 
 
 Since the 3 most significant digits of D. are invariant for iterations 
 
 J 
 
 3 < j < m, the value of D, is a constant or just 
 
 D «- D. 
 J 
 
 and, for base r 
 
 Ad = r" e . (2.14) 
 
 And, since the partial remainder is computed via a limited carry-borrow 
 propagation adder and is, therefore, in redundant notation, 
 
 Arp = K»r" a+1 (2.15) 
 
 where K' is the redundancy coefficient of the adder. 
 
31 
 
 The conditions for determining the smallest possible a and 3 can De 
 found by investigating the worst case (steepest and narrowest) overlap 
 region of the P-D plot, that is when 
 
 and 
 
 i = P • 
 
 See Figure 2.5. Sufficient precision of rP. and D., as represented by their 
 
 truncated estimates rP . and D, is insured if a selection step can be defined 
 
 J 
 
 in this worst case region. Then, steps for the rest of the overlap regions 
 can be found and the model division completely specified. 
 
 Figure 2.5 shows the upper selection limit for p-1, UPPER(p-l), 
 
 rP. < (K+p-l)D - (1+K)r" 6+1 (2.16) 
 
 and the lower selection limit for p, LOWER(p), 
 
 -cS+1 
 rP. > (-K+ p )D + (1+K)r (2.17) 
 
 near D = — . An estimate, rP . , falling in this region and resulting in the 
 selection of the maximum quotient digit, q.,, = p» must meet the following 
 constraints 
 
 rP. - Arp > (-K4o)(-|+Ad) + (1+K)r" 6+1 (2.18) 
 
 and 
 
 rP. <_ (K4p-l)| - (1+K) r " 6+1 (2.19) 
 
 for the selection to also be valid for the full precision operand, rP . 
 
32 
 
 rP; 
 
 rP. MAX 
 J 
 
 a 
 u 
 < 
 
 rP. ~r 
 
 a 
 < 
 
 rP. MIN 
 J 
 
 £ 
 
 Equation (2.19) 
 
 v- 
 
 0.5 
 
 UPPER (p-1) 
 
 LL(p) 
 
 Equation (2.18) 
 
 D = D. MIN 
 
 Ad 
 
 D. MAX 
 
 LOWER (p) 
 
 -► D 
 
 Figure 2.5 P-D Plot Showing the Worst Case 
 Overlap Region 
 
33 
 
 Thus, the dotted line, LL(p), in Figure 2.5 defines an absolute 
 lower limit for a tread value, rP . , which would result in a quotient digit 
 selection of q.., = p. A similar lower limit exists for each overlap region 
 and will be denoted as LL(i). The upper limit on the possible tread values 
 is obviously the upper limit of the corresponding overlap region. Thus, 
 the treads (and, hence, the risers) of each stair case must be fully 
 contained between the lines LL(i) and UPPER(i-l) in each appropriate overlap 
 region. These treads and risers should assume the simplest possible binary 
 values which conform to the limits and the precision requirements. 
 
 The minimum values of a and 3 can now be empirically defined. 
 Subtracting equation (2.18) from (2.19) gives 
 
 Arp < K - j + (K-p)Ad - 2(1+K)r" 6+1 . 
 
 Substituting the values of Arp and Ad, (2.14) and (2.15), into the above 
 equation gives 
 
 K'r~ a+1 + K(r-2)r 3 < K - | - 2(1+K)r 6+1 . (2.20) 
 
 For a given base (r), redundancy coefficient of the quotient (K) , 
 index difference (6), and redundancy coefficient of the adder (K'), 
 interdependent a and 3 values can be defined using equation (2.20). Recall 
 that a represents the number of digits in rP . which are redundant. Thus, an 
 attempt should be made to minimize a even at the cost of increasing 3 to 
 the maximum allowed value of 6. The flowchart of the program used to 
 determine near minimal values for a and 3 is given in Figure 2.6. 
 
34 
 
 C CALL g & g ) 
 
 3 = 6, a = 3 
 
 (THE MAXIMUM POSSIBLE 
 VALUE) 
 
 FAIL = 
 
 NO 
 
 YES 
 
 CAN 
 'k COMPLETE 
 STEP BE 
 )EFINED USING THIS 
 a AND 3 GIVEN THE 
 .IMITS OF LL(i) AND/ 
 UPPER(i-l) 
 
 YES 
 
 YES 
 
 NO 
 
 a = a - 1 
 
 FAIL = FAIL + 1 
 
 NO 
 
 a - a 
 
 + 1 
 
 NO 
 
 - 1 
 
 (return) 
 
 SEARCH FOR a AND 
 FAILS ! 
 
 YES 
 
 YES 
 
 NO 
 
 YES 
 
 + 1 
 
 (return) 
 
 Figure 2.6 Flowchart for Determining a and 3 
 
35 
 
 Definition of the Steps 
 
 Once "good" minimum values for a and 3 have been chosen for a given 
 set of constants (r, K, 6, and K'), steps can be defined in each overlap 
 region. The value of a limits the maximum precision allowed in the specifica- 
 tion of the treads and 3 the maximum precision allowed in the specification of 
 the risers. Recall that the treads and risers must conform to the upper and 
 lower limits, UPPER(i-l) and LL(i), in each overlap region. 
 
 Think of the overlap region between q.., = i and q,,, = i-1 as a 
 
 j+1 3+1 
 
 grid of vertical spacings, Ad, and horizontal spacings, Arp. The set of all 
 boundaries in this overlap region is all stairsteps which can be drawn along 
 these grids while remaining inside the upper and lower limits. See Figure 2.7, 
 As Ad and Arp are decreased (i.e., a and 3 are increased) the number of 
 different possible boundaries increases exponentially. The boundary, 
 stairstep, which results in the simplest and fewest comparisons for the 
 selection of q.,-, should be chosen. Some possible choices are shown in 
 Figure 2.8. These comparison constants are then used to define the quotient 
 digit selection function, SELECT, of algorithm DIVIDE. 
 
 The flowchart for the program used to choose a "good" stairstep in 
 each overlap region is given in Figure 2.8. Little or no attempt was made 
 to minimize the comparison constants of one overlap region in relation to 
 another. See the work of Atkins [ATK67, ATK68, ATK70] for a more detailed 
 analysis . 
 
 Appendix II contains some sample P-D plots with stairsteps and 
 examples of the algorithm DIVIDE corresponding to them. The steps were 
 chosen according to the algorithm defined in the flowchart of Figure 2.8. 
 
36 
 
 a 
 u 
 < 
 
 UPPER(i-l) 
 
 -, LL(i) 
 
 LOWER (i) 
 
 D; 
 
 0.5 
 
 Figure 2.7 Selecting a "Good" Staircase 
 
(call steps ) 
 i 
 
 NRISER - 1 
 NTREAD = 
 D(NRISER) = 
 0.5 
 
 r 
 
 NTREAD = 
 NTREAD + 1 
 
 FIND THE LARGEST TREAD AT 
 
 THIS RISER: 
 RP(NTREAD) = LUPPER(i-l) 
 @ D(NRISER)/ArpJ*Arp 
 
 ( RETURN ) 
 
 YES 
 
 \ 
 
 NRISER=NRISER+1 
 D(NRISER) - 
 
 D(NRISER - 1) 
 
 37 
 
 YES 
 
 D (NRISER) =1.0 
 
 YES 
 
 D(NRISER) = 
 D(NRISER)+Ad 
 
 D(NRISER) = 
 D(NRISER)-Ad 
 
 STEPS SUCCEED 
 PLOT STEPS 
 
 (RETURN ) 
 
 YES 
 
 FAIL DUE TO 
 Ad TOO LARGE 
 
 ( RETURN ) 
 
 YES 
 
 Figure 2.8 Step Definition Flowchart 
 
38 
 
 The binary values of the steps (i.e., the comparison constants) are given in 
 the righthand (rP.) and top (D) margins of the plots shown in Appendix II. 
 
 2.4 Valid Operand Ranges 
 
 By investigating the P-D plots, it becomes obvious that the initial 
 operands must be restricted in the range of values they can assume. To 
 insure the convergence of algorithm DIVIDE, equation (2.5) must be satisfied 
 for j = 0. This may require an initial preshifting of the dividend. As 
 stated in Section 2.2, both operands are assumed to be in normalized form 
 upon input to the arithmetic unit; i.e., 
 
 4 1 D, N < 1 . 
 
 As seen by looking at the plots, only D is required to be in normalized form 
 for the division algorithm to be defined. The allowable range for N must 
 now be determined. 
 
 When the 6 most significant digits of the dividend are shifted to 
 become the first shifted partial remainder, rP , prior to quotient digit 
 selection, it is conceivable that this shifting results in an rP_ which is 
 out of range. That is, it corresponds to a point on the P-D plot for which 
 no quotient digit value is defined. In other words, if 
 
 rP Q > (K+ P )D - (1+K)r" 6+1 
 
 then rP. is out of range and must be scaled before division can proceed. 
 For rP to be valid it must conform to the bounds 
 
 (-K-p)D + (1+K)r" 6+1 < rP Q < (K+ p )D - (1+K)r~ 6+1 . 
 
39 
 
 Looking at the worst case values on the upper bound and assuming minimal 
 redundancy, this implies that 
 
 or just 
 
 r p < f2 _ 3r-2 -6+1 
 - 4(r-l) 2(r-l) r 
 
 2 
 If the term involving 6 is assumed to be negligible with respect to the r term, 
 
 then P_ must fall below the value 
 
 1 r 1 
 4 < 4(r-l) - 2 
 
 r = 2,3,. .. 
 
 Since P_ is input in normalized form, shifting P one bit to the right will 
 guarantee that rP_ will be within the allowable range limits. A correction 
 on the quotient which consists of shifting R one bit to the left must then 
 be made. 
 
 2.5 Hardware Block Diagram 
 
 This chapter has completely defined the algorithm DIVIDE. Specifically, 
 the requirements on the index difference, 6, and the quotient digit selection 
 function, SELECT, have been given. In this section a variable radix block 
 diagram implementing the DIVIDE algorithm will be examined. At this point, 
 only the major components of the arithmetic unit (AU) for DIVIDE will be 
 discussed. Lower level details for actual implementation will be developed 
 in Chapter 5. 
 
 Figure 2.9 is a block diagram of the AU for performing division. 
 It is so structured as to be compatible with multiplication, addition, and 
 
40 
 
 REDUNDANT QUOTIENT REGISTER 
 
 R. .-.-fr-R.+q. ,,r 
 3+1 3 3+1 
 
 -3-1 
 
 4+1 
 
 QUOTIENT 
 
 DIGIT 
 SELECTOR 
 
 7T 
 
 rP 
 
 C=qt 
 
 v / 
 
 MULTI- INPUT REDUNDANT 
 ADDER 
 
 -q,D, • r <S (n^ r -R, 1 d J! _ ul> ) ■ 
 
 3 3 
 
 7Y 
 
 q.D. 
 3 3 
 
 q 3 
 
 j+6 "j-1^3+6 7 ' "j-1 
 
 — 7V^ 7![ — 
 
 r" 6 (n 
 
 3+6 
 
 ■ R 3"l d 3 + 6) 
 
 SELECTION 
 NETWORK 
 
 A 
 
 j+6, 
 
 DIVISOR REGISTER 
 
 D.+D. ,+d.,_r 
 3 3-1 1+5 
 
 -j-6 
 
 rP 3-l 
 
 SELECTION 
 NETWORK 
 
 J ft A 
 
 l j+<s 
 
 R 3-l 
 
 
 
 DIVIDEND REGISTER 
 1 6~~ 
 
 P_ «- £ n.r 
 1=1 X 
 
 -l 
 
 o 
 w 
 
 a o 
 w > 
 
 > N 
 
 o 
 
 1 j +6 
 
 n. ,n . 
 i j+6 
 
 Figure 2.9 Block Diagram for Division 
 
41 
 
 subtraction with only minor modifications. The major component of the AU is 
 a full width multi- input limited carry-borrow propagation adder. The 
 adder is discussed in detail in Chapter 5. In many practical applications 
 the number of inputs to the adder is rather small. 
 
 The quotient digit selector is a table look up device which implements 
 the SELECT function. It examines a most significant digits of rP . , rP., and 
 3 most significant digits of D., D, in order to select the appropriate 
 quotient digit, q.,-,- 
 
 The rest of the unit consists mainly of registers and selectors. Two 
 
 full width double bank registers are required for the storage of the quotient, 
 
 R, and the partial remainder, P., because they are in redundant form. The 
 
 selection network must be capable of forming the required multiples of D. and 
 
 R. . . A carry generator would be needed if, for example, a radix complement 
 
 representation of negative numbers is used. In that case, the selection 
 
 network must also be able to form the complement of the possible multiples 
 
 of D. . 
 J 
 
 The complexity of the selection network increases for higher radices, 
 and since the additional multiples appear as inputs to the adder, the 
 complexity of the adder would also be increased. Thus, a higher radix, while 
 reducing the number of steps per cycle, does increase the complexity of the 
 arithmetic unit. Chapter 5 addresses the problem of finding an optimal radix 
 while considering both the compexity of the adder and selection network and 
 the complexity of the result digit selector across all of the operations 
 (/, *, +, and -). 
 
 T 
 
 The term "full width" implies that the adder can process a full precision 
 
 operand (i.e., several memory words) during one cycle and the registers can 
 
 store one full precision operand each. Thus, the adder and register widths 
 
 set a hardware upper limit on the maximum allowed precision. 
 
42 
 
 3. MULTIPLICATION 
 
 Once the on-line algorithm for division has been specified, com- 
 patible algorithms for multiplication, addition, and subtraction can be 
 defined. This chapter presents an on-line multiplication algorithm which 
 has its roots in work done by Ercegovac [ERC75, TRI75]. It combines the 
 well-known technique of incremented multiplication, as used in digital 
 differential analyzers [BRA63, CAM70, BAK75], with the use of redundant 
 number systems. The algorithm is so structured as to be compatible with 
 the division algorithm specified in Chapter 2. 
 
 3.1 Background 
 
 In multiplication, a product is accumulated by the successive 
 addition of multiplies of the multiplicand to a partial product. Unlike 
 division, the selection of which multiple to add is dependent upon a known 
 
 quantity (i.e., a digit of the multiplier). Thus, multiplication can be 
 
 f 
 defined by the recursion relationship 
 
 P. «- rP. . + y.X, j = l,...,m (3.1) 
 
 J J-l 3 
 
 in which 
 
 P_ is zero, 
 
 P. is the partial product used in the j recursion, 
 
 t 
 Recall that operations proceed from most-to-least significant digit as 
 
 required for division. 
 
43 
 
 P is the product, 
 m 
 
 j is the recursion index, 
 
 y. is the j multiplier digit, 
 
 X is the multiplicand, and 
 
 r is the radix. 
 
 To form the new partial product, P., a multiple of the multiplicand 
 is added to the previous shifted partial product. Exactly which multiple 
 to add is dependent upon a known multiplier digit. Thus, many of the 
 problems encountered in division are not present in the design of the 
 multiplication algorithm. 
 
 In converting multiplication to an on-line process, two complica- 
 tions arise. First, the recursion relationship must be restructured to 
 take into account the on-line nature of the operands. During the j 
 recursion, only those operand digits which have been received prior to the 
 j iteration can be included in the calculation. 
 
 Secondly, if a nonredundant number system is used in representing 
 the partial product, the digits of the desired product appear in a right-to- 
 left (least-to-most significant) fashion, as determined by the conventional 
 carry propagation requirements. If, however, redundancy is used in the 
 representation of the product, the desired on-line, most-to-least significant 
 generation of the product digits can be provided. Here, again, the redundant 
 product must eventually be converted to a conventional representation. See 
 Appendix III for a description of the on-line recoding algorithm. 
 
 This chapter will develop in detail the methods used to alleviate 
 these complications and specify a compatible on-line multiplication algorithm 
 and the conditions on it. 
 
44 
 
 3.2 The On-line Algorithm 
 
 Let the radix r representation of the fractional part of the positive 
 multiplicand, multiplier, and product be denoted by X, Y, and R respectively, 
 
 such that 
 
 m 
 X = S x.r 1 , 
 i-1 X 
 
 m 
 Y = E y.r X , 
 i=l 
 
 m 
 R = E p.r 1 , 
 i=l X 
 
 and R = X • Y 
 
 to m digit precision. 
 
 Recall that in an on-line environment the digits of the multiplicand 
 and multiplier are not known in advance, but are available on-line, digit-by- 
 digit, most significant digit first. The operand digits, x. and y., are 
 typically members of a conventional, nonredundant digit set, 0, such that 
 
 x., y. e {0,1,2, .. .r-1} . 
 i J i 
 
 It is assumed that the multiplicand and multiplier are in normalized form; 
 
 i.e., 
 
 - < X, Y < 1 . 
 2 ~ 
 
 Define the j -digit representation of the on-line operands X and 
 
 Y as 
 
 X.= Z x 4 r = X. n + x.r J 
 1 i-1 i 1-1 J 
 
45 
 
 where X = 0, and 
 
 3 -i -i 
 
 Y. = Z y.r = Y. , + y.r J 
 
 J i=1 i J-l "3 
 
 where Y = 0. The corresponding partial product is, then, 
 
 X.Y. *■ X. .Y. , + (X. ,y. + x.y.r" 3 + Y. .xjr^ 
 J J J-l J-l J-l J J J J-l J 
 
 which can be rewitten as 
 
 X.Y. + x. ,'Y. . + (X.y. + Y. .xjr"^ . 
 J J J-l J-l J J J-l J 
 
 Therefore, if P. is the scaled partial product, 
 
 P. = X.Y.r 3 (3.2) 
 
 J J J 
 
 a recursion relationship for multiplication which takes into account 
 
 the on-line nature of the operands can be expressed as 
 
 P. «- rP. . + X.y. + Y. -x. (3.3) 
 
 J J-l J J J-l J 
 
 But, this relationship does not, as it stands, generate the result digits in 
 
 an on-line fashion. The product digits would become available from the 
 
 least-to-most significant end of P , as determined by the traditional carry 
 
 m 
 
 propagation requirements. 
 
 Assume that the product digits, p., can be selected on-line using a 
 recursion relationship similar to (3.3). One new digit of the product could 
 then be determined upon the receipt of one new digit each of the multiplicand 
 and multiplier. In multiplication the index difference, 6, is identically 1. 
 That is, only one digit of each of the operands is needed initially to 
 select the first result digit. Let the product digits, p., be members of 
 
46 
 
 the same symmetric redundant digit set, , as defined for the quotient 
 digits in Chapter 2. 
 
 The partial product is computed via the same limited carry-borrow 
 propagation adder used to generate the partial remainder during division. 
 Thus, the digits of the partial product are members of the redundant digit 
 
 set, ' . 
 
 P 
 
 Given these definitions, the algorithm MULT [TRI75] which is shown 
 on the next page, can be specified. In this algorithm, recursion (3.3), 
 which allows for on-line processing of operands, has been altered to form 
 the basic recursion, (3.4), providing on-line generation of result digits. 
 The selection procedure for result digits, as shown in the next section, 
 corresponds to a simple rounding on P.. Thus, the basic recursion includes 
 a "correction" on P. n to take into account the previous selection of p. . . 
 
 From equation (3.3) and the basic recursion equation, (3.4), in 
 algorithm MULT, the following relation can be obtained by induction. 
 
 P. = P. - E p,r J . 
 J J i= l i 
 
 This implies that, as j -* m, 
 
 m-1 
 n r> *- m-i 
 
 P = P - E p.r 
 mm ... l 
 i=l 
 
 or rearranging 
 
 m-1 
 
 P = P + r m E p.r 1 . (3.5) 
 
 m m i=1 i 
 
 Since, by equation (3.2), 
 
 P =X • Y •r m =X-Y-r m , 
 m m m 
 
47 
 
 Algorithm MULT 
 
 
 
 Step 1 
 
 [Initialization]: P Q + 0; X Q <■ 0; Y «- 0; 
 
 R - 0; Pq «■ 0; 
 
 j *■ i; 
 
 
 Step 2 
 
 [Input Digit]: X. <- X. . + x.r J ; 
 L v 6 J J-l J 
 
 Y. «- Y. . + y.r~ J ; 
 J 3-1 J 
 
 
 Step 3 
 
 [Basic Recursion]: 
 
 
 
 P * r(P. , - p; :■•_•) -+ X.y. + Y. n x. 
 j j-l F j-l y j 7 j j-l J 
 
 (3.4) 
 
 Step 4 
 
 [Selection]: p. i- SELECT(P.); 
 R. *- R. 1 + p.r" 3 ; 
 
 
 Step 5 
 
 [Test]: IF (j < m) 
 
 THEN j *• j + 1; 
 
 GO TO Step 2; 
 
 
 
 ELSE END MULT; 
 
 
 equation (3.5) can be rewritten as 
 
 which is just 
 
 so that 
 
 m-1 
 
 X • Y • r = P +r E p.r 
 m . , i 
 
 i=l 
 
 m 
 
 X • Y = £ p.r" 1 + (P - p )r" m , 
 . , r i mm 
 
 1=1 
 
 m 
 
 R = E p.r _1 = X • Y - (P -p )r~ m 
 . n r i mm 
 
 1=1 
 
48 
 
 By devising a product digit selection procedure, SELECT, in Step 4 
 of algorithm MULT, such that 
 
 |P - P | < K (3.6) 
 
 1 m m 1 — 
 
 where the redundancy coefficient, K, satisfies 
 
 \< K< 1 , 
 
 then R = X • Y can be computed to m digit precision. In the next section 
 such a selection procedure will be derived so as to guarantee convergence of 
 the algorithm. 
 
 The algorithm as it stands produces just the most significant half 
 of the product. The least significant half of the product is available as 
 the redundant output of the adder after iteration m + 1; i.e., 
 
 m+1 m m 
 
 By feeding these redundant adder digits directly into the recoding unit, the 
 least significant half of the product can also be output in conventional 
 
 form. 
 
 3.3 Product Digit Selection 
 
 In order to implement the algorithm MULT, a product digit selection 
 procedure, SELECT of Step 4, must be devised such that restriction (3.6) is 
 satisfied. For j = 1, the range restriction can be satisfied by appropri- 
 ately preshifting the operands as explained in Section 3.4. Assuming that 
 there is a selection procedure that generates the product digit p. so as 
 to guarantee that 
 
 |P. - p. | < K 
 
49 
 given that 
 
 then by induction the range restriction (3.6), assuming that the operands 
 conform to certain bounds as derived in Section 3.4, will hold for all 
 values of j. Obviously, by performing a "simple" rounding on P. and using 
 the integer part of this rounded P. to represent the product digit, p., 
 
 |P.-p. | < K . 
 
 The remainder of this section will fully specify an implementable rounding 
 procedure for the selection of the product digits. 
 
 3.3.1 The Selection Function 
 
 Define the product digit selection function [TRI75] to be 
 
 ''sign p *l|p |+ij 
 
 for I P. j <_ p , 
 
 p. «- SELECT(P.) = \ 
 
 (3.7) 
 
 sign P.*l|P. | J 
 otherwise 
 
 This represents, for all practical purposes, a rounding procedure which has 
 been modified at the end points of the domain to avoid product digit values 
 greater than p. Thus, the selection process itself can be carried out in 
 a deterministic fashion; that is, the product digit selected by the procedure 
 is simply the integer part of the rounded partial product, P.. 
 
 However, the partial product is in redundant form. Thus, a "simple" 
 rounding would require a full precision carry propagation of P. in order to 
 determine its magnitude. This must be avoided if at all possible. An 
 
50 
 
 ideal selection procedure is one in which the time necessary to per- 
 form the selection is independent of the precision of the operands 
 (i.e., step invariant). 
 
 By devising a graphical representation for the selection process, 
 the problem can be more easily understood and solved. Figure 3.1 is a plot 
 of the partial product, P., versus the remaining partial product after 
 rounding, P. -p., and will be designated a P-p plot. By analyzing such a 
 plot, a product digit selection procedure based upon a limited number of 
 leading digits of P. can he fully specified for a given r and p, resulting 
 in a precision independent selection. 
 
 In Figure 3.1 each line corresponding to a different product digit 
 value will be called a "p-line." The bound on the remaining partial product 
 after rounding, 
 
 J 3 ' - 
 
 t - 
 determines the maximum allowable value for the partial product , P.; that is, 
 
 | P . | <_ K + p = rK 
 
 in order for a product digit to be properly selected. Thus, the maximum 
 
 value for P. is rK which occurs at the intersection of (P. -p.) = K and the 
 3 3 3 
 
 p-line, p. = p. Similarly, the minimum value, -rK, must occur at the 
 
 intersection of (P. -p.) = -K and the p-line, p. = p. These bounds are 
 3 3 3 
 
 indicated by the dashed vertical lines on Figure 3.1. 
 
 The range for the partial product, P., in multiplication (-K - p _< P. K + p) 
 is approximately comparable in magnitude to the range of the shifted partial 
 remainder, rP., in division ((-K - p)D + (1 + K)r~ 6+1 <_ rP . <_ (K + p)D - 
 fl + K)r ft+L ). 
 
51 
 
 <Q- 
 
 Q. 
 I 
 
 <o. 
 
 Figure 3.1 P-p Plot 
 
52 
 
 By investigating the basic recursion, equation (3.4), of algorithm 
 MULT it becomes obvious that certain bounds must also be placed upon X and Y 
 in order for the new partial product, P. +1 > to be within the required limits 
 of + (K + p). These operand range restrictions and their effects on the 
 selection function are covered in Section 3.4. 
 
 As in division, redundancy in the representation of the product 
 permits the selection of product digits to be based upon a limited number 
 of leading digits of P.. This is manifested in Figure 3.1 by the overlap 
 region, A, for which either of two p-lines may be legitimately selected. 
 For example, at point A one may move vertically upward to the p-line, 
 p. = 0, or downward to the p-line, p. = 1. In either case the product 
 digit is correct. 
 
 The defined selection function (3.7) implies that the comparison 
 constants for multiplication are simple, low precision numbers (i.e., + -r-, 
 + 1 t, + 2 t, ... + (p - -r-)). Then the product digit, p., can be defined 
 according to 
 
 f i - 1 (i-1) - \ < P. < (i-1) + \ 
 
 'r 
 
 l * l 
 i i - f < p, <i+ T . 
 
 V. 2 — j 2 
 
 Denote the lower and upper endpoints of the i p-line as %. and y 
 respectively, as shown in Figure 3.2. The overlap, A, becomes 
 
53 
 
 Pi-Pi 
 
 + K-- 
 
 -K-- 
 
 (L-l + K) 
 
 ■OH 
 
 -c* 
 
 ■ol- 
 
 1-2 
 
 i-1 
 
 L + l 
 
 A = i-l+K- (i-K) 
 A = 2K -1 
 
 Figure 3.2 The Overlap Region of the P-p Plot 
 
54 
 
 for p _< i _< p. From this plot the relationship between the overlap and the 
 redundancy coefficient is seen to be 
 
 A = 2K - 1 . 
 
 As in the case of division, the width of the overlap region is 
 proportional to the amount of redundancy in the representation of the result. 
 As the redundancy is increased, the width of the overlap, A, is increased 
 and the required precision of inspection of the partial product, P., for 
 selection is decreased. To fully define the algorithm MULT, sufficient 
 precision for selection of p. must now be determined. 
 
 3.3.2 Sufficient Precision 
 
 Assume that sufficient precision for the selection of product digits 
 corresponds to the use of y most significant digits of the partial product 
 P.. Denote this truncated estimate to the partial product as P.. Unlike 
 in division, multiplication can proceed after only one digit of each operand 
 has been input, regardless of the value of y. Recall that selection of the 
 first product digit, p 1 , is based upon 
 
 *i * x i y i 
 
 which must be guaranteed to be less than or equal to (K + p) for the selection 
 to be valid. Since the selection is based only upon y most significant 
 digits of a computed value for P.. , selection can take place even though only 
 one digit of each operand has been input. 
 
 Denote the maximum error introduced by this truncation into the 
 representation of P. as Ap. Then Ap is defined by 
 
 I P.-?. I <Ap 
 
55 
 and 
 
 -Y+l 
 
 Ap < K'r T X (3.8) 
 
 where K' is the redundancy coefficient of the adder. 
 
 The minimum allowable value for y can be determined by looking at 
 the overlap region, A. See Figure 3.3. The precision implied by y must be 
 large enough to guarantee that the error incurred by truncating the least 
 significant portion of P. during the selection of p. does not result in an 
 incorrect product digit selection; that is, 
 
 Ap<| (3.9) 
 
 This condition is depicted in Figure 3.3. The smallest P. falling in the 
 
 overlap region and resulting in a product digit selection of p. = i 
 
 - 1 
 i.e., (P. = i - — ) must conform to the bound 
 J *• 
 
 |P ~ Pi < "£■ 
 
 1 j P l - 2 * 
 
 The minimum value of y can now be determined. From equation (3.8) 
 and (3.9), it is seen that 
 
 , -Y+l ^ A _ ZKr± 
 K r £ 2 " 2 
 
 The lower bound on y becomes 
 
 Y > 1 + r% r J^l (3.10) 
 
 In Table 3.1 the minimum values of y (the number of redundant digits of P.) 
 and y' (the number of nonredundant bits derived from the y most significant 
 digits of P. on which a carry propagation has been performed) are shown as 
 a function of the base (r) , product redundancy coefficient (K) , and adder 
 redundancy coefficient (K'). 
 
56 
 
 U 
 
 L-l 
 
 • • • 
 
 p 
 
 J 
 
 i-1 
 
 ■oh 
 
 Figure 3.3 P-p Plot Showing the 
 Worst Case Error 
 
Table 3.1 Minimum y Values 
 
 57 
 
 Y = 1 + \b 
 
 2K' 
 r 2K-1 
 
 BASE 
 
 K 
 
 K' 
 
 
 Y DIGITS 
 
 Y 1 BITS 
 
 2 
 
 1 
 
 1 
 
 
 2 
 
 2 
 
 
 2/3 
 
 2/3 
 
 
 2 
 
 4 
 
 4 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 2 
 
 3 
 
 
 4/7 
 
 4/7 
 
 
 2 
 
 6 
 
 8 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 2 
 
 4 
 
 
 8/15 
 
 8/15 
 
 
 2 
 
 8 
 
 16 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 2 
 
 5 
 
 
 128/255 
 
 128/255 
 
 
 2 
 
 16 
 
 256 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 2 
 
 9 
 
58 
 
 3.4 Valid Operand Ranges 
 
 As stated in Section 3.3.1, the range of the initial operands must 
 be restricted to insure the convergence of the algorithm MULT. The operands 
 may have to be preshifted initially to conform to the specified convergence 
 bounds. Recall that both operands are assumed to be in normalized form 
 upon input to the arithmetic unit; i.e., 
 
 j ± X, Y < 1 . 
 
 To be able to select the first product digit, p , the first partial 
 product, P., must conform to the bound 
 
 P_ < K + o 
 
 From the basic recursion, (3.4), the bound on X 1 is seen to be 
 
 X y < K + p = rK . 
 
 Assuming worst case values for y and K, this implies that 
 
 x<4 r 
 
 2 r-1 
 and, since 
 
 ,. 1 r 
 lim — 
 
 2 r-1 2 * 
 
 r-x» 
 
 shifting X one bit to the right will guarantee that V will be within the 
 allowed range limits for the selection of p . 
 
 To guarantee convergence of the algorithm MULT over all j, it is 
 obvious that certain bounds must also be placed upon X and Y in order for 
 each new partial product to be within the required limits of +(K+p). To 
 insure convergence of the algorithm, given that 
 
59 
 
 iVi- p j-i' - K ' 
 
 the upper bound on the values of the operands must be small enough to 
 guarantee that 
 
 |P. I < rK . 
 
 J ~ 
 
 to insure valid selection of p.. From the basic recursion of the algorithm 
 MULT, equation (3.4), 
 
 P. <■ r(P. -p. .) + X.y. + Y. .x. , 
 j j-1 *j-l i 1 J-l J 
 
 worst case values imply that 
 
 P. < r(P. .-p. ,) + X(r-l) + Y(r-l) , 
 J _ J-l J-l 
 
 or just 
 
 P. 1 r ( p -_ 1 -P-_ 1 ) + (X+Y)(r-1) . 
 
 Let the upper bound on both operands be represented by M. The 
 largest possible value for M, resulting in the fewest possible preshifting 
 steps (scaling) on the initial operands, should be found. By using the 
 upper bound M, the above equation becomes 
 
 P. < r(P. -p. .) + 2M(r-l) . 
 J ~ J-l J-l 
 
 Replacing P. by its upper bound for selection, this becomes 
 
 r(P -p ) + 2M(r-l) < rK 
 
 or, rearranging 
 
 r(K-(P -p )) 
 
 m < dZ± J__i — C3 IT) 
 
 M - 2(r-l) * U * ; 
 
60 
 
 If the smallest possible number of leading digits of P. was used for the 
 
 selection of p._-i> then the upper bound 
 
 <Vr»>i> <- K 
 
 must be used in equation (3.11) and the bound on the operands goes to zero! 
 But, by inspecting more digits than the absolute minimum for selection, the 
 bound on (P _-p -) can be tightened considerably. A full precision 
 inspection results in the tightest possible upper bound of — . Thus, for 
 full precision inspection the bound, M, is 
 
 M K r(2K-l) 
 M ± 4(r-l) ' 
 
 Assuming maximum redundancy (K=l) , 
 
 t r 1 
 
 lim 
 
 4(r-l) 4 ' 
 
 so that shifting both X and Y two bits to the right will guarantee convergence 
 for the case of full precision inspection. 
 
 A compromise between the minimum number of digits inspected for 
 selection and the smallest amount of necessary initial scaling must be made. 
 As will be discussed in Chapter 5, more digits of the partial remainder are 
 always required for selection during division. Since these lines are 
 available, using them during multiplication does not increase the complexity 
 of the arithmetic unit, while it does relax the bound, M, on the operands. 
 
 3.5 Hardware Block Diagram 
 
 In this chapter the algorithm MULT has been completely defined. 
 Specifically, the product digit selection function, SELECT, and initial range 
 restrictions have been given. In this section a variable block diagram 
 imp] (-meriting the MULT algorithm will be examined. Here, again, only the 
 
61 
 
 REDUNDANT PRODUCT REGISTER 
 
 R.-^R. ,,+p.r 
 J J-l J 
 
 "J 
 
 p j 
 
 PRODUCT 
 DIGIT fc / 
 
 SELECTOR 
 
 Y 
 
 rp j-i 
 
 TO THE -1 DIGITAL 
 POSITION ONLY 
 
 MULTI-INPUT REDUNDANT 
 ADDER 
 
 -rp. , + X. y. +Y. . x.+rP. , 
 J-l J J 3-1 J J-l 
 
 tt ~~n A 
 
 Y. -x. 
 
 J-l J 
 
 X.y. 
 J J 
 
 SELECTION 
 NETWORK 
 
 7\ 
 
 x. 
 
 MULTIPLICAND REGISTER 
 
 X.+X. -,+x.r 
 J J-l J 
 
 rP 
 j-l 
 
 SELECTION 
 NETWORK 
 
 7T 
 
 j-i 
 
 MULTIPLIER REGISTER 
 
 Y.«-Y. ,+y.r 
 J J-l J 
 
 -J 
 
 o 
 
 M 
 
 2 n 
 tn > 
 
 > w 
 
 O 
 50 
 
 Figure 3.4 Block Diagram for Multiplication 
 
62 
 
 major components of the arithmetic unit (AU) for MULT will be discussed. 
 Lower level details for actual implementation will be developed in Chapter 5. 
 
 Figure 3.4 is a block diagram of the AU for performing multiplica- 
 tion. It is so structured as to be compatible with division, addition, and 
 subtraction. As with division the major component of the AU is a full 
 width multi- input limited carry-borrow propagation adder. 
 
 The product digit selector is a table look up device which implements 
 the SELECT function for multiplication. It examines the y most significant 
 digits of P., P., and does essentially a rounding on it to select the 
 proper product digit, p.. As with division, the rest of the unit consists 
 of register and selectors. 
 
 3.6 Some Numerical Examples 
 
 Table 3.2 through Table 3. A are examples of the algorithm MULT for 
 several difference radices and product redundancy coefficients (K) . 
 
Table 3.2 Example of the Algorithm MULT 
 
 (r=2,K=l) 
 
 63 
 
 r = 2, m = 8, = {1,0,1), K 
 
 X = 0.01101001 
 Y - 0.01110011 
 
 (R = 0.0010111100101011) 
 
 = 1 
 
 X.y.+Y. -x. 
 
 P. 
 
 3 
 
 p j 
 
 2(P.-p.) 
 
 0.0 
 
 0.0 
 
 
 
 0.0 
 
 0.01 
 
 0.01 
 
 
 
 2(0.01-0) - 0.1 
 
 0.101 
 
 1.001 
 
 1 
 
 2(1.001-1) = 0.01 
 
 0.0110 
 
 0.1010 
 
 1 
 
 2(0.1010-1) = -0.11 
 
 0.0111 
 
 -0.0101 
 
 
 
 2(-0. 0101-0) = -0.101 
 
 0.0 
 
 -0.101 
 
 T 
 
 2(-0101+l) = 0.11 
 
 0.0110100 
 
 1.0010100 
 
 1 
 
 2(1.0010100-1) = 0.010100 
 
 0.11011011 
 
 1.00101011 
 
 1 
 
 2(1.00101011-1) = 0.0101011 
 
 8 
 
 .-8 
 
 ,-9 
 
 R = Z p. 2 J + 2 U (P -P ) = 0.00110111 + 2 "(0.0101011) 
 j=l J 8 8 
 
 = 0.0010111100101011 
 
64 
 
 Table 3.3 Example of the Algorithm MULT 
 
 (r=4,K=l) 
 
 r = 4, m = 4, Q = {3,2,1,0,1,2,3}, K = 1 
 P 
 
 X = 0.01101001 
 Y = 0.01110011 
 
 (R = 0.0010111100101011) 
 
 j 
 
 X.y. + Y. n x. 
 
 P. 
 j 
 
 p j 
 
 4(P.-P.) 
 
 1 
 
 0.01 
 
 0.01 
 
 
 
 4(0.01-0) = 1.0 
 
 2 
 
 3(0.0110)+2(0.01) 
 
 
 
 
 
 = 1.1010 
 
 10.1010 
 
 3 
 
 4(10.1010-3) = -1.1 
 
 3 
 
 2(0.0111) 
 
 
 
 
 
 = 0.1110 
 
 -0.101 
 
 T 
 
 4 (-0.101+1) =1.1 
 
 4 
 
 3(0.01101001) + 
 0.011100 
 
 
 
 
 
 = 1.10101011 
 
 11.00101011 
 
 3 
 
 4(11.00101011-3) = 
 0.101011 
 
 R = E p.4" j + 4 A (P 4 ~P 4 ) = 0.0313 4 + 4 5 (0.223 4 ) 
 j-l J 
 
 R = 0.02330223. = 0.0010111100101011 
 
 4 
 
Table 3.4 Example of the Algorithm MULT 
 
 (r=4,K=2/3) 
 
 r = 4, m = 4, = {2,1,0,1,2}, K = 2/3 
 
 X = 0.01101001 
 Y = 0.01110011 
 
 TO INSURE CONVERGENCE — 
 SHIFT RIGHT ONE DIGIT EACH 
 
 (R = 0.0010111100101011) 
 
 65 
 
 j 
 
 X.y.+Y. .x, 
 J J 3-1 J 
 
 p. 
 J 
 
 p j 
 
 4( VV 
 
 1 
 
 0.0 
 
 0.0 
 
 
 
 0.0 
 
 2 
 
 0.0001 
 
 0.0001 
 
 
 
 4(0.0001-0) - 0.01 
 
 3 
 
 3(0.000110) + 
 
 2(0.0001) 
 = 0.011010 
 
 0.101010 
 
 1 
 
 4(0.101010-1) = -1.011 
 
 4 
 
 2(0.000111) 
 
 
 
 
 
 = 0.001110 
 
 -1.00101 
 
 I 
 
 4(-l. 00101+1) = -0.101 
 
 5 
 
 3(0.0001101001) + 
 0.00011100 
 
 
 
 
 
 = 0.0110101011 
 
 -0.0011010101 
 
 
 
 4(-0. 0011010101) = 
 -0.11010101 
 
 R = 
 
 4 2 ( I P .4" j + 4 5 (P 5 -p 5 )) 
 i=l J 
 
 4 (0.00110 -0.0000003111 ) = 
 4 4 
 
 0.0101000011010101 = 0.0010111100101011 
 
66 
 
 4. ADDITION AND SUBTRACTION 
 
 Once compatible on-line algorithms for division and multiplication 
 have been specified, it remains to define compatible on-line addition 
 and subtraction algorithms. It would be possible, since a limited carry- 
 borrow propagation adder is used for the basic recursion, to simply feed 
 the output of the adder directly to the recoding logic. Instead, an 
 algorithm which is quite similar to the multiplication algorithm is 
 presented. It will soon become apparent why this method is preferred. 
 
 4.1 Background 
 
 Consider the situation at the i digit position of a conventional 
 adder . The addend A and the augend B provide the inputs a . and b . , each 
 of weight r , to the i stage. A third input c, also of weight r , 
 is the carry output from digit position i+1. The outputs of the i 
 position are the sum digit s. of weight r and the carry out digit c -_i> 
 of weight r . The relationship between outputs and inputs is, then 
 
 re. , + s. = x. + y. + c. . 
 i-1 l l J x i 
 
 Because the carry out c._ 1 is propagated from the least significant to the 
 most significant end of the adder, the speed of conventional addition is 
 less than desirable. It has long been recognized that carries do not 
 need to be propagated during each addition of a long sequence of additions, 
 provided that the carries are explicitly stored. This technique was first 
 used to speed up multiplication, with a single carry propagation at the 
 ] of the operation. 
 
67 
 
 Storing the carry, c._, , is a method of introducing redundancy 
 into the adder. Much work has been done on the development of redundant 
 adders [ROH67, BOR68, MEL72, GOY76]. In particular, Rohatsch's work 
 proves that the utilization of any redundant and contiguous sum-difference 
 digit set makes possible the implementation of limited carry-borrow 
 addition and subtraction; that is, addition and subtraction for which 
 carries-borrows propagate no further than a fixed number (one or 
 two in practice) of digital positions is possible. In Goyal's work an 
 exhaustive study is done of various design techniques and the reader is 
 referred to this work for the gory details. 
 
 Since the carry-borrows can be limited to one or two digital 
 positions, a redundant adder could be used directly for on-line addition 
 with an on-line delay of up to two digits. There is, however, an 
 algorithm which involves no delay for producing on-line sums and differ- 
 ences. This algorithm will now be presented. 
 
 4.2 The On-line Algorithm 
 
 Let the radix r representation of the fractional part of the 
 positive addend (minuend), augend (subtrahend), and sum (difference) be 
 denoted by A, B, and R respectively, such that 
 
 m 
 A = E a.r -1 , 
 i-1 X 
 
 m 
 B = E b.r 1 
 i=l X 
 
 m 
 R = E s.r- 1+1 
 i-1 X 
 
68 
 
 and R = A+B (R=A-B) to m digit precision. 
 
 In an on-line environment the digits of the addend (minuend) and 
 augend (subtrahend) are not known in advance, but are available on-line, 
 digit-by-digit, most significant digit first. The operand digits are 
 typically members of a conventional nonredundant digit set, £>, such that 
 
 a.,b. e {0,1,2,.. .(r-1)} . 
 11 
 
 It is assumed that the operands are in normalized form; i.e., 
 
 |- < A,B < 1 . 
 
 Assume that the sum (difference) digits can be selected on-line. 
 One new digit of the sum (difference) would then be determined upon the 
 receipt of one new digit each of the addend (minuend) and augend (subtra- 
 hend) . In addition (subtraction) the index difference, 6, is identically 
 1. That is, only one digit of each of the operands is needed initially 
 to select the first result digit. Let the sum (difference) digits, s.'s, 
 be members of the same symmetric redundant digit set, © , as defined for 
 the quotient and product digits. 
 
 The partial sum (difference) is computed via the same limited 
 carry-borrow propagation adder used to generate the partial remainder during 
 division and the partial product during multiplication. Thus, the digits 
 of the partial sum (difference) are members of the redundant digit set, ,. 
 
 Given these definitions, the algorithm ADD and the algorithm SUBT 
 as shown on the succeeding pages, can be defined in the same manner as the 
 algorithm MULT was defined for multiplication. The selection procedure 
 for result digits corresponds to a simple rounding on P, exactly as done 
 
 
69 
 
 in multiplication. Thus, again, the basic recursion includes a "correction" 
 
 on P . .. to take into account the previous selection of s . ,. 
 3-1 3-1 
 
 As in multiplication a recursion relationship for addition which 
 takes into account the on-line nature of the operands only can be expressed 
 
 as 
 
 P. ■*■ rP._ + (a.+bjr" 1 (4.3) 
 
 Algorithm ADD: 
 
 
 
 Step 1 
 
 [Initialization]: P_ ■«■ 0; 
 
 s Q .0; 
 
 j «- i; 
 R o * 0j 
 
 
 Step 2 
 
 [Input Digit]: a. AND b.; 
 
 
 Step 3 
 
 [Basic Recursion]: 
 
 » 
 
 
 P. «- r(P. -s. ,) + (a +b.)r 
 3 3-1 3-1 3 3 
 
 (4.1) 
 
 Step 4 
 
 [Selection]: s. «- SELECT(P.); 
 
 R. ^ R. • + s.r~ J+1 ; 
 3 3-1 3 
 
 
 Step 5 
 
 [Test]: IF (j < m) 
 
 THEN j *■ j + 1; 
 
 GO TO Step 2; 
 ELSE END ADD; 
 
 
70 
 
 Algorithm SUBT : 
 
 Step 1 [Initialization]: P «- 0; 
 
 s -0; 
 
 R -0; 
 
 j * i; 
 
 Step 2 [Input Digits]: a. AND b.; 
 
 Step 3 [Basic Recursion]: 
 
 R. «■ R. . + s.r ^ +1 
 J J-l J 
 
 Step 5 [Test]: IF (j < m) 
 
 THEN j ■*- j + 1; 
 
 GO TO Step 2; 
 ELSE END SUBT; 
 
 
 P. «- r(P. ,-s. .) + (a.-b.)r 1 (4.2) 
 
 3 J-l J-l J J 
 
 Step 4 [Selection]: s. «- SELECT(P.); 
 
 Using this recursion, the sum digits would become available from the 
 
 least-to-most significant end of P , as determined by the traditional 
 
 m 
 
 carry propagation requirements. From equation (4.3) and the basic recursion 
 equation in algorithm ADD, (4.1), the following relation for addition can 
 be obtained by induction. 
 
 
 J-l • • ,-, 
 P. = P. - £ s.r 
 
 1=1 
 
71 
 
 This implies that, as j •*■ m, 
 
 m-1 
 
 m-i+1 
 P = P - Z sj 
 m m . , i 
 1=1 
 
 or rearranging 
 
 P = P + r m Z s,r" i+1 . (4.4) 
 
 m m i=1 i 
 
 Since P = (A+B)r for addition and P = (A-B)r for subtraction, equation 
 m m 
 
 (4.4) can be rewritten as 
 
 a m— 1 . . , 
 
 (A+B)r = P + r Z s.r 
 — m . . i 
 
 i=l 
 
 which is just 
 
 m -'-4-1 
 
 A + B = Z s.r ■"■ + (P -s )r" m , 
 — . , l mm 
 
 i=l 
 
 so that 
 
 m . . _ 
 
 R = Z s.r 1 " 1 " 1 = (A+B) - (P -s )r" m . 
 . . l — mm 
 
 i=l 
 
 By using the product selection procedure SELECT defined for multi- 
 plication in Chapter 3, Section 3.3, then 
 
 |P -s | < K (4.5) 
 
 1 m m' — 
 
 and R = A+B(R=A-B) can be computed to m digit precision. 
 
72 
 
 4.3 Sum (Difference) Digit Selection 
 
 Using the same selection function as defined for multiplication of 
 
 r . : ,.i. ,.i 
 
 s. «- SELECT (P 
 
 sign P *l|P | + -J 
 
 j ) = 
 
 for |P.| < p, (4.6) 
 
 ^sign P J *L|P j | J 
 
 otherwise, 
 
 will insure the convergence of the algorithms ADD and SUBT. This represents 
 a simple rounding procedure on P . . Recall, however, that the partial sum, 
 P., is in redundant form. Thus, a "simple" rounding would require a full 
 precision carry propagation of P. in order to determine its magnitude. 
 This must be avoided if at all possible. 
 
 Sufficient precision for the proper selection of sum digits can be 
 determined by looking at the basic recursion of the algorithm ADD, equation 4.1, 
 
 A. A 1 
 
 P. «- r(P. -s. n ) + (a.+ b.)r . 
 3 3-1 3-1 3 3 
 
 Using the same upper bound on P. for proper selection as defined for multipli- 
 cation, 
 
 |Pjl < rK » 
 and assuming worst case values for a and b. of (r-1) , the basic recursion can 
 be restated as 
 
 r(P -s ) + 2(r-l)r~ 1 <_ rK . 
 
 The bound on the 'residual', P. - s._ , is then 
 
 (P. -s. .) < K - 2(r I 1) . (4.7) 
 
 3-1 3-1 ~ r 2 
 
73 
 For base 2(K=1), this is just 
 
 <Vi-Vi> < .1 - ! - 1 • 
 
 So, for base 2 a full precision inspection of P. to select s. must be 
 performed. For base 4(K=1), equation (4.7) becomes 
 
 (p _ s \ "< 1 „ A. m I 
 * 3-1 S j-1 ; - X 16 8 
 
 which implies a two digit inspection of P. to select s.. Selection require- 
 ments for other cases can be determined in a similar manner. 
 
 4.4 Hardware Block Diagram 
 
 In this chapter the algorithms ADD and SUBT have been completely 
 defined. In this section a variable block diagram implementing the ADD and 
 SUBT algorithms will be examined. Here, again, only the major components of 
 the arithmetic unit (AU) for ADD and SUBT will be discussed. Lower level 
 details for actual implementation will be developed in Chapter 5. 
 
 Figure 4.1 is a block diagram of the AU for performing addition and 
 subtraction. It is so structured as to be compatible with division and 
 multiplication. As with the other algorithms, the major component of the AU 
 is a full width multi- input limited carry-borrow propagation adder. 
 
 The sum (difference) digit selector is a table look up device which 
 implements the SELECT function described for addition and subtraction (i.e., 
 the same function as in multiplication) . It examines the y most significant 
 digits of P., P., and does essentially a rounding on it to select the proper 
 sum (difference) digit, s.. As with the other algorithms, the rest of the 
 unit consists of registers and selectors. 
 
74 
 
 REDUNDANT SUM (DIFFERENCE) REG, 
 
 R.«-R. n +s.r 
 
 -j+l 
 
 s . 
 J 
 
 SUM 
 (DIFFERENCE) 
 DIGIT 
 SELECTOR 
 
 P. 
 
 Y 
 
 rs 
 
 MULTI-INPUT REDUNDANT 
 ADDER 
 
 P.*- 
 J 
 
 -rs. . + (a. + b.)r 1 + rP 
 3-1 J ~ J ~ 
 
 j-l 
 
 TO THE -1 DIGITAL 
 POSITION ONLY 
 
 j-l 
 
 7V 
 
 SELECTION 
 NETWORK 
 
 rp j-i 
 
 SELECTION 
 NETWORK 
 
 a. 
 J 
 
 ADDEND (MINUEND) REG. 
 
 AUGEND (SUBTRAHEND) REG, 
 
 Figure 4.1 Block Diagram for Addition and Subtraction 
 
 o 
 w 
 
 25 O 
 
 3 £ 
 
 > W 
 
 H k! 
 O 
 
 
75 
 
 4.5 Some Numerical Examples 
 
 Table 4.1 through Table 4.3 are examples of the algorithms ADD and 
 SUBT for several different radices and sum (difference) redundancy 
 coefficients (K) . 
 
76 
 
 Table 4.1 Example of Algorithm ADD 
 (r=2,K=l) 
 
 r = 2, m = 8, D - (1,0,1), K = 1 
 
 A - 0.11011001 
 B = 0.10100101 
 
 (R = 1.01111110) 
 
 j 
 
 (a.+ b.) 2" 1 
 
 p. 
 J 
 
 s . 
 
 J 
 
 2( V s.) 
 
 1 
 
 1.0 
 
 1.0 
 
 1 
 
 2(1.0-1) = 0.0 
 
 2 
 
 0.1 
 
 0.1 
 
 1 
 
 2(0.1-1) = -1.0 
 
 3 
 
 0.1 
 
 -0.1 
 
 I 
 
 2(-0.1+l) = 1.0 
 
 4 
 
 0.1 
 
 1.1 
 
 1 
 
 2(1.1-1) = 1.0 
 
 5 
 
 0.1 
 
 1.1 
 
 1 
 
 2(1.1-1) = 1.0 
 
 6 
 
 0.1 
 
 1.1 
 
 1 
 
 2(1.1-1) = 1.0 
 
 7 
 
 0.0 
 
 1.0 
 
 1 
 
 2(1.0-1) - 0.0 
 
 8 
 
 1.0 
 
 1.0 
 
 1 
 
 2(1.0-1) = 0.0 
 
 9 
 
 0.0 
 
 0.0 
 
 
 
 2(0.0-0) = 0.0 
 
 m+1 i+1 
 
 R = Z s r 1 = 1.11111110 
 1=1 
 
 = 1.01111110 
 
77 
 
 Table 4.2 Example of Algorithm ADD 
 
 (r=4,K=l) 
 
 r = 4, m = 4, © = {3,2,1,0,1,2,3}, K 
 
 A = 0.11011001 
 B = 0.10100101 
 
 (R = 1.01111110) 
 
 1 
 2 
 3 
 4 
 
 5 
 
 (a.+ b.) 4 X 
 
 P. 
 3 
 
 s . 
 3 
 
 4(P.-s.) 
 
 1.01 
 
 1.01 
 
 1 
 
 4(1.01-1) = 1.0 
 
 .11 
 
 1.11 
 
 2 
 
 4(1.11-2) = -1.0 
 
 .11 
 
 -0.01 
 
 
 
 4(-0.01-0) = -1.0 
 
 .10 
 
 -0.10 
 
 I 
 
 4 (-0.10+1) =10.0 
 
 0.0 
 
 10.0 
 
 2 
 
 4(10.0-2) - 0.0 
 
 m¥1 _-+1 
 
 R = E s.r 1 L = 1.2012, 
 
 • i 1 4 . 
 
 x=l 
 
 = 1.01111110 
 
78 
 
 Table 4.3 Example of Algorithm SUBT 
 
 (r=4,K=l) 
 
 r = 4, m = 4, © p = {3,2,1,0,1,2,3}, k = 1 
 
 A = 0.11011001 
 B = 0.10100101 
 
 (R = 0.00110100) 
 
 j 
 
 (a.-b.)4" 1 
 J J 
 
 P. 
 J 
 
 s . 
 3 
 
 4(P.- S .) 
 
 1 
 
 0.01 
 
 0.01 
 
 
 
 4(0.01-0) = 1.0 
 
 2 
 
 -0.01 
 
 0.11 
 
 1 
 
 4(0.11-1) = -1.0 
 
 3 
 
 0.01 
 
 -0.11 
 
 I 
 
 4(-0.11+l) = 1.0 
 
 4 
 
 0.0 
 
 1.0 
 
 1 
 
 4(1.0-1) = 0.0 
 
 5 
 
 0.0 
 
 0.0 
 
 
 
 4(0.0-0) = 0.0 
 
 m+1 --+1 
 
 R = E s.r X = 0.1110. = 0.01010100 
 1=1 X 4 , 
 
 = 0.00110100 
 
79 
 
 5. IMPLEMENTATION 
 
 The previous three chapters have dealt with the algorithmic design 
 for on-line division, multiplication, and addition/subtraction. It is now 
 appropriate to consider the problems of actual implementation of the on-line 
 arithmetic unit while keeping in mind OBJECTIVES 4, 5, 6, and 7 as outlined 
 
 in Chapter 1. With the typical computer scientist's fondness for acronyms, 
 
 t 
 this unit has been dubbed AURORA (Arithmetic Unit Realizing On-line 
 
 Redundant Algorithms). This chapter attacks the problems of implementation: 
 the design constraints of LSI, floating point considerations, design modifi- 
 cations required to pipeline AURORA, minimal hardware requirements with the 
 corresponding speed tradeoffs, and a system's level overview. 
 
 5.1 Design Constraints of LSI 
 
 With the advent of large scale integration (LSI) technology has come 
 a challenge to computer designers to find circuits which make efficient use 
 of its full potential. The ever improving reliability, cost, and size of 
 integrated circuits makes it more and more reasonable to now implement in 
 hardware various functions which previously belonged in the software domain. 
 It is this author's opinion that AURORA, in keeping with OBJECTIVE 4, makes 
 an excellent candidate for design as a single chip, LSI module. When 
 
 t 
 
 An aurora is appropriately defined as "the early period of anything." In 
 
 Roman mythology Aurora was the goddess of the dawn. Thus, the name seems 
 
 suitable in all respects: connotation, gender, and acronymability . 
 
80 
 
 designing the module, several properties innate to LSI technology [HOD76, 
 GOY76, LEW74] must be adhered to. These properties include: 
 
 • high circuit density, 
 
 • regularity of structure, 
 
 • low pin count, and 
 
 • a large domain of applications. 
 
 With LSI, thousands of individual gates can be fabricated on an 
 extremely small chip of silicon. There has been a problem finding suitable 
 functions which require a large number of gates while still meeting the 
 other design constraints of LSI. Semiconductor memories (RAMs, ROMs, and 
 PLAs) are obviously the most suitable candidates for LSI implementation. 
 They more than satisfy the previously defined properties. Calculator 
 chips are also suitable candidates. The microprocessor, which is becoming 
 more and more widely used, can be made suitable for implementation in LSI. 
 Another area in which LSI technology is apparent is in digital watch chips. 
 The capability provided by LSI, though, is considerably more than that 
 required for the typical watch function — telling the time and date. 
 Consequently, watches which are also timers, counters, alarms, and even 
 calculators are appearing on the market in order to make full use of the 
 inherent LSI capabilities. The search goes on for other candidates which 
 will make full use of the potential of LSI while meeting the predefined 
 constraints. AURORA, similar in function to a microprocessor, will have a 
 high gate count (high circuit density). Thus, an effort must be made to 
 make AURORA conform, as the microprocessor has been made to conform, to 
 the other constraints of LSI. 
 
81 
 
 One of the most severe constraints in the design of an LSI device 
 is the restriction on interconnections within the chip itself. This is due 
 to two factors: 1) the comparatively large chip area required for inter- 
 connections reducing the "useable" chip area, and 2) the immense problem 
 of routing a large number of signals while maintaining a minimum number of 
 crossovers and high gate density. Interconnections can be simplified by 
 forcing the logic design of the chip into a cellular or regular structure. 
 A regular structure has other important implications. 
 
 1) It simplifies the manufacturing process by making tooling 
 and mask production easier and by helping to regulate the 
 processing steps. 
 
 2) It makes it possible to optimize each cell and, thus, the 
 overall chip to achieve the most function per dollar. A 
 large collection of random gates is virtually impossible 
 to optimize because of the large number of variables 
 involved. 
 
 3) The generation of testing algorithms for cellular and 
 regular, repetitive structures is easier than for random 
 logic. 
 
 A) In addition, as technological improvements pave the way 
 for larger devices, cellular structures are more easily 
 expanded to make use of this bonus real estate. 
 
 t 
 
 Unavoidable crossovers call for more processing steps and, thus, lead to 
 higher tooling costs. 
 
82 
 
 In that microprocessors have been made to conform to the constraint 
 of regularity of structure, so can AURORA. Typically, the most random 
 part of any system is its control logic. It is envisioned that AURORA 
 would be microprogrammed, possibly via a PLA (Programmable Logic Array) 
 [RHY74, WES 75] to avoid randomness in the control logic. 
 
 The PLA would generate all necessary control signals for processing 
 based upon the opcode input to the unit (+, -, *, /), the present state of 
 the process, internal and external status signals, and a clock pulse. See 
 Figure 5.1. A PLA, which can be visualized as a logical AND network feeding 
 a logical OR network, works like an ROM with flexible addressing. Though 
 suitable for LSI, using an ROM as a logic element has a major drawback in 
 that a word must be stored for every possible combination of system inputs, 
 many of which may be don't cares and are therefore never used. The PLA's 
 address translation capability allows more than one of these possible input 
 combinations to share a word of storage, making the PLA practical in many 
 applications too large for the typical ROM. The PLA also has more input 
 (address) lines than a comparable ROM. Using a PLA instead of the standard 
 ROM for the microcontrol store provides for self-addressing of the next 
 control state as well as the generation of the required control signals 
 within a single array. 
 
 The processing logic of AURORA will require the use of proper logic 
 partitioning to make it suitable to LSI. Logic partitioning involves the 
 organization of the internal logic structure so that large functional areas 
 (or arrays) on the chip can be grouped together and used repetitively. 
 External to the chip, functional partitioning of the overall system requires 
 a framework consisting of modules which are completely self-contained 
 
83 
 
 INTERNAL 
 
 & EXTERNAL 
 
 STATUS 
 
 INITIALIZE 
 
 OP CODE 
 
 TO PROCESSING 
 LOGIC 
 
 CLOCK 
 PULSE 
 
 Figure 5.1 A Typical PLA Control Layout 
 
84 
 
 processors, each having its own local store, processing logic, and the 
 control necessary for the module to execute its function. Thus, each module 
 acts as a small insular unit of logic. Since each module's control sees 
 only its own state, the internal and external communication requirements are 
 correspondingly reduced. The AURORA as a module of a larger system meets 
 this criterion. Internal to the chip, an effort must be made to subdivide 
 the logical units of the processing logic in order to achieve logic 
 partitioning and its benefits. 
 
 One of the most serious constraints imposed by LSI on any device is 
 that of a limited number of external connections (i.e., a low pin count). 
 A realistic maximum of 64 external connections (pins) is an unavoidable LSI 
 limitation. AURORA should require fewer pins than this upper limit. The 
 unit will be designed so that it can be connected easily to the bus of a 
 typical microprocessor. A bus system based upon generic signals, such as 
 MUMS (Modular-Unif ied-Microprocessor-System) developed by Faiman [FAI77, 
 CAT76], could be used as a model for designing the communication interface 
 of AURORA. This would help to insure generality of the interface with a 
 system. The number of signals on the MUMS bus sets an upper limit on the 
 pin count for communication with the mother system at 47: a maximum of 32 
 for data (16 "data" lines and 16 "address" lines) and 15 for control (3 power 
 lines, 6 memory control lines, and 6 interrupt control lines). The upper 
 limit on the number of bits per operand word input at the beginning of each 
 cycle is, then, 16. Thus, the choice of radices to be used in the interal 
 hardware for implementation of the algorithms can be realistically set at 
 either 2, 4, 16, or 256. The higher the radix the higher the speed and the 
 higher the complexity (gate count). A rough overall pin count for AURORA 
 is attempted in Section 5.4. This count must also include requirements for 
 
85 
 
 communication with an exponent unit for floating point operations and, 
 possibly, communication with other AURORA modules. 
 
 By designing AURORA to meet the specifications of a typical micro- 
 processor bus system, the domain of usefulness increases dramatically. The 
 desirability of having a peripheral unit on a microprocessor bus capable 
 of achieving on-line variable precision arithmetic is obvious. This and 
 other applications of AURORA are discussed in Chapter 6. 
 
 5.2 Floating Point Considerations 
 
 In early, fixed point computers, all numerical data within the com- 
 puter was scaled to lie within a restricted range; a frequent choice was that 
 of a fractional representation such that each number, X, used in computation 
 was in the range -1 < X < 1. It soon became obvious that in order to reduce 
 the burden of scaling during preparation or execution of a problem, a second 
 arithmetic unit, for operations on exponents, should be introduced. Then 
 a number X could be represented by a fractional part f and an exponent e, 
 such that 
 
 X = f • r e 
 
 where e is a positive or negative integer and f is a fraction in the normalized 
 range 
 
 \± |f| «i . 
 
 In order for AURORA to process floating point numbers, two arithmetic 
 units are required, one for the fractional parts and one for the exponents. 
 As far as the handling of data between the mother system and these two pro- 
 cessing units, the following sequence of operations would occur: the 
 exponents, which are assumed to be one memory word each, would be sent to 
 
86 
 
 the exponent unit; the exponent unit would then handle these exponent 
 operands appropriately, sending the proper shift signals to the mantissa unit; 
 and, while these exponents were still being processed, the fractional 
 operands, consisting of several memory words each, would be sent in an on- 
 line fashion to the fractional unit. Upon normalization and exponent 
 adjustment of the results via an investigation of the most significant 
 digits of the mantissa as soon as they became available, the exponent of 
 the result could, then, be returned to the mother system followed by an 
 on-line return of the result words as they are generated. A sample mother 
 system organization is given in Section 5.5. 
 
 If the operation in question was addition or subtraction, an exchange 
 of communication such as the following must occur between the two units. 
 
 1) The exponent unit would determine which operand was 
 smaller. 
 
 2) The mantissa unit, upon signals from the exponent unit, 
 would shift the appropriate operand to the right while 
 the exponent unit would increase that exponent 
 accordingly, until the two exponents agree. 
 
 3) The mantissa unit would start the on-line addition or 
 subtraction. 
 
 4) When the most significant digits of the result were 
 produced in the mantissa unit, the result could be 
 normalized (shifted right if necessary) and the mantissa 
 unit would signal the exponent unit to adjust the 
 result exponent (i.e., the exponent of the larger 
 operand) appropriately. 
 
 
Multiplication is only slightly more complicated in that the sum 
 of the exponents must be formed in the exponent unit to produce the exponent 
 of the result. One step of normalization in the form of one left shift of the 
 result and a unit decrease of the result exponent may be necessary if the 
 fractional product, R, falls in the range 
 
 <2> < W < 2 ' 
 
 Similarly, for division the difference of the exponents must be 
 
 formed by the exponent unit to produce the result exponent. The fractional 
 
 quotient, R, if it falls in the range 
 
 j 
 
 - « ? 
 
 l <_ |r| < 2 
 
 a 
 
 'J I 
 
 may then require normalization by one right shift with a unit increase in the 
 result exponent. 
 
 A communications protocol between the two units, exponent and mantissa, 
 is shown in Figure 5.2. Recall that operand scaling and the corresponding 
 result correction to insure convergence of the algorithms is internal to the 
 mantissa processing unit. 
 
 5.3 Algorithmic Modifications of AURORA for Pipelining 
 
 Up to this juncture AURORA has been discussed from the point of view 
 of the serial processing of multiple memory words on a single unit which 
 accepts one memory word per cycle. A problem arises, then, if the memory 
 width is wider than the operand width accepted by one AURORA during one cycle. 
 Perhaps this problem can be resolved by connecting multiple AURORA'S together, 
 each accepting one "byte" of the memory word, with all of the units running 
 in parallel to produce multiple result "bytes" in parallel constituting one 
 result word. (Note that the data input to an AURORA has been redefined from 
 
88 
 
 EXPONENT 
 UNIT 
 
 77 
 
 A 
 
 ADD OR SUBT CODE 
 
 NORMALIZE PULSE 
 
 RIGHT SHIFT PULSE 
 
 (FOR ADD/SUBT) 
 
 OPERAND SHIFT CODE 
 
 MANTISSA 
 UNIT 
 
 A 
 
 A 
 
 PIN COUNT FOR FLOATING POINT 
 COMMUNICATION : 4 
 
 Figure 5.2 Communication between Mantissa 
 and Exponent Units 
 
89 
 
 "word" to "byte" in this context.) If so, would it then be possible to 
 combine the two approaches (serial and parallel), producing a serial system 
 of, say, N AURORA units which run in parallel? The entire system, while 
 cycling serially, would then produce N result bytes during each cycle. 
 
 While this is an admirable goal, on second consideration it is 
 deemed improbable that such a system of simple AURORAS could be designed. 
 Since the AURORA'S operations include division and multiplication, such a 
 design specification, allowing an unspecified number of parallel units 
 to be connected to produce parallel results, implies parallel division and 
 multiplication! Such a conclusion is intuitively impossible. 
 
 There are two more or less attractive alternative solutions to this 
 problem. The first approach is based upon the assumption that the demand 
 that parallel units produce parallel results is removed. Then the units, 
 connected as shown in Figure 5.3, though capable of being loaded in 
 parallel, would produce result bytes in a serial fashion rippling down 
 from the most significant AURORA. The necessary information would be 
 pipelined from left to right through the units. This information to be 
 pipelined through AURORAS arranged in this manner must be determined. Such 
 an arrangement, of course, will increase the complexity of the unit, while 
 the communication required between units will increase the pin count of 
 each unit. But the speed benefits of streaming multiple instructions 
 through the units as allowed in pipelining [RAM77], would more than com- 
 pensate for the added complexity costs. 
 
 Another technique for accommodating the parallel input of oversized 
 operand words is to use one AURORA unit serially as in Figure 5.4, while 
 providing large holding shift registers for the oversized operands. While 
 
90 
 
 H 
 to 
 
 w 
 
 H 
 
 
 <! 
 
 
 U 
 
 f-> 
 
 M 
 
 w 
 
 Pn 
 
 H 
 
 M 
 
 >-" 
 
 s 
 
 PQ 
 
 o 
 
 
 M 
 
 H 
 
 CO 
 
 >J 
 
 
 !=> 
 
 H 
 
 en 
 
 to 
 
 w 
 
 < 
 
 e* 
 
 w 
 
 
 hJ 
 
 
 O 
 
 
 O 
 
 
 O 
 
 
 H 
 
 
 ^ 
 
 CN 
 
 U 
 
 w 
 
 M 
 
 H 
 
 P^ 
 
 ^ 
 
 H M 
 
 « 
 
 X S 
 
 
 w o 
 
 H 
 
 IS M 
 CO 
 
 £ 
 
 
 to 
 
 H 
 
 w 
 
 to 
 
 ^ 
 
 o 
 
 
 £ 
 
 
 H 
 
 
 % 
 
 •H 
 
 o 
 
 w 
 
 M 
 
 H 
 
 Pn 
 
 >H 
 
 M 
 
 P3 
 
 :s 
 
 
 e> 
 
 H 
 
 M 
 
 hJ 
 
 CO 
 
 & 
 
 
 to 
 
 co 
 
 § 
 
 o 
 
 
 £ 
 
 
 X 
 
 W pd 
 
 a* o 
 o s 
 
 o 
 
 Figure 5.3 Pipelining Multiple AURORAs 
 
91 
 
 RESULT WORD 
 
 11 
 
 HOLDING SHIFT 
 REGISTER 
 
 OP 
 CODE 
 
 V 
 
 RESULT BYTE 
 
 J 
 
 AURORA 
 
 7S ft 
 
 OPERAND 
 BYTE 
 
 OPERAND BYTE Y, 
 
 HOLDING SHIFT 
 REGISTER 
 
 TT 
 
 OPERAND WORD Y 
 
 HOLDING SHIFT 
 REGISTER 
 
 TT 
 
 OPERAND WORD X 
 
 Figure 5.4 Serial Processing with One AURORA 
 
92 
 
 this approach requires the purchase of only one expensive AURORA unit, it 
 must be dedicated to a single instruction until that instruction is 
 completed. Thus, the speed benefits of pipelining are unavailable. 
 
 The necessary information to be pipelined through the multiple 
 AURORA setup will now be specified. The basic recursion of each operation 
 must be investigated to determine this information. For addition and 
 subtraction it is obvious from the basic recursion, (4.1), 
 
 P. «- r(P. ,-s. ,) + a. + b. 
 
 J 3-1 J-l J J 
 
 that the information to be passed down the pipe from one stage to the 
 next is just the "residual" 
 
 *<Vi-i-i> 
 
 which increases in significant width as it flows from one unit to the next 
 down the pipe. The problem becomes slightly more complicated when multi- 
 plication is attempted. The basic recursion, (3.4), 
 
 P. -*- r(P. .-p, n ) + X. y. + Y. x. 
 J J-l J-l J J j-l J 
 
 t 
 indicates that not only the residual, but also the operand digits up to 
 
 and including x. and y. must be passed. It is only in this way that the 
 
 next stage can then form X. and Y. , . The information to be passed in 
 
 J J-l 
 
 division is not at all obvious. The basic recursion, (2.3), 
 
 P. «• rP, - q.D. + n.^.r" 6 - R. .d.-.r 
 J J-l J J J+6 J-l J+<5 
 
 For simplicity during this discussion, it is assumed that one byte of an 
 operand or result word is equivalent to one digit of that operand or 
 result word. 
 
93 
 
 must be broken down further into 
 
 in order to reveal the desired information. Since 
 
 j_1 -i 
 R.d.^. - ( I q.r )d. ,. + q.d.^. , 
 
 j J+ 6 i=1 i j-hs m j j-ks 
 
 equation (5.1) can be rewritten as 
 
 P. «• rP. n - q.D. . + n.^r" 6 - R.d.^r" 6 . (5.2) 
 
 By using equation (5.2) as the basic recursion for division, the informa- 
 tion to be passed down the pipe is seen to be the division residual, 
 
 rP. t - q.D. .. , 
 J-1 3 J-1 
 
 the divisor digits up to and including d.^. (to be used in forming D. 1 in 
 the next stage), and all of the quotient digits up to and including q. +1 
 (to be used in forming R. in the next stage). 
 
 The necessary information to be sent down the pipe over all 
 operations is, then, a residual, both operands, and the result. By sending 
 a digit of each operand and result at the end of every cycle, by the time 
 the last stage of the pipe is processing an operation all of the necessary 
 information has been sent to it. , 
 
 The 
 
 residual can be passed either as its individual elements, P._i 
 
 and p._, for multiplication, or the residual can be explicitly formed 
 in the unit and passed as one unique entity. Of course, this would require 
 another adder (or another pass through the main adder) in each AURORA to 
 form the residual. The addition of this residual adder simplifies the 
 
94 
 
 design of the main adder. Section 5.4.2 investigates the hardware adjust- 
 ments of the processing logic necessary to pipeline the AURORA. 
 
 Some obvious hardware complications besides the extra residual 
 adder in the processing logic are 1) more pins for the communication 
 requirements between the pipelined AURORAS, 2) a much more complex micro- 
 control store, and 3) a bank of extra registers to store the operand and 
 result digits as they arrive. 
 
 5.4 AURORA Hardware Requirements 
 
 A personalized hardware block diagram has been shown for each 
 operation in its appropriate chapter. It now remains to combine the various 
 requirements into one total arithmetic unit which can be microprogrammed 
 to provide the various functions. Low-level (gate) detail of an actual 
 implementation will not be discussed. The description given here will be 
 only detailed enough to estimate the following features: complexity, 
 speed, and pin count. 
 
 5.4.1 The Processing Logic 
 
 The block diagram of the processing logic of AURORA is shown in 
 Figure 5.5. For clarity, the basic recursion formulas are now restated. 
 
 DIVISION P. -e rP. - q.D. + n.^r" 6 - R. -d.^r" 6 
 
 (R-N/D) J J , J J J+6 J_1 J+6 
 
 MULTIPLICATION P. «- r(P. .-p. . ) + X.y . + Y. -X 
 
 (R--X-Y) 
 
 j j-1 M-l' "j'j j-1 j 
 
 ADDITION P. +- r(P. .-s. ,) + a. + b 
 
 (R=A+B) 
 
 j ' j-1 j-1 j j 
 
 SUBTRACTION P. ^ r(P. -s. J+a. -b 
 
 -IS) 
 
 j i- 1 j-1 j j 
 
95 
 
 RESULT 
 
 DIGIT 
 
 SELECTOR 
 
 ?y 
 
 rP 
 
 TO THE -1 
 DIGITAL 
 POSITION 
 ONLY 
 
 r 
 
 MULTI-INPUT REDUNDANT 
 ADDER 
 
 n n 
 
 71 
 
 
 o 
 
 
 
 w 
 
 
 w > 
 
 
 • 
 
 £g 
 
 
 H *< 
 
 
 O 
 
 • 
 
 ftf 
 
 
 SELECTION 
 NETWORK 
 
 r; 
 
 x. 
 
 a. 
 3 
 
 t^.^ 
 
 X.,D. 
 J J 
 
 OPERAND REGISTER 
 
 V D J 
 
 J+6 
 
 rP J-l 
 
 SELECTION 
 NETWORK 
 
 I fi~K 
 
 b. 
 3 
 
 j-l 
 
 Y j-r p o 
 
 OPERAND REGISTER 
 
 Yi' p o 
 
 a. ,x. ,d. ,d. , x 
 3 3 i J+ 6 
 
 b . , y . , n . , n x 
 3 3 i 3 +<s 
 
 Figure 5.5 AURORA Processing Logic 
 
96 
 
 The processing unit is made up of an adder, a result digit selector, 
 registers, and selection networks. A brief discussion of the individual 
 elements will now be given. 
 
 Two full width double bank registers are required for the storage 
 of the result, R, and the partial remainder, P., because they are in 
 redundant form. Two full width single bank registers will suffice for 
 the storage of the operands: one of N, X, or A and one of D, Y, or B. 
 The selection networks must be capable of forming the required multiples 
 of their inputs. In the case of radix complement representation, the 
 selection network must also be able to form the complement of these possible 
 multiples. The complexity of the selection networks increases for higher 
 radices and, since the additional multiples appears as inputs to the 
 adder, the complexity of the adder would also be increased. A higher 
 radix, while reducing the number of cycles per operand word thereby 
 increasing the processing speed, does increase the complexity of the 
 processing unit. For this reason and because of the complexity of the 
 result digit selection, radix 256 should be eliminated from further 
 serious consideration. The choice among the other possible radices (2, 4, 
 and 16) depends upon speed requirements and the amount of hardware com- 
 plexity allowed. 
 
 The result digit selector, is a table look up device (ROM) which 
 implements the SELECT functions of the appropriate algorithm. Recall that 
 the SELECT function of multiplication, addition, and subtraction are 
 identical — a simple rounding is performed on the y most significant digits 
 of I'., the lower limit for proper selection. Table 5.1 gives a list of 
 the possible values of a, B, and y. The first a most significant digits 
 >' . and the first 6 most significant digits of D are inspected during 
 
Table 5.1 Result Digit Selector Inputs 
 
 97 
 
 RADIX 
 
 P(K) 
 
 INDEX DIFI 
 (6) 
 
 REDUNDANT P. 
 3 
 a(bits) Y(bits) 
 
 D 
 B(bits) 
 
 TOTAL 
 
 2 
 
 KD 
 
 5,6,7,8 
 
 4(8) 
 
 2(4) 
 
 
 
 8 + 1 = 9 
 
 4 
 
 2(2/3) 
 
 4 
 
 4(16) 
 
 2(8) 
 
 3(6) 
 
 22 + 1 = 23 
 
 
 
 5,6,7,8 
 
 3(12) 
 
 2(8) 
 
 3(6) 
 
 18 + 1 = 19 
 
 
 3(1) 
 
 3 
 
 3(12) 
 
 2(8) 
 
 3(6) 
 
 18 + 1 = 19 
 
 
 
 4,5,6 
 
 3(12) 
 
 2(8) 
 
 2(4) 
 
 16 + 1 = 17 
 
 16* 
 
 11(11/15) 
 
 4 
 
 2(16) 
 
 2(8) 
 
 2(8) 
 
 24 + 1 = 25 
 
 
 12(4/5) 
 
 3,4 
 
 2(16) 
 
 2(8) 
 
 2(8) 
 
 24 + 1 = 25 
 
 
 13(13/15) 
 
 3,4 
 
 2(16) 
 
 2(8) 
 
 2(8) 
 
 24 + 1 = 25 
 
 
 14(14/15) 
 
 3,4 
 
 2(16) 
 
 2(8) 
 
 2(8) 
 
 24 + 1 = 25 
 
 
 15(1) 
 
 3,4 
 
 2(16) 
 
 2(8) 
 
 2(8) 
 
 24 + 1 = 25 
 
 *Cases with more than 50 steps in any single overlap region were 
 automatically eliminated from consideration. 
 
98 
 
 division. The complexity of base 16 for the result digit selection is 
 
 obvious. The binary case would require a table look up device with 9 
 
 available address lines: 8 for the first 4 redundant bits of P. and 1 to 
 
 J 
 
 indicate which operation is in force. The particular case for radix 4, 
 p = 3, and 6=4 would require 19 available lines: 12 for the first 3 
 redundant digits of P., 6 for the first 3 conventional digits of D, and 1 
 for control. 
 
 For the radix 4 case, the number of address lines for the table 
 look up is somewhat large by conventional standards. Two techniques to 
 avoid this dilemma are available: 1) use a PLA with the previously out- 
 lined advantages to do the selection, or 2) perform a carry propagation 
 on the most significant portion of P. to reduce the number of lines 
 required. Table 5.2 indicates the number of address lines required when 
 a carry propagation is employed on the most significant portion of P.. 
 Note that division always requires more significant digits of P. for 
 selection than the other operations. By using all of the available 
 digits for selection during all operations, which is more than the 
 absolute minimum required for multiplication, addition, and subtraction, 
 the bounds on the operands during addition, subtraction, and multiplication 
 can be relaxed significantly as discussed in Chapter 3, Section 3.4. 
 The comparison constants for division which must be encoded into the 
 table look up for quotient digit selection are given in Appendix II 
 for several cases. The comparison constants for multiplication, addition, 
 and subtraction were given in Chapter 3, Section 3.3, as + y, + 1 y, + 2 — , 
 •-., ± (P- \). 
 
99 
 
 Table 5.2 Result Digit Selector Inputs After Carry Propagation 
 
 RADIX 
 
 p(K) 
 
 ENDEX DIFF 
 (6) 
 
 :arry propagated p . 
 j 
 
 a(bits) y(bits*) 
 
 D 
 3(bits) 
 
 TOTAL 
 
 2 
 
 KD 
 
 5,6,7,8 
 
 4(4) 
 
 2(2) 
 
 
 
 4 + 1 = 5 
 
 4 
 
 2(2/3) 
 
 4 
 
 4(8) 
 
 2(4) 
 
 3(6) 
 
 14 + 1 = 15 
 
 
 
 5,6,7,8 
 
 3(6) 
 
 2(4) 
 
 3(6) 
 
 12 + 1 = 13 
 
 
 3(1) 
 
 3 
 
 3(6) 
 
 2(3) 
 
 3(6) 
 
 12 + 1 = 13 
 
 
 
 4,5,6 
 
 3(6) 
 
 2(3) 
 
 2(4) 
 
 10 + 1 = 17 
 
 16 
 
 11(11/15) 
 
 4 
 
 2(8) 
 
 2(7) 
 
 2(8) 
 
 16 + 1 = 17 
 
 
 12(4/5) 
 
 3,4 
 
 2(8) 
 
 2(6) 
 
 2(8) 
 
 16 + 1 - 17 
 
 
 13(13/15) 
 
 3,4 
 
 2(8) 
 
 2(6) 
 
 2(8) 
 
 16 + 1 = 17 
 
 
 14(14/15) 
 
 3,4 
 
 2(8) 
 
 2(5) 
 
 2(8) 
 
 16 + 1 = 17 
 
 
 15(1) 
 
 3,4 
 
 2(8) 
 
 2(5) 
 
 2(8) 
 
 16 + 1 = 17 
 
 *From Table 3.1 
 
100 
 
 The central part of the processing logic — the adder — is also 
 the most complex part . This adder, as has been repeatedly stated, is a 
 multi- input limited carry-borrow propagation type adder. Such adders 
 have been the subject of intense study in much of the literature [GOY76, 
 ROH67, BOR68, ROB75, AVI61, MET57]. As a result it will not be examined 
 in detail here. Figure 5.6 gives the adder I/O requirements for the 
 hardware block diagram being discussed, while Figure 5.7 represents the 
 i digit position of this adder. A carry generator will be needed if a 
 radix complement representation for negative numbers is used. 
 
 5.4.2 Hardware Modifications of AURORA for Pipelining 
 
 If the pipelining approach is used — stringing multiple AURORAS 
 together — as described in Section 5.3, an adjustment of the processing 
 logic makes it suitable for this technique. An extra adder can be used 
 to form the residual, as discussed in Section 5.3, which must he passed 
 from one unit to another. This residual is in the form of 
 
 *<Vr»j-i> 
 
 for multiplication, addition, and subtraction, and is in the form of 
 
 rP. - q.D. 
 j-1 M j j-1 
 
 for division. Additionally, the two operand digits, x. and y., must be 
 passed in multiplication, and the divisor and quotient digits passed in 
 division during each cycle. 
 
 The hardware block diagram for this approach is shown in Figure 5.8, 
 Note that the addition of the second adder has simplified the input 
 requirements of the main adder. The control for this setup would be 
 Lex than Lti the serial arrangement. 
 
101 
 
 p . 
 J 
 
 J-1' S 3"1 
 
 -q.D. 
 J J 
 
 Y «Y 
 
 j-l X j 
 
 i 
 
 MULTI-INPUT 
 
 REDUNDANT 
 
 ADDER 
 
 J ft ft 
 
 7K 
 
 (n j+6" R j-l d j+6 )r 
 
 -6 
 
 X.«y. 
 J J 
 
 +b. 
 - J 
 
 X 
 
 
 o 
 
 
 
 
 m 
 
 
 2 o 
 
 
 m > 
 
 o 
 
 3) JO 
 
 o 
 
 > JO 
 
 o 
 
 H -< 
 
 o 
 
 
 JO 
 
 rP 
 
 3-1 
 
 rP 
 
 j-l 
 
 rP :-i 
 
 Figure 5.6 Adder I/O Requirements 
 
102 
 
 RESULT DIGIT i 
 
 {-p\... ,1,0,1, 
 
 TRANSFER - 
 OUT 
 t 
 
 i-1 
 
 i 
 
 ,p'> 
 
 DIGITAL POSITION 
 
 L 
 
 OF MULTI-INPUT 
 
 REDUNDANT ADDER 
 
 
 
 (DO A SUBTRACTION IF q. 
 IS NEGATIVE) J 
 
 {0,l,...p(r-l)} 
 
 (± qj d ± ) 
 
 {0,l,...,(r-l)(r-l)} 
 
 (y.-x.) 
 
 {0,1,. ..(r-1)} 
 
 (aj 
 
 ] l^ 
 
 TRANSFER IN 
 t. 
 
 + 
 
 i<6 
 
 1=6 {0,1, ...(r-1)} 
 
 _ (n j+6 } 
 i>6{p(r-l),.. .1,0,1,... 
 
 < a ^ P(r-l)} 
 
 {0,l,...,(r-l)(r-l)} 
 (x. «y .) 
 
 {0,1 (r-1)} 
 
 (+b.) 
 
 X 
 
 + 
 
 {-p\. . .,1,0,1,. ..p'} 
 (RESULT DIGIT i+1) 
 
 {-p',...l,0,l,...p'} 
 (RESULT DIGIT i+1) 
 
 {-p\ .. .1,0,1, ...pM 
 (RESULT DIGIT i+1) 
 
 I i}',ure 5.7 I/O Requirements of Digital Position 
 i of Adder (i > 1) 
 
103 
 
 REDUNDANT RESULT REGISTER 
 
 V Vl 
 
 RESULT FROM 
 
 PREVIOUS AURORA 
 
 RESULT TO 
 NEXT AURORA 
 
 t 
 
 T " 
 
 s r p j' q j+ i 
 
 s :-i ,p j-i 
 
 RESULT 
 DIGIT 
 SELECTOR 
 
 A 
 
 C 
 
 rP. 
 3 
 
 \ / 
 
 SELECTION 
 NETWORK 
 
 7Y 
 
 $> 
 
 RESIDUAL TO 
 NEXT AURORA 
 
 RESIDUAL REGISTER 
 
 ( VrVi } 
 r VrVj-i 
 
 7S 
 
 RESIDUAL ADDER 
 
 7S 
 
 A 
 
 q.D. 
 
 j-l 
 
 MULTI- INPUT REDUNDANT 
 ADDER 
 
 I 
 
 A 
 
 y i 
 
 X. 
 
 J 
 
 X. 
 3 
 
 d j-iIX 
 
 OPERAND REGISTER 
 
 X., D. . 
 J 3-1 
 
 5 F 
 
 vy 
 
 
 7S 
 
 o 
 
 w > 
 
 H k! 
 O 
 
 SELECTION 
 NETWORK 
 
 A 
 
 i 
 
 RESIDUAL 
 FROM PREVIOUS 
 AURORA 
 
 R. 
 
 Vi- F o 
 
 OPERAND REGISTER 
 
 Vr P o 
 
 F E 
 
 OPERAND FROM a.,x.,d.,d OPERANDS TO b . ,y . ,n. ,n . 
 PREVIOUS J J x 1 NEXT aurora J 3 i J 
 AURORA 
 
 Figure 5.8 Modified AURORA Processing Logic for 
 Pipelining 
 
 OPERAND FROM 
 PREVIOUS AURORA 
 
104 
 
 5.4.3 Speed of the Processing Logic 
 
 An attempt to bound the speed of the serial processing logic is 
 now appropriate. Define the time to perform one basic recursion in the 
 standard processing logic (Figure 5.5) as 
 
 t R =t T +t A + C S 
 where t T is the register and selection network transfer time, t. is the 
 redundant add time, and t is the result digit selection time. Both t^ 
 and t correspond to a few gate delays — 4 to 5 t . So t appears to be 
 the dominant factor in the equation and depends upon the radix (and, thus, 
 the number of inputs to the adder) and the adder structure. Let s be 
 defined to be the number of radix 2 summands (inputs to the adder); i.e., 
 higher radix and redundant inputs are considered in their binary equiva- 
 lent formats. Then a simple adder structure, consisting of (s-2) levels 
 of full adder rows [GOY76] will have a 
 
 t A " 2<a-2)t g > 
 assuming 2t per full adder. More sophisticated adder structures, such as 
 
 D 
 
 Dadda-types [H073] can considerably reduce this time. Using this formula 
 
 for t., the total recursion time for a radix 2 case, where the worst case 
 A 
 
 s over all operations is 5 (i.e., 1 conventional input and 2 redundant 
 inputs) is 
 
 t R - 0(Ut g ) 
 
 In the case of radix 4, this increases to 0(26t ). 
 
 g 
 
 Depending on the operand size input, a number of cycles must occur 
 •re a result is completely generated as tabulated in Table 5.3. While 
 
105 
 
 Table 5.3 Word Processing Time Delay 
 
 
 
 
 
 
 TOTAL WORD 
 
 
 WORD 
 
 
 
 DIGITS /WORD 
 
 
 PROCESSING 
 
 
 LENGTH 
 
 RADIX 
 
 BITS/DIGIT 
 
 (//OF CYCLES) 
 
 TIME/ CYCLE 
 
 TIME 
 
 
 4 
 
 2 
 
 1 
 
 4 
 
 14 tg 
 
 56 t 
 g 
 
 (*) 
 
 
 4 
 
 2 
 
 2 
 
 26 tg 
 
 52 t 
 g 
 
 (*) 
 
 
 16 
 
 4 
 
 1 
 
 98 tg 
 
 98 t 
 
 g 
 
 (*) 
 
 8 
 
 2 
 
 1 
 
 8 
 
 14 tg 
 
 112 t 
 g 
 
 
 
 4 
 
 2 
 
 4 
 
 26 tg 
 
 104 t 
 g 
 
 
 
 16 
 
 4 
 
 2 
 
 98 tg 
 
 196 t 
 
 g 
 
 
 16 
 
 2 
 
 1 
 
 16 
 
 14 tg 
 
 224 t 
 g 
 
 
 
 4 
 
 2 
 
 8 
 
 26 tg 
 
 208 t 
 g 
 
 
 
 16 
 
 4 
 
 4 
 
 98 tg 
 
 392 t 
 
 g 
 
 
 (*)These cases do not satisfy the on-line delay requirements 
 of division. 
 
106 
 
 using a higher radix can speed up the overall processing time (112t for 
 base 2 versus 104t for base 4 given an 8 bit input width) , it also 
 increases the complexity of the adder. From this table it would appear 
 that radix 4 may be the best compromise. Table 5.5 reinforces this choice 
 in comparing speed and complexity of the processing logic. Once again the 
 trade off between high speed and low gate count is evident. 
 
 5.4.4 AURORA as a Total Module 
 
 A block diagram of the total unit is given in Figure 5.9. It 
 includes the PLA control logic as discussed in Section 5.1, the processing 
 logic just discussed, the recoding logic as discussed in Appendix III, 
 and the I/O requirements of the overall module. An approximate pin count 
 for base 2, 4, and 16 is given in Table 5.4. 
 
 Table 5.4 Rough Pin Count for AURORA 
 
 
 INPUT 
 
 OUTPUT 
 
 PINS TO 
 
 
 
 WORD 
 
 # OPERANDS 
 
 // RESULT 
 
 COMMUNICATE 
 
 
 
 LENGTH 
 
 PINS 
 
 PINS 
 
 WITH EXP. UNIT 
 
 CONTROL 
 
 TOTAL 
 
 4 
 
 8 
 
 4 
 
 4 
 
 15 
 
 31 
 
 8 
 
 16 
 
 8 
 
 4 
 
 15 
 
 43 
 
 16 
 
 32 
 
 16 
 
 4 
 
 15 
 
 67 
 
Table 5.5 Speed vs. Complexity 
 (Processing Logic) 
 
 107 
 
 X 
 
 M 
 
 q 
 
 3 
 
 *PUT 
 3 ERAND 
 [DTH BITS 
 
 iDUNDANCY 
 4AX. RESULT 
 EGIT) 
 
 ADDER COMPLEXITY 
 
 RESULT SELECTOR 
 
 WORST CASE 
 tfORD PROCESS 
 
 H O !3 
 
 3 ^ Q 
 
 (LEVELS) 
 
 INPUT/ STORAGE 
 
 TIME 
 
 2 
 
 4 
 8 
 
 1(1) 
 1(1) 
 
 3 levels 
 ull width) 
 
 9 LINES/ 3 WORDS 
 
 * 
 112 t 
 g 
 
 
 16 
 
 1(1) 
 
 
 9 LINES/ 3 WORDS 
 
 224 t 
 g 
 
 4 
 
 4 
 
 §,1(2,3) 
 
 
 — 
 
 — 
 
 
 8 
 
 f(2) 
 
 :vels 
 ddth} 
 
 23 LINES/18 WORDS 
 
 104 t 
 
 g 
 
 
 8 
 
 1(3) 
 
 cu 3 
 
 rH 
 
 rH 
 
 ON rH 
 
 17 LINES/16 WORDS 
 
 104 t 
 g 
 
 
 16 
 
 |(2) 
 
 3 
 
 4H 
 v— ' 
 
 19 LINES/21 WORDS 
 
 208 t 
 g 
 
 
 16 
 
 1(3) 
 
 
 15 LINES/19 WORDS 
 
 208 t 
 g 
 
 16 
 
 4 
 
 —-►1(8+15) 
 
 
 — 
 
 — 
 
 
 8 
 
 j| +1(8+15) 
 
 CO XI 
 rH 4J 
 
 — 
 
 — 
 
 
 16 
 
 iiciD 
 
 45 leve 
 (full wid 
 
 25 LINES/- 300 WORDS 
 
 392 t 
 
 g 
 
 
 16 
 
 1(15) 
 
 
 21 LINES/- 200 WORDS 
 
 392 t 
 g 
 
 t is one gate delay 
 g 
 
108 
 
 C 
 
 RESULT WORD READY 
 
 CQ 
 
 INTERRUPT 
 
 CONTROL 
 SIGNALS 
 
 FROM • 
 EXPONENT 
 LOGIC 
 
 > 
 
 PLA 
 
 CONTROL 
 LOGIC 
 
 RESULT 
 WORD 
 
 RESULT WORD 
 RECODING LOGIC 
 
 AURORA 
 
 PROCESSING 
 LOGIC 
 
 7\ 
 
 7V 
 
 OPERAND WORDS 
 
 Figure 5.9 Block Diagram of an AURORA 
 
109 
 
 5.5 System's Level Overview 
 
 AURORA was designed to operate as a functional module controlled 
 externally by a larger system. This system must then provide AURORA with 
 the correct operand words and control signals and receive from AURORA the 
 result word for storage. This system must fetch the correct number of 
 operand words in the proper order from main memory, transfer them on-line 
 to AURORA, and store the result words in the proper order back into main 
 memory. To avoid the necessity of dedicating the system to AURORA while 
 the digits are being processed, the AURORA unit should signal the main 
 system via an interrupt when a result word is ready for storage. Then the 
 system is free to handle other processes while AURORA is functioning. 
 Figure 5.10 shows a sample system organization. 
 
 This chapter has consisted of an overview only of the necessary 
 considerations of actual implementation. The implementation as such was 
 not attempted, but is left for a more suitable medium. A detailed gate 
 count was considered inappropriate and ill advised. It remains to justify 
 the existence of such a unit as AURORA. 
 
110 
 
 HIGH LEVEL 
 
 COMMAND 
 
 
 OPERATION 
 
 X, Y, R, 
 
 N 
 
 X, Y, R: address of 
 significant exponent 
 N: precision (number 
 words to process) 
 
 most 
 
 word 
 
 of 
 
 1 
 
 31 
 
 MICROCODE ROUTINE 
 
 Determined by AURORA 
 arrangement 
 
 h4 
 O 
 OS 
 H 
 
 O 
 
 PL, CJ 
 
 PIPELINED APPROACH 
 
 
 INITIALIZE AURORA 
 
 
 FETCH X, Y 
 
 • 
 
 CALL AURORA (EXP, OP, X,Y) 
 
 • 
 
 DO 1=1, N, 2 
 
 • 
 
 FETCH X+I,X+I+1 
 
 
 FETCH Y+I.Y+I+1 
 
 
 CALL AURORA (MAN, OP, X,Y) 
 
 
 IF (INTER .EQ. 0) WAIT 
 
 
 IF(I .EQ. 1) STORE R 
 IF (INTER .EQ. 0) WAIT 
 
 >v 
 
 ■> 
 
 
 STORE R+I 
 
 IF (INTER .EQ. 0) WAIT 
 
 STORE R+I+l 
 
 
 
 c 
 
 
 10 CONTINUE 
 
 
 RETURN 
 
 
 
 c 
 
 >io 
 
 SERIAL APPROACH 
 
 INITIALIZE AURORA 
 
 FETCH X, Y 
 
 CALL AUR0RA(EXP,0P,X,Y) 
 
 DO 10 1=1, N,l 
 
 FETCH X+I,Y+I 
 
 CALL AURORA (MAN, OP, X,Y) 
 
 IF (INTER .EQ. 0) WAIT 
 
 IF (I .EQ. 1) STORE R 
 
 IF (INTER .EQ. 0) WAIT 
 
 STORE R+I 
 
 CONTINUE 
 
 RETURN 
 
 o 
 o OS 
 
 o 
 
 Ph 
 
 o 
 
 OS 
 
 u 
 
 RESULT WORD 
 
 AURORA 
 
 EXPONENT 
 
 LOGIC 
 
 UTTER 
 
 AURORA 
 
 MANTISSA 
 
 LOGIC 
 
 
 :> 
 
 C 
 
 RESULT WORD 
 
 INTER 
 
 AURORA 
 MANTISSA 
 LOGIC 
 
 \UUUl_JTJl 
 
 c 
 
 
 CONTROL 
 
 S> 
 
 AURORA 
 
 EXPONENT 
 
 LOGIC 
 
 AURORA 
 
 MANTISSA 
 
 LOGIC 
 
 OPERAND WORDS 
 
 {LJHI_J) 
 
 OPERAND WORDS 
 
 $ 
 
 o 
 
 OS 
 
 < 
 
 I"i;',ure 5.10 Sample System Organization 
 
Ill 
 
 6 . SUMMARY 
 
 6.1 Summary of the Results 
 
 The algorithmic and logical design of an arithmetic unit to be 
 used in a computational environment in which the basic arithmetic 
 operations satisfy the on-line property has been presented. The on-line 
 property requires that to generate the j digit of a result (where a 
 digit consists of n bits for base 2 ), it is necessary and sufficient to 
 have the operands available only up to the j digit plus, in the case 
 of division, a predetermined number of extra digits which correspond to 
 an "on-line delay." Since there is no on-line delay for addition, 
 subtraction, and multiplication, the algorithms can begin generating 
 result digits as soon as one digit of each operand has been input. The 
 delay for division was shown to be a small, positive, radix dependent 
 constant. The fulfill the on-line requirements, a set of left-to-right 
 (most-to-least significant), digit-by-digit algorithms have been derived. 
 The existence of such algorithms is contingent upon using a redundant 
 representation for the result digits. These algorithms and a block diagram 
 implementation of the basic arithmetic unit has been presented. 
 
 Algorithms for addition and subtraction which conform to this 
 on-line property were easily specified, while the multiplication algorithm 
 required a somewhat more elaborate approach. The existence of an on-line 
 division algorithm was only recently discovered [TRI75]; the bulk of 
 this thesis was a development and extension of this early work. Quotient 
 
112 
 
 digit selection procedures based upon a limited precision model of the operands 
 and minimum values for the on-line delay were discussed in detail. Once 
 compatible algorithms for the basic operations were defined, the problems 
 of actual implementation of the basic unit, AURORA, were then considered. 
 Suitability to LSI, floating point processing, and adjustments in the hard- 
 ware to gain speed were some of the desirable design goals. 
 
 It is now fitting to discuss some possible applications of such a 
 unit to justify its existence. 
 
 6.2 Applications 
 
 It has been stated that a unit suitable for investigation as an 
 LSI chip should have as large a domain as possible. Several application 
 areas are ripe for exploitation today, while others, as advances are made 
 in technology, are on the horizon. The following list of applications is, 
 by no means, exhaustive, but is presented here as a representative study. 
 1) The most obvious use for AURORA is in the area of 
 
 real-time applications. As the operands are generated 
 serially by an analog-to-digital conversion process 
 beginning with the most significant digits, AURORA 
 could be used to process these operand digits as soon 
 as they became available from the converter. This 
 is unlike the conventional setup, where the processing 
 unit must wait while the full precision operands are 
 converted before starting operation. The speed up 
 benefits are obvious. Such a system, with the 
 capability of overlapping conversion and computation, 
 has a definite place in long distance communications 
 
113 
 
 and satellite systems. In fact, any system designed 
 to be of use in a real-time environment could make 
 significant gains with the addition of an AURORA 
 module to its hardware. 
 2) Another possible application is in performing variable 
 precision arithmetic. The described algorithms and 
 simple implementation requirements of AURORA are 
 compatible with the required modularity of any 
 variable precision unit. Both hardware implementations 
 as discussed in Chapter 5, serial processing on one 
 unit or the pipelining of multiple AURORAS for speed, 
 provide variable precision arithmetic in a straight- 
 forward manner. Of course, the ultimate allowable 
 precision is set by the internal hardware of AURORA 
 as with any such device. It is believed that sufficient 
 register and adder widths can be provided by large 
 scale integrated technology to provide enough "variable 
 precision arithmetic" to meet the demands of most 
 applications. The desirability of being able to attach 
 an AURORA as a peripheral device on a microprocessor 
 bus is obvious. The microprocessor, a device tradition- 
 ally restricted from most mathematical applications 
 because of its short word length, could not help but 
 benefit from this boost in processing power. As an 
 added bonus, since AURORA signals completion via an 
 interrupt, the microprocessor would not have to be 
 
114 
 
 dedicated to it. While the AURORA is functioning, the 
 main system would be freed to handle other pending 
 processes. 
 
 3) If the approach of connecting multiple AURORAS together 
 is used for actual implementation, pipelining of 
 successive operations could be used as an effective 
 speed up technique. Recall that the recursive function- 
 ing ripples down from the most-to-least significant 
 unit. When the first unit completes processing and 
 returns the result digit to the main system, it has 
 also passed all necessary information for that operation 
 to correctly continue on down the pipe to the next unit. 
 When one unit has completed all of the processing 
 associated with the present operation, the next unit in 
 line can begin generating the next result digit 
 associated with that same instruction. Thus, the one unit 
 is free to initiate processing on the next instruction 
 in the program. In this way, the fraction arithmetic 
 unit, which has been traditionally considered as a 
 single stage of the pipeline [STE75, AND67], can be 
 further decomposed into multiple stages to speed up 
 processing even more. Chaining operations on result 
 digits as they are generated can increase processing 
 speed even further. 
 
 h) One application on the horizon will become more feasible 
 as technological improvements make way for the wide 
 spread use of large serial memories (CCDs, bubble 
 
115 
 
 memories, etc.). The major user of the large serial 
 memories will be data base systems. AURORA could be 
 used in conjunction with these serial memories to provide 
 instant processing capabilities for data base systems. 
 As soon as the most significant word was (serially) 
 extracted from the memory, the decisions as to which 
 actual memory word is desired could be made on-line. 
 Then, just that operand would have to be read from 
 memory. The only other alternative is to read both 
 operands from memory, a lengthy process, and then pro- 
 ceed to compare them to determine which is the desired 
 operand. By using AURORAS to provide on-line processing, 
 intelligent data base retrieval systems could be built. 
 
 6.3 Suggestions for Future Research 
 
 During work on this dissertation, several extensions or other areas 
 
 of interest requiring further research have become obvious to this author. 
 
 They include the following questions. 
 
 1) Could the on-line processing technique employed here be 
 extended to other functions, such as logarithmic, 
 trigonometric, and exponential? It is believed that 
 an alternative algorithmic approach to processing, such 
 as continued products as described by DeLugish [DEL70] 
 or the E-method as developed by Ercegovac [ERC75] might 
 be more appropriate for extension. The on-line result 
 digit selection procedure outlined in this thesis could 
 be carried over with slight modification to these 
 techniques. This area is in need of further research. 
 
116 
 
 2) What about actual implementation? It is beyond the 
 influence of this author whether or not an actual 
 AURORA module is manufactured as a single chip. It is 
 entirely feasible to build the module from several 
 standard MSI and SSI chips now available on the market. 
 An even more attractive alternative would be to 
 implement AURORA in software using a suitable bit-slice 
 microprocessor. This seems to be more useful than 
 simulating AURORA on a large system equipped with an 
 appropriate simulation language. The microprocessor 
 implementation is one area of continuing research in 
 which this author hopes to be actively engaged. 
 
117 
 
 BIBLIOGRAPHY 
 
 [AND67] Anderson, S. F. , et. al . , "The System/360 Model 91 Floating Point 
 Execution Unit," IBM System Journal 11 , Vol. 34, 1967. 
 
 [ATK67] Atkins, D. E. , "The Theory and Implementation of SRT Division," 
 M.S. Thesis, Report 230, Department of Computer Science, 
 University of Illinois, Urbana, June 1967. 
 
 [ATK68] Atkins, D. E., "Higher Radix Division Using Estimates of the 
 
 Divisor and Partial Remainders," IEEE Transactions on Computers , 
 Vol. C-17, No. 10, pp. 925-934, October 1968. 
 
 [ATK70] Atkins, D. E., "Design of the Arithmetic Units of Illiac III: Use 
 of Redundancy and Higher Radix Methods," IEEE Transactions on 
 Computers , Vol. C-19 , No. 8, pp. 720-733, August 1970. 
 
 [ATK70] Atkins, D. E., "A Study of Methods for Selection of Quotient Digits 
 During Digital Division," Ph.D. Thesis, Report 397, Department of 
 Computer Science, University of Illinois, Urbana, June 1970. 
 
 [ATK75] Atkins, D. E., "Introduction to the Role of Redundancy in Computer 
 Arithmetic," Computer , Vol. 8, No. 6, pp. 74-76, June 1975. 
 
 [ATR65] Atrubin, A. J., "A One-Dimensional Real-Time Iterative Multiplier," 
 IEEE Transactions on Computers , Vol. EC-14, pp. 394-399, 1965. 
 
 [AVI61] Avizienis, A., "Signed-Digit Number Representation for Fast 
 
 Parallel Arithmetic," IRE Transactions on Electronic Computers , 
 EC-10, pp. 389-400, 1961. 
 
 [AVI62] Avizienis, A., "On a Flexible Implementation of Digital Computer 
 Arithmetic," Proceedings of IFIP , pp. 664-668, 1962. 
 
 [AVI64] Avizienis, A., "Binary-Compatible Signed-Digit Arithmetic," AFIPS 
 Conference Proceedings , Vol. 26, Part 1, pp. 663-672, 1964. 
 
 [BAK75] Baker, P. W. , "Algorithms for Higher Level Functions in Machine 
 Hardware," Ph.D. Thesis, The University of New South Wales, 
 November, 1975. 
 
 [B0R68] Borovec, R. T., "The Logical Design of a Class of Limited Carry-Borrow 
 Propagation Adders," M.S. Thesis, Report 275, Department of Computer 
 Science, University of Illinois, Urbana, August 1968. 
 
118 
 
 [BRA63] Brawn, E. L., Digital Computer Design-Logic, Circuitry, and 
 Synthesis , Academic Press, New York, 1963. 
 
 [CAM70] Campeau, J. 0., "Communication and Sequential Problems in the 
 
 Parallel Processor," Parallel Processor Systems, Technologies and 
 Applications , Spartan Books, New York, 1970. 
 
 [CAT76] Catlin, Robert W., "MUMS: Modular Unified Microprocessor System," 
 M.S. Thesis, Report 809, Department of Computer Science, University 
 of Illinois, Urbana, June 1976. 
 
 [DEL70] DeLugish, B. G. , "A Class of Algorithms for Automatic Evaluation 
 
 of Certain Elementary Functions in a Binary Computer," Ph.D. Thesis, 
 Report 399, Department of Computer Science, University of Illinois, 
 Urbana, June 1970. 
 
 [ERC75] Ercegovac, Milos D., "A General Method for Evaluation of Functions 
 and Computations in a Digital Computer," Ph.D. Thesis, Report 750, 
 Department of Computer Science, University of Illinois, Urbana, 
 August 1975. 
 
 [FAI77] Faiman, Michael, et^. al_. , "MUMS-A Reconf igurable Microprocessor 
 
 Architecture," Computer , Vol. 10, No. 1, pp. 11-16, January 1977. 
 
 [FRE61] Freiman, C. V., "Statistical Analysis of Certain Binary Division 
 Algorithms," Proceedings of the IRE , Vol. 49, pp. 91-103, 
 January 1961. 
 
 [GOY76] Goyal, Lakshmi, "A Study in the Design of an Arithmetic Element 
 for Serial Processing in an Iterative Structure," Ph.D. Thesis, 
 Report 797, Department of Computer Science, University of Illinois, 
 Urbana, May 1976. 
 
 [H073] Ho, I. T. and T. C. Chen, "Multiple Addition by Residue Threshold 
 
 Functions and Their Representation by Array Logic," IEEE Transactions 
 on Computers , Vol. C-22, pp. 762-767, August 1973. 
 
 [HOD76] Hodges, David A., "Trends in Computer Hardware Technology," 
 Computer Design , Vol. 15, No. 2, pp. 77-85, February 1976. 
 
 [LEW74] Lewin, Douglas, "Outstanding Problems in Logic Design," The Radio 
 and Electronic Engineer , Vol. 44, No. 1, pp. 9-17, January 1974. 
 
 [MEL72] Melicher, S. A., "An Arithmetic Unit with Total Redundant 
 
 Representation," M.S. Thesis, Department of Electrical Engineering, 
 University of Illinois, Urbana, 1972. 
 
 [ME.T57] Metze, G., "A Study of Parallel One's Complement Arithmetic Units 
 with Separate Carry or Borrow Storage," Ph.D. Thesis, Report 81, 
 Department of Electrical Engineering, University of Illinois, 
 Urbana, November 1957. 
 
119 
 
 [PEN62] Penhollow, J. 0., "A Study of Arithmetic Recoding with Applications 
 to Multiplication and Division," Thesis, Report 128, Department 
 of Computer Science, University of Illinois, Urbana, September 1962, 
 
 [PIS70] Pisterzi, M. J., "A Limited Connection Arithmetic Unit," Ph.D. 
 
 Thesis, Report 398, Department of Computer Science, University of 
 Illinois, Urbana, June 1970. 
 
 [RAM77] Ramamoorthy, C. V. and H. F. Li, "Pipeline Architecture," Computing 
 Surveys , Vol. 9, No. 1, pp. 61-102, March 1977. 
 
 [RHY74] Rhyne, V. Thomas, Scott McPhillips and Jerry Ogdin, "Programmed 
 Logic Array," New Logic Notebook, Vol. 1, No. 2, October 1974. 
 
 [ROB58] Robertson, J. E., "A New Class of Digital Division Methods," 
 
 IRE Transactions on Electronic Computers , Vol. EC-7, pp. 218-222, 
 September 1958. 
 
 [ROB65] Robertson, J. E., "Methods of Selection of Quotient Digits During 
 Digital Division," Report 663, Department of Computer Science, 
 University of Illinois, Urbana, 1965. 
 
 [ROB67] Robertson, J. E., "A Deterministic Procedure for the Design of 
 Carry-Save Adders and Borrow-Save Subtractors, " Department of 
 Computer Science, University of Illinois, Urbana, July 1967. 
 
 [ROB75] Robertson, J. E., "Redundant Structures I: Binary Carry-Save 
 
 Adders and Borrow-Save Sbutractors" and "Redundant Structures II: 
 Signed Digit Arithmetic," Chapters 6 and 7 of Class Notes for 
 CS 364, University of Illinois, Urbana, Revised Fall 1975. 
 
 [ROH67] Rohatsch, Fred A., "A Study of Transformations Applicable to the 
 Development of Limited Carry-Borrow Propagation Adders," Ph.D. 
 Thesis, Report 226, Department of Computer Science, University of 
 Illinois, Urbana, June 1967. 
 
 [STE75] Stephenson, C, "CaseStudyof the Pipelined Arithmetic Unit for 
 the TI Advanced Scientific Computer," Proceedings of Third IEEE 
 Symposium on Computer Arithmetic , Dallas, Texas, November 1975. 
 
 [TOC58] Tocher, T. D., "Techniques of Multiplication and Division for 
 
 Automatic Binary Computers," Quarter. J. Mech. App . Math. , Vol. 2, 
 Pt. 3, pp. 364-384, 1958. 
 
 [TRI75] Trivedi, Kishor S. and Milos D. Ercegovac, "On-Line Algorithms for 
 Division and Multiplication," Proceedings of Third IEEE Symposium 
 on Computer Arithmetic , Dallas, Texas, November 1975. 
 
 [WES75] 1975 Wescon Professional Program, "Field Programmable Logic," 
 Session 26, San Francisco, California, September 1975. 
 
120 
 
 APPENDIX I 
 REDUNDANCY DEFINITION 
 
 Redundancy (the state of being in excess of what is necessary) is 
 used in the implementaion of computer arithmetic to achieve three design 
 goals: improved reliability, increased speed, and structural flexibility. 
 To achieve the first goal, hardware redundancy and/or redundant arithmetic 
 codes are applied to the detection and correction of faults. This, however, 
 is not the type of redundancy referred to in this dissertation. Rather, 
 the use of number systems employing redundancy in their representation is 
 implied. In this way the design goals of increased speed and structural 
 flexibility can be achieved. A positional number system with fixed radix, 
 r, is redundant if the allowable digit set includes more than r distinct 
 elements, thereby affording alternate representations of a given numeric 
 value. Uniqueness of representation is sacrificed with the hope of gains 
 in speed and flexibility. 
 
 The type of redundant number representation used throughout this 
 dissertation calls for the use of a symmetric redundant digit set, defined 
 
 as 
 
 where 
 
 D = {-p,-(p-l),.. .1,0,1, ...(p-l),p} 
 
 f 1 P 1 r ~ ! • 
 
121 
 
 In particular, D is 
 
 1) minimally redundant if (where | D | is the cardinality of the digit 
 set) 
 
 l° p l - r + 1 
 
 so that 
 
 r 
 P = 2 
 
 2) or maximally redundant if 
 
 |D I = 2r - 1 
 1 P 1 
 
 so that 
 
 p = r - 1 
 
 Consequently, the representation of a number X is simply 
 
 m 
 X = Z x.r 1 
 1-1 X 
 
 where the sign of X is just the sign of x.. . 
 EXAMPLE: 
 
 For radix r = 4 
 
 D = {0,1,2,3} 
 Vn = { 2,I,0,1,2} 
 ° P MAX = ^,2,1,0,1,2,3} 
 
 where the overbar denotes the negative sign, i.e., 2 = -2. Then 
 X = 0.6875 1Q = 0.1011 2 = 0.23 4 
 
 D = 0.1101 2 = 0.3l 4 for x. e D^ 
 D = 1.1101 2 = 1.11 4 for x. « D pMiN 
 
122 
 
 Some of the desirable properties of this type of number representa- 
 tion are: 
 
 1) The representation of zero is unique. An algebraic value 
 of X = 0, if and only if all x. - 0. 
 
 2) The additive inverse (negation) of an operand is very simply 
 
 achieved by reversing the sign of every non-zero digit 
 
 \ 
 
 individually. 
 
 3) The sign of the algebraic value of X is given by the sign 
 of the most significant (leftmost) non-zero digit. 
 
123 
 
 APPENDIX II 
 SAMPLE P-D PLOTS FOR DIVISION 
 
 This appendix contains sample P-D plots for on-line division. 
 The key in the top, left-hand corner of each plot corresponds to 
 
 BASE: the radix(r) 
 
 RHO: the maximum quotient digit (p) 
 
 DELTA: the on-line delay (6) 
 
 ALPHA: sufficient precision of rP. for quotient 
 digit selection (a) 
 
 BETA: sufficient precision of D. for quotient 
 digit selection (6) 
 The comparison constants for the treads (rP.) are given in the right-hand 
 column. The comparison constants for the risers (D) are given along the 
 top row. In the case of base 16 several anomalies are apparent: 1) when 
 an overlap region required more than 50 steps, those steps were not 
 plotted; and 2) the comparison constants are not shown due to insufficient 
 space. 
 
Table A.l Example of the Algorithm DIVIDE 
 
 (r=2,K=l) 
 
 r = 2, 6 = 5, m = 8, D = {1,0, 1}, K = 1 
 
 p 
 
 N = 0.10100011 
 D = 0.11110110 
 
 L s 
 
 TO INSURE CONVERGENCE - 
 SHIFT N RIGHT ONE BIT 
 
 (R = 0.10101001101...) 
 
 ] 
 
 124 
 
 j 
 
 D. 
 J 
 
 2P. 
 J 
 
 Vi 
 
 -4 
 R. 2 
 
 J 
 
 
 
 0.11110 
 
 0.10100 
 
 1 
 
 0.0 
 
 1 
 
 0.111101 
 
 2(0.10100-0.111101) = 
 
 
 
 
 
 -0.10101 
 
 T 
 
 0.00001 
 
 2 
 
 0.1111011 
 
 2 (-0.1011+0. 1111011 
 -0.00001) = 
 
 
 
 
 
 0.100111 
 
 1 
 
 0.000001 
 
 3 
 
 0.11110110 
 
 2(0.101001-0.1111011) = 
 
 
 
 
 
 -0.101001 
 
 T 
 
 
 
 4 
 
 0.11110110 
 
 2) -0.101001+0. 1111011) = 
 
 
 
 
 
 0.101001 
 
 1 
 
 
 
 5 
 
 0.11110110 
 
 2(0.101001-0.1111011) = 
 
 
 
 
 
 -0.101001 
 
 1 
 
 
 
 6 
 
 0.11110110 
 
 2 (-0.101001+0. 1111011) = 
 
 
 
 
 
 0.101001 
 
 1 
 
 
 
 7 
 
 0.11110110 
 
 2(0.101001-0.1111011) = 
 
 
 
 
 
 -0.101001 
 
 1 
 
 
 
 8 
 
 0.11110110 
 
 2 (-0.101001+0. 1111011) = 
 
 
 
 
 
 0.101001 
 
 1 
 
 __ — 
 
 R = 
 
 8 
 2( Z q 2 
 
 I 
 
 -i 
 
 2(0.1111111111) 
 2(0.0101010101) = 0.10101010 
 
D 
 D 
 
 125 
 
 a. oo 
 
 BASE = 2 
 RHQ = 1 
 DELTA = 5 
 ALPHA = 4 
 BETA = Q 
 
 0.01 1 
 
 0.17 
 
 0.33 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 1.00 
 
126 
 
 o 
 o 
 
 ru 
 
 r- 
 
 o 
 
 CO 
 
 UJ 
 
 •— i • 
 
 ^: 
 
 UJ 
 
 az 
 
 a:g 
 
 CX 
 
 UJlo 
 
 01 
 
 
 in 
 ru 
 
 BASE = 2 
 RHQ = 1 
 DELTA = S 
 ALPHA = 4 
 BETA = 
 
 UPPER OF 1 
 
 UPPER QF 
 
 0.011 
 
 o 
 o 
 
 LOWER QF 1 
 
 ♦ 
 
 —I 
 
 0.33 0.50 
 
 DIVISOR 
 
 ^.00 
 
 0.17 
 
 0.67 
 
 0.83 
 
 1.00 
 
127 
 
 o 
 
 ru 
 
 BASE = 2 
 RHQ = 1 
 DELTA = 7 
 ALPHA = 4 
 BETA = Q 
 
 UPPER OF 1 
 
 UPPER OF Q 
 
 Q.Qlt 
 
 LOWER QF 1 
 1 
 
 —I — 
 0.67 
 
 0.00 
 
 0.17 
 
 0.33 0.50 
 
 DIVISOR 
 
 0.83 
 
 1.00 
 
128 
 
 
 oj 
 
 
 o 
 
 CO 
 
 LU 
 
 CX 
 
 
 a: 
 
 a: 
 
 a_ 
 
 Q 
 LUld 
 
 I— r- 
 
 U~) 
 
 LO 
 
 LO 
 
 ru 
 
 a 
 
 BASE - 2 
 RHQ = 1 
 DELTA = 8 
 ALPHA = 4 
 BETA = Q 
 
 upper ar i 
 
 upper ar o 
 
 00 
 
 0.17 
 
 ni)I FR HF 1 
 
 0.01 1 
 
 0.33 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 1.00 
 
129 
 
 o 
 o 
 
 o — 
 o o 
 o o 
 
 o 
 o 
 
 o 
 
 CD 
 
 o o 
 
 o «-• 
 
 o o 
 
 o o 
 
 O D 
 
 o 
 
 _:t 
 
 ^b.oo 
 
 BRSE = 4 
 RHQ = 2 
 DELTA = 4 
 flLPHfl - 4 
 BETA = 3 
 
 UPPER OP 2 
 
 UPPER DP 1 
 LOWER OP. 2 
 
 UPPER 0P 
 LOWER QP 1 P 8 
 
 0.17 
 
 0.33 
 
 0.50 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 01.011107 
 
 01.010007 
 
 01.000177 
 
 00.111777 
 
 00.111000 
 00.770077 
 
 00.077710 
 00.010077 
 
 1. 00 
 
130 
 
 o 
 o 
 
 O —• O — • 
 
 .-< o o — • 
 
 o — o o — •— 
 
 o o — • — i — •-« 
 
 O O O O CD 
 
 O CD 
 
 o 
 o 
 
 o 
 o 
 
 oT 
 
 ' b oo 
 
 KEI 
 
 BASE = 4 
 RHO = 2 
 DELTA = 5 
 ALPHA = 3 
 BETA = 3 
 
 f 
 
 0.17 
 
 UPPER OF 2 
 
 UPPER OF 1 
 LOWER OF 2 
 
 UPPER OF 
 LOWER OF 1 
 
 0.33 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 1.00 
 
o 
 o 
 
 131 
 
 o «-« o »-« o o o 
 
 «■* o o ~* «-• o —i 
 
 O — i O O — • —. o 
 
 O O -^ .-• —i o ._ 
 
 O O O O O ^h _h 
 
 (30 
 
 KEI 
 
 BASE = 4 
 RHQ = 2 
 DELTA = 6 
 ALPHA = 3 
 BETA = 3 
 
 UPPER OF 2 
 
 UPPER OF 1 
 LOWER Of 2 
 
 UPPER OF 
 LOWER OF 1 
 
 01.0002 
 01.0000 
 00.7122 
 00.2220 
 00.2102 
 
 00.0272 
 00.0102 
 
 0.27 
 
 0.33 0.50 
 
 DIVISOR 
 
 0.67 
 
 — I 
 
 0.83 
 
 2.00 
 
o 
 o 
 
 132 
 
 o «-< o «-• o 
 
 ~ O O ^ «— 
 
 O ^ O O — i 
 
 O O — • — « ~4 
 
 o o o o o 
 
 o 
 o 
 
 o 
 o 
 
 ^.00 
 
 BASE = 4 
 RHQ = 2 
 DELTA = 7 
 ALPHA = 3 
 BETA = 3 
 
 4 
 
 0.17 
 
 UPPER QP 2 
 
 UPPER OP 1 
 LOWER OF. 2 
 
 UPPER QP 
 LOWER OP 1 
 
 01 
 
 .1000 
 
 01 
 
 .0102 
 
 01 
 
 007 2 
 
 01, 
 
 0007 
 
 01. 
 
 0000 
 
 00. 
 
 7177 
 
 00. 
 
 7170 
 
 00. 
 
 7107 
 
 00.0777 
 00.0101 
 
 + 
 
 0.33 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 7.00 
 
o 
 o 
 
 O «-< O —* CD 
 
 ~+ o o — • -* 
 
 o — ■ o o — • 
 
 o o o o o 
 
 o 
 o 
 
 o 
 o 
 
 133 
 
 ^.00 
 
 BASE - 4 
 RHO - 2 
 DELTA = 8 
 ALPHA = 3 
 BETA = 3 
 
 UPPER OP 2 
 
 UPPER OF 1 
 LOWER OP 2 
 
 UPPER OP 
 LOWER OP \ 
 
 01.0001 
 01.0000 
 00.1111 
 00.1110 
 00.1101 
 
 00. 0111 
 00.0101 
 
 4- 
 
 0.17 
 
 0.33 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 1.00 
 
134 
 
 o 
 o 
 
 ' b 00 
 
 o 
 
 o 
 
 .—I 
 
 o 
 
 o 
 o 
 
 T 
 
 ' 
 
 BASE =4 / \ 
 
 RHQ - 3 / 
 
 ,. DELTA = 31 / I 
 
 ALPHA = 3 - / A 
 
 BETA =3 1 / / 
 
 1 / / 1 
 
 / / ' ^ s\ 
 
 / / \ y y\ 
 
 / / ! "V"^ 
 
 UPPER OF 3 f X- "^^ >i 
 
 / ' ^ / / 
 
 
 / \ S yS / 
 
 
 /- •TyS / 
 
 
 / | // / 
 
 
 UPPER OF 2 \--y iy ^/ // 
 
 
 '/ /__ g 
 
 ^ 
 
 LOWER OF 3 x /-"; ^ -^ 
 
 
 UPPER OF 1 r^-- ^^^^ ^\ 
 
 LOWER OF 2 -^^ ^^^ 1 
 
 1 ipppR pip n _<r^^__ 
 
 
 urrcn ur u 
 
 
 i piiipp rap -i 
 
 Luwcn ur i 
 
 -»- — »— -4 1 « 1 
 
 0.17 
 
 0.33 
 
 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 10.0102 
 
 01.1110 
 01.1101 
 
 01.1002 
 01.0110 
 01.0011 
 
 00.1110 
 
 00.0110 
 
 1.00 
 
135 
 
 o 
 o 
 
 o 
 
 o 
 
 BASE = 4 
 RHQ = 3 
 DELTA = 4 
 ALPHA = 3 
 BETA = 2 
 
 UPPER OP 3 
 
 UPPER OF 2 
 
 LOWER OF 3 
 UPPER OF I 
 
 LOWER OF 2 
 JPPER OF 
 
 ^4 O 
 
 — « o 
 o .- 
 
 o o 
 
 oq. nn 
 
 00.0111 
 
 Hh lFFi OF 1 
 
 t 
 
 —i — 
 
 0.83 
 
 — I 
 
 1.00 
 
 0.00 
 
 0.17 
 
 0.33 
 
 DIVISOR 
 
 0.67 
 
136 
 
 o 
 o 
 
 o 
 o 
 
 — o 
 
 -H O 
 
 D O 
 
 • "1 
 
 rr 
 
 
 ■ - 
 
 
 
 o 
 to 
 
 KEI 
 
 1 / 
 
 
 BASE = 
 
 4 / 1 
 
 
 RHQ = 
 
 3 / 
 
 CD 
 
 o 
 
 . DELTA = 
 
 : 5 / 
 
 o 
 
 ALPHA = 
 
 : 3 . / / 
 
 cl 
 
 1 1 \ 
 
 BETA = 
 
 2 ' / / 
 
 ' — i • - 
 
 UJ 
 
 CL 
 
 
 / / 
 
 C- — i 
 
 1 
 
 < — 1 . _ 
 
 \— t" 
 CL 
 CL 
 Cl. 
 
 Q 
 
 LxJo 
 
 
 UPPER OF 3 V /T""" 
 
 / - 1 / 
 
 UPPER QF 2 l/_ ; ,, ' s? 
 
 s 
 
 
 * — i 
 
 X 
 CO 
 
 
 
 
 
 / s^ 
 
 
 o 
 
 
 
 ^ ** 
 
 CD 
 
 
 yPNIR SF ? ^ 
 
 ■-*" ^ ^^^ 
 
 o 
 
 
 
 CD 
 
 
 uphir §f @ ^^ 
 
 O 
 
 y 1 
 
 I 1 
 
 CD 
 
 ^ — . 
 
 lower of i r I i li 
 
 10.1007 
 
 10.0000 
 01.1010 
 
 oi.oni 
 
 00. nti 
 
 oo.oni 
 
 0.00 
 
 0.17 
 
 0.33 
 
 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 1.00 
 
 
137 
 
 o 
 o 
 
 + 
 
 KEI 
 
 BASE = 4 
 RHQ = 3 
 DELTA = 6 
 ALPHA = 3 
 BETA = 2 
 
 UPPER 0P 3 
 
 UPPER OF 2 
 
 UPPER BF 8 
 
 UBRER BF 3 
 
 O 
 O 
 
 — o 
 
 — • o 
 
 D O 
 
 0.00 
 
 0.17 
 
 LONER QF 
 
 4-t 
 
 0.33 
 
 0.50 
 
 DIVISOR 
 
 —i — 
 
 0.67 
 
 0.83 
 
 00.1111 
 
 oo.om 
 
 1.00 
 
138 
 
 o 
 
 o 
 
 CD 
 
 CD 
 
 
 az 
 
 UJ 
 Q 
 
 CO 
 
 
 CD 
 LO 
 
 az 
 
 Cl_ 
 CD 
 
 CD 
 
 CO 
 
 IT) 
 
 CD 
 
 0.00 
 
 BASE = 16 
 RHO = 9 
 DELTA = 4 
 ALPHA = 3 
 BETA = 2 
 
 UPPER 0F 9 
 
 \m 
 m 
 \m 
 m 
 \m 
 
 
 ISEIB IF J 
 Mil 8f 9 
 
 0.17 
 
 0.33 
 
 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 1.00 
 
139 
 
 ID 
 
 a? 
 
 CD 
 
 CC 
 LU 
 
 a 
 
 ai 
 
 UJ 
 
 
 CC 
 
 CL. 
 
 Q 
 UJ 
 
 CD 
 
 CD 
 
 + 
 
 BASE = 16 
 RHQ = 10 
 DELTA = 4 
 RLPHfl = 2 
 BETA = 2 
 
 upper of io 
 
 \m ep 
 
 tffilB 8p 
 
 ra §p 
 
 KIR ep 
 
 ffiGiR ep 
 kr ep 
 \m ep 
 ra ep 
 ra ep 
 
 + 
 
 0.33 
 
 a. oo 
 
 0.17 
 
 0.50 
 
 DIVI50R 
 
 0.G7 
 
 Q.S3 
 
 1.00 
 
140 
 
 o 
 cu- 
 
 
 O 
 a? 
 
 cc 
 
 (X) 
 
 
 CD 
 l£3 
 
 ex 
 
 Q 
 
 m 
 
 CO 
 
 CD 
 
 LO 
 
 f 
 
 BRSE = 16 
 RHQ - 11 
 DELTA = 3 
 RLPHfi = 2 
 BETA = 2 
 
 OF 20 
 OF 11 
 
 OF 9 
 OF 20 
 
 UPPER OF 21 
 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 
 — » — 
 0.17 
 
 UPPER 
 LOWER 
 
 UPPER 
 L0J|IER 
 
 OF 1 
 
 OF 2 
 
 OF 
 
 OF 1 
 
 o oo 
 
 0.33 
 
 0.50 
 
 OJVI50R 
 
 0.67 
 
 0.83 
 
 1.00 
 
141 
 
 
 in 
 o 
 
 o 
 ay 
 
 UJ 
 Q 
 
 IT) 
 
 i — i . 
 
 CL 
 
 CL 
 CL_ 
 
 Q 
 LU 
 ]_Ln 
 
 CO 
 
 
 10 
 
 0.00 
 
 BASE = 16 
 RHfl = 11 
 
 DELTA = 4 
 RLPHfl = 2 
 BETA = 2 
 
 UPPER OF 11 
 
 UPPER 
 LOWER 
 UPPER 
 LOWER 
 UPPER 
 LOWER 
 UPPER 
 LOWER 
 UPPER 
 LOWER 
 UPPER 
 LOWER 
 
 UPPER 
 LOWER 
 UPPER 
 LOWER 
 UPPER 
 LOWER 
 UPPER 
 LOWER 
 UPPER 
 
 LBJEB 
 
 OP 10 
 OF 11 
 OF 9 
 OF 10 
 OF 8 
 OF 9 
 OF 7 
 OF 8 
 OF 6 
 OF 7 
 
 OF 
 OF 
 OF 
 OF 
 OF 
 OF 
 OF 
 OF 
 
 5 
 6 
 
 5 
 3 
 4 
 2 
 3 
 
 OF 1 
 
 OF 2 
 
 OF 
 
 _QF L 
 
 0.17 
 
 0.33 
 
 0.50 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 1.00 
 
142 
 
 o 
 
 CO- 
 
 CD 
 
 en 
 
 CC 
 UJ 
 Q 
 
 d' 
 UJ 
 
 
 CL 
 CL 
 Q_ 
 
 Q 
 bJ 
 
 
 CO 
 
 C^ 
 m 
 
 LD 
 
 C=>_ 
 
 n.oo 
 
 BfiSE = 16 
 RHQ - 12 
 DELTA = 3 
 RLPHfl = 2 
 BETfi = 2 
 
 0.17 
 
 UPPER OF 12 
 
 UPPER 
 
 LOWER 
 UPPER 
 
 \m 
 
 LOWER 
 UPPER 
 
 hPPBI 
 
 \m 
 
 LOWER 
 UPPER 
 
 LOWER 
 UPPER 
 
 LOWER 
 UPPER 
 
 LOWER 
 UPPER 
 
 m 
 
 1 n^P R 
 
 0.33 
 
 Q.50 
 
 QIVI50R 
 
 0.67 
 
 0.83 
 
 1.00 
 
143 
 
 o 
 af- 
 
 ro 
 
 di- 
 
 ll} 
 
 o 
 
 az 
 
 LU 
 
 ■ CD 
 
 I • ■ 
 CD 
 
 OZ 
 
 az 
 a: 
 a. 
 
 o 
 
 m 
 
 x 
 
 IT) 
 
 CD 
 
 BASE = 16 
 RHQ = 13 
 DELTA = 3 
 ALPHA = 2 
 BETA = 2 
 
 UPPER OF 13 
 
 Upper of 12 
 
 \mn bf u 
 
 \fffl BF \i 
 
 \fffl BF 4 
 
 Wffl BF *g 
 bp¥P BF 
 
 bPplp; BF 
 
 bPHP BF 
 
 WP BF 
 
 \$mw 
 
 WiR BF 
 bPHP BF 
 
 \em BF 
 
 LOIJER OF 
 
 
 1 h- — 
 
 o.oo 
 
 0.17 
 
 0.33 
 
 0.5O 
 
 DIVISOR 
 
 0.67 
 
 0.83 
 
 1.00 
 
144 
 
 o 
 
 CO 
 OJ- 
 
 o 
 
 a: 
 
 cr 
 
 UJ 
 
 (X, 
 
 az 
 cc 
 a. 
 
 O 
 UJ 
 
 CD 
 CD 
 
 LT3 
 
 en 
 
 
 CD 
 
 Tj\00 
 
 BASE - 16 
 RH0 = 13 
 DELTR = 4 
 RLPHfl = 2 
 BETA = 2 
 
 —\ — 
 
 0.J7 
 
 UPPER 3F 13 
 UPPER 0F 12 
 WP SF 1? 
 
 WFJ If 1§ 
 bPpiR P 4 
 M P x % 
 bfffB P 
 hPHP P 
 
 mm p 
 
 bPHil P 
 WIR BF 
 WFJ P 
 WfBP 
 IftP IF § 
 
 LO W ER BF — I 
 
 0.33 
 
 O.SO 
 
 DIVISGR 
 
 0.67 
 
 0.83 
 
 1.00 
 
145 
 
 APPENDIX III 
 AN ON-LINE RECODING ALGORITHM 
 
 The result digits generated by the algorithms discussed in this 
 dissertation are in a redundant format. Before they can be output from 
 the unit they must be converted to a conventional format. As each 
 redundant result digit becomes available from the result digit selector, 
 it is stored in the full width double bank result register. Given the 
 result in the form 
 
 m 
 R = E s.r" 1 
 1=1 X 
 
 where 
 
 s ± e {-p,-(p-l),...l,0,l,...(p-l),p} 
 
 it must be recoded to the form 
 
 where 
 
 m 
 R' = Z s'r x 
 i=l 
 
 s! e {0,1,2, ...r-1} 
 
 The "mostly" on-line recoding algorithm is shown in Figure A.l. Note 
 
 that the only information needed to recode the present digit is the overall 
 
 sign of the result (S ) and the sign of the next rightmost non-zero 
 
 op 
 
 digit (S ). So, the only time the recoding network fails to produce an 
 n 
 
 output on-line is when it encounters a string of zeros in the result. 
 
146 
 
 op 
 
 V 
 
 S : 
 n 
 
 x.: 
 
 r: 
 
 sign of the overall result 
 
 sign of the present result digit 
 
 sign of the next rightmost nonzero result digit 
 
 present result digit 
 
 radix 
 
 Figure A.l The On-line Recoding Algorithm 
 
147 
 
 VITA 
 
 Mary Jane Irwin was born in Cairo, Illinois, in 1949. She 
 received a B.S. degree in Mathematics from Memphis State University in 
 1970, her M.S. and Ph.D. degrees in Computer Science from the University 
 of Illinois at Urbana-Champaign, in 1975 and 1977 respectively. 
 
 From 1971 to 1977 she was employed as a graduate teaching/research 
 assistant by the Department of Computer Science at the University of 
 Illinois. She was a participant in the Digital Computer Arithmetic 
 Group, headed by Professor James E. Robertson. 
 
 While at Illinois she served as president of the University of 
 Illinois student chapter of the ACM. She is currently a member of ACM and 
 IEEE. 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-77-873 
 
 4. Title and Subtitle 
 
 An Arithmetic Unit for On-Line Computation 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 May, 1977 
 
 6. 
 
 7. Author(s) 
 
 Mary Jane Irwin 
 
 8. Performing Organization Rept. 
 No. 
 
 h Performing Organization Name and Address 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, IL 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 NSF DCR 73-07998 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, DC 
 
 13. Type of Report & Period 
 Covered 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts This thesis is concerned with the algorithmic and logic design of an arithme- 
 tic unit to be used in a computational environment in which the basic arithmetic opera- 
 tions satisfy the on-line property; that is, to generate the j tn digit of a result 
 (where a digit consists of n bits for base 2 n ) , it is necessary and sufficient to have 
 the operands available only up to the j th digit plus, in the case of division, a pre- 
 determined number of extra digits which correspond to an "on-line delay." Since there is 
 no on-line delay for addition, subtraction, and multiplication, the unit can begin gen- 
 erating result digits as soon as one digit of each operand has been input. The delay 
 for division is shown to be a small, positive, radix dependent constant. To fulfill the 
 on-line requirements, a set of left-to-right (most-to-least significant), digit-by-digit 
 algorithms have been derived. The existence of such algorithms is contingent upon the 
 use of a redundant representation for the result digits. These algorithms and a block 
 diagram level implementation of the basic arithmetic unit are developed in the thesis. 
 The proposed arithmetic unit, capable of performing on-line operations, would 
 
 17. Key Words and Document Analysis. 17a. Descriptors ^e extremely useful in many real-time applications 
 
 digital computer arithmetic Due to its potential for performing sequences of 
 
 on-line algorithms operations in an overlapped fashion (pipelining), 
 
 pipelining the unit could provide an effective way to speed 
 
 redundancy up execution. Furthermore, it is ideally suited 
 
 digit-by-digit algorithms for variable precision arithmetic, 
 large scale integration 
 floating-point arithmetic 
 
 17b. Identifiers/Open-Ended Terms 
 
 I7e. COSATI Field/Group 
 
 18. Availability Statement 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 22. Price 
 
 ORM NTIS-35 ( 10-70) 
 
 USCOMM-DC 40329-P71 
 
SEP 1 6 1977 
 
 
AUG 1M| 
 
w