LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510. 84 Ij?4>4 Cop. Z The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the Latest Date stamped below. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 1 3 R L161 — O1096 Digitized by the Internet Archive in 2013 http://archive.org/details/studyindesignofa797goya >.X M4L O UIUCDCS-R- 76-797 A STUDY IN THE DESIGN OF AN ARITHMETIC ELEMENT FOR SERIAL PROCESSING IN A LINEAR ITERATIVE STRUCTURE by Lakshmi Narayana Goyal May, 1976 UIUCDCS-R-76-797 A STUDY IN THE DESIGN OF AN ARITHMETIC ELEMENT FOR SERIAL PROCESSING IN A LINEAR ITERATIVE STRUCTURE by Lakshmi Narayana Goyal May, 1976 Department of Computer Science University of Illinois at Urb ana- Champaign Urbana, Illinois 6l801 This work was supported in part by the Department of Computer Science and in part by the National Science Foundation under grant NSF DCR 73-07998, and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering, 1976. iii ACKNOWLEDGEMENTS I wish to express my deep gratitude to my thesis advisor, Professor James E. Robertson, for his invaluable guidance, insight, constant en- couragement and personal friendship during the preparation of this thesis. I would also like to thank Professors D. J. Kuck and S. R. Ray for their advice, friendship and time for many useful discussions. The support of the Department of Computer Science of the University of Illinois, the Atomic Energy Commission of the United States and the National Science Foundation during my studies at the University of Illinois is sincerely appreciated. Many thanks are due to Mr. Stan Zundo of the drafting department for the excellent illustrations, his personal friendship and constant cooperation and to Mr. Dennis Reed for fast and excellent printing services. The excellent cooperation of Mr. Mark Goebel of the drafting department is very much appreciated. Finally, I wish to thank my wife, Madhu, for her patience, under- standing, love and constant encouragement without which this thesis would not have been possible. iv TABLE OF CONTENTS Page ACKNOWLEDGEMENTS iii LIST OF TABLES x LIST OF FIGURES xi LIST OF ABBREVIATIONS xv 1. INTRODUCTION 1 1.1 Introduction to LSI Design Constraints 1 1.2 Arithmetic Unit Structure and LSI 4 1.2.1 Partitioning of conventional ALU 5 1.2.2 Two dimensional iterative structures 6 1.2.2.1 Cellular arrays 6 1.2.2.2 Table look-up methods 9 1.2.3 New number system representation 11 1 . 3 Present Work 15 2. ORGANIZATION AND OPERATION OF ARITHMETIC UNIT 21 2.1 Introduction 21 2 . 2 Organization of the Arithmetic Unit 21 2.3 Organization of Mantissa Processing Logic 23 2.4 Formal Description of Processing in a PE 28 2.5 Generalized Example 31 2.6 The Micro-Instruction Repertoire of the PEs 35 2.6.1 The inter-register transfer microinstructions 36 2.6.2 The shift microinstructions 38 Page 2.6.3 The arithmetic microinstructions 40 2.6.4 The memory-acessing microinstructions 41 2.6.5 The miscellaneous microinstructions 42 3. ARITHMETIC DESIGN AND IMPLEMENTATION CONSIDERATIONS 44 3.1 Introduction 44 3.2 Implications of Serial Processing on Arithmetic Design.. 44 3.3 Choice of Number System 45 3.4 Choice of Number Representation and Amount of Redundancy 46 3.4.1 Signed-digit number representations 46 3.4.2 Number format and range for mantissa 49 3 . 5 Normalization Considerations 50 3.5.1 Definition and range of normalized numbers 51 3.6 Arithmetic Microinstructions and Corresponding Digit Algorithms 55 3.6.1 Simple sum (SS) microinstruction 55 3.6.1.1 Digit algorithm 56 3.6.1.1.1 Arithmetic design of RBA-2 56 3.6.2 Form multiple and add (FMA) microinstruction 60 3.6.2.1 Digit algorithm 60 3.6.2.1.1 Algorithm 1 62 3.6.2.1.2 Algorithm 2 71 3.6.2.1.3 Design of a multi-input redun- dant binary adder (MIRBA) 71 vi Page 3.6.2.1.3.1 Rohatsch's [39] technique 73 3.6.2.1.3.2 Log-sum tree tech- nique 78 3.6.2.1.3.3 Tree-structure using RBA-3s and RBA-2s 80 3.6.3 Multi-sum (MS) microinstruction 84 3.6.4 Normalize Recode (NR) microinstruction 86 3.6.5 Assimilation Recode (AR) microinstruction 87 4 . LOGIC DESIGN OF THE PROCESSING ELEMENT 91 4.1 Introduction 91 4.2 Block Diagram Description of a Processing Element 91 4.2.1 Register file 93 4.2.2 Logic design of digit processing logic 96 4.2.2.1 Block diagram description of DPL 96 4.2.2.2 Choice of logic vector encodings 99 4.2.2.3 Logic design of RBA-2 (BU) 102 k 4.2.2.4 Logic design of a radix-2 multi-input adder (MIAD) 109 4.2.2.5 Logic design of digit product generator (DPG) 113 4.2.2.6 Logic design of digit sum encoder 117 4.2.2.7 Logic design of selector networks 122 4.2.2.7.1 Logic design of adder input selector (sADR) 122 4.2.2.7.2 Logic design of digit sum encoder selector (sDSE) 126 vii Page 4.2.2.7.3 Logic design of selectors sRIB, sROB and sTOP 126 4.2.2.7.4 Storage buffer registers of DPL. 132 4 . 3 Design of PE Control 133 4.3.1 Logical organization of PE control 135 4.3.1.1 Global description of interaction of subcontrols 136 4.3.2 Logic design of PE control 139 4.3.2.1 Block diagram description of PE control logic (PCL) 139 4.3.2.2 Design and description of microinstruc- tion formats 139 4.3.2.3 Description of subcontrols by control sequence charts 145 4.3.2.3.1 Control sequence chart con- ventions 146 4.3.2.3.2 Description of R-control 149 4.3.2.3.3 Description of DM-control 151 4.3.2.3.4 Description of T-control 161 4.3.2.3.5 Description of F-control 161 4.3.2.3.6 Description of G-control 163 4.3.2.3.6.1 Description of G„- gn control 165 4.3.2.3.6.2 Description of G - control ??.. 169 4.3.2.3.7 Description of E-control 171 4.4 Logic Complexity of Processing Element 175 4.4.1 Logic complexity of DPL 175 viii Page 4.4.1.1 Gate complexity of digit processing logic DPL 175 4.4.1.2 Pin complexity of DPL 179 4.4.1.3 Effect of multiplier digit's redundancy on gate and pin complexity of DPL 183 4.4.2 Logic complexity of PE control 189 4.4.2.1 Gate complexity of PE control 190 4.4.2.2 Pin complexity of PE control 190 4.4.3 Overall logic complexity of a PE 192 5 . INTERACTION WITH MEMORY 195 5 . 1 Introduction 195 5.2 Organization of Local Operand Mantissa Memory, LOMM 198 5.3 A Description of Buffer Memory Control 200 5.4 Size of Buffer Memory 204 6. IMPLEMENTATION OF MACHINE ARITHMETIC INSTRUCTIONS 207 6.1 Introduction 207 6.2 Implementation of 'Machine' Arithmetic Instructions..... 207 6.2.1 Global description of the processing of a 'machine' arithmetic instruction 207 6.2.2 Floating point Addition 209 6.2.2.1 Mantissa processing microprogram 209 6.2.2.2 Mantissa overflow correction 210 6.2.3 Floating point Subtraction 211 6.2.4 Floating point Multiplication 211 6.2.4.1 Microprogram for mantissa processing 212 ix Page 6.2.5 Floating point Division 217 6.2.5.1 Microprogram for mantissa processing 217 6.2.6 Normalization of operands 220 6.2.7 Assimilation of signed-digit operand 220 7 . SUMMARY AND CONCLUSIONS 221 7 . 1 Summary and Discussion of Results 221 7.2 Suggestions for Further Work 226 LIST OF REFERENCES 228 APPENDIX A-l ALGEBRAIC DESIGN OF A RIGHT -DIRECTED RECODER TO CHANGE MULTIPLIER DIGIT'S REDUNDANCY FROM 6 = 1 to 5 ^2/3 233 A-2 PRECISION REQUIREMENTS FOR QUOTIENT DIGIT CALCULATION... 236 VITA 239 LIST OF TABLES Table Page 2.1 a of the Microinstructions of the Example of Figure 2.4 31 3.1 Values of a and a. for Various (k+1) -Input MIRBA Configurations. 78 4.1 Logic Vector Encodings 100 4.2 Gate Complexity of DPL vs Radix for h <_ 6 <_ 1 and LVE 3 Encoding for a Redundant Binary Digit 178 4.3 Pin Complexity of DPL Vs Radix f or h <_ 6 <_ 1 182 4.4 Gate Complexity of DPL Vs Radix for 1/2 <_ 6 <_ 2/3 and Encoding LVE„ for a Redundant Binary Digit 188 4.5 Pin Complexity of DPL Vs Radix for 1/2 ^6 ^2/3 188 4.6 Gate Complexity of Various Subcontrols of TCS 191 7.1 Values of a and a. when the multiplier/quotient digit redundancy ratio is 1/2 <_ 6 <_ 2/3 224 A.l Values of ft Vs Radix and Redundancy Ratio of a Quotient Digit 237 xi LIST OF FIGURES Figure Page 1.1 Block Diagram of a Basic Model of Limited Connection Arithmetic Unit 18 2.1 Global Block Diagram of Arithmetic Unit 22 2.2 The Organization of the Limited Connection Mantissa Processing Logic 25 2.3 The Distribution of Operands Digits in the PEs of Mantissa Processing Logic 26 2.4 Illustration of the Execution of the Generalized Example in Mantissa Processing Logic 32 2.5a Illustration of Processing for Microinstruction vi- and a = 2 34 2.5b Illustration of Processing for Microinstruction vi- and a = 1 34 3.1 Illustration of Digit Algorithm for Microinstruction SS. 57 3 . 2 Arithmetic Structure of an RBA-2 58 3.3 Functional Representation of Microinstruction FMA 61 3.4 Functional Representation of the Digit Algorithm for FMA 63 3.5 Functional Representation of Transformation f~ 64 3.6 A Redundant Binary Product Matrix 66 3.7 Illustration of Adjacent Overlapping Product Matrices and 'Collective Product Transfer, CPT* 68 3.8 Illustration of the Implementation of Algorithm 1 of Microinstruction FMA, using Redundant Binary Product Matrix Generator . (Radix = 16) 69 3.9 Illustration of the Implementation of Algorithm 2 of Microinstruction FMA using ROMs. (Radix =16) 72 3.10a Illustration of the Algebraic Design of a MIRBA, using First Order Simple Transformations only 75 xii Figure Page 3.10b Illustration of the Algebraic Design of a MIRBA using Higher (>_2) Order Simple Transformation 76 3.10c Algebraic Design of Bottom Level (Level 4) Box of Figure 3.10a 77 3.11 Illustration of Log-Sum Tree Structure for a MIRBA using RBA-2s only (k = 4) 7 9 3.12 Arithmetic Structure of an RBA-3 81 3.13 Illustration of Tree Structure for a MIRBA using RBA-2s and RBA-3s (k = 4) 82 3.14 Illustration of Digit Algorithm for Microinstruction MS. 85 3.15 Flowchart of the Digit Algorithm for Microinstruction NR 88 3.16 Flowchart of the Digit Algorithm for Microinstruction AR 90 4 . 1 Block Diagram of a Processing Element 92 4.2 Block Diagram of the Register File of the PE 95 4.3 Block Diagram of Digit Processing Logic (DPL) 97 4.4 Algebraic Design of a 2-input Redundant Binary Adder (RBA-2) 103 4.5 Schematic Functional Diagram of an RBA-2 using LVE 105 4.6 Logic Implementation of an RBA-2 using Logic Vector Encoding LVE ± (Version 1) 106 4.7 Logic Implementation of an RBA-2 using Logic Vector Encoding LVE (Version 2) 107 4.8 Logic Implementation of an RBA-2 using Logic Vector Encoding LVE„ 108 4.9 Logic Implementation of an RBA-2 using Logic Vector Encoding LVE 3 110 4.10 Schematic Diagram of a Radix-2 (k = 4) Multi-input Adder (MIAD) Ill xiii Figure Page 4.11a Schematic Diagram of Square Array DPG 114 p 4.11b Illustration of 'Adjacent Generation' of t.. 114 p 4.11c Illustration of 'Local Generation' of t. 114 4. lid Illustration of a Combination of an MIAD and DPG using 'Local Generation' of tj 116 4.12a Block Diagram of Digit Sum Encoder (DSE) 118 4.12b Logic Network Realization of RBTC 118 4.12c Logic Network Realization of TCSM (o. = P. ) 118 1 \ 4.13a Logic Implementation of Selector sADR for Magnitude Bits 124 4.13b Logic Implementation of Selector sADR for Sign Bits 125 4 . 14 Logic Implementation of Selector sDSE 127 4.15 Logic Implementation of Selector sRIB 128 4 . 16 Logic Implementation of Selector sTOP 130 4.17 Logic Implementation of Selector sROB 131 4.18 Logic Organization of PE Control Signal Generator 137 4.19 Block Diagram of PE Control Logic 140 4.20 Microinstruction Codes and Formats 142 4 . 21 Control Sequence Chart Symbols 147 4 . 22 Control Sequence Chart for R-control 150 4.23a Control Sequence Chart for DM-control, Part 1 152 4.23b Control Sequence Chart for DM-control, Part II 153 4.23c Control Sequence Chart for DM-control, Part III 154 4 . 24 Control Sequence Chart for T-control 162 4.25 Control Sequence Chart for F-control 164 xiv Figure Page 4 . 26 Control Sequence Chart for G -control 166 gn 4 . 27 Control Sequence Chart for G -control 170 ap 4 . 28 Control Sequence Chart for E-control 172 4.29 Illustration of the Effect of NAF Recoded Multiplier Digit on it of Inputs to MIRBA of Radix-2 k (k=7) Adder... 184 5.1 Block Schematic Diagram of Local Operand Processor Memory 197 5 . 2 Structure of Control Table in LOMCO 201 6.1 A Pictorial Representation of the Flow of the Sequence of Microinstructions for Mantissa Processing Logic for Multiplication 215 XV LIST OF ABBREVIATIONS APR Adjacent Operand digit Register AR "Assimilation Recode' Microinstruction BU Borovec Unit - A 2-input redundant binary adder DMM Data Main Memory DPG Digit Product Generator DPL Digit Processing Logic DSE Digit Sum Encoder ECU Exponent £ontrol Unit EPL Exponent Processing Logic FMA 'Form Multiple and Add' Microinstruction IBR Register File Input Bus Buffer Register INRj^ ith Internal Register (File Register) GACU Global Arithmetic Control Unit GIR G_-inf ormation Input Buffer Register LDR LOMM Data Register LOEM Local Operand Exponent Buffer Memory LOMCO Local Operand Memory Controller LOMM Local Operand Mantissa Buffer Memory LPM 'Load PE from PEM' Microinstruction LVE i ith Logic Vector Encoding (i=l,2,3) MATD Multi-input Adder's 'Transfer' Decoder MATE Multi-input Adder's 'Transfer' Encoder MCU Mantissa Control Unit MIAD Multiinput Adder Network MIR Microinstruction Register MIRBA Multiinput Redundant Binary Adder MPL Mantissa Processing Logic MS 'Multi Sum' Microinstruction NR 'Normalization Recode' Microinstruction PE Processing Element PEM Processing Element Memory RBA-2 Two input Redundant Binary Adder xvi List of Abbreviations (continued) RBA-3 Three input Redundant Binary Adder RB Redundant Binary Encoded Format for Radix-r digit RIP^ Register Input Port for ith Processing Element ROPj Register Output Port for ith Processing Element RS 'Right Shift' Microinstruction SM, Sign Magnitude logic encoding format for a redundant binary digit SM Sign Magnitude Encoding format for a radix-r signed digit SPM 'Store into PEM from PE' Microinstruction sRIB Selector for Register File Input Bus sROB Selector for Register File Output Bus SS 'Simple Sum' Microinstruction sTOP Selector for 'Transfer Output Port' sADR Selector for data input to Adder Network MIAD sDSE Selector for input to Digit Sum Encoder, DSE TD 'Transfer Direct' Microinstruction TI 'Transfer with Inverted Sign' Microinstruction TH^ Adder Transfer Input Port for ith PE TOP Adder Transfer Output Port for ith PE 1. INTRODUCTION 1.1 Introduction to LSI Design Constraints The advent of large scale integration (LSI) technology for the manufacture of logic circuits has posed a new challenge to the computer system and logic designers. The challenge is to find out ways that would make efficient use of the full potential of LSI — reliability, lower cost and improved speed — in the design of digital computers. The LSI technology has peculiar constraints which have important implications for its effective use in future systems. The constraints and implications can be broadly classified into two categories: external or system level considerations and internal or logic circuit level constraints. With LSI, hundreds of logic functions can be fabricated on subminiature substrates. Since the initial development cost is very high, it is important that a small number of standard elements be developed and the initial cost of development, thus, gets amortized. However, designing universal elements of the complexity offered by LSI is very difficult. A potential benefit of LSI that has been continually cited is an increase in reliability over current systems. Since system reliability is inversely proportional to the number of module interconnections, it is important that LSI devices should have a high gate-to-pin ratio. But the idea of universality of LSI devices and high gate-to-pin ratio are conflicting in that the latter tends to give the device a unique personality and cannot be used in a system repetitively. At the internal level, designing logic for integration on the chip requires a reorientation of the relative values placed on the resources used to realize the design. One of the severest constraints in the design of an LSI device is the restriction on interconnections on the chip itself. This is due to the limitation both of available wiring area, the number of planes to which all of the wiring must be confined, and a host of other topological considerations which combine to determine the locations of candidate points for interconnections. Required wiring can be reduced by forcing the logic design of the chip into a cellular or regular structure. Regular structure has very important implications. a) It facilitates every step of LSI manufacturing process by making it possible to perform relatively simple tasks repetitively. Mask making can be facilitated by the repetitive structure. b) It is possible to design and optimize a simple cell to achieve most function per dollar, but a large chip of random gates is impossible to optimize because of variables involved, c) Testing of LSI devices is a major cost factor. The genera- tion of test algorithms for simple cells and regular and repetitive structure is easier. d) In addition, as the yield increases due to technology improvements, larger devices can be made out of the same simple cells. Another limitation of LSI which must be considered in logic design is that of external connections. More pins require more external gates to drive the capacitance of the external pins. It causes an increase in temperature due to increased number of gates and higher current required due to a large number of external gates. This increases the failure rate of the device. Semiconductor memories meet most of the requirements of LSI technology and the present use of LSI in computer systems is in the form of these memories for system enhancement applications [1] , [2] , [3] . These applications include the use of RAMs as scratch-pad memories, of cache memories to reduce main memory requests and, thus, increase the computational throughput. Use of ROMs for microprogramming, table look-up operations and hardwired subroutines also increase performance at a relatively little cost. Content addressable memories, queues and stacks can greatly simplify building and maintaining tables and greatly reduce system overhead and software costs. Use of LSI in the design of central processor itself involves proper logic partitioning. Logic partitioning involves organizing the internal logic structure such that large functional arrays on a chip can be repetitively used. Two partitioning methods are the bit-slicing and functional partitioning. Bit-slicing [4] , [5] tends to be system dependent and not universal and, thus, is suitable for custom LSI only. In functional partitioning, the machine is structured towards modules wherein each module consists of a completely self-contained processor having local storage, some processing logic and the control necessary for the module to execute its function. Each module acts as a small insular unit of logic. The module control sees only its own state and the requirements for communication outside the module are correspond- ingly reduced. An excellent example of functional partitioning is RCA's LIMAC and Macromodular computers [6] , [7] . In this thesis, we report on a study of logical organization and design of an Arithmetic Unit which is capable of performing four basic operations of addition, subtraction, multiplication and division. The organization and design of the Arithmetic Unit are influenced by LSI technology constraints of modularity, least number of different module types, structural regularity of the module, limited pin count and limited fan-out capability. In the rest of this chapter, we first very briefly review the various proposals suggested for the Arithmetic Unit and its LSI imple- mentation. This is followed by a brief introduction of the model chosen for study in this research and the scope and an overview of the thesis. 1.2 Arithmetic Unit Structure and LSI The proposals suggested for the architecture of LSI implementable Arithmetic Units can be broadly classified into three categories; namely: a) partitioning of the conventional ALU which uses standard binary number representation, b) two dimensional iterative (cellular) structures and table look-up methods, and c) use of number representations different than conventional binary. It must be noted that these three categories are not exclusive of each other but are rather interrelated. This class- ification is used here simply for ease of exposition. 1.2.1 Partitioning of conventional ALU - A low performance and parallel basic ALU essentially consists of registers — the accumulator, the M-Q register, and the registers for parallel shifting — a full adder, circuitry for complementing and shifting and some control logic to co- ordinate their activity for arithmetic and logic control. It is necessary for efficient operation to allow flexible and rapid transfer of informa- tion from any one register to another. In binary arithmetic, except for the end bits , it is possible to partition all the circuitry associated with one bit along with some local decoder bits for gating functions into one LSI cell [8] . Thus, all the data transfer and manipulation operations circuitry can be assembled into identical cells and provide a fairly good gate/pin ratio. A classic example of this approach is the Texas Instrument LSI airborne computer, the model 2502. However, this bit-slicing approach breaks down for high performance, conventional binary number system when circuitry for a fast carry generation and prop- agation is added to the ALU. Raytheon [9] has combined four bit slices into one LSI module so that the look-ahead logic could be used on the four bits of this module. But still this does not provide the flexibility for unlimited carry look-ahead with only one type of module. For achieving high performance, with conventional number systems, the various slices of the ALU work in synchro-parallelism [10] and controlled by signals broadcast from the central control logic. Since the control functions are more difficult to modularize than functions related to data operations, micromemory control technique is used for mapping the irregular and diverse algorithms for arithmetic control into a regular structure of memory. However, for large word length, the broad- casting of control signals is not compatible with LSI constraint of low fan-out and neighborhood connections only. To overcome this problem of control irregularity and broadcasting, many combinational two-dimensional iterative structures (cellular arrays) have been proposed for multiplication, division and other arithmetic functions like square root, etc. 1.2.2 Two dimensional iterative structures - Two dimensional itera- tive structures are memory-like structures and admirably satisfy the LSI constraints. From the arithmetic unit point of view, they can be further classified into two sub-categories of cellular arrays and table look-up methods. 1.2.2.1 Cellular arrays - A cellular array is a two-dimensional iterative configuration of identical cells, each of which contains both logic and storage and is connected mainly to its immediate neighbors. Such an array, therefore, has the form of a memory array that is enhanced with logic at each digit position. A cellular array is a spatial analog of the temporal sequence of steps of the control algorithm; i.e., the cellular array performs the same sequence of computations iteratively in space rather than in time. The cellular arrays can be either purely dedi- cated exclusively [11] to some arithmetic function or can be programmable [12] so that they can be used by many functions. Since multiplication processes are characterized by the basic algorithm of add/no add followed by shift, they differ mainly in the interconnection of the various cells in the array for speeding up the effective addition time of the various partial products. Some use tree adder structure [13] while others use carry save adders [14] in the basic cells to avoid the carry propagation problem at every stage — the carry propagation occurring only at the last stage. Most arrays assume that the operands are either positive or in the sign magnitude representation with the sign of the product being determined separately. A negative multiplier in 2's complement representation needs a correction to the product obtained by the simple "add/no add and shift" algorithm and makes the interconnection of the cells in the array somewhat irregular. A cell- ular array for multiplication has been suggested [15] which makes use of multiplier recoding and a conditional adder/subtracter cell so that either addition or subtraction of the shifted multiplicand to the partial product can take place. This does not require final correction to the product. However, the recoding array is structurally different from the multiplication array and thus needs two types of functional arrays to generate the final product. More recently, Baugh and Wooley Q.6] have proposed a cellular multiplier where the correction is not necessary. Similarly, the cellular array for the division operation uses the basic binary restoring or non-restoring algorithm to produce the quotient. The interconnection structure for the end cells used for comparing the signs of the divisor and the partial remainder is again different from the rest of the cell interconnection [17] . For fixed point operations, a special step may be necessary at the end to generate the remainder of the same sign as the dividend. In addition to the dedicated arrays for each arithmetic operation, a programmable array suitable for both multiplication and division has also been recently proposed. Here the most significant cell of each row is more compli- cated and acts both as a multiplier recoder cell and a comparator cell for comparing the signs of the divisor and partial remainders, depending upon the arithmetic operation [18] . For operands of large word length, the cellular array contains more cells than can reliably be implemented on a single silicon slice, and hence is subdivided into subarrays which are externally connected. Such an array made up of subarrays will take more time to generate the final result compared to a fully iterative monolithic array on a single silicon slice. Cellular arrays can be either synchronous or asynchronous in operation. An asynchronous cellular multiplier for vector or pipelined mode operations has been proposed by Bjorner [19] . For arrays using conventional number systems, the problem of carry propagation along a row of cells still plagues the cellular arrays. Thompson [20] and Chen [10] have suggested using the cellular array in a diagonally-timed fashion such that digit level pipeline takes place in two dimensions, giving a higher computational throughput. Although the cellular arrays do satisfy the structural needs of logic circuits, for LSI technology, a few shortcomings can be sum- marized as follows: i) Due to the use of conventional ripple carry adder/subtracter in the basic cells, the total number of cells needed is always equal to that necessary for a double length product, although in practice most often the single length product only is desired. This is due to the fact that the carry ripples from the least significant end to the most signifi- cant bit position and if the cells that contribute to the less significant part of the double length product (for fractional mantissas) do not exist, then the most signifi- cant part of the product will be in error and this error becomes very acute for operands of large word length. These otherwise unnecessary cells raise the cost of the array. ii) In the case of cellular arrays, as many rows of adder/ subtracter basic cells are needed as there are multiplier bits or quotient bits. But effectively, the rows corre- sponding to "zero" multiplier or quotient bits serve no useful purpose (except possibly shifting) . These unneces- sary cells not only add to the cost by using large amounts of silicon area, but they also increase the probability of faults on the chip and make the testing of the chip more expensive, iii) For large word length where subarrays have to be externally connected, the addition of other subarrays for expansion of word length cannot be done without extensive changes in the external wiring. Thus expandability of two-dimensional structures is poor. 1*2.2.2 Table look-up methods - Structural regularity of memories makes them very suitable for implementation in large-scale integration. All the logic and arithmetic operations in the machine can be performed 10 by extensive table look-up operations. Table look-up operations can either be done in parallel or in a serial fashion. Parallel operations, however, require too large a table for any reasonable word length and are out of the question. However, tables for bit parallel and byte serial operations can be reasonably implemented for arithmetic operations like addition and multiplication because the number of words required is re- lated to 2 where n is the operand width in bits. A functional memory based on an associative array composed of writeable storage cells capable of holding three states — 0, 1, and don't care — has been proposed by Gardner [21] . Here the logic is performed by associative table look-up and uses the "don't care" state to give significant compression of the 2 tables over conventional two-state arrays. Typically, only n to n words are necessary for functional memory instead of 2 words for conventional two-state arrays. In fact, such a functional memory has been suggested as a nucleus and the building block for the whole machine. Lee, et al. [22] and Crane, et al. [23] have proposed a distribu- ted logic memory structure which is suitable for LSI implementation. Although they suggested this structure for nonarithmetical logic opera- tions, arithmetic can be performed by the bit serial table look-up method. But, this method is too slow when operated on scalar operands. However, for vector operands the arithmetic operation proceeds simul- taneously in parallel on all components of the vector operands, and thus the inherent slowness of the bit serial table look-up method is masked by this parallelism. Bit serial processing is used in Goodyear 's STARAN computer [24] . 11 1.2.3 New number system representation - One of the main obstacles to the partition of currently existing arithmetic processors which use con- ventional binary representation (radix complement or sign magnitude) into identical subunits is the fact that the most significant digit behaves differently than the rest of the digit positions. Radix complement/ diminished radix complement notation causes the control and structure of the most significant and least significant digits to be different from the rest of the digits due to such things as carry-in to the least significant digit, end-around carries and special circuitry for logical and arith- metical shifts. These factors preclude the chaining of accumulator modules to any desired length. Moreover, the radix or diminished radix complement notation causes problems both in the multiplication process (e.g., a correction factor) and in the division process. Sign magnitude notation is nice for multiplication and division because the sign of the result can be readily determined, but addition and subtraction need a complicated sequential control algorithm for determination of sign of the result. All this difficulty can be traced directly back to the limitations imposed by the requirement for knowledge of sign and magni- tude of the operands and the result. This knowledge, by definition, is available on a word level rather than a digit level. So, it would be preferable to have a number representation where each digit position carries both magnitude and polarity information, unlike normal binary where the most significant bit carries sign but no magnitude information and other bits carry only magnitude but no sign information. This will remove the above limitation since a priori or a posteriori knowledge of 12 the operand or result magnitude and polarities is not necessary. This will make it possible to perform arithmetic on a digit ("stage" in a machine) basis rather than on a number ("register" in a machine) basis. That is, an arithmetic operation on corresponding digits of a pair (or more) of numbers would become invariant with respect to the polarities of two (or more) numbers in which they are separately imbedded. This results in two independent but important implications: One, that a true variable word length operation is completely practicable, permitting modular construction in terms of quantity of digit positions; and two, that simultaneous operations on multiple (two or more) operands are also practicable, permitting modular constructions in terms of number of operands. The sign information with each digit position in a number can be provided either implicitly or explicitly. In a positional weighted number system, a negative radix implies [25] indirectly a sign associated with each digit position (positive for odd positions and negative for even positions, for integers). An example of explicit sign information with each digit is the Avizienis' signed digit number representation [26] . These two approaches can be utilized to design computational modules for each digit position, which can be used later on to perform arithmetic either in purely combinational logic net (array) or used with a sequential control algorithm. Shaipov [27] and Prangishvilli have proposed cellular arithmetic arrays using a basic computation module based on minus-two adder system. Avizienis and Tung [28], [29] have 13 proposed a universal arithmetic building element (ABE) to be used in combinatorial logic net to perform arithmetic functions. Pisterzi [30] has utilized the explicit signed-digit representation to design a limited connection arithmetic unit with a central global control which provides the temporally sequential commands to the various modules to achieve the arithmetic functions. Negative base number system, while facilitating the addition/ subtraction and multiplication processes at a digit level, makes the division process very complicated. In any restoring or non-restoring division algorithm [31] , [32] , [33] , the signs of the partial remainder and the divisor are very essential and the negative base number representation does not lend itself for easy determination of sign of an operand because one has to go through a counting process to know whether the most significant digit of the integer representation is in an even position or an odd position. Further, for faster addition/subtraction, one still needs the carry look-ahead circuits. Avizienis' proposed number representation, besides being signed- digit, is also redundant; i.e., each digital position can have more than r values where r is the radix of the number representation. This number system has many desirable features, namely, i) The algebraic value z of the number Z composed of n + m + 1 digits (z . . .z . z. . z, . . ,z ) is given by the 6 -n -1 1 nr 6 ' expression: m Z = I z r i=-n 1 -i 14 ii) Algebraic value Z = if and only if all z. = 0. iii) The sign of the algebraic value Z is given by the sign of the most significant (left-most) nonzero digit, iv) To form the representation of the additive inverse -Z, the sign of every nonzero digit z. is changed individually, v) The addition and subtraction of two signed-digit operands Z and Y satisfy s = f(z , y , z . - , y.,-.) for all posi- tions i, where s. are digits in the representation of the sum or difference S = Z + Y. This means that there are no carry propagation chains in signed-digit additions (or subtractions) . vi) The same logic that is used for adding two numbers (maxi- mally redundant) can be used to convert the number from conventional binary representation to the signed-digit format . vii) It allows limited inspection of partial remainder digits to determine the quotient. The properties (iv) to (vi) make the signed-digit redundant number representation very suitable for digit-wise operation of the arithmetic unit. Property (iii) obviates the need for any complement arithmetic operations. Based on the above number representation, Avizienis [28] proposed his Arithmetic Building Element (ABE) which has the capability of adding two digits of the two operands to be added, forming the product of two digits and also of forming a sum of many digits, one of each different operand, 15 besides having the capability of achieving logical operations on in- dividual digits. He proposed this element for use in combinational arrays for forming the product of two numbers. But since the ABE can form a sum of only m <_ r+1 digits, it becomes necessary to partition the product of two numbers where the multiplier is greater than r+1 digits long into groups of r+1 digits so that the same kind of ABE can be used to form the whole product. Secondly, the proposed ABE is too complex for any reasonable radix r. Thirdly, the combinational net for the division process is very complex and expensive. For digit-wise arithmetic operations, mention should also be made of residue number representation [34] which allows addition/subtraction and multiplication on a digit basis. However, handling of overflow/ underflow, conversion of conventional binary representation to residue number representation, and, of course, the division process are very complicated, and that is why not many computers have been built based on this number representation. Moreover, because the moduli for each digit position is different, all the digit modules are necessarily different and not compatible with fewer module type constraints of LSI. 1.3 Present Work The goal of the present work is to formulate a set of desirable characteristics for an LSI implementable Arithmetic Unit capable of the four basic operations of Addition, Subtraction, Multiplication and Division, to choose a suitable system and logical organization which comes close 16 to meeting these desirable properties and finally to study the arithmetic and logic design of the arithmetic unit. From our discussion in Sections 1.1 and 1.2, the following set of characteristics for the Arithmetic Unit are considered suitable for its implementation in LSI or any batch fabrication process technology. i) The arithmetic unit should be partitionable on a bit slice or digit slice (for higher radix) basis which means that we should be able to perform calculations on a digit-by-digit basis. All the digit processing modules should be identical so that a variable word length can be accommodated, ii) Purely combinational cellular arrays are too expensive for large operand lengths, especially when each cell is rather complex. Hence, the arithmetic function execution should be done by a time sequence of microinstructions. Further, to achieve a balance between the high cost of a purely com- binatorial array and the slow speed of completely sequential execution of microinstructions, some form of pipeline struc- ture should be employed so that when an arithmetic expression is evaluated, the various arithmetic operations can be over- lapped . iii) To avoid fan-out problems in case of large operand lengths, the various modules should have limited intercommunication with each other, iv) Each processing module should have local control and be autonomous as far as possible so that only a few 17 microinstructions need to be issued by a central control to the modules, instead of a large number of separate control signals. This would cut down the number of external leads necessary on each module. v) The various microinstructions should be as simple as possible, vi) Each processing module must be consistent with the constraints of large scale integration insofar as total external pin count in the module is concerned and the module itself should prefer- ably be made up of cells (identical logic repeated) when the cell consists of many many gates. vii) Since the divide process by its very characteristic has to examine the most significant digits of the operands (dividend/ partial remainder, and divisor) for the calculation of quotient, the multiplication and addition/subtraction should also be performed as a right-directed process. This most significant digit first approach is consistent with other arith- methic processes of operand normalization, mantissa overflow determination and the determination of the sign of the result because these processes inherently require the examination of the most significant digits of the operands to determine what additional processing is necessary. Many of the characteristics mentioned above are met by an Arithmetic Unit structure proposed by Pisterzi [30] . The Arithmetic Unit consists of t modular processing elements called the Digit Processing Units (DPUs) and a global control module called Primitive Control Unit (PCU) . The PCU DPU and PCU are the terminology used by Pisterzi [30] . In the present thesis, the terms PE and MCU will be used for the Processing Element and the global control module respectively. 18 does not broadcast control signals to all the DPUs but instead the PCU communicates only with the most significant DPU as far as the issuance of microinstructions is concerned. The first DPU executes each instruc- tion and then passes it on to the second DPU which again executes this microinstruction and further passes it down to the next DPU and so on. Thus, a sort of pipeline of microinstructions is established where the same sequence of microinstructions is executed in each DPU. A simplified block diagram of such an arithmetic unit is shown in Figure 1.1. PCU DPU, DPU, DPU . n-1 ■ DPU n Figure 1.1 Block Diagram of a Basic Model of Limited Connection Arithmetic Unit The present study concentrates on the design of the essential micro- instructions necessary for performing four basic arithmetic operations in such a structure, the logic design of the Processing Element, the method of communication between the Processing Elements and the Data Main M eirory for fetching and storing operands and results. The major part of this thesis reports on the logic design of the Processing Element and identi- fies those parts of the Processing Element whose gate and pin complexity are a function of the bit width of the Processing Element. This, in turn, allows us to choose a suitable bit width for the processing module con- sistent with the technology constraints and also to balance the costs for the processing logic and the control logic of the Processing Element. 19 Chapter 2 describes briefly the system and logical organization and mode of operation of the Arithmetic Unit. The major emphasis in this chapter is on the logical structure of the Mantissa Processing Logic (MPL) , the method of communication between the modules of the MPL and the flow of microinstructions through them. This discussion provides the necessary perspective for the material in later chapters. The flow of microinstructions in the MPL is illustrated by a generalized example. The chapter concludes with the definition and a brief description of a set of basic and elementary microinstructions which are sufficient to execute the 'machine' arithmetic instruction like Add, Multiply of two operands. Chapter 3 treats the arithmetic design of the Processing Element — the basic module of the Mantissa Processing Logic. The arithmetic design is described in terms of the implications of the particular structure of the Mantissa Processing Logic on the required characteristics of the number system, the number representation and the definition of a normal- ized number. Finally, in Section 3.6 which is the major portion of this chapter, we develop the definition and operational specification of a set of five simple arithmetic microinstructions. These microinstructions cause an arithmetic transformation of the data and are specified as such by an arithmetic transfer function, wherever possible. The digit algorithm for each arithmetic microinstruction is also given. In Chapter 4, which is the largest chapter of this thesis, the logic design of the major components of the Processing Element is given. The major components are the register file for storage of active operands, 20 the Combinational Network for processing and the Control which generates control signals to condition the Combinational Network. This chapter also describes the actual format and code assignment for the twelve types of microinstructions executed in a Processing Element. Finally, the logic complexity of the Processing Element is calculated in terms of the total number of gates and external leads required in the Processing Element module as a function of the bit width of the module and the redundancy ratio of the multiplier and quotient digit. Chapter 5 describes how the Mantissa Processing Logic and the Data Main Memory may communicate to fetch and store operands through an inter- face whose behavior is somewhat analogous to that of a cache memory. In Chapter 6, we show how the various microinstructions can be combined into a sequence to be executed by the Processing Element modules to perform a 'machine' arithmetic instruction like Floating Point Add, Multiply, etc. Summary and conclusions are given in Chapter 7. Two appendices are included. Appendix A-l gives the algebraic design of a digit recoder which changes the redundancy ratio of the digit from unity to <_ 2/3. In Appendix A-2 , we calculate the number of radix-2 digits of the truncated operands that are necessary in the model division to determine one radix-2 quotient digit. 21 2. ORGANIZATION AND OPERATION OF ARITHMETIC UNIT 2.1 Introduction In order to put the discussion in the following chapters in proper perspective, a brief description of the logical organization and method of performing the processing is given. The method of processing is illustrated by an idealized example in Section 2.4. The chapter closes with an introductory description of the repertoire of only the essential microinstructions which are executed by the processing logic. 2.2 Organization of the Arithmetic Unit In Figure 2.1 is shown the global block diagram of the arithmetic unit. It consists of Mantissa Processing Logic, Exponent Processing Logic, Local Operand Memories (LOMM, LOEM) and an Arithmetic Control Unit. The Arithmetic Control Unit (ACU) consists of three parts — the Global Arithmetic Control Unit (GACU) , the Mantissa Control Unit (MCU) and an Exponent Control Unit (ECU) . The GACU acts as the interface between the arithmetic unit and the rest of the computer. It receives the arithmetic instructions from the central control of the computer, decodes them, and causes the Local Operand Memory Control (LOMCO) to fetch the ncessary operands from main memory, if they are not already present in LOMM. LOMCO provides the LOMM address of the operands to the GACU which then issues the necessary commands to the ECU and MCU for exponent and Mantissa processing and coordinates their actions. After the processing is complete, it informs 22 r K OO z 5 < uj _ §! UJ 2 Q. o t- z -I UJ < z o o O 0. -I X =0 ("5 »- z z III to u -1 z V) o 0. o X UJ UJ o o cr Q. o -I UJ v + 23 the central control its status along with any exceptional conditions if necessary, that may arise during execution of the instruction. The MCU converts the commands received from the GACU into necessary microinstructions to be executed by the Mantissa Processing Logic. For example, the Multiply command is converted into a series of shift left multiplier, form multiple and add, and shift left accumulator. Also, it contains the overflow recoder logic and quotient determination logic, etc. The ECU performs the necessary control for exponent arithmetic such as calculating the difference of the exponents for addition and subtraction arithmetic instruction, sum of the exponents for the multiplication instruc- tion and detecting exponent overflow and underflow conditions. In this thesis, we shall be concerned mainly with the detailed design of the Mantissa Processing Logic and its communication with the Local Operand Mantissa memories. The detailed design of GACU, MCU and ECU is beyond the scope of this research. The next section describes the logical organization of the Mantissa Processing Logic and a descrip- tion of the method of processing. 2.3 Organization of Mantissa Processing Logic The Mantissa Processing Logic consists of a linear cascade of identical Processing Elements (PEs) . Each PE is a complex logical module and contains logic to perform the various microinstructions, issued by the Mantissa Control Unit (MCU), in cooperation with other PEs. The MCU communicates only with the most significant PE (closest to the 24 MCU) and the microinstructions flow serially (in a pipelined manner) from the most significant PE to the least significant PE. Figure 2.2 shows the schematic organization of the Mantissa Process- ing Logic along with the MCU. This figure also shows an End Unit which is optional and not intrinsically necessary for the arithmetic process- ing. The End Unit allows the last PE to be identical to all the other PEs as far as interface is concerned, thus causing it to operate as though it had another PE to its right. Moreover, it could contain some logic in which the operand digits shifted off the right end could be temporarily stored for improving the accuracy of the result [35] . The PEs collectively contain the fractional (Mantissa) parts of all active operands, one digit in each PE, as shown in Figure 2.3. Because the quotient generation and operand normalization processes require the examination of most significant digits, the operands are placed in the PEs so that the digits of each of the operands are avail- able to the microinstructions in order of decreasing significance. Thus, the most significant digits of the active operands are placed in the PE which communicates with the MCU. Each PE performs the same sequence of microinstructions. A given microinstruction is not executed by all PEs in synchro-parallelism but rather must be executed by them in sequence (i.e., first by PE, , then PE„,...). Note that this is different from a conventional pipeline organization in which data flows in sequence through a number of stages which, in general, do different operation on the data. In this organi- zation, however, data is relatively constant and flowing microinstructions Z / V / OH l Iff V / r SIGNIFIO END c Ul a. r- CO < UJ -J !!! iil UJ Ol ttf < < ** 1 UJ 0. -1 o o: Z O u ^ 4JJ- 25 c o •H ■u o 0 0) u CO o •H U CD c JS « H a CM CN operand Z operand PEj^ PE 2 PE 3 m m m. PE n m A - I a. r i=l -i n M = I m. r i=l 1 -l etc. where r is the radix. Figure 2.3 The Distribution of Operands Digits in the PEs of Mantissa Processing Logic. 27 tell a PE what operation to execute on the data resident in that PE at that instant of time. During processing, each PE physically communicates only with its immediate neighbors. To execute a microinstruction, a given PE may need information from its right neighbor. This information logically may depend on the contents (active operand digits) of its neighboring PEs, depending on the nature of the microinstruction. So we may say that each PE physically communicates with only one PE to its immediate right but from a logical viewpoint, the PE communicates with more than one t PE. In the following discussion of the mode of processing, when we talk about information required by a PE, from its right neighbors, we mean the information requirement in the logical sense. As mentioned earlier, a given microinstruction is executed by PEs not in synchro-parallelism but rather in sequence. As soon as all the PEs (say a.) which contain information required by PE. to perform micro- instruction j+1 (referred to as u -.) have executed u. and have sent the required information to PE- , u . may be performed by PE.. . The micro- instructions, executed by PEs, are defined in such a way that they have regular data requirements independent of the position of the PE in which a microinstruction is executed so that as each additional PE executes u., tt one more PE may execute u , . The microinstructions may be viewed as The logical communication could be converted into physical communi- cation by duplicating the necessary hardware logic in the PE where that information is required but this would increase the number of intercon- nections. There is an exception to this rule in the case of Assimilation Recode (AR) microinstruction in which case the a^ is variable and depends on the nature of the data resident in the PEs. This is further explained later in Section 4.3.2.3.3. 28 flowing through successive PEs. Clearly, the PE registers do not contain entire operands as long as any of the PEs are actively executing micro- instructions. Each PE contains the digits from the results of the last microinstruction executed. (In the worst case, if there are n PEs and each PE has the capability of storing n active operands, there could be n active operands in different stages of processing if there is a sequence of n load or store microinstructions.) 2.4 Formal Description of Processing in a PE The processing performed by the PEs can be described by the follow- ing: Let j *i - ■ *j . is the f inction employed to obtain the new operand set and is dependent on the microinstruction to be performed, .F. is a 'modifier' value which PE . transmits to PE . , , with j l l l+l the microinstruction j, to be performed next, 29 . is the function which each PE performs to determine .F., J J r. is the function PE, employs to determine .G. , j k j k' .G. is the value which PE transmits to the PE executing m,, j k k j and a. is one more than the number of PEs which must logically cooperate with the right neighbor of PE performing y in order to generate the necessary .G . . The information .G is generated in a time sequential fashion. G con- "] K _ I K. 011 sists of a. components .G , .G ,...,. G J and they are given by the 3 J k j k j k following relations. ,G? = r° (,f, ., . X) 3 k j 3 k-1' j-1 k (2.4) A - '] ^ F k-r 3-A- 3C1) • a.-l a.-l _^ Q a. -2 .G. = r . (.F. ,, ^ , ,G. ...••■• .G. . . ) J k 3 3 k-1' j-1 k' j k+1' 'j k+1 1 a.-l , G. = I .G. , . G. , . . . , .G, j J k j k' j k' 'j k The superscript on ,G, indicates the time order of sequential generation of G-information. Another formulation which is applicable for only fixed value of a. is given by Pisterzi [30] . In this formulation, the PE executing micro- instruction y . gets G-information directly from a PEs to its immediate right. The trade-off between the two is that the former needs less con- 30 nections to PE^^ and also less logic in PE. since the G-information is developed in a distributed fashion in the a. PEs. However, this is obtained at the expense of more complex control and longer time delay. The operation of a typical PE, PE. say, is as follows. It begins in a state in which it is receptive to information defining the next micro- instruction to be performed. PE. receives this information (microinstruction) and the value of .F. , from its left neighbor PE. ... Then PE. determines .G.— j l-l b l-l l j i the information required by PE. ■, to complete microinstruction y.. .G. is determined sequentially as described by the set of relations in (2.4). The component ,G is developed immediately. At the same time, PE. deter- mines .F. by performing equation (2.2) (which incidentally is the same as .F. . in most cases) and transmits the identity of y. along with .F. to J i-1 J J i PE.,.. At this time, PE.,. generates ! G J ,^ and transmits it back to PE. l+l l+l & j i+1 l so that PE. may generate .G.. Simultaneously, it (PE . ) transmits the identity of y. instruction along with ,F.,. to PE.,. which repeats the 1 j i+l -i i+2 same process. Note that the information ,G. depends on ,G., which ill l+a . J must trickle back to PE.. Although this takes quite some time, the a.-l .G.T, can be generated by PE.,, just one time step later. Initial setup time is large, however, a As soon as PE. transmits ,G. to PE. n , PE. . can complete the l j l l-l l-l r execution of microinstruction y,. After some time, PE. receives a J i signal from PE, .. indicating that PE. . has executed y.. PE. then i-1 l-l 1 l a. J executes y. (the necessary .G ~_ being ready by now). PE now transmits a signal to PE.,, which indicates that PE,,, may execute y.. When PE, i+l i+l j i receives an acknowledgement from PE , it goes into a state where it is receptive to information concerning y . The sequence above then repeats. 31 2.5 Generalized Example To illustrate how the processing of several microinstructions may take place concurrently in the Mantissa Processing Logic, each by a dif- ferent PE, we describe below a generalized example. This example is borrowed from Pisterzi [30] but the necessary changes have been made to conform to our notation. Table 2.1 shows the a. for the various microinstructions for the Table 2.1 a. of the Microinstructions of the Example of Figure 2.4. j 12 3 4 5 6 J 2 10 12 generalized example. The Mantissa Processing Logic will have five PEs and one operand. This operand will be indicated as composed of five digits a.,..., a, such that digit .a. is the digit contained in PE after the j-th microinstruction. The operation of the Mantissa Processing Logic is presented in a tabular form in Figure 2.4. The columns labeled will indicate the operand contained in PE . The occurrence of a. in the i-th operand column will indicate that .a. has just been computed and placed in the operand register of PE . The columns labeled IR. will indicate the microinstruction being executed by PE . and/or the G-information being produced by PE . . The occurrence of ^ in the IR. column will be used to i j i — MICROINSTRUCTION OPERAND RLC1STER "i. A l "z IR 3 », «, •, °2 °3 °4 °5 1 a l a 2 a 3 0*4 O a 5 2 3 l°i u 1, A 4 rf 5 © A "i, l a l b "2, © ri l a 2 7 "2, A © ^ l a 3 8 Q "2. © 2 a l l a 4 9 Q © "2. A © 3 a l 2*2 1*5 10 "4, A © Q 3 a 2 2 a 3 11 A © © 3 a 3 2 a 4 12 © A © © 4 a l 3 a 4 2 a 5 13 u 5, n 5 G 1 © w 4, G° 4°4 © 4 a 2 3"5 14 "5. 3=; © 4 a 3 15 ,°! U 5. A © 4 a 4 16 ,=; "5, A © 4 a 5 17 © A "5, A 5"l 18 12 © A 6 a l 5 a 2 19 © \y A 6 a 2 5 a 3 20 hi. J 9 6 a 3 5*4 32 Figure 2.4 Illustration of the Execution of the Generalized Example in Mantissa Processing Logic. 33 denote that PE, has just received the identity of j-th microinstruction and will begin determining G. in a time sequential fashion. The appearance of ,G. in the IR. column indicates that PE. has lust determined the A-th j i i i component of G information which is needed by PE. -, . A ranges from to a -1. (In our example, <_ A <_1.) The occurrence of MJj) will repre- sent that the execution of microinstruction y has just been completed by PE. (and the result operand digit a has been generated, as indicated by the appearance of .a. in column ). The progression of time will be indicated by the rows, each row equivalent to the time required by a PE to execute one step of processing. Figure 2.4 shows the Mantissa Processing Logic in steady state at time 0. No microinstructions are being executed and the operand A_ ( a.. , n a 9 , a , a, , a ) is in the operand register. The processing proceeds as follows. We assume that at time 3, the identity y of microinstruction 1 has reached PE-. PE calculates G~ and sends it to PE (Figure 2.5a). At time 4, PE calculates G 2 and sends it to PE . At time 5, all the G-information required for execution of y (a =2) is available in PE^ and y. is executed by PE . This causes ^a.. to be replaced by .a . During the next four time intervals, y.. is performed consecutively by each of the remaining PEs since -G, becomes available just as it is required by PE, « to perform y . The identity y„ of the second microinstruction is received by a PE, one time unit after that PE performs y . Since PE- requires „G„ to execute \\~ (ou=l) , this microinstruction is not performed by PE until time 8, one time step after PE is able to determine this value and send it to PE (Figure 2.5b). Just as with y , y_ is executed 34 2 * o c 1 7 l } V V cn T3 C cO d o •H •u o 3 VJ •u co c •H O O •H o U-l GO c •H CO CO co o o J-l c o •H •P CO u •u CO CO m CN CU H toO •H 7T c CO CN 3. c o 00 c •H CO CO CD o o u m CN cu M 3 GO •H 35 sequentially by each of the remaining PEs during each of the next four time intervals. Microinstruction y_ is performed by each of the PEs one time unit after each PE has performed y 2 because a_ ■ and no outside information is required. The other microinstructions are performed in the same pattern. In general, PE. performs y., 2a + 1 time units later following the execution of y, ,. The time T„ elapsed between the instant when the j-1 Em identity of the first microinstruction reaches PE and the instant of execution of the m-th microinstruction (of a set of consecutively issued microinstructions) by the first PE. is given by m T,, = T 2a. + m. Em j-1 J 2.6 The Micro-Instruction Repertoire of the PEs In this section, we will discuss briefly the microinstructions which are executed by the PEs so that the overall arithmetic unit is able to do addition, subtraction, multiplication, division and normali- zation. The microinstructions may be broadly categorized in five classes for the purposes of this discussion. These five classes are: 1. the inter-register transfers, 2. the shift microinstructions, 3. the arithmetic microinstructions, 4. the memory accessing microinstructions, and 5. the miscellaneous microinstructions. 36 2.6.1 The inter-register transfer microinstructions - These micro- instructions cause operands to be transferred from one internal register of the PE to another internal register. There are two instructions in this class: Transfer Direct (TD) and Transfer Invert (TI) . The micro- instruction TD moves the contents of one register in the PE to another register, both the registers being specified explicitly in the micro- instruction, with no changes in the source operand. The microinstruc- tion, TI, on the other hand, causes the transfer of operands from source to destination register, with the sign of the source operand being in- verted, that is, changed to opposite polarity. The microinstruction TD allows the results of one instruction to be stored temporarily into another local register before being used as an operand in the execution of some later microinstruction, thus avoiding a memory reference. A second application of this microinstruction is in the exchange of operands when normalization is required. As would be seen later on, because the Normalization Recode (NR) and Assimilation Recode (AR) microinstructions require the operand to be in only the Accumulator register, assimilation and normalization of operands would require the use of microinstruction TD for moving the operand to the Accumulator register. The main use of the microinstruction TI occurs when one needs to change the sign of an operand before being used, e.g., in the case of subtraction. Since the PE has only an 'add' microinstruction, it is necessary to invert the sign of the operand before being 'added' to another operand to cause subtraction. Note that, in this microinstruction, 37 the source and destination register addresses can be the same. This microinstruction can thus be used, if necessary, for getting the absolute value of an operand. In all the inter-register transfers, all of the data required by a PE to perform the microinstruction is contained within that PE itself. It can be seen in Figure 2.3. Each PE contains one digit of each of the operands. Therefore the value of a, the number of PEs which must logi- cally cooperate with the PE executing the inter-register transfer micro- instruction, is zero, and .F. is not required to transmit data. The j i value of ,F. is used instead to identify both the registers taking part in the transfer. The exact format of the microinstructions TD and TI is discussed in Section 4.3.2.2. In the notation of Section 2.4, the inter-register transfer micro- instructions may be expressed as: j X i = j-l y i i = 1, 2, ..., n (2.5) /i - jVl i = 1, 2, ..., n (2.6) j G i = i = 1, 2, ..., n (2.7) where Y is the register to be copied into the X register, .x i is the i digit of the X register after the transfer, i-l y i is the *" di 8 it: of tne Y register before the transfer, and indicates that the value of .G. is not required when per- forming inter-register transfers. 38 2.6.2 The shift microinstructions - These microinstructions are used during radix point alignment prior to addition or subtraction, for normalization, and for multiplication and division by the radix during the repetitive steps for multiplication and division. A shift of more than one digital position is performed as a number of successive shifts of one digital position each. The left shift can be accomplished by causing the PE to the immed- iate right of the PE performing the microinstruction to transmit the value of the digit of the operand contained in its register to the PE performing the microinstruction. This PE stores the digit it receives in its operand register. The equations defining the left shift micro- instruction, LS, are: j X i = j G i+l i = 1, 2, ..., n (2.8) j F i = j F i-l i " 1. 2, ..., n (2.9) j G i = j-l x i i = 1, 2, ..., n (2.10) .G ., = .F if .F is a valid digit (2.11) 3 n+1 J n j n otherwise see text where X is the operand being shifted, .x . J i 1" v» x. is the i digit of the shifted operand, . n x. is the i digit of X before the shift, j-1 i .F is the modifier value passed along with the microinstruc- J tion and carries the address of the register to be shifted and the value of a digit sent by MCU that is to go into the last PE. This is made use of in the execution of Multiplication . 39 ,F n is the value that the MCU sends to PE, with the left shift J 1 microinstruction to indicate the value that is to go into the last PE. If F is a valid digit, it becomes the digit shifted into the last PE. If it is not a valid digit, it causes the End Unit to shift-in the digit shifted out during the last right shift. One should also note that the left shift microinstructions make it possible to transmit the most significant digit of an operand to the MCU. The left shift can therefore be used by the MCU to examine operands. The right shift (RS) microinstruction does not have the complexity of the left shift microinstruction. The value stored into a PE is the value transmitted by its left neighbor PE with the indication that a right shift is to be performed. The value of the digit to be stored in the first PE is determined by the MCU. In the terminology of Equations 2.1 through 2.3, (2.12) (2.13) (2.14) 3 X 1 — j'i-i i = 1, 2, . . . , n .F. = j-i x i i = 1, 2, . . . , n .G. 3 i = i = 1, 2, . . . , n where .F,, is the digit which the MCU transmits with the indication j that a right shift is to be performed. This value becomes the value of the most significant digit of the shifted operand . 40 The value of .F , which is transmitted by PE to the 'End Unit 1 , is stored as the new top element in the push-down stack. The push-down stack is essentially an extended version of 'guard' digits. A final note concerning shifts is that the value of a = 1 and a = 0. The exact format of the microinstructions LS and RS is Kb described in Section 4.3.2.2. 2.6.3 The arithmetic microinstructions - The microinstructions in this class are those instructions which do some sort of arithmetic transformation on the operands. These microinstructions operate on one, two or more than two operands, depending on the nature of the micro- instructions. The various microinstructions in this class are: Form Multiple and Add (FMA) , Simple Sum (SS) , Multiple Sum (MS), Assimilation Recode (AR) and Normalization Recode (NR) . The microinstruction FMA is used to form the product of a multiplier (quotient) digit and a multiplicand (divisor) digit and add (subtract) it to (from) the partial product (partial remainder) in the execution of Multiplication (Division) instruction and is the most complex of all microinstructions. The microinstruction, SS, sums the contents of two registers and is used to execute the Add or Subtract instructions. Although the micro- instruction FMA could be used for this purpose, a separate microinstruction SS was designed for faster operation, especially because the frequency of addition or subtraction of two operands in a computer program is much higher than multiplication or division. 41 The Multiple Sum microinstruction, MS, is used to add the contents of more than two registers in a PE. This microinstruction is not in- trinsically necessary for the operation of the arithmetic unit but rather comes about as a useful by-product of the design of the logic for microinstruction FMA. The microinstruction NR operates on a single operand in the Accum- ulator register only. It is used to recode the operand in a form which when left-shifted one or more places meets the normalization definition. Finally, the Assimilation Recode microinstruction, AR, is used to convert the operand in the Accumulator register, from the number repre- sentation used in the arithmetic processing, to the conventional form for communication to memory or other parts of the computer system. This microinstruction is very similar to microinstruction NR. All the above microinstructions are discussed in detail in Chapter 3. 2. 6. A The memory-accessing microinstructions - These microinstruc- tions cause the exchange of data between the internal registers of the PEs and a local buffer operand memory. They are used to fetch operands into PEs for processing and to store the results for eventual trans- mission to the main memory of the computer. The two microinstructions are Load from Processor Memory (LPM) and Store into Processor Memory (SPM) ; the former is used to bring operands into PE registers and the latter causes the contents of a specified PE register to be stored into a specified location of the local Operand Processor Memory. The 42 microinstructions in this class are similar to the inter-register transfer microinstructions except that one of the source or destination address refers to some location in the local Operand Processor Memory. In these microinstructions also, the modifier .F. is used to identify the source and the destination. The exact format of these microinstructions are discussed in Section 4.3.2.2. The communication of the PEs with the Data Main Memory via the local Operand Processor Memory is discussed in Chapter 5. 2.6.5 The miscellaneous microinstructions - One instruction in this class is Load Constant (LDC) . This microinstruction can be used to clear the operand register by loading zeros in a specified register of all the PEs in the arithmetic unit. It can also be used to initialize an operand register spread across all the PEs to a pattern such that all the digits are identical. An example of such an use could be the load- ing of maximum value of operands. In the terminology of Section 2.4, j X i = j F i-l i = 1, 2, ..., n (2.15) j F i = j F i-l i = 1, 2, ..., n (2.16) .G. = null i = 1, 2, ..., n (2.17) J i .F = digit which the MCU sends to PE, with the LDC micro- J ° 1 instruction and the register name in which the constant is to be loaded. A3 Clearly, a = which means that no information is needed from its right neighbor for the execution of this microinstruction. Note that in Equation 2.15, only the digit part of field .F is stored in register ,X. . 44 3. ARITHMETIC DESIGN AND IMPLEMENTATION CONSIDERATIONS 3.1 Introduction This chapter describes the arithmetic design of the Processing Ele- ment. Arithmetic Design consists of the choice of a suitable number system, number representation, and the development of suitable digit level algorithms. Serial processing in an iterative structure has important implications on all of these factors and will be considered in this chapter. Implementation of the digit algorithm and its implications for LSI realization of the Processing Element are also discussed. 3 .2 Implications of Serial Processing on Arithmetic Design From the description of processing in Section 2.4, it is evident that the results are obtained on a digit-by-digit basis. To achieve a compro- mise between the digit serial processing and the arithmetic speed, the arithmetic should be carried out in higher radix say r = 2 (k > 1) such that k bits of the result are obtained at any step. Since the processes of quotient generation, operand normalization, mantissa overflow determination and the determination of the sign of the result inherently require the examination of the most significant digits of operands to determine what additional processing is necessary, arithmetic algorithms should be so designed that the most significant digits of the result are obtained first. The most-significant-digit-first (MSDF) approach has the advantages of providing early status indication (over- flow, sign of the result, etc.), normalization concurrent with processing 45 and early termination of processing as soon as enough significant digits in the result have been obtained. The latter would allow faster variable precision arithmetic in a digit serial environment. Early status indi- cation would also aid in an instruction look-ahead unit. Further, the MSDF approach allows the meshing-in (pipeline) of successive macroinstruc- tions for efficient operation. For example, if a MULTIPLY instruction is followed by a DIVIDE instruction, at some point in time, the least sig- nificant digits of the product can be generated by a right directed procedure in the least significant elements of the iterative structure, while the most significant elements are generating quotient digits. 3 . 3 Choice of Number System For a smooth flow of microinstructions in the linear iterative structure and for maximizing the rate of computation, two constraints on t a. are necessary: a) The microinstructions should have regular data requirements in- dependent of the significance of the digits retained by a PE. That is, a. should be constant. J b) The value of a. should be as small as possible because the execution rate of a given microinstruction is inversely proportional to a . . J In a conventional weighted number system, a carry or borrow into any digi- tal position is a function of all the digits to the right of this position, t a^ is the number of PEs from which a given PE requires information (in the logical sense) in order to execute the microinstruction u.. 46 Thus for MSDF algorithms which are right directed, the conventional number system cannot be employed because in a conventional number system, the value of a . is a function of the significance of the digit itself. A redundant number system which gives a bounded value of a . is clearly essential. 3.4 Choice of Number Representation and Amount of Redundancy The major factors influencing the choice of the redundant number representation and the amount of redundancy in the number system are the following: a) the ease of conversion from the conventional number representa- tion to the redundant number representation, b) its compatibility with the widely employed conventional binary number system, c) ease of normalization of operands to radix-2 limits, and d) LSI technology constraints, namely (i) minimization of the number of types of cells (in the arith- metic and logic sense) required for higher radix (r = 2 ) implementation of the digit processing logic, and (ii) minimization of the number of input and output pins. In this study, signed-digit redundant number representations with maximal redundancy were chosen, because they satisfy most of these requirements. 3.4.1 Signed-digit number representations - Signed-Digit (SD) representations are redundant positional representations. 47 A number X is represented, in radix-r, redundant, signed-digit format, as a digit vector (abbreviated as "d-vector") of length n + m + 1 A. " X X/ ■ v • * • X rt x_ x« • • • X - X -m -(m-1) 12 n-1 n such that -i r X = I x. . L i i= -m where x ± e {d,(d-l),...,l,0,l,...,(d-l),d} and ffj < d 1 (r-D • The overbar indicates negative values and unless otherwise specified, we shall be using rightward indexing in the d-vector representation. For maximally redundant, signed-digit number systems d = r - 1. That is, for a radix r = 2 , each digit of the radix-r digit vector can k - k assume any integer value in the digit set {(2 -1) , . . . ,1,0,1, . . . , (2 -1)}. Some of the desirable properties of signed-digit representations are: 1. Representation of zero is unique. An algebraic value of X = if, and only if, all x. =0. 2. The additive inverse (negation) of an operand is very simply achieved by reversing the sign of every non-zero digit individually. 3. The sign of the algebraic value of X is given by the sign of the most-significant (leftmost) non-zero digit. 48 4. For the sum or difference of two signed-digit operands, a. = 1. 1 Maximal redundancy is compatible with the widely used sign-magnitude representation of conventional binary input operands. A binary number may be interpreted as a number of radix r = 2 by grouping the binary digits into groups of k bits each. Conversion from the conventional number system to signed-digit form is simply carried out by just attaching the sign of the conventional number to each digit. Another important advan- tage is the fact that the carry between bits of a digit has the same properties as the carry between digits whereas in the other than maximally redundant representations such is not the case. This allows the radix-2 arithmetic, for example shifts, etc., if necessary. From the LSI view- It point, it allows a radix r = 2 arithmetic structure to be composed of k identical and simpler radix-2 substructures interconnected in a regular pattern. Maximal redundancy also provides more code-space patterns [36] for testing the radix-r module. This makes the design of a self -testing version of the module easier. Two modes of representation for a signed-digit of the radix-2 d-vector are used, depending on the area of application: a) Sign-Magnitude (SM ) Mode - Each radix-2 digit x is represented by a single sign bit s. and k magnitude bits, x. (j=0,l, . . . ,k-l) such that s . k-1 x. = ( — 1) 1 I x. . 2 3 , s., x. e {0,1} l . ~ l. l l. J=0 J J 49 b) Redundant-Binary (RB ) Mode - Each radix-2 digit x is repre- sented by k redundant binary digits x* (j=0,l, . . . ,k-l) , such J that k-1 x. ■ I x* 2 J , x* e {1,0,1). j-0 X j l j (Note that in the above representation of x . in terms of radix-2 sub- digits, we use zero-origin leftward indexing.) The SM mode requires k+1 binary storage elements (or k+1 pins as an output from the processing element) and the RB mode needs 2k binary storage elements (or pins) because each redundant binary digit requires two binary state elements. The SM representation for a radix-r digit is used for inter-PE communication to keep the number of external I/O pins small. The RB mode of representation is used for implementing digit algorithms (as will be seen) . If each redundant-binary digit is expressed in sign and magnitude form, conversion from SM to RB mode is trivial and involves appending the single sign bit to each of the k magnitude bits. Conversion from RB to SM is less trivial, however, and involves r r ' ' recognition that the sign of the radix-2 digit is that of the most sig- nificant non-zero binary digit, followed by subtraction of the magnitudes of those digits of opposite sign from the magnitudes of those binary digits of the same sign. 3.4.2 Number format and range for mantissa - In this thesis, the mantissa is assumed to be represented by a one-origin right indexed d- vector of length n. The radix point is assumed to be at the left of the most significant digit with index one. That is, 50 v l . Ill 11 A - • X X~ X_ • • i X _ X 12 3 n-1 n For a conventional number representation, the values of digit x are {0,1, . . . ,r-l} and for the signed-digit format, the digit x. can assume any value in the digit set { (r-1) , (r-2) ,... ,1,0,1, ..., (r-2) , (r-1) } . When more than one operand is considered, the superscript is employed 1 2 to identify a specific operand, i.e., X and X for two operands or X ,X , . . . ,X J , . . . ,X for I operands. The i-th digit of y? is uniquely identified as x. . i The algebraic value of the mantissa is given by i n i v l v 1 -i X = ) x. r ii 1 i*l and -1 < X 1 < 1. 3.5 Normalization Considerations For the preparation of operands and the processing of results, it is necessary to restrict the range of values which the mantissa may assume. One usually restricts this range by requiring that all operands be normalized. This is generally done by defining the form of d-vector representation of the restricted range operands. However, in redundant number representations, there exist pseudo-normal forms, because more than one d-vector representation is possible for the same algebraic value. For example, the two numbers X and X' X 1 = .00...1 X' = .l(r-l)(r-l)...(r-l) 51 have the same algebraic value. The representation X' " satisfies the con- ventional normalization condition x' 4 but not the minimum magnitude (> H) requirements for its algebraic value. 3.5.1 Definition and range of normalized numbers - Three alternative definitions of normalized operands were considered. Definition 1 A number X (of nonzero algebraic value) is considered normalized + _ when its d-vector representation X = .x x . . .x - satisfies either of two conditions a. |x 1 | >_2 b. |x, | =1 and x.. .x„ >_ Definition 2 A number X (of nonzero algebraic value) is considered normalized when in its d-vector representation X = .x n x ...x , x ... x either 1 2 t-1 t n a. |x | >_ 2 or b. x n = 1, x = x_ = . . . = x ,=0 and 1 l 1 2 3 t-1 x. . x > , t < n It — x 1 . x =0 , t = n 1 T where n is the length of the operand. In these definitions, the superscript on X has been dropped for ease of readability. 52 Definition 3 A number X (of nonzero algebraic value) is considered normalized when its d-vector representation X = . x,x.x, ... x. ... x satisfies the 12 3 j . n conditions a. |x | > and b. x, . x. > 1 l — such that 2 <_ i < j and x. is the first (counting from left) zero digit in the d-vector of x. For example X = .11101 is considered unnormalized per Definition 3. The range of values for the normalized operands under Definitions 1 and 3 is r-1 1,1,1 r- + < X < 1 2 n — ' ' — n r r r and for operands, normalized according to Definition 2, the range is i < |*| < i--L r — ' ' — n r Note that the Definition 2 is equivalent to the conventional definition. Of the three definitions, the Definition 3 was adopted. The factors affecting the choice between the three definitions are: i) Complexity of normalization implementation, ii) Amount of significance loss, and iii) Logic complexity of quotient selection. For normalizing numbers according to Definition 2, one needs to examine more digits than for Definition 1. If immediately following |x.. | = 1, there is a string of zeros of length v, the normalization pro- cedure must examine at least v+2 digits (to determine the sign of the 53 first nonzero digit following the string of zeros) in the case of Definition 2, whereas only 2 digits need to be looked at for Definition 1. Since the examination of digits is essentially a serial process, it takes v extra steps for Definition 2. When the results are normalized according to Definitions 1 and 3 there may be a potential loss of one extra radix-r significant digit, compared to Definition 2. Such a case can occur when a result d-vector is of the form 1.0 0...X. x.,,...x and a post-normalization shift becomes necessary, l l+l n However, it is expected that such a case would not occur very often be- cause for higher radix arithmetic the frequency of zero digits is low, and also the overflow occurs less often [37] . Finally, because of the redundant number representation, the quotient is calculated based on a truncated version of the partial remainder and the divisor. The number of digits of the truncated operands necessary for quotient calculation depends on the minimum algebraic value of the truncated divisor, say D . . The lower the value of D . , the greater are the ' ' min mm' ° number of digits of the truncated divisor and partial remainder necessary for the quotient calculation. For higher values of radix r, e.g., r ^_ 16, the difference in the minimum value of the truncated, normalized divisor for Definitions 1 and 2 is very small and the number of digits required for quotient calculation remains the same. However, for lower radices (8 ^_ r >_ 2) , the number of digits required and thus the logic complexity for quotient calculation is greater for Definitions 1 and 3 than for Definition 2. In the case of Definition 2, this number would have to be normalized further to the form . (r-1) (r-1) . . . (x .-1) x. - . . .x and a post normaliza- tion shift would not be necessary. 54 From the above discussion of factors affecting the choice of defini- tion of normalized numbers for maximally redundant signed-digit operands, it is clear that any of the three choices would be almost equally useful for higher radices (r > 16). But for r = 2,4 where the probability of a string of zero is higher, Definition 1 or 3 would definitely be better for faster normalization, although the logic complexity of quotient cal- culation would correspondingly be increased. The speed of quotient calculation and thus the speed of the DIVIDE instruction would be decreased But the frequency of DIVIDE instructions is rather low compared to ADD in- structions and so Definition 1 or 3 would overall add to the speed of the arithmetic processing. For the present research, Definition 3 was chosen because of its compatibility with the Assimilation Recode (AR) microinstruction's digit algorithm. The Assimilation Recode algorithm converts a signed-digit operand into a conventional sign-magnitude operand. This compatibility allows the sharing of logic in the implementation of Normalize Recode (NR) microinstruction and microinstruction AR and thus reduces the control complexity of a PE. Digit algorithms for microinstructions NR and AR are discussed in Sections 3.6.4 and 3.6.5. Normalization of an operand is achieved by shifting out leading zeros, followed by a 'Normalize Recode' microinstruction, again followed by shifting out leading zeros, if any. This is discussed further in Section 6.2.6. 55 3.6 Arithmetic Microinstructions and Corresponding Digit Algorithms In the present research, the design of the Processing Element is re- stricted to the capability of performing the four basic arithmetic processes of Addition, Subtraction, Multiplication and Division of two operands. Multiplication and Division are implemented as a number of additions or subtractions (of a multiple of multiplicand or divisor) and shifts as in a classical Von-Neumann Arithmetic Unit. Hence the basic arithmetic microinstruction necessary is of the form X W = X U + (xj * X V ) (3.6.1) w u v q where X , X and X are d-vectors and x; is a digit. In case of multi- l ° plication (division), X is the multiplicand (divisor), X is the old partial product (partial remainder) and X is the new partial product (partial remainder) and x_? is the signed multiplier (quotient) digit. The microinstruction which achieves (3.6.1) is termed 'Form Multiple and Add ' (FMA) . Other microinstructions of an arithmetical nature which are needed for the execution of four basic arithmetic processes are Simple Sum (SS) , Multiple Sum (MS) , Normalize Recode (NR) , and Assimilation Recode (AR) . The function of each of these microinstructions and the corresponding digit algorithm for execution of the microinstruction in a processing element is discussed next. 3.6.1 Simple sum (SS) microinstruction - This microinstruction forms the sum of two signed-digit operands say A and $ such that A" - A + $ 56 where A' is the new value of the operand A. In general, A and A' are in the Accumulator register of the Processing Elements. At the digit level, the SS microinstruction is characterized by a I - a i + *i + T i - rT i-i where a., a"! and . are radix-r signed digits of the operand in the active registers of PE . , T. is the 'Transfer' (carry-borrow) from the adjacent processing element PE , and T. 1 is the 'Transfer' out of the PE . 3.6.1.1 Digit algorithm - The specification of the digit algorithm for SS is intimately connected with its implementation and is described below in terms of its algebraic implementation. Because of the structural regularity requirements of the LSI tech- nology, the sum of two radix-r signed digits a. and . is realized in a linear cascade of k, two input redundant binary adders. This is schematically shown in Figure 3.1. RBA-2 is a two input * * redundant binary adder which accepts two redundant binary digits a. , $. v v e {1,0,1} and produces one redundant binary digit. The design of such an REA-2 was studied in detail by Borovec [38] and we shall interchangeably use the term Borovec Unit (BU) for RBA-2. 3.6.1.1.1 Arithmetic design of RBA-2 - The major consideration in the design of RBA-2 was the minimization of the number of pins required for the 'Transfer' into and out of an RBA-2. One such design is shown in Figure 3.2. RBA-2 is realized by a series of four arithmetic trans- formations as follows. 57 O -< - >& 1 o u u •H 2 m o n-i •V r-*i rS M o tH IrH * M •* 4-1 r-i o o •H ■-^ ■^ >-^ O W w w r-l .H iH .H < •H + -H 1 -H 4-) H 4-1 4-1 •H t>0 *. n « •H •H + "H 1 -H Q H 4-1 4-1 llustration of nstruction SS. ■> ? CN CN iH * • o ■> ■> * * -H * -H lr-4 cd -e- >— ' iH O iH o (l) 1 t^J il 1 IXI II M ■> M > * •* -e- II II Wk •iH •H * >rH cd -e- cd u o + .1 I .1 58 1,0,1 1 0,1 -«— o,T«*— i "* N , RBA-2 «— 0,1 *— 0,1 t t 1,0,1 1,0,1 Figure 3»2 Arithmetic Structure of an RBA-2 59 o l : $J - ft + $ ± V V V °2 : a i + *i = W i + i v v v v ° 3 : w i + *i + % "± = w i + 2t t . v v v v v-1 ii + i* a A : W i + C i = a i V V V * * ' * — where a. , . and a e {1,0,1} V V V t~ , t~ , f ± e {0,1} v v-1 V and t , t , ' + ss G i+i - u i' V and a = 2 for r = 2 = 1 for r > 2 60 3.6.2 Form multiple and add (FMA) microinstruction - This micro- instruction is used to form the product of the multiplicand (divisor) d-vector and a multiplier (quotient) digit, which when added to the old partial product (partial remainder) gives the new partial product (partial remainder) in the execution of a Multiplication (Division) of two d-vector operands. At the digit level, this microinstruction is characterized by the arithmetic transfer function aC = a. + m. • . - r T. ., + T J (3.6.3) l i j i l-l i where a., * , m* e {1,0,1} k-1 x £ J q m. = I m* . 2 q J q=0 J q 63 Ti-i^ Ci- MULTI- INPUT RADIX -r ADDER U W: DIGIT PRODUCT GENERATOR fl rri] $ i u i f<\ >T: f l : m j ' *i = r l ±-l + W i f 2 : w 1 + a ± + t; + t^-«i + r t^ Figure 3.4 Functional Representation of the Digit Algorithm for FMA. < — 64 lL 5 H c QQ < h q: od < CVJ en en < o < V \l_ CVJ I w o: oq < \7_ — cr m < >o y V- y * e o 14-1 w c CO u H o c o •H 4-1 CO 4-1 C (U CO CD !-i & • m is implemented by a product matrix generator which consists of a k x k square array of redundant binary product cells. Each cell performs the product of two redundant binary digits <$>* and m* and * - its output product digit p is also in the digit set {1,0,1}. Jcq The product may be viewed in terms of the sums of the p terms of the same weight in the product matrix. 2k-2 ♦ i ■ m 4 " I 1 3 v=0 2k-2 /v " I 2 " [l h v-J v =o U=o ^' v * * * where p n . = . • m * and p does not exist when either SL > k-1 or v-Jl > k-1 . The number N of product elements in the v-th column of the product v matrix is given by v+1 < v <_ k-1 N = I (3.6.7) 1 -v + (2k-l) k<_v< 2k-2 k-1 The number N is maximum in column of weight 2 and is equal to k. The v product elements in other columns decrease uniformly by one on either side of this column as shown in Figure 3.6. 66 liH a. *>» u CO c •H pq co c 3 CD Pi co 1) a •H 67 Equation (3.6.6) can be rewritten in the following form 1 . m k_1 v / V * \ 2k " 2 / V * 3 v=0 U=0 ' / v=k U=0 *» v *\ (3.6.8) k-1 / v \ 2k-2 , / v - I ^ I P ».J + ^ I ^" k I pj v _ v=0 U=0 *'* y v=k U=0 fc,v ^ The columns of weight 2 (k <_ v <^ 2k-2) of the product matrix can be p considered as forming a carry t._. called Collective Product Transfer, CPT to the next more significant radix-2 digital position 1-1. These (k <_ v <_ v-k 2k-2) CPT columns have weights 2 ' (Equation (3.6.8)) with respect to the higher significant digital position. When similar CPT columns from digital position i+1 are added in the appropriate (of the same weight) MIRBA of the digital position i, all the stages of the linear cascade of MIRBAs in PE. become identical, each MIRBA having k- inputs. This is illustrated in Figure 3.7. Further, the transformation f„ requires the addition of one radix-r P digit a^ to w and t , and the digit a. contributes one redundant binary- input to each position of MIRBA. Hence the transformation f 'requires k MIRBAs, each capable of summing k+1 redundant binary inputs, as well as the 'Transfer' from the adjacent MIRBA position. Figure 3.8 schematically shows the implementation of FMA digit algorithm for radix 16, that is, k=A. Values of |w. I and I t , „ I i max ' i-l'max From Equations (3.6.4) and (3.6.8), we have 68 o 3 XI o U ' ex - H 00 Pn c u •H O. » Pu l-i co •H C 4-t o o •H . 4-1 CO )// •H C /V\ / rH i of Algor iundant Bi = 16) I h e o < ■¥ / 1 *— \v/ \ !_! ' l y yOv >w >/>/\ X > v ^ DIGIT PRODUCT GENERATOR 2 — CE 00 < / >. \^ r sntatioi sing Re< (Radix w * \ \ ■> * - £ ► £ at u CO 3 \\ 7 S\\ *- \\ X •+— Implem FMA, u 2 •_ a: m < O > CO the tion Gener< ■ ,< — c~ 5 a ^." y-i o •i-i PQ O 3 X ' 1 I 1 1 1 } 1 1 1 r 1 ' 4-) C CO T3 C 3 T3 H O 6 \Z/ ss\. /"" HlH ^ • -« "6 2 »< E 00 l c ► e * •> B Oh - « xx * >\\ *~~ Vr xx u \ 2 « or oo < J H 4-1 . P P and m. into w., t. n such that w. and t. - are given by Equations (3.6.9) j l i-1 l l-l ° J n and (3.6.10) respectively. ii) Perform transformation f~ in a k-stage linear cascade of (k+1) input redundant binary adder. The design of the multi-input redundant binary adder is discussed in Section 3.6.2.1.3. 71 3.6.2.1.2 Algorithm 2 - In this algorithm, the transformation f P recodes the product <}>. . m into digits w. and t._- such that w t e {(r-l),(r-2), 1,0,1, ...., (r-2) , (r-1)} and t , e {(r^f), 1,0,1, , (7^) } . p Clearly, the recoded digits w , t contribute only one redundant binary input to each MIRBA of the linear cascade. Then the transformation f„ is performed in the k-stage linear cascade of 3 input MIRBAs. This is illustrated in Figure 3.9. Note that, in algorithm 2 the number of inputs to the MIRBAs is always three, independent of the value of k. The LSI implications of algorithms 1 and 2 are discussed later in Section 4.2.2.5. 3.6.2.1.3 Design of a multi-input redundait binary adder (MIRBAl - A MIRBA is a limited carry/borrow propagation adder which accepts several redundant binary inputs (digit set {1,0,1}) and produces one redundant binary output (with appropriate adder 'Transfers' for more significant adjacent adder stages) . Definition Let us define a new parameter a . The redundant binary output of any MIRBA is dependent on the 'Transfers' (the composite term for carry/ borrow) input to that MIRBA. In a redundant number system, the 'Transfers' are functions of 'primary' inputs (other than 'Transfer' inputs) to only 72 * ■> 6 1-1 o o" ""> - I 1 ' 5 — er cd < ■4—1 2 h K O < *~ , '! ' W " 1 ! 2 « v- C >v- 3 +J •H i ' 1 ' 1 1 1 - 00 o 5 m K 0D < ' "cf< r *f "o • -i "o" * OJ * " "6~ #— J J 1 g • o - oo -9-" C/3 2 ►-• or od <* •H o" CO f ' ! ! r ' r ' ~\ •H ' f PL, v 4-1 2 -. tr co < 2 O a: DIGIT PRODUCT GENERATOR l_i >i , .,,). Thus a. is related to a by Equation (3.6.13) a . = J * b -i + 1 (3.6.13) 3.6.2.1.3.1 Rohatsch's [39] technique - This is a deterministic and explicit transformation procedure which converts a given input digit set into the required output digit set by a series of simple transformations. In using this technique, one generally proceeds backwards; namely, consider the transformations going from output set to input set. The basic concept of Rohatsch's technique is very simple: i) Take the desired output set S, find two or more sets A n , A- , A , . . . , A such that 2 n S = A +A - + ...+ A« + A, + A_ . n n-1 2 10 ii) Form the input set M where M - v n A + r 1 *" 1 A , + ...+ A, r 1 + A n n n-1 1 where r is the radix of the adder. In our case, for MIRBA, r=2 74 iii) If necessary, repeat the steps i) and ii) (using the last in- put set as the new output set) as many times as is required to generate a set which includes the desired input set. Steps i) and ii) above together constitute an n-th order Simple Transforma- tion (referred to as S.T.). For the contiguity of sets M and S, A , A ,, n' n-1 . . . ,A_ must be contiguous and the number of distinct digits in sets A., n-1 > i > should be greater than or equal to r. Using the above approach, we find that for k > 2, a (k+1) -input MIRBA requires a series of three S.T.s. Figure 3.10a shows one such four level (each level indicated by a box) adder which is applicable for k <_ 5. In this, level 1 and level 2 perform first order S.T.s whereas level 3 represents a 2nd order transformation. If level 3 performs a third order or fourth order transformation, such a four level adder would be applicable for k <_ 9 and k <_ 11 respectively. It is interesting to note that if level 2 achieves a 2nd order S.T. and level 3 constitutes a 6th order S.T., then the four level adder can be used to sum as much as 51 redundant binary {1,0,1} inputs. This is shown in Figure 3.10b. However, the logic design of the bottom two levels is highly com- plicated for k >_ 5 if they are to be implemented in two or three logic levels. In practice, the technique is to break down the bottom level structure into equivalent simpler structures frequently at the cost of increasing the number of levels, as shown in Figure 3.10c for k = 5. In this adder structure, a is given by b q_1 a = I n v=l 75 1,04' 1.0,1 1,0,1 ««- 1,0,1 i LEVEL 1 1,0,1 0,1 0,1 0,1 LEVEL 2 2,1,0,1 LEVEL 3 3,2, ...0,1 1 1A1 0,1 LEVEL 4 7,..., 1,0,1,... ,6 it a O O H |H O o,i ' 1A1 1,0,1 Note: Entries in the box show the allowed output digit set values, Figure 3.10a Illustration of the Algebraic Design of a MIRBA, using First Order Simple Transformations only. 1,0,1 1 LEVEL 1 T.0,1 0,1^- I 0,1 0,1 LEVEL 2 0,1 0,1 ■+— o,T 0,1 0,1 LEVEL 3 3,. ..,1,0,1,. ..,4 0,1 0,1 0,1 «* 0,1 0,1 •*■ 0,1 I 0,1 0,1 0,1 0,1 0,1 76 0,1 0,1 LEVEL 4 51,. ..,1,0,1... .,77 TT~3T o o O Note: Entries in the box show the allowed output digit set values, Figure 3.10b Illustration of the Algebraic Design of a MIRBA using Higher (>2) Order Simple Transformation. 77 3 INPUT, 2 OUTPUT REDUNDANT BINARY ADDERS 1,0,1 1,0,1 1,0,1 1,0,1 1,0,1 1,0,1 Figure 3.10c Algebraic Design of Bottom Level (Level A) Box of Figure 3.10a. 78 where q = number of levels in MIRBA n = order of S.T. performed by adder level v. Table 3.1 shows the values of a and a. for various values of k, 3 for a (k+1)- input MIRBA, Table 3.1 Values of a and a. for Various (k+l)-Input MIRBA Configurations radix r k r » 2 k Rohatsch's Technique log-sum tree RBA-3,RBA-2 tree structure b a a . b a b a -1 4 8 16 32 64 128 256 2 3 4 5 6 7 8 3 4 4 4 5 5 5 2 2 2 2 2 2 2 4 4 6 6 6 6 8 3 2 3 2 2 2 2 3 5 5 5 6 6 6 3 3.6.2.1.3.2 Log-sum tree technique - A conceptually simple approach is to realize the (k+1) input MIRBA by a log-sum tree structure of two input redundant binary adders (RBA-2) . For a (k+1) input MIRBA, the tree structure has t levels of Borovec Units such that t = 1og 2 (k+l)"| and the number of BUs required is k. Figure 3.11 shows the log-sum tree structure for a five input MIRBA. In this configuration, b a = 2t = 2riog 2 (k+l)1 and 2llog ? (k+l)| -lj + 1b a . = 79 + ill a. UJ Q. UJ a. 1 1 1 1 1 1 1 • • • 1 «| 1 • 1 • 1 1 ( , / 1 / 1 / I • I / — 1 / m 1 / • 1 / 1 ' 1 ■ 1 y i • • y i • I l y • 1 • • 1 \ I / / _. »* ' ^ i— . i fir m ^r • 1 «2 *^ Lt UJ ^4. / 1 • [ i / / => / , ,'— 1 1 CO/ / / H t ,'/ J ! .'■ 3 / «-' 03/ / 3/ ' ml i "1 ' )) ' 1 ' 00/ i 1 1 1 f f f 1 1! < ! 1 M o M-l CD u 4J u o •— > u sa- 4-1 CD il (U M 0) ■*^s >-i H • £ >. a H CO c i o 0C O Cfi kJ CN U-l < o CQ « c o &c •H c 4J •H rfl en M D CO < M 2 0) u a •H CO -II 80 The value of a. for various values of k is tabulated in Table 1. From the table we find that for k=2 and k=4, that is, radices A and 16, the value of a . =3 and a. = 2 for all other values of k. Since minimum 3 J value of a . is desirable, a different arrangement of BUs as described in third approach given next can be used to achieve a. = 2. 3.6.2.1.3.3 Tree-structure using RBA-3s and RBA-2s - In this con- figuration, 3-input redundant binary adders (RBA-3) and RBA-2s are con- nected in a tree structure. An RBA-3 consists of two BUs, a D-element and a C-element arranged as shown in Figure 3.12. The C-element composes two binary inputs {0,1; 0,1} into one redundant binary {1,0,1} output. The lower BU in combination with the C-element and the D-element acts as a redundant binary (3,2) counter. The upper BU forms the sum of the sum-outputs of the lower BUs and the 'Transfer' output of the lower BU of adjacent less significant RBA-3. For a design of a (k+1) input MIRBA, RBA-3s are used whenever they can be fully utilized, that is, three inputs are available for addition; and RBA-2s are used when only 2-inputs are to be added at any level of the tree structure. (An exception occurs for k=3 where the log-sum tree technique is necessary.) Figure 3.13 shows a 5-input MIRBA using RBA-3s and RBA-2s as building blocks. The number of BUs required in this technique is also k for a (k+1)- input MIRBA. The number of BU levels is also 2 |"log 2 (k+l)~] . Table 3.1 shows the values of a and a. for various of k. It shows that a. = 2 for all 3 J values of k except k=3 . 81 1,0,1 1,0,1 1,0,1 Figure 3.12 Arithmetic Structure of an RBA-3 . 82 C\J + Q. I I I l_r 1 5.1 a. t 7 ; it/ ' ii / i .+* ' I ' I I / / / / / L L. < — UJ L ' ' \,*+ q: m < / / / / / / / / . i CD/ > i M , i| 3d / / /•? f ' I < ' ' ( X m j-i O r~v »4-t -sf 0) II u 3 ^ JJ ^ o 3 W • u en W ro 0) ; 7 / / a/ 03/ , ,i i i, ii ii en bO H -rl H CO 3, 83 The tree structure configurations described in 3.6.2.1.3.2 and 3.6.2.1.3.3 have the following advantages compared to Rohatsch's tech- nique. a) It is more general and has the same configuration for any value of k. b) It makes use of only one kind of cell, that is, Borovec Unit for the implementation of MIRBA. c) The various BUs are uniformly and regularly interconnected. Because of b) and c) above, this implementation meets the LSI con- straints of structure regularity and minimum cell number type. In terms of our notation of Section 2.4, „„F. = ™, A F. .. JU- i, 1 < i < n FMA i FMA i-1 ~*r > _ _ FMA F = m j where _, A F. = modifier value which is sent by PE^ along with micro- FMA i i instruction FMA to PE.,.,. l+l _..F = modifier value sent by MCU along with microinstruction FMA to PE.. m. = Multiplier (or Quotient) digit. FMA G i + l " <<■ C i> P where t = Product Transfer to PE from PE . t = MIRBA Output Transfer from PE a FMA = 2 ' 84 3.6.3 Multi-sum (MS) microinstruction - This microinstruction forms the sum of N digit vectors where N is the number of inputs of a MIRBA used in the implementation of microinstruction FMA. N depends on the digit algorithm used for FMA. In any case N <_ k+1. The digit level transfer function is given by 12 N A A a: = xT + x, + ... + x. - r t A . + t (3.6.13) i i i i l-l i where a^ , x ± e { (r-1) , (r-2) ,... ,1,0,1, ..., (r-1) } If we designate the set of arithmetic transformations performed by a Borovec Unit as Borovec Unit Transformation (BAT), then the transfer func- tion in Equation (3.6.13) is realized by a series of flog„N| BATs. This is discussed earlier in the design of a MIRBA in Section 3.6.2.1.3. Implementatio n The MS digit algorithm can be implemented by making use of MIRBAs already existing in the digit processing logic of the processing element. This is shown in Figure 3.14. The MS microinstruction can be represented in the notation of Section 2.4 as follows. MS F i = MS F i-l ' Vl 1 i i i n MS F = G - t A MS i+1 i < .- 85 ii ii CVJ I CO I o -*- S h QC Q] < w \t n "d" - h CC £D < it v IF H < .1 CM I u N — X ,,, _, rsi — X CM I CM I * .* z — - X I X * J* ,/"N CM — ,-1 X I u *X~ w 2= 'H CM -H X -( + I H 'H < ^ Ji! X +J « v | ^ -H 32. 3.6.4 Normalize Recode (NR) microinstruction - This microinstruction is used to normalize an operand according to Definition 3 given in Section 3.5.1. Given an operand of the form A. - * X- X n ••• X , ••• X . X , , .. • • • X 12 l j j+1 n such that xj > 1 <_ i < j-1, x. = 0, and |x. I >_ j+1 <^ i <_ n the NR microinstruction transforms the operand X into an algebraically equivalent operand X' X" = . 00 .. < x' . . . . x" . . . x' . xC,, ... x - * h h+1 i j j+1 n where sign (x') = sign (x ) , h <_ i <^ j-1 h >_ 1 x: = x. = J 3 and xj* = x,, j+1 <^ k, <_ n For example, if radix r=10, then the numbers .199704, .1909704, .179412 and .109018 would be recoded respectively as .001704, .0109704, .160608 and .109018. if +ve 1 if -ve 87 Digit Algorithm Let S_ = Sign of the operand S = Sign of digit x. = < r = radix |x. | = magnitude of digit x. The digit algorithm is given by the flowchart of Figure 3.15. Initially S is known and is equal to S.. . In terms of the notation of Section 2.4, F =S 1 < i < 1 - 2 NR i a 0P - - J where = i > j - 1 s op = s i for i - °* j is the index of first zero digit in the operand d-vector G = x i < i < i-1 NR i+1 x i+l J - - J and a NR = 1. 3.6.5 Assimilation Recode (AR) microinstruction - This microinstruc- tion is used to assimilate (convert) a signed-digit redundant operand into an algebraically equivalent operand such that all the digits in the re- coded form are of the same sign as the sign of original operand. In the actual implementation of the digit algorithm for NR, MD G .... NR i+1 information consists of S.,, and Z. ., where S #11 is a bit carrying l+l l+l l+l sign information of digit x.,, and Z tl1 is also a bit whose two states ° l+l l+l indicate where digit x. in is zero or not. (cf. Section 4.3.2.3.6.2) ° l+l 88 I 1 I Paas the micro- I instruction to PE, YES YES I T I Figure 3.15 Flowchart of the Digit Algorithm for Microinstruction NR. 89 Given a signed -digit operand of the form X S . X;L x 2 ... Xl x. +2 ... x k 00 ... XfcH ... x n such that sign (x. + .) = sign (x. +2 ) and sign (x. ..) = sign (x, ) = = sign (x, ) = sign (x. ) That is, all the zero digits in the d-vector of number X have the same sign as that of the first nonzero digit to the immediate right of a string of zeros (of length >_ 1), the AR digit algorithm would recode X into X' j\ — • X. Xa • • • A, A..., X . . f~, ••• A. X. . - • « • X« . _ • • • A 12 l l+l i+2 k k+1 k+£ n such that X = X' and sign (xp = sign (x. + ,) = sign (x) V 1 * 1 1 *■ l n Digit Algorithm Let S_ = sign of the operand S. = sign of the digit i S = sign of the digit i+1 x = digit i of operand X The digit algorithm is almost identical with that for NR micro- instruction except that the microinstruction acts on all the digits of the operand. It is given by the flowchart shown in Figure 3.16. In the notation of Section 2.4, AR F i = S 0P l i * < » = S 1 AR G i+l = S i+1 ! 1 i 1 n a._ = Variable depending on the d-vector 90 Pass the micro- instruction to PE, 'i+1 OP YES NO ^S » S ^^ ^S - S \^YES < ' ^Y^NO \* ± \ - r-lxj |xj - r-l-lxj |x ± | - IxJ - 1 1 .. \ Figure 3.16 Flowchart of the Digit Algorithm for Microinstruction AR. 91 4. LOGIC DESIGN OF THE PROCESSING ELEMENT 4.1 Introduction In this chapter, the logic design of the Processing Element (PE) is developed and discussed in detail. The major components of the PE are the Register File for the temporary storage of active operands, the Digit Processing Logic (DPL) which is essentially a large combinational logic circuit and the Processing Element Control Logic (PCL) which supplies the control signals in proper temporal order to condition the combina- tional DPL to execute the various microinstructions. The major consider- ations in the logic design of the PE are the LSI technology constraints: namely, the PE should require as few external pins as possible and that the logical organization of the PE should have structural uniformity and regularity. Section 4.2 discusses the logic design of data path structure of the PE and in Section 4.3 is given the logical organization and detailed design of the control algorithms for the generation of control signals. Finally, Section 4.4 discusses the logic complexity of the DPL and the PE control logic in terms of the number of gates and the external pins for the PE module as a function of the bit width of the PE module. 4.2 Block Diagram Description of a Processing Element Figure 4.1 shows the schematic block diagram of a Processing Element. It consists of three main components — Digit Processing Logic (DPL), Register File and Control. The Register file comprises a set of digit-wide registers which are used to hold the operand digits and z o 92 2 I Is 0:5 f= U. o a: t- «-• z + is >- o UJ 2 V 1 A 1 1 1 L n 1 1 UJ _J u. (T UJ h- UJ DIGIT PROCESSING LOGIC * ^. The digit . comes from the operand register INR2 via the output bus selector sROB and the multiplier digit m. is inputed to the DPG, from the microinstruction register MIR in local control logic of the PE. The multi-input adder, MIAD adds the w columns of the redundant binary product array formed by DPG and the collective product transfer The necessity of the register APR for the multiplicand digit from the adjacent PE would be clear from the discussion in Section 4.2.2.5. 97 TO REGISTER FILE FROM REGISTER FILE Rl R2 R3 R4 R5 ii 11 Ji 11 11 i IBR "TV" ♦ ---fllBR REGISTER FILE . R1.R08 ^ R2tROB OUTPUT BUS SELECTOR • R3tROB SROB TOPj<= TO PEi-i ROP<= --TA«TOP r-MIRtTOP LA k R4.R0B ~~ -R5»ROB JIEGISTER FILE INPUT JUS SELECTOR sRIB -ts 7ns ► ---DSEtRlB ► APRtRIB MIRiRIB DJGIT £UU E.NCOOER OSE OSE<4:0>(»S,,X,<3:0>) P: SIGN BIT-.' I X,<3:0> MAGNITUDE BITS IF* .5 sOSE .0 „~-ROB»OSE •— AMF»OSE --SCHI r-gGIR RADIX -2 (KM) MULTI-INPUT A^ODER MIAD Rl R2 R3 R4 R5 * a W-l I .-R2.AOR AOOER INPUT SELECTOR L*---* 3 * ADR - - * R4tADR r*~ R5.ADR ** — SWTtAOR A A A I A A * Sw *i st ( DIGIT PRODUCT ARRAY GENERATOR DPG 1 *,« r-flAPR i V FROM TO CONTROL LOGIC TIP RIP; FROM PE +1 Figure 4.3 Block Diagram of Digit Processing Logic (DPL) . 98 P * * t . . The MIAD is also used to add the two operand digits a . and from the registers Rl and R2 for the microinstruction SS and to add the operand digits for the microinstruction MS. The radix-2 multi-input adder is made up of k-stages of MIRBAs. The Digit Sum Encoder DSE converts the redundant binary sum output of adder MIAD to the SM format for local storage in the accumulator register INR1 of the register file or transfer out of the PE. The DSE is also used in the microinstructions AR and NR for forming the radix and diminished radix complement of the magnitude bits of the accumulator register INR1 and also for subtracting unity from the magnitude of the accumulator contents. In addition, sDSE and DSE are made use of in inter-register transfer microinstructions TD and TI for direct and reversed-sign inter-register transfer. The Adder Input Selector sADR routes appropriate data in redundant binary form to the MIAD inputs depending on the microinstruction the PE is executing at that time. The selector sDSE selects the appropriate input to the encoder DSE. Also shown in the Figure 4.3 are input and output ports designated as TIP , RIP. and TOP , ROP., respectively. The input port TIP and RIP , respectively carry the 'transfer* (carry or borrow) from adjacent MIAD and the contents of some register in the register file of the adjacent PE . These ports essentially carry the 'G-information' from the adjacent PE - for the microinstruction that is being executed by the present PE . The output ports TOP and ROP. are, however, shared to carry the 'G-information' for the left neighbor PE.. and also the address and data information respectively for the local operand memory 99 PEM . This is made use of, for fetching data from and storing data to the PEM under the control of microinstructions LPM and SPM. The selector sTOP selects either the 'transfer' information from MIAD or the address bits and Read/Write bit for PEM.. 1 The details of the logic design of the various blocks described above are discussed next in the following sections. Since the logic complexity of the major components DPG, MIAD and DSE and sADR are dependent on the choice of logic vector encoding for the redundant binary digits, the three logic vector encodings considered for study are described first. It is followed by the logic design details of the major components. 4.2.2.2 Choice of logic vector encodings - As mentioned in Section 3.4.1, the redundant binary (RB ) mode of encoding for a radix-r signed digit is used for the arithmetic processing. Each redundant binary digit requires 2 bits for representation. There are nine distinct ways under permutation and negation [40] , of assigning three values (1,0,1) to four states of two binary logic variables. Of the nine ways, three encodings were chosen for this study because they are the simplest as far as the conversion from the SM mode to the chosen encoding for RB mode is con- r ° r cerned. Let a radix-2 , signed digit x., encoded in SM mode, be rep- resented by a k+1-tuple (S.,x. ,x ,...,x. ) such that 1 k-1 1 k-2 X S i k_1 1 :. - (-D 1 I x. . 2 1 x z {0,1}. 3=0 X j j The corresponding RB encoded form is given by 100 k_1 * i j-o J x e {1,0,1} J Let the redundant binary digit x be represented by a 2-tuple logic J vector (x . , x ) where J J X ± , x ± e {0,1}. J J The three logic vector encodings for the redundant binary digit x con- j sidered in this research are given in Table 4.1. Table 4.1 Logic Vector Encodings Encodings Binary 2-tuple logic vector LVE LVE 2 LVE 3 * X i. * * X i. .1 1 1 1 1 I 1 1 D.C 1 1 I The conversion from SM mode to the encoding format LVE- is the simplest and is equivalent to attaching the sign S of the SM encoded k radix-2 digit to each magnitude bit x. individually. The conversion for the three encodings are given by 101 LVE : x = X, " x j j J x i = X i ® S i (4,1) j j X i. = S i J where stands for exclusive — OR or x. = S. A x. i. 11. j J X i. = S i - X i. (A. 2) (4.3) * LVE, : x. = x. 2x, with x. x. disallowed __1 ij ij *■, ij ij X. = Xj 1 . 1. J J X- = S. - x. i. i i. 3 J * x i. LVE- : x. = (-1) J . x. 3 l. l. J 3 x. = x. (4.4) 3 For the encodings LVE.. and LVE~, the conversion logic requires one exclusive-OR gate (Equation(4.1)) or two AND-gates (Equation (4.2)), and one AND-gate (Equation (4.3)) for each redundant binary digit respectively. For one radix-2 digit conversion, there are k redundant binary digits. Encoding LVE~ is essentially a sign and magnitude encoding of the re- dundant binary digit x by the 2-tuple (x . » x. ). Logic variables x.» 102 and x respectively act as sign and magnitude bits. This encoding format J would also be referred to, in subsequent discussion, as SM, format where subscript b indicates radix-2 (or binary) . 4.2.2.3 Logic design of RBA-2 (BU) - Let £ , m denote the redundant * binary inputs and d denote the redundant binary output of a RBA-2. Further let £ , m and d be respectively represented by the logic vari- able pairs (X , £ ) , (u , m ) and (6 , d ). Also let t , t and t ,, V V V V V V V V v-1 t __ be the input and output 'Transfers' of the RBA-2 as shown in Figure 4.4. In the configuration shown in Figure 4.4, it has a cascade combination of a symmetric subtracter and a symmetric adder. Robertson [40] has given the logic equations for the symmetric subtracter and symmetric adder for all the nine distinct encodings referred earlier. Using those results, the logic equations for the RBA-2 for the three logic vector encodings being considered here are given as follows: LVE, : d = X © £ © y © m © t 1 V V V V V ^ V 6 = t + V V t~ . = £ m V * (£ vm ) (4.5) v-1 v v v v v t = w y V t~ (wVu) v-1 V V V V V w = X © £ ©m V V V V 103 0,1 « djfi.o.i SYMMETRIC / ADDER— ^ / (SA) / / 0,1" / ^_. SYMMETRIC SUBTRACTER-^ / (SS) ^/ / / ojf 1,0,1 loj_ _J __J 0,1 0,1 1,0,1 m, 1,0,1 •0,1 0,1 1,0,1 0,1' 0,1" 1,0,1 1,0,1 Figure 4.4 Algebraic Design of a 2-input Redundant Binary Adder (RBA-2) . 104 This is schematically represented in Figure 4.5. Each box in the figure essentially represents a full adder with a slightly modified carry function. Figure 4.6 shows the logic implementation. This implementation requires 22 two input NAND gates and the output digit d is available after 12 gate delays. Figure 4.7 shows another logic implementation that requires 27 gates but the output digit is available after only 9 gate delays. Note further that the logic in Figure 4.7 is no longer made of two identical logic substructures. The implementation of Figure 4.6 allows a simpler basic cell for LSI implementation of MIRBA at the cost of larger logic delay. LVE 2 : d = £ © m © t + © t" 6 = t + (A t" © m ) V V t , = X V SL y v-1 V V V (4.6) :,=t (U©y)vum)vy m £ . v-1 V V ^ V V V V V V The logic implementation of this adder using only 2 input NANDS is shown in Figure 4.8. Thirty-four two input NAND gates are needed. The output * * digit is available, 13 gate delays after the primary inputs l^ and m^ are stable because the 'Transfer' input t is available 7 gate delay after inputs I ,, and m ,-. r v+1 v+1 105 x + tj/.l -#- MFA W, MODIFIED FULL ADDER (MFA) I \v Xj/ l„ m^ fly m- Figure 4.5 Schematic Functional Diagram of an RBA-2 using LVE, . 106 M (\l CO UJ < DC UJ GO 3 O t>0 O hJ c •H CO 3 ^ i-H 4: g § en to « > o • C r- O W c •H t3 o o ex a B W o o •H 4J 00 U O CO hJ > v^ 3* •H Pi-, 107 00 o h4 00 C •H CO CM I < 4-1 o c o CM c o •H en 0) > 00 c •H T3 O O a w 1-4 o o oo a o u 3 00 •H 108 u •H 60 O hJ 00 c •H CO 3 CN I < CM w > 60 c x) o H O a c W CJ O 60 O O CO ►J > 00 CD 109 LVE 3 : d V ~ I V 6 V = + t V d = £ © m © t + © t~ V V v (A. 7) t n =X£vym£ v-1 v v v v v t ,= t ((«, © m ) V £ y ) V y I m v-1 V vv vv v v \ The logic implementation of this adder using only 2 input NANDS is shown in Figure 4.9. This RBA-2 realization requires 26 gates and the output * d is available 13 gate delays after the primary inputs of this RBA-2 and its adjacent RBA-2 are stable. Note that the lower gate delay for the sum output of RBA-2 using LVE, encoding is achieved because the logic variable d is a function of £ , m V V V and t only. In the other two encodings, d is dependent on t also. V J ° V r V 4.2.2.4 Logic design of a radix-2 multi-input adder (MIAD) - The radix-2 adder MIAD is used for two purposes; 1) to add the columns of the redundant binary product array formed by DPG and 2) to form the sum of the operand digits for microinstructions SS and MS. Figure 4.10 shows the schematic diagram of a radix-16 (k = 4) MIAD. It consists of 4 MIRBAs * each of five inputs each. Each MIRBA has two outputs — one a MF corre- sponding to the sum of all the five inputs (used in microinstructions MS and FMA) and a SS corresponding to the sum of only two inputs — for micro- instruction SS to the left-most BU in the bottom level of the tree of BUs and RBA-3s making up a MIRBA. The proper data is routed to the inputs of MIRBAs by the adder input selector sADR. MATE and MATD are the encoders 110 <0 CM t/> Mi I- < e> or UJ CD 2 Z> Z < o 60 O ■-J 60 c •H CO CM I < « o C CO o w •H > 4-1 00 c c CD -H §T3 O iH O O. C a w O O •H 4-1 GO O O 0) hJ > 0) to 3 GO •H fa r < — i < h- O o in V) CM CM uT O 5 -• or co < m (t CD < (VJ or cd < II II ro CC CD < ill 00, ir || M / 'to / ' / < I CD I * I I CD " y j j_ n < I- UJ 1 V 111 •H •u ■H cr 3 o 2 H O /— V UJ «* _l II UJ M (/> ^ 2 04 o • cc u. •H 5 5 M M bOS cfl w •H Q U CD O TJ r-H 2 3 cd ccj c S *J o O) 3 •H x: ex 4J o c CL C/i -H O cu o n iH CTJ vr Q H § o ~G?w 77 C=f SIGN BIT' v. K x K ANO GATES r SIGN BIT 4>: m Figure 4.11a Schematic Diagram of Square Array DPG. W; SIGN BIT, St: Ci < MAGNITUDE' BITS J^PINS f CONVERSION LOGIC d © SIGN BIT i ii c SIGN BIT * Figure 4.11b Illustration of 'Adjacent Generation' of t i-r (K + DPINS S 1 W: \A 1 I 1: SI 1 / }' f d -\ v) W~ ' "' f - • • • > • • V. J f • t ' ii ,i L -^ i v_ • • • * ' SWj = SIGN OF Wj i Stj = SIGN OF tf (K + DPINS m Figure 4.11c Illustration of 'Local Generation' of t ? . 114 115 The conversion logic, however, requires only 2k exclusive-OR gates (Equation (4.1)) or 4k AND gates (Equation (4.2)) for the encoding LVE^ 2k AND gates (Equation (4.3)) for LVE~ and none for the encoding LVE- . The pins contributed by DPG to the pin complexity of DPL are those p pins which are required for the 'Collective Product Transfers' t and P P t . If t is generated in PE^ , then the pins needed for transmission i i-1 i P P of t. to the adjacent PE , consist of one pin for the sign of t. - and (k-l)+(k-2)+. . .+1 = k(k-l)/2 pins for the magnitude bits, assuming that the conversion to redundant binary form is done in PE._, . We shall P P call this method of generating t._ 1 and t. as 'Adjacent Generation' (AG) of Collective Product Transfer (Figures 4.11b and 3.9). These pins can, however, be reduced to only (k+1) from k(k-l) if 2 P P the CPT t,, , (tj is generated locally in PE. , (PE.) itself where it is i-1 i . ° l-l l P P needed, t (t. .) is a function of the multiplicand digit 4> . . (.) in PE (PE ) and the multiplier digit m., the latter being the same in both PE. and PE . , - (PE . ,). Thus PE. (PE. n ) needs to know only the l l+l l-l l l-l } P P multiplicand digit . (<)>.) in PE. . (PE.) to generate t (t._,), and this requires only (k+1) pins for SM encoded multiplicand digit . +1 • We shall term this method of generating CPTs as 'Local Generation' (LG) of CPT. This is shown in Figure 4.11c. Figure 4. lid shows a DPG using p 'Local Generation' of t.. In the LG method of generating CPTs, the logic for DPG requires one more exclusive-OR gate than for the AG method. For the algorithm 2 of FMA, where the DPG is implemented in ROM, the p pins required for t. - are only (k+1) — one for sign of the product and k for magnitude bits of the product. This is shown in Figure 3.10. In 116 Figure 4. lid Illustration of a Combination of an MIAD and DPG using 'Local Generation' of t^. 117 p the block diagram of DPL shown in Figure 4.3, local generation of t is assumed. The register APR is used to hold the multiplicand digit <|>.,, from the adjacent PE, .... J i+1 4.2.2.6 Logic design of digit sum encoder - The Digit Sum Encoder (DSE) transforms the redundant binary sum output of the radix-2 adder into an algebraically equivalent radix-2 sum digit in SM format for either local storage in the processing element or transmission out of the PE. The DSE is an iterative logic network and involves carry prop- agation. Its action can be described as a two-step process. a) determination of sign of the redundant binary sum digit and its conversion to an algebraically equivalent sum digit in 2's complement, and b) conversion of 2's complement form of the sum digit to SM format, Figure 4.12a shows DSE in block diagram form. Let the input and output sum digit x. be respectively given by (4.8 and (4.9) k_1 * i * x i = I x . 2 J x e {1,0,1} (4.8) S ± k-1 = (-D • I x 2 J S x e {0,1} (4.9) 3=0 ;j X J x j * where x ± is represented by a 2-tuple logic vector (x . , x. ) such that J J X ± . X E {0,1}. j j DSE k A FROM CONTROL LOGIC DSE k .! A DSE DSE, ■h-i i 2's COMPLEMENT TO SMr FORMAT TRANSFORMATION , TCSM P: 'k-1 I y., SIGN DETERMINATION AND RB TO 2's COMPLEMENT TRANSFORMATION, RBTC t ...t ...t 'k-i Figure 4.12a Block Diagram of Digit Sum Encoder (DSE) 118 P; X i k _a Xi k-i ".j -j Figure 4.12b Logic Network Realization of RBTC. k -1 A X; A Xi p-c^ - *i. • • y, k-i Figure 4.12c Logic Network Realization of TCSM. (o = P ) 1 k 119 The Redundant Binary to Two Complement (RBTC) logic (Figure 4.12b) converts input x to y such that y p i k ~i ■ (-1) k + [ y . 2 J ye {0,1} (A. 10) The logic equations of RBTC network for the three logic vector en- codings of the input sum digit are given by LVE 3 : Y ± = x ± © ? ± j-0,l,...,k-l (A. 11) J J J \ = (Xi A x i )v(P. A x. ) (4.12) j+1 J 3 J J P. = *0 LVE„ : Same as for LVE . LVE 1 : y. = x± © x ± © P. (A. 13) J J j J P i = (x i AX i )v(P i * (x i © x i >> < 4 * 14 > 3+1 j J J J J P = x The logic equations for the logic network TCSM (Figure A. 12c) that converts 2 ' s complement form y to corresponding SM format are independent of logic vector encodings for the input sum digit. The logic equations are 120 x = y © <° ± AV) (4.15) J 3 3 Z = Z V y. (4.16) j+l 3 J •i - h ■ \ The signal Z , if equal to logical zero implies that the binary digits 3 y , y , . . . ,y are all logical zero. i j j-i The Digit Sum Encoder DSE logic is also used to achieve the radix k k (2 ) complement and diminished radix (2 -1) complement of the magnitude bits of ROB input to DSE via sDSE. Assuming logic vector encoding LVE~, x i = 1, J = 0,l,...,k-l J P = °i " ° will cause the radix complement of the magnitude bits to appear at the output of DSE, whereas X ± « 1, j = 0,1,. . .,k-l 3 will generate the diminished radix complement of the input magnitude bits. 121 Similarly, X. = 0, j = 0,1 k-1 J P ■ 1, a. - x 1 will subtract unity from the value of the input magnitude bits. These particular values of x. > P. > and a are made use of in the J 1 processing of microinstructions AR and NR, as described in Section 4.3.2.3.7. However, in the case of inter-register transfer microinstructions TD and TI, the magnitude bits at the input and output are to remain unchanged; the sign bit, S , is equal to the complement of S RQB for micro- instruction TI, where S denotes the sign of the digit on the bus ROB. ROB From Figures 4.12(b) and 4.12(c), we see that RBTC and TCSM consist of k-stages each of identical logic cells. Each cell requires four 2-input NAND gates and one exclusive-OR (EX) gate. An EX-gate is equivalent to four 2-input NAND gates. Therefore the total number of gates, G_ q „ required by DSE logic using logic vector encoding LVE~ or LVE„ is given by G DSE = 16K + C x (4.17) where C. is a constant and gives the gates necessary for generation of §. and c. under various control signal conditions. Use of logic vector encoding LVE. will raise G,.^ to 26K + C . In the remainder of this chapter, we shall assume only the sign-magni- tude logic vector encoding LVE- for the redundant binary digit because 122 a) conversion from sign-magnitude format SM to KB format is the simplest for logic vector encoding LVE„ for the redundant binary digit, as shown in Equation (4.4), and b) the number of gates required for the logic implementation of the Digit Sum Encoder, DSE, is less in the case of the encoding LVE-, than that of LVE.. whereas the gates required for an RBA-2 are comparable for both the encodings LVE, and LVE . The encoding LVE„ is too expensive for the logic implementation of an RBA-2 and hence of MIAD which is the major consumer of gates in the DPL. 4.2.2.7 Logic design of selector networks - Since the Adder MIAD and Digit Sum Encoder, DSE, are shared by more than one microinstruction, selector networks are needed in order to route appropriate data to the inputs of these processing logics. These selector networks also do re- formatting of data, if necessary. Besides the selector networks sADR and sDSE, two more selector networks, sROB and sRIB, are necessary for transferring data out of and into the various registers of the register file. In addition, selector sTOP is used to choose the contents of out- put port TOP. In the following discussion, logic vector encoding LVE- for the redundant binary digit is assumed. sADR 4.2.2.7.1 Logic design of adder input selector (sADR) - Selector accepts inputs from two sources: 1) 'Product Array' w. and P 'Collective Product Transfer' array t. along with their corresponding signs Sw. and St. from the output of DPG and 2) the contents R2,...,R5 of the internal registers INR2 , . . . , INR5 of the register file. These 123 inputs are in sign-magnitude format, SM . Depending on the microinstruc- tion, the sADR directs the appropriate data reformatted in redundant binary form to the inputs of the MIAD. For the microinstruction FMA, the outputs of DPG are routed appropriately so that the redundant binary elements of the 'product' and 'transfer' arrays are added by MIRBAs of appropriate weights. In the case of microinstruction SS, only contents R2 of register INR2 are inputed to the adder and for microinstruction MS, the contents of one or more of the four registers INR2 , . . . , INR5 are directed to the input of the adder. The contents Rl of the accumulator register INR1 are inputed directly to the adder. The logic networks for the selector sADR for radix 16 (k = 4) are shown in Figures A. 13a and 4.13b. Figure 4.13a shows the selector for the magnitude bits and Figure 4.13b shows the generation of appropriate sign bits for the re- dundant binary adder inputs. SR. (j=l,...,5) indicates the sign bit of inputs Rl, . . . ,R5. The control signals Rj sADR (j=2,3,4,5) and SWTsADR are provided by local control logic in the PE. Since the selector networks have no memory and the data at the input of adder MIAD must be continuously available throughout the processing of microinstructions SS, MS and FMA, the selector control signals are permanently tied to the appropriate outputs of the microinstruction decoder. For any radix-2 , and assuming that the adder MIAD is made up of k-stages of (k+1) input MIRBAs, the gates required for magnitude and sign bits (using logic vector encoding LVE~ for redundant binary) are 2 respectively 3k and (3k+l) . Denoting by G . the total number of gates for selector sADR for a radix-2 PE, we have SWTsADR RSiADR R4»A0R R3*A0R DPG tf *i / A V / * * R1 3 rzr — l _ ^Vwi BO .... -r-vP— * 1 Ki B a/t _i-^ H— * i "**j □ Ai >— " LJ V>1 ^°! D1 ... . nig j L* "i p<» ._ .. n£2 M ►— I Dl _ _ nOj R I -i B PA ....... nto — A 2 — i i OR, rtOg R 1 _ Klj r~ i -1 D? . . nC \ M •H I B^, ._ ... nOj R — 1 B DA n+i A l M BR j nOj 1 •— -L B1 Kl r - i L. —i po "Cq M — I I p - * .. ,. . noQ R OA . . — —l B n*o PR _ i A *"' -J NDq ■ ■ — 1 — L J 124 Figure 4.13a Logic Implementation of Selector sADR for Magnitude Bits. 125 R2sADR R3sADR R4sADR R5sADR SWTsADR SRI — SR2 — SR3 SR4 SR5 S W: St cy- o =o o M I R B A- M I R B A. M I R B A, M I R B A, Figure 4.13b Logic Implementation of Selector sADR for Sign Bits. 126 G sADR = 3k2 + < 3k+1 > = 3k 2 + 3k + 1 (4.18) 4.2.2.7.2 Logic design of digit sum encoder selector (sDSE) - The selector sDSE (shown in Figure 4.14) accepts inputs from two it it sources— the two outputs a SS and a MF. (j-0,1, . . . ,k-l) of the adder MIAD and the ROB (J-0,1, ... ,k-l) . The control signals ASSsDSE, AMFsDSE and ROBsDSE, respectively select the MIAD outputs a SS. and a MF corre- sponding to microinstructions SS, FMA and MS and the bus ROB. The control signal SCHI appropriately sets the sign bit x., (j=0,l, . . . ,k-l) of the J redundant binary input Xj of tne DSE to achieve radix complements or * diminished radix complement or direct transfer of the magnitude bits of ROB. This was explained earlier in Section 4.2.2.6. Figure 4.14 shows the logic implementation of sDSE for radix-16. Since the selector sDSE has no memory, the appropriate control signals must be held active throughout the processing of the microinstruction. For any radix-2 , the total number of NAND gates, G D q E , required for the logic of sDSE is given by G sDSE - 7k - (4 - 19 > 4.2.2.7.3 Logic design of selectors sRIB, sROB and sTOP - The selector sRIB is a three input multiplexer and has three input sources — the DSE, APR and the digit field of microinstruction register MIR in the control logic of PE. The selector is one digit wide and the NAND gates required for the logic implementation of this selector shown in Figure 4.15 is given by G DT11 where SKI d SCHI ASSsDSE AMFsDSE ROBsDSE *cc J aSS 3 o SS 3 s 9 laSS 3 a*MF, aMF 3 k aMF 2 ROB, a SS, a*MF, 127 > X: -*■ X V Figure 4.14 Logic Implementation of Selector sDSE. DSEsRIB APRsRIB MIRsRIB DSE 4 APR 4 MIR* DSE APR MIR 128 On > O -► sRIB 4 -► sRIB Figure 4.15 Logic Implementation of Selector sRIB. 129 G sRIB - 4(k+1) - (4 - 20 > The control signals DSEsRIB, APRsRIB and MIRsRIB, respectively select the sources DSE, APR and MIR. The path from MIR to sRIB is made use of in the processing of microinstructions RS and LPM. The width of selector sTOP (Figure A. 16) is equal to the width of output port TOP . The width of TOP. is determined by the number (=k+l) of bits required for the address space of PEM. plus one more for the Read/Write function of PEM. and the bits, P required for the 'Adder 1 i-1 A A Transfer' t which is dependent on the method of encoding used for t._ . Assuming the width of TOP. to be b, b is given by b = Max(k+2, P^ ). i-1 For MIADs using encoders and /or redundancy ratio 6 <_ 2/3, k+2 is greater than P Therefore the gates, G required for the logic implemen- l-l tation of selector sTOP is G sTOp = 3 (k+2). (4.21) The selector sROB selects the contents of one of the registers of the register file on to the register file output bus ROB. The gates required for this network are dependent on the number of registers in the register file and the bit width of the registers. For radix-2 , the register width is (k+1) bits and assuming (k+1) registers in the register file, the total gates required are G sROB " (k+D (k+2). (4.22) Figure A. 17 shows logic implementation of sROB for radix-2 (k=A) . 130 TAsTOP MIRsTOP MIR<0> •►TOP. TOP, TOP 3 to PEMj AND/OR PEi-1 •►TOP, Note : MIR<4:0> are PEM address bits MIR<8> is Read/Write bit Figure 4.16 Logic Implementation of Selector sTOP, RlsROB R2sROB R3sROB R4sR0B R5sROB SRI SR2 SR3 SR4 SR5 Rl 3 R2 3 R3 3 R4 3 R5 3 Rio R2 R3 R4 R5„ 131 o- O- O- o o- > -»* s ROB -^ ROB 3 ROB, Figure 4.17 Logic Implementation of Selector sROB 132 Note that these selectors have no memory. The control signals would have to remain active throughout the processing of a micro- instruction. Although the selectors are shown to have separate control signals, fewer control signals with a local decoder would suffice. But the separate control signals are shown for the ease of exposition because the separate names of the signals help to identify the sources easily. 4. 2. 2. 7. A Storage buffer registers of PPL - In addition to the com- binational logic for processing and the selector networks for proper data routing, the DPL has three buffer registers GIR, APR and IB. GIR and APR are used to hold the G-information from the adjacent PE. The register GIR holds the 'Adder Transfer' t from the adjacent PE.,, and the register APR stores the multiplicand digit for the local generation of 'Product P k Transfer' t in the DPG. The width of APR is (k+1) bits for radix-2 and the width of GIR depends upon the bit requirements of t . At the maximum, it is equal to 2k if no encoder MATE is used in the design of adder MIAD. However, if either the number of inputs to the MIRBAs is reduced by changing the redundancy of the multiplier digit or by any other means, then the bit width of GIR would be correspondingly reduced. Assuming that the 'encoders' and 'decoders' MATE and MATD are not used A A k for the t . and t , , and the inputs to each MIRBA is k' for radix-2 adder, then the bit width of GIR is 2(k'-l). The register APR is also used in the processing of left shift microinstruction LS and is used to hold the shifted digit from the right neighbor PE. - temporarily before being stored in the register file via the selector sRIB. 133 Since the outputs of internal registers of the register file are directly and permanently connected to the input of combinational processing logic, it is necessary to provide a buffer register, IBR at the output of selector sRIB. The output of the selector sRIB is gated into the register IBR and thus isolates the input bus of the registers from any changes which might occur due to feedback through the combina- tional logic when the contents of the buffer register IBR (i.e., the result digit) are transferred to the appropriate register in the register file. The bit-width of the register IBR is (k+1) for a radix-2 digit. 4.3 Design of PE Control The processing of a microinstruction in the PE requires the activa- tion of the various data paths and the conditioning of combinational transformation logic of the DPL, in a certain temporal order depending on the nature of the microinstruction. These time ordered activation control signals are generated by the PE Control Logic (PCL) which is locally resident in the PE. Another function of the local PE control is to coordinate the actions of the PEs not only to obtain 'G' information from adjacent PEs for the processing of the microinstruction but also to receive and transmit the microinstructions from and to the adjacent PEs. The latter is necessary to process a 'machine instruction'. Each PE executes the same sequence of microinstructions which is issued by MCU depending on the 'machine instruction' to be processed by the Arithmetic Unit and the specific operand values. After executing microinstruction 134 j-1, a typical PE, PE say, must determine the value of .G. and inform the PE.. of its availability. .G. is needed by PE. , to execute microinstruction j. PE also passes the jth microinstruction and modi- fier to PE . , n so that PE.... will determine .G. ... in cooperation with i+1 i+I j i+1' v PE.,~, if necessary. When PE. receives .G. ., it performs the micro- i+2 J i j i+1 instruction j and begins the procedure for microinstruction j+1. The control strategy for implementing the coordination of the various PEs can be either synchronous or asynchronous. In the former case, all the PEs act in synchronism with some central clock whereas in the asynchronous case, all the activities are controlled by request- response signals. In this paper, asynchronous control with request- response signals is chosen because of the following advantages: a) It avoids the clock-skew problems when a large number of PEs are concatenated together for high precision of arithmetic. b) Due to the pipeline nature of processing, different PEs at any instant are executing different microinstructions which take different times to execute. The request-response strategy will provide overall better average speed of processing. c) The asynchronous control is compatible with the 'localized' nature of processing and an autonomous and modular arithmetic element. However, it does have the disadvantage of increasing the number of pins required for the PCL. 135 4.3.1 Logical organization of PE control - The PE control is organized as a set of six interacting subcontrols some of which are active concurrently while others are activated in sequence, depending on the nature of the control algorithm for the microinstruction. Concurrently interacting controls allow an average speed up in the processing of micro- instructions by allowing independent operations to take place in parallel. Figure 4.18 shows the various subcontrols and their interaction. The division of PE control into subcontrols is based on a functional grouping of the various steps in the control flow. The various sub- controls are R-control, T-control, G-control, E-control, F-control and DM-control. The Decode and Main or DM-control is the main control which super- vises and coordinates the actions of other subcontrols. It handles the decoding of the microinstruction, sets up the necessary data paths in DPL, and then chooses the proper subcontrols and their temporal order for the execution of the control algorithm of the microinstruction. (In a crude software analogy, the DM-control can be considered as the Main procedure and other subcontrols which are invoked by DM-control as subroutines.) The Receive or R-control and the Transmit or T-control are the primary controls for the coordination of PEs. R-control is concerned with accepting the microinstruction from the left neighbor PE._, and acknowledging the receipt of the microinstruction (OP-code y. and the modifier field .F.)« The T-control transmits the received micro- 3 V instruction with the same or a new modified F-field .F , depending on the nature of the microinstruction, to the PE . , - . l+l 136 The G-control and E-control together can be considered as consti- tuting the main processing controls for the microinstruction. The G- control generates the G-information for the left neighbor PE.. and accepts the G-information from the right neighbor PE . . The Execute or E-control activates the necessary control signals to the combinational logic to calculate and gate the result digit in appropriate internal register of the register file. In addition to this, the status of the digit in the accumulator register is set. The status checking involves determining the sign and magnitude of the digit. If the accumulator digit is zero, the sign of the digit is considered to be unknown. The F-control is used when a new value, different from that received, of the modifier field has to be sent to the right neighbor PE - . It is made use of in right shift microinstruction RS. 4.3.1.1 Global description of interaction of subcontrols - Figure 4.18 shows the interaction of the various subcontrols. It should be noted that Figure 4.18 does not show the hierarchical order in which the various subcontrols are invoked by DM-control but only shows a gross overview of the interaction. The specific temporal order of the various subcontrols in the control sequence of any microinstruc- tion is discussed later in Section 4.3.2.3.3. The control sequence for every microinstruction begins in the R- control. The R-control, on receiving a go-ahead signal from DM-control to accept another microinstruction from the left neighbor PE..., accepts the microinstruction and acknowledges back the receipt of the micro- instruction. It also invokes the DM-control. The DM-control decodes 137 INVOKE RETURN (REPLY) Figure 4.18 Logic Organization of PE Control Signal Generator. 138 the microinstruction, sets up the data paths in the DPL and invokes one or more of F, G, T, and E controls depending on the microinstruction type. The F-control makes the changes in the modifier field of the microinstruction and calls on the T-control to transmit the modified microinstruction to PE . . F-control is invoked only for right shift microinstruction RS. If the processing of microinstruction requires G-inf ormation, the G-control and T-control are invoked in parallel. The G-control can be conceptually considered as comprising of two sub- controls: G -control which generates G-inf ormation for the microinstruction executing in adjacent PE . , , and G -control which accepts G-information from the right neighbor PE . - . (In the case where G-information depends logically on two or more right neighboring PEs (e.g., microinstructions FMA, AR) , the subcontrols G -control and G -control interact with each gn ap other.) After the necessary G-information for the execution of the microinstruction has been obtained, the G -control branches to E-control ap for the execution of the microinstruction. In those cases when G-control is not invoked by DM-control because no G-information is needed from adjacent neighbors (e.g., microinstruc- tions TD, TI, LDC) , the DM-control directly calls upon E and T controls in parallel. The T-control transfers the microinstruction to the right neighbor PE ... As the various invoked subcontrols finish their sequence operations, they report back to the DM-control. When all the invoked subcontrols are finished, the DM-control replies back to the R-control which was suspended earlier from accepting any more microinstruction. The R-control now is again ready to accept another microinstruction and the control sequence begins again. 139 4.3.2 Logic design of PE control 4.3.2.1 Block diagram description of PE control logic (PCL) - Figure 4.19 shows the major components of the PCL in block diagram form. It consists of a microinstruction register MIR, the selector network, sMIR, the 'Zero magnitude and Sign Detector', ZSD, and the timing control signal generator, TCS. The register MIR is 11 bits wide and is used to hold the microinstruction, received on microinstruction Jjiput _p_ort, MIP , f rom adjacent PE.,, during processing by PE . . The selector sMIR is a two way multiplexer which chooses either MIP or ROB from the DPL as the appropriate source of data for the bits <4:0> of MIR. The ZSD is a combinational logic block which monitors the sign and magnitude bits of the accumulator register INR1. It sets flip-flop Z to logical state '1' if the magnitude of the accumulator in PE. is zero. Flip-flop S. is set to the state of the sign bit SRI of accumu- lator register. The TCS generates the timing signals for the activa- tion of data paths and processing logic in DPL and for the coordination of the adjacent PEs. The generation of the appropriate control signals and their temporal order depends on the microinstruction — its digit algorithm and the data flow structure of DPL. 4.3.2.2 Design and description of microinstruction formats - The major consideration in the design of the various microinstruction formats are: 140 UJ Q. + o o o > < < o: 00 o ►J o u c o u w u 00 CO o o CQ CD 1-1 3 00 fc 141 The major consideration in the design of the various microinstruc- tion formats are: a) the bit width of the microinstruction should be as small as possible so that the pins required for the input port MIP be least, and b) the microinstructions should be powerful so that they take full advantage of the data flow structure of the DPL and facilitate the micro- programming of the 'machine instruction'. These two aims are conflicting in nature because b) requires a large instruction width. A compromise was achieved by using varying number of bits for the OP-code of the microinstruction. Basically, each microinstruction has an OP-code field u. and a modi- fier field, F as was discussed in Section 2.6. The basic OP-code field is 3 bits long and the modifier field depends on the bit width (radix of arithmetic processing) of the PE and the number of addressable registers in the register file. The modifier field .F. is further divided into two subfields — one field carries the address of the register in the register file of the PE and the other field carries either a digit, or the address of the PEM location in local operand mantissa memory. For some micro- instructions, these fields are used for other purposes. The Figure 4.20 shows the specific OP-code bit assignment and the formats for various microinstructions. In this figure, it is assumed that the bit width of the PE is 5 bits (that is, radix is 16) and that there are 5 (=k+l) registers in the register file of the DPL. The micro- instructions LPM, SPM, RS, LS and LDC have three bit OP-codes whereas microinstructions TD and TI have four bit OP-codes. The OP-codes for 142 ^i J F i MNEMONIC -"joP-CODEf— MODIFIER FIELD 10 98 76543 210 LPM 1 1 Al A2 SPM 10 Al A2 RS 1 1 Al 01 LS 10 Al 02 LDC 1 1 ^T Dl TD o'oil Al 4, 1 A2 TI 1 Al i A2 SS ill mm MS l l l 1 1 A3 i . i l FMA l l l 1 D3 ■ i i i AR ill 1 Sfa, o iH NR l l l 1 s w 1 /yyyyy/j — BIT NUMBER Al: Destination File Register Address A2: Source PEM Address INR[A1] - PEM[A2] Al: Source File Register Address A2: Destination PEM Address PEM[A2] - INR[A1] Al: Address of File Register to be shifted Dl: Digit from left neighbor PF. Al : Address of File Register to be shifted D2: Digit, sent by MCI', to be stored in least significant PE Dl: Digit to be loaded in file Register Al: File Register Address INR[A1] * Dl Al: Source File Register Address A2: Destination File Register Address INRIA2] * INR[A1] Al: Source File Register Address A2: Destination File Register Address 'INR[A2] • (-1) • INRfAl] INR1 - IN'Rl + INR2 A3: Source File Register Addresses. A '1' in bit j indicate»that file register INR(j+l) will take part In Multi-Sum Addition, IN'Rl 5-0 MIR 1 INRfJ] 1)3: Multiplier Digit TNR1 ■■ INR1 + D, * INR2 S p : Sign of the Operand f- 1 impli \* impli es es -ve -K-e Figure 4.20 Microinstruction Codes and Formats. 143 microinstructions SS, MS and FMA are six bits long whereas for AR and NR, they are seven bits long. The varying length of the OP-code allows a basic three bit field for OP-codes, otherwise a straightforward coding of 12 microinstructions would have required four bit OP-codes. It should be noted that the use of a more restricted set of micro- instructions could have reduced the bit width of the microinstructions at the cost of less flexibility in microprogramming capability. In general, for a radix-2 arithmetic structure, the bit width of a PE digit is (k+1) , and assuming the register file to consist of (k+1) registers, the bits required for a microinstruction are given by I, = Instruction width in bits b = 3 + flog 2 (k+l)l + (k+1) = k + |Tog 2 (k+l)"l + *• (4.23) A description of the various microinstructions was given earlier in Section 2.6. The function of each microinstruction, briefly, is again given below. The memory access microinstructions LPM and SPM are respectively used to fetch data from and store data to the processing element memory PEM associated with the PE. The microinstruction field A2 gives the location in PEM and field Al identifies the register in the register file. In the shift microinstructions RS and LS, Al identifies the register to be shifted. The field Dl carries the digit from the regis- ter in the left adjacent PE and D2 identifies the digit which must be 144 loaded in the register of the least significant PE. This facility is made use of in multiplication where the digit shifted out of the most singificant PE, during left shift of partial products, has to be saved in the least significant digital position of the multiplier operand register. The field Al in microinstruction LDC identifies the register to be loaded with the digit given in field Dl. In microinstructions TD and TI, the Al and A2 respectively identify the source and destination registers in the register file. Note that A2 can be equal to Al in microinstruction TI, whereas such a condition in TD is meaningless. In the case of arithmetic instruction SS, no special registers are identified because this microinstruction always causes the contents of accumulator register INR1 and operand register INR2 to be added with the result going to the accumulator register. For microinstruction MS, field A3 identifies the various registers of the register file whose contents would be added by microinstruction MS. Note that the address in A3 is not encoded but rather each bit of A3 identifies a register. A bit value of '1' in A3 indicates that the corresponding file register would take part as the source of the operand. The ' 1' in the least significant position of A3 indicates that accumu- lator register INR1 would always be one of the source registers in the MS instruction. The result of addition always goes to the accumulator register INR1. The D3 field in microinstruction FMA identifies the multiplier digit for the formation of the partial product. 145 D3 field in microinstruction FMA identifies the multiplier digit for the formation of the partial product. The microinstruction bit 4 carries the sign of the operand, S , which is nothing but the sign of the most significant nonzero digit in the accumulator. This sign is first determined by the MCU by a sequence of left shift microinstructions and testing the status indicators Z and S.. of the most significant PE. . The proper value of S , that is, bit 4, is set by MCU before issuing the microinstruction. 4.3.2.3 Description of subcontrols by control sequence charts - The subcontrols are multi-output finite state machines which pro- duce control signals in proper temporal order for the execution of various microoperations during the processing of a microinstruction. These control signals condition the combinational processing logic to perform elementary microoperations like opening or closing of a register gate, setting of selector networks to certain states or the setting of a control status memory element. In addition, some of the control signals act as interface request-response signals for the coordination of various PEs or to access the local memory (PEM) module. The operation of the finite state machine can be described by a control sequence chart (CSC) which is a flowchart like description of a control sequence. A control sequence is an instance of the execution of a subcontrol. The control sequence chart shows the various control signals and their temporal order generated during the execution of the subcontrol. 146 4.3.2.3.1 Control sequence chart conventions - A control sequence chart (CSC) consists of a set of rectangular, diamond and pentagonal shaped boxes and entry and exit symbols connected together in a two- dimensional pattern with straight directed lines. The arrows on the lines indicate the direction of the control flow in the sequence. The various symbols used in the CSC are shown in Figure 4.21. The diamond shaped symbol (Figure 4.21a and 4.21b) represents the decision element with single entry and two exit points. The exit points are labeled yes/no (Figure 4.21a) which indicate the truth/falsehood of the statement written inside the box, or the exit points are labeled with the actual name of the option (Figure 4.21b) that is valid on that exit point. The rectangular box of Figure 4.21c represents a control step. A control step is a set of microoperations (indicated by control signals) enclosed in the rectangular box. The time ordering of the micro- operations within a control step is not important and they are, in general, all activated in parallel. The rectangular boxes of Figures 4.21d and 4.21e represents the invoking of another subcontrol whose name is written inside the box. However, in the case of Figure 4.21d, the exit from the subcontrol returns the control flow to the point where it was invoked (like a subroutine call in software) whereas the control flow at the end of execution of the subcontrol indicated in Figure 4.21e branches to the next point in the control sequence chart. The pentagonal boxes of Figures 4.21f and 4.21g respectively repre- sent the 'FORK' and 'JOIN' symbols. The 'FORK' symbol indicates that 1A7 (a) (b) 1 MIRsRTB: - 1 gIBH: » 1 S ± ' SAD T (.:) 1 T-control Ion r eturn (d) n n I ! E-control i LI U T-control j Li Li (f) ( Return J (j) (g) (k) (h) Figure 4.21 Control Sequence Chart Symbols. 148 the subcontrols at the exit points of the symbols be activated concur- rently. On the other hand, the 'JOIN' symbol signifies that the replies from all the concurrently active control sequences indicated by the entry points to the box must be true before the control flow can proceed any further. The entry to a control sequence chart is indicated by a single circle (Figure 4.21h) with the name of the corresponding subcontrol written in the circle. The oval symbol (Figure 4.21j) represents a 'return' to the invoking point of the subcontrol in the control flow. A double circle (Figure 4.21k) represents a branch to the entry point of the subcontrol whose name is written inside the circle. The control sequence charts which are too big to be fitted on a single page have been drawn on different pages but the entry point on each page is labeled the same. An example is the DM-control. The microoperations within a control step box are indicated by either control signals of the form control signal name: = 1 or or transfer statements of the form x •*■ y x •*- 1 or 0. Most of the control signals in DPL are level signals whereas the inter- face request-response signals are Pulse signals whose leading and trail- ing edges are used to indicate request, acknowledge and response states. The '1' or '0' on the right hand side indicate the logically 'active' and 'inactive' state respectively. In the case of transfer statements 149 indicated by the arrow «-, x represents a control status memory element which is set to the state '1' or '0' or to the state of *y'. The control signals for the selector networks are of the form XsY where X indicates the input source to the selector network sY. The gate signals for the register is of the form gRegisterName where RegisterName identifies the register which has to be loaded with information. Square brackets [ ] indicate a subscript value as in ISP notation [42] and thus the address of a register or memory location when these brackets appear after a memory element name. The value of the subscript is written within the square brackets. The angle brackets < > enclose lists of bit names. For example, if MIR is a register, then MIR<4:0> indicate bits through 4 of reg- ister MIR and that the bits in MIR are numbered from right to left in ascending order. The subscript i, i-1, i+1 on the signal names indicates the index of the PE originating the interface control signal. 4.3.2.3.2 Description of R-control - The function of the R-control in PE. is to accept for processing and to acknowledge a microinstruction from the adjacent PE._, , and to invoke the DM-control for the processing of the microinstruction. The control sequence chart for the R-control is shown in Figure 4.22. The R-control indicates its readiness to PE._, to accept another microinstruction by the signal RACK :=1. The R-control in PE monitors the request signal TRQ i _ 1 from PE^. The active state of TRQ . indi- cates that information on input port MIP is valid and R-control (control step RC1) loads the microinstruction into register MIR<10:0>. (It is 150 [ R-controlrt 1 1 — i 1 f | ^TRQ . , -l\ NO V A' YES r RC1 gMIR<4:0> : - 1 gMIR<10:5>: - 1 L J r I 1 1 RC2 H RACKj/. - MIPsMIR<4:0>: -0 < ' RC3 DM-control ' on return ' RC4 RACK^ « 1 MIPsMIR<4:0>: «= 1 L J Figure 4.22 Control Sequence Chart for R-control, 151 assumed that the selector sMIR<4:0> was put earlier in a state to select MIP input.) Then the R-control (control step RC2) acknowledges the receipt of the microinstruction by the control signal RACK. :=0. The R-control (control step RC3) then invokes the DM-control for the processing of the microinstruction, and waits for a reply from the DM-control. The reply indicates that the processing is finished and R-control can accept another microinstruction which it (R-control) indicates to PE._, by the control signal RACK :=1. At the same time, the selector network sMIR<4:0> is set to select the data from micro- instruction input port MIP . This is done in control step RC4. It is assumed, in the control sequence chart of Figure 4.22, that initially, at the power turn on, RACK :=1 and MIPsMIR<4:0>:=1 are true. 4.3.2.3.3 Description of DM-control - The DM-control can be looked upon as the main control which on being invoked by the R-control monitors the output of the microinstruction decoder. Depending on the nature of the microinstruction, it sets up the necessary data paths and conditions the combinational logic in the data flow logic of the PE. After the data paths are set up, the DM-control invokes one or more of the other con- trols, F, E, G for processing and T-control, if necessary for onward transmission of the microinstruction to PE . . Since the selectors have no memory, and the data paths remain set throughout the processing, the output of the microinstruction decoder can be directly connected to the selector signals of the form XsY: = 1 and involves no extra logic cost. Figures 4.23a, b, and c show the control sequence chart for the DM-control. 152 II m h o VI r ,-* r cd on a: 2: K 1-1 in M X OS ofi _ C M n: 0; r kWVWS c u. ESSS3 E *J 01 1* O b O e - Y H »' c » J 3 U u c fcj C O u u ■H ,-t _t _4 ■ 03 D- r-- w O v M O < x -h £ •-I CM f» •* oc 06 o£ ps « — ( » — 1 ( — • » — ■ y. x t, t, QC GC oc cc w w ■ « -1 - rt O OS U 00 o. □ WHO < Q « f~ in ce • a 2u> u < www esss ^ w 1 1 4-> P-. O u c o I o u o a 0) a* o M c o u CM 3 M •H 153 1-1 g -J 1 03 O E fH as (A •— • a. o H at at M X |-~ V ££ —< X "a: 00 •H u T s •-I ■ x E •H A f-( ►j || o | ■4 0. OS co o M M H X as U) to ■ as Pu OES M >-i M X X X !' CJ £ 5 tH u 3 1 cn M a! (0 as M X «o r-4 u t-l s 1 l-l ., H 0Q > O as a « r-t v-4 H i— , A 1 1 o —t r>. u CO y, w M * + bS a as i-i 10 01 H X 00 u SB fc-* Q to — i O QS 04 a a. (A C 3 4-1 fH 0) O l-l U 4J c e o o o H r^ c h 3 4-1 1) hi c o w4 O b U c o u 1 u w cfl o 4-1 c o o i S Q >-i o o c a) 3 cr a> co o n 4-1 C o u ]bROB: - 1 ROBbDSE: - 1 DSEsRIB: - 1 YKS SAD . S i+1 SAK ADZ AR^^ AR/NR G -control ap I G -control Go To E-control on return LLi on return f RETURN j Figure 4.23c Control Sequence Chart for DM-control, Part III. 154 155 The data path for the microinstruction SS is through the selector sADR for operand register INR2 , through the selector sDSE and encoder DSE for the result (sum of the contents of INR1 and INR2) digit and finally through the selector sRIB. The encoder DSE converts the redundant binary sum digit to sign-magnitude format SM . This data path is set up by control step DMC1. . The control signal TAsTOP sets up the data path for the 'Adder Transfer 1 out of PE . . The microoperation P. -*-Q conditions the DSE encoder logic for proper conversion of the re- 1 dundant binary result digit into the equivalent sign-magnitude format. For the microinstruction MS, the control step DMCl^ sets up the selector sADR for the source operands and the data paths for the result digit through the selectors sDSE and sRIB and for the 'Adder Transfer' through the selector sTOP. If the microinstruction to be processed is FMA, the control step DMC1~ sets up the necessary data paths — for the 'product array' and 'product transfer' through sADR, that of result digit through sDSE and sRIB, of 'Adder Transfer' through sTOP. The multiplicand digit from operand register INR2 is put on the register output port ROP. via selector sROB. The control memory flip-flop GFMA acts as a synchron- izing device between the concurrently active and interacting controls — G -control and G -control. It is initialized to state '1'. The gn ap details of its action are discussed later in Section 4.3.2.3.6. Control steps DMC1, through DMCl g set up the data paths for the microinstruction shown at the entry points of each control step in Figures 4.23a, 4.23b, and 4.23c. For the left shift microinstruction 156 LS , the data path for the digit in PE. is from the internal register to the output port ROP. via the selector sROB and the data path for the in- coming digit from PE . is from RIP. to register IBR via the register APR and selector sRIB. In the case of microinstruction RS, the digit to be stored is in microinstruction register MIR<4:0> and its corresponding data path to the input of register file is via the selector sRIB. The data path for the digit to be shifted out to PE . is via the selector sROB, the bus ROB and the selector sMIR<4:0> to register MIR and thence to port MOP . For the inter-register transfer microinstructions TD and TI, the data path is through the selector sROB, the bus ROB, the selectors sDSE and sRIB. The control memory flip-flop SCHI generates the similar named control signal which transforms the SM -encoded output of ROB into redundant binary format for proper transfer. The logical '1' state of control signal SCHI guarantees that the magnitude bits of input digit on ROB will appear unchanged at the output of DSE. This can be seen from Figure 4.14 and the discussion in Section 4.2.2.6. The data path for the microinstruction LDC is from the register MIR through the selector sRIB to the proper register INR[MIR<7 :5>] of register file via the buffer register IBR. For the memory access microinstructions, the communication of data and address takes place via the ports ROP., MIR., and TOP.. For the microinstructions SPM and LPM, the data path for the address of the location in PEM and the read/write bit is from MIR to TOP via the selector sTOP. However, the data path for the data to be stored in case of SPM is from register INR[MIR<7 :5>] through selector sROB to the 157 output port ROP . But the data path from memory to the register INR[MIR<7 :5>] for microinstruction LPM, is via port MIP , selector sMIR<4:0>, register MIR<4:0>, the selector sRIB and buffer register IBR. The data path for the microinstructions AR and NR is from the register INR1 through the selectors sROB, sDSE, encoder DSE and the selector sRIB back to INR1 via buffer register IBR. Note that the Op- codes for the microinstructions AR and NR are so chosen that bits MIR<7:5> address the register INR1. This explains the reasons for the OP-code choices, for various microinstructions shown in Figure 4.20. After the data paths are set up, the DM-control invokes one or more of G, F, E and T-controls for actually processing of the micro- instruction. The microinstructions SS, MS, FMA and LS all require G-information from their right neighboring PEs. So the DM-control invokes the G-control consisting of G -control and G -control and 6 gn ap the T-control in parallel. The T-control transmits the present micro- instruction to PE.,, . The identity of microinstruction in PE. is essential in PE.,, to generate the G-information for the microinstruc- tion processing. The control flow at the end of G -control branches ap directly to the E-control for the actual calculation and storage of the result digit. When all the concurrently invoked subcontrols are finished, they report back to the DM-control at the invoking point in control flow. The DM-control now replies back to the R-control which had earlier invoked DM-control. The R-control was in a state of active suspension 158 (wait state) during the activity of DM-control. The R-control now gets ready to receive another command as explained earlier. For the microinstruction RS, DM-control, after setting up the data paths, invokes F-control which changes the modifer field of the micro- instruction in MIR and PE, for transmission to PE, , . . The details of i i+1 F-control are discussed later. At the end of F-control, E-control and T-control are invoked in parallel but no G-information is required for the processing of microinstruction RS. On return from both the con- currently active E- and T-controls, the DM-control replies back to the waiting R-control. In the case of microinstructions TD, TI, LDC, LPM and SPM (Figure 4.23b) no G-information is required. Hence DM-control invokes only the E-control and T-control in parallel. The rest of the control flow is as it is for RS. The invoking of E, G and T-controls by DM-control for microinstruc- tions AR and NR is more complex since it (invocation) depends on the nature of the data resident in the adjacent PE . . . The digit algorithm of the microinstruction AR discussed in Section 3.6.5, requires knowing the sign of the first non-zero digit to the immediate right of the present digital position. This is done through the use of interface control signals S , Z and control memory flip-flops SAD, SAK and ADZ which respectively stand for the Sign of Adjacent Digit, Sign of Adjacent Digit Known and Adjacent Digit is Zero. The value of logical '1' for SAK and ADZ indicate assertion or truth whereas '0' indicates falsehood. 159 The interface control signal S. (which is the outputs of control memory flip-flop S.) indicates the sign of the digit in PE 's accumulator regis- ter. Z (which is also the output of flip-flop Z ) indicates whether the magnitude of the accumulator digit is zero. Z. = 1 indicates that the digit is zero and Z = indicates otherwise. Note that if Z. is moni- tored by adjacent PE , , validity of Z can only be ensured when RACK =1; i.e., when PE. is not in the middle of executing any previous micro- instruction. The mechanism for determining the sign of the first non- zero digit to the right of the present digital position, i say, is as follows. If the digit in PE . , , is zero, G -control in PE. goes into a wait ° i+1 ap i loop. In the meantime, the microinstruction AR is passed to PE . where again Z. „ is monitored to see if the digit in PE. _ is zero. If it is, it (G -control in PE ) also goes into a wait loop and the micro- instruction passes to PE. „, PE. _, ..., PE... if Z. . = and Z..,, Z . , . , ..., Z . . , are all in logical state '1'. The G -controls i+3 i+4 i+j ° ap in PE _, ..., PE. ._- go into the wait loop. As soon as Z - = is monitored by PE . , . , G -control in PE . . . assigns the value of S . , , . , i+j gn i+j i+j+1 to S . , , and declares the sign valid to the waiting G -control in i+j & 6 ap PE... , by assigning logical state ' 1' to the control signal GRQ. . i+j-l i+j The G -control in PE., n informs the G -control in PE... . about the ap i+j-l gn i+j-l validity of sign S . . , by setting synchronizing control flip-flop SAK, in PE , to logical state 'l 1 . The G n ~control in its turn assigns the value of S to S and declares the sign S 1+ ., to be valid. The sign of the digit thus flows backward till PE is reached and in 160 this way, all the zero digits lying to the immediate right of digital position i are assigned the sign of the first non-zero digit. We now describe the action of DM-control for microinstructions AR and NR. The DM-control checks the state of control signal Z . by monitoring the control signal RACK . as explained earlier. If the adjacent digit in PE. . is not zero (Z... ^ 1), the control memory element SAD is set to the state of S - t , the sign of adjacent digit is declared known by SAK ■*- 1 and the adjacent digit is declared non- zero by setting ADZ to logical state '0'. However, if Z = 1, the control memory flip-flops SAD and SAK are set to logical state '0' and flip-flop ADZ to state '1'. For the microinstruction AR, the DM-control then invokes T-control and G-control in parallel irrespective of the state of the control signal Z . However, for microinstruction NR, no digit beyond and including the first (counting from left) zero digit of the operand needs to be recoded. So the flow of microinstruction NR stops as soon as Z = 1 is encountered. This is done by the DM-control not in- voking the T-control in PE . . However, G -control and G -control are invoked for uniformity of invoking procedure, although G -control is Immediately exited for microinstruction NR as can be seen from the control sequence chart for G -control in Figure 4.27. When all the parallely invoked controls have finished, the DM- control replies back to the waiting R-control which gets conditioned to receive another microinstruction for processing in PE.. 161 4.3.2.3.4 Description of T-control - The T-control, when invoked by DM-control, passes the microinstruction in register MIR of PE. to the PE . - . The control sequence chart for T-control is shown in Figure 4.24. The T-control in PE. monitors the signal RACK . (from the R-control in PE - ) whose logical state '1' indicates that R-control in PE . is ready to accept the microinstruction. The control step TCI sets the control memory flip-flop whose output gates the contents of MIR onto bus MOP , . Then in control step TC2, the request signal TRQ. is activated which in turn is monitored by the R-control in PE - . As soon as R-control in PE... accepts the microinstruction from MOP (=MIP .), it (R-control) acknowledges by assigning the ' 0' logical state to acknowledge signal RACK . The '0' state of RACK , being monitored by T-control, signi- fies that the microinstruction has been accepted and then the control step TC3 withdraws the request for transmission by assigning '0' logical state to request signal TRQ. and also removes the information from the bus MOP. (MIP ) . The latter is necessary for microinstruction LPM where the port MIP is used for inputing data read from the PEM . At the end of the control sequence, the control flow returns to the point where the t-control was invoked. 4.3.2.3.5 Description of F-control - The function of the F-control is to modify the microinstruction modifier field F before transmission of the microinstruction to the next PE, i.e., PE -,. This is made use of in the microinstruction RS where the modifier field carries the digit 162 I RACK :-l\ NO TCI MIRgMOP: - 1 I TC2 TRQ i : - 1 L TRO,: - MIRgMOP: - ( RETURN ) Figure 4.24 Control Sequence Chart for T-control. 163 to be shifted into the adjacent PE. Figure 4.25 shows the control se- quence chart for F-control. Control step FC1 loads the buffer register IBR from the output of the selector sRIB. The selector sRIB was ini- tially conditioned by control step DMClj. in DM-control to select the digit from MIR<4:0>. At this time, the MIR<4:0> carries the digit from the adjacent PE._, and it arrived as part of the microinstruction from PE.,. Control step FC2 loads the MIR<4:0> from the output of the selec- tor sMIR<4:0> which was conditioned to accept the digit from INR[MIR<7 :5>] in PE. to be shifted into next PE.,, by control step DMCU. At the end of control step FC2, the control flow branches to initiate E-control and T-control in parallel. The T-control would transmit the microinstruction in MIR with the new modifier field and the E-control would load the register INR[MIR<7 :5>] from the buffer register IBR. 4.3.2.3.6 Description of G-control - The G-control consists of two independent subcontrols: G -control which generates the G-information gn (mainly 'Adder Transfer' t , ) for the adjacent PE.,; and G -control which accepts the G-information from the adjacent PE. . . The G-control is invoked by DM-control only when the processing of the microinstruction requires information from the adjacent PEs. When the G-information depends logically on more than one adjacent PE, the G -control and G - control interact with each other through synchronizing control memory flip-flops GFMA (in the case of microinstruction FMA) and SAK (for the microinstructions AR and NR) . 164 W FC1 gIBR: - 1 I FC2 gMIR<4:0>: - 1 Figure 4.25 Control Sequence Chart for F-control 165 A. 3. 2. 3. 6.1 Description of G -control - The function of G - t^r gn gn control is to generate G- information needed in the adjacent PE._, . Figure 4.26 shows the control sequence chart for this control. The G- information for the microinstructions SS and MS consists of the 'Adder Transfer 1 t , which is routed to the output port TOP by the control steps DMC1 and DMC1„. The G-inf ormation for the microinstruction LS is the digit in register INR[MIR<7 :5>] which is routed to port ROP by data path set up in control step DMC1, . After the 'G' -information stabilizes on ports TOP and ROP . , control step Ggnl informs the G -control in PE. , about the validity of G-inf ormation. The G -control then monitors i-1 ' gn the acknowledgment signal GACK. .. from G -control of PE.. . When GACK. - is in logical state ' 1', the control step Ggn2 declares the G-inf ormation not valid. For the microinstruction FMA, the G-information ™, A G. to be FMA i generated by PE. for PE , consists of 'Adder Transfer' t and the p multiplicand digit (assuming 'local generation' of CPT t .). The 'Adder Transfer' t , is dependent on the multiplicand digit and accumulator digit a. in registers INR2 and INR1 respectively in PE , multiplicand digit <|> . and accumulator digit a..- in registers INR2 and INR1 respectively in PE. , - and the multiplier digit m . . t . 1 con- AO Al sists of two parts t._. and t. , and is generated in a time sequential manner. The process of generation of -^.G. can be represented in the notation of Section 2.4 as follows: SSv MSv LS Wait for Conforma- tion to stabilize on port TOP or ROP Ggnl. I FMA ■i r Wait for Conforma- tion (cp to stabilize on ports TOP and ROP Ggn3 GFMA - 1\N0 YES Wait for G-informa- tion (gJ) to stabilize on port TOP, Ggn5 GV, i f RETURN J 166 ARv NR YES W Ggn7 S - SAD \ I Ggn8 GV : - 1 I Figure 4.26 Control Sequence Chart for G -control, — a -» g n 167 fma g i " {t i-l • V C i-1 = r(m j' *i' V ± , a^ Al _1 / , . A(k fc i-i * r ( V V V ♦i+i' fc i } and FMA G i = {t i-l» 'f-V V where and = {G i - G l } •8 = {t i-r V t = (t A1 > lt i-l' ' In the above relations, the subscript FMA has been dropped from all variables for ease of reading. The above described structure for G and G. can be deduced from an examination of Figures 3.13 and 4. lid to- gether . When the G-inf ormation G. is valid on ports TOP. and ROP. which carry the t. - and components of G. respectively, control step Ggn3 informs PE - about the validity of G information. After PE . has accepted (indicated by GACK. 1 :=1) the G . , control step Ggn4 sets validity signal GV to logical state '0'. To generate G., it is neces- sary that G.,, (« {t A0 , .}) be available in PE . . When G... from PE... i+l l i 1 l+l l+l is valid on input ports TIP . and RIP . , the G -control in PE. accepts I i ap i 168 and stores G and informs the G -control about its availability by setting to logical state ' 1' the control memory flip-flop GFMA. As soon as the synchronizing flip-flop GFMA is in logical state '1' and 1 1 Al G. is valid on port TOP. (G. = t. .. is automatically generated by the logic in MIAD of PE ) , the control steps Ggn5 and Ggn6 declare the G information valid and invalid respectively in the same way as do control steps Ggn3 and Ggn4 . For the microinstructions AR and NR, the G-information consists of only the sign of the digit in the accumulator of adjacent PE _ and also whether the digit is zero or not. If the digit in the accumulator in PE. is non-zero (Z . ^ 1) , then no G-information needs to be generated because it is already known to the PE.. via its DM-control. However, if the present digit is a zero (Z . = 1) , the meaningful sign for this zero digit is the sign of the first non-zero digit to its right. If the digit to the immediate right in PE... is non-zero, then the sign of the adjacent digit is known and is stored in flip-flop SAD by DM-control in PE. earlier. If, however, the digit in PE, . is zero (Z... = 1), the sign of the adjacent digit is unknown (SAK ^ 1). The G -control goes into a wait loop continuously monitoring SAK till the G -control in PE. determines and stores the sign in SAD of the digit in PE _ . As soon as the sign of the adjacent digit is known, control step Ggn7 assigns the same value to the flip-flop S whose value represents the sign of accumulator digit in PE.. Control step Ggn8 informs the G -control in PE J . about the validity of S . . After G -control in PE . , acknowledges i-1 i ap i-1 the receipt of valid sign S. (GACK.. = 1), control step Ggn9 withdraws the validity signal. G -control now returns to the invoking point in DM-control. 169 A. 3. 2. 3. 6. 2 Description of G -control - The function of G -control 1 ap ap is to accept the G-information generated by G -control in PE . . . Figure 4.27 shows the control sequence chart for G -control. This G-information ap is available on port TIP. (for microinstructions SS, MS and FMA) , on port RIP. (for microinstructions FMA and LS) and on interface control line S . . (in case of microinstructions AR and NR) . In the case of microinstructions SS and MS, the G -control moni- ap tors the validity interface signal GV. - . As soon as the G-information is valid, control step Gapl stores the G-information on bus TIP into G-information register GIR. Control step Gap2 acknowledges the receipt of G-information, and control step Gap3 withdraws the acknowledgment signal GACK once the validity signal is withdrawn (GV - ■ 0) by G -control in PE . . . For microinstruction LS, the same sequence is followed except that G-information is available on RIP and is stored in register APR by control step GapA. As explained earlier, G-information G . for microinstruction FMA A0 1 Al consists of two components: G _(={t. A i+ i }) and G _(={t. }) . When the G . , information is valid, the control step Gap7 stores t. component in register GIR and . component in register APR. Then control step Gap8 sets the synchronizing flip-flop GFMA to logical state '1' to inform the G -control about the availability of G. ,. information so gn ' i+1 that G -control may generate G. for PE , . Control steps Gap9 and GaplO play the same role of acknowledgment assertion and its withdrawal as control steps Gap2 and Gap3. After control step GaplO, G -control ap again starts monitoring the validity control signal for G. . . As soon 170 Capl4 SAD - S 1+1 1 l CaplS SAK • 1 1 Capl 6 CACK :- 1 1 - 0\ NO Capl? GACK : - ^YF.S r Gap7 gCIR : - 1 gAPR : - 1 p Cap8 GFMA ' 1 ' » Gap9 GACK : - 1 1 gCIR : - 1 ir Cap2 CACK : - 1 GV^ , - s ; NO gAPR : - 1 | ; Gap5 r GaplX gGIR : - 1 n Gapl2 CACK : - 1 Figure 4.27 Control Sequence Chart for G -control. 171 as G is valid (GV - 1) on TIP , control step Gapll stores it in G-information buffer register GIR and then control steps Gapl2 and Gapl3 respectively acknowledge the receipt of G . information and withdraws the acknowledge signal on response from G -control. gn For the microinstructions AR and NR, if the adjacent digit is non- zero or if it is zero and the microinstruction is NR, no G-information needs to be accepted from adjacent G -control and the G -control is r J gn ap immediately exited. In the case of zero adjacent digit and microinstruction AR, G - ap control monitors the G-information validity signal. Here the G-information consists of S JM . As soon as G -control has determined the valid sign i+1 gn for S..- (which is the sign of first non-zero digit to its right), G -control sets the validity signal GV... to logical state '1'. As soon gn l+i as G -control in PE. ,, finds GV... in state '1', the control step Gapl4 ap i+1 i+1 stores the sign S. . in flip-flop SAD and Gapl5 sets the synchronizing flip-flop SAK to logical state '1'. (SAK is being monitored by G - control in PE. in order to attach this sign (stored in SAD) to S..) Control steps Gapl6 and Gapl7 play the same role as Gap2 and Gap3. At the end of execution of G -control, the control sequence for the processing of the microinstruction branches directly to E-control where the result digit is calculated and stored in the appropriate register of the register file. 4.3.2.3.7 Description of E-control - Figure 4.28 shows the control sequence chart for E-control. For the microinstructions SS, MS, FMA, LDC and LS, the E-control loads the result digit, which is available at the 172 Jx r 1 i i >3 M 1 r\ 1 U o 1 1 1 1 1 2 I o u c o o I w o u CO u 0) u c a) C 0) o 4-t c o 00 CN 01 a ■H Pn I 173 output of selector sRIB, Into buffer register IBR. This is done in con- trol step E3. Then control steps E4~ and E4 2 transfer the contents of IBR into accumulator register INR1 for microinstructions SS, MS, FMA and into the destination register INR[MIR<7 :5>] for LDC and LS. Finally, the control step E5 sets the status indicators S and Z . For the microinstruction RS, the control step E3 is bypassed and control step E4 2 loads the register to be shifted. E5 sets the digit status indicators. For the inter-register transfer microinstructions, the state of the sign bit S^., of the digit on the bus ROB is transferred to the sign bit output S. of digit sum encoder DSE for TD. The complement of the state of S,,.,. is transferred in the case of microinstruction TI. This is done in control steps El- and El 2 respectively. The control sequence then goes through the control steps E3, E4. and E5. Control step E4. loads the destination register in the register file. For the microinstruction LPM, control step E6. requests access to the local memory PEM. of the PE.. Note that the address of the location in PEM. and the Read/Write bit (in the state 'Read') is already available on the output port TOP . The PEM. reads out the data on the micro- in PEM and the Read /Write bit (in state Read) is already available on the output port TOP . . The PEM. reads out the data on the micro- instruction input bus MIP. and informs the PE. by the logical state ' 1' of acknowledge signal MACK.. The control step E6_ loads the register MIR<4:0> from the output of selector sMIR<4:0> which had been earlier conditioned, in DM-control, to accept this output and also withdraws the 174 request for memory access. The control steps E3 and E4„ load the buffer register IBR and file register INR[MIR<7 :5>] . Finally the control sequence goes through control step E5 for setting the status indicators. For the store microinstruction SPM, the address of the PEM loca- tion is already available on output bus TOP and the digit to be stored is on bus ROP , when the control flow enters E-control. Control step E7- requests access to the memory. Then the PEM responds by the logical state '1' of acknowledge signal MACK , after accepting the data and address from the buses TOP and ROP . Now control step E7_ withdraws the request for memory access. The control sequence finally goes through status setting control step E5. For the microinstructions NR and AR, the E-control implements the digit algorithms discussed earlier in Sections 3.6.4 and 3.6.5. Control steps E2- and E2_ respectively achieve the radix complement and the dimin- ished radix complement of the magnitude bits of the accumulator digit. Control step E2_ diminishes the magnitude of the accumulator digit by unity. The particular setting of control signals to states shown in con- trol steps E2.. , E2_, and E2 was explained earlier in Section 4.2.2.6. Control step E2, assigns the state of MIR<4>, which is the sign, S p , of the whole operand to be assimilated or normalized, to the sign bit output S. of digit sum encoder DSE. Control steps E_ , E4~ load the result digit in buffer IBR and accumulator register INR1. Finally the control step E5 sets the status indicators regarding the sign and magni- tude of the accumulator digit in PE.. 175 When the control sequence corresponding to E-control Is finished, the control flow returns to the invoking point where G -control was invoked in DM-control. This is because the control flow had branched into E-control at the end of the G -control sequence. ap 4.4 Logic Complexity of Processing Element From the viewpoint of LSI implementation of a PE, two things are of major importance: the number of circuit elements and the number of external pins required for the chip. The total number of circuit ele- ments and pins determine the silicon real estate, density of the circuit elements and the heat dissipation, etc. The number of circuit elements depend on the technology used for the implementation of the logic on the chip. In this thesis, we shall use the number of gates as an in- direct measure of logic complexity because the number of circuit elements are directly related to the number of gates. Further, a multi-input NAND gate is considered equivalent to a 2-input NAND gate because in TTL logic, a multi-input NAND is realized by the use of a multi-emitter transistor. These assumptions have been made for the sake of simplicity. The overall gate complexity and pin complexity of a PE must take into account the gates and pins required by a PE's major components: DPL, PE control logic and Register File. 4.4.1 Logic complexity of DPL 4.4.1.1 Gate complexity of digit processing logic DPL - The total number of gates required for the DPL is equal to the sum of the gates necessary for its various components: Adder MIAD, Digit Product 176 Generator DPG, Digit Sum Encoder DSE, various selector networks sADR, sDSE, sROB, sRIB, and sTOP and the storage buffer registers in the DPL. The gates required for MIAD, DPG, DSE and selectors sADR and sDSE are dependent on the choice of the logic vector encoding for the redundant binary digit. From the earlier discussion in Sections 4.2.2.2 through 4.2.2.6, it is clear that logic vector encoding LVE„ is the simplest encoding and requires the least number of gates for the implementation of DPG, sADR and sDSE. In the following, we shall calculate the gate complexity of DPL, assuming only the sign-magnitude (SM, ) logic vector encoding LVE_ . Let G = Total number of gates required for DPL, excluding storage registers, G = Gates required for the logic implementation of Digit Product Generator, DPG, using 'local generation' of G„, TAT> = Gates required for the radix-2 adder, MIAD, MIAD G = Gates required for Digit Sum Encoder, DSE, and let G sDSE' G sRIB' G sROB' G sADR and G sTOP' respectively denote the gates required for the selectors sDSE, sRIB, sROB, sADR and sTOP. From the design details described in Section 4.2, it is clear that 2 A G„ TAT . = 26K NANDs, assuming no encoder MATE for t. , MIAD l-l 2 G = K ANDs + 2 Exclusive-OR gates for sign generation DPG 2 = K + 8 gates considering a AND and NAND gate equivalent, and 1 Exclusive-OR gate equivalent to 4 NAND gates. From Equations (4.17) through (4.22), we have 177 G DSE " 16K + C l G 8ADR " 3K 2 + 3K + 1 G 8DSE " 7K G SRIB " 4(K + » G 8R0B " (K + " (K + 2) G sTOP " 3(K + 2) Therefore, the total number of gates required for the combinational processing logic DPL is given by G DPL " G MIAD + G DPG + G DSE + G sADR + G sDSE + G sRIB + G sROB + G sTOP - 26K 2 + (K 2 + 8) + (16K + c^ + (3K 2 + 3K + 1) + 7K + 4(K + 1) + (K + 1) (K + 2) + 3b. Ignoring the constant c. and assuming the width b of port TOP. to be equal to (K + 2) , we have G DpL - 31 K 2 + 36K + 21 (4. 24) In the expression above for G , the sum of the gates contributed 2 by the three major components DPG, MIAD and DSE alone is (27K +16K+8+C.) and forms the bluk of the gates required for the implementation of DPL. The other components like selector networks contribute progressively smaller and smaller percentage of gates to the gate complexity of DPL, as the value of K increases. Table 4.2 lists the values of the gates 178 co > rJ c v I % ■H >-c -a CO cfl C pi •H PQ CO > 4-1 C h4 « Pm -a Q c 3 14-I TJ o a> erf ^1 4-1 CO •H X m 0) o rH U-J CL 00 o c CJ •iH TJ 01 O ■u O C3 c R) H rJ pm r» 00 T"H vO CO CN CO Q rH o VO r-« m CTi o e> CN w CN 00 •vt o vO CN 00 CO CO t^. rH rH r^ o\ r- rH CO VO _ 6 becomes p too large and hence not suitable. However, 'Local Generation' of t. and 182 Table 4.3 Pin Complexity of DPL Vs Radix for h < x C 184 01 t4 f-l •"* Q» r» •H 4J II rH 3 J«2 X ^ -v* 01 H OJ H Mt3 .-1 -H -a m a < 0) M a •H tl4 185 Gate Complexity Assuming the sign-magnitude, logic vector encoding LVE_ for a re- dundant binary digit, and 'Local Generation' of the product transfer p t., the gates G« pr for the logic of the Digit Product Generator are given by G^tj/-. " gates required for the magnitude bits of the 'product' DPG p array w. and 'transfer' arrays t. + gates required for the generation of sign bits of 'product' and 'transfer' arrays + gates required to combine adjacent bits and their corresponding signs (one of the bits is zero) to form a single (composite) redundant binary digit shown by circles in Figure 2 ■ k + 8k + Total # of composite redundant binary digits X gates required to form one composite redundant binary digit. k 2 + 8k + k T 2 x 4 (4.29) The above expression shows that the gates required for Digit Product Generator, DPG are increased for 1/2 <^ 6 <^2/3 compared to the maximal redundancy case by an amount equal to 4k + 8(k-l). Further, the complexity of the selector network sADR is also in- creased because the composite redundant binary digit has to be individ- ually routed to the input of the MIRBAs through the selector sADR. // of gates needed for the magnitude bits selection ■ 3k [k # of gates required for sign bits selection ■ (2k+l) ■=■ 186 total # of gates required for sADR network = (2k+l) + 3k G sADR= (5M) (A. 30) From the above it is clear that although the number of gates required for the sign bits' selection is increased compared to the case when 6=1, the number of gates required for the magnitude bits' selec- tion is almost halved and the overall gates G' AT -_, required for the ° sADR n selector network sADR is decreased compared to the maximal redundancy case. However, there is a drastic reduction in the number of gates required for the adder MIAD because of the decrease in the number of inputs to each MIRBA. The gates G/,_._ required are MIAD G MIAD = 26k (A. 31) There is no change, due to change in redundancy, in the gates required for either the Digit Sum Encoder DSE or the other remaining selector networks sDSE, sRIB and sTOP. Therefore, the total number of gates G' required for the Digit Processing Logic, when the multiplier digit redundancy is restricted to 1/2 < 6 < 2/3 only is G DPL " G DPG + G sADR + G MIAD + G DSE + G sDSE + G sRIB + G sR0B + °sT0P The gates for sROB are calculated on the assumption that we reduce the number of registers in the register file from (k+1) to + 2) 187 = k + 8k + 4k + (5k+l) f + 16k + 7k + 4(k+l) + (k+1) ( k + (36k + 2) + 2) + 3(k+2) + 40K + 12 (4.32) The values of G' and its various components are given in Table 4.4 L/JTIj for different values of the parameter k. A comparison of Table 4.2 and Table 4.4 shows that the reduction in the gates required for digit processing logic, for 1/2 <_ 6 <_ 2/3, comes mainly from the drastic reduction in the number of gates necessary for the adder MIAD. Pin Complexity Using the same notation as in the case of % < 6 < 1, we have .NEL Similarly .EL = Total number of pins required for implementation p of DPL, using 'Local Generation' method of t. and no encoder for t for 1/2 <_ 6 <_ 2/3 - + (k+1) + (k+1) Max(k+2, 2 Y 2 ) + 2 "k 2 (k+2) + 2 Y 2 + (2k+l) 3k + 4 + 2 Y 2 (4.33) - Max (k+2, 2 ^ "k 2 ) + 2 log 2 M + (k+1) + (k+1) t Strictly speaking, 6 = (r-1) A- 1) which may be slightly larger than 2/3 for certain values of r. In this thesis, however, we shall say that 6 < 2/3. 188 Table 4.4 Gate Complexity of DPL Vs Radix for 1/2 <_ 6 <_ 2/3 and Encoding LVE„ for a Redundant Binary Digit Radix k r=2 G DPG G MIAD G sADR G DSE G sDSE G sRIB G sR0B G sT0P G DPL r k 4 2 28 52 11 32 14 12 9 12 170 8 3 57 156 32 48 21 16 16 15 361 16 4 80 208 42 64 28 20 20 18 480 32 5 125 390 78 80 35 24 30 21 783 64 6 156 468 93 96 42 28 35 24 942 128 7 217 728 144 112 49 32 48 27 1357 256 8 256 832 164 128 56 36 54 30 1556 Table 4.5 Pin Complexity of DPL Vs Radix for 1/2 < 6 < 2/3 Radix r k r=2 k .EL T .NEL T 4 2 12 12 8 3 17 17 16 4 20 20 32 5 23 25 64 6 26 28 128 7 31 33 256 8 34 36 189 3k + A + 2 log 2 *+■* (4.34) Values of both the Equations (A. 33) and (A.3A) are tabulated in Table A. 5. A comparison of Tables A. 3 and A. 5 shows that by restricting the redundancy ratio to 1/2 <_ 6 <_ 2/3 for each multiplier digit, one can achieve almost the same number of total pins for DPL, as are achieved by using 6=1 and encoder MATE, without having to introduce the new cell for MATE. The introduction of MATE destroys the uniformity of the structure of MIAD. A. A. 2 Logic complexity of PE control - The major components of PE control logic are the microinstruction register, MIR, the selector net- work sMIR, the Zero and Sign Detection Logic ZSD, the microinstruction decoder and control and timing signal generator, TCS. Of these, the gate complexity of only the selector sMIR and ZSD is dependent on the bit width (=k+l) of the PE module because each is one digit wide. The gate complexity of TCS is independent of bit width, if we exclude the file register address decoders from consideration. However, the gate complexity is dependent on the method of implementing the control sequence charts described earlier. The author used the control point technique used in ILLIAC III [A5] for the implementation of control sequence charts, in order to calculate the gate complexity of PE control. 190 4. A. 2.1 Gate complexity of PE control - Table 4.6 shows the gates required for each subcontrol of TCS in terms of the number of control points, gates for the control points and the gates required for the conditional generation of control and timing signals. The last column of the Table 4.6 shows the total number of gates required for each subcontrol. Let G q denote the total number of gates required for the Timing and Control Signal Generator, TCS. G = 200 NAND gates In addition, let G^-., G .,__ and G„-,~ denote the gates required for DCD sMIR ZSD ° the logic implementation of the microinstruction decoder, selector sMIR<4:0> and the Zero and Sign Detector. These gates are given by G DCD = 32 NAND gates G sMIR = 15 NAND gateS G ZSD = 6 NAND Sates Therefore G ■ Total # of gates required by PE control logic excluding storage elements = G TCS + G DCD + G sMIR + G ZSD = 253 (4.35) 4.4.2.2 Pin complexity of PE control - The total number of pins required for the logic implementation of PE local control is the sum of the pins required for microinstruction ports MIP and MOP. and the pins CO 4-» a c o •H -H O -U (X CO U a ex a o u o o u o CJ 3 O CD 3 oo en co CM CN CM CM CO S5 ►J O CM M-l CM CM r-4 CM CM CM CO CM o o O o o o o u M M u M u u •u 4-> 4J u 4J 4-t 4J c c C c C c c o o o o o o o CJ CJ o CJ a CJ o Pi H fa g 00 ex CO w o o 191 o o CN 192 required for the request-response signals of TCS. Denoting by "P-pm » t ' ie total pins required by PE local control, we have P P + P + P PCL = MIP. r MOP. r TCS 1 i = 11+11+14 = 36 (A. 36) If the multiplier digit has redundancy 1/2 < 6 < 2/3, then the number of internal registers in the PE reduces to + 1 from (k+1). The number of address bits required to specify the internal register correspondingly reduce to log 2 ( + 1) This results in the saving of one pin in the microinstruction ports and thus the pins required for PE control logic reduce to 34. 4.4.3 Overall logic complexity of a PE - The total number of gates, G , required for the implementation of a PE is the sum of the gates re- quired for the combinational logic of DPL, the gates required for the PE control logic and the gates required for the implementation of storage registers in the PE. The gates required for DPL and the storage regis- k ters are a function of the parameter k (radix-2 ) which represents the bit width of a PE. The gates required for PE control logic are virtually independent of k and are about 250 NANDs. The storage registers in a PE comprise the registers in the register file, buffer registers in DPL and the register MIR in PE control logic. Considering that all the storage registers are made of edge triggered D-type flip-flops, the gates G Q required for the storage registers is given by 193 G _ - (// of registers in the register file x (k+1) + width of IBR + width of APR + width of GIR + width of MIR) x gates required for one D-type edge trig- gered flip-flop. A D-type edge triggered flip-flop requires 6 NAND gates [46]. There- fore, for multiplier digit redundancy ratio ^ < 6 < 1, we have G STO = 6X = 6 (k+1) (k+1) + (k+l)+(k+l)+ 2|log 2 (k+lT| + 11 k 2 + 4k + 2[Tog 2 (k+l)] + 141 (4.37) For multiplier digit redundancy ratio 1/2 < 6 < 2/3, STO = 6x = 6 |3(k+l) + Ij] (2 + k+1) + loj hk + (k+3) fyl + 13J + l)(k+l) + (k+l)+(k+l)+ 2 (2 + k+1) + 10 = 6 1 3k + (k+3) HfUig (4.38) Now for h < < < O 1^* s UJ a. 4 c UJ z < 2 2 i o o • I i ^v 2 • • -1 >- N k J • 2 * • iiz o • » 22 • _l I UJ (/) i (/> UJ < > o — — 1 1 IO o i UJ U. l/t H i < CM 2 UI 1 * UJ Q. l<»* a. *--- — * t i A ' , j. ~ z UJ l-J» UJ a. « — --» a. i 1 f r 1 A | * < < a k i -1 Q O Z * _! f * « z o ^s u z> 1 * o 5 o 2 - 1 5 < o o _l O u , / 1 i\ r i -j V T ONVU3dO 3Hi 3 ! < ?£ jo ssBuam r W3d " < z 3 ; UJ O • » EXPONE ROCESS LOGIC a. 4 u ■ ►" £ uj 5 D U 1 o i a. K K UJ .197 i ? *- z — u. i« C UJ Z ? J TJ G tO »-i QJ a- o rH cfl o o ►J vw o B CO »-< 60 CO •H Q • >. a u •H +J e CO OJ & 2 CD X l-l o c w Uj co 4«5 where s is the first most significant non-zero digit in the d-vector representation of S. Due to the use of an RBA-2 using the sign-magnitude logic vector encoding LVE_ for the redundant binary digit, in the imple- mentation of microinstruction SS, bogus overflow [35] would occur quite often. The bogus overflow occurs whenever s« . s. < because the sum can always be recoded such that s~ = 0. For example, the sum 1.0321 can always be recoded into its algebraic equivalent 0.9721. The mantissa overflow can be corrected by shifting the sum right by one digital position and correspondingly adjusting the exponent of the 211 sum. In the case of bogus overflow, however, this procedure would cause a loss of one significant digit (k bits for radix-2 arithmetic) un- necessarily. However, this can be taken care of during normalization of the sum if the shifted -out digit of the sum is saved in the End Unit and reintroduced during left-shifting of the operand. The left shift of the sum after normalization recoding is done to eliminate the leading zeros in the recoded sum. 6.2.3 Floating point Subtraction - The processing for Floating Point Subraction is exactly identical to that for Floating Point Addi- tion except that in the Mantissa Processing Microprogram, the micro- instruction SS is preceded by the microinstruction TI 2,2 . This microinstruction reverses the sign of each digit of the subtrahend. The sequence of microinstructions for mantissa processing is as follows. Microinstructions comments LPM 2 ea RS 2 RS 2 RS 2 TI 2 2 SS INR2 «- PEM[ea] operand alignment INR2 i- (-1) . INR2 INR1 «- INR1 + INR2 6.2.4 Floating point Multiplication - Let the Floating Point Multiply instruction be denoted as FPM , EA where _EA is the effective DMM address of the Multiplier operand. The operand in the accumulator is the implicitly assumed multiplicand operand. If _ea is the correspond- ing buffer memory address of the multiplier operand, the processing for 212 Floating Point Multiply instructions involves the following steps by MCU. MCU calls upon the Exponent Control Unit to sum the exponents of the operands in the accumulator and at LOEM address specified by ea ; concurrently, the MCU issues the sequence of microinstructions to PEs to form the double length product in the PEs and finally to check for any exceptional conditions. Mantissa overflow cannot occur because the mantissas of both the operands are less than unity. However, exponent overflow may take place. 6.2.4.1 Microprogram for mantissa processing - The mantissa pro- cessing for the multiplication instruction involves the generation of partial products and the final product digits. For processing, the multiplier and multiplicand operands are respectively in file registers INR3 and INR2 whereas the Accumulator register INR1 is used to form and accumulate the partial products. Unlike the conventional multipliers, the most significant half of the final double length product is in the Multiplier register INR3 and the least significant half is in the Accumu- lator register INR1. Because the partial products are formed beginning with the most significant digits, the most significant digits of the product are formed first. Also, to achieve maximum precision the partial product is shifted left during each step instead of the multiplicand being shifted right. Due to the left shift of the partial product, two prob- lems immediately arise. a) During the left shift of the Accumulator register (partial product) , not one but two digits (the digits to the left and right of 213 the radix point) are shifted out. These two digits need to be recoded into a final product digit to be stored into the multiplier register and a residual digit to be added to the next partial product in the next 2 step. Pisterzi [30] has shown that a recoder with r states is necessary for this purpose. Such a recoder can be conceptually looked upon as an extension of the adder network to the left. However, in our case, due to the existence of bogus overflow in RBA-2, the basic cell of the adder MIRBA, the recoder' s logic design would have to be different. b) Another problem due to left shifting of the partial product is that the digits of the most significant half of the final product which become available one by one as the output of the recoder (connected to the most significant digital position of the adder) need to be stored in the multiplier register and /or the buffer memory. In order that these product digits may be stored in proper order in the multiplier register, the product digit (output of recoder) needs to be stored in the least significant digital position of the multiplier register because this position is vacant due to the left shift of the multiplier reg- ister. But MCU communicates with and knows the state of only the most significant PE. The solution to this problem is to send the value of the digit to be placed in the least significant digital position with the left shift multiplier microinstruction. As a matter of fact, the particular definition of the left shift microinstruction was contrived to serve specially this purpose only. The MCU forms the partial products by issuing a sequence of a set of three microinstructions as many times as the number of digits in the 214 multiplier operand. The three microinstructions are 'Left Shift Multi- plier' to examine the multiplier digit, 'Form Multiple and Add' to form the partial product and 'Left Shift Accumulator' to shift the partial product. The microprogram for the formation of the double length product for six digit long multiplier and multiplicand operands is given below, m and P. (i=l,2,...,6 and j=l,2, ... ,11,12) respectively denote multi- plier and Product digits. Microinstruction LPM TD LDC LS FMA LS LS FMA LS LS 3 1 1 3 m ] 1 3 m, 1 3 FMA m. LS LS FMA m, LS LS FMA m, LS LS FMA m. LS LS ea 2 comments INR3 «- PEM[ea] (e Multiplier) INR2 4- INR1 (e Multiplicand) INR1 *■ MCU m, INR3 «- r.INR3 INR1 + INR1 + INR2 MCU MCU INR2 i m. r.INR2 m 2 ,INR3 6 +■ P 1 , INR3 r.INR3 These achieve the partial products for the rest of the multiplier digits nu, m~,..., m, in the same way as the sequence of immediately preceding four microinstructions A pictorial representation of the flow and execution of the above sequence in a 6 PE Mantissa Processing Logic is shown in Figure 6.1. In 215 O c 3 O ft. o W9 « ft. o a. o • O ft o a- *r> >c wi O ft, ft, a. a- ft, ft. ft. ft. 1^1 S0 ft. -* »*0 ft. f O ft. ft. ft. ft, ,»«->«• «N ft. ft. •• ft. ft. t V0 1^1 a. o ft. a: o ft. ft, #n ^ W* «N ft. ft. P* CM ft. ft. ■ ft. ft. ft. a. f o ft. ft. ft. ft. ft. ft. ■ a. *N «*» 00 ft. ft. ft. f o -" rsi _i is* ft. a. a. ft. ft. ft. • a. ft. ft. ft. ill •■£> O —i en o en tH OJ ^ a o 0) m X. a. 4-1 CtJ 14-1 to o CO •H c 4-1 o c •H 4-1 j) rt 4J l-l C o 0) U-l CO 0) CO 1-1 c p. o • 01 •H a Pi 4J o a •H iH 3 4-1 CO l-i cd •H •u u M co •H O a r-\ 4J •H PL CJ O •H •H )-i 4J a- a H •H -1 < S £ a) a •H 216 this figure, P., is the j th digit of the portion of the i th accumulated partial product which is in the Accumulator register and P. is the j digit of the final product and is P. = .P, 1 < j < 6 J J 1 - J - Pj - 6 P J-5 ' 1), digit m. + , only needs to be brought into 217 MCU because m. is already known from the previous step. Note that there may be one more modified multiplier digit than the number of multiplier digits in the original multiplier operand to maintain algebraic equiv- alence in the two forms of the same multiplier operand. 6.2.5 Floating point Division - The processing for Floating Point Division is almost identical to that for Multiplication except that the quotient digits must be determined by examination of the partial remainder. Division is performed by repetitive additions and shifts. In the Float- ing Point Divide instruction FPD , EA , the effective DMM address of the divisor operand is given by EA and the Dividend is implicitly assumed to be the operand in the Accumulator. The processing by MCU for Floating Point Divide involves calling upon the Exponent Control Unit to take the difference of the exponents of the dividend (accumulator) and the divisor at the buffer memory address ea, and processing of the mantissa to cal- culate the quotient digits. The exceptional conditions are the possible exponent underflow and a zero value of the divisor. 6.2.5.1 Microprogram for mantissa processing - The major problems in the implementation of Division in the arithmetic unit under consider- ation are: a) the storage of double precision dividend, b) the calculation of the quotient digits, c) the placement of quotient digits in the PEs, and d) the extension to the left of the Accumulator and the Adder Network to take care of a shifted partial remainder. 218 We would not discuss the above problems in detail except indicating the possible solutions. For details, the reader should refer to Pisterzi [30] . The double precision dividend is stored in two registers — the Accumulator register INR1 and the multiplier register INR3. The Accumulator register INR1 holds the most significant half of the divi- dend and INR3 holds the least significant half. At the end of the processing, they respectively hold the remainder and the quotient. Register INR2 will hold the divisor. Because of the redundant number representation for the quotient digit, the quotient digit can be calculated by a 'model division' [50] which uses only truncated version of the divisor and shifted partial remainders. It is shown in the Appendix A-2 that for radix 2 (k >_ 3) , 3 digits of the divisor and 2 digits of the fractional part in addition to the integer part of the shifted partial remainder are sufficient for the calculation of the quotient digit with redundancy ratio of 2/3 or 1. However, for radix-4, one more digit each of the divisor and partial remainder are necessary if the quotient digit has redundancy ratio of 2/3. But for maximally redundant quotient digits, we use the same number of digits as for k >_ 3. The examination of the operand digits for quotient calculation in the MCU is done by shifting the operands left as many times as the number of digits necessary. Examination of the divisor digits needs to be done only once at the beginning since the same digits take part in the calculation at every step of a quotient digit. However, since the partial remainder changes, it has to be shifted every time. But since the unshifted divisor and the radix-r 219 shifted partial remainder are necessary in the PEs for calculation of a new partial remainder, the examination of the operand digits is done by shifting another register which contains a copy of the operand whose digits are to be examined. File register INR4 can be used for that purpose. The quotient digit is stored in the vacant least significant digital position of the register INR3 by using the Left Shift microin- struction just as in the case of Multiplication. The quotient digit is sent in the modifier field of the Left Shift microinstruction issued by the MCU. Because of the characteristics of the Division process, the shifted partial remainder in the accumulator would always be less than r in absolute magnitude. Thus the overflow recoder that was used in the Multiplication process can be used to store the integer part of the shifted partial remainder. Note that the technique used for the 'Model' Division is completely independent of the architecture of our Arithmetic Unit. It can be done by Table look-up or by any other method depending on the time and cost considerations. Pure table look-up is too expensive for any reasonable radix greater than A. We propose that the quotient digit be calculated in MCU serially one bit at a time for radix >_ 8 and then assembled into a radix-r digit before calculating the next partial remainder. The sequence of microinstructions for Floating Point Division is very similar to that for Multiplication and hence would not be given here. 220 6.2.6 Normalization of operands - An operand is considered normal- ized if it satisfies the definition 3, given in Section 3.5.1, of a normalized number. The major steps in the normalization process are left shifting of the signed-digit operand till there are no leading zeros, recoding the shifted operand by the 'Normalize Recode' micro- instruction, NR and finally left shifting the recoded operand to remove the leading zeros, if any were created by the microinstruction NR. Because of the interface control signal Z- between PE and the MCU, there is no need to launch a Left Shift microinstruction to examine the leading digit for zero magnitude. Simply monitoring of Z 1 is sufficient and this has the advantage that no overshift of the operand would take place during normalization process. Note that since the microinstruction NR operates only on the operand in the Accumulator register INR1, the operand should be placed in INR1 for the normalization process. 6.2.7 Assimilation of signed-digit operand - The process of Assim- ilation converts the signed-digit operand whose different digits carry, in general, different signs to a form in which each digit has the same sign. This sign is the sign of the operand. The procedure and the se- quence of microinstructions necessary for Assimilation is identical to that for Normalization except that the microinstruction AR instead of NR is launched for recoding the operand. 221 7. SUMMARY AND CONCLUSIONS 7 .1 Summary and Discussion of Results Chapter 1 described the characteristics and the constraints of the newly emerging technology of Large Scale Integration (LSI) and its impli- cations for the design of a digital system. Based on this discussion, a set of desirable characteristics for an Arithmetic Unit were formulated and a limited interconnection arithmetic unit as proposed by Pisterzi [30] was chosen as a vehicle to study the arithmetic and logic design aspects of the basic module of such an arithmetic unit. Chatper 2 described briefly the logical organization and mode of operation of the arithmetic unit — especially of the Mantissa Processing Logic (MPL) . The MPL is composed of a linear cascade of identical logic modules called Processing Elements (PEs) which execute a sequence of microinstructions issued to MPL by the Mantissa Control Unit (MCU) . The MCU is an interpreter for the 'machine' arithmetic instructions like 'Multiply', 'Add', etc., and issues a sequence of microinstructions for processing. The salient feature of the processing in MPL is that a microinstruction issued by the MCU is not broadcasted to all the PEs in the MPL. Instead, the microinstruction is executed by the PEs in se- quence starting with the most significant PE. (The 'significance' of a PE is the same as the arithmetic significance of the operand digit contained in that PE.) The method of processing in the MPL was illus- trated by an example which showed how the various microinstructions in the microinstruction stream could be pipelined. This pipelining 222 feature allows the meshing in of machine arithmetic instructions even before all the result digits of a previous machine instruction have been calculated. The discussion in this chapter forms the framework for the material in the subsequent chapters. Chapter 3 is concerned with the arithmetic design of the Processing Element. Due to the digit serial nature of the arithmetic processing and the desirability of limited intercommunication between PEs, redundant number system is a necessity. The number system was chosen to be Signed Digit and maximally redundant firstly because the conversion from the conventional number representation of sign and magnitude to the maximally redundant and vice versa is very simple, and secondly, the radix-2 arithmetic can be realized in terms of identical stages of redundant binary {1,0,1} arithmetic structures. This gives the required repetitive and uniform logical structure to the internal logic of the PE. Then a set of simple arithmetic microinstructions sufficient for the implementation of four basic arithmetic operations are defined and their digit algorithms are described by their arithmetic transfer functions and their algebraic implementation. The particular algebraic implementation of the digit algorithm is influenced by LSI technology constraints of regularity of logic structure, simplicity of the basic cell of the logic structure and the least number of pins for the module. The regularity of logical structure is obtained by implementing the radix-2 multi-input adder as a linear cascade of k multi-input redundant binary adder. Each multi-input redundant binary adder in turn is implemented as a tree structure of still simpler 2 inputs or 3 inputs redundant binary adders. A definition 223 of normalized operands was developed and its influence on the arithmetic properties of overall processing and the complexity of quotient digit calculation was also discussed. The definition for normalized operands chosen was such that processes of 'normalization' and 'assimilation' could share the same logic. Chapter 4, which is the major contribution of this thesis, treats the logic design of the Processing Element. In this chapter, the gate complexity and pin complexity of the Processing Element are shown to be related to the bit width of the Processing Element (radix of arithmetic processing in MPL) and the redundancy in the multiplier/quotient digit used to form the partial products in the process of multiplication. The major components of the Processing Element are the Register File for the storage of active operands, the Digit Processing Logic which is es- sentially a combinational logic network for the data transformation, and the Processing Element Control which receives and decodes the microin- struction and generates the necessary sequence of control signals to condition the combinational network DPL. The number of gates and pins required for the DPL are very strongly dependent on the bit width of the Processing Element whereas the number of gates and pins required for the PE control is almost independent of the bit width of the module. From the inspection of Tables A. 3 and 4.5, it is clear that 'local generation' p of collective Product Transfer t. should be used to keep down the number of pins necessary on the PE module. An examination of Tables 4.2 and 4.4, which give respectively the number of gates required in the implementation of DPL for multiplier 224 digit's redundancy ratio of 1 and 2/3, leads to the conclusion that redundancy ratio of 2/3 should be employed for the multiplier and quotient digit. This would require the existence of a multiplier digit recoder in MCU because the digits of the multiplier operand in the MPL have redund- ancy ratio of unity. But the multiplier digit recoder is very simple. A still further advantage of restricting the redundancy ratio of the multi- plier/quotient digit to < 2/3 is that a_,. — the number of PEs which must — FMA cooperate with a given PE in the execution of microinstruction FMA — would always remain 2 irrespective of the radix of the multiplier digit, when the MIRBAs in MIAD are implemented as a log-sum tree of RBA-2s only. This can be seen from the Table 7.1. Table 7.1 Values of a and a. when the multiplier/quotient digit redundancy ratio is 1/2 < 6 < 2/3 Radix r = 2 k # of inputs to a MIRBA i - ri ♦ i a b = 2[Tog 2 H + 1 r k a . = J - b -r k 4 2 2 2 2 8 3 3 4 2 16 4 3 4 2 32 5 4 4 2 64 6 4 4 2 128 7 5 6 2 256 8 5 6 2 225 Finally, inspection of Table 4.5 shows that the DPL requires only 36 pins for radix-256, that is, for a 8 bit width of the PE module. Since the PE control requires 36 pins also, an eight bit wide PE module should be employed in the Mantissa Processing Logic in order to balance the arithmetic processing cost in DPL and PE control cost. This requires a total of 72 pins on the PE module package and which is by no means un- reasonable by the standards of today's technology. A negative aspect, from the LSI viewpoint, of the structure of Mantissa Processing Logic should be noted here. Since the microinstruc- tion flow from one PE to the other instead of being broadcast from the MCU, the number of pins required for the PE control are doubled in the present structure. Moreover, the request-response strategy of PE co- ordination control also doubles the number of pins required compared to a synchronous control synchronized to a central clock. However, the asynchronous control has the advantage that any number of PEs can be con- catenated together more easily to achieve any desired precision without worrying about the clock skew problems. It should be noted, however, that the arithmetic and logic design of the DPL as described in this thesis is independent of the nature of PE control and the same DPL design can be used to design a PE module for a bus-structured and synchronous Mantissa Processing Logic. In Chapter 5, a brief description was given of the logic organiza- tion and structure of a buffer memory which acts as an interface between the arithmetic unit and the Data Main Memory. The major characteristic of the buffer memory is that communication between the buffer memory and 226 Data Main Memory is on word level whereas the communication between the buffer memory and Mantissa Processing Logic is on a digit serial basis. It was further argued that the size of the buffer memory in words is fairly small — of the order of 16 to 32 words. Chapter 6 showed how various machine arithmetic instructions could be implemented using the microinstructions. 7 .2 Suggestions for Further Work Reliability and availability considerations were not addressed in this thesis. Since microinstructions flow from any PE to its adjacent PE, it is important that all the consecutive PEs operate properly in order for the Arithmetic Unit to operate properly. Determining organizational modifications in the interconnection structure of the PEs which would facilitate the automatic reconfiguration of properly operating PEs to yield a working Arithmetic Unit with degraded performance, in the presence of faulty PEs, is a very important area of further investigation. Because the processing in the Arithmetic Unit takes place on a digit-by-digit basis starting with the most significant digit, this Arithmetic Unit structure has a potential for implementing a dynamically varying precision arithmetic. But due to the possibility of different PEs working concurrently on digits of different operands, certain struc- tural modifications would be necessary. Investigation of such modifica- tions is another interesting area of investigation. One possible solution may be the use of some kind of 'end-of-the-word' marker as the delimiter 227 for the precision of the operands and the use of a bus-structure to in- form the MCU when the last digits of the operands have been operated on, A simulation of the Arithmetic Unit using data from real programs would be interesting and useful to determine the useful word capacity of a PEM module. Finally, the logic design of the MCU and the GACU should be per- formed to determine the actual gate complexity of this module. 228 LIST OF REFERENCES [1] Berg, R. 0. and Jack, L. A., "System and Logic Design for the Effective Use of LSI," Proceedings of the National Tele- communications Conference , Atlanta, Ga., Nov. 1973, pp. [2] Smith, M. G., "LSI and Systems Architecture in the 1970's," First US A- JAP AN Computer Conference , 1972, pp. 182-192. [3] Conway, M. E. and Spandorfer, L. M. , "A Computer System Designer's View of Large Scale Integration," 1968 Fall Joint Computer Conference, AFIPS Proc , Washington, D.C.: Spartan 1968, pp. 835-845. [4] Jennings, R. C, "Design and Fabrication of a General Purpose Airborne Computer Using LSI Arrays," Digest of IEEE Computer Group Conference , June 1968, pp. 50-54. [5] Beuscher, H. J. and Toy, W. N. , "Check Schemes for Integrated Microprogrammed Control and Data Transfer Circuitry," IEEE Trans. EC , Vol. C-19, No. 12, Dec. 1970, pp. 1153-1159. [6] Beelitz, H. R., Levy, S. Y., Linhardt, R. J., and Miller, H. S., "System Architecture for Large-Scale Integration," 1967 Fall Joint Computer Conference, AFIPS Proc , Washington, D.C.: Spartan 1967, pp. 185-200. [7] Clark, W. A., "Macromodular Computer Systems," 1967 Spring Joint Computer Conference, AFIPS Proc , Washington, D.C.: Spartan 1967, pp. 337-401. [8] Podraza, G. V., Gregg, R. S., Jr., and Slager, J. R., "Efficient MSI Partitioning for a Digital Computer," IEEE Trans. EC , Vol. C-19, No. 11, Nov. 1970, pp. 1020-1028. [9] Cserhalmi, N. , Lowenschuss, 0., and Scheff, B., "Efficient Parti- tioning for the Batch-fabricated Fourth Generation Computer," 1968 Fall Joint Computer Conference, AFIPS Proc . , Washington, D.C.: Spartan 1968, pp. 857-865. [10] Chen, T. C, "Overlap and Pipeline Processing" in Introduction to Computer Architecture , Edited by H. S. Stone, Chicago, Science Research Associates, Inc., 1975, pp. 375-431. 229 List of References (continued) [11] Ramamoorthy, C. V. and Economides, S. C, "Fast Multiplication Cellular Arrays for LSI Implementation," 1969 Fall Joint Computer Conference, AFIPS Proc , Washington, D.C.: Spartan 1969, pp. 89-98. [12] Gex, A., "Multiplier-Divider Cellular Array," Electronics Letters , 29th of July 1971, Vol. 7, No. 15, pp. 442-444. [13] Kingbury, N. G. , "High Speed Binary Multiplier," Electronics Letters , 20th of May 1971, Vol. 7, No. 10, pp. 277-278. [14] Wallace, C. S., "A Suggestion for a Fast Multiplier," IEEE Trans . EC, Feb. 1964, pp. 14-17. [15] Majithia, J. C. and Kitai, R. , "An Iterative Array for Multiplica- tion of Signed Binary Numbers," IEEE Trans. EC . , Vol. C-20, No. 2, Feb. 1971, pp. 214-216. [16] Baugh, C. R. and Wooley, B. A., "A Two's Complement Parallel Array Multiplication Algorithm," IEEE Transactions on Computers , Vol C-22, No. 12, pp. 1045-1047. [17] Majithia, J. C, "Non-Restoring Binary Division Using a Cellular Array," Electronics Letters , June 1970, pp. 303-304. [18] Majithia, J. C. and Kitai, R. , "Fast Multiplier /Divider Array Using a Controlled Iterative Array," private communication. [19] Bjorner, Dines., "A Flow-Mode, Self -Steering, Cellular Multiplier- Summation Processor," BIT 10, 1970, pp. 125-144. [20] Thompson, P. M. , "Digital Arithmetic Units for a High Data Rate," The Radio and Electronic Engineer , Vol. 45, No. 3, 1975. [21] Gardner, P. L., "Functional Memory and Its Microprogrammed Impli- cations," IEEE Trans. Comput ., Vol. C-20, No. 7, July 1971, pp. 764-775. [22] Lee, C. Y. and Paull, M. C, "A Content Addressable Distributed Logic Memory with Application to Information Retrieval," Proc. IEEE , Vol. 51, June 1963, pp. 924-932. [23] Crane, B. A. and Githens, J. A., "Bulk Processing in Distributed Logic Memory," IEEE Trans. EC , April 1966, pp. 186-196. [24] Batcher, K. E., "STARAN Parallel Processor System Hardware," National Computer Conference, AFIPS Proc , 1974, pp. 405-410. 230 List of References (continued) [25] deRegt, M. P., "Introduction to Negative Radix Number Systems," Part I, Computer Design , May 1967, pp. 53-63. [26] Avizienis, A., "Signed-Digit Number Representation for Fast Parallel Arithmetic," IRE Trans. EC , Vol. EC-10, Sept. 1961, pp. 389-400, [27] Shaipov, N. Yu., "Methods of Realizing Arithmetic Operations in the Minus-Two Number System," Automation and Remote Control , 1970, pp. 835-841. [28] Avizienis, A. and Tung, C, "A Universal Arithmetic Building Ele- ment (ABE) and Design Methods for Arithmetic Processors," IEEE Trans. Comput ., Vol. C-19, No. 8, Aug. 1970, pp. 733-745. [29] Avizienis, A. and Tung, C, "Design of Combinational Arithmetic Nets," Digest 1st Annual IEEE Computer Conference (Chicago, Illinois), Sept. 6-8, 1967, pp. 25-28. [30] Pisterzi, M. J., "A Limited Connection Arithmetic Unit," Ph.D. dissertation, Department of Electrical Engineering, University of Illinois, Urbana, Illinois; also, DCS Report No. 398, June wnr. [31] deRegt, M. P., "Negative Radix Arithmetic, Part 4, Multiplication and Division," Computer Design , Aug. 1967, pp. 36-44. [32] , "Negative Radix Arithmetic, Part 5, Division: Testing the Remainder," Computer Design , Sept. 1967, pp. 44-50. [33] , "Negative Radix Arithmetic, Part 6, Manual Division: the Magnitude Test," Computer Design , Oct. 1967, pp. 68-77. [34] Szabo, N. S. and Tanaka, R. I. , Residue Arithmetic and Its Applica- tions to Computer Technology , New York, McGraw-Hill, 1967. [35] Atkins, D. E., "Design of Arithmetic Units of ILLIAC III: Use of Redundancy and Higher Radix Methods," IEEE Trans. Comput. , Vol. c-19, No. 8, Aug. 1970, pp. 720-733. [36] Cristelly, R. de ORY, "Design of a Dynamically checked, Signed- Digit Arithmetic Unit," Computer Science Department, Univer- sity of California, Los Angeles, California , Report No. UCLA-ENG-7366, November 1973. [37] Sweeney, T., "An Analysis of Floating-Point Addition," IBM Systems Journal, Vol. 4, No. 1, pp. 31-42, 1965. 231 List of References (continued) [38] Borovec, R. T., "The Logical Design of a Class of Limited Carry- Borrow Propagation Adders," M.S. Thesis, Department of Elec - trical Engineering, University of Illinois, Urbana, Illinois , August, 1968. Also, Report No. 275, Department of Computer Science, University of Illinois , Urbana, Illinois. [39] Rohatsch, F. A., "A Study of Transformations Applicable to the Development of Limited Carry-Borrow Propagation Adders," Ph.D. Thesis, Department of Electrical Engineering, University of Illinois , Urbana, Illinois, June, 1967. Also, DCS Report No. 226, Department of Computer Science, University of Illinois , Urbana, Illinois. [40] Robertson, J. E., "A Deterministic Procedure for the Design of Carry-save Adders and Borrow-save Subtracters," Department of Computer Science, University of Illinois , Urbana, Illinois, Report No. 235, July 5, 1967. [41] Foster, C. C. and Stockton, F. D. , "Counting Responders in an Associative Memory," IEEE Trans. Comput ., Vol. C-20, pp. 1580- 1583, December 1971. [42] Bell, C. G. and Newell, Allen, Computer Structures: Readings and Examples , New York, McGraw-Hill Inc., 1971, pp. 628-637. [43] Preparata, Franco P., "On the Representation of Integers in Non- adjacent Form," SIAM J. Appl. Math , Vol. 21, No. 4, December 1971. [44] Avizienis, A., "Arithmetic Microsystems for the Synthesis of Function Generators," Proc. IEEE , Vol. 54, No. 12, December 1966. [45] Goyal, L. N., "ILLIAC III Computer System Manual: Arithmetic Units- Vol. 2," Department of Computer Science, University of Illinois , Urbana, Illinois, Report No. UIUCDCS-R-73-551, January 1973. [46] Texas Instruments Incorporated, TTL Integrated Circuits Catalog , Dallas, Texas Instruments Catalog CC201, August 1969. [47] Kuck, D., Budnick, P., Chen, S. C, Davis, E., Han, J., Kraska, P., Lawrie, D., Muraoka, Y. , Strebendt, R. and Towle, R. , "Measurement of Parallelism in Ordinary FORTRAN Programs," IEEE Computer , January 1974, pp. 37-46. [48] Knuth, D. E. , "An Empirical Study of FORTRAN Programs," Software Practice and Experience , Vol. 1, pp. 105-133, April-June 1971. 232 List of References (continued) [49] Foster, C. C. and Riseman, E. M. , "Percolation of Code to Enhance Parallel Dispatching and Execution," IEEE Trans . Comput . , Vol. C-21, No. 12, pp. 1411-1415, December 1972. [50] Atkins, D. E., "A Study of Methods for Selection of Quotient Digits During Digital Division," Ph.D. Thesis, Department of Computer Science, University of Illinois , Urbana, Illinois, June 1970. Also, DCS Report No. 397, Department of Computer Science , University of Illinois , Urbana, Illinois, June 1970. 233 APPENDIX A-l ALGEBRAIC DESIGN OF A RIGHT-DIRECTED RECODER TO CHANGE MULTIPLIER DIGIT'S REDUNDANCY FROM 6 - 1 to 6 < 2/3+ This recoder changes the multiplier operand M = . m, m~ nu ... m, , m. m, , ... m 12 3 j-1 j j+1 n where |m | < (r-1) Y^ = 0,l,...n. to an algebraically equivalent operand M'= ml . m' m' m' ... m' , m' m' , ... m' 12 3 J-1 j j+1 n such that I m n | <^ 1 and I- II f (r_1) jyj = 0,1, ...n, In order to do the above recoding serially on a digit-by-digit basis, starting from the most significant digit, one needs to know only the digit to the immediate left of the digit to be recoded in addition to the digit itself. For example, if m is the digit to be recoded, then the algebraic design of the recoder is given by the following Strictly speaking, 6 = larger than 2/3. f (r-1) ^tr-1) which may be slightly m, 234 M m J-l w j-i Vi m. J w, Each digit m and m is first recoded into a pair of digits {t , , w J 3 . j -1 j and {t._ , w._.} so that Vl = r C j-2 + "j-l ■ r > 4 m, = r t . . + w. J-l J and m. >_ | (r-1) t . = (0 otherwise 1 if m.< - l— f (r-1) » i - j. j-l f (r_1) -1 < w. < — l — j (r-1) -1 The recoded digit m". is given by nK = w. , + t , , J J-l j-l 235 The above recoder is applicable for all values of index j . Note that m' cannot have a magnitude greater than 1 and the recoded multiplier operand may have one digit extra compared to the original operand. 236 APPENDIX A-2 PRECISION REQUIREMENTS FOR QUOTIENT DIGIT CALCULATION According to the analysis by Atkins [50] based on P-D plot con- siderations, the worst case precision of the operands required for quotient digit calculation is given by the relation AP| < Df n (6 - Jj) - iM (n _i + 6 ) ( A2 .l) where AP = truncation error in the left shifted (by one digital posi- tion) partial remainder Ad = truncation error in the divisor n = maximum allowed value of the quotient digit 6 = redundancy ratio of the quotient digit n r-1 and D = minimum value of the truncated divisor Let I Ad I = r n and I API -r-^ where ft = number of digits in the truncated divisor Case 1: 6=1 For a maximally redundant quotient digit, 6 = 1 n = (r-1) 237 and for a maximally redundant divisor normalized according to Definition 3 given in Section 3.5, 'min III ± 2 ft r r Substituting the above values in Equation (A2.1), we have r r r r -ft -ft V (r-1) -1 + 1) which simplifies to ^ < (r-D r 2 (3r-2) (A2.2) Values of ft which satisfies the relation (A2.2) for different values of r are tabulated in Table A.l. Table A.l Values of ft Vs Radix and Redundancy Ratio of a Quotient Digit RaH-fv Redundancy ratio, i », of a quotient digit r 6 = 1 6 <_ 2/3 2 4 - 4 3 4 8 3 3 16 3 3 32 3 3 64 3 3 238 Case 2: 5=2/3 In this case, f (r-1) ~min _ r-1 1_ t 2 0. r r -(ft-1) _ ,r-l 1 w n , . r , . , n N 1 (— + - )(^i " h) - — (n-l + ^) which simplifies to jfl <_ 2 * - Cr-1) . Eli CA 2.3) r 2r(r-l) + n(r-2) Values of ft which satisfies the relation (A2.3) for different values of r are given in Table A.l. Table A.l clearly shows that 3 digits of the divisor and 2 most significant digits of the fractional part of the shifted partial remainder in addition to its (shifted partial remainder's) integer part are suffi- cient to calculate the quotient digit. It can be further shown that all the bits of the last digits of the truncated divisor and partial remainder are not necessary for the quotient digit calculation. 239 VITA Lakshmi N. Goyal was born in Rohtak, India on October 19, 1941. He received the B.Tel.E degree in Electronics and Telecommunication Engineer- ing in 1964 and M.Tel.E degree in 1965, both from Jadavpur University, Calcutta, India, and Ph.D. degree in Electrical Engineering from the University of Illinois, Urbana in 1976. From 1963 to 1965 he was associated with Indian Statistical Insti- tute - Jadavpur University joint computer project. He was responsible for the logic design and hardware implementation of the Arithmetic and Control Units of the Computer ISIJU-1, a variable word-length, micro- programmed solid state digital computer. In 1965, he became a lecturer in the Department of Electronics and Telecommunication Engineering, Jadavpur University, Calcutta, India, and continued to be asociated with the ISIJU-1 project. Since 1967, he has been a Research Assistant in the Department of Computer Science, University of Illinois, Urbana. From 1967 to 1971, he was associated with the Image Processing ILLIAC III Computer Project and worked on the design and implementation of the Scan- Display system, Interrupt Unit and Arithmetic Units of ILLIAC III. His research interests include Computer Arithmetic, Computer Architecture, Microprogramming, Digital System Design and Display Processors. He has several publications in these areas. Mr. Goyal is a member of the Association for Computing Machinery and the Institute of Electrical and Electronics Engineers. BIOGRAPHIC DATA ■ET 1. Report No. UIUCDCS-R-76-797 'itto .ind Subtitle A STUDY IN THE DESIGN OF AN ARITHMETIC ELEMENT FOR SERIAL PROCESSING IN A LINEAR ITERATIVE STRUCTURE 3. Recipient's Accession No 5- Report Date May, 1976 6. uthor(s) Lakshmi Narayana Goyal 8. Performing Organization Kept No. ctforming Organization Name and Address Department of Computer Science University of Illinois Urbana, Illinois 10. Project/Task/Work Unit N( 11. Contract /Grant No. NSF DCR 73-07998 sponsoring Organization Name and Address National Science Foundation Washington, DC 13. Type of Report & Period Covered 14. Supplementary Notes Abstracts rn^^g s ^ U{ ^y j_ s concerned with the design of an Arithmetic Element for Serial >cessing in a Linear Iteratively Structured Arithmetic Unit. The Arithmetic Unit is .e up of identical logic modules called Processing Elements (PEs) such that each modul^ ically communicates with a maximum of three of its neighboring modules for data and .trol information. An arithmetic instruction is executed by a sequence of elementary roinstructions such that each microinstruction is executed by all the modules not in chro-parallelism, but in sequence by each module. The arithmetic processing takes ce serially on a digit -by-digit basis with the most- significant-digit-first (MSDF). The arithmetic and logic design of the Processing Element and the implications of : design choices on the LSI implementation of a PE is described. The MSDF nature of ithmetic execution necessitates the use of the redundant number system for processing. : arithmetic design of the PE is discussed with respect to the number representation, J definition of a normalized number and the algebraic design of the digit algorithms the microinstructions necessary to implement the four basic arithmetic operations. tmilas are given for the gate and pin complexities of the various components of the PE sa function of the type of 2-tuple logic vector encoding for a redundant binary digit, ■bit width of the PE and the amount of redundancy in the multiplier/ quotient digit* tis found that a sign-magnitude logic vector encoding and the multiplier/ quotient HORDS : digit's redundancy of 2/3 or less should be employed in the design of the Processing Element. Idition ithmetic Design Distributed Memory Division Iterative Structure Large Scale Integration Multiplication uthmetic Element Lgit-by-Digit Algorithms -gital Computer Arithmetic 7 Identifiers/Open-Ended Terms Processing Element Redundant Number System Serial Processing Subtraction 'JCOSATI Fie Id /Group '■vailability St£ NTIS-35 ( 10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 22. Price USCOMM-DC 40329-P71 JU;< 24 1376