LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510. 84 
 Ij?4>4 
 
 Cop. Z 
 
The person charging this material is re- 
 sponsible for its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 1 3 R 
 
 
 L161 — O1096 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/studyindesignofa797goya 
 
>.X 
 
 M4L 
 
 O 
 
 UIUCDCS-R- 76-797 
 
 A STUDY IN THE DESIGN OF AN ARITHMETIC ELEMENT FOR 
 
 SERIAL PROCESSING IN A LINEAR ITERATIVE STRUCTURE 
 
 by 
 Lakshmi Narayana Goyal 
 
 May, 1976 
 
UIUCDCS-R-76-797 
 
 A STUDY IN THE DESIGN OF AN ARITHMETIC ELEMENT FOR 
 SERIAL PROCESSING IN A LINEAR ITERATIVE STRUCTURE 
 
 by 
 
 Lakshmi Narayana Goyal 
 
 May, 1976 
 
 Department of Computer Science 
 
 University of Illinois at Urb ana- Champaign 
 
 Urbana, Illinois 6l801 
 
 This work was supported in part by the Department of Computer Science 
 and in part by the National Science Foundation under grant NSF DCR 73-07998, 
 and was submitted in partial fulfillment of the requirements for the 
 degree of Doctor of Philosophy in Electrical Engineering, 1976. 
 
iii 
 
 ACKNOWLEDGEMENTS 
 
 I wish to express my deep gratitude to my thesis advisor, Professor 
 James E. Robertson, for his invaluable guidance, insight, constant en- 
 couragement and personal friendship during the preparation of this 
 thesis. I would also like to thank Professors D. J. Kuck and S. R. Ray 
 for their advice, friendship and time for many useful discussions. 
 
 The support of the Department of Computer Science of the University 
 of Illinois, the Atomic Energy Commission of the United States and the 
 National Science Foundation during my studies at the University of 
 Illinois is sincerely appreciated. Many thanks are due to Mr. Stan 
 Zundo of the drafting department for the excellent illustrations, his 
 personal friendship and constant cooperation and to Mr. Dennis Reed for 
 fast and excellent printing services. The excellent cooperation of 
 Mr. Mark Goebel of the drafting department is very much appreciated. 
 
 Finally, I wish to thank my wife, Madhu, for her patience, under- 
 standing, love and constant encouragement without which this thesis 
 would not have been possible. 
 
iv 
 
 TABLE OF CONTENTS 
 
 Page 
 
 ACKNOWLEDGEMENTS iii 
 
 LIST OF TABLES x 
 
 LIST OF FIGURES xi 
 
 LIST OF ABBREVIATIONS xv 
 
 1. INTRODUCTION 1 
 
 1.1 Introduction to LSI Design Constraints 1 
 
 1.2 Arithmetic Unit Structure and LSI 4 
 
 1.2.1 Partitioning of conventional ALU 5 
 
 1.2.2 Two dimensional iterative structures 6 
 
 1.2.2.1 Cellular arrays 6 
 
 1.2.2.2 Table look-up methods 9 
 
 1.2.3 New number system representation 11 
 
 1 . 3 Present Work 15 
 
 2. ORGANIZATION AND OPERATION OF ARITHMETIC UNIT 21 
 
 2.1 Introduction 21 
 
 2 . 2 Organization of the Arithmetic Unit 21 
 
 2.3 Organization of Mantissa Processing Logic 23 
 
 2.4 Formal Description of Processing in a PE 28 
 
 2.5 Generalized Example 31 
 
 2.6 The Micro-Instruction Repertoire of the PEs 35 
 
 2.6.1 The inter-register transfer microinstructions 36 
 
 2.6.2 The shift microinstructions 38 
 
Page 
 
 2.6.3 The arithmetic microinstructions 40 
 
 2.6.4 The memory-acessing microinstructions 41 
 
 2.6.5 The miscellaneous microinstructions 42 
 
 3. ARITHMETIC DESIGN AND IMPLEMENTATION CONSIDERATIONS 44 
 
 3.1 Introduction 44 
 
 3.2 Implications of Serial Processing on Arithmetic Design.. 44 
 
 3.3 Choice of Number System 45 
 
 3.4 Choice of Number Representation and Amount of 
 
 Redundancy 46 
 
 3.4.1 Signed-digit number representations 46 
 
 3.4.2 Number format and range for mantissa 49 
 
 3 . 5 Normalization Considerations 50 
 
 3.5.1 Definition and range of normalized numbers 51 
 
 3.6 Arithmetic Microinstructions and Corresponding Digit 
 Algorithms 55 
 
 3.6.1 Simple sum (SS) microinstruction 55 
 
 3.6.1.1 Digit algorithm 56 
 
 3.6.1.1.1 Arithmetic design of RBA-2 56 
 
 3.6.2 Form multiple and add (FMA) microinstruction 60 
 
 3.6.2.1 Digit algorithm 60 
 
 3.6.2.1.1 Algorithm 1 62 
 
 3.6.2.1.2 Algorithm 2 71 
 
 3.6.2.1.3 Design of a multi-input redun- 
 dant binary adder (MIRBA) 71 
 
vi 
 
 Page 
 
 3.6.2.1.3.1 Rohatsch's [39] 
 technique 73 
 
 3.6.2.1.3.2 Log-sum tree tech- 
 nique 78 
 
 3.6.2.1.3.3 Tree-structure 
 using RBA-3s and 
 
 RBA-2s 80 
 
 3.6.3 Multi-sum (MS) microinstruction 84 
 
 3.6.4 Normalize Recode (NR) microinstruction 86 
 
 3.6.5 Assimilation Recode (AR) microinstruction 87 
 
 4 . LOGIC DESIGN OF THE PROCESSING ELEMENT 91 
 
 4.1 Introduction 91 
 
 4.2 Block Diagram Description of a Processing Element 91 
 
 4.2.1 Register file 93 
 
 4.2.2 Logic design of digit processing logic 96 
 
 4.2.2.1 Block diagram description of DPL 96 
 
 4.2.2.2 Choice of logic vector encodings 99 
 
 4.2.2.3 Logic design of RBA-2 (BU) 102 
 
 k 
 
 4.2.2.4 Logic design of a radix-2 multi-input 
 
 adder (MIAD) 109 
 
 4.2.2.5 Logic design of digit product 
 
 generator (DPG) 113 
 
 4.2.2.6 Logic design of digit sum encoder 117 
 
 4.2.2.7 Logic design of selector networks 122 
 
 4.2.2.7.1 Logic design of adder input 
 
 selector (sADR) 122 
 
 4.2.2.7.2 Logic design of digit sum 
 
 encoder selector (sDSE) 126 
 
vii 
 
 Page 
 
 4.2.2.7.3 Logic design of selectors sRIB, 
 
 sROB and sTOP 126 
 
 4.2.2.7.4 Storage buffer registers of DPL. 132 
 
 4 . 3 Design of PE Control 133 
 
 4.3.1 Logical organization of PE control 135 
 
 4.3.1.1 Global description of interaction of 
 
 subcontrols 136 
 
 4.3.2 Logic design of PE control 139 
 
 4.3.2.1 Block diagram description of PE control 
 
 logic (PCL) 139 
 
 4.3.2.2 Design and description of microinstruc- 
 tion formats 139 
 
 4.3.2.3 Description of subcontrols by control 
 
 sequence charts 145 
 
 4.3.2.3.1 Control sequence chart con- 
 ventions 146 
 
 4.3.2.3.2 Description of R-control 149 
 
 4.3.2.3.3 Description of DM-control 151 
 
 4.3.2.3.4 Description of T-control 161 
 
 4.3.2.3.5 Description of F-control 161 
 
 4.3.2.3.6 Description of G-control 163 
 
 4.3.2.3.6.1 Description of G„- 
 
 gn 
 
 control 165 
 
 4.3.2.3.6.2 Description of G - 
 control ??.. 169 
 
 4.3.2.3.7 Description of E-control 171 
 
 4.4 Logic Complexity of Processing Element 175 
 
 4.4.1 Logic complexity of DPL 175 
 
viii 
 
 Page 
 
 4.4.1.1 Gate complexity of digit processing logic 
 
 DPL 175 
 
 4.4.1.2 Pin complexity of DPL 179 
 
 4.4.1.3 Effect of multiplier digit's redundancy 
 
 on gate and pin complexity of DPL 183 
 
 4.4.2 Logic complexity of PE control 189 
 
 4.4.2.1 Gate complexity of PE control 190 
 
 4.4.2.2 Pin complexity of PE control 190 
 
 4.4.3 Overall logic complexity of a PE 192 
 
 5 . INTERACTION WITH MEMORY 195 
 
 5 . 1 Introduction 195 
 
 5.2 Organization of Local Operand Mantissa Memory, LOMM 198 
 
 5.3 A Description of Buffer Memory Control 200 
 
 5.4 Size of Buffer Memory 204 
 
 6. IMPLEMENTATION OF MACHINE ARITHMETIC INSTRUCTIONS 207 
 
 6.1 Introduction 207 
 
 6.2 Implementation of 'Machine' Arithmetic Instructions..... 207 
 
 6.2.1 Global description of the processing of a 
 
 'machine' arithmetic instruction 207 
 
 6.2.2 Floating point Addition 209 
 
 6.2.2.1 Mantissa processing microprogram 209 
 
 6.2.2.2 Mantissa overflow correction 210 
 
 6.2.3 Floating point Subtraction 211 
 
 6.2.4 Floating point Multiplication 211 
 
 6.2.4.1 Microprogram for mantissa processing 212 
 
ix 
 
 Page 
 
 6.2.5 Floating point Division 217 
 
 6.2.5.1 Microprogram for mantissa processing 217 
 
 6.2.6 Normalization of operands 220 
 
 6.2.7 Assimilation of signed-digit operand 220 
 
 7 . SUMMARY AND CONCLUSIONS 221 
 
 7 . 1 Summary and Discussion of Results 221 
 
 7.2 Suggestions for Further Work 226 
 
 LIST OF REFERENCES 228 
 
 APPENDIX 
 
 A-l ALGEBRAIC DESIGN OF A RIGHT -DIRECTED RECODER TO CHANGE 
 MULTIPLIER DIGIT'S REDUNDANCY FROM 6 = 1 to 
 
 5 ^2/3 233 
 
 A-2 PRECISION REQUIREMENTS FOR QUOTIENT DIGIT CALCULATION... 236 
 
 VITA 239 
 
LIST OF TABLES 
 
 Table Page 
 
 2.1 a of the Microinstructions of the Example of 
 
 Figure 2.4 31 
 
 3.1 Values of a and a. for Various (k+1) -Input MIRBA 
 
 Configurations. 78 
 
 4.1 Logic Vector Encodings 100 
 
 4.2 Gate Complexity of DPL vs Radix for h <_ 6 <_ 1 and 
 
 LVE 3 Encoding for a Redundant Binary Digit 178 
 
 4.3 Pin Complexity of DPL Vs Radix f or h <_ 6 <_ 1 182 
 
 4.4 Gate Complexity of DPL Vs Radix for 1/2 <_ 6 <_ 2/3 and 
 Encoding LVE„ for a Redundant Binary Digit 188 
 
 4.5 Pin Complexity of DPL Vs Radix for 1/2 ^6 ^2/3 188 
 
 4.6 Gate Complexity of Various Subcontrols of TCS 191 
 
 7.1 Values of a and a. when the multiplier/quotient digit 
 
 redundancy ratio is 1/2 <_ 6 <_ 2/3 224 
 
 A.l Values of ft Vs Radix and Redundancy Ratio of a 
 
 Quotient Digit 237 
 
xi 
 
 LIST OF FIGURES 
 
 Figure Page 
 
 1.1 Block Diagram of a Basic Model of Limited Connection 
 
 Arithmetic Unit 18 
 
 2.1 Global Block Diagram of Arithmetic Unit 22 
 
 2.2 The Organization of the Limited Connection Mantissa 
 Processing Logic 25 
 
 2.3 The Distribution of Operands Digits in the PEs of 
 
 Mantissa Processing Logic 26 
 
 2.4 Illustration of the Execution of the Generalized 
 
 Example in Mantissa Processing Logic 32 
 
 2.5a Illustration of Processing for Microinstruction vi- 
 and a = 2 34 
 
 2.5b Illustration of Processing for Microinstruction vi- 
 and a = 1 34 
 
 3.1 Illustration of Digit Algorithm for Microinstruction SS. 57 
 
 3 . 2 Arithmetic Structure of an RBA-2 58 
 
 3.3 Functional Representation of Microinstruction FMA 61 
 
 3.4 Functional Representation of the Digit Algorithm for 
 
 FMA 63 
 
 3.5 Functional Representation of Transformation f~ 64 
 
 3.6 A Redundant Binary Product Matrix 66 
 
 3.7 Illustration of Adjacent Overlapping Product Matrices 
 
 and 'Collective Product Transfer, CPT* 68 
 
 3.8 Illustration of the Implementation of Algorithm 1 of 
 Microinstruction FMA, using Redundant Binary Product 
 
 Matrix Generator . (Radix = 16) 69 
 
 3.9 Illustration of the Implementation of Algorithm 2 of 
 Microinstruction FMA using ROMs. (Radix =16) 72 
 
 3.10a Illustration of the Algebraic Design of a MIRBA, using 
 
 First Order Simple Transformations only 75 
 
xii 
 
 
 Figure Page 
 
 3.10b Illustration of the Algebraic Design of a MIRBA using 
 
 Higher (>_2) Order Simple Transformation 76 
 
 3.10c Algebraic Design of Bottom Level (Level 4) Box of 
 
 Figure 3.10a 77 
 
 3.11 Illustration of Log-Sum Tree Structure for a MIRBA 
 
 using RBA-2s only (k = 4) 7 9 
 
 3.12 Arithmetic Structure of an RBA-3 81 
 
 3.13 Illustration of Tree Structure for a MIRBA using RBA-2s 
 
 and RBA-3s (k = 4) 82 
 
 3.14 Illustration of Digit Algorithm for Microinstruction MS. 85 
 
 3.15 Flowchart of the Digit Algorithm for Microinstruction 
 
 NR 88 
 
 3.16 Flowchart of the Digit Algorithm for Microinstruction 
 
 AR 90 
 
 4 . 1 Block Diagram of a Processing Element 92 
 
 4.2 Block Diagram of the Register File of the PE 95 
 
 4.3 Block Diagram of Digit Processing Logic (DPL) 97 
 
 4.4 Algebraic Design of a 2-input Redundant Binary Adder 
 
 (RBA-2) 103 
 
 4.5 Schematic Functional Diagram of an RBA-2 using LVE 105 
 
 4.6 Logic Implementation of an RBA-2 using Logic Vector 
 
 Encoding LVE ± (Version 1) 106 
 
 4.7 Logic Implementation of an RBA-2 using Logic Vector 
 
 Encoding LVE (Version 2) 107 
 
 4.8 Logic Implementation of an RBA-2 using Logic Vector 
 
 Encoding LVE„ 108 
 
 4.9 Logic Implementation of an RBA-2 using Logic Vector 
 
 Encoding LVE 3 110 
 
 4.10 Schematic Diagram of a Radix-2 (k = 4) Multi-input 
 
 Adder (MIAD) Ill 
 
xiii 
 
 Figure Page 
 
 4.11a Schematic Diagram of Square Array DPG 114 
 
 p 
 
 4.11b Illustration of 'Adjacent Generation' of t.. 114 
 
 p 
 
 4.11c Illustration of 'Local Generation' of t. 114 
 
 4. lid Illustration of a Combination of an MIAD and DPG using 
 
 'Local Generation' of tj 116 
 
 4.12a Block Diagram of Digit Sum Encoder (DSE) 118 
 
 4.12b Logic Network Realization of RBTC 118 
 
 4.12c Logic Network Realization of TCSM (o. = P. ) 118 
 
 1 \ 
 
 4.13a Logic Implementation of Selector sADR for Magnitude 
 
 Bits 124 
 
 4.13b Logic Implementation of Selector sADR for Sign Bits 125 
 
 4 . 14 Logic Implementation of Selector sDSE 127 
 
 4.15 Logic Implementation of Selector sRIB 128 
 
 4 . 16 Logic Implementation of Selector sTOP 130 
 
 4.17 Logic Implementation of Selector sROB 131 
 
 4.18 Logic Organization of PE Control Signal Generator 137 
 
 4.19 Block Diagram of PE Control Logic 140 
 
 4.20 Microinstruction Codes and Formats 142 
 
 4 . 21 Control Sequence Chart Symbols 147 
 
 4 . 22 Control Sequence Chart for R-control 150 
 
 4.23a Control Sequence Chart for DM-control, Part 1 152 
 
 4.23b Control Sequence Chart for DM-control, Part II 153 
 
 4.23c Control Sequence Chart for DM-control, Part III 154 
 
 4 . 24 Control Sequence Chart for T-control 162 
 
 4.25 Control Sequence Chart for F-control 164 
 
xiv 
 
 Figure Page 
 
 4 . 26 Control Sequence Chart for G -control 166 
 
 gn 
 
 4 . 27 Control Sequence Chart for G -control 170 
 
 ap 
 
 4 . 28 Control Sequence Chart for E-control 172 
 
 4.29 Illustration of the Effect of NAF Recoded Multiplier 
 
 Digit on it of Inputs to MIRBA of Radix-2 k (k=7) Adder... 184 
 
 5.1 Block Schematic Diagram of Local Operand Processor 
 
 Memory 197 
 
 5 . 2 Structure of Control Table in LOMCO 201 
 
 6.1 A Pictorial Representation of the Flow of the Sequence 
 of Microinstructions for Mantissa Processing Logic for 
 Multiplication 215 
 
XV 
 
 LIST OF ABBREVIATIONS 
 
 APR Adjacent Operand digit Register 
 
 AR "Assimilation Recode' Microinstruction 
 
 BU Borovec Unit - A 2-input redundant binary adder 
 
 DMM Data Main Memory 
 
 DPG Digit Product Generator 
 
 DPL Digit Processing Logic 
 
 DSE Digit Sum Encoder 
 
 ECU Exponent £ontrol Unit 
 
 EPL Exponent Processing Logic 
 
 FMA 'Form Multiple and Add' Microinstruction 
 
 IBR Register File Input Bus Buffer Register 
 
 INRj^ ith Internal Register (File Register) 
 
 GACU Global Arithmetic Control Unit 
 
 GIR G_-inf ormation Input Buffer Register 
 
 LDR LOMM Data Register 
 
 LOEM Local Operand Exponent Buffer Memory 
 
 LOMCO Local Operand Memory Controller 
 
 LOMM Local Operand Mantissa Buffer Memory 
 
 LPM 'Load PE from PEM' Microinstruction 
 
 LVE i ith Logic Vector Encoding (i=l,2,3) 
 
 MATD Multi-input Adder's 'Transfer' Decoder 
 
 MATE Multi-input Adder's 'Transfer' Encoder 
 
 MCU Mantissa Control Unit 
 
 MIAD Multiinput Adder Network 
 
 MIR Microinstruction Register 
 
 MIRBA Multiinput Redundant Binary Adder 
 
 MPL Mantissa Processing Logic 
 
 MS 'Multi Sum' Microinstruction 
 
 NR 'Normalization Recode' Microinstruction 
 
 PE Processing Element 
 
 PEM Processing Element Memory 
 
 RBA-2 Two input Redundant Binary Adder 
 
xvi 
 
 List of Abbreviations (continued) 
 
 RBA-3 Three input Redundant Binary Adder 
 
 RB Redundant Binary Encoded Format for Radix-r digit 
 
 RIP^ Register Input Port for ith Processing Element 
 
 ROPj Register Output Port for ith Processing Element 
 
 RS 'Right Shift' Microinstruction 
 
 SM, Sign Magnitude logic encoding format for a redundant binary digit 
 
 SM Sign Magnitude Encoding format for a radix-r signed digit 
 
 SPM 'Store into PEM from PE' Microinstruction 
 
 sRIB Selector for Register File Input Bus 
 
 sROB Selector for Register File Output Bus 
 
 SS 'Simple Sum' Microinstruction 
 
 sTOP Selector for 'Transfer Output Port' 
 
 sADR Selector for data input to Adder Network MIAD 
 
 sDSE Selector for input to Digit Sum Encoder, DSE 
 
 TD 'Transfer Direct' Microinstruction 
 
 TI 'Transfer with Inverted Sign' Microinstruction 
 
 TH^ Adder Transfer Input Port for ith PE 
 
 TOP Adder Transfer Output Port for ith PE 
 
1. INTRODUCTION 
 
 1.1 Introduction to LSI Design Constraints 
 
 The advent of large scale integration (LSI) technology for the 
 manufacture of logic circuits has posed a new challenge to the computer 
 system and logic designers. The challenge is to find out ways that 
 would make efficient use of the full potential of LSI — reliability, 
 lower cost and improved speed — in the design of digital computers. 
 
 The LSI technology has peculiar constraints which have important 
 implications for its effective use in future systems. The constraints 
 and implications can be broadly classified into two categories: 
 external or system level considerations and internal or logic circuit 
 level constraints. With LSI, hundreds of logic functions can be 
 fabricated on subminiature substrates. Since the initial development 
 cost is very high, it is important that a small number of standard 
 elements be developed and the initial cost of development, thus, gets 
 amortized. However, designing universal elements of the complexity 
 offered by LSI is very difficult. A potential benefit of LSI that 
 has been continually cited is an increase in reliability over current 
 systems. Since system reliability is inversely proportional to the 
 number of module interconnections, it is important that LSI devices 
 should have a high gate-to-pin ratio. But the idea of universality 
 of LSI devices and high gate-to-pin ratio are conflicting in that the 
 latter tends to give the device a unique personality and cannot be 
 used in a system repetitively. 
 
At the internal level, designing logic for integration on the 
 chip requires a reorientation of the relative values placed on the 
 resources used to realize the design. One of the severest constraints 
 in the design of an LSI device is the restriction on interconnections 
 on the chip itself. This is due to the limitation both of available 
 wiring area, the number of planes to which all of the wiring must be 
 confined, and a host of other topological considerations which combine 
 to determine the locations of candidate points for interconnections. 
 Required wiring can be reduced by forcing the logic design of the 
 chip into a cellular or regular structure. Regular structure has 
 very important implications. 
 
 a) It facilitates every step of LSI manufacturing process by 
 making it possible to perform relatively simple tasks 
 repetitively. Mask making can be facilitated by the 
 repetitive structure. 
 
 b) It is possible to design and optimize a simple cell to 
 achieve most function per dollar, but a large chip of random 
 gates is impossible to optimize because of variables involved, 
 
 c) Testing of LSI devices is a major cost factor. The genera- 
 tion of test algorithms for simple cells and regular and 
 repetitive structure is easier. 
 
 d) In addition, as the yield increases due to technology 
 improvements, larger devices can be made out of the same 
 simple cells. 
 
Another limitation of LSI which must be considered in logic 
 
 design is that of external connections. More pins require more 
 
 external gates to drive the capacitance of the external pins. It 
 
 causes an increase in temperature due to increased number of gates 
 and higher current required due to a large number of external gates. 
 
 This increases the failure rate of the device. 
 
 Semiconductor memories meet most of the requirements of LSI 
 technology and the present use of LSI in computer systems is in the 
 form of these memories for system enhancement applications [1] , [2] , [3] . 
 These applications include the use of RAMs as scratch-pad memories, 
 of cache memories to reduce main memory requests and, thus, increase 
 the computational throughput. Use of ROMs for microprogramming, table 
 look-up operations and hardwired subroutines also increase performance 
 at a relatively little cost. Content addressable memories, queues and 
 stacks can greatly simplify building and maintaining tables and 
 greatly reduce system overhead and software costs. 
 
 Use of LSI in the design of central processor itself involves 
 proper logic partitioning. Logic partitioning involves organizing the 
 internal logic structure such that large functional arrays on a chip 
 can be repetitively used. Two partitioning methods are the bit-slicing 
 and functional partitioning. Bit-slicing [4] , [5] tends to be system 
 dependent and not universal and, thus, is suitable for custom LSI only. 
 In functional partitioning, the machine is structured towards modules 
 wherein each module consists of a completely self-contained processor 
 having local storage, some processing logic and the control necessary 
 for the module to execute its function. Each module acts as a small 
 
insular unit of logic. The module control sees only its own state and 
 the requirements for communication outside the module are correspond- 
 ingly reduced. An excellent example of functional partitioning is RCA's 
 LIMAC and Macromodular computers [6] , [7] . 
 
 In this thesis, we report on a study of logical organization and 
 design of an Arithmetic Unit which is capable of performing four basic 
 operations of addition, subtraction, multiplication and division. 
 
 The organization and design of the Arithmetic Unit are influenced 
 by LSI technology constraints of modularity, least number of different 
 module types, structural regularity of the module, limited pin count 
 and limited fan-out capability. 
 
 In the rest of this chapter, we first very briefly review the 
 various proposals suggested for the Arithmetic Unit and its LSI imple- 
 mentation. This is followed by a brief introduction of the model chosen 
 for study in this research and the scope and an overview of the thesis. 
 
 1.2 Arithmetic Unit Structure and LSI 
 
 The proposals suggested for the architecture of LSI implementable 
 Arithmetic Units can be broadly classified into three categories; namely: 
 a) partitioning of the conventional ALU which uses standard binary number 
 representation, b) two dimensional iterative (cellular) structures and 
 table look-up methods, and c) use of number representations different 
 than conventional binary. It must be noted that these three categories 
 are not exclusive of each other but are rather interrelated. This class- 
 ification is used here simply for ease of exposition. 
 
1.2.1 Partitioning of conventional ALU - A low performance and 
 parallel basic ALU essentially consists of registers — the accumulator, 
 the M-Q register, and the registers for parallel shifting — a full adder, 
 circuitry for complementing and shifting and some control logic to co- 
 ordinate their activity for arithmetic and logic control. It is necessary 
 for efficient operation to allow flexible and rapid transfer of informa- 
 tion from any one register to another. In binary arithmetic, except for 
 the end bits , it is possible to partition all the circuitry associated 
 with one bit along with some local decoder bits for gating functions 
 into one LSI cell [8] . Thus, all the data transfer and manipulation 
 operations circuitry can be assembled into identical cells and provide a 
 fairly good gate/pin ratio. A classic example of this approach is the 
 Texas Instrument LSI airborne computer, the model 2502. However, this 
 bit-slicing approach breaks down for high performance, conventional 
 binary number system when circuitry for a fast carry generation and prop- 
 agation is added to the ALU. Raytheon [9] has combined four bit slices 
 into one LSI module so that the look-ahead logic could be used on the 
 four bits of this module. But still this does not provide the flexibility 
 for unlimited carry look-ahead with only one type of module. 
 
 For achieving high performance, with conventional number systems, the 
 various slices of the ALU work in synchro-parallelism [10] and controlled 
 by signals broadcast from the central control logic. Since the control 
 functions are more difficult to modularize than functions related to 
 data operations, micromemory control technique is used for mapping 
 the irregular and diverse algorithms for arithmetic control into a 
 
regular structure of memory. However, for large word length, the broad- 
 casting of control signals is not compatible with LSI constraint of low 
 fan-out and neighborhood connections only. 
 
 To overcome this problem of control irregularity and broadcasting, 
 many combinational two-dimensional iterative structures (cellular arrays) 
 have been proposed for multiplication, division and other arithmetic 
 functions like square root, etc. 
 
 1.2.2 Two dimensional iterative structures - Two dimensional itera- 
 tive structures are memory-like structures and admirably satisfy the LSI 
 constraints. From the arithmetic unit point of view, they can be further 
 classified into two sub-categories of cellular arrays and table look-up 
 methods. 
 
 1.2.2.1 Cellular arrays - A cellular array is a two-dimensional 
 iterative configuration of identical cells, each of which contains both 
 logic and storage and is connected mainly to its immediate neighbors. 
 Such an array, therefore, has the form of a memory array that is enhanced 
 with logic at each digit position. A cellular array is a spatial analog 
 of the temporal sequence of steps of the control algorithm; i.e., the 
 cellular array performs the same sequence of computations iteratively in 
 space rather than in time. The cellular arrays can be either purely dedi- 
 cated exclusively [11] to some arithmetic function or can be programmable 
 [12] so that they can be used by many functions. Since multiplication 
 processes are characterized by the basic algorithm of add/no add 
 followed by shift, they differ mainly in the interconnection of the 
 various cells in the array for speeding up the effective addition 
 
time of the various partial products. Some use tree adder structure [13] 
 while others use carry save adders [14] in the basic cells to avoid 
 the carry propagation problem at every stage — the carry propagation 
 occurring only at the last stage. Most arrays assume that the operands 
 are either positive or in the sign magnitude representation with the 
 sign of the product being determined separately. A negative multiplier 
 in 2's complement representation needs a correction to the product 
 obtained by the simple "add/no add and shift" algorithm and makes the 
 interconnection of the cells in the array somewhat irregular. A cell- 
 ular array for multiplication has been suggested [15] which makes use 
 of multiplier recoding and a conditional adder/subtracter cell so that 
 either addition or subtraction of the shifted multiplicand to the 
 partial product can take place. This does not require final correction 
 to the product. However, the recoding array is structurally different 
 from the multiplication array and thus needs two types of functional 
 arrays to generate the final product. More recently, Baugh and 
 Wooley Q.6] have proposed a cellular multiplier where the correction is 
 not necessary. Similarly, the cellular array for the division operation 
 uses the basic binary restoring or non-restoring algorithm to produce 
 the quotient. The interconnection structure for the end cells used for 
 comparing the signs of the divisor and the partial remainder is again 
 different from the rest of the cell interconnection [17] . For fixed 
 point operations, a special step may be necessary at the end to generate 
 the remainder of the same sign as the dividend. In addition to the 
 dedicated arrays for each arithmetic operation, a programmable array 
 suitable for both multiplication and division has also been recently 
 
proposed. Here the most significant cell of each row is more compli- 
 cated and acts both as a multiplier recoder cell and a comparator cell 
 for comparing the signs of the divisor and partial remainders, depending 
 upon the arithmetic operation [18] . For operands of large word length, 
 the cellular array contains more cells than can reliably be implemented 
 on a single silicon slice, and hence is subdivided into subarrays which 
 are externally connected. Such an array made up of subarrays will take 
 more time to generate the final result compared to a fully iterative 
 monolithic array on a single silicon slice. Cellular arrays can be 
 either synchronous or asynchronous in operation. An asynchronous 
 cellular multiplier for vector or pipelined mode operations has been 
 proposed by Bjorner [19] . 
 
 For arrays using conventional number systems, the problem of carry 
 propagation along a row of cells still plagues the cellular arrays. 
 
 Thompson [20] and Chen [10] have suggested using the cellular 
 array in a diagonally-timed fashion such that digit level pipeline 
 takes place in two dimensions, giving a higher computational throughput. 
 
 Although the cellular arrays do satisfy the structural needs 
 of logic circuits, for LSI technology, a few shortcomings can be sum- 
 marized as follows: 
 
 i) Due to the use of conventional ripple carry adder/subtracter 
 in the basic cells, the total number of cells needed is 
 always equal to that necessary for a double length product, 
 although in practice most often the single length product 
 only is desired. This is due to the fact that the carry 
 
ripples from the least significant end to the most signifi- 
 cant bit position and if the cells that contribute to the 
 less significant part of the double length product (for 
 fractional mantissas) do not exist, then the most signifi- 
 cant part of the product will be in error and this error 
 becomes very acute for operands of large word length. 
 These otherwise unnecessary cells raise the cost of the 
 array. 
 ii) In the case of cellular arrays, as many rows of adder/ 
 
 subtracter basic cells are needed as there are multiplier 
 bits or quotient bits. But effectively, the rows corre- 
 sponding to "zero" multiplier or quotient bits serve no 
 useful purpose (except possibly shifting) . These unneces- 
 sary cells not only add to the cost by using large amounts 
 of silicon area, but they also increase the probability of 
 faults on the chip and make the testing of the chip more 
 expensive, 
 iii) For large word length where subarrays have to be externally 
 connected, the addition of other subarrays for expansion of 
 word length cannot be done without extensive changes in the 
 external wiring. Thus expandability of two-dimensional 
 structures is poor. 
 
 1*2.2.2 Table look-up methods - Structural regularity of memories 
 makes them very suitable for implementation in large-scale integration. 
 All the logic and arithmetic operations in the machine can be performed 
 
10 
 
 by extensive table look-up operations. Table look-up operations can 
 either be done in parallel or in a serial fashion. Parallel operations, 
 however, require too large a table for any reasonable word length and are 
 out of the question. However, tables for bit parallel and byte serial 
 operations can be reasonably implemented for arithmetic operations like 
 addition and multiplication because the number of words required is re- 
 lated to 2 where n is the operand width in bits. A functional memory 
 based on an associative array composed of writeable storage cells capable 
 of holding three states — 0, 1, and don't care — has been proposed by 
 Gardner [21] . Here the logic is performed by associative table look-up 
 
 and uses the "don't care" state to give significant compression of the 
 
 2 
 tables over conventional two-state arrays. Typically, only n to n words 
 
 are necessary for functional memory instead of 2 words for conventional 
 two-state arrays. In fact, such a functional memory has been suggested 
 as a nucleus and the building block for the whole machine. 
 
 Lee, et al. [22] and Crane, et al. [23] have proposed a distribu- 
 ted logic memory structure which is suitable for LSI implementation. 
 Although they suggested this structure for nonarithmetical logic opera- 
 tions, arithmetic can be performed by the bit serial table look-up 
 method. But, this method is too slow when operated on scalar operands. 
 However, for vector operands the arithmetic operation proceeds simul- 
 taneously in parallel on all components of the vector operands, and thus 
 the inherent slowness of the bit serial table look-up method is masked by 
 this parallelism. Bit serial processing is used in Goodyear 's STARAN 
 computer [24] . 
 
11 
 
 1.2.3 New number system representation - One of the main obstacles 
 to the partition of currently existing arithmetic processors which use con- 
 ventional binary representation (radix complement or sign magnitude) into 
 identical subunits is the fact that the most significant digit behaves 
 differently than the rest of the digit positions. Radix complement/ 
 diminished radix complement notation causes the control and structure of 
 the most significant and least significant digits to be different from the 
 rest of the digits due to such things as carry-in to the least significant 
 digit, end-around carries and special circuitry for logical and arith- 
 metical shifts. These factors preclude the chaining of accumulator 
 modules to any desired length. Moreover, the radix or diminished radix 
 complement notation causes problems both in the multiplication process 
 (e.g., a correction factor) and in the division process. Sign magnitude 
 notation is nice for multiplication and division because the sign of the 
 result can be readily determined, but addition and subtraction need a 
 complicated sequential control algorithm for determination of sign of 
 the result. All this difficulty can be traced directly back to the 
 limitations imposed by the requirement for knowledge of sign and magni- 
 tude of the operands and the result. This knowledge, by definition, is 
 available on a word level rather than a digit level. So, it would be 
 preferable to have a number representation where each digit position 
 carries both magnitude and polarity information, unlike normal binary 
 where the most significant bit carries sign but no magnitude information 
 and other bits carry only magnitude but no sign information. This will 
 remove the above limitation since a priori or a posteriori knowledge of 
 
12 
 
 the operand or result magnitude and polarities is not necessary. This 
 will make it possible to perform arithmetic on a digit ("stage" in a 
 machine) basis rather than on a number ("register" in a machine) basis. 
 That is, an arithmetic operation on corresponding digits of a pair (or 
 more) of numbers would become invariant with respect to the polarities 
 of two (or more) numbers in which they are separately imbedded. This 
 results in two independent but important implications: One, that a true 
 variable word length operation is completely practicable, permitting 
 modular construction in terms of quantity of digit positions; and two, 
 that simultaneous operations on multiple (two or more) operands are also 
 practicable, permitting modular constructions in terms of number of 
 operands. 
 
 The sign information with each digit position in a number can be 
 provided either implicitly or explicitly. In a positional weighted 
 number system, a negative radix implies [25] indirectly a sign associated 
 with each digit position (positive for odd positions and negative for 
 even positions, for integers). An example of explicit sign information 
 with each digit is the Avizienis' signed digit number representation 
 [26] . These two approaches can be utilized to design computational 
 modules for each digit position, which can be used later on to perform 
 arithmetic either in purely combinational logic net (array) or used with 
 a sequential control algorithm. Shaipov [27] and Prangishvilli have 
 proposed cellular arithmetic arrays using a basic computation module 
 based on minus-two adder system. Avizienis and Tung [28], [29] have 
 
13 
 
 proposed a universal arithmetic building element (ABE) to be used in 
 combinatorial logic net to perform arithmetic functions. Pisterzi [30] 
 has utilized the explicit signed-digit representation to design a limited 
 connection arithmetic unit with a central global control which provides 
 the temporally sequential commands to the various modules to achieve 
 the arithmetic functions. 
 
 Negative base number system, while facilitating the addition/ 
 subtraction and multiplication processes at a digit level, makes the 
 division process very complicated. In any restoring or non-restoring 
 division algorithm [31] , [32] , [33] , the signs of the partial remainder and 
 the divisor are very essential and the negative base number representation 
 does not lend itself for easy determination of sign of an operand because 
 one has to go through a counting process to know whether the most 
 significant digit of the integer representation is in an even position 
 or an odd position. Further, for faster addition/subtraction, one 
 still needs the carry look-ahead circuits. 
 
 Avizienis' proposed number representation, besides being signed- 
 digit, is also redundant; i.e., each digital position can have more 
 than r values where r is the radix of the number representation. This 
 number system has many desirable features, namely, 
 
 i) The algebraic value z of the number Z composed of n + m + 1 
 
 digits (z . . .z . z. . z, . . ,z ) is given by the 
 6 -n -1 1 nr 6 ' 
 
 expression: 
 
 m 
 Z = I z r 
 
 i=-n 1 
 
 -i 
 
14 
 
 ii) Algebraic value Z = if and only if all z. = 0. 
 iii) The sign of the algebraic value Z is given by the sign of 
 the most significant (left-most) nonzero digit, 
 iv) To form the representation of the additive inverse -Z, the 
 
 sign of every nonzero digit z. is changed individually, 
 v) The addition and subtraction of two signed-digit operands 
 Z and Y satisfy s = f(z , y , z . - , y.,-.) for all posi- 
 tions i, where s. are digits in the representation of the 
 sum or difference S = Z + Y. This means that there are no 
 carry propagation chains in signed-digit additions (or 
 subtractions) . 
 vi) The same logic that is used for adding two numbers (maxi- 
 mally redundant) can be used to convert the number from 
 conventional binary representation to the signed-digit 
 format . 
 vii) It allows limited inspection of partial remainder digits 
 to determine the quotient. 
 The properties (iv) to (vi) make the signed-digit redundant number 
 representation very suitable for digit-wise operation of the arithmetic 
 unit. Property (iii) obviates the need for any complement arithmetic 
 operations. 
 
 Based on the above number representation, Avizienis [28] proposed his 
 Arithmetic Building Element (ABE) which has the capability of adding two 
 digits of the two operands to be added, forming the product of two digits 
 and also of forming a sum of many digits, one of each different operand, 
 
15 
 
 besides having the capability of achieving logical operations on in- 
 dividual digits. He proposed this element for use in combinational 
 arrays for forming the product of two numbers. But since the ABE can 
 form a sum of only m <_ r+1 digits, it becomes necessary to partition 
 the product of two numbers where the multiplier is greater than r+1 
 digits long into groups of r+1 digits so that the same kind of ABE can 
 be used to form the whole product. Secondly, the proposed ABE is too 
 complex for any reasonable radix r. Thirdly, the combinational net 
 for the division process is very complex and expensive. 
 
 For digit-wise arithmetic operations, mention should also be made 
 of residue number representation [34] which allows addition/subtraction 
 and multiplication on a digit basis. However, handling of overflow/ 
 underflow, conversion of conventional binary representation to residue 
 number representation, and, of course, the division process are very 
 complicated, and that is why not many computers have been built based 
 on this number representation. Moreover, because the moduli for each 
 digit position is different, all the digit modules are necessarily 
 different and not compatible with fewer module type constraints of LSI. 
 
 1.3 Present Work 
 
 The goal of the present work is to formulate a set of desirable 
 characteristics for an LSI implementable Arithmetic Unit capable of the four 
 basic operations of Addition, Subtraction, Multiplication and Division, 
 to choose a suitable system and logical organization which comes close 
 
16 
 
 to meeting these desirable properties and finally to study the arithmetic 
 
 and logic design of the arithmetic unit. 
 
 From our discussion in Sections 1.1 and 1.2, the following set of 
 
 characteristics for the Arithmetic Unit are considered suitable for its 
 
 implementation in LSI or any batch fabrication process technology. 
 
 i) The arithmetic unit should be partitionable on a bit slice 
 or digit slice (for higher radix) basis which means that we 
 should be able to perform calculations on a digit-by-digit 
 basis. All the digit processing modules should be identical 
 so that a variable word length can be accommodated, 
 ii) Purely combinational cellular arrays are too expensive for 
 large operand lengths, especially when each cell is rather 
 complex. Hence, the arithmetic function execution should 
 be done by a time sequence of microinstructions. Further, 
 to achieve a balance between the high cost of a purely com- 
 binatorial array and the slow speed of completely sequential 
 execution of microinstructions, some form of pipeline struc- 
 ture should be employed so that when an arithmetic expression 
 is evaluated, the various arithmetic operations can be over- 
 lapped . 
 iii) To avoid fan-out problems in case of large operand lengths, 
 the various modules should have limited intercommunication 
 with each other, 
 iv) Each processing module should have local control and be 
 autonomous as far as possible so that only a few 
 
17 
 
 microinstructions need to be issued by a central control to 
 the modules, instead of a large number of separate control 
 signals. This would cut down the number of external leads 
 necessary on each module. 
 v) The various microinstructions should be as simple as possible, 
 vi) Each processing module must be consistent with the constraints 
 of large scale integration insofar as total external pin count 
 in the module is concerned and the module itself should prefer- 
 ably be made up of cells (identical logic repeated) when the 
 cell consists of many many gates. 
 vii) Since the divide process by its very characteristic has to 
 
 examine the most significant digits of the operands (dividend/ 
 partial remainder, and divisor) for the calculation of 
 quotient, the multiplication and addition/subtraction should 
 also be performed as a right-directed process. 
 This most significant digit first approach is consistent with other arith- 
 methic processes of operand normalization, mantissa overflow determination 
 and the determination of the sign of the result because these processes 
 inherently require the examination of the most significant digits of the 
 operands to determine what additional processing is necessary. 
 
 Many of the characteristics mentioned above are met by an Arithmetic 
 
 Unit structure proposed by Pisterzi [30] . The Arithmetic Unit consists of 
 
 t 
 modular processing elements called the Digit Processing Units (DPUs) and 
 
 a global control module called Primitive Control Unit (PCU) . The PCU 
 
 DPU and PCU are the terminology used by Pisterzi [30] . In the present 
 thesis, the terms PE and MCU will be used for the Processing Element and the 
 global control module respectively. 
 
18 
 
 does not broadcast control signals to all the DPUs but instead the PCU 
 communicates only with the most significant DPU as far as the issuance 
 of microinstructions is concerned. The first DPU executes each instruc- 
 tion and then passes it on to the second DPU which again executes this 
 microinstruction and further passes it down to the next DPU and so on. 
 Thus, a sort of pipeline of microinstructions is established where the 
 same sequence of microinstructions is executed in each DPU. 
 
 A simplified block diagram of such an arithmetic unit is shown in 
 Figure 1.1. 
 
 PCU 
 
 DPU, 
 
 DPU, 
 
 
 DPU . 
 n-1 
 
 ■ 
 
 DPU 
 n 
 
 
 
 
 
 Figure 1.1 Block Diagram of a Basic Model of Limited 
 Connection Arithmetic Unit 
 
 The present study concentrates on the design of the essential micro- 
 instructions necessary for performing four basic arithmetic operations 
 in such a structure, the logic design of the Processing Element, the method 
 of communication between the Processing Elements and the Data Main M eirory 
 for fetching and storing operands and results. The major part of this 
 thesis reports on the logic design of the Processing Element and identi- 
 fies those parts of the Processing Element whose gate and pin complexity 
 are a function of the bit width of the Processing Element. This, in turn, 
 allows us to choose a suitable bit width for the processing module con- 
 sistent with the technology constraints and also to balance the costs 
 for the processing logic and the control logic of the Processing Element. 
 
19 
 
 Chapter 2 describes briefly the system and logical organization and 
 mode of operation of the Arithmetic Unit. The major emphasis in this 
 chapter is on the logical structure of the Mantissa Processing Logic 
 (MPL) , the method of communication between the modules of the MPL and the 
 flow of microinstructions through them. This discussion provides the 
 necessary perspective for the material in later chapters. The flow of 
 microinstructions in the MPL is illustrated by a generalized example. 
 The chapter concludes with the definition and a brief description of a 
 set of basic and elementary microinstructions which are sufficient to 
 execute the 'machine' arithmetic instruction like Add, Multiply of two 
 operands. 
 
 Chapter 3 treats the arithmetic design of the Processing Element — 
 the basic module of the Mantissa Processing Logic. The arithmetic design 
 is described in terms of the implications of the particular structure of 
 the Mantissa Processing Logic on the required characteristics of the 
 number system, the number representation and the definition of a normal- 
 ized number. Finally, in Section 3.6 which is the major portion of this 
 chapter, we develop the definition and operational specification of a set of 
 five simple arithmetic microinstructions. These microinstructions cause 
 an arithmetic transformation of the data and are specified as such by an 
 arithmetic transfer function, wherever possible. The digit algorithm 
 for each arithmetic microinstruction is also given. 
 
 In Chapter 4, which is the largest chapter of this thesis, the logic 
 design of the major components of the Processing Element is given. The 
 major components are the register file for storage of active operands, 
 
20 
 
 the Combinational Network for processing and the Control which generates 
 control signals to condition the Combinational Network. This chapter 
 also describes the actual format and code assignment for the twelve types 
 of microinstructions executed in a Processing Element. Finally, the 
 logic complexity of the Processing Element is calculated in terms of the 
 total number of gates and external leads required in the Processing 
 Element module as a function of the bit width of the module and the 
 redundancy ratio of the multiplier and quotient digit. 
 
 Chapter 5 describes how the Mantissa Processing Logic and the Data 
 Main Memory may communicate to fetch and store operands through an inter- 
 face whose behavior is somewhat analogous to that of a cache memory. 
 
 In Chapter 6, we show how the various microinstructions can be 
 combined into a sequence to be executed by the Processing Element modules 
 to perform a 'machine' arithmetic instruction like Floating Point Add, 
 Multiply, etc. 
 
 Summary and conclusions are given in Chapter 7. 
 
 Two appendices are included. Appendix A-l gives the algebraic design 
 of a digit recoder which changes the redundancy ratio of the digit from 
 unity to <_ 2/3. In Appendix A-2 , we calculate the number of radix-2 
 digits of the truncated operands that are necessary in the model division 
 to determine one radix-2 quotient digit. 
 
21 
 
 2. ORGANIZATION AND OPERATION OF ARITHMETIC UNIT 
 
 2.1 Introduction 
 
 In order to put the discussion in the following chapters in proper 
 perspective, a brief description of the logical organization and method 
 of performing the processing is given. The method of processing is 
 illustrated by an idealized example in Section 2.4. The chapter closes 
 with an introductory description of the repertoire of only the essential 
 microinstructions which are executed by the processing logic. 
 
 2.2 Organization of the Arithmetic Unit 
 
 In Figure 2.1 is shown the global block diagram of the arithmetic 
 unit. It consists of Mantissa Processing Logic, Exponent Processing 
 Logic, Local Operand Memories (LOMM, LOEM) and an Arithmetic Control 
 Unit. The Arithmetic Control Unit (ACU) consists of three parts — the 
 Global Arithmetic Control Unit (GACU) , the Mantissa Control Unit (MCU) 
 and an Exponent Control Unit (ECU) . 
 
 The GACU acts as the interface between the arithmetic unit and 
 the rest of the computer. It receives the arithmetic instructions from 
 the central control of the computer, decodes them, and causes the Local 
 Operand Memory Control (LOMCO) to fetch the ncessary operands from main 
 memory, if they are not already present in LOMM. LOMCO provides the 
 LOMM address of the operands to the GACU which then issues the necessary 
 commands to the ECU and MCU for exponent and Mantissa processing and 
 coordinates their actions. After the processing is complete, it informs 
 
22 
 
 r 
 
 K 
 
 OO 
 z 5 
 < uj _ 
 <K2s 5 
 
 Out j 
 
 31 
 
 o 
 
 z 
 
 V) 
 CO 
 UJ 
 
 o 
 o 
 
 a y -J 
 
 3 
 O 
 
 5 
 
 T 
 
 1 
 
 TT 
 
 L 
 
 —^ 
 
 o 
 
 z 
 < 
 
 Ul 
 
 a. 
 o 
 
 UJ 
 
 I 
 
 cr 
 < 
 
 to 
 
 z 
 g 
 
 »- 
 o 
 
 (C 
 
 I- 
 10 
 
 UJ 
 
 Z 
 I 
 
 cr 
 < 
 
 r 
 
 T 
 
 3 
 O 
 < 
 
 t 
 
 3 
 a 
 
 UJ 
 
 3 
 O 
 
 < 
 
 _J 
 
 o 
 
 cr 
 h- 
 z 
 o 
 o 
 
 UJ 
 
 2 
 I 
 I- 
 
 C 
 
 < 
 
 LJ J 
 
 C 
 o 
 
 •H 
 CD 
 
 e 
 
 
 6 
 
 ca 
 
 J-l 
 
 00 
 
 « 
 
 ■H 
 
 Q 
 
 o 
 
 pa 
 
 CO 
 O 
 
 o 
 
 CD 
 
 too 
 •H 
 
 I 
 O 
 < 
 
 2 
 
 I : 
 
 c> 
 
 §! 
 
 UJ 2 
 Q. 
 
 o t- 
 
 z 
 
 -I UJ 
 
 < z 
 
 o o 
 
 O 0. 
 -I X 
 
 =0 
 
 
 ("5 
 
 
 
 »- 
 
 z 
 
 
 
 z 
 
 III 
 
 to 
 
 u 
 
 -1 
 
 z 
 
 V) 
 
 o 
 
 0. 
 
 o 
 
 X 
 
 UJ 
 
 UJ 
 
 o 
 o 
 cr 
 
 Q. 
 
 o 
 
 -I 
 
 UJ 
 
 v + 
 
23 
 
 the central control its status along with any exceptional conditions 
 
 if necessary, that may arise during execution of the instruction. 
 
 The MCU converts the commands received from the GACU into necessary 
 
 microinstructions to be executed by the Mantissa Processing Logic. For 
 
 example, the Multiply command is converted into a series of shift left 
 
 multiplier, form multiple and add, and shift left accumulator. Also, 
 
 it contains the overflow recoder logic and quotient determination logic, 
 
 etc. 
 
 The ECU performs the necessary control for exponent arithmetic such as 
 
 calculating the difference of the exponents for addition and subtraction 
 arithmetic instruction, sum of the exponents for the multiplication instruc- 
 tion and detecting exponent overflow and underflow conditions. 
 
 In this thesis, we shall be concerned mainly with the detailed 
 design of the Mantissa Processing Logic and its communication with the 
 Local Operand Mantissa memories. The detailed design of GACU, MCU and 
 ECU is beyond the scope of this research. The next section describes 
 the logical organization of the Mantissa Processing Logic and a descrip- 
 tion of the method of processing. 
 
 2.3 Organization of Mantissa Processing Logic 
 
 The Mantissa Processing Logic consists of a linear cascade of 
 identical Processing Elements (PEs) . Each PE is a complex logical 
 module and contains logic to perform the various microinstructions, 
 issued by the Mantissa Control Unit (MCU), in cooperation with other PEs. 
 The MCU communicates only with the most significant PE (closest to the 
 
24 
 
 MCU) and the microinstructions flow serially (in a pipelined manner) 
 from the most significant PE to the least significant PE. 
 
 Figure 2.2 shows the schematic organization of the Mantissa Process- 
 ing Logic along with the MCU. This figure also shows an End Unit which 
 is optional and not intrinsically necessary for the arithmetic process- 
 ing. The End Unit allows the last PE to be identical to all the other 
 PEs as far as interface is concerned, thus causing it to operate as 
 though it had another PE to its right. Moreover, it could contain some 
 logic in which the operand digits shifted off the right end could be 
 temporarily stored for improving the accuracy of the result [35] . 
 
 The PEs collectively contain the fractional (Mantissa) parts of 
 all active operands, one digit in each PE, as shown in Figure 2.3. 
 Because the quotient generation and operand normalization processes 
 require the examination of most significant digits, the operands are 
 placed in the PEs so that the digits of each of the operands are avail- 
 able to the microinstructions in order of decreasing significance. Thus, 
 the most significant digits of the active operands are placed in the PE 
 which communicates with the MCU. 
 
 Each PE performs the same sequence of microinstructions. A given 
 microinstruction is not executed by all PEs in synchro-parallelism but 
 rather must be executed by them in sequence (i.e., first by PE, , then 
 PE„,...). Note that this is different from a conventional pipeline 
 organization in which data flows in sequence through a number of stages 
 which, in general, do different operation on the data. In this organi- 
 zation, however, data is relatively constant and flowing microinstructions 
 
Z 
 
 / V 
 
 / OH l 
 
 Iff 
 
 V 
 / 
 
 r SIGNIFIO 
 END 
 
 c 
 Ul 
 
 a. 
 
 
 r- 
 CO 
 < 
 UJ 
 
 -J 
 
 !!! 
 
 iil 
 
 
 
 UJ 
 
 Ol 
 
 
 
 ttf 
 
 
 < 
 
 < 
 
 ** 
 
 1 
 
 UJ 
 0. 
 
 -1 
 o 
 o: 
 
 Z 
 O 
 u 
 
 ^ 
 
 4JJ- 
 
 
 25 
 
 c 
 
 
 o 
 
 
 •H 
 
 
 ■u 
 
 
 o 
 
 
 <D 
 
 
 c 
 
 
 c 
 
 
 o 
 
 
 U 
 
 
 T3 
 
 
 0) 
 
 
 ■u 
 
 
 •H 
 
 
 1 
 
 • 
 
 -J 
 
 o 
 
 
 •H 
 
 0) 
 
 60 
 
 4= 
 
 O 
 
 4J 
 
 •J 
 
 U-l 
 
 bO 
 
 O 
 
 C 
 
 
 •H 
 
 C 
 
 cn 
 
 o 
 
 CO 
 
 •H 
 
 <u 
 
 4-t 
 
 o 
 
 tO 
 
 o 
 
 N 
 
 jj 
 
 •H 
 
 P* 
 
 c 
 
 
 to 
 
 <d 
 
 t>0 
 
 0) 
 
 u 
 
 CO 
 
 o 
 
 •H 
 
 
 U 
 
 CD 
 
 c 
 
 JS 
 
 « 
 
 H 
 
 a 
 
 CM 
 CN 
 
 <D 
 
 u 
 
 a 
 
 •H 
 
26 
 
 A operand 
 M operand 
 <$> operand 
 
 Z operand 
 
 PEj^ PE 2 PE 3 
 
 m 
 
 m 
 
 m. 
 
 PE 
 
 n 
 
 m 
 
 A - I a. r 
 i=l 
 
 -i 
 
 n 
 
 M = I m. r 
 i=l 1 
 
 -l 
 
 etc. 
 where r is the radix. 
 
 Figure 2.3 The Distribution of Operands Digits in the PEs of 
 Mantissa Processing Logic. 
 
27 
 
 tell a PE what operation to execute on the data resident in that PE at 
 that instant of time. 
 
 During processing, each PE physically communicates only with its 
 immediate neighbors. To execute a microinstruction, a given PE may need 
 information from its right neighbor. This information logically may 
 depend on the contents (active operand digits) of its neighboring PEs, 
 depending on the nature of the microinstruction. So we may say that 
 each PE physically communicates with only one PE to its immediate right 
 
 but from a logical viewpoint, the PE communicates with more than one 
 
 t 
 PE. In the following discussion of the mode of processing, when we 
 
 talk about information required by a PE, from its right neighbors, we 
 mean the information requirement in the logical sense. 
 
 As mentioned earlier, a given microinstruction is executed by PEs 
 not in synchro-parallelism but rather in sequence. As soon as all the 
 PEs (say a.) which contain information required by PE. to perform micro- 
 instruction j+1 (referred to as u -.) have executed u. and have sent the 
 required information to PE- , u . may be performed by PE.. . The micro- 
 instructions, executed by PEs, are defined in such a way that they have 
 regular data requirements independent of the position of the PE in which 
 
 a microinstruction is executed so that as each additional PE executes u., 
 
 tt 
 one more PE may execute u , . The microinstructions may be viewed as 
 
 The logical communication could be converted into physical communi- 
 cation by duplicating the necessary hardware logic in the PE where that 
 information is required but this would increase the number of intercon- 
 nections. 
 
 There is an exception to this rule in the case of Assimilation Recode 
 (AR) microinstruction in which case the a^ is variable and depends on the 
 nature of the data resident in the PEs. This is further explained later 
 in Section 4.3.2.3.3. 
 
28 
 
 flowing through successive PEs. Clearly, the PE registers do not contain 
 entire operands as long as any of the PEs are actively executing micro- 
 instructions. Each PE contains the digits from the results of the last 
 microinstruction executed. (In the worst case, if there are n PEs and 
 each PE has the capability of storing n active operands, there could be 
 n active operands in different stages of processing if there is a 
 sequence of n load or store microinstructions.) 
 
 2.4 Formal Description of Processing in a PE 
 
 The processing performed by the PEs can be described by the follow- 
 ing: 
 
 Let 
 
 j *i - ■ *j <j-A. j'i-r jW c 2 - 1 ) 
 
 . F. = (J). (. X., .F. ) and (2.2) 
 
 j l Y j j-1 l j l-l 
 
 . G = r. (,F , . X , . X ,...,.. X ) (2.3) 
 
 3 k j j k-1 j-1 k' j-1 k+1 'j-1 k+a / 
 
 where 
 
 .X. is the operand information contained in the i-th PE immedi- 
 
 ately following the execution of u . . It consists of the 
 
 i-th digit of each of the active operands, y. represents the 
 
 j-th microinstruction, 
 
 \l> . is the f inction employed to obtain the new operand set and 
 
 is dependent on the microinstruction to be performed, 
 
 .F. is a 'modifier' value which PE . transmits to PE . , , with 
 j l l l+l 
 
 the microinstruction j, to be performed next, 
 
29 
 
 <j> . is the function which each PE performs to determine .F., 
 
 J J 
 
 r. is the function PE, employs to determine .G. , 
 j k j k' 
 
 .G. is the value which PE transmits to the PE executing m,, 
 j k k j 
 
 and 
 a. is one more than the number of PEs which must logically 
 cooperate with the right neighbor of PE performing y in 
 
 order to generate the necessary .G . . 
 
 The information .G is generated in a time sequential fashion. G con- 
 
 "] K _ I K. 
 
 011 
 
 sists of a. components .G , .G ,...,. G J and they are given by the 
 3 J k j k j k 
 
 following relations. 
 
 ,G? = r° (,f, ., . X) 
 
 3 k j 3 k-1' j-1 k 
 
 (2.4) 
 
 A - '] ^ F k-r 3-A- 3C1) • 
 
 a.-l a.-l _^ Q a. -2 
 
 .G. = r . (.F. ,, ^ , ,G. ...••■• .G. . . ) 
 
 J k 3 3 k-1' j-1 k' j k+1' 'j k+1 
 
 1 a.-l 
 
 , G. = I .G. , . G. , . . . , .G, j 
 
 J k j k' j k' 'j k 
 
 The superscript on ,G, indicates the time order of sequential generation 
 of G-information. 
 
 Another formulation which is applicable for only fixed value of a. 
 is given by Pisterzi [30] . In this formulation, the PE executing micro- 
 instruction y . gets G-information directly from a PEs to its immediate 
 right. The trade-off between the two is that the former needs less con- 
 
30 
 
 nections to PE^^ and also less logic in PE. since the G-information is 
 developed in a distributed fashion in the a. PEs. However, this is 
 obtained at the expense of more complex control and longer time delay. 
 The operation of a typical PE, PE. say, is as follows. It begins 
 in a state in which it is receptive to information defining the next micro- 
 instruction to be performed. PE. receives this information (microinstruction) 
 
 and the value of .F. , from its left neighbor PE. ... Then PE. determines .G.— 
 j l-l b l-l l j i 
 
 the information required by PE. ■, to complete microinstruction y.. .G. 
 is determined sequentially as described by the set of relations in (2.4). 
 The component ,G is developed immediately. At the same time, PE. deter- 
 mines .F. by performing equation (2.2) (which incidentally is the same as 
 
 .F. . in most cases) and transmits the identity of y. along with .F. to 
 J i-1 J J i 
 
 PE.,.. At this time, PE.,. generates ! G J ,^ and transmits it back to PE. 
 l+l l+l & j i+1 l 
 
 so that PE. may generate .G.. Simultaneously, it (PE . ) transmits the 
 
 identity of y. instruction along with ,F.,. to PE.,. which repeats the 
 1 j i+l -i i+2 
 
 
 same process. Note that the information ,G. depends on ,G., which 
 
 ill l+a . 
 
 J 
 
 must trickle back to PE.. Although this takes quite some time, the 
 
 a.-l 
 .G.T, can be generated by PE.,, just one time step later. Initial setup 
 
 time is large, however, 
 
 a 
 
 As soon as PE. transmits ,G. to PE. n , PE. . can complete the 
 l j l l-l l-l r 
 
 execution of microinstruction y,. After some time, PE. receives a 
 
 J i 
 
 signal from PE, .. indicating that PE. . has executed y.. PE. then 
 i-1 l-l 1 l 
 
 a. J 
 
 executes y. (the necessary .G ~_ being ready by now). PE now transmits 
 
 a signal to PE.,, which indicates that PE,,, may execute y.. When PE, 
 i+l i+l j i 
 
 receives an acknowledgement from PE , it goes into a state where it is 
 receptive to information concerning y . The sequence above then repeats. 
 
31 
 
 2.5 Generalized Example 
 
 To illustrate how the processing of several microinstructions may 
 take place concurrently in the Mantissa Processing Logic, each by a dif- 
 ferent PE, we describe below a generalized example. This example is 
 borrowed from Pisterzi [30] but the necessary changes have been made to 
 conform to our notation. 
 
 Table 2.1 shows the a. for the various microinstructions for the 
 
 Table 2.1 
 a. of the Microinstructions 
 
 of the Example of Figure 2.4. 
 
 
 
 j 
 
 12 3 4 5 6 
 
 J 
 
 2 10 12 
 
 generalized example. The Mantissa Processing Logic will have five PEs 
 
 and one operand. This operand will be indicated as composed of five 
 
 digits a.,..., a, such that digit .a. is the digit contained in PE 
 
 after the j-th microinstruction. 
 
 The operation of the Mantissa Processing Logic is presented in a 
 
 tabular form in Figure 2.4. The columns labeled will indicate the 
 
 operand contained in PE . The occurrence of a. in the i-th operand 
 
 column will indicate that .a. has just been computed and placed in the 
 
 operand register of PE . The columns labeled IR. will indicate the 
 
 microinstruction being executed by PE . and/or the G-information being 
 
 produced by PE . . The occurrence of ^ in the IR. column will be used to 
 i j i 
 
— 
 
 MICROINSTRUCTION 
 
 OPERAND RLC1STER 
 
 
 "i. 
 A 
 
 l "z 
 
 IR 3 
 
 », 
 
 «, 
 
 •, 
 
 °2 
 
 °3 
 
 °4 
 
 °5 
 
 1 
 
 
 
 
 
 a l 
 
 a 2 
 
 a 3 
 
 0*4 
 
 O a 5 
 
 2 
 
 
 
 
 
 
 
 
 
 
 
 3 
 
 l°i 
 
 
 u 
 1, 
 
 A 
 
 
 
 
 
 
 
 
 4 
 
 
 rf 
 
 
 
 
 
 
 
 
 
 5 
 
 © 
 
 
 A 
 
 
 "i, 
 
 l a l 
 
 
 
 
 
 b 
 
 "2, 
 
 © 
 
 
 ri 
 
 
 
 l a 2 
 
 
 
 
 7 
 
 
 "2, 
 
 A 
 
 © 
 
 
 ^ 
 
 
 
 l a 3 
 
 
 
 8 
 
 Q 
 
 
 "2. 
 
 © 
 
 
 2 a l 
 
 
 
 l a 4 
 
 
 9 
 
 Q 
 
 © 
 
 
 "2. 
 
 A 
 
 © 
 
 3 a l 
 
 2*2 
 
 
 
 1*5 
 
 10 
 
 "4, 
 
 A 
 
 © 
 
 Q 
 
 
 
 
 3 a 2 
 
 2 a 3 
 
 
 
 11 
 
 
 A 
 
 © 
 
 © 
 
 
 
 
 3 a 3 
 
 2 a 4 
 
 
 12 
 
 © 
 
 
 A 
 
 © 
 
 © 
 
 4 a l 
 
 
 
 3 a 4 
 
 2 a 5 
 
 13 
 
 u 5, 
 n 
 
 5 G 1 
 
 © 
 
 
 w 4, 
 
 G° 
 4°4 
 
 © 
 
 
 4 a 2 
 
 
 
 3"5 
 
 14 
 
 
 "5. 
 
 3=; 
 
 © 
 
 
 
 
 
 4 a 3 
 
 
 
 15 
 
 ,°! 
 
 
 U 5. 
 
 A 
 
 © 
 
 
 
 
 
 4 a 4 
 
 
 16 
 
 
 ,=; 
 
 
 "5, 
 
 A 
 
 © 
 
 
 
 
 
 4 a 5 
 
 17 
 
 © 
 
 
 A 
 
 
 "5, 
 
 A 
 
 5"l 
 
 
 
 
 
 18 
 
 12 
 
 © 
 
 
 A 
 
 
 6 a l 
 
 5 a 2 
 
 
 
 
 19 
 
 
 © 
 
 \y 
 
 
 A 
 
 
 6 a 2 
 
 5 a 3 
 
 
 
 20 
 
 
 
 hi. J 
 
 9 
 
 
 
 
 6 a 3 
 
 5*4 
 
 
 32 
 
 Figure 2.4 Illustration of the Execution of the 
 Generalized Example in Mantissa 
 Processing Logic. 
 
33 
 
 denote that PE, has just received the identity of j-th microinstruction 
 
 and will begin determining G. in a time sequential fashion. The appearance 
 
 of ,G. in the IR. column indicates that PE. has lust determined the A-th 
 j i i i 
 
 component of G information which is needed by PE. -, . A ranges from 
 to a -1. (In our example, <_ A <_1.) The occurrence of MJj) will repre- 
 sent that the execution of microinstruction y has just been completed by 
 PE. (and the result operand digit a has been generated, as indicated 
 
 by the appearance of .a. in column ). The progression of time will be 
 indicated by the rows, each row equivalent to the time required by a PE 
 to execute one step of processing. 
 
 Figure 2.4 shows the Mantissa Processing Logic in steady state at 
 time 0. No microinstructions are being executed and the operand A_ 
 ( a.. , n a 9 , a , a, , a ) is in the operand register. The processing proceeds 
 as follows. 
 
 We assume that at time 3, the identity y of microinstruction 1 has 
 reached PE-. PE calculates G~ and sends it to PE (Figure 2.5a). 
 At time 4, PE calculates G 2 and sends it to PE . At time 5, all the 
 G-information required for execution of y (a =2) is available in PE^ and 
 y. is executed by PE . This causes ^a.. to be replaced by .a . During 
 the next four time intervals, y.. is performed consecutively by each of 
 the remaining PEs since -G, becomes available just as it is required by 
 PE, « to perform y . The identity y„ of the second microinstruction is 
 received by a PE, one time unit after that PE performs y . Since PE- 
 requires „G„ to execute \\~ (ou=l) , this microinstruction is not performed 
 by PE until time 8, one time step after PE is able to determine this 
 value and send it to PE (Figure 2.5b). Just as with y , y_ is executed 
 
34 
 
 2 
 
 * 
 
 o 
 
 c 
 
 
 
 
 
 
 
 1 
 
 7 l 
 
 } 
 
 V V 
 
 cn 
 
 T3 
 C 
 cO 
 
 d 
 o 
 
 •H 
 
 •u 
 o 
 
 3 
 
 VJ 
 
 •u 
 co 
 
 c 
 
 •H 
 O 
 
 O 
 •H 
 
 o 
 
 U-l 
 
 GO 
 
 c 
 
 •H 
 CO 
 CO 
 
 co 
 o 
 o 
 J-l 
 
 c 
 o 
 
 •H 
 •P 
 CO 
 
 u 
 
 •u 
 
 CO 
 
 CO 
 
 m 
 
 CN 
 
 CU 
 H 
 
 toO 
 •H 
 
 7T 
 
 c 
 
 CO 
 
 CN 
 
 3. 
 
 c 
 o 
 
 00 
 
 c 
 
 •H 
 CO 
 CO 
 
 CD 
 
 o 
 o 
 u 
 
 m 
 
 CN 
 
 cu 
 
 M 
 
 3 
 GO 
 •H 
 
35 
 
 sequentially by each of the remaining PEs during each of the next four 
 
 time intervals. Microinstruction y_ is performed by each of the PEs one 
 
 time unit after each PE has performed y 2 because a_ ■ and no outside 
 
 information is required. The other microinstructions are performed in 
 
 the same pattern. 
 
 In general, PE. performs y., 2a + 1 time units later following the 
 
 execution of y, ,. The time T„ elapsed between the instant when the 
 j-1 Em 
 
 identity of the first microinstruction reaches PE and the instant of 
 execution of the m-th microinstruction (of a set of consecutively 
 issued microinstructions) by the first PE. is given by 
 
 m 
 T,, = T 2a. + m. 
 Em j-1 J 
 
 2.6 The Micro-Instruction Repertoire of the PEs 
 
 In this section, we will discuss briefly the microinstructions 
 which are executed by the PEs so that the overall arithmetic unit is 
 able to do addition, subtraction, multiplication, division and normali- 
 zation. The microinstructions may be broadly categorized in five 
 classes for the purposes of this discussion. These five classes are: 
 
 1. the inter-register transfers, 
 
 2. the shift microinstructions, 
 
 3. the arithmetic microinstructions, 
 
 4. the memory accessing microinstructions, and 
 
 5. the miscellaneous microinstructions. 
 
36 
 
 2.6.1 The inter-register transfer microinstructions - These micro- 
 instructions cause operands to be transferred from one internal register 
 of the PE to another internal register. There are two instructions in 
 this class: Transfer Direct (TD) and Transfer Invert (TI) . The micro- 
 instruction TD moves the contents of one register in the PE to another 
 register, both the registers being specified explicitly in the micro- 
 instruction, with no changes in the source operand. The microinstruc- 
 tion, TI, on the other hand, causes the transfer of operands from source 
 to destination register, with the sign of the source operand being in- 
 verted, that is, changed to opposite polarity. 
 
 The microinstruction TD allows the results of one instruction to be 
 stored temporarily into another local register before being used as an 
 operand in the execution of some later microinstruction, thus avoiding 
 a memory reference. A second application of this microinstruction is in 
 the exchange of operands when normalization is required. As would be 
 seen later on, because the Normalization Recode (NR) and Assimilation 
 Recode (AR) microinstructions require the operand to be in only the 
 Accumulator register, assimilation and normalization of operands would 
 require the use of microinstruction TD for moving the operand to the 
 Accumulator register. 
 
 The main use of the microinstruction TI occurs when one needs to 
 change the sign of an operand before being used, e.g., in the case of 
 subtraction. Since the PE has only an 'add' microinstruction, it is 
 necessary to invert the sign of the operand before being 'added' to 
 another operand to cause subtraction. Note that, in this microinstruction, 
 
37 
 
 the source and destination register addresses can be the same. This 
 microinstruction can thus be used, if necessary, for getting the 
 absolute value of an operand. 
 
 In all the inter-register transfers, all of the data required by a 
 PE to perform the microinstruction is contained within that PE itself. 
 It can be seen in Figure 2.3. Each PE contains one digit of each of the 
 operands. Therefore the value of a, the number of PEs which must logi- 
 cally cooperate with the PE executing the inter-register transfer micro- 
 instruction, is zero, and .F. is not required to transmit data. The 
 
 j i 
 
 value of ,F. is used instead to identify both the registers taking part 
 in the transfer. The exact format of the microinstructions TD and TI 
 is discussed in Section 4.3.2.2. 
 
 In the notation of Section 2.4, the inter-register transfer micro- 
 instructions may be expressed as: 
 
 j X i = j-l y i i = 1, 2, ..., n (2.5) 
 
 /i - jVl i = 1, 2, ..., n (2.6) 
 
 j G i = <null> i = 1, 2, ..., n (2.7) 
 
 where 
 
 Y is the register to be copied into the X register, 
 .x i is the i digit of the X register after the transfer, 
 i-l y i is the *" di 8 it: of tne Y register before the transfer, and 
 <null> indicates that the value of .G. is not required when per- 
 forming inter-register transfers. 
 
38 
 
 2.6.2 The shift microinstructions - These microinstructions are 
 used during radix point alignment prior to addition or subtraction, for 
 normalization, and for multiplication and division by the radix during 
 the repetitive steps for multiplication and division. A shift of more 
 than one digital position is performed as a number of successive shifts 
 of one digital position each. 
 
 The left shift can be accomplished by causing the PE to the immed- 
 iate right of the PE performing the microinstruction to transmit the 
 value of the digit of the operand contained in its register to the PE 
 performing the microinstruction. This PE stores the digit it receives 
 in its operand register. The equations defining the left shift micro- 
 instruction, LS, are: 
 
 j X i = j G i+l i = 1, 2, ..., n (2.8) 
 
 j F i = j F i-l i " 1. 2, ..., n (2.9) 
 
 j G i = j-l x i i = 1, 2, ..., n (2.10) 
 
 .G ., = .F if .F is a valid digit (2.11) 
 
 3 n+1 J n j n 
 
 otherwise see text 
 
 where 
 
 X is the operand being shifted, 
 
 .x . 
 
 J i 
 
 1" v» 
 x. is the i digit of the shifted operand, 
 
 . n x. is the i digit of X before the shift, 
 j-1 i 
 
 .F is the modifier value passed along with the microinstruc- 
 J tion and carries the address of the register to be shifted 
 and the value of a digit sent by MCU that is to go into 
 the last PE. This is made use of in the execution of 
 Multiplication . 
 
39 
 
 ,F n is the value that the MCU sends to PE, with the left shift 
 J 1 
 
 microinstruction to indicate the value that is to go into the last PE. 
 If F is a valid digit, it becomes the digit shifted into the last PE. 
 If it is not a valid digit, it causes the End Unit to shift-in the digit 
 shifted out during the last right shift. 
 
 One should also note that the left shift microinstructions make it 
 possible to transmit the most significant digit of an operand to the 
 MCU. The left shift can therefore be used by the MCU to examine 
 operands. 
 
 The right shift (RS) microinstruction does not have the complexity 
 of the left shift microinstruction. The value stored into a PE is the 
 value transmitted by its left neighbor PE with the indication that a 
 right shift is to be performed. The value of the digit to be stored in 
 the first PE is determined by the MCU. In the terminology of Equations 
 2.1 through 2.3, 
 
 (2.12) 
 (2.13) 
 (2.14) 
 
 3 X 1 
 
 — 
 
 j'i-i 
 
 i = 
 
 1, 
 
 2, 
 
 . . . , n 
 
 .F. 
 
 = 
 
 j-i x i 
 
 i = 
 
 1, 
 
 2, 
 
 . . . , n 
 
 .G. 
 3 i 
 
 = 
 
 <null> 
 
 i = 
 
 1, 
 
 2, 
 
 . . . , n 
 
 where 
 
 .F,, is the digit which the MCU transmits with the indication 
 j 
 
 that a right shift is to be performed. This value becomes 
 the value of the most significant digit of the shifted 
 operand . 
 
40 
 
 The value of .F , which is transmitted by PE to the 'End Unit 1 , is 
 stored as the new top element in the push-down stack. The push-down stack 
 is essentially an extended version of 'guard' digits. 
 
 A final note concerning shifts is that the value of a = 1 and 
 a = 0. The exact format of the microinstructions LS and RS is 
 
 Kb 
 
 described in Section 4.3.2.2. 
 
 2.6.3 The arithmetic microinstructions - The microinstructions in 
 this class are those instructions which do some sort of arithmetic 
 transformation on the operands. These microinstructions operate on one, 
 two or more than two operands, depending on the nature of the micro- 
 instructions. The various microinstructions in this class are: Form 
 Multiple and Add (FMA) , Simple Sum (SS) , Multiple Sum (MS), Assimilation 
 Recode (AR) and Normalization Recode (NR) . 
 
 The microinstruction FMA is used to form the product of a multiplier 
 (quotient) digit and a multiplicand (divisor) digit and add (subtract) 
 it to (from) the partial product (partial remainder) in the execution of 
 Multiplication (Division) instruction and is the most complex of all 
 microinstructions. 
 
 The microinstruction, SS, sums the contents of two registers and is 
 used to execute the Add or Subtract instructions. Although the micro- 
 instruction FMA could be used for this purpose, a separate microinstruction 
 SS was designed for faster operation, especially because the frequency of 
 addition or subtraction of two operands in a computer program is much 
 higher than multiplication or division. 
 
41 
 
 The Multiple Sum microinstruction, MS, is used to add the contents 
 of more than two registers in a PE. This microinstruction is not in- 
 trinsically necessary for the operation of the arithmetic unit but 
 rather comes about as a useful by-product of the design of the logic 
 for microinstruction FMA. 
 
 The microinstruction NR operates on a single operand in the Accum- 
 ulator register only. It is used to recode the operand in a form which 
 when left-shifted one or more places meets the normalization definition. 
 
 Finally, the Assimilation Recode microinstruction, AR, is used to 
 convert the operand in the Accumulator register, from the number repre- 
 sentation used in the arithmetic processing, to the conventional form 
 for communication to memory or other parts of the computer system. 
 This microinstruction is very similar to microinstruction NR. 
 
 All the above microinstructions are discussed in detail in 
 Chapter 3. 
 
 2. 6. A The memory-accessing microinstructions - These microinstruc- 
 tions cause the exchange of data between the internal registers of the 
 PEs and a local buffer operand memory. They are used to fetch operands 
 into PEs for processing and to store the results for eventual trans- 
 mission to the main memory of the computer. The two microinstructions 
 are Load from Processor Memory (LPM) and Store into Processor Memory 
 (SPM) ; the former is used to bring operands into PE registers and the 
 latter causes the contents of a specified PE register to be stored into 
 a specified location of the local Operand Processor Memory. The 
 
42 
 
 microinstructions in this class are similar to the inter-register transfer 
 microinstructions except that one of the source or destination address 
 refers to some location in the local Operand Processor Memory. In these 
 microinstructions also, the modifier .F. is used to identify the source 
 and the destination. The exact format of these microinstructions are 
 discussed in Section 4.3.2.2. The communication of the PEs with the Data 
 Main Memory via the local Operand Processor Memory is discussed in 
 Chapter 5. 
 
 2.6.5 The miscellaneous microinstructions - One instruction in this 
 class is Load Constant (LDC) . This microinstruction can be used to clear 
 the operand register by loading zeros in a specified register of all the 
 PEs in the arithmetic unit. It can also be used to initialize an 
 operand register spread across all the PEs to a pattern such that all 
 the digits are identical. An example of such an use could be the load- 
 ing of maximum value of operands. In the terminology of Section 2.4, 
 
 j X i = j F i-l i = 1, 2, ..., n (2.15) 
 
 j F i = j F i-l i = 1, 2, ..., n (2.16) 
 
 .G. = null i = 1, 2, ..., n (2.17) 
 
 J i 
 
 .F = digit which the MCU sends to PE, with the LDC micro- 
 J ° 1 
 
 instruction and the register name in which the constant 
 is to be loaded. 
 
A3 
 
 Clearly, a = which means that no information is needed from its 
 right neighbor for the execution of this microinstruction. Note that 
 in Equation 2.15, only the digit part of field .F is stored in 
 register ,X. . 
 
44 
 
 3. ARITHMETIC DESIGN AND IMPLEMENTATION CONSIDERATIONS 
 
 3.1 Introduction 
 
 This chapter describes the arithmetic design of the Processing Ele- 
 ment. Arithmetic Design consists of the choice of a suitable number 
 system, number representation, and the development of suitable digit level 
 algorithms. Serial processing in an iterative structure has important 
 implications on all of these factors and will be considered in this 
 chapter. Implementation of the digit algorithm and its implications for 
 LSI realization of the Processing Element are also discussed. 
 
 3 .2 Implications of Serial Processing on Arithmetic Design 
 
 From the description of processing in Section 2.4, it is evident that 
 the results are obtained on a digit-by-digit basis. To achieve a compro- 
 mise between the digit serial processing and the arithmetic speed, the 
 arithmetic should be carried out in higher radix say r = 2 (k > 1) such 
 that k bits of the result are obtained at any step. 
 
 Since the processes of quotient generation, operand normalization, 
 mantissa overflow determination and the determination of the sign of the 
 result inherently require the examination of the most significant digits of 
 operands to determine what additional processing is necessary, arithmetic 
 algorithms should be so designed that the most significant digits of the 
 result are obtained first. The most-significant-digit-first (MSDF) 
 approach has the advantages of providing early status indication (over- 
 flow, sign of the result, etc.), normalization concurrent with processing 
 
45 
 
 and early termination of processing as soon as enough significant digits 
 in the result have been obtained. The latter would allow faster variable 
 precision arithmetic in a digit serial environment. Early status indi- 
 cation would also aid in an instruction look-ahead unit. Further, the 
 MSDF approach allows the meshing-in (pipeline) of successive macroinstruc- 
 tions for efficient operation. For example, if a MULTIPLY instruction is 
 followed by a DIVIDE instruction, at some point in time, the least sig- 
 nificant digits of the product can be generated by a right directed 
 procedure in the least significant elements of the iterative structure, 
 while the most significant elements are generating quotient digits. 
 
 3 . 3 Choice of Number System 
 
 For a smooth flow of microinstructions in the linear iterative 
 
 structure and for maximizing the rate of computation, two constraints on 
 
 t 
 a. are necessary: 
 
 a) The microinstructions should have regular data requirements in- 
 dependent of the significance of the digits retained by a PE. That is, 
 
 a. should be constant. 
 J 
 
 b) The value of a. should be as small as possible because the 
 
 execution rate of a given microinstruction is inversely proportional 
 
 to a . . 
 J 
 
 In a conventional weighted number system, a carry or borrow into any digi- 
 tal position is a function of all the digits to the right of this position, 
 
 t 
 a^ is the number of PEs from which a given PE requires information 
 
 (in the logical sense) in order to execute the microinstruction u.. 
 
46 
 
 Thus for MSDF algorithms which are right directed, the conventional 
 number system cannot be employed because in a conventional number system, 
 the value of a . is a function of the significance of the digit itself. 
 A redundant number system which gives a bounded value of a . is clearly 
 essential. 
 
 3.4 Choice of Number Representation and Amount of Redundancy 
 
 The major factors influencing the choice of the redundant number 
 representation and the amount of redundancy in the number system are 
 the following: 
 
 a) the ease of conversion from the conventional number representa- 
 tion to the redundant number representation, 
 
 b) its compatibility with the widely employed conventional binary 
 number system, 
 
 c) ease of normalization of operands to radix-2 limits, and 
 
 d) LSI technology constraints, namely 
 
 (i) minimization of the number of types of cells (in the arith- 
 metic and logic sense) required for higher radix (r = 2 ) 
 implementation of the digit processing logic, and 
 (ii) minimization of the number of input and output pins. 
 In this study, signed-digit redundant number representations with maximal 
 redundancy were chosen, because they satisfy most of these requirements. 
 
 3.4.1 Signed-digit number representations - Signed-Digit (SD) 
 representations are redundant positional representations. 
 
47 
 
 A number X is represented, in radix-r, redundant, signed-digit 
 format, as a digit vector (abbreviated as "d-vector") of length n + m + 1 
 
 A. " X X/ ■ v • * • X rt x_ x« • • • X - X 
 
 -m -(m-1) 12 n-1 n 
 
 such that 
 
 -i 
 
 r 
 
 X = I x. 
 
 . L i 
 
 i= -m 
 
 where 
 
 x ± e {d,(d-l),...,l,0,l,...,(d-l),d} 
 
 and 
 
 ffj < d 1 (r-D • 
 
 The overbar indicates negative values and unless otherwise specified, we 
 shall be using rightward indexing in the d-vector representation. For 
 maximally redundant, signed-digit number systems 
 
 d = r - 1. 
 That is, for a radix r = 2 , each digit of the radix-r digit vector can 
 
 k - k 
 
 assume any integer value in the digit set {(2 -1) , . . . ,1,0,1, . . . , (2 -1)}. 
 
 Some of the desirable properties of signed-digit representations are: 
 
 1. Representation of zero is unique. An algebraic value of X = 
 if, and only if, all x. =0. 
 
 2. The additive inverse (negation) of an operand is very simply 
 achieved by reversing the sign of every non-zero digit individually. 
 
 3. The sign of the algebraic value of X is given by the sign of the 
 most-significant (leftmost) non-zero digit. 
 
48 
 
 4. For the sum or difference of two signed-digit operands, 
 
 a. = 1. 
 1 
 
 Maximal redundancy is compatible with the widely used sign-magnitude 
 representation of conventional binary input operands. A binary number 
 may be interpreted as a number of radix r = 2 by grouping the binary 
 digits into groups of k bits each. Conversion from the conventional number 
 system to signed-digit form is simply carried out by just attaching the 
 sign of the conventional number to each digit. Another important advan- 
 tage is the fact that the carry between bits of a digit has the same 
 properties as the carry between digits whereas in the other than maximally 
 redundant representations such is not the case. This allows the radix-2 
 
 arithmetic, for example shifts, etc., if necessary. From the LSI view- 
 It 
 point, it allows a radix r = 2 arithmetic structure to be composed of k 
 
 identical and simpler radix-2 substructures interconnected in a regular 
 
 pattern. Maximal redundancy also provides more code-space patterns [36] 
 
 for testing the radix-r module. This makes the design of a self -testing 
 
 version of the module easier. 
 
 Two modes of representation for a signed-digit of the radix-2 
 
 d-vector are used, depending on the area of application: 
 
 a) Sign-Magnitude (SM ) Mode - Each radix-2 digit x is represented 
 
 by a single sign bit s. and k magnitude bits, x. (j=0,l, . . . ,k-l) 
 
 such that 
 
 s . k-1 
 
 x. = ( — 1) 1 I x. . 2 3 , s., x. e {0,1} 
 l . ~ l. l l. 
 
 J=0 J J 
 
49 
 
 b) Redundant-Binary (RB ) Mode - Each radix-2 digit x is repre- 
 sented by k redundant binary digits x* (j=0,l, . . . ,k-l) , such 
 
 J 
 
 that 
 
 k-1 
 x. ■ I x* 2 J , x* e {1,0,1). 
 j-0 X j l j 
 
 (Note that in the above representation of x . in terms of radix-2 sub- 
 digits, we use zero-origin leftward indexing.) 
 
 The SM mode requires k+1 binary storage elements (or k+1 pins as 
 an output from the processing element) and the RB mode needs 2k binary 
 storage elements (or pins) because each redundant binary digit requires 
 two binary state elements. The SM representation for a radix-r digit is 
 used for inter-PE communication to keep the number of external I/O pins 
 small. The RB mode of representation is used for implementing digit 
 algorithms (as will be seen) . If each redundant-binary digit is expressed 
 in sign and magnitude form, conversion from SM to RB mode is trivial 
 and involves appending the single sign bit to each of the k magnitude 
 
 bits. Conversion from RB to SM is less trivial, however, and involves 
 
 r r ' ' 
 
 recognition that the sign of the radix-2 digit is that of the most sig- 
 nificant non-zero binary digit, followed by subtraction of the magnitudes 
 of those digits of opposite sign from the magnitudes of those binary 
 digits of the same sign. 
 
 3.4.2 Number format and range for mantissa - In this thesis, the 
 mantissa is assumed to be represented by a one-origin right indexed d- 
 vector of length n. The radix point is assumed to be at the left of the 
 most significant digit with index one. That is, 
 
50 
 
 v l . Ill 11 
 
 A - • X X~ X_ • • i X _ X 
 
 12 3 n-1 n 
 
 For a conventional number representation, the values of digit x are 
 {0,1, . . . ,r-l} and for the signed-digit format, the digit x. can assume any 
 
 value in the digit set { (r-1) , (r-2) ,... ,1,0,1, ..., (r-2) , (r-1) } . 
 
 When more than one operand is considered, the superscript is employed 
 
 1 2 
 to identify a specific operand, i.e., X and X for two operands or 
 
 X ,X , . . . ,X J , . . . ,X for I operands. The i-th digit of y? is uniquely 
 
 identified as x. . 
 
 i 
 
 The algebraic value of the mantissa is given by 
 
 i n i 
 v l v 1 -i 
 X = ) x. r 
 
 ii 1 
 i*l 
 
 and -1 < X 1 < 1. 
 
 3.5 Normalization Considerations 
 
 For the preparation of operands and the processing of results, it is 
 necessary to restrict the range of values which the mantissa may assume. 
 One usually restricts this range by requiring that all operands be 
 normalized. This is generally done by defining the form of d-vector 
 representation of the restricted range operands. However, in redundant 
 number representations, there exist pseudo-normal forms, because more than 
 one d-vector representation is possible for the same algebraic value. For 
 example, the two numbers X and X' 
 
 X 1 = .00...1 
 
 X' = .l(r-l)(r-l)...(r-l) 
 
51 
 
 have the same algebraic value. The representation X' " satisfies the con- 
 ventional normalization condition x' 4 but not the minimum magnitude 
 (> H) requirements for its algebraic value. 
 
 3.5.1 Definition and range of normalized numbers - Three alternative 
 definitions of normalized operands were considered. 
 
 Definition 1 
 
 A number X (of nonzero algebraic value) is considered normalized 
 
 + _ 
 when its d-vector representation X = .x x . . .x - satisfies either of two 
 
 conditions 
 
 a. |x 1 | >_2 
 
 b. |x, | =1 and x.. .x„ >_ 
 
 Definition 2 
 
 A number X (of nonzero algebraic value) is considered normalized 
 
 when in its d-vector representation X = .x n x ...x , x ... x either 
 
 1 2 t-1 t n 
 
 a. |x | >_ 2 
 
 or 
 
 b. x n = 1, x = x_ = . . . = x ,=0 and 
 1 l 1 2 3 t-1 
 
 x. . x > , t < n 
 It — 
 
 x 1 . x =0 , t = n 
 
 1 T 
 
 where n is the length of the operand. 
 
 In these definitions, the superscript on X has been dropped for 
 ease of readability. 
 
52 
 
 Definition 3 
 
 A number X (of nonzero algebraic value) is considered normalized 
 
 when its d-vector representation X = . x,x.x, ... x. ... x satisfies the 
 
 12 3 j . n 
 
 conditions 
 
 a. |x | > 
 
 and b. x, . x. > 
 1 l — 
 
 such that 2 <_ i < j and x. is the first (counting from left) zero digit 
 
 in the d-vector of x. For example X = .11101 is considered unnormalized 
 
 per Definition 3. 
 
 The range of values for the normalized operands under Definitions 1 
 and 3 is 
 
 r-1 1,1,1 
 
 r- + < X < 1 
 
 2 n — ' ' — n 
 r r r 
 
 and for operands, normalized according to Definition 2, the range is 
 
 i < |*| < i--L 
 
 r — ' ' — n 
 
 r 
 
 Note that the Definition 2 is equivalent to the conventional definition. 
 
 Of the three definitions, the Definition 3 was adopted. The factors 
 affecting the choice between the three definitions are: 
 
 i) Complexity of normalization implementation, 
 
 ii) Amount of significance loss, 
 and iii) Logic complexity of quotient selection. 
 
 For normalizing numbers according to Definition 2, one needs to 
 examine more digits than for Definition 1. If immediately following 
 |x.. | = 1, there is a string of zeros of length v, the normalization pro- 
 cedure must examine at least v+2 digits (to determine the sign of the 
 
53 
 
 first nonzero digit following the string of zeros) in the case of 
 
 Definition 2, whereas only 2 digits need to be looked at for Definition 1. 
 
 Since the examination of digits is essentially a serial process, it takes 
 
 v extra steps for Definition 2. 
 
 When the results are normalized according to Definitions 1 and 3 there 
 
 may be a potential loss of one extra radix-r significant digit, compared 
 
 to Definition 2. Such a case can occur when a result d-vector is of the 
 
 form 1.0 0...X. x.,,...x and a post-normalization shift becomes necessary, 
 l l+l n 
 
 However, it is expected that such a case would not occur very often be- 
 cause for higher radix arithmetic the frequency of zero digits is low, 
 and also the overflow occurs less often [37] . 
 
 Finally, because of the redundant number representation, the quotient 
 is calculated based on a truncated version of the partial remainder and 
 the divisor. The number of digits of the truncated operands necessary 
 for quotient calculation depends on the minimum algebraic value of the 
 
 truncated divisor, say D . . The lower the value of D . , the greater are the 
 
 ' ' min mm' ° 
 
 number of digits of the truncated divisor and partial remainder necessary 
 for the quotient calculation. For higher values of radix r, e.g., r ^_ 16, 
 the difference in the minimum value of the truncated, normalized divisor 
 for Definitions 1 and 2 is very small and the number of digits required 
 for quotient calculation remains the same. However, for lower radices 
 (8 ^_ r >_ 2) , the number of digits required and thus the logic complexity 
 for quotient calculation is greater for Definitions 1 and 3 than for 
 Definition 2. 
 
 In the case of Definition 2, this number would have to be normalized 
 further to the form . (r-1) (r-1) . . . (x .-1) x. - . . .x and a post normaliza- 
 tion shift would not be necessary. 
 
54 
 
 From the above discussion of factors affecting the choice of defini- 
 tion of normalized numbers for maximally redundant signed-digit operands, 
 it is clear that any of the three choices would be almost equally useful 
 for higher radices (r > 16). But for r = 2,4 where the probability of a 
 string of zero is higher, Definition 1 or 3 would definitely be better 
 for faster normalization, although the logic complexity of quotient cal- 
 culation would correspondingly be increased. The speed of quotient 
 calculation and thus the speed of the DIVIDE instruction would be decreased 
 But the frequency of DIVIDE instructions is rather low compared to ADD in- 
 structions and so Definition 1 or 3 would overall add to the speed of the 
 arithmetic processing. 
 
 For the present research, Definition 3 was chosen because of its 
 compatibility with the Assimilation Recode (AR) microinstruction's digit 
 algorithm. The Assimilation Recode algorithm converts a signed-digit 
 operand into a conventional sign-magnitude operand. This compatibility 
 allows the sharing of logic in the implementation of Normalize Recode (NR) 
 microinstruction and microinstruction AR and thus reduces the control 
 complexity of a PE. Digit algorithms for microinstructions NR and AR are 
 discussed in Sections 3.6.4 and 3.6.5. 
 
 Normalization of an operand is achieved by shifting out leading zeros, 
 followed by a 'Normalize Recode' microinstruction, again followed by 
 shifting out leading zeros, if any. This is discussed further in 
 Section 6.2.6. 
 
55 
 
 3.6 Arithmetic Microinstructions and Corresponding Digit Algorithms 
 
 In the present research, the design of the Processing Element is re- 
 stricted to the capability of performing the four basic arithmetic 
 processes of Addition, Subtraction, Multiplication and Division of two 
 operands. Multiplication and Division are implemented as a number of 
 additions or subtractions (of a multiple of multiplicand or divisor) and 
 shifts as in a classical Von-Neumann Arithmetic Unit. Hence the basic 
 arithmetic microinstruction necessary is of the form 
 
 X W = X U + (xj * X V ) (3.6.1) 
 
 w u v q 
 
 where X , X and X are d-vectors and x; is a digit. In case of multi- 
 
 l ° 
 
 plication (division), X is the multiplicand (divisor), X is the old 
 partial product (partial remainder) and X is the new partial product 
 (partial remainder) and x_? is the signed multiplier (quotient) digit. 
 
 The microinstruction which achieves (3.6.1) is termed 'Form Multiple 
 and Add ' (FMA) . 
 
 Other microinstructions of an arithmetical nature which are needed 
 for the execution of four basic arithmetic processes are Simple Sum (SS) , 
 Multiple Sum (MS) , Normalize Recode (NR) , and Assimilation Recode (AR) . 
 The function of each of these microinstructions and the corresponding 
 digit algorithm for execution of the microinstruction in a processing 
 element is discussed next. 
 
 3.6.1 Simple sum (SS) microinstruction - This microinstruction forms 
 the sum of two signed-digit operands say A and $ such that 
 
 A" - A + $ 
 
56 
 
 where A' is the new value of the operand A. In general, A and A' are 
 in the Accumulator register of the Processing Elements. 
 
 At the digit level, the SS microinstruction is characterized by 
 
 a I - a i + *i + T i - rT i-i 
 
 where a., a"! and <j> . are radix-r signed digits of the operand in the active 
 registers of PE . , T. is the 'Transfer' (carry-borrow) from the adjacent 
 processing element PE , and T. 1 is the 'Transfer' out of the PE . 
 
 3.6.1.1 Digit algorithm - The specification of the digit algorithm 
 for SS is intimately connected with its implementation and is described 
 below in terms of its algebraic implementation. 
 
 Because of the structural regularity requirements of the LSI tech- 
 nology, the sum of two radix-r signed digits a. and <j> . is realized in a 
 linear cascade of k, two input redundant binary adders. 
 
 This is schematically shown in Figure 3.1. RBA-2 is a two input 
 
 * * 
 
 redundant binary adder which accepts two redundant binary digits a. , $. 
 
 v v 
 
 e {1,0,1} and produces one redundant binary digit. The design of such 
 
 an REA-2 was studied in detail by Borovec [38] and we shall interchangeably 
 
 use the term Borovec Unit (BU) for RBA-2. 
 
 3.6.1.1.1 Arithmetic design of RBA-2 - The major consideration in 
 the design of RBA-2 was the minimization of the number of pins required 
 for the 'Transfer' into and out of an RBA-2. One such design is shown 
 in Figure 3.2. RBA-2 is realized by a series of four arithmetic trans- 
 formations as follows. 
 
57 
 
 O -< - 
 
 >& 
 
 
 
 
 1 
 
 o 
 
 
 
 
 u 
 
 
 
 
 u 
 
 
 
 
 •H 
 
 
 
 
 2 
 
 
 
 
 m 
 
 
 
 
 o 
 
 
 
 
 n-i 
 
 •V 
 
 r-*i 
 
 rS 
 
 M 
 
 o 
 
 tH 
 
 IrH 
 
 * 
 
 M 
 
 •* 
 
 4-1 
 
 r-i 
 
 o 
 
 o 
 
 •H 
 
 ■-^ 
 
 ■^ 
 
 >-^ 
 
 O 
 
 W 
 
 w 
 
 w 
 
 r-l 
 
 .H 
 
 iH 
 
 .H 
 
 < 
 
 •H 
 
 + -H 
 
 1 -H 
 
 4-) 
 
 H 
 
 4-1 
 
 4-1 
 
 •H 
 t>0 
 
 *. 
 
 n 
 
 « 
 
 •H 
 
 •H 
 
 + "H 
 
 1 -H 
 
 Q 
 
 H 
 
 4-1 
 
 4-1 
 
 llustration of 
 nstruction SS. 
 
 ■> 
 
 ? 
 
 
 CN 
 
 CN 
 
 iH 
 
 * 
 
 • 
 
 o 
 
 ■> 
 
 ■> 
 
 * 
 
 * -H 
 
 * -H 
 
 lr-4 
 
 cd 
 
 -e- 
 
 >— ' 
 
 iH O 
 
 iH o 
 
 (l) 
 
 1 t^J il 
 
 1 IXI II 
 
 
 M ■> 
 
 M > 
 
 * •* 
 
 -e- 
 
 II 
 
 II 
 
 Wk 
 
 •iH 
 
 •H 
 
 * >rH 
 
 cd 
 
 -e- 
 
 cd 
 
 u 
 
 o 
 
 + .1 I .1 
 
58 
 
 1,0,1 
 
 1 
 
 0,1 -«— 
 
 o,T«*— 
 
 i 
 
 "* N , RBA-2 
 
 «— 0,1 
 *— 0,1 
 
 
 t t 
 
 1,0,1 1,0,1 
 
 
 Figure 3»2 Arithmetic Structure of an RBA-2 
 
59 
 
 o l : $J - ft + $ ± 
 
 V V V 
 
 °2 : a i + *i = W i + i 
 v v v v 
 
 ° 3 : w i + *i + % "± = w i + 2t t . 
 
 v v v v v-1 
 
 ii + i* 
 
 a A : W i + C i = a i 
 
 V V V 
 
 * * ' * — 
 
 where a. , <J>. and a e {1,0,1} 
 
 V V V 
 
 t~ , t~ , f ± e {0,1} 
 
 v v-1 V 
 
 and t , t , <ji e {0,1} 
 v v-1 v 
 
 Let us call t. , tj 'negative transfers' and t . , t . 'positive 
 l ' i n i i . 
 
 v v-1 v v-1 
 
 transfers'. The logic design of RBA-2 is discussed in Section 4.2.2.3. 
 
 It is clear from the design of RBA-2 above that the transfer digits t. 
 
 v 
 
 and t. are respectively dependent on the inputs to the RBA-2s which 
 
 v 
 
 are immediately adjacent and one next to it. 
 
 In terms of the notation of Section 2.4, we have for the SS micro- 
 instruction T? T? 
 
 SS F i " SS F i-l for all i» 1 1 1 1 n 
 
 SS F = <null> ' 
 
 + 
 
 ss G i+i - u i' V 
 
 and a = 2 for r = 2 
 = 1 for r > 2 
 
60 
 
 3.6.2 Form multiple and add (FMA) microinstruction - This micro- 
 instruction is used to form the product of the multiplicand (divisor) 
 d-vector and a multiplier (quotient) digit, which when added to the old 
 partial product (partial remainder) gives the new partial product 
 (partial remainder) in the execution of a Multiplication (Division) of 
 two d-vector operands. 
 
 At the digit level, this microinstruction is characterized by the 
 arithmetic transfer function 
 
 aC = a. + m. • <j> . - r T. ., + T J (3.6.3) 
 
 l i j i l-l i 
 
 where 
 
 a., <b . are the digits in the active operand registers of the 
 11 
 
 processing element PE . 
 a' is the new value of digit a. 
 
 m. is a multiplier (quotient) digit 
 
 r is the radix 
 and T.(T. -)is the 'Transfer' (carry/borrow) from (to) adjacent process- 
 ing element PE (PE ) . 
 This is functionally represented in Figure 3.3. 
 
 3.6.2.1 Digit algorithm - The major considerations in the design of 
 the digit algorithm for microinstruction FMA were the LSI technology con- 
 straints — namely, that the implementation logic for FMA should consist of 
 a regular and repetitive structure. The specification of the digit 
 algorithm is intimately connected with the implementation procedure and 
 is described below as such. 
 
61 
 
 Tj-i- 
 
 
 
 (, 
 
 
 
 
 
 
 RADIX- r 
 SUMMER 
 
 
 
 
 
 i 
 
 i 
 
 - 
 
 i 
 
 
 
 
 
 * 
 
 
 
 
 
 
 i 
 
 t 
 
 
 T: 
 
 *i 
 
 Figure 3.3 Functional Representation of Micro- 
 instruction FMA. 
 
62 
 
 The transfer function in Equation (3.6.3) is achieved by a series of 
 two transformations f and f„ as shown in Figure 3.4. The two trans- 
 formations are 
 
 f l : m j ' *i = r t i-l + w i (3.6.4) 
 
 and f 2 : w. + a. + t* + t£ = a| + r t^ . (3.6.5) 
 
 Transformation f^ essentially requires a radix-r multi-input adder which 
 forms the sum of digits of both signs. This multi-input adder is imple- 
 mented as a k-stage linear cascade of radix-2 multi-input adder where 
 
 each input of a radix-2 adder can assume three values 1,0,1. The input 
 
 p 
 digits w., a., t. are expressed in the form of radix-2 d-vectors such 
 
 that each component of the radix-2 d-vector is from the redundant binary 
 digit set {1,0,1}. This is schematically shown in Figure 3.5. MIRBA 
 represents the Multi-Input Redundant Binary Adder. The number of redun- 
 dant binary inputs to each MIRBA are determined as follows. Two 
 
 algorithms were studied for the implementation of transformation f, given 
 
 p 
 in Equation (3.6.4). They differ in the maximum values that w. and t._ 1 
 
 can assume. 
 
 3.6.2.1.1 Algorithm 1 - To illustrate the principle, 
 
 Let 4, = I ** . 2* 
 £=0 I 
 
 <j>* , m* e {1,0,1} 
 k-1 x £ J q 
 
 m. = I m* . 2 q 
 J q=0 J q 
 
63 
 
 
 Ti-i^ 
 
 Ci- 
 
 MULTI- INPUT 
 RADIX -r ADDER 
 
 U 
 
 W: 
 
 DIGIT PRODUCT 
 GENERATOR 
 
 fl 
 
 rri] $ 
 
 i u i 
 
 f<\ 
 
 >T: 
 
 f l : m j ' *i = r l ±-l + W i 
 
 f 2 : w 1 + a ± + t; + t^-«i + r t^ 
 
 Figure 3.4 Functional Representation of the Digit 
 Algorithm for FMA. 
 
< — 
 
 64 
 
 lL 
 
 5 H c QQ < 
 
 h q: od < 
 
 CVJ 
 
 en en < 
 
 o < 
 
 V 
 
 \l_ 
 
 CVJ 
 
 I 
 
 w o: oq < 
 
 \7_ 
 
 — cr m < 
 
 >o 
 
 y V- y * 
 
 e 
 
 o 
 
 14-1 
 
 w 
 c 
 
 CO 
 u 
 
 H 
 
 o 
 
 c 
 o 
 
 •H 
 4-1 
 CO 
 4-1 
 C 
 
 (U 
 CO 
 
 CD 
 
 !-i 
 & 
 
 <D 
 
 ca 
 c 
 o 
 
 •H 
 4-1 
 CJ 
 
 C 
 
 D 
 (^ 
 
 m 
 
 CD 
 u 
 
 •H 
 
 V 
 
 < — 
 
65 
 
 The product <J> • m is implemented by a product matrix generator which 
 consists of a k x k square array of redundant binary product cells. Each 
 cell performs the product of two redundant binary digits <$>* and m* and 
 
 * - 
 
 its output product digit p is also in the digit set {1,0,1}. 
 
 Jcq 
 
 The product may be viewed in terms of the sums of the p terms of 
 
 the same weight in the product matrix. 
 
 2k-2 
 
 ♦ i ■ m 4 " I 
 
 1 3 v=0 
 
 2k-2 /v 
 " I 2 " [l h v-J 
 
 v =o U=o ^' v 
 * * * 
 
 where p n . = <j> . • m 
 
 * 
 
 and p does not exist when either SL > k-1 
 
 or v-Jl > k-1 . 
 
 The number N of product elements in the v-th column of the product 
 
 v 
 
 matrix is given by 
 
 v+1 < v <_ k-1 
 
 N = I (3.6.7) 
 
 1 -v + (2k-l) k<_v< 2k-2 
 
 k-1 
 The number N is maximum in column of weight 2 and is equal to k. The 
 
 v 
 
 product elements in other columns decrease uniformly by one on either 
 side of this column as shown in Figure 3.6. 
 
66 
 
 liH 
 
 a. 
 
 *<T 
 
 -e- 
 
 -e- 
 
 CM CM 
 
 I I 
 
 -e- e 
 
 i I 
 
 •e- 6 
 
 o 
 
 ■K O 
 
 CX 
 
 O rH 
 
 * i-l -K O 
 
 a. ex 
 
 CM CO 
 
 I I 
 
 a. a 
 
 o .H 
 
 ,-T cm" 
 
 i i 
 
 •XX -KM 
 
 CX CX 
 
 ex 
 
 i 
 
 r 
 
 CX 
 
 I 
 
 CM 
 I 
 
 CX 
 
 CX 
 
 -e- 
 
 ex 
 
 •H 
 
 u 
 ■u 
 
 CO 
 
 a 
 
 a 
 3 
 
 o 
 
 (-1 
 
 (X. 
 
 >>» 
 
 u 
 
 CO 
 
 c 
 
 •H 
 
 pq 
 
 co 
 c 
 
 3 
 
 CD 
 Pi 
 
 co 
 
 1) 
 
 a 
 
 •H 
 
67 
 
 Equation (3.6.6) can be rewritten in the following form 
 
 1 
 
 . m 
 
 k_1 v / V * \ 2k " 2 / V * 
 3 v=0 U=0 ' / v=k U=0 *» v *\ 
 
 (3.6.8) 
 
 k-1 / v \ 2k-2 , / v 
 
 - I ^ I P ».J + ^ I ^" k I pj v _ 
 v=0 U=0 *'* y v=k U=0 fc,v ^ 
 
 The columns of weight 2 (k <_ v <^ 2k-2) of the product matrix can be 
 
 p 
 considered as forming a carry t._. called Collective Product Transfer, CPT 
 
 to the next more significant radix-2 digital position 1-1. These (k <_ v <_ 
 
 v-k 
 2k-2) CPT columns have weights 2 ' (Equation (3.6.8)) with respect to 
 
 the higher significant digital position. When similar CPT columns from 
 
 digital position i+1 are added in the appropriate (of the same weight) 
 
 MIRBA of the digital position i, all the stages of the linear cascade of 
 
 MIRBAs in PE. become identical, each MIRBA having k- inputs. This is 
 
 illustrated in Figure 3.7. 
 
 Further, the transformation f„ requires the addition of one radix-r 
 
 P 
 digit a^ to w and t , and the digit a. contributes one redundant binary- 
 input to each position of MIRBA. Hence the transformation f 'requires k 
 MIRBAs, each capable of summing k+1 redundant binary inputs, as well as 
 the 'Transfer' from the adjacent MIRBA position. Figure 3.8 schematically 
 shows the implementation of FMA digit algorithm for radix 16, that is, 
 k=A. 
 
 Values of |w. I and I t , „ I 
 i max ' i-l'max 
 
 From Equations (3.6.4) and (3.6.8), we have 
 
68 
 
 o 
 
 
 3 
 
 XI 
 
 o 
 
 U ' 
 
 ex - 
 
 H 
 
 00 Pn 
 
 c u 
 
 •H 
 
 O. » 
 
 Pu l-i 
 
 co <v 
 
 rH »4-l 
 
 Jj CO 
 
 cu c 
 
 & 2 
 
 H 
 
 ■P 
 
 C 4J 
 
 CU U 
 
 O 3 
 
 (0 x) 
 
 f-J o 
 
 Xj U 
 
 (4-1 <U 
 
 O > 
 
 •H 
 
 C 4-t 
 
 o o 
 
 •H <U 
 
 •U iH 
 
 CO rH 
 
 U O 
 
 •U U 
 
 to - 
 
 3 
 
 H xi 
 
 H C 
 
 M CO 
 
 r~- 
 
 
 CO 
 
 
 CO 
 
 
 u 
 
 
 3 
 
 
 M 
 
 
 ■H 
 
 
 ft. 
 
 
69 
 
 
 
 
 
 e 
 
 
 
 
 -» 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ' 
 
 ' 
 
 ' 
 
 ' 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 
 
 
 2 «-■ cr cd < 
 
 
 
 
 f-H 
 
 
 
 
 J £ 
 
 * >. 
 
 
 4-1 CO 
 
 
 
 
 
 
 
 
 
 )// 
 
 •H C 
 
 
 /V\ / rH 
 
 
 i of Algor 
 iundant Bi 
 = 16) 
 
 
 I h e o < 
 
 
 ■¥ 
 
 
 
 / 
 
 
 
 
 1 
 
 *— \v/ \ 
 
 
 !_! 
 
 ' 
 
 l y 
 
 
 
 
 
 
 
 
 
 
 
 yOv >w 
 
 
 
 
 >/>/\ X 
 
 > v ^ 
 
 DIGIT 
 
 PRODUCT 
 
 GENERATOR 
 
 
 
 2 — CE 00 < 
 
 / >. \^ 
 
 r 
 
 sntatioi 
 
 sing Re< 
 
 (Radix 
 
 w 
 
 * \ \ 
 
 ■> 
 
 * - £ 
 ► £ at 
 
 u 
 
 CO 
 
 3 \\ 
 
 
 
 7 S\\ 
 
 *- \\ 
 
 
 
 
 
 
 
 
 
 
 
 X 
 
 
 
 •+— 
 
 
 Implem 
 FMA, u 
 
 
 2 •_ a: m < 
 
 
 
 
 
 
 
 O 
 
 > 
 
 CO 
 
 
 
 
 
 
 
 
 the 
 tion 
 Gener< 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ■ ,< — 
 
 
 
 
 
 
 
 
 
 
 
 
 c~ 
 
 
 
 5 a ^." 
 
 y-i o 
 
 
 
 
 
 
 
 
 
 
 
 
 •i-i 
 PQ 
 
 O 3 X 
 
 
 
 ' 1 
 
 I 1 
 
 1 1 
 
 } 1 
 
 1 1 
 
 r 1 
 
 ' 
 
 
 
 
 
 4-) 
 
 C 
 
 CO 
 T3 
 C 
 
 3 
 T3 
 
 <U 
 P£j 
 
 C 
 •H 
 
 0) 
 M 
 CO 
 
 * • 
 
 -e- 
 
 ration 
 roinst 
 t Matr 
 
 
 o 
 2E *" DC 03 < 
 
 ^ 
 
 "6" 
 
 r » _° 
 
 *o 
 
 
 
 
 
 * v 
 
 4J O CJ 
 
 ♦— » °\ / 
 
 HI iH 3 
 3 S T3 
 >H O 
 
 
 
 
 
 
 
 
 
 
 6 \Z/ 
 
 ss\. /"" 
 
 
 HlH ^ 
 
 • -« 
 
 "6 
 
 2 »< E 00 <t 
 
 X / X. /^ 
 
 *-£ 
 
 M O Pu 
 
 / \ / 
 
 
 
 "8" 
 
 
 
 
 <• \ 
 
 
 3— «._-\\ / \ 
 
 1 
 
 ' < 
 
 r 
 
 i r 
 
 
 00 
 
 CO 
 
 CU 
 
 J-l 
 
 •H 
 
 
 
 
 
 
 
 
 
 
 ^Ov ^v 
 
 
 
 
 y^\\ X 
 
 y v ■ 
 
 DIGIT 
 
 PRODUCT 
 
 GENERATOR 
 
 
 
 * s 4 
 "6" 
 
 5 — (t CD < 
 
 / N. N. 
 
 
 w 
 
 
 5' v ^ 
 
 >l 
 
 c 
 
 ► e * •> 
 
 B 
 
 Oh - 
 
 
 « xx 
 
 
 
 
 * >\\ 
 
 
 *~~ Vr xx 
 
 
 
 
 
 
 
 
 
 
 
 
 
 u \ 
 
 
 
 
 
 2 « or oo < 
 
 
 
 
 
 
 
 J 
 
 H 
 
 
 4-1 
 
 <u 
 
 4-1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 o 
 
 
 
 
 
 Vt. 
 
 
 
 
 
 
 
 j 
 
 
 »' 
 
 'T f 
 
 P ' 
 
 
 o 
 
 S5 
 
 
70 
 
 k-1 /v . \ k-1 
 
 w< = I 2 I P. o - I 2 I *4 ' m (3-6.9) 
 
 1 vio 1=0 *' v " £ v=0 U-0 *A J v-J 
 
 «'« - T ^ l < J V .- !♦;..; 
 
 1 X v=k U-0 *' V 7 v=k U=0 X Z J v-i 
 
 (3.6.10) 
 
 From Equations (3.6.9) and (3.6.7), we have 
 
 w.l =1.2°+ 2.2 1 + ...+ (.v+1) 2 V +...+(k).2 k " 1 
 
 l ' max 
 
 = 2 k (k-l) +1 (3.6.11) 
 
 Further 
 
 |w,| + 2 k |t. J = (2 k -l) 2 
 1 i'max ' i-l'max 
 
 V l^i-lLax " (2 k -D 2 - (2 k (k-l) + l) 
 
 2 k 
 
 = 2 k - (k+1) (3.6.12) 
 
 In summary, the digit algorithm 1 for the microinstruction FMA can be 
 described as follows: 
 
 i) Perform transformation f by recoding the product of digits c|> . 
 
 P P 
 
 and m. into w., t. n such that w. and t. - are given by Equations (3.6.9) 
 j l i-1 l l-l ° J n 
 
 and (3.6.10) respectively. 
 
 ii) Perform transformation f~ in a k-stage linear cascade of (k+1) 
 input redundant binary adder. 
 
 The design of the multi-input redundant binary adder is discussed in 
 Section 3.6.2.1.3. 
 
71 
 
 3.6.2.1.2 Algorithm 2 - In this algorithm, the transformation f 
 
 P 
 recodes the product <}>. . m into digits w. and t._- such that 
 
 w t e {(r-l),(r-2), 1,0,1, ...., (r-2) , (r-1)} 
 
 and t , e {(r^f), 1,0,1, , (7^) } . 
 
 p 
 Clearly, the recoded digits w , t contribute only one redundant binary 
 
 input to each MIRBA of the linear cascade. 
 
 Then the transformation f„ is performed in the k-stage linear cascade 
 of 3 input MIRBAs. This is illustrated in Figure 3.9. 
 
 Note that, in algorithm 2 the number of inputs to the MIRBAs is 
 always three, independent of the value of k. 
 
 The LSI implications of algorithms 1 and 2 are discussed later in 
 Section 4.2.2.5. 
 
 3.6.2.1.3 Design of a multi-input redundait binary adder (MIRBAl - A 
 MIRBA is a limited carry/borrow propagation adder which accepts several 
 redundant binary inputs (digit set {1,0,1}) and produces one redundant 
 binary output (with appropriate adder 'Transfers' for more significant 
 adjacent adder stages) . 
 
 Definition 
 
 Let us define a new parameter a . The redundant binary output of 
 any MIRBA is dependent on the 'Transfers' (the composite term for carry/ 
 borrow) input to that MIRBA. In a redundant number system, the 'Transfers' 
 are functions of 'primary' inputs (other than 'Transfer' inputs) to only 
 
72 
 
 
 
 
 
 
 * 
 
 
 
 
 
 ■> 
 
 
 6 
 
 1-1 
 o 
 
 o" 
 
 ""> 
 
 
 
 
 
 
 
 
 - 
 
 
 I 
 
 1 
 
 ' 
 
 
 
 
 
 
 
 
 
 
 
 5 — er cd < 
 
 
 
 
 ■4—1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 2 h K O < 
 
 
 
 
 
 
 
 *~ 
 
 
 
 
 
 
 
 
 
 , '! ' W " 1 
 
 
 
 
 
 
 
 ! 
 
 
 
 2 « <r on < 
 
 
 2 
 O 
 
 DIGIT 
 
 PRODUCT 
 
 GENERATOR 
 
 
 
 
 
 " 
 
 
 
 
 
 
 r 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 - 
 
 CO 
 
 
 
 
 
 
 2 h k id < 
 
 
 
 
 
 
 
 
 
 
 
 J 
 
 4-1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 o 
 
 
 
 
 
 >v- 
 
 C 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 >v- 
 
 
 
 
 
 
 
 
 
 3 
 +J 
 •H 
 
 
 i 
 
 ' 1 
 
 ' 1 
 
 1 
 
 1 
 
 - 
 
 
 
 
 
 
 
 
 00 
 
 o 
 
 5 m K 0D < 
 
 
 ' 
 
 "cf< 
 
 r *f 
 "o 
 
 • -i 
 "o" 
 
 * OJ 
 
 * " 
 "6~ 
 
 #— J 
 
 J 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 g • o 
 
 
 
 - 
 
 
 oo 
 
 
 
 
 
 
 
 
 
 
 
 -9-" 
 
 C/3 
 
 
 
 
 
 
 
 2 ►-• or od <* 
 
 
 •H 
 
 
 
 
 
 o" 
 
 
 
 
 
 
 
 
 CO 
 
 
 f ' 
 
 ! ! 
 
 r ' 
 
 r ' 
 
 ~\ 
 
 •H 
 
 
 
 
 
 
 ' 
 
 f 
 
 
 PL, v 
 4-1 
 
 2 -. tr co < 
 
 
 2 
 O 
 a: 
 
 DIGIT 
 
 PRODUCT 
 
 GENERATOR 
 
 
 l_i 
 
 
 
 >i 
 
 <D 
 
 
 
 
 
 
 
 
 o 
 
 
 
 14-1 
 
 
 
 
 
 
 c 
 
 
 
 
 
 - 
 
 
 
 
 
 
 
 5 « t ID < 
 
 
 
 
 
 
 
 
 
 
 4-1 
 CJ 
 
 3 
 T3 
 
 «*— « ._" 
 
 
 J 
 
 
 o 
 
 
 
 
 
 
 
 
 o 
 
 
 
 
 
 
 
 u 
 
 
 o' 
 
 I 
 
 d 
 
 o* 
 
 1-1 
 
 
 
 
 
 
 
 
 
 
 <u 
 
 
 
 4 . 
 
 -1 
 
 
 
 
 a. - 
 
 i 
 
 
 
 o 
 
 S3 
 
 vO 
 
 •H 
 
 II 
 
 M 
 
 
 
 
 X 
 
 M 
 
 •H 
 
 H 
 
 T) 
 
 <n 
 
 CD 
 
 
 Bj 
 
 «-i 
 
 s — ' 
 
 o 
 
 
 G 
 
 en 
 
 o 
 
 * ■ 
 
 ■rH 
 
 c: 
 
 ■U 
 
 pci 
 
 cC 
 
 
 4-J 
 
 M 
 
 c 
 
 C 
 
 CD 
 
 •H 
 
 E 
 
 Cfi 
 
 01 
 
 D 
 
 H 
 
 
 Q. 
 
 < 
 
 
 CD 
 
 C 
 
 ,X3 
 
 
 
 (j 
 
 ■H 
 
 
 u 
 
 14-4 
 
 V 
 
 o 
 
 3 
 
 
 H 
 
 c 
 
 4-1 
 
 
 
 CO 
 
 •H 
 
 c 
 
 4J 
 
 •H 
 
 CO 
 
 O 
 
 Vj 
 
 n 
 
 4-1 
 
 u 
 
 on 
 
 •H 
 
 3 
 
 S 
 
 M O 
 CO 
 
 a) 
 
 00 
 •H 
 f»4 
 
73 
 
 a limited number of adjacent less significant MIRBAs. a denotes the 
 number of such adjacent MIRBAs whose 'primary' inputs in cooperation with 
 the primary inputs of a given MIRBA determine the output of that MIRBA. 
 
 The radix-2 digit processing logic in, say PE consists of a k 
 stage linear cascade of (k+1) input MIRBAs. Except for the most significant 
 MIRBA in k-stage cascade, the primary inputs to the MIRBAs in PE are 
 functions of radix-2 operand digits in PE and PE (accumulator 
 digits a., multiplier digit m. and multiplicand digits <J> , <t>.,,). Thus 
 a. is related to a by Equation (3.6.13) 
 
 a . = 
 J 
 
 * b -i 
 
 + 1 (3.6.13) 
 
 3.6.2.1.3.1 Rohatsch's [39] technique - This is a deterministic and 
 explicit transformation procedure which converts a given input digit set 
 into the required output digit set by a series of simple transformations. 
 
 In using this technique, one generally proceeds backwards; namely, 
 consider the transformations going from output set to input set. The 
 basic concept of Rohatsch's technique is very simple: 
 
 i) Take the desired output set S, find two or more sets A n , A- , 
 
 A , . . . , A such that 
 2 n 
 
 S = A +A - + ...+ A« + A, + A_ . 
 
 n n-1 2 10 
 
 ii) Form the input set M where 
 
 M - v n A + r 1 *" 1 A , + ...+ A, r 1 + A n 
 n n-1 1 
 
 where r is the radix of the adder. In our case, for MIRBA, r=2 
 
74 
 
 iii) If necessary, repeat the steps i) and ii) (using the last in- 
 put set as the new output set) as many times as is required to 
 generate a set which includes the desired input set. 
 Steps i) and ii) above together constitute an n-th order Simple Transforma- 
 tion (referred to as S.T.). For the contiguity of sets M and S, A , A ,, 
 
 n' n-1 
 
 . . . ,A_ must be contiguous and the number of distinct digits in sets A., 
 n-1 > i > should be greater than or equal to r. 
 
 Using the above approach, we find that for k > 2, a (k+1) -input 
 MIRBA requires a series of three S.T.s. Figure 3.10a shows one such 
 four level (each level indicated by a box) adder which is applicable for 
 k <_ 5. In this, level 1 and level 2 perform first order S.T.s whereas 
 level 3 represents a 2nd order transformation. If level 3 performs a 
 third order or fourth order transformation, such a four level adder 
 would be applicable for k <_ 9 and k <_ 11 respectively. 
 
 It is interesting to note that if level 2 achieves a 2nd order S.T. 
 and level 3 constitutes a 6th order S.T., then the four level adder can 
 be used to sum as much as 51 redundant binary {1,0,1} inputs. This is 
 shown in Figure 3.10b. 
 
 However, the logic design of the bottom two levels is highly com- 
 plicated for k >_ 5 if they are to be implemented in two or three logic 
 levels. In practice, the technique is to break down the bottom level 
 structure into equivalent simpler structures frequently at the cost of 
 increasing the number of levels, as shown in Figure 3.10c for k = 5. 
 
 In this adder structure, a is given by 
 
 b q_1 
 a = I n 
 
 v=l 
 
75 
 
 1,04' 
 
 1.0,1 
 1,0,1 ««- 
 
 1,0,1 
 
 i 
 
 LEVEL 1 
 1,0,1 
 
 0,1 
 
 0,1 
 
 0,1 
 
 LEVEL 2 
 2,1,0,1 
 
 LEVEL 3 
 3,2, ...0,1 
 
 1 1A1 
 
 0,1 
 
 LEVEL 4 
 7,..., 1,0,1,... ,6 
 
 it a 
 
 O O 
 H |H 
 
 O 
 
 o,i ' 1A1 
 
 1,0,1 
 
 Note: Entries in the box show the allowed output digit set values, 
 
 Figure 3.10a Illustration of the Algebraic Design 
 of a MIRBA, using First Order Simple 
 Transformations only. 
 
1,0,1 
 
 1 
 
 LEVEL 1 
 T.0,1 
 
 0,1^- 
 
 I 
 
 0,1 
 
 0,1 
 
 LEVEL 2 
 
 0,1 
 
 0,1 ■+— 
 
 o,T 
 
 0,1 
 
 0,1 
 
 LEVEL 3 
 3,. ..,1,0,1,. ..,4 
 
 0,1 
 
 0,1 
 0,1 «* 
 
 0,1 
 
 0,1 •*■ 
 
 0,1 
 
 I 
 
 0,1 
 
 0,1 
 
 0,1 
 
 0,1 
 
 0,1 
 
 76 
 
 0,1 
 
 0,1 
 
 LEVEL 4 
 51,. ..,1,0,1... .,77 
 
 TT~3T 
 
 o o 
 
 O 
 
 Note: Entries in the box show the allowed output digit set values, 
 
 Figure 3.10b Illustration of the Algebraic Design 
 of a MIRBA using Higher (>2) Order 
 Simple Transformation. 
 
77 
 
 3 INPUT, 2 OUTPUT 
 REDUNDANT BINARY 
 ADDERS 
 
 1,0,1 1,0,1 1,0,1 1,0,1 1,0,1 1,0,1 
 
 Figure 3.10c Algebraic Design of Bottom Level 
 (Level A) Box of Figure 3.10a. 
 
78 
 
 where q = number of levels in MIRBA 
 
 n = order of S.T. performed by adder level v. 
 
 Table 3.1 shows the values of a and a. for various values of k, 
 
 3 
 
 for a (k+1)- input MIRBA, 
 
 Table 3.1 
 Values of a and a. for Various (k+l)-Input MIRBA Configurations 
 
 
 
 
 
 
 radix 
 r 
 
 k 
 
 r » 2 
 
 k 
 
 Rohatsch's Technique 
 
 log-sum tree 
 
 RBA-3,RBA-2 tree structure 
 
 b 
 
 a 
 
 a . 
 
 b 
 
 a 
 
 
 b 
 
 a 
 
 -1 
 
 4 
 
 8 
 16 
 32 
 64 
 128 
 256 
 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 
 3 
 
 4 
 4 
 4 
 5 
 5 
 5 
 
 2 
 2 
 2 
 2 
 2 
 2 
 2 
 
 4 
 4 
 6 
 6 
 6 
 6 
 8 
 
 3 
 
 2 
 3 
 2 
 2 
 2 
 2 
 
 3 
 5 
 5 
 5 
 6 
 6 
 6 
 
 3 
 
 3.6.2.1.3.2 Log-sum tree technique - A conceptually simple approach 
 is to realize the (k+1) input MIRBA by a log-sum tree structure of two 
 input redundant binary adders (RBA-2) . For a (k+1) input MIRBA, the tree 
 structure has t levels of Borovec Units such that 
 
 t = 
 
 1og 2 (k+l)"| 
 
 and the number of BUs required is k. Figure 3.11 shows the log-sum tree 
 structure for a five input MIRBA. 
 In this configuration, 
 b 
 
 a = 2t = 2riog 2 (k+l)1 and 
 2llog ? (k+l)| -lj + 1b 
 
 a . = 
 
 
79 
 
 + 
 ill 
 
 a. 
 
 UJ 
 Q. 
 
 UJ 
 
 a. 
 
 
 
 
 
 1 
 
 
 
 
 1 
 
 1 
 
 
 
 1 
 
 1 
 
 
 1 1 
 
 
 
 
 
 
 
 • 
 • 
 • 
 
 
 
 
 
 
 
 
 
 
 1 
 
 «| 
 
 
 1 
 
 • 
 
 
 1 
 
 • 
 
 
 1 
 
 1 ( 
 
 
 
 
 
 
 , 
 
 
 
 
 
 
 
 
 
 / 1 
 
 
 
 / 1 
 
 
 
 / I 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 I / 
 
 — 
 
 
 1 / 
 
 m 
 
 
 1 / 
 
 • 
 
 
 1 / 
 
 1 ' 
 
 
 
 
 
 1 
 
 ■ 
 
 
 
 
 
 
 1 
 
 
 
 y i 
 
 • 
 • 
 
 
 y i 
 
 • 
 
 
 
 I 
 
 
 
 
 
 
 
 
 l y 
 
 • 
 
 
 1 
 
 • 
 
 
 
 • 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 \ I 
 
 
 
 / / 
 
 _. »* ' ^ i— . i fir m ^r 
 
 • 
 
 
 1 «2 *^ Lt UJ ^4. 
 / 
 
 1 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 [ i 
 
 
 
 
 / / 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 => / , 
 
 ,'— 
 
 
 
 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 
 
 CO/ / 
 
 
 
 
 / 
 
 
 
 H t 
 
 
 
 
 
 
 
 ,'/ 
 
 
 J ! .'■ 
 
 
 
 
 
 
 
 
 
 3 / «-' 
 
 03/ / 
 
 
 3/ ' 
 
 ml i 
 
 
 
 
 "1 ' 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 )) 
 
 
 
 
 
 
 
 
 ' 1 
 
 ' 
 
 
 
 
 
 
 
 
 
 
 00/ i 
 
 
 
 
 
 
 1 
 
 
 
 1 
 
 1 
 
 f 
 
 
 
 
 f 
 
 f 
 
 
 
 
 
 1 
 
 1! 
 
 
 < ! 
 
 1 
 
 
 M 
 
 
 o 
 
 
 M-l 
 
 
 CD 
 
 
 u 
 
 
 
 
 
 4J 
 
 
 u 
 
 
 o 
 
 •— > 
 
 u 
 
 sa- 
 
 4-1 
 
 
 CD 
 
 il 
 
 (U 
 
 M 
 
 0) 
 
 ■*^s 
 
 >-i 
 
 
 H 
 
 
 
 • 
 
 £ 
 
 >. 
 
 a 
 
 H 
 
 CO 
 
 c 
 
 i 
 
 o 
 
 0C 
 
 
 O 
 
 Cfi 
 
 kJ 
 
 CN 
 
 U-l 
 
 < 
 
 o 
 
 CQ 
 
 
 « 
 
 c 
 
 
 o 
 
 &c 
 
 •H 
 
 c 
 
 4J 
 
 •H 
 
 rfl 
 
 en 
 
 M 
 
 D 
 
 CO < 
 
 M 2 
 
 0) 
 
 u 
 
 a 
 
 •H 
 
 CO 
 -II 
 
80 
 
 The value of a. for various values of k is tabulated in Table 1. From 
 
 the table we find that for k=2 and k=4, that is, radices A and 16, the 
 
 value of a . =3 and a. = 2 for all other values of k. Since minimum 
 3 J 
 
 value of a . is desirable, a different arrangement of BUs as described in 
 third approach given next can be used to achieve a. = 2. 
 
 3.6.2.1.3.3 Tree-structure using RBA-3s and RBA-2s - In this con- 
 figuration, 3-input redundant binary adders (RBA-3) and RBA-2s are con- 
 nected in a tree structure. 
 
 An RBA-3 consists of two BUs, a D-element and a C-element arranged 
 as shown in Figure 3.12. The C-element composes two binary inputs 
 {0,1; 0,1} into one redundant binary {1,0,1} output. The lower BU in 
 combination with the C-element and the D-element acts as a redundant 
 binary (3,2) counter. The upper BU forms the sum of the sum-outputs of 
 the lower BUs and the 'Transfer' output of the lower BU of adjacent less 
 significant RBA-3. 
 
 For a design of a (k+1) input MIRBA, RBA-3s are used whenever they 
 can be fully utilized, that is, three inputs are available for addition; 
 and RBA-2s are used when only 2-inputs are to be added at any level of 
 the tree structure. (An exception occurs for k=3 where the log-sum tree 
 technique is necessary.) Figure 3.13 shows a 5-input MIRBA using RBA-3s 
 and RBA-2s as building blocks. 
 
 The number of BUs required in this technique is also k for a (k+1)- 
 input MIRBA. The number of BU levels is also 2 |"log 2 (k+l)~] . Table 3.1 shows 
 
 the values of a and a. for various of k. It shows that a. = 2 for all 
 
 3 J 
 
 values of k except k=3 . 
 
81 
 
 1,0,1 1,0,1 1,0,1 
 
 Figure 3.12 Arithmetic Structure of an RBA-3 . 
 
82 
 
 C\J 
 
 + 
 
 Q. 
 
 I 
 I 
 I 
 
 l_r 
 
 1 
 
 5.1 
 
 a. 
 
 t 7 ; 
 
 it/ ' 
 
 ii / i 
 
 .+* 
 
 ' I ' I 
 
 
 I 
 
 / / / 
 
 / 
 / 
 
 L L. 
 
 < — 
 
 UJ 
 
 L ' ' 
 \,*+ q: m < / / 
 
 / / / 
 
 / / / 
 
 . i 
 
 CD/ > 
 
 i M 
 
 , i| 
 
 3d 
 
 / / /•? 
 
 f ' I < 
 
 ' ' ( 
 
 
 X 
 
 
 m 
 
 
 j-i 
 
 
 O r~v 
 
 
 »4-t -sf 
 
 
 0) II 
 
 
 u 
 
 
 3 ^ 
 
 
 JJ ^ 
 
 
 o 
 
 
 3 
 
 
 W • 
 
 
 u en 
 
 
 W ro 
 
 
 0) <i 
 
 
 CU PQ 
 M PS 
 
 
 H 
 
 
 -o 
 
 
 4-1 c 
 
 
 O CO 
 
 
 C w 
 
 
 O CN 
 
 
 •H 1 
 
 
 •u <J 
 
 o 
 
 cO PQ 
 M OS 
 
 < .- 
 
 4-> 
 
 ; 7 
 / / 
 
 a/ 
 
 03/ 
 
 , ,i i i, ii 
 
 ii 
 
 en bO 
 
 H -rl 
 H CO 
 
 3, 
 
83 
 
 The tree structure configurations described in 3.6.2.1.3.2 and 
 3.6.2.1.3.3 have the following advantages compared to Rohatsch's tech- 
 nique. 
 
 a) It is more general and has the same configuration for any value 
 of k. 
 
 b) It makes use of only one kind of cell, that is, Borovec Unit 
 for the implementation of MIRBA. 
 
 c) The various BUs are uniformly and regularly interconnected. 
 Because of b) and c) above, this implementation meets the LSI con- 
 straints of structure regularity and minimum cell number type. 
 
 In terms of our notation of Section 2.4, 
 
 „„F. = ™, A F. .. JU- i, 1 < i < n 
 FMA i FMA i-1 ~*r > _ _ 
 
 FMA F = m j 
 
 where _, A F. = modifier value which is sent by PE^ along with micro- 
 FMA i i 
 
 instruction FMA to PE.,.,. 
 
 l+l 
 
 _..F = modifier value sent by MCU along with microinstruction 
 
 FMA to PE.. 
 
 m. = Multiplier (or Quotient) digit. 
 FMA G i + l " <<■ C i> 
 
 P 
 
 where t = Product Transfer to PE from PE . 
 t = MIRBA Output Transfer from PE 
 
 a FMA = 2 ' 
 
84 
 
 3.6.3 Multi-sum (MS) microinstruction - This microinstruction forms 
 the sum of N digit vectors where N is the number of inputs of a MIRBA used 
 in the implementation of microinstruction FMA. N depends on the digit 
 algorithm used for FMA. In any case N <_ k+1. 
 
 The digit level transfer function is given by 
 
 12 N A A 
 
 a: = xT + x, + ... + x. - r t A . + t (3.6.13) 
 
 i i i i l-l i 
 
 where a^ , x ± e { (r-1) , (r-2) ,... ,1,0,1, ..., (r-1) } 
 
 If we designate the set of arithmetic transformations performed by a 
 Borovec Unit as Borovec Unit Transformation (BAT), then the transfer func- 
 tion in Equation (3.6.13) is realized by a series of flog„N| BATs. This 
 is discussed earlier in the design of a MIRBA in Section 3.6.2.1.3. 
 
 Implementatio n 
 
 The MS digit algorithm can be implemented by making use of MIRBAs 
 already existing in the digit processing logic of the processing element. 
 This is shown in Figure 3.14. 
 
 The MS microinstruction can be represented in the notation of Section 
 2.4 as follows. 
 
 MS F i = MS F i-l ' Vl 1 i i i n 
 
 MS F = <Null> 
 
 G - t A 
 
 MS i+1 i 
 
< .- 
 
 85 
 
 ii 
 
 ii 
 
 CVJ 
 
 I 
 
 CO 
 
 I 
 
 o -*- 
 
 S h QC Q] < 
 
 w \t 
 
 n 
 
 "d" - 
 
 h CC £D < 
 
 it v 
 
 IF H 
 
 < .1 
 
 CM 
 I 
 
 u 
 
 N 
 
 — X ,,, _, 
 
 rsi — 
 
 X 
 
 
 CM 
 I 
 
 CM 
 I 
 
 * .* 
 
 z — 
 - X 
 
 I 
 
 X * J* ,/"N 
 
 CM — ,-1 
 
 X I 
 
 u 
 
 *X~ w 
 
 2= 'H 
 
 CM -H 
 
 X -( + 
 
 I 
 
 H 'H < ^ Ji! 
 X +J 
 
 « v | 
 ^ -H <J -H 
 
 c 
 o 
 
 a 
 1-1 
 
 C 
 •H 
 O 
 (-1 
 
 u 
 
 o 
 
 •4-1 
 
 o 
 
 60 
 
 CO 
 •H 
 
 o 
 
 3 
 
 •H 
 Pn 
 
86 
 
 b 
 
 and a MS = T 
 
 = 2 for radix-2 k < 32 
 
 = 1 for radix > 32. 
 
 3.6.4 Normalize Recode (NR) microinstruction - This microinstruction 
 is used to normalize an operand according to Definition 3 given in 
 Section 3.5.1. 
 
 Given an operand of the form 
 
 A. - * X- X n ••• X , ••• X . X , , .. • • • X 
 
 12 l j j+1 n 
 
 such that 
 
 xj > 1 <_ i < j-1, 
 
 x. = 0, 
 
 and |x. I >_ j+1 <^ i <_ n 
 
 the NR microinstruction transforms the operand X into an algebraically 
 equivalent operand X' 
 
 X" = . 00 .. < x' . . . . x" . . . x' . xC,, ... x - * 
 h h+1 i j j+1 n 
 
 where sign (x') = sign (x ) , h <_ i <^ j-1 h >_ 1 
 
 x: = x. = 
 J 3 
 
 and xj* = x,, j+1 <^ k, <_ n 
 
 For example, if radix r=10, then the numbers .199704, .1909704, .179412 
 and .109018 would be recoded respectively as .001704, .0109704, .160608 
 and .109018. 
 
if +ve 
 
 1 if -ve 
 
 87 
 
 Digit Algorithm 
 Let 
 
 S_ = Sign of the operand 
 
 S = Sign of digit x. = < 
 
 r = radix 
 |x. | = magnitude of digit x. 
 
 The digit algorithm is given by the flowchart of Figure 3.15. Initially 
 S is known and is equal to S.. . 
 
 In terms of the notation of Section 2.4, 
 
 F =S 1 < i < 1 - 2 
 
 NR i a 0P - - J 
 
 where 
 
 = <Null> i > j - 1 
 
 s op = s i for i - °* 
 
 j is the index of first zero digit in the operand d-vector 
 
 G = x i < i < i-1 
 NR i+1 x i+l J - - J 
 
 and a NR = 1. 
 
 3.6.5 Assimilation Recode (AR) microinstruction - This microinstruc- 
 tion is used to assimilate (convert) a signed-digit redundant operand into 
 an algebraically equivalent operand such that all the digits in the re- 
 coded form are of the same sign as the sign of original operand. 
 
 In the actual implementation of the digit algorithm for NR, MD G 
 
 .... NR i+1 
 information consists of S.,, and Z. ., where S #11 is a bit carrying 
 
 l+l l+l l+l 
 
 sign information of digit x.,, and Z tl1 is also a bit whose two states 
 
 ° l+l l+l 
 
 indicate where digit x. in is zero or not. (cf. Section 4.3.2.3.6.2) 
 ° l+l 
 
88 
 
 I 1 
 
 I Paas the micro- I 
 instruction to 
 PE, 
 
 YES 
 
 YES 
 
 I 
 
 T 
 
 I 
 
 Figure 3.15 Flowchart of the Digit Algorithm for 
 Microinstruction NR. 
 
89 
 
 Given a signed -digit operand of the form 
 
 X S . X;L x 2 ... Xl x. +2 ... x k 00 ... XfcH ... x n 
 
 such that 
 
 sign (x. + .) = sign (x. +2 ) and 
 
 sign (x. ..) = sign (x, ) = = sign (x, ) = sign (x. ) 
 
 That is, all the zero digits in the d-vector of number X have the same 
 sign as that of the first nonzero digit to the immediate right of a string 
 of zeros (of length >_ 1), the AR digit algorithm would recode X into X' 
 
 j\ — • X. Xa • • • A, A..., X . . f~, ••• A. X. . - • « • X« . _ • • • A 
 
 12 l l+l i+2 k k+1 k+£ n 
 such that 
 
 X = X' and 
 sign (xp = sign (x. + ,) = sign (x) V 1 * 1 1 *■ l n 
 
 Digit Algorithm 
 
 Let S_ = sign of the operand 
 
 S. = sign of the digit i 
 
 S = sign of the digit i+1 
 
 x = digit i of operand X 
 
 The digit algorithm is almost identical with that for NR micro- 
 instruction except that the microinstruction acts on all the digits of 
 the operand. It is given by the flowchart shown in Figure 3.16. 
 In the notation of Section 2.4, 
 AR F i = S 0P l i * < » 
 = S 1 
 
 AR G i+l = S i+1 ! 1 i 1 n 
 
 a._ = Variable depending on the d-vector 
 
90 
 
 Pass the micro- 
 instruction to 
 PE, 
 
 'i+1 
 
 OP 
 
 YES 
 
 
 
 
 NO 
 
 ^S » S ^^ ^S - S \^YES 
 
 
 < 
 
 ' 
 
 
 
 
 ^Y^NO 
 
 
 \* ± \ 
 
 - r-lxj 
 
 |xj - r-l-lxj 
 
 
 |x ± | - IxJ - 1 
 
 
 
 
 
 1 .. 
 
 
 \ 
 
 
 Figure 3.16 Flowchart of the Digit Algorithm for 
 Microinstruction AR. 
 
91 
 
 4. LOGIC DESIGN OF THE PROCESSING ELEMENT 
 
 4.1 Introduction 
 
 In this chapter, the logic design of the Processing Element (PE) is 
 developed and discussed in detail. The major components of the PE are 
 the Register File for the temporary storage of active operands, the Digit 
 Processing Logic (DPL) which is essentially a large combinational logic 
 circuit and the Processing Element Control Logic (PCL) which supplies 
 the control signals in proper temporal order to condition the combina- 
 tional DPL to execute the various microinstructions. The major consider- 
 ations in the logic design of the PE are the LSI technology constraints: 
 namely, the PE should require as few external pins as possible and that 
 the logical organization of the PE should have structural uniformity and 
 regularity. 
 
 Section 4.2 discusses the logic design of data path structure of 
 the PE and in Section 4.3 is given the logical organization and detailed 
 design of the control algorithms for the generation of control signals. 
 Finally, Section 4.4 discusses the logic complexity of the DPL and the 
 PE control logic in terms of the number of gates and the external pins 
 for the PE module as a function of the bit width of the PE module. 
 
 4.2 Block Diagram Description of a Processing Element 
 
 Figure 4.1 shows the schematic block diagram of a Processing 
 Element. It consists of three main components — Digit Processing Logic 
 (DPL), Register File and Control. The Register file comprises a set 
 of digit-wide registers which are used to hold the operand digits and 
 
z 
 o 
 
 92 
 
 2 I 
 
 Is 
 
 0:5 
 
 f= U. 
 
 o 
 
 a: 
 
 t- 
 
 </> «-• 
 z + 
 
 is 
 
 >- 
 o 
 
 UJ 
 
 2 
 
 
 
 
 
 V 
 
 1 
 
 
 A 
 
 
 
 
 1 
 1 
 1 
 L n 
 
 1 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 UJ 
 
 _J 
 u. 
 
 (T 
 UJ 
 
 h- 
 
 
 UJ 
 
 
 DIGIT 
 
 PROCESSING 
 
 LOGIC 
 
 
 * 
 
 
 
 
 <j — 
 
 
 or 
 
 »- 
 z 
 
 
 
 
 
 
 
 
 
 ^ U 
 
 \ 
 
 
 
 4 
 
 
 
 
 1 1 
 
 
 ^.< — -0 
 
 1 
 
 1 1 
 1 | 
 
 ,_ j 
 
 1 
 
 1 
 
 1 
 
 
 
 
 
 
 
 1 
 
 4 
 
 
 2 
 
 w 
 
 
 •^ 
 
 *J 
 
 UJ 
 
 
 a. 
 
 E 
 
 
 (1) 
 
 z 
 
 
 UJ 
 
 60 
 
 2 
 
 C 
 
 •H 
 
 UJ 
 
 (A 
 
 _j 
 
 (0 
 
 UJ 
 
 41 
 
 
 
 
 
 
 
 k 
 
 z 
 
 P4 
 
 a) 
 
 td 
 
 a) 
 
 u-i 
 
 UJ 
 
 
 
 
 
 § 
 
 O 
 
 QL 
 
 u 
 
 Q. 
 
 60 
 
 
 •H 
 
 
 Q 
 
 
 ^ 
 
 
 O 
 
 
 
 
 z 
 o -• 
 
 I- 1 
 
 _l 
 
 
 
 3 rH 
 
 tK 1 
 
 \- — 
 
 CO U | 
 
 ^Q_ 
 
 ICRO- 
 ROM 
 
 2u. 
 
 M 
 
93 
 
 result digits. The registers could also be used to hold intermediate 
 result digits temporarily. Inter-register transfer microinstructions 
 operate on these registers. The Digit Processing Logic is essentially 
 combinational processing logic (along with some storage for G-informa- 
 tion) and is used to process the microinstructions of the PE. The DPL 
 operates on the operand digits stored in the register file of the PE 
 and G- information from its right neighboring PEs. It also generates 
 the G-inf ormation for its left neighbor PEs. The Control issues the 
 timing control signals to the processing logic for sequencing the 
 various steps of the digit algorithms for the microinstructions. It 
 also coordinates the actions of the PEs by accepting the micro- 
 instructions and G-inf ormation from neighboring PEs and by transmitting 
 the microinstructions to the right neighbor and generating G-inf ormation 
 for its left neighbor PE. 
 
 4.2.1 Register file - The register file is a set of registers that 
 are used to hold the operand and result digits. Each PE retains one 
 digit of each of the active operands. Each register is (k+1) bits long 
 to hold the k-magnitude bits and one sign bit of one SM -encoded radix-2 
 digit. There must be at least three registers in a PE: an accumulator 
 register, a multiplier-quotient (MQ) register, and an operand register. 
 However, the multi-sum microinstruction MS requires N operands and hence 
 N storage registers where N represents the number of operands the radix-2 
 adder in the DPL is capable of adding simultaneously. It was shown 
 earlier in Sections 3.6.2.1.1 and 3.6.2.1.2 that the radix-2 adder adds 
 
94 
 
 either (k+1) or 3 operand digits depending on how the algorithm for 
 microinstruction FMA is implemented. For the present discussion, we shall 
 assume that the adder is capable of adding (k+1) operand digits simultane- 
 ously and that k = 4. The register file would thus contain at least five 
 registers. Additional registers can be added to the register file. One 
 possible use of such registers is to hold the intermediate results which 
 are needed so soon after they are calculated that storing them and re- 
 trieving them from memory would unnecessarily delay the processing. The 
 number of desirable intermediate result registers is determined by the 
 method of communicating between memory and the arithmetic unit, the 
 number of extra pins, if any, required for the identification address of 
 these registers and their contribution to the overall logic complexity 
 of the chip. Figure 4.2 shows an internal register file containing five 
 registers INR1, INR2, . . . ,INR5. In this thesis, registers INR1, INR2 and 
 INR3 act respectively as Accumulator, operand register and MQ-register. 
 
 The registers in the register file are loaded from a buffer 
 register, IBR whose contents are determined by the Internal Register 
 Input Bus Selector, sRIB in the Digit Processing Logic (discussed 
 
 later in Section 4.2.2.7.3). Similarly, the contents of the registers 
 are inputed to the DPL either directly or through an Output Bus Selector 
 sROB, also in the DPL. The control signals gINR[x] , x = 1,2,..., 5 for 
 loading the registers are provided by the local control in the PE. 
 
 Because of the bus mechanism for input ing and output ing of data in 
 the register file, any whole operand register (consisting of correspond- 
 ing operand digit registers distributed in the various PEs) can be used 
 
95 
 
 1NR5 
 
 I qINR 
 
 gINR5 
 
 INR4 
 
 T7„ 
 
 gINR4 
 
 INR3 
 
 T7_._ 
 
 gINR3 
 
 INR2 
 
 r_._ 
 
 gINR2 
 
 V V V 
 
 R5 R2 Rl 
 
 V v / 
 
 TO SELECTORS S ROB 
 AND sADR 
 
 INR1 
 
 L g IN 
 
 gINRI 
 
 K v ' 
 
 FROM BUFFER REGISTER 
 IBR 
 
 Figure A. 2 Block Diagram of the Register File of 
 the PE. 
 
96 
 
 as a shift register, capable of shifting one digit at a time. This is 
 made use of in microinstructions LS and RS. Rl, R2,...,R5 denote the 
 outputs (contents) of registers INR1, INR2, . . . ,INR5, respectively. 
 
 4.2.2 Logic design of digit processing logic 
 
 4.2.2.1 Block diagram description of PPL - Figure 4.3 shows the 
 data flow structure of the Digit Processing Logic (DPL) in block diagram 
 form. It consists of three major components — the Digit Product Generator, 
 DPG, a radix-2 multi-input adder MIAD, and a Digit Sum Encoder, DSE. In 
 addition to these three main components, there are selector networks sADR 
 for the adder input, sDSE for the Sum Encoder, sROB for selecting the out- 
 put of internal registers in the register file and sRIB for selecting the 
 inputs to the in-bus of the register file and sTOP for selecting the con- 
 tents of 'Transfer' Output Port (TOP). Besides, there are two registers 
 GIR and APR for storing the G-inf ormation and multiplicand digit from 
 the adjacent right neighbor PE. 
 
 The Digit Product Generator, DPG forms the product array in re- 
 dundant binary form, of two SM -encoded radix-2 digits m. and <f>^. 
 The digit <J>. comes from the operand register INR2 via the output bus 
 selector sROB and the multiplier digit m. is inputed to the DPG, 
 from the microinstruction register MIR in local control logic of the 
 PE. 
 
 The multi-input adder, MIAD adds the w columns of the redundant 
 binary product array formed by DPG and the collective product transfer 
 
 The necessity of the register APR for the multiplicand digit from 
 the adjacent PE would be clear from the discussion in Section 4.2.2.5. 
 
97 
 
 TO REGISTER FILE 
 
 FROM REGISTER FILE 
 
 Rl R2 R3 R4 R5 
 
 ii 11 Ji 11 11 
 
 i 
 
 IBR 
 
 "TV" 
 
 ♦ ---fllBR 
 
 REGISTER FILE 
 
 . R1.R08 
 
 ^ R2tROB 
 
 OUTPUT BUS SELECTOR • R3tROB 
 
 SROB 
 
 TOPj<= 
 
 TO 
 
 PEi-i 
 
 ROP<= 
 
 --TA«TOP 
 r-MIRtTOP 
 
 LA 
 
 k R4.R0B 
 
 ~~ -R5»ROB 
 
 JIEGISTER FILE 
 INPUT JUS SELECTOR 
 
 sRIB 
 
 -ts 7ns 
 
 ► ---DSEtRlB 
 
 ► APRtRIB 
 
 MIRiRIB 
 
 DJGIT £UU E.NCOOER 
 OSE 
 
 OSE<4:0>(»S,,X,<3:0>) 
 P: 
 
 SIGN BIT-.' 
 
 I 
 
 X,<3:0> 
 
 MAGNITUDE 
 BITS 
 
 IF* 
 
 .5 
 
 sOSE 
 
 .0 
 
 „~-ROB»OSE 
 
 •— AMF»OSE 
 --SCHI 
 
 r-gGIR 
 
 RADIX -2 (KM) 
 
 MULTI-INPUT 
 A^ODER 
 
 MIAD 
 
 Rl 
 
 R2 
 R3 
 R4 
 R5 
 
 * 
 
 a 
 
 W-l 
 
 I 
 
 .-R2.AOR 
 
 AOOER INPUT SELECTOR L*---* 3 * ADR 
 
 - - * R4tADR 
 
 r*~ R5.ADR 
 
 ** — SWTtAOR 
 
 A A A I A A * 
 
 Sw 
 
 *i 
 
 st ( 
 
 DIGIT PRODUCT 
 ARRAY GENERATOR 
 
 DPG 
 
 1 
 
 *,« 
 
 r-flAPR 
 
 i 
 
 V 
 
 FROM TO 
 
 CONTROL 
 LOGIC 
 
 TIP 
 
 RIP; 
 
 FROM 
 PE 
 
 +1 
 
 Figure 4.3 Block Diagram of Digit Processing Logic 
 (DPL) . 
 
98 
 
 P * * 
 
 t . . The MIAD is also used to add the two operand digits a . and <j> from 
 
 the registers Rl and R2 for the microinstruction SS and to add the 
 
 operand digits for the microinstruction MS. The radix-2 multi-input 
 
 adder is made up of k-stages of MIRBAs. 
 
 The Digit Sum Encoder DSE converts the redundant binary sum output 
 of adder MIAD to the SM format for local storage in the accumulator 
 register INR1 of the register file or transfer out of the PE. The DSE 
 is also used in the microinstructions AR and NR for forming the radix 
 and diminished radix complement of the magnitude bits of the accumulator 
 register INR1 and also for subtracting unity from the magnitude of the 
 accumulator contents. In addition, sDSE and DSE are made use of in 
 inter-register transfer microinstructions TD and TI for direct and 
 reversed-sign inter-register transfer. 
 
 The Adder Input Selector sADR routes appropriate data in redundant 
 binary form to the MIAD inputs depending on the microinstruction the 
 PE is executing at that time. 
 
 The selector sDSE selects the appropriate input to the encoder DSE. 
 
 Also shown in the Figure 4.3 are input and output ports designated 
 as TIP , RIP. and TOP , ROP., respectively. The input port TIP and 
 RIP , respectively carry the 'transfer* (carry or borrow) from adjacent 
 MIAD and the contents of some register in the register file of the 
 adjacent PE . These ports essentially carry the 'G-information' from 
 the adjacent PE - for the microinstruction that is being executed by 
 the present PE . The output ports TOP and ROP. are, however, shared 
 to carry the 'G-information' for the left neighbor PE.. and also the 
 address and data information respectively for the local operand memory 
 
99 
 
 PEM . This is made use of, for fetching data from and storing data to 
 the PEM under the control of microinstructions LPM and SPM. 
 
 The selector sTOP selects either the 'transfer' information from 
 
 MIAD or the address bits and Read/Write bit for PEM.. 
 
 1 
 
 The details of the logic design of the various blocks described 
 above are discussed next in the following sections. Since the logic 
 complexity of the major components DPG, MIAD and DSE and sADR are 
 dependent on the choice of logic vector encoding for the redundant 
 binary digits, the three logic vector encodings considered for study 
 are described first. It is followed by the logic design details of 
 the major components. 
 
 4.2.2.2 Choice of logic vector encodings - As mentioned in Section 
 
 3.4.1, the redundant binary (RB ) mode of encoding for a radix-r signed 
 
 digit is used for the arithmetic processing. Each redundant binary digit 
 
 requires 2 bits for representation. There are nine distinct ways under 
 
 permutation and negation [40] , of assigning three values (1,0,1) to four 
 
 states of two binary logic variables. Of the nine ways, three encodings 
 
 were chosen for this study because they are the simplest as far as the 
 
 conversion from the SM mode to the chosen encoding for RB mode is con- 
 
 r ° r 
 
 cerned. Let a radix-2 , signed digit x., encoded in SM mode, be rep- 
 resented by a k+1-tuple (S.,x. ,x ,...,x. ) such that 
 
 1 k-1 1 k-2 X 
 
 S i k_1 1 
 :. - (-D 1 I x. . 2 1 x z {0,1}. 
 
 3=0 X j j 
 
The corresponding RB encoded form is given by 
 
 100 
 
 k_1 * i 
 
 j-o J 
 
 x e {1,0,1} 
 J 
 
 Let the redundant binary digit x be represented by a 2-tuple logic 
 
 J 
 vector (x . , x ) where 
 J J 
 
 X ± , x ± e {0,1}. 
 J J 
 
 The three logic vector encodings for the redundant binary digit x con- 
 
 j 
 sidered in this research are given in Table 4.1. 
 
 Table 4.1 
 
 Logic Vector Encodings 
 
 
 
 
 Encodings 
 
 Binary 2-tuple 
 logic vector 
 
 LVE 
 
 LVE 2 
 
 LVE 3 
 
 
 * 
 
 X i. 
 
 * 
 
 * 
 
 X i. 
 .1 
 
 
 
 
 
 
 
 1 
 
 1 
 
 1 1 
 
 
 I 
 1 
 
 
 
 
 1 
 
 D.C 
 
 1 
 
 
 
 1 
 
 
 I 
 
 The conversion from SM mode to the encoding format LVE- is the 
 
 simplest and is equivalent to attaching the sign S of the SM encoded 
 
 k 
 radix-2 digit to each magnitude bit x. individually. The conversion 
 
 for the three encodings are given by 
 
101 
 
 LVE : x = X, " x 
 j j J 
 
 x i = X i ® S i (4,1) 
 
 j j 
 
 X i. = S i 
 
 J 
 
 where stands for exclusive — OR 
 
 or x. = S. A x. 
 
 i. 11. 
 j J 
 
 X i. = S i - X i. 
 
 (A. 2) 
 
 (4.3) 
 
 * 
 LVE, : x. = x. 2x, with x. x. disallowed 
 __1 ij ij *■, ij ij 
 
 X. = Xj 
 
 1 . 1. 
 
 J J 
 
 X- = S. - x. 
 i. i i. 
 3 J 
 
 * x i. 
 LVE- : x. = (-1) J . x. 
 3 l. l. 
 
 J 3 
 
 x. = x. (4.4) 
 
 3 
 
 For the encodings LVE.. and LVE~, the conversion logic requires one 
 exclusive-OR gate (Equation(4.1)) or two AND-gates (Equation (4.2)), and 
 one AND-gate (Equation (4.3)) for each redundant binary digit respectively. 
 For one radix-2 digit conversion, there are k redundant binary digits. 
 
 Encoding LVE~ is essentially a sign and magnitude encoding of the re- 
 
 
 dundant binary digit x by the 2-tuple (x . » x. ). Logic variables x.» 
 
102 
 
 and x respectively act as sign and magnitude bits. This encoding format 
 
 J 
 would also be referred to, in subsequent discussion, as SM, format where 
 
 subscript b indicates radix-2 (or binary) . 
 
 4.2.2.3 Logic design of RBA-2 (BU) - Let £ , m denote the redundant 
 
 * 
 
 binary inputs and d denote the redundant binary output of a RBA-2. 
 
 Further let £ , m and d be respectively represented by the logic vari- 
 able pairs (X , £ ) , (u , m ) and (6 , d ). Also let t , t and t ,, 
 
 V V V V V V V V v-1 
 
 t __ be the input and output 'Transfers' of the RBA-2 as shown in Figure 4.4. 
 In the configuration shown in Figure 4.4, it has a cascade combination of 
 a symmetric subtracter and a symmetric adder. Robertson [40] has given 
 the logic equations for the symmetric subtracter and symmetric adder for 
 all the nine distinct encodings referred earlier. Using those results, 
 the logic equations for the RBA-2 for the three logic vector encodings 
 being considered here are given as follows: 
 
 LVE, : d = X © £ © y © m © t 
 
 1 V V V V V ^ V 
 
 6 = t + 
 
 V V 
 
 t~ . = £ m V * (£ vm ) (4.5) 
 
 v-1 v v v v v 
 
 t = w y V t~ (wVu) 
 
 v-1 V V V V V 
 
 w = X © £ ©m 
 
 V V V V 
 
103 
 
 0,1 « 
 
 djfi.o.i 
 
 SYMMETRIC / 
 ADDER— ^ / 
 (SA) 
 
 / 
 
 / 
 
 0,1" 
 
 / 
 ^_. 
 
 SYMMETRIC 
 SUBTRACTER-^ / 
 (SS) ^/ 
 
 / 
 
 / 
 
 ojf 
 
 1,0,1 
 
 loj_ 
 
 _J 
 
 __J 
 
 0,1 
 
 0,1 
 
 1,0,1 m, 
 
 1,0,1 
 
 •0,1 
 
 0,1 
 
 1,0,1 
 
 0,1' 
 0,1" 
 
 1,0,1 1,0,1 
 
 Figure 4.4 Algebraic Design of a 2-input Redundant 
 Binary Adder (RBA-2) . 
 
104 
 
 This is schematically represented in Figure 4.5. Each box in the 
 figure essentially represents a full adder with a slightly modified 
 carry function. Figure 4.6 shows the logic implementation. This 
 implementation requires 22 two input NAND gates and the output digit 
 d is available after 12 gate delays. Figure 4.7 shows another logic 
 implementation that requires 27 gates but the output digit is available 
 after only 9 gate delays. Note further that the logic in Figure 4.7 is 
 no longer made of two identical logic substructures. The implementation 
 of Figure 4.6 allows a simpler basic cell for LSI implementation of MIRBA 
 at the cost of larger logic delay. 
 
 LVE 2 : 
 
 d = £ © m © t + © t" 
 6 = t + (A t" © m ) 
 
 V V 
 
 t , = X V SL y 
 
 v-1 V V V 
 
 (4.6) 
 
 :,=t (U©y)vum)vy m £ . 
 
 v-1 V V ^ V V V V V V 
 
 The logic implementation of this adder using only 2 input NANDS is shown 
 
 in Figure 4.8. Thirty-four two input NAND gates are needed. The output 
 
 * * 
 
 digit is available, 13 gate delays after the primary inputs l^ and m^ 
 
 are stable because the 'Transfer' input t is available 7 gate delay 
 
 after inputs I ,, and m ,-. 
 
 r v+1 v+1 
 
105 
 
 
 x + 
 
 tj/.l -#- 
 
 MFA 
 
 W, 
 
 MODIFIED 
 FULL ADDER 
 
 (MFA) 
 
 I 
 
 \v 
 
 Xj/ l„ m^ fly 
 
 m- 
 
 Figure 4.5 Schematic Functional Diagram of an RBA-2 
 using LVE, . 
 
106 
 
 M 
 (\l 
 
 CO 
 
 UJ 
 
 < 
 
 DC 
 UJ 
 GO 
 
 3 
 
 O 
 
 t>0 
 O 
 hJ 
 
 c 
 
 •H 
 CO 
 
 3 ^ 
 i-H 
 
 4: g 
 
 § 
 
 en 
 to 
 
 « > 
 
 o 
 
 • 
 C r- 
 O W 
 
 c 
 
 •H 
 t3 
 
 o 
 o 
 ex a 
 
 B W 
 
 o o 
 
 •H 4J 
 
 00 U 
 
 O CO 
 
 hJ > 
 
 v^ 
 
 3* 
 
 •H 
 
 Pi-, 
 
107 
 
 00 
 
 o 
 
 h4 
 
 00 
 C 
 •H 
 CO 
 
 CM 
 I 
 
 < 
 
 4-1 
 
 o 
 
 c 
 o 
 
 CM 
 
 c 
 o 
 
 •H 
 
 en 
 
 0) 
 
 > 
 
 00 
 
 c 
 
 •H 
 T3 
 O 
 O 
 
 a w 
 
 1-4 
 
 o o 
 
 oo a 
 o <u 
 
 hJ > 
 
 u 
 
 3 
 00 
 •H 
 
108 
 
 u 
 
 •H 
 60 
 O 
 
 hJ 
 
 00 
 
 c 
 
 •H 
 CO 
 
 3 
 
 CN 
 I 
 
 < 
 
 CM 
 
 w 
 > 
 
 60 
 
 c 
 
 x) 
 o 
 
 H O 
 
 a c 
 
 W 
 
 CJ O 
 
 60 O 
 O CO 
 
 ►J > 
 
 00 
 
 CD 
 
109 
 
 LVE 3 : 
 
 d 
 
 V 
 
 ~ 
 
 I 
 
 V 
 
 
 6 
 
 V 
 
 = 
 
 + 
 t 
 
 V 
 
 d = £ © m © t + © t~ 
 
 V V v 
 
 (A. 7) 
 
 t n =X£vym£ 
 v-1 v v v v v 
 
 t ,= t ((«, © m ) V £ y ) V y I m 
 
 v-1 V vv vv v v \ 
 
 The logic implementation of this adder using only 2 input NANDS is shown 
 
 in Figure 4.9. This RBA-2 realization requires 26 gates and the output 
 
 * 
 
 d is available 13 gate delays after the primary inputs of this RBA-2 and 
 
 its adjacent RBA-2 are stable. 
 
 Note that the lower gate delay for the sum output of RBA-2 using LVE, 
 encoding is achieved because the logic variable d is a function of £ , m 
 
 V V V 
 
 and t only. In the other two encodings, d is dependent on t also. 
 
 V J ° V r V 
 
 4.2.2.4 Logic design of a radix-2 multi-input adder (MIAD) - The 
 radix-2 adder MIAD is used for two purposes; 1) to add the columns of 
 the redundant binary product array formed by DPG and 2) to form the sum 
 of the operand digits for microinstructions SS and MS. Figure 4.10 shows 
 
 the schematic diagram of a radix-16 (k = 4) MIAD. It consists of 4 MIRBAs 
 
 * 
 each of five inputs each. Each MIRBA has two outputs — one a MF corre- 
 sponding to the sum of all the five inputs (used in microinstructions MS 
 and FMA) and a SS corresponding to the sum of only two inputs — for micro- 
 instruction SS to the left-most BU in the bottom level of the tree of BUs 
 and RBA-3s making up a MIRBA. The proper data is routed to the inputs of 
 MIRBAs by the adder input selector sADR. MATE and MATD are the encoders 
 
110 
 
 <0 
 
 CM 
 
 t/> 
 Mi 
 
 I- 
 < 
 
 e> 
 
 or 
 
 UJ 
 CD 
 
 2 
 
 Z> 
 Z 
 
 < 
 o 
 
 60 
 O 
 ■-J 
 
 60 
 
 c 
 
 •H 
 CO 
 
 CM 
 I 
 
 < 
 
 « 
 
 o 
 
 C CO 
 
 o w 
 
 •H > 
 
 4-1 00 
 
 c c 
 
 CD -H 
 
 §T3 
 O 
 iH O 
 O. C 
 
 a w 
 
 O O 
 •H 4-1 
 
 GO O 
 
 O 0) 
 hJ > 
 
 0) 
 
 to 
 
 3 
 
 GO 
 •H 
 
 fa 
 
r 
 
 < — 
 
 i 
 
 < h- O 
 
 o 
 in 
 
 V) 
 
 CM 
 
 CM 
 
 uT 
 
 O 
 
 5 -• or co < 
 
 m (t CD < 
 
 (VJ 
 
 or cd < 
 
 II II 
 
 ro 
 CC CD < 
 
 ill 
 
 00, 
 
 ir || M 
 
 / 'to 
 
 / ' / < 
 
 I CD 
 
 I * 
 
 I I 
 
 CD 
 
 " y j j_ n 
 
 < I- UJ 
 
 1 
 
 V 
 
 111 
 
 <r 
 
 
 Q 
 
 
 < 
 
 1 
 
 v> 
 
 •H 
 
 
 •u 
 
 
 ■H 
 
 cr 
 
 3 
 
 o 
 
 2 
 
 H 
 
 
 O 
 
 /— V 
 
 UJ 
 
 «* 
 
 _l 
 
 II 
 
 UJ 
 
 M 
 
 (/> 
 
 
 
 ^ 
 
 2 
 
 04 
 
 o 
 
 • 
 
 cc 
 u. 
 
 •H 
 
 
 5 
 
 5 
 
 
 M 
 
 M 
 
 
 bOS 
 
 
 cfl w 
 
 
 •H 
 
 
 Q U 
 
 
 CD 
 
 
 O TJ 
 
 r-H 
 
 2 3 
 
 cd 
 
 ccj 
 
 c 
 
 S *J 
 
 o 
 
 O) 3 
 
 •H 
 
 x: ex 
 
 4J 
 
 o c 
 
 CL 
 
 C/i -H 
 
 O 
 
 
 cu 
 
 o 
 
 
 n 
 
 iH 
 
 
 CTJ 
 
 vr 
 
 
 Q 
 
 
 
 H 
 
 <u 
 
 
 sa 
 
 u 
 
 
 T) 
 
 •H 
 
 
 c 
 
 ft) 
 
 
 cd 
 
 
 W 
 
 
 H 
 
 
 
 
 ■ • 
 
 
 CD 
 
 
 4-1 
 
 
 o 
 
 
 z 
 
 
 
 < — 
 
112 
 
 and decoders used for reducing the pins required for the 'Adder Transfers' 
 t and t . 
 
 In general, a radix-2 multi-input adder consists of a linear cascade 
 of k MIRBAs. A (k+1) input MIRBA is implemented as a tree structure of 
 RBA-2s and/or RBA-3s. Each MIRBA requires k RBA-2s and are arranged in 
 L = |log2(k+l)| levels. Therefore, the total number of gates required 
 
 for radix-2 adder is k times the gates needed for each MIRBA and is equal 
 
 2 
 to k times the gates required by each RBA-2 (plus those required by D and 
 
 C-elements of RBA-3s) . Also, the total gate delay for the sum output is 
 
 L times the delay of each RBA-2 (ignoring the extra delay necessary for 
 
 A A 
 control and inter-PE communication of 'Adder Transfers' t. and t. -)• 
 
 For a (k+1) input adder, the number of pins required for the input 
 
 A A 
 and output Adder Transfers t and t. are 2k each, and is large for 
 
 A A 
 large value of k. One way to reduce the pins necessary for t. and t., 
 
 is to encode the output 'Adder Transfer' t into algebraically equiva- 
 lent value and to use a corresponding decoder to decode the input 'Adder 
 Transfer' into the form required by MIRBAs. k RBA-2s of a MIRBA produce 
 
 k 'positive transfers' and k 'negative transfers'. Thus the value of 
 
 A A 
 t , t lies in the range -k to k and this can be encoded into 
 
 flog (2k+l)"l bits (pins). However, the corresponding decoder is too 
 
 complicated. A simpler design of the encoder, MATE, and decoder, MATD, (shown 
 
 dotted in Figure 4.10) results if the k positive and k negative 'Adder 
 
 'Transfers' are separately encoded into |log 2 (k+l)~f bits each. The 
 
 corresponding decoder is simply a fan-out network such that the bit of 
 
 weight w would fan-out to w input 'Adder Transfers' of the corresponding 
 
 sign. The encoder MATE simply consists of two adder networks which form 
 
113 
 
 the sum of k bits each. Each adder network requires less than k full 
 adders arranged in approximately ( flogok] + (log 2 k|-2) levels [41]. 
 
 It should be noted that the decrease in total pin requirement for 
 
 A A i — — i 
 
 both t. and t , together from 4k to 4|log 2 (k+l)| is obtained at the cost 
 
 of introducing a new logic cell (full adder) in the radix-2 adder design 
 
 and also more delay in the generation of t . 
 
 4.2.2.5 Logic design of digit product generator (PPG) - The Digit 
 Product Generator forms the product array of two signed radix-2 digits. 
 It accepts the two digits encoded in SM format and outputs the product 
 array in redundant binary. The logic for the DPG consists of three 
 parts (Figure 4.11a). 
 
 a) logic for generating the product magnitude digits, 
 
 b) logic for generating the sign of the product, 
 
 and c) logic for converting the magnitude digits to redundant binary form 
 for input to MIRBAs. The gates required for this logic are dependent on 
 the logic vector encodings chosen for the redundant binary digits. 
 
 For the implementation of digit algorithm 1 of microinstruction FMA 
 
 2 
 (Section 3.6.2.1.1), the logic for a) and b) consists of k AND gates and 
 
 2 
 one exclusive-OR gate respectively. The conversion logic requires k 
 
 2 
 exclusive-OR gates (Equation (4.1)) or 2k AND gates (Equation (4.2)) for 
 
 2 
 encoding LVE- , k AND gates (Equation (4.3)) for LVE. and none for the 
 
 encoding LVE_. 
 
 For the implementation of digit algorithm 2 of microinstruction FMA 
 
 (Section 3.6.2.1.2), the logic for a) and b) consists of a ROM of bit 
 
 2k 2k+l 
 
 capacity 2 . 2k = k . 2 and one exclusive-OR gate respectively. 
 
Wj IN RB FORMAT 
 
 £l<: 
 
 IN RB FORMAT 
 
 I 
 
 CONVERSION LOGIC 
 
 z 
 g 
 
 (A u 
 
 >§ 
 
 o 
 
 ~G?w 
 
 77 
 
 C=f 
 
 SIGN BIT' v. 
 
 K x K 
 
 ANO 
 GATES 
 
 r 
 
 SIGN BIT 
 
 4>: 
 
 m 
 
 Figure 4.11a Schematic Diagram of Square Array DPG. 
 
 W; 
 
 SIGN BIT, St: 
 
 Ci < 
 
 MAGNITUDE' 
 BITS 
 
 J^PINS 
 
 f 
 
 CONVERSION LOGIC 
 
 d 
 
 © 
 
 SIGN BIT 
 
 i ii 
 
 c 
 
 SIGN BIT 
 
 * 
 
 
 Figure 4.11b Illustration of 'Adjacent Generation' of t 
 
 i-r 
 
 (K + DPINS 
 
 
 
 S 
 
 1 
 
 W: \A 
 
 1 I 
 
 1: SI 
 
 1 / 
 
 }' 
 
 
 
 f 
 
 
 
 
 d 
 
 
 
 
 -\ 
 
 
 
 
 v) 
 
 W~ ' 
 
 "' f 
 
 
 
 
 
 
 
 - 
 
 
 
 
 • 
 
 • 
 
 
 
 • > 
 
 • 
 • 
 
 V. 
 
 
 
 
 
 J 
 
 
 f 
 
 • t ' 
 
 ii ,i 
 
 
 
 L 
 
 
 
 -^ i 
 
 
 
 
 
 
 
 
 
 
 
 
 
 v_ 
 
 • • • 
 
 * ' 
 
 
 
 SWj = SIGN OF Wj 
 i Stj = SIGN OF tf 
 
 (K + DPINS 
 
 m 
 
 Figure 4.11c Illustration of 'Local Generation' of t ? . 
 
 114 
 
115 
 
 The conversion logic, however, requires only 2k exclusive-OR gates 
 (Equation (4.1)) or 4k AND gates (Equation (4.2)) for the encoding LVE^ 
 2k AND gates (Equation (4.3)) for LVE~ and none for the encoding LVE- . 
 
 The pins contributed by DPG to the pin complexity of DPL are those 
 
 p 
 pins which are required for the 'Collective Product Transfers' t and 
 
 P P 
 t . If t is generated in PE^ , then the pins needed for transmission 
 i i-1 i 
 
 P P 
 
 of t. to the adjacent PE , consist of one pin for the sign of t. - 
 
 and (k-l)+(k-2)+. . .+1 = k(k-l)/2 pins for the magnitude bits, assuming 
 
 that the conversion to redundant binary form is done in PE._, . We shall 
 
 P P 
 call this method of generating t._ 1 and t. as 'Adjacent Generation' (AG) 
 
 of Collective Product Transfer (Figures 4.11b and 3.9). 
 
 These pins can, however, be reduced to only (k+1) from k(k-l) if 
 
 2 
 
 P P 
 the CPT t,, , (tj is generated locally in PE. , (PE.) itself where it is 
 i-1 i . ° l-l l 
 
 P P 
 needed, t (t. .) is a function of the multiplicand digit 4> . . (<J>.) in 
 
 PE (PE ) and the multiplier digit m., the latter being the same in 
 
 both PE. and PE . , - (PE . ,). Thus PE. (PE. n ) needs to know only the 
 l l+l l-l l l-l } 
 
 P P 
 multiplicand digit <J> . (<)>.) in PE. . (PE.) to generate t (t._,), and 
 
 this requires only (k+1) pins for SM encoded multiplicand digit <f> . +1 • 
 
 We shall term this method of generating CPTs as 'Local Generation' (LG) 
 
 of CPT. This is shown in Figure 4.11c. Figure 4. lid shows a DPG using 
 
 p 
 'Local Generation' of t.. In the LG method of generating CPTs, the logic 
 
 for DPG requires one more exclusive-OR gate than for the AG method. 
 
 For the algorithm 2 of FMA, where the DPG is implemented in ROM, the 
 
 p 
 pins required for t. - are only (k+1) — one for sign of the product and 
 
 k for magnitude bits of the product. This is shown in Figure 3.10. In 
 
116 
 
 Figure 4. lid Illustration of a Combination of an MIAD and 
 DPG using 'Local Generation' of t^. 
 
117 
 
 p 
 the block diagram of DPL shown in Figure 4.3, local generation of t is 
 
 assumed. The register APR is used to hold the multiplicand digit <|>.,, 
 
 from the adjacent PE, .... 
 J i+1 
 
 4.2.2.6 Logic design of digit sum encoder - The Digit Sum Encoder 
 (DSE) transforms the redundant binary sum output of the radix-2 adder 
 into an algebraically equivalent radix-2 sum digit in SM format for 
 either local storage in the processing element or transmission out of 
 the PE. The DSE is an iterative logic network and involves carry prop- 
 agation. Its action can be described as a two-step process. 
 
 a) determination of sign of the redundant binary sum digit and its 
 conversion to an algebraically equivalent sum digit in 2's complement, 
 and 
 
 b) conversion of 2's complement form of the sum digit to SM format, 
 Figure 4.12a shows DSE in block diagram form. Let the input and output 
 sum digit x. be respectively given by (4.8 and (4.9) 
 
 k_1 * i * 
 
 x i = I x . 2 J x e {1,0,1} (4.8) 
 
 S ± k-1 
 = (-D • I x 2 J S x e {0,1} (4.9) 
 
 3=0 ;j X J x j 
 
 * 
 
 where x ± is represented by a 2-tuple logic vector (x . , x. ) such that 
 
 J J 
 
 X ± . X E {0,1}. 
 
 j j 
 
DSE k 
 A 
 
 FROM 
 CONTROL 
 
 LOGIC 
 
 DSE k .! 
 
 A 
 
 DSE 
 
 DSE, 
 
 ■h-i 
 
 i 
 
 2's COMPLEMENT TO SMr 
 FORMAT TRANSFORMATION , TCSM 
 
 P: 
 
 'k-1 
 
 I 
 
 y., 
 
 SIGN DETERMINATION AND RB 
 
 TO 2's COMPLEMENT 
 TRANSFORMATION, RBTC 
 
 t ...t ...t 
 
 'k-i 
 
 Figure 4.12a Block Diagram of Digit Sum Encoder (DSE) 
 
 118 
 
 P; 
 
 X i k _a Xi k-i ".j -j 
 
 Figure 4.12b Logic Network Realization of RBTC. 
 
 k -1 
 
 A 
 X; 
 
 A 
 Xi 
 
 p-c^ - 
 
 *i. 
 
 • • 
 
 y, k-i 
 
 Figure 4.12c Logic Network Realization of TCSM. (o = P ) 
 
 1 k 
 
119 
 
 The Redundant Binary to Two Complement (RBTC) logic (Figure 4.12b) 
 converts input x to y such that 
 
 y 
 
 p i k ~i 
 
 ■ (-1) k + [ y . 2 J ye {0,1} (A. 10) 
 
 The logic equations of RBTC network for the three logic vector en- 
 codings of the input sum digit are given by 
 
 LVE 3 : 
 
 Y ± = x ± © ? ± j-0,l,...,k-l (A. 11) 
 
 J J J 
 
 \ = (Xi A x i )v(P. A x. ) (4.12) 
 
 j+1 J 3 J J 
 
 P. = 
 *0 
 
 LVE„ : Same as for LVE . 
 
 LVE 1 : y. = x± © x ± © P. (A. 13) 
 J J j J 
 
 P i = (x i AX i )v(P i * (x i © x i >> < 4 * 14 > 
 
 3+1 j J J J J 
 
 P = 
 x 
 
 The logic equations for the logic network TCSM (Figure A. 12c) that 
 converts 2 ' s complement form y to corresponding SM format are independent 
 of logic vector encodings for the input sum digit. The logic equations are 
 
120 
 
 x = y © <° ± AV) (4.15) 
 
 J 3 3 
 
 Z = Z V y. (4.16) 
 
 j+l 3 J 
 
 •i - h ■ \ 
 
 The signal Z , if equal to logical zero implies that the binary digits 
 3 
 
 y , y , . . . ,y are all logical zero. 
 
 i j j-i 
 
 The Digit Sum Encoder DSE logic is also used to achieve the radix 
 
 k k 
 
 (2 ) complement and diminished radix (2 -1) complement of the magnitude 
 
 bits of ROB input to DSE via sDSE. Assuming logic vector encoding LVE~, 
 
 x i = 1, J = 0,l,...,k-l 
 J 
 
 P = 
 
 °i " ° 
 
 will cause the radix complement of the magnitude bits to appear at the 
 
 output of DSE, whereas 
 
 X ± « 1, j = 0,1,. . .,k-l 
 3 
 
 will generate the diminished radix complement of the input magnitude 
 bits. 
 
121 
 
 Similarly, X. = 0, j = 0,1 k-1 
 
 J 
 
 P ■ 1, a. - 
 x 1 
 
 will subtract unity from the value of the input magnitude bits. 
 
 These particular values of x. > P. > and a are made use of in the 
 
 J 1 
 
 processing of microinstructions AR and NR, as described in Section 
 4.3.2.3.7. 
 
 However, in the case of inter-register transfer microinstructions 
 TD and TI, the magnitude bits at the input and output are to remain 
 unchanged; the sign bit, S , is equal to the complement of S RQB for micro- 
 instruction TI, where S denotes the sign of the digit on the bus ROB. 
 
 ROB 
 
 From Figures 4.12(b) and 4.12(c), we see that RBTC and TCSM consist 
 of k-stages each of identical logic cells. Each cell requires four 
 2-input NAND gates and one exclusive-OR (EX) gate. An EX-gate is 
 equivalent to four 2-input NAND gates. Therefore the total number of 
 gates, G_ q „ required by DSE logic using logic vector encoding LVE~ or 
 LVE„ is given by 
 
 G DSE = 16K + C x (4.17) 
 
 where C. is a constant and gives the gates necessary for generation of 
 §. and c. under various control signal conditions. Use of logic vector 
 encoding LVE. will raise G,.^ to 26K + C . 
 
 In the remainder of this chapter, we shall assume only the sign-magni- 
 tude logic vector encoding LVE- for the redundant binary digit because 
 
122 
 
 a) conversion from sign-magnitude format SM to KB format is the 
 simplest for logic vector encoding LVE„ for the redundant binary digit, 
 as shown in Equation (4.4), and 
 
 b) the number of gates required for the logic implementation of 
 the Digit Sum Encoder, DSE, is less in the case of the encoding LVE-, 
 than that of LVE.. whereas the gates required for an RBA-2 are comparable 
 for both the encodings LVE, and LVE . The encoding LVE„ is too expensive 
 for the logic implementation of an RBA-2 and hence of MIAD which is the 
 major consumer of gates in the DPL. 
 
 4.2.2.7 Logic design of selector networks - Since the Adder MIAD 
 and Digit Sum Encoder, DSE, are shared by more than one microinstruction, 
 selector networks are needed in order to route appropriate data to the 
 inputs of these processing logics. These selector networks also do re- 
 formatting of data, if necessary. Besides the selector networks sADR 
 and sDSE, two more selector networks, sROB and sRIB, are necessary for 
 transferring data out of and into the various registers of the register 
 file. In addition, selector sTOP is used to choose the contents of out- 
 put port TOP. In the following discussion, logic vector encoding LVE- 
 for the redundant binary digit is assumed. 
 
 sADR 
 
 4.2.2.7.1 Logic design of adder input selector (sADR) - Selector 
 
 accepts inputs from two sources: 1) 'Product Array' w. and 
 
 P 
 'Collective Product Transfer' array t. along with their corresponding 
 
 signs Sw. and St. from the output of DPG and 2) the contents R2,...,R5 
 
 of the internal registers INR2 , . . . , INR5 of the register file. These 
 
123 
 
 inputs are in sign-magnitude format, SM . Depending on the microinstruc- 
 tion, the sADR directs the appropriate data reformatted in redundant 
 binary form to the inputs of the MIAD. For the microinstruction FMA, 
 the outputs of DPG are routed appropriately so that the redundant binary 
 elements of the 'product' and 'transfer' arrays are added by MIRBAs of 
 appropriate weights. In the case of microinstruction SS, only contents 
 R2 of register INR2 are inputed to the adder and for microinstruction MS, 
 the contents of one or more of the four registers INR2 , . . . , INR5 are 
 directed to the input of the adder. The contents Rl of the accumulator 
 register INR1 are inputed directly to the adder. The logic networks for 
 the selector sADR for radix 16 (k = 4) are shown in Figures A. 13a and 
 4.13b. Figure 4.13a shows the selector for the magnitude bits and 
 Figure 4.13b shows the generation of appropriate sign bits for the re- 
 dundant binary adder inputs. SR. (j=l,...,5) indicates the sign bit of 
 inputs Rl, . . . ,R5. 
 
 The control signals Rj sADR (j=2,3,4,5) and SWTsADR are provided by 
 local control logic in the PE. Since the selector networks have no 
 memory and the data at the input of adder MIAD must be continuously 
 available throughout the processing of microinstructions SS, MS and FMA, 
 the selector control signals are permanently tied to the appropriate 
 outputs of the microinstruction decoder. 
 
 For any radix-2 , and assuming that the adder MIAD is made up of 
 k-stages of (k+1) input MIRBAs, the gates required for magnitude and 
 
 sign bits (using logic vector encoding LVE~ for redundant binary) are 
 
 2 
 respectively 3k and (3k+l) . Denoting by G . the total number of gates 
 
 for selector sADR for a radix-2 PE, we have 
 
SWTsADR 
 RSiADR 
 R4»A0R 
 R3*A0R 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 DPG 
 tf *i 
 
 / A V / * * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 R1 3 
 
 
 
 
 
 
 
 
 
 
 
 rzr — l 
 
 
 
 
 
 
 
 
 
 
 
 
 _ 
 
 ^Vwi 
 
 
 
 BO .... 
 
 -r-vP— * 1 
 
 
 
 K<i 3 
 
 
 
 
 
 
 
 
 
 
 
 O^ I 
 
 
 
 
 
 
 M 
 
 
 
 
 
 — 
 
 ^^Vvi 
 
 I 
 
 p^ 
 
 _r-v rl— /" 1 
 
 
 K3 3 
 
 
 
 
 
 
 
 
 
 
 □ 
 
 
 
 
 i rr 
 
 R 
 
 
 
 
 
 - 
 
 ^V>i 
 
 B 
 
 a/t 
 
 _i-^ H— * i 
 
 
 
 "**j 
 
 
 
 
 
 
 
 
 
 
 □ 
 
 Ai 
 
 
 
 
 
 
 
 
 
 >— 
 
 " LJ V>1 
 
 
 
 ^°! 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 D1 ... . 
 
 
 
 
 
 nig 
 
 
 
 
 
 
 
 
 
 
 
 
 
 j 
 
 L* 
 
 
 
 
 
 
 "i 
 
 
 p<» ._ .. 
 
 
 
 n£2 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 M 
 
 
 
 
 
 ►— 
 
 I 
 
 Dl _ _ 
 
 
 
 nOj 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 R 
 
 
 
 
 I 
 
 -i 
 
 B 
 
 PA ....... 
 
 
 
 nto — 
 
 
 
 
 
 
 
 
 
 
 
 
 
 A 2 
 
 
 
 
 
 
 
 
 — i 
 
 i 
 
 
 OR, 
 
 
 
 
 rtOg 
 
 
 
 
 
 
 
 
 
 R 1 _ 
 
 
 
 
 
 Klj 
 
 
 
 
 
 
 
 
 
 
 
 r~ 
 
 i 
 
 
 
 
 
 
 
 -1 
 
 
 
 D? . . 
 
 
 
 nC \ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 M 
 
 
 
 
 
 
 •H 
 
 I 
 
 B^, ._ ... 
 
 
 
 
 nOj 
 
 
 
 
 
 
 
 
 
 
 
 R 
 
 
 
 
 
 
 — 1 
 
 B 
 
 DA 
 
 
 
 
 n+i 
 
 
 
 
 
 
 
 
 
 
 
 A l 
 
 
 
 
 
 
 
 M 
 
 
 BR 
 
 
 j 
 
 
 
 nOj 
 
 
 
 
 1 •— 
 
 -L 
 
 
 B1 
 
 
 
 
 
 Kl 
 
 
 
 
 
 
 
 
 r - 
 
 i 
 
 L. 
 
 
 
 
 
 
 
 —i 
 
 
 po 
 
 
 
 
 
 "Cq 
 
 
 
 
 
 
 
 
 
 
 M 
 
 
 
 
 
 — I 
 
 I 
 
 p - * .. ,. . 
 
 
 
 
 noQ 
 
 
 
 
 
 
 
 
 
 R 
 
 OA . . — 
 
 
 
 
 —l 
 
 B 
 
 n*o 
 
 
 
 
 
 
 
 
 PR _ 
 
 
 
 i 
 
 A 
 
 
 
 *"' 
 
 -J 
 
 
 NDq ■ ■ — 
 
 
 1 — L J 
 
 
 
 
 
 
 
 
 
 124 
 
 Figure 4.13a Logic Implementation of Selector sADR 
 for Magnitude Bits. 
 
125 
 
 R2sADR 
 R3sADR 
 R4sADR 
 R5sADR 
 SWTsADR 
 
 SRI — 
 SR2 — 
 
 SR3 
 
 SR4 
 
 SR5 
 
 S W: 
 
 St 
 
 cy- 
 
 o 
 
 =o 
 
 o 
 
 M 
 I 
 R 
 B 
 A- 
 
 M 
 I 
 R 
 B 
 A. 
 
 M 
 I 
 R 
 B 
 A, 
 
 M 
 I 
 R 
 B 
 A, 
 
 Figure 4.13b Logic Implementation of Selector sADR 
 for Sign Bits. 
 
126 
 
 G sADR = 3k2 + < 3k+1 > 
 
 = 3k 2 + 3k + 1 (4.18) 
 
 4.2.2.7.2 Logic design of digit sum encoder selector (sDSE) - 
 The selector sDSE (shown in Figure 4.14) accepts inputs from two 
 
 it it 
 
 sources— the two outputs a SS and a MF. (j-0,1, . . . ,k-l) of the adder 
 MIAD and the ROB (J-0,1, ... ,k-l) . The control signals ASSsDSE, AMFsDSE 
 and ROBsDSE, respectively select the MIAD outputs a SS. and a MF corre- 
 sponding to microinstructions SS, FMA and MS and the bus ROB. The control 
 
 signal SCHI appropriately sets the sign bit x., (j=0,l, . . . ,k-l) of the 
 
 J 
 
 redundant binary input Xj of tne DSE to achieve radix complements or 
 
 * 
 diminished radix complement or direct transfer of the magnitude bits 
 
 of ROB. This was explained earlier in Section 4.2.2.6. Figure 4.14 
 shows the logic implementation of sDSE for radix-16. Since the selector 
 sDSE has no memory, the appropriate control signals must be held active 
 throughout the processing of the microinstruction. 
 
 For any radix-2 , the total number of NAND gates, G D q E , required 
 for the logic of sDSE is given by 
 
 G sDSE - 7k - (4 - 19 > 
 
 4.2.2.7.3 Logic design of selectors sRIB, sROB and sTOP - The 
 selector sRIB is a three input multiplexer and has three input sources — 
 the DSE, APR and the digit field of microinstruction register MIR in the 
 control logic of PE. The selector is one digit wide and the NAND gates 
 required for the logic implementation of this selector shown in Figure 4.15 
 is given by G DT11 where 
 
 SKI d 
 
SCHI 
 ASSsDSE 
 AMFsDSE 
 ROBsDSE 
 
 *cc J aSS 3 
 
 o SS 3 s 9 
 
 laSS 3 
 
 a*MF, 
 
 aMF 3 
 
 k aMF 2 
 ROB, 
 
 a SS, 
 
 a*MF, 
 
 127 
 
 > X: 
 
 -*■ X 
 
 V 
 
 Figure 4.14 Logic Implementation of Selector sDSE. 
 
DSEsRIB 
 APRsRIB 
 MIRsRIB 
 
 DSE 4 
 
 APR 4 
 
 MIR* 
 
 DSE 
 
 APR 
 MIR 
 
 128 
 
 
 On 
 > 
 
 O 
 
 -► sRIB 4 
 
 -► sRIB 
 
 Figure 4.15 Logic Implementation of Selector sRIB. 
 
129 
 
 G sRIB - 4(k+1) - (4 - 20 > 
 
 The control signals DSEsRIB, APRsRIB and MIRsRIB, respectively select 
 the sources DSE, APR and MIR. The path from MIR to sRIB is made use of 
 in the processing of microinstructions RS and LPM. 
 
 The width of selector sTOP (Figure A. 16) is equal to the width of 
 output port TOP . The width of TOP. is determined by the number (=k+l) 
 of bits required for the address space of PEM. plus one more for the 
 
 Read/Write function of PEM. and the bits, P required for the 'Adder 
 
 1 i-1 
 
 A A 
 
 Transfer' t which is dependent on the method of encoding used for t._ . 
 
 Assuming the width of TOP. to be b, b is given by 
 
 b = Max(k+2, P^ ). 
 i-1 
 
 For MIADs using encoders and /or redundancy ratio 6 <_ 2/3, k+2 is greater 
 
 than P Therefore the gates, G required for the logic implemen- 
 
 l-l 
 tation of selector sTOP is 
 
 G sTOp = 3 (k+2). (4.21) 
 
 The selector sROB selects the contents of one of the registers of 
 the register file on to the register file output bus ROB. The gates 
 required for this network are dependent on the number of registers in 
 the register file and the bit width of the registers. For radix-2 , 
 the register width is (k+1) bits and assuming (k+1) registers in the 
 register file, the total gates required are 
 
 G sROB " (k+D (k+2). (4.22) 
 
 Figure A. 17 shows logic implementation of sROB for radix-2 (k=A) . 
 
130 
 
 TAsTOP 
 MIRsTOP 
 
 MIR<0> 
 
 •►TOP. 
 
 TOP, 
 
 TOP 3 to PEMj 
 AND/OR 
 
 PEi-1 
 
 •►TOP, 
 
 Note : MIR<4:0> are PEM address bits 
 MIR<8> is Read/Write bit 
 
 Figure 4.16 Logic Implementation of Selector sTOP, 
 
RlsROB 
 R2sROB 
 R3sROB 
 R4sR0B 
 R5sROB 
 
 SRI 
 
 SR2 
 
 SR3 
 
 SR4 
 
 SR5 
 
 Rl 3 
 
 R2 3 
 R3 3 
 R4 3 
 R5 3 
 
 Rio 
 R2 
 
 R3 
 R4 
 R5„ 
 
 131 
 
 o- 
 
 O- 
 
 O- 
 
 o 
 
 o- 
 
 > 
 
 -»* s 
 
 ROB 
 
 -^ ROB 3 
 
 ROB, 
 
 Figure 4.17 Logic Implementation of Selector sROB 
 
132 
 
 Note that these selectors have no memory. The control signals 
 would have to remain active throughout the processing of a micro- 
 instruction. Although the selectors are shown to have separate control 
 signals, fewer control signals with a local decoder would suffice. But 
 the separate control signals are shown for the ease of exposition because 
 the separate names of the signals help to identify the sources easily. 
 
 4. 2. 2. 7. A Storage buffer registers of PPL - In addition to the com- 
 binational logic for processing and the selector networks for proper data 
 routing, the DPL has three buffer registers GIR, APR and IB. GIR and APR 
 are used to hold the G-information from the adjacent PE. The register 
 GIR holds the 'Adder Transfer' t from the adjacent PE.,, and the register 
 
 APR stores the multiplicand digit <j> for the local generation of 'Product 
 
 P k 
 
 Transfer' t in the DPG. The width of APR is (k+1) bits for radix-2 and the 
 
 width of GIR depends upon the bit requirements of t . At the maximum, 
 
 it is equal to 2k if no encoder MATE is used in the design of adder 
 
 MIAD. However, if either the number of inputs to the MIRBAs is reduced 
 
 by changing the redundancy of the multiplier digit or by any other 
 
 means, then the bit width of GIR would be correspondingly reduced. 
 
 Assuming that the 'encoders' and 'decoders' MATE and MATD are not used 
 
 A A k 
 
 for the t . and t , , and the inputs to each MIRBA is k' for radix-2 
 
 adder, then the bit width of GIR is 2(k'-l). 
 
 The register APR is also used in the processing of left shift 
 
 microinstruction LS and is used to hold the shifted digit from the 
 
 right neighbor PE. - temporarily before being stored in the register 
 
 file via the selector sRIB. 
 
133 
 
 Since the outputs of internal registers of the register file are 
 directly and permanently connected to the input of combinational 
 processing logic, it is necessary to provide a buffer register, IBR at 
 the output of selector sRIB. The output of the selector sRIB is gated 
 into the register IBR and thus isolates the input bus of the registers 
 from any changes which might occur due to feedback through the combina- 
 tional logic when the contents of the buffer register IBR (i.e., the 
 result digit) are transferred to the appropriate register in the 
 register file. 
 
 The bit-width of the register IBR is (k+1) for a radix-2 digit. 
 
 4.3 Design of PE Control 
 
 The processing of a microinstruction in the PE requires the activa- 
 tion of the various data paths and the conditioning of combinational 
 transformation logic of the DPL, in a certain temporal order depending 
 on the nature of the microinstruction. These time ordered activation 
 control signals are generated by the PE Control Logic (PCL) which is 
 locally resident in the PE. 
 
 Another function of the local PE control is to coordinate the 
 actions of the PEs not only to obtain 'G' information from adjacent 
 PEs for the processing of the microinstruction but also to receive 
 and transmit the microinstructions from and to the adjacent PEs. The 
 latter is necessary to process a 'machine instruction'. Each PE 
 executes the same sequence of microinstructions which is issued by MCU 
 depending on the 'machine instruction' to be processed by the Arithmetic 
 Unit and the specific operand values. After executing microinstruction 
 
134 
 
 j-1, a typical PE, PE say, must determine the value of .G. and inform 
 
 the PE.. of its availability. .G. is needed by PE. , to execute 
 microinstruction j. PE also passes the jth microinstruction and modi- 
 fier to PE . , n so that PE.... will determine .G. ... in cooperation with 
 i+1 i+I j i+1' v 
 
 PE.,~, if necessary. When PE. receives .G. ., it performs the micro- 
 i+2 J i j i+1 
 
 instruction j and begins the procedure for microinstruction j+1. 
 
 The control strategy for implementing the coordination of the 
 various PEs can be either synchronous or asynchronous. In the former 
 case, all the PEs act in synchronism with some central clock whereas 
 in the asynchronous case, all the activities are controlled by request- 
 response signals. In this paper, asynchronous control with request- 
 response signals is chosen because of the following advantages: 
 
 a) It avoids the clock-skew problems when a large number of PEs 
 are concatenated together for high precision of arithmetic. 
 
 b) Due to the pipeline nature of processing, different PEs at any 
 instant are executing different microinstructions which take different 
 times to execute. The request-response strategy will provide overall 
 better average speed of processing. 
 
 c) The asynchronous control is compatible with the 'localized' 
 nature of processing and an autonomous and modular arithmetic element. 
 
 However, it does have the disadvantage of increasing the number 
 of pins required for the PCL. 
 
135 
 
 4.3.1 Logical organization of PE control - The PE control is 
 organized as a set of six interacting subcontrols some of which are 
 active concurrently while others are activated in sequence, depending on 
 the nature of the control algorithm for the microinstruction. Concurrently 
 interacting controls allow an average speed up in the processing of micro- 
 instructions by allowing independent operations to take place in parallel. 
 
 Figure 4.18 shows the various subcontrols and their interaction. 
 The division of PE control into subcontrols is based on a functional 
 grouping of the various steps in the control flow. The various sub- 
 controls are R-control, T-control, G-control, E-control, F-control and 
 DM-control. 
 
 The Decode and Main or DM-control is the main control which super- 
 vises and coordinates the actions of other subcontrols. It handles the 
 decoding of the microinstruction, sets up the necessary data paths in 
 DPL, and then chooses the proper subcontrols and their temporal order 
 for the execution of the control algorithm of the microinstruction. (In 
 a crude software analogy, the DM-control can be considered as the Main 
 procedure and other subcontrols which are invoked by DM-control as 
 subroutines.) 
 
 The Receive or R-control and the Transmit or T-control are the 
 
 primary controls for the coordination of PEs. R-control is concerned 
 
 with accepting the microinstruction from the left neighbor PE._, and 
 
 acknowledging the receipt of the microinstruction (OP-code y. and the 
 
 modifier field .F.)« The T-control transmits the received micro- 
 3 V 
 
 instruction with the same or a new modified F-field .F , depending 
 
 on the nature of the microinstruction, to the PE . , - . 
 
 l+l 
 
136 
 
 The G-control and E-control together can be considered as consti- 
 tuting the main processing controls for the microinstruction. The G- 
 control generates the G-information for the left neighbor PE.. and 
 accepts the G-information from the right neighbor PE . . The Execute 
 or E-control activates the necessary control signals to the combinational 
 logic to calculate and gate the result digit in appropriate internal 
 register of the register file. In addition to this, the status of the 
 digit in the accumulator register is set. The status checking involves 
 determining the sign and magnitude of the digit. If the accumulator 
 digit is zero, the sign of the digit is considered to be unknown. 
 
 The F-control is used when a new value, different from that 
 received, of the modifier field has to be sent to the right neighbor 
 PE - . It is made use of in right shift microinstruction RS. 
 
 4.3.1.1 Global description of interaction of subcontrols - 
 Figure 4.18 shows the interaction of the various subcontrols. It 
 should be noted that Figure 4.18 does not show the hierarchical order 
 in which the various subcontrols are invoked by DM-control but only 
 shows a gross overview of the interaction. The specific temporal order 
 of the various subcontrols in the control sequence of any microinstruc- 
 tion is discussed later in Section 4.3.2.3.3. 
 
 The control sequence for every microinstruction begins in the R- 
 control. The R-control, on receiving a go-ahead signal from DM-control to 
 accept another microinstruction from the left neighbor PE..., accepts 
 the microinstruction and acknowledges back the receipt of the micro- 
 instruction. It also invokes the DM-control. The DM-control decodes 
 
137 
 
 INVOKE 
 RETURN (REPLY) 
 
 Figure 4.18 Logic Organization of PE Control Signal 
 Generator. 
 
138 
 
 the microinstruction, sets up the data paths in the DPL and invokes one 
 or more of F, G, T, and E controls depending on the microinstruction 
 type. The F-control makes the changes in the modifier field of the 
 microinstruction and calls on the T-control to transmit the modified 
 microinstruction to PE . . F-control is invoked only for right shift 
 microinstruction RS. If the processing of microinstruction requires 
 G-inf ormation, the G-control and T-control are invoked in parallel. 
 The G-control can be conceptually considered as comprising of two sub- 
 controls: G -control which generates G-inf ormation for the microinstruction 
 executing in adjacent PE . , , and G -control which accepts G-information 
 from the right neighbor PE . - . (In the case where G-information depends 
 logically on two or more right neighboring PEs (e.g., microinstructions 
 
 FMA, AR) , the subcontrols G -control and G -control interact with each 
 
 gn ap 
 
 other.) After the necessary G-information for the execution of the 
 
 microinstruction has been obtained, the G -control branches to E-control 
 
 ap 
 
 for the execution of the microinstruction. 
 
 In those cases when G-control is not invoked by DM-control because 
 no G-information is needed from adjacent neighbors (e.g., microinstruc- 
 tions TD, TI, LDC) , the DM-control directly calls upon E and T controls 
 in parallel. The T-control transfers the microinstruction to the right 
 neighbor PE ... 
 
 As the various invoked subcontrols finish their sequence operations, 
 they report back to the DM-control. When all the invoked subcontrols 
 are finished, the DM-control replies back to the R-control which was 
 suspended earlier from accepting any more microinstruction. The R-control 
 now is again ready to accept another microinstruction and the control 
 sequence begins again. 
 
139 
 
 4.3.2 Logic design of PE control 
 
 4.3.2.1 Block diagram description of PE control logic (PCL) - 
 
 Figure 4.19 shows the major components of the PCL in block diagram 
 form. It consists of a microinstruction register MIR, the selector 
 network, sMIR, the 'Zero magnitude and Sign Detector', ZSD, and the timing 
 control signal generator, TCS. The register MIR is 11 bits wide and 
 is used to hold the microinstruction, received on microinstruction 
 Jjiput _p_ort, MIP , f rom adjacent PE.,, during processing by PE . . The 
 selector sMIR is a two way multiplexer which chooses either MIP or ROB 
 from the DPL as the appropriate source of data for the bits <4:0> of MIR. 
 The ZSD is a combinational logic block which monitors the sign and 
 magnitude bits of the accumulator register INR1. It sets flip-flop 
 Z to logical state '1' if the magnitude of the accumulator in PE. is 
 zero. Flip-flop S. is set to the state of the sign bit SRI of accumu- 
 lator register. The TCS generates the timing signals for the activa- 
 tion of data paths and processing logic in DPL and for the coordination 
 of the adjacent PEs. 
 
 The generation of the appropriate control signals and their temporal 
 order depends on the microinstruction — its digit algorithm and the data 
 flow structure of DPL. 
 
 4.3.2.2 Design and description of microinstruction formats - The 
 major consideration in the design of the various microinstruction formats 
 are: 
 
140 
 
 UJ 
 Q. 
 
 + o o o 
 > < < o: 
 
 00 
 
 o 
 
 ►J 
 
 o 
 u 
 
 c 
 o 
 u 
 
 w 
 
 u 
 
 00 
 CO 
 
 o 
 o 
 
 CQ 
 
 CD 
 
 1-1 
 
 3 
 00 
 
 fc 
 
141 
 
 The major consideration in the design of the various microinstruc- 
 tion formats are: 
 
 a) the bit width of the microinstruction should be as small as 
 possible so that the pins required for the input port MIP be least, and 
 
 b) the microinstructions should be powerful so that they take full 
 advantage of the data flow structure of the DPL and facilitate the micro- 
 programming of the 'machine instruction'. 
 
 These two aims are conflicting in nature because b) requires a large 
 instruction width. A compromise was achieved by using varying number of 
 bits for the OP-code of the microinstruction. 
 
 Basically, each microinstruction has an OP-code field u. and a modi- 
 fier field, F as was discussed in Section 2.6. The basic OP-code field 
 is 3 bits long and the modifier field depends on the bit width (radix of 
 arithmetic processing) of the PE and the number of addressable registers 
 in the register file. The modifier field .F. is further divided into two 
 subfields — one field carries the address of the register in the register 
 file of the PE and the other field carries either a digit, or the address 
 of the PEM location in local operand mantissa memory. For some micro- 
 instructions, these fields are used for other purposes. 
 
 The Figure 4.20 shows the specific OP-code bit assignment and the 
 formats for various microinstructions. In this figure, it is assumed 
 that the bit width of the PE is 5 bits (that is, radix is 16) and that 
 there are 5 (=k+l) registers in the register file of the DPL. The micro- 
 instructions LPM, SPM, RS, LS and LDC have three bit OP-codes whereas 
 microinstructions TD and TI have four bit OP-codes. The OP-codes for 
 
142 
 
 ^i 
 
 J F i 
 
 MNEMONIC -"joP-CODEf— MODIFIER FIELD 
 
 10 98 76543 210 
 
 LPM 
 
 1 1 
 
 
 Al 
 
 
 A2 
 
 
 
 
 
 SPM 
 
 10 
 
 
 Al 
 
 
 A2 
 
 
 
 
 
 RS 
 
 1 1 
 
 
 Al 
 
 
 01 
 
 
 
 
 
 LS 
 
 10 
 
 
 Al 
 
 
 02 
 
 
 
 
 
 LDC 
 
 1 1 
 
 
 ^T 
 
 Dl 
 
 
 
 
 TD 
 
 o'oil 
 
 Al 
 
 4, 
 
 1 
 
 A2 
 
 
 
 
 TI 
 
 1 
 
 
 Al 
 
 i 
 
 
 
 A2 
 
 
 
 
 SS 
 
 ill 
 
 
 
 
 
 mm 
 
 
 
 
 MS 
 
 l l l 
 
 
 
 1 1 
 
 A3 
 
 i . i 
 
 l 
 
 
 
 
 FMA 
 
 l l l 
 
 
 
 1 
 
 D3 
 
 ■ i i i 
 
 
 
 
 AR 
 
 ill 
 
 
 
 1 
 
 Sfa, 
 
 o 
 
 iH 
 
 
 
 
 NR 
 
 l l l 
 
 
 
 1 
 
 s w 
 
 1 
 
 /yyyyy/j 
 
 — BIT NUMBER 
 
 Al: Destination File Register Address 
 A2: Source PEM Address 
 INR[A1] - PEM[A2] 
 
 Al: Source File Register Address 
 A2: Destination PEM Address 
 PEM[A2] - INR[A1] 
 
 Al: Address of File Register to be shifted 
 Dl: Digit from left neighbor PF. 
 
 Al : Address of File Register to be shifted 
 D2: Digit, sent by MCI', to be stored in 
 least significant PE 
 
 Dl: Digit to be loaded in file Register 
 Al: File Register Address 
 INR[A1] * Dl 
 
 Al: Source File Register Address 
 A2: Destination File Register Address 
 INRIA2] * INR[A1] 
 
 Al: Source File Register Address 
 A2: Destination File Register Address 
 'INR[A2] • (-1) • INRfAl] 
 
 INR1 - IN'Rl + INR2 
 
 A3: 
 
 Source File Register Addresses. A '1' 
 in bit j indicate»that file register 
 INR(j+l) will take part In Multi-Sum 
 Addition, 
 
 IN'Rl 
 
 5-0 
 
 MIR 1 
 
 INRfJ] 
 
 1)3: Multiplier Digit 
 
 TNR1 ■■ INR1 + D, * INR2 
 
 S p : Sign of the Operand 
 
 f- 1 impli 
 \* impli 
 
 es 
 
 es 
 
 -ve 
 -K-e 
 
 Figure 4.20 Microinstruction Codes and Formats. 
 
143 
 
 microinstructions SS, MS and FMA are six bits long whereas for AR and NR, 
 they are seven bits long. The varying length of the OP-code allows a 
 basic three bit field for OP-codes, otherwise a straightforward coding 
 of 12 microinstructions would have required four bit OP-codes. 
 
 It should be noted that the use of a more restricted set of micro- 
 instructions could have reduced the bit width of the microinstructions 
 at the cost of less flexibility in microprogramming capability. 
 
 In general, for a radix-2 arithmetic structure, the bit width of 
 a PE digit is (k+1) , and assuming the register file to consist of (k+1) 
 registers, the bits required for a microinstruction are given by 
 
 I, = Instruction width in bits 
 b 
 
 = 3 + flog 2 (k+l)l + (k+1) 
 
 = k + |Tog 2 (k+l)"l + *• (4.23) 
 
 A description of the various microinstructions was given earlier 
 in Section 2.6. The function of each microinstruction, briefly, is 
 again given below. 
 
 The memory access microinstructions LPM and SPM are respectively 
 used to fetch data from and store data to the processing element memory 
 PEM associated with the PE. The microinstruction field A2 gives the 
 location in PEM and field Al identifies the register in the register 
 file. 
 
 In the shift microinstructions RS and LS, Al identifies the 
 register to be shifted. The field Dl carries the digit from the regis- 
 ter in the left adjacent PE and D2 identifies the digit which must be 
 
144 
 
 loaded in the register of the least significant PE. This facility is 
 made use of in multiplication where the digit shifted out of the most 
 singificant PE, during left shift of partial products, has to be saved in 
 the least significant digital position of the multiplier operand register. 
 
 The field Al in microinstruction LDC identifies the register to be 
 loaded with the digit given in field Dl. 
 
 In microinstructions TD and TI, the Al and A2 respectively identify 
 the source and destination registers in the register file. Note that A2 
 can be equal to Al in microinstruction TI, whereas such a condition in 
 TD is meaningless. 
 
 In the case of arithmetic instruction SS, no special registers are 
 identified because this microinstruction always causes the contents of 
 accumulator register INR1 and operand register INR2 to be added with 
 the result going to the accumulator register. 
 
 For microinstruction MS, field A3 identifies the various registers 
 of the register file whose contents would be added by microinstruction 
 MS. Note that the address in A3 is not encoded but rather each bit of 
 A3 identifies a register. A bit value of '1' in A3 indicates that the 
 corresponding file register would take part as the source of the operand. 
 The ' 1' in the least significant position of A3 indicates that accumu- 
 lator register INR1 would always be one of the source registers in the MS 
 instruction. The result of addition always goes to the accumulator 
 register INR1. 
 
 The D3 field in microinstruction FMA identifies the multiplier 
 digit for the formation of the partial product. 
 
145 
 
 D3 field in microinstruction FMA identifies the multiplier digit 
 for the formation of the partial product. 
 
 The microinstruction bit 4 carries the sign of the operand, S , which 
 is nothing but the sign of the most significant nonzero digit in the 
 accumulator. This sign is first determined by the MCU by a sequence of 
 left shift microinstructions and testing the status indicators Z and S.. 
 of the most significant PE. . The proper value of S , that is, bit 4, is 
 set by MCU before issuing the microinstruction. 
 
 4.3.2.3 Description of subcontrols by control sequence charts - 
 The subcontrols are multi-output finite state machines which pro- 
 duce control signals in proper temporal order for the execution of 
 various microoperations during the processing of a microinstruction. 
 These control signals condition the combinational processing logic to 
 perform elementary microoperations like opening or closing of a register 
 gate, setting of selector networks to certain states or the setting of 
 a control status memory element. In addition, some of the control 
 signals act as interface request-response signals for the coordination 
 of various PEs or to access the local memory (PEM) module. 
 
 The operation of the finite state machine can be described by a 
 control sequence chart (CSC) which is a flowchart like description of a 
 control sequence. A control sequence is an instance of the execution 
 of a subcontrol. The control sequence chart shows the various control 
 signals and their temporal order generated during the execution of the 
 subcontrol. 
 
146 
 
 4.3.2.3.1 Control sequence chart conventions - A control sequence 
 chart (CSC) consists of a set of rectangular, diamond and pentagonal 
 shaped boxes and entry and exit symbols connected together in a two- 
 dimensional pattern with straight directed lines. The arrows on the 
 lines indicate the direction of the control flow in the sequence. The 
 various symbols used in the CSC are shown in Figure 4.21. 
 
 The diamond shaped symbol (Figure 4.21a and 4.21b) represents the 
 decision element with single entry and two exit points. The exit points 
 are labeled yes/no (Figure 4.21a) which indicate the truth/falsehood of 
 the statement written inside the box, or the exit points are labeled 
 with the actual name of the option (Figure 4.21b) that is valid on that 
 exit point. 
 
 The rectangular box of Figure 4.21c represents a control step. A 
 control step is a set of microoperations (indicated by control signals) 
 enclosed in the rectangular box. The time ordering of the micro- 
 operations within a control step is not important and they are, in 
 general, all activated in parallel. The rectangular boxes of Figures 
 4.21d and 4.21e represents the invoking of another subcontrol whose 
 name is written inside the box. However, in the case of Figure 4.21d, 
 the exit from the subcontrol returns the control flow to the point where 
 it was invoked (like a subroutine call in software) whereas the control 
 flow at the end of execution of the subcontrol indicated in Figure 4.21e 
 branches to the next point in the control sequence chart. 
 
 The pentagonal boxes of Figures 4.21f and 4.21g respectively repre- 
 sent the 'FORK' and 'JOIN' symbols. The 'FORK' symbol indicates that 
 
1A7 
 
 (a) 
 
 (b) 
 
 1 
 
 MIRsRTB: - 1 
 gIBH: » 1 
 S ± ' SAD 
 
 T 
 
 (.:) 
 
 1 
 
 T-control 
 
 Ion r 
 
 eturn 
 
 (d) 
 
 n n 
 
 I ! E-control i 
 LI U 
 
 T-control j 
 Li Li 
 
 (f) 
 
 ( Return J 
 
 (j) 
 
 (g) 
 
 (k) 
 
 (h) 
 
 Figure 4.21 Control Sequence Chart Symbols. 
 
148 
 
 the subcontrols at the exit points of the symbols be activated concur- 
 rently. On the other hand, the 'JOIN' symbol signifies that the replies 
 from all the concurrently active control sequences indicated by the entry 
 points to the box must be true before the control flow can proceed any 
 further. 
 
 The entry to a control sequence chart is indicated by a single 
 circle (Figure 4.21h) with the name of the corresponding subcontrol 
 written in the circle. The oval symbol (Figure 4.21j) represents a 
 'return' to the invoking point of the subcontrol in the control flow. 
 A double circle (Figure 4.21k) represents a branch to the entry point 
 of the subcontrol whose name is written inside the circle. 
 
 The control sequence charts which are too big to be fitted on a 
 single page have been drawn on different pages but the entry point on 
 each page is labeled the same. An example is the DM-control. 
 
 The microoperations within a control step box are indicated by 
 either control signals of the form 
 
 control signal name: = 1 or 
 or transfer statements of the form 
 x •*■ y 
 
 x •*- 1 or 0. 
 Most of the control signals in DPL are level signals whereas the inter- 
 face request-response signals are Pulse signals whose leading and trail- 
 ing edges are used to indicate request, acknowledge and response states. 
 The '1' or '0' on the right hand side indicate the logically 'active' 
 and 'inactive' state respectively. In the case of transfer statements 
 
149 
 
 indicated by the arrow «-, x represents a control status memory element 
 which is set to the state '1' or '0' or to the state of *y'. 
 
 The control signals for the selector networks are of the form XsY 
 where X indicates the input source to the selector network sY. The gate 
 signals for the register is of the form gRegisterName where RegisterName 
 identifies the register which has to be loaded with information. 
 
 Square brackets [ ] indicate a subscript value as in ISP notation 
 [42] and thus the address of a register or memory location when these 
 brackets appear after a memory element name. The value of the subscript 
 is written within the square brackets. 
 
 The angle brackets < > enclose lists of bit names. For example, 
 if MIR is a register, then MIR<4:0> indicate bits through 4 of reg- 
 ister MIR and that the bits in MIR are numbered from right to left in 
 ascending order. 
 
 The subscript i, i-1, i+1 on the signal names indicates the index 
 of the PE originating the interface control signal. 
 
 4.3.2.3.2 Description of R-control - The function of the R-control 
 in PE. is to accept for processing and to acknowledge a microinstruction 
 from the adjacent PE._, , and to invoke the DM-control for the processing 
 of the microinstruction. The control sequence chart for the R-control 
 is shown in Figure 4.22. 
 
 The R-control indicates its readiness to PE._, to accept another 
 microinstruction by the signal RACK :=1. The R-control in PE monitors 
 the request signal TRQ i _ 1 from PE^. The active state of TRQ . indi- 
 cates that information on input port MIP is valid and R-control (control 
 step RC1) loads the microinstruction into register MIR<10:0>. (It is 
 
150 
 
 
 [ R-controlrt 1 
 
 1 
 
 
 — i 
 
 1 f 
 | ^TRQ . , -l\ NO 
 
 
 
 V A' 
 
 YES 
 r RC1 
 
 
 
 
 gMIR<4:0> : - 1 
 gMIR<10:5>: - 1 
 
 
 L 
 
 J 
 
 
 
 r 
 
 I 1 
 
 1 RC2 
 
 H 
 
 
 RACKj/. - 
 MIPsMIR<4:0>: -0 
 
 
 < 
 
 ' RC3 
 
 
 
 DM-control 
 
 
 ' 
 
 on return 
 
 ' RC4 
 
 
 
 RACK^ « 1 
 MIPsMIR<4:0>: «= 1 
 
 
 L 
 
 J 
 
 
 
 
 Figure 4.22 Control Sequence Chart for R-control, 
 
151 
 
 assumed that the selector sMIR<4:0> was put earlier in a state to 
 select MIP input.) Then the R-control (control step RC2) acknowledges 
 the receipt of the microinstruction by the control signal RACK. :=0. 
 The R-control (control step RC3) then invokes the DM-control for the 
 processing of the microinstruction, and waits for a reply from the 
 DM-control. The reply indicates that the processing is finished and 
 R-control can accept another microinstruction which it (R-control) 
 indicates to PE._, by the control signal RACK :=1. At the same time, 
 the selector network sMIR<4:0> is set to select the data from micro- 
 instruction input port MIP . This is done in control step RC4. 
 
 It is assumed, in the control sequence chart of Figure 4.22, 
 that initially, at the power turn on, RACK :=1 and MIPsMIR<4:0>:=1 are 
 true. 
 
 4.3.2.3.3 Description of DM-control - The DM-control can be looked 
 upon as the main control which on being invoked by the R-control monitors 
 the output of the microinstruction decoder. Depending on the nature of 
 the microinstruction, it sets up the necessary data paths and conditions 
 the combinational logic in the data flow logic of the PE. After the data 
 paths are set up, the DM-control invokes one or more of the other con- 
 trols, F, E, G for processing and T-control, if necessary for onward 
 transmission of the microinstruction to PE . . Since the selectors have 
 no memory, and the data paths remain set throughout the processing, the 
 output of the microinstruction decoder can be directly connected to the 
 selector signals of the form XsY: = 1 and involves no extra logic cost. 
 Figures 4.23a, b, and c show the control sequence chart for the DM-control. 
 
152 
 
 II 
 
 m h 
 
 o 
 
 VI r ,-* 
 
 r cd on 
 
 a: 2: K 
 1-1 in M 
 X OS ofi 
 
 _ C M 
 
 n: 0; r 
 
 
 
 
 kWVWS 
 
 c 
 u. 
 
 
 ESSS3 
 
 
 
 
 
 
 E 
 
 
 
 
 
 
 
 
 
 
 
 
 
 *J 
 
 
 
 
 
 01 
 
 
 
 
 
 1* 
 
 
 
 
 
 
 
 
 
 
 
 O 
 
 
 
 
 b 
 
 O 
 
 
 
 
 
 
 
 
 
 e 
 
 
 - 
 
 
 
 
 
 
 
 
 
 Y 
 
 
 
 
 
 H 
 
 
 
 
 »' 
 
 
 
 
 
 
 
 
 
 
 c 
 
 » 
 
 J 
 
 
 
 
 3 
 
 
 
 
 
 U 
 
 
 
 
 
 u 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 c 
 
 
 
 
 fcj 
 
 C 
 
 
 
 
 
 
 
 
 
 
 
 O 
 
 
 
 
 
 u 
 
 
 
 
 
 u 
 
 
 
 
 
 
 
 
 ■H ,-t _t _4 ■ 
 
 03 D- r-- 
 w O v 
 
 M O < 
 
 x -h £ 
 
 •-I CM f» •* 
 
 oc 06 o£ ps 
 
 « — ( » — 1 ( — • » — ■ 
 
 y. x t, t, 
 
 QC GC oc cc 
 w w ■ « 
 
 -1 - rt O 
 
 OS U 00 o. 
 
 □ WHO 
 
 < Q « f~ 
 in ce • a 
 
 2u> u < 
 www 
 
 esss 
 
 
 ^ 
 
 w 
 
 1 1 
 
 4-> 
 
 P-. 
 
 O 
 
 u 
 
 c 
 o 
 
 I 
 
 o 
 
 u 
 
 o 
 
 a 
 
 0) 
 
 a* 
 
 o 
 
 M 
 
 c 
 o 
 u 
 
 CM 
 
 3 
 
 M 
 •H 
 
153 
 
 <r> 
 
 
 
 
 
 
 1-1 
 
 
 
 
 
 
 g 
 
 
 -J 
 1 
 
 03 
 
 O 
 
 
 
 
 E 
 
 fH 
 
 as 
 
 
 
 
 (A 
 
 
 •— • 
 
 
 
 
 
 a. 
 o 
 
 H 
 
 at 
 
 at 
 M 
 X 
 
 |-~ 
 
 V 
 
 ££ 
 —< 
 
 X 
 
 "a: 
 
 
 
 
 00 
 
 
 
 
 
 •H 
 
 
 
 
 
 
 u 
 
 
 
 
 
 
 T 
 
 
 
 
 
 
 s 
 
 
 •-I 
 ■ 
 
 
 
 
 x 
 
 
 
 
 
 
 E 
 
 •H 
 
 A 
 
 f-( 
 
 
 
 ►j 
 
 || 
 
 o 
 
 | 
 
 
 
 
 
 ■4 
 
 
 
 
 
 
 
 
 
 
 
 0. 
 
 OS 
 
 co 
 
 
 
 
 o 
 
 M 
 
 M 
 
 
 
 
 H 
 
 X 
 
 as 
 
 
 
 
 U) 
 
 to 
 
 ■ 
 
 
 
 
 as 
 
 Pu 
 
 OES 
 
 
 
 
 M 
 
 >-i 
 
 M 
 
 
 
 
 X 
 
 X 
 
 X 
 
 
 
 
 
 
 
 !' 
 
 
 
 
 
 
 
 
 
 
 
 CJ 
 
 
 
 
 
 
 £ 
 
 
 
 
 
 
 5 
 
 tH 
 
 
 
 
 
 u 
 
 
 
 
 
 
 3 
 
 1 
 
 
 
 
 
 
 cn 
 
 
 
 
 
 
 M 
 
 
 
 
 
 
 a! 
 
 
 
 
 
 
 (0 
 
 
 
 
 
 
 as 
 
 
 
 
 
 
 M 
 
 
 
 
 
 
 X 
 
 
 
 
 
 «o 
 
 
 
 
 
 r-4 
 
 
 
 
 
 
 u 
 
 t-l 
 
 
 
 
 
 s 
 
 1 
 
 
 
 
 
 l-l 
 
 ., 
 
 
 
 
 
 H 
 
 0Q 
 
 
 
 
 
 > 
 
 O 
 
 
 
 
 
 as 
 
 
 
 
 
 a 
 
 « 
 
 r-t 
 
 v-4 
 
 
 
 H 
 
 i— , 
 
 
 
 
 
 
 A 
 
 1 
 
 1 
 
 o 
 
 
 
 
 
 
 
 
 
 —t 
 
 
 
 r>. 
 
 u 
 
 CO 
 
 
 
 
 y, 
 
 w 
 
 M * 
 
 + 
 
 
 
 bS 
 
 a 
 
 as 
 
 
 
 
 i-i 
 
 10 
 
 01 
 
 H 
 
 
 
 X 
 
 00 
 
 u 
 
 SB 
 
 
 
 fc-* 
 
 Q 
 
 to — i 
 
 O 
 
 
 
 QS 
 
 04 
 
 a a. 
 
 (A 
 
 
 
 C 
 
 
 
 
 3 
 
 
 4-1 
 
 fH 
 
 0) 
 
 O 
 
 l-l 
 
 U 
 
 
 4J 
 
 c 
 
 e 
 
 o 
 
 o 
 
 o 
 
 
 
 H 
 
 
 
 
 r^ 
 
 
 c 
 h 
 
 3 
 
 
 4-1 
 1) 
 hi 
 
 c 
 o 
 
 w4 
 O 
 b 
 U 
 
 c 
 o 
 u 
 1 
 u 
 
 
 
 
 w 
 
 cfl 
 
 o 
 
 4-1 
 
 c 
 o 
 o 
 i 
 
 S 
 
 Q 
 
 >-i 
 
 o 
 
 
 o 
 c 
 a) 
 
 3 
 cr 
 a> 
 co 
 
 o 
 n 
 
 4-1 
 
 C 
 o 
 u 
 
 
 <U 
 
 M 
 
 a 
 
 •H 
 
T-control 
 
 on return 
 
 ARvNR 
 
 DMC1 
 
 10 
 
 R[MIR<7:5>]bROB: - 1 
 ROBbDSE: - 1 
 DSEsRIB: - 1 
 
 YKS 
 
 SAD . S 
 
 i+1 
 
 SAK 
 ADZ 
 
 AR^^ AR/NR 
 
 G -control 
 ap 
 
 I 
 
 G -control 
 
 Go To 
 
 E-control 
 
 on return 
 
 LLi 
 
 on return 
 
 f RETURN j 
 
 Figure 4.23c Control Sequence Chart for DM-control, 
 Part III. 
 
 154 
 
155 
 
 The data path for the microinstruction SS is through the selector 
 
 sADR for operand register INR2 , through the selector sDSE and encoder 
 
 DSE for the result (sum of the contents of INR1 and INR2) digit and 
 
 finally through the selector sRIB. The encoder DSE converts the 
 
 redundant binary sum digit to sign-magnitude format SM . This data 
 
 path is set up by control step DMC1. . The control signal TAsTOP sets 
 
 up the data path for the 'Adder Transfer 1 out of PE . . The microoperation 
 
 P. -*-Q conditions the DSE encoder logic for proper conversion of the re- 
 
 1 
 dundant binary result digit into the equivalent sign-magnitude format. 
 
 For the microinstruction MS, the control step DMCl^ sets up the 
 selector sADR for the source operands and the data paths for the 
 result digit through the selectors sDSE and sRIB and for the 'Adder 
 Transfer' through the selector sTOP. 
 
 If the microinstruction to be processed is FMA, the control step 
 DMC1~ sets up the necessary data paths — for the 'product array' and 
 'product transfer' through sADR, that of result digit through sDSE and 
 sRIB, of 'Adder Transfer' through sTOP. The multiplicand digit from 
 operand register INR2 is put on the register output port ROP. via 
 selector sROB. The control memory flip-flop GFMA acts as a synchron- 
 izing device between the concurrently active and interacting controls — 
 
 G -control and G -control. It is initialized to state '1'. The 
 gn ap 
 
 details of its action are discussed later in Section 4.3.2.3.6. 
 
 Control steps DMC1, through DMCl g set up the data paths for the 
 microinstruction shown at the entry points of each control step in 
 Figures 4.23a, 4.23b, and 4.23c. For the left shift microinstruction 
 
156 
 
 LS 
 
 , the data path for the digit in PE. is from the internal register to 
 
 the output port ROP. via the selector sROB and the data path for the in- 
 coming digit from PE . is from RIP. to register IBR via the register 
 APR and selector sRIB. In the case of microinstruction RS, the digit to 
 be stored is in microinstruction register MIR<4:0> and its corresponding 
 data path to the input of register file is via the selector sRIB. The 
 data path for the digit to be shifted out to PE . is via the selector 
 sROB, the bus ROB and the selector sMIR<4:0> to register MIR and thence 
 to port MOP . For the inter-register transfer microinstructions TD and 
 TI, the data path is through the selector sROB, the bus ROB, the 
 selectors sDSE and sRIB. The control memory flip-flop SCHI generates 
 the similar named control signal which transforms the SM -encoded output 
 of ROB into redundant binary format for proper transfer. The logical 
 '1' state of control signal SCHI guarantees that the magnitude bits of 
 input digit on ROB will appear unchanged at the output of DSE. This 
 can be seen from Figure 4.14 and the discussion in Section 4.2.2.6. 
 The data path for the microinstruction LDC is from the register MIR 
 through the selector sRIB to the proper register INR[MIR<7 :5>] of 
 register file via the buffer register IBR. 
 
 For the memory access microinstructions, the communication of data 
 and address takes place via the ports ROP., MIR., and TOP.. For the 
 microinstructions SPM and LPM, the data path for the address of the 
 
 location in PEM and the read/write bit is from MIR to TOP via the 
 selector sTOP. However, the data path for the data to be stored in 
 case of SPM is from register INR[MIR<7 :5>] through selector sROB to the 
 
157 
 
 output port ROP . But the data path from memory to the register 
 INR[MIR<7 :5>] for microinstruction LPM, is via port MIP , selector 
 sMIR<4:0>, register MIR<4:0>, the selector sRIB and buffer register 
 IBR. 
 
 The data path for the microinstructions AR and NR is from the 
 register INR1 through the selectors sROB, sDSE, encoder DSE and the 
 selector sRIB back to INR1 via buffer register IBR. Note that the Op- 
 codes for the microinstructions AR and NR are so chosen that bits 
 MIR<7:5> address the register INR1. This explains the reasons for the 
 OP-code choices, for various microinstructions shown in Figure 4.20. 
 
 After the data paths are set up, the DM-control invokes one or 
 more of G, F, E and T-controls for actually processing of the micro- 
 instruction. The microinstructions SS, MS, FMA and LS all require 
 G-information from their right neighboring PEs. So the DM-control 
 
 invokes the G-control consisting of G -control and G -control and 
 
 6 gn ap 
 
 the T-control in parallel. The T-control transmits the present micro- 
 instruction to PE.,, . The identity of microinstruction in PE. is 
 essential in PE.,, to generate the G-information for the microinstruc- 
 tion processing. The control flow at the end of G -control branches 
 
 ap 
 
 directly to the E-control for the actual calculation and storage of 
 the result digit. 
 
 When all the concurrently invoked subcontrols are finished, they 
 report back to the DM-control at the invoking point in control flow. 
 The DM-control now replies back to the R-control which had earlier 
 invoked DM-control. The R-control was in a state of active suspension 
 
158 
 
 (wait state) during the activity of DM-control. The R-control now gets 
 ready to receive another command as explained earlier. 
 
 For the microinstruction RS, DM-control, after setting up the data 
 paths, invokes F-control which changes the modifer field of the micro- 
 instruction in MIR and PE, for transmission to PE, , . . The details of 
 
 i i+1 
 
 F-control are discussed later. At the end of F-control, E-control and 
 T-control are invoked in parallel but no G-information is required for 
 the processing of microinstruction RS. On return from both the con- 
 currently active E- and T-controls, the DM-control replies back to the 
 waiting R-control. 
 
 In the case of microinstructions TD, TI, LDC, LPM and SPM (Figure 
 4.23b) no G-information is required. Hence DM-control invokes only the 
 E-control and T-control in parallel. The rest of the control flow is 
 as it is for RS. 
 
 The invoking of E, G and T-controls by DM-control for microinstruc- 
 tions AR and NR is more complex since it (invocation) depends on the 
 nature of the data resident in the adjacent PE . . . The digit algorithm 
 of the microinstruction AR discussed in Section 3.6.5, requires knowing 
 the sign of the first non-zero digit to the immediate right of the 
 present digital position. This is done through the use of interface 
 control signals S , Z and control memory flip-flops SAD, SAK and ADZ 
 which respectively stand for the Sign of Adjacent Digit, Sign of Adjacent 
 Digit Known and Adjacent Digit is Zero. The value of logical '1' for 
 SAK and ADZ indicate assertion or truth whereas '0' indicates falsehood. 
 
159 
 
 The interface control signal S. (which is the outputs of control memory 
 flip-flop S.) indicates the sign of the digit in PE 's accumulator regis- 
 ter. Z (which is also the output of flip-flop Z ) indicates whether the 
 magnitude of the accumulator digit is zero. Z. = 1 indicates that the 
 digit is zero and Z = indicates otherwise. Note that if Z. is moni- 
 tored by adjacent PE , , validity of Z can only be ensured when RACK =1; 
 i.e., when PE. is not in the middle of executing any previous micro- 
 instruction. The mechanism for determining the sign of the first non- 
 zero digit to the right of the present digital position, i say, is as 
 follows. 
 
 If the digit in PE . , , is zero, G -control in PE. goes into a wait 
 ° i+1 ap i 
 
 loop. In the meantime, the microinstruction AR is passed to PE . where 
 again Z. „ is monitored to see if the digit in PE. _ is zero. If it is, 
 it (G -control in PE ) also goes into a wait loop and the micro- 
 instruction passes to PE. „, PE. _, ..., PE... if Z. . = and 
 
 Z..,, Z . , . , ..., Z . . , are all in logical state '1'. The G -controls 
 i+3 i+4 i+j ° ap 
 
 in PE _, ..., PE. ._- go into the wait loop. As soon as Z - = is 
 
 monitored by PE . , . , G -control in PE . . . assigns the value of S . , , . , 
 i+j gn i+j i+j+1 
 
 to S . , , and declares the sign valid to the waiting G -control in 
 i+j & 6 ap 
 
 PE... , by assigning logical state ' 1' to the control signal GRQ. . 
 i+j-l i+j 
 
 The G -control in PE., n informs the G -control in PE... . about the 
 ap i+j-l gn i+j-l 
 
 validity of sign S . . , by setting synchronizing control flip-flop SAK, 
 in PE , to logical state 'l 1 . The G n ~control in its turn assigns 
 the value of S to S and declares the sign S 1+ ., to be valid. 
 The sign of the digit thus flows backward till PE is reached and in 
 
160 
 
 this way, all the zero digits lying to the immediate right of digital 
 position i are assigned the sign of the first non-zero digit. 
 
 We now describe the action of DM-control for microinstructions AR 
 and NR. The DM-control checks the state of control signal Z . by 
 monitoring the control signal RACK . as explained earlier. If the 
 adjacent digit in PE. . is not zero (Z... ^ 1), the control memory 
 element SAD is set to the state of S - t , the sign of adjacent digit 
 is declared known by SAK ■*- 1 and the adjacent digit is declared non- 
 zero by setting ADZ to logical state '0'. However, if Z = 1, the 
 control memory flip-flops SAD and SAK are set to logical state '0' 
 and flip-flop ADZ to state '1'. 
 
 For the microinstruction AR, the DM-control then invokes T-control 
 and G-control in parallel irrespective of the state of the control 
 signal Z . However, for microinstruction NR, no digit beyond and 
 including the first (counting from left) zero digit of the operand 
 needs to be recoded. So the flow of microinstruction NR stops as soon 
 as Z = 1 is encountered. This is done by the DM-control not in- 
 voking the T-control in PE . . However, G -control and G -control are 
 invoked for uniformity of invoking procedure, although G -control is 
 Immediately exited for microinstruction NR as can be seen from the 
 control sequence chart for G -control in Figure 4.27. 
 
 When all the parallely invoked controls have finished, the DM- 
 control replies back to the waiting R-control which gets conditioned 
 to receive another microinstruction for processing in PE.. 
 
161 
 
 4.3.2.3.4 Description of T-control - The T-control, when invoked 
 by DM-control, passes the microinstruction in register MIR of PE. to the 
 PE . - . The control sequence chart for T-control is shown in Figure 4.24. 
 The T-control in PE. monitors the signal RACK . (from the R-control in 
 PE - ) whose logical state '1' indicates that R-control in PE . is ready 
 to accept the microinstruction. The control step TCI sets the control 
 memory flip-flop whose output gates the contents of MIR onto bus MOP , . 
 Then in control step TC2, the request signal TRQ. is activated which in 
 turn is monitored by the R-control in PE - . As soon as R-control in 
 PE... accepts the microinstruction from MOP (=MIP .), it (R-control) 
 acknowledges by assigning the ' 0' logical state to acknowledge signal 
 RACK . The '0' state of RACK , being monitored by T-control, signi- 
 fies that the microinstruction has been accepted and then the control 
 step TC3 withdraws the request for transmission by assigning '0' logical 
 state to request signal TRQ. and also removes the information from the 
 bus MOP. (MIP ) . The latter is necessary for microinstruction LPM 
 where the port MIP is used for inputing data read from the PEM . 
 
 At the end of the control sequence, the control flow returns to 
 the point where the t-control was invoked. 
 
 4.3.2.3.5 Description of F-control - The function of the F-control 
 is to modify the microinstruction modifier field F before transmission 
 of the microinstruction to the next PE, i.e., PE -,. This is made use 
 of in the microinstruction RS where the modifier field carries the digit 
 
162 
 
 I 
 
 RACK :-l\ NO 
 
 TCI 
 
 MIRgMOP: - 1 
 
 I 
 
 TC2 
 
 TRQ i : - 1 
 
 L 
 
 TRO,: - 
 MIRgMOP: - 
 
 ( RETURN ) 
 
 Figure 4.24 Control Sequence Chart for T-control. 
 
163 
 
 to be shifted into the adjacent PE. Figure 4.25 shows the control se- 
 quence chart for F-control. Control step FC1 loads the buffer register 
 IBR from the output of the selector sRIB. The selector sRIB was ini- 
 tially conditioned by control step DMClj. in DM-control to select the 
 digit from MIR<4:0>. At this time, the MIR<4:0> carries the digit from 
 the adjacent PE._, and it arrived as part of the microinstruction from 
 PE.,. Control step FC2 loads the MIR<4:0> from the output of the selec- 
 tor sMIR<4:0> which was conditioned to accept the digit from INR[MIR<7 :5>] 
 in PE. to be shifted into next PE.,, by control step DMCU. At the end 
 of control step FC2, the control flow branches to initiate E-control and 
 T-control in parallel. The T-control would transmit the microinstruction 
 in MIR with the new modifier field and the E-control would load the 
 register INR[MIR<7 :5>] from the buffer register IBR. 
 
 4.3.2.3.6 Description of G-control - The G-control consists of two 
 
 independent subcontrols: G -control which generates the G-information 
 
 gn 
 
 (mainly 'Adder Transfer' t , ) for the adjacent PE.,; and G -control 
 which accepts the G-information from the adjacent PE. . . The G-control 
 is invoked by DM-control only when the processing of the microinstruction 
 requires information from the adjacent PEs. When the G-information 
 depends logically on more than one adjacent PE, the G -control and G - 
 control interact with each other through synchronizing control memory 
 flip-flops GFMA (in the case of microinstruction FMA) and SAK (for the 
 microinstructions AR and NR) . 
 
164 
 
 W FC1 
 
 gIBR: - 1 
 
 I 
 
 FC2 
 
 gMIR<4:0>: - 1 
 
 Figure 4.25 Control Sequence Chart for F-control 
 
165 
 
 A. 3. 2. 3. 6.1 Description of G -control - The function of G - 
 t^r gn gn 
 
 control is to generate G- information needed in the adjacent PE._, . 
 
 Figure 4.26 shows the control sequence chart for this control. The G- 
 
 information for the microinstructions SS and MS consists of the 'Adder 
 
 Transfer 1 t , which is routed to the output port TOP by the control 
 
 steps DMC1 and DMC1„. The G-inf ormation for the microinstruction LS is 
 
 the digit in register INR[MIR<7 :5>] which is routed to port ROP by data 
 
 path set up in control step DMC1, . After the 'G' -information stabilizes 
 
 on ports TOP and ROP . , control step Ggnl informs the G -control in 
 
 PE. , about the validity of G-inf ormation. The G -control then monitors 
 i-1 ' gn 
 
 the acknowledgment signal GACK. .. from G -control of PE.. . When GACK. - 
 
 is in logical state ' 1', the control step Ggn2 declares the G-inf ormation 
 
 not valid. 
 
 For the microinstruction FMA, the G-information ™, A G. to be 
 
 FMA i 
 
 generated by PE. for PE , consists of 'Adder Transfer' t and the 
 
 p 
 multiplicand digit <f> (assuming 'local generation' of CPT t .). The 
 
 'Adder Transfer' t , is dependent on the multiplicand digit <f> and 
 
 accumulator digit a. in registers INR2 and INR1 respectively in PE , 
 
 multiplicand digit <|> . and accumulator digit a..- in registers INR2 
 
 and INR1 respectively in PE. , - and the multiplier digit m . . t . 1 con- 
 
 AO Al 
 sists of two parts t._. and t. , and is generated in a time sequential 
 
 manner. The process of generation of -^.G. can be represented in the 
 
 notation of Section 2.4 as follows: 
 
SSv MSv LS 
 
 Wait for Conforma- 
 tion to stabilize on 
 port TOP or ROP 
 
 Ggnl. 
 
 I 
 
 FMA 
 
 ■i r 
 
 Wait for Conforma- 
 tion (cp to 
 
 stabilize on ports 
 TOP and ROP 
 
 Ggn3 
 
 GFMA - 1\N0 
 
 YES 
 
 Wait for G-informa- 
 tion (gJ) to 
 
 stabilize on port 
 TOP, 
 
 Ggn5 
 
 GV, 
 
 i 
 
 f RETURN J 
 
 166 
 
 ARv NR 
 
 YES 
 W Ggn7 
 
 S - SAD 
 
 \ I Ggn8 
 
 GV : - 1 
 
 I 
 
 Figure 4.26 Control Sequence Chart for G -control, 
 — a -» g n 
 
167 
 
 fma g i " {t i-l • V 
 
 C i-1 = r(m j' *i' V <W a i+l } 
 
 = (t A ° t A1 } 
 
 AO ,,0, x 
 
 where t^ = r (m , <$> ± , a^ 
 
 Al _1 / , . A(k 
 
 fc i-i * r ( V V V ♦i+i' fc i } 
 
 and FMA G i = {t i-l» 'f-V V 
 
 where 
 
 and 
 
 
 = 
 
 {G i - G l } 
 
 •8 
 
 = 
 
 {t i-r V 
 
 t 
 
 = 
 
 (t A1 > 
 
 lt i-l' ' 
 
 In the above relations, the subscript FMA has been dropped from all 
 variables for ease of reading. The above described structure for G and 
 G. can be deduced from an examination of Figures 3.13 and 4. lid to- 
 gether . 
 
 When the G-inf ormation G. is valid on ports TOP. and ROP. which 
 carry the t. - and <j> components of G. respectively, control step 
 Ggn3 informs PE - about the validity of G information. After PE . 
 has accepted (indicated by GACK. 1 :=1) the G . , control step Ggn4 sets 
 validity signal GV to logical state '0'. To generate G., it is neces- 
 sary that G.,, (« {t A0 , <J>.}) be available in PE . . When G... from PE... 
 i+l l i 1 l+l l+l 
 
 is valid on input ports TIP . and RIP . , the G -control in PE. accepts 
 
 I i ap i 
 
168 
 
 and stores G and informs the G -control about its availability by 
 setting to logical state ' 1' the control memory flip-flop GFMA. As 
 
 soon as the synchronizing flip-flop GFMA is in logical state '1' and 
 
 1 1 Al 
 
 G. is valid on port TOP. (G. = t. .. is automatically generated by the 
 
 logic in MIAD of PE ) , the control steps Ggn5 and Ggn6 declare the G 
 
 information valid and invalid respectively in the same way as do control 
 
 steps Ggn3 and Ggn4 . 
 
 For the microinstructions AR and NR, the G-information consists of 
 
 only the sign of the digit in the accumulator of adjacent PE _ and 
 
 also whether the digit is zero or not. If the digit in the accumulator 
 
 in PE. is non-zero (Z . ^ 1) , then no G-information needs to be generated 
 
 because it is already known to the PE.. via its DM-control. However, 
 
 if the present digit is a zero (Z . = 1) , the meaningful sign for this 
 
 zero digit is the sign of the first non-zero digit to its right. If 
 
 the digit to the immediate right in PE... is non-zero, then the sign of 
 
 the adjacent digit is known and is stored in flip-flop SAD by DM-control 
 
 in PE. earlier. If, however, the digit in PE, . is zero (Z... = 1), the 
 
 sign of the adjacent digit is unknown (SAK ^ 1). The G -control goes 
 
 into a wait loop continuously monitoring SAK till the G -control in 
 
 PE. determines and stores the sign in SAD of the digit in PE _ . As soon 
 
 as the sign of the adjacent digit is known, control step Ggn7 assigns 
 
 the same value to the flip-flop S whose value represents the sign of 
 
 accumulator digit in PE.. Control step Ggn8 informs the G -control in 
 
 PE J . about the validity of S . . After G -control in PE . , acknowledges 
 i-1 i ap i-1 
 
 the receipt of valid sign S. (GACK.. = 1), control step Ggn9 withdraws 
 the validity signal. G -control now returns to the invoking point in 
 DM-control. 
 
169 
 
 A. 3. 2. 3. 6. 2 Description of G -control - The function of G -control 
 1 ap ap 
 
 is to accept the G-information generated by G -control in PE . . . Figure 
 
 4.27 shows the control sequence chart for G -control. This G-information 
 
 ap 
 
 is available on port TIP. (for microinstructions SS, MS and FMA) , on port 
 RIP. (for microinstructions FMA and LS) and on interface control line 
 S . . (in case of microinstructions AR and NR) . 
 
 In the case of microinstructions SS and MS, the G -control moni- 
 
 ap 
 
 tors the validity interface signal GV. - . As soon as the G-information 
 is valid, control step Gapl stores the G-information on bus TIP into 
 G-information register GIR. Control step Gap2 acknowledges the receipt 
 of G-information, and control step Gap3 withdraws the acknowledgment 
 signal GACK once the validity signal is withdrawn (GV - ■ 0) by 
 G -control in PE . . . For microinstruction LS, the same sequence is 
 followed except that G-information is available on RIP and is stored 
 in register APR by control step GapA. 
 
 As explained earlier, G-information G . for microinstruction FMA 
 
 A0 1 Al 
 
 consists of two components: G _(={t. A i+ i }) and G _(={t. }) . When 
 
 the G . , information is valid, the control step Gap7 stores t. component 
 
 in register GIR and <J> . component in register APR. Then control step 
 
 Gap8 sets the synchronizing flip-flop GFMA to logical state '1' to 
 
 inform the G -control about the availability of G. ,. information so 
 gn ' i+1 
 
 that G -control may generate G. for PE , . Control steps Gap9 and 
 GaplO play the same role of acknowledgment assertion and its withdrawal 
 
 as control steps Gap2 and Gap3. After control step GaplO, G -control 
 
 ap 
 
 again starts monitoring the validity control signal for G. . . As soon 
 
170 
 
 Capl4 
 
 SAD - S 1+1 
 
 
 1 
 
 l CaplS 
 
 SAK • 1 
 
 
 
 1 Capl 
 
 6 
 
 CACK :- 1 
 
 
 
 
 
 1 
 
 - 0\ 
 
 NO 
 
 Capl? 
 
 GACK : - 
 
 
 ^YF.S 
 r Gap7 
 
 gCIR : - 1 
 gAPR : - 1 
 
 p Cap8 
 
 GFMA ' 1 
 
 ' 
 
 » Gap9 
 
 GACK : - 1 1 
 
 gCIR : - 1 
 
 
 ir Cap2 
 
 CACK : - 1 
 
 
 
 
 
 GV^ 
 
 , - s ; 
 
 NO 
 
 gAPR : - 1 
 
 | ; Gap5 
 
 r 
 
 GaplX 
 
 gGIR : - 1 
 
 n Gapl2 
 
 CACK : - 1 
 
 Figure 4.27 Control Sequence Chart for G -control. 
 
171 
 
 as G is valid (GV - 1) on TIP , control step Gapll stores it in 
 
 G-information buffer register GIR and then control steps Gapl2 and Gapl3 
 
 respectively acknowledge the receipt of G . information and withdraws 
 
 the acknowledge signal on response from G -control. 
 
 gn 
 
 For the microinstructions AR and NR, if the adjacent digit is non- 
 zero or if it is zero and the microinstruction is NR, no G-information 
 
 needs to be accepted from adjacent G -control and the G -control is 
 
 r J gn ap 
 
 immediately exited. 
 
 In the case of zero adjacent digit and microinstruction AR, G - 
 
 ap 
 
 control monitors the G-information validity signal. Here the G-information 
 
 consists of S JM . As soon as G -control has determined the valid sign 
 i+1 gn 
 
 for S..- (which is the sign of first non-zero digit to its right), 
 
 G -control sets the validity signal GV... to logical state '1'. As soon 
 gn l+i 
 
 as G -control in PE. ,, finds GV... in state '1', the control step Gapl4 
 ap i+1 i+1 
 
 stores the sign S. . in flip-flop SAD and Gapl5 sets the synchronizing 
 flip-flop SAK to logical state '1'. (SAK is being monitored by G - 
 control in PE. in order to attach this sign (stored in SAD) to S..) 
 Control steps Gapl6 and Gapl7 play the same role as Gap2 and Gap3. 
 
 At the end of execution of G -control, the control sequence for 
 the processing of the microinstruction branches directly to E-control 
 where the result digit is calculated and stored in the appropriate 
 register of the register file. 
 
 4.3.2.3.7 Description of E-control - Figure 4.28 shows the control 
 sequence chart for E-control. For the microinstructions SS, MS, FMA, LDC 
 and LS, the E-control loads the result digit, which is available at the 
 
172 
 
 Jx 
 
 r 1 
 
 i i 
 
 >3 
 
 M 
 
 1 
 
 r\ 
 
 1 
 U 
 
 o 
 
 1 
 1 
 
 1 
 
 1 
 1 
 
 2 
 
 I 
 
 o 
 u 
 
 c 
 o 
 o 
 I 
 w 
 
 o 
 
 u 
 
 CO 
 
 u 
 
 0) 
 u 
 c 
 a) 
 
 C 
 
 0) 
 
 o 
 
 4-t 
 
 c 
 o 
 
 00 
 CN 
 
 01 
 
 a 
 
 ■H 
 Pn 
 
 I 
 
173 
 
 output of selector sRIB, Into buffer register IBR. This is done in con- 
 trol step E3. Then control steps E4~ and E4 2 transfer the contents of 
 IBR into accumulator register INR1 for microinstructions SS, MS, FMA and 
 into the destination register INR[MIR<7 :5>] for LDC and LS. Finally, the 
 control step E5 sets the status indicators S and Z . 
 
 For the microinstruction RS, the control step E3 is bypassed and 
 control step E4 2 loads the register to be shifted. E5 sets the digit 
 status indicators. 
 
 For the inter-register transfer microinstructions, the state of 
 the sign bit S^., of the digit on the bus ROB is transferred to the 
 sign bit output S. of digit sum encoder DSE for TD. The complement of 
 the state of S,,.,. is transferred in the case of microinstruction TI. 
 This is done in control steps El- and El 2 respectively. The control 
 sequence then goes through the control steps E3, E4. and E5. Control 
 step E4. loads the destination register in the register file. 
 
 For the microinstruction LPM, control step E6. requests access to 
 the local memory PEM. of the PE.. Note that the address of the location 
 in PEM. and the Read/Write bit (in the state 'Read') is already available 
 on the output port TOP . The PEM. reads out the data on the micro- 
 in PEM and the Read /Write bit (in state Read) is already available on 
 the output port TOP . . The PEM. reads out the data on the micro- 
 instruction input bus MIP. and informs the PE. by the logical state 
 ' 1' of acknowledge signal MACK.. The control step E6_ loads the register 
 MIR<4:0> from the output of selector sMIR<4:0> which had been earlier 
 conditioned, in DM-control, to accept this output and also withdraws the 
 
174 
 
 request for memory access. The control steps E3 and E4„ load the buffer 
 register IBR and file register INR[MIR<7 :5>] . Finally the control 
 sequence goes through control step E5 for setting the status indicators. 
 
 For the store microinstruction SPM, the address of the PEM loca- 
 tion is already available on output bus TOP and the digit to be stored 
 is on bus ROP , when the control flow enters E-control. Control step 
 E7- requests access to the memory. Then the PEM responds by the logical 
 state '1' of acknowledge signal MACK , after accepting the data and 
 address from the buses TOP and ROP . Now control step E7_ withdraws 
 the request for memory access. The control sequence finally goes 
 through status setting control step E5. 
 
 For the microinstructions NR and AR, the E-control implements the 
 digit algorithms discussed earlier in Sections 3.6.4 and 3.6.5. Control 
 steps E2- and E2_ respectively achieve the radix complement and the dimin- 
 ished radix complement of the magnitude bits of the accumulator digit. 
 Control step E2_ diminishes the magnitude of the accumulator digit by 
 unity. The particular setting of control signals to states shown in con- 
 trol steps E2.. , E2_, and E2 was explained earlier in Section 4.2.2.6. 
 Control step E2, assigns the state of MIR<4>, which is the sign, S p , 
 of the whole operand to be assimilated or normalized, to the sign bit 
 output S. of digit sum encoder DSE. Control steps E_ , E4~ load the 
 result digit in buffer IBR and accumulator register INR1. Finally the 
 control step E5 sets the status indicators regarding the sign and magni- 
 tude of the accumulator digit in PE.. 
 
175 
 
 When the control sequence corresponding to E-control Is finished, 
 
 the control flow returns to the invoking point where G -control was 
 
 invoked in DM-control. This is because the control flow had branched 
 
 into E-control at the end of the G -control sequence. 
 
 ap 
 
 4.4 Logic Complexity of Processing Element 
 
 From the viewpoint of LSI implementation of a PE, two things are of 
 major importance: the number of circuit elements and the number of 
 external pins required for the chip. The total number of circuit ele- 
 ments and pins determine the silicon real estate, density of the circuit 
 elements and the heat dissipation, etc. The number of circuit elements 
 depend on the technology used for the implementation of the logic on 
 the chip. In this thesis, we shall use the number of gates as an in- 
 direct measure of logic complexity because the number of circuit elements 
 are directly related to the number of gates. Further, a multi-input 
 NAND gate is considered equivalent to a 2-input NAND gate because in 
 TTL logic, a multi-input NAND is realized by the use of a multi-emitter 
 transistor. These assumptions have been made for the sake of simplicity. 
 
 The overall gate complexity and pin complexity of a PE must take 
 into account the gates and pins required by a PE's major components: 
 DPL, PE control logic and Register File. 
 
 4.4.1 Logic complexity of DPL 
 
 4.4.1.1 Gate complexity of digit processing logic DPL - The total 
 number of gates required for the DPL is equal to the sum of the gates 
 necessary for its various components: Adder MIAD, Digit Product 
 
176 
 
 Generator DPG, Digit Sum Encoder DSE, various selector networks sADR, 
 
 sDSE, sROB, sRIB, and sTOP and the storage buffer registers in the DPL. 
 
 The gates required for MIAD, DPG, DSE and selectors sADR and sDSE are 
 
 dependent on the choice of the logic vector encoding for the redundant 
 
 binary digit. From the earlier discussion in Sections 4.2.2.2 through 
 
 4.2.2.6, it is clear that logic vector encoding LVE„ is the simplest 
 
 encoding and requires the least number of gates for the implementation 
 
 of DPG, sADR and sDSE. In the following, we shall calculate the gate 
 
 complexity of DPL, assuming only the sign-magnitude (SM, ) logic vector 
 
 encoding LVE_ . 
 
 Let 
 
 G = Total number of gates required for DPL, excluding 
 storage registers, 
 
 G = Gates required for the logic implementation of Digit 
 Product Generator, DPG, using 'local generation' of 
 
 G„, TAT> = Gates required for the radix-2 adder, MIAD, 
 MIAD 
 
 G = Gates required for Digit Sum Encoder, DSE, 
 
 and let G sDSE' G sRIB' G sROB' G sADR and G sTOP' respectively denote the 
 gates required for the selectors sDSE, sRIB, sROB, sADR and sTOP. 
 From the design details described in Section 4.2, it is clear that 
 
 2 A 
 
 G„ TAT . = 26K NANDs, assuming no encoder MATE for t. , 
 MIAD l-l 
 
 2 
 G = K ANDs + 2 Exclusive-OR gates for sign generation 
 
 DPG 
 
 2 
 = K + 8 gates 
 
 considering a AND and NAND gate equivalent, and 1 Exclusive-OR gate 
 equivalent to 4 NAND gates. From Equations (4.17) through (4.22), we have 
 
177 
 
 G DSE " 16K + C l 
 G 8ADR " 3K 2 + 3K + 1 
 
 G 8DSE " 7K 
 
 G SRIB " 4(K + » 
 
 G 8R0B " (K + " (K + 2) 
 
 G sTOP " 3(K + 2) 
 
 Therefore, the total number of gates required for the combinational 
 processing logic DPL is given by 
 
 G DPL " G MIAD + G DPG + G DSE + G sADR + G sDSE + G sRIB + G sROB 
 
 + G sTOP 
 
 - 26K 2 + (K 2 + 8) + (16K + c^ + (3K 2 + 3K + 1) + 
 7K + 4(K + 1) + (K + 1) (K + 2) + 3b. 
 Ignoring the constant c. and assuming the width b of port TOP. 
 to be equal to (K + 2) , we have 
 
 G DpL - 31 K 2 + 36K + 21 (4. 24) 
 
 In the expression above for G , the sum of the gates contributed 
 
 2 
 by the three major components DPG, MIAD and DSE alone is (27K +16K+8+C.) 
 
 and forms the bluk of the gates required for the implementation of DPL. 
 
 The other components like selector networks contribute progressively 
 
 smaller and smaller percentage of gates to the gate complexity of DPL, as 
 
 the value of K increases. Table 4.2 lists the values of the gates 
 
178 
 
 co 
 
 > 
 rJ 
 
 c 
 
 v I 
 <o 
 
 rX 1 -U 
 
 M 
 
 00 
 
 o 
 
 •iH 
 
 U-l 
 
 o 
 
 X 
 
 >% 
 
 ■H 
 
 >-c 
 
 -a 
 
 CO 
 
 cfl 
 
 C 
 
 pi 
 
 •H 
 
 
 PQ 
 
 CO 
 
 
 > 
 
 4-1 
 
 
 C 
 
 h4 
 
 « 
 
 Pm 
 
 -a 
 
 Q 
 
 c 
 
 
 3 
 
 14-I 
 
 TJ 
 
 o 
 
 a> 
 
 
 erf 
 
 ^1 
 
 
 4-1 
 
 CO 
 
 •H 
 
 
 X 
 
 m 
 
 0) 
 
 o 
 
 rH 
 
 U-J 
 
 CL 
 
 
 
 
 00 
 
 o 
 
 c 
 
 CJ 
 
 •iH 
 
 
 TJ 
 
 01 
 
 O 
 
 ■u 
 
 O 
 
 C3 
 
 c 
 
 <JJ 
 
 w 
 
 ^j 
 
 
 <t 
 
 
 CU 
 
 
 H 
 
 
 X> 
 
 
 R) 
 
 
 H 
 
 
 rJ 
 
 
 
 
 
 
 
 
 pm 
 
 r» 
 
 00 
 
 T"H 
 
 vO 
 
 CO 
 
 CN 
 
 CO 
 
 Q 
 
 rH 
 
 o 
 
 VO 
 
 r-« 
 
 m 
 
 CTi 
 
 o 
 
 e> 
 
 CN 
 
 <r 
 
 vO 
 
 <3\ 
 
 ro 
 
 rH 
 
 rH 
 
 CO 
 CN 
 
 Oh 
 
 
 
 
 
 
 
 
 o 
 
 CN 
 
 m 
 
 00 
 
 rH 
 
 <f 
 
 r^ 
 
 o 
 
 H 
 
 rH 
 
 rH 
 
 rH 
 
 CN 
 
 CN 
 
 CN 
 
 co 
 
 CO 
 
 
 
 
 
 
 
 
 O 
 
 
 
 
 
 
 
 
 PQ 
 
 
 
 
 
 
 
 
 Q 
 
 CN 
 
 O 
 
 O 
 
 CN 
 
 VO 
 
 CN 
 
 o 
 
 Prf 
 
 rH 
 
 CN 
 
 CO 
 
 «tf 
 
 m 
 
 r- 
 
 C\ 
 
 CO 
 
 
 
 
 
 
 
 
 o 
 
 
 
 
 
 
 
 
 PQ 
 
 
 
 
 
 
 
 
 H 
 
 CN 
 
 vC 
 
 o 
 
 <r 
 
 00 
 
 CN 
 
 M3 
 
 erf 
 
 rH 
 
 rH 
 
 CN 
 
 CN 
 
 CN 
 
 CO 
 
 CO 
 
 CO 
 
 
 
 
 
 
 
 
 o 
 
 
 
 
 
 
 
 
 w 
 
 
 
 
 
 
 
 
 CO 
 
 <t 
 
 rH 
 
 00 
 
 m 
 
 CN 
 
 CTi 
 
 vO 
 
 Q 
 
 rH 
 
 CN 
 
 CN 
 
 co 
 
 vl- 
 
 vt 
 
 m 
 
 CO 
 
 
 
 
 
 
 
 
 e> 
 
 
 
 
 
 
 
 
 w 
 
 CN 
 
 00 
 
 •vt 
 
 o 
 
 vO 
 
 CN 
 
 00 
 
 CO 
 
 CO 
 
 <r 
 
 vO 
 
 00 
 
 ON 
 
 rH 
 
 CN 
 
 Q 
 
 
 
 
 
 
 rH 
 
 rH 
 
 O 
 
 
 
 
 
 
 
 
 erf 
 
 
 
 
 
 
 
 
 9 
 
 CT> 
 
 t^. 
 
 rH 
 
 rH 
 
 r^ 
 
 o\ 
 
 r- 
 
 rH 
 
 CO 
 
 VO 
 
 <J\ 
 
 cm 
 
 VO 
 
 rH 
 
 CO 
 
 
 
 
 
 rH 
 
 rH 
 
 CM 
 
 o 
 
 
 
 
 
 
 
 
 3 
 
 <r 
 
 <r 
 
 vO 
 
 O 
 
 VD 
 
 <r 
 
 <f 
 
 o 
 
 CO 
 
 rH 
 
 m 
 
 CO 
 
 r^ 
 
 VO 
 
 M 
 
 rH 
 
 CN 
 
 <t 
 
 vO 
 
 ON 
 
 CM 
 
 vO 
 
 s 
 
 
 
 
 
 
 rH 
 
 rH 
 
 o 
 
 
 
 
 
 
 
 
 O 
 
 
 
 
 
 
 
 
 On 
 
 CN 
 
 r^- 
 
 <r 
 
 ro 
 
 <r 
 
 r-» 
 
 CN 
 
 O 
 
 rH 
 
 rH 
 
 CN 
 
 CO 
 
 -d- 
 
 LO 
 
 r- 
 
 o 
 
 
 
 
 
 
 
 
 M CM 
 
 
 
 
 
 
 
 
 00 
 
 
 
 
 
 
 
 
 O 
 
 CN 
 
 CO 
 
 -* 
 
 m 
 
 vO 
 
 r-- 
 
 00 
 
 rH 
 
 
 
 
 
 
 
 
 II 
 
 
 
 
 
 
 
 
 fcrf 
 
 
 
 
 
 
 
 
 M 
 
 
 
 
 
 
 
 
 II 
 
 
 
 
 
 
 
 
 X 
 
 -3" 
 
 00 
 
 vO 
 
 CM 
 
 <■ 
 
 00 
 
 vC 
 
 •H 
 
 
 
 rH 
 
 CO 
 
 vO 
 
 CN 
 
 IT» 
 
 T3 
 
 
 
 
 
 
 rH 
 
 CM 
 
 cO 
 
 
 
 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
179 
 
 required for various components of the DPL as a function of radix-r and 
 the last column in the table shows the gate complexity of the combina- 
 tional processing logic. 
 
 4.4.1.2 Pin complexity of DPL - Pin complexity is independent of 
 the logic vector encoding chosen for a redundant binary digit. The pins 
 required for digit processing logic DPL is the sum of the pins necessary 
 for input ports TIP , RIP and output ports TOP and RIP.. Pins, P p 
 and P onT} required for ports RIP . and ROP . , respectively are dependent 
 
 P P 
 
 on the method of generation of 'product transfer' t. and t. ,, respec- 
 tively. Similarly, P TT p , the pins required for port TIP. is dependent 
 on the method of encoding used for the 'Adder Transfer' t . The port TOP. 
 is shared both by the Read/Write and address lines for PEM. and the 'Adder 
 Transfer' t . , . P Tnp » tne pins required for TOP. is the larger of the 
 number of pins required for t. and PEM address lines. The total number 
 of pins, P T necessary for the logic implementation of DPL is equal to the 
 sum of the pins required for input and output ports. 
 
 p ■ p +p +p +p 
 T r TOP i TIP ± r RlP ROP ± 
 
 Let 
 
 A P 
 
 P _ ■ Pins required for generating t . using 'Adjacent 
 
 t. , Generation' method. 
 
 L P 
 
 P p ■ Pins required for generating t , using 'Local 
 
 t, , Generation' method. 
 
 R P 
 
 P p = Pins required for t. . using ROM for DPG. 
 
 *i-i 
 
180 
 
 NE A 
 
 P . = Pins required for t. , without encoder MATE. 
 A ^ i-1 
 
 i-1 
 
 E A 
 
 P A = Pins required for t , using encoder MATE. 
 
 t 1 " 1 ~ 
 
 i-1 
 
 R A 
 
 P A = Pins required for t .. , using ROM for DPG. 
 a i-l - 
 
 i-1 
 
 If the Read Only Memory (ROM) is used for the implementation of DPG, 
 and P is the total number of pins required for DPL, using ROM for DPG, 
 then 
 
 P R = Max (k + 2, P R A ) + P R + P R + P R 
 
 '±-1 C i C i Vl 
 
 = Max (k + 2, 4) + 4 + (k + 1) + (k + 1) 
 
 = (k + 2) + 4 + (k + 1) + (k + 1) 
 
 = 3k + 8 (4.25) 
 
 Another interesting configuration for the DPL implementation is 
 
 p 
 
 when DPG uses the 'Local Generation' method for t. and the encoder MATE 
 
 l 
 
 A EL 
 is used to reduce the pins required for t ._-,• If P T denotes the total 
 
 pins required for such a configuration, then 
 
 P^ L = Max (k + 2, P E A ) + P E A + P L p + P L p 
 
 t i-l l ± t i t i-l 
 
 = Max (k + 2, 2|log 2 (k + 1)1) + 2[log 2 (k+l)~|+(k+l)+(k+l) 
 
 = (k+2) + 2pLog 2 (k+l)"7 + 2(k+l) 
 
 = 3k + 4 + 2 [log 9 (k+l)"l (4.26) 
 
181 
 
 Still another implementation configuration of interest for compari- 
 son purposes is the one which uses no encoder for the 'Adder Transfer' 
 
 and the 'Adjacent Generation' method is used for collective product 
 
 P NEA 
 transfer t . . Let P be the total number of pins required for such a 
 
 configuration. Now 
 
 Pf* - Max(k + 2, P N * ) + P 1 * + P A „ + p* 
 
 = 2k + 2k + ( k ^" 1 ^ + 1) + ( k( ip^ + 1) 
 
 - 4k + k(k-l) + 2 
 
 - k 2 + 3k + 2 (4.27) 
 
 Finally, we have a configuration which uses no encoder for t . 
 and the 'Local Generation' method for t . . The total pins P_ for this 
 configuration is given by the following: 
 
 Pf L = Max(k + 2, P N = ) + P N * + p L p + p^ 
 
 t i-l t ± C i 'i-l 
 
 - 2k + 2k + (k+1) + (k+1) 
 
 - 6k + 2 (4.28) 
 
 Values of Equations (4.25) ,(4.26) \ (4.27) and (4.28) are tabulated in 
 Table 4.3 for various values of the parameter k. It shows that the con- 
 figuration using ROM for DPG requires the least number of pins. However, 
 
 2 k+1 
 the bit capacity (=k2 ) of ROM required for values of k >_ 6 becomes 
 
 p 
 too large and hence not suitable. However, 'Local Generation' of t. and 
 
182 
 
 Table 4.3 Pin Complexity of DPL Vs Radix for h < <S < 1 
 
 
 radix 
 
 t 
 
 : = 2 
 k 
 
 P R 
 T 
 
 p EL 
 T 
 
 NEA 
 T 
 
 NEL 
 T 
 
 4 
 
 2 
 
 14 
 
 14 
 
 12 
 
 14 
 
 8 
 
 3 
 
 17 
 
 17 
 
 20 
 
 20 
 
 16 
 
 4 
 
 20 
 
 22 
 
 30 
 
 26 
 
 32 
 
 5 
 
 23 
 
 25 
 
 42 
 
 32 
 
 64 
 
 6 
 
 26 
 
 28 
 
 56 
 
 38 
 
 128 
 
 7 
 
 29 
 
 31 
 
 72 
 
 44 
 
 256 
 
 8 
 
 32 
 
 36 
 
 90 
 
 50 
 
 P T = Total pins for DPL using Read-only-Memory DPG. 
 
 A 
 
 ■ Total pins for DPL using Encoder for Adder Transfer t and 
 'Local Generation' of t?. 
 
 = Total pins for DPL without Encoder for t , and 'Adjacent 
 Generation' of t^. 
 
 = Total pins for DPL without Encoder for t and 'Local Genera- 
 tion' of tP. i 
 
 T 
 
 EL 
 T 
 
 ,NEA 
 T 
 
 ,NEL 
 
183 
 
 use of the encoder MATE for 'Adder Transfer' gives reasonable pin count 
 but at the cost of introducing a new cell (full adder) for MATE in the 
 realization of MIAD. 
 
 4.4.1.3 Effect of multiplier digit's redundancy on gate and pin 
 complexity of PPL - In the discussion so far, we have assumed that both 
 the multiplier and multiplicand digits are maximally redundant, that is, 
 both can assume values equal to or less than (r-1) . However, if the 
 multiplier digit has a redundancy ratio 6 lying between 1/2 and 2/3, 
 that is, the maximum magnitude of the multiplier digit is <_ | 2/3 (r-1) | , 
 then the multiplier digit can be recoded in Non-Adjacent Format (NAF) 
 [43,49] . In a NAF recoded radix-r multiplier digit, no two adjacent 
 redundant binary digits are nonzero. That is, the recoded multiplier 
 x. is of the form 
 
 k-1 * 4 * 
 
 *4 " I x ' . 2 J x. e (1,0,1} 
 
 J-0 *j *j 
 
 such that 
 
 \ 
 
 x i 
 
 J+l 
 
 J 4 1 where |x 
 1 J 
 
 is the absolute value of 
 
 the redundant binary digit x 
 
 V 
 
 With the multiplier digit in NAF format, the number of inputs to 
 
 + 1) from (k+1), by 
 combining two adjacent redundant binary digits of a column into one 
 
 all MIRBAs of radix-2 adder can be reduced to ( 
 
 redundant binary digit. This is shown in Figure 4.29. The reduction 
 in the number of inputs to MIRBAs of the radix-2 adder causes a corre- 
 sponding decrease in the gate and pin complexity of the Digit Processing 
 Logic . 
 
>x C 
 
 184 
 
 01 
 
 
 t4 
 
 
 f-l •"* 
 
 
 Q» r» 
 
 
 •H 
 
 
 4J II 
 
 
 rH 
 
 
 3 J«2 
 
 
 X ^ 
 
 
 -v* 
 
 
 01 <N 
 
 
 T3 1 
 
 
 O X 
 
 
 U -H 
 
 
 0) T3 
 
 
 *£ 
 
 
 pM 
 
 
 < 14-4 
 
 
 z o 
 
 
 U-l < 
 
 
 °§ 
 
 
 4-1 M 
 
 
 O S 
 
 
 01 
 
 
 14-1 O 
 
 
 U-l 4-1 
 
 
 Bel 
 
 
 CO 
 
 
 01 4-1 
 
 
 X. 3 
 
 
 4-1 D. 
 
 
 C 
 
 
 U-l •-• 
 
 
 o 
 
 
 U-l 
 
 
 c o 
 
 
 o 
 
 
 •r-t =Sfc 
 
 
 4-1 
 
 
 nj C 
 
 
 i-i o 
 
 
 4-1 
 
 
 CO 4-1 
 
 U 
 
 3 >H 
 
 OJ 
 
 H Mt3 
 
 .-1 -H 
 
 -a 
 
 m a 
 
 < 
 
 
 0) 
 M 
 
 a 
 
 •H 
 
 tl4 
 
185 
 
 Gate Complexity 
 
 Assuming the sign-magnitude, logic vector encoding LVE_ for a re- 
 dundant binary digit, and 'Local Generation' of the product transfer 
 
 p 
 t., the gates G« pr for the logic of the Digit Product Generator are 
 
 given by 
 
 G^tj/-. " gates required for the magnitude bits of the 'product' 
 DPG 
 
 p 
 
 array w. and 'transfer' arrays t. + gates required for 
 
 the generation of sign bits of 'product' and 'transfer' 
 arrays + gates required to combine adjacent bits and 
 their corresponding signs (one of the bits is zero) 
 to form a single (composite) redundant binary digit 
 shown by circles in Figure 
 
 2 
 ■ k + 8k + Total # of composite redundant binary digits 
 
 X gates required to form one composite redundant binary 
 
 digit. 
 
 k 2 + 8k + k 
 
 T 
 
 2 
 
 x 4 (4.29) 
 
 The above expression shows that the gates required for Digit Product 
 Generator, DPG are increased for 1/2 <^ 6 <^2/3 compared to the maximal 
 
 redundancy case by an amount equal to 4k 
 
 + 8(k-l). 
 
 Further, the complexity of the selector network sADR is also in- 
 creased because the composite redundant binary digit has to be individ- 
 ually routed to the input of the MIRBAs through the selector sADR. 
 
 // of gates needed for the magnitude bits selection ■ 3k 
 
 [k 
 # of gates required for sign bits selection ■ (2k+l) ■=■ 
 
186 
 
 total # of gates required for sADR network = (2k+l) 
 
 + 3k 
 
 G sADR= (5M) 
 
 (A. 30) 
 
 From the above it is clear that although the number of gates 
 required for the sign bits' selection is increased compared to the case 
 when 6=1, the number of gates required for the magnitude bits' selec- 
 tion is almost halved and the overall gates G' AT -_, required for the 
 
 ° sADR n 
 
 selector network sADR is decreased compared to the maximal redundancy 
 
 case. 
 
 However, there is a drastic reduction in the number of gates 
 
 required for the adder MIAD because of the decrease in the number of 
 
 inputs to each MIRBA. The gates G/,_._ required are 
 
 MIAD 
 
 G MIAD 
 
 = 26k 
 
 (A. 31) 
 
 There is no change, due to change in redundancy, in the gates required 
 for either the Digit Sum Encoder DSE or the other remaining selector 
 networks sDSE, sRIB and sTOP. Therefore, the total number of gates 
 G' required for the Digit Processing Logic, when the multiplier digit 
 redundancy is restricted to 1/2 < 6 < 2/3 only is 
 
 G DPL " G DPG + G sADR + G MIAD + G DSE + G sDSE + G sRIB + 
 
 G sR0B + °sT0P 
 
 The gates for sROB are calculated on the assumption that we reduce 
 
 the number of registers in the register file from (k+1) to 
 
 + 2) 
 
187 
 
 = k + 8k + 4k 
 
 + (5k+l) 
 
 f 
 
 + 16k + 7k + 
 
 4(k+l) + (k+1) ( 
 
 k + (36k + 2) 
 
 + 2) + 3(k+2) 
 
 + 40K + 12 
 
 (4.32) 
 
 The values of G' and its various components are given in Table 4.4 
 
 L/JTIj 
 
 for different values of the parameter k. A comparison of Table 4.2 and 
 Table 4.4 shows that the reduction in the gates required for digit 
 processing logic, for 1/2 <_ 6 <_ 2/3, comes mainly from the drastic 
 reduction in the number of gates necessary for the adder MIAD. 
 
 Pin Complexity 
 
 Using the same notation as in the case of % < 6 < 1, we have 
 
 .NEL 
 
 Similarly 
 
 .EL 
 
 = Total number of pins required for implementation 
 
 p 
 of DPL, using 'Local Generation' method of t. and 
 
 no encoder for t for 1/2 <_ 6 <_ 2/3 
 
 - + (k+1) + (k+1) 
 
 Max(k+2, 2 
 
 Y 
 
 2 
 
 ) + 2 
 
 "k 
 
 2 
 
 
 (k+2) + 2 
 
 Y 
 2 
 
 + (2k+l) 
 
 3k + 4 + 2 
 
 Y 
 2 
 
 
 
 
 (4.33) 
 
 - Max (k+2, 2 ^ 
 
 "k 
 
 2 
 
 ) + 2 
 
 log 2 
 
 M 
 
 + (k+1) + (k+1) 
 
 t 
 
 Strictly speaking, 6 = 
 
 (r-1) 
 
 A- 
 
 1) which may be slightly 
 
 larger than 2/3 for certain values of r. In this thesis, however, we 
 shall say that 6 < 2/3. 
 
188 
 
 Table 4.4 Gate Complexity of DPL Vs Radix for 1/2 <_ 6 <_ 2/3 
 and Encoding LVE„ for a Redundant Binary Digit 
 
 
 Radix 
 
 k 
 r=2 
 
 G DPG 
 
 G MIAD 
 
 G sADR 
 
 G DSE 
 
 G sDSE 
 
 G sRIB 
 
 G sR0B 
 
 G sT0P 
 
 G DPL 
 
 r 
 
 k 
 
 4 
 
 2 
 
 28 
 
 52 
 
 11 
 
 32 
 
 14 
 
 12 
 
 9 
 
 12 
 
 170 
 
 8 
 
 3 
 
 57 
 
 156 
 
 32 
 
 48 
 
 21 
 
 16 
 
 16 
 
 15 
 
 361 
 
 16 
 
 4 
 
 80 
 
 208 
 
 42 
 
 64 
 
 28 
 
 20 
 
 20 
 
 18 
 
 480 
 
 32 
 
 5 
 
 125 
 
 390 
 
 78 
 
 80 
 
 35 
 
 24 
 
 30 
 
 21 
 
 783 
 
 64 
 
 6 
 
 156 
 
 468 
 
 93 
 
 96 
 
 42 
 
 28 
 
 35 
 
 24 
 
 942 
 
 128 
 
 7 
 
 217 
 
 728 
 
 144 
 
 112 
 
 49 
 
 32 
 
 48 
 
 27 
 
 1357 
 
 256 
 
 8 
 
 256 
 
 832 
 
 164 
 
 128 
 
 56 
 
 36 
 
 54 
 
 30 
 
 1556 
 
 Table 4.5 Pin Complexity of DPL Vs Radix for 1/2 < 6 < 2/3 
 
 
 Radix 
 r 
 
 k 
 r=2 
 
 k 
 
 .EL 
 T 
 
 .NEL 
 T 
 
 4 
 
 2 
 
 12 
 
 12 
 
 8 
 
 3 
 
 17 
 
 17 
 
 16 
 
 4 
 
 20 
 
 20 
 
 32 
 
 5 
 
 23 
 
 25 
 
 64 
 
 6 
 
 26 
 
 28 
 
 128 
 
 7 
 
 31 
 
 33 
 
 256 
 
 8 
 
 34 
 
 36 
 
189 
 
 3k + A + 2 
 
 log 2 
 
 *+■* 
 
 (4.34) 
 
 Values of both the Equations (A. 33) and (A.3A) are tabulated in Table 
 A. 5. 
 
 A comparison of Tables A. 3 and A. 5 shows that by restricting the 
 redundancy ratio to 1/2 <_ 6 <_ 2/3 for each multiplier digit, one can 
 achieve almost the same number of total pins for DPL, as are achieved 
 by using 6=1 and encoder MATE, without having to introduce the new 
 cell for MATE. The introduction of MATE destroys the uniformity of 
 the structure of MIAD. 
 
 A. A. 2 Logic complexity of PE control - The major components of PE 
 control logic are the microinstruction register, MIR, the selector net- 
 work sMIR, the Zero and Sign Detection Logic ZSD, the microinstruction 
 decoder and control and timing signal generator, TCS. Of these, the 
 gate complexity of only the selector sMIR and ZSD is dependent on the 
 bit width (=k+l) of the PE module because each is one digit wide. The 
 gate complexity of TCS is independent of bit width, if we exclude the 
 file register address decoders from consideration. However, the gate 
 complexity is dependent on the method of implementing the control sequence 
 charts described earlier. The author used the control point technique 
 used in ILLIAC III [A5] for the implementation of control sequence charts, 
 in order to calculate the gate complexity of PE control. 
 
190 
 
 4. A. 2.1 Gate complexity of PE control - Table 4.6 shows the gates 
 required for each subcontrol of TCS in terms of the number of control 
 points, gates for the control points and the gates required for the 
 conditional generation of control and timing signals. 
 
 The last column of the Table 4.6 shows the total number of gates 
 required for each subcontrol. Let G q denote the total number of gates 
 required for the Timing and Control Signal Generator, TCS. 
 
 G = 200 NAND gates 
 
 In addition, let G^-., G .,__ and G„-,~ denote the gates required for 
 DCD sMIR ZSD ° 
 
 the logic implementation of the microinstruction decoder, selector 
 sMIR<4:0> and the Zero and Sign Detector. These gates are given by 
 
 G DCD = 32 NAND gates 
 
 G sMIR = 15 NAND gateS 
 G ZSD = 6 NAND Sates 
 Therefore 
 
 G ■ Total # of gates required by PE control 
 logic excluding storage elements 
 
 = G TCS + G DCD + G sMIR + G ZSD 
 
 = 253 (4.35) 
 
 4.4.2.2 Pin complexity of PE control - The total number of pins 
 required for the logic implementation of PE local control is the sum of 
 the pins required for microinstruction ports MIP and MOP. and the pins 
 
CO 
 
 4-» a 
 c o 
 
 •H -H 
 
 O -U 
 
 (X CO 
 
 U 
 
 a 
 
 ex 
 
 a 
 
 o u 
 o 
 
 o 
 u 
 
 o 
 
 CJ 
 
 3 
 
 O 
 CD 
 3 
 
 oo 
 
 en 
 
 co 
 
 CM 
 
 
 CN 
 
 CM 
 CM 
 
 CO 
 
 
 S5 
 
 ►J 
 
 
 
 O 
 
 CM 
 
 M-l 
 
 <X 
 
 CO 
 
 O 
 
 O 
 
 
 5 »»= 
 
 II 
 
 
 iH 
 
 CO 
 
 
 cd 
 
 (U 
 
 
 4-1 
 
 4J 
 
 
 o 
 
 CD 
 
 
 H 
 
 00 
 
 CO 
 
 v£> 
 
 CM 
 
 CM 
 
 r-4 
 
 CM 
 
 CM 
 
 CM 
 
 CO 
 
 CM 
 
 o 
 
 o 
 
 O 
 
 o 
 
 o 
 
 o 
 
 o 
 
 u 
 
 M 
 
 M 
 
 u 
 
 M 
 
 u 
 
 u 
 
 •u 
 
 4-> 
 
 4J 
 
 u 
 
 4J 
 
 4-t 
 
 4J 
 
 c 
 
 c 
 
 C 
 
 c 
 
 C 
 
 c 
 
 c 
 
 o 
 
 o 
 
 o 
 
 o 
 
 o 
 
 o 
 
 o 
 
 CJ 
 
 CJ 
 
 o 
 
 CJ 
 
 a 
 
 CJ 
 
 o 
 
 Pi 
 
 H 
 
 fa 
 
 g 
 
 00 
 
 ex 
 CO 
 
 w 
 
 o 
 
 o 
 
 191 
 
 o 
 o 
 
 CN 
 
192 
 
 required for the request-response signals of TCS. Denoting by "P-pm » t ' ie 
 total pins required by PE local control, we have 
 
 P P + P + P 
 PCL = MIP. r MOP. r TCS 
 1 i 
 
 = 11+11+14 
 
 = 36 (A. 36) 
 
 If the multiplier digit has redundancy 1/2 < 6 < 2/3, then the 
 
 number of internal registers in the PE reduces to 
 
 + 1 from (k+1). 
 
 The number of address bits required to specify the internal register 
 
 correspondingly reduce to 
 
 log 2 ( 
 
 + 1) 
 
 This results in the saving 
 
 of one pin in the microinstruction ports and thus the pins required for 
 PE control logic reduce to 34. 
 
 4.4.3 Overall logic complexity of a PE - The total number of gates, 
 G , required for the implementation of a PE is the sum of the gates re- 
 quired for the combinational logic of DPL, the gates required for the PE 
 control logic and the gates required for the implementation of storage 
 
 registers in the PE. The gates required for DPL and the storage regis- 
 
 k 
 ters are a function of the parameter k (radix-2 ) which represents the 
 
 bit width of a PE. The gates required for PE control logic are virtually 
 
 independent of k and are about 250 NANDs. The storage registers in a PE 
 
 comprise the registers in the register file, buffer registers in DPL and 
 
 the register MIR in PE control logic. Considering that all the storage 
 
 registers are made of edge triggered D-type flip-flops, the gates G Q 
 
 required for the storage registers is given by 
 
193 
 
 G _ - (// of registers in the register file x (k+1) + 
 
 width of IBR + width of APR + width of GIR + width 
 of MIR) x gates required for one D-type edge trig- 
 gered flip-flop. 
 A D-type edge triggered flip-flop requires 6 NAND gates [46]. There- 
 fore, for multiplier digit redundancy ratio ^ < 6 < 1, we have 
 
 G STO = 6X 
 
 = 6 
 
 (k+1) (k+1) + (k+l)+(k+l)+ 2|log 2 (k+lT| + 11 
 k 2 + 4k + 2[Tog 2 (k+l)] + 141 
 
 (4.37) 
 
 For multiplier digit redundancy ratio 1/2 < 6 < 2/3, 
 
 STO 
 
 = 6x 
 
 = 6 
 
 |3(k+l) + Ij] (2 + k+1) + loj 
 
 hk + (k+3) fyl + 13J 
 
 + l)(k+l) + (k+l)+(k+l)+ 2 
 (2 + k+1) + 10 
 = 6 1 3k + (k+3) 
 
 HfUig 
 
 (4.38) 
 
 Now for h < <S < 1 
 
 G PE = G DPL + G PCL + G ST0 
 
 Substituting the values from Equations (4.24) , (4.35) and (4.37) 
 
 G pE = 37k 2 + 60k + 12 |7og 2 (k+l)"| + 359 
 
 (4.39) 
 
 Similarly for 1/2 < 6 < 2/3 
 
 G PE " G DPL + G PCL + G ST0 
 
 = k + k(50 + 42 
 
 ) + 20 
 
 + 351 
 
 (4.40) 
 
194 
 
 The total number of pins required for the logic implementation of 
 the PE is the sum of the pins necessary for the DPL and the pins required 
 for the PE control logic. From the earlier discussion, we find that by 
 restricting multiplier (quotient) digit redundancy to 1/2 <_ 6 <_ 2/3, a 
 reduction results both in gate complexity and the total number of pins 
 required for the PE. Moreover, the reduction in pins is achieved without 
 destroying the cellularity and structural uniformity of the multi-input 
 adder, MIAD. The total number of pins, P pTr , for the PE is given by, 
 for 1/2 < 6 < 2/3 
 
 PE 
 
 = (3k + 4 + 2 
 
 ) + 34 
 
 = (3k + 38 + 2 
 
 (4.41) 
 
195 
 
 5. INTERACTION WITH MEMORY 
 
 5.1 Introduction 
 
 The arithmetic structure consisting of Mantissa Processing Logic 
 (MPL) and Exponent Processing Logic (EPL) needs to communicate with 
 Data Main Memory (DMM) for fetching the operands and for storing the 
 results. The major considerations in the design of the interface 
 between the MPL and the DMM are: 
 
 a) a PE in the MPL should require minimum possible number of pins 
 for the addresses of the operands, and 
 
 b) the interface and the DMM should have high data bandwidth to 
 satisfy the data needs of concurrently processing PEs. 
 
 The first point suggests that the DMM address of the operand should 
 not be carried as part of the microinstruction issued by the mantissa 
 control unit, MCU. Instead we should, along with the microinstruction, 
 send in the modifier field either a pointer to an address register in 
 the interface [30] or the address of some location in a small size 
 
 buffer memory (in the interface) which is preloaded with the operands. 
 In this thesis, the latter approach is adopted. 
 
 Since the different PEs in the mantissa processing logic, MPL, are 
 concurrently active and in general operating on digits of different 
 operands, the different PEs may be accessing the interface simultane- 
 ously. This requires high bandwidth and leads to a distributed, multi- 
 port architecture for the buffer memory. In Section 5.2, the details 
 of the architecture of the mantissa buffer memory called the Local 
 Operand Mantissa Memory (LOMM) are described. 
 
196 
 
 Because the operand address in the LOMM instead of the DMM address 
 of the operand is carried with the microinstruction issued by MCU, a 
 mapping mechanism is necessary in the interface. Besides, the LOMM is 
 of finite size and its contents need to be stored away in the DMM to 
 make space for new operands. A brief description of the mapping mech- 
 anism, the loading (storing) of the operands (results) from (to) DMM into 
 (from) LOMM is given in Section 5.3. Finally, Section 5.4 describes 
 some of the factors which determine the word capacity of the buffer 
 memory. 
 
 The Local Operand Memory for floating point operands consists of 
 two parts: the Local Operand Mantissa Memory, LOMM and the Local Operand 
 Exponent Memory, LOEM. Figure 5.1 shows the block diagram of Local 
 Operand Memory. The architecture of LOMM and LOEM are independent of 
 each other and depend respectively on the organization and architecture 
 of the Mantissa Processing Logic and the Exponent Processing Logic. In 
 this thesis, we shall be concerned only with the architecture and organi- 
 zation of LOMM and refer to it as the buffer memory. Whenever we speak 
 of the buffer memory operand, the exponent part of the operand is under- 
 stood to be in the LOEM. The LOEM can be organized in many ways. Since 
 the number of bits required for the operand exponent is not too large, 
 LOEM could consist of simply one memory module of word size equal to the 
 bit width of the exponent and word capacity equal to that of a PEM 
 module. 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Q 
 
 
 
 
 
 
 
 ~ 
 
 
 
 
 _) 
 
 
 p 
 
 
 
 1 
 
 
 
 UJ 
 
 
 
 K 
 
 
 *-» 
 
 
 
 Itl 
 
 ♦ -- 
 
 — + 
 
 
 
 UJ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 I A 
 
 i 
 
 
 
 O 
 UJ 
 
 s 
 
 
 
 
 
 
 
 
 
 f * 
 
 
 
 
 
 
 C 
 
 — «. > 
 
 
 
 
 
 < 
 
 < 
 O 
 
 
 1^* 
 
 
 
 s 
 
 UJ 
 
 a. 
 
 4 
 
 c 
 UJ 
 
 
 
 
 z 
 < 
 
 2 
 2 
 
 
 
 
 
 
 i 
 
 
 
 o 
 
 o 
 
 
 
 
 • 
 
 
 
 I 
 i 
 
 
 
 
 ^v 
 
 
 
 
 
 2 
 
 • • 
 
 
 
 -1 
 >- 
 
 N 
 
 k J 
 
 
 
 • 
 
 2 
 
 * • 
 
 
 
 iiz 
 
 
 
 
 
 
 o 
 
 • » 
 
 
 
 22 
 
 
 
 
 
 • 
 
 _l 
 
 
 I 
 
 
 
 UJ (/) 
 
 
 
 
 
 
 
 
 i 
 
 
 
 (/> UJ 
 
 < > 
 
 o 
 
 — — 1 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 IO 
 
 
 
 
 
 
 <fl 
 
 
 
 
 
 
 2 
 
 
 
 UJ 
 
 
 
 
 v < 
 
 
 
 I-* 
 
 
 
 UJ 
 
 a. 
 
 < — 
 
 — -» 
 
 0. 
 
 
 
 
 m a: 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ,1 A 
 
 
 
 
 > o 
 
 
 
 
 
 
 
 
 i 
 
 
 
 UJ U. 
 l/t 
 
 
 
 
 
 
 
 
 
 H i 
 
 
 
 
 
 
 
 
 
 
 
 
 < 
 
 
 
 
 
 
 CM 
 
 2 
 
 
 
 UI 
 
 
 
 1 * 
 
 
 
 
 
 
 
 
 
 
 
 
 UJ 
 
 
 
 Q. 
 
 
 
 
 
 
 
 l<»* 
 
 
 
 a. 
 
 *--- 
 
 — * 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 t 
 
 i A 
 
 
 
 
 
 
 
 
 
 
 
 ' 
 
 , j. 
 
 
 
 
 
 
 ~ 
 
 
 
 
 
 
 
 
 
 
 z 
 
 
 
 UJ 
 
 
 
 
 
 l-J» 
 
 
 
 UJ 
 
 a. 
 
 « — 
 
 --» 
 
 a. 
 
 
 
 
 
 
 
 
 i 
 
 1 f 
 
 
 
 
 
 
 
 r 
 
 1 
 
 
 A 
 
 
 
 
 | 
 
 
 
 * 
 
 
 
 < 
 
 < 
 
 a 
 
 k 
 
 
 
 
 
 
 i 
 
 
 
 -1 
 
 Q O 
 Z * 
 
 
 
 
 _! 
 
 f * 
 
 
 
 
 
 
 
 
 « z 
 
 o 
 
 
 
 
 
 
 
 ^s 
 
 u 
 
 
 
 z> 
 
 
 
 1 * 
 
 o 
 
 5 
 
 
 
 o 
 
 2 
 
 
 
 
 - 1 5 
 < o 
 
 o 
 
 _l 
 
 
 
 
 
 
 
 O u 
 
 
 
 
 
 
 
 
 
 
 
 
 , 
 
 
 
 
 
 / 
 
 1 
 
 
 
 i\ 
 
 r 
 
 
 
 
 i -j 
 
 V T 
 
 
 
 
 ONVU3dO 
 
 3Hi 
 
 3 ! 
 
 < 
 
 ?£ 
 
 jo ssBuam 
 
 r W3d " 
 
 < z 3 ; 
 
 
 
 
 <o 
 
 S ui s 
 
 o 5 ° 
 
 
 ss: 
 
 1HQ0V 0NVW3 
 
 do waa : 
 
 ►"z 
 
 
 
 
 
 < 
 z 
 
 1 — 
 
 1 
 
 
 
 
 
 
 
 
 5 
 
 
 zi 
 
 
 
 
 > 
 
 
 UJ 
 O 
 
 • » 
 
 EXPONE 
 
 ROCESS 
 
 LOGIC 
 
 
 
 
 
 
 
 
 
 
 
 
 a. 
 
 
 
 
 
 
 4 
 
 
 u 
 
 
 
 ■ 
 
 ►" £ 
 
 
 
 
 uj 5 
 
 
 
 
 D 
 
 
 
 
 
 
 U 
 
 
 
 
 
 
 
 1 o 
 
 i a. 
 
 K 
 
 K 
 
 
 
 
 
 UJ 
 
 
 .197 
 
 i ? 
 
 *- z 
 
 — u. i« 
 
 C UJ 
 
 Z ? 
 
 J 
 
 TJ 
 
 
 G 
 
 
 tO 
 
 
 »-i 
 
 
 QJ 
 
 
 a- 
 
 
 o 
 
 
 rH 
 
 
 cfl 
 
 
 o 
 
 
 o 
 
 
 ►J 
 
 
 vw 
 
 
 o 
 
 
 B 
 
 
 CO 
 
 
 »-< 
 
 
 60 
 
 
 CO 
 
 
 •H 
 
 
 Q 
 
 • 
 
 
 >. 
 
 a 
 
 u 
 
 •H 
 
 
 
 +J 
 
 e 
 
 CO 
 
 OJ 
 
 & 
 
 2 
 
 CD 
 
 
 X 
 
 l-l 
 
 o 
 
 c 
 
 w 
 
 Uj 
 
 
 co 
 
 4«5 
 
 <D 
 
 a 
 
 CJ 
 
 o 
 
 o 
 
 .H 
 
 u 
 
 PQ 
 
 a. 
 
 aj 
 H 
 
 Ed 
 
 I X cc Z I I , 
 
198 
 
 5.2 Organization of Local Operand Mantissa Memory, LOMM 
 
 The buffer memory, LOMM (Figure 5.1) consists of as many memory 
 modules, called PEMs as there are PEs in the Mantissa Processing Logic. 
 One PEM is associated with one individual PE and has the same bit width 
 (length of an individual PEM word) as the bit width of the PE. Each PEM 
 communicates with its own individual PE and with the LOMM data register, 
 LDR which acts as an interface buffer between the data main memory, DMM 
 and the buffer memory LOMM. An operand is stored across in all the PEs 
 at the same location, each PE carrying one digit of the operand. Each 
 PEM can be considered as a set of digit wide general registers for its 
 corresponding PE. The loading of data from the DMM into PEMs and storing 
 of data from PEMs into DMM is under the control of the Local Operand 
 Memory Control, LOMCO and takes place via buffer data register LDR. 
 
 Let us define a Buffer Memory Word as a word formed by concatenating 
 the digits stored across in the same location of all the PEMs of the 
 buffer memory. If the DMM word length is different than the Buffer 
 Memory Word length, the logic for assembly and disassembly of DMM words 
 and Buffer Memory words also form part of the buffer memory logic and 
 is under the control of LOMCO. Further, the operand in interface buffer 
 register LDR is in signed-digit format (each digit carrying the same 
 sign as the sign of the operand) while the DMM word is in conventional 
 sign-magnitude format. The format conversion logic necessary for trans- 
 formation from sign-magnitude to signed-digit form and vice versa also 
 exists in the data path between DMM and the regiser LDR. Each PEM 
 module of the buffer memory has its own independent access control and 
 
199 
 
 read/write logic. Each PEM is accessed from two sources: its associ- 
 ated PE which fetches/stores data from/into PEM during the execution of 
 microinstructions LPM/SPM; and LOMCO which accesses the individual PEM 
 to read/write a new buffer memory word from/to PEM modules into/from 
 buffer memory data register, LDR. Since the various PEs are accessing 
 their own PEMs not in synchronism but rather independently, LOMCO reads/ 
 writes a buffer memory word into/from LDR by requesting access to PEMs 
 individually. Thus, the access between DMM and buffer memory is in 
 parallel whereas the individual PEs access the different digits of the 
 buffer memory word in serial mode. 
 
 The distributed multiport architecture for LOMM has the advantages 
 of modularity, easy expandability, high data bandwidth and small number 
 of pins per PE for fetching and storing of operands. Each PEM module 
 in LOMM is a random access monolithic memory. Semiconductor memory 
 chips of N bit capacity are organized as N x 1 words because it leads 
 to minimum number of leads and also this organization makes feasible 
 very effective use of error correcting codes and software diagnostic 
 routines to detect and overcome component failures. Each PEM module 
 can be assembled from these chips — as many as the bit width of the PE. 
 Thus having a separate PEM module associated with each PE provides the 
 necessary flexibility and modularity. Secondly, the small word size of 
 a PEM module permits faster access and cycle times. Multiport architec- 
 ture permits concurrent operation of the various PEM modules and thus high 
 data rate and data bandwidth. Multiport organization and the communica- 
 tion of each PE with only its own PEM module allows each PEM module 
 
200 
 
 to be of small word capacity and thus reduces the number of pins required 
 for addressing a location in PEM. Each PEM module can be logically con- 
 sidered as an extension of the regiser file of each PE. 
 
 5.3 A Description of Buffer Memory Control 
 
 The operation of the buffer memory is under the control of LOMCO. 
 The function of LOMCO is two fold: 
 
 a) When LOMCO is presented with a DMM address of an operand by the 
 GACU, LOMCO replies back with the PEM address of the operand, if the 
 operand is available in the buffer memory. However, if the operand is 
 not available in the buffer memory, LOMCO searches for an empty location 
 in the buffer memory, fetches the operand from the DMM, loads the operand 
 in the empty location of buffer memory and replies back with the address 
 of the location just loaded. 
 
 b) Another function of LOMCO is to store away the contents of 
 those locations in buffer memory which are no longer being used by any 
 of the PEs of the Mantissa Processing Logic to make space for new 
 operands that may be needed by other arithmetic machine instructions. 
 
 The LOMCO achieves the above two functions by a table look-up 
 mechanism. There are as many entries in the control table as there are 
 locations in the buffer memory — one entry for each location. Each entry 
 in the control table has five fields as shown in Figure 5.2. The fields 
 labeled DMMA and PEMA respectively denote the DMM operand address and 
 the corresponding PEM address of the operand. The field labeled VCB is 
 a one bit field and denotes the validity control bit for the data stored 
 
201 
 
 .. ■- .... ■ 
 
 PEMA 
 
 DMMA 
 
 use 
 
 VCB 
 
 STF 
 
 1 
 
 A 
 
 
 
 
 
 
 2 
 
 D 
 
 
 1 
 
 
 
 3 
 
 B 
 
 
 
 
 1 
 
 4 
 
 C 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 2 k+1 -4 
 
 M 
 
 
 1 
 
 
 
 2 k+1 -3 
 
 N 
 
 
 1 
 
 
 
 2 k+1 -2 
 
 A 
 
 
 1 
 
 
 
 2 k+1 -l 
 
 E 
 
 
 1 
 
 1 
 
 2 k+l 
 
 F 
 
 1 
 
 1 
 
 m 
 
 Figure 5.2 Structure of Control Table in LOMCO. 
 
202 
 
 in the location specified by PEMA. Whenever an operand is loaded in 
 the buffer memory from the Data Main Memory, DMM, the validity control 
 bit is set to logical ' 1' state. When a DMM operand address is pre- 
 sented to the LOMCO, an associative table look-up is performed. If there 
 is a match and the field VCB is '1', then the corresponding PEMA entry 
 (i.e., the address of the location in PEM containing the DMM variable 
 
 (operand)) is sent back as a response. If, however, there is no match 
 
 f 
 or if the validity bit is zero but match takes place, the LOMCO searches 
 
 for an empty entry in the table. An empty entry in the table indicates 
 an empty location in the buffer memory. On finding an empty location, 
 LOMCO fetches the operand specified by DMM operand address, stores it in 
 the empty buffer memory location, sets the VCB field to logical state '1' 
 and responds back with PEMA, that is, the address of the operand location 
 in PEM. Another field associated with each entry is the Usage Count or 
 USC field. This field is essential for deciding when to replace the 
 contents of the corresponding buffer memory location. Its function and 
 necessity can be explained in the following way. 
 
 In order that all the PEs may be kept usefully busy in processing 
 the microinstructions, it is important that there should be a steady 
 stream of microinstructions being issued to PEs and also that the cor- 
 responding operands be available in the buffer memory for processing by 
 the PEs. This implies the use of an instruction look-ahead unit which 
 supplies arithmetic instructions to the GACU of the arithmetic unit. 
 
 + 
 Such a case can occur when the value of the DMM variable is changed 
 
 by some other unit accessing the DMM without a corresponding update in 
 
 the buffer memory. 
 
203 
 
 The LOMCO in cooperation with GACU loads the buffer memory with operands 
 in advance of their processing by the PEs. Since the PEs in the Mantissa 
 Processing Logic operate on the individual digits of an operand in 
 sequence and different PEs may be operating on different operands or on 
 different digits of the same operand, an operand in the buffer memory 
 cannot be replaced by a new operand as long as a PE in MPL is using any 
 digit of the operand. Moreover, if in the arithmetic instruction look- 
 ahead buffer in GACU, there exists an arithmetic instruction which may 
 make use of the operand in buffer memory, the operand should not be 
 replaced by a new operand in order to avoid an unnecessary DMM access 
 by LOMCO. All these control functions are provided through the Usage 
 Count field, USC, in each entry of the table. Contents of the field USC is 
 a tally which is incremented by one every time a match is obtained with 
 the DMMA field in the table entry or an operand is fetched from DMM into 
 buffer memory. This tally in USC is decreased by unity every time a PEM 
 accessing microinstruction exits from the last PE in the MPL. The PEM 
 accessing microinstructions are the LPM and SPM microinstructions as 
 discussed in Section 2.6. When the tally in the usage count field USC 
 goes to zero, it implies that no PE is using the operand in the corre- 
 sponding buffer memory location and it can be replaced by a new operand 
 from DMM, if necessary. 
 
 The final field associated with each entry in the control table is 
 the Store Flag field, STF. It is a one bit field and is set to logical 
 state '1' every time the PEM reference microinstruction SPM exits from 
 the last PE in MPL. Whenever the tally in the USC field goes to zero and 
 
204 
 
 field STF is '1', then the LOMCO would store the contents of all the PEMs 
 at the corresponding location into DMM at a location specified by DMMA 
 field. Note that only the 'final' value of an operand (at the end of a 
 series of calculations involving that operand) is stored in DMM even 
 though there may be a set of store microinstructions, SPMs, in the stream 
 of microinstructions flowing through the PEs. This is because there is 
 no guarantee without tally in USC being zero, that no PE is modifying 
 the contents of its PEM location when a DPM microinstruction referring 
 to the same buffer memory location exits from the last PE in MPL. 
 
 5.4 Size of Buffer Memory 
 
 The number of words in the buffer memory and hence in each PEM 
 module is dependent, among others, on two main factors. 
 
 a) The maximum possible number of PEs which are concurrently 
 accessing different operands in the buffer memory at any time. 
 
 b) The ratio of the rate at which the buffer memory can be loaded 
 from DMM and the rate of processing of a microinstruction in a PE. 
 
 In order that no PE may be idle due to lack of operands in the 
 buffer memory, it is important that the word capacity of the buffer 
 memory be at least equal to the maximum number of PEs which may at any 
 time be accessing different locations of the buffer memory. The number 
 of PEs concurrently accessing different buffer memory locations in turn 
 depends on the nature of the arithmetic instruction stream, the amount 
 of arithmetic instruction look-ahead in the GACU and the number of PEs 
 in the Mantissa Processing Logic. If there are no data dependencies in 
 
205 
 
 the instruction stream and a constant stream of microinstructions can be 
 
 issued to the PEs, as many as p x n «_ n PEs may be accessing different 
 
 p+1 
 
 operands at any time where n is the number of PEs in the Mantissa 
 Processing Logic and p is the number of operands that can be summed by 
 the Multi Sum microinstruction, MS. Such a case can occur in the 
 evaluation of an arithmetic expression of the form 
 
 m 
 A = I B 
 
 where B, , B OJ ...,B are DMM operands. 
 1 2' m 
 
 However, in practical cases there are always data dependencies in 
 the instruction stream and if n is large, there would be some idle PEs 
 and the size of the buffer memory required would always be less than n. 
 From the empirical program studies made by Kuck et al. [47] , Knuth [48] 
 and Foster and Riseman [49] , we deduce that a buffer memory of word 
 capacity 16 or at the most 32 would be sufficient for all practical 
 purposes. 
 
 The word capacity of the buffer memory would also depend on the 
 number of pins available in a PE for addressing its PEM module in the 
 buffer memory. 
 
 In our design of the microinstructions LPM and SPM, we have assumed 
 
 k+1 
 that there are 2 words in the buffer memory because (k+1) bits are 
 
 available in the microinstruction word for addressing the PEM module. 
 
 In case the word capacity of PEM has to be large, e.g., when k is small, 
 
 then correspondingly more bits would have to be assigned in the 
 
206 
 
 microinstruction word for the PEM address. For k ^ 3, k+1 bits are 
 sufficient. 
 
207 
 
 6. IMPLEMENTATION OF MACHINE ARITHMETIC INSTRUCTIONS 
 
 6.1 Introduction 
 
 In this chapter, we describe how a 'machine' arithmetic instruction 
 can be implemented using the various microinstructions and the particular 
 arithmetic unit organization developed in earlier chapters. The arith- 
 metic unit is organized to process floating point operands. The frac- 
 tional parts of the operands are processed by the PEs in the Mantissa 
 Processing Logic and the exponent arithmetic is performed by the Exponent 
 Processing Logic. Since the exponent arithmetic involves simply taking 
 the sum or difference of the exponents of two operands, no details are 
 given of the method of processing the exponents. Instead, the major 
 emphasis is put on the sequence of microinstructions which are issued 
 by the Mantissa Control Unit, MCU, to process the 'machine' arithmetic 
 instruction. 
 
 6.2 Implementation of 'Machine' Arithmetic Instructions 
 
 6.2.1 Global description of the processing of a 'machine' arith- 
 metic instruction - When the instruction look-ahead unit in the machine, 
 of which this Arithmetic Unit forms a part, detects an arithmetic in- 
 struction, it sends the arithmetic instruction to the Arithmetic Unit 
 for processing. The GACU part of the local arithmetic control acts as 
 the interface between the arithmetic unit and the instruction look-ahead 
 unit. On receiving the machine instruction, the GACU calls upon the 
 buffer memory control, LOMCO to provide the GACU with the buffer memory 
 
208 
 
 address of the data operand (referred in machine arithmetic instruction). 
 (If the data operand is not present in the buffer memory, L0MC0 will 
 fetch the operand and then provide the buffer memory address as ex- 
 plained earlier in Chapter 5.) The GACU now sends the machine instruc- 
 tion, with the DMM operand address replaced by the corresponding buffer 
 memory address, to the MCU. MCU decodes the arithmetic instruction and 
 calls upon the exponent control unit, if necessary, to calculate the sum/ 
 difference of the exponents of the operands involved. The micro- 
 instructions for the processing of mantissas are then issued either 
 concurrently with exponent processing or on response from the exponent 
 control unit depending on the arithmetic instruction. (In the case of an 
 Add or Subtract instruction, the difference of the exponents must be avail- 
 able for operand alignment before mantissa processing can begin.) The 
 sequence of microinstructions issued by the MCU for the mantissa process- 
 ing depends on the OP-code (for example, Add, Subtract, Multiply, etc.) 
 of the arithmetic instruction and whether the instruction is single 
 address, double address, or the three address type. In the following 
 discussion, a single address type of 'machine' arithmetic instruction 
 is assumed. That is, each instruction is of the format OP-code , EA 
 where OP-code field carries the mnemonic and EA is the effective DMM 
 address of the operand. The other operand, if necessary, is implicitly 
 assumed to be in the Accumulator which is distributed in the PEs of the 
 Mantissa Processing Logic. The Register INR1 of the register file acts 
 as the Accumulator register in each PE. 
 
209 
 
 MCU also monitors the development of any exceptional conditions 
 like exponent overflow/underflow, etc. during arithmetic instruction 
 processing and reports such an occurrence to the GACU. GACU in turn may 
 pass this status information to other parts of the machine; e.g., in- 
 struction look-ahead and fetch unit for branching and other decisions. 
 Due to the most-significant -digit-first nature of arithmetic processing, 
 the occurrence of singular conditions can be detected by MCU before the 
 execution of arithmetic instruction is fully complete. 
 
 6.2.2 Floating point Addition - Let the Floating Point ADD instruc- 
 tion be given by FPA , EA where EA is the effective DMM address of the 
 Addend operand. Let ea be the corresponding buffer memory address of 
 the Addend. Once the buffer memory address ea_ is known, the MCU calls 
 upon the Exponent Control Unit, ECU, to calculate the difference of 
 exponents of the operands in Accumulator and ea , and issues the micro- 
 instructions for the right shift of the appropriate operand to align 
 the operands, the summation of the operands and finally checks for man- 
 tissa overflow and takes the necessary corrective action. This could be 
 followed by normalization, assimilation to conventional form and then 
 storing of the result operand. 
 
 6.2.2.1 Mantissa processing microprogram - Assuming, without loss 
 of generality, that the Addend is to be shifted for operand alignment, 
 the sequence of microinstructions issued by MCU is shown below. 
 
210 
 
 RS 
 
 2 
 
 
 
 RS 
 
 2 
 
 
 
 RS 
 
 • 
 • 
 
 2 
 
 
 
 SS 
 
 
 
 Microinstruction comments 
 
 LPM 2 ea ; INR2 * PEM[ea] 
 
 ; As many right shift micro- 
 ; instructions are issued as 
 is the difference in ex- 
 ponents of the Addend and 
 ; Augend . 
 
 ; INR1 «- INR1 + INR2 
 
 The sequence of microinstructions given above loads the addend in regis- 
 ter INR2, aligns the operands and then sums the operands. The sequence 
 of microinstructions for normalization and assimilation of operands is 
 discussed in Sections 6.2.6 and 6.2.7 respectively. 
 
 6.2.2.2 Mantissa overflow correction - In the d-vector representa- 
 tion of the sum S 
 
 12 3 n 
 it is possible that | s,J = 1. However, this is only an indication of 
 potential overflow. A mantissa overflow occurs only when s~ . s > 
 where s is the first most significant non-zero digit in the d-vector 
 representation of S. Due to the use of an RBA-2 using the sign-magnitude 
 logic vector encoding LVE_ for the redundant binary digit, in the imple- 
 mentation of microinstruction SS, bogus overflow [35] would occur quite 
 often. The bogus overflow occurs whenever s« . s. < because the sum 
 can always be recoded such that s~ = 0. For example, the sum 1.0321 
 can always be recoded into its algebraic equivalent 0.9721. 
 
 The mantissa overflow can be corrected by shifting the sum right by 
 one digital position and correspondingly adjusting the exponent of the 
 
211 
 
 sum. In the case of bogus overflow, however, this procedure would cause 
 a loss of one significant digit (k bits for radix-2 arithmetic) un- 
 necessarily. However, this can be taken care of during normalization of 
 the sum if the shifted -out digit of the sum is saved in the End Unit and 
 reintroduced during left-shifting of the operand. The left shift of the 
 sum after normalization recoding is done to eliminate the leading zeros 
 in the recoded sum. 
 
 6.2.3 Floating point Subtraction - The processing for Floating 
 Point Subraction is exactly identical to that for Floating Point Addi- 
 tion except that in the Mantissa Processing Microprogram, the micro- 
 instruction SS is preceded by the microinstruction TI 2,2 . This 
 microinstruction reverses the sign of each digit of the subtrahend. 
 The sequence of microinstructions for mantissa processing is as follows. 
 
 Microinstructions comments 
 
 LPM 2 ea 
 RS 2 
 RS 2 
 
 RS 2 
 TI 2 2 
 SS 
 
 INR2 «- PEM[ea] 
 
 operand alignment 
 
 INR2 i- (-1) . INR2 
 INR1 «- INR1 + INR2 
 
 6.2.4 Floating point Multiplication - Let the Floating Point 
 Multiply instruction be denoted as FPM , EA where _EA is the effective 
 DMM address of the Multiplier operand. The operand in the accumulator 
 is the implicitly assumed multiplicand operand. If _ea is the correspond- 
 ing buffer memory address of the multiplier operand, the processing for 
 
212 
 
 Floating Point Multiply instructions involves the following steps by 
 MCU. MCU calls upon the Exponent Control Unit to sum the exponents of 
 the operands in the accumulator and at LOEM address specified by ea ; 
 concurrently, the MCU issues the sequence of microinstructions to PEs 
 to form the double length product in the PEs and finally to check for 
 any exceptional conditions. Mantissa overflow cannot occur because the 
 mantissas of both the operands are less than unity. However, exponent 
 overflow may take place. 
 
 6.2.4.1 Microprogram for mantissa processing - The mantissa pro- 
 cessing for the multiplication instruction involves the generation of 
 partial products and the final product digits. For processing, the 
 multiplier and multiplicand operands are respectively in file registers 
 INR3 and INR2 whereas the Accumulator register INR1 is used to form and 
 accumulate the partial products. Unlike the conventional multipliers, 
 the most significant half of the final double length product is in the 
 Multiplier register INR3 and the least significant half is in the Accumu- 
 lator register INR1. 
 
 Because the partial products are formed beginning with the most 
 significant digits, the most significant digits of the product are 
 formed first. Also, to achieve maximum precision the partial product 
 is shifted left during each step instead of the multiplicand being 
 shifted right. Due to the left shift of the partial product, two prob- 
 lems immediately arise. 
 
 a) During the left shift of the Accumulator register (partial 
 product) , not one but two digits (the digits to the left and right of 
 
213 
 
 the radix point) are shifted out. These two digits need to be recoded 
 into a final product digit to be stored into the multiplier register 
 
 and a residual digit to be added to the next partial product in the next 
 
 2 
 step. Pisterzi [30] has shown that a recoder with r states is necessary 
 
 for this purpose. Such a recoder can be conceptually looked upon as an 
 
 extension of the adder network to the left. However, in our case, due 
 
 to the existence of bogus overflow in RBA-2, the basic cell of the 
 
 adder MIRBA, the recoder' s logic design would have to be different. 
 
 b) Another problem due to left shifting of the partial product is 
 that the digits of the most significant half of the final product which 
 become available one by one as the output of the recoder (connected to 
 the most significant digital position of the adder) need to be stored 
 in the multiplier register and /or the buffer memory. In order that 
 these product digits may be stored in proper order in the multiplier 
 register, the product digit (output of recoder) needs to be stored in 
 the least significant digital position of the multiplier register because 
 this position is vacant due to the left shift of the multiplier reg- 
 ister. But MCU communicates with and knows the state of only the most 
 significant PE. The solution to this problem is to send the value of 
 the digit to be placed in the least significant digital position with 
 the left shift multiplier microinstruction. As a matter of fact, the 
 particular definition of the left shift microinstruction was contrived 
 to serve specially this purpose only. 
 
 The MCU forms the partial products by issuing a sequence of a set 
 of three microinstructions as many times as the number of digits in the 
 
214 
 
 multiplier operand. The three microinstructions are 'Left Shift Multi- 
 plier' to examine the multiplier digit, 'Form Multiple and Add' to form 
 the partial product and 'Left Shift Accumulator' to shift the partial 
 product. The microprogram for the formation of the double length product 
 for six digit long multiplier and multiplicand operands is given below, 
 m and P. (i=l,2,...,6 and j=l,2, ... ,11,12) respectively denote multi- 
 plier and Product digits. 
 
 Microinstruction 
 
 LPM 
 TD 
 LDC 
 LS 
 
 FMA 
 
 LS 
 
 LS 
 
 FMA 
 
 LS 
 LS 
 
 3 
 1 
 1 
 3 
 
 m ] 
 1 
 
 3 
 
 m, 
 
 1 
 3 
 
 FMA m. 
 
 LS 
 LS 
 
 FMA m, 
 
 LS 
 LS 
 
 FMA m, 
 
 LS 
 LS 
 
 FMA m. 
 
 LS 
 LS 
 
 ea 
 2 
 
 
 
 comments 
 
 INR3 «- PEM[ea] (e Multiplier) 
 INR2 4- INR1 (e Multiplicand) 
 INR1 *■ 
 
 MCU 
 
 m, 
 
 INR3 «- r.INR3 
 
 INR1 + INR1 + INR2 
 
 MCU 
 
 MCU 
 
 INR2 
 
 i m. 
 r.INR2 
 
 m 2 ,INR3 6 +■ P 1 , INR3 
 
 r.INR3 
 
 These achieve the partial products 
 for the rest of the multiplier 
 digits nu, m~,..., m, in the same 
 
 way as the sequence of immediately 
 preceding four microinstructions 
 
 A pictorial representation of the flow and execution of the above 
 sequence in a 6 PE Mantissa Processing Logic is shown in Figure 6.1. In 
 
215 
 
 O 
 
 c 
 
 3 
 
 
 
 
 O ft. o 
 
 W9 « 
 
 ft. o a. o 
 
 • O 
 
 ft 
 
 o 
 
 a- 
 
 
 
 
 *r> >c wi 
 O ft, ft, a. 
 
 a- ft, ft. 
 
 ft. ft. 
 
 1^1 S0 
 
 
 ft. -* 
 »*0 ft. 
 
 f 
 
 
 
 O ft. ft. ft. 
 
 ft, ,»«->«• 
 
 «N ft. ft. •• ft. 
 
 ft. 
 
 t V0 
 
 1^1 
 
 a. 
 
 o 
 
 ft. 
 
 a: 
 
 
 
 o 
 
 ft. ft, #n ^ 
 
 W* «N ft. ft. 
 
 P* CM 
 
 ft. ft. ■ ft. 
 
 ft. ft. 
 
 a. 
 
 
 f 
 
 
 o 
 
 
 ft. ft. ft. ft. 
 
 ft. ft. ■ a. 
 
 *N «*» 00 
 
 ft. ft. ft. 
 
 
 
 f 
 
 o 
 
 
 
 -" rsi _i is* 
 ft. a. a. ft. 
 
 ft. ft. • a. 
 
 ft. ft. ft. 
 
 
 
 ill •■£> O —i <N . . ^ ir\ sO 
 
 '.< B ft. ft. a. ft, a. 
 
 ~ y. -,"" 1 „"° ° a. * ' ft. a. a. 
 
 S V- -im O O-.rsi n .j 
 
 ■ - v b e a ». a. a. 
 
 v. ' i. . , . 
 
 .-.;,- E 6 E a * ft. ft. ft. 
 
 r^B E E ■ " " • . ° ft. ft. 
 
 e"' fP' E^ B* ' * * B° ° ftT 
 
 
 
 
 
 
 
 <fl 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 41 
 
 o 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ft. 
 
 (N O 
 
 ft, 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 E 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 (■*» 
 
 
 
 
 
 
 
 
 
 
 
 
 ft. O 
 
 a 
 
 3 g 
 
 V. 
 
 t/1 
 
 2 3 
 
 VI 
 
 ^ 
 
 „ 
 
 
 
 < VI 
 
 
 
 
 
 
 
 -1 t- 
 
 
 J 
 
 J 
 
 n! 
 
 t 
 
 J 
 
 
 J 
 
 £ j _j 
 
 
 
 
 
 <0 
 41 
 
 o 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 N Oft. 
 
 
 
 
 
 
 
 
 si 
 
 
 
 
 
 J ' 
 
 ,^ 
 
 
 r? 
 
 
 B * 
 
 B 
 
 
 
 
 
 B • • 
 
 Li. 
 
 * 
 
 
 
 
 0- 
 
 a X 
 
 _l 
 
 g 3 
 
 J 
 
 
 i "j a 
 
 g 
 
 
 pj 
 
 
 2 
 
 2 V) VI 
 
 
 
 
 « 
 
 
 o 
 
 
 
 
 
 
 
 
 
 
 
 
 5 
 
 s 
 
 
 
 f« 
 
 ' t 
 
 
 
 e 
 
 *i 
 
 B 
 
 O ft. 
 
 
 
 o 
 
 CL. 
 
 10 
 
 oft. I j 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 p 
 
 
 
 
 a. 
 
 
 n vi 
 
 
 E 3 
 
 3 
 
 £ 
 
 J -J 
 
 a 
 
 
 1/) 
 
 LO 
 
 g 
 
 V) V) 
 
 -J -1 
 
 ;j 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 IX 
 
 
 
 
 
 O 
 
 
 
 
 
 
 
 
 
 
 
 
 
 if. 
 
 k: 
 
 
 E 
 
 n 
 m 
 
 o 
 
 CI 
 
 <A 
 
 E 
 
 C* ft. 
 Vj VI 
 
 ■ 
 g" 
 
 
 LS 1,0 
 LS 3,P 
 
 
 o 
 
 V) 
 
 ft. 
 
 V) 
 
 
 B 
 
 g 
 
 LS 1,0 
 LS 3,P 
 
 
 « 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 41 
 
 ■J 
 
 u 
 
 o 
 
 e"" 
 
 i 
 
 
 o ftT* 
 
 e 
 
 
 B 
 
 o 
 
 
 ft. 
 
 
 
 O ft. 
 
 1 1 1 1 
 
 
 
 c 
 
 -J 
 
 
 a 
 
 mi 
 
 
 VI V) 
 
 -J -J 
 
 1 
 
 
 V) VI 5 
 
 J j c 
 
 
 
 VI 
 
 £ 
 
 
 _J -J 
 
 X 
 
 41 
 
 O 
 
 o 
 
 
 
 
 Oft. -M 
 
 
 a 
 
 B . 
 
 o 
 
 
 ft. -jD 
 
 
 o 
 
 ft. 
 
 
 — 
 
 — < 
 
 
 *** 
 
 
 
 
 -.i-l 
 
 
 «-• 
 
 ^ 
 
 
 
 
 
 
 
 
 _J I— 
 
 a 
 
 3 
 
 
 
 
 cn V! ^ 
 3 J £ 
 
 
 j 
 
 -j c 
 
 2 
 
 
 ^ i 
 
 
 V) 
 
 ■J 
 
 V) 
 
 0) 
 
 
 
 o 
 
 
 
 c 
 
 
 
 <u 
 
 ki 
 
 
 3 
 
 o 
 
 
 0" 14-1 
 
 
 01 
 
 
 
 w 
 
 o 
 
 •H 
 
 
 0) 
 
 or 
 
 
 X! 
 
 o 
 
 
 4J 
 
 kJ 
 
 
 U-l 
 
 GO 
 
 
 o 
 
 c 
 
 •H 
 
 
 > 
 
 en 
 
 
 o 
 
 en 
 
 
 tH 
 
 OJ 
 
 
 ^ 
 
 a 
 o 
 
 
 0) 
 
 m 
 
 
 X. 
 
 a. 
 
 
 4-1 
 
 CtJ 
 
 
 14-1 
 
 to 
 
 
 o 
 
 CO 
 
 •H 
 
 
 c 
 
 4-1 
 
 
 o 
 
 c 
 
 
 •H 
 4-1 
 
 j) 
 
 
 rt 
 
 
 
 4J 
 
 l-l 
 
 
 C 
 
 o 
 
 
 0) 
 
 U-l 
 
 
 CO 
 
 
 
 0) 
 
 CO 
 
 
 1-1 
 
 c 
 
 
 p. 
 
 o 
 
 • 
 
 01 
 
 •H 
 
 a 
 
 Pi 
 
 4J 
 
 o 
 
 
 a 
 
 •H 
 
 iH 
 
 3 
 
 4-1 
 
 CO 
 
 l-i 
 
 cd 
 
 •H 
 
 •u 
 
 u 
 
 M 
 
 co 
 
 •H 
 
 O 
 
 a 
 
 r-\ 
 
 4J 
 
 •H 
 
 PL 
 
 CJ 
 
 O 
 
 •H 
 
 •H 
 
 )-i 
 
 4J 
 
 a- 
 
 a 
 
 H 
 
 
 •H 
 
 -1 
 
 < 
 
 S 
 
 £ 
 
 a) 
 
 a 
 
 •H 
 
216 
 
 this figure, 
 
 P., is the j th digit of the portion of the i th 
 accumulated partial product which is in the 
 Accumulator register and 
 
 P. is the j digit of the final product and is 
 
 P. = .P, 1 < j < 6 
 J J 1 - J - 
 
 Pj - 6 P J-5 '<UU 
 
 P 12 " ° 
 
 The column labeled 'Register in MCU' is a one digit wide register which 
 holds the multiplier digit when the multiplier register is shifted left 
 for examining and selecting the next multiplier digit for partial pro- 
 duct formation. This register is also used to hold the product digit, 
 from the output of the accumulator overflow recoder, for storing in the 
 least significant digital position of the multiplier register via the 
 Left Shift Accumulator (LS 3, P.) microinstruction. 
 
 If the multiplier digit m' in the microinstruction FMA,mT is to 
 have a redundancy ratio 6 <_2/3, then the two consecutive digits m and 
 m of the multiplier register have to be examined to generate one 
 modified multiplier digit mT. The algebraic design of such a redundancy 
 recoder is given in the Appendix A-l. The sequence of microinstructions 
 for mantissa processign when the multiplier digit redundancy is 5 <_ 2/3 
 would remain the same as before except that the fourth microinstruction 
 LS 3,0 will be immediately followed by another LS 3.0 microinstruction 
 in order to bring m ? to the recoder. For the rest of the modified 
 multiplier digits mT (j > 1), digit m. + , only needs to be brought into 
 
217 
 
 MCU because m. is already known from the previous step. Note that there 
 may be one more modified multiplier digit than the number of multiplier 
 digits in the original multiplier operand to maintain algebraic equiv- 
 alence in the two forms of the same multiplier operand. 
 
 6.2.5 Floating point Division - The processing for Floating Point 
 Division is almost identical to that for Multiplication except that the 
 quotient digits must be determined by examination of the partial remainder. 
 Division is performed by repetitive additions and shifts. In the Float- 
 ing Point Divide instruction FPD , EA , the effective DMM address of the 
 divisor operand is given by EA and the Dividend is implicitly assumed to 
 be the operand in the Accumulator. The processing by MCU for Floating 
 Point Divide involves calling upon the Exponent Control Unit to take the 
 difference of the exponents of the dividend (accumulator) and the divisor 
 at the buffer memory address ea, and processing of the mantissa to cal- 
 culate the quotient digits. The exceptional conditions are the possible 
 exponent underflow and a zero value of the divisor. 
 
 6.2.5.1 Microprogram for mantissa processing - The major problems 
 in the implementation of Division in the arithmetic unit under consider- 
 ation are: 
 
 a) the storage of double precision dividend, 
 
 b) the calculation of the quotient digits, 
 
 c) the placement of quotient digits in the PEs, and 
 
 d) the extension to the left of the Accumulator and the Adder 
 Network to take care of a shifted partial remainder. 
 
218 
 
 We would not discuss the above problems in detail except indicating the 
 possible solutions. For details, the reader should refer to Pisterzi 
 [30] . The double precision dividend is stored in two registers — the 
 Accumulator register INR1 and the multiplier register INR3. The 
 Accumulator register INR1 holds the most significant half of the divi- 
 dend and INR3 holds the least significant half. At the end of the 
 processing, they respectively hold the remainder and the quotient. 
 Register INR2 will hold the divisor. 
 
 Because of the redundant number representation for the quotient 
 digit, the quotient digit can be calculated by a 'model division' [50] 
 which uses only truncated version of the divisor and shifted partial 
 remainders. It is shown in the Appendix A-2 that for radix 2 (k >_ 3) , 
 3 digits of the divisor and 2 digits of the fractional part in addition 
 to the integer part of the shifted partial remainder are sufficient for 
 the calculation of the quotient digit with redundancy ratio of 2/3 or 1. 
 However, for radix-4, one more digit each of the divisor and partial 
 remainder are necessary if the quotient digit has redundancy ratio 
 of 2/3. But for maximally redundant quotient digits, we use the same 
 number of digits as for k >_ 3. The examination of the operand digits 
 for quotient calculation in the MCU is done by shifting the operands 
 left as many times as the number of digits necessary. Examination of 
 the divisor digits needs to be done only once at the beginning since 
 the same digits take part in the calculation at every step of a quotient 
 digit. However, since the partial remainder changes, it has to be 
 shifted every time. But since the unshifted divisor and the radix-r 
 
219 
 
 shifted partial remainder are necessary in the PEs for calculation of 
 a new partial remainder, the examination of the operand digits is done 
 by shifting another register which contains a copy of the operand whose 
 digits are to be examined. File register INR4 can be used for that 
 purpose. 
 
 The quotient digit is stored in the vacant least significant 
 digital position of the register INR3 by using the Left Shift microin- 
 struction just as in the case of Multiplication. The quotient digit is 
 sent in the modifier field of the Left Shift microinstruction issued by 
 the MCU. 
 
 Because of the characteristics of the Division process, the shifted 
 partial remainder in the accumulator would always be less than r in 
 absolute magnitude. Thus the overflow recoder that was used in the 
 Multiplication process can be used to store the integer part of the 
 shifted partial remainder. 
 
 Note that the technique used for the 'Model' Division is completely 
 independent of the architecture of our Arithmetic Unit. It can be done 
 by Table look-up or by any other method depending on the time and cost 
 considerations. Pure table look-up is too expensive for any reasonable 
 radix greater than A. We propose that the quotient digit be calculated 
 in MCU serially one bit at a time for radix >_ 8 and then assembled into 
 a radix-r digit before calculating the next partial remainder. 
 
 The sequence of microinstructions for Floating Point Division is 
 very similar to that for Multiplication and hence would not be given 
 here. 
 
220 
 
 6.2.6 Normalization of operands - An operand is considered normal- 
 ized if it satisfies the definition 3, given in Section 3.5.1, of a 
 normalized number. The major steps in the normalization process are 
 left shifting of the signed-digit operand till there are no leading 
 zeros, recoding the shifted operand by the 'Normalize Recode' micro- 
 instruction, NR and finally left shifting the recoded operand to remove 
 the leading zeros, if any were created by the microinstruction NR. 
 
 Because of the interface control signal Z- between PE and the MCU, 
 there is no need to launch a Left Shift microinstruction to examine the 
 leading digit for zero magnitude. Simply monitoring of Z 1 is sufficient 
 and this has the advantage that no overshift of the operand would take 
 place during normalization process. 
 
 Note that since the microinstruction NR operates only on the operand 
 in the Accumulator register INR1, the operand should be placed in INR1 
 for the normalization process. 
 
 6.2.7 Assimilation of signed-digit operand - The process of Assim- 
 ilation converts the signed-digit operand whose different digits carry, 
 in general, different signs to a form in which each digit has the same 
 sign. This sign is the sign of the operand. The procedure and the se- 
 quence of microinstructions necessary for Assimilation is identical to 
 that for Normalization except that the microinstruction AR instead of 
 
 NR is launched for recoding the operand. 
 
221 
 
 7. SUMMARY AND CONCLUSIONS 
 
 7 .1 Summary and Discussion of Results 
 
 Chapter 1 described the characteristics and the constraints of the 
 newly emerging technology of Large Scale Integration (LSI) and its impli- 
 cations for the design of a digital system. Based on this discussion, a 
 set of desirable characteristics for an Arithmetic Unit were formulated 
 and a limited interconnection arithmetic unit as proposed by Pisterzi 
 [30] was chosen as a vehicle to study the arithmetic and logic design 
 aspects of the basic module of such an arithmetic unit. 
 
 Chatper 2 described briefly the logical organization and mode of 
 operation of the arithmetic unit — especially of the Mantissa Processing 
 Logic (MPL) . The MPL is composed of a linear cascade of identical logic 
 modules called Processing Elements (PEs) which execute a sequence of 
 microinstructions issued to MPL by the Mantissa Control Unit (MCU) . 
 The MCU is an interpreter for the 'machine' arithmetic instructions like 
 'Multiply', 'Add', etc., and issues a sequence of microinstructions for 
 processing. The salient feature of the processing in MPL is that a 
 microinstruction issued by the MCU is not broadcasted to all the PEs in 
 the MPL. Instead, the microinstruction is executed by the PEs in se- 
 quence starting with the most significant PE. (The 'significance' of 
 a PE is the same as the arithmetic significance of the operand digit 
 contained in that PE.) The method of processing in the MPL was illus- 
 trated by an example which showed how the various microinstructions 
 in the microinstruction stream could be pipelined. This pipelining 
 
222 
 
 feature allows the meshing in of machine arithmetic instructions even 
 before all the result digits of a previous machine instruction have been 
 calculated. The discussion in this chapter forms the framework for the 
 material in the subsequent chapters. 
 
 Chapter 3 is concerned with the arithmetic design of the Processing 
 Element. Due to the digit serial nature of the arithmetic processing 
 and the desirability of limited intercommunication between PEs, redundant 
 number system is a necessity. The number system was chosen to be Signed 
 Digit and maximally redundant firstly because the conversion from the 
 conventional number representation of sign and magnitude to the maximally 
 redundant and vice versa is very simple, and secondly, the radix-2 
 arithmetic can be realized in terms of identical stages of redundant 
 binary {1,0,1} arithmetic structures. This gives the required repetitive 
 and uniform logical structure to the internal logic of the PE. Then a set 
 of simple arithmetic microinstructions sufficient for the implementation 
 of four basic arithmetic operations are defined and their digit algorithms 
 are described by their arithmetic transfer functions and their algebraic 
 implementation. The particular algebraic implementation of the digit 
 algorithm is influenced by LSI technology constraints of regularity of 
 logic structure, simplicity of the basic cell of the logic structure 
 and the least number of pins for the module. The regularity of logical 
 structure is obtained by implementing the radix-2 multi-input adder as 
 a linear cascade of k multi-input redundant binary adder. Each multi-input 
 redundant binary adder in turn is implemented as a tree structure of still 
 simpler 2 inputs or 3 inputs redundant binary adders. A definition 
 
223 
 
 of normalized operands was developed and its influence on the arithmetic 
 properties of overall processing and the complexity of quotient digit 
 calculation was also discussed. The definition for normalized operands 
 chosen was such that processes of 'normalization' and 'assimilation' 
 could share the same logic. 
 
 Chapter 4, which is the major contribution of this thesis, treats 
 the logic design of the Processing Element. In this chapter, the gate 
 complexity and pin complexity of the Processing Element are shown to be 
 related to the bit width of the Processing Element (radix of arithmetic 
 processing in MPL) and the redundancy in the multiplier/quotient digit 
 used to form the partial products in the process of multiplication. 
 The major components of the Processing Element are the Register File for 
 the storage of active operands, the Digit Processing Logic which is es- 
 sentially a combinational logic network for the data transformation, and 
 the Processing Element Control which receives and decodes the microin- 
 struction and generates the necessary sequence of control signals to 
 condition the combinational network DPL. The number of gates and pins 
 required for the DPL are very strongly dependent on the bit width of the 
 Processing Element whereas the number of gates and pins required for the 
 PE control is almost independent of the bit width of the module. From 
 
 the inspection of Tables A. 3 and 4.5, it is clear that 'local generation' 
 
 p 
 of collective Product Transfer t. should be used to keep down the number 
 
 of pins necessary on the PE module. 
 
 An examination of Tables 4.2 and 4.4, which give respectively the 
 
 number of gates required in the implementation of DPL for multiplier 
 
224 
 
 digit's redundancy ratio of 1 and 2/3, leads to the conclusion that 
 redundancy ratio of 2/3 should be employed for the multiplier and quotient 
 digit. This would require the existence of a multiplier digit recoder in 
 MCU because the digits of the multiplier operand in the MPL have redund- 
 ancy ratio of unity. But the multiplier digit recoder is very simple. 
 A still further advantage of restricting the redundancy ratio of the multi- 
 plier/quotient digit to < 2/3 is that a_,. — the number of PEs which must 
 
 — FMA 
 
 cooperate with a given PE in the execution of microinstruction FMA — would 
 always remain 2 irrespective of the radix of the multiplier digit, when 
 the MIRBAs in MIAD are implemented as a log-sum tree of RBA-2s only. 
 This can be seen from the Table 7.1. 
 
 Table 7.1 Values of a and a. when the multiplier/quotient 
 digit redundancy ratio is 1/2 < 6 < 2/3 
 
 
 Radix r 
 
 = 2 k 
 
 # of inputs 
 to a MIRBA 
 
 i - ri ♦ i 
 
 a b = 2[Tog 2 H 
 
 
 + 1 
 
 r 
 
 k 
 
 a . = 
 J 
 
 - b -r 
 
 k 
 
 4 
 
 2 
 
 2 
 
 2 
 
 2 
 
 
 8 
 
 3 
 
 3 
 
 4 
 
 2 
 
 
 16 
 
 4 
 
 3 
 
 4 
 
 2 
 
 
 32 
 
 5 
 
 4 
 
 4 
 
 2 
 
 
 64 
 
 6 
 
 4 
 
 4 
 
 2 
 
 
 128 
 
 7 
 
 5 
 
 6 
 
 2 
 
 
 256 
 
 8 
 
 5 
 
 6 
 
 2 
 
 
225 
 
 Finally, inspection of Table 4.5 shows that the DPL requires only 
 36 pins for radix-256, that is, for a 8 bit width of the PE module. 
 Since the PE control requires 36 pins also, an eight bit wide PE module 
 should be employed in the Mantissa Processing Logic in order to balance 
 the arithmetic processing cost in DPL and PE control cost. This requires 
 a total of 72 pins on the PE module package and which is by no means un- 
 reasonable by the standards of today's technology. 
 
 A negative aspect, from the LSI viewpoint, of the structure of 
 Mantissa Processing Logic should be noted here. Since the microinstruc- 
 tion flow from one PE to the other instead of being broadcast from the 
 MCU, the number of pins required for the PE control are doubled in the 
 present structure. Moreover, the request-response strategy of PE co- 
 ordination control also doubles the number of pins required compared to 
 a synchronous control synchronized to a central clock. However, the 
 asynchronous control has the advantage that any number of PEs can be con- 
 catenated together more easily to achieve any desired precision without 
 worrying about the clock skew problems. It should be noted, however, 
 that the arithmetic and logic design of the DPL as described in this 
 thesis is independent of the nature of PE control and the same DPL 
 design can be used to design a PE module for a bus-structured and 
 synchronous Mantissa Processing Logic. 
 
 In Chapter 5, a brief description was given of the logic organiza- 
 tion and structure of a buffer memory which acts as an interface between 
 the arithmetic unit and the Data Main Memory. The major characteristic 
 of the buffer memory is that communication between the buffer memory and 
 
226 
 
 Data Main Memory is on word level whereas the communication between the 
 buffer memory and Mantissa Processing Logic is on a digit serial basis. 
 It was further argued that the size of the buffer memory in words is 
 fairly small — of the order of 16 to 32 words. 
 
 Chapter 6 showed how various machine arithmetic instructions could 
 be implemented using the microinstructions. 
 
 7 .2 Suggestions for Further Work 
 
 Reliability and availability considerations were not addressed in this 
 thesis. Since microinstructions flow from any PE to its adjacent PE, it 
 is important that all the consecutive PEs operate properly in order for 
 the Arithmetic Unit to operate properly. Determining organizational 
 modifications in the interconnection structure of the PEs which would 
 facilitate the automatic reconfiguration of properly operating PEs to 
 yield a working Arithmetic Unit with degraded performance, in the 
 presence of faulty PEs, is a very important area of further investigation. 
 
 Because the processing in the Arithmetic Unit takes place on a 
 digit-by-digit basis starting with the most significant digit, this 
 Arithmetic Unit structure has a potential for implementing a dynamically 
 varying precision arithmetic. But due to the possibility of different 
 PEs working concurrently on digits of different operands, certain struc- 
 tural modifications would be necessary. Investigation of such modifica- 
 tions is another interesting area of investigation. One possible solution 
 may be the use of some kind of 'end-of-the-word' marker as the delimiter 
 
227 
 
 for the precision of the operands and the use of a bus-structure to in- 
 form the MCU when the last digits of the operands have been operated on, 
 
 A simulation of the Arithmetic Unit using data from real programs 
 would be interesting and useful to determine the useful word capacity 
 of a PEM module. 
 
 Finally, the logic design of the MCU and the GACU should be per- 
 formed to determine the actual gate complexity of this module. 
 
228 
 
 LIST OF REFERENCES 
 
 [1] Berg, R. 0. and Jack, L. A., "System and Logic Design for the 
 Effective Use of LSI," Proceedings of the National Tele- 
 communications Conference , Atlanta, Ga., Nov. 1973, pp. 
 
 [2] Smith, M. G., "LSI and Systems Architecture in the 1970's," First 
 US A- JAP AN Computer Conference , 1972, pp. 182-192. 
 
 [3] Conway, M. E. and Spandorfer, L. M. , "A Computer System Designer's 
 View of Large Scale Integration," 1968 Fall Joint Computer 
 Conference, AFIPS Proc , Washington, D.C.: Spartan 1968, 
 pp. 835-845. 
 
 [4] Jennings, R. C, "Design and Fabrication of a General Purpose 
 
 Airborne Computer Using LSI Arrays," Digest of IEEE Computer 
 Group Conference , June 1968, pp. 50-54. 
 
 [5] Beuscher, H. J. and Toy, W. N. , "Check Schemes for Integrated 
 
 Microprogrammed Control and Data Transfer Circuitry," IEEE 
 Trans. EC , Vol. C-19, No. 12, Dec. 1970, pp. 1153-1159. 
 
 [6] Beelitz, H. R., Levy, S. Y., Linhardt, R. J., and Miller, H. S., 
 "System Architecture for Large-Scale Integration," 1967 Fall 
 Joint Computer Conference, AFIPS Proc , Washington, D.C.: 
 Spartan 1967, pp. 185-200. 
 
 [7] Clark, W. A., "Macromodular Computer Systems," 1967 Spring Joint 
 Computer Conference, AFIPS Proc , Washington, D.C.: Spartan 
 1967, pp. 337-401. 
 
 [8] Podraza, G. V., Gregg, R. S., Jr., and Slager, J. R., "Efficient 
 MSI Partitioning for a Digital Computer," IEEE Trans. EC , 
 Vol. C-19, No. 11, Nov. 1970, pp. 1020-1028. 
 
 [9] Cserhalmi, N. , Lowenschuss, 0., and Scheff, B., "Efficient Parti- 
 tioning for the Batch-fabricated Fourth Generation Computer," 
 1968 Fall Joint Computer Conference, AFIPS Proc . , Washington, 
 D.C.: Spartan 1968, pp. 857-865. 
 
 [10] Chen, T. C, "Overlap and Pipeline Processing" in Introduction to 
 Computer Architecture , Edited by H. S. Stone, Chicago, 
 Science Research Associates, Inc., 1975, pp. 375-431. 
 
229 
 
 List of References (continued) 
 
 [11] Ramamoorthy, C. V. and Economides, S. C, "Fast Multiplication 
 Cellular Arrays for LSI Implementation," 1969 Fall Joint 
 Computer Conference, AFIPS Proc , Washington, D.C.: Spartan 
 1969, pp. 89-98. 
 
 [12] Gex, A., "Multiplier-Divider Cellular Array," Electronics Letters , 
 29th of July 1971, Vol. 7, No. 15, pp. 442-444. 
 
 [13] Kingbury, N. G. , "High Speed Binary Multiplier," Electronics 
 Letters , 20th of May 1971, Vol. 7, No. 10, pp. 277-278. 
 
 [14] Wallace, C. S., "A Suggestion for a Fast Multiplier," IEEE Trans . 
 EC, Feb. 1964, pp. 14-17. 
 
 [15] Majithia, J. C. and Kitai, R. , "An Iterative Array for Multiplica- 
 tion of Signed Binary Numbers," IEEE Trans. EC . , Vol. C-20, 
 No. 2, Feb. 1971, pp. 214-216. 
 
 [16] Baugh, C. R. and Wooley, B. A., "A Two's Complement Parallel Array 
 Multiplication Algorithm," IEEE Transactions on Computers , Vol 
 C-22, No. 12, pp. 1045-1047. 
 
 [17] Majithia, J. C, "Non-Restoring Binary Division Using a Cellular 
 Array," Electronics Letters , June 1970, pp. 303-304. 
 
 [18] Majithia, J. C. and Kitai, R. , "Fast Multiplier /Divider Array 
 
 Using a Controlled Iterative Array," private communication. 
 
 [19] Bjorner, Dines., "A Flow-Mode, Self -Steering, Cellular Multiplier- 
 Summation Processor," BIT 10, 1970, pp. 125-144. 
 
 [20] Thompson, P. M. , "Digital Arithmetic Units for a High Data Rate," 
 The Radio and Electronic Engineer , Vol. 45, No. 3, 1975. 
 
 [21] Gardner, P. L., "Functional Memory and Its Microprogrammed Impli- 
 cations," IEEE Trans. Comput ., Vol. C-20, No. 7, July 1971, 
 pp. 764-775. 
 
 [22] Lee, C. Y. and Paull, M. C, "A Content Addressable Distributed 
 Logic Memory with Application to Information Retrieval," 
 Proc. IEEE , Vol. 51, June 1963, pp. 924-932. 
 
 [23] Crane, B. A. and Githens, J. A., "Bulk Processing in Distributed 
 Logic Memory," IEEE Trans. EC , April 1966, pp. 186-196. 
 
 [24] Batcher, K. E., "STARAN Parallel Processor System Hardware," 
 
 National Computer Conference, AFIPS Proc , 1974, pp. 405-410. 
 
230 
 
 List of References (continued) 
 
 [25] deRegt, M. P., "Introduction to Negative Radix Number Systems," 
 Part I, Computer Design , May 1967, pp. 53-63. 
 
 [26] Avizienis, A., "Signed-Digit Number Representation for Fast Parallel 
 Arithmetic," IRE Trans. EC , Vol. EC-10, Sept. 1961, pp. 389-400, 
 
 [27] Shaipov, N. Yu., "Methods of Realizing Arithmetic Operations in the 
 Minus-Two Number System," Automation and Remote Control , 1970, 
 pp. 835-841. 
 
 [28] Avizienis, A. and Tung, C, "A Universal Arithmetic Building Ele- 
 ment (ABE) and Design Methods for Arithmetic Processors," 
 IEEE Trans. Comput ., Vol. C-19, No. 8, Aug. 1970, pp. 733-745. 
 
 [29] Avizienis, A. and Tung, C, "Design of Combinational Arithmetic 
 Nets," Digest 1st Annual IEEE Computer Conference (Chicago, 
 Illinois), Sept. 6-8, 1967, pp. 25-28. 
 
 [30] Pisterzi, M. J., "A Limited Connection Arithmetic Unit," Ph.D. 
 
 dissertation, Department of Electrical Engineering, University 
 of Illinois, Urbana, Illinois; also, DCS Report No. 398, June 
 
 wnr. 
 
 [31] deRegt, M. P., "Negative Radix Arithmetic, Part 4, Multiplication 
 and Division," Computer Design , Aug. 1967, pp. 36-44. 
 
 [32] , "Negative Radix Arithmetic, Part 5, Division: Testing 
 
 the Remainder," Computer Design , Sept. 1967, pp. 44-50. 
 
 [33] , "Negative Radix Arithmetic, Part 6, Manual Division: 
 
 the Magnitude Test," Computer Design , Oct. 1967, pp. 68-77. 
 
 [34] Szabo, N. S. and Tanaka, R. I. , Residue Arithmetic and Its Applica- 
 tions to Computer Technology , New York, McGraw-Hill, 1967. 
 
 [35] Atkins, D. E., "Design of Arithmetic Units of ILLIAC III: Use of 
 Redundancy and Higher Radix Methods," IEEE Trans. Comput. , 
 Vol. c-19, No. 8, Aug. 1970, pp. 720-733. 
 
 [36] Cristelly, R. de ORY, "Design of a Dynamically checked, Signed- 
 Digit Arithmetic Unit," Computer Science Department, Univer- 
 sity of California, Los Angeles, California , Report No. 
 UCLA-ENG-7366, November 1973. 
 
 [37] Sweeney, T., "An Analysis of Floating-Point Addition," IBM Systems 
 Journal, Vol. 4, No. 1, pp. 31-42, 1965. 
 
231 
 
 List of References (continued) 
 
 [38] Borovec, R. T., "The Logical Design of a Class of Limited Carry- 
 Borrow Propagation Adders," M.S. Thesis, Department of Elec - 
 trical Engineering, University of Illinois, Urbana, Illinois , 
 August, 1968. Also, Report No. 275, Department of Computer 
 Science, University of Illinois , Urbana, Illinois. 
 
 [39] Rohatsch, F. A., "A Study of Transformations Applicable to the 
 
 Development of Limited Carry-Borrow Propagation Adders," Ph.D. 
 Thesis, Department of Electrical Engineering, University of 
 Illinois , Urbana, Illinois, June, 1967. Also, DCS Report No. 
 226, Department of Computer Science, University of Illinois , 
 Urbana, Illinois. 
 
 [40] Robertson, J. E., "A Deterministic Procedure for the Design of 
 
 Carry-save Adders and Borrow-save Subtracters," Department of 
 Computer Science, University of Illinois , Urbana, Illinois, 
 Report No. 235, July 5, 1967. 
 
 [41] Foster, C. C. and Stockton, F. D. , "Counting Responders in an 
 
 Associative Memory," IEEE Trans. Comput ., Vol. C-20, pp. 1580- 
 1583, December 1971. 
 
 [42] Bell, C. G. and Newell, Allen, Computer Structures: Readings and 
 Examples , New York, McGraw-Hill Inc., 1971, pp. 628-637. 
 
 [43] Preparata, Franco P., "On the Representation of Integers in Non- 
 adjacent Form," SIAM J. Appl. Math , Vol. 21, No. 4, December 
 1971. 
 
 [44] Avizienis, A., "Arithmetic Microsystems for the Synthesis of Function 
 Generators," Proc. IEEE , Vol. 54, No. 12, December 1966. 
 
 [45] Goyal, L. N., "ILLIAC III Computer System Manual: Arithmetic Units- 
 Vol. 2," Department of Computer Science, University of Illinois , 
 Urbana, Illinois, Report No. UIUCDCS-R-73-551, January 1973. 
 
 [46] Texas Instruments Incorporated, TTL Integrated Circuits Catalog , 
 Dallas, Texas Instruments Catalog CC201, August 1969. 
 
 [47] Kuck, D., Budnick, P., Chen, S. C, Davis, E., Han, J., Kraska, P., 
 Lawrie, D., Muraoka, Y. , Strebendt, R. and Towle, R. , 
 "Measurement of Parallelism in Ordinary FORTRAN Programs," 
 IEEE Computer , January 1974, pp. 37-46. 
 
 [48] Knuth, D. E. , "An Empirical Study of FORTRAN Programs," Software 
 
 Practice and Experience , Vol. 1, pp. 105-133, April-June 1971. 
 
232 
 
 List of References (continued) 
 
 [49] Foster, C. C. and Riseman, E. M. , "Percolation of Code to Enhance 
 
 Parallel Dispatching and Execution," IEEE Trans . Comput . , Vol. 
 C-21, No. 12, pp. 1411-1415, December 1972. 
 
 [50] Atkins, D. E., "A Study of Methods for Selection of Quotient Digits 
 During Digital Division," Ph.D. Thesis, Department of Computer 
 Science, University of Illinois , Urbana, Illinois, June 1970. 
 Also, DCS Report No. 397, Department of Computer Science , 
 University of Illinois , Urbana, Illinois, June 1970. 
 
233 
 
 APPENDIX A-l 
 
 ALGEBRAIC DESIGN OF A RIGHT-DIRECTED RECODER TO CHANGE 
 MULTIPLIER DIGIT'S REDUNDANCY FROM 6 - 1 to 6 < 2/3+ 
 
 This recoder changes the multiplier operand 
 
 M = . m, m~ nu ... m, , m. m, , ... m 
 12 3 j-1 j j+1 n 
 
 where 
 
 |m | < (r-1) Y^ = 0,l,...n. 
 
 to an algebraically equivalent operand 
 
 M'= ml . m' m' m' ... m' , m' m' , ... m' 
 12 3 J-1 j j+1 n 
 
 such that 
 
 I m n | <^ 1 and 
 
 I- II 
 
 f (r_1) 
 
 jyj = 0,1, ...n, 
 
 In order to do the above recoding serially on a digit-by-digit 
 basis, starting from the most significant digit, one needs to know only 
 the digit to the immediate left of the digit to be recoded in addition 
 to the digit itself. For example, if m is the digit to be recoded, 
 then the algebraic design of the recoder is given by the following 
 
 Strictly speaking, 6 = 
 larger than 2/3. 
 
 f (r-1) 
 
 ^tr-1) which may be slightly 
 
m, 
 
 234 
 
 M 
 
 m 
 
 J-l 
 
 w 
 
 j-i 
 
 Vi 
 
 m. 
 J 
 
 w, 
 
 Each digit m and m is first recoded into a pair of digits {t , , w 
 J 3 . j -1 j 
 
 and {t._ , w._.} so that 
 
 Vl = r C j-2 + "j-l 
 
 ■ r > 4 
 
 m, 
 
 = r t . . + w. 
 
 J-l J 
 
 and 
 
 m. >_ | (r-1) 
 
 t . = (0 otherwise 
 
 1 if m.< - 
 l— 
 
 f (r-1) 
 
 » i - j. j-l 
 
 f (r_1) 
 
 -1 
 
 < w. < 
 
 — l — 
 
 j (r-1) 
 
 -1 
 
 The recoded digit m". is given by 
 
 nK = w. , + t , , 
 
 J J-l j-l 
 
235 
 
 The above recoder is applicable for all values of index j . Note that 
 m' cannot have a magnitude greater than 1 and the recoded multiplier 
 operand may have one digit extra compared to the original operand. 
 
236 
 
 APPENDIX A-2 
 PRECISION REQUIREMENTS FOR QUOTIENT DIGIT CALCULATION 
 
 According to the analysis by Atkins [50] based on P-D plot con- 
 siderations, the worst case precision of the operands required for 
 quotient digit calculation is given by the relation 
 
 AP| < Df n (6 - Jj) - iM (n _i + 6 ) ( A2 .l) 
 
 where 
 
 AP = truncation error in the left shifted (by one digital posi- 
 tion) partial remainder 
 
 Ad = truncation error in the divisor 
 
 n = maximum allowed value of the quotient digit 
 
 6 = redundancy ratio of the quotient digit 
 
 n 
 
 r-1 
 
 and 
 
 D = minimum value of the truncated divisor 
 
 Let I Ad I = r n 
 
 and I API -r-^ 
 
 where ft = number of digits in the truncated divisor 
 
 Case 1: 6=1 
 
 For a maximally redundant quotient digit, 
 6 = 1 
 n = (r-1) 
 
237 
 
 and for a maximally redundant divisor normalized according to Definition 3 
 given in Section 3.5, 
 
 'min 
 
 III ± 
 
 2 ft 
 r r 
 
 Substituting the above values in Equation (A2.1), we have 
 
 r r 
 
 r r 
 
 -ft 
 
 -ft 
 V (r-1) 
 
 -1 + 1) 
 
 which simplifies to 
 
 ^ < (r-D 
 
 r 2 (3r-2) 
 
 (A2.2) 
 
 Values of ft which satisfies the relation (A2.2) for different values of 
 r are tabulated in Table A.l. 
 
 Table A.l 
 Values of ft Vs Radix and Redundancy Ratio of a Quotient Digit 
 
 
 RaH-fv 
 
 Redundancy ratio, i 
 
 », of a quotient digit 
 
 r 
 
 6 = 1 
 
 6 <_ 2/3 
 
 2 
 
 4 
 
 - 
 
 4 
 
 3 
 
 4 
 
 8 
 
 3 
 
 3 
 
 16 
 
 3 
 
 3 
 
 32 
 
 3 
 
 3 
 
 64 
 
 3 
 
 3 
 
238 
 
 Case 2: 5=2/3 
 
 In this case, 
 
 f (r-1) 
 
 ~min _ r-1 1_ 
 
 t 2 0. 
 
 r r 
 
 -(ft-1) _ ,r-l 1 w n , . r , . , n N 
 1 (— + - )(^i " h) - — (n-l + ^) 
 
 which simplifies to 
 
 jfl <_ 2 * - Cr-1) . Eli CA 2.3) 
 
 r 2r(r-l) + n(r-2) 
 
 Values of ft which satisfies the relation (A2.3) for different values of 
 r are given in Table A.l. 
 
 Table A.l clearly shows that 3 digits of the divisor and 2 most 
 significant digits of the fractional part of the shifted partial remainder 
 in addition to its (shifted partial remainder's) integer part are suffi- 
 cient to calculate the quotient digit. It can be further shown that all 
 the bits of the last digits of the truncated divisor and partial remainder 
 are not necessary for the quotient digit calculation. 
 
239 
 
 VITA 
 
 Lakshmi N. Goyal was born in Rohtak, India on October 19, 1941. He 
 received the B.Tel.E degree in Electronics and Telecommunication Engineer- 
 ing in 1964 and M.Tel.E degree in 1965, both from Jadavpur University, 
 Calcutta, India, and Ph.D. degree in Electrical Engineering from the 
 University of Illinois, Urbana in 1976. 
 
 From 1963 to 1965 he was associated with Indian Statistical Insti- 
 tute - Jadavpur University joint computer project. He was responsible 
 for the logic design and hardware implementation of the Arithmetic and 
 Control Units of the Computer ISIJU-1, a variable word-length, micro- 
 programmed solid state digital computer. In 1965, he became a lecturer 
 in the Department of Electronics and Telecommunication Engineering, 
 Jadavpur University, Calcutta, India, and continued to be asociated with 
 the ISIJU-1 project. Since 1967, he has been a Research Assistant in the 
 Department of Computer Science, University of Illinois, Urbana. From 
 1967 to 1971, he was associated with the Image Processing ILLIAC III 
 Computer Project and worked on the design and implementation of the Scan- 
 Display system, Interrupt Unit and Arithmetic Units of ILLIAC III. His 
 research interests include Computer Arithmetic, Computer Architecture, 
 Microprogramming, Digital System Design and Display Processors. He has 
 several publications in these areas. 
 
 Mr. Goyal is a member of the Association for Computing Machinery and 
 the Institute of Electrical and Electronics Engineers. 
 
BIOGRAPHIC DATA 
 ■ET 
 
 1. Report No. 
 
 UIUCDCS-R-76-797 
 
 'itto .ind Subtitle 
 
 A STUDY IN THE DESIGN OF AN ARITHMETIC ELEMENT FOR 
 SERIAL PROCESSING IN A LINEAR ITERATIVE STRUCTURE 
 
 3. Recipient's Accession No 
 
 5- Report Date 
 
 May, 1976 
 
 6. 
 
 uthor(s) 
 
 Lakshmi Narayana Goyal 
 
 8. Performing Organization Kept 
 No. 
 
 ctforming Organization Name and Address 
 
 Department of Computer Science 
 University of Illinois 
 Urbana, Illinois 
 
 10. Project/Task/Work Unit N( 
 
 11. Contract /Grant No. 
 
 NSF DCR 73-07998 
 
 sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, DC 
 
 13. Type of Report & Period 
 Covered 
 
 14. 
 
 Supplementary Notes 
 
 Abstracts rn^^g s ^ U{ ^y j_ s concerned with the design of an Arithmetic Element for Serial 
 >cessing in a Linear Iteratively Structured Arithmetic Unit. The Arithmetic Unit is 
 .e up of identical logic modules called Processing Elements (PEs) such that each modul^ 
 ically communicates with a maximum of three of its neighboring modules for data and 
 .trol information. An arithmetic instruction is executed by a sequence of elementary 
 roinstructions such that each microinstruction is executed by all the modules not in 
 chro-parallelism, but in sequence by each module. The arithmetic processing takes 
 ce serially on a digit -by-digit basis with the most- significant-digit-first (MSDF). 
 The arithmetic and logic design of the Processing Element and the implications of 
 : design choices on the LSI implementation of a PE is described. The MSDF nature of 
 ithmetic execution necessitates the use of the redundant number system for processing. 
 : arithmetic design of the PE is discussed with respect to the number representation, 
 J definition of a normalized number and the algebraic design of the digit algorithms 
 the microinstructions necessary to implement the four basic arithmetic operations. 
 
 tmilas are given for the gate and pin complexities of the various components of the PE 
 sa function of the type of 2-tuple logic vector encoding for a redundant binary digit, 
 ■bit width of the PE and the amount of redundancy in the multiplier/ quotient digit* 
 tis found that a sign-magnitude logic vector encoding and the multiplier/ quotient 
 HORDS : digit's redundancy of 2/3 or less should be employed in the 
 
 design of the Processing Element. 
 Idition 
 
 ithmetic Design Distributed Memory 
 
 Division 
 
 Iterative Structure 
 
 Large Scale Integration 
 
 Multiplication 
 
 uthmetic Element 
 Lgit-by-Digit Algorithms 
 -gital Computer Arithmetic 
 
 7 Identifiers/Open-Ended Terms 
 
 Processing Element 
 Redundant Number System 
 Serial Processing 
 Subtraction 
 
 'JCOSATI Fie Id /Group 
 
 '■vailability St£ 
 
 NTIS-35 ( 10-70) 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 
 Page 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 22. Price 
 
 USCOMM-DC 40329-P71 
 
JU;< 24 1376