LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 
 IlGr 
 
 no. 7G I - ?&3 
 
 Cop. 2 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/techniquesforpar763budn 
 
U 'J 
 
 if 
 
 m.X 
 
 Report No. UIUCDCS-R-75-763 
 
 TECHNIQUES FOR PARALLEL COMPUTER DESIGN 
 
 by 
 
 Paul Peter Budnik, Jr. 
 
 NSF - OCA - DCR73-07980 A02 - 000013 
 
 October 1975 
 
 H0V2i 1975 
 
 ip *m 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
Report No. UIUCDCS-R-75-763 
 
 TECHNIQUES FOR PARALLEL COMPUTER DESIGN* 
 
 by 
 Paul Peter Budnik, Jr, 
 
 October 1975 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 This work was supported in part by the National Science Foundation under 
 Grant No. US NSF DCR73-07980 A02 and was submitted in partial fulfillment 
 of the requirements for the degree of Doctor of Philosophy in Computer 
 Science, October 1975. 
 
TECHNIQUES FOR PARALLEL COMPUTER DESIGN 
 
 Paul Peter Budnik, Jr., Ph.D. 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign, 1975 
 
 In the future major increases in computer performance must result 
 primarily from architectural innovations involving increased parallelism. 
 Large Scale Integrated circuits will provide the basic technology to make 
 this possible. This thesis discusses various techniques for constructing 
 parallel computers. It does so by applying those techniques to the design 
 of a specific machine. These techniques include the determination of a 
 basic set of building block components, the establishment of interface and 
 timing structures, and the development of techniques for pipelining and 
 parallelizing complex control functions. The entire machine is broken up 
 into small functional units which are independently queue driven. 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 The author wishes to thank his advisor Professor David J. Kuck for 
 ideas, criticism, and discussions, and for moral support over an extended 
 period of time. 
 
 Professor Duncan Lawrie also provided helpful comments and 
 suggestions. 
 
 Bernice Shimabukuro proofread this thesis, drafted the figures, and 
 typed it. Her help was invaluable. 
 
IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1 INTRODUCTION 1 
 
 2 OVERALL STRUCTURE 7 
 
 3 BASIC BUILDING BLOCKS AND DESIGN TECHNIQUES 19 
 
 3.1 BUILDING BLOCKS 21 
 
 3.1.1 Motivation 21 
 
 3.1.2 Basic Building Blocks 21 
 
 3.1.2.1 Queues 21 
 
 3.1.2.2 Controls 22 
 
 3.1.2.3 Switches 23 
 
 3.1.2.4 Access Controllers 23 
 
 3.1.2.5 Descriptive Tables 23 
 
 3.1.2.5 Traditional Components .... 23 
 
 3.1.3 Block Interfaces 24 
 
 3.1.3.1 Timing Structure 24 
 
 3.1.3.2 Pipeline or Parallel Units ... 27 
 
 3.1.3.3 Additional Advantages of the 
 Interconnection and 
 
 Timing Structures 27 
 
 3.1.3.3.1 Error Detection 
 
 and Correction ... 29 
 
 3.1.3.3.2 Hardware Performance 
 Monitoring .... 32 
 
 3.2 GENERAL DISCUSSION OF DESIGN TECHNIQUES 33 
 
 3.2.1 Pipeline and Parallel Design Techniques . . 36 
 
 3.2.1.1 IUD Design Analysis ....... 36 
 
 3.2.1.2 Queuing Techniques 38 
 
 3.2.1.3 Resolving Buffer Access Conflicts. 41 
 
 3.2.2 Tables 41 
 
 3.2.3 Deadlock 42 
 
 3.3 PARALLELISM - AN ABSTRACT DISCUSSION. ...... 44 
 
Page 
 
 COMPUTATION UNIT - DETAILED LOGICAL DESIGN 51 
 
 4.1 OVERALL STRUCTURE 52 
 
 4.2 FUNCTIONAL STRUCTURE 53 
 
 4.3 SCALAR PORTION OF COMPUTATION UNIT 54 
 
 4.3.1 Overall Structure of the Scalar 
 
 Portion of Computation Unit" 7 .... 54 
 
 4.3.2 Scalar Execution Units 
 
 56 
 
 4.3.2.1 Scalar Instruction Sequencing . . 58 
 
 4.3.2.2 Scalar Queues 63 
 
 4.3.2.3 SEU Sequence Controller .... 64 
 
 4-3.3 Scalar Execution Unit Buffers . . . . 75 
 
 4.3.4 Scalar Switch 78 
 
 4.4 VECTOR PORTION OF COMPUTATION UNIT 83 
 
 4.4.1 Overall Structure of Vector 
 
 Portion of Computation Unit 83 
 
 4.4.2 Vector Execution Units 84 
 
 4.4.2.1 Standard Arithmetic Units ... 84 
 
 4.4.2.2 Vector Routers 90 
 
 4.4.2.3 Other Vector Units 90 
 
 4.4.2.4 Detailed Internal Structure 
 
 of a VEU 91 
 
 4.4.3 Vector Buffer 105 
 
 4.4.4 Vector Switch HO 
 
 4.5 MAIN MEMORY 114 
 
 4.6 INSTRUCTION UNIT DISPATCHER 123 
 
 4.6.1 Introduction 123 
 
 4.6.2 IUD Functional Structure 123 
 
 4.6.2.1 Data Rate Analysis 124 
 
VI 
 
 Page 
 
 4.6.2.2 Memory Operands and Results. 
 
 4.6.2.3 Scalar Operands and Results. 
 
 4.6.2.4 Vector Operands and Results. 
 
 4.6.2.5 Scalar EL) Assignment . . . 
 
 4.6.2.6 Vector EU Assignment . . . 
 
 4.6.2.7 Generating Vector Switch and 
 Internal Switch Queue Entries 
 
 4.6.2.8 Generating Instructions for 
 the EUs and Memory. 
 
 4.6.3 Logical Structure 
 
 4.6.3.1 IUD Pipeline Structure . . 
 
 4.6.3.1.1 OFFL Instruction 
 Format. 
 
 4.6.3.1.2 OFFL Syntax . . 
 
 4.6.3.1.3 Analysis of Pipeline 
 Requirements. 
 
 4.6.3.1.4 Switching Instruction 
 Components into 
 
 the Pipe . 
 
 4.6.3.1.5 Global Structure 
 of IUD Pipe . . . 
 
 4.6.3.2 Detailed Structure and Gate Counts 
 for Internal IUD Pipe Functions 
 
 4.6.3.2.1 Details of the 
 Parallel Update of th 
 Scalar Table (SPU). 
 
 4.6.3.2.2 Generating Time 
 Indexes .... 
 
 4.6.3.2.3 Parallel Update of 
 Vector Buffer 
 Table (VPU) 
 
 4.6.3.2.4 Searching the 
 Scalar Table. 
 
 4.6.3.2.5 Searching the 
 Vector Table (SVT). 
 
 4.6.3.2.6 Selecting the Scalar 
 Execution Unit (SSE) 
 
 4.6.3.2.7 VEU Queue 
 Selector (SVE) . . 
 
 4.6.3.2.8 Reserve Vector Buffer 
 Storage (RVS) . . 
 
 4.6.3.2.9 Update Vector Tables 
 and Fill in Vector 
 Operands (UV and FVO) 
 
 125 
 125 
 127 
 128 
 128 
 
 134 
 
 135 
 
 135 
 
 135 
 
 136 
 137 
 
 138 
 
 144 
 152 
 
 157 
 
 157 
 161 
 
 176 
 177 
 179 
 188 
 198 
 
 204 
 
VI 1 
 
 Page 
 
 4.6.4 Tail End of Main IUD Pipe 205 
 
 4.6.4.1 Assembling Instructions 
 
 (AV, AS, AM) 205 
 
 4.6.4.2 Initiating Transfer of 
 
 Instructions (IM, ISC, IV) . . . 209 
 
 4.6.4.3 Generating Vector Switch 
 Instructions (GSI) 209 
 
 4 -6.5 Scalar Instruction Dispatcher Subsystem . . 211 
 
 4.6.5.1 SIDS Functional Summary .... 211 
 
 4.6.5.2 Detailed Design of SIDS .... 211 
 
 4.6.6 Vector Instruction Dispatcher Subsystem . . 218 
 
 4.7 GATE COUNT SUMMARY 219 
 
 5 MACRO INSTRUCTION DECODER, I/O CONTROL 
 
 AND EXTERNAL EXPANDABILITY 221 
 
 5.1 MACRO INSTRUCTION DECODER 222 
 
 5.2 PAGING DESCRIPTION 225 
 
 5.3 EXTERNAL EXPANDABILITY 226 
 
 6 CONCLUSION 227 
 
 LIST OF REFERENCES 231 
 
 APPENDIX A DETAILED LOGIC FOR VECTOR EXECUTION 
 
 UNIT SELECTOR 232 
 
 VITA 262 
 
8 
 
 vi n 
 
 LIST OF TABLES 
 
 Page 
 
 1 MAJOR COMPONENT FUNCTIONS n 
 
 2 OFFL PROGRAM 15 
 
 3 INSTRUCTION QUEUE OPERATION 66 
 
 4 GATE COUNT FOR INSTRUCTION QUEUE 68 
 
 5 SCALAR UNIT STATUS TABLES SPECIFICATIONS 74 
 
 6 DESIGN PARAMETERS AND GATE COUNTS FOR SCALAR BUFFERS . . 76 
 DATA AND INSTRUCTION PORTS FOR THE SCALAR SWITCH .... 80 
 GATE COUNT FOR SCALAR SWITCH 82 
 
 9 GATE COUNT FOR SCALAR ASSEMBLING UNIT 89 
 
 10 DATA BUFFER OPERATION 93 
 
 11 VEU GATE COUNT 104 
 
 12 VECTOR SWITCH PORTS HI 
 
 13 VECTOR SWITCH HARDWARE 113 
 
 14 MEMORY LOGIC SUMMARY 122 
 
 15 CLASS PAIRS 132 
 
 16 OPERAND DISTRIBUTION PAIRS 133 
 
 17 OFFL SYNTAX 139 
 
 18 OFFL INSTRUCTION CONSTRAINTS 141 
 
 19 INSTRUCTION STREAM CONSTRAINTS 143 
 
 20 PIPE COMPONENTS 144 
 
 21 LOGIC EQUATIONS AND GATE COUNTS FOR 
 
 ASSIGNING INSTRUCTION NUMBERS 145 
 
 22 LOGIC EQUATIONS FOR THE CONTROL OF 
 
 THE IUD FRONT END SWITCHES 147 
 
 23 GATE COUNT FOR IUD FRONT END 151 
 
IX 
 
 LIST OF TABLES (cont.) 
 
 Page 
 
 24 IUD PIPE FUNCTIONS, TIMINGS, AND DEPENDENCIES 153 
 
 25 IUD PIPE TIMING CHART 155 
 
 26 SCALAR COMPARISON TREE GATE COUNT 160 
 
 27 GATE COUNT FOR TIME INDEX GENERATORS 167 
 
 28 VECTOR BUFFER COMPARISON TREE GATE COUNT 174 
 
 29 VECTOR STATUS TABLE GATE COUNT 178 
 
 30 SEU QUEUE SELECTOR GATE COUNT 184 
 
 31 CONNECTIONS FROM OPERAND PIPES TO VEU QUEUE SELECTOR . . 185 
 
 32 GATE COUNTS FOR INDEXING OPERANDS AMD 
 
 PARTIAL INSTRUCTION DETECTION 187 
 
 33 TIMINGS FOR UPDATING WEIGHT SELECTION REGISTERS 195 
 
 34 TIMING AND GATE COUNT FOR VEU QUEUE SELECTION 196 
 
 35 LOGIC SUMMARY FOR ASSEMBLING INSTRUCTIONS 208 
 
 36 LOGIC FOR INITIATING INSTRUCTION TRANSFERS 210 
 
 37 LOGIC FOR GENERATING VECTOR SWITCH INSTRUCTIONS 210 
 
 38 SIDS FUNCTIONS 212 
 
 39 SPECIFICATIONS AND GATE COUNTS FOR SIDS TABLES 213 
 
 40 VIDS TABLE SPECIFICATIONS 218 
 
 41 COMPUTATION UNIT SUMMARY GATE COUNT 219 
 
LIST OF FIGURES 
 
 Page 
 
 1 COMPUTATION NODE 9 
 
 2 COMPUTATION UNIT 10 
 
 3 PROGRAM TREE 17 
 
 4 ERROR CORRECTION CONNECTIONS 31 
 
 5 SCALAR PORTION OF EXECUTION UNIT 55 
 
 SCALAR EXECUTION UNIT 57 
 
 INSTRUCTION QUEUE 65 
 
 8 ALGORITHMS FOR ACCESSING SCALAR STATUS TABLES 71 
 
 9 
 
 10 
 11 
 12 
 
 SCALAR SWITCH DETAIL 81 
 
 VEU OVERALL STRUCTURE 85 
 
 ASSEMBLING SCALARS INTO A VECTOR 88 
 
 DATA BUFFER 92 
 
 13 8 WAY PRIORITY SELECTOR 98 
 
 14 N0N-P0WER-0F-2 PRIORITY SELECTOR 101 
 
 MEMORY ORGANIZATION 115 
 
 INTERNAL STRUCTURE OF MEMORY PAGE 120 
 
 IUD FRONT END 150 
 
 IUD PIPE OVERALL STRUCTURE 156 
 
 19 COMPARISON TREE 158 
 
 20 TIME INDEX LOGIC 162 
 
 21 VECTOR COMPARISON TREE 168 
 
 22 SEU QUEUE SELECTOR 181 
 
 23 LOGIC TO INDEX OPERANDS AND DETECT A PARTIAL 
 
 INSTRUCTION 186 
 
 15 
 16 
 17 
 18 
 
XI 
 
 LIST OF FIGURES (cont.) 
 
 Page 
 
 24 VEU QUEUE SELECTOR lg4 
 
 25 RESULT PROCESSING PORTION OF UNIT TO 
 
 RESERVE VECTOR STORAGE 202 
 
 26 VECTOR BUFFER STATUS DETAILS 203 
 
 27 TAIL END OF IUD PIPE VECTOR OPERATOR PORTION 206 
 
 28 SIDS FLOWCHARTS 215 
 
1 INTRODUCTION 
 
 The brief history of computer design has been one of constant and rapid 
 change. This change has centered in the technology of circuit design. Logic 
 has grown cheaper, smaller, and faster at dramatic rates. With few excep- 
 tions, the art of computer design has been largely one of doing more and 
 more of the same thing faster and faster. There have, of course, been many 
 design innovations. However, one could undoubtedly explain the operation of 
 any modern computer to Babbage's ghost in a day or two. The nature of the 
 game of computer design is changing radically. 
 
 The cost and size of logic continues to decline at an accelerating rate. 
 The speed of logic is nearing theoretical limits, with the speed of elec- 
 trical signals being a major design consideration in all current high-speed 
 computers. Thus, barring some extremely dramatic revolution in physics, 
 logic is not going to get much faster. On the other hand, if one wants to 
 obtain a perspective on the limits of cost and gate density for logic, one 
 might compute the cost per gate and gate density of a pigeon brain. We have 
 only the crudest ideas on how to effectively utilize the logic technology 
 that currently exists. With each new advance in circuit technology, the 
 depths of this ignorance increases. This thesis is one primitive attempt to 
 begin to plumb these depths. 
 
 The problems associated with effectively utilizing this technology can 
 be divided into two broad categories. The first is to determine what useful 
 structures one can concoct using a very large number of gates. The second 
 is to determine which of these structures can be practically implemented 
 given the constraints of an existing technology and how this may be done. 
 
This thesis concentrates on the first of these problems and treats the 
 second only in the most general sense. We justify this approach because 
 simultaneously solving both problems is extraordinarily complex and time 
 consuming. Determining useful theoretical designs must inevitably be the 
 first step in generating practical designs. Further, many of the practical 
 constraints are in a rapid state of change. The practical constraints that 
 we do take into account are ones that we expect to exist in any technology. 
 These include restrictions on fan in and fan out, and logic levels per clock. 
 In addition we have adopted a general structural approach of attempting to 
 find basic building blocks at a fairly high level of complexity. We discuss 
 this approach and its relevence to IC technology in Chapter 3. 
 
 There are two broad categorizations of approaches to determining how 
 to utilize this technology effectively. One is to consider existing hard- 
 ware and software techniques and see how these may be expanded and generalized. 
 The alternative is to start from scratch. Existing programming hardware and 
 software techniques have evolved from the notion of a simple arithmetic unit, 
 a set of memory reigsters, and a single control unit. This structure is a 
 natural one for humans to understand and use. It is unlikely to be of uni- 
 versal significance for all data processing problems. It is, in fact, our 
 belief that the problem of investigating useful computing structures is co- 
 extensive with the problem of investigating useful mathematical structures. 
 We consider this open-ended approach to the problem to be of deep intellec- 
 tual fascination. In Section 3.3, we discuss some observations about this 
 approach. However, for pragmatic reasons, the bulk of this thesis takes the 
 other approach. We will now define our approach and objectives in more 
 detail . 
 
Our overall goal is to design a good general-purpose, expandable, fast 
 parallel computer. This statement of objective is both yery vague, and in- 
 ternally inconsistent. Parallelism implies a structuring of multiple com- 
 puting elements. Inevitably some algorithms will fit the structure better 
 than others. More parallelism implies less generality. However, we do know 
 that eight arithmetic units can be effectively utilized by most FORTRAN pro- 
 grams [Kuck 5]. Thus, a limited degree of parallelism is not incompatible 
 with a fairly general-purpose computer. The vagueness in our statement of 
 overall objectives is intentional. Good computer design involves many com- 
 plex and ill-defined factors. A precise statement of objectives is impos- 
 sible. We can enumerate the important factors, and we do so by first 
 dividing them into two broad categories of programmability and hardware. 
 
 A machine with good programmability should allow for the easy implemen- 
 tation of a wide range of languages. These should be easily expandable both 
 to facilitate the evolution of special-purpose languages, and to parallel 
 hardware expandability. The languages should be efficient both in terms of 
 the code they produce, and their own execution time. Programs should be 
 easy to debug. The machine design should allow for the easy implementation 
 of a powerful, efficient operating system. Facilities should be provided 
 which ease the burden of managing a hierarchy of memories, and in many in- 
 stances eliminate it entirely. Multiprogramming and multiprocessing features 
 should be provided to facilitate fast turnaround on short jobs and to allow 
 the system to efficiently handle a broad range of problems. 
 
 The hardware should be modular, reliable, expandable, cheap, and easy 
 and inexpensive to maintain. 
 
 These goals are still quite vague, and it is essential that they remain 
 so. Computer design is more art than science, and to pretend otherwise can 
 
lead to disaster. The objectives of computer design cannot be precisely 
 defined without gross oversimplification. 
 
 I should say a few words about the results of this research. It does 
 not consist of any one technique or conclusion. Rather, it consists of 
 showing how a number of techniques may be developed and integrated to pro- 
 duce a good machine. 
 
 Although our overall objectives must remain vague, we can be more spe- 
 cific about the techniques we will employ in meeting these objectives. 
 Hardware designed for specialized purposes is potentially much more effi- 
 cient than that designed for more general purposes. Our goal is to design 
 a general -purpose computer. There are two ways in which specialized hard- 
 ware can be employed in such a machine. First, we can directly implement 
 some operating system and compiler functions. Secondly, we can allow for 
 the inclusion of specialized but unspecified operational units. The great 
 danger in designing specialized hardware is that it becomes extremely effi- 
 cient at doing what nobody cares to have done. There do exist universal 
 compiler and operating system functions that can be completely specified at 
 machine design time, and can thus benefit from specialized hardware. Opera- 
 ting system functions include interrupt processing, multiprogramming alloca- 
 tion of resources, and memory management. Obvious compiler functions for 
 which some existing machines have specialized hardware include array index- 
 ing and subroutine calls. Functions which are candidates for such hardware 
 include mapping the parallel structure of a language onto the parallel 
 structure of a machine, hardware management of loops to minimize execution 
 time non-determinism, and anticipatory I/O scheduling. 
 
There can be no way to anticipate what sort of specialized hardware 
 may be desirable or even necessary for various applications. It is feasible 
 to design a general -purpose computer that will allow for the later inclusion 
 of various specialized pieces of hardware. To allow for this and because 
 it is a good basic technique, our overall design philosophy will involve the 
 construction of various functional units. The interfaces between these units 
 will be as simple and at as high a level of abstraction as seems practical. 
 As an example, we will have units to perform unspecified operations on vec- 
 tors of a fixed size and similar units which operate on scalars. The inter- 
 faces between these units and the rest of the machine will be limited to the 
 operands and results and a minimum of information specifying whether or not 
 the unit is in a position to accept operands and the exact operation to per- 
 form. This should allow for the evolution of specialized hardware. 
 
 As the complexity of any design project increases, it becomes important 
 to break the problem up into a hierarchy of more tractable problems. Further, 
 there are advantages to having this structure reflected in physical units. 
 It would be particularly desirable if the lowest level of structure could be 
 implemented on LSI chips making a theoretical design constraint compatible 
 with practical implementation constraints. One result of this research is 
 the observation that a few functional units are required in many contexts and 
 might be ideal candidates for a basic set of chips. Along with the struc- 
 tured hierarchy of functions, we require simple, well defined interfaces 
 between units. This constraint also serves to keep the design problem 
 tractable and to improve the chances for a smooth LSI adaptation. Another 
 advantage of such a structure is to ease hardware debugging and maintenance. 
 In Section 3.1.3.3 we will discuss how this structure could facilitate the 
 construction of a super reliable computer. A final advantage of this 
 
structure is that any unit that meets the interface specifications can be 
 plugged in at any time after the machine is built. This could allow for 
 some use of new technologies in an existing machine as well as facilitating 
 the addition of specialized hardware mentioned earlier. These ideas are 
 simply good engineering practice and similar to those of structured program- 
 ming. 
 
2 OVERALL STRUCTURE 
 
 In this section we describe the overall structure that evolves from 
 the objectives and techniques we have mentioned. We will first list the 
 basic structural features and the objectives they are intended to meet and 
 then go on to describe these features in more detail. 
 
 We begin by discussing expandability. The basic "computer" we design 
 will be called a Computation Node, and we will refer to external expandabi- 
 lity and internal expandability within a node. External expandability refers 
 to the fact that these nodes will be especially well suited to being hooked 
 together as a network of computers. We will briefly discuss this subject in 
 a later chapter. The bulk of this thesis is concerned with the design of a 
 single computation node. Internal expandability refers to the fact that our 
 modular approach will allow for varying numbers of all the major control 
 memory and computation portions of the machine. This is not fundamentally 
 different from existing computers which allow for varying numbers of CPUs, 
 memory mods, I/O channels, etc. Our approach will allow for a significantly 
 greater flexibility in this area than currently exists. 
 
 To allow for the implementation of compiler and operating system func- 
 tions in hardware, we will employ three levels of machine languages ranging 
 from an APL-like, high-level vector language to a language which is basically 
 a set of queue entries specifying physical machine addresses and types of 
 operations. These various levels of machine language also help in keeping 
 the design modular with well defined interfaces. These languages in conjunc- 
 tion with an overall philosophy of having all processing driven by local 
 queues and control will ease implementation of multiprogramming and multi- 
 processing. 
 
We have already mentioned that we will employ a minimum parallelism 
 of 8. We will extend the potential parallelism without restricting the 
 generality. We do this by extending conventional multiprogramming and 
 multiprocessing to allowing these functions at the queued instruction level, 
 In particular, the largest version of this machine will allow four programs 
 to be running simultaneously, distributing their vector instructions to up 
 to six 8-wide arithmetic units. Of course, additional parallelism could be 
 obtained through the external expandability we have mentioned. 
 
 Before going on to a more detailed description of the machine's struc- 
 ture, some additional comments on our objectives are in order. We are not 
 attempting to provide maximum potential computation power at minimal cost 
 or even maximum usable computation power at minimum cost. Instead we are 
 considering what we believe to be the correct problem, providing the most 
 cost effective overall system. Overall system cost includes both the cost 
 of developing system software and ultimately the cost of doing applications 
 programming. Thus, much of the structure we will discuss is intended to 
 make the machine more useful in this general sense. 
 
 Figure 1 gives the overall structure of the Computation Node, Figure 
 2 gives the structure of the Computation Unit within the Computation Node, 
 and Table 1 lists the units in these figures and briefly describes their 
 functions. In order to provide a general idea of the operation of these 
 units, we will provide a brief example. The example will raise more ques- 
 tions than it answers, but it is only intended to provide an initial impres- 
 sion of the functional structure we have in mind. Later chapters will 
 describe the structure and function of these units in more detail. 
 
UJ 
 Q 
 O 
 
 O 
 O 
 
 
 1— o 
 
 o 3 
 j- s- 
 
 4-> 4-> to 
 C to +J 
 O C UJ 
 
 <_) 1— e Q 
 
 II 
 
 C_J 1— 1 Q 
 
or 
 
 co 
 
 LU 
 CO 
 
 10 
 
 CO 
 
 Z2 
 LU 
 CO 
 
 LU 
 CO 
 
 CO 
 
 o 
 
 o 
 (_> 
 
 >- 
 
 o 
 
 o 
 
 Q 
 
 o 
 
 
 o 
 
 c£ 
 
 c_> 
 
 o 
 
 Q 
 
 —i I— 
 
 CO CO 
 
 UJ 
 
 :r 
 
 O 
 
 I— 
 •a: 
 
 D_ 
 CO 
 
 o 
 i— i 
 
 h- 
 
 cc 
 \- 
 co 
 
 o 
 
 C£ Ol 
 <C LU 
 _J Ll_ 
 «=t Lu 
 <_> Z=> 
 CO CO 
 
 o 
 
 o LU 
 
 c_> u_ 
 
 CO 
 
 cc 3c 
 o o 
 
 LU 3 
 > CO 
 
 
 to 
 Q 
 
 Q >- 
 
 oS 
 I— or 
 
 o_ >- 
 co q: 
 h- s: o 
 o: o s: 
 o o: lu 
 o_ u_ s: 
 
 /TN 
 
 
 o 
 o 
 
 C\J 
 
 U3 
 
 Q 
 
 Q 
 
 Q 
 
 C_) 
 
 _E 
 
11 
 
 TABLE 1 MAJOR COMPONENT FUNCTIONS 
 
 Functional Unit 
 
 Instruction Unit 
 Dispatcher 
 
 Abbreviation 
 
 IUD 
 
 Vector Execution 
 Unit 
 
 VEU 
 
 Scalar Execution 
 Unit 
 
 Vector Buffer 
 Scalar Buffer 
 Vector Switch 
 
 Scalar Switch 
 Computation Unit 
 
 Macro Instruction 
 Decoder 
 
 SEU 
 
 MID 
 
 Functions 
 
 (1) Map logical registers of OFFL 
 onto the various physical regis- 
 ters within the computation unit 
 and by so doing schedule the vari- 
 ous execution units. 
 
 (2) Do conflict resolution between 
 the competing MIDs. 
 
 (1) Perform the actual processing 
 of all vector operations of OFFL. 
 These will include arithmetic, 
 routing, and may include special - 
 purpose vector operations. 
 
 (1) Perform the actual processing 
 of all scalar operations of OFFL. 
 
 (1) Provide temporary storage for 
 vectors. 
 
 (1) Provide storage for 
 scalars. 
 
 (1) Provide paths for routing vec- 
 tors between the various vector 
 execution units, buffers, primary 
 memory, and the MIDs. 
 
 (1) Provide paths for routing 
 scalars between the scalar execu- 
 tion units and the scalar buffer. 
 
 (1) Includes all of the above func- 
 tional units and their intercon- 
 nections and connections to the 
 external world. 
 
 (1) Decomposition of macro instruc- 
 tions into OFFL. 
 
 (2) Program control . 
 
 (3) Initiation of page faults. 
 
 (4) Anticipatory I/O. 
 
12 
 
 TABLE 1 MAJOR COMPONENT FUNCTIONS (cont.) 
 
 Functional Unit 
 Memory Manager 
 
 Abbreviation 
 
 Memory Controller 
 
 Main Memory 
 Backup Storage 
 Computation Node 
 
 Functions 
 
 (1) Initiates request for page 
 swappings. 
 
 (2) Assures that sufficient core 
 is available to maintain a high 
 level of efficiency. 
 
 (1) Maps logical memory addresses 
 into physical addresses. 
 
 (2) Does the actual addressing of 
 memory. 
 
 (1) High-speed, random access 
 storage. 
 
 (1) All other storage devices in 
 the machine not mentioned above. 
 
 (1) All of the above functional 
 units, their interconnections and 
 connections to the external world. 
 
13 
 
 Our example will consist of a brief APL program segment. A, B, and 
 C are vectors of length 24; D, F, and G are scalars. The program consists 
 of the element by element add of A to B and the dot product of the result 
 with C. This result is stored in F and also added to D, and that result 
 stored in G. The program for this is: 
 
 F <- +/CxA+B 
 G «• F+D 
 
 We will refer to the highest level vector language as Universal Assembly 
 Language (UAL). The name is intended to reflect its machine-independent 
 character. We will describe how the above is translated into UAL and how 
 it is processed by our machine. 
 
 UAL basically consists of 3 address instructions. In order to mini- 
 mize the size of these instructions, their operands will refer to a small 
 group of special registers, which will contain descriptor information for 
 the actual operands. Special 2 address instructions areused to relate 
 these registers to program defined variables. The UAL for the above will 
 be as follows: 
 
 Instruction 
 
 Operand 
 
 SETADR 
 
 T f A 
 
 SETADR 
 
 T^B 
 
 ADD 
 
 VY T i 
 
 SETADR 
 
 V c 
 
 MULTIPLY 
 
 W T i 
 
 Comment 
 
 Register T-j now refers to variable A, 
 
14 
 
 Instruction Operand 
 
 VECSUM 
 
 V T 1 
 
 STORE 
 
 V F 
 
 SETADR 
 
 T 2 «-D 
 
 ADD 
 
 W T 1 
 
 STORE 
 
 T 1 ^G 
 
 Comment 
 
 T^ now refers to the sum of the com- 
 ponents of T, before this statement. 
 
 T 1 is not used here because its value 
 is needed later. 
 
 An MID must translate these UAL instructions into a machine-dependent 
 Operand Fixed Format Language (OFFL). This name is intended to refer to the 
 fact that this language explicitly recognizes the vector width of the 
 machine. The IUD must in turn translate OFFL into a sequence of queue 
 entries within the Computation Unit. These queue entries will ultimately 
 result in a logically correct execution of the code. Figure 3 shows the 
 tree for this program. Table 2 shows what the OFFL instructions might look 
 like. In constructing this table and figure, we have chosen temporary regis- 
 ter locations to show how they can be used to control the sequencing of the 
 program. This is the basis of the method by which the hardware permutes the 
 order of execution of instructions in any way that optimizes resource utili- 
 zation without affecting the logical outcome of the code. In particular, 
 the MID has a small number of logical temporary registers available to it. 
 These refer to 8-word wide vectors. It uses these to break up instructions 
 operating on arbitrary sized vectors in UAL into OFFL instructions. As soon 
 as all instructions that use one of these temporaries has been generated, 
 that temporary may be reused. The IUD assigns physical register locations 
 to these logical locations. It does this in a way that allows for the 
 
15 
 
 TABLE 2 OFFL PROGRAM 
 
 Instruction 
 
 Operand 
 
 LOAD 
 
 T H-7 
 
 LOAD 
 
 T H-7 
 
 VECADD 
 
 t TK 
 
 LOAD 
 
 T 2^0-7 
 
 VECMUL 
 
 * T S"! 
 
 LOAD 
 
 3 Vl5 
 
 LOAD 
 
 T H-15 
 
 VECADD 
 
 fW 
 
 LOAD 
 
 T^VlS 
 
 VECMUL 
 
 &M 
 
 VECADD 
 
 T I +T K 
 
 LOAD 
 
 T v «-A 
 
 '5 M 15-23 
 
 LOAD 
 
 T 6" B 15-23 
 
 VECADD 
 
 WW 
 
 LOAD 
 
 '6^15-23 
 
 VECMUL 
 
 W T 5 
 
 VECADD 
 
 W[ 
 
 MOVE 
 
 ^1-7 
 
 Comment 
 
 T refers to vector registers. 
 
 Vector addition. 
 
 There is no reason not to reuse tY and 
 
 T 2 in this and the previous statement. 
 
 Vector multiplication. 
 
 We do not reuse T 1 or T 2 here so we can 
 
 overlap the following sequence of five 
 
 statements with the above. 
 
 This add cannot be overlapped so there is 
 no reason not to reuse T, . 
 
 Move the result vector into 8 separate 
 scalars to complete the summation. 
 
16 
 
 TABLE 2 OFFL PROGRAM (cont.) 
 
 Instruction 
 
 Operand 
 
 SADD 
 
 '0 1 ! 8 
 
 SADD 
 
 T S +T S ->-T S 
 1 2 ' 3 '9 
 
 SADD 
 
 t s +t s vt s 
 '8 '9 '8 
 
 SADD 
 
 1 4 ' 5 ' 10 
 
 SADD 
 
 T 6 +T 7^ T 11 
 
 SADD 
 
 T ?8 T if T ?o 
 
 SADD 
 
 T 8 +T 10" T 
 
 STORE 
 
 %* 
 
 LOAD 
 
 T i* D 
 
 SADD 
 
 T +T l+ T 
 
 STORE 
 
 T g* 
 
 Comment 
 
 Scalar addition. 
 
 At this point all scalar temporaries except 
 Tq are available. 
 
17 
 
 LEVEL 
 
 FIGURE 3 PROGRAM TREE 
 
18 
 
 maximum possible parallelism. In particular, the reuse of a logical 
 temporary in OFFL will always be assigned a different physical location. 
 If this were not the case, the store to this temporary would have to wait 
 until all loads from it requiring its earlier value have completed. Thus, 
 the assignment of temporaries in our example reflects the way in which the 
 IUD might assign physical registers. The MID can be more careless about 
 reassigning logical registers, because the IUD operates as just described. 
 The above only applies to vector instructions. Scalars are handled in a 
 different way but with essentially the same result. There can exist several 
 copies of the same "physical" scalar at the same time. Associative tables 
 and a time indexing scheme keep them straight. Parallelism in Figure 3 
 could be utilized by our machine. 
 
 
19 
 
 3 BASIC BUILDING BLOCKS AND DESIGN TECHNIQUES 
 
 In the process of designing a family of machines to meet the objectives 
 discussed previously, we found ourselves using the same sort of functional 
 units and the same design techniques in many different contexts. Before we 
 describe this detailed design work, we will provide a generalized discussion 
 of the basic building blocks and techniques which we have evolved. We regard 
 this set of very high-level building blocks as significant. The cost of in- 
 tegrated circuits is much more critically dependent on the number of circuits 
 ultimately to be produced than on the number of gates in the circuit. Thus, 
 if one can establish a canonical set of ICs from which a broad class of com- 
 puters can be constructed, one will be able to keep the cost of the computers 
 themselves down. In this thesis we have only undertaken the first step in 
 the complex process that could ultimately lead to the fabrication of such a 
 canonical set of ICs. That step is the recognition of the functional simi- 
 larity of many of the units we will construct. We have made no attempt to 
 provide sets of logical designs that will be of universal validity for the 
 various function types. Such a process would be desirable if one were plan- 
 ning to construct machines of the sort we have designed. Section 3.1 is a 
 discussion of these basic building blocks as well as how they can be combined 
 to form more complex blocks. The resulting structures are in some ways simi- 
 lar to a well designed program made up of small subroutines existing at many 
 levels in an overall hierarchy. 
 
 The techniques of pipelining and parallelism are well known and widely 
 used. They are in a fairly primitive state of development. Using the build- 
 ing blocks just mentioned, we have applied these techniques in a somewhat 
 systematic way in the course of doing detailed design. Section 3.2 is a 
 
20 
 
 description of the various ways in which we use pipelining and parallelism 
 to achieve our objectives. Section 3.3 is a theoretical analysis of paral- 
 lelism. Its main purpose is to provide some perspective on the dimensions 
 of this field and to suggest some unconventional approaches to gaining 
 greater understanding of this subject. 
 
21 
 
 3.1 BUILDING BLOCKS 
 
 In this section we first discuss the motivation for structuring the 
 machine as we have. We then discuss the lowest level of building blocks 
 or functional units from which the entire machine is constructed. Finally, 
 we discuss how more global units are built up from these basic units. 
 
 3.1 .1 Motivation 
 
 The building blocks we have chosen arise from three aspects of our 
 overall approach. These are our attempt to provide local distributed con- 
 trol, hardware implementation of compiler and operating system functions, 
 and parallelism itself. The building blocks these give rise to are: queues, 
 controls, switches, access controllers, and descriptive tables. We will de- 
 fine each of these units and relate them to the three aspects of our approach 
 in the next section. 
 
 3.1.2 Basic Building Blocks 
 
 We now describe the basic building blocks. These differ from the pri- 
 mitives in Bell and Newel l's PMS notation. They are not basic components 
 from which any computer can be constructed. They are fairly complex units. 
 They also differ from the various boxes that one inevitably draws when dis- 
 cussing conventional computers. They are more primitive in the sense that 
 they occur repeatedly in many different contexts in the overall design. 
 
 3.1.2.1 Queues 
 
 Conceptually, a queue is a linearly ordered list of requests for re- 
 sources. Our basic building block queue arises from the need to provide 
 local control. By allowing hardware to determine the sequence of instruc- 
 tion execution, we can allow for more efficient utilization of resources. 
 
22 
 
 To provide for this, our queues attempt to provide first in first out 
 service. They also attempt to keep the unit they drive as active as possi- 
 ble. The algorithm to meet these objectives will be to examine the oldest 
 entry in the queue first and then earlier entries until one is found for 
 which all required resources are available. Such FIRFO queues will be used 
 to drive the vector and scalar execution units and the vector switch. The 
 control of the queue itself, the testing of outside resources, and the ac- 
 tual decisions about what action to take will involve other components. 
 The queue is a memory that allows for the access of entries at the data 
 rates and in the sequence required for implementing the above functions. 
 It also allows new queue entries to be made at the data rate required. 
 
 3.1.2.2 Controls 
 
 A control is a unit capable of sensing various states of its environ- 
 ment and initiating actions based on this information. The most general 
 sort of control would be a full scale Turing Machine with I/O. Our con- 
 trols will not correspond to the standard definition in that memory of 
 previous states will not reside in the control itself, but will always be 
 in queues and status tables which the control can interrogate. Controls 
 in general will need to respond very quickly, often within a few gate de- 
 lays. Thus, controls will consist of some combinatorial logic with 
 sequencing circuits driving the unit. They can range from a very simple 
 to rather complex combinatorial circuit driven by the machine clock. They 
 will be used throughout the machine, driving switches, using queues to 
 sequence various units, determining the operation of the IUD pipe at all 
 stages within it, and in general keeping the machine operating by indivi- 
 dually keeping each of its parts going. 
 
23 
 
 3.1.2.3 Switches 
 
 The process of transferring data and instructions to the units requir- 
 ing them will frequently involve the use of small crossbar switches. In 
 general we will confine the use of the word switches to refer to just such 
 units used for such purposes. We will use the term routers to refer to units 
 which sort data under program control . 
 
 3.1.2.4 Access Controllers 
 
 Many of the resources of this machine can service different controls. 
 Access controllers referee this competition. They may provide some priority 
 scheduling scheme and usually have some memory of the previous allocations 
 of the resource they referee. This memory allows them to ensure that no 
 requesting unit can be totally locked out. 
 
 3.1.2.5 Descriptive Tables 
 
 A descriptive table is simply a memory that contains information about 
 the current state of the machine and about programs that are executing. One 
 traditional use of such tables that will occur frequently in our design will 
 be tables which provide information about the status of registers or buffer 
 memories. The difference between tables and queues is the method of access. 
 Tables may be either associative or addressable memories. They will some- 
 times be accessible by either method. In addition the high data rates 
 required in some instances may necessitate the ability for multiple simul- 
 taneous access. 
 
 3.1.2.6 Traditional Components 
 
 All of the components we have discussed arise from existing concepts. 
 Our definitions have been restricted in ways to suit our purposes. 
 
24 
 
 Two additional components we will use in a more or less completely tradi- 
 tional manner are memories and computation units. Memories in addition to 
 the specialized ones we have discussed will exist in many sizes and types 
 throughout the machine. Memories are unlike descriptive tables in that 
 they contain data or program instructions. They will not be associative 
 memories, but will allow for access in various ways to suit their purpose. 
 
 Computation units will be arithmetic units and in general the hardware 
 that actually does some useful work. They will be both vector and scalar 
 units and will include routers whose purpose is to sort data. We will not 
 propose any new designs for these units. Our only innovation in this area 
 is the provision for plugging such units into an existing machine without 
 significant hardware or software modifications. 
 
 3.1.3 Block Interfaces 
 
 In this section we discuss the timing aspects of block interfaces. The 
 structure of the basic blocks just discussed as well as the structure at more 
 complex organizational levels determine the detailed nature of the interfaces. 
 In general, the more complex the units, the greater the communication delay; 
 this is both necessary and tolerable. We will describe the various timing 
 levels we use and the techniques for facilitating the timing structure. We 
 will describe how this relates to pipelining and parallelism. Finally, we 
 will describe some special advantages of the timing structure. 
 
 3.1.3.1 Timing Structure 
 
 We will employ a minor clock and major clock. The time for the minor 
 clock will be roughly that corresponding to 8 levels of logic. The major 
 clock will be 8 minor clocks in duration. The minor clock will be used 
 
25 
 
 within major structural units; the major clock will be used for transfers 
 between major structural units. The levels of logic for the minor clock 
 evolved naturally out of the detailed design work. The ratio of major to 
 minor clocks is directly related to the vector width of the machine. Most 
 data transfers both within and between units will be pipelined at the rate 
 of one word per minor clock. This two-level clock structure seems a natural 
 way to ameliorate some of the problems associated with very fast clocks. We 
 do not need to specify a physical time for the clock period. This would be 
 a function of the physical size and speed of the logic used. With current 
 technology, we could assume 2 ns gate delays. With 8 levels of logic, we 
 have up to 16 gate delays or roughly a 50 ns clock. This estimate may be a 
 bit optimistic since one would not want to implement this design with the 
 fastest logic possible, \lery fast logic implies high power dissipation and 
 comparatively small gate densitites. Both of these constraints would be ex- 
 tremely troublesome in a machine with yery high gate counts. On the other 
 hand, faster logic with less power consumption does seem to be in the offing. 
 The one constraint on machine speed that is not likely to change is the delay 
 times for signal transmission, i.e., the speed of light. Our two-level clock 
 will be especially useful in accommodating this fact of life. Our basic 
 approach is not geared to developing optimal techniques for a current techno- 
 logy, but rather for developing techniques that will become increasingly 
 attractive in the near future given the direction that technology is moving. 
 There are structural problems inherent in our two clock levels, and we 
 will now describe how we deal with these. We can think of the clocks as 
 being centrally located, synchronized, and broadcasting pulses to all parts 
 of the machine. Major components must be constructed to operate internally 
 at the fast clock rate. The interfaces between these must operate at the 
 
26 
 
 slower clock rate, and special Interfacing logic will be designed so that 
 the major units can internally behave as if there were a single fast clock. 
 We will not specify absolutely what constitutes a major component since this 
 is a technology dependent decision. We will indicate in the process of 
 doing detailed logical design the various levels at which the machine can 
 and cannot be partitioned. 
 
 Major components will be transmitting both data and control information 
 among themselves at the major clock rate. We would like to minimize the 
 width of the paths and thus, if possible, pipeline transmission on them at 
 the minor clock rate. This can be done without losing the advantage of the 
 longer clock rate between major components. The sending unit will transmit 
 its output at the minor clock rate and will also send a copy of its clock 
 pulse in parallel with the information. Provided all parallel paths are the 
 same length, this pulse can be used for reading the transmission line. We 
 can design a single interface that accepts input from such a line and the 
 clock pulse of the receiving unit and that inputs data to the receiving unit 
 according to its clocking. This can be accomplished by a simple circular 
 buffering technique in which registers are written with one timing pulse and 
 are read by the other Dulse. Since both pulses originate from the same mas- 
 ter clock, there will be a constant phase difference between them. 
 
 There is an additional buffering problem associated with this timing 
 structure that our interface will handle. Not all of the units can neces- 
 sarily process an arbitrary input stream at the maximum possible rate. There 
 is usually some internal buffering to minimize the effects of transients, 
 but it is possible for these buffers to become full. Just prior to this 
 occurring, the receiving unit must notify its interface to stop transmitting 
 information. Because of the long delays possible between major units, as 
 
27 
 
 long as two major clocks worth of information may be transmitted before 
 the interface has a chance to notify the sending unit to halt. The inter- 
 face must buffer this amount of information. It would be possible to pro- 
 vide a single design for such units with varying capacities and use them 
 at all long delay interfaces. 
 
 3.1.3.2 Pipeline or Parallel Units 
 
 The timing structure we have just described is well suited to either 
 pipeline or parallel execution units or a combination of both types. We 
 have established that transfers between units will be structured in a pipe- 
 lined manner to minimize interconnections. This structure is perfectly 
 suited to full parallel operation. We need only introduce an 8-word wide 
 buffer to accumulate the pipelined inputs in preparation for their access 
 by a fully parallel execution unit. It is assumed that the execution time 
 of a fully parallel unit would be at least 8 times the minor clock rate. 
 The interconnections are suitable for direct input to a pipeline processor. 
 If there is a set-up time associated with the processor, then changing to a 
 different operation might require some buffering similar to that for a paral- 
 lel unit. On a machine of the nature we are constructing, a pipe with long 
 set-up time for general -purpose arithmetic would be impractical. On the 
 other hand, such a pipe might be desirable for some specialized processing 
 unit designed for a specific function in a specific algorithm. The timing 
 structure provides substantial but not unlimited flexibility in choosing the 
 structure of computational hardware. 
 
 3.1.3.3 Additional Advantages of the Interconnection and Timing Structures 
 The interconnection structure we have described is particularly well 
 
 suited to error correction and detection techniques and to performance 
 
28 
 
 monitoring. We regard both of these functions as being particularly impor- 
 tant, given our overall approach. The larger the gate count of a machine, 
 the lower the MTBF. In addition, the more costly a machine is, the more 
 expensive down-time is both directly in terms of lost computer time and, 
 even worse, the delay in projects dependent on the machine. The most serious 
 obstacle to effectively using ILLIAC IV is a yery small MTBF combined with 
 something like an hour required to isolate a failing PE, replace it, and 
 verify that no new errors have been introduced in the process. Fixing the 
 CU requires several hours to days. To some degree, these problems are un- 
 doubtedly due to the decision to build ILLIAC out of the fastest logic 
 available instead of logic that has been more highly developed and is better 
 understood. Larger scale integration is likely to significantly improve cir- 
 cuit reliability [9J. Nonetheless, in constructing computers with very large 
 gate counts, reliability problems inevitably increase. Providing a design 
 which allows for the easy addition of architectural features that improve 
 reliability is an extremely desirable feature for a paper machine of the sort 
 we are proposing. It allows decisions to be made after hard information on 
 reliability has been obtained. It also allows for the construction of ma- 
 chines with various cost-versus-reliability tradeoffs. Since many of the 
 applications for large computers involve real-time applications, there may 
 exist a need for super-reliable versions of such machines. 
 
 Performance monitoring is important for any complex, expensive system. 
 Complexity inevitably implies that deep analytical understanding becomes ex- 
 tremely difficult and expensive to obtain or simply impossible. Many exist- 
 ing computer architectures have reached the point where such understanding 
 is at least pragmatically impossible to obtain. Our design has evolved from 
 existing concepts but is significantly different from and more complex than 
 
29 
 
 existing systems. Thus we are in a position where analytical understanding 
 is impossible and experience with existing machines is inadequate. Provid- 
 ing detailed critical information on performance in prototype models would 
 be a mandatory requirement for perfecting the design concepts we are de- 
 scribing. Providing such information in existing systems would be extremely 
 important in developing operating systems and compilers. Because of our 
 modular structure, hardware improvements would also be possible at this 
 stage. Finally, such monitoring would provide excellent feedback on what 
 constitutes good programming techniques for this architecture. 
 
 We will now describe how improved reliability and performance monitoring 
 are obtainable from our structure. 
 
 3.1.3.3.1 Error Detection and Correction 
 
 A traditional expensive but simple method of providing error detection 
 or correction is with replicated hardware. Providing duplicates for all 
 components allows detection of errors as discrepancies in the outputs. Error 
 correction is provided by triplicated hardware and majority vote when an er- 
 ror is encountered. In the case of main memory and other back-up memory 
 devices, we would propose that single error correction, double error detec- 
 tion codes be used. This provides protection essentially equivalent to 
 triple redundancy at a modest cost in additional logic. We propose the more 
 costly triple redundancy for the remainder of the logic because of its sim- 
 plicity and suitability for the structure. In particular, the triple redun- 
 dancy can be provided at the major component level. We can enhance the 
 interface units we have described to include the error detection correction 
 function. This could be done without imposing more than a couple additional 
 minor clock delays in actual processing. The delays needed to synchronize 
 
30 
 
 the two or three imput signals may impose some additional delay as a func- 
 tion of the physical layout of the machine. 
 
 One major advantage of this approach would be the ability to do real- 
 time, automated error isolation. Upon detecting an error, the operating 
 system could notify the operator to "Please replace Module X12 in Cabinet 5, 
 Rack 4 with Part 210Z in Storage Cabine 3, Shelf C." In double redundancy, 
 the operating system could, in most instances, lock out the affected unit, 
 restart any affected program, and continue operation with somewhat reduced 
 capabilities. In the case of triple redundancy, no error should be intro- 
 duced in any running program, and no reduction in capacity would occur. 
 Clearly, if the MTBF of the individual components is at a reasonable level, 
 the overall system could approach 100 percent reliability and availability. 
 In addition, maintenance and repair in almost all cases could be done by 
 unskilled personnel. Of course, repair of the individual modules would re- 
 quire different approaches, but this can be done in a leisurely manner if a 
 reasonable inventory of spares is maintained. 
 
 By associating the error correction function with the interface unit, 
 we can essentially eliminate the problem of who referees the referees. Fig- 
 ure 4 illustrates this structure. A, B, C, and D refer to different func- 
 tional units. The subscripts refer to the three copies of each unit. There 
 is an interface for each copy of each unit and each connection to a unit. 
 The interfaces are labeled with the source name followed by the name of the 
 particular copy of the destination unit. For error isolation purposes, all 
 interfaces are to be considered as part of the unit they input to. Thus, if 
 there is an error in AB Q , this will affect the outputs of B Q and will be 
 detected by BD Q , BD, , and BDp. All three will signal the operating system 
 
31 
 
 UNITS 
 
 INTERFACES UNITS 
 
 INTERFACES UNITS 
 
 FIGURE 4 ERROR CORRECTION CONNECTIONS 
 
32 
 
 that there is an error in B Q which includes the interface AB Q . From that 
 point on until it is replaced, the outputs of B Q will be ignored. 
 
 3.1.3.3.2 Hardware Performance Monitoring 
 
 The interfaces also provide an obvious source of information for very 
 detailed performance monitoring. It would be practical to include within 
 each interface a microcomputer to monitor the communication and selectively 
 transmit information to a central performance monitoring system. If facili- 
 ties were provided for altering the programming of these microcomputers, 
 this structure would provide an extremely powerful and flexible system. As 
 we have already mentioned, the somewhat radical and very complex nature of 
 the structure we are proposing makes such a facility extremely desirable if 
 not essential. It is our belief that as computers become more complex, real- 
 time performance monitoring will become an essential element in the feedback 
 loop that should lead to "better" computers. 
 
33 
 
 3.2 GENERAL DISCUSSION OF DESIGN TECHNIQUES 
 
 We will somewhat arbitrarily divide this discussion of techniques into 
 a discussion of pipelining and parallelism analysis, and a discussion of 
 techniques for reading and updating descriptive tables. The former refers 
 to a flow analysis used to determine the degree of parallelism and pipelin- 
 ing required to insure that all components of the machine are able to keep 
 up with each other. It also refers to the queueing techniques used to smooth 
 the flow between units. The processing of descriptive tables refers to the 
 algorithms for maintaining an adequate description of the state of the 
 machine and programs. Our hardware implementation of operating system and 
 compiler functions and our queueing techniques require hardware maintenance 
 of some fairly sophisticated tables. 
 
 Before we describe these techniques in detail, we need to say a few 
 words about our overall approach to pipelining and parallelism. More speci- 
 fically, we will discuss what we consider to be the primary obstacle to 
 effectively utilizing these techniques. This is program nondeterminism. We 
 have kept the parallelism of individual computation units small enough that 
 we know most programs can, in theory, effectively use it. To turn this theo- 
 retical possibility into a practical reality and for other reasons which we 
 have discussed, we will construct an elaborate system for hardware control of 
 the individual execution units. The operation of this analysis and control 
 hardware must be overlapped with actual program execution. This implies a 
 pipeline structure. The more complex the analysis, the longer this pipe must 
 be. We will do a flow analysis in the process of doing detailed design, 
 which should insure that the machine will be operating efficiently as long 
 as instructions keep flowing in at the head of the pipe. Conditional trans- 
 fers can break up this flow and have a devastating effect on overall 
 
34 
 
 efficiency. There are several aspects of our approach that will minimize 
 this problem. The problem cannot be eliminated for all programs. 
 
 In Section 3.3 we discuss parallelism from a completely abstract 
 perspective. In particular, we will discuss the general question of the 
 structure of algorithms and transformations to map them onto various 
 parallel computing structures. For now we simply concede that there are 
 some algorithms poorly suited to the structure we will develop. We believe 
 this group of algorithms is a quite small percentage of all useful algo- 
 rithms. We will now describe how the problems associated with conditional 
 transfers can be overcome for most algorithms. 
 
 We have three complementary approaches to this problem. These are use 
 of an if tree analyzer, compilation and execution time analysis of flow of 
 control, and instruction level multiprogramming. In analyzing FORTRAN pro- 
 grams for parallelism [5], it was determined that there are "bursts" of 
 assignment, go to, and if statements. Special hardware has been designed 
 [3] to process such if nodes in parallel. This results in converting a 
 sequence of nondeterministic nodes to a single nondeterministic node. Such 
 an execution unit can be included in the vector portion of our machine. 
 
 In referring to analysis of flow of control, we have in mind differ- 
 entiating deterministic program loops from true program nondeterminism. 
 The critical parameter is the time between when it is known which alternative 
 of a branch must be taken and the moment when the branch occurs. Counting 
 loops with a limit computed outside of them can be made completely determin- 
 istic. The compiler can recognize this situation, and the MID can be con- 
 structed to use this knowledge. In general the compiler can attempt to move 
 any branch dependent computation as far ahead of the branch as possible. 
 
35 
 
 Back substitution and the introduction of redundant computations could be 
 used in some instances. We will discuss these alternatives in more detail 
 in Chapter 5. 
 
 Instruction level multiprogramming refers to the fact that we have up 
 to four MIDs simultaneously processing different programs or different paral 
 lei paths of the same program. These r 1 1Ds simply load various queues which 
 drive memory and other resources. If one of the MIDs is held up, it simply 
 stops feeding the queues. Data rates are such that the other MIDs can take 
 up the slack. In fact, one MID is capable of fully utilizing the computing 
 resources. Further, no instruction ever gets past the MID unless it can 
 proceed to completion. In particular, all operands must be in memory. 
 Thus, in multiprogramming mode, if the nondeterministic branches are rela- 
 tively sparse, utilization can approach 100 percent. 
 
 There are algorithms with a very high level of program nondeterminism, 
 and they would not be well suited to an architecture of the sort we are 
 proposing. However, we do believe that most programs will be able to run 
 efficiently on our machine. As one measure of the level of nondeterminism 
 in programs, we can examine some of the parameters measures in the analysis 
 of FORTRAN programs [5]. Attempting to speed up a highly nondeterministic 
 program with parallel execution inevitably results in \/ery poor efficiency 
 compared to executing the same program on a serial machine. Yet, it was 
 possible to maintain an efficiency of 0.3 to 0.4 over a broad class of 
 programs while using, in almost all cases, more than 16 parallel units and, 
 in the majority of cases, more than 30. 
 
36 
 
 3.2.1 Pipeline and Parallel Design Techniques 
 
 The problem of designing the IUD was to construct in logic an algo- 
 rithm for carrying on a fairly complex set of scheduling tasks. We will 
 outline here our general approach to the problem. The actual performing 
 of computation is controlled by FIRFO queue driven units which accept queue 
 entries furnished by the IUD. We will describe in general terms the opera- 
 tion of these queue driven units. 
 
 3.2.1.1 IUD Design Analysis 
 
 The steps taken in designing the IUD were as follows: 
 
 1. Compute the instruction emergence rate required to keep the 
 rest of the machine active. 
 
 2. List the functions the IUD was required to perform. Estimate 
 how long each of these would take and list other functions they 
 may be dependent on. (Table 24) 
 
 3. Make an IUD pipeline diagram giving time versus function(s) 
 performed. (Table 25) 
 
 4. Do detailed logical design of each of the functional units. 
 
 If any of the units cannot be designed to meet the estimates of 
 step 2, modify the pipe diagram of 3 appropriately. 
 In performing the internal design of the various units, a similar 
 approach was applied in a less systematic way. For the most part, this pro- 
 cess worked fairly well. Like any moderately complex subroutine in a com- 
 puter program, we are quite certain that any of our individual designs 
 could be improved upon by additional work. The final structure of the IUD 
 pipe did turn out to be significantly different than the initial diagram 
 we constructed. 
 
37 
 
 The IUD processes instructions for all the various computation and 
 memory units. In the process of doing the design, it was noted that the 
 scalar instructions could be processed for the most part independently 
 of the other instructions. There was a definite advantage to doing this 
 processing independently after the instructions emerged from the main IUD 
 pipe. The instructions could be processed at the maximum rate for scalar 
 instructions as opposed to the maximum rate for all types of instructions 
 at a considerable savings in hardware. In Section 4.6 we describe both 
 our original structure and how it was modified in the course of design. 
 
 The one unit in the IUD pipe that did not quite meet our time con- 
 straint of 8 levels of logic was the unit that allocated functionally 
 equivalent VEUs. This unit is designed in Section 4.6.3.2.7 and is probably 
 the kludgiest of any of the units we designed. We choose to discuss that 
 design here, not out of masochism, but because we learned the most in con- 
 structing that unit. 
 
 In the process of designing a portion of tnis unit, we developed a 
 systematic notation for generalizing the techniques used to construct a 
 carry save save adder. Of course the technique is not likely to be appli- 
 cable to all problems of speeding up logical circuits. Further, the sys- 
 tematic portion of the procedure is the notation. The notation must be 
 applied in an intelligent and sometimes imaginative way to provide a high- 
 speed logical design for a specific functional unit. Nonetheless, the 
 notation described in Section 4.6.3.2.7 and used in the appendix does seem 
 likely to be a powerful tool for designing fast and complex functional 
 units. 
 
38 
 
 3.2.1.2 Queuing Techniques 
 
 We have described the basic operation of the FIRFO queues. In this 
 section we will provide a somewhat more detailed analysis of their operation 
 and an analysis of their size. A queued instruction is allowed to proceed 
 when it is the oldest one in the queue and when all required resources are 
 available. Resources refer to the unit the queue drives and the operands 
 for each queued instruction. The unit becomes available at a time deter- 
 mined by the previous instruction. The unit will instruct the queue con- 
 trol of its becoming available in enough time to allow for the queue search 
 and any preliminary set-up steps required. The determination of when 
 operands are available differs with different types of units. We will 
 briefly describe the operation of the vector and scalar execution unit 
 queues and the memory queues. These units will be described in detail in 
 various sections of Chapter 4. 
 
 There will be associated with each Vector Execution Unit physical 
 registers for operands and results. When a vector instruction is processed 
 by the IUD, it will transmit instructions to switch the operands to the VEU 
 assigned to the instruction. It will assign specific physical registers 
 for those operands. The queued instruction within the VEU is ready to exe- 
 cute when those specified registers are loaded. A physical register for 
 the result is also allocated within the VEU. Since this allocation is done 
 by the IUD, no instruction which reaches the VEU queue can be held up for 
 lack of a place to put the result. A range of 8 to 16 seems a reasonable 
 size for this queue. This estimate is based on the fact that twice the 
 number of operand registers as queue entries would be required for a binary 
 unit, and probably no more than 8 queue entries could be checked in one 
 
39 
 
 major clock. The first constraint is important because of the cost of the 
 8-word wide parallel buffer. Because there is likely to be something like 
 a 6 major clock delay between when buffers are reserved by the IUD and when 
 the instruction enters the VEU queue, we would want more than two registers 
 per queue entry for a unit that only processes binary vector instructions. 
 The second constraint is important because once it takes longer to search 
 the entire queue than it does for a vector instruction to execute, it be- 
 comes increasingly likely that in doing a full queue search an earlier 
 instruction that was not ready to execute when it was tested will become 
 ready to execute. Thus, long queue searches can defeat the FIRFO philosophy. 
 
 Scalar Execution Units only have internal buffers for the current and 
 next operands and the current and next result. All scalar results from 
 scalar instructions are assigned a time index. A scalar operand may or may 
 not have a time index. If it has been recently computed, it will. There 
 are many fewer time indexes than physical scalar buffer locations, and the 
 time indexes are constantly being recycled. Thus, a scalar operand may 
 refer to a physical location whose time index has been reused and is no 
 longer associated with it. The mechanism for assigning these indexes is 
 described in detail in Section 4.3. Any scalar operand without a time index 
 is available. A scalar operand with a time index may be available in either 
 of two places. For each of these buffers there is a set of bits, one for 
 each time index that indicates if the corresponding operand is in the 
 respective buffer. The first of these is simply the main scalar buffer. 
 Two queued scalar instructions may produce results to the same physical 
 scalar buffer location. If the logically later of these is ready to pro- 
 ceed before the earlier, it will store its result in a special result buffer. 
 
40 
 
 The time indexes will assure that the correct value is ultimately stored in 
 the scalar buffer and that intermediate instructions access the correct 
 values. Because their operand buffers are not separate from each other, 
 it makes sense to have a single Scalar Execution Unit queue drive all 
 equivalent SEUs. This fact combined with our earlier observation about 
 queue size versus queue search time means that we would probably want 
 larger scalar queues than vector queues. A size of 16 would probably be 
 reasonable. 
 
 Each memory page of 8 x 1 K words will have its own queue. In the 
 case of all instructions, the queue control must insure that all indexes 
 and modes are available, i.e., have been transmitted to appropriate buf- 
 fers within the page. Further, it must insure that the instruction can 
 proceed without producing a logical error. Various schemes could be used 
 to determine this. The simplest would require that all instructions pro- 
 ceed in exactly the sequence they entered the queue. One could allow non- 
 indexed instructions to be executed out of sequence if their addresses 
 insured that no conflicts would result. In the most general case, one 
 could do arithmetic on all available and relevant indexes and modes to see 
 if any instruction could proceed. As soon as any instruction with an un- 
 available index is encountered, the queue search must stop. A queue size 
 of 8 - 16 would probably be reasonable for a memory page. Experimentation 
 might reveal that queue sizes smaller than the ones we have suggested for 
 all units might be practical. 
 
 In all the above cases involving local buffers, the various units must 
 notify the IUD as buffer locations become available. Thre is an additional 
 problem associated with the vector result buffers. Values from these 
 
41 
 
 buffers may be accessed as operands for other vector instructions. Thus, 
 these locations may be used until the corresponding logical location is 
 reused in the OFFL instruction stream. We must, however, insure there is 
 space in these buffers for new instructions. Thus, the local control must 
 initiate a transfer of some of these operands to the main Vector Buffer 
 when it becomes too full. 
 
 3.2.1.3 Resolving Buffer Access Conflicts 
 
 Our local control and queue driven structure can often result in buf- 
 fer access conflicts. Two methods for handling this are to allow multiple 
 simultaneous accesses to the smae memory and to provide hardware for con- 
 flict resolution. The first method is employed in some of the IUD tables 
 because of the necessity for very high access rates. This involves pro- 
 viding multiple addressing logic and a larger fanout from each bit of 
 storage. It makes the memory considerably mroe expensive and is thus only 
 used when required by the data rates. For the other more common case, we 
 have developed a very simple, fast and cheap circuit for conflict resolu- 
 tion. It is described in Section 4.4.2.4. 
 
 3.2.2 Tables 
 
 In this section we provide a general description of the hardware 
 maintained tables and the algorithms for updating and accessing them. 
 Vector tables are provided to map physical buffer addresses to logical 
 buffer addresses and to maintain use counts for active buffer locations. 
 A use count is the number of accesses for a particular register that have 
 been processed by the IUD but have not yet occurred. Its use count must be 
 zero before a physical buffer location can be reused. The original assign- 
 ment of physical to logical address is made by the IUD when a store to a 
 
42 
 
 logical address occurs. This assignment is made from a known list of free 
 physical registers. A table whose addresses correspond to logical addresses 
 contains the physical address for each assigned logical address. This table 
 is used to determine the physical location of an operand. Every access to 
 a location in this table by the IUD results in an increase in the use count 
 also stored in that location. Periodically information is obtained from 
 the various execution units giving a list of physical addresses which have 
 been accessed. This list is used to access the table in an associative 
 fashion and decrement the use counts. When a use count goes to zero after 
 the corresponding logical address has been reused, then the physical ad- 
 dress can be reused. Both conditions are necessary, the former to insure 
 that there is no access to the register that has yet to be made and the 
 latter to insure that an instruction not yet processed by the IUD will 
 require the value. 
 
 A similar structure is provided to keep track of scalars. In parti- 
 cular, use counts must be maintained to insure that no physical scalar 
 buffer address is overwritten while there is a queued instruction requir- 
 ing that value. 
 
 3.2.3 Deadlock 
 
 In designing a machine with this structure, one must be certain that 
 no deadlocks can occur. By consistently following two basic design con- 
 straints, we have assured this. First, no instruction gets past the IUD 
 unless all resources required for its execution are immediately available. 
 In particular, instructions which would cause a memory page fault do not 
 get past the MID. All instructions which require the allocation of tempo- 
 rary registers have that allocation made within the IUD from a set of known 
 
43 
 
 registers that are physically available. The second constraint is that 
 whenever a required resource is not available to the IUD, it ceases pro- 
 cessing all instructions until the resource becomes available. For example, 
 if an instruction requires space in a queue that is full, later instruc- 
 tions which may not directly use that queue will also be held up. Thus, 
 no instruction will enter and possibly block a queue because it is depend- 
 ent on the results of an instruction that is not yet in a queue. In con- 
 junction with the first constraint, this assures that once an instruction 
 enters a queue, all its operands will eventually become available and it can 
 proceed. Thus, certain badly balanced instruction sequences could degrade 
 the performance of this structure, but no instruction sequence could com- 
 pletely block it. 
 
44 
 
 3.3 PARALLELISM - AN ABSTRACT DISCUSSION 
 
 This section is a general philosophical discussion on the nature and 
 scope of parallel computing structures and is not immediately related to 
 the remainder of this thesis. We will suggest a possible basis for relating 
 the problem of understanding computing structures to the general problem 
 of understanding mathematical structures. We will not be presenting estab- 
 lished results, but rather pointing out similarities and suggesting possible 
 approaches. 
 
 It is a great luxury in conventional computer architecture that all 
 words of main memory are equally accessible by the processing portion of 
 the machine. Parallelism replaces this "amorphous" topology of data inter- 
 action with a specific structure. In a totally abstract sense the problem 
 of parallel computing design is that of determining classes of data inter- 
 action topologies that correspond to significant real problems and that can 
 be mapped in an efficient way to a single computer topology. Mathematics 
 is the study of arbitrary abstract structures. Some of these are obviously 
 and directly related to problems of computer architecture. We will describe 
 some of these direct relationships for a s/ery substantial portion of all 
 mathematics and suggest possible approaches to investigate computer archi- 
 tecture utilizing this body of mathematics. We will then briefly explain 
 why we believe the study of structures relevant to computation includes 
 all of mathematics. Finally, we will discuss some of the implications of 
 this point of view for a theory of mathematical truth. 
 
 Two fundamentally different measures of the strength of a mathematical 
 system are provability and definability. The former refers to what questions 
 can be decided by the system and the latter refers to what questions can be 
 
45 
 
 stated within the system. As we have suggested elsewhere [2], one can 
 directly relate mathematics through the hyperarithmetical sets to compu- 
 tation related structures. We can begin with some language adequate to 
 describe all finite state machines. We godel number all statements within 
 this language. We have a separate godel numbering for all Turing machines 
 with blank input tapes. We code the output of these so that they either 
 represent the godel number of another Turing machine or the godel number 
 of a statement in our language describing finite state machines. We now 
 assign truth values to each of the Turing machines as follows: the truth 
 value of a Turing machine is true if it has an unbounded number of outputs 
 and the truth value for each member in some unbounded subset of these is 
 true; the truth value for any output corresponding to a finite state machine 
 statement is true if the statement is true. This structure is completely 
 adequate to define all hyperarithmetical statements. This encompasses most 
 mathematical questions and includes a broad area that Intuitionist mathe- 
 maticians consider to be meaningless. 
 
 Central to this level of mathematical definability are the two related 
 concepts of constructive ordinals in mathematics and non-deterministic 
 Turing machines in computer science. The proof that eyery constructive 
 ordinal has a recursive notation defined in a particularly technical way [7] 
 can be interpreted as demonstrating that there is a non-deterministic Turing 
 machine that recursively describes completely the structure of any con- 
 structive ordinal. Mathematical questions about hyperarithmetical sets 
 are those which result from "iterating" up to some constructive ordinal the 
 question is there an infinite subset of all true statements in a recursively 
 enumerable collection of statements about finite state machines. 
 
46 
 
 Constructive or recursive ordinals can also be used as a measure of the 
 power of a mathematical system in terms of provability. Loosely speaking, 
 the larger the recursive ordinal that can be proven to be a recursive 
 ordinal in a system, the more powerful it is in terms of provability. There 
 are many mathematical languages rich enough to define all recursive ordinals 
 but no mathematical theory rich enough to prove that some definition in the 
 language does define a recursive ordinal for each recursive ordinal. 
 
 The concept of recursive ordinal can be thought of as a sort of measure 
 or classification of level of complexity for an initial segment of mathema- 
 tical structures. We suggest that this classification of structures might 
 be a good starting point in a search for classifying various topologies of 
 data interaction. As an example, the initial recursive ordinals correspond 
 in a fairly direct way to the elementary mathematical operations of addition, 
 multiplication, and exponentiation. These each have different and increas- 
 ing complex topologies of bit interactions. Different techniques of logi- 
 cal design are required in providing time versus gate count tradeoffs in 
 implementing them. The concept of recursive ordinals provides a detailed 
 and direct method of extending this hierarchy to more complex structures. 
 
 Further, it is my belief that the concept of recursive ordinals is 
 directly connected to the compuer science concept of iteration. This rela- 
 tionship tends to be obscured by a modern set theory treatment of ordinals. 
 Modern set theory originated from an attempt to avoid the paradoxes dis- 
 covered by Bertrand Russell in earlier versions of set theory. It seems to 
 do so in an extremely elegant and powerful way. However, returning to the 
 intuition that led Ressell to discover the paradoxes and the resulting less 
 elegant and less powerful theory of types that he proposed as a solution will 
 
47 
 
 shed considerable light on the relationship between set theory and a 
 computer related theory of iteration. The paradoxes arose from sets with 
 self referencing definitions constructed in such a manner that if some 
 element was a member of the set then one could show it was not a member of 
 the set. Russell's solution was to provide a sort of index associated 
 with all statements used in defining sets. This index provided a limit 
 over the type of set used in the definition. The set being defined would 
 have a higher index or type. Ordinal numbers including the recursive 
 ordinals implicitly form a similar indexing scheme for set theory. We can 
 consider the problem of iteration as that of applying various algorithms to 
 each other. The problem of possible contradictions is replaced by the 
 problem of whether the resulting algorithm computes a value or simply loops 
 forever. It is possible to consider iterations on a hierarchy of functions. 
 For example, we can start with algorithms which compute integers from inte- 
 gers. We can then consider functions which, given a function of this first 
 type and an integer, computes an integer. Given any type, we can consider 
 a function of all lower types. Given an effective procedure for listing an 
 infinite number of types, we can consider a function of a Turing Machine 
 which enumerates an infinite sequence of such types. Using such types, 
 we can construct more powerful techniques of iteration. We can also con- 
 struct larger recursive ordinals. Finally, the topology of the interaction 
 of the original operands becomes more complex and more general as we go 
 to higher types. 
 
 We are not suggesting that any of these approaches are uniquely correct, 
 but rather pointing out similarities and suggesting that each field and 
 each approach may benefit from insights of the others. We would now like 
 
48 
 
 to outline why we believe reasoning about physically implementable processes 
 is relevant to the outer reaches of mathematical research. We consider 
 both the problems of definability and provability. Problems associated 
 with the set of all real numbers provide the first obstacle to providing 
 constructive interpretations for all of mathematics. Cantor's proof that 
 there cannot exist a one-to-one map from the integers to the reals makes 
 it impossible to provide any constructive method of naming all the reals. 
 Cantor did not prove that there were more reals than integers since the 
 existential status of reals is in question. A possible interpretation of 
 reals is that they represent properties of Turing Machines. One can con- 
 sider that the "meaningful properties" of Turing machines that one could 
 invent might be limitless. By meaningful property we mean a property that 
 is either true or false for any given Turing Machine. Thus, each such pro- 
 perty under a particular godel numbering of Turing Machines defines a real. 
 We can reflect the open-ended nature of the situation by employing a lan- 
 guage for describing properties in which an infinite sequence of words are 
 always left undefined. This seems to me to be a particularly desirable 
 approach since it more closely reflects the reality of the situation. We 
 know from the Lowenheim Skolem theorem that any mathematical theory with 
 recursively enumerable axioms has a countable model. This approach by it- 
 self would not be adequate to construct a constructive version of set theory. 
 However, examining the actual combinatorial power of the axioms of set 
 theory and seeing if similar constructive interpretations are possible 
 seems to me to be likely to be successful. 
 
 We now consider the problems associated with providing constructive 
 interpretations for set theory in the domain of provability. In doing so, 
 we will confront what is probably the major philosophical problem with the 
 
49 
 
 approach we are suggesting. Mathematics is the one area of human endeavor 
 that is generally considered to have a claim to absolute truths. Godel's 
 Incompleteness theorem showed that there exist fundamental problems with 
 allowing mathematics to grow and at the same time retain the property of 
 possessing absolute truths. One school of mathematics has jettisoned all 
 but totally constructive proofs as a means of insuring the absoluteness of 
 mathematical truth. The Intuitionists do not even accept the statement that 
 any Turing Machine must either halt or continue indefinitely. On the op- 
 posite end of the spectrum we have what might be considered the mystical 
 school of mathematics. This is the belief that intuition about infinite 
 sets allows mathematicians to transcend the limits of Godel's Incomplete- 
 ness theorem when dealing with constructive processes. As far as I am 
 aware, no one has seriously considered the possibility that mathematics 
 should give up its claim to absolute truth outside of a narrow domain and 
 become a speculative and experimental science. Our suggestion for handling 
 the concept of real numbers is made in this spirit. 
 
 Godel's Incompleteness theorem established that no mathematical theory 
 in which a Universal Turing Machine was imbedable and in which the halt- 
 ing problem could be defined could decide within itself if it were consist- 
 ent. This established severe limits for any formal mathematical system 
 with respect to its power of provability. For any such "true" system one 
 can adjoin the statement that the system is consistent and obtain a more 
 powerful system. In fact, one can regard the high power that set theory 
 has in a provability sense as deriving from the powerful methods available 
 within it for taking a powerful kernal system and iterating the statement 
 that the system is consistent. This is accomplished via the strong axioms 
 of infinity that allow one to construct models of increasingly more powerful 
 
50 
 
 subsystems. If one can construct a model for a system, one has a proof 
 that it is consistent. An alternative approach would be to directly study 
 and attempt to enhance the combinatorial power of this"iterative" process. 
 But to attack the problem from that direction would require giving up the 
 notion that the results were absolute truth. 
 
 This non-absolutist approach to mathematical truth has a philosophi- 
 cal appeal. Perhaps the severest problem associated with the accomplish- 
 ments of Western mathematics, science and technology is recognizing the 
 limits of these endeavors. It is fitting that the queen of the sciences 
 be the first to establish precise limits for its power and scope. It is 
 essential that we know what we do not know, otherwise we know nothing. 
 That is why mathematics is so concerned with avoiding contradiction. 
 
51 
 
 4 COMPUTATION UNIT - DETAILED LOGICAL DESIGN 
 
 In this section we provide a detailed logical design for the computa- 
 tion unit of Figure 2. We will briefly describe its overall physical 
 structure. We will then describe its overall functional structure. We 
 will then proceed to a detailed functional and logical design of sufficient 
 detail to provide realistic gate counts. In various subsections we will 
 provide tables giving approximate gate counts for individual units. In a 
 concluding section we will provide a summary gate count for the entire 
 computation unit. In this section we will group buffers by their access 
 times and compute their gate counts separately. This scheme is intended 
 to give a very rough notion of the logical complexity and cost of this 
 design. 
 
52 
 
 4.1 OVERALL STRUCTURE 
 
 Figure 2 can be partitioned into four major units. We will refer to 
 these as the scalar portion, the vector portion, memory, and the Instruction 
 Unit Dispatcher. The scalar and vector portions are symmetric in the sense 
 that they both consist of up to six execution units, a buffer, and a switch. 
 The execution units are the portions of the machine that do all actual com- 
 putation. The switches operate under hardware control and are responsible 
 for transferring data between buffers, memory, and execution units. The 
 vector buffer is a more or less conventional high-speed buffer for the main 
 vector memory. There is also vector buffer space within the VEUs. The 
 Scalar Buffer is the primary memory for scalars. It can be loaded via the 
 Vector Switch from main memory for initialization. There are additional 
 buffers associated with the scalar portion. They exist to enhance the 
 throughput of the scalar portion and will be described in detail. The In- 
 struction Unit Dispatcher is the most complex and unconventional or the 
 major units. It has responsibility for mapping OFFL instructions into queue 
 entries which drive the other units. 
 
53 
 
 4.2 FUNCTIONAL STRUCTURE 
 
 The functional structure can be thought of as a generalization of the 
 algorithms used to sequence the arithmetic on the IBM 360/91 [8]. 
 All of the resources of the machine are queue driven. The queues are not 
 strictly first in first out, but rather first in which is able to begin 
 using the resource, first out. We will refer to these as FIRFO, i.e., 
 first in and ready, first out. An instruction is ready when its operands 
 become available. What constitutes an available operand will vary with dif- 
 ferent types of functional units. This structure allows the sequence of 
 instructions to be permuted in any way which enhances resource utilization 
 without altering the logical structure of the original program. It is the 
 responsibility of the IUD to insure the logical integrity of the original 
 program. Most of the complexity of the IUD is a result of this function. 
 
54 
 
 4.3 SCALAR PORTION OF COMPUTATION UNIT 
 
 The scalar portion of the computation unit allows us to perform opera- 
 tions on scalars without tying up the vector execution units. In addition 
 it contains a high-speed memory with sufficient space for the scalars in 
 almost any program. The units that actually perform scalar operations are 
 constructed in a modular fashion to allow for the construction and use of 
 specialized hardware at any time during the operational life of the machine. 
 
 4.3.1 Overall Structure of the Scalar Portion of Computation Unit 
 
 Figure 5 shows the structure of the scalar portion of the execution 
 unit. We will briefly describe the functions of each of the units in the 
 figure and the nature of their interconnections. The Scalar Execution 
 Units contain the queues control and logic to sequence and perform the 
 scalar operations. These units receive instructions from the SIDS through 
 the instruction switch. The execution units make use of the tables in the 
 Scalar Buffer Status unit to determine admissible instruction sequencing. 
 The execution units also provide information for updating these status 
 tables as instructions are executed. Every major clock the Scalar Buffer 
 Status unit and the SIDS exchange information to update their respective 
 status tables. The functional structure of the SIDS relevant to sequencing 
 instructions will be described in Section 4.3.2.1. Detailed design of the 
 entire SIDS is in Section 4.6.5. The Result Buffer is used to buffer re- 
 sults that would otherwise overwrite operands needed by instructions waiting 
 to execute. Its contents may be accessed as operands and will eventually be 
 transferred to the Scalar Buffer through the Scalar Switch. The Vector- 
 Scalar Buffer is used for transferring scalars between the VEUs and the SEUs. 
 It is addressable as if it were an extension of the Scalar Buffer, but it 
 
55 
 
 RESULT 
 BUFFER 
 
 FROM VECTOR SWITCH 
 
 VECTOR- 
 SCALAR 
 BUFFER 
 
 SCALAR 
 BUFFER 
 
 SCALAR 
 SWITCH 
 
 SCALAR 
 BUFFER 
 CONTROL 
 
 SCALAR 
 SWITCH 
 CONTROL 
 
 TO AND FROM 
 VECTOR SWITCH 
 
 TO AND 
 
 FROM 
 
 VEUs 
 
 SCALAR 
 BUFFER 
 STATUS 
 
 _^ TO AND FROM 
 VECTOR 
 EXECUTION 
 UNITS 
 
 SCALAR 
 EXECUTION 
 UNITS (UP 
 TO 6) 
 
 INSTRUCTION 
 SWITCH 
 
 FROM SIDS 
 
 (IUD SUBSYSTEM) 
 
 TO AND FROM 
 
 SIDS 
 
 ClUD SUBSYSTEM) 
 
 FIGURE 5 SCALAR PORTION OF EXECUTION UNIT 
 
56 
 
 has a special status table in the Scalar Buffer Status unit that must be 
 updated with information from both the SEUs and VEUs. The Scalar Switch 
 does all operand transfers between the various SEUs and buffers. A special 
 status table is maintained for operands about to be transferred, and they 
 can be accessed by other SEUs without going through the Scalar Buffer. The 
 Scalar Switch Control does what its name implies. The Scalar Buffer is the 
 primary memory for scalars. It can be accessed by the Scalar Switch and 
 can accept data in blocks equal to the standard vector width from the vector 
 switch. Reverse transfers from the scalar buffer to the Vector Switch are 
 not allowed. The primary function of the Scalar Buffer Control is to referee 
 between the Scalar Switch and Vector Switch in their competition for the 
 Scalar Buffer. We will provide details of the function and structure of 
 these units. We begin by discussing the Scalar Execution Units. 
 
 4.3.2 Scalar Execution Units 
 
 Figure 6 shows the overall structure of the scalar execution unit. The 
 Sequencer provides overall control of the unit. It reads instructions in the 
 Queue, checks the status of the operands in the Scalar Buffer Status tables, 
 and on this basis, determines the sequencing of the queued instructions. 
 When an instruction is to be executed, the scalar switch must be provided 
 with requests to access the operands from the appropriate buffers. The ope- 
 rand and results are provided with buffers to allow a continuous flow of 
 operands. A special switch is provided to allow results to be used as ope- 
 rands without going through the scalar switch. The computation hardware 
 contains the logic to perform the actual scalar operations. Working registers 
 are included in the figure to emphasize the buffering function of the other 
 registers. If an interrupt condition occurs, the MID will be notified and 
 processing will continue. 
 
57 
 
 z 
 
 I 
 
 1 
 
 
 z 
 o 
 
 o 
 
 
 
 I— I- 1 
 
 •— t 
 
 
 
 Q- 1— 
 
 1— 
 
 SB 
 
 
 
 
 c_> 
 
 o 
 
 
 
 ttMQ 
 
 ZD 
 
 t— 1 
 
 
 
 cc c£ •— • 
 
 (V 
 
 CJ 
 
 
 
 
 
 UIOS 
 
 \- 
 
 
 
 1— CO 
 
 CO 
 
 Z3 
 
 
 
 
 Z UJ o 
 
 z: 
 
 ct: uj 
 
 
 
 
 l-t O |— 
 
 I— r 3Z 
 
 1— =) 
 
 
 
 
 
 <_> 
 
 CO UJ 
 
 
 
 z 
 
 
 211— 
 
 z r> 
 
 
 
 o 
 
 
 O •— i 
 
 ►-" cr 
 
 
 
 l-H 
 
 
 CC ZS. 
 
 
 
 
 1— LU 
 
 <C cc 
 
 
 u_ co 
 
 
 
 
 
 
 
 1 
 
 
 1— <C 
 
 =3 3 
 
 
 
 1 
 
 
 
 Q- Q 
 
 
 
 
 
 2: q; 
 
 
 
 
 
 
 o <c 
 
 o re 
 
 
 
 ct: 
 
 
 
 
 
 
 LU 
 
 c_> 
 
 
 
 
 
 
 
 
 
 
 
 
 z 
 
 
 
 i 
 
 
 
 
 UJ 
 
 
 
 
 
 
 ^D 
 
 
 
 
 
 
 
 
 o- 
 
 UJ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 CO 
 
 ! 
 
 
 1 
 
 
 1 I 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 cc 
 
 UJ 
 
 
 a: 
 
 UJ 
 
 
 CC 
 
 LU 
 
 
 1 
 
 
 1— 
 
 
 h- 
 
 
 1— 
 
 cc 
 
 
 CO 
 
 
 co 
 
 
 CO 
 
 LU 
 
 
 l^-l 
 
 
 i— 1 
 
 
 1 — 1 
 
 2TU- 
 
 DC 
 
 CD 
 
 
 CT5 
 
 
 CD 
 
 O U_ 
 
 O 
 
 UJ 
 
 
 UJ 
 
 
 LU 
 
 CCZD 
 
 sri— 
 
 cc 
 
 
 cc 
 
 
 CC 
 
 U_ CO 
 
 O t-* 
 
 
 
 
 
 
 
 q; s 
 
 as 
 
 
 o 
 
 
 o 
 
 a a: (s 
 
 > U_ CO 
 
 z 
 
 
 z 
 
 
 z: 
 
 z<z 
 
 » —i 
 
 1— I 
 
 
 i— t 
 
 
 i— i 
 
 <-ih 
 
 • DOiO 
 
 ^ 
 
 
 ^ 
 
 
 ^: 
 
 <:< 
 
 : z= <c q; 
 
 or 
 
 
 o: 
 
 
 CC 
 
 O CJ> h- 
 
 • <C _l I— 
 
 o 
 
 
 o 
 
 
 O 
 
 |— CO C 
 
 1 <c -z. 
 o o o 
 
 3 
 
 
 3 
 
 
 3 
 
 
 l— co o 
 
 
 . 
 
 
 i 
 
 
 
 
 1 
 
 
 
 or 
 
 
 cc 
 
 
 
 
 
 UJ 
 
 
 UJ 
 
 
 C£ 
 
 
 
 Li- 
 
 
 u_ 
 
 
 UJ 
 
 
 
 Li. 
 
 
 ti- 
 
 
 U_ 
 
 
 
 ZD 
 
 
 ro 
 
 
 Li- 
 
 
 
 CO 
 
 
 CD 
 
 
 za 
 
 CQ 
 
 
 
 Q 
 
 
 Q 
 
 
 
 
 
 Z 
 
 
 Z 
 
 
 1— 
 
 
 
 < 
 
 re 
 
 «=c 
 
 
 _l 
 
 
 
 or 
 
 C_> 
 
 or 
 
 
 13 
 
 
 
 UJ 
 
 I— 
 
 UJ 
 
 
 CO 
 
 
 
 Q_ 
 
 M 
 
 D_ 
 
 
 LU 
 
 
 
 o 
 
 3 
 CO 
 
 o 
 
 
 CC 
 
 
 
 
 1 
 
 o: 
 
 I 
 
 i 
 
 
 
 
 
 
 
 
 s: —i 
 
 
 
 
 
 
 L 
 
 — o <c — 1 
 
 
 
 
 
 
 
 CC <_> 
 
 
 
 
 
 
 U_ CO 
 
 
 
 
 
 
 
 
 l' 
 
 
 
 
 
 UJ 
 
 
 
 
 
 
 
 i in 
 
 
 
 
 
 
 O O 
 
 
 
 
 
 >- l— 
 
 
 
 
 
 CO •—< 
 
 
 
 
 
 LU IS 
 
 
 
 
 
 
 
 
 
 
 cc co 
 
 
 
 
 
 o 
 
 UJ 
 X 
 
 cc 
 
 <c 
 
 < 
 c_> 
 
 CO 
 
 CO 
 UJ 
 
 cc 
 
 ZZ> 
 CO 
 
 cc 
 
 <C 
 _l 
 <c rn 
 
 CO C_) 
 CO I— 
 I— « 
 O 3 
 I— CO 
 
58 
 
 We will discuss the algorithm for sequencing instructions. Then we 
 will provide detailed design for the Instruction Queue and for the Sequencer. 
 
 4.3.2.1 Scalar Instruction Sequencing 
 
 Scalar operands and results are not uniquely identified by a physical 
 memory address. The queues allow considerable flexibility in the sequence in 
 which instructions are actually executed. One price of this flexibility is 
 the necessity of including special hardware to insure that: 
 
 1. No scalar memory location is overwritten when its contents are 
 still needed for some queued instruction. 
 
 2. No operand is fetched before the instruction that computes that 
 operand has completed. 
 
 To allow for this, we will think of scalar addresses as also including a 
 time index to uniquely identify logical values. In addition, whenever an 
 instruction enters the scalar queues, a use count for all operands will be 
 provided. No store to scalar instruction will be allowed to execute if the 
 physical address of its result has a non-zero use count for some time index 
 earlier than the time index of the instruction in question. 
 
 We wish to minimize the complexity and cost of the logic that does this 
 bookkeeping. As discussed in Section 3.2.1.2, there will be up to 
 six scalar queues, each with a capacity of 8 to 16. In addition, it is 
 reasonable to assume that instructions will be executed in a reasonably uni- 
 form manner. Whenever this is not the case, the unit is likely to be blocked 
 anyway due to whatever is causing the long wait on some instructions. Thus 
 between 128 and 256 is likely to be an adequate range for active indices. 
 In the case where this is not adequate, we must halt processing of instruc- 
 tions by the IUD until older instructions have completed. We will now 
 
59 
 
 describe the detailed algorithms to insure correct instruction sequencing. 
 The description we give here will be functionally complete. It will not 
 include the details of what constitutes a complete instruction or the spe- 
 cial circuitry that allows results to be used as operands without going to 
 and from scalar memory. This will be discussed in Section 4.4.4 where we 
 describe the scalar switch. 
 
 Only those logical scalar addresses that have recently occurred as 
 results have time indexes associated with them. All operands occurring in 
 the queues must have a use count associated with them to insure that they 
 will not be overwritten. Thus, there will be two tables associated with the 
 scalar buffer that provide use counts. The first of these will have one 
 entry for each time index. In addition to the physical scalar buffer ad- 
 dress and use count, each entry contains a link and status information. 
 The status information indicates if this is the oldest or youngest refer- 
 ence to that location in this table and also indicates if the corresponding 
 logical operand is now available in the scalar buffer. The link points to 
 the next oldest reference to the same physical location. The second table 
 contains a use count and the physical address that the use count is for. 
 
 Before any instruction enters the queues, whenever an instruction is 
 being checked for being ready to execute, whenever any operand is accessed, 
 and finally, whenever any result is stored, these tables must be accessed. 
 We will now describe the algorithms required. When an instruction enters 
 the queue, the use count for its operands must be incremented and entries 
 made for its result. If those operands have a time index, this may be 
 interpreted as an address to the first status table, and the associated use 
 count must be incremented. If there is no time index, then an associative 
 access of the second status table is required. If an entry is found, then 
 
60 
 
 its use count must be incremented. If no entry is found, then a new entry 
 must be constructed with a use count of one. Before an instruction enters 
 the scalar queues, a new entry for the scalar result must be made in the 
 first scalar status table, and links within the table must be updated. An 
 associative access of the table is required to determine the most recent 
 entry referring to the same physical location. The link in this location 
 must be set to point to the new entry, and the status bits set to indicate 
 that this is no longer the youngest table entry. The new entry has its 
 link field cleared and its status set as being the youngest entry. In addi- 
 tion, the status is set to indicate this value is not yet available in the 
 scalar buffer, and if there were no younger entry referring to this physical 
 location, its status is set as the oldest entry referring to this location. 
 
 We will now discuss the algorithms for updating the tables when an 
 operand is accessed by the SEUs. If this operand does not have a time index, 
 then the second table is searched, and the corresponding entry has its use 
 count decremented. If the use count goes to zero, then the location is 
 cleared and marked available for reuse. If there is a time index, then the 
 specified location of the first table has its use count decremented. If this 
 use count goes to zero, then there are two possible courses of action that 
 may be required. If the link of this entry is non-zero and there is there- 
 fore a younger reference to this physical location, then we know that all new 
 instructions entering the queues with operands having this physical address 
 will refer to the entry pointed to by the link or another more recent entry. 
 Thus, this entry in the table can be cleared and marked as free. In addition, 
 a signal must be sent to the IUD indicating that this time index is now 
 available to be reused. If there is no other entry with this physical 
 address, then the entry is left in the table with a zero use count. 
 
61 
 
 This is desirable because an instruction yet to enter the queues may use 
 this operand. If this entry were cleared when the later instruction was 
 processed, a new entry in the other scalar status table would be required. 
 
 Since we do not necessarily clear the first table when a use count goes 
 to zero in it, additional hardware is necessary to insure that new entries 
 do not overwrite needed information. In addition, this hardware can make 
 sure that no problems result from the limited number of time indexes. The 
 algorithm for doing this is to make sure at successive points in time that 
 all the results that might be processed during the next interval have space 
 available for them. Thus, the hardware must continually test and if possible 
 clear a block of locations. If any of these have a non-zero link, they can- 
 not be cleared, and the IUD must be given a signal to wait. In addition, if 
 the second table becomes full, the IUD will also be required to wait. 
 
 The design techniques to construct hardware for the algorithms described 
 in this section are the same as those employed in constructing the IUD. In 
 a sense, this hardware may be regarded as an extension of the IUD that re- 
 sides physically inside the scalar processing part of the machine. We will 
 refer to this hardware as the Scalar Instruction Dispatch Subsystem (SIDS). 
 We will provide a detailed design for this unit in Section 4.6.5 when we 
 describe the remainder of the IUD. 
 
 The one function of these tables that we have yet to discuss is their 
 use in determining if an instruction is ready to be executed. We must deter- 
 mine if the operands are available. We need also to determine if the result 
 can go directly to the Scalar Buffer or if it must go to the Result Buffer. 
 This problem is further complicated by the fact that there may be up to six 
 SEUs simultaneously determining which of their queue entries are ready to be 
 executed. In order to keep the communication between the SEUs and the SIDS 
 
62 
 
 to a manageable level and to simplify the design of both units, we will 
 provide the SEUs with their own local set of tables. We will further re- 
 strict communication between these units to block transfers of information 
 occurring once in each major clock. From the previous discussion, we see 
 that at ewery major clock the SEUs must provide the SIDS with a list of all 
 operands accessed in the previous major clock. There will be at most twelve 
 of these operands. Before we can determine the information that flows in 
 the other direction, we need to define in detail the function and structure 
 of the SEU tables. 
 
 The SEU tables must serve four functions: 
 
 (1) Determine which instructions are ready to be executed. 
 
 (2) Determine which available operands are in the Result Buffer and 
 which results can go directly to the Scalar Buffer. 
 
 (3) Determine when a result can be transferred from the Result Buffer 
 to the Scalar Buffer. 
 
 (4) Accumulate the list of operands needed by the SIDS. 
 
 An instruction can be executed if its operands are available. The operands 
 are available if they do not have a time index. The SEUs must be provided 
 with a list of all time indexed operands which are available. It must have 
 a means of updating this information locally until the information it is 
 receiving from the SIDS has had time to absorb those particular updates. To 
 know which time indexed operands are initially available, only a single bit 
 for each possible time index is required. To know which time indexed operands 
 have become available requires setting the available bit for all results as 
 they are computed. Thus, all that is required for the first function listed 
 is an addressable array of available bits, one for each possible time index. 
 
63 
 
 At this point, we should mention that an operand is considered available 
 if a request has been made to send it on the scalar switch immediately after 
 it has been computed. Thus, we require a special table of these pending 
 requests. 
 
 The second function listed previously concerns itself with determining 
 where the operands are and where the result goes once the decision to execute 
 a particular instruction has been made. Possible sources of operands are: 
 Result Buffer, Scalar Buffer, about to appear on the Scalar Switch, and an 
 operand or result of the previous instruction inside the SEU requiring it. 
 This last case is handled entirely by the SEU. Associative memories are 
 required to determine which operands are in the Result Buffer or are about 
 to appear on the Scalar Switch. Constructing these tables does not require 
 any information from the SIDS. Determining which results can go directly 
 to the Scalar Buffer does require a single bit of information for each pos- 
 sible time index. This information does not have to be terribly current, 
 but only updated with sufficient frequency to keep the Result Buffer from 
 becoming full . 
 
 The third function of transferring Result Buffer values to the Scalar 
 Buffer requires the same one bit of information for each time index that 
 states if that Scalar Buffer location may be overwritten. 
 
 The last function requires that all operand accesses by the SEUs be 
 recorded for periodic transfer to the SIDS. 
 
 4.3.2.2 Scalar Queues 
 
 The scalar queue we design in this section will be reused with minor 
 modifications in both the VEUs and the Vector Switch. In all cases, the 
 queues can be thought of a containing two or three address instructions. 
 
64 
 
 Functionally the queue is scanned from oldest element first until an in- 
 struction with all operands available is encountered. This instruction is 
 then chosen for execution. The queues only serve as special memories to 
 contain the instructions. Control sequencers read the queues and make de- 
 cisions based on external conditions. 
 
 We will now discuss the specific functions the queues must serve and 
 the design we have chosen. 
 
 The queue must provide rapid access to its contents in the sequence in 
 which they were stored. It must allow for the deletion of any of its ele- 
 ments without affecting the order of the remainder. It must be able to 
 accept inputs rapidly and at any clock interval. Figure 7 shows the in- 
 ternal structure of the instruction queue. Instructions enter through the 
 bottom of the queue and may exit from any point through the test selector. 
 Each register can be shifted into its neighbor next higher in the queue. 
 Thus, if the kth element exits the queue, all registers further down in the 
 queue can be shifted up one place while those higher in the queue remain as 
 they are. The control bits keep track of which elements are to be shifted, 
 which element the test selector is selecting, and allows for the migration 
 of a new entry up to the current logical bottom of the queue. The control 
 bit logic updates the control bits. Table 3 gives the logical functions 
 that this unit must compute. Table 4 provides estimates of gate counts 
 and speed for the instruction queue. 
 
 4.3.2.3 SEU Sequence Controller 
 
 The sequence controller must examine queue entries, determine which 
 instructions have operands available, and on that basis set up instructions 
 for processing. We will describe in detail the algorithms involved and the 
 
65 
 
 
 C 
 
 , I 
 
 
 
 
 TEST 
 SELECTOR 
 
 
 
 CONTROL 
 BITS 
 
 I 
 
 
 
 
 
 
 
 , 
 
 
 
 c 
 
 
 
 
 
 
 i 
 
 'C 
 
 
 SHIFT 
 CONTROL 
 
 
 C 
 
 I 
 
 
 
 c 
 
 CONTROL 
 
 BIT 
 
 LOGIC 
 
 
 
 I 
 
 "iREGi^i lk 
 
 
 
 
 
 ! 
 
 
 
 
 
 
 i 
 
 c 
 
 I 
 
 
 C 
 
 c 
 
 TEST 
 SELECTOR 
 
 T 
 
 
 
 CONTROL 
 BITS 
 
 
 -J 
 
 1 
 
 
 
 
 
 
 
 
 ___ 
 
 i 
 
 i 
 
 
 
 C 
 
 c 
 
 
 SHIFT 
 CONTROL 
 
 
 
 I 
 
 b. PFP 
 
 t cxrn 
 
 c 
 
 CONTROL 
 
 RTT 
 
 
 
 
 — — — — •HKtUi.o i li\ 
 
 
 
 Di 1 
 
 LOG 
 
 EC 
 
 
 
 
 • 
 
 
 i 
 
 c 
 
 
 l c 
 
 1' . 
 
 TEST 
 SELECTOR 
 
 I 
 
 CONTROL 
 BITS 
 
 
 
 
 
 
 
 
 
 
 
 
 
 C 
 
 • 
 
 c 
 
 «^ 
 
 SHIFT 
 CONTROL 
 
 
 
 
 
 I 
 
 
 C 
 
 REGISTER 
 
 C 
 
 CONTROL 
 
 RTT 
 
 
 
 
 
 
 
 
 
 Dl 1 
 
 LOG 
 
 IC 
 
 
 i 
 
 C 
 
 
 
 I 
 
 FIGURE 7 INSTRUCTION QUEUE 
 
66 
 
 TABLE 3 INSTRUCTION QUEUE OPERATION 
 
 Abbreviation Description 
 Control Bits 
 
 A Add to queue 
 
 B Bottom of queue 
 
 W Word examined 
 W Word to pop 
 
 Instructions from Sequencer 
 
 N. 
 
 Next element 
 
 Pop queue 
 
 Instructions from Queue 
 
 R. 
 
 Reset W. 
 
 R Reset W 
 Actions of Queue Element 
 S Shift output 
 
 E Element output 
 
 Explanation 
 
 This bit is set when a new element 
 enters the bottom of the queue. As the 
 element migrates up the queue, this bit 
 moves along until the bottom of the queue 
 is reached. 
 
 Designates the highest cell in the queue 
 which is not currently occupied. 
 
 Set for the next queue entry to go out. 
 
 Set for the element that can be currently 
 popped from queue and for all higher 
 elements. 
 
 Output next element for the sequencer to 
 test. 
 
 Pop designated element out of queue. 
 
 Shift contents of register to next queue 
 element. 
 
 Output register to the sequencer. 
 
 The following conventions are used in the description of queue operations: 
 
 S(A q ) 
 R(A q ) 
 
 V 1 
 
 Set control bit A . 
 
 Reset control bit A . 
 
 Indicates A is set in the next lowest queue element. 
 
 Indicates A is reset. 
 
67 
 
 TABLE 3 INSTRUCTION QUEUE OPERATION (cont.) 
 
 Event Action Explanation 
 
 Advance new entry up one position. 
 
 Receive new entry and set control bit. 
 
 New entry reaches bottom of queue. 
 
 New entry reaches bottom of queue. New 
 bottom is marked. 
 
 Output queue entry. 
 
 Mark next entry to be output. 
 
 Advance word to be popped marker. 
 
 Word being examined has reached bottom 
 of queue. 
 
 Word to be popped has reached bottom of 
 queue. 
 
 Clear W . (The top of queue will have W 
 set.) e e 
 
 Clear W . (The top of queue will have W 
 set.) p p 
 
 Pop queue. (The top of queue will have 
 
 W rt set.) 
 e 
 
 Move elements below element popped up one 
 in queue. 
 
 Move bottom of queue up one. 
 
 Move bottom of queue up one. 
 
 \ 
 
 
 S o R <\> 
 
 Ml B n 
 
 q q 
 
 
 S(A q ) 
 
 A 01 B 
 
 q q 
 
 P q 
 
 R(B q ) 
 
 A o B 0-1 
 
 q q 
 
 P q 
 
 S(B q ) 
 
 e e 
 
 
 E o R < W e> 
 
 N Q W 0-1 
 e e 
 
 
 s(w e ) 
 
 N Q W 0-1 
 e p 
 
 
 S(H p ) 
 
 N Q B 01 
 e q 
 
 W e 
 
 S(R e ) 
 
 N e B q @l 
 
 M 
 
 P 
 
 S(R p ) 
 
 R e 
 
 
 R(w e ) 
 
 R P 
 
 
 R(W p ) 
 
 P q 
 
 
 R(W e ) R(W ) 
 
 P„ F~ 
 q e 
 
 
 S o 
 
 Po B 
 
 q q 
 
 
 R(B q ) 
 
 P o B„@l 
 
 q q 
 
 
 S(B q ) 
 
68 
 
 TABLE 4 GATE COUNT FOR INSTRUCTION QUEUE 
 
 Symbols 
 
 Description 
 
 Number of gates to shift 
 one bit 
 
 Number of gates in 
 control bit logic 
 
 Number of bits in shift 
 control logic 
 
 Gates to store one bit 
 
 Number of bits per 
 register 
 
 Number of registers 
 Number of control bits 
 
 Symbol 
 
 Sample Value 
 
 Explanation 
 
 N s 
 
 2 
 
 From Table 3 
 
 C L 
 
 25 
 
 From Table 3 
 
 S L 
 
 3 
 
 From Table 3 
 
 G m 
 
 4 
 
 
 N b 
 
 60 
 
 Must hold up to three 
 data buffer addresses 
 and an operation code 
 
 N r 
 
 16 
 
 Queue length 
 
 t 
 
 4 
 
 From Table 3 
 
 Gate Estimates 
 
 Functional Unit 
 Shift Control 
 Control Bit Logic 
 Test Selector 
 Control Bits 
 Register 
 
 Formula 
 
 Vs 
 
 G N 
 m c 
 
 G N. 
 m b 
 
 Subtotal 
 
 Sample Value 
 3 
 
 25 
 
 64 
 
 16 
 
 240 
 
 348 
 
 Multiply by N to get total for queue = 5568 
 
69 
 
 status tables required. We will explain how the algorithms can be imple- 
 mented within the required time constraints and provide gate estimates. We 
 will not do detailed logical design for these algorithms. 
 
 The principal complication of this unit is the variety of possible 
 sources of operands. Most of these sources are redundant in the sense that 
 their only purpose is to allow for rapid processing. Except for timing con- 
 siderations, the same operands would be available from other sources. Since 
 only extensive experimentation could provide accurate estimates of the cost 
 benefit tradeoffs, we do not claim that the ideal design would incorporate 
 all these features. We include them as suggestions and state what advantages 
 they seem to provide. 
 
 The scalar buffer is the primary source of operands. All operands with- 
 out time indexes are in the scalar buffer. In addition, a table provides a 
 list of what time indexed operands are in the scalar buffer. It is logically 
 possible to eliminate all sources of operands other than the scalar buffer. 
 
 The result buffer allows for the existence of multiple occurrences of 
 the same physical scalar buffer address. In addition, it eliminates the sub- 
 stantial delay between the time all accesses to a particular physical scalar 
 address have completed and the SEUs will be aware of that fact and be able 
 to overwrite that physical address with a new result. It is fairly certain 
 that this latter function of the result buffer is essential to providing a 
 reasonable throughput for the scalar execution units. With it all instruc- 
 tions can be processed as soon as their operands are available. The result 
 goes to the result buffer unless the SEUs know the physical address of the 
 result can be overwritten. 
 
70 
 
 It is less clear to what degree the remaining two sources of operands 
 we will now discuss are important for efficient utilization of the SEU. 
 Both of them will help in lessening the load on the scalar switch and in 
 some instances providing for more rapid instruction processing. First we 
 consider the case where a request has been made to transfer an operand on 
 the scalar switch. If the same operand is required for a queued instruc- 
 tion, a request can be entered that the operand also be transferred to the 
 SEU that will be assigned the queued instruction. This is the only method 
 the 360/91 uses in sequencing its various arithmetic units. The other case 
 occurs when a result being computed is required for a subsequent instruc- 
 tion. The controller can be aware of this and can simply transfer the 
 result to an operand buffer within the SEU. This is likely to be a desira- 
 ble feature since it is very common to have the result of one operation be 
 required by the next. A final possibility would be to allow one to reuse 
 the operands of one instruction for the next. We do not include this alter- 
 native because of the switch required within each SEU for non-symmetric 
 operations and because it is a less likely occurrence. 
 
 We will now describe the tables required to keep track of all these 
 sources of operands. The scalar buffer requires only a single bit for each 
 time index to indicate if that operand is present. The same is true of the 
 result buffer. These bits are set as results are returned to the specified 
 buffers and reset as the time indexes are recycled. There will exist short 
 queues to drive the entries to the scalar switch. By allowing associative 
 reads of these queues, we can determine if an operand is about to appear in 
 the scalar switch. The final source of operands is results in the process 
 of being computed. Again, an associative memory is required. Figure 8 
 
71 
 
 (Both operands are 
 done simultaneously) 
 
 OPERAND 1 
 
 RESET 
 
 0P1 
 
 FLAG 
 
 PHYSICAL 
 
 SET 
 
 0P1, 0P1B 
 
 FLAGS 
 
 TIME INDEX 
 
 SET 
 
 OP1, OP1R 
 
 FLAG 
 
 NO 
 
 SET 
 
 OP1, OP1C 
 
 FLAG 
 
 YES 
 
 SET OP1, OP1S FLAGS; 
 ADD TO SWITCH QUEUE; 
 REQUEST TO GO TO 
 SELECTED SEU 
 
 FIGURE 8 
 
 ALGORITHMS FOR ACCESSING 
 SCALAR STATUS TABLES 
 
72 
 
 OPERAND 2 
 
 RESET 
 
 0P2 
 
 FLAG 
 
 SET 
 
 0P2, 0P2B 
 
 FLAG 
 
 NONE 
 
 PHYSICAL 
 
 SET 
 
 0P2, OP2B 
 
 FLAGS 
 
 TIME INDEX 
 
 NO 
 
 SET 
 
 0P2, OP2R 
 
 FLAG 
 
 NO 
 
 SET 
 
 0P2, 0P2B 
 
 FLAG 
 
 NO 
 
 SET 
 
 0P2, 0P2C 
 
 FLAG 
 
 SET OP2, 0P2S FLAG; 
 ADD TO SWITCH QUEUE 
 REQUEST TO GO TO 
 SELECTED SEU 
 
 FIGURE 8 
 
 ALGORITHMS FOR ACCESSING 
 SCALAR STATUS TABLES (cont.) 
 
73 
 
 SEU ALLOCATION 
 
 NO 
 
 ALLOCATE SEU 
 CONTAINING 
 OP1; SET OP1U 
 
 OPT 
 X FLAG 
 SET & SEU 
 CONTAINING OP1 
 MAILABLE 
 
 RESET 
 QUEUE 
 
 ALLOCATE SEU 
 CONTAINING 
 0P2; SET 
 0P2U 
 
 10 
 
 GET NEXT 
 
 QUEUE 
 
 ENTRY 
 
 NO 
 
 ALLOCATE 
 NEXT 
 
 AVAILABLE 
 SEU 
 
 TEST 
 
 OPERANDS 
 
 PRESENT 
 
 YES 
 
 •0 
 
 FIGURE 8 
 
 ALGORITHMS FOR ACCESSING 
 SCALAR STATUS TABLES (cont.) 
 
74 
 
 TABLE 5 SCALAR UNIT STATUS TABLES SPECIFICATIONS 
 
 Scalar Buffer Table 
 
 Size 
 
 Fields 
 
 Parallel Accesses 
 
 Gate Count 
 
 256 entries 
 
 1 bit to indicate the presence of each entry 
 6 reads 
 
 2 writes 
 4*6*256 = 6144 
 
 II Result Buffer Table 
 
 Same as for scalar buffer table 
 
 III Pending Requests to Use Scalar Switch 
 
 This table will be described in Section 4.4.3 
 
 IV Results Being Computed 
 Size 
 Fields 
 
 Parallel Accesses 
 
 Gate Count 
 
 One entry for each SEU or a total of 6 
 
 Destination address (12 bits) 
 
 Time index (8 bits) 
 
 1 bit indicating use result buffer or scalar 
 buffer 
 
 6 bits indicating other SEUs requesting result 
 
 6 associative searches of time index 
 
 6 stores to SEU result request bit (each 
 SEU has its own bit) 
 
 1 initial store for all fields 
 
 (6*12*8 + 6*4 + 27)*6 = 3762 
 
75 
 
 describes the detailed algorithms for accessing these tables. Table 5 
 provides detailed specifications for these tables including gate counts for 
 the tables and estimated gate counts for implementing the accessing algo- 
 rithms. 
 
 4.3.3 Scalar Execution Unit Buffers 
 
 In this section we will discuss the scalar buffer, result buffer, and 
 vector-scalar buffer shown in Figure 5. We will describe their internal 
 structure, their external connections, and conflict resolution. The memo- 
 ries are organized into independently accessible modules. The number of 
 these is determined by the maximum data rate at which the memories can oper- 
 ate. This in turn is determined by the data rates of the SEUs. Table 6 
 provides this analysis for the three buffers. The connections to the out- 
 side world include data paths to the scalar switch and other units as well 
 as switches between the memories and these data paths. Table 6 lists the 
 size of the required switches. Conflicts may arise when there are simulta- 
 neous requests to read and write to the same memory module from the scalar 
 switch. In addition, conflicts may arise between the scalar switch and 
 other units requesting access to the same memory module. Multiple requests 
 for the same memory module by the scalar switch are resolved from within 
 the switch. All other conflicts are handled by a simple rotating priority 
 scheme. One of the requests is given priority and honored; the others must 
 wait until they are given priority. The priority shifts between units in 
 such a way that they all are given priority once before any receives it twice 
 A detailed design of this type of priority logic for a more complex case 
 will be given in Section 4.4.2.4. Table 6 provides a gate count for all 
 the logic discussed. 
 
76 
 
 TABLE 6 DESIGN PARAMETERS AND GATE COUNTS FOR 
 SCALAR BUFFERS 
 
 Buffer 
 
 Communicates With 
 
 Scalar Buffer 
 
 6 SEUs 
 
 
 Scalar Buffer 
 
 Result 
 
 Buffer 
 
 Scalar Buffer 
 
 Vector 
 
 Switch 
 
 TOTALS 
 
 
 
 Result Buffer 
 
 SEUs 
 
 
 Result Buffer 
 
 Scalar 
 
 Buffer 
 
 TOTALS 
 
 
 
 Vector Scalar 
 
 
 
 Buffer 
 
 SEUs 
 
 
 VEUs 
 
 
 
 TOTALS 
 
 
 
 Max Output Rate* Max Input Rate* 
 12 6** 
 
 6** 
 
 8 
 
 12 14 
 
 12 
 
 6 
 
 18 
 
 12 
 
 6 
 
 18 
 
 6 
 
 6 
 
 6 
 
 6 
 
 12 
 
 *Data rates in words per major clock, 
 **The sum of these must be ^ 6. 
 
 The above data rates are based on the assumption of 6 SEUs operating at 
 full capacity. The maximum rates are in general the maximum possible rates 
 for an individual unit and not all maximum total rates could be maintained 
 simultaneously. In converting these rates to actual access rates, we take 
 full advantage of the fact that these units are 8-word parallel buffers. 
 Conflicts keep this assumption from being totally correct, but given the 
 highly queued nature and the inital remarks in this statement, the assump- 
 tion seems reasonable. 
 
77 
 
 TABLE 6 DESIGN PARAMETERS AND GATE COUNTS FOR 
 SCALAR BUFFERS (cont.) 
 
 SCALAR UNIT BUFFER SPECIFICATIONS 
 
 Buffer 
 
 Scalar 
 Buffer 
 
 Result 
 Buffer 
 
 Vector- 
 Scalar 
 Buffer 
 
 Size 
 
 8*256 
 
 8*32 
 
 8*32 
 
 Read/Write 
 Rate per 
 Major Clock 
 
 26 
 
 24 
 
 30 
 
 Rate per 
 Mod per 
 Minor Clock 
 
 0.41 
 
 0.38 
 
 0.47 
 
 Gate 
 Access Time Count 
 in Minor (64 bit 
 Clocks word) 
 
 2 524,288 
 
 2 65,536 
 
 2 65,536 
 
 TOTAL 655,360 
 
 MEMORY SWITCH SIZES 
 
 Buffer 
 
 Read 
 
 Write 
 
 Scalar Buffer 
 
 8x2 
 
 8x2 
 
 Result Buffer 
 
 8x3 
 
 8 x 1 
 
 Vector Scalar Buffer 
 
 8x2 
 
 8 x 1 
 
78 
 
 4.3.4 Scalar Switch 
 
 The Scalar Switch transmits data between the buffers and execution units 
 of the scalar portion of the Computation Unit. We have discussed its func- 
 tional operation in the preceding sections. In this section we provide a 
 detailed design. Table 7 summarizes the data and instruction paths of the 
 switch. Figure 9 gives the structure of a representative portion of the 
 switch. In discussing the SEU sequence controller, we did not specify how 
 many SEUs each controller drives. The scalar switch ties together all the 
 previously discussed scalar components and thus at least for the purpose of 
 providing gate estimates, we need to assume some realistic configuration. 
 Thus, in this section we have assumed three sequence controllers driving six 
 SEUs. This would allow for three independent types of Scalar Execution Units. 
 
 The principal complexity of the switch design results from possible 
 conflicts. Such problems may arise both from the instruction switch and 
 data switch. Conflicts are resolved by priority logic like that mentioned 
 in the previous section and discussed in detail in Section 4.4.2.4. Con- 
 flicts can occur for any of three reasons. All requests for accessing data 
 originate from the queues. These requests must enter "source associated" 
 queues. If more than one attempt at a time is made to try and make an entry, 
 a conflict results. Another source of conflicts is the simultaneous attempt 
 to access the same memory mod. The final source of conflicts is the limited 
 number of ports into any unit. There may be too many simultaneous requests 
 to use these ports. The mechanism for resolving all these conflicts involves 
 two basic mechanisms. First, once a request is made, the requesting unit 
 waits until it receives confirmation. The priority hardware mentioned above 
 assures that this will always happen fairly soon. The second principle is 
 
79 
 
 that requests are always made and honored for the earliest time at which 
 the requesting unit is capable of honoring it. In other words, the re- 
 questing process is pipelined with the hardware that executes requests. 
 
 Two of the above mentioned conflicts are interdependent. Both a memory 
 mod and a switch port must be reserved in accessing any of the buffers. 
 This is handled by not requesting a memory mod until a port has been re- 
 served. It also requires an additional minor clock of pipelining in the 
 requests for ports to insure that they can be used at full capacity. Table 
 8 provides gate estimates for this logic. 
 
 We add a final note that all this pipelining and requesting circuitry 
 is not likely to create problems in throughput. This is because the SEUs 
 only process one instruction per major clock, whereas all this conflict 
 resolution occurs at the minor clock rate. In addition the input and output 
 of each SEU is buffered. Thus there should be adequate slack as long as the 
 data rates can be maintained. The most likely source of trouble would be 
 poor distribution of data across the buffers. If this proved to be a problem 
 it could be alleviated by doubling the memory speed to a 1 minor clock cycle. 
 
80 
 
 TABLE 7 DATA AND INSTRUCTION PORTS 
 FOR THE SCALAR SWITCH 
 
 DATA PORTS 
 
 
 
 
 Unit 
 
 
 Input Ports 
 
 Output Ports 
 
 6 SEUs 
 
 
 6 
 
 6 
 
 Scalar Buffer 
 
 
 2 
 
 2 
 
 Result Buffer 
 
 
 1 
 
 3 
 
 Vector-Scalar 
 
 Buffer 
 
 1 
 
 2 
 
 TOTALS 
 
 10 
 
 13 
 
 See Table for the source of these figures 
 
 INSTRUCTION PORTS 
 
 There are three SEU queues, each of which has a path to all three buffers, 
 
81 
 
 o 
 
 <_> 
 
 oo 
 
 o 
 I— cc 
 
 I— < 
 
 Q OO 
 
 dd 
 o 
 
 O I/O 
 
 a: 
 
 u_ a: 
 <c 
 
 < -J 
 i— <: 
 
 < o 
 
 Q 00 
 
 LU 
 
 O 
 
 ID 
 O 
 
 oo 
 
 I— 
 o 
 
 ^ z o 
 
 o o q: 
 
 i— 2: 
 
 — I O — I o 
 eC I— O 
 UHlOQ 
 O 3 LU 
 
 _iona< 
 
 00 
 
 h- 
 00 
 
 UJ 
 
 ZD 
 
 o- 
 
 LU 
 Q£ 
 
 I/) 
 00 ID 
 O0 UJ 
 LU OO 
 O 
 
 O O 
 =51 I— 
 
 CQ 
 
 —1 
 
 O 
 OO 
 
 i— o 
 
 IC cC 
 
 o 
 I— o 
 
 00 -z. 
 
 o 
 
 cer t-i 
 
 <c 1— 
 
 _J QC 
 «=C O 
 <_) D. 
 OO — ' 
 
 LU 
 
 ID 
 
 o 
 
 1— 
 _i 00 
 
 o o 
 o i— 
 
 00 o 
 
 I— 
 
 X t-t 
 LO OO 
 
 o o 
 
 o •— <- 
 
 Od LU 3 
 
 U_ > 00 
 
 LU 
 
 CQ 
 
 at 
 
 _j 
 
 o 
 00 
 
82 
 
 TABLE 8 GATE COUNT FOR SCALAR SWITCH 
 
 Unit Number 
 
 Source Queue 2 
 
 (examines 2 entries 
 simultaneously) 
 
 Source 3x2 Switch 2 
 
 Source Queue 5 
 
 Control 
 
 Source Queue 1 
 
 Source 3x1 Switch 
 
 Gate 
 Estimate 
 
 Source of Estimate 
 
 Total 
 
 2 400 
 
 4 entries in queue, 
 Section 4.3.2.1 
 
 4 800 
 
 480 
 
 20 bit instructions 
 
 960 
 
 1 000 
 
 Figure 
 
 5 000 
 
 1 200 
 
 4 entries in queue, 
 Section 4.3.2.1 
 
 1 200 
 
 240 
 
 20 bit instructions 
 
 240 
 
 Local Data Switches: 
 
 SEU 1x2 
 
 6 
 
 
 256 
 
 64 bit word 
 
 Scalar and Result 
 Buffers 8x5 
 
 2 
 
 10 
 
 240 
 
 64 bit word 
 
 Vector-Scalar Buffer 
 8x3 
 
 1 
 
 6 
 
 144 
 
 64 bit word 
 
 Local Switch Controls 
 
 3 
 
 3 
 
 000 
 
 Figure 
 
 Scalar Switch 10x13 
 
 1 
 
 39 
 
 520 
 
 64 bit words 
 12 bit addresses 
 
 1 536 
 20 480 
 
 6 144 
 
 9 000 
 39 520 
 
 TOTAL 
 
 88 880 
 
83 
 
 4.4 VECTOR PORTION OF COMPUTATION UNIT 
 
 The vector portion of the execution unit is intended to do the bulk 
 of the actual processing. The primary purpose of the scalar unit just dis- 
 cussed is to avoid having to use full vector processors when these are not 
 required. The justification for having the vector units is a combination 
 of utility and economy. We know that most FORTRAN programs can effectively 
 utilize vector units of at least width 8. We have already observed 
 in Section 4.3 the substantial overhead that is involved with the queue 
 driven and pipelined approach. By allowing each instruction to drive the 
 equivalent of 8 parallel execution units, we minimize the cost of this over- 
 head. The tradeoff in determining how wide such units should be is overhead 
 cost versus utilization of the potential parallelism. As discussed in Sec- 
 tion 3.3, we consider the whole area of parallel computing to be in a wery 
 primitive state. Thus, we justify the width we have chosen solely on the 
 grounds that we know it will work for a very broad class of problems and we 
 can implement it with an acceptable level of overhead. We do not wish to 
 enter into the extraordinarily complex question of quantifying the tradeoffs. 
 We will now discuss the overall structure of the vector unit. 
 
 4.4.1 Overall Structure of Vector Portion of Computation Unit 
 
 Those portions of Figure 2 that constitute the vector unit are the 
 VEUs, the Vector Switch, and the Vector Buffer. The VEUs perform the actual 
 vector processing and are fairly complex units containing instruction queues, 
 buffers, and other hardware in addition to that that does the actual computa- 
 tion. The Vector Buffer acts as a back-up reserve storage for the buffers 
 within the VEUs. The Vector Switch is responsible for transferring data 
 
84 
 
 among these units and between them and the memory. In addition, it can 
 transmit data to the Scalar Buffer and to the MIDs. It also contains its 
 own internal queues. We will now describe each of these units in detail. 
 
 4.4.2 Vector Execution Units 
 
 Figure 10 gives the overall structure of a typical VEU. The unit is 
 controlled by the sequencer which reads instructions from the instruction 
 queue. The sequencer tests successive entries in the queues until one is 
 encountered with all operands present in the operand buffer. The sequencer 
 will then set up this instruction to commence execution as soon as the cur- 
 rent instruction is finished. The access controllers resolve any conflicts 
 that may occur in accessing the buffers. The hardware associated with the 
 internal switch allows for results to be used as operands without going 
 through the Vector Switch. We will discuss all the units of Figure 10 in 
 more detail in subsequent sections. We will first discuss the various possi- 
 bilities for the computation hardware. 
 
 4.4.2.1 Standard Arithmetic Units 
 
 The computation hardware may be a standard arithmetic unit. In other 
 words, it may perform standard floating point and fixed point arithmetic and 
 logical operations. We will not discuss the logic to do arithmetic or simi- 
 lar operations. There are a number of ways in which the parallel units can 
 be organized. We will discuss several of these alternatives and their ad- 
 vantages and limitations. 
 
 The simplest structure would be an ILLIAC IV type of parallelism. That 
 is, eight units driven by a single control sequencer such that the units all 
 perform identical operations. An alternative method of implementing the 
 same logical structure would be an eight stage pipeline. Since data transfers 
 
85 
 
 cc re 
 o o 
 l— h- 
 
 o i— i 
 
 OLU2 
 I— > oo 
 
 CC h- 
 
 LU _l 
 U- ZD 
 U. CO 
 
 zd uj 
 oq cc 
 
 I— ct oo o 
 
 -ILU WQi 
 ZZ> Ll_ LU t— 
 OO U_ O Z 
 
 LU3UO 
 CCCQ<U 
 
 LU 
 
 c_> 
 
 LU 
 ZD 
 
 cr 
 
 LU 
 
 00 
 
 ID 
 
 cc lu 
 
 I— ZZ> 
 CO LU 
 
 2: zn 
 i-h cr 
 
 CO 
 
 Q 
 
 K g* 
 
 cc o 
 
 LU I— 
 
 Mi/ia 
 
 zd 
 l— 
 
 O 
 
 zz> 
 c£ 
 h- 
 co 
 
 cc 
 
 ZZ> 
 LU 
 
 q; 
 
 q _i 
 
 <C LU CO CC 
 CC Ll_ LU I— 
 
 LULLUZ 
 
 0- zz> o o 
 
 OCQ<0 
 
 <=C —I 
 
 ZIO 
 
 or o q: 
 
 z 3 o 
 
 MCOO 
 
 c_> 
 
 I— 
 
 1—1 
 CO 
 
 _1 
 <c 
 z: 
 
 cc 
 
 zz> 
 
 o 
 
 Z CC 
 <C LU 
 
 cc: ll. 
 
 LU Li_ 
 Q- ZZ> 
 O CO 
 
 a; zc 
 
 O C_3 
 
 2: 1— I— 
 
 O O 1— 1 
 Lt_ > CO 
 
86 
 
 are pipelined within the unit as we have discussed, a pipelined arith- 
 metic unit would fit in quite nicely. In the case of parallel units, we 
 would have to phase them or buffer them to accommodate the pipelined data 
 transfers. The advantages of this type of parallelism are that fewer gates 
 are required for control purposes and the instructions are relatively sim- 
 ple. The disadvantage is the lack of flexibility. The statement that most 
 FORTRAN programs can effectively utilize a parallelism of eight was based on 
 the assumption that each unit can perform a different arithmetic operation 
 in parallel . 
 
 The next level of complexity is to provide eight arithmetic units, each 
 capable of performing a different operation in parallel. To control these 
 units, we would require an extended instruction with at least two bits of 
 control information for each arithmetic unit. 
 
 The next level of complexity is to consider tree processors [6]. 
 This is basically a set of arithmetic units interconnected to form a tree 
 structure. Given our basic width of eight, we could implement trees with a 
 base of four pairs of operands or eight pairs of operands. In the first 
 case, the instructions would be regarded as having a single 8-word wide 
 operand. In the latter case, we would have two 8-word wide operands. The 
 advantage of a tree structure is that it allows \jery efficient execution of 
 an arithmetic expression. Two disadvantages are that most arithmetic expres- 
 sions are not full trees, and the unit must be pipelined to operate at full 
 efficiency. Perhaps the most flexible unit would be one that allowed four 
 pairs of operands at the base and could also be used to operate on pairs of 
 eight wide vectors. One advantage of our approach is that various combina- 
 tions of these alternatives could be tried out after a machine had 
 
87 
 
 been constructed. There is no need to go into a detailed and abstract 
 analysis of these alternatives at this stage. 
 
 There is an important structural difference between vector and tree 
 units that we must consider at this stage. This is that tree units have a 
 scalar output and vector units a vector output. There are three possible 
 destinations for such a scalar. These are the scalar portion of the Compu- 
 tation Unit, the MID, and a vector operand as one element of it. In this 
 latter case, the vector may be used in full vector computations and/or be 
 used in more tree computations. We need to include hardware to accommodate 
 these possibilities. We will now discuss each of these alternatives. 
 
 We have already mentioned in Section 4.3 how some scalar buffer addresses 
 refer to data from outside the scalar buffer. Results of vector instructions 
 that have the scalar unit as destination need to be sent to the above men- 
 tioned portion of the scalar buffer. The logic for handling conflict reso- 
 lution for multiple VEUs will be in the scalar unit as discussed in Section 
 4.3.3.3. The VEU need only interpret this destination from the queue in- 
 struction and send the data and its destination address out over the appro- 
 priate path. 
 
 The same procedure can be followed in the case of data headed for the 
 MID. We do not do a detailed logical design of the MID, but the techniques 
 discussed in Section 3.2.1.3 can be used to handle conflict resolution. 
 
 The final alternative we have to consider is that of scalars that are 
 to become part of a vector. We require special hardware within the vector 
 portion of the computation unit for this case. This must include a scalar 
 switch to transfer the data to the correct position in the destination vector 
 and controls that are able to recognize when a complete vector has been 
 assembled and is available for further processing. In the case of a single 
 
88 
 
 SCALAR SOURCE 
 and address 
 
 ""8x1 
 
 
 VECTOR 
 
 
 SWITCH 
 
 
 BUFFER 
 
 VECTOR 
 OUTPUT 
 
 
 
 PATH 
 
 CONTROL 
 
 VECTOR ELEMENT 
 PRESENCE BITS 
 
 VECTOR PRESENCE 
 BITS 
 
 When all 8 Vector Element Presence 
 Bits are set for a single vector, 
 then the corresponding Vector 
 Presence Bit is set. 
 
 FIGURE 11 ASSEMBLING SCALARS INTO A VECTOR 
 
89 
 
 tree unit, it would probably be desirable to have this hardware within that 
 unit. With multiple units we would probably want this hardware as part of 
 the vector buffer. Figure 11 describes this logic and Table 9 provides 
 gate estimations. 
 
 Another type of tree we might wish to include is the if tree analyzer 
 [3]. This would be especially helpful in reducing the amount of non- 
 determinism in a program. Our highly pipelined structure makes this espe- 
 cially desirable. Including such a feature is also necessary to obtain the 
 theoretical speed of FORTRAN programs we have discussed earlier. Function- 
 ally, the if tree analyzer is no different than the trees we have discussed 
 except that it has a single output that goes to the MID. 
 
 TABLE 9 GATE COUNT FOR SCALAR ASSEMBLING UNIT 
 
 Unit 
 
 8 x 1 Switch 
 
 Vector Buffer (8 vectors) 
 
 Vector Element Presence Bits 
 
 Control 
 
 Vector Presence Bits 
 
 TOTAL 25,000 
 
 Gate E: 
 
 stima 
 
 te 
 
 2 
 
 200 
 
 
 20 
 
 000 
 600 
 
 
 2 
 
 000 
 200 
 
 
90 
 
 4.4.2.2 Vector Routers 
 
 At least one of the VEUs will be devoted to a vector router or full 
 crossbar switch. Given the relatively small size of our vectors and the 
 broad spectrum of permutations that various algorithms may require, a full 
 crossbar switch is justified. In addition to allowing the arbitrary permu- 
 tations of a vector, this unit should allow for the combining of two oper- 
 ands under mode control. It would also be desirable to allow selective 
 partial broadcasting. Mode bits and routing patterns may either be included 
 as part of the instruction or be dynamically computed within the EUs. Arbi- 
 trary broadcasting patterns contain too many bits to be part of the instruc- 
 tion. At least in the case of modes and possibly in other cases, we would 
 wish to allow scalar operands. Thus, we have the inverse case of that 
 discussed in the previous section. We need to obtain a scalar from the sca- 
 lar portion of the computation unit. This can be accomplished by issuing a 
 scalar instruction to store the required operand in that portion of the 
 Scalar Buffer that allows for transmission to the VEUs. Some of those ad- 
 dresses may be regarded as operands within the VEU, and a store to them 
 results in the transfer. 
 
 4.4.2.3 Other Vector Units 
 
 There is no need to limit the computation hardware to the alternatives 
 just discussed. Even after the machine has been constructed, different 
 softs of units could be added or used to replace existing units. In the 
 next section we will discuss in detail the internal queues switches and con- 
 trol for a single VEU. All of this hardware could be used "as is" for any 
 type of VEU. Not all of it will necessarily be included in every VEU. 
 
91 
 
 The point is that at any time we could add specialized hardware without 
 designing more than that hardware and some very simple interfaces. 
 
 4.4.2.4 Detailed Internal Structure of a VEU 
 
 In this section we will finish our discussion of the remaining units 
 of Figure 10. We now list those units requiring elaboration. The instruc- 
 tion queue is logically the same as that discussed in Section 4.3.2.2. The 
 operand and result buffers operate in a special phased array fashion which 
 we will describe in detail. We will provide detailed design for general 
 purpose access controllers which we have referred to in previous sections. 
 The boxes associated with the internal switch will require no new specialized 
 design. We will provide gate estimates for the entire unit at the end of 
 this section. 
 
 We begin our detailed design with the phased array buffers. This memo- 
 ry is eight modules wide corresponding to our vector width. All accesses 
 are to a single vector stored in the same relative position across the memo- 
 ry. The data paths themselves are only one word wide, and so the data trans- 
 fer must be pipelined. Once an access has started for one vector in the 
 memory, we cannot always afford to wait eight clocks before starting a new 
 access. Thus, we essentially shift the decoded address from one memory 
 module to the next. We do this in a manner that allows a new address to 
 enter at any clock. Similarly, we have a switch that allows the data to be 
 transferred to several places. The addresses for this switch are shifted in 
 parallel. Figure 12 shows the structure of such a unit. Table 10 describes 
 its operation. 
 
92 
 
 DATA PATH 
 ADDRESS 
 
 MEMORY 
 ADDRESS 
 
 ADDRESS 
 DECODER 
 
 ADDRESS 
 DECODER 
 
 linn 
 
 DATA PATH 
 FAN OUT 
 
 SLOTTED ADDRESSING UNIT 
 
 I i i i ii i 
 
 '• i 
 
 SLOTTED ADDRESSING UNIT 
 
 MEMORY 
 
 FIGURE 12 DATA BUFFER 
 
93 
 
 TABLE 10 DATA BUFFER OPERATION 
 
 Minor Cycle Event 
 
 Two addresses simultaneously enter the decoders and are 
 decoded. 
 
 1 The enable patterns enter the addressing units and cause 
 the selected memory location to be switched to the correct 
 data path. A new pair of addresses are decoded. 
 
 2 The enable pattern in the addressing unit is shifted one 
 to the right and a new pattern takes its place. These are 
 both used to switch two words to two data paths. A third 
 address enters the addressing unit. 
 
 3 All enable patterns are shifted one right. A new address 
 enters the address decoder and a new enable pattern the 
 addressing unit. Three words are accessed. 
 
 4 Same as minor cycle 3, but four words are accessed. 
 
 5 Same as minor cycle 3, but five words are accessed. 
 
 6 Same as minor cycle 3, but six words are accessed. 
 
 7 Same as minor cycle 3, but seven words are accessed. 
 
 8 Same as minor cycle 3, but eight words are accessed. 
 
 9 Same as minor cycle 8, except the right-most address is 
 dropped. 
 
 etc. Operation continues in this manner. During any minor 
 
 cycle in which no new addresses are presented, there will 
 be a vacant slot that will move to the right in the same 
 manner as the enable patterns. Along with the enable 
 patterns, there is a single bit which indicates if a read 
 or store is being performed. 
 
94 
 
 We now turn to the general problem of designing priority access con- 
 trollers. The logic we design must be fast enough to operate in one minor 
 clock. It must treat all requests equally. It must ensure that each re- 
 questing unit receives top priority once before any unit receives it twice. 
 Finally, the design should be general enough to accommodate any number of 
 requesting units up to 32. Actually, no application within this machine 
 will require that many units, but our design will meet the other constraints 
 for up to that many units. We will first describe our algorithm for the 
 case when the number of requesting units is a power of 2. We will then show 
 how the algorithm can be modified to handle the remaining cases. 
 
 In describing the power of 2 case, we will assume 8 requesting units. 
 It will be obvious how to generalize to larger or smaller powers of 2. 
 Functionally, our unit is presented with 8 bits, any combination of which 
 may be set. We must produce an output of 8 bits, only one of which is set. 
 This bit must correspond to one of the bits that was originally set. Over 
 a period of time, the selection process must conform to the requirements 
 listed above. 
 
 Physically, the unit consists of three levels or log base 2 of the num- 
 ber of bits in the general case. At the first level, there are 4 two-state 
 devices through which pairs of bits pass. At each level the number of de- 
 vices is halved, and the number of bits passing through each device doubles. 
 All the devices have two states. The bits passing through each device are 
 divided into two groups. The device will pass on to the next stage a one 
 bit from only one of these groups. The two groups passing through a single 
 device form a single group for the next stage. Thus, by induction, each 
 group has at most one bit set. The choice of which group to pass on is a 
 function of the state of the device and its input. The state of the device 
 
95 
 
 indicates a preference for one or in the other state the other group. By 
 preference, we mean simply if the preferred group has a one in it, that one 
 is passed on, otherwise the other group's one is passed on. Of course, if 
 neither group has a one, then nothing is passed on. Figure 13 gives an 
 example. It should be clear that only of the originally set bits can emerge. 
 By changing the devices' states in an appropriate sequence, we are able to 
 provide the uniform scheduling required. The lowest level changes state at 
 every clock. Each higher level changes state in twice the number of clocks 
 as the next lowest state. Thus, every bit position will go through all 
 eight priority states in every eight clocks. Figure 13 illustrates this. 
 
 We now consider the case where the number of requesting units is not a 
 power of 2. We start with the design just described for the smallest power 
 of 2 greater than the number we are considering. By appropriately allocating 
 the requesting units to the excess available slots, by sequencing the en- 
 tire unit correctly, and by allowing some requesting units to use either of 
 two slots, we can meet the design constraints. We will present an informal 
 constructive proof. 
 
 First, we precisely restate the problem. We have N units. We must con- 
 struct a circuit which, when presented with N bits, any subset of which may 
 be set, will select a single bit. It must perform this selection on a prio- 
 rity basis. These priorities must rotate in a way that any bit will go 
 through all possible priorities from 1 to N before any priority is repeated. 
 We have already demonstrated how to construct such a circuit when N is a 
 power of 2. We will prove the more general case by induction. It is clear 
 we can construct such a circuit for N = 1. Now we assume we can construct 
 the circuit for all integers less than or equal to M, the greatest power 
 
96 
 
 of 2 that is less than N. We will use these circuits to show how to con- 
 struct a circuit for all integers less than or equal to M. We need to 
 consider two cases, N even and N odd. 
 
 First, if N is even, we use two circuits of size N/2. We then add one 
 additional level that chooses between the outputs of these two circuits. 
 By varying this selection choice e\/ery N/2 selections, we have the desired 
 circuit. 
 
 The case where N is odd is somewhat more complex. Let K be such that 
 N = 2K + 1 . We begin with two circuits of size K + 1. Again, we will add 
 an additional level to select between these. We will assign K of the in- 
 puts to circuit A and K + 1 to circuit B. In addition, we install binary 
 switches to allow any of the K + 1 inputs of B to use the vacant input of 
 A. We will need to assume that a circuit of size L + 1 and L even can be 
 used with L inputs. This is clearly true f or L = 1 . It will be true for 
 larger L because of the way these circuits are constructed out of smaller 
 circuits as we have outlined above. In particular, the circuit of size 
 L + 1 is made up of two circuits of size L/2 + 1 , with a global level for 
 selecting between them. Clearly, if these smaller circuits can be made to 
 work for size L/2, then we can sequence the larger circuit in a way that it 
 will work for size L. Thus, given the way we are constructing our circuit 
 of size N, we may assume the circuits of size K + 1 can work for K inputs. 
 We now proceed with the construction of our size N circuit. 
 
 For the first K states of the entire device, we sequence A and B in 
 any way that insures no input will have its priority repeated. We do this 
 by giving B highest priority at the highest added level of the circuit and 
 by sequencing A and B individually so they do not repeat any states. 
 
97 
 
 For the remaining K + 1 states, we must give priority to A and, in addi- 
 tion, during each of these states, switch one of the B inputs into the 
 vacant spot of A. We need to pay special attention to the element assigned 
 priority K + 1 during each of these remaining states. All but one (we will 
 call it Z) of the elements of B had that priority during the first K states. 
 We will sequence A as if it were an ordinary K + 1 state device during these 
 remaining states. Thus, during the state in which the vacant input of A has 
 priority K + 1 , we must switch Z into the vacant state. During each of the 
 other K final states, we must switch a different element of B into A. The 
 element that must be switched during each of these other states is also 
 uniquely determined. Whatever priority the vacant element receives will 
 correspond to a priority already assigned to K of the elements of B. Thus, 
 the unique remaining element must always be switched. Thus, to complete 
 the proof, we must show how we can do this switching and still insure that 
 none of the last priorities will be repeated. The algorithm for this is 
 quite simple. We begin with any correct K + 1 sequence for B. We choose 
 row X from this sequence as the one to be switched. In other words, we will 
 always switch the element of B that would have been assigned priority X 
 within B. This assures us that none of the last K priorities will be re- 
 peated. We note that we can arbitrarily permute the sequence in which 
 these states occur. Thus, we permute them in such a way to insure that 
 the element of B having priority X is the unique element that must be 
 switched during each of the last K + 1 states. This completes the construc- 
 tion. Figure 14 gives an example for N = 7 and also provides a count of 
 the number of switches for arbitrary N. Table 11 provides a summary gate 
 count for the VEU. 
 
98 
 
 Input 
 Bits 
 
 Level 
 States 
 
 1 Level 1 
 Outputs 
 
 1 
 
 
 
 Level 
 States 
 
 2 Level 2 
 Outputs 
 
 1 
 
 
 
 Level : 
 States 
 
 3 Output 
 Bits 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 1 
 
 
 
 
 i 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 i 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 1 
 1 
 
 1 
 
 
 1 
 1 
 
 
 1 
 
 1 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 A possible Sequencing of Priority States with Corresponding Priorities 
 
 States Priority States Priority 
 
 11 10 1 2 
 
 • • * • • * 
 
 12 1 11 
 
 4 
 3 
 
 6 
 
 5 
 
 8 
 
 7 
 
 
 • • • 
 
 1 
 
 
 
 
 
 
 1 
 
 • • • 
 
 1 
 
 
 
 • • • 
 
 1 
 
 
 
 
 
 
 
 
 
 3 
 
 
 
 • • • 
 
 
 
 4 
 
 1 
 
 
 5 
 
 
 
 • • • 
 
 1 
 
 6 
 
 1 
 
 • mm 
 
 7 
 
 
 
 a a a 
 
 
 
 8 
 
 1 
 
 
 FIGURE 13 8 WAY PRIORITY SELECTOR 
 
99 
 
 
 States 
 
 
 Priori 
 
 fir 
 
 
 States 
 
 
 Priority 
 
 1 
 
 
 
 3 
 
 
 
 
 
 
 4 
 
 • • • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 • • • 
 
 1 
 
 4 
 
 
 1 
 
 
 1 
 
 3 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 2 
 
 • • • 
 
 1 
 
 
 
 
 • • • 
 
 1 
 
 
 
 
 
 
 
 
 2 
 
 
 1 
 
 
 
 1 
 
 1 
 
 
 
 7 
 
 
 
 
 
 
 8 
 
 • • • 
 
 
 
 
 
 
 • • • 
 
 
 
 
 
 
 
 
 
 
 8 
 
 
 1 
 
 
 
 
 7 
 
 1 
 
 
 
 5 
 
 
 
 
 
 
 6 
 
 • • • 
 
 1 
 
 
 
 
 . . . 
 
 1 
 
 
 
 
 
 
 
 6 
 
 
 1 
 
 
 
 5 
 
 1 
 
 
 
 5 
 
 
 
 
 
 
 6 
 
 • ■ ■ 
 
 1 
 
 
 
 
 • • • 
 
 
 
 
 
 
 
 
 
 
 6 
 
 
 1 
 
 
 
 
 5 
 
 1 
 
 
 
 7 
 
 
 
 
 
 
 8 
 
 • • • 
 
 
 
 
 
 
 • • • 
 
 1 
 
 
 
 
 
 
 
 8 
 
 
 1 
 
 
 
 7 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 2 
 
 ■ a • 
 
 1 
 
 
 
 
 • • • 
 
 
 
 
 
 
 
 
 1 
 
 2 
 
 
 1 
 
 
 1 
 
 1 
 
 1 
 
 
 
 3 
 
 
 
 
 
 
 4 
 
 « • • 
 
 
 
 
 
 
 • • • 
 
 1 
 
 
 
 
 
 
 
 4 
 
 
 1 
 
 
 
 3 
 
 FIGURE 13 8 WAY PRIORITY SELECTOR (cont.) 
 
100 
 
 
 States 
 
 
 Pr 
 
 iori 
 7 
 
 t,y 
 
 
 States 
 
 
 Priority 
 
 1 
 
 
 
 
 
 8 
 
 • • • 
 
 
 
 
 
 
 
 • • • 
 
 
 
 
 
 
 
 
 
 
 
 8 
 
 
 1 
 
 • • • 
 
 
 
 7 
 
 1 
 
 1 
 
 
 
 5 
 
 
 
 
 • • • 
 
 1 
 
 
 6 
 
 
 
 
 ... 
 
 
 6 
 
 
 1 
 
 
 • • • 
 
 5 
 
 1 
 
 
 
 
 3 
 
 
 
 
 
 
 4 
 
 • • • 
 
 
 
 
 
 
 
 • • • 
 
 
 
 
 
 
 
 
 
 
 4 
 
 
 J_ 
 
 
 
 3 
 
 
 • • ■ 
 
 1 
 
 
 
 
 
 t • ■ 
 
 1 
 
 
 1 
 
 • • • 
 
 1 
 
 
 
 1 
 
 
 
 
 • • • 
 
 1 
 
 
 2 
 
 
 
 
 
 
 2 
 
 
 1 
 
 
 
 1 
 
 FIGURE 13 8 WAY PRIORITY SELECTOR (cont.) 
 
101 
 
 Inputs 
 
 f 
 
 A^ 
 
 B< 
 
 
 
 
 
 1 
 
 V, vacant 
 f 
 
 1 
 
 1 
 
 
 
 Final Final 
 Switch Switch Level 1 Level 2 Level 3 Level 3 Switch Switch 
 Setting Output States_ States States Output Setting Output 
 
 vacant 
 R 
 R 
 S 
 R 
 
 
 
 
 
 1 
 
 1 
 
 
 
 1 
 
 
 
 
 
 
 
 R 
 
 
 
 
 
 R 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 R 
 
 
 
 
 
 R 
 
 
 
 
 
 S 
 
 1 
 
 
 
 R 
 
 
 
 Sequence 
 
 for N 
 
 = 7 
 
 
 
 
 
 
 
 
 Switches 
 
 
 States 
 
 
 Priorities 
 6 
 
 Switches 
 S 
 
 
 States 
 
 ' 
 
 Priorities 
 
 R 
 
 1 
 
 
 
 
 
 
 
 7 
 
 
 • • # 
 
 
 
 
 
 
 . . . 
 
 1 
 
 
 
 R 
 
 
 
 1 
 
 1 
 
 
 
 7 
 5 
 
 R 
 
 1 
 
 1 
 
 
 
 
 
 5 
 6 
 
 V 
 
 
 
 
 ... 
 
 
 V 
 
 
 
 . 
 
 
 
 R 
 
 1 
 
 1 
 
 
 1 
 
 R 
 
 
 
 1 
 
 
 2 
 
 R 
 
 
 
 
 1 
 
 2 
 
 R 
 
 1 
 
 
 1 
 
 1 
 
 R 
 
 1 
 
 • • * 
 
 
 
 
 3 
 
 R 
 
 
 
 
 
 
 4 
 
 R 
 
 
 
 
 
 4 
 
 R 
 
 1 
 
 
 
 3 
 
 FIGURE 14 N0N-P0WER-0F-2 PRIORITY SELECTOR 
 
102 
 
 Switches 
 
 
 States 
 
 
 Priorities 
 5 
 
 Switches 
 R 
 
 States 
 1 
 
 Priorities 
 
 R 
 
 1 
 
 
 
 1 
 
 
 • • • 
 
 1 
 
 
 
 
 
 • • • 
 
 1 
 
 
 S 
 
 
 
 
 
 
 6 
 
 R 
 
 
 
 
 1 
 
 2 
 
 — 
 
 
 
 • • • 
 
 
 
 
 7 
 
 
 
 1 
 
 
 
 3 
 
 V 
 
 1 
 
 
 • • • 
 
 
 V 
 
 
 
 
 
 
 R 
 
 1 
 
 • • • 
 
 
 
 
 3 
 
 S 
 
 
 1 
 
 • • • 
 
 1 
 
 4 
 
 R 
 
 
 
 
 1 
 
 4 
 
 R 
 
 
 
 
 ... 
 
 5 
 
 R 
 
 1 
 
 • • • 
 
 1 
 
 
 1 
 
 R 
 
 
 1 
 
 
 
 6 
 
 R 
 
 
 
 
 
 2 
 
 R 
 
 
 
 
 
 7 
 
 R 
 
 
 
 • • • 
 
 1 
 
 
 2 
 
 R 
 
 
 1 
 
 • • • 
 
 
 
 3 
 
 R 
 
 1 
 
 
 
 • ■ • 
 
 
 
 1 
 
 1 
 4 
 
 R 
 
 
 
 
 1 
 
 ... 1 
 1 
 
 4 
 1 
 
 V 
 
 1 
 
 
 
 
 
 V 
 
 
 
 
 
 
 R 
 
 1 
 
 • • • 
 
 
 
 
 7 
 
 R 
 
 
 
 
 ft ft ft 
 
 
 
 6 
 
 S 
 
 
 
 • • • 
 
 
 
 3 
 
 R 
 
 
 1 
 
 ... 
 
 7 
 
 R 
 
 1 
 
 • • ft 
 
 1 
 
 
 5 
 
 S 
 
 
 
 
 • * • 
 
 1 
 
 2 
 
 R 
 
 
 
 
 
 6 
 
 R 
 
 
 1 
 
 
 5 
 
 R 
 
 
 
 
 
 
 4 
 
 
 
 
 
 
 R 
 
 1 
 
 • • • 
 
 1 
 
 3 
 
 
 
 
 
 
 
 
 
 
 
 2 
 
 
 FIGUR 
 
 
 
 
 • • • 
 
 1 
 
 
 
 
 
 
 
 
 V 
 
 1 
 
 
 
 
 
 NON- 
 
 •POWER' 
 
 -0F-2 PRIORI 
 
 
 
 
 • • • 
 
 
 
 SELECTOR 
 
 (cont.) 
 
 
 R 
 
 
 
 ♦ • • 
 
 1 
 
 
 5 
 
 
 
 
 
 
 R 
 
 1 
 
 
 
 
 6 
 
 
 
 
 
 
 R 
 
 
 
 
 
 
 7 
 
 
 
 
 
 
 S 
 
 1 
 
 
 
 
 
 
 
 
 
103 
 
 Gate Count Summary 
 
 First we consider N a power of 2. 
 
 Gate count for basic selection logic is 8. 
 
 Thus for N = 2 we have 
 
 £ (8+2) + | (8+4) + I (8+8) ... (8+2 K ) = 8(N-1) + N - K 
 
 For N not a power of 2, the gate count is < the gate count for M the 
 smallest power of 2 < N plus twice the gate count for N-l switches o\ 
 (2 K = M). Thus the total is: 
 
 < 8(M-1) + M - K + (N-l)8 
 * 8(N+M-2) + M-K 
 
 FIGURE 14 N0N-P0WER-0F-2 PRIORITY SELECTOR (cont.) 
 
104 
 
 TABLE 11 VEU GATE COUNT 
 
 (Gate counts for the units in Figure 10) 
 
 Unit 
 
 Operand Buffer 
 
 Result Buffer 
 
 Instruction Queue 
 
 Internal Switch Queue 
 
 Operand Buffer Access 
 Controller 
 
 Result Buffer Access 
 
 Source of Gate 
 
 Estimate 
 
 Gate Count 
 
 8x16x64 bits + 
 
 addressing logic 
 
 40 000 
 
 8x16x64 bits + 
 
 addressing logic 
 
 40 000 
 
 Table 4 
 
 
 5 600 
 
 Table 4 
 
 
 5 600 
 
 Figure 14 
 
 300 
 
 Controller 
 
 Figure 14 
 
 
 300 
 
 Sequencer 
 
 Estimate based on function 
 
 1 
 
 000 
 
 Internal Switch 
 Controller 
 
 Estimate based on function 
 
 1 
 
 000 
 
 Internal Switch 
 
 64 bit words 
 
 
 256 
 
 
 TOTAL 
 
 54 
 
 056 
 
105 
 
 4.4.3 Vector Buffer 
 
 The Vector Buffer serves two purposes. It provides a source of oper- 
 ands that can be used by multiple instructions without accessing main memory. 
 In addition, it provides space where intermediate results can be stored. 
 These values are also stored within the VEUs, but the number allowed in a 
 single VEU is quite small, probably 16. The detailed allocation of Vector 
 Buffer storage is handled by the IUD. In this section we will provide a 
 general functional discussion of this storage allocation and provide a de- 
 tailed design of the Vector Buffer itself. 
 
 In Section 2 where we described OFFL, we noted that all instructions 
 which perform operations have addresses referring to an intermediate buffer. 
 All loads and stores to main memory are to locations in this virtual buffer. 
 The physical buffer corresponding to this virtual buffer is distributed with- 
 in the VEUs and the Vector Buffer. These virtual locations can be divided 
 into two classes: those that were initially defined by an instruction to 
 load from memory, and those that were defined as the result of some opera- 
 tion. All of the first class are assigned space in the Vector Buffer. All 
 in the second class are initially assigned space within the VEU that is 
 assigned the corresponding instruction. Elements from the second class will 
 be transferred to the Vector Buffer if that is necessary to keep the storage 
 space within the VEU from being exhausted. 
 

 106 
 
 We will refer to the Vector Buffer plus the storage space within the 
 VEUs for results as the total vector buffer. Once a physical address with- 
 in the total vector buffer has been allocated, it must remain allocated 
 until the corresponding virtual address is reused. Once the virtual address 
 is reused and all instructions with pending request for the corresponding 
 physical address have completed accessing this physical address, it may be 
 reused. The contents of that physical location are no longer accessible by 
 any executing program. It would be possible to keep an associative memory 
 that relates such buffer locations to main memory and thus in some instances 
 possibly save some memory accesses. We do not include this option as part 
 of our design, because it does not appear to us as providing much of a 
 return for the logic that would be required, given our overall structure. 
 
 Clearly, it is essential that the number of available virtual addresses 
 not exceed the number of physical locations within the total vector buffer. 
 Given the highly pipelined nature of the machine and the inevitable delays 
 between the time when a virtual address is reused and the time that all 
 pending instructions have completed access to the corresponding physical 
 address, we require an excess of physical locations to logical or virtual 
 locations. We will first discuss the number of virtual locations likely to 
 be desirable and, on this basis, estimate a reasonable number of physical 
 locations. 
 
 In determining the virtual buffer size, we will concentrate on the 
 pipelining delay between memory and the VEUs. Considering other aspects 
 which are program dependent makes the problem extremely complex. Further, 
 our queued and pipelined structure is intended to ameliorate such problems 
 across a broad spectrum of programs. Thus, it is reasonable to concentrate 
 
107 
 
 our attention on the buffer size required to keep the pipe flowing. The 
 essential constraint in determining this will be the time for a transfer 
 from a VEU to primary memory and back to the VEU. We need enough virtual 
 memory space to insure that the memory value that is reused within this 
 time interval can be left in virtual storage. This leads us to the obser- 
 vation that the size of the virtual buffer is primarily dependent on the 
 rate of reuse of memory locations within the specified delay time. This 
 time cannot be computed exactly, but we can provide a rough conservative 
 estimate. The overall delay is a sum of the following delays: 
 
 1. Delay in the Vector Switch queue (4) 4.4.4 
 
 2. Delay in the Vector Switch (8) 4.4.4 
 
 3. Delay in the memory buffer queue (4) 4.5 
 
 4. Delay in the memory switch (11) 4.5 
 
 5. Delay in the memory page buffer (4) 4.5 
 
 6. Delay in memory store (8) 4.5 
 
 The number in parentheses is the delay in minor clocks. The second number 
 is the section in which the unit is discussed in detail. 
 
 The total delay is twice the sum of the individual delays plus an addi- 
 tional trip through the Vector Switch or 82 minor clocks. One VEU can 
 generate 10 results in this time (one every 8 minor clocks). Thus, our 6 
 VEUs can generate roughly 60 results. Thus, 64 would be a reasonable con- 
 servative size for the virtual vector buffer. This estimate would be ade- 
 quate for a 100 percent reuse of memory values within the specified delay. 
 Since the number of locations required for this assumption is reasonably 
 small and this case may be approximated over some program segments, it is 
 reasonable to allow for this worst case. 
 

 108 
 
 We now turn our attention to the physical buffer size required to 
 achieve the specified virtual buffer size. The primary factor we have to 
 consider here is the delay between the time a virtual memory location is 
 reused in the instruction stream and the time the corresponding physical 
 location can be reallocated. The start and end of this delay refers to the 
 IUD. More specifically, it is the time beginning when the IUD notes that 
 an instruction reuses allocated virtual memory address and the time the IUD 
 is able to reallocate that address. The total time for this process is a 
 function not only of the various pipe and queue delays, but also a function 
 of the total number of pending requests to access the virtual memory loca- 
 tion when it is reused. Instead of explicitly considering this case, we 
 will consider a particular case for which the delays are relatively easy to 
 estimate and which should in most instances be the worst case. 
 We will assume the following OFFL instruction sequence: 
 
 LOAD A to Tl 
 
 LOAD B to T2 
 
 COMPUTE T3 from A and B 
 
 Instruction which uses T3 
 
 Instruction which reallocates virtual location T3 
 In addition, we assume A and B are in the same memory page. Because accesses 
 to virtual memory locations are buffered within each VEU, it is unlikely 
 that these accesses will be delayed by a greater time than that required to 
 fetch a single operand from memory. We now estimate the dealys encountered 
 by the above sequence. Again we give the time in minor clocks and the section 
 in which the unit performing the function is described in detail. 
 
109 
 
 1. IUD delay to complete processing memory instructions (8) 4.6 
 
 2. Delay in switching instruction into memory page queue (5) 4.5 
 
 3. Delay in memory page queue (4) 4.5 
 
 4. Delay in accessing memory (8) 4.5 
 
 5. Delay in memory switch (11) 4.5 
 
 6. Delay in Vector Switch queue (4) 4.4.4 
 
 7. Delay in Vector Switch (8) 4.4.4 
 
 8. Delay in Vector Switch queue (4) 4.4.4 
 
 9. Delay in Vector Switch (8) 4.4.4 
 
 10. Additional delay to access the second operand (8) 
 
 11. Delay in VEU queue (32) 4.4.2.4 
 
 12. Computation time (8) 4.4.2.4 
 
 13. Delay in Vector Switch queue (4) 4.4.4 
 
 14. Delay in Vector Switch (8) 4.4.4 
 
 15. Time to transmit information about available 
 
 virtual location to VIDS (8) 4.6.6 
 
 16. Time to transmit information to IUD (8) 4.6.3 
 
 These delays total 136 minor clocks or 17 major clocks. During this 
 period our 6 VEUs could generate up to 102 new results, each of which might 
 require a new physical buffer location. Adding this figure to our earlier 
 estimate of 64 different virtual addresses, we can see that a buffer size of 
 256 seems a reasonable size and leaves a substantial margin for error. The 
 VEUs will contain 96 of these locations as their result buffers and the re- 
 mainder will be within the Vector Buffer. The internal design of the Vector 
 Buffer will be functionary the same as the data buffer described in Section 
 4.4.2.4. 
 
no 
 
 4.4.4 Vector Switch 
 
 The design of the Vector Switch requires that one solve two basic 
 problems. First of all, one must determine the number of ports to and from 
 the various units. Secondly, there is the problem of the internal structure 
 of the switch. We begin with a discussion of the ports. 
 
 We will assume a machine with four binary VEUs and two unary VEUs. 
 This could correspond to two routers and four vector/tree arithmetic units. 
 This will require eight ports going to the binary VEUs and four ports coming 
 from them. The unary units require two input ports and two output ports. A 
 single port is required going to the Scalar Buffer. The remaining units re- 
 quiring ports are the Vector Buffer and primary memory. The optimal size 
 for these paths to these units depends on the ratio of primary memory refer- 
 ences to total operand references. This figure varies across programs and 
 within an individual program. In designing the ports to the Vector Buffer, 
 we will assume two-thirds of all instructions access buffer locations already 
 available. In desinging the main memory ports, we will assume two-thirds of 
 all instructions require a memory access. These assumptions should assure us 
 that even in the worst cases the capacity of the Vector Switch will not slow 
 the machine by more than a factor of one-third. Experimentation with an 
 existing machine would undoubtedly provide the data for determining more cost 
 effective distributions of ports. In the case of memory ports, our assump- 
 tions lead to a requirement of eight ports coming from memory and four ports 
 going to memory. In the case of the Vector Buffer, things are a bit more 
 complex. All operands that originate in Primary Memory are stored in the 
 Vector Buffer. Those operands that were computed by earlier instructions 
 may be in the VEU which computed them. Providing eight input and output 
 
in 
 
 ports for the Vector Buffer should roughly conform to our assumption. 
 Table 12 summarizes these conclusions. 
 
 TABLE 12 VECTOR SWITCH PORTS 
 
 Unit 
 
 2 Unary VEUs 
 
 4 Binary VEUs 
 
 Scalar Buffer 
 
 Memory 
 
 Vector Buffer 
 
 TOTAL 
 
 Input Ports 
 (to unit) 
 
 Output Ports 
 (from unit) 
 
 2 
 
 2 
 
 8 
 
 4 
 
 1 
 
 
 
 4 
 
 8 
 
 8 
 
 8 
 
 23 
 
 22 
 
 We now turn our attention to the internal structure of the Vector 
 Switch. It is a pipelined crossbar switch with queued instructions asso- 
 ciated with each of its entry ports. Once a path in the switch has been 
 reserved, it will remain active for 8 minor clocks and allow the transfer of 
 an 8-word vector. Thus, there is a fairly long time available for searching 
 the queues. This is important because requests to use the Vector Switch may 
 be made long before the operand is available. Thus, in searching its queues, 
 the Vector Switch must not only be sure that a path is available, but must 
 also determine that the data is present. The presence of data is indicated 
 by a single bit which is set whenever data is stored in any of the vector 
 buffer locations. This bit is reset whenever the corresponding physical 
 location is freed; i.e., when its use count is zero and the corresponding 
 logical location has been reused. Note that the algorithms for keeping track 
 
112 
 
 of vector buffer storage are simpler than those for scalar buffer storage, 
 because each different valued vector has a different physical address, and 
 there is no need for time indexes to keep track of them. On the other hand, 
 the scalar switch does not have to test for the presence of data since re- 
 quests are never entered in its queues until the data is available. 
 
 Most of the logical design for the above mentioned functions is similar 
 to work we have already done. However, the large number of "functionally 
 identical" ports going to and from the vector buffer and memory does present 
 us with a new allocation problem. The solution is to assume that these paths 
 become available in a time skewed fashion. There are in all cases either 4 
 or 8 paths which are tied up for 8 minor clocks once they are reserved. 
 Further, because they feed memories that must be allocated in a time skewed 
 fashion, some form of time skewing is required. Thus, we can assume that 
 only one of these paths becomes available in each minor clock and the stan- 
 dard priority hardware from Section 4.4.2.4 can be used. The same priority 
 unit will be used to schedule all the paths in any equivalent set. This 
 scheme will accommodate the problems associated with multiple input ports. 
 We have a related problem associated with multiple output ports. We 
 are searching queues to drive the ports and at least one minor clock is re- 
 quired for each test of a queue entry. Thus, unless every entry tested is 
 ready to transfer, we cannot run at the maximum possible data rate. We solve 
 this problem with multiple queues. One queue for every four paths will allow 
 every other entry to be unavailable and still run at maximum rate. Table 13 
 summarizes the hardware in the vector switch. 
 
113 
 
 TABLE 13 VECTOR SWITCH HARDWARE 
 
 QUEUES FOR ALL UNIT OUTPUT PORTS 
 
 Unit 
 
 Number 
 Units 
 
 Paths/ 
 Unit 
 
 Queues/ 
 Unit 
 
 a Queue 
 Size 
 
 Gate 
 
 Count/ 
 
 Queue 
 
 Total 
 Gates 
 
 Unary VEUs 
 
 2 
 
 1 
 
 1 
 
 16 
 
 6 696 
 
 13 392 
 
 Binary VEUs 
 
 4 
 
 1 
 
 1 
 
 16 
 
 6 696 
 
 26 784 
 
 Memory 
 
 1 
 
 4 
 
 1 
 
 64 
 
 23 784 
 
 23 784 
 
 Vector Buffer 
 
 1 
 
 8 
 
 2 
 
 64 
 
 23 784 
 
 47 568 
 
 II. CONFLICT RESOLUTION CIRCUITS FOR ALL UNIT INPUT PORTS 
 
 Unit 
 
 Number 
 Units 
 
 Paths/ 
 Unit 
 
 Requesting 
 Units 
 
 u Gate 
 Count 
 
 Total 
 Gates 
 
 Unary VEUs 
 
 2 
 
 1 
 
 
 8 
 
 168 
 
 336 
 
 Binary VEUs 
 
 4 
 
 2 
 
 
 8 
 
 168 
 
 1344 
 
 Scalar Buffer 
 
 1 
 
 1 
 
 
 8 
 
 168 
 
 168 
 
 Memory 
 
 1 
 
 8 
 
 
 8 
 
 168 
 
 168 
 
 Vector Buffer 
 
 1 
 
 8 
 
 
 7 
 
 160 
 
 160 
 
 III. CROSSBAR SWITCH 
 
 Switch Size: 21 x 22 x 80 bits 
 
 Gates: 147,840 
 
 TOTAL GATES: 262,544 
 
 a. 
 
 b. 
 c. 
 d. 
 
 See previous section for basis of estimates. 
 
 See Section 4.3.2.2 for queue gate count formula. 
 
 This is the total number of queues minus the queues for this unit, 
 
 See Section 4.4.2.4 for priority logic gate counts. 
 
114 
 
 4.5 MAIN MEMORY 
 
 Logically main memory is divided into pages. These pages are 8 words 
 wide. A reasonable length would be 1 K. Physically each page is indepen- 
 dently queue driven. Load and store systems of switches connect these pages 
 to buffers in front of the memory ports in the Vector Switch. Other switches 
 distribute queue entries, modes, and indexes to the control portion of each 
 page. No indexing is allowed across pages. Figure 15 shows the overall 
 structure of memory and the load system of switches. 
 
 We now discuss the organization and operation of this system. It may 
 frequently happen that for a short period of time it is desirable to access 
 an individual memory page at the maximum rate possible for that page. On the 
 other hand, the number of memory ports in the vector switch makes it point- 
 less to be able to simultaneously access all pages at their maximum possible 
 rate. There is a virtually unlimited number of ways a memory may be organ- 
 ized, considering these constraints. We have chosen one that seems reasona- ; 
 ble and workable. We will consider a 1 megaword memory. It will be clear 
 from the discussion how to generalize to other sizes. The Vector Switch has 
 8 input ports for communicating with memory. (See Section 4.4.4.) Thus, 
 we want to design our switch to accommodate this data rate coming from any- 
 where in memory. We will assume the cycle time for main memory is 8 minor 
 clocks. Thus, the data path leaving one memory page need only be one word 
 wide if it is pipelined at one transfer ewery minor cycle. To exactly accom- 
 modate the vector switch data rate, we need to allow at most 8 pages trans- 
 fering data at any given instant. 
 
 The pages are grouped into blocks of 8. There are 16 of these blocks. 
 We allow a maximum simultaneous transfer of up to 8 words from each of these 
 groupings. All transfers are pipelined at the rate of one per minor clock. 
 
115 
 
 ^ 
 
 
 
 or 
 
 
 
 <t 
 
 
 
 00 
 
 LU 
 
 
 CO 
 
 Q 
 
 
 CO 
 
 1— c 
 
 3 
 
 
 o 
 
 
 
 
 o 
 
 Q a: 
 
 
 
 C£ LU 
 
 
 00 
 
 o u_ 
 
 
 
 3 Li_ 
 
 
 X 
 
 i rj 
 
 
 
 00 CO 
 
 
 00 
 
 1 
 
 
 
 
 
 CO 
 
 a: 
 
 
 LU 
 
 <c 
 
 
 U3 
 
 CQ 
 
 
 <c 
 
 CO 
 
 
 Q. 
 
 CO 
 
 
 
 O 
 
 
 >- 
 
 
 
 or 
 o 
 
 
 
 
 s: 
 
 CO 
 
 
 LU 
 
 X 
 
 
 CO 
 
 CO 
 
 
 
 CO 
 
 > 
 
 VO 
 
 
 
 CO 
 
 q: 
 
 
 LU 
 
 < 
 
 
 CD 
 
 CO 
 
 
 <c 
 
 CO 
 
 
 Q_ 
 
 CO 
 
 
 
 o 
 
 
 >- 
 
 Cd 
 O 
 
 
 o 
 
 
 CO 
 
 
 LU 
 
 X 
 
 
 CO 
 
 CO 
 
 
 
 CO 
 
 Q 
 
 o 
 
 <c 
 
 ct: 
 o 
 
 >- 
 en 
 o 
 
 LO 
 LU 
 
 ct: 
 
 y 
 
 
 1 
 
 
 o 
 
 —1 
 
 Od 
 
 < H- 
 
 O 
 
 ■zz a: 
 
 O O LU 
 
 _l 
 
 c_> _i 
 
116 
 
 A combination of crossbar switches and global control are used to referee 
 conflicts. Before a page is allowed to initiate a transfer into this struc- 
 ture, it must have a path reserved all the way to the highest level of the 
 structure. This is not to say that the path must be clear at the time the 
 transfer begins, but only that it will become clear at each stage when re- 
 quired. We will now discuss the algorithms for allocating these paths. 
 
 Since 8 minor clocks are required to complete a vector transfer, we need 
 only allocate our various groups of 8 paths at the rate of one per minor 
 clock. Up through the first level of crossbar switches, every page has its 
 own path. However, the outputs of these paths are ganged together so that 
 one output from each of the level 1 switches is an input to the same path in 
 the level 2 switch. Thus, allocating paths consists of determining which 
 memory pages may initiate transfers through the level 2 switch and trans- 
 mitting to the switches the identity of the paths available. At a given 
 clock, any number of paths in the level 2 switch may be available. However, 
 to keep our allocation algorithm to a reasonable size, we will consider that 
 at most one path becomes available during each minor clock. At most we in- 
 troduce brief transient delays by this restriction. There will be no loss 
 in assuming that a given fixed path becomes available at a given clock. In 
 other words, the global control attempts to allocate the level 2 paths in a 
 round-robin basis. If there are no outstanding requests at a given clock, 
 then the path assigned that time slot will remain vacant at least for the 
 next 8 minor clocks. We must keep the number of pages requesting a path at 
 a given clock to a number we can handle. This can be done by having local 
 controllers limiting the requests from each group of 8 pages to one. 
 
117 
 
 Thus, the global controller will have at most 8 requests to deal with 
 in any clock. The controllers at both levels will use the access controller 
 described in Section 4.4.2.4. 
 
 We can now describe the complete functioning of the memory in trans- 
 fering data to the Vector Switch. The numbered paragraphs correspond to 
 successive minor clocks. 
 
 1. All memory pages with queue entries ready to initiate a memory 
 access send a bit to the local controller. 
 
 2. Each local controller selects one of these pages for possible 
 transfer and sends to the global controller a request for a path 
 if it had any requests. 
 
 3. The global controller selects one of the requests from the local 
 controllers to honor and notifies the local controller. 
 
 4. The local controllers notify the winning page. 
 
 5. The transfer begins through the first level of the crossbar. At 
 clock 3 the global controller also notified the local controller 
 which path(s) to use in the crossbar. 
 
 6. The transfer from the lower level crossbar to the global crossbar 
 begins. 
 
 7. The transfer from the global crossbar to the buffer begins. The 
 global crossbar works in a fundamentally different way from the 
 local crossbars. It is successively transfering data to different 
 modules in one of the buffers. Thus, with each minor clock, it 
 changes its configuration. 
 
 Several remarks about this process are necessary. First, the entire 
 unit must be pipelined so that each function is occurring at e\/ery clock. 
 In practice, this is not particularly difficult; it only requires some 
 
118 
 
 buffering of information. The requests for transfer always come from the 
 pages 4 clocks prior to the time they are actually able to begin the trans- 
 fer. Thus, the only loss from the decision delays occur when a new entry 
 arrives in the intervening 4 clocks. The switches and memories may all 
 operate all the time and at the maximum data rate the Vector Switch allows. 
 With the transfer of the data, a queue entry is also transferred. This 
 queue entry will be used to request use of the Vector Switch. The data 
 paths must be slightly larger than one word to accommodate the queue entry, 
 which would probably need to be divided into 8 parts. 
 
 A request to store data always takes precedence over a request to load 
 data. To initiate a store, a path must be reserved through a switching 
 network similar to the one just discussed. In addition, it must be verified 
 that there is room in the store buffer of the destination page. A buffer 
 for stores within each page is desirable because of possible sequencing 
 problems which we will discuss later in this section when we describe the 
 internal operation of a single memory page. Since the Vector Switch only 
 has four output ports to memory, allowing one store to be initiated in each 
 minor clock allows for transients of twice the maximum long-term data rate. 
 Thus, we will provide a switching network to allow the transfer of one store 
 request per minor clock to the specified page. If that page has space avail- 
 able, it sends back to the requesting unit a signal to proceed. This sequence 
 is pipelined and requires two minor clocks. For stores, there can be no con- 
 flicts like those encountered with loads. Thus, we simply use the next slot 
 in the highest level crossbar and ask to set up the level 1 crossbar that 
 services our destination page. 
 
 In addition to the load and store switch paths just discussed, we need 
 an instruction switch to transfer load queue entries to the appropriate page. 
 
119 
 
 Since only one instruction is required for each vector load, this network 
 can be similar to, but less complex than, the load and store switching net- 
 works. This same switch can be used for the transfer of scalar indexes and 
 scalars used for mode control. 
 
 We now come to the internal structure of the individual pages. Figure 
 16 shows this structure. Entries from the instruction switch may be either 
 load queue entries, modes, or scalar indexes and are switched either into 
 a scalar buffer or into an instruction queue. Entries may arrive from this 
 switch at the rate of one per minor clock. Since a vector access can only 
 occur once every major clock, this will be a more than adequate data rate. 
 Vectors arriving from the store switch may be transferred either to the vec- 
 tor index buffer or to the store buffer, depending on their intended use. 
 The store buffer allows for the load switch and store switch to be transfer- 
 ring data with the same memory page at the same time. The load buffer allows 
 memory to be synchronized with the load switch. The control processes the 
 queued instructions and referees possible conflicts between the load and 
 store switches. 
 
 We will now outline the operation of the memory page control. The queue 
 which contains both load and store instructions is continuously interrogated 
 to see if an instruction can proceed. The conditions which must be met are 
 as follows: 
 
 1. All required indexes and modes are present. 
 
 2. No earlier instruction which conflicts with this instruction is 
 still in the queues. 
 
 3. In the case of stores, the required data is present. 
 
120 
 
 From Instruction Switch 
 
 _L 
 
 SWITCH 
 
 SCALAR INDEX 
 
 AND 
 
 MODE BUFFER 
 
 Control Information 
 
 to all Internal -* 
 
 Units and to and 
 
 from Switching Networks 
 
 MEMORY 
 
 INSTRUCTION 
 QUEUE 
 
 CONTROL 
 
 From Store Switch 
 
 STORE BUFFER 
 
 To 
 
 Load 
 
 Switch 
 
 FIGURE 16 INTERNAL STRUCTURE OF MEMORY PAGE 
 
121 
 
 Condition 2 requires further explanation. Clearly, no load or store 
 can proceed if there is an earlier store with unknown indexing and unknown 
 or overlapping modes. Similarly, no store can proceed if there is an 
 earlier load with unknown indexes and unknown or overlapping modes. These 
 are the weakest possible conditions for the existence of conflicts, and it 
 would be possible to test for these specific conditions among the first 
 few queue entries. We will base our gate estimates on this capability, 
 although a somewhat weaker condition might prove more practical. 
 
 Space in the indexing buffers is reserved by the IUD in the same manner 
 as space within the Vector Buffer. Thus, these buffers have to notify the 
 IUD when the values they contain have been used and the space is free. One 
 can estimate sizes for these buffers by an analysis like that in Section 
 4.4.3. Determining a size for the store buffer is more complex since it is 
 dependent on how much instruction reordering is done by the queue and con- 
 trol. We will estimate a size of 8 as being reasonably small and probably 
 larger than will usually be needed. Table 14 provides a summary of all the 
 hardware described in this section. 
 
 There are two capabilities that we do not provide in this design that 
 might be of considerable practical value. One would be the ability to pro- 
 vide index arithmetic within each memory page. Loops might often involve 
 performing simple operations on the same base index set and within the same 
 memory page. The second capability is to provide memory-to-memory and 
 memory- to- index register transfers without going through the Vector Buffer 
 and Vector Switch. Both of these capabilities could be provided without 
 major increases in the logical complexity of the system and should undoubted- 
 ly be considered. 
 
122 
 
 TABLE 14 MEMORY LOGIC SUMMARY 
 
 Ml Source of Gate Estimate Gate Count 
 First we list the gate counts for the units in a memory page (Figure 16). 
 
 Instruction Queue Table 4 5 500 
 
 Control Estimate based on function 1 000 
 
 Switch 1x2 64 bits 384 
 
 Scalar Buffer 16 words by 10 bits 640 
 
 Vector Index Buffer 8x8 words by 10 bits 2 560 
 
 Store Buffer 8 words by 64 bits 2 048 
 
 Load Buffer 8 words by 64 bits 2 048 
 
 Total for memory page excluding memory: 14 664 
 
 128 pages are required for a million words: 1 876 992 
 
 Now we compute accessing network gate counts as illustrated by Figure 15. 
 
 8x8 Switch 72 bit words 
 
 Local Controller Estimate based on function 
 Load Network has 17 switches and 16 controllers: 
 Store Network has one controller and 17 switches: 
 I/O Network has 16 switches and one controller: 
 
 Total Exclusive of Memory: 
 
 
 18 432 
 
 
 2 
 
 000 
 
 
 345 
 
 344 
 
 
 315 
 
 344 
 
 
 296 
 
 912 
 
 2 
 
 ,843, 
 
 592 
 
123 
 
 4.6 INSTRUCTION UNIT DISPATCHER 
 
 4.6.1 Introduction 
 
 The Instruction Unit Dispatcher (IUD) has the responsibility of 
 mapping OFFL instructions from up to four MIDs into some collection of 
 execution units. It must ensure that the correct operands for an instruc- 
 tion will meet in the unit assigned that instruction. The principal problem 
 in designing this unit is maintaining a high instruction rate while provid- 
 ing an "intelligent" scheduling algorithm. The scheduling algorithm must, 
 as a minimum, assure that no blockages result and maintain the correct 
 logical sequence of operations. 
 
 In describing the IUD, we will first outline its functional structure, 
 ignoring all problems associated with maintaining the necessary high data 
 rate. We will then determine what degree of pipelining and parallelism 
 will be necessary. We will discuss in more detail the various operations 
 that the IUD performs. In this discussion we will bring in any algorithm 
 modifications necessitated by the combination pipeline parallel processing 
 required. We will then provide a detailed logical design of the IUD, com- 
 plete with gate counts. 
 
 4.6.2 IUD Functional Structure 
 
 The IUD's operation is partitioned into several tasks. Three broad 
 categories are: work on operands, work on results, and construction of 
 queue entries. The three types of operands are vectors, scalars, and main 
 memory vectors. For main memory vectors, the IUD merely passes on the 
 specified address to the correct memory box. For scalar instructions, a 
 time index is necessary to uniquely identify the operand. An associative 
 
124 
 
 memory table is accessed to obtain this time index. The use made of this 
 time index is discussed in Section 4.3. A logical vector operand must be 
 mapped onto the correct physical vector register. Another associative 
 memory is provided for this function. 
 
 Scalar results are used to update the scalar status table mentioned 
 above. Similarly, vector results are used to update the vector status table 
 which maps physical to logical registers. 
 
 Both operands and results as well as the operation fields are used by 
 the IUD in generating various queue entries. Where execution units are not 
 unique, the IUD must decide which to use. In the case of vector instruc- 
 tions, it must reserve space in the VEU and set up queue entries to route 
 data as required. Finally, it must set up the queue entry for the execution 
 of the operation itself. 
 
 4.6.2.1 Data Rate Analysis 
 
 The instruction rate the IUD must handle is a function of the proces- 
 sing rate of the various execution units. A reasonable average is one 
 operation per major clock. For this computation, we will assume all operands 
 originate in memory and are returned to memory. This is an extremely conser- 
 vative assumption. It will be somewhat balanced by neglecting memory-to- 
 memory instructions and transfers between scalar memory and vector memory. 
 The overall assumption is still somewhat conservative. 
 
 Each vector operation counts as 4 instructions for a binary operation 
 (3 memory instructions plus the actual operation) and 3 instructions for a 
 unary operation. Each scalar operation counts as one instruction. 
 
125 
 
 Reasonable values are 4 vector binary units plus 2 vector scalar units. 
 In addition, 6 scalar units is a likely value. Thus, we should be able to 
 process roughly 28 instructions through the IUD in one major clock. This 
 comes to 4 instructions per minor clock with full pipelining. 
 
 4.6.2.2 Memory Operands and Results 
 
 Sequencing of instructions refering to memory is controlled by the 
 individual memory boxes. The instructions need only be passed on to the 
 appropriate memory box in the correct sequence. The IUD need not perform 
 additional processing on these operands. 
 
 4.6.2.3 Scalar Operands and Results 
 
 In order to ensure proper sequencing of scalar instructions, operands 
 must have both a time and place index. These designate a particular loca- 
 tion in the scalar buffer and a particular "time index" which uniquely 
 identifies a store to that location. To ensure that no operand will be 
 over-written when it is still required by a queued instruction, the SEU 
 must be provided with a count of the number of pending requests for a given 
 instruction result. The scalar status table provides the information neces- 
 sary to construct the time index and generate the operand use count. The 
 range of the time index was discussed in Section 4.3. The values 128 and 
 256 were determined as reasonable options in that section. 
 
 The status table is an associative memory containing entries for all 
 recent stores to scalar memory that may be ambiguous. It contains a time 
 index and scalar memory location for each entry. In addition, it contains 
 a disable bit and a bit to indicate if the address refers to a vector stored 
 across the scalar memory. Whenever an instruction with result to scalar 
 memory is processed and there is another store to that location in the 
 
126 
 
 queues, a new entry is made in an available location in the status table. 
 In addition, a previous entry using that same location has its disable bit 
 set and its location recorded in another table as available. 
 
 We still have the problem that some of the operands may refer to re- 
 sults being processed in parallel with these operands. We take care of this 
 case by special circuitry containing the scalar results being processed. 
 This special circuitry finds the entry with the latest time index earlier 
 than the time index for a particular operand. This requires a comparison 
 tree, but since only a small number of results are processed in parallel, 
 this tree is quite small. Additional circuitry selects the time index from 
 either this comparison tree or from the full table search when the tree 
 finds no match. The full table update for the results being processed occurs 
 in the same clock. The search for current operands will not see these en- 
 tries until one clock later. 
 
 Finally, the scalar status table must be kept from becoming full. 
 Thus, we want to remove entries from it as quickly as possible. As soon as 
 any instruction which causes an entry to be made in this table has completed, 
 the associated entry in the table can be freed. Thus, there is an additional 
 bookkeeping table containing instruction indices and associated scalar status 
 table locations. When notification comes that a scalar instruction has com- 
 pleted, this table is used to determine if a scalar status table instruction 
 may be freed. Since a scalar status table location may be freed before the 
 associated instruction is complete, this bookkeeping table needs to be up- 
 dated whenever this occurs. 
 
127 
 
 4.6.2.4 Vector Operands and Results 
 
 In the case of the vector buffer, the status table must map every 
 physical register in use onto the corresponding logical register. The num- 
 ber of physical vector registers is much smaller than the number of physical 
 scalar registers. Thus, 128 or 256 is a reasonable size for this table, 
 corresponding to the size of the vector buffer discussed in Section 4.4.3. 
 Like the SEU, the VEU must be provided with a count of the number of ac- 
 cesses to a particular set of data values. The vector status must contain 
 two pieces of information to perform the functions described above: the 
 physical location of a logical register and the logical register identifica- 
 tion. 
 
 We will first discuss the use of this table, ignoring the fact that 
 more than one instruction is being processed in parallel. When a vector 
 operand is encountered, this table is accessed as an associative memory to 
 find the physical register corresponding to the designated logical register. 
 If there is no corresponding entry, this is an error condition which should 
 cause a program interrupt. Since the vector buffer is distributed among 
 the VEUs as well as being in a central vector buffer, the physical location 
 identifies the unit as well as the location within the unit. This informa- 
 tion will be used in selecting the VEU to use when there is more than one 
 which may be used. 
 
 A vector result causes the associated identification of that logical 
 register to be altered to correspond to the new physical register. In 
 addition, it causes a signal to be sent to the unit containing the register, 
 indicating that the physical register may be freed once all pending requests 
 on it have cleared. In turn, when all requests have cleared, this unit 
 notifies the IUD. 
 
128 
 
 The problems associated with parallel processing of instructions are 
 more complex than those encountered in the scalar case. This is a conse- 
 quence of the fact that the physical location of a logical result is not 
 known until processing of that instruction is nearly complete. To accom- 
 modate this situation, a special bit will signify a not yet known address. 
 In addition, the time index of the corresponding result will be provided 
 in place of the physical address. Special logic to fill in this informa- 
 tion will be described in Section 4.6.3.7. This same logic provides this 
 information to be added to the vector status table when the information 
 becomes available. In addition, we need a comparison tree similar to that 
 for the scalar status table, discussed in the previous section. 
 
 4.6.2.5 Scalar EU Assignment 
 
 There may be several SEUs that are functionally equivalent. We must 
 provide a method of selecting which SEU will be used for a given instruc- 
 tion. Since the SEUs, unlike the VEUs, do not contain any operands, (see 
 following section), the only consideration that seems reasonable to take 
 into account is the size of the various queues. Thus, logic will be pro- 
 vided to keep track of where the next n scalar instruction should be 
 assigned for each set of equivalent SEUs, where n is the maximum number of 
 instructions that can be processed in any clock. The logic will update 
 this information eyery clock based on which SEUs were assigned in the previ 
 ous clock and based on information from the SEUs on instructions completed. 
 
 4.6.2.6 Vector EU Assignment 
 
 Vector operands may reside in a specific vector execution unit and 
 there exists logic to use these as operands. In order to lessen the load 
 on the vector switch as well as to minimize transfer delays, we want to 
 
129 
 
 encourage using these features. The question of what constitutes an opti- 
 mal scheduling scheme is extraordinarily complex. In addition, we have 
 severe constraints imposed by required rapid and comparatively cheap hard- 
 ware implementation. We will propose a scheme that seems workable and 
 reasonable. 
 
 We begin our discussion by considering different relations between the 
 number of active programs (P ) and the number of equivalent EUs (R ). When 
 possible, we will assign specific EUs to specific programs and try to keep 
 all computation within assigned EUs. When queue size discrepancies become 
 too large, we will start distributing the operands. 
 
 We first consider cases where R > P . Each program may have its own 
 
 e a 
 
 resource or resources; i.e., if 
 
 nP o > R a and n ;> 1 
 a e 
 
 then each program has n resources allocated to it. 
 
 First we consider the case of a single resource assigned to each pro- 
 gram. There are three threshold values that determine which EU it chooses. 
 These threshold values all represent differences in queue sizes. The first 
 two are limited to queues of EUs assigned to a particular program. 
 
 O2QJ Size of queue containing both operands minus size of 
 
 smallest queue. 
 0-jqj Size of smallest queue containing either operand 
 
 minus size of smallest queue. 
 
130 
 
 The next two values represent the differences of a queue size assigned to 
 the program minus a queue size not assigned to the program. 
 
 °20IE Size of Q ueue containing both operands minus size 
 of smallest queue. 
 
 °10IE Slze of smallest queue containing one operand minus 
 size of smallest queue. 
 
 The final two thresholds refer to queues not assigned to the program: 
 
 °20E Slze of 9 ueue containing both operands minus size 
 of smallest queue. 
 
 °10E Size °^ smallest queue containing either operand 
 minus size of smallest queue. 
 
 The threshold values can be assigned dynamically or be hard-wired 
 constants. Experimentation should be conducted to determine optimal values. 
 Associated at any instant in time with a threshold value is the actual value 
 of the corresponding condition. We will label these as 0A with the same 
 subscript. Thus, 0A 10£I is the actual difference of the smallest queue 
 assigned to this program branch which contains a single operand minus the 
 smallest queue size not assigned to this program. Both queues are restricted 
 to those capable of performing the operation which we are now assigning to 
 a queue. We will define Q(0), where is a threshold parameter, to be the 
 index of the first queue associated with this parameter. I.e., Q(0A ?nTF ) 
 is the queue with two operands. Whenever the actual value for a given 
 threshold is not defined, it will behave as infinity. 
 
131 
 
 Finally, we will consider the six threshold values as being ordered 
 
 in the sequence they were defined and will abbreviate threshold and actual 
 
 values as 0. and 0A., i =0, 1, ..., 5. Thus, 0. is the same as 2Qr 
 
 The algorithm for queue selection in the case there R > P is to select 
 
 e a 
 
 Q(0A.) for the least i such that 
 
 ° A i < °i 
 
 If no i satisfies this condition, then the smallest queue is chosen. 
 
 Two observations seem important here. First, we probably have more 
 threshold values than are useful and experiments would probably show us 
 how to limit these. Second, other threshold values are possibly meaning- 
 ful. One example would be 0,^^ referring to the queue size difference of 
 a queue assigned to the program containing one operand minus queue size of 
 a queue not assigned to the program containing one operand. As stated 
 previously, our algorithm is not necessarily optimal, only practical and 
 reasonable. 
 
 In the case of R < P , there will be no assignment of EUs to queues, 
 e a 
 
 In this case only the 0-, Q r will be meaningful. Otherwise, the same algo- 
 rithm applies. 
 
 In the case of multiple resources assigned to a program, additional 
 threshold values are needed to decide within these assigned resources when 
 small queue size has precedence over the presence of operands. At this 
 point, it seems desirable to describe a general theory of threshold values. 
 Let there be K classes of resources ordered so that the first of these is 
 the most desirable to use and the last the least desirable. Assume that 
 instructions may have up to N operands. The threshold values will be 
 
132 
 
 TABLE 15 CLASS PAIRS 
 
 1 Class Pairs 
 
 1 1,1 
 
 2 1,2 
 
 3 1,3 
 
 i i 
 
 K 1,K 
 
 K+l 2,2 
 
 K+2 2,3 
 
 2K-1 2,K 
 
 2K 3,3 
 
 agfil K,K 
 
 
 
133 
 
 TABLE 16 OPERAND DISTRIBUTION PAIRS 
 
 Operand 
 
 j_ Distribution Pairs 
 
 1 N,0 
 
 2 N-1,0 
 
 3 N-1,1 
 
 4 N-2,0 
 
 5 N-2,1 
 
 6 N-2,2 
 
 7 . N-3,0 
 
 8 N-3,1 
 
 9 N-3,2 
 10 N-3,3 
 
 • • 
 
 • • « 
 
 ("+2)(N+1) . („.,) 0i0 
 
 < N+2 )( N+1 ) - (N-2) 0,1 
 
 (N+2)(N+1) 
 2 
 
 0,N 
 
134 
 
 written as 0... The actual value of the corresponding difference in queue 
 sizes will be OA^.. (KOA^.) refers to the first element in the class pair 
 corresponding to i. Tables 15 and 16 define the i and j subscripts. 
 
 For this more complex case, no linear ordering of threshold values 
 makes sense. Instead, the algorithm for choosing the EU will be to take 
 Q(0A..) for the i,j pair such that: 
 
 I J I J 
 
 where i = 1, 2, ..., M|iU 
 
 i = t 2 JMKJitll 
 
 J ■ » (- > • • • » p 
 
 Again, not all the 0^. are defined, and the matrix of useful values is 
 likely to be fairly sparse. 
 
 4.6.2.7 Generating Vector Switch and Internal Switch Queue Entries 
 
 After we have selected the VEU, we must reserve physical space within 
 the VEU for the operands and results. We then need to generate instructions 
 for the various switches to transfer the operands to the VEU. In addition, 
 we must reserve space for the results and update the vector status table 
 accordingly. Parallel processing of instructions constrains us to first 
 reserve space for the result. Then the remainder of the above operations 
 can be performed, in some cases, using this information about results be- 
 cause it corresponds to one of the operands. 
 
135 
 
 4.6.2.3 Generating Instructions for the EUs and Memory 
 
 At this point we have discussed how all the information necessary for 
 the EU instructions is generated. Thus the only remaining problems are to 
 assemble the information together and transfer it to the appropriate EU. 
 To assemble the instruction, we will provide addresses with each of its 
 component parts. Using these addresses to assemble the instructions poses 
 no special problems. In designing the logic for transmitting the instruc- 
 tions to the EUs, we wish to minimize the size of the data paths. In any 
 given minor clock all the instructions emerging may be scalar instructions 
 (or any other single type). Since no single unit can process instructions 
 at this rate, it makes sense to buffer the output to the verious units. 
 
 4.6.3 Logical Structure 
 
 In this section we will provide a logical design for the functions 
 discussed in Section 4.6.2. We will do this in sufficient detail to pro- 
 vide realistic estimates of the gate counts for various individual compo- 
 nents and for the entire unit. We will first discuss the overall structure 
 of the IUD pipeline and then go on to describe each of the stages and units 
 in detail . 
 
 4.6.3.1 IUD Pipeline Structure 
 
 Before we begin our overall analysis of the pipeline structure, we 
 need to provide more details about OFFL. Thus in this section we will de- 
 scribe the OFFL instruction format and syntax. We will use this information 
 to analyze the pipeline requirements. Then we will present a time-versus- 
 function diagram of the IUD's operation. In this section we will describe 
 instructions and design in considerable detail. We do this, not because 
 we believe the design is optimal or that there is anything sacred about the 
 

 136 
 
 particular design decisions we have made, but because our approach differs 
 radically from conventional control unit design. Thus, details are worked 
 out both as a discipline onto ourselves to ensure that we are not over- 
 looking any devastating problem, and as a means of convincing the reader 
 that our overall approach is workable. 
 
 4.6.3.1.1 OFFL Instruction Format 
 
 OFFL instructions consist of a variable number of 16-bit bytes. The 
 first of these specifies the operation to be performed. It has the follow- 
 ing format: 
 
 Fields Meaning 
 
 0:1 1. This byte is 1 only for the operator portion of an 
 instruction. 
 
 1:1 indicates an SEU instruction. 1 indicates a VEU 
 
 instruction. (All memory instructions are to or from 
 the VEU.) 
 
 2:2 Contains the program number. (Up to four different 
 
 programs can be executing simultaneously with instruc- 
 tion level multiprogramming.) 
 
 4:4 EU address. (Specifies the type of EU that this instruc- 
 tion requires.) 
 
 9:8 Information to be interpreted by the specified EU. (This 
 field is used by the EU to determine what operation to 
 perform. If it is identically 0, then the next byte con- 
 tains information to be passed on to the EU.) 
 
137 
 
 The EU operation field or literal information for the EU can be extended 
 over up to four additional bytes. Bit of each of these bytes contains 
 a 1. The remainder of the byte contains information for the EU. 
 
 The following is the format of all possible OFFL operands and results 
 
 Fields Meaning 
 
 0:1 
 
 1:1 Indicates that this is a result and also signifies the 
 physical end of the instruction. In the case of a 
 memory result, the physical end of the instruction occurs 
 not at this byte, but at the next. 
 
 2:1 indicates a scalar. 1 indicates a vector. 
 
 3:1 Vector only. indicates a logical vector buffer address; 
 1 indicates a main memory address. 
 
 3:13 Scalar only. Physical scalar buffer address. 
 
 4:12 Vector only. In the case of a memory address (bit 2), 
 this field contains the physical location within the 
 physical memory box specified by the next byte. In the 
 case of a logical vector buffer address (bit 2), this 
 field contains that address. 
 
 As mentioned in the above table, memory results are two bytes long. The 
 second byte, except for bit which is 0, is completely used to specify a 
 memory box. This completes the specification of the format of OFFL. 
 
 4.6.3.1.2 OFFL Syntax 
 
 In this section we specify the syntax of OFFL instructions in BNF 
 using the metalanguage of the report on ALGOL 60. We will describe the 
 
138 
 
 semantics of the terminal and non-terminal symbols used in terms of the 
 previous section and then present the brief formal syntax. Note that if 
 there are two scalar operands, the second will be a mode pattern. If there 
 is only one scalar operand, then the operation will specify whether it is 
 a mode or an index. Table 17 is the syntax of OFFL. 
 
 4.6.3.1.3 Analysis of Pipeline Requirements 
 
 In Section 4.6.2.1 we determined that the IUD must allow for the 
 emergence of instructions at a rate of approximately four instructions per 
 minor clock. In this section we will analyze how that requirement, in con- 
 junction with the specification of OFFL we have described in the previous 
 two sections, translates into physical requirements of the IUD structure. 
 Because of the extreme variability of instruction length and complexity, it 
 would be unnecessarily costly to allow the IUD to be able to process any 
 possible combination of instructions at an emergence rate of four per clock. 
 As a first step in determining a reasonable design, we will enumerate rele- 
 vant constraints on OFFL instructions. These constraints can all be easily 
 derived from the previous two sections. They are listed in Table 18. 
 
139 
 
 TABLE 17 OFFL SYNTAX 
 
 Symbol 
 
 Instruction 
 Memory 
 
 Not Memory 
 
 NM Operator 
 
 NM Operand 
 NM Result 
 M Operator 
 
 M Operand 
 
 V Result 
 
 V Operand 
 M Result 
 
 VS Operand 
 S Operand 
 S Result 
 OM Address 
 
 Meaning 
 
 A complete OFFL instruction from operation to result. 
 A complete OFFL instruction which makes reference to 
 main memory. 
 
 A complete OFFL instruction which does not make reference 
 to main memory. 
 
 An operation of one to five bytes as specified in the 
 previous section that does not refer to main memory. 
 All the operands in a complete non-memory OFFL instruction. 
 A vector buffer or scalar buffer result. 
 An operation of one to five bytes as specified in the pre- 
 vious section that does refer to main memory. 
 An operand which specifies a main memory address, including 
 indexing and mode specification. 
 
 A result which specifies a logical vector buffer address. 
 An operand which specifies a logical vector buffer address. 
 A result which specifies a main memory address, including 
 indexing and mode specification. 
 
 An operand giving a vector buffer or scalar buffer address. 
 An operand specifying a physical scalar buffer address. 
 A result specifying a physical scalar buffer address. 
 A two-byte operand which specifies a memory box and an 
 address within the box. 
 
Symbol 
 
 RM Address 
 
 140 
 
 TABLE 17 OFFL SYNTAX (cont.) 
 Meaning 
 
 
 A two-byte result which specifies a memory box and an 
 address within the box. 
 Index and Mode A string containing or 1 vector operands and to 2 
 
 scalar operands which serve as indexes and modes relative 
 to the physical memory address specified by the associated 
 memory address. All indexing is limited to addresses 
 within a single memory box. 
 
 The OFFL syntax follows. 
 
 Instruction : : = Memory | Not Memory ; 
 Not Memory ::= NM Operator NM Operand NM Result ; 
 
 Memory. ::= M Operator M Operand V ResuU | M Operator V Operand M Result; 
 NM Operand ::= VS Operand | VS Operand VS Operand ; 
 VS Operand : : = V Operand | S Operand ; 
 NM Result : : = V Result | S Result ; 
 M Operand : : = Index and Mode OM Address ; 
 M Result : : = Index and Mode RM Address ; 
 
 Index and Mode ::= S Operand | V Operand | V Operand | S Operand | 
 
 V Operand S Operand S Operand ; 
 
141 
 
 TABLE 18 OFFL INSTRUCTION CONSTRAINTS 
 
 Instruction Parameter Minimum Maximum 
 
 Instruction length in bytes 
 
 Number of scalar operands 
 
 Number of vector operands 
 
 Number of memory operands 
 
 Number of results, any type 
 
 Operator length in bytes 
 
 Memory operand or result length in bytes 
 
 Total length of all operands for a non- 
 memory instruction in bytes 
 
 3 
 
 12 
 
 
 
 2 
 
 
 
 2 
 
 
 
 1 
 
 1 
 
 1 
 
 1 
 
 5 
 
 2 
 
 5 
 
142 
 
 The physical paths between the MIDs and the IUD must be of some fixed 
 size. At this point we need to translate an emergence rate of four instruc- 
 tions per clock into a size for these paths. Since the path size determined 
 will be a fundamental physical limit to the IUD's processing rate, we will 
 design the IUD to handle the full bandwidth of these paths. The tradeoff 
 in deciding on this parameter is the possibility of the IUD slowing down the 
 EUs versus the cost of the IUD. Because of its pipelined parallel nature 
 and the sophisticated functions it must perform as outlined in Section 4.6.2, 
 the cost of the IUD rises dramatically with increased bandwidth. This will 
 become even more evident in the remainder of Section 4.6.3 as we do detailed 
 logical design. The emergence rate of instructions required is actually 
 3 1/2, not 4, and the assumptions that gave rise to that figure were some- 
 what conservative. (See Section 4.6.2.1 for details.) Thus, a bandwidth 
 of 12 bytes per clock total coming into the IUD seems to be a reasonable 
 figure that will allow for little or no delay in the EUs. There are addi- 
 tional reasons for choosing the precise figure 12. These will become appa- 
 rent in the remainder of this and the following section. 
 
 The IUD is required to perform various types of operations in parallel 
 as outlined in Section 4.6.2. We will now determine what degree of paral- 
 lelism will be required. As a first step, we will determine how many of the 
 various types of operations, operands, and results may occur in segments of 
 the instruction stream of different lengths. Table 19 gives maximum counts 
 versus instruction stream length and can be easily derived from the instruc- 
 tion constraints listed above. This table gives the maximum number of 
 instruction components that may occur in a length of instruction stream 
 segment. Note that 12 is a particularly good number since at 13 most of the 
 counts go up by 1 . This table will also be used in the next section. 
 
143 
 
 TABLE 19 INSTRUCTION STREAM CONSTRAINTS 
 
 Instruc- 
 
 Operat 
 Number 
 
 ars 
 Size 
 
 Memory Operands 
 and Results 
 Combined 
 
 Number Size 
 
 
 Vector 
 
 and Scalar 
 
 
 tion 
 Stream 
 
 Operands* 
 
 Results* 
 
 Length 
 
 Number 
 
 Size 
 
 Number 
 
 Size 
 
 1 
 
 1 
 
 1 
 
 1 
 
 1 
 
 1 
 
 1 
 
 1 
 
 1 
 
 2 
 
 1 
 
 2 
 
 1 
 
 2 
 
 2 
 
 2 
 
 1 
 
 1 
 
 3 
 
 1 
 
 3 
 
 2 
 
 2 
 
 2 
 
 2 
 
 1 
 
 1 
 
 4 
 
 2 
 
 4 
 
 2 
 
 3 
 
 2 
 
 2 
 
 2 
 
 2 
 
 5 
 
 2 
 
 5 
 
 2 
 
 4 
 
 3 
 
 3 
 
 2 
 
 2 
 
 6 
 
 2 
 
 5 
 
 2 
 
 4 
 
 4 
 
 4 
 
 2 
 
 2 
 
 7 
 
 3 
 
 5 
 
 3 
 
 4 
 
 4 
 
 4 
 
 3 
 
 3 
 
 8 
 
 3 
 
 6 
 
 3 
 
 5 
 
 4 
 
 4 
 
 3 
 
 3 
 
 9 
 
 3 
 
 7 
 
 3 
 
 6 
 
 5 
 
 5 
 
 3 
 
 3 
 
 10 
 
 4 
 
 8 
 
 3 
 
 6 
 
 6 
 
 6 
 
 4 
 
 4 
 
 11 
 
 4 
 
 9 
 
 4 
 
 6 
 
 6 
 
 6 
 
 4 
 
 4 
 
 12 
 
 4 
 
 10 
 
 4 
 
 7 
 
 6 
 
 6 
 
 4 
 
 4 
 
 13 
 
 5 
 
 10 
 
 4 
 
 8 
 
 7 
 
 7 
 
 5 
 
 5 
 
 14 
 
 5 
 
 10 
 
 4 
 
 8 
 
 8 
 
 8 
 
 5 
 
 5 
 
 15 
 
 5 
 
 11 
 
 5 
 
 8 
 
 8 
 
 8 
 
 5 
 
 5 
 
 16 
 
 6 
 
 12 
 
 5 
 
 9 
 
 8 
 
 8 
 
 6 
 
 6 
 
 *These columns refer to either of the specified types and not to both 
 types combined. 
 
144 
 
 4.6.3.1.4 Switching Instruction Components into the Pipe 
 
 The first stages of the IUD pipe will consist of units designed to 
 process each of the various instruction components. Since these components 
 may occur at any point in each 12-byte segment, we need some means of trans- 
 mitting the various components to the appropriate type of processors. At 
 the same time we must maintain the identity and sequence of the original 
 instructions. We will number the instructions through 3 and assign this 
 index to each instruction component. Either or both instructions and 3 
 may be incomplete. Thus, in the later stages of the pipe we must take this 
 into account. Table 21 gives the logic equations for assigning instruction 
 numbers and an associated gate count. 
 
 The component pipes that our 12 instruction bytes may need to be 
 switched into are listed below. 
 
 TABLE 20 PIPE COMPONENTS 
 
 Component 
 
 
 Number 
 
 of 
 
 Units 
 
 Size of Unit 
 in Bytes 
 
 Operator 
 
 
 
 4 
 
 
 
 5 
 
 Memory Operands 
 Results 
 
 and 
 
 
 4 
 
 
 
 2 
 
 Vector Operand 
 
 
 
 6 
 
 
 
 1 
 
 Vector Result 
 
 
 
 4 
 
 
 
 1 
 
 Scalar Operand 
 
 
 
 6 
 
 
 
 1 
 
 Scalar Result 
 
 
 
 4 
 
 
 
 1 
 
145 
 
 TABLE 21 LOGIC EQUATIONS AND GATE COUNTS FOR 
 ASSIGNING INSTRUCTION NUMBERS 
 
 The symbols A-L represent a logical input as to whether the correspond 
 ing instruction byte is the first byte of a new instruction. This can be 
 determined by bit of the preceding instruction being and bit 1 of this 
 byte being 1. Q, R, and S are the following logical functions: 
 
 Q = B E H K 
 R = I V J V K 
 S = I V J 
 
 X, Y, T, and W are instruction bits which are used in other equations. 
 
 High-order Bit of Low-order Bit of Number 
 
 Instruction Number Instruction Number of Gates 
 
 A 6 
 
 BO B 1 
 
 c B V C 
 
 2 
 
 D BVCVD 3 
 
 E X = BE Y = 'v(BEVBCD) 7 
 
 F X V Y F YFVYF 7 
 
 G X V Y F V Y G YFGVYFVYG 12 
 
 H T = XVYFVYGVYH W = Y F G H V Y F V Y G V Y H 18 
 
 I TVWI WlVWI 7 
 
 J TVWIVWJ WTJVWIJVWTJ 12 
 
 K QTVQWIVQWJVQWK WTJT^V¥lV¥JV¥QK 23 
 
 L QTVTSVTKQVTL W R L V W R L V W R L 18 
 
 TOTAL 110 
 
146 
 
 TABLE 21 LOGIC EQUATIONS AND GATE COUNTS FOR 
 ASSIGNING INSTRUCTION NUMBERS (cont.) 
 
 With fan out of 20 and fan in of 4, the entire decoding can be implemented 
 in 7 levels of logic. The gate counts in this case would be 110 plus 9 
 for Q, R, and S, plus 24 to generate the initial values A-L. As a practi- 
 cal manner, at least one additional level of logic and a slightly higher 
 gate count is likely to be required to avoid the large fan out. It may be 
 possible to implement the decoder in one minor clock with under 200 gates, 
 but two minor clocks may be required. 
 
 
147 
 
 TABLE 22 LOGIC EQUATIONS FOR THE CONTROL OF 
 THE IUD FRONT END SWITCHES 
 
 Variables A - E represent logical values associated with byte positions. 
 They are true if the corresponding byte is of the type this switch fetches 
 
 Variables xG - xK (where x is one of A - E) represent enables of switch 
 paths. In particular, AG equal true enables the path from byte position A 
 to the first pipe entry of the type the switch is for. Similarly, BH 
 enables the path from the second byte position to the second pipe entry. 
 The designs will not be for fully general switches. They will take ad- 
 vantage of restrictions from Table 19. 
 
 I. Control for 4x2 switch used by vector and scalar operands. 
 
 AG = A DH = D 
 
 BG = AB CH = DC 
 
 CG = ABCD BH = DCAB 
 
 (Note: Paths AH and DG are not required, saving both switch and 
 control logic.) 
 
 II. Control for 3 x 1 switch used by vector and scalar results. 
 
 AG = A 
 BG = B 
 CG = C 
 
 III. Control for 6x4 selector used for memory operands and results 
 combined. 
 
 AG = A FJ = F 
 
 BG = AB EJ = FE 
 
 CG = ABC DJ = FED 
 
 BH = AB EI = FE 
 
 CH = ABC v ABC DI = FED v FED 
 
148 
 
 TABLE 22 LOGIC EQUATIONS FOR THE CONTROL OF 
 THE IUD FRONT END SWITCHES (cont.) 
 
 IV. Control for 6x5 selector used for operators. 
 AG = A FK = F 
 
 BG = AB EK = FE 
 
 BH = AB EJ = FE 
 
 CH = ABC v ABC DJ = "FED v FFD 
 
 CI = ABC CI = FED 
 
149 
 
 A full cross-bar switch for this transfer would need to be 12 bytes 
 by 56 bytes. The logic to control this switch would be particularly cum- 
 bersome. Referring to the list of components versus instruction length in 
 the previous section, we see that there is a possibility of partitioning 
 the 12 instructions into 4 groups of 3 each in the case of vector and scalar 
 results. In the case of vector and scalar operands, 3 partitions of length 
 4 makes sense. In the case of memory operands or results and in the case of 
 operators, two partitions of length 6 will work. All these partitions have 
 the advantage of not requiring any additional processing units, while at the 
 same time reducing the complexity of the switch and its control. 
 
 With these smaller partitions, we can design very simple and fast con- 
 trol logic for the switches. The basic idea of the design is to start at 
 both ends and work towards the middle. Thus, if we are checking 4 bytes, 
 2 of which may be of the same type, then the first output path accessed by 
 this switch will, in a sense, be assigned to the first two bytes, and the 
 other output path to the remaining bytes. The logic will be symmetric 
 around the middle. Detailed logical design of control for all the parti- 
 tions required is contained in Table 22. 
 
 At this point we can complete the design of the front end of the IUD 
 pipe. There is a 12-byte wide data path into the IUD which inputs a new 
 segment of the instruction stream. There are registers to receive this in- 
 put and a second set of registers to serve as a buffer if the IUD becomes 
 blocked. It takes longer than 1 minor clock to notify the MIDs that there 
 
150 
 
 FROM 
 MID 
 
 INSTRUCTION 
 
 INDEX 
 
 GENERATOR 
 
 > 
 
 BLOCKAGE 
 BUFFER 
 
 SWITCH 
 CONTROLLERS 
 
 FIGURE 17 
 IUD FRONT END 
 
 TO 
 SWITCHES 
 
 6x4 MEMORY 
 OPERANDS RESULTS 
 
 
 
 6x5 
 OPERATIONS 
 
 
 4x2 VECTOR 
 OPERANDS 
 
 
 4x2 SCALAR 
 OPERANDS 
 
 
 
 
 3x1 VECTOR 
 RESULTS 
 
 
 3x1 SCALAR 
 RESULTS 
 
 
 4x2 VECTOR 
 OPERANDS 
 
 
 4x1 SCALAR 
 OPERANDS 
 
 oo 
 
 LU 
 O- 
 1— 1 
 
 
 Q 
 
 3x1 VECTOR 
 RESULTS 
 
 1 — I 
 CD 
 
 
 t— i 
 
 3x1 SCALAR 
 RESULTS 
 
 Q 
 
 o 
 
 OO 
 
 
 LU 
 
 3x1 VECTOR 
 RESULTS 
 
 a. 
 
 en 
 o 
 o 
 
 
 o 
 
 3x1 SCALAR 
 RESULTS 
 
 1— 
 
 4x2 VECTOR 
 OPERANDS 
 
 
 
 
 4x2 SCALAR 
 OPERANDS 
 
 
 3x1 VECTOR 
 RESULTS 
 
 
 
 
 3x1 SCALAR 
 RESULTS 
 
 
 
 
 6x4 MEMORY 
 OPERANDS RESULTS 
 
 
 
 
 6x5 
 OPERATIONS 
 
 
151 
 
 TABLE 23 GATE COUNT FOR IUD FRONT END 
 
 Unit 
 
 16 bit x 12 byte 
 buffer 
 
 19 bit x 12 byte 
 buffer 
 
 Instruction index 
 generator 
 
 How Derived 
 4 gates/bit 
 
 4 gates/bit 
 
 Table 
 
 Number 
 Gate Count Units 
 
 768 2 
 
 912 
 
 200 
 
 Total 
 1536 
 
 1824 
 
 200 
 
 Switch Controllers (SK) and Switches (S) 
 4 x 2 S 
 
 3 x 1 S 
 6 x 4 S 
 6 x 5 S 
 
 4 x 2 SK 
 
 3 x 1 SK 
 6 x 4 SK 
 6 x 5 SK 
 
 (no. of bits) 
 
 • 
 
 
 608 
 
 6 
 
 3648 
 
 (lines in) • 
 
 
 
 
 
 
 (lines out) • 
 
 4 
 
 
 
 
 
 same as above 
 
 
 
 228 
 
 8 
 
 1824 
 
 same as above 
 
 
 
 1536 
 
 2 
 
 3072 
 
 same as above 
 
 
 
 2280 
 
 2 
 
 4560 
 
 Table 
 
 
 
 28 
 
 6 
 
 168 
 
 (number of variabl 
 
 es 
 
 
 
 
 in equation ti 
 
 mes 
 
 2) 
 
 
 
 
 same as above 
 
 
 
 6 
 
 8 
 
 48 
 
 same as above 
 
 
 
 56 
 
 2 
 
 112 
 
 same as above 
 
 
 
 56 
 
 2 
 
 112 
 
 TOTAL 
 
 17104 
 
152 
 
 
 is a block. However, all but one minor clock of this delay is buffered by 
 the path transducer discussed in Section 3.1.3. The following functions 
 are performed by the IUD front end. 
 
 (1) Instruction indexes are generated and produced for each instruc- 
 tion byte. 
 
 (2) One clock later, control signals for the various partitioned 
 switches are generated. 
 
 (3) One clock later, the instructions meet their indexes. 
 
 (4) One clock later, the instruction components are switched into 
 the various pipes. 
 
 Figure 17 gives the overall structure of the IUD front end. Table 23 pro- 
 vides a gate count of the IUD front-end. 
 
 4.6.3.1.5 Global Structure of IUD Pipe 
 
 We have just seen how the stream of IUD instructions is broken up into 
 bytes of various types by the IUD front end. This breaking up is done in 
 such a way that the identity of the complete instruction can be recovered 
 later. In this section we will describe the overall structure and timing 
 of the IUD pipe as it performs the functions outlined in Section 4.6.2. In 
 the remainder of Section 4.6.3 we will provide detailed logical design of 
 the various components of the pipe as well as gate counts. 
 
 Table 24 gives a list of functions to be performed (including how many 
 parallel units are required), the delays involved, and the dependency rela- 
 tionships. This table is then used to generate Table 25, which describes 
 the timing sequence for all functional components in the IUD pipe. Finally, 
 using these two tables, we construct Figure 18 which is an overall diagram 
 of the pipeline components. 
 
153 
 
 TABLE 24 IUD PIPE FUNCTIONS, TIMINGS, AND DEPENDENCIES 
 
 Function 
 
 Section 
 
 Where 
 
 Outlined 
 
 Abbre- 
 viation 
 
 Time 
 
 Number of 
 
 Parallel 
 
 Units 
 
 Requires 
 
 Output 
 
 From 
 
 Use scalar result to 
 update parallel 
 search portion of 
 scalar table 
 
 4.6.2.3 
 
 SPU 
 
 1 
 
 4 
 
 none 
 
 Use vector result to 
 update parallel 
 search portion of 
 vector table 
 
 4.6.2.4 
 
 VPU 
 
 1 
 
 4 
 
 none 
 
 Use scalar result to 
 search scalar table 
 
 4.6.2.3 
 
 SST 
 
 2 
 
 6 
 
 SPU 
 
 Use vector operand 
 to search vector 
 table 
 
 4.6.2.4 
 
 SVT 
 
 2 
 
 6 
 
 VPU 
 
 Select scalar 
 execution unit 
 
 4.6.2.5 
 
 SSE 
 
 1 
 
 4 
 
 none 
 
 Select vector 
 execution unit 
 
 4.6.2.6 
 
 SVE 
 
 2 
 
 4 
 
 SVT 
 
 Reserve vector 
 buffer storage 
 
 4.6.2.7 
 
 RVS 
 
 1 
 
 6 
 
 SVE 
 
 Update scalar 
 operand table 
 
 4.6.2.3 
 
 US 
 
 2 
 
 6 
 
 none 
 
 Update vector 
 operand table 
 
 4.6.2.4 
 
 UV 
 
 2 
 
 6 
 
 RVS 
 
 Generate vector 
 switch operations 
 
 4.6.2.7 
 
 GSI 
 
 2 
 
 9 
 
 FVO 
 
 Fill in vector 
 operand fields not 
 known at SVT 
 
 4.6.2.7 
 
 FVO 
 
 2 
 
 6 
 
 RVS 
 
 Assemble complete 
 memory instructions 
 
 4.6.2.8 
 
 AM 
 
 2 
 
 3 
 
 FVO 
 
154 
 
 TABLE 24 IUD PIPE FUNCTIONS, TIMINGS, AND DEPENDENCIES (cont.) 
 
 Function 
 
 Assemble complete 
 scalar instructions 
 
 Assemble complete 
 vector instructions 
 
 Initiate buffered 
 transfer of vector 
 switch instructions 
 
 Initiate buffered 
 transfer of memory 
 instructions 
 
 Initiate buffered 
 transfer of scalar 
 instructions 
 
 Initiate buffered 
 transfer of vector 
 instructions 
 
 Section 
 
 Where 
 
 Outlines 
 
 4.6.2.8 
 4.6.2.8 
 
 4.6.2.8 
 
 4.6.2.8 
 
 4.6.2.8 
 
 4.6.2.8 
 
 Abbre- 
 viation 
 
 AS 
 AV 
 
 ISW 
 
 IM 
 
 ISC 
 
 IV 
 
 Time 
 2 
 2 
 
 Number of 
 
 Parallel 
 
 Units 
 
 Requires 
 
 Output 
 
 From 
 
 FVO 
 
 FVO 
 
 GSI 
 
 AM 
 
 AS 
 
 AV 
 
 
 
 • 
 
 ; 
 
155 
 
 TABLE 25 IUD PIPE TIMING CHART 
 
 Stage of Pipe 
 
 Functions 
 
 Just Completed 
 
 Functions Which May Begin 
 
 1 
 
 
 
 SPU VPU SSE US 
 
 2 
 
 SPU 
 
 VPU SSE 
 
 SST SVT 
 
 3 
 
 US 
 
 
 
 4 
 
 SST 
 
 SVT 
 
 SVE 
 
 5 
 
 
 
 
 6 
 
 SVE 
 
 
 RVS 
 
 7 
 
 RVS 
 
 
 UV FVO 
 
 8 
 
 
 
 
 9 
 
 UV 
 
 FVO 
 
 AM AS AV GSI 
 
 10 
 
 
 
 
 11 
 
 AM 
 
 AS AV GSI 
 
 IM ISC IV ISW 
 
 12 
 
 IM 
 
 ISC IV ISW 
 
 
156 
 
 Entry Port 
 from IUD 
 
 Number 
 of 
 
 Number 
 Bytes/ 
 
 
 Functions 
 
 vs Pi 
 
 pe Stage 
 
 
 Front End 
 
 Ports 
 
 Port 
 
 J_ 
 
 2 
 
 3 
 
 4 
 
 5 
 
 6 
 
 7 
 
 Operator 
 
 2 
 
 5 
 
 SSE 
 
 H 
 
 H 
 
 *SVE 
 
 H 
 
 H 
 
 H 
 
 Memory Operands 
 and Results 
 
 2 
 
 4 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 Vector Operands 
 
 3 
 
 2 
 
 H 
 
 SVT 
 
 H 
 
 *SVE 
 
 H 
 
 RVS 
 
 GSI 
 FVO 
 
 Scalar Operands 
 
 3 
 
 2 
 
 H 
 
 SST 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 Vector Results 
 
 4 
 
 1 
 
 VPU 
 
 H 
 
 H 
 
 H 
 
 H 
 
 RVS 
 
 UV 
 GSI 
 
 Scalar Results 
 
 4 
 
 1 
 
 US 
 SPU 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H means hold and pass on to the next stage with no function initiated. 
 
 *SVE indicates that SVE requires both these inputs and not that SVE is 
 cone for each as is the case in other duplications down a column 
 
 No functions are initiated at stage 8. At stage 9, we initiate another 
 
 set of switches like that in the IUD front end in order to assemble complete 
 
 instructions. We will describe this tail end of the pipe in section 4.6.3.3, 
 
 FIGURE 18 IUD PIPE OVERALL STRUCTURE 
 
157 
 
 4.6.3.2 Detailed Structure and Gate Counts for Internal IUD Pipe 
 Functions 
 In this section we will present gate counts and, where necessary, logi- 
 cal design for the functions listed in Table 24 up to the point where we 
 begin assembling complete instructions. We will do logical design to the 
 minimum degree of detail required to obtain reasonably accurate gate counts. 
 Many of the units we discuss will be involved with several functions. We 
 will introduce each unit as required. At the end of this section we will 
 provide a summary of these units and their interconnections as well as a 
 total gate count for the internal IUD pipe. 
 
 4.6.3.2.1 Details of the Parallel Update of the Scalar Table (SPU) 
 
 As discussed in Section 4.6.2.3, there must be a comparison tree for 
 assigning time indexes to scalar operands which may coincide with scalar 
 results being processed in parallel with the operands. From Table 25 we 
 see that the complete update of the scalar table (US) is complete at clock 
 3. Since the search of the scalar table (SST) does not begin until clock 
 2, the comparison tree we are designing need only contain the scalar results 
 being processed in one clock. This is a maximum of 4 (see Table 19). The 
 information contained in this comparison tree must include the physical 
 address of the scalar result and the time index for the instruction. In 
 the next section we will discuss the hardware for generating time indexes 
 for the various classes of instructions. For now we will assume they are 
 directly accessible by using the instruction indexes discussed in Section 
 4.6.3.1.4. The loading of this comparison tree consists of gating the 
 required information into the associated registers and clearing any of the 
 4 registers not used. This is the SPU function. This comparison tree is 
 
158 
 
 Q 
 
 a: z 
 
 <UJD. 
 O O D_ i-h 
 I— OO O Q. 
 
 A 
 
 X 
 
 LU 
 Q 
 
 O 
 
 »-i ce: o 
 (— o z 
 
 o 
 
 I— 
 a: o 
 
 I— <_> UJ 
 OO t— _l 
 
 «C <C LU 
 _l ^ OO 
 
 ITS 
 
 O 
 
 o 
 
 00 
 
 x 
 
 L 
 
 i i , - 
 
 \y 
 
 o 
 o 
 
 c 
 
 or 
 
 n_ 
 z: 
 o 
 c_> 
 
 Q_ 
 
 o 
 
 O 
 
 or 
 o 
 
 LU 
 D_ 
 H-e 
 
 D_ 
 
 LU 
 D_ 
 O 
 
 a: 
 
 <C 
 
 —i 
 <c 
 
 O 
 OO 
 
 o 
 a: 
 
 oo 
 
 LU 
 Q_ 
 i— c 
 Q_ 
 
 C£ 
 
 <c 
 —i 
 <c 
 
 C_> 
 
 oo 
 
 X 
 
 I — I 
 OO 
 
 o 
 
 o 
 
 o 
 
 Q 
 
 O 
 OO 
 i — r 
 CC 
 
 •a: 
 
 Q_ 
 
 s: 
 o 
 o 
 
 CT> 
 
 LU 
 CJ3 
 
 
 
 
 C£ 
 
 h- 
 
 
 
 LU 
 
 
 cf 
 
 _) 
 
 oo 
 
 ^ 
 
 CJ3 
 
 
 —1 
 
 ro 
 
 LU 
 
 o 
 
 ■=c 
 
 
 <T 
 
 oo 
 
 Q_ 
 
 Qi 
 
 i— 
 
 Ll_ 
 
 O 
 
 LU 
 
 i— i 
 
 Lt_ 
 
 00 
 
 O 
 
 OO 
 
 cc 
 
 O. 
 
 J 
 
159 
 
 a - d 
 
 Match Inputs 
 
 A - D 
 
 Switch Control Outputs 
 
 N 
 
 No Match Output 
 
 N = 
 
 a bed 
 
 A = 
 
 a b c d 
 
 B = 
 
 bed 
 
 C = 
 
 c d 
 
 D = 
 
 D 
 
 FIGURE 19 COMPARISON TREE (cont.) 
 
160 
 
 
 TABLE 26 SCALAR COMPARISON TREE GATE COUNT 
 
 Unit 
 
 Compare for Identical 
 Physical Addresses 
 (13 bits) 
 
 Last Match Selector 
 
 4 x 1 Switch 
 (9 bits) 
 
 Gate Count 
 
 30 + 15 + 4 = 49 
 (3 levels of 
 logic) 
 
 14 
 
 3 * 9 * 4 = 108 
 
 Number/ Number/ 
 Operand Result Total 
 
 1 
 
 1176 
 
 84 
 648 
 
 TOTAL 
 
 1908 
 
161 
 
 then used by the SST function in the next clock. It must compare in paral- 
 lel up to 6 scalar operand physical addresses. In the case of a match, it 
 must select the largest time index less than the time index of the instruc- 
 tion containing the operand. Since the actual time indexes can be obtained 
 from the instruction index, and since the instruction indexes are in the 
 same sequence as the time indexes, the tree need only contain and work with 
 the 2-bit instruction indexes. Figure 19 gives the structure of the com- 
 parison tree. Table 26 gives the gate count. No additional registers are 
 required for this unit since all required information is in registers in 
 other units or at various stages in the pipe. It should be possible to per- 
 form the entire operation in six levels of logic and under one clock. Thus, 
 any internal pipelining of this unit is not required. The SPU function thus 
 becomes almost null. It is moved up one clock in the pipe and exists because 
 scalar results are used in the SST function. 
 
 4.6.3.2.2 Generating Time Indexes 
 
 The need for generating instruction time indexes arises from two 
 sources. First, scalar physical addresses require a time index to uniquely 
 identify them. This was discussed in detail in Section 4.3. The second re- 
 quirement for time indexes is contained entirely within the IUD. It is not 
 until clock 7 that all the information necessary to update the vector table 
 is available. All vector operands corresponding to results during this 
 period and the additional two clocks required to update the vector table 
 will not contain correct vector addresses. In fact, their address field 
 will be set to zero to indicate that they still must be updated. Thus, time 
 indexes must be assigned to vector store instructions to guarantee that the 
 correct vector operand addresses may ultimately be obtained. To save 
 
162 
 
 O l-H 
 
 LU LjJ 
 Q Q. 
 
 2: i-H 
 
 
 C_) 
 
 O 
 
 X 
 
 LU 
 Q 
 
 O 
 CM 
 
 LU 
 
 => 
 
 o 
 
163 
 
 LU 
 
 O 
 
 O 
 
 I— i LU 
 
 2= < 
 
 
 
 
 Q 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 
 
 oo 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 00 
 
 
 Q 
 
 < 
 
 
 
 I 
 
 If) 
 
 
 
 
 
 
 ■ -I 
 
 1 
 
 
 ■ 
 
 
 
 
 oo 
 
 
 
 
 
 
 
 oo 
 
 
 ca 
 
 
 *- 
 
 
 
 f 
 
 
 
 
 
 
 
 
 Q 
 Q CVJ 
 
 
 
 
 
 
 
 i 
 
 ■» 
 
 
 
 
 1 
 
 
 
 
 CO 
 
 
 
 ■ 
 
 1 
 
 
 
 
 
 
 
 
 Q r— 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 * 
 
 o 
 
 oo 
 
 h- 
 
 ca 
 
 NO CARRY 
 OUTPUT 
 
 
 
 
 
 
 
 
 
 C/0 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 or 
 o 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 NING 
 T BIT 
 
 
 3 
 
 o 
 
 —I 
 
 U_ 
 
 ac 
 
 LU 
 
 5< 
 
 
 CARRY 
 OUTPUT 
 
 
 NGE MEA 
 
 MOST 
 
 NIFICAN 
 
 
 
 o 
 
 1— 1 
 
 
 
 
 c 
 c 
 
 9 
 
 J 
 
 
 c 
 o 
 o 
 
 o 
 
 I— I 
 
 o 
 
 X 
 
 LU 
 Q 
 
 O 
 CM 
 
 or 
 
 «=c o 
 n: u_ h-i 
 
 O O <S) 
 
 •u 
 
 
 Q) 
 
 
 (A 
 
 
 3 
 
 -" — . 
 
 
 oo 
 
 
 c/i 
 
 
 cu 
 
 
 -C 
 
 
 (_> 
 
 +■> 
 
 +J 
 
 3 
 
 •p— 
 
 Q_ S 
 
 -l-> 
 
 00 
 
 3 
 
 
 O 
 
 1 — 
 
 
 o 
 
 >> s. 
 
 S- 
 
 +J 
 
 S- 
 
 c 
 
 fO 
 
 o 
 
 o 
 
 a 
 
164 
 
 SWITCH CONTROL (1) 
 
 i = First instruction is of specified type and is not 
 a continuation of a previous instruction 
 
 j = Second instruction is of specified type 
 
 k = Third instruction is of specified type 
 
 I = Fourth instruction is of specified type 
 
 no change = T J k" J 
 
 A = i J k I V TjkZ V TJkl v T J k I 
 
 B = i j k I V i J k I V i J k I V 
 Tjkl V J j k I V TJkl 
 
 C = i j k I V i j k I V i J k I V T j k I 
 D = i j k I 
 
 FIGURE 20 TIME INDEX LOGIC (cont.) 
 
165 
 
 SWITCH CONTROL (2) 
 
 i, j, k, and £ are the same as in Switch Control (1) 
 
 SA = T 
 
 SB = TJ 
 
 SC = T J k 
 
 SD = TJkl 
 
 AW = i 
 
 AX = i J V J i 
 
 AY = i J k V T J k V T j k 
 
 AZ = ijkl V TJkl V Tjk! V TJkl 
 
 BX = i j 
 
 BY = i j k V i J k V T j k 
 
 BZ = ijkl V i J k I V Tjk I V 
 TJkl V ijkl V Tjk£ 
 
 CY = i j k 
 
 CZ = ijkl V i j k I V i J k I V T j k I 
 
 DZ = i j k I 
 
 FIGURE 20 TIME INDEX LOGIC (cont.) 
 
166 
 
 hardware costs, we wish to keep the range of possible indexes to a minimum. 
 We then need to do the indexing in a circular manner. In order to keep the 
 compare logic to a minimum, we will add an additional bit to the minimum 
 word size needed. We can then do circular indexing by alternately consider- 
 ing this high-order bit or its complement as the most significant bit of 
 the index. That is, whenever we use up all indexes and start over, the new 
 indexes we assign will have this bit set opposite to what it was just before 
 we started over. Numbers whose high-order bit is the same as that we are 
 now assigning will all be considered greater than numbers with the opposite 
 value for this high-order bit. In the case of vector results, a maximum 
 of 36 may be processed in 9 clocks. Thus a total of 7 bits will be required. 
 In both cases of vector and scalar results, we only need to time index in- 
 structions with results of the specified type. Generating these indexes 
 requires logic similar to, but a bit more complex than that required for 
 generating the instruction numbers. This logic can operate in parallel with 
 the IUD front-end logic and thus has three clocks available to it. Once 
 generated, these indexes will be carried along through the pipe in their own 
 set of registers. They will be accessible at any time merely by providing 
 the instruction index as an address to the registers containing the time 
 indexes. In the case of instructions that do not have a result of the speci- 
 fied type, they will receive an instruction index that is one greater than 
 the most recent instruction with a result of the specified type. This will 
 be the same index as the next instruction with a result of that type. Fig- 
 ure 20 gives the structure of the time index logic. Table 27 gives a gate 
 count for the various index generators which may be required. 
 
167 
 
 TABLE 27 GATE COUNT FOR TIME INDEX GENERATORS 
 
 9 Bits 
 
 7 Bits 
 
 Unit 
 
 Number 
 Units 
 
 Gate 
 Unit 
 
 Coui 
 
 it/ 
 
 Total 
 
 Gate 
 Unit 
 
 Coui 
 
 it/ 
 
 Total 
 
 Bit 0-4 (9) 
 Bit 0-2 (7) 
 
 1 
 
 
 50 
 
 
 50 
 
 
 30 
 
 
 30 
 
 Plus 1 through 
 plus 4 adders 
 
 4 
 
 
 30 
 
 
 120 
 
 
 30 
 
 
 120 
 
 Carry or no 
 carry switches 
 
 4 
 
 
 12 
 
 
 36 
 
 
 12 
 
 
 36 
 
 Logic to alter 
 meaning of most 
 significant bit 
 
 1 
 
 
 10 
 
 
 10 
 
 
 10 
 
 
 10 
 
 4 Output Counter 1 
 
 
 216 
 
 
 196 
 
 Switch Control (1) 1 
 
 64 
 
 64 
 
 64 
 
 64 
 
 Switch (1) 1 
 
 5*9*3=135 
 
 135 
 
 5*7*3=115 
 
 115 
 
 Switch Control (2) 1 
 
 98 
 
 98 
 
 98 
 
 98 
 
 Switch (2) 1 
 
 14*9*3=378 
 
 378 
 
 14*7*3=294 
 
 294 
 
 Registers 6 
 
 36 
 
 216 
 
 28 
 
 168 
 
 TOTALS 
 
 1107 
 
 935 
 
168 
 
 
 FROM VECTOR 
 TIME INDEX PIPES 
 
 , 
 
 4 x 36 
 
 SWITCH 
 
 J TREE 
 REGISTER 
 
 TREE 
 REGISTER 
 
 FROM VECTOR 
 OPERAND PIPES 
 
 
 36 PIPE REGISTERS 
 
 TREE 
 REGISTER 
 
 T 
 
 COMPARISON 
 TREE LOGIC 
 AND 
 36 x 1 
 SWITCH 
 
 < 
 
 CC 
 LU 
 Q. 
 O 
 
 CC 
 O 
 
 h- 
 
 c_> 
 
 LU 
 
 
 SWITCH 
 
 
 COMPARISON 
 TREE 
 
 
 
 
 CONTROL 
 
 PURGE 
 LOGIC 
 
 m i 
 
 NUMBER OF RESULTS 
 COMPLETELY 
 UPDATED THIS 
 CLOCK 
 
 
 j 
 
 
 
 i 
 
 
 
 NUMBER OF 
 wrpTnn m 
 
 NEXT 
 
 CDCC 
 
 
 NEXT 
 
 . 
 
 
 RESULTS 
 
 rKtt 
 COUN 
 
 TER 
 
 PURG 
 COUN 
 
 TER 
 
 
 
 FIGURE 21 VECTOR COMPARISON TREE 
 
169 
 
 NEXT 
 FREE 
 COUNTER 
 
 6 BIT 
 
 36 OUTPUT 
 
 ADDRESS 
 
 DECODER 
 
 36 TIMES 
 
 CONTROL TO 
 APPROPRIATE 
 SWITCH 
 PATHS 
 
 I TO 4t: 
 FANOUTj- 
 
 1 J~ 
 
 4 x 36 SWITCH CONTROL LOGIC DETAIL 
 
 FIGURE 21 VECTOR COMPARISON TREE (cont.) 
 
170 
 
 NUMBER OF INSTRUCTIONS WITH 
 VECTOR RESULTS COMPLETELY UPDATED 
 IN VECTOR STATUS TABLE IN THIS CLOCK 
 
 EACH LINE CONTAINS 
 4 CONSECUTIVE INPUTS 
 FROM ADDRESS DECODER 
 
 P 
 U 
 R 
 G 
 E 
 
 S 
 
 I 
 G 
 N 
 A 
 L 
 S 
 
 COMPARISON TREE PURGE LOGIC 
 
 FIGURE 21 VECTOR COMPARISON TREE (cont.) 
 
171 
 
 FROM 
 TREE 
 REGISTERS 
 
 36 
 
 COMPARE 
 
 ELEMENTS 
 
 3 LEVEL 
 LAST 
 MATCH 
 SELECTOR 
 
 36 x 1 
 
 SWITCH 
 
 TIME 
 INDEX 
 
 VECTOR COMPARISON TREE LOGIC DETAIL 
 
 FIGURE 21 VECTOR COMPARISON TREE (cont.) 
 
172 
 
 
 IT 
 
 PASS/ 
 NO PASS 
 UNIT 
 
 16 
 ELEMENTS 
 
 
 - 
 
 
 LMS 
 
 
 
 
 
 
 N OUTPUTS FROM 
 OTHER LEVEL 2 LMS 
 
 LMS = Last Match Selector 
 
 PNP = Pass/No Pass Unit (4 elements) 
 
 3 LEVEL LAST MATCH SELECTOR SHOWING 
 DETAIL FOR FIRST 16 OF 36 OUTPUTS 
 
 PASS/ 
 NO PASS 
 UNIT 
 
 16 
 ELEMENTS 
 
 TO OTHER 16 ELEMENT 
 PASS/NO PASS SELECTOR 
 
 
 FIGURE 21 
 
 VECTOR COMPARISON 
 TREE (cont.) 
 
173 
 
 A - D are address decoder inputs 
 
 X, Y, Z are numbers of complete inputs 
 P is output 
 
 P = AZ V BX V c YZ V DX 
 
 FIGURE 21 VECTOR COMPARISON 
 TREE (cont.) 
 
174 
 
 TABLE 28 VECTOR BUFFER COMPARISON TREE GATE COUNT 
 
 Unit 
 
 4 x 36 Switch Control 
 
 Purge Logic 
 
 Next Free Counter 
 
 Next Purge Counter 
 
 4 x 36 Switch 
 
 36 Registers 
 
 Gate Count 
 
 16+2*36+4*36 = 232 
 
 16+2*36+8*36 = 376 
 
 100 
 
 100 
 
 4*36*20*4 = 11520 
 
 4*36*20 = 2880 
 
 Comparison Tree Logic 
 
 Unit 
 
 Number 
 
 Gate Count/ 
 Unit 
 
 Total 
 
 LMS 
 
 9+2+1 = 13 
 
 14 
 
 182 
 
 4 Input PS 
 
 9 
 
 8 
 
 72 
 
 16 Input PS 
 
 3 
 
 32 
 
 96 
 
 38 x 1 Switch 
 (10 bits) 
 
 1 
 
 36*10*3 
 
 1080 
 
 
 COMPARISON 
 
 TREE TOTAL 
 
 1436 
 
 
 6 TREES 
 
 
 8616 
 
 
 UNIT TOTAL 
 
 
 23,824 
 
175 
 
 4.6.3.2.3 Parallel Update of Vector Buffer Table (VPU) 
 
 As mentioned in the previous section, up to 36 vector buffer results 
 may be in an incomplete state inside the IUD pipe. The search of the 
 vector buffer table (SVT) must be able to detect this fact. To allow this, 
 we need a comparison tree similar to that described in Section 4.6.3.2.1. 
 This tree must be much larger to allow for 36 entries. Since entries can 
 remain in the tree for up to 9 clocks, we need some additional control logic 
 to properly purge and update the tree. Its functional operation is iden- 
 tical to that of the scalar result comparison tree described in Section 
 4.6.3.2.1. Figure 21 gives a diagram of this tree. Table 22 gives a gate 
 count for the tree. 
 
 Most of Figure 21 is self explanatory. The three-level last match 
 selector does require some additional explanation. It is built from the 
 four input last match selectors whose logic equations occur in Figure 
 Outputs from the 36 compares are routed into nine of these last match selec- 
 tors (LMS). These nine are divided into three groups, two of size 4 and one 
 containing a single element. The match/no-match output (N) from each ele- 
 ment in a size 4 subgroup is one of the inputs to another LMS. Finally, the 
 N output from these two LMS plus the output from the solo element at level 1 
 are fed into a final LMS. The N output from this final LMS is the N output 
 for the entire comparison tree. The four select outputs from each LMS are 
 used either for control or as input to a pass/no-pass selector (PS). The 
 level 1 LMS output are inputs to nine PS. The control for each of these PS 
 is a selected output from a level 2 LMS. In particular, if the correspond- 
 ing level 1 LMS were the last in its group of four to have a match, then its 
 PS will be enabled, and otherwise not. The same game is played at level 3, 
 
176 
 
 but each PS controls 16 inputs. The extra level 1 LMS is handled in the 
 obvious way to minimize package costs without requiring specially designed 
 units. 
 
 4.6.3.2.4 Searching the Scalar Table 
 
 If we examine Table 24, we see from the dependency column that no 
 function requires output from SST. We also know from Section 4.6.2.3 that 
 close communication is required between the SEUs and the circuitry main- 
 taining the scalar status tables. Thus it is both possible and desirable 
 to move the scalar status tables and the SST hardware to be physically just 
 ahead of the SEUs. One additional advantage in doing this is that it will 
 no longer be required to process these instructions at the maximum rate that 
 can occur in the IUD pipe, but only at the lower rate at which the SEUs can 
 accept them. This advantage cannot be realized in the case of vector in- 
 structions because of the manner in which logical vector addresses are 
 distributed among the VEUs. Memory and scalar instructions that interact 
 with the vector unit must be processed together in a manner that logically 
 corresponds to the actual sequence in which the instructions occur. 
 
 We will move the SST and SPU functions into the SIDS mentioned in 
 Section 4.3.2.1. We have already described the detailed hardware for SPU 
 which was then used in the next section on the VPU. We will describe in 
 detail the remainder of the hardware in Section 4.6.5 where we provide a 
 complete description of the SIDS. 
 
177 
 
 4.6.3.2.5 Searching the Vector Table (SVT) 
 
 The vector status table is the comparison tree described in Section 
 4.6.3.2.4 and an associative memory that maps logical buffer addresses to 
 physical buffer addresses. Since most of the accessing of this table is 
 by logical addresses, it would be desirable to have each logical vector buf- 
 fer address be a physical address to this table. At a given instant in time, 
 there may exist in the queues several instructions with the same logical 
 address. However only the most recent store to a logical vector location 
 is required for this table. Thus we can make the logical address of a vec- 
 tor be a physical address to the status table. The only information that 
 will then be needed in the table is the current physical address of the 
 logical buffer. When we first discussed this table in Section 4.6.2.4, we 
 described it as an associative memory. We see here that this is no longer 
 necessary. In that previous discussion, we mentioned the necessity of pro- 
 viding use counts for the VEU as a means of noting when a logical register 
 was available for re-use. For the same reasons discussed in the previous 
 section, these functions are best transferred to just in front of the vector 
 portion of the machine. We will describe this unit, the Vector Instruction 
 Dispatcher Subsystem (VIDS), in Section 4.6.6. 
 
 There are two functions that this table must perform. It must allow 
 for accesses, in parallel, up to the number of vector operand pipes. There 
 are six of these pipes. Two minor cycles are allowed for this access and 
 selecting the output from this table or the comparison tree. The access 
 must be pipelined so that a set of six is complete in es/ery minor cycle. 
 The same table must be able to accept stores in parallel at the rate that 
 the vector results can emerge from the pipe. Again, two clocks are allowed 
 
178 
 
 TABLE 29 VECTOR STATUS TABLE GATE COUNT 
 
 Unit 
 
 
 Gate Count 
 
 
 Number Units 
 
 Total 
 
 2 Level 
 (128) 
 
 Address Decoder 
 
 16*4+8*3+128*2 
 
 = 344 
 
 9 
 
 3096 
 
 2 Level 
 (256) 
 
 Address Decoder 
 
 2*16*4+256*2 
 
 = 656 
 
 9 
 
 5904 
 
 Bits to 
 Address 
 (128) 
 
 Store Decoded 
 for Pipelining 
 
 4*128 
 
 = 512 
 
 9 
 
 4608 
 
 Bits to 
 Address 
 (256) 
 
 Store Decoded 
 for Pipelining 
 
 4*256 
 
 = 1024 
 
 9 
 
 9216 
 
 Memory Location, 16 
 Bits with 3 Way Fan In 
 and 6 Way Fan Out 
 
 16*(4+3+2+6) = 240 
 
 Comparison Tree/ 
 
 Memory Selection Logic (4+1 6)*2+l 6*2*3= 136 
 
 128 
 256 
 
 30720 
 61440 
 
 136 
 
 Total (128) not including memory 7,840 
 
 with memory 38,560 
 
 Total (256) not including memory 15,256 
 
 with memory 76,696 
 
179 
 
 for this function, but it must be pipelined to allow a set of stores to be 
 processed starting at every clock. To do this, we require only a standard 
 memory, but with six read address decoders with their own output lines and 
 three write address decoders with their own output lines. The two-clock 
 pipelining can consist of one clock for address decoding and one clock to 
 do the read or write. 
 
 Table 29 gives a gate count for this memory and the circuitry that 
 selects either the memory output or the comparison tree output. 
 
 4.6.3.2.6 Selecting the Scalar Execution Unit (SSE) 
 
 This is another function which can and should be moved to the SIDS. 
 However since this design is a simplified version of the hardware for 
 selecting the vector execution unit, we will provide a detailed design at 
 this point. We will take advantage of the lowered processing rate required 
 by transfering this function to the SIDS. Thus we will only need to process 
 instructions at the rate the SEUs can process them or, at most, one per 
 minor clock. 
 
 A small table containing the current queue size of each SEU is required. 
 Whenever an instruction has completed execution, a table entry must be decre- 
 mented. Whenever an instruction is entered into, a queue table entry must 
 be incremented. Instruction codes are logical addresses referring to a group 
 of functionally identical SEUs. If for a particular logical address there is 
 only one element in the specified group, then the SSE function is simply to 
 convert the logical address to a physical address, and if the queue and the 
 SIDS buffer is full, send a signal to hold up the IUD. If there are more 
 than one functionally equivalent SEU involved, then the conversion to a 
 physical address involves selecting the unit with the smallest queue and 
 
180 
 
 holding up the IUD if all the queues and the SIDS buffer are full. Figure 
 22 is a unit to perform this function for up to six equivalent SEUs. 
 Table 30 provides a gate count for this unit. 
 
 This unit must provide an indication of the smallest queue which is 
 updated every clock. To allow for this, we have 12 registers. Six of these 
 contain the current queue sizes. These are divided into two groups of three. 
 For each of these groups there are an additional set of three registers con- 
 taining the differences in queue sizes. These differences are not generated 
 by doing subtracts, but rather directly from the increment and decrement 
 signals to the queue sizes. The signs of these differences are used to de- 
 termine a minimum in each group of three and to control a 3 x 1 switch to 
 transfer that minimum to another register. In one minor clock all register 
 incrementing and decrementing and the transfer of a minimum can be performed. 
 In the next clock a true subtract can be done on the two minima generated in 
 the previous clock. The results of this subtract can then be used to choose 
 from which of the three groups the global minimum is to be chosen. Because 
 of this two-stage pipelining, we do not necessarily have the absolute minima 
 at a given clock. However only one instruction per queue can complete in a 
 major clock, and only one queue location can be reserved in a minor clock. 
 Thus this additional one clock delay will, at most, allow a difference of 
 two from the minimum, and this cannot happen often. 
 
181 
 
 +1 p 
 
 
 
 
 
 n «- 
 
 
 
 
 
 
 
 a 
 
 u 
 
 am 
 bm 
 cm 
 
 
 PASS/ 
 
 - NO 
 PASS 
 
 
 .1 
 
 
 
 
 
 
 
 
 ■ 1 
 
 N 
 N 
 
 
 
 
 |S k 
 
 
 
 
 
 
 
 
 I 
 
 a-b 
 
 
 
 
 
 
 
 1 
 
 
 
 dU 
 
 
 
 
 P 
 P 
 
 
 
 
 
 
 N ' 
 +1 
 
 P 
 P 
 
 
 
 
 b 
 
 
 
 
 
 -1 
 
 
 *"" 
 
 
 
 3 x 1 
 SWITCH 
 AND 
 CONTROL 
 
 
 
 
 
 
 N 
 N 
 
 
 
 
 
 t*a 
 
 
 m 
 
 
 
 
 
 
 
 
 » 
 
 
 
 P »— »- 
 
 1 
 
 b-c 
 
 
 
 
 
 
 
 'n pr 
 
 
 S L 
 
 
 
 
 
 
 -i 
 
 P 
 P 
 
 
 
 be 
 
 
 
 c 
 
 
 -l 
 
 
 
 
 
 
 
 N 
 N 
 
 
 
 'o - 
 
 
 
 
 p L 
 
 I 
 
 c-a 
 
 
 N 
 
 > — 
 
 n — 
 
 
 
 S A 
 
 
 1 
 
 
 
 
 
 
 
 cd 
 
 o 1 
 
 
 +1 p 
 
 
 
 n 
 
 
 
 
 
 d 
 
 
 
 i 
 
 
 
 
 
 
 
 
 ■ i 
 
 N 
 
 
 
 
 S dP 
 
 
 
 
 
 N 
 
 
 
 I 
 
 d-e 
 
 ue 
 
 
 
 P 
 P 
 
 
 n 
 
 
 
 N 
 +1 
 
 P 
 P 
 
 
 
 
 
 e 
 
 u 
 
 
 
 
 
 
 -1 
 
 f 
 
 
 mi 
 
 
 
 3 x 1 
 SWITCH 
 AND 
 CONTROL 
 
 
 
 
 N 
 N 
 
 
 
 
 
 1 i_ 
 
 1 
 
 ** m-. 
 
 
 P 
 
 1 
 
 e-f 
 
 
 
 
 
 
 
 
 
 
 N 
 
 
 s * 
 
 
 
 
 
 
 
 +1 
 
 P 
 P 
 
 
 
 ef 
 
 
 
 f 
 
 
 -1 
 p 
 
 > 
 
 
 — »- 
 
 
 
 
 
 
 1 
 
 ' 
 
 N 
 N 
 
 
 
 
 
 
 c 
 
 
 dm 
 em 
 fm 
 
 
 P/ 
 
 \ss/ 
 
 ) 
 
 \ss 
 
 
 r i ._ 
 
 > »- 
 
 1 
 
 f-d 
 
 
 
 P/ 
 
 
 V-n — 
 
 
 
 — ■* 
 
 
 
 3 fd 
 
 
 
 
 
 
 
 
 
 FIGURE 22 SEU QUEUE SELECTOR 
 
182 
 
 LOGIC EQUATIONS FOR INTEGRATOR 
 
 Output: 
 
 S = Sign (TRUE means negative) 
 X = High order bit 
 
 Y = Low order bit 
 Input: 
 
 Pq,P, = Increment input 
 Nq,N, = Decrement input 
 
 S = yjfi v Vl F v N Q N 1 P 1 v N^ 
 
 Y - y/^Q v N^P^ v N^Fq v N^Pq v 
 
 Wi p o v Wi p o v ¥o¥o v Wl P 
 
 X = NqN/^ v N N lPl P 
 
 LOGIC EQUATIONS FOR a-b PSEUDO CARRY DIFFERENCE COUNTERS 
 
 Input: S, X, Y just defined plus 
 
 a Q , a 1 , a 2 , a 3 , S d (sign and 4 bits of differences) 
 Output: S r , r Q , r-j , r 2 , r 3 (sign and 4 bits of new differences) 
 
 Note: X and Y cannot both be TRUE 
 
 r 3 = a 3 Y v a 3 Y 
 
 S a = SSj v SS . (TRUE means addition, FALSE means subtraction) 
 
 FIGURE 22 SEU QUEUE SELECTOR (cont.) 
 
Add 
 
 Subtract 
 
 183 
 
 r 2 = S a Ya 3^2 v S a Xa 2 v ^ r 2 ^ a 2^ 
 S XYa 2 v S XYa^ v (r 2 = a 2 ) 
 
 S Ya" 3 a 2 v S Xa 2 v (r 2 f a 2 ) 
 
 \ a 2 a 3 Y V \ a 2 W v (r 2 = a 2> 
 
 C D = S Ya a.j v S Xa 9 (positive carry) 
 
 r a *. o a £ 
 
 C N = ^a Ya 2 a 3 v \ Xa 2 (negative carry) 
 
 r l = a lV N v a l C P v a l C N 
 
 r Q = F^Cp v a^ v a C p C N v a oai C N v a^Cp 
 
 S = S.S v Addition 
 
 r da 
 
 S d S a a C P C N v S d S a a C N v Subtraction, no 
 
 r -F r c T - r cTTr ., Overflow 
 
 S d S a a l C N v S d S a a C P v S d S a a l C P v 
 
 Va a O a l C P v Va¥l C N Subtraction Overflow 
 
 LOGIC TO SELECT MINIMUM ELEMENT FROM SIGNS OF DIFFERENCES 
 
 a 
 
 > 
 
 b 
 
 -*> 
 
 S ab 
 
 b 
 
 ^ 
 
 c 
 
 ■*■ 
 
 S bc 
 
 c 
 
 > 
 
 a 
 
 -> 
 
 S cd 
 
 a minimal - S &b S cd v S ab S bc S cd 
 
 b minimal - S ab S bc 
 
 c minimal + S cd S bc 
 
 FIGURE 22 SEU QUEUE SELECTOR (cont.) 
 
184 
 
 TABLE 30 SEU QUEUE SELECTOR GATE COUNT 
 
 
 Unit 
 
 Integrator 
 
 Queue Counters Ripple Carry 
 4 Bits 
 
 Difference Counters 
 
 3 x 1 Switch (4 bits) 
 
 Adder for M Q - M, 
 
 Registers for Mq,M, 
 
 Registers for Sign Bits 
 
 Pass/No-Pass Units 
 
 Gate Count 
 
 Number Units 
 
 Total 
 
 54 
 
 6 
 
 324 
 
 16 
 
 6 
 
 96 
 
 140 
 
 6 
 
 840 
 
 36 
 
 2 
 
 72 
 
 24 
 
 1 
 
 24 
 
 16 
 
 2 
 
 32 
 
 12 
 
 2 
 
 24 
 
 15 
 
 2 
 
 30 
 
 TOTAL 
 
 1442 
 
185 
 
 TABLE 31 CONNECTIONS FROM OPERAND PIPES TO 
 VEU QUEUE SELECTOR 
 
 Operand 
 
 Pipe 
 
 Queue Selector Port 
 
 
 
 
 
 
 1 
 
 
 1,2 
 
 2 
 
 
 2,3,4 
 
 3 
 
 
 3,4,5,6 
 
 4 
 
 
 4,5,6,7 
 
 5 
 
 
 5,6,7 
 
 6 
 
 
 6,7 
 
 7 
 
 
 7 
 
 The two-bit instruction index forms the high-order bits of the port address. 
 The bit indicating the first or second operand of binary instructions is the 
 low- order bit. 
 
 Gate Count 
 
 
 
 
 Unit 
 
 Gate Count 
 
 Number Units 
 
 Total 
 
 3 bit address decoder 
 
 32 
 
 6 
 
 192 
 
 2x1 4 bit switch 
 
 24 
 
 2 
 
 48 
 
 3x1 switch 
 
 32 
 
 2 
 
 64 
 
 4x1 switch 
 
 40 
 
 2 
 
 80 
 
 Direct path 
 
 8 
 
 2 
 
 16 
 
 
 
 TOTAL 
 
 400 
 
186 
 
 All operands of the same type are adjacent. Thus we need only look at the 
 type of adjacent instruction bytes. Let A, B, C represent types of adja- 
 cent instruction bytes. True = type we are testing for. x true indicates 
 the byte corresponding to B is the second operand of a binary instruction. 
 
 
 x = AB 
 
 Detecting a partition instruction is only slightly more complex. The ques- 
 tion is, "Does a terminal byte exist before the last byte?" Tq-T,, indicate 
 whether the corresponding byte is terminal. P. is true if byte i is part of 
 a partial instruction. 
 
 P. = A T. 
 1 j-1.12 ] 
 
 FIGURE 23 LOGIC TO INDEX OPERANDS AND 
 DETECT A PARTIAL INSTRUCTION 
 
187 
 
 TABLE 32 GATE COUNTS FOR INDEXING OPERANDS 
 AND PARTIAL INSTRUCTION DETECTION 
 
 Indexing Operands 
 
 
 
 
 
 Operation 
 
 Gate Cou 
 
 nt 
 
 Number Units 
 
 Total 
 
 Operand type detection 
 
 5 
 
 
 12*2 
 
 120 
 
 Index generation 
 
 3 
 
 
 12*2 
 
 TOTAL 
 
 72 
 192 
 
 Detecting Partial Instructions 
 
 12 
 
 Number of Gates = )>^ (i+1) = 76 
 
 i=2 
 
188 
 
 4.6.3.2.7 VEU Queue Selector (SVE) 
 
 There are two important differences between VEU and SEU queue selectors. 
 First, this function cannot be moved outside the main IUD pipe, and thus we 
 must allow for the processing of up to four instructions in parallel. The 
 second difference is the added complexity that results from VEUs being as- 
 signed to individual programs as discussed in Section 4.6.2.6. This requires 
 that the operands of a vector instruction be combined with the operator in 
 the unit which selects the VEU. We have allowed a two minor clock delay for 
 this processing, but with a fully pipelined processing rate of four instruc- 
 tions per clock. We will first discuss how the operands and operation get 
 together and then the details of the queue selection hardware. 
 
 Since each instruction element has an instruction index associated with 
 it, switching the operands to the queue selection elements is relatively 
 straightforward. We require a switch from the 8 vector operand pipes to the 
 4 sets of input ports for selecting a VEU. This does not require a full 8x8 
 crossbar. Table 31 lists the connections required and gives a gate count for 
 this unit. One problem occurs with binary instructions. It is necessary that 
 each operand be switched to a different entry in the set being used for a par- 
 ticular instruction. To allow for this, it would be desirable to have associ- 
 ated with each vector operand a single bit indicating if this is the first or 
 second operand in a binary instruction. The same information would be desir- 
 able for scalar instructions when they are recombined to be routed to the 
 SIDS. This information is recoverable from the index of the pipe in which the 
 operand occurs, but having it instantly available is necessary to maintain the 
 high processing rate. Figure 23 gives the logic for this process, and Table 
 32 provides a gate count. This unit will occur in the pipe front-end where 
 the instruction index generator occurs (see Section 4.6.3.1.4). 
 
189 
 
 A second problem results from the fact that only part of an instruction 
 may correspond to the first or last instruction index. At most four instruc- 
 tions can complete in a goven clock. The cases where four full instructions 
 enter the pipe in a single clock allow space for no partially complete in- 
 structions. If we provide additional registers to hold any initial segment 
 of an instruction with the highest queue index, we can then complete proces- 
 sing of that instruction in the next clock. This is the first point in the 
 pipe where this problem arises. Thus we can add switches and buffers to hold 
 a partially completed instruction. In order to minimize the circuitry to do 
 this, we will provide one bit associated with each instruction component to 
 indicate if it is part of a partial instruction. The logic for this is in- 
 cluded in Figure 23 and the gate count for this logic is in Table 32. 
 Figure 24, which we will discuss shortly, gives the switch control and buf- 
 fers for retaining partial instructions. 
 
 Now that we have designed hardware to retain and collect the information 
 necessary for VEU queue selection, we can proceed to design the hardware to 
 perform the algorithms described in Section 4.6.2.6. There are two ways in 
 which this unit is more complex than the SEU queue selection unit. First, 
 it is not simply the queue size that is relevant to selecting a VEU, but also 
 the number of operands already resident in the queues in which VEU has been 
 assigned to the instruction. These complications are handled by having six 
 effective queue sizes for each VEU. There are effective queue sizes for the 
 following cases: 
 
 A. Cases where the VEU is assigned to this program. 
 
 1. No operands for this instruction in this queue. 
 
 2. One operand for this instruction in this queue. 
 
 3. Two operands for this instruction in this queue. 
 
190 
 
 B. Cases where the VEU is not assigned to this program. 
 
 1. No operands for this instruction in this queue. 
 
 2. One operand for this instruction in this queue. 
 
 3. Two operands for this instruction in this queue. 
 
 Figure 24 shows the entire design of the VEU queue selector. The four groups 
 of six registers in this figure contain the above effective weights for each 
 of the four VEU queues. 
 
 The second factor which makes this unit more complex than the SEU queue 
 selector is the necessity of processing up to four instructions in parallel. 
 In Table 24 we have allowed two clocks for this processing. The problem with 
 meeting these time constraints results from the way the queue use of one in- 
 struction can affect queue assignment for later instructions. Since we are 
 processing up to four instructions in parallel, we need to somehow simulta- 
 neously take into account the queue use interactions of four instructions. 
 Although we are allowed two minor clocks to do the processing, we must pipe- 
 line this with an emergence rate of four instructions every minor clock. 
 This will have the consequence of having to start processing of a given set 
 of four instructions before the queue weight registers have been updated for 
 the previous two sets of four instructions. Table 33 summarizes the depen- 
 dency relationships of the instructions. 
 
 We now outline the algorithms employed in meeting the above constraints. 
 Each of the three major functions we will list are performed in parallel for 
 different sets of four instructions. The subfunctions are performed sequen- 
 tially on the same group of four instructions. 
 
191 
 
 I. Update queue weight registers. 
 
 1. Use a bit serial counter to decrement the weight register 
 by 1 if the corresponding VEU has notified the IUD that a 
 queued instruction has started to execute. 
 
 2. Cascaded with the bit serial counter, use a bit serial adder 
 to increment the corresponding weight register by the queue 
 use of the instructions which had their queue reservation 
 processing completed in the previous minor clock. 
 
 II. First minor clock of queue reservation processing. 
 
 1. Switch instructions and their operands into Unit. 
 
 2. Determine which weights to use. 
 
 3. Select the determined weights. 
 
 4. Increment each weight by the corresponding queue usage of 
 the queue reservations made in the previous clock. 
 
 5. Determine the minimum weight. This and the previous function 
 are done simultaneously. 
 
 6. Subtract the minimum weights from all weights. 
 
 III. Second minor clock of queue reservation processing. 
 
 1. Increment each weight by the corresponding queue usage of the 
 queue reservations completed in the previous clock. Simul- 
 taneously subtract 0, 1, 2, 3, and 4 from each of these sums. 
 
 2. Select the set of weights that produced a zero sum and had 
 the smallest value subtracted from it. 
 
 3. Decode the selected weights as follows: 
 
 a. For instruction generate BO which is true if the 
 
 weight for instruction queue n is 0. 
 
192 
 
 b. For instruction 1 generate Bl similar to BO . 
 
 n n 
 
 c. For instruction 2 generate B20 and B21 . B20 is 
 
 n n n 
 
 similar to B0 n - B21 n is true if the weight for instruc- 
 tion 2 queue n is 1 . 
 
 d. For instruction 3 generate B30 , B31 , B32 . 
 
 4. Simultaneously select the queues for instructions and 1 
 according to the following algorithms: 
 
 a. For instruction select the minimum n such that BO . 
 
 n 
 
 b. For instruction 1 select the minimum n such that Bl 
 
 n 
 
 and B0 n were not selected for instruction 0. If this 
 
 is not possible, select the unique n for which Bl . 
 
 ^ n 
 
 5. Select the queue for instruction 2, taking into account the 
 queues selected for instructions and 1. 
 
 6. Select the queue for instruction 3, taking into account the 
 queue selections for the previous three instructions. 
 
 7. Decode the queues selected into: 
 
 a. Binary integers, giving the queue use for each queue. 
 
 b. Queue addresses for the instructions. 
 
 The detailed logic for performing these functions is described in Appen- 
 dix A. Most of the logic equations in this appendix are fairly straight- 
 forward. However, in order to meet the serious time constraints, the logic 
 to perform functions 1 1 1-4 through III-6 above require a somewhat complex 
 technique for constructing the logic equations. We will now describe this 
 technique. 
 
 The method may be thought of as a generalization of the trick used in 
 constructing a pseudo carry adder. In general, we are construction a func- 
 tion R n from many Boolean inputs. We divide the function into two case. 
 
193 
 
 We then construct factors D and D which will detect these cases. We also 
 construct Boolean selection functions SI and S2 n which will be true for 
 the correct value of n in cases 1 and 2 respectively. Thus we will get an 
 equation for R as: 
 
 R n - DSl n V DS2 n 
 
 This is essentially the way a pseudo carry adder is constructed. Just as 
 one can generate a multi -level pseudo carry adder, we can generalize our 
 technique to many levels. Doing this for the adder results in very symmetric 
 equations. In our case, that symmetry is not present, and a major part of 
 the design problem is providing a notation to keep track of the terms we have 
 generated and the cases we have considered. Thus, in the appendix, we have 
 indexed detection and selection terms as Di.j.k and Si.j.k where i, j, and 
 k are integers equal to 1 or 2. Each integer separated by a dot represents 
 another level of cases, and there is no precise limit to how many levels 
 are allowed other than the necessity of keeping the equations to a reasonable 
 size. We do not necessarily go to deeper levels in a symmetric way. For 
 example, we might develop S2.1.1 down three more levels until we are con- 
 sidering S2.1.1.i.j.k, whereas S2.1.2 may not be developed to any deeper 
 level. This notation does make it fairly easy to construct such complex 
 functions. We can always determine the cases we have not yet considered by 
 simply reading an index backwards until we reach the first 1. The negation 
 of that case is the next one to be considered. One other technique we employ 
 to keep the equations from becoming too large is to construct a subcase in a 
 lower level of logic and to simply use the output from this lower level in 
 the final equation. When we do this, we replace the corresponding "." with 
 a "-". Thus, we might construct an S2.1 .1 n to use in the equation for S2 n< 
 
194 
 
 
 H h 
 
 <z> 
 
 
 3 -*- 
 
 F 
 R 
 
 M 
 
 F 
 R 
 
 M 
 
 
 2 — 
 
 kR 
 
 
 M 
 
 QUEUE WEIGHT 4x1 SWITCHES 
 REGISTERS AND ADDRESS 
 
 u n 
 
 
 
 F 
 I 
 N 
 A 
 L 
 
 S 
 E 
 L 
 E 
 C 
 T 
 I 
 
 N 
 
 L 
 
 G 
 I 
 C 
 
 
 J 
 
 
 
 
 
 
 I 
 
 
 
 
 
 
 
 
 
 
 
 
 — 
 
 a 0n 
 
 .1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 n 
 
 
 
 
 u i 
 
 
 
 J 
 
 
 
 
 
 
 
 
 
 
 
 
 — *• 
 
 •i n 
 
 
 
 
 
 
 
 
 
 
 
 
 B.m 
 l n 
 
 F D 
 I E 
 N C 
 A 
 L D 
 E 
 R 
 
 
 
 
 
 
 
 
 
 u? 
 
 
 
 
 
 J 
 
 
 
 
 
 
 
 
 
 
 •r 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 *" 
 
 
 n 
 
 
 
 
 u. 
 J 
 
 
 
 
 
 
 t 
 
 
 
 
 
 
 
 
 
 
 
 
 
 a 3 " 
 J 
 
 
 
 
 
 
 
 p 
 
 TDC1 TMI 
 
 
 
 
 
 
 rir ll i m_ 
 
 | 
 
 
 
 REGISTERS 
 
 
 INCREntiNi anu 
 MIN SELECTORS 
 
 
 
 u n 
 
 q! 
 
 SECOND 5x1 SWITCHES 
 INCREMENT UNIT 
 
 FIGURE 24 VEU QUEUE SELECTOR 
 
195 
 
 TABLE 33 TIMINGS FOR UPDATING WEIGHT SELECTION REGISTERS 
 
 Minor Clock 
 
 Instruction 
 Group 
 Instruction 
 12 3 
 
 Instruction 
 Group 1 
 Instruction 
 12 3 
 
 Instruction 
 Group 2 
 Instruction 
 12 3 
 
 Instruction 
 Group 3 
 Instruction 
 12 3 
 
 
 
 E E E E 
 
 
 
 
 1 
 
 D 1 D 1 D 1 D 1 
 
 E E E E 
 
 
 
 2 
 
 D 2 D 2 D 2 D 2 
 
 D lWl 
 
 E E E E 
 
 
 3 
 
 R R R R 
 
 D 2 D 2 D 2 D 2 
 
 Wl D l 
 
 E E E E 
 
 4 
 
 
 R R R R 
 
 D 2 D 2 D 2 D 2 
 
 Wl D l 
 
 5 
 
 
 
 R R R R 
 
 D 2 D 2 D 2 D 2 
 
 6 
 
 
 
 
 R R R R 
 
 E = Enter Unit; D, ,D 2 = Determine Queue Use; R = Reset weight Registers 
 
 Instruction, Instruction Group 
 (2,0) 
 
 (2,1) 
 (2,2) 
 (2,3) 
 (3,0) 
 
 (3,1) 
 (3,2) 
 (3,3) 
 
 Requires information about the follow- 
 ing instructions with weight registers 
 not yet updated 
 
 (1,0) (1,1) 
 (0,0) (0,1) 
 
 Same as (2,0 
 
 Same as (2,1 
 
 Same as (2,2 
 
 (2,0) (2,1) 
 (1,0) (1,1) 
 
 Same as (3,0 
 
 Same as (3,1 
 
 Same as (3,2 
 
 1,2) (1,3) 
 0,2) (0,3) 
 
 plus (2,0) 
 
 plus (2,1) 
 
 plus (2,2) 
 
 2,2) (2,3) 
 1,2) (1,3) 
 
 plus (3,0) 
 
 plus (3,1) 
 
 plus (3,2) 
 
196 
 
 TABLE 34 TIMING AND GATE COUNT FOR VEU QUEUE SELECTION 
 
 TIMING FOR FIRST CLOCK OF PIPELINE 
 
 Logic Level 
 
 1 
 
 2 
 
 3 
 
 4 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 10 
 11 
 
 Function Completed 
 Information switched into unit 
 
 Weights selected 
 Weights switched 
 
 Minimum of weights with U. added found 
 
 J 
 
 Minimum selected 
 
 Minimum subtracted from all weights 
 
 TIMING FOR SECOND CLOCK OF PIPELINE 
 
 Logic Level 
 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
 
 Function Completed 
 
 U. added, constant weights subtracted 
 Group with first overflow switched 
 
 Queue use determined 
 Queue use decoded 
 
197 
 
 TABLE 34 TIMING AND GATE COUNT FOR VEU QUEUE SELECTION (cont.) 
 
 GATE COUNT 
 
 
 Unit 
 
 
 Queue Weight Registers 
 
 
 Queue Weight Adders and 
 
 Counters 
 
 Weight Selection Logic 
 
 
 Weight Switches 
 
 
 Increment and Decode Loc 
 
 lie 
 
 Second Increment 
 
 
 Final Selection 
 
 
 Decoding 
 
 
 TOTAL 
 
 Number Gates 
 
 120 
 
 2 760 
 
 456 
 
 240 
 
 6 864 
 
 9 608 
 
 895 
 
 64 
 
 21 ,007 
 
198 
 
 In turn, S2.1.1 could have terms such as D2.1 .l-i.j.kS2.1 .1-i i k 
 n ,%J * n 
 
 in it. We sometimes use a similar notation when values computed at a lower 
 level of logic are required but do not exactly fit our detection selection 
 scheme. In such cases, the values are usually defined, e.g., T2.1.1 , and 
 are simply used in the equation for S2.1.1 . 
 
 This notational scheme does seem to be an effective tool in generating 
 complex multi-level Boolean functions where it is important to keep the 
 number of levels small. Unfortunately, the scheme gives no algorithms for 
 determining what cases are likely to be good ways to break up the function. 
 Intuition and trial and error are required for that part of the process. 
 
 Table 34 gives the overall timing and gate count of the unit. These 
 figures are derived from Appendix A. These figures are not unreasonably 
 large, but they probably cound be reduced substantially by some more playing 
 with the design. The 11 levels of logic or 22 gate delays in one minor 
 clock is probably the one figure that one would most want to reduce. 
 
 4.6.3.2.8 Reserve Vector Buffer Storage (RVS) 
 
 This unit must reserve storage for the operands and results of each 
 vector instruction. In the case of operands, the space must be within the 
 VEU in which the instruction is to be executed. The same is true of results 
 except for results from a memory load instruction which have space in the 
 vector buffer. The operand and result portions of this unit operate on dis- 
 joint storage spaces and are functionally independent. We will now describe 
 the structure and operation of these two units. 
 
 Figure 25 gives the overall structure of the result processing portion 
 of this unit. From this point on we will not provide detailed logical design 
 unless the unit is in some major way dissimilar from units already designed. 
 
 
 
199 
 
 We will make rough conservative estimates of gate counts and logic levels 
 required. In some instances limited design of parts of a unit may be re- 
 quired to verify that the estimates are conservative. We will try to be 
 explicit about the assumptions on which the estimates are based. Thus we 
 will describe how this unit works and, within this commentary, provide 
 estimates of gate counts and timing. This same approach will be used 
 throughout the remainder of Chapter 4. 
 
 The first step in assigning storage for vector results is to determine 
 how many spaces are required in each VEU and in the Vector Buffer. This 
 function is performed by the field decoder. We are processing up to four 
 instructions in parallel, so up to four results may be required in a given 
 VEU. The field decoder generates a count of the number of results required 
 of each VEU by looking at the VEU address portion of the result field for 
 each vector instruction. Decoding the address fields requires one gate for 
 each bit in each field or a total 12 bits for each VEU. Thus the total will 
 be under 100. One gate delay of 1/2 level of logic is required to do this 
 decoding. The encoding of the counts for individual VEUs requires roughly 
 10 gates for each bit of the encoded results or under 300 gates total. One 
 level of logic is required for this encoding. The counts produced are sent 
 to the buffer status units and to the final switch. 
 
 The vector buffer status unit differs from the VEU status units mainly 
 in the size of the memory it is working with. The size of the VEU result 
 memory is likely to be about 16 as discussed in Section 4.4.2.1. The size 
 of the Vector Buffer is likely to be about 256 as discussed in Section 4.4.3, 
 The other differences between these units is how they free locations and the 
 possibility of buffer overflow. If the Vector Buffer ever overflows, this 
 is a hardware or programming error as discussed in Section 3.2.1.2.1. 
 
200 
 
 The Vector Buffer status units try to keep their respective buffers from 
 becoming full by outputting a "too full signal" when their size crosses a 
 certain threshold. As discussed in Section 3.2.1.2.1, this threshold should 
 be variable so that an optimal value can be determined by experience. It 
 might also be varied with different programs or program mixes. We will 
 first discuss the VEU status units. 
 The functions of these units are: 
 
 1. To maintain in the buffer registers four available buffer 
 locations. 
 
 2. To signal to the VIDS when a buffer has exceeded its thresh- 
 old size. 
 
 3. To signal to the entire IUD to pause when a request is made 
 for space that cannot be honored. 
 
 4. To process signals that indicate a given VEU buffer location 
 is available for reuse. 
 
 Since there are only 16 locations to be accounted for, it is reasonable 
 to maintain these in a stack of registers that can be shifted four positions 
 in a single minor clock. The logic for such registers should be less than 
 the total number of bits times 12 or less than 1000. The logic to keep 
 track of the size of the stack, to control the shifting, and to interrupt 
 the IUD should be under 200 gates. If we allow a new entry to be made at 
 eyery fourth position, the logic for the necessary switch and control should 
 be under 200 gates. It is reasonable to allow at most one location to be 
 freed in a minor clock and to buffer requests at their initiation whenever 
 they are generated at a faster rate. Thus the total gate cound for one of 
 these units will be under 1400. 
 
201 
 
 Figure 26 gives the detailed structure of the vector buffer status 
 unit. A stack of registers or push-up buffer is also employed in this unit. 
 This stack only contains 12 of the possible 256 available locations. A 
 single status bit is maintained for each physical location, and a register 
 is available if its status bit is one or its address is contained in the 
 push-up buffer. The status bits are grouped into four portions of 64 each, 
 and these are searched and set separately. The purpose of the push-up buf- 
 fer is to allow for no pauses in instruction processing in the case when 
 one or more of the groups of 64 may have no free locations. Experiments 
 might be desirable to obtain an ideal size for this push-up buffer. How- 
 ever, given that we are assuming less than half the instructions are vector 
 instructions and given that this algorithm for assigning locations will tend 
 to distribute locations uniformly, 12 is likely to be an adequate size. The 
 size of this buffer should also be less than 1000 gates. 
 
 The clear tree's function is to decode a 6 bit address into a signal 
 to set a status bit. This can be done in under 160 gates. The search tree 
 must output a 6 bit address corresponding to some bit that is set and at the 
 same time reset that bit. This can be done by having one set of logic that 
 is pyramided up from the status bits that indicates if a given set of four 
 and then 16 status bits has a one in it. Logic pyramided down to the status 
 bits can choose the lowest set of four at each level and simultaneously de- 
 code two bits of the address of the bit that will be ultimately chosen and 
 send a signal to the correct one of the four groups it is looking at to 
 choose a bit from that group. Only the decoding of the least significant 
 two bits of the address is done at the base of the pyramid. This requires 
 less than 400 gates. There are 256 status bits. At four gates each, this 
 comes to 1024. The free decoder decodes the first two bits of a free 
 
202 
 
 
 
 
 
 1 
 
 
 VECTOR 
 BUFFER 
 STATUS 
 
 
 
 FINAL 
 SWITCH 
 
 
 
 FULL SIGNAL 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 FREE SIGNAL 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 VEU 
 
 BUFFER 
 
 STATUS 
 
 
 
 
 
 
 
 
 
 
 FULL SIGNAL 
 
 
 
 
 
 
 TO VECTOR 
 
 
 
 ADDRESS 
 
 
 
 
 
 RESULT 
 
 
 
 
 
 FIELDS 
 
 
 
 FREE SIGNAL 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 • 
 • 
 • 
 
 
 
 
 
 
 VEU 5 
 
 BUFFER 
 
 STATUS 
 
 
 
 
 
 FULL SIGNAL 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 FREE SIGNAL 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 VEU FIELDS 
 OF VECTOR 
 
 1 
 
 
 BUFFI 
 
 :r registe 
 
 :rs 
 
 
 
 INSTRUCTIONS 
 
 FIELD 
 DECODER 
 
 BUFFER 
 ATUS 
 ITS 
 
 
 Tfi 
 
 
 
 
 
 
 I ST 
 
 
 ~^" 
 
 > UM 
 
 
 m 
 
 
 
 
 
 
 
 
 FIGURE 25 RESULT PROCESSING PORTION OF 
 
 UNIT TO RESERVE VECTOR STORAGE 
 
203 
 
 FREE 
 DECODER 
 
 PUSH 
 
 UP 
 
 BUFFER 
 
 CONTROL 
 
 FIGURE 26 VECTOR BUFFER STATUS DETAILS 
 
204 
 
 address and switches the remainder to the appropriate clear tree. This 
 requires less than 60 gates. The control for the push-up buffer requires 
 less than 200 gates. The total gate count for the vector buffer status 
 unit is under 4600. 
 
 Returning to Figure 25, we still need to provide a gate count for the 
 buffer registers and the final switch. The buffer registers require 720 
 gates, and the final switch and its control require less than 400 gates. 
 Thus the total for the entire unit will be less than 14,500. 
 
 4.6.3.2.9 Update Vector Tables and Fill in Vector Operands (UV and FV0) 
 
 Now that we have determined a physical address for all vector results, 
 we need to update the vector status table discussed in Section 4.6.3.2.5. 
 In addition we need to fill in the vector operands which were not known at 
 that stage in the pipe. No additional logic is required for the first of 
 these functions since we provided sufficient address decoders when we ori- 
 ginally discussed the vector status table. For the missing operands we 
 have only a time index of the instruction which generates the result. We 
 need to construct a table which allows us to map this time index into a 
 physical address. A simple way to do this is to provide a buffer with one 
 physical location for each possible time index of an originally undefined 
 operand. Then, if we load and address this buffer in a circular fashion 
 and have six independent ports to address it, we will have the problem 
 solved. This unit will be similar to the non-comparison tree portion of 
 the vector buffer comparison tree designed in Section 4.6.3.2.3. We will 
 use the gate counts of that unit. We do not require the 4 x 36 switch listed 
 there, and we can get by with four 1 x 9 switches, thereby dropping the 
 
205 
 
 gate count for the switch to 720. The six address decoders require less 
 than 2000 gates. This gives a total of less than 6500 gates. 
 
 4.6.4 Tail End of Main IUD Pipe 
 
 The main IUD pipe consists of those functions listed in Figure 18. In 
 the course of designing the pipe we have decided to move functions related 
 to scalar operands and results to the SIDS. We now must complete the IUD 
 pipe by assembling bytes into complete instructions and shipping these in- 
 structions to the VIDS, SIDS, or main memory for further processing and 
 ultimate execution. In addition after the instructions are assembled, any 
 instruction that requires the vector switch must have queue entries gene- 
 rated. 
 
 4.6.4.1 Assembling Instructions (AV, AS, AM) 
 
 Referring again to Figure 18, we see that all the pipes except the 
 operator pipes, are separated into vector, scalar, and memory instruction 
 bytes. Thus the operators must first enter a 3 x 1 switch as they emerge. 
 From this switch, they enter a buffer for either vector operators, scalar 
 operators, or memory operators. Simultaneously with making entries in each 
 of these buffers, we will set up a word of presence bits to be used in re- 
 moving entries from these buffers. A portion of this hardware is illustrated 
 in Figure 27. The non-operator pipes do not require the initial switch. 
 They do require a buffer and set of presence bits for the process of assem- 
 bling complete instructions. After these buffers another set of switches 
 is required to merge the bytes of an instruction into a complete instruction. 
 The size of the buffers and switches is determined by the various data rates 
 involved. We will not consider the question of what constitutes optimal 
 size, but will suggest some reasonable sizes. Both vector and scalar 
 
206 
 
 FROM MAIN IUD PIPE 
 
 FROM MAIN IUD PIPE 
 
 TO MEMORY 
 OPERATOR BUFFER 
 
 MEMORY 
 
 PRESENCE 
 
 BITS 
 
 • f • 10 TIMES 
 
 TO SCALAR 
 OPERATOR BUFFER 
 
 • • • 
 
 LLLLLLU 
 
 VECTOR OPERATOR BUFFER 
 
 PARALLEL INPUTS FOR 
 
 VECTOR OPERANDS AND RESULTS 
 
 " " ' 
 
 4 x 1 
 SWITCH 
 
 # - 
 
 VECTOR 
 
 INSTRUCTION 
 
 BUFFER 
 
 ,, ,, |F l| ,| ■, 
 
 FIGURE 27 TAIL END OF IUD PIPE 
 
 VECTOR OPERATOR PORTION 
 
207 
 
 execution units can process instructions at the rate of one per minor clock. 
 Since there will be six of each of these units, the overall emergence rate 
 will average to slightly less than one instruction per minor clock. Assum- 
 ing two memory instructions for each vector instruction is probably conserva- 
 tive. To relate these figures to the parameters we are determining, we need 
 to consider instruction sizes and emergence rates from the pipe. The con- 
 straints on instruction sizes are listed in Table 18. 
 
 There will be two levels of buffering involved. A sparse buffer will 
 collect the output as it emerges from the IUD. From here the output is 
 transmitted to a dense instruction buffer from which it will be transmitted 
 to its final destination. In this section we are determining the size of 
 the first buffer and the size of the intervening switch. The input width of 
 this switch determines the rate at which the sparse buffer is emptied. The 
 output width of the switch determines the rate at which complete instructions 
 can be assembled. There are three of these switches operating in parallel 
 for each type of instruction. The three switches are for operators, operands, 
 and results. It is these parallel switches which reassemble the instructions. 
 The output width of these switches must at least accommodate the average 
 instruction processing rate. We will use widths slightly larger than this 
 as listed in Table 35. The input widths must be adequate to accommodate the 
 output widths. This can be determined by consulting Table 19. The size of 
 the sparse vector buffer should be large enough to accommodate uneven in- 
 struction distribution without stopping the IUD. Determining an optimal 
 value for this size is probably an impossibility. A smart compiler could 
 probably do quite well with very little buffering by distributing instruc- 
 tions. A size between 4 and 8 words long would probably be reasonable. 
 
208 
 
 TABLE 35 LOGIC SUMMARY FOR ASSEMBLING INSTRUCTIONS 
 
 Unit 
 
 Switches for Operators 
 
 Sparse Operator Buffers 
 
 Switch and Buffer Input Controls 
 
 Vector and Scalar Operand Buffers 
 
 Buffer Input Controls 
 
 Vector and Scalar Result Buffers 
 
 Buffer Input Controls 
 
 Memory Operand and Result Buffers 
 
 Buffer Input Controls 
 
 Vector and Scalar Operator Switches 
 
 Switch Controls 
 
 Memory Operator Switch 
 
 Switch Control 
 
 Vector and Scalar Operand Switches* 
 
 Switch Controls 
 
 Memory Operand Switches* 
 
 Switch Control 
 
 Vector and Scalar Result Switches* 
 
 Switch Controls 
 
 Memory Result Switch* 
 
 Switch Control 
 
 Size 
 
 Number 
 
 Gate Count 
 
 3x1 
 
 10 
 
 2400 
 
 10x6 
 
 3 
 
 14400 
 
 
 3 
 
 3720 
 
 6x6 
 
 2 
 
 5760 
 
 
 2 
 
 408 
 
 4x5 
 
 2 
 
 3840 
 
 
 2 
 
 272 
 
 7x6 
 
 2 
 
 6720 
 
 
 2 
 
 476 
 
 4x6 
 
 2 
 
 1440 
 
 
 2 
 
 240 
 
 6x8 
 
 1 
 
 1920 
 
 
 1 
 
 160 
 
 4x6 
 
 2 
 
 1440 
 
 
 2 
 
 240 
 
 6x8 
 
 1 
 
 1920 
 
 
 1 
 
 160 
 
 2x6 
 
 2 
 
 720 
 
 
 2 
 
 240 
 
 3x8 
 
 1 
 
 960 
 
 
 1 
 
 160 
 
 TOTAL 
 
 47,595 
 
 *Attribute refers to instruction type, not operand or result type. 
 
 We have assumed 20 bit words and data paths, 4 gates/bit storage, and 
 2 gates/switch bit junction. 
 
209 
 
 Table 35 lists actual choices for all the design parameters required and 
 provides approximate gate counts. We do not do any design of the switch 
 controls. They are driven by instruction indices and instruction types 
 carried along with the instructions. Techniques used in previous sections 
 should easily produce units within the specified gate estimates and timing 
 constraints. 
 
 4.6.4.2 Initiating Transfer of Instructions (IM, ISC, IV) 
 
 The next function is to ship the assembled instructions to the appro- 
 priate units. One of these destinations will be a unit for generating 
 vector switch instructions. We need to estimate buffer sizes and data path 
 widths using the methods and estimates of the previous section. These es- 
 timates and gate counts are summarized in Table 36. 
 
 4.6.4.3 Generating Vector Switch Instructions (GSI) 
 
 Either memory or vector instructions may require use of the vector 
 switch. We must scan these instructions for vector operands and results 
 and, where present, generate the appropriate queue entries for the vector 
 switch. What is required is that the source and destination for each word 
 to be switched must be selected from the instruction streams and combined 
 to make a vector switch queue entry. Physical addresses will always be used, 
 In the case of memory instructions, we must reserve space in the vector 
 memory buffer. Finding a space in this buffer is simply a matter of allocat- 
 ing a free location. Thus, simplified versions of the logic described in 
 Section 4.6.3.2.8 can be used. The data rates must be adequate to handle 
 the maximum rate at which instructions can emerge. Table 37 summarizes 
 the logic requirements for this function. 
 
210 
 
 TABLE 36 LOGIC FOR INITIATING INSTRUCTION TRANSFERS 
 
 Uml Size 
 
 Vector and Scalar Instruction Buffers 6x4 
 
 Memory Instruction Buffer 8x4 
 Vector and Scalar Buffer Controls 
 
 Number 
 
 2 
 
 1 
 
 2 
 TOTAL 
 
 Gate Count 
 3840 
 2560 
 400 
 
 6800 
 
 TABLE 37 LOGIC FOR GENERATING VECTOR SWITCH INSTRUCTIONS 
 
 Unit 
 
 Select up to 2 out of 32 Available 
 Memory Source Buffer Locations 
 
 Select up to 2 out of 32 Available 
 Memory Destination Buffer Locations 
 
 Generate Vector Switch Queue Entries 
 for Memory Instructions 
 
 Generate Vector Switch Queue Entries 
 for Vector Instructions 
 
 Size 
 
 Number Gate Count 
 
 400 
 
 300 
 
 800 
 
 400 
 
 TOTAL 
 
 1900 
 
211 
 
 4.6.5 Scalar Instruction Dispatcher Subsystem 
 
 The SIDS must provide time indexes and maintain use counts for physical 
 scalar addresses. The logical functioning of this unit is described in 
 detail in Sections 4.3.2.1, 4.6.2.3, 4.6.2.5, and 4.6.3.2.1. We will sum- 
 marize these descriptions, provide an overall design of the unit and give 
 gate count estimates. It should be noted here that the SPU, SST, US, and 
 SSE functions listed in Section 4.6.3.1.5 are performed in this unit. 
 
 4.6.5.1 SIDS Functional Summary 
 
 The functions listed in Table 24 are pipelined with an emergence rate 
 of one instruction per minor clock. They operate on the scalar status table 
 and the scalar use table. The scalar status table contains one location for 
 each possible active time index. It allows an associative search to be made 
 for the correct time index of a scalar operand. Each new result causes the 
 corresponding time index location to be loaded with the physical address for 
 that result. Simultaneously, an associative search is made to delete any 
 entry with the same physical address. The scalar use table consists of two 
 parts. There is a section addressable by time indexes and another section 
 associatively addressable by physical address. This second portion is for 
 scalars in use with a time index that is about to be or has been reused. 
 These are referred to as the index scalar table and old operand table. 
 
 4.6.5.2 Detailed Design of SIDS 
 
 We now provide a detailed function of the SIDS structure. Table 39 
 provides a description and gate count for all the tables we refer to. The 
 function US consists of two parallel stores to the scalar status table. 
 The function SPU merely retains a result associated with its time index to 
 
212 
 
 be searched in performing the SST function. The SSE function selects the 
 scalar execution unit. Since one scalar queue can drive several equivalent 
 SEUs, this function will ordinarily be null with instructions being routed 
 to the unique queue required. If it is desired to have independent queues, 
 then the logic of Section 4.6.3.2.6 can be used. The SST function consists 
 of a parallel search of scalar use tables for the most recent reference to 
 the specified physical addresses. Flow charts for the functions UU, USU, 
 and RU are provided in Figure 28. The AL function consists of accumulating 
 a list of result locations and time indexes as use counts with non-zero links 
 go to zero. These control functions can all be implemented in under 10,000 
 gates. 
 
 TABLE 38 SIDS FUNCTIONS 
 
 Function 
 
 Use Result to Update Scalar Status 
 Table 
 
 Retain Result for Pipelined Search 
 of Scalar Status Table that will 
 happen before this Entry is Complete 
 
 Select Scalar Execution Unit 
 
 Find Time Indexes for Operands 
 
 Update Use Counts for Operands 
 
 Update Scalar Use Table as Instructions 
 are Executed 
 
 Accumulate List of Time Indexes and 
 Physical Locations Pairs for Stores 
 that can Proceed 
 
 Abbreviation 
 
 Time 
 
 Dependency 
 
 US 
 
 2 
 
 None 
 
 SPU 
 
 1 
 
 None 
 
 SSE 
 
 1 
 
 None 
 
 SST 
 
 2 
 
 SPU 
 
 UU 
 
 2 
 
 SST 
 
 s 
 
 USU 
 
 2 
 
 Instruction 
 Execution 
 
 AL 
 
 8 
 
 Continuous 
 Function 
 
 Use Result to Update Scalar Use Table 
 
 RU 
 
 None 
 
213 
 
 TABLE 39 SPECIFICATIONS AND GATE COUNTS 
 FOR SIDS TABLES 
 
 I SCALAR STATUS TABLE 
 Size 
 
 Fields 
 
 Parallel Accesses 
 
 Gate Counts 
 
 II INDEX USE TABLE 
 Size 
 
 Fields 
 
 Parallel Accesses 
 
 Gate Count 
 
 256 entries 
 
 12 bits for physical scalar address 
 
 2 associative reads (SST) 
 
 1 store (US) 
 
 256(12*8*2 + 4*12) = 61440 
 
 256 entries 
 
 12 bits for physical address (associatively 
 addressable) 
 
 2 bits for top and bottom list flag (associa- 
 tively addressable) 
 
 8 bits for link 
 
 6 bits for use count 
 
 2 increments of use count (UU) 
 
 2 decrements of use count (USU) 
 
 1 associative read (RU) 
 
 1 store (RU) 
 
 1 store (UU) 
 
 256(14*8 + 8*8 + 6*16 + 4*28) = 98,304 
 
214 
 
 TABLE 39 SPECIFICATIONS AND GATE COUNTS 
 FOR SIDS TABLES (cont.) 
 
 Ill OLD OPERAND TABLE 
 
 Size 64 
 
 Fields 12 bits for physical address (associative! y 
 
 addressable) 
 
 1 bit for top of list (associatively addressable) 
 8 bits for link to index use table 
 6 bits for use counter 
 Parallel Accesses 1 associative search (UR) 
 
 1 set for new result (UR) 
 
 2 associative searches and increment counter (UU) 
 2 associative searches and decrement counter (USU) 
 1 read of link when use count goes to zero (AL) 
 
 1 store (UR) 
 Gate Count 64(13*8*5 + 6*16 + 12*4 + 4*27) = 49,408 
 
 Assumptions used in gate count: 
 
 8 gates per bit per associative access 
 4 gates per bit per regular access 
 16 gates per counter bit 
 
215 
 
 RESULT UPDATE 
 
 NO LIST 
 PRESENT 
 
 SET BOTTOM 
 LIST BIT IN 
 NEW ENTRY 
 
 TOP OF ADDRESS 
 LIST IN OLD 
 OPERAND TABLE 
 
 TOP OF ADDRESS 
 LIST IN USE 
 TABLE 
 
 RESET TOP LIST 
 BIT IN THIS 
 LOCATION; ADD 
 LINK TO NEW 
 ENTRY 
 
 RESET TOP LIST 
 BIT IN THIS 
 LOCATION; ADD 
 LINK TO NEW 
 ENTRY 
 
 MAKE NEW 
 ENTRY WITH 
 TOP LIST 
 BIT SET 
 
 FIGURE 28 SIDS FLOWCHARTS 
 
216 
 
 UPDATE USE COUNT FOR OPERANDS 
 
 I 
 
 HALT IUD; WAIT 
 FOR ZERO COUNT 
 
 INCREMENT COUNT 
 FOR THAT TIME 
 INDEX 
 
 INCREMENT USE 
 COUNT BY 1 
 
 USE LOCATION; 
 SET USE COUNT 
 TO 1 
 
 
 FIGURE 28 SIDS FLOWCHARTS (cont.) 
 
217 
 
 UPDATE USE COUNT 
 FOR INSTRUCTION 
 EXECUTION 
 
 DECREMENT COUNT 
 IN OLD OPERAND 
 TABLE 
 
 YES 
 
 DECREMENT COUNT 
 FOR THIS TIME 
 INDEX 
 
 NO 
 
 -+-**■ 
 
 LINK \__MO 
 PRESENT 
 
 ADD ELEMENT OF 
 LINK TO LIST OF 
 STORES THAT CAN 
 PROCEED 
 
 CLEAR LOCATION; 
 ADD ELEMENT OF 
 LINK TO LIST OF 
 STORES THAT CAN 
 PROCEED 
 
 (usu J 
 
 FIGURE 28 SIDS FLOWCHARTS (cont.) 
 
218 
 
 4.6.6 Vector Instruction Dispatcher Subsystem 
 
 The VIDS has the responsibility of freeing physical vector addresses 
 as soon as possible. The algorithm for doing this is to maintain a use 
 count for each logical buffer address. If a store is processed going to a 
 logical address, the corresponding physical location can be reused when the 
 use count goes to zero. Use counts are incremented each time an operand 
 appears in the instruction stream in the VIDS. They are decremented each 
 time the Vector Switch or internal switch transfers an operand. Since the 
 physical address of each active vector buffer location is unique, we may 
 organize the table on this basis. Table 40 gives the specifications of this 
 table. Less than 4000 gates will be required for control purposes. 
 
 TABLE 40 VIDS TABLE SPECIFICATIONS 
 
 Size: 
 Fields 
 
 Parallel Accesses 
 
 Gate count 
 
 256 
 
 8 bits for logical address 
 
 6 bits for use count 
 
 1 bit indicating location may be freed 
 when use count is zero 
 
 1 store for new result from instruction stream 
 
 1 associative search on a result from instruc- 
 tion stream 
 
 2 increments of use count for each operand in 
 instruction stream 
 
 2 decrements of use count for each operand used 
 by VEUs 
 
 1 decoding of physical address when it becomes 
 available 
 
 256(8*8 + 4*15 + 6*16) + 512 = 56,832 
 
 See Table 39 for gate count assumptions 
 
219 
 
 4.7 GATE COUNT SUMMARY 
 
 Table 41 provides a summary gate count for all the logic discussed 
 in this chapter. It is divided into buffer type gates and other logic. 
 Summaries are provided for the computation portion of the IUD and memory 
 control. The counts do not include memory itself, which consists of one 
 million words of 64 bits with a 1 major clock access rate. The gate counts 
 for buffers assume 4 gates per bit. We have assumed 5000 gates per parallel 
 computing element in each VEU. We have assumed 10,000 gates per each SEU. 
 
 TABLE 41 COMPUTATION UNIT SUMMARY GATE COUNT 
 
 Unit Gate Type ' 
 
 6 SEUs 1 
 
 Scalar Status Tables 1 
 
 Scalar Buffers 2 
 
 Scalar Switch 1 
 
 Scalar Assembling Unit 1 
 
 6 VEUs (control) 1 
 
 6 VEUs (buffers) 3 
 
 6 VEUs (arithmetic) 1 
 
 Vector Buffer 3 
 
 Vector Switch 1 
 
 Memory Switches and 
 
 Control 1 
 
 Source of Count 
 Estimate 
 Table 5 
 Table 6 
 Table 8 
 Table 9 
 Table 11 
 Table 11 
 Estimate 
 Section 4.4.3 
 Table 13 
 
 Table 14 
 
 Count 
 
 60 
 
 000 
 
 16 
 
 050 
 
 655 
 
 360 
 
 88 
 
 880 
 
 25 
 
 000 
 
 87 
 
 936 
 
 480 
 
 000 
 
 240 
 
 000 
 
 131 
 
 072 
 
 252 
 
 544 
 
 992 040 
 
 Gate Types: 1. Ordinary logic. 
 
 2. Simple memory access rate 1 word per 2 minor clocks 
 
 3. Simple memory access rate 1 word per minor clock. 
 
220 
 
 TABLE 41 COMPUTATION UNIT SUMMARY GATE COUNT (cont.) 
 
 Unit Gate Type 
 
 Source of 
 
 Count 
 
 Count 
 
 Remaining components are in IUD. 
 
 
 
 
 
 
 12 Assign Instruction No. 1 
 
 Table 21 
 
 
 
 
 1 400 
 
 IUD Front End 1 
 
 Table 23 
 
 
 
 
 17 104 
 
 Time Index Generator 1 
 
 Table 27 
 
 
 
 
 2 042 
 
 Vector Buffer Comparison 
 
 Tree 1 
 
 Table 28 
 
 
 
 
 23 824 
 
 Vector Status Table 1 
 
 Table 29 
 
 
 
 
 76 696 
 
 Ports to VEU Queue 
 
 Selector 1 
 
 Table 31 
 
 
 
 
 400 
 
 Partial Instruction 
 
 Detection 1 
 
 Table 32 
 
 
 
 
 192 
 
 VEU Queue Selection 1 
 
 Table 34 
 
 
 
 
 21 007 
 
 Reserve Vector Buffer 
 
 Storage 1 
 
 Section 4. 
 
 6. 
 
 3 
 
 2.8 
 
 14 500 
 
 Update Vector Table, etc. 1 
 
 Section 4. 
 
 6. 
 
 3 
 
 2.9 
 
 6 500 
 
 Assembling Instructions 1 
 
 Table 35 
 
 
 
 
 47 595 
 
 Initiating Instruction 
 
 Transfer 1 
 
 Table 36 
 
 
 
 
 6 800 
 
 Vector Switch Instructions 1 
 
 Table 37 
 
 
 
 
 1 900 
 
 SIDS Control 1 
 
 Section 4. 
 
 6. 
 
 5 
 
 2 
 
 10 000 
 
 SIDS Tables 1 
 
 Table 39 
 
 
 
 
 209 152 
 
 VI DS 1 
 
 Table 40 
 
 
 
 
 56 382 
 
 Scalar and Computation Summary: Type 1, 770,^20; Type 2, 655,360; 
 Type 3, 1,603,112. 
 
 Memory Summary: Type 1, 1,851,552; Type 3, 992,040. 
 
 IUD Summary: Type 1, 495,494. 
 
221 
 
 5 MACRO INSTRUCTION DECODER, I/O CONTROL AND EXTERNAL EXPANDABILITY 
 
 The machine designed in Chapter 4 with the addition of some I/O con- 
 trol could be a complete CPU. In this chapter we briefly discuss possible 
 additions to it that could significantly enhance its performance. The 
 Macro Instruction Decoder, as described in Chapter 2, converts UAL instruc- 
 tions into Operand Fixed Format Instructions. Its primary purpose is to 
 provide a high level of flexibility and to help eliminate program non- 
 determinism as discussed in Section 3.2. These are also the reasons for 
 including a scheme for anticipatory I/O. We will also describe the paging 
 algorithms for this machine. Finally, we will discuss external expandabi- 
 lity or the connection of many of these computers to form a single working 
 unit. This chapter is an outline of projects we would undertake if we had 
 unlimited time, energy and resources. 
 
222 
 
 5.1 MACRO INSTRUCTION DECODER 
 
 The Macro Instruction Decoder may be regarded as a combination inter- 
 preter of UAL and operating system. Its primary function is to convert 
 UAL instructions to OFFL instructions. Involved in this process are the 
 following major tasks: 
 
 1. Convert instructions operating on arbitrary sized vectors to 
 operate on the fixed vector width of the machine. 
 
 2. Insure that all memory accesses refer to pages present in 
 Primary Memory. 
 
 3. Execute all transfers of control. In the case of conditional 
 transfers, attempt to cause any required values to be computed 
 by the execution units at the earliest feasible time. (The MIDs 
 can request values from the EUs for use in evaluating condition- 
 al transfers.) 
 
 4. Attempt to anticipate I/O requests at the earliest possible 
 time. 
 
 5. Perform normal operating system functions. 
 
 We have described in detail algorithms for converting vector instructions 
 operating on arbitrary sized vectors to instructions for a fixed vector 
 width [1]. We will not discuss this function further here. The second 
 MID function may require subscript evaluation. Whenever this occurs, the 
 effect is the same as a conditional branch. The MID cannot continue pro- 
 cessing instructions until it can be assured that the required pages are 
 available. Two features should be included to minimize problems associated 
 with this situation. First, both the compiler and the MID should attempt 
 to insure that subscript expressions are evaluated as early as is practical. 
 
223 
 
 The MID should be constructed to use this information. Second, it should 
 be possible to declare various arrays as save core during execution of 
 various program segments. If it becomes necessary to swap a page of save 
 core, then the entire associated program should be swapped out. The 
 programmer, the compiler, and possibly even the MID should have the option 
 of requesting save core. The last three functions are all standard ones 
 with the observations we have already made about minimizing the effects of 
 non-determinism. 
 
 The techniques used in Chapter 4 should allow one to implement the 
 functions described in hardware in an efficient manner. A great deal of 
 analysis and experimentation would be required to obtain a good final 
 result. In Chapter 6 we will outline some pragmatic considerations about 
 constructing the entire system. We will conclude these remarks on the 
 MID by considering one major issue that should significantly influence its 
 detailed design. 
 
 The MID performs many compiler-like functions, and it is an open ques- 
 tion as to which functions should be performed by the compiler and which 
 by the MID. The primary motivation for moving compiler functions to the 
 MID is the existence of information at execution time that is not available 
 at compile time. The primary drawback is that the functions must be per- 
 formed in an interpretive manner whenever a given code segment is executed. 
 To the degree that it is possible to do the analysis fast enough with logic 
 that is significantly less costly than the "computing portion" of the 
 machine, this is not a major drawback. The only way to get a good hold on 
 what the tradeoffs are is to do some experimentation. We do not yet have 
 all the techniques required to design a good MID as outlined above. Once we 
 
224 
 
 have generated some basic set of building block ICs and have experience 
 with connecting them, similar to the experience we now have in construct- 
 ing large compilers, I would anticipate this approach to be highly pro- 
 ductive. 
 
225 
 
 5.2 PAGING DESCRIPTION 
 
 In this section we outline some minimum requirements for a paging 
 algorithm to function with the machine already described and describe some 
 information that the MID could make available and that would be of use to 
 an intelligent memory manager. One essential requirement is that a page 
 be locked if any instructions accessing it has gotten past the MID. A 
 locked page cannot be transferred to back up memory until all pending 
 requests for access from the EUs have completed. Another required page 
 status is that it be saved. A saved page is one considered essential to 
 the current reasonable execution of a particular program and cannot be 
 swapped unless the entire program is swapped. Additionally, the MID can 
 look ahead and anticipate what pages are about to be accessed. Thus, an 
 additional state a page can be in is that of about to be required. It 
 should be possible for the MID to provide a rough estimate of how imminent 
 the access is as a basis for determining priorities. 
 
226 
 
 5.3 EXTERNAL EXPANDABILITY 
 
 The remarks in this section will be primarily philosophical. They 
 might be thought of as an expansion on the ideas that led to the two- 
 level clock discussed in Section 3.1.3.1. The fundamental concept is that 
 the physical size of a computing structure imposes constraints on the 
 interface and control structures of subunits. The primary constraint is 
 that the larger the physical size, the longer the delays that must be 
 tolerated. A secondary constraint is that the amount of information passed 
 between subunits should be kept reasonably small. The interface scheme 
 for two clocks at different structural levels could be generalized to more 
 levels. An especially serious constraint that is related to the discus- 
 sions on non-determinism in the previous section is that of the control 
 structure. Traditional computers have a hierarchical control structure. 
 The operating system resident in the CPU controls the entire computing 
 system. Some units like I/O channels may have a limited degree of auto- 
 nomy. Computer networks like the ARPA net have a democratic structure. 
 There is no central source of control. The larger the physical size of 
 a computing unit, the more desirable a democratic structure becomes. If 
 one wished to use a large computing system for a single problem and it 
 possessed a democratic structure, then one's program would need to reflect 
 that structure. Basically, what is required is fork and join operations 
 and the ability for independent processes to interrupt and in other ways 
 communicate with each other. There do exist computing languages with 
 these features. They are primarily used in real-time computing systems. 
 Our computer structure with its operating system computer, the MID, inde- 
 pendent of the number crunching part of the machine could provide an excel- 
 lent candidate for a democratically structured computing system. 
 
227 
 
 6 CONCLUSION 
 
 To perform a detailed and complete analysis of the structure we have 
 designed would require an extremely elaborate and costly computer simu- 
 lation. Such a process could provide much information about ironing out 
 details, refining design parameters, and in general improving implementa- 
 tion details. Such a process is not necessary to provide general estimates 
 of the performance of this structure. For this purpose we can use the 
 generalized measures on FORTRAN programs which have been experimentally 
 obtained. We justify this approach as being a useful and meaningful first 
 iteration in the process of developing the design techniques and structural 
 approach we have adopted. 
 
 The basic postulate is that this structure can obtain the potential 
 speed-up and efficiency that has been measured in FORTRAN programs. We 
 justify this statement by the flow analysis that we have provided throughout 
 Chapter 4 and by the structure of the arithmetic units. The effective 
 width of our machine is 38. This includes four 8-word wide parallel units 
 and six scalar arithmetic units. Although some of the FORTRAN programs 
 could benefit from a wider machine, most used roughly this amount or less 
 parallelism. The multiprogramming structure of the machine allows the 
 entire machine to be fully utilized while executing individual programs 
 that could not effectively utilize it. Our hardware based real-time sched- 
 uling will allow less compile time analysis and allows non-deterministic 
 breaks if they are sparse enough to occur without degrading utilization. 
 
 We did not start out to design a machine to accommodate arbitrary 
 FORTRAN programs of the type measured and would not propose that the machine 
 be devoted to an essentially random mix of FORTRAN programs. The most 
 
228 
 
 cost effective way to execute a small FORTRAN program is to find the 
 smallest available minicomputer that will accommodate it. Showing that 
 such a structure can be effectively utilized on such a random mix of jobs 
 guarantees that it works for the worst cases that it is likely to encounter. 
 Our original aim was to design a good, flexible, easy-to-program parallel 
 computer based to a large degree on our experience and intuition obtained 
 from working with ILLIAC IV and thinking about other parallel computers. 
 Thus, for example, our independent vector and scalar execution units 
 evolved directly from problems in programming ILLIAC. Just as the FORTRAN 
 program measurements were intended as a sort of benchmark establishing a 
 minimum degree of utilizable parallelism over a broad class of problems, 
 we use them here as a minimum benchmark of this machine's performance. 
 
 One objective of our work is simply not measurable. That is to 
 provide a machine that is easy to program. It is my belief that one of the 
 major difficulties in using current parallel machines effectively is that 
 few people understand how to program them. Our primary inspiration for 
 this process was the B5500 machines and their use of hardware to handle 
 many of the tedious details of programming and to do so in an execution 
 time dynamic way. The ultimate measurement of the value of that approach 
 as it was applied in those machines was the economic success of a machine 
 that, if it were rated on a multiplies per dollar basis, would come out 
 very poorly. Because the problems of programming parallel machines are 
 significantly more complex, such hardware aids seem to us to be even more 
 desirable for them. 
 
 In summary, the design can effectively exploit the parallelism that has 
 been measured in a broad class of problems. It has a great many features 
 that should significantly ease the burden of exploiting parallelism. 
 
229 
 
 SPECIFIC RESULTS 
 
 The results of this work is not a detailed plan for constructing a 
 computer, but rather the development of a general approach and techniques 
 for implementing that approach. The detailed design work and its relation- 
 ship to measures on FORTRAN programs is intended as a justification that 
 the approach and techniques are practical and effective. Here we will 
 sort out which of our techniques and approaches appear particularly suc- 
 cessful and which areas call for additional study. 
 
 The generalization of the technique for designing a carry look-ahead 
 adder seems to be a useful technique for designing fast, complex combina- 
 torial circuits. This is the technique described in Section 4.6.3.2.7. 
 One area where this technique might be productively employed is in provid- 
 ing real-time dynamic control for a multi-level crossbar switch as defined 
 in [4J. This unit allows the arbitrary permutation of a vector in an 
 extremely cost effective way, but it requires a highly complex scheduling 
 algorithm. If one could construct a relatively inexpensive combinatorial 
 circuit to schedule such a network, one would probably have the ideal 
 crossbar switch for large applications. This scheduling problem may well 
 be suited to the type of analysis we developed. One could begin by design- 
 ing the logic for a two-level 4x4 crossbar, then gradually move up to higher 
 levels. The analysis technique would certainly provide a reasonable hard- 
 ware scheduling algorithm for the initial small switches and the intuition 
 developed might well lead to generalizations valid for larger switches. 
 
 Most of the logic design we have done is certainly far from optimal. 
 Our circuitry for very fast conflict resolution may be an exception. It 
 requires few gates, is extremely fast, and we have proven that it can be 
 
230 
 
 generalized to an arbitrary number of units in possible conflict. We 
 have used it in a great many situations in our overall design. This unit 
 is described in Section 4.4.2.4. 
 
 The observations about block structure and universal building blocks 
 in Section 3.1 seem to us to be particularly significant and an area re- 
 quiring much further development. Certainly designing a set of basic 
 building block ICs is a problem of major significance for the super- 
 computers of the future. We have made some very preliminary steps in that 
 direction. 
 
 The concept of instruction level multiprogramming seems to be useful 
 in the environment we have employed it. The advent of cheap mini and 
 microcomputers has certainly greatly reduced the need for multiprogramming. 
 It does seem to us to be an important feature for very large parallel com- 
 puters for two reasons. First, a great many runs on such computers will 
 be short debugging runs. The availability of the machine for such pur- 
 poses can be a very critical factor in program development time. Multi- 
 programming can allow short high-priority jobs to be run while longer 
 production jobs are also using the machine. The second reason is provid- 
 ing two independent processes may be an effective way to program some large 
 tasks. Providing hardware to execute these is desirable. Instruction 
 level multiprogramming is particularly nice in that there is no overhead 
 involved in swapping out programs from registers. The operating system 
 control resides in a processor entirely independent from the various 
 arithmetic units, and as long as any MID is feeding them instructions, 
 they can be utilized efficiently. 
 
231 
 
 LIST OF REFERENCES 
 
 1 Budnik, P. P., "Tranquil Arithmetic," M.S. Thesis, University of 
 Illinois, 1969. 
 
 2 Budnik, P. P., "An Intuitive Interpretation ofthe Hyperarithmetic 
 Sets," talk presented at the Spring 1972 meeting of the Association 
 for Symbolic Logic, abstract printed in the Journal of Symbolic 
 Logic , volume 37, number 4, p. 778, December 1972. 
 
 3 Davis, E. W. , "A Multiprocessor for Simulation Applications," 
 
 Ph.D. Thesis, University of Illinois at Urbana-Champaign, Department 
 of Computer Science Report No. 527, June 1972. 
 
 4 Kuck, D. J., D. H. Lawrie, and Y. Muraoka, "Interconnection Networks 
 for Processors and Memories in Large Systems," COMPCON 72 Digest of 
 Papers , pp. 131-134. 
 
 5 Kuck, D. J., Y. Muraoka, and S. C. Chen, "On the Number of Operations 
 Simultaneously Executable in FORTRAN-Like Programs and Their Result- 
 ing Speedups," IEEE Trans, on Computers , volume 21, number 12, 
 December 1972, pp. 1293-1310. 
 
 6 Muraoka, Y. , "Parallelism Exposure and Exploitation in Programs," 
 Ph.D. Thesis, University of Illinois at Urbana-Champaign, Department 
 of Computer Science Report No. 424, 1971. 
 
 7 Rogers, H. , Theory of Recursive Functions and Eff ective Computability , 
 McGraw Hill, 1967. ~~~ 
 
 8 Tomasulo, R. M. , "An Efficient Algorithm for Exploiting Multiple 
 Arithmetic Units," IBM Journal of Res, and Devel . , volume 11, number 1 
 January 1967. 
 
 9 Turn, Rein, Computers in the 1980s , Columbia University Press, 1974. 
 
232 
 
 APPENDIX A 
 DETAILED LOGIC FOR VECTOR EXECUTION UNIT SELECTOR 
 
 This appendix describes in detail the unit outlined in Section 
 4.6.3.2.7. Some notational conventions and structure of this appendix 
 are explained in Section 4.6.3.2.7. 
 
 The following conventions will be observed throughout this appendix: 
 
 1. Superscript i ranges over (0,1,2,3) and refers to an instruction 
 
 2. Superscript n ranges over (0,1,2,3) and refers to a VEU. 
 
 3. Superscripts or subscripts may be omitted when they are uniform 
 and unambiguous throughout an equation. 
 
 4. All weight registers are 5 bits wide. 
 
 5. Value (X..) indicates the value of the binary integer defined 
 by Boolean value X Qi X 1 ,. .. ,X j4 X q has highest significance. 
 
 In addition, the following notation will be used: 
 
 Uj 5 j bit of amount added to the size of the queue for VEU n. 
 
233 
 
 A-l QUEUE WEIGHT REGISTERS AND ADDERS 
 
 Inputs 
 
 U n 
 J 
 
 d = indicates queue n is to be decremented by 1. 
 Function 
 
 (U. ) - (d ) must be added to contents of a queue weight register, 
 
 a.j I = (0,1,..., 5) and ranges over the 6 queue weights for a 
 single queue. 
 
 Algorithm 
 
 We will use a serial counter and a serial adder cascaded together. 
 
 Equations 
 
 4-£n • j • j. . th i . , 
 
 tj indicates j bit of counter output 
 
 ct. indicates carry from j place of counter 
 
 ar. new value of al n 
 
 Cj indicates carry from j bit in producing output 
 
 *o = a o d v a o d 
 c o = a o d 
 
 *J = a j C j-1 Va j C j-1 U" ] ' 2 ' 3 ) 
 
 c j = a j c j-i U = ] ' 2 ' 3 ) 
 
 ar = t a v t a 
 
 o oooo 
 
234 
 
 
 o 
 ar. 
 
 t a 
 o o 
 
 *J a j C j-1 V F j *J C j-1 V T j a j «j-l v *J *j 'j-l " 1-2,3) 
 
 *j 8 J V *J C M V a j C J-1 0-1.2.3) 
 
 Logic Levels 
 6 
 
 Gates 
 
 115 (These are just standard counter and adder circuits. We include 
 these as an example of our notation and because we wish to design this 
 unit in complete detail.) 
 
235 
 
 A- 2 WEIGHT SELECTION LOGIC 
 Inputs 
 
 ,ik 
 
 indicates instruction i operand k is unknown (k = 0,1). 
 
 ik 
 X. bit j of queue address where operand of instruction k 
 
 J 
 
 resided, (j = 0,1,2,3) We will assume (Xg k v x] k ) implies 
 
 operand not assigned to one of these four VEUs. 
 A. bit j of VEU assigned to instruction i. 
 
 Outputs 
 
 W« indicates if weight I for instruction i queue n is to be 
 switched. I has the following meanings: 
 
 Number of Operan 
 from Instruction 
 in Queue n 
 
 ds 
 i 
 
 Instructi 
 Assigned 
 VEU n 
 
 on i 
 
 to 
 
 Instruction i 
 not Assigned 
 to VEU n 
 
 
 
 
 
 W in 
 w 
 
 
 
 w in 
 
 w 3 
 
 1 
 
 
 
 wj n 
 
 
 
 W in 
 W 4 
 
 2 
 
 
 
 wj n 
 
 
 
 W in 
 w 5 
 
 An unknown operand will count as being in the assigned VEU, 
 
 Algorithm 
 
 First we generate: 
 
 P^ which indicates if operand k from instruction i is in 
 
 queue n. 
 AP^ same as P^ but will include an unknown operand as being 
 
 assigned queue n. 
 
236 
 
 AA 
 
 in 
 
 indicates if value (A 1 .) = n. 
 
 J 
 
 We will then use these to generate wl n in a fairly obvious 
 
 way 
 
 PJ n = [value (x{°) = nj 
 
 p i0 
 
 = 
 
 Y i0 Y i0 Y i0 Y i0 
 
 Aj-i A-i hry A^ 
 
 P il 
 
 = 
 
 x^° x{° x i0 x i0 
 
 p i2 
 
 = 
 
 Y i0 Y i0 Y i0 v i0 
 A A l A 2 X 3 
 
 P T3 
 
 = 
 
 Y i0 Y i0 Y i0 v i0 
 A A l A 2 x 3 
 
 pin 
 P l 
 
 = 
 
 [value (Xf ) = n] 
 
 The expansion is similar to the above, 
 
 AP 
 
 AP 
 AA 
 
 W 
 W 
 
 w 
 w 
 w 
 
 in 
 
 
 in 
 
 1 
 
 in 
 
 = [value (xj°) = n] v Y i0 
 = [value (X^ 1 ) = nj v Y 11 
 
 •J 
 
 = [value (A?) = nj 
 
 in 
 
 
 in 
 
 1 
 
 in 
 2 
 
 in 
 3 
 
 in 
 
 AA in AP^ n APJ n 
 
 in nn in nn in 
 
 AA'" AP^" APJ" v AA in APj n AP] 
 
 AA in APJ n APJ n 
 
 AA ln APj n AP] n 
 
 in nn in nn in 
 
 4 " = AA'" AP-' AP^ 1 v AA in AP™ APJ 
 
237 
 
 W™ = AA in APj n AP] n 
 
 Logic Levels 
 2 
 
 Gates 
 114 
 
238 
 
 A- 3 INCREMENT AND MIN SELECTOR DETAILS 
 
 Inputs 
 
 a. bit j of queue weight from weight selection switch 
 J 
 
 U n 
 J 
 
 Functions 
 
 1. Subtract U 1 ? from each of 4 weights. 
 
 2. Find minimum weight. 
 
 3. Subtract minimum weight from all weights. 
 
 Functions 1 and 2 are combined in one set of equations. We will 
 discuss these functions first. 
 
 Algorithm 
 
 We will break up the operation into parts by computing the following 
 intermediate values: 
 
 bj n j th bit of value (ui n ) + value (a1 n ) 
 J J j 
 
 c. carry from bit j used in computing b"! n . 
 
 ASZ™ indicates that value (aj n ,aj n ) s k. 
 
 BSZ™ indicates that value ( b^ n ,b^ n ,b| n ) <; k. 
 
 CSZj, n indicates that value (bj n ,bj n ) * k. 
 
 
 SB in b] n v b\ n 
 
239 
 
 M^ n indicates that counting only bits through j 
 
 bj is a minimum over n = (0,1,2,3). 
 
 :™ indicates that M™ A [value (b^.b^ 11 ) < £]. 
 
 MCZ 
 
 Time versus variable computed 
 
 Logic Level Variables 
 
 U a™ 
 
 J J 
 
 1 r lf1 h 111 r 111 AC7 ln K in 
 
 1 c 4 b 4 c 2 ASZ k b 2 
 
 2 b] n b™ BSZ™ c] n SB in 
 
 o bg CSZ. Mp MCZ. 
 
 Equations, Level = 1 
 
 C 4 = U 4 a 4 
 
 Hj" 
 
 b 4 = U 4 a 4 v U 4 a 4 
 
 c 2 = U 2 a 2 v a 2 U 3 a 3 v a 2 a 3 a 4 U 4 v a 2 U 3 a 4 U 4 
 The above uses value (U.) < 4. 
 
 J 
 
 ASZg = a Q a 1 
 
 ASZ, = a 
 
 ASZ 2 = a Q v a. 
 
240 
 
 ASZ 3 = TRUE 
 
 b 2 = a 2 U 2 IJUJ v a 2 U^U^U^ v a £ U^ UJ i^ v 
 
 a 2 °2 *3 ^ v a 2 ^2 *3 H v «i U 2 v 
 
 *2 U 3 a 3 v ^2 U 3 U 4 a 4 v ^2 a 3 U 4 a 4 
 Equations, Logic Level = 2 
 
 >! = a 1 c 2 v 3] c 2 
 
 b 3 = a 3 c 4 U 3 v a 3 c 4 U 3 v a 3 c 4 U 3 v a 3 c 4 U 3 
 BSZ Q = ASZq^E^ 
 
 BSZ 1 = ASZ Q C 2 
 
 BSZ 2 = ASZ Q v ASZ ] c 2 b £ 
 BSZ 3 = ASZ Q v ASZ ] T 2 
 
 BSZ 4 = ASZ 1 v ASZ 2 c 2 b^" 
 
 BSZ 5 = ASZ 1 v ASZ 2 T 2 
 
 BSZ 6 = ASZ 2 v b 2 
 
 BSZ ? = TRUE 
 
 Cn ~ d-l Cn 
 
 SB 
 
 b 4 v a 3 c 4 U 3 v a 3 c 4 U 3 v a 3 c 4 U 3 v a 3 c 4 U^ 
 
241 
 
 Equations, Level = 3 
 
 b = C l v b 
 
 CSZ. = same as ASZ. but b 3 ,b. replace a Q ,a.. 
 
 .in 
 
 BSZ Q v 
 
 6 
 
 Z 
 k=l 
 
 BSZ™ tt BSZ^ 
 
 m=0 
 m^n 
 
 MCZ 
 
 in . .in . in DC7 in 
 " b 3 b 4 BSZ v 
 
 b 3 b 4 
 
 6 
 
 Z 
 
 k=l 
 
 BSZ? n TT 
 K m=0 
 m^n 
 
 BSZ 
 
 lm 
 k-1 
 
 MCZ 1 / = b in BSZ 1 " v 57 i 
 J u J k=l 
 
 Dc7 in Dcv lm 
 
 BSZ k tt BSZ k _ 1 
 
 m=0 
 
 m^n 
 
 MCZ 
 
 1 = (b in v bj n ) BSZ in v (b/ v bj n ) Z 
 
 k=l 
 
 BSZ 1 / tt BSZ^ 1 
 m=0 
 m^n 
 
 SB'" BSZ'" v SB m Z 
 U k=l 
 
 BSZ™ tt BZK™ 
 m=0 
 ntfn 
 
 MCZ/ = M™ 
 
242 
 
 Equations, Level = 4 
 
 M in - M in rc 7 in 
 M4 = M« CSZ Q v 
 
 M ? n CSZ} n tt (M™ v CSZ^" 1 ) v 
 
 1 m=o ^ u 
 
 3 
 
 m=0 
 
 m ? " csz:" n (m: ,m v CSZ '" ) v 
 
 c m=0 ^ ' 
 
 M^ n 
 
 7r (M ? m v CSZJ m ) 
 
 m=Q c *" 
 
 
 m^n 
 
 
 = M™ 
 
 cszj n 
 
 V 
 
 M* n 
 
 cszj n 
 
 tt MCZ™ V 
 m=0 u 
 m^n 
 
 4" 
 
 csz™ 
 
 3 - 
 
 7T f 
 
 m=0 
 
 3 1- 
 
 7T MCZ™ v 
 
 m=0 ' 
 
 m^n 
 
 ^ n 
 
 1CZ^ m 
 
 We still need to perform the subtraction of the minimum from all 
 weights. 
 
 Algorithm 
 
 1. Select the minimum using the Ml n . 
 
 2. Do a subtract of mini 
 
 mum. 
 
243 
 
 Equations 
 
 i n 
 
 d. bit j of minimum 
 
 J 
 
 d« . b!" M » 
 
 We wish to do the subtraction in two logic levels 
 
 a. bit j of result 
 
 c. carry from j bit 
 
 a 4 = b 4 d 4 v b 4 d 4 
 
 c 4 = b 4 d 4 
 
 c 3 = b 3 d 3 v b 3 b 4 d 4 
 
 a3 m in . m 
 
 aa. a. assuming ci 
 
 J J 3 3 
 
 ac. al n assuming cl n 
 
 aa Q = b Q ^ b 1 tff v b Q ^ b 1 b 2 v b Q ^ b 1 d 2 
 
 v 
 
 b Q o^ 07 b 2 V b Q ^ ^ ^ 
 
 ac Q = b Q 3q b, H7 v b Q ^ b 1 b 2 g^ v b Q d^ 37 b 2 d^ 
 
 The above two equations take advantage of the fact that we are sub- 
 tracting the minimum and no negative result is possible. 
 
244 
 
 aa l = b l d l b 2 v b l d l d 2 v b l d l d 2 v b l d l b 2 v 
 
 B7 b^ d 2 v b 1 d 1 K, d 2 
 
 ac, = b ] dj b 2 d^ v Fj" d ] b 2 d^ v b^ ^j" b^ v 
 
 ' ^7^7 d 2 v b l d l h v b l d l d 2 
 
 aa 2 = b 2 d^ v F^ d 2 
 
 ac 2 = b 2 ^2 V b 2 d 2 
 
 a 3 = b 3 ^3 ^3 v ^3 d 3 ^3 v S ^3 C 3 v b 3 d 3 c 3 
 
 a 2 = aa 2 c^ v ac 2 c~ 
 
 a-j = aa, c 3 v ac, c^ 
 
 a Q = aa Q c 3 v ac Q c 3 
 
 Logic Levels 
 
 4 Increment and Select Min 
 
 1 Switch Min 
 
 2 Subtract Min 
 7 Total 
 
 Gates 
 
 4976 Increment and Select Min 
 
 160 Switch Min 
 
 1728 Subtract Min 
 
245 
 
 A- 4 INCREMENT AND DECODER DETAILS 
 Inputs 
 
 A 1 . finally computed in Section A-3 
 
 u? 
 
 Function 
 
 We need to perform the following three functions: 
 
 1. Compute value (Bl n ) = [value (a 1 .") - value (u")]. 
 
 J J J 
 
 2. Normalize the result so the smallest is O.an.. 
 
 J 
 
 3. Compute Bim = [value (an 1 . 11 ) = m]. 
 
 (i,m) = (0,0), (1,0), (2,0), (2,1), (3,0), (3,1), (3,2) 
 
 B00„ and BIO* will also be written as BO and Bl . 
 n n n n 
 
 Algorithm 
 
 Taking advantage of value (u!?) £ 4, we will compute b! n in two levels 
 
 j j 
 
 1i 
 j 
 
 i n/ 
 of logic. To do the normalization fast, we will actually compute b 
 
 where 
 
 [value (b 1 ^)] = [value (b! n ) - I] 
 j j 
 
 I = (0,1,2,3,4) 
 We will detect the smallest I for which an overflow occurs and switch this 
 set b. as an. . We will then generate the Bim in the obvious way in 
 
 J J 
 
 one clock. 
 
246 
 
 Equations 
 
 irv£ 
 
 The equations for computing the bl n ^ are similar to those for sub- 
 tracting the minimum in Section A-3, and we will not describe them in 
 detail. The switch is also standard, so we will not describe it either. 
 
 Gates 
 
 8640 Subtract u] n and offsets 
 
 800 Switch correct offset 
 
 168 Compute Bim 
 
 n 
 
 9608 Total 
 
247 
 
 A- 4 FINAL SELECTION UNIT 
 
 Inputs 
 
 Bim generated, described in Section A-3. 
 
 Functions 
 
 Select the minimum weights for up to four instructions, taking into 
 account the fact that the queue selected by instruction i must have one 
 added to it before determining the queue for instruction i+1. 
 
 Algorithm 
 
 Output: Ri which indicates instruction i is to use queue n. 
 
 R0„ is set true for minimum n such that BO is true, 
 n n 
 
 Rl n is set true for minimum n such that RO Bin. If there is 
 n n 
 
 no such n, then the Rl is set true for the unique n such 
 
 that Bl . 
 n 
 
 R2 n is set depending on the following: 
 
 1. If 3n[B20 n (R0 n v Rln)] then the minimum n satisfying 
 
 B20 n R0 n Rl is chosen. If none exists, then minimum 
 
 n such that B20 is chosen, 
 n 
 
 2. If Vn[B20 n ■+ R0 n Rl ] then the minimum n such that 
 
 B21 n R0 n is chosen. If none exists, the unique n such 
 
 that B20„ is chosen, 
 n 
 
 R3 is set depending on the following conditions: 
 
 1. 3n[B30 n (R0 p Rl n v R1 n R2 n v R0 n R2 n )J. The minimum n 
 
 such that B30 n R0 n Rl n R2 n is chosen. If none exists, 
 then the minimum n such that B30 is chosen. 
 
248 
 
 2. Vn[B30 n + (RO n Rl n R2 n v RO n R1 r R2 n v RO^ Rl n R2,,)]. 
 Note both this and the next condition imply B30 is 
 
 un 
 
 ique. Also, B30. + (B31. v B32.). The mini- 
 
 mum n 
 
 such that B31 n R0 n RT^ R2 n is chosen. If none exists, 
 then the unique RO is chosen. 
 3. Vn[B30 n + R0 n Rl p R2J. The minimum n such that B31 
 is chosen. If none exists, then the minimum n such 
 that B32 n is chosen. If none exists, the unique n 
 such that B30 n is chosen. 
 
 Equations 
 
 For a description of the notation and conventions used in generating 
 these equations, see Section 4.6.3.2.6. 
 
 RO 
 n 
 
 Detection: 
 
 Only one case. 
 Selection: 
 
 n-1 
 
 R0 n = B0 n A * BO 
 J=0 J 
 
 (level = 1, gates = 10) 
 
 1 
 1.1 
 
 3i[Bl A 7r Bl.] 
 n j=0 J 
 j7n 
 
 i.e., only one Bl n is true. 
 
 Detection and selection: 
 3 
 
 Bl 7T Bl. 
 
 j=0 J 
 j7n 
 
 (gates = 20) 
 
249 
 
 n 
 1.2 3n[Bl A Z Bl.] 
 
 " j=0 J 
 
 i.e., more than one Bl is true. 
 
 n 
 
 Detection: 
 
 There is no need to detect this case. We always choose a true 
 
 Bl n - Since in case 1.1 Bl is unique, we cannot make a selection 
 
 in conflict with case 1.1. 
 
 Selection: 
 
 n-1 
 
 1.2.1 3n E B1 n B0 * B'H 
 
 Detection and Selection: 
 
 Clearly, we can select the unique Bl , satisfying the above. 
 Thus, we have 
 n-1 
 
 Bl BO 77 Bl. (Gates = 18) 
 
 n n j=Q j 
 
 n-1 
 
 1.2.2 Vn[Bl v BO v Z Bl.J 
 n n j=0 J 
 
 i.e., the first true Bl occurs with a true BO . 
 
 n n 
 
 Detection and Selection: 
 
 In this case we must assure ourselves that there is a smaller 
 
 n with B0 n true before we can select this. To insure that the 
 
 selection will be unique, we must choose the first Bl with a 
 
 n 
 
 smaller BO 
 n 
 
 n-1 n-1 
 (n = 1,2,3) Bl [( Z BO.) (77 Bl .) v 
 n j=0 J j=0 J 
 
 n-1 j-1 
 
 Z (Bl. BO. 7T BO. )] 
 j=0 J J k=0 J 
 
250 
 
 
 Gates = 
 
 8 
 
 n=l 
 
 
 15 
 
 n=2 
 
 
 2-3 
 
 n=3 
 
 
 46 
 
 Total 
 
 Summary for Rl 
 
 
 R1 n = 
 
 B1 n 
 
 3 
 
 TT Bl . V 
 j=0 J 
 #1 
 
 
 B1 n 
 
 n-1 
 
 B0 M TT Bl . 
 
 n j=o J 
 
 
 B1 n 
 
 n-1 
 
 ( Z BO.) ( 
 j=0 J 
 
 n-1 
 
 TT Bl. ) v 
 j=0 J 
 
 n-1 j-1 
 
 Bl z (Bl. BO. TT BO. ) 
 n j=0 J J k=0 J 
 
 (note for the last two terms, n = 1,2,3) 
 (Gates = 84, Level = 1) 
 
 2 R2 
 
 "n 
 
 2.1 Vn[RO n = Rl n J 
 
 Detection 
 
 D2.1 = I RO Rl (Gates = 12, Level = 2) 
 
 i=0 n n 
 
 Selection: 
 2.2.1 Vi[B20 n ■* ROn] 
 
251 
 
 Detection 
 
 D2.1-1 = 7T (B20 n v R0 n ) (Gates = 20, Level = 2) 
 
 This gate count takes advantage of R0 RO . = n=j 
 
 Selection: 
 2.1.1.1 3n[B21 n ] 
 
 Detection: 
 
 3 
 D2. 1.1.1 = z B21 (Gates = 4, Level = 1) 
 
 n=0 n 
 
 Selection: 
 
 n-1 
 
 S2. 1.1.1 = B2.1 7T B2.1. (Gates = 40, Level = 1) 
 
 n n j=0 J 
 
 2.1.1.2 V [B21 ] 
 n L n J 
 
 Detection: 
 
 D2. 1.1.1 
 
 Selection 
 B20. 
 
 n 
 
 2.1-2 3n[B20 M R0~] 
 
 n n 
 
 Detection: 
 
 D2.1.1 
 Selection: 
 
 n-1 
 
 S.2.1.2 n = B20 n 7T (B20. v R0 n ) 
 
 (Gates = 25, Level = 2) 
 
 "n • n . =0 v j n' 
 
252 
 
 This gate count takes advantage of RO RO. = n=j 
 
 J 
 
 2.2 3n[R0 n Rl n ] 
 
 Detection: 
 
 D2.1 
 
 Selection: 
 
 2.2-1 Vn[B20 * (RO v Rl )1 
 
 L n v n n /J 
 
 Detection: 
 
 D2.2.2 
 
 Selection: 
 
 First get B20 f RO 
 3 n n 
 
 n-1 
 
 TS2.2.2 n = B20jB0 n v B0„ tt BO J 
 
 (Gates = 26, Level = 1) 
 
 •n n L n n j=Q w j 
 
 n-1 
 
 S2.2.2 n = TS2.2.1 n Rl n tt (TS2.2.1 . v Rl .) 
 
 (Gates = 30, Level = 2) 
 
 Selection: 
 
 n-1 
 
 S2.2-1 = B20„ tt B20. 
 n n j=0 J 
 
 2.2.2 3n[B20 RO Rl J 
 
 n n n J 
 
253 
 
 Detection: 
 
 D2.2-2 = Z B20 RO Rl 
 n=0 n n n 
 
 Summary 2 
 
 R2 n = D2.1 D2. 1.1 D2. 1.1.1 S2. 1.1.1 v 
 
 D2.1 D2.1.1 D2. 1.1.1 B20 v 
 
 n 
 
 D2.1 D2.1.1 S2.1.2 v 
 n 
 
 D2.1 D2.2.2 S2.2.1 v 
 n 
 
 D27T D2.2.2 S2.2.2 
 n 
 
 (Gates = 22, Level = 3) 
 
 3.1 Vn(RO n Rl n v Rl n R2 n v R0 n Rl n ) 
 
 i.e., for each of RO, Rl , R2, they are true for different values of 
 n. No two of them are true for the same n. 
 
 Detection: 
 
 D27T means RO f Rl 
 
 T3,1 n = RO n v R1 n (Gates = 8 » Level = 2 ) 
 
 D3.1 = D2.1 I (T3.1 R20.) (Gates = 16, Level = 2) 
 n=0 n n 
 
 Selection: 
 
 3.1.1 Vn(B30„ -► R0 n v Rl v R2 ) 
 n n n n 
 
 Detection: 
 
 D3.1.1 = I B30 RO Rl R2 n (Gates = 20, Level = 4) 
 
 n=0 n n n n 
 
254 
 
 Selection 
 
 Choose first B30 n which is true. 
 n-1 
 
 j=o 
 
 S3.1.1 n = B30 n _ t 7r rt B30j (Gates = 10, Level = 1) 
 
 3.1.2 3n ( B3 ° n ^^^n~) 
 
 Detection: 
 D3.1.1 
 
 Selection: 
 
 We must choose first B30 n not equal RO or Rl or R2 . First 
 
 n n n n 
 
 we get the B30 n not equal RO and Rl . 
 
 TS3.1.2 n = B30 n RTRr (Gates = 12, Level = 2) 
 
 n-1 
 S3.1.2 n = TS3.1.2 n tt CTS2.1.2, v R20.) 
 
 (Gates = 25, Level = 4) 
 3.2 3n (R0 n Rl n v Rl n R2 n v R0 n Rl n ) 
 
 Two or more of RO, Rl , R2 agree. 
 Detection: 
 
 D3.1 
 Selection : 
 
 3.2.1 3n R0„ Rl R2 
 
 n n n 
 
255 
 
 Detection: 
 
 D3.2.1 = E R() Rl R2 
 n=0 n n n 
 
 (Gates = 16, Level = 4) 
 
 Selection: 
 
 3.2.1.1 3n(B30 n v R0 n ) 
 
 Detection: 
 
 D3.2.1.1 = Z B30 RO 
 n=0 n n 
 
 (Gates = 12, Level = 2) 
 
 Selection: 
 
 S3. 2. 1.1 = R0 n B30 n 7T (RO v B30 ) 
 n n n m=Q m m ' 
 
 3.2.1.2 Vn(B30 + RO ) 
 n n 
 
 Detection 
 
 (Gates = 96, Level = 2) 
 
 D3.2.1.2 
 
 Selection 
 
 3.2.1.2-1 Vn B31 
 
 Detection: 
 
 D3. 2. 1.2-1 = 77 B31 
 n=0 
 
 (Gates = 4, Level = 1) 
 
 Selection: 
 
 3.2.1.2-1.1 3n B32 
 
256 
 
 Detection: 
 
 3 
 D3. 2. 1.2-1.1 = z B32 n (Gates = 4, Level = 1) 
 
 n=0 " 
 
 Selection: ' 
 
 n-1 
 S3. 2. 1.2-1.1 = B32„ A tt B32 
 
 n n n n 
 
 m=0 
 
 (Gates = 40, Level = 1) 
 
 3.2.1.2.1.2 Vn B3T" 
 
 n 
 
 Detection: 
 
 D3. 2. 1.2-1.1 
 
 Selection 
 R30 
 
 i 
 
 3.2.1.2-2 3n B31 
 
 n 
 
 Detection: 
 
 D3. 2. 1.2-2 
 
 Selection: 
 
 n-1 
 
 B31 tt B31 
 n m=0 n 
 
 Summary for 3.2.1.2 
 
 S3. 2. 1.2 = D3. 2. 1.2-1 A D3. 2. 1.2-1.1 A S3. 2.1. 2-1.1 v 
 " n 
 
 D3. 2. 1.2-1 A D3. 2. 1.2-1.1 A R30. v 
 
 _____ ""I 
 
 D3. 2. 1.2-2 B31 tt B31~ 
 
 m=0 
 
 (Gates = 46, Level = 4) 
 
257 
 
 3.2.2 Vn (RO n v RTrT] 
 
 Detection 
 
 D3.2.1 
 Selection: 
 
 3.2.2.1 3n(B30 RO FT v B30 Rl R2 v B30 RO R2 ) 
 nnn nnn nnn 
 
 B30 is true for a case when at most one of RO, Rl , R2 is true. 
 
 Detection: 
 
 First we get all B30 unequal to R30 and R31 . 
 
 ■ n ^ n n 
 
 TA3.2.2.2.1 = B30 n R0„ Rl (Gates = 12, Level = 2) 
 n n n n ' 
 
 In addition, we need: 
 
 TB3.2.2.1 = B30 RO v B30 Rl 
 n n n n n 
 
 3 
 
 D3.2.2.1 = l T3.2.2.1 R2 
 n=0 n n 
 
 (Gates = 24, Level = 2) 
 
 (Gates = 12, Level = 4) 
 
 Selection: 
 
 3.2.2.1.1 3n(B30 M R0„ Rl R2 ) 
 n n n n 
 
 Detection: 
 
 D3.2.2.1.1 = I B30 RO Rl R2 
 n=0 n n n n 
 
 (Gates = 20, Level = 4) 
 
258 
 
 Selection: 
 
 
 First we set all B30 M not equal to either RO or Rl . 
 
 n n n 
 
 T3.2.2.1.1 = B30 RO Rl 
 n n n n 
 
 S3. 2. 2. 1.1 = TA3.2.2.1.1 v 
 n n 
 
 (Gates = 12, Level = 2) 
 
 n-1 
 
 n m=0 n 
 
 TB3.2.2.1.1„ 7T TB3.2.2.1.1 
 
 (Gates = 23, Level = 4) 
 
 3.2.2.1.2 Vn (B30„ v RO v Rl v R2~) 
 
 n n n n 
 
 Detection 
 
 D3.2.2TTTT 
 
 Selection: 
 
 From case 3.2.2.1 we know B30 n is true when only one RO, Rl , 
 or R2 is true. From case 3.2.2 we know RO, Rl , R2 agree for 
 one value of n. Since each is true for exactly one n, there 
 can only be a single n for which those conditions and the 
 above condition hold. We will use previously generated terms 
 
 S3.2.2.1.2 n = TA3.2.2.1 v TB3.2.2.1 R2~ 
 n n n n 
 
 (Gates = 16, Level = 4) 
 
259 
 
 3.2.2.2 Vn (B30 ■> RO n Rl v Rl R2 n v RO n Rl) 
 n n n n n n n 
 
 Detection 
 
 D3.2.2.1 
 Selection: 
 
 3.2.2.2.1 3n (B31 ) 
 n 
 
 Detection: 
 
 3 
 D3.2.2.2.1 = Z B31 (Gates = 4, Level = 1) 
 
 n=0 n 
 
 Selection: 
 
 3.2.2.2.1.1 3n B31 RO Rl R2 
 
 n n n n 
 
 Detection: 
 
 3 
 D3. 2. 2. 2. 1.1 Z B31 W~ RT~ R7~ 
 n=0 " n n n 
 
 (Gates = 20, Level = 4) 
 
 Selection: 
 
 T3. 2. 2. 2. 1.1 = RO Rl B31 
 n n n n 
 
 S3. 2. 2. 2. 1.1 = R2 M T3. 2. 2. 2. 1.1 A 
 n n n 
 
 n-1 
 
 77 (R2 n v T3. 2. 2. 2. 1.1 ) 
 
 m=0 n 
 
 (Gates = 48, Level = 4) 
 
260 
 
 3.2.2.2.1.2 Vn [B31 ■* (R0 n v Rl v R2 )J 
 
 n n n n /J 
 
 Detection: 
 
 D3. 2. 2. 2. 1.1 
 
 Selection 
 
 B31 n satisfying the above must be unique. This is true because 
 
 from 3.2.2.2 we know R0 5 Rl , R2 agree for some value of n. 
 
 Thus, thre is only o-e value of n for which exactly one of them 
 
 is true. B30 n is true for the value where two of TO, Rl , R2 
 
 agree. B30 + B3T. Thus, there is a unique n for which B31 
 n n 
 
 is true and exactly one of RO, Rl , R2 is true. 
 
 TA3.2.2.2.1.2 
 
 TB3.2.2.2.1.2 
 
 B31 n R0 n RT 
 
 (Gates = 12, Level = 2) 
 
 S3. 2. 2. 2. 1.2 
 
 n 
 
 B31 R0 n v B31 Rl 
 n n n n 
 
 (Gates = 24, Level = 2) 
 
 TA3.2.2.2.1.2„ v TB3.2.2.2. 1 .2 
 n n 
 
 (Gates = 8, Level = 4) 
 
 3.2.2.2.2 Vn B31 
 
 Detection 
 
 3.2.2.2.1 
 
 
 Selection 
 
 There is a unique B30 n true which agrees with two of RO, Rl, R2. 
 Select this n. 
 
 B30 
 
261 
 
 Summary for R3 
 J n 
 
 R3 n = D3.1 A D3.1.1 A S3. 1.1 v 
 
 D3.1 A D3.1.1 A S3.1.2 v 
 
 D3.1 A D3.1.1 A D3.2.1.1 S3. 2. 1.1 v 
 
 n 
 
 D3.1 A D3.1.1 A D3.2.1.2 A S3. 2. 1.2 v 
 
 n 
 
 D3.1 A D3.2.1 A D3.2.2.1 A D3.2.2.1.1 A S3. 2. 2. 1.1 v 
 
 n 
 
 D3.1 A D3.2.1 A D3.2.2.1 A D3.2.2.1.1 S3. 2. 2. 1.2 v 
 
 n 
 
 D37T A D3.2.1 A D3.2.2.1 A D3.2.2.2.1 A D3. 2. 2. 2. 1.1 A 
 
 S3. 2. 2. 2. 1.1 v 
 n 
 
 D37T A D3.2.1 A D3.2.2.1 A D3.2.2.2.1 D3. 2. 2. 2. 1.1 A 
 
 S3. 2. 2. 2. 1.2 v 
 n 
 
 D3. 1 A D3.2.1 A D3.2.2.1 A D3.2.2.2.1 A B30 
 
 n 
 
 (Gates = 48, Level = 5) 
 
262 
 
 VITA 
 
 Paul Peter Budnik, Jr. was born in Chicago, Illinois in 1945. He 
 received the Bachelor of Science in Physics degree from the University of 
 Illinois in 1967 and the Master of Science in Computer Science degree from 
 the same university in 1969. 
 
 During the 1970 to 1971 academic year he was an Acting Assistant 
 Professor at the University of California at Los Angeles. 
 
 During the 1971 to 1972 academic year he was employed by the University 
 of Illinois on a project involving finding parallelism in FORTRAN programs. 
 
 From 1973 to the present he has been employed by Systems Control, 
 Incorporated. During this period he designed and implemented a correlation 
 program on ILLIAC IV. 
 
 Mr. Budnik is a member of the ACM, IEEE, the Association for Symbolic 
 Logic, the American Association for the Advancement of Science, and Sigma Xi. 
 
JIBLIOGRAPHIC DATA 
 iHEET 
 
 . I u [c .iiul Subt it U- 
 
 Kepor 
 
 UIU 
 
 UCDCS-R-75-763 
 
 3. Recipient's Accession N< 
 
 TECHNIQUES FOR PARALLEL COMPUTER DESIGN 
 
 5. Kcport Date 
 
 October 1975 
 
 A uthor(s ) 
 
 Paul Peter Budnik, Jr. 
 
 8. Performing Organization Kept. 
 
 No UIUCDCS-R-75-763 
 
 Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF DCR73-07980 A02 
 
 2. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 13. Type of Report & Period 
 Covered 
 
 Doctoral - 1975 
 
 14. 
 
 5. Supplementary Notes 
 
 6. Abstracts 
 
 In the future major increases 
 marily from architectural innovations i 
 Integrated circuits will provide the ba 
 thesis discusses various techniques for 
 so by applying those techniques to the 
 techniques include the determination of 
 the establishment of interface and timi 
 techniques for pipelining and paralleli 
 machine is broken up into small functio 
 
 in computer performance must result pri- 
 nvolving increased parallelism. Large Scale 
 sic technology to make this possible. This 
 
 constructing parallel computers. It does 
 design of a specific machine. These 
 
 a basic set of building block components, 
 ng structures, and the development of 
 zing complex control functions. The entire 
 nal units which are independently queue driven 
 
 h Key Words and Document Analysis. 17a. Descriptors 
 
 Computer Architecture 
 Computer Design 
 High-Speed Computation 
 MIMD Computer 
 Parallel Computation 
 
 'b. Identifiers /Open-Ended Terms 
 
 'c. COSATI Field/Group 
 
 1. Availability Statement 
 
 RELEASE UNLIMITED 
 
 >RM NT15-35 (10-70) 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 274 
 
 22. Price 
 
 USCOMM-DC 40329-P71 
 
rfftJT 
 
»flV25 
 
 «r>"TK 
 
t*» 
 
 *