LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 IlGr no. 7G I - ?&3 Cop. 2 Digitized by the Internet Archive in 2013 http://archive.org/details/techniquesforpar763budn U 'J if m.X Report No. UIUCDCS-R-75-763 TECHNIQUES FOR PARALLEL COMPUTER DESIGN by Paul Peter Budnik, Jr. NSF - OCA - DCR73-07980 A02 - 000013 October 1975 H0V2i 1975 ip *m DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Report No. UIUCDCS-R-75-763 TECHNIQUES FOR PARALLEL COMPUTER DESIGN* by Paul Peter Budnik, Jr, October 1975 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 This work was supported in part by the National Science Foundation under Grant No. US NSF DCR73-07980 A02 and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, October 1975. TECHNIQUES FOR PARALLEL COMPUTER DESIGN Paul Peter Budnik, Jr., Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign, 1975 In the future major increases in computer performance must result primarily from architectural innovations involving increased parallelism. Large Scale Integrated circuits will provide the basic technology to make this possible. This thesis discusses various techniques for constructing parallel computers. It does so by applying those techniques to the design of a specific machine. These techniques include the determination of a basic set of building block components, the establishment of interface and timing structures, and the development of techniques for pipelining and parallelizing complex control functions. The entire machine is broken up into small functional units which are independently queue driven. Ill ACKNOWLEDGMENT The author wishes to thank his advisor Professor David J. Kuck for ideas, criticism, and discussions, and for moral support over an extended period of time. Professor Duncan Lawrie also provided helpful comments and suggestions. Bernice Shimabukuro proofread this thesis, drafted the figures, and typed it. Her help was invaluable. IV TABLE OF CONTENTS Page 1 INTRODUCTION 1 2 OVERALL STRUCTURE 7 3 BASIC BUILDING BLOCKS AND DESIGN TECHNIQUES 19 3.1 BUILDING BLOCKS 21 3.1.1 Motivation 21 3.1.2 Basic Building Blocks 21 3.1.2.1 Queues 21 3.1.2.2 Controls 22 3.1.2.3 Switches 23 3.1.2.4 Access Controllers 23 3.1.2.5 Descriptive Tables 23 3.1.2.5 Traditional Components .... 23 3.1.3 Block Interfaces 24 3.1.3.1 Timing Structure 24 3.1.3.2 Pipeline or Parallel Units ... 27 3.1.3.3 Additional Advantages of the Interconnection and Timing Structures 27 3.1.3.3.1 Error Detection and Correction ... 29 3.1.3.3.2 Hardware Performance Monitoring .... 32 3.2 GENERAL DISCUSSION OF DESIGN TECHNIQUES 33 3.2.1 Pipeline and Parallel Design Techniques . . 36 3.2.1.1 IUD Design Analysis ....... 36 3.2.1.2 Queuing Techniques 38 3.2.1.3 Resolving Buffer Access Conflicts. 41 3.2.2 Tables 41 3.2.3 Deadlock 42 3.3 PARALLELISM - AN ABSTRACT DISCUSSION. ...... 44 Page COMPUTATION UNIT - DETAILED LOGICAL DESIGN 51 4.1 OVERALL STRUCTURE 52 4.2 FUNCTIONAL STRUCTURE 53 4.3 SCALAR PORTION OF COMPUTATION UNIT 54 4.3.1 Overall Structure of the Scalar Portion of Computation Unit" 7 .... 54 4.3.2 Scalar Execution Units 56 4.3.2.1 Scalar Instruction Sequencing . . 58 4.3.2.2 Scalar Queues 63 4.3.2.3 SEU Sequence Controller .... 64 4-3.3 Scalar Execution Unit Buffers . . . . 75 4.3.4 Scalar Switch 78 4.4 VECTOR PORTION OF COMPUTATION UNIT 83 4.4.1 Overall Structure of Vector Portion of Computation Unit 83 4.4.2 Vector Execution Units 84 4.4.2.1 Standard Arithmetic Units ... 84 4.4.2.2 Vector Routers 90 4.4.2.3 Other Vector Units 90 4.4.2.4 Detailed Internal Structure of a VEU 91 4.4.3 Vector Buffer 105 4.4.4 Vector Switch HO 4.5 MAIN MEMORY 114 4.6 INSTRUCTION UNIT DISPATCHER 123 4.6.1 Introduction 123 4.6.2 IUD Functional Structure 123 4.6.2.1 Data Rate Analysis 124 VI Page 4.6.2.2 Memory Operands and Results. 4.6.2.3 Scalar Operands and Results. 4.6.2.4 Vector Operands and Results. 4.6.2.5 Scalar EL) Assignment . . . 4.6.2.6 Vector EU Assignment . . . 4.6.2.7 Generating Vector Switch and Internal Switch Queue Entries 4.6.2.8 Generating Instructions for the EUs and Memory. 4.6.3 Logical Structure 4.6.3.1 IUD Pipeline Structure . . 4.6.3.1.1 OFFL Instruction Format. 4.6.3.1.2 OFFL Syntax . . 4.6.3.1.3 Analysis of Pipeline Requirements. 4.6.3.1.4 Switching Instruction Components into the Pipe . 4.6.3.1.5 Global Structure of IUD Pipe . . . 4.6.3.2 Detailed Structure and Gate Counts for Internal IUD Pipe Functions 4.6.3.2.1 Details of the Parallel Update of th Scalar Table (SPU). 4.6.3.2.2 Generating Time Indexes .... 4.6.3.2.3 Parallel Update of Vector Buffer Table (VPU) 4.6.3.2.4 Searching the Scalar Table. 4.6.3.2.5 Searching the Vector Table (SVT). 4.6.3.2.6 Selecting the Scalar Execution Unit (SSE) 4.6.3.2.7 VEU Queue Selector (SVE) . . 4.6.3.2.8 Reserve Vector Buffer Storage (RVS) . . 4.6.3.2.9 Update Vector Tables and Fill in Vector Operands (UV and FVO) 125 125 127 128 128 134 135 135 135 136 137 138 144 152 157 157 161 176 177 179 188 198 204 VI 1 Page 4.6.4 Tail End of Main IUD Pipe 205 4.6.4.1 Assembling Instructions (AV, AS, AM) 205 4.6.4.2 Initiating Transfer of Instructions (IM, ISC, IV) . . . 209 4.6.4.3 Generating Vector Switch Instructions (GSI) 209 4 -6.5 Scalar Instruction Dispatcher Subsystem . . 211 4.6.5.1 SIDS Functional Summary .... 211 4.6.5.2 Detailed Design of SIDS .... 211 4.6.6 Vector Instruction Dispatcher Subsystem . . 218 4.7 GATE COUNT SUMMARY 219 5 MACRO INSTRUCTION DECODER, I/O CONTROL AND EXTERNAL EXPANDABILITY 221 5.1 MACRO INSTRUCTION DECODER 222 5.2 PAGING DESCRIPTION 225 5.3 EXTERNAL EXPANDABILITY 226 6 CONCLUSION 227 LIST OF REFERENCES 231 APPENDIX A DETAILED LOGIC FOR VECTOR EXECUTION UNIT SELECTOR 232 VITA 262 8 vi n LIST OF TABLES Page 1 MAJOR COMPONENT FUNCTIONS n 2 OFFL PROGRAM 15 3 INSTRUCTION QUEUE OPERATION 66 4 GATE COUNT FOR INSTRUCTION QUEUE 68 5 SCALAR UNIT STATUS TABLES SPECIFICATIONS 74 6 DESIGN PARAMETERS AND GATE COUNTS FOR SCALAR BUFFERS . . 76 DATA AND INSTRUCTION PORTS FOR THE SCALAR SWITCH .... 80 GATE COUNT FOR SCALAR SWITCH 82 9 GATE COUNT FOR SCALAR ASSEMBLING UNIT 89 10 DATA BUFFER OPERATION 93 11 VEU GATE COUNT 104 12 VECTOR SWITCH PORTS HI 13 VECTOR SWITCH HARDWARE 113 14 MEMORY LOGIC SUMMARY 122 15 CLASS PAIRS 132 16 OPERAND DISTRIBUTION PAIRS 133 17 OFFL SYNTAX 139 18 OFFL INSTRUCTION CONSTRAINTS 141 19 INSTRUCTION STREAM CONSTRAINTS 143 20 PIPE COMPONENTS 144 21 LOGIC EQUATIONS AND GATE COUNTS FOR ASSIGNING INSTRUCTION NUMBERS 145 22 LOGIC EQUATIONS FOR THE CONTROL OF THE IUD FRONT END SWITCHES 147 23 GATE COUNT FOR IUD FRONT END 151 IX LIST OF TABLES (cont.) Page 24 IUD PIPE FUNCTIONS, TIMINGS, AND DEPENDENCIES 153 25 IUD PIPE TIMING CHART 155 26 SCALAR COMPARISON TREE GATE COUNT 160 27 GATE COUNT FOR TIME INDEX GENERATORS 167 28 VECTOR BUFFER COMPARISON TREE GATE COUNT 174 29 VECTOR STATUS TABLE GATE COUNT 178 30 SEU QUEUE SELECTOR GATE COUNT 184 31 CONNECTIONS FROM OPERAND PIPES TO VEU QUEUE SELECTOR . . 185 32 GATE COUNTS FOR INDEXING OPERANDS AMD PARTIAL INSTRUCTION DETECTION 187 33 TIMINGS FOR UPDATING WEIGHT SELECTION REGISTERS 195 34 TIMING AND GATE COUNT FOR VEU QUEUE SELECTION 196 35 LOGIC SUMMARY FOR ASSEMBLING INSTRUCTIONS 208 36 LOGIC FOR INITIATING INSTRUCTION TRANSFERS 210 37 LOGIC FOR GENERATING VECTOR SWITCH INSTRUCTIONS 210 38 SIDS FUNCTIONS 212 39 SPECIFICATIONS AND GATE COUNTS FOR SIDS TABLES 213 40 VIDS TABLE SPECIFICATIONS 218 41 COMPUTATION UNIT SUMMARY GATE COUNT 219 LIST OF FIGURES Page 1 COMPUTATION NODE 9 2 COMPUTATION UNIT 10 3 PROGRAM TREE 17 4 ERROR CORRECTION CONNECTIONS 31 5 SCALAR PORTION OF EXECUTION UNIT 55 SCALAR EXECUTION UNIT 57 INSTRUCTION QUEUE 65 8 ALGORITHMS FOR ACCESSING SCALAR STATUS TABLES 71 9 10 11 12 SCALAR SWITCH DETAIL 81 VEU OVERALL STRUCTURE 85 ASSEMBLING SCALARS INTO A VECTOR 88 DATA BUFFER 92 13 8 WAY PRIORITY SELECTOR 98 14 N0N-P0WER-0F-2 PRIORITY SELECTOR 101 MEMORY ORGANIZATION 115 INTERNAL STRUCTURE OF MEMORY PAGE 120 IUD FRONT END 150 IUD PIPE OVERALL STRUCTURE 156 19 COMPARISON TREE 158 20 TIME INDEX LOGIC 162 21 VECTOR COMPARISON TREE 168 22 SEU QUEUE SELECTOR 181 23 LOGIC TO INDEX OPERANDS AND DETECT A PARTIAL INSTRUCTION 186 15 16 17 18 XI LIST OF FIGURES (cont.) Page 24 VEU QUEUE SELECTOR lg4 25 RESULT PROCESSING PORTION OF UNIT TO RESERVE VECTOR STORAGE 202 26 VECTOR BUFFER STATUS DETAILS 203 27 TAIL END OF IUD PIPE VECTOR OPERATOR PORTION 206 28 SIDS FLOWCHARTS 215 1 INTRODUCTION The brief history of computer design has been one of constant and rapid change. This change has centered in the technology of circuit design. Logic has grown cheaper, smaller, and faster at dramatic rates. With few excep- tions, the art of computer design has been largely one of doing more and more of the same thing faster and faster. There have, of course, been many design innovations. However, one could undoubtedly explain the operation of any modern computer to Babbage's ghost in a day or two. The nature of the game of computer design is changing radically. The cost and size of logic continues to decline at an accelerating rate. The speed of logic is nearing theoretical limits, with the speed of elec- trical signals being a major design consideration in all current high-speed computers. Thus, barring some extremely dramatic revolution in physics, logic is not going to get much faster. On the other hand, if one wants to obtain a perspective on the limits of cost and gate density for logic, one might compute the cost per gate and gate density of a pigeon brain. We have only the crudest ideas on how to effectively utilize the logic technology that currently exists. With each new advance in circuit technology, the depths of this ignorance increases. This thesis is one primitive attempt to begin to plumb these depths. The problems associated with effectively utilizing this technology can be divided into two broad categories. The first is to determine what useful structures one can concoct using a very large number of gates. The second is to determine which of these structures can be practically implemented given the constraints of an existing technology and how this may be done. This thesis concentrates on the first of these problems and treats the second only in the most general sense. We justify this approach because simultaneously solving both problems is extraordinarily complex and time consuming. Determining useful theoretical designs must inevitably be the first step in generating practical designs. Further, many of the practical constraints are in a rapid state of change. The practical constraints that we do take into account are ones that we expect to exist in any technology. These include restrictions on fan in and fan out, and logic levels per clock. In addition we have adopted a general structural approach of attempting to find basic building blocks at a fairly high level of complexity. We discuss this approach and its relevence to IC technology in Chapter 3. There are two broad categorizations of approaches to determining how to utilize this technology effectively. One is to consider existing hard- ware and software techniques and see how these may be expanded and generalized. The alternative is to start from scratch. Existing programming hardware and software techniques have evolved from the notion of a simple arithmetic unit, a set of memory reigsters, and a single control unit. This structure is a natural one for humans to understand and use. It is unlikely to be of uni- versal significance for all data processing problems. It is, in fact, our belief that the problem of investigating useful computing structures is co- extensive with the problem of investigating useful mathematical structures. We consider this open-ended approach to the problem to be of deep intellec- tual fascination. In Section 3.3, we discuss some observations about this approach. However, for pragmatic reasons, the bulk of this thesis takes the other approach. We will now define our approach and objectives in more detail . Our overall goal is to design a good general-purpose, expandable, fast parallel computer. This statement of objective is both yery vague, and in- ternally inconsistent. Parallelism implies a structuring of multiple com- puting elements. Inevitably some algorithms will fit the structure better than others. More parallelism implies less generality. However, we do know that eight arithmetic units can be effectively utilized by most FORTRAN pro- grams [Kuck 5]. Thus, a limited degree of parallelism is not incompatible with a fairly general-purpose computer. The vagueness in our statement of overall objectives is intentional. Good computer design involves many com- plex and ill-defined factors. A precise statement of objectives is impos- sible. We can enumerate the important factors, and we do so by first dividing them into two broad categories of programmability and hardware. A machine with good programmability should allow for the easy implemen- tation of a wide range of languages. These should be easily expandable both to facilitate the evolution of special-purpose languages, and to parallel hardware expandability. The languages should be efficient both in terms of the code they produce, and their own execution time. Programs should be easy to debug. The machine design should allow for the easy implementation of a powerful, efficient operating system. Facilities should be provided which ease the burden of managing a hierarchy of memories, and in many in- stances eliminate it entirely. Multiprogramming and multiprocessing features should be provided to facilitate fast turnaround on short jobs and to allow the system to efficiently handle a broad range of problems. The hardware should be modular, reliable, expandable, cheap, and easy and inexpensive to maintain. These goals are still quite vague, and it is essential that they remain so. Computer design is more art than science, and to pretend otherwise can lead to disaster. The objectives of computer design cannot be precisely defined without gross oversimplification. I should say a few words about the results of this research. It does not consist of any one technique or conclusion. Rather, it consists of showing how a number of techniques may be developed and integrated to pro- duce a good machine. Although our overall objectives must remain vague, we can be more spe- cific about the techniques we will employ in meeting these objectives. Hardware designed for specialized purposes is potentially much more effi- cient than that designed for more general purposes. Our goal is to design a general -purpose computer. There are two ways in which specialized hard- ware can be employed in such a machine. First, we can directly implement some operating system and compiler functions. Secondly, we can allow for the inclusion of specialized but unspecified operational units. The great danger in designing specialized hardware is that it becomes extremely effi- cient at doing what nobody cares to have done. There do exist universal compiler and operating system functions that can be completely specified at machine design time, and can thus benefit from specialized hardware. Opera- ting system functions include interrupt processing, multiprogramming alloca- tion of resources, and memory management. Obvious compiler functions for which some existing machines have specialized hardware include array index- ing and subroutine calls. Functions which are candidates for such hardware include mapping the parallel structure of a language onto the parallel structure of a machine, hardware management of loops to minimize execution time non-determinism, and anticipatory I/O scheduling. There can be no way to anticipate what sort of specialized hardware may be desirable or even necessary for various applications. It is feasible to design a general -purpose computer that will allow for the later inclusion of various specialized pieces of hardware. To allow for this and because it is a good basic technique, our overall design philosophy will involve the construction of various functional units. The interfaces between these units will be as simple and at as high a level of abstraction as seems practical. As an example, we will have units to perform unspecified operations on vec- tors of a fixed size and similar units which operate on scalars. The inter- faces between these units and the rest of the machine will be limited to the operands and results and a minimum of information specifying whether or not the unit is in a position to accept operands and the exact operation to per- form. This should allow for the evolution of specialized hardware. As the complexity of any design project increases, it becomes important to break the problem up into a hierarchy of more tractable problems. Further, there are advantages to having this structure reflected in physical units. It would be particularly desirable if the lowest level of structure could be implemented on LSI chips making a theoretical design constraint compatible with practical implementation constraints. One result of this research is the observation that a few functional units are required in many contexts and might be ideal candidates for a basic set of chips. Along with the struc- tured hierarchy of functions, we require simple, well defined interfaces between units. This constraint also serves to keep the design problem tractable and to improve the chances for a smooth LSI adaptation. Another advantage of such a structure is to ease hardware debugging and maintenance. In Section 3.1.3.3 we will discuss how this structure could facilitate the construction of a super reliable computer. A final advantage of this structure is that any unit that meets the interface specifications can be plugged in at any time after the machine is built. This could allow for some use of new technologies in an existing machine as well as facilitating the addition of specialized hardware mentioned earlier. These ideas are simply good engineering practice and similar to those of structured program- ming. 2 OVERALL STRUCTURE In this section we describe the overall structure that evolves from the objectives and techniques we have mentioned. We will first list the basic structural features and the objectives they are intended to meet and then go on to describe these features in more detail. We begin by discussing expandability. The basic "computer" we design will be called a Computation Node, and we will refer to external expandabi- lity and internal expandability within a node. External expandability refers to the fact that these nodes will be especially well suited to being hooked together as a network of computers. We will briefly discuss this subject in a later chapter. The bulk of this thesis is concerned with the design of a single computation node. Internal expandability refers to the fact that our modular approach will allow for varying numbers of all the major control memory and computation portions of the machine. This is not fundamentally different from existing computers which allow for varying numbers of CPUs, memory mods, I/O channels, etc. Our approach will allow for a significantly greater flexibility in this area than currently exists. To allow for the implementation of compiler and operating system func- tions in hardware, we will employ three levels of machine languages ranging from an APL-like, high-level vector language to a language which is basically a set of queue entries specifying physical machine addresses and types of operations. These various levels of machine language also help in keeping the design modular with well defined interfaces. These languages in conjunc- tion with an overall philosophy of having all processing driven by local queues and control will ease implementation of multiprogramming and multi- processing. We have already mentioned that we will employ a minimum parallelism of 8. We will extend the potential parallelism without restricting the generality. We do this by extending conventional multiprogramming and multiprocessing to allowing these functions at the queued instruction level, In particular, the largest version of this machine will allow four programs to be running simultaneously, distributing their vector instructions to up to six 8-wide arithmetic units. Of course, additional parallelism could be obtained through the external expandability we have mentioned. Before going on to a more detailed description of the machine's struc- ture, some additional comments on our objectives are in order. We are not attempting to provide maximum potential computation power at minimal cost or even maximum usable computation power at minimum cost. Instead we are considering what we believe to be the correct problem, providing the most cost effective overall system. Overall system cost includes both the cost of developing system software and ultimately the cost of doing applications programming. Thus, much of the structure we will discuss is intended to make the machine more useful in this general sense. Figure 1 gives the overall structure of the Computation Node, Figure 2 gives the structure of the Computation Unit within the Computation Node, and Table 1 lists the units in these figures and briefly describes their functions. In order to provide a general idea of the operation of these units, we will provide a brief example. The example will raise more ques- tions than it answers, but it is only intended to provide an initial impres- sion of the functional structure we have in mind. Later chapters will describe the structure and function of these units in more detail. UJ Q O O O 1— o o 3 j- s- 4-> 4-> to C to +J O C UJ <_) 1— e Q II C_J 1— 1 Q or co LU CO 10 CO Z2 LU CO LU CO CO o o (_> >- o o Q o o c£ c_> o Q —i I— CO CO UJ :r O I— •a: D_ CO o i— i h- cc \- co o C£ Ol Z=> CO CO o o LU c_> u_ CO cc 3c o o LU 3 > CO to Q Q >- oS I— or o_ >- co q: h- s: o o: o s: o o: lu o_ u_ s: /TN o o C\J U3 Q Q Q C_) _E 11 TABLE 1 MAJOR COMPONENT FUNCTIONS Functional Unit Instruction Unit Dispatcher Abbreviation IUD Vector Execution Unit VEU Scalar Execution Unit Vector Buffer Scalar Buffer Vector Switch Scalar Switch Computation Unit Macro Instruction Decoder SEU MID Functions (1) Map logical registers of OFFL onto the various physical regis- ters within the computation unit and by so doing schedule the vari- ous execution units. (2) Do conflict resolution between the competing MIDs. (1) Perform the actual processing of all vector operations of OFFL. These will include arithmetic, routing, and may include special - purpose vector operations. (1) Perform the actual processing of all scalar operations of OFFL. (1) Provide temporary storage for vectors. (1) Provide storage for scalars. (1) Provide paths for routing vec- tors between the various vector execution units, buffers, primary memory, and the MIDs. (1) Provide paths for routing scalars between the scalar execu- tion units and the scalar buffer. (1) Includes all of the above func- tional units and their intercon- nections and connections to the external world. (1) Decomposition of macro instruc- tions into OFFL. (2) Program control . (3) Initiation of page faults. (4) Anticipatory I/O. 12 TABLE 1 MAJOR COMPONENT FUNCTIONS (cont.) Functional Unit Memory Manager Abbreviation Memory Controller Main Memory Backup Storage Computation Node Functions (1) Initiates request for page swappings. (2) Assures that sufficient core is available to maintain a high level of efficiency. (1) Maps logical memory addresses into physical addresses. (2) Does the actual addressing of memory. (1) High-speed, random access storage. (1) All other storage devices in the machine not mentioned above. (1) All of the above functional units, their interconnections and connections to the external world. 13 Our example will consist of a brief APL program segment. A, B, and C are vectors of length 24; D, F, and G are scalars. The program consists of the element by element add of A to B and the dot product of the result with C. This result is stored in F and also added to D, and that result stored in G. The program for this is: F <- +/CxA+B G «• F+D We will refer to the highest level vector language as Universal Assembly Language (UAL). The name is intended to reflect its machine-independent character. We will describe how the above is translated into UAL and how it is processed by our machine. UAL basically consists of 3 address instructions. In order to mini- mize the size of these instructions, their operands will refer to a small group of special registers, which will contain descriptor information for the actual operands. Special 2 address instructions areused to relate these registers to program defined variables. The UAL for the above will be as follows: Instruction Operand SETADR T f A SETADR T^B ADD VY T i SETADR V c MULTIPLY W T i Comment Register T-j now refers to variable A, 14 Instruction Operand VECSUM V T 1 STORE V F SETADR T 2 «-D ADD W T 1 STORE T 1 ^G Comment T^ now refers to the sum of the com- ponents of T, before this statement. T 1 is not used here because its value is needed later. An MID must translate these UAL instructions into a machine-dependent Operand Fixed Format Language (OFFL). This name is intended to refer to the fact that this language explicitly recognizes the vector width of the machine. The IUD must in turn translate OFFL into a sequence of queue entries within the Computation Unit. These queue entries will ultimately result in a logically correct execution of the code. Figure 3 shows the tree for this program. Table 2 shows what the OFFL instructions might look like. In constructing this table and figure, we have chosen temporary regis- ter locations to show how they can be used to control the sequencing of the program. This is the basis of the method by which the hardware permutes the order of execution of instructions in any way that optimizes resource utili- zation without affecting the logical outcome of the code. In particular, the MID has a small number of logical temporary registers available to it. These refer to 8-word wide vectors. It uses these to break up instructions operating on arbitrary sized vectors in UAL into OFFL instructions. As soon as all instructions that use one of these temporaries has been generated, that temporary may be reused. The IUD assigns physical register locations to these logical locations. It does this in a way that allows for the 15 TABLE 2 OFFL PROGRAM Instruction Operand LOAD T H-7 LOAD T H-7 VECADD t TK LOAD T 2^0-7 VECMUL * T S"! LOAD 3 Vl5 LOAD T H-15 VECADD fW LOAD T^VlS VECMUL &M VECADD T I +T K LOAD T v «-A '5 M 15-23 LOAD T 6" B 15-23 VECADD WW LOAD '6^15-23 VECMUL W T 5 VECADD W[ MOVE ^1-7 Comment T refers to vector registers. Vector addition. There is no reason not to reuse tY and T 2 in this and the previous statement. Vector multiplication. We do not reuse T 1 or T 2 here so we can overlap the following sequence of five statements with the above. This add cannot be overlapped so there is no reason not to reuse T, . Move the result vector into 8 separate scalars to complete the summation. 16 TABLE 2 OFFL PROGRAM (cont.) Instruction Operand SADD '0 1 ! 8 SADD T S +T S ->-T S 1 2 ' 3 '9 SADD t s +t s vt s '8 '9 '8 SADD 1 4 ' 5 ' 10 SADD T 6 +T 7^ T 11 SADD T ?8 T if T ?o SADD T 8 +T 10" T STORE %* LOAD T i* D SADD T +T l+ T STORE T g* Comment Scalar addition. At this point all scalar temporaries except Tq are available. 17 LEVEL FIGURE 3 PROGRAM TREE 18 maximum possible parallelism. In particular, the reuse of a logical temporary in OFFL will always be assigned a different physical location. If this were not the case, the store to this temporary would have to wait until all loads from it requiring its earlier value have completed. Thus, the assignment of temporaries in our example reflects the way in which the IUD might assign physical registers. The MID can be more careless about reassigning logical registers, because the IUD operates as just described. The above only applies to vector instructions. Scalars are handled in a different way but with essentially the same result. There can exist several copies of the same "physical" scalar at the same time. Associative tables and a time indexing scheme keep them straight. Parallelism in Figure 3 could be utilized by our machine. 19 3 BASIC BUILDING BLOCKS AND DESIGN TECHNIQUES In the process of designing a family of machines to meet the objectives discussed previously, we found ourselves using the same sort of functional units and the same design techniques in many different contexts. Before we describe this detailed design work, we will provide a generalized discussion of the basic building blocks and techniques which we have evolved. We regard this set of very high-level building blocks as significant. The cost of in- tegrated circuits is much more critically dependent on the number of circuits ultimately to be produced than on the number of gates in the circuit. Thus, if one can establish a canonical set of ICs from which a broad class of com- puters can be constructed, one will be able to keep the cost of the computers themselves down. In this thesis we have only undertaken the first step in the complex process that could ultimately lead to the fabrication of such a canonical set of ICs. That step is the recognition of the functional simi- larity of many of the units we will construct. We have made no attempt to provide sets of logical designs that will be of universal validity for the various function types. Such a process would be desirable if one were plan- ning to construct machines of the sort we have designed. Section 3.1 is a discussion of these basic building blocks as well as how they can be combined to form more complex blocks. The resulting structures are in some ways simi- lar to a well designed program made up of small subroutines existing at many levels in an overall hierarchy. The techniques of pipelining and parallelism are well known and widely used. They are in a fairly primitive state of development. Using the build- ing blocks just mentioned, we have applied these techniques in a somewhat systematic way in the course of doing detailed design. Section 3.2 is a 20 description of the various ways in which we use pipelining and parallelism to achieve our objectives. Section 3.3 is a theoretical analysis of paral- lelism. Its main purpose is to provide some perspective on the dimensions of this field and to suggest some unconventional approaches to gaining greater understanding of this subject. 21 3.1 BUILDING BLOCKS In this section we first discuss the motivation for structuring the machine as we have. We then discuss the lowest level of building blocks or functional units from which the entire machine is constructed. Finally, we discuss how more global units are built up from these basic units. 3.1 .1 Motivation The building blocks we have chosen arise from three aspects of our overall approach. These are our attempt to provide local distributed con- trol, hardware implementation of compiler and operating system functions, and parallelism itself. The building blocks these give rise to are: queues, controls, switches, access controllers, and descriptive tables. We will de- fine each of these units and relate them to the three aspects of our approach in the next section. 3.1.2 Basic Building Blocks We now describe the basic building blocks. These differ from the pri- mitives in Bell and Newel l's PMS notation. They are not basic components from which any computer can be constructed. They are fairly complex units. They also differ from the various boxes that one inevitably draws when dis- cussing conventional computers. They are more primitive in the sense that they occur repeatedly in many different contexts in the overall design. 3.1.2.1 Queues Conceptually, a queue is a linearly ordered list of requests for re- sources. Our basic building block queue arises from the need to provide local control. By allowing hardware to determine the sequence of instruc- tion execution, we can allow for more efficient utilization of resources. 22 To provide for this, our queues attempt to provide first in first out service. They also attempt to keep the unit they drive as active as possi- ble. The algorithm to meet these objectives will be to examine the oldest entry in the queue first and then earlier entries until one is found for which all required resources are available. Such FIRFO queues will be used to drive the vector and scalar execution units and the vector switch. The control of the queue itself, the testing of outside resources, and the ac- tual decisions about what action to take will involve other components. The queue is a memory that allows for the access of entries at the data rates and in the sequence required for implementing the above functions. It also allows new queue entries to be made at the data rate required. 3.1.2.2 Controls A control is a unit capable of sensing various states of its environ- ment and initiating actions based on this information. The most general sort of control would be a full scale Turing Machine with I/O. Our con- trols will not correspond to the standard definition in that memory of previous states will not reside in the control itself, but will always be in queues and status tables which the control can interrogate. Controls in general will need to respond very quickly, often within a few gate de- lays. Thus, controls will consist of some combinatorial logic with sequencing circuits driving the unit. They can range from a very simple to rather complex combinatorial circuit driven by the machine clock. They will be used throughout the machine, driving switches, using queues to sequence various units, determining the operation of the IUD pipe at all stages within it, and in general keeping the machine operating by indivi- dually keeping each of its parts going. 23 3.1.2.3 Switches The process of transferring data and instructions to the units requir- ing them will frequently involve the use of small crossbar switches. In general we will confine the use of the word switches to refer to just such units used for such purposes. We will use the term routers to refer to units which sort data under program control . 3.1.2.4 Access Controllers Many of the resources of this machine can service different controls. Access controllers referee this competition. They may provide some priority scheduling scheme and usually have some memory of the previous allocations of the resource they referee. This memory allows them to ensure that no requesting unit can be totally locked out. 3.1.2.5 Descriptive Tables A descriptive table is simply a memory that contains information about the current state of the machine and about programs that are executing. One traditional use of such tables that will occur frequently in our design will be tables which provide information about the status of registers or buffer memories. The difference between tables and queues is the method of access. Tables may be either associative or addressable memories. They will some- times be accessible by either method. In addition the high data rates required in some instances may necessitate the ability for multiple simul- taneous access. 3.1.2.6 Traditional Components All of the components we have discussed arise from existing concepts. Our definitions have been restricted in ways to suit our purposes. 24 Two additional components we will use in a more or less completely tradi- tional manner are memories and computation units. Memories in addition to the specialized ones we have discussed will exist in many sizes and types throughout the machine. Memories are unlike descriptive tables in that they contain data or program instructions. They will not be associative memories, but will allow for access in various ways to suit their purpose. Computation units will be arithmetic units and in general the hardware that actually does some useful work. They will be both vector and scalar units and will include routers whose purpose is to sort data. We will not propose any new designs for these units. Our only innovation in this area is the provision for plugging such units into an existing machine without significant hardware or software modifications. 3.1.3 Block Interfaces In this section we discuss the timing aspects of block interfaces. The structure of the basic blocks just discussed as well as the structure at more complex organizational levels determine the detailed nature of the interfaces. In general, the more complex the units, the greater the communication delay; this is both necessary and tolerable. We will describe the various timing levels we use and the techniques for facilitating the timing structure. We will describe how this relates to pipelining and parallelism. Finally, we will describe some special advantages of the timing structure. 3.1.3.1 Timing Structure We will employ a minor clock and major clock. The time for the minor clock will be roughly that corresponding to 8 levels of logic. The major clock will be 8 minor clocks in duration. The minor clock will be used 25 within major structural units; the major clock will be used for transfers between major structural units. The levels of logic for the minor clock evolved naturally out of the detailed design work. The ratio of major to minor clocks is directly related to the vector width of the machine. Most data transfers both within and between units will be pipelined at the rate of one word per minor clock. This two-level clock structure seems a natural way to ameliorate some of the problems associated with very fast clocks. We do not need to specify a physical time for the clock period. This would be a function of the physical size and speed of the logic used. With current technology, we could assume 2 ns gate delays. With 8 levels of logic, we have up to 16 gate delays or roughly a 50 ns clock. This estimate may be a bit optimistic since one would not want to implement this design with the fastest logic possible, \lery fast logic implies high power dissipation and comparatively small gate densitites. Both of these constraints would be ex- tremely troublesome in a machine with yery high gate counts. On the other hand, faster logic with less power consumption does seem to be in the offing. The one constraint on machine speed that is not likely to change is the delay times for signal transmission, i.e., the speed of light. Our two-level clock will be especially useful in accommodating this fact of life. Our basic approach is not geared to developing optimal techniques for a current techno- logy, but rather for developing techniques that will become increasingly attractive in the near future given the direction that technology is moving. There are structural problems inherent in our two clock levels, and we will now describe how we deal with these. We can think of the clocks as being centrally located, synchronized, and broadcasting pulses to all parts of the machine. Major components must be constructed to operate internally at the fast clock rate. The interfaces between these must operate at the 26 slower clock rate, and special Interfacing logic will be designed so that the major units can internally behave as if there were a single fast clock. We will not specify absolutely what constitutes a major component since this is a technology dependent decision. We will indicate in the process of doing detailed logical design the various levels at which the machine can and cannot be partitioned. Major components will be transmitting both data and control information among themselves at the major clock rate. We would like to minimize the width of the paths and thus, if possible, pipeline transmission on them at the minor clock rate. This can be done without losing the advantage of the longer clock rate between major components. The sending unit will transmit its output at the minor clock rate and will also send a copy of its clock pulse in parallel with the information. Provided all parallel paths are the same length, this pulse can be used for reading the transmission line. We can design a single interface that accepts input from such a line and the clock pulse of the receiving unit and that inputs data to the receiving unit according to its clocking. This can be accomplished by a simple circular buffering technique in which registers are written with one timing pulse and are read by the other Dulse. Since both pulses originate from the same mas- ter clock, there will be a constant phase difference between them. There is an additional buffering problem associated with this timing structure that our interface will handle. Not all of the units can neces- sarily process an arbitrary input stream at the maximum possible rate. There is usually some internal buffering to minimize the effects of transients, but it is possible for these buffers to become full. Just prior to this occurring, the receiving unit must notify its interface to stop transmitting information. Because of the long delays possible between major units, as 27 long as two major clocks worth of information may be transmitted before the interface has a chance to notify the sending unit to halt. The inter- face must buffer this amount of information. It would be possible to pro- vide a single design for such units with varying capacities and use them at all long delay interfaces. 3.1.3.2 Pipeline or Parallel Units The timing structure we have just described is well suited to either pipeline or parallel execution units or a combination of both types. We have established that transfers between units will be structured in a pipe- lined manner to minimize interconnections. This structure is perfectly suited to full parallel operation. We need only introduce an 8-word wide buffer to accumulate the pipelined inputs in preparation for their access by a fully parallel execution unit. It is assumed that the execution time of a fully parallel unit would be at least 8 times the minor clock rate. The interconnections are suitable for direct input to a pipeline processor. If there is a set-up time associated with the processor, then changing to a different operation might require some buffering similar to that for a paral- lel unit. On a machine of the nature we are constructing, a pipe with long set-up time for general -purpose arithmetic would be impractical. On the other hand, such a pipe might be desirable for some specialized processing unit designed for a specific function in a specific algorithm. The timing structure provides substantial but not unlimited flexibility in choosing the structure of computational hardware. 3.1.3.3 Additional Advantages of the Interconnection and Timing Structures The interconnection structure we have described is particularly well suited to error correction and detection techniques and to performance 28 monitoring. We regard both of these functions as being particularly impor- tant, given our overall approach. The larger the gate count of a machine, the lower the MTBF. In addition, the more costly a machine is, the more expensive down-time is both directly in terms of lost computer time and, even worse, the delay in projects dependent on the machine. The most serious obstacle to effectively using ILLIAC IV is a yery small MTBF combined with something like an hour required to isolate a failing PE, replace it, and verify that no new errors have been introduced in the process. Fixing the CU requires several hours to days. To some degree, these problems are un- doubtedly due to the decision to build ILLIAC out of the fastest logic available instead of logic that has been more highly developed and is better understood. Larger scale integration is likely to significantly improve cir- cuit reliability [9J. Nonetheless, in constructing computers with very large gate counts, reliability problems inevitably increase. Providing a design which allows for the easy addition of architectural features that improve reliability is an extremely desirable feature for a paper machine of the sort we are proposing. It allows decisions to be made after hard information on reliability has been obtained. It also allows for the construction of ma- chines with various cost-versus-reliability tradeoffs. Since many of the applications for large computers involve real-time applications, there may exist a need for super-reliable versions of such machines. Performance monitoring is important for any complex, expensive system. Complexity inevitably implies that deep analytical understanding becomes ex- tremely difficult and expensive to obtain or simply impossible. Many exist- ing computer architectures have reached the point where such understanding is at least pragmatically impossible to obtain. Our design has evolved from existing concepts but is significantly different from and more complex than 29 existing systems. Thus we are in a position where analytical understanding is impossible and experience with existing machines is inadequate. Provid- ing detailed critical information on performance in prototype models would be a mandatory requirement for perfecting the design concepts we are de- scribing. Providing such information in existing systems would be extremely important in developing operating systems and compilers. Because of our modular structure, hardware improvements would also be possible at this stage. Finally, such monitoring would provide excellent feedback on what constitutes good programming techniques for this architecture. We will now describe how improved reliability and performance monitoring are obtainable from our structure. 3.1.3.3.1 Error Detection and Correction A traditional expensive but simple method of providing error detection or correction is with replicated hardware. Providing duplicates for all components allows detection of errors as discrepancies in the outputs. Error correction is provided by triplicated hardware and majority vote when an er- ror is encountered. In the case of main memory and other back-up memory devices, we would propose that single error correction, double error detec- tion codes be used. This provides protection essentially equivalent to triple redundancy at a modest cost in additional logic. We propose the more costly triple redundancy for the remainder of the logic because of its sim- plicity and suitability for the structure. In particular, the triple redun- dancy can be provided at the major component level. We can enhance the interface units we have described to include the error detection correction function. This could be done without imposing more than a couple additional minor clock delays in actual processing. The delays needed to synchronize 30 the two or three imput signals may impose some additional delay as a func- tion of the physical layout of the machine. One major advantage of this approach would be the ability to do real- time, automated error isolation. Upon detecting an error, the operating system could notify the operator to "Please replace Module X12 in Cabinet 5, Rack 4 with Part 210Z in Storage Cabine 3, Shelf C." In double redundancy, the operating system could, in most instances, lock out the affected unit, restart any affected program, and continue operation with somewhat reduced capabilities. In the case of triple redundancy, no error should be intro- duced in any running program, and no reduction in capacity would occur. Clearly, if the MTBF of the individual components is at a reasonable level, the overall system could approach 100 percent reliability and availability. In addition, maintenance and repair in almost all cases could be done by unskilled personnel. Of course, repair of the individual modules would re- quire different approaches, but this can be done in a leisurely manner if a reasonable inventory of spares is maintained. By associating the error correction function with the interface unit, we can essentially eliminate the problem of who referees the referees. Fig- ure 4 illustrates this structure. A, B, C, and D refer to different func- tional units. The subscripts refer to the three copies of each unit. There is an interface for each copy of each unit and each connection to a unit. The interfaces are labeled with the source name followed by the name of the particular copy of the destination unit. For error isolation purposes, all interfaces are to be considered as part of the unit they input to. Thus, if there is an error in AB Q , this will affect the outputs of B Q and will be detected by BD Q , BD, , and BDp. All three will signal the operating system 31 UNITS INTERFACES UNITS INTERFACES UNITS FIGURE 4 ERROR CORRECTION CONNECTIONS 32 that there is an error in B Q which includes the interface AB Q . From that point on until it is replaced, the outputs of B Q will be ignored. 3.1.3.3.2 Hardware Performance Monitoring The interfaces also provide an obvious source of information for very detailed performance monitoring. It would be practical to include within each interface a microcomputer to monitor the communication and selectively transmit information to a central performance monitoring system. If facili- ties were provided for altering the programming of these microcomputers, this structure would provide an extremely powerful and flexible system. As we have already mentioned, the somewhat radical and very complex nature of the structure we are proposing makes such a facility extremely desirable if not essential. It is our belief that as computers become more complex, real- time performance monitoring will become an essential element in the feedback loop that should lead to "better" computers. 33 3.2 GENERAL DISCUSSION OF DESIGN TECHNIQUES We will somewhat arbitrarily divide this discussion of techniques into a discussion of pipelining and parallelism analysis, and a discussion of techniques for reading and updating descriptive tables. The former refers to a flow analysis used to determine the degree of parallelism and pipelin- ing required to insure that all components of the machine are able to keep up with each other. It also refers to the queueing techniques used to smooth the flow between units. The processing of descriptive tables refers to the algorithms for maintaining an adequate description of the state of the machine and programs. Our hardware implementation of operating system and compiler functions and our queueing techniques require hardware maintenance of some fairly sophisticated tables. Before we describe these techniques in detail, we need to say a few words about our overall approach to pipelining and parallelism. More speci- fically, we will discuss what we consider to be the primary obstacle to effectively utilizing these techniques. This is program nondeterminism. We have kept the parallelism of individual computation units small enough that we know most programs can, in theory, effectively use it. To turn this theo- retical possibility into a practical reality and for other reasons which we have discussed, we will construct an elaborate system for hardware control of the individual execution units. The operation of this analysis and control hardware must be overlapped with actual program execution. This implies a pipeline structure. The more complex the analysis, the longer this pipe must be. We will do a flow analysis in the process of doing detailed design, which should insure that the machine will be operating efficiently as long as instructions keep flowing in at the head of the pipe. Conditional trans- fers can break up this flow and have a devastating effect on overall 34 efficiency. There are several aspects of our approach that will minimize this problem. The problem cannot be eliminated for all programs. In Section 3.3 we discuss parallelism from a completely abstract perspective. In particular, we will discuss the general question of the structure of algorithms and transformations to map them onto various parallel computing structures. For now we simply concede that there are some algorithms poorly suited to the structure we will develop. We believe this group of algorithms is a quite small percentage of all useful algo- rithms. We will now describe how the problems associated with conditional transfers can be overcome for most algorithms. We have three complementary approaches to this problem. These are use of an if tree analyzer, compilation and execution time analysis of flow of control, and instruction level multiprogramming. In analyzing FORTRAN pro- grams for parallelism [5], it was determined that there are "bursts" of assignment, go to, and if statements. Special hardware has been designed [3] to process such if nodes in parallel. This results in converting a sequence of nondeterministic nodes to a single nondeterministic node. Such an execution unit can be included in the vector portion of our machine. In referring to analysis of flow of control, we have in mind differ- entiating deterministic program loops from true program nondeterminism. The critical parameter is the time between when it is known which alternative of a branch must be taken and the moment when the branch occurs. Counting loops with a limit computed outside of them can be made completely determin- istic. The compiler can recognize this situation, and the MID can be con- structed to use this knowledge. In general the compiler can attempt to move any branch dependent computation as far ahead of the branch as possible. 35 Back substitution and the introduction of redundant computations could be used in some instances. We will discuss these alternatives in more detail in Chapter 5. Instruction level multiprogramming refers to the fact that we have up to four MIDs simultaneously processing different programs or different paral lei paths of the same program. These r 1 1Ds simply load various queues which drive memory and other resources. If one of the MIDs is held up, it simply stops feeding the queues. Data rates are such that the other MIDs can take up the slack. In fact, one MID is capable of fully utilizing the computing resources. Further, no instruction ever gets past the MID unless it can proceed to completion. In particular, all operands must be in memory. Thus, in multiprogramming mode, if the nondeterministic branches are rela- tively sparse, utilization can approach 100 percent. There are algorithms with a very high level of program nondeterminism, and they would not be well suited to an architecture of the sort we are proposing. However, we do believe that most programs will be able to run efficiently on our machine. As one measure of the level of nondeterminism in programs, we can examine some of the parameters measures in the analysis of FORTRAN programs [5]. Attempting to speed up a highly nondeterministic program with parallel execution inevitably results in \/ery poor efficiency compared to executing the same program on a serial machine. Yet, it was possible to maintain an efficiency of 0.3 to 0.4 over a broad class of programs while using, in almost all cases, more than 16 parallel units and, in the majority of cases, more than 30. 36 3.2.1 Pipeline and Parallel Design Techniques The problem of designing the IUD was to construct in logic an algo- rithm for carrying on a fairly complex set of scheduling tasks. We will outline here our general approach to the problem. The actual performing of computation is controlled by FIRFO queue driven units which accept queue entries furnished by the IUD. We will describe in general terms the opera- tion of these queue driven units. 3.2.1.1 IUD Design Analysis The steps taken in designing the IUD were as follows: 1. Compute the instruction emergence rate required to keep the rest of the machine active. 2. List the functions the IUD was required to perform. Estimate how long each of these would take and list other functions they may be dependent on. (Table 24) 3. Make an IUD pipeline diagram giving time versus function(s) performed. (Table 25) 4. Do detailed logical design of each of the functional units. If any of the units cannot be designed to meet the estimates of step 2, modify the pipe diagram of 3 appropriately. In performing the internal design of the various units, a similar approach was applied in a less systematic way. For the most part, this pro- cess worked fairly well. Like any moderately complex subroutine in a com- puter program, we are quite certain that any of our individual designs could be improved upon by additional work. The final structure of the IUD pipe did turn out to be significantly different than the initial diagram we constructed. 37 The IUD processes instructions for all the various computation and memory units. In the process of doing the design, it was noted that the scalar instructions could be processed for the most part independently of the other instructions. There was a definite advantage to doing this processing independently after the instructions emerged from the main IUD pipe. The instructions could be processed at the maximum rate for scalar instructions as opposed to the maximum rate for all types of instructions at a considerable savings in hardware. In Section 4.6 we describe both our original structure and how it was modified in the course of design. The one unit in the IUD pipe that did not quite meet our time con- straint of 8 levels of logic was the unit that allocated functionally equivalent VEUs. This unit is designed in Section 4.6.3.2.7 and is probably the kludgiest of any of the units we designed. We choose to discuss that design here, not out of masochism, but because we learned the most in con- structing that unit. In the process of designing a portion of tnis unit, we developed a systematic notation for generalizing the techniques used to construct a carry save save adder. Of course the technique is not likely to be appli- cable to all problems of speeding up logical circuits. Further, the sys- tematic portion of the procedure is the notation. The notation must be applied in an intelligent and sometimes imaginative way to provide a high- speed logical design for a specific functional unit. Nonetheless, the notation described in Section 4.6.3.2.7 and used in the appendix does seem likely to be a powerful tool for designing fast and complex functional units. 38 3.2.1.2 Queuing Techniques We have described the basic operation of the FIRFO queues. In this section we will provide a somewhat more detailed analysis of their operation and an analysis of their size. A queued instruction is allowed to proceed when it is the oldest one in the queue and when all required resources are available. Resources refer to the unit the queue drives and the operands for each queued instruction. The unit becomes available at a time deter- mined by the previous instruction. The unit will instruct the queue con- trol of its becoming available in enough time to allow for the queue search and any preliminary set-up steps required. The determination of when operands are available differs with different types of units. We will briefly describe the operation of the vector and scalar execution unit queues and the memory queues. These units will be described in detail in various sections of Chapter 4. There will be associated with each Vector Execution Unit physical registers for operands and results. When a vector instruction is processed by the IUD, it will transmit instructions to switch the operands to the VEU assigned to the instruction. It will assign specific physical registers for those operands. The queued instruction within the VEU is ready to exe- cute when those specified registers are loaded. A physical register for the result is also allocated within the VEU. Since this allocation is done by the IUD, no instruction which reaches the VEU queue can be held up for lack of a place to put the result. A range of 8 to 16 seems a reasonable size for this queue. This estimate is based on the fact that twice the number of operand registers as queue entries would be required for a binary unit, and probably no more than 8 queue entries could be checked in one 39 major clock. The first constraint is important because of the cost of the 8-word wide parallel buffer. Because there is likely to be something like a 6 major clock delay between when buffers are reserved by the IUD and when the instruction enters the VEU queue, we would want more than two registers per queue entry for a unit that only processes binary vector instructions. The second constraint is important because once it takes longer to search the entire queue than it does for a vector instruction to execute, it be- comes increasingly likely that in doing a full queue search an earlier instruction that was not ready to execute when it was tested will become ready to execute. Thus, long queue searches can defeat the FIRFO philosophy. Scalar Execution Units only have internal buffers for the current and next operands and the current and next result. All scalar results from scalar instructions are assigned a time index. A scalar operand may or may not have a time index. If it has been recently computed, it will. There are many fewer time indexes than physical scalar buffer locations, and the time indexes are constantly being recycled. Thus, a scalar operand may refer to a physical location whose time index has been reused and is no longer associated with it. The mechanism for assigning these indexes is described in detail in Section 4.3. Any scalar operand without a time index is available. A scalar operand with a time index may be available in either of two places. For each of these buffers there is a set of bits, one for each time index that indicates if the corresponding operand is in the respective buffer. The first of these is simply the main scalar buffer. Two queued scalar instructions may produce results to the same physical scalar buffer location. If the logically later of these is ready to pro- ceed before the earlier, it will store its result in a special result buffer. 40 The time indexes will assure that the correct value is ultimately stored in the scalar buffer and that intermediate instructions access the correct values. Because their operand buffers are not separate from each other, it makes sense to have a single Scalar Execution Unit queue drive all equivalent SEUs. This fact combined with our earlier observation about queue size versus queue search time means that we would probably want larger scalar queues than vector queues. A size of 16 would probably be reasonable. Each memory page of 8 x 1 K words will have its own queue. In the case of all instructions, the queue control must insure that all indexes and modes are available, i.e., have been transmitted to appropriate buf- fers within the page. Further, it must insure that the instruction can proceed without producing a logical error. Various schemes could be used to determine this. The simplest would require that all instructions pro- ceed in exactly the sequence they entered the queue. One could allow non- indexed instructions to be executed out of sequence if their addresses insured that no conflicts would result. In the most general case, one could do arithmetic on all available and relevant indexes and modes to see if any instruction could proceed. As soon as any instruction with an un- available index is encountered, the queue search must stop. A queue size of 8 - 16 would probably be reasonable for a memory page. Experimentation might reveal that queue sizes smaller than the ones we have suggested for all units might be practical. In all the above cases involving local buffers, the various units must notify the IUD as buffer locations become available. Thre is an additional problem associated with the vector result buffers. Values from these 41 buffers may be accessed as operands for other vector instructions. Thus, these locations may be used until the corresponding logical location is reused in the OFFL instruction stream. We must, however, insure there is space in these buffers for new instructions. Thus, the local control must initiate a transfer of some of these operands to the main Vector Buffer when it becomes too full. 3.2.1.3 Resolving Buffer Access Conflicts Our local control and queue driven structure can often result in buf- fer access conflicts. Two methods for handling this are to allow multiple simultaneous accesses to the smae memory and to provide hardware for con- flict resolution. The first method is employed in some of the IUD tables because of the necessity for very high access rates. This involves pro- viding multiple addressing logic and a larger fanout from each bit of storage. It makes the memory considerably mroe expensive and is thus only used when required by the data rates. For the other more common case, we have developed a very simple, fast and cheap circuit for conflict resolu- tion. It is described in Section 4.4.2.4. 3.2.2 Tables In this section we provide a general description of the hardware maintained tables and the algorithms for updating and accessing them. Vector tables are provided to map physical buffer addresses to logical buffer addresses and to maintain use counts for active buffer locations. A use count is the number of accesses for a particular register that have been processed by the IUD but have not yet occurred. Its use count must be zero before a physical buffer location can be reused. The original assign- ment of physical to logical address is made by the IUD when a store to a 42 logical address occurs. This assignment is made from a known list of free physical registers. A table whose addresses correspond to logical addresses contains the physical address for each assigned logical address. This table is used to determine the physical location of an operand. Every access to a location in this table by the IUD results in an increase in the use count also stored in that location. Periodically information is obtained from the various execution units giving a list of physical addresses which have been accessed. This list is used to access the table in an associative fashion and decrement the use counts. When a use count goes to zero after the corresponding logical address has been reused, then the physical ad- dress can be reused. Both conditions are necessary, the former to insure that there is no access to the register that has yet to be made and the latter to insure that an instruction not yet processed by the IUD will require the value. A similar structure is provided to keep track of scalars. In parti- cular, use counts must be maintained to insure that no physical scalar buffer address is overwritten while there is a queued instruction requir- ing that value. 3.2.3 Deadlock In designing a machine with this structure, one must be certain that no deadlocks can occur. By consistently following two basic design con- straints, we have assured this. First, no instruction gets past the IUD unless all resources required for its execution are immediately available. In particular, instructions which would cause a memory page fault do not get past the MID. All instructions which require the allocation of tempo- rary registers have that allocation made within the IUD from a set of known 43 registers that are physically available. The second constraint is that whenever a required resource is not available to the IUD, it ceases pro- cessing all instructions until the resource becomes available. For example, if an instruction requires space in a queue that is full, later instruc- tions which may not directly use that queue will also be held up. Thus, no instruction will enter and possibly block a queue because it is depend- ent on the results of an instruction that is not yet in a queue. In con- junction with the first constraint, this assures that once an instruction enters a queue, all its operands will eventually become available and it can proceed. Thus, certain badly balanced instruction sequences could degrade the performance of this structure, but no instruction sequence could com- pletely block it. 44 3.3 PARALLELISM - AN ABSTRACT DISCUSSION This section is a general philosophical discussion on the nature and scope of parallel computing structures and is not immediately related to the remainder of this thesis. We will suggest a possible basis for relating the problem of understanding computing structures to the general problem of understanding mathematical structures. We will not be presenting estab- lished results, but rather pointing out similarities and suggesting possible approaches. It is a great luxury in conventional computer architecture that all words of main memory are equally accessible by the processing portion of the machine. Parallelism replaces this "amorphous" topology of data inter- action with a specific structure. In a totally abstract sense the problem of parallel computing design is that of determining classes of data inter- action topologies that correspond to significant real problems and that can be mapped in an efficient way to a single computer topology. Mathematics is the study of arbitrary abstract structures. Some of these are obviously and directly related to problems of computer architecture. We will describe some of these direct relationships for a s/ery substantial portion of all mathematics and suggest possible approaches to investigate computer archi- tecture utilizing this body of mathematics. We will then briefly explain why we believe the study of structures relevant to computation includes all of mathematics. Finally, we will discuss some of the implications of this point of view for a theory of mathematical truth. Two fundamentally different measures of the strength of a mathematical system are provability and definability. The former refers to what questions can be decided by the system and the latter refers to what questions can be 45 stated within the system. As we have suggested elsewhere [2], one can directly relate mathematics through the hyperarithmetical sets to compu- tation related structures. We can begin with some language adequate to describe all finite state machines. We godel number all statements within this language. We have a separate godel numbering for all Turing machines with blank input tapes. We code the output of these so that they either represent the godel number of another Turing machine or the godel number of a statement in our language describing finite state machines. We now assign truth values to each of the Turing machines as follows: the truth value of a Turing machine is true if it has an unbounded number of outputs and the truth value for each member in some unbounded subset of these is true; the truth value for any output corresponding to a finite state machine statement is true if the statement is true. This structure is completely adequate to define all hyperarithmetical statements. This encompasses most mathematical questions and includes a broad area that Intuitionist mathe- maticians consider to be meaningless. Central to this level of mathematical definability are the two related concepts of constructive ordinals in mathematics and non-deterministic Turing machines in computer science. The proof that eyery constructive ordinal has a recursive notation defined in a particularly technical way [7] can be interpreted as demonstrating that there is a non-deterministic Turing machine that recursively describes completely the structure of any con- structive ordinal. Mathematical questions about hyperarithmetical sets are those which result from "iterating" up to some constructive ordinal the question is there an infinite subset of all true statements in a recursively enumerable collection of statements about finite state machines. 46 Constructive or recursive ordinals can also be used as a measure of the power of a mathematical system in terms of provability. Loosely speaking, the larger the recursive ordinal that can be proven to be a recursive ordinal in a system, the more powerful it is in terms of provability. There are many mathematical languages rich enough to define all recursive ordinals but no mathematical theory rich enough to prove that some definition in the language does define a recursive ordinal for each recursive ordinal. The concept of recursive ordinal can be thought of as a sort of measure or classification of level of complexity for an initial segment of mathema- tical structures. We suggest that this classification of structures might be a good starting point in a search for classifying various topologies of data interaction. As an example, the initial recursive ordinals correspond in a fairly direct way to the elementary mathematical operations of addition, multiplication, and exponentiation. These each have different and increas- ing complex topologies of bit interactions. Different techniques of logi- cal design are required in providing time versus gate count tradeoffs in implementing them. The concept of recursive ordinals provides a detailed and direct method of extending this hierarchy to more complex structures. Further, it is my belief that the concept of recursive ordinals is directly connected to the compuer science concept of iteration. This rela- tionship tends to be obscured by a modern set theory treatment of ordinals. Modern set theory originated from an attempt to avoid the paradoxes dis- covered by Bertrand Russell in earlier versions of set theory. It seems to do so in an extremely elegant and powerful way. However, returning to the intuition that led Ressell to discover the paradoxes and the resulting less elegant and less powerful theory of types that he proposed as a solution will 47 shed considerable light on the relationship between set theory and a computer related theory of iteration. The paradoxes arose from sets with self referencing definitions constructed in such a manner that if some element was a member of the set then one could show it was not a member of the set. Russell's solution was to provide a sort of index associated with all statements used in defining sets. This index provided a limit over the type of set used in the definition. The set being defined would have a higher index or type. Ordinal numbers including the recursive ordinals implicitly form a similar indexing scheme for set theory. We can consider the problem of iteration as that of applying various algorithms to each other. The problem of possible contradictions is replaced by the problem of whether the resulting algorithm computes a value or simply loops forever. It is possible to consider iterations on a hierarchy of functions. For example, we can start with algorithms which compute integers from inte- gers. We can then consider functions which, given a function of this first type and an integer, computes an integer. Given any type, we can consider a function of all lower types. Given an effective procedure for listing an infinite number of types, we can consider a function of a Turing Machine which enumerates an infinite sequence of such types. Using such types, we can construct more powerful techniques of iteration. We can also con- struct larger recursive ordinals. Finally, the topology of the interaction of the original operands becomes more complex and more general as we go to higher types. We are not suggesting that any of these approaches are uniquely correct, but rather pointing out similarities and suggesting that each field and each approach may benefit from insights of the others. We would now like 48 to outline why we believe reasoning about physically implementable processes is relevant to the outer reaches of mathematical research. We consider both the problems of definability and provability. Problems associated with the set of all real numbers provide the first obstacle to providing constructive interpretations for all of mathematics. Cantor's proof that there cannot exist a one-to-one map from the integers to the reals makes it impossible to provide any constructive method of naming all the reals. Cantor did not prove that there were more reals than integers since the existential status of reals is in question. A possible interpretation of reals is that they represent properties of Turing Machines. One can con- sider that the "meaningful properties" of Turing machines that one could invent might be limitless. By meaningful property we mean a property that is either true or false for any given Turing Machine. Thus, each such pro- perty under a particular godel numbering of Turing Machines defines a real. We can reflect the open-ended nature of the situation by employing a lan- guage for describing properties in which an infinite sequence of words are always left undefined. This seems to me to be a particularly desirable approach since it more closely reflects the reality of the situation. We know from the Lowenheim Skolem theorem that any mathematical theory with recursively enumerable axioms has a countable model. This approach by it- self would not be adequate to construct a constructive version of set theory. However, examining the actual combinatorial power of the axioms of set theory and seeing if similar constructive interpretations are possible seems to me to be likely to be successful. We now consider the problems associated with providing constructive interpretations for set theory in the domain of provability. In doing so, we will confront what is probably the major philosophical problem with the 49 approach we are suggesting. Mathematics is the one area of human endeavor that is generally considered to have a claim to absolute truths. Godel's Incompleteness theorem showed that there exist fundamental problems with allowing mathematics to grow and at the same time retain the property of possessing absolute truths. One school of mathematics has jettisoned all but totally constructive proofs as a means of insuring the absoluteness of mathematical truth. The Intuitionists do not even accept the statement that any Turing Machine must either halt or continue indefinitely. On the op- posite end of the spectrum we have what might be considered the mystical school of mathematics. This is the belief that intuition about infinite sets allows mathematicians to transcend the limits of Godel's Incomplete- ness theorem when dealing with constructive processes. As far as I am aware, no one has seriously considered the possibility that mathematics should give up its claim to absolute truth outside of a narrow domain and become a speculative and experimental science. Our suggestion for handling the concept of real numbers is made in this spirit. Godel's Incompleteness theorem established that no mathematical theory in which a Universal Turing Machine was imbedable and in which the halt- ing problem could be defined could decide within itself if it were consist- ent. This established severe limits for any formal mathematical system with respect to its power of provability. For any such "true" system one can adjoin the statement that the system is consistent and obtain a more powerful system. In fact, one can regard the high power that set theory has in a provability sense as deriving from the powerful methods available within it for taking a powerful kernal system and iterating the statement that the system is consistent. This is accomplished via the strong axioms of infinity that allow one to construct models of increasingly more powerful 50 subsystems. If one can construct a model for a system, one has a proof that it is consistent. An alternative approach would be to directly study and attempt to enhance the combinatorial power of this"iterative" process. But to attack the problem from that direction would require giving up the notion that the results were absolute truth. This non-absolutist approach to mathematical truth has a philosophi- cal appeal. Perhaps the severest problem associated with the accomplish- ments of Western mathematics, science and technology is recognizing the limits of these endeavors. It is fitting that the queen of the sciences be the first to establish precise limits for its power and scope. It is essential that we know what we do not know, otherwise we know nothing. That is why mathematics is so concerned with avoiding contradiction. 51 4 COMPUTATION UNIT - DETAILED LOGICAL DESIGN In this section we provide a detailed logical design for the computa- tion unit of Figure 2. We will briefly describe its overall physical structure. We will then describe its overall functional structure. We will then proceed to a detailed functional and logical design of sufficient detail to provide realistic gate counts. In various subsections we will provide tables giving approximate gate counts for individual units. In a concluding section we will provide a summary gate count for the entire computation unit. In this section we will group buffers by their access times and compute their gate counts separately. This scheme is intended to give a very rough notion of the logical complexity and cost of this design. 52 4.1 OVERALL STRUCTURE Figure 2 can be partitioned into four major units. We will refer to these as the scalar portion, the vector portion, memory, and the Instruction Unit Dispatcher. The scalar and vector portions are symmetric in the sense that they both consist of up to six execution units, a buffer, and a switch. The execution units are the portions of the machine that do all actual com- putation. The switches operate under hardware control and are responsible for transferring data between buffers, memory, and execution units. The vector buffer is a more or less conventional high-speed buffer for the main vector memory. There is also vector buffer space within the VEUs. The Scalar Buffer is the primary memory for scalars. It can be loaded via the Vector Switch from main memory for initialization. There are additional buffers associated with the scalar portion. They exist to enhance the throughput of the scalar portion and will be described in detail. The In- struction Unit Dispatcher is the most complex and unconventional or the major units. It has responsibility for mapping OFFL instructions into queue entries which drive the other units. 53 4.2 FUNCTIONAL STRUCTURE The functional structure can be thought of as a generalization of the algorithms used to sequence the arithmetic on the IBM 360/91 [8]. All of the resources of the machine are queue driven. The queues are not strictly first in first out, but rather first in which is able to begin using the resource, first out. We will refer to these as FIRFO, i.e., first in and ready, first out. An instruction is ready when its operands become available. What constitutes an available operand will vary with dif- ferent types of functional units. This structure allows the sequence of instructions to be permuted in any way which enhances resource utilization without altering the logical structure of the original program. It is the responsibility of the IUD to insure the logical integrity of the original program. Most of the complexity of the IUD is a result of this function. 54 4.3 SCALAR PORTION OF COMPUTATION UNIT The scalar portion of the computation unit allows us to perform opera- tions on scalars without tying up the vector execution units. In addition it contains a high-speed memory with sufficient space for the scalars in almost any program. The units that actually perform scalar operations are constructed in a modular fashion to allow for the construction and use of specialized hardware at any time during the operational life of the machine. 4.3.1 Overall Structure of the Scalar Portion of Computation Unit Figure 5 shows the structure of the scalar portion of the execution unit. We will briefly describe the functions of each of the units in the figure and the nature of their interconnections. The Scalar Execution Units contain the queues control and logic to sequence and perform the scalar operations. These units receive instructions from the SIDS through the instruction switch. The execution units make use of the tables in the Scalar Buffer Status unit to determine admissible instruction sequencing. The execution units also provide information for updating these status tables as instructions are executed. Every major clock the Scalar Buffer Status unit and the SIDS exchange information to update their respective status tables. The functional structure of the SIDS relevant to sequencing instructions will be described in Section 4.3.2.1. Detailed design of the entire SIDS is in Section 4.6.5. The Result Buffer is used to buffer re- sults that would otherwise overwrite operands needed by instructions waiting to execute. Its contents may be accessed as operands and will eventually be transferred to the Scalar Buffer through the Scalar Switch. The Vector- Scalar Buffer is used for transferring scalars between the VEUs and the SEUs. It is addressable as if it were an extension of the Scalar Buffer, but it 55 RESULT BUFFER FROM VECTOR SWITCH VECTOR- SCALAR BUFFER SCALAR BUFFER SCALAR SWITCH SCALAR BUFFER CONTROL SCALAR SWITCH CONTROL TO AND FROM VECTOR SWITCH TO AND FROM VEUs SCALAR BUFFER STATUS _^ TO AND FROM VECTOR EXECUTION UNITS SCALAR EXECUTION UNITS (UP TO 6) INSTRUCTION SWITCH FROM SIDS (IUD SUBSYSTEM) TO AND FROM SIDS ClUD SUBSYSTEM) FIGURE 5 SCALAR PORTION OF EXECUTION UNIT 56 has a special status table in the Scalar Buffer Status unit that must be updated with information from both the SEUs and VEUs. The Scalar Switch does all operand transfers between the various SEUs and buffers. A special status table is maintained for operands about to be transferred, and they can be accessed by other SEUs without going through the Scalar Buffer. The Scalar Switch Control does what its name implies. The Scalar Buffer is the primary memory for scalars. It can be accessed by the Scalar Switch and can accept data in blocks equal to the standard vector width from the vector switch. Reverse transfers from the scalar buffer to the Vector Switch are not allowed. The primary function of the Scalar Buffer Control is to referee between the Scalar Switch and Vector Switch in their competition for the Scalar Buffer. We will provide details of the function and structure of these units. We begin by discussing the Scalar Execution Units. 4.3.2 Scalar Execution Units Figure 6 shows the overall structure of the scalar execution unit. The Sequencer provides overall control of the unit. It reads instructions in the Queue, checks the status of the operands in the Scalar Buffer Status tables, and on this basis, determines the sequencing of the queued instructions. When an instruction is to be executed, the scalar switch must be provided with requests to access the operands from the appropriate buffers. The ope- rand and results are provided with buffers to allow a continuous flow of operands. A special switch is provided to allow results to be used as ope- rands without going through the scalar switch. The computation hardware contains the logic to perform the actual scalar operations. Working registers are included in the figure to emphasize the buffering function of the other registers. If an interrupt condition occurs, the MID will be notified and processing will continue. 57 z I 1 z o o I— I- 1 •— t Q- 1— 1— SB c_> o ttMQ ZD t— 1 cc c£ •— • (V CJ UIOS \- 1— CO CO Z3 Z UJ o z: ct: uj l-t O |— I— r 3Z 1— =) <_> CO UJ z 211— z r> o O •— i ►-" cr l-H CC ZS. 1— LU z i UJ ^D o- UJ CO ! 1 1 I 1 cc UJ a: UJ CC LU 1 1— h- 1— cc CO co CO LU l^-l i— 1 1 — 1 2TU- DC CD CT5 CD O U_ O UJ UJ LU CCZD sri— cc cc CC U_ CO O t-* q; s as o o a a: (s > U_ CO z z z: z h- • or 13 UJ I— UJ CO Q_ M D_ LU o 3 CO o CC 1 o: I i s: —i L — o U_ CO l' UJ i in O O >- l— CO •—< LU IS cc co o UJ X cc CO CO UJ cc ZZ> CO cc Ml B n q q S(A q ) A 01 B q q P q R(B q ) A o B 0-1 q q P q S(B q ) e e E o R < W e> N Q W 0-1 e e s(w e ) N Q W 0-1 e p S(H p ) N Q B 01 e q W e S(R e ) N e B q @l M P S(R p ) R e R(w e ) R P R(W p ) P q R(W e ) R(W ) P„ F~ q e S o Po B q q R(B q ) P o B„@l q q S(B q ) 68 TABLE 4 GATE COUNT FOR INSTRUCTION QUEUE Symbols Description Number of gates to shift one bit Number of gates in control bit logic Number of bits in shift control logic Gates to store one bit Number of bits per register Number of registers Number of control bits Symbol Sample Value Explanation N s 2 From Table 3 C L 25 From Table 3 S L 3 From Table 3 G m 4 N b 60 Must hold up to three data buffer addresses and an operation code N r 16 Queue length t 4 From Table 3 Gate Estimates Functional Unit Shift Control Control Bit Logic Test Selector Control Bits Register Formula Vs G N m c G N. m b Subtotal Sample Value 3 25 64 16 240 348 Multiply by N to get total for queue = 5568 69 status tables required. We will explain how the algorithms can be imple- mented within the required time constraints and provide gate estimates. We will not do detailed logical design for these algorithms. The principal complication of this unit is the variety of possible sources of operands. Most of these sources are redundant in the sense that their only purpose is to allow for rapid processing. Except for timing con- siderations, the same operands would be available from other sources. Since only extensive experimentation could provide accurate estimates of the cost benefit tradeoffs, we do not claim that the ideal design would incorporate all these features. We include them as suggestions and state what advantages they seem to provide. The scalar buffer is the primary source of operands. All operands with- out time indexes are in the scalar buffer. In addition, a table provides a list of what time indexed operands are in the scalar buffer. It is logically possible to eliminate all sources of operands other than the scalar buffer. The result buffer allows for the existence of multiple occurrences of the same physical scalar buffer address. In addition, it eliminates the sub- stantial delay between the time all accesses to a particular physical scalar address have completed and the SEUs will be aware of that fact and be able to overwrite that physical address with a new result. It is fairly certain that this latter function of the result buffer is essential to providing a reasonable throughput for the scalar execution units. With it all instruc- tions can be processed as soon as their operands are available. The result goes to the result buffer unless the SEUs know the physical address of the result can be overwritten. 70 It is less clear to what degree the remaining two sources of operands we will now discuss are important for efficient utilization of the SEU. Both of them will help in lessening the load on the scalar switch and in some instances providing for more rapid instruction processing. First we consider the case where a request has been made to transfer an operand on the scalar switch. If the same operand is required for a queued instruc- tion, a request can be entered that the operand also be transferred to the SEU that will be assigned the queued instruction. This is the only method the 360/91 uses in sequencing its various arithmetic units. The other case occurs when a result being computed is required for a subsequent instruc- tion. The controller can be aware of this and can simply transfer the result to an operand buffer within the SEU. This is likely to be a desira- ble feature since it is very common to have the result of one operation be required by the next. A final possibility would be to allow one to reuse the operands of one instruction for the next. We do not include this alter- native because of the switch required within each SEU for non-symmetric operations and because it is a less likely occurrence. We will now describe the tables required to keep track of all these sources of operands. The scalar buffer requires only a single bit for each time index to indicate if that operand is present. The same is true of the result buffer. These bits are set as results are returned to the specified buffers and reset as the time indexes are recycled. There will exist short queues to drive the entries to the scalar switch. By allowing associative reads of these queues, we can determine if an operand is about to appear in the scalar switch. The final source of operands is results in the process of being computed. Again, an associative memory is required. Figure 8 71 (Both operands are done simultaneously) OPERAND 1 RESET 0P1 FLAG PHYSICAL SET 0P1, 0P1B FLAGS TIME INDEX SET OP1, OP1R FLAG NO SET OP1, OP1C FLAG YES SET OP1, OP1S FLAGS; ADD TO SWITCH QUEUE; REQUEST TO GO TO SELECTED SEU FIGURE 8 ALGORITHMS FOR ACCESSING SCALAR STATUS TABLES 72 OPERAND 2 RESET 0P2 FLAG SET 0P2, 0P2B FLAG NONE PHYSICAL SET 0P2, OP2B FLAGS TIME INDEX NO SET 0P2, OP2R FLAG NO SET 0P2, 0P2B FLAG NO SET 0P2, 0P2C FLAG SET OP2, 0P2S FLAG; ADD TO SWITCH QUEUE REQUEST TO GO TO SELECTED SEU FIGURE 8 ALGORITHMS FOR ACCESSING SCALAR STATUS TABLES (cont.) 73 SEU ALLOCATION NO ALLOCATE SEU CONTAINING OP1; SET OP1U OPT X FLAG SET & SEU CONTAINING OP1 MAILABLE RESET QUEUE ALLOCATE SEU CONTAINING 0P2; SET 0P2U 10 GET NEXT QUEUE ENTRY NO ALLOCATE NEXT AVAILABLE SEU TEST OPERANDS PRESENT YES •0 FIGURE 8 ALGORITHMS FOR ACCESSING SCALAR STATUS TABLES (cont.) 74 TABLE 5 SCALAR UNIT STATUS TABLES SPECIFICATIONS Scalar Buffer Table Size Fields Parallel Accesses Gate Count 256 entries 1 bit to indicate the presence of each entry 6 reads 2 writes 4*6*256 = 6144 II Result Buffer Table Same as for scalar buffer table III Pending Requests to Use Scalar Switch This table will be described in Section 4.4.3 IV Results Being Computed Size Fields Parallel Accesses Gate Count One entry for each SEU or a total of 6 Destination address (12 bits) Time index (8 bits) 1 bit indicating use result buffer or scalar buffer 6 bits indicating other SEUs requesting result 6 associative searches of time index 6 stores to SEU result request bit (each SEU has its own bit) 1 initial store for all fields (6*12*8 + 6*4 + 27)*6 = 3762 75 describes the detailed algorithms for accessing these tables. Table 5 provides detailed specifications for these tables including gate counts for the tables and estimated gate counts for implementing the accessing algo- rithms. 4.3.3 Scalar Execution Unit Buffers In this section we will discuss the scalar buffer, result buffer, and vector-scalar buffer shown in Figure 5. We will describe their internal structure, their external connections, and conflict resolution. The memo- ries are organized into independently accessible modules. The number of these is determined by the maximum data rate at which the memories can oper- ate. This in turn is determined by the data rates of the SEUs. Table 6 provides this analysis for the three buffers. The connections to the out- side world include data paths to the scalar switch and other units as well as switches between the memories and these data paths. Table 6 lists the size of the required switches. Conflicts may arise when there are simulta- neous requests to read and write to the same memory module from the scalar switch. In addition, conflicts may arise between the scalar switch and other units requesting access to the same memory module. Multiple requests for the same memory module by the scalar switch are resolved from within the switch. All other conflicts are handled by a simple rotating priority scheme. One of the requests is given priority and honored; the others must wait until they are given priority. The priority shifts between units in such a way that they all are given priority once before any receives it twice A detailed design of this type of priority logic for a more complex case will be given in Section 4.4.2.4. Table 6 provides a gate count for all the logic discussed. 76 TABLE 6 DESIGN PARAMETERS AND GATE COUNTS FOR SCALAR BUFFERS Buffer Communicates With Scalar Buffer 6 SEUs Scalar Buffer Result Buffer Scalar Buffer Vector Switch TOTALS Result Buffer SEUs Result Buffer Scalar Buffer TOTALS Vector Scalar Buffer SEUs VEUs TOTALS Max Output Rate* Max Input Rate* 12 6** 6** 8 12 14 12 6 18 12 6 18 6 6 6 6 12 *Data rates in words per major clock, **The sum of these must be ^ 6. The above data rates are based on the assumption of 6 SEUs operating at full capacity. The maximum rates are in general the maximum possible rates for an individual unit and not all maximum total rates could be maintained simultaneously. In converting these rates to actual access rates, we take full advantage of the fact that these units are 8-word parallel buffers. Conflicts keep this assumption from being totally correct, but given the highly queued nature and the inital remarks in this statement, the assump- tion seems reasonable. 77 TABLE 6 DESIGN PARAMETERS AND GATE COUNTS FOR SCALAR BUFFERS (cont.) SCALAR UNIT BUFFER SPECIFICATIONS Buffer Scalar Buffer Result Buffer Vector- Scalar Buffer Size 8*256 8*32 8*32 Read/Write Rate per Major Clock 26 24 30 Rate per Mod per Minor Clock 0.41 0.38 0.47 Gate Access Time Count in Minor (64 bit Clocks word) 2 524,288 2 65,536 2 65,536 TOTAL 655,360 MEMORY SWITCH SIZES Buffer Read Write Scalar Buffer 8x2 8x2 Result Buffer 8x3 8 x 1 Vector Scalar Buffer 8x2 8 x 1 78 4.3.4 Scalar Switch The Scalar Switch transmits data between the buffers and execution units of the scalar portion of the Computation Unit. We have discussed its func- tional operation in the preceding sections. In this section we provide a detailed design. Table 7 summarizes the data and instruction paths of the switch. Figure 9 gives the structure of a representative portion of the switch. In discussing the SEU sequence controller, we did not specify how many SEUs each controller drives. The scalar switch ties together all the previously discussed scalar components and thus at least for the purpose of providing gate estimates, we need to assume some realistic configuration. Thus, in this section we have assumed three sequence controllers driving six SEUs. This would allow for three independent types of Scalar Execution Units. The principal complexity of the switch design results from possible conflicts. Such problems may arise both from the instruction switch and data switch. Conflicts are resolved by priority logic like that mentioned in the previous section and discussed in detail in Section 4.4.2.4. Con- flicts can occur for any of three reasons. All requests for accessing data originate from the queues. These requests must enter "source associated" queues. If more than one attempt at a time is made to try and make an entry, a conflict results. Another source of conflicts is the simultaneous attempt to access the same memory mod. The final source of conflicts is the limited number of ports into any unit. There may be too many simultaneous requests to use these ports. The mechanism for resolving all these conflicts involves two basic mechanisms. First, once a request is made, the requesting unit waits until it receives confirmation. The priority hardware mentioned above assures that this will always happen fairly soon. The second principle is 79 that requests are always made and honored for the earliest time at which the requesting unit is capable of honoring it. In other words, the re- questing process is pipelined with the hardware that executes requests. Two of the above mentioned conflicts are interdependent. Both a memory mod and a switch port must be reserved in accessing any of the buffers. This is handled by not requesting a memory mod until a port has been re- served. It also requires an additional minor clock of pipelining in the requests for ports to insure that they can be used at full capacity. Table 8 provides gate estimates for this logic. We add a final note that all this pipelining and requesting circuitry is not likely to create problems in throughput. This is because the SEUs only process one instruction per major clock, whereas all this conflict resolution occurs at the minor clock rate. In addition the input and output of each SEU is buffered. Thus there should be adequate slack as long as the data rates can be maintained. The most likely source of trouble would be poor distribution of data across the buffers. If this proved to be a problem it could be alleviated by doubling the memory speed to a 1 minor clock cycle. 80 TABLE 7 DATA AND INSTRUCTION PORTS FOR THE SCALAR SWITCH DATA PORTS Unit Input Ports Output Ports 6 SEUs 6 6 Scalar Buffer 2 2 Result Buffer 1 3 Vector-Scalar Buffer 1 2 TOTALS 10 13 See Table for the source of these figures INSTRUCTION PORTS There are three SEU queues, each of which has a path to all three buffers, 81 o <_> oo o I— cc I— < Q OO dd o O I/O a: u_ a: 00 LU CQ at _j o 00 82 TABLE 8 GATE COUNT FOR SCALAR SWITCH Unit Number Source Queue 2 (examines 2 entries simultaneously) Source 3x2 Switch 2 Source Queue 5 Control Source Queue 1 Source 3x1 Switch Gate Estimate Source of Estimate Total 2 400 4 entries in queue, Section 4.3.2.1 4 800 480 20 bit instructions 960 1 000 Figure 5 000 1 200 4 entries in queue, Section 4.3.2.1 1 200 240 20 bit instructions 240 Local Data Switches: SEU 1x2 6 256 64 bit word Scalar and Result Buffers 8x5 2 10 240 64 bit word Vector-Scalar Buffer 8x3 1 6 144 64 bit word Local Switch Controls 3 3 000 Figure Scalar Switch 10x13 1 39 520 64 bit words 12 bit addresses 1 536 20 480 6 144 9 000 39 520 TOTAL 88 880 83 4.4 VECTOR PORTION OF COMPUTATION UNIT The vector portion of the execution unit is intended to do the bulk of the actual processing. The primary purpose of the scalar unit just dis- cussed is to avoid having to use full vector processors when these are not required. The justification for having the vector units is a combination of utility and economy. We know that most FORTRAN programs can effectively utilize vector units of at least width 8. We have already observed in Section 4.3 the substantial overhead that is involved with the queue driven and pipelined approach. By allowing each instruction to drive the equivalent of 8 parallel execution units, we minimize the cost of this over- head. The tradeoff in determining how wide such units should be is overhead cost versus utilization of the potential parallelism. As discussed in Sec- tion 3.3, we consider the whole area of parallel computing to be in a wery primitive state. Thus, we justify the width we have chosen solely on the grounds that we know it will work for a very broad class of problems and we can implement it with an acceptable level of overhead. We do not wish to enter into the extraordinarily complex question of quantifying the tradeoffs. We will now discuss the overall structure of the vector unit. 4.4.1 Overall Structure of Vector Portion of Computation Unit Those portions of Figure 2 that constitute the vector unit are the VEUs, the Vector Switch, and the Vector Buffer. The VEUs perform the actual vector processing and are fairly complex units containing instruction queues, buffers, and other hardware in addition to that that does the actual computa- tion. The Vector Buffer acts as a back-up reserve storage for the buffers within the VEUs. The Vector Switch is responsible for transferring data 84 among these units and between them and the memory. In addition, it can transmit data to the Scalar Buffer and to the MIDs. It also contains its own internal queues. We will now describe each of these units in detail. 4.4.2 Vector Execution Units Figure 10 gives the overall structure of a typical VEU. The unit is controlled by the sequencer which reads instructions from the instruction queue. The sequencer tests successive entries in the queues until one is encountered with all operands present in the operand buffer. The sequencer will then set up this instruction to commence execution as soon as the cur- rent instruction is finished. The access controllers resolve any conflicts that may occur in accessing the buffers. The hardware associated with the internal switch allows for results to be used as operands without going through the Vector Switch. We will discuss all the units of Figure 10 in more detail in subsequent sections. We will first discuss the various possi- bilities for the computation hardware. 4.4.2.1 Standard Arithmetic Units The computation hardware may be a standard arithmetic unit. In other words, it may perform standard floating point and fixed point arithmetic and logical operations. We will not discuss the logic to do arithmetic or simi- lar operations. There are a number of ways in which the parallel units can be organized. We will discuss several of these alternatives and their ad- vantages and limitations. The simplest structure would be an ILLIAC IV type of parallelism. That is, eight units driven by a single control sequencer such that the units all perform identical operations. An alternative method of implementing the same logical structure would be an eight stage pipeline. Since data transfers 85 cc re o o l— h- o i— i OLU2 I— > oo CC h- LU _l U- ZD U. CO zd uj oq cc I— ct oo o -ILU WQi ZZ> Ll_ LU t— OO U_ O Z LU3UO CCCQ LU ZD cr LU 00 ID cc lu I— ZZ> CO LU 2: zn i-h cr CO Q K g* cc o LU I— Mi/ia zd l— O zz> c£ h- co cc ZZ> LU q; q _i o o OCQ<0 <=C —I ZIO or o q: z 3 o MCOO c_> I— 1—1 CO _1 o Z CC O CO a; zc O C_3 2: 1— I— O O 1— 1 Lt_ > CO 86 are pipelined within the unit as we have discussed, a pipelined arith- metic unit would fit in quite nicely. In the case of parallel units, we would have to phase them or buffer them to accommodate the pipelined data transfers. The advantages of this type of parallelism are that fewer gates are required for control purposes and the instructions are relatively sim- ple. The disadvantage is the lack of flexibility. The statement that most FORTRAN programs can effectively utilize a parallelism of eight was based on the assumption that each unit can perform a different arithmetic operation in parallel . The next level of complexity is to provide eight arithmetic units, each capable of performing a different operation in parallel. To control these units, we would require an extended instruction with at least two bits of control information for each arithmetic unit. The next level of complexity is to consider tree processors [6]. This is basically a set of arithmetic units interconnected to form a tree structure. Given our basic width of eight, we could implement trees with a base of four pairs of operands or eight pairs of operands. In the first case, the instructions would be regarded as having a single 8-word wide operand. In the latter case, we would have two 8-word wide operands. The advantage of a tree structure is that it allows \jery efficient execution of an arithmetic expression. Two disadvantages are that most arithmetic expres- sions are not full trees, and the unit must be pipelined to operate at full efficiency. Perhaps the most flexible unit would be one that allowed four pairs of operands at the base and could also be used to operate on pairs of eight wide vectors. One advantage of our approach is that various combina- tions of these alternatives could be tried out after a machine had 87 been constructed. There is no need to go into a detailed and abstract analysis of these alternatives at this stage. There is an important structural difference between vector and tree units that we must consider at this stage. This is that tree units have a scalar output and vector units a vector output. There are three possible destinations for such a scalar. These are the scalar portion of the Compu- tation Unit, the MID, and a vector operand as one element of it. In this latter case, the vector may be used in full vector computations and/or be used in more tree computations. We need to include hardware to accommodate these possibilities. We will now discuss each of these alternatives. We have already mentioned in Section 4.3 how some scalar buffer addresses refer to data from outside the scalar buffer. Results of vector instructions that have the scalar unit as destination need to be sent to the above men- tioned portion of the scalar buffer. The logic for handling conflict reso- lution for multiple VEUs will be in the scalar unit as discussed in Section 4.3.3.3. The VEU need only interpret this destination from the queue in- struction and send the data and its destination address out over the appro- priate path. The same procedure can be followed in the case of data headed for the MID. We do not do a detailed logical design of the MID, but the techniques discussed in Section 3.2.1.3 can be used to handle conflict resolution. The final alternative we have to consider is that of scalars that are to become part of a vector. We require special hardware within the vector portion of the computation unit for this case. This must include a scalar switch to transfer the data to the correct position in the destination vector and controls that are able to recognize when a complete vector has been assembled and is available for further processing. In the case of a single 88 SCALAR SOURCE and address ""8x1 VECTOR SWITCH BUFFER VECTOR OUTPUT PATH CONTROL VECTOR ELEMENT PRESENCE BITS VECTOR PRESENCE BITS When all 8 Vector Element Presence Bits are set for a single vector, then the corresponding Vector Presence Bit is set. FIGURE 11 ASSEMBLING SCALARS INTO A VECTOR 89 tree unit, it would probably be desirable to have this hardware within that unit. With multiple units we would probably want this hardware as part of the vector buffer. Figure 11 describes this logic and Table 9 provides gate estimations. Another type of tree we might wish to include is the if tree analyzer [3]. This would be especially helpful in reducing the amount of non- determinism in a program. Our highly pipelined structure makes this espe- cially desirable. Including such a feature is also necessary to obtain the theoretical speed of FORTRAN programs we have discussed earlier. Function- ally, the if tree analyzer is no different than the trees we have discussed except that it has a single output that goes to the MID. TABLE 9 GATE COUNT FOR SCALAR ASSEMBLING UNIT Unit 8 x 1 Switch Vector Buffer (8 vectors) Vector Element Presence Bits Control Vector Presence Bits TOTAL 25,000 Gate E: stima te 2 200 20 000 600 2 000 200 90 4.4.2.2 Vector Routers At least one of the VEUs will be devoted to a vector router or full crossbar switch. Given the relatively small size of our vectors and the broad spectrum of permutations that various algorithms may require, a full crossbar switch is justified. In addition to allowing the arbitrary permu- tations of a vector, this unit should allow for the combining of two oper- ands under mode control. It would also be desirable to allow selective partial broadcasting. Mode bits and routing patterns may either be included as part of the instruction or be dynamically computed within the EUs. Arbi- trary broadcasting patterns contain too many bits to be part of the instruc- tion. At least in the case of modes and possibly in other cases, we would wish to allow scalar operands. Thus, we have the inverse case of that discussed in the previous section. We need to obtain a scalar from the sca- lar portion of the computation unit. This can be accomplished by issuing a scalar instruction to store the required operand in that portion of the Scalar Buffer that allows for transmission to the VEUs. Some of those ad- dresses may be regarded as operands within the VEU, and a store to them results in the transfer. 4.4.2.3 Other Vector Units There is no need to limit the computation hardware to the alternatives just discussed. Even after the machine has been constructed, different softs of units could be added or used to replace existing units. In the next section we will discuss in detail the internal queues switches and con- trol for a single VEU. All of this hardware could be used "as is" for any type of VEU. Not all of it will necessarily be included in every VEU. 91 The point is that at any time we could add specialized hardware without designing more than that hardware and some very simple interfaces. 4.4.2.4 Detailed Internal Structure of a VEU In this section we will finish our discussion of the remaining units of Figure 10. We now list those units requiring elaboration. The instruc- tion queue is logically the same as that discussed in Section 4.3.2.2. The operand and result buffers operate in a special phased array fashion which we will describe in detail. We will provide detailed design for general purpose access controllers which we have referred to in previous sections. The boxes associated with the internal switch will require no new specialized design. We will provide gate estimates for the entire unit at the end of this section. We begin our detailed design with the phased array buffers. This memo- ry is eight modules wide corresponding to our vector width. All accesses are to a single vector stored in the same relative position across the memo- ry. The data paths themselves are only one word wide, and so the data trans- fer must be pipelined. Once an access has started for one vector in the memory, we cannot always afford to wait eight clocks before starting a new access. Thus, we essentially shift the decoded address from one memory module to the next. We do this in a manner that allows a new address to enter at any clock. Similarly, we have a switch that allows the data to be transferred to several places. The addresses for this switch are shifted in parallel. Figure 12 shows the structure of such a unit. Table 10 describes its operation. 92 DATA PATH ADDRESS MEMORY ADDRESS ADDRESS DECODER ADDRESS DECODER linn DATA PATH FAN OUT SLOTTED ADDRESSING UNIT I i i i ii i '• i SLOTTED ADDRESSING UNIT MEMORY FIGURE 12 DATA BUFFER 93 TABLE 10 DATA BUFFER OPERATION Minor Cycle Event Two addresses simultaneously enter the decoders and are decoded. 1 The enable patterns enter the addressing units and cause the selected memory location to be switched to the correct data path. A new pair of addresses are decoded. 2 The enable pattern in the addressing unit is shifted one to the right and a new pattern takes its place. These are both used to switch two words to two data paths. A third address enters the addressing unit. 3 All enable patterns are shifted one right. A new address enters the address decoder and a new enable pattern the addressing unit. Three words are accessed. 4 Same as minor cycle 3, but four words are accessed. 5 Same as minor cycle 3, but five words are accessed. 6 Same as minor cycle 3, but six words are accessed. 7 Same as minor cycle 3, but seven words are accessed. 8 Same as minor cycle 3, but eight words are accessed. 9 Same as minor cycle 8, except the right-most address is dropped. etc. Operation continues in this manner. During any minor cycle in which no new addresses are presented, there will be a vacant slot that will move to the right in the same manner as the enable patterns. Along with the enable patterns, there is a single bit which indicates if a read or store is being performed. 94 We now turn to the general problem of designing priority access con- trollers. The logic we design must be fast enough to operate in one minor clock. It must treat all requests equally. It must ensure that each re- questing unit receives top priority once before any unit receives it twice. Finally, the design should be general enough to accommodate any number of requesting units up to 32. Actually, no application within this machine will require that many units, but our design will meet the other constraints for up to that many units. We will first describe our algorithm for the case when the number of requesting units is a power of 2. We will then show how the algorithm can be modified to handle the remaining cases. In describing the power of 2 case, we will assume 8 requesting units. It will be obvious how to generalize to larger or smaller powers of 2. Functionally, our unit is presented with 8 bits, any combination of which may be set. We must produce an output of 8 bits, only one of which is set. This bit must correspond to one of the bits that was originally set. Over a period of time, the selection process must conform to the requirements listed above. Physically, the unit consists of three levels or log base 2 of the num- ber of bits in the general case. At the first level, there are 4 two-state devices through which pairs of bits pass. At each level the number of de- vices is halved, and the number of bits passing through each device doubles. All the devices have two states. The bits passing through each device are divided into two groups. The device will pass on to the next stage a one bit from only one of these groups. The two groups passing through a single device form a single group for the next stage. Thus, by induction, each group has at most one bit set. The choice of which group to pass on is a function of the state of the device and its input. The state of the device 95 indicates a preference for one or in the other state the other group. By preference, we mean simply if the preferred group has a one in it, that one is passed on, otherwise the other group's one is passed on. Of course, if neither group has a one, then nothing is passed on. Figure 13 gives an example. It should be clear that only of the originally set bits can emerge. By changing the devices' states in an appropriate sequence, we are able to provide the uniform scheduling required. The lowest level changes state at every clock. Each higher level changes state in twice the number of clocks as the next lowest state. Thus, every bit position will go through all eight priority states in every eight clocks. Figure 13 illustrates this. We now consider the case where the number of requesting units is not a power of 2. We start with the design just described for the smallest power of 2 greater than the number we are considering. By appropriately allocating the requesting units to the excess available slots, by sequencing the en- tire unit correctly, and by allowing some requesting units to use either of two slots, we can meet the design constraints. We will present an informal constructive proof. First, we precisely restate the problem. We have N units. We must con- struct a circuit which, when presented with N bits, any subset of which may be set, will select a single bit. It must perform this selection on a prio- rity basis. These priorities must rotate in a way that any bit will go through all possible priorities from 1 to N before any priority is repeated. We have already demonstrated how to construct such a circuit when N is a power of 2. We will prove the more general case by induction. It is clear we can construct such a circuit for N = 1. Now we assume we can construct the circuit for all integers less than or equal to M, the greatest power 96 of 2 that is less than N. We will use these circuits to show how to con- struct a circuit for all integers less than or equal to M. We need to consider two cases, N even and N odd. First, if N is even, we use two circuits of size N/2. We then add one additional level that chooses between the outputs of these two circuits. By varying this selection choice e\/ery N/2 selections, we have the desired circuit. The case where N is odd is somewhat more complex. Let K be such that N = 2K + 1 . We begin with two circuits of size K + 1. Again, we will add an additional level to select between these. We will assign K of the in- puts to circuit A and K + 1 to circuit B. In addition, we install binary switches to allow any of the K + 1 inputs of B to use the vacant input of A. We will need to assume that a circuit of size L + 1 and L even can be used with L inputs. This is clearly true f or L = 1 . It will be true for larger L because of the way these circuits are constructed out of smaller circuits as we have outlined above. In particular, the circuit of size L + 1 is made up of two circuits of size L/2 + 1 , with a global level for selecting between them. Clearly, if these smaller circuits can be made to work for size L/2, then we can sequence the larger circuit in a way that it will work for size L. Thus, given the way we are constructing our circuit of size N, we may assume the circuits of size K + 1 can work for K inputs. We now proceed with the construction of our size N circuit. For the first K states of the entire device, we sequence A and B in any way that insures no input will have its priority repeated. We do this by giving B highest priority at the highest added level of the circuit and by sequencing A and B individually so they do not repeat any states. 97 For the remaining K + 1 states, we must give priority to A and, in addi- tion, during each of these states, switch one of the B inputs into the vacant spot of A. We need to pay special attention to the element assigned priority K + 1 during each of these remaining states. All but one (we will call it Z) of the elements of B had that priority during the first K states. We will sequence A as if it were an ordinary K + 1 state device during these remaining states. Thus, during the state in which the vacant input of A has priority K + 1 , we must switch Z into the vacant state. During each of the other K final states, we must switch a different element of B into A. The element that must be switched during each of these other states is also uniquely determined. Whatever priority the vacant element receives will correspond to a priority already assigned to K of the elements of B. Thus, the unique remaining element must always be switched. Thus, to complete the proof, we must show how we can do this switching and still insure that none of the last priorities will be repeated. The algorithm for this is quite simple. We begin with any correct K + 1 sequence for B. We choose row X from this sequence as the one to be switched. In other words, we will always switch the element of B that would have been assigned priority X within B. This assures us that none of the last K priorities will be re- peated. We note that we can arbitrarily permute the sequence in which these states occur. Thus, we permute them in such a way to insure that the element of B having priority X is the unique element that must be switched during each of the last K + 1 states. This completes the construc- tion. Figure 14 gives an example for N = 7 and also provides a count of the number of switches for arbitrary N. Table 11 provides a summary gate count for the VEU. 98 Input Bits Level States 1 Level 1 Outputs 1 Level States 2 Level 2 Outputs 1 Level : States 3 Output Bits 1 1 1 1 1 i i 1 1 1 1 1 1 1 1 1 1 A possible Sequencing of Priority States with Corresponding Priorities States Priority States Priority 11 10 1 2 • • * • • * 12 1 11 4 3 6 5 8 7 • • • 1 1 • • • 1 • • • 1 3 • • • 4 1 5 • • • 1 6 1 • mm 7 a a a 8 1 FIGURE 13 8 WAY PRIORITY SELECTOR 99 States Priori fir States Priority 1 3 4 • • • • • • 1 4 1 1 3 1 1 2 • • • 1 • • • 1 2 1 1 1 7 8 • • • • • • 8 1 7 1 5 6 • • • 1 . . . 1 6 1 5 1 5 6 • ■ ■ 1 • • • 6 1 5 1 7 8 • • • • • • 1 8 1 7 1 1 2 ■ a • 1 • • • 1 2 1 1 1 1 3 4 « • • • • • 1 4 1 3 FIGURE 13 8 WAY PRIORITY SELECTOR (cont.) 100 States Pr iori 7 t,y States Priority 1 8 • • • • • • 8 1 • • • 7 1 1 5 • • • 1 6 ... 6 1 • • • 5 1 3 4 • • • • • • 4 J_ 3 • • ■ 1 t • ■ 1 1 • • • 1 1 • • • 1 2 2 1 1 FIGURE 13 8 WAY PRIORITY SELECTOR (cont.) 101 Inputs f A^ B< 1 V, vacant f 1 1 Final Final Switch Switch Level 1 Level 2 Level 3 Level 3 Switch Switch Setting Output States_ States States Output Setting Output vacant R R S R 1 1 1 R R 1 R R S 1 R Sequence for N = 7 Switches States Priorities 6 Switches S States ' Priorities R 1 7 • • # . . . 1 R 1 1 7 5 R 1 1 5 6 V ... V . R 1 1 1 R 1 2 R 1 2 R 1 1 1 R 1 • • * 3 R 4 R 4 R 1 3 FIGURE 14 N0N-P0WER-0F-2 PRIORITY SELECTOR 102 Switches States Priorities 5 Switches R States 1 Priorities R 1 1 • • • 1 • • • 1 S 6 R 1 2 — • • • 7 1 3 V 1 • • • V R 1 • • • 3 S 1 • • • 1 4 R 1 4 R ... 5 R 1 • • • 1 1 R 1 6 R 2 R 7 R • • • 1 2 R 1 • • • 3 R 1 • ■ • 1 1 4 R 1 ... 1 1 4 1 V 1 V R 1 • • • 7 R ft ft ft 6 S • • • 3 R 1 ... 7 R 1 • • ft 1 5 S • * • 1 2 R 6 R 1 5 R 4 R 1 • • • 1 3 2 FIGUR • • • 1 V 1 NON- •POWER' -0F-2 PRIORI • • • SELECTOR (cont.) R ♦ • • 1 5 R 1 6 R 7 S 1 103 Gate Count Summary First we consider N a power of 2. Gate count for basic selection logic is 8. Thus for N = 2 we have £ (8+2) + | (8+4) + I (8+8) ... (8+2 K ) = 8(N-1) + N - K For N not a power of 2, the gate count is < the gate count for M the smallest power of 2 < N plus twice the gate count for N-l switches o\ (2 K = M). Thus the total is: < 8(M-1) + M - K + (N-l)8 * 8(N+M-2) + M-K FIGURE 14 N0N-P0WER-0F-2 PRIORITY SELECTOR (cont.) 104 TABLE 11 VEU GATE COUNT (Gate counts for the units in Figure 10) Unit Operand Buffer Result Buffer Instruction Queue Internal Switch Queue Operand Buffer Access Controller Result Buffer Access Source of Gate Estimate Gate Count 8x16x64 bits + addressing logic 40 000 8x16x64 bits + addressing logic 40 000 Table 4 5 600 Table 4 5 600 Figure 14 300 Controller Figure 14 300 Sequencer Estimate based on function 1 000 Internal Switch Controller Estimate based on function 1 000 Internal Switch 64 bit words 256 TOTAL 54 056 105 4.4.3 Vector Buffer The Vector Buffer serves two purposes. It provides a source of oper- ands that can be used by multiple instructions without accessing main memory. In addition, it provides space where intermediate results can be stored. These values are also stored within the VEUs, but the number allowed in a single VEU is quite small, probably 16. The detailed allocation of Vector Buffer storage is handled by the IUD. In this section we will provide a general functional discussion of this storage allocation and provide a de- tailed design of the Vector Buffer itself. In Section 2 where we described OFFL, we noted that all instructions which perform operations have addresses referring to an intermediate buffer. All loads and stores to main memory are to locations in this virtual buffer. The physical buffer corresponding to this virtual buffer is distributed with- in the VEUs and the Vector Buffer. These virtual locations can be divided into two classes: those that were initially defined by an instruction to load from memory, and those that were defined as the result of some opera- tion. All of the first class are assigned space in the Vector Buffer. All in the second class are initially assigned space within the VEU that is assigned the corresponding instruction. Elements from the second class will be transferred to the Vector Buffer if that is necessary to keep the storage space within the VEU from being exhausted. 106 We will refer to the Vector Buffer plus the storage space within the VEUs for results as the total vector buffer. Once a physical address with- in the total vector buffer has been allocated, it must remain allocated until the corresponding virtual address is reused. Once the virtual address is reused and all instructions with pending request for the corresponding physical address have completed accessing this physical address, it may be reused. The contents of that physical location are no longer accessible by any executing program. It would be possible to keep an associative memory that relates such buffer locations to main memory and thus in some instances possibly save some memory accesses. We do not include this option as part of our design, because it does not appear to us as providing much of a return for the logic that would be required, given our overall structure. Clearly, it is essential that the number of available virtual addresses not exceed the number of physical locations within the total vector buffer. Given the highly pipelined nature of the machine and the inevitable delays between the time when a virtual address is reused and the time that all pending instructions have completed access to the corresponding physical address, we require an excess of physical locations to logical or virtual locations. We will first discuss the number of virtual locations likely to be desirable and, on this basis, estimate a reasonable number of physical locations. In determining the virtual buffer size, we will concentrate on the pipelining delay between memory and the VEUs. Considering other aspects which are program dependent makes the problem extremely complex. Further, our queued and pipelined structure is intended to ameliorate such problems across a broad spectrum of programs. Thus, it is reasonable to concentrate 107 our attention on the buffer size required to keep the pipe flowing. The essential constraint in determining this will be the time for a transfer from a VEU to primary memory and back to the VEU. We need enough virtual memory space to insure that the memory value that is reused within this time interval can be left in virtual storage. This leads us to the obser- vation that the size of the virtual buffer is primarily dependent on the rate of reuse of memory locations within the specified delay time. This time cannot be computed exactly, but we can provide a rough conservative estimate. The overall delay is a sum of the following delays: 1. Delay in the Vector Switch queue (4) 4.4.4 2. Delay in the Vector Switch (8) 4.4.4 3. Delay in the memory buffer queue (4) 4.5 4. Delay in the memory switch (11) 4.5 5. Delay in the memory page buffer (4) 4.5 6. Delay in memory store (8) 4.5 The number in parentheses is the delay in minor clocks. The second number is the section in which the unit is discussed in detail. The total delay is twice the sum of the individual delays plus an addi- tional trip through the Vector Switch or 82 minor clocks. One VEU can generate 10 results in this time (one every 8 minor clocks). Thus, our 6 VEUs can generate roughly 60 results. Thus, 64 would be a reasonable con- servative size for the virtual vector buffer. This estimate would be ade- quate for a 100 percent reuse of memory values within the specified delay. Since the number of locations required for this assumption is reasonably small and this case may be approximated over some program segments, it is reasonable to allow for this worst case. 108 We now turn our attention to the physical buffer size required to achieve the specified virtual buffer size. The primary factor we have to consider here is the delay between the time a virtual memory location is reused in the instruction stream and the time the corresponding physical location can be reallocated. The start and end of this delay refers to the IUD. More specifically, it is the time beginning when the IUD notes that an instruction reuses allocated virtual memory address and the time the IUD is able to reallocate that address. The total time for this process is a function not only of the various pipe and queue delays, but also a function of the total number of pending requests to access the virtual memory loca- tion when it is reused. Instead of explicitly considering this case, we will consider a particular case for which the delays are relatively easy to estimate and which should in most instances be the worst case. We will assume the following OFFL instruction sequence: LOAD A to Tl LOAD B to T2 COMPUTE T3 from A and B Instruction which uses T3 Instruction which reallocates virtual location T3 In addition, we assume A and B are in the same memory page. Because accesses to virtual memory locations are buffered within each VEU, it is unlikely that these accesses will be delayed by a greater time than that required to fetch a single operand from memory. We now estimate the dealys encountered by the above sequence. Again we give the time in minor clocks and the section in which the unit performing the function is described in detail. 109 1. IUD delay to complete processing memory instructions (8) 4.6 2. Delay in switching instruction into memory page queue (5) 4.5 3. Delay in memory page queue (4) 4.5 4. Delay in accessing memory (8) 4.5 5. Delay in memory switch (11) 4.5 6. Delay in Vector Switch queue (4) 4.4.4 7. Delay in Vector Switch (8) 4.4.4 8. Delay in Vector Switch queue (4) 4.4.4 9. Delay in Vector Switch (8) 4.4.4 10. Additional delay to access the second operand (8) 11. Delay in VEU queue (32) 4.4.2.4 12. Computation time (8) 4.4.2.4 13. Delay in Vector Switch queue (4) 4.4.4 14. Delay in Vector Switch (8) 4.4.4 15. Time to transmit information about available virtual location to VIDS (8) 4.6.6 16. Time to transmit information to IUD (8) 4.6.3 These delays total 136 minor clocks or 17 major clocks. During this period our 6 VEUs could generate up to 102 new results, each of which might require a new physical buffer location. Adding this figure to our earlier estimate of 64 different virtual addresses, we can see that a buffer size of 256 seems a reasonable size and leaves a substantial margin for error. The VEUs will contain 96 of these locations as their result buffers and the re- mainder will be within the Vector Buffer. The internal design of the Vector Buffer will be functionary the same as the data buffer described in Section 4.4.2.4. no 4.4.4 Vector Switch The design of the Vector Switch requires that one solve two basic problems. First of all, one must determine the number of ports to and from the various units. Secondly, there is the problem of the internal structure of the switch. We begin with a discussion of the ports. We will assume a machine with four binary VEUs and two unary VEUs. This could correspond to two routers and four vector/tree arithmetic units. This will require eight ports going to the binary VEUs and four ports coming from them. The unary units require two input ports and two output ports. A single port is required going to the Scalar Buffer. The remaining units re- quiring ports are the Vector Buffer and primary memory. The optimal size for these paths to these units depends on the ratio of primary memory refer- ences to total operand references. This figure varies across programs and within an individual program. In designing the ports to the Vector Buffer, we will assume two-thirds of all instructions access buffer locations already available. In desinging the main memory ports, we will assume two-thirds of all instructions require a memory access. These assumptions should assure us that even in the worst cases the capacity of the Vector Switch will not slow the machine by more than a factor of one-third. Experimentation with an existing machine would undoubtedly provide the data for determining more cost effective distributions of ports. In the case of memory ports, our assump- tions lead to a requirement of eight ports coming from memory and four ports going to memory. In the case of the Vector Buffer, things are a bit more complex. All operands that originate in Primary Memory are stored in the Vector Buffer. Those operands that were computed by earlier instructions may be in the VEU which computed them. Providing eight input and output in ports for the Vector Buffer should roughly conform to our assumption. Table 12 summarizes these conclusions. TABLE 12 VECTOR SWITCH PORTS Unit 2 Unary VEUs 4 Binary VEUs Scalar Buffer Memory Vector Buffer TOTAL Input Ports (to unit) Output Ports (from unit) 2 2 8 4 1 4 8 8 8 23 22 We now turn our attention to the internal structure of the Vector Switch. It is a pipelined crossbar switch with queued instructions asso- ciated with each of its entry ports. Once a path in the switch has been reserved, it will remain active for 8 minor clocks and allow the transfer of an 8-word vector. Thus, there is a fairly long time available for searching the queues. This is important because requests to use the Vector Switch may be made long before the operand is available. Thus, in searching its queues, the Vector Switch must not only be sure that a path is available, but must also determine that the data is present. The presence of data is indicated by a single bit which is set whenever data is stored in any of the vector buffer locations. This bit is reset whenever the corresponding physical location is freed; i.e., when its use count is zero and the corresponding logical location has been reused. Note that the algorithms for keeping track 112 of vector buffer storage are simpler than those for scalar buffer storage, because each different valued vector has a different physical address, and there is no need for time indexes to keep track of them. On the other hand, the scalar switch does not have to test for the presence of data since re- quests are never entered in its queues until the data is available. Most of the logical design for the above mentioned functions is similar to work we have already done. However, the large number of "functionally identical" ports going to and from the vector buffer and memory does present us with a new allocation problem. The solution is to assume that these paths become available in a time skewed fashion. There are in all cases either 4 or 8 paths which are tied up for 8 minor clocks once they are reserved. Further, because they feed memories that must be allocated in a time skewed fashion, some form of time skewing is required. Thus, we can assume that only one of these paths becomes available in each minor clock and the stan- dard priority hardware from Section 4.4.2.4 can be used. The same priority unit will be used to schedule all the paths in any equivalent set. This scheme will accommodate the problems associated with multiple input ports. We have a related problem associated with multiple output ports. We are searching queues to drive the ports and at least one minor clock is re- quired for each test of a queue entry. Thus, unless every entry tested is ready to transfer, we cannot run at the maximum possible data rate. We solve this problem with multiple queues. One queue for every four paths will allow every other entry to be unavailable and still run at maximum rate. Table 13 summarizes the hardware in the vector switch. 113 TABLE 13 VECTOR SWITCH HARDWARE QUEUES FOR ALL UNIT OUTPUT PORTS Unit Number Units Paths/ Unit Queues/ Unit a Queue Size Gate Count/ Queue Total Gates Unary VEUs 2 1 1 16 6 696 13 392 Binary VEUs 4 1 1 16 6 696 26 784 Memory 1 4 1 64 23 784 23 784 Vector Buffer 1 8 2 64 23 784 47 568 II. CONFLICT RESOLUTION CIRCUITS FOR ALL UNIT INPUT PORTS Unit Number Units Paths/ Unit Requesting Units u Gate Count Total Gates Unary VEUs 2 1 8 168 336 Binary VEUs 4 2 8 168 1344 Scalar Buffer 1 1 8 168 168 Memory 1 8 8 168 168 Vector Buffer 1 8 7 160 160 III. CROSSBAR SWITCH Switch Size: 21 x 22 x 80 bits Gates: 147,840 TOTAL GATES: 262,544 a. b. c. d. See previous section for basis of estimates. See Section 4.3.2.2 for queue gate count formula. This is the total number of queues minus the queues for this unit, See Section 4.4.2.4 for priority logic gate counts. 114 4.5 MAIN MEMORY Logically main memory is divided into pages. These pages are 8 words wide. A reasonable length would be 1 K. Physically each page is indepen- dently queue driven. Load and store systems of switches connect these pages to buffers in front of the memory ports in the Vector Switch. Other switches distribute queue entries, modes, and indexes to the control portion of each page. No indexing is allowed across pages. Figure 15 shows the overall structure of memory and the load system of switches. We now discuss the organization and operation of this system. It may frequently happen that for a short period of time it is desirable to access an individual memory page at the maximum rate possible for that page. On the other hand, the number of memory ports in the vector switch makes it point- less to be able to simultaneously access all pages at their maximum possible rate. There is a virtually unlimited number of ways a memory may be organ- ized, considering these constraints. We have chosen one that seems reasona- ; ble and workable. We will consider a 1 megaword memory. It will be clear from the discussion how to generalize to other sizes. The Vector Switch has 8 input ports for communicating with memory. (See Section 4.4.4.) Thus, we want to design our switch to accommodate this data rate coming from any- where in memory. We will assume the cycle time for main memory is 8 minor clocks. Thus, the data path leaving one memory page need only be one word wide if it is pipelined at one transfer ewery minor cycle. To exactly accom- modate the vector switch data rate, we need to allow at most 8 pages trans- fering data at any given instant. The pages are grouped into blocks of 8. There are 16 of these blocks. We allow a maximum simultaneous transfer of up to 8 words from each of these groupings. All transfers are pipelined at the rate of one per minor clock. 115 ^ or - or o s: CO LU X CO CO CO > VO CO q: LU < CD CO - Cd O o CO LU X CO CO CO Q o - en o LO LU ct: y 1 o —1 Od < H- O ■zz a: O O LU _l c_> _i 116 A combination of crossbar switches and global control are used to referee conflicts. Before a page is allowed to initiate a transfer into this struc- ture, it must have a path reserved all the way to the highest level of the structure. This is not to say that the path must be clear at the time the transfer begins, but only that it will become clear at each stage when re- quired. We will now discuss the algorithms for allocating these paths. Since 8 minor clocks are required to complete a vector transfer, we need only allocate our various groups of 8 paths at the rate of one per minor clock. Up through the first level of crossbar switches, every page has its own path. However, the outputs of these paths are ganged together so that one output from each of the level 1 switches is an input to the same path in the level 2 switch. Thus, allocating paths consists of determining which memory pages may initiate transfers through the level 2 switch and trans- mitting to the switches the identity of the paths available. At a given clock, any number of paths in the level 2 switch may be available. However, to keep our allocation algorithm to a reasonable size, we will consider that at most one path becomes available during each minor clock. At most we in- troduce brief transient delays by this restriction. There will be no loss in assuming that a given fixed path becomes available at a given clock. In other words, the global control attempts to allocate the level 2 paths in a round-robin basis. If there are no outstanding requests at a given clock, then the path assigned that time slot will remain vacant at least for the next 8 minor clocks. We must keep the number of pages requesting a path at a given clock to a number we can handle. This can be done by having local controllers limiting the requests from each group of 8 pages to one. 117 Thus, the global controller will have at most 8 requests to deal with in any clock. The controllers at both levels will use the access controller described in Section 4.4.2.4. We can now describe the complete functioning of the memory in trans- fering data to the Vector Switch. The numbered paragraphs correspond to successive minor clocks. 1. All memory pages with queue entries ready to initiate a memory access send a bit to the local controller. 2. Each local controller selects one of these pages for possible transfer and sends to the global controller a request for a path if it had any requests. 3. The global controller selects one of the requests from the local controllers to honor and notifies the local controller. 4. The local controllers notify the winning page. 5. The transfer begins through the first level of the crossbar. At clock 3 the global controller also notified the local controller which path(s) to use in the crossbar. 6. The transfer from the lower level crossbar to the global crossbar begins. 7. The transfer from the global crossbar to the buffer begins. The global crossbar works in a fundamentally different way from the local crossbars. It is successively transfering data to different modules in one of the buffers. Thus, with each minor clock, it changes its configuration. Several remarks about this process are necessary. First, the entire unit must be pipelined so that each function is occurring at e\/ery clock. In practice, this is not particularly difficult; it only requires some 118 buffering of information. The requests for transfer always come from the pages 4 clocks prior to the time they are actually able to begin the trans- fer. Thus, the only loss from the decision delays occur when a new entry arrives in the intervening 4 clocks. The switches and memories may all operate all the time and at the maximum data rate the Vector Switch allows. With the transfer of the data, a queue entry is also transferred. This queue entry will be used to request use of the Vector Switch. The data paths must be slightly larger than one word to accommodate the queue entry, which would probably need to be divided into 8 parts. A request to store data always takes precedence over a request to load data. To initiate a store, a path must be reserved through a switching network similar to the one just discussed. In addition, it must be verified that there is room in the store buffer of the destination page. A buffer for stores within each page is desirable because of possible sequencing problems which we will discuss later in this section when we describe the internal operation of a single memory page. Since the Vector Switch only has four output ports to memory, allowing one store to be initiated in each minor clock allows for transients of twice the maximum long-term data rate. Thus, we will provide a switching network to allow the transfer of one store request per minor clock to the specified page. If that page has space avail- able, it sends back to the requesting unit a signal to proceed. This sequence is pipelined and requires two minor clocks. For stores, there can be no con- flicts like those encountered with loads. Thus, we simply use the next slot in the highest level crossbar and ask to set up the level 1 crossbar that services our destination page. In addition to the load and store switch paths just discussed, we need an instruction switch to transfer load queue entries to the appropriate page. 119 Since only one instruction is required for each vector load, this network can be similar to, but less complex than, the load and store switching net- works. This same switch can be used for the transfer of scalar indexes and scalars used for mode control. We now come to the internal structure of the individual pages. Figure 16 shows this structure. Entries from the instruction switch may be either load queue entries, modes, or scalar indexes and are switched either into a scalar buffer or into an instruction queue. Entries may arrive from this switch at the rate of one per minor clock. Since a vector access can only occur once every major clock, this will be a more than adequate data rate. Vectors arriving from the store switch may be transferred either to the vec- tor index buffer or to the store buffer, depending on their intended use. The store buffer allows for the load switch and store switch to be transfer- ring data with the same memory page at the same time. The load buffer allows memory to be synchronized with the load switch. The control processes the queued instructions and referees possible conflicts between the load and store switches. We will now outline the operation of the memory page control. The queue which contains both load and store instructions is continuously interrogated to see if an instruction can proceed. The conditions which must be met are as follows: 1. All required indexes and modes are present. 2. No earlier instruction which conflicts with this instruction is still in the queues. 3. In the case of stores, the required data is present. 120 From Instruction Switch _L SWITCH SCALAR INDEX AND MODE BUFFER Control Information to all Internal -* Units and to and from Switching Networks MEMORY INSTRUCTION QUEUE CONTROL From Store Switch STORE BUFFER To Load Switch FIGURE 16 INTERNAL STRUCTURE OF MEMORY PAGE 121 Condition 2 requires further explanation. Clearly, no load or store can proceed if there is an earlier store with unknown indexing and unknown or overlapping modes. Similarly, no store can proceed if there is an earlier load with unknown indexes and unknown or overlapping modes. These are the weakest possible conditions for the existence of conflicts, and it would be possible to test for these specific conditions among the first few queue entries. We will base our gate estimates on this capability, although a somewhat weaker condition might prove more practical. Space in the indexing buffers is reserved by the IUD in the same manner as space within the Vector Buffer. Thus, these buffers have to notify the IUD when the values they contain have been used and the space is free. One can estimate sizes for these buffers by an analysis like that in Section 4.4.3. Determining a size for the store buffer is more complex since it is dependent on how much instruction reordering is done by the queue and con- trol. We will estimate a size of 8 as being reasonably small and probably larger than will usually be needed. Table 14 provides a summary of all the hardware described in this section. There are two capabilities that we do not provide in this design that might be of considerable practical value. One would be the ability to pro- vide index arithmetic within each memory page. Loops might often involve performing simple operations on the same base index set and within the same memory page. The second capability is to provide memory-to-memory and memory- to- index register transfers without going through the Vector Buffer and Vector Switch. Both of these capabilities could be provided without major increases in the logical complexity of the system and should undoubted- ly be considered. 122 TABLE 14 MEMORY LOGIC SUMMARY Ml Source of Gate Estimate Gate Count First we list the gate counts for the units in a memory page (Figure 16). Instruction Queue Table 4 5 500 Control Estimate based on function 1 000 Switch 1x2 64 bits 384 Scalar Buffer 16 words by 10 bits 640 Vector Index Buffer 8x8 words by 10 bits 2 560 Store Buffer 8 words by 64 bits 2 048 Load Buffer 8 words by 64 bits 2 048 Total for memory page excluding memory: 14 664 128 pages are required for a million words: 1 876 992 Now we compute accessing network gate counts as illustrated by Figure 15. 8x8 Switch 72 bit words Local Controller Estimate based on function Load Network has 17 switches and 16 controllers: Store Network has one controller and 17 switches: I/O Network has 16 switches and one controller: Total Exclusive of Memory: 18 432 2 000 345 344 315 344 296 912 2 ,843, 592 123 4.6 INSTRUCTION UNIT DISPATCHER 4.6.1 Introduction The Instruction Unit Dispatcher (IUD) has the responsibility of mapping OFFL instructions from up to four MIDs into some collection of execution units. It must ensure that the correct operands for an instruc- tion will meet in the unit assigned that instruction. The principal problem in designing this unit is maintaining a high instruction rate while provid- ing an "intelligent" scheduling algorithm. The scheduling algorithm must, as a minimum, assure that no blockages result and maintain the correct logical sequence of operations. In describing the IUD, we will first outline its functional structure, ignoring all problems associated with maintaining the necessary high data rate. We will then determine what degree of pipelining and parallelism will be necessary. We will discuss in more detail the various operations that the IUD performs. In this discussion we will bring in any algorithm modifications necessitated by the combination pipeline parallel processing required. We will then provide a detailed logical design of the IUD, com- plete with gate counts. 4.6.2 IUD Functional Structure The IUD's operation is partitioned into several tasks. Three broad categories are: work on operands, work on results, and construction of queue entries. The three types of operands are vectors, scalars, and main memory vectors. For main memory vectors, the IUD merely passes on the specified address to the correct memory box. For scalar instructions, a time index is necessary to uniquely identify the operand. An associative 124 memory table is accessed to obtain this time index. The use made of this time index is discussed in Section 4.3. A logical vector operand must be mapped onto the correct physical vector register. Another associative memory is provided for this function. Scalar results are used to update the scalar status table mentioned above. Similarly, vector results are used to update the vector status table which maps physical to logical registers. Both operands and results as well as the operation fields are used by the IUD in generating various queue entries. Where execution units are not unique, the IUD must decide which to use. In the case of vector instruc- tions, it must reserve space in the VEU and set up queue entries to route data as required. Finally, it must set up the queue entry for the execution of the operation itself. 4.6.2.1 Data Rate Analysis The instruction rate the IUD must handle is a function of the proces- sing rate of the various execution units. A reasonable average is one operation per major clock. For this computation, we will assume all operands originate in memory and are returned to memory. This is an extremely conser- vative assumption. It will be somewhat balanced by neglecting memory-to- memory instructions and transfers between scalar memory and vector memory. The overall assumption is still somewhat conservative. Each vector operation counts as 4 instructions for a binary operation (3 memory instructions plus the actual operation) and 3 instructions for a unary operation. Each scalar operation counts as one instruction. 125 Reasonable values are 4 vector binary units plus 2 vector scalar units. In addition, 6 scalar units is a likely value. Thus, we should be able to process roughly 28 instructions through the IUD in one major clock. This comes to 4 instructions per minor clock with full pipelining. 4.6.2.2 Memory Operands and Results Sequencing of instructions refering to memory is controlled by the individual memory boxes. The instructions need only be passed on to the appropriate memory box in the correct sequence. The IUD need not perform additional processing on these operands. 4.6.2.3 Scalar Operands and Results In order to ensure proper sequencing of scalar instructions, operands must have both a time and place index. These designate a particular loca- tion in the scalar buffer and a particular "time index" which uniquely identifies a store to that location. To ensure that no operand will be over-written when it is still required by a queued instruction, the SEU must be provided with a count of the number of pending requests for a given instruction result. The scalar status table provides the information neces- sary to construct the time index and generate the operand use count. The range of the time index was discussed in Section 4.3. The values 128 and 256 were determined as reasonable options in that section. The status table is an associative memory containing entries for all recent stores to scalar memory that may be ambiguous. It contains a time index and scalar memory location for each entry. In addition, it contains a disable bit and a bit to indicate if the address refers to a vector stored across the scalar memory. Whenever an instruction with result to scalar memory is processed and there is another store to that location in the 126 queues, a new entry is made in an available location in the status table. In addition, a previous entry using that same location has its disable bit set and its location recorded in another table as available. We still have the problem that some of the operands may refer to re- sults being processed in parallel with these operands. We take care of this case by special circuitry containing the scalar results being processed. This special circuitry finds the entry with the latest time index earlier than the time index for a particular operand. This requires a comparison tree, but since only a small number of results are processed in parallel, this tree is quite small. Additional circuitry selects the time index from either this comparison tree or from the full table search when the tree finds no match. The full table update for the results being processed occurs in the same clock. The search for current operands will not see these en- tries until one clock later. Finally, the scalar status table must be kept from becoming full. Thus, we want to remove entries from it as quickly as possible. As soon as any instruction which causes an entry to be made in this table has completed, the associated entry in the table can be freed. Thus, there is an additional bookkeeping table containing instruction indices and associated scalar status table locations. When notification comes that a scalar instruction has com- pleted, this table is used to determine if a scalar status table instruction may be freed. Since a scalar status table location may be freed before the associated instruction is complete, this bookkeeping table needs to be up- dated whenever this occurs. 127 4.6.2.4 Vector Operands and Results In the case of the vector buffer, the status table must map every physical register in use onto the corresponding logical register. The num- ber of physical vector registers is much smaller than the number of physical scalar registers. Thus, 128 or 256 is a reasonable size for this table, corresponding to the size of the vector buffer discussed in Section 4.4.3. Like the SEU, the VEU must be provided with a count of the number of ac- cesses to a particular set of data values. The vector status must contain two pieces of information to perform the functions described above: the physical location of a logical register and the logical register identifica- tion. We will first discuss the use of this table, ignoring the fact that more than one instruction is being processed in parallel. When a vector operand is encountered, this table is accessed as an associative memory to find the physical register corresponding to the designated logical register. If there is no corresponding entry, this is an error condition which should cause a program interrupt. Since the vector buffer is distributed among the VEUs as well as being in a central vector buffer, the physical location identifies the unit as well as the location within the unit. This informa- tion will be used in selecting the VEU to use when there is more than one which may be used. A vector result causes the associated identification of that logical register to be altered to correspond to the new physical register. In addition, it causes a signal to be sent to the unit containing the register, indicating that the physical register may be freed once all pending requests on it have cleared. In turn, when all requests have cleared, this unit notifies the IUD. 128 The problems associated with parallel processing of instructions are more complex than those encountered in the scalar case. This is a conse- quence of the fact that the physical location of a logical result is not known until processing of that instruction is nearly complete. To accom- modate this situation, a special bit will signify a not yet known address. In addition, the time index of the corresponding result will be provided in place of the physical address. Special logic to fill in this informa- tion will be described in Section 4.6.3.7. This same logic provides this information to be added to the vector status table when the information becomes available. In addition, we need a comparison tree similar to that for the scalar status table, discussed in the previous section. 4.6.2.5 Scalar EU Assignment There may be several SEUs that are functionally equivalent. We must provide a method of selecting which SEU will be used for a given instruc- tion. Since the SEUs, unlike the VEUs, do not contain any operands, (see following section), the only consideration that seems reasonable to take into account is the size of the various queues. Thus, logic will be pro- vided to keep track of where the next n scalar instruction should be assigned for each set of equivalent SEUs, where n is the maximum number of instructions that can be processed in any clock. The logic will update this information eyery clock based on which SEUs were assigned in the previ ous clock and based on information from the SEUs on instructions completed. 4.6.2.6 Vector EU Assignment Vector operands may reside in a specific vector execution unit and there exists logic to use these as operands. In order to lessen the load on the vector switch as well as to minimize transfer delays, we want to 129 encourage using these features. The question of what constitutes an opti- mal scheduling scheme is extraordinarily complex. In addition, we have severe constraints imposed by required rapid and comparatively cheap hard- ware implementation. We will propose a scheme that seems workable and reasonable. We begin our discussion by considering different relations between the number of active programs (P ) and the number of equivalent EUs (R ). When possible, we will assign specific EUs to specific programs and try to keep all computation within assigned EUs. When queue size discrepancies become too large, we will start distributing the operands. We first consider cases where R > P . Each program may have its own e a resource or resources; i.e., if nP o > R a and n ;> 1 a e then each program has n resources allocated to it. First we consider the case of a single resource assigned to each pro- gram. There are three threshold values that determine which EU it chooses. These threshold values all represent differences in queue sizes. The first two are limited to queues of EUs assigned to a particular program. O2QJ Size of queue containing both operands minus size of smallest queue. 0-jqj Size of smallest queue containing either operand minus size of smallest queue. 130 The next two values represent the differences of a queue size assigned to the program minus a queue size not assigned to the program. °20IE Size of Q ueue containing both operands minus size of smallest queue. °10IE Slze of smallest queue containing one operand minus size of smallest queue. The final two thresholds refer to queues not assigned to the program: °20E Slze of 9 ueue containing both operands minus size of smallest queue. °10E Size °^ smallest queue containing either operand minus size of smallest queue. The threshold values can be assigned dynamically or be hard-wired constants. Experimentation should be conducted to determine optimal values. Associated at any instant in time with a threshold value is the actual value of the corresponding condition. We will label these as 0A with the same subscript. Thus, 0A 10£I is the actual difference of the smallest queue assigned to this program branch which contains a single operand minus the smallest queue size not assigned to this program. Both queues are restricted to those capable of performing the operation which we are now assigning to a queue. We will define Q(0), where is a threshold parameter, to be the index of the first queue associated with this parameter. I.e., Q(0A ?nTF ) is the queue with two operands. Whenever the actual value for a given threshold is not defined, it will behave as infinity. 131 Finally, we will consider the six threshold values as being ordered in the sequence they were defined and will abbreviate threshold and actual values as 0. and 0A., i =0, 1, ..., 5. Thus, 0. is the same as 2Qr The algorithm for queue selection in the case there R > P is to select e a Q(0A.) for the least i such that ° A i < °i If no i satisfies this condition, then the smallest queue is chosen. Two observations seem important here. First, we probably have more threshold values than are useful and experiments would probably show us how to limit these. Second, other threshold values are possibly meaning- ful. One example would be 0,^^ referring to the queue size difference of a queue assigned to the program containing one operand minus queue size of a queue not assigned to the program containing one operand. As stated previously, our algorithm is not necessarily optimal, only practical and reasonable. In the case of R < P , there will be no assignment of EUs to queues, e a In this case only the 0-, Q r will be meaningful. Otherwise, the same algo- rithm applies. In the case of multiple resources assigned to a program, additional threshold values are needed to decide within these assigned resources when small queue size has precedence over the presence of operands. At this point, it seems desirable to describe a general theory of threshold values. Let there be K classes of resources ordered so that the first of these is the most desirable to use and the last the least desirable. Assume that instructions may have up to N operands. The threshold values will be 132 TABLE 15 CLASS PAIRS 1 Class Pairs 1 1,1 2 1,2 3 1,3 i i K 1,K K+l 2,2 K+2 2,3 2K-1 2,K 2K 3,3 agfil K,K 133 TABLE 16 OPERAND DISTRIBUTION PAIRS Operand j_ Distribution Pairs 1 N,0 2 N-1,0 3 N-1,1 4 N-2,0 5 N-2,1 6 N-2,2 7 . N-3,0 8 N-3,1 9 N-3,2 10 N-3,3 • • • • « ("+2)(N+1) . („.,) 0i0 < N+2 )( N+1 ) - (N-2) 0,1 (N+2)(N+1) 2 0,N 134 written as 0... The actual value of the corresponding difference in queue sizes will be OA^.. (KOA^.) refers to the first element in the class pair corresponding to i. Tables 15 and 16 define the i and j subscripts. For this more complex case, no linear ordering of threshold values makes sense. Instead, the algorithm for choosing the EU will be to take Q(0A..) for the i,j pair such that: I J I J where i = 1, 2, ..., M|iU i = t 2 JMKJitll J ■ » (- > • • • » p Again, not all the 0^. are defined, and the matrix of useful values is likely to be fairly sparse. 4.6.2.7 Generating Vector Switch and Internal Switch Queue Entries After we have selected the VEU, we must reserve physical space within the VEU for the operands and results. We then need to generate instructions for the various switches to transfer the operands to the VEU. In addition, we must reserve space for the results and update the vector status table accordingly. Parallel processing of instructions constrains us to first reserve space for the result. Then the remainder of the above operations can be performed, in some cases, using this information about results be- cause it corresponds to one of the operands. 135 4.6.2.3 Generating Instructions for the EUs and Memory At this point we have discussed how all the information necessary for the EU instructions is generated. Thus the only remaining problems are to assemble the information together and transfer it to the appropriate EU. To assemble the instruction, we will provide addresses with each of its component parts. Using these addresses to assemble the instructions poses no special problems. In designing the logic for transmitting the instruc- tions to the EUs, we wish to minimize the size of the data paths. In any given minor clock all the instructions emerging may be scalar instructions (or any other single type). Since no single unit can process instructions at this rate, it makes sense to buffer the output to the verious units. 4.6.3 Logical Structure In this section we will provide a logical design for the functions discussed in Section 4.6.2. We will do this in sufficient detail to pro- vide realistic estimates of the gate counts for various individual compo- nents and for the entire unit. We will first discuss the overall structure of the IUD pipeline and then go on to describe each of the stages and units in detail . 4.6.3.1 IUD Pipeline Structure Before we begin our overall analysis of the pipeline structure, we need to provide more details about OFFL. Thus in this section we will de- scribe the OFFL instruction format and syntax. We will use this information to analyze the pipeline requirements. Then we will present a time-versus- function diagram of the IUD's operation. In this section we will describe instructions and design in considerable detail. We do this, not because we believe the design is optimal or that there is anything sacred about the 136 particular design decisions we have made, but because our approach differs radically from conventional control unit design. Thus, details are worked out both as a discipline onto ourselves to ensure that we are not over- looking any devastating problem, and as a means of convincing the reader that our overall approach is workable. 4.6.3.1.1 OFFL Instruction Format OFFL instructions consist of a variable number of 16-bit bytes. The first of these specifies the operation to be performed. It has the follow- ing format: Fields Meaning 0:1 1. This byte is 1 only for the operator portion of an instruction. 1:1 indicates an SEU instruction. 1 indicates a VEU instruction. (All memory instructions are to or from the VEU.) 2:2 Contains the program number. (Up to four different programs can be executing simultaneously with instruc- tion level multiprogramming.) 4:4 EU address. (Specifies the type of EU that this instruc- tion requires.) 9:8 Information to be interpreted by the specified EU. (This field is used by the EU to determine what operation to perform. If it is identically 0, then the next byte con- tains information to be passed on to the EU.) 137 The EU operation field or literal information for the EU can be extended over up to four additional bytes. Bit of each of these bytes contains a 1. The remainder of the byte contains information for the EU. The following is the format of all possible OFFL operands and results Fields Meaning 0:1 1:1 Indicates that this is a result and also signifies the physical end of the instruction. In the case of a memory result, the physical end of the instruction occurs not at this byte, but at the next. 2:1 indicates a scalar. 1 indicates a vector. 3:1 Vector only. indicates a logical vector buffer address; 1 indicates a main memory address. 3:13 Scalar only. Physical scalar buffer address. 4:12 Vector only. In the case of a memory address (bit 2), this field contains the physical location within the physical memory box specified by the next byte. In the case of a logical vector buffer address (bit 2), this field contains that address. As mentioned in the above table, memory results are two bytes long. The second byte, except for bit which is 0, is completely used to specify a memory box. This completes the specification of the format of OFFL. 4.6.3.1.2 OFFL Syntax In this section we specify the syntax of OFFL instructions in BNF using the metalanguage of the report on ALGOL 60. We will describe the 138 semantics of the terminal and non-terminal symbols used in terms of the previous section and then present the brief formal syntax. Note that if there are two scalar operands, the second will be a mode pattern. If there is only one scalar operand, then the operation will specify whether it is a mode or an index. Table 17 is the syntax of OFFL. 4.6.3.1.3 Analysis of Pipeline Requirements In Section 4.6.2.1 we determined that the IUD must allow for the emergence of instructions at a rate of approximately four instructions per minor clock. In this section we will analyze how that requirement, in con- junction with the specification of OFFL we have described in the previous two sections, translates into physical requirements of the IUD structure. Because of the extreme variability of instruction length and complexity, it would be unnecessarily costly to allow the IUD to be able to process any possible combination of instructions at an emergence rate of four per clock. As a first step in determining a reasonable design, we will enumerate rele- vant constraints on OFFL instructions. These constraints can all be easily derived from the previous two sections. They are listed in Table 18. 139 TABLE 17 OFFL SYNTAX Symbol Instruction Memory Not Memory NM Operator NM Operand NM Result M Operator M Operand V Result V Operand M Result VS Operand S Operand S Result OM Address Meaning A complete OFFL instruction from operation to result. A complete OFFL instruction which makes reference to main memory. A complete OFFL instruction which does not make reference to main memory. An operation of one to five bytes as specified in the previous section that does not refer to main memory. All the operands in a complete non-memory OFFL instruction. A vector buffer or scalar buffer result. An operation of one to five bytes as specified in the pre- vious section that does refer to main memory. An operand which specifies a main memory address, including indexing and mode specification. A result which specifies a logical vector buffer address. An operand which specifies a logical vector buffer address. A result which specifies a main memory address, including indexing and mode specification. An operand giving a vector buffer or scalar buffer address. An operand specifying a physical scalar buffer address. A result specifying a physical scalar buffer address. A two-byte operand which specifies a memory box and an address within the box. Symbol RM Address 140 TABLE 17 OFFL SYNTAX (cont.) Meaning A two-byte result which specifies a memory box and an address within the box. Index and Mode A string containing or 1 vector operands and to 2 scalar operands which serve as indexes and modes relative to the physical memory address specified by the associated memory address. All indexing is limited to addresses within a single memory box. The OFFL syntax follows. Instruction : : = Memory | Not Memory ; Not Memory ::= NM Operator NM Operand NM Result ; Memory. ::= M Operator M Operand V ResuU | M Operator V Operand M Result; NM Operand ::= VS Operand | VS Operand VS Operand ; VS Operand : : = V Operand | S Operand ; NM Result : : = V Result | S Result ; M Operand : : = Index and Mode OM Address ; M Result : : = Index and Mode RM Address ; Index and Mode ::= S Operand | V Operand | V Operand | S Operand | V Operand S Operand S Operand ; 141 TABLE 18 OFFL INSTRUCTION CONSTRAINTS Instruction Parameter Minimum Maximum Instruction length in bytes Number of scalar operands Number of vector operands Number of memory operands Number of results, any type Operator length in bytes Memory operand or result length in bytes Total length of all operands for a non- memory instruction in bytes 3 12 2 2 1 1 1 1 5 2 5 142 The physical paths between the MIDs and the IUD must be of some fixed size. At this point we need to translate an emergence rate of four instruc- tions per clock into a size for these paths. Since the path size determined will be a fundamental physical limit to the IUD's processing rate, we will design the IUD to handle the full bandwidth of these paths. The tradeoff in deciding on this parameter is the possibility of the IUD slowing down the EUs versus the cost of the IUD. Because of its pipelined parallel nature and the sophisticated functions it must perform as outlined in Section 4.6.2, the cost of the IUD rises dramatically with increased bandwidth. This will become even more evident in the remainder of Section 4.6.3 as we do detailed logical design. The emergence rate of instructions required is actually 3 1/2, not 4, and the assumptions that gave rise to that figure were some- what conservative. (See Section 4.6.2.1 for details.) Thus, a bandwidth of 12 bytes per clock total coming into the IUD seems to be a reasonable figure that will allow for little or no delay in the EUs. There are addi- tional reasons for choosing the precise figure 12. These will become appa- rent in the remainder of this and the following section. The IUD is required to perform various types of operations in parallel as outlined in Section 4.6.2. We will now determine what degree of paral- lelism will be required. As a first step, we will determine how many of the various types of operations, operands, and results may occur in segments of the instruction stream of different lengths. Table 19 gives maximum counts versus instruction stream length and can be easily derived from the instruc- tion constraints listed above. This table gives the maximum number of instruction components that may occur in a length of instruction stream segment. Note that 12 is a particularly good number since at 13 most of the counts go up by 1 . This table will also be used in the next section. 143 TABLE 19 INSTRUCTION STREAM CONSTRAINTS Instruc- Operat Number ars Size Memory Operands and Results Combined Number Size Vector and Scalar tion Stream Operands* Results* Length Number Size Number Size 1 1 1 1 1 1 1 1 1 2 1 2 1 2 2 2 1 1 3 1 3 2 2 2 2 1 1 4 2 4 2 3 2 2 2 2 5 2 5 2 4 3 3 2 2 6 2 5 2 4 4 4 2 2 7 3 5 3 4 4 4 3 3 8 3 6 3 5 4 4 3 3 9 3 7 3 6 5 5 3 3 10 4 8 3 6 6 6 4 4 11 4 9 4 6 6 6 4 4 12 4 10 4 7 6 6 4 4 13 5 10 4 8 7 7 5 5 14 5 10 4 8 8 8 5 5 15 5 11 5 8 8 8 5 5 16 6 12 5 9 8 8 6 6 *These columns refer to either of the specified types and not to both types combined. 144 4.6.3.1.4 Switching Instruction Components into the Pipe The first stages of the IUD pipe will consist of units designed to process each of the various instruction components. Since these components may occur at any point in each 12-byte segment, we need some means of trans- mitting the various components to the appropriate type of processors. At the same time we must maintain the identity and sequence of the original instructions. We will number the instructions through 3 and assign this index to each instruction component. Either or both instructions and 3 may be incomplete. Thus, in the later stages of the pipe we must take this into account. Table 21 gives the logic equations for assigning instruction numbers and an associated gate count. The component pipes that our 12 instruction bytes may need to be switched into are listed below. TABLE 20 PIPE COMPONENTS Component Number of Units Size of Unit in Bytes Operator 4 5 Memory Operands Results and 4 2 Vector Operand 6 1 Vector Result 4 1 Scalar Operand 6 1 Scalar Result 4 1 145 TABLE 21 LOGIC EQUATIONS AND GATE COUNTS FOR ASSIGNING INSTRUCTION NUMBERS The symbols A-L represent a logical input as to whether the correspond ing instruction byte is the first byte of a new instruction. This can be determined by bit of the preceding instruction being and bit 1 of this byte being 1. Q, R, and S are the following logical functions: Q = B E H K R = I V J V K S = I V J X, Y, T, and W are instruction bits which are used in other equations. High-order Bit of Low-order Bit of Number Instruction Number Instruction Number of Gates A 6 BO B 1 c B V C 2 D BVCVD 3 E X = BE Y = 'v(BEVBCD) 7 F X V Y F YFVYF 7 G X V Y F V Y G YFGVYFVYG 12 H T = XVYFVYGVYH W = Y F G H V Y F V Y G V Y H 18 I TVWI WlVWI 7 J TVWIVWJ WTJVWIJVWTJ 12 K QTVQWIVQWJVQWK WTJT^V¥lV¥JV¥QK 23 L QTVTSVTKQVTL W R L V W R L V W R L 18 TOTAL 110 146 TABLE 21 LOGIC EQUATIONS AND GATE COUNTS FOR ASSIGNING INSTRUCTION NUMBERS (cont.) With fan out of 20 and fan in of 4, the entire decoding can be implemented in 7 levels of logic. The gate counts in this case would be 110 plus 9 for Q, R, and S, plus 24 to generate the initial values A-L. As a practi- cal manner, at least one additional level of logic and a slightly higher gate count is likely to be required to avoid the large fan out. It may be possible to implement the decoder in one minor clock with under 200 gates, but two minor clocks may be required. 147 TABLE 22 LOGIC EQUATIONS FOR THE CONTROL OF THE IUD FRONT END SWITCHES Variables A - E represent logical values associated with byte positions. They are true if the corresponding byte is of the type this switch fetches Variables xG - xK (where x is one of A - E) represent enables of switch paths. In particular, AG equal true enables the path from byte position A to the first pipe entry of the type the switch is for. Similarly, BH enables the path from the second byte position to the second pipe entry. The designs will not be for fully general switches. They will take ad- vantage of restrictions from Table 19. I. Control for 4x2 switch used by vector and scalar operands. AG = A DH = D BG = AB CH = DC CG = ABCD BH = DCAB (Note: Paths AH and DG are not required, saving both switch and control logic.) II. Control for 3 x 1 switch used by vector and scalar results. AG = A BG = B CG = C III. Control for 6x4 selector used for memory operands and results combined. AG = A FJ = F BG = AB EJ = FE CG = ABC DJ = FED BH = AB EI = FE CH = ABC v ABC DI = FED v FED 148 TABLE 22 LOGIC EQUATIONS FOR THE CONTROL OF THE IUD FRONT END SWITCHES (cont.) IV. Control for 6x5 selector used for operators. AG = A FK = F BG = AB EK = FE BH = AB EJ = FE CH = ABC v ABC DJ = "FED v FFD CI = ABC CI = FED 149 A full cross-bar switch for this transfer would need to be 12 bytes by 56 bytes. The logic to control this switch would be particularly cum- bersome. Referring to the list of components versus instruction length in the previous section, we see that there is a possibility of partitioning the 12 instructions into 4 groups of 3 each in the case of vector and scalar results. In the case of vector and scalar operands, 3 partitions of length 4 makes sense. In the case of memory operands or results and in the case of operators, two partitions of length 6 will work. All these partitions have the advantage of not requiring any additional processing units, while at the same time reducing the complexity of the switch and its control. With these smaller partitions, we can design very simple and fast con- trol logic for the switches. The basic idea of the design is to start at both ends and work towards the middle. Thus, if we are checking 4 bytes, 2 of which may be of the same type, then the first output path accessed by this switch will, in a sense, be assigned to the first two bytes, and the other output path to the remaining bytes. The logic will be symmetric around the middle. Detailed logical design of control for all the parti- tions required is contained in Table 22. At this point we can complete the design of the front end of the IUD pipe. There is a 12-byte wide data path into the IUD which inputs a new segment of the instruction stream. There are registers to receive this in- put and a second set of registers to serve as a buffer if the IUD becomes blocked. It takes longer than 1 minor clock to notify the MIDs that there 150 FROM MID INSTRUCTION INDEX GENERATOR > BLOCKAGE BUFFER SWITCH CONTROLLERS FIGURE 17 IUD FRONT END TO SWITCHES 6x4 MEMORY OPERANDS RESULTS 6x5 OPERATIONS 4x2 VECTOR OPERANDS 4x2 SCALAR OPERANDS 3x1 VECTOR RESULTS 3x1 SCALAR RESULTS 4x2 VECTOR OPERANDS 4x1 SCALAR OPERANDS oo LU O- 1— 1 Q 3x1 VECTOR RESULTS 1 — I CD t— i 3x1 SCALAR RESULTS Q o OO LU 3x1 VECTOR RESULTS a. en o o o 3x1 SCALAR RESULTS 1— 4x2 VECTOR OPERANDS 4x2 SCALAR OPERANDS 3x1 VECTOR RESULTS 3x1 SCALAR RESULTS 6x4 MEMORY OPERANDS RESULTS 6x5 OPERATIONS 151 TABLE 23 GATE COUNT FOR IUD FRONT END Unit 16 bit x 12 byte buffer 19 bit x 12 byte buffer Instruction index generator How Derived 4 gates/bit 4 gates/bit Table Number Gate Count Units 768 2 912 200 Total 1536 1824 200 Switch Controllers (SK) and Switches (S) 4 x 2 S 3 x 1 S 6 x 4 S 6 x 5 S 4 x 2 SK 3 x 1 SK 6 x 4 SK 6 x 5 SK (no. of bits) • 608 6 3648 (lines in) • (lines out) • 4 same as above 228 8 1824 same as above 1536 2 3072 same as above 2280 2 4560 Table 28 6 168 (number of variabl es in equation ti mes 2) same as above 6 8 48 same as above 56 2 112 same as above 56 2 112 TOTAL 17104 152 is a block. However, all but one minor clock of this delay is buffered by the path transducer discussed in Section 3.1.3. The following functions are performed by the IUD front end. (1) Instruction indexes are generated and produced for each instruc- tion byte. (2) One clock later, control signals for the various partitioned switches are generated. (3) One clock later, the instructions meet their indexes. (4) One clock later, the instruction components are switched into the various pipes. Figure 17 gives the overall structure of the IUD front end. Table 23 pro- vides a gate count of the IUD front-end. 4.6.3.1.5 Global Structure of IUD Pipe We have just seen how the stream of IUD instructions is broken up into bytes of various types by the IUD front end. This breaking up is done in such a way that the identity of the complete instruction can be recovered later. In this section we will describe the overall structure and timing of the IUD pipe as it performs the functions outlined in Section 4.6.2. In the remainder of Section 4.6.3 we will provide detailed logical design of the various components of the pipe as well as gate counts. Table 24 gives a list of functions to be performed (including how many parallel units are required), the delays involved, and the dependency rela- tionships. This table is then used to generate Table 25, which describes the timing sequence for all functional components in the IUD pipe. Finally, using these two tables, we construct Figure 18 which is an overall diagram of the pipeline components. 153 TABLE 24 IUD PIPE FUNCTIONS, TIMINGS, AND DEPENDENCIES Function Section Where Outlined Abbre- viation Time Number of Parallel Units Requires Output From Use scalar result to update parallel search portion of scalar table 4.6.2.3 SPU 1 4 none Use vector result to update parallel search portion of vector table 4.6.2.4 VPU 1 4 none Use scalar result to search scalar table 4.6.2.3 SST 2 6 SPU Use vector operand to search vector table 4.6.2.4 SVT 2 6 VPU Select scalar execution unit 4.6.2.5 SSE 1 4 none Select vector execution unit 4.6.2.6 SVE 2 4 SVT Reserve vector buffer storage 4.6.2.7 RVS 1 6 SVE Update scalar operand table 4.6.2.3 US 2 6 none Update vector operand table 4.6.2.4 UV 2 6 RVS Generate vector switch operations 4.6.2.7 GSI 2 9 FVO Fill in vector operand fields not known at SVT 4.6.2.7 FVO 2 6 RVS Assemble complete memory instructions 4.6.2.8 AM 2 3 FVO 154 TABLE 24 IUD PIPE FUNCTIONS, TIMINGS, AND DEPENDENCIES (cont.) Function Assemble complete scalar instructions Assemble complete vector instructions Initiate buffered transfer of vector switch instructions Initiate buffered transfer of memory instructions Initiate buffered transfer of scalar instructions Initiate buffered transfer of vector instructions Section Where Outlines 4.6.2.8 4.6.2.8 4.6.2.8 4.6.2.8 4.6.2.8 4.6.2.8 Abbre- viation AS AV ISW IM ISC IV Time 2 2 Number of Parallel Units Requires Output From FVO FVO GSI AM AS AV • ; 155 TABLE 25 IUD PIPE TIMING CHART Stage of Pipe Functions Just Completed Functions Which May Begin 1 SPU VPU SSE US 2 SPU VPU SSE SST SVT 3 US 4 SST SVT SVE 5 6 SVE RVS 7 RVS UV FVO 8 9 UV FVO AM AS AV GSI 10 11 AM AS AV GSI IM ISC IV ISW 12 IM ISC IV ISW 156 Entry Port from IUD Number of Number Bytes/ Functions vs Pi pe Stage Front End Ports Port J_ 2 3 4 5 6 7 Operator 2 5 SSE H H *SVE H H H Memory Operands and Results 2 4 H H H H H H H Vector Operands 3 2 H SVT H *SVE H RVS GSI FVO Scalar Operands 3 2 H SST H H H H H Vector Results 4 1 VPU H H H H RVS UV GSI Scalar Results 4 1 US SPU H H H H H H H means hold and pass on to the next stage with no function initiated. *SVE indicates that SVE requires both these inputs and not that SVE is cone for each as is the case in other duplications down a column No functions are initiated at stage 8. At stage 9, we initiate another set of switches like that in the IUD front end in order to assemble complete instructions. We will describe this tail end of the pipe in section 4.6.3.3, FIGURE 18 IUD PIPE OVERALL STRUCTURE 157 4.6.3.2 Detailed Structure and Gate Counts for Internal IUD Pipe Functions In this section we will present gate counts and, where necessary, logi- cal design for the functions listed in Table 24 up to the point where we begin assembling complete instructions. We will do logical design to the minimum degree of detail required to obtain reasonably accurate gate counts. Many of the units we discuss will be involved with several functions. We will introduce each unit as required. At the end of this section we will provide a summary of these units and their interconnections as well as a total gate count for the internal IUD pipe. 4.6.3.2.1 Details of the Parallel Update of the Scalar Table (SPU) As discussed in Section 4.6.2.3, there must be a comparison tree for assigning time indexes to scalar operands which may coincide with scalar results being processed in parallel with the operands. From Table 25 we see that the complete update of the scalar table (US) is complete at clock 3. Since the search of the scalar table (SST) does not begin until clock 2, the comparison tree we are designing need only contain the scalar results being processed in one clock. This is a maximum of 4 (see Table 19). The information contained in this comparison tree must include the physical address of the scalar result and the time index for the instruction. In the next section we will discuss the hardware for generating time indexes for the various classes of instructions. For now we will assume they are directly accessible by using the instruction indexes discussed in Section 4.6.3.1.4. The loading of this comparison tree consists of gating the required information into the associated registers and clearing any of the 4 registers not used. This is the SPU function. This comparison tree is 158 Q a: z UJ OO t— _l «C Q_ o O or o LU D_ H-e D_ LU D_ O a: oo X I — I OO o o o Q O OO i — r CC •a: Q_ s: o o CT> LU CJ3 C£ h- LU cf _) oo ^ CJ3 —1 ro LU o ■=c o 163 LU O O I— i LU 2= < Q 1 • oo 1 00 Q < I If) ■ -I 1 ■ oo oo ca *- f Q Q CVJ i ■» 1 CO ■ 1 Q r— 1 1 * o oo h- ca NO CARRY OUTPUT C/0 1 or o NING T BIT 3 o —I U_ ac LU 5< CARRY OUTPUT NGE MEA MOST NIFICAN o 1— 1 c c 9 J c o o o I— I o X LU Q O CM or «=c o n: u_ h-i O O +■> +J 3 •p— Q_ S -l-> 00 3 O 1 — o >> s. S- +J S- c fO o o a 164 SWITCH CONTROL (1) i = First instruction is of specified type and is not a continuation of a previous instruction j = Second instruction is of specified type k = Third instruction is of specified type I = Fourth instruction is of specified type no change = T J k" J A = i J k I V TjkZ V TJkl v T J k I B = i j k I V i J k I V i J k I V Tjkl V J j k I V TJkl C = i j k I V i j k I V i J k I V T j k I D = i j k I FIGURE 20 TIME INDEX LOGIC (cont.) 165 SWITCH CONTROL (2) i, j, k, and £ are the same as in Switch Control (1) SA = T SB = TJ SC = T J k SD = TJkl AW = i AX = i J V J i AY = i J k V T J k V T j k AZ = ijkl V TJkl V Tjk! V TJkl BX = i j BY = i j k V i J k V T j k BZ = ijkl V i J k I V Tjk I V TJkl V ijkl V Tjk£ CY = i j k CZ = ijkl V i j k I V i J k I V T j k I DZ = i j k I FIGURE 20 TIME INDEX LOGIC (cont.) 166 hardware costs, we wish to keep the range of possible indexes to a minimum. We then need to do the indexing in a circular manner. In order to keep the compare logic to a minimum, we will add an additional bit to the minimum word size needed. We can then do circular indexing by alternately consider- ing this high-order bit or its complement as the most significant bit of the index. That is, whenever we use up all indexes and start over, the new indexes we assign will have this bit set opposite to what it was just before we started over. Numbers whose high-order bit is the same as that we are now assigning will all be considered greater than numbers with the opposite value for this high-order bit. In the case of vector results, a maximum of 36 may be processed in 9 clocks. Thus a total of 7 bits will be required. In both cases of vector and scalar results, we only need to time index in- structions with results of the specified type. Generating these indexes requires logic similar to, but a bit more complex than that required for generating the instruction numbers. This logic can operate in parallel with the IUD front-end logic and thus has three clocks available to it. Once generated, these indexes will be carried along through the pipe in their own set of registers. They will be accessible at any time merely by providing the instruction index as an address to the registers containing the time indexes. In the case of instructions that do not have a result of the speci- fied type, they will receive an instruction index that is one greater than the most recent instruction with a result of the specified type. This will be the same index as the next instruction with a result of that type. Fig- ure 20 gives the structure of the time index logic. Table 27 gives a gate count for the various index generators which may be required. 167 TABLE 27 GATE COUNT FOR TIME INDEX GENERATORS 9 Bits 7 Bits Unit Number Units Gate Unit Coui it/ Total Gate Unit Coui it/ Total Bit 0-4 (9) Bit 0-2 (7) 1 50 50 30 30 Plus 1 through plus 4 adders 4 30 120 30 120 Carry or no carry switches 4 12 36 12 36 Logic to alter meaning of most significant bit 1 10 10 10 10 4 Output Counter 1 216 196 Switch Control (1) 1 64 64 64 64 Switch (1) 1 5*9*3=135 135 5*7*3=115 115 Switch Control (2) 1 98 98 98 98 Switch (2) 1 14*9*3=378 378 14*7*3=294 294 Registers 6 36 216 28 168 TOTALS 1107 935 168 FROM VECTOR TIME INDEX PIPES , 4 x 36 SWITCH J TREE REGISTER TREE REGISTER FROM VECTOR OPERAND PIPES 36 PIPE REGISTERS TREE REGISTER T COMPARISON TREE LOGIC AND 36 x 1 SWITCH < CC LU Q. O CC O h- c_> LU SWITCH COMPARISON TREE CONTROL PURGE LOGIC m i NUMBER OF RESULTS COMPLETELY UPDATED THIS CLOCK j i NUMBER OF wrpTnn m NEXT CDCC NEXT . RESULTS rKtt COUN TER PURG COUN TER FIGURE 21 VECTOR COMPARISON TREE 169 NEXT FREE COUNTER 6 BIT 36 OUTPUT ADDRESS DECODER 36 TIMES CONTROL TO APPROPRIATE SWITCH PATHS I TO 4t: FANOUTj- 1 J~ 4 x 36 SWITCH CONTROL LOGIC DETAIL FIGURE 21 VECTOR COMPARISON TREE (cont.) 170 NUMBER OF INSTRUCTIONS WITH VECTOR RESULTS COMPLETELY UPDATED IN VECTOR STATUS TABLE IN THIS CLOCK EACH LINE CONTAINS 4 CONSECUTIVE INPUTS FROM ADDRESS DECODER P U R G E S I G N A L S COMPARISON TREE PURGE LOGIC FIGURE 21 VECTOR COMPARISON TREE (cont.) 171 FROM TREE REGISTERS 36 COMPARE ELEMENTS 3 LEVEL LAST MATCH SELECTOR 36 x 1 SWITCH TIME INDEX VECTOR COMPARISON TREE LOGIC DETAIL FIGURE 21 VECTOR COMPARISON TREE (cont.) 172 IT PASS/ NO PASS UNIT 16 ELEMENTS - LMS N OUTPUTS FROM OTHER LEVEL 2 LMS LMS = Last Match Selector PNP = Pass/No Pass Unit (4 elements) 3 LEVEL LAST MATCH SELECTOR SHOWING DETAIL FOR FIRST 16 OF 36 OUTPUTS PASS/ NO PASS UNIT 16 ELEMENTS TO OTHER 16 ELEMENT PASS/NO PASS SELECTOR FIGURE 21 VECTOR COMPARISON TREE (cont.) 173 A - D are address decoder inputs X, Y, Z are numbers of complete inputs P is output P = AZ V BX V c YZ V DX FIGURE 21 VECTOR COMPARISON TREE (cont.) 174 TABLE 28 VECTOR BUFFER COMPARISON TREE GATE COUNT Unit 4 x 36 Switch Control Purge Logic Next Free Counter Next Purge Counter 4 x 36 Switch 36 Registers Gate Count 16+2*36+4*36 = 232 16+2*36+8*36 = 376 100 100 4*36*20*4 = 11520 4*36*20 = 2880 Comparison Tree Logic Unit Number Gate Count/ Unit Total LMS 9+2+1 = 13 14 182 4 Input PS 9 8 72 16 Input PS 3 32 96 38 x 1 Switch (10 bits) 1 36*10*3 1080 COMPARISON TREE TOTAL 1436 6 TREES 8616 UNIT TOTAL 23,824 175 4.6.3.2.3 Parallel Update of Vector Buffer Table (VPU) As mentioned in the previous section, up to 36 vector buffer results may be in an incomplete state inside the IUD pipe. The search of the vector buffer table (SVT) must be able to detect this fact. To allow this, we need a comparison tree similar to that described in Section 4.6.3.2.1. This tree must be much larger to allow for 36 entries. Since entries can remain in the tree for up to 9 clocks, we need some additional control logic to properly purge and update the tree. Its functional operation is iden- tical to that of the scalar result comparison tree described in Section 4.6.3.2.1. Figure 21 gives a diagram of this tree. Table 22 gives a gate count for the tree. Most of Figure 21 is self explanatory. The three-level last match selector does require some additional explanation. It is built from the four input last match selectors whose logic equations occur in Figure Outputs from the 36 compares are routed into nine of these last match selec- tors (LMS). These nine are divided into three groups, two of size 4 and one containing a single element. The match/no-match output (N) from each ele- ment in a size 4 subgroup is one of the inputs to another LMS. Finally, the N output from these two LMS plus the output from the solo element at level 1 are fed into a final LMS. The N output from this final LMS is the N output for the entire comparison tree. The four select outputs from each LMS are used either for control or as input to a pass/no-pass selector (PS). The level 1 LMS output are inputs to nine PS. The control for each of these PS is a selected output from a level 2 LMS. In particular, if the correspond- ing level 1 LMS were the last in its group of four to have a match, then its PS will be enabled, and otherwise not. The same game is played at level 3, 176 but each PS controls 16 inputs. The extra level 1 LMS is handled in the obvious way to minimize package costs without requiring specially designed units. 4.6.3.2.4 Searching the Scalar Table If we examine Table 24, we see from the dependency column that no function requires output from SST. We also know from Section 4.6.2.3 that close communication is required between the SEUs and the circuitry main- taining the scalar status tables. Thus it is both possible and desirable to move the scalar status tables and the SST hardware to be physically just ahead of the SEUs. One additional advantage in doing this is that it will no longer be required to process these instructions at the maximum rate that can occur in the IUD pipe, but only at the lower rate at which the SEUs can accept them. This advantage cannot be realized in the case of vector in- structions because of the manner in which logical vector addresses are distributed among the VEUs. Memory and scalar instructions that interact with the vector unit must be processed together in a manner that logically corresponds to the actual sequence in which the instructions occur. We will move the SST and SPU functions into the SIDS mentioned in Section 4.3.2.1. We have already described the detailed hardware for SPU which was then used in the next section on the VPU. We will describe in detail the remainder of the hardware in Section 4.6.5 where we provide a complete description of the SIDS. 177 4.6.3.2.5 Searching the Vector Table (SVT) The vector status table is the comparison tree described in Section 4.6.3.2.4 and an associative memory that maps logical buffer addresses to physical buffer addresses. Since most of the accessing of this table is by logical addresses, it would be desirable to have each logical vector buf- fer address be a physical address to this table. At a given instant in time, there may exist in the queues several instructions with the same logical address. However only the most recent store to a logical vector location is required for this table. Thus we can make the logical address of a vec- tor be a physical address to the status table. The only information that will then be needed in the table is the current physical address of the logical buffer. When we first discussed this table in Section 4.6.2.4, we described it as an associative memory. We see here that this is no longer necessary. In that previous discussion, we mentioned the necessity of pro- viding use counts for the VEU as a means of noting when a logical register was available for re-use. For the same reasons discussed in the previous section, these functions are best transferred to just in front of the vector portion of the machine. We will describe this unit, the Vector Instruction Dispatcher Subsystem (VIDS), in Section 4.6.6. There are two functions that this table must perform. It must allow for accesses, in parallel, up to the number of vector operand pipes. There are six of these pipes. Two minor cycles are allowed for this access and selecting the output from this table or the comparison tree. The access must be pipelined so that a set of six is complete in es/ery minor cycle. The same table must be able to accept stores in parallel at the rate that the vector results can emerge from the pipe. Again, two clocks are allowed 178 TABLE 29 VECTOR STATUS TABLE GATE COUNT Unit Gate Count Number Units Total 2 Level (128) Address Decoder 16*4+8*3+128*2 = 344 9 3096 2 Level (256) Address Decoder 2*16*4+256*2 = 656 9 5904 Bits to Address (128) Store Decoded for Pipelining 4*128 = 512 9 4608 Bits to Address (256) Store Decoded for Pipelining 4*256 = 1024 9 9216 Memory Location, 16 Bits with 3 Way Fan In and 6 Way Fan Out 16*(4+3+2+6) = 240 Comparison Tree/ Memory Selection Logic (4+1 6)*2+l 6*2*3= 136 128 256 30720 61440 136 Total (128) not including memory 7,840 with memory 38,560 Total (256) not including memory 15,256 with memory 76,696 179 for this function, but it must be pipelined to allow a set of stores to be processed starting at every clock. To do this, we require only a standard memory, but with six read address decoders with their own output lines and three write address decoders with their own output lines. The two-clock pipelining can consist of one clock for address decoding and one clock to do the read or write. Table 29 gives a gate count for this memory and the circuitry that selects either the memory output or the comparison tree output. 4.6.3.2.6 Selecting the Scalar Execution Unit (SSE) This is another function which can and should be moved to the SIDS. However since this design is a simplified version of the hardware for selecting the vector execution unit, we will provide a detailed design at this point. We will take advantage of the lowered processing rate required by transfering this function to the SIDS. Thus we will only need to process instructions at the rate the SEUs can process them or, at most, one per minor clock. A small table containing the current queue size of each SEU is required. Whenever an instruction has completed execution, a table entry must be decre- mented. Whenever an instruction is entered into, a queue table entry must be incremented. Instruction codes are logical addresses referring to a group of functionally identical SEUs. If for a particular logical address there is only one element in the specified group, then the SSE function is simply to convert the logical address to a physical address, and if the queue and the SIDS buffer is full, send a signal to hold up the IUD. If there are more than one functionally equivalent SEU involved, then the conversion to a physical address involves selecting the unit with the smallest queue and 180 holding up the IUD if all the queues and the SIDS buffer are full. Figure 22 is a unit to perform this function for up to six equivalent SEUs. Table 30 provides a gate count for this unit. This unit must provide an indication of the smallest queue which is updated every clock. To allow for this, we have 12 registers. Six of these contain the current queue sizes. These are divided into two groups of three. For each of these groups there are an additional set of three registers con- taining the differences in queue sizes. These differences are not generated by doing subtracts, but rather directly from the increment and decrement signals to the queue sizes. The signs of these differences are used to de- termine a minimum in each group of three and to control a 3 x 1 switch to transfer that minimum to another register. In one minor clock all register incrementing and decrementing and the transfer of a minimum can be performed. In the next clock a true subtract can be done on the two minima generated in the previous clock. The results of this subtract can then be used to choose from which of the three groups the global minimum is to be chosen. Because of this two-stage pipelining, we do not necessarily have the absolute minima at a given clock. However only one instruction per queue can complete in a major clock, and only one queue location can be reserved in a minor clock. Thus this additional one clock delay will, at most, allow a difference of two from the minimum, and this cannot happen often. 181 +1 p n «- a u am bm cm PASS/ - NO PASS .1 ■ 1 N N |S k I a-b 1 dU P P N ' +1 P P b -1 *"" 3 x 1 SWITCH AND CONTROL N N t*a m » P »— »- 1 b-c 'n pr S L -i P P be c -l N N 'o - p L I c-a N > — n — S A 1 cd o 1 +1 p n d i ■ i N S dP N I d-e ue P P n N +1 P P e u -1 f mi 3 x 1 SWITCH AND CONTROL N N 1 i_ 1 ** m-. P 1 e-f N s * +1 P P ef f -1 p > — »- 1 ' N N c dm em fm P/ \ss/ ) \ss r i ._ > »- 1 f-d P/ V-n — — ■* 3 fd FIGURE 22 SEU QUEUE SELECTOR 182 LOGIC EQUATIONS FOR INTEGRATOR Output: S = Sign (TRUE means negative) X = High order bit Y = Low order bit Input: Pq,P, = Increment input Nq,N, = Decrement input S = yjfi v Vl F v N Q N 1 P 1 v N^ Y - y/^Q v N^P^ v N^Fq v N^Pq v Wi p o v Wi p o v ¥o¥o v Wl P X = NqN/^ v N N lPl P LOGIC EQUATIONS FOR a-b PSEUDO CARRY DIFFERENCE COUNTERS Input: S, X, Y just defined plus a Q , a 1 , a 2 , a 3 , S d (sign and 4 bits of differences) Output: S r , r Q , r-j , r 2 , r 3 (sign and 4 bits of new differences) Note: X and Y cannot both be TRUE r 3 = a 3 Y v a 3 Y S a = SSj v SS . (TRUE means addition, FALSE means subtraction) FIGURE 22 SEU QUEUE SELECTOR (cont.) Add Subtract 183 r 2 = S a Ya 3^2 v S a Xa 2 v ^ r 2 ^ a 2^ S XYa 2 v S XYa^ v (r 2 = a 2 ) S Ya" 3 a 2 v S Xa 2 v (r 2 f a 2 ) \ a 2 a 3 Y V \ a 2 W v (r 2 = a 2> C D = S Ya a.j v S Xa 9 (positive carry) r a *. o a £ C N = ^a Ya 2 a 3 v \ Xa 2 (negative carry) r l = a lV N v a l C P v a l C N r Q = F^Cp v a^ v a C p C N v a oai C N v a^Cp S = S.S v Addition r da S d S a a C P C N v S d S a a C N v Subtraction, no r -F r c T - r cTTr ., Overflow S d S a a l C N v S d S a a C P v S d S a a l C P v Va a O a l C P v Va¥l C N Subtraction Overflow LOGIC TO SELECT MINIMUM ELEMENT FROM SIGNS OF DIFFERENCES a > b -*> S ab b ^ c ■*■ S bc c > a -> S cd a minimal - S &b S cd v S ab S bc S cd b minimal - S ab S bc c minimal + S cd S bc FIGURE 22 SEU QUEUE SELECTOR (cont.) 184 TABLE 30 SEU QUEUE SELECTOR GATE COUNT Unit Integrator Queue Counters Ripple Carry 4 Bits Difference Counters 3 x 1 Switch (4 bits) Adder for M Q - M, Registers for Mq,M, Registers for Sign Bits Pass/No-Pass Units Gate Count Number Units Total 54 6 324 16 6 96 140 6 840 36 2 72 24 1 24 16 2 32 12 2 24 15 2 30 TOTAL 1442 185 TABLE 31 CONNECTIONS FROM OPERAND PIPES TO VEU QUEUE SELECTOR Operand Pipe Queue Selector Port 1 1,2 2 2,3,4 3 3,4,5,6 4 4,5,6,7 5 5,6,7 6 6,7 7 7 The two-bit instruction index forms the high-order bits of the port address. The bit indicating the first or second operand of binary instructions is the low- order bit. Gate Count Unit Gate Count Number Units Total 3 bit address decoder 32 6 192 2x1 4 bit switch 24 2 48 3x1 switch 32 2 64 4x1 switch 40 2 80 Direct path 8 2 16 TOTAL 400 186 All operands of the same type are adjacent. Thus we need only look at the type of adjacent instruction bytes. Let A, B, C represent types of adja- cent instruction bytes. True = type we are testing for. x true indicates the byte corresponding to B is the second operand of a binary instruction. x = AB Detecting a partition instruction is only slightly more complex. The ques- tion is, "Does a terminal byte exist before the last byte?" Tq-T,, indicate whether the corresponding byte is terminal. P. is true if byte i is part of a partial instruction. P. = A T. 1 j-1.12 ] FIGURE 23 LOGIC TO INDEX OPERANDS AND DETECT A PARTIAL INSTRUCTION 187 TABLE 32 GATE COUNTS FOR INDEXING OPERANDS AND PARTIAL INSTRUCTION DETECTION Indexing Operands Operation Gate Cou nt Number Units Total Operand type detection 5 12*2 120 Index generation 3 12*2 TOTAL 72 192 Detecting Partial Instructions 12 Number of Gates = )>^ (i+1) = 76 i=2 188 4.6.3.2.7 VEU Queue Selector (SVE) There are two important differences between VEU and SEU queue selectors. First, this function cannot be moved outside the main IUD pipe, and thus we must allow for the processing of up to four instructions in parallel. The second difference is the added complexity that results from VEUs being as- signed to individual programs as discussed in Section 4.6.2.6. This requires that the operands of a vector instruction be combined with the operator in the unit which selects the VEU. We have allowed a two minor clock delay for this processing, but with a fully pipelined processing rate of four instruc- tions per clock. We will first discuss how the operands and operation get together and then the details of the queue selection hardware. Since each instruction element has an instruction index associated with it, switching the operands to the queue selection elements is relatively straightforward. We require a switch from the 8 vector operand pipes to the 4 sets of input ports for selecting a VEU. This does not require a full 8x8 crossbar. Table 31 lists the connections required and gives a gate count for this unit. One problem occurs with binary instructions. It is necessary that each operand be switched to a different entry in the set being used for a par- ticular instruction. To allow for this, it would be desirable to have associ- ated with each vector operand a single bit indicating if this is the first or second operand in a binary instruction. The same information would be desir- able for scalar instructions when they are recombined to be routed to the SIDS. This information is recoverable from the index of the pipe in which the operand occurs, but having it instantly available is necessary to maintain the high processing rate. Figure 23 gives the logic for this process, and Table 32 provides a gate count. This unit will occur in the pipe front-end where the instruction index generator occurs (see Section 4.6.3.1.4). 189 A second problem results from the fact that only part of an instruction may correspond to the first or last instruction index. At most four instruc- tions can complete in a goven clock. The cases where four full instructions enter the pipe in a single clock allow space for no partially complete in- structions. If we provide additional registers to hold any initial segment of an instruction with the highest queue index, we can then complete proces- sing of that instruction in the next clock. This is the first point in the pipe where this problem arises. Thus we can add switches and buffers to hold a partially completed instruction. In order to minimize the circuitry to do this, we will provide one bit associated with each instruction component to indicate if it is part of a partial instruction. The logic for this is in- cluded in Figure 23 and the gate count for this logic is in Table 32. Figure 24, which we will discuss shortly, gives the switch control and buf- fers for retaining partial instructions. Now that we have designed hardware to retain and collect the information necessary for VEU queue selection, we can proceed to design the hardware to perform the algorithms described in Section 4.6.2.6. There are two ways in which this unit is more complex than the SEU queue selection unit. First, it is not simply the queue size that is relevant to selecting a VEU, but also the number of operands already resident in the queues in which VEU has been assigned to the instruction. These complications are handled by having six effective queue sizes for each VEU. There are effective queue sizes for the following cases: A. Cases where the VEU is assigned to this program. 1. No operands for this instruction in this queue. 2. One operand for this instruction in this queue. 3. Two operands for this instruction in this queue. 190 B. Cases where the VEU is not assigned to this program. 1. No operands for this instruction in this queue. 2. One operand for this instruction in this queue. 3. Two operands for this instruction in this queue. Figure 24 shows the entire design of the VEU queue selector. The four groups of six registers in this figure contain the above effective weights for each of the four VEU queues. The second factor which makes this unit more complex than the SEU queue selector is the necessity of processing up to four instructions in parallel. In Table 24 we have allowed two clocks for this processing. The problem with meeting these time constraints results from the way the queue use of one in- struction can affect queue assignment for later instructions. Since we are processing up to four instructions in parallel, we need to somehow simulta- neously take into account the queue use interactions of four instructions. Although we are allowed two minor clocks to do the processing, we must pipe- line this with an emergence rate of four instructions every minor clock. This will have the consequence of having to start processing of a given set of four instructions before the queue weight registers have been updated for the previous two sets of four instructions. Table 33 summarizes the depen- dency relationships of the instructions. We now outline the algorithms employed in meeting the above constraints. Each of the three major functions we will list are performed in parallel for different sets of four instructions. The subfunctions are performed sequen- tially on the same group of four instructions. 191 I. Update queue weight registers. 1. Use a bit serial counter to decrement the weight register by 1 if the corresponding VEU has notified the IUD that a queued instruction has started to execute. 2. Cascaded with the bit serial counter, use a bit serial adder to increment the corresponding weight register by the queue use of the instructions which had their queue reservation processing completed in the previous minor clock. II. First minor clock of queue reservation processing. 1. Switch instructions and their operands into Unit. 2. Determine which weights to use. 3. Select the determined weights. 4. Increment each weight by the corresponding queue usage of the queue reservations made in the previous clock. 5. Determine the minimum weight. This and the previous function are done simultaneously. 6. Subtract the minimum weights from all weights. III. Second minor clock of queue reservation processing. 1. Increment each weight by the corresponding queue usage of the queue reservations completed in the previous clock. Simul- taneously subtract 0, 1, 2, 3, and 4 from each of these sums. 2. Select the set of weights that produced a zero sum and had the smallest value subtracted from it. 3. Decode the selected weights as follows: a. For instruction generate BO which is true if the weight for instruction queue n is 0. 192 b. For instruction 1 generate Bl similar to BO . n n c. For instruction 2 generate B20 and B21 . B20 is n n n similar to B0 n - B21 n is true if the weight for instruc- tion 2 queue n is 1 . d. For instruction 3 generate B30 , B31 , B32 . 4. Simultaneously select the queues for instructions and 1 according to the following algorithms: a. For instruction select the minimum n such that BO . n b. For instruction 1 select the minimum n such that Bl n and B0 n were not selected for instruction 0. If this is not possible, select the unique n for which Bl . ^ n 5. Select the queue for instruction 2, taking into account the queues selected for instructions and 1. 6. Select the queue for instruction 3, taking into account the queue selections for the previous three instructions. 7. Decode the queues selected into: a. Binary integers, giving the queue use for each queue. b. Queue addresses for the instructions. The detailed logic for performing these functions is described in Appen- dix A. Most of the logic equations in this appendix are fairly straight- forward. However, in order to meet the serious time constraints, the logic to perform functions 1 1 1-4 through III-6 above require a somewhat complex technique for constructing the logic equations. We will now describe this technique. The method may be thought of as a generalization of the trick used in constructing a pseudo carry adder. In general, we are construction a func- tion R n from many Boolean inputs. We divide the function into two case. 193 We then construct factors D and D which will detect these cases. We also construct Boolean selection functions SI and S2 n which will be true for the correct value of n in cases 1 and 2 respectively. Thus we will get an equation for R as: R n - DSl n V DS2 n This is essentially the way a pseudo carry adder is constructed. Just as one can generate a multi -level pseudo carry adder, we can generalize our technique to many levels. Doing this for the adder results in very symmetric equations. In our case, that symmetry is not present, and a major part of the design problem is providing a notation to keep track of the terms we have generated and the cases we have considered. Thus, in the appendix, we have indexed detection and selection terms as Di.j.k and Si.j.k where i, j, and k are integers equal to 1 or 2. Each integer separated by a dot represents another level of cases, and there is no precise limit to how many levels are allowed other than the necessity of keeping the equations to a reasonable size. We do not necessarily go to deeper levels in a symmetric way. For example, we might develop S2.1.1 down three more levels until we are con- sidering S2.1.1.i.j.k, whereas S2.1.2 may not be developed to any deeper level. This notation does make it fairly easy to construct such complex functions. We can always determine the cases we have not yet considered by simply reading an index backwards until we reach the first 1. The negation of that case is the next one to be considered. One other technique we employ to keep the equations from becoming too large is to construct a subcase in a lower level of logic and to simply use the output from this lower level in the final equation. When we do this, we replace the corresponding "." with a "-". Thus, we might construct an S2.1 .1 n to use in the equation for S2 n< 194 H h 3 -*- F R M F R M 2 — kR M QUEUE WEIGHT 4x1 SWITCHES REGISTERS AND ADDRESS u n F I N A L S E L E C T I N L G I C J I — a 0n .1 n u i J — *• •i n B.m l n F D I E N C A L D E R u? J •r *" n u. J t a 3 " J p TDC1 TMI rir ll i m_ | REGISTERS INCREntiNi anu MIN SELECTORS u n q! SECOND 5x1 SWITCHES INCREMENT UNIT FIGURE 24 VEU QUEUE SELECTOR 195 TABLE 33 TIMINGS FOR UPDATING WEIGHT SELECTION REGISTERS Minor Clock Instruction Group Instruction 12 3 Instruction Group 1 Instruction 12 3 Instruction Group 2 Instruction 12 3 Instruction Group 3 Instruction 12 3 E E E E 1 D 1 D 1 D 1 D 1 E E E E 2 D 2 D 2 D 2 D 2 D lWl E E E E 3 R R R R D 2 D 2 D 2 D 2 Wl D l E E E E 4 R R R R D 2 D 2 D 2 D 2 Wl D l 5 R R R R D 2 D 2 D 2 D 2 6 R R R R E = Enter Unit; D, ,D 2 = Determine Queue Use; R = Reset weight Registers Instruction, Instruction Group (2,0) (2,1) (2,2) (2,3) (3,0) (3,1) (3,2) (3,3) Requires information about the follow- ing instructions with weight registers not yet updated (1,0) (1,1) (0,0) (0,1) Same as (2,0 Same as (2,1 Same as (2,2 (2,0) (2,1) (1,0) (1,1) Same as (3,0 Same as (3,1 Same as (3,2 1,2) (1,3) 0,2) (0,3) plus (2,0) plus (2,1) plus (2,2) 2,2) (2,3) 1,2) (1,3) plus (3,0) plus (3,1) plus (3,2) 196 TABLE 34 TIMING AND GATE COUNT FOR VEU QUEUE SELECTION TIMING FOR FIRST CLOCK OF PIPELINE Logic Level 1 2 3 4 5 6 7 8 9 10 11 Function Completed Information switched into unit Weights selected Weights switched Minimum of weights with U. added found J Minimum selected Minimum subtracted from all weights TIMING FOR SECOND CLOCK OF PIPELINE Logic Level 1 2 3 4 5 6 7 8 9 Function Completed U. added, constant weights subtracted Group with first overflow switched Queue use determined Queue use decoded 197 TABLE 34 TIMING AND GATE COUNT FOR VEU QUEUE SELECTION (cont.) GATE COUNT Unit Queue Weight Registers Queue Weight Adders and Counters Weight Selection Logic Weight Switches Increment and Decode Loc lie Second Increment Final Selection Decoding TOTAL Number Gates 120 2 760 456 240 6 864 9 608 895 64 21 ,007 198 In turn, S2.1.1 could have terms such as D2.1 .l-i.j.kS2.1 .1-i i k n ,%J * n in it. We sometimes use a similar notation when values computed at a lower level of logic are required but do not exactly fit our detection selection scheme. In such cases, the values are usually defined, e.g., T2.1.1 , and are simply used in the equation for S2.1.1 . This notational scheme does seem to be an effective tool in generating complex multi-level Boolean functions where it is important to keep the number of levels small. Unfortunately, the scheme gives no algorithms for determining what cases are likely to be good ways to break up the function. Intuition and trial and error are required for that part of the process. Table 34 gives the overall timing and gate count of the unit. These figures are derived from Appendix A. These figures are not unreasonably large, but they probably cound be reduced substantially by some more playing with the design. The 11 levels of logic or 22 gate delays in one minor clock is probably the one figure that one would most want to reduce. 4.6.3.2.8 Reserve Vector Buffer Storage (RVS) This unit must reserve storage for the operands and results of each vector instruction. In the case of operands, the space must be within the VEU in which the instruction is to be executed. The same is true of results except for results from a memory load instruction which have space in the vector buffer. The operand and result portions of this unit operate on dis- joint storage spaces and are functionally independent. We will now describe the structure and operation of these two units. Figure 25 gives the overall structure of the result processing portion of this unit. From this point on we will not provide detailed logical design unless the unit is in some major way dissimilar from units already designed. 199 We will make rough conservative estimates of gate counts and logic levels required. In some instances limited design of parts of a unit may be re- quired to verify that the estimates are conservative. We will try to be explicit about the assumptions on which the estimates are based. Thus we will describe how this unit works and, within this commentary, provide estimates of gate counts and timing. This same approach will be used throughout the remainder of Chapter 4. The first step in assigning storage for vector results is to determine how many spaces are required in each VEU and in the Vector Buffer. This function is performed by the field decoder. We are processing up to four instructions in parallel, so up to four results may be required in a given VEU. The field decoder generates a count of the number of results required of each VEU by looking at the VEU address portion of the result field for each vector instruction. Decoding the address fields requires one gate for each bit in each field or a total 12 bits for each VEU. Thus the total will be under 100. One gate delay of 1/2 level of logic is required to do this decoding. The encoding of the counts for individual VEUs requires roughly 10 gates for each bit of the encoded results or under 300 gates total. One level of logic is required for this encoding. The counts produced are sent to the buffer status units and to the final switch. The vector buffer status unit differs from the VEU status units mainly in the size of the memory it is working with. The size of the VEU result memory is likely to be about 16 as discussed in Section 4.4.2.1. The size of the Vector Buffer is likely to be about 256 as discussed in Section 4.4.3, The other differences between these units is how they free locations and the possibility of buffer overflow. If the Vector Buffer ever overflows, this is a hardware or programming error as discussed in Section 3.2.1.2.1. 200 The Vector Buffer status units try to keep their respective buffers from becoming full by outputting a "too full signal" when their size crosses a certain threshold. As discussed in Section 3.2.1.2.1, this threshold should be variable so that an optimal value can be determined by experience. It might also be varied with different programs or program mixes. We will first discuss the VEU status units. The functions of these units are: 1. To maintain in the buffer registers four available buffer locations. 2. To signal to the VIDS when a buffer has exceeded its thresh- old size. 3. To signal to the entire IUD to pause when a request is made for space that cannot be honored. 4. To process signals that indicate a given VEU buffer location is available for reuse. Since there are only 16 locations to be accounted for, it is reasonable to maintain these in a stack of registers that can be shifted four positions in a single minor clock. The logic for such registers should be less than the total number of bits times 12 or less than 1000. The logic to keep track of the size of the stack, to control the shifting, and to interrupt the IUD should be under 200 gates. If we allow a new entry to be made at eyery fourth position, the logic for the necessary switch and control should be under 200 gates. It is reasonable to allow at most one location to be freed in a minor clock and to buffer requests at their initiation whenever they are generated at a faster rate. Thus the total gate cound for one of these units will be under 1400. 201 Figure 26 gives the detailed structure of the vector buffer status unit. A stack of registers or push-up buffer is also employed in this unit. This stack only contains 12 of the possible 256 available locations. A single status bit is maintained for each physical location, and a register is available if its status bit is one or its address is contained in the push-up buffer. The status bits are grouped into four portions of 64 each, and these are searched and set separately. The purpose of the push-up buf- fer is to allow for no pauses in instruction processing in the case when one or more of the groups of 64 may have no free locations. Experiments might be desirable to obtain an ideal size for this push-up buffer. How- ever, given that we are assuming less than half the instructions are vector instructions and given that this algorithm for assigning locations will tend to distribute locations uniformly, 12 is likely to be an adequate size. The size of this buffer should also be less than 1000 gates. The clear tree's function is to decode a 6 bit address into a signal to set a status bit. This can be done in under 160 gates. The search tree must output a 6 bit address corresponding to some bit that is set and at the same time reset that bit. This can be done by having one set of logic that is pyramided up from the status bits that indicates if a given set of four and then 16 status bits has a one in it. Logic pyramided down to the status bits can choose the lowest set of four at each level and simultaneously de- code two bits of the address of the bit that will be ultimately chosen and send a signal to the correct one of the four groups it is looking at to choose a bit from that group. Only the decoding of the least significant two bits of the address is done at the base of the pyramid. This requires less than 400 gates. There are 256 status bits. At four gates each, this comes to 1024. The free decoder decodes the first two bits of a free 202 1 VECTOR BUFFER STATUS FINAL SWITCH FULL SIGNAL FREE SIGNAL VEU BUFFER STATUS FULL SIGNAL TO VECTOR ADDRESS RESULT FIELDS FREE SIGNAL • • • VEU 5 BUFFER STATUS FULL SIGNAL FREE SIGNAL VEU FIELDS OF VECTOR 1 BUFFI :r registe :rs INSTRUCTIONS FIELD DECODER BUFFER ATUS ITS Tfi I ST ~^" > UM m FIGURE 25 RESULT PROCESSING PORTION OF UNIT TO RESERVE VECTOR STORAGE 203 FREE DECODER PUSH UP BUFFER CONTROL FIGURE 26 VECTOR BUFFER STATUS DETAILS 204 address and switches the remainder to the appropriate clear tree. This requires less than 60 gates. The control for the push-up buffer requires less than 200 gates. The total gate count for the vector buffer status unit is under 4600. Returning to Figure 25, we still need to provide a gate count for the buffer registers and the final switch. The buffer registers require 720 gates, and the final switch and its control require less than 400 gates. Thus the total for the entire unit will be less than 14,500. 4.6.3.2.9 Update Vector Tables and Fill in Vector Operands (UV and FV0) Now that we have determined a physical address for all vector results, we need to update the vector status table discussed in Section 4.6.3.2.5. In addition we need to fill in the vector operands which were not known at that stage in the pipe. No additional logic is required for the first of these functions since we provided sufficient address decoders when we ori- ginally discussed the vector status table. For the missing operands we have only a time index of the instruction which generates the result. We need to construct a table which allows us to map this time index into a physical address. A simple way to do this is to provide a buffer with one physical location for each possible time index of an originally undefined operand. Then, if we load and address this buffer in a circular fashion and have six independent ports to address it, we will have the problem solved. This unit will be similar to the non-comparison tree portion of the vector buffer comparison tree designed in Section 4.6.3.2.3. We will use the gate counts of that unit. We do not require the 4 x 36 switch listed there, and we can get by with four 1 x 9 switches, thereby dropping the 205 gate count for the switch to 720. The six address decoders require less than 2000 gates. This gives a total of less than 6500 gates. 4.6.4 Tail End of Main IUD Pipe The main IUD pipe consists of those functions listed in Figure 18. In the course of designing the pipe we have decided to move functions related to scalar operands and results to the SIDS. We now must complete the IUD pipe by assembling bytes into complete instructions and shipping these in- structions to the VIDS, SIDS, or main memory for further processing and ultimate execution. In addition after the instructions are assembled, any instruction that requires the vector switch must have queue entries gene- rated. 4.6.4.1 Assembling Instructions (AV, AS, AM) Referring again to Figure 18, we see that all the pipes except the operator pipes, are separated into vector, scalar, and memory instruction bytes. Thus the operators must first enter a 3 x 1 switch as they emerge. From this switch, they enter a buffer for either vector operators, scalar operators, or memory operators. Simultaneously with making entries in each of these buffers, we will set up a word of presence bits to be used in re- moving entries from these buffers. A portion of this hardware is illustrated in Figure 27. The non-operator pipes do not require the initial switch. They do require a buffer and set of presence bits for the process of assem- bling complete instructions. After these buffers another set of switches is required to merge the bytes of an instruction into a complete instruction. The size of the buffers and switches is determined by the various data rates involved. We will not consider the question of what constitutes optimal size, but will suggest some reasonable sizes. Both vector and scalar 206 FROM MAIN IUD PIPE FROM MAIN IUD PIPE TO MEMORY OPERATOR BUFFER MEMORY PRESENCE BITS • f • 10 TIMES TO SCALAR OPERATOR BUFFER • • • LLLLLLU VECTOR OPERATOR BUFFER PARALLEL INPUTS FOR VECTOR OPERANDS AND RESULTS " " ' 4 x 1 SWITCH # - VECTOR INSTRUCTION BUFFER ,, ,, |F l| ,| ■, FIGURE 27 TAIL END OF IUD PIPE VECTOR OPERATOR PORTION 207 execution units can process instructions at the rate of one per minor clock. Since there will be six of each of these units, the overall emergence rate will average to slightly less than one instruction per minor clock. Assum- ing two memory instructions for each vector instruction is probably conserva- tive. To relate these figures to the parameters we are determining, we need to consider instruction sizes and emergence rates from the pipe. The con- straints on instruction sizes are listed in Table 18. There will be two levels of buffering involved. A sparse buffer will collect the output as it emerges from the IUD. From here the output is transmitted to a dense instruction buffer from which it will be transmitted to its final destination. In this section we are determining the size of the first buffer and the size of the intervening switch. The input width of this switch determines the rate at which the sparse buffer is emptied. The output width of the switch determines the rate at which complete instructions can be assembled. There are three of these switches operating in parallel for each type of instruction. The three switches are for operators, operands, and results. It is these parallel switches which reassemble the instructions. The output width of these switches must at least accommodate the average instruction processing rate. We will use widths slightly larger than this as listed in Table 35. The input widths must be adequate to accommodate the output widths. This can be determined by consulting Table 19. The size of the sparse vector buffer should be large enough to accommodate uneven in- struction distribution without stopping the IUD. Determining an optimal value for this size is probably an impossibility. A smart compiler could probably do quite well with very little buffering by distributing instruc- tions. A size between 4 and 8 words long would probably be reasonable. 208 TABLE 35 LOGIC SUMMARY FOR ASSEMBLING INSTRUCTIONS Unit Switches for Operators Sparse Operator Buffers Switch and Buffer Input Controls Vector and Scalar Operand Buffers Buffer Input Controls Vector and Scalar Result Buffers Buffer Input Controls Memory Operand and Result Buffers Buffer Input Controls Vector and Scalar Operator Switches Switch Controls Memory Operator Switch Switch Control Vector and Scalar Operand Switches* Switch Controls Memory Operand Switches* Switch Control Vector and Scalar Result Switches* Switch Controls Memory Result Switch* Switch Control Size Number Gate Count 3x1 10 2400 10x6 3 14400 3 3720 6x6 2 5760 2 408 4x5 2 3840 2 272 7x6 2 6720 2 476 4x6 2 1440 2 240 6x8 1 1920 1 160 4x6 2 1440 2 240 6x8 1 1920 1 160 2x6 2 720 2 240 3x8 1 960 1 160 TOTAL 47,595 *Attribute refers to instruction type, not operand or result type. We have assumed 20 bit words and data paths, 4 gates/bit storage, and 2 gates/switch bit junction. 209 Table 35 lists actual choices for all the design parameters required and provides approximate gate counts. We do not do any design of the switch controls. They are driven by instruction indices and instruction types carried along with the instructions. Techniques used in previous sections should easily produce units within the specified gate estimates and timing constraints. 4.6.4.2 Initiating Transfer of Instructions (IM, ISC, IV) The next function is to ship the assembled instructions to the appro- priate units. One of these destinations will be a unit for generating vector switch instructions. We need to estimate buffer sizes and data path widths using the methods and estimates of the previous section. These es- timates and gate counts are summarized in Table 36. 4.6.4.3 Generating Vector Switch Instructions (GSI) Either memory or vector instructions may require use of the vector switch. We must scan these instructions for vector operands and results and, where present, generate the appropriate queue entries for the vector switch. What is required is that the source and destination for each word to be switched must be selected from the instruction streams and combined to make a vector switch queue entry. Physical addresses will always be used, In the case of memory instructions, we must reserve space in the vector memory buffer. Finding a space in this buffer is simply a matter of allocat- ing a free location. Thus, simplified versions of the logic described in Section 4.6.3.2.8 can be used. The data rates must be adequate to handle the maximum rate at which instructions can emerge. Table 37 summarizes the logic requirements for this function. 210 TABLE 36 LOGIC FOR INITIATING INSTRUCTION TRANSFERS Uml Size Vector and Scalar Instruction Buffers 6x4 Memory Instruction Buffer 8x4 Vector and Scalar Buffer Controls Number 2 1 2 TOTAL Gate Count 3840 2560 400 6800 TABLE 37 LOGIC FOR GENERATING VECTOR SWITCH INSTRUCTIONS Unit Select up to 2 out of 32 Available Memory Source Buffer Locations Select up to 2 out of 32 Available Memory Destination Buffer Locations Generate Vector Switch Queue Entries for Memory Instructions Generate Vector Switch Queue Entries for Vector Instructions Size Number Gate Count 400 300 800 400 TOTAL 1900 211 4.6.5 Scalar Instruction Dispatcher Subsystem The SIDS must provide time indexes and maintain use counts for physical scalar addresses. The logical functioning of this unit is described in detail in Sections 4.3.2.1, 4.6.2.3, 4.6.2.5, and 4.6.3.2.1. We will sum- marize these descriptions, provide an overall design of the unit and give gate count estimates. It should be noted here that the SPU, SST, US, and SSE functions listed in Section 4.6.3.1.5 are performed in this unit. 4.6.5.1 SIDS Functional Summary The functions listed in Table 24 are pipelined with an emergence rate of one instruction per minor clock. They operate on the scalar status table and the scalar use table. The scalar status table contains one location for each possible active time index. It allows an associative search to be made for the correct time index of a scalar operand. Each new result causes the corresponding time index location to be loaded with the physical address for that result. Simultaneously, an associative search is made to delete any entry with the same physical address. The scalar use table consists of two parts. There is a section addressable by time indexes and another section associatively addressable by physical address. This second portion is for scalars in use with a time index that is about to be or has been reused. These are referred to as the index scalar table and old operand table. 4.6.5.2 Detailed Design of SIDS We now provide a detailed function of the SIDS structure. Table 39 provides a description and gate count for all the tables we refer to. The function US consists of two parallel stores to the scalar status table. The function SPU merely retains a result associated with its time index to 212 be searched in performing the SST function. The SSE function selects the scalar execution unit. Since one scalar queue can drive several equivalent SEUs, this function will ordinarily be null with instructions being routed to the unique queue required. If it is desired to have independent queues, then the logic of Section 4.6.3.2.6 can be used. The SST function consists of a parallel search of scalar use tables for the most recent reference to the specified physical addresses. Flow charts for the functions UU, USU, and RU are provided in Figure 28. The AL function consists of accumulating a list of result locations and time indexes as use counts with non-zero links go to zero. These control functions can all be implemented in under 10,000 gates. TABLE 38 SIDS FUNCTIONS Function Use Result to Update Scalar Status Table Retain Result for Pipelined Search of Scalar Status Table that will happen before this Entry is Complete Select Scalar Execution Unit Find Time Indexes for Operands Update Use Counts for Operands Update Scalar Use Table as Instructions are Executed Accumulate List of Time Indexes and Physical Locations Pairs for Stores that can Proceed Abbreviation Time Dependency US 2 None SPU 1 None SSE 1 None SST 2 SPU UU 2 SST s USU 2 Instruction Execution AL 8 Continuous Function Use Result to Update Scalar Use Table RU None 213 TABLE 39 SPECIFICATIONS AND GATE COUNTS FOR SIDS TABLES I SCALAR STATUS TABLE Size Fields Parallel Accesses Gate Counts II INDEX USE TABLE Size Fields Parallel Accesses Gate Count 256 entries 12 bits for physical scalar address 2 associative reads (SST) 1 store (US) 256(12*8*2 + 4*12) = 61440 256 entries 12 bits for physical address (associatively addressable) 2 bits for top and bottom list flag (associa- tively addressable) 8 bits for link 6 bits for use count 2 increments of use count (UU) 2 decrements of use count (USU) 1 associative read (RU) 1 store (RU) 1 store (UU) 256(14*8 + 8*8 + 6*16 + 4*28) = 98,304 214 TABLE 39 SPECIFICATIONS AND GATE COUNTS FOR SIDS TABLES (cont.) Ill OLD OPERAND TABLE Size 64 Fields 12 bits for physical address (associative! y addressable) 1 bit for top of list (associatively addressable) 8 bits for link to index use table 6 bits for use counter Parallel Accesses 1 associative search (UR) 1 set for new result (UR) 2 associative searches and increment counter (UU) 2 associative searches and decrement counter (USU) 1 read of link when use count goes to zero (AL) 1 store (UR) Gate Count 64(13*8*5 + 6*16 + 12*4 + 4*27) = 49,408 Assumptions used in gate count: 8 gates per bit per associative access 4 gates per bit per regular access 16 gates per counter bit 215 RESULT UPDATE NO LIST PRESENT SET BOTTOM LIST BIT IN NEW ENTRY TOP OF ADDRESS LIST IN OLD OPERAND TABLE TOP OF ADDRESS LIST IN USE TABLE RESET TOP LIST BIT IN THIS LOCATION; ADD LINK TO NEW ENTRY RESET TOP LIST BIT IN THIS LOCATION; ADD LINK TO NEW ENTRY MAKE NEW ENTRY WITH TOP LIST BIT SET FIGURE 28 SIDS FLOWCHARTS 216 UPDATE USE COUNT FOR OPERANDS I HALT IUD; WAIT FOR ZERO COUNT INCREMENT COUNT FOR THAT TIME INDEX INCREMENT USE COUNT BY 1 USE LOCATION; SET USE COUNT TO 1 FIGURE 28 SIDS FLOWCHARTS (cont.) 217 UPDATE USE COUNT FOR INSTRUCTION EXECUTION DECREMENT COUNT IN OLD OPERAND TABLE YES DECREMENT COUNT FOR THIS TIME INDEX NO -+-**■ LINK \__MO PRESENT ADD ELEMENT OF LINK TO LIST OF STORES THAT CAN PROCEED CLEAR LOCATION; ADD ELEMENT OF LINK TO LIST OF STORES THAT CAN PROCEED (usu J FIGURE 28 SIDS FLOWCHARTS (cont.) 218 4.6.6 Vector Instruction Dispatcher Subsystem The VIDS has the responsibility of freeing physical vector addresses as soon as possible. The algorithm for doing this is to maintain a use count for each logical buffer address. If a store is processed going to a logical address, the corresponding physical location can be reused when the use count goes to zero. Use counts are incremented each time an operand appears in the instruction stream in the VIDS. They are decremented each time the Vector Switch or internal switch transfers an operand. Since the physical address of each active vector buffer location is unique, we may organize the table on this basis. Table 40 gives the specifications of this table. Less than 4000 gates will be required for control purposes. TABLE 40 VIDS TABLE SPECIFICATIONS Size: Fields Parallel Accesses Gate count 256 8 bits for logical address 6 bits for use count 1 bit indicating location may be freed when use count is zero 1 store for new result from instruction stream 1 associative search on a result from instruc- tion stream 2 increments of use count for each operand in instruction stream 2 decrements of use count for each operand used by VEUs 1 decoding of physical address when it becomes available 256(8*8 + 4*15 + 6*16) + 512 = 56,832 See Table 39 for gate count assumptions 219 4.7 GATE COUNT SUMMARY Table 41 provides a summary gate count for all the logic discussed in this chapter. It is divided into buffer type gates and other logic. Summaries are provided for the computation portion of the IUD and memory control. The counts do not include memory itself, which consists of one million words of 64 bits with a 1 major clock access rate. The gate counts for buffers assume 4 gates per bit. We have assumed 5000 gates per parallel computing element in each VEU. We have assumed 10,000 gates per each SEU. TABLE 41 COMPUTATION UNIT SUMMARY GATE COUNT Unit Gate Type ' 6 SEUs 1 Scalar Status Tables 1 Scalar Buffers 2 Scalar Switch 1 Scalar Assembling Unit 1 6 VEUs (control) 1 6 VEUs (buffers) 3 6 VEUs (arithmetic) 1 Vector Buffer 3 Vector Switch 1 Memory Switches and Control 1 Source of Count Estimate Table 5 Table 6 Table 8 Table 9 Table 11 Table 11 Estimate Section 4.4.3 Table 13 Table 14 Count 60 000 16 050 655 360 88 880 25 000 87 936 480 000 240 000 131 072 252 544 992 040 Gate Types: 1. Ordinary logic. 2. Simple memory access rate 1 word per 2 minor clocks 3. Simple memory access rate 1 word per minor clock. 220 TABLE 41 COMPUTATION UNIT SUMMARY GATE COUNT (cont.) Unit Gate Type Source of Count Count Remaining components are in IUD. 12 Assign Instruction No. 1 Table 21 1 400 IUD Front End 1 Table 23 17 104 Time Index Generator 1 Table 27 2 042 Vector Buffer Comparison Tree 1 Table 28 23 824 Vector Status Table 1 Table 29 76 696 Ports to VEU Queue Selector 1 Table 31 400 Partial Instruction Detection 1 Table 32 192 VEU Queue Selection 1 Table 34 21 007 Reserve Vector Buffer Storage 1 Section 4. 6. 3 2.8 14 500 Update Vector Table, etc. 1 Section 4. 6. 3 2.9 6 500 Assembling Instructions 1 Table 35 47 595 Initiating Instruction Transfer 1 Table 36 6 800 Vector Switch Instructions 1 Table 37 1 900 SIDS Control 1 Section 4. 6. 5 2 10 000 SIDS Tables 1 Table 39 209 152 VI DS 1 Table 40 56 382 Scalar and Computation Summary: Type 1, 770,^20; Type 2, 655,360; Type 3, 1,603,112. Memory Summary: Type 1, 1,851,552; Type 3, 992,040. IUD Summary: Type 1, 495,494. 221 5 MACRO INSTRUCTION DECODER, I/O CONTROL AND EXTERNAL EXPANDABILITY The machine designed in Chapter 4 with the addition of some I/O con- trol could be a complete CPU. In this chapter we briefly discuss possible additions to it that could significantly enhance its performance. The Macro Instruction Decoder, as described in Chapter 2, converts UAL instruc- tions into Operand Fixed Format Instructions. Its primary purpose is to provide a high level of flexibility and to help eliminate program non- determinism as discussed in Section 3.2. These are also the reasons for including a scheme for anticipatory I/O. We will also describe the paging algorithms for this machine. Finally, we will discuss external expandabi- lity or the connection of many of these computers to form a single working unit. This chapter is an outline of projects we would undertake if we had unlimited time, energy and resources. 222 5.1 MACRO INSTRUCTION DECODER The Macro Instruction Decoder may be regarded as a combination inter- preter of UAL and operating system. Its primary function is to convert UAL instructions to OFFL instructions. Involved in this process are the following major tasks: 1. Convert instructions operating on arbitrary sized vectors to operate on the fixed vector width of the machine. 2. Insure that all memory accesses refer to pages present in Primary Memory. 3. Execute all transfers of control. In the case of conditional transfers, attempt to cause any required values to be computed by the execution units at the earliest feasible time. (The MIDs can request values from the EUs for use in evaluating condition- al transfers.) 4. Attempt to anticipate I/O requests at the earliest possible time. 5. Perform normal operating system functions. We have described in detail algorithms for converting vector instructions operating on arbitrary sized vectors to instructions for a fixed vector width [1]. We will not discuss this function further here. The second MID function may require subscript evaluation. Whenever this occurs, the effect is the same as a conditional branch. The MID cannot continue pro- cessing instructions until it can be assured that the required pages are available. Two features should be included to minimize problems associated with this situation. First, both the compiler and the MID should attempt to insure that subscript expressions are evaluated as early as is practical. 223 The MID should be constructed to use this information. Second, it should be possible to declare various arrays as save core during execution of various program segments. If it becomes necessary to swap a page of save core, then the entire associated program should be swapped out. The programmer, the compiler, and possibly even the MID should have the option of requesting save core. The last three functions are all standard ones with the observations we have already made about minimizing the effects of non-determinism. The techniques used in Chapter 4 should allow one to implement the functions described in hardware in an efficient manner. A great deal of analysis and experimentation would be required to obtain a good final result. In Chapter 6 we will outline some pragmatic considerations about constructing the entire system. We will conclude these remarks on the MID by considering one major issue that should significantly influence its detailed design. The MID performs many compiler-like functions, and it is an open ques- tion as to which functions should be performed by the compiler and which by the MID. The primary motivation for moving compiler functions to the MID is the existence of information at execution time that is not available at compile time. The primary drawback is that the functions must be per- formed in an interpretive manner whenever a given code segment is executed. To the degree that it is possible to do the analysis fast enough with logic that is significantly less costly than the "computing portion" of the machine, this is not a major drawback. The only way to get a good hold on what the tradeoffs are is to do some experimentation. We do not yet have all the techniques required to design a good MID as outlined above. Once we 224 have generated some basic set of building block ICs and have experience with connecting them, similar to the experience we now have in construct- ing large compilers, I would anticipate this approach to be highly pro- ductive. 225 5.2 PAGING DESCRIPTION In this section we outline some minimum requirements for a paging algorithm to function with the machine already described and describe some information that the MID could make available and that would be of use to an intelligent memory manager. One essential requirement is that a page be locked if any instructions accessing it has gotten past the MID. A locked page cannot be transferred to back up memory until all pending requests for access from the EUs have completed. Another required page status is that it be saved. A saved page is one considered essential to the current reasonable execution of a particular program and cannot be swapped unless the entire program is swapped. Additionally, the MID can look ahead and anticipate what pages are about to be accessed. Thus, an additional state a page can be in is that of about to be required. It should be possible for the MID to provide a rough estimate of how imminent the access is as a basis for determining priorities. 226 5.3 EXTERNAL EXPANDABILITY The remarks in this section will be primarily philosophical. They might be thought of as an expansion on the ideas that led to the two- level clock discussed in Section 3.1.3.1. The fundamental concept is that the physical size of a computing structure imposes constraints on the interface and control structures of subunits. The primary constraint is that the larger the physical size, the longer the delays that must be tolerated. A secondary constraint is that the amount of information passed between subunits should be kept reasonably small. The interface scheme for two clocks at different structural levels could be generalized to more levels. An especially serious constraint that is related to the discus- sions on non-determinism in the previous section is that of the control structure. Traditional computers have a hierarchical control structure. The operating system resident in the CPU controls the entire computing system. Some units like I/O channels may have a limited degree of auto- nomy. Computer networks like the ARPA net have a democratic structure. There is no central source of control. The larger the physical size of a computing unit, the more desirable a democratic structure becomes. If one wished to use a large computing system for a single problem and it possessed a democratic structure, then one's program would need to reflect that structure. Basically, what is required is fork and join operations and the ability for independent processes to interrupt and in other ways communicate with each other. There do exist computing languages with these features. They are primarily used in real-time computing systems. Our computer structure with its operating system computer, the MID, inde- pendent of the number crunching part of the machine could provide an excel- lent candidate for a democratically structured computing system. 227 6 CONCLUSION To perform a detailed and complete analysis of the structure we have designed would require an extremely elaborate and costly computer simu- lation. Such a process could provide much information about ironing out details, refining design parameters, and in general improving implementa- tion details. Such a process is not necessary to provide general estimates of the performance of this structure. For this purpose we can use the generalized measures on FORTRAN programs which have been experimentally obtained. We justify this approach as being a useful and meaningful first iteration in the process of developing the design techniques and structural approach we have adopted. The basic postulate is that this structure can obtain the potential speed-up and efficiency that has been measured in FORTRAN programs. We justify this statement by the flow analysis that we have provided throughout Chapter 4 and by the structure of the arithmetic units. The effective width of our machine is 38. This includes four 8-word wide parallel units and six scalar arithmetic units. Although some of the FORTRAN programs could benefit from a wider machine, most used roughly this amount or less parallelism. The multiprogramming structure of the machine allows the entire machine to be fully utilized while executing individual programs that could not effectively utilize it. Our hardware based real-time sched- uling will allow less compile time analysis and allows non-deterministic breaks if they are sparse enough to occur without degrading utilization. We did not start out to design a machine to accommodate arbitrary FORTRAN programs of the type measured and would not propose that the machine be devoted to an essentially random mix of FORTRAN programs. The most 228 cost effective way to execute a small FORTRAN program is to find the smallest available minicomputer that will accommodate it. Showing that such a structure can be effectively utilized on such a random mix of jobs guarantees that it works for the worst cases that it is likely to encounter. Our original aim was to design a good, flexible, easy-to-program parallel computer based to a large degree on our experience and intuition obtained from working with ILLIAC IV and thinking about other parallel computers. Thus, for example, our independent vector and scalar execution units evolved directly from problems in programming ILLIAC. Just as the FORTRAN program measurements were intended as a sort of benchmark establishing a minimum degree of utilizable parallelism over a broad class of problems, we use them here as a minimum benchmark of this machine's performance. One objective of our work is simply not measurable. That is to provide a machine that is easy to program. It is my belief that one of the major difficulties in using current parallel machines effectively is that few people understand how to program them. Our primary inspiration for this process was the B5500 machines and their use of hardware to handle many of the tedious details of programming and to do so in an execution time dynamic way. The ultimate measurement of the value of that approach as it was applied in those machines was the economic success of a machine that, if it were rated on a multiplies per dollar basis, would come out very poorly. Because the problems of programming parallel machines are significantly more complex, such hardware aids seem to us to be even more desirable for them. In summary, the design can effectively exploit the parallelism that has been measured in a broad class of problems. It has a great many features that should significantly ease the burden of exploiting parallelism. 229 SPECIFIC RESULTS The results of this work is not a detailed plan for constructing a computer, but rather the development of a general approach and techniques for implementing that approach. The detailed design work and its relation- ship to measures on FORTRAN programs is intended as a justification that the approach and techniques are practical and effective. Here we will sort out which of our techniques and approaches appear particularly suc- cessful and which areas call for additional study. The generalization of the technique for designing a carry look-ahead adder seems to be a useful technique for designing fast, complex combina- torial circuits. This is the technique described in Section 4.6.3.2.7. One area where this technique might be productively employed is in provid- ing real-time dynamic control for a multi-level crossbar switch as defined in [4J. This unit allows the arbitrary permutation of a vector in an extremely cost effective way, but it requires a highly complex scheduling algorithm. If one could construct a relatively inexpensive combinatorial circuit to schedule such a network, one would probably have the ideal crossbar switch for large applications. This scheduling problem may well be suited to the type of analysis we developed. One could begin by design- ing the logic for a two-level 4x4 crossbar, then gradually move up to higher levels. The analysis technique would certainly provide a reasonable hard- ware scheduling algorithm for the initial small switches and the intuition developed might well lead to generalizations valid for larger switches. Most of the logic design we have done is certainly far from optimal. Our circuitry for very fast conflict resolution may be an exception. It requires few gates, is extremely fast, and we have proven that it can be 230 generalized to an arbitrary number of units in possible conflict. We have used it in a great many situations in our overall design. This unit is described in Section 4.4.2.4. The observations about block structure and universal building blocks in Section 3.1 seem to us to be particularly significant and an area re- quiring much further development. Certainly designing a set of basic building block ICs is a problem of major significance for the super- computers of the future. We have made some very preliminary steps in that direction. The concept of instruction level multiprogramming seems to be useful in the environment we have employed it. The advent of cheap mini and microcomputers has certainly greatly reduced the need for multiprogramming. It does seem to us to be an important feature for very large parallel com- puters for two reasons. First, a great many runs on such computers will be short debugging runs. The availability of the machine for such pur- poses can be a very critical factor in program development time. Multi- programming can allow short high-priority jobs to be run while longer production jobs are also using the machine. The second reason is provid- ing two independent processes may be an effective way to program some large tasks. Providing hardware to execute these is desirable. Instruction level multiprogramming is particularly nice in that there is no overhead involved in swapping out programs from registers. The operating system control resides in a processor entirely independent from the various arithmetic units, and as long as any MID is feeding them instructions, they can be utilized efficiently. 231 LIST OF REFERENCES 1 Budnik, P. P., "Tranquil Arithmetic," M.S. Thesis, University of Illinois, 1969. 2 Budnik, P. P., "An Intuitive Interpretation ofthe Hyperarithmetic Sets," talk presented at the Spring 1972 meeting of the Association for Symbolic Logic, abstract printed in the Journal of Symbolic Logic , volume 37, number 4, p. 778, December 1972. 3 Davis, E. W. , "A Multiprocessor for Simulation Applications," Ph.D. Thesis, University of Illinois at Urbana-Champaign, Department of Computer Science Report No. 527, June 1972. 4 Kuck, D. J., D. H. Lawrie, and Y. Muraoka, "Interconnection Networks for Processors and Memories in Large Systems," COMPCON 72 Digest of Papers , pp. 131-134. 5 Kuck, D. J., Y. Muraoka, and S. C. Chen, "On the Number of Operations Simultaneously Executable in FORTRAN-Like Programs and Their Result- ing Speedups," IEEE Trans, on Computers , volume 21, number 12, December 1972, pp. 1293-1310. 6 Muraoka, Y. , "Parallelism Exposure and Exploitation in Programs," Ph.D. Thesis, University of Illinois at Urbana-Champaign, Department of Computer Science Report No. 424, 1971. 7 Rogers, H. , Theory of Recursive Functions and Eff ective Computability , McGraw Hill, 1967. ~~~ 8 Tomasulo, R. M. , "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal of Res, and Devel . , volume 11, number 1 January 1967. 9 Turn, Rein, Computers in the 1980s , Columbia University Press, 1974. 232 APPENDIX A DETAILED LOGIC FOR VECTOR EXECUTION UNIT SELECTOR This appendix describes in detail the unit outlined in Section 4.6.3.2.7. Some notational conventions and structure of this appendix are explained in Section 4.6.3.2.7. The following conventions will be observed throughout this appendix: 1. Superscript i ranges over (0,1,2,3) and refers to an instruction 2. Superscript n ranges over (0,1,2,3) and refers to a VEU. 3. Superscripts or subscripts may be omitted when they are uniform and unambiguous throughout an equation. 4. All weight registers are 5 bits wide. 5. Value (X..) indicates the value of the binary integer defined by Boolean value X Qi X 1 ,. .. ,X j4 X q has highest significance. In addition, the following notation will be used: Uj 5 j bit of amount added to the size of the queue for VEU n. 233 A-l QUEUE WEIGHT REGISTERS AND ADDERS Inputs U n J d = indicates queue n is to be decremented by 1. Function (U. ) - (d ) must be added to contents of a queue weight register, a.j I = (0,1,..., 5) and ranges over the 6 queue weights for a single queue. Algorithm We will use a serial counter and a serial adder cascaded together. Equations 4-£n • j • j. . th i . , tj indicates j bit of counter output ct. indicates carry from j place of counter ar. new value of al n Cj indicates carry from j bit in producing output *o = a o d v a o d c o = a o d *J = a j C j-1 Va j C j-1 U" ] ' 2 ' 3 ) c j = a j c j-i U = ] ' 2 ' 3 ) ar = t a v t a o oooo 234 o ar. t a o o *J a j C j-1 V F j *J C j-1 V T j a j «j-l v *J *j 'j-l " 1-2,3) *j 8 J V *J C M V a j C J-1 0-1.2.3) Logic Levels 6 Gates 115 (These are just standard counter and adder circuits. We include these as an example of our notation and because we wish to design this unit in complete detail.) 235 A- 2 WEIGHT SELECTION LOGIC Inputs ,ik indicates instruction i operand k is unknown (k = 0,1). ik X. bit j of queue address where operand of instruction k J resided, (j = 0,1,2,3) We will assume (Xg k v x] k ) implies operand not assigned to one of these four VEUs. A. bit j of VEU assigned to instruction i. Outputs W« indicates if weight I for instruction i queue n is to be switched. I has the following meanings: Number of Operan from Instruction in Queue n ds i Instructi Assigned VEU n on i to Instruction i not Assigned to VEU n W in w w in w 3 1 wj n W in W 4 2 wj n W in w 5 An unknown operand will count as being in the assigned VEU, Algorithm First we generate: P^ which indicates if operand k from instruction i is in queue n. AP^ same as P^ but will include an unknown operand as being assigned queue n. 236 AA in indicates if value (A 1 .) = n. J We will then use these to generate wl n in a fairly obvious way PJ n = [value (x{°) = nj p i0 = Y i0 Y i0 Y i0 Y i0 Aj-i A-i hry A^ P il = x^° x{° x i0 x i0 p i2 = Y i0 Y i0 Y i0 v i0 A A l A 2 X 3 P T3 = Y i0 Y i0 Y i0 v i0 A A l A 2 x 3 pin P l = [value (Xf ) = n] The expansion is similar to the above, AP AP AA W W w w w in in 1 in = [value (xj°) = n] v Y i0 = [value (X^ 1 ) = nj v Y 11 •J = [value (A?) = nj in in 1 in 2 in 3 in AA in AP^ n APJ n in nn in nn in AA'" AP^" APJ" v AA in APj n AP] AA in APJ n APJ n AA ln APj n AP] n in nn in nn in 4 " = AA'" AP-' AP^ 1 v AA in AP™ APJ 237 W™ = AA in APj n AP] n Logic Levels 2 Gates 114 238 A- 3 INCREMENT AND MIN SELECTOR DETAILS Inputs a. bit j of queue weight from weight selection switch J U n J Functions 1. Subtract U 1 ? from each of 4 weights. 2. Find minimum weight. 3. Subtract minimum weight from all weights. Functions 1 and 2 are combined in one set of equations. We will discuss these functions first. Algorithm We will break up the operation into parts by computing the following intermediate values: bj n j th bit of value (ui n ) + value (a1 n ) J J j c. carry from bit j used in computing b"! n . ASZ™ indicates that value (aj n ,aj n ) s k. BSZ™ indicates that value ( b^ n ,b^ n ,b| n ) <; k. CSZj, n indicates that value (bj n ,bj n ) * k. SB in b] n v b\ n 239 M^ n indicates that counting only bits through j bj is a minimum over n = (0,1,2,3). :™ indicates that M™ A [value (b^.b^ 11 ) < £]. MCZ Time versus variable computed Logic Level Variables U a™ J J 1 r lf1 h 111 r 111 AC7 ln K in 1 c 4 b 4 c 2 ASZ k b 2 2 b] n b™ BSZ™ c] n SB in o bg CSZ. Mp MCZ. Equations, Level = 1 C 4 = U 4 a 4 Hj" b 4 = U 4 a 4 v U 4 a 4 c 2 = U 2 a 2 v a 2 U 3 a 3 v a 2 a 3 a 4 U 4 v a 2 U 3 a 4 U 4 The above uses value (U.) < 4. J ASZg = a Q a 1 ASZ, = a ASZ 2 = a Q v a. 240 ASZ 3 = TRUE b 2 = a 2 U 2 IJUJ v a 2 U^U^U^ v a £ U^ UJ i^ v a 2 °2 *3 ^ v a 2 ^2 *3 H v «i U 2 v *2 U 3 a 3 v ^2 U 3 U 4 a 4 v ^2 a 3 U 4 a 4 Equations, Logic Level = 2 >! = a 1 c 2 v 3] c 2 b 3 = a 3 c 4 U 3 v a 3 c 4 U 3 v a 3 c 4 U 3 v a 3 c 4 U 3 BSZ Q = ASZq^E^ BSZ 1 = ASZ Q C 2 BSZ 2 = ASZ Q v ASZ ] c 2 b £ BSZ 3 = ASZ Q v ASZ ] T 2 BSZ 4 = ASZ 1 v ASZ 2 c 2 b^" BSZ 5 = ASZ 1 v ASZ 2 T 2 BSZ 6 = ASZ 2 v b 2 BSZ ? = TRUE Cn ~ d-l Cn SB b 4 v a 3 c 4 U 3 v a 3 c 4 U 3 v a 3 c 4 U 3 v a 3 c 4 U^ 241 Equations, Level = 3 b = C l v b CSZ. = same as ASZ. but b 3 ,b. replace a Q ,a.. .in BSZ Q v 6 Z k=l BSZ™ tt BSZ^ m=0 m^n MCZ in . .in . in DC7 in " b 3 b 4 BSZ v b 3 b 4 6 Z k=l BSZ? n TT K m=0 m^n BSZ lm k-1 MCZ 1 / = b in BSZ 1 " v 57 i J u J k=l Dc7 in Dcv lm BSZ k tt BSZ k _ 1 m=0 m^n MCZ 1 = (b in v bj n ) BSZ in v (b/ v bj n ) Z k=l BSZ 1 / tt BSZ^ 1 m=0 m^n SB'" BSZ'" v SB m Z U k=l BSZ™ tt BZK™ m=0 ntfn MCZ/ = M™ 242 Equations, Level = 4 M in - M in rc 7 in M4 = M« CSZ Q v M ? n CSZ} n tt (M™ v CSZ^" 1 ) v 1 m=o ^ u 3 m=0 m ? " csz:" n (m: ,m v CSZ '" ) v c m=0 ^ ' M^ n 7r (M ? m v CSZJ m ) m=Q c *" m^n = M™ cszj n V M* n cszj n tt MCZ™ V m=0 u m^n 4" csz™ 3 - 7T f m=0 3 1- 7T MCZ™ v m=0 ' m^n ^ n 1CZ^ m We still need to perform the subtraction of the minimum from all weights. Algorithm 1. Select the minimum using the Ml n . 2. Do a subtract of mini mum. 243 Equations i n d. bit j of minimum J d« . b!" M » We wish to do the subtraction in two logic levels a. bit j of result c. carry from j bit a 4 = b 4 d 4 v b 4 d 4 c 4 = b 4 d 4 c 3 = b 3 d 3 v b 3 b 4 d 4 a3 m in . m aa. a. assuming ci J J 3 3 ac. al n assuming cl n aa Q = b Q ^ b 1 tff v b Q ^ b 1 b 2 v b Q ^ b 1 d 2 v b Q o^ 07 b 2 V b Q ^ ^ ^ ac Q = b Q 3q b, H7 v b Q ^ b 1 b 2 g^ v b Q d^ 37 b 2 d^ The above two equations take advantage of the fact that we are sub- tracting the minimum and no negative result is possible. 244 aa l = b l d l b 2 v b l d l d 2 v b l d l d 2 v b l d l b 2 v B7 b^ d 2 v b 1 d 1 K, d 2 ac, = b ] dj b 2 d^ v Fj" d ] b 2 d^ v b^ ^j" b^ v ' ^7^7 d 2 v b l d l h v b l d l d 2 aa 2 = b 2 d^ v F^ d 2 ac 2 = b 2 ^2 V b 2 d 2 a 3 = b 3 ^3 ^3 v ^3 d 3 ^3 v S ^3 C 3 v b 3 d 3 c 3 a 2 = aa 2 c^ v ac 2 c~ a-j = aa, c 3 v ac, c^ a Q = aa Q c 3 v ac Q c 3 Logic Levels 4 Increment and Select Min 1 Switch Min 2 Subtract Min 7 Total Gates 4976 Increment and Select Min 160 Switch Min 1728 Subtract Min 245 A- 4 INCREMENT AND DECODER DETAILS Inputs A 1 . finally computed in Section A-3 u? Function We need to perform the following three functions: 1. Compute value (Bl n ) = [value (a 1 .") - value (u")]. J J J 2. Normalize the result so the smallest is O.an.. J 3. Compute Bim = [value (an 1 . 11 ) = m]. (i,m) = (0,0), (1,0), (2,0), (2,1), (3,0), (3,1), (3,2) B00„ and BIO* will also be written as BO and Bl . n n n n Algorithm Taking advantage of value (u!?) £ 4, we will compute b! n in two levels j j 1i j i n/ of logic. To do the normalization fast, we will actually compute b where [value (b 1 ^)] = [value (b! n ) - I] j j I = (0,1,2,3,4) We will detect the smallest I for which an overflow occurs and switch this set b. as an. . We will then generate the Bim in the obvious way in J J one clock. 246 Equations irv£ The equations for computing the bl n ^ are similar to those for sub- tracting the minimum in Section A-3, and we will not describe them in detail. The switch is also standard, so we will not describe it either. Gates 8640 Subtract u] n and offsets 800 Switch correct offset 168 Compute Bim n 9608 Total 247 A- 4 FINAL SELECTION UNIT Inputs Bim generated, described in Section A-3. Functions Select the minimum weights for up to four instructions, taking into account the fact that the queue selected by instruction i must have one added to it before determining the queue for instruction i+1. Algorithm Output: Ri which indicates instruction i is to use queue n. R0„ is set true for minimum n such that BO is true, n n Rl n is set true for minimum n such that RO Bin. If there is n n no such n, then the Rl is set true for the unique n such that Bl . n R2 n is set depending on the following: 1. If 3n[B20 n (R0 n v Rln)] then the minimum n satisfying B20 n R0 n Rl is chosen. If none exists, then minimum n such that B20 is chosen, n 2. If Vn[B20 n ■+ R0 n Rl ] then the minimum n such that B21 n R0 n is chosen. If none exists, the unique n such that B20„ is chosen, n R3 is set depending on the following conditions: 1. 3n[B30 n (R0 p Rl n v R1 n R2 n v R0 n R2 n )J. The minimum n such that B30 n R0 n Rl n R2 n is chosen. If none exists, then the minimum n such that B30 is chosen. 248 2. Vn[B30 n + (RO n Rl n R2 n v RO n R1 r R2 n v RO^ Rl n R2,,)]. Note both this and the next condition imply B30 is un ique. Also, B30. + (B31. v B32.). The mini- mum n such that B31 n R0 n RT^ R2 n is chosen. If none exists, then the unique RO is chosen. 3. Vn[B30 n + R0 n Rl p R2J. The minimum n such that B31 is chosen. If none exists, then the minimum n such that B32 n is chosen. If none exists, the unique n such that B30 n is chosen. Equations For a description of the notation and conventions used in generating these equations, see Section 4.6.3.2.6. RO n Detection: Only one case. Selection: n-1 R0 n = B0 n A * BO J=0 J (level = 1, gates = 10) 1 1.1 3i[Bl A 7r Bl.] n j=0 J j7n i.e., only one Bl n is true. Detection and selection: 3 Bl 7T Bl. j=0 J j7n (gates = 20) 249 n 1.2 3n[Bl A Z Bl.] " j=0 J i.e., more than one Bl is true. n Detection: There is no need to detect this case. We always choose a true Bl n - Since in case 1.1 Bl is unique, we cannot make a selection in conflict with case 1.1. Selection: n-1 1.2.1 3n E B1 n B0 * B'H Detection and Selection: Clearly, we can select the unique Bl , satisfying the above. Thus, we have n-1 Bl BO 77 Bl. (Gates = 18) n n j=Q j n-1 1.2.2 Vn[Bl v BO v Z Bl.J n n j=0 J i.e., the first true Bl occurs with a true BO . n n Detection and Selection: In this case we must assure ourselves that there is a smaller n with B0 n true before we can select this. To insure that the selection will be unique, we must choose the first Bl with a n smaller BO n n-1 n-1 (n = 1,2,3) Bl [( Z BO.) (77 Bl .) v n j=0 J j=0 J n-1 j-1 Z (Bl. BO. 7T BO. )] j=0 J J k=0 J 250 Gates = 8 n=l 15 n=2 2-3 n=3 46 Total Summary for Rl R1 n = B1 n 3 TT Bl . V j=0 J #1 B1 n n-1 B0 M TT Bl . n j=o J B1 n n-1 ( Z BO.) ( j=0 J n-1 TT Bl. ) v j=0 J n-1 j-1 Bl z (Bl. BO. TT BO. ) n j=0 J J k=0 J (note for the last two terms, n = 1,2,3) (Gates = 84, Level = 1) 2 R2 "n 2.1 Vn[RO n = Rl n J Detection D2.1 = I RO Rl (Gates = 12, Level = 2) i=0 n n Selection: 2.2.1 Vi[B20 n ■* ROn] 251 Detection D2.1-1 = 7T (B20 n v R0 n ) (Gates = 20, Level = 2) This gate count takes advantage of R0 RO . = n=j Selection: 2.1.1.1 3n[B21 n ] Detection: 3 D2. 1.1.1 = z B21 (Gates = 4, Level = 1) n=0 n Selection: n-1 S2. 1.1.1 = B2.1 7T B2.1. (Gates = 40, Level = 1) n n j=0 J 2.1.1.2 V [B21 ] n L n J Detection: D2. 1.1.1 Selection B20. n 2.1-2 3n[B20 M R0~] n n Detection: D2.1.1 Selection: n-1 S.2.1.2 n = B20 n 7T (B20. v R0 n ) (Gates = 25, Level = 2) "n • n . =0 v j n' 252 This gate count takes advantage of RO RO. = n=j J 2.2 3n[R0 n Rl n ] Detection: D2.1 Selection: 2.2-1 Vn[B20 * (RO v Rl )1 L n v n n /J Detection: D2.2.2 Selection: First get B20 f RO 3 n n n-1 TS2.2.2 n = B20jB0 n v B0„ tt BO J (Gates = 26, Level = 1) •n n L n n j=Q w j n-1 S2.2.2 n = TS2.2.1 n Rl n tt (TS2.2.1 . v Rl .) (Gates = 30, Level = 2) Selection: n-1 S2.2-1 = B20„ tt B20. n n j=0 J 2.2.2 3n[B20 RO Rl J n n n J 253 Detection: D2.2-2 = Z B20 RO Rl n=0 n n n Summary 2 R2 n = D2.1 D2. 1.1 D2. 1.1.1 S2. 1.1.1 v D2.1 D2.1.1 D2. 1.1.1 B20 v n D2.1 D2.1.1 S2.1.2 v n D2.1 D2.2.2 S2.2.1 v n D27T D2.2.2 S2.2.2 n (Gates = 22, Level = 3) 3.1 Vn(RO n Rl n v Rl n R2 n v R0 n Rl n ) i.e., for each of RO, Rl , R2, they are true for different values of n. No two of them are true for the same n. Detection: D27T means RO f Rl T3,1 n = RO n v R1 n (Gates = 8 » Level = 2 ) D3.1 = D2.1 I (T3.1 R20.) (Gates = 16, Level = 2) n=0 n n Selection: 3.1.1 Vn(B30„ -► R0 n v Rl v R2 ) n n n n Detection: D3.1.1 = I B30 RO Rl R2 n (Gates = 20, Level = 4) n=0 n n n n 254 Selection Choose first B30 n which is true. n-1 j=o S3.1.1 n = B30 n _ t 7r rt B30j (Gates = 10, Level = 1) 3.1.2 3n ( B3 ° n ^^^n~) Detection: D3.1.1 Selection: We must choose first B30 n not equal RO or Rl or R2 . First n n n n we get the B30 n not equal RO and Rl . TS3.1.2 n = B30 n RTRr (Gates = 12, Level = 2) n-1 S3.1.2 n = TS3.1.2 n tt CTS2.1.2, v R20.) (Gates = 25, Level = 4) 3.2 3n (R0 n Rl n v Rl n R2 n v R0 n Rl n ) Two or more of RO, Rl , R2 agree. Detection: D3.1 Selection : 3.2.1 3n R0„ Rl R2 n n n 255 Detection: D3.2.1 = E R() Rl R2 n=0 n n n (Gates = 16, Level = 4) Selection: 3.2.1.1 3n(B30 n v R0 n ) Detection: D3.2.1.1 = Z B30 RO n=0 n n (Gates = 12, Level = 2) Selection: S3. 2. 1.1 = R0 n B30 n 7T (RO v B30 ) n n n m=Q m m ' 3.2.1.2 Vn(B30 + RO ) n n Detection (Gates = 96, Level = 2) D3.2.1.2 Selection 3.2.1.2-1 Vn B31 Detection: D3. 2. 1.2-1 = 77 B31 n=0 (Gates = 4, Level = 1) Selection: 3.2.1.2-1.1 3n B32 256 Detection: 3 D3. 2. 1.2-1.1 = z B32 n (Gates = 4, Level = 1) n=0 " Selection: ' n-1 S3. 2. 1.2-1.1 = B32„ A tt B32 n n n n m=0 (Gates = 40, Level = 1) 3.2.1.2.1.2 Vn B3T" n Detection: D3. 2. 1.2-1.1 Selection R30 i 3.2.1.2-2 3n B31 n Detection: D3. 2. 1.2-2 Selection: n-1 B31 tt B31 n m=0 n Summary for 3.2.1.2 S3. 2. 1.2 = D3. 2. 1.2-1 A D3. 2. 1.2-1.1 A S3. 2.1. 2-1.1 v " n D3. 2. 1.2-1 A D3. 2. 1.2-1.1 A R30. v _____ ""I D3. 2. 1.2-2 B31 tt B31~ m=0 (Gates = 46, Level = 4) 257 3.2.2 Vn (RO n v RTrT] Detection D3.2.1 Selection: 3.2.2.1 3n(B30 RO FT v B30 Rl R2 v B30 RO R2 ) nnn nnn nnn B30 is true for a case when at most one of RO, Rl , R2 is true. Detection: First we get all B30 unequal to R30 and R31 . ■ n ^ n n TA3.2.2.2.1 = B30 n R0„ Rl (Gates = 12, Level = 2) n n n n ' In addition, we need: TB3.2.2.1 = B30 RO v B30 Rl n n n n n 3 D3.2.2.1 = l T3.2.2.1 R2 n=0 n n (Gates = 24, Level = 2) (Gates = 12, Level = 4) Selection: 3.2.2.1.1 3n(B30 M R0„ Rl R2 ) n n n n Detection: D3.2.2.1.1 = I B30 RO Rl R2 n=0 n n n n (Gates = 20, Level = 4) 258 Selection: First we set all B30 M not equal to either RO or Rl . n n n T3.2.2.1.1 = B30 RO Rl n n n n S3. 2. 2. 1.1 = TA3.2.2.1.1 v n n (Gates = 12, Level = 2) n-1 n m=0 n TB3.2.2.1.1„ 7T TB3.2.2.1.1 (Gates = 23, Level = 4) 3.2.2.1.2 Vn (B30„ v RO v Rl v R2~) n n n n Detection D3.2.2TTTT Selection: From case 3.2.2.1 we know B30 n is true when only one RO, Rl , or R2 is true. From case 3.2.2 we know RO, Rl , R2 agree for one value of n. Since each is true for exactly one n, there can only be a single n for which those conditions and the above condition hold. We will use previously generated terms S3.2.2.1.2 n = TA3.2.2.1 v TB3.2.2.1 R2~ n n n n (Gates = 16, Level = 4) 259 3.2.2.2 Vn (B30 ■> RO n Rl v Rl R2 n v RO n Rl) n n n n n n n Detection D3.2.2.1 Selection: 3.2.2.2.1 3n (B31 ) n Detection: 3 D3.2.2.2.1 = Z B31 (Gates = 4, Level = 1) n=0 n Selection: 3.2.2.2.1.1 3n B31 RO Rl R2 n n n n Detection: 3 D3. 2. 2. 2. 1.1 Z B31 W~ RT~ R7~ n=0 " n n n (Gates = 20, Level = 4) Selection: T3. 2. 2. 2. 1.1 = RO Rl B31 n n n n S3. 2. 2. 2. 1.1 = R2 M T3. 2. 2. 2. 1.1 A n n n n-1 77 (R2 n v T3. 2. 2. 2. 1.1 ) m=0 n (Gates = 48, Level = 4) 260 3.2.2.2.1.2 Vn [B31 ■* (R0 n v Rl v R2 )J n n n n /J Detection: D3. 2. 2. 2. 1.1 Selection B31 n satisfying the above must be unique. This is true because from 3.2.2.2 we know R0 5 Rl , R2 agree for some value of n. Thus, thre is only o-e value of n for which exactly one of them is true. B30 n is true for the value where two of TO, Rl , R2 agree. B30 + B3T. Thus, there is a unique n for which B31 n n is true and exactly one of RO, Rl , R2 is true. TA3.2.2.2.1.2 TB3.2.2.2.1.2 B31 n R0 n RT (Gates = 12, Level = 2) S3. 2. 2. 2. 1.2 n B31 R0 n v B31 Rl n n n n (Gates = 24, Level = 2) TA3.2.2.2.1.2„ v TB3.2.2.2. 1 .2 n n (Gates = 8, Level = 4) 3.2.2.2.2 Vn B31 Detection 3.2.2.2.1 Selection There is a unique B30 n true which agrees with two of RO, Rl, R2. Select this n. B30 261 Summary for R3 J n R3 n = D3.1 A D3.1.1 A S3. 1.1 v D3.1 A D3.1.1 A S3.1.2 v D3.1 A D3.1.1 A D3.2.1.1 S3. 2. 1.1 v n D3.1 A D3.1.1 A D3.2.1.2 A S3. 2. 1.2 v n D3.1 A D3.2.1 A D3.2.2.1 A D3.2.2.1.1 A S3. 2. 2. 1.1 v n D3.1 A D3.2.1 A D3.2.2.1 A D3.2.2.1.1 S3. 2. 2. 1.2 v n D37T A D3.2.1 A D3.2.2.1 A D3.2.2.2.1 A D3. 2. 2. 2. 1.1 A S3. 2. 2. 2. 1.1 v n D37T A D3.2.1 A D3.2.2.1 A D3.2.2.2.1 D3. 2. 2. 2. 1.1 A S3. 2. 2. 2. 1.2 v n D3. 1 A D3.2.1 A D3.2.2.1 A D3.2.2.2.1 A B30 n (Gates = 48, Level = 5) 262 VITA Paul Peter Budnik, Jr. was born in Chicago, Illinois in 1945. He received the Bachelor of Science in Physics degree from the University of Illinois in 1967 and the Master of Science in Computer Science degree from the same university in 1969. During the 1970 to 1971 academic year he was an Acting Assistant Professor at the University of California at Los Angeles. During the 1971 to 1972 academic year he was employed by the University of Illinois on a project involving finding parallelism in FORTRAN programs. From 1973 to the present he has been employed by Systems Control, Incorporated. During this period he designed and implemented a correlation program on ILLIAC IV. Mr. Budnik is a member of the ACM, IEEE, the Association for Symbolic Logic, the American Association for the Advancement of Science, and Sigma Xi. JIBLIOGRAPHIC DATA iHEET . I u [c .iiul Subt it U- Kepor UIU UCDCS-R-75-763 3. Recipient's Accession N< TECHNIQUES FOR PARALLEL COMPUTER DESIGN 5. Kcport Date October 1975 A uthor(s ) Paul Peter Budnik, Jr. 8. Performing Organization Kept. No UIUCDCS-R-75-763 Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF DCR73-07980 A02 2. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Doctoral - 1975 14. 5. Supplementary Notes 6. Abstracts In the future major increases marily from architectural innovations i Integrated circuits will provide the ba thesis discusses various techniques for so by applying those techniques to the techniques include the determination of the establishment of interface and timi techniques for pipelining and paralleli machine is broken up into small functio in computer performance must result pri- nvolving increased parallelism. Large Scale sic technology to make this possible. This constructing parallel computers. It does design of a specific machine. These a basic set of building block components, ng structures, and the development of zing complex control functions. The entire nal units which are independently queue driven h Key Words and Document Analysis. 17a. Descriptors Computer Architecture Computer Design High-Speed Computation MIMD Computer Parallel Computation 'b. Identifiers /Open-Ended Terms 'c. COSATI Field/Group 1. Availability Statement RELEASE UNLIMITED >RM NT15-35 (10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 274 22. Price USCOMM-DC 40329-P71 rfftJT »flV25 «r>"TK t*» *