If HnHifl mmmMti m wBBmBM BBfflllMfHffl Iulall$ftr ImImDDB MSB* ■■■1 *>^>IH6I Hani 199 mm m!;h S? ■ H ■ MMttMUft K IB m m wMmWm luKMtftvui ills H Sta Bni Hi IniiEiiMiuHX LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510 .S>4 cop' 2 ' Digitized by the Internet Archive in 2013 http://archive.org/details/arrayprocessorwi499mach f //, / lJ IUCDCS - R -72-U99 l* 1/ AN ARRAY PROCESSOR WITH A LARGE NUMBER OF PROCESSING ELEMENTS By Nelson Castro Machado January 1, 1972 CAC Document No. 25 CAC Document No. 25 UIUCDCS-R-72-499 AN ARRAY PROCESSOR WITH A LARGE NUMBER OF PROCESSING ELEMENTS By Nelson Castro Machado Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 January 1, 1972 Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign and supported in part by the Advanced Research Projects Agency of the Department of Defense and was monitored by the U.S. Army Research Office-Durham under Contract No. DAHC0U-72-C-0001. ii ABSTRACT This paper describes a new type of array processor (SPEAC) which could be characterized as an intermediate between ILLIAC IV and the Associa- tive Processor- The number of processing elements (PE's) is typically IK but could go as high as 8K. Each PE is a relatively simple unit with about IK equivalent gates, designed to allow implementation either on a single very complex LSI chip or on several MSI chips. Each PE plus its memory (PEM) could then be assembled on one single printed circuit board or ceramic substrate. Processing is performed in groups of four bits which allows varia- ble word length. Maximum freedom in data format and instruction format is made possible by the use of a microprogrammable control unit (CU). Therefore, the machine is quite versatile and can be used efficiently either on floating- point large precision problems (matrix operations, signal processing, etc.) or on fixed-point small precision ones (character manipulation, picture pro- cessing, etc.). PE design is carried out in great detail and a general sketch of the CU is presented. Operations are described and timed, with particular emphasis on floating-point addition (20 msec per PE for 32 bits) and floating- point multiplication (25 usee per PE for 32 bits). A few typical applications are presented along with their time estimates. Ill ACKNOWLEDGMENTS I wish to express my deep gratitude to Professor Daniel L. Slotnick for the constant aid and encouragement in every phase of the research herein described. Ky colleague Robert L. Mercer is to be thanked for the many helpful discussions and suggestions. I would also like to express my appreciation for the financial sup- port given by the ILLIAC IV Project and the Center for Advanced Computation of the University of Illinois. Finally, special thanks are due to Suzanne Sluizer for the efficient and accurate typing of the final document, to Fred Hancock and Jose Martinez for their careful execution of many complex drawings, and to my wife Arlene for unmatched patience and constant incentive. iv TABLE OF CONTENTS Chapter Page 1. INTRODUCTION 1 2. THE ARRAY COMPUTER AND ITS APPLICATIONS 3 2.1 General ascription of an Array Computer 3 2.2 Typical Applications and Their Requirements .... 6 2.3 Considerations on the Number and Complexity of the PE's 9 3- SPEAC's HARDWARE 13 3«1 General Considerations 13 3.2 The Multiplication Algorithm 17 3.3 The System as a Whole 20 3.U The Processing Unit 26 3.U.1 PE Memory 26 3-U.2 PE Data Registers 30 3.1+.3 PE Description 32 3.U.3.1 Registers and Buses 32 3.U.3-2 The Arithmetic/Logic Unit 38 3 . *4 . 3 • 3 Scratchpad Memory 38 3.U.3.U Address Registers kO 3.U.3.5 Register A *+l 3.U.U Local Control ^5 3.k.k.l Direct Local Control ^8 3.U.U.2 Indirect Local Control ^9 3.U.5 Mode Control Chapter Page 3.4.6 Interrupts 54 3*4.7 Implementation Remarks 55 3.5 The Control Unit 62 3.5.I General Structure 62 3.5*2 Machine Synchroni zat ion - Events 65 3.5.3 ^.ueue System and FINST 67 3. 5 • 3-1 Queue Structure 68 3.5.3.2 FINST Structure and Operation ... 70 3«5«4 The Instruction Processor 76 3.5.5 IDU and Instruction Format 80 3.6 Memory 82 3.7 I/O Buffer Register 84 4. SPEAC's OPERATION 87 4 . 1 Generalities - Data j ormat 87 4.2 Local Indexing 89 k.3 Multiplication 89 U.3.1 Floating-point Multiplication 93 4.4 Addition and Subtraction 94 4.4.1 Signed Addition and Subtraction 94 k.k. 2 Floating-point Addition and Subtraction . . 95 4.5 Other Operations 98 4.5.1 .On 98 4.5.2 Logic Operations 99 4.5.3 Comparisons 99 vi Chapter Page U.5.U Shifts 100 U.6 I/O 101 h.7 Routing 102 k.Q Sunmary timings 105 5- APPLICATIONS 107 5-1 General :onsiderations 107 5-2 Relaxation 109 5 • 3 ''?-irl Multiplication 116 5.U Pattern Matching 122 5.5 Spars Matrices 127 6. CONCLUSIONS 132 APPENDIX A. PACKAGE LOGICAL DIAGRAMS I37 B. MICROSEQUENCE FOR 32-BIT FLOATING-POINT MULTIPLICATION I53 C. MICROSEQUENCE FOR 32-BIT FLOATING-POINT ADDITION . . 161 LIST OF REFERENCES 172 VITA 17!+ VI 1 LIST OF TABLES Table Page 1. PE Registers 37 2. Functions Provided by the A/L Unit 39 Control Wires and Their Functions lj-3 k. Connections to Each PU 56 5- Some IC Chips that Might Be Used in the PE 58 6. Packages Used in the PE and Their Contents 59 7- Rough Estimates for the Number of Chips Per PU 60 2 . Microinstruction Repertoire 80 Number of Elementary Shifts for Each Shifting Distance ... 86 10. Micro sequence for Local Indexing 90 11. Summary of Timing Estimates 106 VI 11 LIST OF FIGURES Figure Page 1. A Classical Computer k 2. An Array Computer h 3. A Family of Array Computers with Constant Average Speed . . 11 h. Versatility as a Function of the Number of PE's 11 5. Cost-efficiency as a Function of the Number of PE's ... . 11 6. Instruction Format 15 7» Fetches in Multiplication 21 8. Global Structure 22 9. Block Diagram of a Possible PEM Chip 29 10. Basic Data Register Structure 30 11. Simplified PE Diagram 33 12. Conventions Used in PE Logical Diagram 3^ 13- Complete PE Logical Diagram 35 Ik. A Generalized Local Control ^7 15. Diagram of a Local Control FF 51 16. CU Structure 63 17. Queues and FINST Structure 68 18. FINST Action Flow-graph 73 19. Final Microsequence Assembly in FINST 75 20. Basic PEIP Structure 77 21. Detailed Instruction Format 81 22. I/O Buffer Register Structure 8U 23- Standard Floating-point Format 88 1 . INTRODUCTION Faster computers may be obtained either by improving the raw speed of the circuits and components or by adopting a better organization, i.e., using the same circuits in a more efficient architecture. Indefinite im- provements in circuit speed cannot be expected due to fundamental physical constants, the most obvious of these being the speed of light. Therefore, new approaches to computer organization must be found if projected demands of computer users are to be met, particularly in the area of large scientific problems . In recent years, a fair amount of attention has been given to non- conventional organizations and the first two super-computers utilizing these new concepts will become operational within a few months: the pipeline pro- cessor CDC-STAR [1] and the array computer ILLIAC IV [2] [3]. Several other approaches have been proposed in the literature, deserving special mention the parallel processor, extensively studied by IBM [h] , and the associative processor, a type of array processor utilizing an associative memory and distributed logic [5]« Goodyear Aerospace Corporation has been working on an associative processor and successful tests have been performed on a re- duced scale prototype. An endless number of questions, discussions and comparisons can and have been raised when the capabilities and handicaps of the different organi- zations are considered [6]. As usual, one can usually find a specific appli- cation in which a given architecture excels and a pathological case in which the same approach fails miserably. It is not the purpose of this paper to engage in such comparisons. It will instead deal only with a particular organization: the array computer. The array processor family of computers has been widely accepted by the computer community as a cost-effective approach in a particular but rather important set of applications. In the sequel, this type of architec- ture is examined and a new approach to the design of an array processor is proposed in order to take advantage of recent and contemplated developments in the fields of LSI circuits and solid state memories. 2. THE ARRAY COMPUTER AND ITS APPLICATIONS 2.1 General Description of an Array Computer ILLIAC IV will be taken here as the "typical" array computer. This section is not supposed to be a complete description of ILLIAC IV and a cer- tain familiarity with [2] and [3] is assumed. Only a few basic concepts are considered here in order to set the stage for the following discussion. Figure 1 shows the functional diagram of a classical computer. It consists of: l) A memory to hold operands and instructions, 2) A control unit that fetches instructions from the memory, decodes them and issues con- trol signals to 3) An arithmetic unit that performs the operations on oper- ands taken from the memory. The most radical approach to parallelism would obviously be to duplicate the elements shown in Figure 1 a number (n) of times providing adequate interconnections between the elements. This is the multiprocessor or parallel processor approach. Although powerful, this or- ganization leads to several implementation problems and seems to be imprac- tical for large n. (The Burroughs B6500 uses this organization with n = K. ) nicix One of these problems is the economic burden caused by the multiplicity of control units since in a sophisticated classical machine the control unit accounts for rather more than fifty percent of the total gate count. This leads to the array computer approach, whose functional diagram is shown in Figure 2. Only arithmetic units and memories are duplicated and one single control unit (CU) drives the "array" of arithmetic units. Actually not the whole control unit can be made central since certain control decisions are operand- dependent (normalization for example) . Therefore, a minimum amount of control is kept local and each arithmetic unit plus its local control will instructions control unit J III I ♦ 1 * arithmetic unit memorv Figure 1. A Classical Computer instructions 1 1 1 instruction • I memory i . I l I I I I Li±U PE„ I memory 1 control unit ~l — i — i — i — r i I I i i T T T T T U-t-Lt PE. a memory 2 I I I I I ii-UJ PE SZ5 memory n Figure P. An Array Computer be called processing element (PE) . Each PE operates on its own memory ( PEM) . The term processing unit (PU) will be used to designate a PE with its PEM. Instructions can be stored either across the PEM's or in a special instruc- tion memory. Therefore, an array computer is characterized by the fact that a single instruction stream is executed simultaneously by at the most n PU's. The concepts of local indexing, routing and mode control will now be intro- duced. The biggest restriction imposed by this type of organization is obviously that every PE must be performing precisely the same instruction on the same addresses on its own PEM . These constraints can be relaxed to a good extent with the introduction of extra hardware to allow: a) local indexing : each central base address, "broadcast" by the CU to each PE, is locally indexed, b) mode control : each instruction is locally modified by the PE's. The simplest form of mode control is to locally decide if central instruction "I" will be locally executed as "I" or as a no-op; i.e., each PE can be turned on or off. This is the only type of mode control available in ILLIAC IV (extreme mode control capability would obviously lead to a multi- processor approach), c) routing : obviously, for most applications, at a certain point in the computation PE. may need an operand which is stored in PEM., i ^ j. Therefore, some way of "routing" operands from one PE to J another is highly desirable. The most complete freedom of routing would be obtained if a cross-bar switch were provided linking each PEM to each PE. Naturally, this solution is prohibitively expensive for large values of n. The simplest type of routing is to link PE. to PE's i-1 and i+1. This is called "neighbor routing. " Obviously, non-neighbor routing is obtained with a sequence of neighbor routings. 2.2 Typical Applications and Their Requirements The obvious application for an array computer is on problems in which the same operations must be repeated over a set of operands. Matrix operations fit nicely in this category and therefore this type of machine will work well on solving systems of linear equations, Fourier transforms, systems of partial differential equations, etc. Several areas of major scientific interest are included in such formalizations and the best known proposed applications for an array computer are: weather analysis and pre- diction, linear programming, seismic data processsing, hydrodynamic flow analysis, phased array radar processing, picture processing, etc. Since a new type of array processor was contemplated, the first step was to elaborate a list of questions about the features of an array pro- cessor and submit it to several users in different areas of applications. In this way an opinion could be formed as to which features are needed for each application and which compromises would be acceptable. Users in four areas of application were interviewed: l) weather problem (WP) , 2) seismic signal processing (SP), 3) linear programming (LP), and k) hydrodynamic flow problem (HP) . The basic questions asked were: a) How much floating-point operations does your application need? Could you do with fixed-point only? b) What precision is needed for your application? How many bits is the typical precision in the input data? c) How important is local indexing in your application? To which extent is local indexing used only as a solution to poor routing facilities? d) How much routing is done? Would only neighbor routing be sufficient? What are typical numbers for non-neighbor routing? e) Mention any other problems encountered and facilities desired in your area of application. It should be pointed out that all persons interviewed are ILLIAC IV users. ILLIAC IV contains 6k extremely powerful PE's with a complete reper- toire of floating and fixed point instructions. Words are 6U bits long and can be used in submultiple precision variants of two 32 -bit words or eight 8-bit words. There are facilities for local indexing and routing (accom- plished through an optimal combination of distance 1 (neighbor) routings and distance 8 routings). Mode control is on-off only. The following facts were established by the survey above: a) Floating point : Floating point seems to be a luxury turned ne- cessity. All users admitted that they could probably do with- out floating point by careful scaling of the quantities. They also admitted that they would hate to be forced to do that. The consensus is that presently a viable machine should have, if not hardware floating-point instructions, at least a good, fast set of floating-point subroutines. b) Precision : Naturally, the precision requirements are heavily dependent on the particular application and method of solution: WP uses 32-bit words although the initial data has a typical precision of 8 bits only. It is felt that performing computa- tion on 32-bit words is good insurance against precision erosion 8 due to severe numerical error propagation with the methods presently used. SP receives data from sensors in 13 to ik "bits precision and operates on 32 -hit mode. Incidentally, simple format conversion of the input data accounts for a considerable amount of processing time in this application. SP could con- ceivably be performed with less precision than 32 bits: 18 or 2k bits should be adequate. LP is the application with the heaviest requirements on precision: I/O is performed in 32-bit mode but internal calculations use 6k bits to avoid severe error buildup in LP problems with about ^00 equations. In fact, even 64-bit precision is inadequate for larger problems and the use of multiple precision routines is envisioned. HP has been using 32 bits which is adequate for low precision inputs. How- ever, k8 to 6k bits would be ideal for future applications. Finally a few special but important applications need much less precision. Picture processing can be done with k to 8 bits of precision and a recently developed area—linear programming with Boolean variables—uses 1-bit precision for the variables and "small" integers for the coefficients. The conclusion is obvious: a versatile machine should have as many precision modes as possible. This was the case with serial by bit machines which featured variable word length. Speed requirements forced the introduction of parallel proces- sing of a word and the variable word convenience and efficiency was lost except for some low-precision instruction variants as the ones featured in ILLIAC IV. c) Local Indexing : This seems to be a very important feature, heavily used by almost all application. Its main use is definitely to avoid slow routings in a "skewed" type of matrix storage [3]« However, a few other types of use for local indexing did appear. d) Routing : Routing is the most difficult problem in an array computer. Complete and unlimited routing facilities are eco- nomically impossible for large values of n. The ILLIAC IV approach did satisfy its users, however. Definitely the most frequent type of routing is neighbor routing. Odd routing dis- tances do appear, however, in a few important cases: table n look-ups and log-sums (i.e., the problem of obtaining Z a. i=l 1 where each a. is stored in a different PE) are two examples. 2.3 Considerations on the Number and Complexity of the PE's The array-processor family of computers has at present two well established members: ILLIAC IV and the Associative Processor (AP) . Both these machines were extensively studied and are actually being built. In a sense, however, they represent two extremes in this design philosophy: ILLIAC IV has a relative small (6k) number of PE's, each an extremely powerful floating-point word-parallel unit with 13K gates. The AP, described in [5], 12 15 has a very large number (on the order of 2 - 2 ) of PE'.s, each an extremely simple fixed-point serial -by-bit unit containing only 32 gates. Each ILLIAC IV PE has a floating point add time of 175 nsec. and a floating-point multiply time of 225 nsec. for 32 -bit operands. The AP has a fixed-point add time of 35 Msec, and a fixed point multiply time of approximately 1 msec, for 32 -bit 10 operands. Therefore, a 12K PE AP could add fixed point about as fast as ILLIAC IV. Multiplication would still be much slower (about 20 times slower even for a 12 K PE AP) . Routing capability in the AP is extremely limited: only neighbor routing is permitted, on a bit-by-bit basis. PEM is 2K 6^-bit words long in ILLIAC IV and only 256 bits long in the AP. However the AP's PEM is an associative memory allowing simultaneous interrogation of n bits (n is the number of PE's). Obviously, ILLIAC IV s conventional PEM's could also be considered as an associative memory allowing simultaneous interroga- tion of 6k words . It seems obvious that the A0 is a much less versatile machine than ILLIAC IV, i.e., its field of application is quite limited. However, it may come as a surprise that in the problems to which it is well suited (especially radar tracking applications), the AP is quite cost-effective. In fact, its proponents argue that it can perform those special jobs at the same rate as ILLIAC IV but at l/30th of the cost. A few generalizations are in order: One could consider a set of array computer M.. , M , ... , M each with a simpler (slower) PE than its predecessor but with a larger number of PE's in order to keep constant the average speed. Figure 3 illustrates the number of PE's x speed of each PE for these machines. Figures k and 5 represent some rough qualitative estimates about the versatility of these machines (i.e., how large is the set of appli- cations for which they are well suited, i.e., can compute approximately n times faster than a sequential machine with same speed as each PE) and the cost-efficiency of such machines for such suitable problems. The estimate in Figure h is practically obvious: the sequential machine (n=l) is the most versatile. As n grows, the number of problems that the machine can handle 11 J ' speed of each PE I \ V \ \ V \ N 1 N ^M^ to- n number of PE' s Figure 3« A Family of Array Computers with Constant Average Speed versatility \ \ \ \ \ m M 2 M,, n number of PE' s Figure k. Versatility as a Function of the Number of PE's i cost-efficiency for suitable problems / / / >» M 1 J»M 2 »> n number of PE's Figure 5. Cost efficiency as a Function of the Number of PE's 12 efficiently obviously decreases. Figure 5 is harder to justify. In fact, it is a guess "based in two extremes: ILLIAC IV and the AP. A third machine, however, to be introduced later, does seem to verify this hypothesis: as n grows and each PE is simplified, modern integrated circuit techniques (LSI) allow a very rapid decrease in the cost per PE. These considerations justify the idea of exploring the possibilities of a third type of array computer: the SPEAC (for small PE Array Computer). This machine would be between the AP and ILLIAC IV in number of PE's and PE power and hopefully would achieve a happy compromise between ILLIAC IV s rela- tive versatility and the AP's cost-efficiency. The initial goals were: n SPEAC ~ 10 n iLL IV t0 10 ° n iLL IV PE speed spEAC ~ ^ PE speedy Jy to -^ PE speedy Jy gates per PE gpMC ~ ±- gates per PE^ Jy to JL gates per PE^ J The remainder of this paper is dedicated to exploring the feasi- bility and characteristics of this new machine. 13 3. SPEAC's HARDWARE Initially, a few general considerations are made in order to estab- lish the design goals that dictated the structure chosen for the hardware. The multiplication algorithm is also presented as a preface to the actual hardware description since the PE has been specifically designed to implement this algorithm efficiently. 3*1 General Considerations a) The PE will be simple enough and built in a quantity high enough to warrant the expense of building special-purpose MSI to LSI integrated circuits . At first, it was hoped that a whole PE could be contained in a single LSI chip. This still seems to be possible, at least with the kind of technology foreseeable within a decade: a bipolar integrated chip with density on the order of 1 to 2K equivalent gates would be needed. However, even if one does not count on such extremes of built-to-order LSI, the proposed design could be implemented using a few dozen standard or nearly standard MSI chips, allowing an entire PU to be packed in one printed circuit card . b) The results of the survey mentioned in Section 2.2 indicate the need of some floating-point capability. Naturally, entirely hardware- implemented floating-point is out of the question in a simple PE. However, the hardware should allow efficient imple- mentation of floating-point routines . Serial processing, by bit or by groups of bits is the only way to keep the gate count low. This leads naturally to variable word length as a means Ik of satisfying the conflicting precision requirements outlined in Section 2.2. c) Most contemplated applications have a high frequency of multi- plications, typical of scientific problems. Therefore, multi - plication should he as fast as possible , ideally almost as fast as addition as is the case in the ILLIAC IV PE. d) Due to the existence of a CU, the PE must be strictly syn - chronous and local control must be minimized . Any synchronism or data- dependent optimization is wasted since the CU must always wait for the worst-case which almost certainly occurs for large n. This rules out certain classical methods like: in- creasing the speed of multiplication by adding only when the multiplier bit is one and simply shifting when it is zero. Instead, the CU must always output micro-orders for the worst- case and: either: the method is such that the extra operations are no-ops for non-worst- case conditions (example: add on a zero multiplier bit); or: some local control (typically a flip-flop) will inhibit certain steps in non-worst-case conditions (example: normalization, recomplementation) . e) An accumulator is impractical in a variable word length machine since it would have to be as long as the worst-case-length. Therefore, variable word length machines are typically 2- or 3- address machines. Three addresses are quite desirable since they avoid the frequent duplication of operands (to avoid its 15 destruction) found in 2-address machines. The classical short- coming of 3-address machines, unnecessarily large instructions when the third address is equal to a previous one, can easily be avoided by adopting a variable length instruction format . There- fore, each instruction (op-code) will have a large number of variants with different lengths, from a minimum of zero ad- dresses (in this case the old contents of the address registers would be used as addresses) to a maximum of six addresses, three basic addresses plus three addresses for local indexing. Word length of each operand and of the result might also be specified in the address part. The resulting instruction format is illustrated in Figure 6. basic op-code variant v ^ J as many addresses as specified by the variant code Figure 6. Instruction Format f) Timing considerations: In order to satisfy the initial esti- mates set forth in Section 2-3, an addition time of 3 to 30 usee and a multiplication time of k to kO (usee are needed. Consider- ing the basic PEM cycle time of the order of one-half jusec (this assumption will be explained in Section 3-2), and noticing that 1 to 3 PEM's cycle times (depending on the amount of interleave) are required per serial operation of the PE, one concludes that a 30 to 60 jusec addition time is obtained in a bit-by-bit PE for 16 32 -bit fixed point addition. Straight multiplication will take 32 times as much or about 1 msec. This is far too slow and a serial by hexadecimal digit PE (i.e., serially processes chunks of k bits) is now considered. Addition time (32 bits, fixed point) now goes down to 8 to 15 usee which is convenient. Straight multiplication, taking 32 times longer, is still quite slow. The next step would be a serial by byte PE but this pre- sents two problems: firstly, normalization is either rather complicated and slow or it is done in 8-bit increments causing an unacceptable erosion in precision; secondly, the number of gates in the PE will be quite larger. Therefore, a serial by hexadecimal digit PE seems to be the best compromise: normali- zation in k bit increments (i.e., exponent base = 16) is quite acceptable and widely used in present computers. A somewhat elaborate multiplication algorithm (described in the next sec- tion) will be adopted to bring the multiplication time down to acceptable values. ;) Since the basic unit of data in the PE is one hexadecimal digit instead of a whole word, the machine is capable of accepting several different word formats provided the CU is able to gener- ate an appropriate microsequence for that format. This immedi- ately suggests the : idea of micro-programming . Therefore, no particular word format will be picked and the PE control wire set will be chosen as carefully as possible in order to maximize the number of formats and operations that can be dealt with by writing adequate micro -programs at the CU level. The variable IT format feature can be quite useful in certain applications (like seismic signal processing) in which format conversion accounts for a significant percentage of the processing time. Summing up, the following design goals are thus established for SPEAC: - PE built with MSI and LSI integrated circuits. - One printed circuit card per PU. - Variable word length. - Multiplication not much slower than addition. - Up to 3 addresses (possibly indexed) per instruction. - Variable instruction length. - PE serial by hexadecimal digits. - Variable word format. - Microprogramming capability. 3.2 The Multiplication Algorithm As pointed out in Section 3*1; "straight" multiplication techniques (i.e., bit-by-bit) yield an unacceptably high multiplication time as compared to the addition time. On the other hand, ver-high- speed multiplication of the type used in ILLIAC IV requires a massive increase in the number of gates. The best compromise for SPEAC seems to be some form of hexadecimal multiplication algorithm allowing multiplication times roughly proportional to fir where fif is the number of hexadecimal digits in the operands rather than the number of bits. It is also required that the algorithm be able to generate the product without the need to store double precision partial products since the PE has no register capable of holding long numbers and storing partial products in the memory will be slow and require the use of a portion of PEM as "scratchpad area. " 18 The following multiplication algorithm satisfies the requirements above and is proposed for SPEAC: Consider the multiplication of two numbers A and B, each containing n+1 hexadecimal digits: ^ 08 ,-M o*+n A = a^ + a., 2 + a 2 + . . . + a.2 + . . . + a 2 12 1 n B = b. + b.,2 + b 2 u + ... + b.2 + ... + b 2 12 1 n (1) (2) The double precision product M will be written as: M = A X B = m A + m_2 + m 2 + ... + m.2 + . . . + m ,2 ' 2n+1 M3) 12 x 2n+l VJy multiplying (l) and (2) as polynomials: h 8 M = A X B = a b + (a Q b 1 + a.jb )2 + (a Q b 2 + a^ + a 2 b Q )2 + or: .. +1 Ea.b. .)2 k± + ... J Za.b .\2 kn + ... + D i-J UJ=0 J n-j La. b . . 2 + . . . + a b 2 v y i j-n n-j+il n n '2n n Iki M = Z Z a.b. .(2 41 ) 2n /2n Mi i=0 \ j=0 J 1_J Z Z a. b . .(2 41 ) i=n + l I j=i J " n n - J+1 From (3) = (4): U, (»0 m o = ^oVmod 16 ; c o = ^oVdiv 16 (i ' e - Vo = c o (2H) + m o } m i = (c o + a o b i + a iVmod 16 ; c i = (c o + Vi + a iVdiv 16 r. m. 1 i < n = c . _ + Ea.b. . , , s : c . = c . _ + Z a.b. .. > ., i-l . J 1-jJmod 16 ' 1 l-l Q j 1-jJdiv 16 < m. = c 2n + Z a. b i > n . n + u a. d . . , _,, : c. = c. n + ^ a. b . . ,. ,/ i-l ._. j-n n-j+ilmod 16 1 1 i-l . . j-n n-j+il div lb (5) 2n Z a. b . .L. J=i 0=i m = ( c n + a b ) ,-,/■; c_ = ( c n + ab) n . _, . 2n 2n-l n n'mod 16 , 2n v 2n-l n n'div 16 m 2n + l = C 2n 19 Therefore , the product may be computed as follows: - multiply a and b n , the two low order digits of A and B; the k result has two hexadecimal digits: c n (2 ) + m ; m is the low order bit of the product and can be stored (in double precision multiplication) or discarded; c ' is kept in an accumulator. - multiply: a~ X b ; add to the accumulator; multiply: a X b ; add to the accumulator; the accumulator then contains cm* store or discard m and keep c in the accumulator and so on, using the equations (5) to determine each c. and m. . It is easy to see that (n+l) pairs of hexadecimal digits must be multiplied to compute the product of two numbers each with (n+l) hexadecimal digits. It should also be noticed that if a single precision product is de- sired, the product can replace one of the operands: m , m, , . . . , m _ are computed only to accumulate the carry and discarded, m is the first digit that may be in the final product and can be stored either "on top" of a or b since these two digits are not needed anymore to form the product. Finally, m_ replaces a (or b ). If m. , = c =0, then the product is stored cor- ^n ^ n v n' 2n+l n ' ^ rectly. However, if m .. = c ^ 0, the product must be normalized, i.e., each digit is shifted one to the right, m is discarded and c = m _ _ is then D ' n n 2n+l stored on the address of a (or b ) . n n The number of memory accesses required is: Memory accesses = 2 IT + N + (N-l) + N operand stores fetches stores fetches ■> ' ^ multiplication normalization of the mantissas 20 where N = n+1 is the number of hexadecimal digits in each operand. Notice, however, that in the computation of each m. , one operand fetch may be saved since the operand is already available from the last operation in the previous computation. This saves N-l operand fetches. Therefore: Total number of memory accesses = 2N(N+l), including no rmali z a t i on . Finally, it should be pointed out that the operations may be arranged in such a way that not only (N-l) fetches are saved as described but also each address is modified only in unitary decrements or increments. Since the ad- dress registers will have the capability of unitary increment or decrement, only the addresses of a and b are needed initially. These addresses are then possibly indexed and the rest of the multiplication does not require further address broadcasts. Figure 7 illustrates the order of operations for the multiplication of two ^--digit numbers. 3-3 The System as a Whole A summary description of the complete system is initially presented in order to establish the function of each component and their interconnec- tions. Figure 8 is a diagram of the global structure. The components are: a) The PU array, containing "a large number" of PU's arranged in rows. Each row has 128 PU's and the number of rows is not fixed: with the exception of "row gating, " nothing in the machine is a logical function of the number of PU rows. Therefore, any numbe] of PU rows can be used in SPEAC provided that the row gating contains that same number of inputs. There are, however, some practical limits: too few rows (say 1 or 2) will lead to an 21 a b (initial address broadcasts and fetches) a b + a ; b A A 3>? v 2 + Vi + y u A A — A - ft^ 3 + *l b 2 + *2 b l + ^3 b o CX A A — A — A — a b + a b 2 + a-b Eta — a — A b a + b a D 5 A — A J a b A 3 D No fetch or address modification A Add 1 to the address and fetch — Subtract 1 from the address and fetch Figure 7« Fetches in Multiplication 22 COMMON AODRESS BUS CORNER MEMORY; CONTROL -DATA (i BITS) WIDE DATA PATH Figure 8. Global Structure i! 23 uneconomical machine since each PE is relatively slow and good average speed can only he obtained by using a large number of PE's. Therefore, the speed obtainable with 1 or 2 rows would not justify the investment represented by the components needed to drive the array: CU, mass memory, etc. On the other hand, too many rows will result in poor I/O speed and routing speed (since these operations are performed on a per-row basis) causing a degradation in system performance. Based on these considerations, an interval of k-6k PU rows has been established as the most useful range. In particular, 8 rows were chosen for the "typical" SPEAC. Therefore, for the remainder of this paper, a 102 k PE machine will be described. b) The row gating switch which is a 512 -bit, bidirectional, 1-out- of-8 selector driven by a row address supplied by the CU. This switch selects one of the PE rows for I/O transactions with the mass memory. c) The I/O buffer register which is a long, shif table register to buffer the I/O flow between mass memory and PE array. It should be pointed out that this register has twice the length of the mass-memory word and can be shifted by any multiple of U-bits in a maximum of 7 clock pulses. These two features enable the I/O buffer register to provide routing facilities for SPEAC. The method will be detailed in Sections 3*7 an ^ ^-7« o d) A mass memory system with at least 10 bits of relatively fast (l to 3 jusec cycle time) random-access memory. Bulk core is the present choice for the mass-memory, probably backed-up by a 2k hierarchy of large capacity disk and tape. The random-access mass memory serves as a common pool of data for the different parts of the system and is directly accessible to the CU, HJ array, corner-memory and other peripherals. e) A corner-memory which is a special-purpose peripheral device operating on the mass memory in the same fashion as an indepen- dent I/O channel. This device is capable of reading from mass memory 128 words with 128 hexadecimal digits each; the i — word read can be written as: a. n a.^ ... a. n _ , where each a. . is a ll i2 il2o io hexadecimal digit. After being loaded with rows in this way, the corner-memory can write back in mass memory in a column -wise fashion; i.e., the i — word written will be: a, . a_ ... a n _ Q . . li 2i 12oi Therefore, the device can read a matrix of 128 X 128 hexadecimal digits row-by-row and rewrite the same matrix column -by- column. This function is desirable in SPEAC to convert data written in mass memory by the array into a form that will allow the same data to be easily handled by the CU. The corner-memory is not an essential part of the system but has been included for the sake of completeness. It should also be mentioned that several other peripheral devices (tape decks, printers, etc.) can be attached to the system in the same way as the corner -memory. f ) A control unit (CU) which sends control pulses to all other units in the system besides having full processing capability on its own. Actually, the CU can be considered a standard serial high- speed general purpose computer in which several modifications were introduced. It must accept three different types of 25 instructions: CU instructions, PE instruction and I/O instruc- tions. CU instructions are completely processed in the CU although operands can be received from the array and results "broadcast " to the array via the common data bus (CDB) which will be described shortly. PE instructions are decoded in the CU and each corresponds to a micro-program which is executed and generates a set of control pulses or micro- sequences. These are sent to every PE in the array via the control lines. Finally, I/O instructions are decoded in the CU and sent to one or more independent I/O channel(s) which drive the row gating, mass mem- ory, I/O buffer register and corner -memory. The CU must also be compatible with the mass memory used in the system since this memory will be shared by the CU and PE and serves as a common pool of data. The CU can interchange data with the PE's via the common data bus, one hexadecimal digit at a time. However, the only high capacity data link between CU and array is via the mass memory. Notice also that SPEAC's programs are not stored in the PEM's but in the CU's own internal fast memory and, for large overlayable programs, also partly in the mass memory. The control unit is linked to the PE's by three buses and one inter- rupt wire. The first bus is a 12 -bit common address bus (CAB) in the direction of CU to PE only. The CU can send addresses to the array via CAB. These ad- dresses can then be stored by each PE in internal address registers and used to access PEM. The second bus is a Ij—bit bidirectional common data bus (CDB) whose use has already been described. The last bus is a set of approximately 26 80 control lines which control every PE function. The interrupt wire is a single line connecting every PE to the CU. It is used to send to the CU an interrupt request which orginated in a PE and must be serviced "by the CU. Each PE is linked to the row gating by a bidirectional ij-bit I/O bus (IOB) which is not common. All the I/O buses (one from each PE) are con- nected to the row gating which selects one group of 128 IOB's (corresponding to one PU row) for connection to the I/O buffer register (lOBR). It is now possible to describe how a program is processed in SPEAC: Program and data are assumed to be initially on tape. The tape is loaded into SPEAC s mass-memory and from there the program is loaded in the CU memory and a portion of the data is transferred to PEM. Processing is then performed simultaneously with further transfers between PEM and mass memory with the latter serving as overlay memory for the relatively small PEM. The results of the computation are transferred from PEM to mass memory and can then be printed or stored in tape via a peripheral device. Each component of the system will now be analyzed with special em- phasis on the PU. 3.^ The Processing Unit 3-^.1 PE Memory Semiconductor memories were chosen for the PEM's for two basic reasons: a) Small size, compatible with the LSI chips that make up the PE. This way each PU could be entirely mounted on a single printed circuit card or on a ceramic substrate. b) Low price per bit even in small size. This characteristic was needed since each PEM in SPEAC is necessarily small for economic 27 reasons: 8K bits is the proposed basic size with provision for expansion up to a maximum of 32K bits per PEM. The next step was to choose between bipolar and MOS memories. At the beginning of the investigation, a survey of semiconductor memories [7] indicated that MOS LSI held the greatest potential for this application: large densities (1000 bits per chip is already commercially available), minute power dissipations (50 jtiw per bit is obtainable), acceptable speeds (less than 1 usee cycle time is typical) and low price ($.02 per bit is commercially available). Therefore, the following PEM chip was postulated for use in SPEAC: MOS LSI, 102 U bits, 50 uw per bit power dissipation, 500 nsec cycle time , price less than $20 in quantities. Since progress in the area of semiconductor memories has been so fast, a reevaluation of the design choice for SPEAC ' s PEM was undertaken at the end of the investigation. It was then discovered that the case for MOS was not as clear cut as before, due to the following factors: a) Although MOS currently appears to have a distinct density and price advantage, it should be noted that recently announced bi- polar processing technology will allow 102U bit and larger bipolar memories with not much increase in power requirements. These devices will be available for delivery about mid-1972 at about MOS prices. With power reduction techniques they take about the same or less power than MOS and are considerably faster with an 80 to 100 nsec cycle time. b) It should be noted that the choice of MOS requires an additional power supply level. If bipolar is chosen, the same supply used for the PE logic can be used by PEM. This is more economical 28 since it is less expensive to buy "x" additional amps on an existing supply than to buy the first "x" amps on a new voltage level, c) If MOS is used, an interface is normally needed to adjust MOS voltage level to bipolar, thus increasing the number of gates per PE. Moreover the larger densities in MOS are obtainable in dynamic memories; i.e., memories in which the information is stored as charge in MOS P-N junction capacitance. These memories are thus volatile and must be refreshed as often as every 2 usee at higher temperatures. This is unacceptable in SPEAC since it would introduce frequent delays in processing to refresh PEM. Therefore, static MOS memories must be used and density with these memories is not better than with bipolar. Static MOS is also slower unless decoding is separately performed with bipolar logic. In conclusion, the factors considered above indicate that PEM would probably be built with bipolar devices or at least static MOS with bipolar de- coding if prices drop as much as predicted. In fact a hybrid chip already exists which, if obtainable at a price small enough, would be an excellent choice for PEM: It consists of 8 MOS static memory chips with 256 bits each, mounted on a ceramic pack with bipolar decoding. The organization is 102^ 2-bit words making only four of these elements needed for the PEM. The devices are made by T.I. (SMA 2002) and have a typical cycle time of only 150 nsec. A block diagram is presented in Figure 9- Therefore, although the basic cycle time of 500 nsec (300 nsec access time) is retained for the remainder of the paper, it now appears that it is a little pessimistic. Significant gains in performance could be obtained in some 29 operations with the faster memories which would probably be available if SPEAC were to be built in the near future. ARHAY SELECT <=*= ADDRESS ~ Ba-ta-IH MTC R/W READ STROM ± c c V H Vcc* Vcc*. T-4-4— : MOS STORAGE > READ WRITE CONTROL A 1 UND Vcc- J> SENSE AMPLIFIER DATA OUT Figure 9» Block Diagram of a Possible PEM Chip Since the basic unit of data in the PE is one hexadecimal digit, PEM is organized in U-bit words. Each hexadecimal digit is addressable in the mem- ory. It is also extremely important to adopt an access technique for PEM which will avoid I/O bounding of programs as much as possible: PEM contains only 2K hexadecimal digits or 2^6 32-bit words . Therefore, for many problems the data will not fit entirely in PEM and mass memory is used as back-up. It would be desirable then to be able to exchange data between PEM and mass memory and, simultaneously, allow the PE to access PEM to perform normal processing. This justifies the adoption of a two -port system: PEM is divided in two modules , each with IK hexadecimal digits and the two modules can be accessed simultane- ously. Basically, one module is replenished from mass memory while the other module is used for operations. In this way, PEM can almost be considered as a fast scratchpad memory for the PE's with mass memory being the main memory. 30 Since (as will be shown in Sections 3.5 and k) a row of 102U 32 -"bit numbers can be exchanged between PEM and mass-memory in about 128 jitsec and the basic floating-point operations take on the order of 25 /Lisec, a number brought to PEM must be used at least six times in operations before being overwritten in order to avoid I/O bounding. This ratio of 1 to 6 is a comfortable figure for a machine intended for scientific applications. It should also be pointed out that i/O-PE overlap is not the only use of the two module system: if i/O is not occurring^ the two modules can be used to overlap fetches for CU operations and PE operations or even for the simultaneous fetch of two operands in a PE opera- tion if each operand happens to be in a different module. It is the responsi- bility of CU's final station (FTNST) to assign use of the two PEM modules in an optimum way (see Section 3-5) • 3-^.2 PE Data Registers The algorithm described in Section 3-2 can be very efficiently mech- anized using the register structure presented in Figure 10. Ac Am A r T 1 I l I l I 1 1 r • I 1 I I l 1 J l I I I 1 1 l I 1 l ■ I 1 -i 1 1 1 1 1 I I 1 i_ INCREMENT £ Ac I I t i i i J I L REGISTER A \—}——0 ADD CONDITIONAL O ADD UNCONDITIONAL REGISTER B Figure 10. Basic Data Register Structure 31 There are two data registers: A and B. Register B is a simple, non-shiftable U-bit unit. Register A is divided into three parts: A , A r m (for right and medium) with k bits each and A (for carry) with 12 bits. Register A is fully sh if table, right or left, bit-by-bit. There is also a fast ^--bit shift mode in which the contents of register A are shifted (left or right) one hexadecimal digit in one operation. The right fast U-bit shift is not essential to implement the multiplication algorithm efficiently but can be very useful in other applications. It should also be pointed out that part A of register A is connected as a counter and a pulse to the "increment A " c ° ^ c control will cause the contents of A to be incremented by one unit. Finally, registers A and B are linked by a k- bit parallel adder which, when activated, replaces the contents of A with the sum of the contents of A and B. The m m adder can be used unconditionally or conditioned to the presence of a "one" in location A . The carry generated by the adder can be fed to the "increment r A " control, c To use the structure of Figure 10 to multiply using the polynomial algorithm, two hexadecimal digits a. and b. are placed in registers A and B respectively. Multiplication is accomplished with a sequence of four add con- ditionals and shifts right 1 bit. Register A is then shifted left fast k bits and a new multiplication can be performed with the new product automatically added to the previous one(s). Registers A and A then work as a small accumu- lator in multiplication. Note that in the polynomial multiplication of two numbers, each n hexadecimal digits long, the worst case carry that can occur is less than log^n + k bits. Therefore, the number of bits needed in A is °2 c given by log_n + k. A reasonable value for n is 6U which leads to an & o- o 2 max max 32 A 10 bits long. Since in SPEAC register length is naturally a multiple of di- bits, 12 bits were reserved for A . For the same reason, the address regis- ter's length was chosen as 12 bits allowing up to k-K hexadecimal digts per PEM module although only IK is contemplated at this stage. 3.^+.3 PE Description The data register configuration described in the previous section was used as a kernel around which the whole PE was designed. Figure 11 pre- sents a simplified PE diagram showing all registers and data paths. For a com- plete logical diagram, Figure 13 should be consulted. In order to reduce the size and complexity of Figure 13, a number of special symbols were adopted. These are defined in Figure 12 and deal with representing groups of h or 12 wires in a concise way. Only a few logic elements appear explicitly in Figure 13; most logic is represented as logical blocks called packages. These pack- ages are numbered and labeled with a name describing their function; i.e., l-of-8 selector, type D flip-flop, inverter, etc. The complete diagrams of the logic inside each package are presented in Appendix A. It should be noted that most packages perform standard logic functions and are availabe as SSI or MSI chips. This aspect will be further pursued in the section on imple- mentation. 3«^-'3«l Registers and Buses Each PE contains nine registers with a total capacity of 65 bits. Table 1 lists each register, its capacity, function, and special features. Buses are used to provide data paths between the different registers. This allows maximum flexibility (since each register can be directly loaded from any other register) at a reasonable cost. Two types of buses are needed: a 33 EE L C ^ 2 4 1 6 8 10 12 I 3 5 7 9 II r T -10 -12 -12 -10 Figure 12. Conventions Used in PE Logical Diagram 55- 8S- Si- £ s 3 t=n a- 2 a. y^u. lilllili S ?i sis 111 ! 8533 i- « I § I i ■j— - rt ffft r^ 33 i. O-J Hi s s s rru i 3i s s i a 3 3 S 5 t S E t il i j! U illislsiisSiill 35 Figure 13 . Complete PE Logical Diagram 36 12-bit address bus, linking all address registers and the CAB, and a U-bit data bus linking all the remaining registers, the CDB and the I OB. Since it was decided that both PEM modules should be simultaneously accessible, one pair of buses is dedicated to each PEM module. Therefore, there are four buses altogether: two address buses (Al and A2) and two data buses (Dl and D2). Buses Al and Dl are linked to PEM module 1 and buses A2 and D2 are linked to PEM module 2. Figure 11 clearly shows all the connections to each bus. In this figure, an arrow into a bus indicates that the given data can be gated into the bus; an arrow out of a bus indicates that the contents of the bus can be gated into the given unit; a dot in the intersection of a wire and a bus indicates a permanent connection of the wire to the bus. It should also be noticed that every line connected to an address bus represented in fact 12 wires (except the line into SM which is a ^--bit line) while lines connected to a data bus stand for k wires with the exception of the line into EE which is a single bit line. A very rough approach to the number of gates needed to im- plement the bus system can now be obtained: counting each arrow associated with a data bus as k gates and each arrow associated with an address bus as 12 gates, one obtains a total of 35^- gates. This represents about a third of the total number of gates used in the PE with flip-flops accounting for the second third and arithmetic, decoding and local control using the remaining gates. It is important to point out that PEM module 1 is permanently con- nected to bus 1 and module 2 to bus 2. Therefore, if an operand is in module i then bus i must be used to fetch that operand. On the other hand, inter- register transfers can use any bus that is available. This fact will be im- portant in the design of the CU's final station (FINST). 37 Register Capacity (bits) Function Special Features A Shiftable ("bidirectional, 1- and 4-bit distances) A c 12 address/ data Can count up A m k data Each bit is individually enabled A r h data None B k data None h 12 address Can count up or down \ 12 address Can count up or down S 12 address Can count up or down LC k local control Each bit is individually enabled EE 1 mode None Table 1. PE Registers 38 3-U.3-2 The Arithmetic/Logic Unit The simple adder of Figure 10 was replaced in the final design by a more sophisticated arithmetic/logic unit (A/L unit) which is capable not only of adding but also of performing several other arithmetic and logic functions as well as comparisons. This unit, whose logical diagram can be seen in package 9 (Appendix A), is currently available from several manufacturers in a 2^-pin MSI bipolar chip. There are five control lines in the A/L unit, allowing a choice between 32 functions (not all different). Table 2 shows these 32 functions. There is also an A = B output to test for equality. Other comparisons can be performed by subtracting the two inputs and analyzing the output carry. Input B to the A/L unit is always register A . Input A can be selected among Dl, D2, reg B and reg B. This allows one to compute not only (reg B) - (reg A ) (by picking reg B as the A input to the A/L unit and sub- tracting) but also (reg A ) - (reg B) (by picking reg B as the A input and adding). Inputing to the A/L directly from Dl or D2 is not essential but speeds up several operations by avoiding unnecessary loads into B only to use the A/L. The output of the unit can be gated either into A or into A . Another impor- m r tant feature is the possibility to gate the output of A/L into A shifted one to the right. This speeds up multiplications considerably since two hexadeci- mal digits can be multiplied in k clocks instead of 8 (i.e., k add and shift as opposed to k adds and ^ shifts). 3.^. 3-3 Scratchpad Memory A small (l6 hexadecimal digits), fast scratchpad memory (sM) has been added to the final version of the PE. This unit is available in a l6-pin MSI chip (see package 8, Appendix A) and can read or write one hexadecimal digit in one PE clock. Although not essential to the PE, sM can be added at a ' 39 S 3 S 2 S 1 S M = 1 (logic functions) M = (arithmetic operations) C = n C = 1 n OOOO 0001 F = A F = A F = A v B F = A + 1 F = (A v B) + 1 F = A v B 0010 F = AB F = A v B F = (A v B) + 1 0011 F - F = 1111 F = 01OO F = AB F = A + AB F = A + AB + 1 0101 F = B F - (A v B) + AB F=(AvB)+AB+l 011O F = A© B F = A - B - 1 F = A - B Dill F = AB F = AB - 1 F = AB LOOO 1001 F = A v B F = A + AB F = A + B F = A + AB + 1 F = A + B + 1 F = A© B 1010 F = B F = (A v B) + AB F=(AvB)+AB+l 1011 F = AB F = AB - 1 F - AB 1100 F = 1 F = A + A F = A + A + 1 1101 F = A v B F = (A v B) + A F=(AvB)+A+l 1110 F = A v B F = (A v B) + A F=(AvB)+A+1 1111 F = A F = A - 1 F = A Table 2. Functions Provided by the A/L Unit ko low cost and provides a dramatic improvement in performance. Floating-point addition, for example, is speeded up by a factor of three. The main use of sM is to avoid repeated fetches of the same digit in multiplication and to store partial results before normalization. It should be noticed that since sM receives addresses from the address buses (four low order bits only are used), it can be locally indexed, i.e., each PE can locally modify an address in sM before performing an sM fetch. This is extremely valuable in floating- point normalization. Therefore, sM is the fourth element in SPEAC's memory hierarchy which is, from the smallest and fastest unit to the slowest and largest: sM - PEM - mass memory (random access) - large capacity disk. 3.^.3'^ Address Registers There are three address registers in the PE: X , X and X . These are simple, non-shif table 12 -bit units with additional logic to enable them to act as up/ down counters (see package 11, Appendix A). The address registers are normally loaded from the CAB with a base address broadcast by CU to all PE's. This base address can then be locally indexed. Successive hexadecimal digits of an operand can be accessed by incrementing or decrementing an address register using the up/ down counter feature and avoiding frequent use of CAB and repeated local indexing operations. It is now clear that a memory transaction may use as address one of four sources: registers X.. , X~, X~, and CAB. The common address bus can be directly used as the address source in I/O transactions or in operand fetches when local indexing is not necessary. This use of CAB indicates that one could possibly eliminate X and still obtain good performance since, in most cases, for PE operations only two addresses are simultaneously needed; in the fetch phase of the operation, the addresses of 1+1 the two operands are stored in X and X . In writing the result two other addresses are needed in X and X — the address of the result and an sM address. X is used, most of the time, to hold I/O transaction addresses. It is felt that eliminating X would cause frequent conflicts in CAB use and a de- gradation in performance. Only extensive simulation can indicate whether such degradation is small enough to warrant removal of X for a very significant saving in the number of gates. 3- J+-3-5 Register A There are eight possible sources of input data to each of the parts of register A. Six of these eight are common to A , A and A . They c m r are: l) shift A right one, 2) shift A left one, 3) shift A fast k right, k) shift A fast k left, 5) load with Dl (Al in the case of A ), and 6) load with D^ (A^ in the case of A ) . The seventh input option is the add and shift 2 2 c especially implemented to speed up multiplication. The effect of this input is the following: the output of the A/L unit is loaded into (A , A , A , to ^ ' m ' m ' m ' A ), A is shifted' right one and A is either shifted right one( if the out- r 3 r ° put carry for the A/L unit is zero) or is incremented by one and shifted right one (if the output carry from the A/L unit is one). Finally, the eighth and final possible input to A is: for A and A , the output of the A/L unit (used for addition, subtraction and logical operations); for A , the last input possibility is simply A incremented by one (i.e., the counter feature of A ) . Input control is independent for each of the three parts of regis- ter A. Therefore, register A shifts end-around as a whole only when A , A and 70 cm A are simultaneously loaded with the same shift input. Several other useful results may be obtained when only one or two of the parts of A receives a shift k2 command. For example, loading A with a shift fast k right enables one to copy A directly into A without having to use Dl or D2. A direct swap of the J m r contents of A and A can be achieved by simultaneously loading A with a m r m shift fast h left and A with a shift fast k right. There is a control wire to r determine whether a distance 1 shift is to be end-off or end-around. Distance k end-off shifts are obtained by shifting only two of the parts of A. A and A have a single load control which, when OFF, preserves the contents of the register and when ON loads the register with the selected input. Load control for A is more sophisticated and allows not only "load" and "no-load" but also a conditional load dependent on the value in Dl or D2. In this conditional load, bit i of A is loaded only if bit i in Dl or D2 is ' m d ON. This is very useful in "assembling" a hexadecimal digit out of specific bits of two other digits as is the case in inserting a sign bit in a number. It is important to notice that Al or A2 can be gated into A thus allowing addresses to reach the data handling part of the PE. This feature is used to modify addresses in local indexing. Also A is a counter and can be used as such when not needed to accumulate a carry in multiplication. This provides a general purpose 12-bit counter in the PE which is extremely useful in several applications. Therefore, A has a quadruple function: a) it" provides linkage between the address portion and the data portion of the PE, b) it serves as a general purpose counter, c) it accumulates the carry in multi- plication, d) for special applications, A could be used as an additional address register. For a more complete idea of the whole PE as well as all the available controls the reader is directed to Figure 13 where each control wire is indicated as a line ending in an open circle with a code name associated to it. There is h3 Control Wire Controls Function kcCl to AcC3 A c Select one out of eight possible inputs OR of carry from X, , X and X • ±=h, output carry from A/L unit LC1C3, LCiCl+ lcFFi 00 - do nothing; 10 - gate lcFFi into interrupt (i=l,2,3,^) wire; 01 - enable clock if lcFFi of OFF; 11 - enable clock if lcFFi is ON PEMiCl (i=l,2) PEM mod i Select read or write PEMiC2 (i=l,2) PEM mod i Do not obey mode control sMCl sM Select between Dl and D2 as input to be read into sM sMC2 sM Select between four low order bits of Al and A2 as address to sM sMC3 sM Select read or write in sM KiCl (1=1,2,3) X. l Load input selected by XiC3 X1C2 (1=1,2,3) X. l Count X. up or down as selected by XiC3 X1C3 (1=1,2,3) X. 1 If counting, select between up or down; if loading select input between Al and A2 Table 3 (Continued) ^5 a total of 78 control wires in the PE and Table 3 lists these wires in alpha- betical order along with a description of their function. 3.U.^ Local Control It has already been pointed out that a certain minimum amount of local control must be present at each PE to take care of data- dependent actions. This takes the form of gates which, when activated, allow or inhibit an at- tempted action depending on some internal PE state. When the information used for local control is stored at some PE register at the same time it is needed, no additional memory elements are necessary. This is the case, for example, with the use of A as local control for the "add conditional" in multipli- r cation (see Figure 10). In other instances, however, the local control infor- mation is not available any more when it is needed. In this case local con- trol flip-flops must be introduced to store this information. Specifically, there are in the PE six "dynamic outputs" which must be stored somehow since they may be needed for local control. These dynamic outputs are: Equality output (A = B) from the A/L unit Carry (C nr ,) from the A counter J n+12 c Carry/borrow (C .,_) from the address registers X n , X^ and X~ ' n+12 1' 2 3 Output carry (C . ) from the A/L unit Four local control flip-flops designated by lcFFi (i=l,2,3,*+) are used to store the dynamic outputs: A = B can be stored in lcFFl; C from A can be stored in lcFF2; the OR of C n ^ from X.. , X^ and X can be stored in c ' n+12 12 3 lcFF3: and C , can be stored in lcFFU. Notice that only one lcFF is used to n+4 store the OR of the carry/borrow ' s from the three address registers. This re- sults in a saving of two lcFF's and does not introduce any serious disadvantage 1+6 since a carry/borrow in an address register is normally an error condition and will cause an interrupt regardless of the particular register in which the overflow occurred. It is easy to see that local control is the most serious obstacle in achieving the goal of a PE as general as possible, able to cope with a wide range of word formats and instructions. Normally, a lcFF may be loaded only with a specific bit of information and a certain PE function. This tends to freeze conventions like negative number representation and sign bit location. These shortcomings suggest the possibility of some generalized local control logic as illustrated in Figure lU. This could be viewed as allowing micro- programming at the PE level. Obviously, a generalized local control as the one proposed in Figure lh is prohibitively expensive. Therefore, the subject was intensively researched and a satisfactory compromise has been found. Initially, one should notice that any type of local control can be achieved using only enable control; i.e., being able to enable or disable the whole PE according to the presence of a ZERO or a ONE in a lcFF. To prove this proposition, simply consider the fact that local control can be of two types: a) if (lcFFi) THEN action 1, and b) IF (lcFFi) THEN action 1 ELSE action 2. For the moment, a disabled PE is defined as one in which the clock is inhibited causing all registers to retain their old values. Local control of type a can be implemented by enabling only the PE's in which lcFFi is ON, executing the microsequence to perform action 1 and then enabling all PE's again. For local control of type b a second step is needed in which only PE's in which lcFFi is OFF are enabled and then action 2 is executed followed by enabling all PE's again. This type of local control, achieved through enabling and disabling PE's ; - will be called indirect local control as opposed to direct local control in ^7 Mt of inputs allowing accost to ovory bit in tha PE CO bi 5 <9 3 a. Uffi fife FFw < V » sst of local control flip-flops (0 ui o I- a. sst of * control wirss gating allowing any local control flip- flop to bs sst from any input or boolsan function of inputs gating allowing any control wire to be inhibited by any local control flip-flop or any boolean combination of the outputs of local control flip-flops Figure Ik. A Generalized Local Control kQ which one or more control wires are directly inhibited "by some lcFF or other register in the PE. Although indirect 1c is universal and can achieve any desired effect, it is obviously slower since extra time is needed to turn PE's ON and OFF. Therefore, local control in SPEAC will be primarily of the indi- rect type except for a few extremely important functions in which one cannot afford the extra time; these will be implemented directly. 3.1+.1+.1 Direct Local Control Direct local control is used in SPEAC for four functions: a) Input carry (C ) to the A/L unit. This is controlled by wires ALCC1 and ALCC2 (see Figures 13 and Table 3). C can thus be chosen between four values: ONE, ZERO, the complement of lcFFl, and the same value as in lcFF^. C = ZERO is used in initiating n unsigned addition and C = ONE in initiating unsigned subtraction (using also reg B as operand A to the A/L unit). Signed addi- tion must be locally controlled since it can be an actual addi- tion (if both operands have the same sign) or a subtraction (if the signs are different). A sign comparison can easily be stored in lcFFl since A = B can be stored in this flip-flop. There- fore, lcFFl = ONE if signs are equal, ZERO otherwise and C n lcFFl can be used in initiating a signed addition. The last possible value of C is lcFF^. This is used in the middle of n an addition or subtraction, when C must have the value that n C r had in the previous step. Therefore, when adding (or sub- tracting) hexadecimal digits a. and b. of A and B, the value of lcFF^ is the carry C , from the addition (or subtraction) of h9 a. , and b. _ and will be used as C . At the same time, lcFFU i-I l-l n ' will be changed to C > from a. + b., to be used in the next to n+4 1 i step. b) Input A to the A/L unit. This is controlled by wires ALICT, ALIC2, and ALIC3« The first two wires choose between B, B, Dl and D2. The last one, ALIC3 implements a direct local control; when ALIC3 is ON, input A to the A/L unit will be B instead of B if lcFFl is OFF. If lcFFl contains a comparison of signs in signed addition, as explained above, then this local control transforms an addition into a subtraction for the PE's in which the signs are unequal. c) Gating of input A to the A/L unit. This local control is actu- ated by a ONE in wire ALIC^. When this happens, the gating of input A to the A/L unit is inhibited by the presence of a ZERO in A . Therefore, if A is ZERO and ALICU is ON, operand A r r to the A/L unit is ZERO regardless of the values of ALIC1, ALIC2 and ALIC3. Obviously, this implements the "add conditional" needed for multiplication. d) Finally, there is local control built into the input gating to register A . When "add and shift" is chosen as the input to register A, A is either shifted right one (if C , is ZERO) ° ' c n+4 or is incremented by one and shifted right one (if C , is ONE) as explained in Section 3«^+- 3- 5 • 3-^* ^+- 2 Indirect Local Control All control functions not directly implemented are obtained 50 using the lcFF's to enable chosen PE's. In order to do this, one must be able to store the controlling bit in one of the lcFF's. It has already been ex- plained that the "dynamic outputs" can be directly stored in lcFF's. There are four lcFF's in the PE and Figure 15 presents a simplified diagram of the controls at the input and output of each IcFF. For the precise logic, the reader is referred to Figure 13 and package 6 in Appendix A. The local control structure illustrated in Figure 15 is actually a simplification of the generalized local control described in Figure lk; the number of gates was considerably reduced to make the unit practical for use in a "small" PE like SPEAC's. Nevertheless, the unit is as powerful as the generalized local control although not as fast. In order to perform indirect local control, every bit in the PE should be accessible to a IcFF. This is achieved by linking LC, the register composed of the four lcFF's, to data buses Dl and D2 like all other data registers thus allowing any bit in the PE to be fed as input to a IcFF. It should also be recalled that the dynamic outputs can also be stored in the lcFF's. Therefore, the input gates of Figure Ik have been reduced in Figure 15 to a l-out-of-3 selector for each IcFF. The selector for lcFFi is con- trolled by two wires: LCiCl and LCiC2. The four possible input actions are: a) do nothing (i.e., retain the previous value stored), b) store in lcFFi the i — bit in Dl, c) store in lcFFi the i — bit in D2, and d) store in lcFFi the dynamic output associated with that flip-flop as described in Section 2.k.k. It is often necessary to set a IcFF to a Boolean combination of other bits, sometimes to a Boolean combination of bits in other lcFF's. In order to save the gates needed to implement this directly, the output of LC J 51 Dli o- D2i o- DYNAMIC OUTPUT i D lc FFI Q * 6 INPUT CONTROL TO Dli G D2i GATE SELECTORS 6 6 OUTPUT CONTROL -o ENABLE WHEN FF IS 0N -o ENABLE WHEN FF IS 0FF -o GATE FF TO INTERRUPT WIRE Figure 15. Diagram of a Local Control FF 52 made available as a possible value of Dl or D2 like any other data register. Therefore, the contents of LC can be brought to register A and one can per- form shifts and logical operations. When the desired function is obtained, it can be stored back in LC from Dl or D2. The output gates of the generalized local control have also been reduced in Figure 15 to a l-out-of-3 selector controlled by two wires: LCiC3 and LCiCU. These wires control the function performed by each lcFF. The four possible functions performed by lcFFi are: a) do nothing (i.e., the state of the flip-flop has no effect on the PE), b) enable PE only if lcFFi is ON, c) enable PE only if lcFFi is OFF, and d) gate the output of lcFFi to the interrupt wire. Function d, used when it is desired to send an interrupt sign to the CU, will be discussed in Section 3 -h.G. Functions b and c are used to perform indirect local control. Since it is possible to enable either on a ONE or on a ZERO of a lcFF, one avoids moving LC to A only for comple- menting. This is important because it is often needed to enable PE's in which lcFFi is ON, perform an action and then enable only PE's in which it is OFF to perform another action thus obtaining control of the type IF (lcFFi) THEN action 1 ELSE action 2. It is then clear that a lcFF does not have a certain fixed function but is attributed , for each clock cycle, one among four possible functions. Also, each lcFF is controlled completely independently from the others, which makes this type of lc rather costly in terms of control wires; 16 wires are required altogether. It is felt, however, that the performance and versatility obtainable with this local control justifies the cost. 3-^.5 Mode Control Mode control is simply the ON-OFF type as in ILLIAC IV. Register M 53 (also called EE for external enable) is in charge of this control. This is a single bit register which can be loaded with any bit of Dl or D2. Therefore, the input gating for register M is a l-out-of-8 selector controlled by wires EEC1, EEC2 and EEC3- A fourth wire (EECU) completes the control of register M. When EECh is 0N ; M is loaded with the input bit select by the three other wires; when it is OFF, M retains its old value. The mode control register has a fixed function which is to enable the PE on a ONE (i.e., whenever M is ON, the PE is enabled and whenever it is OFF, the PE is disabled). The mode register can also be called "external enable" register, which points out the fact that it is an enable register reserved for user (or macro-instructions) manipulations, as opposed to the internal enable, which is the function attributable to lcFF's. This is normally used only by the systems programmer in micro -instruct ions. It is now convenient to define precisely what is meant by a dis- abled PE. Most registers in the PE are clocked by the signal Ck which is the ; main clock sent by the CU "Clock ", inhibited by register M, and possibly by the lcFF's. Therefore, when a PE is disabled, all registers clocked by Ck are frozen; i.e., they retain their old values. The elements not clocked by Ck are: Registers M and X , and the two PEM modules. Register M is directly clocked by "Clock" and cannot be disabled. This is obviously needed or else, once M were disabled, the PE could never be enabled again. There is a special problem with PEM and X : as described in Section 3.^.1, one must be able to overlap PE operation with replenishment of PEM. Therefore, I/O operations must be able to reach a disabled PE since PEM in all PE's must be replenished re- gardless of the fact that some PE's may be temporarily OFF. In order to ac- complish this, each PEM module receives both clock signals: the direct signal ^ . 7.--U-+ a rv A control wire (PEMiC2 where i is the "Clock" and the possibly inhibited Ck. A control wi mod ule n^er) decides whether "Clock" o r Ck is to be used, thus choosing be- 1 tween ignoring and respecting disabling. Also, X 3 is clocked by "Clock" I instead of Ck since it is mainly used to hold addresses for I/O operations. Finally, it can be pointed out that the contents of M are not accessible to the PE. Therefore, if the setting of M is to be used later, it mu st be temporarily stored in sM at the time it is being loaded into M. I 3.I4-.6 Interrupts The interrupt system is very simple; every PE has one interrupt wire and the CU receives also only one wire which is the OR of the data in the * „v, vw Tf one or more PE's are interrupted, the CU will interrupt wires of each PE. It one or muie sense a "1" in the interrupt wire and the operating system will have to inter- rogate the PE's to find out which are responsible. This scheme has the advan- tage of making the number of interrupt wires independent of the number of FE I allowing for system expansion. It has alxeady been described (in Section 3.k.k.2) that one of the functions attributable to each IcEF is the gating of its contents into the interrupt wire. Conditions that should case an interrupt are detected in the PE and stored as a ONE in some IcEF. The interrupt can then be sent to the (9 hy attributing the interrupt function to that IcFF. It should be noticed that the propagation times of the PE interrupt signals are assumed short compared to the PE clock period. This is what allows only one interrupt flip-flop to be used for different conditions like the following: exponent overflow, I exponent underflow, fixed point overflow, division hy zero, etc. It is as- sumed that the CU will notice the interrupt soon enough to be able to distin- guish the different conditions by an analysis of which step of which operation 55 was being performed. It is also interesting to point out that the interrupt system is used not only to detect error conditions, but can be very useful to detect the end of a recurrence process or to optimize certain programs. For example, assume that a recurrence process is being executed by all PE's. At the end of each step, the error is computed and compared with the maximum acceptable. All PE's in which the error is smaller than the maximum are turned OFF, via lcFF3 for example. Sending lcFF3 via the interrupt wire will enable the CU to detect if all PE's have been turned OFF. If this is the case, the recurrence is ended. It may also be quite useful to add a control wire enabling one to send M via the interrupt wire. 3« k."J Implementation Remarks This section considers some of the design problems that would have 2 to be solved if the PE previously described were to be actually built. T L integrated circuits will be used in the implementation of the PE logic due to their medium cost, speed, and power dissipation. MOS logic was initially con- sidered and it offered considerable advantages in cost and power dissipation, however, it does not seem to be fast enough for the purpose of making the mem- ory cycle (l/2 jiisec) the basic speed limiting factor. This cannot be achieved with conventional MOS logic in the PEM (although silicon-on-saphire technology promises for the near future an order of magnitude increase in the speed of MOS logic). T L, although not as fast and desirable, will allow a good bal- ance between memory fetch time and PE operation time; assuming 10 nsec as the typical gate propagation delay time, and considering that there are no long logic chains in the PE, it is realistic to assume a PE clock period of 100 nsec (PE clock frequency = 10 Mc/s). Therefore, a PE clock takes — to — of a PEM 56 cycle., depending on how fast the PEM is used. Since the PU's will be pluggable, it is important that the number of connections to each PU be minimized as this is, in integrated circuitry, a cost factor probably more important than mere gate count. Table k shows the actual number of PE connections achieved. A total of 103 to 110 is needed, probably making necessary two connectors in each PU if a conventional printed circuit is used. Three power wires are needed instead of two if MOS PEM's are used since they need an extra voltage level. IOB and CAB must be bidirectional. This is achieved either running two independent buses, one in and one out as indicated in Figure 11 and 13, or using only one bus with additional logic in the PE and one extra control wire to choose in which direction the bus is to be used. The cost of six extra connections seems small enough to save the extra complications of using only one bus. Also, if both in and out buses are present, they could be simultaneously used in some operations like I/O and routing. Therefore, eight wires are used for CDB and eight more for IOB. Function Number of Connections Control wires 80 - 78 CDB h - 8 IOB k - 8 CAB 12 Interrupt Wire 1 Power 2 - 3 Total 103 - no Table k. Connections to Each PU 57 The number of control wires (78) is quite large, but this is the price to pay for retaining maximum PE versatility for the micro-programmer. Of course, the number of control wires could be reduced by adding encoding logic in each PE. However, this would increase the gate count per PE and re- duce the flexibility of the controls. Therefore, encoding of control wires was used only when flexibility was not affected (like in the input to a regis- ter; anyhow, the register cannot be loaded with two different inputs) and when the extra gating comes automatically in the IC's used or can be added economically. 2 T L MSI chips manufactured by Texas Instruments provide a preliminary guideline in the discussion of questions related to: number of gates, IC's available off-the-shelf, power dissipation, etc. There- fore, the suggested IC's are limited to the ones listed in [8] and this infor- mation is only useful in rough evaluations for a breadboard PE. In actual con- struction, a few made-to-order LSI IC's would be used in place of several 2 smaller chips. Table 5 lists a few MSI T L chips available off-the-shelf that could be of interest in the construction of a breadboard PE. Table 6 lists all the packages used in Figure 13 and also gives the number of FF's per pack- age and a very rough evaluation of the number of equivalent gates per package. Memory elements were not included in the evaluation of the totals for the PE. Roughly, the proposed implementation requires IK gates and 6k type D flip- flops for a total of approximately 1.3K gates. Table 7 presents a preliminary evaluation of the number of IC chips that would be needed in each PU. Two numbers are given: one, for a breadboard PU, uses the chips introduced in Table 5; in this case more than one hundred chips are necessary. The second number assumes the availability of a few custom made IC's with up to 2U pins 58 Chip Type Equivalent DIP Average Description Num- Gates Pins power ber diss mW 1 SMA2002 na 28 1331 2 Memory: M05, 102 U X 2, T L com- patible fully decoded 2 Fair3532 na 16 150 Memory: M05, 512 x 2, T L com- patible fully decoded 3 SN7^89 na 16 375 Memory: 16 x k, scratchpad k SN7U175 na 16 na Register: D-type, k bits 5 SN7I+I7U na 16 na Register: D-type, 6 bits 6 SN7 i +l9l 58 16 325 Counter: parallel in/out, syn- chronized, up/ down, k bits 7 SN7U181 75 2U ~375 A/L unit: k bits 8 SF7^15T ~15 16 125 Data selector: Quad 2-to-l 9 SN7U153 ~16 16 180 Data selector: Dual lj-to-1 with strobe 10 SKjkl52 -15 16 130 Data selector: 8-to-l 11 SN71+L98 ~Uo 16 25 Data selector/ storage register: 2-to-l, k bits 12 SF7ULS83 -1+2 16 75 U- bit binary full adder Table 5- Some IC Chips that Might Be Used in the PE 59 package Number Function No. Used Approx . Gates per package FF's per package Total gates Total FF's 1 l-out-of-8 selector; no strobe 29 9 261 2 Quad D type FF; clock enabled for all FF's simultaneously 5 5 h 25 20 3 Type D FF with enable on the clock 9 2 l 18 9 k 1-out-of-^ selector; no strobe 5 5 25 5 l-out-of-3 selector with enable decoding k 5 20 6 Enable and interrupt con- trol 1 18 18 7 PEM-1 mod 2 — — — 8 sM--6 i + bit memory — 16 U-bit words 1 — — — 9 A/L unit 1 ~60 6o 10 l-out-of-2 selector without strobe 59 3 .187 11 k bit add/ subtract coun- ter, parallel in/parallel out 9 -25 k 225 36 12 l-out-of-4 selector with strobe k 5 20 13 Quad inverter 1 h 1+ Ik Increment by 1 network (6 bits) 2 25 50 15 TOTALS l-of-5 selector 2k 6 lUh S5 1057 Table 6. Packages Used in the PE and Their Contents 60 Used In i Breadboard Actual Implementation Chips Used No. of Chips Used No . of Chips Chips PEM chip 1 = | Pk 7 k as in breadboard 1+ reg B and input gates chip 11 as Pk 2 + (k X Pk 10) 1 as in breadboard 1 input to Dl, D2, A , chip 10 as Pk 1 28 2 X Pk 1 Ik A , A m r c input to A , A chip 10 as Pk 15 2k 2 X Pk 15 12 output to IOB, CDB chip 8 as k X Pk 10 2 as in breadboard 2 inputs to sM chip 8 as k x Pk 10 2 as in breadboard 2 sM chip 3 as Pk 8 1 as in breadboard 1 A/L unit chip 7 as Pk 9 1 as in breadboard 1 ^1' ^2' "^3 chip 6 as Pk 11 9 l| Pk 11 6 inputs to X , X , X chip 8 as k x Pk 10 9 6 x Pk 10 6 input A to A/L chip 9 as 2 x Pk 12 2 as in breadboard 2 Increment net chip 12 as | X Pk Ik 3 Pk Ik 2 A c chip 5 as 1— x Pk 2 2 as in breadboard 2 A r chip k as Pk 2 1 as in breadboard 1 A m SSI dual FF 2 k x Pk 3 1 enable control in A m chip 9 as 2 x Pk k 2 k x Pk k 1 M and M input SSI FF; chip 10 as Pk 1 2 Pk 3 + Pk 1 1 LC and LC input SSI dual FF; chip 9 as Pk 5 k 2 x Pk 3 + 2 x Pk 5 2 enable control chip 9 as r x Pk 6 k Pk 6 1 others SSI chips 5 Pk 13 + Pk k; 3 x Pk 5 2 1 Total 108 Total 6k Table 7. Rough Estimates for the Number of Chips Per PU 61 per DIP. These IC's are only slight modifications of the ones in Table 5- In this case, the number of chips goes down to about 6U. This number of chips will readily fit in one printed circuit board or, better yet, a new packaging technology could be used: a multi-chip on a ceramic substrate technique which is being developed at Fairchild. As far as design is concerned, the substrate is analogous to a two-sided printed circuit board with single devices installed. In addition, a system package is being developed to connect these devices together with simple cam-operated connectors and backplanes. It is important to point out that the number of Gh chips was ob- tained with a very superficial analysis of the circuit and only assuming the availability of quasi -standard IC's. It is expected that with careful compu- ter analysis of the possible partitions of the circuit and wide use of custom- made IC's, the number of MSI chips could go down to about 30 (this is the num- ber reached if one divides the total number of equivalent gates in the PE (1-3K) by 60 to 70, the number of equivalent gates easily obtained nowadays 2 in one MSI T I chip) . The power dissipation per PE is quite acceptable. It is on the 2 order of 15 watts, assuming an average of 10 mw per gate for T L. A new low 2 power T L could be used to reduce this number by a factor of 5 to 10. Finally, it should be mentioned that a number of simplifications could be adopted in the PE at a small cost in performance. Only careful simu- lation can decide whether the saving thus obtained justifies the loss in per- formance or versatility. Some of these simplifications are: - make B unavailable as a value to Dl or D2 - do not use X - use only 10 bits in address lines instead of 12 62 - make X. count up only instead of up/dowi. - reduce A to 8 or 10 bits. c 3-5 The Control Unit The control unit has already been summarily described in Section 3«3- In this section, a few more details of CU's structure and functions are pre- sented but only in a macroscopic way, without getting to the gate level as was done with the PE. 3- 5-1 CU General Structure Figure 16 presents a diagram of the control unit structure. The components are: a) CU Memory (CUM) , which is a conventional, high speed random access memory in which SPEAC's instructions and CU data are stored. It can be replenished from mass memory and is accessed by the central processing unit and by the instruction lookahead unit. b) Instruction Lookahead Unit (ILA) which fetches instructions from CUM and sends them to the instruction decoding unit. Since CUM is very fast, a sophisticated ILA is probably not necessary. c) Instruction Decoding Unit (IDU) which performs basic instruction decoding and central indexing. The instructions are identified as CU, PE, or i/o instructions and sent to the respective in- struction processor along with their indexed addresses and other data. d) Central Processing Unit (CPU) which is the CU instruction proces- sor and responsible for the execution of CU instructions. It 63 TO AND FROM MASS MEMORY CONTROL TO MASS MEMORY CUM C U MEMORY ILA INSTRUCTION LOOK AHEAD MMI MASS MEMORY INTERCHANGE 10 REQUESTS TO MMI C PU CU INSTRUCTION PROCESSOR IOC 10 INSTRUCTION I PROCESSOR AND ROW GATING PEJP PE INSTRUCTION PROCESSOR PP MICRO PROCESSOR CUQ CU QUEUE PEQI PE QUEUE I MICRO MEMORY IOQ 10 QUEUE FINST FINAL STATION TO P U ARRAY Figure 16. CU Structure 6k is basically a fast, highly parallel unit similar to one of ILLIAC IV s PE's. It should be compatible with the data formats used in the PE's. Therefore, for maximum versatility, it should also be microprogrammable like the instruction processor. The CPU is not completely independent from the PE array since it can send common operands to all PE's via CDB ("broadcasting") and also can receive data from the PE's. For this purpose, the CPU can send microsequences to the PE's via the CU Queue. e) I/O Instruction Channel (IOC) which is the i/O instruction pro- cessor and executes array i/O instructions. Like the other two instruction processors, it could be microprogrammable for maxi- mum versatility. The IOC sends i/O requests to the mass memory interchange and control pulses to the row gating and i/O Buffer Register (lOBR). It can also send microsequences to the PE via the 10 Queue. f ) PE Instruction Processor (PEIP) which is the third and last in- struction processor, in charge of PE instructions. It is fully microprogrammable and can be divided into two parts. The first part is a microprocessor (^P) which executes the microprograms and sends microsequences to the PU via two queues--PE Queue 1 and PE Queue 2. The second part is a micromemory (^M) which stores the microprograms. uM does not have to be a separate memory; part of CUM may be used as micromemory if this is the most economical scheme. g) Four Queues which are: Queue (Q) , PE Queue 1 (PEQl) , PE Queue 2 ( PEQ2 ) , and 10 Queue (lOQ) . These queues store microsequences 65 sent by each instruction processor, absorbing fluctuations in the rate of generation of these microsequences which enables the final station to keep the array as busy as possible. h) Final Station (FINST) which analyzes the entries at the bottom of each queue and decides which microsequences to send to the array for optimum PE performance. It must also combine two queue entries into one PE microsequence since each queue entry is not a complete ^sequence but a request to use one of the two pairs of buses in the PE's. FINST action will be explained in consid- erable detail in Section 3-5-3- i) Mass Memory Interchange (MMl) which utilizes the several modules of mass memory in an optimum fashion, solving memory request con- flicts. It receives requests from the following sources: CUP, IOC, Corner Memory and Peripherals. 3-5-2 Machine Synchroni z.ation - Events Events are the means of synchronization in the machine; not only are they accessible to the user for problem-dependent synchronization (i/O and operations, for example) but they are also used by the microprograms to syn- chronize different micro steps executed in the PE's, CU and IOC. Each event is assigned an absolute number and it is basically a flip-flop; when OFF, the event did not occur and when ON, the event has occurred. A' reasonable number of events are needed; 6U as a first approach, for example. Therefore, synchronization is obtained with commands to "WAIT on event N" or "CAUSE event N. " WAIT and CAUSE commands are attached to instructions and are recognized and obeyed at three units: CUP, IOC and FINST. Consider, for example, a CU instruction which needs as one operand a PE value sent via 66 CDB. The instruction goes to CUP which does any local processing needed and then issues the micro sequence to CUQ. The microsequence contains a "CAUSE event N. " The CU then idles on a "WAIT on event N. " When the microsequence is executed; i.e., when the data needed from the array reaches the CUP, event N happens and CUP finishes execution of the instruction. This waiting time could be used by the CU for multiprocessing a serial program (a compilation, for example) being, run simultaneously. One must make sure that an event will not be considered "occurred" because the FF is ON from another use of the same event number. Therefore, the user does have the responsibility of "releasing" an event when the present use of that event number terminates. This may be done when the event is waited on for the last time, with a special type of wait--WAIT and RELEASE--or an event may be specifically reset with a RESET EVENT command. The following event manipulation commands are desirable: - Wait on a boolean function of events - Cause an event depending on a boolean function of others - Cause several events simultaneously. Basically those commands are for program use only since microsequence synchron- ization must be very fast and must be done with single events. It should be noticed that one would never wait on a boolean combina- tion of events since this would require the boolean function to be evaluated at each clock to determine if the wait is over. The way to do this is to have ; after each cause of the events that appear in the boolean function, a state- ment that evaluates the boolean combination and places the result on an extra \ number: N. Then the wait is simply on event N. Care must be taken to avoid re-use of an event before its previous 67 use is completed. Certain complicated cases may "be confusing. Consider, for example, the following program: Input 1 cause event #3 PE-multiply wait on event #3 and release it Input 2 cause event #3 CU operation wait on event #3 In the situation above, Input 1 may occur and cause event #3- Then, before the PE-multiply or Input 2 occur, the CU operation may be executed and event #3 is ON so there is no wait. The possibility of symbolic event names handled by the hardware could be investigated; the hardware would automatically assign symbolic event numbers to the first available physical event flip-flop. This would free the user of keeping track of which events are available and also no set of events would have to be reserved for usequence use. However, the user would still have to release events. Note also that with the present scheme, it is necessary to divide the events into two sets: user events and internal events. The latter will be used by the microprograms to synchronize the execution of microsequences. 3.5-3 Queue System and FINST Queue entries can be considered as requests to use part of a PE. These requests are serviced by FINST which, if possible, combines two entries from different queues into a PE microsequence and sends the microsequence to the PE's. The purpose of FINST and the queue system is to keep both pairs of PE buses (Al, Dl and A2, D2) as busy as possible. 68 The basic principle involved is dynamic bus allocation; i.e., each queue entry does not ask specifically for use of bus 1 or 2, it asks for J either a) any bus, or b) the bus that has access to the PEM module containing I the address stored in X. (1-1,2, or 3). Requests of type a are made for inter J register transfers, in which it is immaterial which bus is actually used; I requests of type b are necessary for memory transactions since for these a specific bus must be used. Therefore, under dynamic bus allocation, CUP, PEIP j and IOC do not specify the microsequences completely- -FINST will dynamically allocate buses to the partial microsequences in the best possible way. 3.5.3.I Queue Structure Each queue entry contains basically a partial microsequence and information which is used by FINST. The fields of a queue entry are illustra- ted in the upper part of Figure IT- All four queues have the same structure although only Queue 2 has been detailed. qutu* 1 CUQ r qutue2PEQ! qutu«3:PEQ2 qu«u«O-I0Q r EV A FFXI FFX 2 2 X FFX3 6 C yUSB 1 \ FFCO s 12 CA ^CAU BAO 4 CO VCDR TDU 6 WEV Wvu 6 CEV \evu BC n n □ FFCI □ I | I i j 1 — CDBR CDBRU ' □ BAI 1 BCI 1 1 Figure 17. Queues and FINST Structure 69 The fields are as follows: X: address field (2 bits) . means the address register is not used; i.e., we have a data transfer and not a memory fetch. X=i (where 1 < i < 3) means the address register X. in the PE's will be used in this micro sequence. C: counter field (~6 bits). means the microsequence is a no-op. C= iX) means that when a bus is assigned to that queue, then this micro- sequence and the next n-1 will be processed consecutively. uS: these are fields that contain the partial-microsequence. uSB: bus-dependent microsequence field (~23 bits). This is the part of the microsequence related to bus used. uSC: bus -independent microsequence field (~55 bits). This is the part of the microsequence related to control that does not use buses. CAU: use of CAB field (l bit). CAU ON means CAB will be used and must be set to the value stored in CA. CA: common address field (12 bits). This contains the value to be used as common address. CDU: use of CDB field (l bit). CDI ON means CDB. will be used and must be set to the value stored in CD. CDR: common data receive field (l bit). When ON, CDB , will be used to ' out receive data from the PU's; this data must be stored in CDBR. CD: common data field (k bits). This contains the value to be used as common data. EV: these are fields that control events. WEVU: wait event use field (l bit). When 0N= this entry must await an event whose number is stored in WEV. WEV: wait event field (~6 bits). This contains the number of an event 70 to be waited on. CEVU: cause event use field (l bit). , When ON, this entry must cause an event whose number is stored in CEV. CEV: cause event field (~6 bits). This contains the number of an event to be caused. The bus-dependent microsequence field must be further explained. It can be divided into two sub-fields: juSBa and /iSBb. juSBa, with 8 bits, corre- sponds to the control wires to gate into buses D and A (3 wires for each) and to control PEM (2 wires). In the actual microsequence, this field appears twice: once for each bus pair. ,uSBb, with about 15 bits, corresponds to the control wires to gate from buses D and A. The values of the bits in this field of a queue entry have a special meaning: a ZERO means that the corre- sponding control is not used in this microsequence and a ONE means that the control is used (i.e., the final microsequence must have in that position the appropriate bit to load from the bus that has been assigned to that queue entry. 3.5.3-2 FIRST Structure and Operation The structure of the final station will not be presented in detail; only the major registers and their uses are discussed and a few considerations are offered on the output logic of FINST (i.e., the part that merges together two queue entries and assembles the micro sequences) . The major registers of FINST are illustrated in Figure 17 and are as follows: FFXi (i=l,2,3): address control FF (l bit). FFXi = j means that in the array, all Xi registers have addresses pointing into memory 71 module j (j=0,l). These flip-flops are automatically set by the CU (i.e., the FINST) every time a microsequence is sent in which the bit that controls gating into Xi is ON. The setting is based on the contents of the CA field in that microsequence. Local modifications (as in local indexing) of Xi cannot change the module it points to. This condition can easily be checked within each PE and causes an interrupt (just monitor the carry from the address registers). Be- sides the automatic setting, FFXi should also be settable by the programmer for special applications. FFCi (i=0,l): conflict FF (l bit). These are the conflict flip-flops, set either when the bus could not be assigned or when one or two of the bus assignments is not used on a particular clock because of bus conflicts or because the queue is empty. BAi (i=0,l): bus assignment register (2 bits). When BAi = j, bus i is assigned to queue j. j e (0,1,2, 3} • BCi (i=0,l). bus counter (~6 bits). When BCi = j, there are j micro- sequences left to be performed before the bus can be reassigned; BCi = means that bus i is idle. CDBR: common data bus register {h bits). This is the register where values placed in CDB by the PE's are stored. out CDBRU: common data bus register use (l bit). When equal to 1 it means that CDBR is in use; i.e., a result placed in it has not been removed by the CU and therefore CDBR cannot be reused before the CU frees it by resetting CDBRU. The FINST decision procedure is now described: at each clock, FINST must decide to which of the four candidates the use of the PE buses will be 72 assigned. Once a request from a queue is granted, the next (C) requests from that same queue must he obeyed before the bus can be reassigned (where (C) is the contents of the counter field). This ensures the microprogrammer that, once control is obtained, it "will be retained for a number of microsequences enabling the completion of a procedure before a new bus assignment destroys needed data. Therefore, groups of microsequences that must be executed se- quentially, without interruption, are "linked" together by placing in the coun- ter field of the first queue entry the number of microsequences in the group. The FINST decision procedure is illustrated in Figure 18 by a flow- graph. If a bus counter register in FINST is zero, the corresponding bus is idle and an attempt is made to assign it. The order in which assignment at- tempts are made is, in Figure 18: IOQ, CUQ, PEQ1, and PEQ2. This attempts first to get the I/O done. This assignment hierarchy , in an actual implemen- tation, would probably be dynamic and selectable by the programmer instead of fixed. Section 3-5-5 discusses a situation in which a dynamic assignment hierarchy is required. The following observations should be made with respect to the flow- graph in Figure 18: - The notation (Top Queue j:C) means the contents of field C of the entry at the top of Queue j . - A queue is empty either when it is physically empty or when it is flagged WAIT on an Event that has not occurred yet. - There is a CAB or CDB conflict when the following expression (where TQi means top queue i) is true: a) (TQ(BA0):CAU)=1 AND (TQ(BAl) : CAU)=1 AND (TQ(BAO) : CA)^(TQ(BAl) : CA) OR 73 START ye» But maybe Designed 8Ci = (top queue j-C) BANj • at FFCi= I where j such that : (BAi)>(BAj) (i,j) = (0,l),(l,0) marge (top queue (BAO)) a (top queue (BAD) into a PE u sequence, inhibi-tad by FFCi=0, i=0,l; •at CAB Br CDB as needed 8k send the p sequence to the array finolization: BCD*min (BCO~l,0) BCI*-min(BCI- 1,0) FFCO*-0-, FFCI«-0; pop queues used 8, cousa avants BUS i i=0,l QUEUE] j= 0,1,2,3 Figure 18. FINST Action Flow- graph 7^ (b) (tq(ba0):cdu)=1 and (tq(bal) : cdu)=1 mb (tq(bao) : cd)^(tq(bal) : cd) or (c) (TQ(BA0):CDR)=1 AND (TQ(BAl) : CDR)=1 OR (d) ((TQ(BAO):CDR)=l OR (TQ(BAI) : CDR)=l) AND CDBRU=1 where the term (a) takes care of CAB conflicts, the term (b) detects CDB. conflicts, the term (c) detects CDB , conflicts, and the term (d) takes care ' v ' out of CDBR use conflict (i.e., CDBR has not yet been used after being set by a previous operation) . It should be pointed out that the decision procedure outlined in Figure 18 is only a basic algorithm. A few sophistications would have to be introduced in an actual implementation; specifically: a) the procedure should also be able to handle efficiently microsequences that do not require the use of any bus, and b) the possibility of deadlock should be considered and steps taken to avoid it. Figure 19 illustrates the part of FINST that merges the two selected queue entries together and "assembles" the microsequence. Gate control selects which of the four possible inputs to each bus is actually gated into the bus; queue i is gated into the bus if i is the value of the expression written in each gate control box. Briefly, the assembly procedure is as fol- lows: CDB is gated into CDBR if the CDR field of any of the two selected queue entries is ON; CDB. is set from the CD field of the selected entry, if any, that has field CDU ON; CAB is obtained from the CA field of the selected entry, if any, that has field CAU ON. Field juSC of the final microsequence is the OR of these fields in the two selected entries. A check for conflicts would be necessary at this point to make sure that the two uSC fields are 75 qutu* l:CUO qu«ut 2 . PEQ I qiMU*3:PEQ2 qiMu* 0"- 100 (iSBo ^SBb yuSC JiSBo fiSBb CD fiSBa fjSSb ^iSC jiSBoO /jSBqI fiSBb jjSC CAB CDS* CDB out CA pSBo pSBb JlSC CD ASSEMBLED MICROSEQUENCE GATE CONTROL (BAO)or(BAI) •nabltd by TO (BAI):CDU (BAO)or(BAI) tnobltd by TO (BAi):CAU (BAO) or (BAD (BAD (BAD (BAO) TQ(BAI):CDR»I 1= 1,2 Figure 19 . Final Microsequence Assembly in FINST 76 compatible to "be OR'ed together; i.e., the actions determined "by one of the entries must not conflict with the actions determined by the other. As ex- plained previously, field /iSBa appears twice in the microsequence, once for each bus pair. Therefore, jiiSBaO is obtained from the juSBa field of the entry selec- ted by BAO and jiiSBal is obtained from the )uSBa field of the entry selected by BA1. Finally, field juSBb is simply taken out of field ^SBb of the entry selec- ted by BA1. A conflict is also possible at this point: fields /iSBb of the two selected entries should yield a zero when AND'ed together, bit by bit. If this is not the case, there is a conflict in the YSBb fields. It should also be pointed out that every gate control box is inhibited by the conflict flip- flops FFCi; i.e., when FFCi is ON, no field from the entry selected by BAi is used in the assembly of the microsequence. 3-5'^- The PE Instruction Processor The basic structure of the PE instruction processor is presented in Figure 20. The components are: a) A macro-instruction register (MIR) which holds the op code and variant field of the macroinstruction being processed. This register is initialized by IDU and is accessible to the micro- processor to be used in controlling microprogram fetch and in arithmetic and masking operations. b) A microinstruction register (,uIPQ which holds the op code and addresses of the microinstruction being executed. c) A micro-memory (jM) which holds the microprograms. d) A PEIP busy flip-flop (PEIPB) which is turned ON by IDU when a macroinstruction is delivered to the microprocessor and is turnec 77 PREFERENTIAL USE (X* LOCAL REGISTERS ~I6 BITS EACH 2 1st ADC RESS 3 2nd ADDRESS 4 3rd ADDRESS 5 1st ADDRESS INDEX 6 2nd ADDRESS INDEX 7 3rd ADDRESS INDEX 8 PRECISION 1st OPED 9 PRECISION 2nd OPED 10 PRECISION RESULT II SCR vtch 12 13 • • • • • • 1 ' macro jnstr. register Ir #1 OP CODE VARIANT M I R PEIPB IR op code reg Imm bit 1st address 2nd address JJ M micro memory u instruction register SUBROUTINE STACK ARITHMETIC UNIT START ADDRESS • • e RETURN ADDRESS e e • REPEAT COUNT e • e Figure 20. Basic PEIP Structure 78 OFF by the PEIP logic when the last microinstruction of the macroinstruction has been processed. This signals IDU that the microprocessor is idle and ready to receive the next macro- instruction. e) A subroutine push- down stack used in controlling execution of subroutines by the microprocessor. Each entry in the stack contains three fields: a start address field which holds the address in which the subroutine starts; a return address field which holds the address of the first instruction following the subroutine; and a repeat count field containing the number of times the subroutine is to be executed. f ) A group of local registers which is used to hold intermediate results in arithmetic operations. The contents of the local registers can be used in assembling the different fields of the partial micro sequences to be fed into the PE queues: PEQ1 and PEQ2. Finally, the local registers are also accessible to the IDU which initializes them with the instruction addresses and other instruction data. In this connection, MIR can be considered a local register and it is assigned local register number 1. The other local registers are numbered in sequence and they are accessed by their local register number. Sixteen local registers are proposed, each 12 to 16 bits long. g) An arithmetic unit capable of performing fixed-point operations on short words: 12 to 16 bits is enough. At least addition, subtraction and multiplication are available (integer division and module operations are also useful) . The operands are 79 either the contents of specified local registers or literals. The results are placed in a specified local register. An arithmetic unit is needed to enable microprograms to accept dynam- ically specified parameters as word length, number of addresses, etc., since it is obviously extremely inefficient to have one complete microsequence stored for each small variant of a basic instruction. This also determines the need for a number of relatively sophisti- cated microinstructions; for example, subroutine calls. The suggested micro- instruction repertoire is presented in Table 8. This repertoire allows very efficient microprograms with respect to juM use. It is assumed that the micro- processor is fast enough to allow an average output of one partial micro- sequence each 100 nsec. Fluctuations in this rate are absorbed by the queues. As indicated in Figure 20, the microinstructions' format uses four fields: op-code, local register number (LR), immediate bit (IMM), and two addresses, Al and A2, each as long as a local register. The use of these fields for each microinstruction is detailed in Table 8. The immediate bit qualifies the first address; if IMM is ONE the first address contains an immediate operand instead of a local register number. The partial microsequences are generated in pairs, assuming optimal conditions; i.e., assuming that both buses will be available. The first partial microsequence in each pair is placed in PEQ1 and the other one is placed in PEQ2 so that if both buses are available they will be executed simul- taneously and if not they will be executed sequentially. Events are used to coordinate the draining of the queues as needed. One extra bit in the queues may be needed to signal a request for the simultaneous execution of a partial microsequence from PEQ1 and one from PEQ2 as is required in a swap of registers. 80 Op Code mnemonic) Description CALL RETURN SOTO IF ADD SUB MULT uSEQ Subroutine call; executes (A2) times the subroutine starting at uM address (Al) Marks the end of a subroutine or the end of a microprogram. Transfers control to the microinstruction in juM address (A2) . If (LR) masked by (Al) is all l's then transfers control to the microinstruction in iM address A2 Add (Al) and (A2) and place the result in LR Subtract (Al) from (A2) and place the result in LR Multiply (Al) and (A2) and place the result in LR Emit a partial microsequence to PEQ1 or PEQ2 Table 8. Microinstruction Repertoire This "will also necessitate a change in assignment hierarchy or else the array will idle for a long period waiting for both buses to become available. The microinstruction uSEQ, must be able to "assemble" a partial micro- sequence (placing in each field either a literal or the contents of a specified local register) and place it either in PEQ1 or PEQ2. Therefore, this microinstruction is unreasonably large and requires about 100 bits of data. This shows the need for a microinstruction with a variable number of bits (just as is the case of macroinstructions) to optimize memory use since the jiiSEQ, microinstruction takes so much more space than the other microinstructions. 3-5-5 IDU and Instruction Format Central indexing is decoded and performed by the IDU which hands the resulting addresses to the three instruction processors. The detailed instruc- 81 tion format is illustrated in Figure 21. Instructions are composed of a vari- able number of "chunks, " each 12 to 16 bits long. A chunk may be an address, an op code or some other type of data. The smallest instruction contains only two chunks: IDU information and op code. 1DU INFORMATION A INDEXED ADDRESSES INSTR TYPE # OF CHUNKS A VARIANT OP CODE ADDRESSES + OTHER CHUNKS ' — v — ' # OF ADDRESSES TOTAL VARIANT FIELD Figure 21. Detailed Instruction Format The four fields in the first chunk (ll bits) contain information used by IDU: a) The instruction type field , with 2 bits, indicates whether the instruction is a CU, 10 or PE instruction enabling IDU to send the instruction to the appropriate processor. b) The indexed addresses field has 3 bits. If bit i is on, then 82 the i — address is to be indexed. The following convention is adopted for the order in which base addresses and index addresses are presented: third chunk: first base address fourth chunk: if first address is indexed, then it is the address index for the first address, else it is the second base address. etc. c) The number of addresses field indicates how many of the chunks following the first two are addresses. d) The number of chunks field gives the total length of the instruction. These last two fields are also sent as part of the variant field since they are needed by the processors. IDU places an instruction in an instruction processor as follows: initialize instruction register with op code and total variant field; initial- ize the three first local registers with the addresses, but do not change a register to which an address was not given in the present instruction; then initialize the next local registers with the extra chunks in the order given-- the instruction processor decides what to do with them. 3«6 Mass Memory A survey was conducted on the state of the art of mass storage sys- tems including bulk magnetic core, fast disks, fast drums and semiconductor memories. Fast magnetic drum (at one-half cent per bit) or disk (as low as one-twentieth of one cent per bit) could be used as the mass memory since they 83 have a significant price advantage over the other two systems. However, being cyclic, these systems would introduce synchronization problems and/or latency time waits. Therefore, while disks are still being considered as a possible very-large -capacity back-up for mass memory, the choice for the actual mass memory is a random access system: bulk core or semiconductor. CDC bulk core model 6636 was picked up as a sample of what is now available. Its characteristics are: - 7*5 million bits per module - the maximum number of modules is four - cycle time: 3-2 (usee; access time: 1.6 jusec - up to four modules can be interleaved - the transfer rate is 25 to 100 million 6-bit chars per second - it fetches in long words of 480 bits - its cost is approximately three cents per bit. It is expected that in the near future, price of bulk core will drop to below one cent per bit. Assuming the availability of units of this price and with cycle times as above, a unit fetching in 512-bit words could be used as SPEAC's mass memory. As for semiconductor memories, the main advantage core has over any semiconductor type is the ability to be non-volatile. Semiconductor memories are already available for less than three cents per bit although the price always goes up for special configurations like the long word that is needed in SPEAC's mass memory. Since semiconductor is so much faster than bulk core, one might attempt to multiplex a narrower word but faster semiconductor memory to achieve the desired word length and access time. In addition, a large memory of shift registers might be considered. A special design would be easier to 8k achieve and control can be maintained over synchronization and latency prob- lems. Therefore; mass memory will be a random-access unit: bulk core or semiconductor, depending on economic considerations. It is assumed that several modules of mass memory will be overlapped under the control of the mass memory interchange (MMl) so that conflicts between mass memory access requests from different sources will be infrequent. An average cycle time of 2 ^sec (l usee access time) for the mass memory has been assumed in all timing estimates. 3-7 I/O Buffer Register The structure of the I/O buffer register (IOBR) is illustrated in Figure 22, TO G FROM MASS MEMORY A IOBR A IOBRK 1 r i i IOBR. TO ROW GATING FROM ROW GATING Figure 22. i/O Buffer Register Structure The register is divided in two parts: a right part (lOBRr) and a left part (lOBRi). Each part is as long as a mass memory word: 512 bits. IOBRr is connected to the mass memory and is the actual buffer register; it can also receive data from the row gating (128 hexadecimal digits, one from each PE in a PE row). IOBR^ is needed to achieve routing capability in SPEAC; it 85 can send data to the row gating. IOBR as a whole can be shifted end around, left or right in U-bit (one hexadecimal digit) increments. In order to achieve good routing speed, it is vital that IOBR can be shifted by any distance (from 1 to 127 digits) in only a few clock periods. This poses an interesting minimization problem: how many direct shift paths should be implemented in order to obtain any shift in a given number of clocks? Also, a few distances are especially important and the corresponding shifts should be particularly fast; this is the case with powers of two since routes by a power of two ap- pear much more frequently than Other routing distances as they are used in log- sums, Fast Fourier transforms, etc. Finally, there is the important economic restriction of keeping the number of direct shift paths at a minimum since for each path one needs roughly one gate per bit and there are 102*4- bits in IOBR. It was decided that a minimum of 7 direct shift paths are needed with the following direct shift distances: 128 left (this is vital to the opera- tion of both l/0's and routes), 1 right and left, 32 right and left, and 8 (or *4-) right and left. This scheme enables one to perform any shift in not more than 7 clocks. The worst case is distance 52 (50 if one uses k instead of 8) . Moreover, shifts by a power of two take not more than k clocks and most take only one or two. At a cost of 2K more gates, one could implement 9 di- rect paths (128 left, 1 left, 1 right, 2 left, 2 right, 8 left, 8 right, 32 left, and 32 right) for a worst case shift of 5 clocks. It is assumed for the remainder of the paper that 7 paths were im- plemented. This represents an investment of about 12 K gates in IOBR which is a reasonable price to pay to achieve routing and I/O buffering for the whole machine. Table 9 presents the number of elementary shifts needed to shift a number by any distance from 1 to 6*4- when the direct paths are: 128 left, 86 A* B* A B A B A B A B A B A B A B 1 1 9 3 17 5 25 1+ 33 2 in k ^9 6 57 5 2 2 10 k 18 6 26 k 3^ 3 U2 5 50 7 58 5 3 2 11 k 19 5 27 3 35 3 k3 5 51 6 59 k k 1 12 3 20 ^ 28 2 36 2 kk k 52 5 60 3 5 2 13 U 21 5 29 3 37 3 k5 5 53 6 61 1+ 6 3 Ik 5 22 5 30 3 38 U k6 6 5k 6 62 1+ 7 3 15 5 23 1+ 31 2 39 k kl 6 55 5 63 3 8 2 16 U 2k 3 32 1 ko 3 kQ 5 56 U 6k 2 *A - shift distance *B - number of elementary shifts Table 9- Number of Elementary Shifts for Each Shifting Distance 1 left, 1 right, 32 left, 32 right, k left, and k right. 87 k. SPEAC's OPERATION k.l Generalities - Data Format The algorithms used in performing the most important instructions will be outlined in this section and timing estimates -will be presented. The timing is based only on a count of the PE clocks necessary to perform the instruction; no CU delays were taken into account. Therefore, the estimates neglect CU instruction fetching, decoding and central indexing times. Also neglected is the time taken by the OU to execute microprogram control instruc- tions; i.e., microinstructions that do not generate micro sequences. These approximations are justified by the assumption that CU is, on the average, faster than the PE's (CU clock rate is about twice PE clock rate) and the queues insure that PE's will not have to wait by CU. The timings are also a function of how much overlap is possible when the instruction is executed; i.e., how many buses are available for the PE instruction use. This factor depends on the assignment hierarchy used by FINST, on the location of the operands in PEM and on how much I/O is taking place when the instruction is executed. In the timings, at least one bus is assumed always available for PE instructions (or else the worst case times will obviously be infinity). Sometimes two timings are given: the "normal " one, with only one bus available and the " optimum " timing, assuming maximum overlap (two buses are available). CDB and CAB bus conflict is also a possi- ble cause of delays which were not taken into account in the times since they depend on how much i/O is going on. However, these delays are expected to be negligible in a PE with three address registers. As discussed in Section 3-1 - g> the machine accepts any word format, 88 since there is nothing in the hardware to "freeze" the data representation. Of course, adequate microprograms must be written to deal with a desired word format. An arbitrary (and quite conventional) format for floating-point num- bers was picked up and used in the timings. This representation will be called the "standard format " and is as follows: a number appears in PEM as indicated in Figure 23- e n e • • • e l e o m n m "i m o Figure 23- Standard Floating-Point Format Each PEM location contains one hexadecimal digit. The location of m is low memory address. There are N = n + 1 exponent digits and N = n + 1 mantissa digits. Mantissa is in sign and magnitude and the sign is in bit e • i.e., the low order bit of the LSD of the exponent. Therefore the exponent has UN - 1 bits since one bit of the exponent is used for mantissa sign. The exponent base is 16 and the exponent is represented in excess no- tation. The number A represented in Figure 23 has a value given by: e m ,, (-^)n A = (-1) °° X (m (2 -4 ) + ... + m , (2 m n 1 m where E, the exponent, is given by: ) + m Q (2 (-^)(n+l) m ))(!£*) -E, E = e Q1 + (e Q2 )(2) + (e Q3 )(2 2 ) + (e.^2 3 ) + ... + (, kn -1 n )• If a floating-point number is normalized, m ^ 0; i.e., at least one of the four bits of m is one. A particularly important length for a floating-point number is 32 bits, which was often taken as the standard floating-point number in this 89 section. A 32-bit floating-point number has one mantissa sign bit, a base 16 exponent with 7 bits and a 2l+-bit mantissa. k.2 Local Indexing Operand addresses are sent to one of the PE address registers via CAB. Only one clock is needed to transmit an address in this fashion. Then, if required, any address may be locally indexed at a maximum cost of 1.6 jLtsec (16 PE clocks) per indexing. The microsequence to perform local indexing is presented in Table 10; the notation is explained in the introduction of Appendix B. It is as- sumed that the address to be indexed (x x x ) is loaded in X, and the index is igi^. In conclusion, local indexing is relatively fast (about 7% of the time for a 32-bit floating-point multiplication) and the procedure does not penalize the users that do not need it since it is performed only when the instruction variant field is adequately set. Also, the microsequence presented can be significantly speeded up if one knows that the index is less than three hexadecimal digits long. U.3 Multipli cation Two mantissas A and B, each with N hexadecimal digits, are to be multiplied. Using the notation of expressions 1 and 2 in Section 3-2, the following steps are performed: 1) load a from memory into register B 2) load b^ from memory into register A r 3) set to zero the remainder of register A; i.e., A and A m k) multiply a and b n using four "add and shift" commands. At the 90 8 ° s •H O -H X £ o o H O Micro sequence 2 3 5 6 X *- CAB (address (i )) 9 10 n 12 13 111 1 2 3 1+ 6 6 7 8 9 10 11 12 13 c 1 B *- PEM (X )• shift A right k Wait for PEM fetch Wait for PEM fetch A r -OMjj VO; lcFFlf^C n+1| ; Incr X„ B <- PEM (Xg); shift A right U Wait for PEM fetch Wait for PEM fetch A - (B+A J; C =lcFFU; r m n lcFF l+ «_ c^^; Incr X, 2 B - PEM (X ) ; shift A right k Wait for PEM fetch Wait for PEM fetch A r -(»-A m ); C n =lcFF^ leFFU - C ^ 16 !- Ik Shift A right K; interrupt on lcFFi+ ON 15 X n *- A 1 c Comments Put in X the address of the index Transfer address to he indexed to A ( Fetch i and place x q in A m Add i and x and place in A o o - 1 Fetch i., and place x 1 in A m Add ± 1 and x and place in A^ Fetch i and place Xp in A m Add i and x 2 ; shifting A will place x+i in A from which it is returned c to X ; an overflow in the indexing causes an interrupt. Table 10. Microsequence for Local Indexing 91 end of this step, A and A will contain the two-digit product of a^ and b_; b^ was destroyed and A now contains m^ r 5) if a double precision product is desired, store m = (A ) into memory; jump this step if a single precision product is to be obtained 6) increment by 1 the contents of register X p (it is assumed that initially X, contains the address of a n and X contains the address of b ) . Therefore, X now contains the address of b 7) load b, from memory into reg A 8) multiply b (in reg A ) and a (in reg B) as described in step h; note that the "carry" of the previous multiplication is auto- matically added to the product 9) increment X, by 1, decrement X p by 1 10) shift register A left k bits which vacates A 11) reload register A and B; multiply 12) A now contains im, which can be stored or discarded r 1 And so on, following the algorithm of Section 3*2. To determine digit m. of the product (i< n), the cycle: [increment X , decrement X , load B, load A , multiply, shift left k~\ is repeated i+1 times. On the first cycle only one increment-load is performed and on the last cycle there is no shift left k. It has already been mentioned that in single precision, product m is the first digit of the product which is not discarded and it can be stored in b 's position (or a 's). If at the end of the multiplication m^ .. does not e 2n+l equal zero, a normalization is needed; each hexadecimal digit is read and restored shifted right. Therefore, m is discarded, m n becomes the low & ' n ' n+1 92 order digit and m^ . the high order one, D 2n+l This algorithm is general and can handle mantissas with any number N of digits. The introduction of the scratchpad memory, however, results in a remarkable improvement in the procedure, especially for N not greater than 16 (which includes most practical applications). The method consists of overlapping a multiplication of two digits with the fetch of a third digit which is temporarily stored in sM and will be used in a subsequent multiplication. This is always possible because the "add and shift" command used in multiplication does not need any PE bus; a bus is thus left available for the fetch. Since the multiplication takes h clocks and a fetch only 3, there is still time to increment the address register usee (preparing for the next fetch) and to reload B concurrently with its last use in an "add and shift." A fifth clock is required to reload A and a sixth if r it is necessary to store A in sM "before reloading A . The procedure described is listed in Appendix B, note a, under the name MF (for multiply and fetch). It should be also pointed out that the result is now first stored in sM and only after normalization is written in PEM which avoids the relatively slow process of rereading and restoring in PEM only to normalize. The time required to multiply two mantissas (each N digits long) can now be estimated: N executions of MF are required, taking 5 clocks each; N product digits must be stored, which takes 6 clocks per digit (see function ST in Appendix B, note d) and N more clocks to store the product temporarily in sM. Finally, about 13 clocks are necessary for initialization and control. Therefore: T nf 5N 2 + TN + 13 , N < 16 (1) where T m is the time for mantissa multiplication in clocks. Since each clock 93 takes 100 nsec, for N = 8 a T of 1+0 /usee is obtained. m 4.3*1 Floating-point Multiplication The algorithm is relatively simple: initially the mantissas are multiplied as described previously and the normalized single precision product is stored. A is left with a 1 if normalization was performed (i.e., if m ' m_ -,= 0) and with a otherwise. A is then subtracted from the first expo- 2n+l ' m ^ nent and the second exponent is added to the difference which obtains the ex- ponent of the result. Five extra clocks are needed to detect exponent over- flow or underflow and to recode the exponent of the result in excess repre- sentation as explained in Appendix B, note f . The sign of the result is obtained from the exclusive - OR of the signs of the factors. Timing estimate: for two floating point numbers with N digits in the mantissa and N digits in the exponent, the mantissa product will take (from (l)) about 5 IF + 7N +13 clocks; exponent manipulation takes about 4 clocks per digit plus 6 clocks per digit for storage and about 5 clocks for control. The final expression is: T_ = 5N 2 + 7N + ION +18 , N + N < 16 (2) f pm mm e e m — where T is the time for floating-point multiplication in clocks. For the "standard" 32-bit floating-point number, N = 2 and N = 6 which yields T„ = 27 usee. For this case, the precise ^sequence is pre- sented in Appendix B and the results obtained are as follows: normal time = 25 usee; optimum time = 24 usee. Two 64-bit floating-point numbers (N = 12, N = 4) can be multiplied in about 86 idsec. It should be remarked that the algorithm illustrated obtains the single precision product by truncation of the double precision product. If 9^ simple truncation is not satisfactory and rounding is to be- performed, then a small addition is needed in the micro sequence. This is not too time consuming, however . h.k Addition and Subtraction Unsigned addition or subtraction is quite straightforward and can be performed in the following steps: 1) load from PEM address (X ) into register B. 2) load from PEM address (X_) into register A . / x 2 m 3) add or subtract using input carry (C ) zero (one in subtraction) for the first cycle and C =lcFFU for the remaining cycles. Also, at each cycle, lcFF^l- stores- the output carry C i . Therefore, at every cycle after the first one, lcFF^ contains C . from the previous step . k) increment X and X by 1. 5) go to step 1. On the last cycle, lcFF^- is gated to the interrupt wire since lcFF^ ON (OFF in subtraction) at this point indicates an oveflow. Timing estimate: for two unsigned fixed-point numbers with N digits each, one needs, per digit, 6 clocks to fetch the two operands, 1 clock to adc and 5 clocks to write the result in PEM. Therefore: T a ~ 12N (3) where T is the time for unsigned addition or subtraction in clocks. Thus, T = 10 Msec for N = 8 digits, a U.J+.l Signed Addition and Subtraction There are several different ways to perform signed addition and 95 subtraction. Signed numbers can be stored in PEM either in a complement form or as sign and magnitude. The latter seems to be preferable since it speeds up multiplication and slows addition. To add two signed numbers represented in sign and magnitude notation, it is necessary first to compare the signs. The result of the comparison is stored in lcFFl which will be ON if the signs are equal, OFF otherwise. lcFFl is then used to control whether an addition or a subtraction is actually per- formed. The two numbers are then added (or subtracted, if lcFFl is OFF) and the final output carry, which is stored in lcFF^-, is analyzed to determine the sign of the result, whether recomplementation is needed or not and if there was overflow. The rules are presented in Appendix C, note f. Signed addition (or subtraction) takes 6 clocks per digit to fetch the two operands, one clock to add/ subtract, one clock to temporarily store the result in sM (assuming N < 16, this is possible and speeds up recomple- mentation considerably), 2 clocks per digit to recomplement (PE's in which this operation is not needed are disabled), 6 clocks per digit to store the result in PEM and about 10 clocks for control and sign manipulations. Therefore, T = 16N + 10 , N < 16 (k) sa ' — where T is the time for signed addition or subtraction in clocks; for N = 8, Set T = Ik (usee. sa k.k.2 Floating-point Addition and Subtraction The algorithm is quite complex and can be divided into six distinct phases: a) exponent comparison, b) exponent subtraction, c) hexadecimal point alignment, d) mantissa addition, e) recomplementation, and f) normalization. The basic steps are the following: 96 1) Set up X, and X with the addresses of the two exponents. 2) Fetch the exponents (storing them temporarily in scratchpad memory to avoid subsequent fetches) and compare them, exchanging the exponents and addresses in PE's with the "wrong" order so that all PE's will have in X the address of the number with the larger exponent. 3) Compute the difference d of the exponents and add it to address X , thus performing hexadecimal point alignment. k) Set up A with (FFF-N +l) via CAB and add d which prepares in A cm — * * c a trap that will overflow when N -d+1 is added to it; this will m ' indicate that all valid digits of the smaller operand have been used and zeros must be substituted for the remaining digits. 5) Perform the actual addition following the algorithm described in the previous section with one extra step: after loading B from PEM, B is zeroed if a carry has already occurred in A . A con- tains initially the trap described in step h and is incremented by one as each pair of digits is added. lcFF2 is used to store the first carry from A . The sum is temporarily stored in sM for possible recomplementation and normalization before it is finally stored in PEM. 6) The final carry is analyzed to determine if there is a need for recomplementation or if an "overflow" occurred; i.e., if one extra MSD containing a ONE should be added to the mantissa. The rules are presented in Appendix C, note f . 7) Recomplementation is performed; only PE's in which this operation is necessary are enabled. The recomplemented result goes back 97 to sM. 8) X and X- are used as counters: X is initialized to FFF (all ones) and X p is initialized with the larger exponent. Then both registers are decremented by one for each leading zero in the mantissa of the result. Therefore, at the end of the process X will contain the exponent of the result and X will contain a trap to be used in A in the next step. 9) The mantissa of the result is written in PEM using X to store the address of the result in PEM and X to store the address of the result in sM. The mantissa is written from LSD to MSD and the trap in A is used to write initially as many trailing zeros as there were leading zeros before normalization. 10) The exponent is written from sM into PEM. Timing estimate: since the procedure is so complex, it is quite difficult to obtain a precise formula for the number of clocks in addition. As a rought estimate, it takes for each pair of mantissa digits: 9 clocks to add, 2 clocks to recomplement, 2 clocks to count leading zeros and 8 clocks to write in PEM; for each pair of exponent digits: 3 clocks to compare, 3 clocks to subtract and 6 clocks to write in PEM. Adding about 50 clocks for control, sign manipulation and other housekeeping actions, the final expres- sion is: T„ = 21N + 12N +50 , N + N < 16 (5) fpa m e ' e m — where T f is the time for floating-point multiplication in clocks, N is the number of digits in the mantissa and N is the number of digits in the expo- nent. Thirty- two bit floating-point numbers with N = 6 and N = 2 take about ° * me 20 usee to add. For this case, a precise microsequence is presented in Appendix 98 C and the results are: normal time = 21 ;usec, optimum time = 19 jusec. Two 6^-bit floating-point numbers can "be added in about 35 jusec. k-5 Other Operations A few other important operations are now considered and a quick sketch is presented describing how they would be performed in SPEAC. k.^.l Division This operation has not been considered in detail and while it is probably possible to design a sophisticated division algorithm that will use the PE very efficiently, this will take considerable research. On the other hand, even a very straightforward restoring division algorithm can be per- formed in an acceptable time. For N < 8, the two mantissas can be stored in m — sM; then the divisor is repeatedly subtracted from the dividend until a final borrow results and disables the PE. This is performed a maximum of 15 times; then all PE's add the divisor to the remainder to restore a positive remainder. The number of subtractions is counted in A . Each subtraction takes only 2 clocks per digit once the operands are in sM. Therefore, it takes at least 32N clocks to determine each digit of the quotient. Adding about 3 extra clocks per subtraction for control, one obtains the following rough timing estimate for mantissa division: T, ~ 32W 2 + 50N , N < 8 (6) d m m m — This yields about 130 /isec for 2^-bit mantissa division and not more than 1^0 usee for 32 -bit floating-point division. The ratio of about six between floating-point division and floating-point multiplication times is adequate for this type of machine (in ILLIAC IV, this ratio is 7). 99 k. 5-2 Logic Operations Logic operations are quite straightforward in this machine since the A/L unit in the PE's can directly perform all sixteen logical functions of two variables. Therefore, to obtain any bit-by-bit logic function of two operands, each N digits long, the same algorithm described for unsigned addi- tion (Section k.k) can be performed; the timing is also as given by (3): Tg ~ 12N (7) where T« is the time required to perform one bit-by -bit logical operation. 4.5-3 Comparisons In SPEAC, the result of a comparison is normally stored either on a lcFF or in the mode register. It can also be stored in sM or PEM for future use or sent to the CU via CDB. The six different types of comparisons (>, <, ^_) S -) /) can readily be performed by the A/L unit. The algorithm for com- paring two unsigned numbers is similar to the algorithm to add two unsigned numbers; as each pair of digits is compared, the result of the comparison -for = is always stored in lcFFl. This is needed even to perform a comparison for >, <, >, or < since lcFFl is used to "freeze" the result of the comparison once the first pair of unequal digits is found. For example, the typical micro- sequence for a < compare is as follows: - load first operand from PEM into A - load second operand from PEM into B - enabling on lcFFl ON, store the comparison A = B in lcFFl and A < B in lcFF^. m When all the digits have been compared, lcFFU will have the resulting bit. Therefore, the timing is: 100 T c = 7N (8) where T is the time in clocks for comparisons of two unsigned numbers, each c N digits long, leaving the result in the PE. Signed and floating-point comparisons require a little more control "but the linear dependence on N is as in (8). Rough estimates are: T ~ 7N + 10 (9) sc V ~ 7(h e + V + 20 (10) where T is the time for signed comparisons in clocks and T_ is the time sc to fpc for floating-point comparisons in clocks. k.J.k Shifts Shifts by a total distance of. b bits are easily performed in two phases: address indexing is used to shift by (b div k) and register A shifts are used to shift by (b mod k) . sM is also frequently used as temporary storage, especially in end-around shifts. If b is global (i.e., all PE's will shift by the same distance) then the address indexing is performed in the CU. In general, it takes in the worst case 3 clocks to shift each digit, one to store it in sM and 6 to store the shifted digit back in PEM. Therefore: T ~ 12N , N < 16 (11) s ' — where T is the time in clocks to shift a number with N digits by a global s distance. The operation is a little more complex if b is local; i.e., the shifting distance is different in each PE. In this case, local indexing is initially performed, taking about 20 clocks, to "shift" by "b div k. " The quantity "b mod k" is then stored in LC and three successive shifts are per- formed which are enabled by lcFFl, lcFF2 and lcFF3 respectively. The remain- der of the operation is as for global shifts. Therefore: 101 T, ~ 12N + 20 , N < 16 (12) Is — where T n is the time in clocks to shift a number with N digits by a local Is distance. It should also be pointed out that the PE, besides shifting, has very good bit manipulation capability in general due to the locally controlled gating into A . m k.6 I/O Both I/O and routing are performed using the row gating and IOBR. I/O will be described first. An elementary I/O operation consists of inter - changing the data words Dl, intially in PEM, and D2, initially in mass memory (MM) . Both words contain 512 bits and Dl is stored across one PE row: row j (PEi in row j contains the i — hexadecimal digit of Dl). Recalling the IOBR structure presented in Figure 22, the general procedure is the following: clock - Initiate a MM read of word D2 to IOBRr. clock 8 - Initiate a PEM read of word Dl. clock 10 - MM read is completed and D2 is in IOBRr. The PEM read will be completed during the next clock period/ therefore gate Dl through row- gating to IOBRr and simultaneously shift IOBRr left 128 digits (i.e., IOBR^ --IOBRr). This can be done in one clock, clock 11 - At this instant, IOBRr contains Dl and IOBRi contains D2 Initiate now the MM rewriting which will replace D2 by Dl in MM. Also initiate a PEM write which will write D2 from IOBRi into any PEM row selected by row gating. If the row selected is row j, then D2 will replace Dl in 102 that PEM row. clock 16 - PEM -write is complete; D2 is now available in PEM. clock 21 - MM rewrite is finished; ready to start a new I/O transac- tion at this clock. One elementary i/O transaction then takes: 1 MM cycle and 1 PE clock or approximately one MM cycle, which was assumed to be 2 usee (l usee access time, 1 usee rewrite time). Eight of these elemtnary i/O's are needed to ex- change one digit in every PEM with MM since there are eight PE rows. Therefore: T Z , Q = 168N (13) where T / is the time in clocks to interchange a word N digits long between PE's and MM. For N=8, T T / n = 135 usee. This indicates that since a typical 32-bit floating-point operation takes about 25 .usee, each word brought to PEM should be used on at least six operations (before being overlaid to MM) in order to completely overlap execution and i/O. The procedure described above for i/O transactions is based on the assumption that MM is bulk core. In this case, IOBRr is in fact the memory data register for MM. If MM is implemented with semiconductor memory, then it would be better to modify the structure in Figure 22 and have the output data from MM linked to IOBRJ? and the input data linked to IOBRr. This would avoid the IOBR shift in clock 10 and would save one clock in each transaction. k.7 Routing The following algorithm is employed to perform routing left of one digit by a distance R, R < 1023- This is obviously general since a routing right by n is equivalent to a route left by 102^-n. l) IOC, which processes routings, decomposes R into r'=R div 128 103 and r=R mod 128. r' will "be taken care of by row gating and r by shifting IOBR. 2) IOBRr is loaded with row r' from PEM (rows are numbered from through 127). 3) IOBRr is shifted left 128 thus placing row r in IOBRi; simul- taneously row r'+l is brought to IOBRr. k) IOBR is shifted left by a distance r. 5) IOBR^ now contains the routed word for row 0. Therefore , IOBR^ is written into row 0. 6) IOBR is now shifted by (128-r) which places row r'+l into IOBRi simultaneously, row r'+2 is brought to IOBRr 7) Repeat step k. and so on It should be noticed that row r' has to be brought to IOBR twice, once at the beginning and once at the end of the routing. This is necessary to recover the leftmost digits of r' which are lost when step h is first executed. The actions performed are: 9 row loads into IOBRr, 1 shift by 128, 8 shifts by r, 7 shifts by ( 128-r) and 8 stores of IOBRi into rows. Also the first clock of all but the first row loads is overlapped with the last clock of a shift and the first clock of all but the last IOBR stores is overlapped with the first clock of a shift. Therefore, the timing for routing will be given by: T = t + 8(t -1) + 8t . / v + 7t . ,, oQ \ + t , /io on + 7(t -1) + t r I v 1 ' sh(r) sh(128-r) sh(12o) s s where T is the time in clocks for routing one digit by a distance R = 128r'+rj 10U t n is the number of clocks for a row load; t , / N is the number of clocks to 1 ' sh(,rj shift IOBR by r; and t is the number of clocks for a row store. It is known s that t n / n ^ N=l. The values for t and t „ depend on where the digit to be sh(128) s I routed is: if it is in some PE register, then these times are only one clock; if the digit is in PEM, then t* requires one PEM read or 3 clocks and t takes 5 clocks for a PEM write. Therefore, there are four different types of routing. They are, from the fastest to the slowest: l) PE to PE, 2) PEM to PE, 3) PE to PEM and k) PEM to PEM. For routings of type 1: T , = (8t . , s + 7t w noQ n + 3)N (lid rl sh(r) sh(128-r) ' x ' where T n is the time in clocks to route a number with N digits, t , / \ is rl ° sh(rj given by Table 9 for r < 6U; shifts by r > 6k in a given direction are simply obtained by first shifting by 128 (end around) and then shifting (128-r) in the opposite direction, t , / n for r > 6k can thus be written as 1+t , /-,„ n sh(r) sh(128-r) and t , /-.po \ is taken from Table 9- For N=8 and r=l, one obtains T , = 20 usee. This is the best possi- ble routing time and it is on the order of one floating-point operation time. Other distances may take longer. For example, when N=8 and r=2, T - is 32 ;usec. Note also that routing must always be from one location to another or else the row that must be loaded twice would be changed when accessed for the second time. For routings of types 2, 3; and k the expressions are: T c = (8t , / v + 7t ,/, oQ n + 21)N (15) r2 sh(rj sh(128-r) ' T r3 " (8t sh(r) + 7*sh(128-r) + 35 > N (l6) T rk ' < 8t sh(r) + ^sh(128-r) + ^ W 105 It is also important to notice that since routing is performed in chunks of 128 each, several other special purpose types of partial routings can be microprogrammed and are very useful in specific applications. k.Q Summary of Timings Table 11 presents a summary of the timing estimates for several operations and four "typical" word lengths: 16 bits (N =3; N =1), 32 bits (N =6, N =2), kQ bits (N =9, N =3), 6U bits (N =12, N =k) . v m ' e " ^ m ' e ' v m e 106 Operation Formula Number Time in /Lisecs 16 bits 32 bits k8 bits 6k bits Local indexing, per address 1.6 1.6 1.6 1.6 Mantissa multiplication 1 12 39 82 lUl Floating-point multiplication 2 9-k 26 52 86 Fixed-point unsigned addition 3 k.Q 9-6 15 19 Fixed-point signed addition k 1-k Ik 20 27 Floating-point addition 5 12.5 20 28 35 Mantissa division 6 kk 1^5 na na Logic Operations 7 k.Q 9-6 15 19 Comparison of unsigned numbers 8 2.8 5.6 Q.k 11 Comparison of signed numbers 9 3-8 6.6 9-k 12 Comparison of floating-point numbers 10 k.Q 7-6 11 13 Global shifts 11 k.Q 9-6 15 19 Locally indexed shifts 12 6.8 12 17 21 I/O (PEM«— > MM) 13 67 135 200 269 Routing PE - PE, distance 1 Ik 10 20 30 1+0 Routing PEM - PE, distance 1 15 17 35 52 69 Routing PE - PEM, distance 1 16 23 h6 69 91 Routing PEM - PEM, distance 1 IT 30 60 90 120 Table 11. Summary of Timing Estimates 107 % APPLICATIONS 5-1 General Considerations In general, SPEAC can handle efficiently most problems in which ILLIAC IV performs well since most of the features of ILLIAC IV are also available in SPEAC. A large number of parallel algorithms to implement many important applications in ILLIAC IV have been developed [9 through 17] . Ob- viously, these algorithms can be used as a starting point when the use of SPEAC for the same applications is contemplated. A few modifications or a new approach are sometimes required due to the following differences: a) PEM is much smaller in SPEAC and many problems which are "core contained" in ILLIAC IV must use memory overlay in SPEAC. On the other hand, MM in SPEAC is random-access and the machine was especially designed to allow efficient PEM overlay so it is normally possible to use SPEAC efficiently even in non-core con- tained problems. In ILLIAC IV, non-core contained problems, while not as frequent as in SPEAC, are harder to program effi- ciently due to the latency problem in its disk mass memory, b) Routing is relatively slow in SPEAC. While in ILLIAC IV a route takes about half the time required for a floating-point operation regardless of distance, in SPEAC it takes from one to several times as much as a typical floating-point operation, depending on the distance. On the other hand, in SPEAC routing is an I/O operation and can be overlapped with PE processing. Also special route instructions can be microprogrammed, "cus- tomized" to particular problems. 108 c) ILLIAC IV is primarily intended for computations on floating- point numbers with 32 or 6k bits precision. While SPEAC can also handle these problems, floating-point multiplication be- comes relatively slow for very long word lengths since it is proportional to the square of the number of digits in the word. Furthermore, there is a very important area of applications which is much more "natural" to program for SPEAC than for ILLIAC ■ IV. This area includes problems involving a large quantity of fixed-point numbers with small precision, typically only a few bits. Examples of these problems are: picture processing, non- numerical processing in strings of characters, etc. These prob- lems can be handled very efficiently by SPEAC due to its digit- by-digit processing and fast operation for small words. d) In ILLIAC IV, the number of PE's (n-^) is 6k and for most appli- cations one is interested in tackling problems in which the num- ber n of parallel computations is equal to or greater than n p „. In matrix computations, for example, n is the order of the matrix and in discrete Fourier transforms n is the number of points. Therefore, a frequent problem in ILLIAC IV is to partition a large data set into "chunks" of 6k or 6k X 6k so that each chunk can "fit" in the machine. Chunks are then processed se- quentially. In SPEAC, n^ =102^ and for most problems one will be interested in n < n • the typical problem is to subdivide a data set into several pieces and to process all the pieces in parallel to "fill" the whole machine when n < n^. In the next sections a few specific representative applications of 109 SPEAC are considered in detail. Of course, they are only meant as a sample since many other interesting applications could possibly be efficiently han- dled by the machine . Timing estimates were based on counting PE clocks by hand. Some attempt has been made to take into account PE/lO overlap but precise numbers could only be obtained with a very sophisticated simulator for CU and specific detailed microprograms for every instruction. Therefore, the estimates can be a little pessimistic if the overlap was not fully accounted for. 5-2 Relaxation The problem consists of: given an initial matrix U , n X n, find a succession of matrices U , IP, . . . where each term of matrix U is a func- tion of the four "neighbors" of the term in the previous matrix LP. In general, 0**-f(U* ., U* ., LP . n , LP . n , U k .) ij l+l, 3 i-l, 3 1,3+1' 1,0-1' 1,3 This is a general formulation for a series of problems that can be very ef- ficiently solved using an array computer. If the elements of U are floating- point numbers, then this type of expression can be used to find the equili- brium temperatures or potentials at every point of a plane submitted to given initial conditions at the edges; if the elements of U are small integers, then each element can represent a point of a picture coded according to a gray scale. In this case, the formulation can be used to implement a "smoothing" filter or a number of other picture processing problems. As an example, the following case will be studied. TT k+l ic" _ . + U k . . + U k . , + IP . . U ,- a = i+l, J i-l, J i,3+l ijJ-1 110 The loop condition is the folio-wing: if |UV . - U7 . | < e for all i,j then exit the loop; otherwise repeat. Two values of n are considered: 32 and 102^ although other powers of two can also he handled efficiently. a) n=32; the elements of U are 32 -bit floating-point numbers. The most straightforward (and most inefficient) way of coding the loop is: the elements of U are stored across PE's, row after row; i.e., numbering PE's from to 1023 and rows from to 31, element U. . is stored in PE„. .. U iQ 321+0 is in PEM location a. The loop is: 1) Route distance 1 left from PEM location a to PEM location b. 2) Route distance 1 right from PEM location a to sM(0). 3) Add sM(0) to PEM(h) and store in PEM(b). k) Route distance 32 left from PEM(a) to sM(0). 5) Add PEM(h) *- (PEM(h)+sM(0)). 6) Route distance 32 right from PEM(a) to sM(0). 7) Add sM(0) <- (PEM(b)+sM(0)). 8) Multiply the addition of the four neighbors by .25, sent via CDB: sM(0) *- (sM(0) X CDB(.25)). 9) Test for ending condition; sM(0) which now contains IT" " is subtracted from IT" which is in PEM location a and the difference is compared against e, sent via CDB. In PE's in which the end- ing condition is satisfied, a zero is gated to the interrupt wire and register M is reset which disables the PE. 10) Write sM(0) in PEM location a and go back to step 1. The process ends when, in step 9 CU receives a zero via the inter- I rupt wire; this indicates that all PE's are disabled. At this point the result Ill of the last iteration is stored in sM(O); all PE's are enabled and the result can then be stored in PEM. The procedure requires three additions (20 usee), one subtraction (20 usee), one multiplication (25 usee), three routes of type 2 (35 usee), one route of type h (60 usee), and one comparison (7 >6 usee) for a total of 278 usee per execution of the loop. Obviously, each k+1 execution of the loop computes a new iteration matrix--U out of the pre- vious value u. It should be noticed that sM was used as temporary storage in some steps. sM can store two 32 -bit numbers: one in sM(0) through sM(7) and the other in sM(8) through sM(l5). It is obviously possible to write micro- sequences for variants of addition and multiplication which take one or both operands from sM instead of PEM and also possibly have the results in sM in- stead of storing the numbers back in PEM. These operations will be faster than the normal PEM to PEM ones (from 1.6 to G.h usee faster) but this will not normally be taken into account in these worst-case timings. It is also important to notice that since sM is used as scratchpad in most operations, if the two operands are in sM, one is destroyed during the operation unless sM is enlarged to contain four or eight 32-bit numbers instead of only two. A few improvements are possible in the straightforward algorithm presented above and they are as follows: 1) The routing in step 1 does not have to be of type k since sM is available. Therefore, one can load the data in sM(0) {2..h usee), route from sM(0) to sM(8) (20 usee), and store sM(8) in PEM (k usee). These last four usee can be overlapped with the routing and total time is roughly 25 usee. 2) A special microsequence can be written for an instruction to divide by k by shifting and normalizing. This will take much 112 less than 25 jusec; since the operand is in sM and the result is also left in sM, 5 Msec is a reasonable upper hound. 3) All additions except the last can he overlapped with routings of type 2. The routings to be overlapped must be of type 2 because there is no space in sM to keep the elements of U permanently in sM, which would enable one to use only type 1 routings. If more space were available in sM, the sum could also be kept in sM and PEM location b would not be used. The improved algorithm is as follows: 1) sM(O) <- PEM(a); U is now in sM(O) (2.k usee). 2) Route distance 1 left from sM(O) to sM(8) (20 jusec). Simultan- eously, write sM(8) in PEM(b) (~2.6 jusec). 3) Route distance 1 right from sM(0) to sM(8) (20 jusec). k) PEM(b) +- (PEM(b)+sM(8)) . Simultaneously, route distance 32 left from PEM(a) to sM(0) (35 jusec) . 5) PEM(b) «- (PEM(b)+sM(0)) . Simultaneously, route distance 32 right from PEM(a) to sM(8) (35 jusec). 6) sM(0) *- (PEM(b)+sM(8)) (~16 usee). Note that this addition takes less time because the result is not stored back in PEM. 7) sM(8) <- sM(0) shifted 2 right (i.e., divided by h) and normalized (~5 /usee) . 8) Test end condition. This is the same as step 9 in the original algorithm (~8 /usee). 9) PEM(a) - sM(8). Go to step 1 (~k /usee). The total time is now only ~lU8 /usee for each complete relaxation. Further improvement is possible if sM can store four 32-bit numbers instead 113 of only two. In this case all routings are of type 1 (which saves 30 jixsec) and the whole problem can be done in sM which saves all PEM reads and writes except the initial read and final write. In this case a total time on the order of 110 /isec is possible. The algorithms considered assume .a toroidal geometry; i.e., there are no edges, LL . is considered a neighbor of IT . and U. _ a neighbor of U. „., . 0,J 31, J i,0 i,31 This is not desirable for most actual applications. In most cases, there is an outside edge: U , ., U-,„ ., U. . and U. _,, with fixed values. This can -i,y 32, 3 1,-1 1,32 be easily included in the program in the following way: a digit D is stored in each PE containing the LSB ON if the element stored in that PE belongs to row 0, the second LSB ON if it belongs to row 31, the third LSB ON for column 0, and the MSB ON for column 31* The fixed edge values are stored in PEM locations c,d,e and f (each is only needed in 32 PE's, but it is probably easier to store them in all PE's). A new step is needed between 2 and 3 in the improved algorithm. This step is number 2— and is identical with step 1. Before steps 1, 2—, k, and 5, a local indexing is added. This local indexing is enabled by the bits of D and makes PE's that have an edge neighbor take the edge value instead of the "end-around" neighbor. This adds only about 8 jusecs to the procedure. It should also be pointed out that overlaps of two operations both using sM can be less than perfect since sM has only one port. Normally, how- ever, operations that use sM do so 50$ or less of the clocks and thus very good overlap is possible. Multiplication is an exception since it uses sM very heavily. b) n= 102 U; the elements of U are floating-point 32 -bit numbers. nk In this case each row of U is stored across PE's and 1024 rows are needed. Therefore, the problem is not "core contained" and PEM overlay is necessary. Routing is only needed now to access the "left" and "right" neighbor; the "upper" and "lower" neighbors of an elements and the element itself are stored in the same PE. Therefore, at least three complete rows of U must always be present in PEM. Assuming they are in locations a, b, and c respec- tively, the algorithm is: 1) sM(0) <-PEM(b); IT is now in sM(0) {2.k usee) . 2) PEM(d) <- (PEM(a)+PEM(c)); do not destroy sM(0) (20 usee). 3) Route distance 1 left from sM(0) to sM(8) (20 usee). k) PEM(d) *- (sM(8)+PEM(d)); do not destroy sM(0) (20 usee). 5) Route distance 1 left from sM(0) to sM(8) (20 usee). 6) sM(0) <- (sM(8)+PEM(d)) (~l6 usee). 7) sM(8) «- sM(0) shifted 2 right and normalized (~5 usee). 8) Test end condition (~8 usee). 9) PEM(b) *- sM(8); go to step 1 (~k usee). Steps (2,3) and (h,5) could overlap for a total time of 58 usee per row. However, this would leave only 5-7 usee in which both buses are not simultaneously used and i/O overlay could not occur. Since FINST normally assigns priority to i/O, on the average each loop will take the maximum time of 116 usee and will have to wait for 20 more usee for i/O. Therefore, the procedure is i/O bound and each loop takes 135 usee which is the time needed for an i/O transaction. One iteration is then performed in about 135 msec. Fixed edge conditions can be introduced as discussed in case a and do not cost any extra time since the procedure is i/O bound. 115 c) n=102J+; the elements of U are one-digit integers. This case would be used in picture processing. The problem is "core contained" since 2K digits are available in PEM and only IK are needed. Storage is as in case b, each row across PE's. All elements of the same column are in the same PEM. Only one PEM read and one PEM write are needed per row since sM is now capable of storing sixteen U- bit elements. Assume that sM(a) contains the upper neighbor, sM(b) the present element and sM(c) will contain the lower neighbor. The algorithm is: 1) sM(c) *- PEM(address of element of next row). 2) reg A *- sM(a)+sM(c). 3) Route distance 1 left from sM(b) to sM(d) (2-5 usee) . k) reg A «- reg A+sM(d) (.2 usee). 5) Route distance 1 right from sM(b) to sM(e) (2-5 usee). 6) reg A *- reg A+sM(e) (.2 usee). 7) Shift reg A right 2 bits (.2 usee). 8) Test end condition (~-5 usee). 9) Got to step 1. The whole procedure then takes only about 6 usee since steps 1, 2, and h are overlapped with routes. Therefore, one iteration can be performed in about 6 msec. Fixed edge conditions could be introduced without diffi- culty since there is space in sM to keep the data for the edges. This prob- lem could also use two digits per element for a gray scale with 256 shades. Since sM can still be used, the time increases linearly to 12 msec per iter- ation. In conclusion, SPEAC performs exceedingly well in relaxation type problems. 116 5. 3 Matrix Multiplication Given two matrices, A and B , the problem consists of finding 7 nxn nXn \ n the matrix C which is the product of A and B. C=AxB (c. .= X a.. X b n .). nxn * ij . . ik ky Two basic methods can be used to store matrices in an array com- puter: a) Straight storage , in which each row is stored across PE's and all elements of a column are stored in the same PE. Therefore, a. . is stored in PE.. ij J b) Skewed storage , in which each row is stored across PE's but it is also rotated one position farther than the preceeding row in an end-around fashion. Thus, a. . is stored in PE/ .._,>, , .. . ' ij (i+j-2)mod n+1 In either storage scheme one row of A can li ° accessed by fetching one row of PE memory. When a matrix is skewed one column can also be accessed in one memory fetch by indexing each PE to a different memory location. To fetch the first column of A, for example, each PE simply loads from location A plus the number of that PE. By routing this indexing pattern, any column of A can be accessed in one operation. It would take many memory fetches to access a column of a matrix which is not skewed since all elements of a column are stored within one PE. Three methods have been proposed ([11] and [12]) to perform matrix multiplication in an array computer. Briefly, they are as follows: a) the log- sum method , which is used to multiply skewed matrices since columns and lines must be accessible. A row of the first matrix is fetched and multiplied, in parallel, by a column of the second matrix. The results are summed across PE's to produce one 117 element of the solution. There are two major causes of inef- ficiency in this method. First of all, the operation of summing across PE's, known as a log-sum, is at best only 20$> efficient in using PE's. Secondly, excessive routing is required to properly index columns and line them up with rows. b) the broadcast method which generates one row of the result ma- trix at a time rather than just one element. It operates on matrices which are stored straight in memory and produces a result matrix which is also stored straight. Each row of the result is obtained after n multiplications and accumulations (the result of each multiplication is added to the sum of all previous multiplications). To obtain row i of the result, the th th k — element a of row i of matrix A is multiplied by the k — row of matrix b and all n rows thus obtained are added together. The expression is: n row(c i ) = £ a ik row(b R ) k=l The CU must be able to broadcast the elements a. . to the PE's and the PE's must have access to rows of B (i.e., row across PE's). As opposed to the skewed matrix multiplication, this method is almost 100$ efficient. There is no log-sum involved and no routing is required. c) Knapp ' s method of which only a brief description is offered here; for a detailed treatment see [12]. A and B are stored straight and C will also be obtained straight. As in the broad- cast method, each row of the result is obtained after n multi- 118 plications and accumulations. However, no broadcast takes place. To obtain row i of the result, row i of A is multiplied by each diagonal of B and then routed right one. Defining the th k — diagonal of B as: b l,k' b 2,k+l' "' ' b i,(k+i-2)niodn+l ; *" ' \, (k+n~l)mod n+1 then Knapp's method is expressed by the following: n row(c.)= E (row a. routed right (k-l) times) X 1 k=l X th (k — diagonal of B) To access the first diagonal of a matrix stored straight, each PE is locally indexed with the PE number (starting with 0); this th pattern is routed right (k-l) times to access the k — diagonal. The efficiency of Knapp's method is very good because no log- sum , operations are performed, but not as good as straight multipli- cation since routing is required. Its major use is to perform several small matrix multiplies simultaneously using only a small group of PE ' s for each one . The three methods can be used in SPEAC but the log- sum method is not considered in detail since it is the least efficient. Two cases are studied: a) n=102U and each element is a 32 -bit floating-point number. Each matrix is stored straight and the broadcast method will be used. One slight modification is needed, however, to avoid I/O bounding since the problem is not "core contained. " In the broadcast method the rows of B are used in order from row 1 to row n (to compute row 1 of C) and then again from row 1 to row n (to compute row 2 of C) and so on. Therefore, each row is used only for one multiply and add each time it is in PEM. Since n multiply and adds can be 119 performed in 1*5 „sec, there is no time to overlay a row w hieh takes 1 35 yS ee. The solntion is simple; each row of B must he used several times eaeh time it is brought to PEM. m this way, several rows of the prodnet are computed simultaneously. For ex^ple, the first row of B (row^)) is brought and is multiplied by 6k broadcast elements %v %v ... , % ^ ^ &k ^ thus obtained are stored in PEM. R ow(b 2 ) is then accessed and is multiplied by %2> a 2,2' ••' » %k,2> eaeh of the 6k rows thus obtained is added to the corresponding row of the first 6k. At the end of 1024 cycles, all rows of B have been accessed and u S ed 6k times each, and the first 6k rows of C are completed. The method is repeated sixteen times to obtain the 1024 news of 0. Since each multiply and add takes 4 5 ,sec, 6k take 2880 u sec in which there is time to interchange 21 rows. Therefore, l/o can be easily overlapped with execution; while the 1024 rows of B are used, there is time to interchange 21K rows and all that is needed is to interchange 1021* rows plus the 6k result rows J CU obtains the elements to broadcast either directly from mass mem- ory or from the PE's via CDB. The latter is the most straightforward scheme and can be efficiently used since overlap is possible with execution due to the fact that l/o takes a relatively small percentage of the execution time. 64 rows of A are needed in the PE at all times to obtain the broadcast ele- ments. Patterns are also stored in PEM and used to turn off all but one PE 1 time CDB out is used to send a broadcast element to CU. Note also that can simultaneously broadcast a previous element since CDB. is used for in ;his purpose. In the worst case, there are 194 rows in PEM at one time: the 64 ■ows of C that are being computed, the 64 rows of C that have just been 120 completed and have not been placed in MM yet, 6k rows of A that are being used to obtain broadcast elements, and 2 rows of B — one being used to multiply and a new one being prepared for the next step. When the 6k completed rows of C are overlaid to MM, the space is used to load the next group of 6k rows of A. When a new step begins, the locations of the old rows of A are used to place the new partial rows of C. Therefore, complete overlay of I/O and CU instructions is possible and the timing is simply given by: n (multiplications and additions) • sM can contain the row of B being used 6k times and also the result of the multi- plication. Only the result of the addition must be stored. In these condi- tions, multiply and add takes about ^3 usee and the final result is ^3 sec . b) n= 102^/2 (k=l,2,3,U) and each element is a 32-bit floating- point number. This is the submultiple case, in which the size of the matrix is a submultiple of the size of the array. In order to keep all PE's busy, one can either divide the matrix in PE parts and use all PE's to n n compute one multiplication or PE multiplications can be computed simultan- n eously. The two approaches are very similar and only the first is considered. Two methods can be used; the broadcast method, which is especially suitable when PE is small (2 or k ideally) and Knapp ' s method which is best when n PE » 8. n In the broadcast method, PE repetitions of a row of B can be cate- n nated across one row of PEM's and the method is used as before but instead of generating k rows of C at the end of each step (k=6U in the example pre- sented in part a), k x PE rows of C are constructed simultaneously. For n PE < 8 this repetition is easily obtained by writing in PEM PE times the n n same row of B read only once from MM. Obviously, there is one difficulty: 121 the broadcast element must be different for each of the PE copies of the row n of B. Up to four different broadcast elements may be sent during a multipli- cation of two digits without any extra delay. The only problem is to enable sM's in only a portion of the PE's without disabling the multiplication itself. This suggests the introduction of an enable flip-flop for I/O and CU use and sM may be directed to obey either the PE enable or the i/O/CU enable. If this is available, the broadcast method can be used without any extra cost since the multiple broadcasts are overlapped with multiplication. Therefore: time = n 2 n 3 ^3 - — 7- = - — x ^3 jusec, and for a 256 X 256 matrix the time = 7^0 msec. PE 7 PE If the above mentioned control of sM is not available, about 8(\PE+2) n additional clocks are needed per multiplication to select the broadcast elements. For PE = h, this adds 5 usee per multiplication. The expression 3 n is: time = - — X (1*3 + .8(^PE + 2)) usee. n PE n This method is then convenient only when PE is small so that the n extra time spent in selective broadcast is not excessive. Knapp's method avoids selective broadcasts but introduces routings. PE rows of A are concatenated across one row of PEM's and B is repeated PE n n times, once for each concatenated row of A. For n=128, this operation is easily obtained by writing in PEM eight times the same row of B read only once from MM. For n < 128 this repetition may require initial routes. Each dia- gonal of B is obtained by local indexing ( PE copies of the diagonal are n actually obtained) and multiplied by the rows of A. The result is accumulated and when all diagonals have been used, PE rows of C are computed. After each n diagonal is used, the rows of A must be routed right by a distance of 1. Since this route is end-around with respect to n and not to n , a second route is needed (by a distance n) unless n=128. The rows of A can be kept in sM while 122 in use so routing is of type 1. The time is given roughly by the following: r, 3 n 3 time = = — (add time + multiply time + 2 route times) = 90 - — usee n PE " PE if the routes are all fast. Therefore, the selective broadcast method is n best for all cases in which PE < 32. n 5-U Pattern Matching This application was chosen to test the character manipulation capa- bilities of SPEAC. The problem, fully described in [9~\, is briefly stated as: given two strings of characters, S (with n characters: s-, , s„, ... , s ) and s P (with n characters: p , p , . .. , p ), find out how many times and/or in which position does P occur in S. P is called the pattern string and S the source string. Normally n » n . The problem can be considered in two dif- ferent aspects: l) n is very small (typical 1 to 3) and only the count of occurrences is desired. This is what is needed in analysis of texts to obtain the frequency of occurrence of given letters or combinations of letters, and 2) n can be a small integer up to about 15 and the positions in S in which P occurs are desired. This is the type of algorithm needed, for example, to find all occurrences of the words BEGIN and END in a segment of a program as would be necessary in a parallel compiling technique as proposed in [10]. The source string S can be arranged in memory in two different ways: l) S is distributed across PE ' s in rows, one element per PE; i.e., character S. is in PE/ . , \, and 2) S is distributed across PE's in n_,_ chunks each (l mod n ) PE Jrili with n /n^^ adjacent characters; i.e., character S. is in PE/ . ,. / n« s' PE ' ' i (1 div n /n ) S -trill Storage scheme 2, called storage in chunks , leads to much more efficient pro- grams in SPEAC than storage scheme 1, called storage across PE's. This is due 123 to the fact that with storage In chunks, routing is practically eliminated. However, both storage schemes are considered since it may he difficult to use storage in chunks if the input data is not initially manipulated hy corner memory . a) Storage in chunks; only a count of the number of occurrences is required. Each character is assumed to he four hits long and is coded in one digit. Obviously, this introduces no restriction since the same algorithms can be applied if more than one digit is needed to code each character. No character manipulation instructions were considered in Chapter k. Therefore, most instructions used in these algorithms are custom-made, that is, they are described in terms of their microsequences. Initially, the first (n p -l) characters in each chunk must be routed left by a distance of one in order to enable the recognition of truncated occurrences of P (i.e., an occurrence of P in which ^ is the right-most character in chunk i and Pg , Vy ... , are the flp 10). P P 5-5 Sparse Matrices The problem deals with the elimination of the need to store in PEM the zero elements of sparse matrices and the resulting problem of remembering in some form the positions of the non-zero elements in the actual matrix. The term actual matrix will be used to refer to a sparse matrix represented with its zeroes and actual row to refer to a row of such a matrix also with its zeroes. The form decided upon clearly must be useful in completing the task 128 of sparse matrix multiplication. This section is concerned with describing two forms of storing sparse matrices for SPEAC, discussing their program adap- tability;, and demonstrating their use in programming. The two general forms for storing sparse matrices are the individual - tag method and the bit-matrix method [11]. These two methods are similar in that for both, the non-zero elements of a matrix are stored in the same way; for a sparse matrix A, 102^ x 10 2 U, the j — column is stored in PE . and zeroes are eliminated by pushing each non-zero number up the column until no zero elements remain between it and the next higher non-zero element, if one exists. 1) The bit-matrix method consists of storing a 1 or a bit for each element of the actual matrix depending on whether an element is non-zero or zero respectively. The result of this procedure is a matrix with the same dimensions as the actual matrix, but which requires less space to store in memory since each element of this matrix is only a bit wide. These bits are stored packed four in each digit and require 256 digits in each PEM. The LSB in this string B of 256 digits (102^ bits) in PE. th indicates whether a . is zero or not; in general, the j — bit in the string in PE. refers to element a... This method allows very efficient reconstitution of the actual rows but may still need too much storage space if the matrix is very large and very sparse. In this case, the following method is used instead. 2) The individual -tag method associates with each non-zero ele- ment of a matrix A a related positive integer t, called a tag. A tag matrix is constructed in which t. . is zero if a. . is zero and t. .=i if a. . is non-zero. The tag matrix is then stored 129 with column j in PE . and compacted in the same way used to com- J pact A. Therefore, PEM. will contain two strings of numbers: J a,. a„, .... a and t, , t„, ... . t where n. is the number 1' 2 7 ' n. 1 2 ' n. j of non-zero elements of A in column j. a. is the i — non-zero element in column j of A; if this is element a. . , then t.=k. K.J 1 Each element t takes only three digits for matrices up to k-K x k-K. Note that n. is normally different for each column of A J but hopefully, if the zero elements of A are randomly distributed, no large variations exist between the number of non-zero ele- ments in two columns. The problem of multiplying two sparse matrices stored in either of the methods above is now considered. The broadcast method of multiplication (see Section 5*3) is used. Therefore, the only extra procedure needed is an efficient way to reconstruct the actual rows of the matrices. This is the purpose of the algorithms now described. a) Expand in actual rows a sparse matrix stored according to the bit-matrix method. The rows must be expanded in order, from the first to the last. Fortunately, this is the order in which they are used in the broadcast method. Initially, the first digit b of the bit string B is fetched from PEM in each PE and stored in sM(O). The address of the first element of each com- pacted column (i.e., the address of a,) is sent via CAB to Xy When each digit of the elements of the first row must be fetched, the PE's are enabled by the LSB of sM(O) during both the fetch and the subsequent increment of X^ to point to the next digit. If the register to which the fetch is made is initially zeroed, the register will contain the correct row element after the fetch. 130 X is also kept pointing to the appropriate element of the compact column since it is not advanced in PE. when the actual row had a zero in column j. For the fetches of the three next rows, the three next bits of b are used as enabling bits and then the next element b of B is fetched, and so on. The extra time required to fetch the elements of B is probably easily overlapped with PE multiplications and the time to multiply two sparse matrices: A x B (each 102 U x 102*0 stored according to the bit-matrix method is D x k-3 sec where D is the density of matrix A. Obviously, CU can analyze the broadcast elements and avoid broadcast of each zero element which decreases the multi- plication time proportionately to the density of matrix A. It should be no- ticed that the optimum reduction factor is not simply D but D X D . It is A A B possible to devise an algorithm that achieves a reduction in time approaching the optimum value [13] J i.e., the algorithm also takes advantage of the sparse- ness of B to reduce multiplication time. However, the procedure is quite com- plex and will not be discussed here. It is also easy to see that the rows of the result can easily be compacted in the same bit-matrix representation if need be (i.e., if the product matrix is also sparse). b) Expand in actual rows a sparse matrix stored according to the individual- tag method. As in case a, the rows must be expanded in order. In this case, however, the expansion procedure is less efficient. Initially, the first tag t is fetched from PEM and compared for equality with the row number (i.e., one for the first row) sent via CDB. This fetch and comparison takes about 12 clocks since three digits must be compared and one of the operands is broadcast and does not have to be fetched. The result of the comparison, left in lcFFl, is then used to enable the fetch from PEM address X and the subse- 131 quent increment of X . Therefore, an additional 1.2 jitsec is needed to fetch each row in the individual- tag method. This cannot be overlapped with multi- plications, as in case a because the arithmetic part of the PE must be used for the comparison. 132 6 . CONCLUSIONS The concept of an array computer with a very large number of rela- tively simple processing elements has been proven feasible; the PE hardware was described in great detail and the sections on operations and applications show that this hardware can be used quite efficiently. Obviously , several problems remain to be studied and the following considerations analyze these problems and offer some suggestions for further research. Two areas are considered: l) problems related with SPEAC in parti- cular, and 2) problems related with the general architecture of array computers with many processing elements. With respect to SPEAC in particular,, the PE hardware has been pain- stakingly refined and optimized as far as one can get without an actual com- mitment to build the machine; a few questions remain to be answered and final "tuning" of the PE hardware must be performed, but these could be accomplished only with definite cost figures to analyze the cost-efficiency of different alternatives. Some of these alternatives were discussed in the section on implementation. A few specific points are: a) The scratchpad memory sM introduced in the PE at a late stage in development has proven to be an impressive improvement, making possible a reduction by a factor of two to three in the times of floating-point operations. The study of applications also re- vealed that an increase in the capacity of sM will improve the performance in several areas. Therefore, the final size of sM must be carefully determined to optimize cost-efficiency. It is also interesting to notice that sM has performed so well 133 "because of the relatively large values attributed to PEM access and cycle times (300 nsec and 500 nsec respectively). It now appears that these values are unduly pessimistic and depending on the final times obtained, the importance of sM will decrease and sM may be eliminated all together. b) CU architecture was only sketched and a much more detailed de- sign would be needed if the machine were to be built. Specifi- cally the system of two queues for PE operation did not result in any substantial improvement for most operations. Since the system is quite expensive to implement and introduces serious complications in microprogramming, it should be dropped and only three queues used; one for l/O, one for PE, and one for CU instructions. c) The possibility of overlapping PE instructions with l/O or CU instructions has proven very valuable in several applications. The system should be refined as suggested in Section 5-3-b to allow overlap not only in the use of PEM, but also in the use of sM. d) Final minimizations in the number of connections and the number of chips per PE must be performed in view of the state of the art in integrated circuitry at the time of implementation. This field has advanced so rapidly that the picture has changed sub- stantially within the last year. Specifically, one would need 2 data about MOS - T L relative performance, equivalent gate den- sities obtainable per chip and cost of custom-built chips. With respect to the field of array computers with a large number of 13U processing elements, the followings considerations are offered: a) Software development for an array computer is a troublesome area as demonstrated by the arduous and sometimes frustrated efforts to develop a high-level language for ILLIAC IV. This was probably to be expected if one takes as a parallel the development of high-level software for sequential computers; it started only after a decade of painstaking machine-language programming. The lapse in the case of array computers should be much shorter since a whole body of knowledge about languages does exist and will be used as a basis. Nevertheless, array computer users seem to be condemned to a few years of assembly- language programming while software researchers gain the insight and experience needed to provide efficient and reliable high- level compilers. It was expected at the beginning of this research that program- ming SPEAC would be one order of magnitude more difficult than programming ILLIAC IV just as programming IILIAC IV is one order of magnitude harder than programming conventional computers. Fortunately this has not been the case; programming SPEAC has been about as difficult as programming ILLIAC IV. Of course, this was mainly due to the fact that the size of the sample prob- lems was selected to facilitate programming. The problem be- comes more difficult when problems "smaller" than the size of the array must be handled efficiently and this is more and more frequent as the number of PE's increases. If large array computers are to perform the role that is 135 expected of them, the user must be spared the task of knowing what each specific PE is doing, much in the same way as in conventional computers the user has been spared the task of keeping track of absolute memory addresses. An initial step in this direction is provided by N. R. Lincoln. In a recent paper [10], he proposes a radically new technique for using array computers in such problems as compiling, which have so far been considered typically non-parallel (that is, unsuitable for these machines). Such techniques, if successful, could increase tre- mendously the area of application of SPEAC. The study of the performance of SPEAC in pattern matching problems, which was discussed in Section ^.k, has shown that it can perform very efficiently the basic tasks required in Lincoln's scheme. b) One very promising idea has been recently proposed to help solve the problem of handling efficiently problems "smaller" than the size of the array in computers of the type of SPEAC. It consists of linking groups of PE's together in a hardware- implemented fashion so that a group of PE's would be able to function as a single PE with speed roughly proportional to the number of actual PE's in the group. The problem is reasonably complex and will require considerable research but the possi- bilities are far-reaching; this method would not only make it much easier to use efficiently computers of the scale of SPEAC, but it would also make practical array computers with tens and even hundreds of thousands of very simple PE's. c) Finally, one very long-range research project would be to inves- 136 tigate how far one could go with the number of elements in a parallel processor. The approach described above allows one to envision a processing unit composed of many similar "PE's" linked together in a fail- soft configuration, much like the individual cells in a brain. If one PE fails, the only imme- diate effect would be a slight reduction in the speed of the processor as a whole. 137 APPENDIX A PACKAGE LOGICAL DIAGRAMS 138 DATA INPUTS DATA SELECT < (ADDRESS) ■o OUTPUT W Package 1. One-out-of -eight Selector without Strobe 139 i o ?o -O S> t-^O D, * r^> =0 ^O ^QJ ^^O 9 9 143 3 -O C (CLOCK OUTPUT) -O I (INTERRUPT OUTPUT) Note: The Function is as follows for each lcFF: A B Function Do nothing, i.e., the lcFF is not used 1 Use the lcFF to control the interrupt wire; the interrupt wire will assume the logical level of the lcFF ) 1 Enable the PE (i.e., allow the clock to reach the registers) when the lcFF contains a ZERO 1 1 Enable the PE when the lcFF contains a ONE Package 6. Enable and Interrupt Control ikh Package 7. PEM - 1 Module 1U5 (ALWAYS ON) r-^O MEMORY o- A DATA SELECT (ADDRESS) OUTPUT o Package 10. One -out -of -two Selector without Strobe ihQ CLOCK Co £>> DOWN/UP Mo £>o •- DATA INPUT o- ENABLE Go- DATA INPUT D, o- DATA INPUT D 2 » DATA INPUT D 3 o- LOAD L o— c£> 1 (NOT USED) RIPPLE CLOCK MAX/MIN ■o OUTPUT Cn+12 <»-oOUTPUTQ J-o OUTPUT Oi #-o OUTPUT Q 2 *—° OUTPUT Q3 Note: When cascading, G input goes to least significant hexadecimal digit and C output comes only from most significant hexadecimal digit; G. - is connected to C, -, n \-j for all not externally connected G and C n J (n+12)i J n+12 Package 11. U-bit Up/ Down Counter, Parallel In/ Out ll+9 S (STROBE) O DATA INPUTS DATA SELECT < (ADDRESS) Package 12. One-out-of-four Selector with Strobe 150 DATA INPUTS Package 13 . Quad Inverter 151 OUTPUT CARRY c w*te Q Q OUTPUTS A Q w 2 o *3 O *4 o r\ n A d Q 6 y 6 M v DATA INPUTS 6 6 °5 °6 INPUT CARRY — O Note: When cascading, C , output comes only from most significant package; C input to the least significant package is "1." Input (C ). is connected to output (C .._). ^ n+12'i Package ik. Increment -by-one Network (l6 bits) 1% r o t DATA INPUTS •< <> H!> > -H>^ v. SELECT -< o T^-^M^ 0- (ADORESS) \>r •H> J -I> O 0UTPU1 -O W Package 15 • One-out-of-five Selector without Strobe 153 APPENDIX B MICROSEQUENCE FOR 32-BIT FLOATING-POINT MULTIPLICATION 15U This is a detailed listing of the microsequences sent by CU to each PE to perform the multiplication: a X b = c, where each number is in the I following format: a T a 6 a 5 a l+ a 3 a 2 a i a O a is the mantissa LSD (least significant digit) a is the mantissa MSD (most significant digit) 5 a (i.e., the low order bit of a.) is the mantissa sign bit V a 63' %2 md a 6l conrt " ute the exponent; a 6l ls the 1W ° rder W * ° f the exponent . The exponent is in excess notation and the mantissa in sign and mag- nitude. The exponent base is 16. a Q is the low address in the PEM. 1 The following abbreviations are used in the microsequences: A - B which means that register A is loaded with the contents of register B. SM(X) or PEM(X) which means the contents of the location with ad- dress X in sM or PEM; X can be a literal or a register in which case the contents of the register are taken as the ad- dress. When X is a literal, it is sent via CAB. CAB(a) or CDB(a) which means that data a is sent via the common bus. En(i,ON) or En(i,OFF) which means that the enable function is at- tributed to lcFFi ON or OFF. Each microsequence is numbered with two PE clock counts: maximum and minimum. The minimum count assumes that the two buses are available and maxi mum overlap can be achieved; the maximum count assumes that only one bus is available at all times for PE operation. CAB and CDB are assumed always | available . 155 SO SO 3 o 3 o S H S H •HO «H O l! S gS Microsequence Comments 1 1 X *- CAB (address (a Q )) Address registers are loaded with mantissas' LSD addresses 2 2 X 2 - CAB( address (b Q )) 3 3 A «- PEM(X )j sM(0) *- PEM(X ) PEM read; takes 3 clocks 6 3 B *- PEM(X 2 ) If overlap is possible, one extra clock is needed to store in sM 8 6 sM(6) «- pem(x 2 ) 9 6 A <- CDB(O); A <- CAB(O); m J c ' Ready to start multiplication; X _, Incr X, ; Incr X X are ready to access the next digits 10 7 MF(1, X 1 , *, 1, *) See note a for the meaning of MF; m is completed 15 12 MF(7, \, 1, 0, S) 20 17 MF(2, X r 6, 2, *) m is completed 25 22 MF(*, *, 7, 1, S) 30 27 MF(8, X 2 , 8, 0, S) 35 32 mf(3, x 1? 6, 3, *) m is completed 1+0 37 MF(*, *, 1, 2, S) ^5 k2 MF(*, *, 8, 1, S) 50 hi mf(9, x 2 , 9, o, S) 55 52 MF(1+, X^ 6, fc, *) m is completed 60 57 MF(*, *, 1, 3, S) 65 62 MF(*, *, 8, 2, S) 70 67 MF(*, *, 9, 1, S) 75 72 MF(10, X , 10, 0, S) 156 ^ ° i s 3 o a o S H S H .SO -rj O S w .hw Micro sequence 80 85 75 82 mf(5. X 1 , 6, 5, *) MF(*, *. 7. k > S ^ 90 J 87 mf(*, *, 8, 3; s) 92 |MF(*, *, 9. 2, s) mf(*, *, 10, 1, s) MF(11, x 2 , 11. 0, S) 95 LOO r 05 L10 97 102 107 Ll6 121 L26 131 136 1^2 ^7 152 157 163 ML(7. 5. 0) 113 MF(7; \> 8 ^ ^ S ^ 118 MF(*, *, 9. 3, S) 123 MF(*, *, 10* 2 ' s ^ 128 MF(*, *, 11; 1* s ) 133 ML(8, 5, 1) 139 MF(8, X^ 9. h, s) II4.I4 mf(*, *. 10. 3, s) 1I4.9 MF(*, *, 11. 2 > S ) 15U ML(9. 5. 2) 160 mf(9. x 2 , 10, ^, s) 168 173 165 170 179 176 MF(*, *, 11. 3, S) ML(10, 5. 3) MF(10, Xg, 11. ^ s) Comments ,; a is completed See note b for the meaning of ML, m is loaded in sM(0) 5 a^ is loaded in sM(T) for future 6 use nu is loaded in sM(l) 6 a is loaded in sM(8) for future use liru is loaded in sM(2) "b, is loaded in sM(9) ^r future 6 use m is loaded in sM(3) b is loaded in sM(10) for future use 157 _ £i X SO BO 3 o 3 o •HO -HO |^ i! S Microsequence 18k 190 191 192 196 197 198 199 205 211 217 223 ?29 >35 Ikl 06 12 18 181 187 188 189 193 193 19^ 195 201 207 213 219 225 231 237 200 201 206 207 ML(ll, 5, h) ML(*, *, 5) X ± *- CAB( address (c )) X 2 - CAB(0) lcFFl «- (A m =CDB(0)) ; sM(6) «- A A <- CDB(0) rn m En(l,0N); A m - CDB(0010); Incr X ST ST ST ST ST ST ST; wait on Event #1 ST j wait on Event #2 B - sM(7); A - CAB(O) A 4- (B-Aj; C =1; lcFFi+ *- C , m tor ' n ' n+4 Shift A r , A m right 1+; B ♦- sM(8) A m*- (B - A m ^ c n = lc ^ Comments m. .q is loaded in sM(U) rrL is loaded in sM(5) X 1 and X 2 are prepared to write the result m ,, is loaded in sM(6) See note c c Q is stored in PEM; see note d for the meaning of ST c is stored in PEM c is stored in PEM c is stored in PEM c. is stored in PEM c._ is stored in PEM cv is stored in PEM; see note e c is stored in PEM Exponent computation starts now B is loaded with LSD of exponent of a A is still as in note c m — B is loaded with MSD of exponent of a 158 lo 3 o a h a h .3 O -HO 5 w Micro sequence ==1 n, 212 Shift A left ^j B sM(9) (A.0B) (A AND CDB(lllO)) m ( A m +B)s C rT°> lcFFif *" C nA <_ A • cause Event #1 nr Shift A m right k; B «- sM(lO) 2l+6 2l+8 2i+9 250 251 251 225 A -(A+B); C =lcFF^; m m ' lc¥F k +. c n+k 226 A - (A eCDB(lOOO)); shift A right *+ 230 J sM(8) -A m j cause Event #2; shift A m left U 231 232 233 2^1 a A , A «- LC V V m o lcFFl <- (A m =LC) Interrupt on lcFFl ON Comment s Now have in A q , A^ exp(a)-l if m =0 and exp(a) if n^]/ A now contains the sign of c '0 Set sign hit to zero in A m A now contains LSD of exp(c) m A has MSD of exp(a) and B has m MSD of exp(b) Correct sum in excess notation f complementing MSB Start detection of exponent over flow or underflow; see note f End of the operation 159 Notes: a) MP(a, b, c, d, S) is defined as the following set of five microsequences: 1) Add and shift; sM(a) - PEM(b) ' I 2) Add and shift 3) Add and shift II jjj ' » r k) Add and shift; Incr (b); B ^ sM(c) 5) A - sM(d) ; shift A , A left k 1 Cm IV If a and b are *' s then portions I and II are absent; if c is a * then portion II is absent; if B is rep iaced by a * then portion IV is absent. MF can perform the following: a) multiply two digits, b) fetch from PEM and store in sM a digit to be used in the next multiplication, and c) load A and m 1 with the two digits needed in the next multiplication. *) ML(a, b, c) is defined as the following set of six microsequences: 1) Add and shift 2) Add and shift 3) Add and shift h) Add and shift; B «- sM(a) 5) sM(c) ♦. a I 6) A r «- sM(b) II , If a is a * then portion I is absent; if b is a * then portion II is absent. ML multiplies two digits, stores the MSD of the product in sM and «is A m and B with the two digits needed in the next multiplication. i6o c) At this stage, X points to m (in sM(O)) if ^=0 and to m^ (in sM(l)) if m /0. Therefore, X points to c . Also, A contains 0000 if m__/0 and 0010 if m =0 to prepare for the correction in the exponent. d) ST is defined as the following set of six microsequences: 1) PEM(X ) *- sm(x 2 ) 2) Wait for writing in PEM 3) Wait for writing in PEM k) Wait for writing in PEM 5) Wait for writing in PEM 6) Incr X-. , Incr X ST stores the digits of the product in PEM. This is overlapped as much as possible with the computation of the exponent. e) The wait in this microsequence assures that the exponent will be written in PEM only after it is computed. f) In excess notation addition, there is an overflow if the carry from the MSB is equal to the MSB of the sum before the necessary correction which con- sists of complementing the MSB. 161 APPENDIX C MICROSEQUENCE FOR 32-BIT FLOATING-POINT ADDITION 162 This is a detailed listing of the microsequences sent by CU to each PE to perform the addition: a + b = c. Number format, notation and abbre- viations used are as listed in the introduction to Appendix B. 163 So SO 3 o p o e h a h •HO -HO Micro sequence Comment s 10 13 15 16 17 18 19 20 21 22 23 21+ 25 8 8 11 12 13 14 15 16 IT 18 19 20 21 X «- CAB( address (a )) X 2 - CAB(address (b )) A *-PEM(xJ: sM(3) *-PEM(X.) m 1 '1 B «- PEM(X 2 ) sM(1) - PEM(X ) lcFFl <- (A =B); lcFFU <- (A < B) j m ' m " Shift A left h, Deer X. , c 1 Deer X A m «- PEM(X 1 ); sM(2) «- PEM^) B *- PEM(X ) sM(0) <- PEM(X 2 ) En(l,0N); lcFF*+ <- (A < B) ; Shift A left k m c A *- (A AND CDB(OOOl)) mm Address registers are loaded with address of MSD of the exponents PEM read; takes 3 clocks If overlap is possible, one extra clock is needed to store in sM Comparison of exponents starts Read the LSD's of the exponents Shift A right k: A r ' m A *- (A AND CDB(OOOl)) B m m lcFFl *- (A =A ) m r En(4,0N)j A «-sM(0); A c *- X En(l+,0N); A -sM(l); X n - X En(i+,0N); B «- sM(2); X 2 *- A Q En(U,0N); sM(0) +- B; shift A left k En(4,0N); B *- sM(3); shift A left k lcFFU is now ON iff exp(a)> exp(b); A contains exp(a) All bits except sign are zeroed All bits except sign are zeroed lcFFl is now ON if sign(a)=sign(b) Interchange exponents and addresses in PE's in which exp(a)> exp(b) 16U B u P o a h •H O - -* a o P o a rH •H O a •H W Micro sequence Comment s " 26 22 En(U,0N) ■ sM(l) *- B 27 23 Shift A right k- sM(2) <- LC See note a 28 2U B «- sM(0) Exponent subtraction now starts 29 2i+ A «- (A OR CDB(OOOl)) Sets sign bit to one so that it m m does not interfere with subtraction 30 25 A <- (B-A ); C -1; lcFFU 4- C , ; r nr > n ' n+V B <- sM(l): shift A right k 31 26 A *- (B-A )' C =lcFF^: m m ' n ' A *- CAB(O) c 10 8 Deer X , Deer X These six clocks are overlapped with previous ones; they make X_ point to a and X point to b 15 11 Deer X , Deer X 16 12 Deer X , Deer X 17 13 Deer X , Deer X 18 Ik Deer X , Deer X p 19 15 Deer X , Deer X 32 27 Shift A right 1; A «- X See note b 33 28 lcFFl «- (A =CDB(0)); B *- A m ■* r 3^ 29 lcFF2 «-CDB(0); A «- CDB(O) 35 30 En(l,0FF); lcFF2 <-CDB(0010); Ready now to perform mantissa shift A right 1+ alignment; see note c 36 31 A *- (A +B); C =0; lcFFU «- C , m m ' n > n+4 37 32 En(U,0N); Incr A c 38 33 Shift A left k 39 3^ X 2^ A e Mantissa alignment completed ko 3^ A <- CAB(FFF-N +1) c m Prepare trap in A ; see note d 165 Bo So P o go sh a h •HO -HO Microsequence Comment s kl 35 Shift A right k; A - CAB(FFF) k2 36 A *- (A +B); C =0; lcFFU «- C , m v m " n ' n+4; B still had the difference of the exps; it is reloaded with the first B *- PM(X ) operand ^3 37 En(4,0N); En(2,0FF); Incr A ; lcFF2 «- C no ° n+12 kh 37 lcFFl 4- sM(2)r shift A left k > c Trap is completed; lcFFl is ON only) if signs are equal h5 38 A «- PEM(X n ) m r Fetch the second operand kQ 41 ADFI(4) The actual addition starts now; see note e for the meaning of ADFI, ADF 57 hi ADF(5) and AD 66 53 adf(6) 75 59 ADF(7) 84 65 ADF(8) 93 71 AD(9) Addition completed; now find out sign of result and if recomplemen- tation is needed: see note f. 96 lh A «- LC ; B <- LC m ' 97 75 sM(3) +- A : shift A right k 98 76 Shift A left 1 99 77 A - (a" AND B) mm 100 78 lcFFl «- A lcFFl is ON if recomplementation is m needed 101 78 B <- CDB(0) 102 79 A «- sM{k) m Ready to start recomplementation 103 80 RCI(5^) See note g for meaning of RC and RCI 105 i 82 RC(6,5) — 166 Qk 3 o go a H s rH •HO -HO Sw -h w Micro sequence r LOT L09 111 113 U5 90 92 RC(T,6) BC(8,7) RC(9,8) RC(*,9) A m 3 m(o; lll6 117 118 119 120 121 122 123 12k 125 126 127 128 93 95 96 97 98 99 99 100 101 102 103 loit- En(l, ON); A^ - (A m © CDB(OOOl) ) sM(0) *- A. m A «- sM(3); B *■ sM(3) m Shift A c left k; X 1 - CAB(FFF) Shift A m right lj lcFFl - CDB(OOOl) A «- (A AND B) m m lcFFU <- A m A - sM(l); A n -CAB(O) m * c Shift A c left U; A m - sM(0) Shift A c left k; A r <- CDB(O) Shift A right 1; sM(lO) «- CDB(OOOl) X. A : A m sM(9) cz(8) Comment s Recomplementation completed Now set up sign of result; i.e., change the sign of the exponent in sM(0), sM(1) if recomplementatxon was needed. S M(3) contains MSB ON if there was final output carry and LSB ON if sign(a)=sign(h) lcFFi+ is now ON if there was an "overflow. " X now contains exp(a) without the sign See note h for the meaning of CZ 167 go So 3 o 3 o e h a h •HO -HO o3 w -h w Microsequence Comments 130 106 cz(7) 132 108 cz(6) 13* 110 cz(5) 136 112 cz(*) 138 11* cz(*) 140 116 En(*,0N); Incr X This adds 1 to the exp if there was "overflow" 1*1 117 A «- X • A <- CDB(O)- c 2' r J A *- CDB(O) m ' 1*2 118 Shift A right * 1*3 119 Shift A left 1 li+U 120 A *- sM(0) Insert the sign back in the expo- m o nent 1*5 121 sM(0) ♦- A v ' m 146 122 Shift A right * 1*7 123 sM(l) 4- A ; shift A right *• Final exponent is now in sM(0), ill 111 B *- CDB(O) sM(l); prepare to detect exponent overflow or underflow 1*8 12* lcFFl <- (A =B) ; A *- X.. m ' ' c 1 1*9 125 Interrupt on lcFFl OFF; lcFFl OFF means exponent overflow X 2 «-CAB(*) or underflow 150 126 En(*,0N) ; X 2 <- CAB(5) 151 127 Incr A ; lcFF2 <- C ,„; c> n+12-' X <- CAB (address (c )); B *- CDB(O) Ready to start storing the result 152 128 WR See note i for the meaning of WR 160 136 WR 168 So So 3 o 3 o s h a h •HO »H O Sw -h w Micro sequence Comment s 168 ikk WR 176 152 WR 184 160 m 192 168 WR Mantissa is stored in PEM 200 176 PEM(X ) <- sM(O) Now store exponent in PEM 205 181 Incr X 206 182 pem(x ) <- sM(l) 21k 190 End of the operation 169 Notes: a) At this stage, the situation is as follows: lcFFl is ON if the signs are equal, OFF otherwise; A q , A ffi contains the smaller exponent; the larger expo- nent is stored in sM(O), sM(l); in sM(2) the LSB is a. one if the signs are equal and a zero otherwise. b) At this point the situation is that A , A contains the difference of the exponents. If A ffl is non-zero, then b will not participate in the sum (since the exponent difference is too large) and lcFF2 is set ON in PE's in which this happens. Mantissa alignment is performed by adding the exponent difference (which is in B) to the address of b which is in A Q , A ffl . The modified address of b is then returned to X p . A c will be used as a counter which yields an overflow when all digits of b have been used. For PE's in which this overflow (which is stored as a lcFF2 I has appeared, digits of b are replaced by zeros before the addition. ADF(a) (add and fetch) is defined as the following set of microsequences: 1.1 - En(2,0N); B - CDB(O); Incr X_j Incr X 2.2 - En(2,0FF); Incr A ; lcFF2 <- C C n+12 3.3 - A r - (A±B); C n =lcFF>+; lcFF^ «- C^; A ffi «- PEM(X ); lcFFl OFF causes subtraction instead of addition 6,3 - B - PEM(X 2 ) 9,6 - sM(a) <- A r ADF takes a minimum of six clocks and the normal time is nine clocks. ADFI is similar to ADF but in clock (2,2) C is set to lcFFl instead of to n Fk. ADFI is used for the first addition and takes as long as ADF. 170 f^+ph -i q performed. It is used for the AD is similar to ADF but no new fetch is periorm last addition and takes only three clocks. f ) The rules are: for a±h=c, S ign(c)=sign(a) and no ^complementation is needed unless si g n(a)/si g n(h) MD lc FF U is OPF at the end of the operation. In this case, si g n(result)=iii^aT-si g n(h) and re complementation must he per- i-„l „;™[>,'i Aim lcFFU is OH at the end formed. An overflow occurs when sign(a)=sxgn(b) « 1°" of the operation. *1 i, defined as the two following microsequences: g) RC(a,h) (recomplement) is defxnea as une !) A r - ( (A" V B) +B+ l) ; A ra - sM(a) ; C n =lcFF. ; IcFFU - C^ 2) En(l,0N); sM(h) *- A f - If a is a *, then A m is not loaded on the first microsequence. . & e arithmetic function above performs ^complementation when B=0. . RCl(a,b) is used for the ^complementation of the first digit; it is similar to EC hut in the first microsequence C n =l instead of C^lcFF*. h) 0Z(a) (count zeros) is defined as the following set of two microsequences: 1) En(l,CM); En(U,0FF); lcFFl <- (A^B); A m ~ sM(a) 2) En(l,0H); En(d,0FF); Deer \; Beer X 2 If a is a * then A ffl is not reloaded in the first microsequence. This function _ decrements X, Id X £ if ^ is zero (and has always teen zero previous- iy) and if there was no "overflow" which is signaled by IcFFU OFF. Since h contains initially all l's, a trap is formed to yield a carry when the number of leading zeros is added to it. Since X £ contains initially the larger exponent, the exponent of the result is formed hy subtracting one out of X 2 for each leading zero. 171 i) WR (write) stands for the following set of eight microsequences: En(2,0N); B *- sM(X 2 ); Incr X g En(2,0FF); Incr A ■ lcFF2 *- C ,_ K ' '' o.' n+12 PEM(X 1 ) <- B Wait for writing in PEM Wait for writing in PEM Wait for writing in PEM Wait for writing in PEM Incr X, WR stores the sum of the mantissas in PEM and also takes care of elimi- nating leading zeros. The trap in A signals when all leading zeros (which are transformed in trailing zeros) are eliminated. 172 LIST OF REFERENCES [1] Control Data Corporation. "The STAR Computing System." A technical proposal to The Atomic Energy Commission. December 1966. [2] Slotnick, D. L., et. al. "The ILLIAC IV Computer," IEEE Transactions on Computers . Volume C-17, Number 8 (August 1968), pp. 746-757- [3] Kuck, D. J. "ILLIAC IV Software and Application Programming/' IEEE Trans - actions on Computers . Volume C-17, Number 8 (August 1968) pp. 758- 770. [k] Lehman, M. "A Survey of Problems and Preliminary Results Concerning Parallel Processing and Parallel Processors, " Proceedings of the IEEE. December 1966, pp. 1889-1901. [5] Fulmer, L. C, and W. C. Meilander. "A Modular Plated Wire Associative Processor," Proceedings of the IEEE Computer Group Conference. June 1970, pp. 325-335. [6] Graham, W. R. "The Parallel and the Pipline Computers," Datamation . Volume 16, Number k (April 1970), pp. 68-71. [7] Bremer, J. ¥. "A Survey of Mainframe Semiconductor Memories," Computer Design . Volume 9, Number 5 (May 1970), pp. 63-73- [8] Texas Instruments Incorporated. The Integrated Circuits Catalog for Design Engineers . First edition. [9] Yasui, T. "Pattern Matching Problem- Benchmark on ILLIAC IV," ILLIAC IV Document Number 217 • University of Illinois at Urb ana -Champaign. May 1970. [10] Lincoln, N. R. "Parallel Programming Techniques," presented at the "SIG- PLAN Symposium on Compiler Optimization. " University of Illinois at Urb ana- Champaign. July 1970. [11] Wilhelmson, R., et. al. "Matrix Operations on ILLIAC IV," ILLIAC IV Document Number 52. University of Illinois at Urbana-Champaign. March 1967. [12] Stevens, J. E., "Matrix Multiplication Algorithm for ILLIAC IV," ILLIAC IV Document Number 231. University of Illinois at Urbana-Champaign. August 1970. [13] Troyer, S. "Sparse Matrix Multiplication," ILLIAC IV Document Number 137 . University of Illinois at Urbana-Champaign. June I968. [lk~\ Carr, R. "Gauss-Seidel on ILLIAC IV," ILLIAC IV Document Number 67. University of Illinois at Urbana-Champaign. May 1967- [15] Ackins, G. "Fast Fourier Transform," ILLIAC IV Document Number 1^-6. University of Illinois at Urbana-Champaign. July 1968. [16] Stevens, J. "Fast Fourier Transform Subroutine for ILLIAC IV," ILLIAC IV Document Number 226, University of Illinois at Urbana-Champaign. July 1970. 173 [17] Mclntyre, D. "ILLIAC IV Language Evaluation - A Preliminary Experiment," ILLIAC IV Document Number 213 • University of Illinois at Urbana- Champaign. November 1970* 17^ VITA Born in 19^1 in Santos, BRAZIL, Nelson Castro Machado received in December I96U the degree of "Electronic Engineer" from the Instituto Tecno- logico de Aeronautica in Sao Jose dos Campos, BRAZIL. He then worked there for one and one half years as a teaching assistant, being responsible for courses in applied electronics, pulse circuits laboratory and automata theory. In September 1966 he came to the University of Illinois where he received the M.S. degree in October of 1969- Since his arrival in the U.S.A., Mr. Machado has been working as a research assistant, initially with the ILLIAC IV Project and later with the Center for Advanced Computation of the University of Illi- nois. In this activity, he was responsible for the semantics part of a Trans- lator Writing System developed to help implement ILLIAC IV languages. His M.S. thesis entitled "ISL-A semantics Language for a Translator Writing System is a result of this research. From January 1970 until January 1972, Mr. Machado worked on the topic of parallel computer organization, exploring new approaches to the concept of array processor utilized in ILLIAC IV. This activity resulted in his Ph.D. dissertation entitled "An Array Processor with a Large Number of Processing Elements." UNCLASSIFIED Security aaasirication DOCUMENT CONTROL DATA .R&D ■»■'■""■ ■■•— r ■ 7; ^^'^~™"~*'*"^ ^ «**">-i/r~_*i. ,._.. ., Center for Advanced Computation University of Illinois at Urbana- Champ aim ' Urbana, Illinois 6 l801 "REPORT TIT1_€ "~ "" "~ ~~~^~~—— — — — — — — — — — _ _______ UNCLASSIFIED 2ft. CROUP AN ARRAY PROCESSOR WITH A LARGE NUMBER OF PROCESSING ELEMENTS lOttCNI-TlvC NOTU (Trp. at r^or. mn4 in.lu.i-. d.,..; Resea rch Report-. i-otmo-(S» (Flt.tnmm.. mlddt. MU.t, lm.tn.rn.) Nelson C. Machado • IEPORT DATI 7ft. NO. OF NCFI January 1, 1972 S. CONTRACT ON CHANT NO. '*• TOTAL NO. OF PACKS 184 •"• 0-|«INATOR'l REPORT NUHIER|S| XL zz CAC Document No. 25 • DAHC04-72-C-0001 • PROJECT NO. ARPA Order I899 B UIUCDCS-R -72-499 C OI 3 TRIBUTION STATEMENT ~" ■ ' ?? Copies may be obtained from the address eiver, in fi\ «* unlimited; approved for public release" (l) ^^ Dlstribu tion 1 UPPLEMENTARY NOTES ' '*• «~ONSORINC MILITARY ACTIVITY None kCT U.S. Army Research Office-Durham Durham, North Carolina could be characterized as an inte'VT ? "** P^sor (SPEAC) which Accessor, -£ number of TroZ"Z ^ "fT" , ^ LIAC IV and the Associative go as high as 8L MPE a Z f" tS (EE s) ls '^aU/ IK but could gates, designed to S^^ntSKtE"® 16 """ """ ab ° U * 1K e^valent Ship or on several MSI chips Each PE^nf? * "T* T^ °° ffiplex LSI assembled on one sin-kTril/! PE plus its memory (pem) could then be Processing Tt n ? f r ° Ult b ° ard or ceramic substrate, word length Maximum ?reef m /? 8 J° UPS ° f fOUr blts whlch allo « variable possible by the nlTtr I J * f0mat and ins t™tion format is made machine is ^itevLsftrlearc:^^ 1 ! contTOl ™« (CO). Therefore, the large precision problems (matrix o^ ? efficiently either on floating-point fixed-point small precision ones Zt, T' ^^ Passing, etc.) or on an precision ones (character manipulation, picture processing, CU is presentel'^pe'rattenslre ^ 111 Tf ^^ ^ & gSneral °^ h ° f «» 1 floating-point addition (Pn des e"°ed and timed, with particular emphasis multiplication (25 tsel per PE for,? 6 *- ?Y°V? b " s) and Rating-point presented along with thefr time estimates ' ' '" ^^ ^^^ -e '..1473 UNCLASSIFIED Security Classification UNCLASSIFIED Design and Construction General Purpose Computer Arithmetic Units I L UNCLASSIFIED Security Classification EIOGRAPHIC DATA I T 1. Report No. UIUCDCS-R-72-U99 il,le and Subtitle AN ARMY PROCESSOR WITH A LARGE NUMBER OF PROCESSING ELEMENTS 3. Recipient's Accession No. 5. Report Date January 1, 1972 ^ hot(s) Nelson Castro Machado 8. Performing Organization Rept. No. P forming Organization Name and Address Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract/Grant No. DAHC0U-72-C-0001 Jsnsoring Organization Name and Address U.S. Army Research Office-Durham Duke Station Durham, North Carolina 13. Type of Report & Period Covered Research 14. plementary Notes None Attracts See DD Form Number 1^73. Words and Document Analysis. 17a. Descriptors Design and Construction General Purpose Computer •rithmetic Unit tifiers/Open-Ended Terms : ATI Fie Id /Group ability Statement MLes may be obtained from the address in (9) ^ we. Distribution unlimited. ^N-ls-35 (10-70) 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 181+ 22. Price NC USCOMM-DC 40329-P7 1 APR * 6 A* IHiSli H JRE» nBnllEn! tSSr