If HnHifl 
 
 mmmMti 
 
 m wBBmBM 
 
 BBfflllMfHffl 
 
 Iulall$ftr ImImDDB 
 
 MSB* 
 
 ■■■1 
 
 *>^>IH6I Hani 
 
 199 
 
 mm 
 
 m!;h 
 
 S? 
 
 ■ 
 
 H ■ 
 
 MMttMUft 
 
 K 
 
 IB m m 
 
 wMmWm 
 
 luKMtftvui ills 
 
 H Sta Bni 
 
 Hi IniiEiiMiuHX 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510 .S>4 
 cop' 2 ' 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/arrayprocessorwi499mach 
 
f //, / lJ IUCDCS - R -72-U99 
 
 l* 
 
 1/ 
 
 AN ARRAY PROCESSOR WITH A LARGE NUMBER 
 OF PROCESSING ELEMENTS 
 
 By 
 
 Nelson Castro Machado 
 
 January 1, 1972 
 
 CAC Document No. 25 
 
CAC Document No. 25 
 UIUCDCS-R-72-499 
 
 AN ARRAY PROCESSOR WITH A LARGE NUMBER 
 OF PROCESSING ELEMENTS 
 
 By 
 
 Nelson Castro Machado 
 
 Center for Advanced Computation 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
 January 1, 1972 
 
 Submitted in partial fulfillment of the requirements for the degree of Doctor 
 of Philosophy in Computer Science in the Graduate College of the University 
 of Illinois at Urbana-Champaign and supported in part by the Advanced Research 
 Projects Agency of the Department of Defense and was monitored by the U.S. 
 Army Research Office-Durham under Contract No. DAHC0U-72-C-0001. 
 
ii 
 
 ABSTRACT 
 
 This paper describes a new type of array processor (SPEAC) which 
 could be characterized as an intermediate between ILLIAC IV and the Associa- 
 tive Processor- The number of processing elements (PE's) is typically IK 
 but could go as high as 8K. Each PE is a relatively simple unit with about 
 IK equivalent gates, designed to allow implementation either on a single very 
 complex LSI chip or on several MSI chips. Each PE plus its memory (PEM) could 
 then be assembled on one single printed circuit board or ceramic substrate. 
 
 Processing is performed in groups of four bits which allows varia- 
 ble word length. Maximum freedom in data format and instruction format is 
 made possible by the use of a microprogrammable control unit (CU). Therefore, 
 the machine is quite versatile and can be used efficiently either on floating- 
 point large precision problems (matrix operations, signal processing, etc.) 
 or on fixed-point small precision ones (character manipulation, picture pro- 
 cessing, etc.). 
 
 PE design is carried out in great detail and a general sketch of 
 the CU is presented. Operations are described and timed, with particular 
 emphasis on floating-point addition (20 msec per PE for 32 bits) and floating- 
 point multiplication (25 usee per PE for 32 bits). A few typical applications 
 are presented along with their time estimates. 
 
Ill 
 
 ACKNOWLEDGMENTS 
 
 I wish to express my deep gratitude to Professor Daniel L. Slotnick 
 for the constant aid and encouragement in every phase of the research herein 
 described. 
 
 Ky colleague Robert L. Mercer is to be thanked for the many helpful 
 discussions and suggestions. 
 
 I would also like to express my appreciation for the financial sup- 
 port given by the ILLIAC IV Project and the Center for Advanced Computation of 
 the University of Illinois. 
 
 Finally, special thanks are due to Suzanne Sluizer for the efficient 
 and accurate typing of the final document, to Fred Hancock and Jose Martinez 
 for their careful execution of many complex drawings, and to my wife Arlene 
 for unmatched patience and constant incentive. 
 
iv 
 
 TABLE OF CONTENTS 
 
 Chapter Page 
 
 1. INTRODUCTION 1 
 
 2. THE ARRAY COMPUTER AND ITS APPLICATIONS 3 
 
 2.1 General ascription of an Array Computer 3 
 
 2.2 Typical Applications and Their Requirements .... 6 
 
 2.3 Considerations on the Number and Complexity of the 
 
 PE's 9 
 
 3- SPEAC's HARDWARE 13 
 
 3«1 General Considerations 13 
 
 3.2 The Multiplication Algorithm 17 
 
 3.3 The System as a Whole 20 
 
 3.U The Processing Unit 26 
 
 3.U.1 PE Memory 26 
 
 3-U.2 PE Data Registers 30 
 
 3.1+.3 PE Description 32 
 
 3.U.3.1 Registers and Buses 32 
 
 3.U.3-2 The Arithmetic/Logic Unit 38 
 
 3 . *4 . 3 • 3 Scratchpad Memory 38 
 
 3.U.3.U Address Registers kO 
 
 3.U.3.5 Register A *+l 
 
 3.U.U Local Control ^5 
 
 3.k.k.l Direct Local Control ^8 
 
 3.U.U.2 Indirect Local Control ^9 
 
 3.U.5 Mode Control 
 
Chapter Page 
 
 3.4.6 Interrupts 54 
 
 3*4.7 Implementation Remarks 55 
 
 3.5 The Control Unit 62 
 
 3.5.I General Structure 62 
 
 3.5*2 Machine Synchroni zat ion - Events 65 
 
 3.5.3 ^.ueue System and FINST 67 
 
 3. 5 • 3-1 Queue Structure 68 
 
 3.5.3.2 FINST Structure and Operation ... 70 
 
 3«5«4 The Instruction Processor 76 
 
 3.5.5 IDU and Instruction Format 80 
 
 3.6 Memory 82 
 
 3.7 I/O Buffer Register 84 
 
 4. SPEAC's OPERATION 87 
 
 4 . 1 Generalities - Data j ormat 87 
 
 4.2 Local Indexing 89 
 
 k.3 Multiplication 89 
 
 U.3.1 Floating-point Multiplication 93 
 
 4.4 Addition and Subtraction 94 
 
 4.4.1 Signed Addition and Subtraction 94 
 
 k.k. 2 Floating-point Addition and Subtraction . . 95 
 
 4.5 Other Operations 98 
 
 4.5.1 .On 98 
 
 4.5.2 Logic Operations 99 
 
 4.5.3 Comparisons 99 
 
vi 
 
 Chapter Page 
 
 U.5.U Shifts 100 
 
 U.6 I/O 101 
 
 h.7 Routing 102 
 
 k.Q Sunmary timings 105 
 
 5- APPLICATIONS 107 
 
 5-1 General :onsiderations 107 
 
 5-2 Relaxation 109 
 
 5 • 3 ''?-irl Multiplication 116 
 
 5.U Pattern Matching 122 
 
 5.5 Spars Matrices 127 
 
 6. CONCLUSIONS 132 
 
 APPENDIX 
 
 A. PACKAGE LOGICAL DIAGRAMS I37 
 
 B. MICROSEQUENCE FOR 32-BIT FLOATING-POINT 
 MULTIPLICATION I53 
 
 C. MICROSEQUENCE FOR 32-BIT FLOATING-POINT ADDITION . . 161 
 
 LIST OF REFERENCES 172 
 
 VITA 17!+ 
 
VI 1 
 
 LIST OF TABLES 
 
 Table Page 
 
 1. PE Registers 37 
 
 2. Functions Provided by the A/L Unit 39 
 
 Control Wires and Their Functions lj-3 
 
 k. Connections to Each PU 56 
 
 5- Some IC Chips that Might Be Used in the PE 58 
 
 6. Packages Used in the PE and Their Contents 59 
 
 7- Rough Estimates for the Number of Chips Per PU 60 
 
 2 . Microinstruction Repertoire 80 
 
 Number of Elementary Shifts for Each Shifting Distance ... 86 
 
 10. Micro sequence for Local Indexing 90 
 
 11. Summary of Timing Estimates 106 
 
VI 11 
 
 LIST OF FIGURES 
 
 Figure Page 
 
 1. A Classical Computer k 
 
 2. An Array Computer h 
 
 3. A Family of Array Computers with Constant Average Speed . . 11 
 h. Versatility as a Function of the Number of PE's 11 
 
 5. Cost-efficiency as a Function of the Number of PE's ... . 11 
 
 6. Instruction Format 15 
 
 7» Fetches in Multiplication 21 
 
 8. Global Structure 22 
 
 9. Block Diagram of a Possible PEM Chip 29 
 
 10. Basic Data Register Structure 30 
 
 11. Simplified PE Diagram 33 
 
 12. Conventions Used in PE Logical Diagram 3^ 
 
 13- Complete PE Logical Diagram 35 
 
 Ik. A Generalized Local Control ^7 
 
 15. Diagram of a Local Control FF 51 
 
 16. CU Structure 63 
 
 17. Queues and FINST Structure 68 
 
 18. FINST Action Flow-graph 73 
 
 19. Final Microsequence Assembly in FINST 75 
 
 20. Basic PEIP Structure 77 
 
 21. Detailed Instruction Format 81 
 
 22. I/O Buffer Register Structure 8U 
 
 23- Standard Floating-point Format 88 
 
1 . INTRODUCTION 
 
 Faster computers may be obtained either by improving the raw speed 
 of the circuits and components or by adopting a better organization, i.e., 
 using the same circuits in a more efficient architecture. Indefinite im- 
 provements in circuit speed cannot be expected due to fundamental physical 
 constants, the most obvious of these being the speed of light. Therefore, 
 new approaches to computer organization must be found if projected demands 
 of computer users are to be met, particularly in the area of large scientific 
 problems . 
 
 In recent years, a fair amount of attention has been given to non- 
 conventional organizations and the first two super-computers utilizing these 
 new concepts will become operational within a few months: the pipeline pro- 
 cessor CDC-STAR [1] and the array computer ILLIAC IV [2] [3]. Several other 
 approaches have been proposed in the literature, deserving special mention 
 the parallel processor, extensively studied by IBM [h] , and the associative 
 processor, a type of array processor utilizing an associative memory and 
 distributed logic [5]« Goodyear Aerospace Corporation has been working on 
 an associative processor and successful tests have been performed on a re- 
 duced scale prototype. 
 
 An endless number of questions, discussions and comparisons can and 
 have been raised when the capabilities and handicaps of the different organi- 
 zations are considered [6]. As usual, one can usually find a specific appli- 
 cation in which a given architecture excels and a pathological case in which 
 the same approach fails miserably. It is not the purpose of this paper to 
 engage in such comparisons. It will instead deal only with a particular 
 
organization: the array computer. 
 
 The array processor family of computers has been widely accepted 
 by the computer community as a cost-effective approach in a particular but 
 rather important set of applications. In the sequel, this type of architec- 
 ture is examined and a new approach to the design of an array processor is 
 proposed in order to take advantage of recent and contemplated developments 
 in the fields of LSI circuits and solid state memories. 
 
2. THE ARRAY COMPUTER AND ITS APPLICATIONS 
 
 2.1 General Description of an Array Computer 
 
 ILLIAC IV will be taken here as the "typical" array computer. This 
 section is not supposed to be a complete description of ILLIAC IV and a cer- 
 tain familiarity with [2] and [3] is assumed. Only a few basic concepts are 
 considered here in order to set the stage for the following discussion. 
 
 Figure 1 shows the functional diagram of a classical computer. It 
 consists of: l) A memory to hold operands and instructions, 2) A control 
 unit that fetches instructions from the memory, decodes them and issues con- 
 trol signals to 3) An arithmetic unit that performs the operations on oper- 
 ands taken from the memory. The most radical approach to parallelism would 
 obviously be to duplicate the elements shown in Figure 1 a number (n) of 
 times providing adequate interconnections between the elements. This is the 
 multiprocessor or parallel processor approach. Although powerful, this or- 
 ganization leads to several implementation problems and seems to be imprac- 
 tical for large n. (The Burroughs B6500 uses this organization with n = K. ) 
 
 nicix 
 
 One of these problems is the economic burden caused by the multiplicity of 
 control units since in a sophisticated classical machine the control unit 
 accounts for rather more than fifty percent of the total gate count. This 
 leads to the array computer approach, whose functional diagram is shown in 
 Figure 2. Only arithmetic units and memories are duplicated and one single 
 control unit (CU) drives the "array" of arithmetic units. Actually not the 
 whole control unit can be made central since certain control decisions are 
 operand- dependent (normalization for example) . Therefore, a minimum amount 
 of control is kept local and each arithmetic unit plus its local control will 
 
instructions 
 
 control 
 
 
 
 unit 
 
 J 
 
 III 
 
 I ♦ 1 * 
 
 arithmetic 
 unit 
 
 
 
 memorv 
 
 
 
 
 
 Figure 1. A Classical Computer 
 
 instructions 
 
 1 1 
 
 1 instruction • 
 
 I memory 
 i 
 
 . I 
 
 l I I I I 
 
 Li±U 
 
 PE„ 
 
 I 
 
 memory 1 
 
 control 
 unit 
 
 ~l — i — i — i — r 
 i I I i i 
 
 T T T T T 
 
 U-t-Lt 
 
 PE. 
 
 a 
 
 memory 2 
 
 I I I I I 
 
 ii-UJ 
 
 PE 
 
 SZ5 
 
 memory n 
 
 Figure P. An Array Computer 
 
be called processing element (PE) . Each PE operates on its own memory ( PEM) . 
 The term processing unit (PU) will be used to designate a PE with its PEM. 
 Instructions can be stored either across the PEM's or in a special instruc- 
 tion memory. 
 
 Therefore, an array computer is characterized by the fact that a 
 single instruction stream is executed simultaneously by at the most n PU's. 
 The concepts of local indexing, routing and mode control will now be intro- 
 duced. 
 
 The biggest restriction imposed by this type of organization is 
 obviously that every PE must be performing precisely the same instruction on 
 the same addresses on its own PEM . These constraints can be relaxed to a 
 good extent with the introduction of extra hardware to allow: a) local 
 
 indexing : each central base address, "broadcast" by the CU to each PE, is 
 locally indexed, b) mode control : each instruction is locally modified by 
 the PE's. The simplest form of mode control is to locally decide if central 
 instruction "I" will be locally executed as "I" or as a no-op; i.e., each 
 PE can be turned on or off. This is the only type of mode control available 
 in ILLIAC IV (extreme mode control capability would obviously lead to a multi- 
 processor approach), c) routing : obviously, for most applications, at a 
 certain point in the computation PE. may need an operand which is stored in 
 PEM., i ^ j. Therefore, some way of "routing" operands from one PE to 
 
 J 
 
 another is highly desirable. The most complete freedom of routing would be 
 obtained if a cross-bar switch were provided linking each PEM to each PE. 
 Naturally, this solution is prohibitively expensive for large values of n. 
 The simplest type of routing is to link PE. to PE's i-1 and i+1. This is 
 called "neighbor routing. " Obviously, non-neighbor routing is obtained with 
 
a sequence of neighbor routings. 
 
 2.2 Typical Applications and Their Requirements 
 
 The obvious application for an array computer is on problems in 
 which the same operations must be repeated over a set of operands. Matrix 
 operations fit nicely in this category and therefore this type of machine 
 will work well on solving systems of linear equations, Fourier transforms, 
 systems of partial differential equations, etc. Several areas of major 
 scientific interest are included in such formalizations and the best known 
 proposed applications for an array computer are: weather analysis and pre- 
 diction, linear programming, seismic data processsing, hydrodynamic flow 
 analysis, phased array radar processing, picture processing, etc. 
 
 Since a new type of array processor was contemplated, the first 
 step was to elaborate a list of questions about the features of an array pro- 
 cessor and submit it to several users in different areas of applications. In 
 this way an opinion could be formed as to which features are needed for each 
 application and which compromises would be acceptable. 
 
 Users in four areas of application were interviewed: l) weather 
 problem (WP) , 2) seismic signal processing (SP), 3) linear programming (LP), 
 and k) hydrodynamic flow problem (HP) . 
 
 The basic questions asked were: 
 
 a) How much floating-point operations does your application 
 need? Could you do with fixed-point only? 
 
 b) What precision is needed for your application? How many bits 
 is the typical precision in the input data? 
 
 c) How important is local indexing in your application? To 
 
which extent is local indexing used only as a solution to 
 poor routing facilities? 
 
 d) How much routing is done? Would only neighbor routing be 
 sufficient? What are typical numbers for non-neighbor routing? 
 
 e) Mention any other problems encountered and facilities desired 
 in your area of application. 
 
 It should be pointed out that all persons interviewed are ILLIAC IV 
 users. ILLIAC IV contains 6k extremely powerful PE's with a complete reper- 
 toire of floating and fixed point instructions. Words are 6U bits long and 
 can be used in submultiple precision variants of two 32 -bit words or eight 
 8-bit words. There are facilities for local indexing and routing (accom- 
 plished through an optimal combination of distance 1 (neighbor) routings and 
 distance 8 routings). Mode control is on-off only. 
 
 The following facts were established by the survey above: 
 
 a) Floating point : Floating point seems to be a luxury turned ne- 
 cessity. All users admitted that they could probably do with- 
 out floating point by careful scaling of the quantities. They 
 also admitted that they would hate to be forced to do that. The 
 consensus is that presently a viable machine should have, if 
 not hardware floating-point instructions, at least a good, fast 
 set of floating-point subroutines. 
 
 b) Precision : Naturally, the precision requirements are heavily 
 dependent on the particular application and method of solution: 
 WP uses 32-bit words although the initial data has a typical 
 precision of 8 bits only. It is felt that performing computa- 
 tion on 32-bit words is good insurance against precision erosion 
 
8 
 
 due to severe numerical error propagation with the methods 
 presently used. SP receives data from sensors in 13 to ik "bits 
 precision and operates on 32 -hit mode. Incidentally, simple 
 format conversion of the input data accounts for a considerable 
 amount of processing time in this application. SP could con- 
 ceivably be performed with less precision than 32 bits: 18 or 
 2k bits should be adequate. LP is the application with the 
 heaviest requirements on precision: I/O is performed in 32-bit 
 mode but internal calculations use 6k bits to avoid severe 
 error buildup in LP problems with about ^00 equations. In fact, 
 even 64-bit precision is inadequate for larger problems and the 
 use of multiple precision routines is envisioned. HP has been 
 using 32 bits which is adequate for low precision inputs. How- 
 ever, k8 to 6k bits would be ideal for future applications. 
 Finally a few special but important applications need much less 
 precision. Picture processing can be done with k to 8 bits of 
 precision and a recently developed area—linear programming with 
 Boolean variables—uses 1-bit precision for the variables and 
 "small" integers for the coefficients. 
 
 The conclusion is obvious: a versatile machine should 
 have as many precision modes as possible. This was the case 
 with serial by bit machines which featured variable word length. 
 Speed requirements forced the introduction of parallel proces- 
 sing of a word and the variable word convenience and efficiency 
 was lost except for some low-precision instruction variants as 
 the ones featured in ILLIAC IV. 
 
c) Local Indexing : This seems to be a very important feature, 
 heavily used by almost all application. Its main use is 
 definitely to avoid slow routings in a "skewed" type of matrix 
 storage [3]« However, a few other types of use for local 
 indexing did appear. 
 
 d) Routing : Routing is the most difficult problem in an array 
 computer. Complete and unlimited routing facilities are eco- 
 nomically impossible for large values of n. The ILLIAC IV 
 approach did satisfy its users, however. Definitely the most 
 frequent type of routing is neighbor routing. Odd routing dis- 
 tances do appear, however, in a few important cases: table 
 
 n 
 look-ups and log-sums (i.e., the problem of obtaining Z a. 
 
 i=l 1 
 
 where each a. is stored in a different PE) are two examples. 
 
 2.3 Considerations on the Number and Complexity of the PE's 
 
 The array-processor family of computers has at present two well 
 established members: ILLIAC IV and the Associative Processor (AP) . Both 
 these machines were extensively studied and are actually being built. In a 
 sense, however, they represent two extremes in this design philosophy: ILLIAC 
 IV has a relative small (6k) number of PE's, each an extremely powerful 
 
 floating-point word-parallel unit with 13K gates. The AP, described in [5], 
 
 12 15 
 has a very large number (on the order of 2 - 2 ) of PE'.s, each an extremely 
 
 simple fixed-point serial -by-bit unit containing only 32 gates. Each ILLIAC IV 
 
 PE has a floating point add time of 175 nsec. and a floating-point multiply 
 
 time of 225 nsec. for 32 -bit operands. The AP has a fixed-point add time of 
 
 35 Msec, and a fixed point multiply time of approximately 1 msec, for 32 -bit 
 
10 
 
 operands. Therefore, a 12K PE AP could add fixed point about as fast as 
 ILLIAC IV. Multiplication would still be much slower (about 20 times slower 
 even for a 12 K PE AP) . Routing capability in the AP is extremely limited: 
 only neighbor routing is permitted, on a bit-by-bit basis. PEM is 2K 6^-bit 
 words long in ILLIAC IV and only 256 bits long in the AP. However the AP's 
 PEM is an associative memory allowing simultaneous interrogation of n bits 
 (n is the number of PE's). Obviously, ILLIAC IV s conventional PEM's could 
 also be considered as an associative memory allowing simultaneous interroga- 
 tion of 6k words . 
 
 It seems obvious that the A0 is a much less versatile machine than 
 ILLIAC IV, i.e., its field of application is quite limited. However, it may 
 come as a surprise that in the problems to which it is well suited (especially 
 radar tracking applications), the AP is quite cost-effective. In fact, its 
 proponents argue that it can perform those special jobs at the same rate as 
 ILLIAC IV but at l/30th of the cost. 
 
 A few generalizations are in order: One could consider a set of 
 array computer M.. , M , ... , M each with a simpler (slower) PE than its 
 predecessor but with a larger number of PE's in order to keep constant the 
 average speed. Figure 3 illustrates the number of PE's x speed of each PE for 
 these machines. Figures k and 5 represent some rough qualitative estimates 
 about the versatility of these machines (i.e., how large is the set of appli- 
 cations for which they are well suited, i.e., can compute approximately n 
 times faster than a sequential machine with same speed as each PE) and the 
 cost-efficiency of such machines for such suitable problems. The estimate in 
 Figure h is practically obvious: the sequential machine (n=l) is the most 
 versatile. As n grows, the number of problems that the machine can handle 
 
11 
 
 J ' speed of each PE 
 
 
 I 
 \ 
 
 
 V 
 
 
 \ 
 
 
 \ 
 
 V 
 
 \ 
 
 N 1 
 
 
 N ^M^ 
 
 to- 
 
 n number of PE' s 
 
 Figure 3« A Family of Array Computers with Constant Average Speed 
 
 versatility 
 \ 
 \ 
 \ 
 
 \ 
 \ 
 
 m M 2 
 
 M,, 
 
 n number of PE' s 
 
 Figure k. Versatility as a Function of the Number of PE's 
 
 i cost-efficiency for 
 
 
 
 
 suitable 
 
 problems 
 
 / 
 
 / 
 
 / 
 
 >» M 1 
 
 J»M 2 
 
 
 
 »> 
 
 n number of PE's 
 
 Figure 5. Cost efficiency as a Function of the Number of PE's 
 
12 
 
 efficiently obviously decreases. Figure 5 is harder to justify. In fact, it 
 is a guess "based in two extremes: ILLIAC IV and the AP. A third machine, 
 however, to be introduced later, does seem to verify this hypothesis: as n 
 grows and each PE is simplified, modern integrated circuit techniques (LSI) 
 allow a very rapid decrease in the cost per PE. 
 
 These considerations justify the idea of exploring the possibilities 
 of a third type of array computer: the SPEAC (for small PE Array Computer). 
 This machine would be between the AP and ILLIAC IV in number of PE's and PE 
 power and hopefully would achieve a happy compromise between ILLIAC IV s rela- 
 tive versatility and the AP's cost-efficiency. The initial goals were: 
 
 n SPEAC ~ 10 n iLL IV t0 10 ° n iLL IV 
 
 PE speed spEAC ~ ^ PE speedy Jy to -^ PE speedy Jy 
 
 gates per PE gpMC ~ ±- gates per PE^ Jy to JL gates per PE^ J 
 
 The remainder of this paper is dedicated to exploring the feasi- 
 bility and characteristics of this new machine. 
 
13 
 3. SPEAC's HARDWARE 
 
 Initially, a few general considerations are made in order to estab- 
 lish the design goals that dictated the structure chosen for the hardware. 
 The multiplication algorithm is also presented as a preface to the actual 
 hardware description since the PE has been specifically designed to implement 
 this algorithm efficiently. 
 
 3*1 General Considerations 
 
 a) The PE will be simple enough and built in a quantity high enough 
 to warrant the expense of building special-purpose MSI to LSI 
 integrated circuits . At first, it was hoped that a whole PE 
 could be contained in a single LSI chip. This still seems to be 
 possible, at least with the kind of technology foreseeable within 
 a decade: a bipolar integrated chip with density on the order 
 
 of 1 to 2K equivalent gates would be needed. However, even if 
 one does not count on such extremes of built-to-order LSI, the 
 proposed design could be implemented using a few dozen standard 
 or nearly standard MSI chips, allowing an entire PU to be packed 
 in one printed circuit card . 
 
 b) The results of the survey mentioned in Section 2.2 indicate the 
 need of some floating-point capability. Naturally, entirely 
 hardware- implemented floating-point is out of the question in a 
 simple PE. However, the hardware should allow efficient imple- 
 mentation of floating-point routines . Serial processing, by 
 bit or by groups of bits is the only way to keep the gate count 
 low. This leads naturally to variable word length as a means 
 
Ik 
 
 of satisfying the conflicting precision requirements outlined 
 in Section 2.2. 
 
 c) Most contemplated applications have a high frequency of multi- 
 plications, typical of scientific problems. Therefore, multi - 
 plication should he as fast as possible , ideally almost as fast 
 as addition as is the case in the ILLIAC IV PE. 
 
 d) Due to the existence of a CU, the PE must be strictly syn - 
 chronous and local control must be minimized . Any synchronism 
 or data- dependent optimization is wasted since the CU must 
 always wait for the worst-case which almost certainly occurs for 
 large n. This rules out certain classical methods like: in- 
 creasing the speed of multiplication by adding only when the 
 multiplier bit is one and simply shifting when it is zero. 
 Instead, the CU must always output micro-orders for the worst- 
 case and: 
 
 either: the method is such that the extra operations are no-ops 
 
 for non-worst- case conditions (example: add on a zero 
 
 multiplier bit); 
 or: some local control (typically a flip-flop) will inhibit 
 
 certain steps in non-worst-case conditions (example: 
 
 normalization, recomplementation) . 
 
 e) An accumulator is impractical in a variable word length machine 
 since it would have to be as long as the worst-case-length. 
 Therefore, variable word length machines are typically 2- or 3- 
 address machines. Three addresses are quite desirable since 
 they avoid the frequent duplication of operands (to avoid its 
 
15 
 
 destruction) found in 2-address machines. The classical short- 
 coming of 3-address machines, unnecessarily large instructions 
 when the third address is equal to a previous one, can easily be 
 avoided by adopting a variable length instruction format . There- 
 fore, each instruction (op-code) will have a large number of 
 variants with different lengths, from a minimum of zero ad- 
 dresses (in this case the old contents of the address registers 
 would be used as addresses) to a maximum of six addresses, 
 three basic addresses plus three addresses for local indexing. 
 Word length of each operand and of the result might also be 
 specified in the address part. The resulting instruction format 
 is illustrated in Figure 6. 
 
 basic op-code 
 
 variant 
 
 v ^ J 
 
 as many addresses as specified by 
 the variant code 
 
 Figure 6. Instruction Format 
 
 f) Timing considerations: In order to satisfy the initial esti- 
 mates set forth in Section 2-3, an addition time of 3 to 30 usee 
 and a multiplication time of k to kO (usee are needed. Consider- 
 ing the basic PEM cycle time of the order of one-half jusec (this 
 assumption will be explained in Section 3-2), and noticing that 
 1 to 3 PEM's cycle times (depending on the amount of interleave) 
 are required per serial operation of the PE, one concludes that 
 a 30 to 60 jusec addition time is obtained in a bit-by-bit PE for 
 
16 
 
 32 -bit fixed point addition. Straight multiplication will take 
 32 times as much or about 1 msec. This is far too slow and a 
 serial by hexadecimal digit PE (i.e., serially processes chunks 
 of k bits) is now considered. Addition time (32 bits, fixed 
 point) now goes down to 8 to 15 usee which is convenient. 
 Straight multiplication, taking 32 times longer, is still quite 
 slow. The next step would be a serial by byte PE but this pre- 
 sents two problems: firstly, normalization is either rather 
 complicated and slow or it is done in 8-bit increments causing 
 an unacceptable erosion in precision; secondly, the number of 
 gates in the PE will be quite larger. Therefore, a serial by 
 hexadecimal digit PE seems to be the best compromise: normali- 
 zation in k bit increments (i.e., exponent base = 16) is quite 
 acceptable and widely used in present computers. A somewhat 
 elaborate multiplication algorithm (described in the next sec- 
 tion) will be adopted to bring the multiplication time down to 
 acceptable values. 
 ;) Since the basic unit of data in the PE is one hexadecimal digit 
 instead of a whole word, the machine is capable of accepting 
 several different word formats provided the CU is able to gener- 
 ate an appropriate microsequence for that format. This immedi- 
 ately suggests the : idea of micro-programming . Therefore, no 
 particular word format will be picked and the PE control wire set 
 will be chosen as carefully as possible in order to maximize the 
 number of formats and operations that can be dealt with by 
 writing adequate micro -programs at the CU level. The variable 
 
IT 
 
 format feature can be quite useful in certain applications 
 (like seismic signal processing) in which format conversion 
 accounts for a significant percentage of the processing time. 
 Summing up, the following design goals are thus established for SPEAC: 
 
 - PE built with MSI and LSI integrated circuits. 
 
 - One printed circuit card per PU. 
 
 - Variable word length. 
 
 - Multiplication not much slower than addition. 
 
 - Up to 3 addresses (possibly indexed) per instruction. 
 
 - Variable instruction length. 
 
 - PE serial by hexadecimal digits. 
 
 - Variable word format. 
 
 - Microprogramming capability. 
 
 3.2 The Multiplication Algorithm 
 
 As pointed out in Section 3*1; "straight" multiplication techniques 
 (i.e., bit-by-bit) yield an unacceptably high multiplication time as compared 
 to the addition time. On the other hand, ver-high- speed multiplication of the 
 type used in ILLIAC IV requires a massive increase in the number of gates. The 
 best compromise for SPEAC seems to be some form of hexadecimal multiplication 
 algorithm allowing multiplication times roughly proportional to fir where fif is 
 the number of hexadecimal digits in the operands rather than the number of 
 bits. It is also required that the algorithm be able to generate the product 
 without the need to store double precision partial products since the PE has no 
 register capable of holding long numbers and storing partial products in the 
 memory will be slow and require the use of a portion of PEM as "scratchpad area. " 
 
18 
 
 The following multiplication algorithm satisfies the requirements 
 above and is proposed for SPEAC: Consider the multiplication of two numbers 
 
 A and B, each containing n+1 hexadecimal digits: 
 
 ^ 08 ,-M o*+n 
 
 A = a^ + a., 2 + a 2 + . . . + a.2 + . . . + a 2 
 
 12 1 n 
 
 B = b. + b.,2 + b 2 u + ... + b.2 + ... + b 2 
 12 1 n 
 
 (1) 
 (2) 
 
 The double precision product M will be written as: 
 
 M = A X B = m A + m_2 + m 2 + ... + m.2 + . . . + m ,2 ' 2n+1 M3) 
 12 x 2n+l VJy 
 
 multiplying (l) and (2) as polynomials: 
 
 h 8 
 
 M = A X B = a b + (a Q b 1 + a.jb )2 + (a Q b 2 + a^ + a 2 b Q )2 + 
 
 or: 
 
 .. +1 Ea.b. .)2 k± + ... J Za.b .\2 kn + ... + 
 
 D i-J 
 
 UJ=0 
 
 J n-j 
 
 La. b . . 2 + . . . + a b 2 v y 
 
 i j-n n-j+il n n 
 
 '2n 
 
 n 
 
 Iki 
 
 M = Z Z a.b. .(2 41 ) 
 
 2n /2n 
 
 Mi 
 
 i=0 \ j=0 J 1_J 
 
 Z Z a. b . .(2 41 ) 
 
 i=n + l I j=i J " n n - J+1 
 
 From (3) = (4): 
 
 U, 
 
 (»0 
 
 m o = ^oVmod 16 ; c o = ^oVdiv 16 (i ' e - Vo = c o (2H) + m o } 
 m i = (c o + a o b i + a iVmod 16 ; c i = (c o + Vi + a iVdiv 16 
 
 r. 
 
 m. 
 
 1 
 
 i < n 
 
 = c . _ + Ea.b. . , , s : c . = c . _ + Z a.b. .. > ., 
 
 i-l . J 1-jJmod 16 ' 1 l-l Q j 1-jJdiv 16 
 
 < 
 
 m. 
 
 = c 
 
 2n 
 + Z a. b 
 
 i > n 
 
 . n + u a. d . . , _,, : c. = c. n + ^ a. b . . ,. ,/ 
 
 i-l ._. j-n n-j+ilmod 16 1 1 i-l . . j-n n-j+il div lb 
 
 (5) 
 2n 
 Z a. b . .L. 
 
 J=i 
 
 0=i 
 
 m = ( c n + a b ) ,-,/■; c_ = ( c n + ab) n . _, . 
 2n 2n-l n n'mod 16 , 2n v 2n-l n n'div 16 
 
 m 2n + l = C 2n 
 
 
19 
 
 Therefore , the product may be computed as follows: 
 
 - multiply a and b n , the two low order digits of A and B; the 
 
 k 
 result has two hexadecimal digits: c n (2 ) + m ; m is the low 
 
 order bit of the product and can be stored (in double precision 
 
 multiplication) or discarded; c ' is kept in an accumulator. 
 
 - multiply: a~ X b ; add to the accumulator; 
 
 multiply: a X b ; add to the accumulator; the accumulator then 
 contains cm* store or discard m and keep c in the accumulator 
 
 and so on, using the equations (5) to determine each c. and m. . 
 It is easy to see that (n+l) pairs of hexadecimal digits must be 
 multiplied to compute the product of two numbers each with (n+l) hexadecimal 
 digits. It should also be noticed that if a single precision product is de- 
 sired, the product can replace one of the operands: m , m, , . . . , m _ are 
 computed only to accumulate the carry and discarded, m is the first digit 
 that may be in the final product and can be stored either "on top" of a or 
 b since these two digits are not needed anymore to form the product. Finally, 
 
 m_ replaces a (or b ). If m. , = c =0, then the product is stored cor- 
 ^n ^ n v n' 2n+l n ' ^ 
 
 rectly. However, if m .. = c ^ 0, the product must be normalized, i.e., each 
 
 digit is shifted one to the right, m is discarded and c = m _ _ is then 
 
 D ' n n 2n+l 
 
 stored on the address of a (or b ) . 
 
 n n 
 
 The number of memory accesses required is: 
 Memory accesses = 2 IT + N + (N-l) + N 
 
 operand stores fetches stores 
 fetches 
 
 ■> ' ^ 
 
 multiplication normalization 
 of the mantissas 
 
20 
 
 where N = n+1 is the number of hexadecimal digits in each operand. Notice, 
 however, that in the computation of each m. , one operand fetch may be saved 
 since the operand is already available from the last operation in the previous 
 computation. This saves N-l operand fetches. 
 
 Therefore: Total number of memory accesses = 2N(N+l), including 
 no rmali z a t i on . 
 
 Finally, it should be pointed out that the operations may be arranged 
 in such a way that not only (N-l) fetches are saved as described but also each 
 address is modified only in unitary decrements or increments. Since the ad- 
 dress registers will have the capability of unitary increment or decrement, 
 only the addresses of a and b are needed initially. These addresses are then 
 possibly indexed and the rest of the multiplication does not require further 
 address broadcasts. Figure 7 illustrates the order of operations for the 
 multiplication of two ^--digit numbers. 
 
 3-3 The System as a Whole 
 
 A summary description of the complete system is initially presented 
 in order to establish the function of each component and their interconnec- 
 tions. Figure 8 is a diagram of the global structure. The components are: 
 a) The PU array, containing "a large number" of PU's arranged in 
 
 rows. Each row has 128 PU's and the number of rows is not fixed: 
 with the exception of "row gating, " nothing in the machine is a 
 logical function of the number of PU rows. Therefore, any numbe] 
 of PU rows can be used in SPEAC provided that the row gating 
 contains that same number of inputs. There are, however, some 
 practical limits: too few rows (say 1 or 2) will lead to an 
 
21 
 
 a b (initial address broadcasts and fetches) 
 
 a b + a ; b 
 A A 
 
 3>? 
 
 v 2 + Vi + y 
 
 u A A — A - 
 
 ft^ 3 + *l b 2 + *2 b l + ^3 b o 
 CX A A — A — A — 
 
 a b + a b 2 + a-b 
 Eta — a — A 
 
 b a + b a 
 
 D 5 A — A J 
 
 a b 
 
 A 3 
 
 D No fetch or address modification 
 
 A Add 1 to the address and fetch 
 
 — Subtract 1 from the address and fetch 
 
 Figure 7« Fetches in Multiplication 
 
22 
 
 COMMON AODRESS BUS 
 
 CORNER 
 
 MEMORY; 
 
 CONTROL 
 -DATA (i BITS) 
 
 WIDE DATA PATH 
 
 
 Figure 8. Global Structure 
 
 i! 
 
23 
 
 uneconomical machine since each PE is relatively slow and good 
 average speed can only he obtained by using a large number of 
 PE's. Therefore, the speed obtainable with 1 or 2 rows would 
 not justify the investment represented by the components needed 
 to drive the array: CU, mass memory, etc. On the other hand, 
 too many rows will result in poor I/O speed and routing speed 
 (since these operations are performed on a per-row basis) 
 causing a degradation in system performance. Based on these 
 considerations, an interval of k-6k PU rows has been established 
 as the most useful range. In particular, 8 rows were chosen 
 for the "typical" SPEAC. Therefore, for the remainder of this 
 paper, a 102 k PE machine will be described. 
 
 b) The row gating switch which is a 512 -bit, bidirectional, 1-out- 
 of-8 selector driven by a row address supplied by the CU. This 
 switch selects one of the PE rows for I/O transactions with the 
 mass memory. 
 
 c) The I/O buffer register which is a long, shif table register to 
 buffer the I/O flow between mass memory and PE array. It should 
 be pointed out that this register has twice the length of the 
 mass-memory word and can be shifted by any multiple of U-bits in 
 a maximum of 7 clock pulses. These two features enable the I/O 
 buffer register to provide routing facilities for SPEAC. The 
 method will be detailed in Sections 3*7 an ^ ^-7« 
 
 o 
 
 d) A mass memory system with at least 10 bits of relatively fast 
 (l to 3 jusec cycle time) random-access memory. Bulk core is the 
 present choice for the mass-memory, probably backed-up by a 
 
2k 
 
 hierarchy of large capacity disk and tape. The random-access 
 mass memory serves as a common pool of data for the different 
 parts of the system and is directly accessible to the CU, HJ 
 array, corner-memory and other peripherals. 
 
 e) A corner-memory which is a special-purpose peripheral device 
 operating on the mass memory in the same fashion as an indepen- 
 dent I/O channel. This device is capable of reading from mass 
 memory 128 words with 128 hexadecimal digits each; the i — word 
 
 read can be written as: a. n a.^ ... a. n _ , where each a. . is a 
 
 ll i2 il2o io 
 
 hexadecimal digit. After being loaded with rows in this way, the 
 
 corner-memory can write back in mass memory in a column -wise 
 
 fashion; i.e., the i — word written will be: a, . a_ ... a n _ Q . . 
 
 li 2i 12oi 
 
 Therefore, the device can read a matrix of 128 X 128 hexadecimal 
 digits row-by-row and rewrite the same matrix column -by- column. 
 This function is desirable in SPEAC to convert data written in 
 mass memory by the array into a form that will allow the same 
 data to be easily handled by the CU. The corner-memory is not 
 an essential part of the system but has been included for the 
 sake of completeness. It should also be mentioned that several 
 other peripheral devices (tape decks, printers, etc.) can be 
 attached to the system in the same way as the corner -memory. 
 
 f ) A control unit (CU) which sends control pulses to all other units 
 in the system besides having full processing capability on its 
 own. Actually, the CU can be considered a standard serial high- 
 speed general purpose computer in which several modifications 
 were introduced. It must accept three different types of 
 
25 
 
 instructions: CU instructions, PE instruction and I/O instruc- 
 tions. CU instructions are completely processed in the CU 
 although operands can be received from the array and results 
 "broadcast " to the array via the common data bus (CDB) which 
 will be described shortly. PE instructions are decoded in the 
 CU and each corresponds to a micro-program which is executed and 
 generates a set of control pulses or micro- sequences. These are 
 sent to every PE in the array via the control lines. Finally, 
 I/O instructions are decoded in the CU and sent to one or more 
 independent I/O channel(s) which drive the row gating, mass mem- 
 ory, I/O buffer register and corner -memory. The CU must also be 
 compatible with the mass memory used in the system since this 
 memory will be shared by the CU and PE and serves as a common 
 pool of data. The CU can interchange data with the PE's via the 
 common data bus, one hexadecimal digit at a time. However, the 
 only high capacity data link between CU and array is via the mass 
 memory. Notice also that SPEAC's programs are not stored in the 
 PEM's but in the CU's own internal fast memory and, for large 
 overlayable programs, also partly in the mass memory. 
 
 The control unit is linked to the PE's by three buses and one inter- 
 rupt wire. The first bus is a 12 -bit common address bus (CAB) in the direction 
 of CU to PE only. The CU can send addresses to the array via CAB. These ad- 
 dresses can then be stored by each PE in internal address registers and used to 
 access PEM. The second bus is a Ij—bit bidirectional common data bus (CDB) 
 whose use has already been described. The last bus is a set of approximately 
 
26 
 
 80 control lines which control every PE function. The interrupt wire is a 
 single line connecting every PE to the CU. It is used to send to the CU an 
 interrupt request which orginated in a PE and must be serviced "by the CU. 
 
 Each PE is linked to the row gating by a bidirectional ij-bit I/O 
 bus (IOB) which is not common. All the I/O buses (one from each PE) are con- 
 nected to the row gating which selects one group of 128 IOB's (corresponding 
 to one PU row) for connection to the I/O buffer register (lOBR). 
 
 It is now possible to describe how a program is processed in SPEAC: 
 Program and data are assumed to be initially on tape. The tape is loaded into 
 SPEAC s mass-memory and from there the program is loaded in the CU memory and 
 a portion of the data is transferred to PEM. Processing is then performed 
 simultaneously with further transfers between PEM and mass memory with the 
 latter serving as overlay memory for the relatively small PEM. The results of 
 the computation are transferred from PEM to mass memory and can then be printed 
 or stored in tape via a peripheral device. 
 
 Each component of the system will now be analyzed with special em- 
 phasis on the PU. 
 
 3.^ The Processing Unit 
 
 3-^.1 PE Memory 
 
 Semiconductor memories were chosen for the PEM's for two basic reasons: 
 
 a) Small size, compatible with the LSI chips that make up the PE. 
 This way each PU could be entirely mounted on a single printed 
 circuit card or on a ceramic substrate. 
 
 b) Low price per bit even in small size. This characteristic was 
 needed since each PEM in SPEAC is necessarily small for economic 
 
27 
 
 reasons: 8K bits is the proposed basic size with provision for 
 
 expansion up to a maximum of 32K bits per PEM. 
 The next step was to choose between bipolar and MOS memories. At 
 the beginning of the investigation, a survey of semiconductor memories [7] 
 indicated that MOS LSI held the greatest potential for this application: 
 large densities (1000 bits per chip is already commercially available), minute 
 power dissipations (50 jtiw per bit is obtainable), acceptable speeds (less than 
 1 usee cycle time is typical) and low price ($.02 per bit is commercially 
 available). Therefore, the following PEM chip was postulated for use in SPEAC: 
 MOS LSI, 102 U bits, 50 uw per bit power dissipation, 500 nsec cycle time , price 
 less than $20 in quantities. 
 
 Since progress in the area of semiconductor memories has been so fast, 
 a reevaluation of the design choice for SPEAC ' s PEM was undertaken at the end 
 of the investigation. It was then discovered that the case for MOS was not as 
 clear cut as before, due to the following factors: 
 
 a) Although MOS currently appears to have a distinct density and 
 price advantage, it should be noted that recently announced bi- 
 polar processing technology will allow 102U bit and larger bipolar 
 memories with not much increase in power requirements. These 
 devices will be available for delivery about mid-1972 at about 
 MOS prices. With power reduction techniques they take about the 
 same or less power than MOS and are considerably faster with an 
 
 80 to 100 nsec cycle time. 
 
 b) It should be noted that the choice of MOS requires an additional 
 power supply level. If bipolar is chosen, the same supply used 
 for the PE logic can be used by PEM. This is more economical 
 
28 
 
 since it is less expensive to buy "x" additional amps on an 
 existing supply than to buy the first "x" amps on a new voltage 
 level, 
 c) If MOS is used, an interface is normally needed to adjust MOS 
 voltage level to bipolar, thus increasing the number of gates 
 per PE. Moreover the larger densities in MOS are obtainable in 
 dynamic memories; i.e., memories in which the information is 
 stored as charge in MOS P-N junction capacitance. These memories 
 are thus volatile and must be refreshed as often as every 2 usee 
 at higher temperatures. This is unacceptable in SPEAC since it 
 would introduce frequent delays in processing to refresh PEM. 
 Therefore, static MOS memories must be used and density with these 
 memories is not better than with bipolar. Static MOS is also 
 slower unless decoding is separately performed with bipolar logic. 
 In conclusion, the factors considered above indicate that PEM would 
 probably be built with bipolar devices or at least static MOS with bipolar de- 
 coding if prices drop as much as predicted. In fact a hybrid chip already 
 exists which, if obtainable at a price small enough, would be an excellent 
 choice for PEM: It consists of 8 MOS static memory chips with 256 bits each, 
 mounted on a ceramic pack with bipolar decoding. The organization is 102^ 2-bit 
 words making only four of these elements needed for the PEM. 
 
 The devices are made by T.I. (SMA 2002) and have a typical cycle time 
 of only 150 nsec. A block diagram is presented in Figure 9- 
 
 Therefore, although the basic cycle time of 500 nsec (300 nsec access 
 time) is retained for the remainder of the paper, it now appears that it is a 
 little pessimistic. Significant gains in performance could be obtained in some 
 
29 
 
 operations with the faster memories which would probably be available if SPEAC 
 were to be built in the near future. 
 
 ARHAY SELECT 
 
 <=*= 
 
 ADDRESS ~ 
 
 Ba-ta-IH 
 
 MTC 
 
 R/W 
 
 READ STROM 
 
 ± c 
 
 c 
 
 V H Vcc* Vcc*. 
 
 T-4-4— : 
 
 MOS STORAGE 
 
 > 
 
 
 READ 
 
 WRITE CONTROL 
 
 A 1 
 
 UND 
 
 Vcc- 
 
 J> 
 
 SENSE 
 AMPLIFIER 
 
 DATA 
 OUT 
 
 Figure 9» Block Diagram of a Possible PEM Chip 
 
 Since the basic unit of data in the PE is one hexadecimal digit, PEM 
 is organized in U-bit words. Each hexadecimal digit is addressable in the mem- 
 ory. It is also extremely important to adopt an access technique for PEM which 
 will avoid I/O bounding of programs as much as possible: PEM contains only 2K 
 hexadecimal digits or 2^6 32-bit words . Therefore, for many problems the data 
 will not fit entirely in PEM and mass memory is used as back-up. It would be 
 desirable then to be able to exchange data between PEM and mass memory and, 
 simultaneously, allow the PE to access PEM to perform normal processing. This 
 justifies the adoption of a two -port system: PEM is divided in two modules , 
 each with IK hexadecimal digits and the two modules can be accessed simultane- 
 ously. Basically, one module is replenished from mass memory while the other 
 module is used for operations. In this way, PEM can almost be considered as a 
 fast scratchpad memory for the PE's with mass memory being the main memory. 
 
30 
 
 Since (as will be shown in Sections 3.5 and k) a row of 102U 32 -"bit numbers 
 can be exchanged between PEM and mass-memory in about 128 jitsec and the basic 
 floating-point operations take on the order of 25 /Lisec, a number brought to 
 PEM must be used at least six times in operations before being overwritten in 
 order to avoid I/O bounding. This ratio of 1 to 6 is a comfortable figure for 
 a machine intended for scientific applications. It should also be pointed out 
 that i/O-PE overlap is not the only use of the two module system: if i/O is not 
 occurring^ the two modules can be used to overlap fetches for CU operations and 
 PE operations or even for the simultaneous fetch of two operands in a PE opera- 
 tion if each operand happens to be in a different module. It is the responsi- 
 bility of CU's final station (FTNST) to assign use of the two PEM modules in an 
 optimum way (see Section 3-5) • 
 
 3-^.2 PE Data Registers 
 
 The algorithm described in Section 3-2 can be very efficiently mech- 
 anized using the register structure presented in Figure 10. 
 
 Ac 
 
 Am 
 
 A r 
 
 T 1 I l I l I 1 1 r 
 
 • I 1 I I l 1 J l I I 
 I 1 1 l I 1 l ■ I 1 
 
 -i 1 1 1 1 1 I I 1 i_ 
 
 INCREMENT £ 
 Ac 
 
 I I t 
 i i i 
 J I L 
 
 REGISTER A 
 
 \—}——0 ADD CONDITIONAL 
 O ADD UNCONDITIONAL 
 
 REGISTER B 
 
 Figure 10. Basic Data Register Structure 
 
31 
 
 There are two data registers: A and B. Register B is a simple, 
 
 non-shiftable U-bit unit. Register A is divided into three parts: A , A 
 
 r m 
 
 (for right and medium) with k bits each and A (for carry) with 12 bits. 
 
 Register A is fully sh if table, right or left, bit-by-bit. There is also a 
 
 fast ^--bit shift mode in which the contents of register A are shifted (left or 
 
 right) one hexadecimal digit in one operation. The right fast U-bit shift is 
 
 not essential to implement the multiplication algorithm efficiently but can 
 
 be very useful in other applications. It should also be pointed out that part 
 
 A of register A is connected as a counter and a pulse to the "increment A " 
 c ° ^ c 
 
 control will cause the contents of A to be incremented by one unit. Finally, 
 
 registers A and B are linked by a k- bit parallel adder which, when activated, 
 
 replaces the contents of A with the sum of the contents of A and B. The 
 
 m m 
 
 adder can be used unconditionally or conditioned to the presence of a "one" in 
 
 location A . The carry generated by the adder can be fed to the "increment 
 r 
 
 A " control, 
 c 
 
 To use the structure of Figure 10 to multiply using the polynomial 
 algorithm, two hexadecimal digits a. and b. are placed in registers A and B 
 respectively. Multiplication is accomplished with a sequence of four add con- 
 ditionals and shifts right 1 bit. Register A is then shifted left fast k bits 
 and a new multiplication can be performed with the new product automatically 
 added to the previous one(s). Registers A and A then work as a small accumu- 
 lator in multiplication. Note that in the polynomial multiplication of two 
 numbers, each n hexadecimal digits long, the worst case carry that can occur 
 
 is less than log^n + k bits. Therefore, the number of bits needed in A is 
 °2 c 
 
 given by log_n + k. A reasonable value for n is 6U which leads to an 
 & o- o 2 max max 
 
32 
 
 A 10 bits long. Since in SPEAC register length is naturally a multiple of di- 
 bits, 12 bits were reserved for A . For the same reason, the address regis- 
 ter's length was chosen as 12 bits allowing up to k-K hexadecimal digts per 
 PEM module although only IK is contemplated at this stage. 
 
 3.^+.3 PE Description 
 
 The data register configuration described in the previous section 
 was used as a kernel around which the whole PE was designed. Figure 11 pre- 
 sents a simplified PE diagram showing all registers and data paths. For a com- 
 plete logical diagram, Figure 13 should be consulted. In order to reduce the 
 size and complexity of Figure 13, a number of special symbols were adopted. 
 These are defined in Figure 12 and deal with representing groups of h or 12 
 wires in a concise way. Only a few logic elements appear explicitly in Figure 
 13; most logic is represented as logical blocks called packages. These pack- 
 ages are numbered and labeled with a name describing their function; i.e., 
 l-of-8 selector, type D flip-flop, inverter, etc. The complete diagrams of 
 the logic inside each package are presented in Appendix A. It should be noted 
 
 that most packages perform standard logic functions and are availabe as SSI 
 or MSI chips. This aspect will be further pursued in the section on imple- 
 mentation. 
 
 3«^-'3«l Registers and Buses 
 
 Each PE contains nine registers with a total capacity of 65 bits. 
 Table 1 lists each register, its capacity, function, and special features. 
 Buses are used to provide data paths between the different registers. This 
 allows maximum flexibility (since each register can be directly loaded from 
 any other register) at a reasonable cost. Two types of buses are needed: a 
 
 
33 
 
 
 EE 
 
 L C 
 
 ^<lj: 
 
 a-:'. 
 
 A/L 
 UNIT 
 
 iL_JI (UL 
 
 o 2 
 
 9 9 
 
 9 9 
 
 4 y. a y~ 
 
 ii 
 
 T"! 
 
 A t 
 
 nr 
 
 A_± i_± 
 
 1 t A *_ 
 
 JLJL 
 
 * T * 
 
 ▼ — y 
 
 "f *~ 
 
 l_t 
 
 ;; v j: 
 
 " :: 
 
 t j- 
 
 PEM 
 modi 
 
 PEM 
 mod 2 
 
 sM 
 
 Figure 11. Simplified PE Diagram 
 
4 WIRE BUSES 
 
 CONCISE 
 NOTATION 
 
 12 WIRE BUSES 
 
 EQUIVALENT IN 
 STANDARD NOTATION 
 
 T 
 
 3^ 
 
 -10 
 
 — , , 
 
 - 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 __L_u> 
 
 2 4 
 
 1 
 
 6 8 10 12 
 
 I 3 5 7 9 II 
 
 r 
 
 T 
 
 -10 
 
 -12 
 
 -12 
 
 -10 
 
 Figure 12. Conventions Used in PE Logical Diagram 
 
55- 
 
 8S- 
 Si- 
 
 £ 
 
 s 3 
 
 t=n 
 
 a- 
 
 2 
 
 a. 
 
 y^u. 
 
 lilllili S ?i sis 
 
 111 
 
 ! 
 
 8533 
 
 i- « 
 
 I 
 
 § I i 
 
 ■j— - 
 
 rt 
 ffft 
 
 r^ 
 
 33 
 
 i. 
 
 O-J 
 
 Hi 
 
 s s s 
 
 rru 
 
 i 3i s s i 
 
 a 3 3 S 5 t S E t 
 
 il i j! U 
 
 illislsiisSiill 
 
 35 
 
 Figure 13 . Complete PE Logical Diagram 
 
36 
 
 12-bit address bus, linking all address registers and the CAB, and a U-bit 
 data bus linking all the remaining registers, the CDB and the I OB. Since it 
 was decided that both PEM modules should be simultaneously accessible, one 
 pair of buses is dedicated to each PEM module. Therefore, there are four 
 buses altogether: two address buses (Al and A2) and two data buses (Dl and 
 D2). Buses Al and Dl are linked to PEM module 1 and buses A2 and D2 are linked 
 to PEM module 2. Figure 11 clearly shows all the connections to each bus. In 
 this figure, an arrow into a bus indicates that the given data can be gated 
 into the bus; an arrow out of a bus indicates that the contents of the bus can 
 be gated into the given unit; a dot in the intersection of a wire and a bus 
 indicates a permanent connection of the wire to the bus. It should also be 
 noticed that every line connected to an address bus represented in fact 12 
 wires (except the line into SM which is a ^--bit line) while lines connected to 
 a data bus stand for k wires with the exception of the line into EE which is 
 a single bit line. A very rough approach to the number of gates needed to im- 
 plement the bus system can now be obtained: counting each arrow associated 
 with a data bus as k gates and each arrow associated with an address bus as 12 
 gates, one obtains a total of 35^- gates. This represents about a third of the 
 total number of gates used in the PE with flip-flops accounting for the second 
 third and arithmetic, decoding and local control using the remaining gates. 
 
 It is important to point out that PEM module 1 is permanently con- 
 nected to bus 1 and module 2 to bus 2. Therefore, if an operand is in module 
 i then bus i must be used to fetch that operand. On the other hand, inter- 
 register transfers can use any bus that is available. This fact will be im- 
 portant in the design of the CU's final station (FINST). 
 
37 
 
 Register 
 
 Capacity 
 (bits) 
 
 Function 
 
 Special Features 
 
 A 
 
 
 
 Shiftable ("bidirectional, 1- and 4-bit 
 
 distances) 
 
 A 
 c 
 
 12 
 
 address/ 
 
 data 
 
 Can count up 
 
 
 A 
 m 
 
 k 
 
 data 
 
 Each bit is individually enabled 
 
 
 A 
 r 
 
 h 
 
 data 
 
 None 
 
 
 B 
 
 k 
 
 data 
 
 None 
 
 
 h 
 
 12 
 
 address 
 
 Can count up or down 
 
 
 \ 
 
 12 
 
 address 
 
 Can count up or down 
 
 
 S 
 
 12 
 
 address 
 
 Can count up or down 
 
 
 LC 
 
 k 
 
 local 
 control 
 
 Each bit is individually enabled 
 
 
 EE 
 
 1 
 
 mode 
 
 None 
 
 
 Table 1. PE Registers 
 
38 
 
 3-U.3-2 The Arithmetic/Logic Unit 
 
 The simple adder of Figure 10 was replaced in the final design by a 
 more sophisticated arithmetic/logic unit (A/L unit) which is capable not only 
 of adding but also of performing several other arithmetic and logic functions 
 as well as comparisons. This unit, whose logical diagram can be seen in 
 package 9 (Appendix A), is currently available from several manufacturers in 
 a 2^-pin MSI bipolar chip. There are five control lines in the A/L unit, 
 allowing a choice between 32 functions (not all different). Table 2 shows 
 these 32 functions. There is also an A = B output to test for equality. Other 
 comparisons can be performed by subtracting the two inputs and analyzing the 
 output carry. Input B to the A/L unit is always register A . Input A can be 
 
 selected among Dl, D2, reg B and reg B. This allows one to compute not only 
 (reg B) - (reg A ) (by picking reg B as the A input to the A/L unit and sub- 
 
 tracting) but also (reg A ) - (reg B) (by picking reg B as the A input and 
 
 adding). Inputing to the A/L directly from Dl or D2 is not essential but speeds 
 
 up several operations by avoiding unnecessary loads into B only to use the A/L. 
 
 The output of the unit can be gated either into A or into A . Another impor- 
 
 m r 
 
 tant feature is the possibility to gate the output of A/L into A shifted one 
 to the right. This speeds up multiplications considerably since two hexadeci- 
 mal digits can be multiplied in k clocks instead of 8 (i.e., k add and shift as 
 opposed to k adds and ^ shifts). 
 
 3.^. 3-3 Scratchpad Memory 
 
 A small (l6 hexadecimal digits), fast scratchpad memory (sM) has 
 been added to the final version of the PE. This unit is available in a l6-pin 
 MSI chip (see package 8, Appendix A) and can read or write one hexadecimal 
 digit in one PE clock. Although not essential to the PE, sM can be added at a 
 
 ' 
 
39 
 
 S 3 S 2 S 1 S 
 
 M = 1 
 (logic functions) 
 
 M = (arithmetic operations) 
 
 C = 
 n 
 
 C = 1 
 n 
 
 OOOO 
 0001 
 
 F = A 
 
 F = A 
 
 F = A v B 
 
 F = A + 1 
 
 F = (A v B) + 1 
 
 F = A v B 
 
 0010 
 
 F = AB 
 
 F = A v B 
 
 F = (A v B) + 1 
 
 0011 
 
 F - 
 
 F = 1111 
 
 F = 
 
 01OO 
 
 F = AB 
 
 F = A + AB 
 
 F = A + AB + 1 
 
 0101 
 
 F = B 
 
 F - (A v B) + AB 
 
 F=(AvB)+AB+l 
 
 011O 
 
 F = A© B 
 
 F = A - B - 1 
 
 F = A - B 
 
 Dill 
 
 F = AB 
 
 F = AB - 1 
 
 F = AB 
 
 LOOO 
 1001 
 
 F = A v B 
 
 F = A + AB 
 F = A + B 
 
 F = A + AB + 1 
 F = A + B + 1 
 
 F = A© B 
 
 1010 
 
 F = B 
 
 F = (A v B) + AB 
 
 F=(AvB)+AB+l 
 
 1011 
 
 F = AB 
 
 F = AB - 1 
 
 F - AB 
 
 1100 
 
 F = 1 
 
 F = A + A 
 
 F = A + A + 1 
 
 1101 
 
 F = A v B 
 
 F = (A v B) + A 
 
 F=(AvB)+A+l 
 
 1110 
 
 F = A v B 
 
 F = (A v B) + A 
 
 F=(AvB)+A+1 
 
 1111 
 
 F = A 
 
 F = A - 1 
 
 F = A 
 
 Table 2. Functions Provided by the A/L Unit 
 
ko 
 
 low cost and provides a dramatic improvement in performance. Floating-point 
 addition, for example, is speeded up by a factor of three. The main use of 
 sM is to avoid repeated fetches of the same digit in multiplication and to 
 store partial results before normalization. It should be noticed that since 
 sM receives addresses from the address buses (four low order bits only are 
 used), it can be locally indexed, i.e., each PE can locally modify an address 
 in sM before performing an sM fetch. This is extremely valuable in floating- 
 point normalization. Therefore, sM is the fourth element in SPEAC's memory 
 hierarchy which is, from the smallest and fastest unit to the slowest and 
 largest: sM - PEM - mass memory (random access) - large capacity disk. 
 
 3.^.3'^ Address Registers 
 
 There are three address registers in the PE: X , X and X . 
 These are simple, non-shif table 12 -bit units with additional logic to enable 
 them to act as up/ down counters (see package 11, Appendix A). The address 
 registers are normally loaded from the CAB with a base address broadcast by 
 CU to all PE's. This base address can then be locally indexed. Successive 
 hexadecimal digits of an operand can be accessed by incrementing or decrementing 
 an address register using the up/ down counter feature and avoiding frequent use 
 of CAB and repeated local indexing operations. It is now clear that a memory 
 transaction may use as address one of four sources: registers X.. , X~, X~, and 
 
 CAB. The common address bus can be directly used as the address source in I/O 
 transactions or in operand fetches when local indexing is not necessary. This 
 use of CAB indicates that one could possibly eliminate X and still obtain good 
 performance since, in most cases, for PE operations only two addresses are 
 simultaneously needed; in the fetch phase of the operation, the addresses of 
 
 
1+1 
 
 the two operands are stored in X and X . In writing the result two other 
 addresses are needed in X and X — the address of the result and an sM 
 address. X is used, most of the time, to hold I/O transaction addresses. It 
 is felt that eliminating X would cause frequent conflicts in CAB use and a de- 
 gradation in performance. Only extensive simulation can indicate whether such 
 degradation is small enough to warrant removal of X for a very significant 
 saving in the number of gates. 
 
 3- J+-3-5 Register A 
 
 There are eight possible sources of input data to each of the 
 
 parts of register A. Six of these eight are common to A , A and A . They 
 
 c m r 
 
 are: l) shift A right one, 2) shift A left one, 3) shift A fast k right, 
 
 k) shift A fast k left, 5) load with Dl (Al in the case of A ), and 6) load 
 
 with D^ (A^ in the case of A ) . The seventh input option is the add and shift 
 2 2 c 
 
 especially implemented to speed up multiplication. The effect of this input 
 
 is the following: the output of the A/L unit is loaded into (A , A , A , 
 
 to ^ ' m ' m ' m ' 
 
 A ), A is shifted' right one and A is either shifted right one( if the out- 
 r 3 r ° 
 
 put carry for the A/L unit is zero) or is incremented by one and shifted right 
 one (if the output carry from the A/L unit is one). Finally, the eighth and 
 final possible input to A is: for A and A , the output of the A/L unit (used 
 for addition, subtraction and logical operations); for A , the last input 
 possibility is simply A incremented by one (i.e., the counter feature of A ) . 
 
 Input control is independent for each of the three parts of regis- 
 ter A. Therefore, register A shifts end-around as a whole only when A , A and 
 70 cm 
 
 A are simultaneously loaded with the same shift input. Several other useful 
 results may be obtained when only one or two of the parts of A receives a shift 
 
k2 
 
 command. For example, loading A with a shift fast k right enables one to 
 
 copy A directly into A without having to use Dl or D2. A direct swap of the 
 J m r 
 
 contents of A and A can be achieved by simultaneously loading A with a 
 m r m 
 
 shift fast h left and A with a shift fast k right. There is a control wire to 
 
 r 
 
 determine whether a distance 1 shift is to be end-off or end-around. Distance 
 
 k end-off shifts are obtained by shifting only two of the parts of A. 
 
 A and A have a single load control which, when OFF, preserves 
 
 the contents of the register and when ON loads the register with the selected 
 
 input. Load control for A is more sophisticated and allows not only "load" 
 
 and "no-load" but also a conditional load dependent on the value in Dl or D2. 
 
 In this conditional load, bit i of A is loaded only if bit i in Dl or D2 is 
 
 ' m d 
 
 ON. This is very useful in "assembling" a hexadecimal digit out of specific 
 bits of two other digits as is the case in inserting a sign bit in a number. 
 
 It is important to notice that Al or A2 can be gated into A thus 
 allowing addresses to reach the data handling part of the PE. This feature 
 is used to modify addresses in local indexing. Also A is a counter and can 
 be used as such when not needed to accumulate a carry in multiplication. This 
 provides a general purpose 12-bit counter in the PE which is extremely useful 
 in several applications. Therefore, A has a quadruple function: a) it" 
 provides linkage between the address portion and the data portion of the PE, 
 b) it serves as a general purpose counter, c) it accumulates the carry in multi- 
 plication, d) for special applications, A could be used as an additional address 
 register. 
 
 For a more complete idea of the whole PE as well as all the available 
 controls the reader is directed to Figure 13 where each control wire is indicated 
 as a line ending in an open circle with a code name associated to it. There is 
 
h3 
 
 Control Wire 
 
 Controls 
 
 Function 
 
 kcCl to AcC3 
 
 A 
 c 
 
 Select one out of eight possible inputs 
 
 <VcC4 
 
 ALCC1, ALCC2 
 
 A 
 c 
 
 A/L unit 
 
 Load A with the selected input 
 
 Select input carry C between 0, 1, lcFFl and lcFF^ 
 
 ALC1 to ALC5 
 
 A/L unit 
 
 Select function performed by A/L unit (see Table 2) 
 
 ALIC1, ALIC2 
 
 A/L unit 
 
 Select operand A for A/L unit between B, B, Dl 
 and D2 
 
 ALIC3 
 
 A/L unit 
 
 Use B instead of B as operand A if lcFFl is ON 
 
 KLICk 
 
 A/L unit 
 
 Uses instead of selected data as operand A if 
 
 A is OFF 
 
 r o 
 
 AmCl to AmC3 
 
 A 
 m 
 
 Select one out of eight possible inputs 
 
 AmCU, AmC5 
 
 A 
 m 
 
 00 - do not load A : 11 - load A (all bits) with 
 m' m 
 
 selected input: 10 - load A with AND of selected 
 input and Dl: 01 - load A with AND of selected 
 input and D2 
 
 ArCl to ArC 3 
 
 A 
 r 
 
 Select one out of eight possible inputs 
 
 ArCl+ 
 
 A 
 r 
 
 Load A with selected input 
 r 
 
 AShC 
 
 A 
 
 Distance 1 shift is end-around 
 
 A1C1 to A1C3 
 
 Al 
 
 Select one value out of five to gate into Al 
 
 A2C1 to A2C3 
 
 A2 
 
 Select one value out of five to gate into A2 
 
 BC1 
 
 B 
 
 Select among Dl and D2 as inputs to B 
 
 BC2 
 
 B 
 
 Load B with the selected input 
 
 CDBC 
 
 CDB 
 
 Select between Dl and D2 to gate into CDB 
 
 Clock 
 
 All FF's 
 
 Clock pulse 
 
 D1C1 to D1C 3 
 
 Dl 
 
 Select one value out of eight to gate into Dl 
 
 D2C1 to D2C3 
 
 D2 
 
 Select one value out of eight to gate into D2 
 
 Table 3- Control Wires and Their Functions 
 
kk 
 
 Control Wire 
 
 Controls 
 
 Function 
 
 EEC1 to EEC3 
 
 EE 
 
 Select one bit out of the eight in Dl, D2 as input 
 to EE 
 
 EECU 
 
 EE 
 
 Load EE with the selected input bit 
 
 IOBC 
 
 IOB 
 
 Select between Dl and D2 to gate into IOB 
 
 LCIC1, LCiC2 
 
 lcFFi 
 
 00 - do not load lcFFi; 11 - load lcFFi with bit i 
 
 (i=l,2,3,U) 
 
 
 of Dl; 10 - load lcFFi with bit i of D2; 01 - load 
 lcFFi with: i=l, A=B output from A/L unit; i=2, 
 output carry from A ; i=3j> OR of carry from X, , 
 
 X and X • ±=h, output carry from A/L unit 
 
 LC1C3, LCiCl+ 
 
 lcFFi 
 
 00 - do nothing; 10 - gate lcFFi into interrupt 
 
 (i=l,2,3,^) 
 
 
 wire; 01 - enable clock if lcFFi of OFF; 11 - 
 enable clock if lcFFi is ON 
 
 PEMiCl (i=l,2) 
 
 PEM mod i 
 
 Select read or write 
 
 PEMiC2 (i=l,2) 
 
 PEM mod i 
 
 Do not obey mode control 
 
 sMCl 
 
 sM 
 
 Select between Dl and D2 as input to be read into 
 sM 
 
 sMC2 
 
 sM 
 
 Select between four low order bits of Al and A2 
 as address to sM 
 
 sMC3 
 
 sM 
 
 Select read or write in sM 
 
 KiCl (1=1,2,3) 
 
 X. 
 
 l 
 
 Load input selected by XiC3 
 
 X1C2 (1=1,2,3) 
 
 X. 
 
 l 
 
 Count X. up or down as selected by XiC3 
 
 X1C3 (1=1,2,3) 
 
 X. 
 
 1 
 
 If counting, select between up or down; if loading 
 select input between Al and A2 
 
 Table 3 (Continued) 
 
^5 
 
 a total of 78 control wires in the PE and Table 3 lists these wires in alpha- 
 betical order along with a description of their function. 
 
 3.U.^ Local Control 
 
 It has already been pointed out that a certain minimum amount of 
 local control must be present at each PE to take care of data- dependent actions. 
 This takes the form of gates which, when activated, allow or inhibit an at- 
 tempted action depending on some internal PE state. When the information used 
 for local control is stored at some PE register at the same time it is needed, 
 no additional memory elements are necessary. This is the case, for example, 
 
 with the use of A as local control for the "add conditional" in multipli- 
 r 
 
 cation (see Figure 10). In other instances, however, the local control infor- 
 mation is not available any more when it is needed. In this case local con- 
 trol flip-flops must be introduced to store this information. Specifically, 
 there are in the PE six "dynamic outputs" which must be stored somehow since 
 they may be needed for local control. These dynamic outputs are: 
 Equality output (A = B) from the A/L unit 
 
 Carry (C nr ,) from the A counter 
 J n+12 c 
 
 Carry/borrow (C .,_) from the address registers X n , X^ and X~ 
 ' n+12 1' 2 3 
 
 Output carry (C . ) from the A/L unit 
 
 Four local control flip-flops designated by lcFFi (i=l,2,3,*+) are 
 
 used to store the dynamic outputs: A = B can be stored in lcFFl; C from 
 
 A can be stored in lcFF2; the OR of C n ^ from X.. , X^ and X can be stored in 
 c ' n+12 12 3 
 
 lcFF3: and C , can be stored in lcFFU. Notice that only one lcFF is used to 
 n+4 
 
 store the OR of the carry/borrow ' s from the three address registers. This re- 
 sults in a saving of two lcFF's and does not introduce any serious disadvantage 
 
1+6 
 
 since a carry/borrow in an address register is normally an error condition and 
 will cause an interrupt regardless of the particular register in which the 
 overflow occurred. 
 
 It is easy to see that local control is the most serious obstacle in 
 achieving the goal of a PE as general as possible, able to cope with a wide 
 range of word formats and instructions. Normally, a lcFF may be loaded only 
 with a specific bit of information and a certain PE function. This tends to 
 freeze conventions like negative number representation and sign bit location. 
 These shortcomings suggest the possibility of some generalized local control 
 logic as illustrated in Figure lU. This could be viewed as allowing micro- 
 programming at the PE level. Obviously, a generalized local control as the 
 one proposed in Figure lh is prohibitively expensive. Therefore, the subject 
 was intensively researched and a satisfactory compromise has been found. 
 
 Initially, one should notice that any type of local control can be 
 achieved using only enable control; i.e., being able to enable or disable the 
 whole PE according to the presence of a ZERO or a ONE in a lcFF. To prove this 
 proposition, simply consider the fact that local control can be of two types: 
 a) if (lcFFi) THEN action 1, and b) IF (lcFFi) THEN action 1 ELSE action 2. 
 For the moment, a disabled PE is defined as one in which the clock is inhibited 
 causing all registers to retain their old values. Local control of type a can 
 be implemented by enabling only the PE's in which lcFFi is ON, executing the 
 microsequence to perform action 1 and then enabling all PE's again. For local 
 control of type b a second step is needed in which only PE's in which lcFFi is 
 OFF are enabled and then action 2 is executed followed by enabling all PE's 
 again. This type of local control, achieved through enabling and disabling PE's ; - 
 will be called indirect local control as opposed to direct local control in 
 
^7 
 
 Mt of inputs allowing 
 accost to ovory bit 
 in tha PE 
 
 CO 
 
 bi 
 5 
 
 <9 
 
 3 
 
 a. 
 
 Uffi 
 
 fife FFw 
 
 < V » 
 
 sst of local 
 control flip-flops 
 
 (0 
 
 ui 
 
 o 
 
 I- 
 a. 
 
 
 sst of 
 
 * control 
 wirss 
 
 gating allowing any 
 local control flip- 
 flop to bs sst from 
 any input or boolsan 
 function of inputs 
 
 gating allowing any 
 control wire to be 
 inhibited by any local 
 control flip-flop or 
 any boolean combination 
 of the outputs of 
 local control flip-flops 
 
 Figure Ik. A Generalized Local Control 
 
kQ 
 
 which one or more control wires are directly inhibited "by some lcFF or other 
 register in the PE. Although indirect 1c is universal and can achieve any 
 desired effect, it is obviously slower since extra time is needed to turn PE's 
 ON and OFF. Therefore, local control in SPEAC will be primarily of the indi- 
 rect type except for a few extremely important functions in which one cannot 
 afford the extra time; these will be implemented directly. 
 
 3.1+.1+.1 Direct Local Control 
 
 Direct local control is used in SPEAC for four functions: 
 
 a) Input carry (C ) to the A/L unit. This is controlled by wires 
 
 ALCC1 and ALCC2 (see Figures 13 and Table 3). C can thus be 
 
 chosen between four values: ONE, ZERO, the complement of lcFFl, 
 
 and the same value as in lcFF^. C = ZERO is used in initiating 
 
 n 
 
 unsigned addition and C = ONE in initiating unsigned subtraction 
 
 (using also reg B as operand A to the A/L unit). Signed addi- 
 tion must be locally controlled since it can be an actual addi- 
 tion (if both operands have the same sign) or a subtraction (if 
 the signs are different). A sign comparison can easily be stored 
 in lcFFl since A = B can be stored in this flip-flop. There- 
 fore, lcFFl = ONE if signs are equal, ZERO otherwise and C 
 
 n 
 
 lcFFl can be used in initiating a signed addition. The last 
 
 possible value of C is lcFF^. This is used in the middle of 
 
 n 
 
 an addition or subtraction, when C must have the value that 
 
 n 
 
 C r had in the previous step. Therefore, when adding (or sub- 
 tracting) hexadecimal digits a. and b. of A and B, the value of 
 lcFF^ is the carry C , from the addition (or subtraction) of 
 
h9 
 
 a. , and b. _ and will be used as C . At the same time, lcFFU 
 i-I l-l n ' 
 
 will be changed to C > from a. + b., to be used in the next 
 to n+4 1 i 
 
 step. 
 
 b) Input A to the A/L unit. This is controlled by wires ALICT, 
 ALIC2, and ALIC3« The first two wires choose between B, B, Dl 
 and D2. The last one, ALIC3 implements a direct local control; 
 when ALIC3 is ON, input A to the A/L unit will be B instead of 
 B if lcFFl is OFF. If lcFFl contains a comparison of signs in 
 signed addition, as explained above, then this local control 
 transforms an addition into a subtraction for the PE's in which 
 the signs are unequal. 
 
 c) Gating of input A to the A/L unit. This local control is actu- 
 ated by a ONE in wire ALIC^. When this happens, the gating of 
 input A to the A/L unit is inhibited by the presence of a ZERO 
 
 in A . Therefore, if A is ZERO and ALICU is ON, operand A 
 r r 
 
 to the A/L unit is ZERO regardless of the values of ALIC1, ALIC2 
 and ALIC3. Obviously, this implements the "add conditional" 
 needed for multiplication. 
 
 d) Finally, there is local control built into the input gating to 
 
 register A . When "add and shift" is chosen as the input to 
 
 register A, A is either shifted right one (if C , is ZERO) 
 ° ' c n+4 
 
 or is incremented by one and shifted right one (if C , is ONE) 
 as explained in Section 3«^+- 3- 5 • 
 
 3-^* ^+- 2 Indirect Local Control 
 
 All control functions not directly implemented are obtained 
 
50 
 
 using the lcFF's to enable chosen PE's. In order to do this, one must be able 
 to store the controlling bit in one of the lcFF's. It has already been ex- 
 plained that the "dynamic outputs" can be directly stored in lcFF's. There 
 are four lcFF's in the PE and Figure 15 presents a simplified diagram of the 
 controls at the input and output of each IcFF. For the precise logic, the 
 reader is referred to Figure 13 and package 6 in Appendix A. 
 
 The local control structure illustrated in Figure 15 is actually a 
 simplification of the generalized local control described in Figure lk; the 
 number of gates was considerably reduced to make the unit practical for use 
 in a "small" PE like SPEAC's. Nevertheless, the unit is as powerful as the 
 generalized local control although not as fast. 
 
 In order to perform indirect local control, every bit in the PE 
 should be accessible to a IcFF. This is achieved by linking LC, the register 
 composed of the four lcFF's, to data buses Dl and D2 like all other data 
 registers thus allowing any bit in the PE to be fed as input to a IcFF. It 
 should also be recalled that the dynamic outputs can also be stored in the 
 lcFF's. Therefore, the input gates of Figure Ik have been reduced in Figure 
 15 to a l-out-of-3 selector for each IcFF. The selector for lcFFi is con- 
 trolled by two wires: LCiCl and LCiC2. The four possible input actions are: 
 a) do nothing (i.e., retain the previous value stored), b) store in lcFFi 
 the i — bit in Dl, c) store in lcFFi the i — bit in D2, and d) store in 
 lcFFi the dynamic output associated with that flip-flop as described in 
 Section 2.k.k. 
 
 It is often necessary to set a IcFF to a Boolean combination of 
 other bits, sometimes to a Boolean combination of bits in other lcFF's. In 
 order to save the gates needed to implement this directly, the output of LC J 
 
51 
 
 Dli o- 
 
 D2i o- 
 
 DYNAMIC 
 OUTPUT i 
 
 D lc FFI Q 
 
 * 6 
 INPUT 
 CONTROL 
 
 TO Dli G D2i GATE SELECTORS 
 
 6 6 
 
 OUTPUT 
 CONTROL 
 
 -o ENABLE WHEN FF IS 0N 
 
 -o ENABLE WHEN FF IS 0FF 
 
 -o GATE FF TO INTERRUPT WIRE 
 
 Figure 15. Diagram of a Local Control FF 
 
52 
 
 made available as a possible value of Dl or D2 like any other data register. 
 Therefore, the contents of LC can be brought to register A and one can per- 
 form shifts and logical operations. When the desired function is obtained, 
 it can be stored back in LC from Dl or D2. 
 
 The output gates of the generalized local control have also been 
 reduced in Figure 15 to a l-out-of-3 selector controlled by two wires: LCiC3 
 and LCiCU. These wires control the function performed by each lcFF. The 
 four possible functions performed by lcFFi are: a) do nothing (i.e., the 
 state of the flip-flop has no effect on the PE), b) enable PE only if lcFFi 
 is ON, c) enable PE only if lcFFi is OFF, and d) gate the output of lcFFi to 
 the interrupt wire. Function d, used when it is desired to send an interrupt 
 sign to the CU, will be discussed in Section 3 -h.G. Functions b and c are 
 used to perform indirect local control. Since it is possible to enable either 
 on a ONE or on a ZERO of a lcFF, one avoids moving LC to A only for comple- 
 menting. This is important because it is often needed to enable PE's in which 
 lcFFi is ON, perform an action and then enable only PE's in which it is OFF to 
 perform another action thus obtaining control of the type IF (lcFFi) THEN 
 action 1 ELSE action 2. It is then clear that a lcFF does not have a certain 
 fixed function but is attributed , for each clock cycle, one among four possible 
 functions. Also, each lcFF is controlled completely independently from the 
 others, which makes this type of lc rather costly in terms of control wires; 
 16 wires are required altogether. It is felt, however, that the performance 
 and versatility obtainable with this local control justifies the cost. 
 
 3-^.5 Mode Control 
 
 Mode control is simply the ON-OFF type as in ILLIAC IV. Register M 
 
53 
 (also called EE for external enable) is in charge of this control. This is a 
 single bit register which can be loaded with any bit of Dl or D2. Therefore, 
 the input gating for register M is a l-out-of-8 selector controlled by wires 
 EEC1, EEC2 and EEC3- A fourth wire (EECU) completes the control of register 
 M. When EECh is 0N ; M is loaded with the input bit select by the three other 
 wires; when it is OFF, M retains its old value. The mode control register has 
 a fixed function which is to enable the PE on a ONE (i.e., whenever M is ON, 
 the PE is enabled and whenever it is OFF, the PE is disabled). 
 
 The mode register can also be called "external enable" register, 
 which points out the fact that it is an enable register reserved for user (or 
 macro-instructions) manipulations, as opposed to the internal enable, which is 
 the function attributable to lcFF's. This is normally used only by the 
 systems programmer in micro -instruct ions. 
 
 It is now convenient to define precisely what is meant by a dis- 
 abled PE. Most registers in the PE are clocked by the signal Ck which is the 
 ; main clock sent by the CU "Clock ", inhibited by register M, and possibly by 
 the lcFF's. Therefore, when a PE is disabled, all registers clocked by Ck 
 are frozen; i.e., they retain their old values. The elements not clocked by 
 Ck are: Registers M and X , and the two PEM modules. Register M is directly 
 clocked by "Clock" and cannot be disabled. This is obviously needed or else, 
 once M were disabled, the PE could never be enabled again. There is a special 
 problem with PEM and X : as described in Section 3.^.1, one must be able to 
 overlap PE operation with replenishment of PEM. Therefore, I/O operations must 
 be able to reach a disabled PE since PEM in all PE's must be replenished re- 
 gardless of the fact that some PE's may be temporarily OFF. In order to ac- 
 complish this, each PEM module receives both clock signals: the direct signal 
 
^ 
 
 . 7.--U-+ a rv A control wire (PEMiC2 where i is the 
 "Clock" and the possibly inhibited Ck. A control wi 
 
 mod ule n^er) decides whether "Clock" o r Ck is to be used, thus choosing be- 1 
 tween ignoring and respecting disabling. Also, X 3 is clocked by "Clock" I 
 instead of Ck since it is mainly used to hold addresses for I/O operations. 
 
 Finally, it can be pointed out that the contents of M are not 
 accessible to the PE. Therefore, if the setting of M is to be used later, it 
 mu st be temporarily stored in sM at the time it is being loaded into M. I 
 
 3.I4-.6 Interrupts 
 
 The interrupt system is very simple; every PE has one interrupt wire 
 
 and the CU receives also only one wire which is the OR of the data in the 
 
 * „v, vw Tf one or more PE's are interrupted, the CU will 
 interrupt wires of each PE. It one or muie 
 
 sense a "1" in the interrupt wire and the operating system will have to inter- 
 rogate the PE's to find out which are responsible. This scheme has the advan- 
 tage of making the number of interrupt wires independent of the number of FE I 
 
 allowing for system expansion. 
 
 It has alxeady been described (in Section 3.k.k.2) that one of the 
 functions attributable to each IcEF is the gating of its contents into the 
 interrupt wire. Conditions that should case an interrupt are detected in the 
 PE and stored as a ONE in some IcEF. The interrupt can then be sent to the (9 
 hy attributing the interrupt function to that IcFF. It should be noticed that 
 the propagation times of the PE interrupt signals are assumed short compared 
 to the PE clock period. This is what allows only one interrupt flip-flop to 
 be used for different conditions like the following: exponent overflow, I 
 exponent underflow, fixed point overflow, division hy zero, etc. It is as- 
 sumed that the CU will notice the interrupt soon enough to be able to distin- 
 guish the different conditions by an analysis of which step of which operation 
 
55 
 
 was being performed. 
 
 It is also interesting to point out that the interrupt system is used 
 not only to detect error conditions, but can be very useful to detect the end 
 of a recurrence process or to optimize certain programs. For example, assume 
 that a recurrence process is being executed by all PE's. At the end of each 
 step, the error is computed and compared with the maximum acceptable. All PE's 
 in which the error is smaller than the maximum are turned OFF, via lcFF3 for 
 example. Sending lcFF3 via the interrupt wire will enable the CU to detect if 
 all PE's have been turned OFF. If this is the case, the recurrence is ended. 
 It may also be quite useful to add a control wire enabling one to send M via 
 the interrupt wire. 
 
 3« k."J Implementation Remarks 
 
 This section considers some of the design problems that would have 
 
 2 
 to be solved if the PE previously described were to be actually built. T L 
 
 integrated circuits will be used in the implementation of the PE logic due to 
 their medium cost, speed, and power dissipation. MOS logic was initially con- 
 sidered and it offered considerable advantages in cost and power dissipation, 
 however, it does not seem to be fast enough for the purpose of making the mem- 
 ory cycle (l/2 jiisec) the basic speed limiting factor. This cannot be achieved 
 with conventional MOS logic in the PEM (although silicon-on-saphire technology 
 promises for the near future an order of magnitude increase in the speed of 
 MOS logic). T L, although not as fast and desirable, will allow a good bal- 
 ance between memory fetch time and PE operation time; assuming 10 nsec as the 
 typical gate propagation delay time, and considering that there are no long 
 logic chains in the PE, it is realistic to assume a PE clock period of 100 nsec 
 (PE clock frequency = 10 Mc/s). Therefore, a PE clock takes — to — of a PEM 
 
56 
 
 cycle., depending on how fast the PEM is used. 
 
 Since the PU's will be pluggable, it is important that the number of 
 connections to each PU be minimized as this is, in integrated circuitry, a cost 
 factor probably more important than mere gate count. Table k shows the actual 
 number of PE connections achieved. A total of 103 to 110 is needed, probably 
 making necessary two connectors in each PU if a conventional printed circuit 
 is used. Three power wires are needed instead of two if MOS PEM's are used 
 since they need an extra voltage level. IOB and CAB must be bidirectional. 
 This is achieved either running two independent buses, one in and one out as 
 indicated in Figure 11 and 13, or using only one bus with additional logic in 
 the PE and one extra control wire to choose in which direction the bus is to 
 be used. The cost of six extra connections seems small enough to save the 
 extra complications of using only one bus. Also, if both in and out buses are 
 present, they could be simultaneously used in some operations like I/O and 
 routing. Therefore, eight wires are used for CDB and eight more for IOB. 
 
 Function 
 
 Number of Connections 
 
 Control wires 
 
 80 - 78 
 
 CDB 
 
 h - 8 
 
 IOB 
 
 k - 8 
 
 CAB 
 
 12 
 
 Interrupt Wire 
 
 1 
 
 Power 
 
 2 - 3 
 
 Total 
 
 103 - no 
 
 Table k. Connections to Each PU 
 
57 
 
 The number of control wires (78) is quite large, but this is the 
 price to pay for retaining maximum PE versatility for the micro-programmer. 
 Of course, the number of control wires could be reduced by adding encoding 
 logic in each PE. However, this would increase the gate count per PE and re- 
 duce the flexibility of the controls. Therefore, encoding of control wires 
 was used only when flexibility was not affected (like in the input to a regis- 
 ter; anyhow, the register cannot be loaded with two different inputs) and 
 when the extra gating comes automatically in the IC's used or can be added 
 
 economically. 
 
 2 
 T L MSI chips manufactured by Texas Instruments 
 
 provide a preliminary guideline in the discussion of questions related to: 
 number of gates, IC's available off-the-shelf, power dissipation, etc. There- 
 fore, the suggested IC's are limited to the ones listed in [8] and this infor- 
 mation is only useful in rough evaluations for a breadboard PE. In actual con- 
 struction, a few made-to-order LSI IC's would be used in place of several 
 
 2 
 smaller chips. Table 5 lists a few MSI T L chips available off-the-shelf that 
 
 could be of interest in the construction of a breadboard PE. Table 6 lists 
 all the packages used in Figure 13 and also gives the number of FF's per pack- 
 age and a very rough evaluation of the number of equivalent gates per package. 
 Memory elements were not included in the evaluation of the totals for the PE. 
 Roughly, the proposed implementation requires IK gates and 6k type D flip- 
 flops for a total of approximately 1.3K gates. Table 7 presents a preliminary 
 evaluation of the number of IC chips that would be needed in each PU. Two 
 numbers are given: one, for a breadboard PU, uses the chips introduced in 
 Table 5; in this case more than one hundred chips are necessary. The second 
 number assumes the availability of a few custom made IC's with up to 2U pins 
 
58 
 
 Chip 
 
 Type 
 
 Equivalent 
 
 DIP 
 
 Average 
 
 Description 
 
 Num- 
 
 
 Gates 
 
 Pins 
 
 power 
 
 
 ber 
 
 
 
 
 diss mW 
 
 
 1 
 
 SMA2002 
 
 na 
 
 28 
 
 1331 
 
 2 
 Memory: M05, 102 U X 2, T L com- 
 patible fully decoded 
 
 2 
 
 Fair3532 
 
 na 
 
 16 
 
 150 
 
 Memory: M05, 512 x 2, T L com- 
 patible fully decoded 
 
 3 
 
 SN7^89 
 
 na 
 
 16 
 
 375 
 
 Memory: 16 x k, scratchpad 
 
 k 
 
 SN7U175 
 
 na 
 
 16 
 
 na 
 
 Register: D-type, k bits 
 
 5 
 
 SN7I+I7U 
 
 na 
 
 16 
 
 na 
 
 Register: D-type, 6 bits 
 
 6 
 
 SN7 i +l9l 
 
 58 
 
 16 
 
 325 
 
 Counter: parallel in/out, syn- 
 chronized, up/ down, k bits 
 
 7 
 
 SN7U181 
 
 75 
 
 2U 
 
 ~375 
 
 A/L unit: k bits 
 
 8 
 
 SF7^15T 
 
 ~15 
 
 16 
 
 125 
 
 Data selector: Quad 2-to-l 
 
 9 
 
 SN7U153 
 
 ~16 
 
 16 
 
 180 
 
 Data selector: Dual lj-to-1 with 
 strobe 
 
 10 
 
 SKjkl52 
 
 -15 
 
 16 
 
 130 
 
 Data selector: 8-to-l 
 
 11 
 
 SN71+L98 
 
 ~Uo 
 
 16 
 
 25 
 
 Data selector/ storage register: 
 2-to-l, k bits 
 
 12 
 
 SF7ULS83 
 
 -1+2 
 
 16 
 
 75 
 
 U- bit binary full adder 
 
 Table 5- Some IC Chips that Might Be Used in the PE 
 
59 
 
 package 
 Number 
 
 Function 
 
 No. 
 Used 
 
 Approx . 
 Gates per 
 package 
 
 FF's per 
 package 
 
 Total 
 gates 
 
 Total 
 
 FF's 
 
 1 
 
 l-out-of-8 selector; no 
 strobe 
 
 29 
 
 9 
 
 
 
 261 
 
 
 
 2 
 
 Quad D type FF; clock 
 enabled for all FF's 
 simultaneously 
 
 5 
 
 5 
 
 h 
 
 25 
 
 20 
 
 3 
 
 Type D FF with enable on 
 the clock 
 
 9 
 
 2 
 
 l 
 
 18 
 
 9 
 
 k 
 
 1-out-of-^ selector; no 
 strobe 
 
 5 
 
 5 
 
 
 
 25 
 
 
 
 5 
 
 l-out-of-3 selector with 
 enable decoding 
 
 k 
 
 5 
 
 
 
 20 
 
 
 
 6 
 
 Enable and interrupt con- 
 trol 
 
 1 
 
 18 
 
 
 
 18 
 
 
 
 7 
 
 PEM-1 mod 
 
 2 
 
 — 
 
 — 
 
 — 
 
 
 
 8 
 
 sM--6 i + bit memory — 16 
 U-bit words 
 
 1 
 
 — 
 
 — 
 
 — 
 
 
 
 9 
 
 A/L unit 
 
 1 
 
 ~60 
 
 
 
 6o 
 
 
 
 10 
 
 l-out-of-2 selector 
 without strobe 
 
 59 
 
 3 
 
 
 
 .187 
 
 
 
 11 
 
 k bit add/ subtract coun- 
 ter, parallel in/parallel 
 out 
 
 9 
 
 -25 
 
 k 
 
 225 
 
 36 
 
 12 
 
 l-out-of-4 selector with 
 strobe 
 
 k 
 
 5 
 
 
 
 20 
 
 
 
 13 
 
 Quad inverter 
 
 1 
 
 h 
 
 
 
 1+ 
 
 
 
 Ik 
 
 Increment by 1 network 
 (6 bits) 
 
 2 
 
 25 
 
 
 
 50 
 
 
 
 15 
 TOTALS 
 
 l-of-5 selector 
 
 2k 
 
 6 
 
 
 
 lUh 
 
 
 S5 
 
 1057 
 
 Table 6. Packages Used in the PE and Their Contents 
 
60 
 
 Used In 
 
 i 
 
 Breadboard 
 
 
 Actual Implementation 
 
 Chips Used 
 
 No. of 
 
 Chips Used 
 
 No . of 
 
 
 
 Chips 
 
 
 Chips 
 
 PEM 
 
 chip 1 = | Pk 7 
 
 k 
 
 as in breadboard 
 
 1+ 
 
 reg B and input gates 
 
 chip 11 as Pk 2 + (k 
 X Pk 10) 
 
 1 
 
 as in breadboard 
 
 1 
 
 input to Dl, D2, A , 
 
 chip 10 as Pk 1 
 
 28 
 
 2 X Pk 1 
 
 Ik 
 
 A , A m 
 r c 
 
 
 
 
 
 input to A , A 
 
 chip 10 as Pk 15 
 
 2k 
 
 2 X Pk 15 
 
 12 
 
 output to IOB, CDB 
 
 chip 8 as k X Pk 10 
 
 2 
 
 as in breadboard 
 
 2 
 
 inputs to sM 
 
 chip 8 as k x Pk 10 
 
 2 
 
 as in breadboard 
 
 2 
 
 sM 
 
 chip 3 as Pk 8 
 
 1 
 
 as in breadboard 
 
 1 
 
 A/L unit 
 
 chip 7 as Pk 9 
 
 1 
 
 as in breadboard 
 
 1 
 
 ^1' ^2' "^3 
 
 chip 6 as Pk 11 
 
 9 
 
 l| Pk 11 
 
 6 
 
 inputs to X , X , X 
 
 chip 8 as k x Pk 10 
 
 9 
 
 6 x Pk 10 
 
 6 
 
 input A to A/L 
 
 chip 9 as 2 x Pk 12 
 
 2 
 
 as in breadboard 
 
 2 
 
 Increment net 
 
 chip 12 as | X Pk Ik 
 
 3 
 
 Pk Ik 
 
 2 
 
 A 
 c 
 
 chip 5 as 1— x Pk 2 
 
 2 
 
 as in breadboard 
 
 2 
 
 A 
 r 
 
 chip k as Pk 2 
 
 1 
 
 as in breadboard 
 
 1 
 
 A 
 m 
 
 SSI dual FF 
 
 2 
 
 k x Pk 3 
 
 1 
 
 enable control in A 
 
 m 
 
 chip 9 as 2 x Pk k 
 
 2 
 
 k x Pk k 
 
 1 
 
 M and M input 
 
 SSI FF; chip 10 as 
 Pk 1 
 
 2 
 
 Pk 3 + Pk 1 
 
 1 
 
 LC and LC input 
 
 SSI dual FF; chip 9 
 as Pk 5 
 
 k 
 
 2 x Pk 3 + 
 2 x Pk 5 
 
 2 
 
 enable control 
 
 chip 9 as r x Pk 6 
 
 k 
 
 Pk 6 
 
 1 
 
 others 
 
 SSI chips 
 
 5 
 
 Pk 13 + Pk k; 
 3 x Pk 5 
 
 2 
 
 1 
 
 Total 
 
 108 
 
 Total 
 
 6k 
 
 Table 7. Rough Estimates for the Number of Chips Per PU 
 
61 
 
 per DIP. These IC's are only slight modifications of the ones in Table 5- In 
 this case, the number of chips goes down to about 6U. This number of chips 
 will readily fit in one printed circuit board or, better yet, a new packaging 
 technology could be used: a multi-chip on a ceramic substrate technique which 
 is being developed at Fairchild. As far as design is concerned, the substrate 
 is analogous to a two-sided printed circuit board with single devices installed. 
 In addition, a system package is being developed to connect these devices 
 together with simple cam-operated connectors and backplanes. 
 
 It is important to point out that the number of Gh chips was ob- 
 tained with a very superficial analysis of the circuit and only assuming the 
 availability of quasi -standard IC's. It is expected that with careful compu- 
 ter analysis of the possible partitions of the circuit and wide use of custom- 
 made IC's, the number of MSI chips could go down to about 30 (this is the num- 
 ber reached if one divides the total number of equivalent gates in the PE 
 
 (1-3K) by 60 to 70, the number of equivalent gates easily obtained nowadays 
 
 2 
 in one MSI T I chip) . 
 
 The power dissipation per PE is quite acceptable. It is on the 
 
 2 
 order of 15 watts, assuming an average of 10 mw per gate for T L. A new low 
 
 2 
 power T L could be used to reduce this number by a factor of 5 to 10. 
 
 Finally, it should be mentioned that a number of simplifications 
 could be adopted in the PE at a small cost in performance. Only careful simu- 
 lation can decide whether the saving thus obtained justifies the loss in per- 
 formance or versatility. Some of these simplifications are: 
 
 - make B unavailable as a value to Dl or D2 
 
 - do not use X 
 
 - use only 10 bits in address lines instead of 12 
 
62 
 
 - make X. count up only instead of up/dowi. 
 
 - reduce A to 8 or 10 bits. 
 
 c 
 
 3-5 The Control Unit 
 
 The control unit has already been summarily described in Section 3«3- 
 In this section, a few more details of CU's structure and functions are pre- 
 sented but only in a macroscopic way, without getting to the gate level as was 
 done with the PE. 
 
 3- 5-1 CU General Structure 
 
 Figure 16 presents a diagram of the control unit structure. The 
 components are: 
 
 a) CU Memory (CUM) , which is a conventional, high speed random 
 access memory in which SPEAC's instructions and CU data are 
 stored. It can be replenished from mass memory and is accessed 
 by the central processing unit and by the instruction lookahead 
 unit. 
 
 b) Instruction Lookahead Unit (ILA) which fetches instructions from 
 CUM and sends them to the instruction decoding unit. Since CUM 
 is very fast, a sophisticated ILA is probably not necessary. 
 
 c) Instruction Decoding Unit (IDU) which performs basic instruction 
 decoding and central indexing. The instructions are identified 
 as CU, PE, or i/o instructions and sent to the respective in- 
 struction processor along with their indexed addresses and other 
 data. 
 
 d) Central Processing Unit (CPU) which is the CU instruction proces- 
 sor and responsible for the execution of CU instructions. It 
 
63 
 
 TO AND FROM 
 MASS MEMORY 
 
 CONTROL TO 
 MASS MEMORY 
 
 CUM 
 C U MEMORY 
 
 ILA 
 INSTRUCTION 
 LOOK AHEAD 
 
 MMI 
 MASS MEMORY 
 INTERCHANGE 
 
 10 REQUESTS 
 TO MMI 
 
 C PU 
 
 CU INSTRUCTION 
 
 PROCESSOR 
 
 IOC 
 10 INSTRUCTION 
 I PROCESSOR 
 
 AND ROW GATING 
 
 PEJP 
 
 PE INSTRUCTION PROCESSOR 
 
 PP 
 
 MICRO 
 PROCESSOR 
 
 CUQ 
 CU QUEUE 
 
 PEQI 
 
 PE 
 
 QUEUE I 
 
 MICRO 
 MEMORY 
 
 IOQ 
 10 QUEUE 
 
 FINST 
 FINAL 
 STATION 
 
 TO P U ARRAY 
 
 Figure 16. CU Structure 
 
6k 
 
 is basically a fast, highly parallel unit similar to one of 
 ILLIAC IV s PE's. It should be compatible with the data formats 
 used in the PE's. Therefore, for maximum versatility, it should 
 also be microprogrammable like the instruction processor. The 
 CPU is not completely independent from the PE array since it can 
 send common operands to all PE's via CDB ("broadcasting") and 
 also can receive data from the PE's. For this purpose, the CPU 
 can send microsequences to the PE's via the CU Queue. 
 
 e) I/O Instruction Channel (IOC) which is the i/O instruction pro- 
 cessor and executes array i/O instructions. Like the other two 
 instruction processors, it could be microprogrammable for maxi- 
 mum versatility. The IOC sends i/O requests to the mass memory 
 interchange and control pulses to the row gating and i/O Buffer 
 Register (lOBR). It can also send microsequences to the PE via 
 the 10 Queue. 
 
 f ) PE Instruction Processor (PEIP) which is the third and last in- 
 struction processor, in charge of PE instructions. It is fully 
 microprogrammable and can be divided into two parts. The first 
 part is a microprocessor (^P) which executes the microprograms 
 and sends microsequences to the PU via two queues--PE Queue 1 
 and PE Queue 2. The second part is a micromemory (^M) which 
 stores the microprograms. uM does not have to be a separate 
 memory; part of CUM may be used as micromemory if this is the 
 most economical scheme. 
 
 g) Four Queues which are: Queue (Q) , PE Queue 1 (PEQl) , PE Queue 2 
 ( PEQ2 ) , and 10 Queue (lOQ) . These queues store microsequences 
 
65 
 
 sent by each instruction processor, absorbing fluctuations in 
 the rate of generation of these microsequences which enables the 
 final station to keep the array as busy as possible. 
 
 h) Final Station (FINST) which analyzes the entries at the bottom 
 of each queue and decides which microsequences to send to the 
 array for optimum PE performance. It must also combine two queue 
 entries into one PE microsequence since each queue entry is not 
 a complete ^sequence but a request to use one of the two pairs 
 of buses in the PE's. FINST action will be explained in consid- 
 erable detail in Section 3-5-3- 
 
 i) Mass Memory Interchange (MMl) which utilizes the several modules 
 of mass memory in an optimum fashion, solving memory request con- 
 flicts. It receives requests from the following sources: CUP, 
 IOC, Corner Memory and Peripherals. 
 
 3-5-2 Machine Synchroni z.ation - Events 
 
 Events are the means of synchronization in the machine; not only 
 are they accessible to the user for problem-dependent synchronization (i/O and 
 operations, for example) but they are also used by the microprograms to syn- 
 chronize different micro steps executed in the PE's, CU and IOC. Each event is 
 assigned an absolute number and it is basically a flip-flop; when OFF, the 
 event did not occur and when ON, the event has occurred. A' reasonable number 
 of events are needed; 6U as a first approach, for example. 
 
 Therefore, synchronization is obtained with commands to "WAIT on event 
 N" or "CAUSE event N. " WAIT and CAUSE commands are attached to instructions 
 and are recognized and obeyed at three units: CUP, IOC and FINST. Consider, 
 for example, a CU instruction which needs as one operand a PE value sent via 
 
66 
 
 CDB. The instruction goes to CUP which does any local processing needed and 
 then issues the micro sequence to CUQ. The microsequence contains a "CAUSE 
 event N. " The CU then idles on a "WAIT on event N. " When the microsequence 
 is executed; i.e., when the data needed from the array reaches the CUP, event 
 N happens and CUP finishes execution of the instruction. This waiting time 
 could be used by the CU for multiprocessing a serial program (a compilation, 
 for example) being, run simultaneously. One must make sure that an event will 
 not be considered "occurred" because the FF is ON from another use of the same 
 event number. Therefore, the user does have the responsibility of "releasing" 
 an event when the present use of that event number terminates. This may be 
 done when the event is waited on for the last time, with a special type of 
 wait--WAIT and RELEASE--or an event may be specifically reset with a RESET 
 EVENT command. 
 
 The following event manipulation commands are desirable: 
 
 - Wait on a boolean function of events 
 
 - Cause an event depending on a boolean function of others 
 
 - Cause several events simultaneously. 
 
 Basically those commands are for program use only since microsequence synchron- 
 ization must be very fast and must be done with single events. 
 
 It should be noticed that one would never wait on a boolean combina- 
 tion of events since this would require the boolean function to be evaluated 
 at each clock to determine if the wait is over. The way to do this is to have ; 
 after each cause of the events that appear in the boolean function, a state- 
 ment that evaluates the boolean combination and places the result on an extra \ 
 number: N. Then the wait is simply on event N. 
 
 Care must be taken to avoid re-use of an event before its previous 
 
 
 
67 
 
 use is completed. Certain complicated cases may "be confusing. Consider, for 
 example, the following program: 
 
 Input 1 cause event #3 
 
 PE-multiply wait on event #3 and release it 
 
 Input 2 cause event #3 
 
 CU operation wait on event #3 
 
 In the situation above, Input 1 may occur and cause event #3- Then, 
 before the PE-multiply or Input 2 occur, the CU operation may be executed and 
 event #3 is ON so there is no wait. 
 
 The possibility of symbolic event names handled by the hardware 
 could be investigated; the hardware would automatically assign symbolic event 
 numbers to the first available physical event flip-flop. This would free the 
 user of keeping track of which events are available and also no set of events 
 would have to be reserved for usequence use. However, the user would still 
 have to release events. 
 
 Note also that with the present scheme, it is necessary to divide 
 the events into two sets: user events and internal events. The latter will 
 be used by the microprograms to synchronize the execution of microsequences. 
 
 3.5-3 Queue System and FINST 
 
 Queue entries can be considered as requests to use part of a PE. 
 These requests are serviced by FINST which, if possible, combines two entries 
 from different queues into a PE microsequence and sends the microsequence to 
 the PE's. The purpose of FINST and the queue system is to keep both pairs of 
 PE buses (Al, Dl and A2, D2) as busy as possible. 
 
68 
 
 The basic principle involved is dynamic bus allocation; i.e., each 
 queue entry does not ask specifically for use of bus 1 or 2, it asks for J 
 either a) any bus, or b) the bus that has access to the PEM module containing I 
 the address stored in X. (1-1,2, or 3). Requests of type a are made for inter J 
 register transfers, in which it is immaterial which bus is actually used; I 
 requests of type b are necessary for memory transactions since for these a 
 specific bus must be used. Therefore, under dynamic bus allocation, CUP, PEIP j 
 and IOC do not specify the microsequences completely- -FINST will dynamically 
 allocate buses to the partial microsequences in the best possible way. 
 
 3.5.3.I Queue Structure 
 
 Each queue entry contains basically a partial microsequence and 
 information which is used by FINST. The fields of a queue entry are illustra- 
 ted in the upper part of Figure IT- All four queues have the same structure 
 although only Queue 2 has been detailed. 
 
 qutu* 1 CUQ 
 
 r 
 
 
 
 
 qutue2PEQ! 
 
 
 
 
 qutu«3:PEQ2 qu«u«O-I0Q 
 
 
 
 
 r 
 
 EV 
 
 A 
 
 
 
 
 
 
 
 FFXI FFX 
 
 2 
 
 2 
 
 X 
 
 FFX3 
 
 6 
 
 C 
 
 yUSB 
 
 1 \ 
 
 FFCO 
 
 s 
 
 12 
 CA 
 
 ^CAU 
 
 BAO 
 
 4 
 
 CO 
 
 VCDR 
 TDU 
 
 6 
 WEV 
 
 Wvu 
 
 6 
 
 CEV 
 
 \evu 
 
 BC 
 
 
 
 
 
 
 n n 
 
 □ 
 
 FFCI 
 
 □ 
 
 I | 
 
 I 
 
 
 
 
 
 
 i j 1 — 
 
 CDBR 
 
 CDBRU 
 
 ' □ 
 
 BAI 
 
 1 
 
 BCI 
 
 
 
 1 1 
 
 
 Figure 17. Queues and FINST Structure 
 
69 
 
 The fields are as follows: 
 
 X: address field (2 bits) . means the address register is not used; 
 i.e., we have a data transfer and not a memory fetch. X=i (where 
 1 < i < 3) means the address register X. in the PE's will be used 
 in this micro sequence. 
 C: counter field (~6 bits). means the microsequence is a no-op. C= 
 iX) means that when a bus is assigned to that queue, then this micro- 
 sequence and the next n-1 will be processed consecutively. 
 uS: these are fields that contain the partial-microsequence. 
 uSB: bus-dependent microsequence field (~23 bits). This is the part of 
 
 the microsequence related to bus used. 
 uSC: bus -independent microsequence field (~55 bits). This is the part of 
 
 the microsequence related to control that does not use buses. 
 CAU: use of CAB field (l bit). CAU ON means CAB will be used and must 
 be set to the value stored in CA. 
 CA: common address field (12 bits). This contains the value to be used 
 as common address. 
 CDU: use of CDB field (l bit). CDI ON means CDB. will be used and must 
 be set to the value stored in CD. 
 
 CDR: common data receive field (l bit). When ON, CDB , will be used to 
 
 ' out 
 
 receive data from the PU's; this data must be stored in CDBR. 
 CD: common data field (k bits). This contains the value to be used as 
 
 common data. 
 EV: these are fields that control events. 
 WEVU: wait event use field (l bit). When 0N= this entry must await an 
 event whose number is stored in WEV. 
 WEV: wait event field (~6 bits). This contains the number of an event 
 
70 
 
 to be waited on. 
 CEVU: cause event use field (l bit). , When ON, this entry must 
 
 cause an event whose number is stored in CEV. 
 CEV: cause event field (~6 bits). This contains the number of an 
 
 event to be caused. 
 The bus-dependent microsequence field must be further explained. It 
 can be divided into two sub-fields: juSBa and /iSBb. juSBa, with 8 bits, corre- 
 sponds to the control wires to gate into buses D and A (3 wires for each) and 
 to control PEM (2 wires). In the actual microsequence, this field appears 
 twice: once for each bus pair. ,uSBb, with about 15 bits, corresponds to the 
 control wires to gate from buses D and A. The values of the bits in this 
 field of a queue entry have a special meaning: a ZERO means that the corre- 
 sponding control is not used in this microsequence and a ONE means that the 
 control is used (i.e., the final microsequence must have in that position the 
 appropriate bit to load from the bus that has been assigned to that queue 
 entry. 
 
 3.5.3-2 FIRST Structure and Operation 
 
 The structure of the final station will not be presented in detail; 
 only the major registers and their uses are discussed and a few considerations 
 are offered on the output logic of FINST (i.e., the part that merges together 
 two queue entries and assembles the micro sequences) . 
 
 The major registers of FINST are illustrated in Figure 17 and are 
 as follows: 
 
 FFXi (i=l,2,3): address control FF (l bit). FFXi = j means that in 
 
 the array, all Xi registers have addresses pointing into memory 
 
71 
 
 module j (j=0,l). These flip-flops are automatically set by the CU 
 (i.e., the FINST) every time a microsequence is sent in which the 
 bit that controls gating into Xi is ON. The setting is based on the 
 contents of the CA field in that microsequence. Local modifications 
 (as in local indexing) of Xi cannot change the module it points to. 
 This condition can easily be checked within each PE and causes an 
 interrupt (just monitor the carry from the address registers). Be- 
 sides the automatic setting, FFXi should also be settable by the 
 programmer for special applications. 
 
 FFCi (i=0,l): conflict FF (l bit). These are the conflict flip-flops, 
 set either when the bus could not be assigned or when one or two 
 of the bus assignments is not used on a particular clock because 
 of bus conflicts or because the queue is empty. 
 
 BAi (i=0,l): bus assignment register (2 bits). When BAi = j, bus i 
 is assigned to queue j. j e (0,1,2, 3} • 
 
 BCi (i=0,l). bus counter (~6 bits). When BCi = j, there are j micro- 
 sequences left to be performed before the bus can be reassigned; 
 BCi = means that bus i is idle. 
 
 CDBR: common data bus register {h bits). This is the register where 
 
 values placed in CDB by the PE's are stored. 
 
 out 
 
 CDBRU: common data bus register use (l bit). When equal to 1 it means 
 that CDBR is in use; i.e., a result placed in it has not been removed 
 by the CU and therefore CDBR cannot be reused before the CU frees it 
 by resetting CDBRU. 
 
 The FINST decision procedure is now described: at each clock, FINST 
 must decide to which of the four candidates the use of the PE buses will be 
 
72 
 
 assigned. Once a request from a queue is granted, the next (C) requests from 
 that same queue must he obeyed before the bus can be reassigned (where (C) is 
 the contents of the counter field). This ensures the microprogrammer that, 
 once control is obtained, it "will be retained for a number of microsequences 
 enabling the completion of a procedure before a new bus assignment destroys 
 needed data. Therefore, groups of microsequences that must be executed se- 
 quentially, without interruption, are "linked" together by placing in the coun- 
 ter field of the first queue entry the number of microsequences in the group. 
 
 The FINST decision procedure is illustrated in Figure 18 by a flow- 
 graph. If a bus counter register in FINST is zero, the corresponding bus is 
 idle and an attempt is made to assign it. The order in which assignment at- 
 tempts are made is, in Figure 18: IOQ, CUQ, PEQ1, and PEQ2. This attempts 
 first to get the I/O done. This assignment hierarchy , in an actual implemen- 
 tation, would probably be dynamic and selectable by the programmer instead of 
 fixed. Section 3-5-5 discusses a situation in which a dynamic assignment 
 hierarchy is required. 
 
 The following observations should be made with respect to the flow- 
 graph in Figure 18: 
 
 - The notation (Top Queue j:C) means the contents of field C of the 
 entry at the top of Queue j . 
 
 - A queue is empty either when it is physically empty or when it is 
 flagged WAIT on an Event that has not occurred yet. 
 
 - There is a CAB or CDB conflict when the following expression 
 (where TQi means top queue i) is true: 
 
 a) (TQ(BA0):CAU)=1 AND (TQ(BAl) : CAU)=1 AND (TQ(BAO) : CA)^(TQ(BAl) : CA) 
 OR 
 
73 
 
 START 
 
 ye» 
 
 But maybe Designed 
 8Ci = (top queue j-C) 
 BANj 
 
 • at FFCi= I 
 
 where j such 
 
 that : 
 (BAi)>(BAj) 
 (i,j) = (0,l),(l,0) 
 
 marge (top queue (BAO)) a 
 (top queue (BAD) into a 
 PE u sequence, inhibi-tad 
 
 by FFCi=0, i=0,l; 
 •at CAB Br CDB as needed 
 8k send the p sequence to 
 the array 
 
 finolization: BCD*min (BCO~l,0) 
 BCI*-min(BCI- 1,0) 
 FFCO*-0-, FFCI«-0; pop 
 queues used 8, cousa avants 
 
 BUS i i=0,l 
 
 QUEUE] j= 0,1,2,3 
 
 Figure 18. FINST Action Flow- graph 
 
7^ 
 
 (b) (tq(ba0):cdu)=1 and (tq(bal) : cdu)=1 mb (tq(bao) : cd)^(tq(bal) : cd) 
 or 
 
 (c) (TQ(BA0):CDR)=1 AND (TQ(BAl) : CDR)=1 
 OR 
 
 (d) ((TQ(BAO):CDR)=l OR (TQ(BAI) : CDR)=l) AND CDBRU=1 
 
 where the term (a) takes care of CAB conflicts, the term (b) detects CDB. 
 
 conflicts, the term (c) detects CDB , conflicts, and the term (d) takes care 
 ' v ' out 
 
 of CDBR use conflict (i.e., CDBR has not yet been used after being set by a 
 previous operation) . 
 
 It should be pointed out that the decision procedure outlined in 
 Figure 18 is only a basic algorithm. A few sophistications would have to be 
 introduced in an actual implementation; specifically: a) the procedure should 
 also be able to handle efficiently microsequences that do not require the use 
 of any bus, and b) the possibility of deadlock should be considered and steps 
 taken to avoid it. 
 
 Figure 19 illustrates the part of FINST that merges the two selected 
 queue entries together and "assembles" the microsequence. Gate control 
 selects which of the four possible inputs to each bus is actually gated into 
 the bus; queue i is gated into the bus if i is the value of the expression 
 written in each gate control box. Briefly, the assembly procedure is as fol- 
 lows: CDB is gated into CDBR if the CDR field of any of the two selected 
 queue entries is ON; CDB. is set from the CD field of the selected entry, if 
 any, that has field CDU ON; CAB is obtained from the CA field of the selected 
 entry, if any, that has field CAU ON. Field juSC of the final microsequence is 
 the OR of these fields in the two selected entries. A check for conflicts 
 would be necessary at this point to make sure that the two uSC fields are 
 
75 
 
 qutu* l:CUO 
 
 qu«ut 2 . PEQ I 
 
 qiMU*3:PEQ2 
 
 qiMu* 0"- 100 
 
 (iSBo 
 
 ^SBb 
 
 yuSC 
 
 JiSBo 
 
 fiSBb 
 
 CD 
 
 fiSBa 
 
 fjSSb 
 
 ^iSC 
 
 jiSBoO /jSBqI fiSBb jjSC CAB CDS* CDB out 
 
 CA 
 
 pSBo 
 
 pSBb 
 
 JlSC 
 
 CD 
 
 ASSEMBLED MICROSEQUENCE 
 
 GATE CONTROL 
 
 (BAO)or(BAI) 
 •nabltd by 
 TO (BAI):CDU 
 
 (BAO)or(BAI) 
 tnobltd by 
 TO (BAi):CAU 
 
 (BAO) or (BAD 
 
 (BAD 
 
 (BAD 
 
 (BAO) 
 
 TQ(BAI):CDR»I 
 1= 1,2 
 
 Figure 19 . Final Microsequence Assembly in FINST 
 
76 
 
 compatible to "be OR'ed together; i.e., the actions determined "by one of the 
 entries must not conflict with the actions determined by the other. As ex- 
 plained previously, field /iSBa appears twice in the microsequence, once for each 
 bus pair. Therefore, jiiSBaO is obtained from the juSBa field of the entry selec- 
 ted by BAO and jiiSBal is obtained from the )uSBa field of the entry selected by 
 BA1. Finally, field juSBb is simply taken out of field ^SBb of the entry selec- 
 ted by BA1. A conflict is also possible at this point: fields /iSBb of the 
 two selected entries should yield a zero when AND'ed together, bit by bit. If 
 this is not the case, there is a conflict in the YSBb fields. It should also 
 be pointed out that every gate control box is inhibited by the conflict flip- 
 flops FFCi; i.e., when FFCi is ON, no field from the entry selected by BAi is 
 used in the assembly of the microsequence. 
 
 3-5'^- The PE Instruction Processor 
 
 The basic structure of the PE instruction processor is presented in 
 Figure 20. The components are: 
 
 a) A macro-instruction register (MIR) which holds the op code and 
 variant field of the macroinstruction being processed. This 
 register is initialized by IDU and is accessible to the micro- 
 processor to be used in controlling microprogram fetch and in 
 arithmetic and masking operations. 
 
 b) A microinstruction register (,uIPQ which holds the op code and 
 addresses of the microinstruction being executed. 
 
 c) A micro-memory (jM) which holds the microprograms. 
 
 d) A PEIP busy flip-flop (PEIPB) which is turned ON by IDU when a 
 macroinstruction is delivered to the microprocessor and is turnec 
 
77 
 
 PREFERENTIAL USE 
 
 (X* 
 
 LOCAL REGISTERS 
 ~I6 BITS EACH 
 
 2 
 
 
 1st ADC 
 
 RESS 
 
 3 
 
 
 2nd ADDRESS 
 
 4 
 
 
 3rd ADDRESS 
 
 5 
 
 
 1st ADDRESS INDEX 
 
 6 
 
 
 2nd ADDRESS INDEX 
 
 7 
 
 
 3rd ADDRESS INDEX 
 
 8 
 
 
 PRECISION 1st OPED 
 
 9 
 
 
 PRECISION 2nd OPED 
 
 10 
 
 
 PRECISION RESULT 
 
 II 
 
 
 SCR 
 
 vtch 
 
 12 
 
 
 
 
 13 
 
 
 
 
 • 
 • 
 • 
 
 • 
 • 
 • 
 
 1 
 
 ' 
 
 
 macro jnstr. register 
 
 Ir 
 
 #1 
 
 OP CODE 
 
 VARIANT 
 
 M I R 
 
 PEIPB 
 
 
 
 IR 
 
 
 
 op code 
 
 reg 
 
 Imm bit 
 
 1st address 
 
 2nd address 
 
 JJ M 
 micro 
 memory 
 
 u instruction register 
 
 SUBROUTINE STACK 
 
 ARITHMETIC 
 UNIT 
 
 START 
 ADDRESS 
 
 • 
 
 • 
 e 
 
 RETURN 
 ADDRESS 
 
 e 
 
 e 
 • 
 
 REPEAT 
 COUNT 
 
 e 
 
 • 
 e 
 
 
 
 
 
 
 
 
 
 
 Figure 20. Basic PEIP Structure 
 
78 
 
 OFF by the PEIP logic when the last microinstruction of the 
 macroinstruction has been processed. This signals IDU that the 
 microprocessor is idle and ready to receive the next macro- 
 instruction. 
 
 e) A subroutine push- down stack used in controlling execution of 
 subroutines by the microprocessor. Each entry in the stack 
 contains three fields: a start address field which holds the 
 address in which the subroutine starts; a return address field 
 which holds the address of the first instruction following the 
 subroutine; and a repeat count field containing the number of 
 times the subroutine is to be executed. 
 
 f ) A group of local registers which is used to hold intermediate 
 results in arithmetic operations. The contents of the local 
 registers can be used in assembling the different fields of 
 the partial micro sequences to be fed into the PE queues: PEQ1 
 and PEQ2. Finally, the local registers are also accessible to 
 the IDU which initializes them with the instruction addresses 
 and other instruction data. In this connection, MIR can be 
 considered a local register and it is assigned local register 
 number 1. The other local registers are numbered in sequence 
 and they are accessed by their local register number. Sixteen 
 local registers are proposed, each 12 to 16 bits long. 
 
 g) An arithmetic unit capable of performing fixed-point operations 
 on short words: 12 to 16 bits is enough. At least addition, 
 subtraction and multiplication are available (integer division 
 and module operations are also useful) . The operands are 
 
79 
 
 either the contents of specified local registers or literals. 
 The results are placed in a specified local register. 
 
 An arithmetic unit is needed to enable microprograms to accept dynam- 
 ically specified parameters as word length, number of addresses, etc., since 
 it is obviously extremely inefficient to have one complete microsequence stored 
 for each small variant of a basic instruction. 
 
 This also determines the need for a number of relatively sophisti- 
 cated microinstructions; for example, subroutine calls. The suggested micro- 
 instruction repertoire is presented in Table 8. This repertoire allows very 
 efficient microprograms with respect to juM use. It is assumed that the micro- 
 processor is fast enough to allow an average output of one partial micro- 
 sequence each 100 nsec. Fluctuations in this rate are absorbed by the queues. 
 
 As indicated in Figure 20, the microinstructions' format uses four 
 fields: op-code, local register number (LR), immediate bit (IMM), and two 
 addresses, Al and A2, each as long as a local register. The use of these 
 fields for each microinstruction is detailed in Table 8. The immediate bit 
 qualifies the first address; if IMM is ONE the first address contains an 
 immediate operand instead of a local register number. 
 
 The partial microsequences are generated in pairs, assuming optimal 
 conditions; i.e., assuming that both buses will be available. The first 
 partial microsequence in each pair is placed in PEQ1 and the other one is 
 placed in PEQ2 so that if both buses are available they will be executed simul- 
 taneously and if not they will be executed sequentially. Events are used to 
 coordinate the draining of the queues as needed. One extra bit in the queues 
 may be needed to signal a request for the simultaneous execution of a partial 
 microsequence from PEQ1 and one from PEQ2 as is required in a swap of registers. 
 
80 
 
 Op Code 
 mnemonic) 
 
 Description 
 
 CALL 
 
 RETURN 
 SOTO 
 
 IF 
 
 ADD 
 SUB 
 MULT 
 uSEQ 
 
 Subroutine call; executes (A2) times the subroutine starting at 
 uM address (Al) 
 
 Marks the end of a subroutine or the end of a microprogram. 
 
 Transfers control to the microinstruction in juM address (A2) . 
 
 If (LR) masked by (Al) is all l's then transfers control to the 
 microinstruction in iM address A2 
 
 Add (Al) and (A2) and place the result in LR 
 
 Subtract (Al) from (A2) and place the result in LR 
 
 Multiply (Al) and (A2) and place the result in LR 
 
 Emit a partial microsequence to PEQ1 or PEQ2 
 
 Table 8. Microinstruction Repertoire 
 
 This "will also necessitate a change in assignment hierarchy or else the array 
 will idle for a long period waiting for both buses to become available. 
 
 The microinstruction uSEQ, must be able to "assemble" a partial micro- 
 sequence (placing in each field either a literal or the contents of a specified 
 local register) and place it either in PEQ1 or PEQ2. 
 
 Therefore, this microinstruction is unreasonably large and requires 
 about 100 bits of data. This shows the need for a microinstruction with a 
 variable number of bits (just as is the case of macroinstructions) to optimize 
 memory use since the jiiSEQ, microinstruction takes so much more space than the 
 other microinstructions. 
 
 3-5-5 IDU and Instruction Format 
 
 Central indexing is decoded and performed by the IDU which hands the 
 resulting addresses to the three instruction processors. The detailed instruc- 
 
81 
 
 tion format is illustrated in Figure 21. Instructions are composed of a vari- 
 able number of "chunks, " each 12 to 16 bits long. A chunk may be an address, 
 an op code or some other type of data. The smallest instruction contains only 
 two chunks: IDU information and op code. 
 
 1DU INFORMATION 
 A 
 
 INDEXED 
 ADDRESSES 
 
 INSTR 
 TYPE 
 
 # OF 
 
 CHUNKS 
 
 A 
 
 
 
 
 
 
 
 
 
 
 
 
 VARIANT 
 
 OP CODE 
 
 ADDRESSES + OTHER CHUNKS 
 
 ' — v — ' 
 
 # OF 
 
 ADDRESSES 
 
 TOTAL VARIANT FIELD 
 
 Figure 21. Detailed Instruction Format 
 
 The four fields in the first chunk (ll bits) contain information 
 used by IDU: 
 
 a) The instruction type field , with 2 bits, indicates whether the 
 instruction is a CU, 10 or PE instruction enabling IDU to send 
 the instruction to the appropriate processor. 
 
 b) The indexed addresses field has 3 bits. If bit i is on, then 
 
82 
 the i — address is to be indexed. The following convention is 
 
 adopted for the order in which base addresses and index addresses 
 are presented: 
 
 third chunk: first base address 
 
 fourth chunk: if first address is indexed, then it is the 
 address index for the first address, else it 
 is the second base address. 
 
 etc. 
 
 c) The number of addresses field indicates how many of the chunks 
 following the first two are addresses. 
 
 d) The number of chunks field gives the total length of the 
 instruction. 
 
 These last two fields are also sent as part of the variant field since they are 
 needed by the processors. 
 
 IDU places an instruction in an instruction processor as follows: 
 initialize instruction register with op code and total variant field; initial- 
 ize the three first local registers with the addresses, but do not change a 
 register to which an address was not given in the present instruction; then 
 initialize the next local registers with the extra chunks in the order given-- 
 the instruction processor decides what to do with them. 
 
 3«6 Mass Memory 
 
 A survey was conducted on the state of the art of mass storage sys- 
 tems including bulk magnetic core, fast disks, fast drums and semiconductor 
 memories. Fast magnetic drum (at one-half cent per bit) or disk (as low as 
 one-twentieth of one cent per bit) could be used as the mass memory since they 
 
 
83 
 
 have a significant price advantage over the other two systems. However, being 
 cyclic, these systems would introduce synchronization problems and/or latency 
 time waits. Therefore, while disks are still being considered as a possible 
 very-large -capacity back-up for mass memory, the choice for the actual mass 
 memory is a random access system: bulk core or semiconductor. 
 
 CDC bulk core model 6636 was picked up as a sample of what is now 
 available. Its characteristics are: 
 
 - 7*5 million bits per module 
 
 - the maximum number of modules is four 
 
 - cycle time: 3-2 (usee; access time: 1.6 jusec 
 
 - up to four modules can be interleaved 
 
 - the transfer rate is 25 to 100 million 6-bit chars per second 
 
 - it fetches in long words of 480 bits 
 
 - its cost is approximately three cents per bit. 
 
 It is expected that in the near future, price of bulk core will 
 drop to below one cent per bit. Assuming the availability of units of this 
 price and with cycle times as above, a unit fetching in 512-bit words could 
 be used as SPEAC's mass memory. 
 
 As for semiconductor memories, the main advantage core has over any 
 semiconductor type is the ability to be non-volatile. Semiconductor memories 
 are already available for less than three cents per bit although the price 
 always goes up for special configurations like the long word that is needed in 
 SPEAC's mass memory. Since semiconductor is so much faster than bulk core, one 
 might attempt to multiplex a narrower word but faster semiconductor memory to 
 achieve the desired word length and access time. In addition, a large memory 
 of shift registers might be considered. A special design would be easier to 
 
8k 
 
 achieve and control can be maintained over synchronization and latency prob- 
 lems. 
 
 Therefore; mass memory will be a random-access unit: bulk core or 
 semiconductor, depending on economic considerations. It is assumed that 
 several modules of mass memory will be overlapped under the control of the mass 
 memory interchange (MMl) so that conflicts between mass memory access requests 
 from different sources will be infrequent. An average cycle time of 2 ^sec 
 (l usee access time) for the mass memory has been assumed in all timing 
 estimates. 
 
 3-7 I/O Buffer Register 
 
 The structure of the I/O buffer register (IOBR) is illustrated in 
 
 Figure 22, 
 
 TO G FROM 
 MASS MEMORY 
 
 A 
 
 IOBR A 
 
 IOBRK 
 
 1 
 
 r 
 
 i 
 
 i 
 
 IOBR. 
 
 TO 
 
 ROW GATING 
 
 FROM 
 ROW GATING 
 
 Figure 22. i/O Buffer Register Structure 
 
 The register is divided in two parts: a right part (lOBRr) and a 
 left part (lOBRi). Each part is as long as a mass memory word: 512 bits. 
 IOBRr is connected to the mass memory and is the actual buffer register; it can 
 also receive data from the row gating (128 hexadecimal digits, one from each 
 PE in a PE row). IOBR^ is needed to achieve routing capability in SPEAC; it 
 
85 
 
 can send data to the row gating. IOBR as a whole can be shifted end around, 
 left or right in U-bit (one hexadecimal digit) increments. In order to achieve 
 good routing speed, it is vital that IOBR can be shifted by any distance (from 
 1 to 127 digits) in only a few clock periods. This poses an interesting 
 minimization problem: how many direct shift paths should be implemented in 
 order to obtain any shift in a given number of clocks? Also, a few distances 
 are especially important and the corresponding shifts should be particularly 
 fast; this is the case with powers of two since routes by a power of two ap- 
 pear much more frequently than Other routing distances as they are used in log- 
 sums, Fast Fourier transforms, etc. Finally, there is the important economic 
 restriction of keeping the number of direct shift paths at a minimum since 
 for each path one needs roughly one gate per bit and there are 102*4- bits in 
 IOBR. It was decided that a minimum of 7 direct shift paths are needed with 
 the following direct shift distances: 128 left (this is vital to the opera- 
 tion of both l/0's and routes), 1 right and left, 32 right and left, and 8 
 (or *4-) right and left. This scheme enables one to perform any shift in not 
 more than 7 clocks. The worst case is distance 52 (50 if one uses k instead of 
 8) . Moreover, shifts by a power of two take not more than k clocks and most 
 take only one or two. At a cost of 2K more gates, one could implement 9 di- 
 rect paths (128 left, 1 left, 1 right, 2 left, 2 right, 8 left, 8 right, 32 
 left, and 32 right) for a worst case shift of 5 clocks. 
 
 It is assumed for the remainder of the paper that 7 paths were im- 
 plemented. This represents an investment of about 12 K gates in IOBR which is 
 a reasonable price to pay to achieve routing and I/O buffering for the whole 
 machine. Table 9 presents the number of elementary shifts needed to shift a 
 number by any distance from 1 to 6*4- when the direct paths are: 128 left, 
 
86 
 
 A* 
 
 B* 
 
 A 
 
 B 
 
 A 
 
 B 
 
 A 
 
 B 
 
 A 
 
 B 
 
 A 
 
 B 
 
 A 
 
 B 
 
 A 
 
 B 
 
 1 
 
 1 
 
 9 
 
 3 
 
 17 
 
 5 
 
 25 
 
 1+ 
 
 33 
 
 2 
 
 in 
 
 k 
 
 ^9 
 
 6 
 
 57 
 
 5 
 
 2 
 
 2 
 
 10 
 
 k 
 
 18 
 
 6 
 
 26 
 
 k 
 
 3^ 
 
 3 
 
 U2 
 
 5 
 
 50 
 
 7 
 
 58 
 
 5 
 
 3 
 
 2 
 
 11 
 
 k 
 
 19 
 
 5 
 
 27 
 
 3 
 
 35 
 
 3 
 
 k3 
 
 5 
 
 51 
 
 6 
 
 59 
 
 k 
 
 k 
 
 1 
 
 12 
 
 3 
 
 20 
 
 ^ 
 
 28 
 
 2 
 
 36 
 
 2 
 
 kk 
 
 k 
 
 52 
 
 5 
 
 60 
 
 3 
 
 5 
 
 2 
 
 13 
 
 U 
 
 21 
 
 5 
 
 29 
 
 3 
 
 37 
 
 3 
 
 k5 
 
 5 
 
 53 
 
 6 
 
 61 
 
 1+ 
 
 6 
 
 3 
 
 Ik 
 
 5 
 
 22 
 
 5 
 
 30 
 
 3 
 
 38 
 
 U 
 
 k6 
 
 6 
 
 5k 
 
 6 
 
 62 
 
 1+ 
 
 7 
 
 3 
 
 15 
 
 5 
 
 23 
 
 1+ 
 
 31 
 
 2 
 
 39 
 
 k 
 
 kl 
 
 6 
 
 55 
 
 5 
 
 63 
 
 3 
 
 8 
 
 2 
 
 16 
 
 U 
 
 2k 
 
 3 
 
 32 
 
 1 
 
 ko 
 
 3 
 
 kQ 
 
 5 
 
 56 
 
 U 
 
 6k 
 
 2 
 
 *A - shift distance 
 
 *B - number of elementary shifts 
 
 Table 9- Number of Elementary Shifts for Each Shifting Distance 
 
 1 left, 1 right, 32 left, 32 right, k left, and k right. 
 
87 
 
 k. SPEAC's OPERATION 
 
 k.l Generalities - Data Format 
 
 The algorithms used in performing the most important instructions 
 will be outlined in this section and timing estimates -will be presented. The 
 timing is based only on a count of the PE clocks necessary to perform the 
 instruction; no CU delays were taken into account. Therefore, the estimates 
 neglect CU instruction fetching, decoding and central indexing times. Also 
 neglected is the time taken by the OU to execute microprogram control instruc- 
 tions; i.e., microinstructions that do not generate micro sequences. These 
 approximations are justified by the assumption that CU is, on the average, 
 faster than the PE's (CU clock rate is about twice PE clock rate) and the 
 queues insure that PE's will not have to wait by CU. 
 
 The timings are also a function of how much overlap is possible 
 when the instruction is executed; i.e., how many buses are available for the 
 PE instruction use. This factor depends on the assignment hierarchy used by 
 FINST, on the location of the operands in PEM and on how much I/O is taking 
 place when the instruction is executed. In the timings, at least one bus is 
 assumed always available for PE instructions (or else the worst case times 
 will obviously be infinity). Sometimes two timings are given: the "normal " 
 one, with only one bus available and the " optimum " timing, assuming maximum 
 overlap (two buses are available). CDB and CAB bus conflict is also a possi- 
 ble cause of delays which were not taken into account in the times since they 
 depend on how much i/O is going on. However, these delays are expected to be 
 negligible in a PE with three address registers. 
 
 As discussed in Section 3-1 - g> the machine accepts any word format, 
 
88 
 
 since there is nothing in the hardware to "freeze" the data representation. 
 Of course, adequate microprograms must be written to deal with a desired word 
 format. 
 
 An arbitrary (and quite conventional) format for floating-point num- 
 bers was picked up and used in the timings. This representation will be called 
 the "standard format " and is as follows: a number appears in PEM as indicated 
 in Figure 23- 
 
 e 
 n 
 e 
 
 • • • 
 
 e l 
 
 e o 
 
 m 
 n 
 
 m 
 
 
 "i 
 
 m o 
 
 Figure 23- Standard Floating-Point Format 
 
 Each PEM location contains one hexadecimal digit. The location of 
 m is low memory address. There are N = n + 1 exponent digits and N = 
 n + 1 mantissa digits. Mantissa is in sign and magnitude and the sign is in 
 bit e • i.e., the low order bit of the LSD of the exponent. Therefore the 
 exponent has UN - 1 bits since one bit of the exponent is used for mantissa 
 sign. The exponent base is 16 and the exponent is represented in excess no- 
 tation. The number A represented in Figure 23 has a value given by: 
 
 e m ,, (-^)n 
 
 A = (-1) °° X (m (2 -4 ) + ... + m , (2 m 
 n 1 
 
 m 
 
 where E, the exponent, is given by: 
 
 ) + m Q (2 
 
 (-^)(n+l) 
 
 m ))(!£*) 
 
 -E, 
 
 E = e Q1 + (e Q2 )(2) + (e Q3 )(2 2 ) + (e.^2 3 ) + ... + (, 
 
 kn -1 
 
 n 
 
 )• 
 
 If a floating-point number is normalized, m ^ 0; i.e., at least 
 one of the four bits of m is one. 
 
 A particularly important length for a floating-point number is 32 
 bits, which was often taken as the standard floating-point number in this 
 
89 
 
 section. A 32-bit floating-point number has one mantissa sign bit, a base 
 16 exponent with 7 bits and a 2l+-bit mantissa. 
 
 k.2 Local Indexing 
 
 Operand addresses are sent to one of the PE address registers via 
 CAB. Only one clock is needed to transmit an address in this fashion. Then, 
 if required, any address may be locally indexed at a maximum cost of 1.6 jLtsec 
 (16 PE clocks) per indexing. 
 
 The microsequence to perform local indexing is presented in Table 
 10; the notation is explained in the introduction of Appendix B. It is as- 
 sumed that the address to be indexed (x x x ) is loaded in X, and the index 
 is igi^. 
 
 In conclusion, local indexing is relatively fast (about 7% of 
 the time for a 32-bit floating-point multiplication) and the procedure does 
 not penalize the users that do not need it since it is performed only when 
 the instruction variant field is adequately set. Also, the microsequence 
 presented can be significantly speeded up if one knows that the index is less 
 than three hexadecimal digits long. 
 
 U.3 Multipli cation 
 
 Two mantissas A and B, each with N hexadecimal digits, are to be 
 multiplied. Using the notation of expressions 1 and 2 in Section 3-2, the 
 following steps are performed: 
 
 1) load a from memory into register B 
 
 2) load b^ from memory into register A 
 r 
 
 3) set to zero the remainder of register A; i.e., A and A 
 
 m 
 
 k) multiply a and b n using four "add and shift" commands. At the 
 
90 
 
 8 ° s 
 
 •H O -H 
 
 X £ 
 
 o 
 o 
 
 H 
 O 
 
 Micro sequence 
 
 2 
 
 3 
 
 5 
 6 
 
 X *- CAB (address (i )) 
 
 9 
 
 10 
 
 n 
 
 12 
 
 13 
 
 111 
 
 1 
 
 2 
 3 
 1+ 
 6 
 
 6 
 7 
 8 
 9 
 
 10 
 11 
 12 
 13 
 
 c 1 
 
 B *- PEM (X )• shift A right k 
 
 Wait for PEM fetch 
 
 Wait for PEM fetch 
 
 A r -OMjj VO; lcFFlf^C n+1| ; 
 
 Incr X„ 
 
 B <- PEM (Xg); shift A right U 
 
 Wait for PEM fetch 
 
 Wait for PEM fetch 
 
 A - (B+A J; C =lcFFU; 
 r m n 
 
 lcFF l+ «_ c^^; Incr X, 
 
 2 
 
 B - PEM (X ) ; shift A right k 
 Wait for PEM fetch 
 Wait for PEM fetch 
 
 A r -(»-A m ); C n =lcFF^ 
 
 leFFU - C ^ 
 
 16 
 
 !- Ik Shift A right K; interrupt on 
 lcFFi+ ON 
 
 15 
 
 X n *- A 
 1 c 
 
 Comments 
 
 Put in X the address of the index 
 
 Transfer address to he indexed to A ( 
 Fetch i and place x q in A m 
 
 Add i and x and place in A 
 o o - 1 
 
 Fetch i., and place x 1 in A m 
 
 Add ± 1 and x and place in A^ 
 
 Fetch i and place Xp in A m 
 
 Add i and x 2 ; shifting A will place 
 
 x+i in A from which it is returned 
 
 c 
 to X ; an overflow in the indexing 
 
 causes an interrupt. 
 
 Table 10. Microsequence for Local Indexing 
 
91 
 
 end of this step, A and A will contain the two-digit product 
 
 of a^ and b_; b^ was destroyed and A now contains m^ 
 r 
 
 5) if a double precision product is desired, store m = (A ) into 
 memory; jump this step if a single precision product is to be 
 obtained 
 
 6) increment by 1 the contents of register X p (it is assumed that 
 initially X, contains the address of a n and X contains the 
 address of b ) . Therefore, X now contains the address of b 
 
 7) load b, from memory into reg A 
 
 8) multiply b (in reg A ) and a (in reg B) as described in step 
 h; note that the "carry" of the previous multiplication is auto- 
 matically added to the product 
 
 9) increment X, by 1, decrement X p by 1 
 
 10) shift register A left k bits which vacates A 
 
 11) reload register A and B; multiply 
 
 12) A now contains im, which can be stored or discarded 
 
 r 1 
 
 And so on, following the algorithm of Section 3*2. To determine 
 
 digit m. of the product (i< n), the cycle: [increment X , decrement X , load 
 
 B, load A , multiply, shift left k~\ is repeated i+1 times. On the first cycle 
 
 only one increment-load is performed and on the last cycle there is no shift 
 
 left k. It has already been mentioned that in single precision, product m is 
 
 the first digit of the product which is not discarded and it can be stored in 
 
 b 's position (or a 's). If at the end of the multiplication m^ .. does not 
 e 2n+l 
 
 equal zero, a normalization is needed; each hexadecimal digit is read and 
 
 restored shifted right. Therefore, m is discarded, m n becomes the low 
 
 & ' n ' n+1 
 
92 
 
 order digit and m^ . the high order one, 
 D 2n+l 
 
 This algorithm is general and can handle mantissas with any number 
 N of digits. The introduction of the scratchpad memory, however, results in a 
 remarkable improvement in the procedure, especially for N not greater than 16 
 (which includes most practical applications). 
 
 The method consists of overlapping a multiplication of two digits 
 
 with the fetch of a third digit which is temporarily stored in sM and will be 
 
 used in a subsequent multiplication. This is always possible because the "add 
 
 and shift" command used in multiplication does not need any PE bus; a bus is 
 
 thus left available for the fetch. Since the multiplication takes h clocks 
 
 and a fetch only 3, there is still time to increment the address register usee 
 
 (preparing for the next fetch) and to reload B concurrently with its last use 
 
 in an "add and shift." A fifth clock is required to reload A and a sixth if 
 
 r 
 
 it is necessary to store A in sM "before reloading A . The procedure described 
 is listed in Appendix B, note a, under the name MF (for multiply and fetch). 
 It should be also pointed out that the result is now first stored in sM and 
 only after normalization is written in PEM which avoids the relatively slow 
 process of rereading and restoring in PEM only to normalize. 
 
 The time required to multiply two mantissas (each N digits long) can 
 now be estimated: N executions of MF are required, taking 5 clocks each; N 
 product digits must be stored, which takes 6 clocks per digit (see function 
 ST in Appendix B, note d) and N more clocks to store the product temporarily 
 in sM. Finally, about 13 clocks are necessary for initialization and control. 
 Therefore: 
 
 T nf 5N 2 + TN + 13 , N < 16 (1) 
 
 where T m is the time for mantissa multiplication in clocks. Since each clock 
 
93 
 
 takes 100 nsec, for N = 8 a T of 1+0 /usee is obtained. 
 
 m 
 
 4.3*1 Floating-point Multiplication 
 
 The algorithm is relatively simple: initially the mantissas are 
 
 multiplied as described previously and the normalized single precision product 
 
 is stored. A is left with a 1 if normalization was performed (i.e., if 
 
 m ' 
 
 m_ -,= 0) and with a otherwise. A is then subtracted from the first expo- 
 2n+l ' m ^ 
 
 nent and the second exponent is added to the difference which obtains the ex- 
 ponent of the result. Five extra clocks are needed to detect exponent over- 
 flow or underflow and to recode the exponent of the result in excess repre- 
 sentation as explained in Appendix B, note f . The sign of the result is 
 obtained from the exclusive - OR of the signs of the factors. 
 
 Timing estimate: for two floating point numbers with N digits in 
 the mantissa and N digits in the exponent, the mantissa product will take 
 (from (l)) about 5 IF + 7N +13 clocks; exponent manipulation takes about 4 
 clocks per digit plus 6 clocks per digit for storage and about 5 clocks for 
 control. The final expression is: 
 
 T_ = 5N 2 + 7N + ION +18 , N + N < 16 (2) 
 
 f pm mm e e m — 
 
 where T is the time for floating-point multiplication in clocks. 
 
 For the "standard" 32-bit floating-point number, N = 2 and N = 6 
 which yields T„ = 27 usee. For this case, the precise ^sequence is pre- 
 sented in Appendix B and the results obtained are as follows: normal time = 
 25 usee; optimum time = 24 usee. Two 64-bit floating-point numbers (N = 12, 
 N = 4) can be multiplied in about 86 idsec. 
 
 It should be remarked that the algorithm illustrated obtains the 
 single precision product by truncation of the double precision product. If 
 
9^ 
 
 simple truncation is not satisfactory and rounding is to be- performed, then a 
 small addition is needed in the micro sequence. This is not too time consuming, 
 however . 
 
 h.k Addition and Subtraction 
 
 Unsigned addition or subtraction is quite straightforward and can 
 be performed in the following steps: 
 
 1) load from PEM address (X ) into register B. 
 
 2) load from PEM address (X_) into register A . 
 / x 2 m 
 
 3) add or subtract using input carry (C ) zero (one in subtraction) 
 for the first cycle and C =lcFFU for the remaining cycles. Also, 
 at each cycle, lcFF^l- stores- the output carry C i . Therefore, 
 
 at every cycle after the first one, lcFF^ contains C . from the 
 previous step . 
 
 k) increment X and X by 1. 
 
 5) go to step 1. 
 
 On the last cycle, lcFF^- is gated to the interrupt wire since lcFF^ 
 ON (OFF in subtraction) at this point indicates an oveflow. 
 
 Timing estimate: for two unsigned fixed-point numbers with N digits 
 each, one needs, per digit, 6 clocks to fetch the two operands, 1 clock to adc 
 and 5 clocks to write the result in PEM. Therefore: 
 
 T a ~ 12N (3) 
 
 where T is the time for unsigned addition or subtraction in clocks. Thus, 
 
 T = 10 Msec for N = 8 digits, 
 a 
 
 U.J+.l Signed Addition and Subtraction 
 
 There are several different ways to perform signed addition and 
 
95 
 
 subtraction. Signed numbers can be stored in PEM either in a complement form 
 or as sign and magnitude. The latter seems to be preferable since it speeds up 
 multiplication and slows addition. 
 
 To add two signed numbers represented in sign and magnitude notation, 
 it is necessary first to compare the signs. The result of the comparison is 
 stored in lcFFl which will be ON if the signs are equal, OFF otherwise. lcFFl 
 is then used to control whether an addition or a subtraction is actually per- 
 formed. The two numbers are then added (or subtracted, if lcFFl is OFF) and 
 the final output carry, which is stored in lcFF^-, is analyzed to determine the 
 sign of the result, whether recomplementation is needed or not and if there 
 was overflow. The rules are presented in Appendix C, note f. 
 
 Signed addition (or subtraction) takes 6 clocks per digit to fetch 
 the two operands, one clock to add/ subtract, one clock to temporarily store 
 the result in sM (assuming N < 16, this is possible and speeds up recomple- 
 mentation considerably), 2 clocks per digit to recomplement (PE's in which this 
 operation is not needed are disabled), 6 clocks per digit to store the result 
 in PEM and about 10 clocks for control and sign manipulations. Therefore, 
 
 T = 16N + 10 , N < 16 (k) 
 
 sa ' — 
 
 where T is the time for signed addition or subtraction in clocks; for N = 8, 
 
 Set 
 
 T = Ik (usee. 
 sa 
 
 k.k.2 Floating-point Addition and Subtraction 
 
 The algorithm is quite complex and can be divided into six distinct 
 phases: a) exponent comparison, b) exponent subtraction, c) hexadecimal point 
 alignment, d) mantissa addition, e) recomplementation, and f) normalization. 
 
 The basic steps are the following: 
 
96 
 
 1) Set up X, and X with the addresses of the two exponents. 
 
 2) Fetch the exponents (storing them temporarily in scratchpad 
 memory to avoid subsequent fetches) and compare them, exchanging 
 the exponents and addresses in PE's with the "wrong" order so 
 that all PE's will have in X the address of the number with 
 the larger exponent. 
 
 3) Compute the difference d of the exponents and add it to address 
 
 X , thus performing hexadecimal point alignment. 
 
 k) Set up A with (FFF-N +l) via CAB and add d which prepares in A 
 cm — * * c 
 
 a trap that will overflow when N -d+1 is added to it; this will 
 
 m ' 
 
 indicate that all valid digits of the smaller operand have been 
 used and zeros must be substituted for the remaining digits. 
 
 5) Perform the actual addition following the algorithm described in 
 the previous section with one extra step: after loading B from 
 PEM, B is zeroed if a carry has already occurred in A . A con- 
 tains initially the trap described in step h and is incremented 
 by one as each pair of digits is added. lcFF2 is used to store 
 the first carry from A . The sum is temporarily stored in sM 
 for possible recomplementation and normalization before it is 
 finally stored in PEM. 
 
 6) The final carry is analyzed to determine if there is a need for 
 recomplementation or if an "overflow" occurred; i.e., if one 
 extra MSD containing a ONE should be added to the mantissa. The 
 rules are presented in Appendix C, note f . 
 
 7) Recomplementation is performed; only PE's in which this operation 
 is necessary are enabled. The recomplemented result goes back 
 
97 
 
 to sM. 
 
 8) X and X- are used as counters: X is initialized to FFF (all 
 ones) and X p is initialized with the larger exponent. Then both 
 registers are decremented by one for each leading zero in the 
 mantissa of the result. Therefore, at the end of the process X 
 will contain the exponent of the result and X will contain a 
 trap to be used in A in the next step. 
 
 9) The mantissa of the result is written in PEM using X to store 
 the address of the result in PEM and X to store the address of 
 the result in sM. The mantissa is written from LSD to MSD and 
 the trap in A is used to write initially as many trailing zeros 
 as there were leading zeros before normalization. 
 
 10) The exponent is written from sM into PEM. 
 Timing estimate: since the procedure is so complex, it is quite 
 difficult to obtain a precise formula for the number of clocks in addition. 
 As a rought estimate, it takes for each pair of mantissa digits: 9 clocks to 
 add, 2 clocks to recomplement, 2 clocks to count leading zeros and 8 clocks 
 to write in PEM; for each pair of exponent digits: 3 clocks to compare, 3 
 clocks to subtract and 6 clocks to write in PEM. Adding about 50 clocks for 
 control, sign manipulation and other housekeeping actions, the final expres- 
 sion is: 
 
 T„ = 21N + 12N +50 , N + N < 16 (5) 
 
 fpa m e ' e m — 
 
 where T f is the time for floating-point multiplication in clocks, N is the 
 number of digits in the mantissa and N is the number of digits in the expo- 
 nent. Thirty- two bit floating-point numbers with N = 6 and N = 2 take about 
 
 ° * me 
 
 20 usee to add. For this case, a precise microsequence is presented in Appendix 
 
98 
 
 C and the results are: normal time = 21 ;usec, optimum time = 19 jusec. Two 
 6^-bit floating-point numbers can "be added in about 35 jusec. 
 
 k-5 Other Operations 
 
 A few other important operations are now considered and a quick 
 sketch is presented describing how they would be performed in SPEAC. 
 
 k.^.l Division 
 
 This operation has not been considered in detail and while it is 
 probably possible to design a sophisticated division algorithm that will use 
 the PE very efficiently, this will take considerable research. On the other 
 hand, even a very straightforward restoring division algorithm can be per- 
 formed in an acceptable time. For N < 8, the two mantissas can be stored in 
 
 m — 
 
 sM; then the divisor is repeatedly subtracted from the dividend until a final 
 borrow results and disables the PE. This is performed a maximum of 15 times; 
 then all PE's add the divisor to the remainder to restore a positive remainder. 
 The number of subtractions is counted in A . Each subtraction takes only 2 
 
 clocks per digit once the operands are in sM. Therefore, it takes at least 
 
 32N clocks to determine each digit of the quotient. Adding about 3 extra 
 
 clocks per subtraction for control, one obtains the following rough timing 
 
 estimate for mantissa division: 
 
 T, ~ 32W 2 + 50N , N < 8 (6) 
 
 d m m m — 
 
 
 This yields about 130 /isec for 2^-bit mantissa division and not more than 1^0 
 usee for 32 -bit floating-point division. The ratio of about six between 
 floating-point division and floating-point multiplication times is adequate 
 for this type of machine (in ILLIAC IV, this ratio is 7). 
 
99 
 
 k. 5-2 Logic Operations 
 
 Logic operations are quite straightforward in this machine since 
 the A/L unit in the PE's can directly perform all sixteen logical functions 
 of two variables. Therefore, to obtain any bit-by-bit logic function of two 
 operands, each N digits long, the same algorithm described for unsigned addi- 
 tion (Section k.k) can be performed; the timing is also as given by (3): 
 
 Tg ~ 12N (7) 
 
 where T« is the time required to perform one bit-by -bit logical operation. 
 
 4.5-3 Comparisons 
 
 In SPEAC, the result of a comparison is normally stored either on a 
 lcFF or in the mode register. It can also be stored in sM or PEM for future 
 use or sent to the CU via CDB. The six different types of comparisons (>, <, 
 ^_) S -) /) can readily be performed by the A/L unit. The algorithm for com- 
 paring two unsigned numbers is similar to the algorithm to add two unsigned 
 numbers; as each pair of digits is compared, the result of the comparison -for 
 = is always stored in lcFFl. This is needed even to perform a comparison for 
 >, <, >, or < since lcFFl is used to "freeze" the result of the comparison once 
 the first pair of unequal digits is found. For example, the typical micro- 
 sequence for a < compare is as follows: 
 
 - load first operand from PEM into A 
 
 - load second operand from PEM into B 
 
 - enabling on lcFFl ON, store the comparison A = B in lcFFl and 
 
 A < B in lcFF^. 
 
 m 
 
 When all the digits have been compared, lcFFU will have the resulting bit. 
 Therefore, the timing is: 
 
100 
 
 T c = 7N (8) 
 
 where T is the time in clocks for comparisons of two unsigned numbers, each 
 c 
 
 N digits long, leaving the result in the PE. 
 
 Signed and floating-point comparisons require a little more control 
 
 "but the linear dependence on N is as in (8). Rough estimates are: 
 
 T ~ 7N + 10 (9) 
 
 sc 
 
 V ~ 7(h e + V + 20 (10) 
 
 where T is the time for signed comparisons in clocks and T_ is the time 
 sc to fpc 
 
 for floating-point comparisons in clocks. 
 
 k.J.k Shifts 
 
 Shifts by a total distance of. b bits are easily performed in two 
 phases: address indexing is used to shift by (b div k) and register A shifts 
 are used to shift by (b mod k) . sM is also frequently used as temporary 
 storage, especially in end-around shifts. If b is global (i.e., all PE's will 
 
 shift by the same distance) then the address indexing is performed in the CU. 
 In general, it takes in the worst case 3 clocks to shift each digit, one to 
 store it in sM and 6 to store the shifted digit back in PEM. Therefore: 
 
 
 T ~ 12N , N < 16 (11) 
 
 s ' — 
 
 where T is the time in clocks to shift a number with N digits by a global 
 
 s 
 
 distance. The operation is a little more complex if b is local; i.e., the 
 shifting distance is different in each PE. In this case, local indexing is 
 initially performed, taking about 20 clocks, to "shift" by "b div k. " The 
 quantity "b mod k" is then stored in LC and three successive shifts are per- 
 formed which are enabled by lcFFl, lcFF2 and lcFF3 respectively. The remain- 
 der of the operation is as for global shifts. Therefore: 
 
101 
 
 T, ~ 12N + 20 , N < 16 (12) 
 
 Is — 
 
 where T n is the time in clocks to shift a number with N digits by a local 
 Is 
 
 distance. 
 
 It should also be pointed out that the PE, besides shifting, has 
 
 very good bit manipulation capability in general due to the locally controlled 
 
 gating into A . 
 m 
 
 k.6 I/O 
 
 Both I/O and routing are performed using the row gating and IOBR. 
 I/O will be described first. An elementary I/O operation consists of inter - 
 changing the data words Dl, intially in PEM, and D2, initially in mass memory 
 (MM) . Both words contain 512 bits and Dl is stored across one PE row: row j 
 (PEi in row j contains the i — hexadecimal digit of Dl). Recalling the IOBR 
 structure presented in Figure 22, the general procedure is the following: 
 clock - Initiate a MM read of word D2 to IOBRr. 
 clock 8 - Initiate a PEM read of word Dl. 
 clock 10 - MM read is completed and D2 is in IOBRr. The PEM read 
 
 will be completed during the next clock period/ therefore 
 gate Dl through row- gating to IOBRr and simultaneously 
 shift IOBRr left 128 digits (i.e., IOBR^ --IOBRr). This 
 can be done in one clock, 
 clock 11 - At this instant, IOBRr contains Dl and IOBRi contains D2 
 Initiate now the MM rewriting which will replace D2 by 
 Dl in MM. Also initiate a PEM write which will write D2 
 from IOBRi into any PEM row selected by row gating. If 
 the row selected is row j, then D2 will replace Dl in 
 
102 
 
 that PEM row. 
 
 clock 16 - PEM -write is complete; D2 is now available in PEM. 
 
 clock 21 - MM rewrite is finished; ready to start a new I/O transac- 
 tion at this clock. 
 
 One elementary i/O transaction then takes: 1 MM cycle and 1 PE clock 
 or approximately one MM cycle, which was assumed to be 2 usee (l usee access 
 time, 1 usee rewrite time). Eight of these elemtnary i/O's are needed to ex- 
 change one digit in every PEM with MM since there are eight PE rows. Therefore: 
 
 T Z , Q = 168N (13) 
 
 where T / is the time in clocks to interchange a word N digits long between 
 PE's and MM. For N=8, T T / n = 135 usee. This indicates that since a typical 
 32-bit floating-point operation takes about 25 .usee, each word brought to PEM 
 should be used on at least six operations (before being overlaid to MM) in 
 order to completely overlap execution and i/O. 
 
 The procedure described above for i/O transactions is based on the 
 assumption that MM is bulk core. In this case, IOBRr is in fact the memory 
 data register for MM. If MM is implemented with semiconductor memory, then it 
 would be better to modify the structure in Figure 22 and have the output data 
 from MM linked to IOBRJ? and the input data linked to IOBRr. This would avoid 
 the IOBR shift in clock 10 and would save one clock in each transaction. 
 
 k.7 Routing 
 
 The following algorithm is employed to perform routing left of one 
 digit by a distance R, R < 1023- This is obviously general since a routing 
 right by n is equivalent to a route left by 102^-n. 
 
 l) IOC, which processes routings, decomposes R into r'=R div 128 
 
103 
 
 and r=R mod 128. r' will "be taken care of by row gating and r 
 by shifting IOBR. 
 
 2) IOBRr is loaded with row r' from PEM (rows are numbered from 
 through 127). 
 
 3) IOBRr is shifted left 128 thus placing row r in IOBRi; simul- 
 taneously row r'+l is brought to IOBRr. 
 
 k) IOBR is shifted left by a distance r. 
 
 5) IOBR^ now contains the routed word for row 0. Therefore , IOBR^ 
 is written into row 0. 
 
 6) IOBR is now shifted by (128-r) which places row r'+l into IOBRi 
 simultaneously, row r'+2 is brought to IOBRr 
 
 7) Repeat step k. 
 
 and so on 
 
 It should be noticed that row r' has to be brought to IOBR twice, 
 once at the beginning and once at the end of the routing. This is necessary 
 to recover the leftmost digits of r' which are lost when step h is first 
 executed. 
 
 The actions performed are: 9 row loads into IOBRr, 1 shift by 128, 
 
 8 shifts by r, 7 shifts by ( 128-r) and 8 stores of IOBRi into rows. Also 
 
 the first clock of all but the first row loads is overlapped with the last 
 
 clock of a shift and the first clock of all but the last IOBR stores is 
 
 overlapped with the first clock of a shift. Therefore, the timing for routing 
 
 will be given by: 
 
 T = t + 8(t -1) + 8t . / v + 7t . ,, oQ \ + t , /io on + 7(t -1) + t 
 r I v 1 ' sh(r) sh(128-r) sh(12o) s s 
 
 where T is the time in clocks for routing one digit by a distance R = 128r'+rj 
 

 10U 
 
 t n is the number of clocks for a row load; t , / N is the number of clocks to 
 
 1 ' sh(,rj 
 
 shift IOBR by r; and t is the number of clocks for a row store. It is known 
 
 s 
 
 that t n / n ^ N=l. The values for t and t „ depend on where the digit to be 
 sh(128) s I 
 
 routed is: if it is in some PE register, then these times are only one clock; 
 
 if the digit is in PEM, then t* requires one PEM read or 3 clocks and t takes 
 
 5 clocks for a PEM write. Therefore, there are four different types of 
 
 routing. They are, from the fastest to the slowest: l) PE to PE, 2) PEM to 
 
 PE, 3) PE to PEM and k) PEM to PEM. 
 
 For routings of type 1: 
 
 T , = (8t . , s + 7t w noQ n + 3)N (lid 
 
 rl sh(r) sh(128-r) ' x ' 
 
 where T n is the time in clocks to route a number with N digits, t , / \ is 
 rl ° sh(rj 
 
 given by Table 9 for r < 6U; shifts by r > 6k in a given direction are simply 
 
 obtained by first shifting by 128 (end around) and then shifting (128-r) in 
 
 the opposite direction, t , / n for r > 6k can thus be written as 1+t , /-,„ n 
 
 sh(r) sh(128-r) 
 
 and t , /-.po \ is taken from Table 9- 
 
 For N=8 and r=l, one obtains T , = 20 usee. This is the best possi- 
 ble routing time and it is on the order of one floating-point operation time. 
 Other distances may take longer. For example, when N=8 and r=2, T - is 
 32 ;usec. Note also that routing must always be from one location to another 
 or else the row that must be loaded twice would be changed when accessed for 
 the second time. 
 
 For routings of types 2, 3; and k the expressions are: 
 
 T c = (8t , / v + 7t ,/, oQ n + 21)N (15) 
 
 r2 sh(rj sh(128-r) ' 
 
 T r3 " (8t sh(r) + 7*sh(128-r) + 35 > N (l6) 
 
 T rk ' < 8t sh(r) + ^sh(128-r) + ^ W 
 
105 
 
 It is also important to notice that since routing is performed in 
 chunks of 128 each, several other special purpose types of partial routings 
 can be microprogrammed and are very useful in specific applications. 
 
 k.Q Summary of Timings 
 
 Table 11 presents a summary of the timing estimates for several 
 
 operations and four "typical" word lengths: 16 bits (N =3; N =1), 32 bits 
 
 (N =6, N =2), kQ bits (N =9, N =3), 6U bits (N =12, N =k) . 
 v m ' e " ^ m ' e ' v m e 
 
106 
 
 Operation 
 
 Formula 
 Number 
 
 
 Time 
 
 in /Lisecs 
 
 
 16 bits 
 
 32 bits 
 
 k8 bits 
 
 6k bits 
 
 Local indexing, per address 
 
 
 
 1.6 
 
 1.6 
 
 1.6 
 
 1.6 
 
 Mantissa multiplication 
 
 1 
 
 12 
 
 39 
 
 82 
 
 lUl 
 
 Floating-point multiplication 
 
 2 
 
 9-k 
 
 26 
 
 52 
 
 86 
 
 Fixed-point unsigned addition 
 
 3 
 
 k.Q 
 
 9-6 
 
 15 
 
 19 
 
 Fixed-point signed addition 
 
 k 
 
 1-k 
 
 Ik 
 
 20 
 
 27 
 
 Floating-point addition 
 
 5 
 
 12.5 
 
 20 
 
 28 
 
 35 
 
 Mantissa division 
 
 6 
 
 kk 
 
 1^5 
 
 na 
 
 na 
 
 Logic Operations 
 
 7 
 
 k.Q 
 
 9-6 
 
 15 
 
 19 
 
 Comparison of unsigned numbers 
 
 8 
 
 2.8 
 
 5.6 
 
 Q.k 
 
 11 
 
 Comparison of signed numbers 
 
 9 
 
 3-8 
 
 6.6 
 
 9-k 
 
 12 
 
 Comparison of floating-point 
 
 
 
 
 
 
 numbers 
 
 10 
 
 k.Q 
 
 7-6 
 
 11 
 
 13 
 
 Global shifts 
 
 11 
 
 k.Q 
 
 9-6 
 
 15 
 
 19 
 
 Locally indexed shifts 
 
 12 
 
 6.8 
 
 12 
 
 17 
 
 21 
 
 I/O (PEM«— > MM) 
 
 13 
 
 67 
 
 135 
 
 200 
 
 269 
 
 Routing PE - PE, distance 1 
 
 Ik 
 
 10 
 
 20 
 
 30 
 
 1+0 
 
 Routing PEM - PE, distance 1 
 
 15 
 
 17 
 
 35 
 
 52 
 
 69 
 
 Routing PE - PEM, distance 1 
 
 16 
 
 23 
 
 h6 
 
 69 
 
 91 
 
 Routing PEM - PEM, distance 1 
 
 IT 
 
 30 
 
 60 
 
 90 
 
 120 
 
 Table 11. Summary of Timing Estimates 
 
107 
 
 % APPLICATIONS 
 
 5-1 General Considerations 
 
 In general, SPEAC can handle efficiently most problems in which 
 ILLIAC IV performs well since most of the features of ILLIAC IV are also 
 available in SPEAC. A large number of parallel algorithms to implement many 
 important applications in ILLIAC IV have been developed [9 through 17] . Ob- 
 viously, these algorithms can be used as a starting point when the use of 
 SPEAC for the same applications is contemplated. A few modifications or a 
 new approach are sometimes required due to the following differences: 
 
 a) PEM is much smaller in SPEAC and many problems which are "core 
 contained" in ILLIAC IV must use memory overlay in SPEAC. On 
 the other hand, MM in SPEAC is random-access and the machine 
 was especially designed to allow efficient PEM overlay so it is 
 normally possible to use SPEAC efficiently even in non-core con- 
 tained problems. In ILLIAC IV, non-core contained problems, 
 while not as frequent as in SPEAC, are harder to program effi- 
 ciently due to the latency problem in its disk mass memory, 
 b) Routing is relatively slow in SPEAC. While in ILLIAC IV a 
 
 route takes about half the time required for a floating-point 
 operation regardless of distance, in SPEAC it takes from one to 
 several times as much as a typical floating-point operation, 
 depending on the distance. On the other hand, in SPEAC routing 
 is an I/O operation and can be overlapped with PE processing. 
 Also special route instructions can be microprogrammed, "cus- 
 tomized" to particular problems. 
 
108 
 
 c) ILLIAC IV is primarily intended for computations on floating- 
 point numbers with 32 or 6k bits precision. While SPEAC can 
 also handle these problems, floating-point multiplication be- 
 comes relatively slow for very long word lengths since it is 
 proportional to the square of the number of digits in the word. 
 Furthermore, there is a very important area of applications 
 which is much more "natural" to program for SPEAC than for ILLIAC ■ 
 IV. This area includes problems involving a large quantity of 
 fixed-point numbers with small precision, typically only a few 
 bits. Examples of these problems are: picture processing, non- 
 numerical processing in strings of characters, etc. These prob- 
 lems can be handled very efficiently by SPEAC due to its digit- 
 by-digit processing and fast operation for small words. 
 
 d) In ILLIAC IV, the number of PE's (n-^) is 6k and for most appli- 
 cations one is interested in tackling problems in which the num- 
 ber n of parallel computations is equal to or greater than n p „. 
 In matrix computations, for example, n is the order of the matrix 
 and in discrete Fourier transforms n is the number of points. 
 Therefore, a frequent problem in ILLIAC IV is to partition a 
 large data set into "chunks" of 6k or 6k X 6k so that each 
 chunk can "fit" in the machine. Chunks are then processed se- 
 quentially. In SPEAC, n^ =102^ and for most problems one will 
 be interested in n < n • the typical problem is to subdivide a 
 data set into several pieces and to process all the pieces in 
 parallel to "fill" the whole machine when n < n^. 
 
 In the next sections a few specific representative applications of 
 
109 
 
 SPEAC are considered in detail. Of course, they are only meant as a sample 
 since many other interesting applications could possibly be efficiently han- 
 dled by the machine . 
 
 Timing estimates were based on counting PE clocks by hand. Some 
 attempt has been made to take into account PE/lO overlap but precise numbers 
 could only be obtained with a very sophisticated simulator for CU and specific 
 detailed microprograms for every instruction. Therefore, the estimates can 
 be a little pessimistic if the overlap was not fully accounted for. 
 
 5-2 Relaxation 
 
 The problem consists of: given an initial matrix U , n X n, find a 
 succession of matrices U , IP, . . . where each term of matrix U is a func- 
 tion of the four "neighbors" of the term in the previous matrix LP. 
 
 In general, 
 
 0**-f(U* ., U* ., LP . n , LP . n , U k .) 
 ij l+l, 3 i-l, 3 1,3+1' 1,0-1' 1,3 
 
 This is a general formulation for a series of problems that can be very ef- 
 ficiently solved using an array computer. If the elements of U are floating- 
 point numbers, then this type of expression can be used to find the equili- 
 brium temperatures or potentials at every point of a plane submitted to given 
 initial conditions at the edges; if the elements of U are small integers, then 
 each element can represent a point of a picture coded according to a gray 
 scale. In this case, the formulation can be used to implement a "smoothing" 
 filter or a number of other picture processing problems. 
 
 As an example, the following case will be studied. 
 
 TT k+l ic" _ . + U k . . + U k . , + IP . . 
 U ,- a = i+l, J i-l, J i,3+l ijJ-1 
 
110 
 
 The loop condition is the folio-wing: if |UV . - U7 . | < e for all 
 i,j then exit the loop; otherwise repeat. 
 
 Two values of n are considered: 32 and 102^ although other powers 
 of two can also he handled efficiently. 
 
 a) n=32; the elements of U are 32 -bit floating-point numbers. 
 
 The most straightforward (and most inefficient) way of coding the loop is: 
 
 the elements of U are stored across PE's, row after row; i.e., numbering PE's 
 
 from to 1023 and rows from to 31, element U. . is stored in PE„. .. U 
 
 iQ 321+0 
 
 is in PEM location a. The loop is: 
 
 1) Route distance 1 left from PEM location a to PEM location b. 
 
 2) Route distance 1 right from PEM location a to sM(0). 
 
 3) Add sM(0) to PEM(h) and store in PEM(b). 
 
 k) Route distance 32 left from PEM(a) to sM(0). 
 
 5) Add PEM(h) *- (PEM(h)+sM(0)). 
 
 6) Route distance 32 right from PEM(a) to sM(0). 
 
 7) Add sM(0) <- (PEM(b)+sM(0)). 
 
 8) Multiply the addition of the four neighbors by .25, sent via CDB: 
 
 sM(0) *- (sM(0) X CDB(.25)). 
 
 9) Test for ending condition; sM(0) which now contains IT" " is 
 subtracted from IT" which is in PEM location a and the difference 
 is compared against e, sent via CDB. In PE's in which the end- 
 ing condition is satisfied, a zero is gated to the interrupt 
 wire and register M is reset which disables the PE. 
 
 10) Write sM(0) in PEM location a and go back to step 1. 
 The process ends when, in step 9 CU receives a zero via the inter- I 
 rupt wire; this indicates that all PE's are disabled. At this point the result 
 
Ill 
 
 of the last iteration is stored in sM(O); all PE's are enabled and the 
 result can then be stored in PEM. The procedure requires three additions 
 (20 usee), one subtraction (20 usee), one multiplication (25 usee), three 
 routes of type 2 (35 usee), one route of type h (60 usee), and one comparison 
 
 (7 >6 usee) for a total of 278 usee per execution of the loop. Obviously, each 
 
 k+1 
 execution of the loop computes a new iteration matrix--U out of the pre- 
 vious value u. It should be noticed that sM was used as temporary storage in 
 some steps. sM can store two 32 -bit numbers: one in sM(0) through sM(7) and 
 the other in sM(8) through sM(l5). It is obviously possible to write micro- 
 sequences for variants of addition and multiplication which take one or both 
 operands from sM instead of PEM and also possibly have the results in sM in- 
 stead of storing the numbers back in PEM. These operations will be faster 
 than the normal PEM to PEM ones (from 1.6 to G.h usee faster) but this will 
 not normally be taken into account in these worst-case timings. It is also 
 important to notice that since sM is used as scratchpad in most operations, if 
 the two operands are in sM, one is destroyed during the operation unless sM 
 is enlarged to contain four or eight 32-bit numbers instead of only two. 
 
 A few improvements are possible in the straightforward algorithm 
 presented above and they are as follows: 
 
 1) The routing in step 1 does not have to be of type k since sM is 
 available. Therefore, one can load the data in sM(0) {2..h usee), 
 route from sM(0) to sM(8) (20 usee), and store sM(8) in PEM 
 
 (k usee). These last four usee can be overlapped with the 
 routing and total time is roughly 25 usee. 
 
 2) A special microsequence can be written for an instruction to 
 divide by k by shifting and normalizing. This will take much 
 
112 
 
 less than 25 jusec; since the operand is in sM and the result is 
 also left in sM, 5 Msec is a reasonable upper hound. 
 
 3) All additions except the last can he overlapped with routings of 
 type 2. The routings to be overlapped must be of type 2 because 
 there is no space in sM to keep the elements of U permanently 
 in sM, which would enable one to use only type 1 routings. If 
 more space were available in sM, the sum could also be kept in 
 sM and PEM location b would not be used. 
 
 The improved algorithm is as follows: 
 
 1) sM(O) <- PEM(a); U is now in sM(O) (2.k usee). 
 
 2) Route distance 1 left from sM(O) to sM(8) (20 jusec). Simultan- 
 eously, write sM(8) in PEM(b) (~2.6 jusec). 
 
 3) Route distance 1 right from sM(0) to sM(8) (20 jusec). 
 
 k) PEM(b) +- (PEM(b)+sM(8)) . Simultaneously, route distance 32 left 
 from PEM(a) to sM(0) (35 jusec) . 
 
 5) PEM(b) «- (PEM(b)+sM(0)) . Simultaneously, route distance 32 
 right from PEM(a) to sM(8) (35 jusec). 
 
 6) sM(0) *- (PEM(b)+sM(8)) (~16 usee). Note that this addition 
 takes less time because the result is not stored back in PEM. 
 
 7) sM(8) <- sM(0) shifted 2 right (i.e., divided by h) and normalized 
 (~5 /usee) . 
 
 8) Test end condition. This is the same as step 9 in the original 
 algorithm (~8 /usee). 
 
 9) PEM(a) - sM(8). Go to step 1 (~k /usee). 
 
 The total time is now only ~lU8 /usee for each complete relaxation. 
 Further improvement is possible if sM can store four 32-bit numbers instead 
 
113 
 
 of only two. In this case all routings are of type 1 (which saves 30 jixsec) 
 
 and the whole problem can be done in sM which saves all PEM reads and writes 
 
 except the initial read and final write. In this case a total time on the 
 
 order of 110 /isec is possible. 
 
 The algorithms considered assume .a toroidal geometry; i.e., there are 
 
 no edges, LL . is considered a neighbor of IT . and U. _ a neighbor of U. „., . 
 0,J 31, J i,0 i,31 
 
 This is not desirable for most actual applications. In most cases, there is 
 
 an outside edge: U , ., U-,„ ., U. . and U. _,, with fixed values. This can 
 
 -i,y 32, 3 1,-1 1,32 
 
 be easily included in the program in the following way: a digit D is stored 
 in each PE containing the LSB ON if the element stored in that PE belongs to 
 row 0, the second LSB ON if it belongs to row 31, the third LSB ON for column 
 0, and the MSB ON for column 31* The fixed edge values are stored in PEM 
 locations c,d,e and f (each is only needed in 32 PE's, but it is probably 
 easier to store them in all PE's). A new step is needed between 2 and 3 in 
 the improved algorithm. This step is number 2— and is identical with step 1. 
 Before steps 1, 2—, k, and 5, a local indexing is added. This local indexing 
 is enabled by the bits of D and makes PE's that have an edge neighbor take 
 the edge value instead of the "end-around" neighbor. This adds only about 
 8 jusecs to the procedure. 
 
 It should also be pointed out that overlaps of two operations both 
 using sM can be less than perfect since sM has only one port. Normally, how- 
 ever, operations that use sM do so 50$ or less of the clocks and thus very 
 good overlap is possible. Multiplication is an exception since it uses sM 
 very heavily. 
 
 b) n= 102 U; the elements of U are floating-point 32 -bit numbers. 
 
nk 
 
 In this case each row of U is stored across PE's and 1024 rows are needed. 
 Therefore, the problem is not "core contained" and PEM overlay is necessary. 
 Routing is only needed now to access the "left" and "right" neighbor; the 
 "upper" and "lower" neighbors of an elements and the element itself are 
 stored in the same PE. Therefore, at least three complete rows of U must 
 always be present in PEM. Assuming they are in locations a, b, and c respec- 
 tively, the algorithm is: 
 
 1) sM(0) <-PEM(b); IT is now in sM(0) {2.k usee) . 
 
 2) PEM(d) <- (PEM(a)+PEM(c)); do not destroy sM(0) (20 usee). 
 
 3) Route distance 1 left from sM(0) to sM(8) (20 usee). 
 
 k) PEM(d) *- (sM(8)+PEM(d)); do not destroy sM(0) (20 usee). 
 
 5) Route distance 1 left from sM(0) to sM(8) (20 usee). 
 
 6) sM(0) <- (sM(8)+PEM(d)) (~l6 usee). 
 
 7) sM(8) «- sM(0) shifted 2 right and normalized (~5 usee). 
 
 8) Test end condition (~8 usee). 
 
 9) PEM(b) *- sM(8); go to step 1 (~k usee). 
 
 Steps (2,3) and (h,5) could overlap for a total time of 58 usee per 
 row. However, this would leave only 5-7 usee in which both buses are not 
 simultaneously used and i/O overlay could not occur. Since FINST normally 
 assigns priority to i/O, on the average each loop will take the maximum time 
 of 116 usee and will have to wait for 20 more usee for i/O. Therefore, the 
 procedure is i/O bound and each loop takes 135 usee which is the time needed 
 for an i/O transaction. One iteration is then performed in about 135 msec. 
 Fixed edge conditions can be introduced as discussed in case a and do not cost 
 any extra time since the procedure is i/O bound. 
 
115 
 
 c) n=102J+; the elements of U are one-digit integers. This 
 case would be used in picture processing. The problem is "core contained" 
 since 2K digits are available in PEM and only IK are needed. Storage is as 
 in case b, each row across PE's. All elements of the same column are in the 
 same PEM. Only one PEM read and one PEM write are needed per row since sM is 
 now capable of storing sixteen U- bit elements. Assume that sM(a) contains the 
 upper neighbor, sM(b) the present element and sM(c) will contain the lower 
 neighbor. The algorithm is: 
 
 1) sM(c) *- PEM(address of element of next row). 
 
 2) reg A *- sM(a)+sM(c). 
 
 3) Route distance 1 left from sM(b) to sM(d) (2-5 usee) . 
 k) reg A «- reg A+sM(d) (.2 usee). 
 
 5) Route distance 1 right from sM(b) to sM(e) (2-5 usee). 
 
 6) reg A *- reg A+sM(e) (.2 usee). 
 
 7) Shift reg A right 2 bits (.2 usee). 
 
 8) Test end condition (~-5 usee). 
 
 9) Got to step 1. 
 
 The whole procedure then takes only about 6 usee since steps 1, 2, 
 and h are overlapped with routes. Therefore, one iteration can be performed 
 in about 6 msec. Fixed edge conditions could be introduced without diffi- 
 culty since there is space in sM to keep the data for the edges. This prob- 
 lem could also use two digits per element for a gray scale with 256 shades. 
 Since sM can still be used, the time increases linearly to 12 msec per iter- 
 ation. 
 
 In conclusion, SPEAC performs exceedingly well in relaxation type 
 problems. 
 
116 
 
 5. 3 Matrix Multiplication 
 
 Given two matrices, A and B , the problem consists of finding 
 
 7 nxn nXn 
 
 \ n 
 
 the matrix C which is the product of A and B. C=AxB (c. .= X a.. X b n .). 
 nxn * ij . . ik ky 
 
 Two basic methods can be used to store matrices in an array com- 
 puter: 
 
 a) Straight storage , in which each row is stored across PE's and 
 
 all elements of a column are stored in the same PE. Therefore, 
 
 a. . is stored in PE.. 
 ij J 
 
 b) Skewed storage , in which each row is stored across PE's but it 
 
 is also rotated one position farther than the preceeding row in 
 
 an end-around fashion. Thus, a. . is stored in PE/ .._,>, , .. . 
 
 ' ij (i+j-2)mod n+1 
 
 In either storage scheme one row of A can li ° accessed by fetching 
 
 one row of PE memory. When a matrix is skewed one column can also be accessed 
 
 in one memory fetch by indexing each PE to a different memory location. To 
 
 fetch the first column of A, for example, each PE simply loads from location 
 
 A plus the number of that PE. By routing this indexing pattern, any column 
 
 of A can be accessed in one operation. It would take many memory fetches to 
 
 access a column of a matrix which is not skewed since all elements of a column 
 
 are stored within one PE. 
 
 Three methods have been proposed ([11] and [12]) to perform matrix 
 
 multiplication in an array computer. Briefly, they are as follows: 
 
 a) the log- sum method , which is used to multiply skewed matrices 
 
 since columns and lines must be accessible. A row of the first 
 
 matrix is fetched and multiplied, in parallel, by a column of the 
 
 second matrix. The results are summed across PE's to produce one 
 
117 
 
 element of the solution. There are two major causes of inef- 
 ficiency in this method. First of all, the operation of summing 
 across PE's, known as a log-sum, is at best only 20$> efficient 
 in using PE's. Secondly, excessive routing is required to 
 properly index columns and line them up with rows. 
 
 b) the broadcast method which generates one row of the result ma- 
 trix at a time rather than just one element. It operates on 
 matrices which are stored straight in memory and produces a 
 result matrix which is also stored straight. Each row of the 
 result is obtained after n multiplications and accumulations 
 (the result of each multiplication is added to the sum of all 
 
 previous multiplications). To obtain row i of the result, the 
 
 th th 
 
 k — element a of row i of matrix A is multiplied by the k — 
 
 row of matrix b and all n rows thus obtained are added together. 
 
 The expression is: 
 
 n 
 
 row(c i ) = £ a ik row(b R ) 
 
 k=l 
 
 The CU must be able to broadcast the elements a. . to the PE's 
 and the PE's must have access to rows of B (i.e., row across 
 PE's). As opposed to the skewed matrix multiplication, this 
 method is almost 100$ efficient. There is no log-sum involved 
 and no routing is required. 
 
 c) Knapp ' s method of which only a brief description is offered 
 here; for a detailed treatment see [12]. A and B are stored 
 straight and C will also be obtained straight. As in the broad- 
 cast method, each row of the result is obtained after n multi- 
 
118 
 
 plications and accumulations. However, no broadcast takes 
 place. To obtain row i of the result, row i of A is multiplied 
 
 by each diagonal of B and then routed right one. Defining the 
 
 th 
 k — diagonal of B as: 
 
 b l,k' b 2,k+l' "' ' b i,(k+i-2)niodn+l ; *" ' \, (k+n~l)mod n+1 
 
 then Knapp's method is expressed by the following: 
 
 n 
 
 row(c.)= E (row a. routed right (k-l) times) X 
 
 1 k=l X 
 
 th 
 (k — diagonal of B) 
 
 To access the first diagonal of a matrix stored straight, each 
 
 PE is locally indexed with the PE number (starting with 0); this 
 
 th 
 pattern is routed right (k-l) times to access the k — diagonal. 
 
 The efficiency of Knapp's method is very good because no log- sum 
 , operations are performed, but not as good as straight multipli- 
 
 cation since routing is required. Its major use is to perform 
 several small matrix multiplies simultaneously using only a 
 small group of PE ' s for each one . 
 The three methods can be used in SPEAC but the log- sum method is not 
 considered in detail since it is the least efficient. Two cases are studied: 
 
 a) n=102U and each element is a 32 -bit floating-point number. 
 Each matrix is stored straight and the broadcast method will be used. One 
 slight modification is needed, however, to avoid I/O bounding since the problem 
 is not "core contained. " In the broadcast method the rows of B are used in 
 order from row 1 to row n (to compute row 1 of C) and then again from row 1 to 
 row n (to compute row 2 of C) and so on. Therefore, each row is used only for 
 one multiply and add each time it is in PEM. Since n multiply and adds can be 
 
 
119 
 performed in 1*5 „sec, there is no time to overlay a row w hieh takes 1 35 yS ee. 
 The solntion is simple; each row of B must he used several times eaeh time it 
 is brought to PEM. m this way, several rows of the prodnet are computed 
 simultaneously. For ex^ple, the first row of B (row^)) is brought and is 
 multiplied by 6k broadcast elements %v %v ... , % ^ ^ &k ^ 
 thus obtained are stored in PEM. R ow(b 2 ) is then accessed and is multiplied 
 by %2> a 2,2' ••' » %k,2> eaeh of the 6k rows thus obtained is added to the 
 corresponding row of the first 6k. At the end of 1024 cycles, all rows of B 
 have been accessed and u S ed 6k times each, and the first 6k rows of C are 
 completed. The method is repeated sixteen times to obtain the 1024 news of 
 0. Since each multiply and add takes 4 5 ,sec, 6k take 2880 u sec in which there 
 is time to interchange 21 rows. Therefore, l/o can be easily overlapped with 
 execution; while the 1024 rows of B are used, there is time to interchange 21K 
 rows and all that is needed is to interchange 1021* rows plus the 6k result 
 
 rows 
 
 J CU obtains the elements to broadcast either directly from mass mem- 
 ory or from the PE's via CDB. The latter is the most straightforward scheme 
 and can be efficiently used since overlap is possible with execution due to 
 the fact that l/o takes a relatively small percentage of the execution time. 
 64 rows of A are needed in the PE at all times to obtain the broadcast ele- 
 ments. Patterns are also stored in PEM and used to turn off all but one PE 
 1 time CDB out is used to send a broadcast element to CU. Note also that 
 
 can simultaneously broadcast a previous element since CDB. is used for 
 
 in 
 ;his purpose. 
 
 In the worst case, there are 194 rows in PEM at one time: the 64 
 ■ows of C that are being computed, the 64 rows of C that have just been 
 
120 
 
 completed and have not been placed in MM yet, 6k rows of A that are being 
 used to obtain broadcast elements, and 2 rows of B — one being used to multiply 
 and a new one being prepared for the next step. When the 6k completed rows 
 of C are overlaid to MM, the space is used to load the next group of 6k rows 
 of A. When a new step begins, the locations of the old rows of A are used 
 to place the new partial rows of C. 
 
 Therefore, complete overlay of I/O and CU instructions is possible 
 and the timing is simply given by: n (multiplications and additions) • sM 
 can contain the row of B being used 6k times and also the result of the multi- 
 plication. Only the result of the addition must be stored. In these condi- 
 tions, multiply and add takes about ^3 usee and the final result is ^3 sec . 
 
 b) n= 102^/2 (k=l,2,3,U) and each element is a 32-bit floating- 
 
 point number. This is the submultiple case, in which the size of the 
 matrix is a submultiple of the size of the array. In order to keep all PE's 
 
 busy, one can either divide the matrix in PE parts and use all PE's to 
 
 n n 
 
 compute one multiplication or PE multiplications can be computed simultan- 
 
 n 
 
 eously. The two approaches are very similar and only the first is considered. 
 
 Two methods can be used; the broadcast method, which is especially suitable 
 
 when PE is small (2 or k ideally) and Knapp ' s method which is best when 
 
 n PE » 8. 
 
 n 
 
 In the broadcast method, PE repetitions of a row of B can be cate- 
 
 n 
 nated across one row of PEM's and the method is used as before but instead 
 
 of generating k rows of C at the end of each step (k=6U in the example pre- 
 sented in part a), k x PE rows of C are constructed simultaneously. For 
 
 n 
 PE < 8 this repetition is easily obtained by writing in PEM PE times the 
 n n 
 
 same row of B read only once from MM. Obviously, there is one difficulty: 
 
 
121 
 
 the broadcast element must be different for each of the PE copies of the row 
 
 n 
 of B. Up to four different broadcast elements may be sent during a multipli- 
 cation of two digits without any extra delay. The only problem is to enable 
 sM's in only a portion of the PE's without disabling the multiplication itself. 
 This suggests the introduction of an enable flip-flop for I/O and CU use and 
 sM may be directed to obey either the PE enable or the i/O/CU enable. If this 
 is available, the broadcast method can be used without any extra cost since 
 
 the multiple broadcasts are overlapped with multiplication. Therefore: time = 
 
 n 2 n 3 
 ^3 - — 7- = - — x ^3 jusec, and for a 256 X 256 matrix the time = 7^0 msec. 
 
 PE 7 PE 
 
 If the above mentioned control of sM is not available, about 8(\PE+2) 
 
 n 
 
 additional clocks are needed per multiplication to select the broadcast 
 
 elements. For PE = h, this adds 5 usee per multiplication. The expression 
 
 3 n 
 is: time = - — X (1*3 + .8(^PE + 2)) usee. 
 n PE n 
 
 This method is then convenient only when PE is small so that the 
 
 n 
 extra time spent in selective broadcast is not excessive. 
 
 Knapp's method avoids selective broadcasts but introduces routings. 
 
 PE rows of A are concatenated across one row of PEM's and B is repeated PE 
 n n 
 
 times, once for each concatenated row of A. For n=128, this operation is 
 
 easily obtained by writing in PEM eight times the same row of B read only once 
 
 from MM. For n < 128 this repetition may require initial routes. Each dia- 
 
 gonal of B is obtained by local indexing ( PE copies of the diagonal are 
 
 n 
 actually obtained) and multiplied by the rows of A. The result is accumulated 
 
 and when all diagonals have been used, PE rows of C are computed. After each 
 
 n 
 diagonal is used, the rows of A must be routed right by a distance of 1. Since 
 
 this route is end-around with respect to n and not to n , a second route is 
 
 needed (by a distance n) unless n=128. The rows of A can be kept in sM while 
 
122 
 
 in use so routing is of type 1. The time is given roughly by the following: 
 
 r, 3 n 3 
 
 time = = — (add time + multiply time + 2 route times) = 90 - — usee 
 
 n PE " PE 
 
 if the routes are all fast. Therefore, the selective broadcast method is 
 
 n 
 
 best for all cases in which PE < 32. 
 
 n 
 5-U Pattern Matching 
 
 This application was chosen to test the character manipulation capa- 
 bilities of SPEAC. The problem, fully described in [9~\, is briefly stated as: 
 
 given two strings of characters, S (with n characters: s-, , s„, ... , s ) and 
 
 s 
 
 P (with n characters: p , p , . .. , p ), find out how many times and/or in 
 
 which position does P occur in S. P is called the pattern string and S the 
 source string. Normally n » n . The problem can be considered in two dif- 
 ferent aspects: l) n is very small (typical 1 to 3) and only the count of 
 occurrences is desired. This is what is needed in analysis of texts to obtain 
 the frequency of occurrence of given letters or combinations of letters, and 
 2) n can be a small integer up to about 15 and the positions in S in which P 
 occurs are desired. This is the type of algorithm needed, for example, to find 
 all occurrences of the words BEGIN and END in a segment of a program as would 
 be necessary in a parallel compiling technique as proposed in [10]. 
 
 The source string S can be arranged in memory in two different ways: 
 l) S is distributed across PE ' s in rows, one element per PE; i.e., character S. 
 
 is in PE/ . , \, and 2) S is distributed across PE's in n_,_ chunks each 
 (l mod n ) PE 
 
 Jrili 
 
 with n /n^^ adjacent characters; i.e., character S. is in PE/ . ,. / n« 
 s' PE ' ' i (1 div n /n ) 
 
 S -trill 
 
 Storage scheme 2, called storage in chunks , leads to much more efficient pro- 
 grams in SPEAC than storage scheme 1, called storage across PE's. This is due 
 
123 
 to the fact that with storage In chunks, routing is practically eliminated. 
 However, both storage schemes are considered since it may he difficult to use 
 storage in chunks if the input data is not initially manipulated hy corner 
 memory . 
 
 a) Storage in chunks; only a count of the number of occurrences 
 is required. Each character is assumed to he four hits long and is coded in 
 one digit. Obviously, this introduces no restriction since the same algorithms 
 can be applied if more than one digit is needed to code each character. No 
 character manipulation instructions were considered in Chapter k. Therefore, 
 most instructions used in these algorithms are custom-made, that is, they 
 are described in terms of their microsequences. 
 
 Initially, the first (n p -l) characters in each chunk must be routed 
 left by a distance of one in order to enable the recognition of truncated 
 occurrences of P (i.e., an occurrence of P in which ^ is the right-most 
 character in chunk i and Pg , Vy ... , are the flp<t characters . r ^ 
 
 1+1). The initialization thus takes (n p -l) routings distance 1 or (n -1) x 
 
 2.5 usee. 
 
 Ideally, for best efficiency, the length n of each chunk (n =n /n ) 
 is a large number. n c is here considered to be on the order of IK; i.e., the 
 source string has one million characters. If S is longer, the whole procedure 
 is repeated a number of times; each execution analyzes one million characters. 
 
 The following algorithm can be used: X ± contains the address of the 
 iext character in S to be analyzed. The pattern string is initially brought 
 :o CU and will be repeatedly broadcast via CDB. 
 
 1) X ± is loaded via CAB with the address of the next character of S 
 
12k 
 
 to "be analyzed as a possible start of an occurrence of P: 
 
 X +- CAB( address of S.) (l clock). Simultaneously, all lcFFl 
 
 are turned ON: lcFFl «- ON via CDB. 
 
 2) Compare the characters of S and P and turn off PE's in which no 
 
 match is found: A «- PEMfX-, ) ; lcFFl «- (A =CDB(p.)); Increment 
 m 1 m j ' 
 
 X, ; Enable function is attributed to lcFFl ON (h clocks). 
 
 3) Step 2 is repeated n times, for j=l,2, . .. , n . At the end 
 of this loop, lcFFl is ON only if there -was a match. 
 
 h) Count the match by incrementing A in PE's in which there was 
 a match. A is initially zero; increment A , enabled by lcFFl 
 ON (1 clock). 
 
 5) Go to step 1. The whole procedure is repeated n =n /n p times, 
 
 using as S.: s~, s, , . . . , s 
 D i 0' 1' ' n 
 
 c 
 
 6) At the end of the chunk, A contains the number of matches in 
 
 ' c 
 
 each PE; no overflow is possible since A can store up to UK 
 
 and only 2K matches are possible if n =n . =2K. A log- 
 
 c c maximum 
 
 sum of the contents of A is then performed and the final total 
 
 c 
 
 may be sent to CU via CAB. 
 The kernel in the algorithm above can now be timed; step 2 is re- 
 peated n times and the loop is repeated n times, for a total of n (kn. +l) 
 p c ' c p 
 
 clocks. The initialization takes 20(n -l) clocks and the finalization takes 
 
 P 
 
 ten routings of type 1 and ten additions of l6-bit unsigned numbers for the 
 
 log- sum, for a total of 150 clocks. Therefore: 
 
 Total time=(n -1)20 + n (kn +1) + 150 clocks, 
 p c p 
 
 For n =1K and n =5, the total time to search one million characters for a match 
 c p ' 
 
125 
 is only 2.1 
 
 b) Storage in chunks; the location of each occurrence is re- 
 quired. The algorithm is very similar to the one in case a, hut now X is 
 also used to hold the location in PEM where the address of the next occurrence 
 will be stored. Step h is replaced by the following: 
 
 h) Store the occurrence of the match in each PE by writing the 
 address of S., the first character of the occurrence, in X^; 
 X p is then incremented by one. Since the whole step is enabled 
 only in PE's in which there was a match, each list of occurrences 
 is compact, with no vacant locations: PEM(X ) «- CDB(address of 
 S.); Incr X p ; attribute enable function to IcFFl ON. Three PEM 
 writes are needed since an address has three digits. 
 
 The new step k takes 12 clocks and the new total time is: 
 
 Total time=(n -1)20 + n (kn +12) + 150 clocks, 
 p c p 
 
 For n =1K and n =5, the total time is now 3*2 msec, 
 c p ' 
 
 c) Storage across PE's; only a count of the number of occurrences 
 is required. Since in this storage scheme adjacent digits are in adjacent 
 PE's, left routings of distance 1 are needed between comparisons. There 
 
 is also a problem with the right-most PE's; at a routing, these PE's should 
 receive characters from the next row of characters rather than end-around 
 characters from the present row. For each row of characters, the algorithm 
 is as follows: 
 
 1) Load in A the characters of the old next row (the present row) 
 
 which are in sM(0): A +- sM(0) (l clock). 
 
 m 
 
 2) Fetch the next row of characters from PEM and store in sM(0): 
 
126 
 
 sM(O) •«- PEM(X-. ) where X contains the address of the characters 
 
 in the next row (3 clocks). Simultaneously, all lcFFl are 
 
 turned ON via CDB. 
 
 3) Compare character in A with the first character of P, sent via 
 
 CDB; the result is stored in lcFFl, enabled by lcFFl OR: 
 
 lcFFl *- (A =CDB(p, )) (1 clock), 
 m ^1 
 
 k) Store A in B to prepare for the routing: B «- A (l clock). 
 m m 
 
 5) Replace B by sM(O) only in the first PE. In this way, B will 
 contain the row needed for routing. A <- sM(o) enabled only 
 in the first PE (2 clocks). 
 
 6) Route 1 character left, distance 1 from B to A (20 clocks) . 
 
 ' m 
 
 7) Same as step 3 but using p . 
 
 th 
 
 8) Repeat steps k through 7 (n -l) times. For the i — execution, 
 
 character p. n of P is used and the first i PE's are enabled in 
 ■*i+l - 
 
 step 5« Therefore, p-1 different patterns are needed to enable 
 PE's in step 5- Since p is small and each pattern takes only 1 
 bit per PE, these patterns may be stored in sM and enabling 
 takes only 1 clock. 
 
 9) lcFFl is now ON only if a match occurred; A is incremented to 
 store this fact: Incr A enabled by lcFFl ON (l clock). 
 
 The whole algorithm is repeated once for each row. At the end of 
 
 the procedure, a log- sum of A is performed to obtain the total number of 
 
 occurrences. This takes 150 clocks. The total time to process n rows is 
 
 then: 
 
 Total time=n (6 + 2^(n -l)) + 150 clocks, 
 r p 
 
 To analyze one million characters when n =1K and n =5, 10.2 msec are required. 
 
 r p 
 
127 
 
 Therefore, this algorithm is about five times slower than the one for chunk 
 storage. 
 
 d) Storage across PE's; the location of each occurrence is re- 
 quired. Only a small modification is needed in the algorithm of case c_, 
 similar to the modification introduced in case b. Instead of using A to keep 
 the number of matches, X is used to keep the address in PEM where the address 
 of the next match will be stored. This step adds 12 clocks per row to the 
 algorithm of case c, thus yielding a total time of: 
 
 Total time=n (17 + 2i+(n -l)) + 150 clocks. 
 Or, for IK rows and n =5, 11-3 msec. 
 
 Therefore, pattern matching can be performed very efficiently in 
 
 SPEAC. One final sophistication to improve performance if the number of 
 
 occurrences is small is the following: when testing for each possible match, 
 
 gate lcFFl to the interrupt wire after each comparison. If the CU receives a 
 
 zero, this means that that match failed in all PE's and the present attempt 
 
 can be abandoned without testing all the remaining digits of P. This step 
 
 costs no extra time and could provide an impressive improvement for large 
 
 values of n (i.e., n > 10). 
 P P 
 
 5-5 Sparse Matrices 
 
 The problem deals with the elimination of the need to store in PEM 
 the zero elements of sparse matrices and the resulting problem of remembering 
 in some form the positions of the non-zero elements in the actual matrix. The 
 term actual matrix will be used to refer to a sparse matrix represented with 
 its zeroes and actual row to refer to a row of such a matrix also with its 
 zeroes. The form decided upon clearly must be useful in completing the task 
 
128 
 
 of sparse matrix multiplication. This section is concerned with describing 
 two forms of storing sparse matrices for SPEAC, discussing their program adap- 
 tability;, and demonstrating their use in programming. 
 
 The two general forms for storing sparse matrices are the individual - 
 tag method and the bit-matrix method [11]. These two methods are similar in 
 that for both, the non-zero elements of a matrix are stored in the same way; 
 for a sparse matrix A, 102^ x 10 2 U, the j — column is stored in PE . and zeroes 
 are eliminated by pushing each non-zero number up the column until no zero 
 elements remain between it and the next higher non-zero element, if one exists. 
 
 1) The bit-matrix method consists of storing a 1 or a bit 
 
 for each element of the actual matrix depending on whether an 
 element is non-zero or zero respectively. The result of this 
 procedure is a matrix with the same dimensions as the actual 
 matrix, but which requires less space to store in memory since 
 each element of this matrix is only a bit wide. These bits are 
 stored packed four in each digit and require 256 digits in each 
 
 PEM. The LSB in this string B of 256 digits (102^ bits) in PE. 
 
 th 
 
 indicates whether a . is zero or not; in general, the j — bit 
 
 in the string in PE. refers to element a... This method allows 
 very efficient reconstitution of the actual rows but may still 
 need too much storage space if the matrix is very large and 
 very sparse. In this case, the following method is used instead. 
 
 2) The individual -tag method associates with each non-zero ele- 
 ment of a matrix A a related positive integer t, called a tag. 
 A tag matrix is constructed in which t. . is zero if a. . is zero 
 and t. .=i if a. . is non-zero. The tag matrix is then stored 
 
129 
 with column j in PE . and compacted in the same way used to com- 
 
 J 
 
 pact A. Therefore, PEM. will contain two strings of numbers: 
 
 J 
 
 a,. a„, .... a and t, , t„, ... . t where n. is the number 
 1' 2 7 ' n. 1 2 ' n. j 
 
 of non-zero elements of A in column j. a. is the i — non-zero 
 
 element in column j of A; if this is element a. . , then t.=k. 
 
 K.J 1 
 
 Each element t takes only three digits for matrices up to k-K x 
 
 k-K. Note that n. is normally different for each column of A 
 J 
 
 but hopefully, if the zero elements of A are randomly distributed, 
 no large variations exist between the number of non-zero ele- 
 ments in two columns. 
 The problem of multiplying two sparse matrices stored in either of 
 the methods above is now considered. The broadcast method of multiplication 
 (see Section 5*3) is used. Therefore, the only extra procedure needed is an 
 efficient way to reconstruct the actual rows of the matrices. This is the 
 purpose of the algorithms now described. 
 
 a) Expand in actual rows a sparse matrix stored according to the 
 bit-matrix method. The rows must be expanded in order, from the first to the 
 last. Fortunately, this is the order in which they are used in the broadcast 
 method. Initially, the first digit b of the bit string B is fetched from PEM 
 in each PE and stored in sM(O). The address of the first element of each com- 
 pacted column (i.e., the address of a,) is sent via CAB to Xy When each digit 
 of the elements of the first row must be fetched, the PE's are enabled by the 
 LSB of sM(O) during both the fetch and the subsequent increment of X^ to point 
 to the next digit. If the register to which the fetch is made is initially 
 zeroed, the register will contain the correct row element after the fetch. 
 
130 
 
 X is also kept pointing to the appropriate element of the compact column 
 since it is not advanced in PE. when the actual row had a zero in column j. 
 For the fetches of the three next rows, the three next bits of b are used as 
 enabling bits and then the next element b of B is fetched, and so on. The 
 extra time required to fetch the elements of B is probably easily overlapped 
 with PE multiplications and the time to multiply two sparse matrices: A x B 
 (each 102 U x 102*0 stored according to the bit-matrix method is D x k-3 sec 
 where D is the density of matrix A. Obviously, CU can analyze the broadcast 
 elements and avoid broadcast of each zero element which decreases the multi- 
 plication time proportionately to the density of matrix A. It should be no- 
 ticed that the optimum reduction factor is not simply D but D X D . It is 
 
 A A B 
 
 possible to devise an algorithm that achieves a reduction in time approaching 
 the optimum value [13] J i.e., the algorithm also takes advantage of the sparse- 
 ness of B to reduce multiplication time. However, the procedure is quite com- 
 plex and will not be discussed here. It is also easy to see that the rows of 
 the result can easily be compacted in the same bit-matrix representation if 
 need be (i.e., if the product matrix is also sparse). 
 
 b) Expand in actual rows a sparse matrix stored according to the 
 individual- tag method. As in case a, the rows must be expanded in order. In 
 this case, however, the expansion procedure is less efficient. Initially, the 
 first tag t is fetched from PEM and compared for equality with the row number 
 (i.e., one for the first row) sent via CDB. This fetch and comparison takes 
 about 12 clocks since three digits must be compared and one of the operands is 
 broadcast and does not have to be fetched. The result of the comparison, left 
 in lcFFl, is then used to enable the fetch from PEM address X and the subse- 
 
131 
 
 quent increment of X . Therefore, an additional 1.2 jitsec is needed to fetch 
 each row in the individual- tag method. This cannot be overlapped with multi- 
 plications, as in case a because the arithmetic part of the PE must be used 
 for the comparison. 
 
132 
 
 6 . CONCLUSIONS 
 
 The concept of an array computer with a very large number of rela- 
 tively simple processing elements has been proven feasible; the PE hardware 
 was described in great detail and the sections on operations and applications 
 show that this hardware can be used quite efficiently. Obviously , several 
 problems remain to be studied and the following considerations analyze these 
 problems and offer some suggestions for further research. 
 
 Two areas are considered: l) problems related with SPEAC in parti- 
 cular, and 2) problems related with the general architecture of array computers 
 with many processing elements. 
 
 With respect to SPEAC in particular,, the PE hardware has been pain- 
 stakingly refined and optimized as far as one can get without an actual com- 
 mitment to build the machine; a few questions remain to be answered and final 
 "tuning" of the PE hardware must be performed, but these could be accomplished 
 only with definite cost figures to analyze the cost-efficiency of different 
 alternatives. Some of these alternatives were discussed in the section on 
 implementation. A few specific points are: 
 
 a) The scratchpad memory sM introduced in the PE at a late stage in 
 development has proven to be an impressive improvement, making 
 possible a reduction by a factor of two to three in the times of 
 floating-point operations. The study of applications also re- 
 vealed that an increase in the capacity of sM will improve the 
 performance in several areas. Therefore, the final size of sM 
 must be carefully determined to optimize cost-efficiency. It 
 is also interesting to notice that sM has performed so well 
 
133 
 
 "because of the relatively large values attributed to PEM access 
 and cycle times (300 nsec and 500 nsec respectively). It now 
 appears that these values are unduly pessimistic and depending 
 on the final times obtained, the importance of sM will decrease 
 and sM may be eliminated all together. 
 
 b) CU architecture was only sketched and a much more detailed de- 
 sign would be needed if the machine were to be built. Specifi- 
 cally the system of two queues for PE operation did not result 
 in any substantial improvement for most operations. Since the 
 system is quite expensive to implement and introduces serious 
 complications in microprogramming, it should be dropped and only 
 three queues used; one for l/O, one for PE, and one for CU 
 instructions. 
 
 c) The possibility of overlapping PE instructions with l/O or CU 
 instructions has proven very valuable in several applications. 
 The system should be refined as suggested in Section 5-3-b to 
 allow overlap not only in the use of PEM, but also in the use 
 of sM. 
 
 d) Final minimizations in the number of connections and the number 
 of chips per PE must be performed in view of the state of the 
 art in integrated circuitry at the time of implementation. This 
 field has advanced so rapidly that the picture has changed sub- 
 stantially within the last year. Specifically, one would need 
 
 2 
 data about MOS - T L relative performance, equivalent gate den- 
 sities obtainable per chip and cost of custom-built chips. 
 With respect to the field of array computers with a large number of 
 
13U 
 
 processing elements, the followings considerations are offered: 
 
 a) Software development for an array computer is a troublesome 
 area as demonstrated by the arduous and sometimes frustrated 
 efforts to develop a high-level language for ILLIAC IV. This 
 was probably to be expected if one takes as a parallel the 
 development of high-level software for sequential computers; it 
 started only after a decade of painstaking machine-language 
 programming. The lapse in the case of array computers should 
 be much shorter since a whole body of knowledge about languages 
 does exist and will be used as a basis. Nevertheless, array 
 computer users seem to be condemned to a few years of assembly- 
 language programming while software researchers gain the insight 
 and experience needed to provide efficient and reliable high- 
 level compilers. 
 
 It was expected at the beginning of this research that program- 
 ming SPEAC would be one order of magnitude more difficult than 
 programming ILLIAC IV just as programming IILIAC IV is one order 
 of magnitude harder than programming conventional computers. 
 Fortunately this has not been the case; programming SPEAC has 
 been about as difficult as programming ILLIAC IV. Of course, 
 this was mainly due to the fact that the size of the sample prob- 
 lems was selected to facilitate programming. The problem be- 
 comes more difficult when problems "smaller" than the size of 
 the array must be handled efficiently and this is more and more 
 frequent as the number of PE's increases. 
 If large array computers are to perform the role that is 
 
135 
 
 expected of them, the user must be spared the task of knowing 
 what each specific PE is doing, much in the same way as in 
 conventional computers the user has been spared the task of 
 keeping track of absolute memory addresses. An initial step in 
 this direction is provided by N. R. Lincoln. In a recent paper 
 [10], he proposes a radically new technique for using array 
 computers in such problems as compiling, which have so far been 
 considered typically non-parallel (that is, unsuitable for these 
 machines). Such techniques, if successful, could increase tre- 
 mendously the area of application of SPEAC. The study of the 
 performance of SPEAC in pattern matching problems, which was 
 discussed in Section ^.k, has shown that it can perform very 
 efficiently the basic tasks required in Lincoln's scheme. 
 
 b) One very promising idea has been recently proposed to help 
 solve the problem of handling efficiently problems "smaller" 
 than the size of the array in computers of the type of SPEAC. 
 It consists of linking groups of PE's together in a hardware- 
 implemented fashion so that a group of PE's would be able to 
 function as a single PE with speed roughly proportional to the 
 number of actual PE's in the group. The problem is reasonably 
 complex and will require considerable research but the possi- 
 bilities are far-reaching; this method would not only make it 
 much easier to use efficiently computers of the scale of SPEAC, 
 but it would also make practical array computers with tens and 
 even hundreds of thousands of very simple PE's. 
 
 c) Finally, one very long-range research project would be to inves- 
 
136 
 
 tigate how far one could go with the number of elements in a 
 parallel processor. The approach described above allows one to 
 envision a processing unit composed of many similar "PE's" 
 linked together in a fail- soft configuration, much like the 
 individual cells in a brain. If one PE fails, the only imme- 
 diate effect would be a slight reduction in the speed of the 
 processor as a whole. 
 
137 
 
 APPENDIX A 
 PACKAGE LOGICAL DIAGRAMS 
 
138 
 
 DATA 
 INPUTS 
 
 DATA 
 SELECT < 
 (ADDRESS) 
 
 ■o OUTPUT W 
 
 Package 1. One-out-of -eight Selector without Strobe 
 
139 
 
 i 
 
 <? X 
 
 co o q: 
 
 I 
 
 I 
 
 CM 
 O 
 
 i 
 
 CO 
 
 o 
 o 
 
 -J 
 woo: 
 
 J 
 
 4 
 
 1 
 
 U 
 
 U 
 
 § 
 
 CO 
 
 4 
 
 i 
 
 Q 
 
 a 
 
 o 
 
 CO U k 
 
 4 
 
 (ENABLE CLOCK) E C (CLOCK) 
 
 Package 2. Quad Type D Flip-flop wit h 
 
 Enable on the Clock 
 
1^0 
 
 E O 6 c 
 (ENABLE CLOCK) (CLOCK) 
 
 Package 3. Type D Flip-flop with Enable on the Clock 
 
11+1 
 
 DATA 
 
 ADDRESS <T 
 
 Package k. One -out -of- four Selector without Strobe 
 
ll+2 
 
 DATA INPUTS -s ° 
 
 ADDRESS 
 
 ENABLE OUTPUT 
 
 E 
 
 Package 5- One -out-of- three Selector with Enable Decoding 
 
B, O- 
 
 D 2 
 
 A 2 O- 
 B, O- 
 
 B, O- 
 
 °4 
 
 A.. O- 
 
 (ENABLE INPUT) E: C: (CLOCK INPUT) 
 
 FtO 
 
 ■=o 
 
 ^> 
 
 o 
 
 ?o 
 
 -O 
 
 S> 
 
 t-^O 
 
 D, * 
 
 r^> 
 
 =0 
 
 ^O 
 
 ^QJ 
 
 ^^O 
 
 9 9 
 
 143 
 
 3 
 
 -O C (CLOCK OUTPUT) 
 
 -O I (INTERRUPT OUTPUT) 
 
 Note: The Function is as follows for each lcFF: 
 A B Function 
 
 Do nothing, i.e., the lcFF is not used 
 
 1 Use the lcFF to control the interrupt wire; the interrupt wire will 
 
 assume the logical level of the lcFF 
 
 ) 1 Enable the PE (i.e., allow the clock to reach the registers) when the 
 lcFF contains a ZERO 
 
 1 1 Enable the PE when the lcFF contains a ONE 
 
 Package 6. Enable and Interrupt Control 
 
ikh 
 
 Package 7. PEM - 1 Module 
 
1U5 
 
 (ALWAYS ON) r-^O 
 
 MEMORY o-<Oo-, 
 ENABLE U0» 
 
 SELECT 
 INPUTS 
 
 =}< 
 
 Package 8. 6U-bit Scratchpad Memory 
 (16 4-bit Words) 
 
ll)6 
 
 -D- 
 
 GorY 
 (NOT USED) 
 
 oCn+4 
 
 PorX 
 (NOT USED) 
 
 Package 9. Arithmetic/Logic Unit 
 
1^7 
 
 r 
 
 DATA 
 INPUTS 
 
 D O- 
 
 < 
 
 D|0- 
 
 V. 
 
 l i o 
 
 L> 
 
 A DATA SELECT 
 (ADDRESS) 
 
 OUTPUT 
 
 o 
 
 Package 10. One -out -of -two Selector without Strobe 
 
ihQ 
 
 CLOCK Co £>> 
 
 DOWN/UP Mo £>o •- 
 
 DATA INPUT o- 
 
 ENABLE Go- 
 
 DATA INPUT D, o- 
 
 DATA INPUT D 2 » 
 
 DATA INPUT D 3 o- 
 
 LOAD L o— c£> 1 
 
 (NOT USED) 
 RIPPLE CLOCK 
 
 MAX/MIN 
 ■o OUTPUT 
 Cn+12 
 
 <»-oOUTPUTQ 
 
 J-o OUTPUT Oi 
 
 #-o OUTPUT Q 2 
 
 *—° OUTPUT Q3 
 
 Note: When cascading, G input goes to least significant hexadecimal digit and 
 C output comes only from most significant hexadecimal digit; G. - 
 
 is connected to C, -, n \-j for all not externally connected G and C n J 
 (n+12)i J n+12 
 
 Package 11. U-bit Up/ Down Counter, Parallel In/ Out 
 
ll+9 
 
 S (STROBE) 
 O 
 
 DATA 
 INPUTS 
 
 DATA 
 SELECT < 
 (ADDRESS) 
 
 Package 12. One-out-of-four Selector with Strobe 
 
150 
 
 DATA INPUTS 
 
 Package 13 . Quad Inverter 
 
151 
 
 OUTPUT 
 CARRY 
 
 c w*te 
 Q 
 
 Q 
 
 OUTPUTS 
 A 
 
 Q 
 
 w 2 
 o 
 
 *3 
 O 
 
 *4 
 
 o 
 
 
 
 r\ 
 
 n 
 
 A 
 
 d 
 
 
 
 Q 
 
 6 y 6 M 
 
 v 
 
 DATA INPUTS 
 
 6 6 
 
 °5 °6 
 
 INPUT 
 CARRY 
 — O 
 
 Note: 
 
 When cascading, C , output comes only from most significant package; 
 C input to the least significant package is "1." Input (C ). is 
 
 connected to output (C .._). 
 ^ n+12'i 
 
 Package ik. Increment -by-one Network (l6 bits) 
 
1% 
 
 r o t 
 
 DATA 
 INPUTS 
 
 •< 
 
 <> H!> > -H>^ 
 
 v. 
 
 SELECT -< o T^-^M^ 0- 
 
 (ADORESS) \>r 
 
 •H> J -I> 
 
 O 
 
 0UTPU1 
 -O W 
 
 Package 15 • One-out-of-five Selector without Strobe 
 
153 
 
 APPENDIX B 
 MICROSEQUENCE FOR 32-BIT FLOATING-POINT MULTIPLICATION 
 
15U 
 
 This is a detailed listing of the microsequences sent by CU to each 
 PE to perform the multiplication: a X b = c, where each number is in the I 
 following format: 
 
 a T a 6 a 5 a l+ a 3 a 2 a i a O 
 a is the mantissa LSD (least significant digit) 
 
 a is the mantissa MSD (most significant digit) 
 5 
 a (i.e., the low order bit of a.) is the mantissa sign bit 
 
 V a 63' %2 md a 6l conrt " ute the exponent; a 6l ls the 1W ° rder W * ° f the 
 
 exponent . 
 
 The exponent is in excess notation and the mantissa in sign and mag- 
 nitude. The exponent base is 16. a Q is the low address in the PEM. 1 
 The following abbreviations are used in the microsequences: 
 A - B which means that register A is loaded with the contents of 
 
 register B. 
 SM(X) or PEM(X) which means the contents of the location with ad- 
 dress X in sM or PEM; X can be a literal or a register in 
 which case the contents of the register are taken as the ad- 
 dress. When X is a literal, it is sent via CAB. 
 CAB(a) or CDB(a) which means that data a is sent via the common bus. 
 En(i,ON) or En(i,OFF) which means that the enable function is at- 
 tributed to lcFFi ON or OFF. 
 Each microsequence is numbered with two PE clock counts: maximum and 
 minimum. The minimum count assumes that the two buses are available and maxi 
 mum overlap can be achieved; the maximum count assumes that only one bus is 
 available at all times for PE operation. CAB and CDB are assumed always | 
 available . 
 
155 
 
 SO SO 
 
 3 o 3 o 
 
 S H S H 
 
 •HO «H O 
 
 l! S gS Microsequence 
 
 Comments 
 
 1 
 
 1 
 
 X *- CAB (address (a Q )) 
 
 Address registers are loaded with 
 mantissas' LSD addresses 
 
 2 
 
 2 
 
 X 2 - CAB( address (b Q )) 
 
 
 3 
 
 3 
 
 A «- PEM(X )j sM(0) *- PEM(X ) 
 
 PEM read; takes 3 clocks 
 
 6 
 
 3 
 
 B *- PEM(X 2 ) 
 
 If overlap is possible, one extra 
 clock is needed to store in sM 
 
 8 
 
 6 
 
 sM(6) «- pem(x 2 ) 
 
 
 9 
 
 6 
 
 A <- CDB(O); A <- CAB(O); 
 m J c ' 
 
 Ready to start multiplication; X _, 
 
 
 
 Incr X, ; Incr X 
 
 X are ready to access the next 
 digits 
 
 10 
 
 7 
 
 MF(1, X 1 , *, 1, *) 
 
 See note a for the meaning of MF; 
 m is completed 
 
 15 
 
 12 
 
 MF(7, \, 1, 0, S) 
 
 
 20 
 
 17 
 
 MF(2, X r 6, 2, *) 
 
 m is completed 
 
 25 
 
 22 
 
 MF(*, *, 7, 1, S) 
 
 
 30 
 
 27 
 
 MF(8, X 2 , 8, 0, S) 
 
 
 35 
 
 32 
 
 mf(3, x 1? 6, 3, *) 
 
 m is completed 
 
 1+0 
 
 37 
 
 MF(*, *, 1, 2, S) 
 
 
 ^5 
 
 k2 
 
 MF(*, *, 8, 1, S) 
 
 
 50 
 
 hi 
 
 mf(9, x 2 , 9, o, S) 
 
 
 55 
 
 52 
 
 MF(1+, X^ 6, fc, *) 
 
 m is completed 
 
 60 
 
 57 
 
 MF(*, *, 1, 3, S) 
 
 
 65 
 
 62 
 
 MF(*, *, 8, 2, S) 
 
 
 70 
 
 67 
 
 MF(*, *, 9, 1, S) 
 
 
 75 
 
 72 
 
 MF(10, X , 10, 0, S) 
 
 
 
 
 
 
156 
 
 ^ ° i s 
 
 3 o a o 
 
 S H S H 
 
 .SO -rj O 
 
 S w .hw Micro sequence 
 
 80 
 85 
 
 75 
 82 
 
 mf(5. X 1 , 6, 5, *) 
 MF(*, *. 7. k > S ^ 
 
 90 J 87 mf(*, *, 8, 3; s) 
 92 |MF(*, *, 9. 2, s) 
 mf(*, *, 10, 1, s) 
 MF(11, x 2 , 11. 0, S) 
 
 95 
 
 LOO 
 
 r 05 
 
 L10 
 
 97 
 
 102 
 
 107 
 
 Ll6 
 
 121 
 L26 
 131 
 136 
 1^2 
 
 ^7 
 152 
 
 157 
 163 
 
 ML(7. 5. 0) 
 
 113 MF(7; \> 8 ^ ^ S ^ 
 
 118 MF(*, *, 9. 3, S) 
 
 123 MF(*, *, 10* 2 ' s ^ 
 
 128 MF(*, *, 11; 1* s ) 
 
 133 ML(8, 5, 1) 
 
 139 MF(8, X^ 9. h, s) 
 
 II4.I4 mf(*, *. 10. 3, s) 
 
 1I4.9 MF(*, *, 11. 2 > S ) 
 
 15U ML(9. 5. 2) 
 
 160 mf(9. x 2 , 10, ^, s) 
 
 168 
 173 
 
 165 
 
 170 
 
 179 176 
 
 MF(*, *, 11. 3, S) 
 
 ML(10, 5. 3) 
 
 MF(10, Xg, 11. ^ s) 
 
 Comments 
 
 
 ,; a is completed 
 
 See note b for the meaning of ML, 
 m is loaded in sM(0) 
 
 5 
 a^ is loaded in sM(T) for future 
 
 6 
 use 
 
 nu is loaded in sM(l) 
 6 
 
 a is loaded in sM(8) for future 
 use 
 
 
 liru is loaded in sM(2) 
 
 "b, is loaded in sM(9) ^r future 
 
 6 
 use 
 
 m is loaded in sM(3) 
 
 b is loaded in sM(10) for future 
 
 use 
 
157 
 
 _ £i X 
 
 SO BO 
 
 3 o 3 o 
 
 •HO -HO 
 
 |^ i! S Microsequence 
 
 
 18k 
 190 
 
 191 
 192 
 
 196 
 
 197 
 198 
 
 199 
 
 205 
 211 
 217 
 223 
 ?29 
 >35 
 Ikl 
 
 06 
 
 12 
 
 18 
 
 181 
 
 187 
 188 
 189 
 
 193 
 193 
 19^ 
 
 195 
 
 201 
 207 
 213 
 219 
 225 
 231 
 237 
 
 200 
 
 201 
 206 
 
 207 
 
 ML(ll, 5, h) 
 
 ML(*, *, 5) 
 
 X ± *- CAB( address (c )) 
 
 X 2 - CAB(0) 
 
 lcFFl «- (A m =CDB(0)) ; sM(6) «- A 
 A <- CDB(0) 
 
 rn 
 
 m 
 
 En(l,0N); A m - CDB(0010); 
 Incr X 
 
 ST 
 
 ST 
 
 ST 
 
 ST 
 
 ST 
 
 ST 
 
 ST; wait on Event #1 
 
 ST j wait on Event #2 
 
 B - sM(7); A - CAB(O) 
 
 A 4- (B-Aj; C =1; lcFFi+ *- C , 
 m tor ' n ' n+4 
 
 Shift A r , A m right 1+; B ♦- sM(8) 
 
 A m*- (B - A m ^ c n = lc ^ 
 
 Comments 
 
 m. 
 
 .q is loaded in sM(U) 
 rrL is loaded in sM(5) 
 
 X 1 and X 2 are prepared to write 
 the result 
 
 m 
 
 ,, is loaded in sM(6) 
 
 See note c 
 
 c Q is stored in PEM; see note d 
 for the meaning of ST 
 
 c is stored in PEM 
 
 c is stored in PEM 
 
 c is stored in PEM 
 
 c. is stored in PEM 
 
 c._ is stored in PEM 
 
 cv is stored in PEM; see note e 
 
 c is stored in PEM 
 
 Exponent computation starts now 
 
 B is loaded with LSD of exponent 
 of a 
 
 A is still as in note c 
 m — 
 
 B is loaded with MSD of exponent 
 of a 
 
158 
 
 lo 3 o 
 a h a h 
 
 .3 O -HO 
 
 5 w Micro sequence 
 
 ==1 n, 
 
 
 212 Shift A left ^j B 
 
 sM(9) 
 
 (A.0B) 
 
 (A AND CDB(lllO)) 
 m 
 
 ( A m +B)s C rT°> lcFFif *" C nA 
 
 <_ A • cause Event #1 
 nr 
 
 Shift A m right k; B «- sM(lO) 
 
 2l+6 
 
 2l+8 
 2i+9 
 
 250 
 251 
 251 
 
 225 
 
 A -(A+B); C =lcFF^; 
 
 m m ' 
 
 lc¥F k +. c n+k 
 
 226 A - (A eCDB(lOOO)); shift A 
 
 right *+ 
 230 J sM(8) -A m j cause Event #2; 
 
 shift A m left U 
 
 231 
 
 232 
 233 
 2^1 
 
 a A , A «- LC 
 
 V V m o 
 
 lcFFl <- (A m =LC) 
 Interrupt on lcFFl ON 
 
 Comment s 
 
 Now have in A q , A^ exp(a)-l if 
 
 m =0 and exp(a) if n^]/ 
 
 A now contains the sign of c 
 
 '0 
 
 Set sign hit to zero in A 
 
 m 
 
 A now contains LSD of exp(c) 
 m 
 
 A has MSD of exp(a) and B has 
 m 
 
 MSD of exp(b) 
 
 Correct sum in excess notation f 
 complementing MSB 
 
 Start detection of exponent over 
 flow or underflow; see note f 
 
 End of the operation 
 
159 
 
 Notes: 
 
 a) MP(a, b, c, d, S) is defined as the following set of five microsequences: 
 
 1) Add and shift; sM(a) - PEM(b) 
 
 ' I 
 
 2) Add and shift 
 
 3) Add and shift II jjj 
 
 ' » r 
 
 k) Add and shift; Incr (b); B ^ sM(c) 
 5) A - sM(d) ; shift A , A left k 
 
 1 Cm 
 
 IV 
 
 If a and b are *' s then portions I and II are absent; if c is a * then 
 
 portion II is absent; if B is rep iaced by a * then portion IV is absent. 
 
 MF can perform the following: a) multiply two digits, b) fetch from PEM and 
 
 store in sM a digit to be used in the next multiplication, and c) load A and 
 
 m 
 1 with the two digits needed in the next multiplication. 
 
 *) ML(a, b, c) is defined as the following set of six microsequences: 
 
 1) Add and shift 
 
 2) Add and shift 
 
 3) Add and shift 
 
 h) Add and shift; B «- sM(a) 
 
 5) sM(c) ♦. a I 
 
 6) A r «- sM(b) 
 
 II 
 , If a is a * then portion I is absent; if b is a * then portion II is 
 absent. ML multiplies two digits, stores the MSD of the product in sM and 
 «is A m and B with the two digits needed in the next multiplication. 
 
i6o 
 
 c) At this stage, X points to m (in sM(O)) if ^=0 and to m^ (in sM(l)) 
 if m /0. Therefore, X points to c . Also, A contains 0000 if m__/0 and 
 0010 if m =0 to prepare for the correction in the exponent. 
 
 d) ST is defined as the following set of six microsequences: 
 
 1) PEM(X ) *- sm(x 2 ) 
 
 2) Wait for writing in PEM 
 
 3) Wait for writing in PEM 
 k) Wait for writing in PEM 
 
 5) Wait for writing in PEM 
 
 6) Incr X-. , Incr X 
 
 ST stores the digits of the product in PEM. This is overlapped as much as 
 possible with the computation of the exponent. 
 
 e) The wait in this microsequence assures that the exponent will be written 
 in PEM only after it is computed. 
 
 f) In excess notation addition, there is an overflow if the carry from the 
 MSB is equal to the MSB of the sum before the necessary correction which con- 
 sists of complementing the MSB. 
 
161 
 
 APPENDIX C 
 MICROSEQUENCE FOR 32-BIT FLOATING-POINT ADDITION 
 
162 
 
 This is a detailed listing of the microsequences sent by CU to each 
 PE to perform the addition: a + b = c. Number format, notation and abbre- 
 viations used are as listed in the introduction to Appendix B. 
 
163 
 
 So SO 
 
 3 o p o 
 
 e h a h 
 
 •HO -HO 
 
 Micro sequence 
 
 Comment s 
 
 10 
 13 
 15 
 16 
 
 17 
 18 
 
 19 
 20 
 21 
 
 22 
 23 
 21+ 
 
 25 
 
 8 
 
 8 
 
 11 
 
 12 
 
 13 
 14 
 
 15 
 16 
 
 IT 
 
 18 
 19 
 
 20 
 21 
 
 X «- CAB( address (a )) 
 
 X 2 - CAB(address (b )) 
 
 A *-PEM(xJ: sM(3) *-PEM(X.) 
 m 1 '1 
 
 B «- PEM(X 2 ) 
 
 sM(1) - PEM(X ) 
 
 lcFFl <- (A =B); lcFFU <- (A < B) j 
 m ' m " 
 
 Shift A left h, Deer X. , 
 c 1 
 
 Deer X 
 
 A m «- PEM(X 1 ); sM(2) «- PEM^) 
 
 B *- PEM(X ) 
 
 sM(0) <- PEM(X 2 ) 
 
 En(l,0N); lcFF*+ <- (A < B) ; 
 
 Shift A left k m 
 
 c 
 
 A *- (A AND CDB(OOOl)) 
 mm 
 
 Address registers are loaded with 
 address of MSD of the exponents 
 
 PEM read; takes 3 clocks 
 
 If overlap is possible, one extra 
 clock is needed to store in sM 
 
 Comparison of exponents starts 
 
 Read the LSD's of the exponents 
 
 Shift A right k: A 
 r ' m 
 
 A *- (A AND CDB(OOOl)) 
 
 B 
 
 m 
 
 m 
 
 lcFFl *- (A =A ) 
 m r 
 
 En(4,0N)j A «-sM(0); A c *- X 
 
 En(l+,0N); A -sM(l); X n - X 
 
 En(i+,0N); B «- sM(2); X 2 *- A Q 
 
 En(U,0N); sM(0) +- B; shift A 
 left k 
 
 En(4,0N); B *- sM(3); shift A 
 left k 
 
 lcFFU is now ON iff exp(a)> exp(b); 
 A contains exp(a) 
 
 All bits except sign are zeroed 
 
 All bits except sign are zeroed 
 
 lcFFl is now ON if sign(a)=sign(b) 
 
 Interchange exponents and addresses 
 in PE's in which exp(a)> exp(b) 
 
16U 
 
 B u 
 P o 
 
 a h 
 
 •H O 
 
 - -* 
 a o 
 
 P o 
 
 a rH 
 
 •H O 
 
 a 
 
 •H W 
 
 Micro sequence 
 
 Comment s 
 
 " 
 
 26 
 
 22 
 
 En(U,0N) ■ sM(l) *- B 
 
 
 27 
 
 23 
 
 Shift A right k- sM(2) <- LC 
 
 See note a 
 
 28 
 
 2U 
 
 B «- sM(0) 
 
 Exponent subtraction now starts 
 
 29 
 
 2i+ 
 
 A «- (A OR CDB(OOOl)) 
 
 Sets sign bit to one so that it 
 
 
 
 m m 
 
 does not interfere with subtraction 
 
 30 
 
 25 
 
 A <- (B-A ); C -1; lcFFU 4- C , ; 
 r nr > n ' n+V 
 
 B <- sM(l): shift A right k 
 
 
 31 
 
 26 
 
 A *- (B-A )' C =lcFF^: 
 m m ' n ' 
 
 A *- CAB(O) 
 c 
 
 
 10 
 
 8 
 
 Deer X , Deer X 
 
 These six clocks are overlapped 
 with previous ones; they make X_ 
 
 point to a and X point to b 
 
 15 
 
 11 
 
 Deer X , Deer X 
 
 16 
 
 12 
 
 Deer X , Deer X 
 
 
 17 
 
 13 
 
 Deer X , Deer X 
 
 
 18 
 
 Ik 
 
 Deer X , Deer X p 
 
 
 19 
 
 15 
 
 Deer X , Deer X 
 
 
 32 
 
 27 
 
 Shift A right 1; A «- X 
 
 See note b 
 
 33 
 
 28 
 
 lcFFl «- (A =CDB(0)); B *- A 
 
 m ■* r 
 
 
 3^ 
 
 29 
 
 lcFF2 «-CDB(0); A «- CDB(O) 
 
 
 35 
 
 30 
 
 En(l,0FF); lcFF2 <-CDB(0010); 
 
 Ready now to perform mantissa 
 
 
 
 shift A right 1+ 
 
 alignment; see note c 
 
 36 
 
 31 
 
 A *- (A +B); C =0; lcFFU «- C , 
 m m ' n > n+4 
 
 
 37 
 
 32 
 
 En(U,0N); Incr A 
 c 
 
 
 38 
 
 33 
 
 Shift A left k 
 
 
 39 
 
 3^ 
 
 X 2^ A e 
 
 Mantissa alignment completed 
 
 ko 
 
 3^ 
 
 A <- CAB(FFF-N +1) 
 c m 
 
 Prepare trap in A ; see note d 
 
165 
 
 Bo So 
 
 P o go 
 
 sh a h 
 
 •HO -HO 
 
 Microsequence 
 
 Comment s 
 
 kl 
 
 35 
 
 Shift A right k; A - CAB(FFF) 
 
 
 k2 
 
 36 
 
 A *- (A +B); C =0; lcFFU «- C , 
 m v m " n ' n+4; 
 
 B still had the difference of the 
 exps; it is reloaded with the first 
 
 
 
 B *- PM(X ) 
 
 operand 
 
 ^3 
 
 37 
 
 En(4,0N); En(2,0FF); Incr A ; 
 
 lcFF2 «- C no ° 
 n+12 
 
 
 kh 
 
 37 
 
 lcFFl 4- sM(2)r shift A left k 
 
 > c 
 
 Trap is completed; lcFFl is ON only) 
 
 
 
 if signs are equal 
 
 h5 
 
 38 
 
 A «- PEM(X n ) 
 m r 
 
 Fetch the second operand 
 
 kQ 
 
 41 
 
 ADFI(4) 
 
 The actual addition starts now; see 
 note e for the meaning of ADFI, ADF 
 
 57 
 
 hi 
 
 ADF(5) 
 
 and AD 
 
 66 
 
 53 
 
 adf(6) 
 
 
 75 
 
 59 
 
 ADF(7) 
 
 
 84 
 
 65 
 
 ADF(8) 
 
 
 93 
 
 71 
 
 AD(9) 
 
 Addition completed; now find out 
 sign of result and if recomplemen- 
 tation is needed: see note f. 
 
 96 
 
 lh 
 
 A «- LC ; B <- LC 
 
 m ' 
 
 
 97 
 
 75 
 
 sM(3) +- A : shift A right k 
 
 
 98 
 
 76 
 
 Shift A left 1 
 
 
 99 
 
 77 
 
 A - (a" AND B) 
 mm 
 
 
 100 
 
 78 
 
 lcFFl «- A 
 
 lcFFl is ON if recomplementation is 
 
 
 
 m 
 
 needed 
 
 101 
 
 78 
 
 B <- CDB(0) 
 
 
 102 
 
 79 
 
 A «- sM{k) 
 m 
 
 Ready to start recomplementation 
 
 103 
 
 80 
 
 RCI(5^) 
 
 See note g for meaning of RC and 
 RCI 
 
 105 
 
 i 
 
 82 
 
 RC(6,5) 
 
 — 
 
166 
 
 Qk 
 
 3 o go 
 
 a H s rH 
 •HO -HO 
 
 Sw -h w Micro sequence 
 
 r 
 
 LOT 
 L09 
 111 
 113 
 U5 
 
 90 
 92 
 
 RC(T,6) 
 
 BC(8,7) 
 RC(9,8) 
 
 RC(*,9) 
 
 A 
 
 m 
 
 3 m(o; 
 
 lll6 
 
 117 
 
 118 
 
 119 
 
 120 
 
 121 
 122 
 
 123 
 12k 
 
 125 
 126 
 
 127 
 
 128 
 
 93 
 
 95 
 
 96 
 97 
 
 98 
 99 
 
 99 
 100 
 101 
 102 
 
 103 
 
 loit- 
 
 En(l, ON); A^ - (A m © CDB(OOOl) ) 
 
 sM(0) *- A. 
 
 m 
 
 A «- sM(3); B *■ sM(3) 
 m 
 
 Shift A c left k; X 1 - CAB(FFF) 
 
 Shift A m right lj lcFFl - 
 CDB(OOOl) 
 
 A «- (A AND B) 
 m m 
 
 lcFFU <- A 
 
 m 
 
 A - sM(l); A n -CAB(O) 
 m * c 
 
 Shift A c left U; A m - sM(0) 
 
 Shift A c left k; A r <- CDB(O) 
 
 Shift A right 1; sM(lO) «- 
 CDB(OOOl) 
 
 X. 
 
 A : A 
 
 m 
 
 sM(9) 
 
 cz(8) 
 
 Comment s 
 
 Recomplementation completed 
 
 Now set up sign of result; i.e., 
 change the sign of the exponent in 
 sM(0), sM(1) if recomplementatxon 
 was needed. 
 
 
 S M(3) contains MSB ON if there was 
 final output carry and LSB ON if 
 sign(a)=sign(h) 
 
 lcFFi+ is now ON if there was an 
 "overflow. " 
 
 X now contains exp(a) without the 
 
 sign 
 
 See note h for the meaning of CZ 
 
167 
 
 go So 
 
 3 o 3 o 
 
 e h a h 
 
 •HO -HO 
 
 o3 w -h w Microsequence 
 
 Comments 
 
 130 
 
 106 
 
 cz(7) 
 
 
 132 
 
 108 
 
 cz(6) 
 
 
 13* 
 
 110 
 
 cz(5) 
 
 
 136 
 
 112 
 
 cz(*) 
 
 
 138 
 
 11* 
 
 cz(*) 
 
 
 140 
 
 116 
 
 En(*,0N); Incr X 
 
 This adds 1 to the exp if there was 
 
 
 
 "overflow" 
 
 1*1 
 
 117 
 
 A «- X • A <- CDB(O)- 
 c 2' r J 
 
 A *- CDB(O) 
 
 m ' 
 
 
 1*2 
 
 118 
 
 Shift A right * 
 
 
 1*3 
 
 119 
 
 Shift A left 1 
 
 
 li+U 
 
 120 
 
 A *- sM(0) 
 
 Insert the sign back in the expo- 
 
 
 
 m o 
 
 nent 
 
 1*5 
 
 121 
 
 sM(0) ♦- A 
 v ' m 
 
 
 146 
 
 122 
 
 Shift A right * 
 
 
 1*7 
 
 123 
 
 sM(l) 4- A ; shift A right *• 
 
 Final exponent is now in sM(0), 
 
 
 
 ill 111 
 
 B *- CDB(O) 
 
 sM(l); prepare to detect exponent 
 
 
 
 overflow or underflow 
 
 1*8 
 
 12* 
 
 lcFFl <- (A =B) ; A *- X.. 
 
 m ' ' c 1 
 
 
 1*9 
 
 125 
 
 Interrupt on lcFFl OFF; 
 
 lcFFl OFF means exponent overflow 
 
 
 
 X 2 «-CAB(*) 
 
 or underflow 
 
 150 
 
 126 
 
 En(*,0N) ; X 2 <- CAB(5) 
 
 
 151 
 
 127 
 
 Incr A ; lcFF2 <- C ,„; 
 c> n+12-' 
 
 X <- CAB (address (c )); 
 B *- CDB(O) 
 
 Ready to start storing the result 
 
 152 
 
 128 
 
 WR 
 
 See note i for the meaning of WR 
 
 160 
 
 136 
 
 WR 
 
 
 
 
 
 
168 
 
 So So 
 
 3 o 3 o 
 
 s h a h 
 
 •HO »H O 
 
 Sw -h w Micro sequence 
 
 Comment s 
 
 168 
 
 ikk 
 
 WR 
 
 
 
 176 
 
 152 
 
 WR 
 
 
 
 184 
 
 160 
 
 m 
 
 
 
 192 
 
 168 
 
 WR 
 
 
 Mantissa is stored in PEM 
 
 200 
 
 176 
 
 PEM(X ) 
 
 <- sM(O) 
 
 Now store exponent in PEM 
 
 205 
 
 181 
 
 Incr X 
 
 
 
 206 
 
 182 
 
 pem(x ) 
 
 <- sM(l) 
 
 
 21k 
 
 190 
 
 
 
 End of the operation 
 
169 
 
 Notes: 
 
 a) At this stage, the situation is as follows: lcFFl is ON if the signs are 
 equal, OFF otherwise; A q , A ffi contains the smaller exponent; the larger expo- 
 nent is stored in sM(O), sM(l); in sM(2) the LSB is a. one if the signs are 
 equal and a zero otherwise. 
 
 b) At this point the situation is that A , A contains the difference of the 
 exponents. If A ffl is non-zero, then b will not participate in the sum (since 
 the exponent difference is too large) and lcFF2 is set ON in PE's in which 
 this happens. 
 
 Mantissa alignment is performed by adding the exponent difference (which 
 is in B) to the address of b which is in A Q , A ffl . The modified address of b is 
 then returned to X p . 
 
 A c will be used as a counter which yields an overflow when all digits of 
 b have been used. For PE's in which this overflow (which is stored as a lcFF2 
 I has appeared, digits of b are replaced by zeros before the addition. 
 
 ADF(a) (add and fetch) is defined as the following set of microsequences: 
 
 1.1 - En(2,0N); B - CDB(O); Incr X_j Incr X 
 
 2.2 - En(2,0FF); Incr A ; lcFF2 <- C 
 
 C n+12 
 
 3.3 - A r - (A±B); C n =lcFF>+; lcFF^ «- C^; A ffi «- PEM(X ); lcFFl OFF causes 
 
 subtraction instead of addition 
 
 6,3 - B - PEM(X 2 ) 
 
 9,6 - sM(a) <- A 
 r 
 
 ADF takes a minimum of six clocks and the normal time is nine clocks. 
 
 ADFI is similar to ADF but in clock (2,2) C is set to lcFFl instead of to 
 
 n 
 
 Fk. ADFI is used for the first addition and takes as long as ADF. 
 
170 
 
 f^+ph -i q performed. It is used for the 
 AD is similar to ADF but no new fetch is periorm 
 
 last addition and takes only three clocks. 
 
 f ) The rules are: for a±h=c, S ign(c)=sign(a) and no ^complementation is 
 needed unless si g n(a)/si g n(h) MD lc FF U is OPF at the end of the operation. 
 In this case, si g n(result)=iii^aT-si g n(h) and re complementation must he per- 
 
 i-„l „;™[>,'i Aim lcFFU is OH at the end 
 formed. An overflow occurs when sign(a)=sxgn(b) « 1°" 
 
 of the operation. 
 
 *1 i, defined as the two following microsequences: 
 g) RC(a,h) (recomplement) is defxnea as une 
 
 !) A r - ( (A" V B) +B+ l) ; A ra - sM(a) ; C n =lcFF. ; IcFFU - C^ 
 
 2) En(l,0N); sM(h) *- A f 
 
 - If a is a *, then A m is not loaded on the first microsequence. 
 . & e arithmetic function above performs ^complementation when B=0. 
 . RCl(a,b) is used for the ^complementation of the first digit; it is 
 similar to EC hut in the first microsequence C n =l instead of C^lcFF*. 
 h) 0Z(a) (count zeros) is defined as the following set of two microsequences: 
 
 1) En(l,CM); En(U,0FF); lcFFl <- (A^B); A m ~ sM(a) 
 
 2) En(l,0H); En(d,0FF); Deer \; Beer X 2 
 
 If a is a * then A ffl is not reloaded in the first microsequence. This 
 function _ decrements X, Id X £ if ^ is zero (and has always teen zero previous- 
 iy) and if there was no "overflow" which is signaled by IcFFU OFF. Since h 
 contains initially all l's, a trap is formed to yield a carry when the number 
 of leading zeros is added to it. Since X £ contains initially the larger 
 exponent, the exponent of the result is formed hy subtracting one out of X 2 
 for each leading zero. 
 
171 
 
 i) WR (write) stands for the following set of eight microsequences: 
 
 En(2,0N); B *- sM(X 2 ); Incr X g 
 
 En(2,0FF); Incr A ■ lcFF2 *- C ,_ 
 K ' '' o.' n+12 
 
 PEM(X 1 ) <- B 
 
 Wait for writing in PEM 
 
 Wait for writing in PEM 
 
 Wait for writing in PEM 
 
 Wait for writing in PEM 
 Incr X, 
 
 WR stores the sum of the mantissas in PEM and also takes care of elimi- 
 nating leading zeros. The trap in A signals when all leading zeros (which are 
 
 transformed in trailing zeros) are eliminated. 
 
172 
 
 LIST OF REFERENCES 
 
 [1] Control Data Corporation. "The STAR Computing System." A technical 
 proposal to The Atomic Energy Commission. December 1966. 
 
 [2] Slotnick, D. L., et. al. "The ILLIAC IV Computer," IEEE Transactions on 
 Computers . Volume C-17, Number 8 (August 1968), pp. 746-757- 
 
 [3] Kuck, D. J. "ILLIAC IV Software and Application Programming/' IEEE Trans - 
 actions on Computers . Volume C-17, Number 8 (August 1968) pp. 758- 
 770. 
 
 [k] Lehman, M. "A Survey of Problems and Preliminary Results Concerning 
 Parallel Processing and Parallel Processors, " Proceedings of 
 the IEEE. December 1966, pp. 1889-1901. 
 
 [5] Fulmer, L. C, and W. C. Meilander. "A Modular Plated Wire Associative 
 Processor," Proceedings of the IEEE Computer Group Conference. 
 June 1970, pp. 325-335. 
 
 [6] Graham, W. R. "The Parallel and the Pipline Computers," Datamation . 
 Volume 16, Number k (April 1970), pp. 68-71. 
 
 [7] Bremer, J. ¥. "A Survey of Mainframe Semiconductor Memories," Computer 
 Design . Volume 9, Number 5 (May 1970), pp. 63-73- 
 
 [8] Texas Instruments Incorporated. The Integrated Circuits Catalog for 
 Design Engineers . First edition. 
 
 [9] Yasui, T. "Pattern Matching Problem- Benchmark on ILLIAC IV," ILLIAC IV 
 Document Number 217 • University of Illinois at Urb ana -Champaign. 
 May 1970. 
 
 [10] Lincoln, N. R. "Parallel Programming Techniques," presented at the "SIG- 
 PLAN Symposium on Compiler Optimization. " University of Illinois 
 at Urb ana- Champaign. July 1970. 
 
 [11] Wilhelmson, R., et. al. "Matrix Operations on ILLIAC IV," ILLIAC IV 
 Document Number 52. University of Illinois at Urbana-Champaign. 
 March 1967. 
 
 [12] Stevens, J. E., "Matrix Multiplication Algorithm for ILLIAC IV," ILLIAC 
 IV Document Number 231. University of Illinois at Urbana-Champaign. 
 August 1970. 
 
 [13] Troyer, S. "Sparse Matrix Multiplication," ILLIAC IV Document Number 137 . 
 University of Illinois at Urbana-Champaign. June I968. 
 
 [lk~\ Carr, R. "Gauss-Seidel on ILLIAC IV," ILLIAC IV Document Number 67. 
 University of Illinois at Urbana-Champaign. May 1967- 
 
 [15] Ackins, G. "Fast Fourier Transform," ILLIAC IV Document Number 1^-6. 
 University of Illinois at Urbana-Champaign. July 1968. 
 
 [16] Stevens, J. "Fast Fourier Transform Subroutine for ILLIAC IV," ILLIAC 
 IV Document Number 226, University of Illinois at Urbana-Champaign. 
 July 1970. 
 
173 
 
 [17] Mclntyre, D. "ILLIAC IV Language Evaluation - A Preliminary Experiment," 
 ILLIAC IV Document Number 213 • University of Illinois at Urbana- 
 Champaign. November 1970* 
 
17^ 
 
 VITA 
 
 Born in 19^1 in Santos, BRAZIL, Nelson Castro Machado received in 
 December I96U the degree of "Electronic Engineer" from the Instituto Tecno- 
 logico de Aeronautica in Sao Jose dos Campos, BRAZIL. He then worked there 
 for one and one half years as a teaching assistant, being responsible for 
 courses in applied electronics, pulse circuits laboratory and automata theory. 
 In September 1966 he came to the University of Illinois where he received the 
 M.S. degree in October of 1969- Since his arrival in the U.S.A., Mr. Machado 
 has been working as a research assistant, initially with the ILLIAC IV Project 
 and later with the Center for Advanced Computation of the University of Illi- 
 nois. In this activity, he was responsible for the semantics part of a Trans- 
 lator Writing System developed to help implement ILLIAC IV languages. His 
 M.S. thesis entitled "ISL-A semantics Language for a Translator Writing System 
 is a result of this research. From January 1970 until January 1972, Mr. 
 Machado worked on the topic of parallel computer organization, exploring new 
 approaches to the concept of array processor utilized in ILLIAC IV. This 
 activity resulted in his Ph.D. dissertation entitled "An Array Processor with 
 a Large Number of Processing Elements." 
 
UNCLASSIFIED 
 
 Security aaasirication 
 
 DOCUMENT CONTROL DATA .R&D 
 
 ■»■'■""■ ■■•— r ■ 7; ^^'^~™"~*'*"^ ^ «**">-i/r~_*i. ,._.. ., 
 
 Center for Advanced Computation 
 
 University of Illinois at Urbana- Champ aim ' 
 
 Urbana, Illinois 6 l801 
 
 "REPORT TIT1_€ "~ "" "~ ~~~^~~—— — — — — — — — — — _ _______ 
 
 UNCLASSIFIED 
 
 2ft. CROUP 
 
 AN ARRAY PROCESSOR WITH A LARGE NUMBER OF PROCESSING ELEMENTS 
 
 lOttCNI-TlvC NOTU (Trp. at r^or. mn4 in.lu.i-. d.,..; 
 
 Resea rch Report-. 
 
 i-otmo-(S» (Flt.tnmm.. mlddt. MU.t, lm.tn.rn.) 
 
 Nelson C. Machado 
 
 • IEPORT DATI 
 
 7ft. NO. OF NCFI 
 
 January 1, 1972 
 
 S. CONTRACT ON CHANT NO. 
 
 '*• TOTAL NO. OF PACKS 
 
 184 
 
 •"• 0-|«INATOR'l REPORT NUHIER|S| 
 
 XL 
 
 zz 
 
 CAC Document No. 25 
 
 • DAHC04-72-C-0001 
 
 • PROJECT NO. 
 
 ARPA Order I899 
 B UIUCDCS-R -72-499 
 
 C OI 3 TRIBUTION STATEMENT ~" ■ ' ?? 
 
 Copies may be obtained from the address eiver, in fi\ «* 
 
 unlimited; approved for public release" (l) ^^ Dlstribu tion 
 
 1 UPPLEMENTARY NOTES ' 
 
 '*• «~ONSORINC MILITARY ACTIVITY 
 
 None 
 
 kCT 
 
 U.S. Army Research Office-Durham 
 Durham, North Carolina 
 
 could be characterized as an inte'VT ? "** P^sor (SPEAC) which 
 Accessor, -£ number of TroZ"Z ^ "fT" , ^ LIAC IV and the Associative 
 
 go as high as 8L MPE a Z f" tS (EE s) ls '^aU/ IK but could 
 
 gates, designed to S^^ntSKtE"® 16 """ """ ab ° U * 1K e^valent 
 Ship or on several MSI chips Each PE^nf? * "T* T^ °° ffiplex LSI 
 
 assembled on one sin-kTril/! PE plus its memory (pem) could then be 
 
 Processing Tt n ? f r ° Ult b ° ard or ceramic substrate, 
 word length Maximum ?reef m /? 8 J° UPS ° f fOUr blts whlch allo « variable 
 possible by the nlTtr I J * f0mat and ins t™tion format is made 
 
 machine is ^itevLsftrlearc:^^ 1 ! contTOl ™« (CO). Therefore, the 
 large precision problems (matrix o^ ? efficiently either on floating-point 
 fixed-point small precision ones Zt, T' ^^ Passing, etc.) or on 
 an precision ones (character manipulation, picture processing, 
 
 CU is presentel'^pe'rattenslre ^ 111 Tf ^^ ^ & gSneral °^ h ° f «» 
 1 floating-point addition (Pn des e"°ed and timed, with particular emphasis 
 
 multiplication (25 tsel per PE for,? 6 *- ?Y°V? b " s) and Rating-point 
 presented along with thefr time estimates ' ' '" ^^ ^^^ -e 
 
 '..1473 
 
 UNCLASSIFIED 
 
 Security Classification 
 
UNCLASSIFIED 
 
 Design and Construction 
 General Purpose Computer 
 Arithmetic Units 
 
 I L 
 
 UNCLASSIFIED 
 
 Security Classification 
 
EIOGRAPHIC DATA 
 I T 
 
 1. Report No. 
 
 UIUCDCS-R-72-U99 
 
 il,le and Subtitle 
 
 AN ARMY PROCESSOR WITH A LARGE NUMBER OF PROCESSING 
 ELEMENTS 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 January 1, 1972 
 
 ^ hot(s) 
 
 Nelson Castro Machado 
 
 8. Performing Organization Rept. 
 No. 
 
 P forming Organization Name and Address 
 
 Center for Advanced Computation 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 DAHC0U-72-C-0001 
 
 Jsnsoring Organization Name and Address 
 
 U.S. Army Research Office-Durham 
 
 Duke Station 
 
 Durham, North Carolina 
 
 
 13. Type of Report & Period 
 Covered 
 
 Research 
 
 14. 
 
 plementary Notes 
 
 None 
 
 Attracts 
 
 See DD Form Number 1^73. 
 
 Words and Document Analysis. 17a. Descriptors 
 
 Design and Construction 
 General Purpose Computer 
 •rithmetic Unit 
 
 tifiers/Open-Ended Terms 
 
 : ATI Fie Id /Group 
 
 ability Statement 
 
 MLes may be obtained from the address in (9) 
 ^ we. Distribution unlimited. 
 
 ^N-ls-35 (10-70) 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 181+ 
 
 22. Price 
 
 NC 
 
 USCOMM-DC 40329-P7 1 
 
APR * 6 A* 
 
IHiSli H 
 
 JRE» 
 
 nBnllEn! 
 
 tSSr