afflB HSU BBjnffi HO WW rWt m mm « Hfin US ■ f^ fjlllllllffl H IB Hi Bn ta BBSS nBI HAS BE BE •T"K BSH 9b£ HHhBH ■8 H ^B m BBS ■ ■ Jl 1 1 ■ h BB ■ I ■ I^B H SB fl B m 3HB jyfll Boh Sh a— HHBwHfllua wSSSSSSSsSSBSat Bgnmi bi bub ^■Bmn »|m Iff — —aMhiiiniMiiB^w— an KfOSfllMIilMUjsi LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 TiGy ho. 637 -£42 cop. 2- The person charging this material is re- sponsible for its return to the library from which it was withdrawn on or before the" Latest Date stamped below. f T or e 'L mU r ,a,i ° n ' ° nd ° nder,ini "9 »' "oaks are reasons z u„r:e p :;;7 ac " on and maY resu,t in *— *- To renew call Telephone Center, 333-8400 ^S^L^^!i_^^L^ URBANA-CHAMPAIGN L161— O-1096 Digitized by the Internet Archive in 2013 http://archive.org/details/programspeedupth638stre \uCSf Report No. UIUCDCS-R-74-638 NSF - OCA - GJ -36936 - 000004 PROGRAM SPEEDUP THROUGH CONCURRENT RECORD PROCESSING by Richard Ernest Strebendt October 1974 THE LIBRARY OF THE JUL 9 1974 Report No. UIUCDCS-R-74-638 PROGRAM SPEEDUP THROUGH CONCURRENT RECORD PROCESSING* by Richard Ernest Strebendt October 1974 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US NSF-GJ-36936 and was submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science, October 1974. m ACKNOWLEDGMENT The author wishes to express his gratitude to Professor David J. Kuck for the suggestions, incisive questions, and advice which were invaluable in developing this thesis. Thanks is due also to the personnel of the Administrative Data Processing Office of the Urbana-Champaign campus of the University of Illinois who provided the COBOL programs for the analyses presented in this paper. Special thanks in this regard is expressed to Mr. Pinaki R. Das. Above all, the author wishes to express his gratitude to his wife, Frances, for her encouragement, patience, and assistance. IV PROGRAM SPEEDUP THROUGH CONCURRENT RECORD PROCESSING Richard Ernest Strebendt, Ph.D. Department of Computer Science University of Illinois at Urbana-Champaign, 1974 Much effort in the past has been devoted to speeding up computational programs through the use of multiprocessing. This paper examines the problem of speeding up data processing programs which typ- ically do not contain a great deal of computation. A machine organization is proposed which is capable of execu- ting several instruction streams concurrently. Compiler algorithms are described which automatically insert the necessary commands to start and stop instruction streams and to protect common variables which must be accessed sequentially. TABLE OF CONTENTS Page 1. INTRODUCTION 1 1.1 Approaches to Program Speedup 1 1.2 Characteristics of COBOL 3 1.3 Assumptions and Restrictions 4 2. DESCRIPTION OF THE METHOD 8 2.1 Types of Hardware Units Needed 11 2.2 Address Counters and Interlocks 13 3. COMPILER ALGORITHMS 18 3.1 Source Text Scan 18 3.2 Phase and Link Identification 23 3.3 Statement Migration 24 3.4 Variable Type Identification . 32 3.5 Storage Assignment 37 3.6 Positioning FORK, HOLD, and QUIT Instructions .... 42 3.7 Inserting Interlocks 47 4. PROOF OF METHOD 53 4.1 Theorem 4.1 53 4.2 Theorem 4.2 54 4.3 Discussion 57 5. MACHINE DESIGN 60 5.1 Over-all Structure 60 5.2 Program Memory 62 5.3 Address Counters 65 5.4 Instruction Dispatch Unit 67 5.5 Address Counter Coordinator 71 5.6 Processors 81 5.7 Data Memory and Buses 84 5.8 I/O Processors 88 5.9 Routing Network 90 5.10 Modifications Needed for Multiprogramming 90 6. EXPERIMENTAL RESULTS 92 6.1 Introduction . . . . * 92 6.2 Variable Size Counts 92 VI Page 6.3 Statement Type Counts 93 6.4 Program Analyses 97 6.5 Program Simulation 106 7. MACHINE PARAMETERS 115 7.1 Speed Limitation 115 7.2 Number of Address Counters 116 7.3 Data Memory Word Size 116 7.4 Data Character Size 118 7.5 Number of Data Memory Units 119 7.6 Size of Data Memory 119 7.7 Size of Program Memory 122 7.8 Numbers of Processors 125 7.8.1 IF Tree Processor 125 7.8.2 Arithmetic/Logical Processors 125 7.8.3 I/O Processors 126 7.8.4 SORT Processor . 127 7.9 Instruction Dispatch Unit Memory Sizes 127 7.10 Other Devices 134 7.10.1 Program Memory 134 7.10.2 Address Counter 135 7.10.3 Instruction Dispatch Unit 136 7.10.4 Address Counter Coordinator 137 7.10.5 Data Memory Unit 138 7.10.6 Routing Network 139 7.10.7 Arithmetic/Logical Unit 139 7.10.8 IF Tree Processor 140 7.10.9 1/0 Processor 140 7.10.10 SORT Network 142 7.11 Package Counts 143 7.11.1 Memories 143 7.11.2 Queues 145 7.11.3 Bus Drivers and Receivers 145 7.11.4 Other Devices 145 7.11.5 Total Package Requirement 145 8. PROBLEM PROGRAMS 151 8.1 Characteristics of Problem Programs 151 8.2 Speeding Up Problem Programs 151 9. SOFTWARE DESIGN 155 9.1 Language Features Which Hinder 155 9.1.1 ALTER 155 9.1.2 Subroutines 156 9.2 Language Features Which Help 158 9.2.1 Complex Operations 158 9.2.2 SORT 158 Vll Page 9.3 Programming Techniques Which Hinder 159 9.3.1 Sequence Checking 159 9.4 Programming Techniques Which Help 159 9.4.1 Super-records ..... 159 9.4.2 Parallel Tasking 160 10. CONCLUSIONS . . . 161 10.1 Summary of Results 161 10.2 Areas for Further Inquiry 163 10.3 Final Comment 165 LIST OF REFERENCES 166 APPENDIX - Program Analysis Example 171 A.l Source Text Scan 171 A. 2 Phase and Link Identification 179 A. 3 Statement Migration 181 A. 4 Variable Type Identification 181 A. 5 Storage Assignment 185 A. 6 Positioning FORK, HOLD, and QUIT Instructions .... 185 A. 7 Inserting Interlocks 185 VITA 191 vm LIST OF TABLES Page 6.1 Frequency Count of Variable Sizes 94 6.2 Frequency Count of Statement Types 98 6.3 Frequency Count of Operator Types 101 6.4 Statistics for Analyzed Programs 104 6.5 Memory Allocation Results 107 6.6 Speedup Results 112 7.1 Memory Requirements 120 7.2 Program Size Statistics 124 7.3 Memory Package Requirements 144 7.4 Queue Package Requirements 146 7.5 Bus Driver and Receiver Package Requirements 147 7.6 Other Device Package Requirements 148 7.7 Total Package Requirements 150 A.l Program Listing 172 A. 2 Variable Names 175 A. 3 Program Summary 176 A. 4 Variable Types 184 A. 5 Affinity and Segregation Sets 186 A. 6 Storage Unit Assignment 187 IX LIST OF FIGURES Page 2.1 Concurrent Execution Example 9 3.1 IF Tree Example 22 3.2 Statement Migration Example 1 25 3.3 Statement Migration Example 2 28 3.4 Statement Migration Example 3 30 3.5 FORK, HOLD, QUIT Insertion 48 3.6 Interlock Insertion Example ... 51 4.1 FORK Insertion Example 59 5.1 Over-all Machine Structure 63 5.2 Program Memory 64 5.3 Address Counter 66 5.4 Instruction Dispatch Unit 68 5.5 Fetch and Tag Generator Logic . 72 5.6 Instruction Dispatch Controller Logic 74 5.7 Tag Status Register Logic 75 5.8 FORK Control Sequence 76 5.9 HOLD Control Sequence 78 5.10 QUIT Control Sequence 80 5.11 TEST Control Sequence 82 5.12 RELEASE Control Sequence 82 5.13 Data Memory Unit 85 Page 6.1 Frequency Count Plot - Variable Sizes 96 6.2 Histogram - Statement Types by Class 99 6.3 Histogram - Statement Types by Statement 100 6.4 Histogram - Operator Types by Class 102 6.5 Histogram - Operator Types by Operator 103 7.1 Cumulative Variable Size Distribution 117 7.2 Memory Requirements vs. Program Size 121 7.3 Instruction Transfer Rate Model 129 9.1 ALTER Instruction Example 157 A.l Program Graph 180 A. 2 Phase I Statement Migration 182 A. 3 Phase II Statement Migration 183 A. 4 Final Program Graph 188 1. INTRODUCTION 1 .1 Approaches to Program Speedup A continuing concern [CHE71a, BEL72, WIT72] in Computer Science is the problem of speeding up the execution of programs. In the early days of computers this was primarily attacked by speeding up the cir- cuitry of the machine itself. Faster devices were developed as relays gave way to vacuum tubes which gave way in turn to semiconductors. More efficient algorithms for arithmetic were found and continue to be inves- tigated. Now that physical limits are in sight for the speed of devices, the emphasis [GIL58, MUR66] on program speedup is being placed on parallelism in the execution of the program. Machines have been devel- oped [BAR68] to exploit the parallelism inherent in array operations. The parallelism present in algorithms for arithmetic operations has also been utilized in pipelined arithmetic units [SEN65, SEN67, HIN72, WAT72]. Some consideration has been given [ASC67, F0S71 , FLY71 , FLY72, CUR73, BAE73] to the possibility of using more than one processing unit to execute a program, with each processor executing independently of the others. Two problems arise when this is done. The first is the problem of conflicts in accessing data common to several processors. For a particular class of programs this problem is solved by inserting a com- plex set of tests around the instructions referencing the common data [DIJ68a, DIJ68b, C0U71 , EIS72, HAB72] to allow any processor to access the common data so long as no other processor is accessing that data. More commonly, however, the problem is avoided by allowing processors to simultaneously execute only tasks which are independent. This leads to the second problem, that of identifying independent tasks. Mechani- cally finding independent tasks within a program can be done [BER66, RAM69, RUS69, TJA70], but for a large program this can be expensive in machine time. The approach suggested by several investigators [C0N63, AND65, 0PL65, WIR66] is that of requiring the programmer to specify in his program where he thinks the processors should be started and stopped. For an occasional program this might be a workable technique, but for a programmer with a heavy work load it would be too time con- suming and error-prone to be a useful technique. In this paper we also attack the problem of program speedup through the concurrent operation of more than one processor. Our approach, however, is different from that in previous work done and yields a potentially very high speedup without searching for indepen- dent tasks within a program. We attain a program speedup by executing the program concurrently with itself , with each instruction stream (or copy of the program) processing a different set of input data. No parallel tasking is attempted within an instruction stream. It is shown in this paper that this method of achieving concurrency has the following advantages: 1) It is not necessary to compare all tasks with all others to find those which are parallel executable. The bulk of this paper assumes that only one program at a time is in execution. The modifications needed to extend the machine proposed in this paper to multiprogramming are discussed in section 5.10, 2) The instructions to start and stop processors can be inserted very easily by the compiler into a program written for a single processor machine. This relieves the programmer of this burden. Also, the locations of these instructions can change with each compilation as the program changes. Thus the programmer can concentrate on what he wants the program to do and not on how the machine does it. 3) The interlocking operations can be inserted by the compiler without the intervention of the programmer. 4) The interlocking conditions are fairly simple and can be relegated to an inexpensive hardware unit. 1.2 Characteristics of COBOL Much recent work on the speedup of programs through multipro- cessing emphasizes the speedup of arithmetic [MUR71 , KRA72, KUC72b]. In many computational programs the speedups thus gained are substan- tial. In many data processing programs written in COBOL, however, little is gained in this way since there is little arithmetic in them. It might be asked: why worry about COBOL programs? The answer is quite straightforward; more programs are written in COBOL than in any other language. Indeed, a recent survey [PHI73] of language users indicates that more programs are written in COBOL than in all of the other languages combined . The economic benefits resulting from improv- ing the execution speed of COBOL programs should be well worth the effort. The characteristics which make COBOL programs as long running as they often are suggested our method of speeding up such programs. A typical program, judging from the examination of a number of pro- * grams, involves wery little processing of the data compared to, say, a FORTRAN numerical program. Commonly, a set of input data is ac- quired, particular items are selected, simple calculations (if any) are carried out, then the data is reformatted and written out, and another set of data is acquired for processing. While the amount of processing done on each set of data is relatively small, the number of sets of data processed per run may be quite large. This profile of a typical COBOL program suggests three things. First, any arithmetic speedup we can obtain is useful, although it may not be as dramatic as that in a numerical program. Second, since much of the work in a COBOL program involves manipulating data, it seems desirable to build these capabilities into the memory [ST070] where this would avoid transferring large amounts of data back and forth to special processors. Finally, our greatest speed improve- ment can be expected to come from overlapping the processing of the sets of input data. 1 .3 Assumptions and Restrictions In attacking any problem of the potential magnitude of this one, it is necessary to make some assumptions about the environment of Our samples are described in Chapter 6, "Experimental Results." the proposed solution and to set bounds on the degree to which we are willing to modify the original COBOL programs. An important consideration is the hardware available for im- plementing the machine. We show in Chapter 7 that the machine proposed in this paper could be built with components which are currently avail- able or are within the capabilities of the current state-of-the-art. In that chapter it is also indicated where capabilities which are not yet available could be put to good use. Another consideration is the software with which the compiler is implemented. The compiler algorithms presented in this paper are intended to demonstrate the feasibility of concurrent record process- ing. In an actual implementation of these algorithms we assume that the implementer would use techniques which take advantage of the capa- bilities of the machine for which he was designing the compiler. Such capabilities would include the ability to execute more than one instruc- tion stream at a time. For the purposes of this thesis we assume no extensions to COBOL to facilitate the solution of the problem, although we suggest some extensions in Chapter 9 that might be useful. We attack the problem of speeding up COBOL programs which are presented to us as they are now written for single processor sequential machines. This is done for several reasons. First, concocting parallel-COBOL test pro- grams for parallel machines could result in a yery small set of test programs which is not representative of real data processing programs. Second, were a machine such as the one proposed in this paper actually put into use, it would have to be able to handle the huge number of previously existing programs in some way, or else force the users to rewrite all of their programs. While many programs should be rewritten to make best use of the abilities of the machine, we show in this paper that many programs intended for a sequential machine can be made to run well on a concurrent machine without requiring the user to modify the programs. Third, while there is a constant quest for higher throughput rates for business computers, a language extension which could improve throughput but complicates programming is likely to be shunned. Quite often in a business data processing environment the efficiency of a program is less important than the ease with which a programmer, unfa- miliar with the program, can make changes in it. To bring about the kind of speed improvement possible in a concurrent machine, it is not necessary to complicate a program by requiring that the programmer insert additional instructions to control instruction sequencing. These additional instructions can be inserted by the compiler and need not appear in the language. Finally, because of the wide use of COBOL and the normal conservative tendency of people to resist change, any language radically different from COBOL would be slow to gain accep- tance among business programmers. In attempting a solution to an interesting problem, it is easy to get carried away and to propose grandiose schemes which would be costly to implement and might have relatively little applicability to real programs. To avoid this pitfall we limit ourselves to adding code to, and rearranging the code in the original COBOL program. We do, of course, make use of the hardware we must introduce. We do not attempt to transform the algorithm used in the program by attempting to discover the programmer's intentions and implementing them in a better way than he did. Besides the obvious pitfalls inherent in attempting to outprogram the programmer, such an approach could lead to a very expensive system whose execution speedup was offset by a yery long compilation time. Since programs are constantly being revised, compilation cost is not to be ignored. Likewise, we do not try to restructure the data files into forms more amenable to concurrent processing. Again, the cost of restructuring each file on every run to suit the needs of each program could eliminate any benefits derived from the resulting faster execution. We do, in Chapter 9, point out programming techniques and file structures which are good for concur- rent processing and should be used by programmers in programming our machine. 8 2. DESCRIPTION OF THE METHOD The technique described in this thesis achieves program speedup by concurrently processing as many input records as possible, while interlocking the processing to preserve any sequentially which is essential to the correct operation of the program. This is done by starting the processing of a record of input data as soon as it is known which READ statement in the program is the next to be executed (i.e.: when it is known what processing is to be done on the next record). At those points in the program at which a sequential execu- tion constraint exists, processing is suspended until the condition inhibiting further processing has been removed. To indicate how this technique works, consider Figure 2.1. Figure 2.1(a) shows a program written for a single processor machine. A record is read in block 1 to obtain values for A, B, and C. A test is made in block 3 which compares A with its preceding value. If A satisfies the test, X is computed in block 4 and written out in block 5. In block 6 the value of A is saved. Figure 2.1(b) shows the same program after we have inserted instructions to overlap processing of different input records and to provide the necessary interlocking between the concurrent processing. Block b causes the next record to be read as soon as the decision at block 3 is made to remain in the 1-2-3-4-5-6 loop. Block d breaks this loop by releasing the hardware used to process a record. Block a tests an interlock indication to be certain that the correct value of OLD-A is used in block 3. Block c i READ A,B,C 2 ,, D ^- B +C I X -*• D/A 5 \ OLD-A-*- A WRITE ERROR-MSG (a) Figure 2.1 Concurrent Execution Example 10 i READ A, B, C I D •+- B + C I WAIT UNTIL CORRECT VALUE OF OLD-A IS AVAILABLE SIGNAL TO START PROCESSING OF NEXT RECORD X <+- D/A OLD-A I SIGNAL THAT OLD-A HAS CORRECT VALUE FOR NEXT RECORD TERMINATE PROCESSING OF THIS RECORD 1 WAIT UNTIL OTHER RECORDS TERMINATE I WRITE ERROR-MSG (b) Figure 2.1 (continued) Concurrent Execution Example 11 releases the interlock as soon as OLD-A has been assigned its proper value for the next record's processing. Block e is used to guarantee that the processing occurring beyond that block is not entered until all of the processing of preceding records is completed. 2.1 Types of Hardware Units Needed To implement the speedup technique discussed in this paper we need the following hardward units: 1) Multiple processors are needed. By the word "processor" we do not mean a complete Central Processing Unit. A processor, as referred to in this paper, is either an Arithmetic Unit, a Conditional Branch Tree (IF Tree) Processor, or other type of special purpose unit. a) In view of the fact that there may be many records in process concurrently, we expect enough demand for computation to require a number of Arithmetic Units, even if there is very little arithmetic per record. b) A Conditional Branch Tree Processor [DAV72a] is a device which accepts as input the results of a collection of comparisons as a set of Boolean values, and returns the identity of the path to be taken for subsequent processing. It has been shown [DAV72b] that such a processor can select the appropriate exit point for up to 12 an eight level tree of IF statements in about two major clock cycles. The formation of IF Trees is discussed in section 3.1. c) Other types of processors include a unit used to sort files, such as that described by Batcher [BAT68], and a collection of I/O processors. 2) To attain the necessary memory bandwidth to satisfy the demands of a number of processors for data, we need a number of memory units. In addition, more memory units are needed to hold the program being executed. We propose to separate the data memory from the program memory so that there is no inter- ference between them. This also allows the design of the program memory to take advantage of the fact that fetches of instructions tend to be from locations relatively close together in memory [C0F72]. The data memory can be designed to include the capability of doing some types of pro- cessing, such as replacing all occurrences of one character by another, without the need to send the data to another unit for processing. 3) To allow any processor to fetch data from any memory and to allow transfers of data from any memory to any other, we need some sort of Routing Network. 13 4) To control instruction sequencing, an Address Counter is needed for each record being concur- rently processed. When we reach a point, during the execution of a program, at which another input record could begin to be processed, we activate a previously inactive Address Counter. It then fetches instructions to perform the processing for the next input record. When an Address Counter completes the processing of a set of input data, it is deactivated and returns to a pool of units available for assignment to subsequent records. 2.2 Address Counters and Interlocks Since the use of multiple Address Counters is a key part of this method of speeding up a program's execution, we examine here the starting and stopping of Address Counters and the constraints under which they must operate. To start an inactive Address Counter into activity, an Address Counter which is already active executes a FORK instruction. One of the operands of the FORK instruction is the program location at which the new Address Counter is to begin execution, which we call the initiation point for that FORK. While active, the jobs of the Address Counter are instruc- tion sequencing and address calculation. An Address Counter fetches instructions, computes the effective data addresses, and passes the 14 instruction on to the rest of the machine for execution until one of the following conditions arises: 1) A conditional branch instruction is encountered. In this case the Address Counter generates the request for the evaluation of the condition, then awaits the result. It resumes execution at the address computed from the information supplied by the IF Tree Processor. 2) A QUIT instruction is encountered. 3) A HOLD instruction is encountered. Either of two things causes the instruction stream to be resumed. If the Address Counter was the only one active, it is signalled to resume execution at the next instruc- tion. If other Address Counters are still active, this Address Counter halts as though it had encoun- tered a QUIT instruction. The last active Address Counter, after it executes a QUIT, resumes execu- tion where the one executing the HOLD was halted. 4) An instruction is encountered which causes a value to be transferred from one of the data memory units into one of the Address Counter's index registers. Execution resumes at the next instruction after the value is received. 5) An instruction testing an interlock is encountered. Execution resumes at the next instruction when the 15 Address Counter is signalled that the interlocking condition has been removed. An Address Counter is restricted to only two classes of addressing in the calculation of data addresses. The first is the class of addresses which all Address Counters can access—those corresponding to common variables. The second is the class of addresses which only a single Address Counter can access—those corresponding to private variables. To accomplish this separation of data we conceptually partition memory into one area common to all Address Counters, and a number of private areas, each accessable to only one Address Counter. By using base registers containing the appropriate base addresses for the partitions allowed to particular Address Counters, and by making the layout of each copy of the private areas the same, we can easily implement this partitioning. In order to insure that the results of a program executed using our concurrent record processing technique are correct, three types of interlocks are needed in a program. 1) Those required to insure that instructions in the same instruction stream which access the same variable are executed in the correct sequence. 2) Those required to protect common variables which can be modified by any instruction stream at any time. It is necessary in this case to make sure that only one instruction stream accesses such a variable at a time. 16 3) Those required to protect variables for which sequential execution constraints exist. These variables, such as OLD-A in Figure 2.1, must not be accessed by an instruction stream until the preceding instruction stream is no longer able to access them. The first type of interlock is obtained if we do not allow an instruction to go into execution until all of its operands are available for use. We show in Chapter 5 that this operation, and the others necessary to handle this interlock problem, can be handled by a hard- ware unit we call an Instruction Dispatch Unit. For the second type of interlock, we could associate a bit with each such variable and use it as a semaphore [DIJ65]. It turns out, however, that the Instruction Dispatch Unit intended to handle the first type of interlock problem also solves the second type of inter- lock problem. It should be noted that no work by the compiler is needed for either of these interlocks. The third type of interlock problem does require both com- piler algorithms and hardware to handle it. First, the variables which require this type of interlock must be identified. These variables are those which must be accessed by Address Counters (or, equivalently, instruction streams) in the order in which the Address Counters are activated. Then those blocks of code (nodes in the program graph) which contain references to these variables must be identified. Finally the compiler must insert instructions in appropriate places to test and to release interlock indicators for each of these interlocked variables. 17 These indicators are included in the circuitry of the Address Counter Coordinator. We could implement this type of interlock by constructing an unlocking function attached to the variable, but this could lead to a problem. The simpler interlocks pose no problem with respect to degrading performance by tying up resources while an instruction waits for access to a variable. The reason for this is that the expected length of the wait should be relatively short. For this third type of interlock it is not sufficient that the variable is not being accessed; it must no longer be able to be accessed by a given Address Counter's predecessor in order for that Address Counter to be able to access the variable. The length of the wait for that condition to be satisfied could be quite long, especially if there are many active Address Counters. With the interlock attached to the variable we could have several statements per Address Counter which are half executed waiting for locked variables, with intermediate results and unfi liable fetch requests tying up a great deal of hardware. Instead, we do not allow a block to enter execution until all of the interlocks attached to that block are satisfied. Equipment is thus available to handle the pro- cessing for the Address Counters which are not locked out of their data, and we avoid a possible deadlock producing condition. 18 3. COMPILER ALGORITHMS This chapter is intended to demonstrate the feasibility of our speedup technique by presenting a set of algorithms which could be used to implement this process. The algorithms are presented in the same order that they could appear in a compiler. 3.1 Source Text Scan The two sections of a COBOL program which provide the bulk of the information in which we are interested are the Data Division and the Procedure Division, as they are named in the language [IBM72]. The former describes the attributes of the files used by the program and defines all of the variables used in the program. The Procedure Division contains the executable instructions of the program. During the scanning of the program we need to collect the following information in addition to that normally collected by a com- piler for a sequential machine. No great difficulty, however, is entailed in accumulating this information since it is readily available during the usual scanning process. 1) For each statement in the program we build two sets of variables: a) The identities of those variables fetched for use in the statement comprise the input set, or set of input variables, for the statement. 19 b) The identities of those variables whose values are set by the execution of the statement comprise the output set, or set of output vari- ables, for that statement. 2) A graph of the control flow of the program is built from the contents of the Procedure Division, as is often done for purposes of optimizing code. Each statement in the program is represented by a single node in the graph except for the following special cases: a) Contiguous assignment statements, such as arithmetic and MOVE statements are lumped together in a single node, so long as no other type of statement or an intervening label is encountered. Thus, a single node in the program graph is generated for a block of assignment statements. b) PERFORM statements are expanded wherever possi- ble. Where the PERFORM statement simply calls for a single execution of a section of the pro- gram, that section is copied into the program in place of the PERFORM statement. Where there is a fixed number of iterations specified in the PERFORM statement, the performed section of code is replicated with the appropriate value of the iteration variable inserted for each 20 replication. For more complex PERFORM state- ments the performed section of code is copied in place of the PERFORM statement and imbedded in a construct similar to the PL/1 DO block. These blocks are handled by a compiler as in the FORTRAN analyzer described by Kuck et al [KUC72b]. 3. For each variable used in the program two sets have to be constructed. a) The set of input references is the set of statements for which the variable appears as an input variable. b) The set of output references is the set of statements for which the variable appears as an output variable. The compiler can do two things during the construction of the internal data base that can yield a speedup at little additional cost. The first is to forward substitute [KUC72b] within a block of assignment statements. In this technique, any occurrence of an output variable as an input variable in a subsequent statement is replaced by the expression in the assignment statement for that output variable. For example, the block of assignment statements A + B + C + D E +■ A + F G «- H + E + A 21 would become, after forward substitution A + B + C + D E^B + C + D + F G«-H + B + C + D + F + B + C + D In the latter case, unlike the former, there are no interdependences between the statements in the block, so that all three statements could be executed in parallel. The second thing the compiler can do during this phase to improve execution speed is to form IF Trees. In this technique we combine individual IF statements into a tree structure which can be executed by an IF Tree Processor. Unlike Davis [DAV72a, DAV72b], however, when we are building an IF Tree, we do not move all assignment statements from within the tree upward to a point ahead of the tree. Instead, we move upward all statements upon which the execution of the conditional branches in the tree depend. All other statements are moved down to be collected at the exits from the tree. This is illustrated in Figure 3.1. Figure 3.1(a) shows a collection of conditional branch statements with assignment statements occurring between them. Figure 3.1(b) shows the IF Tree and associated assign- ment blocks we form. The conditions have been transformed into assignments of logical values to a set of temporary variables &1, &2, and &3 which we call the conditional result set . This result set is used by the IF Tree Processor to determine which exit is to be used. The identity of the exit is then used by an Address Counter to select the next instruction to be executed. 22 L S «*- T ■(a) 1 61 **- A = B &2 **-DG (b) Figure 3.1 IF Tree Example 23 3.2 Phase and Link Identification A program typically consists of a collection of loops con- nected by code which is not included in the loops. There can, of course, be many loops within the outer loops. In terms of the program graph, we define a phase to be a maximal strongly connected set of nodes [CHE71b]. That is, a phase is defined in such a way that any node in the phase can be reached from any node in the phase (including itself) by way of some directed path in the program graph. Any node not found in a phase is in a link . In terms of program execution, control remains within a phase until a link is entered. Once a link is entered, control never re-enters the exited phase. We are particularly interested in phases for a couple of reasons. Obviously, the address mapping used for variables referenced within a phase must be invariant within the phase to avoid ambiguities in calculating data addresses. Different mappings can be used in dif- ferent phases. More importantly, we are concentrating our efforts on speeding up the execution of a phase rather than of a link because the link is executed no more than once, while the number of times the code in a phase is executed is potentially \/ery large. Identification of phases is simply the problem of identifying maximal strongly connected subgraphs in the program graph. An algorithm for this problem has been given by Ramamoorthy [RAM66], and a more effi- cient technique has been found by Chappell [CHA69]. 24 3.3 Statement Migration It has been found during our analyses that a program can be prevented from being sped-up as much as possible because the programmer happened to code a crucial instruction at a point late in the program, while it was actually possible to have placed the instruction earlier in the instruction stream. On a sequential machine this is no problem. When such an instruction involves the assignment of a value to an inter- locked variable, however, this causes the associated interlock on our concurrent machine from being released as early as it could be released. This likewise unnecessarily delays the processing of data by succeeding Address Counters. Consider, for example, Figure 3.2. The loops in Figure 3.2 are identical except for the location of the assignment of SEQ to OSEQ. In Figure 3.2(a), if an Address Counter is waiting to execute the condi- tional branch, it cannot be allowed to proceed until its predecessor has executed the assignment of DATA to WDATA, written out WDATA, and assigned OSEQ the proper value. In Figure 3.2(b), the assignments can both take place concurrently, thus requiring a shorter wait by an Address Counter before the value of OSEQ is set. Since the speedup attainable in a situation such as that in Figure 3.2(b) is potentially much greater than in one such as that in Figure 3.2(a), it is worth our while to reorder statements so that they are executed as early as possible. Because we migrate statements, instructions whose operands are not available until late in the instruction stream tend to be placed after instructions whose operands are available earlier. Because of 25 < tu Q V) < o 8° o or k ^ UJ O- E CO X CO c o CO •!- < h- Q < < Q k UJ . UJ <0 s < < ♦ A /c L C \ < < i- Z5 en +-> re s- C7> 4-> C CD E fO +J n3 26 this ordering, statements do not usually wait for a long time in the Instruction Dispatch Unit for their operands to become available, thus reducing the amount of queue space required in that unit. When we migrate statements, we only move those statements which change the values of variables. Such statements as IF and WRITE should not be moved. The algorithm which follows is a modified version of one reported by Foster and Riseman [F0S72a, F0S72b]. Algorithm 3.1 - Statement Migration 1) Start at the head node, corresponding to the first entry point, of the program graph. 2) Compute the earliest possible dispatch time, t ., for each of the output variables. This is done by computing the execution time, t , for the statement (minimum tree height for blocks of assignment state- ments in [KUC72b]) and finding the maximum of the dispatch times for the input variable set t m = max it d ( input variables )t • Then t . = t + t . d e m The dispatch time for a variable is, thus, the earliest time along a particular path in the pro- gram graph that the variable is available for use as an input variable. 27 3) If the node under consideration is the destination of one or more branch instructions, examine the locations of all branches to this node. There are two possibilities for each. Either the branch is looping back to an earlier point on the path reaching it, or the branch causes the reconvergence of paths which separated at an earlier point in the processing. In the first case we do not attempt further migration since we could end up moving this block endlessly around the loop without real gain. If all of the paths are reconvergent, we attempt to carry out the migration process along all of the paths. If we are able to migrate a statement up any of the paths, then we move the statement into the other paths as well. In Figure 3.3(b) the assignment statement G ■*- B + D can be migrated farther up the first two paths, but it cannot be migrated farther up the third path because of the conflict between the input variable set of the WRITE statement and the output variable set of the assignment statement, as discussed in step (5) below. 4) If the node prior to the current node corresponds to a conditional branch, we can migrate a state- ment upward past the conditional branch only if the same statement appears at all destinations of 28 A i 1 A <+- B+C D «*- E+F WRITE G -<'- \ ' G -*- B+D 1 (2 1 1 1 A -*- B + C G «*- B + D D -*- E + F G -*- B+ D WRITE G ,J G -*- B + D '-- i 1 (b) Figure 3.3 Statement Migration Example 2 29 the branch. In Figure 3.4(a) the assignment to A of B + C occurs on all paths leading from the conditional branch. Also, as discussed in step (5), there is no conflict between the output set of this assignment and the input set of the conditional branch instruction. Thus, this assign- ment statement can be migrated up past the condi- tional branch as shown in Figure 3.4(b). In the same example, two paths from the conditional branch contain the assignment of -3 to E. The third branch, however, does not contain this statement. Until the conditional branch is executed, it is not known which value E takes on. This prevents us from migrating this assignment statement. 5) For assignment statements there are two cases to consider. The block of assignment statements may be preceded by another block of assignment state- ments or by some other type of block. If the preceding block is another block of assignment statements, the two blocks should be concatenated, subject to the constraints in step (3). If the preceding block is of another type, an assignment can be moved ahead of the predecessor if the following are true: a) The dispatch time of the assignment state- ment is less than that of the predecessor. 30 A -*- B + C E -*- - 3 T A -*- B+ C E -*- E + G T (a) 1 A -*- B + C E **- -3 I E-*- - 3 T E -*- E + G T (b) Figure 3.4 Statement Migration Example 3 31 b) The relation (I. n 0.) » (0. n I..) u (0. n Oj) = cf> (3.1) is satisfied [BER66, RUS69], where I i and 0.. are the input and output variable sets for the assignment statements, I. and 0. are the input and output variable sets for the predecessor, and tj) denotes the empty set. With test (a) having been performed, the test 0. - I . = * is redundant. Relation 3.1 thus reduces to 0. n (I. u 0.) - . (3.2) 6) If the movable node under consideration corresponds to other than a block of assignment statements, it can be moved ahead of its predecessor if the follow- ing are satisfied: a) Its dispatch time is less than that of all statements in the previous block. b) Relation 3.2 is satisfied, where I. and 0. are input variable sets for the movable node, and 0. is the output variable set for the predecessor, 7) If any assignment statements were moved in step (5), forward substitute them if possible in their new position. Then continue working them up the path starting at step (2). 32 8) If the statement moved is not an assignment state- ment, continue migration with step (3). 9) If no migration was done in steps (5) or (6), attempt migration starting at step (2) with the next node which has not been examined for migra- tion on the path. At a conditional branch, take one of the paths emanating from it and put the identities of the initial nodes of the rest of the paths into a queue. 10) If there is no further node on the path which has not been examined for migration possibilities, take the next node from the queue built in step (9). 11) If the queue is empty, the algorithm is completed. 3.4 Variable Type Identification We next separate the set of variables referenced in a phase into four classes. Identifying these classes of variables accomplishes two things. First, it identifies those for which interlock instructions must be generated. These variables are required to be accessed by instruction streams in the order in which the streams are started into execution, that is, variables for which sequential execution constraints exist. Second, we can identify the private and common sets of variables. The former have to be provided for each Address Counter, while the latter are shared by all Address Counters. The four classes in which we are interested are the following: 33 1) Constants . These are storage locations whose values are set outside of the phase under consideration. Constants do not impose sequential execution con- straints because they are never assigned values during the phase. Thus they may be accessed by Address Counters in any order. 2) Local variables . During the execution of a phase a separate copy of each of these variables is maintained by each of the active Address Counters. Separate copies are needed since these variables include the ones being used to contain the data from several records undergoing processing concurrently. Local variables do not impose sequential execution constraints since each Address Counter has its own copy of the Local variable set for the phase and no Address Counter can change the value of another Address Counter's Local variables. 3) Reference Independent variables . The remaining variables in the program are shared by all Address Counters active in the phase. All of them must be protected by interlocks to guarantee that they have correct values when referenced. The Reference Independent variables have less stringent require- ments for their use than the Reference Dependent variables described below. Reference Independent 34 variables characteristically are modified during the phase, but their values do not influence the choice of paths through the program. For example, a counter which is incremented for each input record read is of this type. There is no sequential execution constraint generated by the presence of a Reference Independent variable in a phase since the only place the value can be tested is beyond the range of the phase. Only the final value of the variable must be correct; intermediate values are never examined. We must, of course, require that only one Address Counter at a time have access to the variable, but this can be implemented without including interlock instructions in the program by using the Instruction Dispatch Unit described in section 5.4. 4) Reference Dependent variables . These are the vari- ables for which sequential execution constraints exist. Included in this class of variables are the files used in the phase. No Address Counter is allowed to access one of these variables until the nearest active predecessor to that Address Counter is no longer able to access that variable. The following section of COBOL code demonstrates the need for interlocks on these variables: 35 1 MOVE DATA INTO PRINT-LINE. 2 IF LINE-COUNTER > 60 THEN 3 WRITE PRINT-FILE FROM PAGE-HEADER 4 AFTER POSITIONING NEW-PAGE LINES, 5 MOVE ZERO TO LINE-COUNTER. 6 WRITE PRINT-FILE FROM PRINT-LINE 7 AFTER POSITIONING 1 LINES. 8 ADD 1 TO LINE-COUNTER. Since the variable LINE-COUNTER is tested in line 2, we must force Address Counters to access LINE-COUNTER in the order in which the Address Counters were acti- vated, and make each wait until its predecessors no longer alter the value of LINE-COUNTER. Otherwise it could not be guaranteed that each Address Counter would follow its proper path through this section of code. To accomplish variable classification, we make use of the set of input references, I, and the set of output references, 0, for each variable. Algorithm 3.2 - Variable Type Identification 1) If I and are both empty, the variable is not referenced during the phase and can be discarded. 2) If the variable is the name of a file, it is considered a Reference Dependent variable. 36 3) If the variable appears as an argument in a CALL statement, it is considered a Reference Dependent variable. 4) If is empty, then the variable is never assigned a value during the execution of the phase. The variable is thus a Constant during the phase. 5) If I is empty or I = 0, then the variable is never used in a conditional branch test (i.e.: I contains no elements not in 0), so that it does not determine the flow of control in the phase. Thus the variable is a Reference Independent variable for the phase. 6) If, for any path through the phase from one primary READ statement to another primary READ statement, the variable appears as an input variable before it appears as an output variable, then it is a Reference Dependent variable. 7) If none of the above conditions is met, the vari- able is a Local variable. 8) If any item in a record has been made a Reference Dependent variable, then all the items in that record must also be made Reference Dependent to insure that the whole record is available with the correct values assigned when it is to be written out. Otherwise part of the record could be lost when a predecessor to the outputting Address Counter is deactivated and releases its storage. 37 Since one of the conditions tested in this algorithm must hold for each variable, but no more than one condition per variable, it is possible to uniquely determine to which class a variable should be assigned. 3.5 Storage Assignment Within each block of statements we would like to fetch and store all variables without storage access conflicts. We also would like, while avoiding access conflicts, to assign source and destination locations for a data movement (COBOL MOVE instruction) to the same memory unit to avoid needlessly using the Inter-Memory Bus. We also would like to group elements of a data structure which can be fetched together into the same memory word. To accomplish these objectives, we assign variables to memory units according to Algorithm 3.3. In this algorithm, the following four sets of variables are constructed for each variable v: D - The Data Division Affinity Set. This is the set of v J- variables which appear in the same record description in the Data Division of a COBOL program. By assign- ing v to the same word as an element of D , we can fetch both items in the same memory cycle, and we can also simplify the transfer of information between the Data Memory and the I/O Processors. A - The Procedure Division Affinity Set . This is the set of variables with which we would like to group v. Variables are grouped in A because of their 38 relation to v in statements in the Procedure Division of the program. S - The Segregation Set . This set consists of vari- ables from which we must separate v in assigning memory units to avoid access conflicts. Q - The Indefinite Set . This set consists of vari- ables which we would like to put into the same memory word as v if possible; but, if it is not possible, we must place them in separate memory units. Algorithm 3.3 - Storage Assignment 1) Passing through the Data Division of the program, form D for each variable v. Include in D all variables appearing in the same record description as v. 2) Passing through the Procedure Division of the program, form the A, S, and Q sets for each variable. a) If u, v e I (where I is the set of input vari- ables for a block of code), put u into S . However, if u e (A u D v ), put u into Q y . Similarly, put v into S , unless v e (A u D u ), in which case put v into Q . b) If u, v e (where is the set of output variables for a block of code), then put u into S and v into S . 39 c) If u e I. and v e 0. (where I. and 0. are the input and output sets for statement i), then put u into A . However, if u e S , put u into Q . Similarly, put v into A unless v e S , in which case put v into Q . 3) For each variable v examine A , S , and Q . a) If 3 ueO and u e S , then delete u from v b) If 3 u e Q and u e A , then delete u from V 4) Assign variables to memory units. We do this by examining S , A , D , and Q for each variable v in turn. a) Arbitrarily assign some variable to memory unit 1. A heuristic for selecting this vari- able is to use the one with the largest S set. b) For each variable u e S and assigned to memory unit m , mark m unavailable to v. c) If all of the memory units to which variables have been assigned are marked unavailable to variable v, assign v to a previously unassigned memory unit. d) For each variable u e A and assigned to memory m , determine whether or not m is available to v. Assign v to the memory, in the set of available m units, which has the 40 fewest words of the appropriate type (common or local ) assigned. e) If v is not assigned in step (b), then for each variable u e I assigned to memory unit m , compute L = length (v) + length (u 1 ) . We introduce u' to represent u and all other items assigned to the same word as u. That is, if s and t are assigned to the same word, s' = t' = {s,t} , and length (s 1 ) = length (f) = length (s) + length (t) If L £ length (1 memory word), assign v to the same word as u. Otherwise mark m unavailable to v. f) If v is not assigned in one of the steps above, then for each variable u e D and assigned to memory unit m , compute L = length (v) + length (u 1 ) . If L < length (1 memory word), assign v to the same word as u. g) If no assignment is made for a variable during steps (a), (c), (d), (e), or (f), mark the vari- able "unassigned" and start processing the next variable at step (b). 41 h) After the first pass through the variables, try again to assign the variables marked "unassigned" in step (g). Iterate this step until either all variables are assigned, or until no variable is assigned during the iteration. i) If some variables are still unassigned, assign them to memory units in such a way as to balance the number of words used for each type (common and local) across the memories. 5) Calculate the address function for each variable. We do not present an algorithm here, but, instead, present the following remarks which are germane to this problem. a) Constants, Reference Independent, and Reference Dependent variables are assigned to the common area of each memory unit, using the same base register. Local variables are assigned to a replicated area using a different base register or set of base registers. b) It is helpful in assigning memory locations to locate variables, which are in each other's affinity sets, at the same displacements from the start of a memory word to allow data to be moved without the need for shifting to align the data with the destination of the move. 42 c) Since links are transitions from one phase to another, we need instructions in the links to move data items used in both phases from their locations in the storage mapping of the exited phase to their locations in the storage mapping of the entered phase. Generating these instructions and the storage mapping for the links after the storage allocation for the phases has been done should not present any great difficulties. While this algorithm has generated good storage allocations for the programs we have analyzed, no claim of optimal ity is made for it. 3.6 Positioning FORK, HOLD, and QUIT Instructions We define the FORK, HOLD, and QUIT instructions as follows: FORK - When a FORK instruction is encountered by an Address Counter, it causes another Address Counter to start executing at the program address included in the FORK instruction. We refer to this address as the initiation point for that FORK. QUIT - When a QUIT instruction is encountered in an instruction stream, it results in the release of the private storage that had been assigned to the Address Counter executing the instruction. That Address Counter then returns to the pool of 43 inactive Address Counters available for assign- ment to new processing work. HOLD - If the Address Counter executing a HOLD is the last active Address Counter, it executes the next instruction in its instruction stream. If it is not the only active Address Counter, the location of the HOLD instruction is saved and the Address Counter is released. The last active Address Counter then resumes processing at the instruc- tion following the HOLD instruction after it executes a QUIT instruction. Our rules for inserting FORK instructions guarantee that only one HOLD instruction can be executed to leave a phase. Note that the HOLD instruction is not the same as the JOIN instruction defined by Conway [C0N63]. In Conway's machine, only the n processor to reach a JOIN instruction is allowed to proceed beyond it, where n is set by a FORK instruction. In our machine, the last active Address Counter executes the code follow- ing the HOLD instruction. The idea of using FORK instructions to initiate parallel processing is not a new one [C0N63]. Usually it is proposed [0PL65, AND65, WIR66] that the programmer insert these instructions into his code at places he believes will yield correct parallel operation. The type of concurrency we are attempting to exploit, however, leads to 44 very simple rules for inserting FORK, HOLD, and QUIT instructions so that this can be done by the compiler. Note that we make no assump- tions when inserting these instructions about the independence of the processing that may coincide in time during program execution. Our goal in inserting FORK instructions is to cause the next input record to enter processing as early as possible. Only when a path has been selected to a specific READ statement can the appropriate FORK be executed. A FORK, then, is always located after a conditional branch instruction which selects between paths leading to different READ statements. In a program involving more than one input file, we select only one of the files as the one from which we concurrently process records. This file is referred to as the primary input file . An initiation point is associated with each READ statement accessing the primary input file. READs of other files are not specially handled. We want, as the primary input file, the file which in some way controls the processing, such as a "finder" deck identifying records to be selected from another file, or an update deck which selects particular records from a master file for updating. Because this controlling file is typically accessed less often than, say, a master file, a programmer tends to put READ statements for the controlling file into the outer loop of a phase. Heuristically, by selecting the first file encoun- tered in the outermost loop of a phase as the primary file, we did not select the wrong one in any of our sample programs. * We consider the end-of-file option of a READ statement to be a conditional branch instruction. 45 To locate the position at which we want to place the FORK instruction, we use the following algorithm for each primary READ statement. Algorithm 3.4 - FORK Insertion 1) Starting at the node in the program graph corresponding to the primary READ statement, follow one path within the phase at a time backward. Ignore any node which does not correspond to a conditional branch. 2) At a conditional branch, a) If the paths leaving the conditional branch lead to more than one different primary READ statement, or any path has not yet been traced, position a FORK instruction on the path we are following at a point immediately after the condi- tional branch, setting the initiation point address to the appropriate state- ment as in Algorithm 3.5. b) If all of the paths from the conditional branch reach the same primary READ, follow back along the path(s) entering the conditional branch as in step (1). 46 To locate the initiation point for a READ statement, we must examine the node immediately preceding the READ on each path to it. Algorithm 3.5 - Initiation Point Identification 1) If the block preceding the READ is not a block of assignment statements, then the initiation point address is the address of the READ statement. 2) If the block preceding the READ is a block of assignment statements, we have to split it into two blocks as follows. Since forward substitution was done as a part of the source text scan, each of the statements in the block of assignment state- ments is independent of the rest. We can thus reorder them so that all of the statements assign- ing values to Local variables follow the statements having other types of variables as output variables. This block is then split to put the assignments to Local variables into a separate block which follows the remainder of the block. The initiation point address is the address of this block of assignments to Local variables. This modification of the orig- inal block of assignment statements is necessary to insure that all initialization needed is done. Algorithm 3.6 - QUIT Insertion 1) The QUIT instructions are located after all terminal nodes in the program. Terminal nodes, which include 47 such instructions as STOP and GOBACK, are nodes whose execution on a single Address Counter machine would cause termination of the program. 2) QUIT instructions are positioned immediately before all initiation points. Algorithm 3.7 - HOLD Insertion At the entries to links, other than the link from the program entry point, we position HOLD instructions to insure that all of the processing in a phase is com- pleted before the processing in the link and the subse- quent phase is entered. Figure 3.5(a) illustrates a simple program before FORK, HOLD, and QUIT instructions are inserted. Figure 3.5(b) shows the same pro- gram after these instructions are inserted. 3.7 Inserting Interlocks In Chapter 2 it was pointed out that there are three differ- ent types of interlocking problems. It was also noted that two of these problems could be solved by the use of an Instruction Dispatch Unit, which is discussed in section 5.4. Since compiler algorithms are not needed to handle these two problems, we need concern ourselves now with only the third type of interlock. We need to insert two types of instructions for each inter- locked variable, RELEASE and TEST. For each path from a primary READ to a QUIT we want to insert a TEST immediately before the first use of 48 00 o ♦ * < < c o •I— s- CD to c Q _l O o ra 2 1 m o o \ H -H OJ — ► WRI TREC HEAI < < 3 O i 49 the interlocked variable on that path. Similarly, for each path from a primary READ to a QUIT we want to insert a RELEASE immediately after the last use of the interlocked variable on that path. Algorithm 3.6 - Interlock Instruction Insertion 1) Working backward from each QUIT, for each Reference Dependent variable except the primary input file, examine each block for a reference to the Reference Dependent variable. Immediately after the first such block, insert a RELEASE for that variable. At a conditional branch, put a RELEASE immediately after the exit from the conditional branch unless all exits have RELEASES for the same variable. In this case delete all of those RELEASES and continue following the path backward. If the READ statement is reached before a RELEASE instruction is posi- tioned on that path, insert it immediately after the READ. 2) Working forward from each initiation point, for each Reference Dependent variable except the pri- mary input file, examine each block for a reference to the Reference Dependent variable. Immediately ahead of the first such block, insert a TEST for that variable. As an example of this process consider Figure 3.6. We assume in this example that the variable C and the file, FILE-0, associated 50 with OUTREC, are Reference Dependent variables. Figure 3.6(a) shows the program after FORK, HOLD, and QUIT instructions have been inserted. Figure 3.6(b) shows the same program after TEST and RELEASE instructions have been inserted. Note in Figure 3.6(b) that only one TEST and one RELEASE have been inserted for each variable on each path. 51 I yes WRITE OUTREC FROM HEADER WRITE OUTREC FROM A (a) Figure 3.6 Interlock Insertion Example 52 yes < RELEASE C > WRITE OUTREC FROM HEADER WRITE OUTREC FROM A ^RELEASE FILE-O ^> (b) Figure 3.6 (continued) Interlock Insertion Example 53 4. PROOF OF THE METHOD In section 3.4 we demonstrated that it is possible to unambiguously separate the set of variables used in a phase into four subsets. It was further shown that only one subset, the set of Reference Dependent variables, cause the existence of sequential exe- cution constraints. In order to show that our method of executing a program will yield the same output as a single processor sequential machine, we need to prove the following: 1) If our machine and a single processor both access nodes containing references to Reference Dependent variables in the same order, then both yield the same output. 2) The interlocking method we proposed in section 3.7 guarantees the proper sequence of Reference Dependent variable accessing. In developing these proofs we are concerned with only an indi vidual phase since this is where we apply our method to achieve a speedup. 4.1 Theorem 4.1 Given the sets of references to Reference Dependent variables in a phase, a single processor machine and a multiple Address Counter machine both yield the same output if both machines execute, in the same order, nodes containing references to Reference Dependent variables 54 Consider, first, the processing of two consecutive records as it is done by a single processor machine. The first record follows some path through the program executing a sequence of nodes: S l = n lT n 12' n 13' •'• ■ n li until the flow of control in the program causes the next record to be read. The second record then follows a path through the program execu- ting a sequence of nodes: Thus the processing of the two records requires the execution of a sequence of nodes: S 12 == n H' n 12' n 13' •'• » n li' n 2T •" ' n 2j * However, we found previously that not all of the variables involved in the execution of these nodes caused problems in concurrently processing the data. In fact, we need consider only those nodes which contain references to Reference Dependent variables. The sequence of nodes S, 2 then reduces to the sequence: S = n-j , n£, r\y ... , n^ when we omit any node in S,p that involves only Constants, Local vari- ables, and Reference Independent variables. As long as the nodes in set S are executed in the sequence given in S the results of the execu- tion are the same regardless of what the method is. 4.2 Theorem 4.2 Given the execution sequence S from Theorem 4.1, the inter- locking method proposed in section 3.7 preserves this sequence. 55 For the case of the Reference Independent variables in the program, we do not have to worry that accessing them affects the output unless more than one Address Counter can access the variable at one time. The Instruction Dispatch Unit discussed in section 5.4 prevents this occurrence. For Reference Dependent variables we must return to the dis- cussion of the required sequence, S, for the execution of nodes contain- ing Reference Dependent variables. Note that the sequence S, ? is composed of two sequences: S 12 = V S 2 where S-. and S 2 are defined as above. Since S is formed from S,« by dropping nodes with no rearrangement, it follows that S is composed of two subsequences: S = S\ S" where S' 5 S, , and S" E$o' Consider now the sequence of nodes, 0*. , in which a variable, v., is referenced. ' Q i a. c. S °y ^ s 1 t 0." c S' It is apparent that two different Reference Dependent variables, v. and v., can be accessed independently of each other and their sets of nodes, J a. and a., can be executed concurrently unless there is a node, n , which is common to both sequences. When node n is encountered during the execution of one of the sequences, the execution of that sequence 56 is halted until the execution of the other sequence reaches n , whereupon both sequences are continued in execution. Thus it is possi- ble to rearrange sequence S into a sequence having three parts. The first part is composed of the portions of o. and a. preceding n , I j c interleaved in any manner. The second part is n . The third part is composed of the portions of a. and a. which follow n in S, interleaved i j i» in some way. This argument extends in a straightforward way to any number of variables and any collection of nodes in S which are common to more than one of the a. sequences. In order for us to guarantee the correctness of the results of the program while concurrently executing the processing of the records, we must obey the following: 1) We allow the sequences of nodes containing references to different Reference Dependent variables to proceed independently until a node is encountered which con- tains reference to more than one such variable. This node cannot be executed until all sequences in which it appears have reached it. 2) For a given variable, v., the execution of the pro- gram must preserve the sequence a.. In particular, the subsequence a." must follow the subsequence a.'. The first condition is satisfied by the fact that we do not allow a block to be executed until all of the interlocks on that block have been satisfied. The second condition is met through three features of our technique. 57 1) Each Address Counter executes its own subset of nodes sequentially, taking advantage of arithmetic parallelism where possible, just as a single Address Counter machine would. 2) The Instruction Dispatch Unit protects the ordering within the sequences a.' and a.". 3) The interlocks allow a variable to be accessed by an Address Counter only after that Address Counter's predecessor is finished with the variable, thus guaranteeing that a." does not start being executed until after a.' is finished. 4.3 Discussion Having shown that the constraints contained in the method we are proposing are sufficient to guarantee the correctness of the results of the program, we now ask if they are necessary. There are two ques- tions which we must investigate. 1) If each Address Counter has access to only its own set of Local variables in addition to the global (or common) variables (a restriction we examine in Chapter 8), can we start Address Counters into oper- ation sooner that we do now? 2) Can we relax or remove some of the interlock constraints? Considering question (1) we demonstrate that the Address Counters cannot be put into operation any sooner than they are now. 58 Figure 4.1(a) shows a portion of a program graph. The rules for setting our FORK instructions cause a FORK instruction to be placed at the earliest point in the program at which the identity of the next primary READ statement to be executed has been decided. Thus our technique would position the FORKs at the beginning of path 1 and of path 2, immediately after the decision node as shown in Figure 4.1(b). If we attempt to bring another record into processing at any point prior to the locations of the FORKs we clearly get into trouble since we do not know until the decision node has been executed just which READ we exe- cute next and what processing ensues. As to question (2) concerning removal of interlock constraints, we have already demonstrated the necessity of interlocking Reference Dependent variables. However, we acknowledge that it is not necessary to protect these variables by preventing the whole block in which such a variable is referenced from entering execution. This decision, presented in section 3.7, is an engineering decision based on the belief that it is more important to prevent a potentially deadlocking condition than to achieve the ultimate in speedup between deadlocks. 59 V) * V) UJ O o o u. 0. 3 A < UJ 0) CO * UJ Q- X c o +-> c o A3 / (0 UJ o o _ __M — ^ < o UJ or or Q. 60 5. MACHINE DESIGN 5.1 Over-all Structure The following features must be available in a machine designed for concurrent record processing: 1) A number of independent program counters to bring about the concurrent execution of different instruction streams. 2) A number of arithmetic units capable of operating independently of one another [FLY72b]. There also must be no correspondence between instruction streams and arithmetic units; an instruction from any instruction stream is executed by any available arithmetic unit. 3) A number of memories which can each be addressed by any of the arithmetic units. 4) A device which prevents instructions which access the same variables from being executed in an incorrect sequence. The following items are desirable in a machine designed for concurrent record processing: 1) Program and data storage should be kept separate to reduce the problem of access conflicts. 2) Each program counter should have associated with it the necessary logic and registers to calculate the 61 effective addresses of all operands. By the appro- priate settings of the base and index registers, each program counter could execute the same instruc- tions but refer to different data storage areas when necessary. 3) A device should be included to supervise the opera- tion of all program counters. Any communications between program counters could travel through this device. When all units are active but requests are generated for the activation of further units, this device could handle the enqueuing of these requests until program counters are available to satisfy the requests. 4) Since much of the activity in a COBOL program is memory oriented (e.g.: the MOVE, TRANSFORM, and EXAMINE verbs), it seems desirable to build into the memory units some processing capability to avoid the necessity of transferring this data back and forth, to and from a processor. Thus, opera- tions which require only one operand could be done in the memory processor, while those operations needing more operands, which would be found in separate memory units, would be handled by separate processing units. 62 The design we are proposing, shown in Figure 5.1, incorporates the necessary and the desirable features. It is composed of the follow- ing units: 1 ) Address Counters 2) Address Counter Coordinator 3) Instruction Dispatch Unit 4) Processors 5) Program Memory 6) Data Memory 7) I/O Processors 8) Routing Network The design of each of these units is now discussed, but the various numbers and sizes of units recommended on the basis of our experimental results is deferred until Chapter 7. 5.2 Program Memory The Program Memory, shown in Figure 5.2, is designed as a hierarchy [KUC70, MAT72] of memory devices. The program comes ini- tially from an external storage medium, such as disk or drum storage, to the Primary Program Memory. Address Counters obtain instructions from the fast Cache memory [BAR72a, C0N69, KAP73, MEA71] which holds several segments of the program. The design of this memory is similar to that of the IBM 360/85 memory [LIP68], with the addition of the Fetch Queuing and Routing Unit. This unit allows any of the Address Counters to obtain data from the Cache. 63 ADDRESS COUNTERS ]'jJl TTf. 1111 • • • PROGRAM MEMORY INDEX BUS U J A/ From IF Tree Processors ADDRESS COUNTER COORDINATOR PROCESSOR OPERATION BUS Figure 5.1 Over-all Machine Structure 64 From External Devices i PRIMARY PROGRAM MEMORY £_ CACHE I FETCH CONTROL UNIT FETCH QUEUING & ROUTING UNIT T T From From Address Address Counter Counter 1 2 T From Address Counter n Figure 5.2 Program Memory 65 5.3 Address Counters An Address Counter, shown in Figure 5.3, operates in the fol- lowing manner: 1) The instruction whose address is in the Program Address Register is fetched from the Program Memory and placed in the Memory Buffer. 2) The Op Code Decoder examines a portion of the opera- tion code of the instruction to determine the instruction type. The six instruction types recog- nized are unconditional branch, conditional branch, set internal registers, fetch index, Address Counter control , and other. When an unconditional branch instruction is encountered, the effective address is calculated and inserted into the Program Address Register for the next instruction fetch. When a conditional branch is found, the effective address of the conditional result set is calculated and the conditional test instruction is sent to the Instruction Dispatch Unit. The Address Counter ID Match Unit is also armed to respond to the appearance on the Index Bus of this Address Counter's identification number. After the IF Tree Processor evaluates the conditional test, it sends out on the Index Bus the Address Counter identification number and the jump dis- placement from the current instruction to the next instruction to be executed. This displacement is then stored in an index register. The Address Calculation Unit uses the program address and the jump displace- ment to compute the address of the appropriate entry in a transfer 66 To Memory From Address Counter Coordinator From Memory PROGRAM ADDRESS REGISTER T7 MEMORY BUFFER INCREMENTER OP CODE DECODER & CONTROL LOGIC From Address Counter Coordinator ADDR CNTR ID MATCH UNIT ADDRESS CALCULATION UNIT BASE & INDEX REGISTERS From •Address Cntr Coordinator From Index ' Bus INSTRUCTION BUFFER To Instruction Dispatch Unit Figure 5.3 Address Counter 67 vector table and inserts this address into the Program Address Register for the next instruction fetch. When the operation code indicates that the current instruction loads one of the internal registers, either the operand of the instruc- tion or the contents of the Program Address Register, as the instruction requires, is placed in the selected index register. When the operation code indicates that the current instruction fetches an item from the Data Memory to an index register, the effective address of the data item is computed and the instruction is passed on to the Instruction Dispatch Unit. The Address Counter ID Match Unit is also armed to respond to the appearance on the Index Bus of this Address Counter's identification number. When the Data Memory puts this identi- fication number and the data item on the Index Bus, the Address Counter loads the appropriate index register from the bus. An Address Counter must recognize the QUIT, HOLD, and TEST instructions and halt after passing them on to the Instruction Dispatch Unit. At an appropriate time, the Address Counter is restarted by the Address Counter Coordinator. For any of the other instructions, the effective addresses of the operands are calculated and the instruction, now containing full memory addresses for the operands, is sent to the Instruction Dispatch Unit. 5.4 Instruction Dispatch Unit Figure 5.4 shows a design for the Instruction Dispatch Unit. The primary function of this unit is to insure that no instruction is allowed to go into execution until all of its operands have been set to 68 To Memory Op Bus Operand Fetches INSTRUCTION WAITING REGISTERS From Address Counters 1 ARRIVING INSTRUCTION QUEUE (AIQ) I FETCH & TAG GENERATOR (FTG) T (IWR) Memory Instructions £1 PROCESSOR INSTRUCTION QUEUE (PIQ) /Processor / Status Information T Processor Instructions To Processors Address Counter Instructions INSTRUCTION DISPATCH CONTROLLER (IDC) To Address Counter Coordinator TAG STATUS REGISTER ARRAY (TSR) TAG QUEUE (TQ) T Tags From Memories Figure 5.4 Instruction Dispatch Unit 69 the proper value and are available in the Data Registers of the Data Memory . The technique for accomplishing this objective was inspired by the method used to solve the same sort of problem in the IBM 360/91 [T0M67] but differs from that solution in several particulars. The method reported by Tomasulo made use of a tag associated with each operand. The tag was attached to the register(s) into which the operand would be placed when it became available and represented the identity of the source of that operand. In our method the tag has no correlation with the identity of the source of the operand. Rather, the tag is the identity of the Tag Register which contains the Data Memory address and status of that operand. The tags are passed around the machine to identify results and operands when needed, with the tag always eventually returning to the Instruction Dispatch Unit as an indi- cation that the associated data item is available for use. The follow- ing description of the operation of the Instruction Dispatch Unit explains our technique: 1) An instruction is accepted from the Arriving Instruction Queue. 2) If the instruction is destined for the Address Counter Coordinator, it is immediately dispatched to that unit. 3) For other instruction types (memory and processor instructions), the first operand is sent to the Tag Status Register Array. That unit returns to the Fetch and Tag Generator the identity of the 70 register assigned to the operand and an indication of whether or not the operand was previously in a Tag Status Register. 4) If the operand was new to the Tag Status Registers, the Fetch and Tag Generator issues a fetch request to the memory, sending both the operand address and the register identity, the tag, sent by the Tag Status Registers. 5) Steps (3) and (4) are repeated for a second operand if it exists. 6) The address of the result of the operation is then sent to the Tag Status Register Array and the resulting tag is returned. 7) The instruction with tags appended is moved into an idle Instruction Waiting Register. 8) When the tags for all of the operands return from data memory, the instruction is transferred either onto the Memory Bus or into a Processing Instruction Queue. In the former case the appropriate memory accepts the instruction for processing. In the latter case an idle processor of the appropriate type is selected and the instruction is routed to it. 9) When the operation has completed, the memory sends the tag of the result to the Tag Queue in the Instruction Dispatch Unit. 71 10) When the tag is processed by the Instruction Dispatch Controller, the corresponding Tag Status Register is released and any instruction for which all the other operands are also available is started into execution. Logic flow diagrams for the Fetch and Tag Generator, the Instruction Dispatch Controller, and the Tag Status Register Array appear as Figures 5.5 to 5.7. 5.5 Address Counter Coordinator The Address Counter Coordinator is one unit which could be implemented with a large portion of it contained in the operating system software. It could also, on the other hand, be implemented com- pletely in hardware. Because the type of implementation of this unit would be affected by many considerations beyond the scope of this paper, no structure is proposed here. Rather, Figures 5.8 to 5.12 give the control sequence for each of the five instructions executed by this unit. The functions given in these figures would have to be executed regardless of the software/hardware proportion of the implementation. In Figure 5.8 the control sequence of the FORK instruction is given. In the event that all Address Counters are already active, the FORK request is enqueued until one does become available. When an Address Counter is assigned to begin execution at the initiation point specified in the FORK instruction, several things must be done before that Address Counter can begin execution. 72 i ACCEPT INSTRUCTION FROM ARRIVING INSTRUCTION QUEUE Memory or Processor / \ Address Counter Control r — _l CC UJ UJ H u. z o UJ h- n Ul CO CO UJ cc cc o u. cc o c cr o s- 4-> C o UJ a: uj C05 Ul s. S2 HV gHj\ ^ r Ul z LJ 1- / 1 I Q 55/ w >■ ?; t: 1- _j z cc Ul Ul a h- — z Ul ^ u - < CO Cu»~ ^ CO a> CT> ai c =3 cr cu CO o s- +-> c o o CO 83 arithmetic and logical units. As with past machines intended for busi- ness data processing [ADA60], these arithmetic processors should be designed to do decimal, rather than binary, arithmetic. In addition to the arithmetic processors, several other types are included. The IF Tree Processor proposed by Davis [DAV72b] accepts as input the conditional result set. Each bit in the conditional result set represents the result of evaluating the conditional expression from one IF statement. The processor returns as output the identification of the branch of a conditional tree traversed. By evaluating all of the conditional expressions concurrently, passing the results to an IF Tree Processor, then executing a small number of assignment state- ments concurrently, this device allows parallel execution of conditional branches which would otherwise degrade speedup badly [TJA70> RIS72]. Because COBOL programs tend to have large and complex decision trees compared to those in a typical numerical program, and because several data records are undergoing processing concurrently, several of these IF Tree Processors are needed. Very commonly COBOL programs sort a file on one or more keys contained in each record. Because of this use of the SORT operation and the time-consuming nature of software methods of sorting large amounts of data, there should be a sorting network included as one of the pro- cessors. The networks described by Batcher [BAT68] are good candidates for this job. 84 5.7 Data Memory and Buses Each Data Memory Unit, shown in Figure 5.13, includes a Primary Memory Module, a Data Register Array, and a Function Unit. Data items are stored in the Primary Memory Module until re- quested by a fetch request from the Instruction Dispatch Unit. Fetch requests are enqueued by the Control Logic. The requested data is transferred to one of the Data Registers before it is required by a processor, rather than after as in the case of slave memories [WIL65] or cache memories [LIP68]. Associated with each register is a word in the Address Memory, an associative memory which contains the Primary Memory address of the contents of the register. This address is set during a fetch from the Primary Memory Module or during the transfer of a result value from a processor. To avoid unnecessary Primary Memory fetches, the Address Memory is searched for each fetch request in hopes of recovering previously used data. For each operand request from a processor, it is searched to determine which register contains the re- quested operand. When a result is received from a processor, the Address Memory is searched. If a register has already been allocated for this item, the item is placed in that register. Otherwise, the least-recently used register is allocated for this item. The address in the Address Memory is altered only when a new item is to be written into the register. Flag bits are provided for each register to indi- cate its status. The indications are: 1) Waiting for Request - Data has recently been fetched but has not been requested by a processor. 85 1 UJ O (E UJ §5 _l O o o O -J 1 ^ i I i ' t * V) > V) oc UJ o Q uJ ' ROUTING NETWORK INTERFACE 1 $ 1 r t v. 1 >- K O Z UJ £ § z K CL 0E UJ 0) < < < a INTER -MEMORY BUS INTERFACE p z 1! u. (O UJ m if X cc UJ Ul z z 1 •ft cr> en o E <1J eo O 86 2) Waiting for Store - A result has been sent by a processor but has not yet been stored in the Primary Memory. 3) Not Waiting - Data has been fetched and has been sent on to a processor, or a store has been com- pleted. The register is available for reassignment. Within each Data Memory Unit a set of special processors is provided to handle those functions which do not require a full processor of one of the types described in section 5.6. These memory processors reduce the demand on the Routing Network. Functions which these memory processors perform include the following: 1) Data Transformation - In COBOL the TRANSFORM state- ment is used to change all occurrences of one set of characters into another set of characters within a data item. For example, TRANSFORM A FROM '$<£' TO 'DC. results in the change of all occurrences of the $ character in data item A to the letter D and all occurrences of the character t to the letter C. 2) Character Examination - In COBOL the EXAMINE state- ment is used to count the number of occurrences of a character within a data item. It can also be used to transform each occurrence of that charac- ter to another character. COBOL also includes tests to determine if a data item is numeric or if it is alphabetic. 87 3) Counter Incrementing - A very common statement in COBOL programs is ADD 1 TO item. Since we can regard incrementing a value as a monadic operation on that variable, there is no need to route the value through the Routing Network to a processor, perform the operation, and return the result through the Routing Network back to the same memory unit. 4) Another common type of COBOL statement is MOVE SPACES TO item. or MOVE ZEROS TO item, where the number of spaces or zeros is determined by the length of the item. The operation of jamming one of these values into an item could be done by logic built into the Data Register Array [ST070]. In Figure 5.1 it can be seen that communications between memories occur over the Inter-Memory Bus. This time-shared bus is ade- quate since the storage assignment algorithm described in section 3.5 attempts to keep data items which have an affinity for each other in the same memory unit. Calculations related to the necessary bus bandwidth are given in Chapter 7. 88 5.8 I/O Processors In the paper up to this point we have been assuming that read- ing and writing data records takes very little time—little enough time that an I/O Processor can keep up with the demands of several instruc- tion streams. Obviously, a very sophisticated I/O Processor is needed. We are not attempting to design such a processor here since such a design depends heavily on the capabilities of the bulk storage devices with which it interfaces and the technologies available for its implementation. There are some comments we can make regarding such a design, however, which derive from observations of our example programs. Since some COBOL programs operate on several input and output files, it seems advantageous, in view of the throughput rates needed, to have several I/O Processors. As long as the number of files being accessed is not greater than the number of I/O Processors, each file should be assigned to a separate I/O Processor. One way of achieving a very fast I/O rate is to place all of the data in a random access memory. Reading or writing then amounts to a transfer of information from one memory to another. If the file sizes are yery large, the amount of such buffer memory needed becomes prohibitively expensive. However, it is apparent that the larger the buffer memory can be made the closer we can approach this ideal. A large memory, filled before program execution starts, could be at least partially refilled while the original data is processed. When an Address Counter encounters a READ or WRITE instruction which cannot be executed immediately, because of the unavailability of data or buffer space, there are several things that could be done. One is simply to wait 89 until it is possible to proceed. In view of the disparity between machine operation times and rotating storage access times, this approach could be very wasteful of execution resources. A better way is to create an I/O queue and an I/O save area in the Address Counter Coordinator similar to those used for the enqueued HOLD instructions. Subsequent instructions accessing the same file would have to be chained together to insure that they are executed in the proper order, and a mechanism for restarting an instruction stream when the data is avail- able would also have to be implemented; but neither of these problems seems especially difficult. To keep the amount of data transferred between memories dur- ing an I/O instruction small, it seems apparent that the I/O Processor should have a description of the record format. This information allows the following economies: 1) Only those items actually used from an input record would be transferred from an input record to the Data Memory Units. It is quite common for only a few items, from a large set of items in a record, to be used during the execution of the program. Fillers and unused data items would be discarded. 2) Any constants appearing in output records, such as page headings, could be retained in the I/O Processor, eliminating the need to continually transfer this invariant information between data memory and the I/O Processor. 90 5.9 Routing Network It is necessary to provide some method [KUC72a] of allowing any processor to interrogate any Data Memory Unit. Two methods are com- bined in Figure 5.1. The first, the Switching Network, is a crossbar switch, using a few of the high order bits of the operand to select a path through the network. The second method is a system of time-shared buses connecting groups of Data Memory Units and groups of processors to the Switching Network. The sizes of the groups are determined by the number of memories and processors, and are affected by tradeoffs between the size of the Switching Network, bus bandwidth, and the bus holding times. Some calculations relating to this problem are given in Chapter 7. 5.10 Modifications for Multiprogramming Thus far we have been assuming that only one program at a time is in execution. We now briefly consider the modifications needed to execute more than one program concurrently. There need be no changes in the algorithms used in the com- piler. All of the FORKs, HOLDs, QUITs, and interlocks operate only between different records being processed by the same program. No vari- ables are shared between different programs, although they may share files of data. The processors and memories need not be changed, except in memory size, since they are indifferent to the source of the instructions that pass through them. 91 The Address Counters do not operate differently for different programs. Each is independent of the others except for the handling of interlock conditions. Since interlock handling is done by the Address Counter Coordinator and not by individual Address Counters, the design of the Address Counter need not be modified to support multiprogramming. The largest changes must be made in the Instruction Dispatch Unit and in the Address Counter Coordinator. Since instruction streams for different programs are independent, it is possible to route instruc- tions from different programs through different Instruction Dispatch Units. This appears to be a good thing to do since it is the Instruction Dispatch Unit which limits the speed of our machine, as discussed in section 7.1. In order to implement this modification, a switch has to be inserted to allow any Address Counter to send instructions to any Instruction Dispatch Unit. The Address Counter Coordinator would, then, have to be modified to allow it to control this switch and to keep inter- locks from one program distinct from those in another program. 92 6. EXPERIMENTAL RESULTS 6.1 Introduction Several analyses were made of a sample set of COBOL programs. These programs were obtained from the Student Data Area, the Institutional Area, and the Financial Area in the University of Illinois Office of Administrative Data Processing at the Urbana campus. Most of the programs were considered by the programmers to be typical of the types of processing done at that installation, but some programs were selected because they were atypical and might be expected to tax any method chosen for their execution. While this sample is limited to those programs available from a single university administrative com- puter center, we feel that they are comparable to programs found in various businesses. For one example, a program which generates a report from a student master file is similar to one which generates a report from a file of an insurance company's policy holders. As another example, consider the similarity of a program which prints student grade reports and one which prints charge account bills. Of course, there are also many functions common to the business and academic worlds, such as maintenance of inventory records and payroll records. Thus, while our sample is limited, it does seem to be representative of COBOL programs in general . 6.2 Variable Size Counts One analysis we made was of the frequency of occurrence of variables of various sizes in our sample of 42 programs. Fortunately, 93 the Data Division of a COBOL program contains a description of every variable used in the program. This description contains a PICTURE clause which contains information about the number of characters needed to hold an item's value. Also included in the Data Division are a number of entries with the name FILLER. These are typically used to indicate items in a file which are not accessed, or to hold spaces or text for printout format. Since we discard unused data items, such as those in the former case, and since we hold format information and constant printout text in the I/O Processors, we discarded all FILLER items in this analysis. A summary of the results of this analysis is given in Table 6.1. It should be noted that some entries in this table have two values. In these cases it was found that one or more of the programs in our sample used an unusually large table (i.e.: a vector or 2- or 3-dimensional array). Had we had a very large sample, we would expect such single occurrences of large counts to be swamped out by the rest of the mass of the data. In our sample of 42 programs, however, such single occurrences swamped out the rest of the counts. To overcome this problem we have deleted the portion of the data attributable to large single tables; but we have given, as the parenthesized value in the table, the counts which do include such tables. A plot of the fre- quency counts in Table 6.1 is given in Figure 6.1. 6.3 Statement Type Counts Another type of analysis we did was count the frequency of occurrence of each type of statement. This analysis was done for 34 of 94 Table 6.1 Frequency Count of Variable Sizes Number Size of (Char) Occurrences 1 4711 (24912) 2 2461 3 2217 ( 3917) 4 2062 (12618) 5 2980 ( 4996) 6 2438 ( 4201) 7 118 8 142 9 285 ( 410) 10 593 ( 1093) 11 27 12 34 13 24 ( 524) 14 28 ( 94) 15 24 16 16 17 14 18 55 ( 73) 19 5 20 103 ( 299) 21 31 22 2 ( 70) 23 24 24 4 25 7 26 2 27 4 28 3 ( 41) 29 8 30 8 ( 28) 31 2 32 33 5 ( 13) 34 1 35 1 36 37 2 38 39 2 40 6 41 2 42 1 43 1 Mean Median Percent per per of Program Program Sample 112 33 25.29 58.7 18 13.21 52.8 13 11.90 49.1 11 11.07 71.0 14 16.00 58.0 7 13.09 2.8 0.63 3.4 1 0.76 6.8 3 1.53 14.1 5 3.18 0.6 0.14 0.8 0.18 0.6 0.13 0.7 0.15 0.6 0.13 0.4 0.09 0.3 0.08 1.4 0.30 0.1 0.03 2.5 1 0.55 0.7 0.17 0.01 0.6 0.13 0.1 0.02 0.2 0.04 0.01 0.1 0.02 0.02 0.2 0.04 0.2 0.04 0.01 - 0.1 0.03 0.01 0.01 - 0.01 - 0.01 0.1 0.03 0.01 0.01 0.01 95 Table 6.1 (continued) Frequency Count of Variable Sizes Number Mean Median Percent Size of per per of (Char) Occu rrences Program Program Sample 44 2 0.01 45 - 46 1 0.01 47 1 0.01 48 4 0.1 0.02 49 4 0.1 0.02 50 4 0.1 0.02 Counts for Sizes > 50 Size Count Size Count 51 1 52 1 54 2 55 2 57 3 58 3 60 8 62 1 63 1 66 27 70 4 72 2 74 2 76 2 80 14 81 8 83 1 84 1 85 2 89 2 90 1 92 1 93 2 100 12 104 1 107 1 109 2 no 1 116 1 120 2 126 1 132 2 133 24 136 1 150 4 162 1 181 1 191 1 203 1 205 1 264 1 287 1 310 1 312 1 379 1 392 1 398 1 405 1 449 1 700 1 1500 1 96 5000 Figure 6.1 Frequency Count Plot Variable Sizes 1 * 1 I l I I I I I I M M I I I | I M I I I M I I ■ 20 25 30 35 40 45 50 10 15 Variable Size (Characters) 97 our programs, since eight of the 42 programs were too large for us to transform into an analyzable form. Table 6.2 gives a summary of the results of this analysis. Figure 6.2 is a plot of the data in Table 6.2 broken down into classes of statements. Figure 6.3 is a plot of the data in Table 6.2 broken down into individual statement types. As a part of the same analysis, we counted the frequency of occurrence of each of the operators available in COBOL. A summary of this data is shown in Table 6.3. Since incrementing is very common, it was counted separately from other types of ADD instructions. Figure 6.4 presents the data of Table 6.3 broken down by operator class. Figure 6.5 shows the data of Table 6.3 broken down by individual opera- tor type. 6.4 Program Analyses A number of the programs were examined to determine what sort of speedup our method might actually deliver. In the process, informa- tion relating to machine parameters was also generated. Table 6.4 gives a brief summary of the statistics relating to the sizes of the programs that were analyzed. It should be noted that these are not large programs since the analyses were done by hand. They seem to be typical of the small- to medium-sized programs found at any COBOL computer center. The following is a list of the programs we analyzed with a brief description of each one: B7510363 A report and an error listing are generated from one input file. This program required heavy interlocking. 98 Table 6.2 Frequency Count of Statement Types Statement Type All Statements I/O READ WRITE REWRITE ACCEPT DISPLAY EXHIBIT RETURN RELEASE OPEN CLOSE Assignment MOVE TRANSFORM Ari thmeti c Control IF PERFORM Misc. EXAMINE SORT ON, AT CALL Number of Occurrences Percent of all Occurrences Mean per Program Medi an per Program 16945 100 498.4 143 1583 9.3 46.5 26 134 0.8 3.9 2 900 5.3 26.5 7 3 0.1 18 0.1 0.5 242 1.4 7.1 5 56 0.3 1.6 22 0.1 0.6 67 0.4 2.0 69 0.4 2.0 2 72 0.4 2.1 2 10735 63.4 315.7 84 8852 52.2 260.3 62 101 0.6 3.0 1782 10.5 52.4 16 4037 23.8 118.7 27 3998 23.6 117.6 27 39 0.2 1.1 590 3.5 17.4 4 4 0.1 65 0.4 1.9 492 2.9 14.5 4 29 0.2 0.9 99 63.4 Figure 6.2 Histogram Statement Types by Class 100 +J c a> E rO 4-> ir> CO s t ra >•> <£> -Q CD o CO X- +-> CD 3 l/> a. O) •r— >> »r- n: h- CD E CD +-> fO 4-> O O o O o 41. o 41. iV «* <*> * ^ -^> ■% 'ft. >* <* 101 Table 6.3 Frequency Count of Operator Types Operator Type Number of Occurrences Percent of all Occurrences Mean per Program Median per Program Operators 7391 100 217.4 59 Arithmetic 2122 28.7 62.4 16 Increment 390 5.3 11.5 9 + 934 12.6 27.8 4 - 50 0.7 1.5 • 375 5.1 11.0 / 341 4.6 10.0 ** 15 0.2 0.4 Comparison 4655 62.9 136.9 29 = 3216 43.5 94.6 21 < 170 2.3 4.9 2 > 496 6.7 14.6 3 f 641 8.7 18.9 5 i 24 0.3 0.7 f 68 0.9 2.0 Connective 584 7.9 17.2 4 OR 216 2.9 6.4 2 AND 368 5.0 10.8 Misc. 30 0.4 0.9 NUMERIC 30 0.4 0.9 ALPHABETIC 102 6* V 1 G Figure 6.4 Histogram Operator Types by Class 103 O o o *0 o 4-> i- a; LO Q. •r- _>, ^ i- O +-> ra s- <1J a. o 104 Table 6.4 Statistics for Analyzed Programs Number Number Number Number Number of of of of of Program ID Data Cards Variables Proc. Cards Statements Phasi B7510360 149 60 156 140 1 15156040 378 202 212 172 2 15156050 46 94 70 49 2 15210030 200 75 88 65 1 15212005 158 66 229 183 3 SSN512 69 160 81 84 2 S7510025 68 4 7 5 1 S7550180 217 85 84 73 2 S7550181 215 33 73 59 2 S7550182 68 40 106 101 2 S7550183 230 90 121 169 2 105 15156040 From one input file this program prints Avery labels, sorts the file, then prints a report. 15156050 Records from the input file are read, and selected items are copied to an intermediate file which is then sorted. A report is gen- erated from the sorted file. 15210030 From an old master and finder cards, this program copies the old master into a new master, updating selected records. 15212005 This includes two programs in one. The first one reads one file, edits the input, then outputs a modified file. The second one uses a finder file to select master file records, then outputs records with data from both finder and master records. SSN512 Data from one input file with additional data from a master file is copied to the output file. S7510025 This program generates an output file of 1350 identical records. It does no input. 57550180 This program generates an output file from an input file plus information from matching master records. 57550181 This program is similar to S7550180, but it does more calculation of output data. 106 57550182 This program is similar to S7550180, but it copies more data into an output file. 57550183 This program is similar to S7550182. As a part of the analysis, storage allocation was done for each phase. In view of the peaks in Figure 6.1 at variable sizes of 5, 10, and 20 characters, an allocation was made for each of these sizes. Table 6.5 presents the number of words needed and the percentage utili- zation for each of these allocations. The significance of these results is discussed in Chapter 7. 6.5 Program Simulation After the detailed analysis of a program was complete, its execution was simulated to gather further information about machine parameters and to obtain an estimate of the speedup possible using our method. For this simulation the following rules were used: 1) The effects of storage access conflicts were ignored. This is justified since the multiplicity of Data Memory Units and the use of the Storage Allocation Algorithm of section 3.5 should keep the number of such conflicts small. 2) IF Trees of four or fewer levels were given a con- current execution time of one unit time, while larger IF Trees were given an execution time of two units of time. All of the IF Trees we found could be arranged in eight or fewer levels. These execution times are in line with Davis's results [DAV72a]. 107 r— LT> CTl CT> OO in CM S CM m CO LT) in CTl «3"0 CTlLO O CO ^O , # ## p a # ■ ■ • • • •• •• ft ft i— +■> r*» oooo oo in in ooo raocsj om «d-i — i — i — i — i— r— -Q i— in o ■ — to II O =**= "O CO i— i— r— i— >— COCOr- r— i— i— r— t— i— i— f— ■— r— i— * i — S COO MO COCO •I- &5 T- oo +-> n mm ini — cor-».r^i — cos s toco inoo too <^-i — r— rj co coco coi — in innvo coco cti r- ct» coco co ct> cm r— -o (0 u o O O to U) CON (Mi — COi — COi — O0CVI N COM OOl CVJCM OHO IS-J^-O "vi" r— r— l— lo r^. r-* «3- «3- o oooon cm o co oooo coco i— i— «^-^j- CM CMM Oi — i — i — OOCOCMOOCMCMCOCMCOCO =tfe Q) r— «d- oo oo m cm ooo in oo co men cm in loco oco fc« »l- i— +j vo oo old in m co m «3- co co coo coco cm oo ooo (O O CM i— r— in CM «3- r— i— i— i— i— r— o i — to O =8= TO CO i — l — r— i— i — COCOi — l — i — i— i — l — l — i — l — r— i—i — N r- m CO 00 in cm CO CO r-. O «d" CT> CO r— r^ in r^ r— CM 00 00 •r— 00 •4-> in in in f>» CO CO r^ i — in oo r^» CO in co i^ (T» "=d- cr> <=3- r— p*™ rD CO en ct> r— r— in in «=d- u CO 00 CTl r^ a> CO CO oo r-» CM i— to -a fO +J S- u r— o o to in CO CM cm r-^ CO i— CO r— CM CM CM CO CM o CM CM cr> in =3 3 _l =8= "O r— to CD 3 or E CO CM CM "vj- «3" o ct> oo a r— O CM (0 i— u r— +> sj- oo a in cm m CTi CM o CO CM CO CO o «* o en in CTl 00 -Q o (0 o CM CM r— CM i— r— r— rcs r— O -O 1— 5 o CM II a> N o o to 3 CO ,— CTl r— r— 00 i— CO 1— I-»~ CO r— r— CO r— r-» ^f CM r— «3" ^r in E •r - 0) OO +-> CO "=3- CM oo in CO r^ r— CO r-«. in CO ID 00 CTl r— r— CTl CO o s: "O" o CO 00 00 1 in CO "?f "3- m in cr> r-^ cr> CM CO m in CM r— O o to in CO CM CM CT> CO r— CM r— CM CM CM CO CM O O CM CM en in 3 _l =0 3 E s r^ r». "^ «* in s m in r«. r-. in r-» s r^» r*. i— 00 cri cr> =8: a> CM CM CM CM CM r-— s: ,__ to u CM in CM CO in r— CM i — in co s. in 00 CO CM r— CO «3- o r~~ r— r— ,— _i QJ to fO O- i — i — CM i — CM i — i — cm ro CM r— CM i— CM p— CM r— CM CT) O Q. CO in co o in co o o in in in o in o in in in o m co o o o O CM CM in CM in in o i — CM CO CM 00 CO 00 00 CM o r— r— r— r— r— o o o o o in 1— in in in in •z. in in in in in oo s r^ r^ r^» r*. oo 00 oo oo oo oo 108 3) Reading records from the primary input file was assumed to take no time. We assume that data from the primary input file is transferred from the I/O Processor to the Data Memory before the READ statement is executed. For statements accessing other files, we assume that it takes one cycle to move the data between the Data Memory and the I/O Processor. 4) Fetching and storing data were assumed to take no execution time. In the case of store instructions, data to be stored is kept in one of the Data Registers in a Data Memory Unit until the Primary Memory has a free cycle. Allotting no time to a fetch of data by a processor is justified on the following grounds. We can consider the Instruction Dispatch Unit to be a "black box" which releases each instruction some time after it arrives from an Address Counter. While the delay is not the same for each instruction, we can say that there is some, unspecified, average delay for each program. We would expect this average delay to be much smaller than the total execution time of the pro- gram, hence we ignore this delay in our calculation of program speedup. Since the Instruction Dispatch Unit does not release an instruction until its data is available in a Data Register, and since we expect 109 transfers of data from a Data Register to a processor to take much less than the time needed to execute the operation, it seems reasonable to ignore data fetch time. 5) An exception to Rule (4) was the MOVE operation. Since a MOVE is nothing more than a fetch followed by a store, it is tempting to say that MOVEs take no time. In view of the large fraction of a COBOL program that MOVEs comprise, however, this would not be realistic. Thus, we used a time of one unit for a MOVE instruction. Further, we assumed that a MOVE CORRESPONDING instruction required one unit of time for each item to be moved. 6) All arithmetic, comparison, and connective operations were assumed to take a single unit of execution time. 7) FORK and RELEASE instructions were assumed to take no execution time. This is justified since these instructions go directly to the Address Counter Coordinator which executes them while the Address Counter continues on with the next instruction. 8) QUIT instructions were assumed to take one unit of time to execute. This was done to simulate the delay between the start of execution of a QUIT and the time the Address Counter is again avail- able for assignment. no 9) The operation of TEST instructions was simulated by delaying the execution of an interlocked block until the preceding Address Counter had passed an appropriate RELEASE instruction. 10) TRANSFORM instructions were assumed to take one unit of time to execute. 11) Those instructions containing subscripted vari- ables were assumed to require a unit of time for effective address calculation in addition to the execution time of the operation. This was done to allow for the time needed to transfer the subscript value from the Data Memory to an Address Counter. For a block of assignment statements using the same subscript more than once, the additional unit of time was added to the whole block once, rather than to each subscripted state- ment in the block. 12) ON and AT conditions were assumed to require no time to execute since they could be implemented as interrupts causing a branch to the appropriate section of code during the execution of the instruction to which the ON or AT condition was attached. 13) Each exit from a conditional branch was assigned a probability of execution according to the following criteria: Ill a) Exits leading to error processing were assigned a probability of zero. While these paths in a program are undoubtedly executed in the real world, they should normally be executed in yery small proportion to the non-error processing. b) Any exit leading to an early termination of the program was assigned an execution probability of zero. c) For some programs, information was available to us in the form of counter values accumulated during the actual execution of the program. From these counter values, rough estimates of execution probabilities for some paths could be made. d) In all other cases, it was assumed that exits from a conditional branch instruction were equally probable. The results of the simulations are given in Table 6.6. The following items are tabulated for each phase: # STMT - Number of statements found within the phase. Note that this is a static, rather than dynamic value, t-, - The amount of time a sequential processor needs to process one record. T, - The amount of time a sequential processor needs to execute the segment of the simulated 112 Table 6.6 Speedup Results Program # ID Phase STMT B7510363 15156040 15156050 15210030 15212005 SSN512 S7510025 S7550180 S7550181 S7550182 S7550183 1 74 34 1 105 136 2 22 15 All 127 103 1 14 11 2 16 13 All 30 12.3 1 50 42 1 100 49.5 2 36 14.1 3 9 9 All 145 36.8 1 54 37.5 2 18 17 All 72 32.9 1 4 4 1 64 34 2 5 6 All 69 30.0 1 40 25 2 8 10 All 48 22.0 1 63 65.5 2 29 60 All 92 63.5 1 93 68.5 2 47 45 All 140 65.0 34 136p 15p 151p Up 24 llp+24 42 257.4 14.1 24 295.8 75 17p 17p+75 4 102 6p 6p+102 50 lOp 10p+50 131 120 251 137 45p 45p+137 12 16 6 22 5 10 15 13.5 7.0 6.0 26.5 14 4 18 36 6 42 12 3 15 9 5 14 17 3 20 max 12p 4 12p 4p 2 4p 25 5 3 25 3 3 1 4 2p 2p P P 3 3 3 7 2p 2p 1.3 P P P P 1.3 3 .7p+8.7 1.9 5.2 1.1 2.3 3.5 1.9 P .2p+1.5 1 1.4 P .lp+1.2 1.6 P .2p+1.3 1.5 2.0 1.7 2.3 P .2p+2.0 2.8 8.5p 2.5p 6.9p 2.2p 2.4 0.7p+1.6 4.7 19.1 2.0 4.0 11.2 5.4 4.3p 0.9p+4.2 1 2.8 8p l.lp+2.4 4.2 3.3p 0.7p+3.3 14.6 24.0 17.9 8.1 15p 2.3p+6.9 113 code on which the speedup calculation is based. T - The amount of time our concurrent processor needs to execute the same segment of code which generated the value given for T-. . p - The maximum number of arithmetic units r max required to execute the simulated code. a~ - The average number of Address Counters in use during the execution of the program, n - The maximum number of Address Counters needed during the execution of the program. S - The average speedup found from the ratio: s -A- P T. P In addition, these items are also tabulated for the program as a whole, In generating this information, however, the effect of the execution time of links was ignored. Compared to the execution time of a phase we expect link execution time to be negligible. The total values, then, are for all of the phases combined, as follows: N T > = ,£, \ N P fa Pi Pm,v = maX {Pm.v 1 (1 = !» 2 > •••> N ) 114 a = 1 N - - T" I d i T P T i=l n p i n = max {n.} (i = 1 ,2, . . . ,N) s - T i P T P where N = number of phases in the program. We have assumed that the same number of records are processed by each phase. The factor p appearing in some of the entries is defined as the minumum of the number of records available to be processed and the number of Address Counters available to direct the processing. For example, if there are 10,000 records available and 10 Address Counters, a speedup of 5p is a speedup of 50. 115 7. MACHINE PARAMETERS To complete the design of our concurrent record processing machine, we present in this chapter the values of various parameters for the machine design proposed in Chapter 5. 7.1 Speed Limitation By selecting the appropriate numbers of memories and proces- sors, we can supply any memory and computational bandwidth needed. To provide the necessary instruction fetch rate from the Address Counters, we can use replicated interleaved Program Memories, multiple-instruction fetches, and a pipelined Address Counter design. Compared to the instruction rates handled by the Data Memory and the processors, the rate handled by the Address Counter Coordinator should be quite small. Thus, none of these units should limit the speed of our machine. The speed of the Instruction Dispatch Unit, however, does impose a limitation on the speed in our design. All instructions except branch instructions have to pass through it. Pipelining steps (1) through (7) of the operation of this unit (section 5.4) and executing steps (3) through (6) in parallel by maintaining three copies of the operand address in the Tag Status Register Array should allow us [GUN70, VAD71] to process a single instruction in on the order of 250 nanosec- onds. With a five stage pipeline, this gives us a dispatch time of 50 nanoseconds per instruction. Taking one microsecond as a Primary Memory cycle time, this gives us a throughput rate of about 20 instruc- tions per cycle. 116 7.2 Number of Address Counters From the estimated throughput rate for the Instruction Dispatch Unit, it is apparent that 20 Address Counters, each fetching an instruction per cycle, would generate the maximum load the Instruction Dispatch Unit could handle. In order to have a power of two to simplify addressing, we choose 16 as the number of Address Counters in our machine. 7.3 Data Memory Word Size As a result of our analysis of variable size (Table 6.1) and of storage allocation (Table 6.5), we can select a word length for the Data Memory. Considering Figure 6.1 we note that there are peaks at values of 5, 10, and 20 characters. Clearly, we would prefer to use these sizes rather than, say, 4, 8, or 16. We note, from data presented in Table 6.5, that the efficiency of memory usage is lowest for a word size of 20 characters, much better for a word size of 10 characters, but only marginally improved, beyond that, for a word length of 5 characters. The choice of word size is made more obvious if we plot the percentage of the variables in our sample having a length less than or equal to a particular size, against that variable size. This is shown in Figure 7.1. As can be seen from this plot, almost 80% of the variables have a length less than or equal to 5, and over 96% have a length less than or equal to 10. On the basis of these statistics, it seems appar- ent that 10 characters per word is a good Data Memory word length. 117 lOOl tt + i i i *+ ++ It I I I I I I I I M I I | I I I I I 10 20 30 - T 40 Variable Size (Characters) Figure 7.1 Cumulative Variable Size Distribution 118 7.4 Data Character Size Since the number of bits in a character did not enter into any of our analyses, we lack a statistical basis from which we can derive a good character size. Instead, we present the following remarks as jus- tification for our choice. Judging from the programs we have examined and from discus- sions with COBOL programmers, it appears that data items are most commonly viewed as character strings. This includes alphabetic informa- tion such as a student's name, non-computational numeric data such as his Social Security Number, and computational data such as the number of hours of credit he has accumulated. Furthermore, even high usage numeric data such as record counters are often retained in character form. During the execution of an IBM/360 COBOL program, this results in much transformation of data from character form to packed decimal form and back again. It seems apparent that the use of a single representa- tion for all data could result in a modest improvement in execution speed by simply avoiding all of these unnecessary transformations. Doing arithmetic with such a representation should prove to be no prob- lem [SCH72]. Another consideration in determining character size is the number of characters needed in the character set. The COBOL programs we examined had no need for a huge set of characters. Some machines [BUR67] executing COBOL programs today have as few as 64 characters in their character sets without ill effect. Thus, we propose that our machine use 6-bit characters. Fur- ther, we propose that this be the only way data is encoded. Our 119 arithmetic units use the low order four bits of a character as a BCD encoding of a number, signalling an error condition if the high order bits are not the proper code for a numeric character. 7.5 Number of Data Memory Units There is no apparent correlation between program size and the number of memory units needed for the program's data. This is shown in Table 7.1 and in Figure 7.2. It is apparent that 16 memory units would be adequate for all but a few programs. The data for these few programs could be made to fit into 16 memories at some sacrifice in execution speed due to memory access conflicts. However, this degradation would be mitigated through our use of high speed Data Registers in the Data Memory. Thus, we propose that 16 Data Memory Units be used. 7.6 Size of Data Memory From Table 7.1 we find that the number of words needed to hold the globally accessable data is small compared to that needed for locally accessable data. The largest number of words needed for local storage by our sample programs is 19 words per memory unit. Allowing for the size of our programs and the growth of memory requirements with program size, 128 words per memory unit per Address Counter seems a reasonable amount. Programs needing more than this will not be able to use all 16 Address Counters, but will be able to execute. This gives a require- ment of about 2K words of local memory. In terms of characters and bits, this would provide 20K characters and 120K bits of storage. For *1K = 1024; 1M = 1,048,576. 120 Tab le 7.1 Memory Requirements Program ID # Cards Phase # Wds Global § Wds Local # Memories Local Wds x Memories B7510363 305 1 3 5 13 65 15156040 590 1 2 3 2 12 12 36 24 15156050 116 1 2 2 17 4 4 8 68 15210030 288 1 3 50 150 15212005 387 1 2 3 3 3 1 3 1 9 8 8 9 24 8 SSN512 150 1 2 2 2 41 40 82 80 S7510025 75 1 2 12 24 S7550180 301 1 2 3 2 54 54 162 108 S7510181 288 1 2 10 9 11 11 no 99 S7550182 174 1 2 2 2 16 12 32 24 S7550183 351 1 2 19 16 17 17 323 272 121 o o o o m o o o o to o o CM i CD S- <1J N OO 03 o s- Q- > CO +-> c O) E 1 Financial Area largest programs for which data was at hand: 3348 cards 3293 5627 Program Size Numbe < 500 cards 292 < 1000 147 < 2000 60 < 3000 17 < 4000 5 < 5000 3 > 5000 125 However, allowing for the fact that our sample programs are small, we propose a Cache size of IK words. 7.8 Numbers of Processors 7.8.1 IF Tree Processor Table 6.2 shows that IF statements form about a quarter of the statements in our sample. If we assume that we can group an average of five IF statements per IF Tree, which is a fairly conservative assump- tion, then roughly 5% of the operations executed in our machine will be IF Tree execution operations. With 16 instruction streams active, this implies that we need one IF Tree Processor. 7.8.2 Arithmetic/Logical Processors For the Arithmetic/Logical Processor we look first at Table 6.2 and find that the Arithmetic and IF statements, in which the operators of Table 6.3 are found, comprise 34.1% of the program. Com- paring the number of operators with the number of Arithmetic and IF statements, we find an average of 1.2 operators per statement. This implies that almost 41% of the instructions in the program require an Arithmetic/Logical Processor. Doing a similar calculation for memory- only operations (MOVE, Increment, etc.), we arrive at an estimate that about half the operations involve only the memory. * For this calculation we have not included the Increment operator and the NUMERIC and ALPHABETIC tests which are executed in the Data Memory. 126 Next, we consider the interaction between the memories and the processors. Using access times and cycle times of an existing machine as a guide [IBM73a, IBM73b], it seems reasonable to assume the following operation times: Fetch from Primary Memory to a Data Register 1000 nsec. Transfer between Data Register and processor 200 nsec. Add instruction time 600 nsec. With operation times on this order, it is apparent that a pair of memory units could keep a pair of processors supplied with data and still have time for two to three memory-only operations per Primary Memory cycle. This is roughly the 40%: 50% proportion of Arithmetic/Logical to memory- only operations we found in the first part of this section. Thus, there should be approximately the same number of Arithmetic/Logical Processors as Data Memory Units. To make the total number of Arithmetic/Logical and IF Tree Processors a power of two, we choose 15 as the number of Arithmetic/Logical Processors. 7.8.3 1/0 Processors For most programs we would like to be able to assign an 1/0 Processor to each file in use in a phase. For our sample of 42 programs we found that the average number of files used per phase was 3.4, with a maximum of nine. Since the larger programs tended to use more files, about six 1/0 Processors are needed. To get a power of two for address- ing purposes, we select eight as the number of 1/0 Processors. 127 7.8.4 SORT Processor None of the programs we examined had more than one file being sorted at one time. Thus, one sorting network should be sufficient. 7.9 Instruction Dispatch Unit Memory Sizes There are two memories in the Instruction Dispatch Unit whose sizes we must find. These are the Tag Status Register Array and the Instruction Waiting Register Array. The following analysis yields these sizes: Let 1 unit time = 1 memory cycle time = 1 add time Assume: 1) n = number of Address Counters 2) m = number of memory units 3) p = number of Arithmetic/Logical Processors + number of IF Tree Processors 4) Each Address Counter issues r instructions per unit time at highest speed. 5) Each Address Counter is fetching instructions for a fraction f f <_ 1 of the time. 6) The instruction stream contains the following fractions of each instruction type: f- Arithmetic f M Memory-only f c Address Counter Control 128 f R Branch f. Conditional Branch f A +f M +f C +f B +f I =1 . f 7 - 9 " 1 ' 7) The number of operands for instructions reaching the Instruction Dispatch Unit are: 3 Arithmetic 2 Memory 1 Address Counter Control 8) Average holding times for tags are: Arithmetic instructions Sources 2 Results 3 Memory-only instructions Sources 1 Results 2 9) Average holding times in the Instruction Dispatch Unit for instructions are: Arithmetic 2 (1 fetch, 1 previous operation) Memory 1 (1 previous operation) 10) Consider a condition in which all Address Counters are active, with conditional and unconditional branch instructions being encountered, but no interlocks inhibiting any of the Address Counters. Considering the machine to be represented by the model in Figure 7.3, we define: 129 2 o o < o o 'c ID o E ZE o ro CD CD = b 5 -a o CD +-> q; i_ cu 4- t/» c c o (_> 3 S- +-> if) ^r >/ ^i k^ >v << ^<" CM • • • c ~1 I 1 1 i i «^ ••- •^ -< ^< y< v> M w « «- 2 ° E >. 2 I a. 2E 130 X f = rate at which instructions are being fetched from Program Memory by each Address Counter - f - x (maximum fetch rate) X. = rate at which instructions are being transferred from each Address Counter to the Instruction Dispatch Unit A, = rate at which instructions reach the Instruction Dispatch Unit = rate at which instructions leave the Instruction Dispatch Unit A c = rate at which Address Counter Coordinator instructions leave the Instruction Dispatch Unit A p = rate at which processor instructions leave the Instruction Dispatch Unit A = rate at which processor instructions reach each processor A M = rate at which memory-only operations leave the Instruction Dispatch Unit A = rate at which memory-only operations reach each Data Memory Unit Since we are assuming that interlock instructions are not encountered, we neglect the contribution of Ap. 131 Since each processor is capable of executing an operation each cycle, A p < p operations per cycle. Since each Data Memory Unit can handle about twice as many memory-only operations each cycle (in addition to processor operations) as there are memories X M <_ 2m operations per cycle. Thus, L < 2m + p operations per cycle. (7.9-2) Comparing this result with the discussions of sections 7.1, 7.6, and 7.8 ; we see that Xj = 20 m = 16 p = 16 which does satisfy relation 7.9-2. Since we are assuming that interlocks do not interfere with Address Counter functioning, the only things that do interfere are branches, both conditional and unconditional. If we assume that a con- ditional branch takes one unit of time to be resolved, then a fraction 1 - fx - f B of the instructions fetched are passed on to the Instruction Dispatch Unit. Thus, x i - (1 - f, - f B ) x f negl . ■ (f A + f M + X> X f Since the instruction streams coming from each Address Counter are inde- pendent (by our assumption of no interlock activity, the rate at which 132 instructions reach the Instruction Dispatch Unit is n times the rate instructions are issued by an Address Counter. h = nA i = nX f (f A + f M ) (7.9-3) The rate at which instructions leave the Instruction Dispatch Unit is equal to the rate at which they arrive. Clearly, this must be true or infinite queues would be needed. From relation 7.9-3 the arrival rate for tag requests is A T = nX f (3f A + 2f M ) using assumption (7). The number of tags in use at any point in time is t Y (number of arrivals at t.) n t = y i=k types of requests where t. is the smallest average time such that there are no tags remaining at time zero from those that arrived t. units of time earlier. From assumption (8) we have n(3 + 3 + l)(3f.X f ) n(2 + D(2f„X f ) 1 3 2 N T = 7f A X f n + 3f M X f n = nX f (7f A + 3f M ) (7.9-4) In a similar manner we find that the number of instructions waiting, per unit time, for their operands to become available is Nj = nX f (f A + f M ) (7.9-5) 133 We have previously calculated the following values: n = 16 Address Counters f A ~ 0.4 f M ~ 0.5 X l = 20 operations per cycle From equation 7. 9 -3 we find v = X I n(f A + f M ) = 20 16(.9) = 1.4 operations per cycle per Address Counter, From equation 7.9-4 we then have N T = 16(1. 4)(7 x 0.4 + 3 x 0.5) =96.3 tags in use. From equation 7.9-5 we find Nj = 16(1.4)(0.9) =20.2 instructions waiting. Clearly, 128 Tag Status Registers and 32 Instruction Waiting Registers are adequate. The total sizes of these two memories are found as follows: 1) For each Tag Status Register we need three content addressable address fields, each of which is com- prised of 15 bits (log 2 (32K)). In addition we need a 128 word x 48 bit content addressable memory for the Tag Status Register Array. 134 2) Each Instruction Waiting Register is comprised of the following fields: a) Instruction Operation Code ( 8 bits) b) Operand and result addresses (3 x 15 bits) c) Operand and result lengths (3x6 bits) d) Operand and result tags (3x5 bits) e) Status bits ( 2 bits) This gives a total of 88 bits of which the 15 tag storage bits must be content addressable. Thus, the Instruction Waiting Registers can be constructed of 32 words x 73 bits of random access memory and 32 words x 15 bits of content addressable memory. 7.10 Other Devices Within each unit of the machine there are a number of devices which we have not yet discussed. To allow us to estimate in section 7.11 the number of packages needed to build the machine, we briefly describe, in this section, the circuitry for each unit. Not included in this section, however, are the control circuitry for each unit and the bus drivers and receivers needed for the various control signals sent between units. These units are included in the calculations of section 7.11 . 7.10.1 Program Memory In addition to the memory itself we need the following devices in the Program Memory: 135 Queue - 16 words x 16 bits to hold instruction addresses (12 bits) and Address Counter numbers (4 bits) in the Fetch Queuing and Routing Unit. Decoder - 4 bits of l-of-16 for routing instructions to the proper Address Counter. Bus Drivers and Receivers - 68 drivers for sending instructions to Address Counters. - 12 receivers for instruction addresses. 7.10.2 Address Counter Each Address Counter requires the following devices: Incrementer - One 12 bit counter. Registers - One 12 bit Program Address Register. - One 68 bit Memory Buffer. - Eight 15 bit index registers. Decoder - 3 bits to l-of-6 for the Op Code Decoder. Matcher - 4 bit matcher for the Address Counter ID Match Unit. 136 Adder - Three 15 bit adders for the Address Calculation Unit. Bus Drivers and Receivers - 12 drivers for sending the program address to the Program Memory. - 71 drivers for sending the instruction to the Instruction Dispatch Unit. - 15 drivers for transferring index registers to the Address Counter Coordinator. - 68 receivers for instructions from the Program Memory. - 12 receivers for the initiation point address from the Address Counter Coordinator. - 15 receivers for index register values from the Address Counter Coordinator. - 19 receivers for the Index Bus. 7.10.3 Instruction Dispatch Unit In addition to the Instruction Waiting Registers and the Tag Status Register Array, the following devices are needed in the Instruction Dispatch Unit: Queues - 16 words x 71 bits for the Arriving Instruction Queue. - 48 words x 5 bits for the Tag Queue. 137 - 8 words x 76 bits for the Processor Instruction Queue. Registers - One 16 bit register for processor status information. - Five 71 bit registers used for pipelining the operation of the Fetch and Tag Generator. Bus Drivers and Receivers - 76 drivers for the Memory Operation Bus. - 71 drivers for sending instructions to the Address Counter Coordinator. - 76 drivers for the Processor Operation Bus. - 71 receivers for instructions from the Address Counters. - 5 receivers for the Tag Bus. 7.10.4 Address Counter Coordinator A detailed design for this unit was not presented in Chapter 5; thus it is difficult to specify precisely the necessary devices for the unit. If we assume an implementation completely of hardware, however, it is apparent that devices of the following types would be needed: Queues - 8 words x ~ 32 bits for the FORK queue. - 2 words x ~ 32 bits for the HOLD queue. 138 Registers - Sixteen 16 bit registers for interlock status information. - Three 16 bit registers for Address Counter status information. - Sixteen 8 bit registers for predecessor/successor information. - Nine 15 bit registers for saving index registers for a HOLDing Address Counter. - Fifteen 4 bit registers for the list of inter- locks whose RELEASE is being awaited. Bus Drivers and Receivers - 12 drivers for sending an initiation point address to an Address Counter. - 15 drivers for sending index register values. - 15 receivers for receiving index register values from an Address Counter. - 71 receivers for instructions from the Instruction Dispatch Unit. 7.10.5 Data Memory Unit The Data Memory Unit includes the following devices: Queue - 2 words x 76 bits for the Memory Operation Bus. Incrementer - A 10 BCD digit counter. 139 Match Circuits - Ten 6 bit matchers in the Function Logic for EXAMINE and TRANSFORM operations. Register - One 6 bit register associated with the match circuits. Bus Drivers and Receivers - 19 drivers for the Index Bus. - 60 drivers for the Inter-Memory Bus. - 60 drivers to the Routing Network. - 60 drivers for the I/O Bus. - 5 drivers for the Tag Bus. - 60 receivers for the Inter-Memory Bus. - 60 receivers from the Routing Network. - 60 receivers for the I/O Bus. - 76 receivers for the Memory Operation Bus. 7.10.6 Routing Network The Routing Network can be implemented as a 16 x 16 crossbar switch. For 60 lines per path through the network we then need a total of 15,360 crosspoints in the network. 7.10.7 Arithmetic/Logical Unit Each Arithmetic/Logical Unit has the capability of operating on a pair of 10 digit BCD numbers. In addition, logic is included to interpret two low order digits as an exponent to simulate floating point operations. This requires the following devices for each of these units: 140 Adder - 10 BCD digits. Multiplier - 10 BCD digits. Logical Unit - Ten 6 bit characters. Shift Register - 8 BCD digits. Bus Drivers and Receivers - 60 drivers to send to the Routing Network. - 60 receivers to receive from the Routing Network. - 76 receivers to receive from the Instruction Dispatch Unit. 7.10.8 IF Tree Processor The design of the IF Tree Processor has been detailed by Davis [DAV72b: page 32] and is not repeated here. 7.10.9 1/0 Processor No design for an I/O Processor is given in Chapter 5. In order to arrive at an estimate of the hardware required for each of these units, we first examine the throughput required for a unit. From Table 6.6 we find that the average time spent processing a record for phases able to use p Address Counters is on the order of 10 units of time. During this time we can have 16 records being pro- cessed concurrently. If we assume that each READ statement results in the transfer to the Data Memory of the equivalent of an entire card 141 image of useful data, then we must transfer 480 bits for each READ. If we assume that a unit of time is about 1 microsecond, we have R = (480 bits/READ)(16 READs) 10 ysec = 768 M bits/sec for the rate at which an I/O Processor must supply data to the Data Memory. Transmitting all data in parallel, the I/O Processor must exe- cute 1.6 M transmissions per second. With all eight I/O Processors active (a rare occurrence), the I/O Bus must handle 12.8 M transmissions per second. If three quarters of each record read is discarded by the I/O Processor, then each file must supply data to its I/O Processor at four times the rate the I/O Processor supplies data to the Data Memory. The data rate required from a file is then 6.4 M transmissions per second or 3072 M bits per second. We can achieve this transfer rate if we can obtain large head-per-track disks capable of reading five words (300 tracks) in parallel at a rate of 10 M bits per second. Such disks are not yet available, but bulk storage devices which have the necessary bandwidth, notably those built of semiconductor devices, are becoming feasible. If we can obtain large head-per-track disks which can supply data at high rates, then the amount of buffer storage needed in the I/O Processor can be quite small. We will assume for the purposes of this section that the I/O Processor is composed of a register for the cur- rent five words of the current record, a register for data to be transferred to the Data Memory, a memory containing record format information, and a network to route ten characters at a time to one of 142 the words in the buffer register. On this basis it appears that the following devices are needed for each I/O Processor: Memory - 256 60 bit words of format memory. Registers - One 300 bit record buffer. - One 960 bit data buffer. Crosspoints - A 50 x 50 array of crosspoints where each crosspoint carries six bits in parallel. Bus Drivers and Receivers - 960 drivers for the I/O Bus to supply all 16 Data Memory Units simultaneously. - 300 drivers to send data to a file. - 960 receivers for the I/O Bus. - 300 receivers to receive data from a file. 7.10.10 SORT Network As in the case of the I/O Processor, no design has been pro- posed for the SORT Network. Consequently, only a rough estimate of the required hardware is given here. By supplying the SORT Network with data from the Data Memory Units, we can sort 16 words at a time. From Batcher's results [BAT68] the number of comparators needed to sort 2 P numbers is N = (p 2 + p) 2 (p " 2) = (16 + 4)(4) = 80 143 for our case, arranged in 10 levels with 8 comparators in each level. Thus we need eighty 60 bit comparators in the network. We also need drivers and receivers for sixteen 60 bit words. 7.11 Package Counts In this section we present an estimate of the number of pack- ages of circuitry needed to build this machine. The packages we use for these counts are the Dual In-line Packages available from a number of manufacturers. For those cases for which appropriate devices exist, we have used them for these counts. For those cases for which we found no existing device, we estimated the number of packages over which it would be necessary to split the device. Further comments are included in the discussion of each type of device. 7.11.1 Memories For the purposes of counting memory packages we have different package sizes. For slow bulk memory we assume a package contains up to a 4096 x 1 bit array. For fast register memory we assume a 1024 x 1 bit array per package. For Content Addressable Memory we assume a 16 x 1 bit array per chip. For a small memory, such as the Data Register set in a Data Memory Unit, we use a 64 x 4 bit array. For individual registers we use a 4 bit register chip. Table 7.3 gives a summary of the memory package requirement. Totaling the number of packages, we have the following requirements: Slow Memory 1028 packages Fast Memory 3927 packages Content Addressable Memory 1120 packages 144 Table 7.3 Memory Package Requirements Unit Memory Size Memory Type Pkgs per Unit # Units # Pkgs Program Memory 4096 X 68 Slow 68 1 68 1024 X 68 Fast 68 1 68 Data Memory 2048 X 60 Slow 60 16 960 64 X 60 Fast 15 16 240 64 X 11 CAM 44 16 704 6 X 1 Fast 2 16 32 Address Counter 12 X 1 Fast 3 16 48 68 X 1 Fast 17 16 272 15 X 8 Fast 2 16 32 Instruction Dispatch 128 X 48 CAM 384 384 Unit 73 X 32 Fast 16 16 15 X 32 CAM 32 32 16 X 1 Fast 4 4 71 X 5 Fast 4 4 Address Counter 16 X 16 Fast 4 4 Coordinator 16 X 3 Fast 1 1 8 X 16 Fast 4 4 15 X 9 Fast 2 2 4 X 15 Fast 4 4 Ari thmeti c/Logi cal Unit 8 digit (shift x 1 reg.) Fast 8 15 120 IF Tree Processor 255 X 1 Fast 64 1 64 12 X 1 Fast 3 1 3 34 X 1 Fast 9 1 9 I/O Processor 256 X 60 Fast 60 8 480 1 X 300 Fast 75 8 600 1 X 960 Fast 240 8 1920 145 If we can obtain 16 bit register chips for the I/O Processor registers, then the number of packages needed for fast memory would be reduced to 2037 packages. 7.11.2 Queues In order to allow various units in the machine to operate asynchronously from one another, we have included a number of queues. For the purposes of Table 7.4, we have assumed that a single package contains four bits of storage plus logic to control the first-in/ first-out operation of the queue. From Table 7.4 it can be seen that we need 1072 of these packages. 7.11.3 Bus Drivers and Receivers Because we have a number of buses connecting units in the machine, we are able to have a number of items of data and a number of instructions in transit simultaneously. For this same reason we require a number of bus drivers and receivers. We assume that we can have four driver/ receivers per package for the purposes of Table 7.5. From this table we find that we require 5349 packages. 7.11.4 Other Devices A number of other devices are needed to complete the machine. These are summarized in Table 7.6. The number of packages needed for these other devices is 7380. 7.11.5 Total Package Requirement Adding together the numbers of packages computed in this section, we find that we need 19,876 packages for the devices we have 146 Table 7.4 Queue Package Requirements Unit Queue Size # Pkgs/Unit # Units # Pkgs Program Memory 16 x 16 64 64 Instruction Dispatch Unit 16 x 71 48 x 5 72 96 72 96 8 x 76 152 152 Address Counter Coordinator 8 x 32 2 x 32 64 16 64 16 Data Memory Unit 2 x 76 38 16 608 147 Table 7.5 Bus Driver and Receiver Package Requirements Unit Per # Drivers Unit #Re iceivers # Units Total Packages Program Memory 68 12 1 20 Address Counter 98 114 16 800 Address Counter Coordinator 27 86 1 25 Instruction Dispatch Unit 223 76 1 75 Data Memory Unit 185 196 16 1120 Arithmetic/ Logical Unit 60 136 15 510 I/O Processor 1260 1260 8 2520 SORT Network 960 960 1 240 IF Tree Processor 19 36 1 39 148 CO c co O c_> >? cn o CO CL 00 1 — co O) CD ■I- c CD en to +-> o CO 03 Ol to J* r— •!— at +-> _^ CD .m o 3 4-> en to u ro u fO E to rtj _^ ^ .c T3 Q. fO + -C Q. 5- o -a o CL o ^^, CD +j to ^^ CD ■»-> 4-> -!Z •»— r— to -^ • p— • i— o s CO to CO a. 2 CD +-> to co a T3 •^ co • i — to -o 1- O to "O E «3- >— o C -t-> »d- «* *3- to i— CD «3- 00 to to «5l- o o o o CO CM o CM to -^ «3- r— r— «3- to to to LO CM CM o r^ 4-> Q_ 1— r— 1^ CD C\J r— to 00 O CM ^« n- - h- =*fc to to ••-> ■ — U3 to «o to vo to 1 — LO r— 1 — r— 00 -l-> ^fe "P" n— r— r— r»* r*— 1 — r— c c CO ID E CO s- ■* — ( •r— to +-> •3- CO r— ^~ CD o o o o 00 CM o «3- 3 CD-r- f— r- to LO CM CM o CO cr -i^ C CD ^~ t— to CM CO Q- =) r - — cc MD =t*= • CO r~- CD >> CO -^ to ^— O S- -Q to s_ (O Q- > C CO CO -a o ra o CO o 4-> SZ to o CO s- CL Q ■1— to c o +-> o Q rt} 00 4-> S- CO +-> c ZD CO CL to s- CL S- QJ E to ■r™ _l Q S- E O CO •r~ aj s- S- -a CO E O O o +J T3 CO to o • to S- CO u o to o c O sz c s- CO a> s- CO -»-> to CO CO =J O a 4-> •t— (O o +j 1— CO •P" a -o o CO +-> •r— .c S- o JD X o -a rO -Q +J u u to CO CO to E •r- i- +-> +-> o o r— ■4-> to LO CD to o (0 to to to ^~ 1 •r— i +J |— •r— to JZ CD CD t»- -Q if- •r- XI ^--^ CO o *•—** X o o -Q * — ■* o n «sf — ' i — r— r^ T— *~-' LO S- >1 co S- +-> o c E 3 CO o s: <_) E to to to i~ CO CD i. O -a S- T3 Q_ , 4-> •^ i- o i. to i? CO o o s_ o to o •zz •i- to CL. s CO E +■> to 4-> o CO CD CO CO CO CO o s: c E O CO 2: S- ■f ^Z o s_ Q_ s- 1— 1— +-> 3 •r- CL. or o OLD-SEQ 4 THEN MOVE CORRESPONDING IN-DATA TO OUT-DATA 5 WRITE OUT-DATA 6 ELSE DISPLAY IN-SEQ ' SEQ ERROR'. 7 MOVE IN-SEQ TO OLD-SEQ. 8 GO TO LOOP. Under the constraint that each Address Counter has access only to its own Local variable set, this loop is completely sequential. The problem is that OLD-SEQ cannot be accessed by Address Counter i until Address Counter i-1 has set its proper value. Relaxing this constraint, let us substitute IN-SEQ. , for OLD-SEQ, and use IN-SEQ. where the program indicates IN-SEQ. Now, by allowing Address Counter i to access IN-SEQ. , we can process the input records concurrently. In order to allow an Address Counter to access its predeces- sor's Local variable set, the following conditions must exist: 1) It must be possible to replace the Reference Dependent variable causing the problem (OLD-SEQ in the example) with the same Local variable in all cases. 153 2) The phase being transformed by replacement must contain only one primary READ statement. 3) Every path within the phase must contain an assign- ment of the value of the Local variable to the vari- able we want to remove. 4) There is no assignment statement in the phase for which the Local variable is an output variable. These conditions can be identified as follows: 1) Examine all occurrences of a Reference Dependent variable as an output variable. If it is always assigned a value from the same Local variable, the Reference Dependent variable is a candidate for removal . 2) The presence of only one primary READ statement can be detected during the source text scan (section 3.1) 3) Each path in the phase must be traced, starting at the primary READ. If the primary READ is again encountered, this attempt at replacement must be abandoned. If a statement causing the assignment of the value of the Local variable to the Reference Dependent variable is reached, the path satisfies condition (3) above. 4) By finding that the set of output references for the Local variable is empty, we know that no assignment is made to the Local variable, satisfying condition (4) above. 154 If the Reference Dependent variable can be removed, we can delete the statements in which the variable to be removed occurs as an output variable. In each statement in which the removed variable appears as an input variable, we substitute the preceding Address Counter's copy of the Local variable. To give an Address Counter access to its predecessor's copy of the Local variable, we use a different index register for the base address of the predecessor's storage than for the Address Counter's own storage. This index register's value is set when an Address Counter is activated. It is also necessary to mark Address Counters used in this way so that an Address Counter's storage is not released until both that Address Counter and its successor are ready to return to the pool of available Address Counters. Unfortunately, allowing an Address Counter access to its pre- decessor's Local variable set, while conceptually simple, causes an increase in the complexity of the machine, a larger number of algorithms in the compiler, and a more complicated operating system. Since the conditions under which variable replacement can be done are seldom satisfied in a real program, it appears that the cost of this improve- ment outweighs the benefits derived from it. 155 9. SOFTWARE DESIGN While examining many COBOL programs we found various language features and programming practices which hindered concurrent record pro- cessing. Conversely, some features and techniques were found which aided our method of speedup. The language features and programming techniques which help and which hinder our method are discussed in this chapter. 9.1 Language Features Which Hinder 9.1.1 ALTER A COBOL feature [IBM72] which makes our method impossible to apply is the ALTER command. This instruction causes the program code to be modified during execution by overwriting the destination of a branch instruction. For example, consider the code: A. READ FILE-1. B. GO TO'Pl. pi. alter'b TO PROCEED TO P2. GO T0*A. P2. ADD 1 TO FILE-1-IN. READ FILE-1. The program analysis needed for concurrent record processing is dependent on a priori knowledge of the program graph. Because the ALTER command allows any branch in the program to be changed during 156 execution, the program as analyzed by the compiler and the program as executed may have entirely different program graphs. Our algorithms would form the graph in Figure 9.1(a) after insertion of FORKs, HOLDs, and QUITs. After the block labeled PI is executed, however, the program graph would look like the one in Figure 9.1(b), which will give erroneous results. If a dialect of COBOL is implemented for our machine, the ALTER command should be dropped from that dialect. The same function can be performed by setting and testing in-line switches, as in the following code: 77 FIRST-CARD-SW PICTURE 9 VALUE 1. A. READ FILE-1. B. IF FIRST-CARD-SW = 1 , GO TO PI ELSE GO TO P2. P2. ADD 1*T0 FILE-1-IN. READ FILE-1. This is properly handled by the algorithms presented in Chapter 3. 9.1.2 Subroutines Subroutine calls are seldom used in COBOL since the PERFORM command provides a yery powerful alternative. Those calls which do appear, however, cause a problem. Since we lack information about the way variables are used in a user-supplied subroutine, we must assume the worst. We must assume that all variables in the argument list are 157 1 a c i , i / READ FILE Li. Y Q a < READ FILE "S eu ^— Q. E ro X LxJ , — C • o CT) • r— -l-> SSNO, GO TO READ-SDT. 000108 IF SSN OF SORT-REC < SSNO, 000109 DISPLAY ' NO MASTER RECORD FOR STUDENT ' SORT-REC, 000110 PERFORM RETURN-DEPT, 000111 GO TO MATCH-RECS. 000112 MOVE CORRESPONDING SORT-REC TO PRINT-REC. 000113 MOVE CORRESPONDING MAST-REC TO PRINT-REC. 000114 IF LINES > 50, 000115 ADD 1 TO PAGES, 000116 MOVE PAGES TO PAGENO, 000117 WRITE PRT-BUF FROM PRT-HEAD, 000118 MOVE TO LINES. 000119 WRITE PRT-BUF FROM PRINT-REC. 000120 ADD 1 TO LINES. 000121 GO TO RETURN-DEPT. 000122 END-JOB. 000123 CLOSE SDT-MAST, PRT-FILE. 000124 DISPLAY DEPT-CNT, DEPT-SELECTED, SORT-CNT, MAST-CNT. 175 Table A. 2 Variable Names No. Name 1 DEPT-MAST 2 DEPT-REC.NAME 3 .SSN 4 .C0DE-1 5 .CODE-2 6 SDT-MAST 7 MAST-REC.SSNO 8 .UNITS 9 .GPA 10 PRT-FILE 11 PRT-BUF 12 SORT-FILE 13 SORT-REC.SSN 14 .NAME 15 .CODE-2 16 DEPT-CNT 17 DEPT-SELECTED 18 SORT-CNT 19 MAST-CNT 20 PAGES 21 LINES 22 PRINT-REC.SSN 23 .NAME 24 .CODE-2 25 .UNITS 26 .GPA 27 PRT-HEAD.PAGENO 28 SYSOUT 176 o oo •O C\J cm co ■ — Looor-. oo 01 •O o n n ^- lo ^ n a. <0 Lf> Oi — cm co ^i- in i— 00 or CO q o_ oo «=c oo oo OO M oo OO oo oo - >- LU z: et U_ U_ ■=c <: Qd < >- >- >- UJ z: <=c s- 2: (— oo a; o CO i — i 1 — 1 CQ CQ :s CQ oo 00 oo on o CO a. CM - UJ en s: Q 1 — O c_> 1— O- o UJ —I n o UJ Q UJ 1— 1— 1 u. 1— Qd CM UJ o 1 2: O 1 en CD UJ 1— UJ 1— UJ 1 2T _l Qd 1— 2: 2: — - Q h- i— i o UJ Q_ UJ •Zd UJ o UJ oo 1— _l I— h- h- i CM CO <_> UJ 2T UJ ct: 1 OO UJ ft 1 — 1 1 > _i Q o Qd UJ 1— O O- UJ l— Q u_ u_ o o UJ Q _J o D_ UJ »— O O DC ■=c < l-H 1 — 1 s: s: c*: « CM CM CM cocn^Oi — r^co«^-Lovoi — oo i— i — to (\l i — w w w m w w cr> CM -3- ro o i— ro en cm ro en en ro at UDr^OOCOOr— ojco^j-loud ■ — i — i — i — CM (VI W CM N N W CM CO Ol O r- cm cm en en CM ro ID «3- r— CM r\ n on cr> «=fr CO VO o <=£ ft ^~ CM CM > 00 m A ft n 00 C7i 00 en 00 CM LO i^ 1 CM i — 1 1 CM i — i 1 1 — CM CM O t-* O i— CM CM r- CM CD 3 >> C s- •p— § O 3 o 00 E ro (O • s- «=c CD o CD S- f> Q_ Ot LU LU O >- V£> < LO ro Ct: O IS M CO Ot ro ro - II r— i— i— || at LO m o r— ^t LO r^ II t— II II •s n CTl r> #* CO ro #» r— o o r^ o r— ■ ■ — CO CM CM CM CM ii Q \— <: oo oo oo OO oo t— i oo UJ si <=c cC < U_ <=c »-H o >=C ct: LU oo o 2: ^ oo _l o Q LU Q LU Q_ Li_ oo 1— i oo oo i— I 1 Z at St q; oo LU =c h- — l— oo oo — Ll_ 1— O i O i o LU O U- St 1— oo • oo • • • 1 q: Q_ h- D_ 1— tn CD 1— rs t— 1 oo < • • 1— o 00 ^ OO ZZL ■< CO _J s: • s: LU LU • at oo LU i— i LU i— « A Q. OO 1 | — Q£ c£ — o C£ at a: ct: LU h- o i— o 1 i oo o ct: a. ct: Q. OO O o at 1— Q >- i— h- l— >- 1— o o LU 1— cC Q_ OO Q > Q > l—i > LU I— 1— < Q LL. Ll- t— i LU h- Q o o u_ a O ct: o Ot ■=a: Q oo o r— CM CO «3" LO <£> |-~. CO CT> O i — CM ro CM CM CM CM CM CM CM CM CM CM CO CO ro ro 178 o ID OO IT) CO r— CO LT) CM Q •1 Ll) CO CX» q; CO p"~ Q_ #» n «3" LT) l£> CM CO «— co o — "00 r- M (D N -o O) 3 >> C s- •r- (T3 •4-> E C E o 3 o OO •«^_-' E CO o CD s- i^ Q_ jD fO oc UJ LjJ Q D- O >- — <£> UD 1 — 1 OO 00 1— 1 DC et >- DC 3 CO oo 3 CJ> UJ UJ i _J UJ 1— UJ 00 2: _i 1 »— 1 1— 1 1— 1— DC u_ Q_ -Z. I— Q_ 1 1— UJ O Q 1 ^ S DC 1— UJ O Q. "O0 s: q; 1— 1— ZD ■z. < 1— 1— OO CO 1 1 — 1 _J SI 1 q. z: UJ I— I— 1 DC O 1— D_ 1— OO >- DC UJ 1 — UJ _l 00 1— O0 D_ 1 — 1 Q O O0 DC O _l t— i Z2 Q =*t= 00 <3- uo o 03* o +-> en CM CD S- CT> c CD E CD +-> (0 4-> 00 CD CO CO < C CD A3 00 £ <1J to o ro ro ro ro ro in ro 184 Table A. 4 V ariable Types Variable I-set Phase 1 0-set Type I-set Phase 2 0-set Ty 1 2 - RD - -. NU 2 8 2 L - - NU 3 8 2 L - - NU 4 5,6,8 2 L - - NU 5 6,8 2,7 L - - NU 6 - - NU 17 - RD 7 - - NU 21,22 17 L 8 - - NU 28 17 L 9 - - NU 28 17 L 10 - - NU - 32,34 RD 11 - - NU - 32,34 L 12 - 9 RD 14,24 - RD 13 9 8 L 19,21, 22,23,27 14,24 L 14 9 8 L 19,23, 27 14,24 L 15 9 8 L 19,23, 27 14,24 L 16 4 4 RI - - NU 17 10 10 RI - - NU 18 - - NU 16,26 16,26 RI 19 - - NU 20 20 RI 20 - - NU 30,31 30 RI 21 - - NU 29,35 33,35 RD 22 - - NU 34 27 23 - - NU 34 27 24 - - NU 34 27 25 - - NU 34 28 26 - - NU 34 28 27 - - NU 32 31 RD 28 _ _ NU _ 19,23 RD 185 A. 5 Storage Assignment Table A. 5 shows the D, A, Q, and S sets, defined in section 3.5, for each variable for which storage is to be assigned. Note that the files and the buffer named PRT-BUFR have been dropped. Table A. 6 shows our storage unit assignment. This assign- ment is shown for a memory size of 10 characters per word. The utilization of the memory is given as a percentage of the number of characters available in the memories needed for this program. A. 6 Positioning FORK, HOLD, and QUIT Instructions Figure A. 4 shows the program after we have inserted the FORK, HOLD, and QUIT instructions needed in each phase. A. 7 Inserting Interlocks Figure A. 4 also shows the program after we have inserted the necessary TEST and RELEASE instructions for each phase. 186 CO C\J C\J CM ft #1 A LO LO CTl r— i— i— a *t * ^— (— cri *3- ^f cn cm cm r^ OO CM •> « « oo cr> cr> I - ~ ** CO CO oo «>#>«» "i— o 1 r-i-i-NCOCO 1 1 1 1^ CM CM CM CM CD to o- LO LO «3" «t n n I I I I CT> 00 i — i— i — I I I CM C\J CM CSJ CM CM CM CM CM CM CM CM CM CM CM CM OO CM CM CM CM O CM CM CM CM CM CM I I I I ICMCMCMCMCMI I I OO «=3- LO i — i — r — 00 CT> LO 4-> CD CO LO +-> *d- fO r^ LO en n • CD oo < CD n CD CO r>- LO I— CO ■ — n -O. ^» 01 LO OO fl3 -o r>. r~» lo lo r*>. r~. r~- n «t I— c •r— c •1— CD CO CO fO <3- #t LO LO OO i i ■ — i — i — «^- CM 1 M- o- 0O CM LD CM 1 1 i i i i i 1 1 1 << I I I I I I I I "=3" OO LO i — i — li — I I IOOCMLOI I I I I I I I I I I LO LO LO LO LO CM CM CM CM CM ft 01 ft ft A LO LO LO «3" «* CM CM CM CM CM ft ft A A A LOLOLO't LOLO"5J" ^" <* CO CO fO « « «i «< i — i — r— CMCMCMCMCM ^]- <^J- 0O OO O") 0"> 00 •»■»«! •>«•»•>« «nn«n«i«^-CQrO 0OCMCMCMCM ncMWCMCONNr- i— i— I I I I I ICMCMCMCMCMI CI) oo>Oi — CM0O«=d-LOLOr^ i — i — i — i — i — i — i — CMCMCMCMCMCMCMCM Phase 1 187 Table A. 6 Storage Unit Assignment Word size = 10 characters Memory Units 1 2 3 4 5 6 2a £ (Local) _Q 2b 2c 3 4 5 £ 14a 14b 14c 13 15 > (Global) 16 17 # Variables § Memories # Local words # Global words # Local characters i # Local characters i # Global characters # Global characters ivailable jsed avail abl used e = 9 6 2 1 120 82 (68.3%) 60 10 (16.7%) Phase 2: Word size = 10 characters Memory Units 1 2 3 4 5 6 7 8 13 14a 14b 14c co (Local) 9 15 £ 26 25 22 23a 23b 23c -Q fO 24 1 (Global) l° 7 21 19 # Variables = 15 # Memories = 6 # Local words = 2 # Global words = 1 # Local characters avail a ble = 120 # Local characters used = 102 (85.0%) # Global characters avail able = 60 # Global characters used = 17 (28.3%) 188 / TEST \ \ SORT- FILE / T / RELEASE \ \ SORT-FILE / CQUIT 12 13 f I h Figure A. 4 Final Program Graph 189 Figure A. 4 (continued) Final Program Graph RELEASE PA6EN0 I 35 < TEST PRT-FILE /RELEASE \ \ LINES / 190 \ PAGE NO / 30,31,33,35 I > / RELEASE ( PAGE NO \ LINES ~r. / TEST \ \ PRT-FILE / 32 I 34 / RELEASE \ \ PRT-FILE / ^LMT> 36 T 37 <$UIT> Figure A. 4 (continued) Final Program Graph 191 VITA Richard Ernest Strebendt was born in Detroit, Michigan, in 1943. He received the BSEE degree in 1966 and the MSEE degree in 1968 from Wayne State University in Detroit. From 1968 to 1971 he was employed by Bell Telephone Laboratories in Naperville, Illinois, as a Member of Technical Staff. During this time he participated in the design of a computer-assisted Directory Assistance system and he was responsible for the design and implementation of a descriptive language for logic circuits which is now in use in a design automation system. In 1971 he obtained an educational leave-of-absence. In 1971 he entered the University of Illinois Department of Computer Science as a Graduate Research Assistant. While at the University of Illinois he was a member of a group investigating prob- lems in multiprocessor system design. Mr. Strebendt is a member of Tau Beta Pi, Eta Kappa Nu, Sigma Xi , IEEE, and ACM. MBLIOGRAPHIC DATA ,HEET 1. Report No. UIUCDCS-R-74-638 3. Recipient's Accession No. lit !e and Subtitle PROGRAM SPEEDUP THROUGH CONCURRENT RECORD PROCESSING 5. Report Date October 1974 Author(s) Richard Ernest Strebendt 8- Performing Organization Rept. No -UIUCDCS-R-74-638 Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Or ant No. US NSF GJ 36936 2. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 13. Type of Report & Period Covered Doctoral - 1974 14. 5. Supplementary Notes 6. Abstracts Much effort in the past has been devoted to speeding up computational programs through the use of multiprocessing. This paper examines the problem of speeding up data processing programs which typically do not contain a great deal of computation. A machine organization is proposed which is capable of executing several instruction streams concurrently. Compiler algorithms are described which automatically insert the necessary commands to start and stop instruction streams and to protect common variables which must be accessed sequentially. 7. Ke\ Word' and Document Analysis. 17a. Descriptors Concurrent Processing Data Interlocks Multiprocessors COBOL Data Processing 7b. Identifiers/Open-Ended Terms 7c. . OSATI Field/Group : lability Statement RELEASE UNLIMITED 19. Security (lass (Thi;. Report ) UNCLASSIFIED rfi. no. of 201 Pages 20. Security Class (Tins Page UNCLASSIFIED 22. Price ORM NTIS- 35 ( 10-70) USCOMM- DC <10329- P ^2 ' #5 EJHn UNlVEf}o. T ■ I ^H n ■ I IBB H <"> **--*.,.*: M ■ ■ I w Hi I I H M H ■ I H vt i tl**AXj& A ■ H