mW wBii inBBKE KttUXfc BBiBBM nnHnnB ll£l$&9 HIBlwiE SH H8S WflH88fifi ^^^^^^■^HShSIhRH nil mSm M Hull BH w ■ I ■HilB mamm vm Klfi In mm m mm ■1 Hi IH9RP HwhSIiS Hi HI 3S£ HH Hi HHKmS Iflul BflnOwi Ban liffiKalU *¥* iftlliSiiSlslB ■■HiH LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAICN 5L0.W Digitized by the Internet Archive in 2013 http://archive.org/details/multiprocessorfo527davi .5^7 Report No. UIUCDCS-R- 72-527 f A MULTIPROCESSOR FOR SIMULATION APPLICATIONS "by Edward Willmore Davis, Jr. June 1972 Report No. ULUCDCS-R-72-527 A MULTIPROCESSOR FOR SIMULATION APPLICATIONS* by Edward Willmore Davis, Jr, June 1972 Department of Computer Science University of Illinois at Urbana- Champaign Urbana, Illinois 6l801 *This work was supported in part by the National Science Foundation under Grant No. US NSF GJ 27^6 and was submitted in partial ful- fillment of the requirements for the degree of Doctor of Philosophy in Computer Science, June 1972. A MULTIPROCESSOR FOR SIMULATION APPLICATIONS Edward Willmore Davis, Jr., Ph.D. Department of Computer Science University of Illinois at Urbana- Champaign, 1972 Multiprocessor systems have generally been designed for applications with arrays of data which can be operated on in parallel. In this thesis an application area which does not contain such readily identifiable parallelism is examined. Discrete time simulation is found to contain several distinct levels at which potential for concurrent execution exists. The levels are used to guide the organization of a multiprocessor designed for simulation applications. Both software and hardware aspects of the problem are covered. Features of the system include a special processor used to evaluate conditional jump trees; clusters of simple, fixed point arithmetic processors; a unit to form and dispatch tasks to the processors; and a memory system which includes a read only program memory. Ill ACKNOWLEDGMENT The author wishes to express sincere gratitude to his advisor, Professor David J. Kuck. The many hours of discussion, suggestions, advice, and encouragement were fundamental to the development and completion of this dissertation. The team working on the Fortran program analysis system deserve thanks. In particular, R. Towle did an excellent job maintaining the system and revising it for the needs of the author, and R. Strebendt provided in- struction on system use and interpretation of the results. The dedicated effort of Mrs. Vivian Alsip in typing the final manu- script, as well as much of the draft copy, is gratefully acknowledged. The typing done by Mrs. Diana Mercer also deserves credit. Fred Hancock, of the Center for Advanced Computation, voluntarily drew several figures when the Department of Computer Science drafting group became overloaded. The support of the research by the Department of Computer Science and the Center for Advanced Computation is greatly appreciated. Finally, the author wishes to thank his wife, Jo Ann, for her patience and help in this effort. IV TABLE OF CONTENTS Page INTRODUCTION ■ 1 1 . 1 Overview '• 1 1.2 Thesis Organization CONCURRENT PROCESSING OF CONDITIONAL JUMP STATEMENTS k 2.1 Introduction k 2.2 Decision Trees in Programs and Processors 5 2.3 Software Aspects of IF Trees and Decision Trees 11 2.3-1 Overall Processing Scheme 12 2.3.2 Arguments of IF Statements 12 2.3-2.1 Relational Expressions 12 2.3.2.2 Logical Expressions Ik 2.3.3 Assignment Statement Movement 15 2.k Decision Tree Processor Hardware 19 2.^.1 Input Node Register 21 2 A. 2 Tree Decoder 25 2.U-3 Result Memory 27 2.k.k Path Register 30 2.^.5 Sector Decoder 31 2. ^. 6 Gate Counts and Timing 31 2.5 Processor Operation and Performance 3^ 2.5.1 An Example of IF Tree Processing 3^ 2.5.2 Mapping: Folding and Multiple Input Nodes ^1 2.5.2.1 Folding kl 2.5.2.2 Multiple Input Nodes kk 2.5-3 Processor Efficiency k-5 2.5A Performance Tradeoffs Between Iterations and Cycles kQ SIMULATION OF DISCRETE TIME SYSTEMS 5^ 3.1 Languages 5^ 3.2 A Description of GPSS 55 3.3 GPSS Execution 60 3.3.1 Serial Execution: GPSS/36O 60 3.3.2 Concurrent Execution 67 3.3.2.1 Parallelism Within a Block 67 3.3.2.2 Concurrency Between Blocks 76 3.3.2.3 Concurrent Transaction Movement 85 3.3.3 Processing Code Assignment 91 Page k. MACHINE DESIGN FOR CONCURRENT EXECUTION OF DISCRETE TIME SIMULATIONS 97 4.1 Introduction 97 4.2 Compilation Observations 99 4.3 Task Processors 10.1 4-3.1 Unit Processors 101 4.3.2 Decision Processor Use 103 4. 3 .3 Task Processor Configuration 107 4. 3 .4 Hardware Design Considerations Ill 4.4 Coordination Unit 112 4.4.1 Transaction Selection 115 4.4.2 Processing Code Evaluation 119 4.4.3 Task Queues 119 4.4.3.1 Process Queue '119 4.4.3.2 Delay Queue 120 4.4.4 Task Output 12.1 4.4.5 Coordination Unit Hardware 121 4.5 Memories 125 4.5.1 Data Memory 125 4. 5. 1.1 Main Memory 126 4.5.1.2 Task Memory 127 4.5.2 Program Memory 129 4.6 Machine Design Summary and Performance Estimates 131 5. CONCLUSION 135 APPENDIX 137 A.l GPSS Scanner 137 A. 2 Trace Data Extraction: XTRAC 137 A. 3 Trace Data Insertion: INSRT 140 A. 4 Coordination Unit Simulation l4l LIST OF REFERENCES 150 VITA 152 VI LIST OF TABLES Page 2.1 Relational Expression Conversion 13 2.2 Decision Processor Logic Counts . 32 2.3 Decision Processor Logic Delays 33 2. if Nodes Evaluated on Succeeding Cycles for Linear Decision Trees if 9 2.5 Cycles Required, per Iteration, to Evaluate a Linear Decision Tree 51 3.1 Fortran Block Analysis 7^ 3.2 Block Routine Speedup Factors 77 3-3 Precedence Partitioning Guide 8l 3- ^ Discrete Events Comparison of Figure 3*10 88 3.5 Program Partitions and Processing Codes 95 if.l Simulation Results: Concurrent Transactions 109 if. 2 Processing Code Interpretation llif if .3 Total Machine Hardware Summary 132 VI 1 LIST OF FIGURES Page 2.1 A Decision Tree r 6 2.2 Decision Tree Labels ■ 7 2.3 Processor Tree • 9 2.k Node Classification 10 2.5 Use of a Free Node 10 2.6 An IF Tree l6 2.7 The Decision Tree Derived from Figure 2.6 .18 2.8 Decision Tree Processor 20 2.9 Logic Design for a Three Level Decision Tree Processor ... 22 2.10 Logical Expression Reduction 26 2.11 Cascading Reduction Modules 26 2.12 Tree with Decoder Equation Labels 28 2.13 IF Tree Corresponding to Example Program Segment 36 2.114- Decision Tree Derived from Figure 2.13 38 2.15 Example Decision Tree Mapped into Processor Structure. ... 39 2.16 Result Memory Contents for Example IF Tree k-0 2.17 Transmit Node Pairings k-6 2.18 Remapping for Better Efficiency U7 2.19 Variations in Decision Tree Evaluation Time 53 3-1 Example GPSS Program 57 3.2 Overall GPSS/360 Scan: Update Clock to Next Most Imminent Event 62 3.3 Overall GPSS/360 Scan: Scan of Current Events Chain (Start of Scan) 63 Vlll Page 3.1+ Overall GPSS/360 Scan: Try to Move Individual Transaction into Some Next Block 6k 3.5 Serial Execution Trace of Example Program 66 3.6 QUEUE Block; Fortran Version 70 3.7 QUEUE Block Flow Chart 71 3.8 Precedence Partition Algorithm 80 3-9 Precedence Partitions of Program in Figure 3*1 84 3.10 Example of Time Overlap on Independent Events 87 k.l Machine Organization 100 k.2 Connections Between Decision and Unit Processors 106 k-3 Coordination Unit Il6 h.k Word Format of Transaction Status Memory 117 A.l Simulation Test System 138 1 . INTRODUCTION 1.1 Overview Speeding up the execution of programs by means of compile time algorithms and machine organization is the general topic of this thesis. The approach taken is to study an application area, then design a machine to take advantage of characteristics of programs for the application, parallelism in the problem, and processing requirements. This study leads to a multiprocessor configuration with a hierarchy of processors and memories. Discrete time simulation is the application area selected. Simu- lation languages, particularly the General Purpose Simulation System [7], GPSS, are examined. The purpose of this study is to achieve speedup through machine organization with the use of available logic and memory devices. The speedup does not come from faster execution of individual instructions. Instead, the approach taken is to have more than one instruction in exe- cution simultaneously. GPSS is studied to detect parallelism and provide guidelines for the design of a multiprocessor machine. Simulation programming does not include operations on arrays of data, the feature most commonly associated with machines having more than, say, two processors. An early observation on simulation program charac- teristics was that conditional jump statements occur with great frequency. To significantly speed up the execution it is necessary to do better than serial processing of conditional jumps. In this thesis algorithms and a special hardware unit are designed for processing trees of conditional jumps The thesis is experimental in nature. It includes the analysis, for execution on a multiprocessor, of Fortran programs with approximately 1000 statements. Several GPSS programs are individually analyzed for prospects of concurrency in execution. A simulation system is used to test the performance of the proposed machine organization in the execution of GPSS programs. 1.2 Thesis Organization This thesis is organized as three chapters which present the de- tails of the problems and solutions, a final chapter which summarizes results, and an appendix which describes a software test system for generating and verifying some of the results. Chapter 2 is concerned with the problem of speeding up the ex- ecution of programs with many conditional jumps. Software algorithms to increase the execution concurrency are presented. The algorithms modify the original program, without changing the logic, such that better use of the multiprocessor system is made. A hardware unit for evaluating conditional jump statement trees is presented. This "decision processor" operates in conjunction with the arithmetic processors to select a path through trees of many levels. Chapter 3 discusses discrete time simulation languages and ex- amines GPSS in some detail. Parallelism in GPSS and the potential for con- current execution are the major topics. A multiprocessor machine organization for languages like GPSS is designed in Chapter k. The machine consists of several clusters of proces- sors, a memory system matched to the processor requirements, and a unit to coordinate the processor clusters in their execution of a program. Proces- sors within a cluster have a common control unit; however, the instruction stream to each processor can differ. Clusters operate independently from each other and are individually capable of executing a complete program. Conclusions are presented in- Chapter 5- The Appendix is con- cerned with a system for simulating the machine organization of Chapter k-. 2. CONCURRENT PROCESSING OF CONDITIONAL JUMP STATEMENTS 2.1 Introduction The purpose of a multiprocessor machine organization is to speed up program execution. Speedup is achieved by using the parallelism or concurrency that can be extracted from a program to keep more than one processor busy. For machines such as ILLIAC IV [9,15], STAR [3], or the array associative processor STARAN [5l, the parallelism is largely due to operations on arrays of data. Other machine organizations or parallelism extraction schemes use tree height reduction DO loop and recurrence relation expansion, back substitution, and independent blocks of assignment statements to get a more general form of parallelism p-0] . These schemes may add redundant operations or increase the number of useful operations but they do reduce the number of steps required for execution. All of the above techniques work on sections of code that exist between the statements that control the flow of program execution. When a conditional jump (an IF statement) occurs there is a reversion to serial execution. For programs with very few conditional jumps this is not a serious weakness. For programs or parts of programs with a high ratio of IF to assignment statements, this serial execution can degrade significantly the efficiency of a multiprocessor. In this chapter algorithms and hardware for speeding up the execution of programs with many IF statements are examined. The hardware is a special purpose processor designed into the multiprocessor system of Chapter k. Terms related to the algorithms and hardware are defined in section 2.2. Compile time preparation of programs to use the processor is introduced in 2.3. Section 2.k covers the processor and 2.5 discusses its operation. 2.2 Decision Trees in Programs and Processors Decision statements in programs are those that determine the next instruction to be executed from two or more possible choices. These statements typically begin with the conjunction "IF". The IF statements considered here are the logical type where a boolean variable is the basis of choice between two next instructions. Other types of IFs can be converted to the two way jump form. When at least one of the instructions selected by an IF is also an IF, a tree of IF statements called a decision tree is formed. A single IF in the tree is a node . Establish the convention that the branch taken when the boolean variable is true is pictured as leaving the node to the right. Then the decision tree corresponding to the code: IF (A) THEN IF (B) THEN W; ELSE X; ELSE IF (C) THEN Y; ELSE Z; can be drawn as in Figure 2.1. An exit from the tree occurs when a statement other than an IF is next. In Figure 2.1, W, X, Y, and Z represent exits. The directed line segments between nodes are branches . Any sequence of branches followed to reach an exit is a path through the tree. The path taken on a given execution of the tree identifies the exit and is the re suit for that execution. A single input node is a node with only one branch into it. The discussion and examples in this chapter, with the exception of section 2.5.2.2, assume all nodes are single input nodes. Figure 2.1. A Decision Tree Elements of the decision tree are labeled according to the rules below. Figure 2.2 shows the naming scheme. 1. The root node is\. 2. The name of a branch directed out of a node is the name of the node concatenated with 1 or according to whether the branch leaves the node to the right or left. For concatenation, x is a null element. 3- Nodes other than \ are given the name of the branch directed into them. k. Paths and exits are identified by the name of the branch which is the exit. Branch Node X Node 1 Branch, Path, and Exit 11 Figure 2.2. Decision Tree Labels The nodes on a given level of the tree are the set of nodes which have the same number of bits in their name. Levels are numbered sequentially beginning with one at the root such that at level i there are 2 possible nodes. Let I be the number of levels in the tree. A tree is full if all of 1 I the 2 -1 possible nodes are present. There are 2 exits from a full tree. In an informal way "length" will refer to the number of levels and "shape" will refer to the number and distribution of nodes in a tree. Programs in general have assignment statements interspersed with decision statements. Those parts of a program where the ratio of decision statements to assignment statement operations is larger than an experiment- ally determined threshold are called IF trees. When the assignment state- ments are removed from an IF tree a decision tree is formed. An algorithm is presented in section 2.3-3 for movement of assignment statements out of an IF tree such that correct execution of the program is not disturbed. The algorithm allows the formation of larger decision trees than exist naturally in a program. Now consider a processor to evaluate decision trees in a parallel way. The length and shape of a programmed tree can vary, limited only by the syntax of the language being used. The length and shape of processing equipment is fixed by hardware design. The fixed hardware must be capable of processing trees of any length or shape. Let k be the number of levels in the decision processor. The processor is designed for a full tree so 2 -1 nodes can be evaluated. Longer full trees require repeated use of the processor. Hardware nodes are numbered from left to right within levels and from level one to level k. These numbers are the decimal equivalent of decision tree binary node numbers with a leading one attached. The tree that descends from each node is a sector which is given the name of the sector root node. Sector 1 includes the complete processor tree. All other sectors represent sub-trees. Control is provided in the processor to select a particular sector for evaluation, providing isolation and independence from other sectors. A labeled two level processor tree is shown in Figure 2. J. The mapping of j < k levels of decision tree nodes into the k levels of processor tree nodes is a one-one mapping. It is not in general "onto" since the decision tree shape may differ from the fixed processor tree. A processor node which corresponds to a decision tree node is a decision node . All processor exits are from level k. Any decision tree exit from a node which does not map onto level k must "use" a node at each succeeding Level 1 Level 2 Sector 1 Sector 3 Figure 2.3. Processor Tree level down to k. Nodes that are used to transmit the output of a decision node to an exit, but which do not take part in the decision making process, are transmit nodes . Transmit nodes are assigned a logical value of zero. There is a third class of nodes in the processor tree. A free node is one which is neither a decision nor a transmit node. In drawings to follow, the symbol "O"" will represent a decision node, "O" will represent a transmit node, and "•" a free node. As an example of several of the above definitions consider the three level trees in Figure 2.h. The exits from the free node will never be on a selected path since the free node is on the "1" output branch of a transmit node which always has the output "0". Suppose the decision tree had an additional node on level four. That node can be mapped into the free node if the tree processing is controlled as follows. Evaluate sector one; the complete three level processor tree. 10 Otcltion Nodts Transmit Nodt V z (a) Decision Tree (b) Corresponding Processor Tree Figure 2.k. Node Classification u z S«ctor 5 (a) Decision Tree (b) Corresponding Processor Tree Figure 2.5- Use of a Free Node 11 If the exit is the branch that leads to the level four node, evaluate sector five to get the final exit. The situation is illustrated in Figure 2.5- The mapping of a sub- tree of a decision tree node into a higher level sector in the processor tree is called folding. The process of evaluating a sector is a cycle. If folding is used to fill free nodes, evaluation may take as many cycles as there are folds, plus one for sector one. If not all decision tree nodes can be mapped into the processor the tree is evaluated by repeated use of the processor. Each use related to a given tree is an iteration . An iteration may involve many cycles. At this point much of the terminology has been explained and the execution scheme introduced. Now consider the software required for this processor. 2.3 Software Aspects of IF Trees and Decision Trees Locating IF trees in programs is based on an algorithm described in [2]. The algorithm forms a trace of all paths through a program. De- tection of an IF statement on a path activates a counter of assignment statement operations. As long as the operation count is below a specified threshold, another IF statement on the path is classified as being in a tree with the predecessor. The counter is reset and operation counting begins again. When the operation count exceeds the threshold, an exit from the IF tree has been found. The occurrence of input or output statements or subroutine calls also mark an exit from the tree. Examination of several programs has shown that a threshold value 12 near 10 gives trees which can be processed reasonably well using the tech- niques of this chapter. Assuming an IF tree has been located, the preparation for processing that must be done at compile time is described in this section. 2.3-1 Overall Processing Scheme Three actions are involved in the compile time preparation of IF trees for concurrent evaluation. They are mentioned here, then described in more detail in following sections. First, each relational expression which is part of the argument of an IF statement is converted to an assignment statement the left hand side (lhs) of which is a logical variable. The logical variable replaces the relational expression in the argument. The second item is the movement of assignment statements to a position ahead of the remaining decision tree. Third is the mapping of nodes from the decision tree to the processor. Four steps are required for execution. In the first step assign- ment statements are evaluated in parallel. The second step is determination of the boolean value of logical arguments. Third, evaluation of the tree in the processor. Fourth, selection of assignments from step one that were on the execution path . 2.3.2 Arguments of IF Statements An IF statement argument may contain a boolean variable, a relational expression (rel ex), or a logical expression (log ex). For input to the processor tree the argument must be a boolean value. This section discusses the handling of arguments to arrive at the boolean value. 2.3.2.1 Relational Expressions Relational expressions can be converted to assignment statements (assign sts) which yield the correct boolean value upon examination of the 13 sign or magnitude of the result. As an example, X > Y can be converted to "B = X-Y" where the sign of B is a boolean variable indicating the result. Algorithm 2.1 converts a rel ex to an assign st. The algorithm is given for a machine where the smallest quantity that can be added is e. For an integer machine e = 1. Each assign st formed creates a new logical vari- able in the program which must be given a unique name. The names generated by the algorithm are the concatenation of a name which does not otherwise exist in the program and the state of a counter. Let S(X) be a boolean variable representing the sign of X with "+" = 1 and "-" = 0. The sign of X = is "+". Let M(X) be a boolean variable representing the magnitude of X. If the magnitude is zero, M = 0. If the magnitude is non-zero, M = 1. An overbar represents inversion. Algorithm 2.1: Relational Expression Conversion 0. On the first use of this algorithm on a program, select an unused vari- able name, Y, and set 1=1. On all uses subsequent to the first, enter at step 1. 1. Use Table 2.1 to change the relational expression given in column one to the corresponding assignment statement in column two. Variable X is Y concatenated with I. Relational Assignment Boolean Expression Statement Variable L < R X = R-L-e S(x) L < R X = R-L S(X) L = R X = L-R M(X) Table 2.1. Relational Expression Conversion Ik Relational Assignment Boolean Expression Statement Variable L 4 R X = L-R M(X) L > R X = L-R S(X) L > R X = L-R-e S(x) Table 2.1 (continued). Relational Expression Conversion 2. Replace the relational expression in the argument with the boolean variable from the corresponding column three entry. 3 • Increment I . 2.3.2.2 Logical Expressions Decision trees fan-out from a root node to more than one possible next statement. Logical expressions as arguments of IFs do the inverse. Given a log ex of at least two variables, a fan-in tree can be formed to give the boolean value. A decision processor designed for the fan-out case is not well suited for evaluating log exs. This section considers means of treating log exs. Assume the decision processor has no capability to evaluate log exs The logical IF can be rewritten equivalently as more than one IF where each new argument is a boolean variable. The logical operator connectives are achieved by the way IFs are connected in the program. The basic AND, OR connectives are shown by examples below. Larger expressions are managed by repeated application of these connections, (a) AND operator Given: IF (A-B) THEN T; ELSE F; Rewrite: IF A THEN IF B THEN T; ELSE F; ELSE F; 15 (b) OR operator Given: IF (A V B) then T; ELSE F; Rewrite: IF A THEN T; ELSE IF B THEN T; ELSE F; A recent study of a large number of Fortran programs uncovered very few logical IFs with more than one operator in a log ex. [10] . The study did reveal reasonably frequent use of the one operator argument. It therefore seems practical to provide in the decision processor the capability to accept two operand logical expressions. This will be discussed further in section 2A.1 on input to the processor. 2.3.3 Assignment Statement Movement Decision trees of more than a few levels will not appear often in programs. Rather, the more general IF tree, the mixture of IF and assignment statements will be present. This section gives an algorithm for moving assign sts out of an IF tree, leaving a decision tree with possibly more levels and certainly more nodes. Assignment statements are tagged before moving to identify their position in the IF tree. After movement the statements can be analyzed to determine parallelism and executed in parallel. Statements which may not be on the result path are executed concurrently with those that are, since the path is unknown. Thus the results of the block of assign sts are considered temporary pending determination of the result path. A typical IF tree is shown in Figure 2.6. Here b and f represent boolean and arithmetic functions. 16 Figure 2.6. An IF Tree In Figure 2.6, removal of the three assign sts from the IF tree implies a means of distinguishing the two assignments to A and of knowing which of the three were on the result path. The boolean functions of A must be evaluated with the proper value for A. To accomplish this a descriptor is attached to each lhs variable. If such a variable is later read, in the same path through the tree, the same descriptor is attached to that occur- rence of the variable also. The intuitive concept of a predecessor is used in Algorithm 2.2. Some properties of this relation as used here are given. The relation applies to both nodes and branches. "Immediate" is used to mean closest or most direct. The immediate node predecessor of nodes aO and al is node a. A branch is the immediate branch predecessor of the node with the same name. A predecessor of node or branch a is a predecessor of aj, j a binary number. 17 Algorithm 2.2: Assignment Statement Movement 1. Scan the IF tree from level one to level i applying steps 2, 3, and h. 2. Attach as a descriptor to the lhs variable of each assignment statement the name of the branch in which the statement occurs. Move the state- ment to a position above the IF tree. 3. Examine the logical argument of each node and the rhs of each assignment statement, for variables which have been given descriptors in a pre- decessor branch. Attach the corresponding descriptor to every such variable. If the variable has been given multiple predecessor de- scriptors, use the most immediate one. Since each higher level con- tributes one bit to the length of a descriptor assigned at a level, the most recent assignment of a value to a variable is represented by the longest descriptor. k. Form assignment statements from relational expressions in nodes ac- cording to Algorithm 2.1. Move these statements to a position above the IF tree. At this point node arguments consist of boolean variables or logical expressions and the IF tree has been cleared of assign sts. The IF tree has been converted to a block of assign sts followed by a decision tree. Figure 2.6 can be drawn as in Figure 2.7 after application of Algorithm 2.2. All assign sts in the block are candidates for execution in parallel, The decision tree can be evaluated in parallel in the decision processor. Section 2.k to follow describes the decision processor hardware. 18 CO z UJ si 33 , — * — , o ** n »o — ii ii ii ii CVJ 0> d o CVJ 19 2.h Decision Tree Processor Hardware Evaluation of the decision trees defined in the previous sections is carried out on a decision processor. This special purpose processor is designed to operate in conjunction with a multiple arithmetic processor configuration. The function of the decision processor is to accept boolean values corresponding to decision tree nodes and to return information related to the path through the tree. Figure 2.8 is a block diagram of the hardware. The processor has a tree structure with a capacity of 2 -1 nodes. That is, k is the number of tree levels in the processor. An input register is used to receive boolean node values from the arithmetic processors. The register has a bit per node of decision processor capacity and a fixed structure relating bits to positions of nodes in trees. A tree decoder identifies the path through the tree. The result of decoding points to the address of the next statement to be executed. Potential next statement addresses are stored in a small memory which is loaded for each use of the decision processor. For a k level processor tree there are 2 possible exits. In reality the number of exits programmed is often less than a fourth of that number. When a programmed tree has more than k levels it is necessary to use more than one evaluation cycle to determine the final result. A register is provided to save the path results on each cycle. When processing of a tree is completed the path register identifies the total path taken and likewise provides a means of identifying the assignment statements which were originally on the path but were moved ahead of the tree. The design and construction of these components is explained more 20 • • • ft- * ECT( ECOC W Q CC UJ ae CC < i k J Ul M CC _j < < * u • • • _ J f \ / w < 1 s s < SULT MORY 2«» • §>: MEMORY REGISTER j a. \S < » C/! Ul 1i B o IU1 UJ uj 0= 2 UJ * SULT PUT 1 -j IO 4 '* ( 3 O (— • • • uj cr "8 I 5r™ T) s £ — 7 < J V f ^ In: e c i ■M ar _J § 1" Jl ' Iff * * * e ICC Z < Ul -j UJ — w •HMMH^ I ._-j W «, c i— y y 'V _i £7 o w Z r- • IT • -g 6 * 10 • 3 o • Q. Z 0. INPUT REG ■ j •" UJ o • 8 ",?,, CE UJ X (- b w .1' ) ? Mi c 3 r™5 \ "■"7? < ^; • . < Q. © a UJ 2 z CE or T** 1- _• C 1 UJ O a. < (/> < C ) UJ 1 cc UJ CC ^— jr M £ K u. K X Z a) cu E-i o •H w CO CJ 0) •H 21 fully in following sections. Optional designs are given for several com- ponents. Detailed logic for a three level processor is shown in Figure 2.9* Tables 2.2 and 2.3 contain summary gate counts and timing information. Notation used in drawings and equations is explained here. Indi- vidual nodes and sectors are labeled n(i) and S(i) respectively, 1 < i < (2 -1) following the convention in section 2.2. When a label is used as a signal name it corresponds to a logical one. Let R(i,b) be the signal name for branch b at level St in the tree decoder, where b is the decimal equivalent of the binary branch name. For example, R(2, 0) is the level 2 branch "00"; R(2,3) is branch "11" at level 2. The output of the k level decoder is R(k,b), :=5> ■^ > ■ ro cr < UJ <£■ 8 CO CO CU o o & cu a; u EH o •H CO •H O CU CU u ° 0) fe O CU CU CO S ■p o CO £ cu hO « •H co o •H bD O •H O O ON CU cu 2k 15 C\J tO ro to of CO CO 8 lO g666666 — -- 5a.to to 2 to 00 in J— * CM cd -d o o O -P o CD CO ■a CO In (U •P to •H bO « ■P CO Ph CD +3 CO •H bO d) -p a -P o 1 to CD O CO to CD o O PL. CD CD Jh E-i C o a H CD t> CD Hi CD CD O CM a z o u> z < Q UJ o o tn < CD < O I- < or LJ Q. O O UJ =1 Q O 5 o D Q UJ or S3 o •H +3 O € o ICO _J 27 R(!,2i) = n(2 i ' 1 +i)[S(2 i "" 1 +i) v R(i-l,i)] R(£,2i+1) = n(2 i " 1 +i)[S(2 l_1 +i) s, R(f-l,i)] 1 < I < k Equation 2.1 (a) Equation 2.1 (b) < i < (2 -1) R(0,i) = Examples of Equation 2.1: R(1,0) = n(l)-S(l) R(l,l) = n(l)-S(l) R(2,0) = n(2)[S(2) - n(2)[S(2) R(2,l) - n(2)[S(2) R(2,2) = n(3)[S(3) - n(3)[B(3) R(2,3) = n(3)[S(3) R(1,0)] n(l)-S(l)] R(1,0)] R(l,l)] n(l)-B(l)] R(l,l)] Figure 2.12, picturing a labeled two level tree, will clarify the equation. It is clear, for example, that for R(2,3) to be true n(3) must be true. One of two other conditions must be satisfied. Either S(3) must be the selected sector or n(l) must be true with S(l) selected. 2. 14-. 3 Result Memory When the tree decoder has selected a result the decision processor must convert that result into the path taken through the tree and the address of the next program statement. This is accomplished by reading the result memory word corresponding to the decoder result. Since only one path leads to a given result the path data can be hard wired in each memory word. A bit per level is required. 28 ■8 a o •H I w u 0) T< o o Q •p Q) OJ H OJ Q) Pn 29 Three possibilities exist for the next program statement. The next statement could he a node currently in the input node register in which case the address is interpreted as a sector address selection for another cycle of decoding. For the second possibility the next statement could be a node in the same decision tree which is not currently in the node register, In this case the node register is reloaded. Evaluation of trees which can- not totally be mapped into the processor requires this iteration. The third possibility is that an exit has been reached; that the tree has been evaluated. Distinction between the three possibilities is made by the D bit which is zero when the address is to be interpreted as a sector for another cycle, and the E bit which is one when an exit has been reached. For all three cases the address portion of the memory word and bits D and E are loaded from the arithmetic processors for each iteration. A memory word is detailed in Figure 2.9 (b) . Memory output is stored in the result memory output register. This register can be separated into path, control, and address fields. The path field is the input to the path register. The control field accepts bits D and E. The address field is a processor output register when bit D is one; it is a sector address when bit D is zero. The example register is in Figure 2.9 (c) . This memory is a large part of the total tree processor in terms of gate and flip flop counts so an option is mentioned here to reduce the counts. For maximum flexibility and simplicity of operation the design provides for a full tree. This is sensible for most of the processor since the hardware per node is reasonable. However the hardware may be 30 unreasonable and unnecessary for the memory. The memory requirement is for one word per exit. The question is how many exits will the vast majority of trees, processed with a k level processor, have. For k = 8 the maximum number of exits is 256 while the majority of trees may have less than 6k, or even l6, exits. Thus an indirect addressing scheme may be used in which a j rA | M o ■p o 0) (0 ■P bO-d" 03 0) II Ph PC O bO Cm 60 LA CVJ * IA rA bO l * CVJ X , CVJ CM CVJ CVJ r-l bO m H bO 0) bO bO bO 1 0) is S 8 J- J- § ft ri -cvj H£ 1-1 f- rA tA U CO V 4) u cm <0 -P -P cm bO Cm bo Cm bO Cm ■O w CD "■■^ r-l H O -H bO Cm IA >A LA LA I 1 X , M S bO vO VO IA IA M M CVJ cvi «J IA LA cvj CU CVI cvj PC 1-1 i-l ■p i-i to d 0) ssfi l-l 4) § •H 4j rd h 5 j J- vO CO M M O e p -b co rM 33 u bO •P K> tti «M B0 «M BO >M «M bfl « + fy as s OJ * CVJ 9 CVJ a H OJ •P H M n't bfl H «v\ be Cm bO «" bfl . bO bO bO bfl 1 M Bft i-* r-l rH r-l « Z m u bO 0) TJ bO bfl bfl bfl C o CM rH t- r-l UN 1 CVJ *&~ H H a u u V 4) P P Cm •6 W «-» 5 -H H (0 bO 0 + (k-U) + (k-^) + (k-^)+...+l Equation 2.2 Equation 2.2 is rewritten below with the terms grouped. 1^ = k+(k-2)+2(k-3)44(k-^)46(k-5)+...+2 k_3 (l) k . , = k+E 2 l0 (k-i+l) i=3 For a six level processor L 6 = 6+^+2(3)+^ (2) +8(1) = 32. That is, a 32 level linear decision tree can be evaluated in one iteration on a six level processor. Two bounds are now known for single iteration processing of trees on a k level processor. The maximum number of nodes is 2 -1. The maximum number of levels is given by Equation 2.2. Section 2.5.3 will establish a Uk third bound: the maximum number of nodes for which processing in a single iteration can be guaranteed. 2.5-2.2 Multiple Input Nodes All previous discussion has been concerned with trees in which every node had only one input. In this section some of the ways nodes can have multiple inputs are mentioned along with suggested means for dealing with them. The suggestions are all compile time operations and have not been examined thoroughly. For a single node within an IF tree to have multiple inputs means t there is a way to reach the node other than by the branch from the node above. Consider first an input from outside the IF tree. Unless the multiple input node maps onto the root node of the processor, the program must be modified to let the processor perform properly. Recall that the sector controls initially select sector one, the whole processor tree. The compiler must compensate for this by inserting dummy IF statements in the program to build a path from the root node to the node in question. For example, assume an IF statement labeled HERE maps onto proces- sor node five and the program includes a GO TO HERE statement. Let the compiler insert an IF THEN IF 1 preceding the GO TO to build a path to node five . As a second instance of a multiple input node consider branch b which does not go to node b or exit b but becomes a multiple input for another node. That is, a transfer is made within the IF tree and the IF tree becomes a network rather than a tree. Connections of this type do not exist in the processor. This situation is complicated by the necessity for maintaining information on the path taken. ^5 A solution is to duplicate the sub-tree descending from the multiple input node, where needed, to produce a network free tree. Variable descriptors, as mentioned in relation to assignment statement movement, must be attached according to the shape of the tree with duplicated sub-trees. A loop exists if a path returns to a predecessor node. The loop can be valid (non- infinite) if it includes an assignment statement which can change the decision made at some node in the loop. Duplicating sub- trees is useful only to the extent of the single iteration capability of the processor. Loops can be expanded to fill the processor. Further expansion should be dependent on the results of the iteration. 2.5«3 Processor Efficiency Examination of many programs has shown that decision trees tend to be sparse rather than full. This would seem to indicate many transmit nodes in the processor and inefficient use of the hardware. It can be shown however that a processor with n = 2 -1 nodes and complete sector control can process an (— p— ) node decision tree of any shape in one iteration. That is, regardless of tree shape, more than half of the processor nodes are available for use as decision nodes. For an I level decision tree with i- > k, folding is obviously required. Define processor efficiency as the percentage ratio of decision plus free nodes to the total number of nodes in the processor. Free nodes are available for decision use and thus are grouped with decision nodes. Statement: Processor efficiency for a k level processor can always be greater than 50^ for every iteration required to evaluate a decision tree which has at least k levels. 1+6 When the decision tree has fewer than k levels, processor nodes are used to transmit results to level k rather than make decisions. In this situation the processor capacity is not fully used. Proof of the Statement: The root is by definition a decision node. The three possibilities for successor nodes of a decision node will be examined. Case 1. Both immediate successors are decision nodes. This clearly represents 100$ efficiency locally. Case 2. One immediate successor is a transmit node, the other a decision node. Using Fact k from section 2.5.2.1 construct Figure 2.17. This s Decision Node = Transmit Node = Free Node Figure 2.17. Transmit Node Pairings 1*7 figure represents the Case 2 successors of any decision node, a. A pairing can be made such that each transmit node has a corre- sponding decision or free node without involving a in the pairing. If a is the root node it is always unpaired. If a is not the root it may be paired with the other successor of its predecessor as al is paired with aO. Case 3- Both immediate successors are transmit nodes. The three nodes can be remapped equivalently as shown in Figure 2.l8, Remapping yields a free node which can be paired with the transmit. If the decision node was previously paired with a transmit in Case 2, let that pairing continue. Figure 2.l8. Remapping for Better Efficiency Applying the three possibilities to every decision node results in a pairing for every transmit node with at least the root decision node left unpaired. The efficiency is therefore greater than 50$ by at least one decision node for which there is no paired transmit node. This 48 concludes the proof. The third hound mentioned in section 2.5.2.1 is established by the above statement. For a k level processor there are n = 2 -1 nodes. At n+1 ~k-l k-1 . least — — = d can always be decision nodes. Thus 2 is the maximum number of decision nodes per iteration which can be guaranteed to map into the processor. A note on the significance of this is useful. The maximum number of decision nodes which can be guaranteed to map into the processor is es- sentially a minimum number of decision nodes per iteration, excluding trees which do not use the capacity of the processor. A linear decision tree is the only one for which the minimum holds. The linear decision tree is also the one for which the maximum number of levels can be processed. Now let k = 6. The minimum number of nodes is 32. The maximum number of levels is 32, from Equation 2.2 in section 2.5.2.1. Thus it would take the uncommon program segment consisting of 32 sequential IFs for the minimum decision node bound to apply. 2.5-4 Performance Tradeoffs Between Iterations and Cycles Gate and flip flop delays for the first cycle of an iteration and for each succeeding cycle are nearly the same, from Table 2.3- In operation however, the first cycle of an iteration requires communication with arithmetic processors whereas succeeding cycles are internal to the de- cision processor. The real times are thus not nearly equal and for this section the assumption is made that an execution time for the first cycle of an iteration takes M times as long as for any succeeding cycle. Cycles are the result of folding trees into free sectors. At the lower levels of the processor the nodes per sector are few, which means h9 that the nodes evaluated per cycle are few. At some point it becomes more time effective to discontinue folding in small tree segments and resort to a new iteration. That point is apparently where the probability of reaching an exit is less in the next M cycles within the current iteration, than in the first cycle of the next iteration. If all exits are equally likely it is a simple matter to count their occurrence if mapped as the next M cycles versus one cycle in the next iteration. Linear trees are examined to demonstrate the tradeoff. Equation 2.2 is a summation in which each term is the number of nodes mapped per fold. Table 2.k, which follows, uses that equation to determine the entries in the nodes per fold column. levels k processor nodes 2 k -l linear tree nodes 2 *-i 2 3 2 3 7 k k 15 8 5 31 16 6 63 32 7 127 6k 8 255 128 single iteration nodes per fold (terms of Equation 2.2) Table 2.k. Nodes Evaluated on Succeeding Cycles for Linear Decision Trees 50 If more than one iteration is used, the larger numbers at the beginning of the nodes per fold series are reused, eliminating the long strings of ones and twos. Table 2.5 is a compilation of the cycles required to evaluate various length linear trees using multiple iterations. The entries are derived from Table 2.k as in the following example. The entry at k = 6 for three iterations is determined by first noting that the linear tree to be evaluated has 32 nodes. If at most three iterations are to be used, 7—1= 11 nodes must be examined in at least one iteration, say the first. From Table 2.k, in one iteration the first cycle examines six nodes, the second four, and the third three bringing the total to 13 • Three cycles were required to get the total above 11 and three becomes the first number in the table entry being determined. There are 32-13 = 19 nodes for the remaining two iterations. One of them must examine at least 10 nodes and the other the remainder. Again from Table 2.k it can be seen that two cycles cover 10 nodes, so two becomes the second number in the entry and nine nodes remain. Two cycles pick up the remaining nodes so the last number in the entry is also two. Behind these tables is the goal of selecting the best compromise between iterations and cycles. The best operating point is a function of the multiplier, M, relating the execution times of iterations and cycles. The time required to evaluate a decision tree on the processor is the time for the first cycle of each iteration plus the time for all suc- ceeding cycles. The first cycle includes communication with the arithmetic processors and has been defined as taking M times longer than succeeding cycles to execute. 51 H •\ H •s H •\ H •\ H H VD •N •\ H H H r-l H H H H 1 H H H H H H A •s »N •N •\ •\ ■o H H H H H H <-y a ai E- ti r-l H C »\ *\ •H H OJ K •\ •\ •H r-l OJ a »\ ■s QJ r-l OJ R H r-l OJ u O •\ *\ •s cO H H H OJ CD •\ •s *\ C H rH H OJ •H •\ •\ •\ »\ H^ H OJ H H H r-T H H OJ CD •\ •\ •N •\ *\ *\ t> H r-i H H rH H OJ CD h^ H 1 H H OJ OJ H OJ CO H H CM cd •\ •^ •\ •s ■P vO H H OJ -d- CO *\ •N •\ •\ 2 H i-i H H OJ ^t H •N •\ •\ •\ •\ •\ CO H H H r-l H OJ -d- £ o -p H H K> VO >d •S »\ •\ ■> 0) H H K> VO m -d- •S •\ •\ •N P H H H OJ KA VD •\ •\ •\ •N •\ •\ w H H H H OJ KA VO sd o •H +3 CO h H OJ J- CT\ CD •S •\ •\ ♦V -p KA H H H OJ UA OA H •\ *\ •s »\ •\ •\ H H H OJ KN LTN o\ KO H H OJ -=J- 00 H OJ *\ •\ •\ •s *\ *\ H H H OJ -d- 00 H H H OJ -=h 00 r-l OJ ?H CD £ ^^ ft o O--- ■H w CO -p w to CD CO 0) H OJ rA -d- IT\ VD t— 00 H in O CD O CD O > >>-P ri CD O H PM h-3 vo CD EH o •H w •H O CD fi fn CO CD id •H t-3 CD -P CO PS H CO O -P O •H -P CO f-i CD -P CD ft CD rH •H & CD K w 0) r-l O o OJ CD H ,£> CO EH 52 Let I represent the number of iterations used, corresponding to columns of Table 2.5- Let C represent the total number of cycles used for a particular value of I. Then I is the number of first cycles and C..-I is the number of succeeding cycles required to evaluate a decision tree. The time required is thus proportional to M • I + C - I = (M-l)l + C . This function is graphed in Figure 2.19 for k = 8 and M = 1,2,3, and k. Execution time is in units equivalent to the time required to cycle the decision processor. The operating point to select is the one which minimizes the execution time. Thus for M = 1 it is best to use l6 iterations of one cycle each, whereas for M = 3, four iterations of six cycles each is best. This section has dealt with the fringe case of linear decision trees to simplify data gathering for the tables. The principle is appli- cable to any shape tree. Sector control, in the form of sector decoding logic and control gates on the tree decoder, is required for every node that can accept a fold. The return, in decreased execution time, seems non-existent at the lower levels except perhaps in some situation where just a few more nodes would complete the tree. The cost, in logic, is high since there are more gates on the bottom level than in the rest of the tree. In conclusion it is recommended that sector control stop at some level above k. 53 68- 64- M'4 60- 56 ■ 52- 48- ■ 11 S' Ma3 UJ 44- z P 40- | 36- £32. O UJ 28- \Vs^ *^^^ M " 2 X UJ 24- 20- **^ M*l 16- 12- 8- 4- ^-_| 1 1 1 1 1 1 12 3 4 6 10 ITERATIONS USED 16 Figure 2.19. Variations in Decision Tree Evaluation Time 3. SIMULATION OF DISCRETE TIME SYSTEMS This chapter examines some software questions involved in the con- current processing of discrete time simulation programs. A simulation language is studied to determine the natural parallelism and to develop a machine design philosophy for using that parallelism. 3 .1 Languages Applications programming can be done in any general purpose language. When a particular application comes into frequent use, particular languages tend to be developed to simplify programming. This has been done for discrete time simulation. At the present there are many simulation languages, some being minor variations of others. Among the more widely known are GPSS, the General Purpose Simulation System [7], Simscript [8], and Simula [k] • GPSS is implemented as a fixed set of routines which can be thought of as blocks in a block diagram of the system to be simulated. Simscript, developed at the Rand Corporation, is a powerful language with similarities to Fortran and PL/l but also including features useful in simulating systems that change over time. Simula is an Algol based language that includes Algol as a subset. These three significantly different languages have characteristics in common that provide for simulation. In all cases a simulated time clock is an inherent device which affects the progress of the simulation. An event , defined as a change in the status of the simulated system, occurs at a scheduled clock time or causes the clock to Increment to the event time. Let model mean the representation in a programming language of a system to be simulated. Simulation programs have temporary entities which move through the model. 55 Temporary entities are those which are not required to exist for the duration of the simulation. The model is described by permanent entities which do exist throughout the simulation. Finally, means to control the progress and inter- action of entities and simulated time is provided. GPSS will be described in more detail since it is used extensively in the remainder of this thesis. There were several reasons for selection of GPSS. A primary reason was availability of the GPSS/36O system, and docu- mentation on its implementation and use. The block diagram structure, essen- tially making it a higher level language than the procedural languages, clari- fies the ideas of the thesis. Also, while GPSS is considerably different from other languages it is neither unique nor unused. Acceptance of GPSS has prompted the development of several other similar languages. One of these is BOSS, Burroughs Operational Systems Simu- lator [15] . As stated in that reference "BOSS is a block-diagram-oriented, data base driven simulator program, in the general class of GPSS...". Another is QUICKSIM, an attempt to impart a block diagram structure to Simscript [l6] . As a third similar language there is the Computer System Simulator, CSS- An application of CSS, reported in [11], described it as follows: "CSS/36O was used in this study. It is a simulation program.... In concept it is similar to the General Purpose System Simulator (GPSS), differing in one aspect: it is not general, but applies specifically to computer systems." 3.2 A Description of GPSS The purpose of this section is to present a description of GPSS sufficient for understanding the algorithms and examples that follow. For more complete information refer to the User's Manual [7]« 56 An example of a GPSS program is given in Figure 3.1. The example, intended to show features of GPSS, is not meant to be a meaningful model. A more significant program is listed with comments in the appendix. A GPSS program is made up of control, definition, and executable statements written in a fixed format, one statement per card. An executable statement is called a block. A fixed set of blocks is provided to represent the components and control of the model to be studied. The format for a block allows up to seven operands. The definition of each block identifies the required and optional operands and their meaning. Each block is in actuality the name of a routine with the operands being parameters. Execution of a block is execution of its routine. Temporary entities are transactions . Blocks are provided for causing transactions to become active in the model and for removing them from active status. Once activated, a transaction normally moves sequentially through the blocks of the program. The movement of a transaction into a block is a call for execution of the routine that corresponds to the block. Exceptions to sequential movement are caused by blocks which unconditionally or conditionally cause a transaction to transfer to a non- sequential block. These control blocks affect transactions in GPSS much the way GO TO or IF statements affect the instruction counter in procedure oriented languages . A simulation study typically involves many transactions which inter- act with each other. It frequently becomes necessary to suspend processing of one transaction and begin work on another. Even so, the movement of one transaction through a GPSS program can be thought of as an execution of the program. A clock is maintained which automatically updates to the time of the next event. Simulated time has an origin of one and is incremented by integer 57 ec — (M^-rir-of-aDC'o — *vj f* * tr .$ h- U Z -«««*JIMir» eo o z UJ X x o o ►- z 5 a. ae UJ UJ to UJ X a UJ CO v> X ec o a • •> a. »- < x CO UJ CO ►» z a O UJ O uj ■- t- I p- « i/> ►- < _) *- OC 3 X Z UJ X H- <-■ a. •-> O to 0» -* en — c 0"> • CD o> o ro ir\ • • «*> rsj • • U f\j <\j 00 <. eg • ae • «vj z >»• O j • X eo ec <. UJ ec *~ to to CO a o z 3 «*> u o « « • z a a x UJ c* o • # — • o K UJ U -4 X z x uj 3 ae u. UJ z o — a. x x at ui ui U. Z • * u. o o O *XNp3p O on »• — • 5. u. ck. — • <-» r- UJ -J < Z KUO aeoxcuyjuiaez* ZV)OHlUMfi,>K IUI/)ZUOUJUJO< to UJ u> UJ o X o o o z CO a ae < Z X - o H -J — eo z UJ > Q eo UJ O 3 M •j uj a. v o eo to u o. «■* i-t < 3 tat oc o uj < z »- > < UJ uj < a to to z UI UI << — •-. x a uj ui x ae K a -J _J Ct < Q UJ UJ UJ I— Z ac ec t- to uj W) o & CO CO P-. t5 & H bO ae ui rsi o • » • • * • • ae x ui -i s eo z **u>49^eo O — 0* tai -* 58 values. The unit of real time represented by a unit of simulated time is programmer defined. Events are scheduled through the use of the ADVANCE block. Each transaction has an associated set of words including one called "block departure time", BDT . When a transaction moves into an ADVANCE block a time increment is calculated and added to its BDT. Processing of the transaction is suspended until all transactions with smaller BDT values have been processed. Permanent entities of several types can be used to describe the physical equipment of a model. A facility is an entity which can be occupied by only one transaction at a time . Blocks are provided that let a transaction use (SEIZE, PREEMPT) or relinquish use (RELEASE, RETURN) of a facility. A ' storage is an entity which can be occupied by more than one transaction. The storage capacity is given for each storage by a definition card. ENTER and LEAVE blocks are provided to let transactions use or relinquish use of a storage. A logic switch is a two state entity used to control the movement of transactions. The LOGIC block results in the setting, resetting, or inverting of a switch. Other blocks can be used to simulate various queuing schemes, assign values to variables, control transaction movement, group transactions or numbers, gather statistics, and give diagnostic or partial result information. GPSS provides a set of Standard Numerical Attributes, SNA ' s, which are names of variables. These variables are selected attributes of GPSS entities. An SNA consists of a one or two letter mnemonic and, in most cases, an index to identify a particular entity. For example, a facility can either be in use or available. The status is given by SNA Fj, where j is the index of the facility. In this example F3 = indicates facility three is available. The major use of SNA's is in the operand field of blocks. They are in fact the only variables a programmer can use. Symbolic naming of entities 59 is allowed in which case the symbol replaces the index number. Since all variables are SNA's defined in GPSS, all possible variable names and certain of their characteristics are known at compile time. A characteristic of interest in section 3.3-2 concerns a limitation on accessing certain variables. In particular, each transaction has a priority, PR, and a set of parameters, Pj, < j < 100. When PR or Pj are used as block operands they refer only to the transaction which is executing the block. A given transaction cannot use the value of the priority or any parameter of any other transaction as an operand. These SNA's are considered transaction related variables. System related variables can be accessed by any transaction. An example is the storage location called a savevalue, referenced by Xj . Any transaction executing a block with operand Xj refers to the same physical storage location. A similar distinction can be made for block types based upon the block definition. The PRIORITY and ASSIGN blocks are used to write values into PR and Pj of the transaction executing the block. They affect directly only the transaction being moved. Other blocks in the same category are ADVANCE, TEST, TRANSFER, etc. Blocks can be identified that give system variables new values. The SAVEVALUE block writes a value into savevalue location Xj which can be accessed by any transaction. Similarly, ENTER, LOGIC, QUEUE, and other blocks can change the value of system variables . GPSS provides features for using tapes to store intermediate results of large simulations. The features are not considered in this thesis. Processing a GPSS program includes assembly, input, execution, and output phases. The execution phase, in which transactions are moving through 6o blocks, is of most importance in this thesis. Execution is described in the next section, 3 -3, and the description of GPSS is continued, especially in section 3.3.1. 3-3 GPSS Execution A major goal of this thesis is to design a multiprocessor system for concurrent execution of individual programs written in a language like GPSS. To better understand the problem and the proposed system, serial execu- tion of GPSS is described. Parallelism within GPSS, and thus the potential for concurrent execution, is studied. The constraints limiting concurrency, as imposed by the simulated time aspect of simulation, are demonstrated. When simulation is being discussed in this section, simulated time will be referred to simply as "time" whereas real time will be identified as such. 3.3-1 Serial Execution: GPSS/36O Processing in the execution phase consists entirely of moving trans- actions through blocks. One function of GPSS that simplifies the programmer's effort is the selection of which transaction should be moved and into which block it should be moved. With a single processor only one of the potentially many transactions can be selected. Once a selection has been made, one word of the set of words associated with each transaction identifies the next block to be executed. Selection of the transaction to move is based on two chains main- tained by GPSS. The current events chain contains all transactions whose block departure times, BDT's, are equal to or less than the current time. Items in this chain are ordered first by priority then, within a priority class, on a first-in first-out basis. The future events chain contains all transactions 6l whose BDT's are greater than the current time. For this chain, transactions are ordered by their BDT, most imminent event first, then for all transactions at a given time the ordering is as in the current events chain. Transactions are selected for processing in order from the current events chain. Certain conditions which can prevent the movement of a trans- action and therefore its processing are explained in following paragraphs. When processing of transactions on the current events chain is complete the first transaction in the future events chain, and all transactions with the same block departure time, are transferred to the current events chain. The flowchart for the overall GPSS/36O scan is given in Figures 3.2, 3.3, and ~$.h. The chart, from [7], is included to show the complexity of the algorithm and to provide a basis for comparison with the proposed machine organization. Figure ^> .k also gives the conditions under which the processing of one transaction is stopped and another is started. Now the conditions mentioned above which can prevent the movement of transactions are discussed. A facility was described in section 3*2 as a unit capacity device. If a facility is in use at a time when another trans- action would like to use (SEIZE) it, the second transaction must wait for the first to relinquish use. The second transaction has reached what is called a blocking condition. Processing is suspended until the facility becomes available, removing the blocking condition. A similar situation arises with respect to a storage when its capacity will be exceeded by the transaction that would like to use it. Thus when moving a transaction would violate the specifications of the system being simulated, the transaction is blocked. Two blocks, GATE and TEST, can also stop the processing or divert the movement of a transaction. A gating condition can depend on the status 62 / Update \ ( Clock ) Increase simulator clock to Block Departure Time of first Transaction (next most imminent event) in Future Events Chain *«r Move Transaction from Future Events Chain to Current Events Chain (change word Tl chain link- ages). Transaction becomes last one in its Priority Class. Examine Block Departure Time of next sequential Transaction in Future Events Chain Note: These Future Event Chain Transactions are: 1. In positive time ADVANCE Blocks 2. Transactions waiting to leave GENERATE Blocks. 3. Operators for Tables oper- ating in the arrival rate mode Yes Figure 3'3 All possible Transactions have been transferred from Future Events Chain to Current Events Chain Figure 3.2. Overall GPSS/360 Scan: Update Clock to Next Most Imminent Event 63 Reset Status Change Flag to Off The Overall GPSS/360 Scan transfers to the start of the Current Events Chain: 1. From BUFFER Block 2. From PRIORITY Block with BUFFER option 3. When Status Change Flag is tested and found set to on 4. After clock updating (Figure 3«-2) Examine first Transaction in Current Events Chain Try to Move \ Transaction is in an active scan status . ^Transaction/ Try to move it to some next block Transaction is inactive in a Delay Chain To Figure No Sequential J Status Change Flag is Off From Figure 3 • 4 Yes Advance to next sequential Transaction in Current Events Chain To Figure 3*2 Overall GPSS/360 Scan has gone all the way through the Current Events Chain. No Further Transactions can be moved at this clock time. Therefore, update clock to next most imminent BDT. Figure 3-3- Overall GPSS/360 Scan: Scan of Current Events Chain (Start of Scan) (Fiom Figure J >J ) Scan Status Indicator of Transaction is Reset to Off Transaction blocked by: 1. CATE M Block 2. CATE NM Block J. TEST Block Or in a TRANSFER Block: 1. Both Selection Mode 2. All Selection Mode Can Transaction move into some next block? Yes Execute block type subroutine I To Figure 1 . Place Transaction in a pushdown Delay Chain. 2. Set Scan Status Indicator (Tl sign bit 1) of Transaction On. Is this a TERMINATE, ASSEMBLE, GATHER or MATCH Block whi ch < * ■ either terminates or places Transaction in an interrupt matching condition Yes -=»<» Stop Processing Trans- action immediately (ASSEMBLE Block can also terminate Trans- action) Reset Scan Status Indicator to OFF for all Transactions in any Delay Cham(s) associated with the particular Facility, Storage or Logic Switch Return MATCHed Transaction to Current Events Chain from an interrupt status Return initial ASSEMBLEd Transaction or n GATHERed Transactions to Current Events Chain Remove Transaction from Current Events Chain and merge into Future Events Chain | (To Figure Figure J.k. Overall GPSS/360 Scan: Try to Move Individual Transaction into Some Next Block 65 of a logic switch, facility, storage, or block. The TEST block examines an algebraic relation between two standard numerical attributes. Thus a trans- action can be blocked waiting for the model to satisfy certain programmer defined conditions . When a transaction is blocked it depends on another transaction to change the model status and unblock it. In many cases a time change must take place before the model status changes. Time is a very important entity in control of the simulated system. Serial execution of the program in Figure 3-1 is illustrated in Figure 3.5- The chart shows the sequence of transactions and blocks executed to complete the simulation. Processing starts at the bottom line. It pro- gresses horizontally, moving the transaction that corresponds to the line through blocks until the processing must be stopped for one of the conditions given in Figure ~5.h. Processing resumes with the transaction on the line above. Numbers on the lines are the times at which the transaction moved into the block. The final number on each line is the time when processing of that transaction was suspended, either temporarily or finally. The chart is derived from data gathered by tracing the execution of the example program. There are several things to observe on the chart. Transactions one, two, and three move through all blocks without interruption. This is because no other transaction is active during the time they are in the model. There is no interaction between these three transactions so the movement of each through the program is equivalent to an execution of the program. Other trans- actions took two to four separate processing intervals to move through all blocks. Moving each of these through the program is like an execution of the program with interaction between the separate executions. Completion of the 66 UJ 00 z 3 Z o p o < z < or 10 24 23 22 13 10 9 21 20 19 18 17 16 15 9 8 14 13 12 11 10 8 7 10 8 7 6 9 7 6 5 4 8 7 6 5 4 3 2 1 6S49 6098 8922 5822 5614 5482 5090 5081 4797 4741 4593 4292 4275 4062 3409 2799 1908 1851 1791 1567 1553 1078 621 SMS r I I i I I (4653 i r I I I I I I I / J J 3853 i r 3153 w I I I / I ( J ' 2553 / t I I I I I I I l I I il Ji I J J 3853 r i i i i 3153 r 2553 r 1933 ll II ll 'I '/ ll Jf 7153 r~ I I I i I I I I I I 5863 r i / i i i 4653 -I 3853 r I I I I J 3153 r I I 2553 1953 r i i i i i i i i i 1378 821 101 5 6 7 8 BLOCK NUMBER •7153 •5853 •4653 •3853 ■3153 ■2553 •1953 •1378 821 •101 10 11 Figure 3-5. Serial Execution Trace of Example Program 67 simulation came with completion of ten transactions but more than ten started. The order in which transactions are activated is not necessarily maintained throughout a simulation. An example of changing the order occurs on the chart. Transactions 10, 11, and Ik execute blocks four and five before transaction nine. Movement of transactions through the program is governed by run time determination of the model status. In general, the order of block executions is not known at compile time. Figure 3*5 demon- strates this. No amount of study of the program, short of actual execution, leads one to the knowledge that some transactions will move through the entire program with the processing of no other transaction intervening, whereas other transactions will require four distinct processing intervals. 3« 3*2 Concurrent Execution In this section three levels of parallelism within a GPSS pro- gram are described. The parallelism is used in the design of the multi- processor machine of Chapter k. 3.3-2.1 Parallelism Within a Block The first level of parallelism is that which exists within the routine that a block type represents. This is precisely the type of parallelism that can be found in a procedural language. If it is known that a block is going to be executed, clearly the parallelism within the block can be fully used by a multiprocessor independent of consideration of other blocks or transactions. Use of this parallelism does not change the overall scan of Figures 3-2 through 3-k. 68 To measure the parallelism at this level 21 frequently used GPSS block types were converted from their original 360 Assembler Language version to Fortran. The 21 Fortran program equivalents were analyzed on the system described in [10] . Results of the analysis are given in Tables 3.1 and 3.2 at the end of this section. Comments on the conversions to Fortran are given here. An attempt was made to have each Fortran program perform the same functions as the original although the methods had to differ slightly due to differences in the languages and operation of the analyzer program. The assembler language version makes use of many subroutines. In Fortran, subroutine calls were replaced by the subroutine itself so the analyzer could examine the complete program. Bits manipulated in assembler language were considered variables in Fortran. Speci- fication statements were not given since they do .not affect the parallelsim and are not examined by the analyzer. Each block type is -analyzed as a separate program so any exit from the block is an END or RETURN. This includes error checking statements which normally branch to the GPSS output phase, write an appropriate message, and terminate the simulation. In GPSS/36O, completion of a block results in a call to the overall scan, Figure ~5.h, and processing continues. For parallelism analysis each block is a separate program. No attempt is made to analyze sequences of blocks although this would tend to increase the parallelism. Program conversion and analysis was not done for parts of the GPSS/ 360 system concerned with the selection of transaction-block pairs for execution. Only the box labeled "Execute block type subroutine" in Figure 3. h of the over- all scan algorithm was converted. 69 The flow chart and Fortran version listing of the QUEUE block are given in Figures 3.6 and 3.7- This block is a typical example of the type of statements and length of a block routine. Reference to these figures is made in Chapter h in a discussion of processor capability requirements. Table 3.1 lists the results of analyzing the 21 block type pro- grams, including QUEUE; to determine the number of operations that can be done concurrently and the speedup in execution using a multiprocessor con- figuration. The techniques and algorithms used in the analysis are fully described in [10] . One change was made to increase the accuracy of the analyzer for this set of programs. The analyzer previously did not count memory fetches under the assumption that they were overlapped with operations on the data. The listing of QUEUE shows very few arithmetic operations, reducing the over- lap of memory fetches. Since other blocks are similar it is necessary to count fetches in the measurement of these programs. The analyzer currently has a limited ability in handling IF trees. The maximum number of levels per tree is eight. An algorithm for folding trees with more levels has not been implemented yet so longer trees are artificially broken to give two or more trees. This results in less speedup than can be achieved with the hardware of Chapter 2. Column headings for Table 3.1 are explained in the referenced paper and given again here. Minimum and maximum values are given when a range of results occurred. (l) The names are the GPSS block type names, plus DECODE and FUNCTION, two routines for decoding operand values which required function evaluation or used indirect addressing for index values. TO BLOCK EXECUTION ROUTINE 8101 C C 7099 8102 8105 8100 8200 8300 8400 8900 500 504 669 II GO TC 8400 OUEUE N »N+1 IF (B3B6 .EG. QUNITS »1 QUENR SUB-ROUTINE IF (DCODA .LE. 0) GO TO 500 IF (OCODA .GT. QUENUMI GO TO 500 END OF QUENR UPQ SUB-ROUTINE IF (QNR .60, 0) GO TO 504 IF (QNR .GT. QUENUM) GO TO 504 IF CSTIME .EQ. QUGO TO 7099 OELTAT * STtME -Ql 01 -STIME 04 »Q4 +DELTAT *Q6 CONTINUE END OF UPQ IF (T9B1 .NE. 1) GO TC 8100 IF (T14B57 .EQ. 1) GO TO 8900 QCOUNT *QCOUNT *1 00 8102 I -lt-5 IF (MQTABL(I) .EQ. 0) GO TO 8105 GO TO 669 MQTABL(I) -QNR MQTIME(I) sSTIME GO TO 8300 IF (T13 .GT. 0) GO TO 8200 T13 =QNR T14 =STIME GO TO 8300 T9B1 =*1 MOTABL(l) »T13 MQTIMEU ) »T14 MQTABL(2) =QNR MQTIME(2) -STIME T14B6 =1 02 *Q2 +QUNITS NR 3S000 NR 6000 NR 7000 ♦QUNITS .GT. Q7> 06 =Q6 IF (Q6 RETURN QUNITS IF (QUNITS RETURN IF (B3B1 .EQ. B3B1 =1 CALL WRTMES =DC00B • NE. Q7 «Q6 0) GO TO 8101 1) GO TC 8300 (A WARNING MESSAGE) GO TO RETURN RETURN RETURN END 8300 Figure 3-6. QUEUE Block; Fortran Version 71 INCREI^ENT BLOCK ENTRY COUNT IS THERE A B OPERAND? YES QIMTS=1 Jno_ CALL QUENR CHECK LEGALITY OF QUEUE NUMBER CALL UPQ UPDATE QUEUE STATISTICS <^is the multi -queue bit set?\-^°- \^j Iyes YES | <^DOES THE QUEUE COUNT = 5?\ NO IS THE WARNING MESSAGE BIT SET? YES NO SET THE WARNING MESSAGE BIT CALL WRITHES WRITE WARNING MESSAGE INCREMENT THE QUEUE COUNT BY 1 IS THERE AN AVAILABLE LOCATIONS NQ IN THE MULTI-QUEUE TABLET /^ YES ENTER QUEUE NUNBER IN MULTI "QUEUE TABLE ENTER TIME IN MULTI- QUEUE TINE TABLE o- VQB Figure 5.7. QUEUE Block Flow Chart 72 IS THIS TRANSACTION IN A QUEUE? |YES SET MULTI-QUEUE BIT ENTER FIRST QUEUE NUMBER 013) AND ENTRY TIME 014) IN MULTI-QUEUE TABLE ENTER CURRENT QUEUE NUMBER AND TIFE IN MULTI -QUEUE TABLE SET QUEUE COUNT TO 2 INCREMENT TOTAL ENTRY COUNT BY QUNITS INCREMENT CURRENT CONTENTS BY QUNITS I IS CURRENT CONTENTS GREATER THAN MAXIMUM CONTENTS NO NO T13=QUEUE NUMBER H4=nre YES SET MAXIMUM CONTENTS EQUAL TO CURRENT CONTENTS Figure 3-7 (continued). QUEUE Block Flow Chart 73 (2) This is the approximate number of source cards, excluding comments and multiple RETURN statements. The number of cards in the scopes of DO loops and IF loops is given. (3) This is the maximum number of iterations assumed for any DO loop or IF loop in the program. (k) The number of traces is the sum of all paths from the beginning to all END or RETURN statements plus the number of IF loops. (5) T, is the time required to execute the program on a uniprocessor (6) T is the time required to execute the program using a multi- processor capable of executing a maximum of p operations at once; a p- . multiprocessor . (7) S is the ratio of column (5) to column (6) . (8) This is the number, p, of processors required to achieve the T value in (6) (9) E is the efficiency defined as: T l E = -4r < 1 P PT - for the number of processors in (8) . (10) U is the utilization for the number of processors in (8) . The techniques used to reduce execution time may introduce extra operations Let o be the number of operations in the execution of a program on a p- IT multiprocessor. Call R the operation redundancy and let R = -£>i. P o 1 - The p-multiprocessor utilization is U = E R , P P P 7^ ft o vO VO O VO VO O O O OJ -d- MA UA VO GO VO 0- & MA MA OA O MA o VO O CO O OA VO MA OA O O ua O -=J- i OJ ON MA i OO H i OJ OJ I vO LTA I I UA r-l OJ c— MA UA I CO -d H i O I tr- O MA OJ UA I co ON OJ I UA c- UA MA MA ON t— OA f- t— VO 1 o CO MA I CO OJ -d- MA CO UA vO o UA O O O LfA CO VO OJ H VO CO vo 8a MA MA H OJ MA O O O CO O MA OJ VO H OA O O UA OA W ft OA UA I CO t— H i ir\ OJ VO i MA MA O I MA o I I OJ OJ -d- OJ rH OJ I OJ CO o CO UA o I vO UA o I CO MA I OJ -d- H i VO H -d- o o OA UA OA c— l>- MA CO OJ OJ I UA -d" CO ft OJ H OJ H OJ VO OJ UA MA H I MA I CO D— O OJ I UA MA OJ I OJ VO H I OJ CO -d- H MA O H OJ H I UA OJ t- co ft UA OJ CO o O O VO CO OA OJ OJ UA UA O o o OJ o UA vO MA CO UA o o o OJ O UA VO VO OJ UA VO VO UA t- t- UA UA MA OJ UA UA -d" o ON _d- H UA Oi H v.0 o -d- H VO UA o o o o -d" H r-l H UA OJ UA OJ o o -d- H UA 0- OJ H H H H OJ H VO ft UA MA I VO I OA OJ I UA UA CO OJ OA OA -d" l -d- O UA I CO CO H I -d" OA t- OJ OJ I CO I co OA OJ -d" 0- rH I 00 X o o H PQ C 05 fn -P r< O ft MA H cO EH UA VO H i -dr OJ MA CO -d" OJ VO OJ I VO -d- OA -dr VO -d- -d- OJ VO H H VO i i CO _d" H -d- i CO VO OJ OJ I CO OJ VO I CO rH UA I o H l O VO OJ I o H o OA I -dr CO t- I CO MA UA I -d" ft o rH ra 0) CD ft En OJ P O W CO CO rH CD O VO VO MA MA UA -d" CO H OA 0J OA VO OJ CO 0J -d" MA UA UA OJ VO H r nn MA O O •H rH -P CD CO CD ft Jh P cd co ^ ra o i CO H i CO -d" i 0J ft M < P, ft o o co CO ft 0J CD O ^ rH O P O O =fc CO "o o o o -d- H "o "o H OJ C0~ -d- H "o o o "o "P o OJ o o H O rH o o o O o o o H o o CO VO UA OJ -d- rH VO KA O VO r-i > * o H CO CO H EH H ft O H ft ft ft B o MA ft CO < ft ft CO OJ ft CO o ft ON ft ft CO t^- VO -Hr KN OJ ft CO ft ft ■Vh O cu U w M (U CD 3 P o DQ S CO CO 3 ^ a) S E-t S r -tH O 0) CO CO id O H ft CO H IT— OJ LfN t— J* KN I H i CO ON H i OJ I OJ I>- I CO ON to OJ I 0J r-H O C— CO o -d- I CO 0J ON LfN I CO H OJ KN OJ 0J 0J o 0J NO CO CO KN OJ KN OJ H KN H 75 r- o OJ i-H LfN t— H ft < ft S ft w EH M CO N H |>' Eh fzr H ft pq CO <£ W ft P ft ft 02 CO Eh EH EH H CO OJ I CO o OJ H ON H H H H LfN r-H LfN c- o OJ H LfN [— I O ON o OJ r-H I H KN -H/ NO CO o CO O LfN NO CO NO CO NO LfN LfN O LfN NO KN NO M CO NO t- H NO O O H ON ON OJ KN CO CO NO o o H OJ H -dr 0J i i ON CO -4- OJ r-\ rH O CO CO 0J CO 'p O OJ o OJ o -Hr OJ O o 0J LfN o o o o OJ LfN KN ON ON OJ KN OJ LfN KN LfN o O KN OJ LfN -3 CD § ■H $ O O KN 0) H ■s Eh 76 Table 3*2 summarizes the program speedup information derived from the analysis. For each program it gives the number of traces with the speed- up indicated in the column heading. Each column covers a range of ±0.5 from the heading value. A conservative approach was taken in the design of the Fortran ana- lyzer. The measures will change in the direction of improvement with improve- ment of the analyzer. These tables will be referenced in Chapter k on the design of a multiprocessor for executing these programs. 3.3-2.2 Concurrency Between Blocks Since a single transaction moves through the blocks of a simulation program in much the same way as the execution of a conventional program pro- ceeds from one instruction to the next, there is potential for concurrent execution of several blocks. This second level of parallelism is analogous to that between instructions in a procedural language but is at the level of routines rather than instructions. This section is limited to simulations with one active transaction. The multiple transaction case is covered in the next two sections. If it is known that a sequence of blocks is going to be executed by one transaction, clearly the parallelism between blocks in the sequence can be exploited. A GPSS program can be partitioned into groups of sequential blocks which do not violate the precedence requirements of variables. That is, from the definition of each block type and examination of its operands it is possible to group blocks such that no variable is written then read by the blocks in one group. 77 Block Type Speedup Factor Modal 1 2 3 k 5 6 7 8 9 Value s ADVANCE* 3 1+ 3 It 2 3,5 ASSIGN* 1 5 2 DEPART 6 11 11+ 28 1^ 5 ENTER k k2 77 15 1 lit 3 GENERATE* 1 1 2 10 16 11 5 INDEX 2 - 3 2 1 it LEAVE 1 33 12 103 85 51 6 it LINK 1 1 1 22 36 29 7 5 LOGIC 1 5 12 1+ - it 3 MARK 1 1 1 2 2 1 ^5 MSAVEVALUE 1 5 9 6 3 PRIORITY 1 2 1 2 QUEUE 3 9 18 20 3 It RELEASE 1 1 7 11 3 2 4 SAVEVALUE 2 6 6 2 3A SEIZE 1 1 k 8 1+ 2 it SPLIT 4 7 21 36 17 l 5 TERMINATE 3 16 18 1 3 TEST 3 11 4 1 2 it TRANSFER i i i 11 2 3 3 2 1 3 UNLINK 6 ! 5o ^3 8 5 2 7 3 3 2 DECODE* 7 18 j lj-2 26 16 9 1 3 FUNCTION 1 ! 2 1 1 6 10 6 1 6 ^FUNCTION subroutine removed for analysis. Table 3«2. Block Routine Speedup Factors 78 For a simulation with one transaction the blocks in a precedence partition are those that can be processed concurrently. Most simulations involve multiple transactions which affect each other and the system being modeled. As a result, the processing of one transaction may be interrupted at a block in the midst of a precedence partition. The reduction of concur- rency is covered in section 5«3»3« A segment of the program in Figure 3.1 is reproduced here to serve as a precedence partitioning example. Selected assignment statements from the corresponding routines are given at the right. Block Number 2 3 k Block ASSIGN INDEX SEIZE Operands 12, FN$TEBMI 3,10 P12 Selected Routine Statements P12 <- Function value J <- P12 F(J) <- 1 P12 appears on the left side of an assignment statement in block two and on the right in block four. Since block four cannot be executed concurrently with two, they must be in separate partitions. The order of block execution is known for all segments of the program free of conditional jumps. If interaction between transactions is not con- sidered, all blocks in a partition can be processed concurrently. The parallel- ism within each block can be simultaneously applied. The overall scan in Figures 3.2 through 3 .k is changed to the extent that all blocks in a partition are executed concurrently rather than serially. Algorithm 3.1 for the precedence partitioning of GPSS programs is presented in Figure 3.8. Basically, the GPSS card deck is scanned sequentially. 79 When a block is found, the type and standard numerical attributes, SNA's, used as operands are noted. If processing the block simultaneously with preceeding blocks in the current partition would violate precedence require- ments, the current partition is closed and the block becomes the first in a new partition. Due to the structured nature of GPSS it is not necessary to examine the statements of each routine to perform the partitioning. The algorithm uses the Precedence Partitioning Guide, Table 3.3, which is based upon the definition of each block type, to identify the blocks or SNA's that cannot be in the same partition as the current block. Entries in column B of the guide are those blocks which read a variable written by the block type in the corresponding position of the left most column. When the entry for a block type is ALL, the block type is the last in a partition regardless of what follows. Block types EXECUTE and GENERATE are permanent entries in column B for all block types. Optional block operands cause an EXCLUSIVE OR situation for some entries. The guide is conservative in several respects. The TRANSFER block causes a partition boundary even when it is used for unconditional jumps which do not violate any operand precedence requirements. Block types which are used in pairs, such as QUEUE and DEPART, are not permitted in the same partition. The actual requirement is that pair types should not be in the same partition if they are operating on entities with the same index value. Entries in column S are those SNA's which are given new values by the block type in the left column. In several cases a single letter is used to represent a set of SNA's. This is the case for example with ENTER where S is used for the set S, SR, SA, SM, SC, and ST. The ENTER block can change values <> NO 80 HAS SOURCE DECK BEEN COMPLETELY READ? JNO READ NEXT CARD ^IS IT A BLOCK?^ MS N=N+1 YES -/end) SCAN OPERAND FIELD TO DETERMINE SNAs READ BY THIS BLOCK IS BLOCK TYPE IN TABLE B? YES *9 ,N0 IS ANY SNA READ IN TABLE S? YES 1 p(n-1)=0 I CLEAR TABLES B AND S is ALL in coLum B of partition GUIDE FOR THIS BLOCK TYPE? NO USING PARTITION GUIDE FOR THIS BLOCK TYPE/ PUT COLUMN B ENTRIES IN TABLE B AND COLUMN S ENTRIES IN TABLE S = operand A BUFFER ALL CHANGE ALL COUNT Pj> J = operand A DEPART QUEUE Qj, j a operand A ENTER LEAVE Sj, Rj, j = operand A EXAMINE ALL EXECUTE ALL GATE ALL GATHER ALL GENERATE ADVANCE, ENTER, GATE, LOGIC, MARK, PREEMPT, SEIZE INDEX PI JOIN ALTER, EXAMINE, REMOVE, SCAN Gj, D ~ operand A LEAVE Sj, Rj, j = operand A Table 3 «3 • Precedence Partitioning Guide 82 Precedence Partitioning Guide Block Type Precedence Table Entries B S EXECUTE, GENERATE LINK ALL LOGIC Lj, 3 = operand A LOOP ALL MARK (MPj, Pj, j = operand A) © Ml MATCH ALL MSAVEVALUE MXj(k,l)©MHj(k,l) j = operand A k = operand B 1 = operand C PREEMPT RETURN F J> 3 - operand A PRINT PRIORITY ALL if operand F is specified PR QUEUE DEPART Qj> 3 = operand A RELEASE Fj> 3 = operand A REMOVE (ALTER, EXAMINE, JOIN, SCAN) © ALL if operand F is specified Gj, 3 = operand A RETURN Fj, 3 = operand A SAVEVALUE Xj © XHj , 3 = operand A SCAN (ALTER, JOIN, REMOVE) ©ALL if operand F is specified VJt 3 = operand E SEIZE RELEASE Fj, 3 = operand A SELECT Pjj J = operand A Table 3 .3 (continued). Precedence Partitioning Guide 83 Precedence Partitioning Guide Block Type Precedence Table 1 Entries B s EXECUTE, GENERATE SPLIT Pj> 3 = operand C TABULATE Tj> J = operand A TERMINATE ALL TEST ALL TRACE ALL TRANSFER ALL UNLINK Cj> J = operand A UNTRACE Table 3-3 (continued). Precedence Partitioning Guide 81* for each set member so the use of any member causes a partition boundary and detection of S is sufficient. The WRITE block is omitted from the guide since it involves tape features which are not being considered. Block precedence partition membership is indicated by a subscripted variable P(n) where n is the block number assigned in order of appearance in the source deck. P(n) = 1 when the following block belongs to the same parti- tion. P(n) = for the last block in a partition. Applying the algorithm to the example program gives the result in Figure 3.9- Simulation of one transaction moving through this program can be done in four processing "intervals", corresponding to the four partitions, on a suitable multiprocessor. Block Block Operands P(n) Partition Number 1 GENERATE 300, FN$EXP0N 1 1 2 ASSIGN 12, FN$TERMI 1 1 3 INDEX 3,10 1 k SEIZE P12 1 2 5 QUEUE CPU 1 2 6 SEIZE ; CPU 2 7 DEPART CPU 1 3 8 ADVANCE VI 3 9 RELEASE CPU 1 h 10 RELEASE P12 1 h li TERMINATE 1 k Figure 3.9- Precedence Partitions of Program in Figure 3*1 85 3. 3*2. 3 Concurrent Transaction Movement The interesting feature of simulation programs is the prospect for concurrently processing more than one transaction. This third level of paral- lelism is without analogue in conventional programs. In section 3.3-1 the point was made that a single transaction moving through a simulation program was like a single execution of a conventional program. If multiple transactions moving through a simulation program were like multiple executions of a conventional program it would be a straightforward matter, in theory, to concurrently process as many transactions as desired. This is not, however, the case. There is an interrelationship of transactions with the model, including other transactions, that must be taken into account. It has been mentioned, in section 3-3«l> that the processing of a transaction can be interrupted for several reasons and that transactions can move through the simulation program at different rates causing changes in their ordering in the program. These effects come about because all trans- actions are related to the model and each transaction can affect variables to which other transactions have access. A system variable is a variable which can be both written and read by more than one transaction. Any block which uses a system variable as an operand that will be read must be processed in a certain order with respect to blocks that write that variable. This means that precedence requirements extend across transactions and become a run time function rather than being defined by the program at compile time. A restricted application of multiple transaction concurrency would be to select for processing only the set of transactions which represent cur- rent events. This set consists of the members of the current events chain described in the serial execution section, 3.3-1. The restriction is not neces- sary if proper care is exercised in the selection of transactions to move. 86 That is, members of the future events chain are also candidates for concur- rent processing. The essence of this is that a range of the real time being simulated can be involved in processing being done concurrently. At an instant of real time on a multiprocessor, more than one instant of simulated time can be in process. All of the interactions that take place in real time in the model must be considered, plus the interactions caused by overlapping multiple instances of simulated real time into a single instant on the computer. As an example of time overlap consider a computer system with two independent terminals and one processor that can be called into exclusive use by either terminal. If the processor is in use when a terminal requests it, the terminal must wait for the processor to become available. Figure 3.10(a) shows a possible use of the equipment on an arbitrary real time scale. Solid lines represent equipment in use, dashed lines represent time spent waiting for the processor, and vertical dashes are transitions between equipment. Figure 3.10(b) shows the same equipment usage as it can be simulated. An event at real time one can be processed simultaneously with an event at real time zero if the events are independent, as they are in this example. Likewise for events at real times two and three. In Figure (a) there are six distinct times at which events to be simulated take place. They are tabulated below. On a multiprocessor re- stricted to simulating all events at one instant of time before advancing to the next, six distinct processing intervals would be required. In Figure (b) with the overlap of times (0,1) and (2,3) there are only four distinct times. On a multiprocessor which allows such overlap only four distinct processing intervals are required. TERMINAL 1 PROCESSOR TERMINAL 2 87 -f- -t- + 2 3 4 REAL TIME (a) Timing of System to be Simulated TERMINAL 1 PROCESSOR TERMINAL 2 + 1 , 0-1 2rS 4 5 SIMULATED REAL TIME (b) Timing with Overlap Figure 3. 10. Example of Time Overlap on Independent Events 88 Real Time Events Initiate Terminal 1 1 Initiate Terminal 2 2 Terminal 1 takes Processor 3 Terminal 2 requests Processor k Terminal 1 frees and Terminal 2 takes Processor 5 Terminal 2 frees Processor (a) Processing of Figure 3.10(a) Processing Interval 1 2 3 k 5 Simulated Real Time (0,1) (2,3) Events Processing Interval Initiate Terminals 1 and 2 1 Terminal 1 takes and Terminal 2 re- 2 quests Processor Terminal 1 frees and Terminal 2 takes 3 Processor Terminal 2 frees Processor k (b) Processing of Figure 3« 10(b) Table 3«^« Distinct Events Comparison of Figure 3-10 For this example, the time overlap allowance results in a savings of two processing intervals. The error that must be avoided however is letting terminal two take the processor before terminal one since they both appear ready at the same place on the scale in Figure 3.10(b). That is an example of the interaction introduced by overlapping multiple instances of simulated real time . 8 9 A specific GPSS example of the precedence requirements due to multiple transactions is now taken from the program in Figure 3.1. Again, a program segment is reproduced. The single transaction precedence partition numbers are listed as determined in the previous section. Block Block Operands Selected Routine Partition Number Instructions k SEIZE P12 N^ = N^+l 2 5 QUEUE CPU 2 6 SEIZE CPU 2 7 DEPART CPU 3 8 ADVANCE VI TIME = TIME+100*N^ 3 1 VARIABLE K100*N4 Blocks four and eight are in separate partitions and therefore will not be processed concurrently for any single transaction. The writing and reading of N^ appears to be properly separated. Consider the following multiple transaction situation which uses times that actually occurred as taken from Figure 3.5* Suppose transaction six is processing partition two including block four at time 1791 and, concurrently, transaction five is processing partition three including block eight at time 1953- Each transaction is following the precedence partition but six is writing N4 while five is reading it. Even though transaction five has executed block four it cannot correctly execute block eight until all transactions with times less than 1953 have moved through block four. It is therefore not sufficient just to follow the precedence that can be detected at compile time. 90 A similar situation does not exist with operand P12 which occurs in three different partitions. P12 refers to parameter 12 of the transaction executing the block. In this program no transaction can change the P12 value of another. Fortunately the variables and block types involved in precedence requirements between transactions are known. The variables have been previously defined in this section as system variables. There is a special category of three standard numerical attributes — function, variable, and boolean variable which are programmer defined. They are system variables if the definition uses a system variable. The system block types are those which, by definition, implicitly use system variables. The use of system variables is indicated for each block by S(n) where n is the block number, as in the precedence partition algorithm. S(n) = 1 when the block does not use any system variables due to either the block type or the operands. S(n) = when it does. The assignment of values to S(n) is made on a block by block basis simply by comparing the type and operands against a table of system blocks and variables . Allowing concurrent processing of multiple transactions changes the concept of the overall scan, Figures 3.2 through ~5.h, drastically. The current and future events chains are eliminated. A single clock is replaced by a time word for each transaction. The serial nature of the algorithm must be revised to take advantage of the parallelism in the program. In Chapter k a hardware unit is proposed as a replacement for the current software overall scan algorithm. Elimination of the two chains makes it necessary to define some new 91 terms. Transaction time is the simulated time word associated with each trans- action. It is similar to and replaces, the block departure time. A min time transaction is one whose transaction time is equal to or less than the trans- action time of all other transactions which are not blocked. Refer to section 3.3.I for the meaning of blocked. The set of min time transactions is the set of transactions which would have been members of the current events chain. The significance of S(n), mentioned above, is that any block with S(n) = must be processed only by a min time transaction. The effect, refer- ring to the example in Figure 3.10, is that the action of taking the processor can be restricted to the min time transaction. This assures terminal one of success in taking the processor at time two, avoiding the potential error mentioned in the example. Note that if S(n) = for all blocks, processing reverts to the time ordered case. S(n) = 1 allows those blocks which are time independent to be processed by non-min time transactions. 3.3.3 Processing Code Assignment In the next chapter a multiprocessor machine organization for proces- sing simulations in GPSS is presented. The machine requires information on precedence partitions and system variable usage to control the processing. Slight variations on the V(n) and S(n) assignments of the previous two sections are needed however. The modified values are then concatenated to form a two bit code, SP(n), for each block. This section discusses the modifications and the final code . First the P(n) changes are discussed. The three blocks SEIZE, PREEMPT, and ENTER which have the action of letting a transaction use a facility or storage are assigned P = G. In the multiple transaction case it is possible 92 for the equipment to be in use such that the transaction is blocked preventing the execution of the next block. Letting P = creates an artificial prece- dence boundary which prevents execution of the next block until the previous partition is successfully concluded. This was not necessary in the single transaction case since a blocked transaction would be a deadlock. No other transaction would ever be available to clear the blocking condition. The pro- gram itself is faulty if this happens. Simulations with multiple transactions do cause a reduction in the length of precedence partitions and thus a decrease in the number of concurrently executable blocks per transaction. The total concurrency increases however, due to more than one transaction being processed. The BUFFER block is used to stop the movement of a transaction when it could normally move to the next block. The effect of this is achieved by assigning P = to the previous block and removing the BUFFER. PRIORITY with the BUFFER option is assigned P = for the same reason. Normally GATE and TEST blocks, which are like conditional jumps for transactions, receive P = since the next block is unknown. The entry ALL in column B of the partitioning guide causes this. When GATE or TEST blocks with alternate exits appear In a sequence, an IF tree for transactions is formed. Appropriate revision of the routines for these two block types will allow combined execution of the sequence as one block of greater length, making effec- tive use of the hardware in Chapter 2. When the sequence occurs the precedence partitioning algorithm is applied to all blocks, except the last, as if the column B entry in the guide were blank. Thus operand precedence is checked to determine P rather than assigning P = directly. 93 The above modifications change the partitioning algorithm from defining single transaction precedence partition boundaries to defining sets of blocks which are simultaneously executable in a multiple transaction environ- ment. In the remainder of this paper the sets will be referred to as partitions. Now consider the modifications to S(n) . The purpose of this bit is to flag blocks which must be executed only by min time transactions. If two such blocks are in sequence with no possible change in transaction time from one to the other, it is only necessary to flag the first one. If the trans- action is at the minimum time on execution of the first block it will neces- sarily be there for the second also, regardless of the value of S. Therefore assign the second block S = 1. The choice is made to improve performance in the machine and will be explained in section k.k. This asssignment rule can be extended to allow any number of blocks in the sequence and also allow intervening blocks which do not use system var- iables. One condition is necessary however. If a labeled block, or other block which can be the destination of a jump, occurs, it breaks the sequence. A transaction which is not at the minimum time can transfer to such a block and must be prevented from executing a system variable block. A program has been written that scans GPSS decks, simultaneously applying the partition algorithm and system variable table look-up, with modi- fications mentioned above, to generate the code bits SP(n) for each block. The result is an indication of the concurrency in the program but is of course a static measure. It does not measure the effects of multiple transactions or selected paths through the program. Results of scanning several programs are given in Table 3.5. The meanings of the column headings are: 9h (1) The name identifies the application of the program. The source of the listing is given by reference. (2) Blocks are the executable GPSS statements. (3) Blocks which unconditionally transfer transactions are tabulated. They shorten partitions since they are considered the last block in a partition without inspecting the destination of the transfer. (k) The maximum partition length is the largest number of blocks that are simultaneously executable . (5) The average partition length is the average number of simultan- eously executable blocks. (6) This column is the number of blocks for which S = 1. The proces- sing of these blocks is never delayed to wait for min time transactions. (7) This column is the number of blocks for which P = 1. (8) This column is the number of blocks for which both S and P are one. The maximum and average partition lengths are the figures of most interest. For all the programs that were scanned the average is near two. Thus, on the average, any transaction at any time can be executing two blocks concurrently . It is felt that the average could be raised to three blocks with a more sophisticated scan program. The current program implements the algorithm of Figure 3.8 with the modifications of this section. This algorithm does a sequential scan of the source deck. Improvements could be made by following traces through the program so transaction decision trees could be found and the destination of unconditional transfers could be examined. Loops made with the LOOP block could be expanded. This has been done by hand for two 95 c^ co M H o ll O Ph i-l -p o £ sd 0) o bO -H ,£ ^-- cO -p -P ltn Ph -H bO v— • cd -p sd > Ph CD < CO J Ph -* -P bO CD cO i-q Ph ^ co Pd W ^-- hh KN CQ g En 23 OJ w AS o o H PQ CH O bO o Ph KN CO -H/ OO LTN OJ H KN ON ON ON ON KN-Hr o kn OJ OJ VO -4- -Hr H H CJN VO VO VO -=j- OO KN CO O OJ OJ OJ OJ LTN^O ON OJ H OJ O KN OJ 00 H O OJ OJ CO O VO ON LTN LTN K"N CO ON UN LTN O O O rH H LTN H J-4" LT\ LTN O CO LTN O CO KN ON H c— CO CO LTN -3- H H -p •H p X •H o id Ph CD -p ft id ft O < "3 Ph O 7H. Td 0) sd CO & co co •H ft ft O O CO O ,£ < hd O H OJ U O -P CO > H W VO -P CD B) THEN X; ELSE Y; can be rewritten equivalently as IF (A < B) THEN Y; ELSE X;. Using this technique enables the shape of an IF tree to be biased such that the longest path out of any node is always on the right when pictured as in Chapter 2 . Now the two connection schemes will be explained. For trees that can be thought of as balanced, or triangular in shape, a connection is suggested in which the node values produced by the unit processors fill levels of the decision processor tree structure. For unbalanced trees, those in which a few paths are long and many are short, the connection is designed to fill the tree in a right to left sweep. An example of the two connection schemes, "level" and "biased", is given in Figure h .2 under the assumption that eight unit processors are to be connected to a five level, 31 node decision processor. To provide for the use of all nodes each unit processor must calculate a maximum of four node values. The node output register is thus four bits long. The connection problem is really one of distributing the calcu- lation of node values over unit processors in an attempt to minimize the number of values serially determined by any one processor. Clearly, for the example shown, the level scheme involves only one node calculation per unit processor for a full three level tree but three node calculations in proces- sor one for a five level linear tree. Use of the biased connection requires only one node calculation in five of the unit processors for the same linear 106 BIT 12 3 4 I I 1 1.1 PROCESSOR 1 A I ■ 1 \ >. i 1 1 ^ i / TO U.R 4 TO U.R ! i i . , TO U.R 6 , \ • • • i / , 1 BIT 4 i i BIT 2 BIT 3 a i U.R 2 DECISION PROCESSOR -CONNECTION BUSES NODE OUTPUT REGISTERS (a). Level Connection Scheme s « > >. l BIT 2 5 4 \ l\ X ' . X / • \ • • \ • N. • # \t_ i \ L_ I \ I k t ! | BIT 2 A | BIT 3 M 1 BIT 4 " " ' '"' ' "" " " " | | ■ ■ U. Rl 2 3 4 5 6 7 8 'CONNECTION BUSES DECISION PROCESSOR NODE OUTPUT REGISTERS [b) . Biased Connection Scheme Figure k.2. Connections Between Decision and Unit Processors 107 tree. The connection scheme for a given tree is selected at compile time. The criterion for selection is that the number of node values which must be calculated by any one unit processor should be minimized. The original philosophy of the decision processor was to make better use of existing arithmetic processors when a program had many con- ditional jumps. It was assumed that any multiprocessor application would require a reasonably large number of processors. An interesting result of the Fortran block analysis was that, excluding decision trees, the required number of processors was small. For most programs the number was two to four. The number of unit processors in this machine is in the range of eight to l6 because of the IF trees. It is reasonably accurate to say that the block programs consist primarily of IF trees. k.3-3 Task Processor Configuration Each task processor is capable of processing any block type. The parallelism within a block is exploited by providing parallel unit proces- sors and a decision processor within the task processor. The number of tasks that can be in execution simultaneously is dependent on the concurrency between blocks and transactions. The number is the sum of the concurrently executable blocks per transaction, taken over the number of concurrently moving transactions. Section 3«3«3 showed that the number of blocks per transaction was slightly over two for six significantly different GPSS programs. The number of transactions which can be moved concurrently cannot be measured by study of the program. It is a run time function of the program being executed on this parallel machine. A simulator of the machine was written, primarly in GPSS, to measure the transaction concurrency of actual GPSS programs. Details 108 of the simulation system are given in the Appendix. The results are given here. Programs were tested in the simulator under the condition that executing tasks uses an amount of time determined by the analysis of Fortran equivalents for GPSS blocks. In the first test the coordination unit, which distributes tasks to processors, was assumed to take zero time to operate. This measures the best performance that can be expected. The results are given in Table k .1 as the rows marked (l) . Column headings for Table k.l have the following meanings: (1) The program name corresponds to the names used in Table 5»5« (2) "Total transactions" is the number of transactions that were active at some time in the simulation. For the thesis example program it can be seen in Figure 3-5 that 2k transactions existed although only 10 went through the entire program. (3) "Blocks executed" is the number of blocks for which execution was simulated. (k) Beginning with this column there are multiple entries for each test program. The number of task processors is a design parameter for the machine. Simulations were run with four, eight, and l6 processors to observe the effects on the entries in columns (5) and (6). (5) "Concurrent transactions" is the number of transactions which were being processed simultaneously. Entries were generated in the simulator by sampling transactions being processed at a time interval equal to the shortest block execution time. (6) This column lists the simulated execution time. A second test was run in which operation of the coordination unit 109 VJ3 NA OJ VO [— O o o\ o KA CVJ VO VO N~\ LTA H o\ ON VD H O o MD H O -* -=*■ UA H CO OJ OJ UA ir\ ir\ O VO VJD OJ CO OO o on OJ H H OJ H H H \D \D C— ^O O OJ ^D CT\ OO LTA OJ \D ON OJ • • • • • • OJ -d- VO OJ K> LfA KN OJ KN OJ CO OJ OJ OJ OJ o OJ OJ OJ -d- 00 CO H 00 H 00 H CO vo oo H KA H KA 'H 03 o •H -P o CO w id CO •P S3 CD Sh ?H 3 o o -p H g o •H -P CO P O CI) a +-> •H cu 4-) CU H o h fn CU rH N ■H CO Ph H o H CO CI) OJ D o •3 5h o -P !H CO OJ >> N CU H I -P H CO cu OJ cu O H R< CU n * -p w o M Jh •H CU ra N ci) £ H H o CU H CO cu OJ CQ UN OJ H -4- OJ cu H ■s EH 110 took realistic amounts of time. The purpose of this test was to determine the capability of the coordination unit to keep task processors busy. It also shows the degradation to the concurrency in the program caused by a finite time span between completion of one partition and the start of the next. The entries for this test are marked (2) in Table k.l. Analysis of GPSS programs in Chapter 3 gave very similar results on average partition lengths for six different programs. Results presented in Table Ij-.l, for three of the programs analyzed in Chapter 3, give quite different results. The number of transactions which can move simultaneously is obviously not strongly related to static program analysis results. The Rail Fleet and Elevator programs have approximately the same number of source blocks, including TRANSFERS. The maximum and average parti- tion lengths are close. The difference, in actually running the program, is the total number of transactions. Clearly the program with more transactions can have more concurrency between them. The Thesis Example program has the intermediate value on the total number of transactions and the lowest value of concurrent transactions. This program however is only one-fifth as long, in terms of source blocks, as the other two. For programs which appear comparable, such as Rail Fleet and Ele- vator, a given factor of difference in the total number of transactions does not yield the same factor in concurrent transactions. One important reason for this is that the scheme used to allow concurrent transactions is more restrictive than the language requires. As currently designed into the processing code assignment, and implemented by the coordination unit to be described, any block which reads a system variable must be executed by a transaction which is at the minimum simulated time in the model. Ill The actual requirement is that a system variable must not he read by a transaction if its value can be changed by a transaction at a lower simulated time. A small improvement can be made without changing any of the proposed hardware by using a more sophisticated program for assigning the S bit of the processing code. The required change would be to note those system variables whose values are never changed in the program. These system variables should not cause a block to be assigned S = 0. To break loose the transaction movement bottleneck a major change is needed. A transaction which uses a system variable needs to be delayed only until it is known that no transaction at a lower simulated time can change that variable. The run time information on the model status required to do this is much greater than what is needed for the machine presented here. This implies much greater complexity in the coordination unit. The machine organization conclusion drawn from Table k .1 is that eight task processors are reasonable . Limiting the design to four task processors imposes a physical limit of four on transaction concurrency and increases execution time significantly. There is improvement in execution time going from eight to l6 but it is not as great, even in absolute terms, as the improvement in going from four to eight task processors. k.J.k Hardware Design Considerations Design objectives for this machine are to use currently available, familiar, inexpensive parts and balance the speeds and bandwidths of all system components. As a starting point it is assumed that the task processor is designed with TTL gates having propagation delays near 10 nanoseconds (ns) . A 5 MHz clock generates pulses at 200 ns intervals, allowing combinational logic chains up to 20 gates. Note that one cycle of an eight level decision 112 processor takes just one clock. If instructions take an average of two or three clocks the in- struction execution time averages out near 500 ns. Table J.l gives a count of instruction steps for the execution of many GPSS block types on a multi- processor capable of executing p operations simultaneously. Using the 500 ns average instruction execution time, average block execution times can be stated. These times are used in the simulator described in the Appendix. Now consider the amount of hardware required for the task processor configuration, excluding memories and their interconnections. In summary there are eight to l6 unit processors, one decision processor of six levels, and one task processor control unit. The unit processor is a simple device with on the order of 1000 gates and flip flops. From Table 2.2 the six level decision processor, including total sector control, has about 500 logic circuits. Estimate the task processor control unit at 3500 gates. A task processor with eight unit processors has on the order of 12,000 gates. For l6 unit processors the number is 20,000. A configuration of eight task pro- cessors requires approximately l60, 000 gates. k.k Coordination Unit This unit is concerned with the selection of tasks for execution in the task processors. Selection is based on the simulation model status, the transaction time of all transactions, and the SP processing code assigned to each block in the simulation program. A transaction available for movement is selected, then tasks are formed from the sequence of blocks the transaction will move through. In GPSS/360 the only transactions selected for movement are those 113 on the current events chain. The equivalent of this chain is the set of run time transactions, those which are farthest behind in simulated time. The selection of transactions is still oriented toward the min time set but will, as explained in section 3.3.2.3, select transactions which are not at the min time. When all transactions in the min time set are in the process of being moved, the next transaction to move is chosen from the set nearest to the min time. This selection method reduces the time spread on transactions being processed and minimizes the conflicts caused by overlapped times. Recall that blocks with S = can be processed only by min time transactions. By always working on transactions which are min time, or near to it, the min time (which represents the upper limit of completed simulation time) advances as rapidly as possible. The result is that the real time spent by transactions at blocks with S = 0, waiting for the model to "catch up" to them, is reduced. Thus selection of the transaction to move is based on the transaction time with smaller values being selected. Following selection of a transaction, a task is set up for each block in the partition the transaction is in. Part of the transaction status information is a word which identifies the next block it is to process. Se- quential blocks are in the same partition until the code bit P = 0. The tasks formed fall into the categories of being available for processing im- mediately or being available when the transaction becomes a min time trans- action. Two queues, called "process" and "delay" respectively, are main- tained for the two categories. Task formation and routing to one of the two queues is controlled by the processing code bits, SP, assigned to each block in the program. The interpretation of the code is given now. The significance of S = is that 114 the block uses a system variable and processing must be delayed until the transaction time is the minimum of all transaction times. The significance of P = is that the block is the last in a partition of simultaneously executable blocks. The transaction cannot continue moving until the current partition has been processed. The action taken by the coordination unit when examining the block code is listed in Table k.2. Code S P Coordination Unit Action 1 1 Add task to process queue. Examine next block. 1 Add task to process queue. Select next transaction. 1 Add task to delay queue. Examine next block. Add task to delay queue. Select next transaction. Table k.2. Processing Code Interpretation The actual implementation has one modification. All tasks in a partition following the first one that is added to the delay queue are also added to the delay queue. The reason is that blocks which use system vari- ables are assigned S = 1 if a prior block has been assigned S = and there can be no change in transaction time between the two blocks. In summary, the coordination unit contains logic to select the transaction with the minimum value of simulated time, to examine the code for the block it is to process next, and to place the transaction-block pair, a task, in one of two queues. The process queue receives tasks that can be executed as soon as a processor is available. The delay queue receives tasks which must wait for the min time of the model to reach their scheduled event time. The coordination unit releases tasks from the queues for distribution 115 to the processors. Figure k.3 is a diagram of the unit. Discussion of the components is covered in following sections starting with the selection of a transaction to move. k.k.l Transaction Selection Transactions in this machine can be in several states, some of which make a transaction ineligible for movement. The logic of this part keeps a record of the state of all transactions and allows selection of movable ones according to their simulated time ordering. When a transaction is selected for movement, all blocks in the current partition are set up as tasks for processing. The transaction is not eligible to move again until all tasks in the partition are completed. A transaction in this state is considered "selected". A transaction may reach a blocking condition, defined in section ^.^.1. Such a transaction cannot move, but note that its simulated time is implicitly updated to the time when the blocking condition is removed. A similar situation exists for transactions which are put on user chains by means of the LINK block. They cannot be moved until they are removed from the chain by an UNLINK block. Transactions which are blocked or are on user chains are considered in the same state with respect to selection for movement. A transaction in this state is both "selected" and "blocked". A third state is the transaction which has entered the delay queue. The minimum simulated time in the model must be known so that transactions in this "delayed" state can be removed from the queue . All of the above transactions are in the state of having been selected for movement. The final state consists of all unselected trans- actions. It is from these that one must be selected for movement. 116 FROM TASK PROCESSORS TRANSACTION NUMMR Tint M UTS PRIORITY • BIT* MfMOftY INPUT ASSOCIATIVE MEMORV 1024 WORDS 42 BITS Ntrr ilock 10 IITI TRANSACTION SELECTION SECTION 4.4.1 PROCESSING CODE REOISTER 1024>2 PROCESSING CODE EVALUATION SECTION 4.4.E PROCESSING CODE EVALUATION AND TASK ROUTING WRITE ADDRESS REGISTER T SITSl T INHIBIT DELAY QUEUE LOADING WHEN FULL ADDRESS COMPARATOR READ ADDRESS REGISTER T NEXT BLOCK TRANS. NR. rrz DELAY QUEUE 128 WOROS 22 BITS K TRANSACTION NUMBER REGISTER DELAY QUEUE SECTION 4.4.3.2 TRANSACTION NUMBER COMPARATOR INHIBIT TASK FORMATION WHEN FULL PROCESS OUCUE SECTION 4 4.5.1 WRITE ADDRESS REGISTER ADDRESS COMPARATOR READ ADDRESS REGISTER PROCESS QUEUE 16 WORDS 20 BIT8 — \ QUEUE SOURCE SELECTION TASK OUTPUT SECTION 4.4.4 TASK OUTPUT V TASK REQUEST! (FROM TASK PROCISSORS) TASKS (TO TASK PROCESSORS) Figure k.~5. Coordination Unit 117 The logic in this part is primarily a moderate size memory on which a minimum value search can be performed associatively. One word per trans- action is required. The number of words in the memory is the number of transactions defined as the maximum for the particular machine. The normal GPSS allocation is 600 transactions for the 128k memory size and 1200 for the 256K or higher. Thus a 102^ word memory would be adequate for quite large simulations. Each word must contain fields for the transaction time, priority, and status indicators. The time and priority fields in GPSS use 32 and eight bits respectively. These fields, plus two bits to indicate "selected," C, and "blocked," B, transactions give a word length of ^2 bits. The format for each word is given in Figure k.k. transaction time 3 3^ priority 35 ^2 "Blocked" - B "Selected" - C Rl 1 R2 "Delayed" - D Figure k.k. Word Format of Transaction Status Memory Priority is an extension of the time field on the least significant end. If a low value in the priority field is defined to represent a high priority, the highest priority transaction can be found by continuing the minimum value search through the priority field. The C and B control bits are at the most significant end of the time field. C is set to one when a transaction is selected for movement. It 118 is reset when processing of its partition is complete and it becomes eligible for selection again. B is set to one when a transaction is blocked or put on a user chain. It is reset when the blocking condition is removed or the transaction is removed from the user chain. If either bit is one the trans- action will not be a responder to a minimum value search unless there are no unselected transactions. The "delayed," D, bit Is not part of the minimum value search. The D bit is set when a transaction is routed to the delay queue. It is reset when the transaction is removed from the queue. Each word has two response indicators, enabled on slightly different search bits. The first, Rl, is enabled for bit B and the time and priority fields. Responders indicated by Rl are transactions in the min time set, whether previously selected or not. The second response indicator, R2, is enabled for bit C in addition to the bits for Rl. R2 responders are non- selected transactions with the smallest value of time. Any Rl responders which are also in the delay queue are now eligible for execution. Bit D is ANDed with Rl to eliminate responders that are not in the delay queue. The D bit for all remaining responders is reset since the transactions will be released from the queue. After releasing transactions from the delay queue, responders in R2 are selected for movement. The action upon selection is covered in the next section. When all R2 responders have been selected the interrogation of the memory begins again. Clearly it is necessary to update the memory when the status of any control bit, or value in the time or priority fields, changes. 119 k.k.2 Processing Code Evaluation Each transaction has a word identifying the block into which it will move next. The identifier is the number of the block in the program. In the coordination unit the next block number is used as a pointer to the needed processing code register entry. When a transaction is selected for movement, the processing code for the next block it is to execute is examined and interpreted according to Table h.2. Sequential block codes are examined until P = 0, indicating the end of the partition and need for the next transaction. For each block in the partition a task is set up and routed to either the process or delay queue, described below. If a task is routed to the delay queue, the D bit for that transaction is set to indicate the trans- action is being delayed. A task is fully defined by the transaction and next block numbers. Since 102*4- is the maximum number of both transactions and blocks, 10 bit fields are required for each. In the task processors the next block number is used as the address of a word which identifies the block type and the particular occurrence of this type. k.k.3 Task Queues k.k.3-1 Process Queue Tasks in this queue are available for processing at the request of task processors. Each entry is a transaction number and the number of the block it is to execute. At 10 bits for each of these numbers, an entry is 20 bits. This queue provides a backlog of tasks for the processors. When it becomes full the formation of tasks can be halted. The queue only needs 120 to be long enough to assure that it cannot he emptied before task formation can resume. For a machine with eight task processors a queue of length l6 is easily long enough. Design of the queue is circular with a first-in first-out dis- cipline. Separate read and write address registers point to the oldest entry and the first empty location. The registers are incremented following every read or write. An address comparator detects a full queue and inhibits the formation of any more tasks. k.k.3.2 Delay Queue Tasks which must be processed in a simulated time order, because of their use of system variables, are routed to this queue. Processing is delayed until the transaction becomes a min time transaction. The release of tasks is controlled by the transaction selection logic which interrogates the time value of all transactions to identify those in the min time set. Recall that all blocks in a partition following the first block with S = were also routed to the delay queue. Thus when a min time trans- action is found to be in the delay queue, the task that caused transaction movement to be delayed is released, as well as subsequent tasks in the same partition. The processing code P bit is used to indicate these tasks. Tasks in the delay queue can be in the state of having been re- leased but not yet removed from the queue for processing. An indicator is needed to distinguish these tasks from those not yet released. Each entry in this queue needs, therefore, the 20 bits that identify a task, a bit for the P portion of the processing code and a bit to indicate released tasks. Actual setting of the release bit is based on a comparison of the number of 121 the transaction to be released and transaction numbers in the queue. This queue holds tasks which will become available for processing when the model reaches the condition that no transaction has a simulated time less than that of the transaction being delayed. The queue length in this case must provide for potentially many tasks. It is suggested that this queue be of length 128. When it becomes full, further loading must be inhibited. The design is similar to the process queue except the removal of entries is based on matching transaction numbers rather than length of time in the queue. Movement of tasks from the queues to the task processors is controlled by the task output logic described next. k.k.k Task Output The two sources of tasks for output to the task processors are the process and delay queues. The overall processing algorithm is to move trans- actions which have the smallest time values. Transactions in both queues have time values which range upwards from the minimum completed time. Tasks in the delay queue which are marked as released belong to the min time trans- action set by definition. These tasks have priority over tasks from the pro- cess queue for output. If any release indicator bit in the delay queue is set, that queue is the source of tasks for output. When all released tasks have been output for processing, tasks are taken sequentially from the process queue. The task output logic communicates with the task processors. It accepts requests for tasks. It identifies, for the requesting processor, the transaction and next block number that make up the task. ^.^•5 Coordination Unit Hardware This unit must be able to form and dispatch tasks at a rate to 122 match the task processor consumption. While the number of task processors is unlimited in theory, this unit will place a limit on the number it can support. The design and hardware suggested here is for the eight task proces- sor configuration proposed in section k. 3. 3. Suggestions will be given for increasing the number. A pipeline effect exists in the selection of transactions and formation of tasks. Transactions which have been moved through the blocks in a partition enter at the top of the coordination unit with time and next block data. They filter through the unit to exit as tasks for further block movement. The memory can be updated while responders from the previous search are being resolved and examined by the code evaluation logic. Note that one interrogation may yield more than one responder and each responder will yield an average of two tasks. Loading of the two queues can be inter- leaved with unloading by the task output logic. Average task execution time, as determined from Table 3«1 with the assumption of 500 nanoseconds (ns) per instruction, is 10 microseconds. For eight task processors, with full utilization, the task output rate should therefore average one task per 1.25 microseconds. With ordinary TTL logic having typical propagation delays of 10 ns the task output logic will have no trouble satisfying that rate. In the worst case for the process queue, it must be both loaded and unloaded at that rate . Since the two actions cannot be concurrent, the rate for either must be near 600 ns. Clearly there will be adequate time to decode a four bit address and either read or write a 20 bit word. The delay queue is more complicated. Loading is si mi lar to the process queue but unloading requires transaction number comparisons. When 123 the transaction selection logic has an Rl responder, that transaction number is known to be in the delay queue. Entries in the queue are not required to be in simulated time order but will tend to be so due to the selection algorithm. The comparison begins with the oldest entry and moves sequentially through the queue. When the matching transaction is located the task is released. All other tasks for this transaction, identified as all subsequent tasks in the queue until the partition code bit, P, is zero, have the release indicator set. They are thereby marked as executable . The transaction location sequence of reading, followed by a comparison and either increment of the read address register or release of a task, is a two clock sequence. If the process queue is empty and all tasks must come from the delay queue, the release rate must be the same as the task output logic, 1.25 microseconds per task. At this point a 100 ns clock rate is specified for the coordination unit. At two clocks to load a task, 200 ns are used. This leaves 1000 ns to locate and release a task. Thus five transaction number comparisons can be performed in the allotted time. This is thought to be quite adequate. The conclusion, with respect to the delay queue, is that it is capable of being loaded and unloaded in the required time even when it is the only source of tasks. Feeding the two queues is the processing code evaluation logic. The output requirement here is again an average of 1-.25 microseconds. There is clearly no time problem if selected transactions can arrive at a fast enough rate. Since there is an average of two tasks per partition, the arrival rate must be one transaction every 2.5 microseconds from the trans- action selection logic. 12U A bit serial interrogation at the rate of 100 ns per bit requires K.2 microseconds to cover the k-2 bit associative memory. Thus transaction selection appears to limit the operating speed of the coordination unit. Comments here will show that the situation is not desperate. First note that it is possible to have more than one transaction respond to a search. If there is an average of two responders, the output rate is satisfactory. In general it is not necessary to interrogate all bits to determine the minimum value. Interrogation can cease when the minimum is found thereby reducing the search time. For a minimum value search all words are initially considered responders and the interrogation begins with the most significant bit. It is seldom that the discrimination which produces the final re- sponders occurs in the most significant bits. For this reason it is possible to group the interrogation of these bits. Suppose for example the l6 most significant were interrogated in parallel. If responders still exist, serial interrogation continues with a savings of 15 interrogation times. If no responders exist, it is necessary to re-initiallize the response stores and begin a bit serial interrogation at the most significant bit. The time penalty has been one interrogation plus initialization. Another scheme which reduces the number of interrogations and in- creases the number of responders is based on the fact that R2 responders need not be true minimums. By always stopping the interrogation a small number of bits short of the least significant bit, a cluster of transactions near the minimum is selected. It was noted earlier that just two responders elimi- nated the time problem. Finally, note that the original calculation was based on the maxi- mum task output rate. 125 The conclusion here is that the proposed coordination unit hard- ware, implemented with 10 ns logic, will perform with sufficient speed to support eight task processors. Suggestions have been given on ways to in- crease the speed in the area which appears to be most limiting. The estimated number of logic circuits in the unit is 10,000 plus the memory- chips. if. 5 Memories Data and program memories are separated in this machine and will be discussed separately. k.5'1 Data Memory Processors are organized as a cluster of unit processors to form a task processor, then a cluster of task processors. The data required by a task processor is the data associated with the transaction and with the block for the task. The memory organization given here uses a small memory at each task processor for the data required for that task and a main memory to which each task processor has access. The main memory provides the only communication between task processors. Execution of a task involves a transfer of all data related to the task from the main memory to the task memory, followed by restoration of variables changed in the execution. The maximum amount of transaction data in GPSS/360 occurs when 100 full word transaction parameters are used. In this case the total transaction data is k^>6 bytes. Blocks use a basic allocation of 12 bytes plus four bytes per operand. There is a maximum of seven operands per block with the average number near three. At four bytes per operand the average allocation is 2k- bytes. Transaction and block data, 126 totalling *4-60 bytes for maximum transaction parameter allocation and typical blocks, must be transferred to each task processor for each task. Each task may also use system variables which are accessed directly from the main memory. The number of such variables per block is small. To design for 10 variables, or k-0 bytes, would be liberal. Total task data is then 500 bytes. Given this background, the data memory sizes, bandwidths, and inter- connections are discussed in this section. i+. 5«1-1 Main Memory The starting assumption for calculations on the main memory is that the time required to transfer task data from main to task memory is equal to the compute time. The time to restore changed variables is negli- gible. Average task execution time is 10 microseconds (jus) . The band- width per task, for maximum size transactions, is /g bits \ 500 bytes \ byte J = kQQ x 1Q 6 Mts/second< B.W. . . . = -r mam-task _„. -,^-0 , 10 x 10 seconds The total main memory bandwidth, for eight task processors, is B.W. . = 8 x ^00 x 10 b/s - 3200 x 10 b/ s . main An ordinary core memory can supply a 6h bit word in 0.5 us. To achieve the required maximum bandwidth the number of memories is 6 3200 x 10 b M = — = 25 memories. max /- b / 128 x 10 s /memory 127 Design of the machine with 32 main memory units more than satis- fies the maximum bandwidth requirements. A l6 memory design provides a bandwidth of f> 6 B.W. g = 16 x 128 x 1CT b/s ~ 2000 x 10 b/s. 6 Each task processor would receive one-eighth of this, or 250 x 10 b/s . This leads to a reduction of 200 bytes in the task bandwidth, or 50 full word parameters. Thus l6 memories can supply 50 fullword or 100 halfword parameters. The choice between l6 and 32 memories is a design option. The lower number is satisfactory for a high percentage of actual simulations. Total variable memory size is somewhat flexible. The largest GPSS/360 option begins at 256K bytes and extends upward to the core limits. Suppose 400K bytes were chosen for a large system using l6 memories. Each memory is 25K bytes, or less than kK 6k bit words. Main memory is therefore 16 modules of k¥L word, 6k bit memories, with a 0.5 microsecond cycle time. Optionally, 32 slower modules could be used. The switch to connect this memory to the task processors is a fanout tree. Such a tree has on the order of three gates per bit per destination. The number of bits in the path is the product of the number of memories and the bits per memory, or l6 x 6k ~ 1000. The eight task processors are the destinations. The gates in the switch thus total 3 x 1000 x 8 = 2kK. 4.5.1.2 Task Memory Each task memory must be capable of storing the data for the 128 current task plus the data being set up for the next task. That is, it must have storage for twice the maximum task data requirement. The maximum amount of task data is k3& bytes for the transaction and kO bytes for the block for a total near 500 bytes. Let the memory be IK bytes. The bandwidth of data moving between this memory and the main memory was previously calculated as B.W. . , . = ^+00 x 10 bits per mal n— Xi a sic second. The task memory must simultaneously provide operands for the task processor. The number of memory accesses by a block program with the IOjus average execution time is near 50. If these are all 32 bit words, the memory to processor bandwidth is / bits \ _ TT 5 words \ word/ n ^_ 6 , / B.W., . = 7 = 160 x 10 b/s. task-proc __ n ~-o -. 10 x 10 seconds This bandwidth is satisfactory only if there are no conflicts on accessing memory. Conflict free access would require one memory capable of supplying the total bandwidth going between the task memory and both the main memory 6 and the task processors. That is 5&0 x 10 b/s. For 32 bit words, the cycle time would have to be 32 bits , -Q t = r 60 x 10 seconds. cycle > 1 560xl0 6 b/s A IK byte memory with a cycle time of 60 ns per 32 bits does not meet the design objective of inexpensive parts. To lower the cycle time an array of eight memories is suggested. The memories should have 32 bit word lengths. A total of 250 words is needed so each memory has only 32 words. The connection to the task processors is then a crossbar switch; a not unreasonable connection for the small numbers involved. For example, 129 connecting eight processors and memories with a 32 hit path takes 8 x 8 x 32 ~ 2K gates. If the design uses l6 unit processors the connection takes h¥L gates. Memory conflicts are possible with this modular scheme. Recall from section k.2 that compilation of routines can finely tune the code produced so as to minimize conflicts. Assume however that conflicts do occur one-fourth of the time, increasing the required bandwidth between the 6 memories and processors to 200 x 10 b/s. Total task memory bandwidth is then 600 x 10 b/s. The cycle time is cycle > 8 600 x io 6 b/s A ij-00 ns cycle is within the capabilities of current memory devices. The task memory, in summary, is an array of eight memories, each with 32 words and 32 bits. The cycle time of each is ^00 ns. The memory is connected to the unit processors by a crossbar switch. k.^.2 Program Memory Section k.2 mentioned that a GPSS program is a series of calls for execution of the routines that correspond to blocks. The routines are compiled one time and remain valid until the language definition is changed. Since this executable code does not change it can be stored in a read only memory . Consider the size of the memory required to store the program for all kk block types. From Table 3«1> the number of Fortran statements per block averages near 50. Assume a factor of four is needed to convert | Fortran statements to machine instructions. Then 200 machine instructions per block, and 8800 instructions altogether, are required. 130 The instruction format for unit processors will now be mentioned. The instruction set is small and will be fixed at 16. Four bits are needed for the operation code. Unit processors communicate with the task memory, a 256 word memory requiring eight bits to address. A single address in- struction format therefore uses 12 bits. Assume four additional bits can be used, as for indirect addressing, giving a total of l6 bits per in- struction. A long instruction is defined as one suitable to drive all the unit processors within a task processor. It is 128 bits for the eight unit processor design, or 256 bits for l6 unit processors. With eight task processors, each capable of executing any task, the demand for instructions is great. Assuming a 500 ns average instruction execution time this memory must be capable of delivering a long instruction at intervals of 62 ns. Such capability fails to meet the objective of in- expensive parts. Alternatives are examined. The program memory size is 8800 x l6 = lJ+0800 bits. Organized as 6k- bit words it is 2200 words. If instructions are distributed from a read only main program memory to local program memories at each task processor, the local memories must be capable of holding the two longest block pro- grams. For the blocks converted to Fortran, this would be near 200 Fortran instructions or 800 machine instructions. The local memories, which must be read/write, would have a bandwidth requirement composed of program loading and instruction reading. These components are / pnn instructions \ / /- bits \ [ program j \ instruction ) = 6 ^ load __ n^-o seconds ' 10 x 10 program 128 Mts ■d T7 instruction c 6 / B.W. , = 7 ■= = 256 x 10 b/s. read _ _ _ _-o seconds ' 0.5 x 10 •: —. instruction 131 The combined bandwidth of 576 x 10 b/s is the same as that required for the smaller task memories. Since the program can be stored in a read only memory it becomes attractive to consider duplicating the program at each task processor. The tradeoff is eight read/write memories capable of storing the two longest blocks against seven full program read only memories (at roughly half the bandwidth) . The read only memory scheme will be chosen. With a read only memory at each task processor the bandwidth is the B.W. , calculated above, 256 x 10 b/s. That is the figure for eight read ' unit processors. If the memory is organized such that one cycle is capable of supplying one long instruction, the cycle time is the instruction exe- cution time, which averages near 500 ns. An example of such an organiza- tion for the eight unit processor case, would be two memories with 6^ bit words . The program memory is thus 2200 words of 6k bits. It is read only with a 500 ns cycle time. The memory is duplicated at each of the eight task processors. k.6 Machine Design Summary and Performance Estimates Throughout this chapter estimates have been made on the hardware involved in the components. The estimates are tabulated and totaled in Table 4.3- Estimates of program speedup and concurrency in processing have been given in various sections. These estimates will be summarized and discussed here. The speedup in executing an individual GPSS block was determined in section 3. 3. 2.1. Many block types were converted to Fortran and analyzed. 132 Component Option Gates and Flip Flops Memory- words bits /word Unit Processor (U.P.) Decision Processor Task Processor Control Unit Task Data Memory and Switch Task Program Memory six levels 8 U.P. l6 U.P. 1,000 500 3,500 2,000 4,000 6k 256 2200 12 32 64 (read only Task Processor Totals Coordination Unit Main Data Memory and Switch 8 u.P. 16 U.P. 16 modules Ik, 000 2k, 000 10, 000 24,000 2200 140 1024 1168 64k 64 (read only 64 42 (associati-p) 20 64 Total Machine (8 Task Processors) 8 U.P. 16 U.P. 146, 000 226, 000 - 65K 17K 1024 64 64 (read only 42 (associatr: Table 4-3 • Total Machine Hardware Summary 133 Results are given in Tables 3.1 and 3-2. From Table 3-2 it can be seen that execution speedup ranged from one to eight and that the most frequently occurring speedup factor was four. It is reasonable to expect a speedup of four by multiprocessing individual blocks. The unit processors of section 4.3*1 provide this processing. Further speedup can be achieved by simultaneous processing of more than one block. The number of blocks that can be processed simultaneously, related to one transaction, is covered in sections 3-3«2.2 and 3«3«3« Table 3-5 gives the results of analyzing several GPSS programs and shows, in column (5), that an average of two blocks can be processed simultaneously. This gives a speedup factor of two for each transaction which is multipli- cative with the factor of four for each block. Further speedup can be achieved by concurrently moving more than one transaction. This was studied in section 3«3»2.3 and 3«3-3- The number of concurrent transactions was measured on a simulation system described in the Appendix. The results, given in Table k.l, vary with the program being , simulated. The number, in the range of two to five, is the speedup factor due to multiple transaction processing. The configuration of multiple task processors is needed for the speedup due to processing more than one block per transaction and moving more than one transaction. There is yet another area which contributes to the speedup of this machine over a serial machine. This area is the selection of a transaction to move and a block to execute. In a serial machine the selection is through the software overall scan algorithm shown in Figures 3*2, 3«3> and ^>.h. In the machine designed here, the selection is carried out by the hardware co- ordination unit of section k.k. 13^ The software algorithm is used following the execution of each block. It is a fairly complicated algorithm and is estimated to take as much execution time as the actual processing of the block it selects. The hardware algorithm operates in parallel with the processing of blocks. This should realize another factor of two speedup over any serial execution machine. Combining the speedup factors gives the total expected execution speedup of the organization presented here over a serial organization. The result covers a range due to the range of the transaction concurrency factor, f . Excluding f the speedup is by a factor of l6. The total speedup is "CC DC by l6 • f - The total speedup factor ranges therefore from 32 to 80. This LC improvement in execution speed was achieved with the use of currently available, moderately priced hardware. 135 5 • CONCLUSION This thesis has resulted in the design of a machine which yields a significant improvement in execution time for discrete time simulation lan- guages similar to GPSS. A multiprocessor organization, using currently available logic and memory elements, is employed. The machine includes a device to assist in the evaluation of decision trees. Specific results are presented at several places in the preceding chapters. The locations of the results are given for reference in the following paragraphs. Chapter 2 was concerned with a "decision processor" for finding paths through decision trees. This device, designed for use in a general multiprocessor environment, is shown in block diagram form in Figure 2.8. Logic circuit counts are given in Table 2.2 and delays are given in Table 2.3. As an example of the values, a processor capable of evaluating trees with up to 255 nodes requires fewer than 2000 gates plus a memory on the order of 6k- words by 12 bits. The same processor can evaluate up to eight tree levels in approximately one clock cycle. An example of the processor operation is given in section 2.5-1. Simulation was first mentioned in Chapter 3« GPSS is discussed and analyzed for execution concurrency. Results of analyzing Fortran versions of 21 GPSS block types on the system described in [10] are given in Tables 3«1 and 3.2. One conclusion taken from those tables is that a factor of four speedup is possible within the blocks. Concurrency between blocks is shown to exist, providing potential for additional speedup. In Chapter k a machine was designed which exploits the concurrency 136 found in GPSS. A high level block diagram of the organization is given in Figure U.l. Hardware and performance summaries are the subject of section k.6. The total speedup was found to be near 50 for a sample of fairly small simulation programs. 137 APPENDIX This appendix describes the software system used to generate the experimental results of the thesis. The system involves a series of pro- grams written in Pl/l, 360 Assembler Language, and GPSS. The programs do some analysis of GPSS source decks and simulate the execution of the decks on the multiprocessor designed in this thesis. Figure A.l diagrams the system and includes references to results or discussions of the components. A.l GPSS Scanner The scanner program was mentioned in section 3*3 -3 on assigning the processing code to the blocks of a GPSS program. It implements Algorithm 3.1 for partitioning, and the system variable table look-up, to do the code assignment. It also generates the statistics presented in Table 3-5« Another function of this program has not been previously mentioned. It produces punched card output to modify the original test GPSS program such that a run time "trace" of the original program can be gathered. The trace is explained in the next section, which covers the program that gathers the trace data. A. 2 Trace Data Extraction: XTRAC The goal of the complete test system is to simulate the execution of some real GPSS programs on the proposed machine. To simulate the execution it is necessary to know certain things about the actual execution. It was noted in Chapter 3 that the execution sequence for a GPSS program cannot be . determined from the source code. It can only be determined by tracing the execution. Any trace should identify the block being executed and the 138 Statistics on partitions . Table 3-5 GPSS to Fortran conversion. Section 3*3 .2.1 Fortran block type analysis. Section 3-3-2.1 Block execution time tables. Section A. k H Test GPSS Program GPSS Scanner Program (PL/I) Sections 3-3-3, A.l, A. 2, and A.k Test program block types and processing code. Section A.k 1 Modified GPSS test program. XTPAC routine added. (360 Assem. Lang.) Section A. 2 in- coordination Unit Simulator (GPSS) Section A.k Data insertion routine INSRT. (360 Assembler Language) Section A. 3 1 GPSS/36O System New machine performance Chapter k- Actual execution trace data. Section A. 2 Figure A.l. Simulation Test System 139 transaction which called for the execution. A third piece of information, the simulated time at which the block is "being executed, is needed for this test system. An assembly language routine was written to gather this data and output it on punched cards. The routine is called by the special GPSS HELP block. The HELP block allows a user to write his own routines to supplement the fixed set provided by GPSS. It was needed here to access the trans- action number, a variable not available to the programmer as a standard numerical attribute, and to provide the punched output. Trace data is not needed at every block in a program. It is needed at those blocks which are the first in a partition of simultaneously executable blocks, or at labeled blocks which may be the first in a run time partition due to a transfer to that block. The scanner program of the previous section generates a card deck containing all the appropriate HELP blocks which are then merged into the original GPSS program. The modified program thus formed is logically equivalent to the original. Specifically, when a partition ends, the next block must be the first block in the next partition. The scanner program can simply punch a HELP block card following every block for which P, the partition code bit, is zero. For labeled blocks the action is slightly more involved. For any labeled block it is necessary to give that label to the HELP block, then punch a card for the original block without the label. The original labeled block is replaced by a labeled HELP and an unlabeled copy of itself. If the labeled block is not the first in a partition, any transaction which moves into it from the block directly above does not need the trace data. In this lUO case a TRANSFER block is used to let such a transaction branch around the HELP. Thus three cards are punched. Note that the addition of these HELP blocks changes the block numbering of the original program when a trace is being taken. For this reason the GPSS internal block number variable cannot be used for the next block data in the trace . The format of the HELP block allows the use of parameters which can be read by the routine. Thus the scanner program, which keeps a block counter, can provide the original block number as a parameter. The card deck generated by this program contains the essential information on the actual execution of a test program. This deck is input data for the machine simulator. The operation of inserting it into the simulator is done by the program of the next section. A. 3 Trace Data Insertion; INSRT GPSS is weak in the areas of input and output. It was necessary to write an assembly language routine to load the trace data described in the previous section. The routine is called by a HELP block, as was the data extraction routine. Savevalue locations in the simulator are reserved for the trace data. A table of 300 half words is used for transaction number data. A table of 300 full words, with the same savevalue index, is used for simulated time data. Within the simulator the same number can be used as the index for a half word savevalue to get a transaction number and as a full word savevalue index to get the corresponding simulated time. The trace information is completed by the next block number data. un This is stored in a 300 half word table with the index offset by 300 from the transaction number table. The routine to insert this data can initially load 300 trace records. Trace data is time ordered. Its use is biased towards, but not limited to, the time ordering. Once a record is read it is no longer needed. When data not currently in the tables is needed, the insertion routine will compact the existing unused data and load new records until the tables are full or an end of the data file is reached. At this point the gathering of data needed for simulating the execution of real GPSS programs and a routine for loading the data into the simulator have been described. The remaining step is the simulation. A.k Coordination Unit Simulation This is a simulator, written in GPSS, of a machine for processing the execution phase of GPSS. The purpose of the simulator is twofold. First, the number of transactions which can be processed concurrently is desired. Multiple transaction concurrency offers parallelism of a type that does not exist in the procedural languages. It cannot be measured by examining the source program. Simulation of the machine in Chapter k executing real GPSS programs, using the trace data from actual executions, does provide a measure. The results of this first purpose of the simulator were used in deciding on a reasonable number of task processors. The results are given in Table k.l. The second purpose is to determine the capability of the co- ordination unit to select tasks and keep the task processors busy. These results are also given in Table k.l. 11+2 Programming emphasis is on the simulation of the coordination unit. It is assumed that the memory system of the machine does not cause delays which must be simulated. Task processors are simulated only to the extent of requesting a task from the coordination unit, accepting one, and ad- vancing time by the execution time for the task. The task execution time for a block type is the average over all traces of the parallel machine execution time. Analysis of block types for parallelism, covered in section 3-3 .2.1, is the source of the time figures. A conversion of 500 nanoseconds per time step, T , was used. This allows an average of two or three clock pulses per operation with a 5MHz clock. Information on the test program whose execution is being simulated is unique to each program and must be supplied to the simulator. Static information consisting of a representation of the program and the processing code is given as a GPSS function. The dynamic trace information is read into the model by the INSRT routine of the previous section. The function is in the simulator listing with the label TYPCD. There is a function point for each block in the test program. The point gives the block type and processing code for the block. Block types are numbered from one to hk in alphabetic order. The function definition and follower cards are another part of the scanner program output. A listing of the simulator follows. 1*3 // EXEC DUMMY //D01 DO DSN=£X,SPACE=(TRK, (20,5,2) ),OISP=I, PASS) ,UNIT=DISK // EXEC LKEDASM,PARM=«LIST, MAP, REUS* //LKED.SYSLMOD DO DSN=tX< INSRT ) ,DI SP* ( CLD, PASS ) //LKED.SYSIN DO * < OBJECT DECK FOR SUBRCUTINE INSPT > // EXEC PGM=DAG01,PARM=«B* //DINTERO DD UNITED! SK, SPACE=(CYL ,( 1 , 1 ) ) //DINTWORK DD UNI T=D I SK , SP A CE= ( CYL , ( 1 , 1 ) ) //DOUTPUT CD SYSOLT=A //OREPTGEN DD UN IT=DI SK , SP ACE=(CYL , ( I ♦ 1 ) ) //DSYMTAB DD UN I T = Dl SK, SPACE= ( CYL , ( I , I ) ) //STEPLIB DD DSN=£X,DISP=(CLC,PASS) // OD DSN=SYS1 .GPSSL IB,DISP=SHR //SYSPRINT DD SYSOUT=A //DINPUT1 DD CCNAME=SYSIN //SYSIN DD * REALLOCATE XAC , 20 J , F SV ,6C0 ,HSV ,8C0 , BLO, 30 SIMULATE * * THIS GPSS PROGRAM IS A SIMULATION OF THE CONTROL UNIT OF A MACHINE * DESIGNED TO PUN SIMULATION PROGRAMS USING CONCURRENT PROCESSING * TECHNIQUES. * * THF BASIC UNIT OF TIME IN THIS SIMULATION IS 100 NANOSECONDS. * TRANSACTION PARAMETER USACE * EXECUTION TIKE IN TASK PROCESSORS. SERIAL NUMBER OF THE XACT BEING SIMULATED. NUMBER OF NE XT BLOCK IN THE PROGRAM FOR THIS TASK. VALUE CCMES FRCM TRACE TABLE FOR MASTFR XACTS. VALUE CCMFS FROM INCREMENTING MASTER XACT VALUE FOR SPLIT XACTS. TRANSACTION SIMULATED TIME AT THF START OF THE CONCUR- RENCY GROUP. VALUE TAKEN FROM TRACE TABLES. COUNTER OF TASKS IN A CONCURRENCY GROUP. USED TO DETERMINE WHEN ALL TASKS IN THE GROUP HAVE BEEN PROCESSED. POINTER TO FULLWORO SAVEVALUE IN TIMST WHICH HAS THF TIME, SIMULATFD, FOR THE XACT WITH SERIAL NUMBER P6-4CC. P6=P2+400. POINTER TO HALFWOPD SAVEVALUE IN NOFPE WHICH HAS THE NUMBER C c PROCESSORS FOR BLOCK NR P7-44. P7 = Pll«-44. INCICATES TO OUTPUT UNIT THE SOURCE OF THE TASK. P8 =0 FCR TASKS FROM DELAY QUEUE. Pfl =1 FCR TASKS FROM PROCESS QUEUE. VALUE OF FUNCTION TYPCD IN FORMAT DBBSP. CONTAINS BITS CBB OF P9. NUMERICAL DESIGNATION OF BLOCK TYPE FOR THIS TASK. BITS RB OF pg POINTER TO HALFWORC SAVEVALUE GIVING EXECUTION TIME FOP THIS BLOCK TYPE. BLOCK PROCESSING COCE. BITS SP OF P9. POINTER TO FLLLWORD SAVEVALUE IN TIMST WITH MIN VALUF. POINTER TO MLFWORD SAVEVALUE WITH SERIAL NUMBER AND TO FULLWORC SAVEVALUE WITH SIMULATED TIME IN THE DYNAMIC TRACE TABLES. POINTER TO HJLFWORD SAVEVALUE WITH NEXT BLOCK NUMBFP IN THE TRACE TABLES. P 15 = P14«-30C . * PI * P2 * P3 * * * • P^ * * P5 * * * P6 * * * P7 * * P8 * * * P<9 * OK * Pll * * * « P12 * pn * P14 * * * P15 # Ikk HALFWORD SAVEVALUE TABLES AND DATA TABLE OF EXECUTION TIMES FOR GPSS BLOCKS IN THE PARALLEL MACHINE EXECUTION TIMES ARE TAKEN FRCM THE ANALYSIS OF BLOCKS FOR PARALLELISM CN THE SYSTEM REPCRTED IN CHAPTER 3. THE CONVERSION FRCM TIP) TO TIME IS: ONE STEP IN TIP) »500 NANOSECONDS. EXTIM EOU 1(44), H INIT XH1,60 ADVANCE BLOCK TYPICAL EXECUTION TIME TMT XH4,65 ASSIGN INIT XH8.80 DEPART INIT XH9,175 ENTER INIT XH14.U0 GENERATE INIT XH16.50 INDEX INIT XH18,150 LEAVE INIT XH19,90 LINK INIT XH20.6P LOGIC INIT XH22.40 MARK INIT XH24,9C MSAVEVALUE INIT XH27.45 PRIORITY INIT XH28.1C0 QUEUE INIT XH29.70 RELEASE INIT XH32,6C SAVEVALUE INIT XH3A.70 SEIZE INIT XH36,180 SPLIT INIT XH38,100 TERMINATE INIT XH39.5D TEST INIT XH*1,60 TRANSFER INIT XH42,200 UNLINK INIT XH$DCOD,lCO DECCOE RCUTINE * * AN ASSEMBLY LANGUAGE FCUTINE, INSRT, IS CALLED BY THE HELP BLOCK. * THIS ROUTINE LOADS CATA WHICH REPRESENTS THE EXECUTION TRACE OF * TRANSACTIONS IN TEST GPSS PROGRAMS. SIMULATED TRANSACTION SERIAL * NUMBERS ARE PUT IN HW SAVEVALUE LOCATIONS 100 TQ 299. THE NEXT * BLOCK THAT THE TRANSACTION WILL EXECLTF IS LOADED INTO HW * SAVEVALUE LOCATIONS 3C0 TO 599. CO«R ESPONDI NG TRANSACTION AND * BLOCK DATA ARE THUS OFFSET 300 LOCATIONS. SIMULATED TIME FOR * EACH PFCORD IS LOADED INTC FW SAVEVALUE LOCATIONS 1*0 TO 299. * THE TRANSACTION NUMBER ANC TIME DATA HAVE THE SAME INDICES. * * WHEN A TRACE RFCORD IS USED THE TRANSACTION NUMBER FIELD IS * SET TC ZERO. WHEN MOPE TRACE RECORDS ARE NEEDED, INSRT WILL * COMPACT THE TRACE DATA AND LOAD ADDITIONAL RECORDS UNTIL THE * END OF THE TRACE DATA FILE OCCURS. * POINTER TO LAST VALID ENTRY OF DYNAMIC TRACE DATA. LAST EOU 98, H INIT XH$LAST,9<5 * COUNT OF EMPTY LOCATICNS IN TRACE CATA TABLES LDCNT EOU 99, H INIT XHSLDCNT,2C0 * TEST PROGRAM TRACE CATA. DYNAMIC DATA LOADED BY 'HELP INSRT*. XACNR EOU 100(300), F XACT SERIAL NUMBER NXT8K EOU 400(300),F NEXT BLOCK FOR XACT IN XH(*-30O) INIT XH$TSKPR,4 ASSUMES TOTAL OF u PROCESSORS * * THE FOLLOWING TWO SAVFVALLES DEFINE THE NUMBER OF TRANSACTIONS BEING * SIMULATED. THEY ARE PARTICULAR TO THE PROGRAM BEING TESTED. INIT XH$SIMX,56 NOF XACTS IN RAIL FLEET II INIT XH$DMSIM,55 NR OF ACTIVE XACTS MINUS ONE * FULLWORD SAVEVALUE TABLES * TABLE SXCTR HAS ONE ENTRY PER TRANSACTION. THE ENTRY IS A * COUNTER OF TASKS BEING PROCESSED FOR THE TRANSACTION. THE * NUMBER OF TRANSACTIONS BEING PROCESSED CONCURRENTLY IS THE * NUMBER OF ENTRIES WITf A VALUE GREATER THAN OR EQUAL TO ONE. * SINCE THIS TABLE BEGUS AT SAVEVALUE 1, THE TRANSACTION SERIAL * NUMBER, P2* CAN BE USED AS THE INDEX. SXCTR EOU 1<99J,X USED TO TABULATE SIMULTANEOUS XACTS * * TEST PROGRAM TRACE DATA. SIMULATED TIME OF XACT WITH SERIAL NR * GIVEN IN THE HALFWORD SAVEVALUE WITH THE SAME INDEX SIMTM EQU 100(300), > * SIMULATED TIME OF XACLS BEING SIMULATED TIMST EOU 4Cl(iOOI,X X(4C0*J) HAS SIM TIME FOR XACT(J) INIT X401-X500.2147483647 INIT X$MINTM,1 SIMULATED TIME INITIAL VALUE INIT XSXMAX, 2147483647 * * LOGIC SWITCH USAGF PARTN EOU l(10OI,L SWITCH! J) FOR XACT(J) ENABLES ASSEMBLY INIT LSI-LS100 * * FUNCTIONS: * * TYPCD IS PROCUCED AUTCMAT IC ALLY BY THE GPSS ANALYZER PROGRAM. IT * IS A LIST OF THE BLOCKS IN THE PROGRAM BEING ANALYZED, THE * PROCESSING CODE, AND AN INDICATOR OF BLOCKS WHICH USE FN, V, RV, * OR * IN THE OPERAND F IELC AND THUS REQUIRE DECODING. * BLOCKS ARE LISTED WITF NUMBERS CORRESPONDING TO ALPHABETIC ORDER * BEGINNING WITH 1 FOR ADVANCE AND ENDING WITH 44 FOR WRITE. * THE FORMAT FOR THIS PACKEC CATA IS CBBSP WHERE: * D IS 1 FOR BLOCKS WHICH REQUIRE OPERAND DECODING; OTHERWISE, * BR IS THE NUMERICAL BLOCK TYPE, * S IS FOR BLOCKS VHICH LSE SYSTEM VARIABLES AND MUST RE * DELAYED; 1 OTHERWI SE , AND * P IS FOR THE LAST BLOCK IN A PARTITION; 1 OTHERWISE. * TYPCO FUNCTION P3,LC49 RAIL FLEET PROGRAM 1 01411 2 T0410 3 03211 4 03601 5 HC411 6 00111 7 f;391^ fl 03810 9 00411 10 04110 11 0C411 12 10400 13 1391? 14 10410 15 10400 16 03910 17 03611 18 1O101 19 13901 20 13910 21 C2810 22 C0900 23 OC 8 1 1 24 01610 25 00110 26 101 1 1 27 13901 28 13911 29 139n 30 ^100 31 01811 32 03810 33 03810 34 13201 35 04110 36 C0110 37 00111 38 C4U0 39 OCUC *C 00111 41 04110 42 01411 43 00410 44 001 11 45 13900 46 13211 47 13903 48 rr 411 49 041 K * * INTER GIVES THE CUMULATIVE PROBABILITY THAT X INTERROGATIONS * ARE REQUIRED IN THE ASSOCIATIVE MEMORY TO DETERMINE THE ENTRY * WITH MIMMUM TIME. INTER FUNC RN4,D16 .01,1/. 02, 2/. 04, 3/. 06 ,4/. 09, 5/. 13,6/. 19, 7/. 27, 8 .3 8, 9/. 52.1C/. 64, 11/. 74, I 2/. 92, 13/. 89, 14/. 95, 15/1. 00,16 * FNDCD FUNC RN2.D2 FUNCTION EVALUATION REQUIRED .5,1/1.0,2 * * THE FIRST PART OF TFE PROGRAM LOADS THE FIRST 300 RECORDS OF TRACF 146 * TABLE DATA AND INITIALIZES CTHER VARIABLES. GENE tttl GENERATE 1 XACT TO LOAD TRACES HELP INSRT LOAD DYNAMIC TRACE DATA QUEUE REQST,XH$TSKPR LET ALL PROCESSORS BE AVAILABLE SAVEVAL LSTMN,V1 TRACE PER XACT START SELECT E 14 , 100 , XH ILAST , P 16 , XH, HOLDS ASSI 4,X*14 PA IS GIVEN THE XACT SIMULATED TIME ASSI 2,XH*14 P2 IS GIVEN THE XACT SERIAL NR SAVEVAL *14,K0,H RESET THE XACT NR SAVEVAL. DATA USED. ASSI 6, VI P6 POINTS TO X{J) IN TIMST 1 VARI P2+4C0 TIMST SAVEVALUES BEGIN AT XAC1 SAVEVAL P6.P4 X(JJ GETS SIM TIME FOR XACT(J-AOC) ASSI 15, V? P15 IS PCINTER TO NEXT BLOCK XH 7 VARI P14+300 NEXT BLOCK XH OFFSET 300 FROM XACT NR ASSI 3,XH*15 P3 IS GIVEN THE XACT NEXT BLOCK * * ASSOCIATIVE MEMORY: SELECT THE XACT WITH MINIMUM SIMULATED TIME AS * THE BEST CANDICATE FCR PROCESSING AMEM ENTER AM EM ENTER AM GATE LR RDWRT ALLOW LOAD IF AM NOT BUSY LOGIC S RDWRT SET SWITCH TO INDICATE BUSY ADVA 2 TIME TO LOAD AM LOGIC R RDWRT RESET BUSY SWITCH JOIN AVAIL, P2 XACT IS AVAILABLE FOR SELFCTION LOGIC S CTL SWITCH SET WHEN XACTS ARE ON AMEM LINK AMEM,P4 ORDERED CHAIN TO SIMULATE AM * * THE TRANSACTION GENERATED NEXT SIMULATES INTERROGATION OF THE * ASSOCIATIVE MEMORY. GENE ,,,1,,1 XACT TO INTERROGATE AM INT GATE LR RDWRT ALLOW INTERROGATION IF AM NOT BUSY LOGIC S RDWRT SET SWITCH TO INDICATE BUSY ADVA 1,FN$INTEP INTERROGATION TIME LOGIC R RDWRT RESET BLSY SWITCH UNLINK AMEM, AAA, Kl,, , XXX SELECT MIN TIME XACT TEST E XH$XCTR,KC WiAf T FOR UNLINKED XACTS TO BE MOVED TRANS ,INT RESUME INTERROGATION XXX LGGIC R CTL SWITCH TO CONTROL INTERROGATION GATE LS CTL WAIT FOR XACTS TO GET ON CHAIN TRANS ,INT AAA UNLINK AMEM, QTES T , ALL ,4 UNLINK ALL XACTS WITH SAME SIM TIME OTEST SAVEVAL XCTR*,Kl,H CCUNT UNLINKED XACTS ADVA 1 RESOLVE EACH RESPONDER EXAM DEL0,P2,RCUTE IS XACT CN DELAY QUEUE? TFST LE P4,X$MINT*,NCTMN YES. IS IT MIN TIME? REMCVE AVAIL, ,P2 YES. XACT HAS BEFN SELECTED SAVEVAL XCTR-.K1.H CECREMENT CTR AS XACTS ARE HANDLED LEAVE AMEM TRANS ,LVDLY XACT IS LEAVING DELAY QUEUE NOTMN SAVEVAL XCTR-,Kl,h CECREMENT CTR AS XACTS ARE HANDLED LINK TEMPL,LIFC XACT CANNOT BE SELECTED RELNK LINK AMEM,LIFO * * CODE EVALUATION UNIT: ROUTES TASKS TO PROPER QUEUE AND SWITCHES Ik7 TRANSACTIONS AT PARTITION BOUNDARIES ROUTE PAC 3 4 5 6 BDPAC PAS BCPAS DAC POAC DAS PDAS E LR SAVEVAL SEIZE REMOVE LEAVE TEST E LOGIC S ASSI ADVA ASSI VARI ASSI VARI ASSI VARI ASSI VARI ASSI TEST GATE SPLIT ASSI TRANS TEST E GATE LR RELE LOGIC B TRANS TEST E GATE LP LOGIC S SPLIT ASSI TRANS TEST E GATE LR LOGIC R RELF QUEUE JOIN ASSI TRANS DELAY QUEUE DLQA MERGE LVDLY ASSI QUEUE ADVA JCIN ASS I TEST GATE TRANS LINK REMOVE ADVA UNLINK E SE XCTR-,K1, h CODE AVAIL, tP2 AMEM P4,X$MINT*tPAC BYPAS 9,FN$TYPCC 1 10, V3 P9/100 lit V4 picaioc 12, V5 P92100 7.V6 Pll+44 5*,K1 P12,K11,PAS,1 DELAY, PCAC Kl,PROA 3+,Kl .PAC P12,KlO,CAC.l DELAY, PDAS CODE BYPAS ,PPQB PI 2, KOI, CAS BYPAS, RCPJC DELAY Kl ,DLQA 3+.K1 ,PAC P12,K0r,EPRCD BYPAS, BCPAS DELAY CODE DELAY D6LQ,P2 B,KO , AMEM OECREMENT CTR AS XACTS ARE HANDLED GET CODE EVALUATION UNIT XACT HAS BEEN SELECTED IS THIS A MIN TIME XACT? YES. ENABLE DELAY QUEUE BYPASS P9 GETS BLOCK TYPE AND CODE EVALUATE EACH PROCESSING COOE P10 >99 MEANS BLOCK OPERANDS NEED DECODED, INCREASING EXECUTION TIME Pll GETS BLOCK TYPE. NUMBERS 1 THRU 44 = BLOCKS ADVANCE THRU WRITE P12 GETS THE 2 BIT BLOCK CODE P7 POINTS TO THE XH GIVING THE NR OF PROCESSORS BLOCK Pll USES INCR COUNT OF TASKS IN CONCURRENCY GRP DOES CODE SAY PROCESS Q AND CONTINUE? YES. IS DELAY SWITCH RESET? YES. SEND A TASK TO THE PROCESS Q. INCR BLOCK POINTER TRY THE SAME XACT ON THE NFXT RLOCK DOES CODE SAY PROCESS Q AND SWITCH? YES. IS DELAY SWITCH RESET? YES. RELEASE COOE TO SWITCH XACTS RESET DELAY QUEUE BYPASS SENC THE MASTER TASK TO THE PROCESS Q DOES CODE SAY HELAY Q AND CONTINUE? YES. SHOULD THIS XACT BYPASS DELAY 0? NO. SET THE DELAY SWITCH SEND THE TASK TO THE DELAY Q INCR BLCCK POINTER TRY SAME XACT ON THE NEXT BLOCK COES CODE SAY DELAY Q AND SWITCH? YES. SHOULD THIS XACT BYPASS DELAY NO. RESET THE DELAY SWITCH PREPARE TO SWITCH XACTS Q? GROUP DELAYED XACTS P8 =0 INDICATES DELAY QUEUE IS DELAYED XACTS REMAIN IN THE AM SOURCE WHEN THE SIMLLATFD TIME OF A XACT IN THIS QUEUE IS EQUAL TO THE MIN TIME OF ALL XACTS, ALL TASKS IN THE PARTITION ARE ELIGIBLE TO LEAVE THE QUEUE. TASKS FROM THIS QUEUE HAVE HIGHER PRIORITY FOR USE OF THE PROCESSORS. 5,K0 RESET PARTITION TASK COUNTER. DELAY 2 TIME TO LOAD THE QUEUE DELQ.P2 GROUP DELAYFD XACTS 8, KG PB =0 INDICATES DFLAY Q IS SOURCE GSAVAILtKO, MERGE, 1 IS AVAILABLE GROUP EMPTY? PRPUL, MERGE, ,1 YES. ARE PROCESSORS IDLE? ,TASK YES. PROCESS THIS XACT DELAY, P4 MERGE BY TIME. DELQ,,P2 XACT IS LEAVING THE DELAY 2 REMOVE TASKS FROM DELAY QUEUE DELAY, LNKCT, ALL ,2 UNLINK 4LL TASKS FOR THIS XACT 11+8 LNKCT LINK OUTPT , P8 . T ASK t PROCESS QUEUE: TASKS ARE AVAILABLE FOR PROCESSING UPON REQUEST * . „ n RESET CCNCURRENCY GROUP TASK COUNTER PROA ASSI 5.KO PROB QUEUE PROCS tihe TQ LQA0 THE QUEUE * DVA I „. Pa xi WHEN PROCESS Q IS THE SOURCE I"I OUTPT. P8.1ASK PR IS PRIORITY FOR OUTPUT UNIT ! 0UTPUT unit: JhiS 6 UNU s SElects the^ext task^sent - v THE ppiQRiTY * * n n CD , C T DELAY THE DELAY Q IS OUTPUTTING DLYQ DEPART UtLAT I2?« OUTPT PLACE TASK IN OUTPUT UNIT TASK SEIZE OUTHl ^^^^^ task THR0UGH output UNIT TEST E P8.K1.DLYC DID THE TASK COME PROM THE PROCESS 0? °rlTl SSfMT.KO DOe's A TASK REQUEST EXIST? DE0 UNUNK O^TPT.TAsJ.Kl "MOVE NEXT TASK FROM Q * ,.r^ tack nccc A PROCESSOR FOR THE TIME DETERMINED : ™sk processors: ejc^tas^uses^process^^^ * nDO ,„ the task uses processors from a pool VZ11U* OUTPT YES. MOVE FROM OUTPUT TO PROCESSORS RELEASE °^;^; RFMCVE THE TASK REQUEST DEPART REOST COUNT TASKS IN PROCESS FOR THIS XACT SAVEVAL P2^,K1 COUN ^ ^ ^^ ex ASSI 111 HI MRTIM WILL JUST THE BLOCK EXECUTE TIME DO? TES T L PC.K99, MRTIM WILL ju EXECUTION TIME TIME ACVA PI TABULATE TIMEX IN pR0CESS F0R THIS XACT $ t\ll PRPUL PROCESSES BECOME AVAILABLE AGAIN of.cnp REQST REQUEST A TASK ?-f E ^.MASTR.* IS THIS THE MASTER XACT?^ ^^ GATE LS P2 JU. ASCEMRLE TASKS , N CONCURRENCY GP ASSEM ASSEM j>5 ^^ gate fqr nexT ITERATION TABULATE PART ^ rniAO TC TW r pi nCK TYPE OTHER THAN TERMINATE TEST NE PU.K38.TERMB IS HE LOCK TYP^ ^ ^ Qp ASSI 5,KC ,„„ , Ltl ACT P? XH.HOLDC G«=T NEXT DATA FOR XACT CONT SELFCT F 1*. K WO . XF$l AST . P2.XH.H OLDC G ^ ^ ^ ASSI A ' X * 1 i c THE XACT NP SAVEVAL. DATA USED. SA VEVAL JiJ;;;-" H ccUNT EMPTY TRACE TABLE LOCATIONS SAVEVAL EMPTY*. Kl.H ^LUN ^ CURRENT C , M T1MFS SAVEVAL P6.P4 UKU p0INTER TQ NFXT RLOCK XH ASSI 1 vu*,c P3 IS GIVEN THE XACT NEXT BLOCK »i» L StV.KFO .LC.K « TFE*E «» TH«» T «"»»V SAVEV u «;£;;*» ,. Lt ^^".".crs fko- te„p chain UNLINK TERMB ?«!% «H t O,s r .KC.TFM «| THERE OJH« JJCTS. ^ 7 -- E SSSSRS^S ; ? f jc« -muho^ ;;;« -a s s :;ix: l l s^s-"" * ^e^'^/of ^ve *.* - ts li+9 SELECT MIN 13 ,401 , XH*L STMN , , X P13 WILL POINT TO MIN TIME SAVEV TEST G X*13,X$MINTM,TRM HAS MIN TIME CHANGED? SAVEVAL MINTM,X*12 YES. UPCATE IT UNLINK TEMPL,RELNK,ALL,,,TRM UNLINK DELAY XACTS FROM TEMPL LOGIC S CTL TRM TERM 1 MASTR LOGIC S P2 OPEN GATE FOR TASKS IN CONCURRRENCY GP TRANS ,ASSEM MRTIM ASSI U,XH$DC0C,3 THE BLOCK USES FN, *, V, OR 8V TRANS .TIME LDATA SAVEVAL EMPTY, K0,H RESET EMPTIES COUNTER TFST G V7,K0,GTMIN ARE XACTS WAITING FOR TRACE DATA? HELP INSRT YES. LOAD DATA ASSI 1,125 IDENTIFY OCCURRENCE OF INSRT PRINT ,,MOV,X IDENTIFY OCCURRENCE OF INSRT TEST NF XH$L0CNT,K499,GT>'IN WAS INSERT GOOD UNLINK HOLDS, START, ALL YES. ATTEMPT TO START HOLDS XACTS UNLINK HOLDCCONT.ALL ATTEMPT TO CONTINUE HOLDC XAC^S TRANS ,GTMIN HOLDC TEST NE XH$LDCNT , K999, TERPB IS THERE MORE TRACE DATA? LINK HOLOCFIFC YES. HOLD XACT FOR NEXT INSRT CALL HOLDS TEST Nfc XHtLDCNT , K999, DLETF IS THERE MORE TRACE DATA? LINK HOLDS, FIFC YES. HOLD XACT FOR NEXT INSRT CALL UNHLD HELP INSRT LOAD TRACE DATA ASSI 1,152 IDENTIFY OCCURRENCE OF INSRT PRINT ,,MOV,X IDENTIFY OCCURRENCE OF INSRT TEST NE XH$LDCNT,K499,ALL WAS INSERT SUCCESSFUL? UNLINK HOLDS, START, ALL YES. ATTEMPT TO START HOLDS XACTS UNLINK HOLCC,CONT,ALL ATTEMPT TO CONTINUE HOLDC XACTS ACVA 1 ALLCW UNLINKING TEST G XHSDMSIM, V7,ALL WAS THF INSRT OK? TRANS ,TERMB YES. ALL TEP M XHtSIMX STOP THE SIMULATION. TRACE DATA FAILURE CLETE SAVEVAL DMSIM-,K1,H DECREMENT COUNT OF ACTIVE SIM XACTS TERM 1 ERRCD ASSI 12, KO BLOCK CCDE HAD ERROR. SET CODE TO 00 TRANS ,POAS RESUME bITH WORST CASE COOE GENE 40,,,, ,2 GATHER STATISTICS AT 40 TIME UNIT INT COUNT GF 2,1,XH$SINX,1,X CCUNT NR OF SIMULTANEOUS XACTS TABULATE NRSXl TABULATE PRLSE TERM PRUSE TABLE S $PR PUL ,0 , 1 , 120 NRSXl TABLE P2,C,1,50 NUMBER OF SIMULTANEOUS XACTS IN PRS PART TABLE P5,l,l,20 RUN TIME PARTITION LENGTH TIMEX TABLE PI, 40, 10,53 EXECUTION TIME PER BLOCK PRPUL STORAGE 4 START 56 NUMBER OF SIMULATED XACTS END //INDATA CC * < TRACE DATA READ EY ROUTINE INSRT > 150 LIST OF REFERENCES [l] ACM Profession Development Seminar, "Simulation of Discrete Systems," ACM, pp. 123-135- [2] C. Cartegini, "Scanner for the Analysis of Parallelism in Fortran Programs and IF-Tree Detection," (M.S. Thesis) University of Illinois at Urbana-Champaign, Department of Computer Science; 1971- [3] Control Data Corporation, "The STAR Computing System." A technical proposal to The Atomic Energy Commission. December 1966. [k] 0. Dahl, and K. Nygaard, "SIMULA: An Algol-Based Simulation Lan- guage, " Communications of the ACM, p. 67I; September 1966. [5] L. C. FuLmer, and ¥. C. Meilander, "A Modular Plated Wire Associative Processor, " Proceedings of the IEEE Computer Group Conference, pp. 325-335; June 1970. [6] International Business Machines Corporation, "Capital Investment Studies Using GPSS: Bulk Material Movement Problems," First Edition, p. 39; 1968. [7] International Business Machines Corporation, "General Purpose Simu- lation System/360 User's Manual," Fourth Edition; January 1970- [8] P. J. Kiviat, R. Villanueva, and H. M. Markowitz, The SIMSCRIPT II Programming Language , Prentice-Hall, Inc . ; 1968. [9] D. J. Kuck, "ILLIAC IV Software and Application Programming," IEEE Transactions on Computers , Vol. C-17, No. 8, pp. 758-77O; August 190^ [10] D. J. Kuck, Y. Muraoka, and S. C. Chen, "On the Number of Operations Simultaneously Executable in FORTRAN-Like Programs and Their Resulting Speed-Up, " to be published in IEEE Transactions on Computers. [11] S. E. McAulay, "Job Stream Simulation Using a Channel Multiprogramming Feature, " Fourth Conference on Applications of Simulation, ACM, pp. 190-19^; 1970. [12] T. B. Pinkerton, "Program Behavior and Control in Virtual Storage Computer Systems," (Ph.D. Thesis) The University of Michigan, C0NC0MP Technical Report k; April 1968. 151 [13] Paul F. Roth, "The BOSS Simulator - An Introduction," Fourth Conference on Applications of Simulation , ACM, pp. 2^-250; 1970. [Ik] R. A. Schwarz, and T. J. Schriber, "Application of GPSS/360 to Job Shop Scheduling, " Digest of the Second Conference on Applications of Simulation , ACM, pp. 237-24b; 196b. [15] D. L. Slotnick, et. al., "The ILLIAC IV Computer," IEEE Transactions on Computers , Vol. C-17, No. 8, pp. 7U6-757; August 196b. [16] D. G. Weamer, "QUICKSIM - A Block Structured Simulation Language Written in SIMSCRIPT, " Third Conference on Applications of Simu - lation, ACM, pp. 1-11; 1969. 152 VITA Edward Willmore Davis, Jr. was born in Akron, Ohio, in 19^1. He graduated from The University of Akron in 196U with a Bachelor of Science in Electrical Engineering degree and earned the Master of Science in Engi- neering degree there in 1967* From 196^- to 1968 he was employed in the Computer Engineering De- partment of Goodyear Aerospace Corporation, Akron, Ohio. In 1968 he entered the University of Illinois Department of Computer Science. He was a research assistant with the Illiac IV Project from 1968 to 197° and with the Center for Advanced Computation in 1970 and 1971- I n 1971 he joined a group studying computer organization and software, where he did research on concurrent pro- cessing systems. l GRAPHIC DATA E 1. Report No. UrUCDCS-R-72-527 3. Recipient's Accession No. 5. Report Date June 1972 it and Subtitle UNIPROCESSOR FOR SIMULATION APPLICATIONS 11 >r(s ) £ard Willmore Davis, Jr. 8. Performing Organization Rept. No -UIUCDCS-R- 72-527 r irming Organization Name and Address rversity of Illinois at Urbana-Champaign fcartment of Computer Science 4 ana, Illinois 6l801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF GJ 27^+6 ipsoring Organization Name and Address feional Science Foundation fchington, D.C . 13. Type of Report & Period Covered Doctoral - 1972 14. iiplementary Notes I tracts Multiprocessor systems have generally been designed for applications with arrays Idata which can be operated on in parallel. In this paper an application area ten does not contain such readily identifiable parallelism is examined. Discrete l.e simulation is found to contain several distinct levels at which potential for :c .current execution exists. The levels are used to guide the organization of a utiprocessor designed for simulation applications. Both software and hardware aspects of the problem are covered. Features of the j tern include a special processor used to evaluate conditional jump trees; clusters d simple, fixed point arithmetic processors; a unit to form and dispatch tasks to : processors; and a memory system which includes a read only program memory. C Words and Document Analysis. 17a. Descriptors jcial Purpose Computer iulation Processor callel Computation l:ntifiers/Open-Ended Terms OSATI Field/Group Pliability Statement ■ RELEASE UNLIMITED 19. Security Class (This Report) SSIFIED curity Cli 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 161 22. Price »'riS-35 ( 10-70) USCOMM-DC 40329-P7 1 ' ] ■ .' • MBH ;•»*"" W5*L-"S=5 ni