mW wBii 
 
 inBBKE KttUXfc BBiBBM 
 
 nnHnnB 
 
 ll£l$&9 HIBlwiE 
 
 SH H8S WflH88fifi 
 
 ^^^^^^■^HShSIhRH nil 
 
 mSm 
 
 M 
 
 Hull 
 
 BH 
 
 w 
 
 ■ 
 I 
 
 ■HilB 
 
 mamm 
 
 vm 
 
 Klfi In 
 
 mm 
 
 m mm 
 
 ■1 Hi IH9RP 
 HwhSIiS Hi 
 
 HI 3S£ 
 
 HH 
 
 Hi HHKmS 
 
 Iflul BflnOwi 
 
 Ban liffiKalU 
 
 *¥* iftlliSiiSlslB 
 
 ■■HiH 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAICN 
 
 5L0.W 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/multiprocessorfo527davi 
 
.5^7 Report No. UIUCDCS-R- 72-527 f 
 
 A MULTIPROCESSOR FOR SIMULATION APPLICATIONS 
 
 "by 
 
 Edward Willmore Davis, Jr. 
 
 June 1972 
 
Report No. ULUCDCS-R-72-527 
 
 A MULTIPROCESSOR FOR SIMULATION APPLICATIONS* 
 
 by 
 
 Edward Willmore Davis, Jr, 
 
 June 1972 
 
 Department of Computer Science 
 University of Illinois at Urbana- Champaign 
 Urbana, Illinois 6l801 
 
 *This work was supported in part by the National Science Foundation 
 under Grant No. US NSF GJ 27^6 and was submitted in partial ful- 
 fillment of the requirements for the degree of Doctor of Philosophy 
 in Computer Science, June 1972. 
 
A MULTIPROCESSOR FOR SIMULATION APPLICATIONS 
 
 Edward Willmore Davis, Jr., Ph.D. 
 Department of Computer Science 
 University of Illinois at Urbana- Champaign, 1972 
 
 Multiprocessor systems have generally been designed for applications 
 with arrays of data which can be operated on in parallel. In this thesis an 
 application area which does not contain such readily identifiable parallelism 
 is examined. Discrete time simulation is found to contain several distinct 
 levels at which potential for concurrent execution exists. The levels are 
 used to guide the organization of a multiprocessor designed for simulation 
 applications. 
 
 Both software and hardware aspects of the problem are covered. 
 Features of the system include a special processor used to evaluate conditional 
 jump trees; clusters of simple, fixed point arithmetic processors; a unit to 
 form and dispatch tasks to the processors; and a memory system which includes 
 a read only program memory. 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 The author wishes to express sincere gratitude to his advisor, 
 Professor David J. Kuck. The many hours of discussion, suggestions, advice, 
 and encouragement were fundamental to the development and completion of this 
 dissertation. 
 
 The team working on the Fortran program analysis system deserve 
 thanks. In particular, R. Towle did an excellent job maintaining the system 
 and revising it for the needs of the author, and R. Strebendt provided in- 
 struction on system use and interpretation of the results. 
 
 The dedicated effort of Mrs. Vivian Alsip in typing the final manu- 
 script, as well as much of the draft copy, is gratefully acknowledged. The 
 typing done by Mrs. Diana Mercer also deserves credit. Fred Hancock, of the 
 Center for Advanced Computation, voluntarily drew several figures when the 
 Department of Computer Science drafting group became overloaded. The support 
 of the research by the Department of Computer Science and the Center for 
 Advanced Computation is greatly appreciated. 
 
 Finally, the author wishes to thank his wife, Jo Ann, for her 
 patience and help in this effort. 
 

IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 INTRODUCTION ■ 1 
 
 1 . 1 Overview '• 1 
 
 1.2 Thesis Organization 
 
 CONCURRENT PROCESSING OF CONDITIONAL JUMP STATEMENTS k 
 
 2.1 Introduction k 
 
 2.2 Decision Trees in Programs and Processors 5 
 
 2.3 Software Aspects of IF Trees and Decision Trees 11 
 
 2.3-1 Overall Processing Scheme 12 
 
 2.3.2 Arguments of IF Statements 12 
 
 2.3-2.1 Relational Expressions 12 
 
 2.3.2.2 Logical Expressions Ik 
 
 2.3.3 Assignment Statement Movement 15 
 
 2.k Decision Tree Processor Hardware 19 
 
 2.^.1 Input Node Register 21 
 
 2 A. 2 Tree Decoder 25 
 
 2.U-3 Result Memory 27 
 
 2.k.k Path Register 30 
 
 2.^.5 Sector Decoder 31 
 
 2. ^. 6 Gate Counts and Timing 31 
 
 2.5 Processor Operation and Performance 3^ 
 
 2.5.1 An Example of IF Tree Processing 3^ 
 
 2.5.2 Mapping: Folding and Multiple Input Nodes ^1 
 
 2.5.2.1 Folding kl 
 
 2.5.2.2 Multiple Input Nodes kk 
 
 2.5-3 Processor Efficiency k-5 
 
 2.5A Performance Tradeoffs Between Iterations and 
 
 Cycles kQ 
 
 SIMULATION OF DISCRETE TIME SYSTEMS 5^ 
 
 3.1 Languages 5^ 
 
 3.2 A Description of GPSS 55 
 
 3.3 GPSS Execution 60 
 
 3.3.1 Serial Execution: GPSS/36O 60 
 
 3.3.2 Concurrent Execution 67 
 
 3.3.2.1 Parallelism Within a Block 67 
 
 3.3.2.2 Concurrency Between Blocks 76 
 
 3.3.2.3 Concurrent Transaction Movement 85 
 
 3.3.3 Processing Code Assignment 91 
 
Page 
 
 k. MACHINE DESIGN FOR CONCURRENT EXECUTION OF DISCRETE TIME 
 
 SIMULATIONS 97 
 
 4.1 Introduction 97 
 
 4.2 Compilation Observations 99 
 
 4.3 Task Processors 10.1 
 
 4-3.1 Unit Processors 101 
 
 4.3.2 Decision Processor Use 103 
 
 4. 3 .3 Task Processor Configuration 107 
 
 4. 3 .4 Hardware Design Considerations Ill 
 
 4.4 Coordination Unit 112 
 
 4.4.1 Transaction Selection 115 
 
 4.4.2 Processing Code Evaluation 119 
 
 4.4.3 Task Queues 119 
 
 4.4.3.1 Process Queue '119 
 
 4.4.3.2 Delay Queue 120 
 
 4.4.4 Task Output 12.1 
 
 4.4.5 Coordination Unit Hardware 121 
 
 4.5 Memories 125 
 
 4.5.1 Data Memory 125 
 
 4. 5. 1.1 Main Memory 126 
 
 4.5.1.2 Task Memory 127 
 
 4.5.2 Program Memory 129 
 
 4.6 Machine Design Summary and Performance Estimates 131 
 
 5. CONCLUSION 135 
 
 APPENDIX 137 
 
 A.l GPSS Scanner 137 
 
 A. 2 Trace Data Extraction: XTRAC 137 
 
 A. 3 Trace Data Insertion: INSRT 140 
 
 A. 4 Coordination Unit Simulation l4l 
 
 LIST OF REFERENCES 150 
 
 VITA 152 
 
VI 
 
 LIST OF TABLES 
 
 Page 
 
 2.1 Relational Expression Conversion 13 
 
 2.2 Decision Processor Logic Counts . 32 
 
 2.3 Decision Processor Logic Delays 33 
 
 2. if Nodes Evaluated on Succeeding Cycles for Linear Decision 
 
 Trees if 9 
 
 2.5 Cycles Required, per Iteration, to Evaluate a Linear Decision 
 
 Tree 51 
 
 3.1 Fortran Block Analysis 7^ 
 
 3.2 Block Routine Speedup Factors 77 
 
 3-3 Precedence Partitioning Guide 8l 
 
 3- ^ Discrete Events Comparison of Figure 3*10 88 
 
 3.5 Program Partitions and Processing Codes 95 
 
 if.l Simulation Results: Concurrent Transactions 109 
 
 if. 2 Processing Code Interpretation llif 
 
 if .3 Total Machine Hardware Summary 132 
 
VI 1 
 
 LIST OF FIGURES 
 
 Page 
 
 2.1 A Decision Tree r 6 
 
 2.2 Decision Tree Labels ■ 7 
 
 2.3 Processor Tree • 9 
 
 2.k Node Classification 10 
 
 2.5 Use of a Free Node 10 
 
 2.6 An IF Tree l6 
 
 2.7 The Decision Tree Derived from Figure 2.6 .18 
 
 2.8 Decision Tree Processor 20 
 
 2.9 Logic Design for a Three Level Decision Tree Processor ... 22 
 
 2.10 Logical Expression Reduction 26 
 
 2.11 Cascading Reduction Modules 26 
 
 2.12 Tree with Decoder Equation Labels 28 
 
 2.13 IF Tree Corresponding to Example Program Segment 36 
 
 2.114- Decision Tree Derived from Figure 2.13 38 
 
 2.15 Example Decision Tree Mapped into Processor Structure. ... 39 
 
 2.16 Result Memory Contents for Example IF Tree k-0 
 
 2.17 Transmit Node Pairings k-6 
 
 2.18 Remapping for Better Efficiency U7 
 
 2.19 Variations in Decision Tree Evaluation Time 53 
 
 3-1 Example GPSS Program 57 
 
 3.2 Overall GPSS/360 Scan: Update Clock to Next Most Imminent 
 Event 62 
 
 3.3 Overall GPSS/360 Scan: Scan of Current Events Chain 
 
 (Start of Scan) 63 
 
Vlll 
 
 Page 
 
 3.1+ Overall GPSS/360 Scan: Try to Move Individual Transaction 
 
 into Some Next Block 6k 
 
 3.5 Serial Execution Trace of Example Program 66 
 
 3.6 QUEUE Block; Fortran Version 70 
 
 3.7 QUEUE Block Flow Chart 71 
 
 3.8 Precedence Partition Algorithm 80 
 
 3-9 Precedence Partitions of Program in Figure 3*1 84 
 
 3.10 Example of Time Overlap on Independent Events 87 
 
 k.l Machine Organization 100 
 
 k.2 Connections Between Decision and Unit Processors 106 
 
 k-3 Coordination Unit Il6 
 
 h.k Word Format of Transaction Status Memory 117 
 
 A.l Simulation Test System 138 
 
1 . INTRODUCTION 
 
 1.1 Overview 
 
 Speeding up the execution of programs by means of compile time 
 algorithms and machine organization is the general topic of this thesis. 
 The approach taken is to study an application area, then design a machine 
 to take advantage of characteristics of programs for the application, 
 parallelism in the problem, and processing requirements. This study leads 
 to a multiprocessor configuration with a hierarchy of processors and 
 memories. 
 
 Discrete time simulation is the application area selected. Simu- 
 lation languages, particularly the General Purpose Simulation System [7], 
 GPSS, are examined. 
 
 The purpose of this study is to achieve speedup through machine 
 organization with the use of available logic and memory devices. The 
 speedup does not come from faster execution of individual instructions. 
 Instead, the approach taken is to have more than one instruction in exe- 
 cution simultaneously. GPSS is studied to detect parallelism and provide 
 guidelines for the design of a multiprocessor machine. 
 
 Simulation programming does not include operations on arrays of 
 data, the feature most commonly associated with machines having more than, 
 say, two processors. An early observation on simulation program charac- 
 teristics was that conditional jump statements occur with great frequency. 
 To significantly speed up the execution it is necessary to do better than 
 serial processing of conditional jumps. In this thesis algorithms and a 
 special hardware unit are designed for processing trees of conditional jumps 
 
The thesis is experimental in nature. It includes the analysis, 
 for execution on a multiprocessor, of Fortran programs with approximately 
 1000 statements. Several GPSS programs are individually analyzed for 
 prospects of concurrency in execution. A simulation system is used to test 
 the performance of the proposed machine organization in the execution of 
 GPSS programs. 
 
 1.2 Thesis Organization 
 
 This thesis is organized as three chapters which present the de- 
 tails of the problems and solutions, a final chapter which summarizes results, 
 and an appendix which describes a software test system for generating and 
 verifying some of the results. 
 
 Chapter 2 is concerned with the problem of speeding up the ex- 
 ecution of programs with many conditional jumps. Software algorithms to 
 increase the execution concurrency are presented. The algorithms modify the 
 original program, without changing the logic, such that better use of the 
 multiprocessor system is made. A hardware unit for evaluating conditional 
 jump statement trees is presented. This "decision processor" operates in 
 conjunction with the arithmetic processors to select a path through trees 
 of many levels. 
 
 Chapter 3 discusses discrete time simulation languages and ex- 
 amines GPSS in some detail. Parallelism in GPSS and the potential for con- 
 current execution are the major topics. 
 
 A multiprocessor machine organization for languages like GPSS is 
 designed in Chapter k. The machine consists of several clusters of proces- 
 sors, a memory system matched to the processor requirements, and a unit to 
 
coordinate the processor clusters in their execution of a program. Proces- 
 sors within a cluster have a common control unit; however, the instruction 
 stream to each processor can differ. Clusters operate independently from 
 each other and are individually capable of executing a complete program. 
 
 Conclusions are presented in- Chapter 5- The Appendix is con- 
 cerned with a system for simulating the machine organization of Chapter k-. 
 
2. CONCURRENT PROCESSING OF CONDITIONAL JUMP STATEMENTS 
 
 2.1 Introduction 
 
 The purpose of a multiprocessor machine organization is to speed 
 up program execution. Speedup is achieved by using the parallelism or 
 concurrency that can be extracted from a program to keep more than one 
 processor busy. For machines such as ILLIAC IV [9,15], STAR [3], or the array 
 associative processor STARAN [5l, the parallelism is largely due to operations 
 on arrays of data. Other machine organizations or parallelism extraction 
 schemes use tree height reduction DO loop and recurrence relation expansion, 
 back substitution, and independent blocks of assignment statements to get a 
 more general form of parallelism p-0] . These schemes may add redundant 
 operations or increase the number of useful operations but they do reduce 
 the number of steps required for execution. 
 
 All of the above techniques work on sections of code that exist 
 between the statements that control the flow of program execution. When a 
 conditional jump (an IF statement) occurs there is a reversion to serial 
 execution. For programs with very few conditional jumps this is not a serious 
 weakness. For programs or parts of programs with a high ratio of IF to 
 assignment statements, this serial execution can degrade significantly the 
 efficiency of a multiprocessor. 
 
 In this chapter algorithms and hardware for speeding up the execution 
 of programs with many IF statements are examined. The hardware is a special 
 purpose processor designed into the multiprocessor system of Chapter k. Terms 
 related to the algorithms and hardware are defined in section 2.2. Compile 
 time preparation of programs to use the processor is introduced in 2.3. 
 
Section 2.k covers the processor and 2.5 discusses its operation. 
 
 2.2 Decision Trees in Programs and Processors 
 
 Decision statements in programs are those that determine the next 
 instruction to be executed from two or more possible choices. These statements 
 
 typically begin with the conjunction "IF". The IF statements considered here 
 
 are the logical type where a boolean variable is the basis of choice between 
 
 two next instructions. Other types of IFs can be converted to the two way 
 
 jump form. 
 
 When at least one of the instructions selected by an IF is also 
 an IF, a tree of IF statements called a decision tree is formed. A single 
 IF in the tree is a node . Establish the convention that the branch taken 
 when the boolean variable is true is pictured as leaving the node to the 
 right. Then the decision tree corresponding to the code: 
 
 IF (A) THEN IF (B) THEN W; 
 
 ELSE X; 
 ELSE IF (C) THEN Y; 
 ELSE Z; 
 can be drawn as in Figure 2.1. 
 
 An exit from the tree occurs when a statement other than an IF is 
 next. In Figure 2.1, W, X, Y, and Z represent exits. The directed line 
 segments between nodes are branches . Any sequence of branches followed to 
 reach an exit is a path through the tree. The path taken on a given execution 
 of the tree identifies the exit and is the re suit for that execution. A single 
 input node is a node with only one branch into it. The discussion and examples 
 in this chapter, with the exception of section 2.5.2.2, assume all nodes are 
 single input nodes. 
 
Figure 2.1. A Decision Tree 
 
 Elements of the decision tree are labeled according to the rules 
 below. Figure 2.2 shows the naming scheme. 
 
 1. The root node is\. 
 
 2. The name of a branch directed out of a node is the name 
 of the node concatenated with 1 or according to whether 
 the branch leaves the node to the right or left. For 
 concatenation, x is a null element. 
 
 3- Nodes other than \ are given the name of the branch directed 
 
 into them. 
 k. Paths and exits are identified by the name of the branch 
 
 which is the exit. 
 
Branch 
 
 Node X 
 
 Node 1 
 
 Branch, Path, and Exit 11 
 
 Figure 2.2. Decision Tree Labels 
 
 The nodes on a given level of the tree are the set of nodes which 
 have the same number of bits in their name. Levels are numbered sequentially 
 beginning with one at the root such that at level i there are 2 possible 
 
 nodes. Let I be the number of levels in the tree. A tree is full if all of 
 
 1 I 
 
 the 2 -1 possible nodes are present. There are 2 exits from a full tree. 
 
 In an informal way "length" will refer to the number of levels and "shape" 
 will refer to the number and distribution of nodes in a tree. 
 
 Programs in general have assignment statements interspersed with 
 decision statements. Those parts of a program where the ratio of decision 
 statements to assignment statement operations is larger than an experiment- 
 ally determined threshold are called IF trees. When the assignment state- 
 ments are removed from an IF tree a decision tree is formed. An algorithm 
 is presented in section 2.3-3 for movement of assignment statements out of 
 
an IF tree such that correct execution of the program is not disturbed. The 
 algorithm allows the formation of larger decision trees than exist naturally 
 in a program. 
 
 Now consider a processor to evaluate decision trees in a parallel 
 way. The length and shape of a programmed tree can vary, limited only by 
 the syntax of the language being used. The length and shape of processing 
 equipment is fixed by hardware design. The fixed hardware must be capable 
 of processing trees of any length or shape. Let k be the number of levels 
 in the decision processor. The processor is designed for a full tree so 
 2 -1 nodes can be evaluated. Longer full trees require repeated use of the 
 processor. 
 
 Hardware nodes are numbered from left to right within levels and 
 from level one to level k. These numbers are the decimal equivalent of 
 decision tree binary node numbers with a leading one attached. The tree 
 that descends from each node is a sector which is given the name of the 
 sector root node. Sector 1 includes the complete processor tree. All other 
 sectors represent sub-trees. Control is provided in the processor to select 
 a particular sector for evaluation, providing isolation and independence 
 from other sectors. 
 
 A labeled two level processor tree is shown in Figure 2. J. 
 
 The mapping of j < k levels of decision tree nodes into the k levels 
 of processor tree nodes is a one-one mapping. It is not in general "onto" 
 since the decision tree shape may differ from the fixed processor tree. 
 A processor node which corresponds to a decision tree node is a decision node . 
 
 All processor exits are from level k. Any decision tree exit from 
 a node which does not map onto level k must "use" a node at each succeeding 
 
Level 1 
 
 Level 2 
 
 Sector 1 
 
 Sector 3 
 
 Figure 2.3. Processor Tree 
 
 level down to k. Nodes that are used to transmit the output of a decision 
 node to an exit, but which do not take part in the decision making process, 
 are transmit nodes . Transmit nodes are assigned a logical value of zero. 
 
 There is a third class of nodes in the processor tree. A free node 
 is one which is neither a decision nor a transmit node. 
 
 In drawings to follow, the symbol "O"" will represent a decision 
 node, "O" will represent a transmit node, and "•" a free node. 
 
 As an example of several of the above definitions consider the three 
 level trees in Figure 2.h. The exits from the free node will never be on a 
 selected path since the free node is on the "1" output branch of a transmit 
 node which always has the output "0". 
 
 Suppose the decision tree had an additional node on level four. 
 That node can be mapped into the free node if the tree processing is controlled 
 as follows. Evaluate sector one; the complete three level processor tree. 
 
10 
 
 Otcltion Nodts 
 
 Transmit 
 
 Nodt 
 
 V z 
 
 (a) Decision Tree 
 
 (b) Corresponding Processor Tree 
 
 Figure 2.k. Node Classification 
 
 u z 
 
 S«ctor 5 
 
 (a) Decision Tree 
 
 (b) Corresponding Processor Tree 
 
 Figure 2.5- Use of a Free Node 
 
11 
 
 If the exit is the branch that leads to the level four node, evaluate sector 
 five to get the final exit. The situation is illustrated in Figure 2.5- 
 
 The mapping of a sub- tree of a decision tree node into a higher 
 level sector in the processor tree is called folding. The process of 
 evaluating a sector is a cycle. If folding is used to fill free nodes, 
 evaluation may take as many cycles as there are folds, plus one for sector 
 one. 
 
 If not all decision tree nodes can be mapped into the processor 
 the tree is evaluated by repeated use of the processor. Each use related to 
 a given tree is an iteration . An iteration may involve many cycles. 
 
 At this point much of the terminology has been explained and the 
 execution scheme introduced. Now consider the software required for this 
 processor. 
 
 2.3 Software Aspects of IF Trees and Decision Trees 
 
 Locating IF trees in programs is based on an algorithm described 
 in [2]. The algorithm forms a trace of all paths through a program. De- 
 tection of an IF statement on a path activates a counter of assignment 
 statement operations. As long as the operation count is below a specified 
 threshold, another IF statement on the path is classified as being in a 
 tree with the predecessor. The counter is reset and operation counting begins 
 again. 
 
 When the operation count exceeds the threshold, an exit from the 
 IF tree has been found. The occurrence of input or output statements or 
 subroutine calls also mark an exit from the tree. 
 
 Examination of several programs has shown that a threshold value 
 
12 
 
 near 10 gives trees which can be processed reasonably well using the tech- 
 niques of this chapter. Assuming an IF tree has been located, the preparation 
 for processing that must be done at compile time is described in this section. 
 2.3-1 Overall Processing Scheme 
 
 Three actions are involved in the compile time preparation of IF 
 trees for concurrent evaluation. They are mentioned here, then described in 
 more detail in following sections. First, each relational expression which 
 is part of the argument of an IF statement is converted to an assignment 
 statement the left hand side (lhs) of which is a logical variable. The 
 logical variable replaces the relational expression in the argument. The 
 second item is the movement of assignment statements to a position ahead of 
 the remaining decision tree. Third is the mapping of nodes from the decision 
 tree to the processor. 
 
 Four steps are required for execution. In the first step assign- 
 ment statements are evaluated in parallel. The second step is determination 
 of the boolean value of logical arguments. Third, evaluation of the tree in 
 the processor. Fourth, selection of assignments from step one that were on 
 the execution path . 
 
 2.3.2 Arguments of IF Statements 
 
 An IF statement argument may contain a boolean variable, a relational 
 expression (rel ex), or a logical expression (log ex). For input to the 
 processor tree the argument must be a boolean value. This section discusses 
 the handling of arguments to arrive at the boolean value. 
 2.3.2.1 Relational Expressions 
 
 Relational expressions can be converted to assignment statements 
 (assign sts) which yield the correct boolean value upon examination of the 
 
13 
 
 sign or magnitude of the result. As an example, X > Y can be converted to 
 "B = X-Y" where the sign of B is a boolean variable indicating the result. 
 
 Algorithm 2.1 converts a rel ex to an assign st. The algorithm is 
 given for a machine where the smallest quantity that can be added is e. For 
 an integer machine e = 1. Each assign st formed creates a new logical vari- 
 able in the program which must be given a unique name. The names generated 
 by the algorithm are the concatenation of a name which does not otherwise 
 exist in the program and the state of a counter. 
 
 Let S(X) be a boolean variable representing the sign of X with 
 "+" = 1 and "-" = 0. The sign of X = is "+". Let M(X) be a boolean 
 variable representing the magnitude of X. If the magnitude is zero, M = 0. 
 If the magnitude is non-zero, M = 1. An overbar represents inversion. 
 Algorithm 2.1: Relational Expression Conversion 
 
 0. On the first use of this algorithm on a program, select an unused vari- 
 able name, Y, and set 1=1. On all uses subsequent to the first, enter 
 at step 1. 
 
 1. Use Table 2.1 to change the relational expression given in column one 
 to the corresponding assignment statement in column two. Variable X is 
 Y concatenated with I. 
 
 Relational Assignment Boolean 
 
 Expression Statement Variable 
 
 L < R X = R-L-e S(x) 
 
 L < R X = R-L S(X) 
 
 L = R X = L-R M(X) 
 
 Table 2.1. Relational Expression Conversion 
 
Ik 
 
 Relational Assignment Boolean 
 
 Expression Statement Variable 
 
 L 4 R X = L-R M(X) 
 
 L > R X = L-R S(X) 
 
 L > R X = L-R-e S(x) 
 
 Table 2.1 (continued). Relational Expression Conversion 
 
 2. Replace the relational expression in the argument with the boolean 
 
 variable from the corresponding column three entry. 
 3 • Increment I . 
 
 2.3.2.2 Logical Expressions 
 
 Decision trees fan-out from a root node to more than one possible 
 next statement. Logical expressions as arguments of IFs do the inverse. Given 
 a log ex of at least two variables, a fan-in tree can be formed to give the 
 boolean value. A decision processor designed for the fan-out case is not 
 well suited for evaluating log exs. This section considers means of treating 
 log exs. 
 
 Assume the decision processor has no capability to evaluate log exs 
 The logical IF can be rewritten equivalently as more than one IF where each 
 new argument is a boolean variable. The logical operator connectives are 
 achieved by the way IFs are connected in the program. The basic AND, OR 
 connectives are shown by examples below. Larger expressions are managed by 
 repeated application of these connections, 
 (a) AND operator 
 
 Given: IF (A-B) THEN T; ELSE F; 
 Rewrite: IF A THEN IF B THEN T; 
 
 ELSE F; 
 ELSE F; 
 
15 
 
 (b) OR operator 
 
 Given: IF (A V B) then T; ELSE F; 
 Rewrite: IF A THEN T; 
 
 ELSE IF B THEN T; 
 ELSE F; 
 A recent study of a large number of Fortran programs uncovered 
 very few logical IFs with more than one operator in a log ex. [10] . The study 
 did reveal reasonably frequent use of the one operator argument. It therefore 
 seems practical to provide in the decision processor the capability to accept 
 two operand logical expressions. This will be discussed further in section 
 2A.1 on input to the processor. 
 
 2.3.3 Assignment Statement Movement 
 
 Decision trees of more than a few levels will not appear often in 
 programs. Rather, the more general IF tree, the mixture of IF and assignment 
 statements will be present. This section gives an algorithm for moving 
 assign sts out of an IF tree, leaving a decision tree with possibly more 
 levels and certainly more nodes. 
 
 Assignment statements are tagged before moving to identify their 
 position in the IF tree. After movement the statements can be analyzed to 
 determine parallelism and executed in parallel. Statements which may not be 
 on the result path are executed concurrently with those that are, since the 
 path is unknown. Thus the results of the block of assign sts are considered 
 temporary pending determination of the result path. 
 
 A typical IF tree is shown in Figure 2.6. Here b and f represent 
 boolean and arithmetic functions. 
 
16 
 
 Figure 2.6. An IF Tree 
 
 In Figure 2.6, removal of the three assign sts from the IF tree 
 implies a means of distinguishing the two assignments to A and of knowing 
 which of the three were on the result path. The boolean functions of A must 
 be evaluated with the proper value for A. To accomplish this a descriptor 
 is attached to each lhs variable. If such a variable is later read, in the 
 same path through the tree, the same descriptor is attached to that occur- 
 rence of the variable also. 
 
 The intuitive concept of a predecessor is used in Algorithm 2.2. 
 Some properties of this relation as used here are given. The relation 
 applies to both nodes and branches. "Immediate" is used to mean closest or 
 most direct. The immediate node predecessor of nodes aO and al is node a. 
 A branch is the immediate branch predecessor of the node with the same name. 
 A predecessor of node or branch a is a predecessor of aj, j a binary number. 
 
17 
 
 Algorithm 2.2: Assignment Statement Movement 
 
 1. Scan the IF tree from level one to level i applying steps 2, 3, and h. 
 
 2. Attach as a descriptor to the lhs variable of each assignment statement 
 the name of the branch in which the statement occurs. Move the state- 
 ment to a position above the IF tree. 
 
 3. Examine the logical argument of each node and the rhs of each assignment 
 statement, for variables which have been given descriptors in a pre- 
 decessor branch. Attach the corresponding descriptor to every such 
 variable. If the variable has been given multiple predecessor de- 
 scriptors, use the most immediate one. Since each higher level con- 
 tributes one bit to the length of a descriptor assigned at a level, the 
 most recent assignment of a value to a variable is represented by the 
 longest descriptor. 
 
 k. Form assignment statements from relational expressions in nodes ac- 
 cording to Algorithm 2.1. Move these statements to a position above the 
 IF tree. 
 
 At this point node arguments consist of boolean variables or logical 
 expressions and the IF tree has been cleared of assign sts. The IF tree 
 has been converted to a block of assign sts followed by a decision tree. 
 Figure 2.6 can be drawn as in Figure 2.7 after application of Algorithm 2.2. 
 
 All assign sts in the block are candidates for execution in parallel, 
 The decision tree can be evaluated in parallel in the decision processor. 
 Section 2.k to follow describes the decision processor hardware. 
 
18 
 
 CO 
 
 z 
 
 UJ 
 
 si 
 
 33 
 
 , — * — , 
 
 o ** n »o 
 — ii ii ii ii 
 
 CVJ 
 
 0> 
 
 d 
 
 o 
 
 
 CVJ 
 
19 
 
 2.h Decision Tree Processor Hardware 
 
 Evaluation of the decision trees defined in the previous sections 
 is carried out on a decision processor. This special purpose processor is 
 designed to operate in conjunction with a multiple arithmetic processor 
 configuration. The function of the decision processor is to accept boolean 
 values corresponding to decision tree nodes and to return information related 
 to the path through the tree. Figure 2.8 is a block diagram of the hardware. 
 The processor has a tree structure with a capacity of 2 -1 nodes. That is, 
 k is the number of tree levels in the processor. 
 
 An input register is used to receive boolean node values from the 
 arithmetic processors. The register has a bit per node of decision processor 
 capacity and a fixed structure relating bits to positions of nodes in trees. 
 A tree decoder identifies the path through the tree. The result of decoding 
 points to the address of the next statement to be executed. 
 
 Potential next statement addresses are stored in a small memory 
 which is loaded for each use of the decision processor. For a k level 
 processor tree there are 2 possible exits. In reality the number of exits 
 programmed is often less than a fourth of that number. 
 
 When a programmed tree has more than k levels it is necessary to 
 use more than one evaluation cycle to determine the final result. A register 
 is provided to save the path results on each cycle. When processing of a 
 tree is completed the path register identifies the total path taken and 
 likewise provides a means of identifying the assignment statements which 
 were originally on the path but were moved ahead of the tree. 
 
 The design and construction of these components is explained more 
 
20 
 
 
 
 
 
 
 
 • 
 
 
 
 
 
 
 
 
 • 
 
 
 
 
 
 
 • 
 
 
 
 
 
 
 
 
 ft- * 
 
 
 
 
 
 
 
 
 ECT( 
 ECOC 
 
 
 
 
 
 
 
 
 W Q 
 
 
 
 
 
 
 CC 
 
 UJ 
 
 ae 
 
 CC 
 
 < 
 
 
 
 i 
 
 
 
 
 k J 
 
 
 
 
 
 
 
 Ul 
 
 
 
 
 M 
 
 
 
 
 
 
 CC 
 
 _j 
 
 
 
 < 
 
 < 
 
 
 
 
 
 
 * 
 
 u 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 • • • 
 
 
 _ 
 
 
 
 
 J 
 
 f \ 
 
 / 
 
 
 
 
 
 
 
 w 
 
 
 
 
 
 
 < 
 
 
 
 1 
 
 s 
 
 s 
 
 < 
 
 
 
 SULT 
 MORY 
 
 
 2«» • 
 
 §>: 
 
 MEMORY 
 REGISTER 
 
 j 
 
 
 a. 
 
 
 \S 
 
 < 
 
 » C/! 
 
 Ul 
 
 1i B 
 
 o 
 
 
 
 
 
 
 
 
 
 
 IU1 
 
 
 
 
 UJ uj 
 0= 2 
 
 
 UJ * 
 
 SULT 
 PUT 1 
 
 
 
 -j 
 
 IO 
 
 
 
 
 4 '* 
 
 ( 
 
 3 
 
 
 
 
 
 
 
 
 
 
 
 O (— 
 
 
 
 • • • 
 
 
 
 
 
 uj cr 
 
 "8 
 
 
 I 
 
 
 
 
 
 5r™ 
 
 T) 
 
 s 
 
 £ 
 
 — 7 
 
 < J 
 
 V 
 
 
 f 
 
 
 
 
 
 
 ^ 
 
 
 
 
 
 
 
 
 In: 
 
 e 
 
 c 
 
 i 
 
 ■M 
 
 
 
 ar 
 
 
 
 _J 
 
 § 
 
 
 1" 
 
 
 Jl 
 
 ' Iff 
 
 * * * 
 
 e 
 ICC 
 
 
 Z 
 
 < 
 
 Ul 
 
 -j 
 
 
 UJ 
 
 
 
 
 
 
 — w 
 
 
 •HMMH^ 
 
 I 
 
 
 
 
 ._-j 
 
 
 
 
 
 
 W «, 
 
 c 
 
 
 
 
 
 
 
 i— y y 
 
 'V 
 
 
 
 
 
 _i 
 
 
 £7 
 
 
 o w 
 
 Z r- 
 
 • 
 
 IT 
 
 • 
 
 
 
 
 
 -g 
 
 
 6 
 
 
 * 10 
 
 • 
 
 3 o 
 
 • 
 
 
 
 
 
 Q. Z 
 
 
 0. 
 
 
 INPUT 
 REG 
 
 ■ j 
 
 •" UJ 
 
 o 
 
 • 
 
 
 
 
 
 8 
 
 
 
 
 
 
 
 
 ",?,, 
 
 
 
 
 
 
 
 CE 
 
 UJ 
 X (- 
 
 b w 
 
 .1' 
 
 ) 
 
 ? 
 
 
 
 
 Mi 
 
 c 
 
 
 
 
 
 
 3 r™5 
 
 \ 
 
 "■"7? 
 
 
 
 
 
 
 
 
 
 
 
 
 < ^; 
 
 • 
 
 
 
 
 
 
 
 
 . 
 
 < 
 
 Q. © 
 
 
 
 
 
 
 
 
 
 a 
 
 
 UJ 
 
 
 2 
 
 z 
 
 
 
 
 CE 
 
 
 
 
 
 or 
 
 T** 
 
 
 1- 
 
 
 
 
 
 
 
 _• 
 
 
 
 C 
 
 1 UJ 
 
 
 
 
 O 
 
 
 
 a. 
 
 
 
 
 < 
 
 (/> 
 
 
 
 
 < 
 
 
 
 
 
 
 C 
 
 ) UJ 
 
 1 cc 
 
 
 
 
 UJ 
 
 CC 
 
 
 
 ^— jr 
 
 M £ 
 
 
 
 
 
 K 
 
 
 
 
 
 
 u. 
 
 K 
 
 
 
 
 
 X 
 
 Z 
 
 a) 
 cu 
 
 E-i 
 
 o 
 
 •H 
 
 w 
 
 CO 
 CJ 
 0) 
 
 •H 
 
21 
 
 fully in following sections. Optional designs are given for several com- 
 ponents. Detailed logic for a three level processor is shown in Figure 2.9* 
 Tables 2.2 and 2.3 contain summary gate counts and timing information. 
 
 Notation used in drawings and equations is explained here. Indi- 
 vidual nodes and sectors are labeled n(i) and S(i) respectively, 1 < i < (2 -1) 
 following the convention in section 2.2. When a label is used as a signal 
 name it corresponds to a logical one. Let R(i,b) be the signal name for 
 branch b at level St in the tree decoder, where b is the decimal equivalent 
 of the binary branch name. For example, R(2, 0) is the level 2 branch "00"; 
 R(2,3) is branch "11" at level 2. The output of the k level decoder is 
 R(k,b), <b < (2 k -l). 
 
 2.^.1 Input Node Register 
 
 Each node in the processor tree is implemented by a specific flip 
 flop, f/f, in this register. rn he register is reset initially. Data is gated 
 into the set inputs of each f/f such that a f/f is set when the boolean value 
 for that node is one. The f/f remains reset if either the boolean value is 
 zero or the node is not a decision node. The logic for a three level node 
 register is shown in Figure 2.9 (a). 
 
 Optional logic can be added on the input side of the register for 
 reducing logical expressions to a single boolean value. Section 2.3-2.2 
 mentioned the practicality of providing this for log exs consisting of two 
 operands. Assume that each relational expression in a log ex has been con- 
 verted to a boolean variable according to the rules in section 2.3.2.1. 
 Then a two operand log ex can be written as: 
 
 (BV1)L01(BV2) 
 
22 
 
 ■g 
 I 
 
 o 
 
 CO 
 CO 
 
 CD 
 O 
 
 o 
 
 & 
 
 CD 
 
 0) 
 
 CD o 
 
 Eh Q 
 
 c 
 
 CD 
 
 CD 
 
 -p CD 
 
 co cd 
 
 •h H 
 
 bo «a 
 
 S <H 
 
 a 
 
 ft 'H 
 
 c M 
 
 P 
 
 CO 'H 
 
 ^ bo 
 
 » 
 
 <d 
 
 IS 1 IS 1 £^ "ff "fff ' s 
 
 < 
 
 o 
 
 -J 
 
 •H 
 
23 
 
 CO 
 UJ 
 
 <r 
 
 Q 
 
 Q 
 
 < 
 
 IUJ 
 
 
 
 ;=l> 
 
 :=5> 
 
 ■^ > 
 
 ■ 
 
 ro 
 
 cr 
 < 
 
 UJ 
 
 <£■ 
 
 8 
 
 CO 
 CO 
 
 CU 
 o 
 o 
 
 & 
 
 cu 
 a; 
 u 
 
 EH 
 
 o 
 
 •H 
 CO 
 
 •H 
 O 
 
 CU CU 
 
 u ° 
 
 0) 
 
 fe 
 
 O 
 
 CU 
 
 CU CO 
 
 S 
 
 ■p o 
 
 CO £ 
 
 cu hO 
 
 « 
 
 •H 
 co 
 
 o 
 
 •H 
 bD 
 O 
 
 •H 
 
 O 
 O 
 
 ON 
 CU 
 
 cu 
 
2k 
 
 15 
 
 C\J 
 tO 
 
 ro 
 to 
 
 of 
 
 CO 
 
 CO 
 
 8 
 
 lO 
 
 g666666 
 
 — -- <H II- I 
 
 5 
 
 PO 
 
 < 
 
 CM 
 
 Or 
 
 i 
 
 a -i s 
 
 en 
 
 to 
 
 TC 
 
 !S3n03H * 0V3U 
 
 
 £ 
 
 iS3D03a unssy avsy 
 
 
 £ 
 
 £ IO 
 to O 
 
 UJ 
 
 z>5a.to 
 to 2 to 
 
 00 
 
 in 
 
 J— * 
 
 CM 
 
 
 cd 
 
 -d 
 o 
 o 
 
 O 
 
 -P 
 
 o 
 CD 
 
 CO 
 
 ■a 
 
 CO 
 
 In 
 (U 
 •P 
 
 to 
 •H 
 bO 
 
 « 
 
 ■P 
 
 CO 
 
 Ph 
 
 CD 
 
 +3 
 CO 
 •H 
 
 bO 
 
 d) 
 -p 
 
 a 
 
 -P 
 
 o 
 
 1 
 
 to 
 
 CD 
 
 O 
 
 CO 
 
 to 
 CD 
 
 o 
 O 
 
 PL. 
 
 CD 
 CD 
 Jh 
 
 E-i 
 
 C 
 
 o 
 
 a 
 
 H 
 CD 
 t> 
 CD 
 
 Hi 
 CD 
 
 CD 
 
 O 
 
 <H 
 
 •H 
 CO 
 
 o 
 
 •H 
 W 
 O 
 
 Hi 
 
 
 
 •H 
 
 -P 
 S3 
 O 
 O 
 
 ON 
 
 O CM 
 CD 
 
 •H 
 
 En 
 
25 
 
 where BVi is a "boolean variable in either true or complement form, and L0.1 
 is the logical operator. Let the operator be either AND or OR. Let L01 = 1 
 when the operator is AND. Let L01 = for an OR operator. 
 
 The function of the two boolean variables, f2, can be implemented 
 by the equation: 
 
 f2 - (BVI ' BV2 * L01) s, (BVI ^ BV2)T$I 
 
 = (BVI ' BV2) v/ (BVI ' L^l) v (BV2 • L$l) . 
 A NOR logic design for f2 is given in Figure 2.10. Output f2 is needed for 
 
 gating by LOAD N to give the signal, f2e, to the input node register. 
 
 Input logic for larger expressions can be designed but the useful- 
 ness is questionable due to infrequent use of such expressions by pro- 
 grammers. One such simple design is the cascaded application of the two 
 operand module. This design is shown in Figure 2.11 for a three operand 
 expression, f5. This design does imply a grouping of the first two variables 
 before applying the second operator. That is, 
 
 f3 = [(BV1)LO / 1(BV2)]L02(BV5). 
 2.4.2 Tree Decoder 
 
 This decoder identifies the path through the tree by considering 
 node register data and sector selection control. The result is used to read 
 a word from the result memory. Figure 2. 9 (a) has the design for a three 
 level tree decoder. 
 
 Equation 2.1 is the logic implemented for a single node of the 
 decoder. The equation is applied recursively to build a decoder with the 
 desired number of levels (value of l) . Equation 2.1 (a) is used for the false 
 branch out of the node and 2.1 (b) is used for the true branch. 
 
26 
 
 u. 
 
 a> 
 CM 
 
 a 
 
 z o 
 u> z 
 
 < 
 
 Q 
 
 UJ 
 
 o 
 
 o 
 tn 
 
 < 
 
 CD 
 
 < 
 O 
 
 I- 
 
 < 
 
 or 
 
 LJ 
 Q. 
 O 
 
 O 
 
 UJ 
 
 =1 
 
 Q 
 O 
 
 5 
 
 o 
 
 D 
 Q 
 UJ 
 
 or 
 
 S3 
 
 o 
 
 •H 
 
 +3 
 O 
 
 € 
 
 <L) 
 
 o 
 
 •H 
 03 
 W 
 Q) 
 Li 
 
 ff 
 
 W 
 
 o 
 
 •H 
 
 6D 
 
 o 
 1-3 
 
 o 
 
 H 
 OJ 
 (1) 
 
 •H 
 
 fa 
 
 a 
 
 < 
 o 
 
 d 
 
 o 
 
 •H 
 -P 
 O 
 
 •H 
 CO 
 
 o 
 
 w 
 CO 
 O 
 
 H 
 H 
 
 OJ 
 
 CD 
 
 51 
 
 ICVJ 
 
 •— ) 
 
 > 
 
 o 
 
 ICO 
 
 _J 
 
27 
 
 R(!,2i) = n(2 i ' 1 +i)[S(2 i "" 1 +i) v R(i-l,i)] 
 R(£,2i+1) = n(2 i " 1 +i)[S(2 l_1 +i) s, R(f-l,i)] 
 
 1 < I < k 
 
 Equation 2.1 (a) 
 Equation 2.1 (b) 
 
 < i < (2 -1) 
 R(0,i) = 
 
 Examples of Equation 2.1: 
 
 R(1,0) 
 
 = n(l)-S(l) 
 
 R(l,l) 
 
 = n(l)-S(l) 
 
 R(2,0) 
 
 = n(2)[S(2) 
 
 
 - n(2)[S(2) 
 
 R(2,l) 
 
 - n(2)[S(2) 
 
 R(2,2) 
 
 = n(3)[S(3) 
 
 
 - n(3)[B(3) 
 
 R(2,3) 
 
 = n(3)[S(3) 
 
 R(1,0)] 
 
 n(l)-S(l)] 
 
 R(1,0)] 
 
 R(l,l)] 
 
 n(l)-B(l)] 
 
 R(l,l)] 
 
 Figure 2.12, picturing a labeled two level tree, will clarify the 
 equation. It is clear, for example, that for R(2,3) to be true n(3) must 
 be true. One of two other conditions must be satisfied. Either S(3) must 
 be the selected sector or n(l) must be true with S(l) selected. 
 2. 14-. 3 Result Memory 
 
 When the tree decoder has selected a result the decision processor 
 must convert that result into the path taken through the tree and the address 
 of the next program statement. This is accomplished by reading the result 
 memory word corresponding to the decoder result. 
 
 Since only one path leads to a given result the path data can be 
 hard wired in each memory word. A bit per level is required. 
 
28 
 
 ■8 
 
 a 
 o 
 
 •H 
 
 I 
 
 w 
 
 u 
 
 0) 
 T< 
 
 o 
 o 
 
 Q 
 
 •p 
 
 Q) 
 
 OJ 
 H 
 
 OJ 
 
 Q) 
 
 Pn 
 
29 
 
 Three possibilities exist for the next program statement. The 
 next statement could he a node currently in the input node register in which 
 case the address is interpreted as a sector address selection for another 
 cycle of decoding. For the second possibility the next statement could be 
 a node in the same decision tree which is not currently in the node register, 
 In this case the node register is reloaded. Evaluation of trees which can- 
 not totally be mapped into the processor requires this iteration. The 
 third possibility is that an exit has been reached; that the tree has been 
 evaluated. 
 
 Distinction between the three possibilities is made by the D bit 
 which is zero when the address is to be interpreted as a sector for another 
 cycle, and the E bit which is one when an exit has been reached. For all 
 three cases the address portion of the memory word and bits D and E are 
 loaded from the arithmetic processors for each iteration. A memory word is 
 detailed in Figure 2.9 (b) . 
 
 Memory output is stored in the result memory output register. 
 This register can be separated into path, control, and address fields. The 
 path field is the input to the path register. The control field accepts 
 bits D and E. The address field is a processor output register when bit D 
 is one; it is a sector address when bit D is zero. The example register is 
 in Figure 2.9 (c) . 
 
 This memory is a large part of the total tree processor in terms 
 of gate and flip flop counts so an option is mentioned here to reduce the 
 counts. For maximum flexibility and simplicity of operation the design 
 provides for a full tree. This is sensible for most of the processor since 
 the hardware per node is reasonable. However the hardware may be 
 
30 
 
 unreasonable and unnecessary for the memory. The memory requirement is for 
 one word per exit. The question is how many exits will the vast majority 
 of trees, processed with a k level processor, have. For k = 8 the maximum 
 number of exits is 256 while the majority of trees may have less than 6k, 
 or even l6, exits. Thus an indirect addressing scheme may be used in which 
 a j <k bit address is selected by the tree decoder and it points to the 
 result in a 2y word memory. 
 2.k.k Path Register 
 
 This shift register accumulates data that identifies the path 
 through the tree. Each cycle of the decoder inserts k bits in the least 
 significant end of the register after shifting the previous contents k bits 
 toward the most significant end. 
 
 In the initial state the least significant bit, lsb, is set while 
 all other bits are reset. The register is full when the most significant 
 bit becomes set by virtue of the initial lsb propagating to it. At this 
 time the path data must be transferred to the processors and the register 
 initialized. The register is also read and initialized when an exit is 
 reached. 
 
 The length of the register is somewhat arbitrary. At the minimum 
 it must have k bits. Read and initialize operations would take place at 
 each cycle eliminating the need for shift capability and the initial set of 
 the lsb. A longer register allows more cycles before it is necessary to 
 read and initialize. The register should have ck+1 bits where c is the 
 desired number of cycles. The extra bit receives the propagated lsb to 
 signal a full register. 
 
 Logic for this register is shown in Figure 2.9 (c) for a three level 
 
31 
 
 tree with c = h. 
 
 2.4.5 Sector Decoder 
 
 Straightforward decoding of the result memory output register 
 address field to a signal corresponding to an individual sector address is 
 used. The logic is enabled at all times influencing decoding of input node 
 register data. The initial state of the sector address field decodes as 
 sector one, such that the entire processor tree is used for the first cycle 
 of each iteration. The three level sector decoder is shown in Figure 2.9 (c) . 
 
 Table 2.2 in section 2.4.6 shows that the gate count for sector 
 decoding increases rapidly as the number of levels increases. This is due 
 to the exponential increase in the number of sectors and the need for addi- 
 tional gates to handle fan-in of the address bits for decoding each sector. 
 Section 2.5-4 discusses the need for sector control at levels near the bottom 
 of the tree and concludes that the performance return on gates invested 
 decreases as control is extended to the lower levels. Thus a design option 
 is to provide sector decoding for a number of levels called sector depth , sd, 
 less than k. 
 
 2.4.6 Gate Counts and Timing 
 
 Practicality of the processor is determined by cost and performance. 
 One indication of cost is the count of gates used. This is not the most 
 significant factor in current technology but it is relevant and readily 
 gathered. Performance will be measured by gate delays for an iteration and 
 a cycle of the processor. 
 
 Table 2.2 gives the gate count for each component of the processor 
 for several values of k. A distinction is made between gates and flip-flops. 
 
32 
 
 p -p 
 
 §1 
 
 If 
 
 60 
 
 « s 
 
 »•» ** 
 
 s -' 
 
 «-, 
 
 M eg 
 LA H 
 
 
 
 £ 
 
 3 -P 
 d to 
 
 M 4) 
 
 h 
 
 H 4) 
 
 o -o 
 
 ■P o 
 
 o u 
 
 0) QJ 
 
 W (5 
 
 M 
 
 <M 
 
 SO 
 
 VO 
 
 CVJ 
 
 ST 
 
 bO 
 H 
 
 o\ 
 
 LA 
 
 CVJ 
 
 bO 
 
 * 
 
 * 
 
 CD 
 LA 
 
 to to 
 
 vO 
 
 to 
 
 1 
 
 I 
 
 8 
 
 § 
 
 IS 
 
 a 
 
 w bo 
 
 60 bo 
 
 ST & 
 
 jl j ] 
 
 * * 
 
 fc 
 
 T> rA 
 
 | 
 
 M 
 
 o 
 ■p 
 o 
 
 0) 
 (0 
 
 ■P bO-d" 
 
 03 0) II 
 Ph PC O 
 
 bO 
 
 Cm 
 
 60 
 
 LA 
 CVJ 
 
 * 
 
 IA 
 rA 
 
 bO <M 
 
 CVJ H 
 
 bO 
 
 J- 
 
 s 
 
 bfl 
 
 0) 
 
 (1) 
 
 2 
 
 PS 
 
 ■p 
 
 -p 
 
 3 
 
 a 
 
 to 
 
 -p 
 
 1) 
 
 K 
 
 5 
 
 bfl 
 (M 
 
 00. 
 
 bo 
 cvi 
 
 o 
 
 CVJ 
 
 g si 
 
 bO 
 CVI 
 
 
 3 5 
 
 T) 
 
 bO 
 
 <H 
 
 bO 
 
 «M 
 
 bO 
 
 V. 
 
 »<— » 
 
 
 ^-» 
 
 
 co 
 
 CVJ 
 
 J- 
 
 vO 
 
 % 
 
 vO 
 
 C 
 
 •¥ 
 
 -* 
 
 * 
 
 
 to a 
 0) S 
 
 -p 
 
 H 
 
 a 
 
 r- 
 
 H 
 
 oo 
 
 r-l 
 
 o 
 
 i-l 
 
 •H 
 
 IA 
 
 H 
 
 CO 
 
 « 
 
 CVJ 
 
 »w 
 
 IA 
 
 **^ 
 
 « a 
 
 ,J3 
 
 
 
 CVJ 
 
 
 fA 
 
 ■w 
 
 M 
 
 
 >l 
 
 
 
 
 
 
 
 
 
 * 
 
 CVJ 
 
 X , 
 
 CVJ 
 
 
 CM 
 
 
 
 
 
 
 
 CVJ 
 
 
 CVJ 
 
 
 
 r-l 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 bO 
 
 
 
 
 m 
 
 
 
 
 
 
 
 
 H 
 
 
 bO 
 
 
 0) 
 
 
 bO 
 
 
 bO 
 
 
 bO 
 
 
 1 
 
 
 
 
 0) is 
 
 S 8 
 
 
 J- 
 J- 
 
 
 § 
 
 
 ft 
 
 
 ri 
 
 
 -cvj 
 
 
 H£ 
 
 
 
 
 1-1 
 
 
 f- 
 
 
 rA 
 
 
 tA 
 
 
 U 
 
 
 CO 
 
 
 
 
 
 
 
 
 
 
 V 
 
 
 4) 
 
 
 
 
 
 
 u 
 
 cm 
 
 
 
 <0 -P 
 
 
 -P 
 
 cm 
 
 bO 
 
 Cm 
 
 bo 
 
 Cm 
 
 
 
 bO 
 
 Cm 
 
 ■O w 
 
 
 CD 
 
 "■■^ 
 
 
 
 
 
 r-l 
 
 H 
 
 
 
 O -H 
 
 
 bO 
 
 Cm 
 
 IA 
 
 >A 
 
 LA 
 
 LA 
 
 I 
 
 1 
 
 X , 
 
 M 
 
 S bO 
 
 
 
 
 vO 
 
 VO 
 
 IA 
 
 IA 
 
 M 
 
 M 
 
 CVJ 
 
 cvi 
 
 «J 
 
 
 IA 
 
 LA 
 
 
 
 cvj 
 
 CU 
 
 CVI 
 
 cvj 
 
 
 
 PC 
 
 
 1-1 
 
 i-l 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ■p 
 
 i-i 
 
 to 
 
 
 
 
 
 
 
 
 
 
 d 
 
 0) 
 
 ssfi 
 
 l-l 
 
 4) 
 
 
 
 
 
 
 
 
 
 
 § 
 
 •H 
 
 4j rd h 
 
 5 
 
 j 
 
 
 J- 
 
 
 vO 
 
 
 CO 
 
 
 M 
 
 
 M 
 O 
 
 e 
 
 p -b co 
 
 rM 
 
33 
 
 u 
 
 
 
 
 
 
 
 bO 
 
 
 
 •P K> 
 
 tti 
 
 «M 
 
 B0 
 
 «M 
 
 BO 
 
 >M 
 
 
 «M 
 
 bfl 
 
 « + 
 
 
 
 
 
 
 
 fy 
 
 
 
 as 
 
 s 
 
 OJ 
 
 * 
 
 CVJ 
 
 9 
 
 CVJ 
 
 a 
 
 H 
 
 OJ 
 
 •P H 
 
 
 
 
 
 
 
 
 
 M 
 
 
 
 
 
 
 
 
 
 
 n't 
 
 
 
 
 
 
 
 bfl 
 
 
 
 H «v\ 
 
 be 
 
 Cm 
 
 bO 
 
 «" 
 
 bfl 
 
 <M 
 
 
 «M 
 
 bfl 
 
 
 o 
 
 H 
 
 ir\ 
 
 r-l 
 
 ON 
 
 r-l 
 
 r-l 
 
 CVJ 
 
 i-4 
 
 
 ^ 
 
 
 r-l 
 
 
 S 
 
 
 
 
 
 
 
 
 
 
 
 3 
 
 V| 
 
 
 U 
 
 
 
 
 
 
 
 J- 
 
 M 
 
 -* 
 
 u a) 
 tj 
 
 bO 
 
 
 bO 
 
 
 B0 
 
 
 V| 
 
 VI 
 
 ~*tk 
 
 •p o «— » 
 
 
 
 
 
 
 
 
 
 
 O O VO 
 
 i-4 
 
 
 CVJ 
 
 
 CVJ 
 
 
 rW 
 
 IA 
 
 X 
 
 CD m v_^ 
 
 w a 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 bfl 
 
 bO 
 
 
 
 
 
 
 
 
 
 H 
 
 CVJ 
 
 
 5 no-^ 
 
 bfl 
 
 ^ 
 
 BO 
 
 Cm 
 
 bO 
 
 «M 
 
 bfl 
 
 Cm 
 
 
 
 
 
 
 
 
 
 
 1 
 
 <£S^ 
 
 r-l 
 
 r-l 
 
 H 
 
 H 
 
 H 
 
 r-l 
 
 rH 
 
 ^ 
 
 
 II 
 
 
 
 
 
 
 
 
 
 
 4J -P ^^ 
 
 oo 
 
 (h 
 
 bfl 
 
 «M 
 
 bfl 
 
 <M 
 
 bfl 
 
 Cm 
 
 
 11* 
 
 H 
 
 r-l 
 
 H 
 
 r-l 
 
 r-l 
 
 r-l 
 
 r-l 
 
 r-l 
 
 1 
 
 K 6 
 
 
 
 
 
 
 
 
 
 
 5( 
 
 P Q /—>. 
 
 bO 
 
 
 bO 
 
 
 bO 
 
 
 bfl 
 
 
 1 
 
 M Bft 
 
 i-* 
 
 
 r-l 
 
 
 rH 
 
 
 r-l 
 
 
 
 
 
 
 
 
 
 
 
 
 
 « Z 
 
 
 
 
 
 
 
 
 
 
 m 
 
 
 
 
 
 
 
 
 
 
 u 
 
 
 
 
 
 
 
 bO 
 
 
 
 0) TJ 
 
 bO 
 
 
 bfl 
 
 
 bfl 
 
 
 
 
 bfl 
 
 C o CM 
 
 
 
 
 
 
 
 rH 
 
 
 
 t- 
 
 
 r-l 
 
 
 UN 
 
 
 1 
 
 
 CVJ 
 
 *&~ 
 
 
 
 H 
 
 
 H 
 
 
 a 
 
 
 
 u 
 
 
 
 
 
 
 
 
 
 
 u 
 
 V 
 
 
 
 
 
 
 
 
 
 4) P 
 
 P 
 
 Cm 
 
 
 
 
 
 
 
 
 •6 W «-» 
 5 -H H 
 
 (0 
 bO 
 
 <H 
 
 bfl 
 
 cm 
 
 bfl 
 
 Cm 
 
 bfl 
 
 Cm 
 
 1 
 
 a M"— ' 
 
 
 
 r-l 
 
 H 
 
 H 
 
 r-l 
 
 H 
 
 r-l 
 
 
 o 
 
 H 
 
 (H 
 
 
 
 
 
 
 
 
 (X 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 P r-l 
 
 
 
 
 
 
 
 
 
 
 A * 
 
 CO 
 
 H 
 
 
 
 
 
 
 
 
 
 ■MK 
 
 0) 
 
 
 
 
 
 
 
 
 
 1 
 
 J- 
 
 
 va 
 
 
 CO 
 
 
 M 
 
 
 M 0) TJ D 
 
 id's a v 
 
 3 
 
 8 
 
 •H 
 U 
 •M 
 
 O 
 
 a 
 
 • 
 
 CVJ 
 <P 
 
 Eh 
 
^ 
 
 General formulas are given for a k level processor and for the incremental 
 count due to adding level k+1. 
 
 Path register design was fixed at c = k (storage for four cycles) 
 for the table. Address field length for the result memory and its output 
 register was fixed at 12 bits. 
 
 Since the result memory makes up such a significant part of the 
 count, two columns are used for totals. The first gives the count including 
 the result memory. The second is the total excluding the memory. An in- 
 direct addressing scheme to reduce the memory requirements was mentioned in 
 section 2.^.3 but gate counts for that option have not been tallied. 
 
 Table 2.3 gives the delay in gates and flip flops at each component. 
 The total delay is not the sum of the component delays since some component 
 actions occur simultaneously. 
 
 Timing for the first cycle of an iteration of the processor in- 
 cludes the node register, tree decoder, result memory output register, and 
 the final output. If the path is to be read, the path register delay takes 
 the place of the output delay. 
 
 Timing for a cycle other than the first cycle in an iteration 
 begins with sector decoding, does not require a reload of the node register, 
 then proceeds as does an iteration. That is, sector decoding time replaces 
 node register loading time. 
 
 2.5 Processor Operation and Performance 
 
 2.5-1 An Example of IF Tree Processing 
 
 This section follows an IF tree from the program to its execution. 
 The program segment is written in a PL/i like language and uses b and f to 
 represent boolean and arithmetic functions. 
 
35 
 
 The program segment IF tree : 
 IF b(A) THEN 
 IF b(B) THEN 
 DO; C = fl; 
 
 IF b(C) THEN 
 DO; K = f2; 
 
 IF b(D) THEN R; 
 ELSE S; 
 END; 
 ELSE IF b(E) THEN T; 
 
 ELSE IF b(H) THEN 
 DO; H = f3j 
 
 IF b(H) THEN U; 
 ELSE V; 
 END; 
 ELSE W; 
 END; 
 ELSE DO; C= A; 
 
 IF b(C) THEN X; 
 ELSE Y; 
 END; 
 ELSE Z; 
 
 Figure 2.13 is a pictorial representation of the program. Nodes 
 and branches are labeled with their binary names. 
 
 Apply Algorithm 2.2 to reduce the IF tree to a block of assignment 
 
36 
 
 1 1000 
 
 
 Figure 2.13. IF Tree Corresponding to Example Program Segment 
 
37 
 
 statements followed by a decision tree. Algorithm 2.1 converts the logical 
 arguments to boolean variables symbolized with lower case letters as in 
 "a = b(A)". Figure 2.1k shows the block of assign sts and the six level 
 decision tree that results. Descriptors have been attached to variable names 
 where needed. 
 
 For this example assume the processor is capable of evaluating 
 four level full trees. Mapping the first four decision tree levels into the 
 processor gives four free nodes including a two level sector. Fold the 
 . leftover two levels into this sector, as in Figure 2.15, so "the tree can be 
 evaluated in one iteration of either one or two cycles. 
 
 The original decision tree had six levels and thus required only 
 six bits to specify a bottom level result. Now each cycle of the processor 
 produces a four bit result. If the first cycle result is 1100 indicating 
 S5 as the next statement, a second cycle is needed. The final result has 
 eight bits. For consistency the binary address of the sector which receives 
 the fold must be inserted in all descriptors which are part of the fold. The 
 extra bits are inserted following the bits that identify the exit which points 
 to the fold. In this example exit WVU is 1100 and the sector which receives 
 the fold is 01. The descriptor for F changes from 11001 to 1100011. Paths 
 which exit at W, V, and U likewise have 01 inserted in their binary label. 
 All of the preceding work is performed at compile time. 
 
 In execution the block of assign sts can be done in parallel inclu- 
 ding those assignments that result in boolean node values. All node values 
 are loaded into the node register. Addresses of next program locations and 
 bits D and E are loaded in the result memory as shown in Figure 2.l6. 
 
 Cycling the processor once will yield an exit unless path 1100 is 
 
38 
 
 Ci0*f4 
 CiMfx 
 
 K 1 1 1 « f 2 
 
 HiiOOl* f 3 
 
 
 Figure 2.14. Decision Tree Derived from Figure 2.13 
 

 39 
 
 I 
 
 q 
 
 -P 
 
 CO 
 
 u 
 o 
 
 w 
 
 w 
 <D 
 O 
 
 o 
 
 & 
 o 
 
 t 
 
 •H 
 
 ID 
 
 <D 
 
 EH 
 
 O 
 •H 
 M 
 
 •rf 
 O 
 
 (1) 
 
 £ 
 
 H 
 CVJ 
 
 •H 
 
ko 
 
 taken. In this case the address field is decoded as a sector for control of 
 the second cycle. The result of the second cycle will point to a memory 
 location with exit W, V, or U. 
 
 Path 
 
 D 
 
 E 
 
 Address 
 
 0000 
 
 
 1 
 
 Z 
 
 0001 
 
 
 
 
 0010 
 
 
 
 
 0011 
 
 
 
 
 0100 
 
 
 1 
 
 W 
 
 0101 
 
 
 
 
 0110 
 
 
 1 
 
 V 
 
 011.1 
 
 
 1 
 
 u 
 
 1000 
 
 
 1 
 
 Y 
 
 1001 
 
 
 
 
 1010 
 
 
 1 
 
 X 
 
 1011 
 
 
 
 
 1100 
 
 1 
 
 
 S5(=01) 
 
 1101 
 
 
 1 
 
 T 
 
 1110 
 
 
 1 
 
 S 
 
 1111 
 
 
 1 
 
 R 
 
 Figure 2.l6. Result Memory Contents for Example IF Tree 
 
 Assume two cycles were necessary and V was the selected exit. After 
 the first cycle the path register was loaded with 1100. The contents were 
 shifted after the second cycle to receive the next four bits, 0110. The 
 total path is 11000110. 
 
 
 
1+1 
 
 At this point the next program location has been determined. What 
 remains is selection of variables for permanent assignment from the block of 
 assign sts removed from the IF tree. Statements selected are those on the 
 path taken. Identification is by descriptors which match the path bit pattern 
 from the most significant end. Thus C = Cll and F = F1100011 are selected. 
 The example is concluded. 
 
 2.5*2 Mapping: Folding and Multiple Input Nodes 
 
 Control over processor tree sectors is provided to extend the 
 usefulness of the processor. Decision trees with k of fewer levels and single 
 input nodes map into the processor in a straightforward way. Other trees may 
 need folding, special mappings, and sector control to achieve satisfactory 
 performance. 
 
 2.5.2.1 Folding 
 
 Examples of folding have been given in previous sections, especially 
 2.5-1. In this section some observations and limits are noted. The assump- 
 tion is made here that the decision tree to be folded has more nodes than 
 the processor. That is, the decision tree is not exhausted by folding. 
 
 Folding can be applied when free nodes exist. A free node occurs 
 as a consequence of a decision tree node at a level i < (k-2) with an exiting 
 branch. First a transmit node is formed at level i+1. The successors, at 
 i+2, of the transmit node are a transmit node and the free node. A picture 
 of this situation is given in Figure 2.k (b) . 
 
 Several facts concerning free nodes can be stated. 
 Fact 1. All nodes in the sector of successors from a free node are free, 
 
 prior to any folding operation in the sector. 
 Fact 2. Level three is the highest level at which a free node can occur. 
 
1+2 
 
 This can be observed in Figure 2.k (b) . 
 Fact 3. A branch which exits at level i requires (k-i) transmit nodes and 
 
 introduces a sector of free nodes at each level from (i+2) to k. 
 Fact k. A free sector at level i can accept a tree of length (k-i+l) . 
 Fact 5. The immediate successors of a transmit node are one transmit node 
 
 and one free node. 
 
 An algorithm for folding IF tree nodes into the decision processor 
 is sketched here. The mapping goal is to convert free nodes to decision 
 nodes while minimizing the number of transmit nodes. The execution goal is 
 to minimize the number of cycles required for execution. Assuming no knowl- 
 edge of which exits are most likely the two goals are reasonably met in the 
 following way. First determine the number of levels in the largest sector 
 of free nodes. Then, from the highest decision tree level with unmapped 
 nodes, select the node which is the root node of the largest unmapped sub- 
 tree. Map that node and its successors into the free sector for the appro- 
 priate number of levels. Continue the above steps until all free nodes are 
 used or the decision tree is exhausted. 
 
 A tree of a particular shape is now defined. Let a linear decision 
 tree be one in which there is only one node per level. 
 
 The largest number of nodes that can be processed in one iteration 
 is the number of nodes in the processor tree. A full, k level decision tree 
 is required for this situation. The largest number of levels that can be 
 processed in one iteration occurs with a linear decision tree. This state- 
 ment is due to the following fact. 
 
 Fact 6. Every tree includes a linear tree as a subset of the nodes in the 
 tree. The subset consists of one node from each level. 
 
hi 
 
 Now determine the largest number of levels per iteration. A linear 
 decision tree has an exit at every level. After mapping the first k levels 
 examine the free nodes. According to Fact 3 there are sectors of free nodes 
 at levels ~5,h } ...,k due to the level one exit; at levels k-,5, •••.,k due to 
 the level two exit; etc. Remaining segments of the tree are mapped into 
 free sectors. 
 
 Using Fact k, a tree segment of length (k-2) can be mapped into the 
 sector at level three. This produces additional free sectors at levels 
 jy Oj . . .^K; Oj (y . . • } k. ; ccc • 
 
 Continuing this enumeration leads to equation 2.2 for the maximum 
 number of levels which can be mapped into a k level processor. Each term is 
 the number of levels per fold mapped into a free sector. The count of terms 
 is then the maximum number of cycles required to evaluate the tree. 
 
 L. = k+(k-2) + (k-3) + (k-3) + (k->0 + (k-U) + (k-^) + (k-^)+...+l Equation 2.2 
 
 Equation 2.2 is rewritten below with the terms grouped. 
 
 1^ = k+(k-2)+2(k-3)44(k-^)46(k-5)+...+2 k_3 (l) 
 
 k . , 
 = k+E 2 l0 (k-i+l) 
 i=3 
 
 For a six level processor 
 L 6 = 6+^+2(3)+^ (2) +8(1) 
 
 = 32. 
 That is, a 32 level linear decision tree can be evaluated in one iteration 
 on a six level processor. 
 
 Two bounds are now known for single iteration processing of trees 
 on a k level processor. The maximum number of nodes is 2 -1. The maximum 
 number of levels is given by Equation 2.2. Section 2.5.3 will establish a 
 
Uk 
 
 third bound: the maximum number of nodes for which processing in a single 
 iteration can be guaranteed. 
 
 2.5-2.2 Multiple Input Nodes 
 
 All previous discussion has been concerned with trees in which 
 every node had only one input. In this section some of the ways nodes can 
 have multiple inputs are mentioned along with suggested means for dealing 
 with them. The suggestions are all compile time operations and have not 
 been examined thoroughly. 
 
 For a single node within an IF tree to have multiple inputs means 
 
 t 
 
 there is a way to reach the node other than by the branch from the node 
 above. Consider first an input from outside the IF tree. Unless the multiple 
 input node maps onto the root node of the processor, the program must be 
 modified to let the processor perform properly. Recall that the sector 
 controls initially select sector one, the whole processor tree. The compiler 
 must compensate for this by inserting dummy IF statements in the program to 
 build a path from the root node to the node in question. 
 
 For example, assume an IF statement labeled HERE maps onto proces- 
 sor node five and the program includes a GO TO HERE statement. Let the 
 compiler insert an IF THEN IF 1 preceding the GO TO to build a path to 
 node five . 
 
 As a second instance of a multiple input node consider branch b 
 which does not go to node b or exit b but becomes a multiple input for 
 another node. That is, a transfer is made within the IF tree and the IF 
 tree becomes a network rather than a tree. Connections of this type do not 
 exist in the processor. This situation is complicated by the necessity for 
 maintaining information on the path taken. 
 
 
^5 
 
 A solution is to duplicate the sub-tree descending from the 
 multiple input node, where needed, to produce a network free tree. Variable 
 descriptors, as mentioned in relation to assignment statement movement, must 
 be attached according to the shape of the tree with duplicated sub-trees. 
 
 A loop exists if a path returns to a predecessor node. The loop 
 can be valid (non- infinite) if it includes an assignment statement which 
 can change the decision made at some node in the loop. Duplicating sub- 
 trees is useful only to the extent of the single iteration capability of the 
 processor. Loops can be expanded to fill the processor. Further expansion 
 should be dependent on the results of the iteration. 
 2.5«3 Processor Efficiency 
 
 Examination of many programs has shown that decision trees tend to 
 be sparse rather than full. This would seem to indicate many transmit nodes 
 in the processor and inefficient use of the hardware. It can be shown 
 however that a processor with n = 2 -1 nodes and complete sector control can 
 process an (— p— ) node decision tree of any shape in one iteration. That is, 
 regardless of tree shape, more than half of the processor nodes are available 
 for use as decision nodes. For an I level decision tree with i- > k, folding 
 
 is obviously required. 
 
 Define processor efficiency as the percentage ratio of decision 
 plus free nodes to the total number of nodes in the processor. Free nodes 
 are available for decision use and thus are grouped with decision nodes. 
 Statement: Processor efficiency for a k level processor can always be 
 
 greater than 50^ for every iteration required to evaluate a 
 
 decision tree which has at least k levels. 
 
1+6 
 
 When the decision tree has fewer than k levels, processor nodes 
 are used to transmit results to level k rather than make decisions. In this 
 situation the processor capacity is not fully used. 
 
 Proof of the Statement: 
 
 The root is by definition a decision node. The three possibilities 
 for successor nodes of a decision node will be examined. 
 Case 1. Both immediate successors are decision nodes. 
 
 This clearly represents 100$ efficiency locally. 
 Case 2. One immediate successor is a transmit node, the other a decision 
 
 node. 
 
 Using Fact k from section 2.5.2.1 construct Figure 2.17. This 
 
 s Decision Node 
 = Transmit Node 
 = Free Node 
 
 Figure 2.17. Transmit Node Pairings 
 
1*7 
 
 figure represents the Case 2 successors of any decision node, a. 
 A pairing can be made such that each transmit node has a corre- 
 sponding decision or free node without involving a in the pairing. 
 If a is the root node it is always unpaired. If a is not the root 
 it may be paired with the other successor of its predecessor as 
 al is paired with aO. 
 Case 3- Both immediate successors are transmit nodes. 
 
 The three nodes can be remapped equivalently as shown in Figure 2.l8, 
 Remapping yields a free node which can be paired with the transmit. 
 If the decision node was previously paired with a transmit in Case 
 2, let that pairing continue. 
 
 Figure 2.l8. Remapping for Better Efficiency 
 
 Applying the three possibilities to every decision node results 
 in a pairing for every transmit node with at least the root decision node 
 left unpaired. The efficiency is therefore greater than 50$ by at least 
 one decision node for which there is no paired transmit node. This 
 
48 
 
 concludes the proof. 
 
 The third hound mentioned in section 2.5.2.1 is established by 
 
 the above statement. For a k level processor there are n = 2 -1 nodes. At 
 
 n+1 ~k-l k-1 . 
 
 least — — = d can always be decision nodes. Thus 2 is the maximum 
 
 number of decision nodes per iteration which can be guaranteed to map into 
 
 the processor. 
 
 A note on the significance of this is useful. The maximum number 
 of decision nodes which can be guaranteed to map into the processor is es- 
 sentially a minimum number of decision nodes per iteration, excluding trees 
 which do not use the capacity of the processor. A linear decision tree is 
 the only one for which the minimum holds. The linear decision tree is also 
 the one for which the maximum number of levels can be processed. Now let 
 k = 6. The minimum number of nodes is 32. The maximum number of levels is 
 32, from Equation 2.2 in section 2.5.2.1. Thus it would take the uncommon 
 program segment consisting of 32 sequential IFs for the minimum decision 
 node bound to apply. 
 
 2.5-4 Performance Tradeoffs Between Iterations and Cycles 
 
 Gate and flip flop delays for the first cycle of an iteration and 
 for each succeeding cycle are nearly the same, from Table 2.3- In operation 
 however, the first cycle of an iteration requires communication with 
 arithmetic processors whereas succeeding cycles are internal to the de- 
 cision processor. The real times are thus not nearly equal and for this 
 section the assumption is made that an execution time for the first cycle of 
 an iteration takes M times as long as for any succeeding cycle. 
 
 Cycles are the result of folding trees into free sectors. At the 
 lower levels of the processor the nodes per sector are few, which means 
 
h9 
 
 that the nodes evaluated per cycle are few. At some point it becomes more 
 time effective to discontinue folding in small tree segments and resort to 
 a new iteration. That point is apparently where the probability of reaching 
 an exit is less in the next M cycles within the current iteration, than in 
 the first cycle of the next iteration. If all exits are equally likely it 
 is a simple matter to count their occurrence if mapped as the next M cycles 
 versus one cycle in the next iteration. 
 
 Linear trees are examined to demonstrate the tradeoff. Equation 
 2.2 is a summation in which each term is the number of nodes mapped per 
 fold. Table 2.k, which follows, uses that equation to determine the entries 
 in the nodes per fold column. 
 
 levels 
 k 
 
 processor 
 nodes 
 
 2 k -l 
 
 linear 
 tree nodes 
 
 2 *-i 
 
 2 
 
 3 
 
 2 
 
 3 
 
 7 
 
 k 
 
 k 
 
 15 
 
 8 
 
 5 
 
 31 
 
 16 
 
 6 
 
 63 
 
 32 
 
 7 
 
 127 
 
 6k 
 
 8 
 
 255 
 
 128 
 
 single iteration nodes per fold 
 (terms of Equation 2.2) 
 
 Table 2.k. Nodes Evaluated on Succeeding Cycles 
 for Linear Decision Trees 
 
50 
 
 If more than one iteration is used, the larger numbers at the 
 beginning of the nodes per fold series are reused, eliminating the long 
 strings of ones and twos. Table 2.5 is a compilation of the cycles required 
 to evaluate various length linear trees using multiple iterations. The 
 entries are derived from Table 2.k as in the following example. The entry 
 at k = 6 for three iterations is determined by first noting that the linear 
 tree to be evaluated has 32 nodes. If at most three iterations are to be 
 used, 7—1= 11 nodes must be examined in at least one iteration, say the 
 first. From Table 2.k, in one iteration the first cycle examines six nodes, 
 the second four, and the third three bringing the total to 13 • Three cycles 
 were required to get the total above 11 and three becomes the first number 
 in the table entry being determined. There are 32-13 = 19 nodes for the 
 remaining two iterations. One of them must examine at least 10 nodes and 
 the other the remainder. Again from Table 2.k it can be seen that two cycles 
 cover 10 nodes, so two becomes the second number in the entry and nine nodes 
 remain. Two cycles pick up the remaining nodes so the last number in the 
 entry is also two. 
 
 Behind these tables is the goal of selecting the best compromise 
 between iterations and cycles. The best operating point is a function of 
 the multiplier, M, relating the execution times of iterations and cycles. 
 
 The time required to evaluate a decision tree on the processor is 
 the time for the first cycle of each iteration plus the time for all suc- 
 ceeding cycles. The first cycle includes communication with the arithmetic 
 processors and has been defined as taking M times longer than succeeding 
 cycles to execute. 
 
51 
 
 
 
 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
 
 •\ 
 
 
 
 
 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
 
 •s 
 
 
 
 
 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
 
 •\ 
 
 
 
 
 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
 
 •\ 
 
 
 
 
 
 
 
 
 H 
 
 H 
 
 
 
 VD 
 
 
 
 
 
 •N 
 
 •\ 
 
 
 
 H 
 
 
 
 
 H 
 
 H 
 
 r-l 
 H 
 
 H 
 H 
 H 
 
 1 
 
 
 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H A 
 
 
 
 
 •s 
 
 »N 
 
 •N 
 
 •\ 
 
 •\ 
 
 ■o 
 
 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 H 
 
 <-y 
 
 a 
 
 
 
 
 
 
 
 
 
 ai 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 E- 
 
 
 
 
 
 
 
 
 
 ti 
 
 
 
 
 
 
 
 r-l 
 
 H 
 
 C 
 
 
 
 
 
 
 
 »\ 
 
 *\ 
 
 •H 
 
 
 
 
 
 
 
 H 
 
 OJ 
 
 K 
 
 
 
 
 
 
 
 •\ 
 
 •\ 
 
 •H 
 
 
 
 
 
 
 
 r-l 
 
 OJ 
 
 a 
 
 
 
 
 
 
 
 »\ 
 
 ■s 
 
 QJ 
 
 
 
 
 
 
 
 r-l 
 
 OJ 
 
 R 
 
 
 
 
 
 
 H 
 
 r-l 
 
 OJ 
 
 u 
 
 O 
 
 
 
 
 
 •\ 
 
 *\ 
 
 •s 
 
 cO 
 
 H 
 
 
 
 
 
 H 
 
 H 
 
 OJ 
 
 CD 
 
 
 
 
 
 
 •\ 
 
 •s 
 
 *\ 
 
 C 
 
 
 
 
 
 H 
 
 rH 
 
 H 
 
 OJ 
 
 •H 
 
 
 
 
 
 •\ 
 
 •\ 
 
 •\ 
 
 »\ 
 
 H^ 
 
 
 
 
 
 H 
 
 
 
 OJ 
 
 H 
 
 
 
 H 
 
 H 
 
 r-T 
 
 H 
 
 H 
 
 OJ 
 
 CD 
 
 
 
 •\ 
 
 •\ 
 
 •N 
 
 •\ 
 
 *\ 
 
 *\ 
 
 t> 
 
 
 H 
 
 r-i 
 
 H 
 
 H 
 
 rH 
 
 H 
 
 OJ 
 
 CD 
 
 
 
 
 
 
 
 
 
 h^ 
 
 
 
 
 
 
 
 
 
 H 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 H 
 
 H 
 
 OJ 
 
 OJ 
 
 
 
 
 
 
 H 
 
 OJ 
 
 
 CO 
 
 
 
 
 
 H 
 
 H 
 
 CM 
 
 
 cd 
 
 
 
 
 
 •\ 
 
 •^ 
 
 •\ 
 
 •s 
 
 ■P 
 
 vO 
 
 
 
 
 H 
 
 H 
 
 OJ 
 
 -d- 
 
 CO 
 
 
 
 
 
 *\ 
 
 •N 
 
 •\ 
 
 •\ 
 
 2 
 
 
 
 H 
 
 i-i 
 
 H 
 
 H 
 
 OJ 
 
 ^t 
 
 H 
 
 
 
 •N 
 
 •\ 
 
 •\ 
 
 •\ 
 
 •\ 
 
 •\ 
 
 CO 
 
 
 H 
 
 H 
 
 H 
 
 r-l 
 
 H 
 
 OJ 
 
 -d- 
 
 £ 
 
 
 
 
 
 
 
 
 
 o 
 
 -p 
 
 
 
 
 
 
 
 
 
 
 
 
 
 H 
 
 H 
 
 K> 
 
 VO 
 
 >d 
 
 
 
 
 
 •S 
 
 »\ 
 
 •\ 
 
 ■> 
 
 0) 
 
 
 
 
 
 H 
 
 H 
 
 K> 
 
 VO 
 
 m 
 
 -d- 
 
 
 
 
 •S 
 
 •\ 
 
 •\ 
 
 •N 
 
 P 
 
 
 
 H 
 
 H 
 
 H 
 
 OJ 
 
 KA 
 
 VD 
 
 
 
 
 •\ 
 
 •\ 
 
 •\ 
 
 •N 
 
 •\ 
 
 •\ 
 
 w 
 
 
 H 
 
 H 
 
 H 
 
 H 
 
 OJ 
 
 KA 
 
 VO 
 
 sd 
 
 
 
 
 
 
 
 
 
 o 
 
 
 
 
 
 
 
 
 
 •H 
 
 
 
 
 
 
 
 
 
 +3 
 
 
 
 
 
 
 
 
 
 CO 
 
 
 
 
 
 
 
 
 
 h 
 
 
 
 
 
 H 
 
 OJ 
 
 J- 
 
 CT\ 
 
 CD 
 
 
 
 
 
 •S 
 
 •\ 
 
 •\ 
 
 ♦V 
 
 -p 
 
 KA 
 
 
 H 
 
 H 
 
 H 
 
 OJ 
 
 UA 
 
 OA 
 
 H 
 
 
 
 •\ 
 
 *\ 
 
 •s 
 
 »\ 
 
 •\ 
 
 •\ 
 
 
 
 H 
 
 H 
 
 H 
 
 OJ 
 
 KN 
 
 LTN 
 
 o\ 
 
 
 
 
 
 
 
 
 
 KO 
 
 
 
 
 H 
 
 H 
 
 OJ 
 
 -=J- 
 
 00 
 
 H 
 
 
 OJ 
 
 
 *\ 
 
 •\ 
 
 •\ 
 
 •s 
 
 *\ 
 
 *\ 
 
 
 
 H 
 
 H 
 
 H 
 
 OJ 
 
 -d- 
 
 00 
 
 H 
 
 
 H 
 
 H 
 
 OJ 
 
 -=h 
 
 00 
 
 r-l 
 
 OJ 
 
 
 ?H 
 
 
 
 
 
 
 
 
 
 CD £ 
 
 ^^ 
 
 
 
 
 
 
 
 
 ft o 
 
 O--- 
 
 
 
 
 
 
 
 
 ■H 
 
 w 
 
 
 
 
 
 
 
 
 CO -p 
 
 w to 
 
 
 
 
 
 
 
 
 CD CO 
 
 0) H 
 
 OJ 
 
 rA 
 
 -d- 
 
 IT\ 
 
 VD 
 
 t— 
 
 00 
 
 H in 
 
 O CD 
 
 
 
 
 
 
 
 
 O CD 
 
 O > 
 
 
 
 
 
 
 
 
 >>-P 
 
 ri CD 
 
 
 
 
 
 
 
 
 O H 
 
 PM h-3 
 
 
 
 
 
 
 
 
 vo 
 
 CD 
 EH 
 
 o 
 
 •H 
 w 
 
 •H 
 O 
 
 CD 
 fi 
 
 fn 
 CO 
 CD 
 id 
 •H 
 t-3 
 
 CD 
 -P 
 CO 
 PS 
 H 
 CO 
 
 O 
 -P 
 
 O 
 •H 
 -P 
 CO 
 f-i 
 CD 
 -P 
 
 CD 
 ft 
 
 CD 
 
 rH 
 
 •H 
 
 & 
 CD 
 
 K 
 
 w 
 0) 
 r-l 
 O 
 
 o 
 
 OJ 
 
 CD 
 H 
 ,£> 
 
 CO 
 
 EH 
 
52 
 
 Let I represent the number of iterations used, corresponding to 
 columns of Table 2.5- Let C represent the total number of cycles used for 
 a particular value of I. Then I is the number of first cycles and C..-I is 
 the number of succeeding cycles required to evaluate a decision tree. The 
 time required is thus proportional to M • I + C - I = (M-l)l + C . This 
 function is graphed in Figure 2.19 for k = 8 and M = 1,2,3, and k. Execution 
 time is in units equivalent to the time required to cycle the decision 
 processor. The operating point to select is the one which minimizes the 
 execution time. Thus for M = 1 it is best to use l6 iterations of one cycle 
 each, whereas for M = 3, four iterations of six cycles each is best. 
 
 This section has dealt with the fringe case of linear decision 
 trees to simplify data gathering for the tables. The principle is appli- 
 cable to any shape tree. 
 
 Sector control, in the form of sector decoding logic and control 
 gates on the tree decoder, is required for every node that can accept a fold. 
 The return, in decreased execution time, seems non-existent at the lower 
 levels except perhaps in some situation where just a few more nodes would 
 complete the tree. The cost, in logic, is high since there are more gates 
 on the bottom level than in the rest of the tree. In conclusion it is 
 recommended that sector control stop at some level above k. 
 
53 
 
 68- 
 64- 
 
 
 M'4 
 
 60- 
 
 
 
 56 ■ 
 
 
 
 52- 
 
 48- 
 
 ■ 
 
 11 S' Ma3 
 
 UJ 44- 
 
 z 
 
 P 40- 
 
 
 | 36- 
 
 £32. 
 
 O 
 
 UJ 28- 
 
 
 \Vs^ *^^^ M " 2 
 
 X 
 
 UJ 24- 
 
 
 20- 
 
 
 **^ M*l 
 
 16- 
 
 
 12- 
 
 
 8- 
 
 
 4- 
 
 ^-_| 1 1 1 1 1 1 
 
 12 3 4 
 
 6 10 
 
 ITERATIONS USED 
 
 16 
 
 Figure 2.19. Variations in Decision Tree Evaluation Time 
 
3. SIMULATION OF DISCRETE TIME SYSTEMS 
 
 This chapter examines some software questions involved in the con- 
 current processing of discrete time simulation programs. A simulation language 
 is studied to determine the natural parallelism and to develop a machine design 
 philosophy for using that parallelism. 
 
 3 .1 Languages 
 
 Applications programming can be done in any general purpose language. 
 When a particular application comes into frequent use, particular languages 
 tend to be developed to simplify programming. This has been done for discrete 
 time simulation. At the present there are many simulation languages, some being 
 minor variations of others. Among the more widely known are GPSS, the General 
 Purpose Simulation System [7], Simscript [8], and Simula [k] • 
 
 GPSS is implemented as a fixed set of routines which can be thought 
 of as blocks in a block diagram of the system to be simulated. Simscript, 
 developed at the Rand Corporation, is a powerful language with similarities to 
 Fortran and PL/l but also including features useful in simulating systems 
 that change over time. Simula is an Algol based language that includes Algol 
 as a subset. 
 
 These three significantly different languages have characteristics 
 in common that provide for simulation. In all cases a simulated time clock 
 is an inherent device which affects the progress of the simulation. An event , 
 defined as a change in the status of the simulated system, occurs at a scheduled 
 clock time or causes the clock to Increment to the event time. Let model mean 
 the representation in a programming language of a system to be simulated. 
 Simulation programs have temporary entities which move through the model. 
 
55 
 
 Temporary entities are those which are not required to exist for the duration 
 of the simulation. The model is described by permanent entities which do exist 
 throughout the simulation. Finally, means to control the progress and inter- 
 action of entities and simulated time is provided. 
 
 GPSS will be described in more detail since it is used extensively 
 in the remainder of this thesis. There were several reasons for selection 
 of GPSS. A primary reason was availability of the GPSS/36O system, and docu- 
 mentation on its implementation and use. The block diagram structure, essen- 
 tially making it a higher level language than the procedural languages, clari- 
 fies the ideas of the thesis. Also, while GPSS is considerably different from 
 other languages it is neither unique nor unused. 
 
 Acceptance of GPSS has prompted the development of several other 
 similar languages. One of these is BOSS, Burroughs Operational Systems Simu- 
 lator [15] . As stated in that reference "BOSS is a block-diagram-oriented, data 
 base driven simulator program, in the general class of GPSS...". Another is 
 QUICKSIM, an attempt to impart a block diagram structure to Simscript [l6] . 
 As a third similar language there is the Computer System Simulator, CSS- 
 An application of CSS, reported in [11], described it as follows: "CSS/36O 
 was used in this study. It is a simulation program.... In concept it is 
 similar to the General Purpose System Simulator (GPSS), differing in one aspect: 
 it is not general, but applies specifically to computer systems." 
 
 3.2 A Description of GPSS 
 
 The purpose of this section is to present a description of GPSS 
 sufficient for understanding the algorithms and examples that follow. For 
 more complete information refer to the User's Manual [7]« 
 
56 
 
 An example of a GPSS program is given in Figure 3.1. The example, 
 intended to show features of GPSS, is not meant to be a meaningful model. A 
 more significant program is listed with comments in the appendix. 
 
 A GPSS program is made up of control, definition, and executable 
 statements written in a fixed format, one statement per card. An executable 
 statement is called a block. A fixed set of blocks is provided to represent 
 the components and control of the model to be studied. The format for a 
 block allows up to seven operands. The definition of each block identifies 
 the required and optional operands and their meaning. Each block is in 
 actuality the name of a routine with the operands being parameters. Execution 
 
 of a block is execution of its routine. 
 
 Temporary entities are transactions . Blocks are provided for causing 
 transactions to become active in the model and for removing them from active 
 status. Once activated, a transaction normally moves sequentially through the 
 blocks of the program. The movement of a transaction into a block is a call 
 for execution of the routine that corresponds to the block. Exceptions to 
 sequential movement are caused by blocks which unconditionally or conditionally 
 cause a transaction to transfer to a non- sequential block. These control blocks 
 affect transactions in GPSS much the way GO TO or IF statements affect the 
 instruction counter in procedure oriented languages . 
 
 A simulation study typically involves many transactions which inter- 
 act with each other. It frequently becomes necessary to suspend processing of 
 one transaction and begin work on another. Even so, the movement of one 
 transaction through a GPSS program can be thought of as an execution of the 
 program. 
 
 A clock is maintained which automatically updates to the time of the 
 next event. Simulated time has an origin of one and is incremented by integer 
 
57 
 
 ec — (M^-rir-of-aDC'o — *vj f* * tr .$ h- 
 
 U Z 
 
 -«««*JIM<Mfsjr»j*sjiM«Nj«sje»J.»oi»>ir» 
 
 eo o 
 
 z 
 
 UJ 
 
 X 
 
 x 
 o 
 o 
 
 ►- 
 z 
 
 5 
 
 a. 
 
 ae 
 
 UJ 
 
 UJ 
 
 to 
 
 UJ 
 
 X 
 
 a 
 
 UJ 
 CO 
 
 v> 
 
 X 
 
 ec 
 
 o 
 
 a • 
 
 •> a. »- 
 
 < x 
 
 CO UJ 
 
 CO ►» 
 z a 
 
 O UJ O uj 
 
 ■- t- I 
 
 p- « i/> ►- 
 
 < _) *- 
 
 OC 3 X Z 
 
 UJ X H- <-■ 
 
 a. •-> 
 O to 
 
 0» 
 
 -* en 
 
 
 — c 
 0"> • 
 
 CD 
 
 o> 
 
 o ro 
 
 ir\ • 
 
 • «*> 
 
 rsj • • 
 U 
 
 f\j <\j 00 
 
 <. eg • 
 
 ae • «vj 
 
 z >»• 
 
 O <M 0* 
 
 IT 
 
 z 
 ae 
 
 UJ 
 
 ae 
 
 O 
 
 UJ 
 
 eo 
 to 
 
 U. .O 
 
 » • 
 
 in 
 to r>j 
 
 • X 
 eo ec 
 
 <. UJ 
 
 ec *~ 
 
 to 
 
 to 
 
 CO 
 
 a 
 o 
 
 z 
 
 3 
 
 «*> 
 
 u 
 
 o 
 
 « « • 
 
 z 
 a 
 a 
 x 
 
 UJ c* 
 o • # 
 
 — • o 
 
 K UJ 
 
 U -4 X 
 
 z x uj 
 
 3 ae 
 u. UJ 
 
 z 
 
 o — 
 a. x 
 x at 
 ui ui 
 
 U. Z • 
 
 * u. o o 
 
 O *XNp3p O 
 
 on »• — • 5. u. ck. — • <-» 
 
 r- UJ -J 
 
 < Z KUO 
 
 aeoxcuyjuiaez* 
 
 ZV)OHlUMfi,>K 
 IUI/)ZUOUJUJO< 
 
 to 
 
 UJ 
 
 u> 
 
 UJ 
 
 o 
 
 X 
 
 o 
 
 o 
 
 o 
 z 
 
 CO 
 
 a 
 ae 
 < 
 
 Z X 
 
 - o 
 
 H -J 
 
 — eo 
 
 z 
 
 UJ > 
 
 Q eo 
 
 UJ O 3 M 
 
 •j uj a. v o 
 
 eo to u o. «■* i-t 
 
 < 3 
 tat 
 
 oc o uj 
 
 < z »- 
 > < UJ uj < 
 
 a to to z 
 UI UI << — •-. 
 x a uj ui x ae 
 
 K a -J _J Ct < Q 
 
 UJ UJ UJ I— Z 
 
 ac ec t- to uj 
 
 W) 
 
 o 
 
 & 
 
 CO 
 
 CO 
 
 P-. 
 
 t5 
 
 & 
 
 
 H 
 bO 
 
 ae 
 ui rsi 
 
 o • » • 
 
 • * • • 
 
 ae 
 x ui 
 
 -i s 
 
 eo z 
 
 **<vjf)w>u>49^eo 
 
 O — 
 
 0* tai -* 
 
58 
 
 values. The unit of real time represented by a unit of simulated time is 
 programmer defined. Events are scheduled through the use of the ADVANCE block. 
 Each transaction has an associated set of words including one called "block 
 departure time", BDT . When a transaction moves into an ADVANCE block a time 
 increment is calculated and added to its BDT. Processing of the transaction is 
 suspended until all transactions with smaller BDT values have been processed. 
 
 Permanent entities of several types can be used to describe the 
 physical equipment of a model. A facility is an entity which can be occupied 
 by only one transaction at a time . Blocks are provided that let a transaction 
 use (SEIZE, PREEMPT) or relinquish use (RELEASE, RETURN) of a facility. A ' 
 storage is an entity which can be occupied by more than one transaction. The 
 storage capacity is given for each storage by a definition card. ENTER and 
 LEAVE blocks are provided to let transactions use or relinquish use of a 
 storage. A logic switch is a two state entity used to control the movement of 
 transactions. The LOGIC block results in the setting, resetting, or inverting 
 of a switch. 
 
 Other blocks can be used to simulate various queuing schemes, assign 
 values to variables, control transaction movement, group transactions or numbers, 
 gather statistics, and give diagnostic or partial result information. 
 
 GPSS provides a set of Standard Numerical Attributes, SNA ' s, which 
 are names of variables. These variables are selected attributes of GPSS entities. 
 An SNA consists of a one or two letter mnemonic and, in most cases, an index to 
 identify a particular entity. For example, a facility can either be in use or 
 available. The status is given by SNA Fj, where j is the index of the facility. 
 In this example F3 = indicates facility three is available. 
 
 The major use of SNA's is in the operand field of blocks. They are 
 in fact the only variables a programmer can use. Symbolic naming of entities 
 
 
59 
 
 is allowed in which case the symbol replaces the index number. Since all 
 variables are SNA's defined in GPSS, all possible variable names and certain of 
 their characteristics are known at compile time. 
 
 A characteristic of interest in section 3.3-2 concerns a limitation 
 on accessing certain variables. In particular, each transaction has a priority, 
 PR, and a set of parameters, Pj, < j < 100. When PR or Pj are used as block 
 operands they refer only to the transaction which is executing the block. A 
 given transaction cannot use the value of the priority or any parameter of 
 any other transaction as an operand. These SNA's are considered transaction 
 related variables. 
 
 System related variables can be accessed by any transaction. An 
 example is the storage location called a savevalue, referenced by Xj . Any 
 transaction executing a block with operand Xj refers to the same physical 
 storage location. 
 
 A similar distinction can be made for block types based upon the 
 block definition. The PRIORITY and ASSIGN blocks are used to write values 
 into PR and Pj of the transaction executing the block. They affect directly 
 only the transaction being moved. Other blocks in the same category are 
 ADVANCE, TEST, TRANSFER, etc. 
 
 Blocks can be identified that give system variables new values. The 
 SAVEVALUE block writes a value into savevalue location Xj which can be accessed 
 by any transaction. Similarly, ENTER, LOGIC, QUEUE, and other blocks can 
 change the value of system variables . 
 
 GPSS provides features for using tapes to store intermediate results 
 of large simulations. The features are not considered in this thesis. 
 
 Processing a GPSS program includes assembly, input, execution, and 
 output phases. The execution phase, in which transactions are moving through 
 
6o 
 
 blocks, is of most importance in this thesis. Execution is described in the 
 next section, 3 -3, and the description of GPSS is continued, especially in 
 section 3.3.1. 
 
 3-3 GPSS Execution 
 
 A major goal of this thesis is to design a multiprocessor system 
 for concurrent execution of individual programs written in a language like 
 GPSS. To better understand the problem and the proposed system, serial execu- 
 tion of GPSS is described. Parallelism within GPSS, and thus the potential 
 for concurrent execution, is studied. The constraints limiting concurrency, 
 as imposed by the simulated time aspect of simulation, are demonstrated. 
 
 When simulation is being discussed in this section, simulated time 
 will be referred to simply as "time" whereas real time will be identified as 
 such. 
 
 3.3-1 Serial Execution: GPSS/36O 
 
 Processing in the execution phase consists entirely of moving trans- 
 actions through blocks. One function of GPSS that simplifies the programmer's 
 effort is the selection of which transaction should be moved and into which 
 block it should be moved. With a single processor only one of the potentially 
 many transactions can be selected. Once a selection has been made, one word 
 of the set of words associated with each transaction identifies the next block 
 to be executed. 
 
 Selection of the transaction to move is based on two chains main- 
 tained by GPSS. The current events chain contains all transactions whose block 
 departure times, BDT's, are equal to or less than the current time. Items in 
 this chain are ordered first by priority then, within a priority class, on a 
 first-in first-out basis. The future events chain contains all transactions 
 
6l 
 
 whose BDT's are greater than the current time. For this chain, transactions 
 are ordered by their BDT, most imminent event first, then for all transactions 
 at a given time the ordering is as in the current events chain. 
 
 Transactions are selected for processing in order from the current 
 events chain. Certain conditions which can prevent the movement of a trans- 
 action and therefore its processing are explained in following paragraphs. When 
 processing of transactions on the current events chain is complete the first 
 transaction in the future events chain, and all transactions with the same 
 block departure time, are transferred to the current events chain. 
 
 The flowchart for the overall GPSS/36O scan is given in Figures 3.2, 
 3.3, and ~$.h. The chart, from [7], is included to show the complexity of the 
 algorithm and to provide a basis for comparison with the proposed machine 
 organization. Figure ^> .k also gives the conditions under which the processing 
 of one transaction is stopped and another is started. 
 
 Now the conditions mentioned above which can prevent the movement 
 of transactions are discussed. A facility was described in section 3*2 as a 
 unit capacity device. If a facility is in use at a time when another trans- 
 action would like to use (SEIZE) it, the second transaction must wait for the 
 first to relinquish use. The second transaction has reached what is called 
 a blocking condition. Processing is suspended until the facility becomes 
 available, removing the blocking condition. A similar situation arises with 
 respect to a storage when its capacity will be exceeded by the transaction 
 that would like to use it. Thus when moving a transaction would violate the 
 specifications of the system being simulated, the transaction is blocked. 
 
 Two blocks, GATE and TEST, can also stop the processing or divert 
 the movement of a transaction. A gating condition can depend on the status 
 

 62 
 
 / Update \ 
 ( Clock ) 
 
 Increase simulator clock to Block Departure Time 
 of first Transaction (next most imminent event) in 
 Future Events Chain 
 
 *«r 
 
 Move Transaction from Future Events Chain to 
 Current Events Chain (change word Tl chain link- 
 ages). Transaction becomes last one in its Priority 
 Class. 
 
 Examine Block Departure Time of next sequential 
 Transaction in Future Events Chain 
 
 Note: These Future Event 
 
 Chain Transactions are: 
 
 1. In positive time ADVANCE 
 Blocks 
 
 2. Transactions waiting to leave 
 GENERATE Blocks. 
 
 3. Operators for Tables oper- 
 ating in the arrival rate mode 
 
 Yes 
 
 Figure 3'3 
 
 All possible Transactions have 
 
 been transferred from Future 
 
 Events Chain to Current Events 
 
 Chain 
 
 Figure 3.2. Overall GPSS/360 Scan: Update Clock to Next Most Imminent Event 
 
63 
 
 Reset Status Change Flag 
 to Off 
 
 The Overall GPSS/360 Scan transfers to the start of 
 the Current Events Chain: 
 
 1. From BUFFER Block 
 
 2. From PRIORITY Block with BUFFER option 
 
 3. When Status Change Flag is tested and found set to on 
 
 4. After clock updating (Figure 3«-2) 
 
 Examine first Transaction 
 in Current Events Chain 
 
 Try to 
 
 Move \ Transaction is in an active scan status . 
 ^Transaction/ Try to move it to some next block 
 
 Transaction is 
 inactive in a 
 Delay Chain 
 
 To Figure 
 
 No 
 
 Sequential J Status Change 
 Flag is Off 
 
 From Figure 3 • 4 
 
 Yes 
 
 Advance to next sequential 
 Transaction in Current 
 Events Chain 
 
 To Figure 3*2 
 
 Overall GPSS/360 Scan has gone all 
 the way through the Current 
 Events Chain. No Further Transactions 
 can be moved at this clock time. 
 Therefore, update clock to next most 
 imminent BDT. 
 
 Figure 3-3- Overall GPSS/360 Scan: Scan of Current Events Chain (Start of Scan) 
 
(Fiom Figure J >J ) 
 Scan Status Indicator of 
 Transaction is Reset to 
 Off 
 
 Transaction blocked by: 
 
 1. CATE M Block 
 
 2. CATE NM Block 
 J. TEST Block 
 
 Or in a TRANSFER Block: 
 
 1. Both Selection Mode 
 
 2. All Selection Mode 
 
 Can 
 
 Transaction move 
 
 into some next block? 
 
 Yes 
 
 Execute block type subroutine I 
 
 To Figure 
 
 1 . Place Transaction in a pushdown 
 Delay Chain. 
 
 2. Set Scan Status Indicator (Tl 
 sign bit 1) of Transaction On. 
 
 Is this a TERMINATE, ASSEMBLE, 
 
 GATHER or MATCH Block whi ch 
 < * ■ 
 
 either terminates or places Transaction 
 
 in an interrupt matching condition 
 
 Yes 
 
 -=»<» 
 
 Stop Processing Trans- 
 action immediately 
 (ASSEMBLE Block can 
 also terminate Trans- 
 action) 
 
 Reset Scan Status Indicator to OFF for all 
 Transactions in any Delay Cham(s) associated with 
 the particular Facility, Storage or Logic Switch 
 
 Return MATCHed Transaction 
 to Current Events Chain from 
 an interrupt status 
 
 Return initial ASSEMBLEd Transaction 
 or n GATHERed Transactions to 
 Current Events Chain 
 
 Remove Transaction from Current Events Chain and merge into Future Events Chain | 
 
 (To Figure 
 
 Figure J.k. Overall GPSS/360 Scan: Try to Move Individual Transaction into 
 Some Next Block 
 
65 
 
 of a logic switch, facility, storage, or block. The TEST block examines an 
 algebraic relation between two standard numerical attributes. Thus a trans- 
 action can be blocked waiting for the model to satisfy certain programmer 
 defined conditions . 
 
 When a transaction is blocked it depends on another transaction to 
 change the model status and unblock it. In many cases a time change must take 
 place before the model status changes. Time is a very important entity in 
 control of the simulated system. 
 
 Serial execution of the program in Figure 3-1 is illustrated in 
 Figure 3.5- The chart shows the sequence of transactions and blocks executed 
 to complete the simulation. Processing starts at the bottom line. It pro- 
 gresses horizontally, moving the transaction that corresponds to the line through 
 blocks until the processing must be stopped for one of the conditions given in 
 Figure ~5.h. Processing resumes with the transaction on the line above. Numbers 
 on the lines are the times at which the transaction moved into the block. The 
 final number on each line is the time when processing of that transaction was 
 suspended, either temporarily or finally. The chart is derived from data 
 gathered by tracing the execution of the example program. 
 
 There are several things to observe on the chart. Transactions 
 one, two, and three move through all blocks without interruption. This is 
 because no other transaction is active during the time they are in the model. 
 There is no interaction between these three transactions so the movement of each 
 through the program is equivalent to an execution of the program. Other trans- 
 actions took two to four separate processing intervals to move through all 
 blocks. Moving each of these through the program is like an execution of the 
 program with interaction between the separate executions. Completion of the 
 
66 
 
 UJ 
 
 00 
 
 z 
 
 3 
 
 Z 
 
 o 
 
 p 
 o 
 
 < 
 
 z 
 < 
 or 
 
 10 
 24 
 23 
 22 
 
 13 
 
 10 
 
 9 
 
 21 
 
 20 
 
 19 
 
 18 
 
 17 
 
 16 
 
 15 
 
 9 
 
 8 
 
 14 
 
 13 
 
 12 
 
 11 
 
 10 
 
 8 
 
 7 
 
 10 
 
 8 
 
 7 
 
 6 
 
 9 
 
 7 
 
 6 
 
 5 
 
 4 
 
 8 
 
 7 
 
 6 
 
 5 
 
 4 
 
 3 
 2 
 
 1 
 
 6S49 
 
 6098 
 
 8922 
 
 5822 
 
 5614 
 
 5482 
 
 5090 
 
 5081 
 
 4797 
 
 4741 
 
 4593 
 
 4292 
 
 4275 
 
 4062 
 
 3409 
 
 2799 
 
 1908 
 
 1851 
 
 1791 
 
 1567 
 
 1553 
 
 1078 
 
 621 
 
 SMS 
 
 r 
 
 I 
 
 I 
 
 i 
 
 I 
 
 I 
 
 (4653 
 
 i r 
 
 I I 
 I I 
 I I 
 
 I / 
 J J 
 
 3853 
 
 i r 
 
 3153 
 
 w 
 
 I I 
 
 I / 
 
 I ( 
 
 J ' 
 
 2553 
 / t 
 
 I I 
 I I 
 I I 
 
 I l 
 
 I I 
 il 
 
 Ji 
 
 I 
 
 J 
 
 J 
 3853 
 
 r 
 
 i 
 i 
 i 
 i 
 
 3153 
 
 r 
 
 2553 
 
 r 
 
 1933 
 
 ll 
 
 II 
 ll 
 'I 
 '/ 
 
 ll 
 
 Jf 
 
 7153 
 
 r~ 
 
 I 
 
 I 
 
 I 
 
 i 
 
 I 
 
 I 
 
 I 
 
 I 
 
 I 
 
 I 
 
 5863 
 
 r 
 
 i 
 / 
 i 
 i 
 
 i 
 
 4653 
 
 -I 
 
 3853 
 
 r 
 
 I 
 I 
 I 
 I 
 
 J 
 
 3153 
 
 r 
 
 I 
 I 
 
 2553 
 
 1953 
 
 r 
 
 i 
 
 i 
 
 i 
 
 i 
 
 i 
 
 i 
 
 i 
 
 i 
 
 i 
 
 1378 
 
 821 
 
 101 
 
 5 6 7 8 
 
 BLOCK NUMBER 
 
 •7153 
 
 •5853 
 
 •4653 
 
 •3853 
 
 ■3153 
 
 ■2553 
 •1953 
 
 •1378 
 821 
 
 •101 
 
 10 
 
 11 
 
 Figure 3-5. Serial Execution Trace of Example Program 
 
67 
 
 simulation came with completion of ten transactions but more than ten 
 started. 
 
 The order in which transactions are activated is not necessarily 
 maintained throughout a simulation. An example of changing the order occurs 
 on the chart. Transactions 10, 11, and Ik execute blocks four and five 
 before transaction nine. Movement of transactions through the program 
 is governed by run time determination of the model status. In general, the 
 order of block executions is not known at compile time. Figure 3*5 demon- 
 strates this. No amount of study of the program, short of actual execution, 
 leads one to the knowledge that some transactions will move through the 
 entire program with the processing of no other transaction intervening, 
 whereas other transactions will require four distinct processing intervals. 
 3« 3*2 Concurrent Execution 
 
 In this section three levels of parallelism within a GPSS pro- 
 gram are described. The parallelism is used in the design of the multi- 
 processor machine of Chapter k. 
 
 3.3-2.1 Parallelism Within a Block 
 
 The first level of parallelism is that which exists within the 
 routine that a block type represents. This is precisely the type of 
 parallelism that can be found in a procedural language. If it is known 
 that a block is going to be executed, clearly the parallelism within the 
 block can be fully used by a multiprocessor independent of consideration of 
 other blocks or transactions. Use of this parallelism does not change the 
 overall scan of Figures 3-2 through 3-k. 
 
68 
 
 To measure the parallelism at this level 21 frequently used GPSS 
 block types were converted from their original 360 Assembler Language version 
 to Fortran. The 21 Fortran program equivalents were analyzed on the system 
 described in [10] . Results of the analysis are given in Tables 3.1 and 3.2 
 at the end of this section. 
 
 Comments on the conversions to Fortran are given here. An attempt 
 was made to have each Fortran program perform the same functions as the original 
 although the methods had to differ slightly due to differences in the languages 
 and operation of the analyzer program. The assembler language version makes 
 use of many subroutines. In Fortran, subroutine calls were replaced by the 
 subroutine itself so the analyzer could examine the complete program. Bits 
 manipulated in assembler language were considered variables in Fortran. Speci- 
 fication statements were not given since they do .not affect the parallelsim 
 and are not examined by the analyzer. 
 
 Each block type is -analyzed as a separate program so any exit from the 
 block is an END or RETURN. This includes error checking statements which 
 normally branch to the GPSS output phase, write an appropriate message, and 
 terminate the simulation. In GPSS/36O, completion of a block results in a 
 call to the overall scan, Figure ~5.h, and processing continues. For parallelism 
 analysis each block is a separate program. No attempt is made to analyze 
 sequences of blocks although this would tend to increase the parallelism. 
 
 Program conversion and analysis was not done for parts of the GPSS/ 
 360 system concerned with the selection of transaction-block pairs for execution. 
 Only the box labeled "Execute block type subroutine" in Figure 3. h of the over- 
 all scan algorithm was converted. 
 
69 
 
 The flow chart and Fortran version listing of the QUEUE block are 
 given in Figures 3.6 and 3.7- This block is a typical example of the type of 
 statements and length of a block routine. Reference to these figures is made 
 in Chapter h in a discussion of processor capability requirements. 
 
 Table 3.1 lists the results of analyzing the 21 block type pro- 
 grams, including QUEUE; to determine the number of operations that can be 
 done concurrently and the speedup in execution using a multiprocessor con- 
 figuration. The techniques and algorithms used in the analysis are fully 
 described in [10] . 
 
 One change was made to increase the accuracy of the analyzer for 
 this set of programs. The analyzer previously did not count memory fetches 
 under the assumption that they were overlapped with operations on the data. 
 The listing of QUEUE shows very few arithmetic operations, reducing the over- 
 lap of memory fetches. Since other blocks are similar it is necessary to 
 count fetches in the measurement of these programs. 
 
 The analyzer currently has a limited ability in handling IF trees. 
 The maximum number of levels per tree is eight. An algorithm for folding 
 trees with more levels has not been implemented yet so longer trees are 
 artificially broken to give two or more trees. This results in less speedup 
 than can be achieved with the hardware of Chapter 2. 
 
 Column headings for Table 3.1 are explained in the referenced paper 
 and given again here. Minimum and maximum values are given when a range of 
 results occurred. 
 
 (l) The names are the GPSS block type names, plus DECODE and 
 FUNCTION, two routines for decoding operand values which required function 
 evaluation or used indirect addressing for index values. 
 
TO 
 
 BLOCK EXECUTION ROUTINE 
 
 8101 
 
 C 
 C 
 
 7099 
 
 8102 
 8105 
 
 8100 
 8200 
 
 8300 
 
 8400 
 8900 
 
 500 
 504 
 
 669 
 
 II GO TC 8400 
 
 OUEUE 
 N »N+1 
 
 IF (B3B6 .EG. 
 QUNITS »1 
 
 QUENR SUB-ROUTINE 
 IF (DCODA .LE. 0) GO TO 500 
 IF (OCODA .GT. QUENUMI GO TO 500 
 
 END OF QUENR 
 
 UPQ SUB-ROUTINE 
 IF (QNR .60, 0) GO TO 504 
 IF (QNR .GT. QUENUM) GO TO 504 
 IF CSTIME .EQ. QUGO TO 7099 
 OELTAT * STtME -Ql 
 
 01 -STIME 
 
 04 »Q4 +DELTAT *Q6 
 CONTINUE 
 
 END OF UPQ 
 IF (T9B1 .NE. 1) GO TC 8100 
 IF (T14B57 .EQ. 1) GO TO 8900 
 QCOUNT *QCOUNT *1 
 00 8102 I -lt-5 
 
 IF (MQTABL(I) .EQ. 0) GO TO 8105 
 GO TO 669 
 MQTABL(I) -QNR 
 MQTIME(I) sSTIME 
 GO TO 8300 
 
 IF (T13 .GT. 0) GO TO 8200 
 T13 =QNR 
 T14 =STIME 
 GO TO 8300 
 T9B1 =*1 
 
 MOTABL(l) »T13 
 MQTIMEU ) »T14 
 MQTABL(2) =QNR 
 MQTIME(2) -STIME 
 T14B6 =1 
 
 02 *Q2 +QUNITS 
 
 NR 3S000 
 
 NR 6000 
 
 NR 7000 
 
 ♦QUNITS 
 .GT. Q7> 
 
 06 =Q6 
 
 IF (Q6 
 
 RETURN 
 
 QUNITS 
 
 IF (QUNITS 
 
 RETURN 
 
 IF (B3B1 .EQ. 
 
 B3B1 =1 
 
 CALL WRTMES 
 
 =DC00B 
 • NE. 
 
 Q7 «Q6 
 
 0) GO TO 8101 
 
 1) GO TC 8300 
 
 (A WARNING MESSAGE) 
 
 GO TO 
 
 RETURN 
 
 RETURN 
 
 RETURN 
 
 END 
 
 8300 
 
 Figure 3-6. QUEUE Block; Fortran Version 
 
71 
 
 INCREI^ENT BLOCK ENTRY COUNT 
 
 IS THERE A B OPERAND? 
 
 YES 
 
 QIMTS=1 
 
 Jno_ 
 
 CALL QUENR 
 
 CHECK LEGALITY OF QUEUE NUMBER 
 
 CALL UPQ 
 
 UPDATE QUEUE STATISTICS 
 
 <^is the multi -queue bit set?\-^°- \^j 
 
 Iyes 
 
 YES 
 
 | <^DOES THE QUEUE COUNT = 5?\ 
 
 NO 
 
 IS THE WARNING 
 MESSAGE BIT SET? 
 
 YES 
 
 NO 
 
 SET THE WARNING 
 MESSAGE BIT 
 
 CALL WRITHES 
 
 WRITE WARNING MESSAGE 
 
 INCREMENT THE QUEUE COUNT BY 1 
 
 IS THERE AN AVAILABLE LOCATIONS NQ 
 IN THE MULTI-QUEUE TABLET /^ 
 
 YES 
 
 ENTER QUEUE NUNBER 
 IN MULTI "QUEUE TABLE 
 
 ENTER TIME IN MULTI- 
 QUEUE TINE TABLE 
 
 o- 
 
 VQB 
 
 Figure 5.7. QUEUE Block Flow Chart 
 
72 
 
 IS THIS TRANSACTION IN A QUEUE? 
 
 |YES 
 
 SET MULTI-QUEUE BIT 
 
 ENTER FIRST QUEUE NUMBER 013) AND 
 ENTRY TIME 014) IN MULTI-QUEUE TABLE 
 
 ENTER CURRENT QUEUE NUMBER AND 
 TIFE IN MULTI -QUEUE TABLE 
 
 SET QUEUE COUNT TO 2 
 
 INCREMENT TOTAL ENTRY 
 COUNT BY QUNITS 
 
 INCREMENT CURRENT 
 CONTENTS BY QUNITS 
 
 I 
 
 IS CURRENT CONTENTS GREATER 
 THAN MAXIMUM CONTENTS 
 
 NO 
 
 NO 
 
 T13=QUEUE NUMBER 
 
 H4=nre 
 
 YES 
 
 SET MAXIMUM CONTENTS 
 EQUAL TO CURRENT CONTENTS 
 
 Figure 3-7 (continued). QUEUE Block Flow Chart 
 
73 
 
 (2) This is the approximate number of source cards, excluding 
 comments and multiple RETURN statements. The number of cards in the scopes 
 of DO loops and IF loops is given. 
 
 (3) This is the maximum number of iterations assumed for any DO 
 loop or IF loop in the program. 
 
 (k) The number of traces is the sum of all paths from the beginning 
 to all END or RETURN statements plus the number of IF loops. 
 
 (5) T, is the time required to execute the program on a uniprocessor 
 
 (6) T is the time required to execute the program using a multi- 
 processor capable of executing a maximum of p operations at once; a p- . 
 multiprocessor . 
 
 (7) S is the ratio of column (5) to column (6) . 
 
 (8) This is the number, p, of processors required to achieve the T 
 
 value in (6) 
 
 (9) E is the efficiency defined as: 
 
 T l 
 
 E = -4r < 1 
 
 P PT - 
 
 for the number of processors in (8) . 
 
 (10) U is the utilization for the number of processors in (8) . 
 The techniques used to reduce execution time may introduce extra operations 
 Let o be the number of operations in the execution of a program on a p- 
 
 IT 
 
 multiprocessor. Call R the operation redundancy and let 
 
 R = -£>i. 
 
 P o 1 - 
 
 The p-multiprocessor utilization is U = E R , 
 
 P P P 
 
7^ 
 
 ft 
 
 o 
 
 vO 
 VO 
 
 O 
 VO 
 VO 
 
 O 
 O 
 O 
 
 OJ 
 
 -d- 
 
 MA 
 UA 
 VO 
 
 GO 
 
 VO 0- 
 
 & 
 
 MA 
 
 MA 
 
 OA O 
 
 MA o 
 VO O 
 
 CO 
 O 
 
 OA VO 
 
 MA 
 
 OA 
 
 O 
 O 
 ua 
 
 O 
 
 -=J- 
 
 i 
 OJ 
 ON 
 
 MA 
 
 i 
 OO 
 H 
 
 i 
 
 OJ 
 OJ 
 
 I 
 
 vO 
 
 LTA 
 
 I I 
 
 UA r-l 
 
 OJ c— 
 
 MA UA 
 
 I 
 
 CO 
 
 -d 
 H 
 
 i 
 O 
 
 I 
 
 tr- 
 O 
 
 MA 
 
 OJ 
 UA 
 
 I 
 
 co 
 
 ON 
 OJ 
 
 I 
 
 UA 
 
 c- 
 
 UA 
 
 MA 
 MA 
 
 ON t— 
 
 OA f- 
 
 t— VO 
 
 1 
 
 o 
 
 CO 
 MA 
 
 I 
 
 CO 
 OJ 
 
 -d- 
 
 MA 
 CO 
 UA 
 
 vO 
 
 o 
 
 UA 
 
 O 
 O 
 O 
 
 LfA 
 CO 
 VO 
 
 OJ 
 H 
 VO 
 
 CO 
 vo 
 
 8a 
 
 MA 
 MA 
 
 H 
 OJ 
 MA 
 
 O 
 O 
 O 
 
 CO 
 O 
 
 MA 
 OJ 
 
 VO 
 H 
 OA 
 
 O 
 O 
 UA 
 
 OA W 
 
 ft 
 
 OA 
 UA 
 
 I 
 
 CO 
 
 t— 
 
 H 
 i 
 
 ir\ 
 OJ 
 VO 
 
 i 
 
 MA 
 MA 
 O 
 
 I 
 
 MA 
 
 o 
 
 I I 
 
 OJ OJ 
 
 -d- OJ 
 
 rH OJ 
 
 I 
 
 OJ 
 CO 
 
 o 
 
 CO 
 UA 
 
 o 
 
 I 
 
 vO 
 
 UA 
 
 o 
 
 I 
 
 CO 
 
 MA 
 
 I 
 
 OJ 
 -d- 
 H 
 
 i 
 
 VO 
 H 
 
 -d- 
 
 o 
 o 
 
 OA UA 
 
 OA c— 
 
 l>- MA 
 
 CO 
 OJ 
 
 OJ 
 
 I 
 
 UA 
 
 -d" 
 
 CO 
 
 ft 
 
 OJ 
 
 H 
 
 OJ 
 H 
 
 OJ 
 
 VO 
 OJ 
 
 UA 
 
 MA 
 
 H 
 I 
 
 MA 
 
 I 
 
 CO 
 
 D— 
 
 O 
 
 OJ 
 
 I 
 
 UA 
 
 MA 
 
 OJ 
 I 
 
 OJ 
 
 VO 
 
 H 
 
 I 
 
 OJ 
 
 CO 
 
 -d- 
 
 H 
 
 MA 
 
 O 
 H 
 
 OJ 
 
 H 
 
 I 
 
 UA 
 
 OJ 
 
 t- 
 
 co 
 
 ft 
 
 UA 
 OJ 
 
 CO 
 
 o 
 
 O 
 O 
 
 VO 
 CO 
 
 OA 
 OJ 
 
 OJ 
 
 UA 
 UA 
 
 O 
 
 o 
 
 o 
 
 OJ 
 
 o 
 
 UA 
 
 vO 
 
 MA 
 CO 
 
 UA 
 
 o 
 
 o 
 
 o 
 
 OJ 
 
 O 
 
 UA 
 
 VO 
 
 VO 
 
 OJ 
 
 UA 
 
 VO 
 
 VO 
 
 UA 
 
 t- 
 
 t- 
 
 UA 
 
 UA 
 
 MA 
 
 OJ 
 
 UA 
 
 UA 
 
 -d" 
 
 o 
 
 ON 
 
 _d- 
 H 
 
 UA 
 Oi 
 
 H 
 
 v.0 
 
 o 
 
 -d- 
 
 H 
 
 VO 
 UA 
 
 o 
 o 
 
 o 
 o 
 
 -d" 
 H 
 
 r-l 
 
 H 
 
 UA 
 OJ 
 
 UA 
 OJ 
 
 o 
 o 
 
 -d- 
 
 H 
 
 UA 
 
 0- 
 
 OJ 
 
 H 
 
 H 
 
 H 
 
 H 
 
 OJ 
 
 H 
 
 VO 
 
 ft 
 
 UA 
 
 MA 
 
 I 
 
 VO 
 
 I 
 
 OA 
 
 OJ 
 I 
 
 UA 
 UA 
 
 CO 
 OJ 
 
 OA 
 
 OA 
 
 -d" 
 
 l 
 
 -d- 
 
 O 
 
 UA 
 
 I 
 
 CO 
 
 CO 
 
 H 
 I 
 
 -d" 
 
 OA 
 t- 
 
 OJ 
 
 OJ 
 
 I 
 
 CO 
 
 I 
 
 co 
 
 OA 
 
 OJ 
 
 -d" 
 
 0- 
 
 rH 
 I 
 
 00 
 
 X 
 o 
 o 
 
 H 
 
 PQ 
 
 C 
 05 
 fn 
 -P 
 r< 
 O 
 ft 
 
 MA 
 
 H 
 cO 
 
 EH 
 
 UA 
 
 VO 
 
 H 
 i 
 
 -dr 
 
 OJ 
 
 MA 
 CO 
 
 -d" 
 OJ 
 
 VO 
 
 OJ 
 
 I 
 
 VO 
 
 -d- 
 
 OA 
 -dr 
 
 VO 
 
 -d- 
 
 -d- 
 
 OJ 
 
 VO H 
 
 H VO 
 
 i i 
 
 CO _d" 
 H 
 
 -d- 
 
 i 
 
 CO 
 
 VO 
 
 OJ 
 
 OJ 
 
 I 
 
 CO 
 
 OJ 
 VO 
 
 I 
 
 CO 
 
 rH 
 
 UA 
 
 I 
 
 o 
 
 H 
 
 l 
 
 O 
 
 VO 
 
 OJ 
 
 I 
 
 o 
 H 
 
 o 
 
 OA 
 
 I 
 
 -dr 
 
 CO 
 
 t- 
 I 
 
 CO 
 
 MA 
 
 UA 
 
 I 
 
 -d" 
 
 ft 
 
 o 
 
 rH ra 
 
 0) CD 
 
 ft En 
 
 OJ 
 
 P 
 O W 
 CO CO 
 rH CD 
 
 O 
 
 VO 
 
 VO 
 
 MA 
 
 MA 
 UA 
 
 -d" 
 
 CO 
 
 H 
 
 OA 
 0J 
 
 OA 
 
 VO 
 OJ 
 
 CO 
 
 0J 
 
 -d" 
 
 MA 
 UA 
 
 UA 
 OJ 
 
 VO 
 H 
 
 r nn 
 
 MA 
 
 O O 
 
 •H 
 
 rH -P 
 CD CO 
 
 CD ft 
 
 Jh P 
 
 cd co ^ 
 ra o 
 
 i 
 
 CO 
 
 H 
 
 i 
 
 CO 
 
 -d" 
 
 i 
 
 0J 
 
 ft M < P, 
 
 ft 
 
 o o 
 
 co 
 
 CO ft 
 
 0J 
 
 CD 
 
 O ^ 
 
 rH O 
 
 P O 
 
 O =fc 
 
 CO 
 
 "o 
 
 o 
 
 o 
 
 o 
 
 -d- 
 H 
 
 "o 
 
 "o 
 
 H 
 
 OJ 
 
 C0~ 
 
 -d- 
 
 H 
 
 "o 
 
 o 
 
 o 
 
 "o 
 
 "P 
 
 o 
 
 OJ 
 
 o 
 
 o 
 
 H 
 
 O 
 
 rH 
 
 o 
 
 o 
 
 o 
 
 O 
 
 o 
 
 o 
 
 o 
 
 H 
 
 o 
 
 o 
 
 CO 
 VO 
 
 UA 
 OJ 
 
 -d- 
 
 rH 
 
 VO 
 
 KA 
 
 O 
 
 VO 
 
 r-i 
 
 
 > 
 
 * 
 o 
 
 
 H 
 CO 
 CO 
 
 <! 
 
 ft 
 
 ft 
 
 ft 
 EH 
 
 ft 
 
 o 
 
 UA 
 
 * 
 
 Eh 
 
 C5 
 
 OJ 
 
 VO 
 UA 
 
 VO 
 
 OJ 
 MA 
 
 O 
 OJ 
 
 X 
 ft 
 H 
 
 ft 
 ft 
 
 ft 
 
 o 
 
 H 
 Ci3 
 
 O 
 ft 
 
 -3- 
 
 MA 
 
 ft 
 
 CO 
 
 S 
 
 VO 
 
 -H- 
 
 >H 
 EH 
 
 H 
 
 ft 
 
 O 
 H 
 
 ft 
 
 ft 
 
 ft 
 
 B 
 
 o 
 
 MA 
 
 ft 
 CO 
 
 < 
 
 ft 
 ft 
 
 CO 
 OJ 
 
 ft 
 
 CO 
 
o 
 
 ft 
 
 ON ft 
 
 ft 
 
 CO 
 
 t^- 
 
 VO 
 
 -Hr 
 
 KN 
 
 OJ 
 
 ft 
 
 CO 
 
 ft 
 
 ft 
 
 ■Vh 
 
 
 
 O 
 
 
 cu 
 
 U 
 
 w 
 
 M 
 
 (U 
 
 CD 
 
 3 
 
 P 
 
 o 
 
 DQ 
 
 S CO CO 
 
 3 ^ a) 
 S E-t S 
 
 r -tH 
 
 O 
 
 0) CO 
 
 CO 
 
 id 
 
 O 
 
 H <d 
 -P CD ft 
 
 On 
 
 no 
 
 KN 
 OJ 
 NO 
 
 H 
 
 ON 
 LfN 
 
 KN 
 LfN 
 LTN 
 
 KN 
 O 
 NO 
 
 o 
 
 LfN 
 
 t— 
 
 I 
 
 -J- 
 
 NO 
 
 I 
 
 CO 
 ON 
 OJ 
 
 I 
 
 CO 
 
 kn. 
 KN 
 
 CO 
 OJ 
 
 OJ 
 NO 
 OJ 
 
 i 
 LfN 
 KN 
 
 kn 
 
 KN 
 O 
 NO 
 
 KN 
 OJ 
 NO 
 
 OJ 
 
 NO 
 
 OJ 
 OJ 
 
 NO 
 KN 
 
 o 
 
 LfN 
 t— 
 
 I 
 
 NO 
 OJ 
 H 
 
 LfN 
 LfN 
 
 OJ 
 
 OJ 
 
 H 
 
 I 
 
 CO 
 
 KN 
 
 i 
 
 NO 
 OJ 
 H 
 
 OJ 
 CO 
 H 
 
 ON 
 
 on 
 
 H 
 I 
 
 00 
 
 t— 
 
 H 
 
 OJ 
 
 CO 
 H 
 
 OJ 
 
 kn 
 
 CO 
 
 OJ 
 
 OJ 
 
 cH 
 
 *h 3 
 
 CD ra ^ 
 
 -P w O 
 
 M < O 
 
 w 
 «d 
 
 O O M 
 
 CD 
 
 O ^ 
 
 u o 
 
 , 3 « 
 
 S CO — ' 
 
 CD 
 
 0> 
 
 ft 
 
 CO 
 
 H 
 
 IT— 
 
 OJ 
 LfN 
 
 t— 
 
 J* 
 
 KN 
 
 I 
 
 H 
 i 
 
 CO 
 
 ON 
 
 H 
 i 
 
 OJ 
 
 I 
 
 OJ 
 
 I>- 
 I 
 
 CO 
 
 ON 
 
 to 
 
 OJ 
 I 
 
 0J 
 r-H 
 
 O 
 
 C— 
 
 CO 
 
 o 
 
 -d- 
 
 I 
 
 CO 
 0J 
 
 ON 
 LfN 
 I 
 CO 
 H 
 
 OJ 
 KN 
 OJ 
 
 0J 
 0J 
 
 o 
 
 0J 
 
 NO 
 CO 
 
 CO 
 KN 
 
 OJ 
 
 KN 
 OJ 
 
 H 
 
 KN 
 
 H 
 
 75 
 
 r- o 
 
 OJ i-H 
 
 LfN t— 
 
 
 
 H 
 
 
 ft 
 
 
 
 < 
 
 
 ft 
 
 
 
 S 
 
 
 ft 
 
 w 
 
 EH 
 
 M 
 
 
 CO 
 
 N 
 
 H 
 
 |>' 
 
 Eh 
 
 fzr 
 
 H 
 
 ft 
 
 pq 
 
 CO 
 
 <£ 
 
 W 
 
 ft 
 
 P 
 
 ft 
 
 ft 
 
 02 
 
 CO 
 
 Eh 
 
 EH 
 
 EH 
 
 H 
 
 CO 
 OJ 
 
 I 
 
 CO 
 
 o 
 
 OJ 
 
 H 
 
 ON 
 H 
 H 
 
 H 
 H 
 
 LfN 
 r-H 
 LfN 
 
 c- o 
 
 OJ H 
 LfN [— 
 
 I 
 
 O 
 
 ON 
 
 o 
 
 OJ 
 
 r-H 
 I H 
 
 KN 
 -H/ 
 
 NO 
 CO 
 
 o 
 
 CO 
 
 O 
 
 LfN 
 
 NO 
 CO 
 
 NO 
 CO 
 
 NO 
 LfN 
 
 LfN 
 O 
 
 LfN 
 
 NO 
 
 KN 
 
 NO 
 
 M 
 
 CO 
 
 NO 
 
 t- 
 
 H 
 
 NO 
 
 O 
 O 
 
 H 
 
 ON 
 
 ON 
 OJ 
 
 KN 
 CO 
 
 CO 
 NO 
 
 o 
 o 
 
 H 
 
 OJ H 
 
 -dr 0J 
 
 i i 
 
 ON CO 
 
 -4- OJ 
 
 r-\ rH 
 
 O CO 
 
 CO 
 0J 
 
 CO 
 
 'p 
 
 O 
 
 OJ 
 
 o 
 
 OJ 
 
 o 
 
 -Hr 
 
 OJ 
 
 O 
 
 o 
 
 0J 
 LfN 
 
 o 
 
 o 
 
 o 
 
 o 
 
 
 OJ 
 
 LfN 
 
 KN 
 
 ON 
 
 ON 
 
 OJ 
 
 KN 
 OJ 
 
 LfN 
 KN 
 
 LfN 
 
 o 
 
 O 
 
 KN 
 
 OJ 
 LfN 
 
 -3 
 
 CD 
 
 § 
 
 ■H 
 
 $ 
 
 O 
 O 
 
 KN 
 
 0) 
 
 H 
 
 ■s 
 
 Eh 
 
76 
 
 Table 3*2 summarizes the program speedup information derived from 
 the analysis. For each program it gives the number of traces with the speed- 
 up indicated in the column heading. Each column covers a range of ±0.5 from 
 the heading value. 
 
 A conservative approach was taken in the design of the Fortran ana- 
 lyzer. The measures will change in the direction of improvement with improve- 
 ment of the analyzer. These tables will be referenced in Chapter k on the 
 design of a multiprocessor for executing these programs. 
 3.3-2.2 Concurrency Between Blocks 
 
 Since a single transaction moves through the blocks of a simulation 
 program in much the same way as the execution of a conventional program pro- 
 ceeds from one instruction to the next, there is potential for concurrent 
 execution of several blocks. This second level of parallelism is analogous 
 to that between instructions in a procedural language but is at the level of 
 routines rather than instructions. 
 
 This section is limited to simulations with one active transaction. 
 The multiple transaction case is covered in the next two sections. If it is 
 known that a sequence of blocks is going to be executed by one transaction, 
 clearly the parallelism between blocks in the sequence can be exploited. 
 
 A GPSS program can be partitioned into groups of sequential blocks 
 which do not violate the precedence requirements of variables. That is, 
 from the definition of each block type and examination of its operands it is 
 possible to group blocks such that no variable is written then read by the 
 blocks in one group. 
 
 

 
 
 
 
 77 
 
 
 
 
 
 
 Block Type 
 
 
 
 
 Speedup Factor 
 
 
 Modal 
 
 
 1 
 
 2 
 
 3 
 
 k 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 Value s 
 
 ADVANCE* 
 
 
 3 
 
 1+ 
 
 3 
 
 It 
 
 2 
 
 
 
 
 3,5 
 
 ASSIGN* 
 
 1 
 
 5 
 
 
 
 
 
 
 
 
 2 
 
 DEPART 
 
 
 6 
 
 11 
 
 11+ 
 
 28 
 
 1^ 
 
 
 
 
 5 
 
 ENTER 
 
 k 
 
 k2 
 
 77 
 
 15 
 
 1 
 
 lit 
 
 
 
 
 3 
 
 GENERATE* 
 
 1 
 
 1 
 
 2 
 
 10 
 
 16 
 
 11 
 
 
 
 
 5 
 
 INDEX 
 
 
 2 
 
 - 
 
 3 
 
 2 
 
 1 
 
 
 
 
 it 
 
 LEAVE 
 
 1 
 
 33 
 
 12 
 
 103 
 
 85 
 
 51 
 
 6 
 
 
 
 it 
 
 LINK 
 
 1 
 
 1 
 
 1 
 
 22 
 
 36 
 
 29 
 
 7 
 
 
 
 5 
 
 LOGIC 
 
 1 
 
 5 
 
 12 
 
 1+ 
 
 - 
 
 it 
 
 
 
 
 3 
 
 MARK 
 
 1 
 
 1 
 
 1 
 
 2 
 
 2 
 
 1 
 
 
 
 
 ^5 
 
 MSAVEVALUE 
 
 1 
 
 5 
 
 9 
 
 6 
 
 
 
 
 
 
 3 
 
 PRIORITY 
 
 1 
 
 2 
 
 1 
 
 
 
 
 
 
 
 2 
 
 QUEUE 
 
 3 
 
 9 
 
 18 
 
 20 
 
 3 
 
 
 
 
 
 It 
 
 RELEASE 
 
 1 
 
 1 
 
 7 
 
 11 
 
 3 
 
 2 
 
 
 
 
 4 
 
 SAVEVALUE 
 
 
 2 
 
 6 
 
 6 
 
 2 
 
 
 
 
 
 3A 
 
 SEIZE 
 
 1 
 
 1 
 
 k 
 
 8 
 
 1+ 
 
 2 
 
 
 
 
 it 
 
 SPLIT 
 
 
 4 
 
 7 
 
 21 
 
 36 
 
 17 
 
 l 
 
 
 
 5 
 
 TERMINATE 
 
 3 
 
 16 
 
 18 
 
 1 
 
 
 
 
 
 
 3 
 
 TEST 
 
 
 
 3 
 
 11 
 
 4 
 
 1 
 
 2 
 
 
 
 it 
 
 TRANSFER 
 
 
 i 
 
 i 
 
 i 
 
 11 
 
 2 
 
 3 
 
 3 
 
 2 
 
 1 
 
 
 3 
 
 UNLINK 
 
 6 
 
 ! 5o 
 
 ^3 
 
 8 
 
 5 
 
 2 
 
 7 
 
 3 
 
 3 
 
 2 
 
 DECODE* 
 
 7 
 
 18 
 
 j 
 
 lj-2 
 
 26 
 
 16 
 
 9 
 
 1 
 
 
 
 3 
 
 FUNCTION 
 
 1 
 
 ! 2 
 
 1 
 
 1 
 
 6 
 
 10 
 
 6 
 
 1 
 
 
 6 
 
 ^FUNCTION subroutine removed for analysis. 
 
 Table 3«2. Block Routine Speedup Factors 
 
78 
 
 For a simulation with one transaction the blocks in a precedence 
 partition are those that can be processed concurrently. Most simulations 
 involve multiple transactions which affect each other and the system being 
 modeled. As a result, the processing of one transaction may be interrupted 
 at a block in the midst of a precedence partition. The reduction of concur- 
 rency is covered in section 5«3»3« 
 
 A segment of the program in Figure 3.1 is reproduced here to serve 
 as a precedence partitioning example. Selected assignment statements from the 
 corresponding routines are given at the right. 
 
 Block 
 Number 
 
 2 
 3 
 
 k 
 
 Block 
 
 ASSIGN 
 INDEX 
 
 SEIZE 
 
 Operands 
 
 12, FN$TEBMI 
 
 3,10 
 
 P12 
 
 Selected Routine 
 Statements 
 
 P12 <- Function value 
 
 J <- P12 
 F(J) <- 1 
 
 P12 appears on the left side of an assignment statement in block two and on 
 the right in block four. Since block four cannot be executed concurrently with 
 two, they must be in separate partitions. 
 
 The order of block execution is known for all segments of the program 
 free of conditional jumps. If interaction between transactions is not con- 
 sidered, all blocks in a partition can be processed concurrently. The parallel- 
 ism within each block can be simultaneously applied. The overall scan in 
 Figures 3.2 through 3 .k is changed to the extent that all blocks in a partition 
 are executed concurrently rather than serially. 
 
 Algorithm 3.1 for the precedence partitioning of GPSS programs is 
 presented in Figure 3.8. Basically, the GPSS card deck is scanned sequentially. 
 
79 
 When a block is found, the type and standard numerical attributes, SNA's, 
 used as operands are noted. If processing the block simultaneously with 
 preceeding blocks in the current partition would violate precedence require- 
 ments, the current partition is closed and the block becomes the first in a 
 new partition. 
 
 Due to the structured nature of GPSS it is not necessary to examine 
 the statements of each routine to perform the partitioning. The algorithm 
 uses the Precedence Partitioning Guide, Table 3.3, which is based upon the 
 definition of each block type, to identify the blocks or SNA's that cannot 
 be in the same partition as the current block. 
 
 Entries in column B of the guide are those blocks which read a 
 variable written by the block type in the corresponding position of the left 
 most column. When the entry for a block type is ALL, the block type is the last 
 in a partition regardless of what follows. Block types EXECUTE and GENERATE 
 are permanent entries in column B for all block types. Optional block operands 
 cause an EXCLUSIVE OR situation for some entries. The guide is conservative 
 in several respects. The TRANSFER block causes a partition boundary even when 
 it is used for unconditional jumps which do not violate any operand precedence 
 requirements. Block types which are used in pairs, such as QUEUE and DEPART, 
 are not permitted in the same partition. The actual requirement is that pair 
 types should not be in the same partition if they are operating on entities 
 with the same index value. 
 
 Entries in column S are those SNA's which are given new values by 
 the block type in the left column. In several cases a single letter is used to 
 represent a set of SNA's. This is the case for example with ENTER where S is 
 used for the set S, SR, SA, SM, SC, and ST. The ENTER block can change values 
 
<> 
 
 NO 
 
 80 
 
 HAS SOURCE DECK BEEN 
 COMPLETELY READ? 
 
 JNO 
 
 READ NEXT CARD 
 
 ^IS IT A BLOCK?^ 
 
 MS 
 N=N+1 
 
 YES 
 
 -/end) 
 
 SCAN OPERAND FIELD TO DETERMINE 
 SNAs READ BY THIS BLOCK 
 
 IS BLOCK TYPE IN TABLE B? 
 
 YES 
 
 *9 
 
 ,N0 
 
 IS ANY SNA READ IN TABLE S? 
 
 YES 
 
 1 
 
 p(n-1)=0 
 
 I 
 
 CLEAR TABLES B AND S 
 
 is ALL in coLum B of partition 
 
 GUIDE FOR THIS BLOCK TYPE? 
 
 NO 
 
 USING PARTITION GUIDE FOR THIS 
 BLOCK TYPE/ PUT COLUMN B 
 ENTRIES IN TABLE B AND 
 COLUMN S ENTRIES IN TABLE S 
 
 
 <r 
 
 
 
 p(n)=1 
 
 
 5 
 
 YES 
 
 1 
 
 p(n)=0 
 
 CLEAR TABLES B AND S 
 
 Figure 3 '8. Precedence Partition Algorithm 
 
8i 
 
 Precedence Partitioning Guide 
 
 Block Type 
 
 Precedence Table Entries 
 
 
 B 
 
 S 
 
 
 EXECUTE, GENERATE 
 
 
 ADVANCE 
 
 ADVANCE, ENTER, 
 GATE, LEAVE, LOGIC, 
 MARK, PREEMPT, 
 RELEASE, RETURN, 
 SEIZE 
 
 CI, MP, Ml 
 
 ALTER 
 
 (EXAMINE, JOIN, 
 REMOVE, SCAN) 
 ALL if operand G is 
 specified 
 
 PR Pj, j = operand C 
 
 ASSEMBLE 
 
 ALL 
 
 
 ASSIGN 
 
 
 Pj> = operand A 
 
 BUFFER 
 
 ALL 
 
 
 CHANGE 
 
 ALL 
 
 
 COUNT 
 
 
 Pj> J = operand A 
 
 DEPART 
 
 QUEUE 
 
 Qj, j a operand A 
 
 ENTER 
 
 LEAVE 
 
 Sj, Rj, j = operand A 
 
 EXAMINE 
 
 ALL 
 
 
 EXECUTE 
 
 ALL 
 
 
 GATE 
 
 ALL 
 
 
 GATHER 
 
 ALL 
 
 
 GENERATE 
 
 ADVANCE, ENTER, 
 GATE, LOGIC, MARK, 
 PREEMPT, SEIZE 
 
 
 INDEX 
 
 
 PI 
 
 JOIN 
 
 ALTER, EXAMINE, 
 REMOVE, SCAN 
 
 Gj, D ~ operand A 
 
 LEAVE 
 
 
 Sj, Rj, j = operand A 
 
 Table 3 «3 • Precedence Partitioning Guide 
 
82 
 
 Precedence Partitioning Guide 
 
 Block Type 
 
 Precedence 
 
 Table 
 
 Entries 
 
 
 B 
 
 
 S 
 
 
 EXECUTE, GENERATE 
 
 
 
 LINK 
 
 ALL 
 
 
 
 LOGIC 
 
 
 
 Lj, 3 = operand A 
 
 LOOP 
 
 ALL 
 
 
 
 MARK 
 
 
 
 (MPj, Pj, j = operand A) 
 © Ml 
 
 MATCH 
 
 ALL 
 
 
 
 MSAVEVALUE 
 
 
 
 MXj(k,l)©MHj(k,l) 
 j = operand A 
 k = operand B 
 1 = operand C 
 
 PREEMPT 
 
 RETURN 
 
 
 F J> 3 - operand A 
 
 PRINT 
 
 
 
 
 PRIORITY 
 
 ALL if operand F 
 is specified 
 
 
 PR 
 
 QUEUE 
 
 DEPART 
 
 
 Qj> 3 = operand A 
 
 RELEASE 
 
 
 
 Fj> 3 = operand A 
 
 REMOVE 
 
 (ALTER, EXAMINE, 
 JOIN, SCAN) © 
 ALL if operand F is 
 specified 
 
 
 Gj, 3 = operand A 
 
 RETURN 
 
 
 
 Fj, 3 = operand A 
 
 SAVEVALUE 
 
 
 
 Xj © XHj , 3 = operand A 
 
 SCAN 
 
 (ALTER, JOIN, REMOVE) 
 ©ALL if operand F is 
 specified 
 
 VJt 3 = operand E 
 
 SEIZE 
 
 RELEASE 
 
 
 Fj, 3 = operand A 
 
 SELECT 
 
 
 
 Pjj J = operand A 
 
 Table 3 .3 (continued). Precedence Partitioning Guide 
 
83 
 
 Precedence Partitioning Guide 
 
 Block Type 
 
 Precedence Table 
 
 1 Entries 
 
 
 B 
 
 s 
 
 
 EXECUTE, GENERATE 
 
 
 SPLIT 
 
 
 Pj> 3 = operand C 
 
 TABULATE 
 
 
 Tj> J = operand A 
 
 TERMINATE 
 
 ALL 
 
 
 TEST 
 
 ALL 
 
 
 TRACE 
 
 ALL 
 
 
 TRANSFER 
 
 ALL 
 
 
 UNLINK 
 
 
 Cj> J = operand A 
 
 UNTRACE 
 
 
 
 Table 3-3 (continued). Precedence Partitioning Guide 
 
81* 
 
 for each set member so the use of any member causes a partition boundary and 
 detection of S is sufficient. 
 
 The WRITE block is omitted from the guide since it involves tape 
 features which are not being considered. 
 
 Block precedence partition membership is indicated by a subscripted 
 variable P(n) where n is the block number assigned in order of appearance in 
 the source deck. P(n) = 1 when the following block belongs to the same parti- 
 tion. P(n) = for the last block in a partition. 
 
 Applying the algorithm to the example program gives the result in 
 Figure 3.9- Simulation of one transaction moving through this program can be 
 done in four processing "intervals", corresponding to the four partitions, on 
 a suitable multiprocessor. 
 
 Block 
 
 Block 
 
 Operands 
 
 P(n) 
 
 Partition 
 
 Number 
 
 
 
 
 
 1 
 
 GENERATE 
 
 300, FN$EXP0N 
 
 1 
 
 1 
 
 2 
 
 ASSIGN 
 
 12, FN$TERMI 
 
 1 
 
 1 
 
 3 
 
 INDEX 
 
 3,10 
 
 
 
 1 
 
 k 
 
 SEIZE 
 
 P12 
 
 1 
 
 2 
 
 5 
 
 QUEUE 
 
 CPU 
 
 1 
 
 2 
 
 6 
 
 SEIZE ; 
 
 CPU 
 
 
 
 2 
 
 7 
 
 DEPART 
 
 CPU 
 
 1 
 
 3 
 
 8 
 
 ADVANCE 
 
 VI 
 
 
 
 3 
 
 9 
 
 RELEASE 
 
 CPU 
 
 1 
 
 h 
 
 10 
 
 RELEASE 
 
 P12 
 
 1 
 
 h 
 
 li 
 
 TERMINATE 
 
 1 
 
 
 
 k 
 
 Figure 3.9- Precedence Partitions of Program in Figure 3*1 
 
85 
 
 3. 3*2. 3 Concurrent Transaction Movement 
 
 The interesting feature of simulation programs is the prospect for 
 concurrently processing more than one transaction. This third level of paral- 
 lelism is without analogue in conventional programs. 
 
 In section 3.3-1 the point was made that a single transaction moving 
 through a simulation program was like a single execution of a conventional 
 program. If multiple transactions moving through a simulation program were like 
 multiple executions of a conventional program it would be a straightforward 
 matter, in theory, to concurrently process as many transactions as desired. 
 This is not, however, the case. There is an interrelationship of transactions 
 with the model, including other transactions, that must be taken into account. 
 
 It has been mentioned, in section 3-3«l> that the processing 
 of a transaction can be interrupted for several reasons and that transactions 
 can move through the simulation program at different rates causing changes 
 in their ordering in the program. These effects come about because all trans- 
 actions are related to the model and each transaction can affect variables to 
 which other transactions have access. A system variable is a variable which 
 can be both written and read by more than one transaction. Any block which 
 uses a system variable as an operand that will be read must be processed in 
 a certain order with respect to blocks that write that variable. This means 
 that precedence requirements extend across transactions and become a run time 
 function rather than being defined by the program at compile time. 
 
 A restricted application of multiple transaction concurrency would 
 be to select for processing only the set of transactions which represent cur- 
 rent events. This set consists of the members of the current events chain 
 described in the serial execution section, 3.3-1. The restriction is not neces- 
 sary if proper care is exercised in the selection of transactions to move. 
 
86 
 
 That is, members of the future events chain are also candidates for concur- 
 rent processing. 
 
 The essence of this is that a range of the real time being simulated 
 can be involved in processing being done concurrently. At an instant of real 
 time on a multiprocessor, more than one instant of simulated time can be in 
 process. All of the interactions that take place in real time in the model 
 must be considered, plus the interactions caused by overlapping multiple 
 instances of simulated real time into a single instant on the computer. 
 
 As an example of time overlap consider a computer system with two 
 independent terminals and one processor that can be called into exclusive use 
 by either terminal. If the processor is in use when a terminal requests it, 
 the terminal must wait for the processor to become available. Figure 3.10(a) 
 shows a possible use of the equipment on an arbitrary real time scale. Solid 
 lines represent equipment in use, dashed lines represent time spent waiting 
 for the processor, and vertical dashes are transitions between equipment. 
 
 Figure 3.10(b) shows the same equipment usage as it can be simulated. 
 An event at real time one can be processed simultaneously with an event at 
 real time zero if the events are independent, as they are in this example. 
 Likewise for events at real times two and three. 
 
 In Figure (a) there are six distinct times at which events to be 
 simulated take place. They are tabulated below. On a multiprocessor re- 
 stricted to simulating all events at one instant of time before advancing to 
 the next, six distinct processing intervals would be required. In Figure (b) 
 with the overlap of times (0,1) and (2,3) there are only four distinct times. 
 On a multiprocessor which allows such overlap only four distinct processing 
 intervals are required. 
 
TERMINAL 1 
 
 PROCESSOR 
 
 TERMINAL 2 
 
 87 
 
 -f- 
 
 
 -t- 
 
 + 
 
 2 3 4 
 
 REAL TIME 
 
 (a) Timing of System to be Simulated 
 
 TERMINAL 1 
 PROCESSOR 
 TERMINAL 2 
 
 + 
 
 1 , 
 
 0-1 2rS 4 5 
 
 SIMULATED REAL TIME 
 
 (b) Timing with Overlap 
 Figure 3. 10. Example of Time Overlap on Independent Events 
 
88 
 
 Real Time Events 
 
 Initiate Terminal 1 
 
 1 Initiate Terminal 2 
 
 2 Terminal 1 takes Processor 
 
 3 Terminal 2 requests Processor 
 
 k Terminal 1 frees and Terminal 2 takes 
 
 Processor 
 
 5 Terminal 2 frees Processor 
 
 (a) Processing of Figure 3.10(a) 
 
 Processing Interval 
 1 
 2 
 
 3 
 k 
 
 5 
 
 Simulated 
 Real Time 
 
 (0,1) 
 (2,3) 
 
 Events 
 
 Processing Interval 
 
 Initiate Terminals 1 and 2 1 
 
 Terminal 1 takes and Terminal 2 re- 2 
 quests Processor 
 
 Terminal 1 frees and Terminal 2 takes 3 
 Processor 
 
 Terminal 2 frees Processor k 
 
 (b) Processing of Figure 3« 10(b) 
 
 Table 3«^« Distinct Events Comparison of Figure 3-10 
 
 For this example, the time overlap allowance results in a savings of 
 two processing intervals. The error that must be avoided however is letting 
 terminal two take the processor before terminal one since they both appear 
 ready at the same place on the scale in Figure 3.10(b). That is an example of 
 the interaction introduced by overlapping multiple instances of simulated 
 real time . 
 
 
8 9 
 
 A specific GPSS example of the precedence requirements due to multiple 
 transactions is now taken from the program in Figure 3.1. Again, a program 
 segment is reproduced. The single transaction precedence partition numbers 
 are listed as determined in the previous section. 
 
 Block 
 
 Block 
 
 Operands 
 
 Selected Routine 
 
 Partition 
 
 Number 
 
 
 
 
 Instructions 
 
 
 k 
 
 SEIZE 
 
 P12 
 
 
 N^ = N^+l 
 
 2 
 
 5 
 
 QUEUE 
 
 CPU 
 
 
 
 2 
 
 6 
 
 SEIZE 
 
 CPU 
 
 
 
 2 
 
 7 
 
 DEPART 
 
 CPU 
 
 
 
 3 
 
 8 
 
 ADVANCE 
 
 VI 
 
 
 TIME = TIME+100*N^ 
 
 3 
 
 
 1 VARIABLE 
 
 K100*N4 
 
 
 
 Blocks four and eight are in separate partitions and therefore will 
 not be processed concurrently for any single transaction. The writing and 
 reading of N^ appears to be properly separated. Consider the following multiple 
 transaction situation which uses times that actually occurred as taken from 
 Figure 3.5* 
 
 Suppose transaction six is processing partition two including block 
 four at time 1791 and, concurrently, transaction five is processing partition 
 three including block eight at time 1953- Each transaction is following the 
 precedence partition but six is writing N4 while five is reading it. 
 
 Even though transaction five has executed block four it cannot 
 correctly execute block eight until all transactions with times less than 1953 
 have moved through block four. It is therefore not sufficient just to follow 
 the precedence that can be detected at compile time. 
 
90 
 
 A similar situation does not exist with operand P12 which occurs in 
 three different partitions. P12 refers to parameter 12 of the transaction 
 executing the block. In this program no transaction can change the P12 value 
 of another. 
 
 Fortunately the variables and block types involved in precedence 
 requirements between transactions are known. The variables have been previously 
 defined in this section as system variables. There is a special category of 
 
 three standard numerical attributes — function, variable, and boolean variable 
 
 which are programmer defined. They are system variables if the definition 
 uses a system variable. The system block types are those which, by definition, 
 implicitly use system variables. 
 
 The use of system variables is indicated for each block by S(n) where 
 n is the block number, as in the precedence partition algorithm. S(n) = 1 
 when the block does not use any system variables due to either the block 
 type or the operands. S(n) = when it does. The assignment of values to 
 S(n) is made on a block by block basis simply by comparing the type and 
 operands against a table of system blocks and variables . 
 
 Allowing concurrent processing of multiple transactions changes 
 the concept of the overall scan, Figures 3.2 through ~5.h, drastically. The 
 current and future events chains are eliminated. A single clock is replaced 
 by a time word for each transaction. The serial nature of the algorithm 
 must be revised to take advantage of the parallelism in the program. In 
 Chapter k a hardware unit is proposed as a replacement for the current 
 software overall scan algorithm. 
 
 Elimination of the two chains makes it necessary to define some new 
 
91 
 
 terms. Transaction time is the simulated time word associated with each trans- 
 action. It is similar to and replaces, the block departure time. A min time 
 transaction is one whose transaction time is equal to or less than the trans- 
 action time of all other transactions which are not blocked. Refer to section 
 3.3.I for the meaning of blocked. The set of min time transactions is the set 
 of transactions which would have been members of the current events chain. 
 
 The significance of S(n), mentioned above, is that any block with 
 S(n) = must be processed only by a min time transaction. The effect, refer- 
 ring to the example in Figure 3.10, is that the action of taking the processor 
 can be restricted to the min time transaction. This assures terminal one of 
 success in taking the processor at time two, avoiding the potential error 
 mentioned in the example. Note that if S(n) = for all blocks, processing 
 reverts to the time ordered case. S(n) = 1 allows those blocks which are 
 time independent to be processed by non-min time transactions. 
 
 3.3.3 Processing Code Assignment 
 
 In the next chapter a multiprocessor machine organization for proces- 
 sing simulations in GPSS is presented. The machine requires information on 
 precedence partitions and system variable usage to control the processing. 
 Slight variations on the V(n) and S(n) assignments of the previous two sections 
 are needed however. The modified values are then concatenated to form a two 
 bit code, SP(n), for each block. This section discusses the modifications and 
 the final code . 
 
 First the P(n) changes are discussed. The three blocks SEIZE, PREEMPT, 
 and ENTER which have the action of letting a transaction use a facility or 
 storage are assigned P = G. In the multiple transaction case it is possible 
 
92 
 
 for the equipment to be in use such that the transaction is blocked preventing 
 the execution of the next block. Letting P = creates an artificial prece- 
 dence boundary which prevents execution of the next block until the previous 
 partition is successfully concluded. This was not necessary in the single 
 transaction case since a blocked transaction would be a deadlock. No other 
 transaction would ever be available to clear the blocking condition. The pro- 
 gram itself is faulty if this happens. 
 
 Simulations with multiple transactions do cause a reduction in the 
 length of precedence partitions and thus a decrease in the number of concurrently 
 executable blocks per transaction. The total concurrency increases however, 
 due to more than one transaction being processed. 
 
 The BUFFER block is used to stop the movement of a transaction when 
 it could normally move to the next block. The effect of this is achieved by 
 assigning P = to the previous block and removing the BUFFER. PRIORITY with 
 the BUFFER option is assigned P = for the same reason. 
 
 Normally GATE and TEST blocks, which are like conditional jumps for 
 transactions, receive P = since the next block is unknown. The entry ALL 
 in column B of the partitioning guide causes this. When GATE or TEST blocks 
 with alternate exits appear In a sequence, an IF tree for transactions is formed. 
 Appropriate revision of the routines for these two block types will allow 
 combined execution of the sequence as one block of greater length, making effec- 
 tive use of the hardware in Chapter 2. When the sequence occurs the precedence 
 partitioning algorithm is applied to all blocks, except the last, as if the 
 column B entry in the guide were blank. Thus operand precedence is checked to 
 determine P rather than assigning P = directly. 
 
93 
 
 The above modifications change the partitioning algorithm from 
 defining single transaction precedence partition boundaries to defining sets 
 of blocks which are simultaneously executable in a multiple transaction environ- 
 ment. In the remainder of this paper the sets will be referred to as partitions. 
 
 Now consider the modifications to S(n) . The purpose of this bit is 
 to flag blocks which must be executed only by min time transactions. If two 
 such blocks are in sequence with no possible change in transaction time from 
 one to the other, it is only necessary to flag the first one. If the trans- 
 action is at the minimum time on execution of the first block it will neces- 
 sarily be there for the second also, regardless of the value of S. Therefore 
 assign the second block S = 1. The choice is made to improve performance in 
 the machine and will be explained in section k.k. 
 
 This asssignment rule can be extended to allow any number of blocks 
 in the sequence and also allow intervening blocks which do not use system var- 
 iables. One condition is necessary however. If a labeled block, or other 
 block which can be the destination of a jump, occurs, it breaks the sequence. A 
 transaction which is not at the minimum time can transfer to such a block and 
 must be prevented from executing a system variable block. 
 
 A program has been written that scans GPSS decks, simultaneously 
 
 applying the partition algorithm and system variable table look-up, with modi- 
 fications mentioned above, to generate the code bits SP(n) for each block. 
 The result is an indication of the concurrency in the program but is of course 
 a static measure. It does not measure the effects of multiple transactions or 
 selected paths through the program. 
 
 Results of scanning several programs are given in Table 3.5. The 
 meanings of the column headings are: 
 
9h 
 
 (1) The name identifies the application of the program. The source 
 of the listing is given by reference. 
 
 (2) Blocks are the executable GPSS statements. 
 
 (3) Blocks which unconditionally transfer transactions are tabulated. 
 They shorten partitions since they are considered the last block in a partition 
 without inspecting the destination of the transfer. 
 
 (k) The maximum partition length is the largest number of blocks that 
 are simultaneously executable . 
 
 (5) The average partition length is the average number of simultan- 
 eously executable blocks. 
 
 (6) This column is the number of blocks for which S = 1. The proces- 
 sing of these blocks is never delayed to wait for min time transactions. 
 
 (7) This column is the number of blocks for which P = 1. 
 
 (8) This column is the number of blocks for which both S and P are 
 one. 
 
 The maximum and average partition lengths are the figures of most 
 interest. For all the programs that were scanned the average is near two. 
 Thus, on the average, any transaction at any time can be executing two blocks 
 concurrently . 
 
 It is felt that the average could be raised to three blocks with a 
 more sophisticated scan program. The current program implements the algorithm 
 of Figure 3.8 with the modifications of this section. This algorithm does a 
 sequential scan of the source deck. Improvements could be made by following 
 traces through the program so transaction decision trees could be found and 
 the destination of unconditional transfers could be examined. Loops made 
 with the LOOP block could be expanded. This has been done by hand for two 
 
95 
 
 c^ 
 
 co 
 
 M H 
 
 o ll 
 
 O Ph 
 i-l 
 
 -p 
 
 o £ 
 
 sd 
 
 0) o 
 
 bO -H ,£ 
 
 ^-- cO -p -P 
 
 ltn Ph -H bO 
 
 v— • cd -p sd 
 
 > Ph CD 
 
 < CO J 
 Ph 
 
 -* 
 
 -P 
 bO 
 
 CD 
 
 cO i-q 
 Ph 
 
 ^ 
 
 co 
 
 Pd 
 W 
 
 ^-- hh 
 
 KN CQ 
 
 g 
 
 En 23 
 
 OJ 
 
 w 
 
 AS 
 o 
 o 
 H 
 
 PQ 
 
 CH 
 
 O 
 
 bO 
 o 
 
 Ph 
 
 KN 
 CO 
 
 -H/ 
 OO 
 
 
 LTN 
 OJ 
 
 H 
 
 KN 
 ON 
 
 ON 
 
 ON ON 
 KN-Hr 
 
 o 
 
 kn 
 
 OJ 
 OJ 
 
 VO 
 
 -4- 
 -Hr 
 
 H 
 
 H CJN VO 
 
 VO VO -=j- 
 
 OO 
 
 KN 
 
 CO 
 
 O 
 
 OJ 
 
 OJ 
 OJ 
 
 OJ 
 
 LTN^O 
 ON OJ 
 
 H OJ 
 
 O 
 KN 
 
 OJ 
 
 00 
 H 
 
 O 
 
 OJ 
 
 OJ 
 
 CO O VO ON LTN LTN K"N 
 
 CO 
 
 ON 
 
 UN LTN O O O 
 
 rH H 
 
 LTN 
 H 
 
 J-4" LT\ LTN O 
 
 CO 
 
 LTN 
 
 O CO 
 
 KN 
 
 ON 
 
 H 
 
 c— 
 
 CO CO 
 
 LTN 
 
 -3- 
 
 H 
 
 H 
 
 
 
 
 
 -p 
 
 •H 
 
 p X 
 
 •H 
 
 o id 
 
 Ph CD 
 
 -p ft 
 
 id ft 
 
 O < 
 
 "3 
 
 Ph 
 
 O 
 
 7H. 
 
 Td 
 
 0) 
 
 sd 
 
 CO 
 
 & 
 
 co co 
 
 •H ft 
 
 ft O 
 
 O CO O 
 
 ,£ < hd 
 
 O H OJ 
 
 U 
 O 
 
 -P 
 
 CO 
 
 > 
 
 <L> 
 H 
 
 W 
 
 VO 
 
 -P 
 CD 
 
 <u 
 
 H 
 
 pq 
 
 H 
 •H 
 
 cO 
 
 Pd 
 
 X KN 
 
 w 
 
 CD 
 co Ph 
 
 co bO 
 cd -H 
 ,£ Ph 
 Eh^-- 
 
 co 
 CD 
 
 -d 
 o 
 
 o 
 
 bO 
 id 
 
 •H 
 
 to 
 
 co 
 
 CD 
 O 
 
 o 
 
 Ph 
 
 Ph 
 
 -d 
 sd 
 
 CO 
 
 co 
 sd 
 o 
 
 •H 
 
 -p 
 
 •H 
 -P 
 Ph 
 CO 
 
 Ph 
 
 CO 
 U 
 bO 
 O 
 Ph 
 Ph 
 
 LTN 
 
 • 
 
 KN 
 CD 
 
 H 
 
 ■8 
 
 EH 
 
96 
 
 of the four loops in the Job Shop program with a resulting improvement of .31 
 blocks on the average partition length. The total results are listed as Job 
 Shop (2) . Techniques such as re-ordering blocks where the order is not critical, 
 and back substitution of variables across blocks could also be used to increase 
 partition lengths. 
 
 In Chapter h the use of this code for coordinating a multiprocessor 
 is described. The machine design allows the three levels of parallelism dis- 
 cussed in this chapter. It is recognized that implementation of GPSS on the 
 multiprocessor will require some revision of the block routines to account for 
 the change from time ordered to time overlapped processing. 
 
97 
 
 k. MACHINE DESIGN FOR CONCURRENT EXECUTION 
 OF DISCRETE TIME SIMULATIONS 
 
 k.l Introduction 
 
 A simulation language, GPSS, has been examined for the purpose of 
 finding ways to speed up the execution. The speedup was intended to come 
 from multiprocessing of the existing language, not from changing the language 
 definition. Results of the examination revealed parallelism within routines 
 that make up the language and parallelism between those routines. The 
 parallelism will be used in this chapter to design a multiprocessor machine 
 for concurrent execution of simulation studies. 
 
 The largest readily identified part of a GPSS simulation which is 
 known to be executed in its entirety is a block, the GPSS name for the 
 routines of the language. A specific routine is executed when a transaction 
 moves into a block with the name of the routine. For example, in Figure 3*1 
 the routine for the SEIZE block is executed each time a transaction moves 
 into block four, and again when it moves into block six. The operands for 
 these two blocks are not the same. Operands are parameters which identify 
 data and execution options for the routine. The combination of a trans- 
 action, a block type, and a set of operands defines a routine to be executed 
 and the data for this execution. The block type and operands are simply a 
 particular program block. The pair consisting of a transaction and a partic- 
 ular program block is called a task . 
 
 In section 3«3«2 three levels of parallelism in a GPSS program 
 were described. The first level is that within a block, or in execution 
 terminology, within a task. The second level establishes the fact that 
 
98 
 
 tasks for a given transaction can be simultaneously executable. The tasks 
 which are simultaneously executable constitute a static set determined at 
 compile time. The third level extends the concept of sets of simultaneously 
 executable tasks to include tasks related to more than one transaction. In 
 this case the sets become dynamic; their members are functions of the run 
 time status of the model. 
 
 Suppose a multiprocessor machine were designed to determine a set 
 of tasks that could be processed simultaneously, process them, then request 
 the next set. There are two weaknesses with this scheme. First, all tasks 
 in the set do not take the same amount of processor time for execution. If 
 all processors are forced to wait for the longest task, all processors ex- 
 cept one will be idle for some time after completion of their tasks . The 
 second weakness is that the number of tasks in a set will not, in general, 
 fit the number of processors an integral number of times. A set with an 
 excess of one task results in total idleness for all processors except one 
 and delays execution of the next set. 
 
 A better scheme would be to keep a list of tasks which are ready 
 for execution and let processors operate asynchronously. When a processor 
 completes a task it is available for the next task in the list. This 
 eliminates the two weaknesses caused by differences in task execution times 
 and the misfit between the number of members in a set and the number of 
 processors. The list of executable tasks is the basis of the machine 
 organization presented here. 
 
 GPSS/360 uses a software overall scan algorithm, Figures 3-2 
 through ~b.h, to serially select the next task. That algorithm is replaced 
 by a hardware coordination unit that selects transactions to move, and 
 
99 
 
 builds the list of tasks ready for execution. The coordination unit is thus 
 the source of tasks for distribution to a group of processors. 
 
 An overall picture of the machine organization is given in Figure 
 k.l. Explanation of the figure and details concerning the processors, co- 
 ordination unit, and memories will be given in this chapter. The details 
 are partially based on the results presented in Chapter 3. Since all major 
 parts of the machine work together to some extent, the discussion of any one 
 part is incomplete without reference to, or knowledge of, the other parts. 
 The discussion, therefore, is of necessity somewhat circular and incomplete 
 until all parts have been covered. 
 
 k.2 Compilation Observations 
 
 When a person uses GPSS to perform a simulation study he writes a 
 "program" in which the executable statements are GPSS blocks. Recall that 
 block types are routines provided by GPSS for the user, not written by the 
 user. The "program" is a sequence of calls for the execution of previously 
 written routines. 
 
 It is only necessary to compile the fixed set of routines one time 
 for all users. Compiled routines can be stored in the machine in a read 
 only memory. Since compilation is a one time happening it is not unreason- 
 able to tune the code produced to the machine. Optimization can consider 
 the interconnection of processors and memories to minimize memory access 
 conflicts. Detection of IF trees and use of the algorithms of Chapter 2 for 
 preparation of trees, and their mapping into the decision processor, is in- 
 ; eluded in compilation. 
 
 Existence of compiled code for the block routines does not mean 
 
100 
 
 a 
 o 
 
 •H 
 
 -p 
 
 CO 
 N 
 •H 
 
 CO 
 bfl 
 U 
 
 o 
 d) 
 
 •H 
 
 .£ 
 
 g 
 
 CD 
 •H 
 
101 
 
 the user's GPSS deck is ready to execute. The machine of this chapter re- 
 quires the processing code defined in section 3«3»3« The GPSS deck must go 
 through phases which approximate the GPSS/360 assembly and input phases. 
 These phases include the use of the partitioning algorithm and system vari- 
 able table lookup to assign the processing code to each block, the small 
 amount of compilation for GPSS VARIABLE definition cards, and the input of 
 block operands to distinguish the data and execution options for each oc- 
 currence of each block type. Processing of the source decks to accomplish 
 this is referred to as compilation in this chapter. 
 
 K.~5 Task Processors 
 
 A task, the combination of a transaction executing a particular 
 block, has been identified as a part of a GPSS program which is executed in 
 its entirety. Let a task processor be a processor capable of executing any 
 GPSS task. To make use of the parallelism within a task implies that a task 
 processor must be made up of several unit processors . To use the concurrency 
 between tasks implies that the total machine must contain more than one task 
 processor. In this section the design and arrangement of task processors is 
 discussed. 
 
 ^.3.1 Unit Processors 
 
 Results of analyzing several GPSS block types are given in section 
 3.3.2.1, Table 3-l» The number of processors required for the reported speed 
 up ranges from two to 32. To design a task processor with the maximum number 
 of parallel unit processors would ensure maximum speed up but result in 
 lowered efficiencies for tasks which do not require the maximum number. To 
 select the smallest number would keep efficiencies high but not exploit the 
 
102 
 
 parallelism in the block routine. A compromise value should be used. Exami- 
 nation of the tabulated analysis results indicates the number of unit proces- 
 sors per task processor should be in the range of eight to l6. 
 
 Simulation programs are interesting in the simplicity of their 
 computational requirements. Study of the definition of GPSS block types and 
 Fortran program equivalents shows that the instructions are very basic. Re- 
 fer to Figure 3*7 fo r an example. Some frequently used assignment statements 
 are addition of two operands, incrementing a variable, and simple replace- 
 ment operations. Assignment statements involving more than two operands, or 
 involving multiplication or division, are rare. 
 
 Arithmetic operations in fixed point are satisfactory. Note that 
 time is an ever increasing integer. Queue lengths, contents of storages, 
 logic switch states, parameter values, constants, etc. are all integers. 
 In fact, all standard numerical attributes are integers when used as operands , 
 There are a few cases where non-integer values occur. These cases use 
 fractional values with up to six decimal digits. A floating point variable 
 is specified in which calculations are performed in floating point but the 
 result is converted to an integer. Since results larger than the fixed 
 point range cannot be used it is a minor concession to restrict the design 
 to fixed point . 
 
 A frequently occurring instruction is the conditional jump. It is 
 used extensively to check for error conditions as well as make decisions on 
 the model status. The frequent use of conditional jumps, with reasonably 
 few assignment statements interspersed, was the inspiration for the decision 
 processor of Chapter 2. Discussion of the relation between task and de- 
 cision processors is covered in section k-,3-2 below. 
 
103 
 
 Several conclusions on unit processor design, based on knowledge 
 of variable and instruction types, can be reached. An unsophisticated 
 processor with a small instruction set is satisfactory. The processor needs 
 the ability to add, shift, fetch and store operands, decode the instruction 
 set, etc. To use the decision processor it is necessary to be able to read 
 the sign of the accumulator and to know if the accumulator magnitude is 
 greater than zero. These two quantities can be used to determine the boolean 
 value of a relational expression as explained in section 2.3.2.1. 
 
 Task processors are independent of each other in the execution 
 of tasks. They operate asynchronously under control of the coordination 
 unit. Unit processors that make up a task processor are all working on the 
 same task and thus are not independent. They are synchronized in the 
 execution of instructions by one control unit per task processor. 
 
 The task control unit has several functions. It is used for 
 starting the unit processors and transferring instructions from the program 
 memory to the unit processors. It communicates with the coordination unit 
 when it is available and when it has completed a task. It controls the 
 decision processor and access to the main memory. 
 
 Unit processors are not constrained to the alternatives of executing 
 the same instruction or being idle. Each unit processor has instruction de- 
 coding logic. Each block type is formulated for parallel execution when 
 compiled. At every program step appropriate instructions are transferred 
 from the program memory to all unit processors for decoding and execution. 
 U.3.2 Decision Processor Use 
 
 In this section the decision processor of Chapter 2 is merged into 
 the task processor design. From the definition of GPSS block types it was 
 
10)| 
 
 known that a significant amount of the processing was decision making. This 
 was confirmed by the Fortran analyzer, used to study Fortran equivalents of 
 GPSS blocks. Chapter 2 can be referred to for details on IF trees and the 
 decision processor. 
 
 Results of block analyses show that IF trees of considerable length 
 exist in many blocks but the number of trees per block is one or two. One 
 decision processor of six to eight level capability associated with each 
 task processor should be sufficient. 
 
 The decision processor accepts boolean values corresponding to 
 arguments of nodes in the tree. The boolean values come from the unit proces- 
 sors with the requirement that values for a specific node in the IF tree load 
 into the corresponding bit in the decision processor input node register. 
 The number of nodes in the decision processor is greater than the number of 
 unit processors. Thus several instruction cycles may be required to de- 
 termine all boolean node values. It is proposed that each unit processor 
 include a node output register to store the node values until all have been 
 determined. The register should be of a length such that the sum of the 
 bits for all unit processors in a task processor is at least equal to the 
 number of decision processor nodes. One transfer from the node output 
 registers to the decision processor is then sufficient. 
 
 Some manner of connection must exist between the node output 
 registers and the decision processor. Maximum flexibility is achieved with 
 an expensive crossbar switch type of connection. A more reasonable solution 
 is to provide simply two alternative connection schemes based on the shape 
 of the IF tree. Each bit in the node output register of each unit processor 
 connects to at most two bits in the decision processor input node register. 
 
105 
 
 Before presenting the connections it is noted that the shape of 
 trees can be changed somewhat without changing their logical structure. 
 This can be done by complementing the logical argument at a node and inter- 
 changing the true and false paths out of the node. For example, IF (A > B) 
 THEN X; ELSE Y; can be rewritten equivalently as IF (A < B) THEN Y; ELSE X;. 
 Using this technique enables the shape of an IF tree to be biased such that 
 the longest path out of any node is always on the right when pictured as in 
 Chapter 2 . 
 
 Now the two connection schemes will be explained. For trees that 
 can be thought of as balanced, or triangular in shape, a connection is 
 suggested in which the node values produced by the unit processors fill 
 levels of the decision processor tree structure. For unbalanced trees, 
 those in which a few paths are long and many are short, the connection is 
 designed to fill the tree in a right to left sweep. 
 
 An example of the two connection schemes, "level" and "biased", 
 is given in Figure h .2 under the assumption that eight unit processors are 
 to be connected to a five level, 31 node decision processor. To provide for 
 the use of all nodes each unit processor must calculate a maximum of four 
 node values. The node output register is thus four bits long. 
 
 The connection problem is really one of distributing the calcu- 
 lation of node values over unit processors in an attempt to minimize the 
 number of values serially determined by any one processor. Clearly, for the 
 example shown, the level scheme involves only one node calculation per unit 
 processor for a full three level tree but three node calculations in proces- 
 sor one for a five level linear tree. Use of the biased connection requires 
 only one node calculation in five of the unit processors for the same linear 
 
106 
 
 BIT 
 12 3 4 
 
 I I 1 1.1 PROCESSOR 1 
 
 
 A 
 
 
 
 
 
 I 
 
 
 ■ 
 
 
 
 
 1 
 
 \ >. i 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 1 ^ i 
 
 
 
 
 
 
 
 
 / TO U.R 4 TO U.R ! 
 
 
 
 i 
 
 
 
 
 
 i . , TO U.R 6 , 
 
 
 
 
 
 
 
 
 
 
 \ 
 
 
 
 
 
 • • • 
 
 
 i 
 
 
 
 
 
 
 / 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 , 
 
 1 
 
 BIT 4 
 
 i 
 
 i 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 BIT 2 
 
 BIT 3 
 
 
 
 
 
 
 a 
 
 i 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 U.R 2 
 
 DECISION PROCESSOR 
 
 -CONNECTION BUSES 
 
 NODE OUTPUT 
 REGISTERS 
 
 (a). Level Connection Scheme 
 
 s « 
 
 > >. 
 
 
 
 
 
 l 
 
 BIT 
 2 
 
 5 
 
 
 4 
 
 
 \ l\ 
 
 
 
 
 
 
 
 X ' . X 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 / • \ • 
 
 • \ • N. 
 
 
 
 
 
 
 
 • # \t_ 
 
 
 
 
 
 
 
 
 
 
 i 
 
 
 
 
 
 
 
 
 \ 
 
 
 L_ 
 
 
 
 
 I \ 
 
 
 
 
 
 I 
 
 k t 
 
 
 
 
 
 
 
 
 
 ! | BIT 2 A 
 | BIT 3 M 
 
 1 
 
 
 
 
 BIT 4 " " ' '"' ' "" " " " | | ■ ■ 
 
 
 
 
 U. Rl 
 2 
 
 3 
 
 4 
 5 
 6 
 
 7 
 8 
 
 'CONNECTION BUSES 
 
 DECISION PROCESSOR 
 
 NODE OUTPUT 
 REGISTERS 
 
 [b) . Biased Connection Scheme 
 
 Figure k.2. Connections Between Decision and Unit Processors 
 
107 
 
 tree. The connection scheme for a given tree is selected at compile time. 
 The criterion for selection is that the number of node values which must be 
 calculated by any one unit processor should be minimized. 
 
 The original philosophy of the decision processor was to make 
 better use of existing arithmetic processors when a program had many con- 
 ditional jumps. It was assumed that any multiprocessor application would 
 require a reasonably large number of processors. An interesting result of 
 the Fortran block analysis was that, excluding decision trees, the required 
 number of processors was small. For most programs the number was two to 
 four. The number of unit processors in this machine is in the range of 
 eight to l6 because of the IF trees. It is reasonably accurate to say that 
 the block programs consist primarily of IF trees. 
 k.3-3 Task Processor Configuration 
 
 Each task processor is capable of processing any block type. The 
 parallelism within a block is exploited by providing parallel unit proces- 
 sors and a decision processor within the task processor. The number of tasks 
 that can be in execution simultaneously is dependent on the concurrency 
 between blocks and transactions. The number is the sum of the concurrently 
 executable blocks per transaction, taken over the number of concurrently 
 moving transactions. 
 
 Section 3«3«3 showed that the number of blocks per transaction was 
 slightly over two for six significantly different GPSS programs. The number 
 of transactions which can be moved concurrently cannot be measured by study 
 of the program. It is a run time function of the program being executed on 
 this parallel machine. A simulator of the machine was written, primarly in 
 GPSS, to measure the transaction concurrency of actual GPSS programs. Details 
 
108 
 
 of the simulation system are given in the Appendix. The results are given 
 here. 
 
 Programs were tested in the simulator under the condition that 
 executing tasks uses an amount of time determined by the analysis of Fortran 
 equivalents for GPSS blocks. In the first test the coordination unit, which 
 distributes tasks to processors, was assumed to take zero time to operate. 
 This measures the best performance that can be expected. The results are 
 given in Table k .1 as the rows marked (l) . 
 
 Column headings for Table k.l have the following meanings: 
 
 (1) The program name corresponds to the names used in Table 5»5« 
 
 (2) "Total transactions" is the number of transactions that were 
 active at some time in the simulation. For the thesis example program it 
 can be seen in Figure 3-5 that 2k transactions existed although only 10 went 
 through the entire program. 
 
 (3) "Blocks executed" is the number of blocks for which execution 
 was simulated. 
 
 (k) Beginning with this column there are multiple entries for each 
 test program. The number of task processors is a design parameter for the 
 machine. Simulations were run with four, eight, and l6 processors to observe 
 the effects on the entries in columns (5) and (6). 
 
 (5) "Concurrent transactions" is the number of transactions which 
 were being processed simultaneously. Entries were generated in the simulator 
 by sampling transactions being processed at a time interval equal to the 
 shortest block execution time. 
 
 (6) This column lists the simulated execution time. 
 
 A second test was run in which operation of the coordination unit 
 
109 
 
 VJ3 
 
 NA 
 
 OJ 
 
 VO 
 
 [— 
 
 O 
 
 o 
 
 o\ 
 
 o 
 
 KA 
 
 CVJ 
 
 VO 
 
 VO 
 
 N~\ 
 
 LTA 
 
 H 
 
 o\ 
 
 ON 
 
 VD 
 
 H 
 
 O 
 
 o 
 
 MD 
 
 H 
 
 O 
 
 -* 
 
 -=*■ 
 
 UA 
 
 H 
 
 CO 
 
 OJ 
 
 OJ 
 
 UA 
 
 ir\ 
 
 ir\ 
 
 O 
 
 VO 
 
 VJD 
 
 OJ 
 
 CO 
 
 OO 
 
 o 
 
 on 
 
 OJ 
 
 H 
 
 H 
 
 OJ 
 
 H 
 
 H 
 
 
 
 H 
 
 
 \D \D C— ^O 
 O OJ ^D CT\ 
 
 OO 
 
 LTA 
 
 OJ 
 
 \D 
 
 ON 
 
 OJ 
 
 • 
 
 • 
 
 • 
 
 • 
 
 • 
 
 • 
 
 OJ 
 
 -d- 
 
 VO 
 
 OJ 
 
 K> 
 
 LfA 
 
 KN 
 
 OJ 
 
 KN 
 
 OJ 
 
 CO 
 
 OJ 
 
 OJ 
 OJ 
 
 OJ 
 
 o 
 
 OJ 
 
 OJ 
 OJ 
 
 -d- 
 
 00 
 
 
 CO 
 
 H 
 
 00 
 
 H 
 
 00 
 
 H 
 
 CO 
 
 vo oo 
 
 H 
 
 KA 
 
 H 
 
 KA 
 
 'H 
 
 03 
 
 o 
 
 •H 
 
 -P 
 
 o 
 
 CO 
 w 
 
 id 
 
 CO 
 
 •P 
 S3 
 
 CD 
 Sh 
 
 ?H 
 
 3 
 
 o 
 o 
 
 -p 
 
 H 
 g 
 
 o 
 
 •H 
 -P 
 CO 
 
 
 P 
 
 
 O 
 
 
 CI) 
 
 
 a 
 
 +-> 
 
 •H 
 
 cu 
 
 4-) 
 
 CU 
 
 
 H 
 
 o 
 
 h 
 
 fn 
 
 
 CU 
 
 rH 
 
 N 
 
 ■H 
 
 
 CO 
 
 
 Ph 
 
 H 
 
 o 
 
 H 
 
 CO 
 
 CI) 
 
 OJ 
 
 D 
 
 o 
 
 •3 
 
 5h 
 
 
 o 
 
 
 
 -P 
 
 !H 
 
 CO 
 
 OJ 
 
 >> 
 
 N 
 
 CU 
 
 
 H 
 
 I 
 
 -P 
 
 H 
 CO 
 
 cu 
 
 OJ 
 
 cu 
 
 O 
 
 H 
 
 
 R< 
 
 CU 
 
 
 n 
 
 * 
 
 -p 
 
 w 
 
 
 
 o 
 
 M 
 
 Jh 
 
 •H 
 
 CU 
 
 ra 
 
 N 
 
 ci) 
 
 
 £ 
 
 
 H 
 
 H 
 
 o 
 
 CU 
 
 H 
 
 CO 
 
 cu 
 
 OJ 
 
 CQ 
 
 UN 
 
 OJ 
 H 
 
 -4- 
 OJ 
 
 cu 
 H 
 
 ■s 
 
 EH 
 
110 
 
 took realistic amounts of time. The purpose of this test was to determine 
 the capability of the coordination unit to keep task processors busy. It 
 also shows the degradation to the concurrency in the program caused by a 
 finite time span between completion of one partition and the start of the 
 next. The entries for this test are marked (2) in Table k.l. 
 
 Analysis of GPSS programs in Chapter 3 gave very similar results on 
 average partition lengths for six different programs. Results presented in 
 Table Ij-.l, for three of the programs analyzed in Chapter 3, give quite 
 different results. The number of transactions which can move simultaneously 
 is obviously not strongly related to static program analysis results. 
 
 The Rail Fleet and Elevator programs have approximately the same 
 number of source blocks, including TRANSFERS. The maximum and average parti- 
 tion lengths are close. The difference, in actually running the program, is 
 the total number of transactions. Clearly the program with more transactions 
 can have more concurrency between them. The Thesis Example program has the 
 intermediate value on the total number of transactions and the lowest value 
 of concurrent transactions. This program however is only one-fifth as long, 
 in terms of source blocks, as the other two. 
 
 For programs which appear comparable, such as Rail Fleet and Ele- 
 vator, a given factor of difference in the total number of transactions does 
 not yield the same factor in concurrent transactions. One important reason 
 for this is that the scheme used to allow concurrent transactions is more 
 restrictive than the language requires. As currently designed into the 
 processing code assignment, and implemented by the coordination unit to be 
 described, any block which reads a system variable must be executed by a 
 transaction which is at the minimum simulated time in the model. 
 
Ill 
 
 The actual requirement is that a system variable must not he read 
 by a transaction if its value can be changed by a transaction at a lower 
 simulated time. A small improvement can be made without changing any of the 
 proposed hardware by using a more sophisticated program for assigning the S 
 bit of the processing code. The required change would be to note those 
 system variables whose values are never changed in the program. These system 
 variables should not cause a block to be assigned S = 0. 
 
 To break loose the transaction movement bottleneck a major change 
 is needed. A transaction which uses a system variable needs to be delayed 
 only until it is known that no transaction at a lower simulated time can 
 change that variable. The run time information on the model status required 
 to do this is much greater than what is needed for the machine presented here. 
 This implies much greater complexity in the coordination unit. 
 
 The machine organization conclusion drawn from Table k .1 is that 
 eight task processors are reasonable . Limiting the design to four task 
 processors imposes a physical limit of four on transaction concurrency and 
 increases execution time significantly. There is improvement in execution 
 time going from eight to l6 but it is not as great, even in absolute terms, 
 as the improvement in going from four to eight task processors. 
 k.J.k Hardware Design Considerations 
 
 Design objectives for this machine are to use currently available, 
 familiar, inexpensive parts and balance the speeds and bandwidths of all system 
 components. As a starting point it is assumed that the task processor is 
 designed with TTL gates having propagation delays near 10 nanoseconds (ns) . 
 A 5 MHz clock generates pulses at 200 ns intervals, allowing combinational 
 logic chains up to 20 gates. Note that one cycle of an eight level decision 
 
112 
 
 processor takes just one clock. 
 
 If instructions take an average of two or three clocks the in- 
 struction execution time averages out near 500 ns. Table J.l gives a count 
 of instruction steps for the execution of many GPSS block types on a multi- 
 processor capable of executing p operations simultaneously. Using the 500 ns 
 average instruction execution time, average block execution times can be 
 stated. These times are used in the simulator described in the Appendix. 
 
 Now consider the amount of hardware required for the task processor 
 configuration, excluding memories and their interconnections. In summary 
 there are eight to l6 unit processors, one decision processor of six levels, 
 and one task processor control unit. The unit processor is a simple device 
 with on the order of 1000 gates and flip flops. From Table 2.2 the six level 
 decision processor, including total sector control, has about 500 logic 
 circuits. Estimate the task processor control unit at 3500 gates. A task 
 processor with eight unit processors has on the order of 12,000 gates. For 
 l6 unit processors the number is 20,000. A configuration of eight task pro- 
 cessors requires approximately l60, 000 gates. 
 
 k.k Coordination Unit 
 
 This unit is concerned with the selection of tasks for execution in 
 the task processors. Selection is based on the simulation model status, the 
 transaction time of all transactions, and the SP processing code assigned to 
 each block in the simulation program. A transaction available for movement 
 is selected, then tasks are formed from the sequence of blocks the transaction 
 will move through. 
 
 In GPSS/360 the only transactions selected for movement are those 
 
113 
 
 on the current events chain. The equivalent of this chain is the set of run 
 time transactions, those which are farthest behind in simulated time. The 
 selection of transactions is still oriented toward the min time set but will, 
 as explained in section 3.3.2.3, select transactions which are not at the 
 min time. When all transactions in the min time set are in the process of 
 being moved, the next transaction to move is chosen from the set nearest to 
 the min time. This selection method reduces the time spread on transactions 
 being processed and minimizes the conflicts caused by overlapped times. 
 
 Recall that blocks with S = can be processed only by min time 
 transactions. By always working on transactions which are min time, or near 
 to it, the min time (which represents the upper limit of completed simulation 
 time) advances as rapidly as possible. The result is that the real time 
 spent by transactions at blocks with S = 0, waiting for the model to "catch 
 up" to them, is reduced. Thus selection of the transaction to move is 
 based on the transaction time with smaller values being selected. 
 
 Following selection of a transaction, a task is set up for each 
 block in the partition the transaction is in. Part of the transaction status 
 information is a word which identifies the next block it is to process. Se- 
 quential blocks are in the same partition until the code bit P = 0. The 
 tasks formed fall into the categories of being available for processing im- 
 mediately or being available when the transaction becomes a min time trans- 
 action. Two queues, called "process" and "delay" respectively, are main- 
 tained for the two categories. 
 
 Task formation and routing to one of the two queues is controlled 
 by the processing code bits, SP, assigned to each block in the program. The 
 interpretation of the code is given now. The significance of S = is that 
 
114 
 
 the block uses a system variable and processing must be delayed until the 
 transaction time is the minimum of all transaction times. The significance 
 of P = is that the block is the last in a partition of simultaneously 
 executable blocks. The transaction cannot continue moving until the current 
 partition has been processed. The action taken by the coordination unit when 
 examining the block code is listed in Table k.2. 
 
 Code 
 
 
 S P 
 
 Coordination Unit Action 
 
 1 1 
 
 Add task to process queue. Examine next block. 
 
 1 
 
 Add task to process queue. Select next transaction. 
 
 1 
 
 Add task to delay queue. Examine next block. 
 
 
 
 Add task to delay queue. Select next transaction. 
 
 Table k.2. Processing Code Interpretation 
 
 The actual implementation has one modification. All tasks in a 
 partition following the first one that is added to the delay queue are also 
 added to the delay queue. The reason is that blocks which use system vari- 
 ables are assigned S = 1 if a prior block has been assigned S = and there 
 can be no change in transaction time between the two blocks. 
 
 In summary, the coordination unit contains logic to select the 
 transaction with the minimum value of simulated time, to examine the code for 
 the block it is to process next, and to place the transaction-block pair, a 
 task, in one of two queues. The process queue receives tasks that can be 
 executed as soon as a processor is available. The delay queue receives tasks 
 which must wait for the min time of the model to reach their scheduled event 
 time. The coordination unit releases tasks from the queues for distribution 
 
115 
 
 to the processors. Figure k.3 is a diagram of the unit. Discussion of the 
 components is covered in following sections starting with the selection of a 
 transaction to move. 
 
 k.k.l Transaction Selection 
 
 Transactions in this machine can be in several states, some of 
 which make a transaction ineligible for movement. The logic of this part 
 keeps a record of the state of all transactions and allows selection of 
 movable ones according to their simulated time ordering. 
 
 When a transaction is selected for movement, all blocks in the 
 current partition are set up as tasks for processing. The transaction is not 
 eligible to move again until all tasks in the partition are completed. A 
 transaction in this state is considered "selected". A transaction may reach 
 a blocking condition, defined in section ^.^.1. Such a transaction cannot 
 move, but note that its simulated time is implicitly updated to the time 
 when the blocking condition is removed. A similar situation exists for 
 transactions which are put on user chains by means of the LINK block. They 
 cannot be moved until they are removed from the chain by an UNLINK block. 
 Transactions which are blocked or are on user chains are considered in the 
 same state with respect to selection for movement. A transaction in this 
 state is both "selected" and "blocked". A third state is the transaction 
 which has entered the delay queue. The minimum simulated time in the model 
 must be known so that transactions in this "delayed" state can be removed 
 from the queue . 
 
 All of the above transactions are in the state of having been 
 selected for movement. The final state consists of all unselected trans- 
 actions. It is from these that one must be selected for movement. 
 
116 
 
 FROM TASK PROCESSORS 
 
 TRANSACTION 
 NUMMR 
 
 Tint 
 M UTS 
 
 PRIORITY 
 
 • BIT* 
 
 MfMOftY INPUT 
 
 ASSOCIATIVE 
 MEMORV 
 
 1024 WORDS 
 42 BITS 
 
 Ntrr ilock 
 
 10 IITI 
 
 TRANSACTION SELECTION 
 SECTION 4.4.1 
 
 PROCESSING CODE 
 
 REOISTER 1024>2 
 
 PROCESSING CODE 
 EVALUATION 
 SECTION 4.4.E 
 
 PROCESSING CODE 
 
 EVALUATION AND 
 
 TASK ROUTING 
 
 WRITE 
 ADDRESS 
 REGISTER 
 
 T SITSl 
 
 T 
 
 INHIBIT DELAY 
 
 QUEUE LOADING 
 
 WHEN FULL 
 
 ADDRESS 
 COMPARATOR 
 
 READ 
 ADDRESS 
 REGISTER 
 
 T 
 
 NEXT 
 BLOCK 
 
 TRANS. 
 NR. 
 
 rrz 
 
 DELAY 
 QUEUE 
 
 128 WOROS 
 22 BITS 
 
 K 
 
 TRANSACTION 
 NUMBER 
 REGISTER 
 
 DELAY QUEUE 
 SECTION 4.4.3.2 
 
 TRANSACTION 
 
 NUMBER 
 COMPARATOR 
 
 INHIBIT TASK 
 
 FORMATION 
 WHEN FULL 
 
 PROCESS 
 
 OUCUE 
 
 SECTION 4 4.5.1 
 
 WRITE 
 ADDRESS 
 REGISTER 
 
 ADDRESS 
 COMPARATOR 
 
 READ 
 ADDRESS 
 REGISTER 
 
 PROCESS 
 
 QUEUE 
 
 16 WORDS 
 20 BIT8 
 
 — \ QUEUE SOURCE SELECTION 
 
 TASK 
 OUTPUT 
 SECTION 4.4.4 
 
 TASK OUTPUT 
 
 V 
 
 TASK REQUEST! 
 
 (FROM TASK PROCISSORS) 
 
 TASKS 
 (TO TASK PROCESSORS) 
 
 Figure k.~5. Coordination Unit 
 
117 
 
 The logic in this part is primarily a moderate size memory on which 
 a minimum value search can be performed associatively. One word per trans- 
 action is required. The number of words in the memory is the number of 
 transactions defined as the maximum for the particular machine. The normal 
 GPSS allocation is 600 transactions for the 128k memory size and 1200 for 
 the 256K or higher. Thus a 102^ word memory would be adequate for quite 
 large simulations. 
 
 Each word must contain fields for the transaction time, priority, 
 and status indicators. The time and priority fields in GPSS use 32 and 
 eight bits respectively. These fields, plus two bits to indicate "selected," 
 C, and "blocked," B, transactions give a word length of ^2 bits. The format 
 for each word is given in Figure k.k. 
 
 transaction time 
 3 3^ 
 
 priority 
 35 ^2 
 
 "Blocked" - B 
 "Selected" - C 
 
 Rl 
 
 1 
 
 R2 
 
 "Delayed" - D 
 
 Figure k.k. Word Format of Transaction Status Memory 
 
 Priority is an extension of the time field on the least significant 
 end. If a low value in the priority field is defined to represent a high 
 priority, the highest priority transaction can be found by continuing the 
 minimum value search through the priority field. 
 
 The C and B control bits are at the most significant end of the 
 time field. C is set to one when a transaction is selected for movement. It 
 
118 
 
 is reset when processing of its partition is complete and it becomes eligible 
 for selection again. B is set to one when a transaction is blocked or put 
 on a user chain. It is reset when the blocking condition is removed or the 
 transaction is removed from the user chain. If either bit is one the trans- 
 action will not be a responder to a minimum value search unless there are no 
 unselected transactions. 
 
 The "delayed," D, bit Is not part of the minimum value search. The 
 D bit is set when a transaction is routed to the delay queue. It is reset 
 when the transaction is removed from the queue. 
 
 Each word has two response indicators, enabled on slightly different 
 search bits. The first, Rl, is enabled for bit B and the time and priority 
 fields. Responders indicated by Rl are transactions in the min time set, 
 whether previously selected or not. The second response indicator, R2, is 
 enabled for bit C in addition to the bits for Rl. R2 responders are non- 
 selected transactions with the smallest value of time. 
 
 Any Rl responders which are also in the delay queue are now 
 eligible for execution. Bit D is ANDed with Rl to eliminate responders that 
 are not in the delay queue. The D bit for all remaining responders is reset 
 since the transactions will be released from the queue. 
 
 After releasing transactions from the delay queue, responders in 
 R2 are selected for movement. The action upon selection is covered in the 
 next section. When all R2 responders have been selected the interrogation 
 of the memory begins again. 
 
 Clearly it is necessary to update the memory when the status of 
 any control bit, or value in the time or priority fields, changes. 
 
119 
 
 k.k.2 Processing Code Evaluation 
 
 Each transaction has a word identifying the block into which it 
 will move next. The identifier is the number of the block in the program. 
 In the coordination unit the next block number is used as a pointer to the 
 needed processing code register entry. 
 
 When a transaction is selected for movement, the processing code 
 for the next block it is to execute is examined and interpreted according 
 to Table h.2. Sequential block codes are examined until P = 0, indicating 
 the end of the partition and need for the next transaction. 
 
 For each block in the partition a task is set up and routed to 
 either the process or delay queue, described below. If a task is routed to 
 the delay queue, the D bit for that transaction is set to indicate the trans- 
 action is being delayed. A task is fully defined by the transaction and next 
 block numbers. Since 102*4- is the maximum number of both transactions and 
 blocks, 10 bit fields are required for each. In the task processors the next 
 block number is used as the address of a word which identifies the block type 
 and the particular occurrence of this type. 
 k.k.3 Task Queues 
 
 k.k.3-1 Process Queue 
 
 Tasks in this queue are available for processing at the request of 
 task processors. Each entry is a transaction number and the number of the 
 block it is to execute. At 10 bits for each of these numbers, an entry is 
 20 bits. 
 
 This queue provides a backlog of tasks for the processors. When 
 it becomes full the formation of tasks can be halted. The queue only needs 
 
120 
 
 to be long enough to assure that it cannot he emptied before task formation 
 can resume. For a machine with eight task processors a queue of length l6 
 is easily long enough. 
 
 Design of the queue is circular with a first-in first-out dis- 
 cipline. Separate read and write address registers point to the oldest entry 
 and the first empty location. The registers are incremented following every 
 read or write. An address comparator detects a full queue and inhibits the 
 formation of any more tasks. 
 k.k.3.2 Delay Queue 
 
 Tasks which must be processed in a simulated time order, because 
 of their use of system variables, are routed to this queue. Processing is 
 delayed until the transaction becomes a min time transaction. The release 
 of tasks is controlled by the transaction selection logic which interrogates 
 the time value of all transactions to identify those in the min time set. 
 
 Recall that all blocks in a partition following the first block 
 with S = were also routed to the delay queue. Thus when a min time trans- 
 action is found to be in the delay queue, the task that caused transaction 
 movement to be delayed is released, as well as subsequent tasks in the same 
 partition. The processing code P bit is used to indicate these tasks. 
 
 Tasks in the delay queue can be in the state of having been re- 
 leased but not yet removed from the queue for processing. An indicator is 
 needed to distinguish these tasks from those not yet released. Each entry 
 in this queue needs, therefore, the 20 bits that identify a task, a bit for 
 the P portion of the processing code and a bit to indicate released tasks. 
 Actual setting of the release bit is based on a comparison of the number of 
 
121 
 
 the transaction to be released and transaction numbers in the queue. 
 
 This queue holds tasks which will become available for processing 
 when the model reaches the condition that no transaction has a simulated time 
 less than that of the transaction being delayed. The queue length in this 
 case must provide for potentially many tasks. It is suggested that this queue 
 be of length 128. When it becomes full, further loading must be inhibited. 
 The design is similar to the process queue except the removal of entries is 
 based on matching transaction numbers rather than length of time in the queue. 
 
 Movement of tasks from the queues to the task processors is 
 controlled by the task output logic described next. 
 k.k.k Task Output 
 
 The two sources of tasks for output to the task processors are the 
 process and delay queues. The overall processing algorithm is to move trans- 
 actions which have the smallest time values. Transactions in both queues 
 have time values which range upwards from the minimum completed time. Tasks 
 in the delay queue which are marked as released belong to the min time trans- 
 action set by definition. These tasks have priority over tasks from the pro- 
 cess queue for output. 
 
 If any release indicator bit in the delay queue is set, that queue 
 is the source of tasks for output. When all released tasks have been output 
 for processing, tasks are taken sequentially from the process queue. 
 
 The task output logic communicates with the task processors. It 
 accepts requests for tasks. It identifies, for the requesting processor, 
 the transaction and next block number that make up the task. 
 ^.^•5 Coordination Unit Hardware 
 
 This unit must be able to form and dispatch tasks at a rate to 
 
122 
 
 match the task processor consumption. While the number of task processors 
 is unlimited in theory, this unit will place a limit on the number it can 
 support. The design and hardware suggested here is for the eight task proces- 
 sor configuration proposed in section k. 3. 3. Suggestions will be given for 
 increasing the number. 
 
 A pipeline effect exists in the selection of transactions and 
 formation of tasks. Transactions which have been moved through the blocks in 
 a partition enter at the top of the coordination unit with time and next 
 block data. They filter through the unit to exit as tasks for further block 
 movement. The memory can be updated while responders from the previous 
 search are being resolved and examined by the code evaluation logic. Note 
 that one interrogation may yield more than one responder and each responder 
 will yield an average of two tasks. Loading of the two queues can be inter- 
 leaved with unloading by the task output logic. 
 
 Average task execution time, as determined from Table 3«1 with the 
 assumption of 500 nanoseconds (ns) per instruction, is 10 microseconds. For 
 eight task processors, with full utilization, the task output rate should 
 therefore average one task per 1.25 microseconds. With ordinary TTL logic 
 having typical propagation delays of 10 ns the task output logic will have 
 no trouble satisfying that rate. 
 
 In the worst case for the process queue, it must be both loaded 
 and unloaded at that rate . Since the two actions cannot be concurrent, the 
 rate for either must be near 600 ns. Clearly there will be adequate time to 
 decode a four bit address and either read or write a 20 bit word. 
 
 The delay queue is more complicated. Loading is si mi lar to the 
 process queue but unloading requires transaction number comparisons. When 
 
123 
 
 the transaction selection logic has an Rl responder, that transaction number 
 is known to be in the delay queue. Entries in the queue are not required to 
 be in simulated time order but will tend to be so due to the selection 
 algorithm. The comparison begins with the oldest entry and moves sequentially 
 through the queue. When the matching transaction is located the task is 
 released. All other tasks for this transaction, identified as all subsequent 
 tasks in the queue until the partition code bit, P, is zero, have the release 
 indicator set. They are thereby marked as executable . 
 
 The transaction location sequence of reading, followed by a 
 comparison and either increment of the read address register or release of a 
 task, is a two clock sequence. If the process queue is empty and all tasks 
 must come from the delay queue, the release rate must be the same as the 
 task output logic, 1.25 microseconds per task. At this point a 100 ns clock 
 rate is specified for the coordination unit. At two clocks to load a task, 
 200 ns are used. This leaves 1000 ns to locate and release a task. Thus 
 five transaction number comparisons can be performed in the allotted time. 
 This is thought to be quite adequate. 
 
 The conclusion, with respect to the delay queue, is that it is 
 capable of being loaded and unloaded in the required time even when it is 
 the only source of tasks. 
 
 Feeding the two queues is the processing code evaluation logic. 
 The output requirement here is again an average of 1-.25 microseconds. There 
 is clearly no time problem if selected transactions can arrive at a fast 
 enough rate. Since there is an average of two tasks per partition, the 
 arrival rate must be one transaction every 2.5 microseconds from the trans- 
 action selection logic. 
 
12U 
 
 A bit serial interrogation at the rate of 100 ns per bit requires 
 K.2 microseconds to cover the k-2 bit associative memory. Thus transaction 
 selection appears to limit the operating speed of the coordination unit. 
 Comments here will show that the situation is not desperate. First note 
 that it is possible to have more than one transaction respond to a search. 
 If there is an average of two responders, the output rate is satisfactory. 
 
 In general it is not necessary to interrogate all bits to determine 
 the minimum value. Interrogation can cease when the minimum is found thereby 
 reducing the search time. For a minimum value search all words are initially 
 considered responders and the interrogation begins with the most significant 
 bit. It is seldom that the discrimination which produces the final re- 
 sponders occurs in the most significant bits. For this reason it is possible 
 to group the interrogation of these bits. Suppose for example the l6 most 
 significant were interrogated in parallel. If responders still exist, serial 
 interrogation continues with a savings of 15 interrogation times. If no 
 responders exist, it is necessary to re-initiallize the response stores and 
 begin a bit serial interrogation at the most significant bit. The time penalty 
 has been one interrogation plus initialization. 
 
 Another scheme which reduces the number of interrogations and in- 
 creases the number of responders is based on the fact that R2 responders need 
 not be true minimums. By always stopping the interrogation a small number of 
 bits short of the least significant bit, a cluster of transactions near the 
 minimum is selected. It was noted earlier that just two responders elimi- 
 nated the time problem. 
 
 Finally, note that the original calculation was based on the maxi- 
 mum task output rate. 
 
 
125 
 
 The conclusion here is that the proposed coordination unit hard- 
 ware, implemented with 10 ns logic, will perform with sufficient speed to 
 support eight task processors. Suggestions have been given on ways to in- 
 crease the speed in the area which appears to be most limiting. The 
 estimated number of logic circuits in the unit is 10,000 plus the memory- 
 chips. 
 
 
 if. 5 Memories 
 
 Data and program memories are separated in this machine and will 
 be discussed separately. 
 k.5'1 Data Memory 
 
 Processors are organized as a cluster of unit processors to form 
 a task processor, then a cluster of task processors. The data required by 
 a task processor is the data associated with the transaction and with the 
 block for the task. The memory organization given here uses a small memory 
 at each task processor for the data required for that task and a main memory 
 to which each task processor has access. The main memory provides the only 
 communication between task processors. 
 
 Execution of a task involves a transfer of all data related to the 
 task from the main memory to the task memory, followed by restoration of 
 variables changed in the execution. The maximum amount of transaction data 
 in GPSS/360 occurs when 100 full word transaction parameters are used. In 
 this case the total transaction data is k^>6 bytes. Blocks use a basic 
 allocation of 12 bytes plus four bytes per operand. There is a maximum of 
 seven operands per block with the average number near three. At four bytes 
 per operand the average allocation is 2k- bytes. Transaction and block data, 
 
126 
 
 totalling *4-60 bytes for maximum transaction parameter allocation and typical 
 blocks, must be transferred to each task processor for each task. 
 
 Each task may also use system variables which are accessed directly 
 from the main memory. The number of such variables per block is small. To 
 design for 10 variables, or k-0 bytes, would be liberal. Total task data is 
 then 500 bytes. 
 
 Given this background, the data memory sizes, bandwidths, and inter- 
 connections are discussed in this section. 
 i+. 5«1-1 Main Memory 
 
 The starting assumption for calculations on the main memory is 
 that the time required to transfer task data from main to task memory is 
 equal to the compute time. The time to restore changed variables is negli- 
 gible. 
 
 Average task execution time is 10 microseconds (jus) . The band- 
 width per task, for maximum size transactions, is 
 
 /g bits \ 
 500 bytes \ byte J = kQQ x 1Q 6 Mts/second< 
 
 B.W. . . . = -r 
 
 mam-task _„. -,^-0 , 
 10 x 10 seconds 
 
 The total main memory bandwidth, for eight task processors, is 
 
 B.W. . = 8 x ^00 x 10 b/s - 3200 x 10 b/ s . 
 
 main 
 
 An ordinary core memory can supply a 6h bit word in 0.5 us. To 
 achieve the required maximum bandwidth the number of memories is 
 
 6 
 
 3200 x 10 b 
 
 M = — = 25 memories. 
 
 max /- b / 
 
 128 x 10 s /memory 
 
127 
 
 Design of the machine with 32 main memory units more than satis- 
 fies the maximum bandwidth requirements. A l6 memory design provides a 
 bandwidth of 
 
 f> 6 
 
 B.W. g = 16 x 128 x 1CT b/s ~ 2000 x 10 b/s. 
 
 6 
 Each task processor would receive one-eighth of this, or 250 x 10 b/s . 
 
 This leads to a reduction of 200 bytes in the task bandwidth, or 50 full 
 
 word parameters. Thus l6 memories can supply 50 fullword or 100 halfword 
 
 parameters. 
 
 The choice between l6 and 32 memories is a design option. The 
 lower number is satisfactory for a high percentage of actual simulations. 
 
 Total variable memory size is somewhat flexible. The largest 
 GPSS/360 option begins at 256K bytes and extends upward to the core limits. 
 Suppose 400K bytes were chosen for a large system using l6 memories. Each 
 memory is 25K bytes, or less than kK 6k bit words. 
 
 Main memory is therefore 16 modules of k¥L word, 6k bit memories, 
 with a 0.5 microsecond cycle time. Optionally, 32 slower modules could be 
 used. 
 
 The switch to connect this memory to the task processors is a 
 fanout tree. Such a tree has on the order of three gates per bit per 
 destination. The number of bits in the path is the product of the number 
 of memories and the bits per memory, or l6 x 6k ~ 1000. The eight task 
 processors are the destinations. The gates in the switch thus total 
 3 x 1000 x 8 = 2kK. 
 
 4.5.1.2 Task Memory 
 
 Each task memory must be capable of storing the data for the 
 
128 
 
 current task plus the data being set up for the next task. That is, it 
 
 must have storage for twice the maximum task data requirement. The maximum 
 
 amount of task data is k3& bytes for the transaction and kO bytes for the 
 
 block for a total near 500 bytes. Let the memory be IK bytes. 
 
 The bandwidth of data moving between this memory and the main 
 
 memory was previously calculated as B.W. . , . = ^+00 x 10 bits per 
 
 mal n— Xi a sic 
 
 second. The task memory must simultaneously provide operands for the task 
 processor. The number of memory accesses by a block program with the IOjus 
 average execution time is near 50. If these are all 32 bit words, the 
 memory to processor bandwidth is 
 
 / bits \ 
 
 _ TT 5 words \ word/ n ^_ 6 , / 
 
 B.W., . = 7 = 160 x 10 b/s. 
 
 task-proc __ n ~-o -. 
 10 x 10 seconds 
 
 This bandwidth is satisfactory only if there are no conflicts on accessing 
 memory. Conflict free access would require one memory capable of supplying 
 
 the total bandwidth going between the task memory and both the main memory 
 
 6 
 and the task processors. That is 5&0 x 10 b/s. For 32 bit words, the 
 
 cycle time would have to be 
 
 32 bits , -Q 
 
 t = r 60 x 10 seconds. 
 
 cycle > 1 560xl0 6 b/s 
 
 A IK byte memory with a cycle time of 60 ns per 32 bits does not 
 meet the design objective of inexpensive parts. To lower the cycle time an 
 array of eight memories is suggested. The memories should have 32 bit word 
 lengths. A total of 250 words is needed so each memory has only 32 words. 
 The connection to the task processors is then a crossbar switch; a not 
 unreasonable connection for the small numbers involved. For example, 
 
129 
 
 connecting eight processors and memories with a 32 hit path takes 8 x 8 x 32 
 ~ 2K gates. If the design uses l6 unit processors the connection takes h¥L 
 gates. 
 
 Memory conflicts are possible with this modular scheme. Recall 
 from section k.2 that compilation of routines can finely tune the code 
 produced so as to minimize conflicts. Assume however that conflicts do occur 
 
 one-fourth of the time, increasing the required bandwidth between the 
 
 6 
 memories and processors to 200 x 10 b/s. Total task memory bandwidth is 
 
 then 600 x 10 b/s. The cycle time is 
 
 cycle > 8 600 x io 6 b/s 
 
 A ij-00 ns cycle is within the capabilities of current memory devices. 
 
 The task memory, in summary, is an array of eight memories, each 
 with 32 words and 32 bits. The cycle time of each is ^00 ns. The memory 
 is connected to the unit processors by a crossbar switch. 
 k.^.2 Program Memory 
 
 Section k.2 mentioned that a GPSS program is a series of calls for 
 execution of the routines that correspond to blocks. The routines are 
 compiled one time and remain valid until the language definition is changed. 
 Since this executable code does not change it can be stored in a read only 
 memory . 
 
 Consider the size of the memory required to store the program for 
 all kk block types. From Table 3«1> the number of Fortran statements per 
 block averages near 50. Assume a factor of four is needed to convert 
 | Fortran statements to machine instructions. Then 200 machine instructions 
 per block, and 8800 instructions altogether, are required. 
 
130 
 The instruction format for unit processors will now be mentioned. 
 
 The instruction set is small and will be fixed at 16. Four bits are needed 
 for the operation code. Unit processors communicate with the task memory, 
 a 256 word memory requiring eight bits to address. A single address in- 
 struction format therefore uses 12 bits. Assume four additional bits can 
 be used, as for indirect addressing, giving a total of l6 bits per in- 
 struction. A long instruction is defined as one suitable to drive all the 
 unit processors within a task processor. It is 128 bits for the eight unit 
 processor design, or 256 bits for l6 unit processors. 
 
 With eight task processors, each capable of executing any task, 
 the demand for instructions is great. Assuming a 500 ns average instruction 
 execution time this memory must be capable of delivering a long instruction 
 at intervals of 62 ns. Such capability fails to meet the objective of in- 
 expensive parts. Alternatives are examined. 
 
 The program memory size is 8800 x l6 = lJ+0800 bits. Organized as 
 6k- bit words it is 2200 words. If instructions are distributed from a read 
 only main program memory to local program memories at each task processor, 
 the local memories must be capable of holding the two longest block pro- 
 grams. For the blocks converted to Fortran, this would be near 200 Fortran 
 instructions or 800 machine instructions. The local memories, which must be 
 read/write, would have a bandwidth requirement composed of program loading 
 and instruction reading. These components are 
 
 / pnn instructions \ / /- bits \ 
 
 [ program j \ instruction ) = 6 ^ 
 
 load __ n^-o seconds ' 
 
 10 x 10 
 
 program 
 
 128 Mts 
 
 ■d T7 instruction c 6 / 
 
 B.W. , = 7 ■= = 256 x 10 b/s. 
 
 read _ _ _ _-o seconds ' 
 
 0.5 x 10 •: —. 
 
 instruction 
 
131 
 
 The combined bandwidth of 576 x 10 b/s is the same as that required for the 
 
 smaller task memories. 
 
 Since the program can be stored in a read only memory it becomes 
 attractive to consider duplicating the program at each task processor. The 
 tradeoff is eight read/write memories capable of storing the two longest 
 blocks against seven full program read only memories (at roughly half the 
 bandwidth) . The read only memory scheme will be chosen. 
 
 With a read only memory at each task processor the bandwidth is 
 
 the B.W. , calculated above, 256 x 10 b/s. That is the figure for eight 
 read ' 
 
 unit processors. If the memory is organized such that one cycle is capable 
 of supplying one long instruction, the cycle time is the instruction exe- 
 cution time, which averages near 500 ns. An example of such an organiza- 
 tion for the eight unit processor case, would be two memories with 6^ bit 
 words . 
 
 The program memory is thus 2200 words of 6k bits. It is read 
 only with a 500 ns cycle time. The memory is duplicated at each of the 
 eight task processors. 
 
 k.6 Machine Design Summary and Performance Estimates 
 
 Throughout this chapter estimates have been made on the hardware 
 involved in the components. The estimates are tabulated and totaled in 
 Table 4.3- 
 
 Estimates of program speedup and concurrency in processing have 
 been given in various sections. These estimates will be summarized and 
 discussed here. 
 
 The speedup in executing an individual GPSS block was determined 
 in section 3. 3. 2.1. Many block types were converted to Fortran and analyzed. 
 
132 
 
 Component 
 
 Option 
 
 Gates and 
 Flip Flops 
 
 Memory- 
 
 words 
 
 bits /word 
 
 Unit Processor (U.P.) 
 
 Decision Processor 
 
 Task Processor 
 Control Unit 
 
 Task Data Memory 
 and Switch 
 
 Task Program Memory 
 
 six levels 
 
 8 U.P. 
 l6 U.P. 
 
 1,000 
 
 500 
 3,500 
 
 2,000 
 4,000 
 
 6k 
 
 256 
 
 2200 
 
 12 
 
 32 
 
 64 (read only 
 
 Task Processor 
 Totals 
 
 Coordination Unit 
 
 Main Data Memory 
 and Switch 
 
 8 u.P. 
 
 16 U.P. 
 
 16 modules 
 
 Ik, 000 
 
 2k, 000 
 10, 000 
 
 24,000 
 
 2200 
 140 
 
 1024 
 1168 
 
 64k 
 
 64 (read only 
 64 
 
 42 (associati-p) 
 20 
 
 64 
 
 Total Machine 
 
 (8 Task Processors) 
 
 8 U.P. 
 
 16 U.P. 
 
 146, 000 
 
 226, 000 
 
 - 65K 
 17K 
 
 1024 
 
 64 
 
 64 (read only 
 
 42 (associatr: 
 
 Table 4-3 • Total Machine Hardware Summary 
 
133 
 
 Results are given in Tables 3.1 and 3-2. From Table 3-2 it can be seen that 
 execution speedup ranged from one to eight and that the most frequently 
 occurring speedup factor was four. It is reasonable to expect a speedup of 
 four by multiprocessing individual blocks. The unit processors of section 
 4.3*1 provide this processing. 
 
 Further speedup can be achieved by simultaneous processing of more 
 than one block. The number of blocks that can be processed simultaneously, 
 related to one transaction, is covered in sections 3-3«2.2 and 3«3«3« Table 
 3-5 gives the results of analyzing several GPSS programs and shows, in 
 column (5), that an average of two blocks can be processed simultaneously. 
 This gives a speedup factor of two for each transaction which is multipli- 
 cative with the factor of four for each block. 
 
 Further speedup can be achieved by concurrently moving more than 
 one transaction. This was studied in section 3«3»2.3 and 3«3-3- The number 
 of concurrent transactions was measured on a simulation system described in 
 the Appendix. The results, given in Table k.l, vary with the program being 
 , simulated. The number, in the range of two to five, is the speedup factor 
 due to multiple transaction processing. The configuration of multiple task 
 processors is needed for the speedup due to processing more than one block 
 per transaction and moving more than one transaction. 
 
 There is yet another area which contributes to the speedup of this 
 machine over a serial machine. This area is the selection of a transaction 
 to move and a block to execute. In a serial machine the selection is through 
 the software overall scan algorithm shown in Figures 3*2, 3«3> and ^>.h. In 
 the machine designed here, the selection is carried out by the hardware co- 
 ordination unit of section k.k. 
 
13^ 
 
 The software algorithm is used following the execution of each 
 block. It is a fairly complicated algorithm and is estimated to take as 
 much execution time as the actual processing of the block it selects. The 
 hardware algorithm operates in parallel with the processing of blocks. This 
 should realize another factor of two speedup over any serial execution 
 machine. 
 
 Combining the speedup factors gives the total expected execution 
 speedup of the organization presented here over a serial organization. The 
 result covers a range due to the range of the transaction concurrency factor, 
 f . Excluding f the speedup is by a factor of l6. The total speedup is 
 
 "CC DC 
 
 by l6 • f - The total speedup factor ranges therefore from 32 to 80. This 
 
 LC 
 
 improvement in execution speed was achieved with the use of currently 
 available, moderately priced hardware. 
 
135 
 
 5 • CONCLUSION 
 
 This thesis has resulted in the design of a machine which yields a 
 significant improvement in execution time for discrete time simulation lan- 
 guages similar to GPSS. A multiprocessor organization, using currently 
 available logic and memory elements, is employed. The machine includes a 
 device to assist in the evaluation of decision trees. 
 
 Specific results are presented at several places in the preceding 
 chapters. The locations of the results are given for reference in the 
 following paragraphs. 
 
 Chapter 2 was concerned with a "decision processor" for finding 
 paths through decision trees. This device, designed for use in a general 
 multiprocessor environment, is shown in block diagram form in Figure 2.8. 
 Logic circuit counts are given in Table 2.2 and delays are given in Table 2.3. 
 As an example of the values, a processor capable of evaluating trees with up 
 to 255 nodes requires fewer than 2000 gates plus a memory on the order of 6k- 
 words by 12 bits. The same processor can evaluate up to eight tree levels 
 in approximately one clock cycle. An example of the processor operation is 
 given in section 2.5-1. 
 
 Simulation was first mentioned in Chapter 3« GPSS is discussed 
 and analyzed for execution concurrency. Results of analyzing Fortran versions 
 of 21 GPSS block types on the system described in [10] are given in Tables 3«1 
 and 3.2. One conclusion taken from those tables is that a factor of four 
 speedup is possible within the blocks. Concurrency between blocks is shown 
 to exist, providing potential for additional speedup. 
 
 In Chapter k a machine was designed which exploits the concurrency 
 
136 
 
 found in GPSS. A high level block diagram of the organization is given in 
 Figure U.l. Hardware and performance summaries are the subject of section 
 k.6. The total speedup was found to be near 50 for a sample of fairly small 
 simulation programs. 
 
137 
 
 APPENDIX 
 
 This appendix describes the software system used to generate the 
 experimental results of the thesis. The system involves a series of pro- 
 grams written in Pl/l, 360 Assembler Language, and GPSS. The programs do 
 some analysis of GPSS source decks and simulate the execution of the decks 
 on the multiprocessor designed in this thesis. Figure A.l diagrams the 
 system and includes references to results or discussions of the components. 
 
 A.l GPSS Scanner 
 
 The scanner program was mentioned in section 3*3 -3 on assigning 
 the processing code to the blocks of a GPSS program. It implements Algorithm 
 3.1 for partitioning, and the system variable table look-up, to do the code 
 assignment. It also generates the statistics presented in Table 3-5« 
 Another function of this program has not been previously mentioned. It 
 produces punched card output to modify the original test GPSS program such 
 that a run time "trace" of the original program can be gathered. The trace 
 is explained in the next section, which covers the program that gathers the 
 trace data. 
 
 A. 2 Trace Data Extraction: XTRAC 
 
 The goal of the complete test system is to simulate the execution 
 of some real GPSS programs on the proposed machine. To simulate the execution 
 it is necessary to know certain things about the actual execution. It was 
 noted in Chapter 3 that the execution sequence for a GPSS program cannot be 
 . determined from the source code. It can only be determined by tracing the 
 execution. Any trace should identify the block being executed and the 
 
138 
 
 Statistics on 
 partitions . 
 Table 3-5 
 
 GPSS to Fortran 
 conversion. 
 Section 3*3 .2.1 
 
 Fortran block 
 type analysis. 
 Section 3-3-2.1 
 
 Block execution 
 time tables. 
 Section A. k 
 
 H 
 
 Test GPSS 
 Program 
 
 GPSS Scanner 
 
 Program 
 
 (PL/I) 
 
 Sections 3-3-3, A.l, 
 
 A. 2, and A.k 
 
 Test program 
 block types and 
 processing code. 
 Section A.k 
 
 1 
 
 Modified GPSS 
 
 test program. 
 
 XTPAC routine 
 
 added. 
 
 (360 Assem. Lang.) 
 
 Section A. 2 
 
 in- 
 
 coordination Unit 
 Simulator 
 (GPSS) 
 Section A.k 
 
 Data insertion 
 routine INSRT. 
 (360 Assembler 
 
 Language) 
 Section A. 3 
 
 1 
 
 GPSS/36O 
 
 System 
 
 New machine 
 performance 
 Chapter k- 
 
 Actual execution 
 trace data. 
 Section A. 2 
 
 Figure A.l. Simulation Test System 
 
139 
 
 transaction which called for the execution. A third piece of information, 
 the simulated time at which the block is "being executed, is needed for this 
 test system. 
 
 An assembly language routine was written to gather this data and 
 output it on punched cards. The routine is called by the special GPSS HELP 
 block. The HELP block allows a user to write his own routines to supplement 
 the fixed set provided by GPSS. It was needed here to access the trans- 
 action number, a variable not available to the programmer as a standard 
 numerical attribute, and to provide the punched output. 
 
 Trace data is not needed at every block in a program. It is 
 needed at those blocks which are the first in a partition of simultaneously 
 executable blocks, or at labeled blocks which may be the first in a run time 
 partition due to a transfer to that block. The scanner program of the 
 previous section generates a card deck containing all the appropriate HELP 
 blocks which are then merged into the original GPSS program. The modified 
 program thus formed is logically equivalent to the original. 
 
 Specifically, when a partition ends, the next block must be the 
 first block in the next partition. The scanner program can simply punch a 
 HELP block card following every block for which P, the partition code bit, 
 is zero. 
 
 For labeled blocks the action is slightly more involved. For any 
 labeled block it is necessary to give that label to the HELP block, then 
 punch a card for the original block without the label. The original labeled 
 block is replaced by a labeled HELP and an unlabeled copy of itself. If the 
 labeled block is not the first in a partition, any transaction which moves 
 into it from the block directly above does not need the trace data. In this 
 
lUO 
 
 case a TRANSFER block is used to let such a transaction branch around the 
 HELP. Thus three cards are punched. 
 
 Note that the addition of these HELP blocks changes the block 
 numbering of the original program when a trace is being taken. For this 
 reason the GPSS internal block number variable cannot be used for the next 
 block data in the trace . The format of the HELP block allows the use of 
 parameters which can be read by the routine. Thus the scanner program, 
 which keeps a block counter, can provide the original block number as a 
 parameter. 
 
 The card deck generated by this program contains the essential 
 information on the actual execution of a test program. This deck is input 
 data for the machine simulator. The operation of inserting it into the 
 simulator is done by the program of the next section. 
 
 A. 3 Trace Data Insertion; INSRT 
 
 GPSS is weak in the areas of input and output. It was necessary 
 to write an assembly language routine to load the trace data described in 
 the previous section. The routine is called by a HELP block, as was the 
 data extraction routine. 
 
 Savevalue locations in the simulator are reserved for the trace 
 data. A table of 300 half words is used for transaction number data. A 
 table of 300 full words, with the same savevalue index, is used for simulated 
 time data. Within the simulator the same number can be used as the index 
 for a half word savevalue to get a transaction number and as a full word 
 savevalue index to get the corresponding simulated time. 
 
 The trace information is completed by the next block number data. 
 
un 
 
 This is stored in a 300 half word table with the index offset by 300 from 
 the transaction number table. 
 
 The routine to insert this data can initially load 300 trace 
 records. Trace data is time ordered. Its use is biased towards, but not 
 limited to, the time ordering. Once a record is read it is no longer needed. 
 When data not currently in the tables is needed, the insertion routine will 
 compact the existing unused data and load new records until the tables are 
 full or an end of the data file is reached. 
 
 At this point the gathering of data needed for simulating the 
 execution of real GPSS programs and a routine for loading the data into the 
 simulator have been described. The remaining step is the simulation. 
 
 A.k Coordination Unit Simulation 
 
 This is a simulator, written in GPSS, of a machine for processing 
 the execution phase of GPSS. The purpose of the simulator is twofold. First, 
 the number of transactions which can be processed concurrently is desired. 
 Multiple transaction concurrency offers parallelism of a type that does not 
 exist in the procedural languages. It cannot be measured by examining the 
 source program. Simulation of the machine in Chapter k executing real GPSS 
 programs, using the trace data from actual executions, does provide a 
 measure. The results of this first purpose of the simulator were used in 
 deciding on a reasonable number of task processors. The results are given 
 in Table k.l. 
 
 The second purpose is to determine the capability of the co- 
 ordination unit to select tasks and keep the task processors busy. These 
 results are also given in Table k.l. 
 
11+2 
 
 Programming emphasis is on the simulation of the coordination unit. 
 It is assumed that the memory system of the machine does not cause delays 
 which must be simulated. Task processors are simulated only to the extent 
 of requesting a task from the coordination unit, accepting one, and ad- 
 vancing time by the execution time for the task. The task execution time 
 for a block type is the average over all traces of the parallel machine 
 execution time. Analysis of block types for parallelism, covered in section 
 3-3 .2.1, is the source of the time figures. A conversion of 500 nanoseconds 
 per time step, T , was used. This allows an average of two or three clock 
 pulses per operation with a 5MHz clock. 
 
 Information on the test program whose execution is being simulated 
 is unique to each program and must be supplied to the simulator. Static 
 information consisting of a representation of the program and the processing 
 code is given as a GPSS function. The dynamic trace information is read 
 into the model by the INSRT routine of the previous section. 
 
 The function is in the simulator listing with the label TYPCD. 
 There is a function point for each block in the test program. The point 
 gives the block type and processing code for the block. Block types are 
 numbered from one to hk in alphabetic order. The function definition and 
 follower cards are another part of the scanner program output. 
 
 A listing of the simulator follows. 
 
1*3 
 
 // EXEC DUMMY 
 
 //D01 DO DSN=£X,SPACE=(TRK, (20,5,2) ),OISP=I, PASS) ,UNIT=DISK 
 
 // EXEC LKEDASM,PARM=«LIST, MAP, REUS* 
 
 //LKED.SYSLMOD DO DSN=tX< INSRT ) ,DI SP* ( CLD, PASS ) 
 
 //LKED.SYSIN DO * 
 
 < OBJECT DECK FOR SUBRCUTINE INSPT > 
 // EXEC PGM=DAG01,PARM=«B* 
 //DINTERO DD UNITED! SK, SPACE=(CYL ,( 1 , 1 ) ) 
 //DINTWORK DD UNI T=D I SK , SP A CE= ( CYL , ( 1 , 1 ) ) 
 //DOUTPUT CD SYSOLT=A 
 
 //OREPTGEN DD UN IT=DI SK , SP ACE=(CYL , ( I ♦ 1 ) ) 
 //DSYMTAB DD UN I T = Dl SK, SPACE= ( CYL , ( I , I ) ) 
 //STEPLIB DD DSN=£X,DISP=(CLC,PASS) 
 // OD DSN=SYS1 .GPSSL IB,DISP=SHR 
 //SYSPRINT DD SYSOUT=A 
 //DINPUT1 DD CCNAME=SYSIN 
 //SYSIN DD * 
 
 REALLOCATE XAC , 20 J , F SV ,6C0 ,HSV ,8C0 , BLO, 30 
 
 SIMULATE 
 * 
 
 * THIS GPSS PROGRAM IS A SIMULATION OF THE CONTROL UNIT OF A MACHINE 
 
 * DESIGNED TO PUN SIMULATION PROGRAMS USING CONCURRENT PROCESSING 
 
 * TECHNIQUES. 
 * 
 
 * THF BASIC UNIT OF TIME IN THIS SIMULATION IS 100 NANOSECONDS. 
 
 * TRANSACTION PARAMETER USACE 
 
 * 
 
 EXECUTION TIKE IN TASK PROCESSORS. 
 
 SERIAL NUMBER OF THE XACT BEING SIMULATED. 
 
 NUMBER OF NE XT BLOCK IN THE PROGRAM FOR THIS TASK. 
 
 VALUE CCMES FRCM TRACE TABLE FOR MASTFR XACTS. 
 
 VALUE CCMFS FROM INCREMENTING MASTER XACT VALUE 
 
 FOR SPLIT XACTS. 
 TRANSACTION SIMULATED TIME AT THF START OF THE CONCUR- 
 RENCY GROUP. VALUE TAKEN FROM TRACE TABLES. 
 COUNTER OF TASKS IN A CONCURRENCY GROUP. USED TO 
 
 DETERMINE WHEN ALL TASKS IN THE GROUP HAVE BEEN 
 
 PROCESSED. 
 POINTER TO FULLWORO SAVEVALUE IN TIMST WHICH HAS THF 
 
 TIME, SIMULATFD, FOR THE XACT WITH SERIAL NUMBER 
 
 P6-4CC. P6=P2+400. 
 POINTER TO HALFWOPD SAVEVALUE IN NOFPE WHICH HAS THE 
 
 NUMBER C c PROCESSORS FOR BLOCK NR P7-44. P7 = Pll«-44. 
 INCICATES TO OUTPUT UNIT THE SOURCE OF THE TASK. 
 
 P8 =0 FCR TASKS FROM DELAY QUEUE. 
 
 Pfl =1 FCR TASKS FROM PROCESS QUEUE. 
 VALUE OF FUNCTION TYPCD IN FORMAT DBBSP. 
 CONTAINS BITS CBB OF P9. 
 NUMERICAL DESIGNATION OF BLOCK TYPE FOR THIS TASK. 
 
 BITS RB OF pg 
 POINTER TO HALFWORC SAVEVALUE GIVING EXECUTION TIME FOP 
 
 THIS BLOCK TYPE. 
 BLOCK PROCESSING COCE. BITS SP OF P9. 
 
 POINTER TO FLLLWORD SAVEVALUE IN TIMST WITH MIN VALUF. 
 POINTER TO MLFWORD SAVEVALUE WITH SERIAL NUMBER AND TO 
 
 FULLWORC SAVEVALUE WITH SIMULATED TIME IN THE 
 
 DYNAMIC TRACE TABLES. 
 POINTER TO HJLFWORD SAVEVALUE WITH NEXT BLOCK NUMBFP 
 
 IN THE TRACE TABLES. P 15 = P14«-30C . 
 
 * 
 
 PI 
 
 * 
 
 P2 
 
 * 
 
 P3 
 
 * 
 
 
 * 
 
 
 * 
 
 
 • 
 
 P^ 
 
 * 
 
 
 * 
 
 P5 
 
 * 
 
 
 * 
 
 
 * 
 
 P6 
 
 * 
 
 
 * 
 
 
 * 
 
 P7 
 
 * 
 
 
 * 
 
 P8 
 
 * 
 
 
 * 
 
 
 * 
 
 P<9 
 
 * 
 
 OK 
 
 * 
 
 Pll 
 
 * 
 
 
 * 
 
 
 * 
 
 
 « 
 
 P12 
 
 * 
 
 pn 
 
 * 
 
 P14 
 
 * 
 
 
 * 
 
 
 * 
 
 P15 
 
 # 
 
 
Ikk 
 
 HALFWORD SAVEVALUE TABLES AND DATA 
 
 TABLE OF EXECUTION TIMES FOR GPSS BLOCKS IN THE PARALLEL MACHINE 
 EXECUTION TIMES ARE TAKEN FRCM THE ANALYSIS OF BLOCKS FOR 
 PARALLELISM CN THE SYSTEM REPCRTED IN CHAPTER 3. THE CONVERSION 
 FRCM TIP) TO TIME IS: ONE STEP IN TIP) »500 NANOSECONDS. 
 EXTIM EOU 1(44), H 
 
 INIT XH1,60 ADVANCE BLOCK TYPICAL EXECUTION TIME 
 
 TMT XH4,65 ASSIGN 
 
 INIT XH8.80 DEPART 
 
 INIT XH9,175 ENTER 
 
 INIT XH14.U0 GENERATE 
 
 INIT XH16.50 INDEX 
 
 INIT XH18,150 LEAVE 
 
 INIT XH19,90 LINK 
 
 INIT XH20.6P LOGIC 
 
 INIT XH22.40 MARK 
 
 INIT XH24,9C MSAVEVALUE 
 
 INIT XH27.45 PRIORITY 
 
 INIT XH28.1C0 QUEUE 
 
 INIT XH29.70 RELEASE 
 
 INIT XH32,6C SAVEVALUE 
 
 INIT XH3A.70 SEIZE 
 
 INIT XH36,180 SPLIT 
 
 INIT XH38,100 TERMINATE 
 
 INIT XH39.5D TEST 
 
 INIT XH*1,60 TRANSFER 
 
 INIT XH42,200 UNLINK 
 
 INIT XH$DCOD,lCO DECCOE RCUTINE 
 * 
 
 * AN ASSEMBLY LANGUAGE FCUTINE, INSRT, IS CALLED BY THE HELP BLOCK. 
 
 * THIS ROUTINE LOADS CATA WHICH REPRESENTS THE EXECUTION TRACE OF 
 
 * TRANSACTIONS IN TEST GPSS PROGRAMS. SIMULATED TRANSACTION SERIAL 
 
 * NUMBERS ARE PUT IN HW SAVEVALUE LOCATIONS 100 TQ 299. THE NEXT 
 
 * BLOCK THAT THE TRANSACTION WILL EXECLTF IS LOADED INTO HW 
 
 * SAVEVALUE LOCATIONS 3C0 TO 599. CO«R ESPONDI NG TRANSACTION AND 
 
 * BLOCK DATA ARE THUS OFFSET 300 LOCATIONS. SIMULATED TIME FOR 
 
 * EACH PFCORD IS LOADED INTC FW SAVEVALUE LOCATIONS 1*0 TO 299. 
 
 * THE TRANSACTION NUMBER ANC TIME DATA HAVE THE SAME INDICES. 
 * 
 
 * WHEN A TRACE RFCORD IS USED THE TRANSACTION NUMBER FIELD IS 
 
 * SET TC ZERO. WHEN MOPE TRACE RECORDS ARE NEEDED, INSRT WILL 
 
 * COMPACT THE TRACE DATA AND LOAD ADDITIONAL RECORDS UNTIL THE 
 
 * END OF THE TRACE DATA FILE OCCURS. 
 
 * POINTER TO LAST VALID ENTRY OF DYNAMIC TRACE DATA. 
 LAST EOU 98, H 
 
 INIT XH$LAST,9<5 
 
 * COUNT OF EMPTY LOCATICNS IN TRACE CATA TABLES 
 LDCNT EOU 99, H 
 
 INIT XHSLDCNT,2C0 
 
 * TEST PROGRAM TRACE CATA. DYNAMIC DATA LOADED BY 'HELP INSRT*. 
 XACNR EOU 100(300), F XACT SERIAL NUMBER 
 
 NXT8K EOU 400(300),F NEXT BLOCK FOR XACT IN XH(*-30O) 
 INIT XH$TSKPR,4 ASSUMES TOTAL OF u PROCESSORS 
 * 
 
 * THE FOLLOWING TWO SAVFVALLES DEFINE THE NUMBER OF TRANSACTIONS BEING 
 
 * SIMULATED. THEY ARE PARTICULAR TO THE PROGRAM BEING TESTED. 
 
 INIT XH$SIMX,56 NOF XACTS IN RAIL FLEET II 
 INIT XH$DMSIM,55 NR OF ACTIVE XACTS MINUS ONE 
 
* FULLWORD SAVEVALUE TABLES 
 
 * TABLE SXCTR HAS ONE ENTRY PER TRANSACTION. THE ENTRY IS A 
 
 * COUNTER OF TASKS BEING PROCESSED FOR THE TRANSACTION. THE 
 
 * NUMBER OF TRANSACTIONS BEING PROCESSED CONCURRENTLY IS THE 
 
 * NUMBER OF ENTRIES WITf A VALUE GREATER THAN OR EQUAL TO ONE. 
 
 * SINCE THIS TABLE BEGUS AT SAVEVALUE 1, THE TRANSACTION SERIAL 
 
 * NUMBER, P2* CAN BE USED AS THE INDEX. 
 
 SXCTR EOU 1<99J,X USED TO TABULATE SIMULTANEOUS XACTS 
 
 * 
 
 * TEST PROGRAM TRACE DATA. SIMULATED TIME OF XACT WITH SERIAL NR 
 
 * GIVEN IN THE HALFWORD SAVEVALUE WITH THE SAME INDEX 
 SIMTM EQU 100(300), > 
 
 * SIMULATED TIME OF XACLS BEING SIMULATED 
 
 TIMST EOU 4Cl(iOOI,X X(4C0*J) HAS SIM TIME FOR XACT(J) 
 INIT X401-X500.2147483647 
 
 INIT X$MINTM,1 SIMULATED TIME INITIAL VALUE 
 INIT XSXMAX, 2147483647 
 * 
 
 * LOGIC SWITCH USAGF 
 
 PARTN EOU l(10OI,L SWITCH! J) FOR XACT(J) ENABLES ASSEMBLY 
 
 INIT LSI-LS100 
 * 
 
 * FUNCTIONS: 
 * 
 
 * TYPCD IS PROCUCED AUTCMAT IC ALLY BY THE GPSS ANALYZER PROGRAM. IT 
 
 * IS A LIST OF THE BLOCKS IN THE PROGRAM BEING ANALYZED, THE 
 
 * PROCESSING CODE, AND AN INDICATOR OF BLOCKS WHICH USE FN, V, RV, 
 
 * OR * IN THE OPERAND F IELC AND THUS REQUIRE DECODING. 
 
 * BLOCKS ARE LISTED WITF NUMBERS CORRESPONDING TO ALPHABETIC ORDER 
 
 * BEGINNING WITH 1 FOR ADVANCE AND ENDING WITH 44 FOR WRITE. 
 
 * THE FORMAT FOR THIS PACKEC CATA IS CBBSP WHERE: 
 
 * D IS 1 FOR BLOCKS WHICH REQUIRE OPERAND DECODING; OTHERWISE, 
 
 * BR IS THE NUMERICAL BLOCK TYPE, 
 
 * S IS FOR BLOCKS VHICH LSE SYSTEM VARIABLES AND MUST RE 
 
 * DELAYED; 1 OTHERWI SE , AND 
 
 * P IS FOR THE LAST BLOCK IN A PARTITION; 1 OTHERWISE. 
 * 
 
 TYPCO FUNCTION P3,LC49 RAIL FLEET PROGRAM 
 
 1 01411 2 T0410 3 03211 4 03601 5 HC411 6 00111 
 
 7 f;391^ fl 03810 9 00411 10 04110 11 0C411 12 10400 
 
 13 1391? 14 10410 15 10400 16 03910 17 03611 18 1O101 
 
 19 13901 20 13910 21 C2810 22 C0900 23 OC 8 1 1 24 01610 
 
 25 00110 26 101 1 1 27 13901 28 13911 29 139n 30 ^100 
 
 31 01811 32 03810 33 03810 34 13201 35 04110 36 C0110 
 
 37 00111 38 C4U0 39 OCUC *C 00111 41 04110 42 01411 
 
 43 00410 44 001 11 45 13900 46 13211 47 13903 48 rr 411 
 
 49 041 K 
 * 
 
 * INTER GIVES THE CUMULATIVE PROBABILITY THAT X INTERROGATIONS 
 
 * ARE REQUIRED IN THE ASSOCIATIVE MEMORY TO DETERMINE THE ENTRY 
 
 * WITH MIMMUM TIME. 
 INTER FUNC RN4,D16 
 
 .01,1/. 02, 2/. 04, 3/. 06 ,4/. 09, 5/. 13,6/. 19, 7/. 27, 8 
 
 .3 8, 9/. 52.1C/. 64, 11/. 74, I 2/. 92, 13/. 89, 14/. 95, 15/1. 00,16 
 
 * 
 
 FNDCD FUNC RN2.D2 FUNCTION EVALUATION REQUIRED 
 
 .5,1/1.0,2 
 
 * 
 
 * THE FIRST PART OF TFE PROGRAM LOADS THE FIRST 300 RECORDS OF TRACF 
 
146 
 
 * TABLE DATA AND INITIALIZES CTHER VARIABLES. 
 
 GENE tttl GENERATE 1 XACT TO LOAD TRACES 
 
 HELP INSRT LOAD DYNAMIC TRACE DATA 
 
 QUEUE REQST,XH$TSKPR LET ALL PROCESSORS BE AVAILABLE 
 
 SAVEVAL LSTMN,V<?,F UPPER USED LIMIT ON TIMST 
 
 9 VARIABLE XHSSIMX*4C0 
 
 TERM 
 * 
 
 * CNF TRANSACTION IS GENERATED FOR EVERY TRANSACTION BEING SIMULATED. 
 
 GFNE ,, ,XHSSIMX, ,17,F GENERATE I XACT PER SIM XACT 
 
 SELECT LS 16,1,100 GET FIRST SET SWITCH FROM PARTN SET 
 
 LOGIC R P16 PREVENT >1 TRACE PER XACT 
 
 START SELECT E 14 , 100 , XH ILAST , P 16 , XH, HOLDS 
 
 ASSI 4,X*14 PA IS GIVEN THE XACT SIMULATED TIME 
 
 ASSI 2,XH*14 P2 IS GIVEN THE XACT SERIAL NR 
 
 SAVEVAL *14,K0,H RESET THE XACT NR SAVEVAL. DATA USED. 
 
 ASSI 6, VI P6 POINTS TO X{J) IN TIMST 
 
 1 VARI P2+4C0 TIMST SAVEVALUES BEGIN AT XAC1 
 
 SAVEVAL P6.P4 X(JJ GETS SIM TIME FOR XACT(J-AOC) 
 
 ASSI 15, V? P15 IS PCINTER TO NEXT BLOCK XH 
 
 7 VARI P14+300 NEXT BLOCK XH OFFSET 300 FROM XACT NR 
 
 ASSI 3,XH*15 P3 IS GIVEN THE XACT NEXT BLOCK 
 
 * 
 
 * ASSOCIATIVE MEMORY: SELECT THE XACT WITH MINIMUM SIMULATED TIME AS 
 
 * THE BEST CANDICATE FCR PROCESSING 
 AMEM ENTER AM EM ENTER AM 
 
 GATE LR RDWRT ALLOW LOAD IF AM NOT BUSY 
 
 LOGIC S RDWRT SET SWITCH TO INDICATE BUSY 
 
 ADVA 2 TIME TO LOAD AM 
 
 LOGIC R RDWRT RESET BUSY SWITCH 
 
 JOIN AVAIL, P2 XACT IS AVAILABLE FOR SELFCTION 
 
 LOGIC S CTL SWITCH SET WHEN XACTS ARE ON AMEM 
 
 LINK AMEM,P4 ORDERED CHAIN TO SIMULATE AM 
 
 * 
 
 * THE TRANSACTION GENERATED NEXT SIMULATES INTERROGATION OF THE 
 
 * ASSOCIATIVE MEMORY. 
 
 GENE ,,,1,,1 XACT TO INTERROGATE AM 
 
 INT GATE LR RDWRT ALLOW INTERROGATION IF AM NOT BUSY 
 
 LOGIC S RDWRT SET SWITCH TO INDICATE BUSY 
 
 ADVA 1,FN$INTEP INTERROGATION TIME 
 
 LOGIC R RDWRT RESET BLSY SWITCH 
 
 UNLINK AMEM, AAA, Kl,, , XXX SELECT MIN TIME XACT 
 
 TEST E XH$XCTR,KC WiAf T FOR UNLINKED XACTS TO BE MOVED 
 
 TRANS ,INT RESUME INTERROGATION 
 
 XXX LGGIC R CTL SWITCH TO CONTROL INTERROGATION 
 
 GATE LS CTL WAIT FOR XACTS TO GET ON CHAIN 
 
 TRANS ,INT 
 
 AAA UNLINK AMEM, QTES T , ALL ,4 UNLINK ALL XACTS WITH SAME SIM TIME 
 
 OTEST SAVEVAL XCTR*,Kl,H CCUNT UNLINKED XACTS 
 
 ADVA 1 RESOLVE EACH RESPONDER 
 
 EXAM DEL0,P2,RCUTE IS XACT CN DELAY QUEUE? 
 
 TFST LE P4,X$MINT*,NCTMN YES. IS IT MIN TIME? 
 
 REMCVE AVAIL, ,P2 YES. XACT HAS BEFN SELECTED 
 
 SAVEVAL XCTR-.K1.H CECREMENT CTR AS XACTS ARE HANDLED 
 
 LEAVE AMEM 
 
 TRANS ,LVDLY XACT IS LEAVING DELAY QUEUE 
 
 NOTMN SAVEVAL XCTR-,Kl,h CECREMENT CTR AS XACTS ARE HANDLED 
 
 LINK TEMPL,LIFC XACT CANNOT BE SELECTED 
 
 RELNK LINK AMEM,LIFO 
 * 
 
 * CODE EVALUATION UNIT: ROUTES TASKS TO PROPER QUEUE AND SWITCHES 
 

 Ik7 
 
 TRANSACTIONS AT PARTITION BOUNDARIES 
 
 ROUTE 
 
 PAC 
 
 3 
 
 4 
 5 
 6 
 
 BDPAC 
 
 PAS 
 
 BCPAS 
 
 DAC 
 
 POAC 
 
 DAS 
 PDAS 
 
 E 
 LR 
 
 SAVEVAL 
 SEIZE 
 
 REMOVE 
 
 LEAVE 
 
 TEST E 
 
 LOGIC S 
 
 ASSI 
 
 ADVA 
 
 ASSI 
 
 VARI 
 
 ASSI 
 
 VARI 
 
 ASSI 
 
 VARI 
 
 ASSI 
 
 VARI 
 
 ASSI 
 
 TEST 
 
 GATE 
 
 SPLIT 
 
 ASSI 
 
 TRANS 
 
 TEST E 
 
 GATE LR 
 
 RELE 
 
 LOGIC B 
 
 TRANS 
 
 TEST E 
 
 GATE LP 
 
 LOGIC S 
 
 SPLIT 
 
 ASSI 
 
 TRANS 
 
 TEST E 
 
 GATE LR 
 
 LOGIC R 
 
 RELF 
 
 QUEUE 
 
 JOIN 
 
 ASSI 
 
 TRANS 
 
 DELAY QUEUE 
 
 DLQA 
 
 
 MERGE 
 LVDLY 
 
 ASSI 
 
 QUEUE 
 
 ADVA 
 
 JCIN 
 
 ASS I 
 
 TEST 
 
 GATE 
 
 TRANS 
 
 LINK 
 
 REMOVE 
 
 ADVA 
 
 UNLINK 
 
 E 
 SE 
 
 XCTR-,K1, h 
 
 CODE 
 
 AVAIL, tP2 
 
 AMEM 
 
 P4,X$MINT*tPAC 
 
 BYPAS 
 
 9,FN$TYPCC 
 
 1 
 
 10, V3 
 
 P9/100 
 
 lit V4 
 
 picaioc 
 
 12, V5 
 
 P92100 
 
 7.V6 
 
 Pll+44 
 
 5*,K1 
 
 P12,K11,PAS,1 
 
 DELAY, PCAC 
 
 Kl,PROA 
 
 3+,Kl 
 
 .PAC 
 
 P12,KlO,CAC.l 
 
 DELAY, PDAS 
 
 CODE 
 
 BYPAS 
 
 ,PPQB 
 
 PI 2, KOI, CAS 
 
 BYPAS, RCPJC 
 
 DELAY 
 
 Kl ,DLQA 
 
 3+.K1 
 
 ,PAC 
 
 P12,K0r,EPRCD 
 
 BYPAS, BCPAS 
 
 DELAY 
 
 CODE 
 
 DELAY 
 
 D6LQ,P2 
 
 B,KO 
 
 , AMEM 
 
 OECREMENT CTR AS XACTS ARE HANDLED 
 GET CODE EVALUATION UNIT 
 XACT HAS BEEN SELECTED 
 
 IS THIS A MIN TIME XACT? 
 
 YES. ENABLE DELAY QUEUE BYPASS 
 
 P9 GETS BLOCK TYPE AND CODE 
 
 EVALUATE EACH PROCESSING COOE 
 
 P10 >99 MEANS BLOCK OPERANDS NEED 
 
 DECODED, INCREASING EXECUTION TIME 
 Pll GETS BLOCK TYPE. NUMBERS 1 THRU 
 
 44 = BLOCKS ADVANCE THRU WRITE 
 P12 GETS THE 2 BIT BLOCK CODE 
 
 P7 POINTS TO THE XH GIVING THE NR OF 
 
 PROCESSORS BLOCK Pll USES 
 INCR COUNT OF TASKS IN CONCURRENCY GRP 
 DOES CODE SAY PROCESS Q AND CONTINUE? 
 YES. IS DELAY SWITCH RESET? 
 YES. SEND A TASK TO THE PROCESS Q. 
 INCR BLOCK POINTER 
 
 TRY THE SAME XACT ON THE NFXT RLOCK 
 DOES CODE SAY PROCESS Q AND SWITCH? 
 YES. IS DELAY SWITCH RESET? 
 YES. RELEASE COOE TO SWITCH XACTS 
 RESET DELAY QUEUE BYPASS 
 SENC THE MASTER TASK TO THE PROCESS Q 
 DOES CODE SAY HELAY Q AND CONTINUE? 
 YES. SHOULD THIS XACT BYPASS DELAY 0? 
 NO. SET THE DELAY SWITCH 
 SEND THE TASK TO THE DELAY Q 
 INCR BLCCK POINTER 
 TRY SAME XACT ON THE NEXT BLOCK 
 COES CODE SAY DELAY Q AND SWITCH? 
 YES. SHOULD THIS XACT BYPASS DELAY 
 NO. RESET THE DELAY SWITCH 
 PREPARE TO SWITCH XACTS 
 
 Q? 
 
 GROUP DELAYED XACTS 
 
 P8 =0 INDICATES DELAY QUEUE IS 
 
 DELAYED XACTS REMAIN IN THE AM 
 
 SOURCE 
 
 WHEN THE SIMLLATFD TIME OF A XACT IN THIS QUEUE IS EQUAL 
 TO THE MIN TIME OF ALL XACTS, ALL TASKS IN THE PARTITION 
 ARE ELIGIBLE TO LEAVE THE QUEUE. TASKS FROM THIS 
 QUEUE HAVE HIGHER PRIORITY FOR USE OF THE PROCESSORS. 
 
 5,K0 RESET PARTITION TASK COUNTER. 
 
 DELAY 
 
 2 TIME TO LOAD THE QUEUE 
 
 DELQ.P2 GROUP DELAYFD XACTS 
 
 8, KG PB =0 INDICATES DFLAY Q IS SOURCE 
 
 GSAVAILtKO, MERGE, 1 IS AVAILABLE GROUP EMPTY? 
 
 PRPUL, MERGE, ,1 YES. ARE PROCESSORS IDLE? 
 
 ,TASK YES. PROCESS THIS XACT 
 
 DELAY, P4 MERGE BY TIME. 
 
 DELQ,,P2 XACT IS LEAVING THE DELAY 
 
 2 REMOVE TASKS FROM DELAY QUEUE 
 
 DELAY, LNKCT, ALL ,2 UNLINK 4LL TASKS FOR THIS XACT 
 
11+8 
 
 LNKCT LINK OUTPT , P8 . T ASK 
 
 t PROCESS QUEUE: TASKS ARE AVAILABLE FOR PROCESSING UPON REQUEST 
 
 * . „ n RESET CCNCURRENCY GROUP TASK COUNTER 
 
 PROA ASSI 5.KO 
 PROB QUEUE PROCS tihe TQ LQA0 THE QUEUE 
 
 * DVA I „. Pa xi WHEN PROCESS Q IS THE SOURCE 
 
 I"I OUTPT. P8.1ASK PR IS PRIORITY FOR OUTPUT UNIT 
 
 ! 0UTPUT unit: JhiS 6 UNU s SElects the^ext task^sent - v THE ppiQRiTY 
 
 * 
 
 * n n CD , C T DELAY THE DELAY Q IS OUTPUTTING 
 
 DLYQ DEPART UtLAT 
 
 I2?« OUTPT PLACE TASK IN OUTPUT UNIT 
 
 TASK SEIZE OUTHl ^^^^^ task THR0UGH output UNIT 
 
 TEST E P8.K1.DLYC DID THE TASK COME PROM THE PROCESS 0? 
 
 °rlTl SSfMT.KO DOe's A TASK REQUEST EXIST? 
 
 DE0 UNUNK O^TPT.TAsJ.Kl "MOVE NEXT TASK FROM Q 
 * ,.r^ tack nccc A PROCESSOR FOR THE TIME DETERMINED 
 
 : ™sk processors: ejc^tas^uses^process^^^ 
 
 * nDO ,„ the task uses processors from a pool 
 
 VZ11U* OUTPT YES. MOVE FROM OUTPUT TO PROCESSORS 
 
 RELEASE °^;^; RFMCVE THE TASK REQUEST 
 
 DEPART REOST COUNT TASKS IN PROCESS FOR THIS XACT 
 
 SAVEVAL P2^,K1 COUN ^ ^ ^^ ex 
 
 ASSI 111 HI MRTIM WILL JUST THE BLOCK EXECUTE TIME DO? 
 TES T L PC.K99, MRTIM WILL ju EXECUTION TIME 
 
 TIME ACVA PI 
 
 TABULATE TIMEX IN pR0CESS F0R THIS XACT 
 
 $ t\ll PRPUL PROCESSES BECOME AVAILABLE AGAIN 
 
 of.cnp REQST REQUEST A TASK 
 
 ?-f E ^.MASTR.* IS THIS THE MASTER XACT?^ ^^ 
 
 GATE LS P2 JU. ASCEMRLE TASKS , N CONCURRENCY GP 
 
 ASSEM ASSEM j>5 ^^ gate fqr nexT ITERATION 
 
 TABULATE PART ^ rniAO TC TW r pi nCK TYPE OTHER THAN TERMINATE 
 
 TEST NE PU.K38.TERMB IS HE LOCK TYP^ ^ ^ Qp 
 
 ASSI 5,KC ,„„ , Ltl ACT P? XH.HOLDC G«=T NEXT DATA FOR XACT 
 CONT SELFCT F 1*. K WO . XF$l AST . P2.XH.H OLDC G ^ ^ ^ 
 
 ASSI A ' X * 1 i c THE XACT NP SAVEVAL. DATA USED. 
 
 SA VEVAL JiJ;;;-" H ccUNT EMPTY TRACE TABLE LOCATIONS 
 SAVEVAL EMPTY*. Kl.H ^LUN ^ CURRENT C , M T1MFS 
 
 SAVEVAL P6.P4 UKU p0INTER TQ NFXT RLOCK XH 
 
 ASSI 1 vu*,c P3 IS GIVEN THE XACT NEXT BLOCK 
 
 »i» L StV.KFO .LC.K « TFE*E «» TH«» T «"»»V 
 
 SAVEV u «;£;;*» ,. Lt ^^".".crs fko- te„p chain 
 
 UNLINK 
 
 TERMB ?«!% «H t O,s r .KC.TFM «| THERE OJH« JJCTS. ^ 
 
 7 -- E SSSSRS^S ; ? f jc« -muho^ ;;;« -a 
 
 s s :;ix: l l s^s-"" * ^e^'^/of ^ve *.* - ts 
 
li+9 
 
 SELECT MIN 13 ,401 , XH*L STMN , , X P13 WILL POINT TO MIN TIME SAVEV 
 
 TEST G X*13,X$MINTM,TRM HAS MIN TIME CHANGED? 
 
 SAVEVAL MINTM,X*12 YES. UPCATE IT 
 
 UNLINK TEMPL,RELNK,ALL,,,TRM UNLINK DELAY XACTS FROM TEMPL 
 
 LOGIC S CTL 
 
 TRM TERM 1 
 
 MASTR LOGIC S P2 OPEN GATE FOR TASKS IN CONCURRRENCY GP 
 
 TRANS ,ASSEM 
 
 MRTIM ASSI U,XH$DC0C,3 THE BLOCK USES FN, *, V, OR 8V 
 
 TRANS .TIME 
 
 LDATA SAVEVAL EMPTY, K0,H RESET EMPTIES COUNTER 
 
 TFST G V7,K0,GTMIN ARE XACTS WAITING FOR TRACE DATA? 
 
 HELP INSRT YES. LOAD DATA 
 
 ASSI 1,125 IDENTIFY OCCURRENCE OF INSRT 
 
 PRINT ,,MOV,X IDENTIFY OCCURRENCE OF INSRT 
 
 TEST NF XH$L0CNT,K499,GT>'IN WAS INSERT GOOD 
 
 UNLINK HOLDS, START, ALL YES. ATTEMPT TO START HOLDS XACTS 
 
 UNLINK HOLDCCONT.ALL ATTEMPT TO CONTINUE HOLDC XAC^S 
 
 TRANS ,GTMIN 
 
 HOLDC TEST NE XH$LDCNT , K999, TERPB IS THERE MORE TRACE DATA? 
 
 LINK HOLOCFIFC YES. HOLD XACT FOR NEXT INSRT CALL 
 
 HOLDS TEST Nfc XHtLDCNT , K999, DLETF IS THERE MORE TRACE DATA? 
 
 LINK HOLDS, FIFC YES. HOLD XACT FOR NEXT INSRT CALL 
 
 UNHLD HELP INSRT LOAD TRACE DATA 
 
 ASSI 1,152 IDENTIFY OCCURRENCE OF INSRT 
 
 PRINT ,,MOV,X IDENTIFY OCCURRENCE OF INSRT 
 
 TEST NE XH$LDCNT,K499,ALL WAS INSERT SUCCESSFUL? 
 
 UNLINK HOLDS, START, ALL YES. ATTEMPT TO START HOLDS XACTS 
 
 UNLINK HOLCC,CONT,ALL ATTEMPT TO CONTINUE HOLDC XACTS 
 
 ACVA 1 ALLCW UNLINKING 
 
 TEST G XHSDMSIM, V7,ALL WAS THF INSRT OK? 
 
 TRANS ,TERMB YES. 
 
 ALL TEP M XHtSIMX STOP THE SIMULATION. TRACE DATA FAILURE 
 
 CLETE SAVEVAL DMSIM-,K1,H DECREMENT COUNT OF ACTIVE SIM XACTS 
 
 TERM 1 
 
 ERRCD ASSI 12, KO BLOCK CCDE HAD ERROR. SET CODE TO 00 
 
 TRANS ,POAS RESUME bITH WORST CASE COOE 
 
 GENE 40,,,, ,2 GATHER STATISTICS AT 40 TIME UNIT INT 
 
 COUNT GF 2,1,XH$SINX,1,X CCUNT NR OF SIMULTANEOUS XACTS 
 
 TABULATE NRSXl 
 
 TABULATE PRLSE 
 
 TERM 
 
 PRUSE TABLE S $PR PUL ,0 , 1 , 120 
 
 NRSXl TABLE P2,C,1,50 NUMBER OF SIMULTANEOUS XACTS IN PRS 
 
 PART TABLE P5,l,l,20 RUN TIME PARTITION LENGTH 
 
 TIMEX TABLE PI, 40, 10,53 EXECUTION TIME PER BLOCK 
 
 PRPUL STORAGE 4 
 
 START 56 NUMBER OF SIMULATED XACTS 
 
 END 
 //INDATA CC * 
 
 < TRACE DATA READ EY ROUTINE INSRT > 
 
150 
 
 LIST OF REFERENCES 
 
 [l] ACM Profession Development Seminar, "Simulation of Discrete Systems," 
 ACM, pp. 123-135- 
 
 [2] C. Cartegini, "Scanner for the Analysis of Parallelism in Fortran 
 
 Programs and IF-Tree Detection," (M.S. Thesis) University of 
 
 Illinois at Urbana-Champaign, Department of Computer Science; 
 1971- 
 
 [3] Control Data Corporation, "The STAR Computing System." A technical 
 proposal to The Atomic Energy Commission. December 1966. 
 
 [k] 0. Dahl, and K. Nygaard, "SIMULA: An Algol-Based Simulation Lan- 
 guage, " Communications of the ACM, p. 67I; September 1966. 
 
 [5] L. C. FuLmer, and ¥. C. Meilander, "A Modular Plated Wire Associative 
 Processor, " Proceedings of the IEEE Computer Group Conference, 
 pp. 325-335; June 1970. 
 
 [6] International Business Machines Corporation, "Capital Investment 
 Studies Using GPSS: Bulk Material Movement Problems," First 
 Edition, p. 39; 1968. 
 
 [7] International Business Machines Corporation, "General Purpose Simu- 
 lation System/360 User's Manual," Fourth Edition; January 1970- 
 
 [8] P. J. Kiviat, R. Villanueva, and H. M. Markowitz, The SIMSCRIPT II 
 Programming Language , Prentice-Hall, Inc . ; 1968. 
 
 [9] D. J. Kuck, "ILLIAC IV Software and Application Programming," IEEE 
 
 Transactions on Computers , Vol. C-17, No. 8, pp. 758-77O; August 
 190^ 
 
 [10] D. J. Kuck, Y. Muraoka, and S. C. Chen, "On the Number of Operations 
 Simultaneously Executable in FORTRAN-Like Programs and Their 
 Resulting Speed-Up, " to be published in IEEE Transactions on 
 Computers. 
 
 [11] S. E. McAulay, "Job Stream Simulation Using a Channel Multiprogramming 
 Feature, " Fourth Conference on Applications of Simulation, ACM, 
 pp. 190-19^; 1970. 
 
 [12] T. B. Pinkerton, "Program Behavior and Control in Virtual Storage 
 Computer Systems," (Ph.D. Thesis) The University of Michigan, 
 C0NC0MP Technical Report k; April 1968. 
 
151 
 
 [13] Paul F. Roth, "The BOSS Simulator - An Introduction," Fourth Conference 
 on Applications of Simulation , ACM, pp. 2^-250; 1970. 
 
 [Ik] R. A. Schwarz, and T. J. Schriber, "Application of GPSS/360 to Job 
 
 Shop Scheduling, " Digest of the Second Conference on Applications 
 of Simulation , ACM, pp. 237-24b; 196b. 
 
 [15] D. L. Slotnick, et. al., "The ILLIAC IV Computer," IEEE Transactions on 
 Computers , Vol. C-17, No. 8, pp. 7U6-757; August 196b. 
 
 [16] D. G. Weamer, "QUICKSIM - A Block Structured Simulation Language 
 
 Written in SIMSCRIPT, " Third Conference on Applications of Simu - 
 lation, ACM, pp. 1-11; 1969. 
 
152 
 
 VITA 
 
 Edward Willmore Davis, Jr. was born in Akron, Ohio, in 19^1. He 
 graduated from The University of Akron in 196U with a Bachelor of Science 
 in Electrical Engineering degree and earned the Master of Science in Engi- 
 neering degree there in 1967* 
 
 From 196^- to 1968 he was employed in the Computer Engineering De- 
 partment of Goodyear Aerospace Corporation, Akron, Ohio. In 1968 he entered 
 the University of Illinois Department of Computer Science. He was a research 
 assistant with the Illiac IV Project from 1968 to 197° and with the Center 
 for Advanced Computation in 1970 and 1971- I n 1971 he joined a group studying 
 computer organization and software, where he did research on concurrent pro- 
 cessing systems. 
 
l GRAPHIC DATA 
 E 
 
 1. Report No. 
 
 UrUCDCS-R-72-527 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 June 1972 
 
 it and Subtitle 
 
 UNIPROCESSOR FOR SIMULATION APPLICATIONS 
 
 11 >r(s ) 
 
 £ard Willmore Davis, Jr. 
 
 8. Performing Organization Rept. 
 
 No -UIUCDCS-R- 72-527 
 
 r irming Organization Name and Address 
 
 rversity of Illinois at Urbana-Champaign 
 fcartment of Computer Science 
 4 ana, Illinois 6l801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF GJ 27^+6 
 
 ipsoring Organization Name and Address 
 
 feional Science Foundation 
 fchington, D.C . 
 
 13. Type of Report & Period 
 Covered 
 
 Doctoral - 1972 
 
 14. 
 
 iiplementary Notes 
 
 I tracts 
 
 Multiprocessor systems have generally been designed for applications with arrays 
 Idata which can be operated on in parallel. In this paper an application area 
 ten does not contain such readily identifiable parallelism is examined. Discrete 
 l.e simulation is found to contain several distinct levels at which potential for 
 :c .current execution exists. The levels are used to guide the organization of a 
 utiprocessor designed for simulation applications. 
 
 Both software and hardware aspects of the problem are covered. Features of the 
 j tern include a special processor used to evaluate conditional jump trees; clusters 
 d simple, fixed point arithmetic processors; a unit to form and dispatch tasks to 
 : processors; and a memory system which includes a read only program memory. 
 
 C Words and Document Analysis. 17a. Descriptors 
 
 jcial Purpose Computer 
 iulation Processor 
 callel Computation 
 
 l:ntifiers/Open-Ended Terms 
 
 OSATI Field/Group 
 
 Pliability Statement 
 
 ■ RELEASE UNLIMITED 
 
 19. Security Class (This 
 Report) 
 
 SSIFIED 
 
 curity Cli 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 161 
 
 22. Price 
 
 »'riS-35 ( 10-70) 
 
 USCOMM-DC 40329-P7 1 
 
' ] ■ .' • 
 
MBH 
 
 
 ;•»*"" 
 
 W5*L-"S=5 
 
 ni 
 
 <SS2£ZSZi.~ 
 
 
 
 ^^^H ^^^^H * Jr. 
 
 M 
 
 ■ ■ ii 
 
 1 
 
 ■i 
 
 ItSfilniliHHIIiillH