LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 oop 2. 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/schedulingonpara545chin 
 
y^uLZA 
 
 5/0' *f 
 
 M X C' /2y Report No . UIUCDCS-R- 72-5^5 
 
 7co.6'fS 
 
 £ SCHEDULING ON PARALLEL PROCESSORS FOR WEIGHTED-NODE GRAPHS 
 
 by 
 
 Janet Sau-Ying Chin 
 
 October 1972 
 
 e. 
 
 TH E LIBRARY OF THE 
 
 DEC 1^1972 
 
 ^VERS.TY^.U^ 
 
 *?3 
 
Report No. UIUCDCS-R- 72-5^5 
 
 SCHEDULING ON PARALLEL PROCESSORS FOR WEIGHTED-NODE GRAPHS 
 
 by 
 
 Janet Sau-Ying Chin 
 
 October 1972 
 
 Department of Computer Science 
 University of Illinois at Urbana- Champaign 
 Urbana, Illinois 6l801 
 
 *This work was supported in part by the National Science Foundation 
 under Grant No. US NSF GJ 2"jKk6 and was submitted in partial ful- 
 fillment of the requirements for the degree of Master of Science in 
 Computer Science, October 1972. 
 
SCHEDULING ON PARALLEL PROCESSORS FOR WEIGHTED-NODE GRAPHS 
 
 BY 
 
 JANET SAU-YING CHIN 
 B.S., University of Illinois, 1970 
 
 THESIS 
 
 Submitted in partial fulfillment of the requirements 
 for the degree of Master of Science in Computer Science 
 in the Graduate College of the 
 University of Illinois at Urbana- Champaign, 1972 
 
 Urbana, Illinois 
 
Ill 
 
 ACKNOWLEDGMENT 
 
 The author expresses her gratitude to Professor David J. Kuck, 
 the Department of Computer Science of the University of Illinois at Urbana- 
 Champaign, without whom this thesis would not have been successfully completed, 
 Duncan Lawrie and Paul Kraska are much appreciated for their help and guidance , 
 
 Special thanks to P. Budnik, S. C. Chen, R. Towle, and R. E. 
 Strebendt for their help. 
 
IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 2. ASPECTS 2 
 
 2.1 Introduction to the Scheduling Problem 2 
 
 2.2 General Description of Algorithm k 
 
 2.3 Preparation of FORTRAN Programs 6 
 
 3- A DESCRIPTION OF WSCHED 12 
 
 3-1 A General Description of WSCHED 12 
 
 3.2 The Relaxation Process lU 
 
 3.3 General Description of JCOMP Ik 
 
 k. EXPERIMENTS AND RESULTS l8 
 
 k.l Background Discussion of the Experiments l8 
 
 k.2 Discussion of the Distribution Results 22 
 
 k.3 Background for the Second Set of Experiments 27 
 
 k.k Results of the Second Set of Experiments 29 
 
 APPENDIX 32 
 
 A 32 
 
 B 35 
 
 LIST OF REFERENCES kQ 
 
V 
 
 LIST OF TABLES 
 
 Page 
 
 1. For Weighted Nodes 2k- 
 
 2. For Unweighted Nodes 25 
 
LIST OF FIGURES 
 
 Page 
 
 1. An Example of Partial Ordering 5 
 
 2.1 A Connection Graph for (3) 8 
 
 2.2 A Relaxed Graph 9 
 
 2.3 Another Relaxed Graph ..... 9 
 
 2.4 The Relaxed Connection Matrix 10 
 
 2.5 Weight Vector • • 10 
 
 2.6 Type Vector j_0 
 
 3.1 The Original Graph 15 
 
 3-2 A Relaxed Graph With Four Levels of Nodes 15 
 
 3 A Relaxed Graph Representing Released Matrix l6 
 
 4.1 Unweighted Case 19 
 
 4.2 Weighted Case 19 
 
 4.3 Unweighted Case Distributed 19 
 
 4.4 Weighted Case Distributed 19 
 
 4.5 Unweighted Case 20 
 
 4.6 Weighted Case 20 
 
 4-7 Unweighted Case Distributed 20 
 
 4.8 Weighted Case Distributed 20 
 
 4 .9 Unweighted Case 21 
 
 4.10 Weighted Case 21 
 
 4.11 Unweighted Case Distributed 21 
 
 4.12 Weighted Case Distributed 21 
 
 4.13 The 6 FORTRAN Statements 23 
 
 4.14 Graph for P Using Common Results 27 
 
 4.15 P 7 Tree Without Common Results 28 
 
1. INTRODUCTION 
 
 In their search for increased computer speed and throughput, Hobbs 
 and Theis [2] maintain that parallel processing is a solution for problems 
 with inherent parallelism. This property of parallelism allows various oper- 
 ations to be performed concurrently. To demonstrate, assume that the following 
 two FORTRAN statements are given. 
 
 (1) F = A + B + C+D 
 
 (2) T = V + X + Y + Z 
 
 Statement 2 does not rely on statement 1. Therefore, they may be 
 executed concurrently. On two parallel processors then, both statements would 
 be completed in the amount of time needed to process one statement. We can 
 also consider parallelism on the operator level. In statement 1, A + B can be 
 computed at the same time C + D is being calculated. Increasing the number of 
 computers would again decrease the amount of time needed to complete the oper- 
 ations. In other words, parallel processing, in our example, would increase 
 computer speed and throughput. 
 
 Parallelism is inherent in FORTRAN programs according to David Kuck, 
 Yoichi Muraoka, and S. C. Chen [8]. Therefore, if we had parallel processors 
 with arithmetic units which can do any of four operations (add, subtract, 
 multiply, and divide) as well as initiating two other operations (fetch and 
 store), then we will have greater speed and throughput than a conventional 
 serial processor. 
 
 This paper will present some of the aspects behind the project under- 
 taken to implement Kuck's proposal [7] and describe a compile-time operation 
 scheduler for parallel processors. In addition, the experiments which were 
 run with the scheduler and their results will be discussed. 
 
2. ASPECTS 
 
 2.1 Introduction to the Scheduling Problem 
 
 Ever since computers came into being, there have been tradeoffs of 
 one kind or another. The most emphasized seems to be the ratio of cost to 
 speed of computation. Another which deserves some attention is the one be- 
 tween compilation time (which includes any time spent in preparation prior to 
 actual program execution) and execution time. The considerations that follow 
 deal with a fixed number of parallel processors. 
 
 The tradeoff concerns the time when scheduling of operations is done. 
 If we execute operations as they are required, then operands must be fetched. 
 As a result, execution time would be longer than if we had some amount of look- 
 ahead so that the operands could have been available at the time the operation 
 was required. Other aspects of this problem are the minimization of processor 
 idle times and the optimization or processor run speeds. These two can be ac- 
 complished by establishing for each of the parallel processors a sequence of 
 operations to be done. However, this ordering process would require a longer 
 compilation time. If we were given statement 1, on page 1, along with two 
 parallel processing elements (PEs) and the four operands already fetched, we 
 may assign the execution of A + B to PE 1 and the execution of C + D to PE 2. 
 When these operations are completed, PE 2 would add the two results. This then 
 describes an order of operation for each of the processors. 
 
 Furthermore, Paul Kraska has developed an algorithm (algorithm a) 
 which calculates an order of operations in reduced time [5l« The PEs are 
 assumed to be capable of doing addition, subtraction, multiplication, and 
 division. Each processor, when done with its task, does not wait for other 
 processors to finish before it proceeds to another operation. Along with the 
 
four arithmetic operations named above, the processors are capable of initiating 
 stores and fetches of data to and from memory, respectively. 
 
 The minimal time taken to process a set of operations is dependent 
 upon how many operations must be performed in sequence, as opposed to oper- 
 ations being completed in parallel; and how long each of the operations in the 
 sequence takes. If we represent the sequences in this set of operations as a 
 tree, each operation would be represented by a node. The time taken to complete 
 an operation corresponds to the weight assigned to the associated node. 
 
 Since the operations must be done in some order, the operations tree 
 is a partial ordering on nodes. For an illustration of the ordering, consider 
 Figure 1. The operations associated with nodes 2 and 3 are to be completed 
 before the operation represented by node 1 is begun. Nodes 2 and 3 are said to 
 be predecessors of node 1 in the tree. In this ordering nodes k, 5, and 6 are 
 root nodes; and node 1 is a terminal node. 
 
 Figure 1. An Example of Partial Ordering 
 
 The "by-demand" scheduling is discussed in a paper by Larry Swanson 
 [12]. He is concerned with the data management problem. He presents a machine 
 
configuration of a tree processor system associated with a number of routing 
 networks. He has programs which simulate various combinations of the log 
 router, the Illiac IV router, the Semmelhaack router, the Batcher network, and 
 the crossbar switch. He discusses the amount of time taken to transfer data 
 to and from memory and between PEs. The time delays influence the size and 
 speed of the tree machine since the larger the machine, the longer the route 
 data must travel. 
 
 This paper will discuss the program written to implement Kraska's 
 algorithm a . It will also present some experiments which used the program 
 along with their results. 
 
 2.2 General Description of Algorithm 
 
 Starting with a lower bound on the number of processors, m, Kraska's 
 algorithm a provides a way of finding the least upper bound on the number of 
 processors needed to complete the operations represented by an operations tree 
 in the minimal amount of time. 
 
 As operations are scheduled, their nodes are placed into lists which 
 correspond to the processors. The lists would, at the end, tell us which 
 operations are to be executed by which PE and in what order. Since the weight 
 of each node corresponds to the amount of time required to complete the asso- 
 ciated operation, the lists then indicate when an operation has been executed 
 by the position of the node in a list and the weights of the nodes below it. 
 During scheduling then, each list has a height. 
 
 For each node, n., find the latest time it may be executed. This is 
 done by finding the largest sum of the weights of all the nodes between node 
 n. and a root node, including the weight of the root node. The path length for 
 node n. at any particular time is the sum of the minimal height for the lists 
 
and the largest n.-to-root weight sum. This path length indicates the time 
 the operations on the longest path through the operations tree will he com- 
 pleted. The minimal time is just the longest path length through the tree. 
 This time is called the critical path length. 
 
 To find a schedule which will be completed in minimal time, we 
 consider the terminal nodes in the operations tree. ^Insert the node(s) with 
 the maximum path length into the lists. Then consider the tree with the in- 
 serted node(s) removed. With the new tree we start all over again at *. If 
 we can continue down to all the terminal nodes and insert them into the lists 
 without getting a path length greater than the critical path length for the 
 operations tree, we will have succeeded in our scheduling for minimal time; 
 and the least upper bound on the number of processors will have been reached. 
 If, however, a path length is calculated which is greater than the critical 
 path length, we must start from the beginning again, but this time with an 
 added processor. Hence, another list is available for scheduling, for a total 
 of m + 1 lists. 
 
 Algorithm a also provides a way of scheduling an operation tree for 
 a fixed number of processors regardless of the critical path length. In this 
 case it is necessary to provide a back-up facility so that when an assignment 
 is made which denies assignment to a node awaiting scheduling and which 
 results in a longer path length than the critical path length, we are able to 
 backtrack to the point the node was denied assignment, insert it into the 
 lists and start off from there. This method provides a schedule for reduced 
 time given a fixed number of processors. 
 
 No matter how many processors are used, the critical path length 
 still represents the least amount of time necessary for the operation tree to 
 
be processed. Thus, if we are to schedule on a greater number of PEs than the 
 least upper bound, the main difference in the results would be a greater per- 
 centage of processor idle time than if we scheduled on the least upper bound 
 number of processors. 
 
 The weights associated with each node is dependent upon the time 
 needed to complete the corresponding operation. Therefore, if we assign the 
 weights such that certain operations take longer than others, each processor, 
 in effect, would seem to proceed independent of the other PEs. On the other 
 hand, if we assign equal weights to all operations, then each processor would 
 start and finish an operation in time with the other processors. We say that 
 the nodes are unweighted or unit weighted since the time spans are all the 
 same. 
 
 2.3 Preparation of FORTRAN Programs 
 
 A FORTRAN program has many other types of statements besides a 
 simple assignment statement. Functions and subroutine calls, at present, are 
 ignored. Built-in functions are replaced by a number of adds and multiplies. 
 Transcendental functions are replaced by an expression consisting of a sum of 
 products. Arithmetic functions are regarded as subscripted variables. DO 
 loops and IF statements are taken care of by back substitution, recursion, and 
 tree height reduction. These processes are described in Han's paper [l] . Aftei 
 these operations have been applied to the FORTRAN program, it becomes a series 
 of assignment statement blocks which involve only the four arithmetic oper- 
 ations named earlier along with store, fetch, and a transfer of control oper- 
 ation. 
 
 Calls to subprograms and certain other FORTRAN statements like GO TO 
 signal the end of each assignment block [11] . Along the length of each block, 
 
variables are back substituted recursively, variables and sub-expressions are 
 distributed over each arithmetic expression as specified by the Multiplication 
 Distribution Algorithm and the Division Distribution Algorithm as described by 
 Han [l] and Krasha [5]. These two algorithms specify distribution only if the 
 number of operation levels, the number of nodes in the longest path from a 
 terminal node to a root node, will decrease by the distribution process. Thus, 
 the time taken to complete execution of any statement will be reduced, never 
 lengthened by the process. 
 
 One aspect which should be considered is how the distribution process 
 affects the number of processing elements. Although the number of levels may 
 be decreased by the distribution process, the number of processors needed for 
 executing in the reduced time never decreases. Since distribution introduces 
 operations into the levels, it is often true that more processors are needed 
 than before distribution. Experiments conducted bear out this theory. 
 
 After an assignment statement block has been back substituted and 
 distributed, the corresponding trees' heights are reduced; and a connectivity 
 graph is made in the following way. Define all "fetches" to be root nodes. 
 Define all "stores" to be terminal nodes. The rest of the matrix is constructed 
 by precedence. As an example, consider the statement 
 
 (3) x=b-a*c + d/e*f 
 
 The graph for it is shown as Figure 2.1. 
 
 Relax this graph in the following manner. First, place all terminal 
 nodes at the bottom level. Secondly, place at the next to the bottom level all 
 nodes which are predecessors to the terminal nodes only. Place at the third 
 from the bottom level, all predecessors of the nodes at the lower levels. Con- 
 tinue in this fashion until all the root nodes have been placed into the graph. 
 
Figure 2.1. A Connection Graph for (3) 
 
 This resulting graph is relaxed. Now create the n x n connectivity matrix 
 where the graph has n nodes. The ij entry in the matrix is 1 only if node i 
 precedes node j in the graph. Notice, however, that the relaxed graph will 
 have a node ordering different from the original connection graph if we start 
 numbering at the root level and proceed toward the terminal level. Notice 
 also that permutations on the numbering at each level will also result in a 
 relaxed graph as illustrated in Figures 2.2 and 2.3. Figure 2.k shows the 
 connectivity matrix for (3) based on the relaxed graph in Figure 2.2. 
 
 Notice that the relaxed matrix is upper triangular. This is a 
 consequence of the relaxation process. The row(s) of zeroes at the bottom 
 
Figure 2.2. A Relaxed Graph 
 
 Figure 2.3- Another Relaxed Graph 
 
10 
 
 
 1 
 
 2 
 
 3 
 
 k 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 10 1112 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 2 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 
 
 3 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 k 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 
 
 5 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 6 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 7 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 8 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 
 
 9 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 10 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 11 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 12 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 1 
 
 12 
 
 1 
 
 13 
 
 1 
 
 57 
 
 1 
 
 97 
 
 1 
 
 111 
 
 3 
 
 6 
 
 5 
 
 7 
 
 1 
 
 15 
 
 2 
 
 5 
 
 3 
 
 6 
 
 2 
 
 k 
 
 1 
 
 li 
 
 Figure 2.k. The Relaxed Figure 2-5- Figure 2.6. 
 Connection Matrix Weight Vector Type Vector 
 
11 
 
 indicate the terminal node(s), while the columns of zeroes at the left indicate 
 root node(s) . 
 
 A weight vector, NW, is made from the weights of the nodes. A type 
 vector, NT, defines the operation involved at each node. Thus, NW. and NT. 
 tell how much time is required to complete the operation of type NT. for node i 
 in the graph. The weight vector for the relaxed matrix is shown in Figure 2.5 
 with the assumption that fetches and stores take 1 time unit, adds and subtracts 
 take 2, multiplies take 3, and divides take 5 time units. If each fetch is 
 coded with a number greater than 11, symbolizing the variable fetched, add is 
 coded as h, subtract as 5, multiply as 6, divide as 7> and store as 11, then 
 the type vector is shown in Figure 2.6. 
 
 The weight vector is then used to compute path lengths and lists' 
 heights. The type vector is a convenience supplied to provide some statistics 
 on the operations at each level. A table of occurrence is outputted to show 
 the frequency of each operation at the different levels. 
 
12 
 3 . A DESCRIPTION OF WSCHED 
 
 3.1 A General Description of WSCHED 
 
 WSHED is a PL/I procedure which implements Kraska's algorithm a , 
 described earlier in this paper. It receives as parameters a connectivity 
 matrix along with a weight vector, a type vector, and the dimension N. First, 
 the N X N BIT(l) connectivity matrix is relaxed. (See pages 10-11 and 16-I7 
 for discussions of the relaxation procedure) . Relaxation produces another 
 result: the number of operation levels in the tree. The levels give us a 
 precedence relation. If node i is in level m and node j is in level n, m > n; 
 then if there is a path between i and j, i is a predecessor of j . The levels 
 are numbered from bottom to top. Therefore, the level of terminal nodes is 
 level 1. 
 
 Second, since the levels of the tree represent a precedence relation i 
 and the matrix represents a connectivity relation, we can compute the critical 
 path length and the paths for any node to a root node. From these, we calculate, 
 the lower bound on the number of processors needed to complete the operations 
 represented by the matrix and the two vectors. This lower bound is found as 
 follows. Calculate at each level the sum of all the node weights for nodes on 
 and below the level. Divide the sum by the maximum path length up to the level. 
 The smallest integer not less than the largest ratio is the lower bound on the 
 number of processors. 
 
 Since algorithm a contains an option, the signal indicator SCHEDPE is 
 provided by the user to determine whether WSCHED is to find the least upper 
 bound on the number of processors to complete operation in minimal time or 
 WSCHED is to simply find a schedule for the number of processors indicated. 
 
 If SCHEDPE (l) = 0, we want to find the least upper bound. Starting 
 
13 
 
 with the lower bound obtained above, procedure JCOMP is called which determines 
 the scheduling of the processors. Should JCOMP fail to find a schedule which 
 would complete execution in minimal time, it returns a KODE = 1 indicating 
 failure. Upon seeing this code, WSCHED will add another processor to the 
 group and recall JCOMP. Eventually JCOMP will return a KODE = 0, indicating 
 that an ordering of operations exists which will have the operations completed 
 in the minimal time. At this point, the number of processors available to 
 JCOMP is the least upper bound on the number of processors needed to execute 
 the operations represented by the matrix and the two vectors in minimal time. 
 We also know the order in which the operations are done, which operation is 
 executed by which processor, and what percentage of the time each of the 
 processors is idle. 
 
 If SCHEDPE(l) / 0, it contains the number of scheduling trials to be 
 made for certain numbers of processing elements. SCHEDPE(2) through SCHEDPE 
 (number of trials plus one) determine the number of processors to be used during 
 trials 1 through SCHEDPE (l), respectively. During each trial, JCOMP will 
 schedule the nodes on the specified number of arithmetic processors, regardless 
 of the critical path length. The resulting order is a schedule which will 
 complete the operations as quickly as possible. 
 
 It should be noted that for any number of processors less than the 
 least upper bound (LUB) calculated in WSCHED, the amount of time needed for 
 completion necessarily increases. There is no use having the number of proces- 
 sors greater than the LUB since the critical path length cannot decrease. 
 Therefore, having extra processors only results in a higher percentage of idle 
 time for the processors. 
 
 
Ik 
 
 3 .2 The Relaxation Process 
 
 The relaxation of the connectivity matrix requires row and column 
 interchanges since we may want to move a node in level i down to level j, 
 i > j . We usually have more than one node moving through levels in a relax- 
 ation. Therefore, it would he too time consuming to physically interchange 
 the rows and columns. As a result, pointers are used to effect the exchange; 
 and it is not "until the very end that the physical change takes place. 
 
 Since the position of a node after relaxation does not necessarily 
 correspond to its position prior to the exchanges, another set of pointers 
 keeps the corresponding old position for each node. Hence, at the end of the 
 scheduling, node 6 which corresponded to row 6 and column 6 in the original 
 connectivity matrix became node 10 corresponding to row 10 and column 10 in 
 the relaxed matrix; if JCOMP finds that node 10 is in list 2, we actually 
 have PE 2 executing the original node 6. An example of tree relaxation was 
 given earlier. An example of a graph relaxation may be seen in Figures 3-1, 
 3-2, and 3.3- 
 
 Notice that Figure 3*3 would give rise to a relaxed matrix where 
 node i corresponds to row i and column i, while Figure 3-2 would not. The 
 node numbering in Figure 3*2 show what happened to each original node. This 
 would be the intermediate result of the phase when pointers are utilized to 
 effect row and column exchanges. The renumbering process after relaxation 
 would yield Figure 3«3« 
 
 3 -3 General Description of JCOMP 
 
 Starting with the terminal nodes, JCOMP calculates their longest 
 paths through the tree. Assign processors to the terminal node(s) which are 
 on the longest paths. Remove the scheduled nodes from the tree and start over 
 
15 
 
 Figure 3.1. The Original Graph 
 
 level k 
 
 level 3 
 
 level 2 
 
 level 1 
 
 Figure 3*2. A Relaxed Graph With Four Levels of Nodes 
 
16 
 
 
 
 Figure 3.3 • A Relaxed Graph Representing Released Matrix 
 
 with the new tree. The procedure followed is described in Kraska's paper [5]. 
 The processor (s) with the minimum amount of activity will be assigned first. 
 
 When JCOMP has finished establishing the order of node processing, 
 it prints out the nodes according to the original node numbering system rather 
 than according to the relaxed numbering. Thus, the initial input to WSCHED 
 has the same node numbering as the final output of WSCHED, the scheduled lists. 
 
 JCOMP is an internal procedure in WSCHED. It receives the number of 
 processors for which it should schedule, the number of levels in the matrix, 
 and the number of nodes in the operations graph. It returns a code of 1 to 
 signal an unsuccessful attempt and a code of for success in scheduling the 
 nodes on the number of processors provided. JCOMP includes a switch which 
 determines whether the LUB is to be found or a straight scheduling is to be 
 done. 
 
 The input for the scheduler involves only assignment statements, 
 
17 
 
 and each of these begins with a series of variable fetches—one of the fastest 
 operations. Therefore, if the stage of having only root nodes (fetches) left 
 unscheduled is reached, then they may be scheduled in any order as long as the 
 scheduling does not increase the critical path length of the LUB calculation. 
 If other operations with weights greater than that of fetches were to be in- 
 cluded at the root level in LUB calculation, then Johnson's algorithm [3] 
 would have to be implemented to perform the final phase of scheduling. 
 Johnson's algorithm considers the amount of time left to each processor before 
 minimal time is exceeded. Start with the PE with the least amount of time left. 
 Find an unscheduled operation or a set of unscheduled operations which occupy 
 the PE the longest without exceeding its time left. Then proceed to each of 
 the other processors in order of time left, so the PE with the most time left 
 is scheduled only if there are still unscheduled operations left to be done. 
 
18 
 
 k. EXPERIMENTS AND RESULTS 
 
 k.l Background Discussion of the Experiments 
 
 Let us discuss two differences between weighted and unweighted nodes. 
 We want to see how weights affect the scheduling of operations and how the two 
 types of nodes are affected by the distribution process. 
 
 Consider the expressions: 
 
 (k) C / A + D + B * (A + E) 
 
 (5) c/a + d + b*a + b*e 
 
 For the unit weighted case, statement k can be calculated in 3 time steps by 
 2 PEs, assuming that the operands are already available. For the weighted 
 case, with the weights as defined in section 2.3, statement k- would be done in 
 9 time units by 2 PEs. However, their scheduling process is different from the 
 one with unit weights (see Figures k.l and k.2). Since statement 5 would not 
 be executed faster than statement k and it would take an extra PE in both the 
 weighted and the unweighted case (see Figures k.3 and k.k), distribution would 
 not be done . Note that the operations performed are in different sequences . 
 In the weighted case one of the PEs only has to do the divide operation while 
 the other PE is performing two adds and a multiply. 
 Let us look at another set of statements: 
 
 (6) A + D * (B + C + E) 
 
 (7) A + D*B + D*C+D*E 
 
 As we see in Figures k.5 and k.6, statement 6 requires 1 PE in both cases. 
 The weighted case takes 9 time units; the unweighted case takes k. Notice that 
 the scheduling is the same for both the weighted case and the unweighted case. 
 Once we distribute D, however, as shown in Figures ^.7 and k.Q, the scheduling 
 
19 
 
 changes and the time shortens with 3 PEs "being used in both cases. Therefore, 
 we would proceed with the distribution process in both cases. 
 The last set of expressions to be considered are: 
 
 (8) ((A + B) * C * D) * E 
 
 (9) (A + B) * C * D * E 
 
 Figure k.l. Unweighted Case 
 
 7 + 2 = 9 
 
 Figure k.2. Weighted Case 
 
 C/A + D + B * A + B * E timing 
 
 V / \/ V 
 
 * 1 
 
 Figure 1|- -3 - Unweighted Case Distributed Figure k.h. 
 
 Weighted Case 
 Distributed 
 
20 
 
 A + D*(B + C + E) timing 
 
 A + D * (B v + C + E) timing 
 
 V / 
 
 V 
 
 Figure k-5- Unweighted Case 
 
 Figure k.6. Weighted Case 
 
 A + D*B + D * C + D * E timing A + D*B + D*C + D*E timing 
 
 V v v- \ V V V 
 
 # # * 1 \ ~> 
 
 V V. V 
 
 »■ 5 
 
 Figure k.^. Unweighted Case 
 Distributed 
 
 Figure k.Q. Weighted Case 
 Distributed 
 
21 
 
 ((A + B) * C * D) * E timing 
 
 V v 
 
 * / 1 
 
 * C * D) * E timi 
 
 ing 
 
 Figure 4.9. Unweighted Case 
 
 Figure 4.10. Weighted Case 
 
 (A^+ B) * C. * D v * 7 E timing 
 
 V 
 
 (A + B) * C *D*E timing 
 
 v / V . 
 
 Figure 4.11. Unweighted Case 
 Distributed 
 
 Figure 4.12. Weighted Case 
 Distributed 
 
 This time distribution will be done for the weighted case, but it 
 will not be done for the unweighted case. This discussion bears out our 
 previous observation that the distributed case tends to require more PEs than 
 the undistributed case, but it may have shorter execution time. 
 
 If subexpressions are evaluated only once and the results are used 
 whenever the corresponding subexpressions appear, the number of processors 
 needed will usually decrease; but the number of levels in the operations tree 
 will remain constant. The reason for this is apparent. If a set of operations 
 has to be done in several places, there will be PEs at each level working on 
 the same operands producing the same results that some other PE at that level 
 
 ; 
 
 is also producing. Therefore, a greater number of PEs would be required than 
 
22 
 
 if we were to route the previous results to wherever that subexpression is 
 needed as an operand. Since we have already found that several processors 
 executing the same set of operations take the same amount of time as one 
 processor executing that set of instructions, the time needed to compute 
 remains constant. 
 
 To verify these ideas, two sets of experiments were run. The first 
 set compares the effects of distribution on weighted and unweighted nodes for 
 the same statements. The statements used are similar to our previous examples, 
 The second set applies WSCHED to two versions of a polynomial of degree 30. 
 The connection graph of the first version requires each instance of common 
 subexpressions to be evaluated separately. Therefore, this graph is a tree. 
 For the second version communication between processors is allowed, i.e., 
 
 common subexpressions are computed once, and the results are routed to wher- 
 
 5 
 ever they are needed. For example, the value of x is calculated only once; 
 
 -1-2 O c 
 
 and x may be calculated by using x and x . 
 
 K.2 Discussion of the Distribution Results 
 
 Six FORTRAN statements (Figure if-. 13) were run through WSCHED both 
 with weighted operators and unweighted operators. The results are shown in 
 Table 1 and Table 2. These statements were chosen to show various distri- 
 butions as well as how the distribution process can decrease the critical path 
 length and how it affects the number of processing elements. The entries in 
 the E do idle time) columns have been rounded to the nearest percent to avoid 
 clutter. 
 
 Statements A, B, E, and F were put through the distribution process 
 in both the weighted case and the unweighted case. For these four statements, 
 the number of levels decreased (the tree height was reduced) ; and the amount 
 
23 
 
 Set # Statement 
 
 A BA1=A*(B*C*D+E) 
 
 B BA2 = A * (B * C + D) + E 
 
 C BE = (A + B * C * D) * (E + F) 
 
 D BE = ((A + B) * C * D) E 
 
 E BE = (A * (B * C))/D * (E * F * G) 
 
 F BE = (((A * (BC)) * (DE))/F) * G 
 
 Figure l+.lj. The 6 FORTRAN Statements 
 
 of time required to completely process each statement was cut. However, the 
 number of processing elements increased. These results are as expected from 
 our previous discussion. 
 
 Consider statement C. It should go through distribution in the 
 weighted case since clock time decreases while the tree height remains constant 
 and the number of processors required increases. However, in the unweighted 
 case no distribution would be done. Not only does the number of levels remain 
 constant (and, hence, since this is the unit weighted case, the clock time 
 remains constant), but the number of processors increases causing a greater 
 percentage of idle time on the processors used. 
 
 In statement D we have the case that distribution is no help at all 
 in the unweighted case as far as number of levels and time are concerned. 
 However, with distribution we see a decrease in the number of PEs, thereby 
 decreasing idle time. So distribution should be done if we are interested in 
 conserving machine power. In the weighted case not only is the number of 
 
2k 
 
 Set # 
 
 No. of levels 
 
 Distribute? 
 
 No. of PEs 
 
 Time 
 
 Z $ idle 
 
 time 
 
 A 
 
 6 
 
 no 
 
 2 
 
 13 
 
 69 
 
 A 
 
 5 
 
 yes 
 
 k 
 
 10 
 
 190 
 
 B 
 
 6 
 
 no 
 
 2 
 
 12 
 
 67 
 
 B 
 
 5 
 
 yes 
 
 3 
 
 10 
 
 100 
 
 C 
 
 6 
 
 no 
 
 2 
 
 13 
 
 i+6 
 
 C 
 
 6 
 
 yes 
 
 3 
 
 12 
 
 75 
 
 D 
 
 5 
 
 no 
 
 k 
 
 11 
 
 lh5 
 
 D 
 
 5 
 
 yes 
 
 3 
 
 10 
 
 130 
 
 E 
 
 6 
 
 no 
 
 3 
 
 16 
 
 125 
 
 E 
 
 5 
 
 yes 
 
 5 
 
 13 
 
 2*+6 
 
 F 
 
 7 
 
 no 
 
 2 
 
 19 
 
 52 
 
 F 
 
 5 
 
 yes 
 
 5 
 
 13 
 
 2k6 
 
 (E % idle time is summed over the number of processing elements used) 
 
 Table 1. For Weighted Nodes 
 
25 
 
 Set # 
 
 No. of levels 
 
 Distribute? 
 
 No. of PEs 
 
 Time 
 
 Z $ idle time 
 
 A 
 
 6 
 
 no 
 
 2 
 
 6 
 
 33 
 
 A 
 
 5 
 
 yes 
 
 k 
 
 5 
 
 160 
 
 B 
 
 6 
 
 no 
 
 2 
 
 6 
 
 33 
 
 B 
 
 5 
 
 yes 
 
 k 
 
 5 
 
 180 
 
 C 
 
 6 
 
 no 
 
 3 
 
 6 
 
 100 
 
 C 
 
 6 
 
 yes 
 
 If 
 
 6 
 
 150 
 
 D 
 
 5 
 
 no 
 
 k 
 
 5 
 
 200 
 
 D 
 
 5 
 
 yes 
 
 3 
 
 5 
 
 100 
 
 E 
 
 6 
 
 no 
 
 U 
 
 6 
 
 167 
 
 E 
 
 5 
 
 yes 
 
 6 
 
 5 
 
 320 
 
 F 
 
 7 
 
 no 
 
 3 
 
 7 
 
 100 
 
 F 
 
 5 
 
 yes 
 
 6 
 
 5 
 
 320 
 
 u 
 
 N 
 W 
 E 
 I 
 G 
 H 
 T 
 E 
 D 
 
 Table 2. For Unweighted Nodes 
 
26 
 
 processors decreased, but time is also shortened, so distribution should be 
 done. This is the only statement in the group of six in which distribution 
 causes the amount of idle time to decrease. 
 
 In this set of experiments, two-thirds of the cases yielded higher 
 amounts of idle time for the unweighted case than for the weighted case. This 
 is due to the fact that the number of processors needed by the unweighted nodes 
 is not less than the number of processors needed by the weighted case. The 
 reason is simple to see. If we restrict processors to writing for fellow PEs, 
 no matter how long any one of them takes, we waste process time. In Figure 
 J+.10, the processor executing A + B had to wait for the results from the 
 processor calculating C * D since multiplies take longer than adds in our 
 example. This creates a time slot during which the former processor is un- 
 occupied. There was also a 6 -unit time hole while the E was waiting to be 
 multiplied for a total of 7 time units unoccupied. After distribution in 
 Figure 4.12, there are still two holes; however, they total up to only k- time 
 units unoccupied. We notice that the A + B processor could switch right into 
 multiplying (A + B) by C without waiting for D * E to be computed. A dis- 
 cussion on time holes may be found in Kraska's paper [5] • 
 
 As expected, the number of PEs required by the unweighted case is 
 greater than or equal to the number of processors needed by the weighted. The 
 argument is presented above. A processor is not tied up waiting for another PE 
 to finish. It is allowed to proceed to another operation. If this operation 
 is on the same level as the one the PE just finished, it is then taking the 
 place of another PE so that we could, for example, do with one less processor 
 for the weighted case than for the unweighted case. 
 
27 
 
 4 .3 Background for the Second Set of Experiments 
 
 Let us, for simplicity's sake, first consider a polynomial of 
 degree 7 "with unweighted nodes "before tackling the polynomial of degree 30. 
 Muraoka ' s folding method [10] factors out powers of x which are Fibonacci 
 numbers, and the subexpression left after factoring is in the form of the sub- 
 expression to the left of it. For example, consider 
 
 23^+567 
 (10) P r7 (x) = a^ + a_.x + a^x + a^x + a, x + a^x + a r x + a^x 1 
 v/ 7 y 012 3 ^ 5 6 7 
 
 which, using Muraoka 's method, is folded into 
 
 2 3 2 5 
 
 (11) P 7 (x) = a + a x + a p x + (a + a, x) yc + (a + a^x + ax ) x 
 
 2 
 (see Figure k. 14 for the operation tree). In this example, a + a^-x + ax is 
 
 2 
 in the same form as a + a x + ax ; and the former subexpressions take no 
 
 longer to calculate than the latter. The highest powers of x in the equation 
 
 can be calculated by multiplying two previously calculated powers, e.g., 
 
 5 2 3 
 x = x * x , the property of Fibonacci numbers. 
 
 a + a x + a x + (a + a,, x) x + (a + a^x + a„ x ) 
 
 2n „5 
 
 Figure 4.14. Graph for P Using Common Results 
 
28 
 
 With data transmission between processors, as shown in Figure k-.lk, 
 Kraska [6] proposed that no more than nw + nw + 2.75 ln(n) + 1 arithmetic 
 
 operations are needed to compute Pn, a polynomial of degree n where w is the 
 
 a 
 
 weight of an add node and w is the weight of a multiply node. Thus, for P„, 
 to m 7 
 
 the upper bound is computed as being 25 arithmetic operations, 9 fetches, and 
 
 1 store. In actuality, P requires 17 arithmetic operations, 9 fetches, and 
 
 1 store. 
 
 Now let us assume that each common subexpression must be recalculated. 
 
 Then the P tree becomes the following: 
 
 store 
 
 Figure 4.15- P 7 Tree Without Common Results 
 
 By counting the nodes in this tree, we find that there are 21 arithmetic nodes. 
 However, for polynomials with large degrees it would be desirable to have an 
 easier method for calculating the number of nodes rather than having to count 
 them up after forming the trees. 
 
29 
 
 Therefore, we begin "by noting that for a polynomial of degree n 
 there are n + 1 coefficients and an x to he fetched plus 1 store. Then we 
 
 note that x may be calculated in [log_n] + N - 1. where N is the number of 
 
 2 n n 
 
 l's in the binary representation of n, [q] represents the largest integer not 
 
 greater than q [k] . From this, we find that to compute a polynomial of degree 
 
 n, using Muraoka ' s folding method, we need exactly arithmetic operations, 
 
 where is defined recursively as 
 n 
 
 (12) n = n _ x + [log 2 k] + N k + 1 
 
 where k = n - .E T F., F. are elements of F., which is the set of Fibonacci 
 
 numbers which are factored out of the a a term. In other words, k is the 
 
 n ' 
 
 highest power of x remaining inside the parentheses. For example, in P , k = 2. 
 By equation 12, P requires = l8 + 1 + 1 + 1 = 21 arithmetic operations, 
 which agrees with our node count in Figure 4.15. 
 
 k.k Results of the Second Set of Experiments 
 
 Returning to the polynomial of degree 30 with unit weighted nodes, 
 Kraska's upper bound is calculated to be 70 arithmetic operations. These 70 
 plus 32 fetches and 1 store yield a total of 103 nodes in the graph for P* n « 
 
 In reality, there are 99 nodes. Without using common results, there are 133 
 
 nodes calculated from (12) which agrees with the node count done on the tree. 
 
 We formed the graph and the tree of P_. as we have done here for P_, and it is 
 
 30 ( 
 
 with these that we verified the above results. 
 
 Putting these two versions of P through WSCHED with unweighted nodes 
 
 produced the following results. Both had 10 levels of operation. The tree 
 
30 
 
 (133 nodes) required 27 PEs to complete calculation while for the graph (99 
 nodes), l8 processors sufficed. 
 
 The results bear out intuitive conjectures. Graphs, not requiring 
 as many operations, should need fewer processors. However, since the powers 
 of x still must be calculated, our previous discussion stipulates that with 
 extra processors the time required to perform one set of calculations is equal 
 to the time required to execute several copies of the set of calculations in 
 parallel. 
 
 In our discussion of data transmission, we had not mentioned any 
 time allotted to the routing of information. That is, because in our ex- 
 periments we did not take it into consideration. It should be very interesting 
 to see how the added data transmission time would affect our results. 
 
 It would be interesting to compare the amount of time taken to 
 prepare and execute a FORTRAN program using our scheduler (with optimization 
 done by back substitution, recursion, and tree height reduction) with the 
 time taken by the FORTRAN code which is optimized by hand plus the preparation 
 time. 
 
 Duncan Lawrie [9] has started such an experiment for unit weighted 
 nodes. The results are as yet unavailable. 
 
 Another question time did not allow us to investigate is the original 
 question. Does the amount of time saved at execution speed up on our parallel 
 processor machine as compared to execution on a single processor justify the 
 amount of time we needed to prepare programs for execution on our system? In 
 other words, does the program spend less total time on our parallel system than 
 on the single processor? 
 
 We see that a program has been written which implements a compile- 
 time operation scheduler for parallel processors. We have used It to investigate 
 
31 
 
 the distribution process and the tree versus graph question. The results of 
 the investigations support the theories we have presented. 
 
32 
 APPENDIX A 
 
 N 
 
 INCM 
 
 NW 
 
 nt 
 
 Parameter to WSCHED, the number of nodes in the 
 graph including fetches and stores. 
 
 Parameter to WSCHED, the connectivity matrix 
 representing a graph with N nodes. It is a 
 N x N BIT (1) matrix. 
 
 Parameter to WSCHED, the weight vector where 
 
 NW (i) is the weight of node I. It is a H vector 
 
 of integers. 
 
 Parameter to WSCHED, the type vector where NT (i) 
 represents the operator involved at node I. If the 
 operation is a fetch, NT (i) specifies which variable 
 is to be fetched. 
 
 SCHEDPE 
 
 SCEAP 
 
 TIME 
 
 STEPZ 
 
 External vector with a maximum of 5 elements. 
 SCHEDPE (1) = indicates that WSCHED is to find the 
 least upper bound on the number of processors needed 
 to perform all the operations represented by the graph 
 in minimal time. SCHEDPE (l) = k, k a positive 
 integer less than 5> indicates the number of straight 
 schedulings to be made for the numbers of processors 
 specified by SCHEDPE (2) through SCHEDPE (k + l) . 
 
 Indicator based on SCHEDPE (l) . SCRAP = 1 means to 
 calculate the least upper bound, SCRAP = means to 
 schedule the graph using the number of PEs specified 
 in SCHEDPE (i). 
 
 Saves the timer results so that we know the elapsed 
 time from part to part. 
 
 A function provided by the system library which 
 returns the amount of time remaining before the task's 
 time limit is exceeded. 
 
 TABLE 
 
 LVL 
 
 IPTR 
 
 BACK, PTR 
 
 B 
 
 A matrix which for the levels of operation in the 
 relaxed graph indicate the number of occurrences of 
 each of the 6 operators (including fetch and store) . 
 
 A vector such that LVL (i) points to the node which is 
 the first on level I, the terminal level being level 1. 
 
 Vector of N pointers, IPTR (i) = J indicates that 
 relaxed graph is node J in the original graph. 
 
 Work vectors which save pointers during the relaxation 
 process. 
 
 N x N BIT (l) matrix used in the relaxation process. 
 
33 
 
 The number of levels of operation in the relaxed graph. 
 
 The critical path length. 
 
 The sum of weights up through the levels. 
 
 The ratio of IP to ICWM. 
 
 The maximum ratio . 
 
 The greatest lower bound on the number of processors 
 needed to compute in minimal time. It is the least 
 integer not less than EMAX. It is later the number of 
 processors to be scheduled upon in JCOMP. 
 
 Success (0) or failure (l) indicator from JCOMP. 
 
 A procedure internal to WSCHED which develops the 
 schedule . 
 
 Parameter to JCOMP, same as N in WSCHED. 
 
 Parameter to JCOMP, the success-failure indicator. 
 
 Parameter to JCOMP, the number of processors on which 
 to schedule. 
 
 Parameter to JCOMP, the number of levels in the graph. 
 
 The number of nodes assigned so far. 
 
 The number of nodes previously assigned. 
 
 Indicator of nodes having been denied assignment. 
 
 A N x K matrix the jth column of which is the assign- 
 ment list for the jth processor. 
 
 T (i) represents the longest path length from node I 
 to a root node. 
 
 DEPTH (I, l) is the number of nodes in list I. 
 DEPTH (I, 2) is the length of list I calculated from 
 the weights of the nodes in list I. 
 
 An operation's completion time if its node were 
 assigned now. 
 
 The critical path length. 
 
 The set of terminal nodes for the graph being considered. 
 
3h 
 
 R 
 
 B 
 
 P 
 
 LP 
 
 NOTN 
 
 NTNA 
 
 RESTOR 
 
 S 
 
 SP 
 
 NS 
 
 NSP 
 
 SAVEM 
 
 L 
 
 PP, LPP, PW, LL, PAS 
 
 ASIGN 
 
 LU 
 
 U 
 
 X 
 
 NX 
 
 MJ 
 
 LV 
 
 WSJ 
 
 LD, DELTA 
 
 Q plus the nodes which would become terminals after 
 this time. 
 
 N x N BIT (1) matrix which is the relaxed connectivity- 
 matrix workarea. 
 
 The processors whose lists have minimal length. 
 
 The number in P. 
 
 The cardinality of Q. 
 
 The cardinality of R. 
 
 Matrix saving columns of B. 
 
 The set of nodes with max (W. + T.) i in Q or E. 
 
 11 
 
 S n Q. 
 
 The cardinality of S. 
 
 The cardinality of SP. 
 
 A save matrix. 
 
 The node under consideration. 
 
 Save variables. 
 
 The list of nodes which have been assigned thus far. 
 
 min {¥. - m. I n. e (S - SP)} . 
 
 fn. | w. - mr. = lui. 
 
 {i | T i < LU). 
 Cardinality of X. 
 Cardinality of U. 
 LP-NX + NU (0 if S-SP = t) . 
 max {NW. | N. e SP] . 
 
 Indicate if we can still schedule without exceeding 
 the critical path length. 
 
 All variables not listed here are used only in the capacity of work areas 
 
35 
 
 APPENDIX B 
 
 (SUBSCRIPTRANGFtSTRINGPANGE) : 
 TEST:PRGC OPT I CNS( MA IN ) ; 
 
 DCL SCHEDPE(5) FIXEO B IN( 15 ) EXTERNAL; 
 GET LIST(L > ; 
 B: BEGIN; 
 
 DCL M(L,L) BIT( 1) ,NW(L ),NT(L) ; 
 
 M^O'B; /* INITIALIZING THE WHOLE MATRIX TO */ 
 
 GET DATA(M); 
 GET LIST(NW,NT ) ; 
 ON ENDFILE GC TC ZFPC; 
 
 GET LIST(SCHEDPE( 1)1; 
 
 GET LISTM SCHEDPEI I) DC 1=2 TO SCHEDPE ( I) + 1) ) ; 
 GO TO CALL*; 
 ZERO: SCHECPEt 1)=0; /* NO FORCED SCHEDULING */ 
 
 CALLW: CALL W SC HED( L, M , NW ,NT ) ; 
 
 PUT SKIP LISTMTEST FINISHED 1 ); 
 END B; 
 END TEST; 
 
36 
 
 (SUBSCRIPTRANGE): 
 
 WSCHED: PRCC(N,INCP,NW,NT); 
 
 DCL SCHEDPE(5) FIXED BIN(15) EXTERNAL; 
 DCL TIME FIXED BIN<31), STEPZ ENTRY RETURNSC FIXED BINOlll; 
 DCL L,I , J,K,M,N,KK,RATIG,RMAX,NP,ICWM; 
 DCL INCM(N,NIBIT<1 ),NW(N), IPTR ( N I , NT< N I ; 
 DCL TABLE(30,6 IFIXED BIN; 
 DCL LVL<0:N) ; 
 
 DCL SCRAP BITCH, KCDE BIT(l); 
 /* SCHEDPE(l) IS THE NUMBER OF SETS OF PE'S TO BE DONE 
 SCHEDPE(2I TO SCHEDPE< SCEDPE < 1 1 + 1 1 CONTAINS 
 THE NUMBER OF PE'S OF EACH SET */ 
 /* SCRAP = 1 INDICATES CALCULATION OF L.U.B. OF 
 
 THE NUMBER OF PROCESSORS NEEDED TO DO THE MATRIX 
 IN THE LEAST AMOUNT OF TIME. SCRAP = MEANS TO 
 JUST SCHEDULE THE MATRIX USING THE NUMBER OF PE»S 
 WHICH IS READ IN AS DATA */ 
 
 /* RIGHT NCW f THE LISTS MAY NOT EXCEED 32767 IN LENGTH 
 TO CHANGE THIS, ALL THE FIXED BINC15) VARIABLES WILL 
 HAVE TO BE CHANGEO TO FIXED BIN! 311 AND CERTAIN OF 
 THE FORMAT STATEMENTS WILL HAVE TO BE ALTERED ALSO */ 
 /* LEVEL RECOGNITION: THE ALGORITHM IS NECESSARY TO 
 
 PRODUCE A RELAXED GRAPH ( MATRIX! - HOWEVER, IT MUST 
 BE NOTED THAT THIS PROCEDURE INVOLVES ORDER OF N 
 SQUARED CALCULATIONS AND HENCE VERY SLOW */ 
 TIME=STEPZ; 
 
 IF SCHECPE(1)=0 THEN SCRAP=1B; ELSE SCRAP=OB; 
 DO l=i TG N; 
 
 PUT SKIP ECIT((INCM( I, J I DO J=l TO N I I ( < 1001 B< 1 1 1 ; 
 END; 
 
 PUT SKIP EDIT( (NW(II,NT(II DO 1=1 TO N ) I ( < 60 IF ( 3 I I ; 
 DO 1=1 TC N; 
 IPTR (1 1 = 1 ; 
 END; 
 
 /* LAST LEVEL IS ROWS WITH ALL ZEROS - GETTING LAST LEVEL * 
 K,LVL(OI=N+l; 
 Bl: BEGIN; 
 
 DCL (PTR(N),BACK(NI»FIXED BIN; 
 DCL MS,MT, e<N,N|BIT(l); 
 
 /* PTR SAVES A LEVEL OF POINTERS, BACK SPECIFIED WHICH NODE 
 REFERENCED NCDE I */ 
 PTR,BACK=IPTR; 
 DO I=N TO 1 BY -1 ; 
 IF ANY< INCMCI ,* I )=«O f 8 THEN DO; 
 K= K~ 1 * 
 
 IF K-*= IPTR( I) THEN DO; 
 MT=IPTR(K) ; 
 MS,IPTR(KI = IPTR( II ; 
 BACK(MT)=MS; 
 IPTRCI )=MT; 
 BACKC I l=K; 
 
37 
 
 /* END DG LOOP */ 
 
 END; 
 END; 
 END; 
 
 PTR=IPTR; 
 
 M=l; 
 
 IF K=l THEN GO TO BLK; 
 KK,LVL( 1)=K; 
 /* LAST LEVEL GOTTEN, PROCEED WITH ALL THE OTHERS */ 
 JO: DO J=LVL(M-1I-1 TO K BY -1; 
 NP=IPTR( J) ; 
 DC I=N TO 1 BY -1; 
 MT=BACK< I ); 
 
 IF MT<K I INCM( I,NP)=1B THEN DO; 
 DO L=J-1 TC 1 BY -IS 
 
 IF INCM{ I, IPTR(L ))=1B THEN GO TO J2; 
 END; 
 
 KK=KK-1 ; 
 
 IF KK-. = MT THEN DO; 
 PS,PTR(MT) =PTR (KK) ; 
 PTR(KK )=I; 
 BACK( MS>=M; 
 BACK( I )=KK; 
 END; 
 ENDS 
 J2: FND; 
 END JO; 
 
 IPTR=PTR; /* SET NEW POINTERS */ 
 
 M=M+1; /* GOT NEXT LEVEL */ 
 
 IF K=KK THEN GO TO BLK; 
 K,LVL(M) =KK; 
 J3: IF K>1 THEN GO TO JO; 
 BLK:LVL(M) = 1 ; 
 /* GET THE RELAXED MATRIX 
 DO 1=1 TO N; 
 fi( PACK( N,*)=INCM(I ,*> ; 
 FNC; 
 
 DC 1=1 TC N; 
 INCM(*,RACK( I) ) = B(*,I) ; 
 END: 
 
 DO 1=1 TO N; 
 
 PUT SKIP EDIT( (INCMU ,J) DO J=l TO NM 
 
 ((100) flllll; 
 END; 
 
 PUT SKIP DATA( IPTR,BACK) ; 
 END Bl ; 
 
 /* M LEVELS ESTABLISHED, INDEX IN LVL - LAST LEVEL FIRST 
 NCW CCMPUTE THE SUMS AND RATIOS OF THE WEIGHTS AND 
 THE PATHLENGTHS */ 
 BLOCK: BEGIN; 
 
 /* GET NEXT LEVEL */ 
 */ 
 
38 
 
 DCL IL,JL,LL,IKW,IP,RMAX,LEV,ICW(N); 
 TABLE=0; 
 
 ICW=0; 
 ip=o; 
 
 ICWM=0; 
 
 rmax=o; 
 
 DO LEV = M TO 1 BY -1; 
 
 LL=LVL(LEV-1)-1; 
 
 DC IL=LVL<LEV) TO LL; 
 
 K=NT( IPTR( ID) ; 
 IF K<11 THEN DO; 
 /* THEN IT IS ONE OF THE FOUR OPERATORS */ 
 
 K=K-3; 
 
 TABLE(LEV,K)=TABLE(LEV,K)+l; 
 
 END; 
 ELSE IF K=ll THEN TABL E ( LEV, 5 )=T ABLE ( LEV, 5 Kl ; 
 /•* IT WAS A STORE INSTRUCTION */ 
 
 ELSE TABLE(LEV,6 )=TABLE(LE V, 6 >+l ; /* A FETCH */ 
 ICW(IL)=NW(IPTR(IL) ); 
 IF ICWM<ICW(IL) THEN ICWM= IC W( IL ) ; 
 IP = IP+ICkUL); 
 END; 
 
 DO IL=LVL<LEV)-1 TO 1 BY -1; 
 IKW=0; 
 
 DO JL=IL+1 TO LL; 
 IF INCMUL, JU= , l t B THEN 
 
 IF ICW(JL)>IKW THEN IKW=ICW(JL); 
 END; 
 
 IK«=NW( IPTR( IL ) )*IKW; 
 
 IF IKWMCWUL) THEN ICW(IL)=IKW; 
 
 IF ICWUDMCWM THEN I CWM=ICW( I L ) ; 
 
 END; 
 
 RATIO=IP/ICWM; 
 
 PUT SKIP DATA(RATIO) ; 
 
 IF RMAX<RATIC THEN RMAX=RATIO; 
 
 END; 
 
 /* NP IS THE NUMBER OF PROCESSORS NEED TO DO THE MATRIX IN 
 
 THE SHORTEST AMOUNT OF TIME */ 
 
 PUT SKIP<2> LISTt'THERE ARE • ,M, • LE VELS « ) ; 
 PUT SKIP DATA((LVL(I) DO 1=1 TO M)); 
 NP=CEIL(RMAX); 
 PUT SKIP EDIT( 'NUMBER OF PROCESSCRS NEEDED FOR •, 
 
 •CONNECTION MATRIX • ,NP ) ( A, A ,F{ 6 ) ) ; 
 
 PUT SKIPC2) LISTC THE TABLE OF TYPES :•); 
 
 PUT SKIP EOIT<«LEVEL», •♦•,»-•, •*•,•/•, 'STORE*, 
 •FETCH* ) < A,X<7), A,X< 10) ,A,X(9),A,X< 10),A,X(8) , 
 A,X(5) ,A); 
 
 PUT SKIP(O) EDIT((61 )•_• )(X(7) ,A); 
 
 CO 1=1 TC f; 
 
 
39 
 
 PUT SKIP E 
 (F<5) 
 END; 
 END BLOCK; 
 
 PUT SKIP EDIT( 
 
 IF S 
 
 NP= 
 
 CAL 
 
 END 
 
 ELSE 
 
 AGA 
 
 IF 
 
 NP 
 
 GG 
 
 EN 
 
 END; 
 
 JCCM 
 
 PUT 
 
 /* 
 
 DCL 
 
 CRAP=OB THEN D 
 
 SCHEDPE( I) ; 
 
 L JCCMP(N,KODE,NP,M) ; 
 
 CIT(I,» | • ,(TABLE( I, J) DO J= 1 TO 6)) 
 
 ,a, <6)F(ion; 
 
 •TIME ELAPSED = • ,T I ME-ST EPZ , ■ CENTI SECONDS* ) 
 
 (X(70) ,A,F(16) ,A); 
 
 C 1=2 TO SCHEDPE(1H-1; 
 
 DO; 
 IN: CALL JCOMP 
 
 KODE=» I'B THE 
 =NP+l; 
 
 TO AGAIN; 
 
 D; 
 
 (N,KODE,NP,M) ; 
 N DO; 
 
 P: PR 
 
 SKIP 
 
 DECLA 
 
 (REST 
 
 PW(N) 
 
 CCL 
 DCL 
 DCL 
 /* 
 
 / + 
 
 AS = 
 PAS= 
 DENY 
 PP = 
 LI ST 
 I* R 
 
 /* 
 
 T=o; 
 
 PUT 
 
 /* 
 /' 
 
 Pi: 
 
 <LU,M 
 RATIO 
 T(N) 
 
 TIME = 
 STFP 
 
 OC(N,KODE 
 
 LIST(»N £ 
 
 RAT IONS 
 
 OR(N,K) ,B 
 
 ,LPP,PP(K 
 
 LM,M,AS» 
 
 NS,NSP,L 
 
 P(K) ,Q(N 
 
 MI »WSJtD 
 
 INIFIXED 
 
 fSAVFHN* 
 
 FIXED BIN 
 
 */ 
 STEPZ; 
 41 +/ 
 
 F,K,LM ); 
 
 K« ,N,K) ; 
 
 */ 
 (N,N>) BITU), (DI (N) ,LIST (N,K ),DEPTH<K,2) ,R(N), 
 If 
 
 ASIGN(N), MN,MJ, 
 
 DtC,LV,NSPCtLWD,REST(K) ,MStNU,NX,U(N) ,X(N) , 
 ),S(N),SP(N),W(N)»LP,I ,J f K, L,NOTN,NR,BP, 
 ELTA,NN,KL,NTNA) FIXEO BIN; 
 BINI31 ),KODEF BIT( 1) ; 
 
 MBIT(1),PAS FIXED BIN,LL,DENY BITU); 
 , LEVEL, MT; 
 
 o; 
 
 = '0« B; 
 
 ; LPP=0; 
 =0; 
 
 OUT INE TO CALC 
 
 NODE I ANC ITS 
 
 THE Dl ROUTIN 
 
 SKIP LISTCSTA 
 STARTING DOWN 
 LM IS THE NUVB 
 DC LEVEL=LM-1 
 MS = LVL(LEVEL) ; 
 MT=LVL(LEVEL-1 
 DC J=MS TC MT; 
 DO 1=1 TC MS-1 
 
 /+ THE U OF NODES ASSIGNED SO FAR "/ 
 /* THE SAVED ¥ OF NODES ASSI GNED-PRE VI OUS AS 
 /« 1 IF A NODE HAS BEEN DENIED ASSIGNMENT */ 
 PW=0; /* THESE ARE THE SAVE VARIABLES */ 
 
 ULATE T( I ) , THE LONGEST PATH BETWEEN 
 
 SET OF STARTING NODES */ 
 
 E CAN BE MOVED TO AN OUTTER PROCEDURE LATER 
 
 PTING DOWN GRAPH* ) ; 
 THE GRAPH */ 
 
 ER OF LEVELS */ 
 
 TO 1 BY -l; 
 
 )-l; 
 
 */ 
 
ho 
 
 IF INCM( I, J)=«1 , B THEN DO; 
 MN=NW< IPTR(IM+T(I ) ; 
 -IF MN> T(J) THEN TU) = MN; 
 
 END; 
 END Dl; 
 
 PUT SKIP CATA(T); 
 
 /* DEPTH(I,1)=# OF ENTRIES OF STACK I */ 
 /* DEPTH(I,2)= DEPTH OF STACK I */ 
 
 PUT SKIP LISTCACTUAL CRITICAL PATH LENGTHS ICWM); 
 C=ICWM; /* CRITICAL PATH LENGTH */ 
 
 DEPTH=0; 
 W=0; 
 
 kodef=«o»b; 
 di=o; 
 
 B=INCM; /* B IS THE WORK CONNECTIVITY MATRIX */ 
 
 NR=0; /* # OF COLUMNS IN RESTOR */ 
 
 CO I = LVL (1) TO N; 
 
 W( I)=NW(IPTRU>) ; 
 
 END; 
 
 /* */ 
 
 /* STEP #2 */ 
 
 S2:MIN=99999999; 
 
 DC 1=1 TC K; 
 
 IF MIN>DEPTH( 1,2) THEN DO; 
 
 LP = 1; 
 
 P(1)=I ; 
 
 MIN=DEPTH( 1,2) ; 
 
 GO TO S21; 
 END; 
 IF MIN=DEPTHU ,2 ) THEN DC; 
 
 LP=LP+l; 
 
 P(LP)=I; 
 END; 
 
 S21: END; 
 DO 1=1 TC LP; 
 IF DEPTH(P( I »,2)-.= THEN DO; 
 
 L = LIST(DEPTH(P(I),1) ,P{ I)) ; 
 
 IF L<0 THEN GO TO S22; /* SKIP OVER DUMMY VARIABLES */ 
 
 B<*,L)=»0«B; /* ELIMINATE THE COLUMN FROM B */ 
 
 END; 
 S22: END; 
 /* */ 
 
 /* STEP #3 */ 
 
 NOTN=0; /* * OF TERMINALS, I.E. 101 */ 
 
 DO 1=1 TO n; 
 
 /* HAS THE NODE BEEN ASSIGNED ALREADY? */ 
 
 DO J=l TO AS; 
 
 IF I=ASIGN(J> THEN GO TO S3; 
 FND; 
 
in 
 
 IF 
 
 S3 
 IF 
 
 /♦ 
 
 ANY(B(I,*n = '0«B THEN DO; 
 
 NCTN=NCTN+1; 
 
 Q(NOTN J = l; 
 
 ENC; 
 : END; 
 NGTN=0 THEN DO; 
 
 MIN=99999999; 
 
 PO I=LP+1 TO K; 
 
 IF MIN > DEPTH(I,2) THEN MI N=DE PTH ( I ,2 > ; 
 
 END; 
 
 CO 1=1 TO LP; 
 
 KL=DEPTH(I ,2)-MIN; 
 
 IF KL=0 THEN GO TO S31; 
 
 MN,DEPTH< I ,1 )=DEPTH( 1,1 »♦!; 
 
 DEPTH( I,2)=M IN ; 
 SETTING UP THE DUPMY BLOCKS,- THE DEPTH OF THE X BLOCK */ 
 
 LIST(MN, I)=KL; 
 S31: END; 
 
 S2 
 
 /* 
 
 IR 
 
 #4 I 5 
 
 
 /* RESTORATION */ 
 
 GO TO 
 ENC; 
 NTNA=0; 
 /* 
 
 I* STEPS 
 J=l; 
 
 S4: CO 1=1 TC K; 
 IF I=P(J» THEN DC; 
 IF J<LP THEN J = JU ; 
 GO TC ENDS*; 
 ENC; 
 IF (DEPTH( I ,2)-=0) THEN DC; 
 
 L = LIST(DEPTH(I ,1),I) ; 
 
 IF L<0 THEN GO TO ENCSA; 
 
 NR=NP+l; 
 
 RESTOR( SNR)=B(»,L ) ; 
 
 REST(NR) =L; 
 
 6(«,L)= , 0»B; 
 
 END; 
 ENDS4: END S4 ; 
 DO 1=1 TO N; 
 DO J=l TO AS; 
 IF I=ASIGN(J) 
 END; 
 IF ANY(R(I ,* M^O'B THEN DO; 
 
 NTNA=MNA+1 ; 
 
 R(NTNA)=I ; 
 
 END; 
 S5: END; 
 CO 1=1 TO NUTN; 
 V»(Q( I )) =NW( IPTRIQI im+DEPTH(P(L)t2l 
 
 /* GET THE REST OF R */ 
 
 THEN GO TO S5; 
 
14-2 
 
 END; 
 
 /* WANT SUCCESSORS CF NCRUM, ROW R( I) TELLS */ 
 IF NOTN-NTNA=0 THEN GO TO S06; 
 DO 1=1 TC NTNA; 
 
 DO J=l TC NOTN; /* FOR NODES IN R-Q */ 
 IF R(I)=Q(J) THEN GO TO S51; 
 END; 
 MN=0; 
 
 DC J=R(l )+l TO N; 
 
 IF INCM<R( I) ,J)=»1«B THEN IF W(J)>MN THEN MN=W(J); 
 /* GET THE PREDECESSOR FROM THE ORIGINAL GRAPH */ 
 END; 
 
 W<R( I) )=NW( IPTR(R(I> ll+MN; 
 S51: END; 
 /* */ 
 
 /* STEP #6 */ 
 
 S06: CALL SU86 ( M ,R ,NTNA , S, NS I ; 
 NSP=0; 
 
 DO 1=1 TO NS; 
 
 DO J=l TO NOTN; 
 IF S( I )=Q( J) THEN DO; 
 NSP=NSP+l; 
 SP(NSP)=S(I); 
 GO TO S6; 
 END; 
 END; 
 S6: END; 
 IF M<C THEN M=C; 
 LD=M-C; 
 IF SCRAP=»1 , B £ LD>0 THEN DO; 
 
 KODEF=«l»B; /* FAILURE SIGNALED */ 
 
 PUT SKIP EDIT(«TIME ELAPSED = • , TIME-STEPZ , • CEN TI SECONDS' ) 
 
 <X(70) ,A,F(16) ,A) ; 
 RETURN; 
 END; 
 IF LD >0 ThEN IF DENY=»1»B THEN DO; 
 DC J=l 7C K; 
 DC I=PAS+1 TO AS; 
 IF LIST(DEPTH( J , 1 ) , J ) = AS IGN ( I ) THEN DO; 
 
 LIST(DEPTH(J ,1), J)=0; /* REMOVE NODE FROM LIST */ 
 DEPTH ( J,l) = DEPTH(J f 1 )-l; 
 
 IF DEPTH(J,1)<0 THEN DEPTH ( J, 1 ) =0; /* PRECAUTION ONLY */ 
 DEPTH<J,2)=DEPTH(J,2)~NW(IPTR(ASIGN( 1)1); 
 GO TO S61; 
 END; 
 END; 
 /* REMOVE DUMMY BLOCKS IF THEY EXIST ON THE TOP OF THE LISTS */ 
 S6l: IF LIST(DEPTH(J,l ) , J )<0 THEN DO; 
 
 DEPTH( J f 2)=DEPTH(J f 2)>LIST(DEPTH(J,l) f J»; 
 
^ 
 
 LIST(DEPTH(J,1 ), J) = 0; 
 DEPTH ( J,l )=DEPTH(J,1 )-l; 
 END; 
 END; 
 
 AS=PAS; /* SET ASSIGNMENT COUNTER AT PREVIOUS */ 
 
 B=SAVEM; /* RESET MATRIX */ 
 
 P=PP; /* RESET P VECTOR */ 
 
 LP=LPP; /* RESET P COUNTER */ 
 
 V»=PW; /* RESET U VECTOR */ 
 
 DENY= , , B; /* RESET DENY BIT TO OFF */ 
 
 L=LL; /* SET NODE */ 
 
 /* PUSH NODE INTO THE STACK */ 
 
 NN,DEPTH(P(LP) , 1 I =CEPTH( P ( LP ) , 1 ) +1 ; 
 LIST(NN,P< LP))=L; 
 
 PUT SKIP ECIT(»STACK IN 6- 8 • , P ( LP ) , • # • »L > ( A , F ( 5 ) ,X( 2 ) ) 
 DEPTH(PILP) ,2) =W(L) ; 
 
 PUT SKIP LIST( 'STACK* ,P(LP), •# IN 6-8»,L); 
 
 AS = AS + 1*. 
 
 ASIGN(AS)=L; 
 LP=LP-1; 
 B(Lt*l='0'B5 
 GO TC S9; 
 END; 
 /* */ 
 
 /* STEP 47 */ 
 C=M; 
 
 IF NS-NSP=0 THEN LV=0; 
 ELSE DO; 
 
 NSPC=l; 
 
 LU=9S999999; 
 
 DO 1=1 TO NS; /* GET LU BY FINDING S-S» */ 
 
 IF NSPONSP THEN GO TO S70; 
 
 IF S( I ) = SP(NSPC) THEN DO; 
 
 NSPC=NSPC+l; 
 
 GO TC S700; 
 
 END; 
 S70: LWD=W(S(I ))-NW( IPTR(S( n n; 
 
 IF LWD<LU THEN LU=LWD; 
 S700: END; 
 
 S7: NU=0; /* |U| */ 
 
 NX=0; /* I X | */ 
 
 DO 1=1 TO N; /* CCMPUTE LU,U,X »/ 
 LWD=W( I)-NW< IPTR( II); 
 IF(LWD-. = LU) THFN GO TO ST70; 
 NU=NU+l; 
 
 u(NU) = i ; 
 
 ST70: END; 
 
 DC 1=1 TC K; 
 
 IF DEPTH( I ,2 )>LU THEN GO TO ST71; 
 
kk 
 
 END; 
 
 TO ST8; 
 
 IN S») */ 
 
 I))); 
 
 CO; 
 
 nx=nx*1; 
 
 X(NX)=I; 
 
 ST71: END; 
 
 LV=LP-NX+NU; 
 
 IF LV<0 THEN LV=0; 
 
 IF LP<=LV|NSP=0 THEN GO 
 DO WHILE(LP>LV6NSP>0); 
 /* GET MAX(W( I) |N( I) IS 
 
 mj=o; 
 
 DO 1=1 TC nsp; 
 MN=NW(IPTR(SP< 
 IF HJ<MN THEN 
 MJ=MN; 
 MI=I; 
 END; 
 END; 
 /* PUSH MMI) INTO THE LIST */ 
 
 MN,DEPTH(P(LP),1)=DEPTH(P( LP), 1)4-1; 
 LIST(MN,P(LP))=SP<MI ) ; 
 PUT EDIT( 'STACK* ,P(LP) , • #• , SP(MI )) 
 
 <X<5),A,F<5),X(2),A,F< 5) ); 
 /* ELIMINATE THE SP(MI) ROW FROM B */ 
 
 /* G* = G« - MSP (Mil ) */ 
 
 AS=AS+1; 
 
 ASIGN(AS) = SP(MI) ; 
 mn=o; 
 
 W(SP(MI) ) t DEPTH (P( LP) , 2) =DEPTH ( P ( LP) , 2 ) +NW ( I PTR( SP (MI ) )); 
 DO 1=1 TC NSP; /* S* = S* - N(MI) */ 
 
 IF I=MI THEN GC TO ST72; 
 MN=MN*l; 
 SP(MN)=SP( I) ; 
 ST72: ENC; 
 LP=LP-l; 
 NSP=NSP-1; 
 END; 
 GO TO S9; 
 /* */ 
 
 /* STEP #8 */ 
 
 ST8: CALL SUB6( M ,Q ,NCTN, SP, NSP) ; 
 BP=LP; 
 PUT SKIP LIST( «S" »,NSP); 
 PUT LISTUSP(I) DO 1=1 TC NSP)); 
 /* FIND WSJ=MAX(NW(I ) |N( I) IS IN S'l */ 
 WSJ=0; 
 
 DO 1=1 TC NSP; 
 MN=NW(IPTR(SP( I) )) ; 
 IF WSJ<MN THEN WSJ=MN; 
 END; 
 
^5 
 
 MN=NSP; 
 
 MT=0; 
 ST80: DO J = l TC MN WHILE (LP>0 & NSP>0 ); 
 
 L=SP( Jl; 
 IF NW< IPTR(L) )=WSJ THEN DO; 
 
 SP( J)=0; 
 
 M T= 1 ; 
 
 NSP=NSP-1; 
 
 DELTA=W(L)-LU; END; 
 
 IF DELTA<=0 | LP-LV>0 THEN /* PUSH INTO STACK */ 
 
 DC; 
 
 NN, DEPTH (P( LP) ,1 )=D6 PTH( P(LP)tll + l! 
 LIST(NN,P(LP) >=L; 
 PUT SKIP LIST( 'STEP 8' ,L); 
 DEPTH(P(LP) ,2)=W(L ); 
 
 PUT SKIP LISTCSTACK' ,P( LP) ,• 4 IN 8*,L); 
 
 AS=AS+l; 
 
 ASIGN( AS)=L; 
 LP=LP-l; 
 B<L,*) = , , B; 
 END; 
 ELSE IF DELTA<=DI(L) |DI( L ) =0 THEN DO; 
 DO 1=1 TO NR ; 
 B<* tRESTU ))=RESTOR(*, I ) ; 
 END; 
 
 NR=0; 
 
 
 
 di (D=delta; 
 
 
 
 PP=P; 
 
 /* 
 
 SAVE P VECTOR +/ 
 
 LPP=LP; 
 
 /* 
 
 SAVE P COUNTER */ 
 
 pw=w; 
 
 /* 
 
 SAVE W VECTOR */ 
 
 SAVEM=B; 
 
 /* 
 
 SAVE MATRIX */ 
 
 pas=as; 
 
 /* 
 
 SAVE ASSIGNED COUNTER 
 
 ll = l; 
 
 /* 
 
 SAVE NODE +/ 
 
 DENY=» 1»B; 
 
 /* 
 
 SET DENY BIT */ 
 
 END; 
 
 
 
 END ST80; 
 
 
 
 /* COMPRESS S« */ 
 
 
 
 IF MT=1 I NSP-=0 
 
 THEN DO; 
 
 j=o; 
 
 
 
 DO 1=1 TO MN; 
 
 
 IF SP(I)-=L 
 
 THEN DC; 
 
 J=J+1; 
 
 
 
 SP(J)=SP(I ); 
 
 
 END; 
 
 
 
 END; 
 
 
 
 end; 
 
 
 
 IF LP=RP THEN CO; 
 
 
 
 MIN=99999999; 
 
 
 
 1 = 1; 
 
 
 
 */ 
 
ke 
 
 00 MN=1 TC K; 
 
 IF MN=PU)€I<=LP THEN DO; 
 
 1=1+1; 
 
 GO TO ST805; 
 END; 
 
 IF MIN>DEPTH(I ,2) THEN MIN=DEPTH U ,2 ) ; 
 
 ST805: END; 
 
 DC 1=1 TC LP; 
 
 KL=DEPTH(P(I ),2)-MIN; 
 
 IF KL=0 THEN GO TO ST806; 
 
 MN.DEPTHCPU ),1)=DEPTH<P(II,1)+1; 
 
 LIST(MN,P( I) )=KL; 
 
 DEPTH(P( I) ,2 ) = MN; 
 ST806: END; 
 END; 
 
 /* */ 
 
 /* STEP #9 */ 
 
 S9: IF ANY(B)= l O«B THEN GO TC SIO; 
 S95: DO 1=1 TO NR ; 
 
 B<* ,REST(I l)=RESTGR(*,I ) ; /* RESTORE COLUMNS */ 
 
 END; 
 
 NR=0; /* HAVING RESTORED, SET COUNTER TO ZERO */ 
 
 GO TO S2; 
 /* */ 
 
 /* STEP #10 */ 
 SIO: L=0; 
 
 DC J=l TC N; 
 
 DO 1=1 TC AS; 
 
 IF ASIGN(I)=J THEN GO TC S101; 
 
 END; 
 
 L = L + NW(IPTR(J) ) ; 
 S101: END; 
 
 nn=o; 
 
 ms=o; 
 
 DO 1=1 TO K; 
 
 MN=DEPTH(I,2 )i 
 
 IF MN> MS THEN MS=MN; 
 
 NN=NN+MN; 
 
 END; 
 
 NN=K*MS-NN; 
 
 IF NN<L | AS<N THEN GOTC S95; 
 
 NU=0; 
 
 DO 1=1 TC K; 
 
 IF NU<DEPTH( 1,2) THEN NU=DEPTH( I ,2 I ; 
 
 END; 
 
 PUT SKIP(2) EDIT(«THE PROCESSORS ARE SCHEDULED FOR «,NU, 
 
 • TIME UNITS' KA,F(8),A) ; 
 /" SET UP THE ORIGINAL NODE NUMBERING INTO LISTS */ 
 S102: DO J=l TC K; 
 
^7 
 
 L=OEPTH< J,l » ; 
 MN=0; 
 
 DC 1=1 TC L; 
 IF LIST( I.JKO THEN DO; 
 MN=MN-LIST(I, J) ; 
 GO TC SI 03; 
 END; 
 LIST( I,J)=IPTR(LIST< [,JM; 
 S1C3: ENO; 
 
 PUT SKIP(2) EDIT(»LIST», J, • : • )(A f F(5l,AI ; 
 
 PUT SKIP LIST( <LIST( I, J)CO 1=1 TC LM; 
 
 PUT SKIP ECIT(«LENGTH OF LIST = • , DEPTH( J, 2) ) ( A, F ( 6 ) ) ; 
 
 IF DEPTH(J t 2KNU THEN MN=NU-DEPT H ( J , 2 H-MN; 
 
 RATIO=MN/NU*100; 
 
 PIT SKIP EDIT(»? OF IDLE TIME IN THIS LIST = • , RATIO) 
 
 (A,F (10, 611; 
 END S102; 
 
 KODEF=«0«B; /* SUCCESS */ 
 
 PUT SKIP ECIT('TIME ELAPSED = • , T I ME- STEPZ , * C EN TI SECONDS' ) 
 
 <X<70) ,A,F(16) ,A); 
 RETURN; 
 SUB6: PRCC(N,R,NTNA,StNS); 
 DCL R(N) FIXEC RIN,S(N) FIXED BIN; 
 CCL I; 
 M=0; 
 
 NS=0; 
 
 DC 1=1 TC NTNA; 
 
 MN = W(R( I ) )+T(R ( I ) » ; 
 
 IF M<MN THEN M=MN; 
 
 END; 
 
 DC 1=1 TC NTNA; 
 
 IF T(R(I ) )+W(R< I ) )=M THEN DO; 
 NS=NS+l; 
 S(NS)=R( I) ; 
 ENC SUB6; 
 ENC JCCMP; 
 END WSCHED; 
 
K8 
 
 LIST OF REFERENCES 
 
 [1] J. C. Han, "Tree Height Reduction for Parallelism Processing of Blocks 
 of FORTRAN Assignment Statements/' (M.S. Thesis) University of 
 Illinois at Urbana- Champaign, Department of Computer Science Report 
 No. k-93, 1972. 
 
 [2] L. C. Hobbs and D. J. Theis, "Survey of Parallel Processors Approaches 
 and Techniques, " Parallel Processor Systems, Technologies, and 
 Applications, Spartan Books, Washington, D.C., 1970- 
 
 [3] S. M. Johnson, "Optimal Two-and Three-Stage Production Schedules with 
 Set-Up Time Included," ed. by J. F. Muth, et al., Industrial 
 Scheduling , Prentice -Hall, Inc., N.J., 19&3- 
 
 [k] D. E. Knuth, The Art of Computer Programming , vol. 2, Addi son-Wesley, 
 N.Y., 190^ 
 
 [5] P. W. Kraska, "Parallelism Exploitation and Scheduling, " (Ph.D. Thesis) 
 University of Illinois at Urbana- Champaign, Department of Computer 
 Science Report No. 518, 1972. 
 
 [6] P. W. Kraska, "Utilization of Finite Parallel Machines, " a paper written 
 for preliminary examination, University of Illinois at Urbana- 
 Champaign, 1971 • 
 
 [7] D. J. Kuck, private meeting. 
 
 [8] D. J. Kuck, Y. Muraoka, and S. C. Chen, "On the Number of Operations 
 Simultaneously Executable in FORTRAN- Like Programs and Their 
 Resulting Speed-Up, " to be published in IEEE Transactions on 
 Computers . 
 
 [9] D. H. Lawrie, private meeting. 
 
 [10] Y. Muraoka, "Parallelism Exposure and Exploitation in Programs," (Ph.D. 
 Thesis) University of Illinois at Urbana- Champaign, Department of 
 Computer Science Report No. ^-2^, 1971 • 
 
 [11] R. E. Strebendt, private meeting. 
 
 [12] L. A. Swanson, "Simulation of a Tree Processor," (M.S. Thesis) University 
 of Illinois at Urbana- Champaign, Department of Computer Science 
 Report No. 503, 1972. 
 
LIOGRAPHIC DATA 
 ET 
 
 1. Report No. 
 
 UIUCDCS-R-72-5^5 
 
 3. Recipient's Accession No. 
 
 itle and Subtitle 
 
 cheduling on Parallel Processors for Weighted- Node Graphs 
 
 5. Report Date 
 
 October, 1972 
 
 uthor(s) 
 
 anet Sau-Ying Chin 
 
 8. Performing Organization Rept. 
 
 °' UIUCDCS-R-72- [ 5if5 
 
 erforming Organization Name and Address 
 
 niversity of Illinois at Urb ana -Champaign 
 epartment of Computer Science 
 rbana, Illinois 6l801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract/Grant No. 
 
 US NSF GJ 27^6 
 
 Sponsoring Organization Name and Address 
 
 ational Science Foundation 
 ashington, D. C. 
 
 13. Type of Report & Period 
 Covered 
 
 Master's Thesis 
 
 14. 
 
 supplementary Notes 
 
 \bstracts 
 
 arallelism is inherent in FORTRAN programs. Therefore, if we had parallel processors 
 ith arithmetic units which can do any of four operations (add, subtract, multiply, 
 nd divide) as well as initiating two other operations (fetch and store), then we will 
 ave greater speed and throughput than a conventional serial processor. 
 
 his paper will present some of the aspects behind the project undertaken to imple- 
 ent Kuck's proposal [7] and describe a compile-time operation scheduler for parallel 
 rocessors. In addition, the experiments which were run with the scheduler and their 
 esults will be discussed. 
 
 Cey Words and Document Analysis. 17a. Descriptors 
 
 ask Scheduling 
 irallel Processors 
 
 Identifiers/Open-Ended Terms 
 
 COSATI Fie Id /Group 
 
 f'ailability Statement 
 
 Release Unlimited 
 
 19. Security Class (This 
 Report ) 
 
 UflCLASSlFlliP 
 
 20. Security Class (This 
 
 Page 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 22. Price 
 
 NTIS-1B ( 10-70) 
 
 USCOMM-DC 40329-P7 1 
 

Mj 
 
ii! I III Mll?i!» Y of ilun °is-urbana 
 
 3 0112 054457137