LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 oop 2. Digitized by the Internet Archive in 2013 http://archive.org/details/schedulingonpara545chin y^uLZA 5/0' *f M X C' /2y Report No . UIUCDCS-R- 72-5^5 7co.6'fS £ SCHEDULING ON PARALLEL PROCESSORS FOR WEIGHTED-NODE GRAPHS by Janet Sau-Ying Chin October 1972 e. TH E LIBRARY OF THE DEC 1^1972 ^VERS.TY^.U^ *?3 Report No. UIUCDCS-R- 72-5^5 SCHEDULING ON PARALLEL PROCESSORS FOR WEIGHTED-NODE GRAPHS by Janet Sau-Ying Chin October 1972 Department of Computer Science University of Illinois at Urbana- Champaign Urbana, Illinois 6l801 *This work was supported in part by the National Science Foundation under Grant No. US NSF GJ 2"jKk6 and was submitted in partial ful- fillment of the requirements for the degree of Master of Science in Computer Science, October 1972. SCHEDULING ON PARALLEL PROCESSORS FOR WEIGHTED-NODE GRAPHS BY JANET SAU-YING CHIN B.S., University of Illinois, 1970 THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science in the Graduate College of the University of Illinois at Urbana- Champaign, 1972 Urbana, Illinois Ill ACKNOWLEDGMENT The author expresses her gratitude to Professor David J. Kuck, the Department of Computer Science of the University of Illinois at Urbana- Champaign, without whom this thesis would not have been successfully completed, Duncan Lawrie and Paul Kraska are much appreciated for their help and guidance , Special thanks to P. Budnik, S. C. Chen, R. Towle, and R. E. Strebendt for their help. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. ASPECTS 2 2.1 Introduction to the Scheduling Problem 2 2.2 General Description of Algorithm k 2.3 Preparation of FORTRAN Programs 6 3- A DESCRIPTION OF WSCHED 12 3-1 A General Description of WSCHED 12 3.2 The Relaxation Process lU 3.3 General Description of JCOMP Ik k. EXPERIMENTS AND RESULTS l8 k.l Background Discussion of the Experiments l8 k.2 Discussion of the Distribution Results 22 k.3 Background for the Second Set of Experiments 27 k.k Results of the Second Set of Experiments 29 APPENDIX 32 A 32 B 35 LIST OF REFERENCES kQ V LIST OF TABLES Page 1. For Weighted Nodes 2k- 2. For Unweighted Nodes 25 LIST OF FIGURES Page 1. An Example of Partial Ordering 5 2.1 A Connection Graph for (3) 8 2.2 A Relaxed Graph 9 2.3 Another Relaxed Graph ..... 9 2.4 The Relaxed Connection Matrix 10 2.5 Weight Vector • • 10 2.6 Type Vector j_0 3.1 The Original Graph 15 3-2 A Relaxed Graph With Four Levels of Nodes 15 3 A Relaxed Graph Representing Released Matrix l6 4.1 Unweighted Case 19 4.2 Weighted Case 19 4.3 Unweighted Case Distributed 19 4.4 Weighted Case Distributed 19 4.5 Unweighted Case 20 4.6 Weighted Case 20 4-7 Unweighted Case Distributed 20 4.8 Weighted Case Distributed 20 4 .9 Unweighted Case 21 4.10 Weighted Case 21 4.11 Unweighted Case Distributed 21 4.12 Weighted Case Distributed 21 4.13 The 6 FORTRAN Statements 23 4.14 Graph for P Using Common Results 27 4.15 P 7 Tree Without Common Results 28 1. INTRODUCTION In their search for increased computer speed and throughput, Hobbs and Theis [2] maintain that parallel processing is a solution for problems with inherent parallelism. This property of parallelism allows various oper- ations to be performed concurrently. To demonstrate, assume that the following two FORTRAN statements are given. (1) F = A + B + C+D (2) T = V + X + Y + Z Statement 2 does not rely on statement 1. Therefore, they may be executed concurrently. On two parallel processors then, both statements would be completed in the amount of time needed to process one statement. We can also consider parallelism on the operator level. In statement 1, A + B can be computed at the same time C + D is being calculated. Increasing the number of computers would again decrease the amount of time needed to complete the oper- ations. In other words, parallel processing, in our example, would increase computer speed and throughput. Parallelism is inherent in FORTRAN programs according to David Kuck, Yoichi Muraoka, and S. C. Chen [8]. Therefore, if we had parallel processors with arithmetic units which can do any of four operations (add, subtract, multiply, and divide) as well as initiating two other operations (fetch and store), then we will have greater speed and throughput than a conventional serial processor. This paper will present some of the aspects behind the project under- taken to implement Kuck's proposal [7] and describe a compile-time operation scheduler for parallel processors. In addition, the experiments which were run with the scheduler and their results will be discussed. 2. ASPECTS 2.1 Introduction to the Scheduling Problem Ever since computers came into being, there have been tradeoffs of one kind or another. The most emphasized seems to be the ratio of cost to speed of computation. Another which deserves some attention is the one be- tween compilation time (which includes any time spent in preparation prior to actual program execution) and execution time. The considerations that follow deal with a fixed number of parallel processors. The tradeoff concerns the time when scheduling of operations is done. If we execute operations as they are required, then operands must be fetched. As a result, execution time would be longer than if we had some amount of look- ahead so that the operands could have been available at the time the operation was required. Other aspects of this problem are the minimization of processor idle times and the optimization or processor run speeds. These two can be ac- complished by establishing for each of the parallel processors a sequence of operations to be done. However, this ordering process would require a longer compilation time. If we were given statement 1, on page 1, along with two parallel processing elements (PEs) and the four operands already fetched, we may assign the execution of A + B to PE 1 and the execution of C + D to PE 2. When these operations are completed, PE 2 would add the two results. This then describes an order of operation for each of the processors. Furthermore, Paul Kraska has developed an algorithm (algorithm a) which calculates an order of operations in reduced time [5l« The PEs are assumed to be capable of doing addition, subtraction, multiplication, and division. Each processor, when done with its task, does not wait for other processors to finish before it proceeds to another operation. Along with the four arithmetic operations named above, the processors are capable of initiating stores and fetches of data to and from memory, respectively. The minimal time taken to process a set of operations is dependent upon how many operations must be performed in sequence, as opposed to oper- ations being completed in parallel; and how long each of the operations in the sequence takes. If we represent the sequences in this set of operations as a tree, each operation would be represented by a node. The time taken to complete an operation corresponds to the weight assigned to the associated node. Since the operations must be done in some order, the operations tree is a partial ordering on nodes. For an illustration of the ordering, consider Figure 1. The operations associated with nodes 2 and 3 are to be completed before the operation represented by node 1 is begun. Nodes 2 and 3 are said to be predecessors of node 1 in the tree. In this ordering nodes k, 5, and 6 are root nodes; and node 1 is a terminal node. Figure 1. An Example of Partial Ordering The "by-demand" scheduling is discussed in a paper by Larry Swanson [12]. He is concerned with the data management problem. He presents a machine configuration of a tree processor system associated with a number of routing networks. He has programs which simulate various combinations of the log router, the Illiac IV router, the Semmelhaack router, the Batcher network, and the crossbar switch. He discusses the amount of time taken to transfer data to and from memory and between PEs. The time delays influence the size and speed of the tree machine since the larger the machine, the longer the route data must travel. This paper will discuss the program written to implement Kraska's algorithm a . It will also present some experiments which used the program along with their results. 2.2 General Description of Algorithm Starting with a lower bound on the number of processors, m, Kraska's algorithm a provides a way of finding the least upper bound on the number of processors needed to complete the operations represented by an operations tree in the minimal amount of time. As operations are scheduled, their nodes are placed into lists which correspond to the processors. The lists would, at the end, tell us which operations are to be executed by which PE and in what order. Since the weight of each node corresponds to the amount of time required to complete the asso- ciated operation, the lists then indicate when an operation has been executed by the position of the node in a list and the weights of the nodes below it. During scheduling then, each list has a height. For each node, n., find the latest time it may be executed. This is done by finding the largest sum of the weights of all the nodes between node n. and a root node, including the weight of the root node. The path length for node n. at any particular time is the sum of the minimal height for the lists and the largest n.-to-root weight sum. This path length indicates the time the operations on the longest path through the operations tree will he com- pleted. The minimal time is just the longest path length through the tree. This time is called the critical path length. To find a schedule which will be completed in minimal time, we consider the terminal nodes in the operations tree. ^Insert the node(s) with the maximum path length into the lists. Then consider the tree with the in- serted node(s) removed. With the new tree we start all over again at *. If we can continue down to all the terminal nodes and insert them into the lists without getting a path length greater than the critical path length for the operations tree, we will have succeeded in our scheduling for minimal time; and the least upper bound on the number of processors will have been reached. If, however, a path length is calculated which is greater than the critical path length, we must start from the beginning again, but this time with an added processor. Hence, another list is available for scheduling, for a total of m + 1 lists. Algorithm a also provides a way of scheduling an operation tree for a fixed number of processors regardless of the critical path length. In this case it is necessary to provide a back-up facility so that when an assignment is made which denies assignment to a node awaiting scheduling and which results in a longer path length than the critical path length, we are able to backtrack to the point the node was denied assignment, insert it into the lists and start off from there. This method provides a schedule for reduced time given a fixed number of processors. No matter how many processors are used, the critical path length still represents the least amount of time necessary for the operation tree to be processed. Thus, if we are to schedule on a greater number of PEs than the least upper bound, the main difference in the results would be a greater per- centage of processor idle time than if we scheduled on the least upper bound number of processors. The weights associated with each node is dependent upon the time needed to complete the corresponding operation. Therefore, if we assign the weights such that certain operations take longer than others, each processor, in effect, would seem to proceed independent of the other PEs. On the other hand, if we assign equal weights to all operations, then each processor would start and finish an operation in time with the other processors. We say that the nodes are unweighted or unit weighted since the time spans are all the same. 2.3 Preparation of FORTRAN Programs A FORTRAN program has many other types of statements besides a simple assignment statement. Functions and subroutine calls, at present, are ignored. Built-in functions are replaced by a number of adds and multiplies. Transcendental functions are replaced by an expression consisting of a sum of products. Arithmetic functions are regarded as subscripted variables. DO loops and IF statements are taken care of by back substitution, recursion, and tree height reduction. These processes are described in Han's paper [l] . Aftei these operations have been applied to the FORTRAN program, it becomes a series of assignment statement blocks which involve only the four arithmetic oper- ations named earlier along with store, fetch, and a transfer of control oper- ation. Calls to subprograms and certain other FORTRAN statements like GO TO signal the end of each assignment block [11] . Along the length of each block, variables are back substituted recursively, variables and sub-expressions are distributed over each arithmetic expression as specified by the Multiplication Distribution Algorithm and the Division Distribution Algorithm as described by Han [l] and Krasha [5]. These two algorithms specify distribution only if the number of operation levels, the number of nodes in the longest path from a terminal node to a root node, will decrease by the distribution process. Thus, the time taken to complete execution of any statement will be reduced, never lengthened by the process. One aspect which should be considered is how the distribution process affects the number of processing elements. Although the number of levels may be decreased by the distribution process, the number of processors needed for executing in the reduced time never decreases. Since distribution introduces operations into the levels, it is often true that more processors are needed than before distribution. Experiments conducted bear out this theory. After an assignment statement block has been back substituted and distributed, the corresponding trees' heights are reduced; and a connectivity graph is made in the following way. Define all "fetches" to be root nodes. Define all "stores" to be terminal nodes. The rest of the matrix is constructed by precedence. As an example, consider the statement (3) x=b-a*c + d/e*f The graph for it is shown as Figure 2.1. Relax this graph in the following manner. First, place all terminal nodes at the bottom level. Secondly, place at the next to the bottom level all nodes which are predecessors to the terminal nodes only. Place at the third from the bottom level, all predecessors of the nodes at the lower levels. Con- tinue in this fashion until all the root nodes have been placed into the graph. Figure 2.1. A Connection Graph for (3) This resulting graph is relaxed. Now create the n x n connectivity matrix where the graph has n nodes. The ij entry in the matrix is 1 only if node i precedes node j in the graph. Notice, however, that the relaxed graph will have a node ordering different from the original connection graph if we start numbering at the root level and proceed toward the terminal level. Notice also that permutations on the numbering at each level will also result in a relaxed graph as illustrated in Figures 2.2 and 2.3. Figure 2.k shows the connectivity matrix for (3) based on the relaxed graph in Figure 2.2. Notice that the relaxed matrix is upper triangular. This is a consequence of the relaxation process. The row(s) of zeroes at the bottom Figure 2.2. A Relaxed Graph Figure 2.3- Another Relaxed Graph 10 1 2 3 k 5 6 7 8 9 10 1112 1 1 2 1 3 1 k 1 5 1 6 1 7 1 8 1 9 1 10 1 11 1 12 1 12 1 13 1 57 1 97 1 111 3 6 5 7 1 15 2 5 3 6 2 k 1 li Figure 2.k. The Relaxed Figure 2-5- Figure 2.6. Connection Matrix Weight Vector Type Vector 11 indicate the terminal node(s), while the columns of zeroes at the left indicate root node(s) . A weight vector, NW, is made from the weights of the nodes. A type vector, NT, defines the operation involved at each node. Thus, NW. and NT. tell how much time is required to complete the operation of type NT. for node i in the graph. The weight vector for the relaxed matrix is shown in Figure 2.5 with the assumption that fetches and stores take 1 time unit, adds and subtracts take 2, multiplies take 3, and divides take 5 time units. If each fetch is coded with a number greater than 11, symbolizing the variable fetched, add is coded as h, subtract as 5, multiply as 6, divide as 7> and store as 11, then the type vector is shown in Figure 2.6. The weight vector is then used to compute path lengths and lists' heights. The type vector is a convenience supplied to provide some statistics on the operations at each level. A table of occurrence is outputted to show the frequency of each operation at the different levels. 12 3 . A DESCRIPTION OF WSCHED 3.1 A General Description of WSCHED WSHED is a PL/I procedure which implements Kraska's algorithm a , described earlier in this paper. It receives as parameters a connectivity matrix along with a weight vector, a type vector, and the dimension N. First, the N X N BIT(l) connectivity matrix is relaxed. (See pages 10-11 and 16-I7 for discussions of the relaxation procedure) . Relaxation produces another result: the number of operation levels in the tree. The levels give us a precedence relation. If node i is in level m and node j is in level n, m > n; then if there is a path between i and j, i is a predecessor of j . The levels are numbered from bottom to top. Therefore, the level of terminal nodes is level 1. Second, since the levels of the tree represent a precedence relation i and the matrix represents a connectivity relation, we can compute the critical path length and the paths for any node to a root node. From these, we calculate, the lower bound on the number of processors needed to complete the operations represented by the matrix and the two vectors. This lower bound is found as follows. Calculate at each level the sum of all the node weights for nodes on and below the level. Divide the sum by the maximum path length up to the level. The smallest integer not less than the largest ratio is the lower bound on the number of processors. Since algorithm a contains an option, the signal indicator SCHEDPE is provided by the user to determine whether WSCHED is to find the least upper bound on the number of processors to complete operation in minimal time or WSCHED is to simply find a schedule for the number of processors indicated. If SCHEDPE (l) = 0, we want to find the least upper bound. Starting 13 with the lower bound obtained above, procedure JCOMP is called which determines the scheduling of the processors. Should JCOMP fail to find a schedule which would complete execution in minimal time, it returns a KODE = 1 indicating failure. Upon seeing this code, WSCHED will add another processor to the group and recall JCOMP. Eventually JCOMP will return a KODE = 0, indicating that an ordering of operations exists which will have the operations completed in the minimal time. At this point, the number of processors available to JCOMP is the least upper bound on the number of processors needed to execute the operations represented by the matrix and the two vectors in minimal time. We also know the order in which the operations are done, which operation is executed by which processor, and what percentage of the time each of the processors is idle. If SCHEDPE(l) / 0, it contains the number of scheduling trials to be made for certain numbers of processing elements. SCHEDPE(2) through SCHEDPE (number of trials plus one) determine the number of processors to be used during trials 1 through SCHEDPE (l), respectively. During each trial, JCOMP will schedule the nodes on the specified number of arithmetic processors, regardless of the critical path length. The resulting order is a schedule which will complete the operations as quickly as possible. It should be noted that for any number of processors less than the least upper bound (LUB) calculated in WSCHED, the amount of time needed for completion necessarily increases. There is no use having the number of proces- sors greater than the LUB since the critical path length cannot decrease. Therefore, having extra processors only results in a higher percentage of idle time for the processors. Ik 3 .2 The Relaxation Process The relaxation of the connectivity matrix requires row and column interchanges since we may want to move a node in level i down to level j, i > j . We usually have more than one node moving through levels in a relax- ation. Therefore, it would he too time consuming to physically interchange the rows and columns. As a result, pointers are used to effect the exchange; and it is not "until the very end that the physical change takes place. Since the position of a node after relaxation does not necessarily correspond to its position prior to the exchanges, another set of pointers keeps the corresponding old position for each node. Hence, at the end of the scheduling, node 6 which corresponded to row 6 and column 6 in the original connectivity matrix became node 10 corresponding to row 10 and column 10 in the relaxed matrix; if JCOMP finds that node 10 is in list 2, we actually have PE 2 executing the original node 6. An example of tree relaxation was given earlier. An example of a graph relaxation may be seen in Figures 3-1, 3-2, and 3.3- Notice that Figure 3*3 would give rise to a relaxed matrix where node i corresponds to row i and column i, while Figure 3-2 would not. The node numbering in Figure 3*2 show what happened to each original node. This would be the intermediate result of the phase when pointers are utilized to effect row and column exchanges. The renumbering process after relaxation would yield Figure 3«3« 3 -3 General Description of JCOMP Starting with the terminal nodes, JCOMP calculates their longest paths through the tree. Assign processors to the terminal node(s) which are on the longest paths. Remove the scheduled nodes from the tree and start over 15 Figure 3.1. The Original Graph level k level 3 level 2 level 1 Figure 3*2. A Relaxed Graph With Four Levels of Nodes 16 Figure 3.3 • A Relaxed Graph Representing Released Matrix with the new tree. The procedure followed is described in Kraska's paper [5]. The processor (s) with the minimum amount of activity will be assigned first. When JCOMP has finished establishing the order of node processing, it prints out the nodes according to the original node numbering system rather than according to the relaxed numbering. Thus, the initial input to WSCHED has the same node numbering as the final output of WSCHED, the scheduled lists. JCOMP is an internal procedure in WSCHED. It receives the number of processors for which it should schedule, the number of levels in the matrix, and the number of nodes in the operations graph. It returns a code of 1 to signal an unsuccessful attempt and a code of for success in scheduling the nodes on the number of processors provided. JCOMP includes a switch which determines whether the LUB is to be found or a straight scheduling is to be done. The input for the scheduler involves only assignment statements, 17 and each of these begins with a series of variable fetches—one of the fastest operations. Therefore, if the stage of having only root nodes (fetches) left unscheduled is reached, then they may be scheduled in any order as long as the scheduling does not increase the critical path length of the LUB calculation. If other operations with weights greater than that of fetches were to be in- cluded at the root level in LUB calculation, then Johnson's algorithm [3] would have to be implemented to perform the final phase of scheduling. Johnson's algorithm considers the amount of time left to each processor before minimal time is exceeded. Start with the PE with the least amount of time left. Find an unscheduled operation or a set of unscheduled operations which occupy the PE the longest without exceeding its time left. Then proceed to each of the other processors in order of time left, so the PE with the most time left is scheduled only if there are still unscheduled operations left to be done. 18 k. EXPERIMENTS AND RESULTS k.l Background Discussion of the Experiments Let us discuss two differences between weighted and unweighted nodes. We want to see how weights affect the scheduling of operations and how the two types of nodes are affected by the distribution process. Consider the expressions: (k) C / A + D + B * (A + E) (5) c/a + d + b*a + b*e For the unit weighted case, statement k can be calculated in 3 time steps by 2 PEs, assuming that the operands are already available. For the weighted case, with the weights as defined in section 2.3, statement k- would be done in 9 time units by 2 PEs. However, their scheduling process is different from the one with unit weights (see Figures k.l and k.2). Since statement 5 would not be executed faster than statement k and it would take an extra PE in both the weighted and the unweighted case (see Figures k.3 and k.k), distribution would not be done . Note that the operations performed are in different sequences . In the weighted case one of the PEs only has to do the divide operation while the other PE is performing two adds and a multiply. Let us look at another set of statements: (6) A + D * (B + C + E) (7) A + D*B + D*C+D*E As we see in Figures k.5 and k.6, statement 6 requires 1 PE in both cases. The weighted case takes 9 time units; the unweighted case takes k. Notice that the scheduling is the same for both the weighted case and the unweighted case. Once we distribute D, however, as shown in Figures ^.7 and k.Q, the scheduling 19 changes and the time shortens with 3 PEs "being used in both cases. Therefore, we would proceed with the distribution process in both cases. The last set of expressions to be considered are: (8) ((A + B) * C * D) * E (9) (A + B) * C * D * E Figure k.l. Unweighted Case 7 + 2 = 9 Figure k.2. Weighted Case C/A + D + B * A + B * E timing V / \/ V * 1 Figure 1|- -3 - Unweighted Case Distributed Figure k.h. Weighted Case Distributed 20 A + D*(B + C + E) timing A + D * (B v + C + E) timing V / V Figure k-5- Unweighted Case Figure k.6. Weighted Case A + D*B + D * C + D * E timing A + D*B + D*C + D*E timing V v v- \ V V V # # * 1 \ ~> V V. V »■ 5 Figure k.^. Unweighted Case Distributed Figure k.Q. Weighted Case Distributed 21 ((A + B) * C * D) * E timing V v * / 1 * C * D) * E timi ing Figure 4.9. Unweighted Case Figure 4.10. Weighted Case (A^+ B) * C. * D v * 7 E timing V (A + B) * C *D*E timing v / V . Figure 4.11. Unweighted Case Distributed Figure 4.12. Weighted Case Distributed This time distribution will be done for the weighted case, but it will not be done for the unweighted case. This discussion bears out our previous observation that the distributed case tends to require more PEs than the undistributed case, but it may have shorter execution time. If subexpressions are evaluated only once and the results are used whenever the corresponding subexpressions appear, the number of processors needed will usually decrease; but the number of levels in the operations tree will remain constant. The reason for this is apparent. If a set of operations has to be done in several places, there will be PEs at each level working on the same operands producing the same results that some other PE at that level ; is also producing. Therefore, a greater number of PEs would be required than 22 if we were to route the previous results to wherever that subexpression is needed as an operand. Since we have already found that several processors executing the same set of operations take the same amount of time as one processor executing that set of instructions, the time needed to compute remains constant. To verify these ideas, two sets of experiments were run. The first set compares the effects of distribution on weighted and unweighted nodes for the same statements. The statements used are similar to our previous examples, The second set applies WSCHED to two versions of a polynomial of degree 30. The connection graph of the first version requires each instance of common subexpressions to be evaluated separately. Therefore, this graph is a tree. For the second version communication between processors is allowed, i.e., common subexpressions are computed once, and the results are routed to wher- 5 ever they are needed. For example, the value of x is calculated only once; -1-2 O c and x may be calculated by using x and x . K.2 Discussion of the Distribution Results Six FORTRAN statements (Figure if-. 13) were run through WSCHED both with weighted operators and unweighted operators. The results are shown in Table 1 and Table 2. These statements were chosen to show various distri- butions as well as how the distribution process can decrease the critical path length and how it affects the number of processing elements. The entries in the E do idle time) columns have been rounded to the nearest percent to avoid clutter. Statements A, B, E, and F were put through the distribution process in both the weighted case and the unweighted case. For these four statements, the number of levels decreased (the tree height was reduced) ; and the amount 23 Set # Statement A BA1=A*(B*C*D+E) B BA2 = A * (B * C + D) + E C BE = (A + B * C * D) * (E + F) D BE = ((A + B) * C * D) E E BE = (A * (B * C))/D * (E * F * G) F BE = (((A * (BC)) * (DE))/F) * G Figure l+.lj. The 6 FORTRAN Statements of time required to completely process each statement was cut. However, the number of processing elements increased. These results are as expected from our previous discussion. Consider statement C. It should go through distribution in the weighted case since clock time decreases while the tree height remains constant and the number of processors required increases. However, in the unweighted case no distribution would be done. Not only does the number of levels remain constant (and, hence, since this is the unit weighted case, the clock time remains constant), but the number of processors increases causing a greater percentage of idle time on the processors used. In statement D we have the case that distribution is no help at all in the unweighted case as far as number of levels and time are concerned. However, with distribution we see a decrease in the number of PEs, thereby decreasing idle time. So distribution should be done if we are interested in conserving machine power. In the weighted case not only is the number of 2k Set # No. of levels Distribute? No. of PEs Time Z $ idle time A 6 no 2 13 69 A 5 yes k 10 190 B 6 no 2 12 67 B 5 yes 3 10 100 C 6 no 2 13 i+6 C 6 yes 3 12 75 D 5 no k 11 lh5 D 5 yes 3 10 130 E 6 no 3 16 125 E 5 yes 5 13 2*+6 F 7 no 2 19 52 F 5 yes 5 13 2k6 (E % idle time is summed over the number of processing elements used) Table 1. For Weighted Nodes 25 Set # No. of levels Distribute? No. of PEs Time Z $ idle time A 6 no 2 6 33 A 5 yes k 5 160 B 6 no 2 6 33 B 5 yes k 5 180 C 6 no 3 6 100 C 6 yes If 6 150 D 5 no k 5 200 D 5 yes 3 5 100 E 6 no U 6 167 E 5 yes 6 5 320 F 7 no 3 7 100 F 5 yes 6 5 320 u N W E I G H T E D Table 2. For Unweighted Nodes 26 processors decreased, but time is also shortened, so distribution should be done. This is the only statement in the group of six in which distribution causes the amount of idle time to decrease. In this set of experiments, two-thirds of the cases yielded higher amounts of idle time for the unweighted case than for the weighted case. This is due to the fact that the number of processors needed by the unweighted nodes is not less than the number of processors needed by the weighted case. The reason is simple to see. If we restrict processors to writing for fellow PEs, no matter how long any one of them takes, we waste process time. In Figure J+.10, the processor executing A + B had to wait for the results from the processor calculating C * D since multiplies take longer than adds in our example. This creates a time slot during which the former processor is un- occupied. There was also a 6 -unit time hole while the E was waiting to be multiplied for a total of 7 time units unoccupied. After distribution in Figure 4.12, there are still two holes; however, they total up to only k- time units unoccupied. We notice that the A + B processor could switch right into multiplying (A + B) by C without waiting for D * E to be computed. A dis- cussion on time holes may be found in Kraska's paper [5] • As expected, the number of PEs required by the unweighted case is greater than or equal to the number of processors needed by the weighted. The argument is presented above. A processor is not tied up waiting for another PE to finish. It is allowed to proceed to another operation. If this operation is on the same level as the one the PE just finished, it is then taking the place of another PE so that we could, for example, do with one less processor for the weighted case than for the unweighted case. 27 4 .3 Background for the Second Set of Experiments Let us, for simplicity's sake, first consider a polynomial of degree 7 "with unweighted nodes "before tackling the polynomial of degree 30. Muraoka ' s folding method [10] factors out powers of x which are Fibonacci numbers, and the subexpression left after factoring is in the form of the sub- expression to the left of it. For example, consider 23^+567 (10) P r7 (x) = a^ + a_.x + a^x + a^x + a, x + a^x + a r x + a^x 1 v/ 7 y 012 3 ^ 5 6 7 which, using Muraoka 's method, is folded into 2 3 2 5 (11) P 7 (x) = a + a x + a p x + (a + a, x) yc + (a + a^x + ax ) x 2 (see Figure k. 14 for the operation tree). In this example, a + a^-x + ax is 2 in the same form as a + a x + ax ; and the former subexpressions take no longer to calculate than the latter. The highest powers of x in the equation can be calculated by multiplying two previously calculated powers, e.g., 5 2 3 x = x * x , the property of Fibonacci numbers. a + a x + a x + (a + a,, x) x + (a + a^x + a„ x ) 2n „5 Figure 4.14. Graph for P Using Common Results 28 With data transmission between processors, as shown in Figure k-.lk, Kraska [6] proposed that no more than nw + nw + 2.75 ln(n) + 1 arithmetic operations are needed to compute Pn, a polynomial of degree n where w is the a weight of an add node and w is the weight of a multiply node. Thus, for P„, to m 7 the upper bound is computed as being 25 arithmetic operations, 9 fetches, and 1 store. In actuality, P requires 17 arithmetic operations, 9 fetches, and 1 store. Now let us assume that each common subexpression must be recalculated. Then the P tree becomes the following: store Figure 4.15- P 7 Tree Without Common Results By counting the nodes in this tree, we find that there are 21 arithmetic nodes. However, for polynomials with large degrees it would be desirable to have an easier method for calculating the number of nodes rather than having to count them up after forming the trees. 29 Therefore, we begin "by noting that for a polynomial of degree n there are n + 1 coefficients and an x to he fetched plus 1 store. Then we note that x may be calculated in [log_n] + N - 1. where N is the number of 2 n n l's in the binary representation of n, [q] represents the largest integer not greater than q [k] . From this, we find that to compute a polynomial of degree n, using Muraoka ' s folding method, we need exactly arithmetic operations, where is defined recursively as n (12) n = n _ x + [log 2 k] + N k + 1 where k = n - .E T F., F. are elements of F., which is the set of Fibonacci numbers which are factored out of the a a term. In other words, k is the n ' highest power of x remaining inside the parentheses. For example, in P , k = 2. By equation 12, P requires = l8 + 1 + 1 + 1 = 21 arithmetic operations, which agrees with our node count in Figure 4.15. k.k Results of the Second Set of Experiments Returning to the polynomial of degree 30 with unit weighted nodes, Kraska's upper bound is calculated to be 70 arithmetic operations. These 70 plus 32 fetches and 1 store yield a total of 103 nodes in the graph for P* n « In reality, there are 99 nodes. Without using common results, there are 133 nodes calculated from (12) which agrees with the node count done on the tree. We formed the graph and the tree of P_. as we have done here for P_, and it is 30 ( with these that we verified the above results. Putting these two versions of P through WSCHED with unweighted nodes produced the following results. Both had 10 levels of operation. The tree 30 (133 nodes) required 27 PEs to complete calculation while for the graph (99 nodes), l8 processors sufficed. The results bear out intuitive conjectures. Graphs, not requiring as many operations, should need fewer processors. However, since the powers of x still must be calculated, our previous discussion stipulates that with extra processors the time required to perform one set of calculations is equal to the time required to execute several copies of the set of calculations in parallel. In our discussion of data transmission, we had not mentioned any time allotted to the routing of information. That is, because in our ex- periments we did not take it into consideration. It should be very interesting to see how the added data transmission time would affect our results. It would be interesting to compare the amount of time taken to prepare and execute a FORTRAN program using our scheduler (with optimization done by back substitution, recursion, and tree height reduction) with the time taken by the FORTRAN code which is optimized by hand plus the preparation time. Duncan Lawrie [9] has started such an experiment for unit weighted nodes. The results are as yet unavailable. Another question time did not allow us to investigate is the original question. Does the amount of time saved at execution speed up on our parallel processor machine as compared to execution on a single processor justify the amount of time we needed to prepare programs for execution on our system? In other words, does the program spend less total time on our parallel system than on the single processor? We see that a program has been written which implements a compile- time operation scheduler for parallel processors. We have used It to investigate 31 the distribution process and the tree versus graph question. The results of the investigations support the theories we have presented. 32 APPENDIX A N INCM NW nt Parameter to WSCHED, the number of nodes in the graph including fetches and stores. Parameter to WSCHED, the connectivity matrix representing a graph with N nodes. It is a N x N BIT (1) matrix. Parameter to WSCHED, the weight vector where NW (i) is the weight of node I. It is a H vector of integers. Parameter to WSCHED, the type vector where NT (i) represents the operator involved at node I. If the operation is a fetch, NT (i) specifies which variable is to be fetched. SCHEDPE SCEAP TIME STEPZ External vector with a maximum of 5 elements. SCHEDPE (1) = indicates that WSCHED is to find the least upper bound on the number of processors needed to perform all the operations represented by the graph in minimal time. SCHEDPE (l) = k, k a positive integer less than 5> indicates the number of straight schedulings to be made for the numbers of processors specified by SCHEDPE (2) through SCHEDPE (k + l) . Indicator based on SCHEDPE (l) . SCRAP = 1 means to calculate the least upper bound, SCRAP = means to schedule the graph using the number of PEs specified in SCHEDPE (i). Saves the timer results so that we know the elapsed time from part to part. A function provided by the system library which returns the amount of time remaining before the task's time limit is exceeded. TABLE LVL IPTR BACK, PTR B A matrix which for the levels of operation in the relaxed graph indicate the number of occurrences of each of the 6 operators (including fetch and store) . A vector such that LVL (i) points to the node which is the first on level I, the terminal level being level 1. Vector of N pointers, IPTR (i) = J indicates that relaxed graph is node J in the original graph. Work vectors which save pointers during the relaxation process. N x N BIT (l) matrix used in the relaxation process. 33 The number of levels of operation in the relaxed graph. The critical path length. The sum of weights up through the levels. The ratio of IP to ICWM. The maximum ratio . The greatest lower bound on the number of processors needed to compute in minimal time. It is the least integer not less than EMAX. It is later the number of processors to be scheduled upon in JCOMP. Success (0) or failure (l) indicator from JCOMP. A procedure internal to WSCHED which develops the schedule . Parameter to JCOMP, same as N in WSCHED. Parameter to JCOMP, the success-failure indicator. Parameter to JCOMP, the number of processors on which to schedule. Parameter to JCOMP, the number of levels in the graph. The number of nodes assigned so far. The number of nodes previously assigned. Indicator of nodes having been denied assignment. A N x K matrix the jth column of which is the assign- ment list for the jth processor. T (i) represents the longest path length from node I to a root node. DEPTH (I, l) is the number of nodes in list I. DEPTH (I, 2) is the length of list I calculated from the weights of the nodes in list I. An operation's completion time if its node were assigned now. The critical path length. The set of terminal nodes for the graph being considered. 3h R B P LP NOTN NTNA RESTOR S SP NS NSP SAVEM L PP, LPP, PW, LL, PAS ASIGN LU U X NX MJ LV WSJ LD, DELTA Q plus the nodes which would become terminals after this time. N x N BIT (1) matrix which is the relaxed connectivity- matrix workarea. The processors whose lists have minimal length. The number in P. The cardinality of Q. The cardinality of R. Matrix saving columns of B. The set of nodes with max (W. + T.) i in Q or E. 11 S n Q. The cardinality of S. The cardinality of SP. A save matrix. The node under consideration. Save variables. The list of nodes which have been assigned thus far. min {¥. - m. I n. e (S - SP)} . fn. | w. - mr. = lui. {i | T i < LU). Cardinality of X. Cardinality of U. LP-NX + NU (0 if S-SP = t) . max {NW. | N. e SP] . Indicate if we can still schedule without exceeding the critical path length. All variables not listed here are used only in the capacity of work areas 35 APPENDIX B (SUBSCRIPTRANGFtSTRINGPANGE) : TEST:PRGC OPT I CNS( MA IN ) ; DCL SCHEDPE(5) FIXEO B IN( 15 ) EXTERNAL; GET LIST(L > ; B: BEGIN; DCL M(L,L) BIT( 1) ,NW(L ),NT(L) ; M^O'B; /* INITIALIZING THE WHOLE MATRIX TO */ GET DATA(M); GET LIST(NW,NT ) ; ON ENDFILE GC TC ZFPC; GET LIST(SCHEDPE( 1)1; GET LISTM SCHEDPEI I) DC 1=2 TO SCHEDPE ( I) + 1) ) ; GO TO CALL*; ZERO: SCHECPEt 1)=0; /* NO FORCED SCHEDULING */ CALLW: CALL W SC HED( L, M , NW ,NT ) ; PUT SKIP LISTMTEST FINISHED 1 ); END B; END TEST; 36 (SUBSCRIPTRANGE): WSCHED: PRCC(N,INCP,NW,NT); DCL SCHEDPE(5) FIXED BIN(15) EXTERNAL; DCL TIME FIXED BIN<31), STEPZ ENTRY RETURNSC FIXED BINOlll; DCL L,I , J,K,M,N,KK,RATIG,RMAX,NP,ICWM; DCL INCM(N,NIBIT<1 ),NW(N), IPTR ( N I , NT< N I ; DCL TABLE(30,6 IFIXED BIN; DCL LVL<0:N) ; DCL SCRAP BITCH, KCDE BIT(l); /* SCHEDPE(l) IS THE NUMBER OF SETS OF PE'S TO BE DONE SCHEDPE(2I TO SCHEDPE< SCEDPE < 1 1 + 1 1 CONTAINS THE NUMBER OF PE'S OF EACH SET */ /* SCRAP = 1 INDICATES CALCULATION OF L.U.B. OF THE NUMBER OF PROCESSORS NEEDED TO DO THE MATRIX IN THE LEAST AMOUNT OF TIME. SCRAP = MEANS TO JUST SCHEDULE THE MATRIX USING THE NUMBER OF PE»S WHICH IS READ IN AS DATA */ /* RIGHT NCW f THE LISTS MAY NOT EXCEED 32767 IN LENGTH TO CHANGE THIS, ALL THE FIXED BINC15) VARIABLES WILL HAVE TO BE CHANGEO TO FIXED BIN! 311 AND CERTAIN OF THE FORMAT STATEMENTS WILL HAVE TO BE ALTERED ALSO */ /* LEVEL RECOGNITION: THE ALGORITHM IS NECESSARY TO PRODUCE A RELAXED GRAPH ( MATRIX! - HOWEVER, IT MUST BE NOTED THAT THIS PROCEDURE INVOLVES ORDER OF N SQUARED CALCULATIONS AND HENCE VERY SLOW */ TIME=STEPZ; IF SCHECPE(1)=0 THEN SCRAP=1B; ELSE SCRAP=OB; DO l=i TG N; PUT SKIP ECIT((INCM( I, J I DO J=l TO N I I ( < 1001 B< 1 1 1 ; END; PUT SKIP EDIT( (NW(II,NT(II DO 1=1 TO N ) I ( < 60 IF ( 3 I I ; DO 1=1 TC N; IPTR (1 1 = 1 ; END; /* LAST LEVEL IS ROWS WITH ALL ZEROS - GETTING LAST LEVEL * K,LVL(OI=N+l; Bl: BEGIN; DCL (PTR(N),BACK(NI»FIXED BIN; DCL MS,MT, e=M; BACK( I )=KK; END; ENDS J2: FND; END JO; IPTR=PTR; /* SET NEW POINTERS */ M=M+1; /* GOT NEXT LEVEL */ IF K=KK THEN GO TO BLK; K,LVL(M) =KK; J3: IF K>1 THEN GO TO JO; BLK:LVL(M) = 1 ; /* GET THE RELAXED MATRIX DO 1=1 TO N; fi( PACK( N,*)=INCM(I ,*> ; FNC; DC 1=1 TC N; INCM(*,RACK( I) ) = B(*,I) ; END: DO 1=1 TO N; PUT SKIP EDIT( (INCMU ,J) DO J=l TO NM ((100) flllll; END; PUT SKIP DATA( IPTR,BACK) ; END Bl ; /* M LEVELS ESTABLISHED, INDEX IN LVL - LAST LEVEL FIRST NCW CCMPUTE THE SUMS AND RATIOS OF THE WEIGHTS AND THE PATHLENGTHS */ BLOCK: BEGIN; /* GET NEXT LEVEL */ */ 38 DCL IL,JL,LL,IKW,IP,RMAX,LEV,ICW(N); TABLE=0; ICW=0; ip=o; ICWM=0; rmax=o; DO LEV = M TO 1 BY -1; LL=LVL(LEV-1)-1; DC IL=LVL+l ; /* A FETCH */ ICW(IL)=NW(IPTR(IL) ); IF ICWMIKW THEN IKW=ICW(JL); END; IK«=NW( IPTR( IL ) )*IKW; IF IKWMCWUL) THEN ICW(IL)=IKW; IF ICWUDMCWM THEN I CWM=ICW( I L ) ; END; RATIO=IP/ICWM; PUT SKIP DATA(RATIO) ; IF RMAX LISTt'THERE ARE • ,M, • LE VELS « ) ; PUT SKIP DATA((LVL(I) DO 1=1 TO M)); NP=CEIL(RMAX); PUT SKIP EDIT( 'NUMBER OF PROCESSCRS NEEDED FOR •, •CONNECTION MATRIX • ,NP ) ( A, A ,F{ 6 ) ) ; PUT SKIPC2) LISTC THE TABLE OF TYPES :•); PUT SKIP EOIT<«LEVEL», •♦•,»-•, •*•,•/•, 'STORE*, •FETCH* ) < A,X<7), A,X< 10) ,A,X(9),A,X< 10),A,X(8) , A,X(5) ,A); PUT SKIP(O) EDIT((61 )•_• )(X(7) ,A); CO 1=1 TC f; 39 PUT SKIP E (F<5) END; END BLOCK; PUT SKIP EDIT( IF S NP= CAL END ELSE AGA IF NP GG EN END; JCCM PUT /* DCL CRAP=OB THEN D SCHEDPE( I) ; L JCCMP(N,KODE,NP,M) ; CIT(I,» | • ,(TABLE( I, J) DO J= 1 TO 6)) ,a, <6)F(ion; •TIME ELAPSED = • ,T I ME-ST EPZ , ■ CENTI SECONDS* ) (X(70) ,A,F(16) ,A); C 1=2 TO SCHEDPE(1H-1; DO; IN: CALL JCOMP KODE=» I'B THE =NP+l; TO AGAIN; D; (N,KODE,NP,M) ; N DO; P: PR SKIP DECLA (REST PW(N) CCL DCL DCL /* / + AS = PAS= DENY PP = LI ST I* R /* T=o; PUT /* /' Pi: ) BITU), (DI (N) ,LIST (N,K ),DEPTH DEPTH(I,2) THEN MI N=DE PTH ( I ,2 > ; END; CO 1=1 TO LP; KL=DEPTH(I ,2)-MIN; IF KL=0 THEN GO TO S31; MN,DEPTH< I ,1 )=DEPTH( 1,1 »♦!; DEPTH( I,2)=M IN ; SETTING UP THE DUPMY BLOCKS,- THE DEPTH OF THE X BLOCK */ LIST(MN, I)=KL; S31: END; S2 /* IR #4 I 5 /* RESTORATION */ GO TO ENC; NTNA=0; /* I* STEPS J=l; S4: CO 1=1 TC K; IF I=P(J» THEN DC; IF JMN THEN MN=W(J); /* GET THE PREDECESSOR FROM THE ORIGINAL GRAPH */ END; W ll+MN; S51: END; /* */ /* STEP #6 */ S06: CALL SU86 ( M ,R ,NTNA , S, NS I ; NSP=0; DO 1=1 TO NS; DO J=l TO NOTN; IF S( I )=Q( J) THEN DO; NSP=NSP+l; SP(NSP)=S(I); GO TO S6; END; END; S6: END; IF M0 THEN DO; KODEF=«l»B; /* FAILURE SIGNALED */ PUT SKIP EDIT(«TIME ELAPSED = • , TIME-STEPZ , • CEN TI SECONDS' ) 0 ThEN IF DENY=»1»B THEN DO; DC J=l 7C K; DC I=PAS+1 TO AS; IF LIST(DEPTH( J , 1 ) , J ) = AS IGN ( I ) THEN DO; LIST(DEPTH(J ,1), J)=0; /* REMOVE NODE FROM LIST */ DEPTH ( J,l) = DEPTH(J f 1 )-l; IF DEPTH(J,1)<0 THEN DEPTH ( J, 1 ) =0; /* PRECAUTION ONLY */ DEPTHLIST(DEPTH(J,l) f J»; ^ LIST(DEPTH(J,1 ), J) = 0; DEPTH ( J,l )=DEPTH(J,1 )-l; END; END; AS=PAS; /* SET ASSIGNMENT COUNTER AT PREVIOUS */ B=SAVEM; /* RESET MATRIX */ P=PP; /* RESET P VECTOR */ LP=LPP; /* RESET P COUNTER */ V»=PW; /* RESET U VECTOR */ DENY= , , B; /* RESET DENY BIT TO OFF */ L=LL; /* SET NODE */ /* PUSH NODE INTO THE STACK */ NN,DEPTH(P(LP) , 1 I =CEPTH( P ( LP ) , 1 ) +1 ; LIST(NN,P< LP))=L; PUT SKIP ECIT(»STACK IN 6- 8 • , P ( LP ) , • # • »L > ( A , F ( 5 ) ,X( 2 ) ) DEPTH(PILP) ,2) =W(L) ; PUT SKIP LIST( 'STACK* ,P(LP), •# IN 6-8»,L); AS = AS + 1*. ASIGN(AS)=L; LP=LP-1; B(Lt*l='0'B5 GO TC S9; END; /* */ /* STEP 47 */ C=M; IF NS-NSP=0 THEN LV=0; ELSE DO; NSPC=l; LU=9S999999; DO 1=1 TO NS; /* GET LU BY FINDING S-S» */ IF NSPONSP THEN GO TO S70; IF S( I ) = SP(NSPC) THEN DO; NSPC=NSPC+l; GO TC S700; END; S70: LWD=W(S(I ))-NW( IPTR(S( n n; IF LWDLU THEN GO TO ST71; kk END; TO ST8; IN S») */ I))); CO; nx=nx*1; X(NX)=I; ST71: END; LV=LP-NX+NU; IF LV<0 THEN LV=0; IF LP<=LV|NSP=0 THEN GO DO WHILE(LP>LV6NSP>0); /* GET MAX(W( I) |N( I) IS mj=o; DO 1=1 TC nsp; MN=NW(IPTR(SP< IF HJ0 & NSP>0 ); L=SP( Jl; IF NW< IPTR(L) )=WSJ THEN DO; SP( J)=0; M T= 1 ; NSP=NSP-1; DELTA=W(L)-LU; END; IF DELTA<=0 | LP-LV>0 THEN /* PUSH INTO STACK */ DC; NN, DEPTH (P( LP) ,1 )=D6 PTH( P(LP)tll + l! LIST(NN,P(LP) >=L; PUT SKIP LIST( 'STEP 8' ,L); DEPTH(P(LP) ,2)=W(L ); PUT SKIP LISTCSTACK' ,P( LP) ,• 4 IN 8*,L); AS=AS+l; ASIGN( AS)=L; LP=LP-l; BDEPTH(I ,2) THEN MIN=DEPTH U ,2 ) ; ST805: END; DC 1=1 TC LP; KL=DEPTH(P(I ),2)-MIN; IF KL=0 THEN GO TO ST806; MN.DEPTHCPU ),1)=DEPTH MS THEN MS=MN; NN=NN+MN; END; NN=K*MS-NN; IF NN