CENTRAL CIRCULATION BOOKSTACKS The person charging this material is re- sponsible for its renewal or its return to the library from which it was borrowed on or before the Latest Date stamped below. You may be charged a minimum fee of $75.00 for each lost book. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result In dismissal from the University. TO RENEW CALL TELEPHONE CENTER, 333-8400 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN DEC 1 4 1998 JAN l n 20f MAY 01 MJST When renewing by phone, write new due date below previous due date. L162 Digitized by the Internet Archive in 2013 http://archive.org/details/techniquesforimp929wolf , YO, S{ * y Report No. UIUCDCS-R-78-929 z- - UILU-ENG 78 1722 TECHNIQUES FOR IMPROVING THE INHERENT PARALLELISM IN PROGRAMS 6-*d [••I - by - Michael Joseph Wolfe July 1978 NSF-OCA-MCS73-07980-000034 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Report No. UIUCDCS-R-78-929 TECHNIQUES FOR IMPROVING THE INHERENT PARALLELISM IN PROGRAMS by Michael Joseph Wolfe July 1978 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US NSF MCS73-07980 and was submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science, July 1978. 7 -?3f iii ACKNOWLEDGEMENT I would like to thank all those whose help and advice have made the pursuit of this project an interesting and fruitful challenge. Special thanks go to David Kuck, Ahmed Sameh, Ross Towle, Bruce Leasure, Utpal Banerjee and Robert Kuhn for bearing with me as work progressed, or regressed, at various points in time. This project used computers operated by the Computing Services Office at the University of Illinois in Urbana. iv TABLE OF CONTENTS Page 1 . Introduction 1 2. Recognizing Vector Operations 3 2.1. DO Loop Distribution 3 2.2. Data Dependence 5 2.3. Method of DO Loop Distribution 10 3. Classifying Recurrences 14 3.1. Types of Recurrences 14 3.2. Splitting Recurrences 17 4. IF statements 18 4.1. IF trees 18 4.2. IFs inside DO loops 19 5. Other Manipulations on DO loops 22 5.1. Simple Case, One Loop 23 5.2. More Interesting Case, Two Loops 24 5-3- Triple Loop Case 27 5.4. General Loop Nesting 31 5.5. Future Work 32 6. Induction Variables 33 6.1. Conditions for Induction Variables 34 6.2. Substitution for Induction Variables 37 6.3- Usefulness 40 7. Subscript Addition 41 7.1. Scalar Expansion 43 V 7.2. Complications with IF statements 46 7.3- Method of Scalar Expansion 48 7.4. Array Expansion 52 The PARAFRASE Compiler 56 List of References 62 Appendices A. Recurrence Formulae 63 B. Compiler User's Guide 72 C. Compiler Options 77 CHAPTER 1 Introduction Recently, there has been a great deal of work done to try to make computer programs execute faster. One promising approach is to utilize special hardware. Several new machines have a parallel or pipelined machine architecture. These machines operate efficiently when performing the same operations on a vector of data. This approach is taken because it is easier to optimize a sequence of similar operations than it is to optimize a sequence of very different operations. Programs written today are designed for ordinary serial machines. It would be useful to be able to compile these programs for a vector machine, without having to rewrite the program. The PARAFRASE project at the University of Illinois has been working on a compiler to do just that. This compiler accepts ordinary programs in a serial language (FORTRAN). It then recognizes and isolates vector-type operations. Of primary concern to the compiler is that results are preserved. The output of the compiler is a transformed program reflecting the changes, and pseudo-machine code for an idealized vector machine. This paper will try to introduce the theory used in compiling programs for vector machines. The examples will be in pseudo-FORTRAN . Chapter 2 discusses vector operations, and finding them in programs. Chapter 3 defines types of recurrences. Chapter 4 mentions IFs and conditional statements, and what to do about them. Chapter 5 talks about useful methods to make DO loops more efficient under certain circumstances. Chapters 6 and 7 introduce induction variable substitution and scalar expansion; these are new ways of enhancing the detection of vector operations. Chapter 8 describes the PARAFRASE FORTRAN compiler. The appendices include a list of the time bounds to solve different kinds of recurrences. Also a user's guide for the compiler and a list of the switches and options available are given. CHAPTER 2 Recognizing Vector Operations To efficiently utilize the machine architecture of a vector machine, a compiler should find vector operations in programs. Vector operations can be found in DO loops, where the same operations are performed on streams or vectors of data . 2.1 DO Loop Distribution Suppose a program is composed of DO loops and assignment statements. + DO 1=1 ,UI +-- DO J=1 ,UJ I ! +-- CONTINUE + CONTINUE Example 1 - Sample Program A method to transform this general DO loop structure into a series of vector operations is to execute each statement separately for the entire DO loop index set. This is equivalent to distributing the DO loops over each statement. DO loop distribution is described in [MURAOKA]. + DO 1=1 ,01 | <3tatement 1> + CONTINUE + DO 1=1 ,UI -- DO J=1 ,UJ CONTINUE + CONTINUE + DO 1=1 ,UI +-- DO J=1 ,UJ ! +-- CONTINUE CONTINUE + DO 1=1 ,UI I + CONTINUE Example 1 - Sample Program, Distributed Distributing DO loops may not always yield the correct results . S1 S2 original program + DO 1=1 ,UI ! A(I+1)=B(I)+5 ! B(I+1 )=C(I)»2 + CONTINUE distributed program ( incorrect) ♦ DO 1=1 ,01 S1 i • ! A(I+1 )=B(I)+5 CONTINUE + DO 1=1 ,01 S2 ' : ! B(I+1 )aC(I)*2 + CONTINUE Example 2 - Incorrect Distribution Statement S1 in the original program always reads the value of B computed in statement S2 during the previous iteration of the I loop. In the distributed program, S1' always reads a value of B computed somewhere outside the loop. For this example, the correctly distributed program involves just a little statement reordering. + DO 1=1 ,UI S2 B(I+1)=C(I)»2 + CONTINUE + DO 1=1 ,UI S1':| A(I+1 )=B(I)+5 + CONTINUE Example 2 - Correct Distribution Not all loop distribution problems can be solved statement reordering. by original program + DO 1=1 ,UI 31: ! A(I+1 )=B(I)+5 S2: i B(I+1 )=A(I+1 )«2 + CONTINUE distributed program ( incorrect ) + DO 1=1 ,UI S1 ' : ! A(I+1 )=B(I)+5 CONTINUE + DO 1=1 ,UI S2' : ! B(I+1 )=A(I+1 )«2 + CONTINUE Example 3 - Undistributable Loop In this example, the distributed program is wrong. Statement S1 ' will always read "old" values of B. Reordering the statements does not solve the problem. Putting S2 ' first would make S2 ' always read old values of A, whereas it should read values computed in the loop. The loop in this program cannot be distributed. 2.2 Data Dependence To decide when distributing DO loops is valid, one must look at the data flow of the program. For each statement, one must ask "where do the values used here come from," and "where does the value computed here get used." These questions do not always have simple or unique answers. A value computed in one statement may be used in many places. A value read in one statement may come from one of several places. This is particularly true when IF statements or other conditionals are present in the program. IFs will be discussed later, and so will be ignored for now. The analysis of data flow in a program is called the study of data dependence. A more complete description of data dependence can be found in [TOWLE]. Briefly, a statement Sq is data dependent on statement Sp if Sq reads a value that is computed in statement Sp. This can occur when the left hand side variable in Sp appears on the right hand side of Sq . In EX. 2, statement S1 is data dependent on statement S2 , since S1 reads the value of B computed in statement S2 . In EX. 3, S1 and S2 are data dependent on each other. Towle also defines two other types of data dependence, but for the moment we ignore them. The first requirement for a statement Sq to be data dependent on Sp is that there must exist a control path from Sp to Sq . Second, the variable being computed in Sp , the LHS variable of Sp , must be read in Sq . If this variable is a scalar, the test is satisfied, and Sq is data dependent on Sp . If this variable is an array, the value of its subscript in Sp must be equal to the value of its . subscript in Sq . When this condition is satisfied, then Sq is data dependent on Sp . Equality of subscript expressions is not so easy to check when the statements are in DO loops, and the expressions change value with each iteration. This happens whenever the subscript expressions involve the DO loop index variable. In a DO loop, a particular statement is executed many times. Let Sp[i] be the instance of statement Sp during the iteration of the DO loop when I=i, where I is the DO loop index variable. For multiply nested loops, Sp[ i1 , i2 , . . . , in] is the instance of Sp when I1 = i1, I2 = i2, ..., In=in. Given a DO loop, we can "unroll" it, listing each statement for each iteration of the loop. This removes the loop, and only a serial program remains. Each statement can now be checked with following statements for any data dependence. If any Sq[i'] is data dependent on any Sp[i], where i*>=i, then in the original program with the loop, Sq is data dependent on Sp . Likewise for Sp[i'] being data dependent on Sq[i], where i'>i. + DO 1=1 ,UI Sp : ! Sq : I (statement q> + CONTINUE becomes Sp[1]: (statement p> Sq[1]: (statement q> Sp[2]: (statement p> Sq[2]: (statement q> Sp[9]: (statement p> Sq[9]: (statement q> Sp[10]: Sq[10]: (statement p> (statement q> Example 4 - Unrolling a DO Loop If Sq[i'] is data dependent on Sp[i], for any i' and i such that i'>i, then we say that Sq is data dependent on Sp across the I loop . This data dependence crosses the DO loop boundary. For instance, if Sq[9] is data dependent on Sp[8], then Sq is data dependent on Sp across the I loop boundary. If Sq[i] is data dependent on Sp[i], for any i, then we say that Sq is data dependent on Sp within the i loop . This is not mutually exclusive with data dependence across the loop. Here S2[2] is data dependent on S1[2] and S1[1]. So S2 is data dependent on S1 both within and across the loop boundary. + DO 1=1 ,UI S1 : ! A(I+1 )= . . . S2: ! . . . = A(I+1 )+A(I) + CONTINUE Example 5 - Data Dependence Within and Across Loop Likewise, if Sp[i*] is data dependent on Sq[i], for any i' and i such that i'>i, then we say that Sp is data dependent on Sq across the I loop. Notice that Sp cannot be data dependent on Sq within the I loop boundary, since there is no control path from Sq[i] to Sp[i], for any i. Unrolling the DO loop and doing an exhaustive data dependence test is time consuming and generally unnecessary. A method has been described in [BANERJEE] for computing data dependence, just by studying the subscript expressions as polynomials in the DO loop index variables. For some forms of simple subscripts, a necessary and sufficient test for data dependence can be applied. Although in general the method is not exact, it is conservative, so that whenever a data dependence exists, the test recognizes it. In practice, this test is quite satisfactory. In only a small percentage of cases is the test fooled by not recognizing a non-dependence situation. Some forms of subscripts cannot be handled by any test. An example is a subscripted subscript, when an array is used 10 in a subscript expression. This is often used for array permutations. Whenever an array has a subscript expression which cannot be tested, data dependence must be assumed, to be conservative. In the PARAFRASE compiler, only subscripts which are linear functions of the index variables are tested. A linear function looks like AO+A 1 *I 1+A2*I2+ . . . , where 11, 12, ... are the index variables. Banerjee's general test handles nonlinear functions of the index variables, but they are rare enough that little harm is done by assuming data dependence when they occur. 2.3 Method of DO Loop Distribution The first step in distributing DO loops is to form a data dependence graph of the program. The second step is to find all the cycles in the data dependence graph. Suppose we denote "Sq is data dependent on Sp" by Sp->Sq. If Sp->Sp, then Sp forms a cycle by itself. If Sp->Sq->Sp, then Sp and Sq form a cycle. Likewise, if Sp->Sq 1 ->Sq 2 -> . . .->Sq n ->Sp , then Sp , Sq , ..., Sq form a cycle . Once the cycles have been found, the third step of DO loop distribution is to partition the program into Pi-parti tions . Any assignment statement that is not in any data dependence cycle forms a Pi-partition by itself. Any 1 1 assignment statement that is in a data dependence . cycle is in a Pi-partition with all the statements in that cycle. Each Pi-partition corresponds to some sort of a parallel operation. Finally, DO loops can be distributed over each Pi-partition . This is the same as distributing DO loops over statements, except that the loops are not distributed over statements in the same Pi-partition . S1 : S2: S3: S4: S5: original program + DO 1=1 ,UI A(I)=C(1 ,1-1) +__ DO J=1 ,UJ C( J,I)=B( J-1 ) B(J)=A(I) CONTINUE . . .=B(UJ) . . .=C(UJ,I) CONTINUE i i i i + -- S1 • : S2' : S3' : distributed program + DO 1=1 ,UI A(I)=C(1 ,1-1) +-- DO J=1 ,UJ ! C(J,I)=B(J-1) ! B(J)=A(I) +-- CONTINUE + CONTINUE + DO 1=1 ,UI S4':| ...=B(UJ) + CONTINUE + DO 1=1 ,UI S5':l ...=C(UJ,I) + CONTINUE Example 6 - Distribution over Pl-Partit ions After the loops have been distributed over each Pi-partition , a partial ordering relation between the Pi-partitions must be found. As mentioned before, the original ordering of the statements may no longer be valid after DO loop distribution. The partial ordering relation 12 between Pi-partitions can be used to generate a valid statement ordering after DO loop distribution. The partial ordering relation between Pi-partitions is easily found from the data dependence graph. If any statement in Pi-partition P2 is data dependent on any statement in Pi-partition P1, then P2 must follow P1. Notice that this relation will indeed be a partial ordering. If it were not, then there would be a cycle of data dependence between statements in different Pi-partitions . However, any two statements in a cyclic dependence are by definition in the same Pi-partition . The final step in DO loop distribution is to classify the Pi-partitions . A Pi-partition composed of a single statement which is not in a data dependence cycle is a vector operation. In a vector operation, all the input data can be fetched at once, all the operations can be performed at once, and all the results can be stored at once. If a Pi-partition is composed of statements which are involved in a data dependence cycle, then the Pi-partition is some sort of a recurrence. Recurrences can be further subdivided into linear and nonlinear recurrences. Linear recurrences can be solved using special algorithms which are fast on a parallel machine, described in [CHEN]. Some 13 nonlinear recurrences can be linearized and solved using algorithms similar to Chen's. Other recurrences may have to be executed serially. One non-obvious benefit of DO loop distribution is that each Pi-partition is a Single-Instruction-Mul tiple-Execut ion (SIME) block of code. A vector operation can be executed one operation at a time for the entire vector. The algorithms described in Chen's thesis for solving linear recurrences are SIME. Even serial operations can be considered a limiting case of SIME code, since only one operation is being performed at a time. 14 CHAPTER 3 Classifying Recurrences After DO loop distribution, each Pi-partition is classified as either a vector operation or a recurrence. Recurrences are broken down into several types and classes. Recurrences which can be identified as of a simpler type can be solved using faster or more efficient versions of the basic algorithm. 3.1 Types of Recurrences The first division of recurrences is between linear and nonlinear recurrences. A linear recurrence is a recurrence where each new computation is a linear function of previous computations. A nonlinear recurrence is a recurrence where each new computation is a nonlinear function of previous computations. A linear recurrence can always be transformed into a standard format, with a recurrence matrix A, an initial value vector c, and a result vector x. 15 + DO 1=1 ,N X(I)=C(I) -- DO J=1 ,1-1 X(I)=X(I)+A(I, J)«X(J) CONTINUE CONTINUE Example 7 - Standard Recurrence The recurrence matrix is always strictly lower triangular. If the recurrence matrix is full, then each new computation depends on all of the previous computations. This is called a full recurrence . Chen shows that with 3 12 N /68 processors, a full recurrence can be solved in -z-lg N + 3 ■jlgN time steps, where a time step is a add or a multiply. If the recurrence matrix is banded, that is, it has at most M non-zero subdiagonal bands, then each new computation depends only on the M previous computations. + DO 1=1 ,N X(I)=C(I) +-- DO J=I-M,I-1 I X(I)=X(I)+A(I,J)*X(J) +-- CONTINUE + CONTINUE Example 8 - Standard Banded Recurrence This is called a banded recurrence . In [SAMEH], it is shown 1 2 that with -rM N processors, a banded recurrence can be solved in (lgM+2)lgN time steps 16 Sometimes the recurrence matrix takes a special form, called a Toeplitz form. In this case, A(I,I-b)sa(b) , for all I. That is, each subdiagonal band is constant. A recurrence of this form is called a constant coefficient recurrence . + DO 1=1 ,N X(I)=C(I) + -- DO J=1 ,1-1 ! X(I)=X(I)+A(I-J)«X(J) + -- CONTINUE + CONTINUE Example 9 - Constant Coefficient Recurrence A constant coefficient recurrence may be either full or banded. The time bounds for solving a constant coefficient recurrence are the same as the time bounds for solving a general recurrence, although fewer processors are needed to achieve this bound. Another special type of recurrence arises when only the last element of the result vector is used outside the computation of the recurrence itself. This is called a remote term recurrence . An example is an inner product. + DO 1=1 ,N ! C(I)=G1(I)*G2(I) + CONTINUE + DO 1=1 ,N ! X=X+C(I) + CONTINUE Example 10 - Inner Product is Remote Term Recurrence 17 None of the intermediate terms need to be saved.- A remote term recurrence can be banded or full, and can be constant coefficient or not. Again, the time bounds for solving a remote term recurrence are the same as for solving a general recurrence, but fewer processors are needed. Derivations of the time steps, processor bounds, and operation counts for solving these types of recurrences are given in the Appendix . 3.2 Splitting Recurrences Sometimes a single recurrence can be split into several smaller independent recurrences. This is presented in [TOWLE]. In real programs, this often occurs when a DO loop surrounds a recurrence. + DO K=V,G +-- DO 1=1 ,N X(K)=X(K)+C(I,K) CONTINUE CONTINUE Example 11 - Recurrence that can be Split Initially, this may look like a recurrence of size G*N. In fact, this is G independent recurrences, each of size N. By reducing the size of the recurrence, the amount of time to solve the system, as well as the number of processors needed, is reduced. 18 CHAPTER 4 IF statements When compiling programs for special architecture machines, extra care must be taken with IF statements. An IF can change the flow of control of the program, and so figures into the data dependence graph. IF statements in loops can prevent loop distribution. A large number of IFs can adversely affect the amount of parallelism detectable in the program. Special methods are used to handle IFs. 4.1 IF trees When the ratio of IFs to assignment statements is relatively large, then it is reasonable to use the method of IF-trees, described in [DAVIS]. Essentially ,. this method computes the results for all possible paths, then chooses the desired results with one large conditional. The number of operations is increased, but the number of conditionals is decreased . 19 Example 12 - IF tree made from 3 IFs 4.2 IFs inside DO loops When an IF is inside a DO loop, that IF must be executed for every iteration of that DO loop. Sometimes the condition being tested does not change inside the loop. In this case, the branch taken will be the same for every iteration . 20 + DO 1=1 ,UI (statements not changing B> IF (B.GT.O) C(I)=C(I)/B + CONTINUE Example 13 - Loop Invariant IF Here, the IF can be removed from the scope of the loop entirely. Two copies of the loop are used, and the condition is tested to choose which loop to execute. DO 1=1 ,UI < statements> CONTINUE + DO 1=1 ,UI C(I)=C(I)/B -- CONTINUE Example 13 - Removing a Loop Invariant IF More often, the condition being tested does change between iterations of the DO loop. In this case, the best result that can be achieved is to keep everything a vector operation. When the condition being tested is independent of the results of previous iterations of the loop, the IF can be precomputed. The results of the condition can be saved in a logical vector. This logical vector can be used as a mask in later operations. 21 + DO 1=1 ,UI S1: ! IF (B(I).NE.O) C( I )=C( I ) /B( I) + CONTINUE becomes + DO 1=1 ,UI ! MASK(I)=B(I) .NE.O + CONTINUE + DO 1=1 ,UI S1 i . ! C(I)=C(I)/B(I) , masked by MASK(I) CONTINUE Example 14 - Precomputed IF Now S1 1 is easily recognizable as a vector operation. On a parallel or vector machine, the concept of a bit vector as a mask for operations is inherent in the structure of the machine . Other types of IFs can be handled by more complex or more costly methods. For a more complete discussion of IFs, see [TOWLE]. 22 CHAPTER 5 Other Manipulations on DO loops It may not always be efficient to compute each PI partition using the fastest known methods. For instance, if a program contains a large full recurrence, the fastest 3 method to solve this recurrence uses N /68 processors. + DO 1=1 ,N +-- DO J=1 ,1-1 I X(I)=X(I)+A(I,J)*X(J) + -- CONTINUE + CONTINUE Example 15 - Full Recurrence of size N 1 2 Using this many processors, the solution would require ylg N time steps. If only N processors are available, using this 2 2 algorithm would require (N /136)lg N. A different method of solution would serialize the outer loop, and run only the inner loop in parallel. This is equivalent to a program with the outer loop "unrolled". This method uses at most N processors at one time, and requires at most NlgN time steps. While in general this algorithm is slower than the fastest method, it is more efficient with a smaller number of processors. This chapter describes methods which can be used to manipulate the program into a sometimes more efficient or more accommodating form. 23 DO Js1 , 1 X(2)=X(2)+A(2, J)»X(J) CONTINUE DO J=1 ,2 X(3)=X(3)+A(3, J)*X(J) CONTINUE + DO Js1 , N-1 ! X(N)=X(N)+A(N, J)*X(J) + CONTINUE Example 15 - Same Program with Outer Loop Serialized 5.1 Simple Case, One Loop Consider a simple loop with one statement. In this loop, the program is computing a vector of data. + DO 1=1 ,UI S: I A(I)=... + CONTINUE Example 16 - Simple Loop with One Statement Suppose that for some i and j, where jJ] S[I,J]. So S[I,J] may depend only on S[J]. Again, all the data dependences cross the I loop. Here however, the results are preserved if the inner loop is executed "backwards", from the upper bound to the lower bound . + DO 1=1 ,UI +-- DO J=1 ,UJ I A(I, J)=A(I-1 , J)+A(I-1 , J+1 ) +-- CONTINUE + CONTINUE Example 21 - Executing J Backwards Preserves Results Two interesting things can be done now. The wavefront method can be applied, with an angle of 135 . Or, the two loops can be interchanged, providing that J is executed backwards. + DO J=UJ, 1 ,-1 +-- DO 1=1 ,UI I A(I, J)=A(I-1 , J)+A(I-1 , J+1 ) +-- CONTINUE + CONTINUE Example 21 - With Loops Interchanged 5-3 Triple Loop Case To further complicate matters, consider a single statement inside three loops. Now the program is computing a cube of data. 28 + DO 1=1 ,UI -- DO J=1 ,UJ +- DO K=1 ,UK i A(I, J,K)= . . . +- CONTINUE +-- CONTINUE + CONTINUE Example 22 - Three Loops The statement S[I,J,K] might be data dependent on any of the following . S[I,K] S[I, J,K] S[K] S[J,J, K] S[J,>K] Serializing the outer loop satisfies all dependences of the form "S[I,J,K] depends on S[ J,*] or S[I,J,*]. This happens when S[I,J,K] depends only on S[K]. So S[I,J,K] can depend only on S[I,J,K] interchange J and K + DO 1=1 ,UI +-- DO K=1 ,UK + - DO J=1 ,UJ ! A(I,J,K)=A(I,J,K-1)+A(I-1,J+1,K) +- CONTINUE {+-- CONTINUE + CONTINUE S[I,K,J] depends on S[I,J] interchange I and K + DO K=1 ,UK + -- DO 1=1 ,UI +- DO J=1 ,UJ ! A(I, J,K)=A(I, J,K-1 )+A(I-1 , J+1 ,K) +- CONTINUE +-- CONTINUE +„_- CONTINUE S[K,I,J] depends on S[J] Example 25 - Multiple Interchanging Any number of outer loops can be executed serially to satisfy all data dependences across that loop boundary. If all the dependences are satisfied this way, a vector operation remains. Otherwise, a recurrence must still be solved . 31 5.4 General Loop Nesting Now consider a statement nested inside d DO loops. The program is computing a d-dimensional cube of data. This may correspond to a Pi-partition after DO loop distribution. + DO 11=1 ,UI1 •-- DO 12=1 ,UI2 + - i DO Id=1 ,UId A(I1 ,12, . . . ,Id) = . . . CONTINUE CONTINUE CONTINUE Example 26 - d DO Loops Now, S[ 1 1 , 12 , . . . , Id] may depend on S[I1+ 1 , * ] or S[ 1 1 , . . . , II , 11+ 1 , * ] . With the new loop structure, the interchanging process can be repeated. Loop II can be serialized if all loops containing loop II have been serialized. Executing loop II serially has the effect of satisfying the dependence of S[I1,...,Id] on any statement S[I1 II- 1 , K = K + 3 (statements) CONTINUE + -- - DO 1=1 ,UI (statements) K = K + N (statements) - CONTINUE invalid examples of induction variables + DO 1=1 ,UI (statements) K = K+N N=. . . + CONTINUE + DO 1=1 ,UI (statements) K=K+N(I) (statements) -- CONTINUE Example 29 - Induction and Non-Induction Variables Also, the incrementing statement must be executed for every iteration of the DO loop. It cannot be the object of an IF statement, for example. + DO 1=1 ,UI (statements) IF(...) K=K+3 (statements) -- CONTINUE Example 30 - Not an Induction Variable 36 Induction variables are not restricted to a single increment statement. Multiple increment statements are allowed, as long as each increment is executed during every loop iteration, and the increment expressions are all loop invariant . Multiply-nested loops are also allowed. Naturally, the induction variable may be an induction on any one of the nested loops. It may also be an induction on two or more loops . induction on inner loop + DO 1=1 ,UI K = + -- DO J=1 ,UJ I K=K+3 i +-- CONTINUE induction on both loops K = + DO 1=1 ,UI +-- DO J=1 ,UJ ! K=K+3 ! +-- CONTINUE + CONTINUE + CONTINUE Example 31 - Induction Variables in Multiple Loops The increment statement may also be a decrement statement. As long as the value added to the scalar in the loop does not change within the loop, it may be positive or negative . 37 6.2 Substitution for Induction Variables If the conditions for induction variable substitution have been satisfied, the variable can be replaced by an expression. In each statement in the loop, the variable can be replaced by an expression which is equal to the value of the variable. This expression will be composed of the initial value of the variable, the index variables, and the increment values. The expression will not be the same for all statements in the loop. K = + DO 1 = 1 ,UI K = K + 3 + CONTINUE Example 32 - Simple Induction Variable Expressions Consider this simple example. The value of K at the beginning of the loop is zero, and the increment is 3. In statements below the increment statement, K has the value ( increment*index) , since it has then been incremented I times. In statements above the increment statement, K has been incremented only (1-1) times, since it hasn't been incremented for the current iteration yet. Suppose the initial value is non-zero. 38 K= 1 7 + DO 1=1 ,UI K = K + 3 -- CONTINUE Example 33 - Non-Zero Initial Value Below the increment statement, the value of K is ( increment*index+initial) . Above the increment statement, K has been incremented only (1-1) times, as before. Now add a second increment statement. K= 1 7 + DO 1=1 ,UI K = K + 3 K = K + 5 + CONTINUE Example 34 - Multiple Increments Below the last increment statement, the replacement expression for K is ( index* ( increments) ^initial ) . Above each increment statement, that increment has been executed only ( I- 1 ) times . In a nested loop, the Expressions are a little more complex . 39 K=17 + DO 1=1 ,UI - DO J=1 ,UJ K = K + 3 CONTINUE + CONTINUE Example 35 - Multiple Loops Below the inner loop, K has been incremented (I*UJ) times. It gets incremented UJ times for each iteration of the outer loop. Above the inner loop, K has been incremented only ((I-1)*UJ) times. Within the inner loop, the value is similar to a simple induction variable with an initial value of ( (1-1 )»UJ)*3+17. The last example is with two increment statements in nested loops. K=17 + DO 1=1 ,UI +-- DO J=1 ,UJ K = K+3 +-- CONTINUE K = K + 5 CONTINUE Example 36 - Multiple Increments in Multiple Loops 40 A general procedure could be developed, but the notation would only be confusing. The idea is too simple to spend too much time on formality. The most common case seen is a simple induction with one increment statement in one loop. The second most common case has one increment in two loops. This is used to step through arrays linearly inside a nested loop. 6.3 Usefulness Induction variable substitution reduces the number of data dependences. Before the substitution, a statement that uses the induction variable is dependent on the increment statement. After the substitution, the statement is no longer dependent on the induction variable at all. The clearest advantage of induction variable substitution is the increase of knowledge available about the indexing patterns through the arrays in the loops. It also permits the use of simpler data dependence tests on the subscripts . 41 CHAPTER 7 Subscript Addition Often a programmer will use a scalar temporary in a loop to hold a common subexpression. Sometimes forward substitution of this expression can be done. After forward substitution, DO loop distribution may continue. This may not be desirable, since it increases the total number of operations . original + DO 1=1 ,UI T= A(I)=T«. .. B(I)=T+. . . C(I)=T/. . . + CONTINUE forward substituted + DO 1=1 ,UI A(I) = ». . . B( I) = +. . . C(I) = /. . . + CONTINUE Example 37 - Scalar Temporary Forward Substituted Forward substitution may not always be possible. original + DO 1=1 ,UI T=A(I) A(I)=Z(I) Z(I)=T + CONTINUE not the same + DO 1=1 ,UI A(I)=Z(I) Z(I)=A(I) -- CONTINUE Example 38 - Invalid Forward Substitution Loop distribution on the original program results nonsense . in 42 + DO 1=1 ,UI ! T=A(I) + CONTINUE + DO 1=1 ,UI i A(I)=Z(I) + CONTINUE + DO 1=1 ,UI ! Z(I)=T + CONTINUE Example 39 - Nonsense Loop Distribution However, the desired result can be obtained by making the temporary an array, instead of a scalar. + DO 1=1 ,UI ! T'(I)=A(I) + CONTINUE + DO 1=1 ,UI I ACI)=Z(I) + CONTINUE + DO 1=1 ,UI ! Z(I)=T'(I) + CONTINUE Example 40 - Valid Loop Distribution The same idea of using an array temporary instead of a scalar temporary may be used in Ex. 37 to circumvent the addition of redundant operations. + DO 1 = 1 ,UI T f (I)= A(I)=T'(I)«. • . B(I)=T'(I) + . . . C(I)=T f (I)/. . . + CONTINUE Example 37 - With Array Temporary This idea is called scalar expansion . 43 7.1 Scalar Expansion The idea behind scalar expansion is to use a new temporary for each iteration of the loop. This is most easily done by making the temporary an array. Then by using a different element of the array for each iteration of the loop, we get a new temporary for each iteration. Expansion of scalars into arrays is not always straightforward. Suppose an initial value of the scalar is used inside the loop. T= I A(I)=T»... + CONTINUE Example 41 - Initial Value Used in Loop Somehow, the element used in S1 must address the initial value, when 1=1, and the last value assigned, for other iterations. This can be done by using the element T'(I-1) for T in S1. The initial value is then assigned to T'(0) S1 : T ' (0 )= A(I) = T'(I)«. .. + CONTINUE Example 41 - With Expanded Scalar 44 When the final value assigned to the temporary T is used outside of the loop, an assignment back to the scalar at the end of the loop must be added. T' (0)= A(I)=T'(I)»... + CONTINUE T = T' (UI) Example 41 - With Post-Loop Reassignment Until now, only single loops have been considered. For nested loops, a similar approach is used. In a doubly nested loop, two subscripts are added. With a DO loop nesting of D, D subscripts are added. Care must be taken that the value read for the expanded temporary is always the most recent value assigned. T= S3: ! ! A(I, J) = T*. . . |+— CONTINUE + CONTINUE Example 42 - Doubly Nested Loops In this example, when [ I , J ] = [ 1 , 1 ] , S1 must read the initial value of T. In later iterations of J, S1 must read 45 the value assigned to T by S2 in the previous iteration of J. That is, when [ I , J] = [ 1 , j ] , S1 must read the value assigned to T by S2[1,j-1]. Also, S1[i,1] must read the value assigned to T by S2[ i- 1 , UJ] . Statement S1 oan read a value of T from the previous iteration of the J loop, or from the previous iteration of the I loop. T • (0 , 0)= S3: ! I A(I, J)=T' (1-1 , J)». . . !+-- CONTINUE S4: | T' (I,0)=T» (1-1 ,UJ) + CONTINUE S5: T=T'(UI,0) Example 42 - With Expanded Scalar The assignment added in S4 is similar to the assignment in S5. Statement S4 passes the value of T' to the next iteration of the I loop. This assignment can be costly in terms of extra memory movement. However, it is necessary only when a statement such as S1 , which can read "old" values of T, is present. Proper scalar expansion allows the DO loops to be distributed, although the statements may have to be reordered. + S2: S4: ! + -- T' (0 ,0)= CONTINUE - CONTINUE - DO 1=1 ,01 T' (I,0)=T' (1-1 f UJ) - CONTINUE 46 S3 S1 S5 + DO 1=1 ,01 -- DO J=1 ,0J A(I, J)=T» (1-1 ,J)». . . CONTINOE + CONTINOE + DO 1=1 ,01 +__ DO J=1 ,0J ! D(I, J)=T» (1-1 , J-1 ) + . . . +-- CONTINOE + CONTINOE T = T' (01,0) Example 42 - Distributed This statement ordering is not unique. 7.2 Complications with IF statements As always, the addition of IF statements complicates matters. Extra assignments often must be added when scalar expansion is done and IFs are present. One case is when the scalar temporary is conditionally assigned. Simple subscript addition can be incorrect. + DO 1=1 ,01 I IF( . . .) T= I A(I)=T»... + DO 1=1 ,01 I IF(...) T!(I)= i A(I)=T'(I)«. . . + CONTINOE + CONTINOE + original incorrect expansion of T Example 43 - Incorrect Expansion with IF Statement When the condition is not satisfied, T'(I) is not assigned anything. In this program, the IF can be changed into an IF-THEN-ELSE. 47 + DO 1=1 ,UI IF( . . .) THEN T» (I) = ELSE T' (I)sT' (1-1 ) A(I)=T' (I)*. . . -- CONTINUE Example 43 - Correct Expansion with IF Statement This generates an obscure kind of recurrence. However, this recurrence is restricted to T'. The loop can be distributed, whereas in the original program, it could not be. + DO 1=1 ,UI ! IF (...) THEN T'(I)= I ELSE T' (I)=T' (1-1 ) + CONTINUE + DO 1 = 1 ,UI i A(I)=T'(D*. . . + CONTINUE Example 43 - Eistributed An IF followed by a forward GOTO presents a similar problem. The assignment to the scalar may be skipped over. The problem and the solution are similar to that of conditional assignment. + DO 1=1 ,UI IF( . . .) GOTO 7 T= A(I)=T«. . . + CONTINUE 7: original + DO 1=1 ,UI IF( . . .) THEN DO T' (I)=T f (1-1 ) GOTO 7 END T' (I)= A(I)=T(I)»... + CONTINUE correct expansion Example 44 - Correct Expansion with Forward GOTO 48 Backwards GOTOs present another problem. A backwards GOTO may branch to the program before the assignment to the temporary. The solution is somewhat similar to the above. + DO Ia1 ,UI A(I) = T*. . . T= IF( . . .) GOTO 8 + CONTINUE + DO 1=1 ,UI 8: ! A(I) = T»(I-1)». . . T' (I)= IF( . . .) THEN DO T' (1-1 )=T' (I) | GOTO 8 ! END + CONTINUE Example 45 - Correct Expansion with Backwards GOTO Thi3 may not help in loop distribution. The whole problem of IF loops needs to be studied. One feasible suggestion is to structure an IF-GOTO loop as a DO-WHILE loop, which can be treated much like ordinary DO loops. Other types of GOTOs may be encountered, for example, GOTOs out of the loop, or between loop nesting levels. The general rule is to make sure that the next use of the temporary reads the most recent assignment to that temporary. The extra memory movement necessary to insure this may be too costly to do scalar expansion. 7.3 Method of Scalar Expansion For each scalar T to be expanded, declare a new array variable T'. Find the assignment to T with the deepest DO loop nest level. Give T 1 that many dimensions. Throughout 49 the loop, each occurrence of that scalar will be re-placed by an expression which will consist of the new array, subscripted by expressions of the index variables. Associate with each dimension of the new array a DO loop nesting level. So, the first dimension is associated with the outer nest level, the second dimension with the next DO nest level, etc. An assignment to the zeroeth element of T' may have to be made, if T is read in the loop before it is assigned. This assigns an initial value to the array. The assignment T« (0,0)sT is adequate. Initially, the replacement array expression is T * (0 , , . . . , ) . Travel through the loop, replacing each occurrence of the scalar with the replacement array expression. When a loop of nest level L is entered, change the L subscript of the replacement array expression from "0" to "1-1", where I is the index variable for that DO loop . -- DO 1=1 ,UI . . ,sT' (1-1 ,0) . . . -- DO J=1 ,UJ . . ,=T» (1-1 , J-1 ) CONTINUE + CONTINUE 50 When the first assignment to the temporary within a DO loop is reached, replace any occurrence of the temporary on the right hand side by the current replacement array expression. Then, change the replacement array expression so the L subscript is "I" instead of "1-1". Do not change any other subscript expression. Do not change anything upon reaching a second assignment within the same loop. + -- + - + -- - DO Is1 ,UI DO J=1 ,UJ T' (1-1 , J) = T« (1-1 , J-1 ) CONTINUE CONTINUE When leaving an inner loop, it may be necessary to generate an assignment to carry out the last value assigned in the loop. If the loop just exited is at nest level L, change the replacement array expression so the L subscript is "0" instead of "J". Change the L-1 st to "I" instead of "1-1", if it is not already. The generated assignment uses this array on the left hand side. The right hand side is the old replacement array expression with the L subscript replaced by "UJ", where UJ is the upper bound expression for the loop just exited. + + — + — DO 1=1 ,UI DO J=1 ,UJ CONTINUE T» (I ,0)=T' (1-1 ,UJ) CONTINUE 51 When leaving the outer loop, it may be necessary to generate an assignment to replace the most recent assignment to the temporary back to the original scalar. + DO Ir1 ,UI CONTINUE TsT 1 (UI,0) Conditional assignments and COTOs must be handled carefully. In all such cases, the replacement array expression must refer to the most recent assignment to the temporary . Scalar expansion may seem to be self-defeating. Quite often, many extra memory movements must be added to be correct. Always, the memory requirements increase dramatically, especially for deeply nested loops. However, the idea is to be able to distribute loops around the statements which use the temporary. Also, the use of an array allows the machine to be filled, if it is a parallel architecture, or for large vectors to be operated on otherwise. Limited expansion of scalars, adding only 1 or 2 subscripts, may be sufficient to make the use of the hardware effienct. It may not be wise to allocate large temporary arrays. 52 7.4 Array Expansion The idea behind scalar expansion is to make the temporary "large" enough so that each iteration of the loop refers to a new variable. This is done by giving the temporary as many dimensions as needed, so each DO loop has its own subscript. One may wonder about arrays inside of DO loops, which do not have this many dimensions. The question arises whether arrays can be, and should be, expanded. This is the question of array expansion. A surprisingly common practice is for a programmer to use an array with constant subscripts in a loop. This is often done to pass a large amount of information in a single parameter to a subroutine. Each element of the array can be considered an independent scalar. The same strategy employed for scalar expansion will work for this case. 53 + DO 1=1 ,UI T(5)=. . . T(6)=. . . . ..=T(4) ...=T(5) T(7)=. . . original Example 46 T5'(0)=T(5) T6 f (0)=T(6) T7'(0)=T(7) + DO 1=1 ,UI T5'(D=. . • T6»(I)=... . . .=T(4) CONTINUE i ...=T5'(I) T7'(D=. • . -- CONTINUE T(5)=T5'(UI) T(6)sT6» (UI) T(7)=T7' (UI) proper expansion Array Expansion with Constant Subscripts Another common practice is for a singly-subscripted array to be used inside a doubly nested DO loop, with only one index variable used in its subscript. If only the outer DO loop index variable appears, then a reference to the array in the inner DO loop is essentially a reference to a scalar, since it does not change for different iterations of the inner loop. + -- + - + -- - DO 1=1 ,UI DO J=1 ,UJ A(I)=. . . A(I+1)=. CONTINUE CONTINUE i i + -- + DO 1=1 ,UI A' (I,0) = A(I) A' (1+1 ,0)=A(I+1 ) DO J=1 ,UJ A' (I, J)=. . . A' (1+1 , J) = . . . CONTINUE A(I)=A« (I,UJ) A(I+1 )=A» (1+1 ,UJ) - CONTINUE Example 47 - Array Expansion in Inner Loop Only This is done similarly to scalar expansion. 54 If only the inner DO loop index variable ever appears, then the entire inner loop and the array can be treated somewhat like a large scalar, looking from the outer loop. + DO 1=1 ,UI ...=A(3) - DO J=1 ,UJ . . .sA( J) A(J) = . . . . . .=A(J) +-- CONTINUE A(4)=. .. CONTINUE + DO Js1 ,UJ j A'(J,0)=A(J) + CONTINUE + DO 1=1 ,UI . ..=A'(3,I-D + -- DO J=1 ,UJ . . .=A' (J, 1-1 ) A' (J,I)=. . . . ..=A'(J,I) CONTINUE A'(4,I)=. . . + CONTINUE + DO J=1 ,UJ i A(J)=A'(J,UI) + CONTINUE Example 48 - Array Expansion of Outer Loop Only Again, this is similar to scalar expansion. However, array expansion includes a heavy penalty. All the initialization assignments and other added statements are vector operations. This causes a great deal, of memory movement. The entire procedure was introduced to facilitate loop distribution. For non-trivial cases, this goal cannot be guaranteed . The general case of array expansion can be described, and a method defined for implementing it. It may not serve any useful purpose. If the DO loops cannot be distributed, 55 then the added statements merely make the Pi-partitions larger and more complex. Small problems, like IFs, can make the method very complicated. The tradeoff between added operations, algorithmic complexity, and possible enhancement of loop distribution should be considered before trying to implement general array expansion. 56 CHAPTER 8 The PARAFRASE Compiler This chapter describes the PARAFRASE FORTRAN compiler. This compiler consists of 13,000 PL/I statements, and is currently running on an IBM 360/75 and an IBM 370/158. It accepts ANS FORTRAN with many of the IBM extensions. The compiler is divided into many passes. Each pass makes some transformation on the program. The program is manipulated in essentially source form. The compiler uses many special algorithms to detect parallelism in the program, as well as other standard compiler methods. 1 . Lexical Scanning The first pass over the program . scans the source text and saves the program in the standard compiler data structures. While it is scanning the program, some cosmetic changes in the program are made. The data structures are organized so that the original program can be reconstructed. 2. DO Loop Normalization 57 Several of the algorithms described depend on DO loops having a lower bound of 1 and an increment of 1 . In particular, the data dependence tests, and induction variable substitution depend on this. After lexical scanning, this pass changes DO loops to satisfy this condition. The new upper bound is ( upper-lower+1 ) /increment . The index variable is replaced by an expression in the loop, ( index- 1 ) •increment+lower to reflect the change. + DO 1=4, N, 3 i A(I)r... + CONTINUE + DO 1=1 , (N-4+1 )/3, 1 I A((I-1)«3+4)s... + CONTINUE Example 49 - DO Loop Normalization IF Pattern Matching In real programs, IF statements are often used to replace MAX/MIN builtin function calls. This enhances transportability, since differect implementations have different names for these functions. This pass recognizes some of these patterns. Soon, it will also recognize vector MAX/MINs. 58 scalar MIN IF(A.LT.B) B = A vector MAX B=MIN(B, A) + -- M = - DO 1=1 ,01 IF(A(I) .GT.M)M=A(I) - CONTINOE M = - DO 1=1 ,01 M=MAX(A(I) ,M) - CONTINOE Example 50 - IF Pattern Matching 4. Scalar Renaming This is a standard compiler algorithm used to decrease the total amount of data dependence in the program. Whenever possible, scalars in the program are renamed. A=. . . . . .=A A • • ■ • Example 51 - Scalar Renaming As. . . • < • — n Aa= . . . • • • — fi d 5. Induction Variable Substitution Induction variables were talked about in Chapter 6. This pass handles most common cases of induction variables, with one increment statement, and inside of one or two DO loops. 59 + -- i i i i K = 3 - DO 1=1 ,UI K = K + 5 A(K)=... CONTINUE + K = 3 DO 1=1 ,UI K=I«5*3 A(I»5 + 3) = CONTINUE Example 52 - Induction Variable Substitution 6. Scalar Expansion Scalar expansion was described in Chapter 7. This pass handles the general case of scalar expansion. It expands all scalars in DO loops to arrays. The initialization and the other assignment statements are always added, for correctness. No checking is made to see if they are necessary. This can cause much unnecessary memory movement. + DO 1=1 ,UI 1 T=. . . 1 1 -T* ¥ CONTINUE + -- I I + -- T' (0)=T - DO 1=1 ,UI T ' ( I ) = . . . . . .=T»(I)« - CONTINUE T = T' (UI)' Example 53 - Scalar Expansion 7. Array Expansion Array expansion was also described. This pass handles the limited case of a singly subscripted array in multiple loops, or with constant 60 subscripts. For most purposes, this is sufficient. 8. IF Removal from DO Loops The problem of IFs in DO loops was mentioned. This pass removes Towle's "A" and type "B-prefix" IFs from the scope of DO loops. The theory of IF removal is currently being reconsidered to make it more general . 9. IF Treeing IF treeing was described to be a method to reduce the total number of conditionals in the program. This pass combines IFs outside of DO loops when the ratio of IFs to assignment statements reaches a certain threshhold. 10. Code Generation Three address vector oriented pseudo-code is generated. This code can be analyzed to get bounds on the speedup gained by the parallelism mechanisms, and to see how effective these mechanisms are. Also, this code can be simulated on various machine architectures, to see what kinds of machines are good for what kinds of algorithms. 61 1 1 . Analysis and Statistics After the program is compiled, the code generated is analyzed. The results of the analysis are speedup bounds: how much faster this program could run on a suitable machine than on a serial machine. In addition, global statistics about the program are collected and saved for comparison to other programs. Much work has been done in the theory of compiler for parallel and vector machines. In the PARAFRASE FORTRAN compiler, we have implemented many of the algorithms for parallelism detection and parallelism enhancement. In the course of testing the compiler, several new methods have been discovered for parallelism enhancement, and these too have been implemented. Using this compiler, parallel or vector oriented machines will be able to execute many ordinary programs efficiently on their own special architecture. When more of the theory is understood, and more methods implemented, it is hoped that a large class of ordinary programs will be able to be compiled for these new machines, utilizing their special hardware in a cost effective way. 62 List of References [BANERJEE] Utpal Banerjee "Data Dependence in Ordinary Programs" Master's Thesis, University of Illinois at Urbana UIUCDCS-R-76-837, November, 1976. [CHEN] Shyh-ching Chen "Speedup of Iterative Programs in Multiprocessing Systems" Ph.D. Thesis, University of Illinois at Urbana UIUCDCS-R-75-694, January, 1975. [CHEN 76] Shyh-ching Chen, David Kuck, Ahmed Sameh "Practical Parallel Triangular System Solvers" to appear in !£££ Transactions c_& Mathematical Software [DAVIS] Edward Davis "A Multiprocessor for Simulation Applications" Ph.D. Thesis, University of Illinois at Urbana UIUCDCS-R-72-527, June, 1972. [LEASURE] Bruce Leasure "Compiling Serial Languages for Parallel Machines" Master's Thesis, University of Illinois at Urbana UIUCDCS-R-76-837, November, 1976. [MURAOKA] Yoichi Muraoka "Parallelism Exposure and Exploitation in Programs" Ph.D. Thesis, University of Illinois at Urbana UIUCDCS-R-71-424, February, 1971. [SAMEH] Ahmed Sameh, Richard Brent "Solving Triangular Systems on a Parallel Computer" 3IAM Jpurqfrl on Numerical Analysis Vol. 14, No. 6; December, 1977; pp. 1101-1113. [TOWLE] Ross Towle "Control and Data Dependence for Program Transformations" Ph.D. Thesis, University of Illinois at Urbana UIUCDCS-R-76-788, March 1976. 63 APPENDIX A Recurrence Formulae After we compile a program, we try to see how fast the compiled program would execute on a suitable multiprocessor. To do this, we must know how fast a processor can do any operation. We make the assumption that any arithmetic operation (+,-»*»/) can be performed by a processor in one time step. Furthermore, p processors can perform n independent operations in — time steps. Using these assumptions, and ignoring data alignment (fetch and store) time, we can find the time necessary to perform any sequence of vector operations, by summing the time necessary for each one. We assume the machine is SIME, so no overlap of two distinct operations is done. This would simplify the design of a machine and a compiler. The only other type of operation considered is a linear recurrence. To compute how fast a recurrence could execute on a machine, we must know what algorithm is being used to solve the recurrence. Several algorithms have been proposed to solve linear recurrences on parallel machines [CHEN, CHEN76, SAMEH]. Here we give the number of time steps 64 necessary, along with the number of arithmetic operations performed, to solve a linear recurrence when using these fast parallel algorithms. Throughout this section, lg(n) is logarithm base 2, and [x] is the least integer greater than or equal to x . A.1 Full Recurrences with unlimited processors The algorithm for solving full recurrences with an unlimited number of processing elements available is given in [CHEN] or [SAMEH]. A bound is given on the number of processors used in the algorithm. A. 1.1 General Full Recurrence, R time steps s ^lg 2 (n) + |lg(n) multiplications = additions = 13 12 1 1 _ n + _ n . _ n . _ processors = 15 3 12 1 T024 n + o 70 + T6 n A. 1.2 R with ( n+ 1 ) Right Hand Sides 65 time steps = llg 2 (n) + |lg(n) multiplications = additions = 23 3 12 1 1 42 n " 6 n " 3° " IT processors = 21 3 5 2 1 __ n + _ n + _ n A. 1.3 R, Remote Term only time steps = jlg 2 (n) + flg(n) multiplications = additions = 13 12 2 28 n + 4 n "7 processors = 3 3 9 2 1 256 n + 6T n + 8 n A. 1.4 Full Recurrence with Constant Coefficients, R time steps = 66 ^■lg 2 (n) + |lg(n) multiplications = additions = 13 13 2 5 1 48 n + 24 n " 6 n + 3 processors = 13 3 2 1 T28 n + T6 n + 8 n A. 1.5 R with ( n+ 1 ) Right Hand Sides time steps r jlg 2 (n) + |lg(n) multiplications = additions = 25 n3 12 5 n 1 ¥8 n + 24 n " 6 n + 3 processors = 21 3 5 2 1 T28 n + T6 n + 8 n A. 1.6 R, Remote Term only time steps = ^■lg 2 (n) ♦ |lg(n) multiplications = additions = 1 n 3 5 2 1 2 F4 n + T2 n - I" " 2T 67 processors s 13 7 2 1 T28 n + 64 n + 8 n A. 2 Banded Recurrences with unlimited processors The algorithm for solving banded recurrences with an unlimited number of processors available is given in [CHEN] or [SAMEH]. A bound is given on tne number of processors used in the algorithm. A. 2.1 General Banded Recurrence, R time steps = lg(n) (lg(m)+2) - ^(lg 2 (m)+lg(m)) multiplications = 1/2 » . ,n> 1921 1 1 n 1,3 2, :r(m n+mn)lg( — ) - -r-fm n-rmn-Tn-Tr- + :r(m J +m ) m 42 6" mn -3 n "2T¥ additions = 12 , ,n N 2 m n lg( m" } 192 5 1 1n 1/3 m 2v 42 m n+ 6 mn -3 n -2TS + 2 (m - B } processors = 12 3 •x(m n + mn) - m A. 2. 2 R, Remote Term only 68 time steps = lg(n)(lg(m)+2) - j(lg 2 (m)+lg(m)) multiplications = additions = T |m 2 n--imn-^n - |m 3 +-lm -jji - (m 3 -m 2 )lg(£) processors = 1/2 , 3 — (m n+mn) - m A. 2. 3 Banded Recurrence with Constant Coefficients, R time steps lg(n) (lg(m)+2) - -l(lg 2 (m)+lg(m)) multiplications = 1 , ,n, 12 23 3 1 2 5 1 2 mn 1 « ( m ) * 2 m n " 48° + 24 m '"6 n+ 3 additions = i (n 2 - n ,„ . §f° 3 4!° 2 -M processors 1 1 2 ■xmn + -r-m n A. 2. 4 R, Remote Term only time steps = lg(n)(lg(m)+2) - j(lg 2 (m)+lg(m)) 69 multiplications = n 3 , ,ik 25 3 23 2 5 1 mn + m lg(-) + ^m -p-m -^m+y additions = , 3 2x. ,'iu 25 3 23 2 5 1 mn + (m -m )lg(-) + jj^m -^m -jm+3 processors = 1 3 Tmn + m A. 3 Banded Recurrences with a limited number of processors The algorithms for solving banded recurrences with a limited number of processors are given in [CHEN] or [CHEN 76]. ■ The number of processors, p, is assumed to be in the range 2m <. p <. n . Notice that a simpler algorithm, such as column sweeping, may actually be faster than using this more complicated algorithm. A. 3-1 General Banded Recurrence, R 70 time steps = (2m 2 +3m)^ + (m 2 +^m+1 ) lg(£) - (2m 2 +|m+3) multiplications = O 10 r\ Q Q O (m + 2m)n + j^ +m )P iS^ + 2^ m * - (2m 2 +2m+j)p - (m 3 +m 2 )^ additions = (m +2m-1)n + -r-m p lg(^) + (*nr+^m -m) 1 "3 O n - (2m +2m--r-)p - (m +m -co- processors = p A. 3« 2 R, Remote Term only time steps = (2m 2 +m)- + (m 2 +|m+1)lg(^) - (2m 2 +4-m+2) p d m 2 multiplications = (m 2 +m)n - m 3 lg(|) - (\vr> -\m 2 ) + (|m 2 -2m-^)p - m 3 - m 2 2 2 2 p additions = (m 2 + m)n - (m 3 -m 2 )lg(£) - (I m 3 -|m 2 ) + (|m 2 -3m-j)p - -ra 3 processors = p A. 3-3 Banded Recurrence with Constant Coefficients, R 71 time steps = 4m-^- + (m 2 4m + 1)lg(-^) - (m 2 + fm+3) + (2m-1)[-m 2 ] p-m 2 m 2 p multiplications = i 3 1 2 1 x. ,p-nu ,3.3 , 2 1 , „ n mn + (m -— m +— mpHgC* — ) - (-^-m -2m +-r-m) - 2mp + mp 22 m 2 2 p-m additions = mn + (m 3 -|m 2 +4-mp)lg(- E r^) + (4m 3 +m 2 +^-m) - (2m+1)p + mp~ 2 2 m p-m processors s p A. 3. 4 R, Remote Term only time steps = 2m p?m + ( m3+n » 2 +mp+p)-l i g (2^) + (2m-1)[-im 2 ] multiplications - ,3121 » . , p-m N , 1 3 5 2 x ,3 \\ "' (m J --m +2-mp)lg( Ji — -) - {^ -^m ) - (-*m+ ? )p + mp— m 2 2 ' x 2 2 p-m additions = (m 3 -|m 2 4mp)lg(-^) - 4m 3 -|m 2 ) - <|«4>P + m P^ 22 m 22 22 p-m processors = p 72 APPENDIX B Compiler User's Guide The Parafrase FORTRAN Compiler may be run on the IBM 360/75 at the University of Illinois at Urbana. In addition to the job control cards necessary to execute the compiler, the user may set any compiler options he wishes. The user must also include the FORTRAN program to be compiled. The available options are described in Appendix C. A typical job to use the compiler will look like this: //ANALYZE JOB /•ID PS=1234,NAME='J0E SCHMOE' /•ID CODE=PUBLIC /•ID REGION=2 50K,TIME=2,IOREQ=4000 //PROCLIB DD DSNAME=USER.P6543.MACUOI,DISP=SHR // EXEC COMPILE //OPTIONS DD • specify any compiler options here //SYSIN DD * JNINCR INCREMENT AN ARGUMENT SUBROUTINE INCR (A) A = A+1 . RETURN END JNSUM FORM A CHAIN SUM FUNCTION SUM(A,N) INTEGER A(N) ,X X = DO 1 1=1 ,N 1 X=X+A(I) SUM = X RETURN END *DATA N 10 JNINNER FORM AN INNER PRODUCT 73 JDATA N 100 1NRAN 20 SUBROUTINE INNER (A,B,C,N) REAL A(N) ,B(N) ,C C = DO 7 1=1 ,N C=C+A(I)»B(I) RETURN END DOM A RANDOM PROGRAM SUBROUTINE FIND (A,B,C,N) INTEGER I, J REAL A(N) ,B(N) ,C(N) A(1)=1 DO 20 1=1 ,N A(2) = 2 DO 20 J=1 ,N A(3) = 3 C(I) = C(I) « A(3) CONTINUE JDATA N 30 URRA SIZE = BLOCK A 92 C 103 JNnam JDATA varia varia URRA SIZEr BLOCK array DO 30 1=1 ,N B(I) = B(I) * A(I) CONTINUE RETURN END Y 2 = 1 2 1 X&01 X&02 2 1 2 X&01 X&02 e title source program END value value ble ble Y number of =do block lhs-use arrays to expand to work with n urn -dimensions nestl nest2 X&nn X&nn BL0CK=do block JLOOP indexvariable indexvariable /» 74 The control cards should appear as indicated. More REGION, TIME, or IOREQ may be needed for more or larger programs. Compiler options may be chosen from Appendix C. The format of the SYSIN file follows. 1 . %U card Each program in the input stream must be preceded by a %H card. Immediately following the N is a program name, up to 8 characters long. A title may follow the name, separated by a space, up to 65 characters long. The name and title are used for later identification. 2. source program The actual FORTRAN source follows the %H card. The last card should be an END card. JDATA card Optionally, data cards may be inserted. These are used to set the values of integer scalars. Usually this is used to set DO loop upper bounds. This information is used in the calculation of speedup. To use this feature, include a JDATA card, and follow it with an arbitrary number of cards which 75 have an integer scalar name in column 1 , and its value following the name, separated by a blank. 4. *ARRAY If the user wishes to do array expansion, then he must include a JARRAY card. The cards following it are input to the array expansion program. A 'SIZE=n' card tells the program the maximum number of arrays which will be expanded. The default is 'SIZEslO'. The 'BLOCK=n' card tells the program in which DO-loop block to expand the following arrays. The default is 'BL0CK=1'. Several 'BLOCKm' cards may be included, to expand arrays in several different DO loops. Each array to be expanded requires another input card. The name of the array to be expanded comes first on the card. Following the name is the "program pointer" pointing to the statement which is the first left-hand-side use of the array in that DO loop. Then follows an integer, d, giving the new number of dimensions of the array, the number of dimensions desired for the array. Following that is a list of d numbers. Each of these associates a do nest level with a dimension of the new array. Nest level 1 is the outer DO loop, nest level 2 is the next inner level, nest level is outside the DO loops 76 (constant level), and so on. Expressions in the corresponding subscript position will involve only DO loop indices at that DO nest level. Finally is a list of the index variables for the DO loops in which the array is to be expanded. Note that these are index variables after DO loop normalization, and so are of the form 'X&nn' , where 'nn' is the DO loop number. 5. JLOOP Optionally, the user may wish to execute some DO loops serially, rather than distribute them. If so, include a $LOOP card. Follow it with an arbitrary number of cards which have a DO loop index variable name in column one. Notice that this is done after DO loop normalization, so each DO loop has a unique index variable name. The DO loop with that index variable will not be distributed . As many programs may be compiled in one job as desired, as long as the TIME and IOREQ available are sufficient. The output consists of listings of the program after the transformations, and optionally a disk file containing the generated code, for later simulation. 77 APPENDIX C Compiler Options Compiler options may be set by inserting cards after the //OPTIONS DD * card. A binary switch is set ON by a card : SWITCH=' 1 'B A binary switch is reset OFF by a card: SWITCH= f 'B A numeric option is given a value by a card: 0PTI0N=1 OPTION=77 The OPTIONS file is read with a PL/I GET DATA statement. C.1 FLAGs A FLAG is a binary switch used to enable or disable certain passes of the compiler. A FLAG is set by a card: FLAG.SWITCHr' 1 'B 78 1. FLAG.CLEAN_IF Enables IF pattern matching. 2. FLAG.CLEAN_SUBSCRIPT Enables subscript cleaning, which simplifies subscripts . 3. FLAG.CONSOLIDATE_COMMON Enables a small program which cleans the compiler data structures containing COMMON variables. 4. FLAG.DISTRIBUTE_LOOP Enables a program which will physically distribute the loops around the PI partitions. 5. FLAG.EXPAND_ARRAY Enables array expansion. 6. FLAG.EXPAND_SCALAR Enables scalar expansion. 7 . FLAG. EXPAND_STATEMENT_FUNCTIONS Enables a program which expands statement function uses to the expressions defined by the statement function . 79 8. FLAG.EXPAND_SUBROUTINES Enables a program which expands in line external subroutines called in the program. 9. FLAG.FORWARD_SUBSTITUTE_SUBSCRIPT Enables scalar expression forward substitution into subscripts. FLAG . RENAME_SCALAR must also be set. 10. FLAG.FORWARD_SUBSTITUTE__IF Enables scalar expression forward substitution into IF conditions. FLAG . FORWARD_SUBSTITUTE_SUBSCRlPT must also be set . 11. FLAG.GENERATE_CODE Enables code generation. FLAG. SEGMENT must also be set . 12. FLAG.GRAPHICS_PARTITION When set, the program will be Pi-partitioned , before segmentation, and a file will be created with this information to be used with a graphing program . 13. FLAG.HASP_SYSTEM_LOG When set, a line will be written to the HASP system log, which appears on the first burst page of the 80 output, for each program compiled. 14. FLAG.IFTREE Enables IF tree creation. 15. FLAG.INDUCTION_SUBSTITUTION Enables induction variable substitution. 16. FLAG.INSERT_DATA_CARD When set, cards after the JDATA card are used to initialize scalar integer variables. When reset, the $DATA card is ignored. 17. FLAG. LEXICON Enables lexical scanning of the program. 18. FLAG.LINEARIZE_ARRAY Enables a program multi-dimensional arrays. which linearizes 19. FLAG.NORMALIZE_DO Enables DO loop normalization. 20. FLAG. PARALLEL Master switch to enable IF removal, IF tree creation, tr iad iza tion , segmentation, and code 81 generation . 21. FLAG.REMOVE_A_IF Enables removal of Towle's type A IFs. 22. FLAG.REMOVE_B_PREFIX_IF Enables removal of Towle's type B-prefix IFs. 23- FLAG.REMOVE_CALL When set, CALL statements will be changed into CONTINUE statements as the program is lexically scanned. This may be useful since data dependence around CALL statements is unknown. 24. FLAG.REMOVE_IO When set, input and output statements will be changed into CONTINUE statements as the program is lexically scanned. This may be useful, since input/output is not especially interesting from the viewpoint of speeding things up. 25. FLAG.RENAME_SCALAR Enables scalar renaming. 26. FLAG. SEGMENT Enables program segmentation, which divides the 82 program into segments of code. A segment of code is defined as block of statements with only one control entry point, and one control exit point. 27. FLAG.SERIALIZE_LOOPS When set, the user may specify any DO loops he does not want distributed by using the JLOOP card. 28. FLAG. STANDARD A master switch controlling DO loop normalization, scalar renaming, scalar forward substitution, induction variable substitution, scalar expansion, array expansion, array linearization, subscript standardization, and subscript cleaning. 29- FLAG.STANDARDIZE_SUBSCRIPT Enables subscript standardization, which transforms subscripts into parenthesis-free expressions, when possible . 30. FLAG. STATISTICS Enables the statistics collection. When set, the global statistics collected will be saved for later processing. FLAG. BOUNDS must also be set. 31. FLAG. TRIAD 83 Enables triadization . Tr iadization reduces assignment statements to three address code, with a result, and two operands. This is in preparation for code generation. C.2 PRINT switches A PRINT switch is used to enable or disable the printing of the program after each pass. The following PRINT switches are available: 1. PRINT. AFTER_IF_REMOVAL Enables the printing of the program after removing type A and type B-prefix IFs. 2. PRINT. CLEAN_IF Enables the printing of the program after IF pattern matching. 3. PRINT. CLEAN_SUBSCRIPT Enables the printing of the program after subscript cleaning . 4. PRINT. CODE 84 Enables the printing of the code generated by the compiler . 5. PRINT. DISTRIBUTE_LOOP Enables the printing of the program after distributing loops. 6. PRINT. DURING_IF_REMOVAL Enables the printing of the program after each pass of IF removal. Notice that each pass will remove at most one IF from each DO loop. 7. PRINT. EXPAND_ARRAY Enables the printing of the program after array expansion . 8. PRINT. EXPAND_SCALAR Enables the printing of the program after scalar expansion . 9. PRINT. FORWARD_SUBSTITUTE Enables the printing of the program after scalar forward substitution. 10. PRINT. GENERATE_CODE Enables the printing of each segment of the program 85 as code is generated for that segment. 11. PRINT. GRAPHICS_PARTITION Enables the printing of the program after Pi-partitioning for the graphing program. 12. PRINT. IFTREE Enables the printing of the program after creation of IF trees. 13- PRINT. INDUCTION_SUBSTITUTE Enables the printing of the program after induction variable substitution. 14. PRINT. LEXICON Enables the printing of the program just after it has been lexically scanned, which will show all the cosmetic changes made to the program. 15. PRINT. LINEARIZE_ARRAY Enables the printing of the program after array linearization . 16. PRINT. NORMALIZE_DO Enables the printing of the program after DO loop normalization . 86 17. PRINT. PARALLEL_VERSION Enables the printing of the entire parallel version of the program, after all transformations are done. 18. PRINT. RENAME_SCALAR Enables the printing of the program after scalar renaming . 19. PRINT. SEGMENT Enables the printing of the program after segmentation . 20. PRINT. SERIAL_VERSION Enables the printing of the program before any transformations have been done, but after subroutines and statement functions have been expanded. This is the version of the program being compiled . 21. PRINT. SH0RT_C0DE Enables the printing of a short one line description of each element of code generated for the program, rather than a complete description. 22. PRINT. SOURCE Enables the printing of the original source 87 program . 23- PRINT. STANDARDIZE_SUBSCRIPT Enables the printing of the program after subscript standardization. 24. PRINT. TRIAD Enables the printing of the program after tr iadization . C.3 OPTIONS Other compiler options are kept in the structure OPTION. These are switches and numeric values, used in the compilation process. 1. OPTION. CHECK_DO_BOUND When set, DO loop limits will all be compared to the limit stored in OPTION . DO_BOUND , and will be reduced to this value, if greater. 2. OPTION. COUNT_STORE_OPERATION When set, a store to a variable will be counted as an operation, similar to a multiply or add 88 operation. This is used in the speedup calculations . 3. OPTION.COUNT_STORE_TEMPORARY_OPERATION When set, a store to a compiler temporary will be counted as an operation as above. 4. OPTION. DEFINE_SCALAR When set, any undefined scalar found inside of subscripts will be assumed to have a default value, which is in OPTION . SCALAR_VALUE . 5. OPTION. DO_BOUND This is the default DO loop limit, used for speedup calculations, whenever the actual limit cannot be discovered . 6. OPTION. IFTREE_THRESHOLD This is the number of assignment statements per IF statement allowed when creating IF trees. OPTION. ISOLATE_CALL When set, CALL statements are isolated in the data dependence graph. That is, CALLs are not considered in the computation of data dependence. 89 8. OPTION. SCALAR_VALUE This is the default value for scalars found inside of subscripts. 9. OPTION. SERIALIZE_CALL When set, any DO loop containing a CALL statement is automatically serialized, not distributed. 10. OPTION. SER I ALIZE_FULL_RECURRENCE When set, a full recurrence is executed serially, rather than generating code to solve the recurrence the fastest known way. 11. OPTION. SET_D0_B0UND When set, all DO loops upper bounds are set to the value in OPTION . D0_B0UND , for the purposes of speedup calculation. 12. OPTION. SPECIAL^LISTING When set, a special summary listing of the IFs and recurrences for the programs compiled is produced for easy tabulation. C.4 DEBUG switches In addition, there are DEBUG switches for each program in 90 the compiler. Generally speaking, there is one switch per program, which can be set by a card: DEBUG. programs ■ 1 'B Most users should not need to use DEBUG switches. BLIOGRAPHIC DATA EET Title and Subtitle 1. Report No. UIUCDCS-R-78-929 TECHNIQUES FOR IMPROVING THE INHERENT PARALLELISM IN PROGRAMS Author(s) Michael Joseph Wolfe 3. Recipient's Accession No. 5. Report Date July, 1978 8. Performing Organization Rept. No. UIUCDCS-R-78-929 Performing Organization Name and Address University of Illinois at Urb ana- Champaign Department of Computer Science Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. US NSF MCS73-07980 Sponsoring Organization Name and Address National Science Foundation Washington, D. C. Supplementary Notes 13. Type of Report & Period Covered Master's Thesis 14. Abstracts Much work has been done recently in the area of improving the performance of a serial program on a parallel machine. This document describes some of this work as has been implemented in the FORTRAN ANALYZER as of summer, 1977. Key Words and Document Analysis. 17o. Descriptors DO- loops Induction variables Recurrences Scalar expansion Vectorizing compilers Vector operations Identifiers /Open-Ended Terms • COSATI Field/Group Availability Statement Release Unlimited 19. Security Class (This Report) UNCLASSIFIED cuntv Class (Thi 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 95 22. Price IM NTIS-38 (10-70) USCOMM-DC 40329-P7 1