CENTRAL CIRCULATION BOOKSTACKS 
 
 The person charging this material is re- 
 sponsible for its renewal or its return to 
 the library from which it was borrowed 
 on or before the Latest Date stamped 
 below. You may be charged a minimum 
 fee of $75.00 for each lost book. 
 
 Theft, mutilation, and underlining of books are reasons 
 for disciplinary action and may result In dismissal from 
 the University. 
 
 TO RENEW CALL TELEPHONE CENTER, 333-8400 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 DEC 1 4 1998 
 
 JAN l n 20f 
 MAY 01 MJST 
 
 When renewing by phone, write new due date below 
 previous due date. L162 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/techniquesforimp929wolf 
 
, YO, S{ 
 
 * y Report No. UIUCDCS-R-78-929 
 
 z- - 
 
 UILU-ENG 78 1722 
 
 TECHNIQUES FOR IMPROVING THE 
 INHERENT PARALLELISM IN PROGRAMS 
 
 6-*d 
 
 [••I 
 
 - by - 
 
 Michael Joseph Wolfe 
 
 July 1978 
 
 NSF-OCA-MCS73-07980-000034 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
Report No. UIUCDCS-R-78-929 
 
 TECHNIQUES FOR IMPROVING THE 
 INHERENT PARALLELISM IN PROGRAMS 
 
 by 
 
 Michael Joseph Wolfe 
 
 July 1978 
 
 Department of Computer Science 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
 * 
 
 This work was supported in part by the National Science Foundation under 
 
 Grant No. US NSF MCS73-07980 and was submitted in partial fulfillment of 
 the requirements for the degree of Master of Science in Computer Science, 
 July 1978. 
 
7 -?3f 
 
 iii 
 
 ACKNOWLEDGEMENT 
 
 I would like to thank all those whose help and advice 
 have made the pursuit of this project an interesting and 
 fruitful challenge. Special thanks go to David Kuck, Ahmed 
 Sameh, Ross Towle, Bruce Leasure, Utpal Banerjee and Robert 
 Kuhn for bearing with me as work progressed, or regressed, 
 at various points in time. 
 
 This project used computers operated by the Computing 
 Services Office at the University of Illinois in Urbana. 
 
iv 
 
 TABLE OF CONTENTS 
 
 Page 
 1 . Introduction 1 
 
 2. Recognizing Vector Operations 3 
 
 2.1. DO Loop Distribution 3 
 
 2.2. Data Dependence 5 
 
 2.3. Method of DO Loop Distribution 10 
 
 3. Classifying Recurrences 14 
 
 3.1. Types of Recurrences 14 
 
 3.2. Splitting Recurrences 17 
 
 4. IF statements 18 
 
 4.1. IF trees 18 
 
 4.2. IFs inside DO loops 19 
 
 5. Other Manipulations on DO loops 22 
 
 5.1. Simple Case, One Loop 23 
 
 5.2. More Interesting Case, Two Loops 24 
 
 5-3- Triple Loop Case 27 
 
 5.4. General Loop Nesting 31 
 
 5.5. Future Work 32 
 
 6. Induction Variables 33 
 
 6.1. Conditions for Induction Variables 34 
 
 6.2. Substitution for Induction Variables 37 
 
 6.3- Usefulness 40 
 
 7. Subscript Addition 41 
 
 7.1. Scalar Expansion 43 
 
V 
 
 7.2. Complications with IF statements 46 
 
 7.3- Method of Scalar Expansion 48 
 
 7.4. Array Expansion 52 
 
 The PARAFRASE Compiler 56 
 
 List of References 62 
 
 Appendices 
 
 A. Recurrence Formulae 63 
 
 B. Compiler User's Guide 72 
 
 C. Compiler Options 77 
 
CHAPTER 1 
 
 Introduction 
 
 Recently, there has been a great deal of work done to 
 try to make computer programs execute faster. One promising 
 approach is to utilize special hardware. Several new 
 machines have a parallel or pipelined machine architecture. 
 These machines operate efficiently when performing the same 
 operations on a vector of data. This approach is taken 
 because it is easier to optimize a sequence of similar 
 operations than it is to optimize a sequence of very 
 different operations. 
 
 Programs written today are designed for ordinary serial 
 machines. It would be useful to be able to compile these 
 programs for a vector machine, without having to rewrite the 
 program. The PARAFRASE project at the University of 
 Illinois has been working on a compiler to do just that. 
 This compiler accepts ordinary programs in a serial language 
 (FORTRAN). It then recognizes and isolates vector-type 
 operations. Of primary concern to the compiler is that 
 results are preserved. The output of the compiler is a 
 transformed program reflecting the changes, and 
 pseudo-machine code for an idealized vector machine. 
 
This paper will try to introduce the theory used in 
 compiling programs for vector machines. The examples will 
 be in pseudo-FORTRAN . Chapter 2 discusses vector 
 operations, and finding them in programs. Chapter 3 defines 
 types of recurrences. Chapter 4 mentions IFs and 
 conditional statements, and what to do about them. Chapter 
 5 talks about useful methods to make DO loops more efficient 
 under certain circumstances. Chapters 6 and 7 introduce 
 induction variable substitution and scalar expansion; these 
 are new ways of enhancing the detection of vector 
 operations. Chapter 8 describes the PARAFRASE FORTRAN 
 compiler. The appendices include a list of the time bounds 
 to solve different kinds of recurrences. Also a user's 
 guide for the compiler and a list of the switches and 
 options available are given. 
 
CHAPTER 2 
 
 Recognizing Vector Operations 
 
 To efficiently utilize the machine architecture of a 
 vector machine, a compiler should find vector operations in 
 programs. Vector operations can be found in DO loops, where 
 the same operations are performed on streams or vectors of 
 data . 
 
 2.1 DO Loop Distribution 
 
 Suppose a program is composed of DO loops and 
 assignment statements. 
 
 + DO 1=1 ,UI 
 
 <statement 1> 
 +-- DO J=1 ,UJ 
 I <statement 2> 
 ! <statement 3> 
 +-- CONTINUE 
 
 <statement 4> 
 + CONTINUE 
 
 Example 1 - Sample Program 
 A method to transform this general DO loop structure into a 
 series of vector operations is to execute each statement 
 separately for the entire DO loop index set. This is 
 equivalent to distributing the DO loops over each statement. 
 DO loop distribution is described in [MURAOKA]. 
 
+ DO 1=1 ,01 
 
 | <3tatement 1> 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 -- DO J=1 ,UJ 
 
 <statement 2> 
 
 CONTINUE 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 +-- DO J=1 ,UJ 
 ! <statement 3> 
 +-- CONTINUE 
 CONTINUE 
 
 + DO 1=1 ,UI 
 
 I <statement 4> 
 + CONTINUE 
 
 Example 1 - Sample Program, Distributed 
 Distributing DO loops may not always yield the correct 
 results . 
 
 S1 
 S2 
 
 original program 
 
 + DO 1=1 ,UI 
 
 ! A(I+1)=B(I)+5 
 
 ! B(I+1 )=C(I)»2 
 
 + CONTINUE 
 
 distributed program 
 ( incorrect) 
 
 ♦ DO 1=1 ,01 
 
 S1 
 
 i • ! 
 
 A(I+1 )=B(I)+5 
 CONTINUE 
 
 + DO 1=1 ,01 
 
 S2 ' : ! B(I+1 )aC(I)*2 
 + CONTINUE 
 
 Example 2 - Incorrect Distribution 
 
 Statement S1 in the original program always reads the value 
 
 of B computed in statement S2 during the previous iteration 
 
 of the I loop. In the distributed program, S1' always reads 
 
 a value of B computed somewhere outside the loop. For this 
 
 example, the correctly distributed program involves just a 
 
 little statement reordering. 
 
+ DO 1=1 ,UI 
 
 S2 
 
 B(I+1)=C(I)»2 
 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 S1':| A(I+1 )=B(I)+5 
 + CONTINUE 
 
 Example 2 - Correct Distribution 
 
 Not all loop distribution problems can be solved 
 
 statement reordering. 
 
 by 
 
 original program 
 
 + DO 1=1 ,UI 
 
 31: ! A(I+1 )=B(I)+5 
 
 S2: i B(I+1 )=A(I+1 )«2 
 
 + CONTINUE 
 
 distributed program 
 ( incorrect ) 
 
 + DO 1=1 ,UI 
 
 S1 ' : ! 
 
 A(I+1 )=B(I)+5 
 CONTINUE 
 
 + DO 1=1 ,UI 
 
 S2' : ! B(I+1 )=A(I+1 )«2 
 + CONTINUE 
 
 Example 3 - Undistributable Loop 
 
 In this example, the distributed program is wrong. 
 
 Statement S1 ' will always read "old" values of B. 
 
 Reordering the statements does not solve the problem. 
 
 Putting S2 ' first would make S2 ' always read old values of 
 
 A, whereas it should read values computed in the loop. The 
 
 loop in this program cannot be distributed. 
 
 2.2 Data Dependence 
 
 To decide when distributing DO loops is valid, one must 
 look at the data flow of the program. For each statement, 
 one must ask "where do the values used here come from," and 
 
"where does the value computed here get used." These 
 questions do not always have simple or unique answers. A 
 value computed in one statement may be used in many places. 
 A value read in one statement may come from one of several 
 places. This is particularly true when IF statements or 
 other conditionals are present in the program. IFs will be 
 discussed later, and so will be ignored for now. 
 
 The analysis of data flow in a program is called the 
 study of data dependence. A more complete description of 
 data dependence can be found in [TOWLE]. Briefly, a 
 statement Sq is data dependent on statement Sp if Sq reads a 
 value that is computed in statement Sp. This can occur when 
 the left hand side variable in Sp appears on the right hand 
 side of Sq . In EX. 2, statement S1 is data dependent on 
 statement S2 , since S1 reads the value of B computed in 
 statement S2 . In EX. 3, S1 and S2 are data dependent on 
 each other. Towle also defines two other types of data 
 dependence, but for the moment we ignore them. 
 
 The first requirement for a statement Sq to be data 
 dependent on Sp is that there must exist a control path from 
 Sp to Sq . Second, the variable being computed in Sp , the 
 LHS variable of Sp , must be read in Sq . If this variable is 
 a scalar, the test is satisfied, and Sq is data dependent on 
 Sp . If this variable is an array, the value of its 
 
subscript in Sp must be equal to the value of its . subscript 
 in Sq . When this condition is satisfied, then Sq is data 
 dependent on Sp . 
 
 Equality of subscript expressions is not so easy to 
 check when the statements are in DO loops, and the 
 expressions change value with each iteration. This happens 
 whenever the subscript expressions involve the DO loop index 
 variable. In a DO loop, a particular statement is executed 
 many times. Let Sp[i] be the instance of statement Sp 
 during the iteration of the DO loop when I=i, where I is the 
 DO loop index variable. For multiply nested loops, 
 Sp[ i1 , i2 , . . . , in] is the instance of Sp when I1 = i1, I2 = i2, 
 ..., In=in. Given a DO loop, we can "unroll" it, listing 
 each statement for each iteration of the loop. This removes 
 the loop, and only a serial program remains. Each statement 
 can now be checked with following statements for any data 
 dependence. If any Sq[i'] is data dependent on any Sp[i], 
 where i*>=i, then in the original program with the loop, Sq 
 is data dependent on Sp . Likewise for Sp[i'] being data 
 dependent on Sq[i], where i'>i. 
 
+ DO 1=1 ,UI 
 
 Sp : ! <statement p> 
 
 Sq : I (statement q> 
 
 + CONTINUE 
 
 becomes 
 
 Sp[1]: (statement p> 
 
 Sq[1]: (statement q> 
 
 Sp[2]: (statement p> 
 
 Sq[2]: (statement q> 
 
 Sp[9]: (statement p> 
 Sq[9]: (statement q> 
 
 Sp[10]: 
 Sq[10]: 
 
 (statement p> 
 (statement q> 
 
 Example 4 - Unrolling a DO Loop 
 
 If Sq[i'] is data dependent on Sp[i], for any i' and i 
 such that i'>i, then we say that Sq is data dependent on Sp 
 across the I loop . This data dependence crosses the DO loop 
 boundary. For instance, if Sq[9] is data dependent on 
 Sp[8], then Sq is data dependent on Sp across the I loop 
 boundary. If Sq[i] is data dependent on Sp[i], for any i, 
 then we say that Sq is data dependent on Sp within the i 
 loop . This is not mutually exclusive with data dependence 
 across the loop. Here S2[2] is data dependent on S1[2] and 
 S1[1]. So S2 is data dependent on S1 both within and across 
 the loop boundary. 
 
+ DO 1=1 ,UI 
 
 S1 : ! A(I+1 )= . . . 
 
 S2: ! . . . = A(I+1 )+A(I) 
 
 + CONTINUE 
 
 Example 5 - Data Dependence Within and Across Loop 
 
 Likewise, if Sp[i*] is data dependent on Sq[i], for any 
 i' and i such that i'>i, then we say that Sp is data 
 dependent on Sq across the I loop. Notice that Sp cannot be 
 data dependent on Sq within the I loop boundary, since there 
 is no control path from Sq[i] to Sp[i], for any i. 
 
 Unrolling the DO loop and doing an exhaustive data 
 dependence test is time consuming and generally unnecessary. 
 A method has been described in [BANERJEE] for computing data 
 dependence, just by studying the subscript expressions as 
 polynomials in the DO loop index variables. For some forms 
 of simple subscripts, a necessary and sufficient test for 
 data dependence can be applied. Although in general the 
 method is not exact, it is conservative, so that whenever a 
 data dependence exists, the test recognizes it. In 
 practice, this test is quite satisfactory. In only a small 
 percentage of cases is the test fooled by not recognizing a 
 non-dependence situation. 
 
 Some forms of subscripts cannot be handled by any test. 
 An example is a subscripted subscript, when an array is used 
 
10 
 
 in a subscript expression. This is often used for array 
 permutations. Whenever an array has a subscript expression 
 which cannot be tested, data dependence must be assumed, to 
 be conservative. In the PARAFRASE compiler, only subscripts 
 which are linear functions of the index variables are 
 tested. A linear function looks like AO+A 1 *I 1+A2*I2+ . . . , 
 where 11, 12, ... are the index variables. Banerjee's 
 general test handles nonlinear functions of the index 
 variables, but they are rare enough that little harm is done 
 by assuming data dependence when they occur. 
 
 2.3 Method of DO Loop Distribution 
 
 The first step in distributing DO loops is to form a 
 data dependence graph of the program. The second step is to 
 find all the cycles in the data dependence graph. Suppose 
 we denote "Sq is data dependent on Sp" by Sp->Sq. If 
 Sp->Sp, then Sp forms a cycle by itself. If Sp->Sq->Sp, 
 then Sp and Sq form a cycle. Likewise, if 
 Sp->Sq 1 ->Sq 2 -> . . .->Sq n ->Sp , then Sp , Sq , ..., Sq form a 
 cycle . 
 
 Once the cycles have been found, the third step of DO 
 loop distribution is to partition the program into 
 Pi-parti tions . Any assignment statement that is not in any 
 data dependence cycle forms a Pi-partition by itself. Any 
 
1 1 
 
 assignment statement that is in a data dependence . cycle is 
 in a Pi-partition with all the statements in that cycle. 
 Each Pi-partition corresponds to some sort of a parallel 
 operation. 
 
 Finally, DO loops can be distributed over each 
 Pi-partition . This is the same as distributing DO loops 
 over statements, except that the loops are not distributed 
 over statements in the same Pi-partition . 
 
 S1 : 
 
 S2: 
 S3: 
 
 S4: 
 S5: 
 
 original program 
 
 + DO 1=1 ,UI 
 
 A(I)=C(1 ,1-1) 
 +__ DO J=1 ,UJ 
 
 C( J,I)=B( J-1 ) 
 B(J)=A(I) 
 CONTINUE 
 . . .=B(UJ) 
 . . .=C(UJ,I) 
 CONTINUE 
 
 i 
 i 
 i 
 i 
 
 + -- 
 
 S1 • : 
 
 S2' : 
 S3' : 
 
 distributed program 
 
 + DO 1=1 ,UI 
 
 A(I)=C(1 ,1-1) 
 +-- DO J=1 ,UJ 
 ! C(J,I)=B(J-1) 
 ! B(J)=A(I) 
 +-- CONTINUE 
 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 S4':| ...=B(UJ) 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 S5':l ...=C(UJ,I) 
 + CONTINUE 
 
 Example 6 - Distribution over Pl-Partit ions 
 
 After the loops have been distributed over each 
 Pi-partition , a partial ordering relation between the 
 Pi-partitions must be found. As mentioned before, the 
 original ordering of the statements may no longer be valid 
 after DO loop distribution. The partial ordering relation 
 
12 
 
 between Pi-partitions can be used to generate a valid 
 statement ordering after DO loop distribution. 
 
 The partial ordering relation between Pi-partitions is 
 easily found from the data dependence graph. If any 
 statement in Pi-partition P2 is data dependent on any 
 statement in Pi-partition P1, then P2 must follow P1. 
 Notice that this relation will indeed be a partial ordering. 
 If it were not, then there would be a cycle of data 
 dependence between statements in different Pi-partitions . 
 However, any two statements in a cyclic dependence are by 
 definition in the same Pi-partition . 
 
 The final step in DO loop distribution is to classify 
 the Pi-partitions . A Pi-partition composed of a single 
 statement which is not in a data dependence cycle is a 
 vector operation. In a vector operation, all the input data 
 can be fetched at once, all the operations can be performed 
 at once, and all the results can be stored at once. 
 
 If a Pi-partition is composed of statements which are 
 involved in a data dependence cycle, then the Pi-partition 
 is some sort of a recurrence. Recurrences can be further 
 subdivided into linear and nonlinear recurrences. Linear 
 recurrences can be solved using special algorithms which are 
 fast on a parallel machine, described in [CHEN]. Some 
 
13 
 
 nonlinear recurrences can be linearized and solved using 
 algorithms similar to Chen's. Other recurrences may have to 
 be executed serially. 
 
 One non-obvious benefit of DO loop distribution is that 
 each Pi-partition is a Single-Instruction-Mul tiple-Execut ion 
 (SIME) block of code. A vector operation can be executed 
 one operation at a time for the entire vector. The 
 algorithms described in Chen's thesis for solving linear 
 recurrences are SIME. Even serial operations can be 
 considered a limiting case of SIME code, since only one 
 operation is being performed at a time. 
 
14 
 
 CHAPTER 3 
 
 Classifying Recurrences 
 
 After DO loop distribution, each Pi-partition is 
 classified as either a vector operation or a recurrence. 
 Recurrences are broken down into several types and classes. 
 Recurrences which can be identified as of a simpler type can 
 be solved using faster or more efficient versions of the 
 basic algorithm. 
 
 3.1 Types of Recurrences 
 
 The first division of recurrences is between linear and 
 nonlinear recurrences. A linear recurrence is a recurrence 
 where each new computation is a linear function of previous 
 computations. A nonlinear recurrence is a recurrence where 
 each new computation is a nonlinear function of previous 
 computations. A linear recurrence can always be transformed 
 into a standard format, with a recurrence matrix A, an 
 initial value vector c, and a result vector x. 
 
15 
 
 + DO 1=1 ,N 
 
 X(I)=C(I) 
 -- DO J=1 ,1-1 
 
 X(I)=X(I)+A(I, J)«X(J) 
 CONTINUE 
 CONTINUE 
 
 Example 7 - Standard Recurrence 
 The recurrence matrix is always strictly lower triangular. 
 
 If the recurrence matrix is full, then each new 
 computation depends on all of the previous computations. 
 
 This is called a full recurrence . Chen shows that with 
 
 3 12 
 
 N /68 processors, a full recurrence can be solved in -z-lg N + 
 
 3 
 
 ■jlgN time steps, where a time step is a add or a multiply. 
 
 If the recurrence matrix is banded, that is, it has at 
 most M non-zero subdiagonal bands, then each new computation 
 depends only on the M previous computations. 
 
 + DO 1=1 ,N 
 
 X(I)=C(I) 
 
 +-- DO J=I-M,I-1 
 
 I X(I)=X(I)+A(I,J)*X(J) 
 
 +-- CONTINUE 
 + CONTINUE 
 
 Example 8 - Standard Banded Recurrence 
 
 This is called a banded recurrence . In [SAMEH], it is shown 
 
 1 2 
 that with -rM N processors, a banded recurrence can be solved 
 
 in (lgM+2)lgN time steps 
 
16 
 
 Sometimes the recurrence matrix takes a special form, 
 called a Toeplitz form. In this case, A(I,I-b)sa(b) , for 
 all I. That is, each subdiagonal band is constant. A 
 recurrence of this form is called a constant coefficient 
 recurrence . 
 
 + DO 1=1 ,N 
 
 X(I)=C(I) 
 
 + -- DO J=1 ,1-1 
 ! X(I)=X(I)+A(I-J)«X(J) 
 + -- CONTINUE 
 + CONTINUE 
 
 Example 9 - Constant Coefficient Recurrence 
 A constant coefficient recurrence may be either full or 
 banded. The time bounds for solving a constant coefficient 
 recurrence are the same as the time bounds for solving a 
 general recurrence, although fewer processors are needed to 
 achieve this bound. 
 
 Another special type of recurrence arises when only the 
 last element of the result vector is used outside the 
 computation of the recurrence itself. This is called a 
 remote term recurrence . An example is an inner product. 
 
 + DO 1=1 ,N 
 
 ! C(I)=G1(I)*G2(I) 
 + CONTINUE 
 
 + DO 1=1 ,N 
 
 ! X=X+C(I) 
 + CONTINUE 
 
 Example 10 - Inner Product is Remote Term Recurrence 
 
17 
 
 None of the intermediate terms need to be saved.- A remote 
 term recurrence can be banded or full, and can be constant 
 coefficient or not. Again, the time bounds for solving a 
 remote term recurrence are the same as for solving a general 
 recurrence, but fewer processors are needed. Derivations of 
 the time steps, processor bounds, and operation counts for 
 solving these types of recurrences are given in the 
 Appendix . 
 
 3.2 Splitting Recurrences 
 
 Sometimes a single recurrence can be split into several 
 smaller independent recurrences. This is presented in 
 [TOWLE]. In real programs, this often occurs when a DO loop 
 surrounds a recurrence. 
 
 + DO K=V,G 
 
 +-- DO 1=1 ,N 
 
 X(K)=X(K)+C(I,K) 
 CONTINUE 
 CONTINUE 
 
 Example 11 - Recurrence that can be Split 
 Initially, this may look like a recurrence of size G*N. In 
 fact, this is G independent recurrences, each of size N. By 
 reducing the size of the recurrence, the amount of time to 
 solve the system, as well as the number of processors 
 needed, is reduced. 
 
18 
 
 CHAPTER 4 
 
 IF statements 
 
 When compiling programs for special architecture 
 machines, extra care must be taken with IF statements. An 
 IF can change the flow of control of the program, and so 
 figures into the data dependence graph. IF statements in 
 loops can prevent loop distribution. A large number of IFs 
 can adversely affect the amount of parallelism detectable in 
 the program. Special methods are used to handle IFs. 
 
 4.1 IF trees 
 
 When the ratio of IFs to assignment statements is 
 relatively large, then it is reasonable to use the method of 
 IF-trees, described in [DAVIS]. Essentially ,. this method 
 computes the results for all possible paths, then chooses 
 the desired results with one large conditional. The number 
 of operations is increased, but the number of conditionals 
 is decreased . 
 
19 
 
 
 Example 12 - IF tree made from 3 IFs 
 
 4.2 IFs inside DO loops 
 
 When an IF is inside a DO loop, that IF must be 
 executed for every iteration of that DO loop. Sometimes the 
 condition being tested does not change inside the loop. In 
 this case, the branch taken will be the same for every 
 iteration . 
 
20 
 
 + DO 1=1 ,UI 
 
 (statements not changing B> 
 
 IF (B.GT.O) C(I)=C(I)/B 
 
 <more statements not changing B> 
 
 + CONTINUE 
 
 Example 13 - Loop Invariant IF 
 Here, the IF can be removed from the scope of the loop 
 entirely. Two copies of the loop are used, and the 
 condition is tested to choose which loop to execute. 
 
 
 DO 1=1 ,UI 
 <statements> 
 
 < statements> 
 CONTINUE 
 
 + DO 1=1 ,UI 
 
 <statements> 
 C(I)=C(I)/B 
 <statements> 
 -- CONTINUE 
 
 Example 13 - Removing a Loop Invariant IF 
 
 More often, the condition being tested does change 
 between iterations of the DO loop. In this case, the best 
 result that can be achieved is to keep everything a vector 
 operation. When the condition being tested is independent 
 of the results of previous iterations of the loop, the IF 
 can be precomputed. The results of the condition can be 
 saved in a logical vector. This logical vector can be used 
 as a mask in later operations. 
 
21 
 
 + DO 1=1 ,UI 
 
 S1: ! IF (B(I).NE.O) C( I )=C( I ) /B( I) 
 + CONTINUE 
 
 becomes 
 
 + DO 1=1 ,UI 
 
 ! MASK(I)=B(I) .NE.O 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 S1 
 
 i . ! 
 
 C(I)=C(I)/B(I) , masked by MASK(I) 
 CONTINUE 
 
 Example 14 - Precomputed IF 
 Now S1 1 is easily recognizable as a vector operation. On a 
 parallel or vector machine, the concept of a bit vector as a 
 mask for operations is inherent in the structure of the 
 machine . 
 
 Other types of IFs can be handled by more complex or 
 more costly methods. For a more complete discussion of IFs, 
 see [TOWLE]. 
 
22 
 
 CHAPTER 5 
 
 Other Manipulations on DO loops 
 
 It may not always be efficient to compute each PI 
 partition using the fastest known methods. For instance, if 
 
 a program contains a large full recurrence, the fastest 
 
 3 
 method to solve this recurrence uses N /68 processors. 
 
 + DO 1=1 ,N 
 
 +-- DO J=1 ,1-1 
 
 I X(I)=X(I)+A(I,J)*X(J) 
 
 + -- CONTINUE 
 
 + CONTINUE 
 
 Example 15 - Full Recurrence of size N 
 
 1 2 
 Using this many processors, the solution would require ylg N 
 
 time steps. If only N processors are available, using this 
 
 2 2 
 algorithm would require (N /136)lg N. A different method of 
 
 solution would serialize the outer loop, and run only the 
 
 inner loop in parallel. This is equivalent to a program 
 
 with the outer loop "unrolled". This method uses at most N 
 
 processors at one time, and requires at most NlgN time 
 
 steps. While in general this algorithm is slower than the 
 
 fastest method, it is more efficient with a smaller number 
 
 of processors. This chapter describes methods which can be 
 
 used to manipulate the program into a sometimes more 
 
 efficient or more accommodating form. 
 
23 
 
 DO Js1 , 1 
 
 X(2)=X(2)+A(2, J)»X(J) 
 CONTINUE 
 
 DO J=1 ,2 
 X(3)=X(3)+A(3, J)*X(J) 
 CONTINUE 
 
 + DO Js1 , N-1 
 
 ! X(N)=X(N)+A(N, J)*X(J) 
 + CONTINUE 
 
 Example 15 - Same Program with Outer Loop Serialized 
 
 5.1 Simple Case, One Loop 
 
 Consider a simple loop with one statement. In this 
 loop, the program is computing a vector of data. 
 
 + DO 1=1 ,UI 
 
 S: I A(I)=... 
 + CONTINUE 
 
 Example 16 - Simple Loop with One Statement 
 
 Suppose that for some i and j, where j<i, S[i] is data 
 
 dependent on S[j]. This dependence can be considered as a 
 
 condition which must be satisfied to produce correct 
 
 results. One way to satisfy the condition is to execute the 
 
 loop serially. This is the way employed by an ordinary 
 
 serial machine. Another way to satisfy the condition is to 
 
 solve for A using a fast parallel recurrence algorithm. 
 
 Each method has its advantages and disadvantages, but both 
 
 will correctly compute the result. As a notational 
 
24 
 
 convenience, "S[i] is data dependent on S[j], for some j<i" 
 is expressed as "S[i] depends on S[<i]." When S[i] does not 
 depend on S[<i], there is no condition to be satisfied. The 
 vector A may be computed simultaneously, if desired. 
 
 5.2 More Interesting Case, Two Loops 
 
 Consider now two loops containing one statement. Now 
 the program is computing an array of data. 
 
 + DO 1=1 ,UI 
 
 +-- DO J=1 ,UJ 
 I A(I, J)=. . . 
 +-- CONTINUE 
 
 + CONTINUE 
 
 Example 17 - Computing a Plane of Data 
 In a particular iteration of I and J, S[I,J] might be data 
 dependent on 
 
 S[<I,<J] 
 S[<I, J] 
 S[<I, >J] 
 
 S[I,<J] 
 
 If S[i,j] is data dependent on any S[i',j']i then this 
 dependence will fall into one of these four cases, depending 
 on the relationship between i' and i, and j 1 and j. This 
 gives four possible conditions which might need to be 
 satisfied . 
 
 If none of these conditions apply, then S is a vector 
 operation, and all the elements of A can be computed at once 
 
25 
 
 or in any convenient order. If any data dependences exist, 
 those conditions must be satisfied. One way to satisfy all 
 the conditions is to execute all the loops serially. 
 Another way is to use the fast recurrence solving methods. 
 These are not the only possible things which can be done. 
 
 Suppose S[I,J] does not depend on S[I,<J]. This means 
 that S[I,J] may depend on only S[<I,*]. The only possible 
 dependences cross the I loop. These conditions can be 
 satisfied by executing the outer loop, the I loop, serially, 
 and the inner loop can be executed in parallel. This gives 
 UI vector operations, each UJ long. This solution is 
 equivalent to the wavefront method described in [MURAOKA] 
 with a wave angle of . 
 
 + DO 1=1 ,UI 
 
 t--- DO Js1 ,UJ 
 
 i A(I, J)sA(I-1 , J+1 )+A(I-1 , J-1 ) 
 
 ►-- CONTINUE 
 + CONTINUE 
 
 Example 18 - All Data Dependences Cross the I Loop 
 
 Suppose S[I,J] does not depend on S[I,>J]. So S[I,J] 
 may depend only on S[<I,<J], S[<I,J], and S[I,<J]. These 
 conditions can still be satisfied when the two DO loops are 
 interchanged . 
 
26 
 
 + DO 1=1 ,UI 
 
 -- DO J=1 ,UJ 
 
 A(I, J)aA(I-1 ,J) 
 + A(I, J-1 ) 
 CONTINUE 
 -- CONTINUE 
 
 + DO J=1 ,UJ 
 
 + -- DO 1=1 ,UI 
 
 j A(I, J)=A(I-1 , J) 
 
 I +A(I,J-1) 
 
 + -- CONTINUE 
 
 CONTINUE 
 
 Example 19 - DO Loops Can Be Interchanged 
 Interchanging DO loops can be advantageous at times. It can 
 change the bandwidth of a recurrence, or allow a recurrence 
 to be split. The dependence conditions can be satisfied 
 using one of the fast recurrence solvers, in either form of 
 the loop. The wavefront method again applies, this time 
 with a wave angle of 45 • 
 
 Now suppose S[I,J] depends only on S[<I,<J] and 
 S[I,<J]. This is a subset of the above case, so the DO 
 loops can be interchanged. 
 
 + DO 1=1 ,UI 
 
 -- DO J=1 ,UJ 
 
 A(I, J)=A(I, J-1 ) 
 
 + A(I-1 , J-1 ) 
 CONTINUE 
 CONTINUE 
 
 + DO J=1 ,UJ 
 
 +-- DO 1=1 ,UI 
 
 I A(I,J)=A(I,J-1) 
 
 ! +A(I-1,J-1) 
 
 +-- CONTINUE 
 
 CONTINUE 
 
 Example 20 - Interchange Loops, Run J Serially 
 
 Now, however, executing the outer loop serially after 
 
 interchanging will satisfy the dependence conditions. The I 
 
 loop can be executed in parallel. This results in UJ vector 
 
 operations, each UI long. This method is equivalent to the 
 
 wavefront method with an angle of 90 . 
 
27 
 
 Finally, consider the case when S[I,J] depends only on 
 S[<I,J] and S[<I,>J]. Again, all the data dependences cross 
 the I loop. Here however, the results are preserved if the 
 inner loop is executed "backwards", from the upper bound to 
 the lower bound . 
 
 + DO 1=1 ,UI 
 
 +-- DO J=1 ,UJ 
 
 I A(I, J)=A(I-1 , J)+A(I-1 , J+1 ) 
 
 +-- CONTINUE 
 
 + CONTINUE 
 
 Example 21 - Executing J Backwards Preserves Results 
 
 Two interesting things can be done now. The wavefront 
 
 method can be applied, with an angle of 135 . Or, the two 
 
 loops can be interchanged, providing that J is executed 
 backwards. 
 
 + DO J=UJ, 1 ,-1 
 
 +-- DO 1=1 ,UI 
 
 I A(I, J)=A(I-1 , J)+A(I-1 , J+1 ) 
 
 +-- CONTINUE 
 
 + CONTINUE 
 
 Example 21 - With Loops Interchanged 
 
 5-3 Triple Loop Case 
 
 To further complicate matters, consider a single 
 statement inside three loops. Now the program is computing 
 a cube of data. 
 
28 
 
 + DO 1=1 ,UI 
 
 -- DO J=1 ,UJ 
 
 +- DO K=1 ,UK 
 
 i A(I, J,K)= . . . 
 
 +- CONTINUE 
 
 +-- CONTINUE 
 
 + CONTINUE 
 
 Example 22 - Three Loops 
 The statement S[I,J,K] might be data dependent on any of the 
 following . 
 
 S[I,<J,<K] 
 S[I,<J, K] 
 S[I,<J f >K] 
 
 S[I, J,<K] 
 
 S[<I,<J,<K] 
 S[<I,<J, K] 
 S[<I,<J,>K] 
 S[<I, J,<K] 
 S[<I, J, K] 
 S[<I, J,>K] 
 S[<I,>J,<K] 
 S[<I,>J, K] 
 S[<I,>J,>K] 
 
 Serializing the outer loop satisfies all dependences of the 
 
 form "S[I,J,K] depends on S[ <I ,*,*]" . Serializing the outer 
 
 two loops satisfies all dependences of the form "S[I,J,K] 
 
 depends on S[<I,*,*] and S[I,<J,*]". Naturally, serializing 
 
 all three loops satisfies all the dependences. 
 
 The I and J loops can be interchanged when S[I,J,K] 
 does not depend on S[<I,>J,*] or S[I,J,*]. This happens 
 when S[I,J,K] depends only on S[<I,<J,»], S[<I,J,»], and 
 S[I,<J,*]. This is very similar to the two loop case. 
 Since these dependences all cross the I or the J loop, the K 
 loop does not figure into the picture. Dependences with 
 respect to the K loop are preserved. 
 
29 
 
 + DO 1=1 ,UI 
 
 +-- DO J=1 ,UJ 
 
 +- DO K=1 ,UK 
 
 ! A(I,J,K)=A(I-1,J-1,K+1)+A(I,J-1,K-1) 
 
 +- CONTINUE 
 
 +-- CONTINUE 
 
 + CONTINUE 
 
 Example 23 - I and J loops Can Be Interchanged 
 
 The J and K loops can be interchanged when S[I,J,K] 
 does not depend on S[I,<J,>K]. So S[I,J,K] can depend only 
 on S[I,<J,<K], S[I,<J,K], S[I,J,<K], and S[<I,*,*]. Again, 
 this is somewhat similar to the two loop case. Dependences 
 which cross the I loop are not affected by the interchange. 
 
 + DO 1=1 ,UI 
 
 -- DO J=1 ,UJ 
 
 +- DO K=1 , UK 
 
 ! A(I,J,K)=A(I,J,K-1)+A(I-1,J+1,K) 
 
 +- CONTINUE 
 CONTINUE 
 
 + CONTINUE 
 
 Example 24 - J and K loops Can Be Interchanged 
 
 Loops can be interchanged iteratively. After 
 interchanging two DO loops, the dependence conditions must 
 be reordered. Then, two more DO loops may be eligible for 
 interchanging . 
 
30 
 
 + DO 1=1 ,UI 
 
 +-- DO J=1 ,UJ 
 
 +- DO K=1 ,UK 
 
 ! A(I,J,K)=A(I,J,K-1)+A(I-1,J+1,K) 
 
 +- CONTINUE 
 
 +-- CONTINUE 
 
 + CONTINUE 
 
 S[I,J,K] depends on S[I,J,<K] and S[<I,>J,K] 
 interchange J and K 
 
 + DO 1=1 ,UI 
 
 +-- DO K=1 ,UK 
 
 + - DO J=1 ,UJ 
 
 ! A(I,J,K)=A(I,J,K-1)+A(I-1,J+1,K) 
 
 +- CONTINUE 
 
 {+-- CONTINUE 
 
 + CONTINUE 
 
 S[I,K,J] depends on S[I,<K,J] and S[<I,K,>J] 
 interchange I and K 
 
 + DO K=1 ,UK 
 
 + -- DO 1=1 ,UI 
 
 +- DO J=1 ,UJ 
 
 ! A(I, J,K)=A(I, J,K-1 )+A(I-1 , J+1 ,K) 
 
 +- CONTINUE 
 
 +-- CONTINUE 
 
 +„_- CONTINUE 
 
 S[K,I,J] depends on S[<K,I,J] and S[K,<I,>J] 
 
 Example 25 - Multiple Interchanging 
 
 Any number of outer loops can be executed serially to 
 satisfy all data dependences across that loop boundary. If 
 all the dependences are satisfied this way, a vector 
 operation remains. Otherwise, a recurrence must still be 
 solved . 
 
31 
 5.4 General Loop Nesting 
 
 Now consider a statement nested inside d DO loops. The 
 program is computing a d-dimensional cube of data. This may 
 correspond to a Pi-partition after DO loop distribution. 
 
 + DO 11=1 ,UI1 
 
 •-- DO 12=1 ,UI2 
 
 + - 
 i 
 
 DO Id=1 ,UId 
 A(I1 ,12, . . . ,Id) = . . . 
 CONTINUE 
 
 CONTINUE 
 CONTINUE 
 
 Example 26 - d DO Loops 
 Now, S[ 1 1 , 12 , . . . , Id] may depend on 
 
 S[<I1,», ...,*] 
 S[I1,<I2,»,.. .,»] 
 
 S[I1 ,12, . . . ,Id-1 ,<Id] 
 
 Any two loops II and 11+1 can be interchanged if 
 
 S[I1,...,Id] is not data dependent on S[ 1 1 , . . . , <I1 , >I1+ 1 , * ] 
 
 or S[ 1 1 , . . . , II , 11+ 1 , * ] . With the new loop structure, the 
 interchanging process can be repeated. 
 
 Loop II can be serialized if all loops containing loop 
 II have been serialized. Executing loop II serially has the 
 effect of satisfying the dependence of S[I1,...,Id] on any 
 
 statement S[I1 II- 1 , <I1 ,*,...,*] . This is the same as 
 
 satisfying all data dependences that cross the II loop. 
 
32 
 
 If, after loop interchanging and possible serialization 
 of the outer loops, no data dependences are unsatisfied, the 
 statement is a vector operation with respect to the 
 remaining loops. Otherwise, the statement forms a 
 recurrence . 
 
 5.5 Future Work 
 
 This theory can be extended far beyond what has been 
 presented here. The biggest extension will be the addition 
 of multiple statements in the DO loops. This includes 
 several statements at the same DO loop nest level, as well 
 as statements at different nest levels. 
 
 The applicability of the wavefront method to statements 
 in many loops can also be studied. When interchanging 
 loops, the loop boundary of the inner loop may depend on the 
 outer loop index. A change of DO loop upper bound 
 expressions may be necessary. Finally, the addition of the 
 other types of dependence described in [TOWLE] can be 
 considered . 
 
33 
 
 CHAPTER 6 
 
 Induction Variables 
 
 Data dependence testing is easier when array subscript 
 expressions are linear expressions of the DO loop index 
 variables. Many times the value of the expression is always 
 linear in the index variables, but some programming trick 
 prevents this from being obvious. An example of such a 
 trick is to use a scalar temporary to hold a common 
 expression. This scalar temporary is then used in 
 subscripts throughout the loop. 
 
 + DO 1=1 ,UI 
 
 K = I»3 
 
 A(I)=B(K) . . . 
 C(K)=. . . 
 
 + CONTINUE 
 
 Example 27 - Scalar Temporary Used in Subscript 
 This is used to save on redundant operations. However it 
 impairs the calculation of data dependence and the detection 
 of vector operations. This problem is solved by doing 
 expression forward substitution. 
 
 Another programming trick is to increment a scalar 
 inside a DO loop and use that scalar in subscripts. 
 
34 
 
 K = 
 
 + DO 1=1 ,UI 
 
 K = K + 3 
 A(I)=B(K) 
 C(K)=. .. 
 
 + CONTINUE 
 
 Example 28 - An Induction Variable 
 This is done when the programmer wants a subscripting 
 pattern different from that given by the index variable. 
 Note that Ex. 28 is equivalent to Ex. 27, but the operation 
 in the loop of Ex. 28 is an add, while in Ex. 27, it is a 
 multiply. For the purposes of parallelism detection, Ex. 27 
 is preferable, since the expression for K can be forward 
 substituted into the subscripts. In Ex. 28, K is called an 
 induction variable . An induction variable can always be 
 replaced by an expression which is linear in the DO loop 
 index variables. 
 
 6.1 Conditions for Induction Variables 
 
 Several conditions must be satisfied for a scalar to be 
 an induction variable. First, the value of the scalar 
 variable must be known at the beginning of the loop. 
 Usually this is satisfied by an assignment just prior to the 
 loop . 
 
 Second, the scalar must be incremented in each 
 iteration of the loop by some invariant value. Often, the 
 
35 
 
 scalar is incremented by a constant. It may. also be 
 incremented by another scalar, so long as the value of this 
 increment scalar does not change in the scope of the DO 
 loop. It may not be incremented by an indexed array, or any 
 expression which may change values between iterations of the 
 loop . 
 
 valid examples of induction variables 
 
 DO 1=1 ,UI 
 <statements> 
 K = K + 3 
 
 (statements) 
 CONTINUE 
 
 + -- 
 
 - DO 1=1 ,UI 
 
 (statements) 
 K = K + N 
 (statements) 
 
 - CONTINUE 
 
 invalid examples of induction variables 
 
 + DO 1=1 ,UI 
 
 (statements) 
 K = K+N 
 N=. . . 
 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 (statements) 
 K=K+N(I) 
 (statements) 
 -- CONTINUE 
 
 Example 29 - Induction and Non-Induction Variables 
 Also, the incrementing statement must be executed for every 
 iteration of the DO loop. It cannot be the object of an IF 
 statement, for example. 
 
 + DO 1=1 ,UI 
 
 (statements) 
 IF(...) K=K+3 
 (statements) 
 -- CONTINUE 
 
 Example 30 - Not an Induction Variable 
 
36 
 
 Induction variables are not restricted to a single 
 increment statement. Multiple increment statements are 
 allowed, as long as each increment is executed during every 
 loop iteration, and the increment expressions are all loop 
 invariant . 
 
 Multiply-nested loops are also allowed. Naturally, the 
 
 induction variable may be an induction on any one of the 
 
 nested loops. It may also be an induction on two or more 
 loops . 
 
 induction on inner loop 
 
 + DO 1=1 ,UI 
 
 K = 
 + -- DO J=1 ,UJ 
 I K=K+3 
 i <statements> 
 +-- CONTINUE 
 
 induction on both loops 
 
 K = 
 
 + DO 1=1 ,UI 
 
 +-- DO J=1 ,UJ 
 
 ! K=K+3 
 
 ! <statements> 
 
 +-- CONTINUE 
 
 + CONTINUE 
 
 + CONTINUE 
 
 Example 31 - Induction Variables in Multiple Loops 
 
 The increment statement may also be a decrement 
 
 statement. As long as the value added to the scalar in the 
 
 loop does not change within the loop, it may be positive or 
 
 negative . 
 
37 
 
 6.2 Substitution for Induction Variables 
 
 If the conditions for induction variable substitution 
 have been satisfied, the variable can be replaced by an 
 expression. In each statement in the loop, the variable can 
 be replaced by an expression which is equal to the value of 
 the variable. This expression will be composed of the 
 initial value of the variable, the index variables, and the 
 increment values. The expression will not be the same for 
 all statements in the loop. 
 
 K = 
 + DO 1 = 1 ,UI 
 
 <K is (1-1 )*3> 
 
 K = K + 3 
 
 <K is I«3> 
 + CONTINUE 
 
 Example 32 - Simple Induction Variable Expressions 
 Consider this simple example. The value of K at the 
 beginning of the loop is zero, and the increment is 3. In 
 statements below the increment statement, K has the value 
 ( increment*index) , since it has then been incremented I 
 times. In statements above the increment statement, K has 
 been incremented only (1-1) times, since it hasn't been 
 incremented for the current iteration yet. 
 
 Suppose the initial value is non-zero. 
 
38 
 
 K= 1 7 
 
 + DO 1=1 ,UI 
 
 <K is (1-1 )*3 + 17> 
 K = K + 3 
 
 <K is 1*3 + 17> 
 -- CONTINUE 
 
 Example 33 - Non-Zero Initial Value 
 Below the increment statement, the value of K is 
 ( increment*index+initial) . Above the increment statement, K 
 has been incremented only (1-1) times, as before. 
 
 Now add a second increment statement. 
 
 K= 1 7 
 
 + DO 1=1 ,UI 
 
 <K is (1-1 )*3 + (1-1 )*5 + 17> 
 
 K = K + 3 
 
 <K is 1*3 + (I-1)*5 + 17> 
 
 K = K + 5 
 
 <K is 1*3 + 1*5 + 17> 
 + CONTINUE 
 
 Example 34 - Multiple Increments 
 Below the last increment statement, the replacement 
 expression for K is ( index* ( increments) ^initial ) . Above 
 each increment statement, that increment has been executed 
 only ( I- 1 ) times . 
 
 In a nested loop, the Expressions are a little more 
 
 complex . 
 
39 
 
 K=17 
 
 + DO 1=1 ,UI 
 
 <K is ( (1-1 )*UJ)»3 + 17> 
 - DO J=1 ,UJ 
 
 <K is ((I-1)»UJ)»3 + (J-1)«3 + 17> 
 K = K + 3 
 
 <K is ((I-1)»UJ)*3 + J*3 + 17> 
 CONTINUE 
 <K is (I*UJ)*3 + 17> 
 + CONTINUE 
 
 Example 35 - Multiple Loops 
 Below the inner loop, K has been incremented (I*UJ) times. 
 It gets incremented UJ times for each iteration of the outer 
 loop. Above the inner loop, K has been incremented only 
 ((I-1)*UJ) times. Within the inner loop, the value is 
 similar to a simple induction variable with an initial value 
 of ( (1-1 )»UJ)*3+17. 
 
 The last example is with two increment statements in 
 nested loops. 
 
 K=17 
 
 + DO 1=1 ,UI 
 
 <K is ((I-1)«UJ)*3 + (I-1)*5 + 17> 
 +-- DO J=1 ,UJ 
 
 <K is ((I-1)*UJ)»3 + (J-1)*3 + (1-1 )*5 + 17> 
 K = K+3 
 
 <K is ((I-1)«UJ)*3 + J*3 + (I-1)*5 + 17> 
 +-- CONTINUE 
 
 <K is (I*UJ)«3 + (I-1)*5 + 17> 
 K = K + 5 
 
 <K is (I*UJ)*3 + 1*5 + 17> 
 CONTINUE 
 
 Example 36 - Multiple Increments in Multiple Loops 
 
40 
 
 A general procedure could be developed, but the 
 notation would only be confusing. The idea is too simple to 
 spend too much time on formality. The most common case seen 
 is a simple induction with one increment statement in one 
 loop. The second most common case has one increment in two 
 loops. This is used to step through arrays linearly inside 
 a nested loop. 
 
 6.3 Usefulness 
 
 Induction variable substitution reduces the number of 
 data dependences. Before the substitution, a statement that 
 uses the induction variable is dependent on the increment 
 statement. After the substitution, the statement is no 
 longer dependent on the induction variable at all. 
 
 The clearest advantage of induction variable 
 substitution is the increase of knowledge available about 
 the indexing patterns through the arrays in the loops. It 
 also permits the use of simpler data dependence tests on the 
 subscripts . 
 
41 
 
 CHAPTER 7 
 
 Subscript Addition 
 
 Often a programmer will use a scalar temporary in a 
 loop to hold a common subexpression. Sometimes forward 
 substitution of this expression can be done. After forward 
 substitution, DO loop distribution may continue. This may 
 not be desirable, since it increases the total number of 
 operations . 
 
 original 
 
 + DO 1=1 ,UI 
 
 T=<expr> 
 A(I)=T«. .. 
 B(I)=T+. . . 
 C(I)=T/. . . 
 
 + CONTINUE 
 
 forward substituted 
 
 + DO 1=1 ,UI 
 
 A(I) = <expr>». . . 
 B( I) = <expr>+. . . 
 C(I) = <expr>/. . . 
 + CONTINUE 
 
 Example 37 - Scalar Temporary Forward Substituted 
 Forward substitution may not always be possible. 
 
 original 
 
 + DO 1=1 ,UI 
 
 T=A(I) 
 
 A(I)=Z(I) 
 
 Z(I)=T 
 
 + CONTINUE 
 
 not the same 
 
 + DO 1=1 ,UI 
 
 A(I)=Z(I) 
 Z(I)=A(I) 
 -- CONTINUE 
 
 Example 38 - Invalid Forward Substitution 
 Loop distribution on the original program results 
 nonsense . 
 
 in 
 
42 
 
 + DO 1=1 ,UI 
 
 ! T=A(I) 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 i A(I)=Z(I) 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 ! Z(I)=T 
 + CONTINUE 
 
 Example 39 - Nonsense Loop Distribution 
 
 However, the desired result can be obtained by making the 
 
 temporary an array, instead of a scalar. 
 
 + DO 1=1 ,UI 
 
 ! T'(I)=A(I) 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 I ACI)=Z(I) 
 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 ! Z(I)=T'(I) 
 + CONTINUE 
 
 Example 40 - Valid Loop Distribution 
 The same idea of using an array temporary instead of a 
 scalar temporary may be used in Ex. 37 to circumvent the 
 addition of redundant operations. 
 
 + DO 1 = 1 ,UI 
 
 T f (I)=<expr> 
 A(I)=T'(I)«. • . 
 B(I)=T'(I) + . . . 
 C(I)=T f (I)/. . . 
 
 + CONTINUE 
 
 Example 37 - With Array Temporary 
 This idea is called scalar expansion . 
 
43 
 
 7.1 Scalar Expansion 
 
 The idea behind scalar expansion is to use a new 
 temporary for each iteration of the loop. This is most 
 easily done by making the temporary an array. Then by using 
 a different element of the array for each iteration of the 
 loop, we get a new temporary for each iteration. 
 
 Expansion of scalars into arrays is not always 
 straightforward. Suppose an initial value of the scalar is 
 used inside the loop. 
 
 T=<initial value) 
 
 + DO 1=1 ,01 
 
 S1: I D(I)=T+... 
 ! T=<expr> 
 I A(I)=T»... 
 + CONTINUE 
 
 Example 41 - Initial Value Used in Loop 
 
 Somehow, the element used in S1 must address the initial 
 
 value, when 1=1, and the last value assigned, for other 
 
 iterations. This can be done by using the element T'(I-1) 
 
 for T in S1. The initial value is then assigned to T'(0) 
 
 S1 : 
 
 T ' (0 )=<initial value) 
 + DO 1=1 ,UI 
 
 D(I) = T' (1-1 ) + . . . 
 
 T' (I)=<expr> 
 
 A(I) = T'(I)«. .. 
 + CONTINUE 
 
 Example 41 - With Expanded Scalar 
 
44 
 
 When the final value assigned to the temporary T is 
 used outside of the loop, an assignment back to the scalar 
 at the end of the loop must be added. 
 
 T' (0)=<initial value) 
 -- DO 1=1 ,UI 
 
 D(I)sT» (1-1 ) + . . . 
 T' (I)=<expr> 
 A(I)=T'(I)»... 
 
 + CONTINUE 
 
 T = T' (UI) 
 
 Example 41 - With Post-Loop Reassignment 
 
 Until now, only single loops have been considered. For 
 nested loops, a similar approach is used. In a doubly 
 nested loop, two subscripts are added. With a DO loop 
 nesting of D, D subscripts are added. Care must be taken 
 that the value read for the expanded temporary is always the 
 most recent value assigned. 
 
 T=<initial value) 
 
 + DO 1=1 ,UI 
 
 !+-- DO J=1 ,UJ 
 
 S1 : ! i D(I, J)=T+. . . 
 
 S2 : ! ! T=<expr> 
 
 S3: ! ! A(I, J) = T*. . . 
 
 |+— CONTINUE 
 
 + CONTINUE 
 
 Example 42 - Doubly Nested Loops 
 
 In this example, when [ I , J ] = [ 1 , 1 ] , S1 must read the 
 initial value of T. In later iterations of J, S1 must read 
 
45 
 
 the value assigned to T by S2 in the previous iteration of 
 J. That is, when [ I , J] = [ 1 , j ] , S1 must read the value 
 assigned to T by S2[1,j-1]. Also, S1[i,1] must read the 
 value assigned to T by S2[ i- 1 , UJ] . Statement S1 oan read a 
 value of T from the previous iteration of the J loop, or 
 from the previous iteration of the I loop. 
 
 T • (0 , 0)=<initial value) 
 
 + DO 1=1 ,UI 
 
 !+-- DO J=1 ,UJ 
 SI: 1 ! D(I, J)=T' (1-1 , J-1 ) + . . . 
 S2: ! ! T f (1-1 , J)=<expr> 
 S3: ! I A(I, J)=T' (1-1 , J)». . . 
 
 !+-- CONTINUE 
 
 S4: | T' (I,0)=T» (1-1 ,UJ) 
 
 + CONTINUE 
 
 S5: T=T'(UI,0) 
 
 Example 42 - With Expanded Scalar 
 
 The assignment added in S4 is similar to the assignment in 
 
 S5. Statement S4 passes the value of T' to the next 
 
 iteration of the I loop. This assignment can be costly in 
 
 terms of extra memory movement. However, it is necessary 
 
 only when a statement such as S1 , which can read "old" 
 
 values of T, is present. Proper scalar expansion allows the 
 
 DO loops to be distributed, although the statements may have 
 
 to be reordered. 
 
 + 
 
 S2: 
 
 S4: ! 
 
 + -- 
 
 T' (0 ,0)=<initial value) 
 DO 1=1 ,UI 
 
 - DO J=1 ,UJ 
 
 T' (1-1 , J)=<expr> 
 CONTINUE 
 
 - CONTINUE 
 
 - DO 1=1 ,01 
 
 T' (I,0)=T' (1-1 f UJ) 
 
 - CONTINUE 
 
46 
 
 S3 
 
 S1 
 
 S5 
 
 + DO 1=1 ,01 
 
 -- DO J=1 ,0J 
 
 A(I, J)=T» (1-1 ,J)». . . 
 CONTINOE 
 
 + CONTINOE 
 
 + DO 1=1 ,01 
 
 +__ DO J=1 ,0J 
 ! D(I, J)=T» (1-1 , J-1 ) + . . . 
 +-- CONTINOE 
 + CONTINOE 
 
 T = T' (01,0) 
 Example 42 - Distributed 
 This statement ordering is not unique. 
 
 7.2 Complications with IF statements 
 
 As always, the addition of IF statements complicates 
 matters. Extra assignments often must be added when scalar 
 expansion is done and IFs are present. One case is when the 
 scalar temporary is conditionally assigned. Simple 
 subscript addition can be incorrect. 
 
 + DO 1=1 ,01 
 
 I IF( . . .) T=<expr> 
 I A(I)=T»... 
 
 + DO 1=1 ,01 
 
 I IF(...) T!(I)=<expr> 
 
 i A(I)=T'(I)«. . . 
 
 + CONTINOE 
 
 + CONTINOE + 
 
 original incorrect expansion of T 
 
 Example 43 - Incorrect Expansion with IF Statement 
 When the condition is not satisfied, T'(I) is not assigned 
 anything. In this program, the IF can be changed into an 
 IF-THEN-ELSE. 
 
47 
 
 + DO 1=1 ,UI 
 
 IF( . . .) THEN T» (I) = <expr> 
 
 ELSE T' (I)sT' (1-1 ) 
 A(I)=T' (I)*. . . 
 -- CONTINUE 
 
 Example 43 - Correct Expansion with IF Statement 
 This generates an obscure kind of recurrence. However, this 
 recurrence is restricted to T'. The loop can be 
 distributed, whereas in the original program, it could not 
 be. 
 
 + DO 1=1 ,UI 
 
 ! IF (...) THEN T'(I)=<expr> 
 I ELSE T' (I)=T' (1-1 ) 
 
 + CONTINUE 
 
 + DO 1 = 1 ,UI 
 
 i A(I)=T'(D*. . . 
 + CONTINUE 
 
 Example 43 - Eistributed 
 
 An IF followed by a forward GOTO presents a similar 
 
 problem. The assignment to the scalar may be skipped over. 
 
 The problem and the solution are similar to that of 
 conditional assignment. 
 
 + DO 1=1 ,UI 
 
 IF( . . .) GOTO 7 
 T=<expr> 
 A(I)=T«. . . 
 
 + CONTINUE 
 
 7: 
 
 original 
 
 + DO 1=1 ,UI 
 
 IF( . . .) THEN DO 
 
 T' (I)=T f (1-1 ) 
 GOTO 7 
 END 
 T' (I)=<expr> 
 A(I)=T(I)»... 
 + CONTINUE 
 
 correct expansion 
 
 Example 44 - Correct Expansion with Forward GOTO 
 
48 
 
 Backwards GOTOs present another problem. A backwards 
 GOTO may branch to the program before the assignment to the 
 temporary. The solution is somewhat similar to the above. 
 
 + DO Ia1 ,UI 
 
 A(I) = T*. . . 
 
 T=<expr> 
 
 IF( . . .) GOTO 8 
 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 8: ! A(I) = T»(I-1)». . . 
 T' (I)=<expr> 
 IF( . . .) THEN DO 
 
 T' (1-1 )=T' (I) 
 | GOTO 8 
 
 ! END 
 
 + CONTINUE 
 
 Example 45 - Correct Expansion with Backwards GOTO 
 Thi3 may not help in loop distribution. The whole problem 
 of IF loops needs to be studied. One feasible suggestion is 
 to structure an IF-GOTO loop as a DO-WHILE loop, which can 
 be treated much like ordinary DO loops. 
 
 Other types of GOTOs may be encountered, for example, 
 GOTOs out of the loop, or between loop nesting levels. The 
 general rule is to make sure that the next use of the 
 temporary reads the most recent assignment to that 
 temporary. The extra memory movement necessary to insure 
 this may be too costly to do scalar expansion. 
 
 7.3 Method of Scalar Expansion 
 
 For each scalar T to be expanded, declare a new array 
 variable T'. Find the assignment to T with the deepest DO 
 loop nest level. Give T 1 that many dimensions. Throughout 
 
49 
 
 the loop, each occurrence of that scalar will be re-placed by 
 an expression which will consist of the new array, 
 subscripted by expressions of the index variables. 
 Associate with each dimension of the new array a DO loop 
 nesting level. So, the first dimension is associated with 
 the outer nest level, the second dimension with the next DO 
 nest level, etc. 
 
 An assignment to the zeroeth element of T' may have to 
 be made, if T is read in the loop before it is assigned. 
 This assigns an initial value to the array. The assignment 
 
 T« (0,0)sT 
 is adequate. 
 
 Initially, the replacement array expression is 
 T * (0 , , . . . , ) . Travel through the loop, replacing each 
 occurrence of the scalar with the replacement array 
 expression. When a loop of nest level L is entered, change 
 the L subscript of the replacement array expression from 
 "0" to "1-1", where I is the index variable for that DO 
 loop . 
 
 -- DO 1=1 ,UI 
 
 . . ,sT' (1-1 ,0) . . . 
 -- DO J=1 ,UJ 
 
 . . ,=T» (1-1 , J-1 ) 
 CONTINUE 
 + CONTINUE 
 
50 
 
 When the first assignment to the temporary within a DO loop 
 is reached, replace any occurrence of the temporary on the 
 right hand side by the current replacement array expression. 
 Then, change the replacement array expression so the L 
 subscript is "I" instead of "1-1". Do not change any other 
 subscript expression. Do not change anything upon reaching 
 a second assignment within the same loop. 
 
 + -- 
 
 + - 
 + -- 
 
 - DO Is1 ,UI 
 DO J=1 ,UJ 
 
 T' (1-1 , J) = T« (1-1 , J-1 ) 
 
 CONTINUE 
 CONTINUE 
 
 When leaving an inner loop, it may be necessary to 
 generate an assignment to carry out the last value assigned 
 in the loop. If the loop just exited is at nest level L, 
 change the replacement array expression so the L subscript 
 is "0" instead of "J". Change the L-1 st to "I" instead of 
 "1-1", if it is not already. The generated assignment uses 
 this array on the left hand side. The right hand side is 
 the old replacement array expression with the L subscript 
 replaced by "UJ", where UJ is the upper bound expression for 
 the loop just exited. 
 
 + 
 
 + — 
 
 + — 
 
 
 DO 1=1 ,UI 
 DO J=1 ,UJ 
 
 CONTINUE 
 T» (I ,0)=T' (1-1 ,UJ) 
 CONTINUE 
 
51 
 
 When leaving the outer loop, it may be necessary to 
 generate an assignment to replace the most recent assignment 
 to the temporary back to the original scalar. 
 
 + DO Ir1 ,UI 
 
 CONTINUE 
 TsT 1 (UI,0) 
 
 Conditional assignments and COTOs must be handled 
 carefully. In all such cases, the replacement array 
 expression must refer to the most recent assignment to the 
 temporary . 
 
 Scalar expansion may seem to be self-defeating. Quite 
 often, many extra memory movements must be added to be 
 correct. Always, the memory requirements increase 
 dramatically, especially for deeply nested loops. However, 
 the idea is to be able to distribute loops around the 
 statements which use the temporary. Also, the use of an 
 array allows the machine to be filled, if it is a parallel 
 architecture, or for large vectors to be operated on 
 otherwise. Limited expansion of scalars, adding only 1 or 2 
 subscripts, may be sufficient to make the use of the 
 hardware effienct. It may not be wise to allocate large 
 temporary arrays. 
 
52 
 
 7.4 Array Expansion 
 
 The idea behind scalar expansion is to make the 
 temporary "large" enough so that each iteration of the loop 
 refers to a new variable. This is done by giving the 
 temporary as many dimensions as needed, so each DO loop has 
 its own subscript. One may wonder about arrays inside of DO 
 loops, which do not have this many dimensions. The question 
 arises whether arrays can be, and should be, expanded. This 
 is the question of array expansion. 
 
 A surprisingly common practice is for a programmer to 
 use an array with constant subscripts in a loop. This is 
 often done to pass a large amount of information in a single 
 parameter to a subroutine. Each element of the array can be 
 considered an independent scalar. The same strategy 
 employed for scalar expansion will work for this case. 
 
53 
 
 + DO 1=1 ,UI 
 
 T(5)=. . . 
 T(6)=. . . 
 . ..=T(4) 
 ...=T(5) 
 T(7)=. . . 
 
 original 
 Example 46 
 
 T5'(0)=T(5) 
 T6 f (0)=T(6) 
 T7'(0)=T(7) 
 + DO 1=1 ,UI 
 
 T5'(D=. . • 
 T6»(I)=... 
 . . .=T(4) 
 
 CONTINUE i ...=T5'(I) 
 
 T7'(D=. • . 
 
 -- CONTINUE 
 T(5)=T5'(UI) 
 T(6)sT6» (UI) 
 T(7)=T7' (UI) 
 
 proper expansion 
 Array Expansion with Constant Subscripts 
 
 Another common practice is for a singly-subscripted 
 
 array to be used inside a doubly nested DO loop, with only 
 
 one index variable used in its subscript. If only the outer 
 
 DO loop index variable appears, then a reference to the 
 
 array in the inner DO loop is essentially a reference to a 
 scalar, since it does not change for different iterations of 
 the inner loop. 
 
 + -- 
 
 + - 
 + -- 
 
 - DO 1=1 ,UI 
 DO J=1 ,UJ 
 
 A(I)=. . . 
 
 A(I+1)=. 
 
 CONTINUE 
 CONTINUE 
 
 i 
 i 
 
 + -- 
 
 + DO 1=1 ,UI 
 
 A' (I,0) = A(I) 
 
 A' (1+1 ,0)=A(I+1 ) 
 
 DO J=1 ,UJ 
 
 A' (I, J)=. . . 
 
 A' (1+1 , J) = . . . 
 
 CONTINUE 
 A(I)=A« (I,UJ) 
 A(I+1 )=A» (1+1 ,UJ) 
 - CONTINUE 
 
 Example 47 - Array Expansion in Inner Loop Only 
 This is done similarly to scalar expansion. 
 
54 
 
 If only the inner DO loop index variable ever appears, 
 then the entire inner loop and the array can be treated 
 somewhat like a large scalar, looking from the outer loop. 
 
 + DO 1=1 ,UI 
 
 ...=A(3) 
 - DO J=1 ,UJ 
 . . .sA( J) 
 A(J) = . . . 
 . . .=A(J) 
 +-- CONTINUE 
 
 A(4)=. .. 
 CONTINUE 
 
 + DO Js1 ,UJ 
 
 j A'(J,0)=A(J) 
 + CONTINUE 
 
 + DO 1=1 ,UI 
 
 . ..=A'(3,I-D 
 + -- DO J=1 ,UJ 
 
 . . .=A' (J, 1-1 ) 
 A' (J,I)=. . . 
 . ..=A'(J,I) 
 CONTINUE 
 A'(4,I)=. . . 
 + CONTINUE 
 
 + DO J=1 ,UJ 
 
 i A(J)=A'(J,UI) 
 + CONTINUE 
 
 Example 48 - Array Expansion of Outer Loop Only 
 Again, this is similar to scalar expansion. 
 
 However, array expansion includes a heavy penalty. All 
 the initialization assignments and other added statements 
 are vector operations. This causes a great deal, of memory 
 movement. The entire procedure was introduced to facilitate 
 loop distribution. For non-trivial cases, this goal cannot 
 be guaranteed . 
 
 The general case of array expansion can be described, 
 and a method defined for implementing it. It may not serve 
 any useful purpose. If the DO loops cannot be distributed, 
 
55 
 
 then the added statements merely make the Pi-partitions 
 larger and more complex. Small problems, like IFs, can make 
 the method very complicated. The tradeoff between added 
 operations, algorithmic complexity, and possible enhancement 
 of loop distribution should be considered before trying to 
 implement general array expansion. 
 
56 
 
 CHAPTER 8 
 
 The PARAFRASE Compiler 
 
 This chapter describes the PARAFRASE FORTRAN compiler. 
 This compiler consists of 13,000 PL/I statements, and is 
 currently running on an IBM 360/75 and an IBM 370/158. It 
 accepts ANS FORTRAN with many of the IBM extensions. The 
 compiler is divided into many passes. Each pass makes some 
 transformation on the program. The program is manipulated 
 in essentially source form. The compiler uses many special 
 algorithms to detect parallelism in the program, as well as 
 other standard compiler methods. 
 
 1 . Lexical Scanning 
 
 The first pass over the program . scans the 
 source text and saves the program in the standard 
 compiler data structures. While it is scanning the 
 program, some cosmetic changes in the program are 
 made. The data structures are organized so that 
 the original program can be reconstructed. 
 
 2. DO Loop Normalization 
 
57 
 
 Several of the algorithms described depend on 
 DO loops having a lower bound of 1 and an increment 
 of 1 . In particular, the data dependence tests, 
 and induction variable substitution depend on this. 
 After lexical scanning, this pass changes DO loops 
 to satisfy this condition. The new upper bound is 
 ( upper-lower+1 ) /increment . The index variable is 
 replaced by an expression in the loop, 
 ( index- 1 ) •increment+lower to reflect the change. 
 
 + DO 1=4, N, 3 
 
 i A(I)r... 
 
 + CONTINUE 
 
 + DO 1=1 , (N-4+1 )/3, 1 
 
 I A((I-1)«3+4)s... 
 
 + CONTINUE 
 
 Example 49 - DO Loop Normalization 
 
 IF Pattern Matching 
 
 In real programs, IF statements are often used 
 to replace MAX/MIN builtin function calls. This 
 enhances transportability, since differect 
 implementations have different names for these 
 functions. This pass recognizes some of these 
 patterns. Soon, it will also recognize vector 
 MAX/MINs. 
 
58 
 
 scalar MIN 
 IF(A.LT.B) B = A 
 
 vector MAX 
 
 B=MIN(B, A) 
 
 + -- 
 
 M = 
 
 - DO 1=1 ,01 
 
 IF(A(I) .GT.M)M=A(I) 
 
 - CONTINOE 
 
 M = 
 
 - DO 1=1 ,01 
 
 M=MAX(A(I) ,M) 
 
 - CONTINOE 
 
 Example 50 - IF Pattern Matching 
 
 4. Scalar Renaming 
 
 This is a standard compiler algorithm used to 
 decrease the total amount of data dependence in the 
 program. Whenever possible, scalars in the program 
 are renamed. 
 
 A=. . . 
 
 . . .=A 
 
 A • • ■ • 
 
 Example 51 - Scalar Renaming 
 
 As. . . 
 
 • < • — n 
 
 Aa= . . . 
 
 • • • — fi d 
 
 5. Induction Variable Substitution 
 
 Induction variables were talked about in 
 Chapter 6. This pass handles most common cases of 
 induction variables, with one increment statement, 
 and inside of one or two DO loops. 
 
59 
 
 + -- 
 
 i 
 i 
 
 i 
 i 
 
 K = 3 
 
 - DO 1=1 ,UI 
 K = K + 5 
 A(K)=... 
 CONTINUE 
 
 + 
 
 K = 3 
 
 DO 1=1 ,UI 
 
 K=I«5*3 
 A(I»5 + 3) = 
 CONTINUE 
 
 Example 52 - Induction Variable Substitution 
 
 6. Scalar Expansion 
 
 Scalar expansion was described in Chapter 7. 
 This pass handles the general case of scalar 
 expansion. It expands all scalars in DO loops to 
 arrays. The initialization and the other 
 assignment statements are always added, for 
 correctness. No checking is made to see if they 
 are necessary. This can cause much unnecessary 
 memory movement. 
 
 + 
 
 DO 1=1 ,UI 
 
 1 
 
 T=. . . 
 
 1 
 1 
 
 -T* 
 
 ¥ 
 
 CONTINUE 
 
 + -- 
 
 I 
 I 
 
 + -- 
 
 T' (0)=T 
 
 - DO 1=1 ,UI 
 
 T ' ( I ) = . . . 
 . . .=T»(I)« 
 
 - CONTINUE 
 T = T' (UI)' 
 
 Example 53 - Scalar Expansion 
 
 7. Array Expansion 
 
 Array expansion was also described. This pass 
 handles the limited case of a singly subscripted 
 array in multiple loops, or with constant 
 
60 
 subscripts. For most purposes, this is sufficient. 
 
 8. IF Removal from DO Loops 
 
 The problem of IFs in DO loops was mentioned. 
 This pass removes Towle's "A" and type "B-prefix" 
 IFs from the scope of DO loops. The theory of IF 
 removal is currently being reconsidered to make it 
 more general . 
 
 9. IF Treeing 
 
 IF treeing was described to be a method to 
 reduce the total number of conditionals in the 
 program. This pass combines IFs outside of DO 
 loops when the ratio of IFs to assignment 
 statements reaches a certain threshhold. 
 
 10. Code Generation 
 
 Three address vector oriented pseudo-code is 
 generated. This code can be analyzed to get bounds 
 on the speedup gained by the parallelism 
 mechanisms, and to see how effective these 
 mechanisms are. Also, this code can be simulated 
 on various machine architectures, to see what kinds 
 of machines are good for what kinds of algorithms. 
 
61 
 
 1 1 . Analysis and Statistics 
 
 After the program is compiled, the code 
 generated is analyzed. The results of the analysis 
 are speedup bounds: how much faster this program 
 could run on a suitable machine than on a serial 
 machine. In addition, global statistics about the 
 program are collected and saved for comparison to 
 other programs. 
 
 Much work has been done in the theory of compiler for 
 parallel and vector machines. In the PARAFRASE FORTRAN 
 compiler, we have implemented many of the algorithms for 
 parallelism detection and parallelism enhancement. In the 
 course of testing the compiler, several new methods have 
 been discovered for parallelism enhancement, and these too 
 have been implemented. 
 
 Using this compiler, parallel or vector oriented 
 machines will be able to execute many ordinary programs 
 efficiently on their own special architecture. When more of 
 the theory is understood, and more methods implemented, it 
 is hoped that a large class of ordinary programs will be 
 able to be compiled for these new machines, utilizing their 
 special hardware in a cost effective way. 
 
62 
 
 List of References 
 
 [BANERJEE] Utpal Banerjee 
 
 "Data Dependence in Ordinary Programs" 
 
 Master's Thesis, University of Illinois at Urbana 
 
 UIUCDCS-R-76-837, November, 1976. 
 
 [CHEN] Shyh-ching Chen 
 
 "Speedup of Iterative Programs in Multiprocessing 
 
 Systems" 
 
 Ph.D. Thesis, University of Illinois at Urbana 
 
 UIUCDCS-R-75-694, January, 1975. 
 
 [CHEN 76] Shyh-ching Chen, David Kuck, Ahmed Sameh 
 
 "Practical Parallel Triangular System Solvers" 
 to appear in !£££ Transactions c_& Mathematical 
 Software 
 
 [DAVIS] Edward Davis 
 
 "A Multiprocessor for Simulation Applications" 
 Ph.D. Thesis, University of Illinois at Urbana 
 UIUCDCS-R-72-527, June, 1972. 
 
 [LEASURE] Bruce Leasure 
 
 "Compiling Serial Languages for Parallel 
 
 Machines" 
 
 Master's Thesis, University of Illinois at Urbana 
 
 UIUCDCS-R-76-837, November, 1976. 
 
 [MURAOKA] Yoichi Muraoka 
 
 "Parallelism Exposure and Exploitation in 
 
 Programs" 
 
 Ph.D. Thesis, University of Illinois at Urbana 
 
 UIUCDCS-R-71-424, February, 1971. 
 
 [SAMEH] Ahmed Sameh, Richard Brent 
 
 "Solving Triangular Systems on a Parallel 
 
 Computer" 
 
 3IAM Jpurqfrl on Numerical Analysis 
 
 Vol. 14, No. 6; December, 1977; pp. 1101-1113. 
 
 [TOWLE] Ross Towle 
 
 "Control and Data Dependence for Program 
 
 Transformations" 
 
 Ph.D. Thesis, University of Illinois at Urbana 
 
 UIUCDCS-R-76-788, March 1976. 
 
63 
 
 APPENDIX A 
 
 Recurrence Formulae 
 
 After we compile a program, we try to see how fast the 
 compiled program would execute on a suitable multiprocessor. 
 To do this, we must know how fast a processor can do any 
 operation. We make the assumption that any arithmetic 
 operation (+,-»*»/) can be performed by a processor in one 
 time step. Furthermore, p processors can perform n 
 independent operations in — time steps. Using these 
 assumptions, and ignoring data alignment (fetch and store) 
 time, we can find the time necessary to perform any sequence 
 of vector operations, by summing the time necessary for each 
 one. We assume the machine is SIME, so no overlap of two 
 distinct operations is done. This would simplify the design 
 of a machine and a compiler. 
 
 The only other type of operation considered is a linear 
 recurrence. To compute how fast a recurrence could execute 
 on a machine, we must know what algorithm is being used to 
 solve the recurrence. Several algorithms have been proposed 
 to solve linear recurrences on parallel machines [CHEN, 
 CHEN76, SAMEH]. Here we give the number of time steps 
 
64 
 
 necessary, along with the number of arithmetic operations 
 performed, to solve a linear recurrence when using these 
 fast parallel algorithms. Throughout this section, lg(n) is 
 logarithm base 2, and [x] is the least integer greater than 
 or equal to x . 
 
 A.1 Full Recurrences with unlimited processors 
 
 The algorithm for solving full recurrences with an unlimited 
 number of processing elements available is given in [CHEN] 
 or [SAMEH]. A bound is given on the number of processors 
 used in the algorithm. 
 
 A. 1.1 General Full Recurrence, R<n> 
 
 time steps s 
 
 ^lg 2 (n) + |lg(n) 
 
 multiplications = additions = 
 
 13 12 1 1 
 _ n + _ n . _ n . _ 
 
 processors = 
 
 15 3 12 1 
 
 T024 n + o 70 + T6 n 
 
 A. 1.2 R<n> with ( n+ 1 ) Right Hand Sides 
 
65 
 
 time steps = 
 llg 2 (n) + |lg(n) 
 
 multiplications = additions = 
 
 23 3 12 1 1 
 42 n " 6 n " 3° " IT 
 
 processors = 
 
 21 3 5 2 1 
 __ n + _ n + _ n 
 
 A. 1.3 R<n>, Remote Term only 
 
 time steps = 
 
 jlg 2 (n) + flg(n) 
 
 multiplications = additions = 
 
 13 12 2 
 28 n + 4 n "7 
 
 processors = 
 
 3 3 9 2 1 
 256 n + 6T n + 8 n 
 
 A. 1.4 Full Recurrence with Constant Coefficients, R<n> 
 
 time steps = 
 
66 
 
 ^■lg 2 (n) + |lg(n) 
 
 multiplications = additions = 
 
 13 13 2 5 1 
 48 n + 24 n " 6 n + 3 
 
 processors = 
 
 13 3 2 1 
 T28 n + T6 n + 8 n 
 
 A. 1.5 R<n> with ( n+ 1 ) Right Hand Sides 
 
 time steps r 
 
 jlg 2 (n) + |lg(n) 
 
 multiplications = additions = 
 
 25 n3 12 5 n 1 
 ¥8 n + 24 n " 6 n + 3 
 
 processors = 
 
 21 3 5 2 1 
 T28 n + T6 n + 8 n 
 
 A. 1.6 R<n>, Remote Term only 
 
 time steps = 
 ^■lg 2 (n) ♦ |lg(n) 
 
 multiplications = additions = 
 
 1 n 3 5 2 1 2 
 F4 n + T2 n - I" " 2T 
 
67 
 
 processors s 
 
 13 7 2 1 
 T28 n + 64 n + 8 n 
 
 A. 2 Banded Recurrences with unlimited processors 
 
 The algorithm for solving banded recurrences with an 
 unlimited number of processors available is given in [CHEN] 
 or [SAMEH]. A bound is given on tne number of processors 
 used in the algorithm. 
 
 A. 2.1 General Banded Recurrence, R<n,m> 
 
 time steps = 
 lg(n) (lg(m)+2) - ^(lg 2 (m)+lg(m)) 
 
 multiplications = 
 
 1/2 » . ,n> 1921 1 1 n 1,3 2, 
 
 :r(m n+mn)lg( — ) - -r-fm n-rmn-Tn-Tr- + :r(m J +m ) 
 
 m 
 
 42 
 
 6" mn -3 n "2T¥ 
 
 additions = 
 
 12 , ,n N 
 2 m n lg( m" } 
 
 192 5 1 1n 1/3 m 2v 
 42 m n+ 6 mn -3 n -2TS + 2 (m - B } 
 
 processors = 
 
 12 3 
 
 •x(m n + mn) - m 
 
 A. 2. 2 R<n,m>, Remote Term only 
 
68 
 
 time steps = 
 lg(n)(lg(m)+2) - j(lg 2 (m)+lg(m)) 
 
 multiplications = 
 
 additions = 
 T |m 2 n--imn-^n - |m 3 +-lm -jji 
 
 - (m 3 -m 2 )lg(£) 
 
 processors = 
 
 1/2 , 3 
 — (m n+mn) - m 
 
 A. 2. 3 Banded Recurrence with Constant Coefficients, R<n,m> 
 
 time steps 
 
 lg(n) (lg(m)+2) - -l(lg 2 (m)+lg(m)) 
 
 multiplications = 
 
 1 , ,n, 12 23 3 1 2 5 1 
 2 mn 1 « ( m ) * 2 m n " 48° + 24 m '"6 n+ 3 
 
 additions = 
 
 i (n 2 - n ,„ . §f° 3 4!° 2 -M 
 
 processors 
 
 1 1 2 
 
 ■xmn + -r-m n 
 
A. 2. 4 R<n,m>, Remote Term only 
 
 time steps = 
 lg(n)(lg(m)+2) - j(lg 2 (m)+lg(m)) 
 
 69 
 
 multiplications = 
 n 
 
 3 , ,ik 25 3 23 2 5 1 
 
 mn + m lg(-) + ^m -p-m -^m+y 
 
 additions = 
 
 , 3 2x. ,'iu 25 3 23 2 5 1 
 mn + (m -m )lg(-) + jj^m -^m -jm+3 
 
 processors = 
 
 1 3 
 Tmn + m 
 
 A. 3 Banded Recurrences with a limited number of processors 
 
 The algorithms for solving banded recurrences with a limited 
 number of processors are given in [CHEN] or [CHEN 76]. ■ The 
 number of processors, p, is assumed to be in the range 
 2m <. p <. n . Notice that a simpler algorithm, such as column 
 sweeping, may actually be faster than using this more 
 complicated algorithm. 
 
 A. 3-1 General Banded Recurrence, R<n,m> 
 
70 
 
 time steps = 
 
 (2m 2 +3m)^ + (m 2 +^m+1 ) lg(£) - (2m 2 +|m+3) 
 
 multiplications = 
 
 O 10 r\ Q Q O 
 
 (m + 2m)n + j^ +m )P iS^ + 2^ m * 
 - (2m 2 +2m+j)p - (m 3 +m 2 )^ 
 
 additions = 
 
 (m +2m-1)n + -r-m p lg(^) + (*nr+^m -m) 
 
 1 "3 O n 
 
 - (2m +2m--r-)p - (m +m -co- 
 
 processors = p 
 
 A. 3« 2 R<n,m>, Remote Term only 
 
 time steps = 
 
 (2m 2 +m)- + (m 2 +|m+1)lg(^) - (2m 2 +4-m+2) 
 p d m 2 
 
 multiplications = 
 
 (m 2 +m)n - m 3 lg(|) - (\vr> -\m 2 ) + (|m 2 -2m-^)p - m 3 - 
 
 m 2 2 2 2 p 
 
 additions = 
 (m 2 + m)n - (m 3 -m 2 )lg(£) - (I m 3 -|m 2 ) + (|m 2 -3m-j)p - -ra 3 
 
 processors = p 
 
 A. 3-3 Banded Recurrence with Constant Coefficients, R<n,m> 
 
71 
 
 time steps = 
 
 4m-^- + (m 2 4m + 1)lg(-^) - (m 2 + fm+3) + (2m-1)[-m 2 ] 
 p-m 2 m 2 p 
 
 multiplications = 
 
 i 3 1 2 1 x. ,p-nu ,3.3 , 2 1 , „ n 
 
 mn + (m -— m +— mpHgC* — ) - (-^-m -2m +-r-m) - 2mp + mp 
 
 22 m 2 2 p-m 
 
 additions = 
 
 mn + (m 3 -|m 2 +4-mp)lg(- E r^) + (4m 3 +m 2 +^-m) - (2m+1)p + mp~ 
 
 2 2 
 
 m 
 
 p-m 
 
 processors s p 
 
 A. 3. 4 R<n,m>, Remote Term only 
 
 time steps = 
 
 2m p?m + ( m3+n » 2 +mp+p)-l i g (2^) + (2m-1)[-im 2 ] 
 
 multiplications - 
 
 ,3121 » . , p-m N , 1 3 5 2 x ,3 \\ "' 
 
 (m J --m +2-mp)lg( Ji — -) - {^ -^m ) - (-*m+ ? )p + mp— 
 
 m 
 
 2 2 ' x 2 2 
 
 p-m 
 
 additions = 
 
 (m 3 -|m 2 4mp)lg(-^) - 4m 3 -|m 2 ) - <|«4>P + m P^ 
 22 m 22 22 p-m 
 
 processors = p 
 
72 
 
 APPENDIX B 
 
 Compiler User's Guide 
 
 The Parafrase FORTRAN Compiler may be run on the IBM 
 360/75 at the University of Illinois at Urbana. In addition 
 to the job control cards necessary to execute the compiler, 
 the user may set any compiler options he wishes. The user 
 must also include the FORTRAN program to be compiled. The 
 available options are described in Appendix C. A typical 
 job to use the compiler will look like this: 
 
 //ANALYZE JOB 
 
 /•ID PS=1234,NAME='J0E SCHMOE' 
 
 /•ID CODE=PUBLIC 
 
 /•ID REGION=2 50K,TIME=2,IOREQ=4000 
 
 //PROCLIB DD DSNAME=USER.P6543.MACUOI,DISP=SHR 
 
 // EXEC COMPILE 
 
 //OPTIONS DD • 
 
 specify any compiler options here 
 //SYSIN DD * 
 JNINCR INCREMENT AN ARGUMENT 
 
 SUBROUTINE INCR (A) 
 
 A = A+1 . 
 
 RETURN 
 
 END 
 JNSUM FORM A CHAIN SUM 
 
 FUNCTION SUM(A,N) 
 
 INTEGER A(N) ,X 
 
 X = 
 
 DO 1 1=1 ,N 
 1 X=X+A(I) 
 
 SUM = X 
 
 RETURN 
 
 END 
 *DATA 
 N 10 
 JNINNER FORM AN INNER PRODUCT 
 
73 
 
 JDATA 
 N 100 
 
 1NRAN 
 
 20 
 
 SUBROUTINE INNER (A,B,C,N) 
 
 REAL A(N) ,B(N) ,C 
 
 C = 
 
 DO 7 1=1 ,N 
 
 C=C+A(I)»B(I) 
 
 RETURN 
 
 END 
 
 DOM A RANDOM PROGRAM 
 SUBROUTINE FIND (A,B,C,N) 
 INTEGER I, J 
 REAL A(N) ,B(N) ,C(N) 
 A(1)=1 
 DO 20 1=1 ,N 
 A(2) = 2 
 DO 20 J=1 ,N 
 A(3) = 3 
 
 C(I) = C(I) « A(3) 
 CONTINUE 
 
 JDATA 
 N 30 
 
 URRA 
 
 SIZE = 
 
 BLOCK 
 A 92 
 C 103 
 
 JNnam 
 
 JDATA 
 varia 
 varia 
 
 URRA 
 
 SIZEr 
 
 BLOCK 
 array 
 
 DO 30 1=1 ,N 
 
 B(I) = B(I) * A(I) 
 
 CONTINUE 
 RETURN 
 END 
 
 Y 
 
 2 
 
 = 1 
 
 2 1 X&01 X&02 
 
 2 1 2 X&01 X&02 
 e title 
 
 source 
 
 program 
 
 END 
 
 value 
 value 
 
 ble 
 ble 
 Y 
 
 number of 
 
 =do block 
 
 lhs-use 
 
 arrays to expand 
 to work with 
 n urn -dimensions 
 
 nestl nest2 X&nn X&nn 
 
 BL0CK=do block 
 
 JLOOP 
 
 indexvariable 
 indexvariable 
 /» 
 
74 
 
 The control cards should appear as indicated. More REGION, 
 TIME, or IOREQ may be needed for more or larger programs. 
 Compiler options may be chosen from Appendix C. The format 
 of the SYSIN file follows. 
 
 1 . %U card 
 
 Each program in the input stream must be preceded 
 by a %H card. Immediately following the N is a 
 program name, up to 8 characters long. A title may 
 follow the name, separated by a space, up to 65 
 characters long. The name and title are used for 
 later identification. 
 
 2. source program 
 
 The actual FORTRAN source follows the %H card. The 
 last card should be an END card. 
 
 JDATA card 
 
 Optionally, data cards may be inserted. These are 
 
 used to set the values of integer scalars. Usually 
 this is used to set DO loop upper bounds. This 
 
 information is used in the calculation of speedup. 
 
 To use this feature, include a JDATA card, and 
 
 follow it with an arbitrary number of cards which 
 
75 
 
 have an integer scalar name in column 1 , and its 
 value following the name, separated by a blank. 
 
 4. *ARRAY 
 
 If the user wishes to do array expansion, then he 
 must include a JARRAY card. The cards following it 
 are input to the array expansion program. A 
 'SIZE=n' card tells the program the maximum number 
 of arrays which will be expanded. The default is 
 'SIZEslO'. The 'BLOCK=n' card tells the program in 
 which DO-loop block to expand the following arrays. 
 The default is 'BL0CK=1'. Several 'BLOCKm' cards 
 may be included, to expand arrays in several 
 different DO loops. Each array to be expanded 
 requires another input card. The name of the array 
 to be expanded comes first on the card. Following 
 the name is the "program pointer" pointing to the 
 statement which is the first left-hand-side use of 
 the array in that DO loop. Then follows an 
 integer, d, giving the new number of dimensions of 
 the array, the number of dimensions desired for the 
 array. Following that is a list of d numbers. 
 Each of these associates a do nest level with a 
 dimension of the new array. Nest level 1 is the 
 outer DO loop, nest level 2 is the next inner 
 level, nest level is outside the DO loops 
 
76 
 
 (constant level), and so on. Expressions in the 
 corresponding subscript position will involve only 
 DO loop indices at that DO nest level. Finally is 
 a list of the index variables for the DO loops in 
 which the array is to be expanded. Note that these 
 are index variables after DO loop normalization, 
 and so are of the form 'X&nn' , where 'nn' is the DO 
 loop number. 
 
 5. JLOOP 
 
 Optionally, the user may wish to execute some DO 
 loops serially, rather than distribute them. If 
 so, include a $LOOP card. Follow it with an 
 arbitrary number of cards which have a DO loop 
 index variable name in column one. Notice that 
 this is done after DO loop normalization, so each 
 DO loop has a unique index variable name. The DO 
 loop with that index variable will not be 
 distributed . 
 
 As many programs may be compiled in one job as desired, as 
 long as the TIME and IOREQ available are sufficient. The 
 output consists of listings of the program after the 
 transformations, and optionally a disk file containing the 
 generated code, for later simulation. 
 
77 
 
 APPENDIX C 
 
 Compiler Options 
 
 Compiler options may be set by inserting cards after 
 the //OPTIONS DD * card. A binary switch is set ON by a 
 card : 
 
 SWITCH=' 1 'B 
 A binary switch is reset OFF by a card: 
 
 SWITCH= f 'B 
 A numeric option is given a value by a card: 
 
 0PTI0N=1 
 
 OPTION=77 
 The OPTIONS file is read with a PL/I GET DATA statement. 
 
 C.1 FLAGs 
 
 A FLAG is a binary switch used to enable or disable certain 
 passes of the compiler. A FLAG is set by a card: 
 
 FLAG.SWITCHr' 1 'B 
 
78 
 
 1. FLAG.CLEAN_IF 
 
 Enables IF pattern matching. 
 
 2. FLAG.CLEAN_SUBSCRIPT 
 
 Enables subscript cleaning, which simplifies 
 subscripts . 
 
 3. FLAG.CONSOLIDATE_COMMON 
 
 Enables a small program which cleans the compiler 
 data structures containing COMMON variables. 
 
 4. FLAG.DISTRIBUTE_LOOP 
 
 Enables a program which will physically distribute 
 the loops around the PI partitions. 
 
 5. FLAG.EXPAND_ARRAY 
 Enables array expansion. 
 
 6. FLAG.EXPAND_SCALAR 
 Enables scalar expansion. 
 
 7 . FLAG. EXPAND_STATEMENT_FUNCTIONS 
 
 Enables a program which expands statement function 
 uses to the expressions defined by the statement 
 function . 
 
79 
 
 8. FLAG.EXPAND_SUBROUTINES 
 
 Enables a program which expands in line external 
 subroutines called in the program. 
 
 9. FLAG.FORWARD_SUBSTITUTE_SUBSCRIPT 
 
 Enables scalar expression forward substitution into 
 subscripts. FLAG . RENAME_SCALAR must also be set. 
 
 10. FLAG.FORWARD_SUBSTITUTE__IF 
 
 Enables scalar expression forward substitution into 
 IF conditions. FLAG . FORWARD_SUBSTITUTE_SUBSCRlPT 
 must also be set . 
 
 11. FLAG.GENERATE_CODE 
 
 Enables code generation. FLAG. SEGMENT must also be 
 set . 
 
 12. FLAG.GRAPHICS_PARTITION 
 
 When set, the program will be Pi-partitioned , 
 before segmentation, and a file will be created 
 with this information to be used with a graphing 
 program . 
 
 13. FLAG.HASP_SYSTEM_LOG 
 
 When set, a line will be written to the HASP system 
 log, which appears on the first burst page of the 
 
80 
 output, for each program compiled. 
 
 14. FLAG.IFTREE 
 
 Enables IF tree creation. 
 
 15. FLAG.INDUCTION_SUBSTITUTION 
 
 Enables induction variable substitution. 
 
 16. FLAG.INSERT_DATA_CARD 
 
 When set, cards after the JDATA card are used to 
 initialize scalar integer variables. When reset, 
 the $DATA card is ignored. 
 
 17. FLAG. LEXICON 
 
 Enables lexical scanning of the program. 
 
 18. FLAG.LINEARIZE_ARRAY 
 
 Enables a program 
 multi-dimensional arrays. 
 
 which linearizes 
 
 19. FLAG.NORMALIZE_DO 
 
 Enables DO loop normalization. 
 
 20. FLAG. PARALLEL 
 
 Master switch to enable IF removal, IF tree 
 creation, tr iad iza tion , segmentation, and code 
 
81 
 generation . 
 
 21. FLAG.REMOVE_A_IF 
 
 Enables removal of Towle's type A IFs. 
 
 22. FLAG.REMOVE_B_PREFIX_IF 
 
 Enables removal of Towle's type B-prefix IFs. 
 
 23- FLAG.REMOVE_CALL 
 
 When set, CALL statements will be changed into 
 CONTINUE statements as the program is lexically 
 scanned. This may be useful since data dependence 
 around CALL statements is unknown. 
 
 24. FLAG.REMOVE_IO 
 
 When set, input and output statements will be 
 changed into CONTINUE statements as the program is 
 lexically scanned. This may be useful, since 
 input/output is not especially interesting from the 
 viewpoint of speeding things up. 
 
 25. FLAG.RENAME_SCALAR 
 Enables scalar renaming. 
 
 26. FLAG. SEGMENT 
 
 Enables program segmentation, which divides the 
 
82 
 
 program into segments of code. A segment of code 
 is defined as block of statements with only one 
 control entry point, and one control exit point. 
 
 27. FLAG.SERIALIZE_LOOPS 
 
 When set, the user may specify any DO loops he does 
 not want distributed by using the JLOOP card. 
 
 28. FLAG. STANDARD 
 
 A master switch controlling DO loop normalization, 
 scalar renaming, scalar forward substitution, 
 induction variable substitution, scalar expansion, 
 array expansion, array linearization, subscript 
 standardization, and subscript cleaning. 
 
 29- FLAG.STANDARDIZE_SUBSCRIPT 
 
 Enables subscript standardization, which transforms 
 subscripts into parenthesis-free expressions, when 
 
 possible . 
 
 30. FLAG. STATISTICS 
 
 Enables the statistics collection. When set, the 
 global statistics collected will be saved for later 
 processing. FLAG. BOUNDS must also be set. 
 
 31. FLAG. TRIAD 
 
 
83 
 
 Enables triadization . Tr iadization reduces 
 assignment statements to three address code, with a 
 result, and two operands. This is in preparation 
 for code generation. 
 
 C.2 PRINT switches 
 
 A PRINT switch is used to enable or disable the printing of 
 the program after each pass. The following PRINT switches 
 are available: 
 
 1. PRINT. AFTER_IF_REMOVAL 
 
 Enables the printing of the program after removing 
 type A and type B-prefix IFs. 
 
 2. PRINT. CLEAN_IF 
 
 Enables the printing of the program after IF 
 pattern matching. 
 
 3. PRINT. CLEAN_SUBSCRIPT 
 
 Enables the printing of the program after subscript 
 cleaning . 
 
 4. PRINT. CODE 
 
84 
 
 Enables the printing of the code generated by the 
 compiler . 
 
 5. PRINT. DISTRIBUTE_LOOP 
 
 Enables the printing of the program after 
 distributing loops. 
 
 6. PRINT. DURING_IF_REMOVAL 
 
 Enables the printing of the program after each pass 
 of IF removal. Notice that each pass will remove 
 at most one IF from each DO loop. 
 
 7. PRINT. EXPAND_ARRAY 
 
 Enables the printing of the program after array 
 expansion . 
 
 8. PRINT. EXPAND_SCALAR 
 
 Enables the printing of the program after scalar 
 expansion . 
 
 9. PRINT. FORWARD_SUBSTITUTE 
 
 Enables the printing of the program after scalar 
 forward substitution. 
 
 10. PRINT. GENERATE_CODE 
 
 Enables the printing of each segment of the program 
 
85 
 as code is generated for that segment. 
 
 11. PRINT. GRAPHICS_PARTITION 
 
 Enables the printing of the program after 
 Pi-partitioning for the graphing program. 
 
 12. PRINT. IFTREE 
 
 Enables the printing of the program after creation 
 of IF trees. 
 
 13- PRINT. INDUCTION_SUBSTITUTE 
 
 Enables the printing of the program after induction 
 variable substitution. 
 
 14. PRINT. LEXICON 
 
 Enables the printing of the program just after it 
 has been lexically scanned, which will show all the 
 cosmetic changes made to the program. 
 
 15. PRINT. LINEARIZE_ARRAY 
 
 Enables the printing of the program after array 
 linearization . 
 
 16. PRINT. NORMALIZE_DO 
 
 Enables the printing of the program after DO loop 
 normalization . 
 
86 
 
 17. PRINT. PARALLEL_VERSION 
 
 Enables the printing of the entire parallel version 
 of the program, after all transformations are done. 
 
 18. PRINT. RENAME_SCALAR 
 
 Enables the printing of the program after scalar 
 renaming . 
 
 19. PRINT. SEGMENT 
 
 Enables the printing of the program after 
 segmentation . 
 
 20. PRINT. SERIAL_VERSION 
 
 Enables the printing of the program before any 
 transformations have been done, but after 
 subroutines and statement functions have been 
 expanded. This is the version of the program being 
 compiled . 
 
 21. PRINT. SH0RT_C0DE 
 
 Enables the printing of a short one line 
 description of each element of code generated for 
 the program, rather than a complete description. 
 
 22. PRINT. SOURCE 
 
 Enables the printing of the original source 
 
87 
 program . 
 
 23- PRINT. STANDARDIZE_SUBSCRIPT 
 
 Enables the printing of the program after subscript 
 standardization. 
 
 24. PRINT. TRIAD 
 
 Enables the printing of the program after 
 tr iadization . 
 
 C.3 OPTIONS 
 
 Other compiler options are kept in the structure OPTION. 
 These are switches and numeric values, used in the 
 compilation process. 
 
 1. OPTION. CHECK_DO_BOUND 
 
 When set, DO loop limits will all be compared to 
 the limit stored in OPTION . DO_BOUND , and will be 
 reduced to this value, if greater. 
 
 2. OPTION. COUNT_STORE_OPERATION 
 
 When set, a store to a variable will be counted as 
 an operation, similar to a multiply or add 
 
88 
 
 operation. This is used in the speedup 
 calculations . 
 
 3. OPTION.COUNT_STORE_TEMPORARY_OPERATION 
 
 When set, a store to a compiler temporary will be 
 counted as an operation as above. 
 
 4. OPTION. DEFINE_SCALAR 
 
 When set, any undefined scalar found inside of 
 subscripts will be assumed to have a default value, 
 which is in OPTION . SCALAR_VALUE . 
 
 5. OPTION. DO_BOUND 
 
 This is the default DO loop limit, used for speedup 
 calculations, whenever the actual limit cannot be 
 discovered . 
 
 6. OPTION. IFTREE_THRESHOLD 
 
 This is the number of assignment statements per IF 
 statement allowed when creating IF trees. 
 
 OPTION. ISOLATE_CALL 
 
 When set, CALL statements are isolated in the data 
 dependence graph. That is, CALLs are not 
 considered in the computation of data dependence. 
 
89 
 
 8. OPTION. SCALAR_VALUE 
 
 This is the default value for scalars found inside 
 of subscripts. 
 
 9. OPTION. SERIALIZE_CALL 
 
 When set, any DO loop containing a CALL statement 
 is automatically serialized, not distributed. 
 
 10. OPTION. SER I ALIZE_FULL_RECURRENCE 
 
 When set, a full recurrence is executed serially, 
 rather than generating code to solve the recurrence 
 the fastest known way. 
 
 11. OPTION. SET_D0_B0UND 
 
 When set, all DO loops upper bounds are set to the 
 value in OPTION . D0_B0UND , for the purposes of 
 speedup calculation. 
 
 12. OPTION. SPECIAL^LISTING 
 
 When set, a special summary listing of the IFs and 
 recurrences for the programs compiled is produced 
 for easy tabulation. 
 
 C.4 DEBUG switches 
 
 In addition, there are DEBUG switches for each program in 
 
90 
 
 the compiler. Generally speaking, there is one switch per 
 program, which can be set by a card: 
 
 DEBUG. programs ■ 1 'B 
 Most users should not need to use DEBUG switches. 
 
BLIOGRAPHIC DATA 
 EET 
 
 Title and Subtitle 
 
 1. Report No. 
 
 UIUCDCS-R-78-929 
 
 TECHNIQUES FOR IMPROVING THE 
 INHERENT PARALLELISM IN PROGRAMS 
 
 Author(s) 
 
 Michael Joseph Wolfe 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 July, 1978 
 
 8. Performing Organization Rept. 
 
 No. UIUCDCS-R-78-929 
 
 Performing Organization Name and Address 
 
 University of Illinois at Urb ana- Champaign 
 Department of Computer Science 
 Urbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF MCS73-07980 
 
 Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 Supplementary Notes 
 
 13. Type of Report & Period 
 Covered 
 
 Master's Thesis 
 
 14. 
 
 Abstracts 
 
 Much work has been done recently in the area of improving the performance of a 
 serial program on a parallel machine. This document describes some of this work as 
 has been implemented in the FORTRAN ANALYZER as of summer, 1977. 
 
 Key Words and Document Analysis. 17o. Descriptors 
 
 DO- loops 
 
 Induction variables 
 
 Recurrences 
 
 Scalar expansion 
 
 Vectorizing compilers 
 
 Vector operations 
 
 Identifiers /Open-Ended Terms 
 
 • COSATI Field/Group 
 Availability Statement 
 
 Release Unlimited 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 cuntv Class (Thi 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 95 
 
 22. Price 
 
 IM NTIS-38 (10-70) 
 
 USCOMM-DC 40329-P7 1