LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 
 IJKor 
 
 no. 764-7^9 
 
 cop. Z 
 
The person charging tto ntfjrW * £ 
 Latest Date stamped below. 
 
 Theft mutilation, and underlining of books 
 £ ««»n for disciplinary action and may 
 ?esuU in dismissal from the Univers.ty. 
 
 «0V 3 1977 
 1/HfiCl 
 
 Wrt 5 »«/* 
 \ 4 R 
 
 •/> 
 
 L161 — O-1096 
 
J/t. 7 
 
 fit" 
 
 NSF-0CA-DCR73-07980 A02 - 000015 
 
 Report No. UIUCDCS-R-75-767 
 
 PARALLEL PROCESSING OF ORDINARY PROGRAMS 
 
 by 
 
 David J. Kuck 
 
 November 1975 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
 iE LIBKAR i OR IKc 
 
 JAN 2 6 1976 
 
 UNIVLHSlTv Ol JiS 
 
 ft-r MDRANA-CHAMfAlfil-t 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/parallelprocessi767kuck 
 
Report No. UIUCDCS-R-75-767 
 
 PARALLEL PROCESSING OF ORDINARY PROGRAMS* 
 
 by 
 
 David J. Kuck 
 
 November 1975 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 * 
 
 This work was supported in part by the National Science Foundation under 
 Grant No. US MSF DCR73-07980 A02. 
 
ABSTRACT 
 
 This paper is concerned with executing programs on a number of 
 processors. Various machine configurations are discussed. Algorithms 
 for transforming programs are given and the results of transforming a 
 number of real programs are presented. The relationships between 
 machine organization and program organization are emphasized. 
 
 Key Words 
 
 Array operations 
 Parallel machines 
 Pipeline machines 
 Program transformations 
 Recurrence methods 
 Speedup bounds 
 

Ill 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. Introduction 1 
 
 2. Theoretical Fundamentals 15 
 
 3. Program Analysis 35 
 
 4. Machine Considerations 53 
 
 5. Conclusions 34 
 
 REFERENCES 89 
 
TV 
 
 LIST OF FIGURES 
 
 Page 
 
 1. Expression of 8 Atoms 10 
 
 2. Tree-Height Reduction by Associativity 18 
 
 3. Tree-Height Reduction by Commutativity 18 
 
 4. Tree-Height Reduction by Distributivity 18 
 
 5. Program Graph 51 
 
 6. Distributed Graph 51 
 
 7. Speedup vs. Processors 64 
 
 8. Overall Machine Organization 72 
 
 9. One-Dimensional Array Storage 73 
 
 10. Straight Storage (m = 4) 74 
 
 11. Skewed Storage (m = 4) 74 
 
 12. Skewed Storage (m = 5) '. 74 
 
 13. Skewed Storage (m = 2n = 8) 74 
 
 14. Sj, 
 C P 
 
 Classes 82 
 
 TABLE 
 1. 1973 CACM DO Loop Summary 62 
 

1 . Introduction 
 
 In the 1965-75 decade we have seen a number of changes in 
 computer technology and computer use. Integrated circuits have arrived and 
 with them have come large, fast semiconductor memories; microprocessors 
 which can be used as components; and the potential for a variety of new 
 system architectures. Users of computers in this period have become quite 
 concerned about the reliability of their hardware and software. They have 
 also come to expect computer services to fit their needs, whether this be 
 through a personal minicomputer, a supercomputer at the other end of a 
 network, or a special purpose computer in their computer center. 
 
 In the midsixties, there were many debates about which direction 
 computer organization would go. Stacks vs. registers, binary vs. hexadecimal, 
 time sharing vs. batch processing vs. remote batch were all being discussed. 
 Whether fast computers should be organized as multiprocessors, array 
 processors, pipeline processors, or associative processors, was widely 
 discussed. The discussions were often mainly emotional, with no substantive 
 arguments to back them up. Not surprisingly, everybody won in many of these 
 contests. Indeed, computers are wonderfully flexible devices and can be 
 twisted into many forms with great success. 
 
 This is not to say that every new computer organization is a good 
 idea and will survive. In fact, in this decade the entire computer divisions 
 of several major companies have failed. Nor is it to say that we lack ideas 
 of universal applicability. As examples, hierarchies of virtual memory and 
 microprogrammed control units have at last been adopted, if not discovered , 
 by just about everybody. 
 
 In the midseventies, one can still hear a number of debates. Some 
 of them have not changed from the midsixties. There are also new ones about 
 how to make reliable hardware and software, how to bring computer services 
 
to ordinary people, and how to exploit ever higher levels of integrated 
 circuit technology. The latter subject obviously provides one of the corner- 
 stones of the whole subject of computer design and use. 
 
 While circuit speeds have improved in the past decade, their costs 
 have improved even more. Thus designers can afford to use more and more gates 
 in computer systems. But some of the traditional design considerations have 
 changed. At the present time, printed circuit board and wire delays often 
 dominate gate delays in system design. Thus computer organization itself would 
 now seem to be a more important question than circuit type and gate count 
 minimization at the hardware level of computer design. 
 
 It is also clear that by properly organizing a machine, various 
 software features can be more easily supported. Stacks, extra bits, special 
 registers and instructions are examples of this. And the availability of 
 low cost integrated circuits makes all of these feasible, even in low cost 
 machines. 
 
 Besides these hardware and software considerations, a computer 
 designer must worry about what applications are to be served by his machine. 
 In the days of truly general purpose computers, "Any color you want as long 
 as it's black," was sufficient. But general purpose machines in this sense, 
 have disappeared. Even the IBM 360 series provided specialization in terms 
 of size and speed. And the list of specialization improvements over that 
 series is wery long. It ranges from minis to supers, from scientific to 
 banking and from real time to background specializations. 
 
 Some people would argue that software design is a much more 
 important question than computer system design. As support they would offer 
 the skyrocketing costs of software and the sharply dropping hardware 
 
costs. However, these observations probably support the converse position 
 even more. In view of decreasing hardware costs, how can we better organize 
 machines so that software and applications are better and less expensively 
 handled? Indeed, software is not an end in itself. 
 
 One of the main points of confusion and disagreement over the 
 design of better hardware and software has always been in deciding on goals. 
 People of one background or another tend to have biases of one kind or 
 another. And an entire computer and software system has so much complexity 
 that it is difficult for one person to think about all the interrelating 
 details. Even if a "perfect" system could be conceptualized by one person, 
 he could not build it by himself. Many people must be involved. Indeed 
 this has been the downfall of many systems, the first such having been 
 Charles Babbage's Analytical Engine. 
 
 The above remarks are well understood by any serious hardware or 
 software designer. A point that is not so well understood, or at least it 
 is widely ignored, is that a good system's main goal is to serve its end 
 users well. Many computers and many software systems are designed from 
 beginning to end with only a nod to the ultimate users, the main design 
 goals being whatever "improvements" the designers can make over their 
 previous designs. Such "improvements" are often intrinsic to the hardware 
 or software and may or may not be reflected in what the end user sees. 
 
 The standard way to design any big system; hardware, software, 
 highway or bridge, is to break it up into hierarchies and subparts. When 
 parts which are analytically tractable are found, the proper analysis provides 
 a solution for that part. Other parts are solved using the best intuition 
 the designer has. 
 
A key question in improving the system design procedure would 
 seem to be the following. How can we integrate users' problems into the 
 design procedure? The answer to this is not obvious. Usually, users do 
 not know exactly what they want. They often know what was wrong with their 
 previous system. But solving these problems is often similar to making 
 improvements of an intrinsic hardware or software nature as mentioned 
 above. They do not lead to a global, system improvement. 
 
 A partial answer to what the user wants may be found in looking 
 at the programs he runs. While these may not reflect exactly the algorithms 
 he wishes to run, they at least provide an objective measure of what he is 
 doing. If a system could handle these programs well, then at least for a 
 short while, the user would be happy. 
 
 In this paper we will consider several aspects of the problem of 
 integrating the analysis of users' programs with the design of computer 
 systems. To tackle the problem in a concrete way, it is reasonable to 
 restrict our initial study. We will first deal with Fortran programs 
 because there are many of them in existence and because the language is 
 quite simple. Our primary design goal will be the fast execution of 
 programs. This goal is indeed probably the primary objective of "users" 
 ranging from computer center managers who want high system throughput to 
 individual programmers who want fast turnaround. Of course, reliability, 
 ease of use, quality of results and so on are also important, but we will 
 deal with one problem at a time. The ideas we will discuss are applicable 
 to programs in languages other than Fortran--we use it because it is popular 
 and of relatively low level--as we shall mention in section 5. 
 
 
Speed Limits 
 
 What are the factors that limit the speed of machine computation? 
 Or, to sharpen the question a bit: Given a set of programs, what determines 
 how fast they can be executed? Basically, there are two aspects to the 
 answer. One aspect concerns the physics of the hardware being used. The 
 other aspect concerns the logic of the machine organization and the program 
 organization. 
 
 The physical limits on computer speed are rather widely understood. 
 Gates can switch in some fixed time, and we must pay for the sum of a number 
 of gate delays to perform various computer operations. Additionally, signals 
 must propagate along printed circuit boards and on wires between boards, and 
 these delays are often larger than gate delays. We shall not concern ourselves 
 with these questions. Rather, we shall assume that circuit speeds are fixed 
 and consider the logical problems involved. 
 
 The logic of machine organization has been studied for many years. 
 Greater machine speed through simultaneity or parallelism has been a subject 
 of much study. Parallelism has been used at many levels, between bits, 
 between words, between functions, and so on. Shortly we shall give more 
 details of this. 
 
 The relations between the organization of a machine and the 
 organization of a program to be executed on that machine have not been 
 studied much. Of course, the compilation of a Fortran program for execution 
 on a serial computer is a special case of this. But compiler theory has 
 mainly been developed with respect to languages. The semantic or machine 
 related aspects of compilation are usually handled in ad hoc ways. 
 
 Beyond this, we are really concerned with the syntactic transformation 
 
of algorithms given in the form of programs, into forms which exhibit high 
 amounts of parallelism. At the same time, we are interested in clarifying 
 what kinds of machine organizations correspond to the parallel forms of 
 programs. Thus we seek program transformations and machine organizations 
 which together allow for high speed and efficient execution of any given 
 serial programs. 
 
 A properly developed theory will have a number of benefits. For 
 one thing, it will allow us to see what the logical limits of computation 
 speed are (in contrast to the physical limits), and to see how close to 
 them we are operating. It will also give us constructive procedures for 
 designing machines and compilers for those machines. Another benefit, which 
 we discuss below, is that we can obtain a unified approach to logic design 
 and compiler design, since abstractly, many of the problems are identical. 
 
Our analysis of Fortran-like programs can be carried out at several 
 levels. First, we can consider the elementary statements in the language, 
 e.g., assignment, IF, DO, etc. Then we can consider whole programs and 
 see how these statements fit together. This can be done at an abstract 
 level and also by studying real programs using the abstract theory. The 
 paper also contains several discussions of algorithms in other programming 
 languages, but we will not develop these points at much length here. 
 
 With a good understanding of the structure of programs behind us, 
 it is proper to consider machine organizations. In this paper we will mainly 
 discuss processor, switch and primary memory design. Control units and 
 memory hierarchies are being studied in a similar way but these areas are 
 not as well developed at this point. 
 
 Our long term objective is to develop methods for the rational 
 design of computer systems which are well matched to the classes of programs 
 they are to execute. By developing our ideas theoretically, we can see 
 what our ultimate objectives in terms of bounds might be. We can also 
 observe that several dissimilar aspects of computer system design consist of 
 ideas which are identical at a theoretical level. Thus a coherent body of 
 theoretical material can be used at the logic design level and also at the 
 compiler design level. 
 

 Logic Design and Compiler Uniformity 
 
 To give an intuitive overview of our ideas, let us begin with a 
 few simple examples. The basic question is, how fast can we carry out 
 certain functions. 
 
 First, consider the problem of performing logical operations on 
 two n-bit computer words a = (a-| ... a ) and b = (b-, ... b ). If we have 
 
 a = (101101) 
 and b = (011001) 
 then the result of a logical OR defined as (a. + b.) is 
 
 c = (111101) 
 and the result of a logical AND defined as (a. • b.) is 
 
 d = (001001) . 
 Note that either the AND or the OR function is performed on pairs of bits of 
 a and b, independently of all other bits in the words. Hence it is obvious 
 that either of these functions can be computed in a time (say, one gate delay) 
 which is independent of the number of bits involved. This assumes that we 
 can use as many gates as we need; in this case the number of AND and OR gates 
 will be proportional to n. 
 
 Now let us turn our attention to arithmetic operations rather 
 than logical operations. Again consider a = (a-, , . . . , a ) and 
 
 b = (b,, ..., b ), but now let the a. and b. be full computer words each 
 
 III I 
 
 representing a number. If we have 
 
 a = (3,5,2,1,0,7) 
 and b = (1,2,3,4,5,6) , 
 then the result of a vector add is 
 
 c = (4,7,5,5,5,13) 
 
and the result of a vector multiply is 
 
 d = (3,10,6,4,0,42) . 
 Just as in the logical case, we can perform all of the above arithmetic 
 operations independently of one another. Thus, regardless of n, the 
 dimension of the vectors, we can form c in one add time or d in one multiply 
 time. This assumes that we have n adders or n multipliers available. 
 
 Next, let us consider some more difficult problems at the bit and 
 arithmetic level . 
 
 Suppose we have one computer word a = (a, ... a ) in which bit a. 
 
 corresponds to the occurrence of some event. In other words, let a. = 1 if 
 
 event e. has occurred and a. = otherwise. Further, let b and c be one- 
 bit indicators defined as follows. If any of events e, ... e have occurred, 
 
 we want b to be 1 , otherwise b = 0. And if all of events e-, ... e have 
 
 occurred, we want c to be 1 , otherwise c = 0. 
 
 What are the fastest possible designs for logical circuits which 
 compute b and c? It is intuitively clear that these problems are more difficult 
 than those we discussed above, namely, the pairwise logical AND and OR 
 problems. Here, all bits in the word a are involved in the computation of 
 b and c. A simple way to solve this, which also turns out to be the fastest, 
 is the following. To form b, we compute a, + a ? , a~ + a., ..., a -, + a 
 
 simultaneously using n/2 OR gates (assuming n = 2 k ). Then we compute 
 ( a -| + a 2 ) + (a 3 + a,) and so on, fanning in the result to a single result 
 bit b, as shown in Figure 1. If we replace the logical OR by a logical AND 
 in the above discussion we form the result c. 
 
 It is not difficult to prove that for such problems, this kind of 
 
10 
 
 fan-in approach yields the best possible result. The technique is useful in 
 many logic design problems. 
 
 Now we consider the arithmetic problems corresponding to the above 
 logic design questions. If 
 
 a = (a 1 , ..., a n ) 
 
 is a vector of n numbers stored in a computer, suppose we want to compute 
 the sum and product 
 
 b = l a 
 
 c = 7T a. 
 
 1-1 ' i=l ' 
 Instead of dealing with gates, we must now consider adders or multipliers 
 as our basic building blocks. Again, the best solutions to these problems 
 are obtained by simply fanning in the arguments. The tree of Figure 1 
 illustrates this. If we now interpret the + as an arithmetic addition, the 
 result is b, and similarly for c. 
 
 We see from the above discussion that for a class of computations 
 which require the interaction of more than two data elements, more time is 
 required than was needed by our first type of computation. In particular, 
 
11 
 
 for these calculations, if n arguments are involved then we need po^nl 
 operation times to compute the result. An operation time may be a logical 
 OR or AND, or it may be an arithmetic add or multiply. 
 
 Finally, let us consider an even messier kind of computation. 
 Suppose we have two n bit words 
 
 a = (a ] ... a p ) 
 
 and b = (b-, ... b n ) 
 
 and we want to compute 
 
 fo if i = 
 
 1 1 a. + b. c. -I if 1 <_ i <_ n 
 
 This may seem to be a strange logic design problem. It is not yery unusual, 
 however, since it forms the heart of the design of a binary adder. In 
 particular, this recurrence relation accounts for carry generation which is 
 the main time delay in an adder circuit. How much time is required to compute 
 the vector of carry bits c? 
 
 The solution of this problem is not as obvious as were the solutions 
 of our earlier problems. Before discussing how to solve it, let us consider 
 an analogous arithmetic problem. It frequently occurs in numerical computation 
 that we wish to evaluate a polynomial. 
 
 P(x) = a n + a n _ 1 x +a n _ 2 x 2 + ... + a Q x n . 
 
 Traditionally, we are told to do this using Horner's rule 
 
 h(x) = a n + x(a n _ 1 + x(a n _ 2 + ... + x(a ] + xa Q ) ...)) (l ) 
 
 since it requires only 0(n) operations. This can be restated as a program of 
 the form 
 
12 
 
 P = A(0) 
 
 FOR I = 1 TO N 
 
 P = A(I) + X * P 
 which computes the expression h(x) from inside out in N iterations. 
 
 It is clear that both problems share an important property. 
 Both are recurrences in which step i depends on the result which was computed 
 during step 1-1. This may initially give us the sinking feeling that no 
 speedup is possible here. To show that 0(n) steps are not required to 
 compute such linear recurrence functions is in general a nontrivial problem 
 which has been studied in many forms. We will give more attention to this 
 at the program level later. At the logic design level, it is discussed in 
 Chen and Kuck [21], where algorithms are given for transforming any linear 
 sequential circuit specification into a fast combinational circuit. Time 
 and component bounds are given for such circuits as adders, multipliers 
 and one's position counters, which compare favorably with those derived by 
 traditional methods. 
 
13 
 
 We can derive two important theoretical problems from the above. 
 One is tree-height reduction for arithmetic expression parse trees or for 
 combinational logic expressions. The other is the fast solution of linear 
 recurrences derived either from programs or from logic design problems. If 
 we have fast, efficient procedures for solving both types of problems in a 
 parallel way, we will have a good understanding of important theoretical 
 aspects of compiler writing and logic design automation, respectively. 
 
 In section 2, we will discuss some details of these problems. It 
 can, in fact, be shown that any arithmetic expression or simple linear recurrence 
 containing 0(n) arguments can be solved in 0(log n) time steps. The width 
 of the tree (number of processors) needed is just 0(n) in either case. Thus 
 for any of these calculations, which require 0(n) time steps if performed 
 serially, we can speed them up by a factor of 0(n/log n), using just 0(n) 
 processors. As we saw above, some computations can be speeded up even more 
 (e.g., by 0(n)). And as we shall see in section 2, the best known speedups 
 for some computations are much less than this. 
 
 Using this theoretical background, we will turn our attention to 
 the analysis of whole programs in section 3. We will consider transformations 
 of blocks of assignment statements, loops, conditional statements and program 
 graphs. Algorithms for such transformations, as well as resulting time and 
 processor bounds, will be discussed. Such algorithms can serve as the basis 
 for program measurement to aid in the design of effective machines. They can 
 also be used as a model of a compiler for parallel or pipeline computers. 
 Above the bit-level of logic design, these are our primary motivations, but 
 we can also interpret our work in several other ways. Since we are really 
 engaged in a study of the structure of programs, our results seem useful in 
 
14 
 
 the contexts of structured programming (e.g., since we remove GOTOs) and 
 also memory hierarchy management (e.g., since we can reduce program page 
 space-time products). 
 
 In section 4, we discuss some aspects of real computers, including 
 the accessing, aligning, and processing of data. We also sketch some 
 results from our analysis of a number of real Fortran programs. To 
 relate parallel machine organizations to algorithm organizations, we give 
 a cost/effectiveness measure and a number of examples of its use. 
 
 
15 
 
 2. Theoretical Fundamentals 
 
 The two basic building blocks of any numerical programs are 
 arithmetic expressions and linear recurrences. In this section we will 
 give upper bounds on the time and number of processors needed for the fast 
 parallel evaluation of both of these. In order to keep our discussion 
 simple, we will (in this section) ignore memory and data alignment times 
 as well as control unit activity. Although we will bound the number, we 
 assume as many processors as needed are available. 
 
 We will assume that each arithmetic operation takes one unit of 
 time. Our recurrence methods allow all processors to perform the same 
 operation at the same time. This SIMD (single instruction, multiple data 
 [31]) operation is the simplest for a parallel or pipeline machine organiza- 
 tion to perform. Our arithmetic expression bounds assume that different 
 processors can perform different operations at the same time. This MIMD 
 (multiple instruction, multiple data [31]) behavior assumes a more complex 
 control unit. However, it is obvious that the bounds need be adjusted by 
 only a small constant to allow them to be used for SIMD machines. In the 
 worst case we can assume a machine which simply cycles through each of the 
 four arithmetic operations on each "macro-step", although more delicate 
 schemes are easy to devise. In any case, most of our speedup in most 
 programs comes from the speedup of linear recurrences. Subscripted 
 arithmetic expressions (which are not recurrences) inside loops can simply 
 be handled as trees of arrays, so SIMD operation holds. 
 
 We will give a number of results about tree-height reduction first. 
 This theory has been well developed and we will give more results than are 
 justified for practical compilation. But we will indulge ourselves a bit, 
 since the material is interesting in an abstract sense, at least. 
 
16 
 
 If T is the number of unit time steps required to perform some 
 calculation using p >_ 1 processors, we define the speedup of the p processor 
 calculation over a uniprocessor as 
 
 T l S d 
 
 S = =— > 1 and we define the efficiency of the calculation as E = -^ < 1 
 P T p - J p p - 
 
 which may be regarded as actual speedup divided by the maximum possible 
 speedup using p processors. For various computations we will discuss the 
 maximum possible speedup known according to some algorithm and in such cases 
 we use P to denote the minimum number of processors known to achieve this 
 maximum speedup. In such cases we will use the notation T p , S p and E p to 
 denote the corresponding time, speedup and efficiency, respectively. 
 
 Time and processor bounds for some computation A will be expressed 
 as T p [A] and P[A] in the minimum time cases and T [A] in the restricted 
 processor (p < P) case. When no ambiguity can result, we will write T[A] 
 or just T in place of T p [A] and P in place of P[A], for simplicity. We 
 write log x to denote log- x and Txl for the ceiling of x. 
 
 Arithmetic Expression Tree-Height Reduction 
 
 Now we consider time and processor bounds for arithmetic expression 
 evaluation. We restrict our attention to transforming expressions using 
 associativity, commutativity and distributivity which leads us to speedups 
 
 of 0( , ) at efficiencies of 0(1 /log n). Since this is asymptotic to the 
 
 best possible speedup, more complex transformations (e.g., factoring, partial 
 fraction expansion) seem unnecessary. 
 
 Definition 1 
 
 An arithmetic expression is any well -formed string composed of the 
 
17 
 
 four arithmetic operations (+,-,*, /), left and right parentheses, and atoms 
 which are constants or variables. We denote an arithmetic expression E of n 
 distinct atoms by E<n>. 
 
 If we use one processor, then the evaluation of an expression 
 containing n operands requires n - 1 units of time. But suppose we may use 
 as many processors as we wish. Then it is obvious that some expressions 
 E<n> may be evaluated in log^n units of time as illustrated in Fig. 1. In 
 fact, we can establish, by a simple fan-in argument, the following lower bound: 
 
 Lemma 1 Given any arithmetic expression E<n> 
 
 T[E<n>] > [log n] . 
 
 On the other hand, it is easy to construct expressions E<n> whose 
 evaluation appears to require 0(n) time units regardless of the number of 
 processors available. Consider the evaluation of a polynomial by Horner's 
 rule as in section 1. A strict sequential order is imposed by the 
 parentheses in Eq. 1 and more processors than one are of no use in speeding 
 up this expression's evaluation. 
 
 However, we are not restricted to dealing with arithmetic expressions 
 as they are presented to us. For example, the associative, commutative, and 
 distributive laws of arithmetic operations may be used to transform a given 
 expression into a form which is numerically equivalent to the original but 
 which may be evaluated more quickly. We now consider examples of each of these, 
 
 Fig. 2a shows the only parse tree possible (except for isomorphic 
 images) for the expression (((a + b) + c) + d) . This tree requires three 
 steps for its evaluation and we refer to this as a tree height of three. 
 However, by using the associative law for addition we may rearrange the 
 parentheses and transform this to the expression (a + b) + (c + d) which may 
 

 18 
 
 be evaluated as shown in Fig. 2b with a tree height of two. It should be 
 noted that in both cases, three addition operations are performed. 
 
 Fig. 3a shows a parse tree for the expression a + be + d; again 
 we have a tree of height three. In this case the tree is not unique, but 
 it is obvious that no lower height tree can be found for the expression 
 by use of associativity. But by use of the commutative law for addition we 
 obtain the expression a + d + be and the tree of Fig. 3b, whose height is 
 just two. Again we remark that both trees contain three operations. 
 
 Now consider the expression a(bcd + e) and the tree for it given in 
 Fig. 4a. This tree has height four and contains four operations. By use of 
 associativity and commutativity, no lower height tree can be found. But, 
 using the arithmetic law for the distribution of multiplication over addition 
 we obtain the expression abed + ae, which has a tree of minimum height three, 
 as shown in Fig. 4b. However, unlike the two previous transformations, 
 distribution has introduced an extra operation; the tree of Fig. 4b has five 
 operations compared to the four operations of the undistributed form. 
 
 Having seen a few examples of arithmetic expression tree-height 
 reduction, we are naturally led to ask a number of questions. For any 
 arithmetic expression, how much tree-height reduction can be achieved? Can 
 general bounds and algorithms for tree-height reduction be given? How many 
 processors are needed? 
 
 To answer these questions, we present a brief survey of results 
 concerning the evaluation of arithmetic expressions. Details and further 
 references may be found in the papers cited. Assuming that only associativity 
 and commutativity are used to transform expressions, Baer and Bovet [2] 
 gave a comprehensive tree-height reduction algorithm based on a number of 
 
 
19 
 
 earlier papers. Beatty [7 ] showed the optimal ity of this method. An 
 upper bound on the reduced tree height assuming only associativity and 
 commutativity are used, given by Kuck and Muraoka [43]> is the following. 
 
 Theorem 1 Let E<n|d> be any arithmetic expression with depth d of 
 parenthesis nesting. By the use of associativity and commutativity only, 
 E<n|d> can be transformed such that 
 
 T p [E<n|d>] < Tlog nl + 2d + 1 
 with r n 
 
 Note that if the depth of parenthesis nesting d, is small, then 
 this bound is quite close to the lower bound of Tlog nl. The complexity of 
 this algorithm has been studied in [13], where it is shown that in addition 
 to the standard parsing time, tree-height reduction can be performed in 0(n) 
 steps. Unfortunately, there are classes of expressions, e.g., Horner's rule 
 polynomials or continued fractions for which no speed increase can be achieved 
 by using only associativity and commutativity. 
 
 Muraoka [54] studied the use of distributivity as well as 
 associativity and commutativity for tree-height reduction and developed 
 comprehensive tree-height reduction algorithms using all three transformations 
 An algorithm which considers operations which take different amounts of time 
 is presented by Kraska [37]. 
 
 Bounds using associativity, commutativity and distributivity have 
 been given by a number of people [12,42,53]. In [12] the following theorem 
 is proved. 
 
 Theorem 2 Given any expression E<n>, by the use of associativity, 
 
20 
 
 commutativity and distributivity, E<n> can be transformed such that 
 
 T p [E<n>] < T41og nl 
 with 
 
 P < 3n. 
 
 The complexity of the algorithm of [12] has been studied in [13], 
 where it is shown that tree-height reduction can be done using 0(n log n) 
 steps in addition to normal parsing. Also if the number of processors is 
 allowed to grow beyond 0(n), the time coefficient of Theorem 2 has been 
 reduced to 2.88 by Muller and Preparata [53]. 
 
 A number of other results are available for arithmetic expressions 
 of special forms or for general expressions if more information is known 
 about them. In [42], expressions without division, continued fractions, 
 general expressions with a known number of parenthesis pairs or division 
 operations, and other such cases are considered. Polynomials are discussed 
 in this paper and earlier by Maruyama in [51]. 
 
 One other case should be mentioned here. For programming languages 
 with array operators, other compilation techniques may be of interest. For 
 example, [55] solves the problem of minimizing the time to evaluate the 
 product of a sequence of conformable arrays on a parallel machine. In [42] 
 it is shown that any matrix expression including addition, subtraction, 
 multiplication and matrix inversion can be handled as follows. If any of 
 these four operations take one matrix operation time step, then any matrix 
 expression of n arrays can be evaluated in 61og n matrix operation steps. 
 The coefficient is the sum of three addition times, two multiplication times 
 and one inversion time. Matrix addition and multiplication are straightforward, 
 but the time required to invert a matrix measured in standard arithmetic 
 
21 
 
 operations varies, depending on the method used (c.f., [62]). 
 
 Most arithmetic expressions appearing in real programs have a 
 rather small number of atoms. If the atoms are subscripted, then the 
 arrays may be quite large, but it is usually advisable to evaluate these 
 as a tree of arrays, one array operation at a time. If tree-height 
 reduction techniques are used on such expressions, there are two possibly 
 bad consequences. One is the passing from SIMD to MIMD operation as dis- 
 cussed earlier. The other is that redundant operations are generally 
 introduced, making the overall computation less efficient. However, for 
 expressions outside loops, for unsubscripted expressions inside loops or 
 for expressions of small arrays inside loops, tree-height reduction can be 
 of value. 
 
 The number of processors required to evaluate such expressions is 
 usually less than are required for recurrence solving, as we shall see 
 later. However, there may be cases where tree-height reduction is desirable, 
 but the number of available processors is wery small. The following results 
 cover this case and are theoretically interesting. 
 
 Corollary 1 Given any expression E<n> and p processors for its 
 evaluation, by the use of associativity, commutativity and distributivity, 
 E<n> can be transformed such that 
 
 T p [E<n>] £41og n + 10(n-l)/p . 
 
 This is a corollary of Theorem 2 and was proved by Brent [12]. 
 
 This result has been improved by Winograd [69], who shows that if p processors 
 
 5n 2 
 are available, we can evaluate any E<n> in y- + 0(log n) steps. For small p, 
 
 this result is an improvement on Corollary 1. 
 
22 
 
 We have given a number of different upper bounds on time and 
 processors for arithmetic expression evaluation. While the only lower bound 
 (Lemma 1) is naive, the upper bounds are close enough to it for practical 
 purposes. It is clear that the theory is quite well developed and improve- 
 ments on these results will be quite difficult to obtain. We conclude 
 that any arithmetic expression E<n> can be evaluated in 0(log n) steps at an 
 efficiency of 0(1 /log n). 
 
 
23 
 
 Recurrence Relations 
 
 Linear recurrences share with arithmetic expressions a role of 
 central importance in computer design and use. But they are somewhat more 
 difficult to deal with. While an expression specifies a static computational 
 scheme for a scalar result, a recurrence specifies a dynamic procedure for 
 computing a scalar or an array of results. Linear recurrences are found 
 in computer design, numerical analysis and program analysis, so it is 
 important to find fast, efficient ways to solve them. 
 
 Recurrences arise in any logic design problem which is expressed 
 as a sequential machine. Also, almost every practical program which has 
 an iterative loop contains a recurrence. While not all recurrences are 
 linear, the vast majority found in practice are, and we shall concentrate 
 first on linear recurrences. 
 
 We shall begin with several examples. First, consider the problem 
 of computing an inner product of vectors a = (a,,..., a ) and b = (b,,...,b ). 
 
 This can be written as a linear recurrence of the form 
 
 x = x + a^b., 1 < i <_ n (2) 
 
 where x is initially set to zero and finally set to the value of the inner 
 product of a and b. 
 
 As another example of a linear recurrence which produces a scalar 
 result, the evaluation of a degree n polynomial p (x) in Horner's rule form 
 
 can be expressed as 
 
 p = a. +■ xp, 2 <_ i £ n (3) 
 
 where p is initially set to a, and finally set to the value of P n ( x )- 
 
24 
 
 Techniques to handle both of these recurrences should be 
 familiar from our discussion of expression evaluation. Note that Eq. 2 
 can be expanded by substituting the right-hand side into itself (statement 
 substitution) as follows: 
 
 x = a,b, 
 
 x = a-,b, + a 2 b 2 
 
 x = a,b, + a^bp + a 3 b~ 
 
 After n iterations we have an expression which can be mapped onto a tree 
 similar to that of Figure 1 . 
 
 Earlier, we have also discussed polynomial evaluation. Thus 
 by carrying out a procedure similar to the above, we could obtain an expression 
 which could be handled by tree-height reduction. Thus, we would expect 
 that these and similar recurrences could be solved in T p = 0(log n) time 
 
 steps using P = 0(n) processors. 
 
 But there are other, more difficult looking linear recurrences. 
 For example, a Fibonacci sequence can be generated by 
 
 f i = f i_! + f i_ 2 3 < i < n ,(4) 
 
 where f , = f 2 = 1 . 
 
 As another example, consider the addition of two n-bit binary numbers 
 
 a = a ... a, and b = b ... b, . The propagation of the carry across the 
 
 sum can be described by 
 
 c i = y i + x i' c i-l 1 < i < n ( 5 ^ 
 
 where N c Q = 0, x, = a. + b^ and y. = a.-b. . 
 
25 
 
 Here we use + to denote logical or and • to denote logical and . This is 
 an example of a bit level linear recurrence, in contrast to our previous 
 examples whose arguments were assumed to be real numbers. 
 
 In both Eq. 4 and Eq. 5 we are required to generate a vector 
 result because of the subscripted left-hand side. This is in contrast to 
 the scalar results of Eqs. 2 and 3. Because of this, we can expect a good 
 deal more difficulty in trying to obtain a fast efficient solution to these 
 recurrences. With the above as an introduction, we now turn to a formalization 
 of the general problem. We will then give bounds for the solution of the 
 general problem and several important special cases. 
 
 Definition 2 
 
 An m-th order linear recurrence system of n equations, R<n,m> 
 is defined for m <_ n by 
 
 x^ e for i £ 
 
 i-1 
 
 and x. - c. + I a. . x. for 1 < i <_ n . 
 
 j-i-m J J 
 
 If m B n we call the system a general linear recurrence system and denote it 
 
 by R<n> . 
 
 Note that we can express any linear recurrence system in matrix 
 terms as 
 
 x - c + Ax 
 where c = (c 1 ,...,c n ) , x = (x 1 ,...,x n ) t 
 
 and A is a strictly lower triangular (banded if rn < n) matrix with 
 
 a. . s for i < j or i - j > m. We refer to A as the coefficient matrix, 
 « j 
 
 c as the constant vector and x as the solution vector. 
 
26 
 
 It should be observed that the constant vector and coefficient 
 matrix generally contain values which can be computed before the recurrence 
 evaluation begins. Thus, the x. and y. values of Eq. 5 would be precomputed 
 from the a. and b.. We will assume that the elements of c and A are pre- 
 computed (if necessary) in all cases so that our bounds on recurrence 
 evaluation can be simply stated, and that m and n are powers of 2. 
 
 How can we solve an R<n> system in a fast, efficient way using many 
 simultaneous operations? The following is a straightforward way which uses 
 0(n) processors to solve the system in 0(n) steps. 
 
27 
 
 Column Sweep Algorithm 
 
 * 
 
 Given any R<n> system, we initially know the value of x-. . On 
 
 step 1 we broadcast this value, c-. , to all other equations, multiply by a., 
 
 and add the result to c Since we now know the value of x 2 > this leads to 
 
 an R<n-1> system which can be treated in exactly the same way. Thus after 
 n - 1 steps, each of which consists of a broadcast, a multiply and an 
 add, and each of which generates another x., we have the solution vector x. 
 
 The method requires n - 1 processors on step 1, and fewer thereafter, so 
 T p = 2(n - 1) with P = n - 1. 
 
 What speedup and efficiency have we achieved by this method? The 
 time required to solve this system using a single processor which might 
 sweep the array by rows or columns would be 
 
 1 } = 2[1 + 2 + ... + (n - 1)] = 2[^2^- 1) ] = n(n - 1) . 
 
 Hence the above method achieves a speedup of 
 
 c _ n(n - 1) _ /9 
 S P 2(n - 1) " n/2 
 
 with an efficiency of 
 
 E - ^ - , " r > ^ 
 L P P 2(n - 1) 2 * 
 
 Thus we can conclude that the Column Sweep Algorithm is a reasonable method 
 
 of solving an R<n> system. But how does it perform in the R<n,m> case for 
 
 m«n. 
 
 It can be seen that the Column Sweep Algorithm will achieve S p = 0(m) 
 
 for an R<n,m> system. So if m is very small, the method performs poorly, 
 particularly if we have a large number of processors available. It should 
 
28 
 
 be noted that the m«n case occurs very often in practice. Note that all of 
 our examples (Eqs. 2-5) had m <_ 2. 
 
 What are. our prospects for finding a faster algorithm. First, we 
 observe that the total number of initial data values in an R<n,m> system is 
 0(mn). This is the total of the constant vector c and the coefficient matrix 
 A. Assuming that these numbers all interact in obtaining a solution, a 
 fan-in argument [c.f. Lemma 1] indicates that we need at least 0(log mn) 
 steps to solve an R<n,m> system, since m <_ n, 0(log mn) = 0(log n). The 
 Column Sweep Algorithm required 0(n) steps, so we still have a big gap in ti 
 
 me 
 
 Product Form Recurrence Method 
 
 The next theorem is based on an algorithm for the fastest known 
 method of evaluating an R<n,m> system. For large m, the number of processors 
 required is rather large, but for small m, the number of processors is quite 
 reasonable. We also give bounds for the case of a small number of processors, 
 Corollary 4 is particularly important in the case of m < p < P. This theorem's 
 proof can easily be stated in terms of the product form of the inverse of the 
 coefficient matrix A [25] » [60 ]• It is also proved in [19] and [22]. 
 
29 
 
 Theorem 3 Any R<n,m> can be computed in 
 
 Tp < (2 + log m) log n - -^log m + log m) 
 with 
 
 P * m 2 n/2 + 0(mn) ! for m«n , 
 
 P < n 3 /68 + 0(n 2 ) for m <_ n . 
 The details of transforming a system to meet this bound are fairly 
 straightforward [19]. We will give a simple example here as a basis for 
 some intuition about how the technique of Theorem 3 works. Consider an 
 R<4,2> system. This method would generate the following expressions for the 
 evaluation of the x.: 
 
 x l " c l * 
 
 x 2 = (Cg+a^Cj) 
 
 X 3 = (c 3 +a 31 C l } + a 32 (c 2 +a 21 c l ) 
 
 * 4 = c 4 + (a 42 +a 43 a 32 )(c 2 +a 21 c l ) + a 43 (c 3 +a 31 c l ) ' 
 
 Note that all of the parenthesized expressions can be computed simultaneously 
 in two steps (there are just three distinct ones). Then x,, the largest 
 
 calculation, can be completed in three more steps for T p = (2+log 2) (log 4) 
 
 1 2 
 - -pOog 2+log 2) = 5. This time bound may be achieved using just three 
 
 processors in this case. But as n grows larger, the number of processors 
 
 required becomes very large as shown in the tables of [22]. 
 
 In practice we may have a machine with a limited number of processors 
 
 p < P so Theorem 3 cannot be used directly. Several schemes are 
 
 available for mapping a computation onto a smaller set of processors and 
 
 generally increasing the efficiency of the computation as well. While 
 
30 
 
 the techniques described below may be applied to arithmetic expressions as 
 derived from Theorems 1 or 2, the expressions found in typical programs 
 usually do not require enough processors to warrant such reductions [36]. 
 First, we describe a folding scheme which reduces the number of 
 processors at a much faster rate than the computation time increases. The 
 
31 
 
 P processor computation for R<n,rn> resulting from Theorem 3 contains log n 
 stages, each stage consisting of many independent tree computations of 
 height (log m+1) resulting from inner products of two m-vectors. Such a 
 tree of height t will contain 2 -1 operation nodes and its evaluation 
 requires 2 " processors. P is the maximum of the total number of proces- 
 sors used at each stage. It is easy to show that given such a tree its 
 height increases only one step by halving the number of processors (called 
 one fold ), and after f folds (f <^ t - 2) are performed the tree height is 
 (t+2 -f-2) while the number of processors is reduced to 2 /2 . If all 
 trees at the same stage are folded uniformly, then this folding scheme can 
 
 provide us T as stated below. 
 P 
 
 Corollary g Let R<n,m> and P be as in Theorem 3. Then if 
 f < log m - 1 and p = TP/2 f l, we have T < T p + (2 f+1 -f-2) log n. 
 
 Another technique which is useful in mapping any computation onto 
 a limited number of processors p < P is the sweeping scheme [44]. If the 
 i-tri'Step of any parallel computation requires 0. operations using P proces- 
 
 sors, it can be executed on p processors in 
 leads to the following: 
 
 0. 
 i 
 
 steps. This observation 
 
 Lemma 2 [12] If a computation C can be completed in T p with p operations 
 on P processors, then C can be confuted in T < T D .+ (0 D -T D )/p for p < P. 
 
 P — r r r 
 
 To apply this technique directly on the algorithm of Theorem 3 , the 
 C value can be obtained by the summation of 2»p(k) for k = 2, 4, 8, ...,n, 
 where p(k) is the number of processors required at each stage [19]. The 
 result of this technique can be found in [22]. 
 
32 
 Our third scheme for reducing the number of processors required 
 
 for an R<n,m> system is called the cutting scheme . The idea is to cut 
 
 the original system into a number of smaller systems and evaluate these 
 
 in sequence, using the algorithm underlying Theorem 3 on each such system. 
 
 We have used this scheme in [22], [61], and a detailed proof is given in 
 
 [25]. 
 
 Corollary 3 ' Let R<n,m> and P be as in Theorem 3. Then any R<n,m> 
 can be computed with 1 < p < P processors in 
 
 T D i. 2fm/pl (n-1 ) for 1 < p <_ m , 
 
 < 1^ np 3 (log 2 p+271og p+144) + ^y- for m < p < m 2 , 
 
 -1 
 
 - 72 np 3 0°9 2 P + 271og P+144) for m <_ p <_ m , 
 
 cL C O "7 ^ 
 
 <_ B^-^(log m log p+21og 2 p-p- log m-j log m+1) for m < p < P, 
 
 where $(m,n,p) is a small constant. 
 
 For most practical R<n,m> systems in which m is wery small 
 compared to n, if the number of processors is also very limited then a new 
 computational algorithm developed in [19] can be used more efficiently. 
 This method gives the following time bounds. 
 
 Corollary 4 |_et R<n,m> and P be as in Theorem 3 . If m < p < P> 
 
 then any R<n,m> can be computed in 
 
 T < (?m 2 +3m)j}+ 0(m 2 log (p/m)) . 
 
33 
 
 In summary, for 1 < p < P the time bound for evaluating a 
 ■given R<n,m> system can be determined by choosing the minimum value 
 obtained from Corollaries 2, 3 and 4. 
 
34 
 
 Constant Coefficient Recurrences 
 
 In numerical computation, we are frequently faced with linear 
 recurrences having constant coefficients, i.e., Toeplitz form matrices. 
 For example, Eqs. 2, 3 and 4 are such recurrences. Thus, Eq. 2 could be 
 rewritten as x. = lx. -. + a. b, , and similarly for Eq . 3 . Intuitively, we 
 might expect to be able to compute such systems more efficiently than the 
 more general recurrences we have been considering. 
 
 Indeed, this is the case ,as we shall see below. We formalize 
 the problem with the following definition which should be contrasted with 
 Definition 2. 
 
 Definition 3 
 
 An m-th order linear recurrence system with constant coefficients 
 of n equations, R<n,m> is defined for m <_ n by 
 
 for i < 
 
 *,- ■ ° 
 
 and 
 
 m 
 
 x. = c. + I a- x. . for 1 < 1 <_ n. 
 
 If n = m we call the system a general linear recurrence system with constant 
 coefficients and denote it by R<n>. 
 
 The fastest known method for solving an R<n,m> system can be summarized 
 by the following theorem of Chen [19]. The proof follows the lines of 
 Theorem 3, but avoids computations which are unnecessary due to the constant 
 coefficients. 
 
 Theorem 4 
 
 with 
 
 and 
 
 Any R<n,m> can be computed in 
 T p <_ (3 + log m)log n - (log m + log m + 1) 
 
 P _< mn for m«n 
 
 2 
 P < n /4 for m < n . 
 
35 
 
 By exercising special care in avoiding redundant computations, 
 the proof of Theorem 4 can be modified [19] to give us the following. 
 
 Corollary 5 Any R<n,l> can be computed in 
 T p < 21og n 
 
 with P <_ n . 
 
 By comparing Theorems 3 and 4, we see that while the time bounds 
 are about the same, substantial processor savings can be made in the constant 
 coefficient case. In the case of small m, we have saved 0(m) processors, 
 while in the general case we have saved a factor of 0(n) processors. 
 
 To test the quality of these bounds, we can compare them with some 
 simple calculations. Consider the inner product of Eq. 2. This can obviously 
 be handled using n processors (for the n multiplications) in 1 + log n 
 steps. Since we have been assuming the coefficient matrix and constant 
 vector are set up before the recurrence solution begins, the multiplication 
 is really outside our present scope, so just n/2 processors would be required 
 for the summation. 
 
 The bound of Corollary 5 is thus high by a factor of two in each 
 of processor count and time for this trivial recurrence. However, the 
 recurrence method produces not only the inner product, but also all of the 
 "partial inner products" x-, , x 2> ..., x , , as well as x . Chen [19] has also 
 
 given other variations on the above to handle these special cases of 
 evaluating only the remote terms of recurrences. 
 
 As a final example, note that the entire Fibonacci sequence of 
 Eq. 4 can be evaluated (since m = 2) in 
 
 T p < 41og n - 3 
 with P < 2n 
 
36 
 
 3. Program Analysis 
 
 In this section we discuss techniques for the analysis of whole 
 programs. These techniques can be used to compile programs for parallel or 
 pipeline computers. They can also be used to specify machine organizations 
 for high speed computation. Since we are really just studying the structure 
 of ordinary serial programs, our results have interpretations for ordinary 
 virtual memory machines and structured programming as well. 
 
 Our discussion is centered on methods developed and used by the 
 author and his students. Several significant efforts in this general area 
 were carried out elsewhere earlier. We will briefly sketch two of the most 
 important of these. 
 
 The first major study of whole programs was carried out by Estrin 
 and his students at UCLA in the 1960s. They studied graphs of programs in 
 an attempt to isolate independent tasks for parallel execution. The lowest 
 level object considered as a task was the assignment statement, while other 
 tasks ranged in complexity up to whole subroutines. A number of papers 
 [ 29, 50, 59 ] reported algorithms for the analysis of programs and the 
 results of analyzing a limited number of real programs. This group also 
 worked on various scheduling strategies for executing program graphs. 
 
 Another effort was initiated by Bingham, Fisher and Semon at 
 Burroughs in 1966. This group studied various aspects of the structure of 
 programs and developed algorithms for the automatic detection of parallelism 
 They also investigated certain aspects of machine design related to this— 
 particular attention was given to control unit features. Their work was 
 reported in a series of reports including [10, 11 ]. 
 
 Our own effort has incorporated graphs of programs as well as the 
 two key ideas of the previous section--tree-height reduction and fast 
 
 
37 
 
 recurrence evaluation. In this way we can carry parallel execution down 
 to the level of individual operations in assignment statements. This is 
 of course necessary, if one wants to execute programs in the fastest possible 
 way. 
 
 In order to formalize our discussion of program graphs and their 
 manipulation, we now present a number of definitions. 
 
 Definition 4 An assignment statement is denoted by x = E, where x 
 is a scalar or array variable and E is a well -formed arithmetic expression. 
 A block of assignment statements (BAS) is a sequence of one or more assignment 
 statements with no intervening statements of any other kind. Any BAS can be 
 transformed by a process called statement substitution to obtain a set of 
 expressions which can be evaluated simultaneously. 
 
 For example, the BAS 
 
 X = BCD + E 
 
 Y = AX 
 
 Z = X + FG 
 can be evaluated using one processor in 6 steps, ignoring memory activity. 
 By statement substitution we obtain three statements which can be transformed 
 by tree-height reduction to obtain: 
 
 X = BCD + E; Y = ABCD + AE; Z = BCD + E + FG . 
 Since the resulting expressions can be evaluated simultaneously in three 
 steps, we obtain a speedup of 2. By properly arranging the parse trees it 
 may be seen that just five processors are required. Thus we have efficiency 
 Er = 2/5. In general, the number of processors required to evaluate a set of 
 trees in a fixed number of steps may be minimized using an algorithm of Hu [35] 
 Note that the speedup here results from two effects: the simultaneous 
 
38 
 
 evaluation of independent trees and tree-height reduction by associativity, 
 commutativity and distributivity. 
 
 Definition 5 An IF statement is denoted by (C)(S,, .... S ) where C 
 
 is the conditional expression composed of arithmetic and logical operations 
 and S, , ..., S are n different statements which may be assignment statements, 
 IF statements, or loops such that control will be transferred to one of them 
 depending on the value of C. 
 
 In many programs it is possible to find outside DO loops, rather 
 large sets of statements consisting of many IF and GOTO statements with some 
 interspersed assignment statements. Suppose we have a method of discovering 
 sections of code in which the ratio of control (IF, GOTO) statements to 
 arithmetic operations is greater than some small number. We call such a 
 section of code an IF block . Given an IF block, it is straightforward to 
 put it in a canonical form consisting of: 
 
 Step 1: A set of assignment statements, all of which may be executed 
 simultaneously. 
 
 Step 2: A set of Boolean functions, all of which may be evaluated 
 simultaneously. 
 
 Step 3: A binary decision tree through which one path will be 
 followed for each execution of the program. No Boolean function or arithmetic 
 expression evaluation is included in the tree. 
 
 Step 4: A collection of blocks of assignment statements, each with 
 a single variable or constant on the right-hand side. One such block is 
 associated with each path through the tree. 
 
 The details of an algorithm for the discovery and transformation of 
 an IF block to this canonical form are given by Davis [28]. Note that the IF 
 
39 
 
 block may be a graph with or without cycles. Such graphs are converted to 
 trees called IF trees in the cited references. 
 
 Definition 6 A loop is denoted by 
 
 L • (I v + N ]s I 2 - N 2 , ..., I d - N d )(S r S 2 , .... S s ) 
 
 or = (1^ I 2 , ..., I d )(S r S 2 , ..., S $ ) 
 
 where I. is a loop index , N. is an ordered index set , and S. is a body statement 
 j j j 
 
 which may be an assignment statement, an IF statement or another loop. We use 
 
 0UT(S.) and IN(S.) to denote for S. the LHS (output) variable name and the 
 j j j 
 
 set of RHS (input) variable names, respectively. We will write S.(i\, i 2 , ..., ij) 
 to refer to S. during a particular iteration step, i.e., when the index 
 variables of S, are assigned the specific values I-j = i-j» I 2 = i^' ■••» *d = ^d" 
 
 If S. is executed before S-, we will write S. < S.. We say that the relation 
 i J i o j 
 
 < defines the execution order of the statements. If a loop execution leads to 
 
 the execution of n statements, we sometimes denote their execution order by 
 
 writing Y-:x. = E. , 1 £ i <_ n, implying that Y. < Y. + , , 1 <_ i <_ n - 1 . 
 
 Definition 7 Given a loop L = (I, + N, , . . ., I . <- N,)(S, , . . ., S ), 
 
 all possible data dependencies between statement pairs S and S. are given by 
 
 0UT(S 1 (k 1 ,...,k d ))niN(S j U 1 ,...,2 d )) f <t> for S.(k r ...,k d ) < S j (^ ] ,. . . ,* d ) . 
 
 Whenever this condition is satisfied, we say that S. is data dependent on S.. 
 
 and is denoted by 5-6S.. 6 is a transitive relation. All of the data 
 
 ■ j 
 
 dependencies can be represented by a data dependence g raph G-. of s nodes for 
 
 Sj, 1 < i < s. For each S.6S. there is an arc from S. to S. . Statement S. is 
 i — — i j • j J 
 
40 
 Indirectly data dependent on S^ denoted S.AS., if there exist statements 
 
 S k, * ••*■ S k such that S i 6S k 6 *" S k 5S r Practical details on determining 
 1 m 1 m J 
 
 1f S.6S. can be found in [67]. 
 ' J 
 
 Our definition of data dependence is much more delicate than the usua 
 
 definitions [ 9, 30 ]. These definitions include the condition 
 
 0UT(S i )|lIN(Sj) f <J>, i.e., they ignore subscripts and only check variable names 
 
 Thus statements like S^ A(I) = A(I+i") + B are said to be data dependent (S.6S 
 
 However, by Definition 7 we would not say S i 6S i because the values of A(I+1) 
 
 are not those from A(I). 
 
 In terms of Definitions 6 and 7, we can further classify loops as 
 follows. 
 
 Definition 8 We use D for data dependence relation, to denote the set 
 of loops with at least one S.5S.., 1 < i,j < s . In other words, there is at j 
 least one E k , 1 < k < n, which is a function of x, , for m, > 0. If [e D 
 and none of its S i is a nonlinear function of x., 1 < j < s, we call it a 
 
 linear dependence and write ULD (LDCD). The complement of D is denoted by D, 
 for non-dependence relation. 
 
 
41 
 
 Definition 7 can be applied to any (d-u+1), 1 <_ u < d, innermost 
 nest of L as it is also a loop. This is described below. 
 
 Definition 9 
 
 Let L u be the (d-u+1) innermost nest of L, 1 <_ u <_ d, 
 
 I.e.) L = \ l-i , 1«» . . . » 1j J ^j-i » bp > . • . > j ; 
 
 = -(i 1 ,i 2 ,...,i u _ 1 )(i u ,i u+ -|,...,i c j)(s. 1 ,s 2 ,...,s s ) 
 
 = (I 1 .I 2 .,...I U . 1 )(L U ) . 
 Then for fixed values of I-,, I 2 , ...» I -. » we can obtain all pairs of data 
 dependence for L u according to Definition 8 (note that now k, = £,, ..., k , 
 s *u~l^' which defines graph G u - 
 
 Example 1 
 
 Given a loop 
 
 L: DO S, 
 
 DO S. 
 
 DO S. 
 
 Sp* B(I-i ,!«» 
 
 = 1, 10 
 
 2 = 1, 10 
 
 : 3 -i. io 
 
 : 3 ) = B(i r i,i 2 ,i 3 )*c(i 1 ,i 2 ) + D*E 
 : 3 ) = a(i 1 ,i 2 -i,i 3 )*f(i 2S i 3 ) > 
 
 The corresponding data dependence graphs G, , G 2 , and G~ are 
 
 ft,: 
 
 G 2 : 
 
 G 3 : 
 
 1 
 
42 
 For the set of data dependent loops, we can easily distinguish 
 two cases: acyclic and cyclic graphs. Formally, we define these as follows: 
 
 Definition 10 An acyclic dependence graph is a dependence graph 
 
 of s nodes, S. for 1 <_ i <_ s, with no pair (S.,S.) such that S-AS. and S-AS.. 
 
 A dependence graph which is not acyclic will be called cyclic . 
 
 Given a data dependence graph, we wish to partition it into blocks 
 that contain only one statement or a cyclic dependence graph. Formally, we 
 define these as follows: 
 
 Definition 11 On each dependence graph, G , 1 <_ u < d, for a given 
 
 loop L, we define a node parti tion t of {S, ,Sp, . . . ,S } in such a way that 
 
 S. and S, are in the same subset if and only if S.AS. and S.AS. . On the 
 
 partition tt = {tt-, , tt ~> •••) for 1 £u <_d, define a partial ordering 
 
 relatio n a in such a way that tt . a tt . (reflexive), and for i t j, tt . a tt . 
 
 iff there is an arc in G from some element of tt . to some element of tt . . 
 
 The a relation is also anti -symmetric and transitive. The tt . are called 
 Tr-blocks. 
 
43 
 Wave-Front Method 
 
 If there are cyclic dependencies in a DO loop, we may turn to our 
 next method, the wave-front method. This is a well-known method which 
 effectively extracts array operations from the loop and we can then apply 
 the above bounds to these. If the maximum speedup given by the wave-front 
 method is insufficient, i.e., if the available processors are not all being 
 used, we may turn to the recurrence method which gives the fastest known 
 speedup for such problems. 
 
 Example 2 
 
 L2: DO 10 I = 1, N 
 DO 10 J = 1, N 
 10 W(I,J) = A(I-1,J) * W(I-1,J) + B(I,J-1) * W(I,J-1) 
 
 For one or more assignment statements containing cyclic dependencies, 
 the wave-front method yields moderate speedups with high efficiency. The 
 idea of this method can be illustrated by the loop L2 of Example 2 in which 
 statement 10 has a cyclic dependence in that the LHS depends on RHS values 
 computed earlier in the loop. Note that generally, one or more statements 
 may form a cyclic dependence. This method proceeds as follows: if W(l,l) 
 is computed from boundary values, then we can compute W(2,l) and W(l,2) in terms 
 of W(l,l) and boundary values. Next we can compute W(3,l), W(2,2) and W(l,3) 
 
 and so on, as a wave-front passes through the W array at a 45° angle. Thus 
 
 o 
 
 we can compute this loop in 0(N) steps instead of the 0(N ) serial steps 
 
 required. The wave-front method was first described in detail by Muraoka 
 in [54] and was later used in [44] and was also implemented in [46]. The 
 formalization below removes some of the restrictions included in the original 
 formulation. 
 
 In [24] a revised wave-front algorithm is presented. This includes 
 
44 
 
 a method of determining the angle a at which the wave-front passes through 
 the array. It also includes a method for computing the speedup as a function 
 of a . Note that these ideas can be extended to arrays of higher dimension, 
 as well. However, the wave-front method is of no value in one-dimensional 
 arrays, since it degenerates to a serial computation in this case. A 
 similar thing happens if a is slightly greater than 0° or slightly less 
 than 90°. In such cases we may treat the cyclic dependence as a linear 
 recurrence (assuming it is linear). 
 
 Loop Speedup Hierarchy 
 
 With the above fundamentals, it is possible to give some easy 
 bounds on overall loop speedup in terms of the uniprocessor time T-, . 
 We will present a simple hierarchy here based on the maximum known speedups 
 for various classes of programs. Sharper bounds will be presented later 
 in the paper, based on more detailed loop parameters. The hierarchy of 
 this section will provide good intuition for the following sections. 
 
 The simplest loop is LeD which by Definition 8 has no dependence 
 relation between any pair of statements. Thus, following the notation of 
 Definition 6, all x. = E., 1 < i < n, can be computed in parallel. The 
 
 following loop, which performs matrix addition and scalar product has this 
 property. 
 
 DO S 2 ^ = 1, 10, 2 
 DO S 2 I 2 = 1, 10, 1 
 .... S,: G(I r I 2 ) = A(I r ,I 2 ) + B(I r I 2 ) 
 S 2 : Z(I r I 2 ) = C(I r I 2 ) * D(I r I 2 ) 
 
 The total time required by any LeD is, by Theorems 1 and 2, T p <_ 0(log e) 
 where e is 
 
45 
 
 the maximum number of atoms in E., 1 <_ i <_ n. Hence, we have for LeD 
 
 s piorrchr 0( V • 
 
 Now, let us study a slightly more complicated loop LeLD such 
 as one that performs vector inner-product 
 
 DO 5 I = 1, 10 
 5 T = T + A(I)*B(I) . 
 For any LeLD, if we pre-compute simultaneously all subexpressions in E. , 
 
 1 £ i <_ n, which do not depend on any computed value in the loop, i.e., any 
 x. for 1 £ i <_ n, then the resultant statements x. = El, 1 < i < n, can be 
 
 treated as an R<n,m> system where m<n is the maximum of m. (see Definition 8 ) 
 
 for all i. The total computation time of any LeLD with m«n, or m independent 
 
 of n, is therefore, any preprocessing time needed to obtain the coefficients 
 
 in R<n,m>, which is 0(log e) time steps by Theorems 1 and 2, plus the time 
 
 to solve an R<n,m> system which is stated in Theorerr 3. Since n <_ T, <_ ne, 
 
 we have a speedup for this subset of LD, 
 
 y T 1 
 
 S P - 0(log m log n) + 0(log e) " °^log 1^ ' 
 
 Next, consider the subset of loops which has m - n or m a 
 
 function of n. For example, given an upper triangular matrix A, to solve 
 
 Ax = b by the traditional back-substitution method we may write a loop like 
 
 DO 5 I = 10, 1, -1 
 
 X(I) = B(I)/A(I,I) 
 
 DO 5 J = I + 1, 10, 1 
 
 5 X(I) = X(I) - (A(I ! J)/A(I,I))*X(J) . 
 
 In this example, if we preprocess B(I)/A(I,I) for all I, and A(I,J)/A(I,I) 
 
 for all I, J, we obtain an R<n,n> system. Since m = n, this is the worst-case 
 
 loop of LD. Hence, we can say that the computation time of any LeLD is less 
 
46 
 than 0(1og e) plus the time stated in Theorem 3, i.e., for any LeLD 
 
 Sp 1 5 ! = 0( A—) . 
 
 r ' O(logS) + 0(log e) locfT^ 
 
 Finally, we study a simple looking, but more complicated loop: 
 
 DO 5 I = 1, 10 
 5 X(I) = (X(I-l) + A/X(I-l))/2 . 
 This is a familiar iterative program for approximating JK. For this loop, 
 LeD but ULD. Muraoka [54] shows that by using statement-substitution any 
 loop with E- being a d-th degree polynomial of x. -,, d > 1, can be speeded 
 
 up at most by a constant factor. Later, Kung also studied this problem [45] 
 in a similar way. However, since we have been able to linearize a number of 
 nonlinear recurrences, it remains an open question which techniques besides 
 statement-substitution may be used to speed up such loops. 
 
 Summarizing the above, we are able to classify all loops in terms 
 of their best known speedups over serial computation time T, , i.e., 
 
 S p = ■ f for < i < 2 (6) 
 
 P c^log t/ 
 
 We call a loop Type i , <_ i < 2, if its maximum speedup has the form of 
 Equation 6 , or Type 3 if its maximum speedup is of a lower order of magnitude. 
 
 This was also discussed in [39]. 
 
 1 12. 
 By the wave-front method we are at best able to achieve T p = 0(T, 
 
 1/2 
 with T p = 0(T,) in the worse case. Thus we have S p <_ 0(T-| ). Since 
 
 the wave-front method's speedup is always inferior to the recurrence method 
 for such problems, this is consistent with our c 1 :im that Equation 6 
 represents a maximum speedup hierarchy. 
 
47 
 
 Loop Distribution 
 
 Now we turn to the problem of compiling parallel operations from 
 serial loops which do not contain IFs and GOTOs (we will consider these 
 later). There are two key ideas involved here: one is the reduction of data 
 dependence, the other is the distribution of loop control over the loop 
 statements. We give our Loop Distribution Algorithm which includes both of 
 them and summarizes our handling of loops without IFs. 
 
 When constructing a data dependence graph, we wish to avoid any 
 apparent dependencies which do not really hold for the given subscripts and 
 index sets. This problem was first studied in a general way by Bernstein [9 ] 
 for unsubscripted variables. A powerful test for subscripted variables was 
 given by Muraoka [54]. This has been refined by Towle in [24] and in [67]. 
 By avoiding the inclusion of spurious data dependencies, we may be able to 
 execute more statements in parallel on machines capable of executing multiple 
 array operations (e.g., the Texas Instruments' ASC). Also, we may be able 
 to break cyclic dependencies, thereby reducing i in Eq. 6 and yielding higher 
 speedup. 
 
 Another way to achieve statement independence is through statement- 
 substitution. This yields increased speedup, sometimes at the cost of redundant 
 operations. It should be used with discretion, and only in machines with a 
 high degree of parallelism. For acyclic graphs, it is easy to demonstrate 
 that we can perform statement-substitution between any pair of nodes which 
 have a dependence relation. As in a BAS (c.f. Definition 4 ) , we substitute 
 for each LHS variable of S. on the RHS of S., which is the cause of a dependence 
 relation, the corresponding arithmetic expression on the RHS of S. with all 
 subscript expressions properly shifted. By applying statement-substitution, 
 
48 
 
 the dependence relation is removed and a set of independent assignment 
 statements results. Each of these represents a vector assignment statement, 
 all of which can be executed simultaneously. Theorems 2 and 3 can be used 
 to bound the time and processors. 
 
 In loops with acyclic graphs, we can thus reduce the graph for 
 the entire loop to a set of independent nodes representing simultaneously 
 executable array statements. However, in general, we must deal with cyclic 
 graphs containing several interdependent nodes. Our loop distribution 
 algorithm will be useful in handling these cases. By loop distribution we 
 mean the distribution of the loop control statements over individual or 
 collections of assignment statements contained in the loop. The idea of 
 loop distribution was introduced by Muraoka [54], and later was 
 implemented in our Fortran program analyzer to measure potential parallel- 
 ism in ordinary programs [41], [44]. 
 
 The purpose of distributing a given type i loop is to obtain a 
 set of smaller size loops of type j, <_ j <_ i, which upon execution give 
 results equivalent to the original loop. This is essentially to reduce 
 a- in Eq. 6 (and hence increase speedup) as much as possible. In fact, 
 the loop distribution algorithm resembling the distribution algorithm 
 for the reduction of tree height of an arithmetic expression, may intro- 
 duce more parallelism into a program loop than that obtained from an 
 undistributed one. We now give the algorithm to accomplish this distribu- 
 tion as presented in [24]. 
 
49 
 
 Loop Distribution Algorithm 
 Step 1 Given a loop 
 
 by analyzing subscript expressions and indexing patterns, construct a 
 dependence graph G (c.f . Definitions 7 and 9) for 1 < u < d. 
 
 Step 2 On G 1 < u < d, establish a node partition it as in 
 Definition 11 . 
 
 Step 3 On the partition ? , 1 < u < d, establish a partial ordering 
 relation as in Definition 11. 
 
 Step 4 Let the (d-u+1 ) innermost nest of L be L u , 1 <_ u <_ d, i.e., 
 L = (I 1 ,I 2 ,...,I d )(S 1 ,S 2 ,...,S s ) 
 
 = (i 1 ,i 2 ,...,i u . 1 ){(i u ,i u+1 ,...,i d )(s 1 ,s 2 ,...,s s )> 
 
 = (I 1 ,I 2 ,...,I u _ 1 )(L u ) . 
 
 Replace L u according to it with a set of loops { ( I ) (tt ,),(I)(tt ? ),...} 
 where (I) = (I,,.!^.....!,,) . 
 
50 
 
 The condition of the partial ordering relation a insures that 
 data are updated before being used. Hence, any execution order of the set 
 of loops which replaces L will be valid as long as this relation is not 
 violated. Thus, for fixed values of I, , I,,, ..., I _, , if tt . a tt . then 
 
 loop ( I ) (tt . ) must be evaluated before ( I ) (tt .), otherwise they may be 
 computed in parallel. In general, we can also use statement-substitution to 
 remove this relation between some or all of the distributed loops. But, by 
 not allowing statement-susbstitution we have a somewhat simpler compiler 
 technique; one which generally requires fewer processors and yields less 
 speedup. 
 
 As an example of the use of our loop distribution, consider the 
 following pseudo-FORTRAN program. 
 
 Example 3 
 
 DO 10 I = 1, N 
 S-,: A(I) = B(I) * C(I) 
 
 DO 20 J = 1, N 
 S 2 : D(J) = A(I-3) + E(J-l) 
 
 S 3 : 20 E(J) = D(J-l) + F 
 
 DO 30 K = 1 , N 
 S 4 : 30 G(K) = H(I-5) + 1 
 
 S 5 : 10 H(I) = SQRT(A(I-2)) 
 
 Following step 1 of the Loop Distribution Algorithm, we obtain a dependence 
 graph as shown in Fig. 5. We use brackets to denote loop nesting. For 
 
51 
 
 simplicity and speedup in this program, we only consider the case u = 1. 
 
 In step 2, we form the partition tt, = {tt, , , it, ?> ^iv 77 ^ ^ere 
 
 71 11 = ^ S l ' ^12 = ^ S 2' S 3^' ^13 = ^}» and ^14 = ^r) ' These partitions are 
 partially ordered on step 3 as follows: tt, , a tt^-i > it.., a tt,- and tt,. a tt, 3 . 
 
 Since we are considering only the case u = 1 here, we ignore step 4. 
 
 The result of this transformation is shown in Fig. 6. We could use 
 this graph to compile array operations as follows. First, S, yields a vector 
 multiply. Next, we can execute it,,, or tt,.. tt,,, leads to a linear recurrence 
 
 of the form R<N,3> which can be solved by the method of Theorem 3, by combining 
 the D and E arrays as an unknown vector in which x, represents D(l), x~ 
 
 represents E(l), x 3 represents D(2), x, represents E(2), etc. tt,, leads to 
 
 the execution of S 5 as a vector of square roots. Finally, S. may be executed 
 for all I and K simultaneously. Note that this requires the broadcasting of 
 elements of the H array to all elements in the columns of G. 
 
 Here, the time required to execute tt,,, tt, 3 , and tt-.- is independent 
 of N using 0(N) processors. The overall execution time is dominated by tt, « 
 and is 0(log N), so this is a type 1 loop. The number of processors required 
 to achieve this time is 0(N). 
 
 Notice that in this example we avoided statement-substitution. Using 
 statement-substitution, we would have been able to obtain four TT-blocks, all of 
 which could be executed at once. This would require the execution of several 
 different operations at one time, while the technique we used allows all 
 operations at each step to be identical. Furthermore, wery little additional 
 speedup would be possible by this method since tt-, ? dominates the time here. 
 
52 
 
 IFs in Loops 
 
 To this point we have considered DO loops without conditional 
 statements. The addition to DO loops of IF and GOTO as well as computed 
 GOTO statements, can cause major problems. In particular, data dependencies 
 can be changed at execution time by the existence of such conditional statement 
 Thus, knowledge at compile time of what can be executed in parallel may be 
 difficult to obtain. In the worst case, we may be forced by not knowing about 
 control flow, to compile loops for serial execution which in fact can be 
 executed in a highly parallel way. 
 
 In this section we will consider the addition to DO loops of IF 
 and computed GOTO statements, denoted as IFs. The first part of this section 
 contains definitions and preliminary results concerning IFs. This leads 
 to a distribution algorithm which allows for IFs. The importance of this 
 algorithm is that using it we can often execute all of the DO loop simul- 
 taneously, or at worst, localize the part of the DO loop that will be done 
 sequentially. Then we present a summary of the analysis of one year of CACM 
 Fortran programs. 
 
 Before proceeding, we need a few definitions concerning IFs and 
 control notions. 
 
 Definition 12 Statement S. is an immediate control successor of S., 
 denoted S.yS., if S. is executed immediately after S.. S. is not unique 
 
 when S. is an IF. Statement S. is a control successor of S., denoted S.TS-, 
 
 if there exist statements S. ,...,S. such that S.yS. yS. y...yS, yS • . 
 
 1 m 1 2 m 
 
 Definition 13 Given a loop L = (I, ■*- N, ,...,I d «- N rf ) (S-j , . . . ,S S ) with 
 
53 
 
 S. = (C)(S. ,. . .,S. ); the j-th follower , 1 i J i n, of IF statement S., 
 1 n l \ 
 
 denoted by F.(S.) is the set {S. |S. = S. or S. rS,} . We will refer to an 
 
 J 1 K K I • 1 . K 
 
 J J 
 
 arbitrary j-th follower as a follower . The set of common followers of IF 
 
 n 
 statement S., denoted by CF(S.), is the set {S.|S. e f\ F.(S.) such that 
 
 1 1 K K -i = l J 
 
 if S. is executed then S. is executed}. The set of parent IFs of statement 
 
 S., denoted by PIF(S.), is the set (SJS ? is an IF, S rS. , and there is no 
 
 IF statement S such that S.rS rS. where S|/ECF(S )}. when there is more 
 
 than one IF in the loop, it is possible that some of the followers will 
 resemble trees. 
 
 Example 4 
 
 DO S 3 I,j = 5, 14, 1 
 
 DO S 3 I 2 = 5, 14, 1 
 
 S^ If ^ > I 2 then (S 2 :)A(I 1 ,I 2 ) = A( I-, -2, 1 2 ~2)+A( I-, -4 , I 2 -4)+A( I-, +2 , 1 2 +2) 
 
 + A(I 1 +4,I 2 +4) 
 S 3 : If ^ < I 2 then (S 4 : )B(I ] ,I 2 ) = B(I-,-l ,1^*0(1^ , 1 2 )+B ( I-, -1 ,r 2 )*D(I 1 ,I 2 ) 
 
 S 5 : If ^ = I 2 then (S g :) if M0D(I r 2) = then (S ? :) E( I -j , I 2 ) = Ed-,-1 .Ig-l) 
 
 +F(I r I 2 ) 
 
 In Example 4, we have F,, = {S 2 ,S 3 ,S 4 ,S 5 ,S 6 ,S 7 } , 
 F 12 = (S 3 ,S 4 ,S 5 ,S 6 ,S 7 >, F 31 = ^S 4 ,S 5 ,S 6 ,S 7 >, F 32 = <S5.Sg.S7>, F gl - { S 6 ,S ? } , 
 
 F 52 = ♦• F 61 = {S 7 } ' F 62 = { * } ' CF( V = {S 3' S 4' S 5' S 6' S 7 } ' CF(S 3 ) = {S 5' S 6' S 7 } ' 
 CF(S 5 ) = 4>, CF(S 7 ) = (J), PIF(S 2 ) = PIF(S 3 ) = S ] , PIF(S 4 ) = PIF(S g ) = S 3 , 
 
54 
 
 PIF(S 6 ) = S 5> and PIF(S y ) = S fi . 
 
 Definition 14 
 
 Given a 7T-block ir^ and IF statement S. in tt ., the 
 
 k-th control path with respect to S. through tt . is the set 
 
 J U I 
 
 {S £ £7r uil S £ rS j or (WV and S £ i s j)>- We will refer to an arbitrary 
 j-th control path as a control path . Note that the number of control 
 paths through tt u1 is the same as the number of followers of S.. 
 
 Example 5 
 
 DO S 1 I = 1, N 
 
 V A (D = B(I-2) + C(I) 
 
 V IF (A(I-l).EQ.O) THEN DO 
 s 3 - B(I) = 3 
 s 4 : D = .FALSE. 
 
 END 
 ELSE DO 
 
 V E(I) = E(I-3) + 1 
 
 V D = TRUE 
 
 END 
 
 V F ( J ) = o 
 
 Using Example 5 we will point out the differences between followers 
 and control paths. F^Sg) = {S3.S4.S7}, F 2 (S £ ) = {S5.Sg.S7}, and CF(S 2 ) = {S ? } 
 
 In the TT-partition ^ = {{S^Sg.Sg}, {S 4 }. {S 5 >, {Sg}, {S ? }} notice that F ] (S 2 ) 
 
 is contained in 3 Tt-blocks and F 2 (S 2 ) is not in the same Tr-blocks as F^SJ. 
 
 There are two control paths in {SpSg.Sg}. One control path {S^Sg} contains a 
 
55 
 
 statement not in any follower of S 2 and not all the statements in F,^). 
 
 The other control path (S,) does not contain any statements of Fp(Sp). In 
 
 general, a control path can contain statements not in any follower of the IF 
 and does not have to contain any statements of a follower. 
 
 In an IF-free DO loop L, LeD, we know that the data dependencies 
 for each iteration do not change as a function of the input data. The 
 addition of IFs can give rise to several distinct data dependencies for 
 different input data sets. The data dependencies on different iterations 
 can be distinct for all iterations, for some iterations or for only one 
 iteration. Thus we may have statements that can have several different 
 combinations of data dependence. 
 
 Example 6 
 
 DO S 5 I = 1, N 
 
 S } : IF C(I-l) = D(I) THEN (S 2 A(I) = 3 P(I) 
 
 ELSE DO 
 
 S 3 : A(I) = 4*Q(I) 
 
 S 4 : B(I) = R(I) + 1 
 
 END 
 
 S 5 : C(I) = A(I) + B(I-l) + 2 
 
 Definition 15 The set of different combinations of data dependence 
 into statement S. is called the data dependence combinations of statement 
 S. and is denoted by DDC . . The number of followers of S. that contain S. 
 
 is denoted f... 
 
56 
 Statement Sr in Example 6 can be computed in four different 
 ways. First, the value of A(I) can be from S^ and B(I-l) from outside 
 the DO loop. Second, the value of A(I) can be from S 2 and B(I-l) from S.. 
 Third, the value of A(I) can be from S 3 and B(J-l) from S,. Fourth, the 
 value of A(I) can be from S 3 and B(I-l) from outside the DO loop. 
 
 Using Definitions 8, 12, 13 and 15, we can classify IFs into 
 three types: 
 
 Definition 16 Given a loop L = (^ * N,,...,I. * N d )(S ] , . . . ,S ) and 
 
 IF S. = (C)(S. ,...,S. ), 1 ■: i < s, we say S, is 
 1 n l \ ! 
 
 a) Type A iff there does not exist S-, 1 < j < s, such that 
 S.6S. and Uj,...,^) OlN(S.) = <|> . 
 
 b) Type B iff one of the following holds: 
 
 1) All but one of F.(S.) branch out of the loop. 
 
 n 
 
 2) For each S k e \J Fj(Sj) such that S^., |DDC(S k ) |/f ik 
 
 <_ 1 and each of the data dependence combinations of S. 
 
 only include data dependence on the statements in a 
 single follower of S. and/or statements not in any 
 follower of S. . 
 
 3) Type C iff S. is not type A or type B. 
 
 Type B IFs can be further subdivided. A prefix type B IF is a type B IF 
 that is not data-dependent on any statement in its followers. Postfix type 
 B IFs are all other type B IFs. 
 
 Next we will discuss compiler algorithms for array machines. These 
 machines have two characteristics of which we want to take advantage. First, 
 These machines operate on whole arrays. Thus we need to transform programs 
 to operate on the whole array at once rather than element by element. 
 
57 
 
 Second, these machines can selectively omit certain elements of an array 
 during array operations. We define mode bits as the indicators of which 
 array elements are to be operated on. Thus we want to transform IFs to 
 generate mode bits. Both of these characteristics let us obtain speedups 
 over uniprocessor machines. 
 
 As we saw earlier for IF-free loops, by distributing DO loop 
 indices over 7T-blocks we are able to transform a given loop into one or 
 more loops each containing a vector or recurrence operation. In [23] we 
 gave a modification of that algorithm to allow IFs. The first goal of the 
 algorithm is to localize data dependence and the effects of IFs. Second, 
 the algorithm handles the IFs in four different ways. 
 
 
58 
 
 1) Type A IFs are the easiest to handle. One loop is compiled for 
 each follower and an IF is used at execution time to select which loop to 
 execute. 
 
 2) Prefix type B IFs use mode bits to "prefix" the body statements. 
 The IF is used only to set up the mode bits. The mode bits are set up once 
 and then used by the body statements as necessary. 
 
 3) Postfix type B IFs require execution of each control path for 
 the full DO loop index set. Then we postfix by merging the outputs from each 
 control path. 
 
 4) Type C IFs are executed serially. 
 Since the type A IF depends on variables not set inside the 
 
 loop, such IFs can be removed from loops trivially. However, good pro- 
 grammers seldom write such statements, so this is a moot point. 
 
 In the prefix type B IF, we set up the TT-block containing the IF and 
 the other Tr-blocks that are a-dependent on it to be executed for the full index 
 set. However, before these Tr-blocks are executed we precompute (at compile or 
 run time) which follower will be taken on each particular iteration of the DO 
 loop. This is expressed as a vector of mode bits for each follower. A vector 
 of mode bits is a mask that is applied to the index set for a given operation. 
 In this way we can selectively omit certain elements at run time. The end 
 result is that each statement is executed only for the proper elements. By 
 precomputing these mode bits, we are able to fix the results a priori , hence the 
 name prefix type B. 
 
59 
 
 As an example of a prefix type B IF, consider the following program. 
 
 DO 5 I = 1, N 
 
 IF(I<5) THEN A(I) = B(I) + C(I) 
 
 ELSE A(I) = B(I)/C(I) 
 5 CONTINUE 
 Let M.[a,b] be a vector of mode bits denoting vector elements from a to b, 
 inclusive. Then we can compile the above as 
 
 Ml = [1,5] 
 
 M2 = [6,N] 
 DO SIM{A(M1) = B(M1) + C(M1), A(M2) = B(M2)/C(M2)} 
 where DO SIM indicates that the bracketed statements can be executed as array 
 statements and simultaneously. 
 
 The postfix type B IFs are more complicated. Using only the statements 
 in the TT-block that contains the IF, we compute all control paths in parallel 
 for the full index set. Using these results we evaluate the IF. Finally, the 
 results are merged to produce the correct results and a set of mode bits are 
 passed on to other iT-blocks. The details of the test and merge are now given. 
 What we want to do is find which follower is taken for each iteration. Thus 
 given the outputs for each control path for each iteration, we want to thread 
 our way through the outputs, picking up the proper results. It is possible to 
 do this recursively by using the previously selected results and the results 
 from the current iteration to determine which follower is to be taken. Using 
 subscripts to denote control path, in Example 7 we compute A-,(I) = D(I) and 
 A 2 (I) = B(I)*C(I) for all values of the index set. Let M, and M 2 be the vector 
 of mode bits for follower 1 and follower 2, respectively, Whenever M.(j) is a 1, 
 then the statements in the i-th follower are executed on the j-th iteration 
 in the serial program. 
 
60 
 
 Example 7 
 
 DO 10 I = 6, N 
 
 IF A(I-3) = 5 THEN A(I) = D(I) 
 
 ELSE A(I) = B(I) * C(I) 
 10 CONTINUE 
 
61 
 
 It should be noted, in general, that ANDing of any two distinct 
 vectors of mode bits associated with an IF statement results in a vector 
 of bits. The ORing of all the vectors results in a vector of 1 bits 
 corresponding to the iterations on which the IF statement is executed in 
 the serial program. 
 
 In Example 7, control path 1 is taken whenever we were on control 
 path 1 three iterations ago and the output was 5 or we were on control path 2 
 three iterations ago and the output was 5. Control path 2 is taken in either 
 case if the output was not 5. This can be expressed as the following coupled 
 recurrence relation 
 
 1^(1) +■ (A-, (1-3) = 5 & M 1 (1-3)) V (A 2 (I-3) = 5 & M 2 (I-3)) 
 
 M 2 (I) <- (A^I-3) t 5 & M-,(I-3)) V (A 2 (I-3) f 5 & M ? (I-3)) 
 
 The techniques to solve this recurrence are described in [21] with 
 specific reference to this application. This is a bit-level recurrence and 
 can be done in 0(log n) gate delays, where n is the number of iterations. 
 In reality, we could approximate the log n gate delays by one clock, 
 i .e. , one time step. 
 
 Finally, we mask A, with M, , A 2 with M«» and merge the results to get 
 the proper elements of A. Also M, and M 2 are passed on to other TT-blocks as 
 needed. 
 
 We should point out that postfix type B loops yield the same speedups 
 as prefix type B loops, in general. The difference is that since more processors 
 are required for redundant operations here, postfix type B loops generally 
 have lower efficiencies. 
 
 As a test of the usefulness of our methods, we have analyzed the 
 16 Fortran programs which appeared in 1973 in the CACM Algorithms Section. 
 Nested DOs were counted as one loop at the outermost level. There were a 
 
62 
 
 total of 124 such DO loops. Each loop was characterized in terms of the 
 worst recurrence type (c.f. Eq. 6) and worst IF type it contained. Table 1 
 is a summary of our results. 
 
 We observe that type A and prefix type B IFs together with recurrence 
 types and 1 can be handled with good speedup and efficiency. This accounts 
 for 85% of the loops. Four programs (type 3 and type C) are disasters for 
 our methods and in part must be handled serially. The remaining programs 
 can be handled by the postfix and wave-front (type 2) methods. Overall, this 
 seems to imply that the methods given would be s/ery effective for the general 
 mix of CACM Fortran algorithms. 
 
63 
 
 4. Machine Considerations 
 
 Since the early 1960s, we have seen a sequence of high speed machines 
 which have some kind of mul tioperation capability. The CDC 6600 of the 60s 
 was succeeded by the 7600 in the early 70s. IBM introduced the 360/91 and 
 its successors. The 7600 and /91 are both pipelined machines and achieve 
 high performance by operating on arrays of data. Their instruction sets are 
 rather traditional, however. In contrast, the pipelined Control Data STAR 
 [33] and Texas Instruments' ASC [68], both have vector instruction sets, 
 which should make compilation for them substantially easier. 
 
 On the other hand, the Burroughs' ILLIAC IV [4 ] is a parallel 
 array machine, but its instruction set is also traditional in nature. 
 Vectors must be broken up into partitions of size 64 and loops performed 
 over a sequence of such partitions. The Goodyear Aerospace STARAN IV [6 ] 
 is a parallel array of processors, each of which operates in a bit-serial 
 fashion. This is an example of an associative processor. 
 
 It is interesting to note that the highest speed pipeline 
 processors, the STAR and ASC, both resort to parallelism by providing 
 several parallel pipelines to achieve their desired operating speeds. 
 
 The processing speedups achieved by all of these machines are 
 due to parallelism between operations as well as parallelism between 
 memory and processor activities. We shall discuss memories, alignment 
 networks, and control units later. Our point here is that in order to 
 compile ordinary serial languages for these processors, two things are 
 desirable: 1) Powerful translation techniques to detect parallelism, 
 and 2) Array-type machine languages. 
 
 The main contributions to program speedup discussed in section 3 
 arise from our loop distribution procedure. This leads to array operations 
 
64 
 
 and recurrences. Both of these are well suited for computation on 
 machines which must perform the same operation on many data elements to 
 achieve high performance. Thus, the methods of section 3 could serve as 
 compiler algorithms for such machines. 
 
 Some time ago, we implemented a comprehensive analyzer of Fortran 
 programs. It used algorithms like those of sections 2 and 3, although 
 some of the techniques were much more primitive than those discussed here. 
 Details of our algorithms and results may be found in [39] » [41 L [44] an d 
 [54]. We will summarize a few points very briefly. 
 
 Altogether some 140 ordinary Fortran programs, gathered from many 
 sources, were analyzed. The programs ranged from numerical computations on 
 two-dimensional arrays (e.g., EISPACK) to essentially nonnumerical programs 
 (e.g., Fortran equivalents of GPSS blocks). We set all loops to 10 or fewer 
 iterations and analyzed all paths through the programs, computing T, , T , 
 S , E , etc. These were averaged over all traces and also over collections 
 of programs. A plot of our results for S vs. p is shown in Fig. 7. Some 
 of the points are labelled with the name of a collection of programs. 
 
65 
 
 Our experiments lead us to conclude that mul tioperation machines 
 could be quite effective in most ordinary FORTRAN computations. Fig. 7 
 shows that even the simplest sets of programs (GPSS, for example, has 
 almost no DO loops) could be effectively executed using 16 processors. 
 The overall average (ALL in Fig. 7) is 35 processors when all DO loop 
 limits are set to 10 or less. As the programs become more complex, 128 or 
 more processors would be effective in executing our programs. Note that 
 for all of our studies, T-. <_ 10,000 so most of the programs would be classed 
 as short jobs in a typical computer center. In all cases, the average 
 efficiency for each group of programs was no less than 30%. While we have 
 not analyzed any decks with more than 100 cards, we would expect extra- 
 polations of our results to hold. In fact, we obtained some decks by 
 breaking larger ones at convenient points. 
 
 These numbers should be contrasted with current computer organiza- 
 tions. Presently, two to four simultaneous operation general purpose 
 machines are quite common. The pipeline, parallel and associative machines 
 mentioned above perform 8 to 64 simultaneous operations, but these are 
 largely intended for special purpose use. Thus, we feel that our numbers 
 indicate the possibility of perhaps an order of magnitude speedup increase 
 over the current situation. 
 
 Furthermore, we took a subset of the programs, again a random 
 cross-section, and varied the DO loop limits from 10 to 40. The points 
 10, 20, 30, 40 correspond to the results. We conclude that for our sample 
 of ordinary Fortran programs, speedup is a linear function of T, and hence 
 p. This is quite different from some of the folklore which has arisen about 
 parallel computation [1 ], [31], [52] (e.g., S = 0(log p)). We will 
 
66 
 
 next give an outline of two commonly held beliefs about machine organiza- 
 tion. 
 
 Let us assume that for <_ B. ± 1 , 0-B k ) of the serial execution 
 time of a given program uses p processors, while 3. of it must be performed 
 on k < p processors. Then we may write 
 
 e k T l T l T l 
 
 T n ■ ^~ L + (1-Bj tt , and E n = 
 
 p k k p ' p UvO-^t, 
 
 i 
 
 1 + B k (f-1) 
 
 For example, if k = 1, p = 33, and 6, = 1/16, then we have E-- = 1/3. This 
 
 means that to achieve E 33 = 1/3, 15/16 of T-, must be executed using all 33 
 processors, while only 1/16 of T, may use a single processor. While E 33 = 1/3 
 is typical of our results (see Fig. 7), it would be extremely surprising to 
 learn that 15/16 of T, could be executed using fully 33 processors. This 
 kind of observation led Amdahl [1 ] and others [20] [63] to conclude that 
 computers capable of executing a large number of simultaneous operations 
 would not be reasonably efficient—or, to paraphrase them, "Ordinary pro- 
 grams have too much serial code to be executed efficiently on a mul tioperation 
 processor." 
 
67 
 
 Such arguments have an invalidating flaw, however, in that 
 they assume k = 1 in the above efficiency expression. Evidently no one 
 who repeated this argument ever considered the obvious fact that k will 
 generally assume many integer values in the course of executing most pro- 
 grams. Thus, the expression for E given above must be generalized to 
 allow all values of k up to some maximum. 
 
 The technique used in our experiments for computing E is such a 
 generalization. For some execution trace through a program, at each time 
 step i, some number of processors k(i) will be required. If the maximum 
 number of processors required on any step is p, we compute the efficiency 
 
 for any trace as 
 
 T 
 P 
 
 } k(l) 
 
 E = -Jk-j , assuming p processors are available. 
 
 p p p p 
 
 Apparently no previous attempt to quantify the parameters discussed above 
 has been successful for a wide class of programs. Besides Kuck, et al [41] 
 the only other published results are by Baer and Estrin [3 ], who report on 
 five programs. 
 
 Another commonly-held opinion, which has been mentioned by Minsky 
 [52], is that speedup S is proportional to log p. Flynn [31] further dis- 
 cusses this, assuming that all the operations simultaneously executed are 
 identical. This may be interpreted to hold 1) over many programs of 
 different characteristics, 2) for one fixed program with a varying number 
 of processors, or 3) for one program with varying DO loop limits. That 
 the above is false under interpretation 1 for our analysis is obvious from 
 Fig. 7. Similarly, it is false under interpretation 2 as the number of 
 processors is varied between 1 and some number as plotted in Fig. 7. As p 
 
68 
 
 is increased still farther, the speedup and efficiency may be regarded as 
 constant or the speedup may be increased at a decreasing rate together 
 with a decreasing efficiency. Eventually, as p becomes arbitrarily large, 
 the speedup becomes constant and in some region the curve may appear 
 logarithmic. Under interpretation 3, there are many possibil ities--pro- 
 grams with multiply nested DO loops may have speedups which grow much 
 faster than linearly, and programs without DO loops of course do not 
 change at all . 
 
 One of Flynn's arguments in support of such behavior concerned 
 branching inside loops. For example, if on each pass through a loop we 
 branched down some new path, not taken on any other iteration, then indeed 
 we might have disastrous results. However, we have not observed such 
 intense splitting, in fact most computations which do branch inside a 
 loop, come together in common followers rather quickly. Furthermore, 
 there are usually relatively few distinct paths through a loop in most 
 cases. The results of Table I lend further support to our conclusion 
 that IFs in loops do not pose a serious practical problem. 
 
 Abstractly, it seems of more interest to relate speedup to T, 
 than to p. Based on our data, we offer the following observation: 
 for many ordinary FORTRAN programs (with T, <_ 10,000), we can find p such 
 that 
 
 1) T = alog 2 T 1 for 2 < a £ 10 , 
 
 and 2) p < 
 
 - .6 log 2 T.j 
 such that 
 
 3) Sp^ lOlog^ and E p > ' 3 • 
 
69 
 
 The average a value in our experiments was about 9. However, 
 
 the median value was Ipss than a •--.•« 4.u 
 
 ■ ue was less than 4, since there were several very large 
 
 values 
 
70 
 
 In terms of our present theoretical results, it is not hard to 
 justify such estimates. Consider, for example, Corollary 4 which is 
 intended for solving R<n,m> systems with small n using a limited number 
 of processors. Let us approximate the times by 
 
 T-, ~ mn 
 
 2 T l 
 and T a 2m n/p * 2m - L 
 
 since most programs have m = 1 or m = 2. Following the observation, we 
 choose p * T-,/log T, , so 
 
 T p a 2m log T ] 
 
 and S a T,/2m log T, 
 
 and E a 1/2 m , 
 
 all of which agree reasonably with the observation. 
 
 We should point out that our previous analyzer allowed each 
 processor to be executing a different operation on any time step. This 
 could degrade our results, possibly by a factor of two. 
 
 However, the newly discovered algorithms perform better than the 
 ones we used in our old analyzer. We are currently implementing a new 
 analyzer with which we expect to obtain better results than those summarized 
 above. 
 
 For a more complete discussion of theoretical bounds on the time 
 and processors required for executing whole Fortran loops see [l9]» [24]. 
 These references show how certain details about overall program graphs can 
 
71 
 
 be combined with the results of section 3 to obtain loop bounds. From 
 these, whole program bounds can be obtained. 
 
 Control Units 
 
 A well -designed control unit is one which never gets in the way 
 of the processor(s) and memories. In other words, it operates fast enough 
 to be able to supply instructions whenever they are needed in the proces- 
 sing and moving of data. Control units tend to become complex, mainly 
 in a timing sense, because they may have a number of tasks to control. 
 
 One way to ease some control unit difficulties is to use 
 parallelism at the control unit level. A multiprocessor is an example; 
 several complete control units are used. This may be rather expensive, 
 so the multi-function, pipeline and parallel processor machines use one 
 shared control unit. Such control units often contain a number of inde- 
 pendently operating parts. For example, the first use of pipelining was 
 in control units [14]. A detailed study of the control unit of any high- 
 speed computer will reveal a number of simultaneously operating, independent 
 functions. While this may allow the functions to operate more slowly, it 
 also causes some synchronization problems. 
 
 As we mentioned in our processor discussion, the level of machine 
 language is very important in modern, high-speed computers. Vector in- 
 struction sets make compiler writing easier. They also focus control unit 
 design on the correct questions, namely, to execute vector functions at 
 high speed. 
 
 Control units for high-speed computers must handle the traditional 
 functions, including instruction decoding and sequencing, I/O and interrupt 
 handling, and address mapping and memory indexing. In addition, we can 
 list several new functions. For one, memory indexing becomes somewhat more 
 
72 
 
 complex when whole arrays are being accessed in large parallel memories. 
 Also, array computers (parallel or pipeline) often rely on the control unit 
 for scalar computations. Broadcasting of scalars to an array must also 
 be handled. Special control features such as IF tree processing [27] 
 can be effectively handled in the control unit. Special stacks and queues 
 may be required to handle a number of processors and programs in rapid 
 succession. Indeed, instruction level multiprogramming may even be 
 attempted. 
 
 Rather than discuss any of these in detail, we simply refer the 
 reader to a detailed study of the several high-speed machine papers 
 mentioned earl ier. 
 
 We conclude this section with the computer organization of Fig. 8, 
 The control unit can really be regarded as four control units, one for 
 each of the four other major subsystems shown. The operation of this 
 machine can be regarded as a pipeline from memory to memory. For move 
 instructions (memory to memory) the processors can be bypassed. 
 
 Fig. 8 represents parallelism at a number of levels: within the 
 control unit, processors, memories and data alignment networks. Also, it 
 contains parallelism in the simultaneity of operation of each of these 
 which forms a pipeline. Note that pipelining can also be used within 
 each of the five major subsystems to match bandwidths between them. 
 
 The details of accessing parallel memories and of aligning the 
 accessed data will be discussed in the next section. 
 
73 
 
 Parallel Memory Access 
 
 As effective speeds of processing units have increased, memory 
 speeds have been forced to keep up. This has partly been achieved by new 
 technologies (magnetic cores to semiconductors). But technology has not 
 been enough, as evidenced by the fact that in 1953, the first core memory 
 operating (in Whirlwind I) had an 8 ys memory cycle time. Today, most 
 computer designers cannot afford to use memories much faster than one 
 hundred nanoseconds. Thus, we have achieved an increase of only two orders 
 of magnitude in memory speed over the past twenty odd years. 
 
 In the same period, the fastest processor operation times have 
 advanced from a few tens of microseconds to a few tens of nanoseconds; or 
 three orders of magnitude. Memory system speeds have kept up with proces- 
 sors only through the use of parallelism at the word level. In the late 
 1950s, ILLIAC II and the IBM STRETCH introduced the first two-way inter- 
 leaved memories. At the present time, high speed computers have on the 
 order of 100 parallel memory units. If a word can be fetched from each of 
 m memory units at once, then the effective memory bandwidth is increased 
 by a factor of m. 
 
 Array Access 
 
 Parallel memories are particularly important in array computers 
 (parallel or pipeline). Thus, if a machine has m memory units we can store 
 one-dimensional arrays across the units as shown in Fig. 9, for m = 4. 
 While the first m operands are being processed, we can fetch m more, and 
 so on. But, if the array is indexed such that, say only the odd elements 
 are to be fetched, then the effective bandwidth is cut in half due to access 
 conflicts as shown in the underlined elements of Fig. 9. These conflicts 
 
74 
 
 can be avoided by choosing m to be a prime number. Then, any index distance 
 relatively prime to m can be accessed without conflicts. 
 
 Many programs contain multidimensional arrays. These can lead 
 to more difficult memory access problems, since we may want to access rows, 
 columns, diagonals, back diagonals (as in the wave-front method of section 
 3 ), square blocks, and so on. For simplicity, consider two-dimensional 
 arrays and assume we want to access n element partitions of arrays from 
 parallel memories with m units. 
 
 Consider the storage scheme shown in Fig. 10 where m = n = 4. 
 Clearly, using this storage scheme we can access any row or diagonal (e.g., 
 the circled main diagonal) without conflict. But all the elements of a 
 column (e.g., the underlined first column) are stored in the same memory 
 unit so accessing a column would result in memory conflicts, i.e., we 
 would have to cycle the memories n times to get the n elements of a column. 
 
 In order to allow access to row and column n-vectors, we can 
 skew the data as shown in Fig. 11 [38]. Now, however, we can no longer 
 access diagonals without conflict. It can be shown, in fact, that there is 
 no way to store an mxm matrix in m memories, when m is even, so that 
 arbitrary rows, columns, and diagonals can be fetched without conflicts. 
 However, as we shall see, by using more than m memories we can have conflict- 
 free access to any row, column, or diagonal , as well as other useful m- 
 
 vectors. 
 
 2k 
 It is easy to show that if m = 2 +1, for any integer k, we 
 
 have conflict-free access to rows, columns, diagonals, back diagonals, and 
 
 square blocks. For an example with k = 1, see Fig. 12. This and other 
 
 similar results are discussed in [15]. If m is not a power of two, certain 
 
75 
 
 difficulties arise in indexing the memory. Also, note that the elements 
 of various partitions are accessed in scrambled order. A crossbar switch 
 could be used to unscramble the data, but much cheaper schemes can be 
 devised. The question of unscrambling the accessed elements using a 
 rather simple network is discussed by Swanson [66]. 
 
 In order to simplify indexing and unscrambling, systems of the 
 form m = 2n were considered by Lawrie in [47] and [48]- He shows that 
 conflict-free access to a number of partitions is possible using such a 
 memory. We illustrate this in Fig. 13 with m = 2n = 8. We will discuss 
 data alignment networks for unscrambling the data accessed in such a 
 memory later in this section. 
 
 In order to implement a skewing scheme, we must have a properly 
 designed parallel memory system. In particular, each of the m memory units 
 must have an independent indexing mechanism. This allows us to access a 
 different relative location in each memory unit. It is interesting to 
 observe that several presently existing high speed computers have handled 
 their parallel memories in different ways. 
 
 The Control Data STAR, for example, does not allow independent 
 indexing of each memory unit. Instead, it has an instruction by which 
 arrays can be physically transposed in memory to provide access to, say, 
 rows and columns. The transpose time is essentially wasted time and some 
 algorithms for these machines are slowed down by as much as a factor of 
 two in this way. 
 
 ILLIAC IV has independent index registers and index adders on 
 each of m = 64 memories. Since it has 64 processors, access to partitions 
 of n = 64 elements is usually required. Thus the skew scheme of Fig. 11 
 is easily implemented. Of course, since m is even, conflict-free access 
 
76 
 
 to rows, columns, and diagonals is impossible. But as Fig. 11 shows, 
 diagonals may be accessed in just two memory cycles. 
 
 Parallel Random Access 
 
 The execution of Fortran-like programs frequently leads to memory 
 access requirements which include one-dimensional arrays and various 
 partitions of multidimensional arrays, as we have been discussing. However, 
 we sometimes face access problems which have much less regularity. 
 
 For example, consider the subscripted subscript case: 
 
 DO I = 1, N 
 
 X(I) = A(B(I)) . 
 Here, we have no idea at compile time about which elements of A are to be 
 fetched, assuming that B was computed earlier in the program. This easily 
 generalizes to multidimensional arrays. Frequently, table lookup problems 
 are programmed in this way. 
 
 To deal with this kind of memory access problems, is in general 
 to deal with random access to a parallel memory. Note that this is a 
 problem which has been given a good deal of attention for multiprocessor 
 systems using rather abstract models of various kinds. 
 
 There are two key questions on which the validity and usefulness 
 of these models turns. They are: 
 
 1) What kind of data dependence is assumed in the memory access 
 sequence? 
 
 2) What kind of queueing mechanism is assumed for retaining 
 unserviced accesses? 
 
 In these terms, we briefly summarize some of the results. 
 Hellerman's model [32] can most reasonably be interpreted to assume no data 
 dependence between successive memory accesses and to have no provision to 
 
77 
 
 queue conflicting addresses. It is also a steady state model, ignoring 
 control dependence. Thus, it scans an infinite string of addresses, 
 blocking when it finds the first duplicate memory unit access request. 
 
 In various models, Coffman and his co-workers [16, 17, 26 ] 
 extended the above to include a type of queueing and to separate data 
 accesses from instruction accesses. These papers further introduced 
 address sequences which were not necessarily uniformly distributed. These 
 models also assumed that no data dependencies existed in the address 
 sequence. 
 
 Ravi [58] introduced a model which was more realistic for multi- 
 processor machines. He allows each processor to generate an address and 
 computes the number of them which can be accessed without conflict, in a 
 steady state sense. Effectively he assumes a sequential data dependence 
 in the addresses generated by each processor. 
 
 In [18] the above results are extended in several ways. First, 
 it is shown analytically that the model of [58] yields an effective memory 
 bandwidth which is linear in the number of memory units. Several models 
 are given with queues in the processors and in the memories, to show the 
 differing effects on bandwidth of such queues and methods used for 
 managing the queues. Several types of data dependencies are assumed to 
 exist; some as in the Ravi model and others which include dependencies 
 between the processors. In all of these models, we show that the effective 
 bandwidth of m memories can be made to be 0(m). The models are useful for 
 either multiprocessor or parallel machines. 
 
 Thus, we conclude that for parallel or multiprocessor machines, 
 the proper use of m parallel memories can lead to effective bandwidths 
 
78 
 
 1 12. 
 which are 0(m). This is much more encouraging than the 0(m ' ) which was 
 
 derived from earlier, more naive models. 
 
 Alignment Networks 
 
 Finally, we consider the problem of interconnecting the processors 
 and memories we have been examining. Data alignment requirements depend 
 on the programs to be run and the machine organization. A simple way to 
 connect several memories to a processor is to use a shared bus. For higher 
 speed operation, multiprocessors often use a crossbar switch which allows 
 each processor to be connected to a different memory simultaneously. In 
 the ILLIAC IV array, the i-th processor can pass data to processors i ± 1 
 and i ± 8, modulo 64. Here all processors must route data the same distance 
 in a uniform way. 
 
 None of the above techniques is well suited to a high performance 
 parallel computer. Indeed, the alignment network should be driven by an 
 independent control unit, to operate concurrently with the processor and 
 memory operation. The requirements include more than uniform shifts and at 
 
 times even more than permutations. Often broadcasts are needed, including 
 
 1/2 
 partial and multiple simultaneous broadcasts, e.g., n numbers, each 
 
 1 12 
 broadcast to n ' processors for matrix multiplication in an n processor 
 
 machine [48]. 
 
 The alignment network should be able to transmit data from memory 
 
 to memory and processor to processor as well as back and forth between 
 
 memories and processors. The connections it must provide are derived from 
 
 two sources. For one, it must be able to handle the indexing patterns 
 
 found in existing programs, for example, the uniform shift of 5 necessary in 
 
79 
 
 A(I) + A( 1+5) . For another, it must be able to scramble and unscramble 
 the data for memory accesses. For example, to add a row to a column, one 
 of the partitions must be "unskewed". More details can be found in [39], 
 [47] and [43]. 
 
 With a crossbar switch we can perform any one-to-one connection 
 of inputs to outputs, and with some modification we can also make one-to- 
 many connections for broadcasting. The switch can be set and data can be 
 
 transmitted in 0(log n) gate delays. However, the cost of such a switch 
 
 2 
 is quite high, namely, 0(n ) gates. Thus for large systems, a crossbar 
 
 alignment network is out of the question. 
 
 Another possibility is the rearrangeable network. It is shown 
 by Benes [8 ] that such a switch, with the same connection capabilities as 
 a crossbar, can be implemented using only 0(n log n) gates. The time 
 required to transmit data through the network is just 0(log n). Un- 
 fortunately, the best known time to set up the network for transmission is 
 0(n log n) [56]. This control time renders the network impractical as an 
 alignment network, unless all connection patterns could be set up at compile 
 time. 
 
 The Batcher sorting network [5 ] is another possibility. Not 
 
 only can it perform the connections of a crossbar switch, it can also sort 
 
 2 
 its inputs, if desired. This network has 0(n log n) gates, so it is an 
 
 2 
 improvement over the crossbar. However, it requires time of 0(log n) gate 
 
 delays for control and data transmission, making it faster than the Benes 
 
 approach. 
 
 As a final possibility, the fi network [48] proposed specifically 
 
 for this purpose can be controlled and transmit data in 0(log n) gate delays, 
 
80 
 
 
 but contains only 0(n log n) gates. Thus it has the speed of a crossbar 
 with the cost of a Benes network. Its shortcoming is that it cannot 
 perform arbitrary interconnections. However, as discussed above, we seek 
 an alignment network which can handle the requirements posed by program 
 subscripts and memory skewing schemes. Lawrie has examined a number of 
 such questions and the ft network satisfies many of them. 
 
 It is interesting to note that the ft network consists of a 
 sequence of identical interconnection paths called shuffles , see e.g., 
 [48]> [57J- We call transmission from left to right a shuffle and from 
 right to left an unshuffle . It can be shown that the Benes and Batcher 
 networks, as well as the ft network, can all be constructed from a series 
 of shuffle and unshuffle interconnections of 2*2 switching elements. The 
 switching elements are basically 2x2 crossbars. In the Batcher network, 
 they have the further capability of comparing their inputs and switching 
 on this basis. In the ft network they can also broadcast either of their 
 inputs to both outputs. 
 
81 
 
 A Cost-Effectiveness Measure 
 
 One important motivation for studying the structure of parallel 
 algorithms is to determine computer architectures for effectively executing 
 these algorithms. We are interested in computer requirements as measured 
 in terms such as number of processors, type of processors, number of 
 memories, and type of interconnections. We measure the effectiveness of an 
 architecture for some algorithm in terms of its speedup over a single 
 processor and the efficiency of the evaluation of that algorithm. 
 
 In section 3, we defined speedup categories as a way of charac- 
 terizing parallel algorithms. These were given in terms of T, rather than, 
 say, n, the size of some array, in order to compare dissimilar algorithms. 
 Such a hierarchy is useful for comparing algorithms, but it may not be 
 useful as a quality or effectiveness measure for machine computation. In 
 practice, which algorithm one chooses depends on many factors, an 
 important one being the number of processors available. 
 
 The speedup types mentioned above are useful in algorithm se- 
 lection if one has a machine which is relatively \/ery large compared to 
 the sizes of his problems. But categorizing algorithms on the basis of 
 speedup alone is unsatisfactory in general, since efficiency is important 
 if unlimited processors are not available. On the other hand, using 
 efficiency as the sole measure of effectiveness is too conservative, since 
 a serial computation always has maximum efficiency, i.e., E, = 1. Thus, 
 we turn to a measure which includes both effectiveness and cost. 
 
 The cost of a computation is clearly related to the number of 
 processors required and the time they must be used. We will assume the 
 processor, time product reasonably represents cost, so we define cost as 
 
82 
 
 P P 
 
 Assume that we have two algorithms, A, and A ? , which compute the 
 
 same function using p, and p 2 processors, respectively. Thus, we should 
 
 S. 
 decide which algorithm to use on the basis of max 
 
 p i 
 
 c p, 
 
 the ratios of effectiveness to cost, 
 
 S S E 
 We can rewrite this measure as 7^ = -£- = =£ . Note that since 
 
 C P PT P T P 
 E S 
 
 E <_ 1 and T > 1 , j £ 1 ; hence, ^ £ 1 . 
 p " p " p " p " 
 
 E E S 
 Another way of viewing this measure is to write J^" = t ^ ■ 
 
 P 'l 
 
 Note that for a given function, T-, is fixed so for various parallel 
 algorithms we are attempting to maximize the product of efficiency and 
 speedup. Since E <_ 1 and S < T, , division by T, is simply a normalization 
 
 of the maximum product to unity. 
 
 In Fig. 14 we present a summary of S/C values for a number of 
 parallel computations. This is discussed in detail in [40]. We shall make 
 a few summary remarks here. 
 
 Generally speaking, the best parallel algorithms in the S/C sense 
 are at the top and the worst are at the bottom. The column at the left 
 represents algorithms running on machines in a less than maximum possible 
 speedup way. The right column contains the best known speedups, in general. 
 
 By studying various algorithm and machine organizations for a 
 given problem, we could hope to maximize the S/C measure for some set of 
 computations. For example, the problem of multiplying n matrices is shown 
 
83 
 
 ? 3 4 . 
 
 in three places, for p = n , p = n and p = n . Note that S/C is maximum 
 
 for p = n , less than the maximum possible speedup for this problem. In 
 [40] several algorithms are compared in this way. 
 
 Finally, we point out that for an analysis such as the one we have 
 presented here to be practically meaningful, some machine details must also 
 be included. Analyses similar to the above may be performed on various parts 
 of a computer system. For example, bit serial and bit parallel arithmetic 
 units may be compared in a similar way. Various interconnection networks can 
 also be studied in this way. Thus we would be able to make tradeoffs in 
 parallelism from the level of arithmetic algorithms (i.e., hardware) up 
 through the level of computational algorithms (i.e., programs). Consider 
 the extra gates required for carry lookahead arithmetic over bit serial 
 arithmetic. These could be more cost-effectively used in multiple micro- 
 processors in a machine properly designed for some set of algorithms. 
 
84 
 
 5. Conclusions 
 
 Parallelism in machine organization has been observed and ex- 
 ploited since the time of Babbage [44]. Since the early 1960s, multioperation 
 machines have existed in implementations like the CDC 6600, CDC STAR, 
 TI ASC, Burroughs' ILLIAC IV, and so on. Compilers for these machines have 
 been uniformly difficult to write—mainly for two reasons. First, it has 
 been difficult to discover operations in ordinary programs, which can be 
 executed simultaneously on such machines. Second, the organizations of 
 these machines have often made it difficult to implement array operations 
 in efficient ways. 
 
 In this paper we have discussed both of these problems. The 
 first can be solved by using several compilation techniques aimed at 
 parallelism detection and exploitation. This involves building a dependence 
 graph in a careful way and then being able to compile trees, recurrences 
 and IF statements for execution in fast ways. For most ordinary FORTRAN 
 programs, this will probably lead to theoretical speedups which grow (nearly) 
 linearly in the number of processors used in a computation. 
 
 The second problem is closely related to the first. If one can 
 transform most programs into array form, then proper hardware is necessary 
 to support their fast execution. This means we must have a control unit 
 capable of executing array instructions in a more or less direct way. For 
 example, sufficiently many registers are required to contain all of the 
 relevant parameters of loops, their indexing and array subscripting. 
 Furthermore, we must have a high bandwidth memory which provides conflict- 
 free access to arrays and various commonly needed partitions. We must also 
 be able to align such partitions with others for processing. Finally, 
 
85 
 
 sufficiently high processing bandwidth must be provided --most likely by 
 a combination of parallel and pipeline techniques. To complete the circle, 
 the control unit must be sufficiently pipelined to provide a continuous 
 flow of data in a memory to memory (or register to register) way. 
 
 While many aspects of the problem are well understood now, no 
 machine exists which combines all of the necessary features in an adequate 
 way. These ideas are generally regarded as techniques for yery high-speed 
 computation. However, the supercomputers of one day are often the ordinary 
 computers of a later day and perhaps the minicomputers of the future. 
 Clearly, good architectural ideas for machine speedup and ease of compiler 
 writing will be widely useful. Indeed, with the arrival of sixteen-bit 
 microprocessors, we seem to be quite close to the time when large numbers 
 of simple processors can be used as components in one super processor. 
 
 We emphasize the need for program analysis to determine which 
 machine organizations are useful for various classes of programs. In this 
 paper we have concentrated on Fortran-like programs. The techniques 
 described here have been adapted to other languages. For example, in [28], 
 all of the important blocks of GPSS were analyzed. Using these results, 
 a machine organization was proposed for the fast execution of GPSS programs. 
 Since little arithmetic is involved, designs for the memory, control unit 
 and alignment network were emphasized. 
 
 Similarly, in [65], a number of COBOL programs were analyzed using 
 these techniques. There the memory hierarchy requirements became obvious. 
 Also, the ability to execute one program on a number of successive, 
 independent data sets was necessary. Again, little arithmetic was required. 
 The control unit became a key to obtaining high performance—too much 
 
86 
 
 complexity in the control unit was a danger. 
 
 We have also attempted to exploit parallel and pipeline tech- 
 niques in information retrieval and file processing problems [34] and [64]. 
 Here, since there are no standard programming languages, standard 
 algorithms (e.g., list-merging) were studied. 
 
 It seems clear that by combining the control notions discussed 
 in this paper with various data structure transformations, other programming 
 languages could be analyzed and good machine parameters discovered. For 
 example, instead of arithmetic tree-height reduction and fast recurrence 
 handling methods, various string manipulations or tree and graph algorithms 
 could be used--these could be interpreted in appropriate languages, e.g., 
 SNOBOL, LISP, etc. 
 
 Another approach to machine design is to avoid programs and go 
 directly to the algorithms themselves. We have carried out the direct 
 analysis of several algorithms including [61] and [62]. In general, better 
 speedup results can be obtained in this way since one avoids the difficulties 
 of programming language artifacts. In fact, while few nonlinear recurrences 
 are found in real programs, we often are limited in obtaining faster numerical 
 algorithms by nonlinear recurrences. By hand, the algorithm is reorganized 
 into a potentially fast form which contains a nonlinear recurrence and, 
 hence, cannot be handled by known methods. 
 
 While some nonlinear recurrences can be treated analytically and 
 * others can be shown to be intractable, this area is generally not well 
 understood. The hand analysis of whole algorithms provides good nonlinear 
 recurrence research problems. Of course, even if we cannot analytically 
 speed up nonlinear recurrences, this does not mean that there is no hope. 
 
87 
 
 In fact, hardware tricks can be used to get around some nonl inearities. 
 At the bit level, several examples of this are discussed in detail in 
 [21], for example, binary multiplication. 
 
 To summarize, we have been discussing the structure of programs, 
 the structure of machines and the relation between the two. To discover 
 ultimate speed machines, or just to find low cost, high performance 
 machine designs, such studies are important. By obtaining a better under- 
 standing of the structure of real programs we can determine good compiler 
 algorithms for exotic machine organizations. Furthermore, there are 
 benefits for standard machines; for example, we can hope to improve paging 
 performance for virtual memory machines by understanding the control and 
 data flow. Also, by transforming a program into simpler forms--for 
 example, removing IFs from loops and bringing out the array nature of 
 programs--we can hope to aid programmers in understanding and debugging 
 their programs. Finally, by viewing compilation in a broad way we can 
 see that certain traditional speedup techniques of logic design are 
 identical to methods useful in compilers for mul tioperation machines. 
 These include tree-height reduction and fast linear recurrence solving. 
 
88 
 
 ACKNOWLEDGMENT 
 
 I am particularly indebted to S. C. Chen, Duncan H. Lawrie, 
 Ahmed H. Sameh, and Ross A. Towle, for discussions about these ideas. 
 
89 
 REFERENCES 
 
 [1] Amdahl, G. M. (1967). Validity of the single processor approach 
 
 to achieving large scale computing capabilities. Proc . 
 
 AFIPS Conf., 1967 , 483-485. 
 [2] Baer, J. L., and Bovet, D. P. (1968). Compilation of arithmetic 
 
 expressions for parallel computations. Proc. IFIP Congress , 
 
 1968 , pp. 340-346. 
 [3] Baer, J. L., and Estrin, G. (1969). Bounds for maximum parallelism 
 
 in a bilogic graph model of computations. IEEE Trans. Comput . 
 
 C-18 , 1012-1014. 
 [4] Barnes, G., Brown, R., Kato, M., Kuck, D., Slotnick, D., and Stokes, 
 
 R. (1968). The ILLIAC IV computer. IEEE Trans. Comput . C-17 , 
 
 746-757. 
 [5] Batcher, K. E. (1968). Sorting networks and their applications. 
 
 Proc. AFIPS Spring Jt. Comput. Conf., 1968 , pp. 307-314. 
 [6] Batcher, K. E. (1973). STARAN/RADCAP hardware architecture. 
 
 Proc. Sagamore Conf. on Parallel Processing, 1973 , pp. 147-152. 
 [7] Beatty, J. C. (1972). An axiomatic approach to code optimization 
 
 for expressions. J. Ass. Comput. Mach . 19 , 613-640. 
 [8] Benes, V. E. (1965). Mathematical Theory of Connecting Networks 
 
 and Telephone Traffic . Academic Press, New York. 
 [9] Bernstein, A. (1966). Analysis of programs for parallel proces- 
 sing. IEEE Trans. Electronic Comput . EC-15 , 757-763. 
 [10] Bingham, H. W., Fisher, D. A., and Semon, W. L. (1966). Detection 
 
 of implicit computational parallelism from input-output sets. 
 
 ECOM-02463-1, AD645438. 
 
90 
 
 [11] Bingham, H. W., Reigel , E. W., and Fisher, D. A. (1968). 
 
 Control mechanisms for parallelism in programs. 
 
 ECOM-02463-7. 
 [12] Brent, R. (1974). The parallel evaluation of general arithmetic 
 
 expressions. J. Ass. Comput. Mach . 21 , 201-206. 
 [13] Brent, R., and Towle, R. (1975). On the time required to parse an 
 
 arithmetic expression for parallel processing. Submitted 
 
 for publ ication. 
 [14] Buchholz, W. (1962). Planning a Computer System . McGraw-Hill, 
 
 New York. 
 [15] Budnik, P., and Kuck, D. (1971). The organization and use of 
 
 parallel memories. IEEE Trans. Comput . C-20 , 1566-1569. 
 [16] Burnett, G. J., and Coffman, E. G., Jr. (1970). A study of inter- 
 leaved memory systems. Proc. AFIPS Spring Joint Comput. Conf ., 
 
 1970 , pp. 467-474. 
 [17] Burnett, G. J., and Coffman, E. G., Jr. (1973). A combinatorial 
 
 problem related to interleaved memory systems. J. ACM 20 , 
 
 39-45. 
 [18] Chang, D., Kuck, D. J., and Lawrie, D. (1975). On the effective 
 
 bandwidth of parallel memories. Submitted for publication. 
 [19] Chen, S. C. (1975). Speedup of iterative programs in multiprocessor 
 
 systems. Rep. No. 75-694, Ph.D. thesis, Dept. of Comput. Sci . , 
 
 University of Illinois at Urbana-Champaign. 
 [20] Chen, T. C. (1971). Unconventional superspeed computer systems. 
 
 Proc. AFIPS Spring Jt. Comput. Conf., 1971, pp. 365-371. 
 
91 
 
 [21] Chen, S. C, and Kuck, D. J. (1975). Combinational circuit 
 
 synthesis with time and component bounds. Submitted for 
 
 publ ication. 
 [22] Chen, S. C, and Kuck, D. (1975). Time and parallel processor 
 
 bounds for linear recurrence systems. IEEE Trans. Comput . 
 
 C-24 , 701-717. 
 [23] Chen, S. C, Kuck, D. J., and Towle, R. (1975). Control and 
 
 data dependence in ordinary programs. Submitted for 
 
 publ ication. 
 [24] Chen, S. C, Kuck, D., and Towle, R. (1975b). Time and parallel 
 
 processor bounds for Fortran-like loops. Submitted for 
 
 publication. 
 [25] Chen, S. C, and Sameh, A. H. (1975). On parallel triangular 
 
 system solvers. Sagamore Comput. Conf. on Parallel Processing , 
 
 1975 . 
 [26] Coffman, E. G., Jr., Burnett, G. J., and Snowdon, R. A. (1971). 
 
 On the performance of interleaved memories with multiple-word 
 
 bandwidths. IEEE Trans. Comput . C-20 , 1570-1572. 
 [27] Davis, E. W., Jr. (1972). Concurrent processing of conditional 
 
 jump trees. Compcon 72, IEEE Computer Society Conf. Proc . 
 
 1972 , pp. 279-281. 
 [28] Davis, E. W., Jr. (1972b). A multiprocessor for simulation 
 
 applications. Rep. No. 72-527, Ph.D. thesis, Dept. of Comput. 
 
 Sci., Univ. of 111. at Urb. -Champ. 
 [29] Estrin, G., and Turn, R. (1963). Automatic assignment of 
 
 computations in a variable structure computer system. IEEE 
 
 Trans. Comput . EC-12 , 755-773. 
 
92 
 
 [30] 
 
 !31] 
 
 [32] 
 
 !33] 
 
 [34] 
 
 [35] 
 
 [36] 
 
 [37] 
 
 [38] 
 
 [39] 
 
 Fisher, D. (1967). Program Analysis for Multiprocessing . 
 
 Burroughs Corp., TR-67-2. 
 Flynn, M. (1972). Some computer organizations and their effective- 
 ness. IEEE Trans. Comput . C-21 , 948-960. 
 Hellerman, H. (1967). Computer System Principles . McGraw-Hill, 
 
 New York. 
 Hintz, R. G., and Tate, D. P. (1972). Control Data STAR-100 
 
 processor design. Proc. IEEE Comput. Soc. Conf., 1972 , 
 
 pp. 1-4. 
 Hollaar, L. A. (1975). A list merging processor for inverted 
 
 file information retrieval systems. Rep. No. 75-762, Ph.D. 
 
 thesis, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. 
 Hu, T. C. (1961). Parallel sequencing and assembly line problems. 
 
 Oper. Res . 9, 841-848. 
 Knuth, D. E. (1970). An empirical study of FORTRAN programs. 
 
 Rep. No. CS-186, Dept. of Comput. Sci., Stanford Univ. 
 Kraska, P. W. (1972). Parallelism exploitation and scheduling. 
 
 Rep. No. 72-518, Ph.D. thesis, Dept. of Comput. Sci., Univ. 
 
 of 111 . at Urb. -Champ. 
 Kuck, D. J. (1968). ILLIAC IV software and application programming. 
 
 IEEE Trans. Comput . C-17 , 758-770. 
 Kuck, D. (1973). Mul tioperation machine computational complexity. 
 
 Proc. Symposium on Complexity of Sequential and Parallel 
 
 Numerical Algorithms, 1973 . (j. Traub, ed.), 17-47. Academic 
 
 Press, New York. 
 
93 
 
 [40] Kuck, D. J. (1974). On the speedup and cost of parallel 
 
 computation. To appear in Proc. on Complexity of Compu- 
 tational Problem Solving , The Australian National University. 
 
 [41] Kuck, D., Budnik, P., Chen, S. C, Davis, E., Jr., Han, J., 
 Kraska, P., Lawrie, D., Muraoka, Y., Strebendt, R., and 
 Towle, R. (1974). Measurements of parallelism in ordinary 
 FORTRAN programs. IEEE Computer 7, 37-46. 
 
 [42] Kuck, D., and Maruyama, K. (1975). Time bounds on the parallel 
 evaluation of arithmetic expressions. SIAM J. Comput . 4_, 
 147-162. 
 
 [43] Kuck, D., and Muraoka, Y. (1974). Bounds on the parallel 
 
 evaluation of arithmetic expressions using associativity 
 and commutativity. Acta Informatica 3, 203-216. 
 
 [44] Kuck, D., Muraoka, Y., and Chen, S. C. (1972). On the. number of 
 operations simultaneously executable in FORTRAN-like programs 
 and their resulting speed-up. IEEE Trans. Comput . C-21 , 
 1293-1310. 
 
 [45] Kung, H. T. (1974). New algorithms and lower bounds for the 
 
 parallel evaluation of certain rational expressions. Proc . 
 Sixth Annual ACM Symposium on Theory of Computing, 1974 . 
 
 [46] Lamport, L. (1974). The parallel execution of DO loops. Comm . 
 ACM_2Z» 83-93. 
 
 [47] Lawrie, D. (1973). Memory-processor connection networks. Rep. 
 
 No. 73-557, Ph.D. thesis, Dept. of Comput. Sci., Univ. of 111 
 at Urb. -Champ. 
 
94 
 
 [48] Lawrie, D. (1975). Access and alignment of data in an array 
 
 processor. IEEE Trans. Comput . C-24 . 
 [49] Leondes, C, and Rubinoff, M. (1952). DINA, a digital analyzer 
 
 for Laplace, Poisson, diffusion, and wave equations. AIEE 
 
 Trans. ( Commun. Electron .) 71_, 303-309. 
 [50] Martin, D., and Estrin, G. (1967). Models of computations and 
 
 systems—evaluation of vertex probabilities in graph models 
 
 of computations. J. ACM 14, 281-299. 
 [51] Maruyama, K. (1973). On the parallel evaluation of polynomials. 
 
 IEEE Trans. Comput . C-22 , 2-5. 
 [52] Minsky, M. (1970). Form and content in computer science. J. ACM 
 
 U, 197-215. 
 [53] Muller, D. E., and Preparata, F. P. (1975). Restructuring of 
 
 arithmetic expressions for parallel evaluation. Rep. No. 
 
 R-676, Coordinated Science Lab., Univ. of 111. at Urb. -Champ. 
 [54] Muraoka, Y. (1971). Parallelism exposure and exploitation in 
 
 programs. Rep. No. 71-424, Ph.D. thesis, Dept. of Comput. Sci . , 
 
 Univ. of 111 . at Urb. -Champ. 
 [55] Muraoka, Y., and Kuck, D. (1973). On the time required for a 
 
 sequence of matrix products. Comm. ACM 16 , 22-26. 
 [56] Opferman, D. C, and Tsao-Wu, N. T. (1971). On a class of re- 
 
 arrangeable switching networks. Bell Syst. Tech. Journ . 50 , 
 
 1579-1618. 
 [57] Pease, M. C. (1968). An adaptation of the fast Fourier transform 
 
 for parallel processing. J. ACM 15, 252-264. 
 [58] Ravi, C. V. (1972). On the bandwidth and interference in inter- 
 leaved memory systems. IEEE Trans. Comput. C-21 , 899-901. 
 
95 
 
 [59] Russell, E. C. (1963). Automatic assignment of computational 
 
 tasks in a variable structure computer. M.S. thesis, 
 
 Univ. of Calif., Los Angeles. 
 [60] Sameh, A. H., and Brent, R. P. (1975). Solving triangular 
 
 systems on a parallel computer. Submitted for publication. 
 [61] Sameh, A. H., Chen, S. C, and Kuck, D. (1974). Parallel direct 
 
 Poisson and biharmonic solvers. Submitted for publication; 
 
 also, Rep. No. 74-684, Dept. of Comput. Sci., Univ. of 111. 
 
 at Urb. -Champ. 
 [62] Sameh, A. H., and Kuck, D. J. (1975). Linear system solvers for 
 
 parallel computers. Submitted for publication; also, Rep. 
 
 No. 75-701, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. 
 [63] Senzig, D. (1967). Observations on high-performance machines. 
 
 Proc. AFIPS Fall Jt. Comput. Conf., 1967 , pp. 791-799. 
 [64] Stellhorn, W. H. (1974). A specialized computer for information 
 
 retrieval. Rep. No. 74-637, Ph.D. thesis, Dept. of Comput. 
 
 Sci., Univ. of 111. at Urb. -Champ. 
 [65] Strebendt, R. E. (1974). Program speedup through concurrent record 
 
 processing. Rep. No. 74-638, Ph.D. thesis, Dept. of Comput. 
 
 Sci., Univ. of 111. at Urb. -Champ. 
 [66] Swanson, R. C. (1974). Interconnections for parallel memories to 
 
 unscramble p-ordered vectors. IEEE Trans. Comput . C-23 , 
 
 1105-1115. 
 [67] Towle, R. (1974). Control structures inside DO loops. Ph.D. thesis 
 
 proposal, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. 
 
96 
 
 [68] Watson, W. J., and Carr, H. M. (1974). Operational experiences 
 with the TI advanced scientific computer. National Comput , 
 Conf., 1974 , pp. 389-397. 
 
 [69] Winograd, S. (1975). On the parallel evaluation of certain 
 arithmetic expressions. J. ACM 22, 477-492. 
 
97 
 
 Step 3 
 
 Step 2 
 
 Step 1 
 
 Figure 1 . Expression of 8 Atoms 
 
98 
 
 
 Step 3 
 
 Step 2 
 
 Step 1 
 
 (((a + b) + c) + d) 
 
 (a) 
 
 Step 2 
 
 Step 1 
 
 b + c + d 
 
 (b) 
 
 
 T ± = 3 
 
 T 2 = 2 
 S 2 = 3/2 
 E 2 = 3/4 
 
 Figure 2. Tree-Height Reduction by Associativity 
 
99 
 
 Step 3 
 
 Step 2 
 
 Step 1 
 
 T = 3 
 1 
 
 Step 2 
 
 Step 1 
 
 T 2 = 2 
 S 2 = 3/2 
 E 2 - 3/4 
 
 Figure 3. Tree-Height Reduction by Commutativity 
 
100 
 
 Step 4 
 
 Step 3 
 
 Step 2 
 
 Step 1 
 
 = 4 
 
 (b * 
 
 e) 
 
 (a) 
 
 Step 3 
 
 Step 2 
 
 Step 1 ( * 
 
 T = 3 
 3 J 
 
 S 3 - 4/3 
 E 3 = 4/9 
 
 (b) 
 
 Figure 4. Tree-Height Reduction by Distributivity 
 
101 
 
 Original G, 
 
 Figure 5. Program Graph 
 
 Distributed G 
 
 Figure 6. Distributed Graph 
 
102 
 
 «. 
 
 10 MISC 
 
 M 
 
 LOO 2 (PI 
 
 <=b.o 
 
 ISO 30. J 
 
 C "5.0 30.3 
 
 • OF PHOCES?(W c 
 
 190.0 
 
 Figure 7. Speedup vs. Processors 
 
103 
 
 M MEMORIES . 
 
 ALIGNMENT 
 NETWORK 
 
 ALIGNMENT 
 NETWORK 
 
 N ARITHMETIC-LOGIC 
 UNITS 
 
 Figure 8. Overall Machine Organization 
 
104 
 
 
 Memory Units 
 
 
 1 
 
 2 3 
 
 4 
 
 !i 
 
 a 2 
 
 !i 
 
 a 4 
 
 !i 
 
 a 6 
 
 !z 
 
 a 8 
 
 a 9 
 
 a 10 
 
 a n 
 
 • 
 
 • 
 
 • 
 
 
 
 Fig. 9. One-Dimensional Array Storage 
 
Memory Unit 
 
 105 
 
 
 
 1 
 
 2 
 
 3 
 
 © 
 
 a 0l 
 
 a 02 
 
 a 03 
 
 a io 
 
 © 
 
 a 12 
 
 a 13 
 
 a 20 
 
 a 21 
 
 © 
 
 a 23 
 
 a 30 
 
 a 31 
 
 a 32 
 
 © 
 
 Fig. 10. Straight Storage (m = 4) 
 
 
 
 1 
 
 2 
 
 3 
 
 © 
 
 a 01 
 
 a 02 
 
 a 03 
 
 a 13 
 
 a 10 
 
 © 
 
 a 12 
 
 © 
 
 a 23 
 
 a 20 
 
 a 21 
 
 a 31 
 
 a 32 
 
 © 
 
 a 30 
 
 Fig. 11. Skewed Storage (m = 4) 
 
106 
 
 Memory Unit 
 2 
 
 © 
 
 a 01 
 
 a 02 
 
 a 03 
 
 
 a 13 
 
 
 a 10 
 
 © 
 
 a 12 
 
 a 21 
 
 © 
 
 a 23 
 
 
 a 20 
 
 
 a 30 
 
 a 31 
 
 a 32 
 
 © 
 
 Fig. 12. Skewed Storage (m = 5) 
 
 Memory Unit 
 3 4 
 
 © 
 
 
 a 01 
 
 
 a 02 
 
 
 a 03 
 
 
 
 a 13 
 
 
 a io 
 
 
 © 
 
 
 a 12 
 
 a 21 
 
 
 © 
 
 
 a 23 
 
 
 a 20 
 
 
 
 a 30 
 
 
 a 31 
 
 
 a 32 
 
 
 © 
 
 Fig. 13. Skewed Storage (m = 2n = 8) 
 
107 
 
 0(1) • array of arithmetic expressions 
 search unsorted list 
 
 0( — 774) • multiply n matrices (p = n 3 ) 
 T l 
 
 0(J ) ' multiply two matrices (P = n 2 ) 
 
 T-| • linear system (p = n ) 
 • wave-front 3 dimensions 
 
 multiply n matrices (p = n 2 ) 
 triangular system (p = n) 
 Poisson iterative 
 wave-front 2 dimensions 
 
 FFT 
 
 °( 1oq T ) • Poisson direct 
 
 0( 
 
 -V-> 
 
 • search sorted list 
 
 • nonlinear recurrences 
 
 • serial computation 
 
 Fig. 14. p£ Classes 
 P 
 
 • multiply two matrices 
 (P = n 3 ) 
 
 • arithmetic expression 
 
 • simple linear recurrence 
 
 • tridiagonal linear system 
 
 • tridiagonal matrix 
 eigenvalues 
 
 • multiply n matrices 
 (P = n 4 ) 
 
 • small triangular system 
 (P = 64> 
 
 (NOTE: 
 
 all arrays 
 (n*n)) 
 
Table I. 1973 CACM DO Loop Summary 
 
 Recurrence Type 
 
 108 
 
 IF 
 Type 
 
 
 
 
 1 
 
 2 
 
 3 
 
 No IF 
 
 51 
 
 24 
 
 2 
 
 3 
 
 A 
 
 
 
 
 
 
 
 
 
 Prefix B 
 
 24 
 
 6 
 
 1 
 
 1 
 
 Postfix B 
 
 3 
 
 8 
 
 1 
 
 
 
 C 
 
 
 
 1 
 
 1 
 
 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. K< port No. 
 
 UIUCDCS-R-75-767 
 
 3. Re< ipient ' s A< c esa ion N 
 
 4. I"n !«• and Subtitle 
 
 PARALLEL PROCESSING OF ORDINARY PROGRAMS 
 
 5. Report Date 
 
 November 1975 
 
 7. A uthorl s ) 
 
 David J. Kuck 
 
 8- Performing Organization Re pi. 
 
 No - UIUCDCS-R-75-767 
 
 9. Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, IL 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract Oram No. 
 
 US NSF DCR73-07980 A02 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 20550 
 
 13. Type of Report & Pel id 
 Covered 
 
 Technical Report 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts 
 
 This paper is concerned with executing programs on a number of processors. 
 Various machine configurations are discussed. Algorithms for transforming pro- 
 grams are given and the results of transforming a number of real programs are 
 presented. The relationships between machine organization and program organi- 
 zation are emphasized. 
 
 17. Key Words and Document Analysis. 17a. Descriptors 
 
 Array operations 
 Parallel machines 
 Pipeline machines 
 Program transformations 
 Recurrence methods 
 Speedup bounds 
 
 17b. Identifiers/Open-Ended Terms 
 
 17c. COSATI Field/Group 
 
 18. Availability Statement 
 
 Unlimited 
 
 FORM NTIS-35 (10-70) 
 
 19. Security (lass (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of P 
 
 114 
 
 agt 
 
 22. Price 
 
 USCOMM-DC 40329-P7 1 
 

ro" ND eT