LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 IJKor no. 764-7^9 cop. Z The person charging tto ntfjrW * £ Latest Date stamped below. Theft mutilation, and underlining of books £ ««»n for disciplinary action and may ?esuU in dismissal from the Univers.ty. «0V 3 1977 1/HfiCl Wrt 5 »«/* \ 4 R •/> L161 — O-1096 J/t. 7 fit" NSF-0CA-DCR73-07980 A02 - 000015 Report No. UIUCDCS-R-75-767 PARALLEL PROCESSING OF ORDINARY PROGRAMS by David J. Kuck November 1975 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS iE LIBKAR i OR IKc JAN 2 6 1976 UNIVLHSlTv Ol JiS ft-r MDRANA-CHAMfAlfil-t Digitized by the Internet Archive in 2013 http://archive.org/details/parallelprocessi767kuck Report No. UIUCDCS-R-75-767 PARALLEL PROCESSING OF ORDINARY PROGRAMS* by David J. Kuck November 1975 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US MSF DCR73-07980 A02. ABSTRACT This paper is concerned with executing programs on a number of processors. Various machine configurations are discussed. Algorithms for transforming programs are given and the results of transforming a number of real programs are presented. The relationships between machine organization and program organization are emphasized. Key Words Array operations Parallel machines Pipeline machines Program transformations Recurrence methods Speedup bounds Ill TABLE OF CONTENTS Page 1. Introduction 1 2. Theoretical Fundamentals 15 3. Program Analysis 35 4. Machine Considerations 53 5. Conclusions 34 REFERENCES 89 TV LIST OF FIGURES Page 1. Expression of 8 Atoms 10 2. Tree-Height Reduction by Associativity 18 3. Tree-Height Reduction by Commutativity 18 4. Tree-Height Reduction by Distributivity 18 5. Program Graph 51 6. Distributed Graph 51 7. Speedup vs. Processors 64 8. Overall Machine Organization 72 9. One-Dimensional Array Storage 73 10. Straight Storage (m = 4) 74 11. Skewed Storage (m = 4) 74 12. Skewed Storage (m = 5) '. 74 13. Skewed Storage (m = 2n = 8) 74 14. Sj, C P Classes 82 TABLE 1. 1973 CACM DO Loop Summary 62 1 . Introduction In the 1965-75 decade we have seen a number of changes in computer technology and computer use. Integrated circuits have arrived and with them have come large, fast semiconductor memories; microprocessors which can be used as components; and the potential for a variety of new system architectures. Users of computers in this period have become quite concerned about the reliability of their hardware and software. They have also come to expect computer services to fit their needs, whether this be through a personal minicomputer, a supercomputer at the other end of a network, or a special purpose computer in their computer center. In the midsixties, there were many debates about which direction computer organization would go. Stacks vs. registers, binary vs. hexadecimal, time sharing vs. batch processing vs. remote batch were all being discussed. Whether fast computers should be organized as multiprocessors, array processors, pipeline processors, or associative processors, was widely discussed. The discussions were often mainly emotional, with no substantive arguments to back them up. Not surprisingly, everybody won in many of these contests. Indeed, computers are wonderfully flexible devices and can be twisted into many forms with great success. This is not to say that every new computer organization is a good idea and will survive. In fact, in this decade the entire computer divisions of several major companies have failed. Nor is it to say that we lack ideas of universal applicability. As examples, hierarchies of virtual memory and microprogrammed control units have at last been adopted, if not discovered , by just about everybody. In the midseventies, one can still hear a number of debates. Some of them have not changed from the midsixties. There are also new ones about how to make reliable hardware and software, how to bring computer services to ordinary people, and how to exploit ever higher levels of integrated circuit technology. The latter subject obviously provides one of the corner- stones of the whole subject of computer design and use. While circuit speeds have improved in the past decade, their costs have improved even more. Thus designers can afford to use more and more gates in computer systems. But some of the traditional design considerations have changed. At the present time, printed circuit board and wire delays often dominate gate delays in system design. Thus computer organization itself would now seem to be a more important question than circuit type and gate count minimization at the hardware level of computer design. It is also clear that by properly organizing a machine, various software features can be more easily supported. Stacks, extra bits, special registers and instructions are examples of this. And the availability of low cost integrated circuits makes all of these feasible, even in low cost machines. Besides these hardware and software considerations, a computer designer must worry about what applications are to be served by his machine. In the days of truly general purpose computers, "Any color you want as long as it's black," was sufficient. But general purpose machines in this sense, have disappeared. Even the IBM 360 series provided specialization in terms of size and speed. And the list of specialization improvements over that series is wery long. It ranges from minis to supers, from scientific to banking and from real time to background specializations. Some people would argue that software design is a much more important question than computer system design. As support they would offer the skyrocketing costs of software and the sharply dropping hardware costs. However, these observations probably support the converse position even more. In view of decreasing hardware costs, how can we better organize machines so that software and applications are better and less expensively handled? Indeed, software is not an end in itself. One of the main points of confusion and disagreement over the design of better hardware and software has always been in deciding on goals. People of one background or another tend to have biases of one kind or another. And an entire computer and software system has so much complexity that it is difficult for one person to think about all the interrelating details. Even if a "perfect" system could be conceptualized by one person, he could not build it by himself. Many people must be involved. Indeed this has been the downfall of many systems, the first such having been Charles Babbage's Analytical Engine. The above remarks are well understood by any serious hardware or software designer. A point that is not so well understood, or at least it is widely ignored, is that a good system's main goal is to serve its end users well. Many computers and many software systems are designed from beginning to end with only a nod to the ultimate users, the main design goals being whatever "improvements" the designers can make over their previous designs. Such "improvements" are often intrinsic to the hardware or software and may or may not be reflected in what the end user sees. The standard way to design any big system; hardware, software, highway or bridge, is to break it up into hierarchies and subparts. When parts which are analytically tractable are found, the proper analysis provides a solution for that part. Other parts are solved using the best intuition the designer has. A key question in improving the system design procedure would seem to be the following. How can we integrate users' problems into the design procedure? The answer to this is not obvious. Usually, users do not know exactly what they want. They often know what was wrong with their previous system. But solving these problems is often similar to making improvements of an intrinsic hardware or software nature as mentioned above. They do not lead to a global, system improvement. A partial answer to what the user wants may be found in looking at the programs he runs. While these may not reflect exactly the algorithms he wishes to run, they at least provide an objective measure of what he is doing. If a system could handle these programs well, then at least for a short while, the user would be happy. In this paper we will consider several aspects of the problem of integrating the analysis of users' programs with the design of computer systems. To tackle the problem in a concrete way, it is reasonable to restrict our initial study. We will first deal with Fortran programs because there are many of them in existence and because the language is quite simple. Our primary design goal will be the fast execution of programs. This goal is indeed probably the primary objective of "users" ranging from computer center managers who want high system throughput to individual programmers who want fast turnaround. Of course, reliability, ease of use, quality of results and so on are also important, but we will deal with one problem at a time. The ideas we will discuss are applicable to programs in languages other than Fortran--we use it because it is popular and of relatively low level--as we shall mention in section 5. Speed Limits What are the factors that limit the speed of machine computation? Or, to sharpen the question a bit: Given a set of programs, what determines how fast they can be executed? Basically, there are two aspects to the answer. One aspect concerns the physics of the hardware being used. The other aspect concerns the logic of the machine organization and the program organization. The physical limits on computer speed are rather widely understood. Gates can switch in some fixed time, and we must pay for the sum of a number of gate delays to perform various computer operations. Additionally, signals must propagate along printed circuit boards and on wires between boards, and these delays are often larger than gate delays. We shall not concern ourselves with these questions. Rather, we shall assume that circuit speeds are fixed and consider the logical problems involved. The logic of machine organization has been studied for many years. Greater machine speed through simultaneity or parallelism has been a subject of much study. Parallelism has been used at many levels, between bits, between words, between functions, and so on. Shortly we shall give more details of this. The relations between the organization of a machine and the organization of a program to be executed on that machine have not been studied much. Of course, the compilation of a Fortran program for execution on a serial computer is a special case of this. But compiler theory has mainly been developed with respect to languages. The semantic or machine related aspects of compilation are usually handled in ad hoc ways. Beyond this, we are really concerned with the syntactic transformation of algorithms given in the form of programs, into forms which exhibit high amounts of parallelism. At the same time, we are interested in clarifying what kinds of machine organizations correspond to the parallel forms of programs. Thus we seek program transformations and machine organizations which together allow for high speed and efficient execution of any given serial programs. A properly developed theory will have a number of benefits. For one thing, it will allow us to see what the logical limits of computation speed are (in contrast to the physical limits), and to see how close to them we are operating. It will also give us constructive procedures for designing machines and compilers for those machines. Another benefit, which we discuss below, is that we can obtain a unified approach to logic design and compiler design, since abstractly, many of the problems are identical. Our analysis of Fortran-like programs can be carried out at several levels. First, we can consider the elementary statements in the language, e.g., assignment, IF, DO, etc. Then we can consider whole programs and see how these statements fit together. This can be done at an abstract level and also by studying real programs using the abstract theory. The paper also contains several discussions of algorithms in other programming languages, but we will not develop these points at much length here. With a good understanding of the structure of programs behind us, it is proper to consider machine organizations. In this paper we will mainly discuss processor, switch and primary memory design. Control units and memory hierarchies are being studied in a similar way but these areas are not as well developed at this point. Our long term objective is to develop methods for the rational design of computer systems which are well matched to the classes of programs they are to execute. By developing our ideas theoretically, we can see what our ultimate objectives in terms of bounds might be. We can also observe that several dissimilar aspects of computer system design consist of ideas which are identical at a theoretical level. Thus a coherent body of theoretical material can be used at the logic design level and also at the compiler design level. Logic Design and Compiler Uniformity To give an intuitive overview of our ideas, let us begin with a few simple examples. The basic question is, how fast can we carry out certain functions. First, consider the problem of performing logical operations on two n-bit computer words a = (a-| ... a ) and b = (b-, ... b ). If we have a = (101101) and b = (011001) then the result of a logical OR defined as (a. + b.) is c = (111101) and the result of a logical AND defined as (a. • b.) is d = (001001) . Note that either the AND or the OR function is performed on pairs of bits of a and b, independently of all other bits in the words. Hence it is obvious that either of these functions can be computed in a time (say, one gate delay) which is independent of the number of bits involved. This assumes that we can use as many gates as we need; in this case the number of AND and OR gates will be proportional to n. Now let us turn our attention to arithmetic operations rather than logical operations. Again consider a = (a-, , . . . , a ) and b = (b,, ..., b ), but now let the a. and b. be full computer words each III I representing a number. If we have a = (3,5,2,1,0,7) and b = (1,2,3,4,5,6) , then the result of a vector add is c = (4,7,5,5,5,13) and the result of a vector multiply is d = (3,10,6,4,0,42) . Just as in the logical case, we can perform all of the above arithmetic operations independently of one another. Thus, regardless of n, the dimension of the vectors, we can form c in one add time or d in one multiply time. This assumes that we have n adders or n multipliers available. Next, let us consider some more difficult problems at the bit and arithmetic level . Suppose we have one computer word a = (a, ... a ) in which bit a. corresponds to the occurrence of some event. In other words, let a. = 1 if event e. has occurred and a. = otherwise. Further, let b and c be one- bit indicators defined as follows. If any of events e, ... e have occurred, we want b to be 1 , otherwise b = 0. And if all of events e-, ... e have occurred, we want c to be 1 , otherwise c = 0. What are the fastest possible designs for logical circuits which compute b and c? It is intuitively clear that these problems are more difficult than those we discussed above, namely, the pairwise logical AND and OR problems. Here, all bits in the word a are involved in the computation of b and c. A simple way to solve this, which also turns out to be the fastest, is the following. To form b, we compute a, + a ? , a~ + a., ..., a -, + a simultaneously using n/2 OR gates (assuming n = 2 k ). Then we compute ( a -| + a 2 ) + (a 3 + a,) and so on, fanning in the result to a single result bit b, as shown in Figure 1. If we replace the logical OR by a logical AND in the above discussion we form the result c. It is not difficult to prove that for such problems, this kind of 10 fan-in approach yields the best possible result. The technique is useful in many logic design problems. Now we consider the arithmetic problems corresponding to the above logic design questions. If a = (a 1 , ..., a n ) is a vector of n numbers stored in a computer, suppose we want to compute the sum and product b = l a c = 7T a. 1-1 ' i=l ' Instead of dealing with gates, we must now consider adders or multipliers as our basic building blocks. Again, the best solutions to these problems are obtained by simply fanning in the arguments. The tree of Figure 1 illustrates this. If we now interpret the + as an arithmetic addition, the result is b, and similarly for c. We see from the above discussion that for a class of computations which require the interaction of more than two data elements, more time is required than was needed by our first type of computation. In particular, 11 for these calculations, if n arguments are involved then we need po^nl operation times to compute the result. An operation time may be a logical OR or AND, or it may be an arithmetic add or multiply. Finally, let us consider an even messier kind of computation. Suppose we have two n bit words a = (a ] ... a p ) and b = (b-, ... b n ) and we want to compute fo if i = 1 1 a. + b. c. -I if 1 <_ i <_ n This may seem to be a strange logic design problem. It is not yery unusual, however, since it forms the heart of the design of a binary adder. In particular, this recurrence relation accounts for carry generation which is the main time delay in an adder circuit. How much time is required to compute the vector of carry bits c? The solution of this problem is not as obvious as were the solutions of our earlier problems. Before discussing how to solve it, let us consider an analogous arithmetic problem. It frequently occurs in numerical computation that we wish to evaluate a polynomial. P(x) = a n + a n _ 1 x +a n _ 2 x 2 + ... + a Q x n . Traditionally, we are told to do this using Horner's rule h(x) = a n + x(a n _ 1 + x(a n _ 2 + ... + x(a ] + xa Q ) ...)) (l ) since it requires only 0(n) operations. This can be restated as a program of the form 12 P = A(0) FOR I = 1 TO N P = A(I) + X * P which computes the expression h(x) from inside out in N iterations. It is clear that both problems share an important property. Both are recurrences in which step i depends on the result which was computed during step 1-1. This may initially give us the sinking feeling that no speedup is possible here. To show that 0(n) steps are not required to compute such linear recurrence functions is in general a nontrivial problem which has been studied in many forms. We will give more attention to this at the program level later. At the logic design level, it is discussed in Chen and Kuck [21], where algorithms are given for transforming any linear sequential circuit specification into a fast combinational circuit. Time and component bounds are given for such circuits as adders, multipliers and one's position counters, which compare favorably with those derived by traditional methods. 13 We can derive two important theoretical problems from the above. One is tree-height reduction for arithmetic expression parse trees or for combinational logic expressions. The other is the fast solution of linear recurrences derived either from programs or from logic design problems. If we have fast, efficient procedures for solving both types of problems in a parallel way, we will have a good understanding of important theoretical aspects of compiler writing and logic design automation, respectively. In section 2, we will discuss some details of these problems. It can, in fact, be shown that any arithmetic expression or simple linear recurrence containing 0(n) arguments can be solved in 0(log n) time steps. The width of the tree (number of processors) needed is just 0(n) in either case. Thus for any of these calculations, which require 0(n) time steps if performed serially, we can speed them up by a factor of 0(n/log n), using just 0(n) processors. As we saw above, some computations can be speeded up even more (e.g., by 0(n)). And as we shall see in section 2, the best known speedups for some computations are much less than this. Using this theoretical background, we will turn our attention to the analysis of whole programs in section 3. We will consider transformations of blocks of assignment statements, loops, conditional statements and program graphs. Algorithms for such transformations, as well as resulting time and processor bounds, will be discussed. Such algorithms can serve as the basis for program measurement to aid in the design of effective machines. They can also be used as a model of a compiler for parallel or pipeline computers. Above the bit-level of logic design, these are our primary motivations, but we can also interpret our work in several other ways. Since we are really engaged in a study of the structure of programs, our results seem useful in 14 the contexts of structured programming (e.g., since we remove GOTOs) and also memory hierarchy management (e.g., since we can reduce program page space-time products). In section 4, we discuss some aspects of real computers, including the accessing, aligning, and processing of data. We also sketch some results from our analysis of a number of real Fortran programs. To relate parallel machine organizations to algorithm organizations, we give a cost/effectiveness measure and a number of examples of its use. 15 2. Theoretical Fundamentals The two basic building blocks of any numerical programs are arithmetic expressions and linear recurrences. In this section we will give upper bounds on the time and number of processors needed for the fast parallel evaluation of both of these. In order to keep our discussion simple, we will (in this section) ignore memory and data alignment times as well as control unit activity. Although we will bound the number, we assume as many processors as needed are available. We will assume that each arithmetic operation takes one unit of time. Our recurrence methods allow all processors to perform the same operation at the same time. This SIMD (single instruction, multiple data [31]) operation is the simplest for a parallel or pipeline machine organiza- tion to perform. Our arithmetic expression bounds assume that different processors can perform different operations at the same time. This MIMD (multiple instruction, multiple data [31]) behavior assumes a more complex control unit. However, it is obvious that the bounds need be adjusted by only a small constant to allow them to be used for SIMD machines. In the worst case we can assume a machine which simply cycles through each of the four arithmetic operations on each "macro-step", although more delicate schemes are easy to devise. In any case, most of our speedup in most programs comes from the speedup of linear recurrences. Subscripted arithmetic expressions (which are not recurrences) inside loops can simply be handled as trees of arrays, so SIMD operation holds. We will give a number of results about tree-height reduction first. This theory has been well developed and we will give more results than are justified for practical compilation. But we will indulge ourselves a bit, since the material is interesting in an abstract sense, at least. 16 If T is the number of unit time steps required to perform some calculation using p >_ 1 processors, we define the speedup of the p processor calculation over a uniprocessor as T l S d S = =— > 1 and we define the efficiency of the calculation as E = -^ < 1 P T p - J p p - which may be regarded as actual speedup divided by the maximum possible speedup using p processors. For various computations we will discuss the maximum possible speedup known according to some algorithm and in such cases we use P to denote the minimum number of processors known to achieve this maximum speedup. In such cases we will use the notation T p , S p and E p to denote the corresponding time, speedup and efficiency, respectively. Time and processor bounds for some computation A will be expressed as T p [A] and P[A] in the minimum time cases and T [A] in the restricted processor (p < P) case. When no ambiguity can result, we will write T[A] or just T in place of T p [A] and P in place of P[A], for simplicity. We write log x to denote log- x and Txl for the ceiling of x. Arithmetic Expression Tree-Height Reduction Now we consider time and processor bounds for arithmetic expression evaluation. We restrict our attention to transforming expressions using associativity, commutativity and distributivity which leads us to speedups of 0( , ) at efficiencies of 0(1 /log n). Since this is asymptotic to the best possible speedup, more complex transformations (e.g., factoring, partial fraction expansion) seem unnecessary. Definition 1 An arithmetic expression is any well -formed string composed of the 17 four arithmetic operations (+,-,*, /), left and right parentheses, and atoms which are constants or variables. We denote an arithmetic expression E of n distinct atoms by E. If we use one processor, then the evaluation of an expression containing n operands requires n - 1 units of time. But suppose we may use as many processors as we wish. Then it is obvious that some expressions E may be evaluated in log^n units of time as illustrated in Fig. 1. In fact, we can establish, by a simple fan-in argument, the following lower bound: Lemma 1 Given any arithmetic expression E T[E] > [log n] . On the other hand, it is easy to construct expressions E whose evaluation appears to require 0(n) time units regardless of the number of processors available. Consider the evaluation of a polynomial by Horner's rule as in section 1. A strict sequential order is imposed by the parentheses in Eq. 1 and more processors than one are of no use in speeding up this expression's evaluation. However, we are not restricted to dealing with arithmetic expressions as they are presented to us. For example, the associative, commutative, and distributive laws of arithmetic operations may be used to transform a given expression into a form which is numerically equivalent to the original but which may be evaluated more quickly. We now consider examples of each of these, Fig. 2a shows the only parse tree possible (except for isomorphic images) for the expression (((a + b) + c) + d) . This tree requires three steps for its evaluation and we refer to this as a tree height of three. However, by using the associative law for addition we may rearrange the parentheses and transform this to the expression (a + b) + (c + d) which may 18 be evaluated as shown in Fig. 2b with a tree height of two. It should be noted that in both cases, three addition operations are performed. Fig. 3a shows a parse tree for the expression a + be + d; again we have a tree of height three. In this case the tree is not unique, but it is obvious that no lower height tree can be found for the expression by use of associativity. But by use of the commutative law for addition we obtain the expression a + d + be and the tree of Fig. 3b, whose height is just two. Again we remark that both trees contain three operations. Now consider the expression a(bcd + e) and the tree for it given in Fig. 4a. This tree has height four and contains four operations. By use of associativity and commutativity, no lower height tree can be found. But, using the arithmetic law for the distribution of multiplication over addition we obtain the expression abed + ae, which has a tree of minimum height three, as shown in Fig. 4b. However, unlike the two previous transformations, distribution has introduced an extra operation; the tree of Fig. 4b has five operations compared to the four operations of the undistributed form. Having seen a few examples of arithmetic expression tree-height reduction, we are naturally led to ask a number of questions. For any arithmetic expression, how much tree-height reduction can be achieved? Can general bounds and algorithms for tree-height reduction be given? How many processors are needed? To answer these questions, we present a brief survey of results concerning the evaluation of arithmetic expressions. Details and further references may be found in the papers cited. Assuming that only associativity and commutativity are used to transform expressions, Baer and Bovet [2] gave a comprehensive tree-height reduction algorithm based on a number of 19 earlier papers. Beatty [7 ] showed the optimal ity of this method. An upper bound on the reduced tree height assuming only associativity and commutativity are used, given by Kuck and Muraoka [43]> is the following. Theorem 1 Let E be any arithmetic expression with depth d of parenthesis nesting. By the use of associativity and commutativity only, E can be transformed such that T p [E] < Tlog nl + 2d + 1 with r n Note that if the depth of parenthesis nesting d, is small, then this bound is quite close to the lower bound of Tlog nl. The complexity of this algorithm has been studied in [13], where it is shown that in addition to the standard parsing time, tree-height reduction can be performed in 0(n) steps. Unfortunately, there are classes of expressions, e.g., Horner's rule polynomials or continued fractions for which no speed increase can be achieved by using only associativity and commutativity. Muraoka [54] studied the use of distributivity as well as associativity and commutativity for tree-height reduction and developed comprehensive tree-height reduction algorithms using all three transformations An algorithm which considers operations which take different amounts of time is presented by Kraska [37]. Bounds using associativity, commutativity and distributivity have been given by a number of people [12,42,53]. In [12] the following theorem is proved. Theorem 2 Given any expression E, by the use of associativity, 20 commutativity and distributivity, E can be transformed such that T p [E] < T41og nl with P < 3n. The complexity of the algorithm of [12] has been studied in [13], where it is shown that tree-height reduction can be done using 0(n log n) steps in addition to normal parsing. Also if the number of processors is allowed to grow beyond 0(n), the time coefficient of Theorem 2 has been reduced to 2.88 by Muller and Preparata [53]. A number of other results are available for arithmetic expressions of special forms or for general expressions if more information is known about them. In [42], expressions without division, continued fractions, general expressions with a known number of parenthesis pairs or division operations, and other such cases are considered. Polynomials are discussed in this paper and earlier by Maruyama in [51]. One other case should be mentioned here. For programming languages with array operators, other compilation techniques may be of interest. For example, [55] solves the problem of minimizing the time to evaluate the product of a sequence of conformable arrays on a parallel machine. In [42] it is shown that any matrix expression including addition, subtraction, multiplication and matrix inversion can be handled as follows. If any of these four operations take one matrix operation time step, then any matrix expression of n arrays can be evaluated in 61og n matrix operation steps. The coefficient is the sum of three addition times, two multiplication times and one inversion time. Matrix addition and multiplication are straightforward, but the time required to invert a matrix measured in standard arithmetic 21 operations varies, depending on the method used (c.f., [62]). Most arithmetic expressions appearing in real programs have a rather small number of atoms. If the atoms are subscripted, then the arrays may be quite large, but it is usually advisable to evaluate these as a tree of arrays, one array operation at a time. If tree-height reduction techniques are used on such expressions, there are two possibly bad consequences. One is the passing from SIMD to MIMD operation as dis- cussed earlier. The other is that redundant operations are generally introduced, making the overall computation less efficient. However, for expressions outside loops, for unsubscripted expressions inside loops or for expressions of small arrays inside loops, tree-height reduction can be of value. The number of processors required to evaluate such expressions is usually less than are required for recurrence solving, as we shall see later. However, there may be cases where tree-height reduction is desirable, but the number of available processors is wery small. The following results cover this case and are theoretically interesting. Corollary 1 Given any expression E and p processors for its evaluation, by the use of associativity, commutativity and distributivity, E can be transformed such that T p [E] £41og n + 10(n-l)/p . This is a corollary of Theorem 2 and was proved by Brent [12]. This result has been improved by Winograd [69], who shows that if p processors 5n 2 are available, we can evaluate any E in y- + 0(log n) steps. For small p, this result is an improvement on Corollary 1. 22 We have given a number of different upper bounds on time and processors for arithmetic expression evaluation. While the only lower bound (Lemma 1) is naive, the upper bounds are close enough to it for practical purposes. It is clear that the theory is quite well developed and improve- ments on these results will be quite difficult to obtain. We conclude that any arithmetic expression E can be evaluated in 0(log n) steps at an efficiency of 0(1 /log n). 23 Recurrence Relations Linear recurrences share with arithmetic expressions a role of central importance in computer design and use. But they are somewhat more difficult to deal with. While an expression specifies a static computational scheme for a scalar result, a recurrence specifies a dynamic procedure for computing a scalar or an array of results. Linear recurrences are found in computer design, numerical analysis and program analysis, so it is important to find fast, efficient ways to solve them. Recurrences arise in any logic design problem which is expressed as a sequential machine. Also, almost every practical program which has an iterative loop contains a recurrence. While not all recurrences are linear, the vast majority found in practice are, and we shall concentrate first on linear recurrences. We shall begin with several examples. First, consider the problem of computing an inner product of vectors a = (a,,..., a ) and b = (b,,...,b ). This can be written as a linear recurrence of the form x = x + a^b., 1 < i <_ n (2) where x is initially set to zero and finally set to the value of the inner product of a and b. As another example of a linear recurrence which produces a scalar result, the evaluation of a degree n polynomial p (x) in Horner's rule form can be expressed as p = a. +■ xp, 2 <_ i £ n (3) where p is initially set to a, and finally set to the value of P n ( x )- 24 Techniques to handle both of these recurrences should be familiar from our discussion of expression evaluation. Note that Eq. 2 can be expanded by substituting the right-hand side into itself (statement substitution) as follows: x = a,b, x = a-,b, + a 2 b 2 x = a,b, + a^bp + a 3 b~ After n iterations we have an expression which can be mapped onto a tree similar to that of Figure 1 . Earlier, we have also discussed polynomial evaluation. Thus by carrying out a procedure similar to the above, we could obtain an expression which could be handled by tree-height reduction. Thus, we would expect that these and similar recurrences could be solved in T p = 0(log n) time steps using P = 0(n) processors. But there are other, more difficult looking linear recurrences. For example, a Fibonacci sequence can be generated by f i = f i_! + f i_ 2 3 < i < n ,(4) where f , = f 2 = 1 . As another example, consider the addition of two n-bit binary numbers a = a ... a, and b = b ... b, . The propagation of the carry across the sum can be described by c i = y i + x i' c i-l 1 < i < n ( 5 ^ where N c Q = 0, x, = a. + b^ and y. = a.-b. . 25 Here we use + to denote logical or and • to denote logical and . This is an example of a bit level linear recurrence, in contrast to our previous examples whose arguments were assumed to be real numbers. In both Eq. 4 and Eq. 5 we are required to generate a vector result because of the subscripted left-hand side. This is in contrast to the scalar results of Eqs. 2 and 3. Because of this, we can expect a good deal more difficulty in trying to obtain a fast efficient solution to these recurrences. With the above as an introduction, we now turn to a formalization of the general problem. We will then give bounds for the solution of the general problem and several important special cases. Definition 2 An m-th order linear recurrence system of n equations, R is defined for m <_ n by x^ e for i £ i-1 and x. - c. + I a. . x. for 1 < i <_ n . j-i-m J J If m B n we call the system a general linear recurrence system and denote it by R . Note that we can express any linear recurrence system in matrix terms as x - c + Ax where c = (c 1 ,...,c n ) , x = (x 1 ,...,x n ) t and A is a strictly lower triangular (banded if rn < n) matrix with a. . s for i < j or i - j > m. We refer to A as the coefficient matrix, « j c as the constant vector and x as the solution vector. 26 It should be observed that the constant vector and coefficient matrix generally contain values which can be computed before the recurrence evaluation begins. Thus, the x. and y. values of Eq. 5 would be precomputed from the a. and b.. We will assume that the elements of c and A are pre- computed (if necessary) in all cases so that our bounds on recurrence evaluation can be simply stated, and that m and n are powers of 2. How can we solve an R system in a fast, efficient way using many simultaneous operations? The following is a straightforward way which uses 0(n) processors to solve the system in 0(n) steps. 27 Column Sweep Algorithm * Given any R system, we initially know the value of x-. . On step 1 we broadcast this value, c-. , to all other equations, multiply by a., and add the result to c Since we now know the value of x 2 > this leads to an R system which can be treated in exactly the same way. Thus after n - 1 steps, each of which consists of a broadcast, a multiply and an add, and each of which generates another x., we have the solution vector x. The method requires n - 1 processors on step 1, and fewer thereafter, so T p = 2(n - 1) with P = n - 1. What speedup and efficiency have we achieved by this method? The time required to solve this system using a single processor which might sweep the array by rows or columns would be 1 } = 2[1 + 2 + ... + (n - 1)] = 2[^2^- 1) ] = n(n - 1) . Hence the above method achieves a speedup of c _ n(n - 1) _ /9 S P 2(n - 1) " n/2 with an efficiency of E - ^ - , " r > ^ L P P 2(n - 1) 2 * Thus we can conclude that the Column Sweep Algorithm is a reasonable method of solving an R system. But how does it perform in the R case for m«n. It can be seen that the Column Sweep Algorithm will achieve S p = 0(m) for an R system. So if m is very small, the method performs poorly, particularly if we have a large number of processors available. It should 28 be noted that the m«n case occurs very often in practice. Note that all of our examples (Eqs. 2-5) had m <_ 2. What are. our prospects for finding a faster algorithm. First, we observe that the total number of initial data values in an R system is 0(mn). This is the total of the constant vector c and the coefficient matrix A. Assuming that these numbers all interact in obtaining a solution, a fan-in argument [c.f. Lemma 1] indicates that we need at least 0(log mn) steps to solve an R system, since m <_ n, 0(log mn) = 0(log n). The Column Sweep Algorithm required 0(n) steps, so we still have a big gap in ti me Product Form Recurrence Method The next theorem is based on an algorithm for the fastest known method of evaluating an R system. For large m, the number of processors required is rather large, but for small m, the number of processors is quite reasonable. We also give bounds for the case of a small number of processors, Corollary 4 is particularly important in the case of m < p < P. This theorem's proof can easily be stated in terms of the product form of the inverse of the coefficient matrix A [25] » [60 ]• It is also proved in [19] and [22]. 29 Theorem 3 Any R can be computed in Tp < (2 + log m) log n - -^log m + log m) with P * m 2 n/2 + 0(mn) ! for m«n , P < n 3 /68 + 0(n 2 ) for m <_ n . The details of transforming a system to meet this bound are fairly straightforward [19]. We will give a simple example here as a basis for some intuition about how the technique of Theorem 3 works. Consider an R<4,2> system. This method would generate the following expressions for the evaluation of the x.: x l " c l * x 2 = (Cg+a^Cj) X 3 = (c 3 +a 31 C l } + a 32 (c 2 +a 21 c l ) * 4 = c 4 + (a 42 +a 43 a 32 )(c 2 +a 21 c l ) + a 43 (c 3 +a 31 c l ) ' Note that all of the parenthesized expressions can be computed simultaneously in two steps (there are just three distinct ones). Then x,, the largest calculation, can be completed in three more steps for T p = (2+log 2) (log 4) 1 2 - -pOog 2+log 2) = 5. This time bound may be achieved using just three processors in this case. But as n grows larger, the number of processors required becomes very large as shown in the tables of [22]. In practice we may have a machine with a limited number of processors p < P so Theorem 3 cannot be used directly. Several schemes are available for mapping a computation onto a smaller set of processors and generally increasing the efficiency of the computation as well. While 30 the techniques described below may be applied to arithmetic expressions as derived from Theorems 1 or 2, the expressions found in typical programs usually do not require enough processors to warrant such reductions [36]. First, we describe a folding scheme which reduces the number of processors at a much faster rate than the computation time increases. The 31 P processor computation for R resulting from Theorem 3 contains log n stages, each stage consisting of many independent tree computations of height (log m+1) resulting from inner products of two m-vectors. Such a tree of height t will contain 2 -1 operation nodes and its evaluation requires 2 " processors. P is the maximum of the total number of proces- sors used at each stage. It is easy to show that given such a tree its height increases only one step by halving the number of processors (called one fold ), and after f folds (f <^ t - 2) are performed the tree height is (t+2 -f-2) while the number of processors is reduced to 2 /2 . If all trees at the same stage are folded uniformly, then this folding scheme can provide us T as stated below. P Corollary g Let R and P be as in Theorem 3. Then if f < log m - 1 and p = TP/2 f l, we have T < T p + (2 f+1 -f-2) log n. Another technique which is useful in mapping any computation onto a limited number of processors p < P is the sweeping scheme [44]. If the i-tri'Step of any parallel computation requires 0. operations using P proces- sors, it can be executed on p processors in leads to the following: 0. i steps. This observation Lemma 2 [12] If a computation C can be completed in T p with p operations on P processors, then C can be confuted in T < T D .+ (0 D -T D )/p for p < P. P — r r r To apply this technique directly on the algorithm of Theorem 3 , the C value can be obtained by the summation of 2»p(k) for k = 2, 4, 8, ...,n, where p(k) is the number of processors required at each stage [19]. The result of this technique can be found in [22]. 32 Our third scheme for reducing the number of processors required for an R system is called the cutting scheme . The idea is to cut the original system into a number of smaller systems and evaluate these in sequence, using the algorithm underlying Theorem 3 on each such system. We have used this scheme in [22], [61], and a detailed proof is given in [25]. Corollary 3 ' Let R and P be as in Theorem 3. Then any R can be computed with 1 < p < P processors in T D i. 2fm/pl (n-1 ) for 1 < p <_ m , < 1^ np 3 (log 2 p+271og p+144) + ^y- for m < p < m 2 , -1 - 72 np 3 0°9 2 P + 271og P+144) for m <_ p <_ m , cL C O "7 ^ <_ B^-^(log m log p+21og 2 p-p- log m-j log m+1) for m < p < P, where $(m,n,p) is a small constant. For most practical R systems in which m is wery small compared to n, if the number of processors is also very limited then a new computational algorithm developed in [19] can be used more efficiently. This method gives the following time bounds. Corollary 4 |_et R and P be as in Theorem 3 . If m < p < P> then any R can be computed in T < (?m 2 +3m)j}+ 0(m 2 log (p/m)) . 33 In summary, for 1 < p < P the time bound for evaluating a ■given R system can be determined by choosing the minimum value obtained from Corollaries 2, 3 and 4. 34 Constant Coefficient Recurrences In numerical computation, we are frequently faced with linear recurrences having constant coefficients, i.e., Toeplitz form matrices. For example, Eqs. 2, 3 and 4 are such recurrences. Thus, Eq. 2 could be rewritten as x. = lx. -. + a. b, , and similarly for Eq . 3 . Intuitively, we might expect to be able to compute such systems more efficiently than the more general recurrences we have been considering. Indeed, this is the case ,as we shall see below. We formalize the problem with the following definition which should be contrasted with Definition 2. Definition 3 An m-th order linear recurrence system with constant coefficients of n equations, R is defined for m <_ n by for i < *,- ■ ° and m x. = c. + I a- x. . for 1 < 1 <_ n. If n = m we call the system a general linear recurrence system with constant coefficients and denote it by R. The fastest known method for solving an R system can be summarized by the following theorem of Chen [19]. The proof follows the lines of Theorem 3, but avoids computations which are unnecessary due to the constant coefficients. Theorem 4 with and Any R can be computed in T p <_ (3 + log m)log n - (log m + log m + 1) P _< mn for m«n 2 P < n /4 for m < n . 35 By exercising special care in avoiding redundant computations, the proof of Theorem 4 can be modified [19] to give us the following. Corollary 5 Any R can be computed in T p < 21og n with P <_ n . By comparing Theorems 3 and 4, we see that while the time bounds are about the same, substantial processor savings can be made in the constant coefficient case. In the case of small m, we have saved 0(m) processors, while in the general case we have saved a factor of 0(n) processors. To test the quality of these bounds, we can compare them with some simple calculations. Consider the inner product of Eq. 2. This can obviously be handled using n processors (for the n multiplications) in 1 + log n steps. Since we have been assuming the coefficient matrix and constant vector are set up before the recurrence solution begins, the multiplication is really outside our present scope, so just n/2 processors would be required for the summation. The bound of Corollary 5 is thus high by a factor of two in each of processor count and time for this trivial recurrence. However, the recurrence method produces not only the inner product, but also all of the "partial inner products" x-, , x 2> ..., x , , as well as x . Chen [19] has also given other variations on the above to handle these special cases of evaluating only the remote terms of recurrences. As a final example, note that the entire Fibonacci sequence of Eq. 4 can be evaluated (since m = 2) in T p < 41og n - 3 with P < 2n 36 3. Program Analysis In this section we discuss techniques for the analysis of whole programs. These techniques can be used to compile programs for parallel or pipeline computers. They can also be used to specify machine organizations for high speed computation. Since we are really just studying the structure of ordinary serial programs, our results have interpretations for ordinary virtual memory machines and structured programming as well. Our discussion is centered on methods developed and used by the author and his students. Several significant efforts in this general area were carried out elsewhere earlier. We will briefly sketch two of the most important of these. The first major study of whole programs was carried out by Estrin and his students at UCLA in the 1960s. They studied graphs of programs in an attempt to isolate independent tasks for parallel execution. The lowest level object considered as a task was the assignment statement, while other tasks ranged in complexity up to whole subroutines. A number of papers [ 29, 50, 59 ] reported algorithms for the analysis of programs and the results of analyzing a limited number of real programs. This group also worked on various scheduling strategies for executing program graphs. Another effort was initiated by Bingham, Fisher and Semon at Burroughs in 1966. This group studied various aspects of the structure of programs and developed algorithms for the automatic detection of parallelism They also investigated certain aspects of machine design related to this— particular attention was given to control unit features. Their work was reported in a series of reports including [10, 11 ]. Our own effort has incorporated graphs of programs as well as the two key ideas of the previous section--tree-height reduction and fast 37 recurrence evaluation. In this way we can carry parallel execution down to the level of individual operations in assignment statements. This is of course necessary, if one wants to execute programs in the fastest possible way. In order to formalize our discussion of program graphs and their manipulation, we now present a number of definitions. Definition 4 An assignment statement is denoted by x = E, where x is a scalar or array variable and E is a well -formed arithmetic expression. A block of assignment statements (BAS) is a sequence of one or more assignment statements with no intervening statements of any other kind. Any BAS can be transformed by a process called statement substitution to obtain a set of expressions which can be evaluated simultaneously. For example, the BAS X = BCD + E Y = AX Z = X + FG can be evaluated using one processor in 6 steps, ignoring memory activity. By statement substitution we obtain three statements which can be transformed by tree-height reduction to obtain: X = BCD + E; Y = ABCD + AE; Z = BCD + E + FG . Since the resulting expressions can be evaluated simultaneously in three steps, we obtain a speedup of 2. By properly arranging the parse trees it may be seen that just five processors are required. Thus we have efficiency Er = 2/5. In general, the number of processors required to evaluate a set of trees in a fixed number of steps may be minimized using an algorithm of Hu [35] Note that the speedup here results from two effects: the simultaneous 38 evaluation of independent trees and tree-height reduction by associativity, commutativity and distributivity. Definition 5 An IF statement is denoted by (C)(S,, .... S ) where C is the conditional expression composed of arithmetic and logical operations and S, , ..., S are n different statements which may be assignment statements, IF statements, or loops such that control will be transferred to one of them depending on the value of C. In many programs it is possible to find outside DO loops, rather large sets of statements consisting of many IF and GOTO statements with some interspersed assignment statements. Suppose we have a method of discovering sections of code in which the ratio of control (IF, GOTO) statements to arithmetic operations is greater than some small number. We call such a section of code an IF block . Given an IF block, it is straightforward to put it in a canonical form consisting of: Step 1: A set of assignment statements, all of which may be executed simultaneously. Step 2: A set of Boolean functions, all of which may be evaluated simultaneously. Step 3: A binary decision tree through which one path will be followed for each execution of the program. No Boolean function or arithmetic expression evaluation is included in the tree. Step 4: A collection of blocks of assignment statements, each with a single variable or constant on the right-hand side. One such block is associated with each path through the tree. The details of an algorithm for the discovery and transformation of an IF block to this canonical form are given by Davis [28]. Note that the IF 39 block may be a graph with or without cycles. Such graphs are converted to trees called IF trees in the cited references. Definition 6 A loop is denoted by L • (I v + N ]s I 2 - N 2 , ..., I d - N d )(S r S 2 , .... S s ) or = (1^ I 2 , ..., I d )(S r S 2 , ..., S $ ) where I. is a loop index , N. is an ordered index set , and S. is a body statement j j j which may be an assignment statement, an IF statement or another loop. We use 0UT(S.) and IN(S.) to denote for S. the LHS (output) variable name and the j j j set of RHS (input) variable names, respectively. We will write S.(i\, i 2 , ..., ij) to refer to S. during a particular iteration step, i.e., when the index variables of S, are assigned the specific values I-j = i-j» I 2 = i^' ■••» *d = ^d" If S. is executed before S-, we will write S. < S.. We say that the relation i J i o j < defines the execution order of the statements. If a loop execution leads to the execution of n statements, we sometimes denote their execution order by writing Y-:x. = E. , 1 £ i <_ n, implying that Y. < Y. + , , 1 <_ i <_ n - 1 . Definition 7 Given a loop L = (I, + N, , . . ., I . <- N,)(S, , . . ., S ), all possible data dependencies between statement pairs S and S. are given by 0UT(S 1 (k 1 ,...,k d ))niN(S j U 1 ,...,2 d )) f for S.(k r ...,k d ) < S j (^ ] ,. . . ,* d ) . Whenever this condition is satisfied, we say that S. is data dependent on S.. and is denoted by 5-6S.. 6 is a transitive relation. All of the data ■ j dependencies can be represented by a data dependence g raph G-. of s nodes for Sj, 1 < i < s. For each S.6S. there is an arc from S. to S. . Statement S. is i — — i j • j J 40 Indirectly data dependent on S^ denoted S.AS., if there exist statements S k, * ••*■ S k such that S i 6S k 6 *" S k 5S r Practical details on determining 1 m 1 m J 1f S.6S. can be found in [67]. ' J Our definition of data dependence is much more delicate than the usua definitions [ 9, 30 ]. These definitions include the condition 0UT(S i )|lIN(Sj) f , i.e., they ignore subscripts and only check variable names Thus statements like S^ A(I) = A(I+i") + B are said to be data dependent (S.6S However, by Definition 7 we would not say S i 6S i because the values of A(I+1) are not those from A(I). In terms of Definitions 6 and 7, we can further classify loops as follows. Definition 8 We use D for data dependence relation, to denote the set of loops with at least one S.5S.., 1 < i,j < s . In other words, there is at j least one E k , 1 < k < n, which is a function of x, , for m, > 0. If [e D and none of its S i is a nonlinear function of x., 1 < j < s, we call it a linear dependence and write ULD (LDCD). The complement of D is denoted by D, for non-dependence relation. 41 Definition 7 can be applied to any (d-u+1), 1 <_ u < d, innermost nest of L as it is also a loop. This is described below. Definition 9 Let L u be the (d-u+1) innermost nest of L, 1 <_ u <_ d, I.e.) L = \ l-i , 1«» . . . » 1j J ^j-i » bp > . • . > j ; = -(i 1 ,i 2 ,...,i u _ 1 )(i u ,i u+ -|,...,i c j)(s. 1 ,s 2 ,...,s s ) = (I 1 .I 2 .,...I U . 1 )(L U ) . Then for fixed values of I-,, I 2 , ...» I -. » we can obtain all pairs of data dependence for L u according to Definition 8 (note that now k, = £,, ..., k , s *u~l^' which defines graph G u - Example 1 Given a loop L: DO S, DO S. DO S. Sp* B(I-i ,!«» = 1, 10 2 = 1, 10 : 3 -i. io : 3 ) = B(i r i,i 2 ,i 3 )*c(i 1 ,i 2 ) + D*E : 3 ) = a(i 1 ,i 2 -i,i 3 )*f(i 2S i 3 ) > The corresponding data dependence graphs G, , G 2 , and G~ are ft,: G 2 : G 3 : 1 42 For the set of data dependent loops, we can easily distinguish two cases: acyclic and cyclic graphs. Formally, we define these as follows: Definition 10 An acyclic dependence graph is a dependence graph of s nodes, S. for 1 <_ i <_ s, with no pair (S.,S.) such that S-AS. and S-AS.. A dependence graph which is not acyclic will be called cyclic . Given a data dependence graph, we wish to partition it into blocks that contain only one statement or a cyclic dependence graph. Formally, we define these as follows: Definition 11 On each dependence graph, G , 1 <_ u < d, for a given loop L, we define a node parti tion t of {S, ,Sp, . . . ,S } in such a way that S. and S, are in the same subset if and only if S.AS. and S.AS. . On the partition tt = {tt-, , tt ~> •••) for 1 £u <_d, define a partial ordering relatio n a in such a way that tt . a tt . (reflexive), and for i t j, tt . a tt . iff there is an arc in G from some element of tt . to some element of tt . . The a relation is also anti -symmetric and transitive. The tt . are called Tr-blocks. 43 Wave-Front Method If there are cyclic dependencies in a DO loop, we may turn to our next method, the wave-front method. This is a well-known method which effectively extracts array operations from the loop and we can then apply the above bounds to these. If the maximum speedup given by the wave-front method is insufficient, i.e., if the available processors are not all being used, we may turn to the recurrence method which gives the fastest known speedup for such problems. Example 2 L2: DO 10 I = 1, N DO 10 J = 1, N 10 W(I,J) = A(I-1,J) * W(I-1,J) + B(I,J-1) * W(I,J-1) For one or more assignment statements containing cyclic dependencies, the wave-front method yields moderate speedups with high efficiency. The idea of this method can be illustrated by the loop L2 of Example 2 in which statement 10 has a cyclic dependence in that the LHS depends on RHS values computed earlier in the loop. Note that generally, one or more statements may form a cyclic dependence. This method proceeds as follows: if W(l,l) is computed from boundary values, then we can compute W(2,l) and W(l,2) in terms of W(l,l) and boundary values. Next we can compute W(3,l), W(2,2) and W(l,3) and so on, as a wave-front passes through the W array at a 45° angle. Thus o we can compute this loop in 0(N) steps instead of the 0(N ) serial steps required. The wave-front method was first described in detail by Muraoka in [54] and was later used in [44] and was also implemented in [46]. The formalization below removes some of the restrictions included in the original formulation. In [24] a revised wave-front algorithm is presented. This includes 44 a method of determining the angle a at which the wave-front passes through the array. It also includes a method for computing the speedup as a function of a . Note that these ideas can be extended to arrays of higher dimension, as well. However, the wave-front method is of no value in one-dimensional arrays, since it degenerates to a serial computation in this case. A similar thing happens if a is slightly greater than 0° or slightly less than 90°. In such cases we may treat the cyclic dependence as a linear recurrence (assuming it is linear). Loop Speedup Hierarchy With the above fundamentals, it is possible to give some easy bounds on overall loop speedup in terms of the uniprocessor time T-, . We will present a simple hierarchy here based on the maximum known speedups for various classes of programs. Sharper bounds will be presented later in the paper, based on more detailed loop parameters. The hierarchy of this section will provide good intuition for the following sections. The simplest loop is LeD which by Definition 8 has no dependence relation between any pair of statements. Thus, following the notation of Definition 6, all x. = E., 1 < i < n, can be computed in parallel. The following loop, which performs matrix addition and scalar product has this property. DO S 2 ^ = 1, 10, 2 DO S 2 I 2 = 1, 10, 1 .... S,: G(I r I 2 ) = A(I r ,I 2 ) + B(I r I 2 ) S 2 : Z(I r I 2 ) = C(I r I 2 ) * D(I r I 2 ) The total time required by any LeD is, by Theorems 1 and 2, T p <_ 0(log e) where e is 45 the maximum number of atoms in E., 1 <_ i <_ n. Hence, we have for LeD s piorrchr 0( V • Now, let us study a slightly more complicated loop LeLD such as one that performs vector inner-product DO 5 I = 1, 10 5 T = T + A(I)*B(I) . For any LeLD, if we pre-compute simultaneously all subexpressions in E. , 1 £ i <_ n, which do not depend on any computed value in the loop, i.e., any x. for 1 £ i <_ n, then the resultant statements x. = El, 1 < i < n, can be treated as an R system where m, which is 0(log e) time steps by Theorems 1 and 2, plus the time to solve an R system which is stated in Theorerr 3. Since n <_ T, <_ ne, we have a speedup for this subset of LD, y T 1 S P - 0(log m log n) + 0(log e) " °^log 1^ ' Next, consider the subset of loops which has m - n or m a function of n. For example, given an upper triangular matrix A, to solve Ax = b by the traditional back-substitution method we may write a loop like DO 5 I = 10, 1, -1 X(I) = B(I)/A(I,I) DO 5 J = I + 1, 10, 1 5 X(I) = X(I) - (A(I ! J)/A(I,I))*X(J) . In this example, if we preprocess B(I)/A(I,I) for all I, and A(I,J)/A(I,I) for all I, J, we obtain an R system. Since m = n, this is the worst-case loop of LD. Hence, we can say that the computation time of any LeLD is less 46 than 0(1og e) plus the time stated in Theorem 3, i.e., for any LeLD Sp 1 5 ! = 0( A—) . r ' O(logS) + 0(log e) locfT^ Finally, we study a simple looking, but more complicated loop: DO 5 I = 1, 10 5 X(I) = (X(I-l) + A/X(I-l))/2 . This is a familiar iterative program for approximating JK. For this loop, LeD but ULD. Muraoka [54] shows that by using statement-substitution any loop with E- being a d-th degree polynomial of x. -,, d > 1, can be speeded up at most by a constant factor. Later, Kung also studied this problem [45] in a similar way. However, since we have been able to linearize a number of nonlinear recurrences, it remains an open question which techniques besides statement-substitution may be used to speed up such loops. Summarizing the above, we are able to classify all loops in terms of their best known speedups over serial computation time T, , i.e., S p = ■ f for < i < 2 (6) P c^log t/ We call a loop Type i , <_ i < 2, if its maximum speedup has the form of Equation 6 , or Type 3 if its maximum speedup is of a lower order of magnitude. This was also discussed in [39]. 1 12. By the wave-front method we are at best able to achieve T p = 0(T, 1/2 with T p = 0(T,) in the worse case. Thus we have S p <_ 0(T-| ). Since the wave-front method's speedup is always inferior to the recurrence method for such problems, this is consistent with our c 1 :im that Equation 6 represents a maximum speedup hierarchy. 47 Loop Distribution Now we turn to the problem of compiling parallel operations from serial loops which do not contain IFs and GOTOs (we will consider these later). There are two key ideas involved here: one is the reduction of data dependence, the other is the distribution of loop control over the loop statements. We give our Loop Distribution Algorithm which includes both of them and summarizes our handling of loops without IFs. When constructing a data dependence graph, we wish to avoid any apparent dependencies which do not really hold for the given subscripts and index sets. This problem was first studied in a general way by Bernstein [9 ] for unsubscripted variables. A powerful test for subscripted variables was given by Muraoka [54]. This has been refined by Towle in [24] and in [67]. By avoiding the inclusion of spurious data dependencies, we may be able to execute more statements in parallel on machines capable of executing multiple array operations (e.g., the Texas Instruments' ASC). Also, we may be able to break cyclic dependencies, thereby reducing i in Eq. 6 and yielding higher speedup. Another way to achieve statement independence is through statement- substitution. This yields increased speedup, sometimes at the cost of redundant operations. It should be used with discretion, and only in machines with a high degree of parallelism. For acyclic graphs, it is easy to demonstrate that we can perform statement-substitution between any pair of nodes which have a dependence relation. As in a BAS (c.f. Definition 4 ) , we substitute for each LHS variable of S. on the RHS of S., which is the cause of a dependence relation, the corresponding arithmetic expression on the RHS of S. with all subscript expressions properly shifted. By applying statement-substitution, 48 the dependence relation is removed and a set of independent assignment statements results. Each of these represents a vector assignment statement, all of which can be executed simultaneously. Theorems 2 and 3 can be used to bound the time and processors. In loops with acyclic graphs, we can thus reduce the graph for the entire loop to a set of independent nodes representing simultaneously executable array statements. However, in general, we must deal with cyclic graphs containing several interdependent nodes. Our loop distribution algorithm will be useful in handling these cases. By loop distribution we mean the distribution of the loop control statements over individual or collections of assignment statements contained in the loop. The idea of loop distribution was introduced by Muraoka [54], and later was implemented in our Fortran program analyzer to measure potential parallel- ism in ordinary programs [41], [44]. The purpose of distributing a given type i loop is to obtain a set of smaller size loops of type j, <_ j <_ i, which upon execution give results equivalent to the original loop. This is essentially to reduce a- in Eq. 6 (and hence increase speedup) as much as possible. In fact, the loop distribution algorithm resembling the distribution algorithm for the reduction of tree height of an arithmetic expression, may intro- duce more parallelism into a program loop than that obtained from an undistributed one. We now give the algorithm to accomplish this distribu- tion as presented in [24]. 49 Loop Distribution Algorithm Step 1 Given a loop by analyzing subscript expressions and indexing patterns, construct a dependence graph G (c.f . Definitions 7 and 9) for 1 < u < d. Step 2 On G 1 < u < d, establish a node partition it as in Definition 11 . Step 3 On the partition ? , 1 < u < d, establish a partial ordering relation as in Definition 11. Step 4 Let the (d-u+1 ) innermost nest of L be L u , 1 <_ u <_ d, i.e., L = (I 1 ,I 2 ,...,I d )(S 1 ,S 2 ,...,S s ) = (i 1 ,i 2 ,...,i u . 1 ){(i u ,i u+1 ,...,i d )(s 1 ,s 2 ,...,s s )> = (I 1 ,I 2 ,...,I u _ 1 )(L u ) . Replace L u according to it with a set of loops { ( I ) (tt ,),(I)(tt ? ),...} where (I) = (I,,.!^.....!,,) . 50 The condition of the partial ordering relation a insures that data are updated before being used. Hence, any execution order of the set of loops which replaces L will be valid as long as this relation is not violated. Thus, for fixed values of I, , I,,, ..., I _, , if tt . a tt . then loop ( I ) (tt . ) must be evaluated before ( I ) (tt .), otherwise they may be computed in parallel. In general, we can also use statement-substitution to remove this relation between some or all of the distributed loops. But, by not allowing statement-susbstitution we have a somewhat simpler compiler technique; one which generally requires fewer processors and yields less speedup. As an example of the use of our loop distribution, consider the following pseudo-FORTRAN program. Example 3 DO 10 I = 1, N S-,: A(I) = B(I) * C(I) DO 20 J = 1, N S 2 : D(J) = A(I-3) + E(J-l) S 3 : 20 E(J) = D(J-l) + F DO 30 K = 1 , N S 4 : 30 G(K) = H(I-5) + 1 S 5 : 10 H(I) = SQRT(A(I-2)) Following step 1 of the Loop Distribution Algorithm, we obtain a dependence graph as shown in Fig. 5. We use brackets to denote loop nesting. For 51 simplicity and speedup in this program, we only consider the case u = 1. In step 2, we form the partition tt, = {tt, , , it, ?> ^iv 77 ^ ^ere 71 11 = ^ S l ' ^12 = ^ S 2' S 3^' ^13 = ^}» and ^14 = ^r) ' These partitions are partially ordered on step 3 as follows: tt, , a tt^-i > it.., a tt,- and tt,. a tt, 3 . Since we are considering only the case u = 1 here, we ignore step 4. The result of this transformation is shown in Fig. 6. We could use this graph to compile array operations as follows. First, S, yields a vector multiply. Next, we can execute it,,, or tt,.. tt,,, leads to a linear recurrence of the form R which can be solved by the method of Theorem 3, by combining the D and E arrays as an unknown vector in which x, represents D(l), x~ represents E(l), x 3 represents D(2), x, represents E(2), etc. tt,, leads to the execution of S 5 as a vector of square roots. Finally, S. may be executed for all I and K simultaneously. Note that this requires the broadcasting of elements of the H array to all elements in the columns of G. Here, the time required to execute tt,,, tt, 3 , and tt-.- is independent of N using 0(N) processors. The overall execution time is dominated by tt, « and is 0(log N), so this is a type 1 loop. The number of processors required to achieve this time is 0(N). Notice that in this example we avoided statement-substitution. Using statement-substitution, we would have been able to obtain four TT-blocks, all of which could be executed at once. This would require the execution of several different operations at one time, while the technique we used allows all operations at each step to be identical. Furthermore, wery little additional speedup would be possible by this method since tt-, ? dominates the time here. 52 IFs in Loops To this point we have considered DO loops without conditional statements. The addition to DO loops of IF and GOTO as well as computed GOTO statements, can cause major problems. In particular, data dependencies can be changed at execution time by the existence of such conditional statement Thus, knowledge at compile time of what can be executed in parallel may be difficult to obtain. In the worst case, we may be forced by not knowing about control flow, to compile loops for serial execution which in fact can be executed in a highly parallel way. In this section we will consider the addition to DO loops of IF and computed GOTO statements, denoted as IFs. The first part of this section contains definitions and preliminary results concerning IFs. This leads to a distribution algorithm which allows for IFs. The importance of this algorithm is that using it we can often execute all of the DO loop simul- taneously, or at worst, localize the part of the DO loop that will be done sequentially. Then we present a summary of the analysis of one year of CACM Fortran programs. Before proceeding, we need a few definitions concerning IFs and control notions. Definition 12 Statement S. is an immediate control successor of S., denoted S.yS., if S. is executed immediately after S.. S. is not unique when S. is an IF. Statement S. is a control successor of S., denoted S.TS-, if there exist statements S. ,...,S. such that S.yS. yS. y...yS, yS • . 1 m 1 2 m Definition 13 Given a loop L = (I, ■*- N, ,...,I d «- N rf ) (S-j , . . . ,S S ) with 53 S. = (C)(S. ,. . .,S. ); the j-th follower , 1 i J i n, of IF statement S., 1 n l \ denoted by F.(S.) is the set {S. |S. = S. or S. rS,} . We will refer to an J 1 K K I • 1 . K J J arbitrary j-th follower as a follower . The set of common followers of IF n statement S., denoted by CF(S.), is the set {S.|S. e f\ F.(S.) such that 1 1 K K -i = l J if S. is executed then S. is executed}. The set of parent IFs of statement S., denoted by PIF(S.), is the set (SJS ? is an IF, S rS. , and there is no IF statement S such that S.rS rS. where S|/ECF(S )}. when there is more than one IF in the loop, it is possible that some of the followers will resemble trees. Example 4 DO S 3 I,j = 5, 14, 1 DO S 3 I 2 = 5, 14, 1 S^ If ^ > I 2 then (S 2 :)A(I 1 ,I 2 ) = A( I-, -2, 1 2 ~2)+A( I-, -4 , I 2 -4)+A( I-, +2 , 1 2 +2) + A(I 1 +4,I 2 +4) S 3 : If ^ < I 2 then (S 4 : )B(I ] ,I 2 ) = B(I-,-l ,1^*0(1^ , 1 2 )+B ( I-, -1 ,r 2 )*D(I 1 ,I 2 ) S 5 : If ^ = I 2 then (S g :) if M0D(I r 2) = then (S ? :) E( I -j , I 2 ) = Ed-,-1 .Ig-l) +F(I r I 2 ) In Example 4, we have F,, = {S 2 ,S 3 ,S 4 ,S 5 ,S 6 ,S 7 } , F 12 = (S 3 ,S 4 ,S 5 ,S 6 ,S 7 >, F 31 = ^S 4 ,S 5 ,S 6 ,S 7 >, F 32 = , F gl - { S 6 ,S ? } , F 52 = ♦• F 61 = {S 7 } ' F 62 = { * } ' CF( V = {S 3' S 4' S 5' S 6' S 7 } ' CF(S 3 ) = {S 5' S 6' S 7 } ' CF(S 5 ) = 4>, CF(S 7 ) = (J), PIF(S 2 ) = PIF(S 3 ) = S ] , PIF(S 4 ) = PIF(S g ) = S 3 , 54 PIF(S 6 ) = S 5> and PIF(S y ) = S fi . Definition 14 Given a 7T-block ir^ and IF statement S. in tt ., the k-th control path with respect to S. through tt . is the set J U I {S £ £7r uil S £ rS j or (WV and S £ i s j)>- We will refer to an arbitrary j-th control path as a control path . Note that the number of control paths through tt u1 is the same as the number of followers of S.. Example 5 DO S 1 I = 1, N V A (D = B(I-2) + C(I) V IF (A(I-l).EQ.O) THEN DO s 3 - B(I) = 3 s 4 : D = .FALSE. END ELSE DO V E(I) = E(I-3) + 1 V D = TRUE END V F ( J ) = o Using Example 5 we will point out the differences between followers and control paths. F^Sg) = {S3.S4.S7}, F 2 (S £ ) = {S5.Sg.S7}, and CF(S 2 ) = {S ? } In the TT-partition ^ = {{S^Sg.Sg}, {S 4 }. {S 5 >, {Sg}, {S ? }} notice that F ] (S 2 ) is contained in 3 Tt-blocks and F 2 (S 2 ) is not in the same Tr-blocks as F^SJ. There are two control paths in {SpSg.Sg}. One control path {S^Sg} contains a 55 statement not in any follower of S 2 and not all the statements in F,^). The other control path (S,) does not contain any statements of Fp(Sp). In general, a control path can contain statements not in any follower of the IF and does not have to contain any statements of a follower. In an IF-free DO loop L, LeD, we know that the data dependencies for each iteration do not change as a function of the input data. The addition of IFs can give rise to several distinct data dependencies for different input data sets. The data dependencies on different iterations can be distinct for all iterations, for some iterations or for only one iteration. Thus we may have statements that can have several different combinations of data dependence. Example 6 DO S 5 I = 1, N S } : IF C(I-l) = D(I) THEN (S 2 A(I) = 3 P(I) ELSE DO S 3 : A(I) = 4*Q(I) S 4 : B(I) = R(I) + 1 END S 5 : C(I) = A(I) + B(I-l) + 2 Definition 15 The set of different combinations of data dependence into statement S. is called the data dependence combinations of statement S. and is denoted by DDC . . The number of followers of S. that contain S. is denoted f... 56 Statement Sr in Example 6 can be computed in four different ways. First, the value of A(I) can be from S^ and B(I-l) from outside the DO loop. Second, the value of A(I) can be from S 2 and B(I-l) from S.. Third, the value of A(I) can be from S 3 and B(J-l) from S,. Fourth, the value of A(I) can be from S 3 and B(I-l) from outside the DO loop. Using Definitions 8, 12, 13 and 15, we can classify IFs into three types: Definition 16 Given a loop L = (^ * N,,...,I. * N d )(S ] , . . . ,S ) and IF S. = (C)(S. ,...,S. ), 1 ■: i < s, we say S, is 1 n l \ ! a) Type A iff there does not exist S-, 1 < j < s, such that S.6S. and Uj,...,^) OlN(S.) = <|> . b) Type B iff one of the following holds: 1) All but one of F.(S.) branch out of the loop. n 2) For each S k e \J Fj(Sj) such that S^., |DDC(S k ) |/f ik <_ 1 and each of the data dependence combinations of S. only include data dependence on the statements in a single follower of S. and/or statements not in any follower of S. . 3) Type C iff S. is not type A or type B. Type B IFs can be further subdivided. A prefix type B IF is a type B IF that is not data-dependent on any statement in its followers. Postfix type B IFs are all other type B IFs. Next we will discuss compiler algorithms for array machines. These machines have two characteristics of which we want to take advantage. First, These machines operate on whole arrays. Thus we need to transform programs to operate on the whole array at once rather than element by element. 57 Second, these machines can selectively omit certain elements of an array during array operations. We define mode bits as the indicators of which array elements are to be operated on. Thus we want to transform IFs to generate mode bits. Both of these characteristics let us obtain speedups over uniprocessor machines. As we saw earlier for IF-free loops, by distributing DO loop indices over 7T-blocks we are able to transform a given loop into one or more loops each containing a vector or recurrence operation. In [23] we gave a modification of that algorithm to allow IFs. The first goal of the algorithm is to localize data dependence and the effects of IFs. Second, the algorithm handles the IFs in four different ways. 58 1) Type A IFs are the easiest to handle. One loop is compiled for each follower and an IF is used at execution time to select which loop to execute. 2) Prefix type B IFs use mode bits to "prefix" the body statements. The IF is used only to set up the mode bits. The mode bits are set up once and then used by the body statements as necessary. 3) Postfix type B IFs require execution of each control path for the full DO loop index set. Then we postfix by merging the outputs from each control path. 4) Type C IFs are executed serially. Since the type A IF depends on variables not set inside the loop, such IFs can be removed from loops trivially. However, good pro- grammers seldom write such statements, so this is a moot point. In the prefix type B IF, we set up the TT-block containing the IF and the other Tr-blocks that are a-dependent on it to be executed for the full index set. However, before these Tr-blocks are executed we precompute (at compile or run time) which follower will be taken on each particular iteration of the DO loop. This is expressed as a vector of mode bits for each follower. A vector of mode bits is a mask that is applied to the index set for a given operation. In this way we can selectively omit certain elements at run time. The end result is that each statement is executed only for the proper elements. By precomputing these mode bits, we are able to fix the results a priori , hence the name prefix type B. 59 As an example of a prefix type B IF, consider the following program. DO 5 I = 1, N IF(I<5) THEN A(I) = B(I) + C(I) ELSE A(I) = B(I)/C(I) 5 CONTINUE Let M.[a,b] be a vector of mode bits denoting vector elements from a to b, inclusive. Then we can compile the above as Ml = [1,5] M2 = [6,N] DO SIM{A(M1) = B(M1) + C(M1), A(M2) = B(M2)/C(M2)} where DO SIM indicates that the bracketed statements can be executed as array statements and simultaneously. The postfix type B IFs are more complicated. Using only the statements in the TT-block that contains the IF, we compute all control paths in parallel for the full index set. Using these results we evaluate the IF. Finally, the results are merged to produce the correct results and a set of mode bits are passed on to other iT-blocks. The details of the test and merge are now given. What we want to do is find which follower is taken for each iteration. Thus given the outputs for each control path for each iteration, we want to thread our way through the outputs, picking up the proper results. It is possible to do this recursively by using the previously selected results and the results from the current iteration to determine which follower is to be taken. Using subscripts to denote control path, in Example 7 we compute A-,(I) = D(I) and A 2 (I) = B(I)*C(I) for all values of the index set. Let M, and M 2 be the vector of mode bits for follower 1 and follower 2, respectively, Whenever M.(j) is a 1, then the statements in the i-th follower are executed on the j-th iteration in the serial program. 60 Example 7 DO 10 I = 6, N IF A(I-3) = 5 THEN A(I) = D(I) ELSE A(I) = B(I) * C(I) 10 CONTINUE 61 It should be noted, in general, that ANDing of any two distinct vectors of mode bits associated with an IF statement results in a vector of bits. The ORing of all the vectors results in a vector of 1 bits corresponding to the iterations on which the IF statement is executed in the serial program. In Example 7, control path 1 is taken whenever we were on control path 1 three iterations ago and the output was 5 or we were on control path 2 three iterations ago and the output was 5. Control path 2 is taken in either case if the output was not 5. This can be expressed as the following coupled recurrence relation 1^(1) +■ (A-, (1-3) = 5 & M 1 (1-3)) V (A 2 (I-3) = 5 & M 2 (I-3)) M 2 (I) <- (A^I-3) t 5 & M-,(I-3)) V (A 2 (I-3) f 5 & M ? (I-3)) The techniques to solve this recurrence are described in [21] with specific reference to this application. This is a bit-level recurrence and can be done in 0(log n) gate delays, where n is the number of iterations. In reality, we could approximate the log n gate delays by one clock, i .e. , one time step. Finally, we mask A, with M, , A 2 with M«» and merge the results to get the proper elements of A. Also M, and M 2 are passed on to other TT-blocks as needed. We should point out that postfix type B loops yield the same speedups as prefix type B loops, in general. The difference is that since more processors are required for redundant operations here, postfix type B loops generally have lower efficiencies. As a test of the usefulness of our methods, we have analyzed the 16 Fortran programs which appeared in 1973 in the CACM Algorithms Section. Nested DOs were counted as one loop at the outermost level. There were a 62 total of 124 such DO loops. Each loop was characterized in terms of the worst recurrence type (c.f. Eq. 6) and worst IF type it contained. Table 1 is a summary of our results. We observe that type A and prefix type B IFs together with recurrence types and 1 can be handled with good speedup and efficiency. This accounts for 85% of the loops. Four programs (type 3 and type C) are disasters for our methods and in part must be handled serially. The remaining programs can be handled by the postfix and wave-front (type 2) methods. Overall, this seems to imply that the methods given would be s/ery effective for the general mix of CACM Fortran algorithms. 63 4. Machine Considerations Since the early 1960s, we have seen a sequence of high speed machines which have some kind of mul tioperation capability. The CDC 6600 of the 60s was succeeded by the 7600 in the early 70s. IBM introduced the 360/91 and its successors. The 7600 and /91 are both pipelined machines and achieve high performance by operating on arrays of data. Their instruction sets are rather traditional, however. In contrast, the pipelined Control Data STAR [33] and Texas Instruments' ASC [68], both have vector instruction sets, which should make compilation for them substantially easier. On the other hand, the Burroughs' ILLIAC IV [4 ] is a parallel array machine, but its instruction set is also traditional in nature. Vectors must be broken up into partitions of size 64 and loops performed over a sequence of such partitions. The Goodyear Aerospace STARAN IV [6 ] is a parallel array of processors, each of which operates in a bit-serial fashion. This is an example of an associative processor. It is interesting to note that the highest speed pipeline processors, the STAR and ASC, both resort to parallelism by providing several parallel pipelines to achieve their desired operating speeds. The processing speedups achieved by all of these machines are due to parallelism between operations as well as parallelism between memory and processor activities. We shall discuss memories, alignment networks, and control units later. Our point here is that in order to compile ordinary serial languages for these processors, two things are desirable: 1) Powerful translation techniques to detect parallelism, and 2) Array-type machine languages. The main contributions to program speedup discussed in section 3 arise from our loop distribution procedure. This leads to array operations 64 and recurrences. Both of these are well suited for computation on machines which must perform the same operation on many data elements to achieve high performance. Thus, the methods of section 3 could serve as compiler algorithms for such machines. Some time ago, we implemented a comprehensive analyzer of Fortran programs. It used algorithms like those of sections 2 and 3, although some of the techniques were much more primitive than those discussed here. Details of our algorithms and results may be found in [39] » [41 L [44] an d [54]. We will summarize a few points very briefly. Altogether some 140 ordinary Fortran programs, gathered from many sources, were analyzed. The programs ranged from numerical computations on two-dimensional arrays (e.g., EISPACK) to essentially nonnumerical programs (e.g., Fortran equivalents of GPSS blocks). We set all loops to 10 or fewer iterations and analyzed all paths through the programs, computing T, , T , S , E , etc. These were averaged over all traces and also over collections of programs. A plot of our results for S vs. p is shown in Fig. 7. Some of the points are labelled with the name of a collection of programs. 65 Our experiments lead us to conclude that mul tioperation machines could be quite effective in most ordinary FORTRAN computations. Fig. 7 shows that even the simplest sets of programs (GPSS, for example, has almost no DO loops) could be effectively executed using 16 processors. The overall average (ALL in Fig. 7) is 35 processors when all DO loop limits are set to 10 or less. As the programs become more complex, 128 or more processors would be effective in executing our programs. Note that for all of our studies, T-. <_ 10,000 so most of the programs would be classed as short jobs in a typical computer center. In all cases, the average efficiency for each group of programs was no less than 30%. While we have not analyzed any decks with more than 100 cards, we would expect extra- polations of our results to hold. In fact, we obtained some decks by breaking larger ones at convenient points. These numbers should be contrasted with current computer organiza- tions. Presently, two to four simultaneous operation general purpose machines are quite common. The pipeline, parallel and associative machines mentioned above perform 8 to 64 simultaneous operations, but these are largely intended for special purpose use. Thus, we feel that our numbers indicate the possibility of perhaps an order of magnitude speedup increase over the current situation. Furthermore, we took a subset of the programs, again a random cross-section, and varied the DO loop limits from 10 to 40. The points 10, 20, 30, 40 correspond to the results. We conclude that for our sample of ordinary Fortran programs, speedup is a linear function of T, and hence p. This is quite different from some of the folklore which has arisen about parallel computation [1 ], [31], [52] (e.g., S = 0(log p)). We will 66 next give an outline of two commonly held beliefs about machine organiza- tion. Let us assume that for <_ B. ± 1 , 0-B k ) of the serial execution time of a given program uses p processors, while 3. of it must be performed on k < p processors. Then we may write e k T l T l T l T n ■ ^~ L + (1-Bj tt , and E n = p k k p ' p UvO-^t, i 1 + B k (f-1) For example, if k = 1, p = 33, and 6, = 1/16, then we have E-- = 1/3. This means that to achieve E 33 = 1/3, 15/16 of T-, must be executed using all 33 processors, while only 1/16 of T, may use a single processor. While E 33 = 1/3 is typical of our results (see Fig. 7), it would be extremely surprising to learn that 15/16 of T, could be executed using fully 33 processors. This kind of observation led Amdahl [1 ] and others [20] [63] to conclude that computers capable of executing a large number of simultaneous operations would not be reasonably efficient—or, to paraphrase them, "Ordinary pro- grams have too much serial code to be executed efficiently on a mul tioperation processor." 67 Such arguments have an invalidating flaw, however, in that they assume k = 1 in the above efficiency expression. Evidently no one who repeated this argument ever considered the obvious fact that k will generally assume many integer values in the course of executing most pro- grams. Thus, the expression for E given above must be generalized to allow all values of k up to some maximum. The technique used in our experiments for computing E is such a generalization. For some execution trace through a program, at each time step i, some number of processors k(i) will be required. If the maximum number of processors required on any step is p, we compute the efficiency for any trace as T P } k(l) E = -Jk-j , assuming p processors are available. p p p p Apparently no previous attempt to quantify the parameters discussed above has been successful for a wide class of programs. Besides Kuck, et al [41] the only other published results are by Baer and Estrin [3 ], who report on five programs. Another commonly-held opinion, which has been mentioned by Minsky [52], is that speedup S is proportional to log p. Flynn [31] further dis- cusses this, assuming that all the operations simultaneously executed are identical. This may be interpreted to hold 1) over many programs of different characteristics, 2) for one fixed program with a varying number of processors, or 3) for one program with varying DO loop limits. That the above is false under interpretation 1 for our analysis is obvious from Fig. 7. Similarly, it is false under interpretation 2 as the number of processors is varied between 1 and some number as plotted in Fig. 7. As p 68 is increased still farther, the speedup and efficiency may be regarded as constant or the speedup may be increased at a decreasing rate together with a decreasing efficiency. Eventually, as p becomes arbitrarily large, the speedup becomes constant and in some region the curve may appear logarithmic. Under interpretation 3, there are many possibil ities--pro- grams with multiply nested DO loops may have speedups which grow much faster than linearly, and programs without DO loops of course do not change at all . One of Flynn's arguments in support of such behavior concerned branching inside loops. For example, if on each pass through a loop we branched down some new path, not taken on any other iteration, then indeed we might have disastrous results. However, we have not observed such intense splitting, in fact most computations which do branch inside a loop, come together in common followers rather quickly. Furthermore, there are usually relatively few distinct paths through a loop in most cases. The results of Table I lend further support to our conclusion that IFs in loops do not pose a serious practical problem. Abstractly, it seems of more interest to relate speedup to T, than to p. Based on our data, we offer the following observation: for many ordinary FORTRAN programs (with T, <_ 10,000), we can find p such that 1) T = alog 2 T 1 for 2 < a £ 10 , and 2) p < - .6 log 2 T.j such that 3) Sp^ lOlog^ and E p > ' 3 • 69 The average a value in our experiments was about 9. However, the median value was Ipss than a •--.•« 4.u ■ ue was less than 4, since there were several very large values 70 In terms of our present theoretical results, it is not hard to justify such estimates. Consider, for example, Corollary 4 which is intended for solving R systems with small n using a limited number of processors. Let us approximate the times by T-, ~ mn 2 T l and T a 2m n/p * 2m - L since most programs have m = 1 or m = 2. Following the observation, we choose p * T-,/log T, , so T p a 2m log T ] and S a T,/2m log T, and E a 1/2 m , all of which agree reasonably with the observation. We should point out that our previous analyzer allowed each processor to be executing a different operation on any time step. This could degrade our results, possibly by a factor of two. However, the newly discovered algorithms perform better than the ones we used in our old analyzer. We are currently implementing a new analyzer with which we expect to obtain better results than those summarized above. For a more complete discussion of theoretical bounds on the time and processors required for executing whole Fortran loops see [l9]» [24]. These references show how certain details about overall program graphs can 71 be combined with the results of section 3 to obtain loop bounds. From these, whole program bounds can be obtained. Control Units A well -designed control unit is one which never gets in the way of the processor(s) and memories. In other words, it operates fast enough to be able to supply instructions whenever they are needed in the proces- sing and moving of data. Control units tend to become complex, mainly in a timing sense, because they may have a number of tasks to control. One way to ease some control unit difficulties is to use parallelism at the control unit level. A multiprocessor is an example; several complete control units are used. This may be rather expensive, so the multi-function, pipeline and parallel processor machines use one shared control unit. Such control units often contain a number of inde- pendently operating parts. For example, the first use of pipelining was in control units [14]. A detailed study of the control unit of any high- speed computer will reveal a number of simultaneously operating, independent functions. While this may allow the functions to operate more slowly, it also causes some synchronization problems. As we mentioned in our processor discussion, the level of machine language is very important in modern, high-speed computers. Vector in- struction sets make compiler writing easier. They also focus control unit design on the correct questions, namely, to execute vector functions at high speed. Control units for high-speed computers must handle the traditional functions, including instruction decoding and sequencing, I/O and interrupt handling, and address mapping and memory indexing. In addition, we can list several new functions. For one, memory indexing becomes somewhat more 72 complex when whole arrays are being accessed in large parallel memories. Also, array computers (parallel or pipeline) often rely on the control unit for scalar computations. Broadcasting of scalars to an array must also be handled. Special control features such as IF tree processing [27] can be effectively handled in the control unit. Special stacks and queues may be required to handle a number of processors and programs in rapid succession. Indeed, instruction level multiprogramming may even be attempted. Rather than discuss any of these in detail, we simply refer the reader to a detailed study of the several high-speed machine papers mentioned earl ier. We conclude this section with the computer organization of Fig. 8, The control unit can really be regarded as four control units, one for each of the four other major subsystems shown. The operation of this machine can be regarded as a pipeline from memory to memory. For move instructions (memory to memory) the processors can be bypassed. Fig. 8 represents parallelism at a number of levels: within the control unit, processors, memories and data alignment networks. Also, it contains parallelism in the simultaneity of operation of each of these which forms a pipeline. Note that pipelining can also be used within each of the five major subsystems to match bandwidths between them. The details of accessing parallel memories and of aligning the accessed data will be discussed in the next section. 73 Parallel Memory Access As effective speeds of processing units have increased, memory speeds have been forced to keep up. This has partly been achieved by new technologies (magnetic cores to semiconductors). But technology has not been enough, as evidenced by the fact that in 1953, the first core memory operating (in Whirlwind I) had an 8 ys memory cycle time. Today, most computer designers cannot afford to use memories much faster than one hundred nanoseconds. Thus, we have achieved an increase of only two orders of magnitude in memory speed over the past twenty odd years. In the same period, the fastest processor operation times have advanced from a few tens of microseconds to a few tens of nanoseconds; or three orders of magnitude. Memory system speeds have kept up with proces- sors only through the use of parallelism at the word level. In the late 1950s, ILLIAC II and the IBM STRETCH introduced the first two-way inter- leaved memories. At the present time, high speed computers have on the order of 100 parallel memory units. If a word can be fetched from each of m memory units at once, then the effective memory bandwidth is increased by a factor of m. Array Access Parallel memories are particularly important in array computers (parallel or pipeline). Thus, if a machine has m memory units we can store one-dimensional arrays across the units as shown in Fig. 9, for m = 4. While the first m operands are being processed, we can fetch m more, and so on. But, if the array is indexed such that, say only the odd elements are to be fetched, then the effective bandwidth is cut in half due to access conflicts as shown in the underlined elements of Fig. 9. These conflicts 74 can be avoided by choosing m to be a prime number. Then, any index distance relatively prime to m can be accessed without conflicts. Many programs contain multidimensional arrays. These can lead to more difficult memory access problems, since we may want to access rows, columns, diagonals, back diagonals (as in the wave-front method of section 3 ), square blocks, and so on. For simplicity, consider two-dimensional arrays and assume we want to access n element partitions of arrays from parallel memories with m units. Consider the storage scheme shown in Fig. 10 where m = n = 4. Clearly, using this storage scheme we can access any row or diagonal (e.g., the circled main diagonal) without conflict. But all the elements of a column (e.g., the underlined first column) are stored in the same memory unit so accessing a column would result in memory conflicts, i.e., we would have to cycle the memories n times to get the n elements of a column. In order to allow access to row and column n-vectors, we can skew the data as shown in Fig. 11 [38]. Now, however, we can no longer access diagonals without conflict. It can be shown, in fact, that there is no way to store an mxm matrix in m memories, when m is even, so that arbitrary rows, columns, and diagonals can be fetched without conflicts. However, as we shall see, by using more than m memories we can have conflict- free access to any row, column, or diagonal , as well as other useful m- vectors. 2k It is easy to show that if m = 2 +1, for any integer k, we have conflict-free access to rows, columns, diagonals, back diagonals, and square blocks. For an example with k = 1, see Fig. 12. This and other similar results are discussed in [15]. If m is not a power of two, certain 75 difficulties arise in indexing the memory. Also, note that the elements of various partitions are accessed in scrambled order. A crossbar switch could be used to unscramble the data, but much cheaper schemes can be devised. The question of unscrambling the accessed elements using a rather simple network is discussed by Swanson [66]. In order to simplify indexing and unscrambling, systems of the form m = 2n were considered by Lawrie in [47] and [48]- He shows that conflict-free access to a number of partitions is possible using such a memory. We illustrate this in Fig. 13 with m = 2n = 8. We will discuss data alignment networks for unscrambling the data accessed in such a memory later in this section. In order to implement a skewing scheme, we must have a properly designed parallel memory system. In particular, each of the m memory units must have an independent indexing mechanism. This allows us to access a different relative location in each memory unit. It is interesting to observe that several presently existing high speed computers have handled their parallel memories in different ways. The Control Data STAR, for example, does not allow independent indexing of each memory unit. Instead, it has an instruction by which arrays can be physically transposed in memory to provide access to, say, rows and columns. The transpose time is essentially wasted time and some algorithms for these machines are slowed down by as much as a factor of two in this way. ILLIAC IV has independent index registers and index adders on each of m = 64 memories. Since it has 64 processors, access to partitions of n = 64 elements is usually required. Thus the skew scheme of Fig. 11 is easily implemented. Of course, since m is even, conflict-free access 76 to rows, columns, and diagonals is impossible. But as Fig. 11 shows, diagonals may be accessed in just two memory cycles. Parallel Random Access The execution of Fortran-like programs frequently leads to memory access requirements which include one-dimensional arrays and various partitions of multidimensional arrays, as we have been discussing. However, we sometimes face access problems which have much less regularity. For example, consider the subscripted subscript case: DO I = 1, N X(I) = A(B(I)) . Here, we have no idea at compile time about which elements of A are to be fetched, assuming that B was computed earlier in the program. This easily generalizes to multidimensional arrays. Frequently, table lookup problems are programmed in this way. To deal with this kind of memory access problems, is in general to deal with random access to a parallel memory. Note that this is a problem which has been given a good deal of attention for multiprocessor systems using rather abstract models of various kinds. There are two key questions on which the validity and usefulness of these models turns. They are: 1) What kind of data dependence is assumed in the memory access sequence? 2) What kind of queueing mechanism is assumed for retaining unserviced accesses? In these terms, we briefly summarize some of the results. Hellerman's model [32] can most reasonably be interpreted to assume no data dependence between successive memory accesses and to have no provision to 77 queue conflicting addresses. It is also a steady state model, ignoring control dependence. Thus, it scans an infinite string of addresses, blocking when it finds the first duplicate memory unit access request. In various models, Coffman and his co-workers [16, 17, 26 ] extended the above to include a type of queueing and to separate data accesses from instruction accesses. These papers further introduced address sequences which were not necessarily uniformly distributed. These models also assumed that no data dependencies existed in the address sequence. Ravi [58] introduced a model which was more realistic for multi- processor machines. He allows each processor to generate an address and computes the number of them which can be accessed without conflict, in a steady state sense. Effectively he assumes a sequential data dependence in the addresses generated by each processor. In [18] the above results are extended in several ways. First, it is shown analytically that the model of [58] yields an effective memory bandwidth which is linear in the number of memory units. Several models are given with queues in the processors and in the memories, to show the differing effects on bandwidth of such queues and methods used for managing the queues. Several types of data dependencies are assumed to exist; some as in the Ravi model and others which include dependencies between the processors. In all of these models, we show that the effective bandwidth of m memories can be made to be 0(m). The models are useful for either multiprocessor or parallel machines. Thus, we conclude that for parallel or multiprocessor machines, the proper use of m parallel memories can lead to effective bandwidths 78 1 12. which are 0(m). This is much more encouraging than the 0(m ' ) which was derived from earlier, more naive models. Alignment Networks Finally, we consider the problem of interconnecting the processors and memories we have been examining. Data alignment requirements depend on the programs to be run and the machine organization. A simple way to connect several memories to a processor is to use a shared bus. For higher speed operation, multiprocessors often use a crossbar switch which allows each processor to be connected to a different memory simultaneously. In the ILLIAC IV array, the i-th processor can pass data to processors i ± 1 and i ± 8, modulo 64. Here all processors must route data the same distance in a uniform way. None of the above techniques is well suited to a high performance parallel computer. Indeed, the alignment network should be driven by an independent control unit, to operate concurrently with the processor and memory operation. The requirements include more than uniform shifts and at times even more than permutations. Often broadcasts are needed, including 1/2 partial and multiple simultaneous broadcasts, e.g., n numbers, each 1 12 broadcast to n ' processors for matrix multiplication in an n processor machine [48]. The alignment network should be able to transmit data from memory to memory and processor to processor as well as back and forth between memories and processors. The connections it must provide are derived from two sources. For one, it must be able to handle the indexing patterns found in existing programs, for example, the uniform shift of 5 necessary in 79 A(I) + A( 1+5) . For another, it must be able to scramble and unscramble the data for memory accesses. For example, to add a row to a column, one of the partitions must be "unskewed". More details can be found in [39], [47] and [43]. With a crossbar switch we can perform any one-to-one connection of inputs to outputs, and with some modification we can also make one-to- many connections for broadcasting. The switch can be set and data can be transmitted in 0(log n) gate delays. However, the cost of such a switch 2 is quite high, namely, 0(n ) gates. Thus for large systems, a crossbar alignment network is out of the question. Another possibility is the rearrangeable network. It is shown by Benes [8 ] that such a switch, with the same connection capabilities as a crossbar, can be implemented using only 0(n log n) gates. The time required to transmit data through the network is just 0(log n). Un- fortunately, the best known time to set up the network for transmission is 0(n log n) [56]. This control time renders the network impractical as an alignment network, unless all connection patterns could be set up at compile time. The Batcher sorting network [5 ] is another possibility. Not only can it perform the connections of a crossbar switch, it can also sort 2 its inputs, if desired. This network has 0(n log n) gates, so it is an 2 improvement over the crossbar. However, it requires time of 0(log n) gate delays for control and data transmission, making it faster than the Benes approach. As a final possibility, the fi network [48] proposed specifically for this purpose can be controlled and transmit data in 0(log n) gate delays, 80 but contains only 0(n log n) gates. Thus it has the speed of a crossbar with the cost of a Benes network. Its shortcoming is that it cannot perform arbitrary interconnections. However, as discussed above, we seek an alignment network which can handle the requirements posed by program subscripts and memory skewing schemes. Lawrie has examined a number of such questions and the ft network satisfies many of them. It is interesting to note that the ft network consists of a sequence of identical interconnection paths called shuffles , see e.g., [48]> [57J- We call transmission from left to right a shuffle and from right to left an unshuffle . It can be shown that the Benes and Batcher networks, as well as the ft network, can all be constructed from a series of shuffle and unshuffle interconnections of 2*2 switching elements. The switching elements are basically 2x2 crossbars. In the Batcher network, they have the further capability of comparing their inputs and switching on this basis. In the ft network they can also broadcast either of their inputs to both outputs. 81 A Cost-Effectiveness Measure One important motivation for studying the structure of parallel algorithms is to determine computer architectures for effectively executing these algorithms. We are interested in computer requirements as measured in terms such as number of processors, type of processors, number of memories, and type of interconnections. We measure the effectiveness of an architecture for some algorithm in terms of its speedup over a single processor and the efficiency of the evaluation of that algorithm. In section 3, we defined speedup categories as a way of charac- terizing parallel algorithms. These were given in terms of T, rather than, say, n, the size of some array, in order to compare dissimilar algorithms. Such a hierarchy is useful for comparing algorithms, but it may not be useful as a quality or effectiveness measure for machine computation. In practice, which algorithm one chooses depends on many factors, an important one being the number of processors available. The speedup types mentioned above are useful in algorithm se- lection if one has a machine which is relatively \/ery large compared to the sizes of his problems. But categorizing algorithms on the basis of speedup alone is unsatisfactory in general, since efficiency is important if unlimited processors are not available. On the other hand, using efficiency as the sole measure of effectiveness is too conservative, since a serial computation always has maximum efficiency, i.e., E, = 1. Thus, we turn to a measure which includes both effectiveness and cost. The cost of a computation is clearly related to the number of processors required and the time they must be used. We will assume the processor, time product reasonably represents cost, so we define cost as 82 P P Assume that we have two algorithms, A, and A ? , which compute the same function using p, and p 2 processors, respectively. Thus, we should S. decide which algorithm to use on the basis of max p i c p, the ratios of effectiveness to cost, S S E We can rewrite this measure as 7^ = -£- = =£ . Note that since C P PT P T P E S E <_ 1 and T > 1 , j £ 1 ; hence, ^ £ 1 . p " p " p " p " E E S Another way of viewing this measure is to write J^" = t ^ ■ P 'l Note that for a given function, T-, is fixed so for various parallel algorithms we are attempting to maximize the product of efficiency and speedup. Since E <_ 1 and S < T, , division by T, is simply a normalization of the maximum product to unity. In Fig. 14 we present a summary of S/C values for a number of parallel computations. This is discussed in detail in [40]. We shall make a few summary remarks here. Generally speaking, the best parallel algorithms in the S/C sense are at the top and the worst are at the bottom. The column at the left represents algorithms running on machines in a less than maximum possible speedup way. The right column contains the best known speedups, in general. By studying various algorithm and machine organizations for a given problem, we could hope to maximize the S/C measure for some set of computations. For example, the problem of multiplying n matrices is shown 83 ? 3 4 . in three places, for p = n , p = n and p = n . Note that S/C is maximum for p = n , less than the maximum possible speedup for this problem. In [40] several algorithms are compared in this way. Finally, we point out that for an analysis such as the one we have presented here to be practically meaningful, some machine details must also be included. Analyses similar to the above may be performed on various parts of a computer system. For example, bit serial and bit parallel arithmetic units may be compared in a similar way. Various interconnection networks can also be studied in this way. Thus we would be able to make tradeoffs in parallelism from the level of arithmetic algorithms (i.e., hardware) up through the level of computational algorithms (i.e., programs). Consider the extra gates required for carry lookahead arithmetic over bit serial arithmetic. These could be more cost-effectively used in multiple micro- processors in a machine properly designed for some set of algorithms. 84 5. Conclusions Parallelism in machine organization has been observed and ex- ploited since the time of Babbage [44]. Since the early 1960s, multioperation machines have existed in implementations like the CDC 6600, CDC STAR, TI ASC, Burroughs' ILLIAC IV, and so on. Compilers for these machines have been uniformly difficult to write—mainly for two reasons. First, it has been difficult to discover operations in ordinary programs, which can be executed simultaneously on such machines. Second, the organizations of these machines have often made it difficult to implement array operations in efficient ways. In this paper we have discussed both of these problems. The first can be solved by using several compilation techniques aimed at parallelism detection and exploitation. This involves building a dependence graph in a careful way and then being able to compile trees, recurrences and IF statements for execution in fast ways. For most ordinary FORTRAN programs, this will probably lead to theoretical speedups which grow (nearly) linearly in the number of processors used in a computation. The second problem is closely related to the first. If one can transform most programs into array form, then proper hardware is necessary to support their fast execution. This means we must have a control unit capable of executing array instructions in a more or less direct way. For example, sufficiently many registers are required to contain all of the relevant parameters of loops, their indexing and array subscripting. Furthermore, we must have a high bandwidth memory which provides conflict- free access to arrays and various commonly needed partitions. We must also be able to align such partitions with others for processing. Finally, 85 sufficiently high processing bandwidth must be provided --most likely by a combination of parallel and pipeline techniques. To complete the circle, the control unit must be sufficiently pipelined to provide a continuous flow of data in a memory to memory (or register to register) way. While many aspects of the problem are well understood now, no machine exists which combines all of the necessary features in an adequate way. These ideas are generally regarded as techniques for yery high-speed computation. However, the supercomputers of one day are often the ordinary computers of a later day and perhaps the minicomputers of the future. Clearly, good architectural ideas for machine speedup and ease of compiler writing will be widely useful. Indeed, with the arrival of sixteen-bit microprocessors, we seem to be quite close to the time when large numbers of simple processors can be used as components in one super processor. We emphasize the need for program analysis to determine which machine organizations are useful for various classes of programs. In this paper we have concentrated on Fortran-like programs. The techniques described here have been adapted to other languages. For example, in [28], all of the important blocks of GPSS were analyzed. Using these results, a machine organization was proposed for the fast execution of GPSS programs. Since little arithmetic is involved, designs for the memory, control unit and alignment network were emphasized. Similarly, in [65], a number of COBOL programs were analyzed using these techniques. There the memory hierarchy requirements became obvious. Also, the ability to execute one program on a number of successive, independent data sets was necessary. Again, little arithmetic was required. The control unit became a key to obtaining high performance—too much 86 complexity in the control unit was a danger. We have also attempted to exploit parallel and pipeline tech- niques in information retrieval and file processing problems [34] and [64]. Here, since there are no standard programming languages, standard algorithms (e.g., list-merging) were studied. It seems clear that by combining the control notions discussed in this paper with various data structure transformations, other programming languages could be analyzed and good machine parameters discovered. For example, instead of arithmetic tree-height reduction and fast recurrence handling methods, various string manipulations or tree and graph algorithms could be used--these could be interpreted in appropriate languages, e.g., SNOBOL, LISP, etc. Another approach to machine design is to avoid programs and go directly to the algorithms themselves. We have carried out the direct analysis of several algorithms including [61] and [62]. In general, better speedup results can be obtained in this way since one avoids the difficulties of programming language artifacts. In fact, while few nonlinear recurrences are found in real programs, we often are limited in obtaining faster numerical algorithms by nonlinear recurrences. By hand, the algorithm is reorganized into a potentially fast form which contains a nonlinear recurrence and, hence, cannot be handled by known methods. While some nonlinear recurrences can be treated analytically and * others can be shown to be intractable, this area is generally not well understood. The hand analysis of whole algorithms provides good nonlinear recurrence research problems. Of course, even if we cannot analytically speed up nonlinear recurrences, this does not mean that there is no hope. 87 In fact, hardware tricks can be used to get around some nonl inearities. At the bit level, several examples of this are discussed in detail in [21], for example, binary multiplication. To summarize, we have been discussing the structure of programs, the structure of machines and the relation between the two. To discover ultimate speed machines, or just to find low cost, high performance machine designs, such studies are important. By obtaining a better under- standing of the structure of real programs we can determine good compiler algorithms for exotic machine organizations. Furthermore, there are benefits for standard machines; for example, we can hope to improve paging performance for virtual memory machines by understanding the control and data flow. Also, by transforming a program into simpler forms--for example, removing IFs from loops and bringing out the array nature of programs--we can hope to aid programmers in understanding and debugging their programs. Finally, by viewing compilation in a broad way we can see that certain traditional speedup techniques of logic design are identical to methods useful in compilers for mul tioperation machines. These include tree-height reduction and fast linear recurrence solving. 88 ACKNOWLEDGMENT I am particularly indebted to S. C. Chen, Duncan H. Lawrie, Ahmed H. Sameh, and Ross A. Towle, for discussions about these ideas. 89 REFERENCES [1] Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proc . AFIPS Conf., 1967 , 483-485. [2] Baer, J. L., and Bovet, D. P. (1968). Compilation of arithmetic expressions for parallel computations. Proc. IFIP Congress , 1968 , pp. 340-346. [3] Baer, J. L., and Estrin, G. (1969). Bounds for maximum parallelism in a bilogic graph model of computations. IEEE Trans. Comput . C-18 , 1012-1014. [4] Barnes, G., Brown, R., Kato, M., Kuck, D., Slotnick, D., and Stokes, R. (1968). The ILLIAC IV computer. IEEE Trans. Comput . C-17 , 746-757. [5] Batcher, K. E. (1968). Sorting networks and their applications. Proc. AFIPS Spring Jt. Comput. Conf., 1968 , pp. 307-314. [6] Batcher, K. E. (1973). STARAN/RADCAP hardware architecture. Proc. Sagamore Conf. on Parallel Processing, 1973 , pp. 147-152. [7] Beatty, J. C. (1972). An axiomatic approach to code optimization for expressions. J. Ass. Comput. Mach . 19 , 613-640. [8] Benes, V. E. (1965). Mathematical Theory of Connecting Networks and Telephone Traffic . Academic Press, New York. [9] Bernstein, A. (1966). Analysis of programs for parallel proces- sing. IEEE Trans. Electronic Comput . EC-15 , 757-763. [10] Bingham, H. W., Fisher, D. A., and Semon, W. L. (1966). Detection of implicit computational parallelism from input-output sets. ECOM-02463-1, AD645438. 90 [11] Bingham, H. W., Reigel , E. W., and Fisher, D. A. (1968). Control mechanisms for parallelism in programs. ECOM-02463-7. [12] Brent, R. (1974). The parallel evaluation of general arithmetic expressions. J. Ass. Comput. Mach . 21 , 201-206. [13] Brent, R., and Towle, R. (1975). On the time required to parse an arithmetic expression for parallel processing. Submitted for publ ication. [14] Buchholz, W. (1962). Planning a Computer System . McGraw-Hill, New York. [15] Budnik, P., and Kuck, D. (1971). The organization and use of parallel memories. IEEE Trans. Comput . C-20 , 1566-1569. [16] Burnett, G. J., and Coffman, E. G., Jr. (1970). A study of inter- leaved memory systems. Proc. AFIPS Spring Joint Comput. Conf ., 1970 , pp. 467-474. [17] Burnett, G. J., and Coffman, E. G., Jr. (1973). A combinatorial problem related to interleaved memory systems. J. ACM 20 , 39-45. [18] Chang, D., Kuck, D. J., and Lawrie, D. (1975). On the effective bandwidth of parallel memories. Submitted for publication. [19] Chen, S. C. (1975). Speedup of iterative programs in multiprocessor systems. Rep. No. 75-694, Ph.D. thesis, Dept. of Comput. Sci . , University of Illinois at Urbana-Champaign. [20] Chen, T. C. (1971). Unconventional superspeed computer systems. Proc. AFIPS Spring Jt. Comput. Conf., 1971, pp. 365-371. 91 [21] Chen, S. C, and Kuck, D. J. (1975). Combinational circuit synthesis with time and component bounds. Submitted for publ ication. [22] Chen, S. C, and Kuck, D. (1975). Time and parallel processor bounds for linear recurrence systems. IEEE Trans. Comput . C-24 , 701-717. [23] Chen, S. C, Kuck, D. J., and Towle, R. (1975). Control and data dependence in ordinary programs. Submitted for publ ication. [24] Chen, S. C, Kuck, D., and Towle, R. (1975b). Time and parallel processor bounds for Fortran-like loops. Submitted for publication. [25] Chen, S. C, and Sameh, A. H. (1975). On parallel triangular system solvers. Sagamore Comput. Conf. on Parallel Processing , 1975 . [26] Coffman, E. G., Jr., Burnett, G. J., and Snowdon, R. A. (1971). On the performance of interleaved memories with multiple-word bandwidths. IEEE Trans. Comput . C-20 , 1570-1572. [27] Davis, E. W., Jr. (1972). Concurrent processing of conditional jump trees. Compcon 72, IEEE Computer Society Conf. Proc . 1972 , pp. 279-281. [28] Davis, E. W., Jr. (1972b). A multiprocessor for simulation applications. Rep. No. 72-527, Ph.D. thesis, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. [29] Estrin, G., and Turn, R. (1963). Automatic assignment of computations in a variable structure computer system. IEEE Trans. Comput . EC-12 , 755-773. 92 [30] !31] [32] !33] [34] [35] [36] [37] [38] [39] Fisher, D. (1967). Program Analysis for Multiprocessing . Burroughs Corp., TR-67-2. Flynn, M. (1972). Some computer organizations and their effective- ness. IEEE Trans. Comput . C-21 , 948-960. Hellerman, H. (1967). Computer System Principles . McGraw-Hill, New York. Hintz, R. G., and Tate, D. P. (1972). Control Data STAR-100 processor design. Proc. IEEE Comput. Soc. Conf., 1972 , pp. 1-4. Hollaar, L. A. (1975). A list merging processor for inverted file information retrieval systems. Rep. No. 75-762, Ph.D. thesis, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. Hu, T. C. (1961). Parallel sequencing and assembly line problems. Oper. Res . 9, 841-848. Knuth, D. E. (1970). An empirical study of FORTRAN programs. Rep. No. CS-186, Dept. of Comput. Sci., Stanford Univ. Kraska, P. W. (1972). Parallelism exploitation and scheduling. Rep. No. 72-518, Ph.D. thesis, Dept. of Comput. Sci., Univ. of 111 . at Urb. -Champ. Kuck, D. J. (1968). ILLIAC IV software and application programming. IEEE Trans. Comput . C-17 , 758-770. Kuck, D. (1973). Mul tioperation machine computational complexity. Proc. Symposium on Complexity of Sequential and Parallel Numerical Algorithms, 1973 . (j. Traub, ed.), 17-47. Academic Press, New York. 93 [40] Kuck, D. J. (1974). On the speedup and cost of parallel computation. To appear in Proc. on Complexity of Compu- tational Problem Solving , The Australian National University. [41] Kuck, D., Budnik, P., Chen, S. C, Davis, E., Jr., Han, J., Kraska, P., Lawrie, D., Muraoka, Y., Strebendt, R., and Towle, R. (1974). Measurements of parallelism in ordinary FORTRAN programs. IEEE Computer 7, 37-46. [42] Kuck, D., and Maruyama, K. (1975). Time bounds on the parallel evaluation of arithmetic expressions. SIAM J. Comput . 4_, 147-162. [43] Kuck, D., and Muraoka, Y. (1974). Bounds on the parallel evaluation of arithmetic expressions using associativity and commutativity. Acta Informatica 3, 203-216. [44] Kuck, D., Muraoka, Y., and Chen, S. C. (1972). On the. number of operations simultaneously executable in FORTRAN-like programs and their resulting speed-up. IEEE Trans. Comput . C-21 , 1293-1310. [45] Kung, H. T. (1974). New algorithms and lower bounds for the parallel evaluation of certain rational expressions. Proc . Sixth Annual ACM Symposium on Theory of Computing, 1974 . [46] Lamport, L. (1974). The parallel execution of DO loops. Comm . ACM_2Z» 83-93. [47] Lawrie, D. (1973). Memory-processor connection networks. Rep. No. 73-557, Ph.D. thesis, Dept. of Comput. Sci., Univ. of 111 at Urb. -Champ. 94 [48] Lawrie, D. (1975). Access and alignment of data in an array processor. IEEE Trans. Comput . C-24 . [49] Leondes, C, and Rubinoff, M. (1952). DINA, a digital analyzer for Laplace, Poisson, diffusion, and wave equations. AIEE Trans. ( Commun. Electron .) 71_, 303-309. [50] Martin, D., and Estrin, G. (1967). Models of computations and systems—evaluation of vertex probabilities in graph models of computations. J. ACM 14, 281-299. [51] Maruyama, K. (1973). On the parallel evaluation of polynomials. IEEE Trans. Comput . C-22 , 2-5. [52] Minsky, M. (1970). Form and content in computer science. J. ACM U, 197-215. [53] Muller, D. E., and Preparata, F. P. (1975). Restructuring of arithmetic expressions for parallel evaluation. Rep. No. R-676, Coordinated Science Lab., Univ. of 111. at Urb. -Champ. [54] Muraoka, Y. (1971). Parallelism exposure and exploitation in programs. Rep. No. 71-424, Ph.D. thesis, Dept. of Comput. Sci . , Univ. of 111 . at Urb. -Champ. [55] Muraoka, Y., and Kuck, D. (1973). On the time required for a sequence of matrix products. Comm. ACM 16 , 22-26. [56] Opferman, D. C, and Tsao-Wu, N. T. (1971). On a class of re- arrangeable switching networks. Bell Syst. Tech. Journ . 50 , 1579-1618. [57] Pease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. ACM 15, 252-264. [58] Ravi, C. V. (1972). On the bandwidth and interference in inter- leaved memory systems. IEEE Trans. Comput. C-21 , 899-901. 95 [59] Russell, E. C. (1963). Automatic assignment of computational tasks in a variable structure computer. M.S. thesis, Univ. of Calif., Los Angeles. [60] Sameh, A. H., and Brent, R. P. (1975). Solving triangular systems on a parallel computer. Submitted for publication. [61] Sameh, A. H., Chen, S. C, and Kuck, D. (1974). Parallel direct Poisson and biharmonic solvers. Submitted for publication; also, Rep. No. 74-684, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. [62] Sameh, A. H., and Kuck, D. J. (1975). Linear system solvers for parallel computers. Submitted for publication; also, Rep. No. 75-701, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. [63] Senzig, D. (1967). Observations on high-performance machines. Proc. AFIPS Fall Jt. Comput. Conf., 1967 , pp. 791-799. [64] Stellhorn, W. H. (1974). A specialized computer for information retrieval. Rep. No. 74-637, Ph.D. thesis, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. [65] Strebendt, R. E. (1974). Program speedup through concurrent record processing. Rep. No. 74-638, Ph.D. thesis, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. [66] Swanson, R. C. (1974). Interconnections for parallel memories to unscramble p-ordered vectors. IEEE Trans. Comput . C-23 , 1105-1115. [67] Towle, R. (1974). Control structures inside DO loops. Ph.D. thesis proposal, Dept. of Comput. Sci., Univ. of 111. at Urb. -Champ. 96 [68] Watson, W. J., and Carr, H. M. (1974). Operational experiences with the TI advanced scientific computer. National Comput , Conf., 1974 , pp. 389-397. [69] Winograd, S. (1975). On the parallel evaluation of certain arithmetic expressions. J. ACM 22, 477-492. 97 Step 3 Step 2 Step 1 Figure 1 . Expression of 8 Atoms 98 Step 3 Step 2 Step 1 (((a + b) + c) + d) (a) Step 2 Step 1 b + c + d (b) T ± = 3 T 2 = 2 S 2 = 3/2 E 2 = 3/4 Figure 2. Tree-Height Reduction by Associativity 99 Step 3 Step 2 Step 1 T = 3 1 Step 2 Step 1 T 2 = 2 S 2 = 3/2 E 2 - 3/4 Figure 3. Tree-Height Reduction by Commutativity 100 Step 4 Step 3 Step 2 Step 1 = 4 (b * e) (a) Step 3 Step 2 Step 1 ( * T = 3 3 J S 3 - 4/3 E 3 = 4/9 (b) Figure 4. Tree-Height Reduction by Distributivity 101 Original G, Figure 5. Program Graph Distributed G Figure 6. Distributed Graph 102 «. 10 MISC M LOO 2 (PI <=b.o ISO 30. J C "5.0 30.3 • OF PHOCES?(W c 190.0 Figure 7. Speedup vs. Processors 103 M MEMORIES . ALIGNMENT NETWORK ALIGNMENT NETWORK N ARITHMETIC-LOGIC UNITS Figure 8. Overall Machine Organization 104 Memory Units 1 2 3 4 !i a 2 !i a 4 !i a 6 !z a 8 a 9 a 10 a n • • • Fig. 9. One-Dimensional Array Storage Memory Unit 105 1 2 3 © a 0l a 02 a 03 a io © a 12 a 13 a 20 a 21 © a 23 a 30 a 31 a 32 © Fig. 10. Straight Storage (m = 4) 1 2 3 © a 01 a 02 a 03 a 13 a 10 © a 12 © a 23 a 20 a 21 a 31 a 32 © a 30 Fig. 11. Skewed Storage (m = 4) 106 Memory Unit 2 © a 01 a 02 a 03 a 13 a 10 © a 12 a 21 © a 23 a 20 a 30 a 31 a 32 © Fig. 12. Skewed Storage (m = 5) Memory Unit 3 4 © a 01 a 02 a 03 a 13 a io © a 12 a 21 © a 23 a 20 a 30 a 31 a 32 © Fig. 13. Skewed Storage (m = 2n = 8) 107 0(1) • array of arithmetic expressions search unsorted list 0( — 774) • multiply n matrices (p = n 3 ) T l 0(J ) ' multiply two matrices (P = n 2 ) T-| • linear system (p = n ) • wave-front 3 dimensions multiply n matrices (p = n 2 ) triangular system (p = n) Poisson iterative wave-front 2 dimensions FFT °( 1oq T ) • Poisson direct 0( -V-> • search sorted list • nonlinear recurrences • serial computation Fig. 14. p£ Classes P • multiply two matrices (P = n 3 ) • arithmetic expression • simple linear recurrence • tridiagonal linear system • tridiagonal matrix eigenvalues • multiply n matrices (P = n 4 ) • small triangular system (P = 64> (NOTE: all arrays (n*n)) Table I. 1973 CACM DO Loop Summary Recurrence Type 108 IF Type 1 2 3 No IF 51 24 2 3 A Prefix B 24 6 1 1 Postfix B 3 8 1 C 1 1 BIBLIOGRAPHIC DATA SHEET 1. K< port No. UIUCDCS-R-75-767 3. Re< ipient ' s A< c esa ion N 4. I"n !«• and Subtitle PARALLEL PROCESSING OF ORDINARY PROGRAMS 5. Report Date November 1975 7. A uthorl s ) David J. Kuck 8- Performing Organization Re pi. No - UIUCDCS-R-75-767 9. Performing Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, IL 61801 10. Project/Task/Work Unit No. 11. Contract Oram No. US NSF DCR73-07980 A02 12. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 20550 13. Type of Report & Pel id Covered Technical Report 14. 15. Supplementary Notes 16. Abstracts This paper is concerned with executing programs on a number of processors. Various machine configurations are discussed. Algorithms for transforming pro- grams are given and the results of transforming a number of real programs are presented. The relationships between machine organization and program organi- zation are emphasized. 17. Key Words and Document Analysis. 17a. Descriptors Array operations Parallel machines Pipeline machines Program transformations Recurrence methods Speedup bounds 17b. Identifiers/Open-Ended Terms 17c. COSATI Field/Group 18. Availability Statement Unlimited FORM NTIS-35 (10-70) 19. Security (lass (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of P 114 agt 22. Price USCOMM-DC 40329-P7 1 ro" ND eT