LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN bi.0.%4 IS*' /1 Report No. k 2h ' r y PARALLELISM EXPOSURE AND EXPLOITATION IN PROGRAMS by Yoichi Muraoka February, 1971 LIBF NOV 9 1972 UNIVERSITY OF ILLINOIS AT urbawa-champaign: DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS Digitized by the Internet Archive in 2013 http://archive.org/details/parallelismexpos424mura PARALLELISM EXPOSURE AND EXPLOITATION IN PROGRAMS BY YOICHI MURAOKA B.Eng., Waseda University, 19^5 M.S., University of Illinois, 1969 THESIS Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, 1971 Urbana, Illinois iii ACKNOWLEDGEMENT The author would like to express his deepest gratitude to Professor David J. Kuck, the Department of the Computer Science of the University of Illinois, whose encouragement and good advice have led this work to the successful completion. Also Paul Kraska read the thesis and provided valuable comments. Special thanks should go to Mrs. Linda Bridges without whose excellent job of typing, the final form would have never come out. Thanks are also extended to Mrs. Diana Mercer who helped in getting the thesis finished on time. IV TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. PARALLEL COMPUTATION OF SUMMATIONS, POWERS AND POLYNOMIALS 11 2 . 1 Introduction 11 2 . 2 Summation of n Numbers 14 2 . 3 Computation of Powers 23 2.4 Computation of Polynomials 31 2.4.1 Computation of Polynomial on an Arbitrary Size Machine 31 2.4.1.1 k-th Order Horner's Rule 32 2.4.1.2 Estrin's Method 32 2.4.1.3 Tree Method 33 2.4.1.4 Folding Method 35 2.4.1.5 Comparison of Four Methods 38 2.4.2 Polynomial Computation by the k-th Order Horner's Rule ... 39 3. TREE HEIGHT REDUCTION ALGORITHM 52 3«1 Introduction 52 3-2 Tree Height and Distribution 53 3.3 Holes and Spaces 63 3«3«1 Introduction 63 3.3.2 Holes 70 3.3.3 Space 76 3.4 Algorithm 85 3.4.1 Distribution Algorithm 86 3*4.2 Implementation 91 V Page 3« 5 Discussion 9J+ 3.5-1 The Height of a Tree 9U 3-5-2 Introduction of Other Operators 98 3-5-2. 1 Subtraction and Division 98 3«5-2.2 Relational Operators 99 k . COMPLETE PROGRAM HANDLING 100 k.l Back Substitution - A Block of Assignment Statements and an Iteration 100 k . 2 J'oops „ 110 4.3 Jumps 113 h.k Error Analysis llU 5. PARALLELISM BETWEEN STATEMENTS „ . 122 5. 1 Program „ 122 5.2 Equivalent Relations Between Executions 125 6. PARALLELISM IN PROGRAM LOOPS 135 6.1 Introduction 135 6.1.1 Replacement of a for Statement with Many Statements 135 6.1.2 A Restricted Loop li+1 ■ . 2 A Loop With a Single Body Statement 1^3 6.2.1 Introduction li+3 6.2.2 Type 1 Parallelism 146 6.2.2.1 General Case 146 6.2.2.2 A Restricted Loop 153 6.2.2.3 Temporary Locations 156 6.2.3 Type 2 Parallelism l60 6.2.U :onclusion 167 Page 6.3 A Loop With Many Body Statements 167 6.3*1 Introduction 167 6.3«2 Parallel Computation with Respect to a Loop Index 171 6.3*3 leparation of a Loop 173 o . 3 • 3 • 1 Introduction 173 6.3«3«2 The Ordering Relation (e ) and Separation of Loop 174 6.3«3«3 temporary Storage 179 6.3*^ Parallelism Between Body Statements 182 oTjnrrT Introduction 182 6. 3« h.2 The Statement Dependence Graph and the Algorithm . . 184 6.3»5 Discussion 190 7 . EQUALLY WEIGHTED--TWO PROCESSOR SCHEDULING PROBLEM 192 7.1 Introduction 192 7-2 Job Graph 196 7 . 3 Scheduling of a Tight Graph 199 7 • -V Scheduling of a Loose Graph 2lU 7. 5 Supplement 225 8. CONCLUSION 230 LIST OF REFERENCES 233 VITA 236 VI 1 LIST OF TABLES Table Page 2.1. The Parallel Computation Time for Summation, Power and Polynomial 12 n 2.2. The Number of Steps Required to Compute E a. on P(m), h (m,n), i=l x for n < 10 18 2.3- Computation of p (x) by Folding Method 38 2.k. The Number of Steps Required to Compute p (x), h (m, n), for n < 10 1+8 k.l. Comparison of Back Substituted, y , and Non-Back Substituted Computation, y . --Iteration Formulas 10l+ U.2. Comparison of Back Substituted, y , and Non-Back Substituted Computation, y. --General Cases 108 Vlll LIST OF FIGURES Figure Page 1.1. Statement Dependence Relation 3 1.2. Trees for ( (a+b)+(c+d) ) and ( ( (a+b)+c)+d) 5 1 . 3 • Tree s for a + b x c + d and bx c + a + d 5 lA. Trees for a(bcd+e) and abed + ae 6 2.1. The Minimum Number, M, of PE's Required to Add Numbers in the Minimum Time 22 2.2. Computation of x (l) 26 2.3. Computation of x (2) 27 . - . Computation of x 11 (3) 29 2.5. Computation of x" (k) 30 .6. Computation of a.x 3k 2.7. A Tree for p . . (x) 35 2.8. A Tree for p + .(x) 37 2.9* Comparison of the Four Parallel Polynomial Computation Schemes. kO 2 .10 . k-th Order Horner ' s Rule 1+1 .11. The Number of Steps, h (m, n), to Compute p (x) on P(m) by the m- th Order kk 2.12. The Minimum number, M, of PE's Required to Compute P (x) in the Minimum Time 51 3.1. An Arithmetic Expression Tree (l) 52 .2. An Arithmetic Expression Tree (2) 56 3.3. Free Nodes 63 Free Nodes in a Tree 6k 3-5- An example of F. and F„ 66 3.6. Elimination of a Free Node 66 IX Figure Page 5.7. A Minimum Height Tree. . . . 68 3 .8 . Attachment of T[ t ' ] to a Free Node 71+ 3.9. An Example of Space (l) 77 3 .10 . An Example of Space (2) 78 3.11. Distribution of t' over A 8l 3 .12. Tree Height Reduction by Hole Creation 82 3 .13» Stacks for an Arithmetic Expression 91 k.l. A Back Substituted Tree 102 k.2. Loop Analysis 112 . . A Tree with a Boolean Expression 11^ •'i .h . Trees for a(bc+d) + e and abc + ad + e 118 5 .1 . Conditions for the Output Equivalence 127 6 .1 . E Q 11*8 . . E[ Ij Ik8 6.y. Conditions of Parallel Computation in a Loop 150 ■' . • . An Illustration of t 158 6.5- Wave Front l6l . . Wave Front Travel l62 6.7. An Illustration for Theorem k 16U 6.8. An Execution by a Wave Front 166 . . ultaneous Execution of Body Statements 170 6.10. Execution of P B 173 u '^ 6 .11 . An Introduction of Temporary Locations 180 6.12. Wave Front for Simultaneous Execution of Body Statements 183 6 .13 . A Wave Front for Example 10 187 Figure Page 7.1. Computation of Nondistributed and Distributed Arithmetic Expressions on P(2) 19^ Common Expression 195 A Loose Graph and a Tight Graph 197 A Graph G 201 An Illustration for Lemma 3 20U An Example of a Tight Graph Scheduling 209 An Illustration for Lemma 11 212 A Loose Node 21^ 7-2 7.3 1-h 7.5 7-6 7-7 7.8 7-9 7.10. An Example of the Maximum p-connectable Distance 220 7.11. An Illustration for Lemma 13 22U t P 7 . 12 . An Example for A (-) 226 7-13- An Example for p- connectivity Discovery 228 The p-line Relation in B 217 n 1. INTRODUCTION 1.1 Introduction The purpose of this research is to study compiling techniques for parallel processing machines. Due to remarkable innovations of technology today such as the intro- duction of LSI, it has become feasible to introduce more hardware into computer systems to attain otherwise impossible high speeds. For example Winograd [k-2] showed that the minimum amount of time required to add two t bit numbers is [log p t]d ([x] denotes the smallest integer not smaller than x), where we assume that an adder consists of two input binary logic elements, e.g. AND or OR gates and d is a delay time per gate. An adder which realizes this speed requires a huge number of gates, e.g. approximately 1300 gates for t = 6 [12], and it has been out of the question to build such an adder. However, the introduction of LSI has reduced the cost of a gate significantly, e.g. it has been anticipated that by 197^ the cost of LSI would be reduced to 0.7 cent per gate [33]« Another example is a class of parallel processing machines. The Illiac IV [7], the CDC 6600 [k] and the D825 [^1] are included in this class. A machine in this class has e.g. many arithmetic units to allow simultaneous execution of arithmetic operations. As an extreme case it has been suggested to include special arith- metic units, e.g. a log taking unit (in x) and an exponent unit (x ) \ l6] . (Such being the case, this decade may be marked as a "computer architecture" race, reminiscent of the cycle-time and multiprogramming races of the 60's [12].) We shall not go into the details of machines further. An extensive survey of parallel processing machines is found in [30]. Having a parallel processing machine which is capable of processing many- operations simultaneously, we are faced with the problem of exploiting parallelism from a program so Uiat computational resources be kept as busy as possible to process the program in the shortest time. We now discuss the problem in detail. In this thesis, by a parallel (processing) machine P we understand a set of arbitrarily many identical elements called processing elements (PE). A PE is assumed to be capable of performing any binary arithmetic operations, e.g. addition and multiplication, in the same amount of time. Furthermore we assume that data can be transferred between any PE's instantaneously. Also we write P(m) if P has only m PE's. A machine of this nature may be considered as a general- ization of the Illiac IV. To date two types of parallelism exploitation techniques are known to compile a program written in a conventional programming language (e.g. ALGOL) for the parallel processing machine [36]. They may be termed as intra-statement parallelism and inter-statement parallelism exploitation techniques. The first technique is to analyze the parallelism which exists within a statement, e.g. an arithmetic expression and this has been explored by Stone [k-0], Squire r 39] ^ Hellerman [20], and Baer and Bovet [6]. For example consider the arithmetic expression: a+b +c+d + e+f+g+h and a syntactic tree for it: 3 2 1 level The tree is such that operations on the same level may be done in parallel. The height of a tree is the maximum level of the tree and indicates the number of steps required to evaluate an arithmetic expression in parallel. Note that there may be many different syntactic trees for an arithmetic expression, and among them the tree with the minimum height should be chosen to attain the minimum parallel computation time. Baer and Bovet's algorithm is claimed to achieve this end. i.e. build the minimum height syntactic tree for an arithmetic expression [6]. Exploitation of inter-statement parallelism has also been studied [10], [37]- An outcome of these works is an algorithm (the dependence relation detect- ion algorithm [ 10] ) which detects the dependence relation between statements in a loop- and jump-free sequence of statements. The dependence relation between S and S' holds if S proceeds S' in a sequence and S' uses the output of S as an input to S ' . For example the algorithm dects that the statement SI in Figure 1.1 must be computed before the statement S2, but it may be computed simultaneous- ly with S3- SI: x := f-^y); S2: u := f 2 (x); S3: v := f 3 (w); Sk: z := f^(v,u); (a) program (b) dependence relation Figure 1.1. Statement Dependence Relation Since in a real program the major part of the execution time is spent within loops if it is executed sequentially, the major effort should be directed toward detecting inter-statement parallelism in loops. For example we would like to find out that all fifty statements, A[ 1] := f(B[l]), ..., A[ 50] := f(B[50]), in a loop E: for I := 1 step 1 until 50 do A[I] := f(B[I]) may be executed simultaneously to reduce the computation time to one fiftieth of the original. A technique available now which detects inter- statement parallelism inside a loop requires a loop to be first replaced with (expanded to) a sequence of statements, e.g. E in the above example must be replaced with the sequence of fifty statements, A[ 1] := f (B[ 1] ), ..., A[ 50] := f (B[ 50] ), so that the dependence relation detection algorithm can be applied [ 10] . Obviously this approach obscures an advantage of the introduction of loops into a program because essentially all loops are required to be removed from a program and replaced with straight-line programs so that the dependence relation detection algorithm can be applied on them. The techniques described above find out parallelism inside and between statements as they are presented. If the size of a machine (i.e. the number of PE's) is unlimited, however, then it becomes necessary to exploit more parallelism from a program than the above approaches provide. One obvious strategy is to write a completely new program using e.g. parallel numerical methods [52], [38] • The other approach which we will pursue here is to transform a given program to "squeeze" more parallelism from it. While the first approach requires programmers (or users) to reanalyze problems and reprogram, the second approach tries to accept existing sequential programs written in e.g. AIG0L and execute them in parallel. First we study parallel computation of an arithmetic expression more carefully along this line. For the sake of argument let us assume that an arithmetic expression consists of additions, multiplications and possibly parentheses. Then the associative, the commutative and the distributive laws hold. The first and second laws have been already used to exploit more parallelism from an arithmetic expression. For example the associative law allows one to compute the arithmetic expression a+b+c+das ((a+b) + (c+d)) in two steps rather than as ( ( (a+b)+c)+d) which requires three steps. (a+b) + (c + d) (((a + b) + c) + d) Figure 1.2. Trees for ( (a+b)+(c+d)) and ( ((a+b)+c)+d) Also it has been recognized that the commutative law together with the associative law gives a lower height tree. For example ((a + b x c) + d) requires three steps while (b x c + (a + d)) requires two [39]* b x Figure 1.3- Trees for a + b x c+d and b x c + a + d Now we turn our interest to the third law, i.e. the distributive law and see if it can help speeding up computation. As we can readily see there are cases when distribution helps. For example a(bcd + e) requires four steps while its equivalent abed + ae which is obtained by distributing a over bed + e can be computed in three steps. a (b c d + e) abcd + a Figure l.k. Trees for a(bcd+e) and abed + ae However, distribution does not necessarily always speed up computation. For example the undistributed form ab(c+d) can be computed in fewer steps than the distributed form abc + abd. Hence nondiscr inn native distribution is not the solution to the problem. Chapter 3 of this thesis studies this situation and gives an algorithm which we call the distribution algorithm. Given an arithmetic expression A the distribution algorithm derives the arithmetic expression A by distributing multiplications over additions properly so that the height of A (we write h[A ] for this) is minimized. The algorithm works from the innermost parenthesis level to the outermost parenthesis level of an arithmetic expression and requires only one scan through the entire arithmetic expression. Chapter 3 concludes by giving a measure of the height of the minimum height tree for A as well as A as a function of fundamental values such as the number of single variable occurrences in A. The idea is extended to handle a sequence of assignment statements in Chapter k. The distribution algorithm is applied on the arithmetic expression which is obtained by backsubstituting a statement into one another. Suppose we have a sequence of n assignment statements A , A , ..., A and we get the assignment statement A from this sequence by back substitution. If the sequence is computed sequentially, i.e. one statement after another, but each statement is computed in parallel, then it will take h[A_] + h[A ] + ... + h[A ] steps to compute the sequence (where h[A.] is the height of the minimum height tree for A.). Instead we may compute the back substituted statement A in parallel which requires h[A] steps. Obviously h[A 1 ] + ... + h[A ] > h[A] holds. Chapter h discusses cases when the strict inequality in the above equation holds. The cases include iteration formulas such as x. n := a x x. + b. l+l 1 Next we study inter-statement parallelism in terms of program loops. Chapter 6 first establishes a new algorithm which detects inter-statement parallelism in a loop. The algorithm is such that it only examines index expressions and the way index values vary in a loop to detect parallel computa- bility. For example the algorithm checks index expressions I and I + 1 as well as the clause "I := 1 step 1 until 20" in the loop for I := 1 step 1 until 20 do A[I] := A[I+1] + B and detects that all twenty statements, A[ 1] := A[2] + B, ..., A[ 20] := A[ 21] + B, may be computed simultaneously. Thus it is not necessary to expand a loop into a sequence of statements as was required before to check inter-statement parallelism. In general, the amount of work (i.e. the time) required by the algorithm is proportional to the number of index expression occurrences in statements in a loop. Having established the algorithm, Chapter 6 further introduces two techniques which help to exploit more inter-statement parallelism in loops. These are the introduction of temporary locations and the distribution of a loop. The second technique resembles the idea introduced in Chapter 3, i.e. reduction of tree height for an arithmetic expression by distribution. Let us write I, J, K(S1, S2, S3) for an ALGOL — like program for I :- i 1 , ± 2 , . .., i m do for J := Op J 2 , .--j d n do for K := k n , k„, .... k do 1 2' p — begin SI; S2; S3 end . t ^ ^ Furthermore by e.g. [I, J], K(S1, S2,S3) we understand a loop" for (I, J) := (i-^^), (i^Jg), ••., (ip^), (ig,^), ..., U n >d n ) do for K := k.. , k^, .... k do 1' 2' p — begin SI; S2; S3 end . Then as in the case of arithmetic expressions we may establish the following: (a) Association: Introduction of brackets, e.g. I, [ J,K] (SI, S2, S3) • (b) Commutation: Change of the order of I,J,K e.g. I, K, J(S1, S2,S3). (c) Distribution: Distribution of I,J,K over SI, S2,S3, e.g. I,J,K(S1),I,J,K(S2,S3). Then while the associative law always holds, e.g. I,J,K(S) = [I,J],K(S), the commutative and the distributive laws do not necessarily hold for all loops, e.g I, J(S) / J,I(S) if I,J(S) represents a loop for I := 1, 2, 3 do for J := 1, 2, 3 do A[I,J] := A[I+1,J-1]. JL "This is equivalent to a TRANQUIL expression [2] for (I, J) seq^ ((ip^); (i^^l^-v (i m *d n )) do- In short, Chapter 6 shows that commutation indicates the possibility of computing a loop in parallel as it is and distribution indicates the possibility of intro- ducing more parallelism into a program. For example if I,J,K(S) = K, I,J(S), then S can be computed simultaneously for all values of K while I and J vary sequentially. Next suppose a loop l(Sl, S2) cannot be computed in parallel for all values of I. Then in a certain case it is possible to distribute and obtain two loops I (SI), l(S2) which are equivalent to the original loop, I (SI, S2), and execute each of two loops in parallel for all values of I separately. Chapter 6 gives an algorithm to distribute to attain this end. The thesis, thus, introduces new techniques which transform a given program to expose hidden parallelism. All results in this thesis are also readily applicable to another type of machines, i.e. machines with a pipe-line arithmetic unit such as CDC STAR [ 18] (we regard this type of machines as a special type of parallel machines and call them serial array machines). Each stage of a pipe-line unit may be regarded as an independent PE in the sense that an operation being processed in one stage of a pipe-line unit must not depend on an operation being processed in a different stage. Hence exploiting parallelism results in busying many stages at once. Two more chapters are included in this thesis to make it complete. Chapter 2 studies parallel computation of special cases of arithmetic expressions, e.g. powers and polynomials, in detail to give a measure of the power of a parallel processing machine. As was mentioned before, unless specially mentioned, it will be assumed that there are a sufficient number of PE's available to perform the desired task. In reality, however, that may not be the case and non trivial scheduling problems 10 may arise. To give some insight to this problem Chapter 7 discusses a solution to the two processor-equally weighted job scheduling problem. We conclude this chapter by defining the following symbols: [x] ... the smallest integer not smaller than x, [ xj ... the largest integer not larger than x, and \ x~| ... the smallest power of 2 not smaller than x. Also unless specified, the base of logarithms is assumed to be 2, e.g. log n is log 2 n. 11 2. PARALLEL COMPUTATION OF SUMMATIONS, POWERS AND POLYNOMIALS 2.1 Introduction In this chapter, we study the parallel computation of summations, powers and polynomials. We first assume that m processors (PE) are available. n Then the parallel computation times for the summation ( Z a. ) and the power (x ) i=l 1 evaluation are given as functions of m and n. The minimum time to evaluate n Z a. or x as well as the minimum number of PE's required to attain it is also i- 1 derived. Polynomial computation is first studied assuming the availability of an arbitrary number of PE's. The lower bound on the computation time "or a polynomial of degree n (p (x)) is presented. A scheme which computes p (x) in lesser time than any known scheme is obtained. Because of its simplicity in scheduling, the k-th order Horner's "^ule is .studied further in detail. It is shown that for this algorithm the availability of more PE's sometimes increases the computation time. Table 2.1 summarizes a part of results of this chapter. Before we go further a few comments are in order. The base of logarithms in this chapter is 2, e.g. log n is actually log n. The following lemma will be frequently referred to in the text. 12 H H OJ oT" bO H iH CM 1 + i i OJ + + 1 J*i ^ n^ £ ^L *L ^ CM OJ II OJ OJ CM L_ t_ + VI OJ OJ J3 + V X C A 1 '"CM V bO c —. CM OJ cvH V H — H VI -^ VI H 1 OJ 1 (— VI r~ , — X H -B@ 1 ( — m OJ H rH G CM 1 1 — OJ \ G -^ bO 1 V 1 CM «-r oj "h V H OJ ^ C » C + ^-, CM OJ OJ i OJ 1 1 H + bO + OJ + + — ' — ^ G c G -— G bD OJ G -— c -— C c^ cr- £L H • • > • . , , , • • • • , . ^ «s S ^ .*— -s >- s. .*"— N y"~V ^«— »s H CM H OJ H OJ no nft E JJ *4 G —) 1h *•—*-. | y -. ,-— ^ ,*■ — v ^— V *-— V .*— -v rH OJ G , Al H H + OJ H OJ n II <5) Al II H Al + £ „ E S S OJ £ e + e vE w- *— - — v — ' ■> — ' * — ' OJ 1 gi bO G o . H * Lr- G_, LL? H + OJ c + i + •s H H H e H S c t— l 1 + OJ CO. 5 l + 1 G G m OJ -5 § en' G 1 —C -—~. ^S-l^ ,j_^ ,-— ». *■— ^. rH OJ H OJ H OJ P~» H c •H f— n~: r~ , + [e o + ,G~ L_ G H -— ' OJ + bO i— o [§id / "g / VI ^-~, / •H G ^ •H x / cd r ! X •H v — ^ / H G / G Wll H VI ft / •H H / cd H cd 1 O ft G •H nd F) G • X! • cd X G B oo ^ •H CO G cd ft cd > ft -P € o P G C ft O cd O a •» ft O *. G P X 0) O ["al > a and b - 1 < (bj < b where a and b are non zero positive real numbers. (1) Let a = 2 h + k and b = 2 + g where < k < 2 - 1 and < g < 2 - 1. Now the proof is divided into four cases, i.e. (i) k, g > 0, (ii) k = g = 0, (iii) k - 0, g > 0, and (iv) k > 0, g = 0, (i): k, g > 0. Then flog al - flog bl = h - f. Also let log a = h + x and log b = f + y where < x, y < 1. Then flog a - log bl > h - f . Thus (log al - flog bl < ■log a - log bl . Other three cases may be proved similarly and the details are omitted. (2) (3) Proofs for (2) and (3) follow from the definition. (Q.E.D.) ih 2.2 Summation of n Numbers Theorem 1: The minimum number of steps, h (m, n), to add n numbers on P(m) is '(1): n-1 (m=l) h a (m,n) = <((2): L n / m J - 1 + '"log (m + n -Ln/mjmjl ( l_n/2j > m > 2 ) \p): log n"l (m > |_n/2j ) Proof : (l) is self evident. (3) uses the so-called log sum method [22] or the tree method (see Theorem 1 of Chapter 3)- It is clear that "log n"l steps are required and also that n numbers cannot be added in fewer steps. Now we prove (2). First each PE adds [_n/m| numbers independently. This takes |_n/mj " 1 steps and produces m partial sums. Then there will be m + (n - [n/mj • m) numbers le.'t. Clearly m + n - [n/mj • m < 2m. Then those numbers are added by the log sum method, which takes llog(m + n - (_n/mj m ) steps. (Q.E.D.) Now we show that for a fixed n, |_n/2j > m > m' implies that h (m, n) < h (m',n). To prove this it is enough to show that h (m + 1, n) < h (m, n) where m h- 1 £ |_n/2j . There are two cases: (1) (n/mj = |_n/(m + l)j = k > 1. Let n=km+p (p < m). Then m + n - L n AlJ m = m + p and 15 (m + 1) + n - l_n/(m + l)J (m + 1) = m + p + 1 - k. Hence we have flog(m + n - l_ n Ay m ) > flog((m + l) - n - |_n/(m + ill (m + 1)? , or h a (m,n) > h a (m + l,n). (2) jn/mj > L n /(m + l)j . Let [n/mj = k and L n A m + l)j = k - g, where k, g > 1. Then n = km + p (p < m) (l) and n - (k - g)(m + 1) + p' (p' < m + l). (2) Suppose h (m, n) < h (m + 1, n), i.e. ||| - 1 + flogfm «-(2j.] < L;jij| - 1 + [WO* + 1) + n . |^|(m + l)jl. (3) By substituting Eq. (l) and (2) into Eq. (3) and by rearranging we get g < riog(ra + 1 + p')' - r"iog( m + p )~l . (k) If we can prove g < ftog((m + 1 + p')/(m + p)) 1 , (5) then by Lemma 1(1), we can prove Eq. (k) . Eq. (5) holds if 2 g < (m + 1 + p 1 )/(m + p) (6) holds. Since 2 + l/m > (m + 1 + p* )/(m + p), (7) Eq. (6) holds if and only if g = 1 (remember that g > 1). By letting g = 1 in Eq. (2), we get p' = km + p - (k - l)(m + 1). Then by substituting this and g = 1 into Eq. (6) and by proper 16 rearrangement, we get p < 2 - k. This only holds if k = 1 and p = 0, which implies that n = m (see Eq. (l)) and contradicts our assumption that m + 1 < |n/2j . Hence h (m, n) < h (m + 1, n) never holds. The above two cases (l) and (2) prove that h (m + l,n) < h a (m, n). Thus we have the following lemma. Lemma 2 : h a (m',n) > h a (m,n) if m' < m < |_n/2j . The above lemma may seem insignificant. In Section 2.4.2, however, it will be shown that for a certain algorithm to compute an n-th degree polynomial on P(m), the computation time step, h (m, n), is not a nonincreasing function of m, i.e., m > m' does not necessarily imply that lr (m, n) < h p (m',n). As it will be described later, the algorithm is such that all PE's are forced to participate in the computation. It is true that if we are allowed to "turn off" some PE's, then we always get h P (m, n) < h P (m' ,n) if m > m' . Then a question is how many PE's are to be turned off. These problems will be studied in Section 2.4.2. It should be noted that the minimum number of PE's required to achieve the minimum computation time is not necessarily |_n/2J . For example let n = 17- Then (1T/2J = 8 and h a . n (l7) = h a (8,17) = 5- But also h a (6,17) = 5- As we know, the minimum computation time to add n number is 'log n'. Now we present the minimum number of PE's, M, which achieves this bound. 17 Theorem 2 : For a fixed n, let u M - u (L n / m J - 1 + n.og(m + n - Ln/mj m)' = log rfl )" m Then M = f(l) 1 + L (n - l)/2j - 2 k ~ 2 (2 k < n < 2 k + 2 k_1 ) (2) n - 2 k (2 k + 2 k_1 < n < 2 k+1 ) where k - l_log(n - l)j . Proof : For k < 3, the direct examination shows that the theorem holds (see Table 2.2). Therefore we assume that k > 3« k k k-1 The proof is divided into two parts, i.e. (l)2 l) or (iii) p « 1. k-1 In any case n - 2M £ 2 " holds. Thus we can prove that log (M - n - Ln/MJM) 1 < k - 1. and h a (M,n) < k + 1 = Tlog ril . Then we prove that for all m < M, h (m, n) > ("log nl . This is proved by showing that h (M - 1, n) > I log nl . Then by Lemma 2, h (m, n) > ("log n~l for all m < M. Now let us show the details. First we prove that h (M, n) log ill. From Eq. (8) and (9), we get M = 2 k " 2 + 1 + jjp - l)/2j, (11) and by Lemma 1(2), we get [jj| " 5 - L p J (12) where p = ^L(p - D/2J + k - y 2 k " 2 + 1 + |ip - 1)/2J By Lemma 1(3), w e get P < P' = 2p - h k-l 2 + p - 1 k-1 Now we show that for all p(l ° we have max P* = %-^ — < 1 for 1 < p < 2 k_1 (k > k). 2-1 " " Thus P < 1 and by Eq. (12) we have (n/Mj = 3« Substituting this into Eq. (10), we get h a (M,n) = 2 + llog(n - 2M)1 . Now subtracting two times Eq. (ll) from Eq. (8) we have n - 2M = 2 k_1 + p - 2 - 2j_(p - l)/2j . (13) Eq. (13) is evaluated in three different ways according to the value of p, i.e. (i) p = 2g, (ii) p = 2g + 1 (g > l) or (iii) p = 1 (in every case g is an integer). (i) p = 2g (From Eq. (l), g > l). n - 2M = 2 k_1 . (ii) p = 2g + 1, (g 1). n - 2M = 2 k ~ 1 +2g+l-2-2g< 2 . (iii) p = 1. k-1 k-1 n - 2M = 2 + 1 - 2 < 2 . k-1 p- ~i Hence in any case n - 2M < 2 "or llog(n-2M)' < k - 1. Thus h a (M,n) < k + 1 = Tiog n"l . This proves the first part of (1). Next we prove the latter part, i.e. for m < M, h (m,n) > h (M,n). 21 First we show that h a (M - 1, n) > h a (M, n). From Eq. (8) and (9), and using Lemma 1(2) we get \vrh\ = i> - Q J = 3 - lQj> a 1 *) where Q= H (p - D/2J - P . 2 k_2 + L (p - 1)/2J As we showed for P, we can also prove that for all p (l < p < 2 k_1 ), Q < 1. From Eq. (1*0, we have |_n/(M - l)j =3- Then h a (M - 1, n) = 2 + Tiogdi - 2(M - 1)11 . From Eq. (8) and (11), we get n - 2(M - 1) = 2 k-1 + p - 2|_(p - l)/2j > 2 k-1 + 1 or Tlog(n - 2(M - l))"l > k. Hence Tlog nl = k + l Tog nl by Lemma 2, and this proves (1). (2) 2 k + 2 k_1 < n < 2 k+1 . Let n - 2 k + 2 k_1 + p (l < p < 2 k_1 ). (15) We first sketch the proof. We first show that h (M, n) = flog nl . To show this we prove that l_n/Mj = 2 for all n (2 + k-1 k+1 < n < 2 ). Then using this we get M + n - (_n/Mj M = n - M. We further show that n - M = k. Thus we get h a (M, n) = k + 1 = flog nl. Then we prove that h a (M - 1, n) > h a (M, n) which together with Lemma 2 completes the proof of (2). 22 £ o SJ ir\ 5 8 CD O 2 -z* i o , — .. •p v c rt T) £ 1 d (1) O 2). 2.3 Computation of Powers Lemma 3 '■ [23] Let N be the number of ones in the binary representation for n. Then the near minimum number of steps to compute x on P(l), h (l, n), is h e (l,n) - L lo S n J + N - 1. Up to now, there is no result about the minimum computation time to evaluate x [23]. Thus we shall settle for an approximation. For example let e 15 15. Then h (1,15) = l_ lo § 15J +4-1 = 6. On the other hand x can be evaluated in fewer steps, e.g. (x) 2 = x 2 - 'x 2 )(x) = x 3 - (x 3 )(x 2 ) = x 5 f 5n 2 10 , 5w 10 x 15 -* (x^) = x - (x )(x ) = x • . This takes only 5 steps. For n < 70, this lemma gives the correct values for more than 70$ of the cases. While we cannot give the definite answer for the sequential case, we can prove the following. 2k Theorem J> : The minimum number of steps to compute x on P(m) (m > 1), h e (m,n), is h"(m, n) = ("log nl . Proofs of Lemma 3 and Theorem 3 : Let a = log n and 3 = log (n + l) for convenience. Let I be the j-th most significant bit in the binary representation for n. If m = 1, then x is computed as follows. First let us write X. - (x 2 V 3+1 . J We first compute all x (i = 2 , 2 , . .., 2 ^ -• ) in iqi steps. Then X n = (X Q )x (X x ) x ... x (x 2Lq, ) Irpl and this computation takes N - 1 steps (note that if I . = 0, then X. , = l). Thus in total i.-^j + N - 1 steps are needed. If two PE's are used, then x is computed as follows. Again let us write X = ( X Q ) x ( X 1 ) x • . • X ( x ) Now this can be computed by the following two recursive equations. t (l) - x k k-1 k-1 25 k k-1 k-1' and x n - t (2) Two PE's are required for the simultaneous computation of t/, and t}. . K. K That the above process for P(2) is optimum is clear, because x cannot be computed in less than [log nj steps and at least l_log n) + 1 (= llog nl ) steps are required to compute x . (Q.E.D. ) From the above discussion, we have the following corollary. 2 L lo g "J lorollary : h e (m,n) = h (2,n) for all m > 2. wow let us study simultaneous computation of all x (i=l,2, . . .,n) Theorem h The minimum number of steps, h (m,n), required for simultaneous evaluation of all x (i=l, 2, . . .,n) on P(m) is r (1) n - 1 (m = 1) h (m,n) = -< V (2) L log mj + 1 + r (n _ 2 L lo g m J +1 )/ m 1 (max(n.2 riogn " 1 - 1 , 2 ^ n1 " 2 ) > m > 2) (3) flog nl (m > max (n - 2 riog n_1 "\ flog nl „ HLog ril -2, , 26 Proof : (l) is obvious. (3) is illustrated in Figure 2.2. At the k-th step, i k-1 k-1 k\ the x (i = 2 + 1, 2 + 2, ..., 2 ) are computed using the results of k k-1 k-1 earlier steps, e.g. x = x x x . The number of PE's required at this step is then 2 k - (2 k-1 + l) + 1 = 2 " . \ PE step\ p i P 2 P 3 \ " No. of PE's required 1 X 2 1 2 x 3 X 2 3 5 X' 6 X x 7 8 X 1+ • . k a+1 X a+2 X • • 2a X 2 k-i r log nl-l b+1 X . . 2b X b r log nl c+1 X " n X n - 2c a- 2^ b= 2 ri °S n1 - 2 c = 2b Figure 2.2. Computation of x (l) The maximum number of PE's required is the larger one of n - 2 (the number of PE's required at the last step) or 2 ° S " (the number of PE's required at the ( Hog h~l - l)-th step). This proves (3). Clearly this procedure is optimum in the sense that it gives the minimum computation time. 27 max(n - 2 Next suppose that the number of PE's available, m, is less than flog ffl -1 /log nl -2 }> ^ first all x i (l < ± < 2 l_l°g *J + 1) are :omputed in [log mj +1 steps in the same manner as the above procedure. Number of PE's * m 1 2 X x\\\\\\\\\\\ 2 x 3 i2 WWWWV • I log mj + 1 a X x b \ V . c X • n \\\\ X ~T" J log rnj+1 r _ 1 m i a = 2 llogmj +1 b = 2 [ log mj + 1 c = b +1= 2 ll ° gmJ +1 +1 Figure 2.3- Computation of x (2) 28 Now there are n - 2 ^° g m -J +1 x 1 left (2^° g ^ +1 < i < n) to be computed. This takes "(n - 2 •- g -• )/m' steps on P(m). Clearly at each step, all necessary data to perform operations are available. To show this, let us a b take two successive steps. Assume that the first step computes x ~ x where b - a f 1 = m. Then the second step computes x ~ x . Since b + m = 2b + 1 - a < 2b (a > 1), all inputs required at the second step are available from the first step. Thus in total (log mj + 1 + "(n - 2 L ° g ^ )/m~' steps are required. This proves (2). (Q.E.D.) Clearly for fixed n, m > m' implies that h (m,n) < h (m',n). Thus we have: Lemma h : w / \ For fixed n, h (m, n; is a non-increasing function of m. We again call the reader's attention that the number of PE's required to compute all x in the minimum number of steps, i.e. 'log n~l , is not • -, t o flog nl -1 _ flog n~l -2 N _ , , lQ _ necessarily max(n-2 ,2 ). For example, let n = io. Then max(l8 - 2 ,2 ) = 8, and 'log 18 = 5« Yet P(5) achieves the same result, i.e. h w K,l8) = 5- Lemma_5: The minimum number of PE's, M, necessary to compute all x (l < i < n) simultaneously in the shortest time is 29 M = (1) n - 2 ri0g 3' 1 (2) r (n- 2&** m2 )/2 (n - 2 ri0g ^ _1 > 2 ri ° S ^ " 2 ) (otherwise) Proof: Let q = ilog n~l . (1) n - 2 Crl •> 2°~ 2 . Suppose that there are only m PE's where m < n ■■ 2 Q-l step\ p i P 2 P J •• P m riog nl - 1 r log nl a X a+ni X n X s- m left out a= 2 fl0g nl - X + 1 Figure 2.k. Computation of x (3) Then x(2°""+l+mn- (2 urc + m + l) + 1 i.e. n ,cr2 m = "(n - (2°^ + 1) -l)/2». Also since 2(2°~ 2 + m) > 2(2^ 2 + n/2 - 2 a " 3 ) = n + 2 a ~ 2 > n, 31 all inputs required at the cr th step are ready at the (a - l)-th step or earlier steps. (Q.E.D.) 2.k Computation of Polynomials In this section, we study polynomial computation. First we assume that there are arbitrarily many PE's. Then four schemes are studied and compared. Two of them are known as Estrin's method [16] and the k-th order Horner's rule [15]. Two new methods are also introduced. They are called a tree method (see Chapter 3) and a folding method. It is shown that if there are arbitrarily many PE's, then the folding method gives a faster computation time than any known method. Then we study the case where only a limited number of PE's are available. Because of the simplicity of scheduling, the k-th order Horner's rule is studied in detail. It is shown that on P(m) the m-th order Horner's rule does not necessarily guarantee the fastest computation, i.e., there is a case where the m'-th order Horner's rule (m' < m) gives a better result. Thus availability of more computational resources does not necessarily "speed up" the computation for a certain class of feasible parallel computation algorithms. . I . 1 "omputation of a Polynomial on an Arbitrary Size Machine Definition We write p (x) for a polynomial of degree n i \ n n-1 p ( x ) = a x + a ,x + . . . + a rt . r n' n n-1 32 2. if. 1.1 k-th Order Horner's Rule [15] The details will "be presented in Section 2.1J-.2. Theorem 5 shows that the minimum time required to compute p (x) by this method is n h P . = flog n"l + (log (n+l)~l + 1 mm 2.4.1.2 Estrin's Method fl5]["l6l We first compute C° = a + xa i = 0, 2, ..., 2 L n/2j i+1 Then successively compute c} = C° + x 2 C° ±+2 i = 0, k, ..., k^/kj C? = c] + xV , i = 0, 8, ..., 8 L n/8j ii i 44 m „m-l 2m^m-l . n _m+l m+1 /o m+l C. : C. + x C. +2m i = 0, 2 , ..., 2 (_n/2 J where m - [log nj and more over P n ** n _ 2 riog(n+l)l-l +1 Figure 2.6. Computation of a.x Thus, for example, two variables at the third step are reduced to 8 variables at the first step. Repeating this procedure we get o-l Z ( i=l o-l Z 'c i=l variables on the first level where a - 1og(n + if" . To add these n' numbers by the log sum method, it takes Z (2 1 ' 1 2 i ) + 2 a (n - 2 0rl + 1) + Z2 2i_1 + 2 a (n - 2 Qrl + 1) h = H nrr n"H mm log n' 1 steps. 35 2.U.1.U Folding Method This is 3. method which computes P n (x) in shorter time than any known method. Assume that p , (x) can be computed in h - 1 steps, p.(x) (t < i < s - 1) are computed in h steps and p (x) can be computed in h + 1 steps. Then we show that all p.(x) roof Steps Degree h - 1 ~ t - 1 h t ~ s - 1 h + 1 s ~ (s + t - 1) h + 2 s + t ~ (s < j < s + t - 1) can be computed in h + 1 steps and further p (x) can be computed in h + 2 steps. (1) First we show that p g+t _.(x) (l < ,j < t) can be computed in h + 1 steps. Figure 2.7- A Tree for p .(x) *s+t-j v 36 We write p .(x) as s+t-j v / v \ / t-J v s s-1 .,+. .(xj = (a. .x d + ... + a )x + a n x + =+t-j t+S-J s s-1 . . + a s = p t (x)x +P s . 1 (x). s Now we show that x can be computed in less than h steps. From Theorem 1 we know that x can be computed in flog nl g steps. Suppose that the computation of x takes longer than h, i.e. h < r io£ si . (16) Now from the assumption, p , (x) takes h steps to compute. Also S *~ J_ (see Section 2.U.1.5) h > llog(2(s-l) + l) 1 - G.og(2s-l) 1 . (17) Prom Eq. (16) we have s > 2 + 1 and from Eq. (17 ) we have 2 h > 2s - 1 or 2 h_1 > s. Thus we have s > 2 h_1 + 1 > 2 h_1 > s which is a contradiction. Thus h > llog s~l . From the assumption p, . (x) can be computed in h - 1 steps and p , (x) can be computed in h steps. Hence p .(x) • x takes h steps and p .(x) can be computed in h + 1 steps. ^ *s+t-j 2) Next we show that p (x) can be computed in h + 2 steps. 37 p s-l (x) t+l P t (x) Figure 2.8. A Tree for p . (x) S "T* U We write / \ / s-1 \ t+l t P s+t (x) = (a s + t X + •*• + a t + l )x + a t X + •'• + a - P s . 1 (x) • x t+1 +p x (x). Then from the previous proof we know that x " can be computed in less than h + 1 steps. Since p , (x) and p (x) take h - 1 and h + 1 steps respectively, we can compute p (x) • x "in at most h + 1 steps and p (x) in at most h + 2 steps. (Q.E.D. ) It is easy to check that p p (x) takes 3 steps, p,(x) and p^(x) can be computed in k steps and p q (x) can be computed in 5 steps. By induction we obtain the following table for h = 2, 3, ..., 10. 38 Minimum Steps Degree of Polynomial 3 2 k 3 - 1* 5 5 - 7 6 8 -12 7 13-20 8 21-33 9 3^-5^ 10 55-87 11 89-1^3 Table 2.3. Computation of p (x) by Folding Method For example, p (x) takes 9 steps to be computed. Note that the first numbers in the right column form a Fibonacci sequence. 2.4.1.5 Comparison of Four Methods It has been proved that at least 2n operations are required to compute p Lx ) . ■ roof s appeared in several papers. We owe Ostrowski [3M and Motzkin 'I for their original works. Pan [35] summarized the results. An excellent review of the problem appears in [23]. Also Winograd [^3] generalized results of Ostrowski and Motzkin. Now assume that to compute p (x) in parallel h steps are required. Then Theorem 5 gives h < log r? + 1.og(n+l)l + 1 < 2fiog(n+l)' + 1. Also k since 2-1 operations can be performed in a parallel computation tree of 39 height k, we have 2 h - 1 > 2n or h > Hog ( 2n+l J 1 . Thus 2nLog(n+l) 1 + 1 > h > flog(2n+l^. In Figure 2-9, these upper and lower limits are plotted together with the results from the previous section. It is clear that the folding method is the best in terms of the computational speed. It is yet an open question if there is a better method. 2.U.2 Polynomial Computation by the k-th Order Horner's Rule Now let us study computation of a polynomial by the k-th order Horner's rule. A polynomial is computed by the k-th order Horner's rule as shown by the following procedure. We use k PE's. First we compute all x (1 < i < k) simultaneously. Then we compute k polynomials p ' (x) on k PE's simultaneously, where p. '(x) = a. + x k (a.^ + x k (a. _ +...))) (0 < i < k - l). ^k ' l l+k i+2k — — Then we get k partial results which are added to get p (x). Figure 2.10 i 1 I .si rates this. This scheduling may not be the best, yet it is easy to implement i also adaptable to any number of k. Theorem 5 ' The minimum number of steps, h P (m, n), required to compute a degree n polynomial p (x) on P(m) by the m-th order Horner's rule is (1) 2n (m = 1) h P (m,n) = -{ (2) 2 (log m~l + 2[n/mj +1 (n + l>m>2) (3) frog rH + flog(n+l) 1 + 1. (m > n + l) 14-0 o LA o O on H O Ok \ / Proof: It is enough to show that 2 log ml + 2^/mj + 1 > 2 r iog(n+l) 1 + 1 1+2 or flog m 1 + |_n/mj > flogtn+l) 1 for m < n, because 2flog(n+l)' > flog ir + 'log(n+l)' and if m > n + 1, then by Theorem 5 lr (n+1. n) = h . . man Assume that n=2 g +t(l k. If g = k, then t > s since we n > m. (1) g = k and t v > s. Then log m 1 =k+l=g+l, and Hence flog m) + |_n/mj = g + 2 > flog ( n+1 )1 . (2) g > k. We have further two subcases, i.e. (i) 1 < t < 2 or (ii) t = 2 . (i) 1 < t < 2 g Then flog (n +1)1 = g + 1, Q.og nil = k + 1 and Isl 2 B + t 2 k + s > 2 e k - 1 Hence flog ml + (_n/mj > k + 1 + 2 g "^ k+1 ' ) Now we show that for all k < g f(k) = k + 1 + 2 g " (k+l) s g + 1. *o Since for all k < g, |^f(k) = 1 - (log e 2) 2 g " (k+l) <0 minf(k)=g+l where < k < g. Hence f(k) > g + 1 or log ml + [n/mj > g + 1 = riog(n+l)l . (ii) t = 2 g Similarly to the above, we can show that 'log nil + in/mj > U-Og(n+l)l . The details are omitted. (Q.E.D. ) Unlike the case of h (m,n) and h (m, n), h (m, n) is not necessarily a non-increasing function of m. (A few curves in Figure 2.11 illustrate this.) Therefore it becomes important to choose an appropriate m for a given n to compute in an optimum way. Theorem 6 : Given an n-th degree polynomial. Let M - M (h P (m,n) - flog n"l + fl g(n+l)1 + 1), where Then h P ( m, n ) = 2 flog ml + 2 ^n/mj + 1 . kk ft ft -P OQ O U H n = 5 10 15 Number of PE's (m) Figure 2.11. The Number of Steps, h (m, n), to Compute p (x) on P(m) by the ra-th Order h5 (1) n + 1 M =-< (2) ^n+l)/^ J3) WV2 1 where g = ^log ry • (n = 2 g ) -1, (2* < n < 2* + 2 fe ) (2 g + 2S" 1 < n < 2 g+1 ) Proof: Proof is given for each case independently. (1) n = 2 g . The proof is divided into two parts. First it is clear that if we have n + 1 PE's, then p (x) can be computed in h . steps (see Theorem 5)- Next we show that if the number of PE's, m, is less than or equal to M - 1 (m < n), then Ir . < Ir (m, n), where nun h P . = flog nl + fiog(n+lJl + 1 = 2g + 2. man ov ' to Let m = 2 k + p (k < g, 1 < p < 2 k ). Then h P (m,n) = 2k + 5 + 2 s k + P Let P = 2 g " k p 2 k + p Then (18) ¥ P = (2 k +P ) 2 > 0, and kG for max P = 2 g_k ~ 1 for 1 < p < 2 k . From this and Eq. (l8), we get h P (m,n) > 2k + 3 + 2 |2 g ~ k ~"!j . (19) Since g > k, Eq. (19) becomes h P (m,n) > 2k + 3 + 2 g " k . Now let f(k) = (2k + 3 + 2 g " k ) - (2g + 2) = 2(k - g) + 1 + 2 g ' k . Since 2 a + 1 - 2a > (note that v~(2 a + 1 - 2a) = log 2 ■ 2 a da °e 2 > for a > l), we have f(k) > for all k < g. Since h P (m,n) - (2g + 2) > f(k) > 0, h P (m,n) > 2g + 2 - h p . n . This proves (l). Now since "log n^ 'log(n + l)' if n / 2 for some g, we use h P . = 2 ( "log(n + lp + 1 mm to h P . = H.og(n + l) 1 + log n~l + 1 mm to to prove (2) and (3)- Then it is enough to show that = u ( "log(n + l) = "log nr + l^/mj ) instead of M = tJ (2 f log(n + 1^ +1=2 flog nil + 2 (n/mj + l)« Since 2 g < n < 2 g+ for (2) and (3), we have riog(n + l)" 1 = g + 1. By direct +7 computation, we can show that the theorem holds for n < 10 (see Table 2.k). (2) 2 g < n < 2 S + 2 g_1 . Now we show that h P . - 2 log M*l + 2 |n/M» + 1. mm u / _j To show this we first show flog M~l + (ji/Mj = g + 1. Since 2 g < n < 2 g + 2 g " , we have 2 g-2 < n_+_l_ < 2 g-l (2Q) and log Ml = flog ^n + 1)/3H = g - 1. (21) Now let (n + l)/y = k (k > 3 as we assumed). Then n + 1 - 3k - p (p < 2) or n = 3k - p - 1. (22) Using this and the relation < (p + l)/k < 1, we have L n/Mj =2. (23) Thus from Eq. (21) and (23) flog M 1 + ,n/Mj = g + 1 = flog(n + l)~l, or h P (M,n) = h P . . mm Next we show that if m' < l(n + l)/y , then HLog m^ + [n/m'j > g + 1 = 'log(n + 1)1 (or equivalent ly h P (m', n) > h p . ). mm' k8 m 1 2 3 k 5 6 7 8 9 10 11 g M Case 2 it 1+ k h 1 1 3 6 5 6 5 5 2 2 k 8 7 7 7 7 2 1 5 10 7 7 9 7 7 2 1 6 12 9 9 7 9 9 7 7 1+ 2 7 11+ 9 9 7 9 9 9 7 7 h 2 8 16 11 9 9 9 9 9 9 9 9 3 1 9 18 11 11 9 9 9 9 9 11 9 9 U 1 10 20 13 11 9 11 9 9 9 11 11 9 9 1+ 1 (1): 2 S < n < 2 g + 2 g ~ 1 (2): 2 g + 2 s " 1 < n < 2 6+1 where g = | log nj . n: The degree of a polynomial m: The number of PE's M: The minimum number of PE's Table 2.k. The Number of Steps Required to Compute p (x), h p (m,n), for n < 10 +9 We have two cases, i.e. (i) ?} < m' < 2 1+1 where i + 1 < g - 2 d (ii) 2 g ~ 2 < m' < r (n + l)/^. (i) 2 1 < m' < 2 i+1 . Since i + 1 < g - 2, we write g - 2 = i + 1 + j (d > 0). Then flog m'"l =i+l=g-2-j, and since 2 S < n < 2 g + 2 g_1 , L n/mJ >2 g - i - 1 = 2 2 ^. Thus (log m*~l + |_n/mj > (g - 2 - ,i) + 2 2+J " > g + 1 = flog(n + 1)1 because 2 ;> a + 1 if a > 1. (ii) 2 3 " 2 < m' < T( n + l)/fl. Let us write m' - Rn + 1)/51 - q = k - q (q > l). Then by Eq. (2(>) and (22), we get ri og m n + [_£,} . s . i + [3¥^J > g + 1 = flog(n + l)', because q > 1, p < 2 and 3q - p - 1 > 0. This ends a proof for (2). 50 (3) (2 g + 2 g < n < 2 S+ ) can be proved in a similar manner and the details are omitted. (Q.E.D.) It should be noted that the function Ir (m,n) is not a non-increasing function of m even if m < M for some n. However if n < 50, then for more than 70$ of the cases, h p (m, n) turn out to be non-increasing functions. (The only cases where h p (m, n) is not a non-increasing function are the cases n = 15, 27, 28, 29, 30, 31, 36, 37, 38, 39, ^5, ^6, and 1+7 . In any case, h P (m,n) increases by at most one. ) 51 o LTN O O o H O O CD CD ta 0) Q CD -P O o o -p H •d o *d u a; on o K £ W o H Ph Ph o 3 OJ CD £ £ o s I (W) s,aj jo jaqrariM 52 3. TREE HEIGHT REDUCTION ALGORITHM 3.1 Introduction In this chapter, recognition of parallelism within an arithmetic statement or a block of statements is discussed. There are several existing algorithms which produce a syntactic tree to achieve this end. The tree is such that operations on the same level can be done in parallel. Among them, the algorithm by Baer and Bovet [6] is claimed to give the best result. For example, a statement a+b+c+dxe xf+g+h can be computed in "our steps by their algorithm (Figure 3*1) • k 3 2 1 level Figure 3-1- An Arithmetic Expression Tree (l) The algorithm reorders some terms in a statement to decrease tree height. However, this algorithm does not always take advantage of distributions of multiplication over addition. An arithmetic expression 53 a(bcd+e) takes four steps as it is, whereas the equivalent distributed expression (abcd+ae) requires only three steps. A further example is Horner's rule. To compute a polynomial p (x) = a^ + a, x + a^x + ... + a x , (l) n 12 n Horner ' s rule p (x) = a. + x(a n + x(a, + . . . x(a . +x a ))) ...) (2) *n 13 n-1 n gives a good result for serial machines. However, if a parallel machine is to be used, (l) gives a better result than (2). Namely, if we apply Baer and Bovet's algorithm [6] on (2), we get 2n steps whereas (l) requires only 2flogp(n+l)] steps (see Chapter 2). Thus it is desirable for the compiler to be able to obtain (l) from (2) by distributing multiplications over additions properly. An algorithm to distribute multiplications properly over additions to obtain more parallelism (henceforth called the distribution algorithm or the tree height reduction algorithm) is discussed now. 3-2 Tree Height and Distribution Definition 1 : An arithmetic expression A consists of additions, multiplications, and possibly parentheses. We assume that addition and multiplication require the same amount of time ^see Chapter l). Subtractions and divisions will be introduced later. Small letters (a,b,c, ...), possibly with subscripts, denote single variables. Upper case letters and t, possibly with subscripts, denote arbitrary • arithmetic expressions, including single variables, t is used to single out particular subexpressions, i.e. terms. 5h n Then A can always be written as either (l) A = E t. or i=l x (2) A = w (t.), e.g. A = abc + d (ef+g) = t + t where t = abc and i=l t = d(ef+g), or A = (a+b)c(de+f) = (t )t (t ) where t = a + b, t = c and t-, - de + f . Note that when we write A = tt (t. ), we implicitly assume that for i=l x n(i) each i t. = Z t.'. h[A] denotes the height of a tree T for A, which is of the minimum height among all possible trees for A in its presented form. A minimum height tree (henceforth by a tree we mean a minimum height tree) for A, T[A], is built as follows [6]. n n Let us assume that A = E t. or A ^ tt (t. ) and that for each i, a . , i . -, l ' i=l i=l minimum height tree T[t.] has been built. Then first we choose two trees, say T[ t ] and T[ t ] , each of whose height is smaller than height of any other tree. We combine these two trees and replace them by the new tree whose height is one higher 4 than max (h[t ], h[ t ]] : w Combined tree i. A^Vli hr V +1 X A h I^ t t t r I p q This procedure is repeated until all trees are combined into one tree, which is T[A]. The procedure is formalized as follows: 4 "Figures are in scale as much as possible. 55 (1) Let ST = -j(l), (2), (3), -.., (n)J- and let h'[i] = h[t.] for all it ST. (2) Choose two elements of ST, say p and q such that h'[p], h'[q] < M" where M = rain jh'[u]l for all u€ST- -|p, q|- . (3) Now let ST = jsT - |p, qjj U j(p, q)| and h'[ (p, q)] = max jh'[p], h'[q]| + 1. (k) If I ST | = 1, then stop else go to Step 2. After we apply the above procedure on A, we get e.g. ST = i ( ( ((l)(2) ) (3) ) ( (k) ( 5) ) ) )- where a pair (ab) indicates that trees corresponds to a and b are to be combined. Thus in this case we get: ^((((l)(2))(3))(CO(5))) (((D(2))(3)) 2i ^-^^^-- >Nf (00(5)) ((D(2)) as a minimum height tree of A. In general the procedure is applied from the lowest parenthesis level (see Definition 3) to the higher parenthesis levels. ]xample 1 : Let A = (a+bc)(d+efg) + hi = *1 + *2 If there are many choices, then choose those subtrees with smaller -4(Sh[t ]) values first (see Definition 8). 56 where t = (t )(t^) t_, - a + be = t, + t,_ 3 ^5 and t,- = d + efg = t„ + t Q . b 7 8 T[A] Tft 3 ] g h Figure 3-2. An Arithmetic Expression Tree (2) The < ffective length e of an arithmetic expression is defined as e[A] . 2 h[A1 . The number of single variables in an arithmetic expression is the number of single variable occurences in it. The height of a minimum height tree can be obtained without actually building a tree. Theorem 1: (1.1) If A s it a. or A = Z a. then h[ A] i=l x i=l x ■log 2 fpl g 57 (1.2) If A = It. then h[A] - log i=l 2 P £ e[t ] i=l P I (1.3) If A = tt a. x 7T (t.) then h[A] = log i=l X 3=1 J r P i + z ert.] I . 2 J-l J '2 Given an arithmetic expression A, to obtain the height of a tree for A, Theorem 1 is applied from the inner most parts of A to the outer parts, recursively. Proof: (1) It is obvious that if A = Z a. or A = tt a. then h[A] = log r pi . i=l i=l (2) Now let A = Z t. . Then we can replace each t. by a product of i=l x x e[t.] single variables without affecting the total tree height h[A]. (Note that each t. must be computed before the summation over t. is e[t.] p L i J taken.) Thus A becomes A' = Z tt a . and h[A] = MA']. Let us i=l j=l eft.] 1 call a tree for tt a. a subtree. Then a tree for A' is built ,1=1 J using subtrees in the increasing order of their heights. Since a binary tree of height h cannot accomodate more than 2 leaves, we have 2 h[A'] > Z e[ t.l > 2 " i=l X h[A']-l or h[A'] = h[A] = log. P Z e[ t ] i=l (3) can be proved in a similar manner. (Q.E.D.) 58 Definition 2 : The additive length a and the multiplicative length m of an arithmetic expression A is defined as follows: P (2.1) If A = ir a.., then i=l X (i) a[A] = e[A] and (ii) m[ A] = p. P (2.2) If A - Z t., then i=l X P (i) a[A] = I e[t.] and i=l x (ii) m[A] = e[A]. P I (2.3) If A = tt a. x 7r (t.), then i=l x d=l J (i) a[A] = e[A] and (ii) m[A] = p + I e[t ]. 0=1 J It is to be noted that (!) £[AI |a[A]l 2 '.m[A]j - (2) h[A] = log p |a[A] i = log m[A] I ; d ! * 2 ' '2 compare this definition with Theorem 1. Definition 3 : The level £ of a parenthesis pair in an arithmetic expression is defined as follows : 59 First we start numbering parentheses at the left of the formula, pro- ceeding from left to right, counting each left parenthesis, (, as +1 and each right parenthesis,), as -1 and adding as we go. We call the maximum number m the depth of parentheses. Now the level 1 of each parenthesis pair is obtained as I = p, where p is the count for each parenthesis. The arithmetic expressions enclosed by the level I parenthesis pair are called the level I arithmetic I expressions, A • Also for convenience we assume that there is an outermost parenthesis pair which encloses A. Example 2 A = 123 3 3 3 21 (ab((cd + e)(f + g)+k)) 3 3 2 Now several lemmas are in order. Lemma 1 n n Let A = T t. or A = ir (t.). Also let A 1 = t, + t. + . , l . , l 12 1=1 1=1 + t. ' + l ... + t or A' = (t. ) x (t) x ... x (t. ') x ... x (t ), and A" = A + t . or A" = n 12' l n" n+1 A x (t n+1 ). Then 6o (i) h[A'] >h[A] if h[t.'] > h[t ± ] (ii) h[A"] > h[A]. Proof: Obvious from Theorem 1. What Lemma 1 implies is that the height of the tree for an arithmetic expression is a non-decreasing function of term heights, and the number of terms involved. In an arithmetic expression, there are four possible ways of parenthesis occurence: P x ) ... + (A) + ... P_) ... 6(t n x t_ ... x t ) x (t ' x t ' ... x t ') 9 ... 2 12 n 1 2 m P 2 ) . .. a x a x ... a x (A) 8 . . . 3 12 n p u ) ... e(t 1 + t 2 + ... + t n ) x (t^ + 1 2 ' + ... + t m ») e ... where 6 represents +, x, or no operation. Lemma 2 : Let D - B + ( A) + C and D n = B+(t n x ... xt ) x(t' x ... xt ') +C. 1 1 n 1 m Also let D = B + A + C and D n =B+t n x ...xt xt' x ... xt ' + C. Then 11 n 1 m r\ d h[D] >h[D] and h[ D ] > h[ D ] . Proof : Obvious Prom Theorem 1. 6l As an example, let D=(a+b+c)+d and D = (abc)(defgh) . Then A A A D = a + b + c + d and DJ = abcdefgh, h[ D] = 3 > h[D ] = 2 and h[D ] = k > h[Dj] = 3- Lemma > Let D =| Z t .|| Z t . ] and D d = t, t , ' • t, t n ' ■ ... +tt' ... t t ,1=1 *A.1-1 « Then h[D d ] > h[D]. Proof : n m Let D = (A)(B) where A = Z t. and B = St.'. Also without losing i=l x 1=1 x nerality, assume that h[A] > h[ B] . Then h[D] = h[A] + 1. For each j, let d. = J t.' t n +t.' t„ + ... + t.' t. It is clear that h[ t . ' t.] > h[t.l for all i and j 1 2 j n j i J - i J d m d Thus from Lemma 1, we have h[d.] > h[A] for all j. Since D = Z d., h[D ] > 3 0=1 J min(h[d ]) + log J ml > h[A} + log 2 rml 2 , or since h[D] = h[A] + 1, h[ D ] > h[ D] . (Q.E.D.) Note that the above lemma does not imply necessarily that if D = ft \ H" H" Zt.HB), and D = t (B) + t (B) + ... + t (B), then h[ D ] > h[D]. Actually, i=l 7 n it can be shown that there is a case when h[ D] > hf D ] . What Lemma 3 says is that D = (A)(B) should not be fully distributed, but partial distribution, as in D , may be done in some cases. 62 Lemmas 2 and 3 together indicate that distribution in case (P^) and partial distribution for case (P, ) are the only cases which should be considered for lowering tree height. In casos (P ) and (Pg), removal of parenthesis leads to a better result or at least gives the same tree height. Full distribution in case (P. ) always increases tree height and should not be done. Also it should be clear that in any case tree height of an arithmetic expression can not be lower than that of a component term even after A \ ■ d distributions are done. For example, let D = t(A) = t x J F t J and D = n F ft v t.). Then from Lemma 1, we have h[tt.] > h[t ] for each i. Thus i=l h[r> d ] > h[A]. The same argument holds for all four cases. This assures that evaluation of distributions can be done locally. That is, if some distribution increases tree height for a term then that distribution should not be performed because once tree height is increased, it can never be remedied by further distributions. Actually, there are two cases where distribution pays. For example, if A = a(bcd + e), then h r A] = k. However, if we distribute a, then we get A = abed + ae and h r A ] 3- The idea is to balance a tree by filling the "holes" because a balanced tree can accommodate the largest number of variables among equal height trees. The situation is, however, not totally trivial, because by distribution, the number of variables in an expression is also increased. Next let A = a(bc + d) +e=t+e and A = abc + ad + e = t + e. In this case h[A] = k but h[A ] = h[ t ] =3- What happened here is that t is "opened" by distributing a over (be + d) and the "space" to put e in is created. 63 At each level of parenthesis pair, cases (P3) and (P^), i.e., instances of "holes" and "spaces", are checked and proper distribution is performed. Next we give definitions of holes and spaces, and formalize these ideas. 3-3 Holes and Spaces 3-3«l Introduction Before we proceed further, let us study trees for arithmetic expressions more carefully. P Let A = Z t.. By Definition 1 we first build minimum height trees T[t.] i=l 1 x for all i, and T[AJ is built by combining these T[t.]. Once T[t.] is built the details of t. do not matter, and the only thing that matters is its height h[t.]. Suppose T[t.] and T[t.] are combined to build T[A]. Assume also that hrt.l = hrt.l + s. Then we will get s nodes to which no trees are attached other l 3 than T[t.]. We call these free nodes whose heights are h[t.] +1, h[t.] +2, ..., J J J h r t.]. i T[t.: .h. Free Nodes in a Tree Free nodes in a tree T P it (t.) L i=l are defined similarly. Let us emphasize that once we get Tft.] we treat it as a whole and do not care about its details when we build T[A]. That is, when we consider free nodes in T[A] we mean free nodes "in" T[A] but "outside" of Tft.]. For example let A = (a+b)(cde+f) = (\){t ). Then T[A] d e 65 a and 3 are free nodes in T[A] while y and 8 are free nodes in T[t ] and not in T[A]. Now suppose there are m free nodes in T[A]. We number them arbitrarily from 1 to m. Also let us denote the height of a free node a '»y h[al« Given a free node q- whose height is hi" a] in T [A = I t.] (or T [A - tt( t . ) ] ), by definition we can attach a tree T[t] whose height is h[a]-l (or whose effective length is 2 ^ aJ " ) to a without affecting the height of A: Definition k: For A = 7r(t.) or A = Zt . , we build a tree. Then l i (k.l) define F [A] to be a set of all free nodes in T[A], and (U.2) for each i define F [A, t.] to be a set of all free nodes which r\ 1 exist between the roots of T[A] and T[ t . ] , i.e. the free nodes which we encounter when we traverse T[A] down to the root of T[t.]. For example let us consider the following tree (see Figure J. 5)- 66 Figure J5-5- An Example of F and F Then F A [A] = fa, 3, 7, 6, e] and F R [A, t ± ] = fa, 3, y), F R [A, t g ] = { c& 3} etc. Lemma h : Suppose h[ a] = h[ 3] for some free nodes a and 3 in T[A]. Then without changing the tree height h[A], we can replace two free nodes a and 3 hy one free node 3' whose height is h[a]+l. Proof: original T[A] Figure 3-6. Elimination of a Free Node 67 modified T[A] N Figure 3- 6- (continued) We can combine subtrees 1 and 3, and hence eliminate free nodes a and 3 and create a new free node 3' (see Figure 3-6). (Q.E.D.) Given F.[A], two free nodes a and (3 of equal height can be replaced by one free node 3 1 whose height is h[ a] + 1« Repeating this procedure finally we get a new F '[A] in which no two free nodes have the same height. Let y and 8 be free nodes in F.[A] and F '[A] respectively. Assume that for all free nodes a in F [A] (or F '[A]), h[ 7] > h[ a] (or h[61 > h[ a] ) . Then obviously h[ 7] leSh[t ]. In this case e[ f ] < le Z b(H [A"] ) . Then there is a free node a in F A [ A 1 sucn that u = 2 and e[t'] < u means that T[t'] can be attached to a with- out increasing tree height. Hence A can be multiplied by t' without increasing tree height. P (2) A = I t . i=l We show that if e[t'] < le Sh[A] then t can be distributed over A without increasing tree height. # - d We write (ft) for the expression obtained after distributing f over t such that the tree height is deduced, e.g. (a, b+cd) = ab + acd. 7^ First note that for all i, if a€F [A,t.] then 2 h ^ a ^ _1 > le Sh[t.] Now assume that e[t'] < le Sh[A]. For fixed k, that u = le Sh[A] implies either ( i ) u < le Sh[ t k ] or (ii) there is a in F_,[A, t, ] such that u < 2 "- a ^ (or equivalently u < le Ib(H [A,t ])). In the first case we have h[ t ] = h[(f t ) ] by assumption and h[A] - h[ 7 t + (ft ) d ]. i/k X k In the second case we attach T[ f ] to a (Figure 2.8(a)) without increasing the tree height h[A], i.e. h[A] = h\ T t + (f ) x (t )]. i/k x * In general let 1" be a subtree in T[A] whose root is a (Figure 3.8(b)). Then \ T* (a) (b) Figure 3-8. Attachment of T[ f ] to a Free Node 75 there may be other term trees besides T[ t ] in T", e.g. T[t.] and T[t ] in Figure 3-8(b). Hence we get, h[A] = h[ L t. + (f ) x (t^+t.+tj^)] i#c,j,h in this case. Note that a is also in F [A, t.] and TJA, t ]. a j a n. Repeating this procedure for all k we can get an arithmetic expres- P sion equivalent to X (t 1 ) x (t.) or (t')(A) without increasing i=l x tree height. This proves (2). It is obvious that in both (l) and (2) if e[t'] > le Sh[A], then t' can not. be distributed (multiplied) over A without increasing tree height. (Q.E.D.) Lemma 6 : Let A = F t. and e[t»] < le Sh[A]. Then after t' is distributed over A, we have Sh[ (t'A) d ] = (Sh[A] del (u} ) uni T b(u-e[t'] ) uni Sh[t'] where u is the smallest element in Sh[A] bigger than e[t']. Now let us summarize what we have so far. Let A and t' be arithmetic expressions where A = T t. and t' = 7t'.. If e[t'] h[(t*A) = hfA]. A set of holes in (t'A) is given by the above lemma. Since it is obvious that le Sh[ A] < e[A], we have the following lemma. 76 Lemma J : Let f = Zt.'. Then h[(t')(A)] > h[(t'A) d ] implies that h[t'] < h[A]. q In general let t' = Ti(t.'). For convenience let us assume that e[t. ' ] < e[t '] < e[t '] < ... < e[t ']. Then if the following procedure can be - 2 - 3 ~ - q accomplished successfully, we say that A has holes to accomodate all t.. ' (i=l, 2,...,q). # Procedure: i (1) Let V = (t i '(t i _ 1 *,...,(t 1 , A) d ) d ...) d and V Q = A. Let i = 1. (2) Check if e[t.'] < le Sh[V. .]. i l-l (3) If so, then distribute t. ' over V. , and we have V.. (k) Evaluate Sh[V.]. (5) If i=q then stop, else let i = i + 1 and go to (l). The procedure may be accomplished successfully if m[t'] < #(Sh[A]). 3- 3. 3 Space Now the second possible distribution case, i.e. space is studied. The idea of the second distribution case is that given an arithmetic expression D of the following form D = ... 9 (f) x (A) + t e ... 1 s distribute t' over A so that t can be hidden under the combined tree as shown s in Figure 3.9* We assume that each t. ' does not have any holes, i.e. Sh[t.'] = for all i. Hence Sh[(t*A) d ] = (Sh[A] del { u }) uni b(u-e[t']) for example . 77 t'^ t't 2 (a) A A A A ft (b) Figure 3.9. An Example of Space (l) In other words, in case of D, the addition +, cannot be done before (f )(A) (we write f (A)) is computed while in case of D it may be done earlier. Note that if h[f (A)] > h[(f A) ] then A has enough holes to accomodate f and the distribution of f over A is done anyway. Henceforth throughout the rest of this section we assume that A does not have any holes to accomodate f . Thus we now deal with the case when h[f (A)] < h[(f A) ]. However, if h[f (A)] < h[(f A) ] holds, then clearly there is no way to 78 get h[J) ] < h[ D] by Lemma 1. Thus we have: Lemma 8: Proof: (1) h[t* (A)] = h[(t*A) d ] must hold to get h[D d ] < h[ D] . n _ d (2) Let A = it.. Then h[t'(A)] > h[(t'A) ] if and only if e[t'] < i=l X e[A] (i.e. h[t f ] < h[A]). This implies that h[t'(A)] = h[A] + 1. By inspection. P Intuitively the space Sp in an arithmetic expression A = it. with i=l 1 respect to t' is defined as P Sp[A,f] - e[(A) x (t')] - I e[(f) x (t )]. i=l For example let A = ab + c and t* = d. Then S [A, t'] = 2. free node (a) (b) Figure 3.10. An Example of Space (2) Let D = t'(A) + t - (t'A) + t. Note that a free node in T[t»(A)] cannot be used to attach T[ t] while a free node in T[(t'A) ] may be used to attach T[t]. Now the formal definition of space follows. 79 Definition 9 : p q Given arithmetic expressions A = Ft. and t' = Ft. ' , the space function i=l i=l Sp of A w.r.t. t' is defined as follows. First we build trees T[A] and T[t'], and in T[A] let F be a set of free nodes f higher than T[t'] (h[f] > h[t']). Also we define a set I as follows. We let iel if h[t.] > h[t'] and e[ V ] < le Sh[ t. ] . Now the space function is obtained as : h[f] h ^V Sp[A,f] = Z 2 L J + T 2 fe F i€ I To show how Definition 9 works, we first describe how to build T[(t'A) ] by attaching T[t'] to T[A] properly (i.e. by distributing t' over A properly). Since h[(t'A) ] = h[A] +1 (Lemma 8(1)), we first study to build T'[A] which is obtained by replacing T[t.] in T[A] by T'[t.] whose height is h[t.] + 1. Then the height of T'[A] is h[A] + 1. Building T[(t'A) ] from T[A] may be explained in an analogous way. As stated before the only case to be considered is when h[t'(Aj] = h[(t'A) d ] = h[A] +1 holds (Lemma 8(1 )). Suppose that all T[t.] in T[A] are re- placed by T'[t.] whose height is h[t.] + 1. Then the new tree T'[A], whose height is h[A] + 1, is obtained. Note that a free node a in T[A] now becomes a free node a' in T'[A] with height h[a] + 1. In T'[A] if T'[t.] is replaced by T[t.] again, then a new free node 3', whose height is h[t.] +1, is created. Having these facts in mind, now we describe the way t' is distributed over A to create space. The tree T[(t'A) d ] is built from T[A] as follows (note h[(t'A) d ] = h[A] +1). Depending on the height of T[t.], we have two cases. J 8o (1) h[t.] > h[f ]. If T[t.] has a hole to put T[t'], then we fill it by T[t']. In J this case h[t'(t.)] = h[t J and a new free node 3' whose height is h[t.] + 1 is created in T[(t'A) ]. Otherwise h[t'(t.)] = h[t.] J J J + 1. (2) h[t.] < h[t']. J Find the tree T[Zt ] whose height is h[t'] and which includes s T[t.]. We multiply Zt by t' and get h[t*(Zt )] = h[Zt ] + 1. J s s s Note that t. and Zt are treated as terms of A. In the resultant tree T[(t*A) ], j s J ' those free nodes in T[A] whose heights are less than or equal to h[t T ] (i.e. free nodes in T[Zt ] ) do not appear, s (a) (b) Figure 3-H- Distribution of t' over A (c) .m*- A free node a in T[A] (h[a] > h[t']) appears in T[(t*A) ] as a free node a' where h[a' ] = h[a] + 1. Thus T[(t'A) ] has those a' and 3' described in (l) as free nodes. 81 If 6' is a free node in T[ (t f A) d ], then a tree T[t] (h[t] h[t']). Also let I be a set such that i is in I if h[t.] >h[t'] and T[t.] has enough holes to put all T[t.'] (i.e. a[t'] < /Sh[t.])). Then Sp[A,t f ] = feF iel Informally we say that space to put t is created by distributing t' over A if h[t'(A) + t] > h[(t'A) + t] . Then the procedure is called space filling. Now we study how much we can reduce tree height by space filling. Let B = Z't.')(A.) + Zt . and assume that by distributing t. ' over A. space can li. j l l i J be created for all i. Lemma 9 " Let B = Z(t. ')(A.) + It. and B d = l(t. 'A. ) d + Ft.. Then h[B d ] = h[ Bl .11 .3 .11 .j LJ LJ i i 3 1 if l!3p[A.,t '] >a[B] - e[B]/2, otherwise h[ B d ] = h[ B] . i 82 Proof : First note that to lower the height of a tree for B, some terms must be removed from B so that an effective length of a resultant expression becomes e[B]/2. Hence ZSp[A. ,t.'] must be greater than or equal to a[B] - e[B]/2. r d n Next we show that h|_B J cannot be smaller than h[B] - 1. As before we assume that h[t i *(A.)]= h[(t.'A.) d ] for all i (see lemma 8(1)). It is equivalent to ( — erB]/2 -a[B] — Figure 3-12. Tree Height Reduction by Hole Creation the assumption that Sp[A.,t.'] < e[t.'(A. )]/2 = e[(t.'A. ) ]/2. First we get B ! from B as follows. We replace every (t.'A.) in B by a product P. of e[(t 'A ) ]/2 single variables. This amounts to assuming that Sp[A. ,t '] = i i ii e[(t 'A ) ]/2. Futher we get B" from B' by replacing every t. in B' by a i i product Q. of e[t.]/2 single variables. Then it is clear that h[B] > h[B ] > J J h[B'] > h[B"] = h[B] - 1. If ZSp[A.,t.*] > a[B] - e[B]/2, then h[B] > h[B d ]. Hence h[B d ] = h[B] - 1. (Q.E.D.) What the lemma implies is that by space filling we can lower tree height at most by one . In other words to see if the space creation by distri- bution is effective it is enough to see if total tree height can be lowered by one, and we know if tree height is once lowered by one it is not necessary 83 (i.e. useless) to try to lower tree height further by creating more space by further distribution. Unlike a set of holes (Theorem 2), the space function for A = 2t. does not carry any information about space in components of A, t.. For example let B = a(b(c+defg) + irl6) + Trl6 + irk = a(A) + tt16 + irk where Ti is a product of i single variables. Then h[ B] = 7- Now Sh[A] = and space creation is tried. Note that SpfA.a] = 16 < a[ B] - e[B]/2 = 20, but Sp[ c+defg, ab] = k. That is, a as well as b should be distributed over c+defg. Thus we c -et B 1 = abc + abcdefg + a(7rlo) + tt16 + wh where h[B'] = 6. Now this situation is studied in detail. In general we have a form: F = ... +t'(... +t" ( C ) +. . . +D+. ..) + ... + E + ... ^ J V A V J w~ :-n if Sp[A,t'] is not enough to reduce the tree height h[ F] , we have to further check components of A, e.g. Sp[C,t't"]. As we will show later (see Substep 2 of Step i\ of Algorithm given in Section j.k-.l) an arithmetic expression is examined from the inner most pairs of parentheses to the outer most pair. In the above diagram, the distribution of t" over C is first checked to see if it reduces the tree height h[A] and then the distribution of t' over A is examined. If the itribution of t" over C creates space and reduces the tree height h[A], then there is no problem. However if that distribution does not lower the tree height .], then t" will not be distributed over C (see Algorithm). As we showed in the above example when we check the possibility of reducing the tree height h[ F] by creating space by the distribution of t' over A, it may be necessary to check , ft"] as well. Qk Let A' = ... + (t"C) d + ... + D + ..., G* = (t'A) d and G" = (t'A') d - Now we show that if Sp[A, t'] = 0, then it is not necessary to examine components of A, e.g. Sp[C,t't"] further. This helps to reduce the number of checks required. Lemma 10 : (1) If Sp[A, t'] =0, then h[...+G"+...+E+. ..] < h[...+G'+. ..+E+. ..] never holds. (2) If Sp[A,t'] = 0, then h[G"] < h[G'] never holds. Proof: (1) We prove this by showing that if Sp[A,t'] = then Sp[A',t'] = 0. Note that Sp[A, t'] = implies that either (i) h[t'] > h[t"(C)] or (ii) h[t n ] = h[C] (Definition 9). By Lemma 9, in either case we get h[(t'(t"C) ) ] > h[t't"(C)]. Note that the only difference between G' and G" is that a term -— sr d t't"(C) in G' is being replaced by (t'(t"C) ) in G". Since h[(t'(t"C) d ) ] > h[t't"(C)], G" cannot have more free nodes than G'. Hence Sp[A f ,t'] = 0. (2) This may be proved in a similar way and the details are omitted. (Q.E.D.) Thus in F = ... + t' ( . . .+t" (C)+. . .+D+. . . ) + ... + E ... the distri- bution of t" over C should be done if it reduces the tree height of T[A], otherwise it should be left untouched. In the latter step when the distribution of t' over A is examined, the possibility of distributing t" over C as well shall be checked if and only if Sp[A, t'] / 0. Otherwise we shall leave t'(A) 85 as it is and we need not check inside of A again. 3.4 Algorithm Having these results, an algorithm to reduce tree height of an arithmetic expression is now described. Given an arithmetic expression, the algorithm works from the inner most pairs of parentheses to the outer most pair. We assume that cases (P, ) and (P ) (see Lemma 2) are already taken care of. At each level of a parenthesis pair, first upon finding a form t'(A), a hole of A is tried to be filled by t' (Theorem 2). After all holes are filled, a form t'(A) + t" is checked, i.e. if the distribution of t' over A creates enough space to accomodate t", then the distribution is made. Note that it is not necessary to fully distribute (A)(b) = (lt.)(Xt. T ) (see Lemma 3)- It is not necessarily true that reducing tree height of a term t of an arithmetic expression A reduces tree height of A. However, we show that reduction of tree height should be made in any case to help later steps of the distribution algorithm. n-1 n-1 Let A = Tt. + t (or ir (t . ) x (t ) ) . Assume that the distribution . , 1 n . , 1 n i=l i=l somehow reduced the height of T[ t ], i.e. the distribution algorithm n-1 n-1 reduced A to A' = I t. + t ' (or it (t. ) x (t ')) where h[ t ] > h[ t ']. Also . , 1 n . n i 7 n ; ' L n J L n J i=l i=l assume that MA] = h[A']. Yet it is obvious that #(Sh[ A]) < #(Sh[ A' ] ) (i.e. le £h r ;.l < le Sh[A']) and also for any t", Sp[A,t"] < Sh[A,t"]. That is, even if distribution only reduces the tree height h[ t ] and .does not reduce the tree height h[A], that distribution does not cause any bad effect on the later steps when A appears as a term of a bigger expression d with respect to holes and spaces , 86 The arithmetic expression thus obtained may give the lowest tree height, i.e. the fewest number of computation time steps. 3.k.l Distribution Algorithm In the steps below we refer to the notation: . k - 1 k - 1 I . k 1 I r— k — r i k ! 1 «(...«( k )&.. .)©... ©(...©( )e...e( )©...)© ' s-l,i I I sj I I s,j+l I H — tl — h k << A = .. . + t . , 7T (t - ) + ... k p where t . , = tt a„ or empty. s,J-l l=1 t Step 1 : Go to an arithmetic expression enclosed in an innermost parenthesis pair which is not checked. Let this level be k-1. In the above diagram we are «k-l now working on, say, A s Step 2 : Obtain a set of holes for all t . which are enclosed in the k-th k-1 parenthesis pair and are components of A , as well as their heights h and effective lengths e Step 3 : k-1 In this step, the (k-l)th parenthesis pair level A "" is examined. 87 Substep 1 : Hole filling (see Theorem 2) Let 0+n k P t where t = ir a or empty. Also without loss of generality we assume that e[t g£ ] < e[t g i+J ] for l = j, j+1, ..., j+n-1. k-1 (1) Find an occurrence of a form 7r(t.) or Tra. x 7r(t.) in A .If there J i s is no such occurrence, then go to Substep 2. If an occurence of a form ir(t.) is found, then skip (2) and (3). Otherwise go to (2). k-1 (2) Suppose we find t in A as an instance of Tra. x Tr(t.). s 1 j k k Fill holes in Sh[t ] (h=j, j+1, . . ., j+n) using a.'s in t . If there are many holes to be filled in, fill the smallest ones first, i.e. in order of increasing size. Reevaluate Sh by Lemma 8 for those t . whose holes are filled, sh (3) If t , (h=j, j+1, . . ., j'+n) do not have enough holes to accomodate all a.'s then go back to (l) to find out another occurrence of Tr(t.) or j+n k Tra. x Trft.) form. Otherwise we work on tt (t 1 , ) which we get from i j . . sh' & P J+n v ira, x tt (t\ ) after (2). 1-1 J h=i Sh (J+) We start from h = j. Check if t' can fill in one of holes in u sh 88 Sh[t' ] (l=h+l, . . ., j+n). If there are many holes which can s x. ]^ accomodate t' , fill them in order of increasing size. Continue the procedure until all t' (h=j+l, . . ., j+n-l) are put in some holes or there is no hole to accomodate t' . Go back to (l) to find sn out another occurrence of 7r(t.) or ira.. x 7r(t.) form. J i J Substep 2 : Space filling k-1 k After Substep 1, we again check A , where all holes in t . have s sj been filled in as much as possible by Substep 1. (1) Let Ex = a[A k_1 ] - e[A k_1 ]/2 (see Lemma 9)- (2) Let k-1 k ^ +n-1 k n k A "=...+t . -, x it (t,)x(t .,)+... s S ' J_1 h=j S cf' J t^ ~Y f t k p k where t . n = tt a, or empty. We also assume that eft . 1 is B,d-1 ! = i * s,j+n J k k the largest among all e[t ,] (h=j, . . ., j+n) . Let t' = t . . x sn s f j -± j+n-1 . . . tt_ (tg h ). If h[t'] «h[t* ], then evaluate Sp[t* ^ f ] . Otherwise leave it as it is. (3) Repeat (2) for every occurrence of a form tt( t . ) (or ira.. x ir(t .)) J -'-J k-1 in A . Assume that there are m such occurrences. Arrange all s ~p[t,t'] in order of decreasing size. For convenience we write Sp r Sp 2 , ..., SpJSp. > s P . +1 ). 89 m d-1 d (+) If Z Sp. > Ex then let d be such that Z Sp. < Ex and Z Sp > Ex. i=l x i=l x i=l k ^ +n_1 k k (5) Let t -tX it ( t >.)x('t • ) be a form which corresponds to s,j-± ' ._. sn s, j+n 1 2 ' t' t Sp.(i < d). Then distribute t' over t, and create space Sp.. Repeat the same procedure for all i = i, 2 f . .., d. m (6) In the case where enough space to accomodate Ex (i.e. Z Sp. < Ex) i=l x ]^ is not found, a check is made against the component terms of t (see Lemma 11). For example let t .-=a.a rt ..«a,n=l s,j-l 12 p' and t k . = b n b ... b (t^ 1 ) + Z t*!" 1 . s,j 12 q v sf i=1 si Then A k-1 . Jc ,,. ,. ,. ^k+l N . ™ , s .. +t* _ x (b n b ... b (t"I x ) + lO + .. s, j-1 12 q sf . _ si i=l i/f k+1 Then the distribution is done if the sum of Sp. and Sp[t „ , t . _ x b n . . . b ' is greater than or equal to Ex. Here the dis- s,J-l 1 q J k+1 tribution of a, ... a b, ... b over Z t .' is to be made as well 1 pi q si k+1 as the distribution of a, ... a over t _ . This checking is to 1 p sf 90 be made until enough space to accomodate Ex is found or else until the innermost level of parenthesis pair is reached. Step k ; k-1 Mark A as checked, s Step 5 : If all levels are checked, then stop, otherwise go back to Step 1. For example let us consider the following A = ... + a 1 a 2 a 5 (t 1 )(t 2 )(t 5 ) + t Further assume that Sh[t x ] = Ul, e[t 2 ] = 16 Sh[t 2 ] = (16,21, e[t 2 ] = 61+ and Sh[t 3 ] , e[t 3 ] = 64. Then a, a p a, can be distributed over t,, and in turn this whole thing can be distri- buted over t, . . + a.. a 2 a-, { t, 4 o (t )(tj + ... and we get h[t'] = 7 whereas h[t] = 8. 91 3.^-2 Implementation A few words about implementation of the algorithm described above are given as well as the total number of checks required to process an arithmetic expression. Suppose we are given the aritmetic expression A = ... + (7r24^rl)(7rlO+(7r4-Pn-l)(7rl7-H7r2)-P7r3) + ... = ... + ( d 1 +d 2 )( e 11 + ( e 2i +e 22^ e 31 +e 32^ + e l+l^ + "•• = ... + (D)(E 1 +(E 2 )(E 3 ) + E^) + ... = ... + (D)(E) + . . . where iri represents a product of i single variables. Then we build nested stacks as shown in Figure 3.13(a). Note that a new stack is created for each form ir(t.) or Z t . . 1 e 21 (x) B(+) e u (x) E 2 (+) (+) TVT f v \ 1 im^x; "2 KAJ «= e 00 (x) A <— ^ — <— >> E ? (+) D(+) \ I \l lXJ ~'~~~- e ^n d.(x) 31 d 2 (x) ^^ e 32 level - m-k m-3 m-2 m-1 m Figure 3-15- Stacks for an Arithmetic Expression 92 (b) E 4- w 2 (x; Sh A _L 1,2, If, 8, 8,16,32 t,5,6 E 2 (+) Sh if a iii Sh A 1,2,4,8 2,3,4,5 2, 3 ,4, 5 e 21 (x) h 2 Sh F A e 22 (x) /" e 31 (x) e„(x) h 1 Sh F A N l «■ (c) Sh A El£L 2,4 3,^,5 3,4 "MJ. N 2 '(x) h 5 Sh 1,2,4,6 F A 1,2,3,5 /* e 3i N 2 "(x) Sh e 21 (x) E ( + ) h 3 Sh F A 1,2 F E <- 1,2 '22 Figure 3- 13 • (continued) 93 Each stack is assigned a level number (cf. Definition 3) where the first stack which corresponds to A receives the level number (Figure 3« 12(a)). We start working from a stack with the largest level number, say m. For each stack t, where t = It. or t = ir(t.), h[t] is evaluated. Also if a stack represents a form t = It., then Sh[t], F [t] and F [t, t.] are evaluated. If a stack represents a form t = 7r(t. ), then Sh[t] and F [t] are evaluated. These values are obtained by Definitions 1, h- and 8. Note that this information is sufficient to evaluate Sp. Figure 3.12(b) gives an illustration. Upon finding a form 7r(t.) (or Tra . x 7r(t.)) (e.g. the stack N ), we apply the distribution algorithm and decide if distribution is to be made. If a stack represents a form t = 7r(t.), then Substep 1 of the distribution algorithm, i.e. hole filling, is tried. Otherwise a stack represents a form t = I t. and Substep 2 of the distribution algorithm, i.e. space filling is applied. In our example E is distributed over E,(e[E p ] < le Sh[E,]). Then stacks are revised as shown in Figure 3 .12(c). Note that the stack N is replaced by two new stacks N ' and N ", and the stack E disappears. If all stacks with the level number k have been checked, then stacks with the level number k-1 will be checked. In our example, stacks E(or E' since it has been revised) and D are now checked. The total number of checks required to process a whole arithmetic expression thus depends on the number of parenthesis occurrences in it. Assume that there are p parenthesis pairs in an arithmetic expression A. For each pair, space creation should be examined. Hence in total p space creation checks are required. Now for each ir(t.) form hole filling should be tried. The number of Qk occurrences of a form ir(t. ) in A is obviously less than p. Hence the total number of checks required is less than 2p (i.e. the order of p). 3.5 Discussion 3.5.1 The Height of a Tree Given a tree for an arithmetic expression, the distribution algorithm tries to lower tree height by distribution if possible. However, in general it may not give the minimum tree height. For example let A = ac + ad + be + bd whose tree height is 3> and since no further distribution is possible, the distribution algorithm yields the same value. There is, however, an equivalent expression A' = (a+b)(c+d), whose tree height is lower than 3, i.e. 2. That is, even though factorization lowers tree height sometimes, the distribution algorithm does not take care of it. The question we ask now is how much the distribution algorithm lowers tree height. Before giving an answer to this question let us study tree height in more detail. Given an arithmetic expression, Theorem 1 gives the exact height of a tree obtained by Bovet and Baer's algorithm. It is also desirable if we can get an approximate tree height without actually building a tree for an arithmetic expression. Since the number of single variable occurrences (the number one less than this gives the number of operators in an arithmetic expression) and the depth of parenthesis nesting may well represent the complexity of arithmetic expressions, let us try to approximate tree height in terms of them. Let A be an arithmetic expression with n single variable occurrences and depth d of parenthesis nesting. Now build a tree for A by Bovet and Baer's algorithm. Then it can be proved that: 95 jemma 11 : log 2 rnl 2 < h[A] < n - 1. Moreover we can prove the following theorem. Theorem 3 • h[A] < 1 + 2d + log[nl 2 . The following lemma is helpful to prove Theorem 3< Lemma 12 (1) 2a>ra], (2) f2al 2 = 2fal 2 (3) log r P r p -[ml i=l < log (2 • 2 r m 1=1 i 2 Proof: (l) and (2) are obvious and (3) can be proved by (l) and (2) (Q.E.D.) Proof of Theorem 2 : Proof is given by induction on d. First let us prove the theorem for d = 0- Then A has the following pattern: A - F it a.. i=l j=l r Then by Theorem 1, h[A] = log Z T q 1 i=l 96 < log 2 r p by Lemma 5(3) = 1 + log[n] 2 . Nov assume that the theorem holds for d < f . Let t. be an arithmetic expression with depth d.(< f) parenthesis J J i i ^ nesting and n. single variable occurences. Then by assumption h[t.] < 1 + 2d't J ■ J J + log[n_. lp- Now an arithmetic expression with f + 1 parenthesis nesting can be built from t. as follows: (q. m. n. . i . ir a' tt (tih j=l 3 k=l K where a. are single variables and at least one of t. has f nested parentheses. Now each t. can be replaced with a product of e[t.] single variables without J J affecting the total tree height. Instead of using the value e[t.], let us use 2d' the value 2-2 J . Tn 1 ^. (Note that h[t^] < 1 + 2d 1 + logrn 1 ! = log (2 • 2d 1 . . . h[t X ] 2d* 2 J • TrulJ and eft 1 ] =2 J < 2 • 2 J • rn*"L.) Since d 1 . < f, e[t*] < 2f i 2-2 fn 1 • Now from Theorem 1, we have J 2 h[A] < log r p TCI. 1 2+ Z 2- S 2frn j 1 2l2 3=1 r P m. < log 2 s 2f i- 2 [q + r 2 • 2 nt] 3=1 J 1-= 97 |2.2.2.2 2f r (q, + rnbl < logl2-2-2-2 = 1 + 2(f+l) + logrnl 2 . Thus the theorem holds for d = f + 1 and this proves the theorem. (Q.E.D.) Now let us examine the original question i.e. how effective is the algorithm presented in this chapter. Let A and A be arithmetic expressions d where A is the resultant expression obtained from A after the application of the distribution algorithm. Now build trees for A and A by Bovet and Baer's algorithm. Then it should be clear that h[A] > h[A ]. Moreover experience suggests that: Conjecture : h[A d ] ^ 2 log 2 rnl 2 where n is the number of single variable occurrences in the original arithmetic expression A. Note that the distribution algorithm speeds up a Horner's rule polynomial in a logarithmic way. Also note that the distribution algorithm does no distributions in the case of which takes 21ogr nl steps as it is presented but which would take (n+l) logrn] steps if fully distributed. Thus the algorithm can save a factor of n/2 steps over a scheme which would distribute indiscriminately and in some cases achieves a logarithmic speed up. 98 3.5.2 Introduction of Other Operators 3.5.2.1 Subtraction and Division Subtractions can be introduced into an arithmetic expression without causing any effect on the distribution algorithm. It may be necessary to change operators to build a minimum height tree. For example let A = a + b - c + d. This will be computed as A = a + b - (c-d): Divisions may require special treatment since the distributive law does not hold in certain cases, e.g. (a+b+c)/d = a/d + b/d + c/d but a/(b+c+d) / a/b + a/c + a/d. Hence in general minimization of the height of trees for a numerator and a denominator is tried independently, and then distribution of a denominator over a numerator is tried if appropriate. Also let A = t/t'. Then T[A] is built from T[t] and T[ t ' ] as follows: T[t], or If h[t] / h[t'], then we get nodes to which only one tree is attached, e.g. a and (3 if h[t] < h[ t'], and a' and 3* if h[t] > h[t']i Then a and p are treated as free nodes in T[A] while a' and 3' are not treated as free nodes in T[A], because later when another expression, say t", is multiplied to A, t must 99 be multiplied by t" not t', i.e. t"(A) = t"(t/t') = (t"t)/t' ^ t/(t't**): T[t" T[f] 3- 5-2.2 Relational Operators If an arithmetic expression A contains relational operators e.g. B I ) C -where RO = [>. <, =, >, <, . ••}, trees can be built for B and C independently: A = T[B] T[C] If h[B] -*' h[C], then operators may be moved from one side to the other to balance two trees. For example let A=a+b+c>d. Then we modify this as A' = a + b > d - c and get h[A'] =2 while h[A] = 3: b d 100 k. COMPLETE PROGRAM HANDLING Chapter 3 presented the algorithm which reduced tree height for a single arithmetic expression by distributing multiplications over additions properly. In this chapter we will discuss some ideas about how to handle complete programs, i.e. given one program, how can it be executed in the shortest time by building a tree as well as executing a statement in a for statement simultaneously for all index values. Ideas include back substitution. We do not have the solution to the problem, but this chapter presents some details of the problem and some ways to attack them. We conclude this chapter by comparing serial and parallel computations in terms of a generated error. It is shown that in general we could expect less error from parallel computation than serial computation. It is also shown that distribution would not increase the size of an error significantly. 4.1 Back Substitution - A Block of Assignment Statements and an Iteration While the distribution algorithm in the previous chapter discusses tree height reduction for a single arithmetic expression, it can be used for any jump free block of assignment statements. If we define those variables which appear only on the right hand sides of assignment statements or in read statements in a block as inputs to the block, and those variables which appear only on the left hand sides of assignment statements or in write statements as outputs from the block, then we can rewrite the block with one assignment statement per output by substitution of assignment statements into one another. For example 101 a := b + c; d := e x f; g := a + c; h := a + g x d can be rewritten as g := (b+c) + c; h := (b+c) + ((b+c) + c) x (exf ) . After such a reduction only input variables appear on the right hand sides of assignment statements. At this point, the distribution algorithm could be applied to each remaining assignment statement and if sufficient computer resources were available, all of the reduced assignment statements could be executed at once- In the above example if each statement is computed in parallel (by building a tree) independently then 5 steps are required, while if the back substitution is done then the computation requies only + steps. Suppose we have assignment statements, A , A , ...,A . Also suppose that by back substitution we can rewrite this block as A. We build minimum height trees for A, ,A , ...,A and A. Now we apply the distribution algorithm on those trees. Let the resultant tree heights be h| A ],..., h[A ] and h[A] . Then obviously h r A-,] + ... + hi" A ] > h[A], i.e. back substitution never increases the computation time (in the sense of tree height) (Figure l+.l). Our main interest here is the case where strict inequality in the above relation holds, because that h[A, ] + ... + h[A ] > h[A] holds is equivalent to a speed up of the computation by back' substitution. Note that back substitution amounts to symbol manipulation (i.e. replacement) and should not be confused with arithmetic simplification. For example from 102 w V h [An] Figure k.l. A Back Substituted Tree a := x + y b := a + y we get b := (x+y) + y orb :=x+y+y but we do not get b := x + 2y . Now we shall study this kind of speed up. We shall discuss a limited class of assignment statements, i.e. an iteration. This may serve to give some insight to the problem of speed up by back substitution in the general assignment statement case. By an iteration we mean a statement 'i ■ f(y i-i'- Usually a statement is executed repeatedly for 1 = 1, 2,'. . .,n. An example is: for I := 1 step 1 until 10 do A[I] := A[I-1] + A[I]; 103 Also a block of assignment statements such as: S 1 : a := h + i + j; S ? : b :-■ a + k + m; S,: c := b + n + p; S. : d := c + q + r; falls into this category (note that all statements have a form output of S. : = output of S. + x + y where x and y are pure inputs in the sense that they do not appear as outputs). Assume that we are only interested in the value of y (the other results, i.e. y ,,-y rt , . . . , y, may be obtained similarly to y but in less n-1 n-2 1 n time). Then instead of n statements, i.e. y, = f(..), y p = f(..),...,y = f(..), we n^ay obtain one statement for y by back substitution. For example, let y. = a. y. . Then y = a n y , = a .(a ^y _) = a .(a _(a ,y _)) 1 l-l l-l n n-l^n-1 n-1 n-2 J n-2 n-1 n-2 n-3 n-3 n-1 = • • • - y tt a, . We use the superscript "b" to distinguish the back b n_1 substituted form from an iteration form, e.g. y = a ,y . and y = y^ tt a. . J n n-1 n-1 n . _ k k=0 Then instead of computing each y. repeatedly for i = 1, 2, . .,n, y may be computed directly. In the above example y. can be computed in one step* and to get y n steps are required while y can be computed in r log nl steps in parallel (i.e. by building a tree for y ). The following table summarizes the results for some primitive yet typical iteration formulas. 1C4 y i b y n T s nT s T P ay i-l n a v~ 1 n ri g 2 (n + ljl yi.i + b n y Q +*b +"/."+ b A 1 n flog 2 (n + 1? a i-i y i-i n-1 Vo k=0 K u 1 n log^ y -, + a °i-l i-1 n-1 z \ + y o k=0 K 1 n flog 2 n1 ay._ x + b *\ + P n-l (a ^ 2 2n *2fiog 2 (n + 1? ay i-l + X i-1 . . ■*-* p;(a) 2 2n ar2flog 2 (n + 1)1 y i-l + bx i-l n-1 z bx k + y o k=0 K u 2 2n « 'log nl + 1 2 ay i-l + bx i-l p" a) *n 3 3n -2riog 2 (n + 1)1 /• \ -l. n-1 , n-2 p - (a) = ba + ba + . . . + b „ . . / \ n , n-1 , n-2 ** P n (a) = a y + a X + a X l + "• + "n-2 + X n-1 „v.j, ti / \ n , n— 1 , n-2 , *** p (a) = a y + ba x^ + ba at, + . . . + bax n + bx . n J 1 n-2 n-1 T : The time required to compute y. in parallel, i.e. h[y.]. T : The time required to compute y in parallel, i.e. h[ y ]. Table k.l. Comparison of Back Substituted, y, and Non-Back Substituted Computation, y. — Iteration Formulas 105 From Table *4-.l, the following lemma is obtained by exhaustion. Lemma 1 : Let y. = f (y. , ) be linear in y. . where we assume that in the presented form additions are reduced to multiplications as much as possible, e.g. y. = 2y i _ 1 instead of y ± = y i _ 1 + y i _ 1 - Then n x h[y.,] > h[y n l- Thus if we have enough FE's, then instead of computing each y. repeatedly for i = 1, 2, ...,n, we should obtain y by back substitution and compute it by building a minimum height tree. If an iteration y. = f(y. , ) is not linear in y. _, e.g. y. = a y? , + b y^ + c, or if it is linear in y. but there are some additions not being reduced to multiplications, e.g. y. = y. , + a y. , then it is not clear if back substitution speeds it up. For example, back substitution does not speed up the computation of y ± = y i _ 1 + y i _ 1 + y ± _ 1 + Y ± _ 1 ' Also let y i = f (y i _ 1 ) be a polynomial of y. , where in the presented form additions are again reduced to multiplications as much as possible. Then it is not likely that we can speed up the computation by back substitution. Let „/ \ m f (y. , ) = a y. . + ... where y. is the highest power of y. , among those which appear in f(y. ). Note that f (y. ) is not necessarily a dense polynomial (a polynomial in which all powers of y. , i.e. y. ,, y. ,, ..., y. ., appear). While the exact height of T[ f (y. , )] depends on f(y. ), we may content ourselves with (see Chapter 2) 106 h [ f(y i-l )] ~ 2 r io g 2 m1 ' Hence 2nriog ? ml steps are required to compute y . Now let us consider y . Then n b , b m n m n-l m m = a (a (y ) + . . . ) + m v m w n-2 m m = a (a ( . . .a y n . . . ) . . . ) + ... mm nrO ' i n n-l m = a m y Q + ... . That is, y becomes a polynomial of y of degree m . Leaving the computation of a out of consideration, we have (see Chapter 2) h[y£] * 2[-log 2 m n l * 2nr logpinl . Hence back substitution does not help to speed up computation significantly in this case. To gain a better understanding of more general cases, let us study the situation from a different point of view. Given an iteration y. = f (y. , ), let us consider the number of single variable occurrences in y as a measure of e J n the complexity. We study two cases separately, i.e. (i) y. .. appears only once in y. and (ii) y. appears k times in y. . In both cases we assume that there 107 are m single variable occurrences (including the occurrences of y. ) in y. . For convenience we write N(y) for the number of single variable occurrences in y, e.g. N(a+b+cd4a-e) = 6. (l) y. , appears only once in y. . In this case we have N(y 1 ) = m N(yg) = N( y;L ) + m - 1 = 2m - 1 N(y^) = N(y^) + m - 1 = 2m - 2 N(y ) = N(y ,)+m-l = mn-n + l~ ran. J n n-1 (2) y. appears k times in y. . In this case we have N(y 1 ) = m N(y^) - k • N(y^) + m - k = m + k(m-l) N(y^) = k ■ N(Vg) + m - k = m + (k+k 2 )(m-l) • • • N(y^) - k ■ N(y^_ 1 ) + m - k = m + (k + . . . +k~ )(m-l) = 1 +^-i(m-l). If k n » 1 and m » 1, then N(y b ) s k n_1 m. > w n Now if we use 21"log p N(y)l as a measure of the height of a tree, then we have (see Section 3- 5*1 of Chapter 2): 108 h[y ± ] n x HV ± ] «# (1) 2flog 2 m] 2n[ log 2 ml 2flog 2 (mn)l * 2(riog 2 m] + rioggii"!) (2) 2riog 2 ml 2nf log ml 2riog 2 (k n " 1 m)l « 2((n-l)riog 2 kl + floggml) Table k.2. Comparison of Back Substituted, y , and Non-Back Substituted Computation, y^ — General Cases For example let m = 5, k = 2 and n = 20 in (2). Then we have n . hfy.l = k0 • T log 51 =120 and h[y£] = 2(l9flog 2 2l + flog^l) = kk. Also if we let m = 5 and n = 20 in (l), then we get n • h[y.] = i+oriog 2 5l = 120 and h[y*] = 2(riog 2 5l + Tlog^Ol) = 16. Now a few comments about implementation are in order. As for back substitution of a block of assigment statements, the step by step substitution is the only possible scheme. In case of an iteration formula, we may use the z-transformation technique to obtain y [8]. For example let y. = y. n + x. . Then by applying z-transformation on it, we get Y(z) = zY(z) + X(z) or Y(z) = ^ Z ^ . Hence Y(z) = X(z)(l + z + z 2 + 7? + . . . ) 2 2 = (x + X-.Z + X z + . . . ) (l + z + z +...) = x Q + (x 1 + x Q ) z + (x 2 + x ± + X Q ) z + . . . 109 or y = Z \> 1 k=0 * Two other related problems become evident in the example presented above. First is algebraic simplification. For example, a := b + kc could be executed more quickly than a := b+c+c+c+c We shall not discuss this subject further here. A second problem is the discovery of common subexpressions. In our example, (b+c) appears twice in the right hand side of it. If we had an algorithm, e.g. [11] which discovered common sub- expressions in one (or more) tree which could be simultaneously evaluated, the number of FE's required could be reduced by evaluating the common subexpressions once for all occurrences. On the other hand, by removing common subexpressions the execution time (the height of a tree) may be increased in some cases. For example, if we have x := a(b+c+de) and y := f(g+c+de), then we might try to replace c + de in x and y by z as follows to save the number of PE's required: x However, note that x = a(b+z) or y = f(g+z) takes k steps while the original x and y require only 3 steps, i.e. h[a(b+c+de)] = 3 and h[a(b+z)] = k. Thus an overall strategy must be developed for the use of a common subexpression discovery algorithm in conjunction with overall tree height reduction. 110 h . 2 Loops This section is included here to complete this chapter, and discusses the subject superficially. Details will be presented in the following chapters. Consider the following example. El: for I := 1 step 1 until 10 do for J := 1 step 1 until 10 do S3: A[I,J1 := A[I,J-1] + B[j]; In this case ten statements, A[l, J] := A[1,J-1] + B[J], A[2, J] := A[2, J-l] + B[J], ..., A[10, J] := A[10,J-1] + B[J] can be computed simultaneously while J takes values 1,2, ...,10 sequentially. We say that S3 can be computed in parallel with respect to I. Note that originally the computation of El takes 100 steps . (One step corresponds to the computation of S3, i.e. addition. For the sake of brevity we only take arithmetic operations into account and shall not concern with e.g. operations involved in indexing.) By computing S3 simultaneously for all values of l(l = 1,2, . ..,10) the computation time can be reduced to 10 steps . b 10 Finally by building a minimum height tree for A [1,10] (:= A[I,0] + Z B[J]) J=l for each I (i = 1,2,..., 10), we can compute all ten trees simultaneously in h_ steps . To help understanding, let us further consider L: for I := 1 step 1 until N do for J := 1 step 1 until N do S; Then Figure ^.2 (a) shows the execution of L as it is presented. The total computation time required (t) is N, x N x m, where we assume that m arithmetic operators are in S. Now suppose S can be computed in parallel with respect to Ill I, (Figure 4.2(b)). In Figure 4.2(b), each box has the form shown in Figure lj-.2(c). Here S 11 computed sequentially, i.e. T. = mN p . Now let us compute S in parallel i.e. by building a tree (Figure 4.2(d)). Then we have T Q = N h[S]. Note that m > hJ"S]. Further if we back substitute S for J = 1,2 N and get S , we have T = h[ S ]. As stated before (Section 4.1), N 2 h[S] > h[S b ], or T Q > T ± > T g > T . general we have L: for I := 1 step 1 until N, do for I : = 1 step 1 until N do for i := 1 step 1 until N do S; n *- n — re S is an assignment statement. Then the computation of L takes T = n 7T N. m steps as it is presented, where we assume that m arithmetic operations ] are involved in the computation of S. If S can be computed for all values of I = 1,2, ...,N. ) simultaneously, then the computation time can be reduced to K T, ~ N.lm) stens, i.e. N. statements can be computed simultaneously , I , ...,I _,I, .,..., I change sequentially. In general there are n possibilities, i.e. we examine if S can be computed in parallel with respect to I for k -• 1,2, ...,n. Let P = [k|o can be computed in parallel with respect to I, . hen we would compute 3 in parallel with respect to I where T a = 112 I=N C3> (a) (b) J: =J+1 J T m <^> -^ ^J+l ) h[S] 3 h[s b ] (c) (d) 3 (e) Figure k.2. Loop Analysis 113 min T . Clearly each statement of the resultant N statements can be computed kcF k g by building a tree. Further if it is appropriate we perform back substitution and obtain a big tree as the above example (El) illustrates. If a loop is a limit loop which terminates when e < 6 for some pre- determine! B and computed c, it may be approximated by a counting loop (e.g. or T :- 1 i s - er 1 until II do ) which is executed a fixed number of times before "est is made, and then repeated if the test fails. Consider a program containing n two way forward jump statements (or if statements). Let the tests for jumps be Boolean expressions B„,B ,...,B . — 1 2' n Assume that there are m output variables from the program given as expressions ,A_, ...,A , where parts of A. may depend on B.'s. In a program when B. is 1' 2' ' nr l j j encountered, one of two choices is taken depending on the value of B . . It is possible to start computing all of these possible alternatives at the earliest time, and choose proper results as soon as values of B.'s become available. For example a := g + c; B : a > ':): if B then d := e + f + s else d := a + g + t; : i + f + i x j x k x p x q; yield := B x (e+f+s) + ( not B) x ((g+c)+g+t) +f+ixjxkxpxq or 114 h := ((g+c) > 0) x (e+f+s) + ((g+c) < 0) x ((g+c)+g+t) + f + i x j x k x p x q, where we let B = 1 for true , B = otherwise. Then we may build a tree for h as follows . gcefsg eg t f ijkp q Figure k.J. A Tree with a Boolean Expression The box \B • produces or 1 depending on the value of (g+c>0). In general, Boolean expressions can be embedded in arithmetic expressions as shown in the above example, and a minimum height tree can be built for it. h.k Error Analysis In this section parallel and serial computation are compared in terms of error. We are only concerned with a generated error , i.e. an error which is introduced as a result of arithmetic operations. It is shown that in general parallel computation would produce less error than serial computation. It is also shown that distribution would not increase the size of an error significantly. Let co represent any arithmetic operation. In general, we do not perform the operation co exactly but rather a pseudo-operation (jx>) . Hence instead of obtaining the result xo^y, we obtain a result x(a?)y. We may write 115 x y = (^y)(i+e ) (1) where <: represr I in error introduced by performing a pseudo-operation. For example, we have -: y = (x+y)(l-f€ a ) and x Q y = Cxy)(l+e m ). Let us write A for an approximation to an arithmetic expression A with an error obtained by computing A using pseudo operations, e.g. fy\ or (+) . Then ^an also be written as (xuoy) = (xo>y)(l+e ). Now let us consider the computation A = la.. i=l X First we compute A serially, i.e. A = . . .f((a 1 +a 2 )+a,)+a^)+...+a N ). .e have * = a i a 2 = (a 1 +a 2 )(l+e & ) - & + a 2 + e (a»j+a ), {+) a = (a +a +e (a +a,,)+a )(l+€ ) J 3 -L^SlJ-.^ 3, = a, + a,- + a, *■ e (2a., +2a^+a-. ) . 1 2 3 a 1 2 3 • higher terms of e are neglected. St 116 # * \ = A 3 © % = a l + a 2 + a 3 + \ + e a ^ a i + 5a 2 +2a 3 + ai+ ), \ = \-l O *N = ._\ a i + e a ((N-l)a 1+ (N-l)a 2+ (N-2)a 3+ ... +aN ) N 2 a. + € (Na +(N-l)a +(N-2)a,+. . .+0 i=l We let E = e (Na +(N-l)a +(N-2)a +. . . +a ) . Next let us compute A in parallel, i.e. by building a tree: A Without loss of generality we assume that N is a power of 2. Then A I 2 ■ a l © a 2 = a l + a 2 + e a ( V a 2 ) 1-k = A 12 © A lk = a l + a 2 + a 3 + % + 2c a (a l + V a 3 4 V 1-8 = A l-ii A 5-8 = a l + a 2 + •" + a 8 + 3e a (a 1+ a 2+ ...+a 8 ) A A A l-N = l-N/2 O V+l-N " J/i + ri0g 2 N1 £ a .f^i 117 We let -r = r log Nl € £ a. . To compare E with E , let a = a = a = ... = a Then we get N' and a 2 a :: r log 2 Nl a • e a , or E S > E P. a a N An error for B = it b. can be analyzed in a similar manner. In this case we i=l ^ N E S -- E P = (N-l) e Trb. m m m . , i i=l Hence, Ln general, we could expect that parallel computation produces less error than serial computation. * hat if higher terms of e and e are neglected, then A can be a m ' written as A + i E (A) + ( v E (A) a a mm where an I ''A) are arithmetic expressions consisting of variables in A. For example, if we compute A = afbc+d) serially (i.e. A = a((bc)+d)) then we get 118 A = (ax((bxc)(l+€ m )+d)(l-f€ ))(l+€ ) ~ a(bc-d) + e a(bc+d) + € (2abc+ad), am " and E (A) = a(bc+d) and E (A) = 2abc+ad. m Usually E (A) and E (A) depend on how A is computed as we have shown for A -- E a.. Now let us compare parallel computation of two arithmetic expressions A and A , where A is the resultant expression obtained by applying the distribution algorithm on A, in terms of a generated error. Note that we can write A = A + e x E (A) + e x E (A) a a m m and A d * = A d + e x E (A d ) + e x E (A d ) a a ' m nr = A + g x E (A d ) + e x E (A d ). a a m m / n d As an example let us study A = a(bc+dj + e and A = abc + ad + e. bed e (a) A = a(bc+d) + e (b) A = abc + ad + e Figure k.k. Trees for a(bc+d) + e and abc + ad + e 119 Then we have * A = (a(bc(l+€ )+d)(l+e )(l+€ )+e)(l+€ ) m a nr ' v a y = (a(bc+d)+e)+e (2abc+2ad+e)+e (2abc+ad) a m. and A d * = (abc(l+€ ) 2 + (ad(l+e )+e)(l+e ))(l+€ ) m m &' J a = (abc+ad+e) + e (abc+2ad+2e) + e (2abc+ad). Note that E (A) = E (A ) in the above example, which is not mere chance. We can show that this holds for all cases. Lemma 2 : E (A) = E (A d ) m m Proof : where First let us consider t#a \ © *2 © ••' © *n t* = t. + € E (t.) + e E (t. ). 1 i aai mmi Then clearly n E (t) = € E E (t.) m m . ., m l i=l regardless of the order of additions whereas E (t) depends on the order of additions. Hence we may write * n n t = Z t. + e E (t J + e ZE (t.) . ., i a a Z m . n m i i=l 1=1 where Z indicates that E (t) depends on the order of additions. 120 and Now let us consider A* = t* (tj* © t* © ... © t*) A = t t x © t © t 2 © ... © t t n where and t = t + e E (t) + e E (t) a a mm t. = t. + e E (t. ) + e E (t. ). i i a a l m irr i' Then we have n n and Hence A = (t-K E(t)+e E(t))( Z t 4* E (t )+€ Z E(t ,)(l4€ ni )) aa mm . ., i a a i. m..mi m 1=1 1=1 n n n = t St. + e ( Z (t.E (t)) + tE (t ))+ € ( Z (tE (t )+t E (t)+tt.)) . ,i a . , i a a a m.,miim i i=l i=l 1=1 A = (t4€ E (t)+€ E (t))(t 1 4€ E (tJ+€ E (t 1 ))(l+€ ) (+) ... a a mm 1 a a 1 mm 1 m v-/ = (tt n + e (tE (tJ+t n E (t)) + e (tE (tJ+t.E (t)+tt n )) (+) v 1 a a 1 1 a ' m m 1 1 nr 1" v-^ t Z t, + e E (a£) + € ( Z (tE (t )+t ,E (t)+tt )) . , l a a L m . , mi lm l i=l 1=1 E(A) = E (A u ). m m (Q.E.D.) 121 As for E (A) and E (A ), they depend on the order of additions and cannot be compared simply. However, they may not differ significantly. As a simplified case, let us study the following: N A = t( £ a. ) i=l X and H N A = ^ (ta.). i=l X Again we assume that N is a power of two. Then to compute A, we first compute :; n ^ n n T. a. in parallel. As we showed before, (La..) = L a. + riog_Nle_ L a.. i=l i=l i=l i=l Hence * N N A = t( L a . +r log^Nl e Za.)(l+e ) . , i d a . _ i m i=l i=l N N N = t L a. + e r log Nit Z a. +£ t Z a. . . , l a . _ i m . - l i=l i=l i=l On the other hand, we have A. = (ta. ) = ta. + e ta. 11 l mi d + and A is obtained by summing A. in parallel, i.e. A d * = (..(((a* © a 2 ) © ( A ; © a*)) © ( (A ; © $ © (a; © Ag)))...) N N N . = t Z a. + e riog^Nlt Z a. + e t Z a. . .,i a & 2 •-,! m . . i i=l i=l i=l Hence in this case E (A) = E (A a ) as well as E (A) = E (A ). a a mm 122 5- PARALLELISM BETWEEN STATEMENTS This chapter should be read as an introduction to the following chapter which discusses loops in a program. In this chapter we study parallelism between statements, i.e. inter- statement parallelism. Given a loop and jump free sequence of statements (we call this a program), it is expected that they are executed according to the given (i.e. presented) order. However if two statements do not depend on each other, they may be executed simultaneously in hopes of reducing the total computation time. In general, statements in a program may be executed in any order other than the given order as long as they produce the same results as they will produce when they are executed in accordance with the given sequence. In this chapter we give an algorithm which checks if the execution of statements in a program by some sequence gives the same results as the execution of statements by the given sequence does. Also a technique which exploits more parallelism between statements by introducing temporary locations is introduced. 5-1 Program A program P with a memory M is a sequence of assignment statements S(i), i.e. P -= (S(l); S(2); ...; S(i); ...; S(r)) where i is a statement number and r is the length of a program P (we write r = lg(P)). The memory M is a set of all variables (or identifiers ) which appear in P. Associated with each S(i) is a set of input variables, IN(S(i)) and an output variable, OUT(S(i)). Then M =.U, (lN(S(i)) UOUT(S(i))). Further 123 we define two regions in a memory; a primary input region M and a final output region M as Mj. = (m | meIN(S(i)) and Vk < i, m/OUT(S(k) )} . and M = {m | meOUT(S(i) )1 . A program uses the values of those variables in VL. as primary input data and puts final results into M . C(m) refers to a content ( value ) of a variable m. C(M) refers to the contents of variables in the memory M as a whole and is called a config - uration of M. Also C T (m) refers to a value which m has before a computation (i.e. an initial value of m). Thus C(M ) refers to primary input data given to a program. We call it an initial configuration . The following relations are established among statements in P. A triple (id, i, j) where id e M (id for an identifier) and i, j e {0,1, . . . ,r,r+l] (r = lg(P)) is in the dependence relation DR(P) if and only if: (1) (i) i < j and (ii) id e 0UT(S(i)) and id e IN(S(j)) and (iii) Vk, i < k < j, id ft 0UT(S(k)), or (2) (i) i = and (ii) id e IN(sCj)) and (iii) Vk, < k < j, id I 0UT(S(k)), (s(,j) is the first statement to use id), or (3) (i) $ = r + 1 and (ii) id e 0UT(S(i)) and (iii) Vk, i < k < r + 1, id / 0UT(S(k)). 3.214- only if: (S(i) is the last statement to update id). Similarly a triple (id, i, j) is in the locking relation LR(P) if and (i) i < j and (ii) id e IN(S(i)) and id e OUT(S(j)) and (iii) Vk, i < k < j, id / CUT(s(k)). Example 1 (The notations follow ALGOL 60| 3] ) : Let P be S(l): a I: d ): f ): g = b + c; - a + e; = g + d; - h + i. Tli en DE(P) = {(b,0,l),(c,0,l),(e,0,2),(g,0,3),(h,0,i|),(e,0,^),(a,l,2), (d,2,5),(f,3,5),(g,^,5)) and LR(P) = [(g,3,^)). Since we are only interested in meaningful programs, we assume that there is no superfluous statement, i.e. there is no id e M such that (i) id e OUT(S(i)) and id e OUT(S(j)) where i < j, and (ii) Vk, i < k < j, id / IN(S(k)). Also v/e assume that there is no statement that has no inputs other than constant numbers, e.g. "a := 5" • Now we define an execution order E of a program P as : E(P) = {(i,j)|ie{l,2,...,lg(P)},je(l,2,...}}. 125 JL We also write E (i) = j if (i, j) e E(T). W To execute a program by E(P) means that at step j, all statements with statement number E_ (j) are computed simultaneously using data available before the j-th computation as inputs. A pair (P, E) is used to denote this execution . Also by E n (P), we understand the execution order given by a program, i.e. E (P) = ((i,i) | Vi e [l,2,...,lg(P))}. E n is called a primitive execution order . We assume that at each time step at least one statement of P must be executed. That is, for any E there is k such that Vj > k, E (j) = empty and Vj < k, E (j) / empty. We call k the length of an execution and write lg(E). As stated before, C(OUT(S(i))) refers to the contents of a variable OUT(S(i)). This value, as we expect, varies from time to time throughout an execution. Thus it is essential to specify the time when a variable is referred to. S(i)(P,E) refers to a computation of S(i) of P in an execution (P,E). C(m) after S(i)(P,E) refers to the value of a variable m right after S(i)(P,E). C(m) after (P, E) refers to the value of a variable m after an execution of a whole program. 5-2 Equivalent Relations Between Executions Now we define two equivalent relations between executions. J For convenience we define that Vi, E_(0) < E_(i) and E-^i) < E p (lg(P)+l) 126 Definition 1 ; Given a program P and two execution orders E and E , (P,E, ) and (P, E_) (or simply E.. and E ? ) are said to be output equivalent if and only if: for all initial memory configurations C_(M_), Vi C(OUT(S(i))) after s(i)(P,E 1 ) = C(OUT(S(i))) after S(i)(P,Eg). We write (P,E )~(P,E ) if (P,E.) is output equivalent to (P,E ). Definition 2 : Given two programs P and P , let their execution orders be E, and E p respectively. Also let their memories be M, and M p . Then two executions (P ,E,) and (P ,E ) are .said to be memory equivalent if and only if: (l) there is a one-to-one function fi <"ll U V - lM SI U «20> such that f(M ) = M 1 II 21 and f(M 1Q ) = M 2Q , and (2) for all initial memory configuration pairs C_(M-]_j) and C T (M ?T ) such that Vm € U ir CjU) = C I (f(m)). Vn € M 1Q , C(n) after (P^E^ = C(f(n)) after (P 2 ,E 2 ). M We write (P ,E,)~(P ,E ) if (P-,E, ) is memory equivalent to (P ,E p). 127 In principle, a program is written assuming that it will be executed sequentially, i.e. by E Q . It, however, need not necessarily be executed by E~ as long as it produces the same results as (P,E n ) when it terminates, i.e. (V lM, it may be executed by any E as long as (P,E)=(P,E_) holds. Now the following theorems can be proved directly from the above definitions. Theorem 1 (P,E)=(P,E ) if and only if: (1) Vi, (id, i,.j)eDR implies that E (i) < E (j). and (2) for any two triples (id, i, j) and (id, i', j ' ) in DR with the same identifier id, either E p (j') < E (i) or E (j) < E (i') holds. What condition (l) implies is that variables must be properly up- dated before used, and condition (2) prevents variables from being updated before they are used by all pertinent statements. (a) Condition (l) «p(i) A \ Ep(j) Q id i) (b) Condition (2) E p (i) © Kp(iO © A i id A E p (d) © Ep(j«) 4 id © /A or /A Ep(i') © E p (i) 1 ld A A A Ep(j') © E p (d) 4 id Figure 5»1« Conditions for the Output Equivalence 128 Proof of Theorem 1: (1) if part: Assume that a statement S(i) receives data from statements S(0,S(i ), . . .,S(i, ), i.e. for each pair i and i (s=l,2, . . .,k) there is an identifier id such that (id ,i ,i)eDR. Now let E — s — s s be an execution order which guarantees that (l) before S(i) is computed, all S(i ) are computed, and (2) between the computation s of S(i ) and S(i), no statement updates id , then it is clear s s that (C(OUT(S(i))) after S(i)(P,E) = C(OUT(s(i))) after S(i)(P,E Q ) providing that all OUT(S(i )) have appropriate values. Note that 5 the above two requirements are equivalent to conditions (l) and (2) of the theorem. Then by induction, we can show that if conditions (l) and (2) hold for all statements, then (P, E)~(P, E~). (2) only if part: We give an example to show that if an execution order violates condition (l) or (2) then we cannot get an output equivalent execution. Now let P be S(l): a := b; S(2): c := a; S(3): b := e. Then DR = [(b,0,l),(e,0,l),(a,l,2),(c,2,4),(b,3A)}, and (P,E Q ) gives 129 C(0UT(S(1))) after S(l)(P,E Q ) = C^b), C(0UT(S(2))) after S(2)(P,E Q ) = C^b), and C(0UT(S(3))) after S(3)(P,E Q ) = C^b). Now let E (P) = {(1,2), (2,1), (3,3)} which violates the first condition of the theorem, and Eg(P) = { (l, 2), (2,3), (3,1)) which violates the second condition. Then C(0UT(S(2))) after S(2)(P,E 1 ) = C (a) and C(0UT(S(1))) after S(l)(P,Eg) = C (e) which do not agree with corresponding values produced by (P, E»). (Q.E.D.) Theorem 1 gives more meaningful executions compared to the previous results [5l [10]. For example let P be: S(l): a:=f 1 (x) 8(2): b:=f 2 (a) 8(5): c:=f 3 (b) S(k): b:=f u (x) S(5): d:=f (b,c). Fisher [10], for example, would give the following execution (P,E) as an "equivalent" execution to (P,E ). 130 Step 6) Q 1 aJ |b E= ((1 ' 1) ' (2 > 2) > (5 > 5) ' ^'V' (5A)) s I- - This, however, does not give correct results unless P is properly modified. Note that the variable b carries two different values between steps 2 and 3 which is physically impossible. Theorem 1 does not recognize such an execution as "equivalent" to (P,E n ). Theorem 'c M (P,E)=(P,E Q ) if and only if (1) Vi, (id, i, j) eDR implies that E p (i) < E p (j), and (2) V., (id, i, j) eLR implies that E (i) < E (j). J Example 2 : P: S(l): a:=b+c; S(2): d:=a+e; S(3): a:=q+r; SCO: h:=a+s. Let E(P) = {(3,1), (U,2), (1,3), (2,i+)}- Then (P,E)£(P,E Q ). E, however, violates the second condition of Theorem 2, i.e. (a, 2, 3) eLR but E p (2) i £ v £ J and - k < w < m, there is no id' such that (id',v, w) eDP. Then replace every occurrence of id in S, by id' where id' ^ M. Gold [17] presented a similar transformation to describe his model for linear programming optimization of programs. After the transformation is applied S, and S can oe processed in M parallel, and still (P,E )~(P',E ) holds where P' is the result^ic program a: the application of T, on P. This shows that the second condition of Theorem 2 is not essential, i.e. it can be removed by introducing extra locations if necessnry. 135 6. PARALLELISM IN PROGRAM LOOPS 6.1 Introduction 6.1.1 leplacement of a j f p r . Statement with Many Statements Using the results from the previous chapter, now let us study loops in a program, e.g. ALGOL for statements or FORTRAN DO loops, to extract potential parallelism among statements. Given a loop P, we seek an execution order E with the minimum length among all possible ones. Sometimes it may be appropriate to get a loop P' from P by the previously introduced transformation for which there M is an execution order E' such that (P',E* ) = (P,E ) and lg(E' ) is the minimum (For the definition of =, see Chapter 5)- As stated before, in this chapter our main concern is the parallelism among statements (inter statement parallelism). For example, we are interested in finding out that all 10 statements (A[I] := A[ I + 1] + FUNC(B[I]); (1=1, 2, ..., 10)) in Fl can be computed in parallel, whereas statements in F2 cannot be (The notation follows ALGOL 60 [3]): Fl: for I := 1 step 1 until 10 do A[I] := A[I + 1] + FUNC(B[I]); F2: for I := 1 step 1 until 10 do A[I] := A[I - 1] + FUNC(B[I]). First several notations are presented. According to the ALGOL 60 report [3], a for statement has the following syntax: 136 < for statement > ::= < for clause > + < statement > + < for clause > ::= for < variable > :» < for Hat > do. An instance of this is : for I. := ... do ■ m 9 for I := ... do begin EL ; S ; . . . ; S end . For the sake of brevity, we shall write (i. *-L n , I_ *■ L_, .... I *- L ) (S, , S^, 112 2 . n n 1' 2 . .., S ) or (I , I , ..., I )(S. , S_, . .., S ) for the above for statMMnt m ± d n ± d m ■ instance where I, is called a loop index, L is an ordered set and called a loop list set , and S is called a loop body statement with a statement identification number p (which is different from a statement number (see Chapter 5))- As its name suggests, a loop list set represents a < for list >, e.g. L, = (1,2, 3 A, 5,6) represents "I. := 1 step 1 until 6." In general we write L, (i) for the i-th element of L thus L. (|L. |) is the last element of L, . Now to facilitate later discussions we introduce the following notation. Let B = (b.,,b . ...,b ) (b. > for all i) and (i n ,i ,....i ) be n-tunles 1 2 n i v 1 2 n' # of integers. Then we define the value of (i.,i , ...,i ) w.r.t. B as follows.* — — — id n 7r +=* where * is the Kleene star. JUL 7nr F or convenience we write i(s..t) for (t-s+l) integers i , i , ..,, S S ^* A. i + i) i + , e.g. (i(l..s), i(s+2..n)) means (i.,....i ,i ^,...,i ). L ~- L b is s+2 a Also (|L(s..t)|) means ( |L g | , |L g+1 | , . . ., |L t | ). Finally (i(n)) means n i's e.g. (1(3)) = (1,1,1). 137 n n V ((i(i..n))|B) = 2 i.B - IB. +1 n+1 where B. = f b. ,. and B = b ,, = 1. j , . k+1 n n+1 This notation is introduced so that the relations V((1,1,...,1,D!B) = 1, V((1,1,...,1,2)|b) = 2, V((l,l,...,l,b )|B) = b , n V((l,l,...,l,2,l)|B) = b n + 1, V((1,1,...,1,2,2)|B) = b n +2, hold. and V((b 1 ,b 2 ,...,b n )|B) = b^x-.^ For example V( (2,3,1) I (3A> 5) ) = 31. An n -tuple B is called a base. The inverse function of V is also defined as V~ (t | B) = (i(l..n)) if V( (i(l. .n)) |b) = t. Note that V" 1 is not one-one e.g. V _1 (l5| (J>,k, 5) ) = (2,0,0) or V~ 1 (15| (3A, 5) ) = (l,i+,0). An n-tuple (i(l..n)) is said to be normalized if b . > i. > for all J J n n j. Let (i(l..n)) be normalized. Then 1 < V( (i( i. .n) ) | B) < 2 b.B. - 2 B.+l. 3-1 J J 0=1 ° n n , If 1 < t < Z b.B. - ZB.+l, then V* (t | B) has unique normalized (i(l..n)) as d-i J J j=i J its value. 138 VTe say that normalized (i(i..n)) ranges over B = (b(l..n)) in n increasing order if V( (i(i. .n) ) |B) takes all values, between 1 and £ b.B. - J=l J J n £ B.+l in increasing order as (i(l..n)) changes. Notationally we write J=l J (l(n)) < (i(l..n)) < (b(l..m)). Finally we let (i(l..n)) > (j(l..n)) if V((i(l..n))|B) > V( (j(l. .n)) |b) and (i(l..n)) = (j(l..n)) if V((i(l..n))|B) - V((j(l..n))|B) The following lemma is an immediate consequence of the above definition. Lemma 1: Let B = (b(l..n)) where V.b. > 2. Then li — (1) v((a(l..n))|B) < V((a'(l..n))|B) implies that n v(( ai - a 1 ',...,a n - a n ')|B) < - z B . or v((a 1 - e^', . . ,,a -a n ' ) |b) < V((0(n))|B). (2) V((a 1 '+c 1 ',...,a n '+c n ')|B) = V( (a^, . . . , a n +c n ) | B) ana V((a'(l..n))|B) > V((a(l..n))|B) imply that V((c-- C,',,.., n C n' C n' )|B) - ' . ZB j or V (( c i- c 1 '^-^ c n - c n ')l B ) < V((0(n))|B). (3) Let < [a. | < b for all j. Then V( (a(l. .n) |b) < Vi(0(n) ) |b) if and only if there is h such that Vk (1 < k < h), a. = ana a. < 0. 139 A loop must be replaced with a sequence of statements so that we can use the results of the previous chapter. For example we replace for I := 1 step 1 until 10 do SI: A[I] := A[I] + B[I]; with the sequence of ten statements A[l] := A[l] + B[l]; A[2] := A[2] + B[2]; A[10] := A[10] + B[10]. /" \ In general we will get J it |L.|-nJ statements after the replacement of a loop P: (I, . I_, . . , I )(S -S^;..:S ). Any statement in the set of replaced r 1' 2 n 1' 2 m statements can be identified by an n-tuple (i(l..n)) which corresponds to values of I 1 , I 2 , ...,I n (i.e. L 1 (i 1 ),L 2 (i 2 ), . .,L n (i n )), and p which represents a statement identification number. Thus an (n + l) -tuple (i(l..n),p) serves as a statement number , and we write S((i(l..n),p)) to denote a particular statement in the set of replaced statements, e.g. in the above example S( (3, l) ) = A[ 3] := A[3l + B[31- The actual statement which corresponds to this is the statement S with L n (i n ),..., L (i ) substituted into every occurrence of I,.....I in S , p ll'nn 1' ' n p and we also write S [L, (i, ),..., L (i )] for this. p 1 1 ' n n / n , \ These ir |L. m I statements are to be executed according to the presented order (i.e. the order specified by for loop lists). In other words, the statement S( (l, 1, . . ., 1, l) ) is executed first, S((l, 1, . . ., 1,2) ) second, ..., 1U0 the statement S( (i(l. .n),p) ) is executed V( (i(l. .n),p) | ( |L(l. .n) | ,m) )-th, .. ., and the statement S(( IL^J, . . ., |L |,m)) is executed lastly. Formally, as the essential execution order we have: E Q (P) = {((i(l..n),p),V((i(l..n),p)|B)) | (l(n),l) < (i(l..n),p) < (|L(l..n)|,m)) where B = ( |L(l. .n) | ,m) . Example 1 : for I, := 1 step 1 until 10 do for I := 1 step 1 until 10 do begin SI: A 1 [I,,I ] := A 2 [I--1,I ] +B 1 [I 1 ,I ]; ■1^2- '1 •l'*2- S2: B^I^Ig-l] := A 5 [I 1+ 1,I 2 ] + B 5 [I 1 , Ig+1] ; end is executed as S( (1,1,1)): A X [1,1] := A 2 [0,1] +B 1 [1,1]J S((l,l,2)) S( (1,2,1)) S((l,2,2)) B 2 [1,0] : = A 5 [2,l] + B 3 [l,2]; S((10,10,2)): B 2 [10,9] := A 3 [ll,10] +B^[10,11]; The superscript is used to distinguish different occurences of A and B. 11H A[(i(l..n))] represents a form in which L.. (i. ),..., L (i ) are substituted into T , .. .,1 in index expressions, e.g. in the above example A 2 [(i 1 ,i 2 )] = A 2 ti x -l,i 2 ] and A 2 r(3,2)] = A 2 [2,2]. Finally a set of inputs to a statement S( (i(l. .n),p)) is denoted by IN(S((i(l..n),p))). Similary OUT(S((i(l. .n),p))) represents a set of outputs from S((i(l..n),p)). From the above example we have, e.g. IN(S(l,l,2)) = (A 5 [2,1]*B 5 [1,2]) and 0UT(S(1,1,2)) = {B 2 [1,0]}. 6.1.2 A Restricted Loop In what follows, we mainly deal with a restricted class of for statements. Two restrictions are introduced. Let a loop with m body statements be P™ = (l 1 ,I 2 ,...,I n )(S 1 ;S 2 ;...;S m ). Restriction 1: A for list set L. must be an arithmetic sequence, i.e. L = (.1,2,3,..., t) (1) for all i Restriction 2 Let (A ,A , ...,A 1 be a set of all array identifiers in F where the h-th occurrence of A, in P has the following form (where the superscript h is 142 used only if it is important to distinguish different occurrences of A, ): A^[F(k,h,l), F(k,h,2), ..., F(k,h,n)]. (2) For fixed k and j, F(k,h, j) has an identical form for all h, i.e. either F(k,h,j) = I. + w(k,h,j) J or F(k,h, j) = (i.e. vacant). w(k,h, j) is a constant number. Also we assume that each A, appears on the left hand side of statements at most once. An example of a restricted loop is: for I := 1 step 1 until 20 do for I := 1 step 1 until 30 do for I, := 1 step 1 until kO do begin SI: A 3 [ 1^1,12+3,0] := AjCIylg-J,)*] + ^[0,0,1^]; S2: A 2 [0,I 2 ,I 3 -1] := A ? [ 1^1,12,0] + A.^0,0, 1^1] ; S3: A 1 [0,0,I 5 +1] := A 2 [0,I 2 -1,I 3 ]; end ; Note that, for example, A, always appears as A,[I, +w(3,h,l), I +w(3,h,2),0], thus the first occurrence of A, is A 3 [F(3,1,1), F(3,l,2),0] = A 3 [ 1^(3,1,1), I 2 +w(3,l,2),0] • AjCXj-1, Ig+3, 0] - If there is no ambiguity, we write e.g. A,[I, -1, I 2 +3] f° r A,f I, -1, I 2 +3>0] (which is the conventional form). 1U3 We also write F(k,h, j)(i) for the resultant expression obtained by- substituting i into I. in F(k,h,j), e.g. A,[F(3, 1, l)(2), F(3,l,2)(3),0] = A,[ 1,6,0] (= A,[l,6] conventionally). A single variable may be introduced as a special case of array indentifiers, e.g. we write for I := 1 step 1 until 1 do A[I] := for T := .... 6.2 A Loop With a Single Body Statement 6.2.1 Introduction First we shall deal with the case where a loop has only one body- statement (i.e. m = l). Let a loop with a single body statement be P = (I , I , ...,I )S. Since there is only one statement we may drop the statement identification number. Then a statement number for a replaced statement becomes (i(l..n)) and as the essential execution order we have: EqCP 1 ) = {((i(l..n)), V((i(l..n))|(|L(l..n)|)) | (l(n)) < (i(l..n)) < (|L(l..n)|)}. Also in this case we only have to consider the array identifier which appears on the left hand side of S. Hence instead of s array identifiers we only have one array identifier (see Restriction 2 of Section 6.1.2). Hence we drop k and write A h [F(h,l), F(h,2), ..., F(h,n)] and Ikk F(h,j) = I + w(h,j) J for the h-th occurrence of A for Eq. (2) of Section 6.1.2 (the superscript is used if it is necessary to distinguish the different occurrences of A). Furthermore we assume that F(h, j) ^ for any h and j. Now let us study the following two examples. Gl: for I := 1 step 1 until 10 do A[I] := A[I] + 5; G2: for I := 1 step 1 until 2 do for J := 1 step 1 until 10 do A[I,J] := A[I-1,J+1] + 5; Assume that an arbitrary number of PE's are available. Then: Gl: All ten statements (A[I] :- A[I] + 5) can be computed simultaneously by 10 PE's. G2 : A[1,J] and A[2,J-2] can be computed simultaneously by two PE's at the J-th step (J=l,2, . ..,10). In what follows, the above two types of the interstatement parallelism are studied. Before we go into the details, a few comments are in order with regard to real programs. A for statement with a single body statement, (i, ,--.,I )S, can be classified from several different points of view. First of all let us take a for list set L.. As a simplified case we have L. = (s ., s .+1, . . ., t . ) (t . = (|l.|-1) + s.) which is equivalent to an ALGOL statement "for I. := s. step 1 until t. do". Knuth stated [9] that examination of published algorithms KJ showed that well over 99 percent of the use of 'the ALGOL for statement, the ll*5 value of the step was ' +1 1 , and in the majority of the exceptions the step was a constant. This statement was confirmed by checking all Algorithms published in the Communications of the ACM in 1969- There were 23 programs and 263 for statements used. Only six uses were exceptions (z 3 percent). Next let us examine a body statement S. Then either (l) the left hand side variable of S (i.e. OUT(S)) is a single variable, or (2) OUT(S) is an array identifier. In case of (2) S is of a form A°[F(0,l),...,F(0,n)] := f (A^FCl, l), . . . ,F(l, n)], . . ., A P [ f (p, l), . . . ] ) . Now S has either one of the following five forms. M (1) OUT(S) is a single variable t. (i) t := a function which does not depend on t, e.g. t := a + 5, (ii) t := f(t), e.g. t:= t + a, (2) OUT(S) is an array variable A: (i) A [F , ...,F ] := a function which does not depend on A, e.g. A[I,J] := b + 5, (ii) for all h F(0, j) - F(h, j) is a constant for each j, e.g. A[I,J] := A[I-5,J+3] + A[I+l,J-3] + 5 (iii) other cases, e.g. A[I,J] :- A[2I,J-5] + a. Note that if S is of Form (l-i), then P 1 = Sfl^ClLj), L 2 (|L 2 |), ..., L n (|L n |)]. For example let P 1 be We use a lower case letter for a single variable and an upper case letter for an array variable. 146 for I := 1 step 1 until 5 do t := A[I] - 1. Then after the execution of r, t = A[ 5] - 1« Again all Algorithms published in the CACM were checked (this time the check was made against Algorithms published in 1968 and 1969. ) There were 52 programs altogether and 117 for statements with a single body statement. The details were: (1) (2) No . of Exam pies Percentage (i) (ii) 1+2 35.8 (i) 18 15.4 (ii) 33 28.2 (iii) 2k 20.6 117 100.0 In what follows we deal with Forms (2-i), (2-ii) and (2-iii). Form (l-ii) has been discussed in Chapter h. 6.2.2 Type 1 Parallelism 6.2.2.1 General Case As stated in Chapter 5> a block of statements P need not be executed according to the essential execution order E„ and may be executed by any execution order E as long as (P, E n )=(P, E) holds. In this section we study a special class of execution orders called type 1 execution orders. This execution order is defined for each loop index I (u=l,2, . . . ,n) and hence there are n of these. ll+7 Definition 1: A type 1 parallel execution order with respect to I (we write 1-p w.r.t. I ) is given by E(P) = {((i(l..n)),V((i(l..u-l),i(u+l..n))|(|L(l..u-l)|, |L(u+l. .n) | )) |(l(n)) < (i(l..n)) < (|L(l..n)|)), and is represented by E[ I ] . Figures 6.1 and 6.2 illustrate execution orders E~ and Efl 1. D L u J Note that Efl ]((i(l..n))) = E[I ] ( (i f (l. .n) ) ) if i = i* for all U li K. K. k = 1,2, . . . , u-l,u+l, . . . ,n. Furthermore note that if V((i(l..u-l))|(|L(l..u-l)|)) > V((i'(l..u-l))|(|L(l..u-l)|)), then E[I u ]((i(l..n))) > E[ y ( (i» (1. .n) ) ) . By introducing extra |L | PE's, the computation time becomes one n n |L I -th of the original, i.e. tt |L.| steps instead of it |L.| steps, where one U j=l J j=l 3 step corresponds to the computation of a body statement. We now introduce TRANQUIL notation [2] to illustrate Definition 1. In TRANQUIL for (I) sec} (L) do S st^oids for for I := (for list set) do S. Also in TRANQUIL for (i) sim (L) do S indicates that statements S(L(i)) are executed simultaneously for all L(i) in L. Then Definition 1 amounts to obtaining H+8 C\J \H ?■ r^ : ' O — -H • V > V + 3 + -<><^o h *1 H O a; -p fit v CO • •H -> •H * bO • v c\? 0) -P 3 ti 49 ' E -s C CO • o o ^ h - O -P 3 co C cd ,£ •P O a5 -P^ 0) -^ o fa £i -P co 0) -p ctJ •H TJ •H SB H O o co c u- ( O CO W ft co CM 3 3 + 3 3 + 0 Figure 6.3* Conditions of Parallel Computation in a Loop computation proceeds from the leftmost column to the rightmost column while in a column the computation proceeds from the top to the bottom sequentially. On the other hand if P is executed by E[ I ] then we proceed to compute from the top row to the bottom row while we perform computation in each row simultaneously. Each computation S( (i(l. .u-l), i(u. .n) ) ) uses inputs IN(S( (i(l. .n))) and updates the output 0UT(S((i(l..n)))). Then as we studied in Chapter 5, we have to make sure that the computation S((i(l..n))) (marked x in Figure 6.3) does not receive any data which are to be updated by the computation in the region R, i.e. there must be no id such that (id, (i(l..u-l),i'(u..n)), (i(l..n))) cDR 151 holds where (i'(u..n)) > (i(u..n)) and 1^ < i^. Similarly the computation S((i(l..n)}) must not use any data which are to be updated by the computation in the region Q, i.e. there must be no id' such that (id\ (i(l..n)),(i(l..u-l),i"(u..n))) €LR holds where (i"(u..n)) < (i(u..n)) and i u " > i u « The above observation gives the following theorem. Theorem 1 : Let E[I ] be a type 1 parallel execution order w.r.t. I . Then M (P^Efl ]) = (P 1 ,E Q ) if and only if there are no id, id', (1(1. .n)), (i(l..u-l), J i'(u..n)) and (i(l. .u-l), i"(u. .n) ) for which either (1) (i) i ' < i and ' u u (ii) (i'(u+l..n)) > (i(u+l..n)) and (iii) (id, (i(l. .u-l),i' (u. .n)), (i(l. .n))) eDR, where id e OUT(S((i(l..u-l),i"(u..n)))) and id € IN(S( (i(l. .n) ) ) or (2) (i) i " > i and u u (ii) (i"(u+l..n)) < (i(u+l..n)) and (iii) (id*, (i(l..n)), (i(l.. u-l), i"(u..n))) eLR where id'e OUT(S((i(i..u-l),i"(u..n)))) and id e IN(S((i(l. .n))) hold. T (±' (u+1. .n)) ^ (i(u+l..n)), for example, means V( (i 1 (u+1. .n) ) | B) > V((i(u+1. .n))|B) where B = ( |L(u+l. .n) | ) . Unless specified, the base ( | L(s. .n) | ) (=( |L |,...,|L |)) is to be understood for (i(s. .n)) 152 Let S be of a form A := f(A , ...,A P ) where A is an array identifier and the superscript is used to distinguish different occurrences of A. Then id in the first condition of Theorem 1 corresponds to those A [(i(l..n))] for which A [(i(l..u-l),i'(u..n))] = A [(i(l..n))] holds together with the three conditions (i), (ii), and (iii) of (1) between (i'(u..n)) and (i(u..n)). Similarly id' corresponds to those A [(i(l..n))] for which A [(i(l..u-l), i' (u. .n))] = A [ (i(l..n))] holds. Thus A (l < h < p) can be classified into three groups : CI = {h|A satisfies the first condition} C2 = {h|A satisfies the second condition) C3 = (1,2, ...,p) - CI - C2. Note that CI n C2 = 0. Example 2 : Let P 1 be (I x - (1,2,3), I 2 - (1,2,3))(A°[I 1 ,I 2 ] := f(k\ I^Ig+l])). Then for i^ = 1< i g = 2 and (i 2 ') = (3) > (ig) = (2), we have A°[ (i^i^ )] = A [(i ,i )] = A[l,3], or (A[l,3], (1,3), (2,2)) eDR. Thus P cannot be computed in 1-p w.r.t. I,, and CI = [1] . From this argument it should be clear that if a body statement is of Form (2-i) (see Section 6.2.1), then the loop can be computed in 1-p w.r.t. any I u (u=l,2, ...,n). 153 6.2.2.2 A Restricted Loop If a loop is a restricted loop, then Theorem 1 may be simplified. First we define a vector R(h) for each h = 1,2, . ..,p: R(h) = (R 1 (h),...,R n (h)), where R (h) = F(0,j) - F(h,j) J = I, + w(0,j) - (I + w(h,j)) J J = w(0,j) - w(h,j). For example we get R(l) = (-1,8) from a statement A°[ I r l,I 2 +3] := f(A X [ I^Ig-5]). Then we use these vectors to check parallel computability as follows. Also for convenience we write R'(u,h) = (R 1 (h),...,R u _ 1 (h)) and R"(u,h) = (R u+1 (h),...,R n (h)). Theorem 2 : If one of the following two holds for any of R(h) (h = 1,2, ...,p), then P cannot he computed in 1-p w.r.t. I . (1) (i) R'(u,h) = (0,...,0) and (ii) R (h) > and u (iii) R"(u,h) < (0, ...,0) and V.(u + 1 < j < n) | R.(h)|<|L.| - 1. 15U (2) (i) R*(u,h) = (0, ...,0) and (ii) R (h) < and (iii) R"(u,h) > (0,...,0) and V (u + 1 < j < n) | R (h)|<|L.| - 1. That the theorem is valid is the direct consequence of Theorem 1, i.e. the first check of the theorem corresponds to the first condition of Theorem 1 and the second check corresponds to the second condition. For example the first condition of Theorem 1 says that if (id, (i(l..u-l), i'(u..n)), (i(l..n))) eDR holds f or i ' < i and (i'(u+l..n)) > (i(u+l..n)), then P cannot be computed in 1-p w.r.t. I , where u* id e 0UT(S((i(l..u-l), i'(u..n))) and id e IN(S( (i(l. .n) ) ) ) . Then id represents the element of A for which A h [(i(l..n))] = A°[(i(l..u-1), i'(u..n))] holds. Now this implies that F(h,j)(L.(i.)) = F(0,J)(L (i .)) J J J J for j < u and F(h,d)(L,(i,)) = F(0,j)(L.(i ')) for j > n. Hence L (i ) + w(h,j) = L (i ) + w(0,j) J J J J or R.(h) = for j < u, and J or L (i ) + w(h,j) = L.(i') + w(0,d) J J J J i.' = i. - R.(h) 3 J d 155 for J > u. Also (i'(u+l..n)) > (i(u+l..n)) with B = (|L(u+l..n)|) becomes V((i u+1 -R u+1 (h),...,i n -R n (h))|B) > V((i(u+l..n))|B). Then by Lemma 1, V((R u+1 (h),...,R n (h))|B) < V((0, ...,0)|b). Thus the first check of Theorem 2 is verified. The second check can be varified similarly. Now let us consider the number of checks required. For each A (h=l, 2, . ..,p) which appears on the right hand side of S, we first obtain a vector R(h). Then for each loop index I , we perform the two checks given by Theorem 2 for all R(h) (h=l, 2, . . .,p). Since there are n loop indicies, in total we perform 2np checks . The procedure described in this section can be extended to cover nonrestricted loops, too. Let S be of a form A°[F(0,l),...,F(0,n)] := f (aV(1, l), . . . ,F(l,n)] , . . . , A P [ (F(p, l), . . . ] ) and we define a vector R(h) for each h = 1,2, • ..,p as we did before, i.e. R(h) = (R 1 (h), ..., R n (h)) and R (h) = F(0,j) - F(h,j). J Since a loop is not restricted, F(0, j) and F(h, j) may take any form and hence R ^h) d 156 may not be a constant number but rather a function of loop indicies, e.g. R(h) = I, + 21, - 5. Hence, in the most general case, it is necessary to check the two conditions of Theorem 2 for all values of (i(l..n)) (i.e. (l(n)) < (i(l..n)) < ( lL-1, |L |, • .., |L | )) to examine type 1 parallel comput ability, i.e. n 2( it |L.|) checks are required for each R(h) (h=l, 2, . . .,p). In many cases, we d=i J can expect that the number of checks required is far smaller than that. For example if R(D = (21^21^,1^21^), then only 2( | L, |x|L, | ) checks are required, i.e. it is not necessary to check for those loop indicies, e.g. 1^, which do not appear in R.(l) (d"l, 2,3**0. 6.2.2.3 Temporary Locations In this section we mean a restricted loop by a loop. The second condition of Theorem 1 (or 2) may be dropped by introducing extra temporary locations by applying Transformation T of Chapter 5 on P , i.e. if CI = and and C2 / 0, then temporary locations may be set up so that P can be computed in parallel (for CI and C2, see Section 6.2.2.1). Let heC2. This implies that there are (i(l..n)), (i(l. .u-1), i T (u. .n) ) and id (see Figure 6.^) for which (id, (i(l..n)),(i(l..u-l),i'(u..n))) eLR holds and id = A h [(i(l..n))] € IN(S((i(l..n)))) and 157 id = A°[(i(l..u-l),i'(u..n))] e OUT(S( (i(l. .u-l), i ' (u. .n)) )) ) . If a loop is confuted in 1-p w.r.t. I , then we have E[I u ]((i(l..u-l),i'(u..n))) < E[I u ]((i(l..n))) while E ((i(l..u-l),i'(u..n))) > E ((i(l..n))). Hence A [(i(l..n))] will be updated "by S( (i(l. .u-l), i' (u. .n) ) ) before being used by S( (i(l. .n)) ). That is, if we compute P in 1-p w.r.t. I we must keep the old value of A [(i(l..n))] which otherwise will be updated by S( (i(l. .u-l), i'(u..n))) at the E[I ] ((i(l. .u-l), i* (u. .n) ) )-th step separately until it is used by S((i(l..n))) at the E[ I ] ( (i(l. .n)))-th step. The period of time, t , through which the old value of A [(i(l..n))] must be kept for the computation S((i(l..n))) is given by t h = E[I u ]((i(l..n))) - E[I u ]((i(l..u-l),i'(u..n))) = V((i(u+l..n))|B) - V((i'(u+l..n))|B), where B = |L(u+l. .n) | ). Then as we showed in Section 6.2.2.2, in case of a restricted loop, we can show that n t = V(R"(u,h) B) + Z B - 1 h , t s s=u+l n where B = ir I L , . I . The details are omitted. s ' t+1 ' t=s 158 T u+r • * * vl ' n\ i u i ' u •n)) •n)) (i'(u+l. /> y (i(u+l. o" A. Figure G.k. An Illustration of t Now max t gives the maximum period of time through which A [(i(l..n))] must be h€C2 kept. Since we have | L | of them (i.e. |L | statements are computed simultaneously), the total amount of temporary locations required will he |L | x u' max t, . Additional |L | locations are required for buffering (see Example 5)« u Hence we have the following theorem. Theorem 3: The maximum number of temporary storage locations required is L | x (max [V((R (h),...,R (h))|B) + Z B ] ) heC2 U+1 n s=u+l S where B = ( |L(u+l. .n) | ) and B = w |L , , | and B = 1. t=s 159 Example 3 : Let P 1 be for (I ) se£ (1,2, ...,U0) do for (I ) se£ (1,2, ...,1j-0) do A[I X ,I 2 ] := A[I 1+ 2,I 2 -3] + 2; P as it is cannot "be computed in 1-p w.r.t. I because it violates the second condition of Theorem 2. Now we modify P as follows by introducing temporary arrays Tl(UOxl) and T2 (1*0x3). for (I ) se£ (1,2,. ..,^0) do for (I ) se£ (1,2, ,..,kQ) do begin SI: T1[I 1 ] := Afl^Ig] ; S2: k[l v l 2 ] := T2[I ] _,I 2 mod 3] +2;^ S3: T2[I 1 ,I 2 mod 3] := Tl[ Ij ; end . Then all three statements can be computed in 1-p w.r.t. I , i.e. we can replace seq in the first for statement by sim . The original P , if executed sequentially, takes 1600 steps whereas the modified P takes only 120 steps if executed in parallel with respect to I . JL "a mod b = a Also we assume that T2 is properly initialized before the computation of the loop, i.e. store A[l,*], A[2,*] and A[3,*] in T2[l,*], T[2,*] and T[0, *]. i6o 6.2.3 Type 2 Parallelism In this section we mean a restricted loop by a loop. Since the conflict between two statements S(i) and S(j) due to the existence of an identifier id such that (id, i, j) eLR may be resolved by introducing temporary locations (see Chapter 5 and the previous section), such conflict will not be taken into account to check parallel computability throughout the rest of this chapter. This section describes the second type of parallelism, i.e. type 2 parallelism, in a for statement with a single body statement. Type 2 parallelism is introduced to resolve the conflict due to the first condition of Theorem 1. The following example illustrates it. Example k : P: for I := 1 step 1 until kO do for I := 1 step 1 until kO do A°[I 1 ,I 2 ] := a\ 1^1,12+1] + A 2 [I 1 ,I 2 -1]; Since R 1 (l) a 1 > and (Rp(l)) = (-1) < (0) hold, P cannot be computed in 1-p w.r.t. I . Now let us consider the I -I plane (Figure 6.5). Suppose that all S((i, ,i )) in the shaded area have been computed. Then at the next step those S((i * , i ' ) ) marked as HJ can be computed simultaneously, and at the following step all (2) can be computed simultaneously, and so forth. We can see that a heavy zigzag line travels from left to right like a "wave front" indicating that all statements on that front can be computed simultaneously. 161 Figure 6.5- Wave Front Note that computation of P by this scheme takes approximately 120 steps, while if P is computed sequentially it takes k-0 x h-0 = 1600 steps. Given a loop P , if P is computed in 1-p w.r.t. I , then a "wave front" is in parallel with the I axis of the I - I , x ••• X I plane, and ^ u u u+1 n * ' it travels in the increasing order of (i(u+l..n)). If P cannot be computed in 1-p w.r.t. I then it may be possible to find a "wave front" which is diagonal rather than horizontal as in Example k on the I - I - x ... x I plane, u u+1 n ^ 162 The direction of wave front travel tan a = slope of a wave front Figure 6.6. Wave Front Travel This wave front is such that all computations S((i(l..n))) which corresponds to points (i ,i ,,..., i ) which lie right next to a wave front can be computed simultaneously. In other words all necessary data to compute S( (i(l. .u-l), i(u..n))) have been already computed in the shaded area. The direction of a wave front's travel is perpendicular to the wave front. Now let us obtain the slope of a possible wave front for a restricted loop. Let P be a restricted loop. Assume that P cannot be computed in 1-p w.r.t. I . Then according to Theorem 2, this means that there are R(h) for which R (h) > and (R (h), . . ,,R (h)) < (0, ...,0) hold (i.e. CI / 0). 163 Theorem k : The slope of a possible wave front in the I - I , x ••• x I plane is given by max he CI ^ ((-V((R u+1 (h),...,R n (h))| B ) - r B s +2)1 u s=u+l -1 n where B = ( lL(u+l. .n) | ) and B = 7r L, ,. and B = 1. s t+1 n t=s In Example k, the slope of the wave front is ■r(-("l"l + l)-l + 2) * 2. Proof of Theorem k : Let us consider S( (i(l. .u-l), l(u. .n) ) ) on the I - I x ... x I u u+l n plane. Assume that there is a variable id such that (Figure 6.7) (id, (i(l..u-l),i'(u..n)),(i(l..u-l),i(u..n))) eDR holds together with i ' < i and (i"(u+l..n)) > (i(u+l..n)), u u — i.e. id e IN(S((i(l..u-l),i(u..n)))) and id e OUT(S((i(l..u-l),i'(u..n)))). This implies that there is he CI such that A h [(i(l..u-l),i(u..n))] = A°[(i(l..u-l),i'(u..n))] holds. l£k (i(u+l..n)) (i'(u+l..n)) 1 1 T < "T" t / h // / 7\ a , \L. _x k R u (h) — ■* Figure 6.7. An Illustration for Theorem k In case of a restricted loop, we have i ' = 1 lUh) for j > u. Now let t, = V((i'(u+l..n))|B) - V((i(u+l..n))|B) then we get n t. = -V((R (h),...,R (h))|B) - Z B + 1 where B e n u+1 n . s s s=u+l n 7T |L, I and t=s t+1 B - 1. Now if we let the slope of a wave front be equal to 165 t + l t, + 1 h h i - i ' " R (h) u u u then A [ (i(l. .n))] and A [ (i(l. .U-l), i' (u. .n))] will be separated by it (Figure 6.7). The actual wave front is a zigzag line, rather than a straight line as shown in Figure 6.7- If there are more than one h in CI, then we choose a to be large enough so that all inputs to S( (i(l. .u-1), i(u. .n)) ) be inside of a wave front, i.e. t + 1 tan 3 = max ^-^p. h£Cl u (Q.E.D.) Now suppose we compute F in parallel w.r.t. I using a diagonal wave front whose slope is D = tan q> Then how many steps (one step corresponds to the computation of a body statement S) does it take to compute P ? Theorem c / \ The total number of steps required to compute P in parallel w.r.t. I using a diagonal wave front whose slope is D is given by /u-1 \ i n |L |D^ 1 u ' i T - P ' TT |L.| '.j=u+l J + 166 Proof: Let us consider the I - I ,, x ... x I plane. u u+i n end Figure 6.8. An Execution by a Wave Front Wave front W must travel from the start position to the end position on the n plane. How long does it take? It takes L + |L |d steps where L = it |L. U j=u+l 3 u-1 Since we have to process 7rlL.ll - I ,-■ x ••• X I planes, in total it ^ ._, ' j ' u u+1 n * ' u-1 becomes T = tt |L . I (L + |L |D). (Q.E.D.) Note that if a wave front is horizontal (i.e. if P can be computed in 1-p w.r.t. I ), then D = and T = tt IL.I. 167 6.2.4 Conclusion Assume that there are an arbitrary number of PE's available. Given a restricted single body statement loop, P 1 = (l r ...,I n )(A rF°,...,^] := f(A 1 [F^,...,Fj],...,A P [FP,... > ^])) we can check if F can be computed in 1-p w.r.t. I (u=l, ...,n) by Theorem 2. If it cannot be, then we can check for type 2 parallel computability w.r.t. I , i.e. find a possible wave front. In either case we obtain the number of u' n u-1 n computational steps required, i.e. T = ir |L.| or T = ( tt |L.|)( ir |L.| + U j=l 3 u j=l 3 j=u+l J lL I'D), where one step corresponds to the computation of the body statement S. Then among all possible choices, we would choose to compute in parallel w.r.t. I where T = min T . - 6.3 A Loop With Many Body Statements 6.3-1 Introduction In what follows we mean a restricted loop by a loop. Again a check against all published Algorithms in 1968 and 1969 CACM issues has been done, and it has been revealed that well over 50 percent of the cases of for statement usage (with more than two-body statements) are instances of restricted loops. Also as stated in Section 6.2.2.3 and Chapter 5, the LR relation may be disregarded by introducing temporary locations. Hence it will not be taken into account throughout the rest of this chapter. 168 Given a loop with m body statements, P , there are three different approaches to compute it in parallel. First it is possible to extend the procedure described in Section 6.2 by treating m body statements as if they were one statement. That is, we consider body statements as a function OUT(S(i(l..n))) = f(lN(S((i(l..n)))). For example SI: A[I,J] := f(A[I,J-l],B[I-l,J-l]); S2: B[I,J] := g(A[I-l,J-l],B[I,J-l]) yield S: {A[I,J],B[I,J]} := f(A[I,J-l],A[I-l,J-l],B[I-l,J-l],B[I,J-l]). Then we can apply Theorem 1 directly to check if e.g. S can be computed in 1-p w.r.t. I. The second and the third approaches can be illustrated by the follow- ing two examples. El: for I := 1 step 1 until kO do begin SI: A[I] := f(A[I],B[I]); S2: B[I] := g(A[ I],B[ 1-1] ); end ; E2: for I := 1 step 1 until i+0 do begin SI: A[I] := f(A[I-l],B[I-2]); S2: B[I] := g(A[I]); end. 169 In El, note that SI and S2 cannot be computed in parallel for all values of I because S2 has an iteration form B[I] := g'(B[I-l]). However, El may be replaced with two for statements : for I := 1 step 1 until kO do SI: A[I] := f(A[I],B[I]); for I := 1 step 1 until kO do S2: B[I] := g(A[ I], B[ 1-1] ) ; Now the first loop can be computed in parallel for all values of I while the second for statement is still an iteration. In general by replacing a single or statement with two or more for statements the parallel part may be exposed. In the second example, SI and S2 can not be computed in parallel for all values of I, nor can they be separated into two independent loops because SI uses values which are updated by S2 (i.e. B[I-2]), and S2 uses values being updated by SI (i.e. A[I]). However SI and S2 could be computed simultaneously while I varies sequentially if the index expression in S2 is A "skewed" as follows. E2' for I := 1 step 1 until kO do begin SI: Ari] := f(A[I-l],B[I-2]); S2»: B[I-1] := g(A[I-l])j end. JL Strictly speaking S2' should not be executed when 1=1 and an extra statement S2" : Bl^O] := g( k[kO]) is required after this loop. For the sake of brevity those minor boundary effects are ignored through- out this section. 170 Figure 6.9 illustrates the computation of the modified loop as well as the original loop. I SI S2 • * i-2 B[l-2] 1-1 A[ 1-1] 1 / / i 1 / A[i]- -»B[i] i-2 1-1 SI A[i-1] ^ A[i] S2' ^ B[i-2] B[i-1] Figure 6.9. Simultaneous Execution of Body Statements In general, the above three approaches could be tried in any combination. For example, we may first try the first approach, i.e. we try to execute body statements simultaneously for all values of some loop index. If this Tails, then we may use the second approach, i.e. we separate a loop or we replace a loop with as many for statements as possible. On a resultant for statement we again try the first approach (if it has only one body statement, then the results of the previous section can be used). If we fail again, then the third approach can be taken. We now describe each approach separately. Before we go further, we define the following notations. Without loss of generality we assume that the p-th occurrence of A, appears in S and k p also assume that S and S have forms P q 171 and S * (AP[F(k,p,l),...,F(k,p,n)] := f _.(...)) S = (-.. := f(...,A*[F(k,q,l),...,F(k,q,n)], ...))• q q Then we define a vector R(k,p,q) as follows R(k,p,q) = (R 1 (k,p,q),...,R n (k,p,q)) where R,(k,p,q) - F(k,p,i) - F(k,q,j). J - w(k,p,j) - w(k,q,j). If F(k,p, c i) = F(k, q, j) = 0, then we let R.(k,p, q) = 0. Finally we write J R'(u,k,p,q) = (R x (k,p,q), ...,R u _ 1 (k,p,q)) and R"(u,k,p,q) = (R u (k,p,q),...,R n (k,p,q)). 6.3*2 Parallel Computation with Respect to a Loop Index We first study the first approach described in the previous section, i.e. we treat body statements as if they were one statement and try to execute them in parallel with respect to some loop index. Let us consider P = (I. , I_, .... I ) (S., ; . . . :S ) . Then we treat m body 12 ' n 1 m J statements as one statement S where m 0UT(S((i 1 ,i 2 ,...,i n ))) = U OUT(S (d r i 2 , .••,!))) P=l and 172 m IN(S((i r i 2 ,...,i n ))) = IN(S 1 ((i 1 ,i 2 ,...,i n ))) U U [IN(S ((i^ig, p-1 ...,i n ))) - U OUTCS^dpig,...,^)))]. Having these two sets, we can use results of Section 6.3 directly. For example let us consider Theorem 2. Then we may modify Theorem 2 as follows. First suppose an array A£ appears in OUT(S( (i(l. .n) ) ) ) and A^ appears in IN(3((i(l..n)))). Then obtain R(k,p,q). Theorem 6 ; (cf. Theorem 2) For each A^ in 0UT(S( (i(l. .n) ) ) ), we obtain R(k,p, q) for all q such that A^ is in IN(S( (i(l. .n) ) ) ). Then if there is any R(k,p, q) which satisfies all three conditions described below, then S cannot be computed in type 1 parallel w.r.t. I . Conditions: (1) R.(k,p,q) = or for all j = 1,2, ...,u-l, J (2) R u (k,p,q) > 0, and (3) there is £(u+l<£ (F(k,h q ,u),...,F(k,h q ,n)) must hold- In the second case (F(k,h P ,u),...,F(k,h P ,n)) = (F(k,h q ,u),...,F(k,h q ,n)) must hold. These two make up the second condition. 177 From we can construct a dependence graph D with m nodes each of which represents a body statement, e.g. from Example 6 we get: In D u we call a series of © u , ^(p^Pg), © u (p 2 ,P 5 ), . . .^(p^p^), . . ., © (p k TtV^) a chain and write ch(p., p, ) for it. If p, = p.. then it is called a mesh M. We say that anode p. is in the chain ch(p ,p ) (or in the mesh M), or the chain ch(p, ,p, ) (or the mesh M) includes p.. Note that for nodes p and q there may be more than one chain which connects p to q. Now let Z = (p | there is no q such that 6 (q, p) holds) and Z = (p | there is no q such that 6 (p, q) holds}. Furthermore let PD(p) = {q | ch(q,p) exists} U (p} and SC(p) - (q | ch(p,q) exists} U {p} • (PD for predecessors and SC for successors). Then we classify nodes in D as _ u follows : Z 3_ = (P | For all r e PD(p), there is no mesh in D which includes r} , Z = {p I For all r e SC(p), there is no mesh in D which includes r) , j u and 178 Z 2 = N - Z 1 - Z 3 - Let Z 1 (or Z ) = (p^Pg, . ..,p } • Then we can order this set as p ' p ',..., p ' M in such a way that 6 (p.', p.') does not hold if i > j. Let us write 9 9 Z.(or Z,) for a resultant ordered set. Also we order Z. = {q.,q_, . . .,a } as Q L 2 = ^i''^'* '••> C V ) in such a way that ^i' < q i' if i < J- Now given a loop t where F = (I^,Ig, . . ., I n ) (S^jSgj . . . ;S ; ■ (i r ---,i u . 1 )UV.-.,i n )(s 1 ;S 2 ;...;s m )) = (i n ,...,i J?*, 1 u-l u we build the dependence graph D and obtain sets Z , Z and Z,, say Z. = {P 1 *P 2 >'"*P U ^ z 2 = ( r 2> "^ r w^ " ( m=u+v+w )- 6 6 6 From Z and Z, we obtain ordered sets Z and Z,, say Z = (p ',p p ', . . . ,p ') 6 6 and Z^ = (r 1 l ,P 2 , ,...,P w ')- A1 so we have Z g = (q^ , q^*, . . ., q^' ) . Then M p^~ (i)(s ,);(i)(s ,);...;I(S ,);(i)(s ,;...;S ,); ^1 y 2 *u 4 1 ^v (l)(S r ,);(I)(S ,);...j(l)(S ,) 12 w where (i) = (I ,1 .,...,1 ). u' u+1 n Note that Z (or Z,) together with 9 makes a graph which does not contain any mesh. To order Z (or Z,) the technique discussed in Chapter 7 niay be used. 179 Thus we have replaced a loop P with as many for statements as possible. We say that p is separable from !F if peZ (or peZ_) with respect to u, and that F is separable with respect to I . Also we say that p is separated with respect to I if P°J is replaced by many for statements as we showed above. 6.3-3.3 Temporary Storage Now let us study the following : 2 P : for I := 1 step 1 until 1 do for I : = 1 step 1 until kO do begin SI: A^y := Ag[I 2 ] + A^Ig]; S2: A k [I 2 ] := A.J Ij +A^[Ig]; end Then we have R(l,l,2) = (0,0) or © (1,2) holds and: V © 0] + AJlfO] . However the second loop, (I ,1 p)(S2), requires forty different inputs, i.e. Ap[l] + A,[l], . ,,A,JkO] + A,[^0]. Hence 2 it becomes necessary to modify P ' as follows: P l 2; (\^ 2 ^ S1: ^l'V := A 2 [I 2 ] + A 3 [I 2 ] ) ; (I 1 ,I 2 )(S2: AJI 2 ] := AJI^y ♦ A^]). >A[1] )A[1] Figure 6.11. An Introduction of Temporary Locations In general we apply the following transformation rule on a loop when it is separated. Assume that S and S are body statements in a loop F, and 6 (p, q) holds. Further assume that p is separated from TT with respect to 181 I (i.e. peZ.,qeZ ). Now let us consider the vector R"(u, k,p, q) . Let the value of the first element which is neither nor be e and its position he i. Then we let R(u,k,p,q) = (J | u < j < i and R,(k,p, q) = 0} . We order elements of R(u,k,p, q) by their positions in R"(u,k,p, q) and write R(u,k,p,q) = (r(l),r(2),...,r(t)). Then we apply the following on the loop F . Transformation T p : Transformation T is defined for the cases e < and e > separately. (1) e > 0. Change F(k,p,r(j)) and F(k,q,r(j)) to I y /j\ for j = 1,2, . ..,t. (2) e < 0. (i) Change F(k,p,r(j)) to 1,/j) for j = 1, ...,t. (ii) Change F(k, q,r(j)) to the following ALGOL program for j - l,...,t. -If (I r( . +1) = 1) and (l r( . +2) = 1) and ... (I r(t) = 1) then (if i r( . } = 1 then |L r(j) | else I r(j) - 1) else I r(i)'" Also chan g e F(k,h q , r(t) ) to the following ALGOL program: "if (I , , = l) then |L ,, J else I ,. % - 1." — r(t; ' r(t) 1 r(t) Example 7 • Let R"(5,k,P,q) = (R 5 (k,p,q),...,R 9 (k,p,q)) = (0,0,0,0,-1). Then we get e = -1 and l = 9 and R(5,k,p,q) = (6,8). Also assume that \lA = 182 |Lq| = 3« Originally S and S may look like y WyW :=f p ( -- ); S q : .. := f (...^[I^I^^yi],...); Now after Transformation T^ is applied, S and S become: y : w^'VW :=f ( -- ); S q' : " := t q^'" fA T/S I l' I 5 > B 6>I 7 >Bg,I 9 ],...), where Bg = if Ig = 1 then (if Ig = 1 then 3 else Ig-l) else I, and Bq = if In = 1 then 3 else I« - 1. Note that by applying Transformation T ? , temporary locations are eventually introduced. For example in Example 7, A, is changed to a seven- dimensional array from a four dimensional array by Transformation T . 6.3.k Parallelism Between Body Statements 6-3- i +-l Introduction Now we describe the parallelism between body statements. As stated before it becomes necessary to modify index expressions. In this section we give an algorithm which modifies index expressions properly. We first describe the algorithm in terms of a restricted loop with only one loop index, i.e. F^ = (i ) (S ;S ; . . . ;S ). Accordingly every array identifier in P™ is of a form A.[F(k,h, 1)] (= A. [I,+w(k,h, l)] ) where this is the h-th occurrence of A in F^. For convenience we drop the subscript of 183 loop index. The primitive execution order for p becomes E Q (P rn ) = {((i,p),V((i,p)|(|L|,m))|(l,l) < (i,p) < (|L|,m)}. For a given loop p , we consider the I-S plane which is an L by m grid. For example we have the following 1+0x3 grid for: for I := 1 step 1 until kO do begin 81: A^I-1] := A 2 [I] + A^I-l]; S2: A^I] := A^I+1] - A^I]; S3: A 3 [I] := A^I] + Ag[I]; end. On this grid, we only show the relation DR, e.g. (A,[i-1], (i-l,3)> (i>l)) e DR, SI S2 S3 # • i-1 i V .St. -+ i+1 • The direction of wave front travel Figure 6.12. Wave Front for Simultaneous Execution of Body Statements 184 Then the objective of this section is to discover a wave front W (cf. Section 6.2.4) which separates all inputs from the computation, e.g. in Figure 6.12 inputs (shown by 0) to S( (i,l)), S( (i+1,2)) and S((i,3)) (shown by t) lie above the wave front indicated by a dotted line. Hence S((i, l)),S( (i+1,2)) and S((i,3)) can be computed simultaneously while I takes values 1,2, ...,40 sequentially. In general to discover a wave front is equivalent to discovering a constant C(p) for each body statement S so that all statements S((i-C(l),l) ), .. .,S((i-C (p),p )),..., S((i-C(m),m)) can be computed simultaneously. 6.3.4.2 The Statement Dependence Graph and the Algorithm Let us consider the I-S plane again and consider the computation S((i,p))- Assume that there is id such that (id,(j,q),(i,p))eDR where either (i) j = i and q < p, or (ii) j < i and p / q, then clearly S((i,p)) and S((j,q)) cannot be computed simultaneously. Definition 3 ' The statement dependence graph (cf . the dependence graph in Section 6.3-3), D(p ), is defined by a set N of nodes 1, 2, . ..,m each of which corresponds to a body statement of P and the arrow relation a. From node p to q there is an arrow a(p, q) if and only if either one of the following two conditions hold. (l) For fixed i, there is k such that AJJ[F(k,h,l)(i)] € OUT(S((i,p))), A*[F(k,g,l)(i)] e IN(S((i,q))), 185 F(k,h,l)(i) = F(k,g,l)(i) and p < q. (2) For fixed i, there exist k and i' such that A£[F(k,h,l)(i')l € OUT(S((i',p))), A*[F(k,g,l)(i)] € IN(S((i,q))), F(k,h,l)(i«) = F(k,g,l)(i) and i" < i. In the first case we label the arrow and write f(p, q) = 0. In the second case the arrow is labeled 1 and we write f(p, q) = 1. The statement dependence graph for the previous example is: © © — °-*6) A chain of arrows, a(p r p 2 ), a(p 2 ,p^), . . ., a(p k _ 1 ,p k ),a(p k ,p 1 ) in D(P ) is called a mesh M and we say e.g. a(p.,p. ) is in M. If i(p.,p. ) = for some arrow in M, then M is called a part zero mesh . The following lemma is obtained immediately. Lemma 2 : If D(P ) contains a part zero mesh, then there is no wave front for P*. Henceforth we assume that D(F ) has no part zero mesh. Given D(FJ, we define a subset Z of N as follows: Z = (p there is no q such that i(p, q) = or f(q,p) = 0} . Z together with arrows gives a subgraph D g of D(F m ). Further we let 186 Zy. = (P peZ and there is no q such that i(q, p) = 0} . Now we give an algorithm to find a wave front for D(P m ). Algorithm 1 ; (1) Let C(p) = + oo for all p e N. (2) (i) Take any p from ZL. If Z. = 0, then go to Step (5). (ii) Let C(p) = 0. (3) (i) If there are nodes s and t such that a(s,t) exists, £(s,t) = 1 and C's) > C(t), then we let the value of C(s) be equal to C(t). (ii) If there are nodes s and t such that a(s,t) exists, f(s,t) = and C(s) > C(t), then let the value of C(s) he equal to C(t) - 1. Repeat (i) and (ii) until there are no s and t which satisfy (i) or (ii) in D(F m ). (h) (i) If for all p in Z C(p) ^ + », then go to Step (5). 'Otherwise take any p from Z for which there is q in Z such that a(p,q) exists and C(q) / + »• Let M = max (C(s)] where s 6 Z and C(s) ^ + oo. Then let C(p) = max (c(q) + 1,M} . Go to Step (3). For all p in Z with C(p) = + oo, let C(p) = M where M = max (C(s)} and seZ c(s) ^ + co. If Z ■ $, then let C(p) = for all p in Z. Example ( (1) D: 187 (2) Z = fl,2,3,U,6,7,8l, KD- J ^© ©-^^(D (3) Let C(l) = and apply Step (3) of Algorithm 1. Then we get C(l) = since there is no q such that a(q, l) exists. (k) Let C(2) = 1. (5) Let C(3) = 1. (6) Let C(*0 = 2. Then we apply Step (3) of Algorithm 1 on a(S,l+), a(8, 5), a(5,2),a(2,l),a(7,8),a(6,7),a(9,6),a(3,9),a(3,l) and a(9,l)« And we get C(5) = C(8) = 2. C(2) = 1. C(7) = 1, C(6) = 0, C(9) = 0, C(3) = 0, and C(l) = -1. (7) There is no p in Z with C(p) = + ». Hence Algorithm 1 terminates and we get C(l) = -1, C(2) = 1, C(3) = 0, COO = 2, C(5) = 2, C(6) = 0, C(7) = 1, C(8) = 2 and c(9) = 0. I^ 1 2 5 h 5 6 7 8 9 f\„. /> y- •N i-2 o- .• V >^ o- >• i-1 9- *• % O- »• o\ V> S^i / l (J s *w • ^ / i+1 Figure 6.13- A Wave Front for Example 10 188 Now we show that Algorithm 1 gives a valid wave front. To prove this first we show that Algorithm 1 is effective, i.e. every step of Algorithm 1 is always applicable and terminates. Lemma ~> : Algorithm 1 is effective. Proof : That Step (l),(2),(k) and (5) are effective is clear. Now we show that Step (3) is effective. First we define U(p) to be a set of nodes such that U(p) = (q I there is a chain of arrows a(p, ,p ),a(p ,p_), . . ., a(p n _ x ,P n ) exist where j> ± = q and p n = p) U {p}, e.g. 3 6 © •Ch KpyA^ »k: D(P 8 ): and U(5) = (1>2, h, 5>7>9) . U(p) = U(q) implies that there is a mesh which includes p and q. By assumption this mesh is not a part zero mesh and c(q) will be assigned the same value as C(p) in a finite number of steps after c(p) has been assigned a value. If U(p) 3 U(q), then c(q) will be assigned a value less than or equal to C(p) in a finite number of steps after C(p) has been assigned a value. Thus after a finite number of applications Step (3) eventually terminates. Hence Algorithm 1 is effective. (Q.E.D.) 189 Theorem 6 : Algorithm 1 gives a valid wave front . ■ Proof : To prove this, it is enough to show that' (i) if f(p, q) = 0, then C(p) < C(q) and (ii) if i(p,q) = 1, then C(p) < C(q). However, from Steps (3) and (k) of Algorithm 1, clearly the above conditions hold. Also if p is assigned a value C(p) by Step (5) it implies that either (i) there is r e U(q) where q e Z, and f(r, q) = 1 or (ii) there is no such r. In the second case C(p) may take any value (S may be computed at any time), and in the first case C(q) > C(r) must hold. Hence we let C(q) = max{C(s)} . (Q.E.D.) To handle a restricted loop with more than one loop indicies, we modify Definition 3 as follows. For each S and S in P we first obtain a vector R(k,p, q). Definition 3' '> The statement dependence graph of t f D(f ), is defined by a set N of nodes 1, 2, ...,m each of which corresponds to a body statement of P^ and the arrow relation a. There is an arrow a(p, q) if and only if either one of the following s holds : (1) R.(k,p, q) -- or for all j = 1, 2, ..,n and p < q. We let i(p, q) = 0. (2) V((R 1 (k,p,q),...,R n (k,p,q)|B) > V( (0, . . . , 0) | B) where B = ( |Lj, . . ., |Lj ). We let i(p, q) = 1. 190 From Definition 3' clearly (1) !(p,q) = if S ((i-^ig, ...,1^)) US6S the out P ut of S p(( i 1 ; i 2>"-.» i )) and p < q. (2) f(p,q) = 1 if S ((i^ig, •••»i n )) uses the output of S ((^^ig',...,! •)) where (i^ig', . . .,1^ ) < (i^ig', . . ,,i n ). Algorithm 1 is then applied on D(Fj. For example let P^ be SI: A^Ig^g] := A 2 [I 2 ] + A 3 [I 2 -1,I 3+ 1]; S2: Ag[I 2 +l] := A^I^Ig] + k^I^I^i S3: A 3 [I 2 ,I 3 ] := Ag[Ig-l]; Then we have jn 6.3*5 Discussion Given a loop P = (I , . . ., I ) (S, ; . . . ;£L ), we first try to execute body statements in parallel with respect to some loop index. If this fails for any loop index, or if this does not give a satisfactory result, then we try to replace the loop with many for statements. Then we can attempt to execute a body statement (or body statements) of a resultant for statement in parallel with respect to some loop index. If this fails, then we may try the third 191 approach, i.e. we try to execute all body statements simultaneously while loop indices vary sequentially. Often the number of loop indiciea, n, is very small (typically n = 2), and it will be easy to try all variations. 192 7- EQUALLY WEIGHTED— TWO PROCESSOR SCHEDULING PROBLEM T'l Introduction This chapter gives a solution to the so-called equally weighted- two processor scheduling problem. Informally the problem may be stated as follows. Given a set of tasks along with a set of operational precedence relationships that exist between certain of these tasks, and given two identical processors (PE),P(2), how does one schedule these tasks on the two processors so that they execute in the minimum time? It is assumed that either one of two processors is capable of processing any task in the same amount of time, say 1 unit of time. Informally a set of tasks together with procedence relations forms a graph. Clearly the problem of scheduling any given equally weighted task graph on k identical processors, P(k), in an optimal way is effectively solvable by exhaustion. But this is far from possible in practice. The only practical solution so far obtained is a result for scheduling a rooted tree (a restricted class of graphs) with equally weighted tasks on k identical processors, P(k) [21]. Now let us study how the equally weighted— two processor scheduling problem is related to the computation of arithmetic expressions on a parallel machine. In Chapter 3, the parallel computation of an arithmetic expression by building a syntactic tree was studied. There we were only concerned with the height of a tree and reducing it by distribution, and we did not introduce any 193 physical restrictions of a machine. For example, in reality, the size of a machine, i.e. the number of PE's is limited rather than arbitrarily big. One problem which will arise immediately is whether the distribution algorithm should be applied or not to reduce tree height since distribution introduces additional operations. For example assume that we have a two PE machine, P(2). Now let us consider two arithmetic expressions, A = a(bc+d) + e and B = abc(defgh+i). Then we have h[A] = k, h[B] = 5, h[A ] = h[abc+ad+e] = 3 and h[B ] = h[abcdefgh+abci] = k. Thus distribution reduces the height of T[A] and T[3]. On the other hand Figure 7-1 shows that if A, B,A and B are computed on P(2), A is still computed in less time than A while B now takes more time than B. Assume that we get A from A by the distribution algorithm. If the size of a machine is limited, then it may not necessarily be true that A can be computed in less time than A even if h[A ] < h[A] holds. Actually it is a nontrivial problem to decide whether distribution is to be made or not to reduce computation time (which is different from tree height) if the size of a machine is limited. It depends on the form of an arithmetic expression as well as the machine organization. We will not go into this problem any further. Now let us look at the situation from a different point of view. Given an arithmetic expression A and its minimum height tree, it is possible to take advantage of common expressions to reduce the number of operations to be performed in hopes of reducing computation time. For example let us consider the computation of A = (a+b+c+d)ef + (a+b)g on P(2). If we evaluate 19^ (a+b) only once then A can be computed in k steps on P(2) while if (a+b) is evaluated twice, then it takes 5 steps to compute A (see Figure 7*2). e b c cab a d e (a) A (k steps) (b) A (3 steps) i h d e f b c c 1 (c) B (5 steps) (d) B (6 steps) Figure 7-1. Computation of Nondistributed and Distributed Arithmetic Expressions on P(2) Our main concern in Chapter 2 was to reduce tree height assuming that the size of a machine is unlimited. Hence we were not interested in reducing the number of operations. As mentioned there, it was an open problem to find out common expressions while keeping the height of a tree minimum. However, if we could take advantage of common expressions while 195 1* 3 2 1 level (a) A Minimum Height Tree for A 5 k 3 2 1 Step (b) (a+b) computed twice (c) (a+b) computed once Figure T«2. Common Expression keeping the height of a tree minimum, then we would obtain a graph of operations rather than a tree for an arithmetic expression (see Figure 7-2. (c)). While we do not know how to compute an arithmetic expression A on P(2) in the minimum time (e.g. should distribution be done?), the scheduling algorithm presented in this chapter schedules a given graph of operations for an arithmetic expression on P(2) so that the given graph is processed in the 196 minimum time, assuming that each FE of P(2) may perform addition or multipli- cation independently but in the same amount of time, say 1 unit of time. Note that we may be able to construct many graphs for A. Hence while the scheduling algorithm schedules a given graph for A on P(2) in an optimal way, the algorithm does not necessarily compute A itself in the minimum amount of time. 7-2 Job Graph Let G be an acyclic graph with nodes N. (i=l,2, . . .,n) and a set of directed arrows connecting pairs of nodes. For nodes N and N' we write N -* N' if there is an arrow from N to N' . We say that N is an immediate predecessor of N' and N' is an immediate successor of N. Also we let SR(N) = {N 1 |N -* N'l (a set of successors of N) and PR(N) = {N' |N* - Nl (a set of predecessors of N). Nodes which have no incoming arrows are called initial nodes , and nodes which have no outgoing arrows are called terminal nodes . For the sake of simplicity we assume that a graph has one initial node and one terminal node. If there are more than two, then we can add a dummy initial/ terminal node. We write N and N for them, respectively. We also write N =5> N' if there is a chain N, ,N_, . . . ,N such that N -»1 ■♦ ... -»M -> N' , or N -* N' . 12m 1 m Furthermore we write N / N' or N ^> N' to show that the relation N -» N' or N» I' does not hold. Definition 1 : The forward distance (or level ) from the initial node to a node N, d (N), is the length of the longest path from the initial node to N, thus 197 d_(N T ) = 0. Similarly the backward distance from the terminal node to N, d (n)j is defined, thus •!. (lO = 0. Thus a node N cannot be initiated before time cL(n) but may be initiated at cL.(N) or at any time after that. Definition 2 ; The height of a graph G, h(G), is defined as h(G) = d^V- Then we say that a graph G is tight if for all nodes N, cL^N) + d T (N) = h(G). Otherwise we say that a graph G is loose. Example 1 ; Figure T>3- A Loose Graph and a Tight Graph Th e graph G, is a loose graph because d (N ) + d_(N p ) = 2 ^ h(G. ) whereas the graph G p is a tight graph. 198 First we shall study an optimum scheduling for a tight graph. A scheduling for a loose graph will be discussed in a latter section. In what follows, we use words "process" and "schedule" interchangably. Definition 3 : Let A(i) be the set of all nodes of forward distance i, i.e. A(i) = (Nld^N) = il. Tli is is called a node set . All nodes in A(i) can be scheduled independently of each other since there can be no precendence relations between nodes in A(i). In other words, if N => N 1 then N and N' cannot be processed simultaneously. Now we have the following lemma which characterizes a tight graph. Lemma 1 : If a graph G is tight, then for every node N, there exists N'e SR(N) and N"e PR(n) such that d (N 1 ) = d-^N) + 1 and d^N") = ci^(N) - 1. (For the terminal node SR(N ) = and for the initial node PR(N ) = 0. Those are exceptions. ) Proof : Obvious by Definition 2. (Q.E.D.) Corollary 1 ; Let G be a tight graph. Let N be a node of G. Let Ne A(t). Then for every i, < i < t - 1, there is at least one node N'e A(i) such that N' => N. Also for every j, t + 1 < j < h(G), there is at least one node N"e A(j) such that N => N" . 199 Definition k : To p-schedule a set Q of nodes is to partition Q, into subsets of size 2 in an arbitrary way (if |q| is odd, there will be one subset of size l) and to order those subsets in an arbitrary way. A node N is said to be available if all predecessors of N have been processed. 7.3 Scheduling of a Tight Graph Having these definitions, now we discuss a scheduling of a tight graph G on two processors. The idea of this scheduling scheme is rather simple. We start checking |A(i)| from i = 1 to h(G). For each i, if |A(i)| is even, then we p-schedule A(i), and no processor time will be wasted. If | A ( ± ) | is odd, and if we simply p-schedule A(i), then there will be one node, N, left which cannot be processed in parallel with another node in A(i). Thus we will waste processor time. Therefore a node which can be processed in parallel with the above left out node N must be found. Where can that node be found? It will be shown that we have to look as far as the smallest i' larger than i with |A(i')| = odd to find it. Thus the amount of work to look ahead is always bounded. Before we go further, a few more definitions are in order. t n For some t and n, let us consider a set A = U A(t+i). Now let us n i=0 take a node from each of A(t+,j) and A(t+i) (j < i). Let them be N J and N 1 . If N £> IT", then N and N may be processed simultaneously, providing that all predecessors of N and N" have been processed. Now we establish this relati on formally on A . 200 Definition 5 : ,t . The p-line r elati on ( ) between two nodes N and N' in A is defined n as follows. E WW if and only if (1) N ^> N' and N 1 £> N and (2) d (n) ^ d(W ). We also write (N, W ) for WW. A pair (N,W) is called a p-line pair . Note that in general (N,N') and (W,N" ) do not necessarily imply (N,N"). Further we define A^p) - {(N,N') | N€A(t+i),N'eA(t+j),0 < i,j < n] , i.e. A (p) is a set of all pairs of nodes in A between which the p-line relation n f * n * holds . -L. J- Since (N,N') e A (p) implies that (W>N)e A (p), we shall in general put only one of them in A (p) and drop the other. An algorithm to find A (p) is given in Section 7-5- Definition 6 : A p-line set L on A is an ordered set of p-line pairs *■ n n L n = ((N ,N 1 '),(N r N 2 '),...,(N k ,N , k+1 )) where (1) N Q e A(t) and N' k+1 e A(t+n), and (2) for all g(l < g < k),ci (N ') = cL (N ) but N ' + N . 201 t t t We say that A is p-connectable if there is a p-line set L on A . n * n n t ' t Also we write L (N~, N , ,) when the first and the last nodes in L n u K+l n are of special interest. Further, we write L^N-.N' . ) = L^(N n ,N ') U L t+: J-(N ,N' ) if n 0' k+l 1 0' g n-i g' k+l (N ,,N* ) and (N ,N') are two adjacent elements of L (N n ,N' ) and d_(N' ) = g-i g g g+i n u K+i l g dj(N ) = t + i. An algorithm to build a p-line set for A is given in Section 7»5« Example 2 : A(l) = (b,c,d) A(2) = {e,f} A(3) = (g,h,i) Figure f.k. A Graph G 1 d 1 v = U A(l+i) = {b,c,d,e,f,g,h,i}. A Q (p) = ((b,f),(d,e),(f,h),(e,i)). A 1=0 202 typical L = ((b,f), (e,i)). Hence A is p-connectable. Further we define a special p-line set called a p-line (l) set. Definition 7 * Given a set A . We call a p-line set n a p-line (l) set (write L (l)) if cL(N. ) + 1 = (L(N^ +1 ) for all i(0 < i < k). Note that in this case k = n - 1. Also we write L (l)(N~,N,' ,,) when the first and last nodes are of n 0' k+l y particular interest. Now a few lemmas are in order. Lemma 2 : Suppose N e A(t) and N' e A(t+n) for some t and n in graph G. Assume that (N, N') holds. Then there is a p-line (l) set I>(1)(N,N') = ((N ,N 1 '),(N 1 ,N 2 '),...,(N n _ 1 ,N n ')) where N^ = N and N ' = N 1 ■ n Proof : A proof is given by an induction on n. First note that |A(t+i)| > 2 for all i, 1 < i < n - 1. Otherwise N $* N' holds and (N,N') does not. 203 (1) Now let n = 2. Then there must be two distinct nodes N ,N p e A(t+1) such that N - N, and N -* N' . Otherwise the graph G is not tight. Hence (N^N 1 ) and (N,N 2 ). Thus 1^(1) = ((N,Ng), (N r N')). (2) Now assume that the lemma holds for n < i. Let n = i + 1. Let N e A(t), N'e A(t+i+l), and (N,N')« Then there must be two distinct nodes N ,N e A(t+i) such that (N,N..) and N $> N p . This follows from Lemma 1 and Corollary 1. Then (N p ,N*)> since otherwise N => N' holds and this contradicts the assumption. By the induction hypothesis, there is a L^_ 1 (1)(N,N 1 ) = ((N,ir L '),(N 1 ,^ , ),...,(M n " 2 ,N 1 )). Thus there is a p-line (l) set L*(1)(F,H') = L^_ 1 (1)(N,N 1 ) U (N 2 ,N«) = ( (N, N 1 ' ), (n\ N 2 ' ), ...,(M n ' 2 ,N 1 ),(N 2 ,N')). (Q.E.D.) Lemma 3: Suppose that N e A(t),N 1 ,N 2 £ A(t+i),N^,N e A(t+j), and N 1 e A(t+n) where i < j < n. Also assume that (N,N ),(N ,N*), and (n ,N') hold. Then there is a p-line (l) set L^(l)(N,N') = ((N,^ 1 ),..., (N ^N')). Proof: 20U '© W © : © : © \ / / N © i w) « *f(K © ©: © © <3 ® Figure 7-5* An Illustration for Lemma 3 Since (N, N ) by the previous lemma, there is a p-line (1) set L^(1)(N,N 3 ) = Lj(l)(N,N) U L^(1)(N',N 3 ), u -*- u and since (rr,N'), there is a p-line (l) set L^aKw^N') = L^CDCN 2 ^) U L t+ ^(1)(NSN'). n-i ' j-i v ' n-j v ' By definition, N ^ N' , N / N' , W £ N and N 1 ^ N 2 . 205 Now we have two cases. (1) N 5 = N' Then N ^ N 1 . Now let l*(i)(n,n') = l*(i)(n,n) u lJ^CDCh 1 ,/) U L^ (l ) ft. , r ) . (2) N 5 ^ N'. Then let L*(i)(H,r) = lJ(i)(n,n 3 ) U L^(l)(N',N'). Thus in either case there is a p-line (l) set on A . n (Q.E.D. ) From Lemmas 2 and 3> we can prove the following lemma. Lemma k : If A is p-connectable, then there is a p-line (l) set L (l)(N, N') where N e A(t) and N 1 e A(t+n). What Lemma h implies is the following. Let L*(l)(N,N') = ((N,N 1 '),(N 1 ,N 2 '),(N 2 ,N 3 '),...,(N n _ 1 ,N')), i.e. d I (N.') = cL^N.) = t + i and cL^N.) + 1 = ^(N^). Now assume that a set A is p-connectable. Then we have L (l)(N,N') = n n ' ((N,^'), (N r N 2 '), ...,(N n _ 1 ,N')) where N e A(t) and N' e A(t+n). Since (N.,N'), we can process both at the same time. To do this we first process A(t+i) - {N.} 206 (notice that djC^) = t+i). Then process (N.,N' }. Finally process A(t+i+l) - (N! ,}. This leads, us to the follovring scheduling- Definition 8: Assume A is p-connectable. Then by to p-line schedule A bv I (l). n * n n ' we mean the following scheduling. Let L*(l) = ((N ,N 1 '),(N 1 ,N 2 , ),...,(N n-1 ,N n ')). (1) p-schedule A(t) - {N Q } . (2) g = 1 (3) p-schedule {N _ ,N '} o o (h) p-schedule A(t+g) - {N ',N ]. D g (5) g = g + 1. If g < n, then go to (3). (6) p-schedule (N ,,N '} . v ' * n-1' n (7) p-schedule A(t+k) - (N '}• Now an algorithm to schedule a tight graph for two processors is described. Scheduling is done according to node sets A(i), for i = 1,2, . . .,h(G). All nodes in A(i) can be processed independently of each other, i.e., in any order. If |A(i)| is even, then two processors can be kept busy all the time to process A(i) and A(i) can be processed in time |A(i)|/2. If |A(i)| is odd, then there is a possibility that a machine becomes idle, i.e., one node will be left out from A(i) which does not have any partner 207 to be processed with. Let it be N. Then a partner must be found from some A(j), j > i. First we may try to find N' € A(i+l) which can be a partner of N. However if |A(i+l)| is even, then |A(i+l) - (N'} | becomes odd and we have the same problem again, and may try to find a node from A(i+2) to fill an idle machine, and so on. If this cycle is ever to stop, it must stop when A(i+k) is hit where |A(i+k)| is odd. Otherwise there is no way to remedy the cycle, and machine time must be wasted. Tight Graph Scheduling Algorithm : Step 1: t = Step 2: If t = h(G) then p-schedule A(t) and stop, else go to Step 3* Step 3: If |A(t)| is even, then (3-1) p-schedule A(t) and (3-2) go to Step 7- Step k: |A(t) | is odd. Find A(t+1). Step 5: If VN e A(t) |SR(N)| = |A(t+l)|, then (5-1) p-schedule A(t) and (5-2) go to Step 7. Step 6: There is N' e A(t) such that |SR(N')| < |A(t+l)|. (6-1) If |A(t+l)| is odd, then (6-1-1) p-schedule A(t) - (N']- (6-1-2) p-schedule (N*} U {N"l where N" e A(t+l) - SR(N'). (6-1-3) p-schedule A(t+1) - (N") . (6-1-U) go to Step 7- 208 (6-2) |A(t+l)| is even. (6-2-1) Find out the smallest k greater than 1 such that |A(t+k)| is odd. (6-2-2) If there is no such k (i.e., we have checked up to A(h(G))) then p-schedule A(i) t < i < h(G) individually, and stop. (6-2-3) Else we have a set (2= (A(t),A(t+l), . . .,A(t+k)} where |A(t)| and |A(t+k)| are odd and other |A(t+i)| are all even. Check p-connectivity of A, . (1) A, is not p-connectable. (l-i) p-schedule A(t),A(t+l), .. .,A(t+k-l) individually. (1-ii) Let t = t + k - 1. (l-iii) go to Step "J. (2) A, is p-connectable. (2-i) Find a p-line (l) set l£(l) = ( (N Q , N^ ), (N^N^ ), . . ., (N k+1 ,N k ')). (2-ii) p-line schedule A, by L. (l). (2-iii) t = t + k. (2-iv) go to Step 7- Step 7: t = t + 1. Go to Step 2. Example 3 : (l) |A(l)| and |A(3)| are odds, and |A(2) | is even. Thus we have &= {A(1),A(2),A(3)1 - (by Step 6 of Algorithm) 209 A(l) = {a,b,c,d,e} A(2) = (f,g) A(3) = (h,i,J). Figure 7-6. An Example of a Tight Graph Scheduling (2) For A^, we have Lg(l) = ( (d, f), (g,h)). (3) According to Step (6-2-3) (2), we schedule as follows. (i) p-schedule A(l) - {d} = (a,b, c, e). (ii) p-schedule { d, f } . (iii) p-schedule A(2) - (f,g) = 0. (iv) p-schedule (g,h) . (v) p-schedule A(3) - (h) = {i,j}. Thus we have an optimum schedule : 210 Step 12 3^5 machine A B a c ' d g i b e f h J Proof for the algorithm : Lemma 5 ' Step 3 is optimum and "whatever p-schedule is made, it does not affect the later stages. Lemma 6 : Step 5 is optimum and whatever p-schedule is made, it does not affect the later stages. Proof : First note that after A(t) has been processed, nodes in A(t+l) only- can become available. Since VN e A(t), |S(N)| = |A(t+l)|, all nodes in A(t) must have been processed before any node in A(t+l) can be processed. (Q.E.D.) Lemma 7 • Step 6-1 is optimum and whatever p-schedule is made, it does not effect the later stages. Proof : The algorithm actually processes A(t) and A(t+l) (where |A(t) | and |A(t+l)| are odd) in time ( |A(t) | + |A(t+l) | )/2, which is optimum. (Q.E.D.) 211 Lemma 8 : Step 6-2-2 is optimum. Proof : Lettf= (A(t),A(t+l), ...,A(h(G))). Since |A(t)| is odd and all other |A(i)| is even (t < i < h(G)), it takes at least time I" lA(t)K!A(t + l)l + ...^|A(h(G))l -| to process cl. Step 6-2-2 achieves this. (Q.E.D.) Lemma 9 : Assume |A(t)| and |A(t+k) | are odd and |A(t+i)| are all even (l < i < k). If A, is p-connectable, then Step 6-2-3 (2) is optimum. Proof : Oc= {A(t),A(t+l), . . .,A(t+k)} can be processed in time |A(t)|+|A(t+l)|+...+|A(t+k)| 2 Step 6-2-3 (2) achieves this. (Q.E.D.) Lemma 10 : Assume |A(t)| and |A(t+k) | are odd and |A(t+i)| are all even (l < i < k). If A, is not p-connectable, then Step 6-2-3 (l) is optimum. 212 Proof : Let£^= (A(t),A(t+l), . . . ,A(t+k)} . Since A^ is not p-connectable, there is no p-line set 1^(1)- Thus there will be N in some A(t+i) (0 < i < k) which does not have any partner to be processed with it, thus making a machine idle. Now if this situation could be remedied, then it would be only because there is N' e A(t+n+j) (j > 0) which can be done in parallel with N. A(t+n) A(t+n+j) odd even odd Figure T«T- An Illustration for Lemma 11 Suppose that the above could be done. The parallel processing of N t , and N 1 can, however, be advantageous only if A. is p-eonnected and there is a p-line (l) set L^(l)(N,N f ) where N e A(t),N' e A(t+i) and N' 4 N - Otherwise 213 processors cannot be kept busy for A(t),A(t+l), . . . , A(t+i)-{N ) and we do not gain at all by delaying a processing of N. Now from the assumption, N £> N" for all N" € A(t+n) because otherwise A becomes p-connectable. Also by n " Corollary 1, there is a node N in A(t+n) such that N > N' . This implies that N 5> N'. Thus N cannot be processed in parallel with N. This proves the lemma. (Q.E.D.) Essentially what the above algorithm does is upon finding A(t) where |A(t)| is odd to try to delay the processing of a node N in A(t) till the next node set A(t+l). If |A(t+l)| is even, then again the algorithm tries to delay the processing of a node N' in A(t+l) till the next level A(t+2), etc. until it finds some A(t+k) where |A(t+k) | is odd, or |A(t+k) U (N"} | is even where N" is a node whose processing is being delayed from the previous stage. In other words, the algorithm tries to establish p-connectability between two adjacent node sets both of which have an odd number of elements. Note that it is not necessary to try to establish p-connectability among more than two odd node sets, say A(t),A(t+k) and A(t+m)(m > k), because A cannot be p-connectable if A, is not (see Lemma 10). Now the above argument together with Lemmas 5-10 prove the following theorem. Theorem 1 : The algorithm gives an optimum schedule of a tight graph. 20A 1 .k Scheduling of a Loose Graph Now let us extend the above algorithm so that it can handle a loose graph. To facilitate presentation, a few definitions are in order. (From now on by "a graph", a "loose graph" is to be understood. ) Definition 9 ; A node N in a graph G for which d_(N)+d T (N) ^ h(G) is called a loose node . Otherwise a node is tight. Let N be a loose node. Then we define the far distance d_(N) as then p-schedule B(t+b) where b = a , + g-1 T g g-1 1, a , + 2, .... a - 1. Then p-schedule B(t+a ) - g-1 g g {N f ,N } . g g (5) g = g + 1. If g < k, then go to (3). (6) p-schedule (Nj^N'}. (7) p-schedule B(t+k) - [N.'}- Finally in connection with Definition 8', we define the following. t t t Suppose that B is not p-connectable, i.e. there is no p-line set L on B . * n * n n It is, however, possible to find a p-line set L on B for some s, < s < n. Definition 11 : t t Given B which is not p-connectable. Now we check if B is p- n s connectable for s = 1,2, . ..,n-l. Let m be the smallest s such that B is p- S 220 connectable but B . is not. We call m the maximum p-connectable distance, s+l ■ * ' and L a maximum p-line set. m * The following example illustrates the above definition. Example 5 ; B(t+1) Figure 7*10. An Example of the Maximum p-connectable Distance Let us consider p-connectability of B^. It is clear that B^,(p) = { (N,,N, ), (N, ,N/-), (N , No)} , and B, is not p-connectable. Since B, is p- connectable (N , N. ) but B is not, the maximum p-connectable distance in the above B, is 1 and L = ((N ,N, )). 221 Using a similar technique to the one described in Section 7«5> we can check p-connectability of B . An algorithm to schedule a loose graph for two processors is now given. The algorithm resembles the one for a tight graph. The major difference lies in the treatment of loose nodes. Loose nodes are scheduled as late as possible. They will be used when it becomes inevitable to waste processor time. In what follows we modify the definitions for B(i) and C(i) so that they do not include those loose nodes which hare been already scheduled. Loose Graph Scheduling Algorithm Step 1: t = 0. Step 2: If t = h(G) then p-schedule all unscheduled nodes and stop, else go to Step 3» Step 3: If |B(t)| is even, then (3-1) p-schedule B(t) (3-2) go to Step 7- Step k: |B(t)| is odd. Find B(t+1). Step 5: If V N e B(t) |SR(N) |= |B(t+l) |, then (5-1) Check C(t). If C(t) = 0, then p-schedule B(t) and go to Step 7. (5-2) Otherwise pick N with the minimum cL(n) in C(t). (if there are more than one such N, pick any N. ) Now p-schedule B(t) U {N} and go to Step 7* Step 6: There is N' e B(t) such that |SR(N')| < |B(t+l)|. (6-1) If |B(t+l)| is odd, then (6-1-1) p-schedule B(t) - {N 1 }. (6-1-2) p-schedule (N') U (N") where N" e B(t+l) - SR(N'). (6-1-3) p-schedule B(t+l) - {N"} . (6-1-U) go to Step 7- (6-2) |B(t+l)| is even. (6-2-1) Find out the smallest k greater than 1 such that |B(t+k)| is odd. (6-2-2) If there is no such k (i.e. we have checked up to B(h(G))), then p-schedule B(i) (t < i < h(G)) individually and stop. (6-2-3) Else we have a set & = (B(t),B(t+l), . . .,B(t+k)} where |B(t)| and |B(t+k)| are odd and other |B(t+i)| are all even. Check if B, is p-connectable. k (l) B, is not p-connectable. (l-i) Find the maximum p-connectahle distance in B, . Let K. it be m. (1-ii) Check C(t+m). If C(t+m) = (/>, then p-schedule B(t), B(t+l),...,B(t+k-l) individually. Let t = t+k-1. Go to Step 7- (1-iii) Otherwise let B' (t+m) = B(t+m) U {N} where N e C(t-fm) t m_1 and has the minimum d-CN). Let B' = U B(t+i) U B'(t+m). Then p-line schedule B' by a maximum ' r m 223 p-line set L . Let t = t + m. Go to Step 7* t (2) B^ is p-connectable. (2-i) Find a p-line set L. . (2-ii) p-line schedule B by L . (2-iii) t = t + k. (2-iv) Go to Step 7. Step 7: t = t + 1. Go to Step 2. Proof for the Algorithm : That the above algorithm is optimum can be proved by a similar argument used to prove the previous algorithm. For example, we can show that Steps 3,5-1,6-1,6-2-2 and 6-2-3(2) are optimum easily by previous Lemmas. Now we have to show that Steps 5-2 and 6-2-3 (l) are optimum. Lemma 11 : Step 5-2 is optimum. Proof : Suppose that we do not use N in C(t) to fill up otherwise wasted processor time, where N is the node with the minimum cL(n) in C(t). Saving N for later use, however, does not improve the situation because a node N cannot be used more effectively than to fill up wasted processor time anyway. Also the choice of N from C(t) is optimum. Suppose for example that N' with cL(N') > cL(N) is also in C(t). Now let us consider those two nodes (See Figure 7-12). Whether we use N or N' to schedule with B(t) will not make 22k any difference to schedule B(t+l),B(t+2), . . ,,B(d_(N)) because a node N or N' is available if necessary. Suppose we used N 1 with B(t). Then it is not possible to fill a later request -which may arise when B(u) (d_(N) + 1 < u < dL(N' )) is scheduled, whereas if we used N with B(t) then we can fill the request. Thus this proves the lemma. djCN') Figure 7.11. An Illustration for Lemma 13 (Q.E.D. ) That Step 6-2-3 (l) is optimum is proved similarly. Now we have the following theorem. Theorem 2 The algorithm gives an optimum schedule of a loose graph. 225 7 • 5 Supplement (1) An algorithm to establish A (p) on A is now discussed, v *© n n n Let m = L |A(t+i) \, and B be an m x m connection matrix where the i=0 first |A(t)| columns and rows are labeled by nodes in A(t), the second |A(t+l) | columns and rows by nodes in A(t+l), and so forth. An element b. . of B is 1 if and only if N. -* N . where N. and N. are labels for the i-th row and the j-th column. Now define the multiplication of the connection matrices as follows. Let A = B x C where A, B and C are m x m connection matricies. Then a. . = ij m - m , - — V (b * C ). Now complete B m = V B . In B m , b?. = 1 implies that N. >N. k=l lK * J k=l 1J x J and b. . =0 implies that (N.,N.) • For example, let us consider the graph in J- j -J- J Figure 7-13» Then we have A^ = { (a,f), (a,g), (b,e), (b,h), (b,g), (c,d), (d,h), (d,j), (f,h), (g,h), (g,i)). (2) Given A (p) for A , an algorithm to find p-connectivity and L (l) is described. According to Lemma k, if A is p-connectable, then there is a p-line (l) set L (l). Thus to check p-connectability it is enough to examine if there is a p-line (l) set L (l). n 226 (a) a b c d e f g h i J a 1 1 b 1 1 c 1 1 l d 1 e 1 1 1 f 1 1 g 1 h i •i a b c d e f e ,1 a 1 1 i i b 1 1 i c 1 1 1 1 i d i e i f • i g i h \ K — i \ . J \ (b) B (c) B c t/t, Figure 7.12. An Example for A ( ) 227 First let A*(p)(l) = a£( P ) - C(N,N f )| I^CN) - d^N')! > 1} — t Now we construct a graph A as follows. n (1) A has following nodes: (i) an initial node N , s (ii) a terminal node N_, . . t (iii) all nodes in A , (iv) for each node N (N € A , N / A(t), N / A(t+n)) a new duplicate node N' . — t (2) A has following verticies: n (i) a vertex from N to every node N which was in A(t), (ii) a vertex from every node N which was in A(t+n) to N„, iii (iii) if (N^Ng) e A*(p)(l), then a vertex from V to N . (iv) for every A(t+k),l < k < n-1, for every N e A(t+k), a vertex from N to every N' which is a duplicate of N" e A(t+k) but is not identical to N. To illustrate the above definition, let us consider the following example . Example : Let A 2 = {a,b,c,d,e, f,g} where a, b e A(t) 228 c, d, e e A(t+l) and f,g € A(t+2). Further let a£( P )(1) = {(b,c),(d,f),(d,g),(c,f)} Figure "J. 13- An Example for p-connectivity Discovery Then A has nodes = {N , N_, a, b.c, d, e, f, g (which are nodes in A ), c*, d',e' (which are duplicates of nodes in A(l))} and verticies = { (N ,a),(N ,b), (f,N_), (g^N-,), (b,c), (d,f ), (d,g), (c,f )(which are original verticies in A ), (c,d ! ), (c, e' ), hj n (d, c' ), (d, e' ), (e,c' ), (e, d' ) (which are verticies from N to a duplicate N' of N" which is not identical to N).] Now it is clear that A has a p-line (l) set L (l) if and only if n n there is a path from N to N in A . There is a well-known algorithm for path finding, e.g. f 19] . 229 Figure 7-13. (continued) 230 8. CONCLUSION This thesis introduced new techniques to expose hidden parallelism in a program. Techniques included the use of one of the fundamental arithmetic laws, i.e., the distributive law, extensively. Furthermore it was suggested that with the help of these techniques computation of a program might be speeded up logarithmically in the sense that computation time became a log- arithmic function of the number of single variable occurrences in a program rather than its linear function. Even though discussions were based on an ILLIAC IV type machine, as mentioned before, they are readily applicable to pipe-line machines such as CDC STAR. Chapter 2 of the thesis studied parallel computation of summations, powers and polynomials. The minimum time to evaluate summations or powers as well as the minimum number of PE's required to attain it was given. A scheme which computed a polynomial in parallel in lesser time than any known scheme was also introduced. Because of its simplicity in scheduling, the k-th order Horner's rule for parallel polynomial computation was studied in detail. It was shown that for this algorithm the availability of more PE's sometimes increased the computation time. The algorithm was such that all PE's were forced to participate in computation. Chapter 3 presented an algorithm which reduced tree height for an arithmetic expression by distribution. The algorithm worked from the inner most parenthesis pair to the outer most one and scanned an arithmetic expression only once. A measure for the height of the minimum height tree for an arith- 231 metic expression was given as a function of the depth of parenthesis pair nesting and the number of single variable occurrences in it. Chapter h extended the above idea to cover a sequence of arithmetic expressions. It was shown that by replacing a sequence of arithmetic expres- sions with an arithmetic expression by back substitution, the computation time could be speeded up in a logarithmic way for a certain class of iteration formulas, e.g., x. . := a x x. + b . The chapter also showed that parallel computation was in general more favorable than sequential computation in terms of the round off error. Furthermore it was shown that distribution would not introduce the significant amount of the round off error. Chapter 5 studied inter- statement parallelism as an introduction to the following chapter. An algorithm which checked if the execution of statements in a program by some sequence gave the same results as the execution of statements by the given sequence did was given. The algorithm was new in the sense that it prevented variables from being updated before they were used. This had not been taken into account by the previous works. Also a technique which exploited more parallelism between statements by introducing temporary locations was introduced. Chapter 6 presented an algorithm which checked if a statement in a loop could be executed simultaneously for all values of a loop index. The algorithm checked index expressions and the way the values of indices varied only and did not require a loop to be replaced with a sequence of statements. In case a statement in a loop could not be executed in parallel with respect to a loop index as it was, the algorithm "skewed" the computation of a state- ment with respect to a loop index so that the statement could be executed in parallel for all values of the loop index. Also to expose hidden parallelism 232 from a loop, replacement of a loop with several loops was discussed. A solution for the equally weighted- two processor scheduling problem was given in Chapter 7« The only practical work so far obtained was a result for scheduling a rooted tree with equally weighted tasks on k identical processors. The solution given in Chapter 7 scheduled a graph with equally weighted tasks on two identical processors. If we considered common expres.- sions in an arithmetic expression then we would obtain a graph of operations rather than a tree for the arithmetic expression and the scheduling algorithm was readily applicable for scheduling that graph on P(2) . Suggestions for further research have been given in several places throughout the thesis and need not be repeated here. We conclude by giving two possible extensions that deserve brief mention. (1) The design of a better machine. Even though we assumed that a PE can communicate with any other PE instantaneously, this may not be the case in reality because it is costly and impractical to provide data paths between every PE pair. Hence it is necessary to design PE interconnection which is economical yet powerful enough to simulate the above idealized interconnection [25], [26]. (2) Generalization of the idea given in this thesis. The three laws of arithmetic were utilized in this thesis in terms of parallelism exploitation. We should, however, pay more attention on these laws even in terms of serial computation. For example suppose an arithmetic expression which involves matrices, row and column vectors is given. Then by the appro- priate application of the associative law, the number of multi- plications required may be reduced drastically. 233 LIST OF REFERENCES [1] Abrahams, P. W. , "A Formal Solution to the Dangling else of ALGOL 60 and Related Languages", Coram. ACM, 9 (September, 1966), pp. 679-682. [2] Abel, N. E.j et al., "TRANQUIL: A Language for an Array Processing Computer Tr , Proc . of the Spring Joint Computer Conference (1969), pp. 57-73- [3] Naur, P., et al. , "Revised Report on the Algorithmic Language ALGOL 60", Coram . ACM, 6 (January, 1963), pp. 1-17 • [4] Allard, R. W., Wolf, K. A. and Zemlin, R. A., "Some Effects of the 6600 Computer on Language Structures", Comm . ACM, 7 (February, 1964), pp. 112-119. Baer, J. E., "Graph Models of Computations in Computer Systems", Ph.D. Dessertation, University of California, Los Angeles, Report No. 68-46 (October, 1968) . [6] Baer, J. E. and Bovet, D. P., "Compilation of Arithmetic Expressions for Parallel Computations", Proc . of IFIP Congress (1968), pp. 340-346. [7] Barnes, G. H., et al., "The Illiac-IV Computer", IEEE Trans , of Computers, C-17 (August, 1968), pp. 746-757. [8] Beightler, C. S., et al., "A Short Table of z-Transforms and Generating Functions", Operations Research, 9 (July-August, 1961), pp. 574- 578. [9] Bingham, H. W., Reigel, W. E. and Fisher, D. A., "Control Mechanisms for Parallelism in Programs", Burroughs Corporation, ECOM-02463-7 (October, 1968). [10] Bingham, H. W. and Reigel, W. E., "Parallelism Exposure and Exploitation in Digital Computing Systems", Burroughs Corporation, EC0M-02463-F (June, 1969). [11] Brewer, M. A., "Generation of Optimal Code for Expressions via Factorization", Coram . ACM, 12 (June, 1969), pp. 333-340. [12] "Newsdata", Computer Decision (March, 1970), p. 2. [13] Conway, M. E., "A Multiprocessor System Design", Proc . of the Fall Joint Computer Conference (1963), pp. 139-146. [14] Conway, R. W., Maxwell, L. W. and Miller, L. W., Theory of Scheduling, Addis on -Wis ley Publishing Company, Inc., New York (1967). [15] Dorn, W. S., "Generalizations of Horner's Rule for Polynomial Evaluation", IBM Journal of Research and Development, 6 (April, 1962), pp. 239- 245. •d.0* [16] Estrin, G., "Organization of Computer Systems--the Fixed plus Variable Structure Computer", Proc . of Western Joint Computer Conference (May, I960), pp. 33-40. [17] Gold, D. E., "A Model for Linear Programming Optimization of I/O — Bound Programs", M.S. Thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, Report No. 340 (June, 1969). [18] Graham, W. R., "The Parallel and the Pipeline Computers", Datamation (April, 1970), pp. 68-71. [19] Harary, F., Norman, R. Z. and Cartwright, D., Structural Model : An Introduction to the Theory of Directed Graphs , John-Wiley and Sons, Inc., New York (l9o"6]T [20] Hellerman, H., "Parallel Processing of Algebraic Expressions", IEEE Trans . of Electronic Computers, EC-15 (January, I966), pp. 82-91. [21] Hu, T. C, "Parallel Sequencing and Assembly Line Problems", Operation Research, 9 (November-December, 1961), pp. 841-848. [22] Knowls, M., et al. , "Matrix Operations on Illiac-TV", Department of Computer Science, University of Illinois at Urbana-Champaign, Report No. 222 (March, 1967). [23] Knuth, D. E., The Art of Computer Programming, Vol. 2, Addis on -Wesley Publishing Company, Inc., New York (1969). [24] Kuck, D. J., "I Iliac -IV Software and Application Programming", IEEE Trans . of Computers, C-17 (August, 1968), pp. 758-769- [2 5] Kuck, D. J. and Muraoka, Y., "A Machine Organization for Arithmetic Expression Evaluation and an Algorithm for Tree Height Reduction", unpublished (September, 1969). [26] Kuck, D. J., "A Preprocessing High Speed Memory System", to be published. [27] Logan, J. R., "A Design Technique for Digital Squering Networks", Computer Design (February, 1970), pp. 84-88. [28] Minsky, L. M., Computation : Finite and Infinite Machines , Prentice -Hall, Inc . , New Jersey (1967). [2 9] Motzkin, T. S., "Evaluation of Polynomials and Evaluation of Rational Functions", Bull . A. M.S ., 6l (1965), P- 163. [ 30] Murtha, J. C, "Highly Parallel Information Processing System", in Advances in Computers, Academic Press, Inc., New York, 7 (1966), pp. 2-116. [31] Muntz, R. R. and Coffman, E. G., "Optimal Preemptive Scheduling on Two Processor Systems", IEEE Trans . of Computers, C-l8 (November, 1969), pp. 1014-1020. 235 [32] Nievergelt, J., "Parallel Methods for Integrating Ordinary Differential Equations", Comm . ACM, 7 (December, 1964), pp. 731-733. [33] Noyce, R. N., "Making Integrated Electronics Technology Work", IEEE Spectrum, 5 (May, 1968), pp. 63-66. [34] Ostrowski, A. M., "On Two Problems in Abstract Algebra Connected with Horner's Rule", in Studies in Mathematics and Mechanics Presented to R. von Mises, Academic Press, New York (1954), pp. 40-48. [35] Pan, V. Ya., "Methods of Computing Values of Polynomials", Russian Mathematical Surveys, 21 (January-February, 1966), pp. 105-136. [36] Ramamoorthy, C. V. and Gonzalez, M. J., "A Survey of Techniques for Recognizing Parallel Proces sable Streams in Computer Programs", Proc . of the Fall Joint Computer Conference (1969), pp. 1-15. [37] Russel, E. C, "Automatic Program Analysis", University of California, Los Angeles, Report No. 69-72 (March, 1969). [38] Shedler, G. S. and Lehman, M. M., "Evaluation of Redundancy in a Parallel Algorithm", IBM System Journal, 6, 3 (1967), pp. 142-149. [39] Squire, J. S., "A Translation Algorithm for a Multiple Processor Computer", Proc . of the l8th ACM National Conference (1963). [40] Stone, H. S., "One-pass Compilation of Arithmetic Expressions for a Parallel Processor", Comm . ACM, 10 (April, I967), pp. 220-223. [41] Thompson, R. N. and Wilkinson, J. A., "The D825 Automatic Operating and Scheduling Problem", in Programming Systems and Languages, McGraw-Hill, New York (1967), pp. 647-660. [42] Winograd, S., "On the Time Required to Perform Addition", JACM, 12 (April, 1965), pp. 277-285. [43] Winograd, S., "The Number of Multiplications Involved in Computing Certain Functions", Proc. of IFIP Conference (1968), pp. 276-279. 236 VITA Yoichi Muraoka was born in Sendai, Japan, on July 20, 19^2. He graduated from Waseda University, Tokyo, Japan, in Electrical Engineering in March, 19&5 and started his graduate study at the graduate college, Waseda University. Since September 1966, he has been a research assistant with the project of Illiac IV computer in the Department of Computer Science of the University of Illinois at Urbana-Champaign. In 1969 he received his degree of Master of Science in Computer Science. He is a member of the Association for Computing Machinery and the Institute of Electrical and Electronics Engineering. •& at ^