LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 bi.0.%4 
 
IS*' /1 Report No. k 2h 
 
 ' r 
 
 y 
 
 PARALLELISM EXPOSURE AND EXPLOITATION IN PROGRAMS 
 
 by 
 Yoichi Muraoka 
 
 February, 1971 
 
 LIBF 
 
 NOV 9 1972 
 
 UNIVERSITY OF ILLINOIS 
 
 AT urbawa-champaign: 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/parallelismexpos424mura 
 
PARALLELISM EXPOSURE AND EXPLOITATION IN PROGRAMS 
 
 BY 
 
 YOICHI MURAOKA 
 B.Eng., Waseda University, 19^5 
 M.S., University of Illinois, 1969 
 
 THESIS 
 
 Submitted in partial fulfillment of the requirements 
 
 for the degree of Doctor of Philosophy in Computer Science 
 
 in the Graduate College of the 
 
 University of Illinois at Urbana-Champaign, 1971 
 
 Urbana, Illinois 
 
iii 
 
 ACKNOWLEDGEMENT 
 
 The author would like to express his deepest gratitude to 
 Professor David J. Kuck, the Department of the Computer Science of the 
 University of Illinois, whose encouragement and good advice have led 
 this work to the successful completion. Also Paul Kraska read the 
 thesis and provided valuable comments. 
 
 Special thanks should go to Mrs. Linda Bridges without whose 
 excellent job of typing, the final form would have never come out. 
 Thanks are also extended to Mrs. Diana Mercer who helped in getting 
 the thesis finished on time. 
 
IV 
 
 TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 2. PARALLEL COMPUTATION OF SUMMATIONS, POWERS AND POLYNOMIALS 11 
 
 2 . 1 Introduction 11 
 
 2 . 2 Summation of n Numbers 14 
 
 2 . 3 Computation of Powers 23 
 
 2.4 Computation of Polynomials 31 
 
 2.4.1 Computation of Polynomial on an Arbitrary Size 
 
 Machine 31 
 
 2.4.1.1 k-th Order Horner's Rule 32 
 
 2.4.1.2 Estrin's Method 32 
 
 2.4.1.3 Tree Method 33 
 
 2.4.1.4 Folding Method 35 
 
 2.4.1.5 Comparison of Four Methods 38 
 
 2.4.2 Polynomial Computation by the k-th Order Horner's Rule ... 39 
 
 3. TREE HEIGHT REDUCTION ALGORITHM 52 
 
 3«1 Introduction 52 
 
 3-2 Tree Height and Distribution 53 
 
 3.3 Holes and Spaces 63 
 
 3«3«1 Introduction 63 
 
 3.3.2 Holes 70 
 
 3.3.3 Space 76 
 
 3.4 Algorithm 85 
 
 3.4.1 Distribution Algorithm 86 
 
 3*4.2 Implementation 91 
 
V 
 
 Page 
 
 3« 5 Discussion 9J+ 
 
 3.5-1 The Height of a Tree 9U 
 
 3-5-2 Introduction of Other Operators 98 
 
 3-5-2. 1 Subtraction and Division 98 
 
 3«5-2.2 Relational Operators 99 
 
 k . COMPLETE PROGRAM HANDLING 100 
 
 k.l Back Substitution - A Block of Assignment Statements and an 
 
 Iteration 100 
 
 k . 2 J'oops „ 110 
 
 4.3 Jumps 113 
 
 h.k Error Analysis llU 
 
 5. PARALLELISM BETWEEN STATEMENTS „ . 122 
 
 5. 1 Program „ 122 
 
 5.2 Equivalent Relations Between Executions 125 
 
 6. PARALLELISM IN PROGRAM LOOPS 135 
 
 6.1 Introduction 135 
 
 6.1.1 Replacement of a for Statement with Many Statements 135 
 
 6.1.2 A Restricted Loop li+1 
 
 ■ . 2 A Loop With a Single Body Statement 1^3 
 
 6.2.1 Introduction li+3 
 
 6.2.2 Type 1 Parallelism 146 
 
 6.2.2.1 General Case 146 
 
 6.2.2.2 A Restricted Loop 153 
 
 6.2.2.3 Temporary Locations 156 
 
 6.2.3 Type 2 Parallelism l60 
 
 6.2.U :onclusion 167 
 
Page 
 
 6.3 A Loop With Many Body Statements 167 
 
 6.3*1 Introduction 167 
 
 6.3«2 Parallel Computation with Respect to a Loop Index 171 
 
 6.3*3 leparation of a Loop 173 
 
 o . 3 • 3 • 1 Introduction 173 
 
 6.3«3«2 The Ordering Relation (e ) and Separation of 
 
 Loop 174 
 
 6.3«3«3 temporary Storage 179 
 
 6.3*^ Parallelism Between Body Statements 182 
 
 oTjnrrT Introduction 182 
 
 6. 3« h.2 The Statement Dependence Graph and the Algorithm . . 184 
 
 6.3»5 Discussion 190 
 
 7 . EQUALLY WEIGHTED--TWO PROCESSOR SCHEDULING PROBLEM 192 
 
 7.1 Introduction 192 
 
 7-2 Job Graph 196 
 
 7 . 3 Scheduling of a Tight Graph 199 
 
 7 • -V Scheduling of a Loose Graph 2lU 
 
 7. 5 Supplement 225 
 
 8. CONCLUSION 230 
 
 LIST OF REFERENCES 233 
 
 VITA 236 
 
VI 1 
 
 LIST OF TABLES 
 
 Table Page 
 
 2.1. The Parallel Computation Time for Summation, Power and 
 
 Polynomial 12 
 
 n 
 
 2.2. The Number of Steps Required to Compute E a. on P(m), h (m,n), 
 
 i=l x 
 for n < 10 18 
 
 2.3- Computation of p (x) by Folding Method 38 
 
 2.k. The Number of Steps Required to Compute p (x), h (m, n), for 
 
 n < 10 1+8 
 
 k.l. Comparison of Back Substituted, y , and Non-Back Substituted 
 
 Computation, y . --Iteration Formulas 10l+ 
 
 U.2. Comparison of Back Substituted, y , and Non-Back Substituted 
 
 Computation, y. --General Cases 108 
 
Vlll 
 
 LIST OF FIGURES 
 
 Figure Page 
 
 1.1. Statement Dependence Relation 3 
 
 1.2. Trees for ( (a+b)+(c+d) ) and ( ( (a+b)+c)+d) 5 
 
 1 . 3 • Tree s for a + b x c + d and bx c + a + d 5 
 
 lA. Trees for a(bcd+e) and abed + ae 6 
 
 2.1. The Minimum Number, M, of PE's Required to Add Numbers in the 
 Minimum Time 22 
 
 2.2. Computation of x (l) 26 
 
 2.3. Computation of x (2) 27 
 
 . - . Computation of x 11 (3) 29 
 
 2.5. Computation of x" (k) 30 
 
 .6. Computation of a.x 3k 
 
 2.7. A Tree for p . . (x) 35 
 
 2.8. A Tree for p + .(x) 37 
 
 2.9* Comparison of the Four Parallel Polynomial Computation Schemes. kO 
 
 2 .10 . k-th Order Horner ' s Rule 1+1 
 
 .11. The Number of Steps, h (m, n), to Compute p (x) on P(m) by the 
 
 m- th Order kk 
 
 2.12. The Minimum number, M, of PE's Required to Compute P (x) in the 
 
 Minimum Time 51 
 
 3.1. An Arithmetic Expression Tree (l) 52 
 
 .2. An Arithmetic Expression Tree (2) 56 
 
 3.3. Free Nodes 63 
 
 Free Nodes in a Tree 6k 
 
 3-5- An example of F. and F„ 66 
 
 3.6. Elimination of a Free Node 66 
 
IX 
 
 Figure Page 
 
 5.7. A Minimum Height Tree. . . . 68 
 
 3 .8 . Attachment of T[ t ' ] to a Free Node 71+ 
 
 3.9. An Example of Space (l) 77 
 
 3 .10 . An Example of Space (2) 78 
 
 3.11. Distribution of t' over A 8l 
 
 3 .12. Tree Height Reduction by Hole Creation 82 
 
 3 .13» Stacks for an Arithmetic Expression 91 
 
 k.l. A Back Substituted Tree 102 
 
 k.2. Loop Analysis 112 
 
 . . A Tree with a Boolean Expression 11^ 
 
 •'i .h . Trees for a(bc+d) + e and abc + ad + e 118 
 
 5 .1 . Conditions for the Output Equivalence 127 
 
 6 .1 . E Q 11*8 
 
 . . E[ Ij Ik8 
 
 6.y. Conditions of Parallel Computation in a Loop 150 
 
 ■' . • . An Illustration of t 158 
 
 6.5- Wave Front l6l 
 
 . . Wave Front Travel l62 
 
 6.7. An Illustration for Theorem k 16U 
 
 6.8. An Execution by a Wave Front 166 
 
 . . ultaneous Execution of Body Statements 170 
 
 6.10. Execution of P B 173 
 
 u '^ 
 
 6 .11 . An Introduction of Temporary Locations 180 
 
 6.12. Wave Front for Simultaneous Execution of Body Statements 183 
 
 6 .13 . A Wave Front for Example 10 187 
 
Figure Page 
 
 7.1. Computation of Nondistributed and Distributed Arithmetic 
 
 Expressions on P(2) 19^ 
 
 Common Expression 195 
 
 A Loose Graph and a Tight Graph 197 
 
 A Graph G 201 
 
 An Illustration for Lemma 3 20U 
 
 An Example of a Tight Graph Scheduling 209 
 
 An Illustration for Lemma 11 212 
 
 A Loose Node 21^ 
 
 7-2 
 7.3 
 1-h 
 
 7.5 
 7-6 
 
 7-7 
 7.8 
 
 7-9 
 
 7.10. An Example of the Maximum p-connectable Distance 220 
 
 7.11. An Illustration for Lemma 13 22U 
 
 t P 
 7 . 12 . An Example for A (-) 226 
 
 7-13- An Example for p- connectivity Discovery 228 
 
 The p-line Relation in B 217 
 
 n 
 
1. INTRODUCTION 
 
 1.1 Introduction 
 
 The purpose of this research is to study compiling techniques for 
 parallel processing machines. 
 
 Due to remarkable innovations of technology today such as the intro- 
 duction of LSI, it has become feasible to introduce more hardware into computer 
 systems to attain otherwise impossible high speeds. For example Winograd [k-2] 
 showed that the minimum amount of time required to add two t bit numbers is 
 [log p t]d ([x] denotes the smallest integer not smaller than x), where we assume 
 
 that an adder consists of two input binary logic elements, e.g. AND or OR gates 
 and d is a delay time per gate. An adder which realizes this speed requires a 
 huge number of gates, e.g. approximately 1300 gates for t = 6 [12], and it has 
 been out of the question to build such an adder. However, the introduction of 
 LSI has reduced the cost of a gate significantly, e.g. it has been anticipated 
 that by 197^ the cost of LSI would be reduced to 0.7 cent per gate [33]« Another 
 example is a class of parallel processing machines. The Illiac IV [7], the CDC 
 6600 [k] and the D825 [^1] are included in this class. A machine in this class 
 has e.g. many arithmetic units to allow simultaneous execution of arithmetic 
 operations. As an extreme case it has been suggested to include special arith- 
 metic units, e.g. a log taking unit (in x) and an exponent unit (x ) \ l6] . (Such 
 being the case, this decade may be marked as a "computer architecture" race, 
 reminiscent of the cycle-time and multiprogramming races of the 60's [12].) We 
 shall not go into the details of machines further. An extensive survey of 
 parallel processing machines is found in [30]. 
 
Having a parallel processing machine which is capable of processing many- 
 operations simultaneously, we are faced with the problem of exploiting parallelism 
 from a program so Uiat computational resources be kept as busy as possible to 
 process the program in the shortest time. We now discuss the problem in detail. 
 
 In this thesis, by a parallel (processing) machine P we understand a set 
 of arbitrarily many identical elements called processing elements (PE). A PE is 
 assumed to be capable of performing any binary arithmetic operations, e.g. 
 addition and multiplication, in the same amount of time. Furthermore we assume 
 that data can be transferred between any PE's instantaneously. Also we write P(m) 
 if P has only m PE's. A machine of this nature may be considered as a general- 
 ization of the Illiac IV. 
 
 To date two types of parallelism exploitation techniques are known to 
 compile a program written in a conventional programming language (e.g. ALGOL) for 
 the parallel processing machine [36]. They may be termed as intra-statement 
 parallelism and inter-statement parallelism exploitation techniques. The first 
 technique is to analyze the parallelism which exists within a statement, e.g. an 
 arithmetic expression and this has been explored by Stone [k-0], Squire r 39] ^ 
 Hellerman [20], and Baer and Bovet [6]. For example consider the arithmetic 
 expression: 
 
 a+b +c+d + e+f+g+h 
 and a syntactic tree for it: 
 
 3 
 2 
 1 
 
 level 
 
The tree is such that operations on the same level may be done in parallel. The 
 height of a tree is the maximum level of the tree and indicates the number of 
 steps required to evaluate an arithmetic expression in parallel. Note that there 
 may be many different syntactic trees for an arithmetic expression, and among 
 them the tree with the minimum height should be chosen to attain the minimum 
 parallel computation time. Baer and Bovet's algorithm is claimed to achieve this 
 end. i.e. build the minimum height syntactic tree for an arithmetic expression [6]. 
 
 Exploitation of inter-statement parallelism has also been studied [10], 
 [37]- An outcome of these works is an algorithm (the dependence relation detect- 
 ion algorithm [ 10] ) which detects the dependence relation between statements in 
 a loop- and jump-free sequence of statements. The dependence relation between 
 S and S' holds if S proceeds S' in a sequence and S' uses the output of S as an 
 input to S ' . For example the algorithm dects that the statement SI in Figure 
 1.1 must be computed before the statement S2, but it may be computed simultaneous- 
 ly with S3- 
 
 SI: x := f-^y); 
 
 S2: u := f 2 (x); 
 
 S3: v := f 3 (w); 
 Sk: z := f^(v,u); 
 
 (a) program (b) dependence relation 
 
 Figure 1.1. Statement Dependence Relation 
 
 Since in a real program the major part of the execution time is spent within loops 
 if it is executed sequentially, the major effort should be directed toward 
 detecting inter-statement parallelism in loops. For example we would like to find 
 
out that all fifty statements, A[ 1] := f(B[l]), ..., A[ 50] := f(B[50]), in a loop 
 
 E: for I := 1 step 1 until 50 do 
 A[I] := f(B[I]) 
 may be executed simultaneously to reduce the computation time to one fiftieth of 
 the original. A technique available now which detects inter- statement parallelism 
 inside a loop requires a loop to be first replaced with (expanded to) a sequence 
 of statements, e.g. E in the above example must be replaced with the sequence of 
 fifty statements, A[ 1] := f (B[ 1] ), ..., A[ 50] := f (B[ 50] ), so that the dependence 
 relation detection algorithm can be applied [ 10] . Obviously this approach 
 obscures an advantage of the introduction of loops into a program because 
 essentially all loops are required to be removed from a program and replaced with 
 straight-line programs so that the dependence relation detection algorithm can be 
 applied on them. 
 
 The techniques described above find out parallelism inside and between 
 statements as they are presented. If the size of a machine (i.e. the number of 
 PE's) is unlimited, however, then it becomes necessary to exploit more parallelism 
 from a program than the above approaches provide. One obvious strategy is to 
 write a completely new program using e.g. parallel numerical methods [52], [38] • 
 The other approach which we will pursue here is to transform a given program to 
 "squeeze" more parallelism from it. While the first approach requires programmers 
 (or users) to reanalyze problems and reprogram, the second approach tries to 
 accept existing sequential programs written in e.g. AIG0L and execute them in 
 parallel. First we study parallel computation of an arithmetic expression more 
 carefully along this line. 
 
 For the sake of argument let us assume that an arithmetic expression 
 consists of additions, multiplications and possibly parentheses. Then the 
 associative, the commutative and the distributive laws hold. The first and second 
 
laws have been already used to exploit more parallelism from an arithmetic 
 expression. For example the associative law allows one to compute the arithmetic 
 expression a+b+c+das ((a+b) + (c+d)) in two steps rather than as 
 ( ( (a+b)+c)+d) which requires three steps. 
 
 (a+b) + (c + d) 
 
 (((a + b) + c) + d) 
 
 Figure 1.2. Trees for ( (a+b)+(c+d)) and ( ((a+b)+c)+d) 
 
 Also it has been recognized that the commutative law together with the associative 
 law gives a lower height tree. For example ((a + b x c) + d) requires three steps 
 while (b x c + (a + d)) requires two [39]* 
 
 b x 
 
 Figure 1.3- Trees for a + b x c+d and b x c + a + d 
 
 Now we turn our interest to the third law, i.e. the distributive law and see if 
 it can help speeding up computation. As we can readily see there are cases when 
 distribution helps. For example a(bcd + e) requires four steps while its 
 equivalent abed + ae which is obtained by distributing a over bed + e can be 
 computed in three steps. 
 
a (b c d + e) abcd + a 
 
 Figure l.k. Trees for a(bcd+e) and abed + ae 
 
 However, distribution does not necessarily always speed up computation. For 
 example the undistributed form ab(c+d) can be computed in fewer steps than the 
 distributed form abc + abd. Hence nondiscr inn native distribution is not the 
 solution to the problem. Chapter 3 of this thesis studies this situation and 
 gives an algorithm which we call the distribution algorithm. Given an arithmetic 
 expression A the distribution algorithm derives the arithmetic expression A by 
 distributing multiplications over additions properly so that the height of A 
 
 (we write h[A ] for this) is minimized. The algorithm works from the innermost 
 parenthesis level to the outermost parenthesis level of an arithmetic expression 
 and requires only one scan through the entire arithmetic expression. Chapter 3 
 concludes by giving a measure of the height of the minimum height tree for A as 
 
 well as A as a function of fundamental values such as the number of single 
 variable occurrences in A. 
 
 The idea is extended to handle a sequence of assignment statements in 
 Chapter k. The distribution algorithm is applied on the arithmetic expression 
 which is obtained by backsubstituting a statement into one another. 
 Suppose we have a sequence of n assignment statements A , A , ..., A and we get 
 
 the assignment statement A from this sequence by back substitution. If the 
 
sequence is computed sequentially, i.e. one statement after another, but each 
 statement is computed in parallel, then it will take h[A_] + h[A ] + ... + h[A ] 
 
 steps to compute the sequence (where h[A.] is the height of the minimum height 
 
 tree for A.). Instead we may compute the back substituted statement A in 
 
 parallel which requires h[A] steps. Obviously h[A 1 ] + ... + h[A ] > h[A] holds. 
 
 Chapter h discusses cases when the strict inequality in the above equation holds. 
 
 The cases include iteration formulas such as x. n := a x x. + b. 
 
 l+l 1 
 
 Next we study inter-statement parallelism in terms of program loops. 
 Chapter 6 first establishes a new algorithm which detects inter-statement 
 parallelism in a loop. The algorithm is such that it only examines index 
 expressions and the way index values vary in a loop to detect parallel computa- 
 bility. For example the algorithm checks index expressions I and I + 1 as well as 
 the clause "I := 1 step 1 until 20" in the loop 
 
 for I := 1 step 1 until 20 do 
 A[I] := A[I+1] + B 
 and detects that all twenty statements, A[ 1] := A[2] + B, ..., A[ 20] := A[ 21] + B, 
 may be computed simultaneously. Thus it is not necessary to expand a loop into a 
 sequence of statements as was required before to check inter-statement parallelism. 
 In general, the amount of work (i.e. the time) required by the algorithm is 
 proportional to the number of index expression occurrences in statements in a loop. 
 
 Having established the algorithm, Chapter 6 further introduces two 
 techniques which help to exploit more inter-statement parallelism in loops. These 
 are the introduction of temporary locations and the distribution of a loop. The 
 second technique resembles the idea introduced in Chapter 3, i.e. reduction of 
 tree height for an arithmetic expression by distribution. Let us write 
 
I, J, K(S1, S2, S3) 
 for an ALGOL — like program 
 
 for I :- i 1 , ± 2 , . .., i m do 
 
 for J := Op J 2 , .--j d n do 
 
 for K := k n , k„, .... k do 
 
 1 2' p — 
 
 begin SI; S2; S3 end . 
 
 t ^ ^ 
 
 Furthermore by e.g. [I, J], K(S1, S2,S3) we understand a loop" 
 
 for (I, J) := (i-^^), (i^Jg), ••., (ip^), (ig,^), ..., U n >d n ) do 
 
 for K := k.. , k^, .... k do 
 
 1' 2' p — 
 
 begin SI; S2; S3 end . 
 Then as in the case of arithmetic expressions we may establish the following: 
 
 (a) Association: Introduction of brackets, e.g. I, [ J,K] (SI, S2, S3) • 
 
 (b) Commutation: Change of the order of I,J,K e.g. I, K, J(S1, S2,S3). 
 
 (c) Distribution: Distribution of I,J,K over SI, S2,S3, e.g. 
 
 I,J,K(S1),I,J,K(S2,S3). 
 Then while the associative law always holds, e.g. I,J,K(S) = [I,J],K(S), the 
 commutative and the distributive laws do not necessarily hold for all loops, e.g 
 I, J(S) / J,I(S) if I,J(S) represents a loop 
 for I := 1, 2, 3 do 
 for J := 1, 2, 3 do 
 
 A[I,J] := A[I+1,J-1]. 
 
 JL 
 
 "This is equivalent to a TRANQUIL expression [2] 
 
 for (I, J) seq^ ((ip^); (i^^l^-v (i m *d n )) do- 
 
In short, Chapter 6 shows that commutation indicates the possibility of computing 
 a loop in parallel as it is and distribution indicates the possibility of intro- 
 ducing more parallelism into a program. For example if I,J,K(S) = K, I,J(S), then 
 S can be computed simultaneously for all values of K while I and J vary 
 sequentially. Next suppose a loop l(Sl, S2) cannot be computed in parallel for all 
 values of I. Then in a certain case it is possible to distribute and obtain two 
 loops I (SI), l(S2) which are equivalent to the original loop, I (SI, S2), and 
 execute each of two loops in parallel for all values of I separately. Chapter 6 
 gives an algorithm to distribute to attain this end. 
 
 The thesis, thus, introduces new techniques which transform a given 
 program to expose hidden parallelism. All results in this thesis are also readily 
 applicable to another type of machines, i.e. machines with a pipe-line arithmetic 
 unit such as CDC STAR [ 18] (we regard this type of machines as a special type of 
 parallel machines and call them serial array machines). Each stage of a pipe-line 
 unit may be regarded as an independent PE in the sense that an operation being 
 processed in one stage of a pipe-line unit must not depend on an operation being 
 processed in a different stage. Hence exploiting parallelism results in busying 
 many stages at once. 
 
 Two more chapters are included in this thesis to make it complete. 
 Chapter 2 studies parallel computation of special cases of arithmetic expressions, 
 e.g. powers and polynomials, in detail to give a measure of the power of a 
 parallel processing machine. 
 
 As was mentioned before, unless specially mentioned, it will be assumed 
 that there are a sufficient number of PE's available to perform the desired task. 
 In reality, however, that may not be the case and non trivial scheduling problems 
 
10 
 
 may arise. To give some insight to this problem Chapter 7 discusses a solution 
 to the two processor-equally weighted job scheduling problem. 
 
 We conclude this chapter by defining the following symbols: 
 [x] ... the smallest integer not smaller than x, 
 [ xj ... the largest integer not larger than x, and 
 \ x~| ... the smallest power of 2 not smaller than x. 
 
 Also unless specified, the base of logarithms is assumed to be 2, e.g. log n is 
 log 2 n. 
 
11 
 
 2. PARALLEL COMPUTATION OF SUMMATIONS, POWERS AND POLYNOMIALS 
 
 2.1 Introduction 
 
 In this chapter, we study the parallel computation of summations, 
 
 powers and polynomials. We first assume that m processors (PE) are available. 
 
 n 
 Then the parallel computation times for the summation ( Z a. ) and the power (x ) 
 
 i=l 1 
 
 evaluation are given as functions of m and n. The minimum time to evaluate 
 
 n 
 
 Z a. or x as well as the minimum number of PE's required to attain it is also 
 i- 1 
 
 derived. 
 
 Polynomial computation is first studied assuming the availability of 
 
 an arbitrary number of PE's. The lower bound on the computation time "or a 
 
 polynomial of degree n (p (x)) is presented. A scheme which computes p (x) in 
 
 lesser time than any known scheme is obtained. Because of its simplicity in 
 scheduling, the k-th order Horner's "^ule is .studied further in detail. It is 
 shown that for this algorithm the availability of more PE's sometimes increases 
 the computation time. 
 
 Table 2.1 summarizes a part of results of this chapter. 
 
 Before we go further a few comments are in order. The base of 
 logarithms in this chapter is 2, e.g. log n is actually log n. The following 
 
 lemma will be frequently referred to in the text. 
 
12 
 
 
 
 
 
 
 
 
 
 H H 
 
 
 OJ 
 
 oT" 
 
 bO H 
 
 iH 
 
 
 CM 1 + 
 
 
 i 
 
 i 
 
 OJ + 
 
 + 
 
 
 1 J*i ^ 
 
 
 n^ 
 
 £ 
 
 ^L 
 
 *L 
 
 
 ^ CM OJ 
 
 
 
 
 II OJ 
 
 OJ 
 
 
 CM 
 
 
 L_ 
 
 t_ 
 
 
 
 
 + VI 
 
 
 OJ 
 
 OJ 
 
 J3 + 
 
 V 
 
 
 X C 
 
 
 A 
 
 1 '"CM V 
 
 bO 
 
 c 
 
 
 —. CM 
 
 
 
 
 OJ 
 
 
 
 cvH V 
 
 
 H 
 
 — H 
 
 
 VI 
 
 
 -^ VI 
 
 
 H 1 
 
 OJ 1 
 
 (— VI 
 
 r~ , 
 
 
 — X H 
 
 
 -B@ 
 
 1 ( — 
 
 m 
 
 OJ H 
 
 
 rH G CM 1 
 
 
 1 — OJ 
 
 \ G 
 
 -^ bO 
 
 
 1 V 1 CM 
 
 
 «-r oj 
 
 "h V 
 
 H OJ 
 
 ^ 
 
 C » C + 
 ^-, CM 
 
 OJ 
 
 OJ 
 
 i 
 
 OJ 
 
 1 
 1 
 
 H 
 
 + bO 
 + OJ 
 
 + + 
 
 
 — ' — ^ 
 
 
 G 
 
 c 
 
 G -— 
 
 G bD 
 
 
 OJ 
 
 
 G -— 
 
 c -— 
 
 C c^ 
 
 cr- £L 
 
 
 H 
 
 
 
 
 
 
 
 • • > • 
 
 
 . , 
 
 , , 
 
 • • • • 
 
 , . 
 
 
 ^ «s S ^ 
 
 
 .*— -s 
 
 >- s. 
 
 .*"— N y"~V 
 
 ^«— »s 
 
 
 H CM 
 
 
 H 
 
 OJ 
 
 H OJ 
 
 no 
 
 
 
 
 
 
 
 
 
 nft 
 
 
 
 
 
 
 
 E 
 
 
 
 
 
 
 
 JJ 
 
 
 
 
 
 
 
 
 
 *4 
 
 
 
 
 G 
 
 
 
 
 
 
 
 —) 
 
 
 
 1h 
 
 
 
 
 *•—*-. | y -. 
 
 ,-— ^ 
 
 ,*■ — v 
 
 ^— V 
 
 *-— V .*— -v 
 
 
 
 rH OJ 
 
 G 
 , Al 
 
 H 
 
 H 
 
 + OJ 
 
 H OJ 
 
 
 
 n 
 
 II 
 
 <5) Al 
 
 II H Al 
 
 
 
 + 
 
 
 
 
 
 
 
 £ „ E 
 
 S 
 
 S 
 
 OJ £ 
 
 e + e 
 
 
 
 vE w- 
 
 *— 
 
 - — 
 
 v — ' 
 
 ■> — ' * — ' 
 
 
 
 OJ 
 
 
 
 1 
 
 gi 
 
 
 
 bO 
 
 
 
 G 
 
 
 
 
 o 
 
 . H 
 
 * 
 
 
 Lr- 
 
 G_, 
 
 
 
 
 LL? 
 
 H 
 
 
 + 
 
 OJ 
 
 
 c 
 
 + 
 
 i 
 
 
 
 + 
 
 
 •s 
 
 H 
 
 
 H 
 
 H 
 
 
 
 e 
 
 H 
 
 S 
 
 
 
 c t— 
 
 
 
 l 
 
 
 1 
 
 + 
 
 OJ CO. 
 
 
 5 
 
 l 
 
 + 
 
 
 
 1 
 
 
 
 G 
 
 
 G 
 
 m 
 
 OJ 
 
 
 
 -5 
 
 § 
 
 
 en' 
 
 
 
 
 G 
 
 
 
 
 
 
 
 1 
 
 
 
 
 
 
 
 —C -—~. 
 
 
 ^S-l^ 
 
 ,j_^ 
 
 ,-— ». *■— ^. 
 
 
 
 rH OJ 
 
 
 H 
 
 OJ 
 
 H OJ 
 
 
 
 
 
 
 
 P~» 
 
 
 
 
 
 
 
 H 
 
 
 c 
 
 
 
 
 
 
 
 •H 
 
 f— 
 
 n~: 
 
 r~ 
 
 
 , + 
 
 
 
 [e 
 
 o 
 
 
 
 
 + 
 
 
 ,G~ 
 
 
 
 L_ 
 
 
 G 
 H -— ' 
 
 OJ 
 + bO 
 i— o 
 
 [§id 
 
 
 / 
 
 
 
 
 "g 
 
 
 
 / 
 
 
 
 
 VI 
 
 ^-~, 
 
 
 / 
 
 •H 
 
 G ^ 
 
 •H 
 
 
 x 
 
 
 / 
 
 cd 
 
 r ! 
 
 X 
 
 •H 
 
 v — ^ 
 
 
 / 
 
 H 
 
 
 
 
 G 
 
 
 / 
 
 G Wll 
 
 
 H 
 
 VI 
 
 ft 
 
 
 / 
 
 •H 
 
 
 H 
 
 
 
 
 / 
 
 
 
 cd 
 
 H 
 
 
 
 cd 
 
 1 
 
 O 
 
 ft 
 
 
 G 
 
 
 
 
 
 •H 
 
 
 
 nd 
 
 
 F) 
 
 
 
 G 
 
 • 
 
 X! 
 
 • 
 
 
 cd 
 
 X 
 
 G 
 
 B 
 
 oo 
 
 ^ 
 
 
 •H 
 
 
 
 CO 
 
 G 
 
 cd 
 
 ft 
 
 cd 
 
 > 
 
 ft 
 
 -P 
 
 
 € 
 
 o 
 
 
 P 
 
 G 
 
 C 
 
 ft 
 
 O 
 
 cd 
 
 O 
 
 a 
 
 •» 
 
 ft 
 
 O 
 
 *. 
 
 
 G 
 
 
 P 
 
 X 
 
 0) 
 
 O 
 
 <V 
 
 
 
 d) 
 
 •H 
 
 {=• 
 
 13 
 
 G 
 
 CQ 
 
 P 
 
 •H 
 
 0) 
 
 <H 
 
 
 cd 
 
 P 
 
 ^ 
 
 
 ^ — s 
 
 § 
 
 
 •H 
 
 Jh 
 
 * 
 
 S 
 
 G 
 
 3 
 
 O 
 
 
 p 
 
 O 
 
 O* 
 
 ft 
 
 
 CO 
 
 •H 
 
 CD 
 
 
 
 
 p 
 
 !h 
 
 CD 
 
 
 ^1 
 
 cd 
 
 
 B 
 
 
 o 
 
 P 
 
 W 
 
 ■a 
 
 
 ft 
 
 3 
 
 — 
 
 p 
 
 • 
 
 
 ft 
 
 W 
 
 
 £ 
 
 a 
 
 
 
 ft 
 
 G 
 
 OJ 
 
 e 
 
 8 
 
 
 O 
 
 bO 
 
 •H 
 
 o 
 
 ft 
 
 •H 
 
 O 
 
 Eh 
 
 
 o 
 
 P 
 
 H 
 
 
 1 
 
 u 
 
 P 
 
 II 
 
 G 
 O 
 
 R 
 
 0) 
 
 3 
 
 
 •H 
 
 ■H 
 
 & 
 
 ft 
 
 CO. 
 
 P 
 
 C 
 
 
 i 
 
 
 cd 
 
 •H 
 
 R 
 
 5 
 
 
 ■P 
 
 
 
 G 
 
 o 
 
 
 2 
 
 0) 
 
 <X) 
 
 CD 
 
 G 
 
 k 
 
 £ 
 
 En 
 
 a 
 
 OJ 
 bD 
 
 o 
 H 
 
 o 
 
 a 
 
 H 
 d) 
 
 G 
 
 s 
 
 ^— v 
 
 li 
 
 H 
 
 •H 
 
 
 G 
 
 
 H 
 
 Fl 
 
 
 ^ 
 
 a 
 
 cd 
 
 si 
 
 
 
 
 cd 
 
 ft 
 
 CM 
 
 0) 
 
 ft 
 
 ■§ 
 
 Eh 
 
13 
 
 Lemma 1: 
 
 Proof: 
 
 (1) flog al - flog bl < (log a - log bl 
 
 where a and b are non zero positive integers. 
 
 (2) Ta + bl = Tal + b, 
 la + bj = j_aj + b, 
 
 f"b - al = b - Tal + 1 and 
 l_b-aj= b - [_aj - 1 
 
 where b is a positive integer and a is a positive real number 
 (not an integer); 
 
 (3) a + l>["al > a and 
 b - 1 < (bj < b 
 
 where a and b are non zero positive real numbers. 
 
 (1) Let a = 2 h + k and b = 2 + g where < k < 2 - 1 and < g < 
 
 2 - 1. Now the proof is divided into four cases, i.e. (i) k, 
 
 g > 0, (ii) k = g = 0, (iii) k - 0, g > 0, and (iv) k > 0, g = 0, 
 (i): k, g > 0. 
 
 Then flog al - flog bl = h - f. Also let log a = 
 h + x and log b = f + y where < x, y < 1. Then 
 flog a - log bl > h - f . Thus (log al - flog bl < 
 ■log a - log bl . 
 
 Other three cases may be proved similarly and the details are 
 
 omitted. 
 (2) (3) Proofs for (2) and (3) follow from the definition. 
 
 (Q.E.D.) 
 
ih 
 
 2.2 Summation of n Numbers 
 
 Theorem 1: 
 
 The minimum number of steps, h (m, n), to add n numbers on P(m) is 
 '(1): n-1 (m=l) 
 
 h a (m,n) = <((2): L n / m J - 1 + '"log (m + n -Ln/mjmjl 
 
 ( l_n/2j > m > 2 ) 
 \p): log n"l (m > |_n/2j ) 
 
 Proof : 
 
 (l) is self evident. (3) uses the so-called log sum method [22] or 
 the tree method (see Theorem 1 of Chapter 3)- It is clear that "log n"l steps 
 are required and also that n numbers cannot be added in fewer steps. 
 
 Now we prove (2). First each PE adds [_n/m| numbers independently. 
 This takes |_n/mj " 1 steps and produces m partial sums. Then there will be 
 m + (n - [n/mj • m) numbers le.'t. Clearly m + n - [n/mj • m < 2m. Then those 
 numbers are added by the log sum method, which takes llog(m + n - (_n/mj m ) 
 steps. 
 
 (Q.E.D.) 
 Now we show that for a fixed n, |_n/2j > m > m' implies that h (m, n) < 
 
 h (m',n). To prove this it is enough to show that h (m + 1, n) < h (m, n) where 
 m h- 1 £ |_n/2j . There are two cases: 
 
 (1) (n/mj = |_n/(m + l)j = k > 1. 
 
 Let n=km+p (p < m). Then 
 
 m + n - L n AlJ m = m + p 
 and 
 
15 
 
 (m + 1) + n - l_n/(m + l)J (m + 1) = m + p + 1 - k. 
 Hence we have 
 
 flog(m + n - l_ n Ay m ) > flog((m + l) - n - |_n/(m + ill 
 (m + 1)? , or h a (m,n) > h a (m + l,n). 
 (2) jn/mj > L n /(m + l)j . 
 
 Let [n/mj = k and L n A m + l)j = k - g, where k, g > 1. Then 
 n = km + p (p < m) (l) 
 
 and n - (k - g)(m + 1) + p' (p' < m + l). (2) 
 
 Suppose h (m, n) < h (m + 1, n), i.e. 
 
 ||| - 1 + flogfm «-(2j.] < L;jij| - 1 + [WO* + 1) 
 
 + n . |^|(m + l)jl. (3) 
 
 By substituting Eq. (l) and (2) into Eq. (3) and by rearranging 
 we get 
 
 g < riog(ra + 1 + p')' - r"iog( m + p )~l . (k) 
 
 If we can prove 
 
 g < ftog((m + 1 + p')/(m + p)) 1 , (5) 
 
 then by Lemma 1(1), we can prove Eq. (k) . Eq. (5) holds if 
 
 2 g < (m + 1 + p 1 )/(m + p) (6) 
 
 holds. Since 
 
 2 + l/m > (m + 1 + p* )/(m + p), (7) 
 
 Eq. (6) holds if and only if g = 1 (remember that g > 1). By 
 letting g = 1 in Eq. (2), we get 
 
 p' = km + p - (k - l)(m + 1). 
 Then by substituting this and g = 1 into Eq. (6) and by proper 
 
16 
 
 rearrangement, we get 
 
 p < 2 - k. 
 This only holds if k = 1 and p = 0, which implies that n = m 
 (see Eq. (l)) and contradicts our assumption that m + 1 < |n/2j . 
 
 Hence h (m, n) < h (m + 1, n) never holds. 
 
 The above two cases (l) and (2) prove that h (m + l,n) < h a (m, n). 
 Thus we have the following lemma. 
 
 Lemma 2 : 
 
 h a (m',n) > h a (m,n) if m' < m < |_n/2j . 
 
 The above lemma may seem insignificant. In Section 2.4.2, however, 
 it will be shown that for a certain algorithm to compute an n-th degree 
 polynomial on P(m), the computation time step, h (m, n), is not a nonincreasing 
 function of m, i.e., m > m' does not necessarily imply that lr (m, n) < h p (m',n). 
 As it will be described later, the algorithm is such that all PE's are forced 
 to participate in the computation. It is true that if we are allowed to "turn 
 
 off" some PE's, then we always get h P (m, n) < h P (m' ,n) if m > m' . Then a 
 question is how many PE's are to be turned off. These problems will be studied 
 in Section 2.4.2. 
 
 It should be noted that the minimum number of PE's required to achieve 
 the minimum computation time is not necessarily |_n/2J . For example let n = 17- 
 
 Then (1T/2J = 8 and h a . n (l7) = h a (8,17) = 5- But also h a (6,17) = 5- 
 
 As we know, the minimum computation time to add n number is 'log n'. 
 Now we present the minimum number of PE's, M, which achieves this bound. 
 
17 
 
 Theorem 2 : 
 
 For a fixed n, let 
 
 u 
 
 M - u (L n / m J - 1 + n.og(m + n - Ln/mj m)' = log rfl )" 
 
 m 
 
 
 Then 
 
 M = f(l) 1 + L (n - l)/2j - 2 k ~ 2 (2 k < n < 2 k + 2 k_1 ) 
 
 (2) n - 2 k (2 k + 2 k_1 < n < 2 k+1 ) 
 
 where k - l_log(n - l)j . 
 
 Proof : 
 
 For k < 3, the direct examination shows that the theorem holds (see 
 
 Table 2.2). Therefore we assume that k > 3« 
 
 k k k-1 
 
 The proof is divided into two parts, i.e. (l)2 <n<2 +2 " and 
 
 (2) 2+2 <n<2 . In either case we first prove that h (M, n) = 'log nl 
 
 7hen for m < M we show that h (m, n) > flog n~| . 
 
 It should be clear that in both cases 2 < n < 2 and Tlog nl = 
 
 k + 1, 
 
 (1) 2 k < n £ 2 k + 2 k_1 
 
 Let n = 2 k + p. (l < p < 2 k_1 ) (8) 
 
 >w we first show that h (m, n) = 'log nl where 
 
 M = 1 + |jn - 1)/2J - 2 k * 2 (9) 
 
 4 
 
 u( condition) denotes the minimum value of m which satisfies the 
 m 
 
 condition. 
 
18 
 
 m 
 
 
 1 
 
 2 
 
 3 
 
 k 
 
 5 
 
 k 
 
 M 
 
 Case 
 
 2 
 
 1 
 
 
 
 
 
 
 
 1 
 
 2 
 
 3 
 
 2 
 
 2 
 
 
 
 
 1 
 
 1 
 
 1 
 
 k 
 
 3 
 
 2 
 
 
 
 
 1 
 
 2 
 
 2 
 
 5 
 
 k 
 
 3 
 
 3 
 
 3 
 
 
 2 
 
 2 
 
 1 
 
 6 
 
 5 
 
 3 
 
 3 
 
 3 
 
 
 2 
 
 2 
 
 1 
 
 7 
 
 6 
 
 1+ 
 
 3 
 
 3 
 
 
 2 
 
 3 
 
 2 
 
 8 
 
 7 
 
 k 
 
 k 
 
 3 
 
 
 2 
 
 4 
 
 2 
 
 9 
 
 8 
 
 5 
 
 k 
 
 h 
 
 If 
 
 3 
 
 3 
 
 1 
 
 10 
 
 9 
 
 5 
 
 h 
 
 k 
 
 k 
 
 3 
 
 3 
 
 1 
 
 (1): 2 k < n <2 k + 2*" 1 
 
 (2): 2 k + 2 k " 1 < n < 2 k+1 
 where k = [log(n-l)j. 
 n: The degree of a polynomial 
 
 m: The number of PE's 
 
 M: The minimum number of PE's 
 
 Table 2.2. The Number of Steps Eequired to Compute 2 a. on P(m), h (m, n), 
 
 i=l 
 
 for n < 10. 
 
19 
 
 Now 
 
 h a (m,n) - Ln/MJ - 1 + riog(M-n-Ln/M| M)l • (10 ) 
 
 We then show that [n/Mj = 3 for all n. Then we get (M - n - 
 
 |n/M]M) = n - 2M. The value of n - 2M is . evaluated in three 
 
 ways, i.e. (i) p = 2g, (ii) p = 2g + 1 (g > l) or (iii) p « 1. 
 
 k-1 
 In any case n - 2M £ 2 " holds. Thus we can prove that 
 
 log (M - n - Ln/MJM) 1 < k - 1. 
 
 and 
 
 h a (M,n) < k + 1 = Tlog ril . 
 
 Then we prove that for all m < M, h (m, n) > ("log nl . This is 
 
 proved by showing that h (M - 1, n) > I log nl . Then by Lemma 
 
 2, h (m, n) > ("log n~l for all m < M. 
 
 Now let us show the details. First we prove that h (M, n) 
 log ill. From Eq. (8) and (9), we get 
 
 M = 2 k " 2 + 1 + jjp - l)/2j, (11) 
 
 and by Lemma 1(2), we get 
 
 [jj| " 5 - L p J (12) 
 
 where 
 
 p = ^L(p - D/2J + k - y 
 
 2 k " 2 + 1 + |ip - 1)/2J 
 By Lemma 1(3), w e get 
 
 P < P' = 
 
 2p - h 
 
 k-l 
 
 2 + p - 1 
 
 k-1 
 Now we show that for all p(l<p<2 ), P' < 1. Since 
 
20 
 
 ^ ■ (2 k ~ 1 + p - I) 2 > ° 
 we have 
 
 max P* = %-^ — < 1 for 1 < p < 2 k_1 (k > k). 
 2-1 " " 
 
 Thus P < 1 and by Eq. (12) we have (n/Mj = 3« Substituting this 
 
 into Eq. (10), we get h a (M,n) = 2 + llog(n - 2M)1 . Now 
 subtracting two times Eq. (ll) from Eq. (8) we have 
 
 n - 2M = 2 k_1 + p - 2 - 2j_(p - l)/2j . (13) 
 
 Eq. (13) is evaluated in three different ways according to the 
 value of p, i.e. (i) p = 2g, (ii) p = 2g + 1 (g > l) or 
 (iii) p = 1 (in every case g is an integer). 
 
 (i) p = 2g (From Eq. (l), g > l). 
 
 n - 2M = 2 k_1 . 
 (ii) p = 2g + 1, (g 1). 
 
 n - 2M = 2 k ~ 1 +2g+l-2-2g< 2 . 
 (iii) p = 1. 
 
 k-1 k-1 
 
 n - 2M = 2 + 1 - 2 < 2 . 
 
 k-1 p- ~i 
 
 Hence in any case n - 2M < 2 "or llog(n-2M)' < k - 1. Thus 
 
 h a (M,n) < k + 1 = Tiog n"l . 
 This proves the first part of (1). Next we prove the latter 
 
 part, i.e. for m < M, h (m,n) > h (M,n). 
 
21 
 
 First we show that h a (M - 1, n) > h a (M, n). From Eq. (8) 
 and (9), and using Lemma 1(2) we get 
 
 \vrh\ = i> - Q J = 3 - lQj> a 1 *) 
 
 where 
 
 Q= H (p - D/2J - P . 
 
 2 k_2 + L (p - 1)/2J 
 As we showed for P, we can also prove that for all p (l < p < 
 
 2 k_1 ), Q < 1. From Eq. (1*0, we have |_n/(M - l)j =3- Then 
 
 h a (M - 1, n) = 2 + Tiogdi - 2(M - 1)11 . From Eq. (8) and (11), 
 we get 
 
 n - 2(M - 1) = 2 k-1 + p - 2|_(p - l)/2j > 2 k-1 + 1 
 
 or Tlog(n - 2(M - l))"l > k. Hence 
 
 Tlog nl = k + l<k+2< h a (M - 1, n). 
 
 Thus for all m < M, h (m, n) > Tog nl by Lemma 2, and this proves 
 (1). 
 
 (2) 2 k + 2 k_1 < n < 2 k+1 . 
 
 Let n - 2 k + 2 k_1 + p (l < p < 2 k_1 ). (15) 
 
 We first sketch the proof. We first show that h (M, n) = 
 
 flog nl . To show this we prove that l_n/Mj = 2 for all n (2 + 
 
 k-1 k+1 
 
 < n < 2 ). Then using this we get M + n - (_n/Mj M = 
 
 n - M. We further show that n - M = k. Thus we get h a (M, n) = 
 
 k + 1 = flog nl. Then we prove that h a (M - 1, n) > h a (M, n) 
 which together with Lemma 2 completes the proof of (2). 
 
22 
 
 £ 
 
 o 
 
 
 SJ 
 
 ir\ 
 
 
 5 
 
 8 
 
 CD 
 
 O 
 
 
 2 
 
 -z* 
 
 
 i 
 
 o 
 
 
 , — .. 
 
 •p 
 
 
 v c 
 
 rt 
 
 
 T) 
 
 £ 
 
 
 
 1 
 
 
 d 
 
 (1) 
 
 O 
 
 <u 
 
 
 K^ 
 
 ,0 
 
 CO 
 
 
 o 
 
 -p 
 
 s 
 
 
 CO 
 
 <H 
 
 
 ^ 
 
 O 
 
 
 CD 
 
 s 
 
 
 a 
 
 
 O 
 
 
 *§ 
 
 CVJ 
 
 
 3 
 
 CD 
 
 H 
 
 OJ 
 
 cy 
 
 £ 
 
 0£ 
 
 OS 
 
 01 
 
 (W) s,3d JO aaqrauM 
 
23 
 
 The details are similar to (l) and will not "be given here. 
 
 (Q.E.D.) 
 
 Theorems 1, 2 and Lemma 2 also apply to the case of multiplication of 
 
 n numbers, and to avoid duplication, the corresponding lemmas for multiplication 
 shall not be presented. 
 
 Next let us consider the power computation, e.g. x (n > 2). 
 
 2.3 Computation of Powers 
 
 Lemma 3 '■ [23] 
 
 Let N be the number of ones in the binary representation for n. 
 
 Then the near minimum number of steps to compute x on P(l), h (l, n), 
 is 
 
 h e (l,n) - L lo S n J + N - 1. 
 
 Up to now, there is no result about the minimum computation time to 
 
 evaluate x [23]. Thus we shall settle for an approximation. For example let 
 
 e 15 
 
 15. Then h (1,15) = l_ lo § 15J +4-1 = 6. On the other hand x can be 
 
 evaluated in fewer steps, e.g. 
 
 (x) 2 = x 2 - 'x 2 )(x) = x 3 - (x 3 )(x 2 ) = x 5 
 
 f 5n 2 10 , 5w 10 x 15 
 -* (x^) = x - (x )(x ) = x • . 
 
 This takes only 5 steps. For n < 70, this lemma gives the correct values for 
 
 more than 70$ of the cases. 
 
 While we cannot give the definite answer for the sequential case, we 
 
 can prove the following. 
 
2k 
 
 Theorem J> : 
 
 The minimum number of steps to compute x on P(m) (m > 1), h e (m,n), 
 
 is 
 
 h"(m, n) = ("log nl . 
 
 Proofs of Lemma 3 and Theorem 3 : 
 
 Let a = log n and 3 = log (n + l) for convenience. Let I be the 
 
 j-th most significant bit in the binary representation for n. If m = 1, then x 
 is computed as follows. First let us write 
 
 X. - (x 2 V 3+1 . 
 
 J 
 
 We first compute all x (i = 2 , 2 , . .., 2 ^ -• ) in iqi steps. Then 
 
 X n = (X Q )x (X x ) x ... x (x 2Lq, ) Irpl 
 
 and this computation takes N - 1 steps (note that if I . = 0, then X. , = l). 
 Thus in total i.-^j + N - 1 steps are needed. 
 
 If two PE's are used, then x is computed as follows. Again let us 
 
 write 
 
 X = ( X Q ) x ( X 1 ) x • . • X ( x ) 
 
 Now this can be computed by the following two recursive equations. 
 t (l) - x 
 
 k k-1 k-1 
 
25 
 
 k k-1 k-1' 
 
 and 
 
 x n - t (2) 
 Two PE's are required for the simultaneous computation of t/, and t}. . 
 
 K. K 
 
 That the above process for P(2) is optimum is clear, because x 
 cannot be computed in less than [log nj steps and at least l_log n) + 1 
 
 (= llog nl ) steps are required to compute x . 
 
 (Q.E.D. ) 
 
 From the above discussion, we have the following corollary. 
 
 2 L lo g "J 
 
 lorollary : 
 
 h e (m,n) = h (2,n) for all m > 2. 
 
 wow let us study simultaneous computation of all x (i=l,2, . . .,n) 
 
 Theorem h 
 
 The minimum number of steps, h (m,n), required for simultaneous 
 
 evaluation of all x (i=l, 2, . . .,n) on P(m) is 
 
 r 
 
 (1) n - 1 (m = 1) 
 
 h (m,n) = -< 
 
 V 
 
 (2) L log mj + 1 + r (n _ 2 L lo g m J +1 )/ m 1 
 
 (max(n.2 riogn " 1 - 1 , 2 ^ n1 " 2 ) > m > 2) 
 
 (3) flog nl (m > max (n - 2 riog n_1 "\ 
 
 flog nl „ HLog ril -2, , 
 
26 
 
 Proof : 
 
 (l) is obvious. (3) is illustrated in Figure 2.2. At the k-th step, 
 
 i k-1 k-1 k\ 
 
 the x (i = 2 + 1, 2 + 2, ..., 2 ) are computed using the results of 
 
 k k-1 k-1 
 earlier steps, e.g. x = x x x . The number of PE's required at this 
 
 step is then 2 k - (2 k-1 + l) + 1 = 2 " . 
 
 \ PE 
 
 step\ 
 
 p i 
 
 P 2 
 
 P 3 
 
 \ 
 
 
 
 
 " 
 
 No. of PE's 
 required 
 
 1 
 
 X 2 
 
 
 
 
 
 
 
 
 1 
 
 2 
 
 x 3 
 
 X 
 
 
 
 
 
 
 
 2 
 
 3 
 
 5 
 
 X' 
 
 6 
 
 X 
 
 x 7 
 
 8 
 
 X 
 
 
 
 
 
 1+ 
 
 • 
 
 
 
 
 
 
 
 
 
 . 
 
 k 
 
 a+1 
 
 X 
 
 a+2 
 
 X 
 
 
 
 • • 
 
 2a 
 
 X 
 
 
 
 2 k-i 
 
 
 
 
 
 
 
 
 
 
 
 r log nl-l 
 
 b+1 
 
 X 
 
 
 
 . . 
 
 
 
 2b 
 
 X 
 
 
 b 
 
 r log nl 
 
 c+1 
 
 X 
 
 
 
 " 
 
 
 
 
 n 
 
 X 
 
 n - 2c 
 
 a- 2^ b= 2 ri °S n1 - 2 
 
 c = 2b 
 
 Figure 2.2. Computation of x (l) 
 
 The maximum number of PE's required is the larger one of n - 2 (the 
 
 number of PE's required at the last step) or 2 ° S " (the number of PE's 
 required at the ( Hog h~l - l)-th step). This proves (3). Clearly this 
 procedure is optimum in the sense that it gives the minimum computation time. 
 
27 
 
 max(n - 2 
 
 Next suppose that the number of PE's available, m, is less than 
 flog ffl -1 /log nl -2 }> ^ first all x i (l < ± < 2 l_l°g *J + 1) are 
 
 :omputed in [log mj +1 steps in the same manner as the above procedure. 
 
 Number of PE's 
 * m 
 
 1 
 
 2 
 
 X 
 
 x\\\\\\\\\\\ 
 
 2 
 
 x 3 
 
 i2 
 
 WWWWV 
 
 • 
 
 
 
 
 I log mj + 1 
 
 a 
 
 X 
 
 
 x b 
 
 \ V 
 
 . 
 
 c 
 
 X 
 
 
 
 • 
 
 
 
 
 
 
 
 n 
 
 \\\\ X 
 
 ~T" 
 
 J log rnj+1 
 
 r _ 1 
 
 m 
 
 i 
 
 a = 2 llogmj +1 
 b = 2 [ log mj + 1 
 
 c = b +1= 2 ll ° gmJ +1 +1 
 
 Figure 2.3- Computation of x (2) 
 
28 
 
 Now there are n - 2 ^° g m -J +1 x 1 left (2^° g ^ +1 < i < n) to be 
 
 computed. This takes "(n - 2 •- g -• )/m' steps on P(m). Clearly at each step, 
 
 all necessary data to perform operations are available. To show this, let us 
 
 a b 
 take two successive steps. Assume that the first step computes x ~ x where 
 
 b - a f 1 = m. Then the second step computes x ~ x . Since b + m = 
 2b + 1 - a < 2b (a > 1), all inputs required at the second step are available 
 
 from the first step. Thus in total (log mj + 1 + "(n - 2 L ° g ^ )/m~' steps 
 are required. This proves (2). 
 
 (Q.E.D.) 
 
 Clearly for fixed n, m > m' implies that h (m,n) < h (m',n). Thus 
 we have: 
 
 Lemma h : 
 
 w / \ 
 
 For fixed n, h (m, n; is a non-increasing function of m. 
 
 We again call the reader's attention that the number of PE's required 
 
 to compute all x in the minimum number of steps, i.e. 'log n~l , is not 
 
 • -, t o flog nl -1 _ flog n~l -2 N _ , , lQ _ 
 
 necessarily max(n-2 ,2 ). For example, let n = io. Then 
 
 max(l8 - 2 ,2 ) = 8, and 'log 18 = 5« Yet P(5) achieves the same result, i.e. 
 h w K,l8) = 5- 
 
 Lemma_5: 
 
 The minimum number of PE's, M, necessary to compute all x (l < i < n) 
 simultaneously in the shortest time is 
 
29 
 
 M = 
 
 (1) n - 2 ri0g 3' 1 
 
 (2) r (n- 2&** m2 )/2 
 
 (n - 2 ri0g ^ _1 > 2 ri ° S ^ " 2 ) 
 (otherwise) 
 
 Proof: 
 
 Let q = ilog n~l . 
 
 (1) n - 2 Crl •> 2°~ 2 . 
 
 Suppose that there are only m PE's where m < n ■■ 2 
 
 Q-l 
 
 step\ 
 
 p i 
 
 P 2 
 
 P J 
 
 •• 
 
 P 
 
 m 
 
 
 
 
 
 
 
 
 
 
 
 
 
 riog nl - 1 
 
 
 
 
 
 
 
 r log nl 
 
 a 
 
 X 
 
 
 
 
 
 a+ni 
 
 X 
 
 
 
 
 n 
 
 X 
 
 s- 
 
 m 
 
 left out 
 
 a= 2 fl0g nl - X + 1 
 
 Figure 2.k. Computation of x (3) 
 
 Then x(2°""+l+m<i<n) cannot be computed at the a- th 
 step. Also none of them can be computed at an earlier step 
 because their inputs are not ready. This proves (l). 
 
 (2) n - 2~ rl < 2 r2 . 
 
30 
 
 flog nl - 1 
 
 f log nl 
 
 ,f log nl - 2 + 1 
 
 2 riognl -2 +m 
 
 [log nl - l 
 
 Figure 2.5* Computation of x (k) 
 
 Then we delay the computation of all x 1 (2 Qr<: " + m + l<i<2 a - l) 
 which are originally scheduled to be computed at the (a - l)-th 
 step till the next step, i.e. the <.rth step (i.e. the last step). 
 
 At the o-th step we compute all remaining x {2° rt ' +m + 1 < i < n). 
 The value of m has been chosen in such a way that 
 
 ,a-2 
 
 cr2 
 
 (2 U - 1) - (2 urc +m + l)+l>n- (2 urc + m + l) + 1 
 
 i.e. 
 
 n 
 
 ,cr2 
 
 m = "(n - (2°^ + 1) -l)/2». 
 Also since 
 
 2(2°~ 2 + m) > 2(2^ 2 + n/2 - 2 a " 3 ) 
 
 = n + 2 a ~ 2 > n, 
 
31 
 
 all inputs required at the cr th step are ready at the (a - l)-th 
 step or earlier steps. 
 
 (Q.E.D.) 
 
 2.k Computation of Polynomials 
 
 In this section, we study polynomial computation. First we assume 
 that there are arbitrarily many PE's. Then four schemes are studied and 
 compared. Two of them are known as Estrin's method [16] and the k-th order 
 Horner's rule [15]. Two new methods are also introduced. They are called a 
 tree method (see Chapter 3) and a folding method. It is shown that if there 
 are arbitrarily many PE's, then the folding method gives a faster computation 
 time than any known method. 
 
 Then we study the case where only a limited number of PE's are 
 available. Because of the simplicity of scheduling, the k-th order Horner's 
 rule is studied in detail. It is shown that on P(m) the m-th order 
 Horner's rule does not necessarily guarantee the fastest computation, i.e., 
 there is a case where the m'-th order Horner's rule (m' < m) gives a better 
 result. Thus availability of more computational resources does not necessarily 
 "speed up" the computation for a certain class of feasible parallel computation 
 algorithms. 
 
 . I . 1 "omputation of a Polynomial on an Arbitrary Size Machine 
 
 Definition 
 
 We write p (x) for a polynomial of degree n 
 
 i \ n n-1 
 p ( x ) = a x + a ,x + . . . + a rt . 
 
 r n' n n-1 
 
32 
 
 2. if. 1.1 k-th Order Horner's Rule [15] 
 
 The details will "be presented in Section 2.1J-.2. Theorem 5 shows that 
 the minimum time required to compute p (x) by this method is 
 
 n 
 
 h P . = flog n"l + (log (n+l)~l + 1 
 
 mm 
 
 2.4.1.2 Estrin's Method fl5]["l6l 
 We first compute 
 
 C° = a + xa i = 0, 2, ..., 2 L n/2j 
 
 i+1 
 Then successively compute 
 
 c} = C° + x 2 C° ±+2 i = 0, k, ..., k^/kj 
 
 C? = c] + xV , i = 0, 8, ..., 8 L n/8j 
 
 ii i 
 
 44 
 
 m „m-l 2m^m-l . n _m+l m+1 /o m+l 
 
 C. : C. + x C. +2m i = 0, 2 , ..., 2 (_n/2 J 
 
 where m - [log nj and more over 
 
 P n <x) . C 
 
 The procedure may be illustrated by 
 
 o 
 
 p (x) -- a n + a x + x~(& + ax) 
 n 1 2 .p 
 
 ^ 2/ n 
 
 + x (a, + ax + x (a^-+a xjj 
 
 ftp 1+ 2 
 
 + x (ag + ax + x ( a 10 +a 1JL x ) + x (a 12 +a x+x (a^+a x))) 
 
 Now notice that for each j all C? may be computed simultaneously. 
 
33 
 
 Assuming the availability of an arbitrary number of PE's, it is easy 
 to show that the minimum number of steps to compute p (x) by this method is 
 
 h E . = 2 flog (n + l) 1 . 
 
 E 
 
 mm 
 
 2.4.1.3 Tree Method 
 
 In this method a tree similar to the one for arithmetic expressions 
 is built for a polynomial p (x). For example p,(x) is computed as: 
 
 Computation by this method consists of two stages, computation of 
 
 b. = a.x (0 < i < n) and computation of Lb.. 
 1 i=0 
 
 n 
 
 t 
 i 
 
 (l) Computation of b. = a.x (0 < i < n). 
 
 This requires flog(i+l)l steps by Theorem 3 (see Figure 2.6) 
 
 (2) Computation of lb,. 
 
 i=0 x 
 
 As soon as b.'s become available they are added in the log sum 
 
 way. Suppose b. becomes available at the k-th step. Having a 
 
 k-1 
 variable at the k-th step is equivalent to having 2 variables 
 
 originally at the first step (cf. the effective length in Chapter 
 
 3)- 
 
^ 
 
 Step 
 
 Terms Which Are Computed 
 
 No. of Terms Computed 
 
 1 
 
 a o 
 
 1 
 
 2 
 
 a., x 
 
 1 
 
 J 
 
 2 3 
 
 a 2 X ' a 3 X 
 
 2(=2 X ) 
 
 ■ 
 
 h 5 6 7 
 
 cL X « a,_-X ^ cl/'X ^ a^X 
 
 M=2 2 ) 
 
 riog(n+l)l 
 
 
 2 riog(n+l)l -2 
 
 riog(n-i 1)1+1 
 
 2 riog(n+i)]-i n 
 a 2 rio g (n+i)i-i x ' •••>** 
 
 n _ 2 riog(n+l)l-l +1 
 
 Figure 2.6. Computation of a.x 
 
 Thus, for example, two variables at the third step are reduced 
 to 8 variables at the first step. Repeating this procedure we 
 get 
 
 o-l 
 
 Z ( 
 i=l 
 
 o-l 
 
 Z 'c 
 
 i=l 
 
 variables on the first level where 
 a - 1og(n + if" . 
 
 To add these n' numbers by the log sum method, it takes 
 
 Z (2 1 ' 1 2 i ) + 2 a (n - 2 0rl + 1) 
 
 + Z2 2i_1 + 2 a (n - 2 Qrl + 1) 
 
 h = H nrr n"H 
 
 mm 
 
 log n' 1 steps. 
 
35 
 
 2.U.1.U Folding Method 
 
 This is 3. method which computes P n (x) in shorter time than any known 
 
 method. 
 
 Assume that p , (x) can be computed in h - 1 steps, p.(x) (t < i < 
 s - 1) are computed in h steps and p (x) can be computed in h + 1 steps. 
 
 Then we show that all p.(x) 
 
 roof 
 
 Steps 
 
 Degree 
 
 
 h - 1 
 
 ~ t - 1 
 
 
 h 
 
 t ~ s - 1 
 
 
 h + 1 
 
 s ~ (s + t - 
 
 1) 
 
 h + 2 
 
 s + t ~ 
 
 
 (s < j < s + t - 1) can be 
 computed in h + 1 steps and 
 further p (x) can be 
 
 computed in h + 2 steps. 
 
 (1) First we show that p g+t _.(x) (l < ,j < t) can be computed in 
 h + 1 steps. 
 
 Figure 2.7- A Tree for p .(x) 
 
 *s+t-j v 
 
36 
 
 We write p .(x) as 
 s+t-j v 
 
 / v \ / t-J v s s-1 
 
 .,+. .(xj = (a. .x d + ... + a )x + a n x + 
 =+t-j t+S-J s s-1 
 
 . . + a 
 
 s 
 
 = p t (x)x +P s . 1 (x). 
 
 s 
 
 Now we show that x can be computed in less than h steps. 
 
 From Theorem 1 we know that x can be computed in flog nl 
 
 g 
 
 steps. Suppose that the computation of x takes longer than h, 
 
 i.e. 
 
 h < r io£ si . (16) 
 
 Now from the assumption, p , (x) takes h steps to compute. Also 
 
 S *~ J_ 
 
 (see Section 2.U.1.5) 
 
 h > llog(2(s-l) + l) 1 - G.og(2s-l) 1 . (17) 
 
 Prom Eq. (16) we have s > 2 + 1 and from Eq. (17 ) we have 
 
 2 h > 2s - 1 or 2 h_1 > s. Thus we have s > 2 h_1 + 1 > 2 h_1 > s 
 which is a contradiction. Thus h > llog s~l . 
 
 From the assumption p, . (x) can be computed in h - 1 steps 
 
 and 
 
 p , (x) can be computed in h steps. Hence p .(x) • x takes 
 
 h steps and p .(x) can be computed in h + 1 steps. 
 ^ *s+t-j 
 
 2) Next we show that p (x) can be computed in h + 2 steps. 
 
37 
 
 p s-l (x) 
 
 t+l 
 
 P t (x) 
 
 Figure 2.8. A Tree for p . (x) 
 
 S "T* U 
 
 We write 
 
 / \ / s-1 \ t+l t 
 
 P s+t (x) = (a s + t X + •*• + a t + l )x + a t X + •'• + a 
 
 - P s . 1 (x) • x t+1 +p x (x). 
 
 Then from the previous proof we know that x " can be 
 computed in less than h + 1 steps. Since p , (x) and p (x) take 
 
 h - 1 and h + 1 steps respectively, we can compute p (x) • x "in 
 
 at most h + 1 steps and p (x) in at most h + 2 steps. 
 
 (Q.E.D. ) 
 
 It is easy to check that p p (x) takes 3 steps, p,(x) and p^(x) can be 
 computed in k steps and p q (x) can be computed in 5 steps. By induction we 
 obtain the following table for h = 2, 3, ..., 10. 
 
38 
 
 Minimum Steps 
 
 Degree of Polynomial 
 
 3 
 
 2 
 
 k 
 
 3 - 1* 
 
 5 
 
 5 - 7 
 
 6 
 
 8 -12 
 
 7 
 
 13-20 
 
 8 
 
 21-33 
 
 9 
 
 3^-5^ 
 
 10 
 
 55-87 
 
 11 
 
 89-1^3 
 
 Table 2.3. Computation of p (x) by Folding Method 
 
 For example, p (x) takes 9 steps to be computed. Note that the 
 first numbers in the right column form a Fibonacci sequence. 
 
 2.4.1.5 Comparison of Four Methods 
 
 It has been proved that at least 2n operations are required to compute 
 p Lx ) . ■ roof s appeared in several papers. We owe Ostrowski [3M and Motzkin 
 
 'I for their original works. Pan [35] summarized the results. An excellent 
 review of the problem appears in [23]. Also Winograd [^3] generalized results 
 of Ostrowski and Motzkin. 
 
 Now assume that to compute p (x) in parallel h steps are required. 
 
 Then Theorem 5 gives h < log r? + 1.og(n+l)l + 1 < 2fiog(n+l)' + 1. Also 
 
 k 
 since 2-1 operations can be performed in a parallel computation tree of 
 
39 
 
 height k, we have 2 h - 1 > 2n or h > Hog ( 2n+l J 1 . Thus 
 
 2nLog(n+l) 1 + 1 > h > flog(2n+l^. 
 
 In Figure 2-9, these upper and lower limits are plotted together 
 with the results from the previous section. It is clear that the folding method 
 is the best in terms of the computational speed. It is yet an open question 
 if there is a better method. 
 
 2.U.2 Polynomial Computation by the k-th Order Horner's Rule 
 
 Now let us study computation of a polynomial by the k-th order Horner's 
 
 rule. 
 
 A polynomial is computed by the k-th order Horner's rule as 
 
 shown by the following procedure. We use k PE's. First we compute all 
 
 x (1 < i < k) simultaneously. Then we compute k polynomials p ' (x) on k PE's 
 
 simultaneously, where 
 
 p. '(x) = a. + x k (a.^ + x k (a. _ +...))) (0 < i < k - l). 
 ^k ' l l+k i+2k — — 
 
 Then we get k partial results which are added to get p (x). Figure 2.10 
 
 i 1 I .si rates this. 
 
 This scheduling may not be the best, yet it is easy to implement 
 
 i also adaptable to any number of k. 
 
 Theorem 5 ' 
 
 The minimum number of steps, h P (m, n), required to compute a degree 
 n polynomial p (x) on P(m) by the m-th order Horner's rule is 
 
 (1) 2n (m = 1) 
 
 h P (m,n) = -{ (2) 2 (log m~l + 2[n/mj +1 (n + l>m>2) 
 
 (3) frog rH + flog(n+l) 1 + 1. (m > n + l) 
 
14-0 
 
 o 
 
 LA 
 
 o 
 
 O 
 
 on 
 
 H 
 O 
 
 Ok 
 
 <H 
 O 
 
 <D 
 <U 
 U 
 
 bO 
 
 CD 
 
 O 
 CM 
 
 O 
 H 
 
 I 
 
 sdaq.s jo jaqtimjtf 
 
 
kl 
 
 Proof : 
 
 A proof for (l) can be found in [k-3]. (2) and (3) are self-evident 
 from the above discussion. 
 
 (Q.E.D.) 
 
 k-1 
 
 a + x (a k + x ... )) 
 
 (a 1 + x (a R+1 + ... )) 
 
 (a 2 + x (a k+2 + ... )) 
 
 (a k-l +X (a 2k-l + '•• )} 
 
 £ Horner ' s part 
 
 Lemma 6 
 
 Figure 2.10. k-th Order Horner's Rule 
 
 The minimum number of steps h . to compute a polynomial p (x) of 
 
 mm 
 
 degree n by the k-th order Horner's rule (l < k < n + 1) is 
 
 h P . = log nl + ri og ( n+ iJl + i. 
 mm ° t>\ / 
 
 Proof: 
 
 It is enough to show that 
 
 2 log ml + 2^/mj + 1 > 2 r iog(n+l) 1 + 1 
 
1+2 
 
 or 
 
 flog m 1 + |_n/mj > flogtn+l) 1 
 for m < n, because 2flog(n+l)' > flog ir + 'log(n+l)' and if m > n + 1, then by 
 
 Theorem 5 lr (n+1. n) = h . . 
 
 man 
 
 Assume that n=2 g +t(l<t<2 S ) and m=2 + s(l < s < 2 k ). Then 
 e have two cases, i.e. (l) g = k or (2) g > k. If g = k, then t > s since 
 
 we 
 
 n > m. 
 
 (1) g = k and t v > s. 
 
 Then log m 1 =k+l=g+l, and 
 
 Hence 
 
 flog m) + |_n/mj = g + 2 > flog ( n+1 )1 . 
 
 (2) g > k. 
 
 We have further two subcases, i.e. (i) 1 < t < 2 or (ii) t = 2 . 
 
 (i) 1 < t < 2 g 
 
 Then flog (n +1)1 = g + 1, Q.og nil = k + 1 
 and 
 
 
 Isl 
 
 2 B + t 
 
 2 k + s 
 
 > 2 e 
 
 k - 1 
 
 Hence 
 
 flog ml + (_n/mj > k + 1 + 2 g "^ k+1 ' ) 
 Now we show that for all k < g 
 
 f(k) = k + 1 + 2 g " (k+l) s g + 1. 
 
*o 
 
 Since for all k < g, 
 
 |^f(k) = 1 - (log e 2) 2 g " (k+l) <0 
 
 minf(k)=g+l where < k < g. 
 Hence f(k) > g + 1 or 
 
 log ml + [n/mj > g + 1 = riog(n+l)l . 
 
 (ii) t = 2 g 
 
 Similarly to the above, we can show that 
 
 'log nil + in/mj > U-Og(n+l)l . 
 The details are omitted. 
 
 (Q.E.D. ) 
 
 Unlike the case of h (m,n) and h (m, n), h (m, n) is not necessarily a 
 non-increasing function of m. (A few curves in Figure 2.11 illustrate this.) 
 
 Therefore it becomes important to choose an appropriate m for a given 
 n to compute in an optimum way. 
 
 Theorem 6 : 
 
 Given an n-th degree polynomial. 
 Let 
 
 M - M (h P (m,n) - flog n"l + fl g(n+l)1 + 1), 
 
 where 
 
 Then 
 
 h P ( m, n ) = 2 flog ml + 2 ^n/mj + 1 . 
 
kk 
 
 ft 
 
 ft 
 
 -P 
 
 OQ 
 
 O 
 U 
 
 H 
 
 n = 5 
 
 10 
 
 15 
 
 Number of PE's (m) 
 
 Figure 2.11. The Number of Steps, h (m, n), to Compute p (x) on P(m) by 
 the ra-th Order 
 
h5 
 
 (1) n + 1 
 
 M =-< (2) ^n+l)/^ 
 
 J3) WV2 1 
 
 where g = ^log ry • 
 
 (n = 2 g ) 
 
 -1, 
 
 (2* < n < 2* + 2 fe ) 
 (2 g + 2S" 1 < n < 2 g+1 ) 
 
 Proof: 
 
 Proof is given for each case independently. 
 
 (1) n = 2 g . 
 
 The proof is divided into two parts. First it is clear that if 
 we have n + 1 PE's, then p (x) can be computed in h . steps 
 
 (see Theorem 5)- Next we show that if the number of PE's, m, is 
 
 less than or equal to M - 1 (m < n), then Ir . < Ir (m, n), where 
 
 nun 
 
 h P . = flog nl + fiog(n+lJl + 1 = 2g + 2. 
 man ov ' to 
 
 Let m = 2 k + p (k < g, 1 < p < 2 k ). Then 
 
 h P (m,n) = 2k + 5 + 2 
 
 s k + P 
 
 Let 
 
 P = 
 
 2 g " k p 
 2 k + p 
 
 Then 
 
 (18) 
 
 ¥ 
 
 P = 
 
 (2 k +P ) 2 
 
 > 0, 
 
 and 
 
kG 
 
 for 
 
 max P = 2 g_k ~ 1 for 1 < p < 2 k . 
 From this and Eq. (l8), we get 
 
 h P (m,n) > 2k + 3 + 2 |2 g ~ k ~"!j . (19) 
 
 Since g > k, Eq. (19) becomes 
 
 h P (m,n) > 2k + 3 + 2 g " k . 
 Now let 
 
 f(k) = (2k + 3 + 2 g " k ) - (2g + 2) = 2(k - g) + 1 + 2 g ' k . 
 
 Since 2 a + 1 - 2a > (note that v~(2 a + 1 - 2a) = log 2 ■ 2 a 
 
 da °e 
 
 2 > for a > l), we have f(k) > for all k < g. Since 
 
 h P (m,n) - (2g + 2) > f(k) > 0, 
 
 h P (m,n) > 2g + 2 - h p . n . 
 
 This proves (l). 
 
 Now since "log n^ 'log(n + l)' if n / 2 for some g, we use 
 
 h P . = 2 ( "log(n + lp + 1 
 mm to 
 
 h P . = H.og(n + l) 1 + log n~l + 1 
 mm to 
 
 to prove (2) and (3)- Then it is enough to show that 
 
 = u ( "log(n + l) = "log nr + l^/mj ) 
 instead of 
 
 M = tJ (2 f log(n + 1^ +1=2 flog nil + 2 (n/mj + l)« 
 
 Since 2 g < n < 2 g+ for (2) and (3), we have riog(n + l)" 1 = g + 1. By direct 
 
+7 
 
 computation, we can show that the theorem holds for n < 10 (see Table 2.k). 
 
 (2) 2 g < n < 2 S + 2 g_1 . 
 Now we show that 
 
 h P . - 2 log M*l + 2 |n/M» + 1. 
 mm u / _j 
 
 To show this we first show 
 
 flog M~l + (ji/Mj = g + 1. 
 
 Since 2 g < n < 2 g + 2 g " , we have 
 
 2 g-2 < n_+_l_ < 2 g-l (2Q) 
 
 and 
 
 log Ml = flog ^n + 1)/3H = g - 1. (21) 
 
 Now let (n + l)/y = k (k > 3 as we assumed). Then 
 
 n + 1 - 3k - p (p < 2) 
 
 or 
 
 n = 3k - p - 1. (22) 
 
 Using this and the relation < (p + l)/k < 1, we have 
 
 L n/Mj =2. (23) 
 
 Thus from Eq. (21) and (23) 
 
 flog M 1 + ,n/Mj = g + 1 = flog(n + l)~l, 
 or 
 
 h P (M,n) = h P . . 
 
 mm 
 
 Next we show that if m' < l(n + l)/y , then HLog m^ + [n/m'j > 
 g + 1 = 'log(n + 1)1 (or equivalent ly h P (m', n) > h p . ). 
 
 mm' 
 
k8 
 
 m 
 
 
 1 
 
 2 
 
 3 
 
 k 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 10 11 
 
 g 
 
 M 
 
 Case 
 
 2 
 
 it 
 
 1+ 
 
 k 
 
 
 
 
 
 
 
 
 h 
 
 1 
 
 1 
 
 3 
 
 6 
 
 5 
 
 6 
 
 5 
 
 
 
 
 
 
 
 5 
 
 2 
 
 2 
 
 k 
 
 8 
 
 7 
 
 7 
 
 7 
 
 
 
 
 
 
 
 7 
 
 2 
 
 1 
 
 5 
 
 10 
 
 7 
 
 7 
 
 9 
 
 7 
 
 
 
 
 
 
 7 
 
 2 
 
 1 
 
 6 
 
 12 
 
 9 
 
 9 
 
 7 
 
 9 
 
 9 
 
 7 
 
 
 
 
 7 
 
 1+ 
 
 2 
 
 7 
 
 11+ 
 
 9 
 
 9 
 
 7 
 
 9 
 
 9 
 
 9 
 
 7 
 
 
 
 7 
 
 h 
 
 2 
 
 8 
 
 16 
 
 11 
 
 9 
 
 9 
 
 9 
 
 9 
 
 9 
 
 9 
 
 9 
 
 
 9 
 
 3 
 
 1 
 
 9 
 
 18 
 
 11 
 
 11 
 
 9 
 
 9 
 
 9 
 
 9 
 
 9 
 
 11 
 
 9 
 
 9 
 
 U 
 
 1 
 
 10 
 
 20 
 
 13 
 
 11 
 
 9 
 
 11 
 
 9 
 
 9 
 
 9 
 
 11 
 
 11 9 
 
 9 
 
 1+ 
 
 1 
 
 (1): 2 S < n < 2 g + 2 g ~ 1 
 
 (2): 2 g + 2 s " 1 < n < 2 6+1 
 where g = | log nj . 
 n: The degree of a polynomial 
 m: The number of PE's 
 M: The minimum number of PE's 
 
 Table 2.k. The Number of Steps Required to Compute p (x), h p (m,n), for n < 10 
 
+9 
 
 We have two cases, i.e. 
 
 (i) ?} < m' < 2 1+1 where i + 1 < g - 2 
 
 d (ii) 2 g ~ 2 < m' < r (n + l)/^. 
 
 (i) 2 1 < m' < 2 i+1 . 
 
 Since i + 1 < g - 2, we write 
 
 g - 2 = i + 1 + j (d > 0). 
 
 Then 
 
 flog m'"l =i+l=g-2-j, 
 
 and since 2 S < n < 2 g + 2 g_1 , 
 
 L n/mJ >2 g - i - 1 = 2 2 ^. 
 Thus 
 
 (log m*~l + |_n/mj > (g - 2 - ,i) + 2 2+J " > g + 1 
 = flog(n + 1)1 
 
 because 2 ;> a + 1 if a > 1. 
 
 (ii) 2 3 " 2 < m' < T( n + l)/fl. 
 Let us write 
 
 m' - Rn + 1)/51 - q = k - q (q > l). 
 
 Then by Eq. (2(>) and (22), we get 
 
 ri og m n + [_£,} . s . i + [3¥^J 
 
 > g + 1 = flog(n + l)', 
 because q > 1, p < 2 and 3q - p - 1 > 0. 
 
 This ends a proof for (2). 
 
50 
 
 (3) (2 g + 2 g < n < 2 S+ ) can be proved in a similar manner and the 
 details are omitted. 
 
 (Q.E.D.) 
 
 It should be noted that the function Ir (m,n) is not a non-increasing 
 function of m even if m < M for some n. However if n < 50, then for more than 
 
 70$ of the cases, h p (m, n) turn out to be non-increasing functions. (The only 
 
 cases where h p (m, n) is not a non-increasing function are the cases n = 15, 27, 
 
 28, 29, 30, 31, 36, 37, 38, 39, ^5, ^6, and 1+7 . In any case, h P (m,n) increases 
 by at most one. ) 
 
51 
 
 o 
 
 LTN 
 
 O 
 
 O 
 
 o 
 
 H 
 
 O 
 
 O 
 
 CD 
 CD 
 
 ta 
 
 0) 
 
 Q 
 
 CD 
 -P 
 
 O 
 
 o 
 
 o 
 -p 
 
 
 H 
 
 •d 
 
 o 
 
 *d 
 
 u 
 a; 
 
 on 
 
 o 
 
 K 
 
 
 £ 
 
 W 
 
 
 o 
 
 H 
 
 
 Ph 
 
 Ph 
 
 o 
 
 3 
 
 OJ 
 CD 
 
 £ 
 
 £ 
 
 o s 
 
 I 
 
 
 
 (W) s,aj jo jaqrariM 
 
52 
 
 3. TREE HEIGHT REDUCTION ALGORITHM 
 
 3.1 Introduction 
 
 In this chapter, recognition of parallelism within an arithmetic 
 statement or a block of statements is discussed. There are several existing 
 algorithms which produce a syntactic tree to achieve this end. The tree is 
 such that operations on the same level can be done in parallel. Among them, 
 the algorithm by Baer and Bovet [6] is claimed to give the best result. For 
 example, a statement 
 
 a+b+c+dxe xf+g+h 
 can be computed in "our steps by their algorithm (Figure 3*1) • 
 
 k 
 
 3 
 2 
 
 1 
 
 level 
 
 Figure 3-1- An Arithmetic Expression Tree (l) 
 
 The algorithm reorders some terms in a statement to decrease 
 tree height. However, this algorithm does not always take advantage of 
 distributions of multiplication over addition. An arithmetic expression 
 
53 
 
 a(bcd+e) takes four steps as it is, whereas the equivalent distributed 
 expression (abcd+ae) requires only three steps. A further example is 
 Horner's rule. To compute a polynomial 
 
 p (x) = a^ + a, x + a^x + ... + a x , (l) 
 
 n 12 n 
 
 Horner ' s rule 
 
 p (x) = a. + x(a n + x(a, + . . . x(a . +x a ))) ...) (2) 
 *n 13 n-1 n 
 
 gives a good result for serial machines. However, if a parallel machine is to be 
 
 used, (l) gives a better result than (2). Namely, if we apply Baer and Bovet's 
 
 algorithm [6] on (2), we get 2n steps whereas (l) requires only 2flogp(n+l)] steps 
 
 (see Chapter 2). Thus it is desirable for the compiler to be able to obtain (l) 
 
 from (2) by distributing multiplications over additions properly. An algorithm to 
 
 distribute multiplications properly over additions to obtain more parallelism 
 
 (henceforth called the distribution algorithm or the tree height reduction 
 
 algorithm) is discussed now. 
 
 3-2 Tree Height and Distribution 
 
 Definition 1 : 
 
 An arithmetic expression A consists of additions, multiplications, and 
 possibly parentheses. We assume that addition and multiplication require the same 
 amount of time ^see Chapter l). Subtractions and divisions will be introduced 
 later. Small letters (a,b,c, ...), possibly with subscripts, denote single 
 variables. Upper case letters and t, possibly with subscripts, denote arbitrary • 
 arithmetic expressions, including single variables, t is used to single out 
 particular subexpressions, i.e. terms. 
 
5h 
 
 n 
 Then A can always be written as either (l) A = E t. or 
 
 i=l x 
 
 (2) A = w (t.), e.g. A = abc + d (ef+g) = t + t where t = abc and 
 i=l 
 
 t = d(ef+g), or A = (a+b)c(de+f) = (t )t (t ) where t = a + b, t = c 
 
 and 
 
 t-, - de + f . Note that when we write A = tt (t. ), we implicitly assume that for 
 
 i=l x 
 
 n(i) 
 each i t. = Z t.'. h[A] denotes the height of a tree T for A, which is of the 
 
 minimum height among all possible trees for A in its presented form. 
 
 A minimum height tree (henceforth by a tree we mean a minimum height 
 tree) for A, T[A], is built as follows [6]. 
 
 n n 
 
 Let us assume that A = E t. or A ^ tt (t. ) and that for each i, a 
 
 . , i . -, l ' 
 
 i=l i=l 
 
 minimum height tree T[t.] has been built. Then first we choose two trees, say 
 
 T[ t ] and T[ t ] , each of whose height is smaller than height of any other tree. We 
 
 combine these two trees and replace them by the new tree whose height is one higher 
 
 4 
 
 than max (h[t ], h[ t ]] : w 
 
 Combined tree 
 
 i. 
 
 A^Vli 
 
 hr V +1 X A h I^ 
 
 t t t r 
 
 I p q 
 
 This procedure is repeated until all trees are combined into one tree, which is 
 T[A]. The procedure is formalized as follows: 
 
 4 
 
 "Figures are in scale as much as possible. 
 
55 
 
 (1) Let ST = -j(l), (2), (3), -.., (n)J- and let h'[i] = h[t.] for 
 all it ST. 
 
 (2) Choose two elements of ST, say p and q such that h'[p], h'[q] < M" 
 where M = rain jh'[u]l for all u€ST- -|p, q|- . 
 
 (3) Now let ST = jsT - |p, qjj U j(p, q)| and h'[ (p, q)] = 
 max jh'[p], h'[q]| + 1. 
 
 (k) If I ST | = 1, then stop else go to Step 2. 
 
 After we apply the above procedure on A, we get e.g. ST = i ( ( ((l)(2) ) (3) ) ( (k) ( 5) ) ) )- 
 
 where a pair (ab) indicates that trees corresponds to a and b are to be combined. 
 Thus in this case we get: 
 
 ^((((l)(2))(3))(CO(5))) 
 
 (((D(2))(3)) 2i ^-^^^-- >Nf (00(5)) 
 
 ((D(2)) 
 
 as a minimum height tree of A. In general the procedure is applied from the lowest 
 parenthesis level (see Definition 3) to the higher parenthesis levels. 
 
 ]xample 1 : 
 
 Let A = (a+bc)(d+efg) + hi 
 
 = *1 + *2 
 
 If there are many choices, then choose those subtrees with smaller 
 -4(Sh[t ]) values first (see Definition 8). 
 
56 
 
 where t = (t )(t^) 
 
 t_, - a + be = t, + t,_ 
 3 ^5 
 
 and t,- = d + efg = t„ + t Q . 
 b 7 8 
 
 T[A] 
 
 Tft 3 ] 
 
 g h 
 
 Figure 3-2. An Arithmetic Expression Tree (2) 
 
 The < ffective length e of an arithmetic expression is defined as 
 
 e[A] . 2 h[A1 . 
 
 The number of single variables in an arithmetic expression is the number 
 of single variable occurences in it. 
 
 The height of a minimum height tree can be obtained without actually 
 building a tree. 
 
 Theorem 1: 
 
 (1.1) If A s it a. or A = Z a. then h[ A] 
 i=l x i=l x 
 
 ■log 2 fpl g 
 
57 
 
 (1.2) If A = It. then h[A] - log 
 i=l 2 
 
 P 
 
 £ e[t ] 
 i=l 
 
 P I 
 
 (1.3) If A = tt a. x 7T (t.) then h[A] = log 
 i=l X 3=1 J 
 
 r P i + z ert.] I . 
 
 2 J-l J '2 
 
 Given an arithmetic expression A, to obtain the height of a tree for 
 A, Theorem 1 is applied from the inner most parts of A to the outer parts, 
 recursively. 
 
 Proof: 
 
 (1) It is obvious that if A = Z a. or A = tt a. then h[A] = log r pi . 
 
 i=l i=l 
 
 (2) Now let A = Z t. . Then we can replace each t. by a product of 
 i=l x x 
 
 e[t.] single variables without affecting the total tree height h[A]. 
 
 (Note that each t. must be computed before the summation over t. is 
 
 e[t.] 
 p L i J 
 
 taken.) Thus A becomes A' = Z tt a . and h[A] = MA']. Let us 
 
 i=l j=l 
 
 eft.] 
 1 
 
 call a tree for tt a. a subtree. Then a tree for A' is built 
 ,1=1 J 
 
 using subtrees in the increasing order of their heights. Since a 
 
 binary tree of height h cannot accomodate more than 2 leaves, we 
 
 have 2 
 
 h[A'] 
 
 > Z e[ t.l > 2 
 " i=l X 
 
 h[A']-l 
 
 or h[A'] = h[A] = log. 
 
 P 
 
 Z e[ t ] 
 i=l 
 
 (3) can be proved in a similar manner. 
 
 (Q.E.D.) 
 
58 
 
 Definition 2 : 
 
 The additive length a and the multiplicative length m of an arithmetic 
 expression A is defined as follows: 
 
 P 
 
 (2.1) If A = ir a.., then 
 
 i=l X 
 
 (i) a[A] = e[A] and 
 (ii) m[ A] = p. 
 
 P 
 
 (2.2) If A - Z t., then 
 
 i=l X 
 
 P 
 (i) a[A] = I e[t.] and 
 i=l x 
 
 (ii) m[A] = e[A]. 
 
 P I 
 
 (2.3) If A = tt a. x 7r (t.), then 
 
 i=l x d=l J 
 
 (i) a[A] = e[A] and 
 
 (ii) m[A] = p + I e[t ]. 
 0=1 J 
 
 It is to be noted that 
 
 (!) £[AI |a[A]l 
 
 2 '.m[A]j - 
 
 (2) h[A] = log p |a[A] i = log m[A] I ; 
 d ! * 2 ' '2 
 
 compare this definition with Theorem 1. 
 
 Definition 3 : 
 
 The level £ of a parenthesis pair in an arithmetic expression is defined 
 as follows : 
 
59 
 
 First we start numbering parentheses at the left of the formula, pro- 
 ceeding from left to right, counting each left parenthesis, (, as +1 and each 
 right parenthesis,), as -1 and adding as we go. We call the maximum number m 
 the depth of parentheses. Now the level 1 of each parenthesis pair is obtained 
 as I = p, where p is the count for each parenthesis. The arithmetic expressions 
 
 enclosed by the level I parenthesis pair are called the level I arithmetic 
 
 I 
 expressions, A • Also for convenience we assume that there is an outermost 
 
 parenthesis pair which encloses A. 
 
 Example 2 
 
 A = 
 
 123 3 3 3 21 
 
 (ab((cd + e)(f + g)+k)) 
 
 
 3 
 
 
 3 
 
 
 
 
 2 
 
 
 
 Now several lemmas are in order. 
 
 Lemma 1 
 
 n 
 
 n 
 
 Let A = T t. or A = ir (t.). Also let A 1 = t, + t. + 
 . , l . , l 12 
 
 1=1 1=1 
 
 + t. ' + 
 
 l 
 
 ... + t or A' = (t. ) x (t) x ... x (t. ') x ... x (t ), and A" = A + t . or A" = 
 n 12' l n" n+1 
 
 A x (t n+1 ). Then 
 
6o 
 
 (i) h[A'] >h[A] if h[t.'] > h[t ± ] 
 (ii) h[A"] > h[A]. 
 
 Proof: 
 
 Obvious from Theorem 1. 
 
 What Lemma 1 implies is that the height of the tree for an arithmetic 
 expression is a non-decreasing function of term heights, and the number of terms 
 involved. 
 
 In an arithmetic expression, there are four possible ways of parenthesis 
 occurence: 
 
 P x ) ... + (A) + ... 
 
 P_) ... 6(t n x t_ ... x t ) x (t ' x t ' ... x t ') 9 ... 
 
 2 12 n 1 2 m 
 
 P 2 ) . .. a x a x ... a x (A) 8 . . . 
 
 3 12 n 
 
 p u ) ... e(t 1 + t 2 + ... + t n ) x (t^ + 1 2 ' + ... + t m ») e ... 
 
 where 6 represents +, x, or no operation. 
 
 Lemma 2 : 
 
 Let D - B + ( A) + C and D n = B+(t n x ... xt ) x(t' x ... xt ') +C. 
 
 1 1 n 1 m 
 
 Also let D = B + A + C and D n =B+t n x ...xt xt' x ... xt ' + C. Then 
 
 11 n 1 m 
 
 r\ d 
 
 h[D] >h[D] and h[ D ] > h[ D ] . 
 
 Proof : 
 
 Obvious Prom Theorem 1. 
 
6l 
 
 As an example, let D=(a+b+c)+d and D = (abc)(defgh) . Then 
 
 A A A 
 
 D = a + b + c + d and DJ = abcdefgh, h[ D] = 3 > h[D ] = 2 and h[D ] = k > 
 h[Dj] = 3- 
 
 Lemma > 
 
 Let D =| Z t .|| Z t . ] and D d = t, t , ' • t, t n ' ■ ... +tt' ... t t 
 ,1=1 *A.1-1 « 
 
 Then h[D d ] > h[D]. 
 
 Proof : 
 
 n m 
 
 Let D = (A)(B) where A = Z t. and B = St.'. Also without losing 
 
 i=l x 1=1 x 
 
 nerality, assume that h[A] > h[ B] . Then h[D] = h[A] + 1. For each j, let d. = 
 
 J 
 
 t.' t n +t.' t„ + ... + t.' t. It is clear that h[ t . ' t.] > h[t.l for all i and 
 j 1 2 j n j i J - i J 
 
 d m d 
 
 Thus from Lemma 1, we have h[d.] > h[A] for all j. Since D = Z d., h[D ] > 
 
 3 0=1 J 
 
 min(h[d ]) + log J ml > h[A} + log 2 rml 2 , or since h[D] = h[A] + 1, h[ D ] > h[ D] . 
 
 (Q.E.D.) 
 
 Note that the above lemma does not imply necessarily that if D = 
 
 ft \ H" H" 
 
 Zt.HB), and D = t (B) + t (B) + ... + t (B), then h[ D ] > h[D]. Actually, 
 i=l 7 n 
 
 it can be shown that there is a case when h[ D] > hf D ] . What Lemma 3 says is 
 that D = (A)(B) should not be fully distributed, but partial distribution, as in 
 D , may be done in some cases. 
 
62 
 
 Lemmas 2 and 3 together indicate that distribution in case (P^) and 
 
 partial distribution for case (P, ) are the only cases which should be considered 
 
 for lowering tree height. In casos (P ) and (Pg), removal of parenthesis 
 leads to a better result or at least gives the same tree height. Full 
 distribution in case (P. ) always increases tree height and should not be 
 
 done. Also it should be clear that in any case tree height of an arithmetic 
 expression can not be lower than that of a component term even after 
 
 A \ ■ d 
 
 distributions are done. For example, let D = t(A) = t x J F t J and D = 
 
 n 
 
 F ft v t.). Then from Lemma 1, we have h[tt.] > h[t ] for each i. Thus 
 
 i=l 
 
 h[r> d ] > h[A]. 
 
 The same argument holds for all four cases. This assures that evaluation 
 of distributions can be done locally. That is, if some distribution increases 
 tree height for a term then that distribution should not be performed because once 
 tree height is increased, it can never be remedied by further distributions. 
 Actually, there are two cases where distribution pays. For example, if A = 
 a(bcd + e), then h r A] = k. However, if we distribute a, then we get A = abed + 
 
 ae and h r A ] 3- The idea is to balance a tree by filling the "holes" because 
 a balanced tree can accommodate the largest number of variables among equal height 
 trees. The situation is, however, not totally trivial, because by distribution, 
 the number of variables in an expression is also increased. Next let A = 
 
 a(bc + d) +e=t+e and A = abc + ad + e = t + e. In this case h[A] = k but 
 
 h[A ] = h[ t ] =3- What happened here is that t is "opened" by distributing a 
 over (be + d) and the "space" to put e in is created. 
 
63 
 
 At each level of parenthesis pair, cases (P3) and (P^), i.e., instances 
 of "holes" and "spaces", are checked and proper distribution is performed. Next 
 we give definitions of holes and spaces, and formalize these ideas. 
 
 3-3 Holes and Spaces 
 
 3-3«l Introduction 
 
 Before we proceed further, let us study trees for arithmetic expressions 
 
 more carefully. 
 
 P 
 Let A = Z t.. By Definition 1 we first build minimum height trees T[t.] 
 i=l 1 x 
 
 for all i, and T[AJ is built by combining these T[t.]. Once T[t.] is built the 
 
 details of t. do not matter, and the only thing that matters is its height h[t.]. 
 
 Suppose T[t.] and T[t.] are combined to build T[A]. Assume also that 
 
 hrt.l = hrt.l + s. Then we will get s nodes to which no trees are attached other 
 l 3 
 
 than T[t.]. We call these free nodes whose heights are h[t.] +1, h[t.] +2, ..., 
 J J J 
 
 h r t.]. 
 i 
 
 T[t.: 
 
 <? T[t.]-1 
 
 1 
 
 <? T[t )+l 
 
 X -^ 
 
 A h[t d ] 
 
 Free Nodes 
 (S) 
 
 Figure 3-3- Free Nodes 
 
6k 
 
 Similarly we can enumerate all free nodes in T[A] with their heights. 
 
 O free node 
 £^ occupied node 
 
 root of T[ t . ] 
 
 root of T[A] 
 
 Figure ^>.h. Free Nodes in a Tree 
 
 Free nodes in a tree T 
 
 P 
 
 it (t.) 
 
 L i=l 
 
 are defined similarly. 
 
 Let us emphasize that once we get Tft.] we treat it as a whole and do not 
 
 care about its details when we build T[A]. That is, when we consider free nodes 
 in T[A] we mean free nodes "in" T[A] but "outside" of Tft.]. For example let A = 
 
 (a+b)(cde+f) = (\){t ). Then 
 
 T[A] 
 
 d e 
 
65 
 
 a and 3 are free nodes in T[A] while y and 8 are free nodes in T[t ] and not in 
 T[A]. 
 
 Now suppose there are m free nodes in T[A]. We number them arbitrarily 
 from 1 to m. Also let us denote the height of a free node a '»y h[al« 
 Given a free node q- whose height is hi" a] in T [A = I t.] 
 
 (or T [A - tt( t . ) ] ), by definition we can attach a tree T[t] whose height is h[a]-l 
 
 (or whose effective length is 2 ^ aJ " ) to a without affecting the height of A: 
 
 Definition k: 
 
 For A = 7r(t.) or A = Zt . , we build a tree. Then 
 l i 
 
 (k.l) define F [A] to be a set of all free nodes in T[A], and 
 
 (U.2) for each i define F [A, t.] to be a set of all free nodes which 
 
 r\ 1 
 
 exist between the roots of T[A] and T[ t . ] , i.e. the free nodes 
 which we encounter when we traverse T[A] down to the root of 
 T[t.]. 
 
 For example let us consider the following tree (see Figure J. 5)- 
 
66 
 
 Figure J5-5- An Example of F and F 
 
 Then F A [A] = fa, 3, 7, 6, e] and F R [A, t ± ] = fa, 3, y), F R [A, t g ] = { c& 3} etc. 
 
 Lemma h : 
 
 Suppose h[ a] = h[ 3] for some free nodes a and 3 in T[A]. Then without 
 changing the tree height h[A], we can replace two free nodes a and 3 hy one free 
 node 3' whose height is h[a]+l. 
 
 Proof: 
 
 original T[A] 
 
 Figure 3-6. Elimination of a Free Node 
 
67 
 
 modified T[A] N 
 
 Figure 3- 6- (continued) 
 
 We can combine subtrees 1 and 3, and hence eliminate free nodes a 
 and 3 and create a new free node 3' (see Figure 3-6). 
 
 (Q.E.D.) 
 
 Given F.[A], two free nodes a and (3 of equal height can be replaced by 
 one free node 3 1 whose height is h[ a] + 1« Repeating this procedure finally we 
 get a new F '[A] in which no two free nodes have the same height. Let y and 8 
 be free nodes in F.[A] and F '[A] respectively. Assume that for all free nodes a 
 in F [A] (or F '[A]), h[ 7] > h[ a] (or h[61 > h[ a] ) . Then obviously h[ 7] <h[5]. 
 However, it is clear that h[6] < h[A], i.e. if h[6] = h[A] then h[A] is not 
 the height of a minimum tree (see Figure 3*7 )• Hence we have the following 
 corollary. 
 
 'orollary 1 : 
 
 In a minimum height tree T[A], 
 
 I 2 h[a] " 1 <kA]. 
 C*F A [A] 
 
68 
 
 Proof: 
 
 Assume that S 2 h[a] ~ 1 = |e[ 
 
 aeF [A] 
 
 re[A]. Then by Lemma h, alia in F [A] 
 
 can be replaced by one free node 3 whose height is h[A] (note 2 *■ ■•" = e[A]/2) 
 This, however, implies that we can build a tree T'[A] whose height is h[A]-l, 
 which contradicts our assumption that T[A] is a minimum height tree for A. 
 
 T[A] 
 
 Figure 3«T« A Minimum Height Tree 
 
 (Q.E.D.) 
 
 The following definitions are also used in the next section. 
 
 Definition 5 : 
 
 An integer set is a set of integers with possibly duplicated elements, 
 e.g. (2,2,2,^,8,8). If an integer x occurs in an integer set Y at least once then 
 we write x in Y, e.g. 2 in {2,2,2,^,8,81 . Let Y and Z be two integer sets. Then 
 
 by Y uni Z we get an integer set where if an integer x occurs i times in Y and j 
 times in Z, then x occurs i + j times in Y uni Z, e.g. [2,k,k) uni (2,^,8) = 
 [2,2,k,k,k,8) . Also if Y is an integer set, then #(Y) = (the sum of values of 
 all elements of Y), e.g. #({2,k,k) ) - 10, and le Y = (the value of the largest 
 
69 
 
 element :ln Y), e.g. le (2,4,4) = 4. Furthermore if x in Y, then by Y del {x} we 
 mean the integer set which is obtained by deleting one occurence of x from Y, e.g. 
 (2, 4,4,8, 16, 16} del (4) = (2,4,8,16,16). 
 
 P 
 
 Let Y, , Y„, ..., Y be integer sets. Then by M (d) Y. we mean the set 
 r 2 p u l 
 
 i=l 
 
 Y constructed as follows. 
 
 (1) Let Y = (empty) and s = d. 
 
 (2) If s = 0, then stop else go to (3). 
 
 (3) Let u - min(le Y.) and k be an index of Y. such that u = le Y, . 
 
 — i i — k 
 
 If there are more than one k which satisfy this, then pick an arbitrary 
 one. Now let Y = Y uni (ul and s = s - u. Also let Y = Y, del (u) and Y. = 
 
 (Y. del {le Y.) ) uni { le Y.-ul for all i (i^k). Go to (2). 
 
 For example let Y ± = (1,2,2,8) , Y 2 = (4,4,8) and Y, = {16} . Then Y = 
 
 3 
 
 •• 
 
 i=l 
 
 (13) Y. is constructed as follows, 
 u 1 
 
 (1) Y - and s = 13- 
 
 (2) le Y ± = 8, le Y = 8, le Y = 16. Hence u = 8 and k = 1. Then Y = 
 
 (8^ and s = 13 - 8 = 5- Also Y- [ = (1,2,2), X^ = {4,4} and Y, = (8) 
 
 (3) le Y - 2, le Y = 4, le Y, = 8. Hence u = 2 and k = 1. Then Y = 
 (2,81 and s = 5 - 2 = 3- Also Y 1 = (1,2), Y g = (4,2), Y^ = {6} . 
 
 (4) Repeating the procedure we finally get Y= (1,2,2,8). 
 
 lefinition 6 ; 
 
 Let m be an integer. Now write m as a sum of powers of 2 in which each 
 
 power appears at most once. Then by 5b(m) we mean the integer set of powers of 
 2 which appear in the above sum. 
 
TO 
 
 3 2 
 For example since 13 = 2^+2 +2 = 8+4+1, I b(l3) = {1,4,8} but 
 
 Fb (13) 4 {l,k,k,k} since k appears more than once. 
 
 3.3.2 Holes 
 
 Now we discuss holes in an arithmetic expression. Intuitively, if an 
 arithmetic expression A has a hole of size u then an arithmetic expression t' with 
 e[t'] < u may be distributed over A without increasing tree height. 
 
 Definition 7 » 
 
 For each A = 7r(t.) we define a total hole f unct ion H m and for each t. in 
 
 A = r t. we define a relative hole function H^ as follows: 
 1 R 
 
 (7.1) For each A = 7r(t.), build a tree T[A]. Then define H [A] = 
 
 I 2 h ^ ^" 1 for all a e F [A] . 
 
 (7.2) For each A = Ft., build a tree T[A"|. Then for each t. define 
 
 H R [A,t.] = T 2 h[a]_1 for all a e F R [A,t.]. If F R rA,t.] = 0, 
 
 then let HJA.t.] = 0. 
 R 1 i J 
 
 As stated before if a is a free node in T[A], then 2 "• ^ " is the 
 effective length of a term t' whose tree T[t'] can be attached to a without chang- 
 ing the tree height h[ A] . Also let u in Zb(HJA]). Then this implies that there 
 
 is a free node q in F A [ A] such that 2 = u (see Lemma k) . Similarly if u in 
 T b(HjA, t.]) then there is a free node n in FjA,t.] such that 2 ^ Q '~ = u. Thus 
 
 K 1 n 1 
 
 in general 
 
 (1) if A = 7r(t. ), then h[ (t 1 )x (A)] = h[A] if m[t'] < u, and 
 
71 
 
 (2) if A = 2 t., then h[t'+A] = h[ A] if a[ t ' ] < u, where u = 
 
 I 2 h[a] " 1 . 
 aeF A [A] 
 
 Definition 8 : 
 
 The set of holes Sh[A] for an arithmetic expression A is an integer set 
 defined as follows: 
 
 P 
 
 (8.1) If A = ir a., then we let Sh[A] = (Zb(e[A]-p)} , e.g. Sh[abcde] = 
 
 i=l X 
 
 (a(3)) = (i,2). 
 
 p p 
 
 (8.2) If A = 7T (t.), then we let Sh[ A] = yiii(Sh[ t . ] ) uni Z b(H [ A] ), 
 
 i=l 1 i=l it 
 
 e.g. Sh[ (abc+d)(efg+h)(i+j)] = Sh[abc+d] uni Sh[ efg+h] uni Sh[i+j] 
 uni [Zb(H [A])) = {1} uni (1) uni Zb(l^) - (1,1,2,3,8). 
 
 P 
 
 (8.3) If A = Z t., then we first let Sh'[t.J = Sh[t.] uni Z b(H_[A,t.]) 
 
 , i 1 i - ~ ~~ H i 
 
 1 = 1 
 
 P 
 and d = min (#(Sh'[ t. ] ) ) . Then we let Sh[A] = M u (d) Sh'[t.]. 
 i i=l 
 
 Example 2 : 
 
 Let A = (a+b)(c+def) + ghi = (t 1 )(t 2 ) + t, = t^ + t,. Then Sh[ t ] = (1} 
 
 and Sh[(t 1 )(t )] = (11 uni I b(6) = {1,2, k) . Also Sh[ t ] = {1}. Hence Sh'ft^] = 
 
 (1,2,41 and Sh'[t ] = ( 1) uni Z b(12) = [l,h,Q] . Now d = min ( (#Sh'[ t, ] ), 
 
 #(Sh'[t 3 ])) = 7. 
 
 Hence Sh[A] = ^ (7) Sh'[t.] = (1,2,4). 
 1-3A 
 
72 
 
 In (8-3) above note that for every i 
 
 (1) if u in Sh[A], then there is u. in Sh'[t.] such that u < u., 
 
 (2) if u in Sh[A], then u < #(Sh'[t.]). Also there is at least one 
 k such that le Sh[A] = le Sh'[t ]. 
 
 q 
 
 Given an arithmetic expression A, and t' = £ t.', if e[t'] < le Sh[A], 
 
 i=l 
 
 then t' way be distributed over A without increasing tree height. Informally 
 
 we say that A has a hole which can accomodate t 1 . 
 
 Now we show that the above assertion is indeed valid. First we observe 
 
 q 
 
 that if t' = 7T (t.'), then each t. ' may be distributed independently over A. 
 i=l X X 
 
 q 
 
 Thus in the following we can assume that t* = £ t.'. Note that if t can be 
 
 i=l X 
 
 distributed over A without increasing tree height, then any t'(e[t'] < e[ t] ) 
 
 can be distributed over A as well. 
 
 Lemma 5: 
 
 P 
 Let A = ir a.. Then h[A] = h[(A) x (t*)] if e[t'] < le Sh[A]. 
 
 i=l 
 
 Proof: Obvious by Theorem 1 
 
 Theorem 2 
 
 p - d 
 
 (1) Let A = 7T (t.). Then h[A] = h[ (t'A) ] if e[t*] < le Sh[A] 
 
 i-1 x 
 
73 
 
 P _ d # 
 
 (2) Let A - It.. Then h[A] = h[(t'A) ] if e[ t ' ] < le Sh[ A] 
 
 i=l X 
 
 Proof : 
 
 We use a mathematical induction to prove the above theorem. Lemma 5 
 serves as a basis. 
 
 P P 
 
 Let A= 7T (t.) or A = St.. Assume that the theorem holds for t.. 
 
 . , l . -i i l 
 
 i=l i=l 
 
 P 
 
 (1) A = 7T (t ). 
 
 i=l 
 
 We show that if e[t'] < le Sh[A] then A can be multiplied by f 
 without increasing tree height. There are two cases: 
 
 (i) There is k such that eft 1 ] < le Sh[ t ] . Then we dis- 
 
 tribute t' over t, without increasing tree height. 
 
 (ii) V e[t'] >leSh[t ]. In this case e[ f ] < le Z b(H [A"] ) . 
 
 Then there is a free node a in F A [ A 1 sucn that u = 2 
 and e[t'] < u means that T[t'] can be attached to a with- 
 out increasing tree height. 
 Hence A can be multiplied by t' without increasing tree height. 
 
 P 
 
 (2) A = I t . 
 
 i=l 
 
 We show that if e[t'] < le Sh[A] then t can be distributed over A 
 without increasing tree height. 
 
 # - d 
 
 We write (ft) for the expression obtained after distributing f over 
 
 t such that the tree height is deduced, e.g. (a, b+cd) = ab + acd. 
 
7^ 
 
 First note that for all i, if a€F [A,t.] then 2 h ^ a ^ _1 > le Sh[t.] 
 
 Now assume that e[t'] < le Sh[A]. For fixed k, that u = le Sh[A] 
 implies either 
 
 ( i ) u < le Sh[ t k ] 
 
 or (ii) there is a in F_,[A, t, ] such that u < 2 "- a ^ (or 
 
 equivalently u < le Ib(H [A,t ])). 
 
 In the first case we have h[ t ] = h[(f t ) ] by assumption and 
 
 h[A] - h[ 7 t + (ft ) d ]. 
 i/k X k 
 
 In the second case we attach T[ f ] to a (Figure 2.8(a)) without 
 
 increasing the tree height h[A], i.e. 
 
 h[A] = h\ T t + (f ) x (t )]. 
 i/k x * 
 
 In general let 1" be a subtree in T[A] whose root is a (Figure 
 
 3.8(b)). Then 
 
 \ T* 
 
 (a) (b) 
 
 Figure 3-8. Attachment of T[ f ] to a Free Node 
 
75 
 
 there may be other term trees besides T[ t ] in T", e.g. T[t.] and 
 
 T[t ] in Figure 3-8(b). Hence we get, 
 
 h[A] = h[ L t. + (f ) x (t^+t.+tj^)] 
 i#c,j,h 
 
 in this case. Note that a is also in F [A, t.] and TJA, t ]. 
 
 a j a n. 
 
 Repeating this procedure for all k we can get an arithmetic expres- 
 
 P 
 sion equivalent to X (t 1 ) x (t.) or (t')(A) without increasing 
 
 i=l x 
 
 tree height. This proves (2). 
 
 It is obvious that in both (l) and (2) if e[t'] > le Sh[A], then t' can 
 
 not. be distributed (multiplied) over A without increasing tree height. 
 
 (Q.E.D.) 
 
 Lemma 6 : 
 
 Let A = F t. and e[t»] < le Sh[A]. Then after t' is distributed over A, 
 
 we have 
 
 Sh[ (t'A) d ] = (Sh[A] del (u} ) uni T b(u-e[t'] ) uni Sh[t'] 
 where u is the smallest element in Sh[A] bigger than e[t']. 
 
 Now let us summarize what we have so far. Let A and t' be arithmetic 
 expressions where A = T t. and t' = 7t'.. If e[t'] <le Sh[A], then t' can be 
 
 distributed over A without increasing tree height, i.e. h[(t') x (A)] > 
 
 h[(t*A) = hfA]. A set of holes in (t'A) is given by the above lemma. Since it 
 is obvious that le Sh[ A] < e[A], we have the following lemma. 
 
76 
 
 Lemma J : 
 
 Let f = Zt.'. Then h[(t')(A)] > h[(t'A) d ] implies that h[t'] < h[A]. 
 
 q 
 
 In general let t' = Ti(t.'). For convenience let us assume that e[t. ' ] 
 
 < e[t '] < e[t '] < ... < e[t ']. Then if the following procedure can be 
 - 2 - 3 ~ - q 
 
 accomplished successfully, we say that A has holes to accomodate all t.. ' (i=l, 
 2,...,q). # 
 
 Procedure: 
 
 i 
 
 (1) Let V = (t i '(t i _ 1 *,...,(t 1 , A) d ) d ...) d and V Q = A. Let i = 1. 
 
 (2) Check if e[t.'] < le Sh[V. .]. 
 
 i l-l 
 
 (3) If so, then distribute t. ' over V. , and we have V.. 
 
 (k) Evaluate Sh[V.]. 
 
 (5) If i=q then stop, else let i = i + 1 and go to (l). 
 The procedure may be accomplished successfully if m[t'] < #(Sh[A]). 
 
 3- 3. 3 Space 
 
 Now the second possible distribution case, i.e. space is studied. 
 
 The idea of the second distribution case is that given an arithmetic 
 
 expression D of the following form 
 
 D = ... 9 (f) x (A) + t e ... 
 
 1 s 
 
 distribute t' over A so that t can be hidden under the combined tree as shown 
 
 s 
 
 in Figure 3.9* 
 
 We assume that each t. ' does not have any holes, i.e. Sh[t.'] = 
 
 for all i. Hence Sh[(t*A) d ] = (Sh[A] del { u }) uni b(u-e[t']) for 
 example . 
 
77 
 
 
 t'^ t't 2 
 
 (a) 
 
 A A A A 
 
 ft 
 
 (b) 
 
 Figure 3.9. An Example of Space (l) 
 
 In other words, in case of D, the addition +, cannot be done before (f )(A) 
 (we write f (A)) is computed while in case of D it may be done earlier. 
 
 Note that if h[f (A)] > h[(f A) ] then A has enough holes to 
 accomodate f and the distribution of f over A is done anyway. Henceforth 
 throughout the rest of this section we assume that A does not have any holes 
 to accomodate f . Thus we now deal with the case when h[f (A)] < h[(f A) ]. 
 However, if h[f (A)] < h[(f A) ] holds, then clearly there is no way to 
 
78 
 
 get h[J) ] < h[ D] by Lemma 1. Thus we have: 
 
 Lemma 8: 
 
 Proof: 
 
 (1) h[t* (A)] = h[(t*A) d ] must hold to get h[D d ] < h[ D] . 
 
 n _ d 
 
 (2) Let A = it.. Then h[t'(A)] > h[(t'A) ] if and only if e[t'] < 
 
 i=l X 
 
 e[A] (i.e. h[t f ] < h[A]). This implies that h[t'(A)] = h[A] + 1. 
 
 By inspection. 
 
 P 
 Intuitively the space Sp in an arithmetic expression A = it. with 
 
 i=l 1 
 
 respect to t' is defined as 
 
 P 
 Sp[A,f] - e[(A) x (t')] - I e[(f) x (t )]. 
 
 i=l 
 
 For example let A = ab + c and t* = d. Then S [A, t'] = 2. 
 
 free node 
 
 (a) (b) 
 
 Figure 3.10. An Example of Space (2) 
 
 Let D = t'(A) + t - (t'A) + t. Note that a free node in T[t»(A)] cannot be used 
 
 to attach T[ t] while a free node in T[(t'A) ] may be used to attach T[t]. 
 Now the formal definition of space follows. 
 
79 
 
 Definition 9 : 
 
 p q 
 
 Given arithmetic expressions A = Ft. and t' = Ft. ' , the space function 
 
 i=l i=l 
 
 Sp of A w.r.t. t' is defined as follows. First we build trees T[A] and T[t'], and 
 
 in T[A] let F be a set of free nodes f higher than T[t'] (h[f] > h[t']). Also we 
 
 define a set I as follows. We let iel if h[t.] > h[t'] and e[ V ] < le Sh[ t. ] . 
 
 Now the space function is obtained as : 
 
 h[f] h ^V 
 
 Sp[A,f] = Z 2 L J + T 2 
 fe F i€ I 
 
 To show how Definition 9 works, we first describe how to build T[(t'A) ] 
 by attaching T[t'] to T[A] properly (i.e. by distributing t' over A properly). 
 
 Since h[(t'A) ] = h[A] +1 (Lemma 8(1)), we first study to build T'[A] which is 
 obtained by replacing T[t.] in T[A] by T'[t.] whose height is h[t.] + 1. Then the 
 
 height of T'[A] is h[A] + 1. Building T[(t'A) ] from T[A] may be explained in an 
 analogous way. 
 
 As stated before the only case to be considered is when h[t'(Aj] = 
 
 h[(t'A) d ] = h[A] +1 holds (Lemma 8(1 )). Suppose that all T[t.] in T[A] are re- 
 placed by T'[t.] whose height is h[t.] + 1. Then the new tree T'[A], whose height 
 
 is h[A] + 1, is obtained. Note that a free node a in T[A] now becomes a free node 
 a' in T'[A] with height h[a] + 1. In T'[A] if T'[t.] is replaced by T[t.] again, 
 
 then a new free node 3', whose height is h[t.] +1, is created. Having these 
 facts in mind, now we describe the way t' is distributed over A to create space. 
 
 The tree T[(t'A) d ] is built from T[A] as follows (note h[(t'A) d ] = 
 h[A] +1). Depending on the height of T[t.], we have two cases. 
 
 J 
 
8o 
 
 (1) h[t.] > h[f ]. 
 
 If T[t.] has a hole to put T[t'], then we fill it by T[t']. In 
 
 J 
 
 this case h[t'(t.)] = h[t J and a new free node 3' whose height is 
 
 h[t.] + 1 is created in T[(t'A) ]. Otherwise h[t'(t.)] = h[t.] 
 J J J 
 
 + 1. 
 
 (2) h[t.] < h[t']. 
 
 J 
 
 Find the tree T[Zt ] whose height is h[t'] and which includes 
 s 
 
 T[t.]. We multiply Zt by t' and get h[t*(Zt )] = h[Zt ] + 1. 
 J s s s 
 
 Note that t. and Zt are treated as terms of A. In the resultant tree T[(t*A) ], 
 j s J ' 
 
 those free nodes in T[A] whose heights are less than or equal to h[t T ] (i.e. 
 
 free nodes in T[Zt ] ) do not appear, 
 s 
 
 (a) (b) 
 
 Figure 3-H- Distribution of t' over A 
 
 (c) 
 
 .m*- 
 
 A free node a in T[A] (h[a] > h[t']) appears in T[(t*A) ] as a free 
 node a' where h[a' ] = h[a] + 1. Thus T[(t'A) ] has those a' and 3' described in (l) 
 as free nodes. 
 
81 
 
 If 6' is a free node in T[ (t f A) d ], then a tree T[t] (h[t] <h[6'] - 1) 
 can De attached to T[(t'A) ] without increasing tree height (i.e. h[(t'A) ] = 
 h[(t'A) + t]). Since h[6'] is either h[t.] + 1 or h[ a ] + 1, we have e[t] = 
 
 J 
 
 2 J or 2 h ^ ] . 
 
 In general if a[t] < Sp[A,t f ] then we have h[t'(A)] = h[(t'A) d ] = 
 h[(t'A) d + t]. 
 
 q 
 
 Definition 9 niay be generalized to include the case where t' = ir(t.'). 
 
 In this case we first obtain h[t'] and e[t.'] (i=l, 2, . . ., q), and build T[A"|. 
 
 In T[A] let F be a set of free nodes f which are higher than h[t'] (h[ f ] > 
 h[t']). Also let I be a set such that i is in I if h[t.] >h[t'] and T[t.] has 
 
 enough holes to put all T[t.'] (i.e. a[t'] < /Sh[t.])). Then Sp[A,t f ] = 
 
 feF iel 
 
 Informally we say that space to put t is created by distributing t' 
 
 over A if h[t'(A) + t] > h[(t'A) + t] . Then the procedure is called space 
 
 filling. Now we study how much we can reduce tree height by space filling. 
 
 Let B = Z't.')(A.) + Zt . and assume that by distributing t. ' over A. space can 
 li. j l l 
 
 i J 
 
 be created for all i. 
 
 Lemma 9 " 
 
 Let B = Z(t. ')(A.) + It. and B d = l(t. 'A. ) d + Ft.. Then h[B d ] = h[ Bl 
 .11 .3 .11 .j LJ LJ 
 
 i i 3 
 
 1 if l!3p[A.,t '] >a[B] - e[B]/2, otherwise h[ B d ] = h[ B] . 
 
 i 
 
82 
 
 Proof : 
 
 First note that to lower the height of a tree for B, some terms must 
 be removed from B so that an effective length of a resultant expression becomes 
 
 e[B]/2. Hence ZSp[A. ,t.'] must be 
 
 greater than or equal to a[B] - e[B]/2. 
 
 r d n 
 
 Next we show that h|_B J cannot be 
 smaller than h[B] - 1. 
 
 As before we assume that 
 h[t i *(A.)]= h[(t.'A.) d ] for all i 
 (see lemma 8(1)). It is equivalent to 
 
 ( — erB]/2 
 -a[B] — 
 
 Figure 3-12. Tree Height Reduction 
 by Hole Creation 
 
 the assumption that Sp[A.,t.'] < e[t.'(A. )]/2 = e[(t.'A. ) ]/2. First we get B ! 
 
 from B as follows. We replace every (t.'A.) in B by a product P. of 
 
 e[(t 'A ) ]/2 single variables. This amounts to assuming that Sp[A. ,t '] = 
 i i ii 
 
 e[(t 'A ) ]/2. Futher we get B" from B' by replacing every t. in B' by a 
 i i 
 
 product Q. of e[t.]/2 single variables. Then it is clear that h[B] > h[B ] > 
 J J 
 
 h[B'] > h[B"] = h[B] - 1. If ZSp[A.,t.*] > a[B] - e[B]/2, then h[B] > h[B d ]. 
 
 Hence h[B d ] = h[B] - 1. 
 
 (Q.E.D.) 
 What the lemma implies is that by space filling we can lower tree 
 height at most by one . In other words to see if the space creation by distri- 
 bution is effective it is enough to see if total tree height can be lowered by 
 one, and we know if tree height is once lowered by one it is not necessary 
 
83 
 
 (i.e. useless) to try to lower tree height further by creating more space by 
 further distribution. 
 
 Unlike a set of holes (Theorem 2), the space function for A = 2t. 
 
 does not carry any information about space in components of A, t.. For example let 
 
 B = a(b(c+defg) + irl6) + Trl6 + irk = a(A) + tt16 + irk 
 where Ti is a product of i single variables. Then h[ B] = 7- Now Sh[A] = and 
 space creation is tried. Note that SpfA.a] = 16 < a[ B] - e[B]/2 = 20, but 
 Sp[ c+defg, ab] = k. That is, a as well as b should be distributed over c+defg. 
 Thus we c -et B 1 = abc + abcdefg + a(7rlo) + tt16 + wh where h[B'] = 6. 
 
 Now this situation is studied in detail. In general we have a form: 
 
 F = ... +t'(... +t" ( C ) +. . . +D+. ..) + ... + E + ... 
 
 ^ 
 
 
 J 
 
 
 V 
 
 A 
 
 
 V 
 
 
 J 
 
 w~ 
 
 :-n if Sp[A,t'] is not enough to reduce the tree height h[ F] , we have to further 
 check components of A, e.g. Sp[C,t't"]. As we will show later (see Substep 2 of 
 Step i\ of Algorithm given in Section j.k-.l) an arithmetic expression is examined 
 from the inner most pairs of parentheses to the outer most pair. In the above 
 diagram, the distribution of t" over C is first checked to see if it reduces the 
 tree height h[A] and then the distribution of t' over A is examined. If the 
 
 itribution of t" over C creates space and reduces the tree height h[A], then 
 there is no problem. However if that distribution does not lower the tree height 
 
 .], then t" will not be distributed over C (see Algorithm). As we showed in the 
 above example when we check the possibility of reducing the tree height h[ F] by 
 creating space by the distribution of t' over A, it may be necessary to check 
 , ft"] as well. 
 
Qk 
 
 Let A' = ... + (t"C) d + ... + D + ..., G* = (t'A) d and G" = (t'A') d - 
 Now we show that if Sp[A, t'] = 0, then it is not necessary to examine components 
 of A, e.g. Sp[C,t't"] further. This helps to reduce the number of checks 
 required. 
 
 Lemma 10 : 
 
 (1) If Sp[A, t'] =0, then h[...+G"+...+E+. ..] < h[...+G'+. ..+E+. ..] 
 
 never holds. 
 
 (2) If Sp[A,t'] = 0, then h[G"] < h[G'] never holds. 
 
 Proof: 
 
 (1) We prove this by showing that if Sp[A,t'] = then Sp[A',t'] = 0. 
 Note that Sp[A, t'] = implies that either 
 (i) h[t'] > h[t"(C)] 
 or (ii) h[t n ] = h[C] (Definition 9). 
 
 By Lemma 9, in either case we get h[(t'(t"C) ) ] > h[t't"(C)]. 
 
 Note that the only difference between G' and G" is that a term 
 
 -— sr d 
 
 t't"(C) in G' is being replaced by (t'(t"C) ) in G". Since 
 
 h[(t'(t"C) d ) ] > h[t't"(C)], G" cannot have more free nodes 
 than G'. Hence Sp[A f ,t'] = 0. 
 (2) This may be proved in a similar way and the details are omitted. 
 
 (Q.E.D.) 
 Thus in F = ... + t' ( . . .+t" (C)+. . .+D+. . . ) + ... + E ... the distri- 
 bution of t" over C should be done if it reduces the tree height of T[A], 
 otherwise it should be left untouched. In the latter step when the distribution 
 of t' over A is examined, the possibility of distributing t" over C as well 
 shall be checked if and only if Sp[A, t'] / 0. Otherwise we shall leave t'(A) 
 
85 
 
 as it is and we need not check inside of A again. 
 
 3.4 Algorithm 
 
 Having these results, an algorithm to reduce tree height of an 
 arithmetic expression is now described. Given an arithmetic expression, the 
 algorithm works from the inner most pairs of parentheses to the outer most pair. 
 We assume that cases (P, ) and (P ) (see Lemma 2) are already taken care of. At 
 
 each level of a parenthesis pair, first upon finding a form t'(A), a hole of A 
 is tried to be filled by t' (Theorem 2). After all holes are filled, a form 
 t'(A) + t" is checked, i.e. if the distribution of t' over A creates enough space 
 to accomodate t", then the distribution is made. Note that it is not necessary 
 to fully distribute (A)(b) = (lt.)(Xt. T ) (see Lemma 3)- 
 
 It is not necessarily true that reducing tree height of a term t 
 of an arithmetic expression A reduces tree height of A. However, we show 
 that reduction of tree height should be made in any case to help later steps of 
 the distribution algorithm. 
 
 n-1 n-1 
 
 Let A = Tt. + t (or ir (t . ) x (t ) ) . Assume that the distribution 
 . , 1 n . , 1 n 
 i=l i=l 
 
 somehow reduced the height of T[ t ], i.e. the distribution algorithm 
 
 n-1 n-1 
 
 reduced A to A' = I t. + t ' (or it (t. ) x (t ')) where h[ t ] > h[ t ']. Also 
 
 . , 1 n . n i 7 n ; ' L n J L n J 
 
 i=l i=l 
 
 assume that MA] = h[A']. Yet it is obvious that #(Sh[ A]) < #(Sh[ A' ] ) (i.e. le 
 
 £h r ;.l < le Sh[A']) and also for any t", Sp[A,t"] < Sh[A,t"]. That is, even if 
 
 distribution only reduces the tree height h[ t ] and .does not reduce the tree 
 
 height h[A], that distribution does not cause any bad effect on the later steps 
 
 when A appears as a term of a bigger expression d with respect to holes and spaces , 
 
86 
 
 The arithmetic expression thus obtained may give the lowest tree 
 height, i.e. the fewest number of computation time steps. 
 
 3.k.l Distribution Algorithm 
 
 In the steps below we refer to the notation: 
 
 . k - 1 k - 1 
 
 I . k 1 I r— k — r i k ! 1 
 
 «(...«( k )&.. .)©... ©(...©( )e...e( )©...)© 
 
 ' s-l,i I I sj I I s,j+l I 
 
 H — tl — h k << 
 
 A = .. . + t . , 7T (t - ) + ... 
 
 k p 
 where t . , = tt a„ or empty. 
 s,J-l l=1 t 
 
 Step 1 : 
 
 Go to an arithmetic expression enclosed in an innermost parenthesis 
 
 pair which is not checked. Let this level be k-1. In the above diagram we are 
 
 «k-l 
 now working on, say, A 
 
 s 
 
 Step 2 : 
 
 Obtain a set of holes for all t . which are enclosed in the k-th 
 
 k-1 
 parenthesis pair and are components of A , as well as their heights h and 
 
 effective lengths e 
 Step 3 : 
 
 k-1 
 In this step, the (k-l)th parenthesis pair level A "" is examined. 
 
87 
 
 Substep 1 : Hole filling (see Theorem 2) 
 Let 
 
 0+n 
 
 k P 
 
 t 
 
 where t = ir a or empty. Also without loss of generality we assume 
 
 that e[t g£ ] < e[t g i+J ] for l = j, j+1, ..., j+n-1. 
 
 k-1 
 
 (1) Find an occurrence of a form 7r(t.) or Tra. x 7r(t.) in A .If there 
 
 J i s 
 
 is no such occurrence, then go to Substep 2. If an occurence of a 
 form ir(t.) is found, then skip (2) and (3). Otherwise go to (2). 
 
 k-1 
 
 (2) Suppose we find t in A as an instance of Tra. x Tr(t.). 
 
 s 1 j 
 
 k k 
 
 Fill holes in Sh[t ] (h=j, j+1, . . ., j+n) using a.'s in t . If there 
 
 are many holes to be filled in, fill the smallest ones first, i.e. 
 in order of increasing size. Reevaluate Sh by Lemma 8 for those 
 
 t . whose holes are filled, 
 sh 
 
 (3) If t , (h=j, j+1, . . ., j'+n) do not have enough holes to accomodate all 
 a.'s then go back to (l) to find out another occurrence of Tr(t.) or 
 
 j+n k 
 
 Tra. x Trft.) form. Otherwise we work on tt (t 1 , ) which we get from 
 i j . . sh' & 
 
 P J+n v 
 ira, x tt (t\ ) after (2). 
 1-1 J h=i Sh 
 
 (J+) We start from h = j. Check if t' can fill in one of holes in 
 
 u sh 
 
88 
 
 Sh[t' ] (l=h+l, . . ., j+n). If there are many holes which can 
 s x. 
 
 ]^ 
 accomodate t' , fill them in order of increasing size. Continue 
 
 the procedure until all t' (h=j+l, . . ., j+n-l) are put in some holes 
 
 or there is no hole to accomodate t' . Go back to (l) to find 
 
 sn 
 
 out another occurrence of 7r(t.) or ira.. x 7r(t.) form. 
 
 J i J 
 
 Substep 2 : Space filling 
 
 k-1 k 
 
 After Substep 1, we again check A , where all holes in t . have 
 
 s sj 
 
 been filled in as much as possible by Substep 1. 
 
 (1) Let Ex = a[A k_1 ] - e[A k_1 ]/2 (see Lemma 9)- 
 
 (2) Let 
 
 k-1 k ^ +n-1 k n k 
 A "=...+t . -, x it (t,)x(t .,)+... 
 
 s S ' J_1 h=j S cf' J t^ 
 
 ~Y f t 
 
 k p k 
 
 where t . n = tt a, or empty. We also assume that eft . 1 is 
 B,d-1 ! = i * s,j+n J 
 
 k k 
 
 the largest among all e[t ,] (h=j, . . ., j+n) . Let t' = t . . x 
 
 sn s f j -± 
 
 j+n-1 . . . 
 
 tt_ (tg h ). If h[t'] «h[t* ], then evaluate Sp[t* ^ f ] . 
 
 Otherwise leave it as it is. 
 
 (3) Repeat (2) for every occurrence of a form tt( t . ) (or ira.. x ir(t .)) 
 
 J -'-J 
 
 k-1 
 in A . Assume that there are m such occurrences. Arrange all 
 s 
 
 ~p[t,t'] in order of decreasing size. For convenience we write 
 
 Sp r Sp 2 , ..., SpJSp. > s P . +1 ). 
 
89 
 
 m d-1 d 
 
 (+) If Z Sp. > Ex then let d be such that Z Sp. < Ex and Z Sp > Ex. 
 
 i=l x i=l x i=l 
 
 k ^ +n_1 k k 
 
 (5) Let t -tX it ( t >.)x('t • ) be a form which corresponds to 
 
 s,j-± ' ._. sn s, j+n 
 
 1 2 ' 
 
 t' t 
 
 Sp.(i < d). Then distribute t' over t, and create space Sp.. 
 
 Repeat the same procedure for all i = i, 2 f . .., d. 
 
 m 
 
 (6) In the case where enough space to accomodate Ex (i.e. Z Sp. < Ex) 
 
 i=l x 
 
 ]^ 
 is not found, a check is made against the component terms of t 
 
 (see Lemma 11). 
 
 For example let 
 
 t .-=a.a rt ..«a,n=l 
 s,j-l 12 p' 
 
 and t k . = b n b ... b (t^ 1 ) + Z t*!" 1 . 
 s,j 12 q v sf i=1 si 
 
 Then 
 
 A k-1 . Jc ,,. ,. ,. ^k+l N . ™ , 
 s 
 
 .. +t* _ x (b n b ... b (t"I x ) + lO + .. 
 s, j-1 12 q sf . _ si 
 
 i=l 
 
 i/f 
 
 k+1 
 Then the distribution is done if the sum of Sp. and Sp[t „ , 
 
 t . _ x b n . . . b ' is greater than or equal to Ex. Here the dis- 
 s,J-l 1 q J 
 
 k+1 
 tribution of a, ... a b, ... b over Z t .' is to be made as well 
 1 pi q si 
 
 k+1 
 as the distribution of a, ... a over t _ . This checking is to 
 
 1 p sf 
 
90 
 
 be made until enough space to accomodate Ex is found or else 
 until the innermost level of parenthesis pair is reached. 
 
 Step k ; 
 
 k-1 
 Mark A as checked, 
 s 
 
 Step 5 : 
 
 If all levels are checked, then stop, otherwise go back to Step 1. 
 
 For example let us consider the following 
 A = ... + a 1 a 2 a 5 (t 1 )(t 2 )(t 5 ) + 
 
 t 
 Further assume that 
 
 Sh[t x ] = Ul, e[t 2 ] = 16 
 
 Sh[t 2 ] = (16,21, e[t 2 ] = 61+ 
 
 and 
 
 Sh[t 3 ] , e[t 3 ] = 64. 
 
 Then a, a p a, can be distributed over t,, and in turn this whole thing can be distri- 
 
 buted over t, 
 
 . . + a.. a 2 a-, { t, 4 
 
 o 
 
 (t )(tj + ... 
 
 and we get h[t'] = 7 whereas h[t] = 8. 
 
91 
 
 3.^-2 Implementation 
 
 A few words about implementation of the algorithm described above are 
 given as well as the total number of checks required to process an arithmetic 
 expression. Suppose we are given the aritmetic expression 
 
 A = ... + (7r24^rl)(7rlO+(7r4-Pn-l)(7rl7-H7r2)-P7r3) + ... 
 
 = ... + ( d 1 +d 2 )( e 11 + ( e 2i +e 22^ e 31 +e 32^ + e l+l^ + "•• 
 = ... + (D)(E 1 +(E 2 )(E 3 ) + E^) + ... 
 
 = ... + (D)(E) + . . . 
 where iri represents a product of i single variables. Then we build nested stacks 
 as shown in Figure 3.13(a). Note that a new stack is created for each form ir(t.) 
 
 or Z t . . 
 
 1 
 
 
 e 21 (x) 
 
 
 
 
 B(+) 
 
 e u (x) 
 
 E 2 (+) 
 
 
 (+) 
 
 
 TVT f v \ 
 
 
 
 
 
 1 
 
 im^x; 
 
 
 
 "2 KAJ 
 
 
 
 
 
 
 
 «= 
 
 
 
 
 
 
 
 
 
 e 00 (x) 
 
 
 
 
 
 
 A <— 
 
 
 ^ — 
 
 
 
 
 <— 
 
 >> 
 
 E ? (+) 
 
 
 
 
 
 
 
 
 
 
 D(+) 
 
 \ 
 
 
 I 
 
 
 
 
 
 
 
 \l lXJ 
 
 ~'~~~- e ^n 
 
 
 
 
 d.(x) 
 
 
 
 31 
 
 
 
 d 2 (x) 
 
 
 ^^ e 32 
 
 level 
 
 - 
 
 m-k 
 
 m-3 
 
 m-2 
 
 m-1 
 
 m 
 
 Figure 3-15- Stacks for an Arithmetic Expression 
 
92 
 
 (b) 
 
 E 4- 
 
 w 2 (x; 
 
 Sh 
 
 A 
 
 _L 
 
 1,2, If, 8, 
 8,16,32 
 
 t,5,6 
 
 E 2 (+) 
 
 Sh 
 
 if a 
 
 iii 
 
 Sh 
 
 A 
 
 1,2,4,8 
 
 2,3,4,5 
 
 2, 3 ,4, 5 
 
 e 21 (x) 
 
 
 h 
 
 2 
 
 
 Sh 
 
 
 
 
 F A 
 
 
 
 
 
 e 22 (x) 
 
 /" e 31 (x) 
 
 e„(x) 
 
 h 
 
 1 
 
 Sh 
 
 
 
 F A 
 
 
 
 
 N l «■ 
 
 (c) 
 
 Sh 
 
 A 
 
 El£L 
 
 2,4 
 
 3,^,5 
 
 3,4 
 
 
 "MJ. 
 
 N 2 '(x) 
 
 
 h 
 
 5 
 
 
 Sh 
 
 1,2,4,6 
 
 
 F A 
 
 1,2,3,5 
 
 /* 
 
 
 e 3i 
 
 N 2 "(x) 
 
 Sh 
 
 e 21 (x) 
 
 E ( + ) 
 
 
 h 
 
 3 
 
 
 Sh 
 
 
 
 
 F A 
 
 1,2 
 
 
 F E 
 
 
 
 
 
 <- 
 
 
 1,2 
 
 
 '22 
 
 Figure 3- 13 • (continued) 
 
93 
 
 Each stack is assigned a level number (cf. Definition 3) where the first stack 
 which corresponds to A receives the level number (Figure 3« 12(a)). 
 
 We start working from a stack with the largest level number, say m. For 
 each stack t, where t = It. or t = ir(t.), h[t] is evaluated. Also if a stack 
 
 represents a form t = It., then Sh[t], F [t] and F [t, t.] are evaluated. If a 
 
 stack represents a form t = 7r(t. ), then Sh[t] and F [t] are evaluated. These 
 
 values are obtained by Definitions 1, h- and 8. Note that this information is 
 sufficient to evaluate Sp. Figure 3.12(b) gives an illustration. 
 
 Upon finding a form 7r(t.) (or Tra . x 7r(t.)) (e.g. the stack N ), we apply 
 
 the distribution algorithm and decide if distribution is to be made. If a stack 
 represents a form t = 7r(t.), then Substep 1 of the distribution algorithm, i.e. 
 
 hole filling, is tried. Otherwise a stack represents a form t = I t. and Substep 
 
 2 of the distribution algorithm, i.e. space filling is applied. In our example E 
 
 is distributed over E,(e[E p ] < le Sh[E,]). Then stacks are revised as shown in 
 
 Figure 3 .12(c). Note that the stack N is replaced by two new stacks N ' and N ", 
 
 and the stack E disappears. 
 
 If all stacks with the level number k have been checked, then stacks with 
 the level number k-1 will be checked. In our example, stacks E(or E' since it has 
 been revised) and D are now checked. 
 
 The total number of checks required to process a whole arithmetic 
 expression thus depends on the number of parenthesis occurrences in it. Assume 
 that there are p parenthesis pairs in an arithmetic expression A. For each pair, 
 space creation should be examined. Hence in total p space creation checks are 
 required. Now for each ir(t.) form hole filling should be tried. The number of 
 
Qk 
 
 occurrences of a form ir(t. ) in A is obviously less than p. Hence the total number 
 of checks required is less than 2p (i.e. the order of p). 
 
 3.5 Discussion 
 
 3.5.1 The Height of a Tree 
 
 Given a tree for an arithmetic expression, the distribution 
 algorithm tries to lower tree height by distribution if possible. However, 
 in general it may not give the minimum tree height. For example let A = ac + ad + 
 be + bd whose tree height is 3> and since no further distribution is possible, 
 the distribution algorithm yields the same value. There is, however, an equivalent 
 expression A' = (a+b)(c+d), whose tree height is lower than 3, i.e. 2. That is, 
 even though factorization lowers tree height sometimes, the distribution algorithm 
 does not take care of it. 
 
 The question we ask now is how much the distribution algorithm lowers 
 tree height. Before giving an answer to this question let us study tree height 
 in more detail. Given an arithmetic expression, Theorem 1 gives the exact height 
 of a tree obtained by Bovet and Baer's algorithm. It is also desirable if we can 
 get an approximate tree height without actually building a tree for an arithmetic 
 expression. Since the number of single variable occurrences (the number one less 
 than this gives the number of operators in an arithmetic expression) and the 
 depth of parenthesis nesting may well represent the complexity of arithmetic 
 expressions, let us try to approximate tree height in terms of them. 
 
 Let A be an arithmetic expression with n single variable occurrences 
 and depth d of parenthesis nesting. Now build a tree for A by Bovet and Baer's 
 algorithm. Then it can be proved that: 
 
95 
 
 jemma 11 : 
 
 log 2 rnl 2 < h[A] < n - 1. 
 
 Moreover we can prove the following theorem. 
 
 Theorem 3 • 
 
 h[A] < 1 + 2d + log[nl 2 . 
 
 The following lemma is helpful to prove Theorem 3< 
 
 Lemma 12 
 
 (1) 2a>ra], 
 
 (2) f2al 2 = 2fal 2 
 
 (3) log 
 
 r P 
 
 
 r p 
 
 -[ml 
 i=l 
 
 < log (2 • 
 
 2 
 
 r m 
 1=1 
 
 i 2 
 
 Proof: 
 
 (l) and (2) are obvious and (3) can be proved by (l) and (2) 
 
 (Q.E.D.) 
 
 Proof of Theorem 2 : 
 
 Proof is given by induction on d. First let us prove the theorem for 
 d = 0- Then A has the following pattern: 
 
 A - F it a.. 
 i=l j=l r 
 
 Then by Theorem 1, 
 
 h[A] = log Z T q 1 
 
 i=l 
 
96 
 
 < log 2 
 
 r p 
 
 by Lemma 5(3) 
 
 = 1 + log[n] 2 . 
 Nov assume that the theorem holds for d < f . 
 
 Let t. be an arithmetic expression with depth d.(< f) parenthesis 
 
 J J 
 
 i i ^ 
 
 nesting and n. single variable occurences. Then by assumption h[t.] < 1 + 2d't 
 
 J ■ J J 
 
 + log[n_. lp- Now an arithmetic expression with f + 1 parenthesis nesting can be 
 
 built from t. as follows: 
 
 (q. m. 
 n. . i . 
 ir a' tt (tih 
 
 j=l 3 k=l K 
 
 where a. are single variables and at least one of t. has f nested parentheses. 
 
 Now each t. can be replaced with a product of e[t.] single variables without 
 
 J J 
 
 affecting the total tree height. Instead of using the value e[t.], let us use 
 
 2d' 
 
 the value 2-2 J . Tn 1 ^. (Note that h[t^] < 1 + 2d 1 + logrn 1 ! = log (2 • 
 
 2d 1 . . . h[t X ] 2d* 
 
 2 J • TrulJ and eft 1 ] =2 J < 2 • 2 J • rn*"L.) Since d 1 . < f, e[t*] < 
 
 2f i 
 2-2 fn 1 • Now from Theorem 1, we have 
 J 2 
 
 h[A] < log 
 
 r p 
 
 TCI. 
 
 1 2+ Z 2- S 2frn j 1 2l2 
 
 3=1 
 
 r P 
 
 m. 
 
 < log 2 
 
 s 2f i- 
 
 2 [q + r 2 • 2 nt] 
 
 3=1 J 
 
 1-= 
 
97 
 
 |2.2.2.2 2f r (q, + rnbl 
 
 < logl2-2-2-2 
 
 = 1 + 2(f+l) + logrnl 2 . 
 Thus the theorem holds for d = f + 1 and this proves the theorem. 
 
 (Q.E.D.) 
 
 Now let us examine the original question i.e. how effective is the 
 
 algorithm presented in this chapter. Let A and A be arithmetic expressions 
 
 d 
 where A is the resultant expression obtained from A after the application of 
 
 the distribution algorithm. Now build trees for A and A by Bovet and Baer's 
 
 algorithm. Then it should be clear that h[A] > h[A ]. Moreover experience 
 suggests that: 
 
 Conjecture : 
 
 h[A d ] ^ 2 log 2 rnl 2 
 
 where n is the number of single variable occurrences in the original arithmetic 
 expression A. 
 
 Note that the distribution algorithm speeds up a Horner's rule 
 polynomial in a logarithmic way. Also note that the distribution algorithm 
 does no distributions in the case of 
 
 which takes 21ogr nl steps as it is presented but which would take (n+l) logrn] 
 
 steps if fully distributed. Thus the algorithm can save a factor of n/2 steps 
 over a scheme which would distribute indiscriminately and in some cases achieves 
 a logarithmic speed up. 
 
98 
 
 3.5.2 Introduction of Other Operators 
 
 3.5.2.1 Subtraction and Division 
 
 Subtractions can be introduced into an arithmetic expression without 
 causing any effect on the distribution algorithm. It may be necessary to 
 change operators to build a minimum height tree. For example let A = a + b - 
 c + d. This will be computed as A = a + b - (c-d): 
 
 Divisions may require special treatment since the distributive law does not hold 
 in certain cases, e.g. (a+b+c)/d = a/d + b/d + c/d but a/(b+c+d) / a/b + a/c + 
 a/d. Hence in general minimization of the height of trees for a numerator and 
 a denominator is tried independently, and then distribution of a denominator 
 over a numerator is tried if appropriate. Also let A = t/t'. Then T[A] is 
 built from T[t] and T[ t ' ] as follows: 
 
 T[t], 
 
 or 
 
 If h[t] / h[t'], then we get nodes to which only one tree is attached, e.g. 
 a and (3 if h[t] < h[ t'], and a' and 3* if h[t] > h[t']i Then a and p are 
 treated as free nodes in T[A] while a' and 3' are not treated as free nodes in 
 T[A], because later when another expression, say t", is multiplied to A, t must 
 
 
99 
 
 be multiplied by t" not t', i.e. t"(A) = t"(t/t') = (t"t)/t' ^ t/(t't**): 
 
 T[t" 
 
 T[f] 
 
 3- 5-2.2 Relational Operators 
 
 If an arithmetic expression A contains relational operators e.g. 
 B I ) C -where RO = [>. <, =, >, <, . ••}, trees can be built for B and C 
 independently: 
 
 A = 
 
 T[B] 
 
 T[C] 
 
 If h[B] -*' h[C], then operators may be moved from one side to the other to 
 balance two trees. For example let A=a+b+c>d. Then we modify this as 
 A' = a + b > d - c and get h[A'] =2 while h[A] = 3: 
 
 b d 
 
100 
 
 k. COMPLETE PROGRAM HANDLING 
 
 Chapter 3 presented the algorithm which reduced tree height for a 
 single arithmetic expression by distributing multiplications over additions 
 properly. In this chapter we will discuss some ideas about how to handle 
 complete programs, i.e. given one program, how can it be executed in the 
 shortest time by building a tree as well as executing a statement in a for 
 statement simultaneously for all index values. Ideas include back substitution. 
 We do not have the solution to the problem, but this chapter presents some 
 details of the problem and some ways to attack them. 
 
 We conclude this chapter by comparing serial and parallel computations 
 in terms of a generated error. It is shown that in general we could expect less 
 error from parallel computation than serial computation. It is also shown that 
 distribution would not increase the size of an error significantly. 
 
 4.1 Back Substitution - A Block of Assignment Statements and an Iteration 
 
 While the distribution algorithm in the previous chapter discusses 
 tree height reduction for a single arithmetic expression, it can be used for 
 any jump free block of assignment statements. If we define those variables which 
 appear only on the right hand sides of assignment statements or in read statements 
 in a block as inputs to the block, and those variables which appear only on the 
 left hand sides of assignment statements or in write statements as outputs from 
 the block, then we can rewrite the block with one assignment statement per 
 output by substitution of assignment statements into one another. For example 
 
101 
 
 a := b + c; 
 
 d := e x f; 
 
 g := a + c; 
 
 h := a + g x d 
 can be rewritten as 
 
 g := (b+c) + c; 
 
 h := (b+c) + ((b+c) + c) x (exf ) . 
 After such a reduction only input variables appear on the right hand sides of 
 assignment statements. At this point, the distribution algorithm could be 
 applied to each remaining assignment statement and if sufficient computer 
 resources were available, all of the reduced assignment statements could be 
 executed at once- In the above example if each statement is computed in 
 parallel (by building a tree) independently then 5 steps are required, while 
 if the back substitution is done then the computation requies only + steps. 
 Suppose we have assignment statements, A , A , ...,A . Also suppose that by 
 
 back substitution we can rewrite this block as A. We build minimum height 
 trees for A, ,A , ...,A and A. Now we apply the distribution algorithm on those 
 
 trees. Let the resultant tree heights be h| A ],..., h[A ] and h[A] . Then 
 
 obviously h r A-,] + ... + hi" A ] > h[A], i.e. back substitution never increases 
 the computation time (in the sense of tree height) (Figure l+.l). 
 
 Our main interest here is the case where strict inequality in the 
 above relation holds, because that h[A, ] + ... + h[A ] > h[A] holds is 
 
 equivalent to a speed up of the computation by back' substitution. Note that 
 back substitution amounts to symbol manipulation (i.e. replacement) and should 
 not be confused with arithmetic simplification. For example from 
 
102 
 
 w 
 
 V 
 
 h [An] 
 
 Figure k.l. A Back Substituted Tree 
 
 a := x + y 
 
 b := a + y 
 we get 
 
 b := (x+y) + y 
 orb :=x+y+y 
 but we do not get 
 
 b := x + 2y . 
 Now we shall study this kind of speed up. 
 
 We shall discuss a limited class of assignment statements, i.e. an 
 iteration. This may serve to give some insight to the problem of speed up by 
 back substitution in the general assignment statement case. 
 
 By an iteration we mean a statement 
 
 'i ■ f(y i-i'- 
 
 Usually a statement is executed repeatedly for 1 = 1, 2,'. . .,n. An example is: 
 for I := 1 step 1 until 10 do 
 A[I] := A[I-1] + A[I]; 
 
103 
 
 Also a block of assignment statements such as: 
 S 1 : a := h + i + j; 
 
 S ? : b :-■ a + k + m; 
 
 S,: c := b + n + p; 
 
 S. : d := c + q + r; 
 
 falls into this category (note that all statements have a form 
 output of S. : = output of S. + x + y 
 
 where x and y are pure inputs in the sense that they do not appear as outputs). 
 Assume that we are only interested in the value of y (the other 
 
 results, i.e. y ,,-y rt , . . . , y, may be obtained similarly to y but in less 
 
 n-1 n-2 1 n 
 
 time). Then instead of n statements, i.e. y, = f(..), y p = f(..),...,y = 
 
 f(..), we n^ay obtain one statement for y by back substitution. For example, 
 
 let y. = a. y. . Then y = a n y , = a .(a ^y _) = a .(a _(a ,y _)) 
 1 l-l l-l n n-l^n-1 n-1 n-2 J n-2 n-1 n-2 n-3 n-3 
 
 n-1 
 
 = • • • - y tt a, . We use the superscript "b" to distinguish the back 
 
 b n_1 
 substituted form from an iteration form, e.g. y = a ,y . and y = y^ tt a. . 
 
 J n n-1 n-1 n . _ k 
 
 k=0 
 
 Then instead of computing each y. repeatedly for i = 1, 2, . .,n, y may be 
 
 computed directly. In the above example y. can be computed in one step* and 
 
 to get y n steps are required while y can be computed in r log nl steps in 
 
 parallel (i.e. by building a tree for y ). The following table summarizes the 
 results for some primitive yet typical iteration formulas. 
 
1C4 
 
 y i 
 
 b 
 y n 
 
 T 
 s 
 
 nT 
 s 
 
 T 
 P 
 
 ay i-l 
 
 n 
 a v~ 
 
 1 
 
 n 
 
 ri g 2 (n + ljl 
 
 yi.i + b 
 
 n 
 
 y Q +*b +"/."+ b A 
 
 1 
 
 n 
 
 flog 2 (n + 1? 
 
 a i-i y i-i 
 
 n-1 
 
 Vo 
 k=0 K u 
 
 1 
 
 n 
 
 log^ 
 
 y -, + a 
 °i-l i-1 
 
 n-1 
 
 z \ + y o 
 
 k=0 K 
 
 1 
 
 n 
 
 flog 2 n1 
 
 ay._ x + b 
 
 *\ + P n-l (a ^ 
 
 2 
 
 2n 
 
 *2fiog 2 (n + 1? 
 
 ay i-l + X i-1 
 
 . . ■*-* 
 p;(a) 
 
 2 
 
 2n 
 
 ar2flog 2 (n + 1)1 
 
 y i-l + bx i-l 
 
 n-1 
 
 z bx k + y o 
 
 k=0 K u 
 
 2 
 
 2n 
 
 « 'log nl + 1 
 
 2 
 
 ay i-l + bx i-l 
 
 p" a) 
 *n 
 
 3 
 
 3n 
 
 -2riog 2 (n + 1)1 
 
 /• \ -l. n-1 , n-2 
 p - (a) = ba + ba + . . . + b 
 
 „ . . / \ n , n-1 , n-2 
 
 ** P n (a) = a y + a X + a X l + "• + "n-2 + X n-1 
 
 „v.j, ti / \ n , n— 1 , n-2 , 
 
 *** p (a) = a y + ba x^ + ba at, + . . . + bax n + bx . 
 
 n J 1 n-2 n-1 
 
 T : The time required to compute y. in parallel, i.e. h[y.]. 
 T : The time required to compute y in parallel, i.e. h[ y ]. 
 
 Table k.l. Comparison of Back Substituted, y, and Non-Back Substituted 
 Computation, y. — Iteration Formulas 
 
105 
 
 From Table *4-.l, the following lemma is obtained by exhaustion. 
 
 Lemma 1 : 
 
 Let y. = f (y. , ) be linear in y. . where we assume that in the 
 
 presented form additions are reduced to multiplications as much as possible, 
 e.g. y. = 2y i _ 1 instead of y ± = y i _ 1 + y i _ 1 - Then n x h[y.,] > h[y n l- 
 
 Thus if we have enough FE's, then instead of computing each y. 
 
 repeatedly for i = 1, 2, ...,n, we should obtain y by back substitution and 
 
 compute it by building a minimum height tree. 
 
 If an iteration y. = f(y. , ) is not linear in y. _, e.g. y. = a y? , 
 
 + b y^ + c, or if it is linear in y. but there are some additions not being 
 
 reduced to multiplications, e.g. y. = y. , + a y. , then it is not clear if 
 
 back substitution speeds it up. For example, back substitution does not speed 
 up the computation of y ± = y i _ 1 + y i _ 1 + y ± _ 1 + Y ± _ 1 ' Also let y i = f (y i _ 1 ) 
 
 be a polynomial of y. , where in the presented form additions are again reduced 
 
 to multiplications as much as possible. Then it is not likely that we can 
 
 speed up the computation by back substitution. Let 
 
 „/ \ m 
 
 f (y. , ) = a y. . + ... 
 
 where y. is the highest power of y. , among those which appear in f(y. ). 
 Note that f (y. ) is not necessarily a dense polynomial (a polynomial in which 
 
 all powers of y. , i.e. y. ,, y. ,, ..., y. ., appear). While the exact height 
 of T[ f (y. , )] depends on f(y. ), we may content ourselves with (see Chapter 2) 
 
106 
 
 h [ f(y i-l )] ~ 2 r io g 2 m1 ' 
 
 Hence 2nriog ? ml steps are required to compute y . 
 
 Now let us consider y . Then 
 
 n 
 
 b , b m 
 
 n m n-l 
 
 m m 
 
 = a (a (y ) + . . . ) + 
 m v m w n-2 
 
 m m 
 
 = a (a ( . . .a y n . . . ) . . . ) + ... 
 
 mm nrO ' 
 
 i n 
 n-l m 
 = a m y Q + ... . 
 
 That is, y becomes a polynomial of y of degree m . Leaving the computation 
 
 of a out of consideration, we have (see Chapter 2) 
 
 h[y£] * 2[-log 2 m n l 
 
 * 2nr logpinl . 
 
 Hence back substitution does not help to speed up computation significantly 
 in this case. 
 
 To gain a better understanding of more general cases, let us study 
 the situation from a different point of view. Given an iteration y. = f (y. , ), 
 
 let us consider the number of single variable occurrences in y as a measure of 
 
 e J n 
 
 the complexity. We study two cases separately, i.e. (i) y. .. appears only once 
 in y. and (ii) y. appears k times in y. . In both cases we assume that there 
 
107 
 
 are m single variable occurrences (including the occurrences of y. ) in y. . 
 
 For convenience we write N(y) for the number of single variable occurrences in 
 y, e.g. N(a+b+cd4a-e) = 6. 
 
 (l) y. , appears only once in y. . In this case we have 
 
 N(y 1 ) = m 
 
 N(yg) = N( y;L ) + m - 1 = 2m - 1 
 
 N(y^) = N(y^) + m - 1 = 2m - 2 
 
 N(y ) = N(y ,)+m-l = mn-n + l~ ran. 
 J n n-1 
 
 (2) y. appears k times in y. . In this case we have 
 N(y 1 ) = m 
 
 N(y^) - k • N(y^) + m - k = m + k(m-l) 
 
 N(y^) = k ■ N(Vg) + m - k = m + (k+k 2 )(m-l) 
 
 • • • 
 
 N(y^) - k ■ N(y^_ 1 ) + m - k 
 
 = m + (k + . . . +k~ )(m-l) 
 
 = 1 +^-i(m-l). 
 
 If k n » 1 and m » 1, then N(y b ) s k n_1 m. 
 
 > w n 
 
 Now if we use 21"log p N(y)l as a measure of the height of a tree, then 
 we have (see Section 3- 5*1 of Chapter 2): 
 
108 
 
 h[y ± ] 
 
 n x HV ± ] 
 
 «# 
 
 (1) 2flog 2 m] 
 
 2n[ log 2 ml 
 
 2flog 2 (mn)l * 2(riog 2 m] + rioggii"!) 
 
 (2) 2riog 2 ml 
 
 2nf log ml 
 
 2riog 2 (k n " 1 m)l « 2((n-l)riog 2 kl + floggml) 
 
 Table k.2. Comparison of Back Substituted, y , and Non-Back Substituted 
 Computation, y^ — General Cases 
 
 For example let m = 5, k = 2 and n = 20 in (2). Then we have 
 n . hfy.l = k0 • T log 51 =120 
 
 and h[y£] = 2(l9flog 2 2l + flog^l) = kk. 
 
 Also if we let m = 5 and n = 20 in (l), then we get 
 n • h[y.] = i+oriog 2 5l = 120 
 
 and h[y*] = 2(riog 2 5l + Tlog^Ol) = 16. 
 
 Now a few comments about implementation are in order. As for back 
 substitution of a block of assigment statements, the step by step substitution 
 is the only possible scheme. In case of an iteration formula, we may use the 
 
 z-transformation technique to obtain y [8]. For example let y. = y. n + x. . 
 Then by applying z-transformation on it, we get Y(z) = zY(z) + X(z) or Y(z) = 
 
 ^ Z ^ . Hence Y(z) = X(z)(l + z + z 2 + 7? + . . . ) 
 
 2 2 
 
 = (x + X-.Z + X z + . . . ) (l + z + z +...) 
 
 = x Q + (x 1 + x Q ) z + (x 2 + x ± + X Q ) 
 
 z + . . . 
 
109 
 
 or y = Z \> 
 
 1 k=0 * 
 
 Two other related problems become evident in the example presented 
 above. First is algebraic simplification. For example, a := b + kc could be 
 executed more quickly than a := b+c+c+c+c We shall not discuss this subject 
 further here. A second problem is the discovery of common subexpressions. In 
 our example, (b+c) appears twice in the right hand side of it. 
 
 If we had an algorithm, e.g. [11] which discovered common sub- 
 expressions in one (or more) tree which could be simultaneously evaluated, the 
 number of FE's required could be reduced by evaluating the common subexpressions 
 once for all occurrences. On the other hand, by removing common subexpressions 
 the execution time (the height of a tree) may be increased in some cases. For 
 example, if we have x := a(b+c+de) and y := f(g+c+de), then we might try to 
 replace c + de in x and y by z as follows to save the number of PE's required: 
 
 x 
 
 However, note that x = a(b+z) or y = f(g+z) takes k steps while the original x 
 and y require only 3 steps, i.e. h[a(b+c+de)] = 3 and h[a(b+z)] = k. Thus an 
 overall strategy must be developed for the use of a common subexpression 
 discovery algorithm in conjunction with overall tree height reduction. 
 
110 
 
 h . 2 Loops 
 
 This section is included here to complete this chapter, and discusses 
 the subject superficially. Details will be presented in the following chapters. 
 
 Consider the following example. 
 El: for I := 1 step 1 until 10 do 
 for J := 1 step 1 until 10 do 
 S3: A[I,J1 := A[I,J-1] + B[j]; 
 In this case ten statements, A[l, J] := A[1,J-1] + B[J], A[2, J] := A[2, J-l] + 
 B[J], ..., A[10, J] := A[10,J-1] + B[J] can be computed simultaneously while J 
 takes values 1,2, ...,10 sequentially. We say that S3 can be computed in parallel 
 with respect to I. Note that originally the computation of El takes 100 steps . 
 (One step corresponds to the computation of S3, i.e. addition. For the sake of 
 brevity we only take arithmetic operations into account and shall not concern 
 with e.g. operations involved in indexing.) By computing S3 simultaneously for 
 all values of l(l = 1,2, . ..,10) the computation time can be reduced to 10 steps . 
 
 b 10 
 
 Finally by building a minimum height tree for A [1,10] (:= A[I,0] + Z B[J]) 
 
 J=l 
 
 for each I (i = 1,2,..., 10), we can compute all ten trees simultaneously in 
 h_ steps . 
 
 To help understanding, let us further consider 
 L: for I := 1 step 1 until N do 
 
 for J := 1 step 1 until N do S; 
 Then Figure ^.2 (a) shows the execution of L as it is presented. The total 
 computation time required (t) is N, x N x m, where we assume that m arithmetic 
 
 operators are in S. Now suppose S can be computed in parallel with respect to 
 
Ill 
 
 I, (Figure 4.2(b)). In Figure 4.2(b), each box has the form shown in Figure 
 
 lj-.2(c). Here S 11 computed sequentially, i.e. T. = mN p . Now let us 
 
 compute S in parallel i.e. by building a tree (Figure 4.2(d)). Then we have 
 T Q = N h[S]. Note that m > hJ"S]. Further if we back substitute S for J = 
 
 1,2 N and get S , we have T = h[ S ]. As stated before (Section 4.1), 
 
 N 2 h[S] > h[S b ], or T Q > T ± > T g > T . 
 
 general we have 
 L: for I := 1 step 1 until N, do 
 
 for I : = 1 step 1 until N do 
 
 for i := 1 step 1 until N do S; 
 
 n *- n — 
 
 re S is an assignment statement. Then the computation of L takes T = 
 
 n 
 
 7T N. m steps as it is presented, where we assume that m arithmetic operations 
 ] 
 
 are involved in the computation of S. If S can be computed for all values of 
 
 I = 1,2, ...,N. ) simultaneously, then the computation time can be reduced to 
 
 K 
 
 T, ~ N.lm) stens, i.e. N. statements can be computed simultaneously 
 
 , I , ...,I _,I, .,..., I change sequentially. In general there are n 
 
 possibilities, i.e. we examine if S can be computed in parallel with respect 
 to I for k -• 1,2, ...,n. Let P = [k|o can be computed in parallel with respect 
 
 to I, . hen we would compute 3 in parallel with respect to I where T a = 
 
112 
 
 
 <Z5> 
 
 I=N 
 
 C3> 
 
 (a) 
 
 (b) 
 
 J: =J+1 
 
 J 
 
 T 
 m 
 
 <^> 
 
 -^ ^J+l ) 
 
 h[S] 
 
 3 
 
 h[s b ] 
 
 (c) 
 
 (d) 
 
 3 
 (e) 
 
 Figure k.2. Loop Analysis 
 
113 
 
 min T . Clearly each statement of the resultant N statements can be computed 
 kcF k g 
 
 by building a tree. Further if it is appropriate we perform back substitution 
 
 and obtain a big tree as the above example (El) illustrates. 
 
 If a loop is a limit loop which terminates when e < 6 for some pre- 
 determine! B and computed c, it may be approximated by a counting loop (e.g. 
 or T :- 1 i s - er 1 until II do ) which is executed a fixed number of times before 
 
 "est is made, and then repeated if the test fails. 
 
 Consider a program containing n two way forward jump statements (or 
 
 if statements). Let the tests for jumps be Boolean expressions B„,B ,...,B . 
 — 1 2' n 
 
 Assume that there are m output variables from the program given as expressions 
 
 ,A_, ...,A , where parts of A. may depend on B.'s. In a program when B. is 
 1' 2' ' nr l j j 
 
 encountered, one of two choices is taken depending on the value of B . . It is 
 
 possible to start computing all of these possible alternatives at the earliest 
 time, and choose proper results as soon as values of B.'s become available. 
 
 For example 
 
 a := g + c; 
 
 B : a > ':): 
 
 if B then d := e + f + s else d := a + g + t; 
 
 : i + f + i x j x k x p x q; 
 
 yield 
 
 := B x (e+f+s) + ( not B) x ((g+c)+g+t) +f+ixjxkxpxq 
 or 
 
114 
 
 h := ((g+c) > 0) x (e+f+s) + ((g+c) < 0) x ((g+c)+g+t) + f + i x j 
 x k x p x q, 
 where we let B = 1 for true , B = otherwise. Then we may build a tree for h 
 as follows . 
 
 gcefsg eg t f ijkp q 
 Figure k.J. A Tree with a Boolean Expression 
 
 The box \B • produces or 1 depending on the value of (g+c>0). 
 
 In general, Boolean expressions can be embedded in arithmetic 
 expressions as shown in the above example, and a minimum height tree can be 
 built for it. 
 
 h.k Error Analysis 
 
 In this section parallel and serial computation are compared in terms 
 of error. We are only concerned with a generated error , i.e. an error which is 
 introduced as a result of arithmetic operations. It is shown that in general 
 parallel computation would produce less error than serial computation. It 
 is also shown that distribution would not increase the size of an error 
 significantly. Let co represent any arithmetic operation. In general, we do 
 not perform the operation co exactly but rather a pseudo-operation (jx>) . Hence 
 instead of obtaining the result xo^y, we obtain a result x(a?)y. We may write 
 
115 
 
 x y = (^y)(i+e ) (1) 
 
 where <: represr I in error introduced by performing a pseudo-operation. For 
 
 example, we have 
 
 -: y = (x+y)(l-f€ a ) 
 
 and 
 
 x Q y = Cxy)(l+e m ). 
 
 Let us write A for an approximation to an arithmetic expression A with an 
 error obtained by computing A using pseudo operations, e.g. fy\ or (+) . Then 
 ^an also be written as 
 (xuoy) = (xo>y)(l+e ). 
 
 Now let us consider the computation 
 
 A = la.. 
 i=l X 
 
 First we compute A serially, i.e. 
 
 A = . . .f((a 1 +a 2 )+a,)+a^)+...+a N ). 
 
 .e have 
 * 
 
 = a i a 2 = (a 1 +a 2 )(l+e & ) 
 
 - & + a 2 + e (a»j+a ), 
 
 {+) a = (a +a +e (a +a,,)+a )(l+€ ) 
 
 J 3 -L^SlJ-.^ 3, 
 
 = a, + a,- + a, *■ e (2a., +2a^+a-. ) . 
 1 2 3 a 1 2 3 
 
 • higher terms of e are neglected. 
 
 St 
 
116 
 
 # * 
 
 \ = A 3 © % = a l + a 2 + a 3 + \ + e a ^ a i + 5a 2 +2a 3 + ai+ ), 
 
 \ = \-l O *N = ._\ a i + e a ((N-l)a 1+ (N-l)a 2+ (N-2)a 3+ ... +aN ) 
 
 N 
 
 2 a. + € (Na +(N-l)a +(N-2)a,+. . .+0 
 i=l 
 
 We let E = e (Na +(N-l)a +(N-2)a +. . . +a ) . Next let us compute A in parallel, 
 i.e. by building a tree: 
 
 A 
 
 Without loss of generality we assume that N is a power of 2. Then 
 A I 2 ■ a l © a 2 = a l + a 2 + e a ( V a 2 ) 
 1-k = A 12 © A lk = a l + a 2 + a 3 + % + 2c a (a l + V a 3 4 V 
 1-8 = A l-ii A 5-8 = a l + a 2 + •" + a 8 + 3e a (a 1+ a 2+ ...+a 8 ) 
 
 A 
 
 A 
 
 A l-N = 
 
 l-N/2 O V+l-N " J/i + ri0g 2 N1 £ a .f^i 
 
117 
 
 We let -r = r log Nl € £ a. . To compare E with E , let a = a = a = ... = a 
 
 Then we get 
 
 N' 
 
 and 
 
 a 2 a 
 
 :: r log 2 Nl a • e a , 
 
 or 
 
 E S > E P. 
 a a 
 
 N 
 An error for B = it b. can be analyzed in a similar manner. In this case we 
 i=l ^ 
 
 N 
 
 E S -- E P = (N-l) e Trb. 
 m m m . , i 
 
 i=l 
 
 Hence, Ln general, we could expect that parallel computation produces less 
 
 error than serial computation. 
 
 * 
 hat if higher terms of e and e are neglected, then A can be 
 
 a m ' 
 
 written as 
 
 A + i E (A) + ( v E (A) 
 
 a a mm 
 
 where an I ''A) are arithmetic expressions consisting of variables in A. 
 
 For example, if we compute A = afbc+d) serially (i.e. A = a((bc)+d)) then we 
 get 
 
118 
 
 A = (ax((bxc)(l+€ m )+d)(l-f€ ))(l+€ ) 
 
 ~ a(bc-d) + e a(bc+d) + € (2abc+ad), 
 am " 
 
 and E (A) = a(bc+d) 
 
 and E (A) = 2abc+ad. 
 
 m 
 
 Usually E (A) and E (A) depend on how A is computed as we have shown for 
 
 A -- E a.. 
 
 Now let us compare parallel computation of two arithmetic expressions 
 
 A and A , where A is the resultant expression obtained by applying the 
 
 distribution algorithm on A, in terms of a generated error. Note that we can 
 
 write 
 
 A = A + e x E (A) + e x E (A) 
 a a m m 
 
 and 
 
 A d * = A d + e x E (A d ) + e x E (A d ) 
 a a ' m nr 
 
 = A + g x E (A d ) + e x E (A d ). 
 a a m m 
 
 / n d 
 
 As an example let us study A = a(bc+dj + e and A = abc + ad + e. 
 
 bed e 
 
 (a) A = a(bc+d) + e 
 
 (b) A = abc + ad + e 
 
 Figure k.k. Trees for a(bc+d) + e and abc + ad + e 
 
119 
 
 Then we have 
 * 
 
 A = (a(bc(l+€ )+d)(l+e )(l+€ )+e)(l+€ ) 
 m a nr ' v a y 
 
 = (a(bc+d)+e)+e (2abc+2ad+e)+e (2abc+ad) 
 a m. 
 
 and 
 
 A d * = (abc(l+€ ) 2 + (ad(l+e )+e)(l+e ))(l+€ ) 
 m m &' J a 
 
 = (abc+ad+e) + e (abc+2ad+2e) + e (2abc+ad). 
 
 Note that E (A) = E (A ) in the above example, which is not mere chance. We 
 can show that this holds for all cases. 
 
 Lemma 2 : 
 
 E (A) = E (A d ) 
 m m 
 
 Proof : 
 
 where 
 
 First let us consider 
 
 t#a \ © *2 © ••' © *n 
 
 t* = t. + € E (t.) + e E (t. ). 
 
 1 i aai mmi 
 
 Then clearly 
 
 n 
 
 E (t) = € E E (t.) 
 
 m m . ., m l 
 i=l 
 
 regardless of the order of additions whereas E (t) depends on the order of 
 
 additions. Hence we may write 
 
 * n n 
 
 t = Z t. + e E (t J + e ZE (t.) 
 . ., i a a Z m . n m i 
 
 i=l 1=1 
 
 where Z indicates that E (t) depends on the order of additions. 
 
120 
 
 and 
 
 Now let us consider 
 
 A* = t* (tj* © t* © ... © t*) 
 
 A = t t x © t © t 2 © ... © t t n 
 
 where 
 
 and 
 
 t = t + e E (t) + e E (t) 
 a a mm 
 
 t. = t. + e E (t. ) + e E (t. ). 
 i i a a l m irr i' 
 
 Then we have 
 
 n 
 
 n 
 
 and 
 
 Hence 
 
 A = (t-K E(t)+e E(t))( Z t 4* E (t )+€ Z E(t ,)(l4€ ni )) 
 
 aa mm . ., i a a i. m..mi m 
 
 1=1 1=1 
 
 n n n 
 
 = t St. + e ( Z (t.E (t)) + tE (t ))+ € ( Z (tE (t )+t E (t)+tt.)) 
 
 . ,i a . , i a a a m.,miim i 
 
 i=l i=l 1=1 
 
 A = (t4€ E (t)+€ E (t))(t 1 4€ E (tJ+€ E (t 1 ))(l+€ ) (+) ... 
 a a mm 1 a a 1 mm 1 m v-/ 
 
 = (tt n + e (tE (tJ+t n E (t)) + e (tE (tJ+t.E (t)+tt n )) (+) 
 v 1 a a 1 1 a ' m m 1 1 nr 1" v-^ 
 
 t Z t, + e E (a£) + € ( Z (tE (t )+t ,E (t)+tt )) 
 . , l a a L m . , mi lm l 
 
 i=l 1=1 
 
 E(A) = E (A u ). 
 m m 
 
 (Q.E.D.) 
 
121 
 
 As for E (A) and E (A ), they depend on the order of additions and cannot be 
 
 compared simply. However, they may not differ significantly. As a simplified 
 
 case, let us study the following: 
 
 N 
 A = t( £ a. ) 
 
 i=l X 
 
 and 
 
 H N 
 
 A = ^ (ta.). 
 i=l X 
 
 Again we assume that N is a power of two. Then to compute A, we first compute 
 
 :; n ^ n n 
 
 T. a. in parallel. As we showed before, (La..) = L a. + riog_Nle_ L a.. 
 i=l i=l i=l i=l 
 
 Hence 
 
 * N N 
 
 A = t( L a . +r log^Nl e Za.)(l+e ) 
 
 . , i d a . _ i m 
 
 i=l i=l 
 
 N N N 
 
 = t L a. + e r log Nit Z a. +£ t Z a. . 
 . , l a . _ i m . - l 
 
 i=l i=l i=l 
 
 On the other hand, we have 
 
 A. = (ta. ) = ta. + e ta. 
 11 l mi 
 
 d + 
 and A is obtained by summing A. in parallel, i.e. 
 
 A d * = (..(((a* © a 2 ) © ( A ; © a*)) © ( (A ; © $ © (a; © 
 
 Ag)))...) 
 
 N N N . 
 
 = t Z a. + e riog^Nlt Z a. + e t Z a. . 
 
 .,i a & 2 •-,! m . . i 
 
 i=l i=l i=l 
 
 Hence in this case E (A) = E (A a ) as well as E (A) = E (A ). 
 
 a a mm 
 
122 
 
 5- PARALLELISM BETWEEN STATEMENTS 
 
 This chapter should be read as an introduction to the following 
 chapter which discusses loops in a program. In this chapter we study 
 parallelism between statements, i.e. inter- statement parallelism. Given a 
 loop and jump free sequence of statements (we call this a program), it is 
 expected that they are executed according to the given (i.e. presented) order. 
 However if two statements do not depend on each other, they may be executed 
 simultaneously in hopes of reducing the total computation time. In general, 
 statements in a program may be executed in any order other than the given 
 order as long as they produce the same results as they will produce when they 
 are executed in accordance with the given sequence. In this chapter we give 
 an algorithm which checks if the execution of statements in a program by some 
 sequence gives the same results as the execution of statements by the given 
 sequence does. Also a technique which exploits more parallelism between 
 statements by introducing temporary locations is introduced. 
 
 5-1 Program 
 
 A program P with a memory M is a sequence of assignment statements 
 S(i), i.e. P -= (S(l); S(2); ...; S(i); ...; S(r)) where i is a statement 
 number and r is the length of a program P (we write r = lg(P)). The memory 
 M is a set of all variables (or identifiers ) which appear in P. 
 
 Associated with each S(i) is a set of input variables, IN(S(i)) and 
 an output variable, OUT(S(i)). Then M =.U, (lN(S(i)) UOUT(S(i))). Further 
 
123 
 
 we define two regions in a memory; a primary input region M and a final 
 
 output region M as 
 
 Mj. = (m | meIN(S(i)) and Vk < i, m/OUT(S(k) )} . 
 
 and M = {m | meOUT(S(i) )1 . 
 
 A program uses the values of those variables in VL. as primary input data and 
 
 puts final results into M . 
 
 C(m) refers to a content ( value ) of a variable m. C(M) refers to 
 the contents of variables in the memory M as a whole and is called a config - 
 uration of M. Also C T (m) refers to a value which m has before a computation 
 
 (i.e. an initial value of m). Thus C(M ) refers to primary input data given 
 
 to a program. We call it an initial configuration . 
 
 The following relations are established among statements in P. 
 A triple (id, i, j) where id e M (id for an identifier) and i, j e 
 {0,1, . . . ,r,r+l] (r = lg(P)) is in the dependence relation DR(P) if and only if: 
 (1) (i) i < j and 
 
 (ii) id e 0UT(S(i)) and id e IN(S(j)) and 
 (iii) Vk, i < k < j, id ft 0UT(S(k)), 
 or (2) (i) i = and 
 
 (ii) id e IN(sCj)) and 
 (iii) Vk, < k < j, id I 0UT(S(k)), 
 
 (s(,j) is the first statement to use id), 
 or (3) (i) $ = r + 1 and 
 
 (ii) id e 0UT(S(i)) and 
 (iii) Vk, i < k < r + 1, id / 0UT(S(k)). 
 
3.214- 
 
 only if: 
 
 (S(i) is the last statement to update id). 
 Similarly a triple (id, i, j) is in the locking relation LR(P) if and 
 
 (i) i < j and 
 
 (ii) id e IN(S(i)) and id e OUT(S(j)) and 
 (iii) Vk, i < k < j, id / CUT(s(k)). 
 
 Example 1 (The notations follow ALGOL 60| 3] ) : 
 Let P be 
 
 S(l): a 
 
 I: d 
 
 ): f 
 
 ): g 
 
 = b + c; 
 
 - a + e; 
 = g + d; 
 
 - h + i. 
 
 Tli en 
 
 DE(P) = {(b,0,l),(c,0,l),(e,0,2),(g,0,3),(h,0,i|),(e,0,^),(a,l,2), 
 
 (d,2,5),(f,3,5),(g,^,5)) and 
 LR(P) = [(g,3,^)). 
 
 Since we are only interested in meaningful programs, we assume that 
 there is no superfluous statement, i.e. there is no id e M such that 
 
 (i) id e OUT(S(i)) and id e OUT(S(j)) where i < j, and 
 (ii) Vk, i < k < j, id / IN(S(k)). 
 Also v/e assume that there is no statement that has no inputs other than constant 
 numbers, e.g. "a := 5" • 
 
 Now we define an execution order E of a program P as : 
 E(P) = {(i,j)|ie{l,2,...,lg(P)},je(l,2,...}}. 
 
125 
 
 JL 
 
 We also write E (i) = j if (i, j) e E(T). W To execute a program by E(P) means 
 
 that at step j, all statements with statement number E_ (j) are computed 
 
 simultaneously using data available before the j-th computation as inputs. 
 A pair (P, E) is used to denote this execution . Also by E n (P), we understand 
 
 the execution order given by a program, i.e. 
 
 E (P) = ((i,i) | Vi e [l,2,...,lg(P))}. 
 E n is called a primitive execution order . 
 
 We assume that at each time step at least one statement of P must 
 be executed. That is, for any E there is k such that Vj > k, E (j) = empty 
 
 and Vj < k, E (j) / empty. We call k the length of an execution and write 
 lg(E). 
 
 As stated before, C(OUT(S(i))) refers to the contents of a variable 
 OUT(S(i)). This value, as we expect, varies from time to time throughout an 
 execution. Thus it is essential to specify the time when a variable is referred 
 to. 
 
 S(i)(P,E) refers to a computation of S(i) of P in an execution (P,E). 
 C(m) after S(i)(P,E) refers to the value of a variable m right after S(i)(P,E). 
 C(m) after (P, E) refers to the value of a variable m after an execution of a 
 whole program. 
 
 5-2 Equivalent Relations Between Executions 
 
 Now we define two equivalent relations between executions. 
 
 J 
 
 For convenience we define that Vi, E_(0) < E_(i) and E-^i) < 
 
 E p (lg(P)+l) 
 
126 
 
 Definition 1 ; 
 
 Given a program P and two execution orders E and E , (P,E, ) and 
 
 (P, E_) (or simply E.. and E ? ) are said to be output equivalent if and only if: 
 
 for all initial memory configurations C_(M_), 
 
 Vi C(OUT(S(i))) after s(i)(P,E 1 ) = C(OUT(S(i))) after S(i)(P,Eg). 
 
 We write (P,E )~(P,E ) if (P,E.) is output equivalent to (P,E ). 
 
 Definition 2 : 
 
 Given two programs P and P , let their execution orders be E, and 
 
 E p respectively. Also let their memories be M, and M p . Then two executions 
 
 (P ,E,) and (P ,E ) are .said to be memory equivalent if and only if: 
 
 (l) there is a one-to-one function 
 
 fi <"ll U V - lM SI U «20> 
 
 such that 
 
 f(M ) = M 
 
 1 II 21 
 
 and f(M 1Q ) = M 2Q , 
 
 and (2) for all initial memory configuration pairs C_(M-]_j) and C T (M ?T ) 
 
 such that 
 
 Vm € U ir CjU) = C I (f(m)). 
 
 Vn € M 1Q , 
 
 C(n) after (P^E^ = C(f(n)) after (P 2 ,E 2 ). 
 
 M 
 
 We write (P ,E,)~(P ,E ) if (P-,E, ) is memory equivalent to (P ,E p). 
 
127 
 
 In principle, a program is written assuming that it will be executed 
 sequentially, i.e. by E Q . It, however, need not necessarily be executed by 
 E~ as long as it produces the same results as (P,E n ) when it terminates, i.e. 
 
 (V 
 
 lM, 
 
 it may be executed by any E as long as (P,E)=(P,E_) holds. 
 
 Now the following theorems can be proved directly from the above 
 definitions. 
 
 Theorem 1 
 
 (P,E)=(P,E ) if and only if: 
 
 (1) Vi, (id, i,.j)eDR implies that E (i) < E (j). 
 and (2) for any two triples (id, i, j) and (id, i', j ' ) in DR with the 
 same identifier id, either E p (j') < E (i) or E (j) < E (i') 
 
 holds. 
 
 What condition (l) implies is that variables must be properly up- 
 dated before used, and condition (2) prevents variables from being updated 
 before they are used by all pertinent statements. 
 
 (a) Condition (l) 
 
 «p(i) 
 
 A 
 
 \ 
 
 Ep(j) Q 
 
 id 
 
 i) 
 
 (b) Condition (2) 
 
 E p (i) © Kp(iO 
 
 © 
 
 A i id A 
 E p (d) © Ep(j«) 
 
 4 id 
 
 © 
 
 /A or /A 
 
 
 Ep(i') © E p (i) 
 
 
 
 1 ld A 
 
 A A 
 
 Ep(j') © E p (d) 
 
 4 id 
 
 
 
 Figure 5»1« Conditions for the Output Equivalence 
 
128 
 
 Proof of Theorem 1: 
 
 (1) if part: 
 
 Assume that a statement S(i) receives data from statements 
 S(0,S(i ), . . .,S(i, ), i.e. for each pair i and i (s=l,2, . . .,k) 
 
 there is an identifier id such that (id ,i ,i)eDR. Now let E 
 
 — s — s s 
 
 be an execution order which guarantees that (l) before S(i) is 
 
 computed, all S(i ) are computed, and (2) between the computation 
 s 
 
 of S(i ) and S(i), no statement updates id , then it is clear 
 s s 
 
 that (C(OUT(S(i))) after S(i)(P,E) = C(OUT(s(i))) after S(i)(P,E Q ) 
 providing that all OUT(S(i )) have appropriate values. Note that 
 
 5 
 
 the above two requirements are equivalent to conditions (l) and 
 
 (2) of the theorem. Then by induction, we can show that if 
 
 
 conditions (l) and (2) hold for all statements, then (P, E)~(P, E~). 
 
 (2) only if part: 
 
 We give an example to show that if an execution order 
 violates condition (l) or (2) then we cannot get an output 
 equivalent execution. Now let P be 
 
 S(l): a := b; 
 
 S(2): c := a; 
 
 S(3): b := e. 
 Then DR = [(b,0,l),(e,0,l),(a,l,2),(c,2,4),(b,3A)}, and (P,E Q ) 
 
 gives 
 
 
129 
 
 C(0UT(S(1))) after S(l)(P,E Q ) = C^b), 
 
 C(0UT(S(2))) after S(2)(P,E Q ) = C^b), 
 
 and C(0UT(S(3))) after S(3)(P,E Q ) = C^b). 
 
 Now let E (P) = {(1,2), (2,1), (3,3)} which violates the first 
 
 condition of the theorem, and Eg(P) = { (l, 2), (2,3), (3,1)) which 
 
 violates the second condition. Then 
 
 C(0UT(S(2))) after S(2)(P,E 1 ) = C (a) 
 
 and C(0UT(S(1))) after S(l)(P,Eg) = C (e) 
 
 which do not agree with corresponding values produced by (P, E»). 
 
 (Q.E.D.) 
 
 Theorem 1 gives more meaningful executions compared to the previous 
 results [5l [10]. For example let P be: 
 
 S(l): a:=f 1 (x) 
 
 8(2): b:=f 2 (a) 
 
 8(5): c:=f 3 (b) 
 
 S(k): b:=f u (x) 
 
 S(5): d:=f (b,c). 
 
 Fisher [10], for example, would give the following execution (P,E) 
 as an "equivalent" execution to (P,E ). 
 
130 
 
 Step 
 
 6) Q 
 
 1 aJ |b E= ((1 ' 1) ' (2 > 2) > (5 > 5) ' ^'V' (5A)) 
 
 s I- - 
 
 This, however, does not give correct results unless P is properly 
 modified. Note that the variable b carries two different values between 
 steps 2 and 3 which is physically impossible. Theorem 1 does not recognize 
 such an execution as "equivalent" to (P,E n ). 
 
 Theorem 'c 
 
 M 
 (P,E)=(P,E Q ) if and only if 
 
 (1) Vi, (id, i, j) eDR implies that E p (i) < E p (j), 
 
 and (2) V., (id, i, j) eLR implies that E (i) < E (j). 
 
 J 
 
 Example 2 : 
 
 P: S(l): a:=b+c; 
 
 S(2): d:=a+e; 
 
 S(3): a:=q+r; 
 
 SCO: h:=a+s. 
 Let E(P) = {(3,1), (U,2), (1,3), (2,i+)}- Then (P,E)£(P,E Q ). 
 
 E, however, violates the second condition of Theorem 2, i.e. (a, 2, 3) eLR but 
 E p (2) <E p (3). 
 
 The following lemma is helpful to prove the theorem. 
 
131 
 
 Lemma 1 ; 
 
 If two conditions of Theorem 2 hold for an execution order E, then 
 
 (P,E)-(P,E Q ). 
 
 Proof : 
 
 We show that if conditions (l) and (2) of Theorem 2 (we write C(2-l) 
 and C(2-2) for them) hold, then conditions (l) and (2) of Theorem 1 (c(l-l) 
 and C(l-2)) follow. 
 
 First note that C(2-l) is identical to C(l-l). Next we show that 
 C(2-2) together with C(2-l) satisfy C(l-2). Assume that (a, i^ j 1 ), (a,i , J ) 
 
 eDR where j. < i p . Then there exist statements S(h,), S(k.), S(h ? ), S(k ? ), . . ., 
 
 S(h m ),S(k m ) such that (a, j^l^), (a,^,^), (a,^,^), . . ., (a,k g ,h ^ . . ., (a,k m ,i 2 ) 
 
 eLR and (a^,^), (a,h 2 ,k 2 ), ..., (a,h,k ),..., (a,h m ,k m ) eDR. Then if C(2-l) and 
 
 C(2-2) hold, then E (j ) < E (i ). Thus C(l-2) follows. 
 
 (Q.E.D.) 
 
 Proof of Theorem 2 ; 
 
 (1) if part: 
 
 Let E be an execution order which satisfies C(2-l) and 
 
 
 C(2-2). Then by Lemma 1, (P,E)"(P,E Q ). Now let i be a statement 
 
 number such that (id,i,r+l) eDR where r = lg(P), i.e. S(i) is 
 
 the last statement in (P,E-.) which updates id. Then 
 
 C(id) after S(i)(P,E Q ) = C(id) after (P,E Q ). (l) 
 
132 
 
 Also by a similar argument used to prove Lem m a 1, we can show 
 that for all J such that id e OUT(s(j)), E (J) < E (i) holds. 
 
 Thus S(i) is the last statement to update id in (P,E), too. Thus 
 
 C(id) after S(i)(P,E) = C(id) after (P,E). (2) 
 
 
 Also since (P,E ) = (P,E), we have C(CUT(S(i))) after S(i)(P,E ) = 
 
 C(OUT(S(i))) after S(i)(P,E) or 
 
 C(id) after S(i)(P,E Q ) = C(id) after S(i)(P,E). (3) 
 
 Thus from (l), (2) and (3), we have C(id) after (P,E) = C(id) 
 
 after (P, E n ). Using the same argument for all i such that 
 
 (id, i,r+l) eDR, we can show that for all m e M , C(m) after 
 
 (P,E) = C(m) after (P,E ). 
 
 {?.) only if part: 
 
 This part is again proved by giving a counter example. It 
 is easy to show that a program 
 
 S(l): 
 
 a := e; 
 
 8(2): 
 
 c := b; 
 
 8(3): 
 
 b := a. 
 
 together with execution orders E, = { (1,3), (2,1), (3,2)] and 
 E p = ( (1,1), (2,3), (3,2)} serves as a counter example. The 
 details are omitted. 
 
 (Q.E.D. ) 
 Note that Lemma 2 can be modified as : 
 
133 
 
 Corollary 1 ; 
 
 If (P,E)=(P,E Q ), then (P,E)=(P,E Q ) . 
 
 Now let us study the memory equivalence relation between two 
 different programs and execution orders, (P,,E ) and (P ,E ), in detail. As 
 
 a subcase of this let us consider (P ,E_) and (P ,E ). In general it is 
 
 M 
 impossible to show whether (P ,E )=(P ,E ). That is, this problem can be 
 
 reduced to the Turing machine halting problem which is known to be recursively 
 
 unsolvable [28]. In our discussion we have put the restriction on P so that 
 
 a program is a loop-free block of assignment statements. Even with this 
 
 restriction it may still be practically impossible to show memory equivalence 
 
 between (P..,E n ) and (P ,E„). For example if P and P are different polynomial 
 
 approximations for the same function, then (P , E ) and (P ,E ) are likely 
 
 to produce different results due to e.g. a truncation error. We do not pursue 
 
 this problem further. 
 
 Finally let us consider the following example: 
 
 Example 3 : 
 
 
 
 P: S(l): 
 
 k 
 
 := a; 
 
 S(2): 
 
 b 
 
 := k; 
 
 S(3): 
 
 k 
 
 := c; 
 
 SflO: 
 
 d 
 
 : = k; 
 
 S(5): 
 
 k 
 
 := e; 
 
 S(6): 
 
 f 
 
 := k. 
 
 Let E(P) = {(1,1),(2,2),(3,2),(U,3),(5,3),(6,U)} 
 
 M 
 
 Then (P,E )~(P,E) and lg(E) 
 
 k. 
 
I3h 
 
 However, the following program P' 
 P': S(l): k» := a; 
 S(2): b := k*; 
 S(3): k" := c; 
 
 S(k): d 
 S(5): k 
 S(6): f 
 
 = k"j 
 
 = e; 
 
 = k. 
 
 together with an execution order E(P') = { (l,l), (2,2), (3,1), (1<.,2), (5,1), (6,2)} 
 gives a memory equivalent execution to (P,E n ), and lg(E(P')) = 2. 
 
 This suggest the introduction of the following transformation, which 
 
 M 
 when applied on a program P, produces a new program P' such that (P,E n )~(P',E n ) 
 
 Transformation T 
 
 Let S - (S(i),S(i+l), ...,S(j)} and S 2 = {S(k), S(k+l), . . ., S(m)) where 
 
 j < k. Assume that there is an identifier id such that (id, j,k) €LP, and id 
 e OUT(S(i)), and Vu, i < u < j, id / OUT(S(u)). Also assume that for any v and 
 w > i £ v £ J and - k < w < m, there is no id' such that (id',v, w) eDP. Then 
 replace every occurrence of id in S, by id' where id' ^ M. 
 
 Gold [17] presented a similar transformation to describe his model 
 for linear programming optimization of programs. 
 
 After the transformation is applied S, and S can oe processed in 
 
 M 
 
 parallel, and still (P,E )~(P',E ) holds where P' is the result^ic program a: 
 
 the application of T, on P. 
 
 This shows that the second condition of Theorem 2 is not essential, 
 i.e. it can be removed by introducing extra locations if necessnry. 
 
135 
 
 6. PARALLELISM IN PROGRAM LOOPS 
 
 6.1 Introduction 
 
 6.1.1 leplacement of a j f p r . Statement with Many Statements 
 
 Using the results from the previous chapter, now let us study loops 
 in a program, e.g. ALGOL for statements or FORTRAN DO loops, to extract potential 
 parallelism among statements. Given a loop P, we seek an execution order E with 
 the minimum length among all possible ones. Sometimes it may be appropriate to 
 
 get a loop P' from P by the previously introduced transformation for which there 
 
 M 
 is an execution order E' such that (P',E* ) = (P,E ) and lg(E' ) is the minimum 
 
 (For the definition of =, see Chapter 5)- 
 
 As stated before, in this chapter our main concern is the parallelism 
 among statements (inter statement parallelism). For example, we are interested 
 in finding out that all 10 statements (A[I] := A[ I + 1] + FUNC(B[I]); (1=1, 2, ..., 
 10)) in Fl can be computed in parallel, whereas statements in F2 cannot be 
 (The notation follows ALGOL 60 [3]): 
 
 Fl: for I := 1 step 1 until 10 do 
 
 A[I] := A[I + 1] + FUNC(B[I]); 
 F2: for I := 1 step 1 until 10 do 
 
 A[I] := A[I - 1] + FUNC(B[I]). 
 
 First several notations are presented. According to the ALGOL 60 
 report [3], a for statement has the following syntax: 
 
136 
 
 < for statement > ::= < for clause > + < statement > + 
 
 < for clause > ::= for < variable > :» < for Hat > do. 
 
 An instance of this is : 
 
 for I. := ... do 
 
 ■ m 9 
 
 for I := ... do 
 
 begin EL ; S ; . . . ; S end . 
 
 For the sake of brevity, we shall write (i. *-L n , I_ *■ L_, .... I *- L ) (S, , S^, 
 
 112 2 . n n 1' 2 
 
 . .., S ) or (I , I , ..., I )(S. , S_, . .., S ) for the above for statMMnt 
 
 m ± d n ± d m ■ 
 
 instance where I, is called a loop index, L is an ordered set and called a loop 
 
 list set , and S is called a loop body statement with a statement identification 
 
 number p (which is different from a statement number (see Chapter 5))- 
 
 As its name suggests, a loop list set represents a < for list >, e.g. 
 L, = (1,2, 3 A, 5,6) represents "I. := 1 step 1 until 6." In general we write 
 
 L, (i) for the i-th element of L thus L. (|L. |) is the last element of L, . 
 
 Now to facilitate later discussions we introduce the following notation. 
 
 Let B = (b.,,b . ...,b ) (b. > for all i) and (i n ,i ,....i ) be n-tunles 
 1 2 n i v 1 2 n' 
 
 # 
 
 of integers. Then we define the value of (i.,i , ...,i ) w.r.t. B as follows.* 
 
 — — — id n 
 
 7r <A>+=<A><A>* where * is the Kleene star. 
 
 JUL 
 
 7nr F or convenience we write i(s..t) for (t-s+l) integers i , i , ..,, 
 
 S S ^* A. 
 
 i + i) i + , e.g. (i(l..s), i(s+2..n)) means (i.,....i ,i ^,...,i ). 
 
 L ~- L b is s+2 a 
 
 Also (|L(s..t)|) means ( |L g | , |L g+1 | , . . ., |L t | ). Finally (i(n)) means 
 n i's e.g. (1(3)) = (1,1,1). 
 
137 
 
 n n 
 V ((i(i..n))|B) = 2 i.B - IB. +1 
 
 n+1 
 
 where B. = f b. ,. and B = b ,, = 1. 
 j , . k+1 n n+1 
 
 This notation is introduced so that the relations 
 V((1,1,...,1,D!B) = 1, 
 V((1,1,...,1,2)|b) = 2, 
 
 V((l,l,...,l,b )|B) = b , 
 
 n 
 
 V((l,l,...,l,2,l)|B) = b n + 1, 
 V((1,1,...,1,2,2)|B) = b n +2, 
 
 hold. 
 
 and V((b 1 ,b 2 ,...,b n )|B) = b^x-.^ 
 
 For example V( (2,3,1) I (3A> 5) ) = 31. An n -tuple B is called a base. 
 
 The inverse function of V is also defined as V~ (t | B) = (i(l..n)) if V( (i(l. .n)) |b) 
 
 = t. Note that V" 1 is not one-one e.g. V _1 (l5| (J>,k, 5) ) = (2,0,0) or V~ 1 (15| (3A, 5) ) 
 = (l,i+,0). An n-tuple (i(l..n)) is said to be normalized if b . > i. > for all 
 
 J J 
 
 n n 
 j. Let (i(l..n)) be normalized. Then 1 < V( (i( i. .n) ) | B) < 2 b.B. - 2 B.+l. 
 
 3-1 J J 0=1 ° 
 
 n n , 
 
 If 1 < t < Z b.B. - ZB.+l, then V* (t | B) has unique normalized (i(l..n)) as 
 
 d-i J J j=i J 
 
 its value. 
 
138 
 
 VTe say that normalized (i(i..n)) ranges over B = (b(l..n)) in 
 
 n 
 increasing order if V( (i(i. .n) ) |B) takes all values, between 1 and £ b.B. - 
 
 J=l J J 
 
 n 
 
 £ B.+l in increasing order as (i(l..n)) changes. Notationally we write 
 
 J=l J 
 (l(n)) < (i(l..n)) < (b(l..m)). 
 
 Finally we let 
 
 (i(l..n)) > (j(l..n)) if V((i(l..n))|B) > V( (j(l. .n)) |b) 
 and 
 
 (i(l..n)) = (j(l..n)) if V((i(l..n))|B) - V((j(l..n))|B) 
 
 The following lemma is an immediate consequence of the above 
 definition. 
 
 Lemma 1: 
 
 Let B = (b(l..n)) where V.b. > 2. Then 
 
 li — 
 
 (1) v((a(l..n))|B) < V((a'(l..n))|B) implies that 
 
 n 
 v(( ai - a 1 ',...,a n - a n ')|B) < - z B . or v((a 1 - e^', . . ,,a -a n ' ) |b) 
 
 < V((0(n))|B). 
 
 (2) V((a 1 '+c 1 ',...,a n '+c n ')|B) = V( (a^, . . . , a n +c n ) | B) ana 
 
 V((a'(l..n))|B) > V((a(l..n))|B) imply that V((c-- C,',,.., 
 
 n 
 C n' C n' )|B) - ' . ZB j or V (( c i- c 1 '^-^ c n - c n ')l B ) < V((0(n))|B). 
 
 (3) Let < [a. | < b for all j. Then V( (a(l. .n) |b) < Vi(0(n) ) |b) if 
 and only if there is h such that Vk (1 < k < h), a. = ana a. < 0. 
 
139 
 
 A loop must be replaced with a sequence of statements so that we can 
 use the results of the previous chapter. For example we replace 
 
 for I := 1 step 1 until 10 do 
 
 SI: A[I] := A[I] + B[I]; 
 with the sequence of ten statements 
 
 A[l] := A[l] + B[l]; 
 
 A[2] := A[2] + B[2]; 
 
 A[10] := A[10] + B[10]. 
 
 /" \ 
 
 In general we will get J it |L.|-nJ statements after the replacement 
 
 of a loop P: (I, . I_, . . , I )(S -S^;..:S ). Any statement in the set of replaced 
 
 r 1' 2 n 1' 2 m 
 
 statements can be identified by an n-tuple (i(l..n)) which corresponds to values 
 of I 1 , I 2 , ...,I n (i.e. L 1 (i 1 ),L 2 (i 2 ), . .,L n (i n )), and p which represents a 
 
 statement identification number. Thus an (n + l) -tuple (i(l..n),p) serves as a 
 
 statement number , and we write S((i(l..n),p)) to denote a particular statement 
 
 in the set of replaced statements, e.g. in the above example S( (3, l) ) = A[ 3] := 
 
 A[3l + B[31- The actual statement which corresponds to this is the statement 
 
 S with L n (i n ),..., L (i ) substituted into every occurrence of I,.....I in S , 
 p ll'nn 1' ' n p 
 
 and we also write S [L, (i, ),..., L (i )] for this. 
 
 p 1 1 ' n n 
 
 / n , \ 
 
 These ir |L. m I statements are to be executed according to the 
 
 presented order (i.e. the order specified by for loop lists). In other words, 
 the statement S( (l, 1, . . ., 1, l) ) is executed first, S((l, 1, . . ., 1,2) ) second, ..., 
 
1U0 
 
 the statement S( (i(l. .n),p) ) is executed V( (i(l. .n),p) | ( |L(l. .n) | ,m) )-th, .. ., 
 and the statement S(( IL^J, . . ., |L |,m)) is executed lastly. Formally, as the 
 
 essential execution order we have: 
 
 E Q (P) = {((i(l..n),p),V((i(l..n),p)|B)) | (l(n),l) < (i(l..n),p) < 
 
 (|L(l..n)|,m)) 
 where B = ( |L(l. .n) | ,m) . 
 
 Example 1 : 
 
 for I, := 1 step 1 until 10 do 
 
 for I := 1 step 1 until 10 do 
 
 begin 
 
 SI: A 1 [I,,I ] := A 2 [I--1,I ] +B 1 [I 1 ,I ]; 
 
 ■1^2- 
 
 '1 
 
 •l'*2- 
 
 S2: B^I^Ig-l] := A 5 [I 1+ 1,I 2 ] + B 5 [I 1 , Ig+1] ; 
 
 end 
 is executed as 
 
 S( (1,1,1)): A X [1,1] := A 2 [0,1] +B 1 [1,1]J 
 
 S((l,l,2)) 
 S( (1,2,1)) 
 S((l,2,2)) 
 
 B 2 [1,0] : = A 5 [2,l] + B 3 [l,2]; 
 
 S((10,10,2)): B 2 [10,9] := A 3 [ll,10] +B^[10,11]; 
 The superscript is used to distinguish different occurences of A and B. 
 
11H 
 
 A[(i(l..n))] represents a form in which L.. (i. ),..., L (i ) are 
 substituted into T , .. .,1 in index expressions, e.g. in the above example 
 
 A 2 [(i 1 ,i 2 )] = A 2 ti x -l,i 2 ] 
 
 and A 2 r(3,2)] = A 2 [2,2]. 
 
 Finally a set of inputs to a statement S( (i(l. .n),p)) is denoted by 
 IN(S((i(l..n),p))). Similary OUT(S((i(l. .n),p))) represents a set of outputs 
 from S((i(l..n),p)). From the above example we have, e.g. 
 
 IN(S(l,l,2)) = (A 5 [2,1]*B 5 [1,2]) 
 
 and 0UT(S(1,1,2)) = {B 2 [1,0]}. 
 
 6.1.2 A Restricted Loop 
 
 In what follows, we mainly deal with a restricted class of for 
 statements. Two restrictions are introduced. Let a loop with m body statements 
 be 
 
 P™ = (l 1 ,I 2 ,...,I n )(S 1 ;S 2 ;...;S m ). 
 
 Restriction 1: 
 
 A for list set L. must be an arithmetic sequence, i.e. 
 
 L = (.1,2,3,..., t) (1) 
 
 for all i 
 
 Restriction 2 
 
 Let (A ,A , ...,A 1 be a set of all array identifiers in F where the 
 h-th occurrence of A, in P has the following form (where the superscript h is 
 
142 
 
 used only if it is important to distinguish different occurrences of A, ): 
 
 A^[F(k,h,l), F(k,h,2), ..., F(k,h,n)]. (2) 
 
 For fixed k and j, F(k,h, j) has an identical form for all h, i.e. either 
 F(k,h,j) = I. + w(k,h,j) 
 
 J 
 
 or F(k,h, j) = (i.e. vacant). 
 w(k,h, j) is a constant number. Also we assume that each A, appears on the left 
 
 hand side of statements at most once. 
 
 An example of a restricted loop is: 
 for I := 1 step 1 until 20 do 
 
 for I := 1 step 1 until 30 do 
 
 for I, := 1 step 1 until kO do 
 
 begin 
 
 SI: A 3 [ 1^1,12+3,0] := AjCIylg-J,)*] + ^[0,0,1^]; 
 
 S2: A 2 [0,I 2 ,I 3 -1] := A ? [ 1^1,12,0] + A.^0,0, 1^1] ; 
 
 S3: A 1 [0,0,I 5 +1] := A 2 [0,I 2 -1,I 3 ]; 
 
 end ; 
 
 Note that, for example, A, always appears as 
 
 A,[I, +w(3,h,l), I +w(3,h,2),0], thus the first occurrence of A, is 
 
 A 3 [F(3,1,1), F(3,l,2),0] = A 3 [ 1^(3,1,1), I 2 +w(3,l,2),0] • AjCXj-1, Ig+3, 0] - 
 
 If there is no ambiguity, we write e.g. A,[I, -1, I 2 +3] f° r A,f I, -1, I 2 +3>0] 
 
 (which is the conventional form). 
 
1U3 
 
 We also write F(k,h, j)(i) for the resultant expression obtained by- 
 substituting i into I. in F(k,h,j), e.g. A,[F(3, 1, l)(2), F(3,l,2)(3),0] = 
 
 A,[ 1,6,0] (= A,[l,6] conventionally). 
 
 A single variable may be introduced as a special case of array 
 
 indentifiers, e.g. we write 
 
 for I := 1 step 1 until 1 do 
 A[I] := 
 for 
 
 T := .... 
 
 6.2 A Loop With a Single Body Statement 
 6.2.1 Introduction 
 
 First we shall deal with the case where a loop has only one body- 
 statement (i.e. m = l). Let a loop with a single body statement be P = 
 (I , I , ...,I )S. Since there is only one statement we may drop the statement 
 
 identification number. Then a statement number for a replaced statement becomes 
 (i(l..n)) and as the essential execution order we have: 
 
 EqCP 1 ) = {((i(l..n)), V((i(l..n))|(|L(l..n)|)) | 
 
 (l(n)) < (i(l..n)) < (|L(l..n)|)}. 
 Also in this case we only have to consider the array identifier which 
 appears on the left hand side of S. Hence instead of s array identifiers we 
 only have one array identifier (see Restriction 2 of Section 6.1.2). Hence we 
 drop k and write 
 
 A h [F(h,l), F(h,2), ..., F(h,n)] 
 and 
 
Ikk 
 
 F(h,j) = I + w(h,j) 
 
 J 
 
 for the h-th occurrence of A for Eq. (2) of Section 6.1.2 (the superscript is 
 used if it is necessary to distinguish the different occurrences of A). 
 Furthermore we assume that F(h, j) ^ for any h and j. 
 Now let us study the following two examples. 
 Gl: for I := 1 step 1 until 10 do 
 
 A[I] := A[I] + 5; 
 G2: for I := 1 step 1 until 2 do 
 for J := 1 step 1 until 10 do 
 A[I,J] := A[I-1,J+1] + 5; 
 Assume that an arbitrary number of PE's are available. Then: 
 Gl: All ten statements (A[I] :- A[I] + 5) can be computed simultaneously 
 
 by 10 PE's. 
 G2 : A[1,J] and A[2,J-2] can be computed simultaneously by two PE's at the 
 J-th step (J=l,2, . ..,10). 
 
 In what follows, the above two types of the interstatement parallelism 
 are studied. 
 
 Before we go into the details, a few comments are in order with regard 
 to real programs. A for statement with a single body statement, (i, ,--.,I )S, 
 
 can be classified from several different points of view. First of all let us 
 take a for list set L.. As a simplified case we have L. = (s ., s .+1, . . ., t . ) 
 
 (t . = (|l.|-1) + s.) which is equivalent to an ALGOL statement "for I. := s. 
 
 step 1 until t. do". Knuth stated [9] that examination of published algorithms 
 
 KJ 
 
 showed that well over 99 percent of the use of 'the ALGOL for statement, the 
 
ll*5 
 
 value of the step was ' +1 1 , and in the majority of the exceptions the step was 
 a constant. This statement was confirmed by checking all Algorithms published 
 in the Communications of the ACM in 1969- There were 23 programs and 263 for 
 statements used. Only six uses were exceptions (z 3 percent). 
 
 Next let us examine a body statement S. Then either (l) the left 
 hand side variable of S (i.e. OUT(S)) is a single variable, or (2) OUT(S) is 
 an array identifier. In case of (2) S is of a form 
 
 A°[F(0,l),...,F(0,n)] := f (A^FCl, l), . . . ,F(l, n)], . . ., A P [ f (p, l), . . . ] ) . 
 
 Now S has either one of the following five forms. 
 
 M 
 
 (1) OUT(S) is a single variable t. 
 
 (i) t := a function which does not depend on t, 
 e.g. t := a + 5, 
 (ii) t := f(t), e.g. t:= t + a, 
 
 (2) OUT(S) is an array variable A: 
 
 (i) A [F , ...,F ] := a function which does not depend on A, 
 
 e.g. A[I,J] := b + 5, 
 (ii) for all h F(0, j) - F(h, j) is a constant for each j, 
 e.g. A[I,J] := A[I-5,J+3] + A[I+l,J-3] + 5 
 (iii) other cases, e.g. A[I,J] :- A[2I,J-5] + a. 
 Note that if S is of Form (l-i), then 
 
 P 1 = Sfl^ClLj), L 2 (|L 2 |), ..., L n (|L n |)]. For example let P 1 be 
 
 We use a lower case letter for a single variable and an upper case 
 letter for an array variable. 
 
146 
 
 for I := 1 step 1 until 5 do t := A[I] - 1. 
 
 Then after the execution of r, t = A[ 5] - 1« 
 
 Again all Algorithms published in the CACM were checked (this time the 
 check was made against Algorithms published in 1968 and 1969. ) There were 52 
 programs altogether and 117 for statements with a single body statement. The 
 details were: 
 
 (1) 
 
 (2) 
 
 No 
 
 . of Exam 
 
 pies 
 
 Percentage 
 
 (i) 
 
 
 
 
 
 
 (ii) 
 
 1+2 
 
 
 35.8 
 
 (i) 
 
 18 
 
 
 15.4 
 
 (ii) 
 
 33 
 
 
 28.2 
 
 (iii) 
 
 2k 
 
 
 20.6 
 
 117 100.0 
 
 In what follows we deal with Forms (2-i), (2-ii) and (2-iii). Form 
 (l-ii) has been discussed in Chapter h. 
 
 6.2.2 Type 1 Parallelism 
 6.2.2.1 General Case 
 
 As stated in Chapter 5> a block of statements P need not be executed 
 according to the essential execution order E„ and may be executed by any 
 
 execution order E as long as (P, E n )=(P, E) holds. In this section we study a 
 
 special class of execution orders called type 1 execution orders. This execution 
 order is defined for each loop index I (u=l,2, . . . ,n) and hence there are n of 
 
 these. 
 
 
ll+7 
 
 Definition 1: 
 
 A type 1 parallel execution order with respect to I (we write 1-p 
 
 w.r.t. I ) is given by 
 
 E(P) = {((i(l..n)),V((i(l..u-l),i(u+l..n))|(|L(l..u-l)|, |L(u+l. .n) | )) 
 |(l(n)) < (i(l..n)) < (|L(l..n)|)), 
 and is represented by E[ I ] . 
 
 Figures 6.1 and 6.2 illustrate execution orders E~ and Efl 1. 
 D L u J 
 
 Note that Efl ]((i(l..n))) = E[I ] ( (i f (l. .n) ) ) if i = i* for all 
 
 U li K. K. 
 
 k = 1,2, . . . , u-l,u+l, . . . ,n. Furthermore note that if 
 
 V((i(l..u-l))|(|L(l..u-l)|)) > V((i'(l..u-l))|(|L(l..u-l)|)), 
 then E[I u ]((i(l..n))) > E[ y ( (i» (1. .n) ) ) . 
 
 By introducing extra |L | PE's, the computation time becomes one 
 
 n n 
 
 |L I -th of the original, i.e. tt |L.| steps instead of it |L.| steps, where one 
 U j=l J j=l 3 
 
 step corresponds to the computation of a body statement. 
 
 We now introduce TRANQUIL notation [2] to illustrate Definition 1. In 
 TRANQUIL 
 
 for (I) sec} (L) do S 
 st^oids for 
 
 for I := (for list set) do S. 
 
 Also in TRANQUIL for (i) sim (L) do S indicates that statements 
 S(L(i)) are executed simultaneously for all L(i) in L. Then Definition 1 
 amounts to obtaining 
 
H+8 
 
 C\J 
 
 \H 
 
 ?■ 
 
 
 
 
 
 
 
 r^ : ' 
 
 
 
 
 
 
 
 O — -H • 
 
 
 
 
 
 
 
 V 
 
 
 
 
 
 
 
 > 
 
 
 
 
 
 
 
 V 
 
 + 
 3 
 
 + 
 
 -<><^o 
 
 h 
 
 *1 
 
 H O 
 
 a; -p 
 fit v 
 
 CO • 
 
 •H -> 
 
 •H * 
 
 bO • 
 
 v c\? 
 
 0) -P 
 
 3 ti 49 
 
 ' E -s 
 
 C CO 
 
 • o 
 
 o 
 ^ h 
 
 - O -P 
 3 co 
 
 C 
 cd 
 ,£ 
 •P 
 
 O a5 
 -P^ 
 0) -^ 
 
 o 
 fa 
 
 £i 
 -P 
 
 co 
 
 0) 
 
 -p 
 
 ctJ 
 
 
 •H 
 
 TJ 
 
 •H 
 
 SB 
 
 H 
 
 O 
 
 o co 
 
 c u- ( 
 
 O CO 
 
 W 
 
 ft co 
 
 CM 
 
 3 
 3 
 
 + 
 
 3 
 
 3 
 
 + 
 
 0<X) 
 
 fa 
 
 
ll+9 
 
 from 
 
 for (I ) se£ (L 1 ) do 
 
 for (I , ) sea ( L . ) do 
 
 u-1 — ■* u-1 — 
 
 for (I ) sim (L ) do 
 
 for (I u+1 ) seq, (L u+1 ) do 
 
 for (I ) seq (L ) do S 
 
 v n — ■* n — 
 
 for (I ) se£ (iO do 
 
 for (I ) seq (L ) do 
 
 u — ■* u — 
 
 for (I ) seq (L ) do S. 
 
 n — ■* n — 
 
 First we study a type 1 parallel execution order for a general loop 
 
 in detail. Let the two-dimensional plane I - I . x ••• X I be an |L by 
 
 u u+1 n ' u ' 
 
 n 
 
 7T |L.| grid (see Figure 6.3)« 
 j=u+l D 
 
 The grid is labeled by 1,2, ...,i , ...etc. rather than by L (l),L (2), 
 ...,L (i ),... etc. for convenience. 
 
 Note that each square of the plane represents the computation 
 S((i(l. .u-l),i(u. .n))) for some (i(l..u-l)). If P' is executed by E , then the 
 
150 
 
 <**♦! 
 
 u\J> 
 
 Figure 6.3* Conditions of Parallel Computation in a Loop 
 
 computation proceeds from the leftmost column to the rightmost column while in a 
 column the computation proceeds from the top to the bottom sequentially. On the 
 
 other hand if P is executed by E[ I ] then we proceed to compute from the top 
 
 row to the bottom row while we perform computation in each row simultaneously. 
 Each computation S( (i(l. .u-l), i(u. .n) ) ) uses inputs IN(S( (i(l. .n))) and updates 
 the output 0UT(S((i(l..n)))). Then as we studied in Chapter 5, we have to make 
 sure that the computation S((i(l..n))) (marked x in Figure 6.3) does not receive 
 any data which are to be updated by the computation in the region R, i.e. there 
 must be no id such that 
 
 (id, (i(l..u-l),i'(u..n)), (i(l..n))) cDR 
 
151 
 
 holds where (i'(u..n)) > (i(u..n)) and 1^ < i^. 
 
 Similarly the computation S((i(l..n)}) must not use any data which are to be 
 updated by the computation in the region Q, i.e. there must be no id' such that 
 
 (id\ (i(l..n)),(i(l..u-l),i"(u..n))) €LR 
 holds where (i"(u..n)) < (i(u..n)) and i u " > i u « 
 The above observation gives the following theorem. 
 
 Theorem 1 : 
 
 Let E[I ] be a type 1 parallel execution order w.r.t. I . Then 
 
 M 
 (P^Efl ]) = (P 1 ,E Q ) if and only if there are no id, id', (1(1. .n)), (i(l..u-l), 
 
 J 
 
 i'(u..n)) and (i(l. .u-l), i"(u. .n) ) for which either 
 
 (1) (i) i ' < i and 
 ' u u 
 
 (ii) (i'(u+l..n)) > (i(u+l..n)) and 
 
 (iii) (id, (i(l. .u-l),i' (u. .n)), (i(l. .n))) eDR, where id e 
 
 OUT(S((i(l..u-l),i"(u..n)))) and id € IN(S( (i(l. .n) ) ) 
 
 or (2) (i) i " > i and 
 u u 
 
 (ii) (i"(u+l..n)) < (i(u+l..n)) and 
 
 (iii) (id*, (i(l..n)), (i(l.. u-l), i"(u..n))) eLR where id'e 
 
 OUT(S((i(i..u-l),i"(u..n)))) and id e IN(S((i(l. .n))) 
 
 hold. 
 
 T (±' (u+1. .n)) ^ (i(u+l..n)), for example, means V( (i 1 (u+1. .n) ) | B) > 
 V((i(u+1. .n))|B) where B = ( |L(u+l. .n) | ) . Unless specified, the 
 base ( | L(s. .n) | ) (=( |L |,...,|L |)) is to be understood for (i(s. .n)) 
 
152 
 
 Let S be of a form A := f(A , ...,A P ) where A is an array identifier 
 and the superscript is used to distinguish different occurrences of A. Then 
 id in the first condition of Theorem 1 corresponds to those A [(i(l..n))] for 
 
 which A [(i(l..u-l),i'(u..n))] = A [(i(l..n))] holds together with the three 
 conditions (i), (ii), and (iii) of (1) between (i'(u..n)) and (i(u..n)). 
 
 Similarly id' corresponds to those A [(i(l..n))] for which A [(i(l..u-l), 
 
 i' (u. .n))] = A [ (i(l..n))] holds. Thus A (l < h < p) can be classified into 
 three groups : 
 
 CI = {h|A satisfies the first condition} 
 
 C2 = {h|A satisfies the second condition) 
 C3 = (1,2, ...,p) - CI - C2. 
 
 Note that CI n C2 = 0. 
 
 Example 2 : 
 
 Let P 1 be 
 
 (I x - (1,2,3), I 2 - (1,2,3))(A°[I 1 ,I 2 ] := f(k\ I^Ig+l])). 
 
 Then for i^ = 1< i g = 2 and (i 2 ') = (3) > (ig) = (2), we have A°[ (i^i^ )] = 
 
 A [(i ,i )] = A[l,3], or (A[l,3], (1,3), (2,2)) eDR. Thus P cannot be computed 
 
 in 1-p w.r.t. I,, and CI = [1] . 
 
 From this argument it should be clear that if a body statement is of 
 Form (2-i) (see Section 6.2.1), then the loop can be computed in 1-p w.r.t. 
 any I u (u=l,2, ...,n). 
 
153 
 
 6.2.2.2 A Restricted Loop 
 
 If a loop is a restricted loop, then Theorem 1 may be simplified. 
 First we define a vector R(h) for each h = 1,2, . ..,p: 
 R(h) = (R 1 (h),...,R n (h)), 
 
 where 
 
 R (h) = F(0,j) - F(h,j) 
 
 J 
 
 = I, + w(0,j) - (I + w(h,j)) 
 
 J J 
 
 = w(0,j) - w(h,j). 
 For example we get R(l) = (-1,8) from a statement 
 
 A°[ I r l,I 2 +3] := f(A X [ I^Ig-5]). 
 
 Then we use these vectors to check parallel computability as follows. 
 Also for convenience we write 
 
 R'(u,h) = (R 1 (h),...,R u _ 1 (h)) 
 
 and 
 
 R"(u,h) = (R u+1 (h),...,R n (h)). 
 
 Theorem 2 : 
 
 If one of the following two holds for any of R(h) (h = 1,2, ...,p), 
 
 then P cannot he computed in 1-p w.r.t. I . 
 
 (1) (i) R'(u,h) = (0,...,0) and 
 
 (ii) R (h) > and 
 u 
 
 (iii) R"(u,h) < (0, ...,0) and V.(u + 1 < j < n) | R.(h)|<|L.| - 1. 
 
15U 
 
 (2) (i) R*(u,h) = (0, ...,0) and 
 (ii) R (h) < and 
 
 (iii) R"(u,h) > (0,...,0) and V (u + 1 < j < n) | R (h)|<|L.| - 1. 
 
 That the theorem is valid is the direct consequence of Theorem 1, i.e. 
 the first check of the theorem corresponds to the first condition of Theorem 1 
 and the second check corresponds to the second condition. For example the 
 first condition of Theorem 1 says that if 
 
 (id, (i(l..u-l), i'(u..n)), (i(l..n))) eDR holds f or i ' < i and 
 
 (i'(u+l..n)) > (i(u+l..n)), then P cannot be computed in 1-p w.r.t. I , where 
 
 u* 
 
 id e 0UT(S((i(l..u-l), i'(u..n))) 
 and id e IN(S( (i(l. .n) ) ) ) . 
 Then id represents the element of A for which 
 
 A h [(i(l..n))] = A°[(i(l..u-1), i'(u..n))] 
 holds. Now this implies that 
 
 F(h,j)(L.(i.)) = F(0,J)(L (i .)) 
 
 J J J J 
 
 for j < u and 
 
 F(h,d)(L,(i,)) = F(0,j)(L.(i ')) 
 
 for j > n. Hence 
 
 L (i ) + w(h,j) = L (i ) + w(0,j) 
 
 J J J J 
 
 or R.(h) = for j < u, and 
 J 
 
 or 
 
 L (i ) + w(h,j) = L.(i') + w(0,d) 
 
 J J J J 
 
 i.' = i. - R.(h) 
 
 3 J d 
 
155 
 
 for J > u. Also 
 
 (i'(u+l..n)) > (i(u+l..n)) 
 with 
 
 B = (|L(u+l..n)|) 
 becomes 
 
 V((i u+1 -R u+1 (h),...,i n -R n (h))|B) > V((i(u+l..n))|B). 
 
 Then by Lemma 1, 
 
 V((R u+1 (h),...,R n (h))|B) < V((0, ...,0)|b). 
 
 Thus the first check of Theorem 2 is verified. The second check can be 
 varified similarly. 
 
 Now let us consider the number of checks required. For each A (h=l, 2, 
 . ..,p) which appears on the right hand side of S, we first obtain a vector R(h). 
 Then for each loop index I , we perform the two checks given by Theorem 2 for 
 
 all R(h) (h=l, 2, . . .,p). Since there are n loop indicies, in total we perform 
 2np checks . 
 
 The procedure described in this section can be extended to cover 
 nonrestricted loops, too. Let S be of a form 
 
 A°[F(0,l),...,F(0,n)] := f (aV(1, l), . . . ,F(l,n)] , . . . , A P [ (F(p, l), . . . ] ) 
 
 and we define a vector R(h) for each h = 1,2, • ..,p as we did before, i.e. 
 R(h) = (R 1 (h), ..., R n (h)) 
 
 and 
 
 R (h) = F(0,j) - F(h,j). 
 
 J 
 
 Since a loop is not restricted, F(0, j) and F(h, j) may take any form and hence R ^h) 
 
 d 
 
156 
 
 may not be a constant number but rather a function of loop indicies, e.g. R(h) = 
 
 I, + 21, - 5. Hence, in the most general case, it is necessary to check the 
 
 two conditions of Theorem 2 for all values of (i(l..n)) (i.e. (l(n)) < (i(l..n)) 
 < ( lL-1, |L |, • .., |L | )) to examine type 1 parallel comput ability, i.e. 
 
 n 
 2( it |L.|) checks are required for each R(h) (h=l, 2, . . .,p). In many cases, we 
 
 d=i J 
 
 can expect that the number of checks required is far smaller than that. For 
 example if 
 
 R(D = (21^21^,1^21^), 
 then only 2( | L, |x|L, | ) checks are required, i.e. it is not necessary to check 
 for those loop indicies, e.g. 1^, which do not appear in R.(l) (d"l, 2,3**0. 
 
 6.2.2.3 Temporary Locations 
 
 In this section we mean a restricted loop by a loop. The second 
 condition of Theorem 1 (or 2) may be dropped by introducing extra temporary 
 
 locations by applying Transformation T of Chapter 5 on P , i.e. if CI = and 
 
 and C2 / 0, then temporary locations may be set up so that P can be computed 
 in parallel (for CI and C2, see Section 6.2.2.1). Let heC2. This implies that 
 there are (i(l..n)), (i(l. .u-1), i T (u. .n) ) and id (see Figure 6.^) for which 
 
 (id, (i(l..n)),(i(l..u-l),i'(u..n))) eLR 
 holds and 
 
 id = A h [(i(l..n))] € IN(S((i(l..n)))) 
 and 
 
157 
 
 id = A°[(i(l..u-l),i'(u..n))] e OUT(S( (i(l. .u-l), i ' (u. .n)) )) ) . 
 If a loop is confuted in 1-p w.r.t. I , then we have 
 
 E[I u ]((i(l..u-l),i'(u..n))) < E[I u ]((i(l..n))) 
 
 while 
 
 E ((i(l..u-l),i'(u..n))) > E ((i(l..n))). 
 
 Hence A [(i(l..n))] will be updated "by S( (i(l. .u-l), i' (u. .n) ) ) before being 
 used by S( (i(l. .n)) ). That is, if we compute P in 1-p w.r.t. I we must keep 
 
 the old value of A [(i(l..n))] which otherwise will be updated by S( (i(l. .u-l), 
 i'(u..n))) at the E[I ] ((i(l. .u-l), i* (u. .n) ) )-th step separately until it is 
 
 used by S((i(l..n))) at the E[ I ] ( (i(l. .n)))-th step. The period of time, t , 
 
 through which the old value of A [(i(l..n))] must be kept for the computation 
 S((i(l..n))) is given by 
 
 t h = E[I u ]((i(l..n))) - E[I u ]((i(l..u-l),i'(u..n))) 
 
 = V((i(u+l..n))|B) - V((i'(u+l..n))|B), 
 where B = |L(u+l. .n) | ). Then as we showed in Section 6.2.2.2, in case of a 
 restricted loop, we can show that 
 
 n 
 
 t = V(R"(u,h) B) + Z B - 1 
 h , t s 
 
 s=u+l 
 
 n 
 
 where B = ir I L , . I . The details are omitted. 
 s ' t+1 ' 
 
 t=s 
 
158 
 
 T u+r • * * 
 
 vl 
 
 ' n\ 
 
 
 i 
 u 
 
 
 i ' 
 u 
 
 
 
 •n)) 
 •n)) 
 
 
 
 
 
 
 (i'(u+l. 
 
 
 
 
 /> 
 
 
 
 
 
 y 
 
 
 
 (i(u+l. 
 
 
 o" 
 
 
 
 
 
 
 
 
 
 
 A. 
 
 Figure G.k. An Illustration of t 
 
 Now max t gives the maximum period of time through which A [(i(l..n))] must be 
 h€C2 
 
 kept. Since we have | L | of them (i.e. |L | statements are computed 
 
 simultaneously), the total amount of temporary locations required will he |L | x 
 
 u' 
 
 max t, . Additional |L | locations are required for buffering (see Example 5)« 
 
 u 
 
 Hence we have the following theorem. 
 
 Theorem 3: 
 
 The maximum number of temporary storage locations required is 
 
 L | x (max [V((R (h),...,R (h))|B) + Z B ] ) 
 heC2 U+1 n s=u+l S 
 
 where B = ( |L(u+l. .n) | ) and B = w |L , , | and B = 1. 
 
 t=s 
 
159 
 
 Example 3 : 
 
 Let P 1 be 
 
 for (I ) se£ (1,2, ...,U0) do 
 
 for (I ) se£ (1,2, ...,1j-0) do 
 
 A[I X ,I 2 ] := A[I 1+ 2,I 2 -3] + 2; 
 
 P as it is cannot "be computed in 1-p w.r.t. I because it violates 
 
 the second condition of Theorem 2. Now we modify P as follows by introducing 
 temporary arrays Tl(UOxl) and T2 (1*0x3). 
 for (I ) se£ (1,2,. ..,^0) do 
 
 for (I ) se£ (1,2, ,..,kQ) do 
 
 begin SI: T1[I 1 ] := Afl^Ig] ; 
 
 S2: k[l v l 2 ] := T2[I ] _,I 2 mod 3] +2;^ 
 
 S3: T2[I 1 ,I 2 mod 3] := Tl[ Ij ; 
 
 end . 
 
 Then all three statements can be computed in 1-p w.r.t. I , i.e. we 
 
 can replace seq in the first for statement by sim . The original P , if 
 
 executed sequentially, takes 1600 steps whereas the modified P takes only 120 
 steps if executed in parallel with respect to I . 
 
 JL 
 
 "a mod b = a 
 
 Also we assume that T2 is properly initialized before the computation 
 of the loop, i.e. store A[l,*], A[2,*] and A[3,*] in T2[l,*], T[2,*] 
 and T[0, *]. 
 
i6o 
 
 6.2.3 Type 2 Parallelism 
 
 In this section we mean a restricted loop by a loop. Since the 
 conflict between two statements S(i) and S(j) due to the existence of an 
 identifier id such that (id, i, j) eLR may be resolved by introducing temporary 
 locations (see Chapter 5 and the previous section), such conflict will not be 
 taken into account to check parallel computability throughout the rest of this 
 chapter. 
 
 This section describes the second type of parallelism, i.e. type 2 
 parallelism, in a for statement with a single body statement. Type 2 
 parallelism is introduced to resolve the conflict due to the first condition 
 of Theorem 1. The following example illustrates it. 
 
 Example k : 
 
 P: for I := 1 step 1 until kO do 
 
 for I := 1 step 1 until kO do 
 
 A°[I 1 ,I 2 ] := a\ 1^1,12+1] + A 2 [I 1 ,I 2 -1]; 
 
 Since R 1 (l) a 1 > and (Rp(l)) = (-1) < (0) hold, P cannot be 
 computed in 1-p w.r.t. I . 
 
 Now let us consider the I -I plane (Figure 6.5). Suppose that all 
 
 S((i, ,i )) in the shaded area have been computed. Then at the next step those 
 
 S((i * , i ' ) ) marked as HJ can be computed simultaneously, and at the following 
 
 step all (2) can be computed simultaneously, and so forth. We can see that a 
 heavy zigzag line travels from left to right like a "wave front" indicating 
 that all statements on that front can be computed simultaneously. 
 
161 
 
 Figure 6.5- Wave Front 
 
 Note that computation of P by this scheme takes approximately 120 
 steps, while if P is computed sequentially it takes k-0 x h-0 = 1600 steps. 
 
 Given a loop P , if P is computed in 1-p w.r.t. I , then a "wave 
 
 front" is in parallel with the I axis of the I - I , x ••• X I plane, and 
 ^ u u u+1 n * ' 
 
 it travels in the increasing order of (i(u+l..n)). 
 
 If P cannot be computed in 1-p w.r.t. I then it may be possible to 
 find a "wave front" which is diagonal rather than horizontal as in Example k on 
 
 the I - I - x ... x I plane, 
 u u+1 n ^ 
 
162 
 
 The direction of wave 
 front travel 
 
 tan a = slope of a wave front 
 
 Figure 6.6. Wave Front Travel 
 
 This wave front is such that all computations S((i(l..n))) which corresponds to 
 points (i ,i ,,..., i ) which lie right next to a wave front can be computed 
 
 simultaneously. In other words all necessary data to compute S( (i(l. .u-l), 
 i(u..n))) have been already computed in the shaded area. The direction of a 
 wave front's travel is perpendicular to the wave front. 
 
 Now let us obtain the slope of a possible wave front for a restricted 
 loop. 
 
 Let P be a restricted loop. Assume that P cannot be computed in 
 1-p w.r.t. I . Then according to Theorem 2, this means that there are R(h) for 
 
 which R (h) > and (R (h), . . ,,R (h)) < (0, ...,0) hold (i.e. CI / 0). 
 
163 
 
 Theorem k : 
 
 The slope of a possible wave front in the I - I , x ••• x I plane 
 
 is given by 
 
 max 
 he CI 
 
 ^ ((-V((R u+1 (h),...,R n (h))| B ) - r B s +2)1 
 u s=u+l -1 
 
 n 
 
 where B = ( lL(u+l. .n) | ) and B = 7r L, ,. and B = 1. 
 
 s t+1 n 
 t=s 
 
 In Example k, the slope of the wave front is ■r(-("l"l + l)-l + 2) * 2. 
 
 Proof of Theorem k : 
 
 Let us consider S( (i(l. .u-l), l(u. .n) ) ) on the I - I x ... x I 
 
 u u+l n 
 
 plane. Assume that there is a variable id such that (Figure 6.7) 
 
 (id, (i(l..u-l),i'(u..n)),(i(l..u-l),i(u..n))) eDR 
 
 holds together with 
 
 i ' < i and (i"(u+l..n)) > (i(u+l..n)), 
 u u — 
 
 i.e. 
 
 id e IN(S((i(l..u-l),i(u..n)))) 
 and 
 
 id e OUT(S((i(l..u-l),i'(u..n)))). 
 This implies that there is he CI such that 
 
 A h [(i(l..u-l),i(u..n))] = A°[(i(l..u-l),i'(u..n))] 
 holds. 
 
l£k 
 
 (i(u+l..n)) 
 
 (i'(u+l..n)) 
 
 1 1 T 
 
 
 
 < "T" 
 
 t 
 
 
 / 
 
 h 
 
 
 // 
 
 
 / 
 
 7\ a 
 
 , \L. 
 
 _x 
 
 
 
 k R u (h) — ■* 
 
 
 Figure 6.7. An Illustration for Theorem k 
 
 In case of a restricted loop, we have 
 
 i ' = 1 
 
 lUh) 
 
 for j > u. Now let 
 
 t, = V((i'(u+l..n))|B) - V((i(u+l..n))|B) 
 
 then we get 
 
 n 
 
 t. = -V((R (h),...,R (h))|B) - Z B + 1 where B e 
 n u+1 n . s s 
 
 s=u+l 
 
 n 
 
 7T |L, I and 
 
 t=s 
 
 t+1 
 
 B - 1. Now if we let the slope of a wave front be equal to 
 
165 
 
 t + l t, + 1 
 
 h h 
 
 i - i ' " R (h) 
 
 u u u 
 
 then A [ (i(l. .n))] and A [ (i(l. .U-l), i' (u. .n))] will be separated by it 
 (Figure 6.7). 
 
 The actual wave front is a zigzag line, rather than a straight line 
 as shown in Figure 6.7- 
 
 If there are more than one h in CI, then we choose a to be large 
 
 enough so that all inputs to S( (i(l. .u-1), i(u. .n)) ) be inside of a wave front, 
 
 i.e. 
 
 t + 1 
 tan 3 = max ^-^p. 
 h£Cl u 
 
 (Q.E.D.) 
 
 Now suppose we compute F in parallel w.r.t. I using a diagonal 
 wave front whose slope is D = tan q> Then how many steps (one step corresponds 
 to the computation of a body statement S) does it take to compute P ? 
 
 Theorem c / \ 
 
 The total number of steps required to compute P in parallel w.r.t. 
 I using a diagonal wave front whose slope is D is given by 
 
 
 /u-1 \ 
 
 i n 
 
 
 |L |D^ 
 
 1 u ' 
 
 i 
 
 T - 
 P 
 
 ' TT |L.| 
 
 '.j=u+l J 
 
 + 
 
166 
 
 Proof: 
 
 Let us consider the I - I ,, x ... x I plane. 
 
 u u+i n 
 
 end 
 
 Figure 6.8. An Execution by a Wave Front 
 
 Wave front W must travel from the start position to the end position on the 
 
 n 
 plane. How long does it take? It takes L + |L |d steps where L = it |L. 
 
 U j=u+l 3 
 
 u-1 
 
 Since we have to process 7rlL.ll - I ,-■ x ••• X I planes, in total it 
 
 ^ ._, ' j ' u u+1 n * ' 
 
 u-1 
 becomes T = tt |L . I (L + |L |D). 
 
 (Q.E.D.) 
 
 Note that if a wave front is horizontal (i.e. if P can be computed 
 
 in 1-p w.r.t. I ), then D = and T = tt IL.I. 
 
167 
 
 6.2.4 Conclusion 
 
 Assume that there are an arbitrary number of PE's available. Given 
 a restricted single body statement loop, 
 
 P 1 = (l r ...,I n )(A rF°,...,^] := f(A 1 [F^,...,Fj],...,A P [FP,... > ^])) 
 
 we can check if F can be computed in 1-p w.r.t. I (u=l, ...,n) by Theorem 2. 
 
 If it cannot be, then we can check for type 2 parallel computability w.r.t. 
 
 I , i.e. find a possible wave front. In either case we obtain the number of 
 
 u' 
 
 n u-1 n 
 
 computational steps required, i.e. T = ir |L.| or T = ( tt |L.|)( ir |L.| + 
 
 U j=l 3 u j=l 3 j=u+l J 
 
 lL I'D), where one step corresponds to the computation of the body statement S. 
 
 Then among all possible choices, we would choose to compute in parallel w.r.t. 
 I where T = min T . - 
 
 6.3 A Loop With Many Body Statements 
 
 6.3-1 Introduction 
 
 In what follows we mean a restricted loop by a loop. Again a check 
 against all published Algorithms in 1968 and 1969 CACM issues has been done, 
 and it has been revealed that well over 50 percent of the cases of for statement 
 usage (with more than two-body statements) are instances of restricted loops. 
 
 Also as stated in Section 6.2.2.3 and Chapter 5, the LR relation may 
 be disregarded by introducing temporary locations. Hence it will not be taken 
 into account throughout the rest of this chapter. 
 
168 
 
 Given a loop with m body statements, P , there are three different 
 approaches to compute it in parallel. First it is possible to extend the 
 procedure described in Section 6.2 by treating m body statements as if they 
 were one statement. That is, we consider body statements as a function 
 
 OUT(S(i(l..n))) = f(lN(S((i(l..n)))). 
 For example 
 
 SI: A[I,J] := f(A[I,J-l],B[I-l,J-l]); 
 S2: B[I,J] := g(A[I-l,J-l],B[I,J-l]) 
 yield 
 
 S: {A[I,J],B[I,J]} := f(A[I,J-l],A[I-l,J-l],B[I-l,J-l],B[I,J-l]). 
 Then we can apply Theorem 1 directly to check if e.g. S can be computed in 1-p 
 w.r.t. I. 
 
 The second and the third approaches can be illustrated by the follow- 
 ing two examples. 
 
 El: for I := 1 step 1 until kO do 
 begin 
 
 SI: A[I] := f(A[I],B[I]); 
 S2: B[I] := g(A[ I],B[ 1-1] ); 
 end ; 
 E2: for I := 1 step 1 until i+0 do 
 begin 
 
 SI: A[I] := f(A[I-l],B[I-2]); 
 S2: B[I] := g(A[I]); 
 end. 
 
169 
 
 In El, note that SI and S2 cannot be computed in parallel for all values of I 
 because S2 has an iteration form 
 
 B[I] := g'(B[I-l]). 
 However, El may be replaced with two for statements : 
 
 for I := 1 step 1 until kO do 
 SI: A[I] := f(A[I],B[I]); 
 
 for I := 1 step 1 until kO do 
 
 S2: B[I] := g(A[ I], B[ 1-1] ) ; 
 Now the first loop can be computed in parallel for all values of I while the 
 second for statement is still an iteration. In general by replacing a single 
 or statement with two or more for statements the parallel part may be exposed. 
 
 In the second example, SI and S2 can not be computed in parallel 
 for all values of I, nor can they be separated into two independent loops 
 because SI uses values which are updated by S2 (i.e. B[I-2]), and S2 uses 
 values being updated by SI (i.e. A[I]). However SI and S2 could be computed 
 
 simultaneously while I varies sequentially if the index expression in S2 is 
 
 A 
 
 "skewed" as follows. 
 
 E2' for I := 1 step 1 until kO do 
 begin 
 
 SI: Ari] := f(A[I-l],B[I-2]); 
 S2»: B[I-1] := g(A[I-l])j 
 end. 
 
 JL 
 
 Strictly speaking S2' should not be executed when 1=1 and an extra 
 
 statement S2" : Bl^O] := g( k[kO]) is required after this loop. For 
 the sake of brevity those minor boundary effects are ignored through- 
 out this section. 
 
170 
 
 Figure 6.9 illustrates the computation of the modified loop as well as the 
 original loop. 
 
 I 
 
 SI 
 
 S2 
 
 • 
 * 
 
 
 
 i-2 
 
 
 B[l-2] 
 
 1-1 
 
 A[ 1-1] 
 
 1 / 
 
 / 
 
 i 
 
 1 / 
 A[i]- 
 
 -»B[i] 
 
 i-2 
 
 1-1 
 
 SI 
 
 A[i-1] 
 ^ 
 
 A[i] 
 
 S2' 
 
 ^ 
 
 B[i-2] 
 
 B[i-1] 
 
 Figure 6.9. Simultaneous Execution of Body Statements 
 
 In general, the above three approaches could be tried in any 
 combination. For example, we may first try the first approach, i.e. we try to 
 execute body statements simultaneously for all values of some loop index. If 
 this Tails, then we may use the second approach, i.e. we separate a loop or 
 we replace a loop with as many for statements as possible. On a resultant 
 for statement we again try the first approach (if it has only one body statement, 
 then the results of the previous section can be used). If we fail again, then 
 the third approach can be taken. 
 
 We now describe each approach separately. 
 
 Before we go further, we define the following notations. Without 
 
 loss of generality we assume that the p-th occurrence of A, appears in S and 
 
 k p 
 
 also assume that S and S have forms 
 P q 
 
171 
 
 and 
 
 S * (AP[F(k,p,l),...,F(k,p,n)] := f _.(...)) 
 
 S = (-.. := f(...,A*[F(k,q,l),...,F(k,q,n)], ...))• 
 
 q q 
 
 Then we define a vector R(k,p,q) as follows 
 
 R(k,p,q) = (R 1 (k,p,q),...,R n (k,p,q)) 
 
 where 
 
 R,(k,p,q) - F(k,p,i) - F(k,q,j). 
 
 J 
 
 - w(k,p,j) - w(k,q,j). 
 If F(k,p, c i) = F(k, q, j) = 0, then we let R.(k,p, q) = 0. Finally we write 
 
 J 
 
 R'(u,k,p,q) = (R x (k,p,q), ...,R u _ 1 (k,p,q)) 
 
 and 
 
 R"(u,k,p,q) = (R u (k,p,q),...,R n (k,p,q)). 
 
 6.3*2 Parallel Computation with Respect to a Loop Index 
 
 We first study the first approach described in the previous section, 
 i.e. we treat body statements as if they were one statement and try to execute 
 them in parallel with respect to some loop index. 
 
 Let us consider P = (I. , I_, .... I ) (S., ; . . . :S ) . Then we treat m body 
 
 12 ' n 1 m J 
 
 statements as one statement S where 
 
 m 
 
 0UT(S((i 1 ,i 2 ,...,i n ))) = U OUT(S (d r i 2 , .••,!))) 
 
 P=l 
 
 and 
 
172 
 
 m 
 IN(S((i r i 2 ,...,i n ))) = IN(S 1 ((i 1 ,i 2 ,...,i n ))) U U [IN(S ((i^ig, 
 
 p-1 
 ...,i n ))) - U OUTCS^dpig,...,^)))]. 
 
 Having these two sets, we can use results of Section 6.3 directly. For example 
 let us consider Theorem 2. Then we may modify Theorem 2 as follows. First 
 
 suppose an array A£ appears in OUT(S( (i(l. .n) ) ) ) and A^ appears in 
 
 IN(3((i(l..n)))). Then obtain R(k,p,q). 
 
 Theorem 6 ; (cf. Theorem 2) 
 
 For each A^ in 0UT(S( (i(l. .n) ) ) ), we obtain R(k,p, q) for all q such 
 
 that A^ is in IN(S( (i(l. .n) ) ) ). Then if there is any R(k,p, q) which satisfies 
 
 all three conditions described below, then S cannot be computed in type 1 
 parallel w.r.t. I . Conditions: 
 
 (1) R.(k,p,q) = or for all j = 1,2, ...,u-l, 
 
 J 
 
 (2) R u (k,p,q) > 0, and 
 
 (3) there is £(u+l<£<n) such that V .(u+l<j<£-l), R .(k,p, q) = and 
 
 J J 
 
 R/k,p,q) = or R £ (k,p,q) < 0. 
 
 The above theorem can be proved similarly to Theorem 2, and the 
 details will not be given. 
 
 Since we have to apply the above check for all R(k,p,q) vectors, the 
 number of checks required is proportioned to the total number of R(k,p, q) vectors, 
 #R(k,p, q). Also since we can try to compute S in type 1 parallel in n ways, i.e. 
 with respect to I (u=l,2, . . .,n), the total number of checks we would perform is 
 given by n x (#R(k,p, q)). 
 
173 
 
 6. J. 3 Separation of a Loop 
 6.3-3-1 Introduction 
 
 In this section we study replacement (or separation) of a single for 
 statement with two or more for statements. Let 
 
 7® = (l 1 ,I 2 ,...,I n )(S 1 ;S 2 ;...;S m ) 
 
 = (l 1 ,...,I u _ 1 ){(l u ,...,I n )(S 1 ;...;S m )] 
 
 = (I,,..., I JF*. 
 
 1' ' u-1 u 
 
 For fixed values of I,,..., I , , let us consider P^. 
 
 1' ' u-1' u 
 
 Figure 6.10. Execution of F^ 
 
 u 
 
Yjk 
 
 If we write down the primitive execution order E (p), it can be represented 
 
 by the straight line as shown in Figure 6.10. Now let us consider a statement 
 S((i(l. .n), q) ) (See Figure 6.10). If it does not receive any computed results 
 from part A, i.e. if there is no id and a statement S( (i(l. .u-l),i'(u. .n),p)) 
 for which 
 
 (id,(i(l..u-l),i'(u..n),p),(i(l..n),q))eDR 
 and (i'(u..n),p) < (i(u..n),q) 
 hold, then it may be computed independently of part A. If this holds for all 
 p and all (i(u..n)) for some fixed q, then we can compute S before any S , i.e. 
 
 M 
 
 P 111 ~ (I ,...,1 )(S );(I ,...,1 )(S.;...;S -S .;...;S ). 
 u u n q u' n 1' ' q-1' q+1' ' m 7 
 
 Similarly if S( (i(l. .n), q) ) does not give any output to part B, then we have 
 
 M 
 
 P 111 " (I ,...,1 )(S.;...;S ,;S _;...;S );(l ,...,1 )(S ). 
 u v u' ' n 1' q-1' q+1' ' m ' u* ' n' q_' 
 
 6. 3« 3*2 The Ordering Relation (6 ) and Separation of a Loop 
 
 Now we study how P may be replaced with several for statements. We 
 
 first define the relation 6 between body statements. The relation is such 
 
 u 
 
 that if B (p, q) holds, then for any given (i(u..n)) there are (i'(u..n)) and id 
 
 such that 
 
 (id, (i(l..u-l),i'(u..n),p),(i(l..n),q))eDR 
 and (i*(u..n)) < (i(u..n)) 
 hold. That is, for some fixed q, if there is no p for which B (p, q) holds, then 
 
 u 
 
 S can be computed before any S and 
 
175 
 
 where 
 
 P^-(l)(S q );(l)(S 1 ;...;S q _ 1 ;S q+1 ;...;S m ) 
 
 (I) = (I ,...,1 ). 
 
 Definition 2: 
 
 Between two body statements, S and S we first obtain R(k,p, q). Then 
 9 (p, q) holds if and only if 
 
 (1) R.(k,p, q) = or for all J = 1,2, ...,u-l. 
 
 J 
 
 and (2) the first nonzero element of R"(u,k, p, q) is either or a positive 
 number. Also if all elements of R"(u,k,p, q) are then p < q 
 holds. 
 We also write 9 = { (p, q) | 9 (p, q) holds}. 
 
 If A, appears more than twice in S , then we modify Definition 2 as 
 
 follows. Suppose A, appears twice in S as q, -th and q^-th occurrences. Then 
 
 we construct two vectors R(k,p, q, ) and R(k,p,q_). For each vector we check the 
 
 above two conditions. If at least one of two vectors satisfies the two 
 conditions, then we let 9 (p, q) hold. 
 
 u 
 
 Example 6 ; 
 
 SI: A^-1,1,,+3] :=A 2 [I 2 ,I 3 ] + A^[ 1^1] ; 
 
 S2: A 2 [I 2 +1,I 3 -1] := A 1 [I 1 ,I 2 -3] +^[1^1^; 
 
 S3: ^[1^5,1^-1] :- A^Lylj] + Aj; I^^+l] ; 
 
 S^: A^LJ : = A^I^l] + A^I^I^]; 
 
176 
 
 give 
 
 R(l,l,2) = (-1,6,0), R(l,.l,3) = (-1,2,0), . 
 
 R(2,2,l) = (0,1,-1), R(2,2,*0 = (0,1,-1), 
 
 R(3,3,2) = (-5,0,-1) andR(^Al) = (1,0,0). 
 Then we have = { (2,1), (2,k), (k-,1)) . 
 
 Now let us study Definition 2 in detail. First we note that id in 
 Figure 6.10 corresponds to those A, for which 
 
 F(k,p,j)(i ) = F(k,q,j)(i ) 
 
 J J 
 
 holds for all j = 1,2, . ..,u-l and 
 
 F(k,p,j)(i •) = F(k,q,j)(i ) 
 
 holds for all j = u,u+l, ...,n. The former implies that 
 
 F(k,p,j) = F(k,q,j) 
 or 
 
 R,(k,p,q) = (or 0) 
 
 J 
 
 for j - 1, 2, ...,u-l which is equivalent to the first condition of Definition 2. 
 Next note that (i'(u..n),p) < (i(u..n),q) implies that either 
 (i) (i'(u..n)) < (i(u..n)) 
 or (ii) (i'(u..n)) = (i(u..n)) and p < q. 
 In the first case 
 
 (F(k,h P ,u),...,F(k,h P ,n)) > (F(k,h q ,u),...,F(k,h q ,n)) 
 must hold- In the second case 
 
 (F(k,h P ,u),...,F(k,h P ,n)) = (F(k,h q ,u),...,F(k,h q ,n)) 
 must hold. These two make up the second condition. 
 
 
177 
 
 From we can construct a dependence graph D with m nodes each of 
 which represents a body statement, e.g. from Example 6 we get: 
 
 In D u we call a series of © u , ^(p^Pg), © u (p 2 ,P 5 ), . . .^(p^p^), . . ., 
 
 © (p k TtV^) a chain and write ch(p., p, ) for it. If p, = p.. then it is called a 
 
 mesh M. We say that anode p. is in the chain ch(p ,p ) (or in the mesh M), or 
 
 the chain ch(p, ,p, ) (or the mesh M) includes p.. Note that for nodes p and q 
 
 there may be more than one chain which connects p to q. 
 Now let 
 Z = (p | there is no q such that 6 (q, p) holds) 
 
 and 
 
 Z = (p | there is no q such that 6 (p, q) holds}. 
 
 Furthermore let 
 
 PD(p) = {q | ch(q,p) exists} U (p} 
 and 
 
 SC(p) - (q | ch(p,q) exists} U {p} • 
 
 (PD for predecessors and SC for successors). Then we classify nodes in D as 
 _ u 
 
 follows : 
 
 Z 3_ = (P | For all r e PD(p), there is no mesh in D which includes r} , 
 
 Z = {p I For all r e SC(p), there is no mesh in D which includes r) , 
 j u 
 
 and 
 
178 
 
 Z 2 = N - Z 1 - Z 3 - 
 
 Let Z 1 (or Z ) = (p^Pg, . ..,p } • Then we can order this set as p ' p ',..., p ' 
 
 M 
 
 in such a way that 6 (p.', p.') does not hold if i > j. Let us write 
 
 9 9 
 
 Z.(or Z,) for a resultant ordered set. Also we order Z. = {q.,q_, . . .,a } as 
 
 Q 
 
 L 2 = ^i''^'* '••> C V ) in such a way that ^i' < q i' if i < J- 
 Now given a loop t where 
 
 F = (I^,Ig, . . ., I n ) (S^jSgj . . . ;S ; 
 
 ■ (i r ---,i u . 1 )UV.-.,i n )(s 1 ;S 2 ;...;s m )) 
 = (i n ,...,i J?*, 
 
 1 u-l u 
 
 we build the dependence graph D and obtain sets Z , Z and Z,, say Z. = 
 
 {P 1 *P 2 >'"*P U ^ z 2 = ( <!■]_» 12' •'•' ^ andZ 3 ■ t r i> r 2> "^ r w^ " ( m=u+v+w )- 
 
 6 6 6 
 
 From Z and Z, we obtain ordered sets Z and Z,, say Z = (p ',p p ', . . . ,p ') 
 
 6 6 
 
 and Z^ = (r 1 l ,P 2 , ,...,P w ')- A1 so we have Z g = (q^ , q^*, . . ., q^' ) . Then 
 
 M 
 p^~ (i)(s ,);(i)(s ,);...;I(S ,);(i)(s ,;...;S ,); 
 
 ^1 y 2 *u 4 1 ^v 
 
 (l)(S r ,);(I)(S ,);...j(l)(S ,) 
 
 12 w 
 
 where (i) = (I ,1 .,...,1 ). 
 u' u+1 n 
 
 Note that Z (or Z,) together with 9 makes a graph which does not 
 contain any mesh. To order Z (or Z,) the technique discussed in 
 Chapter 7 niay be used. 
 
179 
 
 Thus we have replaced a loop P with as many for statements as possible. We 
 say that p is separable from !F if peZ (or peZ_) with respect to u, and that 
 F is separable with respect to I . Also we say that p is separated with 
 respect to I if P°J is replaced by many for statements as we showed above. 
 
 6.3-3.3 Temporary Storage 
 
 Now let us study the following : 
 
 2 
 P : for I := 1 step 1 until 1 do 
 
 for I : = 1 step 1 until kO do 
 
 begin 
 
 SI: A^y := Ag[I 2 ] + A^Ig]; 
 
 S2: A k [I 2 ] := A.J Ij +A^[Ig]; 
 
 end 
 Then we have R(l,l,2) = (0,0) or © (1,2) holds and: 
 
 V 
 
 © <D 
 
 Hence we get Z, = (1,2) and 
 
 P 1 * 2 : (I 1 ,I 2 )(S1: A^IJ := A 2 [I 2 ] + A 3 [I 2 ]); 
 (I 1 ,I 2 )(S2: A 4 [I 2 ] := A^IJ +A^[I 2 ]). 
 
180 
 
 This, however, does not give the same results as produced by the original P^. 
 Note that after the execution of the first loop (I,, I_)(Sl), the only outcome 
 
 is A[I X ] (i.e. A[l]) = A 2 [>0] + AJlfO] . However the second loop, (I ,1 p)(S2), 
 
 requires forty different inputs, i.e. Ap[l] + A,[l], . ,,A,JkO] + A,[^0]. Hence 
 
 2 
 it becomes necessary to modify P ' as follows: 
 
 P l 2; (\^ 2 ^ S1: ^l'V := A 2 [I 2 ] + A 3 [I 2 ] ) ; 
 
 (I 1 ,I 2 )(S2: AJI 2 ] := AJI^y ♦ A^]). 
 
 
 >A[1] 
 
 )A[1] 
 
 Figure 6.11. An Introduction of Temporary Locations 
 
 In general we apply the following transformation rule on a loop when 
 it is separated. Assume that S and S are body statements in a loop F, and 
 
 6 (p, q) holds. Further assume that p is separated from TT with respect to 
 
181 
 
 I (i.e. peZ.,qeZ ). Now let us consider the vector R"(u, k,p, q) . Let the 
 
 value of the first element which is neither nor be e and its position he 
 i. Then we let 
 
 R(u,k,p,q) = (J | u < j < i and R,(k,p, q) = 0} . 
 
 We order elements of R(u,k,p, q) by their positions in R"(u,k,p, q) and write 
 R(u,k,p,q) = (r(l),r(2),...,r(t)). 
 
 Then we apply the following on the loop F . 
 
 Transformation T p : 
 
 Transformation T is defined for the cases e < and e > 
 separately. 
 
 (1) e > 0. 
 
 Change F(k,p,r(j)) and F(k,q,r(j)) to I y /j\ for j = 1,2, . ..,t. 
 
 (2) e < 0. 
 
 (i) Change F(k,p,r(j)) to 1,/j) for j = 1, ...,t. 
 
 (ii) Change F(k, q,r(j)) to the following ALGOL program for 
 
 j - l,...,t. -If (I r( . +1) = 1) and (l r( . +2) = 1) and ... 
 (I r(t) = 1) then (if i r( . } = 1 then |L r(j) | else I r(j) - 1) 
 else I r(i)'" Also chan g e F(k,h q , r(t) ) to the following ALGOL 
 
 program: "if (I , , = l) then |L ,, J else I ,. % - 1." 
 — r(t; ' r(t) 1 r(t) 
 
 Example 7 • 
 
 Let R"(5,k,P,q) = (R 5 (k,p,q),...,R 9 (k,p,q)) = (0,0,0,0,-1). 
 Then we get e = -1 and l = 9 and R(5,k,p,q) = (6,8). Also assume that \lA = 
 
182 
 
 |Lq| = 3« Originally S and S may look like 
 
 y WyW :=f p ( -- ); 
 
 S q : .. := f (...^[I^I^^yi],...); 
 Now after Transformation T^ is applied, S and S become: 
 
 y : w^'VW :=f ( -- ); 
 
 S q' : " := t q^'" fA T/S I l' I 5 > B 6>I 7 >Bg,I 9 ],...), 
 where 
 
 Bg = if Ig = 1 then (if Ig = 1 then 3 else Ig-l) else I, 
 
 and Bq = if In = 1 then 3 else I« - 1. 
 
 Note that by applying Transformation T ? , temporary locations are 
 eventually introduced. For example in Example 7, A, is changed to a seven- 
 dimensional array from a four dimensional array by Transformation T . 
 
 6.3.k Parallelism Between Body Statements 
 
 6-3- i +-l Introduction 
 
 Now we describe the parallelism between body statements. As stated 
 before it becomes necessary to modify index expressions. In this section we 
 give an algorithm which modifies index expressions properly. 
 
 We first describe the algorithm in terms of a restricted loop with 
 
 only one loop index, i.e. F^ = (i ) (S ;S ; . . . ;S ). Accordingly every array 
 identifier in P™ is of a form A.[F(k,h, 1)] (= A. [I,+w(k,h, l)] ) where this is 
 
 the h-th occurrence of A in F^. For convenience we drop the subscript of 
 
 
183 
 
 loop index. The primitive execution order for p becomes 
 
 E Q (P rn ) = {((i,p),V((i,p)|(|L|,m))|(l,l) < (i,p) < (|L|,m)}. 
 
 For a given loop p , we consider the I-S plane which is an L by 
 m grid. For example we have the following 1+0x3 grid for: 
 for I := 1 step 1 until kO do 
 begin 
 81: A^I-1] := A 2 [I] + A^I-l]; 
 
 S2: A^I] := A^I+1] - A^I]; 
 
 S3: A 3 [I] := A^I] + Ag[I]; 
 
 end. 
 
 On 
 
 this grid, we only show the relation DR, e.g. (A,[i-1], (i-l,3)> (i>l)) e DR, 
 
 
 SI 
 
 S2 
 
 S3 
 
 # 
 
 
 
 
 • 
 
 
 
 
 i-1 
 
 
 
 
 i 
 
 V 
 
 .St. 
 
 -+ 
 
 i+1 
 
 
 • 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 The direction of 
 wave front travel 
 
 Figure 6.12. Wave Front for Simultaneous Execution of Body Statements 
 
184 
 
 Then the objective of this section is to discover a wave front W (cf. Section 
 6.2.4) which separates all inputs from the computation, e.g. in Figure 6.12 
 inputs (shown by 0) to S( (i,l)), S( (i+1,2)) and S((i,3)) (shown by t) lie above 
 the wave front indicated by a dotted line. Hence S((i, l)),S( (i+1,2)) and 
 S((i,3)) can be computed simultaneously while I takes values 1,2, ...,40 
 sequentially. In general to discover a wave front is equivalent to discovering 
 a constant C(p) for each body statement S so that all statements S((i-C(l),l) ), 
 
 .. .,S((i-C (p),p )),..., S((i-C(m),m)) can be computed simultaneously. 
 
 6.3.4.2 The Statement Dependence Graph and the Algorithm 
 
 Let us consider the I-S plane again and consider the computation 
 
 S((i,p))- Assume that there is id such that 
 (id,(j,q),(i,p))eDR 
 
 where either (i) j = i and q < p, or (ii) j < i and p / q, then clearly S((i,p)) 
 
 and S((j,q)) cannot be computed simultaneously. 
 
 Definition 3 ' 
 
 The statement dependence graph (cf . the dependence graph in Section 
 
 6.3-3), D(p ), is defined by a set N of nodes 1, 2, . ..,m each of which 
 
 corresponds to a body statement of P and the arrow relation a. From node p 
 to q there is an arrow a(p, q) if and only if either one of the following two 
 conditions hold. 
 
 (l) For fixed i, there is k such that 
 
 AJJ[F(k,h,l)(i)] € OUT(S((i,p))), 
 A*[F(k,g,l)(i)] e IN(S((i,q))), 
 
185 
 
 F(k,h,l)(i) = F(k,g,l)(i) 
 and p < q. 
 
 (2) For fixed i, there exist k and i' such that 
 A£[F(k,h,l)(i')l € OUT(S((i',p))), 
 
 A*[F(k,g,l)(i)] € IN(S((i,q))), 
 
 F(k,h,l)(i«) = F(k,g,l)(i) 
 and i" < i. 
 
 In the first case we label the arrow and write f(p, q) = 0. In the second 
 case the arrow is labeled 1 and we write f(p, q) = 1. 
 
 The statement dependence graph for the previous example is: 
 
 © © — °-*6) 
 
 A chain of arrows, a(p r p 2 ), a(p 2 ,p^), . . ., a(p k _ 1 ,p k ),a(p k ,p 1 ) in 
 
 D(P ) is called a mesh M and we say e.g. a(p.,p. ) is in M. If i(p.,p. ) = 
 
 for some arrow in M, then M is called a part zero mesh . The following lemma 
 is obtained immediately. 
 
 Lemma 2 : 
 
 If D(P ) contains a part zero mesh, then there is no wave front for 
 
 P*. 
 
 Henceforth we assume that D(F ) has no part zero mesh. Given D(FJ, 
 we define a subset Z of N as follows: 
 
 Z = (p there is no q such that i(p, q) = or f(q,p) = 0} . 
 
 Z together with arrows gives a subgraph D g of D(F m ). Further we let 
 
186 
 
 Zy. = (P peZ and there is no q such that i(q, p) = 0} . 
 Now we give an algorithm to find a wave front for D(P m ). 
 
 Algorithm 1 ; 
 
 (1) Let C(p) = + oo for all p e N. 
 
 (2) (i) Take any p from ZL. If Z. = 0, then go to Step (5). 
 
 (ii) Let C(p) = 0. 
 
 (3) (i) If there are nodes s and t such that a(s,t) exists, £(s,t) = 1 and 
 
 C's) > C(t), then we let the value of C(s) be equal to C(t). 
 (ii) If there are nodes s and t such that a(s,t) exists, f(s,t) = and 
 C(s) > C(t), then let the value of C(s) he equal to C(t) - 1. 
 
 Repeat (i) and (ii) until there are no s and t which satisfy (i) or (ii) in 
 
 D(F m ). 
 
 (h) (i) If for all p in Z C(p) ^ + », then go to Step (5). 
 
 'Otherwise take any p from Z for which there is q in Z such that 
 
 a(p,q) exists and C(q) / + »• Let M = max (C(s)] where s 6 Z and 
 
 C(s) ^ + oo. Then let C(p) = max (c(q) + 1,M} . Go to Step (3). 
 
 For all p in Z with C(p) = + oo, let C(p) = M where M = max (C(s)} and 
 
 seZ 
 
 c(s) ^ + co. If Z ■ $, then let C(p) = for all p in Z. 
 
 Example ( 
 (1) D: 
 
187 
 
 (2) Z = fl,2,3,U,6,7,8l, 
 
 KD- J ^© ©-^<T>^(D 
 
 (3) Let C(l) = and apply Step (3) of Algorithm 1. Then we get C(l) = 
 
 since there is no q such that a(q, l) exists. 
 (k) Let C(2) = 1. 
 
 (5) Let C(3) = 1. 
 
 (6) Let C(*0 = 2. Then we apply Step (3) of Algorithm 1 on a(S,l+), a(8, 5), 
 a(5,2),a(2,l),a(7,8),a(6,7),a(9,6),a(3,9),a(3,l) and a(9,l)« And we get 
 C(5) = C(8) = 2. C(2) = 1. C(7) = 1, C(6) = 0, C(9) = 0, C(3) = 0, and 
 C(l) = -1. 
 
 (7) There is no p in Z with C(p) = + ». Hence Algorithm 1 terminates and 
 we get 
 
 C(l) = -1, C(2) = 1, C(3) = 0, COO = 2, C(5) = 2, C(6) = 0, C(7) = 1, 
 C(8) = 2 and c(9) = 0. 
 
 I^ 
 
 1 
 
 2 
 
 5 
 
 h 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 
 
 
 
 
 
 
 
 
 
 
 
 f\„. 
 
 
 
 /> 
 
 
 
 
 
 
 y- 
 
 
 •N 
 
 i-2 
 
 
 
 o- 
 
 .• 
 
 V 
 
 >^ 
 
 o- 
 
 >• 
 
 
 i-1 
 
 9- 
 
 *• 
 
 % 
 
 
 
 O- 
 
 »• 
 
 
 
 
 o\ 
 
 
 V> 
 
 S^i 
 
 
 
 
 
 / 
 
 l 
 
 (J 
 
 s 
 
 *w 
 
 
 • 
 
 
 
 
 
 
 
 ^ 
 
 / 
 
 i+1 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 Figure 6.13- A Wave Front for Example 10 
 
188 
 
 Now we show that Algorithm 1 gives a valid wave front. To prove this first 
 we show that Algorithm 1 is effective, i.e. every step of Algorithm 1 is always 
 applicable and terminates. 
 
 Lemma ~> : 
 
 Algorithm 1 is effective. 
 
 Proof : 
 
 That Step (l),(2),(k) and (5) are effective is clear. Now we show that 
 Step (3) is effective. First we define U(p) to be a set of nodes such that 
 U(p) = (q I there is a chain of arrows a(p, ,p ),a(p ,p_), . . ., 
 
 a(p n _ x ,P n ) exist where j> ± = q and p n = p) U {p}, e.g. 
 
 3 6 
 
 © •Ch KpyA^ »k: 
 
 D(P 8 ): 
 
 and U(5) = (1>2, h, 5>7>9) . U(p) = U(q) implies that there is a mesh which 
 includes p and q. By assumption this mesh is not a part zero mesh and c(q) 
 will be assigned the same value as C(p) in a finite number of steps after c(p) 
 has been assigned a value. 
 
 If U(p) 3 U(q), then c(q) will be assigned a value less than or equal to 
 C(p) in a finite number of steps after C(p) has been assigned a value. 
 
 Thus after a finite number of applications Step (3) eventually terminates. 
 Hence Algorithm 1 is effective. 
 
 (Q.E.D.) 
 
189 
 
 
 Theorem 6 : 
 
 Algorithm 1 gives a valid wave front . ■ 
 
 Proof : 
 
 To prove this, it is enough to show that' (i) if f(p, q) = 0, then 
 C(p) < C(q) and (ii) if i(p,q) = 1, then C(p) < C(q). However, from Steps (3) 
 and (k) of Algorithm 1, clearly the above conditions hold. Also if p is 
 assigned a value C(p) by Step (5) it implies that either (i) there is r e U(q) 
 where q e Z, and f(r, q) = 1 or (ii) there is no such r. In the second case 
 
 C(p) may take any value (S may be computed at any time), and in the first case 
 
 C(q) > C(r) must hold. Hence we let C(q) = max{C(s)} . 
 
 (Q.E.D.) 
 To handle a restricted loop with more than one loop indicies, we 
 modify Definition 3 as follows. For each S and S in P we first obtain a 
 
 vector R(k,p, q). 
 
 Definition 3' '> 
 
 The statement dependence graph of t f D(f ), is defined by a set N of 
 nodes 1, 2, ...,m each of which corresponds to a body statement of P^ and the 
 arrow relation a. There is an arrow a(p, q) if and only if either one of the 
 following s holds : 
 
 (1) R.(k,p, q) -- or for all j = 1, 2, ..,n and p < q. We let i(p, q) = 0. 
 
 (2) V((R 1 (k,p,q),...,R n (k,p,q)|B) > V( (0, . . . , 0) | B) where B = ( |Lj, . . ., |Lj ). 
 We let i(p, q) = 1. 
 
190 
 
 From Definition 3' clearly 
 
 (1) !(p,q) = if S ((i-^ig, ...,1^)) US6S the out P ut of S p(( i 1 ; i 2>"-.» i )) and 
 p < q. 
 
 (2) f(p,q) = 1 if S ((i^ig, •••»i n )) uses the output of S ((^^ig',...,! •)) 
 
 where (i^ig', . . .,1^ ) < (i^ig', . . ,,i n ). 
 
 Algorithm 1 is then applied on D(Fj. For example let P^ be 
 SI: A^Ig^g] := A 2 [I 2 ] + A 3 [I 2 -1,I 3+ 1]; 
 
 S2: Ag[I 2 +l] := A^I^Ig] + k^I^I^i 
 
 S3: A 3 [I 2 ,I 3 ] := Ag[Ig-l]; 
 
 Then we have 
 
 jn 
 
 6.3*5 Discussion 
 
 Given a loop P = (I , . . ., I ) (S, ; . . . ;£L ), we first try to execute body 
 statements in parallel with respect to some loop index. If this fails for any 
 loop index, or if this does not give a satisfactory result, then we try to 
 replace the loop with many for statements. Then we can attempt to execute a 
 body statement (or body statements) of a resultant for statement in parallel 
 with respect to some loop index. If this fails, then we may try the third 
 
191 
 
 approach, i.e. we try to execute all body statements simultaneously while 
 loop indices vary sequentially. Often the number of loop indiciea, n, is 
 very small (typically n = 2), and it will be easy to try all variations. 
 
192 
 
 7- EQUALLY WEIGHTED— TWO PROCESSOR SCHEDULING PROBLEM 
 
 T'l Introduction 
 
 This chapter gives a solution to the so-called equally weighted- 
 two processor scheduling problem. Informally the problem may be stated as 
 follows. Given a set of tasks along with a set of operational precedence 
 relationships that exist between certain of these tasks, and given two identical 
 processors (PE),P(2), how does one schedule these tasks on the two processors 
 so that they execute in the minimum time? It is assumed that either one of 
 two processors is capable of processing any task in the same amount of time, 
 say 1 unit of time. Informally a set of tasks together with procedence relations 
 forms a graph. 
 
 Clearly the problem of scheduling any given equally weighted task 
 graph on k identical processors, P(k), in an optimal way is effectively 
 solvable by exhaustion. But this is far from possible in practice. The only 
 practical solution so far obtained is a result for scheduling a rooted tree 
 (a restricted class of graphs) with equally weighted tasks on k identical 
 processors, P(k) [21]. 
 
 Now let us study how the equally weighted— two processor scheduling 
 problem is related to the computation of arithmetic expressions on a parallel 
 machine. 
 
 In Chapter 3, the parallel computation of an arithmetic expression by 
 building a syntactic tree was studied. There we were only concerned with the 
 height of a tree and reducing it by distribution, and we did not introduce any 
 
 
193 
 
 physical restrictions of a machine. For example, in reality, the size of a 
 machine, i.e. the number of PE's is limited rather than arbitrarily big. One 
 problem which will arise immediately is whether the distribution algorithm 
 should be applied or not to reduce tree height since distribution introduces 
 additional operations. For example assume that we have a two PE machine, P(2). 
 Now let us consider two arithmetic expressions, A = a(bc+d) + e and B = 
 
 abc(defgh+i). Then we have h[A] = k, h[B] = 5, h[A ] = h[abc+ad+e] = 3 and 
 
 h[B ] = h[abcdefgh+abci] = k. Thus distribution reduces the height of T[A] and 
 
 T[3]. On the other hand Figure 7-1 shows that if A, B,A and B are computed 
 
 on P(2), A is still computed in less time than A while B now takes more time 
 
 than B. 
 
 Assume that we get A from A by the distribution algorithm. If the 
 size of a machine is limited, then it may not necessarily be true that A can 
 
 be computed in less time than A even if h[A ] < h[A] holds. Actually it is a 
 nontrivial problem to decide whether distribution is to be made or not to 
 reduce computation time (which is different from tree height) if the size of 
 a machine is limited. It depends on the form of an arithmetic expression as 
 well as the machine organization. We will not go into this problem any 
 further. 
 
 Now let us look at the situation from a different point of view. 
 Given an arithmetic expression A and its minimum height tree, it is possible 
 to take advantage of common expressions to reduce the number of operations to 
 be performed in hopes of reducing computation time. For example let us 
 consider the computation of A = (a+b+c+d)ef + (a+b)g on P(2). If we evaluate 
 
19^ 
 
 (a+b) only once then A can be computed in k steps on P(2) while if (a+b) is 
 evaluated twice, then it takes 5 steps to compute A (see Figure 7*2). 
 
 e b c 
 
 cab a d e 
 
 (a) A (k steps) 
 
 (b) A (3 steps) 
 
 i h d e f 
 
 b c 
 
 c 1 
 
 (c) B (5 steps) 
 
 (d) B (6 steps) 
 
 Figure 7-1. Computation of Nondistributed and 
 Distributed Arithmetic Expressions 
 on P(2) 
 
 Our main concern in Chapter 2 was to reduce tree height assuming 
 that the size of a machine is unlimited. Hence we were not interested in 
 reducing the number of operations. As mentioned there, it was an open 
 problem to find out common expressions while keeping the height of a tree 
 minimum. However, if we could take advantage of common expressions while 
 
195 
 
 1* 
 
 3 
 2 
 
 1 
 
 level 
 
 (a) A Minimum Height Tree for A 
 
 5 
 k 
 
 3 
 2 
 
 1 
 
 Step 
 
 (b) (a+b) computed twice 
 
 (c) (a+b) computed once 
 
 Figure T«2. Common Expression 
 
 keeping the height of a tree minimum, then we would obtain a graph of operations 
 rather than a tree for an arithmetic expression (see Figure 7-2. (c)). 
 
 While we do not know how to compute an arithmetic expression A on 
 P(2) in the minimum time (e.g. should distribution be done?), the scheduling 
 algorithm presented in this chapter schedules a given graph of operations for 
 an arithmetic expression on P(2) so that the given graph is processed in the 
 
196 
 
 minimum time, assuming that each FE of P(2) may perform addition or multipli- 
 cation independently but in the same amount of time, say 1 unit of time. Note 
 that we may be able to construct many graphs for A. Hence while the scheduling 
 algorithm schedules a given graph for A on P(2) in an optimal way, the algorithm 
 does not necessarily compute A itself in the minimum amount of time. 
 
 7-2 Job Graph 
 
 Let G be an acyclic graph with nodes N. (i=l,2, . . .,n) and a set of 
 
 directed arrows connecting pairs of nodes. For nodes N and N' we write N -* N' 
 if there is an arrow from N to N' . We say that N is an immediate predecessor 
 of N' and N' is an immediate successor of N. Also we let 
 
 SR(N) = {N 1 |N -* N'l (a set of successors of N) 
 and PR(N) = {N' |N* - Nl (a set of predecessors of N). 
 
 Nodes which have no incoming arrows are called initial nodes , and 
 nodes which have no outgoing arrows are called terminal nodes . For the sake 
 of simplicity we assume that a graph has one initial node and one terminal 
 node. If there are more than two, then we can add a dummy initial/ terminal 
 node. We write N and N for them, respectively. We also write N =5> N' if 
 
 there is a chain N, ,N_, . . . ,N such that N -»1 ■♦ ... -»M -> N' , or N -* N' . 
 12m 1 m 
 
 Furthermore we write N / N' or N ^> N' to show that the relation N -» N' or 
 
 N» I' does not hold. 
 
 Definition 1 : 
 
 The forward distance (or level ) from the initial node to a node N, 
 d (N), is the length of the longest path from the initial node to N, thus 
 
197 
 
 d_(N T ) = 0. Similarly the backward distance from the terminal node to N, d (n)j 
 
 is defined, thus •!. (lO = 0. 
 
 Thus a node N cannot be initiated before time cL(n) but may be 
 initiated at cL.(N) or at any time after that. 
 
 Definition 2 ; 
 
 The height of a graph G, h(G), is defined as 
 h(G) = d^V- 
 
 Then we say that a graph G is tight if for all nodes N, 
 cL^N) + d T (N) = h(G). 
 
 Otherwise we say that a graph G is loose. 
 Example 1 ; 
 
 Figure T>3- A Loose Graph and a Tight Graph 
 
 Th 
 
 e graph G, is a loose graph because d (N ) + d_(N p ) = 2 ^ h(G. ) 
 
 whereas the graph G p is a tight graph. 
 
198 
 
 First we shall study an optimum scheduling for a tight graph. A 
 scheduling for a loose graph will be discussed in a latter section. In what 
 follows, we use words "process" and "schedule" interchangably. 
 
 Definition 3 : 
 
 Let A(i) be the set of all nodes of forward distance i, i.e. 
 A(i) = (Nld^N) = il. 
 
 Tli is is called a node set . 
 
 All nodes in A(i) can be scheduled independently of each other since 
 there can be no precendence relations between nodes in A(i). In other words, 
 if N => N 1 then N and N' cannot be processed simultaneously. 
 
 Now we have the following lemma which characterizes a tight graph. 
 
 Lemma 1 : 
 
 If a graph G is tight, then for every node N, there exists N'e SR(N) 
 and N"e PR(n) such that d (N 1 ) = d-^N) + 1 and d^N") = ci^(N) - 1. (For the 
 
 terminal node SR(N ) = and for the initial node PR(N ) = 0. Those are 
 
 exceptions. ) 
 
 Proof : 
 
 Obvious by Definition 2. 
 
 (Q.E.D.) 
 
 Corollary 1 ; 
 
 Let G be a tight graph. Let N be a node of G. Let Ne A(t). Then 
 for every i, < i < t - 1, there is at least one node N'e A(i) such that N' => 
 N. Also for every j, t + 1 < j < h(G), there is at least one node N"e A(j) 
 such that N => N" . 
 
199 
 
 Definition k : 
 
 To p-schedule a set Q of nodes is to partition Q, into subsets of size 
 2 in an arbitrary way (if |q| is odd, there will be one subset of size l) and to 
 order those subsets in an arbitrary way. 
 
 A node N is said to be available if all predecessors of N have been 
 processed. 
 
 7.3 Scheduling of a Tight Graph 
 
 Having these definitions, now we discuss a scheduling of a tight graph 
 G on two processors. The idea of this scheduling scheme is rather simple. We 
 start checking |A(i)| from i = 1 to h(G). For each i, if |A(i)| is even, then 
 we p-schedule A(i), and no processor time will be wasted. If | A ( ± ) | is odd, 
 and if we simply p-schedule A(i), then there will be one node, N, left which 
 cannot be processed in parallel with another node in A(i). Thus we will waste 
 processor time. Therefore a node which can be processed in parallel with the 
 above left out node N must be found. Where can that node be found? It will be 
 shown that we have to look as far as the smallest i' larger than i with |A(i')| 
 = odd to find it. Thus the amount of work to look ahead is always bounded. 
 
 Before we go further, a few more definitions are in order. 
 
 t n 
 For some t and n, let us consider a set A = U A(t+i). Now let us 
 
 n i=0 
 take a node from each of A(t+,j) and A(t+i) (j < i). Let them be N J and N 1 . If 
 
 N £> IT", then N and N may be processed simultaneously, providing that all 
 
 predecessors of N and N" have been processed. 
 
 Now we establish this relati on formally on A . 
 
200 
 
 Definition 5 : 
 
 ,t . 
 
 The p-line r elati on ( ) between two nodes N and N' in A is defined 
 n 
 
 as follows. 
 
 E 
 
 WW if and only if 
 
 (1) N ^> N' and N 1 £> N 
 and (2) d (n) ^ d(W ). 
 
 We also write (N, W ) for WW. A pair (N,W) is called a p-line pair . 
 Note that in general (N,N') and (W,N" ) do not necessarily imply (N,N"). 
 
 Further we define A^p) - {(N,N') | N€A(t+i),N'eA(t+j),0 < i,j < n] , 
 
 i.e. A (p) is a set of all pairs of nodes in A between which the p-line relation 
 
 n f * n * 
 
 holds . 
 
 -L. J- 
 
 Since (N,N') e A (p) implies that (W>N)e A (p), we shall in general 
 put only one of them in A (p) and drop the other. 
 
 An algorithm to find A (p) is given in Section 7-5- 
 
 Definition 6 : 
 
 A p-line set L on A is an ordered set of p-line pairs 
 
 *■ n n 
 
 L n = ((N ,N 1 '),(N r N 2 '),...,(N k ,N , k+1 )) 
 
 where 
 
 (1) N Q e A(t) and N' k+1 e A(t+n), 
 
 and (2) for all g(l < g < k),ci (N ') = cL (N ) but N ' + N . 
 
201 
 
 t t t 
 
 We say that A is p-connectable if there is a p-line set L on A . 
 n * n n 
 
 t ' t 
 
 Also we write L (N~, N , ,) when the first and the last nodes in L 
 n u K+l n 
 
 are of special interest. 
 
 Further, we write L^N-.N' . ) = L^(N n ,N ') U L t+: J-(N ,N' ) if 
 
 n 0' k+l 1 0' g n-i g' k+l 
 
 (N ,,N* ) and (N ,N') are two adjacent elements of L (N n ,N' ) and d_(N' ) = 
 g-i g g g+i n u K+i l g 
 
 dj(N ) = t + i. 
 
 An algorithm to build a p-line set for A is given in Section 7»5« 
 
 Example 2 : 
 
 A(l) = (b,c,d) 
 A(2) = {e,f} 
 A(3) = (g,h,i) 
 
 Figure f.k. A Graph G 
 
 1 d 1 
 
 v = U A(l+i) = {b,c,d,e,f,g,h,i}. A Q (p) = ((b,f),(d,e),(f,h),(e,i)). A 
 
 1=0 
 
202 
 
 typical L = ((b,f), (e,i)). Hence A is p-connectable. 
 
 Further we define a special p-line set called a p-line (l) set. 
 
 Definition 7 * 
 
 Given a set A . We call a p-line set 
 n 
 
 a p-line (l) set (write L (l)) if 
 
 cL(N. ) + 1 = (L(N^ +1 ) for all i(0 < i < k). 
 
 Note that in this case k = n - 1. 
 
 Also we write L (l)(N~,N,' ,,) when the first and last nodes are of 
 n 0' k+l y 
 
 particular interest. 
 
 Now a few lemmas are in order. 
 
 Lemma 2 : 
 
 Suppose N e A(t) and N' e A(t+n) for some t and n in graph G. Assume 
 that (N, N') holds. Then there is a p-line (l) set 
 
 I>(1)(N,N') = ((N ,N 1 '),(N 1 ,N 2 '),...,(N n _ 1 ,N n ')) 
 
 where N^ = N and N ' = N 1 ■ 
 n 
 
 Proof : 
 
 A proof is given by an induction on n. 
 
 First note that |A(t+i)| > 2 for all i, 1 < i < n - 1. Otherwise 
 N $* N' holds and (N,N') does not. 
 
203 
 
 (1) Now let n = 2. Then there must be two distinct nodes 
 
 N ,N p e A(t+1) such that N - N, and N -* N' . Otherwise the graph 
 
 G is not tight. Hence (N^N 1 ) and (N,N 2 ). Thus 1^(1) = ((N,Ng), 
 
 (N r N')). 
 
 (2) Now assume that the lemma holds for n < i. 
 
 Let n = i + 1. Let N e A(t), N'e A(t+i+l), and (N,N')« Then 
 there must be two distinct nodes N ,N e A(t+i) such that (N,N..) 
 
 and N $> N p . This follows from Lemma 1 and Corollary 1. Then 
 
 (N p ,N*)> since otherwise N => N' holds and this contradicts the 
 
 assumption. By the induction hypothesis, there is a 
 
 L^_ 1 (1)(N,N 1 ) = ((N,ir L '),(N 1 ,^ , ),...,(M n " 2 ,N 1 )). 
 Thus there is a p-line (l) set 
 
 L*(1)(F,H') = L^_ 1 (1)(N,N 1 ) U (N 2 ,N«) = ( (N, N 1 ' ), (n\ N 2 ' ), 
 
 ...,(M n ' 2 ,N 1 ),(N 2 ,N')). 
 
 (Q.E.D.) 
 
 Lemma 3: 
 
 Suppose that N e A(t),N 1 ,N 2 £ A(t+i),N^,N e A(t+j), and N 1 e A(t+n) 
 where i < j < n. Also assume that (N,N ),(N ,N*), and (n ,N') hold. 
 
 Then there is a p-line (l) set L^(l)(N,N') = ((N,^ 1 ),..., (N ^N')). 
 
Proof: 
 
 20U 
 
 '© W 
 
 © 
 
 : © : © 
 
 \ / 
 
 / N 
 
 © 
 
 i w) « 
 
 
 *f(K 
 
 © ©: 
 
 © 
 
 
 
 © 
 
 <3 ® 
 
 Figure 7-5* An Illustration for Lemma 3 
 
 Since (N, N ) by the previous lemma, there is a p-line (1) set 
 L^(1)(N,N 3 ) = Lj(l)(N,N) U L^(1)(N',N 3 ), 
 
 u -*- u 
 
 and since (rr,N'), there is a p-line (l) set 
 
 L^aKw^N') = L^CDCN 2 ^) U L t+ ^(1)(NSN'). 
 n-i ' j-i v ' n-j v ' 
 
 By definition, N ^ N' , N / N' , W £ N and N 1 ^ N 2 . 
 
205 
 
 Now we have two cases. 
 
 (1) N 5 = N' 
 
 Then N ^ N 1 . Now let 
 
 l*(i)(n,n') = l*(i)(n,n) u lJ^CDCh 1 ,/) U L^ (l ) ft. , r ) . 
 
 (2) N 5 ^ N'. 
 Then let 
 
 L*(i)(H,r) = lJ(i)(n,n 3 ) U L^(l)(N',N'). 
 
 Thus in either case there is a p-line (l) set on A . 
 
 n 
 
 (Q.E.D. ) 
 
 From Lemmas 2 and 3> we can prove the following lemma. 
 
 Lemma k : 
 
 If A is p-connectable, then there is a p-line (l) set L (l)(N, N') 
 
 where N e A(t) and N 1 e A(t+n). 
 
 What Lemma h implies is the following. 
 
 Let L*(l)(N,N') = ((N,N 1 '),(N 1 ,N 2 '),(N 2 ,N 3 '),...,(N n _ 1 ,N')), i.e. 
 
 d I (N.') = cL^N.) = t + i and cL^N.) + 1 = ^(N^). 
 
 Now assume that a set A is p-connectable. Then we have L (l)(N,N') = 
 
 n n ' 
 
 ((N,^'), (N r N 2 '), ...,(N n _ 1 ,N')) where N e A(t) and N' e A(t+n). Since (N.,N'), 
 we can process both at the same time. To do this we first process A(t+i) - {N.} 
 
206 
 
 (notice that djC^) = t+i). Then process (N.,N' }. Finally process A(t+i+l) - 
 (N! ,}. This leads, us to the follovring scheduling- 
 
 Definition 8: 
 
 Assume A is p-connectable. Then by to p-line schedule A bv I (l). 
 n * n n ' 
 
 we mean the following scheduling. 
 
 Let L*(l) = ((N ,N 1 '),(N 1 ,N 2 , ),...,(N n-1 ,N n ')). 
 
 (1) p-schedule A(t) - {N Q } . 
 
 (2) g = 1 
 
 (3) p-schedule {N _ ,N '} 
 
 o o 
 
 (h) p-schedule A(t+g) - {N ',N ]. 
 
 D g 
 
 (5) g = g + 1. If g < n, then go to (3). 
 
 (6) p-schedule (N ,,N '} . 
 v ' * n-1' n 
 
 (7) p-schedule A(t+k) - (N '}• 
 
 Now an algorithm to schedule a tight graph for two processors is 
 described. 
 
 Scheduling is done according to node sets A(i), for i = 1,2, . . .,h(G). 
 
 All nodes in A(i) can be processed independently of each other, i.e., 
 in any order. 
 
 If |A(i)| is even, then two processors can be kept busy all the time 
 to process A(i) and A(i) can be processed in time |A(i)|/2. 
 
 If |A(i)| is odd, then there is a possibility that a machine becomes 
 idle, i.e., one node will be left out from A(i) which does not have any partner 
 
207 
 
 to be processed with. Let it be N. Then a partner must be found from some 
 A(j), j > i. First we may try to find N' € A(i+l) which can be a partner of N. 
 However if |A(i+l)| is even, then |A(i+l) - (N'} | becomes odd and we have the 
 same problem again, and may try to find a node from A(i+2) to fill an idle 
 machine, and so on. If this cycle is ever to stop, it must stop when A(i+k) 
 is hit where |A(i+k)| is odd. Otherwise there is no way to remedy the cycle, 
 and machine time must be wasted. 
 
 Tight Graph Scheduling Algorithm : 
 Step 1: t = 
 
 Step 2: If t = h(G) then p-schedule A(t) and stop, else go to Step 3* 
 Step 3: If |A(t)| is even, then 
 (3-1) p-schedule A(t) and 
 (3-2) go to Step 7- 
 Step k: |A(t) | is odd. 
 
 Find A(t+1). 
 Step 5: If VN e A(t) |SR(N)| = |A(t+l)|, then 
 (5-1) p-schedule A(t) and 
 (5-2) go to Step 7. 
 Step 6: There is N' e A(t) such that |SR(N')| < |A(t+l)|. 
 (6-1) If |A(t+l)| is odd, then 
 
 (6-1-1) p-schedule A(t) - (N']- 
 
 (6-1-2) p-schedule (N*} U {N"l where N" e A(t+l) - SR(N'). 
 
 (6-1-3) p-schedule A(t+1) - (N") . 
 
 (6-1-U) go to Step 7- 
 
208 
 
 (6-2) |A(t+l)| is even. 
 
 (6-2-1) Find out the smallest k greater than 1 such that |A(t+k)| 
 
 is odd. 
 (6-2-2) If there is no such k (i.e., we have checked up to A(h(G))) 
 
 then p-schedule A(i) t < i < h(G) individually, and stop. 
 (6-2-3) Else we have a set (2= (A(t),A(t+l), . . .,A(t+k)} where |A(t)| 
 
 and |A(t+k)| are odd and other |A(t+i)| are all even. Check 
 
 p-connectivity of A, . 
 
 (1) A, is not p-connectable. 
 
 (l-i) p-schedule A(t),A(t+l), .. .,A(t+k-l) individually. 
 (1-ii) Let t = t + k - 1. 
 (l-iii) go to Step "J. 
 
 (2) A, is p-connectable. 
 
 (2-i) Find a p-line (l) set l£(l) = ( (N Q , N^ ), (N^N^ ), . . ., 
 (N k+1 ,N k ')). 
 
 (2-ii) p-line schedule A, by L. (l). 
 
 (2-iii) t = t + k. 
 (2-iv) go to Step 7- 
 Step 7: t = t + 1. 
 
 Go to Step 2. 
 
 Example 3 : 
 
 (l) |A(l)| and |A(3)| are odds, and |A(2) | is even. Thus we have &= 
 {A(1),A(2),A(3)1 - (by Step 6 of Algorithm) 
 
209 
 
 A(l) = {a,b,c,d,e} 
 A(2) = (f,g) 
 A(3) = (h,i,J). 
 
 Figure 7-6. An Example of a Tight Graph Scheduling 
 
 (2) For A^, we have Lg(l) = ( (d, f), (g,h)). 
 
 (3) According to Step (6-2-3) (2), we schedule as follows. 
 
 (i) p-schedule A(l) - {d} = (a,b, c, e). 
 (ii) p-schedule { d, f } . 
 (iii) p-schedule A(2) - (f,g) = 0. 
 (iv) p-schedule (g,h) . 
 (v) p-schedule A(3) - (h) = {i,j}. 
 Thus we have an optimum schedule : 
 
210 
 
 Step 12 3^5 
 
 machine A 
 
 B 
 
 a c ' d 
 
 g i 
 
 b e f 
 
 h J 
 
 Proof for the algorithm : 
 
 Lemma 5 ' 
 
 Step 3 is optimum and "whatever p-schedule is made, it does not affect 
 the later stages. 
 
 Lemma 6 : 
 
 Step 5 is optimum and whatever p-schedule is made, it does not affect 
 the later stages. 
 
 Proof : 
 
 First note that after A(t) has been processed, nodes in A(t+l) only- 
 can become available. Since VN e A(t), |S(N)| = |A(t+l)|, all nodes in A(t) 
 must have been processed before any node in A(t+l) can be processed. 
 
 (Q.E.D.) 
 
 Lemma 7 • 
 
 Step 6-1 is optimum and whatever p-schedule is made, it does not 
 effect the later stages. 
 
 Proof : 
 
 The algorithm actually processes A(t) and A(t+l) (where |A(t) | and 
 |A(t+l)| are odd) in time ( |A(t) | + |A(t+l) | )/2, which is optimum. 
 
 (Q.E.D.) 
 
211 
 
 Lemma 8 : 
 
 Step 6-2-2 is optimum. 
 
 Proof : 
 
 Lettf= (A(t),A(t+l), ...,A(h(G))). Since |A(t)| is odd and all other 
 
 |A(i)| is even (t < i < h(G)), it takes at least time 
 I" lA(t)K!A(t + l)l + ...^|A(h(G))l -| 
 
 to process cl. 
 
 Step 6-2-2 achieves this. 
 
 (Q.E.D.) 
 
 Lemma 9 : 
 
 Assume |A(t)| and |A(t+k) | are odd and |A(t+i)| are all even (l < i < 
 k). If A, is p-connectable, then Step 6-2-3 (2) is optimum. 
 
 Proof : 
 
 Oc= {A(t),A(t+l), . . .,A(t+k)} can be processed in time 
 
 |A(t)|+|A(t+l)|+...+|A(t+k)| 
 2 
 
 Step 6-2-3 (2) achieves this. 
 
 (Q.E.D.) 
 
 Lemma 10 : 
 
 Assume |A(t)| and |A(t+k) | are odd and |A(t+i)| are all even (l < i < 
 
 k). If A, is not p-connectable, then Step 6-2-3 (l) is optimum. 
 
212 
 
 Proof : 
 
 Let£^= (A(t),A(t+l), . . . ,A(t+k)} . Since A^ is not p-connectable, 
 
 there is no p-line set 1^(1)- Thus there will be N in some A(t+i) (0 < i < k) 
 
 which does not have any partner to be processed with it, thus making a machine 
 idle. Now if this situation could be remedied, then it would be only because 
 there is N' e A(t+n+j) (j > 0) which can be done in parallel with N. 
 
 A(t+n) 
 
 A(t+n+j) 
 
 odd 
 
 even 
 
 odd 
 
 Figure T«T- An Illustration for Lemma 11 
 
 Suppose that the above could be done. The parallel processing of N 
 
 t , 
 
 and N 1 can, however, be advantageous only if A. is p-eonnected and there is 
 a p-line (l) set L^(l)(N,N f ) where N e A(t),N' e A(t+i) and N' 4 N - Otherwise 
 
213 
 
 processors cannot be kept busy for A(t),A(t+l), . . . , A(t+i)-{N ) and we do not 
 gain at all by delaying a processing of N. Now from the assumption, N £> N" 
 
 for all N" € A(t+n) because otherwise A becomes p-connectable. Also by 
 
 n " 
 
 Corollary 1, there is a node N in A(t+n) such that N > N' . This implies that 
 N 5> N'. Thus N cannot be processed in parallel with N. This proves the 
 
 lemma. 
 
 (Q.E.D.) 
 
 Essentially what the above algorithm does is upon finding A(t) where 
 |A(t)| is odd to try to delay the processing of a node N in A(t) till the next 
 node set A(t+l). If |A(t+l)| is even, then again the algorithm tries to delay 
 the processing of a node N' in A(t+l) till the next level A(t+2), etc. until it 
 finds some A(t+k) where |A(t+k) | is odd, or |A(t+k) U (N"} | is even where N" 
 is a node whose processing is being delayed from the previous stage. In other 
 words, the algorithm tries to establish p-connectability between two adjacent 
 node sets both of which have an odd number of elements. Note that it is not 
 necessary to try to establish p-connectability among more than two odd node 
 
 sets, say A(t),A(t+k) and A(t+m)(m > k), because A cannot be p-connectable if 
 
 A, is not (see Lemma 10). 
 
 Now the above argument together with Lemmas 5-10 prove the following 
 theorem. 
 
 Theorem 1 : 
 
 The algorithm gives an optimum schedule of a tight graph. 
 
20A 
 
 1 .k Scheduling of a Loose Graph 
 
 Now let us extend the above algorithm so that it can handle a loose 
 graph. To facilitate presentation, a few definitions are in order. 
 
 (From now on by "a graph", a "loose graph" is to be understood. ) 
 
 Definition 9 ; 
 
 A node N in a graph G for which d_(N)+d T (N) ^ h(G) is called a loose 
 
 node . Otherwise a node is tight. Let N be a loose node. Then we define the 
 far distance d_(N) as <L(N) = h(G) - d^N). 
 
 A set B(i) is a node set (cf. Definition 
 k) such that 
 
 B(i) = {N|(N is a tight node) and 
 (d^N) = i)} U {N|(N is a 
 
 loose node) and (cL(N) = i)} . 
 
 We write Bt(i) and Bi(i) for the above 
 two component sets respectively, i.e. 
 
 B(i) = Bt(i) U B*(i). 
 
 Note that a loose node N can be 
 processed anywhere between cL(N) and 
 
 d i ( V 
 
 d (N) without affecting the rest of a 
 
 Figure 7-8. A Loose Node 
 
 graph. Definition 9 says that loose 
 nodes are pushed far down and 
 
 classified in terms of the far distance rather than the forward distance. 
 
 Note that a loose node N receives two attributes, the far distance 
 
 d (N) and the forward distance d (N), and is classified in terms of cL(N), 
 
215 
 
 e.g. we say that a loose node N with the forward distance d (N) is in B(i) 
 where i = cL.(N). 
 
 We also define a set C(i). 
 
 Definition 10 : 
 
 A set C(i) of loose nodes is defined as follows. 
 C(i) = (N | N is a loose node and <L(n) < i < d_(N)} . 
 
 C(i) is a set of those loose nodes which may be processed in parallel 
 with a node in B(i). If cL.(N) = i, then N is put in B(i) rather than in C(i). 
 
 C(i) is a set of loose nodes which will be used to fill up waste processor time 
 if necessary. 
 
 Scheduling a loose graph is similar to that for a tight graph. 
 A loose graph is scheduled in accordance with B(i) for i=l, 2, . . .,h(G) . A loose 
 node N, even though it is in B(cL(N)), may be scheduled with any B(k) where 
 
 d_(N) < k < d (N). All tight nodes are scheduled first and loose nodes are 
 
 scheduled as late as possible. If it becomes inevitable to waste processor 
 time if only tight nodes are used, then loose nodes are used to fill those 
 otherwise wasted times. 
 
 In what follows B(i) plays a similar role to that of A(i) in the 
 previous discussion. All definitions for A(i) are also applicable to B(i) with 
 a few modifications. 
 
 Now the p-line relation (==) between two nodes N and N' in B is re- 
 
 n 
 
 defined as follows. 
 
216 
 
 Definition 5' : 
 
 t £ 
 Let N and N' be two nodes in B . NTT ((n,N' )) holds if either one of: 
 
 (l) (i) N 1 is not a loose node (N may be a loose node), 
 and (ii) (^(N) (or d^N) if N is a loose node) + 1 = cL(N')^ 
 
 (In other words, N e B(t+k) and N' e Bt(t+k+l), < k < n. ) 
 and (iii) N / N'. 
 or (2) (i) N* is a loose node (N may be a tight node), 
 
 and (ii) MN' ) < d_(N) (or d_(N) if N is a loose node) < d_(ir) 
 
 (In other words, N* € Bl(t+k) and N e B(t+j) where <L. (N 1 ) < 
 
 t+j < t+k < t+n. ) If (L-(N') < t, then the above inequality 
 
 becomes 
 
 t < d^(N) (or IjW) < d^N 1 ) 
 
 holds . 
 
 Then L on B is defined similarly to Definition 6. 
 n n 
 
 Example k (see Figure 7»9) - 
 
 (2,5), (3,*0,(^,8),(5,7) e B n (p) b y (!)' Since <W = t + 1 and d J (N 6 ) = 
 t + 3, (2,6),'(5,6), (1^,6), (5,6) e B*(p) by (2). 
 
 
217 
 
 B(t) = (N^ 
 
 B(t+1) = {N 2 ,N 3 1 
 
 B(t+2) = {N V N 5 1 
 
 V B(t+3) = (N^N^Ngl 
 
 Bt(t+3) = {N 7 ,Ng),Bl(t+3) = {N 6 } 
 
 B(t44) = {N } 
 
 Figure 7.9. The p-line Relation in B. 
 
 n 
 
 (1) of the above definition resembles Definition 5 and takes care of 
 tight nodes. As Lemmas 2, 3 and h showed Definition 5 is more general than 
 necessary. That is, the p-line relation need only to be defined between 
 adjacent levels, i.e. A(i) and A(i+l). And Definition 5' follows this 
 simplification. 
 
 (2) , on the other hand, takes care of loose nodes. If a loose node 
 N is in B(i) then this means that d. (N) = i and N may be processed in parallel 
 
 with any node N' e B(j) where d_(N) < j < i. Because of this addition, Lemmas 
 
 2, 3 and k do not hold anymore. For example let us consider the graph G: 
 
218 
 
 © B(t+1) 
 
 It is easy to see that (N ,N ) e B,(p) by (2) of Definition 5'. Since (N ,N ) 
 
 / B,(p), there is no p-line (l) set on B,(,p). This, however, does not bother 
 
 us. Assume that B is p-connectable, and let L = ( (tL.N ' ), (N n , N ' ), . . ., 
 n nO'11'2'' 
 
 (N.,N. '),..., (N,,,N' )), where N € B(t) and N ' e B(t+n). Further assume 
 
 that N. e B(t+a. ) and N. * e B(t+a. ,). If a. + 1 / a. ., then by Definition 
 i i l+l l+l i ' l+l 7 J 
 
 5' N. ' is a loose node and d_(N. ') < t + a. . Thus N. ' can be processed 
 l+l I l+l — l l+l 
 
 in parallel with N. without affecting the rest of a graph, i.e. we first process 
 
 B(t+a. ) - fN.) and then process (N..N. ']. If a. + 1 = a. ,, then the old 
 v i y l * i' l+l i i+l' 
 
 strategy can be applied, i.e. we first process B(t+a. ) - {N.}, then {N.,N. '}, 
 
 then B(t+a. .) - {N. n '} . 
 l+l l+l ' 
 
 This leads us to modify Definition 8 as follows. 
 
219 
 
 Definition 8': 
 
 Given B* and L*, where L^ = ((N^N^ ), (N^M, • • -, (N^N^' ), . . ., 
 (N, ,, N. '))• Then by to p-line schedule B by L , we mean the following 
 
 scheduling. 
 
 (1) p-schedule B(t) - {N Q } . 
 
 (2) g = 1. 
 
 (3) p-schedule (N ,N '} • 
 
 g -J - 6 
 
 (U) Let N , e B(t+a . ) and N ' e B(t+a ). 
 g-1 g-1 g g 
 
 (i) If a . +1 = a , then p-schedule B(t+a ) - {N ' N }. 
 g-1 g g g ' g 
 
 (ii) If a , + 1 4 a > then p-schedule B(t+b) where b = a , + 
 g-1 T g g-1 
 
 1, a , + 2, .... a - 1. Then p-schedule B(t+a ) - 
 g-1 g g 
 
 {N f ,N } . 
 
 g g 
 
 (5) g = g + 1. If g < k, then go to (3). 
 
 (6) p-schedule (Nj^N'}. 
 
 (7) p-schedule B(t+k) - [N.'}- 
 
 Finally in connection with Definition 8', we define the following. 
 
 t t t 
 
 Suppose that B is not p-connectable, i.e. there is no p-line set L on B . 
 * n * n n 
 
 It is, however, possible to find a p-line set L on B for some s, < s < n. 
 
 Definition 11 : 
 
 t t 
 
 Given B which is not p-connectable. Now we check if B is p- 
 n s 
 
 connectable for s = 1,2, . ..,n-l. Let m be the smallest s such that B is p- 
 
 S 
 
220 
 
 connectable but B . is not. We call m the maximum p-connectable distance, 
 s+l ■ * ' 
 
 and L a maximum p-line set. 
 
 m * 
 
 The following example illustrates the above definition. 
 
 Example 5 ; 
 
 B(t+1) 
 
 Figure 7*10. An Example of the Maximum p-connectable Distance 
 
 Let us consider p-connectability of B^. It is clear that B^,(p) = 
 { (N,,N, ), (N, ,N/-), (N , No)} , and B, is not p-connectable. Since B, is p- 
 connectable (N , N. ) but B is not, the maximum p-connectable distance in the 
 above B, is 1 and L = ((N ,N, )). 
 
221 
 
 Using a similar technique to the one described in Section 7«5> we can 
 check p-connectability of B . 
 
 An algorithm to schedule a loose graph for two processors is now 
 given. The algorithm resembles the one for a tight graph. The major difference 
 lies in the treatment of loose nodes. Loose nodes are scheduled as late as 
 possible. They will be used when it becomes inevitable to waste processor 
 time. 
 
 In what follows we modify the definitions for B(i) and C(i) so that 
 they do not include those loose nodes which hare been already scheduled. 
 
 Loose Graph Scheduling Algorithm 
 
 Step 1: t = 0. 
 
 Step 2: If t = h(G) then p-schedule all unscheduled nodes and stop, else go 
 
 to Step 3» 
 Step 3: If |B(t)| is even, then 
 (3-1) p-schedule B(t) 
 (3-2) go to Step 7- 
 Step k: |B(t)| is odd. 
 
 Find B(t+1). 
 Step 5: If V N e B(t) |SR(N) |= |B(t+l) |, 
 then 
 (5-1) Check C(t). If C(t) = 0, then p-schedule B(t) and go to Step 7. 
 (5-2) Otherwise pick N with the minimum cL(n) in C(t). (if there are 
 
 more than one such N, pick any N. ) Now p-schedule B(t) U {N} and 
 go to Step 7* 
 
Step 6: There is N' e B(t) such that |SR(N')| < |B(t+l)|. 
 (6-1) If |B(t+l)| is odd, then 
 
 (6-1-1) p-schedule B(t) - {N 1 }. 
 
 (6-1-2) p-schedule (N') U (N") where N" e B(t+l) - SR(N'). 
 (6-1-3) p-schedule B(t+l) - {N"} . 
 (6-1-U) go to Step 7- 
 (6-2) |B(t+l)| is even. 
 
 (6-2-1) Find out the smallest k greater than 1 such that |B(t+k)| is 
 
 odd. 
 (6-2-2) If there is no such k (i.e. we have checked up to B(h(G))), 
 then p-schedule B(i) (t < i < h(G)) individually and stop. 
 (6-2-3) Else we have a set & = (B(t),B(t+l), . . .,B(t+k)} where |B(t)| 
 and |B(t+k)| are odd and other |B(t+i)| are all even. Check 
 
 if B, is p-connectable. 
 k 
 
 (l) B, is not p-connectable. 
 
 (l-i) Find the maximum p-connectahle distance in B, . Let 
 
 K. 
 
 it be m. 
 
 (1-ii) Check C(t+m). If C(t+m) = (/>, then p-schedule B(t), 
 
 B(t+l),...,B(t+k-l) individually. Let t = t+k-1. 
 
 Go to Step 7- 
 
 (1-iii) Otherwise let B' (t+m) = B(t+m) U {N} where N e C(t-fm) 
 
 t m_1 
 and has the minimum d-CN). Let B' = U B(t+i) U 
 
 B'(t+m). Then p-line schedule B' by a maximum 
 
 ' r m 
 
223 
 
 p-line set L . Let t = t + m. Go to Step 7* 
 
 t 
 (2) B^ is p-connectable. 
 
 (2-i) Find a p-line set L. . 
 
 (2-ii) p-line schedule B by L . 
 
 (2-iii) t = t + k. 
 (2-iv) Go to Step 7. 
 Step 7: t = t + 1. 
 
 Go to Step 2. 
 
 Proof for the Algorithm : 
 
 That the above algorithm is optimum can be proved by a similar 
 argument used to prove the previous algorithm. For example, we can show that 
 Steps 3,5-1,6-1,6-2-2 and 6-2-3(2) are optimum easily by previous Lemmas. Now 
 we have to show that Steps 5-2 and 6-2-3 (l) are optimum. 
 
 Lemma 11 : 
 
 Step 5-2 is optimum. 
 
 Proof : 
 
 Suppose that we do not use N in C(t) to fill up otherwise wasted 
 processor time, where N is the node with the minimum cL(n) in C(t). Saving N 
 
 for later use, however, does not improve the situation because a node N cannot 
 be used more effectively than to fill up wasted processor time anyway. 
 
 Also the choice of N from C(t) is optimum. Suppose for example that 
 N' with cL(N') > cL(N) is also in C(t). Now let us consider those two nodes 
 (See Figure 7-12). Whether we use N or N' to schedule with B(t) will not make 
 
22k 
 
 any difference to schedule B(t+l),B(t+2), . . ,,B(d_(N)) because a node N or N' 
 
 is available if necessary. Suppose we used N 1 with B(t). Then it is not 
 possible to fill a later request -which may arise when B(u) (d_(N) + 1 < u < 
 
 dL(N' )) is scheduled, whereas if we used N with B(t) then we can fill the 
 
 request. Thus this proves the lemma. 
 
 djCN') 
 
 Figure 7.11. An Illustration for Lemma 13 
 
 (Q.E.D. ) 
 
 That Step 6-2-3 (l) is optimum is proved similarly. 
 Now we have the following theorem. 
 
 Theorem 2 
 
 The algorithm gives an optimum schedule of a loose graph. 
 
225 
 
 7 • 5 Supplement 
 
 (1) An algorithm to establish A (p) on A is now discussed, 
 v *© n n 
 
 n 
 Let m = L |A(t+i) \, and B be an m x m connection matrix where the 
 i=0 
 
 first |A(t)| columns and rows are labeled by nodes in A(t), the second |A(t+l) | 
 
 columns and rows by nodes in A(t+l), and so forth. An element b. . of B is 1 if 
 
 and only if N. -* N . where N. and N. are labels for the i-th row and the j-th 
 
 column. 
 
 Now define the multiplication of the connection matrices as follows. 
 
 Let A = B x C where A, B and C are m x m connection matricies. Then a. . = 
 
 ij 
 
 m - m , - — 
 
 V (b * C ). Now complete B m = V B . In B m , b?. = 1 implies that N. >N. 
 k=l lK * J k=l 1J x J 
 
 and b. . =0 implies that (N.,N.) • For example, let us consider the graph in 
 J- j -J- J 
 
 Figure 7-13» Then we have 
 
 A^ = { (a,f), (a,g), (b,e), (b,h), (b,g), (c,d), (d,h), (d,j), (f,h), (g,h), 
 (g,i)). 
 
 (2) Given A (p) for A , an algorithm to find p-connectivity and L (l) is 
 described. 
 
 According to Lemma k, if A is p-connectable, then there is a 
 
 p-line (l) set L (l). Thus to check p-connectability it is enough to examine 
 
 if there is a p-line (l) set L (l). 
 
 n 
 
226 
 
 (a) 
 
 
 a 
 
 b 
 
 c 
 
 d 
 
 e 
 
 f 
 
 g 
 
 h 
 
 i 
 
 J 
 
 a 
 
 
 
 
 1 
 
 1 
 
 
 
 
 
 
 b 
 
 
 
 
 1 
 
 
 1 
 
 
 
 
 
 c 
 
 
 
 
 
 1 
 
 1 
 
 l 
 
 
 
 
 d 
 
 
 
 
 
 
 
 
 
 1 
 
 
 e 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 1 
 
 f 
 
 
 
 
 
 
 
 
 
 1 
 
 1 
 
 g 
 
 
 
 
 
 
 
 
 
 
 1 
 
 h 
 
 
 
 
 
 
 
 
 
 
 
 i 
 
 
 
 
 
 
 
 
 
 
 
 •i 
 
 
 
 
 
 
 
 
 
 
 
 
 a 
 
 b 
 
 c 
 
 d 
 
 e 
 
 f 
 
 e 
 
 
 
 ,1 
 
 a 
 
 
 
 
 1 
 
 1 
 
 
 
 
 i 
 
 i 
 
 b 
 
 
 
 
 1 
 
 
 1 
 
 
 
 
 i 
 
 c 
 
 
 
 
 
 1 
 
 1 
 
 1 
 
 1 
 
 
 i 
 
 d 
 
 
 
 
 
 
 
 
 
 i 
 
 
 e 
 
 
 
 
 
 
 
 
 
 
 i 
 
 f 
 
 
 
 
 
 
 
 
 
 • 
 
 i 
 
 g 
 
 
 
 
 
 
 
 
 
 
 i 
 
 h 
 
 
 
 
 
 
 
 
 \ 
 
 K — 
 
 
 i 
 
 
 
 
 
 
 
 
 
 \ 
 
 . 
 
 J 
 
 
 
 
 
 
 
 
 
 
 \ 
 
 (b) B 
 
 (c) B c 
 
 t/t, 
 
 Figure 7.12. An Example for A ( ) 
 
227 
 
 First let 
 
 A*(p)(l) = a£( P ) - C(N,N f )| I^CN) - d^N')! > 1} 
 
 — t 
 Now we construct a graph A as follows. 
 
 n 
 
 (1) A has following nodes: 
 
 (i) an initial node N , 
 
 s 
 
 (ii) a terminal node N_, 
 
 . . t 
 
 (iii) all nodes in A , 
 
 (iv) for each node N (N € A , N / A(t), N / A(t+n)) a new 
 
 duplicate node N' . 
 
 — t 
 
 (2) A has following verticies: 
 
 n 
 
 (i) a vertex from N to every node N which was in A(t), 
 
 (ii) a vertex from every node N which was in A(t+n) to N„, 
 
 iii 
 
 (iii) if (N^Ng) e A*(p)(l), then a vertex from V to N . 
 
 (iv) for every A(t+k),l < k < n-1, for every N e A(t+k), a vertex 
 from N to every N' which is a duplicate of N" e A(t+k) but 
 is not identical to N. 
 To illustrate the above definition, let us consider the following 
 
 example . 
 
 Example : 
 
 Let A 2 = {a,b,c,d,e, f,g} 
 where a, b e A(t) 
 
228 
 
 c, d, e e A(t+l) 
 and f,g € A(t+2). 
 Further let 
 
 a£( P )(1) = {(b,c),(d,f),(d,g),(c,f)} 
 
 Figure "J. 13- An Example for p-connectivity Discovery 
 
 Then A has nodes = {N , N_, a, b.c, d, e, f, g (which are nodes in A ), c*, d',e' 
 (which are duplicates of nodes in A(l))} and verticies = { (N ,a),(N ,b), (f,N_), 
 
 (g^N-,), (b,c), (d,f ), (d,g), (c,f )(which are original verticies in A ), (c,d ! ), (c, e' ), 
 
 hj n 
 
 (d, c' ), (d, e' ), (e,c' ), (e, d' ) (which are verticies from N to a duplicate N' of N" 
 which is not identical to N).] 
 
 Now it is clear that A has a p-line (l) set L (l) if and only if 
 
 n n 
 
 there is a path from N to N in A . There is a well-known algorithm for path 
 
 finding, e.g. f 19] . 
 
229 
 
 Figure 7-13. (continued) 
 
230 
 
 8. CONCLUSION 
 
 This thesis introduced new techniques to expose hidden parallelism 
 in a program. Techniques included the use of one of the fundamental arithmetic 
 laws, i.e., the distributive law, extensively. Furthermore it was suggested 
 that with the help of these techniques computation of a program might be 
 speeded up logarithmically in the sense that computation time became a log- 
 arithmic function of the number of single variable occurrences in a program 
 rather than its linear function. Even though discussions were based on an 
 ILLIAC IV type machine, as mentioned before, they are readily applicable to 
 pipe-line machines such as CDC STAR. 
 
 Chapter 2 of the thesis studied parallel computation of summations, 
 powers and polynomials. The minimum time to evaluate summations or powers as 
 well as the minimum number of PE's required to attain it was given. A scheme 
 which computed a polynomial in parallel in lesser time than any known scheme 
 was also introduced. Because of its simplicity in scheduling, the k-th order 
 Horner's rule for parallel polynomial computation was studied in detail. It 
 was shown that for this algorithm the availability of more PE's sometimes 
 increased the computation time. The algorithm was such that all PE's were 
 forced to participate in computation. 
 
 Chapter 3 presented an algorithm which reduced tree height for an 
 arithmetic expression by distribution. The algorithm worked from the inner 
 most parenthesis pair to the outer most one and scanned an arithmetic expression 
 only once. A measure for the height of the minimum height tree for an arith- 
 
 
231 
 
 metic expression was given as a function of the depth of parenthesis pair 
 nesting and the number of single variable occurrences in it. 
 
 Chapter h extended the above idea to cover a sequence of arithmetic 
 expressions. It was shown that by replacing a sequence of arithmetic expres- 
 sions with an arithmetic expression by back substitution, the computation 
 time could be speeded up in a logarithmic way for a certain class of iteration 
 formulas, e.g., x. . := a x x. + b . The chapter also showed that parallel 
 computation was in general more favorable than sequential computation in terms 
 of the round off error. Furthermore it was shown that distribution would not 
 introduce the significant amount of the round off error. 
 
 Chapter 5 studied inter- statement parallelism as an introduction 
 to the following chapter. An algorithm which checked if the execution of 
 statements in a program by some sequence gave the same results as the execution 
 of statements by the given sequence did was given. The algorithm was new in 
 the sense that it prevented variables from being updated before they were used. 
 This had not been taken into account by the previous works. Also a technique 
 which exploited more parallelism between statements by introducing temporary 
 locations was introduced. 
 
 Chapter 6 presented an algorithm which checked if a statement in a 
 loop could be executed simultaneously for all values of a loop index. The 
 algorithm checked index expressions and the way the values of indices varied 
 only and did not require a loop to be replaced with a sequence of statements. 
 In case a statement in a loop could not be executed in parallel with respect 
 to a loop index as it was, the algorithm "skewed" the computation of a state- 
 ment with respect to a loop index so that the statement could be executed in 
 parallel for all values of the loop index. Also to expose hidden parallelism 
 
232 
 
 from a loop, replacement of a loop with several loops was discussed. 
 
 A solution for the equally weighted- two processor scheduling problem 
 was given in Chapter 7« The only practical work so far obtained was a result 
 for scheduling a rooted tree with equally weighted tasks on k identical 
 processors. The solution given in Chapter 7 scheduled a graph with equally 
 weighted tasks on two identical processors. If we considered common expres.- 
 sions in an arithmetic expression then we would obtain a graph of operations 
 rather than a tree for the arithmetic expression and the scheduling algorithm 
 was readily applicable for scheduling that graph on P(2) . 
 
 Suggestions for further research have been given in several places 
 throughout the thesis and need not be repeated here. We conclude by giving 
 two possible extensions that deserve brief mention. 
 
 (1) The design of a better machine. 
 
 Even though we assumed that a PE can communicate with any 
 other PE instantaneously, this may not be the case in reality 
 because it is costly and impractical to provide data paths 
 between every PE pair. Hence it is necessary to design PE 
 interconnection which is economical yet powerful enough to 
 simulate the above idealized interconnection [25], [26]. 
 
 (2) Generalization of the idea given in this thesis. 
 
 The three laws of arithmetic were utilized in this thesis 
 in terms of parallelism exploitation. We should, however, pay 
 more attention on these laws even in terms of serial computation. 
 For example suppose an arithmetic expression which involves 
 matrices, row and column vectors is given. Then by the appro- 
 priate application of the associative law, the number of multi- 
 plications required may be reduced drastically. 
 
233 
 
 LIST OF REFERENCES 
 
 [1] Abrahams, P. W. , "A Formal Solution to the Dangling else of ALGOL 60 and 
 Related Languages", Coram. ACM, 9 (September, 1966), pp. 679-682. 
 
 [2] Abel, N. E.j et al., "TRANQUIL: A Language for an Array Processing 
 
 Computer Tr , Proc . of the Spring Joint Computer Conference (1969), 
 pp. 57-73- 
 
 [3] Naur, P., et al. , "Revised Report on the Algorithmic Language ALGOL 60", 
 Coram . ACM, 6 (January, 1963), pp. 1-17 • 
 
 [4] Allard, R. W., Wolf, K. A. and Zemlin, R. A., "Some Effects of the 6600 
 Computer on Language Structures", Comm . ACM, 7 (February, 1964), 
 pp. 112-119. 
 
 Baer, J. E., "Graph Models of Computations in Computer Systems", Ph.D. 
 
 Dessertation, University of California, Los Angeles, Report No. 
 68-46 (October, 1968) . 
 
 [6] Baer, J. E. and Bovet, D. P., "Compilation of Arithmetic Expressions for 
 
 Parallel Computations", Proc . of IFIP Congress (1968), pp. 340-346. 
 
 [7] Barnes, G. H., et al., "The Illiac-IV Computer", IEEE Trans , of Computers, 
 C-17 (August, 1968), pp. 746-757. 
 
 [8] Beightler, C. S., et al., "A Short Table of z-Transforms and Generating 
 
 Functions", Operations Research, 9 (July-August, 1961), pp. 574- 
 578. 
 
 [9] Bingham, H. W., Reigel, W. E. and Fisher, D. A., "Control Mechanisms for 
 Parallelism in Programs", Burroughs Corporation, ECOM-02463-7 
 (October, 1968). 
 
 [10] Bingham, H. W. and Reigel, W. E., "Parallelism Exposure and Exploitation in 
 Digital Computing Systems", Burroughs Corporation, EC0M-02463-F 
 (June, 1969). 
 
 [11] Brewer, M. A., "Generation of Optimal Code for Expressions via 
 
 Factorization", Coram . ACM, 12 (June, 1969), pp. 333-340. 
 
 [12] "Newsdata", Computer Decision (March, 1970), p. 2. 
 
 [13] Conway, M. E., "A Multiprocessor System Design", Proc . of the Fall Joint 
 Computer Conference (1963), pp. 139-146. 
 
 [14] Conway, R. W., Maxwell, L. W. and Miller, L. W., Theory of Scheduling, 
 Addis on -Wis ley Publishing Company, Inc., New York (1967). 
 
 [15] Dorn, W. S., "Generalizations of Horner's Rule for Polynomial Evaluation", 
 IBM Journal of Research and Development, 6 (April, 1962), pp. 239- 
 245. 
 
•d.0* 
 
 [16] Estrin, G., "Organization of Computer Systems--the Fixed plus Variable 
 
 Structure Computer", Proc . of Western Joint Computer Conference 
 (May, I960), pp. 33-40. 
 
 [17] Gold, D. E., "A Model for Linear Programming Optimization of I/O — Bound 
 
 Programs", M.S. Thesis, Department of Computer Science, University 
 of Illinois at Urbana-Champaign, Report No. 340 (June, 1969). 
 
 [18] Graham, W. R., "The Parallel and the Pipeline Computers", Datamation (April, 
 1970), pp. 68-71. 
 
 [19] Harary, F., Norman, R. Z. and Cartwright, D., Structural Model : An 
 
 Introduction to the Theory of Directed Graphs , John-Wiley and Sons, 
 Inc., New York (l9o"6]T 
 
 [20] Hellerman, H., "Parallel Processing of Algebraic Expressions", IEEE Trans . 
 of Electronic Computers, EC-15 (January, I966), pp. 82-91. 
 
 [21] Hu, T. C, "Parallel Sequencing and Assembly Line Problems", Operation 
 Research, 9 (November-December, 1961), pp. 841-848. 
 
 [22] Knowls, M., et al. , "Matrix Operations on Illiac-TV", Department of Computer 
 Science, University of Illinois at Urbana-Champaign, Report No. 222 
 (March, 1967). 
 
 [23] Knuth, D. E., The Art of Computer Programming, Vol. 2, Addis on -Wesley 
 Publishing Company, Inc., New York (1969). 
 
 [24] Kuck, D. J., "I Iliac -IV Software and Application Programming", IEEE Trans . 
 of Computers, C-17 (August, 1968), pp. 758-769- 
 
 [2 5] Kuck, D. J. and Muraoka, Y., "A Machine Organization for Arithmetic 
 
 Expression Evaluation and an Algorithm for Tree Height Reduction", 
 unpublished (September, 1969). 
 
 [26] Kuck, D. J., "A Preprocessing High Speed Memory System", to be published. 
 
 [27] Logan, J. R., "A Design Technique for Digital Squering Networks", Computer 
 Design (February, 1970), pp. 84-88. 
 
 [28] Minsky, L. M., Computation : Finite and Infinite Machines , Prentice -Hall, 
 Inc . , New Jersey (1967). 
 
 [2 9] Motzkin, T. S., "Evaluation of Polynomials and Evaluation of Rational 
 Functions", Bull . A. M.S ., 6l (1965), P- 163. 
 
 [ 30] Murtha, J. C, "Highly Parallel Information Processing System", in Advances 
 in Computers, Academic Press, Inc., New York, 7 (1966), pp. 2-116. 
 
 [31] Muntz, R. R. and Coffman, E. G., "Optimal Preemptive Scheduling on Two 
 
 Processor Systems", IEEE Trans . of Computers, C-l8 (November, 1969), 
 pp. 1014-1020. 
 
235 
 
 [32] Nievergelt, J., "Parallel Methods for Integrating Ordinary Differential 
 Equations", Comm . ACM, 7 (December, 1964), pp. 731-733. 
 
 [33] Noyce, R. N., "Making Integrated Electronics Technology Work", IEEE 
 Spectrum, 5 (May, 1968), pp. 63-66. 
 
 [34] Ostrowski, A. M., "On Two Problems in Abstract Algebra Connected with 
 
 Horner's Rule", in Studies in Mathematics and Mechanics Presented 
 to R. von Mises, Academic Press, New York (1954), pp. 40-48. 
 
 [35] Pan, V. Ya., "Methods of Computing Values of Polynomials", Russian 
 
 Mathematical Surveys, 21 (January-February, 1966), pp. 105-136. 
 
 [36] Ramamoorthy, C. V. and Gonzalez, M. J., "A Survey of Techniques for 
 
 Recognizing Parallel Proces sable Streams in Computer Programs", 
 Proc . of the Fall Joint Computer Conference (1969), pp. 1-15. 
 
 [37] Russel, E. C, "Automatic Program Analysis", University of California, 
 Los Angeles, Report No. 69-72 (March, 1969). 
 
 [38] Shedler, G. S. and Lehman, M. M., "Evaluation of Redundancy in a Parallel 
 Algorithm", IBM System Journal, 6, 3 (1967), pp. 142-149. 
 
 [39] Squire, J. S., "A Translation Algorithm for a Multiple Processor Computer", 
 Proc . of the l8th ACM National Conference (1963). 
 
 [40] Stone, H. S., "One-pass Compilation of Arithmetic Expressions for a 
 
 Parallel Processor", Comm . ACM, 10 (April, I967), pp. 220-223. 
 
 [41] Thompson, R. N. and Wilkinson, J. A., "The D825 Automatic Operating and 
 Scheduling Problem", in Programming Systems and Languages, 
 McGraw-Hill, New York (1967), pp. 647-660. 
 
 [42] Winograd, S., "On the Time Required to Perform Addition", JACM, 12 
 (April, 1965), pp. 277-285. 
 
 [43] Winograd, S., "The Number of Multiplications Involved in Computing 
 
 Certain Functions", Proc. of IFIP Conference (1968), pp. 276-279. 
 
236 
 
 VITA 
 
 Yoichi Muraoka was born in Sendai, Japan, on July 20, 19^2. He 
 graduated from Waseda University, Tokyo, Japan, in Electrical Engineering 
 in March, 19&5 and started his graduate study at the graduate college, 
 Waseda University. 
 
 Since September 1966, he has been a research assistant with the 
 project of Illiac IV computer in the Department of Computer Science of 
 the University of Illinois at Urbana-Champaign. In 1969 he received his 
 degree of Master of Science in Computer Science. 
 
 He is a member of the Association for Computing Machinery and 
 the Institute of Electrical and Electronics Engineering. 
 
•& 
 
 at 
 
 ^