LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 06.764-7^9 
 
Latest Date stamped below.^ 
 
 fJOV 3 
 
 jOU l/HtC 
 
 APR 3 f58f'' 
 tlAY 1 4 ^a" 
 
 '^ 
 
 L161 — O1096 
 

 yy/ Report No. UIUCDCS-R-75-766 
 
 NSF-0CA-DCR73-07980 A02 - 000014 
 
 Solving Triangular Systems on a Parallel Computer 
 
 by 
 
 Ahmed H. Sameh and Richard P. Brent 
 
 November 1975 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
 IHEUBRARVOeiHfl 
 
 FEB II 1975 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/solvingtriangula766same 
 
Report No. UIUCDCS-R-75-766 
 
 * 
 Solving Triangular Systems on a Parallel Computer 
 
 by 
 
 ** *•* 
 
 Ahmed H. Sameh and Richard P. Brent 
 
 November 1975 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 * 
 
 This work was supported in part by the National Science Foundation under 
 Grant No. US NSF DCR73-07980 A02. 
 
 ** 
 
 Department of Computer Science and the Center for Advanced Computation 
 
 University of Illinois - Urbana, Illinois 61801. 
 
 *** 
 
 Computer Centre 
 
 The Australian National University - Canberra, AUSTRALIA. 
 
ABSTRACT 
 
 In this paper we present alternative formulations of the 
 algorithms of Chen and Kuck [1] in such a way as to make them easier 
 to follow and implement. We also give a detailed error analysis, 
 showing that if x is the computed solution of the triangular system 
 
 Lx = f, then it satisfies the equation (L + 6L)x = f where ||6L|| = 
 
 2 2 I I I 
 
 0(n log n) £ K (L)||L||. Here k(L) is the condition number of L, 
 
 I I'll denotes the °°-norm and e is the unit roundoff. 
 
1. Introduction 
 
 Chen and Kuck [1] proposed a parallel algorithm for solving 
 the linear system of equations Lx = f , where L is a unit lower tri- 
 angular matrix of order n, in O(log n) steps, using ^p- + 0(n ) processors 
 
 They also showed that if L is of bandwidth (m+1), i.e., I.. = for 
 
 3 2 
 i - j > m, the time required is O(log m log n) using j m n + 0(mn) 
 
 processors. The problem has also been investigated independently by 
 
 Orcutt [2]. Unfortunately, their derivations tend to obscure the fact that 
 
 both algorithms are rather simple. Our objective here is to present these 
 
 algorithms in such a way as to make them easier to follow and implement. 
 
 In fact, using this new formulation we are able to establish that the 
 
 1 2 
 number of processors required for the banded case is y m n + 0(mn) instead 
 
 3 2 
 of x "1 " "*" 0(mn). Furthermore, we present an error analysis of the general 
 
 algorithm showing that if x is the computed solution then (L + 6L ) x = f, 
 
 where 6L is bounded by, ||6L || = a(n) ek (L) ||L||. Here, ||'|| stands 
 
 r r 
 
 2 
 for the oo-norm, a(n) = 0(n log n), e is the unit roundoff, k(L) is the 
 
 2 
 condition number of L, and we assume that terms of 0(e ) are negligible. 
 
 This bound, however, can be yery large compared to that of the sequential 
 
 algorithm, Wilkinson [3], where ||6L || ^ne ||L|| . 
 
 Throughout this paper we assume that any number of processors 
 can be used at any time, but we will give bounds on this number. Each 
 processor can perform any of the four arithmetic operations in one time 
 step. Furthermore, if p processors are used we denote the computation 
 time by T . 
 
 Throughout this paper log n = logpn. 
 
Without loss of generality, we assume that diag(L) is different 
 from the identity and that n = 2^ and m = 2^ for positive integers y and v. 
 where y > 4. 
 
2. Algorithm I 
 
 The inverse of L can be expressed as [4], 
 
 L"' = \ ^-^ \-2 
 
 '1 
 
 where M^. = ( I - t^.£^. e^ in which £^ = (0,...,0,il ,£ 
 
 (2.1) 
 ,...,il ). The 
 
 and T^ = 1/S..., i = 1, 2, ..., n - 1, and M^ = diag(l , . . . ,1 ,1/Jl^^) . The 
 
 solution X is then given by 
 
 X = M^ \., \.2 
 
 M^f . 
 
 (2.2) 
 
 Evaluating the product (M„_-| ••• M-if) in parallel requires y = log n stages 
 At each stage ( j + 1 ) we evaluate the matrices M.'^"*"^ = M^Ji ^oi ' 
 
 i = 1, 2, ..., 
 
 -r^ - 1, and the vectors f^J"^^^ = m/^^ f^"^^ , where 
 
 (0) = 
 1 1 
 
 .(0) 
 
 M/"' = M,, f^"^ E f. Each matrix M.^^^ is of the form. 
 
 M. 
 
 (J) 
 
 ,(J) 
 
 (j) 
 
 (j) 
 
 "^i 
 
 (j) 
 
 (2.3) 
 
 where L^. ^^^ is a lower triangular matrix of oi^der 2"^, q.^^^ = i2'^ - 1, 
 
 ^• 
 
 and r.^J^ = (n + 1) - (i + 1)2'^'. Clearly, L.^°^ e ^ , and -S. 
 
 ii 
 
 r -r (j) 1 
 
 square matrix of order 2"^ . Thus, 
 
 (j) 
 
 (0) 
 
 , where T.^"^^ is a 
 
(j+1) _ 
 
 4i 
 
 (j) 
 
 r(j) T (j) :i3) 
 
 ■"Zi+l '2i "-21+1 
 
 (2.4) 
 
 and 
 
 (j+1) - 
 
 s(j) T (J^ + II (j) 
 ^2i+l '2i ^ ^Zi 
 
 ■(j) 
 '2i+l 
 
 (2.5) 
 
 Furthermore, if f^-^^ = (g/^^ , g^J^ , g^j) ) whe 
 
 (j) 
 
 I-] J ^2 » y3 ^ wriere the vectors g-, 
 
 and g^ "^ are of orders 2'^' and 2'^, respectively. Then the three subvectors 
 constituting f^J"*"'^ are given by, 
 f (J^l) - G (j) 
 
 f (j+1) _ r (j) ^ (j) 
 
 f (j+1) . s (j) n (j) +a (j) 
 T3 - b-i g2 + §3 
 
 (2.6) 
 
 where the first 2'^'*' components of f^^'^^' constitute the elements ^., 
 
 (i = 1 ,2, . . . ,2"' ), of the solution vector x. Forming the matrices M. 
 requires 1 step with - ^""^ processors. The products in (2.4)--(2.6) 
 can be evaluated simultaneously in 1 + log 2"^ = (1 + j) steps. The additions 
 in (2.5) and the last of (2.6) are evaluated in one additional step. 
 Hence, the time required for each stage j + 1 is t ^. - (2 + j) steps. 
 
 Let s = 2 , then the number of processors for evaluating the 
 products L^J|^ J^.^^^ and S^>j|^ T2/-^'^ in (2.4) and (2.5) is given by 
 
 P' = 2 s (s - 1) and p" = s r^'^'^ ' , respectively. Since the number of 
 processors required for evaluating the products in (2.4)--(2.6) is more than 
 
that required for evaluating the additions, evaluating M^.^ ' requires 
 
 p!"^"^^^ = p' + p" = ^ (2n + l)s^ - -^ {4i + 3)s^ processors, 
 
 (i = 1,2,...,^- 1). From (2.6), we see that evaluating f^^'^^' 
 
 requires Pq = 2- (2n + 1 )s - -^ s processors. Consequently, stage j + 1 
 
 requires, 
 
 — -1 
 p(J-l) = ',^ (J>1) 
 k=0 "^ 
 
 = |s3-i 
 
 (5n + 12)s^ + 1 (n2 + 7n + 5)5^ 
 
 The total time for solving the system is, therefore, given by 
 
 y-1 
 T = 2 + z (2 + j) = 2 + y(y + 3)/2 (2.7) 
 
 P j=0 
 
 steps, where we added one further step for M f^^\ using no more than 
 
 p = max[ max {p^'^'^^}, n(n-l)/2] 
 j=0,l...,y-l 
 
 processors. For n ^ 16 the maximum is achieved for s = n/8 yielding 
 
 P = ^ (||n^ + lln + 12) (2.8) 
 
 or p = gg + 0(n ) processors. For n ^256, p 1 gx- 
 
 Hence, we have established the following theorem. 
 
 Theorem 2.1 
 
 The triangular system of equations Lx = f, where L is a lower 
 
 1 ? 3 
 triangular matrix of order n, can be solved in T = p- log n + j log n + 2 steps 
 
 153 2 
 
 using no more than p = TqoT ^ "*■ 0(n ) processors. / / 
 
Lemma 2.2 
 
 Let L be defined as in Theorem 2.1. Then L" can be obtained 
 
 in the same time required for solving the system Lx = f, using no more 
 than (21n'^ + 60n^)/128 processors. /JZ7 
 
 Proof : 
 
 The time bound is trivially established by Theorem 2.1. However, 
 the maximum number of processors used occurs at s = n/4, rather than n/8, 
 yielding the above number of processors. In fact, we can show that for 
 h >_ n right-hand sides the maximum number of processors is given by, 
 
 P ~ 128 ^~32~^ ^8~^ ^ ' 
 
 processors. 
 
 We state, without proof, the following lemma regarding the 
 
 number of operations , (number of nodes in the computational graph), Q 
 
 required for the above algorithm. 
 
 Lemma 2.3 
 
 ? 3 CO TO 
 
 The algorithm of Theorem 2.1 requires Q =2T" """b" — fi"'*'^ 
 operations, with a redundancy R = Q^/T-i = 0(n). 
 
 Now a lemma in [5] states that, if we have p < p processors 
 then the time required for solving the triangular system is given by, 
 
 T; < T + (Qn - T,)/P (2.10) 
 
 p — p p p 
 
 Let p = n^/K, then 
 
 Tp < Tp + |y K (1 + 6) (2.11) 
 
 where 6 < (9/n). For n >_ 256 
 
 Tp < Tp + O.IOK (2.12) 
 
 2 ? 
 
 Hence, as long as K = O(log n), T remains O(log n). 
 
3. Algorithm II 
 
 In this section we present an algorithm that requires the same 
 
 time as Algorithm I but roughly twice the number of processors. The 
 
 advantage of this algorithm is that it can be adapted to band systems 
 
 l(0) . DL and f^^) . 
 
 Df where D = diag (i,^^ i'^^,..., z^J . We form 
 
 nn' 
 
 matrices D^^^, (j = 0, 1 , . . . , y - 1 ) , such that if 
 
 lCj+D = d(J) l(J), f^J+T) = D^J) f^J"^ 
 then L^'^^ = I and x = f^^^ Each matrix L^^*^ is of the form 
 
 Let 
 
 (3.1) 
 
 (j) _ 
 
 (j) 
 
 "21 
 
 G ^^'^ G ^J^ I 
 ^31 '"32 ^: 
 
 .(j) 
 
 . h^ 
 
 ,(J) 
 f2 
 
 3(j) 
 s's"' 
 
 where s = 2"^', and g|9^ = ^^-j/^.^-- Therefore, (D^^^'V^ = diag (L^^"^^ 
 L2^J^...,L^^J^), where. 
 
 (j) - 
 
 p(j) T 
 
 ^2i,2i-l ^ 
 
 (3.2) 
 
 Thus, for stage (j+1) we have, 
 
 ^i^ ' = 
 
 -G 
 
 (j) 
 2i,2i-l 
 
 p(j) 
 ''2i-l,2k-l 
 
 .(j) 
 '2i,2k-l 
 
 «(j) 
 '2i-l,2k 
 
 .(j) 
 '2i,2k 
 
 (3.3) 
 
for i.k + 1 = 2, 3, ..., n/2s, and 
 
 (j+1) _ 
 
 -G 
 
 (j) 
 2i,2i-l 
 
 p(j) 
 
 .(j) 
 2i J 
 
 (3.4) 
 
 for i = 1, 2, ..., n/2s. We can simultaneously evaluate the products 
 
 ^2i,2i-l ^2i-l,2k-T ^2i,2i-l ^2i-l,2k' ^"^ ^2i,2i-l ^2i-l ^" 
 t' = 1 + log s = (1 + j) steps using p' = s (2s + 1) processors. Performing 
 
 the addition of the two pairs of matrices in (3.3) and the pair of vectors 
 in (3.4) requires t" = 1 step using p" = s(2s +1) < p' processors. Since 
 we have n(n - 2s)/8s matrices G.j? , and n/2s vectors f. "^ , then each 
 
 stage 
 
 j + 1 requires t^"^"^^ = t' + t" = (2 + j) steps using p^^^^ = 
 
 7^ sn - J s n + sn processors. Forming L^ ^ and f^ ' requires 1 step 
 using n(n+l)/2 processors. Hence, the total time for solving the linear 
 system Lx = f is given by. 
 
 y-1 
 
 T = 1 + E (2 + j) = 1 + y(y + 3)/2 
 
 (3.5) 
 
 j=0 
 
 .(J+1) 
 
 steps using p = max [maxfp^"" ^},n(n+l )/2] . For n > 16 the maximum occurs 
 
 j=0,l,...,y-l 
 
 at s = n/4 yielding 
 
 3 2 
 n , n 
 
 32 4 
 
 P = %- + 1- 
 
 (3.6) 
 
 processors. Now we introduce the following Theorem for solving banded 
 lower triangular systems. 
 
Theorem 3.1 
 
 Let L be a banded lower triangular matrix of order n and 
 bandwidth (m + 1), (£. =0 for i - k > m) . Then the system Lx = f can 
 
 be 
 
 1 2 
 solved in T = (2 + log m) log n - 2-(log m + log m) steps using 
 
 1„2 
 
 no more than p = ^m n + 0(mn) processors / I 
 
 Proof: 
 
 The matrix L and the vector f can be written in the form, 
 
 L = 
 
 ^1 4 
 
 ^2 4 
 
 n_i x\_ 
 m~ m 
 
 , f = 
 
 (3.7) 
 
 where L^. and R^ are mxm lower triangular and upper triangular matrices, 
 respectively. Premul tiplying both sides of Lx = f by the matrix 
 D = diag [l:1] 
 
 we obtain the system L^^ x = f^^ where. 
 
 (0) _ 
 
 m 
 
 .(0) 
 
 m 
 
 :(0) 
 
 m 
 
 g(o) I 
 
 n_1 m 
 
 m~ 
 
 (3.8) 
 
and 
 
 10 
 
 L.. 
 
 
 
 ' = f 
 
 [^\°\ 
 
 ^(0)" 
 
 = 
 
 (3.9) 
 
 ^i-1 
 
 f. 
 1 
 
 i = 2,3. 
 
 From Theorem 2.1 and Lemma 2.2 we can show that solving the systems in 
 
 (0) 1 2 3 
 requires t^ ' = 2 log m + 2" log m + 2 steps, and by (2.9), using 
 
 (0) 21 2 
 P ^ itd "^ ^ "•■ 0(mn) processors. 
 
 (3.9) 
 
 I 
 
 128 
 
 Now we follow the above algorithm and form the sequence 
 
 l(J+1) = D<J' l'J), and f'J^l' = d'J> f'J' for j = 0. 1. .... 
 Each matrix L^"^^ is of the form, 
 
 log 
 
 (J) _ 
 
 :(j) 
 
 :(j) 
 
 G^J^ I 
 
 r 
 
 (3.10) 
 
 . oJ 
 
 .(j) 
 
 where r = 2^.m. Therefore, D^^^ = diag [ (l('^'^)"M (i = 1, 2, 
 which 
 
 ...p in 
 
 I. 
 
 (j) _ 
 
 ;(j) 
 
 '2i-l 
 
 (3.11) 
 
Hence, for the stage j + 1 , we have 
 
 n 
 
 g(J^i) = 
 
 -G 
 
 '2i 
 
 (J) 
 2i+l 
 
 '2i 
 
 . i = 1, 2, ..., ^ - 1 (3.12) 
 
 and 
 
 (J+1) - 
 
 :(j) 
 
 2i-l 
 
 r;(j) f(j) . f(j) 
 
 ■^2i-l ^2i-l ^2i 
 
 , i = 1, 2, ..., 
 
 2r 
 
 (3.13) 
 
 (J) 
 
 Observing that all except the last m columns of each matrix G.^^' are zero, 
 
 (j) .(j) 
 
 (j) p(j) 
 '2i+l ^2i 
 
 then -Go^ifi G\i and -G2^_i "fpi-l ^^" ^^ evaluated simultaneously in 
 
 1 2 
 
 1 + log m steps using p' = -p m(m + 1 ) n - rm processors. Finally, one more 
 
 step to evaluate fj"^ ' using p" = r processors. Therefore, t^"^ - 2 + log m 
 steps and p^"^ ' = max{p',p"} = ^j m(m + l)n - rm processors. The total 
 
 time is thus given by, 
 
 n_ 
 
 •" (i) 
 
 log 
 
 Z 
 
 j=o 
 
 = y(2 + u) - ^ (u + 1) 
 
 (3.14) 
 
 steps, where y = log n and u = log m, with p = maxCp^-^M processors. 
 
 j 
 
 For m = n/2, p E p^^, otherwise 
 
 = .(1) - 
 
 P = P 
 
 1 3 
 
 2 m(m + l)n - m 
 
 (3.15) 
 
 processors. 
 
12 
 
 Lemma 3.2 
 
 The algorithm of Theorem 3.1 requires Q = m n log ^ + 0(mn log 5-) 
 
 operations, with redundancy R = Qq/T-. = 0(m log 5—). 
 
 ~ m^n 
 Again, we can show that for p = -jt- < p processors, the time 
 
 required for the algorithm of Theorem 3.1 becomes [5], 
 
 Tp i Tp ^ 2K log n_ (3.16) 
 
 Hence, provided K = O(log m) , T remains O(log m log n). 
 
13 
 
 4. Error Analysis 
 
 We present an error analysis of the algorithm in section 2. Let 
 denote any of the four arithmetic operations, then a floating-point 
 operation satisfies, 
 
 f£(d * b) = (a * b){l + 6) 
 where |6| <_ e, e is the unit roundoff 
 
 ^1 J-t 
 
 e = 
 
 ,1-t 
 
 2 S (rounded operations) 
 
 (chopped operations) 
 
 in which 3 is the radix of the machine and t is the length of the fraction. 
 At each stage j of the algorithm (j = 1, 2, ...» y), where y = log n, we 
 perform inner products of vectors u and v the maximum order of which is 
 (1 + 2"^). Performing each inner product in log (1 + 2"^) = (j + 1) steps, 
 we can show that, 
 
 |f£(u^) - u* V I 1 (j + 2) £ lul^lvl (4.2) 
 
 where |u| denotes the vector with elements |y.|, 0(e ) terms are neglected 
 here and below. Without loss of generality, we assume that ||L|I £ 1. 
 
 If M^.^"^^ = fii(M^.^'^^) ana f^^^'^ = fii(f^'^^), then the algorithm of Theorem 2.1 
 
 is given by, 
 
 M (J+1) = u(j) m(J) + p (J+1) ,• - 1 o _n. _ 1 
 ^ ^2i+l ^2i "^ ^i 1-1.2,...., -j^- 1 
 
 f(j+l) =^/J) f(j) +e(j^1) (4.3) 
 
 for j = 0, 1, ..., y - 1, where eW ' is lower triangular and. 
 
14 
 
 (J)||f(j) 
 
 ,(j+l)| 1 (J+2) e |M^ 
 
 (4.4) 
 
 Again, |M| denotes the matrix with elements |y.p| and an inequality in 
 matrix or vector form applies separately to corresponding elements on each 
 side. 
 
 We introduce the following lemma in order to establish bounds 
 on the norms of E^"^ ^ and e^"^ . 
 
 Lemma 4.1 
 
 Suppose L = 
 
 hi ° 
 '-21 '-22 
 
 with I |L| I ;^ 1 . Then 
 
 I -1| I I h -111 ^1 ir-li I ill -1 
 I-??! Ill''- 1 1 » ^^^ ML II £ IJL 
 
 (4.5) 
 
 where L = 
 Proof: 
 
 hi 
 
 hi ^ 
 
 L' 
 
 Since, 
 "-22 '-21 "-11 
 
 -1 
 22 
 
 then, it is clear that the first of (4.5) is satisfied. Also, 
 
 I 
 
 hi ' 
 
 •21 
 
 Consequently, 
 
 I 
 
 = L 
 
 L 
 
 -1 
 22 
 
 I lIlLlI ||L-^|| ll|L-^ 
 
15 
 
 From the above Lemma we see that, 
 
 ll^j^'^l I 1 I IL'""!! (4.6) 
 
 Also, from the fact that f^"^^ = mJ'^*"^^ f^-^'^^ (2.1), and (2.2) we obtain, 
 f^-^"^ = M .... M^M^f 
 
 = M"| ... M"\ M"\ M"^x . 
 2J n-2 n-1 n 
 
 Hence, from Lemma 4.1, 
 
 (4.7) 
 
 f^^^ll < IILM Mx! 
 
 £ I |X| I 
 
 Consequently, from (4.4) 
 
 ||E,^^""^^^|| < (j+2)(l+2'^*) e ||L""'||^ 
 
 ' ~ (4.8) 
 
 lle^"^'""^^!! <. (j+2)(l+2J") e \\l'\\ ||x|| 
 
 For a complete analysis we need to consider the structure of 
 eW'"^^\ From (2.3) - (2.5) we see that the first t.^^ = (2i+l)2^-l rows 
 of E^"^ ' are zero, hence ME. ^ = E. ' for i <_ t..,. Furthermore, the 
 last (n - t.. J columns of e!"^*"*"^^ are zero, thus E^-^^^^^M. = E?"^"^^^ 
 
 J ' I I I X* I 
 
 for £ >_^^+y Let L(k,£) = ^W-^ ... \\ and y.^ = i2<^' - 1. 
 We verify that 
 
 M.(^'^ = L"''(l,Yi+ij)(I + P^^^h L(l.Yij) (4.9) 
 
 where 
 
 /^'= '^ '^^''.'"ui.v,.,,)e,(^)l-^i,.,.,) 
 
 = 1 i-s "K+I.S' K K.S (4-10) 
 
16 
 
 Unfortunately, however, we cannot take advantage of the above properties 
 of eW ' to simplify the expression under the summation in (4.10). 
 It is trivial to show that (4.9) holds for j = 1. From (4.9) and the 
 first of (4.3) we obtain 
 
 Using (4.10) we easily show that 
 
 p(j + l) = p(j)+p(j) + Lfl Np (j + 1) '\. . , 2^ 
 
 2 
 Neglecting terms of 0(e ) we establish (4.9). 
 
 From the second equation in (4.3), we have, 
 
 ;("^l) = (M„S/^' ... m/1) M/°')f . 'z' L-l(2^n)e<^) (4.11) 
 
 where L" (2n,n) = I. Also, from (4.9) 
 
 M^m/^^ ... m/^^ m/°^ = L-^(I + P) 
 
 ^■^ fi) 
 where, P = Z P. ^Ms a lower triangular matrix. If x is the computed 
 
 j=l ' 
 solution then (4.11) reduces to 
 
 X - X = L'^Pf + e (4.12) 
 
 in which 
 
 y+1 
 
 k=l 
 
 e = I L"^2^ n)e^''^ (4.13) 
 
 Let us define the constants a, and a^ by 
 
(4.14) 
 
 17 
 
 1 2 
 a = T n log n + O(nlog n) 
 
 a = 2 n log n + 0(n) 
 Thus, from (4.7) - (4.10) and Lemma 4.1 we have, 
 IlL'^fll <.a^ e ||L'''||^||x|| 
 
 I r -1 I 2 I 
 ||e||<_a2e|lL || |Ix|| 
 
 Hence, from (4.12) 
 
 l|x - x||/||x|| <.a^ e ||L"''||^ (4.15) 
 
 Premulti plying both sides of (4.12) by L, we obtain 
 
 Lx - f = Pf + Le (4.16) 
 
 From the fact that f = Lx, (4.4), (4.10), and (4.13), we have 
 
 |Pf + Le| < |L| |x| (4.17) 
 
 where L is a lower triangular matrix. Furthermore, from (4.8), 
 
 ||L|| < a^ e ||l"'||^ . 
 Hence, there exists a perturbation 6L such that 
 
 (L + 6L)x = f (4.18) 
 
 where, 
 
 ||6L| I la-, e ||L"^||^ . (4.19) 
 
 If we remove the assumption that ||L|| £l, then (4.19) becomes 
 
 ||6L|| < a-| e k^(L) 1|L| | 
 in which k(L) is the condition number of L. 
 
18 
 
 REFERENCES 1 
 
 [1] S. C. Chen and D. J. Kuck, "Time and parallel processor bounds 
 
 for linear recurrence systems," IEEE Transactions on 
 
 Computers . Vol. C-24, No. 7, pp. 701-717, July 1975. 
 [2] S. E. Orcutt, "Parallel solution methods for triangular 
 
 linear systems of equations," Technical Report #77, 1974. 
 
 Digital Systems Laboratory, Stanford Electronics 
 
 Laboratories, Stanford, California. 
 [3] J. H. Wilkinson, Rounding Errors in Algebraic Processes . 
 
 Prentice-Hall. 1963. 
 [4] A. S. Householder. The Theory of Matrices in Numerical Analysis , 
 
 Blaisdell Publishing Co. . 1964. 
 [5] R. P. Brent. "The parallel evaluation of general arithmetic 
 
 expressions," JACM , Vol. 21, No. 2, pp. 201-206, April 1974. 
 
IBLIOGRAPHIC DATA 
 HEET 
 
 1. Report No. 
 
 UIUCDCS-R-75-766 
 
 'lull- and Subtitle 
 
 SOLVING TRIANGULAR SYSTEMS ON A PARALLEL COMPUTER 
 
 3. Recipient's Accession No. 
 
 5. Report Date 
 
 November 1975 
 
 Author(s') 
 
 Ahmed H. Sameh and Richard P. Brent 
 
 8. [Performing Organization Rept. 
 
 ^"UIUCDCS-R-75-766 
 
 Pcrlorming Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, IL 61801 
 
 10. Projcct/"rask/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF DCR73-07980 A02 
 
 I. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 20550 
 
 13. Type of Report & Period 
 Covered 
 
 Technical Report 
 
 14. 
 
 i. Supplementary Notes 
 
 i. Abstracts 
 
 In this paper we present alternative formulations of the algorithms of Chen 
 and Kuck [1] in such a way as to make them easier to follow and implement. We als'" 
 give a detailed error analysis, showing that if x is the computed solution of the 
 
 triangular system Lx = f, then it satisfies the equation (L + 6L)x = f where 
 
 2 ? 
 
 ||5L|| = 0(n log n) e k {L)||L||. Here ic(L) is the condition number of L, ||*|| 
 
 denotes the «-norm and e is the unit roundoff. 
 
 Key Words and Document Analysis. 17a. Descriptors 
 
 Error analysis 
 Parallel computing 
 Triangular systems 
 
 1 • Identifiers /Open-Ended Terms 
 
 n COSAn Field/Gro 
 
 up 
 
 U \vailability Statement 
 
 Jnlimited 
 
 ^01 /( NTIS-35 (10-70) 
 
 i 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 20 
 
 22. Price 
 
 USCOMM-DC 40329-P7I 
 
>. 
 
s^ 
 
 1»