LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 06.764-7^9 Latest Date stamped below.^ fJOV 3 jOU l/HtC APR 3 f58f'' tlAY 1 4 ^a" '^ L161 — O1096 yy/ Report No. UIUCDCS-R-75-766 NSF-0CA-DCR73-07980 A02 - 000014 Solving Triangular Systems on a Parallel Computer by Ahmed H. Sameh and Richard P. Brent November 1975 DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN URBANA, ILLINOIS IHEUBRARVOeiHfl FEB II 1975 Digitized by the Internet Archive in 2013 http://archive.org/details/solvingtriangula766same Report No. UIUCDCS-R-75-766 * Solving Triangular Systems on a Parallel Computer by ** *•* Ahmed H. Sameh and Richard P. Brent November 1975 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 * This work was supported in part by the National Science Foundation under Grant No. US NSF DCR73-07980 A02. ** Department of Computer Science and the Center for Advanced Computation University of Illinois - Urbana, Illinois 61801. *** Computer Centre The Australian National University - Canberra, AUSTRALIA. ABSTRACT In this paper we present alternative formulations of the algorithms of Chen and Kuck [1] in such a way as to make them easier to follow and implement. We also give a detailed error analysis, showing that if x is the computed solution of the triangular system Lx = f, then it satisfies the equation (L + 6L)x = f where ||6L|| = 2 2 I I I 0(n log n) £ K (L)||L||. Here k(L) is the condition number of L, I I'll denotes the °°-norm and e is the unit roundoff. 1. Introduction Chen and Kuck [1] proposed a parallel algorithm for solving the linear system of equations Lx = f , where L is a unit lower tri- angular matrix of order n, in O(log n) steps, using ^p- + 0(n ) processors They also showed that if L is of bandwidth (m+1), i.e., I.. = for 3 2 i - j > m, the time required is O(log m log n) using j m n + 0(mn) processors. The problem has also been investigated independently by Orcutt [2]. Unfortunately, their derivations tend to obscure the fact that both algorithms are rather simple. Our objective here is to present these algorithms in such a way as to make them easier to follow and implement. In fact, using this new formulation we are able to establish that the 1 2 number of processors required for the banded case is y m n + 0(mn) instead 3 2 of x "1 " "*" 0(mn). Furthermore, we present an error analysis of the general algorithm showing that if x is the computed solution then (L + 6L ) x = f, where 6L is bounded by, ||6L || = a(n) ek (L) ||L||. Here, ||'|| stands r r 2 for the oo-norm, a(n) = 0(n log n), e is the unit roundoff, k(L) is the 2 condition number of L, and we assume that terms of 0(e ) are negligible. This bound, however, can be yery large compared to that of the sequential algorithm, Wilkinson [3], where ||6L || ^ne ||L|| . Throughout this paper we assume that any number of processors can be used at any time, but we will give bounds on this number. Each processor can perform any of the four arithmetic operations in one time step. Furthermore, if p processors are used we denote the computation time by T . Throughout this paper log n = logpn. Without loss of generality, we assume that diag(L) is different from the identity and that n = 2^ and m = 2^ for positive integers y and v. where y > 4. 2. Algorithm I The inverse of L can be expressed as [4], L"' = \ ^-^ \-2 '1 where M^. = ( I - t^.£^. e^ in which £^ = (0,...,0,il ,£ (2.1) ,...,il ). The and T^ = 1/S..., i = 1, 2, ..., n - 1, and M^ = diag(l , . . . ,1 ,1/Jl^^) . The solution X is then given by X = M^ \., \.2 M^f . (2.2) Evaluating the product (M„_-| ••• M-if) in parallel requires y = log n stages At each stage ( j + 1 ) we evaluate the matrices M.'^"*"^ = M^Ji ^oi ' i = 1, 2, ..., -r^ - 1, and the vectors f^J"^^^ = m/^^ f^"^^ , where (0) = 1 1 .(0) M/"' = M,, f^"^ E f. Each matrix M.^^^ is of the form. M. (J) ,(J) (j) (j) "^i (j) (2.3) where L^. ^^^ is a lower triangular matrix of oi^der 2"^, q.^^^ = i2'^ - 1, ^• and r.^J^ = (n + 1) - (i + 1)2'^'. Clearly, L.^°^ e ^ , and -S. ii r -r (j) 1 square matrix of order 2"^ . Thus, (j) (0) , where T.^"^^ is a (j+1) _ 4i (j) r(j) T (j) :i3) ■"Zi+l '2i "-21+1 (2.4) and (j+1) - s(j) T (J^ + II (j) ^2i+l '2i ^ ^Zi ■(j) '2i+l (2.5) Furthermore, if f^-^^ = (g/^^ , g^J^ , g^j) ) whe (j) I-] J ^2 » y3 ^ wriere the vectors g-, and g^ "^ are of orders 2'^' and 2'^, respectively. Then the three subvectors constituting f^J"*"'^ are given by, f (J^l) - G (j) f (j+1) _ r (j) ^ (j) f (j+1) . s (j) n (j) +a (j) T3 - b-i g2 + §3 (2.6) where the first 2'^'*' components of f^^'^^' constitute the elements ^., (i = 1 ,2, . . . ,2"' ), of the solution vector x. Forming the matrices M. requires 1 step with - ^""^ processors. The products in (2.4)--(2.6) can be evaluated simultaneously in 1 + log 2"^ = (1 + j) steps. The additions in (2.5) and the last of (2.6) are evaluated in one additional step. Hence, the time required for each stage j + 1 is t ^. - (2 + j) steps. Let s = 2 , then the number of processors for evaluating the products L^J|^ J^.^^^ and S^>j|^ T2/-^'^ in (2.4) and (2.5) is given by P' = 2 s (s - 1) and p" = s r^'^'^ ' , respectively. Since the number of processors required for evaluating the products in (2.4)--(2.6) is more than that required for evaluating the additions, evaluating M^.^ ' requires p!"^"^^^ = p' + p" = ^ (2n + l)s^ - -^ {4i + 3)s^ processors, (i = 1,2,...,^- 1). From (2.6), we see that evaluating f^^'^^' requires Pq = 2- (2n + 1 )s - -^ s processors. Consequently, stage j + 1 requires, — -1 p(J-l) = ',^ (J>1) k=0 "^ = |s3-i (5n + 12)s^ + 1 (n2 + 7n + 5)5^ The total time for solving the system is, therefore, given by y-1 T = 2 + z (2 + j) = 2 + y(y + 3)/2 (2.7) P j=0 steps, where we added one further step for M f^^\ using no more than p = max[ max {p^'^'^^}, n(n-l)/2] j=0,l...,y-l processors. For n ^ 16 the maximum is achieved for s = n/8 yielding P = ^ (||n^ + lln + 12) (2.8) or p = gg + 0(n ) processors. For n ^256, p 1 gx- Hence, we have established the following theorem. Theorem 2.1 The triangular system of equations Lx = f, where L is a lower 1 ? 3 triangular matrix of order n, can be solved in T = p- log n + j log n + 2 steps 153 2 using no more than p = TqoT ^ "*■ 0(n ) processors. / / Lemma 2.2 Let L be defined as in Theorem 2.1. Then L" can be obtained in the same time required for solving the system Lx = f, using no more than (21n'^ + 60n^)/128 processors. /JZ7 Proof : The time bound is trivially established by Theorem 2.1. However, the maximum number of processors used occurs at s = n/4, rather than n/8, yielding the above number of processors. In fact, we can show that for h >_ n right-hand sides the maximum number of processors is given by, P ~ 128 ^~32~^ ^8~^ ^ ' processors. We state, without proof, the following lemma regarding the number of operations , (number of nodes in the computational graph), Q required for the above algorithm. Lemma 2.3 ? 3 CO TO The algorithm of Theorem 2.1 requires Q =2T" """b" — fi"'*'^ operations, with a redundancy R = Q^/T-i = 0(n). Now a lemma in [5] states that, if we have p < p processors then the time required for solving the triangular system is given by, T; < T + (Qn - T,)/P (2.10) p — p p p Let p = n^/K, then Tp < Tp + |y K (1 + 6) (2.11) where 6 < (9/n). For n >_ 256 Tp < Tp + O.IOK (2.12) 2 ? Hence, as long as K = O(log n), T remains O(log n). 3. Algorithm II In this section we present an algorithm that requires the same time as Algorithm I but roughly twice the number of processors. The advantage of this algorithm is that it can be adapted to band systems l(0) . DL and f^^) . Df where D = diag (i,^^ i'^^,..., z^J . We form nn' matrices D^^^, (j = 0, 1 , . . . , y - 1 ) , such that if lCj+D = d(J) l(J), f^J+T) = D^J) f^J"^ then L^'^^ = I and x = f^^^ Each matrix L^^*^ is of the form Let (3.1) (j) _ (j) "21 G ^^'^ G ^J^ I ^31 '"32 ^: .(j) . h^ ,(J) f2 3(j) s's"' where s = 2"^', and g|9^ = ^^-j/^.^-- Therefore, (D^^^'V^ = diag (L^^"^^ L2^J^...,L^^J^), where. (j) - p(j) T ^2i,2i-l ^ (3.2) Thus, for stage (j+1) we have, ^i^ ' = -G (j) 2i,2i-l p(j) ''2i-l,2k-l .(j) '2i,2k-l «(j) '2i-l,2k .(j) '2i,2k (3.3) for i.k + 1 = 2, 3, ..., n/2s, and (j+1) _ -G (j) 2i,2i-l p(j) .(j) 2i J (3.4) for i = 1, 2, ..., n/2s. We can simultaneously evaluate the products ^2i,2i-l ^2i-l,2k-T ^2i,2i-l ^2i-l,2k' ^"^ ^2i,2i-l ^2i-l ^" t' = 1 + log s = (1 + j) steps using p' = s (2s + 1) processors. Performing the addition of the two pairs of matrices in (3.3) and the pair of vectors in (3.4) requires t" = 1 step using p" = s(2s +1) < p' processors. Since we have n(n - 2s)/8s matrices G.j? , and n/2s vectors f. "^ , then each stage j + 1 requires t^"^"^^ = t' + t" = (2 + j) steps using p^^^^ = 7^ sn - J s n + sn processors. Forming L^ ^ and f^ ' requires 1 step using n(n+l)/2 processors. Hence, the total time for solving the linear system Lx = f is given by. y-1 T = 1 + E (2 + j) = 1 + y(y + 3)/2 (3.5) j=0 .(J+1) steps using p = max [maxfp^"" ^},n(n+l )/2] . For n > 16 the maximum occurs j=0,l,...,y-l at s = n/4 yielding 3 2 n , n 32 4 P = %- + 1- (3.6) processors. Now we introduce the following Theorem for solving banded lower triangular systems. Theorem 3.1 Let L be a banded lower triangular matrix of order n and bandwidth (m + 1), (£. =0 for i - k > m) . Then the system Lx = f can be 1 2 solved in T = (2 + log m) log n - 2-(log m + log m) steps using 1„2 no more than p = ^m n + 0(mn) processors / I Proof: The matrix L and the vector f can be written in the form, L = ^1 4 ^2 4 n_i x\_ m~ m , f = (3.7) where L^. and R^ are mxm lower triangular and upper triangular matrices, respectively. Premul tiplying both sides of Lx = f by the matrix D = diag [l:1] we obtain the system L^^ x = f^^ where. (0) _ m .(0) m :(0) m g(o) I n_1 m m~ (3.8) and 10 L.. ' = f [^\°\ ^(0)" = (3.9) ^i-1 f. 1 i = 2,3. From Theorem 2.1 and Lemma 2.2 we can show that solving the systems in (0) 1 2 3 requires t^ ' = 2 log m + 2" log m + 2 steps, and by (2.9), using (0) 21 2 P ^ itd "^ ^ "•■ 0(mn) processors. (3.9) I 128 Now we follow the above algorithm and form the sequence l(J+1) = D f'J' for j = 0. 1. .... Each matrix L^"^^ is of the form, log (J) _ :(j) :(j) G^J^ I r (3.10) . oJ .(j) where r = 2^.m. Therefore, D^^^ = diag [ (l('^'^)"M (i = 1, 2, which ...p in I. (j) _ ;(j) '2i-l (3.11) Hence, for the stage j + 1 , we have n g(J^i) = -G '2i (J) 2i+l '2i . i = 1, 2, ..., ^ - 1 (3.12) and (J+1) - :(j) 2i-l r;(j) f(j) . f(j) ■^2i-l ^2i-l ^2i , i = 1, 2, ..., 2r (3.13) (J) Observing that all except the last m columns of each matrix G.^^' are zero, (j) .(j) (j) p(j) '2i+l ^2i then -Go^ifi G\i and -G2^_i "fpi-l ^^" ^^ evaluated simultaneously in 1 2 1 + log m steps using p' = -p m(m + 1 ) n - rm processors. Finally, one more step to evaluate fj"^ ' using p" = r processors. Therefore, t^"^ - 2 + log m steps and p^"^ ' = max{p',p"} = ^j m(m + l)n - rm processors. The total time is thus given by, n_ •" (i) log Z j=o = y(2 + u) - ^ (u + 1) (3.14) steps, where y = log n and u = log m, with p = maxCp^-^M processors. j For m = n/2, p E p^^, otherwise = .(1) - P = P 1 3 2 m(m + l)n - m (3.15) processors. 12 Lemma 3.2 The algorithm of Theorem 3.1 requires Q = m n log ^ + 0(mn log 5-) operations, with redundancy R = Qq/T-. = 0(m log 5—). ~ m^n Again, we can show that for p = -jt- < p processors, the time required for the algorithm of Theorem 3.1 becomes [5], Tp i Tp ^ 2K log n_ (3.16) Hence, provided K = O(log m) , T remains O(log m log n). 13 4. Error Analysis We present an error analysis of the algorithm in section 2. Let denote any of the four arithmetic operations, then a floating-point operation satisfies, f£(d * b) = (a * b){l + 6) where |6| <_ e, e is the unit roundoff ^1 J-t e = ,1-t 2 S (rounded operations) (chopped operations) in which 3 is the radix of the machine and t is the length of the fraction. At each stage j of the algorithm (j = 1, 2, ...» y), where y = log n, we perform inner products of vectors u and v the maximum order of which is (1 + 2"^). Performing each inner product in log (1 + 2"^) = (j + 1) steps, we can show that, |f£(u^) - u* V I 1 (j + 2) £ lul^lvl (4.2) where |u| denotes the vector with elements |y.|, 0(e ) terms are neglected here and below. Without loss of generality, we assume that ||L|I £ 1. If M^.^"^^ = fii(M^.^'^^) ana f^^^'^ = fii(f^'^^), then the algorithm of Theorem 2.1 is given by, M (J+1) = u(j) m(J) + p (J+1) ,• - 1 o _n. _ 1 ^ ^2i+l ^2i "^ ^i 1-1.2,...., -j^- 1 f(j+l) =^/J) f(j) +e(j^1) (4.3) for j = 0, 1, ..., y - 1, where eW ' is lower triangular and. 14 (J)||f(j) ,(j+l)| 1 (J+2) e |M^ (4.4) Again, |M| denotes the matrix with elements |y.p| and an inequality in matrix or vector form applies separately to corresponding elements on each side. We introduce the following lemma in order to establish bounds on the norms of E^"^ ^ and e^"^ . Lemma 4.1 Suppose L = hi ° '-21 '-22 with I |L| I ;^ 1 . Then I -1| I I h -111 ^1 ir-li I ill -1 I-??! Ill''- 1 1 » ^^^ ML II £ IJL (4.5) where L = Proof: hi hi ^ L' Since, "-22 '-21 "-11 -1 22 then, it is clear that the first of (4.5) is satisfied. Also, I hi ' •21 Consequently, I = L L -1 22 I lIlLlI ||L-^|| ll|L-^ 15 From the above Lemma we see that, ll^j^'^l I 1 I IL'""!! (4.6) Also, from the fact that f^"^^ = mJ'^*"^^ f^-^'^^ (2.1), and (2.2) we obtain, f^-^"^ = M .... M^M^f = M"| ... M"\ M"\ M"^x . 2J n-2 n-1 n Hence, from Lemma 4.1, (4.7) f^^^ll < IILM Mx! £ I |X| I Consequently, from (4.4) ||E,^^""^^^|| < (j+2)(l+2'^*) e ||L""'||^ ' ~ (4.8) lle^"^'""^^!! <. (j+2)(l+2J") e \\l'\\ ||x|| For a complete analysis we need to consider the structure of eW'"^^\ From (2.3) - (2.5) we see that the first t.^^ = (2i+l)2^-l rows of E^"^ ' are zero, hence ME. ^ = E. ' for i <_ t..,. Furthermore, the last (n - t.. J columns of e!"^*"*"^^ are zero, thus E^-^^^^^M. = E?"^"^^^ J ' I I I X* I for £ >_^^+y Let L(k,£) = ^W-^ ... \\ and y.^ = i2<^' - 1. We verify that M.(^'^ = L"''(l,Yi+ij)(I + P^^^h L(l.Yij) (4.9) where /^'= '^ '^^''.'"ui.v,.,,)e,(^)l-^i,.,.,) = 1 i-s "K+I.S' K K.S (4-10) 16 Unfortunately, however, we cannot take advantage of the above properties of eW ' to simplify the expression under the summation in (4.10). It is trivial to show that (4.9) holds for j = 1. From (4.9) and the first of (4.3) we obtain Using (4.10) we easily show that p(j + l) = p(j)+p(j) + Lfl Np (j + 1) '\. . , 2^ 2 Neglecting terms of 0(e ) we establish (4.9). From the second equation in (4.3), we have, ;("^l) = (M„S/^' ... m/1) M/°')f . 'z' L-l(2^n)e<^) (4.11) where L" (2n,n) = I. Also, from (4.9) M^m/^^ ... m/^^ m/°^ = L-^(I + P) ^■^ fi) where, P = Z P. ^Ms a lower triangular matrix. If x is the computed j=l ' solution then (4.11) reduces to X - X = L'^Pf + e (4.12) in which y+1 k=l e = I L"^2^ n)e^''^ (4.13) Let us define the constants a, and a^ by (4.14) 17 1 2 a = T n log n + O(nlog n) a = 2 n log n + 0(n) Thus, from (4.7) - (4.10) and Lemma 4.1 we have, IlL'^fll <.a^ e ||L'''||^||x|| I r -1 I 2 I ||e||<_a2e|lL || |Ix|| Hence, from (4.12) l|x - x||/||x|| <.a^ e ||L"''||^ (4.15) Premulti plying both sides of (4.12) by L, we obtain Lx - f = Pf + Le (4.16) From the fact that f = Lx, (4.4), (4.10), and (4.13), we have |Pf + Le| < |L| |x| (4.17) where L is a lower triangular matrix. Furthermore, from (4.8), ||L|| < a^ e ||l"'||^ . Hence, there exists a perturbation 6L such that (L + 6L)x = f (4.18) where, ||6L| I la-, e ||L"^||^ . (4.19) If we remove the assumption that ||L|| £l, then (4.19) becomes ||6L|| < a-| e k^(L) 1|L| | in which k(L) is the condition number of L. 18 REFERENCES 1 [1] S. C. Chen and D. J. Kuck, "Time and parallel processor bounds for linear recurrence systems," IEEE Transactions on Computers . Vol. C-24, No. 7, pp. 701-717, July 1975. [2] S. E. Orcutt, "Parallel solution methods for triangular linear systems of equations," Technical Report #77, 1974. Digital Systems Laboratory, Stanford Electronics Laboratories, Stanford, California. [3] J. H. Wilkinson, Rounding Errors in Algebraic Processes . Prentice-Hall. 1963. [4] A. S. Householder. The Theory of Matrices in Numerical Analysis , Blaisdell Publishing Co. . 1964. [5] R. P. Brent. "The parallel evaluation of general arithmetic expressions," JACM , Vol. 21, No. 2, pp. 201-206, April 1974. IBLIOGRAPHIC DATA HEET 1. Report No. UIUCDCS-R-75-766 'lull- and Subtitle SOLVING TRIANGULAR SYSTEMS ON A PARALLEL COMPUTER 3. Recipient's Accession No. 5. Report Date November 1975 Author(s') Ahmed H. Sameh and Richard P. Brent 8. [Performing Organization Rept. ^"UIUCDCS-R-75-766 Pcrlorming Organization Name and Address University of Illinois at Urbana-Champaign Department of Computer Science Urbana, IL 61801 10. Projcct/"rask/Work Unit No. 11. Contract /Grant No. US NSF DCR73-07980 A02 I. Sponsoring Organization Name and Address National Science Foundation Washington, D. C. 20550 13. Type of Report & Period Covered Technical Report 14. i. Supplementary Notes i. Abstracts In this paper we present alternative formulations of the algorithms of Chen and Kuck [1] in such a way as to make them easier to follow and implement. We als'" give a detailed error analysis, showing that if x is the computed solution of the triangular system Lx = f, then it satisfies the equation (L + 6L)x = f where 2 ? ||5L|| = 0(n log n) e k {L)||L||. Here ic(L) is the condition number of L, ||*|| denotes the «-norm and e is the unit roundoff. Key Words and Document Analysis. 17a. Descriptors Error analysis Parallel computing Triangular systems 1 • Identifiers /Open-Ended Terms n COSAn Field/Gro up U \vailability Statement Jnlimited ^01 /( NTIS-35 (10-70) i 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 20 22. Price USCOMM-DC 40329-P7I >. s^ 1»