Hll 
 unll 
 
 HHBi 
 1 IHHi 
 
 llKr 
 
 lH H 
 
 1111111 
 
 — msssm 
 
 9& 
 
 villi m 
 
 HHHH1 
 ■BHHBil 
 
 ggg illilSSiI 
 
 W®m Mil 
 lipHiflil 
 
 ffigSm WW®. 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 
 IJHor 
 "0.698-702. 
 cod. 2- 
 
 t 
 
 
The person charging this material is re- 
 sponsible £or its return to the library from 
 which it was withdrawn on or before the 
 Latest Date stamped below. 
 
 Theft, mutilation, and underlining of books 
 are reasons for disciplinary action and may 
 result in dismissal from the University. 
 
 UNIVERSITY OF ILLINOIS LIBRARY AT URBANA-CHAMPAIGN 
 
 RECTJ 
 
 L161 — O-1096 
 
u , *jti Report No. UIUCDCS-R-75-700 
 
 NSF - OCA - GJ -36936 - 000008 
 
 A PARALLEL QR-ALGORITHM FOR TRIDIAGONAL SYMMETRIC MATRICES 
 
 by 
 
 Ahmed H. Sameh and David J. Kuck 
 
 February 1975 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
Digitized by the Internet Archive 
 in 2013 
 
 
 http://archive.org/details/parallelqralgori700same 
 
* 
 A Parallel QR-Algorithm for Tri diagonal Symmetric Matrices 
 
 by 
 
 Ahmed H. Sameh and David J. Kuck 
 
 This work was supported in part by US NSF GJ-36936; and in part by the 
 Advanced Research Project Agency of the Department of Defense under Contract 
 No. DAHC04-72-C-0001 . 
 
 ** 
 
 Department of Computer Science and the Center for Advanced Computation, 
 
 University of Illinois, Urbana, Illinois 61801. 
 
 *** 
 
 Department of Computer Science 
 
 University of Illinois, Urbana, Illinois 61801. 
 
Abstract 
 
 We show that if the size of the tridiagonal matrix in any given 
 iteration is n, then the parallel QR-algorithm requires 0(log ? n) steps 
 
 with 0(n) processors per iteration and no square roots. This results in 
 a speedup of 0(n/log 2 n) and an efficiency of 0(l/log ? n). 
 
 Keywords : 
 
 Givens reduction 
 Linear algebra 
 LU-decomposition 
 Parallelism 
 
 AMS(MOS) Subject Classification 
 Primary 68A20 
 Secondary 65F15 
 
1 . Introduction 
 
 Reinsch [1] proposed a stable QR algorithm for finding the eigen- 
 values of symmetric tri diagonal matrices that requires no square roots. 
 In this paper we present a version of his algorithm that is suitable for 
 parallel computers. We give upper bounds on the time and number of proces- 
 sors required, per iteration. 
 
 Throughout this paper we assume that any number of processors may 
 be used at any time, but we will give bounds on this number. Each processor 
 can perform any of the four arithmetic operations in one time step, and 
 there are no memory or data alignment time penalties. We also give the 
 following definitions. If p >^ 1 processors are used, we denote the computa- 
 tion time by T . The speedup of a parallel algorithm which uses p processors 
 over a serial algorithm is defined by S = T,/T >_ 1, and the corresponding 
 
 efficiency of the computation is given by E = S /p <_ 1. 
 
 We show that if the size of the tri diagonal matrix in any given 
 iteration is n, then the parallel QR algorithm requires 0(log 2 n) steps per 
 
 iteration (and no square roots) using 0(n) processors. This results in a 
 speedup of 0(n/log 2 n) and an efficiency of 0(l/log ? n). This algorithm 
 
 depends on a parallel scheme developed by Chen and Kuck [2], for solving 
 linear triangular systems. 
 
2. The Algorithm 
 
 One iteration of the QR algorithm can be expressed in the form 
 
 [3], 
 
 Q(A-kl) = R 
 A = RQ t 
 
 (1) 
 
 where k is an origin shift, Q is orthogonal, R is upper triangular, and we 
 assume that the tridiagonal matrix A is of order n = 2 m , for a positive 
 integer m. Note that the tridiagonal matrix A is similar to (A-kl) rather 
 than A. Let the diagonal elements of (A-kl) be denoted by a. (i = 1, 2, ..., 
 n) and the off-diagonal elements by b. (i = 2, 3, ..., n). The orthogonal 
 
 matrix Q is chosen as the product of (n-1) plane rotations, i.e., 
 
 n 1 4- 1 
 
 Q = p n -i P 2 P V where p i rotates rows j and j+1 annihilating the 
 
 element in the position (j+1, j), 
 
 pj+l 
 
 row j 
 
 -s 
 
Therefore, if P, 
 
 P* P^ (A-kl) is given by, 
 
 L 3 
 ^3 
 
 Pr-1 
 
 Y+l 
 
 'r+1 
 
 c r-l b r+l 
 
 'r+1 
 
 Y+2 
 
 then it follows by induction that, 
 
 c r = g r /P r 
 
 s r = W p r 
 
 r = 1 , 2, .. . , n - 1 
 
 Vl " c r-l c r b r+l + s r a r+l 
 
 (2) 
 
 where , 
 
 9 r = c r-1 a r - c r _ 2 s r-1 b r r = 1, 2, ..., n 
 
 (3) 
 
 in which c Q = 1, s Q = 0. From (2) and (3) we obtain the recurrence relation, 
 
 r = 2, 3, ..., n - 1 (4) 
 
 cp=c 1 a-c s,b 
 r K r r-1 r r-2 r-1 r 
 
 Let w = c n p., then (4) reduces to the second order linear recurrence 
 r r i=l ] 
 
relation, 
 
 w Q = 1, W] - 0l 
 
 w r = a r Vl - b r w r _ 2 
 
 r = 2, 3, ..., n - 1 
 
 Solving (5) is equivalent to solving the lower triangular system of 
 equations, 
 
 Lw = e 
 
 1 
 
 where, 
 
 w = (w Q , w^ ..., w n _ 1 ) , 
 
 and 
 
 ij 
 
 " a j 
 
 ii 
 
 
 
 1 = J 
 
 i = j + 1 
 
 i = j + 2 
 otherwise 
 
 2 2 2 2 
 
 Using the identity c + s = 1, and defining z = n p. , r 
 
 r r r ^ l 
 
 n - 1 , we obtain the first order linear recurrence relation 
 
 2 2 , 
 z = w = ] 
 
 = 1, 2, 
 
 2 - k2 2 . 2 
 2 r " b r + l 2 r-l + w r 
 
 r = 1, 2, ..., n - 1 
 
 This is equivalent to the lower triangular system, 
 
 Lz = w 
 
where, 
 
 ~t 2 2 2 
 
 w = (w Q , w r . .., w n-1 ) , 
 
 ~t _ / 2 2 2 x 
 
 Z VZq> Z] » • • • » Zp_] / > 
 
 r 
 
 1 1 - J 
 
 and tlj = 
 
 -b? i = j + 1 
 
 otherwise 
 
 Note that the squares of the subdi agonal elements rather than 
 the elements themselves are the given data. So squaring these elements 
 of the original matrix will cost only one step and (n-1) processors, 
 which is negligible compared to the time of finding all the eigenvalues 
 of the tri diagonal matrix. 
 
 The linear systems (6) and (8) are solved sequentially. Chen 
 and Kuck [2] have shown that any m-th order linear recurrence system of 
 n equations, or any unit lower triangular system of bandwidth m + 1, can 
 be solved in 
 
 1 2 
 T p = (2+log 2 m)log 2 n - ■ 2 -( lo 9 2 m+ ' lo 92 m ) 
 
 2 
 steps with p = 0(m n) processors. Specifically, the system (6) can be 
 
 solved in (31og 2 n - 1) steps with no more than 4n processors. Once w is 
 
 obtained, forming the right-hand side w in (8) requires one step with (n-1) 
 processors, and the system itself can be solved in 21og 2 n steps with (n-1) 
 
 processors. Thus, solving (6) and (8) requires T = 51og 9 n steps 
 
 Pi L 
 
with p, = 4n processors. From w and z we can obtain, 
 
 A - <H-i • 
 
 2 _ ..2, 2 
 
 cT = w/z , j = 1, 2, ..., n - 1 
 j j j 
 
 s j = ii/pj - 
 
 and, p^ = n 2 /!^ 
 
 2 
 
 in which n = (a n w n-1 - t/ w n _ 2 ) and where we have used (3) and (9). 
 
 It can be shown [4] that the orthogonal matrix Q is lower 
 
 Hessenberg in which Q.. = c. -, c. = w. -, w./z.. ■, z i (j = 1 , 2, .. ., n - 1 
 3 33 J-' J J -1 J J" 1 3 
 
 Q = c , , and Q. . +1 = s. . Therefore, the elements of the tridiagonal 
 matrix A = RQ are given by, 
 
 a. = E. t • + a., n s . 
 3 3 3 J + l 3 
 
 n 2 c 2 
 
 b?.i = pLi s ? » j = 1, 2, ..., n - 1 
 
 2 
 
 and a = £ a - c -■ b 
 n ^n n *n-l n 
 
 where, 
 
 Tj ■ (£, ♦ Pj 2 ) , 
 
 5 J = w j-l w j /z j • J-l. 2. .... B-l 
 
 and c n = w n-l /z n-l 
 
We show in Figure 1 that the elements of A can be evaluated in T =4 
 
 steps using p 2 = 3n processors. Thus, the total time required is given by 
 
 T=T +T =4+ 51og n steps with p = 4n processors. Reinsch [1] has 
 p P 1 p 2 3 2 
 
 shown that on a serial computer the time required for one iteration is 
 T, = lln steps. Hence, our parallel algorithm yields a speedup 
 
 S a lln/Blog^n with an efficiency E ■ ll/^Olog^n. 
 
 The linear systems (6) and (8) can be extremely ill -conditioned. 
 For example, for Wilkinson's matrix WZ, [5], a. = 11 - i (i = 1, 2, ..., 21) 
 
 and b. = 1, the condition number of L, in (6), in the first iteration is of 
 
 the order 10" . We simulated the above algorithm on an IBM 360/75, and in 
 spite of such high condition number we obtained all the eigenvalues correct 
 to at least 9 decimal places. Wilkinson [6] has shown that the computed 
 solution of a triangular system has low relative errors. Specifically, 
 system (8), (w. > 0, £.. > 0, £.. <_0 i / j), will have low relative error 
 
 in the solution no matter how ill-conditioned L may be. 
 
a j * W M w j Z J 
 
 (b j + i + z j z m> * Vi Vi 
 
 <4 z ;V 
 
 2 x-1 
 
 b ! +1 ""j + i ' j j + i z J' (z j z J-i> 
 
 j = 1, 2, ..., n - 1 
 
 j = 1, 2, .... n - 2 
 
 n-2 w n-l 'n-1 
 
 Figure 1 
 
REFERENCES 
 
 [1] C. H. Reinsch, "A stable, rational QR algorithm for the computa- 
 tion of the eigenvalues of an Hermitian, tri diagonal matrix," 
 Math. Comp., v. 25, 1971, pp. 591-597. MR 45 # 4621. 
 
 [2] S. C. Chen, and D. J. Kuck, "Time and parallel processor bounds 
 for linear recurrence systems," to appear IEEE Transactions 
 on Computers. 
 
 [3] H. Bowdler, R. S. Martin, C. Reinsch, and J. H. Wilkinson, "The 
 QR and QL algorithms for symmetric matrices," Numer. Math., 
 v. 11, 1968, pp. 293-306. 
 
 [4] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders, "Methods 
 for modifying matrix factorizations," Math. Comp., v. 28, 
 1974, pp. 505-535. 
 
 [5] J. H. Wilkinson, The Algebraic Eigenvalue Problem, Clarendon Press, 
 Oxford, 1965. MR 32 # 1894. 
 
 [6] J. H. Wilkinson, "Error analysis of direct methods of matrix 
 inversion," J. ACM, v. 8, 1961, pp. 281-330. 
 
iriiOGfiAPHIC DATA 
 
 UIUCDCS-R-75-700 
 
 3. Recipient's Accession No. 
 
 hi it k 
 
 \LLEL QR-ALGORITHM FOR TRIDIAGONAL SYMMETRIC MATRICES 
 
 5- Report bate 
 
 February 1975 
 
 h. Sameh and David J. Kuck 
 
 8. Performing Organization Rept 
 
 N '"UIUCDCS-R-75-700 
 
 . mization \ame and Address 
 
 iversity of Illinois at Urbana-Champaign 
 lartment of Computer Science 
 ana, IL 61801 
 
 10. Project Task Vtork Unit No. 
 
 1 1. ( ontract Grant No. 
 
 US NSF-GJ-36936 
 
 , mization Name and Address 
 
 ional Science Foundation 
 gton, D. C. 20550 
 
 13. I ypc ol Report & Period 
 ( overed 
 
 Technical Report 
 
 14. 
 
 We show that if the size of the tridiagonal matrix in any given iteration 
 -. then the parallel QR-algorithm requires 0(log 2 n) steps with 0(n) processors 
 
 iteration and no square roots. This results in a speedup of 0(n/log ? n) and an 
 iciencv of 0(l/log ? n) . 
 
 im<nt Analysis. 17o. Des< riptors 
 
 'lens reduction 
 iear algebra 
 I decomposition 
 
 aa 11 el isn 1 
 
 i n-I- nded firms 
 
 '< rroup 
 
 M 
 
 f ;i ted 
 
 19. Set urit) < lass (Thi- 
 Repori i 
 
 PN< IV ^[1- III) 
 
 20. sr, urn v ( lass i i h 
 P. it. 
 
 r\( i j - m ii n 
 
 21. No. of Paj 
 
 12 
 
 22. Pri. 
 
 USC OMM- C C 4 ■ 
 
. 
 
 

 E^fiffifiE 
 
 "MHB 
 
 Hi 
 
 HI 
 
 Warn HH 
 
 ■8 raw* Hra RUm 
 
 mkhSKH [fflBnrmnnUitrinrnT 
 
 HI 
 
 i$fiiiK3 Hi Ira IS 
 h m m 
 
 m ffi mm 
 
 KM 
 
 m 
 
 HUB Ihob» 
 tuft " 
 
 tag i4 i ffis In 
 
 H Si 
 
 rat H mh 
 
 ■Ml BMHun 
 
 ■ 
 
 
 
 Hh 
 
 * T<l 
 
 
 
 
 
 UP 
 
 
 
 K9| 
 
 Bn 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 sRH] 
 
 
 UM] 
 
 
 H 
 
 IMfi 
 
 m 
 
 
 KraftN