m 
 
 mm fM 
 
 RSI H BiaW H USffiiS 
 
 - 'S>'HfflHI mm 
 
 H WM WBBBi 
 
 BM SuSphB m 
 
 ■ 
 
 
 f'A 
 
 ■ 
 
 
 li 
 
 ^H 
 
 hb 
 
 « 
 
 
 ■ ■ 
 
 ■CMtj 
 
 1 
 
 
 
 ift 
 
LIBRARY OF THE 
 
 UNIVERSITY OF ILLINOIS 
 
 AT URBANA-CHAMPAIGN 
 
 510.84 
 
 I£6r 
 
 no. 480-489 
 
 cop 
 
Digitized by the Internet Archive 
 in 2013 
 
 http://archive.org/details/parallelcomputat487kuck 
 

PARALLEL COMPUTATION OF 
 EIGENVALUES OF REAL MATRICES 
 
 by 
 
 David J. Kuck 
 Ahmed Sameh 
 
 November 1, 1971 
 
 CAC Document No. 9 
 
 DHIHE 
 
 NOV 9 1972 
 
 y OF ILLINOIS 
 
 AT U iAMPAIGN 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN • URBANA, ILLINOIS 
 
CAC Document No. 9 
 DCS Report No. lj-87 
 
 PARALLEL COMPUTATION OF EIGENVALUES 
 OF REAL MATRICES 
 
 hy 
 
 David J. Kuck 
 Ahmed Saraeh 
 
 Center for Advanced Computation 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 61801 
 
 November I, 1971 
 
 This work was supported in part by the Advanced Research Projects 
 Agency of the Department of Defense and was monitored by the U. S. 
 Army Research Office-Durham under Contract No. DAHCOU 72-C-0001. 
 
ABSTRACT 
 
 This paper describes the implementation of three standard matrix 
 eigenvalue computation methods on an array machine with high efficiency. 
 A brief description of the ILLIAC IV computer is provided as background 
 material. Three major sections follow- -the first two describe Jacobi and 
 Householder algorithms for real symmetric matrices, and the third describes 
 the ^R algorithm for real nonsymmetric matrices. Each of these sections is 
 divided into four parts. The theoretical background of the method is pre- 
 sented first. An ILLIAC IV implementation of the algorithm is presented and 
 timing estimates are included. Next, the efficiency (ratio of number of 
 computations actually performed to number of computations possible in a 
 given time) of the computation on an array machine is derived for primary 
 memory contained matrices. Finally, eigenvalue computations for matrices too 
 large to be contained in primary memory are discussed in terms of a head-per- 
 track secondary storage device. 
 
 It is shown that for Jacobi and Householder algorithms, parallel 
 computation efficiencies of c )Q°j are possible, while those of the Q,R algorithm 
 are rather low. 
 
TABLE OF CONTENTS 
 
 Page 
 
 1. INTRODUCTION 1 
 
 2. ILLIAC IV OVERVIEW 2 
 
 2.1 System Organization 2 
 
 2.2 Programming ILLIAC IV 1+ 
 
 3. JACOBI'S METHOD 5 
 
 3.1 Theoretical Background 5 
 
 3.2 Implementation 8 
 
 3.3 Efficiency 12 
 
 3.^1 Disk Considerations 15 
 
 k. HOUSEHOLDER'S METHOD 18 
 
 U.l Theoretical Background 18 
 
 k.2 Implementation of Householder Tridiagonalization. 20 
 
 ^+.3 Efficiency of the Tridiagonalization 22 
 
 h.k Disk Considerations 25 
 
 5. THE QR ALGORITHM 26 
 
 5.1 Theoretical Background 26 
 
 5-2 Implementation 31 
 
 5.3 Efficiency 31 
 
 REFERENCES 38 
 
 ACKNOWLEDGMENT 39 
 
1. INTRODUCTION 
 
 The implementations on a parallel computer of three of the most 
 commonly used techniques for finding the eigenvalues and eigenvectors of 
 matrices are discussed. These are Jacobi's, Householder's and the QR method. 
 Jacohi's algorithm for real symmetric matrices was modified in order to take 
 advantage of parallelism in such a way that instead of two, only one ortho- 
 gonal transformation would eliminate n off-diagonal elements, where n is the 
 size of the matrix. Storage schemes for all three methods on the parallel 
 computer are discussed in detail. It is well known that Jacobi's algorithm 
 proved to be highly efficient for parallel computation; yet, its use is 
 recommended primarily when evaluation of all the eigenvalues is required. 
 Even then, if high accuracy is also desired, it should be used in conjunction 
 with a scheme for establishing machine bounds for the eigensystem. 
 
 Householder's algorithm for tridiagonalization of a real symmetric 
 matrix has also proven to be very efficient on a parallel computer; a parallel 
 modification of the method of bisection can be used on the resulting tri- 
 diagonal matrix to obtain one or two eigenvalues or to investigate the distri- 
 bution of all of them. If, however, all the eigenvalues are required, the QR 
 method is usually preferred. The QR algorithm is discussed here only for non- 
 symmetric matrices. Implementation of this method on the parallel machine 
 proved to be less efficient than the other two, though it is the most reliable 
 for obtaining the eigenvalues of any matrix. For Householder's or the QR 
 algorithm, the eigenvectors of the tridiagonal or the upper Hessenberg mat- 
 rices may be obtained by the method of inverse iteration. If the transfor- 
 mation matrices used in reducing the matrix to the condensed form are stored, 
 the eigenvectors of the original matrix can be readily computed. 
 
 -1- 
 
2. ILLIAC IV OVERVIEW 
 
 2. 1 System Organization 
 
 ILLIAC IV is an array of coupled computers driven by instructions 
 from a common control unit. Each of the processing elements (PE's) has 20^-8 
 words of 64-bit memory with a cycle time under 350 nanoseconds. Each PE is 
 capable of 64-bit floating point multiplication in about 500 nanoseconds and 
 addition in about 350 nanoseconds. Two 32-bit floating point operations may 
 be performed in each PE in approximately the same time. The PE instruction 
 set is similar to that of conventiqnal machines, with two exceptions. First, 
 the PE's are capable of communicating data to four neighboring PE's by means 
 of routing instructions . Second, the PE ' s are able to set their own mode 
 registers to effectively disable or enable themselves. 
 
 Figure 1 shows 6k PE's, each having three arithmetic registers 
 (A, B, and C) and one protected programmer register (s). The registers, words, 
 and paths in Figure 1 are all 6k bits wide, except the PE index registers (XR),; 
 mode registers, and as noted. The mode register may be regarded as one bit 
 which may be used to block the participation of its PE in any action. The 
 routing registers are shown connected to neighbors at distances of + 1 and + 8; 
 similar end-around connections are provided at 1, 6k, etc. Programs and data 
 are stored in PE memory. Instructions are fetched by the control unit (CU) as 
 required from PE memory. 
 
 Figure 1 also shows the essential registers and paths in the CU and 
 their relations to the PE's. Instructions are decoded and control signals are 
 sent to the PE array from the control unit. Some instructions are executed 
 directly in the CU, e.g., the loading of CU accumulator registers (CAR's) with 
 program literals. Operand addresses may be indexed once in the CU by a CAR am 
 again separately in each PE by XR. It is possible to load the local data 
 buffer (6k words of 6k bits each) and CAR's from PE memory. Local data buffer 
 registers and CAR's may be loaded from each other. A broadcast instruction 
 allows the contents of a CAR to be transmitted simultaneously to all PE's. It 
 
 A 64-PE ILLIAC IV system is expected to be operating soon. For a more com- 
 plete description of the system see [1]. 
 
FIGURE 1 
 
 1 
 
 INSTRUCTION AND DATA FETCHING 
 
 UJ 
 
 o 
 
 to 
 
 m 
 
 CVI 
 
 r-l 
 
 INSTR. 
 
 FETCH 
 
 REQUEST 
 
 INSTRUCTION 
 BUFFER 
 
 ASSOCIATIVE 
 MEMORY 
 
 CONTROL UNIT 
 
 64 
 
 WORDS 
 
 64 
 WORDS 
 
 DATA 
 BUFFER 
 
 INSTRUCTION 
 DECODER 
 
 C U INDEXING 
 
 « 
 
 CAR 
 
 CAR 3 
 
 PE CONTROL SIGNALS v 
 
 DATA FETCHING a 
 DATA BROADCASTING^ 
 
 64 
 
 4— 
 
 57 
 
 2048 
 WORDS 
 
 MODE 
 
 XR 
 
 MIR 
 
 PE 
 MEMORY 
 
 64 
 
 BITS 
 
 PE 1 
 
 i-1 
 
 9 i-8 
 
 • • • 
 
 MODE 
 
 XR 
 
 MIR 
 
 PE 
 MEMORY 
 
 PEi 
 
 PE64 
 
 -3- 
 
is often convenient to manipulate all PE mode bits or a number from one PE in 
 a CAR. For this purpose, the broadcast path is bi-directional. 
 
 The complete ILLIAC IV system includes a Burroughs 6500 which per- 
 forms input/output operations, compilation, and contains an operating system 
 
 9 
 to control ILLIAC IV. There is a 10 bit, head-per-track disk with a kO 
 
 9 1 
 
 millisecond rotation speed and an effective transfer rate of 10 bits/ second. 
 
 This allows the loading of the 8 x 10 bit ILLIAC IV memory from the disk in 
 32 milliseconds. The input/output controller (IOC) contains a queuer for 
 disk requests and 2 bit buffer memory to smooth transmissions to and 
 from the slower B65OO memory, or an external input/ output device. 
 
 2.2 Programming ILLIAC IV 
 
 It is apparent that ILLIAC IV may be suited to the execution of 
 algorithms defined on arrays of data (e.g., matrix operations and certain 
 partial differential equations). However, there are two major difficulties 
 in using such a machine. One is that an algorithm must be devised which 
 allows identical computations to be performed simultaneously on a large num- 
 ber of operands. The other, closely related to the first, is that the data 
 must be so placed in the ILLIAC IV memory that over the course of the calcu- 
 lation few PE's are disabled. In other words, the algorithm and the data 
 storage allocation must be mutually designed for highly parallel computation. 
 For this reason, ILLIAC IV codes are expressed as two parts, a storage allo- 
 cation block, followed by a computational algorithm block. These codes may 
 be written in a high level language. See, for example, [2]. 
 
 -h- 
 
 
3- JACOBI'S METHOD 
 
 3.1 Theoretical Background 
 
 The classical Jacob i method reduces a symmetric matrix to the 
 diagonal form by a series of orthogonal transformations, [3]> 
 
 A 
 
 r+1 
 
 - <D A cp 
 
 ~Y* "Y* "Y* 
 
 (0) 
 
 Each transformation eliminates two off-diagonal elements. It is possible, 
 however, to eliminate n off-diagonal elements by one orthogonal transforma- 
 tion, where n is the order of the matrix A. This can be achieved if the 
 
 transformation matrices cp are of the form, 
 
 ^r 
 
 cp r = diag. ( T o rT 1 , 
 
 • , T 
 
 assuming that n is even, and where, 
 
 \ = 
 
 cos QL sin a 
 
 -sin OL 
 
 k 
 
 cos CX 
 
 (1) 
 
 Therefore, the matrix A _ will consist of 2 x 2 submatrices of the form 
 
 r+1 
 
 ,t" 
 
 ( Apq )r+1 = V pApqT 4 ' 
 
 For a diagonal submatrix, 
 
 p,q = (0, 1, 
 
 w. 
 
 a 2k,2k a 2k,2k+l 
 
 a 2k,2k+l a 2k+l,2k+l 
 
 r) 
 
 , 5 - 1) a^d. £ = A . 
 ' 2 ' pq qp 
 
 -5- 
 
if the elements of the transformation submatrix T, are chosen such that, 
 
 2 
 cos (X 
 
 where 
 
 W, 
 
 . 2 
 
 , sxn 
 
 a 
 
 (%>, 
 
 . and IV <f 
 
 (2) 
 
 Mt 
 
 m 
 
 SL,2k + l /( a 2 k+l,2k+l - a 2k,2kf V \ r) (3) 
 
 -)) 
 
 then the diagonal submatrix 
 
 w 
 
 will be of the form: 
 
 r+1 
 
 (\k) . 
 
 \ 'r+1 
 
 a. 
 
 2k, 2k 
 
 2k+l,2k+l 
 
 (r+1) 
 
 where, 
 
 2k COS \ + 2a 2k,2k+l 
 
 ( a2k ' 2k )r + l^[ a ^ 
 
 and, trace (a^ = trace {\^ ^ 
 
 cos a k sincx k+ ^ +± ^ k+1 sxn Q^ 
 
 Therefore, after one transformation, the off-diagonal elements 
 
 ( a \ = fa | , where p = (0, 2, ^, . . . , n - 2), and q = p + 1, 
 \ Pq /r+l V Wr 
 
 r r+l 
 
 are eliminated. In order to prepare the. matrix for another transformation, 
 the zero elements in the diagonal submatrices are replaced by carrying out 
 the following permutation: 
 
 t 
 
 r+1 r+1 
 
 (h) 
 
 -6- 
 
where ¥ 
 
 t 
 
 
 e n 
 
 
 
 1 
 
 
 
 
 " T~ 
 
 ~t~~ 
 
 — — "* r 
 
 e 
 
 
 
 , Y¥ = I, [I] being the identity matrix, and 
 
 (1,0), e Q = (0,1) 
 
 In effect, this permutation shifts the second row to the bottom and the 
 
 second column to the far right. The matrix A -, is then ready for another 
 
 r+1 
 
 transformation A 
 
 r+2 
 
 cp , A ., cd , • It can easily be shown that all the 
 ^r+1 r+1 r+1 
 
 off-diagonal elements will be exposed to this process of elimination after 
 every (n - l) orthogonal transformation, if the first row and first column 
 are shifted to the bottom and far right, respectively. This second kind of 
 shifting is equivalent to the permutation 
 
 a = r a r 
 
 n n 
 
 (5) 
 
 where r = 
 
 "* t ! 
 
 — 
 
 
 j 
 
 I 
 
 t i 
 
 _ 6 1 • 
 
 
 
 ^^ 
 
 rr 
 
 To check for convergence the sum, E, of squares of the off-diagonal 
 elements is compared to the sum, D, of squares of the diagonal elements 
 after every orthogonal transformation. If the ratio r\ = E/D is less than 
 an arbitrarily small number £, then the diagonal elements of the matrix A 
 
 m 
 
 are assumed to have converged to the eigenvalues of the original matrix A. 
 
 1' 
 
 In general, each diagonal element 
 
 ^ m 
 
 will lie in an interval of width 
 
 2s/ E. The practically diagonal matrix A will be of the form: 
 
 m 
 
 A --= W A n W 
 m 1 
 
 -7- 
 
where 
 
 W 
 
 = (cp -, . . r cp n 
 
 y co 2 y 
 
 O 
 
 is the matrix whose columns are the eigenvectors. 
 
 3- 2 Implementation 
 
 Consider the implementation of Jac obi 's. method on a 16-PE machine 
 using the storage allocation scheme of Figure 2 for an upper triangular 
 16 x 16 matrix. The mapping of element a. . into PE memory follows,, where 
 P_ (assumed even) is the number of PE's being used (ixl is the greatest 
 integer < x) : 
 
 P„ 
 
 < i <| 
 
 1 
 
 a. . 
 
 loc. i, PE 
 
 [^(■nodjp^^l-j + i(/P E+ l)] 
 
 (mod p ) 
 
 T< i < P E 
 
 & ± . - loc d - I + 1, PE 
 
 nTp e (i(modvTP E + lj + 
 
 (j - P I) fr/5 
 
 P £ + 1) + 1 
 
 1 -2 
 
 (mod P T 
 
 This storage allocation scheme is used because of the negligible amount 
 of routing involved in implementing the Jacobi process. A step-by-step 
 discussion of the process follows. 
 
 First, consider the calculation of the transformation matrices of 
 Equation 1 according to Equations 2 and 3. The transformation matrices may 
 be computed in the pairs of PE's shown in Figure 3. Proper a. . elements may 
 be accessed, as shown in Figure 3, in three memory cycles plus fewer than ten 
 unit routes (distance 1 or k) under mode control. This entire time is negli- 
 
 
 
FIGURE 2 
 
 
 F 
 3WO 
 
 'EO 
 
 1 
 
 2 
 
 3 
 
 4 
 
 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 10 
 
 II 
 
 12 
 
 13 
 
 14 
 
 15 
 
 LOCO R( 
 
 0,0 
 
 0,4 
 
 0,8 
 
 0,12 
 
 0,1 
 
 0,5 
 
 0,9 
 
 0,13 
 
 0,2 
 
 0,6 
 
 0,10 
 
 0,14 
 
 0,3 
 
 0,7 
 
 0,11 
 
 0,15 
 
 1 1 
 
 M4 
 
 1 .3 
 
 1,7 
 
 l,M 
 
 1 ,15 
 
 8,8 
 
 1,4 
 
 1.8 
 
 1,12 
 
 1, 1 
 
 1,5 
 
 1.9 
 
 1 ,13 
 
 1.2 
 
 1.6 
 
 1 ,10 
 
 2 
 
 2 
 
 2,9 
 
 2,13 
 
 2,2 
 
 2,6 
 
 2,10 
 
 2,14 
 
 2,3 
 
 2,7 
 
 2,11 
 
 2,15 
 
 8,9 
 
 2,4 
 
 2,8 
 
 2,12 
 
 9,9 
 
 2,5 
 
 3 
 
 3 
 
 3,4 
 
 3,8 
 
 3,12 
 
 9,10 
 
 3,5 
 
 3,9 
 
 3,13 
 
 10,10 
 
 3,6 
 
 3,10 
 
 3,14 
 
 3,3 
 
 3,7 
 
 3,11 
 
 3,15 
 
 8,10 
 
 4 
 
 4 
 
 11,11 
 
 4,7 
 
 4.M 
 
 4,15 
 
 8,11 
 
 4,4 
 
 4,8 
 
 4,12 
 
 9 ,1 1 
 
 4,5 
 
 4,9 
 
 4,13 
 
 10,11 
 
 4,6 
 
 4,10 
 
 4,14 
 
 5 
 
 5 
 
 5,13 
 
 10,12 
 
 5,6 
 
 5,10 
 
 5,14 
 
 11,12 
 
 5,7 
 
 5,11 
 
 5,15 
 
 8,12 
 
 12,12 
 
 5,8 
 
 5,12 
 
 9,12 
 
 5,5 
 
 5,9 
 
 6 
 
 6 
 
 6,8 
 
 6,12 
 
 9,13 
 
 13,13 
 
 6,9 
 
 6,13 
 
 10,13 
 
 6,6 
 
 6,10 
 
 6,14 
 
 11,13 
 
 6,7 
 
 6,11 
 
 6,15 
 
 8,13 
 
 2,13 
 
 7 
 
 7 
 
 7,7 
 
 7,11 
 
 7,15 
 
 8,14 
 
 12,14 
 
 7,8 
 
 7,12 
 
 9,14 
 
 13,14 
 
 7,9 
 
 7,13 
 
 10,14 
 
 14,14 
 
 7,10 
 
 7,14 
 
 11,14 
 
 8 
 
 8 
 
 10,15 
 
 14,15 
 
 
 
 11,15 
 
 15,15 
 
 
 
 8,15 
 
 12,15 
 
 
 
 9,15 
 
 13,15 
 
 
 
 FIGURE 3 
 
 PEO 1 
 
 2 
 
 3 
 
 4 5 
 
 6 
 
 7 
 
 8 
 
 9 
 
 10 | 11 
 
 12 | 13 
 
 14 | 15 
 
 ELEMENT 0|j 
 
 0,0 0, 1 
 
 2,2 2,3 
 
 4,4 4,5 
 
 6,6 6,7 
 
 8,8 8,9 
 
 10,10 10,11 
 
 12,12 12,13 
 
 14,14 14,15 
 
 
 1. 1 
 
 3,3 
 
 5,5 
 
 7,7 
 
 9,9 
 
 11,11 
 
 13,13 
 
 15,15 
 
 TRANSFORMATION 
 MATRIX 
 
 To 
 
 Tl 
 
 T 2 
 
 T 3 
 
 T 4 
 
 T 5 
 
 ■ T 6 
 
 T 7 
 
gible with respect to the times for the calculations which follows. 
 
 The computations of Equations 2 and 3 are carried out in the pairs 
 of PE's shown in Figure 3- The time required for these is less than 15 
 microseconds. 
 
 The major part of the computation time is consumed in carrying out 
 the transformations indicated by Equation 0. This computation involves a 
 number of 2 x 2 matrix multiplications, and it is important to see that these 
 may be performed efficiently using the storage allocation scheme of Figure 2. 
 Notice that for < i < 7, a and a. . n are, in blocks of four, four PE's 
 apart, and can be brought to the same PE in one unit route. Across the boun- 
 daries of the blocks of four, elements are distance five apart. Similarly, 
 
 for < i < 6, a and a. _ . are five PE's apart and can be brought to the 
 
 - - ij i+l, J 
 same PE in two unit routes. For 8 < i < 15, a. . and a. . . are five PE's 
 
 - ~ ij i,J+l 
 apart while for 8 < i < 14, a. . and a. _ . are four PE's apart in blocks of 
 
 - - ij i+l, J 
 four with five PE distance at the block boundaries. The distances between 
 
 =7; 
 
 and an. are irregular but these rows do not interact while so labeled. 
 By moving the computed transformation matrices T , . . . , T to 
 
 the control unit, pairs (T , T ), (T , T ), etc. can be broadcast to the 
 appropriate PE's. Figure 4 shows the first two steps of this process. Line 
 1 shows the subscripts of the a. . which may be accessed in one memory cycle 
 
 and line 2 shows the subscripts of the A. . partitions to which the a. . of 
 
 1J 1J t 
 
 line 1 belong. By broadcasting T and T from the control unit, TAT, 
 
 ■£ m 1 00 
 
 T_A. T , and TAT are computed in parallel. Lines 3 and 4 show similar 
 subscripts for the second step of the processs, during which T n A T„, T_A T , 
 
 -j- . Od cL U Uj j 
 
 TAT, and TAT are computed. Since T and T were broadcast in step 1, 
 
 they are still available in step 2. The 2x2 matrix multiplications are 
 
 facilitated by spacing the adjacent row and column elements one or two unit 
 
 routes apart. On the next step A^, , A^, A n , and A n ,_ will be transformed. 
 
 04 05 14, 15 
 
 This processing is continued by rows until the entire matrix has been trans- 
 formed. Notice that at the main diagonal this procedure suffers from an 
 efficiency of approximately 50% if computed in the straightforward way shown 
 above; this point will be discussed later. 
 
 The total time to carry out one transformation on a 6k x 6k upper 
 triangular matrix is approximately 6 milliseconds using 6k PE's. The time 
 
 -10- 
 
CD 
 
 ^1 
 
 0> 
 
 cn 
 
 cn 
 
 cn 
 4> 
 
 4* 
 OJ 
 
 OJ 
 
 ro 
 
 ro 
 
 O 
 
 o 
 p 
 
 b 
 
 o 
 
 CO 
 
 m 
 o 
 
 5 
 
 
 0) 
 F5 
 
 cn 
 oi 
 
 — 
 
 OJ 
 
 CO 
 
 ro 
 
 CO 
 
 o 
 
 
 Cn 
 
 ->l 
 
 cn 
 
 5 
 
 4> 
 
 OJ 
 
 ro 
 
 ro 
 63 
 
 o 
 
 OJ 
 
 CO 
 
 — 
 
 CO 
 O 
 
 cn 
 
 CD 
 
 cn 
 cn 
 
 4* 
 
 OJ 
 
 ro 
 ro 
 
 o 
 ^1 
 
 OJ 
 
 ro 
 
 CO 
 4> 
 
 ro 
 4k 
 
 CM 
 OJ 
 
 Cn 
 O 
 
 4k 
 Ol 
 
 co 
 O 
 
 ro 
 
 o 
 
 
 OJ 
 
 cn 
 
 CO 
 
 cn 
 
 cn 
 
 4> 
 4k 
 
 OJ 
 
 cn 
 
 ro 
 O 
 
 o 
 
 Cn 
 
 
 4* 
 
 c* 
 
 f\5 
 
 cn 
 
 OJ 
 
 f\3 
 
 4k 
 CO 
 
 OJ 
 CO 
 
 ro 
 
 O 
 4k 
 
 o 
 
 Oi 
 
 CO 
 
 ->l 
 
 o 
 
 cn 
 
 ro 
 
 OJ 
 OJ 
 
 ro 
 
 OJ 
 
 O 
 CO 
 
 £ 
 
 0) 
 
 CO 
 
 CO 
 
 cn 
 cn 
 
 cn 
 
 4k 
 
 O 
 O 
 
 ro 
 
 O 
 
 ro 
 
 
 -4 
 
 ro 
 
 CJ1 
 
 OJ 
 4> 
 
 cn 
 
 o 
 
 cn 
 cn 
 
 CO 
 
 OJ 
 
 cn 
 
 ro 
 
 O 
 
 
 CO 
 
 CO 
 CO 
 
 CD 
 
 cn 
 
 5 
 
 rv5 
 
 ro 
 
 4> 
 cn 
 
 OJ 
 
 O 
 
 ro 
 
 oi 
 
 O 
 o 
 
 "~ 
 
 CO 
 
 CO 
 
 R5 
 
 0J 
 
 OJ 
 
 cn 
 
 CO 
 
 4k 
 
 co 
 
 OJ 
 4k 
 
 ro 
 4k 
 
 o 
 
 cn 
 
 cn 
 
 O 
 
 CO 
 
 5 
 
 cn 
 -J 
 
 cn 
 R5 
 
 4k 
 
 OJ 
 
 OJ 
 OJ 
 
 ro 
 
 CO 
 
 O 
 
 CO 
 
 
 — 
 
 CO 
 
 Z 
 £ 
 
 cn 
 
 cn 
 
 5 
 
 OJ 
 
 ro 
 
 F5 
 
 O 
 
 OJ 
 
 
 ro 
 
 oi 
 oi 
 
 5 
 
 cn 
 
 
 4k 
 cn 
 
 OJ 
 
 ro 
 
 O 
 ro 
 
 ro 
 
 OJ 
 
 CO 
 CO 
 
 
 ro 
 
 OJ 
 
 cn 
 cn 
 
 4k 
 
 5 
 
 OJ 
 
 co 
 co 
 
 o 
 cn 
 
 
 5 
 
 CO 
 
 £ 
 
 cn 
 
 CO 
 
 cn 
 CO 
 
 4k 
 
 OJ 
 4k 
 
 ro 
 cn 
 
 O 
 5 
 
 
 cn 
 
 
 
 
 
 
 
 
 
 
 
 H 
 Q 
 
 vn 
 
 
 ^_» 
 
 
 _,-_ 
 
 
 
 . OJ 
 
 
 — • 
 
 
 "0 £ 
 
 k 
 
 "Oro 
 
 "^ 
 
 
 >- 
 
 • m 
 
 J>~ 
 
 m 
 
 
 3J^ 
 
 r~ 
 
 5^ 
 
 r 
 
 
 HfT 
 
 m 
 
 Hm 
 
 m 
 
 
 H^ 
 
 2 
 
 h2 
 
 2 
 
 
 rr CC 
 
 m 
 
 ^ci 
 
 m 
 
 
 OfT 
 
 z 
 
 Vm 
 
 z 
 
 
 z=c 
 
 H 
 
 «ix 
 
 H 
 
 
 >o 
 
 Q 
 
 >o 
 
 Q 
 
 
 «-. -n 
 
 ■ — 
 
 ^— Tl 
 
 «Z! 
 
 
 
 
 
 
 u 
 
 — 
 
 OJ 
 
 o 
 
 o 
 
 m 
 
 «• 
 
 a# 
 
 - 
 
 m 
 
 o 
 
 ro 
 
 4k 
 
 o 
 
 O 
 
 
 O 
 
 O 
 
 o 
 
 — 
 
 
 «* 
 
 ->• 
 
 <* 
 
 «« 
 
 — 
 
 ro 
 
 4k 
 
 — 
 
 OJ 
 
 
 o 
 
 — 
 
 — 
 
 ro 
 
 
 «• 
 
 mm 
 
 - 
 
 mt 
 
 ro 
 
 OJ 
 
 ^1 
 
 ^— 
 
 ro 
 
 
 __ 
 
 ro 
 
 
 
 
 m 
 
 -• 
 
 
 
 OJ 
 
 OJ 
 
 cn 
 
 
 
 
 — 
 
 OJ 
 
 
 
 
 - 
 
 - 
 
 
 
 4k 
 
 ro 
 
 cn 
 
 
 
 
 o 
 
 O 
 
 
 
 
 «• 
 
 a* 
 
 
 
 OI 
 
 ro 
 
 cn 
 
 
 
 
 o 
 
 — 
 
 
 
 
 <m 
 
 ** 
 
 
 
 cn 
 
 ro 
 
 4k 
 
 
 
 
 — 
 
 ro 
 
 
 
 
 m 
 
 m» 
 
 
 
 ^i 
 
 OJ 
 
 ->l 
 
 
 
 
 — 
 
 OJ 
 
 o 
 
 o 
 
 
 - 
 
 mm 
 
 - 
 
 *• 
 
 CO 
 
 OJ 
 
 cn 
 
 — 
 
 ro 
 
 
 o 
 
 O 
 
 o 
 
 — 
 
 
 m 
 
 m 
 
 mm 
 
 <• 
 
 CO 
 
 OJ 
 
 cn 
 
 o 
 
 — 
 
 
 o 
 
 — 
 
 
 
 
 m 
 
 «• 
 
 
 
 o 
 
 ro 
 
 cn 
 
 
 
 
 — 
 
 ro 
 
 — 
 
 OJ 
 
 
 *• 
 
 mm 
 
 mm 
 
 «• 
 
 
 ro 
 
 4k 
 
 _ 
 
 OJ 
 
 
 — 
 
 OJ 
 
 O 
 
 o 
 
 
 - 
 
 mm 
 
 «* 
 
 •# 
 
 ro 
 
 OJ 
 
 -J 
 
 ^~ 
 
 OJ 
 
 
 o 
 
 O 
 
 o 
 
 — 
 
 
 - 
 
 » 
 
 <* 
 
 •a 
 
 OJ 
 
 OJ 
 
 -J 
 
 "" — 
 
 ro 
 
 
 o 
 
 — 
 
 
 
 
 - 
 
 «■ 
 
 
 
 4k 
 
 OJ 
 
 cn 
 
 
 
 
 — 
 
 ro 
 
 
 
 
 <* 
 
 •m 
 
 
 
 cn 
 
 ro 
 
 Ul 
 
 
 
 
 ^1 
 
 g 
 
 -P- 
 
 ■11- 
 
to transform a square 6k x 6k matrix is less than 12 milliseconds. 
 
 Next, the rows as shown in Equation k are permuted. The storage 
 
 scheme of Figure 2 allows this to he done "by moving relatively few elements 
 
 of the matrix. Figure 5 shows the resulting matrix in which rows 0, 1, and 
 
 8 and columns k, 8, and 12 have been moved. Each column is moved left one, 
 
 row is moved right three unit routes, and rows 1 and 8 are moved somewhat 
 
 irregularly. However, the total amount of routing time involved is negligible 
 
 with respect to the transformation time. In general, three rows andvP - 1 
 
 ■ E 
 
 columns will he moved in a similar way. 
 
 Figure 6 is a relabeled version of Figure 5, showing the new names 
 of the shuffled elements. Note by comparison with Figure 2 that the origin 
 has moved nine PE's to the right and one row down in PE memory. This is a 
 matter which is easy to accommodate in a program for carrying out the Jacobi 
 process. At this point the entire process is repeated, started by computing 
 a new set of transformations. After n - 1 such steps the second shuffle is 
 performed according to Equation 5- 
 
 The second shuffle requires less routing than the first. Row 0's 
 routing is similar to the routing of row 1 in the above discussion, with 
 columns K, 8, and 12 pushed back distance one to accommodate the new column. 
 
 3-3 Efficiency 
 
 First, an efficiency calculation for the simple process described 
 
 above shall be performed, operating on a (kvP_ x k -s/P ) matrix using P PE's. 
 
 E E E 
 
 k-1 
 
 Assuming one time unit to operate on a (\Tp x vP ) partition, Z = k(k - l)/2 
 
 E E , 
 
 1=1 
 
 time steps is used for square partitions plus k time steps for triangular par- 
 titions on the main diagonal. If the elements which are to be made zero are 
 not computed (just set to zero), only — steps should be used to transform the 
 triangular partitions on the main diagonal. Thus, there is an efficiency of 
 
 k(k - l)/2 + k/2 k . 
 k(k - l)/2 + k ~ k + 1 
 
 For a 6k x 6k matrix and 6k PE s, (i.e., k = 8 and \Tp = 8), there is an 
 o E 
 
 efficiency of — ~ 
 
 -12- 
 
-J 
 
 O 
 
 Ol 
 
 -l> 
 
 OJ 
 
 ro 
 
 
 
 p 
 
 
 
 T3 
 
 to 
 
 CD 
 00 
 
 Ol 
 
 F5 
 
 F3 
 
 oi 
 
 OJ 
 
 ro 
 
 CD 
 
 ro 
 
 
 
 
 m 
 
 
 CO 
 
 5 
 
 CD 
 CD 
 
 en 
 
 F3 
 
 OJ 
 
 01 
 
 ro 
 
 CD 
 
 OJ 
 
 
 
 — 
 
 to 
 
 CD 
 
 CD 
 O 
 
 01 
 
 F3 
 
 OJ 
 
 OJ 
 CD 
 
 5 
 
 £ 
 
 
 ro 
 
 ro 
 
 6,11 
 
 777 
 
 01 
 
 Ol 
 
 F3 
 
 OJ 
 
 ro 
 
 oi 
 
 
 
 OJ 
 
 01 
 
 6,12 
 7,8 
 
 Ol 
 
 
 OJ 
 00 
 
 ro 
 ro 
 
 oi 
 oi 
 
 
 
 -fr 
 
 1 
 
 ■>! | CD 
 
 CD 1 oj 
 
 -& 
 
 Oi 
 
 OJ 
 CD 
 
 ro 
 oi 
 
 __ 
 
 
 01 
 
 01 
 
 6,14 
 7,10 
 
 OJ 
 
 4> 
 
 CD 
 
 OJ 
 O 
 
 ro 
 
 ro 
 
 
 
 CD 
 
 tD 
 
 11,12 
 
 6,15 
 7.71 
 
 ^1 
 
 OJ 
 
 ro 
 oi 
 
 OJ 
 
 O 
 
 ->i 
 
 11,11 
 10,15 
 
 CO 
 
 OJ 
 
 R5 
 
 oi 
 
 •^ 
 
 O 
 
 CO 
 
 00 
 
 
 _ 
 
 O 
 
 Ol 
 
 CD 
 
 OJ 
 
 oi 
 
 £ 
 
 £ 
 
 01 
 
 
 
 CD 
 
 CD 
 
 
 5 
 OJ 
 
 Ol 
 CD 
 
 O 
 
 OJ 
 
 ro 
 ro 
 
 CD 
 
 O 
 O 
 
 O 
 
 ^1 
 oi 
 
 
 ro 
 
 Ol 
 
 -P* 
 
 OJ 
 
 en 
 
 ro 
 
 OJ 
 
 ->l 
 
 O 
 
 — 
 
 CD 
 
 ui 
 
 5 
 
 CJ1 
 CO 
 
 
 OJ 
 oi 
 
 ro 
 
 00 
 
 O 
 
 ro 
 
 r~o 
 
 ID 
 
 O 
 O 
 
 Ol 
 CD 
 
 -P> 
 
 oi 
 
 OJ 
 
 ro 
 
 OJ 
 
 CD 
 
 O 
 
 OJ 
 
 OJ 
 
 JJ 
 
 CD 
 CD 
 
 Ol 
 
 5 
 
 -l> 
 
 OJ 
 OJ 
 
 ro 
 
 CD 
 
 5 
 
 O 
 
 £ 
 
 D 
 O 
 
 CD 
 ^1 
 
 01 
 
 oi 
 
 OJ 
 OJ 
 
 ro 
 
 — 
 
 O 
 oi 
 
 oi 
 
 
 
 
 
 
 
 
 
 
 H 
 
 Q 
 
 
 r 
 o 
 p 
 
 JO 
 
 o 
 
 S(J)Ol^WM-0(D 
 
 CD 
 
 CD 
 CD 
 
 Ol 
 
 ro 
 
 
 
 
 ro 
 
 00 
 
 O 
 OJ 
 
 
 "0 
 m 
 
 
 
 OJ 
 
 CD 
 5 
 
 1 
 
 Ol 
 oi 
 
 CD 
 
 OJ 
 CD 
 
 ro 
 
 ro 
 
 
 ro 
 
 00 
 oi 
 
 — 
 
 ^1 
 CD 
 
 CD 
 
 00 
 
 ro 
 
 01 
 
 OJ 
 
 O 
 
 ro 
 
 oi 
 
 •— 
 
 O 
 
 CD 
 
 ro 
 oi 
 
 ro 
 
 OJ 
 
 OJ 
 
 ro 
 ro 
 
 co 
 
 OJ 
 
 00 
 CD 
 
 01 
 
 
 
 
 
 OJ 
 
 5 
 
 CD 
 ^1 
 
 01 
 00 
 
 -P> 
 
 OJ 
 
 OJ 
 OJ 
 
 ro 
 
 CD 
 
 
 
 
 ■i> 
 
 £ 
 £ 
 
 CD 
 
 Ol 
 
 TO 
 
 
 
 OJ 
 
 ro 
 a> 
 
 OJ 
 
 
 01 
 
 CD 
 
 oi 
 
 Ol 
 
 ^1 
 5 
 
 CD 
 oi 
 
 CD 
 
 i^5 
 
 CD 
 
 OJ 
 
 ro 
 ro 
 
 ro 
 
 
 
 ^1 
 
 OJ 
 
 oi 
 
 CD 
 
 
 00 
 
 OJ 
 
 Ol 
 Ol 
 
 -0 
 O 
 
 oi 
 
 CD 
 CD 
 
 CD 
 
 
 
 
 ^1 
 
 A 
 
 ro 
 
 OJ 
 
 Ol 
 CD 
 
 -& 
 
 £ 
 
 CO 
 
 
 
 ro 
 
 Ol 
 
 O 
 
 
 
 oi 
 
 
 00 
 
 ^1 
 ^1 
 
 CD 
 CO 
 
 Ol 
 
 ol 
 
 — 
 
 OJ 
 
 ro 
 
 CD 
 
 5 
 
 
 
 
 5 
 oi 
 
 CO 
 
 ^1 
 
 CD 
 
 ro 
 
 
 ro 
 
 
 OJ 
 CO 
 
 ro 
 
 OJ 
 
 OJ 
 
 
 
 oi 
 
 
 
 ^1 
 oi 
 
 CD 
 
 Ol 
 CD 
 
 -P> 
 
 OJ 
 
 ro 
 
 ro 
 ro 
 
 ->l 
 
 
 
 00 
 
 
 — 
 
 00 
 
 OJ 
 
 oi 
 
 Ol 
 
 5 
 
 oi 
 
 CD 
 
 
 
 ro 
 
 CD 
 
 — " 
 
 
 ro 
 
 
 ro 
 
 ro 
 
 1 
 
 CD 
 CD 
 
 01 
 
 00 
 
 OJ 
 Ol 
 
 ro 
 5 
 
 oi 
 
 O 
 
 oi 
 
 OJ 
 
 00 
 
 CD 
 
 OJ 
 
 no 
 
 
 OJ 
 CD 
 
 ro 
 
 03 
 OD 
 
 O 
 
 Ol 
 
 oi 
 oi 
 
 £ 
 
 ^1 
 
 5 
 
 OJ 
 
 Ol 
 
 00 
 
 OJ 
 OJ 
 
 ro 
 
 OJ 
 
 * 
 
 O 
 
 CD 
 
 
 oi 
 
 H 
 
 Q 
 
 ON 
 
 ■13- 
 
This same technique and storage allocation scheme may be used on 
 matrices of dimensions less than the number of PE's. The discussion holds 
 of the transformation of A„ , A n , A , and A operated on asfP x \Tp 
 partition of a larger matrix. The process may be used repeatedly on a large 
 
 matrix or just once on a^/P xsll? matrix. Full efficiency can be achieved 
 
 E E 
 
 on square matrices of dimension k s/ P x k v P . 
 
 The maximum efficiency which can be achieved for the Jacobi process 
 on a parallel machine depends on how the diagonal partitions and any columns 
 
 in addition to kvP are treated. Suppose an attempt is made to transform 
 
 E 
 
 simultaneously A , A , A^, \h> \^> and \cr °^ " t:iie example, and to sub- 
 sequently repeat this pattern on the main diagonal. This procedure will 
 achieve 100$ efficiency during the transformation of the triangular matrices, 
 but an irregular route of two elements will be incurred since a and a 
 are in the same PE as are a and a . 
 
 U<d y , 11 
 
 Matrices of dimension larger than 6k x 6k may be partitioned into 
 a number of blocks of dimension 6k x 6k or smaller. For example, a 100 x 100 
 matrix becomes a 6k x 6k upper triangular partition, a 36 x 36 upper trian- 
 gular partition and a 6k x 36 rectangular partition. All rows of rectangular 
 partitions are stored according to the mapping given earlier for rows between 
 and Pp/2 _ !• In other words, the lower part of the matrix is not packed 
 into holes in the upper part, but is stored in rows just like the upper part. 
 In applying the Jacobi process to such a rectangular matrix, proceed as above 
 
 transforming blocks of dimension v P x\)P and achieving full efficiency on 
 
 E E 
 
 partitions of dimension p \] P x q \J P . If the dimension of the matrix is not 
 
 E E 
 
 a multiple of nTp ^ the leftover columns may be handled with 100$ efficiency 
 if the transformations are treated as column operations. Entire columns are 
 available in one memory cycle using the storage allocation scheme described 
 above. 
 
 Thus it seems reasonable to conclude that efficiencies of 90% or 
 greater may be attained for any size matrix without inordinate programming 
 difficulties. 
 
 ■Ik- 
 
3- k Disk Considerations 
 
 2 
 Large matrices that do not fit into primary memory are 
 
 partitioned into blocks of 6^- x Gk. The number of rows of "blocks is 
 
 m = 
 
 L 
 
 N - 1 
 
 + 1, where N > Gk is the size of the matrix. Let the number 
 
 of any block b be given by 
 
 C-l . 
 b = R + Z k 
 
 k=0 
 
 (C,R = 1, 2, ... , NR) 
 
 where R is the row number of the block and C is the column number of the 
 
 block. The blocks 
 
 R-l 
 R + Z k, R = (1, 2, . . . , NR), 
 k=0 
 
 are brought in from the disk sequentially to be subjected to an orthogonal 
 
 transformation, Equation 0. Thus all the rotation angles are determined and 
 
 the rest of the blocks, C-l 
 
 b = R + Z k, R = (0, 1, . . . , NR) C / R 
 k=0 
 
 are fetched from the disk to complete the given transformation of the 
 whole matrix. Some of the rows and columns of each submatrix are stored 
 in arrays in primary memory, for purposes of the shifting process that 
 takes place after the whole matrix is subjected to a transformation. There- 
 fore, the following elements will be stored in primary memory for shifting 
 the second row and second column: 
 
 Submatrix 
 
 Elements to be stored 
 
 R-l 
 1. b = R + Z k 
 k=0 
 
 (R > 1) 
 
 first row of the submatrix, i.e., 
 the elements a. . . 
 
 i = 6U(R - 1) 
 
 3 = 6U(R - 1), . . . , Gk R - 1 
 
 2 
 
 The results of this section are valid for various secondary storage 
 devices, e.g., drum, delay line or the head per track disk which we 
 discuss. 
 
 •15- 
 
C-l 
 2. b = R + Z k 
 k=0 
 
 (R > 1, C > R) 
 
 C-l 
 3. b = 1 + Z k 
 k=0 
 
 C > 1 
 
 first row and first column of the 
 submatrix, i.e., the elements 
 
 (\i) and ( a rs) ' 
 
 k = 6h(R - 1) 
 
 I = 6k(c - 1), . 
 
 s = 6U(c - l) 
 
 r = 6U(R - 1), . 
 
 . , 6U c - l 
 
 . , 6k R - 1 
 
 second row and first column of each 
 submatrix, i.e., the element 
 
 (a ) and [a, \ . 
 pqy \ tu/ 
 
 p = 1 
 
 q = 6U(C - 1), . . . , 6U C - 1 
 
 u = 6k(c - l) 
 
 t = 0, . . . , 63 
 
 1*. b = 1 
 
 second row and second column of the 
 submatrix, i.e., the elements 
 
 (a ) and [a ) . 
 vwy ^ xyj 
 
 . • , 63 
 
 v = 1 
 
 w = 1, 
 
 x = 
 
 y = i 
 
 For the first row and first column shifting, the elements to be stored 
 in the core are the same as above except those of items (3) and (k) : 
 
 -16- 
 
t 
 
 Submatrix Elements to be stored 
 
 C-l 
 
 3 . b = 1 + £ k first row and column of each sub- 
 
 matrix, i.e., the element /a I , 
 
 \ pq y 
 
 w 
 
 c > 1 p = o 
 
 q = 6k(C - 1), . . . , 6k C- 1 
 
 u = 6k(c - l) 
 
 t = o, . . . , 63 
 
 k . b = 1 first row only, i.e., the elements 
 
 fa ), 
 
 v = 
 
 w = 0, . . . , 63 
 
 The total number of elements to be stored in primary memory in each 
 kind of shifting is N(NR + 1). 
 
 The primary memory of 6k PE's is capable of holding 32 partitions 
 6k x 6k' } the transformation time for a square block is about 12 ms, and the 
 rotational delay of the head-per-track disk is kO ms. If the partitions on 
 the disk are laid out in such a way that four of them are accessed at once, 
 then kQ ms later the four 6k x 6k partitions will have been transformed and 
 the disk will have rotated more than one complete revolution. Thus, if the 
 requests for four partitions in the disk queuer are kept one rotation ahead, 
 the rotational latency of the disk can be masked in computing on large matrices . 
 
 -17- 
 
k. HOUSEHOLDER'S METHOD 
 
 k. 1 Theoretical Background 
 
 This method [3?^] reduces a symmetric matrix 
 
 A = [a. .] {l,j = 0, 1,2, . . . ,n-l) to the tridiagonal form T using elementary 
 
 -'-J 
 Hermitian orthogonal matrices. There are (n-2) steps in this reduction. 
 
 We define the matrices A "by the relations, 
 
 r 
 
 A r = P r A r _ x P r (r = 1,2,..., n-2) (6) 
 
 where A = A is the original matrix. The matrices P are defined by 
 
 P = I - !£r (7) 
 
 2 
 2K 
 r 
 
 in which u is a row vector (whose first r elements are zero) given by, 
 
 u = (0,...,0, a + S , a _ _ , , a _ , N (8) 
 
 r r-l,r - r r-l.r+1' ' r-l,n-l) 
 
 s '|l^ (9) 
 
 2 
 and, 2K = S (S + a . ) (10) 
 
 r r r — r-1, r 
 
 where the elements a. . are those of A n . The signs in both expressions 
 
 ij r-1 & * 
 
 (8) and (10) are chosen such that, 
 
 [a+s|=|a n |+S. (11) 
 
 r-l,r — r 1 ' r-l,r' r 
 
 Therefore, in the r-th step (A f = P f A P ) [n - (r + l)] zeros are in- 
 troduced in the r-th column and row of A without destroying the zeros 
 introduced in the previous step. 
 
 In order to take advantage of symmetry, the matrices A may be 
 defined by the relations [ 3] , 
 
 -18- 
 
 
t t . . 
 
 A=A n -qu-uq (12) 
 
 r r-1 r r r r 
 
 where u is as given above, and q is given "by 
 
 t 
 
 % = p r - 1 r r P r C^) 
 
 2 2 
 
 2K 
 r 
 
 in which, 
 
 P = A _ U r . (Ik) 
 
 r r-1 — 
 
 2K 
 
 r 
 
 The total number of multiplications required to reduce the matrix to the 
 
 3 
 tridiagonal form is essentially 2 n . 
 
 3 
 
 The eigenvalues of the tridiagonal matrix may be obtained by 
 several methods. For this parallel computer it is proposed that if all the 
 eigenvalues are required, then the QR algorithm with origin shift [3;5>6] is 
 used, (this algorithm is discussed for real nonsymmetric matrices in the 
 following section). However, if one is interested in the general distribu- 
 tion of the eigenvalues and not in their accurate values, or the eigenvalues 
 in a given interval, the bisection method [3*7] ma y 1 ° e used. In order to 
 show how the method of bisection would be used for a parallel computer, 
 assume that it is necessary to obtain all the eigenvalues of the resulting 
 tridiagonal matrix T using such a method. For the sake of illustration, 
 assume also that all the subdiagonal elements are non-zero and hence the 
 matrix has simple eigenvalues. The eigenvalues lie in the interval [-b,b], 
 where b = T j = max E I t. . I . This interval is then divided into 
 
 i loo , X3 
 
 63 subintervals and for all ju, (k = l,a, . . . ,6k), where u = -h, and n^, = +b, 
 the quantities, 
 
 ■19- 
 
W - (Vl,!-! " \> f i-lK' " V 2 ,i-1 W^H 1 " 2.3,..., J 
 
 are evaluated in parallel, one PE to each id . For each jll the number of 
 
 K. K. 
 
 agreements in sign s(jil ) of consecutive members of the sequence f (ju, ) , 
 f (ll), . ..., f (jll ) is the number of eigenvalues larger than jll . The 
 process is repeated for those intervals [jll , jll ] that contain more than 
 one eigenvalue until all the eigenvalues are separated. Assuming that the 
 matrix is of order 6k, 6U intervals, each containing an eigenvalue are 
 established. By assigning each interval to a PE and by successive bisection 
 of each interval, using the Sturm sequence property, all the eigenvalues 
 are obtained. 
 
 Once all the eigenvalues are available, the eigenvectors of the 
 tridiagonal matrix are obtained by the method of inverse iteration [3,8]. 
 Again this can be done in parallel, each PE evaluating an eigenvector. 
 Finally if Z. (i = l,2,...,n) is an eigenvector of the tridiagonal matrix 
 T, then the corresponding eigenvector Y. of the original matrix A is com- 
 puted by, 
 
 
 Y. = P, P„ P „ Z. . (16) 
 
 x 12 n-2 l 
 
 k.2 Implementation of Householder Tridiagonalization 
 
 We shall consider the implementation of Householder's method 
 on a 16-PE machine using the storage allocation scheme of Figure 7 f° r 
 an upper triangular 16 X 16 matrix. This scheme is somewhat less com- 
 plicated than Figure 2 since it is not necessary to provide for shuffling. In 
 particular, adjacent row and column elements are always a unit route 
 from each other, except between the seventh and eighth rows. The lower 
 half of the matrix is stored in reverse order to allow efficient com- 
 putation on this part of the matrix. This scheme also allows access to 
 
 -20- 
 
any row with one memory fetch. Operations may be performed on P„ x P 
 
 E E 
 
 blocks at any step, where the size of the matrix n is equal to the number 
 
 of PE's P_. 
 
 P E - 1, 
 
 The memory map for this scheme consists of two parts: 
 
 E 
 
 2 
 
 a. . 
 
 loc I, PE ((i(mod/P E )) nTp e + j) (mod P E ) 1 ; (17) 
 
 P P 
 
 if E<i< E-l 
 
 a. . 
 
 loc P E - I, PE [~(P E - 1) - ((i(mod/P E ))/P E + j) (mod P E) 1 
 
 A step-by-step discussion of the process follows. 
 
 As the first step (r = l), consider row of the matrix, ( a nr y a m > 
 
 . . . , a. ._). This row is in location 0, Figure 7« The first diagonal 
 
 element of the tri diagonal matrix T is t = a_^. The element t_ n takes 
 
 e 00 00 01 
 
 the value of + S determined by Equation 9 (with r = l) and sign opposite to 
 that determined by Equation 11. The rest of the elements of the first row of 
 T are, of course, zero. The vector u~ is then readily constructed by Equa- 
 tions 8, 9, and 10 in a time which is negligible compared to the following 
 calculations. The vectors p.. and q are then computed with almost full 
 efficiency using Equations 13 and lk. 
 
 Once u , q , and the first row of the tri-diagonal matrix T 
 
 are constructed, matrix A is obtained using Equation 12. However, this 
 
 transformation can be applied on the individual \/P x vP submatrices A^„, 
 
 E E -K^ 
 
 where R, C = (0, 1, . . . , nTp^ - l), thus dividing the vectors u and q 
 
 E J. J. 
 
 into \/ P subvectors each of \l P components. The submatrices A^ at any 
 step r are given by, 
 
 ( V ) - <v) , - ^ (E) * (c)t - » (E) ^ (c)t ^ 
 
 IRC r TRC r-1 r r r r 
 
 for all R and C such that R < C. 
 
 -21- 
 
It may be noted from Figure 7 that for the steps r = (l, 2, . . . , 
 \Tp ), the efficiency of parallel computation on the submatrices A , A , A , 
 and A , Figure 8, descreases as r increases, while that of the remaining off- 
 diagonal submatrices A , A , and A is 100%. Moreover, efficiencies of 
 higher than 50% can be obtained for the diagonal submatrices A , ^pp> and - ^v 
 The entire time required for reducing a 6k x 6k symmetric matrix to the tri- 
 di agonal form on a 6U-PE machine is less than k ms. 
 
 k. 3 Efficiency of the Tridiagonalization 
 
 The efficiency of the Householder process s on a symmetric P x P 
 matrix is computed using the storage allocation scheme of Figure 7- Assume 
 that computations are performed on-Jp x ^^ v blocks and that no attempt is 
 made to mask inefficiencies incurred due to the annihilation of rows and 
 
 columns within particular v P x s/P blocks. It is assumed that diagonal 
 
 iii E 
 
 blocks are transformed in pairs and that no inefficiency is incurred when 
 there is an even number of such blocks. In Figure 7 it may be noted that for 
 example, the upper left and lower right triangular blocks may be accessed and 
 transformed simultaneously, i.e., the elements with subscripts (0,0; 0,1; 0,2; 
 0,3; 15, 15; 1,1; 1,2; 1,3; 1^,15; 1^,1^; 2,2; 2,3; 13, 15; 13,1^5 13,13; 3,3). 
 This pairing can be observed along the entire diagonal. It is also possible 
 to overlap parts of square blocks which have been annihilated. For example, 
 -when "J F /2 rows in a strip of block are set to zero, adjacent blocks could be 
 paired and 100% efficiency regained. Similarly after nTp^A, nTP™/8, . . . , 
 etc. rows are annihilated, full efficiency could be regained. This latter 
 matter requires some delicate discussion as well as programming which will 
 not be included here. 
 
 The number of operations required to carry out the equations of 
 Householder's process on a triangular matrix of decreasing dimension is 
 
 a=i i=i d=i 5 
 
 -22- 
 
 
Ol 
 
 13 
 STRIP 3 
 
 14 
 
 ro 
 
 
 00 
 
 H 
 JO 
 
 ■o 
 ro 
 
 O CO 
 
 00 
 
 ->l 
 
 00 
 
 H 
 
 en oi 
 
 4* 
 
 OJ 
 
 00 
 
 H 
 X 
 
 ■o 
 o 
 
 ro _ 
 
 r* 
 o 
 o 
 
 o 
 
 TJ 
 
 
 OJ 
 CD 
 
 4> 
 
 Ol 
 
 Ol 
 
 00 
 
 CO 
 
 5 
 
 o 
 en 
 
 ro 
 
 4k 
 
 oi 
 
 Ol 
 
 en 
 
 OJ 
 
 ro 
 
 ro 
 
 00 
 
 OJ 
 
 4k 
 
 o 
 o 
 
 m 
 
 ° w 
 
 J- 
 
 ro 
 
 oi 
 
 o 
 
 Cn 
 
 oi 
 ro 
 
 00 
 
 oi 
 
 CO 
 
 o 
 
 OJ 
 
 Ol 
 
 ro 
 
 en 
 
 00 
 
 4> 
 
 4k 
 
 O 
 
 OJ 
 
 ro 
 co 
 
 OJ 
 Ol 
 
 o 
 
 o 
 
 o 
 
 — x 
 
 o 
 
 ro 
 
 ro 
 oi 
 
 OJ 
 
 4k 
 
 oi 
 
 OJ 
 
 CO 
 
 ro 
 
 5 
 
 00 
 
 ■0 
 
 00 
 O 
 
 Ol 
 
 oi 
 
 en 
 co 
 
 Ol 
 
 4> 
 
 £ 
 
 ro 
 
 5 
 
 OJ 
 
 en 
 
 o 
 
 ro 
 
 oi 
 ro 
 
 00 
 
 oi 
 4* 
 
 ro 
 O 
 
 CO 
 OJ 
 
 o 
 
 CO 
 
 Ol 
 
 00 
 
 Ol 
 
 en 
 5 
 
 en 
 
 4k 
 
 ro 
 
 oi 
 
 ro 
 
 OJ 
 
 O 
 
 OJ 
 
 OJ 
 
 OJ 
 
 CO 
 
 oi 
 Ol 
 
 ro 
 
 CO 
 
 o 
 o 
 
 en 
 
 00 
 
 ro 
 
 Ol 
 
 oi 
 
 en 
 
 
 4k 
 
 OJ 
 
 ro 
 ro 
 
 OJ 
 00 
 
 O 
 4> 
 
 o 
 
 4k 
 
 
 5 
 
 oi 
 CD 
 
 ro 
 ro 
 
 CO 
 oi 
 
 5 
 
 ->i 
 
 00 
 OJ 
 
 en 
 ro 
 
 00 
 
 4* 
 
 4* 
 
 Ol 
 
 O 
 
 ro 
 
 OJ 
 
 OJ 
 
 co 
 
 o 
 
 Ol 
 
 — 
 
 Ol 
 
 <Ji 
 
 4k 
 
 Ol 
 
 ro 
 
 OJ 
 
 o 
 ro 
 
 00 
 
 00 
 
 4> 
 
 CO 
 
 o 
 
 en 
 
 OJ 
 
 CO 
 
 4* 
 
 Ol 
 
 Ol 
 
 ro 
 5 
 
 OJ 
 
 5 
 
 o 
 en 
 
 ro 
 
 en 
 
 4> 
 ro 
 
 Ol 
 GO 
 
 ro 
 4* 
 
 OJ 
 
 O 
 
 O 
 oi 
 
 CO 
 
 00 
 Ol 
 
 CO 
 
 en 
 
 o 
 
 4^ 
 
 en 
 
 Ol 
 
 ro 
 
 ro 
 
 oi 
 
 OJ 
 
 O 
 
 OJ 
 
 ->i 
 
 4> 
 
 OJ 
 
 Ol 
 CO 
 
 ro 
 
 Ol 
 
 OJ 
 
 o 
 
 o 
 
 00 
 
 en 
 
 CO 
 
 ro 
 
 en 
 oi 
 
 ->1 
 
 4k 
 
 Ol 
 OJ 
 
 OJ 
 
 ro 
 
 o 
 
 00 
 
 4* 
 
 ro 
 O 
 
 00 
 
 £ 
 £ 
 
 oi 
 
 o 
 
 ro 
 en 
 
 oi 
 ro 
 
 5 
 
 oi 
 
 — 
 
 00 
 
 co 
 
 OJ 
 
 ro 
 
 4> 
 00 
 
 Ol 
 4* 
 
 en 
 O 
 
 OJ 
 
 OJ 
 
 o 
 co 
 
 Ol 
 
 ro 
 
 CO 
 
 ro 
 
 o 
 
 oi 
 
 Ol 
 
 ro 
 
 OJ 
 OJ 
 
 ro 
 
 00 
 00 
 
 co 
 4> 
 
 o 
 
 o 
 
 OJ 
 
 4* 
 CO 
 
 Ol 
 Ol 
 
 en 
 
 OJ 
 
 o 
 o 
 
 en 
 
 ro 
 ro 
 
 oi 
 ro 
 
 ro 
 
 00 
 
 oi 
 4> 
 
 o 
 
 OJ 
 
 00 
 
 co 
 
 CO 
 Ol 
 
 5 
 
 4> 
 
 4^ 
 
 O 
 
 Ol 
 
 en 
 
 en 
 ro 
 
 OJ 
 
 Oi 
 
 o 
 
 ^ 
 
 ro 
 
 OJ 
 
 — 
 
 OJ 
 
 ro 
 co 
 
 oi 
 
 Ol 
 
 £ 
 
 5 
 
 00 
 
 o 
 
 co 
 en 
 
 6 
 ro 
 
 oi 
 
 4* 
 
 Ol 
 
 en 
 
 OJ 
 
 o 
 ro 
 
 00 
 
 ro 
 
 4k 
 
 OJ 
 
 O 
 
 ro 
 
 oi 
 
 ro 
 5 
 
 en 
 
 ro 
 
 oi 
 
 00 
 
 CO 
 
 o 
 
 OJ 
 
 4> 
 ro 
 
 Ol 
 
 00 
 
 en 
 
 4* 
 
 O 
 
 O 
 oi 
 
 co 
 
 ro 
 
 Ol 
 
 OJ 
 
 oi 
 
 OJ 
 
 oi 
 oi 
 
 ro 
 
 oi 
 ^1 
 
 OJ 
 
 00 
 
 ro 
 
 CO 
 00 
 
 5 
 
 o 
 
 4> 
 
 OJ 
 
 Ol 
 CO 
 
 en 
 
 Ol 
 
 -j 
 
 O 
 
 5 
 
 ro 
 en 
 
 OJ 
 
 ro 
 
 ro 
 ro 
 
 OJ 
 00 
 
 4k 
 
 oi 
 O 
 
 CO 
 OJ 
 
 CO 
 CO 
 
 o 
 
 Ol 
 
 _ 
 
 4* 
 
 Ol 
 
 o 
 
 en 
 en 
 
 ro 
 
 O 
 oi 
 
 — 
 
 ro 
 
 OJ 
 0. 
 
 oi 
 
The number of operations performed by a parallel maching using 
 
 the storage allocation scheme of Figure 7 , with the above assumptions, 
 
 follows. There are P„ elements per s v XvP„ block and each block is 
 
 ill ±L. ±Li 
 
 swept \/P„ times per pass. This product times the total number of "v/P,-, x^P^ 
 blocks to be transformed in all passes is 
 
 \Tp p 
 
 IE 
 
 /p E-l 
 Z 
 
 3=1 
 
 Z 
 i=l 
 
 i + 2 
 
 Z 
 i=l 
 
 where the first term in the brackets counts the number of rectangular 
 blocks, and the second term counts the number of paired triangular blocks 
 on the diagonal. This is equal to 
 
 p| + 2 p| + i.5P^P E 
 
 6 
 
 Thus for the Householder process on symmetric matrices we have: 
 
 processing efficiency = 
 
 ops, required 
 ops. used 
 
 P E + 3P E ♦ 2 
 
 P + 1.5sTp . P + 2P 
 
 E ^ E E E 
 
 If n = P = 6k this yields an efficiency of approximately 86%. With the 
 introduction of the pairing of partially annihilated rectangular blocks 
 mentioned above, efficiencies of well over 90% can be achieved. 
 
 As in our discussion of the Jacobi process, matrices of larger 
 and smaller dimensions may be handled using the same storage scheme. 
 Matrices of dimension smaller than the number of PE's are handled using 
 
 
 ■2k- 
 
vP X^Pg "blocks. Those of dimension larger than the machine size 
 
 are partitioned into triangular and rectangular blocks. The rectangular 
 
 p 
 blocks are stored using the memory map specified for rows < i < E - 1 
 
 2 
 for all rows, since there are no holes. In the case of either large or 
 
 small matrices of dimension other than a multiple of Jp the above effi- 
 ciency formula is a close approximation of the correct value. 
 
 k.k Disk Considerations 
 
 Large matrices that do not fit the primary memory are parti- 
 tioned into blocks of 6k x 6k. The number of rows of blocks is, 
 
 KR = rs + 1 
 
 [Hr] 
 
 where N > 6k is the size of the matrix. The procedure is the same as 
 
 above. At the r-th step, the first row of the remaining matrix is fetched 
 
 from the disk and the vectors u , p , and q are constructed. The 6^- x 6k 
 
 r r r 
 
 blocks are then fetched from the disk one at a time and transformed as 
 discussed above. The efficiency of the computation increases as the size 
 of the matrix gets larger. 
 
 The elements of the resulting tridiagonal matrix are stored in 
 the primary storage. Because of symmetry, the number of these elements is 
 only (2N - l) . The time taken for the transformation of an off-diagonal 
 6k x 6k block, Equation 18, on a 6U-PE machine is approximately 6 ms., 
 assuming that the vectors u, p, and q are already available. Thus if 
 requests for eight 6k x 6k partitions in the disk queuer are kept one 
 rotation ahead, the rotational latency of the disk would be masked. 
 
 -25- 
 
5- THE QR ALGORITHM 
 
 5-1 Theoretical Background 
 
 Throughout this discussion it is assumed that the matrix 
 A . [a..] (i,J . 0,1,2,- ...n-1) nnder consideration is Doth real and non- 
 singular. The QR transformation consists of forming a sequence of matrices 
 A. (k = 1,2, . ..) by the relation [5], 
 
 \ + i ■ «£ \\ ' (17) 
 
 where Q is an orthogonal matrix such that 
 
 K. 
 
 is an upper triangular matrix. In general, for large k the matrix JL tends 
 to a form in which all the eigenvalues are either isolated on the diagonal 
 or are the eigenvalues of 2 X 2 diagonal submatrices. 
 
 The QR algorithm is always used after the original matrix is 
 reduced to the upper Hessehberg form, because: 
 
 (1) The upper Hessenberg form is invariant under the QR trans- 
 formation [ 5] • 
 
 (2) The number of multiplications involved in a QR transfor- 
 
 2 3 
 
 mat ion is greatly reduced, n after reduction vs. n for the unreduced 
 
 matrix, where n is the order of the matrix. 
 
 (3) The QR algorithm has strong convergence properties when 
 applied to upper Hessenberg matrices [9]- 
 
 There are several stable methods for reducing any square matrix to upper 
 Hessenberg form [3,^]« The method of Householder, which is described in 
 the previous section, may be used for that purpose. The reduction to 
 upper Hessenberg form is therefore completed in (n - 2) steps, the r-th 
 of which is given by, 
 
 A r = A r _ 1 - v r p^ - (q r - ar u r ) v^ (r = 1,2,..., n-2) (19) 
 
 -26- 
 
where A is the original matrix A and, 
 
 *r = &> — > °> \,r-l ± V a r+l,r-l' > a n-l,r-l } (20 > 
 
 (21) 
 
 P 1 = u* A . (22) 
 r r r-1 
 
 q = A _ u (23) 
 
 ^r r-1 r v ' 
 
 v = U r (214-) 
 
 r — 7 
 
 S (S + a 77 
 r r — r,r-l; 
 
 °r = P r V r ' (25) 
 
 The sign of (a , + S ) is chosen as indicated in Equation 11. The total 
 r,r-l - r J u ,- 
 
 number of multiplications involved in this reducation is essentially — n . 
 
 The QR algorithm is invariably used with origin shifts [3,5*6,10] 
 described by, 
 
 < ( \" ? k r) "V Vi = < \\ (k.1,2,...) (26) 
 
 where Q, is orthogonal, R is upper triangular, f is the origin shift, and 
 A^ is the upper Hessenberg form of the original matrix A. In order to speed 
 up convergence, the origin shift £, is chosen such that it ultimately ap- 
 proaches an eigenvalue of A. It often happens that the matrix A has complex 
 conjugate eigenvalues which lead to a complex value of t, . Francis [5] 
 proposed the method of double origin shift which combines two QR iterations 
 in one, avoiding complex arithmetic. This method may be described by the 
 relations, 
 
 W. 
 
 ( \ - ^ x » <\ - w « - \ (27) 
 
 \ + 2 " \\ \ (28) 
 
 ■27- 
 
where W. = Q. Q, n is orthogonal, t and t are the origin shifts, and 
 
 K k K+ -L K K+ L 
 
 IL = R n R. is upper triangular. The origin shifts (\ and t. are usually 
 k k+ Ik. -k K+ JL 
 
 taken as the eigenvalues of the last 2x2 diagonal submatrix, and since A 
 is real they are either both real or complex conjugate. Hence the matrix 
 B = (A - (\ I) (A, - £, i) is real and no complex arithmetic is involved. 
 However, since A is nonsingular, A p and W, are determined by the first 
 
 >nly, and the 
 
 are positive. 
 
 column of W, only, and the determination is unique if the subdiagonal ele- 
 
 K. 
 
 ments of 
 
 \+2 
 
 The first column of W, is the same as the first 
 
 k 
 
 column of the elementary Hermitian orthogonal matrix N that eliminates the 
 
 elements of the first column of the matrix B, below the diagonal. 
 
 k 
 
 The first 
 
 column of B has only three non-zero elements. 
 
 K 
 
 They are given by, 
 
 b (k) _ (k) ( (k) 
 
 e ) + a (k) a (k) + Tl 
 -00 ''~00 V + a oi a io + ''1 
 
 (k) _ (k), (k) (k) 
 
 10 " a io la oo + a n 
 
 (k) _ (k) (k) 
 
 20 " a i0 a 21 
 
 ^ 
 
 (29) 
 
 where, 
 
 e k = ^k + 5 
 
 (k) (k) 
 k+1 " a n-2,n-2 + a n-l,n-l 
 
 30) 
 
 - t r - a (k) a (k) a (l ° a^ 
 
 k ^k ^k+i " n-2,n-2 n-l,n-l n-2,n-l n-l,n-2 
 
 Thus the matrix N is given by, 
 
 N = I 
 o 
 
 t 
 o o 
 
 (3D 
 
 !y 
 
 o' '2 
 
 where, 
 
 -28- 
 
y Q = (1, b , t Q , 0, , 0) 
 
 „(*) 
 
 s = 10 
 o 
 
 ^7 S o 
 
 (32) 
 
 .(*) 
 
 t = 20 
 o 
 
 ^¥TT 
 
 t 
 
 ^ (k) ) 2 + (b (k) ) 2 + (b (k ^ 2 
 
 - 00 ; l 10 ; k 20 
 
 J 
 
 Therefore, the matrix C. = N A N is of the form, 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 
 
 
 X 
 
 X 
 X 
 
 X 
 X 
 X 
 
 X 
 X 
 X 
 
 where the underlined elements are those elements of A, affected by the trans- 
 formation. Now the matrix C can he reduced to the upper Hessenberg from 
 A, ? by the orthogonal transformations, 
 
 \+2" 
 
 n-2 n-3 
 
 N^ N* A N N n 
 
 1 O K O 1 
 
 n-3 n-2 
 
 (33) 
 
 where the elementary Hermitian matrix N is chosen to eliminate the 
 elements of the first column of C below the sub diagonal. Therefore, 
 
 C 2 = N l C l N l Wil1 be ° f the f ° rm 
 
 -29- 
 
X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 X 
 
 
 X 
 
 X 
 
 X 
 
 X 
 X 
 
 X 
 X 
 X 
 
 X 
 X 
 X 
 
 Consequently, N is chosen to eliminate the elements of the second column 
 of C below the sub diagonal, and so forth. In general, the orthogonal 
 matrices N (r = 1,2, ...,n-2) are given by, 
 
 where, 
 
 t 
 2Y y 
 
 N = I - r r 
 r 
 
 y 
 
 3h) 
 
 y r = (o, 
 
 ,o,i,s r ,t r ,o, ,o, 
 
 (35) 
 
 in which the first r elements are zero, 
 
 (r) 
 c 
 s = r+l,r-l 
 
 r "771 
 
 c v ; _ + S 
 r, r-1 — r 
 
 c (r) 
 
 t = °r+2,r-l 
 
 r ~TrT 
 
 c K * + S 
 r, r-1 — r 
 
 S = 
 r 
 
 |7cM 
 
 I r,r- 
 
 1> + lC r + l 
 
 ,r-l A l r+2,r-l ; ! 
 
 36: 
 
 (37) 
 
 (38) 
 
 After each iteration the subdiagonal elements are examined for 
 possible partitioning into two or more smaller upper Hessenberg matrices 
 and the iterations continued with the bottom submatrix. Wilkinson [3] and 
 Martin [6] mention that, using the double origin shift method, the average 
 
 -30- 
 
number of iterations per eigenvalue for most matrices that are encountered 
 in practice is less than two. 
 
 5*2 Implementation 
 
 Figure 8 shows the storage allocation scheme for a 16 x 16 non- 
 symmetric matrix on a 16 EE machine. This storage scheme maps an element 
 a. . into the memory as follows: 
 
 a... - loc >Tp _i_ + . _j_ ., PE( (i(mod Vpg)) x >/p e + . _i_ i + j ) 
 
 (mod P E ) (39) 
 
 where P„ is the number of PE's in the machine and LxJ is the greatest in- 
 
 teger less than x. The implementation of the reduction of the original 
 matrix to the upper-Hessenberg form and the QR-iterations is similar to that 
 of the Householder tridiagonalization explained in Section k.2, and hence 
 will not be discussed here in detail. 
 
 5-3 Efficiency 
 
 As a direct extension of the discussions in Sections k.2 and k<3 
 it can be shown that using Equations 19 - 25 the reduction to the upper- 
 Hessenberg of a 6k x 6k matrix would take less than 10 ms on a 6k- PE 
 machine with an efficiency higher than 86$. It will be shown that the Q,R 
 transformations dominate the overall efficiency calculations. Detailed 
 examination of the efficiency of one step in double QR-iteration C _ = 
 NCI follows, where N is defined by Equations 3^ - 38. By virtue of 
 
 Equations 19 - 25, C , may be written as, 
 
 C r+1 = C r - v r p* - ( % - a r j) v* (k0) 
 
 -31- 
 
where p r = y r C., g^ = C r y r , v y = 2 y r , « p = P r v r , and y r is defined 
 
 lly r ll 2 
 
 by Equations 35-38. On a parallel computer the efficiency of the process 
 used for finding the square roots affects the overall efficiency of the 
 Q,R- algorithm. Therefore, in order to minimize the inefficiencies, we 
 describe the following square root routine [11]. Any positive floating 
 number Z may be expressed as Z = 2 (x) where either - < x < 1 or 
 j- < x < — . Thus v Z = 2 (v x) and the problem is reduced to finding the 
 square root of x. We use the recursive relation, 
 
 Vl = ^ (3 " X W iP 4 = 0, 1, 2, ... (1+1) 
 
 where, 
 
 2 
 w = a + bx + ex (1+2) 
 
 in which, 
 
 a = 3.1621U 
 
 b =-5.86118 for i < x < \ (1+3) 
 
 c = k. 75000 
 
 and, 
 
 a = 2.22152 
 
 b =-2.031^6 ' for i < x < 1 (1+1+) 
 
 c = 0.81250 
 
 W i COnVerges <l uad:cat: i- ca lly "to _1_ and v x = (x) lim oj,. 
 
 -32- 
 
i iterations prove sufficient for evaluating the square root of x to 
 bits of accuracy. Assuming this accuracy satisfactory, six is given 
 
 sfx = o p x - 3 P 3 x 2 - 1 x 2 7 3 (1^5) 
 
 k 5" 2 
 
 here, 
 
 P =3 -. "ix.3 
 2 2 
 
 7 - 3 3 - 1 x p 3 (U6) 
 
 2 2 
 
 a) is given by Equations U2 - UU. 
 
 Figure 9 shows how the arithemtic operations would be performed 
 parallel computer for finding the square root, 
 
 S = 
 r 
 
 r ( r ) \2 /- ( r ) \ 2 / ( r ) n 2 "H 
 L^^r-l) + K + l, r-l> + ( °r + 2, r-l } J > 
 
 dements of the vector y , 
 
 (r) (r) 
 
 c c 
 
 s = r+1. r-1 , t - r+2, r-1 , and the ratio 
 
 r (r) r (r) 
 
 c v ' . + S c V ; , + S 
 
 r, r-1 — r r, r-1 — r 
 
 c^ , S 
 
 r, r-1 + r . 
 
 2 S 
 
 r 
 
 -33- 
 
FIGURE 9 
 
 C P ,r-l C r + |, r -| C r +2, r -l 
 
 OPERATI« 
 
 + 
 + 
 
 2 e 2 
 
 9/4 Sf 3/4 S, S^ 
 
 3/2 
 
 Sf/2 
 
 3/2 
 
 sf/2 
 
 + 
 
 * 
 * 
 
 * 
 
 * 
 * 
 
 •r.r-l 
 
 C r ,r-I +S ' 
 
 S r C r + | |f -| C r + 2>r -| 
 
 \^ \/f \J 
 
 2 • 
 
 yrii 
 
 Sr 
 
 tr 
 
 ■3^- 
 
 10 * 
 9 + 
 
In Figure 9 the operations at a given level in the tree are identical and 
 shown to the right, and the totals are shown below. Unlabeled nodes are 
 assumed to use the result computed immediately above. In this and subsequent 
 discussions, the insignificant routing and broadcasting times are neglected. 
 Assuming that one division is equivalent to five multiplications, and two 
 additions are equivalent to one multiplication, the evaluation of the above 
 quantities on a serial machine would effectively require 35 multiplications. 
 On a P > 8 machine the same process requires only the equivalent of 20 
 multiplications or 10 |U.s. This is clear from the fact that the tree height 
 
 is equivalent to 20 multiplications and its width does not exceed 8. The 
 
 35/P 
 
 efficiency of the parallel computation of this part is ' ' E . This is a 
 
 20 
 rather conservative definition of efficiency. It assumes that the total time 
 
 "necessary" for a parallel machine to evaluate a function is just the equiva- 
 lent number of multiplications divided by the number of PE's available. 
 
 The row vector p = y C has its first (r - l) elements equal to 
 
 zero. Its formation requires the equivalent of 3(n - r+l) multiplications. 
 
 The column vector q = C y has (n - r - k) zero last elements for r = 1, 
 
 r r r 
 
 2, . . . , n - 5- Its construction requires the equivalent of (3r + 10 ) 
 multiplications. Assuming that the size of the matrix is n = (m)P where the 
 integer m > 1, the average number of equivalent multiplications required 
 for computing either p or q on a parallel computer is given by 
 
 1 5 3 
 
 *- X 3J = t: (m + 1) . In order to calculate the overall efficiency, 
 m . ., ° 2 
 
 observe that for the matrix C there are m diagonal P x P submatrices; 
 for each, the corresponding portion of either p or q requires three 
 equivalent multiplications with an average efficiency of 50%, and for each 
 of the - m(m - l) P x P submatrices above the diagonal, the corresponding 
 
 <C J_j J_i 
 
 portion of either p or q also requires three equivalent multiplications, 
 but with an efficiency of 100%. Weighing the individual efficiencies with 
 the corresponding number of equivalent multiplications, the overall average 
 efficiency for computing either p or q is calculated as follows: 
 
 Eff f = (3m) 0.5 + | m(m - l) (1.0) = m 
 
 ^r rr 3 / ~ m + 1 
 
 3m + - m(m - 1) 
 
 -35- 
 
Since v and q contain some zero elements and since the computation of 
 r r 
 
 2 t 
 
 v = v requires only two multiplications and that of a = p v 
 
 r i I I 1 2 °r r ^r r 
 
 I |y r l I 
 
 requires four multiplications, both v and a can be computed simultan- 
 eously with p and q respectively. The computation of the column vector 
 a y which requires two multiplications can also be performed simultan- 
 eously with the computation of the matrix v p • The average number of 
 
 multiplications required for computing v p on a parallel machine is the 
 same as that of p and q and with the same efficiency. The construction 
 of the vector (q - a y ) requires the equivalent of ^ multiplications 
 
 with an efficiency of |^-\. The average number of multiplications required 
 
 »)• 
 
 for forming the matrix '(q - a y ) v on a parallel computer is also 
 
 identical to that of p or q and has the same efficiency. The same holds 
 
 r r t t 
 
 for constructing the sum C -v p - (q - (X y)v. Neglecting the time 
 
 taken to find the origin shifts for every iteration, it can be concluded 
 
 that the total time required for one double QR-iteration on a parallel 
 
 computer of P PE's and for a matrix of size n = (m)P is essentially 
 
 Is E 
 
 ((m)P - 2) [10-75 + 3-75 (m + l)] jli.s. with an efficiency of 
 
 0.5 (35/PJ + 3-75 m + (2.25/P T 
 
 'E 
 
 10.75 + 3-75 (m + l) 
 
 Thus, the time required for one QR-iteration for a 256 x 256 matrix on a 
 16-PE machine (assuming that the matrix can be contained in the primary 
 memory) is 19 ms and the efficiency of parallel computation is 82$. If 
 P E = 6k, then the time taken is 7-5 ms with an efficiency of 52$. Clearly, 
 the minimum efficiency of the QR-iteration occurs when m = 1. For example, 
 for n = P E = 6k, the efficiency is 22$. In order to establish a rough 
 lower bound on the overall efficiency of the QR algorithm, consider the 
 case for which m = 1 and when the matrix gets deflated by two rows and two 
 columns every two iterations. This efficiency is given by, 
 
 -36- 
 
Eff. =* 1 £ f . (U8) 
 
 where, 
 
 j =4, 6, 8, , .., n-2 , N = fn-5 1 for n - even 
 
 r*i 
 
 J = 3, 5, 7. •••, n-2 , N = n-k +1 for n - odd 
 
 M 
 
 and, 
 
 f, - _^ (38.5 + 3-75 j 
 
 J l8.25n 
 
 Thus, for n = P„ = 6k, the QR algorithm efficiency is approximately 11.50$. 
 Is 
 
 -37- 
 
REFERENCES 
 
 1. Barnes, G. H., Brown, et. al. "The ILLIAC IV Computer/' IEEE 
 
 Transactions on Computers . C-17:8 (August, 1968), 7^6- 
 757- 
 
 2. Kuck, D. "ILLIAC IV Software and Application Programming," IEEE 
 
 Transactions on Computers . C-17:8 (August, 1968), 757-770. 
 
 3. Wilkinson, J. V. The Algebraic Eigenvalue Problem . Clarendon Press, 
 
 Oxford, 1965. 
 
 k. Martin, R. S-, et al. "Householder's Tridiagonalization of a Symmetric 
 Matrix," Numerische Mathematik . Vol. 11 (1968), 181-195- 
 
 5- Francis, J. G. F. "The QR Transformation, Parts I and II," Computer 
 Journal . Vol. k (I96I-I962), 265-271 and 332 - 3^5- 
 
 6. Martin, R. S., et al. "The QR Algorithm for Real Hessenberg Matrices," 
 Numerische Mathematik . Vol. Ik (1970), 219-231. 
 
 7- Wilkinson, J. H. "Calculation of the Eigenvalues of a Symmetric 
 Tridiagonal Matrix by the Method of Bisection, " Numerische 
 Mathematik . Vol. k (1962), 362-367. 
 
 8. Wilkinson, J. H. "Calculation of the Eigenvectors of a Symmetric Tri- 
 diagonal Matrix by Inverse Iteration, " Numerische Mathematik . 
 Vol. k (1962), 368-376. 
 
 9- Parlett, B. "Global Convergence of the Basic QR Algorithm on Hessenberg 
 Matrices," Mathematics of Computation . Vol. 22 (1968), 803-817. 
 
 10. Wilkinson, J. H. "Global Convergence of Tridiagonal QR Algorithm with 
 
 Origin Shifts," Linear Algebra and its Applications . Vol. 1 (1968), 
 1+09-1+20. 
 
 11. Katz, M. L. New and Revised Mathematical Subroutines for ILLIAC IV . 
 
 ILLIAC IV Document No. 165, ILLIAC IV Project, University of Illinois, 
 Urbana, Illinois. January 27, 1969. 
 
 ■38- 
 
ACKNOWLEDGMENT 
 
 We are indebted to Professor David Young for helpful discussions 
 and to Mr. James Madden for his comments and computational experiments. 
 Sincere thanks go to Mrs. K. Flessner for her patience and accuracy in 
 typing this report. 
 
 -39- 
 
UNCLASSIFIED 
 
 Security Classification 
 
 DOCUMENT CONTROL DATA -R&D 
 
 (Security claaalllcatlon of till*, body of aba tract and Indeahtg annotation mum I he entered when the overall report la clamallled 
 
 originating ACTIVITY (Corporate author) 
 
 Center for Advanced Computation 
 University of Illinois at Urbana-Champaign 
 Urbana, Illinois 6l801 
 
 2a. REPORT SECURITY C L A SSI F I C * TIO 
 
 UNCLASSIFIED 
 
 2b. CROUP 
 
 3. REPORT TITLE 
 
 PARALLEL COMPUTATION OF EIGENVALUES OF REAL MATRICES 
 
 4. DESCRIPTIVE NO res (Type of report and Inclualre da fa) 
 
 Research Report 
 
 B •UTMORISIfF/nl 
 
 middle Initial, laat name) 
 
 David J. Kuck and Ahmed Sameh 
 
 8 REPORT DATE 
 
 November 1, 1971 
 
 7». TOTAL NO. OF PACES 
 
 >+3 
 
 7b. NO. OF REFS 
 
 11 
 
 •a. CONTRACT OR CRANT NO. 
 
 DAHC01+ 72-C-OOOl 
 b. PROJECT NO. 
 
 ARPA Order 1899 
 
 •a. ORIGINATOR'S REPORT NUMBER(S) 
 
 CAC Document No. 9 
 DCS Report No. 1*87 
 
 9b. OTHER REPORT NOISI (Any other numbers that may be aaalgned 
 thla report) 
 
 10 DISTRIBUTION STATEMENT 
 
 Copies may be obtained from the address given in (l) above, 
 
 II. SUPPLEMENTARY NOTES 
 
 None 
 
 12. SPONSORING MILI TARV ACTIVITY 
 
 U.S. Army Research Office-Durham 
 
 Duke Station 
 
 Durham, North Carolina 
 
 13. ABSTRACT 
 
 This paper describes the implementation of three standard matrix 
 eigenvalue computation methods on an array machine with high efficiency. A 
 brief description of the ILLIAC IV computer is provided as background material. 
 Three major sections follow — the first two describe Jacobi and Householder 
 algorithms for real symmetric matrices, and the third describes the QR algorithm 
 for real nonsymmetric matrices. Each of these sections is divided into four 
 parts. The theoretical background of the method is presented first. An ILLIAC IV 
 implementation of the algorithm is presented and timing estimates are included. 
 Next, the efficiency (ratio of number of computations actually performed to number 
 computations possible in a given time) of the computation on an array machine is 
 derived for primary memory contained matrices. Finally, eigenvalue computations 
 for matrices too large to be contained in primary memory are discussed in terms 
 of a head-per-track secondary storage device. It is shown that for Jacobi and 
 Householder algorithms, parallel computation efficiencies of 90% are possible, 
 while those of the QR algorithm are rather low. 
 
 DD ,T. .1473 
 
 UNCLASSIFIED 
 
 Security Classification 
 
UNCLASSIFIED 
 
 Security Classification 
 
 KEY WORDS 
 
 Mathematics of Computation (General) 
 
 I I 
 
 UNCLASSIFIED 
 
 Security Classification 
 
HBHI I 
 
 
 HB 
 
 UNIVERSITY OF ILLINOIS-URBANA 
 
 \ UN.VEBSITYOFIUUNO.S-URBANA 
 
 \ 510 84IL6Rno.C002no.480-«B(1ffn 
 ■ 8,„c h -onlz.«on S ,«. m to..«.l.vl»lon 
 
 ■ 
 
 ^01^2 088400061 
 
 ^^^1 I '*/• *J I iiJB 
 
 ■ I 
 
 •V tf ■ 
 
 
 
 
 
 
 
 
 . * 
 
 
 ■ 
 
 * I ■ * , " 
 
 
 
 
 ->. 
 
 ,•>,*' 
 
 
 *J 
 
 • •."* ^