m mm fM RSI H BiaW H USffiiS - 'S>'HfflHI mm H WM WBBBi BM SuSphB m ■ f'A ■ li ^H hb « ■ ■ ■CMtj 1 ift LIBRARY OF THE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 510.84 I£6r no. 480-489 cop Digitized by the Internet Archive in 2013 http://archive.org/details/parallelcomputat487kuck PARALLEL COMPUTATION OF EIGENVALUES OF REAL MATRICES by David J. Kuck Ahmed Sameh November 1, 1971 CAC Document No. 9 DHIHE NOV 9 1972 y OF ILLINOIS AT U iAMPAIGN DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN • URBANA, ILLINOIS CAC Document No. 9 DCS Report No. lj-87 PARALLEL COMPUTATION OF EIGENVALUES OF REAL MATRICES hy David J. Kuck Ahmed Saraeh Center for Advanced Computation University of Illinois at Urbana-Champaign Urbana, Illinois 61801 November I, 1971 This work was supported in part by the Advanced Research Projects Agency of the Department of Defense and was monitored by the U. S. Army Research Office-Durham under Contract No. DAHCOU 72-C-0001. ABSTRACT This paper describes the implementation of three standard matrix eigenvalue computation methods on an array machine with high efficiency. A brief description of the ILLIAC IV computer is provided as background material. Three major sections follow- -the first two describe Jacobi and Householder algorithms for real symmetric matrices, and the third describes the ^R algorithm for real nonsymmetric matrices. Each of these sections is divided into four parts. The theoretical background of the method is pre- sented first. An ILLIAC IV implementation of the algorithm is presented and timing estimates are included. Next, the efficiency (ratio of number of computations actually performed to number of computations possible in a given time) of the computation on an array machine is derived for primary memory contained matrices. Finally, eigenvalue computations for matrices too large to be contained in primary memory are discussed in terms of a head-per- track secondary storage device. It is shown that for Jacobi and Householder algorithms, parallel computation efficiencies of c )Q°j are possible, while those of the Q,R algorithm are rather low. TABLE OF CONTENTS Page 1. INTRODUCTION 1 2. ILLIAC IV OVERVIEW 2 2.1 System Organization 2 2.2 Programming ILLIAC IV 1+ 3. JACOBI'S METHOD 5 3.1 Theoretical Background 5 3.2 Implementation 8 3.3 Efficiency 12 3.^1 Disk Considerations 15 k. HOUSEHOLDER'S METHOD 18 U.l Theoretical Background 18 k.2 Implementation of Householder Tridiagonalization. 20 ^+.3 Efficiency of the Tridiagonalization 22 h.k Disk Considerations 25 5. THE QR ALGORITHM 26 5.1 Theoretical Background 26 5-2 Implementation 31 5.3 Efficiency 31 REFERENCES 38 ACKNOWLEDGMENT 39 1. INTRODUCTION The implementations on a parallel computer of three of the most commonly used techniques for finding the eigenvalues and eigenvectors of matrices are discussed. These are Jacobi's, Householder's and the QR method. Jacohi's algorithm for real symmetric matrices was modified in order to take advantage of parallelism in such a way that instead of two, only one ortho- gonal transformation would eliminate n off-diagonal elements, where n is the size of the matrix. Storage schemes for all three methods on the parallel computer are discussed in detail. It is well known that Jacobi's algorithm proved to be highly efficient for parallel computation; yet, its use is recommended primarily when evaluation of all the eigenvalues is required. Even then, if high accuracy is also desired, it should be used in conjunction with a scheme for establishing machine bounds for the eigensystem. Householder's algorithm for tridiagonalization of a real symmetric matrix has also proven to be very efficient on a parallel computer; a parallel modification of the method of bisection can be used on the resulting tri- diagonal matrix to obtain one or two eigenvalues or to investigate the distri- bution of all of them. If, however, all the eigenvalues are required, the QR method is usually preferred. The QR algorithm is discussed here only for non- symmetric matrices. Implementation of this method on the parallel machine proved to be less efficient than the other two, though it is the most reliable for obtaining the eigenvalues of any matrix. For Householder's or the QR algorithm, the eigenvectors of the tridiagonal or the upper Hessenberg mat- rices may be obtained by the method of inverse iteration. If the transfor- mation matrices used in reducing the matrix to the condensed form are stored, the eigenvectors of the original matrix can be readily computed. -1- 2. ILLIAC IV OVERVIEW 2. 1 System Organization ILLIAC IV is an array of coupled computers driven by instructions from a common control unit. Each of the processing elements (PE's) has 20^-8 words of 64-bit memory with a cycle time under 350 nanoseconds. Each PE is capable of 64-bit floating point multiplication in about 500 nanoseconds and addition in about 350 nanoseconds. Two 32-bit floating point operations may be performed in each PE in approximately the same time. The PE instruction set is similar to that of conventiqnal machines, with two exceptions. First, the PE's are capable of communicating data to four neighboring PE's by means of routing instructions . Second, the PE ' s are able to set their own mode registers to effectively disable or enable themselves. Figure 1 shows 6k PE's, each having three arithmetic registers (A, B, and C) and one protected programmer register (s). The registers, words, and paths in Figure 1 are all 6k bits wide, except the PE index registers (XR),; mode registers, and as noted. The mode register may be regarded as one bit which may be used to block the participation of its PE in any action. The routing registers are shown connected to neighbors at distances of + 1 and + 8; similar end-around connections are provided at 1, 6k, etc. Programs and data are stored in PE memory. Instructions are fetched by the control unit (CU) as required from PE memory. Figure 1 also shows the essential registers and paths in the CU and their relations to the PE's. Instructions are decoded and control signals are sent to the PE array from the control unit. Some instructions are executed directly in the CU, e.g., the loading of CU accumulator registers (CAR's) with program literals. Operand addresses may be indexed once in the CU by a CAR am again separately in each PE by XR. It is possible to load the local data buffer (6k words of 6k bits each) and CAR's from PE memory. Local data buffer registers and CAR's may be loaded from each other. A broadcast instruction allows the contents of a CAR to be transmitted simultaneously to all PE's. It A 64-PE ILLIAC IV system is expected to be operating soon. For a more com- plete description of the system see [1]. FIGURE 1 1 INSTRUCTION AND DATA FETCHING UJ o to m CVI r-l INSTR. FETCH REQUEST INSTRUCTION BUFFER ASSOCIATIVE MEMORY CONTROL UNIT 64 WORDS 64 WORDS DATA BUFFER INSTRUCTION DECODER C U INDEXING « CAR CAR 3 PE CONTROL SIGNALS v DATA FETCHING a DATA BROADCASTING^ 64 4— 57 2048 WORDS MODE XR MIR PE MEMORY 64 BITS PE 1 i-1 9 i-8 • • • MODE XR MIR PE MEMORY PEi PE64 -3- is often convenient to manipulate all PE mode bits or a number from one PE in a CAR. For this purpose, the broadcast path is bi-directional. The complete ILLIAC IV system includes a Burroughs 6500 which per- forms input/output operations, compilation, and contains an operating system 9 to control ILLIAC IV. There is a 10 bit, head-per-track disk with a kO 9 1 millisecond rotation speed and an effective transfer rate of 10 bits/ second. This allows the loading of the 8 x 10 bit ILLIAC IV memory from the disk in 32 milliseconds. The input/output controller (IOC) contains a queuer for disk requests and 2 bit buffer memory to smooth transmissions to and from the slower B65OO memory, or an external input/ output device. 2.2 Programming ILLIAC IV It is apparent that ILLIAC IV may be suited to the execution of algorithms defined on arrays of data (e.g., matrix operations and certain partial differential equations). However, there are two major difficulties in using such a machine. One is that an algorithm must be devised which allows identical computations to be performed simultaneously on a large num- ber of operands. The other, closely related to the first, is that the data must be so placed in the ILLIAC IV memory that over the course of the calcu- lation few PE's are disabled. In other words, the algorithm and the data storage allocation must be mutually designed for highly parallel computation. For this reason, ILLIAC IV codes are expressed as two parts, a storage allo- cation block, followed by a computational algorithm block. These codes may be written in a high level language. See, for example, [2]. -h- 3- JACOBI'S METHOD 3.1 Theoretical Background The classical Jacob i method reduces a symmetric matrix to the diagonal form by a series of orthogonal transformations, [3]> A r+1 - , . and IV - • m J>~ m 3J^ r~ 5^ r HfT m Hm m H^ 2 h2 2 rr CC m ^ci m OfT z Vm z z=c H «ix H >o Q >o Q «-. -n ■ — ^— Tl «Z! u — OJ o o m «• a# - m o ro 4k o O O O o — «* ->• <* «« — ro 4k — OJ o — — ro «• mm - mt ro OJ ^1 ^— ro __ ro m -• OJ OJ cn — OJ - - 4k ro cn o O «• a* OI ro cn o — l — OJ o o - mm - *• CO OJ cn — ro o O o — m m mm <• CO OJ cn o — o — m «• o ro cn — ro — OJ *• mm mm «• ro 4k _ OJ — OJ O o - mm «* •# ro OJ -J ^~ OJ o O o — - » <* •a OJ OJ -J "" — ro o — - «■ 4k OJ cn — ro <* •m cn ro Ul ^1 g -P- ■11- to transform a square 6k x 6k matrix is less than 12 milliseconds. Next, the rows as shown in Equation k are permuted. The storage scheme of Figure 2 allows this to he done "by moving relatively few elements of the matrix. Figure 5 shows the resulting matrix in which rows 0, 1, and 8 and columns k, 8, and 12 have been moved. Each column is moved left one, row is moved right three unit routes, and rows 1 and 8 are moved somewhat irregularly. However, the total amount of routing time involved is negligible with respect to the transformation time. In general, three rows andvP - 1 ■ E columns will he moved in a similar way. Figure 6 is a relabeled version of Figure 5, showing the new names of the shuffled elements. Note by comparison with Figure 2 that the origin has moved nine PE's to the right and one row down in PE memory. This is a matter which is easy to accommodate in a program for carrying out the Jacobi process. At this point the entire process is repeated, started by computing a new set of transformations. After n - 1 such steps the second shuffle is performed according to Equation 5- The second shuffle requires less routing than the first. Row 0's routing is similar to the routing of row 1 in the above discussion, with columns K, 8, and 12 pushed back distance one to accommodate the new column. 3-3 Efficiency First, an efficiency calculation for the simple process described above shall be performed, operating on a (kvP_ x k -s/P ) matrix using P PE's. E E E k-1 Assuming one time unit to operate on a (\Tp x vP ) partition, Z = k(k - l)/2 E E , 1=1 time steps is used for square partitions plus k time steps for triangular par- titions on the main diagonal. If the elements which are to be made zero are not computed (just set to zero), only — steps should be used to transform the triangular partitions on the main diagonal. Thus, there is an efficiency of k(k - l)/2 + k/2 k . k(k - l)/2 + k ~ k + 1 For a 6k x 6k matrix and 6k PE s, (i.e., k = 8 and \Tp = 8), there is an o E efficiency of — ~ -12- -J O Ol -l> OJ ro p T3 to CD 00 Ol F5 F3 oi OJ ro CD ro m CO 5 CD CD en F3 OJ 01 ro CD OJ — to CD CD O 01 F3 OJ OJ CD 5 £ ro ro 6,11 777 01 Ol F3 OJ ro oi OJ 01 6,12 7,8 Ol OJ 00 ro ro oi oi -fr 1 ■>! | CD CD 1 oj -& Oi OJ CD ro oi __ 01 01 6,14 7,10 OJ 4> CD OJ O ro ro CD tD 11,12 6,15 7.71 ^1 OJ ro oi OJ O ->i 11,11 10,15 CO OJ R5 oi •^ O CO 00 _ O Ol CD OJ oi £ £ 01 CD CD 5 OJ Ol CD O OJ ro ro CD O O O ^1 oi ro Ol -P* OJ en ro OJ ->l O — CD ui 5 CJ1 CO OJ oi ro 00 O ro r~o ID O O Ol CD -P> oi OJ ro OJ CD O OJ OJ JJ CD CD Ol 5 -l> OJ OJ ro CD 5 O £ D O CD ^1 01 oi OJ OJ ro — O oi oi H Q r o p JO o S(J)Ol^WM-0(D CD CD CD Ol ro ro 00 O OJ "0 m OJ CD 5 1 Ol oi CD OJ CD ro ro ro 00 oi — ^1 CD CD 00 ro 01 OJ O ro oi •— O CD ro oi ro OJ OJ ro ro co OJ 00 CD 01 OJ 5 CD ^1 01 00 -P> OJ OJ OJ ro CD ■i> £ £ CD Ol TO OJ ro a> OJ 01 CD oi Ol ^1 5 CD oi CD i^5 CD OJ ro ro ro ^1 OJ oi CD 00 OJ Ol Ol -0 O oi CD CD CD ^1 A ro OJ Ol CD -& £ CO ro Ol O oi 00 ^1 ^1 CD CO Ol ol — OJ ro CD 5 5 oi CO ^1 CD ro ro OJ CO ro OJ OJ oi ^1 oi CD Ol CD -P> OJ ro ro ro ->l 00 — 00 OJ oi Ol 5 oi CD ro CD — " ro ro ro 1 CD CD 01 00 OJ Ol ro 5 oi O oi OJ 00 CD OJ no OJ CD ro 03 OD O Ol oi oi £ ^1 5 OJ Ol 00 OJ OJ ro OJ * O CD oi H Q ON ■13- This same technique and storage allocation scheme may be used on matrices of dimensions less than the number of PE's. The discussion holds of the transformation of A„ , A n , A , and A operated on asfP x \Tp partition of a larger matrix. The process may be used repeatedly on a large matrix or just once on a^/P xsll? matrix. Full efficiency can be achieved E E on square matrices of dimension k s/ P x k v P . The maximum efficiency which can be achieved for the Jacobi process on a parallel machine depends on how the diagonal partitions and any columns in addition to kvP are treated. Suppose an attempt is made to transform E simultaneously A , A , A^, \h> \^> and \cr °^ " t:iie example, and to sub- sequently repeat this pattern on the main diagonal. This procedure will achieve 100$ efficiency during the transformation of the triangular matrices, but an irregular route of two elements will be incurred since a and a are in the same PE as are a and a . U Gk is the size of the matrix. Let the number of any block b be given by C-l . b = R + Z k k=0 (C,R = 1, 2, ... , NR) where R is the row number of the block and C is the column number of the block. The blocks R-l R + Z k, R = (1, 2, . . . , NR), k=0 are brought in from the disk sequentially to be subjected to an orthogonal transformation, Equation 0. Thus all the rotation angles are determined and the rest of the blocks, C-l b = R + Z k, R = (0, 1, . . . , NR) C / R k=0 are fetched from the disk to complete the given transformation of the whole matrix. Some of the rows and columns of each submatrix are stored in arrays in primary memory, for purposes of the shifting process that takes place after the whole matrix is subjected to a transformation. There- fore, the following elements will be stored in primary memory for shifting the second row and second column: Submatrix Elements to be stored R-l 1. b = R + Z k k=0 (R > 1) first row of the submatrix, i.e., the elements a. . . i = 6U(R - 1) 3 = 6U(R - 1), . . . , Gk R - 1 2 The results of this section are valid for various secondary storage devices, e.g., drum, delay line or the head per track disk which we discuss. •15- C-l 2. b = R + Z k k=0 (R > 1, C > R) C-l 3. b = 1 + Z k k=0 C > 1 first row and first column of the submatrix, i.e., the elements (\i) and ( a rs) ' k = 6h(R - 1) I = 6k(c - 1), . s = 6U(c - l) r = 6U(R - 1), . . , 6U c - l . , 6k R - 1 second row and first column of each submatrix, i.e., the element (a ) and [a, \ . pqy \ tu/ p = 1 q = 6U(C - 1), . . . , 6U C - 1 u = 6k(c - l) t = 0, . . . , 63 1*. b = 1 second row and second column of the submatrix, i.e., the elements (a ) and [a ) . vwy ^ xyj . • , 63 v = 1 w = 1, x = y = i For the first row and first column shifting, the elements to be stored in the core are the same as above except those of items (3) and (k) : -16- t Submatrix Elements to be stored C-l 3 . b = 1 + £ k first row and column of each sub- matrix, i.e., the element /a I , \ pq y w c > 1 p = o q = 6k(C - 1), . . . , 6k C- 1 u = 6k(c - l) t = o, . . . , 63 k . b = 1 first row only, i.e., the elements fa ), v = w = 0, . . . , 63 The total number of elements to be stored in primary memory in each kind of shifting is N(NR + 1). The primary memory of 6k PE's is capable of holding 32 partitions 6k x 6k' } the transformation time for a square block is about 12 ms, and the rotational delay of the head-per-track disk is kO ms. If the partitions on the disk are laid out in such a way that four of them are accessed at once, then kQ ms later the four 6k x 6k partitions will have been transformed and the disk will have rotated more than one complete revolution. Thus, if the requests for four partitions in the disk queuer are kept one rotation ahead, the rotational latency of the disk can be masked in computing on large matrices . -17- k. HOUSEHOLDER'S METHOD k. 1 Theoretical Background This method [3?^] reduces a symmetric matrix A = [a. .] {l,j = 0, 1,2, . . . ,n-l) to the tridiagonal form T using elementary -'-J Hermitian orthogonal matrices. There are (n-2) steps in this reduction. We define the matrices A "by the relations, r A r = P r A r _ x P r (r = 1,2,..., n-2) (6) where A = A is the original matrix. The matrices P are defined by P = I - !£r (7) 2 2K r in which u is a row vector (whose first r elements are zero) given by, u = (0,...,0, a + S , a _ _ , , a _ , N (8) r r-l,r - r r-l.r+1' ' r-l,n-l) s '|l^ (9) 2 and, 2K = S (S + a . ) (10) r r r — r-1, r where the elements a. . are those of A n . The signs in both expressions ij r-1 & * (8) and (10) are chosen such that, [a+s|=|a n |+S. (11) r-l,r — r 1 ' r-l,r' r Therefore, in the r-th step (A f = P f A P ) [n - (r + l)] zeros are in- troduced in the r-th column and row of A without destroying the zeros introduced in the previous step. In order to take advantage of symmetry, the matrices A may be defined by the relations [ 3] , -18- t t . . A=A n -qu-uq (12) r r-1 r r r r where u is as given above, and q is given "by t % = p r - 1 r r P r C^) 2 2 2K r in which, P = A _ U r . (Ik) r r-1 — 2K r The total number of multiplications required to reduce the matrix to the 3 tridiagonal form is essentially 2 n . 3 The eigenvalues of the tridiagonal matrix may be obtained by several methods. For this parallel computer it is proposed that if all the eigenvalues are required, then the QR algorithm with origin shift [3;5>6] is used, (this algorithm is discussed for real nonsymmetric matrices in the following section). However, if one is interested in the general distribu- tion of the eigenvalues and not in their accurate values, or the eigenvalues in a given interval, the bisection method [3*7] ma y 1 ° e used. In order to show how the method of bisection would be used for a parallel computer, assume that it is necessary to obtain all the eigenvalues of the resulting tridiagonal matrix T using such a method. For the sake of illustration, assume also that all the subdiagonal elements are non-zero and hence the matrix has simple eigenvalues. The eigenvalues lie in the interval [-b,b], where b = T j = max E I t. . I . This interval is then divided into i loo , X3 63 subintervals and for all ju, (k = l,a, . . . ,6k), where u = -h, and n^, = +b, the quantities, ■19- W - (Vl,!-! " \> f i-lK' " V 2 ,i-1 W^H 1 " 2.3,..., J are evaluated in parallel, one PE to each id . For each jll the number of K. K. agreements in sign s(jil ) of consecutive members of the sequence f (ju, ) , f (ll), . ..., f (jll ) is the number of eigenvalues larger than jll . The process is repeated for those intervals [jll , jll ] that contain more than one eigenvalue until all the eigenvalues are separated. Assuming that the matrix is of order 6k, 6U intervals, each containing an eigenvalue are established. By assigning each interval to a PE and by successive bisection of each interval, using the Sturm sequence property, all the eigenvalues are obtained. Once all the eigenvalues are available, the eigenvectors of the tridiagonal matrix are obtained by the method of inverse iteration [3,8]. Again this can be done in parallel, each PE evaluating an eigenvector. Finally if Z. (i = l,2,...,n) is an eigenvector of the tridiagonal matrix T, then the corresponding eigenvector Y. of the original matrix A is com- puted by, Y. = P, P„ P „ Z. . (16) x 12 n-2 l k.2 Implementation of Householder Tridiagonalization We shall consider the implementation of Householder's method on a 16-PE machine using the storage allocation scheme of Figure 7 f° r an upper triangular 16 X 16 matrix. This scheme is somewhat less com- plicated than Figure 2 since it is not necessary to provide for shuffling. In particular, adjacent row and column elements are always a unit route from each other, except between the seventh and eighth rows. The lower half of the matrix is stored in reverse order to allow efficient com- putation on this part of the matrix. This scheme also allows access to -20- any row with one memory fetch. Operations may be performed on P„ x P E E blocks at any step, where the size of the matrix n is equal to the number of PE's P_. P E - 1, The memory map for this scheme consists of two parts: E 2 a. . loc I, PE ((i(mod/P E )) nTp e + j) (mod P E ) 1 ; (17) P P if E . . . , a. ._). This row is in location 0, Figure 7« The first diagonal element of the tri diagonal matrix T is t = a_^. The element t_ n takes e 00 00 01 the value of + S determined by Equation 9 (with r = l) and sign opposite to that determined by Equation 11. The rest of the elements of the first row of T are, of course, zero. The vector u~ is then readily constructed by Equa- tions 8, 9, and 10 in a time which is negligible compared to the following calculations. The vectors p.. and q are then computed with almost full efficiency using Equations 13 and lk. Once u , q , and the first row of the tri-diagonal matrix T are constructed, matrix A is obtained using Equation 12. However, this transformation can be applied on the individual \/P x vP submatrices A^„, E E -K^ where R, C = (0, 1, . . . , nTp^ - l), thus dividing the vectors u and q E J. J. into \/ P subvectors each of \l P components. The submatrices A^ at any step r are given by, ( V ) - and - ^v The entire time required for reducing a 6k x 6k symmetric matrix to the tri- di agonal form on a 6U-PE machine is less than k ms. k. 3 Efficiency of the Tridiagonalization The efficiency of the Householder process s on a symmetric P x P matrix is computed using the storage allocation scheme of Figure 7- Assume that computations are performed on-Jp x ^^ v blocks and that no attempt is made to mask inefficiencies incurred due to the annihilation of rows and columns within particular v P x s/P blocks. It is assumed that diagonal iii E blocks are transformed in pairs and that no inefficiency is incurred when there is an even number of such blocks. In Figure 7 it may be noted that for example, the upper left and lower right triangular blocks may be accessed and transformed simultaneously, i.e., the elements with subscripts (0,0; 0,1; 0,2; 0,3; 15, 15; 1,1; 1,2; 1,3; 1^,15; 1^,1^; 2,2; 2,3; 13, 15; 13,1^5 13,13; 3,3). This pairing can be observed along the entire diagonal. It is also possible to overlap parts of square blocks which have been annihilated. For example, -when "J F /2 rows in a strip of block are set to zero, adjacent blocks could be paired and 100% efficiency regained. Similarly after nTp^A, nTP™/8, . . . , etc. rows are annihilated, full efficiency could be regained. This latter matter requires some delicate discussion as well as programming which will not be included here. The number of operations required to carry out the equations of Householder's process on a triangular matrix of decreasing dimension is a=i i=i d=i 5 -22- Ol 13 STRIP 3 14 ro 00 H JO ■o ro O CO 00 ->l 00 H en oi 4* OJ 00 H X ■o o ro _ r* o o o TJ OJ CD 4> Ol Ol 00 CO 5 o en ro 4k oi Ol en OJ ro ro 00 OJ 4k o o m ° w J- ro oi o Cn oi ro 00 oi CO o OJ Ol ro en 00 4> 4k O OJ ro co OJ Ol o o o — x o ro ro oi OJ 4k oi OJ CO ro 5 00 ■0 00 O Ol oi en co Ol 4> £ ro 5 OJ en o ro oi ro 00 oi 4* ro O CO OJ o CO Ol 00 Ol en 5 en 4k ro oi ro OJ O OJ OJ OJ CO oi Ol ro CO o o en 00 ro Ol oi en 4k OJ ro ro OJ 00 O 4> o 4k 5 oi CD ro ro CO oi 5 ->i 00 OJ en ro 00 4* 4* Ol O ro OJ OJ co o Ol — Ol CO o en OJ CO 4* Ol Ol ro 5 OJ 5 o en ro en 4> ro Ol GO ro 4* OJ O O oi CO 00 Ol CO en o 4^ en Ol ro ro oi OJ O OJ ->i 4> OJ Ol CO ro Ol OJ o o 00 en CO ro en oi ->1 4k Ol OJ OJ ro o 00 4* ro O 00 £ £ oi o ro en oi ro 5 oi — 00 co OJ ro 4> 00 Ol 4* en O OJ OJ o co Ol ro CO ro o oi Ol ro OJ OJ ro 00 00 co 4> o o OJ 4* CO Ol Ol en OJ o o en ro ro oi ro ro 00 oi 4> o OJ 00 co CO Ol 5 4> 4^ O Ol en en ro OJ Oi o ^ ro OJ — OJ ro co oi Ol £ 5 00 o co en 6 ro oi 4* Ol en OJ o ro 00 ro 4k OJ O ro oi ro 5 en ro oi 00 CO o OJ 4> ro Ol 00 en 4* O O oi co ro Ol OJ oi OJ oi oi ro oi ^1 OJ 00 ro CO 00 5 o 4> OJ Ol CO en Ol -j O 5 ro en OJ ro ro ro OJ 00 4k oi O CO OJ CO CO o Ol _ 4* Ol o en en ro O oi — ro OJ 0. oi The number of operations performed by a parallel maching using the storage allocation scheme of Figure 7 , with the above assumptions, follows. There are P„ elements per s v XvP„ block and each block is ill ±L. ±Li swept \/P„ times per pass. This product times the total number of "v/P,-, x^P^ blocks to be transformed in all passes is \Tp p IE /p E-l Z 3=1 Z i=l i + 2 Z i=l where the first term in the brackets counts the number of rectangular blocks, and the second term counts the number of paired triangular blocks on the diagonal. This is equal to p| + 2 p| + i.5P^P E 6 Thus for the Householder process on symmetric matrices we have: processing efficiency = ops, required ops. used P E + 3P E ♦ 2 P + 1.5sTp . P + 2P E ^ E E E If n = P = 6k this yields an efficiency of approximately 86%. With the introduction of the pairing of partially annihilated rectangular blocks mentioned above, efficiencies of well over 90% can be achieved. As in our discussion of the Jacobi process, matrices of larger and smaller dimensions may be handled using the same storage scheme. Matrices of dimension smaller than the number of PE's are handled using ■2k- vP X^Pg "blocks. Those of dimension larger than the machine size are partitioned into triangular and rectangular blocks. The rectangular p blocks are stored using the memory map specified for rows < i < E - 1 2 for all rows, since there are no holes. In the case of either large or small matrices of dimension other than a multiple of Jp the above effi- ciency formula is a close approximation of the correct value. k.k Disk Considerations Large matrices that do not fit the primary memory are parti- tioned into blocks of 6k x 6k. The number of rows of blocks is, KR = rs + 1 [Hr] where N > 6k is the size of the matrix. The procedure is the same as above. At the r-th step, the first row of the remaining matrix is fetched from the disk and the vectors u , p , and q are constructed. The 6^- x 6k r r r blocks are then fetched from the disk one at a time and transformed as discussed above. The efficiency of the computation increases as the size of the matrix gets larger. The elements of the resulting tridiagonal matrix are stored in the primary storage. Because of symmetry, the number of these elements is only (2N - l) . The time taken for the transformation of an off-diagonal 6k x 6k block, Equation 18, on a 6U-PE machine is approximately 6 ms., assuming that the vectors u, p, and q are already available. Thus if requests for eight 6k x 6k partitions in the disk queuer are kept one rotation ahead, the rotational latency of the disk would be masked. -25- 5- THE QR ALGORITHM 5-1 Theoretical Background Throughout this discussion it is assumed that the matrix A . [a..] (i,J . 0,1,2,- ...n-1) nnder consideration is Doth real and non- singular. The QR transformation consists of forming a sequence of matrices A. (k = 1,2, . ..) by the relation [5], \ + i ■ «£ \\ ' (17) where Q is an orthogonal matrix such that K. is an upper triangular matrix. In general, for large k the matrix JL tends to a form in which all the eigenvalues are either isolated on the diagonal or are the eigenvalues of 2 X 2 diagonal submatrices. The QR algorithm is always used after the original matrix is reduced to the upper Hessehberg form, because: (1) The upper Hessenberg form is invariant under the QR trans- formation [ 5] • (2) The number of multiplications involved in a QR transfor- 2 3 mat ion is greatly reduced, n after reduction vs. n for the unreduced matrix, where n is the order of the matrix. (3) The QR algorithm has strong convergence properties when applied to upper Hessenberg matrices [9]- There are several stable methods for reducing any square matrix to upper Hessenberg form [3,^]« The method of Householder, which is described in the previous section, may be used for that purpose. The reduction to upper Hessenberg form is therefore completed in (n - 2) steps, the r-th of which is given by, A r = A r _ 1 - v r p^ - (q r - ar u r ) v^ (r = 1,2,..., n-2) (19) -26- where A is the original matrix A and, *r = &> — > °> \,r-l ± V a r+l,r-l' > a n-l,r-l } (20 > (21) P 1 = u* A . (22) r r r-1 q = A _ u (23) ^r r-1 r v ' v = U r (214-) r — 7 S (S + a 77 r r — r,r-l; °r = P r V r ' (25) The sign of (a , + S ) is chosen as indicated in Equation 11. The total r,r-l - r J u ,- number of multiplications involved in this reducation is essentially — n . The QR algorithm is invariably used with origin shifts [3,5*6,10] described by, < ( \" ? k r) "V Vi = < \\ (k.1,2,...) (26) where Q, is orthogonal, R is upper triangular, f is the origin shift, and A^ is the upper Hessenberg form of the original matrix A. In order to speed up convergence, the origin shift £, is chosen such that it ultimately ap- proaches an eigenvalue of A. It often happens that the matrix A has complex conjugate eigenvalues which lead to a complex value of t, . Francis [5] proposed the method of double origin shift which combines two QR iterations in one, avoiding complex arithmetic. This method may be described by the relations, W. ( \ - ^ x » <\ - w « - \ (27) \ + 2 " \\ \ (28) ■27- where W. = Q. Q, n is orthogonal, t and t are the origin shifts, and K k K+ -L K K+ L IL = R n R. is upper triangular. The origin shifts (\ and t. are usually k k+ Ik. -k K+ JL taken as the eigenvalues of the last 2x2 diagonal submatrix, and since A is real they are either both real or complex conjugate. Hence the matrix B = (A - (\ I) (A, - £, i) is real and no complex arithmetic is involved. However, since A is nonsingular, A p and W, are determined by the first >nly, and the are positive. column of W, only, and the determination is unique if the subdiagonal ele- K. ments of \+2 The first column of W, is the same as the first k column of the elementary Hermitian orthogonal matrix N that eliminates the elements of the first column of the matrix B, below the diagonal. k The first column of B has only three non-zero elements. K They are given by, b (k) _ (k) ( (k) e ) + a (k) a (k) + Tl -00 ''~00 V + a oi a io + ''1 (k) _ (k), (k) (k) 10 " a io la oo + a n (k) _ (k) (k) 20 " a i0 a 21 ^ (29) where, e k = ^k + 5 (k) (k) k+1 " a n-2,n-2 + a n-l,n-l 30) - t r - a (k) a (k) a (l ° a^ k ^k ^k+i " n-2,n-2 n-l,n-l n-2,n-l n-l,n-2 Thus the matrix N is given by, N = I o t o o (3D !y o' '2 where, -28- y Q = (1, b , t Q , 0, , 0) „(*) s = 10 o ^7 S o (32) .(*) t = 20 o ^¥TT t ^ (k) ) 2 + (b (k) ) 2 + (b (k ^ 2 - 00 ; l 10 ; k 20 J Therefore, the matrix C. = N A N is of the form, X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X where the underlined elements are those elements of A, affected by the trans- formation. Now the matrix C can he reduced to the upper Hessenberg from A, ? by the orthogonal transformations, \+2" n-2 n-3 N^ N* A N N n 1 O K O 1 n-3 n-2 (33) where the elementary Hermitian matrix N is chosen to eliminate the elements of the first column of C below the sub diagonal. Therefore, C 2 = N l C l N l Wil1 be ° f the f ° rm -29- X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Consequently, N is chosen to eliminate the elements of the second column of C below the sub diagonal, and so forth. In general, the orthogonal matrices N (r = 1,2, ...,n-2) are given by, where, t 2Y y N = I - r r r y 3h) y r = (o, ,o,i,s r ,t r ,o, ,o, (35) in which the first r elements are zero, (r) c s = r+l,r-l r "771 c v ; _ + S r, r-1 — r c (r) t = °r+2,r-l r ~TrT c K * + S r, r-1 — r S = r |7cM I r,r- 1> + lC r + l ,r-l A l r+2,r-l ; ! 36: (37) (38) After each iteration the subdiagonal elements are examined for possible partitioning into two or more smaller upper Hessenberg matrices and the iterations continued with the bottom submatrix. Wilkinson [3] and Martin [6] mention that, using the double origin shift method, the average -30- number of iterations per eigenvalue for most matrices that are encountered in practice is less than two. 5*2 Implementation Figure 8 shows the storage allocation scheme for a 16 x 16 non- symmetric matrix on a 16 EE machine. This storage scheme maps an element a. . into the memory as follows: a... - loc >Tp _i_ + . _j_ ., PE( (i(mod Vpg)) x >/p e + . _i_ i + j ) (mod P E ) (39) where P„ is the number of PE's in the machine and LxJ is the greatest in- teger less than x. The implementation of the reduction of the original matrix to the upper-Hessenberg form and the QR-iterations is similar to that of the Householder tridiagonalization explained in Section k.2, and hence will not be discussed here in detail. 5-3 Efficiency As a direct extension of the discussions in Sections k.2 and k<3 it can be shown that using Equations 19 - 25 the reduction to the upper- Hessenberg of a 6k x 6k matrix would take less than 10 ms on a 6k- PE machine with an efficiency higher than 86$. It will be shown that the Q,R transformations dominate the overall efficiency calculations. Detailed examination of the efficiency of one step in double QR-iteration C _ = NCI follows, where N is defined by Equations 3^ - 38. By virtue of Equations 19 - 25, C , may be written as, C r+1 = C r - v r p* - ( % - a r j) v* (k0) -31- where p r = y r C., g^ = C r y r , v y = 2 y r , « p = P r v r , and y r is defined lly r ll 2 by Equations 35-38. On a parallel computer the efficiency of the process used for finding the square roots affects the overall efficiency of the Q,R- algorithm. Therefore, in order to minimize the inefficiencies, we describe the following square root routine [11]. Any positive floating number Z may be expressed as Z = 2 (x) where either - < x < 1 or j- < x < — . Thus v Z = 2 (v x) and the problem is reduced to finding the square root of x. We use the recursive relation, Vl = ^ (3 " X W iP 4 = 0, 1, 2, ... (1+1) where, 2 w = a + bx + ex (1+2) in which, a = 3.1621U b =-5.86118 for i < x < \ (1+3) c = k. 75000 and, a = 2.22152 b =-2.031^6 ' for i < x < 1 (1+1+) c = 0.81250 W i COnVerges + ( °r + 2, r-l } J > dements of the vector y , (r) (r) c c s = r+1. r-1 , t - r+2, r-1 , and the ratio r (r) r (r) c v ' . + S c V ; , + S r, r-1 — r r, r-1 — r c^ , S r, r-1 + r . 2 S r -33- FIGURE 9 C P ,r-l C r + |, r -| C r +2, r -l OPERATI« + + 2 e 2 9/4 Sf 3/4 S, S^ 3/2 Sf/2 3/2 sf/2 + * * * * * •r.r-l C r ,r-I +S ' S r C r + | |f -| C r + 2>r -| \^ \/f \J 2 • yrii Sr tr ■3^- 10 * 9 + In Figure 9 the operations at a given level in the tree are identical and shown to the right, and the totals are shown below. Unlabeled nodes are assumed to use the result computed immediately above. In this and subsequent discussions, the insignificant routing and broadcasting times are neglected. Assuming that one division is equivalent to five multiplications, and two additions are equivalent to one multiplication, the evaluation of the above quantities on a serial machine would effectively require 35 multiplications. On a P > 8 machine the same process requires only the equivalent of 20 multiplications or 10 |U.s. This is clear from the fact that the tree height is equivalent to 20 multiplications and its width does not exceed 8. The 35/P efficiency of the parallel computation of this part is ' ' E . This is a 20 rather conservative definition of efficiency. It assumes that the total time "necessary" for a parallel machine to evaluate a function is just the equiva- lent number of multiplications divided by the number of PE's available. The row vector p = y C has its first (r - l) elements equal to zero. Its formation requires the equivalent of 3(n - r+l) multiplications. The column vector q = C y has (n - r - k) zero last elements for r = 1, r r r 2, . . . , n - 5- Its construction requires the equivalent of (3r + 10 ) multiplications. Assuming that the size of the matrix is n = (m)P where the integer m > 1, the average number of equivalent multiplications required for computing either p or q on a parallel computer is given by 1 5 3 *- X 3J = t: (m + 1) . In order to calculate the overall efficiency, m . ., ° 2 observe that for the matrix C there are m diagonal P x P submatrices; for each, the corresponding portion of either p or q requires three equivalent multiplications with an average efficiency of 50%, and for each of the - m(m - l) P x P submatrices above the diagonal, the corresponding +3 7b. NO. OF REFS 11 •a. CONTRACT OR CRANT NO. DAHC01+ 72-C-OOOl b. PROJECT NO. ARPA Order 1899 •a. ORIGINATOR'S REPORT NUMBER(S) CAC Document No. 9 DCS Report No. 1*87 9b. OTHER REPORT NOISI (Any other numbers that may be aaalgned thla report) 10 DISTRIBUTION STATEMENT Copies may be obtained from the address given in (l) above, II. SUPPLEMENTARY NOTES None 12. SPONSORING MILI TARV ACTIVITY U.S. Army Research Office-Durham Duke Station Durham, North Carolina 13. ABSTRACT This paper describes the implementation of three standard matrix eigenvalue computation methods on an array machine with high efficiency. A brief description of the ILLIAC IV computer is provided as background material. Three major sections follow — the first two describe Jacobi and Householder algorithms for real symmetric matrices, and the third describes the QR algorithm for real nonsymmetric matrices. Each of these sections is divided into four parts. The theoretical background of the method is presented first. An ILLIAC IV implementation of the algorithm is presented and timing estimates are included. Next, the efficiency (ratio of number of computations actually performed to number computations possible in a given time) of the computation on an array machine is derived for primary memory contained matrices. Finally, eigenvalue computations for matrices too large to be contained in primary memory are discussed in terms of a head-per-track secondary storage device. It is shown that for Jacobi and Householder algorithms, parallel computation efficiencies of 90% are possible, while those of the QR algorithm are rather low. DD ,T. .1473 UNCLASSIFIED Security Classification UNCLASSIFIED Security Classification KEY WORDS Mathematics of Computation (General) I I UNCLASSIFIED Security Classification HBHI I HB UNIVERSITY OF ILLINOIS-URBANA \ UN.VEBSITYOFIUUNO.S-URBANA \ 510 84IL6Rno.C002no.480-«B(1ffn ■ 8,„c h -onlz.«on S ,«. m to..«.l.vl»lon ■ ^01^2 088400061 ^^^1 I '*/• *J I iiJB ■ I •V tf ■ . * ■ * I ■ * , " ->. ,•>,*' *J • •."* ^