sin. 84 I £6r no. 997-1005 1979-80 lncompl . copy 2 CENTRAL CIRCULATION BOOKSTACKS on or before the L«#L1 ^ borrowed below. You may be !i« ^ Stam P ed 3 ! > i 1 < i < n and 6. . = °° for (i,i) t k , we have the n by n distance - - i»J matrix D = [6. .] . If a graph G contains no cycles the sum of whose 1 > J arc weights are less than or equal to zero, then we say that G is definite. Although extensive work has been done on sequential algorithms for solving the shortest path problem, little work has been done [4, 6, 19] on parallel algorithms for this problem. Levitt and Kautz [19] discuss an iterative algorithm due to Hu [12] which has been tailored to a cellular array machine. Chen and Feng [4 ] discuss a version of Ford's algorithm [9, 10] for an associative processor machine. Dekel and Sahni [6 ] present a method for cube connected and perfect shuffle computers. Here, however, we treat the shortest path problem in a more general way, using the regular algebra of Carre" [2]. Furthermore, we will assume that : (i) any number of processors may be used at any time, but we will give bounds on this number, (ii) each processor may perform either a comparison or any of the four arithmetic operations in one time step, and (iii) there are no memory or data alignment time penal- ties. If p processors are being used, then we denote the computa- tion time as T unit steps. Thus T is the time required by a serial machine. We define the speedup of a computation using p processors over the serial computation time as Sp=T /T <_ 1. Carre" [2], has shown that the shortest path problem, as well as many other problems, can be posed as linear systems of the form x=Ax + b in a regular algebra. He gives the examples of the scheduling algebra of Cruon and Herve [5], the two elements boolean algebra, and a stochastic communication problem (Kalaba [16]; Moisil [20]). Triangular linear systems, or recurrences, in a regular algebra also arise in the analysis of Fortran programs [17]. Consequently, it is important to solve efficiently such recurrences on a parallel computer. In addition to considering the solution of linear systems in a regular algebra, we will present a parallel version of Dijkstra's al- gorithm [7] which is applicable only to graphs with positive arc weights. Furthermore, considerations for choosing an algorithm when the graph is sparse, i.e. ,has a large number of infinite entries in the distance matrix, are discussed. 1.1. Notation We follow the convention that capital letters denote matrices, lower case letters denote vectors, and lower case greek letters denote scalars. Except in the cases where we state time and processor bounds, the symbols + and x are taken to represent the generalized addition and generalized multiplication operations of the regular algebra. In the statement of the time and processor bounds, the symbols + and * will take on their normal meaning. We will frequently omit the symbol x when it is clear that a generalized product is to be taken. We will al- so use the Z and It notations for the generalized sum or product of a set of items. Unless otherwise stated, log n = flog~nl for any positive num- ber n. 1.2. The Regular Algebra For convenience, we restate the algebra of Carre. We start by defining a semiring (S,+,x) which satisfies the following properties: (i) commutativity a + 3 = 3 + a a x 3 = g x a (ii) associativity a + (3 + y) = (a + 3) + Y ax($XY) = (ax£)XY (iii) distributivity a x (3 + y) = (a x 3) + (a x y) (iv) idempotency a + a = a for a, 3, Y, e, S. The set S has a unit element e such that a x e = a, and a null element 9 satisfying + 9=0, a * 9 = 9, for all a e S. Furthermore, we have the law of multiplicative cancellation: (v) if a 4- 9 and a x 3 = a x y then 3 = y. We define for the semiring (S,+,x) the order relation <_: a <_ 3 if and only if a + 3 = a. 1.3. Extensions to Matrix Operations . We now extend the definitions of the regular algebra to matrices all of whose elements belong to the set S. The definitions of matrix addition, matrix multiplication and matrix transposition are analogous to those in linear algebra. Note that matrix addition is idempotent, A + A = A. We define a square m by m unit matrix I = [e. .1 with e. .= £ if i = j and e . . = 9 if i ^ j . Note that IA = AI = I for any m by matrix A. The null matrix N, all of whose elements are 6, satisfies A + N = A A x N = N. The partial ordering for matrices is easily extended: A < B if and only if a . . < 3. . for all i,j. m In particular, it follows that A <_ B if and only if A + B = A. The shortest path problem can now be stated as a linear system in the regular algebra. Let G (V, A) be a graph with distance matrix D. Let the vector b represent an initial labeling of the nodes V (for the vector b, 3. is the inital label on node i) . The shortest path problem is equivalent to solving the linear system x = Dx + b in the regular algebra [2]. For the single source problem the vector b is given by by 8 = e where s is the source node, and 3. = 6 otherwise, s 1 2 . Solution of Recurrences In this section we study the solution of systems of the form x = Lx + f, (1) where the matrix L is a strictly lower triangular matrix. These systems arise in the analysis of Fortran programs, and are also used in methods for solving general systems x = Ax + b . 2.1. The Column Sweep Algorithm The most fundamental of all algorithms for solving recurrences in the regular algebra is the column sweep algorithm. The solution of x = Lx + f , where L = X 2,l 9 A 3,l X 3,2 11,1 n,2 n,n-l X = f = , (2) is given by 5l -^ , n j-l ?. - I \. . ?, + 4>. . 1 - 2,3,.. .,n J j,i i j i=l (3) Rewriting the system (3) as column operations we get the Column Sweep Algorithm : 1. Set x +■ f 2. For i = 1, 2, . .. , n-1 x ■«- x + C. £. i i (in parallel) In the algorithm, I. represents the ith column of the matrix L. A sequential algorithm for solving recurrence (1) requires 2 2 n - n operations, i.e. T = n - n. On the other hand, the column sweep algorithm can be performed in 2 (n-1) time steps using at most n-1 processors. This gives a speedup of -~ + 0(1) over the sequential algorithm with a corresponding efficiency of roughly 1/2. 2.2 The Product Form of the Solution In this section, the solution x of (1) is formed as a sum of products. Each product involves a power of the matrix L multiplied by the vector f. Lemma 1 . Let L be strictly lower triangular, i.e. L has the form given in equation (2). Also, let L 1 = N N T LR N. where R is upper triangular of order n-i, i.e. L is lower triangular with i null diagonals in its lower half. Proof: From the definition of matrix multiplication and the facts that 9+e = 6, 6 x a = for all a e S, and that L is null on the main diagonal, each multiplication of L by L introduces another diagonal of null elements into the product. ■ Corollary 1 : L n = N . ■ Lemma 2 : The product LL- 1 i_ i then P = M.M. has all null elements 13 except in column j below the ith row. 10 Proof: M. has its first i rows all null and M, has all null elements J i except in column i below the main diagonal. Thus if 1 < j all inner products of rows of M. with columns of M. must be null, i.e. M M = N. If i > j then the only non-null inner products of P = MM. occur when the inner products of rows i+1, i+2, ..., n of M. are taken with col- umn j of M. . ■ J Lemma 4 : Let M , M , . .., M. , M be elementary matrices. Then the product (I + M. ±1 )(I+M.) ... (I + M.)(I + M.)(I + M.) ...(I + M..J J+l J 112 j+1 (I + M. +1 )(I + M.) ... (I + M x ). (6) The proof is by induction on j . Basis step: For j = 1, we have 2 (I + M ) (I + M ) = I + M + M + M . From Lemma 3 and idempotency I + M + M + M = I + M + M = I + M , establishing the basis. Induction step: Assume that the assertion holds for index j, and let M. - (I + M.)(I + M ) ... (I + M )(I + M ). From the inductive hypothesis we have (I + M.^JC + M.) ... (I + M,)(.I + M.)(I + M_) ... (I + M...) = J+l 3 112 j+l (I + M., n ) M. (I + M.,.). 3+1 J 3+1 11 Furthermore, M.(I + *!._,,) = M. + M.M. in . (7) J J+l J J 3+1 By repeated applications of Lemma 3 we obtain hence (7) becomes Consequently, m jVi ■ Vi • M.(I + M.,.) = M. + M.j. . J J+l 1 J+l (I + M., .)M.(I + M., .) - (I + M.j_.)(M. + M., .) j+l j j+l j+l j j+l = M. + M. ,.M. +M. M + M. ,, j j+l j j+l j+l = M. + M. ,.M. + M. ., = M. + M. ,, (M. + I) . J J+l J Since I is an element of the expanded sum of M., then due to idempo- tency, we have (I + M.,,)M.(I + M.,_) = M. + M.j^M. J+l J J+l J J+l J = (I+M j+1 )M., establishing the lemma. I In a similar fashion, it can be shown that (I+M 1 )(I+M 2 ) ... (I + M. +1 )(I + M. +1 )(I + M.) ... (I + M x ) = (I + M. +1 )(I + M.) ... (I+M^. (8) Let (I + L) = (I + M _)(I + M _) ... (I + M )(I + M.) . (9) n-1 n-Z L 1 12 Since (I + L) = (I + M.)(I + M_) ... (I + M J (I + M .), 1 l n-z n-1 from Lemma 4 and equation (8) , we see that * * * (10) (I + L) (I + L) - (I + L) = (I + L)(I + L) . y } We now introduce a lemma of fundamental importance to what follows . Lemma 5: The vector x solves the system x = Ax + f if and only if x solves the system x = x + Ax + f . Proof: If x = Ax + f then adding x to both sides and using idempotency gives x+x=x+Ax+f, x = x + Ax + f . Hence, x+Ax+f£Ax+f, i.e., either x = x or x = Ax + f . Since x = x is redundant, x = Ax + f . The following result was originally given by Carre" [2], but we state and prove it in terms of elementary matrices. Theorem 2: Let x = (I + L) f , then x satisfies x = Lx + f . Proof : From Lemma 5, x = Lx + f is equivalent to x = (I + L)x + f , or x = (I + L)(I + L)*f + f , = (I + L)*f + f. ■k From equation (9) f is an element of the expanded sum of (I + L) f , thus x = (I + L) f satisfies (11). 13 From Theorem 2, we can write the solution x to equation (1) as x = (I + L)*f = (I + M n _ 1 )(I + M n _ 2 ) ... (I + M 2 )(I + M 1 )f, which motivates the following algorithm. Algorithm 2 : Solution of recurrences in product form. 1. Set (I + M.) (0) = I + M , i-1, ..., n-1; f (0) = f. 2. For j=0, 1, .... v - log(n/4) Form (I + M 1 ) (j+1) = (I + M 2i+1 ) (j) (I + M 2 .) (j) in parallel for i=l, 2, ..., n/2 J - 1 . Form f (j+1) = (I + Ml ) (j) f (j) 3. Set x = (I + Ml ) (v) f (v) . Algorithm 2 is a direct analog of Algorithm I of Sameh and Brent [22] for triangular linear algebraic equations. Hence, their results (Theorem 2.1) apply. The solution x to the recurrence (1) can be found in 12 3 T - — log n + 2 log n + 3 steps using no more than tt = (15/1024)n 3 + 0(n 2 ) = n 3 /68 + 0(n 2 ) processors, 14 2.4 Limited Parallelism In this section we present two methods which are analogs of those of Hyafil and Kung [15] for solving the recurrence (1). These methods are used when the number of processors p is fixed. The first method utilizes the algorithm decomposition applied to the column sweep algorithm: ,« - £ (i+1) ,_ \„(i) * i «, i x = (I + M.)x , i=l, 2, ..., n-1. It is easy to see that (I + M.)x (i) (i) i (i) 5 i + i (1> + W ? i (i> n n,i i (i) Thus, given p processors, the product (I + M.)x can be formed in n-i steps. Since T (n) = £ 2 p i=l n-i we have from Hyafil and Kung [15] that T (n) < - + (2 - -)n + - - 2 P ~ P P P This method is most practical when p < n, Let 15 The second method uses the problem decomposition principle, L = '1,1 L 2,l L 2,2 J m , 1 m , 2 N m,m x. m f = m then equation (1) can be written as, X l = L 1,1 X 1 + f l' X 2 = L 2,1 X 1 + L 2,2 X 2 + f 2» m x = (.Z. L .x.) + f , m j=l m,j j m This leads to the following algorithm. Algorithm 3 : Recurrence solver with limited number of processors. 16 taking then For 1=1,. Set f (i) Form x . l , m 1-1 Z L. .x. j-1 X ' J J (I + L .) f 1,1 * (i) using Algorithm 2 Hyaf il and Kung have shown that for p = fn 1, 1 < r < 3, and m = 68n r r i 3 (o(n 2 r log n) , for 1 < r < T (n) < J r p ~ I 1-T 2 3 ^0(n J log n) , for y < r < 3 . Bande d Recurrences In this section, we take the matrix L of (1) to be banded, i.e., of the form L = N R k-1 L k J (12) where L. is a strictly lower triangular matrix of order m, i = l,2,...,k, and R. is an upper triangular matrix of order m, j = l,2,...,k-l. 17 3.1 Unlimited Parallelism We can show that the time and processors required for solving (1) are the same as given by Sameh and Brent [22]. The proof is exactly as that of their theorem 3.1. Before this can be proved, however, we need to establish the following. Lemma 6 : Let L be a strictly lower triangular 2n by 2n matrix given by L = N R L 2-1 (13) where R is n by n and upper triangular, partitioned as f T = (f T f T ) . Thus, the solution of the linear recurrence Let f be a 2n-vector correspondingly x = (I +L)x + f zn (14) is given by x = I 2 J I N n G I n„ L y 2. (15) where G = (I +LJ R , n 2 and ?i = W f ± • * = 1>2 18 Proof ; From the structure of L and Theorem 2, we see that X l ■ (I n +L l>\ ■ "l • Also, from (13) and (14) we have x. = Rx. + (I +L )x + f . 2 1 n 2 2 2 From Theorem 2, we have x 2 - (I fL/lR x +f ] • % * y , proving the lemma. ■ Now we state the main result. Theorem 3 : The linear recurrence x = (I+L)x + f , where L is an n by n strictly lower triangular matrix (12), i.e., n = km, is obtained in (3 + 2 log m) log n - (1/2) (log m) (1 + log m) time steps requiring less than m(m+l)n/2 processors. Proof ; T T T T Let f = (f ,f _, . . . ,f ) . The algorithm of Lemma 6 can be generalized to yield a scheme which is exactly similar to that of Sameh and Brent [22], Theorem 3.1 as follows. Algorithm 4 : Banded recurrences. (Q) * Step 1: Set y v J = (I+L ) f^ and 19 [G i-l, y i° )] = (I+L i ) * [R i-l, f i ] ' 1=2> 3 ' . . , n/m. Step 2: For j=l, 2, . .., v = log(n/m) - 1, Set r - 2 J m Set G. (j+1) - 1 N N r (j) G 2i r (j) r (j) G 2i+1 G 2i , i"lf 2, ' 2r - 1 , and >'i (J+D _ y (j) y 2i-l G (j) (J) +y (3) G 2i-1 Y 2i-1 Y 2i , i=l, 2, n ' 2r Upon termination, y contains the solution vector x. 3.2 Limited parallelism In this section we will assume that mm along the same lines as the algorithm of Theorem 2.3.13 of Kuck and Sameh [18], for solving banded triangular systems of linear algebraic equations. Our results are summarized by the following theorem. Theorem 4 : Given p processors, ml) is of order p, and R.(0 (except possibly for I+V fc ) is of order s, i.e., k = [ (n-m)/s |, and each U. is of the form (19). Then, the solution vector x is given by x Q - (I+V ) f (23) x. = (I+V.) (f. + U.x. ,), i=l, 2, ..., k. (24) l ill l-l Equation (23) can be evaluated in 2(m-l) steps using the column sweep algorithm since p>m. The k equations in (24) can be solved one at a time using the algorithm developed for solving recurrence (18) . Hence using p processors, the recurrence (12) of order n and bandwidth m=l can be solved in time T = 2(m-l) + x [""(n-m)/p(p+m-l)~[, where i is given by equation (17) . ■ 4. Linear Systems x = Ax + b . In this section, we will deal with the general, linear sys- tem x = Ax + b. (25) In section 1.3, we saw that the solution of the shortest path problem can be obtained by solving the system (25) . Due to the form of (25), the n by n matrix A has, in general, its only null elements on the main diagonal. The matrix A is called sparse if it has a small pro- 24 portion of non-null elements. If we let n be the number of non- A 2 null elements of A, then we say that A is sparse if n < 3 1,J i in parallel for (i,j=k+l, 3 i = a i k X 6 k + 3 i ) k+2 ' "•' n; ^ 25 The resulting upper triangular system can be solved using a meth- od of Section 2. The reduction of the matrix A to upper triangular 2 form can be accomplished in 2(n-l) steps using at most (n-1) proces- sors. The added time and perhaps, added processors necessary to solve the resulting linear recurrence of Gaussian elimination can be avoided, by using the generalized Jordan elimination method of Carre [2, §7.3]. Algorithm 6 : Parallel Jordan elimination For k=l , 2 , . . . , n-1 Set a. . = a x a . + a. . i,j i,k k,j i,j in parallel for 1=1> ##<> n; 6. ... .«(. + B . J-"*. *« »«**>■ x i,k k i Note that upon termination, the vector b contains the solution x. Clear- 2 ly, Algorithm 6 can be performed in 2 (n-1) steps using at most n +1 pro- cessors. Thus, we see that by using an additional 2n processors over Gaussian elimination, we can solve the system (25) in the same time as it takes to do the elimination step of Gaussian elimination. 4.2 Iterative Methods In this section we present two iterative methods for solving the system (25) . They correspond to the Jacobi and Gauss-Seidel methods for the iterative solution of linear algebraic equations. The methods of Bellman [1] and Moore [21] are sequential versions of the generalized 26 Jacob! method. Ford and Fulkerson's [10] sequence of scanning the arcs in Ford's method [9] is a sequential version of the generalized Gauss- Seidel method. The system (25) can be solved using the generalized Jacobi method (k) . (k-1) L . x v = Ax + b. (26) Since the graph G associated with the distance matrix A is assumed to be definite, from Theorem 6.1 of Carre [2], for an initial guess of a null x , we see that x = x. (k-1) We observe that the matrix-vector product Ax can be 2 formed in parallel in log n + 1 steps using at most n processors. 2 Thus a single iteration requires log n + 2 steps and at most n proces- sors. The overall algorithm can be accomplished in no more than 2 n(log n + 2) steps using n processors. In our context, the generalized Jacobi method can be con- sidered as a direct method. Since x is the solution to the system (25) , we can write it by repeatedly using equation (26) , as (n) .n (0) , A n-1, » n -2, ., , . / 97 \ x = A x + A b+A b+...+Ab+b. k^i ) Since x was chosen to be null , the equation (27) simplifies to x (n) = A n_1 b + A n ~ 2 b + . . . + Ab + b, (28) which can be evaluated in parallel using the following algorithm. 27 Algorithm 7 : Direct Jacobi Method Set x = b For i=0, 1, ..., v = log(-r) x.,- = (I+A )x. (we can also terminate whenever x., ,=x.) i+l x 1+1 1 ,i+l 2 1 2 1 = A A x = (I+A 7 )x v+1 (if necessary) This is the method proposed by Dekel and Sahni [6], for cube connected and perfect shuffle computers. By summing the time counts for Algorithm 7 , and observing that the maximum number of processors occurs 2 i+l during the formation of A , we get the following theorem. Theorem 5 ; The solution to the system (25) can be obtained in no more 2 3 than 2 log n + 2 logn - 1 steps using n processors. ■ We will now discuss a parallel implementation of the general- ized Gauss-Seidel method. We begin by observing that the matrix A of the system (25) can be written as A = L + U. (29) Cbnsequently, (25) becomes x = (L+U)x +b, (30) which leads to the generalized Gauss-Seidel iteration x (fcfl) . OcH) + Ux (k) + b- 26 C 1 4 « ( k+1 ) • U Solving for x yields x (k+l) m (I+L) * Ux (k) + (I+L) * b Mx (k) + c. (31) From formula (2.16) of Sameh and Brent [22], we get that the 1 2 iteration matrix M and vector c can be formed in — log n + 0(log n) 3 2 steps using at most — + 0(n ) processors. Thus, the solution x can be 6 found by the generalized Gauss-Seidel method in n log n + 0(n) steps n 3 2 using at most -? + 0(n ) processors. Carre 1 [2] has shown that the number of iterations required by the generalized Gauss-Seidel method is not greater than the number required by the generalized Jacobi method. It is thus likely that for many problems, the overall work for the generalized Gauss-Seidel meth- od will be less than that of the generalized Jacobi method, when used in an iterative fashion. This potential savings in time has a sub- stantial cost in additional processors. When used as a direct method, the generalized Gauss-Seidel scheme would use Algorithm 7 applied to the matrix M and vector c of 1 2 equation (31). Note that an additional — log n + 0(log n) operations are required to form the matrix M and vector c. Hence, for the generalized Gauss-Seidel method to be more efficient than the generalized Jacobi method, Algorithm 7 applied to the matrix M and vector c must take at least 25% fewer iterations than the same algorithm applied to the original matrix A and vector b. Consequently, one would prefer the generalized Jacobi method for solving the system (25) . 29 4.3 Dijkstra's Method We now consider the special case when all of the arc weights of the associated graph G are positive. In the regular algebra this means that the elements of the matrix A of the system (25) satisfy a. . > e for all i and j. Due to this added assumption, it is easy to see that there can be no linear algebraic analogs for algorithms tailored to these problems. Dijkstra's method [7] is such an algorithm, iterative in na- ture with no linear algebraic analog. In Dijkstra's method, nodes of the graph G are separated into two classes, temporary and permanent. All nodes start out temporary and become permanent one at a time. The algorithm terminates once all nodes have become permanent. Let mode [*] be an array of bits used to indicate whether a given node is temporary or permanent. When mode[i]=0, then node i is temporary. If mode[i]=l, node i is permanent. If we wish to solve the system (25), where a. .> e for all i and j, then Dijkstra's algorithm can be stated as follows: Algorithm 8 : Parallel Dijkstra's Algorithm The vector b is overwritten by the solution x. Initialize: mode[i]=0, i=l, 2, ..., n. For i=l, 2, . . . , n-1 Find j such that 3. = 16. mode[k]=0 modetj ]=1 (32) (33) For all k such that mode[k]=0 \ =6. +6. * a. , k k j j,k (34) 30 We observe that for each iteration, line (32) can be done in log(n+l-i) steps using n-i processors. Overall, line (32) requires n E log i = nlogy i=2 operations and at most n-1 processors. Similarly line (33) requires one operation each iteration, and line (34) can be performed in paral- lel at each iteration using two operations and n-i processors. The overall work for a parallel version of Dijkstra's method is nlogn+2n-3 steps using at most n-1 processors. Compared with n iterations of the generalized Jacobi method, Dijkstra's method requires essentially the 2 same time, but far fewer processors (n-1 vs. n ). Thus, for dense ma- trices A, Dijkstra's method appears to be the iterative method of choice, 4.4 Sparse Matrix Considerations When the matrix A of the system (25) is large and sparse, that 2 is the number of arcs n. in the graph G is much less than n , direct methods are impractical. Unless the matrix A is banded or has a spe- cial structure, undesireable fill-in of the matrix A will occur. Since the economization of storage for large sparse matrices is essential, iterative methods must be used. Not all iterative methods are appropriate for this problem. The generalized Gauss-Seidel method requires the formation of * the iteration matrix M = (I+L) U. This matrix in general will be dense and hence too costly to store. We will thus restrict ourselves to a dis- cussion of the generalized Jacobi method and Dijkstra's method. Let q be the maximum number of non-null elements in any row of A. In order to compare the two algorithms, we will use a cost function 31 which is the product of the number of processors and time. For dense matrices, we see that Dijkstra's method has a cost of 2 2 (n-l)(n log n + 2n - 3) = n log n + 0(n ) . 3 3 Similarly, the generalized Jacobi method has a cost of n log n + 0(n ) which is about n times greater than that of Dijkstra's method. For sparse matrices, the generalized Jacobi method requires nlogq + 2n steps and n -n