sin. 84 
 
 I £6r 
 
 no. 997-1005 
 1979-80 
 lncompl . 
 copy 2 
 
CENTRAL CIRCULATION BOOKSTACKS 
 
 on or before the L«#L1 ^ borrowed 
 below. You may be !i« ^ Stam P ed 
 <ee of $75^10^^^^ 
 
 Theft. m»#ti«.*i._ . 
 
 Essex* ^ wri,e « *- *«»*- 
 
 L162 
 
JZJLLr 
 
 - aa ^ Report No. UIUCDCS-R-79-997 
 
 UILU-ENG 79 1744 
 
 PARALLEL ALGORITHMS FOR NETWORK ROUTING PROBLEMS 
 AND RECURRENCES 
 
 by 
 
 John A. Wisniewski and Ahmed H. Sameh 
 
 November 1979 
 
 NSF-OCA-MCS75-21758-000045 
 
 IHE LIBRARY QF THE 
 
 SEP 2 1980 
 
 UNIVERSITY OF ILLINOIS 
 AI URBANA-CHAMPAIGN 
 
 DEPARTMENT OF COMPUTER SCIENCE 
 UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 
 
 URBANA, ILLINOIS 
 
• 
 
 1 
 
 
Report No. UIUCDCS-R-79-997 
 
 PARALLEL ALGORITHMS FOR NETWORK ROUTING PROBLEMS 
 
 * 
 
 AND RECURRENCES 
 
 by 
 
 ■k-k kk 
 
 John A. Wisniewski and Ahmed H. Sameh 
 
 November 1979 
 
 Department of Computer Science 
 
 University of Illinois at Urbana-Champaign 
 
 Urbana, Illinois 61801 
 
 * 
 
 This work was supported in part by the National Science Foundation under 
 
 Grant No. US NSF MCS75-21758. 
 
 ** 
 
 Department of Computer Science, University of Illinois - Urbana, IL 61801, 
 
Digitized by the Internet Archive 
 
 in 2013 
 
 http://archive.org/details/parallelalgorith997wisn 
 
Abstract 
 
 In this paper, we consider the parallel solution of recurrences, 
 and linear systems in the regular algebra of Carre. These problems are 
 equivalent to solving the shortest path problem in graph theory, and they 
 also arise in the analysis of Fortran programs. Our methods for solving 
 linear systems in the regular algebra are analogs of well-known methods 
 for solving systems of linear algebraic equations. A parallel version of 
 Dijkstra's method, which has no linear algebraic analog, is presented. 
 Considerations for choosing an algorithm when the problem is large and 
 sparse are also discussed. 
 
1. Introduction 
 
 A basic problem in applications of graph theory is that of 
 finding shortest paths in a weighted graph. This problem arises in 
 finding the shortest or least cost path in transportation problems, in 
 solving minimum cost flow problems, in finding the critical (i.e., 
 longest) path in scheduling problems, and in finding adversary routes 
 through nuclear fuel-cycle facilities [13, 23]. 
 
 While there are several versions of the shortest path 
 problem [8], we will deal with the problem of finding the shortest paths 
 in definite graphs G from one node to all others, where the path length 
 used is the sum of the weights of the arcs in the path. A digraph G(V,A) 
 is a set V of nodes and a set A of arcs which are ordered pairs of nodes. 
 Assume that the nodes are numbered from 1 to n so that V = {1, 2, ...,n} . 
 
 Each arc (i,j) £ A has an associated length, 6. . . By defining 6 = °° 
 
 1 > 3 ! > i 
 
 1 < i < n and 6. . = °° for (i,i) t k , we have the n by n distance 
 - - i»J 
 
 matrix D = [6. .] . If a graph G contains no cycles the sum of whose 
 1 > J 
 
 arc weights are less than or equal to zero, then we say that G is 
 definite. 
 
 Although extensive work has been done on sequential algorithms 
 for solving the shortest path problem, little work has been done [4, 6, 19] 
 on parallel algorithms for this problem. Levitt and Kautz [19] discuss 
 an iterative algorithm due to Hu [12] which has been tailored to a 
 cellular array machine. Chen and Feng [4 ] discuss a version of Ford's 
 algorithm [9, 10] for an associative processor machine. Dekel and Sahni 
 [6 ] present a method for cube connected and perfect shuffle computers. 
 Here, however, we treat the shortest path problem in a more general 
 
way, using the regular algebra of Carre" [2]. Furthermore, we will 
 assume that : 
 
 (i) any number of processors may be used at any time, 
 
 but we will give bounds on this number, 
 (ii) each processor may perform either a comparison or 
 any of the four arithmetic operations in one time 
 step, and 
 (iii) there are no memory or data alignment time penal- 
 ties. 
 If p processors are being used, then we denote the computa- 
 tion time as T unit steps. Thus T is the time required by a serial 
 machine. We define the speedup of a computation using p processors 
 over the serial computation time as Sp=T /T <_ 1. 
 
 Carre" [2], has shown that the shortest path problem, as well 
 as many other problems, can be posed as linear systems of the form 
 x=Ax + b in a regular algebra. He gives the examples of the scheduling 
 algebra of Cruon and Herve [5], the two elements boolean algebra, and a 
 stochastic communication problem (Kalaba [16]; Moisil [20]). Triangular 
 linear systems, or recurrences, in a regular algebra also arise in the 
 analysis of Fortran programs [17]. Consequently, it is important to 
 solve efficiently such recurrences on a parallel computer. 
 
 In addition to considering the solution of linear systems in 
 a regular algebra, we will present a parallel version of Dijkstra's al- 
 
gorithm [7] which is applicable only to graphs with positive arc weights. 
 Furthermore, considerations for choosing an algorithm when the graph is 
 sparse, i.e. ,has a large number of infinite entries in the distance 
 matrix, are discussed. 
 
 1.1. Notation 
 
 We follow the convention that capital letters denote matrices, 
 lower case letters denote vectors, and lower case greek letters denote 
 scalars. Except in the cases where we state time and processor bounds, 
 the symbols + and x are taken to represent the generalized addition and 
 generalized multiplication operations of the regular algebra. In the 
 statement of the time and processor bounds, the symbols + and * will 
 take on their normal meaning. We will frequently omit the symbol x 
 when it is clear that a generalized product is to be taken. We will al- 
 so use the Z and It notations for the generalized sum or product of a set 
 of items. Unless otherwise stated, log n = flog~nl for any positive num- 
 ber n. 
 
 1.2. The Regular Algebra 
 
 For convenience, we restate the algebra of Carre. We start 
 by defining a semiring (S,+,x) which satisfies the following properties: 
 
 (i) commutativity a + 3 = 3 + a 
 
 a x 3 = g x a 
 
 (ii) associativity a + (3 + y) = (a + 3) + Y 
 
 ax($XY) = (ax£)XY 
 
 (iii) distributivity a x (3 + y) = (a x 3) + (a x y) 
 
 (iv) idempotency a + a = a 
 
 for a, 3, Y, e, S. 
 
The set S has a unit element e such that 
 
 a x e = a, 
 
 and a null element 9 satisfying 
 
 + 9=0, a * 9 = 9, 
 for all a e S. Furthermore, we have the law of multiplicative cancellation: 
 (v) if a 4- 9 and a x 3 = a x y then 3 = y. 
 
 We define for the semiring (S,+,x) the order relation <_: 
 
 a <_ 3 if and only if a + 3 = a. 
 1.3. Extensions to Matrix Operations . 
 
 We now extend the definitions of the regular algebra to matrices 
 all of whose elements belong to the set S. The definitions of matrix 
 addition, matrix multiplication and matrix transposition are analogous 
 to those in linear algebra. Note that matrix addition is idempotent, 
 
 A + A = A. 
 We define a square m by m unit matrix I = [e. .1 with e. .= £ 
 if i = j and e . . = 9 if i ^ j . Note that IA = AI = I for any m by 
 matrix A. The null matrix N, all of whose elements are 6, satisfies 
 
 A + N = A 
 A x N = N. 
 The partial ordering for matrices is easily extended: 
 
 A < B if and only if a . . < 3. . for all i,j. 
 
 m 
 
In particular, it follows that 
 
 A <_ B if and only if A + B = A. 
 
 The shortest path problem can now be stated as a linear 
 
 system in the regular algebra. Let G (V, A) be a graph with distance 
 
 matrix D. Let the vector b represent an initial labeling of the nodes 
 
 V (for the vector b, 3. is the inital label on node i) . The shortest 
 
 path problem is equivalent to solving the linear system x = Dx + b in the 
 
 regular algebra [2]. For the single source problem the vector b is given by 
 
 by 8 = e where s is the source node, and 3. = 6 otherwise, 
 s 1 
 
 2 . Solution of Recurrences 
 
 In this section we study the solution of systems of the form 
 x = Lx + f, (1) 
 
 where the matrix L is a strictly lower triangular matrix. These systems 
 arise in the analysis of Fortran programs, and are also used in methods 
 for solving general systems 
 
 x = Ax + b . 
 
 2.1. The Column Sweep Algorithm 
 
 The most fundamental of all algorithms for solving recurrences 
 in the regular algebra is the column sweep algorithm. The solution of 
 x = Lx + f , where 
 
 L = 
 
 X 2,l 9 
 A 3,l X 3,2 
 
 11,1 n,2 n,n-l 
 
X = 
 
 f = 
 
 , (2) 
 
 is given by 
 
 5l -^ , 
 
 n 
 
 j-l 
 
 ?. - I \. . ?, + 4>. . 1 - 2,3,.. .,n 
 J j,i i j 
 
 i=l 
 
 (3) 
 
 Rewriting the system (3) as column operations we get the 
 Column Sweep Algorithm : 
 
 1. Set x +■ f 
 
 2. For i = 1, 2, . .. , n-1 
 
 x ■«- x + C. £. 
 i i 
 
 (in parallel) 
 
 In the algorithm, I. represents the ith column of the matrix 
 
 L. A sequential algorithm for solving recurrence (1) requires 
 
 2 2 
 
 n - n operations, i.e. T = n - n. On the other hand, the column 
 
 sweep algorithm can be performed in 2 (n-1) time steps using at most n-1 
 
 processors. This gives a speedup of -~ + 0(1) over the sequential 
 
 algorithm with a corresponding efficiency of roughly 1/2. 
 
2.2 The Product Form of the Solution 
 
 In this section, the solution x of (1) is formed as a sum of 
 products. Each product involves a power of the matrix L multiplied 
 by the vector f. 
 
 Lemma 1 . Let L be strictly lower triangular, i.e. L has the form given 
 in equation (2). Also, let 
 
 L 1 = 
 
 N N 
 
 T 
 LR N. 
 
 where R is upper triangular of order n-i, i.e. L is lower triangular 
 
 with i null diagonals in its lower half. 
 
 Proof: From the definition of matrix multiplication and the facts that 
 
 9+e = 6, 6 x a = for all a e S, and that L is null on 
 
 the main diagonal, each multiplication of L by L introduces another 
 
 diagonal of null elements into the product. ■ 
 
 Corollary 1 : L n = N . ■ 
 
 Lemma 2 : The product LL- 1 i_<j,i + j<n can be formed in log(n-i-j) + 1 
 
 steps using at most p(n-i-j) processors, where 
 k i 
 p(k) = E £ j - k(k+l)(k+2)/6 . 
 i=l j=l 
 
 Proof : This result also follows from the definition of matrix multi- 
 plication. Since L has i null diagonals and L J has j null diagonals, 
 the longest dot product formed is of length n-i-j . This establishes the 
 time bound. The processor bound is established by counting the number of 
 multiplications needed to compute the product. This is the stated 
 processor count since all of the multiplications can be formed simultaneously . 
 
8 
 
 The product form of the solution is found by substituting for x 
 
 in the right hand side of (1), its equivalent, Lx + f. Thus, 
 
 2 
 
 x = L x + Lf + f. 
 
 After n-1 such substitutions, we obtain 
 
 x = L n x + L n-1 f + ... + Lf + f. 
 
 From Corollary 1, L = N so 
 
 x = L f + L f + 
 
 Lf + f. 
 
 Furthermore, equation (4) can be written in the form 
 n/ 9 n/ ? 
 
 x = (I + L Z )(I + L *) ... (I + IT) (I + L) f. 
 
 (A) 
 
 (5) 
 
 Heller's algorithm [11] for solving triangular systems of 
 linear algebraic equations, can now be applied. 
 Algorithm 1 . Product form of solution for recurrences. 
 
 Set x «- f 
 
 For i = 0, 1, ..., v = log(n/4) 
 
 x. +1 <- (I + L 21 ) x. 
 
 2 i+1 o i o i 
 
 L *■ L L 
 
 x *■ (I + L ) x 
 
 v+1 
 
 Theorem 1 : Algorithm 1 can be performed in parallel in log n + 3 log n + 1 
 
 1 3 
 steps using at most -? n processors. 
 
 Proof : Heller has already shown that Algorithm 1 can be performed in 
 
 2 
 log n + 3 log n + 1 steps. The formation of x and x requires only 
 
 2 2 i+1 3 
 
 0(n ) processors whereas the formation of L requires 0(n ) processors. 
 
 2 
 In particular, the largest number of processors are required in forming L . 
 
 From Lemma 2, this number is 
 
p(n-2) = -| (n-2)(n-l)n < \ n 
 
 processors. 
 
 2.3 Product of Elementary Matrices Solution 
 
 The method presented in this section is an analog of Sameh and 
 Brent's Algorithm I [22]. It is superior to Algorithm 1 above since it 
 takes approximately half the time while using significantly fewer 
 processors to solve the recurrence (1) . 
 
 Definition: M. is an elementary matrix, if it is of the 
 
 form, 
 
 M. 
 
 N 
 
 3+1, J 
 
 3+2,3 
 
 n,3 
 
 N 
 
 N 
 
 M. has all null entries except in column j below the main diagonal, 
 
 Lemma 3: Let M. and M. be elementary matrices, then M.M. = N if 
 13 13 
 
 i < i. Furthermore if i > i then P = M.M. has all null elements 
 
 13 
 
 except in column j below the ith row. 
 
10 
 
 Proof: M. has its first i rows all null and M, has all null elements 
 J i 
 
 except in column i below the main diagonal. Thus if 1 < j all inner 
 products of rows of M. with columns of M. must be null, i.e. M M = N. 
 If i > j then the only non-null inner products of P = MM. occur when 
 the inner products of rows i+1, i+2, ..., n of M. are taken with col- 
 umn j of M. . ■ 
 J 
 
 Lemma 4 : Let M , M , . .., M. , M be elementary matrices. Then the 
 
 product 
 
 (I + M. ±1 )(I+M.) ... (I + M.)(I + M.)(I + M.) ...(I + M..J 
 J+l J 112 j+1 
 
 (I + M. +1 )(I + M.) ... (I + M x ). (6) 
 
 The proof is by induction on j . 
 
 Basis step: For j = 1, we have 
 
 2 
 (I + M ) (I + M ) = I + M + M + M . 
 
 From Lemma 3 and idempotency 
 
 I + M + M + M = I + M + M = I + M , 
 
 establishing the basis. 
 
 Induction step: Assume that the assertion holds for index j, and let 
 
 M. - (I + M.)(I + M ) ... (I + M )(I + M ). 
 
 From the inductive hypothesis we have 
 
 (I + M.^JC + M.) ... (I + M,)(.I + M.)(I + M_) ... (I + M...) = 
 J+l 3 112 j+l 
 
 (I + M., n ) M. (I + M.,.). 
 
 3+1 J 3+1 
 
11 
 
 Furthermore, 
 
 M.(I + *!._,,) = M. + M.M. in . (7) 
 
 J J+l J J 3+1 
 
 By repeated applications of Lemma 3 we obtain 
 
 hence (7) becomes 
 
 Consequently, 
 
 m jVi ■ Vi • 
 
 M.(I + M.,.) = M. + M.j. . 
 J J+l 1 J+l 
 
 (I + M., .)M.(I + M., .) - (I + M.j_.)(M. + M., .) 
 j+l j j+l j+l j j+l 
 
 = M. + M. ,.M. +M. M + M. ,, 
 j j+l j j+l j+l 
 
 = M. + M. ,.M. + M. ., 
 
 = M. + M. ,, (M. + I) . 
 J J+l J 
 
 Since I is an element of the expanded sum of M., then due to idempo- 
 
 tency, we have 
 
 (I + M.,,)M.(I + M.,_) = M. + M.j^M. 
 J+l J J+l J J+l J 
 
 = (I+M j+1 )M., 
 
 establishing the lemma. I 
 
 In a similar fashion, it can be shown that 
 (I+M 1 )(I+M 2 ) ... (I + M. +1 )(I + M. +1 )(I + M.) ... (I + M x ) = 
 
 (I + M. +1 )(I + M.) ... (I+M^. (8) 
 
 Let 
 
 (I + L) = (I + M _)(I + M _) ... (I + M )(I + M.) . (9) 
 n-1 n-Z L 1 
 
12 
 
 Since 
 
 (I + L) = (I + M.)(I + M_) ... (I + M J (I + M .), 
 
 1 l n-z n-1 
 
 from Lemma 4 and equation (8) , we see that 
 
 * * * (10) 
 
 (I + L) (I + L) - (I + L) = (I + L)(I + L) . y } 
 
 We now introduce a lemma of fundamental importance to what 
 follows . 
 
 Lemma 5: The vector x solves the system x = Ax + f if and only if x 
 
 solves the system x = x + Ax + f . 
 
 Proof: If x = Ax + f then adding x to both sides and using idempotency 
 
 gives 
 
 x+x=x+Ax+f, 
 x = x + Ax + f . 
 Hence, 
 
 x+Ax+f£Ax+f, 
 i.e., either x = x or x = Ax + f . Since x = x is redundant, x = Ax + f . 
 
 The following result was originally given by Carre" [2], but 
 we state and prove it in terms of elementary matrices. 
 Theorem 2: Let x = (I + L) f , then x satisfies x = Lx + f . 
 Proof : From Lemma 5, x = Lx + f is equivalent to x = (I + L)x + f , or 
 
 x = (I + L)(I + L)*f + f , 
 = (I + L)*f + f. 
 
 ■k 
 
 From equation (9) f is an element of the expanded sum of (I + L) f , thus 
 x = (I + L) f satisfies (11). 
 
13 
 
 From Theorem 2, we can write the solution x to equation (1) 
 as 
 
 x = (I + L)*f = (I + M n _ 1 )(I + M n _ 2 ) ... (I + M 2 )(I + M 1 )f, 
 
 which motivates the following algorithm. 
 
 Algorithm 2 : Solution of recurrences in product form. 
 
 1. Set (I + M.) (0) = I + M , i-1, ..., n-1; f (0) = f. 
 
 2. For j=0, 1, .... v - log(n/4) 
 Form (I + M 1 ) (j+1) = (I + M 2i+1 ) (j) (I + M 2 .) (j) 
 
 in parallel for i=l, 2, ..., n/2 J - 1 . 
 Form f (j+1) = (I + Ml ) (j) f (j) 
 
 3. Set x = (I + Ml ) (v) f (v) . 
 
 Algorithm 2 is a direct analog of Algorithm I of Sameh and 
 Brent [22] for triangular linear algebraic equations. Hence, their 
 results (Theorem 2.1) apply. The solution x to the recurrence 
 
 (1) can be found in 
 
 12 3 
 T - — log n + 2 log n + 3 
 
 steps using no more than 
 
 tt = (15/1024)n 3 + 0(n 2 ) = n 3 /68 + 0(n 2 ) 
 
 processors, 
 
14 
 
 2.4 
 
 Limited Parallelism 
 
 In this section we present two methods which are analogs of 
 those of Hyafil and Kung [15] for solving the recurrence (1). These 
 methods are used when the number of processors p is fixed. 
 
 The first method utilizes the algorithm decomposition applied 
 to the column sweep algorithm: 
 
 ,« - £ 
 
 (i+1) ,_ \„(i) * i «, i 
 
 x = (I + M.)x , i=l, 2, ..., n-1. 
 
 It is easy to see that 
 
 (I + M.)x 
 
 (i) 
 
 (i) 
 
 i 
 
 (i) 
 
 5 i + i (1> + W ? i 
 
 (i> 
 
 n n,i i 
 
 (i) 
 
 Thus, given p processors, the product (I + M.)x can be formed in 
 
 n-i 
 
 steps. Since T (n) = £ 2 
 p i=l 
 
 n-i 
 
 we have from Hyafil and Kung 
 
 [15] that 
 
 T (n) < - + (2 - -)n + - - 2 
 P ~ P P P 
 
 This method is most practical when p < n, 
 
Let 
 
 15 
 
 The second method uses the problem decomposition principle, 
 
 L = 
 
 '1,1 
 
 L 2,l L 2,2 
 
 J m , 1 m , 2 
 
 N 
 
 m,m 
 
 x. 
 
 m 
 
 f = 
 
 m 
 
 then equation (1) can be written as, 
 
 X l = L 1,1 X 1 + f l' 
 
 X 2 = L 2,1 X 1 + L 2,2 X 2 + f 2» 
 
 m 
 
 x = (.Z. L .x.) + f , 
 m j=l m,j j m 
 
 This leads to the following algorithm. 
 
Algorithm 3 : Recurrence solver with limited number of processors. 
 
 16 
 
 taking 
 
 then 
 
 For 1=1,. 
 Set f (i) 
 
 Form x . 
 l 
 
 , m 
 1-1 
 
 Z L. .x. 
 j-1 X ' J J 
 
 (I + L .) f 
 1,1 
 
 * (i) 
 
 using Algorithm 2 
 
 Hyaf il and Kung have shown that for p = fn 1, 1 < r < 3, and 
 
 m = 
 
 68n 
 
 r r i 3 
 
 (o(n 2 r log n) , for 1 < r < 
 
 T (n) < J r 
 
 p ~ I 1-T 2 3 
 
 ^0(n J log n) , for y < r < 
 
 3 . Bande d Recurrences 
 
 In this section, we take the matrix L of (1) to be banded, i.e., 
 of the form 
 
 L = 
 
 N 
 
 R k-1 L k J 
 
 (12) 
 
 where L. is a strictly lower triangular matrix of order m, i = l,2,...,k, 
 
 and R. is an upper triangular matrix of order m, j = l,2,...,k-l. 
 
17 
 
 3.1 Unlimited Parallelism 
 
 We can show that the time and processors required for solving (1) 
 are the same as given by Sameh and Brent [22]. The proof is exactly as 
 that of their theorem 3.1. Before this can be proved, however, we need to 
 establish the following. 
 Lemma 6 : 
 
 Let L be a strictly lower triangular 2n by 2n matrix given by 
 
 L = 
 
 N 
 
 R L 
 
 2-1 
 
 (13) 
 
 where R is n by n and upper triangular, 
 partitioned as 
 
 f T = (f T f T ) . 
 
 Thus, the solution of the linear recurrence 
 
 Let f be a 2n-vector correspondingly 
 
 x = (I +L)x + f 
 zn 
 
 (14) 
 
 is given by 
 
 x = 
 
 I 2 J 
 
 I N 
 n 
 
 G I 
 
 n„ 
 
 L y 2. 
 
 (15) 
 
 where 
 
 G = (I +LJ R , 
 n 2 
 
 and 
 
 ?i = W f ± • * = 1>2 
 
18 
 
 Proof ; 
 
 From the structure of L and Theorem 2, we see that 
 
 X l ■ (I n +L l>\ ■ "l • 
 
 Also, from (13) and (14) we have 
 
 x. = Rx. + (I +L )x + f . 
 2 1 n 2 2 2 
 
 From Theorem 2, we have 
 
 x 2 - (I fL/lR x +f ] 
 
 • % * y , 
 
 proving the lemma. ■ 
 
 Now we state the main result. 
 
 Theorem 3 : The linear recurrence 
 
 x = (I+L)x + f , 
 
 where L is an n by n strictly lower triangular matrix (12), i.e., n = km, 
 is obtained in (3 + 2 log m) log n - (1/2) (log m) (1 + log m) time steps 
 requiring less than m(m+l)n/2 processors. 
 
 Proof ; 
 
 T T T T 
 Let f = (f ,f _, . . . ,f ) . The algorithm of Lemma 6 can be 
 
 generalized to yield a scheme which is exactly similar to that of Sameh 
 
 and Brent [22], Theorem 3.1 as follows. 
 
Algorithm 4 : Banded recurrences. 
 
 (Q) * 
 
 Step 1: Set y v J = (I+L ) f^ and 
 
 19 
 
 [G i-l, y i° )] = (I+L i ) * [R i-l, f i ] ' 1=2> 3 ' 
 
 . . , n/m. 
 
 Step 2: For j=l, 2, . .., v = log(n/m) - 1, 
 
 Set r - 2 J m 
 
 
 
 Set G. (j+1) - 
 
 1 
 
 N 
 N 
 
 r (j) 
 G 2i 
 
 r (j) r (j) 
 G 2i+1 G 2i 
 
 , i"lf 2, 
 
 ' 2r 
 
 - 1 , and 
 
 >'i 
 
 (J+D _ 
 
 y (j) 
 
 y 2i-l 
 
 G (j) (J) +y (3) 
 G 2i-1 Y 2i-1 Y 2i 
 
 , i=l, 2, 
 
 n 
 ' 2r 
 
 Upon termination, y contains the solution vector x. 
 
 3.2 Limited parallelism 
 
 In this section we will assume that m<p<<n, where p is the 
 number of processors available. If p=m we can use the column sweep 
 algorithm to solve the recurrence in 2(n-l) steps. We will develop a 
 faster algorithm for p>m along the same lines as the algorithm of 
 Theorem 2.3.13 of Kuck and Sameh [18], for solving banded triangular 
 systems of linear algebraic equations. Our results are summarized by the 
 following theorem. 
 
 Theorem 4 : Given p processors, m<p<<n, a linear recurrence with the 
 matrix L of the form (12) can be solved in time 
 
 T = 2(m-l) + t [(n-m)/p(p+m-l)~[ , 
 P 
 
 (16) 
 
20 
 
 where 
 
 t = max 
 
 (2m 2 +3m)p - (m/2) (2m 2 +3m+5) 
 2m(m+l)p - 2m 
 
 (17) 
 
 This yields a speedup and efficiency proportional to (p/m) and (1/m) re- 
 spectively. 
 Proof; To motivate the approach, consider a recurrence of the form 
 
 z = Lz + g of order s = q + p(p-l) 
 
 equations with q = mp, 
 
 R o L i 
 
 R l L 2 
 
 R , L 
 p-1 p 
 
 ~ z o~ 
 
 
 
 z l 
 
 
 8 l" 
 
 Z 2 
 
 
 g 2 
 
 • 
 
 + 
 
 • 
 
 z 
 P 
 
 
 8 p 
 
 (18) 
 
 Here, L is of order q, L.(i>l) is of order p, and R.(0<i<p-1) con- 
 tains non-null elements only on its top m super diagonals, i.e., 
 
 N R 
 
 R. = 
 
 i 
 
 N J 
 
 (19) 
 
 in which R is upper triangular of order m. The system (18) can be expressed 
 
 as 
 
 z i = R o z o + L i z i + 8 r 
 
 z 2 - R x Z ± + L 2 z 2 + g 2 , 
 
 z = R n z n +L z + g , 
 P P-1 P-1 P P P 
 
21 
 
 From Theorem 2, we get that the z.'s can be written as 
 
 * 
 
 z = ( I+L j) tR j-l z j_i + 8 j ] ' J =1 ' •*•' P' or 
 
 Zj = ((I + L)*E j _ 1 )z J _ 1 + (I + L)*g r (2Q) 
 
 Therefore, we define 
 
 h ± = (I+L 1 )*(R () z + §1 ), (21) 
 
 [G i _ 1 ,h.] = a+^ ± )*[\_ rl »g ± \, i=2, 3, ..., p. (22) 
 
 2 
 Using a single processor, equation (21) can be evaluated in t =2m p 
 
 steps. Using a single processor to evaluate each of the p-1 systems, 
 
 equation (22) can be evaluated in 
 
 x 2 = [(2m 2 + m)p - (m/2)(2m 2 + 3m - 1) ] 
 steps. 
 
 Kuck and Sameh [18] have shown that the choice of q=mp keeps 
 the difference in time for evaluating equations (21) and (22) as small 
 as practically possible. Thus the evaluation of equations (21) and 
 (22) can be done in at most t = max{x ,T } steps. 
 
 We now see from equations (20) , (21) , and (22) that z is giv- 
 en by 
 
22 
 
 Z l = h l 
 
 z ± = h ± + G i _ 1 2 i _ 1 , i=2, 3, 
 
 Since only the last m columns of G are non-null, then by using all 
 
 of the p processors, each z can be computed in 2m time steps. Thus, 
 
 z_, z„, ..., z are all obtained in time 
 2 3 p 
 
 t^ = 2m(p-l) , 
 
 and hence the recurrence z = Lz 4- g can be solved in time 
 
 T = T 3 + T A 
 
 = max 
 
 f(2m 2 +3m)p - (m/2) (2m 2 +3m+5) , 
 (2m(m+l)p - 2m. 
 
 Partition the recurrence x = (I+L)x + f of n equations 
 and bandwidth m+1 into the form 
 
 x. 
 
 *k 
 
 I+V, 
 
 u i I+V i 
 
 \ I+ \ 
 
 
 
 
 — — 
 
 
 x o 
 
 
 f o 
 
 
 x l 
 
 
 £ 1 
 
 
 . 
 
 + 
 
 . 
 
 
 • 
 
 
 • 
 
 
 • 
 
 
 • 
 
 
 x k 
 
 
 £ k 
 
 
 — — 
 
 
 — — 
 
23 
 
 where I+V is of order m, and I+V. , i > (except possibly for I+V fc ) 
 is of order s, i.e., k = [ (n-m)/s |, and each U. is of the form (19). 
 Then, the solution vector x is given by 
 
 x Q - (I+V ) f (23) 
 
 x. = (I+V.) (f. + U.x. ,), i=l, 2, ..., k. (24) 
 l ill l-l 
 
 Equation (23) can be evaluated in 2(m-l) steps using the column sweep 
 algorithm since p>m. The k equations in (24) can be solved one at a 
 time using the algorithm developed for solving recurrence (18) . Hence 
 using p processors, the recurrence (12) of order n and bandwidth m=l 
 can be solved in time 
 
 T = 2(m-l) + x [""(n-m)/p(p+m-l)~[, 
 
 where i is given by equation (17) . ■ 
 
 4. Linear Systems x = Ax + b . 
 
 In this section, we will deal with the general, linear sys- 
 tem 
 
 x = Ax + b. (25) 
 
 In section 1.3, we saw that the solution of the shortest path 
 problem can be obtained by solving the system (25) . Due to the form of 
 (25), the n by n matrix A has, in general, its only null elements on 
 the main diagonal. The matrix A is called sparse if it has a small pro- 
 
24 
 
 portion of non-null elements. If we let n be the number of non- 
 
 A 
 
 2 
 null elements of A, then we say that A is sparse if n <<n . 
 
 We reiterate the assumption that the matrix A is defi- 
 nite, that is, the associated graph G has no cycles, the sum of 
 whose arc weights are less than or equal to zero. Now we present 
 two direct methods and two iterative methods for solving the sys- 
 tem (25). Since the Matrix A is definite, the iterative methods 
 can also be posed as direct methods. We also discuss the special 
 case when all of the arc weights of the underlying graph G are po- 
 sitive. The section is concluded with a comparison of the methods 
 when the matrix A is sparse. 
 4.1 Elimination Methods (Carre" [2], §7.2, 7.3) 
 
 The method of generalized Gaussian elimination was first 
 proposed by Carre [2,3]. We state the method and give time and 
 processor counts for a parallel implementation. 
 
 Algorithm 5 : Parallel Gaussian Elimination 
 
 For k=l, 2, . . . , n-1 
 
 Set a.=a.,x a + a. .\ 
 
 1,3 1,k k > 3 1,J i in parallel for (i,j=k+l, 
 
 3 i = a i k X 6 k + 3 i ) k+2 ' "•' n; ^ 
 
25 
 
 The resulting upper triangular system can be solved using a meth- 
 od of Section 2. The reduction of the matrix A to upper triangular 
 
 2 
 form can be accomplished in 2(n-l) steps using at most (n-1) proces- 
 sors. 
 
 The added time and perhaps, added processors necessary to 
 solve the resulting linear recurrence of Gaussian elimination can be 
 avoided, by using the generalized Jordan elimination method of Carre 
 [2, §7.3]. 
 
 Algorithm 6 : Parallel Jordan elimination 
 
 For k=l , 2 , . . . , n-1 
 
 Set a. . = a x a . + a. . 
 
 i,j i,k k,j i,j in parallel for 1=1> ##<> n; 
 
 6. ... .«(. + B . J-"*. *« »«**>■ 
 
 x i,k k i 
 
 Note that upon termination, the vector b contains the solution x. Clear- 
 
 2 
 ly, Algorithm 6 can be performed in 2 (n-1) steps using at most n +1 pro- 
 cessors. Thus, we see that by using an additional 2n processors over 
 Gaussian elimination, we can solve the system (25) in the same time as 
 it takes to do the elimination step of Gaussian elimination. 
 
 4.2 Iterative Methods 
 
 In this section we present two iterative methods for solving 
 the system (25) . They correspond to the Jacobi and Gauss-Seidel methods 
 for the iterative solution of linear algebraic equations. The methods of 
 Bellman [1] and Moore [21] are sequential versions of the generalized 
 
26 
 
 Jacob! method. Ford and Fulkerson's [10] sequence of scanning the arcs 
 in Ford's method [9] is a sequential version of the generalized Gauss- 
 Seidel method. 
 
 The system (25) can be solved using the generalized Jacobi 
 
 method 
 
 (k) . (k-1) L . 
 x v = Ax + b. (26) 
 
 Since the graph G associated with the distance matrix A is assumed to be 
 definite, from Theorem 6.1 of Carre [2], for an initial guess of a null 
 
 x , we see that x = x. 
 
 (k-1) 
 
 We observe that the matrix-vector product Ax can be 
 
 2 
 formed in parallel in log n + 1 steps using at most n processors. 
 
 2 
 Thus a single iteration requires log n + 2 steps and at most n proces- 
 sors. The overall algorithm can be accomplished in no more than 
 
 2 
 n(log n + 2) steps using n processors. 
 
 In our context, the generalized Jacobi method can be con- 
 sidered as a direct method. Since x is the solution to the system 
 (25) , we can write it by repeatedly using equation (26) , as 
 
 (n) .n (0) , A n-1, » n -2, ., , . / 97 \ 
 x = A x + A b+A b+...+Ab+b. k^i ) 
 
 Since x was chosen to be null , the equation (27) simplifies to 
 
 x 
 
 (n) = A n_1 b + A n ~ 2 b + . . . + Ab + b, (28) 
 
 which can be evaluated in parallel using the following algorithm. 
 
27 
 
 Algorithm 7 : Direct Jacobi Method 
 
 Set x = b 
 
 For i=0, 1, ..., v = log(-r) 
 
 x.,- = (I+A )x. (we can also terminate whenever x., ,=x.) 
 i+l x 1+1 1 
 
 ,i+l 
 
 2 1 2 1 
 
 = A A 
 
 x = (I+A 7 )x 
 
 v+1 (if necessary) 
 
 This is the method proposed by Dekel and Sahni [6], for cube 
 
 connected and perfect shuffle computers. By summing the time counts for 
 
 Algorithm 7 , and observing that the maximum number of processors occurs 
 
 2 i+l 
 during the formation of A , we get the following theorem. 
 
 Theorem 5 ; The solution to the system (25) can be obtained in no more 
 
 2 3 
 
 than 2 log n + 2 logn - 1 steps using n processors. ■ 
 
 We will now discuss a parallel implementation of the general- 
 ized Gauss-Seidel method. We begin by observing that the matrix A of 
 the system (25) can be written as 
 
 A = L + U. 
 
 (29) 
 
 Cbnsequently, (25) becomes 
 
 x = (L+U)x +b, 
 
 (30) 
 
 which leads to the generalized Gauss-Seidel iteration 
 
 x 
 
 (fcfl) . OcH) + Ux (k) + b- 
 
26 
 
 C 1 4 « ( k+1 ) • U 
 
 Solving for x yields 
 
 x (k+l) m (I+L) * Ux (k) + (I+L) * b 
 
 Mx (k) + c. (31) 
 
 From formula (2.16) of Sameh and Brent [22], we get that the 
 
 1 2 
 iteration matrix M and vector c can be formed in — log n + 0(log n) 
 
 3 2 
 steps using at most — + 0(n ) processors. Thus, the solution x can be 
 
 6 
 
 found by the generalized Gauss-Seidel method in n log n + 0(n) steps 
 
 n 3 2 
 using at most -? + 0(n ) processors. 
 
 Carre 1 [2] has shown that the number of iterations required 
 by the generalized Gauss-Seidel method is not greater than the number 
 required by the generalized Jacobi method. It is thus likely that for 
 many problems, the overall work for the generalized Gauss-Seidel meth- 
 od will be less than that of the generalized Jacobi method, when used 
 in an iterative fashion. This potential savings in time has a sub- 
 stantial cost in additional processors. 
 
 When used as a direct method, the generalized Gauss-Seidel 
 
 scheme would use Algorithm 7 applied to the matrix M and vector c of 
 
 1 2 
 equation (31). Note that an additional — log n + 0(log n) operations are 
 
 required to form the matrix M and vector c. Hence, for the generalized 
 
 Gauss-Seidel method to be more efficient than the generalized Jacobi 
 
 method, Algorithm 7 applied to the matrix M and vector c must take at 
 
 least 25% fewer iterations than the same algorithm applied to the original 
 
 matrix A and vector b. Consequently, one would prefer the generalized 
 
 Jacobi method for solving the system (25) . 
 
29 
 
 4.3 Dijkstra's Method 
 
 We now consider the special case when all of the arc weights 
 of the associated graph G are positive. In the regular algebra this 
 means that the elements of the matrix A of the system (25) satisfy 
 a. . > e for all i and j. Due to this added assumption, it is easy 
 to see that there can be no linear algebraic analogs for algorithms 
 tailored to these problems. 
 
 Dijkstra's method [7] is such an algorithm, iterative in na- 
 ture with no linear algebraic analog. In Dijkstra's method, nodes of 
 the graph G are separated into two classes, temporary and permanent. 
 All nodes start out temporary and become permanent one at a time. The 
 algorithm terminates once all nodes have become permanent. 
 
 Let mode [*] be an array of bits used to indicate whether 
 a given node is temporary or permanent. When mode[i]=0, then node 
 i is temporary. If mode[i]=l, node i is permanent. If we wish to 
 solve the system (25), where a. .> e for all i and j, then Dijkstra's 
 algorithm can be stated as follows: 
 Algorithm 8 : Parallel Dijkstra's Algorithm 
 The vector b is overwritten by the solution x. 
 Initialize: mode[i]=0, i=l, 2, ..., n. 
 
 For i=l, 2, . . . , n-1 
 
 Find j such that 3. = 16. 
 
 mode[k]=0 
 
 modetj ]=1 
 
 (32) 
 (33) 
 
 For all k such that mode[k]=0 
 
 \ =6. +6. * a. , 
 k k j j,k 
 
 (34) 
 
30 
 
 We observe that for each iteration, line (32) can be done 
 in log(n+l-i) steps using n-i processors. Overall, line (32) requires 
 
 n 
 
 E log i = nlogy 
 i=2 
 
 operations and at most n-1 processors. Similarly line (33) requires 
 one operation each iteration, and line (34) can be performed in paral- 
 lel at each iteration using two operations and n-i processors. The 
 
 overall work for a parallel version of Dijkstra's method is nlogn+2n-3 
 steps using at most n-1 processors. Compared with n iterations of the 
 
 generalized Jacobi method, Dijkstra's method requires essentially the 
 
 2 
 same time, but far fewer processors (n-1 vs. n ). Thus, for dense ma- 
 trices A, Dijkstra's method appears to be the iterative method of choice, 
 4.4 Sparse Matrix Considerations 
 
 When the matrix A of the system (25) is large and sparse, that 
 
 2 
 is the number of arcs n. in the graph G is much less than n , direct 
 
 methods are impractical. Unless the matrix A is banded or has a spe- 
 cial structure, undesireable fill-in of the matrix A will occur. Since 
 the economization of storage for large sparse matrices is essential, 
 iterative methods must be used. 
 
 Not all iterative methods are appropriate for this problem. 
 The generalized Gauss-Seidel method requires the formation of 
 
 * 
 
 the iteration matrix M = (I+L) U. This matrix in general will be dense 
 and hence too costly to store. We will thus restrict ourselves to a dis- 
 cussion of the generalized Jacobi method and Dijkstra's method. 
 
 Let q be the maximum number of non-null elements in any row of 
 A. In order to compare the two algorithms, we will use a cost function 
 
31 
 
 which is the product of the number of processors and time. For dense 
 
 matrices, we see that Dijkstra's method has a cost of 
 
 2 2 
 
 (n-l)(n log n + 2n - 3) = n log n + 0(n ) . 
 
 3 3 
 
 Similarly, the generalized Jacobi method has a cost of n log n + 0(n ) 
 
 which is about n times greater than that of Dijkstra's method. 
 
 For sparse matrices, the generalized Jacobi method requires 
 
 nlogq + 2n steps and n -n<n(q-l) processors. Dijkstra's method still 
 
 requires n log n + 2n - 3 steps and n-1 processors. So, the cost for the 
 
 generalized Jacobi method becomes 
 
 2 2 
 
 n(q-l) (nlogq + 2n) = (q-l)n logq + 0(qn ), 
 
 while the cost of Dijkstra's method remains the same. When the gener- 
 alized Jacobi method requires the full n iterations, it is more cost 
 effective than Dijkstra's method if (q-l)logq < log n. In general, 
 the generalized Jacobi method requires a fraction of n iterations. Let 
 k represent the number of Jacobi iterations required for a given prob- 
 lem. Then if k(q-l)log q < n log n, the generalized Jacobi method is 
 the superior algorithm. A similar tradeoff between Dijkstra's method and 
 a version of Ford's method was found by Hulme and Wisniewski [14] for 
 sequential algorithms. 
 
32 
 
 References 
 
 [1] Bellman, R. , "On a routing problem," Quart. Appl. Math., 16, 
 (1958), 87-90. 
 
 [2] Carre, B.A., "An algebra for network routing problems," J. 
 Inst. Math. Appl., 7, (1971), 273-294. 
 
 [3] Carre, B.A., "An elimination method for minimal-cost network 
 
 flow problems." In Large sparse sets of linear equations , 
 J.K. Reid (ed.),(Proc. I.M.A. Conf . , Oxford, 1970), Academ- 
 ic Press, London, 1971. 
 
 [4] Chen, Y.K. and Feng, T. , "A parallel algorithm for the maximum 
 
 flow problem," (summary), Proc. 1973 Sagmore Conf. on Paral- 
 lel Processing, (1973), 60. 
 
 [5] Cruon, R. and Herve, P., Revue fr. Rech. oper., 34, (1965), 3-19. 
 
 [6] Dekel, E., and Sahni, S., "Parallel matrix and graph algorithms," 
 Tech. Report 79-10, Department of Computer Science, Univer- 
 sity of Minnesota, June 1979. 
 
 [7] Dijkstra, E.W., "A note on two problems in connexion with graphs," 
 Numer. Math., 1, (1977), 269-271. 
 
 [8] Dreyfus, S.E., "An appraisal of some shortest-path algorithms," 
 Operations Res., 17, (1969), 395-412. 
 
 [9] Ford, L.R. Jr., "Network flow theory," P-928, The Rand Corporation, 
 Santa Monica, Ca. , August 1956. 
 
33 
 
 [10] Ford, L.R. Jr., and Fulkerson, D.R., Flows in networks , Prince- 
 ton University Press, Princeton, N.J., (1962), 130-133. 
 
 [11] Heller, D., "On the efficient computation of recurrence relations," 
 ICASE, Hampton, Va. ; Dept. of Computer Science, Carnegie- 
 Mellon University, Pittsburg, Pa., 1974. 
 
 [12] Hu, T.C., "Revised matrix algorithms for shortest paths," SIAM 
 J., 15, (1967), 207-218. 
 
 [13] Hulme, B.L., "MINDPT: a code for minimizing detection probabi- 
 lity up to a given time away from a sabatoge target," SAND 
 77-2039, Sandia Laboratories, Albuquerque, N.M., December 
 1977. 
 
 [14] Hulme, B.L., and Wisniewski, J. A., "A comparison of shortest path 
 algorithms applied to sparse graphs," SAND78-1411, Sandia 
 Laboratories, Albuquerque, N.M., August 1978. 
 
 [15] Hyafil, L. and Rung, H.T., "Parallel algorithms for solving tri- 
 angular linear systems with small parallelism," Dept. of 
 Computer Science, Carnegie-Mellon University, Pittsburg, Pa., 
 1974. 
 
 [16] Kalaba, R. , Proc. 10th Symposium on Applied Mathematics, Am. Math. 
 Soc, (1958), 1-280. 
 
 [17] Kuck, D.J., personal communication. 
 
 [18] Kuck, D.J., and Sameh, A.H., The Structure of Computers and Compu- 
 tation , Volume II, Wiley, New York, in preparation. 
 
34 
 
 [19] Levitt, K.N. and Kautz, W.H. , "Cellular arrays for the solution 
 of graph problems," C. ACM, 15, (1972), 789-801. 
 
 [20] Moisil, G.C., Communicarile Acad. Republicii Populare Romine, 10, 
 (1960), 647-652. 
 
 [21] Moore, E.F., "The shortest path through a maze," in Proc. Internat. 
 Symposium on Theory of Switching, Part II, The Annals of the 
 Computation Laboratory of Harvard University, Harvard Univer- 
 sity Press, Cambridge, 30, (1959), 285-292. 
 
 [22] Sameh, A.H., and Brent, R.P., "Solving triangular systems on a 
 
 parallel computer, " SIAM J. Numer. Anal., 14, (1977), 1101- 
 1113. 
 
 [23] Varnado, G.B., et al., "Reactor safeguards system assessment and 
 
 design, Volume I," SAND7 7-0644, Sandia Laboratories, Albuquer- 
 que, N.M., June 1978. 
 
BIBLIOGRAPHIC DATA 
 SHEET 
 
 1. Report No. 
 
 UIUCDCS-R-79-997 
 
 3. Recipient's Accession No. 
 
 4. Title and Subtitle 
 
 PARALLEL ALGORITHMS FOR NETWORK ROUTING PROBLEMS 
 AND RECURRENCES 
 
 5. Report Date 
 
 November, 1979 
 
 7. Author(s) 
 
 John A . Wisniewski anH AhmpH H. Ramph 
 
 8- Performing Organization Rept. 
 
 No - UIUCDCS-R-79-997 
 
 9. Performing Organization Name and Address 
 
 University of Illinois at Urbana-Champaign 
 Department of Computer Science 
 Urbana, Illinois 61801 
 
 10. Project/Task/Work Unit No. 
 
 11. Contract /Grant No. 
 
 US NSF MCS75-21758 
 
 12. Sponsoring Organization Name and Address 
 
 National Science Foundation 
 Washington, D. C. 
 
 13. Type of Report & Period 
 Covered 
 
 Technical Report 
 
 14. 
 
 15. Supplementary Notes 
 
 16. Abstracts 
 
 In this paper, we consider the parallel solution of recurrences, and linear systems 
 in the regular algebra of Carre. These problems are equivalent to solving the 
 shortest path problem in graph theory, and they also arise in the analysis of Fortran 
 programs. Our methods for solving linear systems in the regular algebra are 
 analogs of well-known methods for solving systems of linear algebraic equations. A 
 parallel version of Dijkstra's method, which has no linear algebraic analog, is pre- 
 sented. Considerations for choosing an algorithm when the problem is large and sparse 
 are also discussed. 
 
 17. Key Words and Document Analysis. 17o. Descriptors 
 
 Parallel algorithms, shortest path problem, recurrence, networks, regular algebra, 
 graphs. 
 
 17b. Identifiers/Open-Ended Terms 
 
 17c. COSATI Field/Group 
 
 18. Availability Statement 
 
 Release Unlimited 
 
 FORM NTIS-3B (10-70) 
 
 19. Security Class (This 
 Report) 
 
 UNCLASSIFIED 
 
 20. Security Class (This 
 Page 
 
 UNCLASSIFIED 
 
 21. No. of Pages 
 
 38 
 
 22. Price 
 
 USCOMM-DC 40329-P7 1 
 
• 
 

 
 ML- 
 
 ■ . 
 
UNIVERSITY Of ILLINOIS URBAN* 
 
 3 0112 001342416