IflW I HI CENTRAL CIRCULATION BOOKSTACKS The person charging this material is re- sponsible for its renewal or its return to the library from which it was borrowed on or before the Latest Date stamped below. You may be charged a minimum fee of $75.00 for each lost book. Theft, mutilation, and underlining of books are reasons for disciplinary action and may result in dismissal from the University. TO RENEW CALL TELEPHONE CENTER, 333-8400 UNIVERSITY Of ILLINOIS LIBRARY AT URBANACHAMPAIGN JAN 2 8 1996 FEB 1 2 1997 AUG 2 4 2006 When renewing by phone, write new due date below previous due date, L162 b/u- * t UIUCDCS- -R- ■77-862 771 &Jsli UILU- -ENG 77 rjih GAUSSIAN ELIMINATION AND NUMERICAL INSTABILITY by Robert D. Skeel ,ne Library ot «* JIM- 1977 v oi W' 00 '" April 1977 Digitized by the Internet Archive in 2013 http://archive.org/details/gaussianeliminat862skee UIUCDCS-R-77-862 GAUSSIAN ELIMINATION AND NUMERICAL INSTABILITY* by Robert D. Skeel April 1977 Department of Computer Science University of Illinois at Urbana-Champaign Urbana, Illinois 61801 -"Supported in part by the US Air Force under grant AFOSR-75-2854, 1 . Introduction The book by Forsythe and Moler [1967] gives an error analysis for Gaussian elimination drawn from the work of Wilkinson and an analysis for iterative improvement due to Moler. In this paper we do a careful backward error analysis using a different idea of what it means for a perturbation to be small, namely that each datum is subject to a small relative change. A system of n equations n I a..x. = b . , 1 < i < n, in n unknowns x. , 1 < i < n, is often written in matrix notation as l — — Ax = b. The solution x computed in floating point arithmetic is not generally exactly equal to x, but it still may be acceptable if it satisfies one of two criteria: (i) It fulfills the accuracy requirements of the problem poser. This usually involves some measure of how far x is from x. For example, in determining the coefficients of an inter- polating polynomial it is some norm of the residual r = A(x - x) which is of concern. More often though it is some norm of the error x - x which is of concern, (ii) It is the exact solution of a problem which differs from the given problem by less than the uncertainty in the data, so that it is conceivable that x exactly solves the original problem. Uncertainty in the data is generally present if only because of the roundoff error introduced when the numbers are put into the computer. It is this second criterion which motivates backward error analysis. In §2 we give mathematical form to the informal discussion of ill conditioning found in Hamming's [1971] book Introduction to Applied Numerical Analysis. The condition of a system is the sensitivity of the solution to uncertainty in the data, and this sensitivity is often measured by a condition number. The usual definition of the condition number is based on the idea that the uncertainties in the data are roughly the same size in an absolute sense. However, if we suppose that these uncertainties have roughly the same relative size, then we obtain A _1 1 I A I I x as the condition number of a linear system where ||°|| is the max norm. There are at least a couple of reasons for believing that this second approach is more realistic. First, most numerical computations are done in floating point rather than fixed point arithmetic, and for floating point computation the conversion of data to machine represented to numbers results in errors of the same relative size. Second, measurement errors are usually more nearly the same in relative size than in absolute size. In §3 we pose the question: What is the least amount by which A and b must be perturbed so that x exactly solves the perturbed problem? The answer to this question turns out to be | (b - Ax) . | max ' i ' i (|b| + |A| |x|) . which we call the "backward error." If this is less than the unit roundoff error u, then the second acceptability criterion is satisfied. And if it can be shown that an algorithm produces a backward error that is always bounded by some fixed multiple K(n)u of the unit roundoff error, then by increasing the precision of the intermediate results by a factor K(n) , the second acceptability criterion can be met. An algorithm with this desirable property is said to be stable. If an algorithm is only stable for infinitesimal values of u, it is said to be asymptotically stable. The stability of Gaussian elimination with row pivoting (usually called partial pivoting) is examined in §4. By using an example of Hamming's it is shown that row pivoting is not asymptotically stable even for systems with equilibrated matrices. Then by means of a careful error analysis performed in Appendix A a bound on the backward error is obtained which contains the quantity max , 1 -1. I I * I N (D. A x ) . 1 '1 ' ' ' l min . 1 -1. i i - i x • ( Pi A x ) • where D is the matrix of row scaling factors. This quantity is minimized by choosing D 1 = diag(|A| |*|), which calls for the i-th row to be divided by la... x, + a._ x„ I +. . .+ 1 ll 1 ' ' i2 2 ' la. x |. It is shown that with such a choice for D n row pivoting would be 1 in n' 1 stable. Of course this is impractical, which explains Stewart's [1973, p. 151 observation that "In spite of intensive theoretical investigation, there is no satisfactory algorithm for scaling a general matrix." Nonetheless, the ratio max i (|A||x|) mm , I . I I * I v • C A x ) . J " J is an excellent a posteriori measure of how poorly scaled the system is. Sometimes, programming considerations (Sherman [1976]) call for the use of column pivoting instead of row pivoting, where by column pivoting we mean that columns are interchanged so that each pivot is the largest in its row. In §5 it is shown that column pivoting could be made stable if it were somehow possible to scale the columns with the matrix of scale factors D 2 = diag(|x|). This calls for each column to be multiplied by its corresponding computed solution value. A measure of ill scaling is given by (|A|e). ||x|| max ' ' 1 " " i (|A||i|). where e is the vector of all ones. Row pivoting may be regarded as the generalization of complete pivoting in which the ordering of the columns is arbitrary, and similarly column pivoting as the generalization in which the ordering of the rows is arbitrary. From this observation it follows that the results of both §4 and §5 apply to complete pivoting. However, one suspects that the error of complete pivoting satisfies an error bound which is appreciably better than simply the smaller of the bounds for row and column pivoting. In §6 iterative improvement is examined in the hope that the poor stability properties of Gaussian elimination can be corrected. It is shown that a single iteration of iterative improvement performed in single precision is enough to make Gaussian elimination asymptotically stable. Before proceeding, it might be interesting to demonstrate the instability of complete pivoting with a simple 2x2 system of equations. Consider Ax = b where A = 3 3 -1 and b = The coefficient matrix A is equilibrated according to the definition of Forsythe and Moler [1967, p. 45]. Using rounded t digit decimal (floating point) arithmetic the elimination step yields A'x = b' where A' = 3 3 .99- • -9 and so the computed solution and b* = 1 33---3 x = .33---3 x 10 .33-V3 -t- The backward error is determined by considering perturbed problems of the form 3(1 + 6 U ) 3(1 + 6 12 ) - (1 + 6 21 ) 1 + 6 13 and choosing the relative changes 6.. so as the minimize the maximum <5 . . ij ij In this case, 6_ must be chosen to be -1, and so the backward error is 100% regardless of the precision t. 2. Condition of Linear Systems The condition of a problem is the sensitivity of its solution to uncertainties in the problem data. The importance of this concept is that it indicates the amount of accuracy that one should reasonably expect for the solution of a problem with inexact data. And even for problems with exact data, the conversion of the numbers to the computer's floating point number base usually introduces errors. As the measure of the condition of a problem we take the maximum amount by which an infinitesimal perturbation in the problem data can be amplified in the solution. More precisely if C denotes the given problem data and (?) denotes the solution of a problem with data £, then we define the condition nwnbev to be lim relative distance from <{>(£) to c|>(£) /? -, \ £->-£; relative distance from £ to £ In the case where £ and cj)(£) are scalars, the condition number is the absolute value of the relative derivative, namely (£) = A b. (In roundoff analysis the number of equations n is not considered to be part of the problem data; rather we take the point of view that each value of n defines a separate class of problems.) There are two crucial matters that have to be settled: (i) how to define relative distance in the problem space, and (ii) how to define relative distance in the solution space. Any problem which is to be solved in an approximate sense is incomplete unless there is also some "metric" specified for measuring how good the approximation is. It is this metric that should be used in defining the "relative distance from (C) to such that I a . . - a.. I < e|a..| and lb. -b.l < elb.l This seems to be consistent with the ideas expressed in Hamming's [1971] book Introduction to Applied Numerical Analysis. On page 117 it is stated that "The term 'ill-conditioned' is ill defined. The vague idea is that small changes in the initial system can produce large changes in the final result. If we are to take floating point seriously, then we should say 'relatively small changes' and 'relatively large changes'." and on page 122 it is stated that "...the system is indeed ill conditioned because, no matter how we try, we are unable to solve the system so that the answer is not sensitive to small changes in the original coefficients. " Thus it seems that by "relatively small changes in the initial system" Hamming means "relatively small changes in the coefficients of the initial system." A similar thought is expressed by Kahan [1966, p. 795], It is worth mentioning that this approach has the advantage of forcing the perturbed matrix A to have the same sparsity structure as the original matrix A, making it more plausible to regard A as the result of perturbing the original physical problem. Having chosen our metrics, we are in a position to determine the condition number of a system Ax = b . We begin by obtaining bounds on the uncertainty in the solution due to the uncertainty in A and b. Bounds of this type also appear in Bauer [1966]. Our notation uses inequalities between arrays to mean inequality of the corresponding components. The absolute value of an array is also to be understood in a componentwise sense. THEOREM 2.1. Let Ax = b and (A + 6A) (x + 6x) = b + 6b where | HlA^llAllxl + lA^Mblll " (l + .lllA^HAllWWI PROOF. Let I be such that (|a _1 ||a||x| + | a -1 1 1 b | ) ^ = || |a _1 | |A| |x| + |A _1 ||b||| Define 6A and 6b by 6a jk = sgn(a £jXk ) £ |a jk | and 6b . = sgn(a. . ) £ |b . 3 ^J J where A = (a . . ) . Then (A 1 6Ax + A 1 6b) - S T. a 6a x,+ E a. . 6b. = £ C | A -1 1 I A I | x | + |A _1 ||b|) £ = £|||A- 1 ||A||x| + lA^llblH , but from (2.2) 10 (A 1 6Ax + A ^b) = (6x - A 1 6A6x) Ji , and so e II I A 1 | |a| |x| + | A 1 \ lb I || < II 6x|| + £ II |A 1 | |a| II ||6x|| . Q.E.D. THEOREM 2.3. The condition number i as defined by (2.1), of a linear algebraic system Ax = b is A x -1 + lA^llb PROOF. The condition number of a linear algebraic system Ax = b is lim 6x M x e(6A,6b) -y e(6A,6b) where e(6A,6b) = min{e _> 0: J 6 A j ;< e | A | , | 6b | <_ e | b | } and 6x satisfies (A + 6A) (x + 6x) = b + 6b . Consider any sequence (6A , 6b ) for which m m e (6 A , 6b ) -*■ as m -*■ °°. By Theorem 2.1 we have mm 6x , (kk xu ^ || |A 1 | |A| |x| + |A 1 | |b| || < e(6A ,6b ) LLJ LJ — u — ' ' — - — LJ — L - n ~ mm,. ,«. „, v iii.-lii.iiK (1 - e(6A ,6b ) A m m for sufficiently large m. Therefore 1^ Hfix m M|x|l < HlA^llAllxl + lA^MbMI m ^ co e (6A ,6b ) - ||x|| mm 1 1 ii which gives an upper bound on the condition number. Let e be a sequence converging to zero. By Theorem 2.2 there exists a sequence (6A , 6b ) such that e(6A , 6b ) = e and mm m ll«* II ii m ii -1 -1 > e m | A] |x| + |a I |b| || (i + OIIa^iiaIIDMI Therefore — — 6x \\/\ x iiIa _ 1|IaM i I A —1 1 i , i ti lim " m" " " || |A |A| |x| + |A |b | m -> co m x which gives a lower bound on the condition number. Q.E.D. 11 In subsequent sections of this paper, we will consider the effects of perturbing only the elements of the coefficient matrix. THEOREM 2.4. Let Ax = b and (A + 6 A) (x + 6x) = b where |<5A| <_ e | A | . Then fix el llA^llA ll xM I u (l-.HlA-HjAllMWI PROOF. Similar to that of Theorem 2.1. Q.E.D. THEOREM 2.5. Let Ax = b. Then there exists 5A such that | 6A | = £ |a| and such that the solution x + 5x of (A + 6A) (x + 5x) = b satisfies 6x , elllA^llAllxlH " (l + eHlA^llAllDHxIl PROOF. Similar to that of Theorem 2.2. Q.E.D. It follows from these last two theorems that when only A is subject to uncertainty the condition number is A Cond(A, x) = [UJ1 ALkLU -ii x Since || |a _1 | |a| |x| || < || |a _1 | |a| |x| + |A _1 | |b| || £ 2|| |a _1 | |a| |x| ||, Cond(A, x) is also adequate for the case where both A and b are subject to uncertainty. A similar quantity ||A- 1 ||,||Ae ( . ) || |x.| K(A, x) = ^r X is used by Van der Sluis [1970a], which he calls [1970b] the "condition number of the solution." Here e,. x denotes the i-th unit vector. (j) The condition number of a matrix A could be defined as the T maximum value of Cond(A, x), which is achieved with x = e = (1, 1,..., 1) . Thus Cond(A) = Cond(A, e) = || |a _1 | | a| ||. This quantity is more satisfying as a measure of ill condition than the usual cond(A) = ||a ||||a|| for a couple of reasons. First, the matrix 12 | A | |A| is a mapping of the solution space into itself, which means that the quantity || |A | |a| || can be defined entirely in -1 terms of the solution space norm. whereas cond(A) = ||A ||A|| is defined in terms of a solution space norm and a residual space norm, which seems quite unnecessary. Second, the quantity Cond(A) is invariant under row scaling. Multiplying a system of equations by a diagonal matrix does not change the problem in any fundamental way. For example, all systems Dx = b where D is diagonal are well conditioned. Accordingly, we have that Cond(D) = 1; whereas cond(D) can be arbitrarily large. Example. According to Hamming [1971, p. 120], the system Ax = b is well conditioned where A = 3 2 1 2 2e 2e 1 2e -e b = 3 + 3e 6e 2e The inverse of the coefficient matrix and the solution are given below: -.6e .4 .2 C 1 = - 1-1. 8e Hence lA" 1 ! lAl = X .4 -J.e 1 -.3 .2e 1 -.6 .2 .2e _1 -.6 -.4e -1 -.6 1+1. 8e 2.4e 1.6e -1 x = 1-1. 8e 4e +1.2 1.4-.6e -1 8e 1.6 l-.6e and 13 a 1 | |a| IxI + |a 1 | Ibl = 1-1. 8e 9.6e + 3.6e' 4.8 + 2.4e 6 - 2.4e which shows that the system is well conditioned. However, Cond(A) = ,8e - 1 + 2.6 - .6e 1-1. 8e which indicates that the system would be ill conditioned for some different right hand side b, and in fact, Hamming [1971, p. 122] gives such an example, 14 3. Stability of Algorithms for Linear Systems Let +, -, *, / denote the floating point operations corresponding to +, -, x, /. Every reference to a floating point result x o y carries with it the assumption that x, o, and y are such that the result is well defined. Nothing is assumed about the floating point arithmetic except that the relative roundoff error is bounded by u/(l + u) where the unit roundoff error u is a small positive number; that is, x 6 y = (x o y) (1 + 6) for some 6 depending on x, o, and y which satisfies 1 ' — 1 + u It follows from the above condition that x O y x o y = — 1 + 6' where |6'| <_ u. Note that for rounding u = -r- and for chopping u = 8 where B is the base and t is the number of base 3 digits in the fraction of the floating point numbers. For any computed solution x we define the relative backward error to be the smallest real number n,~>. such that (A + 6A)(x - 6x) = b + 6b for some 6 A, 6b, and 6x with 1 6 A | <_ r\ ( ^.\k\, |6b| <^ n /^\ |b| , and |6x| <_ ri/^\|x - 6x| . The backward error can be interpreted in the following way: The computed solution x is the rounded solution of a problem with rounded data where H/o\ is the maximum relative roundoff error. Thus, if Hz-n is no larger than the unit roundoff error u, then our solution is as good as our data deserve; otherwise, improved accuracy may be justified, perhaps by using iterative improvement. Further motivation for this definition is given in Miller [1975] and Bauer [1974]. If a solution cannot be computed because the matrix is nearly singular, then the backward error is defined to be 15 the smallest real number H/o\ such that A + 6A is singular for some 6A with | 6A| <_ H/o\ |A[ . Stability of the algorithm means that there exists a stability constant K(n) and a stability threshold u(n) > both independent of the problem data (A, b) such that the relative backward error ^(3% 1 K(n)u provided that u £ u(n). A weaker concept asymptotic stability allows the threshold u(n) to be data dependent. These two types of stability are the same as the "backward stability" and "asymptotic backward stability" used by Miller [1972]. The backward error n . . is not easy to determine, and for this reason we introduce two variants of the backward error which are easier to compute. Let H^ 9 >. be the smallest real number such that (A + 6A)x = b + 6b for some 6A and 6b with |6A| <_ n, ? v|A| and |6b| _< n,„v |b|. Let n,-,^ be the smallest real number such that (A + 6A)x = b for some 6A with | 6A | Ji n ,-, >J A | . Naturally r\ (r >\ .1 M/n\ i.n^x- THEOREM 3.1. Let I = {i: (|a||x| + Ibl). = 0}. The backward error r | (b - Ax) . | JiTl (lA l| x l + l b ] ). V (b " A ^i - /« i € I. n (2) = \ + oo otherwise . PROOF. First consider the case where (b - Ax). ^ for some i£ I, Suppose that n,^ < + °°. Then there exist 6A and 6b such that (A + 6A) x = b + 6b where 1 6A | <_ T\,~Ak\ and 1 6b | _< n,^ |b We have 16 |b - Ax| = |fiAx - fib | £ |6A| |x| + | 6b | £ n (2) (|a| |x| + |b|), (3.1) which is impossible since (b - Ax). ^ and (|a||x| + |b|). = 0. Therefore H/ 9 \ = + °°. Second consider the case where (b - Ax). = for all i 6 I, To obtain an upper bound on n^y consider the choice (b - Ax). 6a _ f Sgn( V |a iJ ! (|A||x| + |b|). 1 * 1 - lj L0 16 1, 1* I. and _i b (b - Ax). 6b. - f l ( I A 'I-I + Ib|) ± l 1 16 1. We have 6 Ax - 6b = b - Ax or (A + 6A)x = b + 6b, and so n / n x < max (b - Ax) J (2) - ."I j (|A|[x| + |b|). Since n,.^ < + °°, equation (3.1) must hold; and so I (b - Ax) . | n. . > max / | a I u I ■ I u I n • Q.E.D (2) ~ ± £ j (|A| |x| + |b|) i THEOREM 3.2. Let I = {i :(|a||x|). = 0}. The backward error — if (b - Ax) . = for i 6= I , i |(b - Ax). max n iTi TW ^ ■+ oo otherwise. PROOF. Similar to Theorem 3.1. Q.E.D. Remark. Similar types of results apply to other problems; for example, the backward error for a polynomial equation a» x + a x +. . .+ a = is given by n b j l ~n ^n-1 , | a_. x + a. x +. . .+ a n (2) -_L 1. Hence assume H/«\ < 1. There exist 5A, 6b, fix such that (A + 6k) (x - \) which verifies (3.2). The first inequality 3 3 of (3.3) is obvious if H/^n >. "F. Hence assume H/o\ K T* From (3.5) 18 b - Ax J <_ 2n, 3 JA| |x - 6x| + n, 3 x |b - Ax | + n ^v |A| |x| , and using (3.6) gives 2n Therefore (l ■ rw)|b ■- A*| ( T -^- + n,„.)|A||x|. (3) ' — 1 - ri/o-j (3) n (3) O - n (3) ) (1 - n (3) ) wh < 3n (3) (1 ~ 3^ C3> > (1 - 3n (3) )(l - 3n (3) ) ich proves (3.3). The first unequality of (3.4) is obvious if n,~N >, 1. Hence assume H/ 9 \ < 1. We have |b - Ax | <_ n (2) (|b| + |A| |x|) 1 n (2) (|b - Ax | + 2 1 A | |x|), and so lb - a* I < , n(2) IaIIxI. -l-n (2) Therefore 1/-,>. £ 2n/-N./(l - H/^), which implies (3.4). Q.E.D. A good algorithm should (i) return an acceptable answer most of the time (robustness) , and (ii) signal failure whenever it does not return an acceptable answer (reliability) . We formally define an algorithm to be reliable if there exist K(n) and u(n) such that for any (A, b) and any u _< u(n) either the algorithm computes an answer with n <_ K(n)u or the algorithm signals failure. Any algorithm for solving linear systems can be made reliable by computing the backward error with floating point arithmetic and then accepting the answer only if the computed backward error is less than a prescribed multiple of the unit roundoff error. For example, the next theorem shows that if the computed backward error f) < Ku, then we can conclude that n < (K + n)ue 19 The residual is to be computed in single precision r. = b. - (...(a... x x, + a._ x x_) . . . + a. x x ) 11 ll 1 i2 2 in n or in double precision r. = fl(b. - (...(a., x x. + a._ x x )... 4- a. * x )). l i 11 i 12 2 in n Here 6 denotes the double precision counterpart of O where it is assumed that x 6 y = (x o y) (1 + 6) i i 2 2 with |6|<_u/(l + u). In practice the double precision unit roundoff error is either this small (rounding in base two) or smaller. By fl(°) we mean the conversion of a double precision value to a single precision value. It is assumed that fl(xoy) = (xoy)(l+<5) with -u/(l+u) < 6 £ u, which is true for rounding and chopping. The computed backward error f) is determined by n = max (|r 1 |7(...(|a ±1 * k ± \ + \a ±2 * ^|)... + |a ±n * X J). i THEOREM 3.4. If n is the computed value of r\ } then -(n+2)u " _ nu (n+2)u - nu e n-nue ^.n£e n+nue where r\x for single precision residual accumulation, U \ 2 V-u for double precision residual accumulation. PROOF. Let q be the computed value of Ax. By the usual type of error analysis |q - Ax | < [(1 + G) n - 1] |a| |x| < nu e nU | A | |x| . (3.7) We have that (i - y^-) 2 ~ . 1 + u r\ > max b - jm 20 and f\ < (1 + u)(l + 1 + u ) max (1 " U _nT1 b - q| Arm 1 + u' where division of two vectors is defined componentwise This reduces to f\ < max (1 + 2u)(l + u) n from which it follows that -(n+2)u Anrr- (1 + 2u) (1 + u) n-2 n < max b - A^ F* (n+2)u Using and (3.7) gives _ (n-2)u b - q| - |q - Ax | 1 |b - Ax | . . |b ■■ q| + |q " Ax| . _ (n-2)u ,b ■■ q| ■ nC e^ -|A| |x| 1 |b - Ax| < |b - q| + nu e Dividing by |a||x| and taking the maximum yields b - ql L - (n-2)u -f- 1 - + nu e A x max b •■ a I - (n-2)u nu e < ri < max aTH #- A xl Q.E.D. Before concluding this section, it should be mentioned that for some classes of problems it may be unreasonable to expect an algorithm to be stable. If the numbert output values is fairly large compared to the number of input values, then it becomes very difficult for an algorithm to be stable because each output value must arise from the scone perturbation of the input values. For example, Miller [1975] shows that the usual algorithm for inverting triangular matrices is unstable. Hence it seems better to use stability as a relative concept rather than an absolute concept. This idea is used by Miller [1976] in a paper entitled "Roundoff analysis by direct comparison of two algorithms." 21 4. Gaussian Elimination with Row Pivoting and Scaling This section applies the ideas of the preceding section to Gaussian elimination with partial pivoting using row interchanges and implicit row scaling. The reciprocals of the scale factors are to be given as inputs d.. , d_,..., d to the algorithm, and so the pivoting is done as if one 1 z n were solving D Ax = D b where D = diag(d l5 d„,..., d ). To keep the L Z n notation simple, it is assumed that the equations are numbered according to their ordering after all row interchanges have been performed. The computations of the algorithm are as follows: (1) a . . = a . . , 1.1 ij for k - l(l)n - 1, af k+1) =a< k) -m„*a<*>, l.j »k+l. (4.2) ij ij ik kj ' J - '0 if i < j, y m. . if i > j, IJ ij < a ij lf i - j ' U if i > j, (4.3) y i = V for i = 2(l)n, y. = b. - (...(l ±1 x ¥± + l^ x y 2 ) . . . + l ± ±mml x y._ 1 ) , x = y /u , n n nn for i = n - 1(-1)1, (4.4) x. = (y. - (...(u. , - x x ., 1 + u. . x x., )... + u. > x ■ ff L (k) kk i > k. It is necessary for the completeness of the theory that the Gaussian elimination algorithm be extended to permit the use of zero as a scaling factor. For any diagonal matrix D = diag(d , d„,..., d ) we let D denote the matrix diag({d } , {d„} ,..., {d } ) where {d} = d if d 4 e if d = 0, That is, D is the matrix D with all the zero diagonal elements replaced by e. The condition (4.5) is replaced by the following: lim e -v u.r 1 .« i e lk {d k } e a kk < 1 , r] . (k) , (k) Let e = minljd. a,, ' a -i. d = 0, d. 4 0}. Then it can be easily shown K. X that for any c with < lei < e the scale matrix D has the same effect on the choice of pivots as does D. Unfortunately, Gaussian elimination with pivoting is unstable. This instability arises in the decomposition stage when the quantities 23 (k+1) (k) . * (k) a . . = a . . - m . , x a, . 13 iJ ik kj (k) (k) are being formed. If |m., a, . >> |a |, then the error in a.. is not (k+1) ik "kj (k) lj 13 very small relative to a.. , and so in our backward error analysis we cannot throw this error back into a (k) The extreme case occurs (k) (k+1) when a.. =0 and a.. ^ 0, which is commonly called "fill in." For sparse systems of equations it is quite common to order the rows so as to avoid fill in. This reduces computational cost, and it apparently may also contribute to stability. The instability of Gaussian elimination has been pointed out bv Hamming, who on page 119 of his book [1971] announces the " Theorem Pivoting can take a well-conditioned system into an ill-conditioned system of simultaneous linear equations." and on page 123 states "We have not justified the pivoting method; rather we have shown that it is an 'old wives' tale.' But like most old wives' tales, it is a mixture of truth and mystic faith." To prove his theorem, Hamming uses the example discussed at the end of §2 For this example, one elimination step with partial or complete pivoting yields the system A'x = b' where A' = -4 -2 -j- + 2e ~ + 2e -2 -1 T + 2e T - e , b' = 3 + 3e -2 + 4e -1 + e (4.6) assuming exact arithmetic. This problem is ill conditioned for small e 3 since Cond(A',x) = Se - 1 - 3 + 3e 1 - 1.8e 24 If the elimination were performed in floating point arithmetic, then a slight perturbation of (4.6) could result, which may have a solution which differs from the true solution by an amount proportional to c . This kind of error could not arise from slightly perturbing the original problem because it has a condition number of about 6. For example, suppose that the computed righthand side of (4.6) was 3 + 3e b' = -2 + 4e - -1 + e + and everything else was exact. Then 1 - e - 1 u/8 1 + e X u/4 and by Theorem 3.2 the backward error is u/(8e + u) . A related observation was made by Gear [1975] "It might be possible to say that 6A represents a perturbation to the original physical problem if the sparsity structure of 6k were the same as that of A. Unfortunately, we will show that such a demand on the structure of 6A can lead to very large bounds on I 6A J |, bounds probably dependent on the condition number of A." This was supported by the example 11-1-1 e e JL 1 for which Cond(A) = 4. b = , x - 1 .-1 -1 25 In one of the theorems that follow, it is shown that the situation is not quite as bad as Gear suggests. The bounds on |6A| | depend not on how ill conditioned the problem is but on how badly scaled the equations are. We begin by obtaining a totally a priori error bound for Gaussian elimination. The proof is modeled after that of Forsythe and Moler [1967], which is mostly borrowed from Wilkinson [1963]. However, our error bound, like that of Van der Sluis [1970b], is more informative than that of Forsythe and Moler in that it distinguishes among the columns of AA. THEOREM 4.1. Let the vector x be computed by Gaussian elimination with row pivoting and row scaling where D = diag(d 1 , d_,..., d ) is the 12 n matrix of reciprocal scale factors. Then there exists AA such that (A + AA)x = b with || Id" 1 aa|z|| < X (n)u|| Id" 1 a|z|| for arbitrary z >_ and arbitrary z satisfying < |e| < e" where X (n) = [19- 2 n-2 - n - 81e 2nu . PROOF. See Appendix A. Q.E.D. Note 1 . The factor e appears in the Forsythe and Moler book as the constant 1.01. The advantage of e is that it indicates the nature of the higher order effects and it does not require placing some arbitrary restriction on the size of nu. Note 2. It is actually possible to show that | A A [ <_ L [ A | for some lower triangular matrix L , although the best possible L R is somewhat complicated. With z = e in Theorem 4.1 we get the usual type of bound Hd; 1 AA|| < X (n)u||D- 1 A|| . 26 The bound of this theorem is practically always extremely pessimistic. However, there are cases where this bound can be attained in the limit as u -*■ 0. THEOREM 4.2. There exists a problem Ax = b and a floating point arithmetic <+, -, x, /> suoh that the solution x computed by Gaussian elimination with partial or complete pivoting satisfies (A + AA)x = b only for those matrices AA for which AA > ri9-2 n 2 - n - 81u + 0(u 2 ). Therefore s the bound of Theorem 4.1 is the best possible bound in the limit u ■+ 0. PROOF. See Appendix A for the proof, which employs a modification of Wilkinson's [1963] example. Q.E.D. The next theorem uses Theorem 4.1 to get a bound on the backward error. We are especially interested in the effect of scaling on the error bound. THEOREM 4.3. Let I = {i: ( | A| | x| ) . = 0} . If d . = for i G I and d . ^ for i 4 1 > then the backward error , s max |D a| |x| min (Id A I I x I ) . i#t ' * PROOF. Putting z = |x| in Theorem 4.1 gives |D^(b - Ax)||=||D^ AAx|| < || ID" 1 AA||x||| (4.7) < x (n)u |||d £ 1 A| |x|||. Multiplying this by e and letting e -*■ gives max | (b - Ax) . | _5_ X (n)u max ( | A| |x | ) . , i e I x i G I x from which it follows that (b - Ax). = for i£ T. Hence, from Theorem 3.2 we have that 27 -1 (b - Ax). | (|D (b - Ax)]). 1 max , i . I i - i ; = max I ^ I C|A| |Sfc|> ± . £ T ( | D "l A || ft |) i and from (A. 7) we see that (|D _1 (b - Ax) |) . ^ x(n)u max |d -1 a||x|. Q.E.D. By choosing d. = (|a| |x|). (4.8) i ' . i the bound on the backward error is minimized, giving n <^ x( n ) u « This suggests that a linear system should be scaled by dividing each row by its weighted £.. norm where the weights are the components of the computed solution. Unfortunately, (4.8) represents an implicit equation for the scale factors d. because the computed solution x is a function £ (11(D)) of the scaling matrix D; that is, D must solve the equation D = diag(|A| |?(n(D))|), (4.9) for which a solution may not exist. The nature of this equation becomes more apparent by noting that it is equivalent to solving for a permutation P that satisfies P = n(diag(|A||c(P)|)). (4.10) For if D satisfies (4.9), then P = 11(D) satisfies (4.10); and if P satisfies (4.10), then D = diag( | a| | £ (P) | ) satisfies (4.9). In principle we could determine the solution to (4.9), if there is one, by testing to see if any of the n! permutations P satisfy (4.10). In the cases where a solution exists, the backward error is bounded by x(n)u. One suspects that (4.10) almost always has a solution, and it is even conceivable that (4.10) always has a solution, at least whenever £(P) is defined for all P. The existence of a solution for (4.10) implies the existence of an ordering for the rows which makes Gaussian elimination stable. 28 If one wishes to solve (4.10), the following iteration would likely converge and converge quickly for almost every system of equations: P (Q) = II(diag(|A|e)), P (m+1) = n(diag(|A||5(P (m) )|)), m = 1, 2,.... This is not suggested as a practical algorithm though because poorly scaled equations can usually be accurately solved by doing iterative improvement. A more useful application of Theorem 4.3 is the diagnosis of ill-scaling, for a R (A, x) = max A X min A X is an easily computable measure of how badly scaled the rows are. Remark. The bound of Theorem 4.1 can be refined: (|D AA|z). <_ x(n)u max (|d A|z).. 1 ~ J 1 i J This suggests that max ( | A | | x | ) 1 . ii2 . 71X17x1). i > j i 1 1 i 'i would be a better measure of the possible effect of ill scaling. The quantity a D (A, x) is not very satisfactory for theoretical purposes because x depends on the arithmetic used in the computation. We would prefer to use a (A, x) for the theory. For Hamming's example R |a| |x| = (3 + 3e, 6e, 4e) and a (A,x) = -f- + j- . Near optimal row R scaling for this problem is given by 4e D*A- 3 2 1 2 2 2 2 -1 , D _1 b = 3 + 3e 6 2 For Gear's example |a||x| = (2/e + 2, 1, 1, 2) and a R (A,x) optimal row scaling is given by = 2/e + 2. Near 29 £ E £ E D" 1 A = e £ , D 1 b = 1 1 1 1 2 One may wonder about the effect of scaling strategies such as row equilibration. Van der Sluis [1970b, p. 80] gives an example showing "that it is quite possible. .. that there exists no bound depending on n only for the ratios of the errors after and before equilibration." He goes on to describe a cautious equilibration scheme that never worsens the situation at the expense of possibly not improving it. An adaptation of this scheme to our theory is to choose a. . ill d. = min max — i , . 'a. . ' ' which has the effect of leaving no row of A strictly dominated by any other row of A. Note that, since d. < 1, we have min|D a| |x| > a| |x|. Furthermore, (|a| |x|). = min Sl-^l i£i k I a k£' k£ l < min max JJ. a, „ x k j kj £ <_ d . max | A | | x | , whence max | D A||x| <_max|A||x|. Therefore min ID A| I x fr^- 1 ,*, \ max[p A 1 j x | a R (D A, x) - -1 | - max A X min A X = a R (A, x), so that the scaling of the problem is not made worse by our choice of D, 30 The theorem that follows gives bounds on the backward error that in the limit u -> depend only on how ill scaled the problem is and not how ill conditioned it is. First we need a lemma. LEMMA 4.4. If D is nonsingular, then D "l AAx || < X(n)u || |D- 1 A||x ||l 1 - x(n)u Cond(A D) PROOF. We have x + A 1 AAx = x, which implies and D -""AAx = (I + D 1 AA "4)) 1 D X AAx iD^AAxlU— llD_] 1 - || D 1 AA A * D|| The term in the numerator ||d _1 aax|| ± || |d _1 aa| | x | II <_ x( n ) u I I D A I |x | ||, and the term in the denominator ||D _1 AA A _1 D|| = HId^AA A _1 D|e|| ± || |d -1 AA| I A _1 D | e |i <_ x(n)u || |d -1 a| |A _1 D|e|| = x(n)u || |d _1 a| |A _1 d| II . Q.E.D. THEOREM 4.5. Let D be nonsingular and let |a| |x| > 0. Gaussian elimination with row pivoting gives X(n)u a (D~ A, x) n < * 1 - 2x(n)u Cond(A 1 D) a D (D 1 A, x) R provided that the denominator is positive. PROOF. Choose AA as in Theorem 4.1. From Theorem 3.2 we have that - AAx | T) = max XfPT 31 Since x = x + A Ax, we have |x| <_ |x| + |A~ 1 D|e||D~ 1 AAx|| , and so I d_1a II x I 1 I d_1a I|x| + |d _1 a| |A _1 D|e)|D~ 1 AAx|| Thus e||D -1 AAx|| n < max | D 1 A | | x | e||D _1 AAx < max D 1 A||x| - e Cond(A 1 D)||d AAx Applying Lemma 4.4 yields x(n)u a (D A, x) _ e||D AAx || < — |D A I |x| , 1 - x(n)u Cond(A D) from which the theorem follows. Q.E.D. Although we are unable to prove that there is always some ordering of the rows for which Gaussian elimination is stable, we can show that this is true asymptotically as u ■*■ 0. THEOREM 4.6. For any problem such that |a||x| > there is some ordering of the rows for which Gaussian elimination is asymptotically stable. PROOF. Using Theorem 4.5 with D = diag(|A||x|) gives n < *L(iOu A 1 - 2x(n)u max J — ' 1| |A||x| A| |x for small enough u. Hence for IaI Ixl u < u(n) = min 4x(n) |a| |a I |a| |x we have n < 2x(n)u. Q.E.D. 32 We end this section by examining the effect of scaling on a good bound for the "forward" error. THEOREM 4.7. The error _ x || < x(n)ullA- 1 DH HlD^Allxl 1 ■ x(n)u Cond(A _1 D) PROOF. We have ||x - x || = ||a _1 (- AAx) II < ||a -1 d|| ||d -1 aax|| , and the theorem follows from Lemma 4.4. Q.E.D. If higher order terms in u are ignored, the bound on the error is minimized by choosing D = diag( |A | |x |) . Thus Ha" 1 !! II IaI U N lllA^llAllxlH is a measure of the possible effect on the "forward" error of how poorly the equations happen to be scaled. For Hamming's example this quantity is fi — 1 —l yy e + 0(1) and for Gear's it is e + 0(1). The usual type of bound on the error is of the form ||ft - x|| < x(n)u|K" 1 li||K' 1 A||||x||+ 0(u 2 ). This is minimized by D = diag(|A|e), which is row equilibration with the £ norm. 33 5. Gaussian Elimination with Column Pivoting and Scaling This section is similar to the previous section except that we examine the variant of Gaussian elimination in which the columns are interchanged in order to ensure that the pivot element is the largest in its row. The algorithm is assumed to do column scaling where the scale factors d.. , d„,..., d are given as inputs to the algorithm. Again the 1 / n selection of the pivot is assumed to be done exactly so that a ki d i (k) a kk < d. , i > k + 1 . — ' k 1 — Writing the condition in this form allows for the use of zero scale factors but does not permit the selection of a zero pivot. An a priori error bound is given by the following theorem: THEOREM 5.1. Let the vector x be computed by Gaussian elimination with column pivoting and column scaling where D = diag(d , d ,..., d ) is the matrix of scale factors. Then there exists A A such that (A + AA)x = b with |AAD)e <_ x(n)u|AD|e where x( n ) =127*2 - 5n - 7je PROOF. See Appendix B. Q.E.D. It is likely that the constant x(n) in this bound could be replaced by a smaller constant. The following theorem indicates how the columns should be scaled in order that Gaussian elimination with column pivoting be stable. THEOREM 5.2. Let x be the value of A b computed by Gaussian elimination with column pivoting and column scaling where D = diag(d , d~,..., d ) is the matrix of scale factors. Let I = {i: (|a| |x|). = 0} and let J={j:x=0}. If d. = for j GJ and d. f for j $ J , then the backward error •J J J 34 (|AD|e), n < x(n)u max , i . i i ^ i ? - max (|d x|) .. PROOF. We have x = DD~ x where D denotes D with all diagonal zero entries replaced by ones. Hence |b - Ax | = |AAx| £ |AAD|e Hd" 1 *!! £ X(n)u |AD|e | JD^x -1. = x(n)u |AD|e max (|d x|).. & J We have that (IaIIxI). = implies that a.. = for i ^ J, which implies 1111 l ij ' r that (lADle). = 0. Hence the theorem follows from Theorem 3.2. Q.E.D. COROLLARY. The backward error C |d _1 x| ) . n _< x(n)u max - "•- . i,j 0. Gaussian elimination with column pivoting gives X(n)u a (AD, D _1 x) n < 1 - 2x(n)u Cond(AD) a (AD, D """x) provided that the denominator is positive. PROOF. Choose AA as in Theorem 5.1. From Theorem 3.2 we have that - AAxI n = max I I I „ I . |A| |x| Furthermore x = x + A AAx, and so IaMxI > I A I I x I - I A I I A 1 || AAx I Therefore I AAx n < max I A I I x I - I A II A - 1 1 I AAx I Applying Lemma 5.3 yields a D X x) AAv I < I A I I v I AAX| - 1 - v(n)u Cond(AD) l A H x l» X 1 from which the theorem follows. Q.E.D. THEOREM 5.5. For any problem such that |x| > there is some ordering of the columns for which Gaussian elimination is asymptotically stable. PROOF. Using Theorem 5.4 with D = diag(|x|) gives X(n)u n < A 1 - 2x(n)u max J — -1 | A | |x| for small enough u. Hence for 37 u < u(n) = min -1 4x(n)|A x ||A||x| we have n <_ 2x(n)u. Q.E.D, Recall that the stability threshold in the case of Gaussian elimination with optimal row ordering was lA||x| u(n) = min -1 4x(n)|A| |A X | |A| |x| It is easy to show that this is larger than the stability threshold for optimal column ordering. This is a slight indication that row oivotine may be superior to column pivoting. we have THEOREM 5.6. The error ||£ _ x || < xCnHl l A-'llADl lHlD^xll II* l! - 1 - x(n)u Cond(AD) PROOF. Since (A + AA)(x - x) = - AAx, x-x = -(I+A 1 AA) 1 A ^"AAx ■ - A _1 AA(I + A~ 1 AA)" 1 x = - A -1 AAD(I + D~ 1 A~ 1 AAD)~ 1 D 1 x. Therefore, x - x < 1 - i | D 1 A 1 AAD|| 1a" 1 ! [aadU Itsliidl i-| ! !d -1 a _1 I IaadU < X(n)u H |A 1 1 [API II Hd"^ - 1 - x(n)u Cond(AD) Q.E.D. 38 If higher order terms in u are ignored, the bound on the error is minimized by choosing D = diag(|x|). Thus for column pivoting HlA-'llA » || HlA^llAllxlll is a measure of the possible effect of ill scaling on the "forward" error. 39 6. Iterative Improvement It is often thought that iterative improvement is not worthwhile unless either (i) the uncertainty in the values of A is less than the unit roundoff error (e.g., if the elements of A are integers) or (ii) we wish to diagnose ill conditioning. This thinking is based on the fact that Gaussian elimination with pivoting is stable from the absolute error point of view. But according to the relative error point of view, Gaussian elimination may not give acceptable accuracy, and so it is of interest to examine the stability behavior of iterative improvement. Results of a careful error analysis are given for iterative improvement both with and without double precision accumulation of the residuals. The algorithm being considered is described as follows where subscripts denote iterates rather than components of vectors: x.. = value of A b computed by row pivoting, for m = 1, 2, 3, . . . r = computed value of b - Ax , m m d = value of A r computed by row pivoting, m m y x , .. = x + d . m+1 m m The algorithm for computing r appears in §3 and the algorithm for d is in §4, The theorem which follows shows that just one iteration of iterative improvement with just single precision accumulation of the residuals is enough to make Gaussian elimination asymptotically stable. This may seem to contradict the usual advice (Forsythe and Moler [1967, p. 49]) that "It is absolutely essential that the residual r be computed with a higher precision than that of the rest of the computation." Actually there is little conflict because we have shown that poorly scaled systems may be solved with an effective precision of much less than single precision. 40 THEOREM 6.1. Assume that |a||x| > 0. Gaussian elimination followed by one iteration of iterative improvement results in a backward error which satisfies nl 1 (n+l)u + {( X (n) 2 + 2n x (n) + n 2 + 2n) Cond(A _1 ) a (A, x) ^ R + X (n) a R (A, x) + | n 2 + | n}u 2 + 0(u 3 ) for single precision residual accumulation and n£ < u + { X (n) 2 Cond(A _1 ) o R (A, x) + X (n) a R (A, x) + n}u 2 + 0(u 3 ) for double precision residual accumulation. That is 3 the algorithm is asymptotically stable in either case. PROOF. The theorem follows from Theorem C.9 in Appendix C. Q.E.D. Note. The asymptotic nature of these bounds conceals the fact that certain assumptions on the smallness of u are necessary in order to get any bound at all. The actual assumptions, found in Appendix C, are too lengthy to reproduce here; roughly speaking it must be assumed that the coefficient of the second order term is less than 1/u. Recall from Theorem 4.3 that for no iterations we have 2 n 5 x(n) ° R ( A > x ) u + °( u )» and thus one iteration does make a big difference in the size of the bound. However, the presence of the product Cond(A ) a (A, x) in the second order term indicates that this may not be the case for problems which are sufficiently ill scaled and ill conditioned. THEOREM 6.2. Assume |a||x| > 0. The backward error n for iterative improvement of Gaussian elimination with row pivoting satisfies Tim n' < (n+l)u + (2n(x(n) + n+1) Cond(A -1 ) a D (A, x) m — k + X (n) a R (A, x) + \ n 2 + | n + l}u 2 + 0(u 3 ) for single precision residual accumulation and lim n" < u + (x(n) a n (A, x) + n+l}u + 0(u ) m — K for double precision residual accumulation. 41 PROOF. The theorem follows from Theorem C.7 in Appendix C. Q.E.D, The main effect of doing more iterations with single precision accumulations is a moderate reduction in the magnitude of the second term. But for the double precision case there is a striking improvement due to 2-1 2 the disappearance of the x ( n ) Cond(A ) a (A, x)u term so that the bound K -1 3 on n" depends on the condition number of A * only through the 0(u ) term. m This may represent a significant improvement for problems which are both poorly scaled and ill conditioned. 42 7. Practical Implications The comments that follow are suggested by the error analysis, but their usefulness remains to be established. By means of examples it has been shown that Gaussian elimination with (partial or complete) pivoting does not generally provide all the accuracy that the data deserve or even a fixed fraction of that accuracy. Hamming [1971, p. 121] states "It is reasonable to ask how typical these examples are and how often in the past the pivoting method has created the ill conditioning that was reported to occur by some library routines. The answers are not known at this time; all that is claimed is that textbooks and library descriptions rarely, if ever, mention this possibility (though it is apparently known in the folklore)." and so it seems that there have been practical instances where the pivoting method has performed poorly. Perhaps Gaussian elimination without iterative improvement should be regarded as a "quick and dirty" way to solve linear equations. The computation of the backward error is one reliable test for deciding whether or not the solution of a linear system is "reasonably accurate." The test can be made quite efficient by accumulating r and |a| |x| + |b| at the same time. If the test is failed, then in most cases the use of iterative improvement would result in a solution which passes the test. One could, of course, forgo the backward error computation and just do iterative improvement until "convergence." But such a procedure may not be completely reliable since it has not been rigorously proven that "convergence" implies a reasonably accurate solution. Stewart [1973, p. 205] mentions "the possibility that, with a violently ill-conditioned matrix, the iteration may appear to converge to a false solution." 43 It is also suggested by the theory that if double precision accumulation of the residuals is costly, then iterative improvement with single precision accumulation might still be beneficial. The success of the pivoting method depends upon a reasonable scaling of the equations, which is at best guesswork unless one has some knowledge about the sizes of the solution components. If c ~ |x|, then (i) for row pivoting one should scale the system to get (D~ 1 A)x = (D^b) where D = diag(|A|c). (ii) for complete pivoting one should scale the system to get (D~ AD 2 )(D~ 1 x) = (D^b) where D - diag(|A|c) and T> 2 = diag(c) It may be worthwhile to allow users of a linear equation solver to provide an estimate of the solution, particularly if the solution variable X is being used only as an output parameter. For simple use of the program X could be set to all ones. 44 Appendix A. Error Bounds for Row Pivoting For Lemmas A.l through A. 8 it is assumed that D has nonzero diagonal elements. This assumption does not apply to the theorem that follows these lemmas. For any n x n matrix C = (c . . ) let c. = c . . /d . . Also, iJ iJ let u) = 1 + u. LEMMA A.l. We have and m.. < a) d . /d, , i > k, ik' — ' i k 1 ij i -(k)i k+1 k _ 1 ,„ ,k-l-£,- | k-1,- a. . < a) I {2b)) a. . + a) a ij i,j :• k. I i (k) , (k) m iJ ^K /a kk PROOF. Equation (4.1) implies i > k, and because of row pivoting (see (4.5)) we get the first inequality of the lemma. Equation (4.2) implies |a< k+1 > 1 ij and therefore |a (k+1) 1 iJ <_oo|a (k) ij + (1 + ZiOlm., a^l, i,j > k + 1 1 ik kj ' J — < Jaf k) | + u(l + 2u)|a 1 (k) |, - ij kj i,j > k + 1, The second inequality of the lemma follows from this by induction on k. Q.E.D. LEMMA A. 2. The matrices L = {I..) and U = (u..) satisfy LD = A + E (1) + E (2) + ... + E (n "^ (k) (k) where the matrices E have elements e.. which satisfy ij -u> ^laf^l + (2 + 3u) a) 2 u|m..a. u for i,j > k , ik kj (k) I.«l<< - 1 I to u a (k) ik L regardless of the pivoting strategy. for i > j = k, otherwise. 45 PROOF. Define the elements of E (k) by r (k+1) (k) (k) a . . - a . . + m., a, . ij ij ik kj . * (k) * (k) m ik a kk " a ik ^' for i,j > k, for i > j = k, otherwise. By separately considering the cases i <_ j and i > j it is straightforward (k) (k) to show that the elements e . . of E satisfy n-1 (k) n E e;?' - £ £., u . - a. . , k= l « k= i lk ^ ij which establishes the equality of the lemma. Let k <^ n - 1 be fixed. Write (4.1) as ™ Ik = C-S^Hl + V- i > k .+ 1, and (4.2) as a (k+1) = (a 00 - m.,a< k) (l + 6..))(1 + 6'..), i,j > k + 1, ij ij ik kj ij ij - where the 6's are relative roundoff errors. Then ^a^V - m.. a, (k) (6.. +6'.. +6.. 6'..) for i,j > k, ij ij xk kj xj xj xj xj (k) U; .. / UK J4 - \ a -,-U 6 IJ ik ik Lo for i > j = k, otherwise, and the lemma follows from the bound u/(l + u) on the 6's. Q.E.D. LEMMA A. 3. The matrices L and U satisfy LU = A + E with -1 |D E|z|| £ (3-2 n-1 3)o 2n - 1 xi\\v~ 1 *\z for arbitrary z _> 0, 46 PROOF. Let E - E (1) + E (2) +. . .+ E (n "^ where the E (k) are given by Lemma A. 2. Substituting the bound on m., of Lemma A.l into the bound xk (k) on the elements e . . we get ij ,-(k), -1 ,-(k), , -1 ,-(k), £.. < w u a.. + (2 + 3u)oj u a, . , 1 ij ' - ' ij ' kj which, in fact, is valid for all i,j. This implies E |e< k) |z. < 3d||n -1 A (k> |«|| for arbitrary z >^ 0. From Lemma A.l it immediately follows that || |D- 1 A (k) | Z || < 2 k - 1 a» 2k - 1 1||D" 1 A| Z ||; and so we h* HlD^E^IzIl! 3-2 krl a." 2n - 1 «|||D" 1 A|z||, from which the lemma follows. Q.E.D. LEMMA A. 4. We have lave and £. . < to d./d , ij i 3 - I „ 2i-l r .i-£-l ir , u . . < oo E 2 a, . , i > J, i < 3- Jl=l (i) PROOF. Follows from Lemma A.l because £.. = m. . and u.. = a.. . Q.E.D, ij ij ij ij LEMMA A. 5. The vector x computed by (4.4) satisfies (U + 5U)x = y for some upper triangular matrix 6U such that where 6u..| < g(i,i)oj |u..| and lu.. + 6u..| < r n-i+li to u ij g(i,j) = < n - j + 2, j > i + 2, n - j + 1, j = i + 1, 2, j = i £ n - 1, xj-, j - i = n, regardless of the pivoting strategy. 47 PROOF. From (4.4) we have that xi x (1+6 .)(l+6^) r v • . V,U and u nn 1 + X n 6' n y n Xv + u xx ) i,i+l i+l i,i+2 ±+2 J .+ u. x x ) xn n = y. i < n - 1 where 6. and 6! are relative roundoff errors due to subtraction and division, x x respectively. By the usual type of analysis we can obtain the bounds r 6u ij 1 . n-j+2 , I | (u) J - l)|u |, n-j+1 | | (u> J - 1) |u. . | , (u) - 1) U 1J (O) - 1) u. . , V. 1J j 1 i + 2, j = i + 1, j = i < n - 1, j = i = n, from which the lemma follows. Q.E.D. LEMMA A. 6. The matrix 6U of Lemma A. 5 satisfies |||d _1 l6u|z|| < f5.2 n " 2 - 21 m^Hd^aUII for arbitrary z _> 0. PROOF. From Lemma A. 4 it follows that (|d~ 1 L6u|z). < Z E IdT 1 A., d, I |6u, . |z. x — . . x xk k ' kv i J k < (jo E E 6u, , z . . " J k kj J Applying Lemma A. 5 first and then Lemma A. 4 gives (|D _1 L6u|z). < u E E g(k,j)w n ~ k+1 |u,.|z. j k j 1 j k n ~^u I a 9 ' U:\ ) I > i > J. 1 ij 1 - ' ij jj ' - regardless of the pivoting strategy. PROOF. We have from (4.3) that (...U u * y t + l a * y 2 )...+ t * T^ltij-- b., i » 2 where 6 . is a relative roundoff error. By the usual type of analysis we £,et that -1 v i-l 6 Z ±1 \ < ((1 + a) u) - 1)U ±1 , i 1 2, -1 vi-j+1 + 0) u) J and 62-.. < ((1 + a) u) J - l)\i..\ y i > j > 2, ij ' - ' il ' - 6fc. . I < ult. . n 1 — ' li The lemma follows from the inequalities (1 + of u) - 1 <_ u)~ u k(l + u u) . k-2 < koj u and £..| < min {u, (/ j } I a9 ) /a^ * I . Q.E.D, ij 1 - iJ JJ 49 LEMMA A. 8. The matrices 6U and 6L of Lemmas A. 5 and A. 7 satisfy HlD^LCU + 6U)|z|| < (2 n+1 - n - 3)a) 2n ^| | D _1 a| z \\ for arbitrary z _> 0. PROOF. Using the bound of Lemma A. 7 and the row pivoting inequality (4.5), we get |6£..| < min{n - j + 1, n - l}a) _J u I d . /d . I , i > j. Hence •> J-XT /TT _L XTT\ I _\ ^ .. T ^ ,' ^ / „ -I -L 1 „ 1 \, , D 3 / I W^ (|D <5L(U + 6U)|z). <_ u E min{n - j + 1, n - l}co J (|D (U + 6U) | z) . j J It immediately follows from Lemma A. 5 that (|D _1 (U + 6U)|z). < a) n " :i+1 (|D" 1 u|z). and from Lemma A. 4 that (|D _1 (U + 6U)|z). < 03 n+j 2 j_1 ]||D" 1 A|z||. n _1 xt /« ^ x„nI^ - .. 2n .. v _,•_/_ 4 j. i „ inJ-^ll^ -1 Therefore, (|D -J -6L(U + 6U)|z) . £ c/ n u E min{n - j + 1, n - 1}2 J_I )| |d~ X a|z||, 1_ 3=1 from which the lemma follows. Q.E.D. THEOREM 4.1. Let the vector & be computed by Gaussian elimination with row pivoting and row scaling where D = diag(d , d ,..., d ) is the matrix of reciprocal scale factors. Then there exists A A such that (A + AA)x = b with II [D~ 1 AA| Z If < x(n)^||D~ 1 A|z|| for arbitrary z _> and arbitrary e satisfying < |e| < e where x(n) = r n-2 , 2nu 1 19* 2 - n - 8 le PROOF. The restriction on e implies that 5(II(r» )) = K (11(D)), and from Lemmas A. 3, A. 6, and A. 8 we have the bounds 50 -1 ,n-l 2n-l .-1, D; A e|z||< (3.2"- 1 - 3)^-^110^1 z||, D^LfiuHl < [5.2 n ~ 2 - 21a, 2n u|||D- 1 A|z||, and || |D X 6L(U + 6U) |z|| < (2 n+1 - n - 3)u) 2r \J| |d "'"A | z||. The theorem follows from the equation (A + E + 6LU + (L + 6L)6U)x = b. Q.E.D. THEOREM 4.2. There exists a problem Ax = b and a floating point arithmetic <+, -, x, J> suoh that the solution x computed by Gaussian elimination with partial or complete pivoting satisfies (A + AA)x = b only for those matrices AA for which AA > [19-2 n 2 - n - 81u + 0(u 2 ) Therefore, the bound of Theorem 4.1 is the best possible bound in the limit u •* 0. PROOF. Obvious for n = 1. Assume n > 2. Let and A = M -M M -M -M o : M 1 -M 1 1 + (7.2 k X - k - 8)u, ( 1 + (2 n+1 - n - 7)u, 1 + (19.2 n_2 - n - 8)u, k <_ n - 2, k = n - 1, k = n. If M is large enough, then there are no interchanges even with complete pivoting. We have m. . = m = M/(-M), i > j , a (1) - 1, in (k) . (k-1) , (k-1) a. = a . - m x a, _ , in in k-l,n, i > k > 2. 51 Suppose that all these floating point operations increase the magnitude of the result by a factor (1 + 2u)/(l + u) . Then and m = -(1 + u) + 0(u ) 00 - tt + u) a^'-H (1 + 3U) a/":" + 0(u 2 ). in in By induction on k it follows that k-l,n and hence We have a (k) =2 kl + 2 k (k _ 1)u + / u 2j i > k, in — u. = 2 k X + 2 k (k-l)u + 0(u 2 ) kn where Then and y l = b l y k = \" S k-l' k ^ 2 ' S k = ( . . . (m * y;L + m * y 2 ) . . . + m * y fc ) . S = m * b S. = S. . + m * (b. - S. ,)» 2 < k < n - 1. k k-1 k k-1 — — Suppose that all these floating point operations reduce the magnitude of the result by a factor 1/(1 + u) . Hence S 1 = -b 1 + 0(u ) and S = (2 - 3u)S, 1 - (1 - 2u)b + 0(u ), 2 <_ k <_ n - 1. By induction on k it follows that k k+1 9 S k = 1 - 2 - 2 (k - 4)u - (k + 9)u + 0(u ), k < n - 2 and S , = 1 - 2 n 1 - 2 n 2 (4n - 19)u - (n + 8)u + 0(u 2 ) n— l 52 Also we have y = 1 - 2u, y k = (b k " S k-l )(1 " U + 0(u2)) ' k 1 2 - From this we get y k = 2 k_1 + 2 k (k - 2)u + 0(u 2 ), k < n - 2, y - = 2 n " 2 + 2 n ~ 2 (2n - 5)u + 0(u 2 ), y = 2 n_1 + 2 n_1 (2n - l)u + 0(u 2 ). We have x = y /u , n n nn Vl = ||AAx|| = ||b - Ax|| = (19-2 n_2 - n - 8)u + 0(u 2 ). Q.E.D. 53 Appendix B. Error Bounds for Column Pivoting For any matrix C = (c..) let c. = c..d.. Also, let to = 1 + u, LEMMA B.l. We have and -u^li-l^l. i>k.i>k. ,-(k), „ k+1 k v 1 , ,k-l-£i- | k-1,- , , 4 v , a.. k . 13 - £=1 '• i*' 1J PROOF. Equation (4.1) implies I 00, i I (k) (k) J , (k) , • m ik a kj d jl 1 "l a ik a kj d j /a kk I' 1 > k, J > k. and because of (A. 5) we get Im^a^d.l < wlafj^d. I, i > k, j > k, 1 lk kj J ' — ' lk k ' J — which proves the first inequality. Equation (4.2) implies l (k+1) | i (k) | ,. ' N I (k) I a.. < wa.. + (1 + 2u) m..a. . , i,j > k + 1, 1 ij ' - » ij ' ' lk kj " and therefore |af k+1) | k + 1. 1 ij '- ' ij ' /] lk " - The second inequality of the theorem follows from this by induction on k. Q.E.D, LEMMA B.2. The matrices L and U satisfy LU = A + E with |ED|e < L7-2 n " 2 - n-2J co 2n U |AD|e. PROOF. Assume n > 2. Let E = E^ 1 ' + E (2 ^ +. . .+ E^ n_1 ' where (k) the E are given by Lemma A. 2. Substituting the bound on m., of Lemma B.l (k) into the bound on the elements e . . we get ij r i-(k) | _ i-(k) I , , , ^ i u|a_ | + 2a)u|a. | for i,j > k, for i > j = k, otherwise. 54 This implies Z|ef k) | < (2n - 2k+l)um|if k) | + " & |a (k) |, 1 > k, « ij ' xk ' . , , ' ii ' J J J=k+1 J and from Lemma B.l it follows that l|e< k) | < (2n - 2k + l). 2k u{^ Z^ 1 "*^ + \ij) f .. 2k-i k : 1 _k-i-£ i - i k-i J i- i + (n-k)io u E 2 a.. + oj u E a.. £=1 " j-k+1 ^ r(2n-l)w 2 u E la. . I if k = 1, < < i 1J L(3n - 3k+l)o) u 2 E | a. . | if k > 2. j 1J The lemma follows because n-1 _ (2n-l) + Z (3n - 3k+l)2 = 7-2 n - n-2. Q.E.D. k=2 LEMMA B.3. We have \l . . u..| < ulaH I i 1 ij ij - ij IJ A=l 1J6 and u . . < u. . I > i< i. i iji - i ii" - PROOF. The first two inequalities follow immediately from Lemma B.l, and the third inequality is a consequence of column pivoting. Q.E.D, LEMMA B.4. The matrix SU of Lemma A. 5 satisfies |L 6U D|e < (2 n+1 - n-2)a) 2n u|AD|e. PROOF. Applying the inequality |u..| <_ |u..| of Lemma B.3 and the bounds of Lemma A. 5, we get for i < n 55 n Z 1 6u . . I < u Z g(i,i)oj lu.J Therefore (n-i+2) (n-i+1) n-i .- < ~ 0) u u. . — 2 ' li ( L5UD e) . = Z £. . S 6u._ 3 k n "" j laf^/af^ I (a-j+l)a> n ' j+1 |u J1 A , - j ' 13 33 ' { 33* and so using Lemma B.3 gives 56 Appendix C. Error Bounds for Iterative Improvement LEMMA C.l. The computed residual satisfies r = b - Ax + c m mm with where c o° + ((f^ 1 - u) limjz || + ^tw||| A || x ||| } 1 - vy " m" i - vy ' m-x» and the lemma follows from Lemma C.5. Q.E.D. THEOREM C.7. Assume |a||x| > 0. Then (1 + uy) (uv + w)o w(g - 1) 1-t "l-u lim n < — m — ., (1 + u) (uv + w)ya ni-xx> l - u - z ' — 1-t provided that (1 + u) (uv + w)ya < (1 - u) (1 - t) where a = o R (A, x) PROOF. From Lemmas C.5 and C.6 we have T~r~ ii ii (uv + w)a I . i i I lim z e < • i — z A x " m" — 1-t 'iii m-x» and -TT- I I r (uv + w)a w(a - l) -. I . I I i lim z < {-z—z »=■ -} Ax. 1 m' — 1-T l-u '!'■ m+oo Hence using Lemma C.3, 59 LEMMA C.2. Define z = A(x + d - x) . Then m mm I z e where v = x( n ) u °- n d y = Cond(A ). PROOF. The correction term d m = (A + F )~ r where F is the AA m mm m of Theorem 4.1. We have z = A(x -x)+r - (I + F A ) FA~r mm m m mm = c - (I + F A -1 )" 1 F (x - x - A _1 c ). m m m m m It follows from the bounds of Theorem 4.1 that -1 z < |c I + (1 - vy) v (|| |a| |x - x| II +yI|c ||)e. m — ' m' " ' ' ' m ' " " m" Substituting the bounds of Lemma C.l into this proves the lemma. Q.E.D. LEMMA C.3. We have and A(x , , -x) < z + ul A| | x + uy z e m+1 ' — ' m' i ii i "m" A x ., - x < u A x + (1 + u)y z e. i i i jn-i-i i _ i i i i " m" PROOF. The new iterate x = x + d + g m+1 m m m where |g <_ u|x + d |. Equivalently m" — m x = x + A z +g m+1 m m where |g and tt— i . / n i r (1 + uy) (uv + wr)a w(a - 1) . • ■, i . i i i lim A(x - x) < {- "-* ^ + u} Ax 1 m ' — 1-t 1-u i i i i < {u+ (1 + u)(uv+w)ya } | A | — 1-T ' ' lxm Ax -x| <{u+ ; L — } |A| |x| . 1 ' ' m nv-*» The theorem follows from and thus So |A(x - x)| |A(x - x)| m ' m H = max I . I I 1 < max i-tt-i — i r-rT-i T • Q.E.D. m Ax — Ax - Ax - x i i i m i i i i i i i i m i LEMMA C.8. Vie have l l l a l l l . r wvY , v ( u + vy+wy)-,|||.|| in z. < wAx + {- — + — ^ ' ^r - ^} Axe. 1 l 1 — ' ' ' ' 1 - vy /n N 2 ' (1 - vy) PROOF. The first iterate x = (A + AA)~ Ax, x - x = -A 1 (I + AA A 1 ) "'"AAx. and A(x 1 - x)|| < (1 - vy) \ || |A| |x| a| J x 1 - x| || <_ (1 - vy) vy | |a| |x| From Lemma C.2 we have z, | £ w| A | | x| -1 + (1 - vy) {(v + w)|| |A| | X] _ - x| II + u||A( Xl - x)|| + vwy|| |a| |x| ||}. The lemma follows by substituting the previous two inequalities into this. Q.E.D. THEOREM C.9. Assume that \k\ |x| > 0. Then (v + u)wya (u + vy + wy) (1 + uy)vo~ W + U + :; + ^ n < (1 " V T> 2 — 2 n (w + uv + v y) , 1 ,. 1-u -Z — — (1 + u)ya (1 - vy) Z provided that the denominator is positive. 62 BIBLIOGRAPHY BAUER, F. L. Genauigkeitsfragen bei der Losung linearer Gleichungssysteme. ZAMM 46, 1 (November 1966), 409-421. BAUER, F. L. Computational graphs and rounding error. SI AM J. Numer. Anal. 11, 1 (March 1974), 87-96. FORSYTHE, G. E. , AND MOLER, C. B. Computer Solution of Linear Algebraic Systems. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1967. GEAR, C. W. Numerical errors in sparse linear equations. File F-75-885, Dept. of Comput. Sci. , Univ. of Illinois at Urbana-Champaign, Urbana, Illinois, May 1975. HAMMING, R. W. Introduction to Applied Numerical Analysis. McGraw-Hill, New York, 1971. KAHAN, W. Numerical linear algebra. Canad. Math. Bull. 9 (1966), 757-801. MILLER, W. On the stability of finite numerical procedures. Numer. Math 19 (1972), 425-432. MILLER, W. Computer search for numerical instability. J. ACM 22, 4 (October 1975), 512-521. MILLER, W. Roundoff analysis by direct comparison of two algorithms. SIAM J. Numer. Anal. 13, 3 (June 1976), 382-392. SHERMAN, A. H. Algorithms for sparse Gaussian elimination with partial pivoting. Rpt. R-76-817, Dept. of Comput. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, Illinois, July 1976. STEWART, G. W. Introduction to Matrix Computations. Academic Press, New York, 1973. VAN DER SLUIS, A. Stability of solutions of linear algebraic systems. Numer. Math. 14 (1970), 246-251. VAN DER SLUIS, A. Condition, equilibration, and pivoting in linear algebraic systems. Numer. Math. 15 (1970), 74-86. WILKINSON, J. H. Rounding Errors in Algebraic Processes. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1963. SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) REPORT DOCUMENTATION PAGE READ INSTRUCTIONS BEFORE COMPLETING FORM 1. REPORT NUMBER 2. GOVT ACCESSION NO 3 RECIPIENT'S CATALOG NUMBER 4. TITLE (and Subtitle) GAUSSIAN ELIMINATION AND NUMERICAL INSTABILITY 5. TYPE OF REPORT & PERIOD COVERED Interim 6. PERFORMING ORG. REPORT NUMBER UIUCDCS-R-77-862 7. AUTHORfs) ROBERT D. SKEEL 8. CONTRACT OR GRANT NUMBERfs) AFOSR-75-2854 9. PERFORMING ORGANIZATION NAME AND ADDRESS Department of Computer Science University of Illinois Urbana, Illinois 61801 10. PROGRAM ELEMENT, PROJECT, TASK AREA a WORK UNIT NUMBERS 61102F 2304/A3 11. CONTROLLING OFFICE NAME AND ADDRESS Air Force Office of Scientific Research/NM Boiling AFB DC 20332 12. REPORT DATE April 1977 13. NUMBER OF PAGES 62 14. MONITORING AGENCY NAME & ADDRESSf// different from Controlling Office) 15. SECURITY CLASS, (of this report) Unclassified 15a. DECLASSIFI CATION/ DOWN GRADING SCHEDULE 16. CI' T RHUTION STATEMENT (of this Report) Approved for public release; distribution unlimited. J '7. DISTRIBUTION STATEMENT (of the abstract entered In Block 20, If different from Report) 18. SUPPLEMENTARY NOTES 19. KEY WORDS (Continue on reverse side if necessary and identify by block number) numerical stability, Gaussian elimination, iterative improvement, ill conditioning, scaling, equilibration, pivoting 20. ABSTRACT (Continue on reverse side If necessary and identify by block number) Roundoff error in the solution of linear algebraic systems is studied using a more realistic notion of what it means to perturb a problem, namely that each datum is subject to a relatively small change. The condition number is determined for this approach. A good computable error bound is given for the "backward error." The effect of scaling on the stability of Gaussian elimination is studied, and it is discovered that the proper way to scale a system is dependent on knowing the solution. Finally it is shown that Gaussian elimination can be stabilized by doing iterative improvement. DD , :° N RM 73 1473 EDITION OF 1 NOV 65 IS OBSOLETE SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) BIBLIOGRAPHIC DATA 5HEET 1. Report No. UIUCDCS-R-77-862 3. Recipient's Accession No. Title and Subtitle GAUSSIAN ELIMINATION AND NUMERICAL INSTABILITY 5. Report Date April 1977 6. I Author(s) Robert D. Skeel 8. Performing Organization Rept. No. g 62 . Performing Organization Name and Address Department of Computer Science University of Illinois Urbana, Illinois 61801 10. Project/Task/Work Unit No. 11. Contract /Grant No. AFOSR-75-2854 2. Sponsoring Organization Name and Address Air Force Office of Scientific Research/NM Boiling AFB DC 20332 13. Type of Report & Period Covered interim 14. 5. Supplementary Notes 6. Abstracts Roundoff error in the solution of linear algebraic systems is studied using a more realistic notion of what it means to perturb a problem, namely that each datum is subject to a relatively small change. The condition number is determined for this approach. A good computable error bound is given for the "backward error." The effect of scaling on the stability of Gaussian elimination is studied, and it is discovered that the proper way to scale a system is dependent on knowing the solution. Finally it is shown that Gaussian elimination can be stabilized by doing iterative improvement. '. Key Words and Document Analysis. 17a. Descriptors numerical stability, Gaussian elimination, iterative improvement, ill conditioning, scaling, equilibration, pivoting b. Identifiers/Open-Ended Terms "c. COSATl Field/Group (.Availability Statement unlimited 19. Security Class (This Report) UNCLASSIFIED 20. Security Class (This Page UNCLASSIFIED 21. No. of Pages 65 22. Price FRM NTIS-35 (10-70) USCOMM-DC 40329-P7I JUN2 81977