UNIVERSITY OF ILLINOIS LIBRARY Iff URBANA CHAMPAIGN BOOKSTACKS THE HECKMAN BINDERY. INC. North Manchester, Indiana JUST FONT SLOT TITLE H CC IW 22 BEBR 21 FACULTY 20 WORKING 19 PAPER H CC IW H CC ^yy H CC 8 1990 7 NO. 1705-1718 330 B385<"CV"> no. 1705-1718 cop. 2 U. of ILL. LIBRARY URBANA KRL BINDING COPY PERIODICAL, n CUSTOM D STANDARD D ECONOMY D THESIS BOOK CD CUSTOM Di D ECONOMY AUTH, 1ST D NO. VOLS. THIS TITLE RUBOR TITLE I. D. SAMPLE ACCOUNT LIBRARY NEW 66672 001 ACCOUNT NAME UNIV OF ILLINOIS ACCOUNT INTERNAL I.D. FOIL MATERIA! WHI 43 ISSN. B01912400 LD. #2 NOTES STX3 COLLATING 35 ADDITIONAL INSTRUCTIONS BINDING WHEEL FREQUENCY Dept=:STX3 Lot = #20 Item=151 HNM=[SY# 1CR2ST3CR MARK BY # B4 91] ■ ^ . / ^^ / ' Cll I CD SEP. SHEETS PTS. BO. PAPER TAPE STUBS CLOTH EXT INSERT MAT PRODUCT TYPE JLL HEIGHT) ^^ ) SPECIAL PREP LEAF ATTA( ACCOUNT LOT NO. #20 ACCOUNT PIECE NO. GROUP CARD VOL THIS w 151 ,v;, V COVER SIZE ^^ x_ii PIECE NO. 81 0012^752 Faclxty Working Paper 90-1716 330 B385 No. 1716 COPY 2 ST X APR i 199, Convergence of Learning Algorithms with Constant Learning Rates G-M. Kuan Department of Economics University of Illinois K. Homik Tecbniscbe Universitdt Wien Vienna, Austria Bureau of Economic and Business Research College of Commerce and Business Adminisuaiion University of Illinois ai Urbana-Champaign Digitized by the Internet Archive in 2011 with funding from University of Illinois Urbana-Champaign http://www.archive.org/details/convergenceoflea1716kuan BEBR FACULTY WORKING PAPER MO. 90-1716 College of Commerce and Business Administration Gniversity of Illinois at Grbana-Champaign December 1990 Convergence of Learning Algorithms With Constant Learning Rates C.-M. Kuan Department of Economics University of Illinois at Urbana-Champiagn and K. Hornik Institut fur Statistik und Wahrscheinlichkeitstheorie Technische Universitat Wien, Vienna, Austria The first author is grateful for support by the Investors in Business Education and the Bureau of Economic and Business Research of the University of Illinois. Abstract We investigate the behavior of neural network learning algorithms with a .small, constant learning rate e in stationary, random input environments. It is rigor- ously established that the sequence of weight estimates can be approximated by a certain ordinary differential equation, in the sense of weak convergence of random processes as e tends to zero. As applications, back-propagation in feedforward architectures and some feature extraction algorithms are studied in more detail. 1 Introduction For understanding the performance of neural network learning algorithms, it is of fundamental importance to investigate how they behave in stationary random input environments. This analysis yields information about the asymptotic properties of the learned connection weights as the number of training samples increases without bound. Thus far, only algorithms with learning rates tending to zero have been studied. However, most neural network learning is conducted using a small, constant learning rate. In this paper, we investigate the limiting behavior of such algorithms. An on-line (local) learning algorithm can be written as en+l^On+r,,,Q{Zn,B„), (1) where 9 is the A:-dimensional vector of network weights to be learned and its current estimate at time n is denoted by 9^, -n is the training pattern presented at time n, /?„ is the learning rate employed at time n, and Q{-, •) is a suitable function characteristic of the algorithm. The key tool in the analysis of the sequence {9n] is the so-called interpolated process (9{t).t > 0), usually defined by e{t) = 9n, t„ 0) is given by e'(t)^9'^, m 0} 6e a family of random elements with values ni some metric space {X,p). We say that ^' converges weakly to ^°, symholicalhj lim,_oE/(e) = E/(^0) for all bounded, continuous real functions f on X. Weak convergence is an extension of the familiar concept of convergence in distribution of sequences of IR -valued random variables to families of abstract valued random elements; as a basic reference, we recommend Billingsley (1968). If /i is a continuous function from A' to IR and ^^ => ^°, then h{^^) -^x> ^(i>°)i where " — x>" denotes convergence in distribution. (Billingsley, 1968, page 29). In our case, we shall regard the interpolated processes 9^{-) as random pro- cesses with values in A' = D*^[0,T'oo), the space of all functions from [0,rcc) to IR which are right continuous with left-hand limits at every < t < Tr,^. Here, T^c, is the supremum over all T such that the limiting ODE has a unicine solution on [0,T] with probability one; in particular, if it has a unique global solution on [0,oo) for every initial condition, then Too = oo- As a metric on A' we shall use Eoo 2-'" min l,supo<,<^^|^(0 - ^(/)| , ^^ ,; € A, m= 1 _ _ "• where {Tm} is an increasing sequence with lim,n-_co Tm = Tco, such that A' is given the topology of uniform convergence on bounded subintervals of [0,Tcx3). In order to obtain the ODE limit of 0'^{-). we must be able to average out the randomness resulting from the pattern sequence {-n}, so that the limiting process (as e tends to zero) eventually follows the "mean" behavior. This is the basic idea of the so-called direct averaging method of Kushner (1984, chapter 5), cf. also Kushner k Shwartz (1984). We assun^e the following conditions. [A 1] {;„} IS a (strictly) stationary and ergodic sequence of random vectors. [A 2] For each N. there exists a function L^{z) such that Li\(:„) is intajrublc and suP|,|,|,-|<.v \Q{z,e)-Q{z,6)\ < Ls{z) \e-o\. (5) [A3] For each 9. Q{:n-d) is square integrable with expectation Q(e):=EQ(z„,e). These conditions are not the weakest possible, but they can easily be verified or interpreted. The stationarity and ergodicity assumption [A 1] applies when we have time series data; in particular, it is satisfied when the training patterns are identically distributed, independent random variables. [A 2] is a Lipschitz- type of smoothness condition on the function Q, which is satisfied in many interesting neural network applications, as shown in later examples. [A 3] defines the correponding ODE. Of course, conditions [A 2] and [A3] are met if the patterns are bounded and Q is continuously diflferentiable in both arguments. The following theorem is based on theorem 5.1 in Kushner (1984); its proof is given in the appendix. Theorem. Assume [A 1] to [A3] and let 9q = ^o, a fixed vector or a random vector independent of e. Then e'{-)^e{-), (6) where 6{-) is the solution of the ODE In particular, if < ti < ■ ■ ■ are of course the asymptotically stable equilibria of the ODE, it is not neces- sarily true that the solution paths of the ODE converge to a local minimum of $ for "most" (e.g. in the sense that the exceptional set has Lebesgue measure zero) initial values, as is very often claimed. If we assume that the training patterns Zn have bounded fourth moments, conditions [A 2] and [A 3] can easily be verified for the usual multilayer multiout- put feedforward architectures with logistic or arctangent hidden layer activation functions. As a (notationally convenient) illustration, consider the following sin- gle hidden layer network where the i-th output component o, is given by o, = fi{x.9) = g^ r^^^^f^thy-'h ( ^^^j"/»;'^j] J ■ ' = 1 P; here, d, q and p are the numbers of input, hidden and output units, respectively, ^j is the j-th input component, and 9 = (an, .. .,a,9{-), where 9{) solves the ODE 9^-V^9), 9(0) = 00. For the proof, let us start by observing that under the above conditions, the network output F{x,6) is uniformly bounded in x over the weight sets where 1^1 ^ ^- The nonzero entries in VF{x,6) are of the form oPth oat, J where ph = Ylj '^*^i^j ^""^ ^' ~ Zlh l^ih^h{Ph)- Hence, we can find a finite constant Ca' such that |VF(j,^)| < cvki, 1^1 < ^. Similarly, it can be shown that there is a finite constant D\ such that all second order partials of F with respect to components of 9 can be bounded by Ds\x\, uniformly in x over {|^| < A''}. We conclude that if in addition inputs and output have finite fourth mo- ments, then Q{xn,yn,0) is square integrable by Schwarz's inequality and [A3] is satisfied. If we let Li\{x, y) := supM|<;v ^Q{^< V^ ^)i then (5) is satisfied and, again using the above estimates, we see that L!^{xn,yn) is integrable, whence [A 2]. 3.2 Feature Extraction Algorithms for Linear Networks For many applications it is very important to train networks to be able to extract the main features inherent in high-dimensional input data streams, thereby sig- nificantly reducing data dimensionality. Generally speaking, we are looking for functions F which compress a d-dimensional input vector x into a p-dimensional output vector y = F{x) (where p < d and usually p • • • > A^ be the eigenvalues of the input covari- ance matrix E = E Xnx'n (in what follows, ' denotes transpose), and assume for simplicity that the inputs are centered, i.e. E j:„ =0, and that all eigenvalues are distinct and positive. For j = 1, . . . , d, let u, be a unit length eigenvector of E corresponding to the eigenvalue A,. Then W is optimal iff its rows span the same p-dimensional subspace of iR as uj, . . . , Up, i.e. iff W = RUp , where R is an invertible p x p matrix and Up — [ui, . . . , u^. One class of PCA learning algorithms which have been proposed in the literature can be described as follows, see e.g. Baldi k. Hornik (1991), Hornik k. Kuan (1990). \V is decomposed as W = MA, where M is an p x p matrix witli all diagonal entries equal to one. In particular, we could have M = /, the p x p unit matrix. The algorithm is ^;+, = .4;+. Q.4(x-„,.4;,A/^), with y — W X = M Ax and Qa{x,A,M) = yx'-Q^(.yy').4, Qm{x,A,M) = QM{yy')\ both Qa and Qm are linear operators on the space of p x p matrices. The following result follows immediately from our main theorem. Corollary 2. Suppose that the starting values Aq and A/o are independent fro di f and that the input sequence {xn} is stationary and ergodic luith finite fouilh moments. Then (.4'(-).A/^(-))^(.4(),A/()), where (.4(), A/()) J5 the solution of the ODE A = MA^-Qa(MA'E:A'M')A, AiO) = .4o, M = Qm(MAEA'M'), A/(0) = A/o. If we take Q.4 as the identity mapping and M = /, we obtain an algoritlim introduced independently by Williams (1985) as the SEC (symmetric error cor- rection) algorithm, by Baldi ( 1988) fis a symmetric simplification of the BP algo- rithm for a linear d-p-d architecture in autoassociative mode, and by Oja (1989) as the subspace algorithm; for more details, see Baldi k Hornik (1991). Of course, this algorithm is a generalization of the one-unit algorithm introduced in Oja (1982) as a first order approximation to normalized hebbian learning with small learning rates. In this case, the limiting ODE is .4 = AE - A-^A'A. For the one-unit case (i.e. p = \), the asymptotic behavior of the solutions of this ODE is completely analyzed in Oja &; Karhunen (1985). It can be shown that the solution paths always converge to ±u\ unless the starting value is perpendicular to uy. For p > I, similar global results do not seem to be available. It can be shown that all full rank equilibrium points of the ODE are of the form .4 = /?[u,| , . . . , u,^]', where I < iy < ■ ■ ■ < ip < d and R is an orthogonal p x p matrix (see e.g. Baldi L Hornik, 1991). Therefore, as these equilibrium points are not isolated, they cannot be asymptotically stable. More precisely, Krogh Ik. Hertz (1990) show that all equilibria with {ii, ■ ■ ■ ,ip} ^ {1, . . . ,p} are unstable and that for equilibria of the form A = RUp, only the components of small perturbations about the equilibrium .4 which are perpendicular to the row space of A die out asymptotically. Thus, one might expect that the estimates more or less "randomly" walk around the manifold A= {A = RUp : R orthogonal} rather than being attracted by one particular equilibrium. These stability problems disappear if instead we use the asymmetric algo- rithm introduced in Sanger (1989) as the GHA (generalized hebbian algorithm). In this case, we take M = I and Qa as the "lower" operator which sets all en- tries of an p X p matrix which are above the main diagonal to zero. As shown in Sanger (1989), see also Hornik k. Kuan (1990), the asymptotically stable equilibria of the associated ODE .4 = .4!]- lower(.4i:.4').4 are given by .4 = [±ui ±Up]'. Therefore, the performance of the GHA should be as good as the one of Oja's one-unit algorithm (which is of course the GHA for p = 1), and it should be superior to the symmetric algorithm. A satisfactory global analysis of the asymptotic behavior of the solution paths of the above ODE has not been carried out thus far. Sanger (1989, p. 463) claims that the domain of attraction of the set of cisymptotically stable equilibria consists of all matrices .4, which is not true, due to the existence of equilibria which are not asymptotically stable. In fact, it is easily seen that if the rows of the initial ,4(0) are perpendicular to some u,, then the same is true for all A{t), t >0. Rubner & Tavian (1990) introduced an algorithm where, upon presentation of a new pattern x, A is updated according to a hebbian learning rule with columnwise normalization, and M is modified using an asymmetric (i.e. hierar- chical) decorrelation filter. If instead we use Oja's one-unit algorithm for each of the rows of .4 (Baldi h Hornik, 1991), we obtain another algorithm contained in our general class, with the choices Q^ = diag and Qm = — subdiag ("'diag" respectively "subdiag" are the linear operators which set the offdiagonal respec- tively the superdiagonal entries of a square matrix to zero). The corresponding ODE is .4 = MAE-dmg(MAEA'M')A lit = -subdiag(A/.4S.4'.V/') with the appropriate initial conditions; usually, ^o is "random" and Mq — I. Hornik &: Kuan (1990) show that the asymptotically stable equilibria of this ODE are given by A = [±ui ±uJ', M = I. The global asymptotic behavior of the solution paths has not been described thus far. Of course, many other PCA learning algorithms exist. Foldiak (1989) sug- gested an algorithm which combines Oja's one-unit algorithm and lateral inhibi- tion terms. As this algorithm uses a feedback rather than the above feedforward architecture, it cannot be dealt with in our framework, because the learning patterns r,, then consist of the new input x^ and some feedback term yn which depends on all previous inputs and weight estimates (the case of "'state depen- dent noise"). Hornik k Kuan (1990) analyze the asymptotic behavior of such feedback feature extraction algorithms for the case where the learning rates tend to zero at a suitable rate; the case of constant learning rates is currently being investigated. Appendix - Proof of the theorem We proceed along the lines of theorem 5.1 in Kushner (1984). (Our notation is different from Kushner's; the correspondencies are 6 ^-* x, z <-^ ^, and Q <— G; also observe that in our case, both Q and {zj} do not depend upon e.) Due to stationarity, establishing uniform integrability of {s'>P|9|0} reduces to showing that E sup|^|<^ \Q{~j , ^)| < ^"^ (cf. Billingsley ( 1968, p. 32), which follows from sup|,|<^ |Q(c,,^)| < sup|,|<^ |Q(^,,^)-Q(;,,0)| + |Q(;;,0)| < sup|,|<^ L^{zj)\e\ + \Q{zj,0)\ < NLr,~{zj) + \Q(z,,0)\, integrabiUty of L,\{zj ) and square integrability of Q(zj , 0), thereby establishing (A5.2.1). For |^|,|^| < A' we have Esup|,-_,|<,3 \Q{zJ,O)-Q{z,,0)\<^EL^-{zJ), hence the left hand side tends to zero as S ^ 0, establishing (A 5. 2. 2. a). Finally, stationarity and ergodicity of {r„} ensure that 1 ""^ - TQ{=j,o) - EQ(zj,e) = Q{e) n — m ■^— ' J =m in mean square as n — 77? — ► oo (Doob, 1953, theorem X.6.1), which in turn implies (A 5. 2. 3. a). Theorem 5.1 in Kushner (1984) now yields that if A' is given the Skorohod topology (see Kushner, 1984, pages 30-33), then 6^{-) ^ 6{-). In fact, Kushner assumes that Too = oo; however, it is straightforward to see that everything goes through mutatis mutandis if Too < oo. Proceeding along the lines of chapter 18 in Billingsley (1968), it can be shown that we also have B^-) => 9{-) if A' is given the topology of uniform convergence on bounded subintervals. The remaining assertions can now easily be established. The mappings X € A t— ► (x(/i), . . . ,x{ti)) and x € A i— ► supo<,c Kuan, C.-M. (1990). Convergence analysts of local feature extraction algorithms. Preprint. Krogh, A., &: Hertz, J. A. (1990). Hebbian learning of principal components. In Eck- miller, R., Hartmann, G., and Hauske, G. (eds.). Parallel Processing in Neural Systems and Computers (pp. 183-186). Elsevier Science Publishers B.V. (Nortli- HoUand). Kuan, C.-M., k. White, H. (1990). Recursive M-estimation, nonlinear regression (ind neural network learning with dependent observations. BEBR Working Papci 90-1703, College of Commerce, University of Illinois, Urbana-Champaign. Kushner, H. J. (1984). Approximation and weak convergence methods for rand^nn processes. Cambridge: MIT Press. Kushner, H. J., & Clark, D. S. (1978). Stochastic approximation methods for con- strained and unconstrained systems. New York: Springer Verlag. Kushner, H. J., & Shwartz, A. (1984). Weak convergence and asymptotic properties of adaptive filters with constant gains. IEEE Transactions on Information Theory. IT-30. 177-182. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105-117. Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, AC-22, 551-575. Oja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of Mathematics and Biology, 15, 267-273. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61-68. Oja, E., L Karhunen, J. (1985). On stochastic approximation of the eigenvectors aiul the eigenvalues of the. expectation of a random matrix. Journal of Mathematicul Analysis and Applications, 106, 69-84. 11 Rubner, J. & Tavian, P. (1990). A self-organizing network for principal component analysis. Preprint, Physics Department, Technische Universitat Miinchen, Mu- nich, Germany. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal repre- sentations by error propagation. In Rumelhart, D. E., McClelland, J. L., and the PDF Research Group, Parallel Distributed Processing: Explorations in the Microstructures of Cognition, Vol. 1 (chap. 8, pp. 318-362). Cambridge, M.\: MIT Press. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedfor- ward neural network. Neural Networks, 2, 459-473. White, H. (1989). Some asymptotic results for learning in single hidden-layer feed- forward network models. Journal of the American Statistical Association. 84. 1003-1013. Williams, R. J. (1985). Feature discovery through error-correction learning. Technical Report 8501, Institute of Cognitive Science, University of California, San Diego. 12 HECKMAN m BINDERY INC. |MI JUN95 ^nd-To.plciv.^ N MANCHESTER, INDIANA 46962