UNIVERSITY OF ILLINOIS LIBRARY Iff URBANA CHAMPAIGN BOOKSTACKS THE HECKMAN BINDERY. INC. North Manchester, Indiana JUST FONT SLOT TITLE H CC IW 22 BEBR 21 FACULTY 20 WORKING 19 PAPER H CC IW H CC ^yy H CC 8 1990 7 NO. 1705-1718 330 B385<"CV"> no. 1705-1718 cop. 2 U. of ILL. LIBRARY URBANA KRL BINDING COPY PERIODICAL, n CUSTOM D STANDARD D ECONOMY D THESIS BOOK CD CUSTOM Di D ECONOMY AUTH, 1ST D NO. VOLS. THIS TITLE RUBOR TITLE I. D. SAMPLE ACCOUNT LIBRARY NEW 66672 001 ACCOUNT NAME UNIV OF ILLINOIS ACCOUNT INTERNAL I.D. FOIL MATERIA! WHI 43 ISSN. B01912400 LD. #2 NOTES STX3 COLLATING 35 ADDITIONAL INSTRUCTIONS BINDING WHEEL FREQUENCY Dept=:STX3 Lot = #20 Item=151 HNM=[SY# 1CR2ST3CR MARK BY # B4 91] ■ ^ . / ^^ / ' Cll I CD SEP. SHEETS PTS. BO. PAPER TAPE STUBS CLOTH EXT INSERT MAT PRODUCT TYPE JLL HEIGHT) ^^ ) SPECIAL PREP LEAF ATTA( ACCOUNT LOT NO. #20 ACCOUNT PIECE NO. GROUP CARD VOL THIS w 151 ,v;, V COVER SIZE ^^ x_ii PIECE NO. 81 0012^752 Faclxty Working Paper 90-1716 330 B385 No. 1716 COPY 2 ST X APR i 199, Convergence of Learning Algorithms with Constant Learning Rates G-M. Kuan Department of Economics University of Illinois K. Homik Tecbniscbe Universitdt Wien Vienna, Austria Bureau of Economic and Business Research College of Commerce and Business Adminisuaiion University of Illinois ai Urbana-Champaign Digitized by the Internet Archive in 2011 with funding from University of Illinois Urbana-Champaign http://www.archive.org/details/convergenceoflea1716kuan BEBR FACULTY WORKING PAPER MO. 90-1716 College of Commerce and Business Administration Gniversity of Illinois at Grbana-Champaign December 1990 Convergence of Learning Algorithms With Constant Learning Rates C.-M. Kuan Department of Economics University of Illinois at Urbana-Champiagn and K. Hornik Institut fur Statistik und Wahrscheinlichkeitstheorie Technische Universitat Wien, Vienna, Austria The first author is grateful for support by the Investors in Business Education and the Bureau of Economic and Business Research of the University of Illinois. Abstract We investigate the behavior of neural network learning algorithms with a .small, constant learning rate e in stationary, random input environments. It is rigor- ously established that the sequence of weight estimates can be approximated by a certain ordinary differential equation, in the sense of weak convergence of random processes as e tends to zero. As applications, back-propagation in feedforward architectures and some feature extraction algorithms are studied in more detail. 1 Introduction For understanding the performance of neural network learning algorithms, it is of fundamental importance to investigate how they behave in stationary random input environments. This analysis yields information about the asymptotic properties of the learned connection weights as the number of training samples increases without bound. Thus far, only algorithms with learning rates tending to zero have been studied. However, most neural network learning is conducted using a small, constant learning rate. In this paper, we investigate the limiting behavior of such algorithms. An on-line (local) learning algorithm can be written as en+l^On+r,,,Q{Zn,B„), (1) where 9 is the A:-dimensional vector of network weights to be learned and its current estimate at time n is denoted by 9^, -n is the training pattern presented at time n, /?„ is the learning rate employed at time n, and Q{-, •) is a suitable function characteristic of the algorithm. The key tool in the analysis of the sequence {9n] is the so-called interpolated process (9{t).t > 0), usually defined by e{t) = 9n, t„ 0) is given by e'(t)^9'^, m 0} 6e a family of random elements with values ni some metric space {X,p). We say that ^' converges weakly to ^°, symholicalhj lim,_oE/(e) = E/(^0) for all bounded, continuous real functions f on X. Weak convergence is an extension of the familiar concept of convergence in distribution of sequences of IR -valued random variables to families of abstract valued random elements; as a basic reference, we recommend Billingsley (1968). If /i is a continuous function from A' to IR and ^^ => ^°, then h{^^) -^x> ^(i>°)i where " — x>" denotes convergence in distribution. (Billingsley, 1968, page 29). In our case, we shall regard the interpolated processes 9^{-) as random pro- cesses with values in A' = D*^[0,T'oo), the space of all functions from [0,rcc) to IR which are right continuous with left-hand limits at every < t < Tr,^. Here, T^c, is the supremum over all T such that the limiting ODE has a unicine solution on [0,T] with probability one; in particular, if it has a unique global solution on [0,oo) for every initial condition, then Too = oo- As a metric on A' we shall use Eoo 2-'" min l,supo<,<^^|^(0 - ^(/)| , ^^ ,; € A, m= 1 _ _ "• where {Tm} is an increasing sequence with lim,n-_co Tm = Tco, such that A' is given the topology of uniform convergence on bounded subintervals of [0,Tcx3). In order to obtain the ODE limit of 0'^{-). we must be able to average out the randomness resulting from the pattern sequence {-n}, so that the limiting process (as e tends to zero) eventually follows the "mean" behavior. This is the basic idea of the so-called direct averaging method of Kushner (1984, chapter 5), cf. also Kushner k Shwartz (1984). We assun^e the following conditions. [A 1] {;„} IS a (strictly) stationary and ergodic sequence of random vectors. [A 2] For each N. there exists a function L^{z) such that Li\(:„) is intajrublc and suP|,|,|,-|<.v \Q{z,e)-Q{z,6)\ < Ls{z) \e-o\. (5) [A3] For each 9. Q{:n-d) is square integrable with expectation Q(e):=EQ(z„,e). These conditions are not the weakest possible, but they can easily be verified or interpreted. The stationarity and ergodicity assumption [A 1] applies when we have time series data; in particular, it is satisfied when the training patterns are identically distributed, independent random variables. [A 2] is a Lipschitz- type of smoothness condition on the function Q, which is satisfied in many interesting neural network applications, as shown in later examples. [A 3] defines the correponding ODE. Of course, conditions [A 2] and [A3] are met if the patterns are bounded and Q is continuously diflferentiable in both arguments. The following theorem is based on theorem 5.1 in Kushner (1984); its proof is given in the appendix. Theorem. Assume [A 1] to [A3] and let 9q = ^o, a fixed vector or a random vector independent of e. Then e'{-)^e{-), (6) where 6{-) is the solution of the ODE In particular, if < ti < ■ ■ ■