key: cord-0131045-ascksvrd
authors: Barrett, David G.T.; Dherin, Benoit
title: Implicit Gradient Regularization
date: 2020-09-23
journal: nan
DOI: nan
sha: ac1a4df77afea1ac868eb32f9d970d58f64640a1
doc_id: 131045
cord_uid: ascksvrd

Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. More broadly, our work indicates that backward error analysis is a useful theoretical approach to the perennial question of how learning rate, model size, and parameter regularization interact to determine the properties of overparameterized models optimized with gradient descent.

The loss surface of a deep neural network is a mountainous terrain -highly non-convex with a multitude of peaks, plateaus and valleys [1, 2] . Gradient descent provides a path through this landscape, taking discrete steps in the direction of steepest descent. However, this simple strategy can be just as hazardous as it sounds. For small learning rates, our model is likely to get stuck at the local minima closest to the starting point, which is unlikely to be the most desirable destination. For large learning rates, we run the risk of ricocheting between peaks and diverging. However, for moderate learning rates, gradient descent seems to move away from the closest local minima and move toward flatter regions where test data errors are often smaller [3] [4] [5] [6] . This phenomenon becomes stronger for larger networks, which also tend to have a smaller test error [7] [8] [9] [10] [11] . In addition, models with low test errors are more robust to parameter perturbations [12] . Overall, these observations contribute to an emerging view that there is some form of implicit regularization in gradient descent and several sources of implicit regularization have been identified, including early-stopping [13] , model initialization [14] [15] [16] [17] [18] [19] , model architecture [1, 20] , stochasticity [3, 11, [21] [22] [23] [24] [25] [26] [27] [28] [29] , implicit L2 regularity [11, [30] [31] [32] [33] [34] [35] , low rank biases [7, 36, 37] and minimum RKHS norm [38] [39] [40] [41] [42] [43] , among other possibilities (see related work, Section 8).

We have found a surprising form of implicit regularization hidden within the discrete numerical flow of gradient descent. Gradient descent iterates in discrete steps along the gradient of the loss, so after each step it actually steps off the exact continuous path that minimizes the loss at each point. Instead of following a trajectory down the steepest local gradient, gradient descent follows a shallower path. We show that this trajectory is closer to an exact path along a modified loss surface, which can be calculated using backward error analysis from numerical integration theory [44] . Our core idea, is that the discrepancy between the original loss surface and this modified loss surface is a form of implicit regularization (Theorem 3.1).

We begin by calculating the discrepancy between the modified loss and the original loss using backward error analysis and find that it is proportional to the second moment of the loss gradients, which we call Implicit Gradient Regularization (IGR). Using differential geometry, we show that IGR is also proportional to the square of the loss surface slope, indicating that it encourages optimization paths with shallower slopes and optima discovery in flatter regions of the loss surface. Next, we explore the properties of this regularization in deep neural networks such as MLP's trained to classify MNIST digits and ResNets trained to classify CIFAR-10 images and in a tractable two-parameter model. In these cases, we verify that IGR effectively encourages models toward minima in the vicinity of small gradient values, in flatter regions with shallower slopes, and that these minima have low test error, consistent with previous observations [3] [4] [5] . We find that IGR can account for the observation that learning rate size is correlated with test accuracy and model robustness [7] [8] [9] [10] [11] [12] . Finally, we demonstrate that IGR can be used as an explicit regularizer, allowing us to directly strengthen this regularization beyond the maximum possible implicit gradient regularization strength.

The general goal of gradient descent is to find a weight vectorθ in parameter space R m that minimizes a loss E(θ). Gradient descent proceeds by iteratively updating the model weights with learning rate h in the direction of the steepest loss gradient:

Now, even though gradient descent takes steps in the direction of the steepest loss gradient, it does not stay on the exact continuous path of the steepest loss gradient, because each iteration steps off the exact continuous path. Instead, we show that gradient descent follows a path that is closer to the exact continuous path given byθ = −∇ θ E(θ), along a modified loss E(θ), which can be calculated analytically using backward error analysis (see Theorem 3.1 and Section 3 for the full analysis), yielding:

where λ ≡ hm 4

and

Immediately, we see that this modified loss is composed of the original training loss E(θ) and an additional term, which we interpret as a regularizer R IG (θ) with regularization rate λ. We call R IG (θ) the implicit gradient regularizer because it penalizes regions of the loss landscape that have large gradient values, and because it is implicit in gradient descent, rather than being explicitly added to our loss. We can now make several predictions about IGR which we will explore later in numerical experiments: Prediction 2.1. IGR encourages smaller values of R IG (θ) relative to the loss E(θ).

Given Equation 2 and Theorem 3.1, we expect gradient descent to follow trajectories that have relatively small values of R IG (θ). It is already well known that gradient descent converges by reducing the loss gradient so it is important to note that this prediction describes the relative size of R IG (θ) along the trajectory of gradient descent. To expose this phenomena in experiments, great care must be taken when comparing different gradient descent trajectories. For instance, in our deep learning experiments, we compare models at the iteration time of maximum test accuracy (and we consider other controls in the appendix), which is an important time point for practical applications and is not trivially determined by the speed of learning (Figures 1, 2) . Also, related to this, since the regularization rate λ is proportional to the learning rate h and network size m (Equation 3), we expect that larger models and larger learning rates will encourage smaller values of R IG (θ) (Figure 2 ). Prediction 2.2. IGR encourages the discovery of flatter optima.

In Section 6, we show that R IG (θ) is proportional the square of the loss surface slope. Given this and Prediction 2.1, we expect that IGR will guide gradient descent to locally seek optimization paths with shallower loss surface slopes, thereby encouraging the discovery of flatter, broader optima. Of course, it is possible to construct loss surfaces at odds with this (such as a Mexican-hat loss surface, where all minima are equally flat). However, we will provide experimental support for this prediction using loss surfaces that are of widespread interest in deep learning, such as MLPs trained on MNIST and ResNets trained on Cifar (Figure 1, 2, 3 ). Prediction 2.3. IGR encourages higher test accuracy.

Given Prediction 2.2, we predict that IGR encourages higher test accuracy since flatter minima are known empirically to coincide with higher test accuracy [3, 5, [7] [8] [9] [10] [11] . We confirm this experimentally ( Figure 2 ) and we also demonstrate that the test accuracy can be increased further using explicit gradient regularization ( Figure 3 ). Prediction 2.4. IGR encourages the discovery of optima that are more robust to parameter perturbations.

Given Prediction 2.2, we expect this, since models with parameters in the vicinity of flatter optima will be less sensitive to parameter variations ( Figure 3 ).

There are several important observations to make about the properties of IGR: 1) It does not originate in any specific model architecture or initialization, although our analysis does provide a formula to explain the influence of these model properties through IGR; 2) Other sources of implicit regularization also have an impact on learning, alongside IGR, and the relative importance of these contributions will likely depend on model architecture and initialization; 3) In defining λ and R IG we chose to set λ proportional to the number of parameters m. To support this choice, we demonstrate in experiments that the test accuracy is controlled by the IGR rate λ. 4) The modified loss and the original loss share the same global minima, so IGR vanishes when the gradient vanishes. Despite this, the presence of IGR has an impact on learning since it changes the trajectory of gradient descent, and in over-parameterized models this can cause the final parameters to reach different solutions. 5) Our theoretical results are derived for full-batch gradient descent, which allows us to isolate the source of implicit regularisation from the stochasticity of stochastic gradient descent (SGD). Extending our theoretical results to SGD is considerably more complicated, and as such, is beyond the scope of this paper. However, in some of our experiments, we will demonstrate that IGR persists in SGD, which is especially important for deep learning. Next, we will provide a proof for Theorem 3.1, and we will provide experimental support for our predictions.

In this section, we show that gradient descent follows the gradient flow of the modified loss E (Equation 2) more closely than that of the original loss E. The argument is a standard argument from the backward error analysis of Runge-Kutta methods (see [44] for a detailed account). We begin by observing that gradient descent (Equation 1) can be interpreted as a Runge-Kutta method numerically integrating the following ODE:

In the language of numerical analysis, gradient descent is the explicit Euler method numerically integrating the vector field f (θ) = −∇E(θ). The explicit Euler method is of order 1, which means that after one gradient descent step θ n = θ n−1 − h∇E(θ n−1 ), the deviation from the gradient flow

is the solution of Equation 5 starting at θ n−1 and evaluated at time h. Backward error analysis was developed to deal with this discrepancy between the discrete steps of a Runge-Kutta method and the continuous exact solutions (or flow) of a differential equation. The main idea is to modify the ODE vector fieldθ = f (θ) with corrections in powers of the step sizẽ

so that the numerical steps θ n approximating the original Equation 5 now lie exactly on the solutions of the modified equationθ =f (θ). In other words, backward error analysis finds the corrections f i in Equation 6 such that θ n =θ(nh) for all n, whereθ(t) is the solution of the modified equation starting at θ 0 . In theory, we can now precisely study the flow of the modified equation to infer properties of the numerical method because its steps follow the modified differential equation solutions perfectly.

The following result is a direct application of backward error analysis to gradient descent: 

where E = E + λR IG is the modified loss introduced in Equation 2. Consider gradient flow with the modified lossθ = −∇ E(θ) and its solutionθ(t) starting at θ n−1 . Now the local error θ n −θ(h) betweenθ(h) and one step of gradient descent θ n = θ n−1 − h∇E(θ n−1 ) is of order O(h 3 ), while it is of order O(h 2 ) for gradient flow with the original loss. Finally, if E is positive both the original loss E and the modified loss E share the same locus of zeros.

Proof. Using a standard result in backward error analysis, the first correction f 1 in the modified vector field (Equation 6) for the explicit Euler method is given by f 1 (θ) = −f f (θ)/2! for the case of a general differential equationθ = f (θ) with vector field f (see Appendix A for a derivation). Applying this result to the case of the gradient flow differential equation (Equation 5) where the vector field is given by f (θ) = −∇ θ E(θ), we find that the first order correction is

, we see that the modified equation for gradient flow has the following form:

. Factoring out the gradient in the first two terms above yields (7), since

As for the local error, if θ(h) is a solution of gradient flow starting at θ n−1 , we have in general that θ(h) = θ n + O(h 2 ). The correction f 1 to gradient flow is constructed so that it kills the O(h 2 ) term in the expansion of its solution (see Proposition A.4 in the appendix), yieldingθ(h) = θ n + O(h 3 ) (see also [45] for details on error bounds). Finally, if E is positive and E(θ) = 0 then θ is a global minima, which implies that ∇E(θ) = 0 also, and hence E(θ) = 0. Now for positive E, E(θ) = 0 trivially implies E(θ) = 0.

For overparameterized models, we predict that the strength of IGR relative to the original loss can be controlled by increasing the learning rate h (Prediction 2.1). However, gradient descent becomes unstable when the learning rate becomes too large. For applications where we wish to increase the strength of IGR beyond this point, we can take inspiration from implicit gradient regularization to motivate Explicit Gradient Regularization (EGR), which we define as:

where, µ is the explicit regularization rate, which is a hyper-parameter that we are free to choose, unlike the implicit regularization rate λ (Equation 3) which can only by controlled indirectly. Now, we can do gradient descent on E µ with small learning rates and large µ. Although EGR is not the primary focus of our work, we will demonstration the effectiveness of EGR for a simple two parameter model in the next section (Section 5) and for a ResNet trained on Cifar-10 ( Figure 3c ).

In our first experiment we explore implicit and explicit gradient regularization in a simple twoparameter model with a loss given by E(a, b) = (y − f (x; a, b)) 2 /2, where f (x; a, b) = abx is our model, a, b ∈ R are the model parameters and x, y ∈ R is the training data. We have chosen this model because we can fully visualize the gradient descent trajectories in 2-d space. For a single data point, this model is overparameterized with global minima located along a curve attractor defined by the hyperbola ab = y/x. For small learning rates, gradient descent follows the gradient flow of the loss from an initial point (a 0 , b 0 ) toward the line attractor. For larger learning rates, gradient descent follows a longer path toward a different destination on the line attractor ( Figure 1a ).

We can understand these observations using Theorem 3.1, which predicts that gradient descent is closer to the modified flow given byȧ = −∇ a E(a, b) andḃ = −∇ b E(a, b) where E(a, b) = E(a, b) + λR IG (a, b) is the modified loss from Equation 2, R IG (a, b) = |a| 2 + |b| 2 x 2 E(a, b) is the implicit regularization term from Equation 4 and λ = h/2 is the implicit regularization rate The global minima in the flattest region of the loss surface and with lowest L2 norm is indicated with a black cross. For small learning rate, h S , gradient descent trajectories follow the exact gradient flow closely (green lines). For a moderate learning rate h M (see appendix D for exact values), gradient descent follows a longer trajectory (red lines). This is closer to the corresponding exact flow following the modified loss gradient, with λ = h M /2, consistent with backward error analysis. We also calculate the gradient descent trajectory with explicit regularization (dashed blue lines) using large regularization rates (µ h M ). (b) The impact of larger learning rates can be measured by calculating the strength of implicit gradient regularization R IG relative to the loss E toward the end of a training. This ratio becomes smaller for larger learning rates, and for this model, this leads to smaller L2 norms (See Appendix D for further details).

from Equation 3, with learning rate h. Although the modified loss and the original loss have the same global minima, they generate different flows. Solving the modified flow equations numerically starting from the same initial point (a 0 , b 0 ) as before, we find that the gradient descent trajectory is closer to the modified flow than the exact flow, consistent with backward analysis Theorem 3.1 ( Figure 1a ).

Next, we investigate Prediction 2.1, that the strength of the implicit gradient regularization R IG (a, b) relative to the original loss E(a, b) can be controlled by increasing the regularization rate λ. In this case, this means that larger learning rates should produce gradient descent trajectories that lead to minima with a smaller value of R IG (a, b)/E(a, b) = x 2 |a| 2 + |b| 2 . It is interesting to note that this is proportional to the parameter norm, and also, to the square of the loss surface slope. In our numerical experiments, we find that larger learning rates lead to minima with a smaller L2 norm (Figure 1b) , closer to the flatter region in the parameter plane, consistent with Prediction 2.1 and Prediction 2.2. The extent to which we can strengthen IGR in this way is restricted by the learning rate. For excessively large learning rates, gradient descent ricochets from peak to peak, until it either diverges or lands in the direct vicinity of a minimum -a behaviour that is sensitively dependent on initialization ( Figure A.1 ).

To go beyond the limits of implicit gradient regularization, we can explicitly regularize this model using Equation 8 to obtain a regularized loss

) starting from the same initial point (a 0 , b 0 ), using a very large explicit regularization rate µ (and using gradient descent with a very small learning rate h for numerical integration, see Appendix D) we find that this flow leads to global minima with a small L2 norm (Figure 1a ) in the flattest region of the loss surface. This could not have been achieved with IGR, since it would require us to use learning rates so large that gradient descent diverges.

In this section, we give a purely geometric interpretation of IGR, supporting Prediction 2.2.

Consider the loss surface S associated with a loss function E defined over the parameter space θ ∈ R m . This loss surface is defined as the graph of the loss:

We define α(θ) to be the angle between the tangent space T θ S to S at θ and the parameter plane, i.e., the linear subspace {(θ, 0) : θ ∈ R m } in R m+1 . We can compute this angle using the inner product between the normal vector N (θ) to S at θ and the normal vectorẑ to the parameter plane:

Now we can define the loss surface slope at θ as being the tangent of this angle: slope(θ) := tan α(θ). This is a natural extension of the 1-dimensional notion of slope. With this definition, we can now reformulate the modified loss function in a purely geometric fashion: Proposition 6.1. The modified lossẼ in Equation 2 can be expressed in terms of the loss surface slope as follows:Ẽ

This proposition is an immediate consequence of Theorem 3.1 and Corollary B.1.1 in Appendix B. It tells us that gradient descent with higher amounts of implicit regularization (higher learning rate) will implicitly minimize the loss surface slope locally along with the original training loss, following shallower paths rather than the steepest continuous path of the original loss gradient flow. Prediction 2.2 claims that this local effect of implicit slope regularization accumulates into the global effect of directing gradient descent trajectories toward global minima in regions surrounded by shallower slopes -toward flatter (or broader) minima. This behaviour is very clear in the two-parameter example of Section 5, where gradient descent trajectories with higher amounts of implicit (or explicit) regularization lead to global optima in regions with shallower slopes (see Figure 1 ).

(a) (b) Figure 2 : Implicit regularization and test accuracy: (a) Here, each dot represents a different MLP, with different learning rates and network sizes. Implicit gradient regularization R IG is reported for each model, at the time of maximum MNIST test accuracy and 100% train accuracy. We see that models with larger implicit regularization rate λ have smaller values of R IG . (b) Networks with higher values of λ also have higher maximum test accuracy values. Figure 3 : (a) We measure the loss surface slope at the time of maximum test accuracy for models trained on MNIST, and we observe that the loss surface slope is smaller for larger learning rates, and the loss surface slopes remain small for larger parameter perturbations compared to models trained with small learning rates. We perturb our networks by adding multiplicative Gaussian noise to each parameter (up to 300% of the original parameter size 

Next, we empirically investigate implicit gradient regularization and explicit gradient regularization in deep neural networks. We consider a selection of MLPs trained to classify MNIST digits and we also investigate Resnet-18 trained to classify CIFAR-10 images. All our models are implemented using Haiku [46] .

To begin, we measure the size of implicit regularization in MLPs trained to classify MNIST digits with a variety of different learning rates and network sizes ( Figure 2 Next, we measure the loss surface slope in 5-layer MLPs, with 400 units per layer, trained to classify MNIST digits with a range of different learning rates. We find that neural networks with larger learning rates, and hence, with stronger IGR have smaller slopes at the time of maximum test accuracy (Figure 3a ). We also measure the loss surface slope in the vicinity of these optima. To do this, we add multiplicative Gaussian noise to every parameter according to θ p = θ(1 + η), where θ are the parameters of a fully trained model and θ p are the parameters after the addition of noise, where η ∼ N (0, σ). We find that neural networks trained with larger learning rates have flatter slopes and these slopes remain small following larger perturbations (Figure 3a ). These numerical results are consistent with our prediction that IGR encourages the discovery of flatter optima (Prediction 2.2)

Next, we observe that improvements in test set accuracy are correlated with increases in regularization rate (Figure 2b ), and also with increases in learning rate and network size ( Figure A.5 ). This is consistent with Prediction 2.3. Furthermore, the correlation between test set accuracy and network size m supports our use of network size scaling in our definition of λ in Equation 3 and R IG (θ) in Equation 4.

Next, we explore the robustness of deep neural networks in response to parameter perturbations. In previous work, it has been reported that deep neural networks are robust to a substantial amount of parameter noise, and that this robustness is stronger in networks with higher test accuracy [12] . We measure the degradation in classification accuracy as we increase the amount of multiplicative Gaussian noise and find that neural networks with larger learning rates, and hence, with stronger IGR, are more robust to parameter perturbations after training (Figure 3c ), consistent with Prediction 2.4. This may explain the origin, in part, of deep neural network robustness previously reported [12] .

We also explore IGR in several other settings. For ResNet-18 models trained on CIFAR-10, we find that R IG is smaller and test accuracy is higher for larger learning rates (at the time of maximum test accuracy) ( Finally, we provide an initial demonstration of explicit gradient regularization (EGR). Specifically, we train a ResNet-18 using our explicit gradient regularizer (Equation 8) and we observe that EGR produces a boost of more than 12% in test accuracy (see Figure 3c ). This initial experiment indicates that EGR may be a useful tool for training of neural networks, in some situations, especially where IGR cannot be increased with larger learning rates, which happens, for instance, when learning rates are so large that gradient descent diverges. However, EGR is not the primary focus of our work here, but for IGR, which is our primary focus, this experiment provides further evidence that IGR may play an important role as a regularizer in deep learning.

Implicit regularization: Among the various possible sources of implicit regularization that have been identified [1, 3, 47] , the Neural Tangent Kernel (NTK) is especially interesting [38-43, 48, 47] since, in the case of the least square loss, the IGR term R IG can be related to the NTK (see Appendix C). This is particularly interesting because it suggests that the NTK may play a role beyond the kernel regime, into the rich regime. IGR is also related to the Fisher Information Matrix [49] , in this context; and implicit norm regularization in gradient descent [11, [30] [31] [32] [33] [34] [35] . It might also be useful for understanding implicit regularization in deep matrix factorization with gradient descent [7, 36, 37] . This implicit regularization seems to act as a bias toward low rank matrix factorization [37] . IGR may also be useful for understanding adaptive learning rate methods, such as gradient descent with cyclically varying learning rates [50] . Here, cyclically varying learning rates between large and small learning rates can be interpreted as a cyclical variation between large and small amounts of IGR.

Runge-Kutta methods: Recently, Runge-Kutta methods have been used to understand old (and devise new) gradient-based optimization methods [51] [52] [53] . However, implicit regularization and backward error analysis has not been explored. Backward error analysis was used [54, 55] to study stochastic gradient descent in the context of stochastic differential equations and diffusion equations for the study of convergence and adaptive learning schemes, but, to the best of our knowledge, it has not been used to explore implicit regularization in gradient descent.

Following our backward error analysis, we now understand gradient descent as an algorithm that effectively optimizes a modified loss with an implicit regularization term arising through the discrete nature of gradient descent. This leads to several predictions that we confirm experimentally: (i) IGR penalizes the second moment of the loss gradients (Prediction 2.1), and consequently, (ii) it penalizes minima in the vicinity of large gradients and encourages flat broad minima in the vicinity of small gradients (Prediction 2.2); (iii) these broad minima are known to have low test errors, and consistent with this, we find that IGR produces minima with low test error (Prediction 2.3); (iv) the strength of regularization is proportional to the learning rate and network size (Equation 3), (iv) consequently, networks with small learning rates or fewer parameters or both will have less IGR and worse test error, and (v) solutions with high IGR are more robust to parameter perturbations (Prediction 2.4).

It can be difficult to study implicit regularization experimentally because it is not always possible to control the impact of various alternative sources of implicit regularization. For instance, we must always make a choice about when to stop training (e.g. early stopping, fixed iteration time, fixed physical time) and this choice itself is an unavoidable form of implicit regularization that will have an impact on the properties of the model. Our analytic approach to the study of implicit regularization in gradient descent allows us to identify the properties of implicit gradient regularization independent of other sources of implicit regularization. In our experimental work, we take great care to choose models and datasets that were sufficiently simple to allow us to clearly expose implicit gradient regularization, yet, sufficiently expressive to provide insight into larger, less tractable settings. For many state-of-the-art deep neural networks trained on large real-world datasets, the implicit gradient regularization that we identify is likely to be just one component of a more complex recipe of implicit and explicit regularization. However, given that many of the favourable properties of deep neural networks such as low test error capabilities and parameter robustness are consistent with IGR, it is possible that IGR is an important piece of the regularization recipe.

Our theoretical work has some immediate practical implications. First, it suggests we should use large learning rates if we want to increase the relative strength of implicit gradient regularization (but not so large that our models diverge). Second, we can strengthen IGR using models that have more parameters. Third, the implicit gradient regularization term that we explore can be used as an explicit gradient regularization term. Although this is not the primary focus of our work here, we have provided a demonstration that explicit gradient regularization (EGR) can be used to extend the impact of implicit gradient regularization beyond the limits of gradient descent stability and backward error analysis applicability. This is reminiscent of other regularizers, such as dropout, which also encourage robust parameterizations [6, 12, 56, 57] . The success of these explicit regularizers demonstrate the importance of this type of regularization in deep learning.

There are many worthwhile directions for further work. In particular, it would be interesting to use backward error analysis to calculate the modified loss and implicit regularization for other widely used optimizers such as momentum, Adam, RMSprop and Nesterov momentum. Indeed, there has already been some recent work formulating Nesterov momentum as a Runge-Kutta method, which is an important step toward a full backward error analysis [51, 53] . It would also be interesting to explore the properties of higher order modified loss corrections. Although this is outside the scope of our work here, we have provided formulae for several higher order terms in the appendix. More generally, we hope that our work demonstrates the utility of combining ideas and methods from backward analysis, geometric numerical integration theory and machine learning and we hope that our contribution supports future work in this direction.

Gradient descent implicitly regularizes models by penalizing trajectories that have large loss-gradients, leading to model parameterizations that have low test error and parameter robustness. Stepping back, our work demonstrates that the sharply peaked loss landscape of a deep neural network is effectively a much less hazardous terrain than it first appears -characterized by shallow paths leading to broad valleys -a consolidating view that emerges through the lens of backward error analysis in gradient descent.

Our work provides new insight into an old algorithm. We do not anticipate any direct negative outcomes of this work. The main beneficiary's of this research will be other researchers who are interested in understanding the origin of implicit regularization in gradient descent, which is currently a very active area of research. It is difficult to imagine any negative outcomes that could be indirectly attributed to our theoretical work here, or equally, to the counter-factual act of not doing this research. No individual or group will be put at a disadvantage from this research.

In this section, we provide formulae for higher order backward analysis correction terms for the explicit Euler method, including the first order correction which is required to complete the proof of Theorem 3.1.

We start by restating the general problem addressed by backward error analysis. To begin, consider a first order differential equationθ = f (θ), (A.1) with vector field f : R m → R m . The explicit Euler method

with step size h produces a sequence of discrete steps θ 0 , θ 1 , . . . , θ n , . . . approximating the solution θ(t) of Equation A.1 with initial condition θ 0 . (In other word, θ n approximates θ(nh) for n ≥ 0.) However, at each discrete step the Euler method steps off the continuous solution θ(t) with a one-step error (or local error) θ 1 − θ(h) of order O(h 2 ). Backward error analysis was introduced in numeric integration to study the long term error (or global error) θ n − θ(nh) of the numerical method. More generally, backward error analysis is useful for studying the long term behavior of a discrete numeric method using continuous flows, such as its numerical phase portrait near equilibrium points, its asymptotic stable orbits, and conserved quantities among other properties (see [45, 44] for a detailed exposition). It stems from the work of Wilkinson [58] in numerical linear algebra where the general idea in the study of a numeric solution via backward error analysis is to understand it as an exact solution for the original problem but with modified data. When applied to numerical integration of ODEs this idea translates into finding a modified vector fieldf with corrections f i 's to the original vector field f in powers of the step-sizẽ

so that the numeric method steps now exactly follow the (formal) solution of the modified equation:

In other words, if θ(t) is the solution of the modified equation (A.4) and θ n is the n th discrete step of the numerical method, now one has: θ n = θ(nh) (A.5)

In general, the sum of the corrections f i 's in (A.3) diverges, and the modified vector fieldf is only a formal power series. For practical purposes, one needs to truncate the modified vector field. If we truncate the modified vector field up to order n (i.e., we discard the higher corrections f l for l ≥ n + 1), the one-step error between the numeric method and the solution of the truncated modified equation is now of order O(h n+1 ). It is also possible to bound the long term error, but in this section we will only formally derive the higher corrections f i for the explicit Euler method. (We refer the reader to [45] , for instance, for precise bounds on the error for the truncated modified equation.)

To derive the corrections f i 's for the explicit Euler method, it is enough to consider a single step of gradient descent θ + hf (θ) and identify the corresponding powers of h in

by expanding the solution θ(h) of the modified equation (A.4) starting at θ into its Taylor series at zero:

Lemma A.1. In the notation above, we havė

where d n−2 dt n−2 d dθ g(θ) is shorthand to denote the operator d n−2 dt n−2 t=0 d dθ θ=θ(t) g(θ) and θ(t) is the solution of the modified equation A.4. We will will use this shorthand notation throughout this section.

Proof. By definition θ(t) is the solution of (A.4) with initial condition θ(0) = θ, henceθ(0) =f (θ). Differentiating both sides of (A.4) with respect to t, we obtain

Now the higher derivatives are obtained by differentiating both sides of the last equation n − 2 times with respect to t and setting t = 0.

The previous lemma gives a formula for the derivatives θ (n) (0). However in order to compare the powers of h we need to expand these derivatives into power of h, which is what the next lemma does: Lemma A.2. In the notation above, we have

where we define

with f 0 (θ) being the original vector field f (θ), and , denoting the inner product of two vectors.

Proof. This follows immediately from (A.9) and by expanding f (θ) 2 in powers of h:

Putting everything together, we now obtain a Taylor series for the solution of the modified equation as a formal power series in h: Lemma A.3. In the notation above, we have

where f 0 is the original vector field f , H 0 = 0 and we define recursively

Proof. Replacing θ (n) (0) by their expression in (A.10) in the Taylor series (A.6), we obtain

which finishes the proof.

for each order of h, we obtain the following proposition:

Proposition A. 4 . The corrections f i 's for the Euler method modified equation in (A.3) are given by the general recursive formula:

where the L n,k are defined by Equation A.11.

Let us use (A.14) to compute the first order correction in the modified equation for the Euler explicit method:

Example A.5. For l = 1, we only have a single term with indices n = 2 and k = 0:

which is the only correction we use in Theorem 3.1. Also observe that for any vector field f , the first order correction always comes from the gradient of a function. When the original vector field itself is a gradient f (θ) = −∇E(θ), then the first two terms of (A.3) can be understood as the gradient of a modified function, yielding the following form for the modified equation (A.4):

which is the main result of Theorem 3.1.

In this section, we provide all the details for a proof of Proposition 6.1 and for our claim concerning the relationship between the loss surface slope and the implicit gradient regularizer, which we package in Corollary B.1.1. The geometry underlying implicit gradient regularization makes it apparent that gradient descent has a bias toward flat minima [3, 6] .

To begin with, consider a loss E over the parameter space θ ∈ R m . The loss surface is defined as the graph of the loss:

It is a submanifold of R m+1 of co-dimension 1, which means that the space of directions orthogonal to S at a given point (θ, E(θ)) is spanned by a single unit vector, the normal vector N (θ) to S at (θ, E(θ)). There is a natural parameterization for surfaces given by the graphs of functions, the Monge parameterization, where the local chart is the parameter plane: θ → (θ, E(θ)). Using this parameterization it is easy to see that the tangent space to S at (θ, E(θ)) is spanned by the tangent vectors: v i (θ) = (0, . . . , 1, . . . , 0, ∇ θi E(θ)),

for i = 1, . . . , m and where the 1 is at the i th position. Now that we have the tangent vectors, we can verify that the following vector is the normal vector, since its inner product with all the tangent vectors is zero and its norm is one:

We can compute the cosine of the angle between the normal vector N (θ) at (θ, E(θ)) and the vector z = (0, . . . , 0, 1) that is perpendicular to the parameter plane by taking the inner product between these two vectors, immediately yielding the following Proposition:

Proposition B.1. Consider a loss E and its loss surface S as above. The cosine of the angle between the normal vector N (θ) to the loss surface at (θ, E(θ)) and the vectorẑ = (0, . . . , 0, 1) perpendicular to the parameter plane can be expressed in terms of the implicit gradient regularizer as follows:

Now observe that if N (θ),ẑ is zero this means that the tangent plane to S at (θ, E(θ)) is orthogonal to the parameter space, in which case the loss surface slope is maximal and infinite at this point! On the contrary, when N (θ),ẑ is equal to 1, the tangent plane at (θ, E(θ)) is parallel to the parameter plane, making S look like a plateau in a neighborhood of this point.

Let us make precise what we mean by loss surface slope. First notice that the angle betweenẑ (which is the unit vector normal to the parameter plane) and N (θ) (which is the vector normal to the tangent plane to S) coincides with the angle between this two planes. We denote by α(θ) this angle:

Now, we define the loss surface slope at (θ, E(θ)) by the usual formula slope(θ) = tan α(θ). (A.17)

As we expect, when the loss surface slope is zero this means that the tangent plane to the loss surface is parallel to the parameter plane (i.e., α(θ) = 0), while when the slope goes to infinity it means the tangent plane is orthogonal to the parameter plane (i.e., α(θ) = π/2).

The following corollary makes it clear that implicit gradient regularization in gradient descent orients the parameter search for minima toward flatter regions of the parameter space, or flat minima, which have been found to be more robust and to possess more generalization power (see [3, 6] ): Corollary B.1.1. The slope of the loss surface S at (θ, E(θ)) can be expressed in terms of the implicit gradient regularizer as follows:

Proof. From (A.15) and the fact that cos α(θ) = N (θ),ẑ , we have that 1 cos 2 α(θ) = 1 + mR IG (θ).

Now basic trigonometry tells us that in general 1/ cos 2 α = 1 + tan 2 α, which implies here that tan 2 α(θ) = mR IG (θ). Taking the square root of this last expression finishes the proof.

Remark B.2. Corollary B.1.1 gives us a very clear understanding of implicit gradient regularization. Namely, the quantity that is regularized is nothing other than the square of the slope R IG (θ) = 1 m slope 2 (θ) and the modified loss becomesẼ(θ) = E(θ) + h 4 slope 2 (θ). For explicit gradient regularization, we can now also understand the explicitly regularized loss in terms of the slope:

This makes it clear that this explicit regularization drives the model toward flat minima (with zero slope). Remark B.3. There is another connection between IGR and the underlying geometry of the loss surface through the metric tensor. It is a well-known fact from Riemannian geometry that the metric tensor g(θ) for surfaces in the Monge parameterization θ → (θ, E(θ)) has the following form:

where δ ij is the Kronecker delta. Now the determinant |g|, which defines the local infinitesimal volume element on the loss surface, can also be expressed in terms of the implicit gradient regularizer: Namely, |g(θ)| = 1 + ∇E(θ) 2 = 1 + mR IG (θ). Solving this equation above for R IG , we obtain a geometric definition for the implicit gradient regularizer:

which incidentally is zero when the surface looks like an Euclidean space.

We conclude this section by showing that the increase in parameter norm can be bounded by the loss surface slope at each gradient descent step. Proposition B.4. Let θ n be the parameter vector after n gradient descent updates. Then the increase in parameter norm is controlled by the loss surface slope as follows:

Proof. The triangle inequality applied to one step of gradient descent θ n+1 = θ n − h∇E(θ n ) yields ( θ n+1 − θ n ) ≤ h ∇E(θ n ) , which concludes the proof, since the gradient norm coincides with the loss surface slope.

In the case of the least square loss, the modified equation as well as the implicit gradient regularizer take a very particular form, involving the Neural Tangent Kernel (NTK) introduced in [41] . Proposition C.1. Consider a model f θ : R d → R c with parameters θ ∈ R m and with least square

The modified loss can then be expressed as

where K θ is the Neural Tangent Kernel defined by

is the error vector on data point x k . In the particular case when the model output is one-dimensional, we can write the modified loss compactly as follows:

Proof. Let us begin by computing the gradient

Using that result, we can compute the implicit gradient regularizer in this case:

which concludes the first part of the proof. Now when the model output is one-dimensional, then the i (θ) are no longer vectors but scalars. We can then collect them into a single vector (θ) = ( 1 (θ), . . . , n (θ)). Similarly, the terms K θ (x i , x j ) are no longer matrices but scalars, allowing us to collect them into a single matrix K θ . In this notation we now see that the original loss can be written as E = T , yieldingẼ = T (1 + K θ ) for the modified loss, concluding the proof.

For the least square loss, we see that the IGR is expressed in terms of the NTK. Therefore the NTK entries will tend to be minimized during gradient descent. In particular, the cross terms K θ (x i , x j ) will be pushed to zero. This means that gradient descent will push the maximum error direction ∇ k (θ) at different data points to be orthogonal to each other (i.e., ∇ k (θ) T ∇ l (θ) 0). This is good, since a gradient update is nothing other than a weighted sum of these error directions. If they are orthogonal, this means that the gradient update contribution at point x k will not affect the gradient update contribution at point x l , so the individual data point corrections are less likely to nullifying each other as gradient descent progresses.

In this section, we provide supplementary results ( Figure A.1) , hyper-parameter values (Table A.1) and modified loss derivations for the two parameter model described in Section 5. This model has a loss given by:

where a, b ∈ R are the model parameters and x, y ∈ R is the training data.

The implicit regularization term for this model can be calculated using Equation 4, yielding:

The implicit regularization rate can be calculated using Equation 3, yielding:

The modified loss can be calculated using Equation 2, yielding:

Here, we can see that the global minima for E(a, b) (i.e. the zeros) are the same as the global minima for E(a, b) since 1 + λ a 2 + b 2 x 2 is positive. However, as we will see, the corresponding gradient flows are different.

The exact modified gradient flow for the modified loss is given by:

The exact gradient flow for the original loss is given bẏ

The exact numerical flow of gradient descent is given by

where (a n , b n ) are the parameters at iteration step n. For this model, we have ∇ a E = −bx(y − abx) and ∇ b E = −ax(y − abx). Remark D.1. In vector notation, we see that the modified gradient flow equation iṡ

with θ = (a, b). The last term −(2λx 2 E)θ is a central vector field re-orienting the original vector field ∇E away from the steepest slopes and toward the origin which coincides in this example with flatter regions where the minimal norm global minima are located. This phenomenon becomes stronger for parameters further away from the origin, where, coincidentally the slopes are the steepest. Specifically, slope(θ) = θ |x| √ 2E.

In Figure 1a , we plot the trajectories of four different flows starting from the same initial point (a 0 , b 0 ), to illustrate the impact of IGR. We look at two initial points, chosen to illustrate the behaviour of gradient descent in different settings. The full set of hyper-parameters used for this experiment is given in We also plot the trajectory of gradient descent for a large learning rate h L , where backward analysis is no longer applicable (Fig. A.1 ) and observe that gradient descent ricochets across the loss surface, stepping over line attractors until it lands in a low gradient region of the loss surface where it converges toward a global minima. This large learning rate regime can be unstable. For larger learning rates, or for different initial positions, we observe that gradient descent can diverge in this regime.

In Figure 1b , we explore the asymptotic behaviour of gradient descent by measuring R/E after convergence for a range of models, all initialized at a 0 = 2.8 and b 0 = 3.5, with a range of learning rates.

We also explore the impact of explicit gradient regularization, using Equation 8 to define the explicitly regularized modified loss for our two-parameter model:

We use this modified loss for gradient descent:

Here, we have used a very small learning rate, h Euler (Table A. 1) and a very large value of µ (Table  A.1) . This allows us to achieve stronger regularization, since µ can be increased to a large value where gradient descent with h = 2µ would diverge.

In this section, we provide further details for the calculation of R IG (θ) in a deep neural network (Figure 2 . For all these experiments, we use JAX [59] and Haiku [46] to automatically differentiate and train deep neural networks for classification. Conveniently, the loss gradients that we compute with automatic differentiation are the same loss gradients that we need for the calculation of R IG (θ).

We calculate the size of implicit gradient regularization R IG (θ), during model training, using Equation 4 . We observe that R IG (θ), the loss E(θ) and the ratio R IG /E(θ) all decrease as training progresses, for all learning rates considered ( Figure A. 2). We also observe that the parameter magnitudes grow during training, and this growth slows as R IG (θ) becomes small, in agreement with Proposition B.4. After a sufficiently large fixed number of training steps, we see that models with larger learning rates have much smaller values of R IG (θ) relative to E(θ), which appears to be consistent with Prediction 2.1. However, the speed of learning clearly depends on the learning rate h so it may not be reasonable to compare models after a fixed number of training iterations. Instead of stopping after a fixed number of iterations, we could stop training after n = T /h iterations, where T is the fixed physical time that naturally occurs in our backward analysis (Equation A.5). Again, we find that models with larger learning rates have lower values of R IG (θ) and E(θ) after a sufficiently large amount of physical training time T ( The models used in this figure are the same as those reported in Figure 2 , but using test data instead of training data. Since physical training time depends on learning rate h, models with large learning rates are trained for far fewer iterations, and are exposed to far fewer epochs of training data than models with small learning rates. Nonetheless, following a sufficiently long period of physical training time, the ratio R IG (θ)/E(θ) is smaller for models with larger learning rates, consistent with Theorem 3.1. The relationship between learning rate, network size, R IG and test accuracy, using the same data as in Figure 2 . We see that R IG typically decreases as the learning rate increases, or, as the network size increases. We also see that test accuracy typically increases as the learning rate increases, or, as the network size increases. Finally, we observe that test error is strongly correlated with the size of R IG after training.

the test accuracy at the time of maximum test accuracy (which will be a different iteration time for each model) ( Figure A.3 ). The observation that (i) fixed iteration stopping time, (ii) fixed physical stopping time, and (iii) maximum test accuracy stopping time all have smaller values of R IG (θ)/E(θ) for larger values of λ, consistent with Prediction 2.1, indicates that the relationships between these quantities cannot be trivially explained to be a consequence of a particular choice stopping time regularization.

In these examples, we use n l = 400 (corresponding to ∼ 9.6 × 10 6 parameters) with batch size 32.

In Figure 2 and Figure A .5 we report R IG (θ) and test accuracy at the time of maximum test accuracy for a selection of networks of different size, trained with different learning rates. For models with sufficient capacity to solve the training task and simultaneously minimize R IG (θ), we expect R IG (θ)/E and test error to decrease as λ increases (Prediction 2.1 and 2.3). To expose this behaviour, we exclude models that fail to reach 100% MNIST training accuracy, such as models that diverge (in the large learning rate regime), and models with excessively small learning rates, which fail to solve the task, even after long training periods. We observe that test error is strongly correlated with the size of R IG after training ( Figure A.5) . We also confirm this for a range of batch sizes, including full batch gradient descent ( Figure A .5, top right) with n l = 400, and for SGD with batch size 32 across a range of learning rates and network sizes ( Figure A.5, bottom right) .

Finally, to explore IGR and EGR for a larger model we trained a ResNet-18 to classify CIFAR-10 images (using Haiku [46] ). We used stochastic gradient descent for the training with a batch size of 512 for a range of learning rates l ∈ {0.005, 0.01, 0.05, 0.1, 0.2}. We observe the same behaviour as in the MNIST experiments: as the learning rate increases, the values of R IG decrease (Prediction 2.1), the test accuracy increases (Prediction 2.3) and the optimization paths follow shallower slopes leading to broader minima (Prediction 2.2). The experimental results are summarized by the training curves displayed in Figure A .6 and in Figure A .7, where we plot the relation between learning rate, R IG , and test accuracy taken at the time of maximum test accuracy for each training curve. The relationship between learning rate, R IG , and test accuracy, using the same data as in Figure A .6 for various Resnet-18 models trained on CIFAR-10. We see that learning R IG typically decreases and the test accuracy increases as the learning rate increases. Finally, we also observe that test accuracy is strongly correlated with the size of R IG . All the values are taken at the time of maximum test accuracy.

Visualizing the loss landscape of neural nets

Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning

On large-batch training for deep learning: Generalization gap and sharp minima

The large learning rate phase of deep learning: the catapult mechanism

Towards explaining the regularization effect of initial large learning rate in training neural networks

Flat minima

Implicit regularization in deep matrix factorization

Reconciling modern machine-learning practice and the classical bias-variance trade-off

Scaling description of generalization with number of parameters in deep learning

Just interpolate: Kernel "ridgeless" regression can generalize

The implicit bias of gradient descent on separable data

On the importance of single directions for generalization

Train faster, generalize better: Stability of stochastic gradient descent

Understanding the difficulty of training deep feedforward neural networks

Learning overparameterized neural networks via stochastic gradient descent on structured data

Generalization in deep networks: The role of distance from initialization

Characterizing implicit bias in terms of optimization geometry

A type of generalization error induced by initialization in deep neural networks

Gradient descent optimizes over-parameterized deep relu networks

Why does deep and cheap learning work so well

Sgd implicitly regularizes generalization error

The implicit regularization of stochastic gradient flow for least squares

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Batch normalization biases deep residual networks towards shallow paths

Stochastic gradient descent as approximate bayesian inference

The effect of network width on stochastic gradient descent and generalization: an empirical study

Empirical analysis of the hessian of over-parametrized neural networks

A bayesian perspective on generalization and stochastic gradient descent

The marginal value of adaptive gradient methods in machine learning

In search of the real inductive bias: On the role of implicit regularization in deep learning

A continuous-time view of early stopping for least squares regression

The implicit bias of gradient descent on nonseparable data

Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate

Theoretical issues in deep networks: Approximation, optimization and generalization

Connecting optimization and regularization paths

Implicit regularization in matrix factorization

Implicit regularization in deep learning may not be explainable by norms

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

On exact computation with an infinitely wide neural net

A note on lazy training in supervised differentiable programming

Neural tangent kernel: Convergence and generalization in neural networks

Wide neural networks of any depth evolve as linear models under gradient descent

Overparameterized nonlinear learning: Gradient descent takes the shortest path

Geometric numerical integration

The life-span of backward error analysis for numerical integrators

Sonnet for JAX

Generalization bounds of stochastic gradient descent for wide and deep neural networks

Kernel and rich regimes in overparametrized models

Universal statistics of fisher information in deep neural networks: Mean field approach

Cyclical learning rates for training neural networks

On symplectic optimization

Integration methods and optimization algorithms

Direct runge-kutta discretization achieves acceleration

Stochastic modified equations and adaptive stochastic gradient algorithms

Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation

Dropout: A simple way to prevent neural networks from overfitting

Learning sparse neural networks via sensitivity-driven regularization

Error analysis of floating-point computation

JAX: composable transformations of Python+NumPy programs

We would like to thank Samuel Smith, Soham De, Mihaela Rosca, Yan Wu, Mélanie Rey, Yee Whye Teh, Razvan Pascanu, Daan Wierstra, Ethan Dyer, Aitor Lewkowycz, Guy Gur-Ari, Michael Munn and Shakir Mohamed for helpful discussion and feedback. We would like to thank Alex Goldin, Guy Scully, Elspeth White and Patrick Cole for their support. We would also like to thank our families, especially Susie, Colm and Fiona; and Wendy, for their support, especially during these coronavirus times.