key: cord-0202768-pruxc8kx authors: Lee, Sokbae; Liao, Yuan; Seo, Myung Hwan; Shin, Youngki title: Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling date: 2021-06-06 journal: nan DOI: nan sha: e716012e1ea637ea024d9f47c2ab0330603150e0 doc_id: 202768 cord_uid: pruxc8kx We develop a new method of online inference for a vector of parameters estimated by the Polyak-Ruppert averaging procedure of stochastic gradient descent (SGD) algorithms. We leverage insights from time series regression in econometrics and construct asymptotically pivotal statistics via random scaling. Our approach is fully operational with online data and is rigorously underpinned by a functional central limit theorem. Our proposed inference method has a couple of key advantages over the existing methods. First, the test statistic is computed in an online fashion with only SGD iterates and the critical values can be obtained without any resampling methods, thereby allowing for efficient implementation suitable for massive online data. Second, there is no need to estimate the asymptotic variance and our inference method is shown to be robust to changes in the tuning parameters for SGD algorithms in simulation experiments with synthetic data. We consider an inference problem for a vector of parameters defined by β * := arg min where Q (β) := E [q (β, Y )] is a real-valued population objective function, Y is a random vector, and β → q (β, Y ) is convex. For a given sample {Y t } n t=1 , let β t denote the stochastic gradient descent (SGD) solution path, that is, for each t ≥ 1, where β 0 is the initial starting value, γ t is a step size, and ∇q (β t−1 , Y t ) denotes the gradient of q (β, Y ) with respect to β at β = β t−1 . We study the classical Polyak (1990 )-Ruppert (1988 averaging estimatorβ n := n −1 n t=1 β t . Polyak and Juditsky (1992) established regularity conditions under which the averaging estimatorβ n is asymptotically normal: where the asymptotic variance Υ has a sandwich form Υ := H −1 SH −1 , H := ∇ 2 Q (β * ) is the Hessian matrix and S := E ∇q (β * , Y ) ∇q (β * , Y ) is the score variance. The Polyak-Ruppert estimatorβ n can be computed recursively by the updating ruleβ t =β t−1 t−1 t + βt t , which implies that it is well suited to the online setting. Although the celebrated asymptotic normality result (Polyak and Juditsky, 1992) was established about three decades ago, it is only past several years that online inference withβ n has gained increasing interest in the literature. It is challenging to estimate the asymptotic variance Υ in an online fashion. This is because the naive implementation of estimating it requires storing all data, thereby losing the advantage of online learning. In the seminal work of Chen et al. (2020) , the authors addressed this issue by estimating H and S using the online iterated estimator β t , and recursively updating them whenever a new observation is available. They called this method a plug-in estimator and showed that it consistently estimates the asymptotic variance and is ready for inference. However, the plug-in estimator requires that the Hessian matrix be computed to estimate H. In other words, it is necessary to have strictly more inputs than the SGD solution paths β t to carry out inference. In applications, it can be demanding to compute the Hessian matrix. As an alternative, Chen et al. (2020) proposed a batch-means estimator that avoids separately estimating H −1 or S. This method directly estimates the variance of the averaged online estimatorβ n by dividing {β 1 , ..., β n } into batches with increasing batch size. The batch-means estimator is based on the idea that correlations among batches that are far apart decay exponentially fast; therefore, one can use nonparametric empirical covariance to estimate Υ. Along this line, Zhu et al. (2021) extended the batch-means approach to allowing for real-time recursive updates, which is desirable in the online setting. The batch-means method produces batches of streaming samples, so that data are weakly correlated when batches far apart. The distance between batches is essential to control dependence among batches so it should be chosen very carefully. In applications, we need to specify a sequence which determines the batch size as well as the speed at which dependence among batches diminish. While this is a new sequence one needs to tune, it also affects the rate of convergence of the estimated covariance matrix. As is shown by Zhu et al. (2021) , the optimal choice of this sequence and the batch size is related to the learning rate and could be very slow. Zhu et al. (2021) showed that the batch-mean covariance estimator converges no faster than O P (n −1/4 ). Simulation results in both Zhu et al. (2021) and this paper show that indeed the coverage probability converges quite slowly. Instead of estimating the asymptotic variance, Fang et al. (2018) proposed a bootstrap procedure for online inference. Specifically, they proposed to use a large number (say, B) of randomly perturbed SGD solution paths: for all b = 1, . . . , B, starting with β where η (b) t > 0 is an independent and identically distributed random variable that has mean one and variance one. The bootstrap procedure needs strictly more inputs than computingβ n and can be time-consuming. In this paper, we propose a novel method of online inference for β * . While the batch-means estimator aims to mitigate the effect of dependence among the averages of SGD iterates, on the contrary, we embrace dependence among them and propose to build a test statistic via random scaling. We leverage insights from time series regression in econometrics (e.g., Kiefer et al., 2000) and use a random transformation of β t 's to construct asymptotically pivotal statistics. Our approach does not attempt to estimate the asymptotic variance Υ, but studentize The resulting statistic is not asymptotically normal but asymptotically pivotal in the sense that its asymptotic distribution is free of any unknown nuisance parameters; thus, its critical values can be easily tabulated. Furthermore, the random scaling quantity V n does not require any additional inputs other than SGD paths β t and can be updated recursively. As a result, our proposed inference method has a couple of key advantages over the existing methods. First, the test statistic is computed in an online fashion with only SGD iterates and the critical values can be obtained without any resampling methods, thereby allowing for efficient implementation suitable for massive online data. Second, there is no need to estimate the asymptotic variance and our inference method is shown to be robust to changes in the tuning parameters for SGD algorithms in simulation experiments with synthetic data. Related work on SGD The SGD methods, which are popular in the setting of online learning (e.g., Hoffman et al., 2010; Mairal et al., 2010) , have been studied extensively in the recent decade. Among other things, probability bounds on statistical errors have been derived. For instance, Bach and Moulines (2013) showed that for both the square loss and the logistic loss, one may use the smoothness of the loss function to obtain algorithms that have a fast convergence rate without any strong convexity. See Rakhlin et al. (2012) and Hazan and Kale (2014) for related results on convergence rates. Duchi et al. (2011) proposed AdaGrad, employing the square root of the inverse diagonal Hessian matrix to adaptively control the gradient steps of SGD and derived regret bounds for the loss function. Kingma and Ba (2015) introduced Adam, computing adaptive learning rates for different parameters from estimates of first and second moments of the gradients. Liang and Su (2019) employed a similar idea as AdaGrad to adjust the gradient direction and showed that the distribution for inference can be simulated iteratively. Toulis and Airoldi (2017) developed implicit SGD procedures and established the resulting estimator's asymptotic normality. Anastasiou et al. (2019) and Mou et al. (2020) developed some results for non-asymptotic inference. Related work in econometrics The random scaling by means of the partial sum process of the same summands has been actively employed to estimate the so-called long-run variance, which is the sum of all the autocovariances, in the time series econometrics since it was suggested by Kiefer et al. (2000) . The literature has documented evidences that the random scaling stabilizes the excessive finite sample variation in the traditional consistent estimators of the long-run variance; see, e.g., Velasco and Robinson (2001) , Sun et al. (2008) , and a recent review in Lazarus et al. (2018) . The insight has proved valid in broader contexts, where the estimation of the asymptotic variance is challenging, such as in Kim and Sun (2011) , which involves spatially dependent data, and Gupta and Seo (2021) for a high-dimensional inference. We show in this paper that it is indeed useful in on-line inference, which has not been explored to the best of our knowledge. While our experiments focus on one of the earlier proposals of the random scaling methods, there are numerous alternatives, see e.g. Sun (2014) , which warrants future research on the optimal random scaling method. Notation Let a and A , respectively, denote the transpose of vector a and matrix A. Let |a| denote the Euclidean norm of vector a and A the Frobenius norm of matrix A. Also, let ∞ [0, 1] denote the set of bounded continuous functions on [0, 1]. In this section, we first present asymptotic theory that underpins our inference method and describe our proposed online inference algorithm. Then, we explain our method in comparison with the existing methods using the linear regression model as an example. We first extend Polyak and Juditsky (1992)'s central limit theorem (CLT) to a functional CLT (FCLT), that is, where ⇒ stands for the weak convergence in ∞ [0, 1] and W (r) stands for a vector of the independent standard Wiener processes on [0, 1]. That is, the partial sum of the online updated estimates β t converges weakly to a rescaled Wiener process, with scaling equal to the square root asymptotic variance of the Polyak-Ruppert average. The CLT proved in Polyak and Juditsky (1992) is then a special case with r = 1. Building on this extension, we propose an online inference procedure. Specifically, using the random scaling matrix V n defined in (3), we consider the following t-statistic where the subscripts j and jj, respectively, denote the j-th element of a vector and the (j, j) element of a matrix. Then, the FCLT yields that the t-statistic is asymptotically pivotal. Note that instead of using an estimate of Υ, we use the random scaling V n for the proposed t-statistic. As a result, the limit is not conventional standard normal but a mixed normal. It can be utilized to construct confidence intervals for β * j for each j. A substantial advantage of this random scaling is that it does not have to estimate an analytic asymptotic variance formula and tends to be more robust in finite samples. As mentioned in the introduction, this random scaling idea has been widely used in the literature known as the fixed bandwidth heteroskedasticity and autocorrelation robust (HAR) inference (e.g., Kiefer et al., 2000; Lazarus et al., 2018) . More generally, for any ≤ d linear restrictions where R is an ( × d)-dimensional known matrix of rank and c is an -dimensional known vector, the conventional Wald test based on V n becomes asymptotically pivotal. To establish this result formally, we make the following assumptionsà la Polyak and Juditsky (1992) . (ii) The Hessian matrix H is positive definite and there exist defined on a probability space (Ω, F, F t , P ), i.e., E(ξ t |F t−1 ) = 0 almost surely, and for some s. for all t ≥ 1. Then, the following decomposition takes place: (iv) It holds that γ t = γ 0 t −a for some 1/2 < a < 1. Assumptions 1 (i)-(iii) are identical to Assumptions 3.1-3.3 of Polyak and Juditsky (1992) and Assumption 1 (iv) is the standard learning rate. Assumption 2 adds mixing-type condition and the bounded moments condition to enhance the results for uniform convergence, which is needed to prove the functional CLT. Given these assumptions, The following theorem is a formal statement under the conditions stated above. Note that the proof of the FCLT requires bounding some processes, indexed by r, uniformly over r. Our proof is new, as some of these processes cannot be written as partial sums of martingale differences. Hence we cannot simply apply results such as Doob's inequalities. Recent works, as in Zhu and Dong (2019) , developed FCLT using bounded sequences. Our proof extends theirs to possibly unbounded sequences but with finite moments, and uses new technical arguments. Theorem 1. Suppose rank(R) = . Under Assumption 1 and H 0 , where W is an -dimensional vector of the standard Wiener processes andW (r) := W (r)−rW (1). Proof of Theorem 1. Rewrite (1) as Let ∆ t := β t − β * and∆ t :=β t − β * to denote the errors in the t-th iterate and that in the average estimate at t, respectively. Then, subtracting β * from both sides of (6) yields that Furthermore, for r ∈ [0, 1], introduce a partial sum process whose weak convergence we shall establish. Specifically, we extend Theorem 2 in Polyak and Juditsky (1992, PJ hereafter) to an FCLT. The first step is a uniform approximation of the partial sum process to another partial sum process . According to Part 4 in the proof of PJ's Theorem 2, this is indeed the case. Turning to the weak convergence of √ t∆ 1 t (r), we extend PJ's Theorem 1. Following its decom- The bound for I (3) requires sophisticated arguments, as w j ξ j is not mds, even though ξ j is. So we develop new technical arguments to bound this term, whose proof is left at the end of the proof. Then the FCLT for mds, see e.g. Theorem 4.2 in Hall and Heyde (1980) , applies to I (2) (r), whose regularity conditions are verified in the proof (specifically, Part 1) of the theorem in PJ to apply the mds CLT for I (2) (1). This shows that I (2) converges weakly to a rescaled Wiener process Υ 1/2 W (r). This establishes the FCLT in (4). t=1 (β t − β * ). Also let Λ = (RΥR ) 1/2 , which exists and is invertible as long as l ≤ d. (4) then shows that for some vector of independent standard Wiener process W * (r), C n (r) ⇒ ΛW * (r). In addition, where the sum is also an integral over r as C n (r) is a partial sum process, and Rβ n − c is a continuous functional of C n (·). The desired result in the theorem then follows from the continuous mapping theorem. It remains to bound Note that Eξ j 1 · · · ξ jp = 0 for distinct values of j 1 , ..., j p as ξ j is mds. When (j 1 , ..., j p ) ∈ A k , where set A k means the collection of indices that consist of vectors of k distinct values from 1, ..., m − 1, for any b due to the boundedness of w m j . According to Lemma 2 in Zhu and Dong (2020) which holds uniformly for m, k. For notational simplicity, write where for B 1 , 1 + ap/2 − p/2 < 0 because 1 < (1 − a)p/2. We now examine B 2 , which is the case k > p/2 in the subindex. We can tighten the bound (7) on Eξ j 1 · · · ξ jp using the mixing condition in Assumption 2. The proof holds for a general p, but without loss of generality, we take p = 6 as an example. Then k ∈ {4, 5}. Now suppose k = 4. There are two types. One type is that j 1 = j 2 , j 3 = j 4 , and j 5 = j 6 and j 5 and j 6 are also different from j 1 and j 3 . The other type is where j 1 = j 2 = j 3 and the other elements are distinct. We only consider j 1 = j 2 ¿ j 3 = j 4 ¿ j 5 > j 6 explicitly as the other cases are similar. Note that due to the mixing-condition and the mds property of ξ t . And as w t j is bounded and the sequence j −η is summable for η > 1, we conclude, uniformly in m, Similarly we reach the same bound for k = 5. The case with larger p also proceeds similarly to conclude that As an important special case of Theorem 1, the t-statistic defined in (5) converges in distribution to the following pivotal limiting distribution: for each j = 1, . . . , d, where W 1 is a one-dimensional standard Wiener process. Related work on Polyak and Juditsky (1992) There exist papers that have extended Polyak and Juditsky (1992) to more general forms (e.g., Kushner and Yang, 1993; Godichon-Baggioni, 2017; Su and Zhu, 2018; Zhu and Dong, 2020) . The stochastic process defined in Kushner and Yang (1993, equation (2. 2)) is different from the partial sum process in (4). Godichon-Baggioni (2017) considers parameters taking values in a separable Hilbert space and as such it considers a generalization of Polyak-Juditsky to more of an empirical process type while our FCLT concerns the partial sum processes. Su and Zhu (2018) 's HiGrad tree divide updates into levels, with the idea that correlations among distant SGD iterates decay rapidly. Their Lemma 2.5 considers the joint asymptotic normality of certain K partial sums for a finite K while our FCLT is for the partial sum process indexed by real numbers on the [0, 1] interval. Zhu and Dong (2020) appears closer than the others to our FCLT for the partial sums, although the set of sufficient conditions is not the same as ours. However, we emphasize that the more innovative part of our work is the way how we utilize the FCLT than the FCLT itself. Indeed, it appears that prior to this paper, there is no other work in the literature that makes use of the FCLT as this paper does in order to conduct on-line inference with SGD. Just as the average SGD estimator can be updated recursively viaβ t =β t−1 t−1 t + βt t , the statistic V t can also be updated in an online fashion. To state the online updating rule for V t , note that Thus, at step t − 1, we only need to keep the three quantities,β t−1 , to update V t−1 to V t using the new observation β t . The following algorithm summarizes the arguments above. Algorithm 1: Online Inference with SGD via Random Scaling Input: function q(·), parameters (γ 0 , a) for step size γ t = γ 0 t −a for t ≥ 1 Initialize: set initial values for β 0 ,β 0 , A 0 Onceβ n and V n are obtained, it is straightforward to carry out inference. For example, we can use the t-statistic in (5) to construct the (1 − α) asymptotic confidence interval for the j-th element where the critical value cv(1−α/2) is tabulated in Abadir and Paruolo (1997, Table I ). The limiting distribution in (8) is mixed normal and symmetric around zero. For easy reference, we reproduce the critical values in Table 1 . When α = 0.05, the critical value is 6.747. Critical values for testing linear restrictions H 0 : Rβ * = c are given in Kiefer et al. (2000 , Table II ). In this subsection, we consider the least squares estimation of the linear regression model y t = x t β * + ε t . In this example, the stochastic gradient sequence β t is given by where γ t = γ 0 t −α with 1/2 < α < 1 is the step size. This linear regression model satisfies Assumption 1, as shown by Polyak and Juditsky (1992, section 5) . The asymptotic variance ofβ n is Υ = H −1 SH −1 where H = Ex t x t and S = Ex t x t 2 t . Our proposed method would standardize using V n according to (3), which does not consistently estimate Υ. We use critical values as tabulated in Table 1 , whereas the existing methods (except for the bootstrap) would seek for consistent estimation of Υ. For instance, the plug-in method respectively estimates H and S by H = 1 n n t=1 x t x t , and S = 1 n n t=1 x t x t 2 t , where t = y t −x t β t−1 . Note that H −1 does not rely on the updated β t but may not be easy to compute if dim(x t ) is moderately large. Alternatively, the batch-mean method first splits the iterates β t 's into M + 1 batches, discarding the first batch as the burn-in stage, and estimates Υ directly by where β k is the mean of β t 's for the k-th batch and n k is the batch size. One may also discard the first batch when calculatingβ n in Υ. As noted by Zhu et al. (2021) , a serious drawback of this approach is that one needs to know the total number of n as a priori, so one needs to recalculate Υ 1 whenever a new observation arrives. Instead, Zhu et al. (2021) proposed a "fully online-fashion" covariance estimator, which splits the iterates β t into n batches B 1 , ..., B n , and estimates the covariance by where S k denotes the sum of all elements in B k and |B k | denotes the size of the k-th batch. The batches are overlapped. For instance, fix a pre-determined sequence {1, 3, 5, 7, ...}, we can set and subsequent B t 's are defined analogously. Our proposed scaling V n is similar to Υ 2 in the sense that it can be formulated as: with particular choice of batches being: However, there is a key difference between V n and Υ 2 : the batches used by Υ 2 , though they can be overlapped, are required to be weakly correlated as they become far apart. In contrast, B * k are strongly correlated and strictly nested. Thus, we embrace dependences among B * k , and reach the scaling V n that does not consistently estimate Υ. The important advantage of our approach is that there is no need to choose the batch size. In the next section, we provide results of experiments that compare different methods in the linear regression model. In this section we investigate the numerical performance of the random scaling method via Monte Carlo experiments. We consider two baseline models: linear regression and logistic regression. We use the Compute Canada Graham cluster composed of Intel CPUs (Broadwell, Skylake, and Cascade Lake at 2.1GHz-2.5GHz) and they are assigned with 3GB memory. Linear Regression The data are generated from y t = x t β * + ε t for t = 1, . . . , n, where x t is a d-dimensional covariates generated from the multivariate normal distribution N (0, I d ), ε t is from N (0, 1), and β * is equi-spaced on the interval [0, 1]. This experimental design is the same as that of Zhu et al. (2021) . The dimension of x is set to d = 5, 20. We consider different combination of the learning rate γ t = γ 0 t −a by setting γ 0 = 0.5, 1 and a = 0.505, 0.667. The sample size set to be n = 100000. The initial value β 0 is set to be zero. In case of d = 20, we burn in around 1% of observations and start to estimateβ t from t = 1000. Finally, the simulation results are based on 1000 replications. We compare the performance of the proposed random scaling method with the state-of-the-art methods in the literature, especially the plug-in method in Chen et al. (2020) and the recursive batch-mean method in Zhu et al. (2021) .The performance is measured by three statistics: the coverage rate, the average length of the 95% confidence interval, and the average computation time. Note that the nominal coverage probability is set at 0.95. For brevity, we focus on the first coefficient β 1 hereafter. The results are similar across different coefficients. Figures 1-2 summarize the simulation results. The complete set of simulation results are reported in the Appendix. In Figure 1 , we adopt the same learning rate parameters as in Zhu et al. (2021) : γ 0 = 0.5 and a = 0.505. Overall, the performance of the random scaling method is satisfactory. First, the random scaling and plug-in methods show better coverage rates. The coverage rate of the batch-mean method deviates more than 5% from the nominal rate even at n = 100000. Second, the batch-mean method shows the smallest average length of the confidence interval followed by the plug-in and the random scaling methods. Third, the plug-in method requires substantially more time for computation than the other two methods. The random scaling method takes slightly more computation time than the batch-mean method. Finally, we check the robustness of the performance by changing the learning rates in Figure 2 , focusing on the case of d = 5. Both the random scaling method and the plug-in method are robust to the changes in the learning rates in terms of the coverage rate. However, the batch-mean method converges slowly when a = 0.667 and it deviates from the nominal rates about 15% even at n = 100000. Logistic Regression We next turn our attention to the following logistic regression model: where ε t follows the standard logistic distribution and 1(·) is the indicator function. We consider a large dimension of x t (d = 200) as well as d = 5, 20. All other settings are the same as the linear model. Overall, the simulation results are similar to those in linear regression. Table 2 summarizes the simulation results of a single design. The coverage rates of Random Scale and Plug-in are satisfactory while that of Batch-mean is 30% lower when d = 200. Random Scale requires more computation time than Batch-mean but is still much faster than Plug-in. The computation time of Random Scale can be substantially reduced when we are interested in the inference of a single parameter. In such a case, we need to update only a single element ofV rather than the whole d × d matrix. In Table 3 , we show that Random Scale can be easily scaled up to d = 800 with only 11.7 seconds computation time when we are interested in the inference of a single parameter. Finally, the results in the appendix reinforce our findings from the linear regression design that the performance of Random-scale is less sensitive to the choice of tuning parameters than Batch-mean. In this section we provide the complete set of simulation results. The R codes to replicate the results are also included in this supplemental package. Linear Regression: Recall that we the linear regression model is generated from where x t is a d-dimensional covariates generated from the multivariate normal distribution N (0, I d ), ε t is from N (0, 1), and β * is equi-spaced on the interval [0, 1]. The dimension of x is set to d = 5, 20. We consider different combination of the learning rate γ t = γ 0 t −a with 1/2 < a < 1, where γ 0 = 0.5, 1 and a = 0.505, 0.667. The sample size set to be n = 10 5 . The initial value β 0 is set to be zero. In case of d = 20, we burn in around 1% of observations and start to estimateβ t from t = 1000. Finally, the simulation results are based on 1000 replications. Figure 3 shows the coverage rates and the lengths of the confidence interval for all five coefficients when d = 5. As discussed in the main text, we observe similar behavior for different coefficients over different learning rates. Figures 4-5 show the complete paths of coverage rates, confidence interval lengths, and the computation times for each design with different learning rates. Tables 4-5 report the same statistics at n = 25000, 50000, 75000, and 100000. Overall, the results are in line with our discussion in the main text. The proposed random scaling method and the plug-in method fit the size well while the batch-mean method show less precise coverage rates. The computation time of the plug-in method is at least five times slower than the other two methods. Because the batch size depends on the learning rate parameter a, the batch-mean results are sensitive to different learning rates. Logistic Regression: Recall that we the logistic regression model is generated from where ε t follows the standard logistic distribution and 1(·) is the indicator function. We consider d = 5, 20, 200. All other settings are the same as the linear model. Figures 6-8 show the complete paths of coverage rates, confidence interval lengths, and the computation times for each design with different learning rates. Tables 6-8 report the same statistics at n = 25000, 50000, 75000, and 100000. Overall, the results are similar to those of the linear regression model. The proposed random scaling method and the plug-in method fit the size well while the batch-mean method show less precise coverage rates. The computation time of the plug-in method is at least five times slower than the other two methods. Because the batch size depends on the learning rate parameter a, the batch-mean results are sensitive to different learning rates. For the large model d = 200, the plug-in method does not perform well in terms of the coverage rate when a = 0.667. Two mixed normal densities from cointegration analysis Normal approximation for stochastic gradient descent via non-asymptotic rates of martingale CLT Non-strongly-convex smooth stochastic approximation with convergence rate O (1/n) Statistical inference for model parameters in stochastic gradient descent Adaptive subgradient methods for online learning and stochastic optimization Online bootstrap confidence intervals for the stochastic gradient descent estimator A central limit theorem for averaged stochastic gradient algorithms in hilbert spaces and online estimation of the asymptotic variance. application to the geometric median and quantiles Robust inference on infinite and growing dimensional regression Martingale Limit Theory and Its Application Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization Online learning for latent dirichlet allocation Simple robust testing of regression hypotheses Spatial heteroskedasticity and autocorrelation consistent estimation of covariance matrix Adam: A method for stochastic optimization Stochastic approximation with averaging of the iterates: Optimal asymptotic rate of convergence for general processes Har inference: Recommendations for practice Statistical inference for the population landscape via momentadjusted stochastic gradients Online learning for matrix factorization and sparse coding On linear stochastic approximation: Fine-grained Polyak-Ruppert and non-asymptotic concentration New method of stochastic approximation type Acceleration of stochastic approximation by averaging Making gradient descent optimal for strongly convex stochastic optimization Efficient estimations from a slowly convergent Robbins-Monro process Uncertainty quantification for online learning and stochastic approximation via hierarchical incremental gradient descent Fixed-smoothing asymptotics in a two-step generalized method of moments framework Optimal bandwidth selection in heteroskedasticityautocorrelation robust testing Asymptotic and finite-sample properties of estimators based on stochastic gradients Edgeworth expansions for spectral density estimates and studentized sample mean Online covariance matrix estimation in stochastic gradient descent On constructing confidence region for model parameters in stochastic gradient descent via batch means