key: cord-0584043-62op6csq
authors: Li, Chris Junchi; Jordan, Michael I.
title: Nonconvex Stochastic Scaled-Gradient Descent and Generalized Eigenvector Problems
date: 2021-12-29
journal: nan
DOI: nan
sha: c7caf1075630e0124ecae79871fec9a8562ba636
doc_id: 584043
cord_uid: 62op6csq

Motivated by the problem of online canonical correlation analysis, we propose the emph{Stochastic Scaled-Gradient Descent} (SSGD) algorithm for minimizing the expectation of a stochastic function over a generic Riemannian manifold. SSGD generalizes the idea of projected stochastic gradient descent and allows the use of scaled stochastic gradients instead of stochastic gradients. In the special case of a spherical constraint, which arises in generalized eigenvector problems, we establish a nonasymptotic finite-sample bound of $sqrt{1/T}$, and show that this rate is minimax optimal, up to a polylogarithmic factor of relevant parameters. On the asymptotic side, a novel trajectory-averaging argument allows us to achieve local asymptotic normality with a rate that matches that of Ruppert-Polyak-Juditsky averaging. We bring these ideas together in an application to online canonical correlation analysis, deriving, for the first time in the literature, an optimal one-time-scale algorithm with an explicit rate of local asymptotic convergence to normality. Numerical studies of canonical correlation analysis are also provided for synthetic data.

Nonconvex optimization has become the algorithmic engine powering many recent developments in statistics and machine learning. Advances in both theoretical understanding and algorithmic implementation have motivated the use of nonconvex optimization formulations with very large datasets, and the striking empirical discovery is that nonconvex models can be successful in this setting, despite the pessimism of classical worst-case analysis. In this paper, we consider the following general constrained nonconvex optimization problem:

where F (v) is a smooth and possibly nonconvex objective function and C is a feasible set. The workhorse algorithm in this setting is stochastic gradient descent (SGD) and its variants (Robbins and Monro, 1951; Qian, 1999; Duchi et al., 2011; Kingma and Ba, 2015; Zhang and Sra, 2016) . Given an unbiased estimate ∇F (v; ζ) of the gradient ∇F (v), SGD performs the following update at the t-th step (t ≥ 1)

where η > 0 is a step size and Π C is a projection operator onto the feasible set C. SGD updates use only a single data point, or a small number of data points, and thus significantly reduce computational and storage complexities compared with offline algorithms, which require storing the full data set and evaluating the full gradient at each iteration. In many applications, however, we do not have access to an unbiased estimate of ∇F (v) when we restrict access to a small number of data points. Instead, for each v ∈ C we have access only to a stochastic vector Γ(v; ζ) which is an unbiased estimate of some scaled gradient:

where D(v) is a deterministic positive scalar that depends on the current state v. Examples of this setup arise most notably in generalized eigenvector (GEV) computation, which finds its applications in principal component analysis, partial least squares regression, Fisher's linear discriminant analysis, canonical correlation analysis (CCA), etc. Despite this wide range of applications, and their particular relevance to large-scale machine learning problems, there exist few rigorous general frameworks for SGD-based online learning using such models. Our approach is a conceptually straightforward extension of SGD. We propose to continue to use (2) but with ∇F (v t−1 ; ζ t ) replaced by Γ(v t−1 ; ζ t ). We refer this algorithm as the Stochastic Scaled-Gradient Descent (SSGD) algorithm. Specifically, at each step, SSGD performs the update:

We provide a theoretical analysis of this algorithm. While some of our analysis applies to the algorithm in full generality, our most useful results arise when we specialize to the online GEV problem. In this case we aim to minimize the generalized Rayleigh quotient given a unit spherical constraint:

The first-order derivative of the generalized Rayleigh quotient with respect to v is

As pointed out by Arora et al. (2012) , the major stumbling block in applying SGD to this problem lies in obtaining an unbiased stochastic sample of the gradient (6), due to the fact that the objective function takes a fractional form of two expectations. In our approach we circumvent this issue by simply replacing the denominator on the right-hand side of (6) by the constant 1 and using the following update:

We refer to the rule (7) as an online GEV iteration. In the special case where B is taken as I, (7) essentially reproduces Oja's online PCA algorithm (Oja, 1982) with an incurred O(η 2 ) error term.

To identify the iterative algorithm in (7) as a manifestation of SSGD, we rewrite the term in parentheses in the algorithm as follows (we set v = v t−1 for brevity):

To proceed, we take A and B as mutually independent and unbiased stochastic samples of A and B respectively. It can be easily seen that the expectation of (8) is a scaled gradient of the generalized Rayleigh quotient, where the scaling is the factor (v Bv) 2 /2. This approach, which has been referred to as double stochastic sampling in the setting of kernel methods (Dai et al., 2014 (Dai et al., , 2017 , makes it possible to develop an efficient stochastic approximation algorithm. Indeed, often A, B are of rank one, so the computation of matrix-vector productsÃv,B v only invokes vector-vector inner products and is hence efficient. Our contributions relative to previous work on nonconvex stochastic optimization as are follows. First, we propose a novel algorithm-the stochastic scaled-gradient descent (SSGD) algorithmwhich generalizes the classical SGD algorithm and has a wider range of applications. Second, we provide a local convergence analysis for spherical-constraint objective functions that are locally convex. Starting with a warm initialization, our local convergence rate matches a known informationtheoretic lower bound (Mei et al., 2018) . Third, by applying SSGD to the GEV problem, we give a positive answer to the question raised by Arora et al. (2012) regarding to the existence of an efficient online GEV algorithm. Specifically, in the case of CCA, our SSGD algorithm uses as few as two samples at each update, does not incur intermediate and expensive computational cost while achieving a polynomial convergence rate guarantee.

The generalized eigenvector problem is at the core of many statistical problems such as principal component analysis (Pearson, 1901; Hotelling, 1933) , canonical correlation analysis (Hotelling, 1936 ), Fisher's linear discriminant analysis (Fisher, 1936; Welling, 2005) , partial least squares regression (Stone and Brooks, 1990) , sufficient dimension reduction (Li, 1991) , mixture models (Balakrishnan et al., 2017) , along with their sparse counterparts. Iterative algorithms for sparse principal component analysis has been proposed by Ma (2013) and Yuan and Zhang (2013) as a special case of the eigenvalue problem: by adding a soft-thresholding step to each power method step their algorithms achieve linear convergence. In follow-up work, Tan et al. (2018) proposed a truncated Rayleigh flow algorithm to estimate the leading sparse generalized eigenvector that also achieves a linear convergence rate. Additional work on generalized eigenvector computation includes Ge et al. (2016) ; Allen-Zhu and Li (2017a); Yuan et al. (2019) ; Ma et al. (2015) ; Chaudhuri et al. (2009) .

Some recent work has focused on developing efficient online procedures for particular instances of generalized eigenvector problems, among which online principal and canonical eigenvectors estimation has been of particular interest. Oja's online PCA iteration (Oja, 1982) , which can be reproduced from (7) when B is taken as I as a special case, up to an incurred O(η 2 ) error term, has been shown to provably match the minimax information lower bound (Jain et al., 2016; Li et al., 2018; Allen-Zhu and Li, 2017b) . There is also a rich literature on stochastic gradient methods for convex and nonconvex minimization that takes place on Riemannian manifolds (Ge et al., 2015; Zhang and Sra, 2016) ; we refer the readers to Hosseini and Sra (2020) for a recent survey study. More related to our work, procedures for efficient online canonical eigenvectors estimation have been explored (Arora et al., 2017; Gao et al., 2019; Chen et al., 2019) . Among these works, Gao et al. (2019) developed a streaming canonical correlation analysis (CCA) algorithm which involves solving a large linear system at each iteration, and independently Arora et al. (2017) proposed a different stochastic CCA algorithm which has temporal and spatial complexities that are quadratic in d. Chen et al. (2019) present a landscape analysis of GEV/CCA and provide a continuous-time insight for a class of primal-dual algorithms when the two matrices in GEV commute; the convergence analysis of Chen et al. (2019) , however, does not directly translate to discrete-time convergence rate bounds and no explicit analysis has been provided when two matrices do not commute.

In a recent paper, Bhatia et al. (2018) studied the CCA problem and proposed a two-time-scale online iteration that they refer to as "Gen-Oja." The notion of two-time-scale analysis has been used widely in stochastic control and reinforcement learning (Borkar, 2008; Kushner and Yin, 2003) , and the slow process in Gen-Oja is essentially Oja's iteration (Oja, 1982) for online principal component estimation with Markovian noise (Shamir, 2016; Jain et al., 2016; Li et al., 2018; Allen-Zhu and Li, 2017b) . Bhatia et al. (2018) obtained a convergence rate under a bounded sample assumption that achieves the minimax rate 1/ √ N in terms of the sample size N . In comparison, our proposed SSGD algorithm is a single time-scale algorithm with a single step size and an extra requirement of two (independent) samples per iterate. The algorithm is minimax optimal with respect to local convergence and hence theoretically comparable with Gen-Oja.

The rest of this paper is organized as follows. §2 states our settings and assumptions throughout the theoretical analysis of our paper. §3 presents our local convergence results under the warm initialization condition. §4 presents our two-phase convergence results for arbitrary initialization. §5 investigates the asymptotic property of our algorithm. §6 uses the example of Canonical Correlation Analysis to demonstrate the practical computation and experimental performance of our algorithm.

§7 presents the proofs of our theoretical analysis. §8 summarizes the entire paper. Limited by space we relegate to Appendix all secondary lemmas.

Unless indicated otherwise, C denotes some positive, absolute constant which may change from line to line. For two sequences {a n } and {b n } of positive scalars, we denote a n b n (resp. a n b n ) if a n ≥ Cb n (resp. a n ≤ Cb n ) for all n, and a n b n if a n b n and a n b n hold simultaneously. We also write a n = O(b n ), a n = Θ(b n ), a n = Ω(b n ) as a n b n , a n b n , a n b n , respectively. We use v to denote the 2 -norm of v. Let λ max (A), λ min (A) and A denote the maximal, minimal eigenvalues and the operator norm of a real symmetric matrix A. We will explain other notation at its first appearance.

In this section, we present the settings and assumptions required by our theoretical analysis of the SSGD algorithm for nonconvex optimization. To illustrate the core idea we focus on the case of a spherical constraint, v ∈ S d−1 , in which case our proposed SSGD iteration (4) reduces to the following update

Let F t = σ ζ s : s ≤ t be the filtration generated by the stochastic process ζ t . Then, from (3),

. That is, the conditional expectation is a scaled gradient. The ensuing analysis is analogous to that of locally convex SGD given we have appropriate Lipschitz-smoothness of the scalar function D(v), but it requires delicate treatment given that SSGD effectively has a varying step size embodied in the scaling factor.

Following the classical theory of constrained optimization (Nocedal and Wright, 2006) we introduce a definition of manifold gradient and manifold Hessian in the presence of a unit spherical constraint, C : c(v) = 1 2 v 2 − 1 = 0. 1 For this equality-constrained optimization problem, we utilize the method of Lagrange multipliers and introduce the following Lagrangian function:

We define the manifold gradient:

and the manifold Hessian:

For

To prove our main theoretical result, we need the following definitions and assumptions. We first define the Lipschitz continuity for a generic mapping:

where · M is any norm properly defined in space M.

In addition, we need the following assumption on the state-dependent scalar D(v) and covariance matrix Σ(v). For a fixed v, define the state-dependent covariance Σ(v) to be

For the purposes of our analysis, we assume that the state-dependent parameter D(v) and the

where v * is a local minimizer of the constrained optimization problem (5) and where δ ∈ (0, 1] is a fixed constant. Within this convex bounded compact space, we can also show that F (v) and ∇F (v) are Lipschitz continuous. We explicitly specify these constants in the following assumption.

where L D , L F , L K , L Q are fixed positive constants.

1 Here for notational simplicity we incorporate a factor of 1/2. Now we pose some tail behavior of the stochastic vectors Γ(v t−1 ; ζ t ), t ≥ 1 as vector α-sub-Weibull, as in the following assumption:

Assumption 2.2 (Sub-Weibull Tail) For some fixed α ∈ (0, 2] and for all v ∈ C, we assume that the stochastic vectors Γ(v; ζ) satisfy

where V is called the sub-Weibull parameter of stochastic vector Γ(v; ζ).

Note here the sub-Weibull parameter is in the vector-norm sense instead of the maximal projection sense. The class of sub-Weibull distributions contains the sub-Gaussian (α = 2) and sub-Exponential (α = 1) distribution classes as special cases (Wainwright, 2019; Kuchibhotla and Chakrabortty, 2018) . Background on vector α-sub-Weibull distributions (and the associated notion of Orlicz ψ α -norm) are provided in Appendix §A.

In this section we provide the main local convergence result for our SSGD algorithm. Our local analysis is inspired from both generic (Ge et al., 2015) and dynamics-based (Li et al., 2018; Li and Jordan, 2021) analyses for nonconvex stochastic gradient descent, which we further adapt to our scaled-gradient setup.

For notational simplicity, we denote

For our local convergence analysis, we assume that the initialization v 0 falls into the neighborhood of a local minimizer v * of the constrained optimization problem; that is,

where µ denotes the minimum positive eigenvalue of the manifold Hessian H(v * ):

v 1 H(v * )v 1 ≥ µ, ∀v 1 ∈ T (v * ) and v 1 = 1.

We note that the initialization condition (14) has a constant neighborhood radius that does not depend on dimension d. In the ensuing Theorem 3.1 on local convergence, we take ∈ (0, 1) and define the following quantities:

and for η < 1/(Dµ), define

We state our local convergence theorem.

Theorem 3.1 (Local Convergence) Given Assumptions 2.1 and 2.2 as well as the initialization condition (14), for any positive constants η, that satisfy the scaling condition

and for any T ≥ K η, T * η , there exists an event H 3.1 with

such that on event H 3.1 the iterates generated by the SSGD algorithm satisfy for all

where G α ≡ log 1/α 2 (1 + e 1/α ) 1 + log 1/α 2 (1 + e 1/α ) is a positive factor depending on α.

To prove Theorem 3.1, we define ∆ t as the projection

We view every T * η = Θ (Dµ) −1 η −1 iterations as one round and interpret K η, = Θ log η −1 as the number of rounds. Note that K η, T * η can be interpreted as the burn-in time for v t to arrive in a O(η 1/2 ) neighborhood of local minimizer v * . We present a proposition that provides an upper bound on ∆ t over T iterations and characterizes the descent in ∆ t at the end of each round:

Proposition 3.2 Assume Assumptions 2.1, 2.2 and initialization condition (14) hold. For any positive constants η, satisfying the scaling condition (17) and T ≥ 1, with probability at least

and

Moreover, if T * η ∈ [0, T ], we have:

The proof of Proposition 3.2 is provided in §7.1. By choosing an asymptotic regime such that T log(1/ε) → 0, Proposition 3.2 states that (19), (20) and (21) hold with probability tending to one. On that high-probability event, (19) indicates that v t − v * and its projection in the tangent space ∆ t are bounded by each other up to constant factors, (20) guarantees that ∆ t does not exceed max 2 ∆ 0 , Θ(η 1/2 ) -that is, v t stays in a neighborhood of local minimizer v * -and (21) states that, for ∆ 0 = Ω(η 1/2 ), ∆ t decreases by half after T * η iterations: ∆ T * η ≤ max ∆ 0 /2, Θ(η 1/2 ) . Proposition 3.2 studies ∆ t in a single round, i.e., for T * η iterations. We are ready to provide the proof of Theorem 3.1 by applying Proposition 3.2 repeatedly for K η, rounds, detailed as follows: Proof of Theorem 3.1 Since the algorithm iteration (4) can be viewed as a discrete-time (strong) Markov process, We recall the definition of K η, in (15) and repeatedly apply Proposition 3.2 to the sequence of {∆ t } for K η, rounds, initializing each round with the output ∆ T * η from the previous round. We adopt an adaptive argument of shrinkage in multiple rounds.

More specifically, for any t ∈ [K η, T * η , T ], we first apply (21) in Proposition 3.2 for K η, rounds, then apply (20) for t − K η, T * η iterations, and use (19) 

where the last inequality is due to initialization condition (14). Here G α is a fixed positive factor depending on α, as defined in Theorem 3.1. By taking a union bound over K η, rounds and

completing the proof of Theorem 3.1. Theorem 3.1 establishes the local convergence of v t in a neighborhood of v * for a fixed step size η and a number of iterations T ≥ K η, T * η . The following corollary provides a finite-sample bound: 

such that on the event H 3.3 the iterates generated by the SSGD algorithm satisfy

We notice that our Theorem 3.1 and Corollary 3.3 provide a dimension-free local convergence rate when V is O(1). As we will see later in the example of CCA, the (α = 1/2) sub-Weibull parameter V in that case scales with √ d and thus the local rate is the minimax-optimal rate O( d/T ) up to a polylogarithmic factor.

In many situations, solving the warm initialization problem itself can be a difficult problem. We borrow the techniques from Ge et al. (2015) and establish a global convergence result for escaping saddle points via SSGD. In this section we consider a variant of SSGD with a unit spherical constraint and equipped with an artificial noise injection step: let n t be an independent spherical noise at each step that is independent of F t−1 and ζ t , and let

Motivated by recent work on escaping saddle points (Ge et al., 2015; Lee et al., 2016; Jin et al., 2019) , one can show that SSGD algorithm equipped with the aforementioned artificial noise injection escapes from all saddle points, and hence the initialization condition (14) can be dropped. First, we generalize Assumption 2.1 for local convergence to the following for global convergence: 

Definition 4.1 (Strict-Saddle Function) A twice differentiable function F (v) with constraint c(v) = 0 is called an (µ, β, γ, δ)-strict-saddle function, if an arbitrary point v with c(v) = 0 satisfies at least one of the following:

In what follows, we show that our algorithms can escape from all saddle points and thus the local initialization is no longer required. We are ready to present the saddle-point escaping result:

Theorem 4.2 (Escaping from Saddle Points) Let Assumptions 2.2 and 4.1 hold. Let F (v) be a (µ, β, γ, δ)-strict-saddle function with finite sup-norm F ∞ . Let

Then for any κ > 0 and any step size η > 0 satisfying

within T 1 · log 2 (κ −1 ) iterates, (22) outputs v t that satisfies (ii) in Definition 4.1 with probability no less than 1 − κ.

The proof of Theorem 4.2 is collected in §7.2. Motivated by this saddle-point escaping result, one can run SSGD first with a burn-in phase and once it enters the warm initialization region, one can re-run SSGD with step sizes chosen so that the local convergence theorem applies immediately.

Using the strong Markov property and combining Theorems 3.1 and 4.2 we immediately obtain the following main theorem. Recall that T 1 is defined as in (23). 

and for any

such that on event A T the iterates generated by the SSGD algorithm satisfy for all

where G α ≡ log 1/α 2 (1 + e 1/α ) 1 + log 1/α 2 (1 + e 1/α ) is a positive factor depending on α.

Note the function class of strict-saddle functions is strictly more general than the local convergence Theorem 3.1. We find the final complexity by interpreting Theorem 4.3. In the asymptotic relations below we write out the dependency on d, η, and let L be a generic quantity that only involves a polylogarithmic factor of d, η and T , which is allowed to vary at each appearance. From (15), (16) and (23) 

and if V is set as the model scaling √ d, the iteration achieves a high-probability bound of L · √ dη after K η, T * η + T 1 · log 2 (κ −1 ) steps. We conclude that under the scaling condition L · d/T → 0, if the total number of samples T is given, we can optimize the choice of step size η = η(d, T ) to conclude the following convergence rate results:

• Local convergence: Given a warm initialization, and choosing η(T ) L · (1/T ), SSGD (4) has the following local convergence rate

• Global convergence: Given any initialization, and choosing η(T ) L · (1/ √ dT ), SSGD with noise injection (22) has the following global convergence rate

We defer the arguments for the proof to §7.2, and turn to the application to GEV problem.

We need to verify that the objective function for the GEV problems is indeed in the class of strict-saddle functions. For the generalized eigenvector problem, the objective function of interest is

where A and B are two real symmetric matrices with B being strictly positive-definite. We make one additional mild assumption on the eigenstructure of matrices A and B.

As our argument proceeds, one can safely assume B −1/2 AB −1/2 being diagonal without loss of generality, and we will assume such unless otherwise specified. Under Assumption 4.4 we denote the minimal gap of λ i 's as

In the following proposition, we verify that the objective function F (v) in (26) satisfies Assumption 2.1; that is,

Proposition 4.5 Assumption 2.1 holds for F (v) in GEV problem (26) with constants

.

The proof of Proposition 4.5 is deferred to §7.3. With the Lipschitz parameters given above, we consider the initialization condition (14). The neighborhood radius on the right-hand side of (14) can be viewed as a function of δ that is maximized at some δ * ∈ (0, 1), when all other constants are fixed. The region covered in the local convergence analysis is maximized with such a choice of δ * . Now we prove that under the mild Assumption 4.4, the objective function for generalized eigenvector problem is strict-saddle as in Definition 4.1 if the parameters are chosen properly:

Proposition 4.6 Under Assumption 4.4, the only local minimizers of (26) are ±e 1 , and the function satisfies the (µ, β, γ, δ)-strict saddle condition for

To verify the strict-saddle parameters and conclude Proposition 4.6, we first conclude the parameters for the objective function of the eigenvector problem:

Lemma 4.7 Under Assumption 4.4, and with the choices of parameters as in (28), we have the following:

It is straightforward from Definition 4.1 of strict-saddle property that Lemma 4.7 leads to Proposition 4.6 immediately. We postpone the details to §B. Intuitively, the parameters are only dependent on the differences of the consecutive (generalized) eigenvalues λ 1 − λ 2 , . . . , λ d−1 − λ d , since we can always shift each λ i by an arbitrary constant and keep the constrained optimization problem (26) unchanged. We also remark that restricted to our analysis, the parameters in (28) might not be the sharpest possible choices. However, we do provide, to the best of our knowledge, a first identification of strict-saddle parameters for the GEV problem, and hence Theorems 4.2 and 4.3 apply (given the proper tail conditions of the stochastic noise).

In this section, we return to the warm initialization as in §3. Ruppert (1988) and Polyak and Juditsky (1992) introduced the idea of trajectory averaging for stochastic gradient descent in order to provide fine-grained convergence rates along with an asymptotic normality result. Our goal is to generalize the Polyak-Juditsky analysis of SGD with trajectory averaging to SSGD for nonconvex objective that is initialized in a local convex region. We denote

From the initialization condition (14), we have u M * u ≥ µ u 2 for all u ∈ T (v * ). We consider the eigendecomposition M * = P diag(λ 1 , . . . , λ d−1 , 0)P for an orthogonal matrix P ∈ R d×d and eigenvalues λ 1 ≥ . . . ≥ λ d−1 > 0 with minimum positive eigenvalue λ d−1 ≥ µ. We take the inverse of all positive eigenvalues and define the following matrix

Here, M − * can be interpreted as the inverse of M * in the (d − 1)-dimensional tangent space T (v * ), and we can easily find M − * v * = 0. As shown in Theorem 3.1, we need K η, T * η iterations for v t to fall in a Θ(η 1/2 ) neighborhood of the local minimizer v * . For T ≥ K η, T * η , we define the trajectory average over time K η, T * η + 1, . . . , T as follows:

where we add the superscript (η) to emphasize the dependency on η. Notice that {v (η)

T } T,η is a triangular array over a continuum η. To obtain asymptotic normality of the trajectory average v (η) T , we additionally make the following local Lipschitz-continuity assumption on stochastic scaled-gradient Γ(v; ζ) in the neighborhood of v * :

The following theorem states that the trajectory average v

Theorem 5.2 (Asymptotic Normality) Given Assumptions 2.1, 2.2, 5.1 and initialization condition (14), if we choose the step size η such that η → 0 as the total sample size T → ∞, where

we obtain Gaussian convergence in distribution:

We relegate the proof details of Theorem 5.2 to §7.4. 2 The analysis has the same rationale as the classical asymptotic normality result that is obtained when minimizing a strongly convex objective function in an Euclidean space using stochastic gradient descent (Ruppert, 1988; Polyak and Juditsky, 1992) . Indeed, in the case of a diminishing step size, η(t) ∝ t −α , α ∈ (1/2, 1), SGD with trajectory averaging converges in distribution to a normal distribution. In contrast, due to our choice of a constant step size that is asymptotically small with η ∝ T −α up to a polylogarithmic factor, we base our analysis on the idea that trajectory averaging begins only after "the burn-in phase"; that is, after K η,ε T * η iterates.

The GEV problem arises in many statistical machine learning tasks. We focus on the example of (rank-one) Canonical Correlation Analysis (CCA) as a core application; we refer to Tan et al. (2018) for other (sparse, high-dimensional) applications including linear discriminant analysis and sliced inverse regression. Recall that CCA aims at maximizing the correlation between two transformed vectors. Given X and Y as two column vectors, let Σ XY be the cross-covariance matrix between

Algorithm 1 Online Canonical Correlation Analysis via Stochastic Scaled-Gradient Descent (Noisy) Given total sample size T and proper step size η and initialization v 0 for t = 1, . . . , T /2 do Draw mutually independent sample pairs (X, Y ) and (X , Y ) from the sampling oracle Compute unbiased estimates

Sample a uniformly spherical noise n t of covariance σ 2 I d and update v t using the following rule

Return v T X and Y , and let Σ XX and Σ Y Y be the covariance matrices of X and Y , respectively. CCA is a special case of the GEV problem (5) with

To obtain A, B as mutually independent and unbiased stochastic samples of A and B, we draw two independent pairs of samples (X, Y ), (X , Y ) at each iteration and compute

where all samples of X, Y are centered such that they have expectation zero. In order to apply the convergence results for the SSGD algorithm to the CCA problem, it remains to verify Assumption 2.2. We assume that the samples X ∈ R dx , Y ∈ R dy follow sub-Gaussian distributions (Gao et al., 2019; Li et al., 2018) with parameters V x , V y ; that is, E exp X 2 /V 2 x ≤ 2 and E exp Y 2 /V 2 y ≤ 2. With these standard assumptions for the samples X, Y , the following lemma shows that the scaled-gradient noise in the CCA problem satisfies Assumption 2.2 with appropriate V and α. The proof is provided in §7.5.

Proposition 6.1 Assumption 2.2 holds for CCA with parameters V = 400(V 2

x + V 2 y )V x V y and α = 1/2. Lemmas 4.5 and 6.1 certify that Assumptions 2.1 and 2.2 hold in CCA settings and hence local convergence Corollary 3.3 applies, which establishes a d/T -rate up to a polylogarithmic since the vector sub-Weibull parameter V in our Assumption 2.2 implicitly contains a factor √ d. Now we demonstrate that our bounds in Corollary 3.3 match the lower bound. Gao et al. (2019) derived a lower bound for Gaussian variables, 1 − align(v, v * ) d/T , in terms of a new measure of error: where

It is easy to verify that 1 − align(v, v * ) 1 − v v 2 = v − v * 2 /2 when both v, v * lie on the unit sphere, in which case our lower bound translates into v T − v * d/T for any estimator v T that consumes T samples, which matches the upper bound of Corollary 3.3 in terms of both d and T .

We note that our Corollary 3.3 and the results of Gao et al. (2019) have different dimension dependency, which is due to a distinct but connected set of assumptions. We have assumed that each sample X, Y follows a vector sub-Gaussian distribution and verifies Assumption 2.2 required by Proposition 6.1, whereas Gao et al. (2019) assume that each coordinate of X, Y is sub-Gaussian with a constant parameter. Hence, the vector sub-Gaussian parameter V in our case suffers a dimension-dependent prefactor.

In this subsection, we present simulation results for SSGD for the case of rank-one CCA [Algorithm 1]. The dimensions of the synthetic data samples are picked as d 1 = 65 of X and d 2 = 70 of Y . We generate the covariance matrix for X, Y as

where A 1 , A 2 are diagonal matrices with each entry along the diagonal obtained as an independent uniform draw from [0, 1]. To ensure the eigengap of Σ

Y Y is significantly large, in particular, no less than 0.5, we set

Here A 3 is a d 1 × d 2 matrix where each entry is generated from an independent N (0, 1/(d 1 + d 2 )) variable with SVD decomposition Σ 1/2

Note that each step of Algorithm 1 can be computed in time O(d 1 + d 2 ). Given this setup, we report our numerical findings of Algorithm 1 as follows:

Saddle-point escaping We first discuss the behavior of our algorithm in the presence of saddle points. When v 0 is exactly chosen as a saddle point, we show that SSGD escapes from a plateau of saddle points in the landscape and converges to the local (and global) minimizer. For illustrative purposes, the initialization v 0 is chosen from four saddle points, each of which corresponds to a component of CCA. We choose the total sample size T = 1e6 and set the (constant) step size η = log(T )/(5T ). In Figure 1 we plot the error of the current solution to the optimal solution, where the error is measured both in squared Euclidean distance and in sine-squared. The first two plots shows the behavior initialized from four different saddle points, and the last two plots shows the behavior initialized from four uniform seeds. The horizontal axis is the number of iterates and the vertical axis is error v t − v * 2 .

Relationship between the step size and squared error We study the role of step size η in our SSGD algorithm. Set sample size T = 1e6 and choose 20 η's from 1e-5 to 5e-4 from We now numerically demonstrate that at stationarity SSGD presents a squared error v − v * 2 or sin 2 (v, v * ) that has a linear relationship with η. We compute the averaged squared error of the last 10% iterates for each run and plot the result in Figure 3 in a log-log scale. The horizontal axes of both Figures 3(a) and 3(b) represent the step size η, and the vertical axes of both figures are the squared error v − v * 2 and sin 2 (v, v * ), respectively. We compute an averaged squared error of the last 10% iterates for each η. Due to ergodicity in the algorithmic final phase, this provides a feasible estimate of its variance around the local (and global) minimizer. Also, the fitting slope of Figure 3 provided by the least-square method is 0.9921 (fairly close to 1), which corroborates our theoretical convergence results in Theorems 3.1 and 4.3. These numerical findings are consistent with our theory that the squared error v − v * 2 at stationarity has a linear relationship with η.

In this section, we provide detailed proofs of our main results.

This subsection provides a proof for Proposition 3.2 on the convergence to a local minimizer. Under the initialization condition (14), there exists a local minimizer v

For a positive quantity M to be determined later, let

In words, T M is the first t such that the norm of the stochastic scaled-gradient Γ(v t−1 ; ζ t ) exceeds M . We first provide the following lemma.

Lemma 7.1 Assume all conditions in Theorem 3.1. For any positive , let

Then, we have

The proof of Lemma 7.1 is a straightforward corollary of a union bound and Assumption 2.2, and is provided in §C.1.

Recall the definitions of the manifold gradient g(v) and the Hessian H(v) in (10) and (11). Under a unit spherical constraint c(v) = v 2 − 1 = 0, their definitions simplify to

Taking derivatives, we decompose

where the additional term N (v) is defined as

The following lemma shows that g(v), H(v), N (v) are Lipschitz continuous.

Lemma 7.2 Given Assumption 2.1, we have that g

A proof of Lemma 7.2 is deferred to §C.2.

For notational simplicity, we denote H * = H(v * ) and N * = N (v * ), and recall that F t is the filtration generated by ζ t . Then we have the following lemma. 

where {ξ t } forms a vector-valued martingale difference sequence with respect to F t , ξ t is α-sub-

The proof of Lemma 7.3 is deferred to §C.3. We define the projection of v t − v * on T (v * ) as

and the projection of H * on T (v * ) as

Lemma 7.4 Under initialization condition (14), the following properties hold:

(i) For all t ≥ 0,

(ii) When η ≤ 1/(DB H ), for all u ∈ T (v * ),

The proof of Lemma 7.4 is deferred to §C.4. To interpret Lemma 7.4(i), we denote (46) is equivalent to the trigonometric inequality sin 2 θ = 4 sin 2 (θ/2) cos 2 (θ/2) ≤ 4 sin 2 (θ/2) = 2(1 − cos θ) ≤ 2(1 − cos θ)(1 + cos θ) = 2 sin 2 θ.

By combining Lemmas 7.3 and 7.4, we have the following lemma for the update rule in terms of ∆ t :

Lemma 7.5 Under Assumptions 2.1, 2.2 and initialization condition (14), when η ≤ 1/(5M ), on the event ( Γ(v t−1 ; ζ t ) ≤ M ), the update (9) can be written in terms of ∆ t as

where χ t , S t , P t ∈ T (v * ), {χ t } forms a vector-valued martingale difference sequence with respect to F t , χ t is α-sub-Weibull with parameter G α V, S t satisfies S t ≤ ρ v t−1 − v * 2 and P t satisfies P t ≤ 7M 2 .

Proof of Lemma 7.5 is deferred to §C.5. Here we have ρ = D(L H + L N + B H /2) + L D L G , which is consistent with its definition in (13). Now, to analyze the iteration ∆ t we need to control its tail behavior. We define the truncated version

let ∆ 0 = ∆ 0 , and define the coupled process iteratively

The ∆ t iteration avoids the potential issues of summation over P t . We conclude the following lemma that characterizes the coupling relation ∆ t = ∆ t , which allows us to analyze the coupled iteration ∆ t .

Lemma 7.6 For each t ≥ 0 we have ∆ t = ∆ t on the event (T M > t). Furthermore, we have for all t ≥ 1

We defer the proof of Lemma 7.6 in §C.6. Next we provide a lemma that tightly characterizes the approximations in (51) that ∆ t ≈ (I − ηDM * ) t ∆ 0 .

Lemma 7.7 Let η ≤ min {1/(DB H ), 1/(5M )} and T ≥ 1. Then with probability at least

The proof of Lemma 7.7 is provided in §C.7.

In the following lemma we prove that when the initial iterate v 0 is sufficiently close to the minimizer v * and r is appropriately chosen to be dependent on ∆ 0 and Θ(η 1/2 ), the conditioning event occurs almost surely on a high-probability event.

Lemma 7.8 When initialization

for any positives η, satisfying scaling condition (17), with probability at least

Lemma 7.8, whose proof is given in §C.8, implies that the iteration keeps ∆ t ≤ 2 ∆ 0 unless v is within a noisy neighborhood of the local minimizer v * , where we recall the definition of ∆ t in (44). Finally, Proposition 3.2 is proved by combining Lemmas 7.4 and 7.8.

In this subsection, we aim to prove Theorem 4.2. To deal with points with strong gradient corresponding to (i) in Definition 4.1, we use the following lemma that is adapted from Ge et al. (2015, Lemma 38 ).

Proposition 7.9 Assume all conditions in Theorem 4.2 as well as 2dV 2 L G D + η < β, we have

A core problem involves escaping from saddle points that corresponds to (iii) in Definition 4.1, we conclude the following modification from Ge et al. (2015, Lemma 40 ).

Proposition 7.10 Assume all conditions in Theorem 4.2 as well as 2ησ 2 L G dD + < β. Then on the event

there is a stopping time T (v 0 ) ≤ T max almost surely such that

where T max is fixed and independent of v 0 defined as

Proofs of Propositions 7.9 and 7.10 are straightforward generalization of relevant proofs of (Ge et al., 2015) , and hence we omit the details.

Proof.[Proof of Theorem 4.2] While this proof can be done in a similar fashion as Theorem 36 in Ge et al. (2015) , here we provide a different proof using stopping-time techniques.

(i) Given (24), we split the state space S d−1 into three distinct regions: let

Define a stochastic process {T i } s.t. T 0 = 0, and

where T (v T i ) ≤ T max is defined in Proposition 7.10. By (52) in Lemma 7.9 and (53) in Proposition 7.10, we know that on (v T i ∈ Q 1 )

Combining the above two displays and (54), we have

on {v T i ∈ Q 1 ∪ Q 2 }.

(ii) Let I ∈ [0, ∞] be the (random) first index i such that v T i ∈ (Q 1 ∪ Q 2 ) c . We conclude immediately that (I > i) ∈ F T i , and (I > i) ⊆ (v T i ∈ Q 1 ∪ Q 2 ). Applying (55) gives

where T ≥ 0 is any constant. Plugging in T = T 1 as in (23) gives

In words, event (T I < T 1 ) has at least 1/2 probability, on which the iteration v t must enter (Q 1 ∪ Q 2 ) c by time T 1 at least once.

(iii) Noting that the argument above holds for all initial points v 0 ∈ Q 1 ∪ Q 2 , so one can use Markov property and conclude that within T 1 · log 2 (κ −1 ) steps where T 1 was defined in (23), iteration {v t } must enter (Q 1 ∪ Q 2 ) c at least once with probability at least 1 − κ. The rest of our proof follows from the definition of strict-saddle function.

Proof.[Proof of Theorem 4.3] The conclusion is reached by directly combining Theorems 3.1 and 4.2, setting A T = H 3.1 , along with an application of strong Markov property.

Proof.

[Proof of Proposition 4.5] For the GEV problem setting, the gradient and the Hessian of the objective function F (v) are

which indicates that D(v) has Lipschitz constant L D ≡ 2 B 2 . Secondly, we introduce an arbitrary unit vector w and take derivative of vector ∇ 2 F (v)w w.r.t. v as

The five terms on the right-hand side have norm bounded by 24 A B

(1−δ) 2 λ 2

(1−δ) 4 λ 4 min (B) respectively, which implies that

.

(1−δ) 4 λ 4 min (B) . Similarly, we also notice for all

,

To prove Theorem 5.2, we first present the following Lemma 7.11 on a linear representation of

Lemma 7.11 (Representation Lemma) Under Assumptions 2.1, 2.2 and given initialization condition (14), for any T ≥ K η, T * η and positive constants η, satisfying the scaling condition

where χ t , S t , P t are vectors in the tangent space T (v * ). Here χ t is defined as

which is α-sub-Weibull with parameter G α V. The sequence {χ t } forms a vector-valued martingale difference sequence with respect to F t . S t satisfies S t ≤ ρ v t−1 − v * 2 . On the event H 3.1 defined in Theorem 3.1, using a total sample size T + 1, each P t satisfies P t ≤ 7V 2 log 2/α −1 .

Proof.[Proof of Lemma 7.11] Telescoping (48) in Lemma 7.5 for t = K η, T * η + 2, . . . , T + 1 gives

Plugging in the definitions of ∆ t , v

T in (44), (32) gives (56). For event H 3.1 defined in Theorem 3.1 using total sample size T + 1, the proof of Lemma 7.8 in §C.8 shows that H 3.1 ⊆ Γ(v t−1 ; ζ t ) ≤ M : 1 ≤ t ≤ T + 2 . The rest of Lemma 7.11 directly follows Lemma 7.5. With Lemma 7.11 in hand, we are ready to prove Theorem 5.2. Proof.[Proof of Theorem 5.2] For a given T , we apply Theorem 3.1 and Lemma 7.11 with = 1/T 2 , such that P(H 3.1 ) → 1 and the scaling condition (17) is satisfied under condition (34). Using a coupling approach we can safely ignore the small probability event and concentrate on the event H 3.1 , where we have

Using the relation ∆ t ≤ v t − v * ≤ √ 2 ∆ t , given in Proposition 3.2, and applying Theorem 3.1 on event H 3.1 we also have

Under condition (34), as T → ∞, η → 0, we have the following almost-sure convergences

From (12) and (57), the covariance matrix of ξ t -i.e., the projection of scaled-gradient noise onto the tangent space T (v * )-can be denoted by

We denote the covariance matrix at local minimizer v * as Φ

Using the central limit theorem and the Slutsky theorem, we have the following convergence-indistribution result under the condition (34) as T → ∞, η → 0:

Combining these results with (56) in Lemma 7.11, under condition (34), as T → ∞, η → 0 we have convergence in distribution:

which omits the asymptotic analysis in the direction parallel to v * . To study the asymptotic property

Applying Theorem 3.1, on event H 3.1 we have:

where in the second line we used the first condition in (34). Under condition (34), as T → ∞, η → 0, we have almost-sure convergence

Adding up (59) and (60) and applying the Slutsky theorem, we conclude (35) and Theorem 5.2.

Proof.[Proof of Proposition 6.1] For notational simplicity, we denote vector v ∈ R dx+dy as v = (v x , v y ) for v x ∈ R dx , v y ∈ R dy . For any vectors w 1 , w 2 ∈ R dx with w 1 ≤ 1, w 2 ≤ 1, using Lemma A.2 we have w 1 XX w 2

Combining all above inequalities and using Lemma A.1 yields v Av

By applying Lemmas A.2 and A.3, in CCA problem we have stochastic scaled-gradient satisfying

Hence Assumption 2.2 holds for V = 400(V 2 x + V 2 y )V x V y and α = 1/2.

We have presented the Stochastic Scaled-Gradient Descent (SSGD) algorithm for minimizing a constrained nonconvex objective function. Comparing with classical stochastic gradient descent, our method only requires access to an unbiased estimate of a scaled gradient, allowing access to a broader range of applications. The proposed algorithm requires only a single pass through the data and is memory-efficient, with storage complexity linearly dependent on the ambient dimensionality of the problem. For a class of nonconvex stochastic optimization problems, we establish local convergence rates of the proposed algorithm to local minimizers and we prove asymptotic normality of the trajectory average. We also investigated the rate of escape of saddle points for SSGD defined on unit sphere. An application to the generalized eigenvector problem to canonical correlation analysis is investigated both theoretically and numerically. In near future, we will study the global convergence of SSGD for generic Riemannian manifolds, as well as exploring alternative methods for other applications.

Of similar style as (Li and Jordan, 2021, §E) we collect in this section some facts for Orlicz-ψ α norm for our usage. We start with its definition:

Definition A.1 (Orlicz ψ α -norm) For a continuous, monotonically increasing and convex function ψ(x) defined for all x > 0 satisfying ψ(0) = 0 and lim x→∞ ψ(x) = ∞, we define the Orlicz ψ-norm for a random variable X as

As a commonly used special case, we consider function ψ α (x) ≡ exp(x α ) − 1 and define the Orlicz ψ α -norm for a random variable X as

Lemma A.1 When ψ(x) is monotonically increasing and convex for x > 0, for any random variables X, Y with finite Orlicz ψ-norm, the triangle inequality holds

For all α ≥ 1, the above inequality holds when · ψ is taken as the Orlicz ψ α -norm.

Proof.[Proof of Lemma A.1] Let K 1 , K 2 denote the Orlicz ψ-norms of X and Y . Because ψ(x) is monotonically increasing and convex, we have

yielding the lemma.

Lemma A.2 Let X and Y be random variables with finite ψ α -norm for some α ≥ 1, then

Using the elementary inequality |AB| ≤ 1 4 (|A| + |B|) 2 , and the triangle inequality in Lemma A.1 we have that

Multiplying both sides of the inequality by X ψα Y ψα gives the desired result.

Lemma A.3 For any random variables X, Y with finite Orlicz ψ α -norm, the following inequalities hold X + Y ψα ≤ log 1/α 2 (1 + e 1/α )( X ψα + Y ψα ), EX ψα ≤ log 1/α 2 (1 + e 1/α ) X ψα , and X − EX ψα ≤ log 1/α 2 (1 + e 1/α ) 1 + log 1/α 2 (1 + e 1/α ) X ψα .

Proof.[Proof of Lemma A.3] Recall that when α ∈ (0, 1), ψ α (x) does not satisfy convexity when x is around 0. Let ψ α (x) be

.

for some appropriate x * > 0, so as to make the function convex. Here x * is chosen such that the tangent line of function ψ α at x * passes through origin, i.e.

Simplifying it gives us a transcendental equation

(1 − αx α * ) exp(x α * ) = 1.

We easily find that x α * ≤ 1/α. Because ψ α (x) is concave on 0, ( 1 α − 1) 1/α and convex on (( 1 α − 1) 1/α , ∞), we have ψ α (x) ≥ ψ α (x) ≥ 0 for all x ≥ 0, and hence

Let K 1 , K 2 denote the Orlicz ψ α -norms of X and Y , then

By applying the triangle inequality in Lemma A.1 and using (61), we have

By applying Jensen's inequality to concave function J α (z) = z log 1+e 1/α 2 , we have

and

and X −EX ψα ≤ log 1/α 2 (1+e 1/α )( X ψα + EX ψα ) ≤ log 1/α 2 (1+e 1/α ) 1 + log 1/α 2 (1 + e 1/α ) X ψα .

Now we proceed with the definition of Orlicz ψ α -norm for random vectors.

Definition A.2 For a random vector X ∈ R d , its Orlicz ψ α -norm is defined as

Seeing the above definition, a random vector X is called sub-Gaussian if X ψ 2 < ∞, and is called sub-Exponential if X ψ 1 < ∞.

Remark A.4 We notice that X ψα equals to the Orlicz ψ α -norm of random variable (scalar) X . Using this relation, we can easily extend all above results of random variables to random vectors with the same positive factors and dependency on α.

The goal of this section is to detail the proof of Lemma 4.7 that estimates the strict-saddle parameters. We first compute the manifold gradient and Hessian in the following Lemma B.1:

The manifold gradient and Hessian can be computed as

Proof. The constrained optimization problem has c(x) = x 2 − 1 so the Lagrangian is

According to the constrained optimization theory in Nocedal and Wright (2006) , since (i) there is one constraint (ii) the gradient g(x) = 2x on constraint has constant norm 2, it satisfies some 2-RLICQ condition. The feasible value of Lagrangian multiplier µ * (x) has

Then we have ∇L(x; µ) = −2Λ(x)x − 2µx, and hence

Solving this problem gives µ * (x) for x ∈ S d−1 :

The manifold gradient can hence be computed as

concluding (62). For manifold Hessian, we can compute it as

This proves (63) and concludes the lemma.

We prove the Hessian smoothness and give the Lipschitz constant for both manifold gradient and Hessian, as in the following lemmas.

Lemma B.2 There are Lipschitz constants

, such that for all z, z 1 , z 2 ∈ S d−1 we have

and

In addition, we have from above two

In fact, in this lemma one can replace A by the norm A − cB for any constant scalar c. 

whose norm is bounded by 36 A B 2 /λ 3 min (B), and

whose norm is thus bounded by 20 A B 3 /λ 4 min (B). Again by mean value theorem we have

which concludes (65) via the definition of operator norm. Lastly to conclude (66), we utilize the properties of projection matrices, P T (z) ≤ 1, P T (z 1 ) − P T (z 2 ) ≤ z 1 − z 2 and hence from matrix operator theory

We complete our proof.

We now come to explore what the small gradient condition g(x) ≤ γ, where g(·) is defined in (62), means for a point x in the GEV Problem. We first analyze the case where B is the identity matrix, which reduces to the classical Eigenvector Problem. Define for convenience

Lemma B.5 We have for x ∈ S d−1 ,

and g(x) ≤ γ implies that there exists at least one j = 1, . . . , d such that

Furthermore, we have that there exists at least one j = 1, . . . , d such that

Proof. Since B is positive definite, letting in (B.3) w = B 1/2 x/ B 1/2 x , we have w = 1 and recall

Note (71) gives γ 1 ∈ (0, 1/2), and hence applying Lemma B.3 gives the following: there is at least one j = 1, . . . , d such that (e j w) 2 ≥ 1 − 4γ 2 1 . Translating this back in terms of x concludes (69). To conclude (72) we note (69) gives if z 1 , z 2 B ≡ z 1 Bz 2 and z B ≡ z, z 1/2 B :

and hence using 1 − √

Using this and applying Lemma B.4 with · 1 = · and · 2 = · B we have λ

and hence for two given nonzero vectors (in the Euclidean norm) x and ∓v j = ∓ e j B −1/2 −1 e j B −1/2 (i) We have Ax = (x Ax/x Bx)Bx if and only if g(x) = 0. For Λ = B −1/2 AB −1/2 being WLOG diagonal, one can see that for j = 2, . . . , d and v j on the unit sphere with Av j = λ j Bv j ,

In the display above, we use the fact that

(since v 1 = ±v j otherwise 0 = v j Bv 1 = ±v 1 Bv 1 which leads to v 1 = 0 due to the positive definiteness of B.) and hence from the above two displays

(ii) To conclude points that are close to P T (v j ) v 1 , Lemma B.5 gives for x ∈ S d−1 ,

and g(x) ≤ γ implies that there exists at least one j = 1, . . . , d such that

Without loss of generality we suppose the minus sign in the above display is taken, so

· γ 1 . Then given the definition of γ 1 in (67) we have from

as long as (combined with (72))

where we applied P T (x) v 1 ≤ 1. This completes the proof of Lemma combining with the definition of β in (28).

C Deferred Auxiliary Proofs of §7.1

We collect the deferred auxiliary proofs of §7.1.

Proof.[Proof of Lemma 7.1] Since M = V log 1 α −1 , we have from Assumption 2.2 that for each t ≥ 1,

where we apply the Markov inequality and Assumption 2.2 (with law of total expectation applied). Taking a union bound,

Proof.[Proof of Lemma 7.2] For all u, v ∈ S d−1 , we have

Proof.[Proof of Lemma 7.3] We have by a Taylor series expansion that for any y ∈ R satisfying |y| ≤ 1/2

(1 − y) −1/2 − 1 − y 2 ≤ 3y 2 8 ∞ k=0 |y| k ≤ 3y 2 4 .

When η ≤ 1/(5M ), on the event ( Γ(v t−1 ; ζ t ) ≤ M ), by letting y = 2ηv t−1 Γ(v t−1 ; ζ t ) − η 2 Γ(v t−1 ; ζ t ) 2 we have |y| ≤ 2η v t−1 Γ(v t−1 ; ζ t ) + η 2 Γ(v t−1 ; ζ t ) 2 ≤ 2ηM + η 2 M 2 ≤ (11/5)ηM < 1/2, and hence combining the above two displays gives

By defining ξ t = (I − v t−1 v t−1 )(Γ(v t−1 ; ζ t ) − D(v t−1 )∇F (v t−1 )),

and

the update formula (9) is equivalent to

Using (73), we have Q t ≤ η −2 · 5η 2 M 2 · (1 + ηM ) + M 2 ≤ 7M 2 .

Recall that we denote D = D(v * ), H * = H(v * ), N * = N (v * ). By defining

we have v t = v t−1 − ηD(H * + N * )(v t−1 − v * ) + ηξ t + ηR t + η 2 Q t .

Since (I − v t−1 v t−1 ) is F t−1 -measurable, we know that E[ξ t | F t−1 ] = 0 and hence {ξ t } is a vectorvalued martingale difference sequence. Additionally, we have I − v t−1 v t−1 ≤ 1, and hence from Assumption 2.2 and Lemma A.3 we know

which implies that ξ is α-sub-Weibull with parameter G α V. Finally, we apply the mean-value theorem using (41) and g(v * ) = 0 to obtain

where we use the Lipschitz continuity of D(v), g(v), H(v), N (v). This completes the proof of Lemma 7.3.

Proof.[Proof of Lemma 7.4] Under initialization condition (14), we have the following:

(i) For all unit vector v, since v = v * = 1 we have

by the Pythagorean theorem we have

Combining the above equalities and plugging in v = v t gives

which admits the following solution given v t v * ≥ 0:

(ii) Under initialization condition (14), for all u ∈ T (v * ), we have u H * u ≥ µ u 2 . Hence for η ≤ 1/(DB H ), we have (I − ηDM * ) 1/2 u ≤ (1 − ηDµ) 1/2 u .

By noticing that (I − ηDM * ) (t−1)/2 u ∈ T (v * ), for all t ≥ 1, we could inductively plug in (I − ηDM * ) (t−1)/2 u to u in (78) and obtain for each t ≥ 0 (I − ηDM * ) t u ≤ (1 − ηDµ) t u .

C.5 Proof of Lemma 7.5

Proof.

[Proof of Lemma 7.5] By left multiplying (43) in Lemma 7.3 by (I − v * v * ) and noticing (I − v * v * )N * = 0, we obtain

We have the decomposition

where (I − v * v * )H * ∆ t = M * ∆ t , and based on Lemma 7.4 and H * ≤ B H ,

We set

Then by combining all of the results above, we have ∆ t = (I − ηDM * ) ∆ t−1 + ηχ t + ηS t + η 2 P t , which proves (48). The rest of Lemma 7.5 can be easily verified in steps similar to the proof of Lemma 7.3.

Doubly accelerated methods for faster CCA and generalized eigendecomposition

First efficient convergence for streaming k-PCA: a global, gap-free, and near-optimal rate

Stochastic optimization for PCA and PLS

Stochastic approximation for canonical correlation analysis

Statistical guarantees for the EM algorithm: From population to sample-based analysis

Gen-Oja: Simple & efficient algorithm for streaming generalized eigenvector computation

Stochastic Approximation: A Dynamical Systems Viewpoint

Multi-view clustering via canonical correlation analysis

On constrained nonconvex stochastic optimization: A case study for generalized eigenvalue decomposition

Learning from conditional distributions via dual embeddings

Scalable kernel methods via doubly stochastic gradients

Adaptive subgradient methods for online learning and stochastic optimization

Large deviation exponential inequalities for supermartingales. Electronic Communications in Probability

The use of multiple measurements in taxonomic problems

Stochastic canonical correlation analysis

Escaping from saddle points -online stochastic gradient for tensor decomposition

Efficient algorithms for large-scale generalized eigenvector computation and canonical correlation analysis

Recent advances in stochastic Riemannian optimization. In Handbook of Variational Methods for Nonlinear Geometric Data

Analysis of a complex of statistical variables into principal components

Relations between two sets of variates

Streaming PCA: Matching matrix bernstein and near-optimal finite sample guarantees for Oja's algorithm

On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points

Adam: A method for stochastic optimization

Moving beyond sub-gaussianity in highdimensional statistics: Applications in covariance estimation and linear regression

Stochastic Approximation and Recursive Algorithms and Applications

Gradient descent only converges to minimizers

Stochastic approximation for online tensorial independent component analysis

Near-optimal stochastic approximation for online principal component estimation

Sliced inverse regression for dimension reduction

Sparse principal component analysis and iterative thresholding

Finding linear structure in large datasets with scalable canonical correlation analysis

The landscape of empirical risk for nonconvex losses

Numerical Optimization

Simplified neuron model as a principal component analyzer

On lines and planes of closest fit to systems of points in space

Acceleration of stochastic approximation by averaging

On the momentum term in gradient descent learning algorithms

A stochastic approximation method

Efficient estimations from a slowly convergent Robbins-Monro process

Convergence of stochastic gradient descent for PCA

Continuum regression: cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression

Sparse generalized eigenvalue problem: Optimal statistical rates via truncated Rayleigh flow

High-Dimensional Statistics: A Non-Asymptotic Viewpoint

Fisher linear discriminant analysis

A decomposition algorithm for the sparse generalized eigenvalue problem

Truncated power method for sparse eigenvalue problems

First-order methods for geodesically convex optimization

We thank the Department of Electrical Engineering and Computer Sciences at UC Berkeley for COVID-19 accommodations during which time this work is completed. We thank Tong Zhang, Huizhuo Yuan, Yuren Zhou for inspiring discussions at various stages of this project. This work was supported in part by the Mathematical Data Science program of the Office of Naval Research under grant number N00014-18-1-2764.

Lemma B.3 When B = I, we have under w = 1, an arbitrary constant γ 1 ∈ (0, 1/2) andΛw − w Λw w w w ≤ λ gap γ 1 , and for some j = 1, . . . , d (for consistency we define λ 0 = λ 1 and λ d+1 = λ d )w Λw w w ∈ λ j−1 + λ j 2 , λ j + λ j+1 2 , together imply (e j w) 2 ≥ 1 − 4γ 2 1 .Proof. Denote till the rest of this proof w i = e i w. Note we have byThis implies the lemma immediately.To study the case of general B, we first introduce an auxiliary lemma.Lemma B.4 Given two norms · 1 , · 2 that are equivalent: there are constants C L , C U > 0 such that for every nonzero vector v, C L v 2 ≤ v 1 ≤ C U v 2 . Then for two given nonzero vectors w 1 , w 2 , we haveProof. Without loss of generality we set w 1 2 = 1 = w 2 2 . Then using triangle inequality we haveWe conclude the following lemma.C.6 Proof of Lemma 7.6Proof.[Proof of Lemma 7.6] For t = 0 the lemma holds by definition. In general if it holds for t − 1 then from the definitions in (49) we have on (t < T M ) that S s = S s , P s = P s for all s ≤ t, so the conclusion holds for t. Iteratively applying (50) we obtain (51), which concludes our lemma.C.7 Proof of Lemma 7.7Proof.[Proof of Lemma 7.7] For any fixed t ≥ 0, we estimate each term of (51) which we repeat here( (51)) For the first term on the right hand of (51), sinceModifying the results in Fan et al. (2012) provides a concentration inequality for α-sub-Weibull random vectors, which gives 3For the second term on the right-hand side of (51), by applying (47) in Lemma 7.4 and using Lemma 7.5, given v s−1 − v * ≤ r for all s = 1, . . . , t we have,For the third term on the right-hand side of (51), from Lemma 7.5 we know P t ≤ 7M 2 andwhere we use the definition of M in (39). The lemma is concluded by combining the above three items and taking union bound on probability.C.8 Proof of Lemma 7.8Proof.[Proof of Lemma 7.8] From the given assumptions, under scaling condition (17), we haveWe let event J be (51) holding for each t ∈ [0, T ], i.e.Then on event J , under scaling condition (17), because ∆ 0 ≤ r 2 , for each t ∈ [0, T ] we haveApplying Lemma 7.7 and taking a union bound givesFurthermore, using (47) in Lemma 7.4 and definition of T * η in (16)In Lemma 7.6 we have shown that, on the event (T < T M ), we have ∆ t = ∆ t . In Lemma 7.1, we have proved P(T < T M ) ≥ 1 − 2T . Together with Lemma 7.7, we take an intersection and obtain At this point we have proved all elements in Lemma 7.8.