general low-rank matrix optimization: geometric analysis

General Low-rank Matrix Optimization:Geometric Analysis and Sharper Bounds

Haixiang ZhangDepartment of Mathematics

University of California, BerkeleyBerkeley, CA 94704

[email protected]

Yingjie BiDepartment of IEOR


[email protected]

Javad LavaeiDepartment of IEOR


[email protected]

Abstract

This paper considers the global geometry of general low-rank minimization prob-lems via the Burer-Monteiro factorization approach. For the rank-1 case, we provethat there is no spurious second-order critical point for both symmetric and asym-metric problems if the rank-2 RIP constant δ is less than 1/2. Combining witha counterexample with δ = 1/2, we show that the derived bound is the sharpestpossible. For the arbitrary rank-r case, the same property is established whenthe rank-2r RIP constant δ is at most 1/3. We design a counterexample to showthat the non-existence of spurious second-order critical points may not hold ifδ is at least 1/2. In addition, for any problem with δ between 1/3 and 1/2, weprove that all second-order critical points have a positive correlation to the groundtruth. Finally, the strict saddle property, which can lead to the polynomial-timeglobal convergence of various algorithms, is established for both the symmetricand asymmetric problems when the rank-2r RIP constant δ is less than 1/3. Theresults of this paper significantly extend several existing bounds in the literature.

1 Introduction

Given the natural numbers n, m and r, consider the low-rank matrix optimization problems

minM∈Rn×n

fs(M) s.t. rank(M) ≤ r, MT = M, M � 0 (1)

and

minM∈Rn×m

fa(M) s.t. rank(M) ≤ r, (2)

where the functions fs(·) and fa(·) are twice continuously differentiable. Problems (1)-(2) arereferred to as the symmetric and the asymmetric problems, respectively. In addition, we call theseproblems linear if the objective function is induced by a linear measurement operator, i.e.,

f(M) = 12‖A(M)− b‖2F

for some vector b ∈ Rp and linear operator A mapping each matrix M to a vector in Rp, wheref(M) denotes either fs(M) or fa(M). Those problems not fitting into the above model are callednonlinear. One common example with non-linearity is the one-bit matrix sensing problem; pleasesee Zhu et al. (2018); Li et al. (2019); Zhu et al. (2021) for more concrete discussions. Low-rank

35th Conference on Neural Information Processing Systems (NeurIPS 2021).

optimization problems arise in a wide range of applications, e.g., matrix completion (Candès & Recht,2009; Recht et al., 2010), phase synchronization (Singer, 2011; Boumal, 2016), and phase retrieval(Shechtman et al., 2015); see Chen & Chi (2018); Chi et al. (2019) for an overview of the topic. Toovercome the non-convex rank constraint, one may resort to convex relaxations. The approach ofreplacing the rank constraint with a nuclear norm regularization is proven to provide the optimalsample complexity (Candès & Recht, 2009; Recht et al., 2010; Candès & Tao, 2010). However,solving the convexified problems involves computing a Singular Value Decomposition (SVD) ineach iteration and results in heavy computational burdens; see the numerical comparison in Zheng &Lafferty (2015). Along with the issue of large space complexities, the convexification approach isimpractical for large-scale problems. Therefore, it is important to design efficient alternative methodswith similar theoretical guarantees. Another line of works generalizes techniques from OrthogonalMatching Pursuit (OMP) to the low-rank matrix problem (Shalev-Shwartz et al., 2011; Axiotis &Sviridenko, 2020), which also require implementing SVD for r times in their algorithms.

1.1 Burer-Monteiro factorization and basic properties

Instead of directly solving convex relaxations of problems (1)-(2), we consider a computationallyefficient approach, namely the Burer-Monteiro factorization (Burer & Monteiro, 2003). The factor-ization approach is based on the observation that any matrix M ∈ Rn×m with rank at most r can bewritten in the form of UV T , where U ∈ Rn×r and V ∈ Rm×r. Then, the asymmetric problem (2) isequivalent to

minU∈Rn×r,V ∈Rm×r

ha(U, V ), (3)

where ha(U, V ) := fa(UV T ). Similarly, the symmetric problem (1) is equivalent to

minU∈Rn×r

hs(U), (4)

where hs(U) := fs(UUT ). The Burer-Monteiro factorization provides a natural parameterization of

the low-rank structure of the unknown solution, and reformulates problems (1)-(2) as unconstrainedoptimization problems. In addition, the number of variables reduces from O(n2) or O(nm) to as lowas O(rn) or O(r(n+m)) when r � min{n,m}. However, the reformulated problems are highlynon-convex, and NP-hard to solve in general. On the other hand, these problems share a specificnon-convex structure, which makes it possible to utilize the structure and design efficient algorithmsto find a global optimum under some conditions. In addition to the special structure, a regularitycondition, named the Restricted Isometry Property, can be used to guarantee the convergence ofcommon iterative algorithms. We state the following two definitions only in the context of thesymmetric problem since the corresponding definitions for the asymmetric problem are similar.Definition 1 (Recht et al. (2010); Zhu et al. (2018)). Given natural numbers r and t, the function fs(·)is said to satisfy the Restricted Isometry Property (RIP) of rank (2r, 2t) for a constant δ ∈ [0, 1),denoted as δ-RIP2r,2t, if

(1− δ)‖K‖2F ≤[∇2fs(M)

](K,K) ≤ (1 + δ)‖K‖2F

holds for all matrices M,K ∈ Rn×n such that rank(M) ≤ 2r, rank(K) ≤ 2t, where[∇2fs(M)

](·, ·) is the curvature of the Hessian at point M .

The RIP condition appears in a variety of applications of the low-rank matrix optimization problem.For instance, in the case of linear measurements with a Gaussian model, Candès & Recht (2009)showed that O

(nr/δ2

)samples are enough to ensure the δ-RIP2r,2r property with high probability.

Please see the survey paper by Chi et al. (2019) for more examples. In certain applications, even theRIP condition cannot be established over the whole low-rank manifold, we are able to establish similarstrongly convex and smooth conditions on part of the manifold. If the iteration points of algorithmsare constrained to or regularized (either explicitly or implicitly) towards those benign regions, theproof techniques in this work may still be applicable. Examples include the phase retrieval problem(Ma et al., 2018) and the matrix completion problem (Chen et al., 2020). However, the analysisof the case when the strong convexity does not hold is usually application-specific and cannot begeneralized to general low-rank problems. Moreover, the RIP assumption is standard in the literatureof general low-rank matrix optimization problem. Furthermore, if we drop the strong convexityassumption, we are unable to achieve linear convergence in general (Bhojanapalli et al., 2016a). The

2

work by Zhang et al. (2019) shows that the existence of RIP is enough to obtain guarantees on thelocal landscape of the problem and the size of this local region depends on the RIP constant thatcan be anything between 0 and 1 (however, the provided bounds on the RIP constant are not sharp).Although we aimed to obtain sharp bounds on the RIP constant for global landscape of the problemin this paper, we believe that our analysis can be adopted to obtain sharp RIP bounds for local regions.We leave the precise derivation to a future work since it needs a number of lemma and we havespace restrictions. We note that the RIP property is equivalent to the restricted strongly convex andsmooth property defined in Wang et al. (2017); Park et al. (2018); Zhu et al. (2021) with the conditionnumber (1 + δ)/(1− δ). Intuitively, the RIP property implies that the Hessian matrix is close to theidentity tensor when the perturbation is restricted to be low-rank. This intuition naturally leads to thefollowing definition.Definition 2 (Bi & Lavaei (2021)). Given a natural number r, the function fs(·) is said to satisfy theBounded Difference Property (BDP) of rank 2r for a constant κ ≥ 0, denoted as κ-BDP2r, if∣∣[∇2fs(M)−∇2fs(M

′)]

(K,L)∣∣ ≤ κ‖K‖F ‖L‖F

holds for all matrices M,M ′,K, L ∈ Rn×n such that rank(M), rank(M ′), rank(K), rank(L) ≤2r.

It has been proven in Bi & Lavaei (2021, Theorem 1) that those functions satisfying the δ-RIP2r,2r

property also satisfy the 4δ-BDP2r property. With the RIP property, there are basically two categoriesof algorithms that can solve the factorized problem in polynomial time. Algorithms in the firstcategory require a careful initialization so that the initial point is already in a small neighbourhood ofa global optimum, and a certain local regularity condition in the neighbourhood ensures that localsearch algorithms will converge linearly to a global optimum; see Tu et al. (2016); Bhojanapalliet al. (2016a); Park et al. (2018) for a detailed discussion. The other class of algorithms is ableto converge globally from a random initialization. The convergence of these algorithms is usuallyestablished via the geometric analysis of the landscape of the objective function. One of the importantgeometric properties is the strict saddle property (Sun et al., 2018), which combined with thesmoothness properties can guarantee the global polynomial-time convergence for various saddle-escaping algorithms (Jin et al., 2017, 2018; Sun et al., 2018; Huang & Becker, 2019). For the linearcase, Ge et al. (2016, 2017) proved the strict saddle property for both problems (3)-(4) when the RIPconstant is sufficiently small. More recently, Zhu et al. (2021) extended the results to the nonlinearasymmetric case. Moreover, a weaker geometric property, namely the non-existence of spurious(non-global) second-order critical points, has been established for both problems when the RIPconstant is small (Li et al., 2019; Ha et al., 2020). We note that second-order critical points arepoints that satisfy the first-order and the second-order necessary optimality conditions, and thus theresult of the non-existence of second-order critical points implies the non-existence of spurious localminima. Under certain regularity conditions, this weaker property is also able to guarantee the globalconvergence from a random initialization without an explicit convergence rate (Lee et al., 2016;Panageas & Piliouras, 2016). Please refer to Table 1 for a summary of the state-of-the-art results.

Most of the aforementioned papers are based on the following assumption on the low-rank criticalpoints of the functions fs(·) and fa(·):Assumption 1. The function fa(·) has a first-order critical point M∗a such that rank(M∗a ) ≤ r.Similarly, the function fs(·) has a first-order critical point M∗s that is symmetric, positive semi-definite and of rank at most r.

This assumption is inspired by the noiseless matrix sensing problem in the linear case for which thenon-negative objective function becomes zero (the lowest value possible) at the true solution. This isa natural property of the matrix sensing problem for nonlinear measurement models as well. Underthe above assumption and the RIP property, Zhu et al. (2018) proved that M∗s and M∗a are the uniqueglobal minima of problems (1)-(2).Theorem 1 (Zhu et al. (2018)). If the functions fs(·) and fa(·) satisfy the δ-RIP2r,2r property, thenthe critical points M∗s and M∗a are the unique global minima of problems (1)-(2).

Given a solution (U∗, V ∗) to problem (3), we observe that (U∗P, V ∗P−T ) is also a solution for anyinvertible P ∈ Rr×r. This redundancy may induce an extreme non-convexity on the landscape of theobjective function. To reduce this redundancy, Tu et al. (2016) considered the regularized problem

minU∈Rn×r,V ∈Rm×r

ρ(U, V ), (5)

3

Table 1: Comparison of the state-of-the-art results and our results. Here δ2r,2t and κ are the RIP2r,2t

and BDP2r constants of fs(·) or fa(·), respectively. Constant α(M∗a ) ∈ (0, 1) only depends on M∗a .

No Spurious Second-order Critical Pts. Strict Saddle PropertyProblem Setups Existing Ours Existing Ours

Rank-1Sym.

Linear δ2,2 <12

(Zhang et al., 2019) δ2,2 <12

- -

Nonlinear δ2,2 <2−O(κ)4+O(κ)

(Bi & Lavaei, 2021)δ2,2 <

12

- -

Rank-1Asym.

Linear& Nonlinear - δ2,2 <

12

- -

Rank-rSym.

Linear δ2r,2r <15

(Ge et al., 2016) δ2r,2r ≤ 13

δ2r,2r <110

(Ge et al., 2017) δ2r,2r <13

Nonlinear δ2r,4r <15

(Li et al., 2019) δ2r,2r ≤ 13

- δ2r,2r <13

Rank-rAsym.

Linear δ2r,2r <13

(Ha et al., 2020) δ2r,2r ≤ 13

δ2r,2r <120

(Ge et al., 2017) δ2r,2r <13

Nonlinear δ2r,2r <13

(Ha et al., 2020) δ2r,2r ≤ 13

δ2r,4r <α(M∗

a )

100(Zhu et al., 2021)

δ2r,2r <13

whereρ(U, V ) := ha(U, V ) +

µ

4· g(U, V )

with a constant µ > 0 and the regularization term

g(U, V ) := ‖UTU − V TV ‖2F .The regularization term is introduced to balance the magnitudes of U∗ and V ∗. Zhu et al. (2018)showed that the regularization term does not introduce bias and thus problem (5) is equivalent to theoriginal problem (2) in the sense that any first-order critical point (U, V ) of problem (2) correspondsto a first-order critical point of problem (5) with balanced energy, i.e. UTU = V TV .Theorem 2 (Zhu et al. (2018)). Any first-order critical point (U∗, V ∗) of problem (5) satisfies(U∗)TU∗ = (V ∗)TV ∗. Moreover, problems (2) and (5) are equivalent.

Detailed optimality conditions for problems (1)-(5) are provided in the appendix.

1.2 Contributions

In this work, we analyze the geometric properties of problems (4)-(5). Novel analysis methodsare developed to obtain less conservative conditions for guaranteeing benign landscapes for bothproblems. We note that, unlike the linear measurements case, the RIP constant of nonlinear problemsmay not concentrate to 0 as the number of samples increases. Therefore, a sharper RIP bound leadsto theoretical guarantees that hold under less stringent statistical requirements. In addition, even if theRIP constant concentrates to 0 when more samples are included, there may only be a limited numberof samples available, either due to the constraints of specific applications or to the great expense oftaking more samples. Hence, obtaining a sharper RIP bound is essential for many applications. Wesummarize our results in Table 1. More concretely, the contributions of this paper are three-folds.

First, we derive necessary conditions and sufficient conditions for the existence of spurious second-order critical points for both symmetric and asymmetric problems. Using our necessary conditions,we show that the δ-RIP2r,2r property with δ ≤ 1/3 is enough to guarantee the non-existence of suchpoints. This result provides a marginal improvement to the previous work (Ha et al., 2020), whichdeveloped the sufficient condition δ < 1/3 for asymmetric problems, and is a major improvementover Ge et al. (2016) and Li et al. (2019), which requires δ < 1/5 for symmetric problems. With thisnon-existence property and under some common regularity conditions, Lee et al. (2016); Panageas &

4

Piliouras (2016) showed that the vanilla gradient descent method with a small enough step size and arandom initialization almost surely converges to a global minimum. We note that the convergence ratewas not studied and could theoretically be exponential in the worst case. In addition, by studying ournecessary conditions, we show that every second-order critical point has a positive correlation to theglobal minimum when δ ∈ (1/3, 1/2). When δ = 1/2, a counterexample with spurious second-ordercritical points is given by utilizing the sufficient conditions. We note that the sufficient conditions cangreatly simplify the construction of counterexamples.

Second, we separately study the rank-1 case to further strengthen the bounds. In particular, weutilize the necessary conditions to prove that the δ-RIP2,2 property with δ < 1/2 is enough for thenon-existence of spurious second-order critical points. Combining with a counterexample in theδ = 1/2 case, we conclude that the bound δ < 1/2 is the sharpest bound for the rank-1 case. Ourresults significantly extend the bounds in Zhang et al. (2019) derived for the linear symmetric case tothe linear asymmetric case and the general nonlinear case. It also improves the bound in Bi & Lavaei(2021) by dropping the BDP constant.

Third, we prove that in the exact parametrization case, problems (4)-(5) both satisfy the strict saddleproperty (Sun et al., 2018) when the δ-RIP2r,2r property is satisfied with δ < 1/3. This resultgreatly improves the bounds in Ge et al. (2017); Zhu et al. (2021) and extends the result in Haet al. (2020) to approximate second-order critical points. With the strict saddle property and certainsmoothness properties, a wide range of algorithms guarantee a global polynomial-time convergencewith a random initialization; see Jin et al. (2017, 2018); Sun et al. (2018); Huang & Becker (2019).Due to the special non-convex structure of our problems and the RIP property, it is possible to provethe boundedness of the trajectory of the perturbed gradient descent method using a similar method asin Jin et al. (2017). Since the smoothness properties are satisfied over a bounded region, combinedwith the strict saddle property, it follows that the perturbed gradient descent method (Jin et al., 2017)achieves a polynomial-time global convergence when δ < 1/3.

1.3 Notation and organization

The operator 2-norm and the Frobenius norm of a matrix M are denoted as ‖M‖2 and ‖M‖F ,respectively. The trace of matrix M is denoted as tr(M). The inner product between two matricesis defined as 〈M,N〉 := tr(MTN). For any matrix M ∈ Rn×m, we denote its singular values byσ1(M) ≥ · · · ≥ σk(M), where k := min{n,m}. For any symmetric matrix M ∈ Rn×n, we denoteits eigenvalues by λ1(M) ≥ · · · ≥ λn(M). The minimal eigenvalue is denoted as λmin(·). Forany matrix U , we use PU to denote the orthogonal projection onto the column space of U . For anymatrices A,B ∈ Rn×m, we use A⊗B to denote the fourth-order tensor whose (i, j, k, `) elementis Ai,jBk,`. The identity tensor is denoted as I. The notation M � 0 means that the matrix M issymmetric and positive semi-definite. The sub-matrix Ri:j,k:` consists of the i-th to the j-th rows andthe k-th to the `-th columns of matrix R. The action of the Hessian∇2f(M) on any two matrices Kand L is given by [∇2f(M)](K,L) :=

∑i,j,k,`[∇2f(M)]i,j,k,`KijLk,`.

In Section 2, the Singular Value Projection algorithm is analyzed as an enlightening example for ourmain results. Sections 3 and 4 are devoted to the non-existence of spurious second-order criticalpoints and the strict saddle property of the low-rank optimization problem in both symmetric andasymmetric cases, respectively.

2 Motivating Example: Singular Value Projection Algorithm

Before providing theoretical results for problems (4)-(5), we first consider the Singular Value Pro-jection Method (SVP) algorithm (Algorithm 1) as a motivating example, which is proposed in Jainet al. (2010). The SVP algorithm is basically the projected gradient method of the original low-rankproblems (1)-(2) via the truncated SVD. For the asymmetric problem (2), the low-rank manifold is

Masym := {M ∈ Rn×m | rank(M) ≤ r}and the projection is given by only keeping components corresponding to the r largest singular values.For the symmetric problem (1), the low-rank manifold is

Msym := {M ∈ Rn×n | rank(M) ≤ r, MT = M, M � 0}.We assume without loss of generality that the gradient ∇f(·) is symmetric; see Appendix A fora discussion. The projection is given by only keeping components corresponding to the r largest

5

Algorithm 1 Singular Value Projection (SVP) Algorithm

Input: Low-rank manifold M, initial point M0, number of iterations T , step size η, objectivefunction f(·).

Output: Low-rank solution MT .1: for t = 0, . . . , T − 1 do2: Update Mt+1 ←Mt − η∇f(Mt).3: Set Mt+1 to be the projection of Mt+1 ontoM via truncated SVD.4: end for5: return MT .

eigenvalues and dropping all components with negative eigenvalues. Since both low-rank manifoldsare non-convex, the projection solution may not be unique and we choose an arbitrary solution whenit is not unique. We note that the above projections are orthogonal in the sense that

‖M+ −M‖F = minK∈M

‖K −M‖F ,

where M+ is the projection of a matrix M . Henceforth,M stands forMsym orMasym, whichshould be clear from the context. Although each truncated SVD operation can be computed withinO(nmr) operations, the constant hidden in the O(·) notation is considerably larger than 1. Thus,the truncated SVD operation is significantly slower than matrix multiplication, which makes theSVP algorithm impractical for large-scale problems. However, the analysis of the SVP algorithm,combining with the equivalence property given in Ha et al. (2020), provides some insights into howto develop proof techniques for problems (4)-(5).

We extend the proof in Jain et al. (2010) and show that Algorithm 1 converges linearly to the globalminimum under the δ-RIP2r,2r property with δ < 1/3.Theorem 3. If function fs(·) (resp. fa(·)) satisfies the δ-RIP2r,2r property with δ < 1/3 and thestep size is chosen to be η = (1 + δ)−1, then Algorithm 1 applied to problem (1) (resp. (2)) returns asolution MT such that MT ∈M and f(MT )− f(M∗) ≤ ε within

T :=

⌈1

log[(1− δ)/(2δ)]· log

[f(M0)− f(M∗)

ε

]⌉iterations, where f(·) := fs(·) (resp. f(·) := fa(·)), M∗ is the global minimum, M0 is the initialpoint and d·e is the ceiling function.

The proof is almost identical to that in Jain et al. (2010) except that we have replaced the quadraticfunction with the RIP bounds. However, the result of the proof provides a key inequality (13) forthe subsequent proofs. We note that the above proof can be applied to other low-rank optimizationproblems with a suitable definition of the orthogonal projection. In Ha et al. (2020), it is proved thatthe unique global minimum is the only fixed point of the SVP algorithm if the RIP constant δ is lessthan 1/3. However, the above paper has not proven the linear convergence (as done in Theorem 3).This difference leads to a strengthened inequality in the following analysis, which further serves as anessential step in proving the strict saddle property. The results in this section provide a hint that thelandscape may be benign when the RIP constant is smaller than 1/3 and we may be able to establishlinear convergence under this condition, which is the main topic of the remainder of this paper.

3 No Spurious Second-order Critical Points

In this section, we develop necessary conditions and sufficient conditions for the existence of spurioussecond-order critical points of problems (4)-(5). Besides the non-existence of spurious local minima,the non-existence of spurious second-order critical points also guarantees the global convergenceof many first-order algorithms with random initialization under certain regularity conditions (Leeet al., 2016; Panageas & Piliouras, 2016). More precisely, we require the iterates of the algorithm toconverge to a single point and the objective function to have a Lipschitz-continuous gradient. Thefirst condition is satisfied by the gradient descent method applied to a large class of functions knownas the KŁ-functions (Attouch et al., 2013). For the second condition, many objective functions thatappear in applications, e.g., the `2-loss function, do not satisfy this condition. However, if the step

6

size is small enough, the special non-convex structure of the Burer-Monteiro decomposition and theRIP property ensure that the trajectory of the gradient descent method stays in a compact set, wherethe Lipschitz condition is satisfied due to the second-order continuity of the functions fs(·) and fa(·).The proof of this claim is similar to Theorem 8 in Jin et al. (2017) and is omitted here. Therefore,the non-existence of spurious second-order critical points can ensure the global convergence of thegradient descent method for many applications.

The non-existence of spurious second-order critical points has been proved in Ge et al. (2017);Zhu et al. (2018) for problems with linear and nonlinear measurements, respectively. Recently, Haet al. (2020) proved a relation between the second-order critical points of problem (3) or (5) and thefixed points of the SVP algorithm on problem (2). Using this relation, they showed that problems(3) and (5) have no spurious second-order critical points when the δ-RIP2r,2r property is satisfiedwith δ < 1/3. In this work, we take a different approach to show that δ ≤ 1/3 is enough for thegeneral case in both symmetric and asymmetric scenarios, and that δ < 1/2 is enough for the rank-1case. Moreover, we prove that there exists a positive correlation between every second-order criticalpoint and the global minimum when δ ∈ (1/3, 1/2). We also show that there may exist spurioussecond-order critical points when δ = 1/2 for both the symmetric and asymmetric problems, whichextends the construction of such examples for the linear symmetric rank-1 problem in Zhang et al.(2018) to general cases. We first give necessary conditions and sufficient conditions for the existenceof a function that satisfies the δ-RIP2r,2r condition and spurious second-order critical points below.Theorem 4. Let ` := min{m,n, 2r}. For a given δ ∈ [0, 1), there exists a function fa(·) with theδ-RIP2r,2r property such that problems (3) and (5) have a spurious second-order critical point only if1− δ < (1 + δ)/2 and there exists a constant α ∈ (1− δ, (1 + δ)/2], a diagonal matrix Σ ∈ Rr×r,a diagonal matrix Λ ∈ R(`−r)×(`−r) and matrices A ∈ Rr×r, B ∈ Rr×r, C ∈ R(`−r)×r, D ∈R(`−r)×r such that

If CBT = 0 and ADT = 0, then there exists a function fa(·) with the δ-RIP2r,2r property such thatproblems (3) and (5) have a spurious second-order critical point.

(1 + δ) min1≤i≤r

Σii ≥ max1≤i≤`−r

Λii, Σ � 0, Λ � 0,

〈Λ, CDT 〉 = α[tr(Σ2)− 2〈Σ, ABT 〉+ ‖ABT ‖2F + ‖ADT ‖2F + ‖CBT ‖2F + ‖CDT ‖2F

], (6)

tr(Λ2) ≤ α−1(2α− 1 + δ2) · 〈Λ, CDT 〉, 〈Λ, CDT 〉 6= 0.

Remark 1. We note that there may exist simpler forms of the above conditions. For instance, we maysolve α via the condition in the second line of (6) and substitute into other conditions. In addition,the requirement that α ∈ (1− δ, (1 + δ)/2] may also be dropped without affecting the conditions.More specifically, the conditions in (6) are equivalent to

(1 + δ) min1≤i≤r


Λii, Σ � 0, Λ � 0, 〈Λ, CDT 〉 6= 0,

tr(Λ2) ≤ 2 · 〈Λ, CDT 〉 − (1− δ2)[

tr(Σ2)− 2〈Σ, ABT 〉

+ ‖ABT ‖2F + ‖ADT ‖2F + ‖CBT ‖2F + ‖CDT ‖2F].

We state Theorem 4 in the current form since it helps with deriving corollaries more directly.

Intuitively, Λ and Σ correspond to the singular values of the second-order critical point and thegradient at the second-order critical point, respectively. Matrices A,B,C,D correspond to the SVDof the global optimum. The original problem of the non-existence of spurious second-order criticalpoints can be viewed as a property of the set of functions satisfying the RIP property, which is a convexset in an infinite-dimensional functional space. The conditions in (6) reduce the infinite-dimensionalproblem to a finite-dimensional problem by utilizing the optimality conditions and the RIP property,which provides a basis for solving these conditions numerically. We note that the conditions in thethird line of (6) are novel and serve as an important step in developing strong theoretical guarantees.Although the conditions in (6) seem complicated, they lead to strong results on the non-existence ofspurious second-order critical points. We provide two corollaries below to illustrate the power of theabove theorem. The first corollary focuses on the rank-1 case. In this case, we can simplify condition

7

(6) through suitable relaxations to obtain a sharper bound on δ that ensures the non-existence ofspurious second-order critical points.Corollary 1. Consider the case r = 1, and suppose that the function fa(·) satisfies the δ-RIP2,2

property with δ < 1/2. Then, problems (3) and (5) have no spurious second-order critical points.

The following example shows that the counterexample in Zhang et al. (2019) designed for thesymmetric case also works for the asymmetric rank-1 case.Example 1. We note that Example 12 in Zhang et al. (2019) shows that problem (4) may havespurious second-order critical points when δ = 1/2. In general, a second-order critical point forproblem (4) is not a second-order critical point for problem (5), since the asymmetric manifoldMasym has a larger second-order critical cone than the symmetric manifoldMsym. However, it canbe verified that the same example also has a spurious second-order critical point in the asymmetriccase. For completeness, we verify the claim in the appendix.

It follows from Corollary 1 and Example 1 that the bound 1/2 is the sharpest bound for the rank-1asymmetric case. The next corollary provides a marginal improvement to the state-of-the-art resultfor the general rank case, which derives the RIP bound δ < 1/3 (Ha et al., 2020). In addition, weprove that there exists a positive correlation between every second-order critical point and the globalminimum when δ < 1/2.Corollary 2. Given an arbitrary r, suppose that the function fa(·) satisfies the δ-RIP2r,2r property.If δ ≤ 1/3, then both problems (3) and (5) have no spurious second-order critical points. In addition,if δ ∈ [0, 1/2), then every second-order critical point M has a positive correlation with the groundtruth M∗a . Namely, there exists a universal function C(δ) : (0, 1/2) 7→ (0, 1] such that

〈M,M∗a 〉 ≥ C(δ) · ‖M‖F ‖M∗a‖F .

For the general rank-r case, we construct a counterexample with spurious second-order critical pointswhen δ = 1/2.Example 2. Let n = m = 2r. Now, we use the sufficiency part of Theorem 4 to construct acounterexample. We choose

δ :=1

2, α :=

3

5, Σ :=

1

2Ir, Λ :=

3

4Ir, A = B := 0r, C = D := Ir.

It can be verified that the conditions in (6) are satisfied and CBT = ADT = 0, which means thatthere exists a function fa(·) satisfying the δ-RIP2r,2r property for which problems (3) and (5) havespurious second-order critical points. We also give a direct construction with linear measurements inthe appendix. This example illustrates that Theorem 4 can be used to systematically design instancesof the problem with spurious second-order critical points.

Before closing this section, we note that similar conditions can be obtained for the symmetric problem(4). Although there exists a natural transformation of symmetric problems to asymmetric problems(see the appendix), the approach requires the objective function fs(·) to have the δ-RIP4r,2r property,which provides sub-optimal RIP bounds compared to a direct analysis. We give the results of thedirect analysis below and omit the proof due to the similarity to the asymmetric case.Theorem 5. Let ` := min{n, 2r}. For a given δ ∈ [0, 1), there exists a function fs(·) withthe δ-RIP2r,2r property such that problem (4) has a spurious second-order critical point only if1− δ < (1 + δ)/2 and there exists a constant α ∈ (1− δ, (1 + δ)/2], a diagonal matrix Σ ∈ Rr×r,a diagonal matrix Λ ∈ R(`−r)×(`−r) and matrices A ∈ Rr×r, C ∈ R(`−r)×r such that

(1 + δ) min1≤i≤r


Λii, Σ � 0,

〈Λ, CCT 〉 = α[tr(Σ2)− 2〈Σ, AAT 〉+ ‖AAT ‖2F + 2‖ACT ‖2F + ‖CCT ‖2F

], (7)

tr(Λ2) ≤ α−1(2α− 1 + δ2) · 〈Λ, CCT 〉, 〈Λ, CCT 〉 6= 0.

If ACT = 0, then there exists a function fs(·) with the δ-RIP2r,2r property for which problem (4)has a spurious second-order critical point.

Compared to Theorem 4, the diagonal matrix Λ is not enforced to be positive semi-definite. Thereason is that the eigenvalue decomposition is used instead of the singular value decomposition inthe symmetric case, and therefore some eigenvalues can be negative. Similarly, we can obtain thenon-existence and the positive correlation results for the symmetric problem.

8

Corollary 3. If function fs(·) satisfies the δ-RIP2r,2r property, then the following statements hold:

• If δ ≤ 1/3, then there are no spurious second-order critical points;

• If δ < 1/2, then there exists a positive correlation between every second-order critical pointand the ground truth;

• If δ = 1/2, then there exists a counterexample with spurious second-order critical points;

• If δ < 1/2 and r = 1, then there are no spurious second-order critical points.

We note that the last statement serves as a generalization of the results in Zhang et al. (2019) to thenonlinear measurement case, and improves upon the bound in Bi & Lavaei (2021) by dropping theBDP constant.

4 Global Landscape: Strict Saddle Property

Although the non-existence of spurious second-order critical points can ensure the global convergenceunder certain regularity conditions, it cannot guarantee a fast convergence rate in general. Saddle-pointescaping algorithms may become stuck at approximate second-order critical points for exponentiallylong time. To guarantee the global polynomial-time convergence, the following strict saddle propertyis commonly considered in the literature:Definition 3 (Sun et al. (2018)). Consider an arbitrary optimization problem minx∈X⊂Rd F (x) andlet X ∗ denote the set of its global minima. It is said that the problem satisfies the (α, β, γ)-strictsaddle property for α, β, γ > 0 if at least one of the following conditions is satisfied for everyx ∈ X :

dist(x,X ∗) ≤ α; ‖∇F (x)‖F ≥ β; λmin[∇2F (x)] ≤ −γ.

For the low-rank problems, we choose the distance to be the Frobenius norm in the factorizationspace. This distance is equivalent to the Frobenius norm in the matrix space in the sense that thereexist constants c1(X ∗) > 0 and c2(X ∗) > 0 such that

c1(X ∗) · ‖U − U∗‖F ≤ ‖UUT − U∗(U∗)T ‖F ≤ c2(X ∗) · ‖U − U∗‖Fholds for all U ∈ X as long as ‖U − U∗‖F is small and X ∗ is bounded (Tu et al., 2016). A similarrelation holds for the asymmetric case.

In Jin et al. (2017), it has been proved that the perturbed gradient descent method can find an ε-approximate second-order critical point in O(ε−2) iterations with high probability if the Hessian ofthe objective function is Lipschitz. Namely, the algorithm returns a point x ∈ X such that

‖∇F (x)‖F ≤ O(ε), λmin[∇2F (x)] ≥ −O(√ε)

in O(ε−2) iterations with high probability. If we choose ε > 0 to be small enough such thatO(ε) < β and −O(

√ε) > −γ, then the strict saddle property ensures that the returned point satisfies

dist(x,X ∗) ≤ α with high probability. We note that the Lipschitz continuity of the Hessian can besimilarly guaranteed by the boundedness of trajectories of the perturbed gradient method, which canbe proved similarly as Theorem 8 in Jin et al. (2017). Since the smoothness properties are satisfiedover a bounded region, we may apply the perturbed gradient descent method (Jin et al., 2017) toachieve the polynomial-time global convergence with random initialization.

In this section, we prove that problems (4) and (5) satisfy the strict saddle property with an arbitraryα > 0 in the exact parameterization case, i.e., when the global optimum has rank r.Assumption 2. The global optimum M∗a or M∗s has rank r.

It has been proved in Zhu et al. (2021) that the regularized problem (5) satisfies the strict saddleproperty if the function fa(·) has the δ-RIP2r,4r property with

δ <σr(M

∗a )3/2

100‖M∗a‖F ‖M∗a‖1/22

.

Our results improve upon their bounds by allowing a larger problem-free RIP constant and requiringonly the RIP2r,2r property (note that there are problems with RIP2r,2r property for which the RIP2r,4r

property does not hold (Bi & Lavaei, 2021)). Our result can also be viewed as a robust version of theresults in Ha et al. (2020).

9

Theorem 6. Suppose that the function fa(·) satisfies the δ-RIP2r,2r property with δ < 1/3. Givenan arbitrary constant α > 0, if µ is selected to belong to the interval [(1− δ)/3, 1− δ), then thereexist positive constants

ε1 := ε1(δ, r, µ, σr(M∗a ), ‖M∗a‖F , α), λ1 := λ1(δ, r, µ, σr(M

∗a ), ‖M∗a‖F , α)

such that for every ε ∈ (0, ε1] and λ ∈ (0, λ1], problem (5) satisfies the (α, β, γ)-strict saddleproperty with

β := min{µ(ε/r)3/2, λ

}, γ := µε.

We note that the constraint µ ∈ [(1− δ)/3, 1− δ) is not optimal and it can be similarly proved thatµ ∈ (δ, 1− δ) also guarantees the strict saddle property. The key step in the proof is to show that forevery point (U, V ) at which the gradient of fa(UV T ) is small, it holds that

‖∇fa(UV T )‖22 ≥ (1 + δ)2σ2r(UV T ) + C · (1− 3δ)[fa(UV T )− fa(M∗a )],

where C > 0 is a constant independent of (U, V ). This inequality can be viewed as a major extensionof the non-existence of spurious second-order critical points when δ < 1/3 (Ha et al., 2020), whichshows that every spurious second-order critical point (U, V ) satisfies

‖∇fa(UV T )‖22 > (1 + δ)σ2r(UV T ).

We emphasize that our proof requires a new framework and is not a standard revision of the existingmethods, which is the reason why sharper bounds can be established. By replacing ‖∇fa(M)‖2 with−λmin(∇fs(M)), the analysis for the asymmetric case can be extended to the symmetric case withminor modifications and the same bound follows.Theorem 7. Suppose that the function fs(·) satisfies the δ-RIP2r,2r property with δ < 1/3. Given anarbitrary constant α > 0, there exists a positive constant λ1 := λ1(δ, r, σr(M

∗s ), ‖M∗s ‖F , α) such

that for every λ ∈ (0, λ1], problem (4) satisfies the (α, β, γ)-strict saddle property with

β := λ, γ := 2λ.

The above bound is the first theoretical guarantee of the strict saddle property for the nonlinearsymmetric problem.

5 Conclusion

In this work, we analyze the geometric properties of low-rank optimization problems via the non-convex factorization approach. We prove novel necessary conditions and sufficient conditions forthe non-existence of spurious second-order critical points in both symmetric and asymmetric cases.We show that these conditions lead to sharper bounds and greatly simplify the construction ofcounterexamples needed to study the sharpness of the bounds. The developed bounds significantlygeneralize several of the existing results. In the rank-1 case, the bound is proved to be the sharpestpossible. In the general rank case, we show that there exists a positive correlation between second-order critical points and the global minimum for problems whose RIP constants are higher than thedeveloped bound but lower than the fundamental limit obtained by the counterexamples. Finally,the strict saddle property is proved with a weaker requirement on the RIP constant for asymmetricproblems. The paper develops the first strict saddle property in the literature for nonlinear symmetricproblems.

Acknowledgments and Disclosure of Funding

This work was supported by grants from AFOSR, ARO, ONR, NSF and C3.ai Digital TransformationInstitute.

ReferencesHedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-

algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularizedGauss–Seidel methods. Mathematical Programming, 137(1):91–129, 2013.

10

Kyriakos Axiotis and Maxim Sviridenko. Sparse convex optimization via adaptively regularized hardthresholding. In International Conference on Machine Learning, pp. 452–462. PMLR, 2020.

Srinadh Bhojanapalli, Anastasios Kyrillidis, and Sujay Sanghavi. Dropping convexity for fastersemi-definite optimization. In Conference on Learning Theory, pp. 530–582. PMLR, 2016a.

Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Global optimality of local searchfor low rank matrix recovery. In Proceedings of the 30th International Conference on NeuralInformation Processing Systems, pp. 3880–3888, 2016b.

Yingjie Bi and Javad Lavaei. On the absence of spurious local minima in nonlinear low-rankmatrix recovery problems. In International Conference on Artificial Intelligence and Statistics, pp.379–387. PMLR, 2021.

Nicolas Boumal. Nonconvex phase synchronization. SIAM Journal on Optimization, 26(4):2355–2377, 2016.

Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefiniteprograms via low-rank factorization. Mathematical Programming, 95(2):329–357, 2003.

Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foun-dations of Computational mathematics, 9(6):717–772, 2009.

Emmanuel J Candès and Terence Tao. The power of convex relaxation: Near-optimal matrixcompletion. IEEE Transactions on Information Theory, 56(5):2053–2080, 2010.

Ji Chen, Dekai Liu, and Xiaodong Li. Nonconvex rectangular matrix completion via gradient descentwithout `2,∞ regularization. IEEE Transactions on Information Theory, 66(9):5806–5841, 2020.

Yudong Chen and Yuejie Chi. Harnessing structures in big data via guaranteed low-rank matrixestimation: Recent theory and fast algorithms via convex and nonconvex optimization. IEEESignal Processing Magazine, 35(4):14–31, 2018.

Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization:An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019.

Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. Advancesin Neural Information Processing Systems, pp. 2981–2989, 2016.

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: Aunified geometric analysis. In International Conference on Machine Learning, pp. 1233–1242.PMLR, 2017.

Wooseok Ha, Haoyang Liu, and Rina Foygel Barber. An equivalence between critical points forrank constraints versus low-rank factorizations. SIAM Journal on Optimization, 30(4):2927–2955,2020.

Zhishen Huang and Stephen Becker. Perturbed proximal descent to escape saddle points for non-convex and non-smooth objective functions. In INNS Big Data and Deep Learning conference, pp.58–77. Springer, 2019.

Prateek Jain, Raghu Meka, and Inderjit Dhillon. Guaranteed rank minimization via singular valueprojection. In Proceedings of the 23rd International Conference on Neural Information ProcessingSystems-Volume 1, pp. 937–945, 2010.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddlepoints efficiently. In International Conference on Machine Learning, pp. 1724–1732. PMLR, 2017.

Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddlepoints faster than gradient descent. In Conference On Learning Theory, pp. 1042–1085. PMLR,2018.

Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent onlyconverges to minimizers. In Conference on learning theory, pp. 1246–1257. PMLR, 2016.

11

Qiuwei Li, Zhihui Zhu, and Gongguo Tang. The non-convex geometry of low-rank matrix optimiza-tion. Information and Inference: A Journal of the IMA, 8(1):51–96, 2019.

Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvexstatistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion.In International Conference on Machine Learning, pp. 3345–3354. PMLR, 2018.

Ioannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.

Dohyung Park, Anastasios Kyrillidis, Constantine Caramanis, and Sujay Sanghavi. Finding low-ranksolutions via nonconvex matrix factorization, efficiently and provably. SIAM Journal on ImagingSciences, 11(4):2165–2204, 2018.

Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linearmatrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.

Shai Shalev-Shwartz, Alon Gonen, and Ohad Shamir. Large-scale convex minimization with a low-rank constraint. In Proceedings of the 28th International Conference on International Conferenceon Machine Learning, pp. 329–336, 2011.

Yoav Shechtman, Yonina C Eldar, Oren Cohen, Henry Nicholas Chapman, Jianwei Miao, andMordechai Segev. Phase retrieval with application to optical imaging: a contemporary overview.IEEE signal processing magazine, 32(3):87–109, 2015.

Amit Singer. Angular synchronization by eigenvectors and semidefinite programming. Applied andcomputational harmonic analysis, 30(1):20–36, 2011.

Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations ofComputational Mathematics, 18(5):1131–1198, 2018.

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-ranksolutions of linear matrix equations via Procrustes flow. In International Conference on MachineLearning, pp. 964–973. PMLR, 2016.

Lingxiao Wang, Xiao Zhang, and Quanquan Gu. A unified computational and statistical frameworkfor nonconvex low-rank matrix estimation. In Artificial Intelligence and Statistics, pp. 981–990.PMLR, 2017.

Richard Y Zhang, Cédric Josz, Somayeh Sojoudi, and Javad Lavaei. How much restricted isometryis needed in nonconvex matrix recovery? In NeurIPS, 2018.

Richard Y Zhang, Somayeh Sojoudi, and Javad Lavaei. Sharp restricted isometry bounds for theinexistence of spurious local minima in nonconvex matrix recovery. Journal of Machine LearningResearch, 20(114):1–34, 2019.

Qinqing Zheng and John Lafferty. A convergent gradient descent algorithm for rank minimizationand semidefinite programming from random linear measurements. In Proceedings of the 28thInternational Conference on Neural Information Processing Systems-Volume 1, pp. 109–117, 2015.

Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. Global optimality in low-rank matrixoptimization. IEEE Transactions on Signal Processing, 66(13):3614–3628, 2018.

Zhihui Zhu, Qiuwei Li, Gongguo Tang, and Michael B Wakin. The global optimization geometry oflow-rank matrix optimization. IEEE Transactions on Information Theory, 67(2):1308–1331, 2021.

12

A Optimality Conditions

In this section, we develop the optimality conditions for problems (1)-(5). We assume without loss ofgenerality that∇fs(M) is symmetric for every M ∈ Rn×n. This is because we can always optimizethe equivalent problem

minM∈Rn×n

1

2

[fs(M) + fs(M

T )]

s.t. rank(M) ≤ r, MT = M, M � 0.

We first consider problems (1) and (2).

Theorem 8 (Li et al. (2019); Ha et al. (2020)). The matrix M = U UT with U ∈ Rn×r is a first-ordercritical point of the constrained problem (1) if and only if{

∇fs(M)U = 0 if rank(M) = r

∇fs(M) � 0 if rank(M) < r.

The matrix M = U V T with U ∈ Rn×r and V ∈ Rm×r is a first-order critical point of theconstrained problem (2) if and only if{

[∇fa(M)]T U = 0, ∇fa(M)V = 0 if rank(M) = r

∇fa(M) = 0 if rank(M) < r.

In Ha et al. (2020), the authors proved that each second-order critical point of problem (3) or (5) is afixed point of the SVP algorithm run on problem (2). We note that this relation can be extended tothe symmetric and positive semi-definite case. This relation plays an important role in the analysis ofSection 3.Theorem 9 (Ha et al. (2020)). The matrix M = U UT with U ∈ Rn×r is a fixed point of the SVPalgorithm run on problem (1) with the step size 1/(1 + δ) if and only if

∇fs(M)U = 0, −λmin(∇fs(M)) ≤ (1 + δ)σr(U).

The matrix M = U V T with U ∈ Rn×r and V ∈ Rm×r is a fixed point of the SVP algorithm run onproblem (2) with the step size 1/(1 + δ) if and only if

[∇fa(M)]T U = 0, ∇fa(M)V = 0, ‖∇fa(M)‖2 ≤ (1 + δ)σr(M).

Next, we consider problems (3)-(5). Since the goal is to study only spurious local minima and saddlepoints, it is enough to focus on the second-order necessary optimality conditions. The following twotheorems follow from basic calculations and we omit the proof.

Theorem 10. The matrix U ∈ Rn×r is a second-order critical point of problem (4) if and only if

∇fs(U UT )U = 0

and2〈∇fs(U UT ),∆∆T 〉+ [∇2fs(U U

T )](U∆T + ∆UT , U∆T + ∆UT ) ≥ 0

holds for every ∆ ∈ Rn×r.Theorem 11. The point (U , V ) with U ∈ Rn×r and V ∈ Rm×r is a second-order critical point ofproblem (3) if and only if

∇[fa(U V T )]T U = 0, ∇fa(U V T )V = 0

and2〈∇fa(U V T ),∆U∆T

V 〉+ [∇2fa(U V T )](U∆TV + ∆U V

T , U∆TV + ∆U V

T ) ≥ 0

holds for every ∆U ∈ Rn×r and ∆V ∈ Rm×r. Moreover, the given point is a a second-order criticalpoint of problem (5) if and only if

∇[fa(U V T )]T U = 0, ∇fa(U V T )V = 0, UT U = V T V

and2〈∇fa(U V T ),∆U∆T

V 〉+ [∇2fa(U V T )](U∆TV + ∆U V

T , U∆TV + ∆U V

T )

+µ

2‖UT∆U + ∆T

U U − V T∆V −∆TV V ‖2F ≥ 0

holds for every ∆U ∈ Rn×r and ∆V ∈ Rm×r.

13

B Relation between the Symmetric and Asymmetric Problems

In this section, we study the relationship between problems (4)-(5). This relationship is more generalthan the topic of this paper, namely the non-existence of spurious second-order critical points andthe strict saddle property, and holds for any property that is characterized by the RIP constant δ andthe BDP constant κ. Specifically, we show that any property that holds for the symmetric problems(4) with (δ, κ) also holds for the regularized asymmetric problem (5) with another pair of constants(δ, κ) decided by δ, κ, and vice versa.

We first consider the transformation from the asymmetric case to the symmetric case. The transfor-mation to the symmetric case has been established in Ge et al. (2017) for linear problem. Here, weshow that the transformation can be revised and extended to the nonlinear measurements case.Theorem 12. Suppose that the function fa(·) satisfies the δ-RIP2r,2s and the κ-BDP2t properties. Ifwe choose µ := (1− δ)/2, then problem (5) is equivalent to a symmetric problem whose objectivefunction satisfies the 2δ/(1 + δ)-RIP2r,2s and the 2κ/(1 + δ)-BDP2t properties.

Proof of Theorem 12. For any matrix N ∈ R(n+m)×(n+m), we divide the matrix into four blocks as

N =

[N11 N12

N21 N22

],

where N11 ∈ Rn×n, N12 ∈ Rn×m, N22 ∈ Rm×m. Then, we define a new function

f(N) := fa(N12) + fa(NT21).

We observe that f(WWT ) = 2ha(U, V ), where

W :=

[UV

]∈ R(n+m)×r.

For any K ∈ R(n+m)×(n+m), the Hessian of f(·) satisfies

[∇2f(N)](K,K) = [∇2fa(N12)](K12,K12) + [∇2fa(NT21)](KT

21,KT21). (8)

Similarly, we can define

g(N) := ‖N11‖2F + ‖N22‖2F − ‖N12‖2F − ‖N21‖2F .

We can also verify that g(WWT ) = g(U, V ) and

[∇2g(N)](K,K) = 2(‖K11‖2F + ‖K22‖2F − ‖K12‖2F − ‖K21‖2F

). (9)

for every K ∈ R(n+m)×(n+m). The minimization problem (5) is then equivalent to

minW∈R(n+m)×r

F (WWT ) := f(WWT ) +µ

2· g(WWT ), (10)

which is in the symmetric form as problem (4). For every N,K ∈ R(n+m)×(n+m) with rank(N) ≤2r and rank(K) ≤ 2s, it results from relations (8) and (9) that

[∇2F (N)](K,K)

≥ (1− δ)(‖K12‖2F + ‖K21‖2F

)+ µ

(‖K11‖2F + ‖K22‖2F − ‖K12‖2F − ‖K21‖2F

)≥ min{1− δ − µ, µ} · ‖K‖2F

and

[∇2F (N)](K,K)

≤ (1 + δ)(‖K12‖2F + ‖K21‖2F

)+ µ

(‖K11‖2F + ‖K22‖2F − ‖K12‖2F − ‖K21‖2F

)≤ max{1 + δ − µ, µ} · ‖K‖2F .

Choosing µ := (1− δ)/2, we obtain

1− δ2· ‖K‖2F ≤ [∇2F (N)](K,K) ≤ 1 + 3δ

2· ‖K‖2F .

14

Hence, it follows that the function 2F (·)/(1 + δ) satisfies the 2δ/(1 + δ)-RIP2r,2s property.

Moreover, for every N,N ′,K, L ∈ R(n+m)×(n+m) with

rank(N), rank(N ′), rank(K), rank(L) ≤ 2t,

it holds that

[∇2g(N)](K,L) = [∇2g(N ′)](K,L)

= 2 (〈K11, L11〉+ 〈K22, L22〉 − 〈K12, L12〉 − 〈K21, L21〉)

and∣∣[∇2F (N)−∇2F (N ′)](K,L)∣∣

=∣∣[∇2f(N12)−∇2f(N ′12)](K12, L12) + [∇2f(NT

21)−∇2f((N ′21)T )](KT21, L

T21)∣∣

≤κ‖K12‖F ‖L12‖F + κ‖K21‖F ‖L21‖F ≤ κ‖K‖F ‖L‖F ,

which implies that the function 21+δ · F (·) satisfies the 2κ/(1 + δ)-BDP2r property. Since problem

(10) is equivalent to the minimization of 21+δ · F (WWT ), it is equivalent to a symmetric problem

that satisfies the 2δ/(1 + δ)-RIP2r,2s and the 2κ/(1 + δ)-BDP2r properties.

We can see that both constants δ and κ are approximately doubled in the transformation. As anexample, Bhojanapalli et al. (2016b) showed that the symmetric linear problem has no spurious localminima if the δ-RIP2r property is satisfied with δ < 1/5. Using Theorem 12, we know that theasymmetric linear problem has no spurious local minima if the δ-RIP2r property is satisfied withδ < 1/9.

The transformation from a symmetric problem to an asymmetric problem is more straightforward.We can equivalently solve the optimization problem

minU,V ∈Rn×r

fs

[1

2

(UV T + V UT

)](11)

or its regularized version with any parameter µ > 0. It can be easily shown that the above problemhas the same RIP and BDP constants as the original symmetric problem. We omit the proof forbrevity.

Theorem 13. Suppose that the function fs(·) satisfies the δ-RIP4r,2s and the κ-BDP4t properties.For every µ > 0, problem (4) is equivalent to an asymmetric problem and its regularized version withthe δ-RIP2r,2s and the κ-BDP2t properties.

Note that the transformation from a symmetric problem to an asymmetric problem will not increasethe constants κ and δ but requires stronger RIP and BDP properties. Hence, a direct analysis on thesymmetric case may establish the same property under a weaker condition. In addition to problem(11), we can also directly consider the problem minU,V fa(UV T ). However, in certain applications,the objective function is only defined for symmetric matrices and we can only use the formulation(11) to construct an asymmetric problem. In more restricted cases when the objective function is onlydefined for symmetric and positive semi-definite matrices, we can only apply the direct analysis tothe symmetric case.

C Proofs for Section 2

C.1 Proof of Theorem 3

Proof of Theorem 3. We denote f(·) := fs(·) and f(·) := fa(·) for the symmetric and asymmetriccase, respectively. Using the mean value theorem and the δ-RIP2r,2r property, there exists a constants ∈ [0, 1] such that

f(Mt+1)− f(Mt)

= 〈∇f(Mt),Mt+1 −Mt〉+1

2[∇2f(Mt + s(Mt+1 −Mt))](Mt+1 −Mt,Mt+1 −Mt)

15

≤ 〈∇f(Mt),Mt+1 −Mt〉+1 + δ

2‖Mt+1 −Mt‖2F .

We define

φt(M) := 〈∇f(Mt),M −Mt〉+1 + δ

2‖M −Mt‖2F =

1 + δ

2‖M − Mt+1‖2F + constant,

where the last constant term is independent of M . Since the projection is orthogonal, the projectedmatrix Mt+1 achieves the minimal value of φt(M) over all matrices on the manifoldM. Therefore,we obtain

f(Mt+1)− f(Mt) ≤ φt(Mt+1) ≤ φt(M∗)

= 〈∇f(Mt),M∗ −Mt〉+

1 + δ

2‖M∗ −Mt‖2F . (12)

On the other hand, we can similarly prove that the δ-RIP2r,2r property ensures

f(M∗)− f(Mt) ≥ 〈∇f(Mt),M∗ −Mt〉+

1− δ2‖M∗ −Mt‖2F ,

f(Mt)− f(M∗) ≥ 1− δ2‖M∗ −Mt‖2F .

Substituting the above two inequalities into (12), it follows that

f(Mt+1)− f(Mt) ≤ f(M∗)− f(Mt) + δ‖M∗ −Mt‖2F

≤ f(M∗)− f(Mt) +2δ

1− δ[f(Mt)− f(M∗)]. (13)

Therefore, using the condition that δ < 1/3, we have

f(Mt+1)− f(M∗) ≤ 2δ

1− δ[f(Mt)− f(M∗)] := α[f(Mt)− f(M∗)],

where α := 2δ/(1− δ) < 1. Combining this single-step bound with the induction method proves thelinear convergence of Algorithm 1.

D Proofs for Section 3

D.1 Proof of Theorem 4

Proof of Theorem 4. We only consider the case when m and n are at least 2r. In this case, we have` = 2r. Other cases can be handled similarly. For the notational simplicity, we denote M∗ := M∗a inthis proof.

Necessity. We first consider problem (3). Suppose that M∗ and M are the optimum and a spurioussecond-order critical point of problem (3), respectively. It has been proved in Ha et al. (2020) that thespurious second-order critical point M has rank r and is a fixed point of the SVP algorithm with thestep size (1 + δ)−1. Therefore, the point M should be a minimizer of the projection step of the SVPalgorithm. This implies that

‖M − [M − (1 + δ)−1∇fa(M)]‖2F ≤ ‖M∗ − [M − (1 + δ)−1∇fa(M)]‖2F ,which can be simplified to

〈∇fa(M), M −M∗〉 ≤ 1 + δ

2‖M −M∗‖2F . (14)

Let U and V denote the subspaces spanned by the columns and rows of M and M∗, respectively.Namely, we have

U := {Mv1 +M∗v2 | v1, v2 ∈ Rm}, V := {MTu1 + (M∗)Tu2 | u1, u2 ∈ Rn}.Since the ranks of both matrices are bounded by r, the dimensions of U and V are bounded by 2r.Therefore, we can find orthogonal matrices U ∈ Rn×2r and V ∈ Rm×2r such that

U ⊂ range(U), V ⊂ range(V )

16

and write M,M∗ in the form

M = U

[Σ 0r×r

0r×r 0r×r

]V T , M∗ = URV T ,

where Σ ∈ Rr×r is a diagonal matrix and R ∈ R2r×2r has rank at most r. Recalling the firstcondition in Theorem 11, the column space and the row space of ∇fa(M) are orthogonal to thecolumn space and the row space of M , respectively. Then, the δ-RIP2r,2r property gives

∃α ∈ [1− δ, 1 + δ] s.t. − 〈∇fa(M),M∗〉 = 〈∇fa(M), M −M∗〉

=

∫ 1

0

[∇2fa(M∗ + s(M −M∗))](M −M∗, M −M∗) ds

= α‖M −M∗‖2F > 0. (15)

This means thatG := PU∇fa(M)PV 6= 0,

where PU and PV are the orthogonal projections onto U and V , respectively. Combining withinequality (14), we obtain α ≤ (1 + δ)/2. By the definition of G, we have

〈∇fa(M),M∗〉 = 〈G,M∗〉.

Since both the column space and the row space of G are orthogonal to M , the matrix G has the form

G = U

[0r×r 0r×r0r×r −Λ

]V T , (16)

where Λ ∈ Rr×r. We may assume without loss of generality that Λii ≥ 0 for all i; otherwise, onecan flip the sign of some of the last r columns of U . By another orthogonal transformation, we mayassume without loss of generality that Λ is a diagonal matrix. Then, Theorem 9 gives

(1 + δ) min1≤i≤r

Σii = (1 + δ)σr(M) ≥ ‖∇fa(M)‖2 ≥ ‖G‖2 = max1≤i≤(`−r)

Λii. (17)

In addition, condition (15) is equivalent to

〈Λ, Rr+1:2r,r+1:2r〉 = α‖M −M∗‖2F = α[tr(Σ2)− 2〈Σ, R1:r,1:r〉+ ‖R‖2F

]. (18)

By the Taylor expansion, for every Z ∈ Rn×m, we have

〈∇fa(M), Z〉 =

∫ 1

0

[∇2fa(M∗ + s(M −M∗))](M −M∗, Z) ds = (M −M∗) : H : Z,

where the last expression is the tensor multiplication andH is the tensor such that

K : H : L =

∫ 1

0

[∇2fa(M∗ + s(M −M∗))](K,L) ds, ∀K,L ∈ Rn×m.

We defineG := G− α(M −M∗).

By the definition of α, we know that 〈G, M −M∗〉 = 0. Furthermore, using the definition ofH, weobtain

(M −M∗) : H : (M −M∗) = α‖M −M∗‖2F ,(M −M∗) : H : G = G : H : (M −M∗) = ‖G‖2F .

Suppose thatG : H : G = β‖G‖2F

for some β ∈ [1− δ, 1 + δ]. We consider matrices of the form

K(t) := t(M −M∗) + G, ∀t ∈ R.

17

Since K(t) is a linear combination of M −M∗ and G, the column space of K(t) is a subspace of U ,and thus K(t) has rank at most 2r and the δ-RIP2r,2r property implies

(1− δ)‖K(t)‖2F ≤ K(t) : H : K(t) ≤ (1 + δ)‖K(t)‖2F . (19)

Using the facts that

‖K(t)‖2F = ‖M −M∗‖2F · t2 + ‖G‖2F ,K(t) : H : K(t) = α‖M −M∗‖2F · t2 + 2‖G‖2F · t+ β‖G‖2F ,

we can write the two inequalities in (19) as quadratic inequalities

[α− (1− δ)]‖M −M∗‖2F · t2 + 2‖G‖2F · t+ [β − (1− δ)]‖G‖2F ≥ 0,

[(1 + δ)− α]‖M −M∗‖2F · t2 − 2‖G‖2F · t+ [(1 + δ)− β]‖G‖2F ≥ 0. (20)

If α = 1 − δ, then we must have ‖G‖F = 0 and thus G = α(M −M∗). Equivalently, we haveM∗ = M −α−1G. Since the column and row spaces of G 6= 0 are orthogonal to M , the rank of M∗

is at least rank(M) + 1 = r+ 1, which is a contradiction. Since α ≤ (1 + δ)/2, we have α < 1 + δ.Thus, we have proved that

1− δ < α < 1 + δ.

Checking the condition for quadratic functions to be non-negative, we obtain

‖G‖2F ≤ [α− (1− δ)][β − (1− δ)] · ‖M −M∗‖2F ,‖G‖2F ≤ [(1 + δ)− α][(1 + δ)− β] · ‖M −M∗‖2F .

Sinceα− (1− δ) > 0, (1 + δ)− α > 0,

the above two inequalities are equivalent to

‖G‖2Fα− (1− δ)

≤ [β − (1− δ)] · ‖M −M∗‖2F ,

‖G‖2F(1 + δ)− α

≤ [(1 + δ)− β] · ‖M −M∗‖2F .

Summing up the two inequalities and dividing both sides by 2δ gives rise to

‖G‖2Fδ2 − (1− α)2

≤ ‖M −M∗‖2F . (21)

We note that the above condition is also sufficient for the inequalities in (20) to hold by choosingβ = 2− α. Using the relation ‖G‖2F = ‖G‖2F + α2‖M −M∗‖2F , one can write

tr(Λ2) = ‖G‖2F ≤ (2α− 1 + δ2)‖M −M∗‖2F = α−1(2α− 1 + δ2)〈Λ, Rr+1:2r,r+1:2r〉. (22)

Now, using the fact that rank(M∗) ≤ r, we can write the matrix R as

R =

[AC

] [BD

]T=

[ABT ADT

CBT CDT

],

where A,B,C,D ∈ Rr×r. Then, conditions (18) and (22) become

〈Λ, CDT 〉 = α[tr(Σ2)− 2〈Σ, ABT 〉+ ‖ABT ‖2F + ‖ADT ‖2F + ‖CBT ‖2F + ‖CDT ‖2F

](23)

and

tr(Λ2) ≤ α−1(2α− 1 + δ2) · 〈Λ, CDT 〉. (24)

If 〈Λ, CDT 〉 = 0, we have

tr(Σ2)− 2〈Σ, ABT 〉+ ‖ABT ‖2F + ‖ADT ‖2F + ‖CBT ‖2F + ‖CDT ‖2F = 0,

which implies thatABT = Σ, ADT = CBT = CDT = 0.

This contradicts the assumption that M 6= M∗. Combining this with conditions (17), (23) and (24),we arrive at the necessity part. For problem (5), Lemma 3 in Ha et al. (2020) ensures that M is still afixed point of the SVP algorithm. Recalling the necessary conditions in Theorem 11, we know thatthe same necessary conditions also hold in this case.

18

Sufficiency. Now, we study the sufficiency part. We first consider problem (3). We choose twoorthogonal matrices U ∈ Rn×2r, V ∈ Rm×2r and define

M = U

[Σ 0r×r

0r×r 0r×r

]V T , M∗ := U

([AC

] [BD

]T)V T , G := U

[0r×r 0r×r0r×r −Λ

]V T .

Since 〈Λ, CDT 〉 6= 0, we have M 6= M∗. Then, we know that rank(M) ≤ r and rank(M∗) ≤ r.We define

G := G− α(M −M∗),which satisfies 〈G, M −M∗〉 = 0 by the condition in the second line of (6). If G = 0, then[

0r×r 0r×r0r×r −Λ

]= α ·

[Σ 0r×r

0r×r 0r×r

]− α ·

[AC

] [BD

]T= α ·

[Σ 0r×r

0r×r 0r×r

]− α ·

[ABT 0

0 CDT

],

where the second step is because of CBT = 0 and ADT = 0. The above relation is equivalent to

Σ = ABT , Λ = α · CDT .

Since Σ � 0, the matrix ABT has rank r. Noticing that the decomposition of matrix M∗ ensuresthat the rank of M∗ is at most r, we have CDT = 0, which is a contradiction to the condition that〈CDT ,Λ〉 6= 0. Therefore, we have G 6= 0. We consider the rank-2 symmetric tensor

G1 :=α

‖M −M∗‖2F· (M −M∗)⊗ (M −M∗) +

2− α‖G‖2F

· G⊗ G

+1

‖M −M∗‖2F

[(M −M∗)⊗ G+ G⊗ (M −M∗)

].

For every matrix K ∈ Rn×m, we have the decomposition

K = t(M −M∗) + sG+ K, 〈M −M∗, K〉 = 〈G, K〉 = 0,

where t, s ∈ R are two suitable constants. Then, using the definition of G1, we have

K : G1 : K = α‖M −M∗‖2F · t2 + 2‖G‖2F · ts+ (2− α)‖G‖2F · s2.

By the conditions in the third line of (6), one can write

‖G‖2F ≤ [α− (1− δ)][(1 + δ)− α] · ‖M −M∗‖2F ,

which leads to

[α− (1− δ)]‖M −M∗‖2F · t2 + 2‖G‖2F · ts+ [(1 + δ)− α]‖G‖2F · s2 ≥ 0,

[(1 + δ)− α]‖M −M∗‖2F · t2 − 2‖G‖2F · ts+ [α− (1− δ)]‖G‖2F · s2 ≥ 0.

The above two inequalities are equivalent to

(1− δ)[‖M −M∗‖2F · s2 + ‖G‖2F · t2] ≤ K : G1 : K ≤ (1 + δ)[‖M −M∗‖2F · s2 + ‖G‖2F · t2].(25)

By restricting to the subspace

S := span{M −M∗, G} = {s(M −M∗) + tG | s, t ∈ R},

the tensor G1 can be viewed as a 2× 2 matrix. Then, inequality (25) implies that the matrix has twoeigenvalues λ1 and λ2 such that

1− δ ≤ λ1, λ2 ≤ 1 + δ.

Therefore, we can rewrite the tensor G1 restricted to S as

[G1]S = λ1 ·G1 ⊗G1 + λ2 ·G2 ⊗G2,

where G1, G2 are linear combinations of M −M∗, G and have the unit norm. Since the orthogonalcomplementary S⊥ is in the null space of G1, we have

G1 = [G1]S = λ1 ·G1 ⊗G1 + λ2 ·G2 ⊗G2.

19

Now, we choose matrices G3, . . . , GN such that G1, . . . , GN form an orthonormal basis of the linearvector space Rn×m, where N := nm. We define another symmetric tensor by

H := G1 +

N∑i=3

(1 + δ) ·Gi ⊗Gi.

Then, inequality (25) implies that the quadratic form K : H : K satisfies the δ-RIP2r,2r property.

Therefore, we can choose the Hessian to be the constant tensorH and define the function fa(·) as

fa(K) :=1

2(K −M∗) : H : (K −M∗), ∀K ∈ Rn×m.

Combining with the definition ofH, we know

∇fa(M) = H : (M −M∗) = G, ∇2fa(M) = H.

We choose matrices U ∈ Rn×r, V ∈ Rm×r such that M = U V T and UT U = V T V . By thedefinitions of M and G, we know that M and G have orthogonal column and row spaces, i.e.,

UTG = 0, GV = 0.

This means that the first-order optimality conditions are satisfied at the point (U , V ). For thesecond-order necessary optimality conditions, we consider the direction

∆ :=

[∆U

∆V

]∈ R(n+m)×r.

We consider the decomposition

∆U = PU∆U + P⊥U ∆U := ∆1U + ∆2

U , ∆V = PV ∆V + P⊥V ∆V := ∆1V + ∆2

V ,

where PU ,PV are the orthogonal projection onto the column space of U , V , respectively. Then,using the conditions in the first line of (6), we have

〈∇fa(M),∆U∆TV 〉 = 〈G,∆U∆T

V 〉 = 〈G,∆2U (∆2

V )T 〉 ≥ −‖GT∆2U‖F ‖∆2

V ‖F

≥ −(1 + δ)σr(M)‖∆2U‖F ‖∆2

V ‖F ≥ −(1 + δ)σr(M) · ‖∆2U‖2F + ‖∆2

V ‖2F2

.

(26)

We define∆1 := U(∆1

V )T + ∆1U V

T , ∆2 := U(∆2V )T + ∆2

U VT .

Then, we know that 〈∆1,∆2〉 = 0. Using the assumption that CBT = ADT = 0, we know that M∗has the form

M∗ = U

[ABT 0

0 CDT

]V T = PUM∗PV + P⊥UM

∗P⊥V . (27)

Then, the special form (27) implies that

〈M∗,∆2〉 = 〈M∗, U(∆2V )T + ∆2

U VT 〉 =

⟨M∗, U∆T

V P⊥V + P⊥U ∆U VT⟩

= 0.

Using the definitions of M and G, it can be concluded that

〈M,∆2〉 = 0, 〈G,∆2〉 = 〈G, U(∆2V )T + ∆2

U VT 〉 = 0.

Since G1, G2 are linear combinations of M −M∗ and G, the last three relations lead to

〈G1,∆2〉 = 〈G2,∆2〉 = 0.

Therefore, there exist constants a3, . . . , aN such that

∆2 =

N∑i=3

aiGi.

20

Suppose that the constants b1, . . . , bN satisfy

∆1 =

N∑i=1

biGi.

Then, the fact 〈∆1,∆2〉 = 0 and the orthogonality of G1, . . . , GN imply that

N∑i=3

aibi = 0.

We can calculate that

[∇2fa(M)](U∆TV + ∆U V

T , U∆TV + ∆U V

T ) = (∆1 + ∆2) : H : (∆1 + ∆2)

=λ1 · b21 + λ2 · b22 + (1 + δ)

N∑i=3

(ai + bi)2 ≥ (1 + δ)

N∑i=3

(ai + bi)2

=(1 + δ)

N∑i=3

(a2i + b2i

)≥ (1 + δ)

N∑i=3

a2i = (1 + δ)‖U(∆2

V )T + ∆2U V

T ‖2F ,

where the third last step is due to∑Ni=3 aibi = 0. Noticing that 〈U(∆2

V )T ,∆2U V

T 〉 = 0, the aboveinequality gives that

[∇2fa(M)](U∆TV + ∆U V

T , U∆TV + ∆U V

T ) ≥ (1 + δ)‖U(∆2V )T ‖2F + (1 + δ)‖∆2

U VT ‖2F

≥(1 + δ)σr(U)2‖∆2V ‖2F + (1 + δ)σr(V )2‖∆2

U‖2F = (1 + δ)σr(M)(‖∆2V ‖2F + ‖∆2

U‖2F ),

where the last equality is because of σr(U)2 = σr(V )2 = σr(M) when UT U = V T V . Combiningwith inequality (26), one can write

[∇2ha(U, V )](∆,∆) = 2〈∇fa(M),∆U∆TV 〉+ [∇2fa(M)](U∆T

V + ∆U VT , U∆T

V + ∆U VT )

≥− (1 + δ)σr(M)(‖∆2V ‖2F + ‖∆2

U‖2F ) + (1 + δ)σr(M)(‖∆2V ‖2F + ‖∆2

U‖2F ) = 0.

This shows that (U , V ) satisfies the second-order necessary optimality conditions, and therefore it isa spurious second-order critical point.

Now, we consider problem (5). Since the point (U , V ) satisfies UT U = V T V , it is also a localminimum of the regularization term. Hence, the point (U , V ) is also a spurious second-order criticalpoint of the regularized problem (5).

D.2 Proof of Corollary 1

Proof of Corollary 1. We assume that problem (3) has a spurious second-order critical point. By thenecessity part of Theorem (6), there exist α ∈ (1− δ, 1 + δ) and real numbers σ, λ, a, b, c, d such that

(1 + δ)σ ≥ λ > 0, α−1(2α− 1 + δ2)cd · λ ≥ λ2 > 0,

cd · λ = α[σ2 − 2ab · σ + (ab)2 + (ad)2 + (cb)2 + (cd)2]. (28)

We first relax the second line to

cd · λ ≥ α[σ2 − 2|ab| · σ + (ab)2 + 2|ab| · |cd|+ (cd)2]. (29)

Then, we denote x := |ab| and consider the quadratic programming problem

minx≥0

x2 + 2(|cd| − σ) · x,

whose optimal value is−(σ − |cd|)2

+,

where (t)+ := max{t, 0}. Substituting into inequality (29), we obtain

cd · λ ≥ α[σ2 − (σ − |cd|)2+ + (cd)2]. (30)

Then, we consider two different cases.

21

Case I. We first consider the case when σ ≥ |cd|. In this case, the inequality (30) becomes

cd · λ ≥ 2α · σ|cd| = 2α · σcd,where the last equality is due to cd > 0. Therefore,

λ ≥ 2α · σ.The second inequality in (28) implies λ ≤ α−1(2α − 1 + δ2) · cd. Combining with the aboveinequality and the assumption of this case, it follows that

α−1(2α− 1 + δ2) · σ ≥ α−1(2α− 1 + δ2) · cd ≥ 2α · σ,which is further equivalent to

α−1(2α− 1 + δ2) ≥ 2α ⇐⇒ δ2 ≥ 2α2 − 2α+ 1.

Since 2α2 − 2α+ 1 ≥ 1/2, we arrive at δ2 ≥ 1/2, which is a contradiction to δ < 1/2.

Case II. We then consider the case when σ ≤ |cd|. In this case, the inequality (30) becomes

cd · λ ≥ α[σ2 + (cd)2].

Combining with the second inequality in (28), we obtain λ ≤ α−1(2α− 1 + δ2) · (cd). Therefore,

α−1(2α− 1 + δ2) · (cd)2 ≥ cd · λ ≥ α[σ2 + (cd)2].

Moreover, the first inequality in (28) gives

(1 + δ)σ · cd ≥ cd · λ ≥ α[σ2 + (cd)2].

By denoting y := cd, the above two inequalities become

α−1(2α− 1 + δ2) · y2 ≥ α[σ2 + y2],

(1 + δ)σ · y ≥ α[σ2 + y2]. (31)

By denoting z := y/σ, the first inequality in (31) implies

z2 ≥ α2

δ2 − (1− α)2. (32)

Since δ < 1/2, one can write

(1− α)2 + α2 ≥ 1

2>

1

4> δ2,

which is equivalent to α2 ≥ δ2 − (1− α)2. Therefore, inequality (32) implies that z2 ≥ 1 and

z2 +1

z2≥ α2

δ2 − (1− α)2+δ2 − (1− α)2

α2. (33)

On the other hand, the second inequality in (31) implies

z +1

z≤ 1 + δ

αand thus z2 +

1

z2+ 2 ≤ (1 + δ)2

α2.

Combining with inequality (33), it follows that

α2

δ2 − (1− α)2+δ2 − (1− α)2

α2+ 2 ≤ (1 + δ)2

α2. (34)

By some calculation, the above inequality is equivalent to

(δ2 + 2δ + 5) · α2 + (2δ2 − 4δ − 6) · α+ 2(1 + δ)(1− δ2) ≤ 0.

Checking the discriminant of the above quadratic function, we obtain

(2δ2 − 4δ − 6)2 − 8(δ2 + 2δ + 5)(1 + δ)(1− δ2) ≥ 0,

which is equivalent to4(2δ − 1)(δ + 1)4 ≥ 0.

However, the above claim contradicts the assumption that δ < 1/2.

In summary, the contradictions in the two cases imply that the condition (28) cannot hold, andtherefore there does not exist spurious second-order critical points.

22

D.3 Counterexample for the Rank-one Case

Example 3. Let ei ∈ Rn be the i-th standard basis of Rn. We define the tensor

H :=

n∑i,j=1

(eieTj )⊗ (eie

Tj ) +

1

2(e1e

T1 )⊗ (e2e

T2 ) +

1

2(e2e

T2 )⊗ (e1e

T1 )

+1

4

[(e1e

T2 )⊗ (e1e

T2 ) + (e2e

T1 )⊗ (e2e

T1 )]

+1

4(e1e

T2 )⊗ (e2e

T1 ) +

1

4(e2e

T1 )⊗ (e1e

T2 )

and the objective function

fa(M) := (M − e1eT1 ) : H : (M − e1e

T1 ) ∀M ∈ Rn×n.

The global minimizer of fa(·) is the rank-1 matrix M∗ := e1eT1 . It has been proved in Zhang et al.

(2019) that the function fa(·) satisfies the δ-RIP2,2 property with δ = 1/2. Moreover, we define

U :=1√2e2, V := U, M := UUT 6= M∗.

It has been proved in Zhang et al. (2019) that the first-order optimality condition is satisfied. To verifythe second-order necessary condition, we can calculate that

[∇2ha(U,U)](∆,∆) = 2〈∇fa(M),∆U∆TV 〉+ (U∆T

V + ∆UUT ) : H : (U∆T

V + ∆UUT )

= −3

2(∆U )1(∆V )1 +

5

8

[(∆U )2

1 + (∆V )21

]+

1

4(∆U )1(∆V )1

+1

2[(∆U )2 + (∆V )2]

2+

1

2

n∑i=3

[(∆U )2

i + (∆V )2i

]=

5

8[(∆U )1 − (∆V )1]

2+

1

2[(∆U )2 + (∆V )2]

2+

1

2

n∑i=3

[(∆U )2

i + (∆V )2i

],

which is non-negative for every ∆ ∈ Rn. Hence, we conclude that the point M is a spurioussecond-order critical point of problem (3). Moreover, since we choose V = U , the point M is aglobal minimizer of the regularizer ‖UTU − V TV ‖2F and thus M is also a spurious second-ordercritical point of problem (5).

D.4 Proof of Corollary 2

Proof of Corollary 2. We first consider the case when δ ≤ 1/3. We assume that there exists aspurious second-order critical point M . Then, by Theorem 4, we know that there exists a constantα ∈ (1− δ, (1 + δ)/2]. This means that

1− δ < 1 + δ

2,

which contradicts the assumption that δ ≤ 1/3.

Then, we consider the case when δ < 1/2. With no loss of generality, assume that M 6= M∗ andM∗ 6= 0; otherwise, the inequality in this theorem is trivially true. Define

m11 := ‖Σ‖2F , m12 := 〈Σ, ABT 〉, m22 := ‖ABT ‖2F + ‖ADT ‖2F + ‖CBT ‖2F + ‖CDT ‖2F .

By our construction in Theorem 4, we know that

m11 = ‖M‖2F , m12 = 〈M,M∗〉, m22 = ‖M∗‖2F .

Therefore, we only need to prove m12 ≥ C(δ) · √m11m22 for some constant C(δ) > 0. By theanalysis in Ha et al. (2020), we know that the second-order critical point M must have rank r andthus m11 6= 0. The remainder of the proof is split into two steps.

23

Step I. First, we prove that

(m11 +m22 − 2m12)2

m11m22 −m212

≤ (1 + δ)2

α2,

(m11 −m12)2

m11m22 −m212

≤ δ2 − (1− α)2

α2. (35)

We first rule out the case when m11m22−m212 = 0. In this case, the equality condition of the Cauchy

inequality shows that there exists a constant t such that

M = tM∗.

Since M 6= 0, the constant t is not 0. Using the mean value theorem, for any Z ∈ Rn×m, there existsa constant c ∈ [0, 1] such that

〈∇fa(M), Z〉 = ∇2f [M∗ + c(M −M∗)](M −M∗, Z)

= ∇2f [M∗ + c(M −M∗)][(t− 1)M∗, Z].

The δ-RIP2r,2r property gives

〈∇fa(M), M〉 = ∇2f [M∗ + c(M −M∗)][(t− 1)M∗, tM∗] ≥ t(t− 1)(1− δ)‖M∗‖2F .

If t = 1, we conclude that M = M∗, which contradicts the assumption that M 6= M∗. Therefore, itholds that

〈M,∇fa(M)〉 6= 0.

This contradicts the first-order optimality condition, which states that 〈M,∇fa(M)〉 = 0. Hence,we have proved that inequality (35) is well defined. We consider the decomposition[

0 00 Λ

]= c1

[Σ 00 0

]+ c2

[AC

] [BD

]T+K,

⟨K,

[Σ 00 0

]⟩=

⟨K,

[AC

] [BD

]T⟩= 0.

Using the conditions in Theorem 4, it follows that⟨[0 00 Λ

],

[Σ 00 0

]⟩= 0,

⟨[0 00 Λ

],

[AC

] [BD

]T⟩= α(m11 − 2m12 +m22).

The pair of coefficients (c1, c2) can be uniquely solved as

c1 = −α · m11 +m22 − 2m12

m11m22 −m212

·m12, c2 = α · m11 +m22 − 2m12

m11m22 −m212

·m11.

Using the orthogonality of the decomposition, we have

‖Λ‖2F ≥

∥∥∥∥∥c1[Σ 00 0

]+ c2

[AC

] [BD

]T∥∥∥∥∥2

F

= c21m11 + 2c1c2m12 + c22m22

= α2 · m11(m11 +m22 − 2m12)2

m11m22 −m212

. (36)

Using the last two lines of condition (6), one can write

α2 · m11(m11 +m22 − 2m12)2

m11m22 −m212

≤ ‖Λ‖2F

≤ (2α− 1 + δ2)[tr(Σ2)− 2〈Σ, ABT 〉+ ‖ABT ‖2F + ‖ADT ‖2F + ‖CBT ‖2F + ‖CDT ‖2F

]= (2α− 1 + δ2)(m11 − 2m12 +m22).

Simplifying the above inequality, we arrive at the second inequality in (35). Now, the first inequalityin condition (6) implies that

‖Λ‖2F ≤ (1 + δ)2‖Σ‖2F = (1 + δ)2m11.

Substituting inequality (36) into the left-hand side, it follows that

α2 · m11(m11 +m22 − 2m12)2

m11m22 −m212

≤ (1 + δ)2m11,

which is equivalent to the first inequality in (35).

24

Step II. Next, we prove the existence of C(δ). We denote

κ :=m12√m11m22

∈ (−1, 1).

and

C1 :=δ2 − (1− α)2

α2, C2 :=

(1 + δ)2

α2, t :=

√m11

m22.

Since M 6= 0, we have t > 0. The inequalities in (35) can be written as(t− κ)2 ≤ (1− κ2)C1, (t+ 1/t− 2κ)2 ≤ (1− κ2)C2. (37)

Using the assumption that δ < 1/2, we can write

δ2 <1

4< (1− α)2 +

1

2α2,

which leads to

C1 =δ2 − (1− α)2

α2<

1

2.

If κ+√

(1− κ2)C1 ≥ 1, then

|κ| ≥ 1− C1

1 + C1≥ 1

3> 0. (38)

If κ < 0, then it holds that

κ+√

(1− κ2)C1 ≤ −1

3+

√1

2< 1,

which contradicts the assumption. Therefore, we have κ ≥ 0 and inequality (38) gives κ ≥ 1/3.

Now, we assume that κ+√

(1− κ2)C1 ≤ 1. Then, the first inequality in (37) gives

0 < t ≤ κ+√

(1− κ2)C1 ≤ 1,

which further leads to

t+1

t− 2κ ≥ −κ+

√(1− κ2)C1 +

1

κ+√

(1− κ2)C1

.

Combining with the second inequality in (37), we obtain

−κ+√

(1− κ2)C1 +1

κ+√

(1− κ2)C1

≤√

(1− κ2)C2.

The above inequality can be simplified to√1− κ2(1 + C1 −

√C1C2) ≤ κ

√C2.

We notice that the inequality 1 + C1 −√C1C2 ≤ 0 is equivalent to inequality (34), which cannot

hold when δ < 1/2. Therefore, we have 1+C1−√C1C2 > 0 and κ > 0. Then, the above inequality

is equivalent to(1− κ2)(1 + C1 −

√C1C2)2 ≤ κ2 · C2.

Therefore, we have

κ2 ≥ (1 + C1 −√C1C2)2

(1 + C1 −√C1C2)2 + C2

= 1− 1

1 + η2,

where we define

η :=1 + C1 −

√C1C2√

C2

.

To prove the existence of C(δ) such that κ ≥ C(δ) > 0, we only need to show that η is lowerbounded by a positive constant. With δ fixed, η can be viewed as a continuous function of α. Sinceη = (1 − δ)/(1 + δ) > 0 when α = 1 − δ, the function/parameter η is defined for all α in thecompact set [1− δ, (1 + δ)/2]. Combining with the fact that 1 +C1 −

√C1C2 > 0, the function η is

positive on a compact set, and thus there exists a positive lower bound C(δ) > 0.

In summary, we can define the function

C(δ) := min

{1

3, C(δ)

}> 0

such that κ ≥ C(δ) for every spurious second-order critical point M .

25

D.5 Counterexample for the General Rank Case with Linear Measurements

Example 4. Using the previous rank-1 example, we design a counterexample with linear measure-ment for the rank-r case. Let n ≥ 2r be an integer and ei ∈ Rn be the i-th standard basis of Rn. Wedefine the tensor

H :=3

2

n∑i,j=1

(eieTj )⊗ (eie

Tj ) +

r∑i=1

{− 1

2

[(e2i−1e

T2i−1)⊗ (e2i−1e

T2i−1) + (e2ie

T2i)⊗ (e2ie

T2i)]

+1

2

[(e2i−1e

T2i−1)⊗ (e2ie

T2i) + (e2ie

T2i)⊗ (e2i−1e

T2i−1)

]− 1

4

[(e2i−1e

T2i)⊗ (e2i−1e

T2i) + (e2ie

T2i−1)⊗ (e2ie

T2i−1)

]+

1

4

[(e2i−1e

T2i)⊗ (e2ie

T2i−1) + (e2ie

T2i−1)⊗ (e2i−1e

T2i)] }

and the rank-r global minimum

U∗ := [e1 e3 · · · e2r−1] , M∗ := U∗(U∗)T =

r∑i=1

e2i−1eT2i−1.

The objective function is defined as

fa(M) := (M −M∗) : H : (M −M∗) ∀M ∈ Rn×n.We can similarly prove that the function fa(·) satisfies the δ-RIP2r,2r property with δ = 1/2.Moreover, we define

U :=1√2

[e2 e4 · · · e2r] , M := U UT =1

2

r∑i=1

e2ieT2i 6= M∗.

The gradient of fa(·) at point M is

∇fa(M) = −3

4

r∑i=1

e2i−1eT2i−1 ∈ R2r×2r.

Since the column and row spaces of the gradient are orthogonal to those of M , the first-orderoptimality condition is satisfied. To verify the second-order necessary condition, we can similarlycalculate that

[∇2ha(U , U)](∆,∆)

=2〈∇fa(M),∆U∆TV 〉+ (U∆T

V + ∆V UT ) : H : (U∆T

V + ∆U UT )

=− 3

2

r∑i=1

r∑j=1

(∆U )2i−1,j

r∑j=1

(∆V )2i−1,j

+

r∑i=1

{5

8

[(∆U )2

2i−1,i + (∆V )22i−1,i

]+

1

4(∆U )2i−1,i(∆V )2i−1,i +

1

2[(∆U )2i,i + (∆V )2i,i]

2}

+∑

1≤i,j≤n,i 6=j

3

4[(∆U )2j,i + (∆V )2i,j ]

2+

∑1≤i,j≤n,i 6=j

3

4

[(∆U )2

2j−1,i + (∆V )22j−1,i

]=

r∑i=1

{5

8[(∆U )2i−1,i − (∆V )2i−1,i]

2+

1

2[(∆U )2i,i + (∆V )2i,i]

2}

+∑

1≤i,j≤n,i 6=j

3

4[(∆U )2j,i + (∆V )2i,j ]

2+

∑1≤i,j≤n,i 6=j

3

4[(∆U )2j−1,i − (∆V )2j−1,i]

2,

which is non-negative for every ∆ ∈ Rn×r. Hence, the point M is a spurious second-order criticalpoint of problem (3). Moreover, since we choose V = U , the point M is a global minimizer of theregularizer ‖UT U − V T U‖2F and thus M is also a spurious second-order critical point of problem(5).

26

E Proofs for Section 4

E.1 Proof of Theorem 6

In this subsection, we use the following notations:

M := UV T , M∗ := U∗(V ∗)T , W :=

[UV

], W ∗ :=

[U∗

V ∗

], W :=

[U−V

], W ∗ :=

[U∗

−V ∗],

where M∗ := M∗a is the global optimum. We always assume that U∗ and V ∗ satisfy (U∗)TU∗ =(V ∗)TV ∗. When there is no ambiguity about W , we use W ∗ to denote the minimizer ofminX∈X∗ ‖W − X‖F , where X ∗ is the set of global minima of problem (5). We note that theset X ∗ is the trajectory of a global minimum (U∗, V ∗) under the orthogonal group:

X ∗ = {(U∗R, V ∗R) | R ∈ Rr×r, RTR = RRT = Ir}.

Therefore, the set X ∗ is a compact set and its minimum can be attained. With this choice, it holds that

dist(W,X ∗) = ‖W −W ∗‖F .

We first summarize some technical results in the following lemma.Lemma 1 (Tu et al. (2016); Zhu et al. (2018)). The following statements hold for every U ∈ Rn×r,V ∈ Rm×r and W ∈ R(n+m)×r:

• 4‖M −M∗‖2F ≥ ‖WWT −W ∗(W ∗)T ‖2F − ‖UTU − V TV ‖2F .

• ‖W ∗(W ∗)T ‖2F = 4‖M∗‖2F .

• If rank(W ∗) = r, then ‖WWT −W ∗(W ∗)T ‖2F ≥ 2(√

2− 1)σ2r(W ∗)‖W −W ∗‖2F .

• If rank(U∗) = r, then ‖UUT − U∗(U∗)T ‖2F ≥ 2(√

2− 1)σ2r(U∗)‖U − U∗‖2F .

The proof of Theorem 6 follows from the following sequence of lemmas. We first identify two caseswhen the gradient is large. The following lemma proves that an unbalanced solution cannot be afirst-order critical point.Lemma 2. Given a constant ε > 0, if

‖UTU − V TV ‖F ≥ ε,

then‖∇ρ(U, V )‖F ≥ µ(ε/r)3/2.

Proof. Using the relationship between the 2-norm and the Frobenius norm, we have

‖UTU − V TV ‖2 ≥ r−1‖UTU − V TV ‖F ≥ ε/r.

Let q ∈ Rr be an eigenvector of UTU − V TV such that

‖q‖2 = 1,∣∣qT (UTU − V TV )q

∣∣ = ‖UTU − V TV ‖2.

We consider the direction∆ := W qqT .

Then, we can calculate that

‖∆‖2F = tr(W qqT qqT WT

)= tr

(qT WT W q

)= qT (UTU + V TV )q.

In addition, we have

〈∇ha(U, V ),∆〉 =

⟨[∇fa(M)V

[∇fa(M)]TU

],

[UqqT

−V qqT]⟩

= tr[V T [∇fa(M)]TUqqT

]− tr

[UT∇fa(M)V qqT

]= qT

[V T [∇fa(M)]TU

]q − qT

[UT∇fa(M)V

]q = 0.

27

and ∣∣∣⟨µ4∇g(U, V ),∆

⟩∣∣∣ = µ∣∣∣⟨WWTW,WqqT

⟩∣∣∣= µ

∣∣tr [(UTU − V TV )(UTU + V TV )qqT]∣∣

= µ∣∣qT (UTU − V TV )(UTU + V TV )q

∣∣= µ‖UTU − V TV ‖2 · qT (UTU + V TV )q

= µ‖UTU − V TV ‖2 ·√qT (UTU + V TV )q · ‖∆‖F .

Hence, Cauchy’s inequality implies that

‖∇ρ(U, V )‖F ≥|〈∇ρ(U, V ),∆〉|

‖∆‖F= µ‖UTU − V TV ‖2 ·

√qT (UTU + V TV )q.

Using the fact that

qT (UTU + V TV )q ≥∣∣qT (UTU − V TV )q

∣∣ = ‖UTU − V TV ‖2,

we obtain‖∇ρ(U, V )‖F ≥ µ‖UTU − V TV ‖3/22 ≥ µ(ε/r)3/2.

The next lemma proves that a solution with large norm cannot be a first-order critical point.Lemma 3. Given a constant ε > 0, if

1− δ3≤ µ < 1− δ, ‖WWT ‖3/2F ≥ max

{(1 + δ

1− µ− δ

)2

‖W ∗(W ∗)T ‖3/2F ,4√rλ

1− µ− δ

},

then‖∇ρ(U, V )‖F ≥ λ.

Proof. Choosing the direction ∆ := W , we can calculate that

〈∇ρ(U, V ),∆〉 = 2〈∇fa(UV T ), UV T 〉+ µ‖UTU − V TV ‖2F . (39)

Using the δ-RIP2r,2r property, we have

[∇2fa(N)](M,M) ≥ (1− δ)‖M‖2F , [∇2fa(N)](M∗,M) ≤ (1 + δ)‖M‖F ‖M∗‖F ,

where N ∈ Rn×m is every matrix with rank at most 2r. Then, the first term can be estimated as

〈∇fa(UV T ), UV T 〉 =

∫ 1

0

[∇2fa(M∗ + s(M −M∗)][M −M∗,M ] ds

≥ (1− δ)‖M‖2F − (1 + δ)‖M∗‖F ‖M‖F .

The second term is

µ‖UTU − V TV ‖2F = µ(‖UUT ‖2F + ‖V V T ‖2F

)− 2µ‖M‖2F .

Substituting into equation (39), it follows that

〈∇ρ(U, V ),∆〉 ≥ µ(‖UUT ‖2F + ‖V V T ‖2F

)+ 2(1− δ − µ)‖M‖2F − 2(1 + δ)‖M∗‖F ‖M‖F

≥ µ(‖UUT ‖2F + ‖V V T ‖2F

)+ 2(1− δ − µ)‖M‖2F − 2c‖M‖2F −

(1 + δ)2

2c‖M∗‖2F

≥ min {µ, 1− δ − µ− c} ‖WWT ‖2F −(1 + δ)2

2c‖M∗‖2F ,

where c ∈ (0, 1−δ−µ) is a constant to be designed later. Using equality that (U∗)TU∗ = (V ∗)TV ∗,Lemma 1 gives

‖W ∗(W ∗)T ‖2F = 4‖M∗‖2F .

28

As a result,

〈∇ρ(U, V ),∆〉 ≥ min {µ, 1− δ − µ− c} ‖WWT ‖2F −(1 + δ)2

8c‖W ∗(W ∗)T ‖2F .

Now, choosing

c =1− δ − µ

2and noticing that µ ≥ (1− δ − µ)/2, it yields that

〈∇ρ(U, V ),∆〉 ≥ 1− δ − µ2

‖WWT ‖2F −(1 + δ)2

4(1− δ − µ)‖W ∗(W ∗)T ‖2F . (40)

On the other hand,‖∆‖F = ‖W‖F ≤

√r‖WWT ‖1/2F .

Combining with inequality (40) and using the assumption of this lemma, one can write

‖∇ρ(U, V )‖F ≥〈∇ρ(U, V ),∆〉‖∆‖F

≥ 1− δ − µ2√r‖WWT ‖3/2F − (1 + δ)2

4√r(1− δ − µ)

‖W ∗(W ∗)T ‖2F ‖WWT ‖−1/2F

≥ 1− δ − µ2√r‖WWT ‖3/2F − (1 + δ)2

4√r(1− δ − µ)

‖W ∗(W ∗)T ‖3/2F

≥ 1− δ − µ4√r‖WWT ‖3/2F ≥ λ.

Using the above two lemmas, we only need to focus on points such that

‖UTU − V TV ‖F = o(1), ‖WWT ‖F = O(1).

The following lemma proves that if (U, V ) is an approximate first-order critical point with a smallsingular value σr(W ), then the Hessian of the objective function at this point has a negative curvature.Lemma 4. Consider positive constants α,C, ε, λ such that

ε2 ≤ (√

2− 1)σ2r(W ∗) · α2, G > µ

(ε+

4H2

G2

)+

(1 + δ)H2

G2, (41)

where G := ‖∇fa(M)‖2 and H := λ+ µεC. If

‖UTU − V TV ‖2F ≤ ε2, ‖WWT ‖F ≤ C2, ‖W −W ∗‖F ≥ α, ‖∇ρ(U, V )‖F ≤ λ

and

σ2r(W ) ≤ 2

1 + δ

[G− µ

(ε+

4H2

G2

)− (1 + δ)H2

G2

]− 2τ (42)

for some positive constant τ , then it holds that

λmin(∇2ρ(U, V )) ≤ −(1 + δ)τ.

Proof. We choose a singular vector q of W such that

‖q‖2 = 1, ‖Wq‖2 = σr(W ).

Since ‖Wq‖2 =√‖Uq‖22 + ‖V q‖22, we have

‖Uq‖22 + ‖V q‖22 = σ2r(W ).

We choose singular vectors u and v such that

‖u‖2 = ‖v‖2 = 1, ‖∇fa(M)‖2 = uT∇fa(M)v.

29

We define the direction as

∆U := −uqT , ∆V := vqT , ∆ :=

[∆U

∆V

], ∆ :=

[∆U

−∆V

].

For the Hessian of ha(·, ·), we can calculate that

〈∇fa(M),∆U∆TV 〉 = −‖∇fa(M)‖2 = −G (43)

and the δ-RIP2r,2r property gives

[∇2fa(M)](∆UVT + U∆T

V ,∆UVT + U∆T

V )

≤ (1 + δ)‖∆UVT + U∆T

V ‖2F = (1 + δ)‖ − u(V q)T + (Uq)vT ‖2F= (1 + δ)

(‖V q‖2F + ‖Uq‖2F

)− 2(1 + δ)[qT (UTu)] · [qT (V T v)]

≤ (1 + δ)σ2r(W ) + 2(1 + δ) · ‖UTu‖F ‖V T v‖F . (44)

Then, we consider the terms coming from the Hessian of the regularizer. First, we have

〈∆WT ,∆WT 〉 ≤ ‖UTU − V TV ‖F · ‖∆TU∆U −∆T

V ∆V ‖F≤ ε ·

[‖∆T

U∆U‖F + ‖∆TV ∆V ‖F

]= 2ε. (45)

Next, we can estimate that

〈W ∆T ,∆WT 〉+ 〈WWT ,∆∆T 〉 =1

2‖UT∆U + ∆T

UU − V T∆V −∆TV V ‖2F

≤ 4(‖UT∆U‖2F + ‖V T∆V ‖2F

)= 4

(‖(UTu)qT ‖2F + ‖(V T v)qT ‖2F

)= 4

(‖UTu‖2F + ‖V T v‖2F

). (46)

Using the assumption that ‖WWT ‖F ≤ C2 and ‖UTU − V TV ‖2F ≤ ε2, one can write

‖WWTW‖2F ≤ ‖UTU − V TV ‖2F · ‖UTU + V TV ‖F ≤ ε2‖WWT ‖F ≤ ε2C2

and ∥∥∥∥[ ∇fa(UV T )V∇fa(UV T )TU

]∥∥∥∥F

= ‖∇ρ(U, V )− µWWTW‖F ≤ λ+ µεC = H. (47)

The second relation implies that

‖∇fa(UV T )V ‖2 ≤ ‖∇fa(UV T )V ‖F ≤ H, ‖UT∇fa(UV T )‖2 ≤ ‖UT∇fa(UV T )‖F ≤ H.(48)

By the definition of u and v, it holds that

‖v‖2 = 1, ‖∇fa(M)‖2u = ∇fa(M)v.

Therefore,

‖UTu‖2F =‖UT∇fa(M)v‖2F‖∇fa(M)‖22

≤ ‖UT∇fa(M)‖2F ‖v‖22‖∇fa(M)‖22

≤ H2

G2.

Similarly,

‖V T v‖2F ≤H2

G2.

Substituting into (44) and (46) yields that

[∇2fa(M)](∆UVT + U∆T

V ,∆UVT + U∆T

V ) ≤ (1 + δ)σ2r(W ) + 2(1 + δ) · H

2

G2(49)

and

〈W ∆T ,∆WT 〉+ 〈WWT ,∆∆T 〉 ≤ 8 · H2

G2. (50)

30

Combining (43), (45), (49) and (50), it follows that

[∇2ρ(U, V )](∆,∆) ≤ −2G+ (1 + δ)σ2r(W ) + 2µε+ [8µ+ 2(1 + δ)] · H

2

G2.

Since ‖∆‖2F = 2, the above relation implies

λmin(∇2ρ(U, V )) ≤ −G+1 + δ

2σ2r(W ) + µε+ (4µ+ 1 + δ) · H

2

G2≤ −(1 + δ)τ.

Remark 2. The positive constants ε and λ in the proof of Lemma 4 can be chosen to be arbitrarilysmall with α,C fixed. Hence, we may choose small enough ε and λ such that the assumptions givenin inequality (41) are satisfied. This lemma resolves the case when the minimal singular value σ2

r(W )is on the order of ‖∇fa(M)‖2/(2 + 2δ). In the next lemma, we will show that this is the only casewhen δ < 1/3.

The final step is to prove that condition (42) always holds provided that δ < 1/3 and ε, λ, τ = o(1).Lemma 5. Given positive constants α,C, ε, λ, if

‖UTU − V TV ‖2F ≤ ε2, max{‖WWT ‖F , ‖W ∗(W ∗)T ‖F } ≤ C2,

‖W −W ∗‖F ≥ α, ‖∇ρ(U, V )‖F ≤ λ, δ < 1/3,

then the inequality G ≥ cα holds for some constant c > 0 independent of α, ε, λ, C. Furthermore,there exist two positive constants

ε0(δ, µ, σr(M∗a ), ‖M∗a‖F , α, C), λ0(δ, µ, σr(M

∗a ), ‖M∗a‖F , α, C)

such that

σ2r(W ) ≤ 2

1 + δ

[G− µ

(2ε+

4H2

G2

)− (1 + δ)H2

G2

](51)

whenever0 <ε ≤ ε0(δ, µ, σr(M

∗a ), ‖M∗a‖F , α, C),

0 <λ ≤ λ0(δ, µ, σr(M∗a ), ‖M∗a‖F , α, C).

Here, G and H are defined in Lemma 4.

Proof. We first prove the existence of the constant c. Using Lemma 1, one can write4‖M −M∗‖2F ≥ ‖WWT −W ∗(W ∗)T ‖2F − ‖UTU − V TV ‖2F ≥ ‖WWT −W ∗(W ∗)T ‖2F − ε2.Using Lemma 1 and the assumption that ‖W −W ∗‖F ≥ α, we have

‖M −M∗‖2F ≥√

2− 1

2σ2r(W ∗)‖W −W ∗‖2F −

ε2

4≥√

2− 1

2σ2r(W ∗) · α2 − ε2

4. (52)

By the definition of ε, it follows that

‖M −M∗‖2F ≥√

2− 1

4σ2r(W ∗) · α2 > 0.

Thus, the δ-RIP2r,2r property gives

‖∇fa(M)‖F ≥〈∇fa(M),M −M∗〉‖M −M∗‖F

≥ (1− δ)‖M −M∗‖F ≥

√√2− 1

4· σr(W ∗)(1− δ) · α.

Hence, we have

G = ‖∇fa(M)‖2 ≥

√√2− 1

4r· σr(W ∗)(1− δ) · α = cα,

where we define

c :=

√√2− 1

4r· σr(W ∗)(1− δ).

Next, we prove inequality (51) by contradiction, i.e., we assume

σ2r(W ) >

2

1 + δ

[G− µ

(2ε+

4H2

G2

)− (1 + δ)H2

G2

]≥ 2cα

1 + δ+ poly(ε, λ). (53)

The remainder of the proof is divided into three steps.

31

Step I. We first develop a lower bound for σr(M). We choose a vector p ∈ Rr such that

‖p‖F = 1, UTUp = σ2r(U) · p.

It can be shown that

‖(Wp)TW‖F = ‖pTUTU + pTV TV ‖F ≤ 2‖pTUTU‖F + ‖pT (V TV − UTU)‖F≤ 2σ2

r(U) + ‖pT ‖F ‖V TV − UTU‖F ≤ 2σ2r(U) + ε.

On the other hand, since W has rank r, it holds that∥∥(Wp)TW∥∥F≥ σ2

r(W ) · ‖p‖F = σ2r(W ).

Combining the above two estimates, we arrive at

2σ2r(U) ≥ σ2

r(W )− ε > 0,

where the last inequality is from the assumption that ε, λ are small and σr(W ) is lower bounded bya positive value in (53). Using the inequality that

√1− x ≥ 1 − x for every x ∈ [0, 1], the above

inequality implies that

σr(U) ≥ 1√2σr(W ) ·

√1− ε

σ2r(W )

≥ 1√2σr(W )− ε√

2σr(W ). (54)

Similarly, one can prove that

σr(V ) ≥ 1√2σr(W )− ε√

2σr(W ).

When ε is small enough, we know that σr(U), σr(V ) 6= 0 and both U, V have rank r. To lowerbound the singular value σr(M), we consider vectors x such that ‖x‖2 = 1 and lower boundxTV (UTU)V Tx. Since the range of V (UTU)V T is a subspace of the range of V and the range ofV has exactly dimension r, directions x that are in the orthogonal complement of the range of Vcorrespond to exactly m− r zero singular values. Hence, to estimate the r-th largest singular valueof M , we only need to consider directions that are in the range of V . Namely, we only considerdirections that have the form x = V y for some vector y. Then, we have

xTV (UTU)V Tx = yT (V TV )(UTU)(V TV )y

= yT (V TV )3y + yT (V TV )(UTU − V TV )(V TV )y.

First, we bound the second term by calculating that

‖V (V TV − UTU)V T ‖2 ≤ ‖V ‖22‖UTU − V TV ‖2 ≤ ‖V TV ‖F ‖UTU − V TV ‖F≤ ‖WTW‖F ‖UTU − V TV ‖F ≤ C2ε.

This implies thatyT (V TV )(UTU − V TV )(V TV )y ≥ −C2ε · ‖V y‖2F .

Next, we assume that y has the decomposition

y =

r∑i=1

civi,

where vi is an eigenvector of V TV associated with the eigenvalue σ2i (V ). Then, we can calculate

that

yT (V TV )3y =

r∑i=1

c2iσ6i (V ), ‖V y‖2F =

r∑i=1

c2iσ2i (V ) = 1.

Combining the above estimates leads to

xTV (UTU)V Tx ≥[∑r

i=1 c2iσ

6i (V )∑r

i=1 c2iσ

2i (V )

− C2ε

]· ‖V y‖2F

=

∑ri=1 c

2iσ

6i (V )∑r

i=1 c2iσ

2i (V )

− C2ε ≥ σ4r(V )− C2ε.

32

This implies that

σ2r(M) ≥ σ4

r(V )− C2ε ≥[

1√2σr(W )− ε√

2σr(W )

]4

− C2ε

≥ 1

4σ4r(W )− σ2

r(W )ε− σ−2r (W )ε3 − C2ε

≥ 1

4σ4r(W )− σ−2

r (W )ε3 − 2C2ε

≥ 1

4σ4r(W )− 1 + δ

G· ε3 − 2C2ε

≥ 1

4σ4r(W )− 1 + δ

cα· ε3 − 2C2ε. (55)

where the second last inequality is due to (53) and the assumption that ε and λ are sufficiently small.

Step II. Next, we derive an upper bound for σr(M). We define

M := Pr[M − 1

1 + δ∇fa(M)

],

where Pr is the orthogonal projection onto the low-rank set via SVD. Since M 6= M∗ and δ < 1/3,we recall that inequality (13) gives

−φ(M) ≥ 1− 3δ

1− δ[fa(M)− fa(M∗)] ≥ 1− 3δ

2‖M −M∗‖2F

≥ 1− 3δ

2

[√2− 1

2σ2r(W ∗)α2 − ε2

4

]:= K,

where the second inequality follows from (52) and

−φ(M) = 〈∇fa(M),M − M〉 − 1 + δ

2‖M − M‖2F .

Hence,

〈∇fa(M),M − M〉 − 1 + δ

2‖M − M‖2F ≥ K. (56)

When we choose ε to be small enough, it holds that K > 0. For simplicity, we define

N := − 1

1 + δ∇fa(M).

Then, M = Pr(M +N) and the left-hand side of (56) is equal to

〈∇fa(M),M − M〉 − 1 + δ

2‖M − M‖2F

= (1 + δ)〈N,Pr(M +N)−M〉 − 1 + δ

2‖Pr(M +N)−M‖2F

=1 + δ

2

[‖N‖2F − ‖N +M − Pr(M +N)‖2F

]=

1 + δ

2

[‖N‖2F − ‖N +M‖2F + ‖Pr(M +N)‖2F

]. (57)

Similar to the proof of inequality (48), we can prove that

‖NV ‖F ≤ H :=H

1 + δ, ‖UTN‖F ≤ H.

Then, we have

− tr[NT (UV T )] ≤ ‖UTN‖F ‖V ‖F ≤ H · ‖W‖F ≤ H ·√√

r‖WWT ‖F ≤ 4√rC · H.

33

Using the above relation, we obtain

‖N‖2F − ‖N +M‖2F = −2 tr[NT (UV T )]− ‖M‖2F ≤ 2 4√rC · H − ‖M‖2F .

Suppose that PU and PV are the orthogonal projections onto the column spaces of U and V ,respectively. We define

N1 := PUNPV , N2 := PUN(I − PV ), N3 := (I − PU )NPV , N4 := (I − PU )N(I − PV ).

Then, recalling the assumption (53) and inequality (54), it follows that

‖N1‖F = ‖PUNPV ‖F ≤ σ−1r (U)‖UTPUNPV ‖F ≤ σ−1

r (U)‖UTN‖F ≤√

2σr(W )

σ2r(W )− ε

· H

≤

[√1 + δ

G+ poly(ε, λ)

]· H ≤

[√1 + δ

cα+ poly(ε, λ)

]· H := κH.

Similarly, we can prove that

‖N1 +N2‖F = ‖PUN‖F ≤ κH, ‖N1 +N3‖F = ‖NPV ‖F ≤ κH,which leads to

‖N2‖F ≤ 2κH, ‖N3‖F ≤ 2κH.

Using Weyl’s theorem, the following holds for every 1 ≤ i ≤ r:

|σi(M +N)− σi(M +N4)| ≤ ‖N1 +N2 +N3‖2 ≤ ‖N1 +N2 +N3‖F ≤ 3κH.

Therefore, we have

‖Pr(M +N)‖2F =

r∑i=1

σ2i (M +N)

≥r∑i=1

σ2i (M +N4)− r · 3κH · (‖M +N‖2 + ‖M +N4‖2)

≥r∑i=1

σ2i (M +N4)− 6rκH · (‖M‖2 + ‖N‖2)

≥r∑i=1

σ2i (M +N4)− 6rκH ·

(‖M‖F +

G

1 + δ

). (58)

Using the assumption (53) and the inequality (55), one can write

G

1 + δ≤ σ2

r(W )

2+ poly(

√ε, λ) ≤ σr(M) + poly(

√ε, λ) ≤ ‖M‖F + poly(

√ε, λ), (59)

where poly(√ε, λ) means a polynomial of

√ε and λ. Therefore, we attain the bound

‖M‖F + ‖N‖F ≤ 2‖M‖F + poly(√ε, λ) ≤ 2 · ‖WWT ‖F√

2+ poly(

√ε, λ)

≤√

2C2 + poly(√ε, λ). (60)

Substituting back into the previous estimate (58), it follows that

‖Pr(M+N)‖2F ≥r∑i=1

σ2i (M+N4)−6

√2rκHC2+poly(

√ε, λ) =

r∑i=1

σ2i (M+N4)+poly(

√ε, λ).

Now, since M and N4 have orthogonal column and row spaces, the maximal r singular values ofM+N4 are simply the maximal r singular values of the singular valuesM andN4, which we assumeto be

σi(M), i = 1, . . . , k and σi(N4), i = 1, . . . , r − k.Now, it follows from (57) that

2

1 + δ

[〈∇fa(M),M − M〉 − 1 + δ

2‖M − M‖2F

]

34

= ‖N‖2F − ‖N +M‖2F + ‖Pr(M +N)‖2F

≤ −r∑i=1

σ2i (M) +

k∑i=1

σ2i (M) +

r−k∑i=1

σ2i (N4) + poly(

√ε, λ) + 2 4

√rC · H

= −r∑

i=k+1

σ2i (M) +

r−k∑i=1

σ2i (N4) + poly(

√ε, λ)

≤ −(r − k)σ2r(M) + (r − k)‖N4‖22 + poly(

√ε, λ)

≤ −(r − k)σ2r(M) + (r − k)‖N‖22 + poly(

√ε, λ).

If k = r, then the above inequality and inequality (56) imply that

poly(√ε, λ) ≥ K = O(α2),

which contradicts the assumption that ε and λ are small. Hence, it can be concluded that r − k ≥ 1.Combining with (56), we obtain the upper bound

σ2r(M) ≤ − 2

1 + δ· K

r − k+ ‖N‖22 +

1

r − k· poly(

√ε, λ)

= − 2

1 + δ· Kr

+ ‖N‖22 + poly(√ε, λ). (61)

Step III. In the last step, we combine the inequalities (55) and (61), which leads to

1

4σ4r(W )− 1 + δ

cα· ε3 − 2C2ε ≤ − 2

1 + δ· Kr

+1

(1 + δ)2G2 + poly(

√ε, λ).

This means that

σ4r(W ) +

8

1 + δ· Kr≤ 4

(1 + δ)2G2 + poly(

√ε, λ).

Since K > 0 has lower bounds that are independent of ε and λ, we can choose ε and λ to be smallenough such that

σ4r(W ) +

4

1 + δ· Kr≤ 4

(1 + δ)2G2.

However, recalling the assumption (53), we have

σ4r(W ) >

4

(1 + δ)2

[G− µ

(2ε+

4H2

G2

)− (1 + δ)H2

G2

]2

≥ 4

(1 + δ)2G2 − 16

(1 + δ)2G · µε+ poly(

√ε, λ)

≥ 4

(1 + δ)2G2 − 16

(1 + δ)2µε · 1√

2(1 + δ)C2 + poly(

√ε, λ)

=4

(1 + δ)2G2 + poly(

√ε, λ),

where in the third inequality we use inequalities (59)-(60) to conclude that

G ≤ (1 + δ)‖M‖F + poly(√ε, λ) ≤ 1√

2(1 + δ)C2 + poly(

√ε, λ).

The above two inequalities cannot hold simultaneously when λ and ε are small enough. Thiscontradiction means that the condition (51) holds by choosing

0 <ε ≤ ε0(δ, µ, σr(M∗a ), ‖M∗a‖F , α, C),

0 <λ ≤ λ0(δ, µ, σr(M∗a ), ‖M∗a‖F , α, C),

for some small enough positive constants

ε0(δ, µ, σr(M∗a ), ‖M∗a‖F , α, C), λ0(δ, µ, σr(M

∗a ), ‖M∗a‖F , α, C).

35

The only thing left is to piecing everything together.

Proof of Theorem 6. We first choose

C :=

[(1 + δ

1− µ− δ

)2

‖W ∗(W ∗)T ‖3/2F

]1/3

.

Then, we select ε1 and λ1 as

ε1(δ, r, µ, σr(M∗a ), ‖M∗a‖F , α) := ε0(δ, r, µ, σr(M

∗a ), ‖M∗a‖F , α, C),

λ1(δ, r, µ, σr(M∗a ), ‖M∗a‖F , α) := min

{λ0(δ, r, µ, σr(M

∗a ), ‖M∗a‖F , α, C),

(1− µ− δ)C3

4√r

}.

Finally, we combine Lemmas 4-5 to get the bounds for the gradient and the Hessian.

E.2 Proof of Theorem 7

In this subsection, we use similar notations:

M := UUT , M∗ := U∗(U∗)T ,

whereM∗ := M∗s is the global optimum. We also assume thatU∗ is the minimizer of minX∈X∗ ‖U−X‖F when there is no ambiguity about U . In this case, the distance is given by

dist(U,X ∗) = ‖U − U∗‖F .

The proof of Theorem 7 is similar to that of Theorem 6. We first consider the case when ‖UUT ‖F islarge.Lemma 6. Given a constant ε > 0, if

‖UUT ‖2F ≥ max

{2(1 + δ)

1− δ‖U∗(U∗)T ‖2F ,

(2λ√r

1− δ

)4/3},

then‖∇hs(U)‖F ≥ λ.

Proof. Choosing the direction ∆ := U , we can calculate that

〈∇hs(U),∆〉 = 〈∇fs(UUT ), UUT 〉.

Using the δ-RIP2r,2r property, we have

〈∇fs(UUT ), UUT 〉 =

∫ 1

0

[∇2fs(M∗ + s(M −M∗)][M −M∗,M ]

≥ (1− δ)‖M‖2F − (1 + δ)‖M∗‖F ‖M‖F

≥ 1− δ2‖M‖2F .

Moreover,‖∆‖F = ‖U‖F ≤

√r‖UUT ‖1/2F .

This leads to

‖∇hs(U)‖F ≥〈∇hs(U),∆〉‖∆‖F

=〈∇fs(UUT ), UUT 〉

‖U‖F≥ 1− δ

2√r‖UUT ‖3/2F ≥ λ.

The next lemma is a counterpart of Lemma 4.

36

Lemma 7. Consider positive constants α,C, λ such that

λ ≤ 2(√rC)−1(

√2− 1)σ2

r(U∗) · α2, G >(1 + δ)λ2

4G2,

where G := −λmin(∇fs(M)). If

‖UUT ‖F ≤ C2, ‖U − U∗‖F ≥ α, ‖∇hs(U)‖F ≤ λ,

then the inequality G ≥ cα2 holds for some constant c > 0 independent of α, λ,C. Moreover, ifthere exists some positive constant τ such that

σ2r(U) ≤ 1

1 + δ

[G− (1 + δ)λ2

4G2

]− τ, (62)

thenλmin(∇2hs(U)) ≤ −2(1 + δ)τ.

Proof. We choose a singular vector q of U such that

‖q‖2 = 1, ‖Uq‖2 = σr(U).

We first prove the existence of the constant c. The δ-RIP2r,2r property gives

〈∇fs(M),M∗ −M〉 ≤ −(1− δ)‖M −M∗‖2F .

Using the assumption of this lemma, we have

‖∇fs(M)U‖2 ≤ ‖∇fs(M)U‖F =1

2‖∇hs(U)‖F ≤

1

2λ, (63)

which leads to

〈∇fs(M),M〉 = 〈∇fs(M)U,U〉 ≤ ‖∇fs(M)U‖F ‖U‖F ≤1

2λ ·√rC.

Substituting into (63), it follows that

〈∇fs(M),M∗〉 ≤ −(1− δ)‖M −M∗‖2F +1

2λ ·√rC.

Using Lemma 1, we have

‖M −M∗‖2F ≥ 2(√

2− 1)σ2r(U∗)‖U − U∗‖2F ≥ 2(

√2− 1)σ2

r(U∗) · α2.

By the condition on λ, it follows that

〈∇fs(M),M∗〉 ≤ −(1− δ)‖M −M∗‖2F +1

2λ ·√rC ≤ −(

√2− 1)(1− δ)σ2

r(U∗) · α2. (64)

The above inequality also indicates that λmin(∇fs(M)) < 0. Using the relations that

∇fs(M) � λmin(∇fs(M)) · In, M∗ � 0,

we arrive at

〈∇fs(M),M∗〉 ≥ λmin(∇fs(M)) tr(M∗) ≥√r‖M∗‖F · λmin(∇fs(M)).

Combining the last inequality with (64), we obtain

λmin(∇fs(M)) ≤ −(√r‖M∗‖F )−1(

√2− 1)(1− δ)σ2

r(U∗) · α2 = −cα2

and thus G ≥ cα2, where

c := (√r‖M∗‖F )−1(

√2− 1)(1− δ)σ2

r(U∗)

Next, we prove the upper bound on the minimal eigenvalue. We choose an eigenvector u such that

‖u‖2 = 1, λmin(∇fs(M)) = uT∇fs(M)u.

37

The direction is chosen to be∆ := uqT .

For the Hessian of hs(·, ·), we can calculate that

〈∇fs(M),∆∆T 〉 = λmin(∇fs(M)) = −G (65)

and the δ-RIP2r,2r property gives

[∇2fs(M)](∆UT+U∆T ,∆UT + U∆T )

≤ (1 + δ)‖∆UT + U∆T ‖2F = (1 + δ)‖u(Uq)T + (Uq)uT ‖2F= 2(1 + δ)‖Uq‖2F + 2(1 + δ)[qT (UTu)]2

≤ 2(1 + δ)σ2r(U) + 2(1 + δ) · ‖UTu‖2F . (66)

By letting the vector v be

‖v‖2 = 1, λmin(∇fs(M))u = ∇fs(M)v,

the inequality (63) implies that

‖UTu‖2F =‖UT∇fs(M)v‖2Fλ2min(∇fs(M))

=‖UT∇fs(M)v‖22λ2min(∇fs(M))

≤ ‖UT∇fs(M)‖22‖v‖22λ2min(∇fs(M))

≤ λ2

4G2.

Substituting into (66), we obtain

[∇2fs(M)](∆UT + U∆T ,∆UT + U∆T ) ≤ 2(1 + δ)σ2r(U) + (1 + δ) · λ

2

2G2. (67)

Combining (65) and (67), it follows that

[∇2hs(U)](∆,∆) ≤ −2G+ 2(1 + δ)σ2r(U) + (1 + δ) · λ

2

2G2.

Since ‖∆‖2F = 1, the above inequality implies

λmin(∇2hs(U)) ≤ −2G+ 2(1 + δ)σ2r(U) + (1 + δ) · λ

2

2G2≤ −(1 + δ)τ.

We finally give the counterpart of Lemma 5, which states that the condition (62) always holds whenδ < 1/3.Lemma 8. Given positive constants α,C, ε, λ, if

max{‖UUT ‖F , ‖U∗(U∗)T ‖F } ≤ C2, ‖U − U∗‖F ≥ α, ‖∇hs(U)‖F ≤ λ, δ < 1/3,

then there exists a positive constant λ0(δ,W ∗, α, C) such that

σ2r(U) ≤ 1

1 + δ

[G− (1 + δ)λ2

4G2− λ

](68)

whenever0 < λ ≤ λ0(δ, σr(M

∗s ), ‖M∗s ‖F , α, C).

Proof. We prove by contradiction, i.e., we assume

σ2r(U) >

1

1 + δ

[G− (1 + δ)λ2

4G2− λ

]≥ cα2

1 + δ+ poly(λ). (69)

To follow the proof of Lemma 5, we also divide the argument into three steps, although the first stepis superficial.

Step I. We first give a lower bound for λr(M). In the symmetric case, this step is straightforward,since we always have

λ2r(M) = σ4

r(U). (70)

38

Step II. Next, we derive an upper bound for λr(M). We define

M := Pr[M − 1

1 + δ∇fs(M)

],

wherePr is the orthogonal projection onto the low-rank manifold (we do not drop negative eigenvaluesin this proof). Since M 6= M∗ and δ < 1/3, we recall that inequality (13) gives

−φ(M) ≥ 1− 3δ

1− δ[fs(M)− fs(M∗)] ≥

1− 3δ

2‖M −M∗‖2F

≥ (1− 3δ) · (√

2− 1)σ2r(W ∗)α2 := K > 0,

where the second inequality comes from Lemma 1 and

−φ(M) = 〈∇fs(M),M − M〉 − 1 + δ

2‖M − M‖2F .

Hence,

〈∇fs(M),M − M〉 − 1 + δ

2‖M − M‖2F ≥ K. (71)

For simplicity, we define

N := − 1

1 + δ∇fs(M).

Then, M = Pr(M +N) and the left-hand side of (71) is equal to

〈∇fs(M),M − M〉 − 1 + δ

2‖M − M‖2F

= (1 + δ)〈N,Pr(M +N)−M〉 − 1 + δ

2‖Pr(M +N)−M‖2F

=1 + δ

2

[‖N‖2F − ‖N +M − Pr(M +N)‖2F

]=

1 + δ

2

[‖N‖2F − ‖N +M‖2F + ‖Pr(M +N)‖2F

]. (72)

Similar to the proof of inequality (63), we can prove that

‖UTN‖F ≤ H :=λ

2(1 + δ).

Then, we have

− tr[NT (UUT )] ≤ ‖UTN‖F ‖U‖F ≤ H · ‖U‖F ≤ H ·√√

r‖UUT ‖F ≤ 4√rC · H.

Using the above relation, one can write

‖N‖2F − ‖N +M‖2F = −2 tr[NT (UUT )]− ‖M‖2F ≤ 2 4√rC · H − ‖M‖2F .

Suppose that PU is the orthogonal projections onto the column space of U . We define

N1 := PUNPU , N2 := PUN(I − PU ), N3 := (I − PU )NPU , N4 := (I − PU )N(I − PU ).

Then, it follows from (69) that

‖N1‖F = ‖PUNPU‖F ≤ σ−1r (U)‖UTPUNPU‖F ≤ σ−1

r (U)‖UTN‖F ≤ σ−1r (U) · H

≤

[√1 + δ

G+ poly(λ)

]· H ≤

[√1 + δ

cα2+ poly(λ)

]· H := κH.

Similarly, we can prove that

‖N1 +N2‖F = ‖PUN‖F ≤ κH, ‖N1 +N3‖F = ‖NPV ‖F ≤ κH,

which leads to‖N2‖F ≤ 2κH, ‖N3‖F ≤ 2κH.

39

Using Weyl’s theorem, the following holds for every 1 ≤ i ≤ r:

|λi(M +N)− λi(M +N4)| ≤ ‖N1 +N2 +N3‖2 ≤ ‖N1 +N2 +N3‖F ≤ 3κH.

Therefore, we have

‖Pr(M +N)‖2F =

r∑i=1

λ2i (M +N)

≥r∑i=1

λ2i (M +N4)− r · 3κH · (‖M +N‖2 + ‖M +N4‖2)

≥r∑i=1

λ2i (M +N4)− 6rκH · (‖M‖2 + ‖N‖2)

≥r∑i=1

λ2i (M +N4)− 6rκH ·

(‖M‖F +

G

1 + δ

). (73)

Similar to the asymmetric case, we can prove that

G

1 + δ≤ ‖M‖F + poly(λ).

holds under the assumption (69). Therefore, we obtain the bound

‖M‖F + ‖N‖F ≤ 2‖M‖F + poly(λ) ≤ 2C2 + poly(λ).

Substituting back into the previous estimate (73), it follows that

‖Pr(M +N)‖2F ≥r∑i=1

λ2i (M +N4) + poly(λ).

Now, sinceM andN4 have orthogonal column and row spaces, the maximal r eigenvalues ofM+N4

are simply the maximal r eigenvalues of the eigenvalues of M and N4, which we assume to be

λi(M), i = 1, . . . , k and λi(N4), i = 1, . . . , r − k.

Now, it follows from (72) that

2

1 + δ

[〈∇fs(M),M − M〉 − 1 + δ

2‖M − M‖2F

]= ‖N‖2F − ‖N +M‖2F + ‖Pr(M +N)‖2F

≤ −r∑i=1

λ2i (M) +

k∑i=1

λ2i (M) +

r−k∑i=1

λ2i (N4) + poly(λ) + 2 4

√rC · H

= −r∑

i=k+1

λ2i (M) +

r−k∑i=1

λ2i (N4) + poly(λ). (74)

Using the assumption (69) and the fact that λ is small, we know that λi(N4) > 0 for all i ∈ {1, . . . , k}.Therefore,

−r∑

i=k+1

λ2i (M) +

r−k∑i=1

λ2i (N4) ≤ −(r − k)λ2

r(M) + (r − k)λmax(N4)2.

Substituting into (74) gives rise to

2

1 + δ

[〈∇fs(M),M − M〉 − 1 + δ

2‖M − M‖2F

]≤ −(r − k)λ2

r(M) + (r − k)λmax(N4)2 + poly(λ)

≤ −(r − k)λ2r(M) + (r − k)λmax(N)2 + poly(λ).

40

If k = r, then the above inequality and inequality (71) imply that

poly(λ) ≥ K = O(α2),

which contradicts the assumption that λ is small. Hence, we conclude that r − k ≥ 1. Combiningwith (71), we obtain the upper bound

λ2r(M) ≤ − 2

1 + δ· K

r − k+ λmax(N)2 +

1

r − k· poly(λ)

= − 2

1 + δ· Kr

+ λmax(N)2 + poly(λ). (75)

Step III. In the last step, we combine the relations (70) and (75), which leads to

σ4r(U) ≤ − 2

1 + δ· Kr

+1

(1 + δ)2G2 + poly(λ).

This means that

σ4r(U) +

2

1 + δ· Kr≤ 1

(1 + δ)2G2 + poly(λ).

Since K > 0 has lower bounds that are independent of λ, we can choose λ to be small enough suchthat

σ4r(U) +

1

1 + δ· Kr≤ 1

(1 + δ)2G2.

However, considering the assumption (69), we have

σ4r(U) ≥ 1

(1 + δ)2

[G− (1 + δ)λ2

4G2− λ

]2

=1

(1 + δ)2G2 − 2λ ·G+ poly(λ)

≥ 1

(1 + δ)2G2 − 2λ · (1 + δ)C2 + poly(λ) =

1

(1 + δ)2G2 + poly(λ),

where the second inequality is due to G ≤ (1 + δ)C2, which can be proved similar to the asym-metric case. The above two inequalities cannot hold simultaneously when λ is small enough. Thiscontradiction means that the condition (68) holds by choosing

0 < λ ≤ λ0(δ, σr(M∗s ), ‖M∗s ‖F , α, C),

for a small enough positive constant λ0(δ, σr(M∗s ), ‖M∗s ‖F , α, C).

Proof of Theorem 7. We first choose

C :=

[2(1 + δ)

1− δ‖U∗(U∗)T ‖2F

]1/4

.

Then, we select λ1 as

λ1(δ, r, σr(M∗s ), ‖M∗s ‖F , α) := min

{λ0(δ, r, σr(M

∗s ), ‖M∗s ‖F , α, C),

(1− δ)C3

2√r

}.

Finally, we combine Lemmas 6-8 to get the bounds for the gradient and the Hessian.

41

general low-rank matrix optimization: geometric analysis

Documents