solving the robust matrix completion problem via a system

Solving the Robust Matrix Completion Problem viaa System of Nonlinear Equations

Yunfeng Cai and Ping LiCognitive Computing Lab

Baidu ResearchNo. 10 Xibeiwang East Road, Beijing 100085, China

10900 NE 8th St. Bellevue, WA 98004, USAcaiyunfeng, [email protected]

Abstract

We consider the problem of robust matrixcompletion, which aims to recover a low rankmatrix L∗ and a sparse matrix S∗ from incom-plete observations of their sumM = L∗+S∗ ∈Rm×n. Algorithmically, the robust matrixcompletion problem is transformed into aproblem of solving a system of nonlinear equa-tions, and the alternative direction method isthen used to solve the nonlinear equations. Inaddition, the algorithm is highly parallelizableand suitable for large scale problems. Theoret-ically, we characterize the sufficient conditionsfor when L∗ can be approximated by a lowrank approximation of the observed M∗. Andunder proper assumptions, it is shown thatthe algorithm converges to the true solutionlinearly. Numerical simulations show thatthe simple method works as expected and iscomparable with state-of-the-art methods.

1 Introduction

Robust matrix completion (RMC) (Chen et al., 2011;Tao and Yuan, 2011; Cherapanamjeri et al., 2017; Kloppet al., 2017; Zeng and So, 2018) aims to recover alow rank matrix L∗ and a sparse matrix S∗ from asampling of M = L∗ + S∗. Mathematically, RMC canbe formulated as the following optimization problem(Candès et al., 2011; Chandrasekaran et al., 2011):

RMC : minL,S

rank(L) + λ| supp(S)|,

s.t. Lij + Sij = Mi,j , (i, j) ∈ Ω,

To appear in AISTATS’20. Copyright 2020 by the author(s).

where λ is a tuning parameter, Ω is a subset of1, . . . ,m × 1, . . . , n. When S = 0, RMC becomesthe matrix completion (MC) problem (Candès andRecht, 2009; Meka et al., 2009; Cai et al., 2010; Candèsand Tao, 2010; Jain and Netrapalli, 2015; Liu and Li,2016); When Ω = 1, . . . ,m × 1, . . . , n, RMC be-comes the robust principal component analysis (RPCA)(Jolliffe, 2011). Thus, RMC can be taken as a combi-nation/generalization of MC and RPCA (Candès andPlan, 2010; Jain and Netrapalli, 2015; Jain et al., 2013;Keshavan et al., 2010).

In many scientific and engineering problems, peopleneed to recover a low rank matrix from observed data,e.g., the recommender system (Funk, 2006; Candès andPlan, 2010; Hu and Li, 2017, 2018b,a), social networkanalysis (Huang et al., 2013), machine learning (Can-dès and Plan, 2010; Davenport and Romberg, 2016),image impainting (Bertalmío et al., 2000), computervision (Candès and Plan, 2010), bioinformatics (Kimet al., 2005), etc.

MC/RPCA/RMC has been studied extensively froman optimization point of view, many algorithms areproposed, and exact recovery is discussed under properassumptions. Most of well-known algorithms are basedon convex optimization, in which the rank of a matrixis relaxed to its nuclear norm (the sum of all singularvalues), and the number of nonzero entries of a matrixis relaxed to its `1-norm (the sum of absolute values ofall entries), e.g., (Cai et al., 2010; Candès et al., 2011;Candès and Recht, 2009; Candès and Tao, 2010; Rechtet al., 2010; Klopp et al., 2017). However, the com-putation of the nuclear norm of L, which requires thecomputation of its singular value decomposition (SVD),is expensive and unsuitable for parallelization, as a re-sult, algorithms based on the nuclear norm relaxationare often not very realistic for large matrices.

To deal with large problems, the low rank matrix canbe represented as the product of a tall-skinny matrix

arX

iv:2

003.

1099

2v1

[st

at.M

L]

24

Mar

202

0

Solving the Robust Matrix Completion Problem via a System of Nonlinear Equations

and a short-fat matrix, so that the low rank propertyis satisfied automatically. But, the optimization prob-lem becomes nonconvex, which makes it difficult tosolve. Bi-convex as the problem is, the alternativeminimization can be used to solve it more efficiently,e.g., (Jain et al., 2013). In (Yi et al., 2016), gradi-ent descent method is used to solve RMC, which isshown to be fast. In (Cherapanamjeri et al., 2017),projected gradient method with hard-thresholding isused to solve RMC, with nearly-optimal observationand corruption. We refer the readers to (Zeng andSo, 2018) and reference therein for more methods. Be-sides the optimization based method, quite recently,in (Dutta et al., 2019), the RPCA/RMC is solved viaan alternating nonconvex projection method. Thismethod does not require any objective function, convexrelaxation or surrogate convex constraint.

Contribution. In this paper, we solve the RMC prob-lem via solving a system of nonlinear equations (NLEQ).This method does not require any objective function,convex relaxation or surrogate convex constraint, either.Let L∗ = XY T, where X = [x1, . . . , xm]T ∈ Rm×r,Y = [y1, . . . , yn]T ∈ Rn×r. The RMC problem can beformulated as the following system of NLEQ:

xTi yj = Mij , for (i, j) ∈ Ω \ supp(S∗), (1)

where supp(S∗) is the support set of S∗. When the rankr and the support set of S∗ are both known, solvingRMC amounts solving a system of nonlinear equations.Any numerical method for nonlinear equations can beused to solve it (e.g., the steepest descent method,the Newton method, etc.), among which the simplestone is the alternative direction method (ADM): FixingX (or Y ), Y (or X) can be updated via solving an(overdetermined) linear system of equations (usually inleast square sense). Thus, solving RMC via solving (1)heavily depends on whether we can solving (1) with-out knowing r and supp(S∗). It is worth mentioninghere that in (Meka et al., 2009), such a method wasproposed to solve the MC problem. And it was statedthere that “ ... its variants outperform most methodsin practice. However, analyzing the performance ofalternate minimization is a notoriously hard problem.”

The contributions of this paper are three folds. First,for both full and partial observation cases, we char-acterize some sufficient conditions for when the lowrank approximation of the observed M is approximateL∗. Second, we propose to solve RMC via solvinga system of NLEQ rather than optimization, and anADM method, which carefully handles the unknown rand supp(S∗) issue, is developed. Third, we carefullyanalyze the convergence of the ADM, and it is shownthat under proper assumptions, the ADM converges tothe true solution linearly, i.e., exact recovery can be

achieved. So, we give an answer to a problem whichis even more difficult than the aforementioned “noto-riously hard problem”. In addition, the algorithm ishighly parallelizable and naturally suitable for largescale problems. It is also worth mentioning here thatthe results of this paper are applicable to the MCproblem as well as the RPCA problem.

The rest of this paper is organized as follows. In Sec-tion 2, we first develop the algorithm, followed by itsconvergence analysis in Section 3. Numerical experi-ments are presented in Section 4. Concluding remarksare given in Section 5.

Notation. We shall adopt the MATLAB style con-vention to access the entries of vectors and matrices.The set of integers from i to j inclusive is i : j. Fora matrix A, its submatrices A(k:`,i:j), A(k:`,:), A(:,i:j)

consist of intersections of row k to row ` and columni to column j, row k to row ` and all columns, allrows and column i to column j, respectively. A(j,:)

and A(:,k) denote the jth row and kth column of A,respectively. ‖A‖ stands for the spectral norm of A,‖A‖F denotes the Frobenius norm, ‖A‖1 =

∑i,j |aij |,

‖A‖max = maxi,j |aij |, ‖A‖2,∞ = maxi ‖A(i,:)‖, A†stands for the Moore–Penrose inverse, and κ(A) =‖A‖‖A†‖ denote the condition number of A. Denoteby σj(A) for 1 ≤ j ≤ minm,n the singular values ofA and they are always arranged in a non-increasingorder: σ1(A) ≥ σ2(A) ≥ · · · ≥ σminm,n(A). R(A)stands for the range space of A, i.e., R(A) = spany ∈Rm | y = Ax, x ∈ Rn. The vector ej stands for thejth column of the identity matrix I. Furthermore, foran index set Ω ⊂ 1, . . . ,m × 1, . . . , n, |Ω| denotesthe cardinality of Ω, ΠΩ(A) = [I(i,j)∈Ωaij ] ∈ Rm×n,where I is the indicator function.

2 Algorithm

In this section, we first reformulate the RMC problemas a problem of solving a system of nonlinear equations(NLEQ), show how to solve the NLEQ via an alternativedirection method (ADM), then the overall algorithm issummarized.

2.1 Problem Reformulation, Difficulty andSolution

Let M = XY T, where X = [x1, . . . , xm]T ∈ Rm×r,Y = [y1, . . . , yn]T ∈ Rn×r. The MC problem can beformulated as

xTi yj = Mij , for (i, j) ∈ Ω. (2)

Now recall (1), the task of RMC becomes solving (2)with unknown r and minimum violators. Here by a“violator”, denoted by (i′, j′), we mean that xT

i′yj′ 6=Mi′j′ , it will be also referred to as an “outlier” hereafter.

Yunfeng Cai and Ping Li

The ADM is the simplest method to solve the NLEQof form (2) – given an initial guess for X, we fix X, (2)becomes a linear system in Y (which is assumed to beoverdetermined), we can solve the linear system for Y ;similarly, we fix Y to solve X; the iteration continuesuntil convergence. As r and supp(S∗) are unknown,our task is to determine them during the iteration ofADM. Next, we first show how to determine r, thensupp(S∗).

Determine the rank adaptively In the tth itera-tion of ADM, let the current rank estimation be rt, Ytbe current estimation for Y and Yt have orthonormalcolumns, i.e., Y T

t Yt = Irt . By ADM, the estimationof X can be obtained, denote it by Xt+1. In order toupdate rt, we need to compute the singular values ofXt+1Y

Tt . Noticing that Yt is orthonormal, then the top

rt singular values of Xt+1YTt will be the singular values

of Xt+1. Then the singular values can be obtained asfollows: first, compute the QR decomposition of Xt+1:

Xt+1 = Xt+1Rx,t+1,

where Xt+1 ∈ Rm×rt is orthonormal, Rx,t+1 ∈ Rrt×rt ;second, compute the SVD of Rx,t+1:

Rx,t+1 = QxΣQTy ,

where Qx ∈ Rrt×rt , Qy ∈ Rn×rt are orthogonal, Σ =diag(σ1, . . . , σrt) with σ1 ≥ · · · ≥ σrt ≥ 0. Then thesingular values of Xt+1Y

Tt are σ1, . . . , σrt . Similarly,

when Xt is the current estimation for X, and XTt Xt =

Irt , we can compute an estimation for Y , its singularvalues can be obtained.

Let Lt be the current estimation for L∗. When therank is underestimated, i.e., rt < r, the residualτt = ‖ΠΩ(Lt + St −M)‖F will stagnate, in such case,we increase the estimated rank rt. When the rank isoverestimated, i.e., rt > r, we expect to observe rankdeficiency from the singular values σ1, . . . , σrt . In sucha case, we decrease the estimated rank rt. For theRMC problem, we prefer an overestimated rank overan underestimated rank due to the following reason.The residual τt stagnates for two reasons: one is thatthe estimated rank is smaller than the true rank; theother is that | supp(S∗)\ supp(St)| is large. Then whenthe residual stagnates, it is difficult for us to makea good choice – to increase the estimated rank or todrop some equalities (of course, those equalities need tobe carefully selected) in (2). Increasing the estimatedrank when | supp(S∗) \ supp(St)| is large or droppingequalities in (2) when the rank is underestimated willboth lead to catastrophic consequences, such as theestimated rank exceeds a prescribed limit, too many“correct” equalities are dropped which will probablyresult in underdetermined linear systems when updat-ing X (or Y ). With an overestimated rank, when the

residual stagnates, we decrease the estimated rank viathe singular values of Lt; if there is no rank deficiencyin Lt, we drop some equalities in (2).

When an overestimated rank decreases to the actualrank, it is expected that the estimated rank will remainunchanged in the follow-up iterations. Therefore, wedo not need to check the singular values of Lt in eachiteration for the sake of efficiency.

Determine supp(S∗) via outlier detection Whena good approximation L of L∗ is obtained, S∗ = M −L∗ ≈ M − L. Thus, it is reasonable to detect (i, j) ∈Ω∩supp(S∗) from the residual Rij = Mij−Lij(i,j)∈Ω.

Outlier detection has been used for centuries to removeabnormal data. Various outlier detection techniqueshave been used (Ester et al., 1996; Hodge and Austin,2004; Xu et al., 2010; Rahmani and Li, 2019; Slawskiet al., 2019). In our implementation, we simply deter-mine the outliers as follows: find the top-k values ineach row and column of |R| (unavailable entries of |R|are set to zero), and the entries in the intersection aretaken as outliers. Alternatively, simply find the topk′ values among all entries of |R|. Here k, k′ are twoparameters which can be tuned. In what follows, wedenote

Ts(A) = [bij ], (3)

where s is the number of the removed outliers, bij =A(i,j) if A(i,j) is an outlier, bij = 0, otherwise. Ofcourse, one can also try other outlier detection tech-niques.

2.2 Algorithm details

Now we present Algorithm 1, which summarizes theADM for RMC described in the previous subsection.

Some implementation details follows.

Initializing Y0 According to Theorem 2 below, goodinitial guesses for X and Y can be obtained by com-puting the SVD of ΠΩ0

(M). An iterative procedure(e.g., Krylov subspace method) is usually adopted toaccomplish the task, in which matrix vector prod-ucts ΠΩ0

(M)v and ΠΩ0(M)Tv are called several times.

A simpler way, which is more efficient and numeri-cally proven to be reliable, is the following: computeW = ΠΩ0

(M)TΠΩ0(M)N , compute an orthonormal

basis for W , and set the columns of Y1 as the basis.Here N ∈ Rn×r0 is a random matrix with entries drawnfrom the standard normal distribution. Such a proce-dure is essentially one iteration of the subspace method(a generalization of power method to compute severaldominant eigenvectors). Since an initial guess for Xor Y is sufficient for ADM to run in Algorithm 1, itis indeed unnecessary to compute the estimations forboth X and Y .


Algorithm 1 ADM for RMC via NLEQInput: The observed matrix ΠΩ(M), a sparsity level

parameter s, an estimated rank r0, an upper boundκ for the condition number of L∗, and a tolerancetol.

Output: X ∈ Rm×rt , Y ∈ Rn×rt and S ∈ Rm×n suchthat ‖ΠΩ(XY T + S −M)‖F ≤ tol, ‖S‖0 ≤ s.

1: Set S0 = Ts(M), X0 = 0, Y0 = 0, Σ0 = 0, t=1;2: Compute [X1,Σ1, Y1] = SVDr0((M − S0)/p′),

where p′ = (|Ω| − s)/mn;3: Compute Rt = ΠΩ(M −XtΣtY

Tt );

4: Set St = Ts(Rt), Ωt = Ω \ supp(St);5: Compute τt = ‖ΠΩ(M −XtΣtY

Tt − St)‖F ;

6: while τt > tol do7: Set t = t+ 1;8: Solve ΠΩt−1(XtY

Tt−1) = ΠΩt−1(M) for Xt;

9: Compute the QR decomposition Xt = XtRx,t,where Xt has orthonormal columns, Rx,t is uppertriangular;

10: Compute the SVD Rx,t = QxΣQTy , where Σ =

diag(σ1, . . . , σrt−1), Qx, Qy are orthogonal;

11: Set rt = rt−1 − |j | κ σj < σ1|;12: Set Xt = [XtQx](:,1:rt);13: Solve ΠΩt−1

(XtYTt ) = ΠΩt−1

(M) for Yt;14: Compute the QR decomposition Yt = YtRy,t,

where Yt has orthonormal columns, Ry,t is uppertriangular;

15: Compute the SVD RTy,t = QxΣQT

y , where Σ =diag(σ1, . . . , σrt), Qx, Qy are orthogonal;

16: Set rt = rt − |j | κ σj < σ1|;17: Set Xt = [XtQx](:,1:rt), Yt = [YtQy](:,1:rt), Σt =

Σ(1:rt,1:rt);18: Compute Rt = ΠΩ(M −XtΣtY

Tt );

19: Set St = Ts(Rt), Ωt = Ω \ supp(St);20: Compute τt = ‖ΠΩ(M −XtΣtY

Tt − St)‖F ;

21: end while

Solving Xt and Yt On Lines 8 and 13, Xt and Ytcan both be solved row by row or simultaneously. Andto obtain one row of Xt or Yt, a small linear systemneeds to be solved. When the linear system is underde-termined, Algorithm 1 may break down. Therefore, ineach row and column, Algorithm 1 requires the numberof observed entries (after the removal of the corruptedentries) must be larger than the rank. To be moreprecise, we need the small linear system to be goodconditioned. In general, it is difficult to determine howmany rows/columns are needed to ensure the linearsystem to be good conditioned. Numerically, for arandom matrix A ∈ Rs×r (generated from a standardnormal distribution) with s = O(r) > 2r is usuallygood conditioned. So, we may declare that O(r) > 2robservations in each row and column are sufficient.

In our implementation, the linear systems are solvedin the least square sense. One may also choose tominimize `p-norm (p ≥ 0) of the residual as in (Zengand So, 2018).

Computational complexity When the number ofobservations in each row and column isO(r), each linearsystem can be solved in O(r3) FLOPS. So, in eachiteration, the computational complexity of the linearsystem solving on Lines 8 and 13 is O((m+n)r3). Thecomputational complexity of the QR decomposition onlines 9 and 14 is O((m + n)r2). The computationalcomplexity of the SVD is O(r3). So, the overall ofcomputational complexity of Algorithm 1 is dominatedby the linear system solving. When the number ofobservations in certain row/column is much larger thanr, we may randomly choose O(r) observations from therow/column, then solve a much smaller linear system ofequations. Again, the overall computational complexityin each iteration is O((m+ n)r3).

Also, note that the linear systems on Line 8 and 13can be solved in parallel. Therefore, Algorithm 1 aresuitable for large scale problems.

Remark 1. When Ωt is fixed, Algorithm 1 essentiallyminimizes ‖ΠΩt

(XY T −M)‖F via ADM. If gradientmethod is used to minimize ‖ΠΩt

(XY T −M)‖F , Al-gorithm 1 is similar to the GD method in (Yi et al.,2016), except the regularization term ‖UT

t Ut−V Tt Vt‖F

in the loss function.

3 Convergence

This section analyzes the convergence of Algorithm 1.We first study the full observation case, which servesas a motivation for the partial observation case next.

To present the results, we need to define the k canonicalangles. Let X ,Y be two k-dimensional subspaces of Rn.Let X,Y ∈ Rn×k be the orthonormal basis matrices ofX and Y, respectively, i.e.,

R(X) = X , XTX = Ik, and R(Y ) = Y, Y TY = Ik.

Denote ωj for 1 ≤ j ≤ k the singular values of Y TX inascending order, i.e., ω1 ≤ · · · ≤ ωk. The k canonicalangles θj(X ,Y) between X and Y are defined by

0 ≤ θj(X ,Y) := arccosωj ≤π

2for 1 ≤ j ≤ k. (4)

They are in descending order, i.e., θ1(X ,Y) ≥ · · · ≥θk(X ,Y). Set

Θ(X ,Y) = diag(θ1(X ,Y), . . . , θk(X ,Y)). (5)

In what follows, we sometimes place a vector or matrixin one or both arguments of θj( · , · ) and Θ( · , · ) withthe meaning that it is about the subspace spanned bythe vector or the columns of the matrix argument.


3.1 Full observation case

Theorem 1. Let M = L∗ + S∗ ∈ Rm×n (m ≥n), where L∗ is low rank, i.e., r = rank(L∗) n. Let the SVD of M be M = UΣV T, whereU = [U1 |U2] = [u1, . . . , ur |ur+1, . . . , um], V =[V1 |V2] = [v1, . . . , vr | vr+1, . . . , vn] are orthogonal ma-

trices, Σ =

[diag(Σ1,Σ2)

0

], Σ1 = diag(σ1, . . . , σr),

Σ2 = diag(σr+1, . . . , σn), and σ1 ≥ · · · ≥ σn. LetMr = U1Σ1V

T1 be the best rank r approximation of

M . Let the economy sized SVD of L be L∗ = U∗Σ∗VT∗ ,

where U∗ ∈ Rm×r and V∗ ∈ Rn×r both have orthonor-mal columns, Σ∗ = diag(σ1∗, . . . , σr∗) with σ1∗ ≥ · · · ≥σr∗ > 0. If

‖(I − U∗UT∗ )S∗(I − V∗V T

∗ )‖ < σr∗, (6a)

max‖S∗V∗‖, ‖ST∗ U∗‖ < σr − σr+1, (6b)

then

maxθu, θv ≤ η,‖L∗ −Mr‖max

‖L∗‖≤ (‖U∗‖2,∞θv + ‖V∗‖2,∞θu)

+ (1 + 3‖U∗‖2,∞‖V∗‖2,∞)θuθv,

where θu = ‖ sin Θ(U1, U∗)‖, θv = ‖ sin Θ(V1, V∗)‖ andη =

max‖S∗V∗‖,‖ST∗ U∗‖

σr−σr+1−max‖S∗V∗‖,‖ST∗ U∗‖

.

Theorem 1 tells that when (6) holds and η is small,the principal angles between U1 and U∗, V1 and V∗will be small, and the best rank-r approximation ofM is a good approximation of L∗. Notice that, (6)does not necessarily implies ‖S∗‖ is small (comparedwith ‖L∗‖). In fact, we have the following example, inwhich ‖S∗‖ is comparable with ‖L∗‖ and η = 0.

Example 1. Let M = L∗ + S∗, L∗ = 1n1n1

Tn ,

S∗ = ρ4

2 −1 0 ... 0 −1−1 2 −1 0 ... 0

0 −1 2 −1...

......... ... ... ... 0

0 ... 0 −1 2 −1−1 0 ... 0 −1 2

, where 1n is an n-by-1

vector of ones, n is even, ρ ∈ (−1, 1) is a real parameter.One can verify that ‖L∗‖ = 1, ‖S∗‖ = ρ, ‖S∗1n‖ = 0,the first two singular values of M are σ1 = 1 andσ2 = |ρ|, and the economy sized SVD of L can be givenby L = U∗Σ∗V

T∗ , where U∗ = V∗ = 1√

n1n, Σ∗ = σ1∗ =

1. Then ‖(I − U∗UT∗ )S∗(I − V∗V T

∗ )‖ = ‖S∗‖ = |ρ| <1 = σ1∗, and max‖S∗V∗‖, ‖ST

∗ U∗‖ = 0 < 1 − |ρ| =σ1 − σ2. In other words, the assumption (6) holds.Noticing that η =

max‖S∗V∗‖,‖ST∗ U∗‖

σr−σr+1−max‖S∗V∗‖,‖ST∗ U∗‖

= 0, byTheorem 1, we can conclude that ‖ sin Θ(U1, U∗)‖ =‖ sin Θ(V1, V∗)‖ = 0, M1 = L∗, where U1, V1 are the

top left and right singular vectors of M , respectively,and M1 is the best rank-1 approximation of M .

Now let us assume all entries of M are observed,S∗ is sufficiently small, and can be taken as a per-turbation to L∗. Let Yt be guess for Y and haveorthonormal columns, let us perform one iterationof ADM. For the ease of illustration, also assumethat rt+1 = rt. Then the iteration reads: (S1)Xt+1 = MYt; (S2) Xt+1 = MYtGx, where Gx is suchthat Xt+1 is orthonormal; (S3) Yt+1 = MTXt+1; (S4)Yt+1 = MTXt+1Gy, where Gy is such that Yt+1 is or-thonormal. So, we get

Yt+1 = (MTM)YtGxGy, (7)

which is just one iteration of subspace iteration (a gen-eration of power iteration) for computing the dominanteigenspace of MTM (e.g., (Demmel, 1997; Stewart,2001; Van Loan and Golub, 2012)). In fact, (S2) and(S4) are iterations for subspaces spanned by the domi-nant left and right singular vectors of M , respectively.Classic results tell that the subspaces R(Xt) and R(Yt)converge to the subspaces spanned by the left and rightsingular vectors ofM corresponding with the dominantsingular values. When the perturbation is small, R(Xt)and R(Yt) are good approximations for R(L∗) andR(LT

∗ ), respectively. In particular, when S∗ = 0 andV T∗ Yt is nonsingular, we have ‖ sin Θ(Xt+1, U∗)‖ = 0,‖ sin Θ(Yt+1, V∗)‖ = 0, i.e., one iteration of ADM givesthe true solution.

3.2 Partial observation case

For the partial observation case, we study the conver-gence of Algorithm 1 under the following assumptions:

(A1) For L∗, the column and row incoherence conditionswith parameter µ hold, i.e.,

max1≤i≤m

‖UT∗ ei‖2 ≤

µr

m, max

1≤j≤n‖V T∗ ej‖2 ≤

µr

n.

(A2) S∗ has at most %-fraction nonzero entries per rowand column, i.e.,

‖S∗(i,:)‖0 ≤ %n, ‖S∗(:,j)‖0 ≤ %m, for all i, j.

(A3) Each entry of M is observed independently withprobability p.

Besides the notations in Algorithm 1, we also adoptthe following notations:

θx,t = ‖ sin Θ(Xt, U∗)‖, θy,t = ‖ sin Θ(Yt, V∗)‖.

The following theorem tells that the SVD of the partialobserved matrix (after the removal of outlier) indeed


gives good approximation for U∗, V∗. Furthermore, X1,Y1 satisfy a incoherence condition with parameter µ1,‖L∗ −X1Σ1Y

T1 ‖max is bounded.

Theorem 2. Assume (A1), (A2), (A3) and m ≥ n.Let M = L∗ + S∗ ∈ Rm×n with rank(L∗) = r, S0 beobtained as in Algorithm 1. Denote r′s =

‖S0−S∗‖2F‖S0−S∗‖2 ,

γ = 21−%

√2%r′sp

. If

(ξ + γ)µrκ <1

6, (8)

then with probability ≥ 1− 1/m10+logα, it holds that

maxθx,1, θy,1 ≤ 3(ξ + γ)µrκ, (9)

where ξ = 6√

αp′n , κ = σ1∗

σr∗. Further assume µ n

and that there exists a positive constant µ′1 n suchthat

(ξ + γ)µrκ ≤ 1

3

√µ′1r

m, (10)

then

‖X1‖2,∞ ≤√µ1r

m, ‖Y1‖2,∞ ≤

√µ1r

n,

‖L∗ −X1Σ1YT1 ‖max ≤ ‖L∗‖

(√µr

mθy,1 +

√µr

nθx,1

+ θx,1θy,1

)+O(n−3/2),

where µ1 = 2(µ+ µ′1).

Remark 2. When there is no corruption, i.e., % = 0,then γ = 0. Furthermore, when m = O(n) 1,since p′n = O(1)r, we know that ξ =

√αp′n is small,

the larger p′n is, the smaller ξ is. By Theorem 1,θx,1 and θy,1 will be small. In other words, the SVD1p′ΠΩ0(M − S0) = X1Σ1Y

T1 , gives good approximation

for both U∗ and V∗, by X1 and Y1, respectively.

The following theorem, which is motivated by (Drineasand Mahoney, 2018, Lemma 55), establish the bridgebetween the full observation case and the partial obser-vation case. This gives an upper bound for the distancebetween the least square solutions between the full ob-servation case and the partial observation case.

Theorem 3. Let m ≥ n, and denote

Xopt = argminX ‖XY Tt − (M − St)‖,

Xopt = argminX ‖ΠΩt(XY T

t − (M − St))‖.

Assume that Ωt can be obtained by sampling each entry

of M with probability p′ = p(1 − %), ‖Yt‖2,∞ ≤√

µ′rn

for some µ′ > 0, and

infX∈Rm×r

‖ΠΩt(XYTt )‖

‖X‖≥ σ (11)

for some constant σ > 0. Then w.p. ≥ 0.99, it holds

‖Xopt −Xopt‖ ≤(2

3log(m+ n) + 5

)√µ′rp′σ2

‖R‖max,

where R = (M − St)(I − YtY Tt ).

Remark 3. The requirement (11) is critical. Theparameter σ reflects the condition number of the leastsquare problem on line 8 of Algorithm 1. What’smore, the larger p′ is, the larger σ is (in particular,if p′ = 1, σ = 1), the smaller the distance betweenXopt and Xopt is, which agrees with our intuition.R is the residual for the full observation case, i.e.,R = XoptY

Tt − (M − St). If the residual is small, the

distance between Xopt and Xopt will be small, too.

Definition 1. Define µ′ , maxµu, µv, where

µu , supU∈Rm×r

mr‖U‖22,∞ | ‖ sin Θ(U∗, U)‖ ≤ θx,1,

µv , supV ∈Rn×r

nr‖V ‖22,∞ | ‖ sin Θ(U∗, V )‖ ≤ θy,1.

By definition of µ′, we know that if θx,t ≤ θx,1,θy,t ≤ θy,1 for all t, then Xt, Yt satisfy the incoher-ence condition with parameter µ′. Recall Theorem 1,under the assumption of (10), θx,1 and θy,1 are quitesmall (at the order of 1√

m), then we can show that

µ′ ≤ µ1, which implies that µ′ is not large.

The next theorem establishes the convergence rate forthe ADM, which is the key in our proof of Theorem 5.Theorem 4. Assume that Ωt can be obtained by sam-pling each entry of M with probability p′, ‖Yt‖2,∞ ≤√

µ′rn for some µ′ > 0, (11) and

‖L∗ −XtΣtYTt ‖max ≤ c‖L∗‖ θy,t

√µr

m, (12)

for some constants c > 0. Denote

rs = inft

‖St − S∗‖2F‖St − S∗‖2

, ζ =

√2sµr

mrs,

CLS =(2

3log(m+ n) + 5

)√µ′rp′σ2

,

C = CLS(1 + 2c√

2p%n)√µr,

ε = cκζ, φ =8ε(κ+

√2ε) +

√2Cκ/

√m

1− 2ε− Cκ/√m

.

Further assume θy,t ≤ 1√2, then w.p. ≥ 0.99,

θx,t+1 ≤ φ θy,t.


Remark 4. The assumption (12) is not a strong re-quirement as it looks. By Theorem 1, (12) is naturalfor t = 1. For general t > 1, it can be shown thatthere exists a constant c > 0 such that (12) holds (seesupplementary for details), as long as Xt, Yt satisfythe incoherence condition. In general, the constant c isat the order of O(1).

Remark 5. The constant rs is the infimum of thestable rank of St − S∗. If we take a random matrix,whose entries are i.i.d. drawn from a normal distribu-tion N (0, σ2), to approximate St − S∗. Numerically,we found that rs = O(n). Therefore, ε ≈ cκ

√2%pµr

is a small number as long as % is small, µ and κ arenot large. When p′ is sufficiently large, C = O(1).Consequently, when m = O(n) is large, φ 1, whichagrees with our conclusion in the full observation case.

Remark 6. Under similar assumptions as in The-orem 4, it can also be shown that θy,t+1 ≤ φ θx,t+1.Then θy,t+1 ≤ φ2θy,t. When φ < 1, θy,tt is a mono-tonically decreasing sequence. Combining it with thedefinition of µ′, we know that Yt satisfies the incoher-ence condition with parameter µ′. Similarly, Xt alsosatisfies the incoherence condition.

Theorem 5. Assume (A1), (A2), (A3) and m ≥ n.Assume that Ωt can be obtained by sampling each entryof M with probability p′ = p(1− %), rt ≡ r and

infX∈Rm×r

‖ΠΩt(XY T

t )‖‖X‖

≥ σ, infY ∈Rn×r

‖ΠΩt(XtY

T)‖‖Y ‖

≥ σ

for some σ > 0. Let rs, ζ, C, ε be the same as inTheorem 4. Then with high probability, it holds that

‖M − St+1 −Xt+1Σt+1YTt+1‖ ≤ ψ‖M − St −XtΣtY

Tt ‖,

where

ψ =2√

2(κ+ 2ε√

µrm + Cκ√

m)(8ε(κ+

√2ε) +

√2 Cκ√

m)

(1− 4√

2ε√

µrm )(1− 2ε− Cκ√

m)

.

Remark 7. If ψ < 1, then by Theorem 5, ‖M −St−XtΣtY

Tt ‖t is a monotonically deceasing sequence.

And in limit, with high probability, it holds

limt→∞

‖M − St −XtΣtYTt ‖ = 0.

Recall the way we determine St, we get (L∗ −XtΣtY

Tt )ij = 0 for any (i, j) /∈ supp(St). Then we

can show that

XtΣtYTt → L∗, St → S∗, as t→∞,

i.e., exact recovery is achieved, with high probability.

4 Experiments

We compare Algorithm 1 (NLEQ) with the gradient de-scent (GD) method (Yi et al., 2016) and the PG-RMCmethod (Cherapanamjeri et al., 2017). The codes ofGD and PG-RMC are obtained from the lrslibrary (So-bral et al., 2016) in Github. 1

4.1 Synthetic Data

We generate the data matrixM ∈ Rd×d as follows. Thelow rank matrix L∗ is given by L∗ = U∗V

T∗ , where the

entries of U∗, V∗ ∈ Rd×r are drawn independently fromthe Gaussian distribution with mean zero, variance1/d. Each entry of the sparse matrix S∗ are nonzerowith probability ρ, and the nonzero entries of S∗ areuniformly drawn from [− r

2d ,r2d ]. Each entry of M =

L∗ + S∗ is observed independently with probability p.

The results are presented in Figure 1. In Figure 1-(a),we draw the relative residual ‖Rt‖F

‖ΠΩt (M)‖F and rank es-timation rt vs. iteration number. We can see thatwhen the initial rank is larger than the true rank, asthe iteration continues, the true rank can be revealedfrom the singular values of Xt(Yt)

T, then the rankestimation drops to the true rank; when the residualstagnates (represented by a big solid dot in the plot),outliers are removed and the residual decreases untilconvergence. In Figure 1-(b), we plot the relative resid-ual ‖Rt‖F

‖ΠΩt (M)‖F vs. total CPU time for different p. Wecan see that Algorithm 1 works for all three cases, thelarger p is, the less iteration number is needed. InFigure 1-(c), we plot total CPU time and the anglemax‖ sin Θ(Xt, U∗)‖, ‖ sin Θ(Yt, V∗)‖ vs. matrix sized for different methods. We can see that the CPU timeof all three methods grows linearly with respect to thematrix size, and are comparable with each other. Theangles of the three methods are all small, which con-firm that all methods give the correct results; the angleproduced by NLEQ is the smallest. In Figure 1-(d),we plot the relative residual ‖Rt‖F

‖PΩt (M)‖ vs. CPU timefor all three methods. The convergence behaviors ofthree methods are quite different: GD converges almostlinearly; PG-RMC at the beginning stage convergeslinearly with a low converge rate, then converges al-most linearly with a larger rate; NLEQ has a zig-zagconvergence, which is due to the removal of outliers.

4.2 Foreground-background separation

The next task is foreground-background separation.By stacking up the vectorized video frames, we get afull data matrix. The static background will form alow rank matrix while the foreground can be taken as

1https://github.com/andrewssobral/lrslibrary/tree/master/algorithms/mc

https://github.com/andrewssobral/lrslibrary/tree/master/algorithms/mc

https://github.com/andrewssobral/lrslibrary/tree/master/algorithms/mc


0 5 10 15

Iteration count

10-3

10-2

Rela

tive r

esid

ual

8

10

12

14

16

18

Estim

ate

d r

ank

Residual

Rank

2 4 6 8 10 12 14

Iteration count

10-4

10-3

10-2

10-1

Rela

tive r

esid

ual

p=0.02

p=0.1

p=0.5

103

104

105

Dimension d

100

101

102

103

Tota

l C

PU

tim

e (

secs)

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

Com

mon logarith

m o

f angle

GD time

PG-RMC time

NLEQ time

GD angle

PG-RMC angle

NLEQ angle

0 50 100 150 200 250

CPU time

10-3

10-2

10-1

100

Rela

tive r

esid

ual

GD

PG-RMC

NLEQ

Figure 1: Results on synthetic data. Up to down: (a)-(d). (a) d = 1e5, r = 10, p = 0.0015, ρ = 0.1. (b)d = 1e4, r = 10, ρ = 0.1, p = 0.02, 0.1, 0.5. (c) d =1e3, 5e3, 1e4, 5e4, 1e5, r = 10, p = 0.15r2 log(m)/m,ρ = 0.1. (d) d = 1e5, r = 10, p = 0.002, ρ = 0.1.

the sparse component. We apply our method NLEQ,and also GD and PG-RMC to two public benchmarks,the Bootstrap and ShoppingMall.2 Each entry of thedata matrix is observed independently w.p. p = 0.05.As presented in Figure 2, all three methods are ableto separate the foreground from the background, andthe backgrounds obtained by three methods are similar.

Boo

tstrap

Original GD/PG-RMC/NLEQ

Shop

ping

Mall

Original GD/PG-RMC/NLEQ

Figure 2: Foreground-background separation.

5 Conclusion

In this paper, we study the RMC problem from an alge-braic point of view – transform the RMC problem intoa problem of solving an overdetermined nonlinear sys-tem of equations (with outliers). This method does notrequire any objective function, convex relaxation or sur-rogate convex constraint. Algorithmically, we proposeto solve the NLEQ via ADM, in which the true rankand support set of the corruption are determined dur-ing the iteration. The algorithm is highly parallelizableand suitable for large scale problems. Theoretically,we characterize the sufficient conditions for when L∗can be approximated by the low rank approximationof M or 1

pΠΩ(M). We establish sufficient conditionsfor Mr ≈ L∗, where Mr is the best rank r approxi-mation of the observed M . The convergence of thealgorithm is guaranteed, and exact recovery is achievedunder proper assumptions. Numerical simulations showthat the algorithm is comparable with state-of-the-artmethods in terms of efficiency and accuracy.

2http://vis-www.cs.umass.edu/~narayana/castanza/I2Rdataset/

http://vis-www.cs.umass.edu/~narayana/castanza/I2Rdataset/

http://vis-www.cs.umass.edu/~narayana/castanza/I2Rdataset/


ReferencesMarcelo Bertalmío, Guillermo Sapiro, Vicent Caselles,

and Coloma Ballester. Image inpainting. In Proceed-ings of the 27th Annual Conference on ComputerGraphics and Interactive Techniques (SIGGRAPH),pages 417–424, New Orleans, LA, 2000.

Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen.A singular value thresholding algorithm for matrixcompletion. SIAM J. Optim., 20(4):1956–1982, 2010.

Emmanuel J. Candès and Yaniv Plan. Matrix com-pletion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.

Emmanuel J Candès and Benjamin Recht. Exact ma-trix completion via convex optimization. Found.Comput. Math., 9(6):717, 2009.

Emmanuel J. Candès and Terence Tao. The power ofconvex relaxation: near-optimal matrix completion.IEEE Trans. Information Theory, 56(5):2053–2080,2010.

Emmanuel J. Candès, Xiaodong Li, Yi Ma, and JohnWright. Robust principal component analysis? J.ACM, 58(3):11:1–11:37, 2011.

Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Par-rilo, and Alan S Willsky. Rank-sparsity incoherencefor matrix decomposition. SIAM J. Optim., 21(2):572–596, 2011.

Yudong Chen, Huan Xu, Constantine Caramanis, andSujay Sanghavi. Robust matrix completion and cor-rupted columns. In Proceedings of the 28th Inter-national Conference on Machine Learning (ICML),pages 873–880, Bellevue, WA, 2011.

Yeshwanth Cherapanamjeri, Kartik Gupta, and Pra-teek Jain. Nearly optimal robust matrix completion.In Proceedings of the 34th International Conferenceon Machine Learning, (ICML), pages 797–805, Syd-ney, Australia, 2017.

Mark A. Davenport and Justin K. Romberg. Anoverview of low-rank matrix recovery from incom-plete observations. J. Sel. Topics Signal Processing,10(4):608–622, 2016.

Chandler Davis and William Morton Kahan. Therotation of eigenvectors by a perturbation. III. SIAMJ. Numer. Anal., 7(1):1–46, 1970.

James W Demmel. Applied Numerical Linear Algebra.SIAM, Philadelphia, PA, 1997.

Petros Drineas and Michael W Mahoney. Lectures onrandomized numerical linear algebra. The Mathe-matics of Data, 25:1, 2018.

Aritra Dutta, Filip Hanzely, and Peter Richtárik. Anonconvex projection method for robust PCA. InThe Thirty-Third AAAI Conference on Artificial In-telligence (AAAI), pages 1468–1476, Honolulu, HI,2019.

Martin Ester, Hans-Peter Kriegel, Jörg Sander, andXiaowei Xu. A density-based algorithm for discover-ing clusters in large spatial databases with noise. InProceedings of the Second International Conferenceon Knowledge Discovery and Data Mining (KDD),pages 226–231, Portland, OR, 1996.

Simon Funk. Netflix update: Try this at home, 2006.

Victoria Hodge and Jim Austin. A survey of outlier de-tection methodologies. Artificial intelligence review,22(2):85–126, 2004.

Jun Hu and Ping Li. Decoupled collaborative rank-ing. In Proceedings of the 26th International Confer-ence on World Wide Web (WWW), pages 1321–1329,Perth, Australia, 2017.

Jun Hu and Ping Li. Collaborative multi-objective rank-ing. In Proceedings of the 27th ACM InternationalConference on Information and Knowledge Manage-ment (CIKM), pages 1363–1372, Torino, Italy, 2018a.

Jun Hu and Ping Li. Collaborative filtering via additiveordinal regression. In Proceedings of the EleventhACM International Conference on Web Search andData Mining (WSDM), pages 243–251, Marina DelRey, CA, 2018b.

Jin Huang, Feiping Nie, Heng Huang, Yu Lei, and ChrisH. Q. Ding. Social trust prediction using rank-k ma-trix recovery. In Proceedings of the 23rd InternationalJoint Conference on Artificial Intelligence (IJCAI),pages 2647–2653, Beijing, China, 2013.

Prateek Jain and Praneeth Netrapalli. Fast exact ma-trix completion with finite samples. In Proceedingsof The 28th Conference on Learning Theory (COLT),pages 1007–1034, Paris, France, 2015.

Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi.Low-rank matrix completion using alternating min-imization. In Symposium on Theory of ComputingConference (STOC), pages 665–674, Palo Alto, CA,2013.

Ian Jolliffe. Principal component analysis. Springer,2011.

Raghunandan H. Keshavan, Andrea Montanari, andSewoong Oh. Matrix completion from a few entries.IEEE Trans. Information Theory, 56(6):2980–2998,2010.

Hyunsoo Kim, Gene H. Golub, and Haesun Park. Miss-ing value estimation for DNA microarray gene ex-pression data: local least squares imputation. Bioin-formatics, 21(2):187–198, 2005.

Olga Klopp, Karim Lounici, and Alexandre B Tsybakov.Robust matrix completion. Probability Theory andRelated Fields, 169(1-2):523–564, 2017.


Guangcan Liu and Ping Li. Low-rank matrix comple-tion in the presence of high coherence. IEEE Trans.Signal Processing, 64(21):5623–5633, 2016.

Raghu Meka, Prateek Jain, and Inderjit S. Dhillon. Ma-trix completion from power-law distributed samples.In Advances in Neural Information Processing Sys-tems (NIPS), pages 1258–1266, Vancouver, Canada,2009.

Mostafa Rahmani and Ping Li. Outlier detection androbust PCA using a convex measure of innovation. InAdvances in Neural Information Processing Systems(NeurIPS), pages 14200–14210, Vancouver, Canada,2019.

Benjamin Recht, Maryam Fazel, and Pablo A Parrilo.Guaranteed minimum-rank solutions of linear matrixequations via nuclear norm minimization. SIAMRev., 52(3):471–501, 2010.

Martin Slawski, Mostafa Rahmani, and Ping Li. Asparse representation-based approach to linear re-gression with partially shuffled labels. In Proceedingsof the Thirty-Fifth Conference on Uncertainty in Ar-tificial Intelligence (UAI), page 7, Tel Aviv, Israel,2019.

Andrews Sobral, Thierry Bouwmans, and El-hadi Za-hzah. Lrslibrary: Low-rank and sparse tools forbackground modeling and subtraction in videos. Ro-bust Low-Rank and Sparse Matrix Decomposition:Applications in Image and Video Processing, 2016.

Gilbert W Stewart. Matrix algorithms volume 2: eigen-systems, volume 2. SIAM, 2001.

Gilbert W Stewart and Ji-Guang Sun. Matrix Pertur-bation Theory. Academic Press, Boston, 1990.

Min Tao and Xiaoming Yuan. Recovering low-rank andsparse components of matrices from incomplete andnoisy observations. SIAM J. Optim., 21(1):57–81,2011.

Joel A. Tropp. An introduction to matrix concentrationinequalities. Foundations and Trends R© in MachineLearning, 8(1-2):1–230, 2015.

Charles F Van Loan and Gene H Golub. Matrix Compu-tations. Johns Hopkins University Press, Baltimore,MD, 4th edition, 2012.

Huan Xu, Constantine Caramanis, and Sujay Sang-havi. Robust PCA via outlier pursuit. In Advancesin Neural Information Processing Systems (NIPS),pages 2496–2504, Vancouver, Canada, 2010.

Xinyang Yi, Dohyung Park, Yudong Chen, and Con-stantine Caramanis. Fast algorithms for robust PCAvia gradient descent. In Advances in Neural Informa-tion Processing Systems (NIPS), pages 4152–4160,Barcelona, Spain, 2016.

Wen-Jun Zeng and Hing-Cheung So. Outlier-robustmatrix completion via p-minimization. IEEE Trans.Signal Processing, 66(5):1125–1140, 2018.


Supplementary Materials

6 Preliminary lemmas

The following lemma gives some fundamental results for sin Θ(U, V ), which can be easily verified via definition.

Lemma 1. Let [U, Uc] and [V, Vc] be two orthogonal matrices with U, V ∈ Rn×k. Then

‖ sin Θ(U, V )‖ui = ‖UTc V ‖ui = ‖UTVc‖ui.

Here ‖ · ‖ui denotes any unitarily invariant norm, including the spectral norm and Frobenius norm. In particular,for the spectral norm, it holds ‖ sin Θ(U, V )‖ = ‖UUT−V V T‖; for the Frobenius norm, it holds ‖ sin Θ(U, V )‖F =1√2‖UUT − V V T‖F .

The following lemma is the well-known Weyl theorem, which gives the perturbation bound for eigenvalues ofHermitian matrix.

Lemma 2. (Stewart and Sun, 1990, p.203) For two Hermitian matrices A, A ∈ Cn×n, let λ1 ≤ · · · ≤ λn,λ1 ≤ · · · ≤ λn be eigenvalues of A, A, respectively. Then

|λj − λj | ≤ ‖A− A‖, for 1 ≤ j ≤ n.

The following lemma is used to establish the perturbation bound for the invariant subspace of a Hermitian matrix,which is due to Davis and Kahan.

Lemma 3. (Davis and Kahan, 1970, Theorem 5.1) Let H and M be two Hermitian matrices, and let S be amatrix of a compatible size as determined by the Sylvester equation

HY − YM = S.

If either all eigenvalues of H are contained in a closed interval that contains no eigenvalue of M or vice versa,then the Sylvester equation has a unique solution Y , and moreover

‖Y ‖ui ≤1

δ‖S‖ui,

where δ = min |λ− ω| over all eigenvalues ω of M and all eigenvalues λ of H.

For a rectangular matrix A ∈ Rm×n (without loss of generality, assume m ≥ n), let the SVD of A beA = UΣV T, where U = [U1 |U2 |U3] = [u1, . . . , uk |uk+1, . . . , ur |ur+1, . . . , um] ∈ Rm×m, V = [V1 |V2 |V3] =

[v1, . . . , vk | vk+1, . . . , vr | vr+1, . . . , vn] ∈ Rn×n are orthogonal matrices, and Σ =

[diag(Σ1,Σ2) 0r×(n−r)

0(m−r)×r 0(m−r)×(n−r)

],

Σ1 = diag(σ1, . . . , σk), Σ2 = diag(σk+1, . . . , σr) with σ1 ≥ · · · ≥ σr > 0, k ≤ r = rank(A). Then the spectral

decomposition of[

0 AAT 0

]can be given by

[0 AAT 0

]= X diag(Σ1,Σ2,−Σ1,−Σ2, 0n−r, 0m−r)X

T, (13)

where X = 1√2

[U −U 0

√2U3

V V√

2V3 0

]is an orthogonal matrix.

With the help of (13) and Lemmas 2 and 3, we are able to prove Lemma 4, which established an error bound forsingular vectors.

Lemma 4. Given A ∈ Rm×n (m ≥ n), let the SVD of A be given as above. Let σj, uj, vj be respectively theapproximate singular values, right and left singular vectors of A satisfying that U = [u1, . . . , uk] ∈ Rm×k andV = [v1, . . . , vk] ∈ Rn×k are both orthonormal , Σ = UTAV = diag(σ1, . . . , σk) with σ1 ≥ · · · ≥ σk > 0. Let

E = AV − U Σ, F = ATU − V Σ. (14)


If

‖(Im − U UT)A(In − V V T)‖ < σk, max‖E‖, ‖F‖ < σk − σk+1,

then

maxΘu,Θv ≤ η,‖U1Σ1V

T1 − U ΣV T‖max

‖A‖≤ (‖U1‖2,∞Θv + ‖V1‖2,∞Θu) + (1 + 3‖U1‖2,∞‖V1‖2,∞)ΘuΘv,

where Θu = ‖ sin Θ(U1, U)‖, Θv = ‖ sin Θ(V1, V )‖, η = max‖E‖,‖F‖σk−σk+1−max‖E‖,‖F‖ .

Proof. Let

H =

[0 AAT 0

], X1 =

1√2

[U1 −U1

V1 V1

],

X1 =1√2

[U −UV V

], X2 =

1√2

[U2 −U2 0

√2U3

V2 V2

√2V3 0

].

By calculations, we have

‖X1XT1 − X1X

T1 ‖ui = ‖ diag(U1U

T1 − U UT, V1V

T1 − V V T)‖ui (15)

By simple calculations, we have

HX1 − X1 diag(Σ,−Σ) =1√2

[AV − U Σ AV − U Σ

ATU − V Σ −ATU + V Σ

]=

1√2

[E EF −F

], R, (16a)

HX2 −X2 diag(Σ2,−Σ2, 0, 0) = 0, (16b)

where (16a) uses (14), (16b) uses the SVD of A. Then it follows from (16a) that

‖R‖ =

∥∥∥∥diag(E,F )1√2

[Ik IkIk −Ik

]∥∥∥∥ = ‖ diag(E,F )‖ = max‖E‖, ‖F‖. (17)

Pre-multiplying (16a) by XT2 and using (16b), we have

XT2 R = XT

2 HX1 −XT2 X1 diag(Σ,−Σ) = diag(Σ2,−Σ2, 0, 0)XT

2 X1 −XT2 X1 diag(Σ,−Σ). (18)

To apply Lemma 3 to (18), we need to estimate the gap between the eigenvalues of diag(Σ,−Σ) and those ofdiag(Σ2,−Σ2, 0, 0). Using (16a) and UTAV = Σ, we have

(H −RXT1 − X1R

T)X1 = HX1 −R = X1 diag(Σ,−Σ), (19)

which implies that ±σj are eigenvalues of H −RXT1 − X1R

T, and the corresponding eigenvectors are 1√2

[±uj

vj

],

for j = 1, . . . , k. Next, we declare that σ1, . . . , σk are the k largest eigenvalues of H − RXT1 − X1R

T. This isbecause

maxXT

1 x=0

xT(H −RXT1 − X1R

T)x

xTx

≤‖(I − X1XT1 )(H −RXT

1 − X1RT)(I − X1X

T1 )‖

=‖(I − X1XT1 )H(I − X1X

T1 )‖

=

∥∥∥∥∥[Im − U UT 0

0 In − V V T

] [0 AAT 0

][Im − U UT 0

0 In − V V T

]∥∥∥∥∥=‖(Im − U UT)A(In − V V T)‖ < σk.


Therefore, by Lemma 2, we have

|σj − σj | ≤ ‖RXT1 + X1R

T‖, for j = 1, . . . , k. (20)

Together with (17), we get

|σj − σj | ≤ ‖RXT1 + X1R

T‖ = maxj|λj([R, X1]

[XT

1

RT

])| = max

j|λj(

[XT

1

RT

][R, X1])|

= maxj|λj(

[0 Ik

RTR 0

])| = ‖R‖ = max‖E‖, ‖F‖. (21)

Here we uses the property that for any two matrix A ∈ Rm×n, B ∈ Rn×m, the nonzero eigenvalues of AB andBA are the same.

Now by the assumption that max‖E‖, ‖F‖ < σk − σk+1, we have

σk − σk+1 = σk − σk+1 + σk − σk ≥ σk − σk+1 −max‖E‖, ‖F‖ > 0, (22)

therefore, the eigenvalues of diag(Σ2,−Σ2, 0, 0) lie in [−σk+1, σk+1], which has no eigenvalues of diag(Σ,−Σ). Weare able to apply Lemma 3 to (18), which yields

‖XT2 X1‖ui ≤

‖XT2 R‖ui

σk − σk+1 −max‖E‖, ‖F‖. (23)

Using (15), Lemma 1, (22) and (23), we get

maxΘu,Θv = ‖ sin Θ(X1, X1)‖ = ‖XT2 X1‖ ≤

‖XT2 R‖

σk − σk+1 −max‖E‖, ‖F‖≤ η. (24)

Let

U = UΓu = [U1, U2, U3]

[Γu1

Γu2

Γu3

], V = V Γv = [V1, V2, V3]

[Γv1

Γv2

Γv3

], (25)

where Γu1 ∈ Rk×k, Γu2 ∈ R(r−k)×k, Γu3 ∈ R(m−r)×k, Γv1 ∈ Rk×k, Γv2 ∈ R(r−k)×k, Γv3 ∈ R(n−r)×k, and[

Γu1

Γu2

Γu3

],[

Γv1

Γv3

Γv3

]are both orthonormal. By (24), we have

∥∥[ Γu2

Γu3

]∥∥ = Θu, σmin(Γu1) =√

1−Θ2u,

∥∥[ Γu2

Γu3

]∥∥ = Θv, σmin(Γv1) =√

1−Θ2v. (26)

Substituting (25) into UTAV = Σ and using the SVD of A, we have

Σ = [ΓTu1,Γ

Tu2,Γ

Tu3] diag(Σ1,Σ2, 0(m−r)×(n−r))

[Γv1

Γv2

Γv3

]= ΓT

u1Σ1Γv1 + ΓTu2Σ2Γv2. (27)

Then it follows that

‖Σ1 − Γu1ΣΓTv1‖ =‖(Σ1 − Γu1ΓT

u1Σ1) + (Γu1ΓTu1Σ1 − Γu1ΓT

u1Σ1Γv1ΓTv1)− Γu1ΓT

u2Σ2Γv2ΓTv1‖

≤‖I − Γu1ΓTu1‖‖Σ1‖+ ‖Γu1ΓT

u1‖‖I − Γv1ΓTv1‖‖Σ1‖+ ‖Γu2‖‖Γv2‖‖Σ2‖

≤(Θ2u + Θ2

v + ΘuΘv)‖Σ1‖. (28)


Finally, using (26), (27), (28) and ‖Γu1‖ ≤ 1, ‖Γv1‖ ≤ 1, ‖Σ‖ ≤ ‖A‖, we have

‖U1Σ1VT1 − U ΣV T‖max = max

i,j|eTi (U1Σ1V

T1 − U ΣV T)ej |

= maxi,j|eTi (U1Σ1V

T1 − UΓuΣΓT

v VT)ej |

≤ maxi,j|eTi (U1Σ1V

T1 − U1Γu1ΣΓT

v1VT1 )ej |+ ‖[U2, U3]

[Γu2

Γu3

]Σ[

Γv2

Γv3

]T[V2, V3]T‖

+ maxi,j

(|eTi [U2, U3]

[Γu2

Γu3

]ΣΓT

v1VT1 ej |+ |eT

i U1Γu1Σ[

Γv2

Γv3

]T[V2, V3]Tej |

)≤ max

i,j

(3‖eT

i U1‖‖eTj V1‖‖A‖ΘuΘv + ‖A‖ΘuΘv + ‖eT

j V1‖‖A‖Θu + ‖eTi U1‖‖A‖Θv

)≤ ‖A‖ ((‖U1‖2,∞Θv + ‖V1‖2,∞Θu) + (1 + 3‖U1‖2,∞‖V1‖2,∞)ΘuΘv) ,

completing the proof.

Lemma 5. (Tropp, 2015, Corollary 6.1.2) Let S1, . . . ,Sn be independent random matrices with common dimensiond1 × d2, and assume that each matrix has uniformly bounded deviation from its mean:

‖Sk − E(Sk)‖ ≤ L, for each k = 1, . . . , n.

Let Z =∑nk=1 Sk, v(Z) denote the matrix covariance statistic of the sum:

v(Z) = max‖E[(Z− E(Z))(Z− E(Z))H]‖, ‖E[(Z− E(Z))H(Z− E(Z))]‖

= max‖E[

n∑k=1

(Sk − E(Sk))(Sk − E(Sk))H]‖, ‖E[

n∑k=1

(Sk − E(Sk))H(Sk − E(Sk))]‖.

Then for all t ≥ 0,

P‖Z− E(Z)‖ ≥ t ≤ (d1 + d2) · exp( −t2/2v(Z) + Lt/3

).

Lemma 6. For any linear homogeneous function F : Rk → Rm×n, assume that the linear system of equationsF (x) = C either has a unique solution or has no solution at all. Then it holds

argminx ‖F (x)− C‖ = argminx ‖F (x)− C‖F .

Proof. For any A,B ∈ Rm×n, define 〈A,B〉 = trace(ATB). It is easy to see that 〈·, ·〉 is an inner product overRm×n. Denote the range space of F (·) by F , and its orthogonal complement space by F⊥. Write C = CLS + Csuch that CLS ∈ F , and C ∈ F⊥. Then the solutions to min ‖F (x)− C‖ and min ‖F (x)− C‖F are nothing butthe solutions to F (x) = CLS. Since CLS ∈ F , F (x) = CLS has at least a solution. By the assumption, the solutionshould be unique. The proof is completed.

Lemma 7. Let L∗ ∈ Rm×n with m ≥ n, let the SVD of L∗ be L∗ = U∗Σ∗VT∗ , where U∗ ∈ Rm×r, V∗ ∈ Rn×r

are orthonormal, Σ∗ = diag(σ1∗, . . . , σr∗) with σ1∗ ≥ · · · ≥ σr∗ > 0. Let G ∈ Rm×n be a perturbation to L∗,X ∈ Rm×r, Y ∈ Rn×r have full column rank. Denote θx = ‖ sin Θ(U∗, X)‖, θy = ‖ sin Θ(V1, Y )‖. Then

minX,Y‖L∗ −G−XY T‖ ≥ σr∗max

√1− θ2

xθy,√

1− θ2yθx

√1− θ2

x

√1− θ2

y − ‖G‖.

Proof. Let U∗,c, V∗,c be such that U = [U∗, U∗,c], V = [V∗, V∗,c] are orthogonal. Let X = U∗Cx + U∗,cSx,Y = V∗Cy + V∗,cSy, where the columns of X, Y form the orthonormal basis for R(X) and R(Y ), respectively,CTx Cx + ST

x Sx = Ir, CTy Cy + ST

y Sy = Ir. By Lemma 1, we know that ‖Sx‖ = θx, ‖Sy‖ = θy.

Noticing that

minX,Y‖L∗ −XY T‖2 = min

D‖UTL∗V − UTXDY TV ‖2 = min

D

∥∥∥∥[Σ1 00 0

]−[CxSx

]D[CT

y , STy ]

∥∥∥∥2

=

∥∥∥∥[Σ1 00 0

]−[CxSx

][Cx, Sx]T

[Σ1 00 0

] [CySy

][CTy , S

Ty ]

∥∥∥∥2

,


we have

minX,Y‖L∗ −XY T‖2 ≥ max

∥∥∥∥Cx[Cx, Sx]T[Σ1 00 0

] [CySy

]STy

∥∥∥∥2

,

∥∥∥∥Sx[Cx, Sx]T[Σ1 00 0

] [CySy

]CTy

∥∥∥∥2

≥ maxσ2min(Cx)‖Sy‖2, σ2

min(Cy)‖Sx‖2σ2min

([Cx, Sx]T

[Σ1 00 0

] [CySy

])≥ max(1− θ2

x)θ2y, (1− θ2

y)θ2x(σr∗

√1− θ2

x

√1− θ2

y)2

Combining it with the fact that ‖L∗−G−XY T‖ ≥ ‖L∗−XY T‖−‖G‖ for any X, Y , we get the conclusion.

Lemma 8. Let L∗, G be the same as in Lemma 7. Let X = (L∗−G)Y , where Y ∈ Rn×r is orthonormal. Denoteθx = ‖ sin Θ(U∗, X)‖, θy = ‖ sin Θ(V∗, Y )‖. If ‖G‖ < σr∗

√1− θ2

y, then

σr(X) ≥ σr∗√

1− θ2y − ‖G‖, θx ≤

‖G‖

σr√

1− θ2y − ‖G‖

.

Proof. By Lemma 2 and Lemma 1, we have

σr(X) = σr((L∗ −G)Y ) ≥ σr(L∗Y )− ‖GY ‖ ≥ σr(Σ∗V T∗ Y )− ‖G‖ ≥ σr∗σmin(V T

∗ Y )− ‖G‖

= σr∗σ12

min(Y TV∗VT∗ Y )− ‖G‖ ≥ σr∗σ

12

min(Ir − Y T(I − V∗V T∗ )Y )− ‖G‖

= σr∗

√1− ‖(I − V∗V T

∗ )Y ‖2 − ‖G‖ = σr∗

√1− θ2

y − ‖G‖ > 0. (29)

Therefore, X has full column rank. Denote Gx = (XTX)−12 , X = XGx. Then X and X = AY can be rewritten

as X = AY Gx. Using Lemma 1 and (29), we have

‖θx‖ = ‖UT∗,cX‖ = ‖UT

∗,c(L∗ −G)Y Gx‖ ≤ ‖GY Gx‖ ≤ ‖G‖‖Gx‖ ≤‖G‖σr(X)

≤ ‖G‖

σr∗√

1− θ2y − ‖G‖

.

The proof is completed.

Lemma 9. Let U , X ∈ Rm×r both have orthonormal columns. It holds ‖X‖2,∞ ≤ ‖U‖2,∞ + ‖ sin Θ(U,X)‖.

Proof. Let Uc be such that [U,Uc] is an orthogonal matrix. We can write X = UCx+UcSx, where CTx Cx+ST

x Sx =Ir. By Lemma 1, we have ‖ sin Θ(U,X)‖ = ‖UT

c X‖ = ‖Sx‖. Then for any 1 ≤ i ≤ m, we have

‖eTi X‖ = ‖eT

i UCx + eTi UcSx‖ ≤ ‖eT

i U‖+ ‖Sx‖,

the conclusion follows.

Lemma 10. (Jain and Netrapalli, 2015, Lemmas 8,10) Let A ∈ Rm×n with m ≥ n. Suppose Ω is obtained bysampling each entry of A with probability p ∈ [ 1

4m , 0.5]. Then w.p. ≥ 1− 1/m10+logα,

‖1

pΠΩ(A)−A‖ ≤ 6

√αm√p‖A‖max.

7 Proof for Main Theorems

7.1 Proof of Theorem 1

Proof of Theorem 1. First, it holds ‖(I − U∗UT∗ )M(I − V∗V T

∗ )‖ = ‖(I − U∗UT∗ )S∗(I − V∗V T

∗ )‖. Then byassumption, we have ‖(I − U∗UT

∗ )M(I − V∗V T∗ )‖ < σr∗.


Second, we have

‖E‖ = ‖MV∗ − U∗Σ∗‖ = ‖L∗V∗ − U∗Σ∗ + S∗V∗‖ = ‖S∗V∗‖,‖F‖ = ‖MTU∗ − V∗Σ∗‖ = ‖LT

∗ U∗ − V∗Σ∗ + ST∗ U∗‖ = ‖ST

∗ U∗‖.

It followsmax‖E‖, ‖F‖ = max‖S∗V∗‖, ‖ST

∗ U∗‖ < σr − σr+1.

Then applying Lemma 4 gives the conclusion.


Throughout the rest of this section, we follow the notations in Algorithm 1. Besides that, we will also adopt thefollowing notations. Denote

r = rank(L∗), κ∗ = κ2(L∗), p′ = p(1− %), Ωt = Ω/ supp(St), Gt = St − S∗. (30)

The SVDs of L∗ is given by

L∗ = [U∗, U∗,c] diag(Σ∗, 0)[V∗, V∗,c]T, (31)

where [U∗, U∗,c] and [V∗, V∗,c] are orthogonal matrices U∗ ∈ Rm×r and V∗ ∈ Rn×r, Σ∗ = diag(σ1∗, . . . , σr∗) withσ1∗ ≥ · · · ≥ σr∗ > 0. Further denote

θx,t = ‖ sin Θ(U∗, Xt)‖, θy,t = ‖ sin Θ(V∗, Yt)‖. (32)

Lemma 11. ‖St − S∗‖max ≤ 2‖ΠΩ(XtΣtYTt − L∗)‖max for t = 0, 1, . . . .

Proof. Denote Φ∗ = supp(S∗), Φt = supp(St), it is obvious that St−S∗ is supported on Φt ∪Φ∗ and Φt ∪Φ∗ ⊂ Ω.Now we claim that

‖ΠΩ(St − S∗)‖max ≤ 2‖ΠΩ(XtΣtYTt − L∗)‖max.

To show the claim, it suffices to consider the following two cases.

Case (1) For any (i, j) ∈ Φt, it holds (St)(i,j) = (L∗ + S∗ −XtΣtYTt )(i,j). Then it follows that

|(St − S∗)(i,j)| = |(L∗ −XtΣtYTt )(i,j)| ≤ ‖ΠΩ(XtΣtY

Tt − L∗)‖max.

Case (2) For any (i, j) ∈ Φ∗ \Φt, it holds (St)(i,j) = 0. If |(St−S∗)(i,j)| = |(S∗)(i,j)| > 2‖ΠΩ(XtΣtYTt −L∗)‖max,

then|(L∗ + S∗ −XtΣtY

Tt )(i,j)| > ‖ΠΩ(L∗ −XtΣtY

Tt )‖max.

Noticing that S∗ only changes s entries of ΠΩ(L∗ −XtΣtYTt ), we know that the (i, j) entry of |ΠΩ(L∗ + S∗ −

XtΣtYTt )| is larger than the (s+1)st largest entry of |ΠΩ(L∗+S∗−XtΣtY

Tt )|. This contradicts with (i, j) /∈ Φt.

Lemma 12. Assume (A1). Denote r′s =‖S0−S∗‖2F‖S0−S∗‖2 . Let S0 be obtained as in Algorithm 1. It holds

‖S0 − S∗‖ ≤ 2

√2%p

r′sµr‖L∗‖.

Proof. First, for any i, j, we have Lij = eTi U∗Σ∗V

T∗ ej . Using (A1), we have

|Lij | ≤ ‖eTi U∗‖‖Σ∗‖‖eT

j V∗‖ ≤µr√mn‖L∗‖,

and hence

‖L∗‖max ≤µr√mn‖L∗‖. (33)


By Lemma 11, we have

‖S0 − S∗‖max ≤ 2‖ΠΩ(L∗)‖max ≤2µr√mn‖L∗‖. (34)

Therefore, using (33), (34) and (A2), we have

‖S0 − S∗‖F ≤√

2s‖S0 − S∗‖max ≤ 2√

2s‖ΠΩ(L∗)‖max ≤ 2√

2s‖L∗‖max ≤ 2√

2%pµr‖L∗‖. (35)

By the definition of r′s, it follows that

‖S0 − S∗‖ ≤‖S0 − S∗‖F√

r′s≤ 2

√2%p

r′sµr‖L∗‖.


Proof of Theorem 2. By (A3), Lemma 10 and (33), w.p. ≥ 1− 1/m10+logα, it holds

‖ 1

p′ΠΩ0(L∗)− L∗‖ ≤

6√αm√p′‖L∗‖max ≤ ξµr‖L∗‖. (36)

Using Lemma 12 and (36), we have w.p. ≥ 1− 1/m10+logα,

‖ 1

p′ΠΩ0(M − S0)− L∗‖ ≤ ‖

1

p′ΠΩ0(L∗)− L∗‖+

1

p′‖S0 − S∗‖ ≤ (ξ + γ)µr‖L∗‖. (37)

Let the SVD of XT1 L∗Y1 be XT

1 L∗Y1 = P ΣQT, where P , Q are orthogonal matrices, Σ = diag(σ1, . . . , σr).Denote U = X1P , V = Y1Q, and let

E = L∗V − U Σ, F = LT∗ U − V Σ. (38)


‖Σ1 −XT1 L∗Y1‖ = ‖XT

1 [1

p′ΠΩ0

(M − S0)− L∗]Y1‖ ≤ ‖1

p′ΠΩ0

(M − S0)− L∗‖. (39)

Using (37) and (39), by calculations, we get

‖E‖ = ‖L∗V − U Σ‖ = ‖L∗Y1 −X1P ΣQT‖ = ‖L∗Y1 −X1XT1 L∗Y1‖

≤ ‖L∗Y1 −X1Σ1‖+ ‖X1(Σ1 −XT1 L∗Y1)‖

= ‖L∗Y1 −1

p′ΠΩ0

(M − S0)Y1‖+ ‖Σ1 −XT1 L∗Y1‖

≤ 2‖ 1

p′ΠΩ0

(M − S0)− L∗‖ ≤ 2(ξ + γ)µr‖L∗‖, w.p. ≥ 1− 1/m10+logα. (40)

Similarly, we get

‖F‖ ≤ 2(ξ + γ)µr‖L∗‖, w.p. ≥ 1− 1/m10+logα. (41)

Next, we only need to show max‖E‖, ‖F‖ ≤ σr∗ and ‖(Im − U UT)L∗(In − V V T)‖ < σr. Once these twoinequalities hold, we may apply Lemma 4.

For the first inequality, using (40), (41) and the assumption (ξ + γ)µκr < 16 , we get

max‖E‖, ‖F‖ ≤ 2(ξ + γ)µr‖L∗‖ < σr∗, w.p. ≥ 1− 1/m10+logα. (42)


For the second inequality, using (35) and (36), we have

‖ 1

p′ΠΩ(M − S0)− L∗‖ ≤ ‖

1

p′ΠΩ(L∗)− L∗‖+

1

p′‖S∗ − S0‖ ≤ (ξ + γ)µr‖L∗‖. (43)

Then using Lemma 2, (37),(39) and (43), we have

|σr − σr∗| ≤ |σr − γr,0|+ |γr,0 − σr∗| ≤ ‖XT1 L∗Y1 − Σ1‖+ ‖ 1

p′ΠΩ0(M − S0)− L∗‖ ≤ 2(ξ + γ)µr‖L∗‖.

It follows that

σr ≥ σr∗ − 2(ξ + γ)µr‖L∗‖. (44)

Then using the assumption (ξ + γ)µκr < 16 , (37) and (44), we have

‖(Im − U UT)L∗(In − V V T)‖ = ‖(Im −X1XT1 )[L∗ −

1

p′ΠΩ0

(M − S0)](In − Y1YT1 )‖

≤‖L∗ −1

p′ΠΩ0

(M − S0)‖ ≤ (ξ + γ)µr‖L∗‖ < σr∗ − 2(ξ + γ)µr‖L∗‖ ≤ σr.

Now using (40), (41), the assumption (ξ + γ)µκr < 16 and Lemma 4, we have

maxθx,1, θy,1 = max‖ sin Θ(U∗, U)‖, ‖ sin Θ(V∗, V )‖ ≤ 2(ξ + γ)µrκ

1− 1/3= 3(ξ + γ)µrκ, (45a)

‖L∗ − U ΣV T‖max/‖L∗‖ ≤ (‖U∗‖2,∞θy,1 + ‖V∗‖2,∞θx,1) + (1 + 3‖U∗‖2,∞‖V∗‖2,∞)θx,1θy,1. (45b)

Using the assumption (ξ + γ)µκr < 13

√µ′1rm , by (45a), we have maxθx,1, θy,1 ≤

√µ′1rm . On the other hand,

assumption (A1) implies that

‖U∗‖2,∞ ≤√µr

m, ‖V∗‖2,∞ ≤

√µr

n. (46)

Then it follows from Lemma 9 that

‖X1‖2,∞ ≤ ‖U∗‖2,∞ + ‖ sin Θ(X1, U∗)‖ ≤√µr

m+

√µ′1r

m≤√µ1r

m, (47a)

‖Y1‖2,∞ ≤ ‖V∗‖2,∞ + ‖ sin Θ(Y1, V∗)‖ ≤√µr

n+

√µ′1r

m≤√µ1r

n. (47b)

Using the assumption (ξ + γ)µκr < 13

√µ′1rm , (37), (39), (45b), (46) and (47), by calculations, we have

‖L∗ −X1Σ1YT1 ‖max/‖L∗‖ ≤‖L∗ − U ΣV T‖max/‖L∗‖+ ‖U ΣV T −X1Σ1Y

T1 ‖max/‖L∗‖

=‖L∗ − U ΣV T‖max/‖L∗‖+ ‖X1(XT1 L∗Y1 − Σ1)Y T

1 ‖max/‖L∗‖

≤‖L∗ − U ΣV T‖max/‖L∗‖+ ‖X1‖2,∞‖XT1 L∗Y1 − Σ1‖‖Y1‖2,∞/‖L∗‖

≤‖L∗ − U ΣV T‖max/‖L∗‖+ ‖X1‖2,∞‖Y1‖2,∞(ξ + γ)µrκ,

≤(‖U∗‖2,∞θy,1 + ‖V∗‖2,∞θx,1) + (1 + 3‖U∗‖2,∞‖V∗‖2,∞)θx,1θy,1

+ ‖X1‖2,∞‖Y1‖2,∞1

3

√µ′1r

m

≤(√

µr

mθy,1 +

√µr

nθx,1 + θx,1θy,1

)+O(n−3/2),

which completes the proof.



Proof of Theorem 3. First, we give an upper bound for supX∈Rm×r ‖ΠΩt(R)ΠΩt

(XY Tt )T‖/‖X‖. Let δij be

an independent family of Bernoulli(p′) random variables, XT = [x1, . . . , xm] ∈ Rr×m be arbitrary nonzeromatrix with ‖X‖ = 1, and Y T

t = [y1, . . . , yn]. Denote Eij = eieTj , R = [rij ], Wil =

∑j,k δijrijEijδlkx

Tk ylE

Tkl,

Z =∑i,l Zi,l. By calculations, we have

E(Wil) = p′2∑j,k

rijEijyTk xlEkl,= p′2

∑j

rijEijyTj xlEjl = p′2R(i,:)YtxlEil = 0,

‖Wil‖ ≤√p′nmax |rijxT

j yl| ≤√µ′rp′‖R‖max,

‖E[∑i,l

WilWTil ]‖ = ‖E[

∑i,l

(∑j,k

δijrijEijδlkxTk ylE

Tkl)(

∑j′,k′

δij′rij′Eij′δlk′xTk′ylE

Tk′l)

T]‖ = 0,

‖E[∑i,l

WTilWil]‖ = 0.

By Lemma 5, we have P‖W‖ > t ≤ (m+ n) exp(− 3t/2√

µ′rp′‖R‖max

). Let t = 2

3 (log(m+ n) + 5)√µ′rp′‖R‖max,

then w.p. ≥ 0.99, it holds

‖W‖ ≤ 2

3(log(m+ n) + 5)

√µ′rp′‖R‖max. (48)

Second, It is easy to see that Xopt = (M − St)Yt. By calculations, we have

minX‖ΠΩt(XY

Tt )−ΠΩt(M − St)‖2 (49)

= min∆X‖ΠΩt

((Xopt + ∆X)Y Tt )−ΠΩt

((M − St)YtY T

t + (M − St)(I − YtY Tt ))‖2

= min∆X‖ΠΩt

(∆XY Tt )−ΠΩt

(R)‖2. (50)

Then we declare that (50) is minimized when ∆X = Xopt − Xopt. This is because (49) is minimized whenX = Xopt and X = Xopt + ∆X. Thus, we have

‖Xopt −Xopt‖ = ‖∆X‖ ≤supX∈Rm×r ‖ΠΩt

(R)ΠΩt(XY T

t )T‖σ2

. (51)

Substituting (48) into (51), we get the conclusion.


Lemma 13. Denote rs = inft‖St−S∗‖2F‖St−S∗‖2 , ζ =

√2sµrmrs

. If ‖L∗ − XtΣtYTt ‖max ≤ ct‖L∗‖

√µrm for some positive

parameter ct, then

‖St − S∗‖ ≤ 2ct‖L∗‖ζ, |γj,t − σj∗| ≤ 2ct‖L∗‖ζ.

Proof. Using Lemma 11, by simple calculations, we have

‖St − S∗‖ ≤‖St − S∗‖F√

rs≤√

2s

rs‖St − S∗‖max ≤ 2

√2s

rs‖ΠΩ(L∗ −XtΣtY

Tt )‖max ≤ 2ct‖L∗‖

√2sµr

mrs= 2ct‖L∗‖ζ.

Then by Lemma 2, we know that

|γj,t − σj∗| ≤ ‖(M − St)− L∗‖ = ‖St − S∗‖ ≤ 2ct‖L∗‖ζ.



Proof of Theorem 4. First, denote Xt+1 = (M −St)Yt, then we know that Xt+1 is the solution to minX ‖M −St −XY T

t ‖. Also note that Xt+1 on line 8 of Algorithm 1 is the solution to minX ‖ΠΩt(M − St −XY T

t )‖. Thenby Theorem 3, we have

‖Xt+1 − Xt+1‖ ≤ CLS‖(M − St)(I − YtY Tt )‖max, w.p. ≥ 0.99.

Then it follows that from Lemma 1, Lemma 11 and Lemma 13 that

‖Xt+1 − Xt+1‖ ≤ CLS(‖L∗(I − YtY Tt )‖max + ‖(St − S∗)(I − YtY T

t )‖max)

≤ CLS(‖L∗‖√µr

mθy,t + ‖St − S∗‖2,∞) ≤ CLS(‖L∗‖

√µr

mθy,t +

√2p%n‖St − S∗‖max)

≤ CLS(‖L∗‖√µr

mθy,t + 2

√2p%n‖L∗ −XtΣtY

Tt ‖max) ≤ C√

m‖L∗‖θy,t. (52)

Second, using Lemma 13 and 4cκζ < 1, we have

‖St − S∗‖ < 2cθy,t‖L∗‖ζ ≤√

2c‖L∗‖ζ <σr∗√

2≤ σr∗

√1− θ2

y,t, (53)

Then by Lemma 8, we know that

‖ sin Θ(Xt+1, U∗)‖ ≤‖St − S∗‖

σr∗

√1− θ2

y,t − ‖St − S∗‖. (54)

Using (54), the assumption ‖L∗ −XtΣtYTt ‖max ≤ c‖L∗‖ θy,t

√µrm , Lemma 13 and θy,t ≤ 1√

2, we get

‖ sin Θ(Xt+1, U∗)‖ ≤2c‖L∗‖ζθy,t

σr∗√2− 2c‖L∗‖ζθy,t

≤ 2√

2cκζθy,t1− 2cκζ

< 4√

2cκζθy,t. (55)

Therefore, using Lemma 1, (52) and (55), we have

‖θx,t+1‖ = ‖UT∗,cXt+1‖ = ‖UT

∗,cXt+1R−1x,t+1‖ ≤ ‖UT

∗,cXt+1R−1x,t+1‖+ ‖UT

∗,c(Xt+1 − Xt+1)R−1x,t+1‖

≤ ‖R−1x,t+1‖(‖ sin Θ(Xt+1, U∗)‖‖Xt+1‖+ ‖Xt − Xt‖)

≤ 1

σr(Xt+1)

(4√

2cκζ‖Xt+1‖+C√m‖L∗‖

)θy,t. (56)

Now using Lemma 2, (52), θy,t ≤ 1√2and (53), we have

‖Xt+1‖ = ‖(M − St)Yt‖ ≤ ‖L∗Y ‖+ ‖St − S∗‖ ≤ ‖L∗‖+√

2c‖L∗‖ζ,

σr(Xt+1) ≥ σr(Xt+1)− C√m‖L∗‖θy,t ≥ σr((M − St)Yt)−

C√2m‖L∗‖ ≥ σr(L∗Yt)− ‖St − S∗‖ −

C√2m‖L∗‖

≥ σr∗√

1− θ2y,t −

√2c‖L∗‖ζ −

C√2m‖L∗‖ ≥

σr∗√2−√

2c‖L∗‖ζ −C√2m‖L∗‖.

Substituting them into (56), we get the conclusion.


Lemma 14. Follow the notations and assumptions in Lemma 1. Then

‖L∗ − Xt+1Rx,t+1YTt ‖max ≤

((1 + CLS

√µ′r

n)

√µr

m+ (1 + CLS

√2%pn)

√µ′r

n2cζ)‖L∗‖θy,t.


Proof. Direct calculations give rise to

‖L∗ − Xt+1YTt ‖max ≤ ‖L∗ − (M − St)YtY T

t ‖max + ‖(M − St)YtY Tt − Xt+1Y

Tt ‖max

≤ ‖L∗ − (M − St)YtY Tt ‖max + ‖(M − St)Yt − Xt+1‖

√µ′r

n(57a)

≤ ‖L∗ − (M − St)YtY Tt ‖max + CLS

√µ′r

n‖(M − St)(I − YtY T

t )‖max (57b)

≤ (1 + CLS

√µ′r

n)‖L∗(I − YtY T

t )‖max + (1 + CLS

√2%pn)

√µ′r

n‖St − S∗‖max

≤(

(1 + CLS

√µ′r

n)

√µr

m+ (1 + CLS

√2%pn)

√µ′r

n2cζ)‖L∗‖θy,t (57c)

where (57a) uses ‖Yt‖2,∞ ≤√

µ′rm , (57b) uses Theorem 3, (57c) uses the SVD of L∗, ‖U∗‖2,∞ ≤

√µrm and

Lemme 1.

Proof of Theorem 5. First, by Lemma 7, we have

‖M − St −XtΣtYTt ‖ ≥ σr∗max

√1− θ2

x,tθy,t,√

1− θ2y,tθx,t

√1− θ2

x,t

√1− θ2

y,t − ‖St − S∗‖

Then using (53), θx,t ≤ 1√2and θy,t ≤ 1√

2, we get

‖M − St −XtΣtYTt ‖ ≥

σr∗

2√

2θy,t − 2c‖L∗‖

√µr

mθy,t. (58)

Second, by calculations, we have

‖M − St − Xt+1YTt+1‖ ≤ ‖(I − Xt+1X

Tt+1)(M − St)‖+ ‖Xt+1X

Tt+1(M − St)− Xt+1Y

Tt+1‖

≤ ‖(I − Xt+1XTt+1)L∗‖+ ‖St − S∗‖+ ‖XT

t+1(M − St)− Y Tt+1‖

≤ ‖L∗‖θx,t+1 + 2c‖L∗‖ζ√µr

mθx,t+1 +

C√m‖L∗‖θx,t+1 (59)

≤ (1 + 2cζ

√µr

m+

C√m

)φ‖L∗‖θy,t, (60)

where the first two terms of (59) use Lemma 1 and (53), respectively, and the last term can obtained similar to(52), with the help of Lemma 14.


‖M − St+1 −Xt+1Σt+1YTt+1‖ ≤ ‖M − St −Xt+1Σt+1Y

Tt+1‖ (61a)

≤ (1 + 2c

√µr

m+

C√m

)φ‖L∗‖θy,t (61b)

≤(1 + 2cζ

√µrm + C√

m)φ‖L∗‖

σr∗2√

2− 2cζ‖L∗‖

√µrm

‖M − St −XtΣtYTt ‖ (61c)

= ψ‖M − St −XtΣtYTt ‖,

where (61a) uses Lemma 6, (61b) uses (60), (61c) uses (58). The proof is completed.

solving the robust matrix completion problem via a system

Documents