the low-rank basis problem for a matrix subspace
TRANSCRIPT
The low-rank basis problemfor a matrix subspace
Tasuku SomaUniv. Tokyo
Joint work with:Yuji Nakatsukasa (Univ. Tokyo)André Uschmajew (Univ. Bonn)
1 / 29
1 The low-rank basis problem
2 Algorithm
3 Convergence Guarantee
4 Experiments
2 / 29
1 The low-rank basis problem
2 Algorithm
3 Convergence Guarantee
4 Experiments
3 / 29
The low-rank basis problem
Low-rank basis problem: for a matrix subspaceM ⊆ Rm×n spannedby M1, . . . ,Md ∈ R
m×n,
minimize rank(X1) + · · · + rank(Xd )subject to span{X1, . . . ,Xd } =M .
• Generalizes the sparse basis problem:
minimize ‖x1‖0 + · · · + ‖xd ‖0subject to span{x1, . . . ,xd } = S ⊆ RN .
• Matrix singular values play role of vector nonzero elements
4 / 29
The low-rank basis problem
Low-rank basis problem: for a matrix subspaceM ⊆ Rm×n spannedby M1, . . . ,Md ∈ R
m×n,
minimize rank(X1) + · · · + rank(Xd )subject to span{X1, . . . ,Xd } =M .
• Generalizes the sparse basis problem:
minimize ‖x1‖0 + · · · + ‖xd ‖0subject to span{x1, . . . ,xd } = S ⊆ RN .
• Matrix singular values play role of vector nonzero elements
4 / 29
Scope
lowrank basissparse basis
(Coleman-Pothen 86)
basis problems
lowrank matrixsparse vector
(Qu-Sun-Wright 14)
single element problems
• sparse vector problem is NP-hard [Coleman-Pothen 1986]
• Related studies: dictionary learning [Sun-Qu-Wright 14], sparse PCA[Spielman-Wang-Wright],[Demanet-Hand 14]
5 / 29
Scope
lowrank basissparse basis
(Coleman-Pothen 86)
basis problems
lowrank matrixsparse vector
(Qu-Sun-Wright 14)
single element problems
• sparse vector problem is NP-hard [Coleman-Pothen 1986]
• Related studies: dictionary learning [Sun-Qu-Wright 14], sparse PCA[Spielman-Wang-Wright],[Demanet-Hand 14]
5 / 29
Scope
lowrank basissparse basis
(Coleman-Pothen 86)
basis problems
lowrank matrixsparse vector
(Qu-Sun-Wright 14)
single element problems
• sparse vector problem is NP-hard [Coleman-Pothen 1986]
• Related studies: dictionary learning [Sun-Qu-Wright 14], sparse PCA[Spielman-Wang-Wright],[Demanet-Hand 14]
5 / 29
Applications
• memory-efficient representation of matrix subspace
• matrix compression beyond SVD
• dictionary learning
• string theory: rank-deficient matrix in rectangular subspace
• image separation
• accurate eigenvector computation
• maximum-rank completion (discrete mathematics)
• ...
6 / 29
1 The low-rank basis problem
2 Algorithm
3 Convergence Guarantee
4 Experiments
7 / 29
Abstract greedy algorithm
Algorithm 1 Greedy meta-alg. for computing a low-rank basisInput: SubspaceM ⊆ Rm×n of dimension d.Output: Basis B = {X∗1 , . . . ,X
∗d } ofM.
Initialize B = ∅.for ` = 1, . . . ,d do
Find X∗`∈ M of lowest possible rank s.t. B ∪ {X∗
`} is linearly
independent.B ← B ∪ {X∗
`}
• If each step is successful, this finds the required basis!
8 / 29
Greedy algorithm: lemma
LemmaX∗1 , . . . ,X
∗d : output of greedy algorithm.
For any ` ∈ {1, . . . ,d} and lin. indep. {X1, . . . ,X`} ⊆ M withrank(X1) ≤ · · · ≤ rank(X`),
rank(Xi ) ≥ rank(X∗i ) for i = 1, . . . , `.
Proof.If rank(X`) < rank(X∗
`), then rank(Xi ) < rank(X∗
`) for i ≤ `.
But since one Xi must be linearly independent from X∗1 , . . . ,X∗`−1, this
contradicts the choice of X∗`. �
(Adaption of standard argument from matroid theory.)
9 / 29
Greedy algorithm: lemma
LemmaX∗1 , . . . ,X
∗d : output of greedy algorithm.
For any ` ∈ {1, . . . ,d} and lin. indep. {X1, . . . ,X`} ⊆ M withrank(X1) ≤ · · · ≤ rank(X`),
rank(Xi ) ≥ rank(X∗i ) for i = 1, . . . , `.
Proof.If rank(X`) < rank(X∗
`), then rank(Xi ) < rank(X∗
`) for i ≤ `.
But since one Xi must be linearly independent from X∗1 , . . . ,X∗`−1, this
contradicts the choice of X∗`. �
(Adaption of standard argument from matroid theory.)
9 / 29
Greedy algorithm: justification
TheoremX∗1 , . . . ,X
∗d : lin. indep. output of greedy algorithm. Then {X1, . . . ,X`} is
of minimal rank iff
rank(Xi ) = rank(X∗i ) for i = 1, . . . , `.
In particular, {X∗1 , . . . ,X∗`} is of minimal rank.
• Analogous result for sparse basis problem in [Coleman, Pothen 1986]
10 / 29
The single matrix problem
minimize rank(X )subject to X ∈ M \ {0}.
• NP-hard of course (since sparse vector is)
Nuclear norm heuristic (‖A‖∗ :=∑σi (A))
minimize ‖X ‖∗subject to X ∈ M, ‖X ‖F = 1.
NOT a convex relaxation due to non-convex constraint.
11 / 29
The single matrix problem
minimize rank(X )subject to X ∈ M \ {0}.
• NP-hard of course (since sparse vector is)
Nuclear norm heuristic (‖A‖∗ :=∑σi (A))
minimize ‖X ‖∗subject to X ∈ M, ‖X ‖F = 1.
NOT a convex relaxation due to non-convex constraint.
11 / 29
Algorithm Outline (for the single matrix problem)
Phase I: rank estimate
Y = Sτ (X ), X =PM (Y )‖PM (Y )‖F
until rank(Y ) converges
Phase II: alternating projection
Y = Tr (X ), X =PM (Y )‖PM (Y )‖F
estimated r = rank(Y )
12 / 29
Algorithm Outline (for the single matrix problem)
Phase I: rank estimate
Y = Sτ (X ), X =PM (Y )‖PM (Y )‖F
until rank(Y ) converges
Phase II: alternating projection
Y = Tr (X ), X =PM (Y )‖PM (Y )‖F
estimated r = rank(Y )
12 / 29
Shrinkage operatorShrinkage operator (soft thresholding) for X = UΣVT :
Sτ (X ) = USτ (Σ)VT , Sτ (Σ) = diag(σ1 − τ, . . . ,σrank(X ) − τ)+
Fixed-point iteration
Y = Sτ (X ), X =PM (Y )‖PM (Y )‖F
Interpretation: [Cai, Candes, Shen 2010], [Qu, Sun, Wright @NIPS 2014]
block coordinate descent (a.k.a. alternating direction) for
minimizeX ,Y
τ‖Y ‖∗ +12‖Y − X ‖2F
subject to X ∈ M, and ‖X ‖F = 1,
[Qu, Sun, Wright @NIPS 2014]: analogous method for sparsest vector.
13 / 29
Shrinkage operatorShrinkage operator (soft thresholding) for X = UΣVT :
Sτ (X ) = USτ (Σ)VT , Sτ (Σ) = diag(σ1 − τ, . . . ,σrank(X ) − τ)+
Fixed-point iteration
Y = Sτ (X ), X =PM (Y )‖PM (Y )‖F
Interpretation: [Cai, Candes, Shen 2010], [Qu, Sun, Wright @NIPS 2014]
block coordinate descent (a.k.a. alternating direction) for
minimizeX ,Y
τ‖Y ‖∗ +12‖Y − X ‖2F
subject to X ∈ M, and ‖X ‖F = 1,
[Qu, Sun, Wright @NIPS 2014]: analogous method for sparsest vector.13 / 29
The use as a rank estimator
Y = Sτ (X ), X =PM (Y )‖PM (Y )‖F
• The fixed point of Y would be a matrix of low-rank r, which is closeto, but not inM if r > 1.
• otherwise, it would be a fixed point of
Y =Sτ (Y )‖Sτ (Y )‖F
which can hold only for rank-one matrices.
• The fixed point of X usually has full rank, and “too large“ σi � 1.⇒ Need further improvement, but accept r as rank estimate.
14 / 29
Algorithm Outline (for the single matrix problem)
Phase I: rank estimate
Y = Sτ (X ), X =PM (Y )‖PM (Y )‖F
until rank(Y ) converges
Phase II: alternating projection
Y = Tr (X ), X =PM (Y )‖PM (Y )‖F
estimated r = rank(Y )
15 / 29
Obtaining solution: truncation operator
Truncation operator (hard thresholding) for X = UΣVT :
Tr (X ) = UTr (Σ)VT , Tr (Σ) = diag(σ1, . . . ,σr ,0, . . . ,0)
Fixed-point iteration
Y = Tr (X ), X =PM (Y )‖PM (Y )‖F
Interpretation: alternating projection method for finding
X∗ ∈ {X ∈ M : ‖X ‖F = 1} ∩ {Y : rank(Y ) ≤ r} .
16 / 29
Obtaining solution: truncation operator
Truncation operator (hard thresholding) for X = UΣVT :
Tr (X ) = UTr (Σ)VT , Tr (Σ) = diag(σ1, . . . ,σr ,0, . . . ,0)
Fixed-point iteration
Y = Tr (X ), X =PM (Y )‖PM (Y )‖F
Interpretation: alternating projection method for finding
X∗ ∈ {X ∈ M : ‖X ‖F = 1} ∩ {Y : rank(Y ) ≤ r} .
16 / 29
Greedy algorithm: pseudocode
Algorithm 2 Greedy algorithm for computing a low-rank basisInput: Basis M1, . . .Md ∈ R
m×n forMOutput: Low-rank basis X1, . . . ,Xd ofM.for ` = 1, . . . ,d do
Phase I on X`, obtain rank estimate r.Phase II on X` with rank r, obtain X` ∈ M of rank r.
• To force linear independence, restarting is sometimes necessary:X` is always initialized and restarted in span{X1, . . . ,X`−1}⊥ ∩M.
• Phase I output X is Phase II input
17 / 29
1 The low-rank basis problem
2 Algorithm
3 Convergence Guarantee
4 Experiments
18 / 29
Observed convergence (single initial guess)
m = 20,n = 10,d = 5, exact ranks: (1,2,3,4,5).
• ranks recovered in “wrong” order (2,1,5,3,4)19 / 29
Observed convergence (several initial guesses)
• ranks recovered in correct order20 / 29
Local convergence of Phase IIRr := {X : rank(X ) = r}, B := {X ∈ M : ‖X ‖F = 1}TX∗ (N ): tangent space of manifold N at X∗
TheoremAssume X∗ ∈ Tr ∩ B has rank r, input of Phase II, andTX∗Rr ∩ TX∗B = ∅. Then Phase II is locally linearly convergent:
‖Xnew − X∗‖F‖X − X∗‖F
. cos θ
• Follows from a meta-theorem on alternating projections innonlinear optimization [Lewis, Luke, Malick 2009]
• We provide “direct” linear algebra proof• Assumption holds if X∗ is isolated rank-r matrix inM
21 / 29
Local convergence of Phase IIRr := {X : rank(X ) = r}, B := {X ∈ M : ‖X ‖F = 1}TX∗ (N ): tangent space of manifold N at X∗
TheoremAssume X∗ ∈ Tr ∩ B has rank r, input of Phase II, andTX∗Rr ∩ TX∗B = ∅. Then Phase II is locally linearly convergent:
‖Xnew − X∗‖F‖X − X∗‖F
. cos θ
• Follows from a meta-theorem on alternating projections innonlinear optimization [Lewis, Luke, Malick 2009]
• We provide “direct” linear algebra proof• Assumption holds if X∗ is isolated rank-r matrix inM
21 / 29
Local convergence: intuition
X∗
TX∗B
TX∗Rr
cos θ ≈ 1√2
X∗
TX∗B
TX∗Rr
cos θ ≈ 0.9
‖Xnew − X∗‖F‖X − X∗‖F
≤ cos θ + O(‖X − X∗‖2F )
θ ∈ (0, π2 ]: subspace angle between TX∗B and TX∗Rr
cos θ = maxX∈TX∗BY∈TX∗Rr
|〈X ,Y 〉F |‖X ‖F ‖Y ‖F
.
22 / 29
1 The low-rank basis problem
2 Algorithm
3 Convergence Guarantee
4 Experiments
23 / 29
Results for synthetic data
exact ranks av. sum(ranks) av. Phase I err (iter) av. Phase II err (iter)( 1 , 1 , 1 , 1 , 1) 5.05 2.59e-14 (55.7) 7.03e-15 (0.4)( 2 , 2 , 2 , 2 , 2 ) 10.02 4.04e-03 (58.4) 1.04e-14 (9.11)( 1 , 2 , 3 , 4 , 5) 15.05 6.20e-03 (60.3) 1.38e-14 (15.8)
( 5 , 5 , 5 , 10 , 10) 35.42 1.27e-02 (64.9) 9.37e-14 (50.1)( 5 , 5 , 10 , 10 , 15) 44.59 2.14e-02 (66.6) 3.96e-05 (107)
Table: m = n = 20, d = 5, random initial guess.
exact ranks av. sum (ranks) av. Phase I err (iter) av. Phase II err (iter)( 1 , 1 , 1 , 1 , 1) 5.00 6.77e-15 (709) 6.75e-15 (0.4)( 2 , 2 , 2 , 2 , 2) 10.00 4.04e-03 (393) 9.57e-15 (9.0)( 1 , 2 , 3 , 4 , 5) 15.00 5.82e-03 (390) 1.37e-14 (18.5)
( 5 , 5 , 5 , 10 , 10) 35.00 1.23e-02 (550) 3.07e-14 (55.8)( 5 , 5 , 10 , 10 , 15) 44.20 2.06e-02 (829) 8.96e-06 (227)
Table: Five random initial guesses.
24 / 29
Image separation
original
mixed
computed
25 / 29
Link to tensor decomposition
Rank-one basis: M = span{a1bT1 , . . . ,adb
Td }
If M1, . . . ,Md is any basis, then
Mk =
d∑`=1
ck ,̀ a`bT` (k = 1, . . . ,d) ⇐⇒ T =
d∑`=1
a` ◦ b` ◦ c`,
where T is the third-order tensor with slices M`. ⇒ rank(T ) = d.
Suggests finding rank-one basis using CP decomposition algorithms(ALS, Generalized Eigenvalue, ...)
but CP not enough for higher-rank case
26 / 29
Comparison with CP for rank-1 basis
matrix size n100 200 300 400 500 600 700 800 900 1000
10-3
10-2
10-1
100
101
102
103Runtime(s)
Phase IPhase IITensorlab
matrix size n100 200 300 400 500 600 700 800 900 1000
10-15
10-10
10-5
100Error
Phase IPhase IITensorlab
d = 10, varying m = n between 50 and 500.
27 / 29
Comparison with CP for rank-1 basis
dimension d2 4 6 8 10 12 14 16 18 20
10-4
10-3
10-2
10-1
100Runtime(s)
Phase IPhase IITensorlab
dimension d2 4 6 8 10 12 14 16 18 20
10-15
10-10
10-5
100Error
Phase IPhase IITensorlab
m = n = 10, varying d between 2 and 20.
28 / 29
Summary and outlook
Summary• low-rank basis problem
• lots of applications• NP-hard
• introduced practical greedy algorithm
Future work• further probabilistic analysis of Phase I• finding low-rank tensor basis
29 / 29
Application: matrix compression• classical SVD compression: achieves ≈
rncompression
A ≈ U Σ VT
further, suppose m = st and reshape ith column of U:ui,1ui,2...
ui,s...
ui,st
→
ui,1 ui,s+1 · · · ui,s(t−1)+1...
.
.
.. . .
.
.
.ui,s ui,2s · · · ui,st
≈ Uu,i Σu,i VTu,i
• Compression: ≈r1r2n2
, “squared” reduction
30 / 29
Application: matrix compression• classical SVD compression: achieves ≈
rncompression
A ≈ U Σ VT
further, suppose m = st and reshape ith column of U:ui,1ui,2...
ui,s...
ui,st
→
ui,1 ui,s+1 · · · ui,s(t−1)+1...
.
.
.. . .
.
.
.ui,s ui,2s · · · ui,st
≈ Uu,i Σu,i VTu,i
• Compression: ≈r1r2n2
, “squared” reduction30 / 29
Finding right “basis” for storage reduction
U = [U1, . . . ,Ur ], ideally, each column Ui = [ui,1,ui,2, . . . ,ui,st]T haslow-rank structure when matricized
ui,1ui,2...
ui,s...
ui,st
→
ui,1 ui,s+1 · · · ui,s(t−1)+1...
.
.
.. . .
.
.
.ui,s ui,2s · · · ui,st
≈ Uu,i VTu,i for i = 1, . . . , r
• More realistically: ∃ Q ∈ Rr×r s.t. UQ has such property⇒ finding low-rank basis for matrix subspace spanned by
mat(U1),mat(U2), . . . ,mat(Ur )
31 / 29
Compressible matrices• FFT matrix e2π
ijn : each mat(column) is rank-1
• Any circulant matrix eigenspace
• Graph Laplacian eigenvectors:• rank 2: ladder, circular• rank 3: binary tree, cycle, path, wheel• rank 4: lollipop• rank 5: barbell• ...
⇒ explanation is open problem, fast FFT-like algorithms?
32 / 29
Eigenvectors of multiple eigenvalues
A[x1,x2, . . . ,xk] = λ[x1,x2, . . . ,xk]
• eigenvector x for Ax = λx is not determined uniquely +non-differentiable
• numerical practice: content with computing span(x1,x2)
extreme example: I, any vector is eigenvector!
• but perhaps
1
,
1
, . . .
1
is a “good” set of eigenvectors
• why? low-rank, low-memory!
33 / 29
Computing eigenvectors of multipleeigenvaluesF : FFT matrix
A =1nFdiag(1+ ε,1+ ε,1+ ε,1+ ε,1+ ε,6, . . . ,n2)F∗, ε = O(10−10)
• cluster of five eigenvalues ≈ 1• “exact”, low-rank eigenvectors: first five columns of F .
v1 v2 v3 v4 v5 memoryeig 4.2e-01 1.2e+00 1.4e+00 1.4e+00 1.5e+00 O(n2)
eig+Alg. 2 1.2e-12 1.2e-12 1.2e-12 1.2e-12 2.7e-14 O(n)
Table: Eigenvector accuracy, before and after finding low-rank basis.
• MATLAB’s eig fails to find accurate eigenvectors.• accurate eigenvectors obtained by finding low-rank basis.
34 / 29
Restart
Algorithm 3 Restart for linear independenceInput: Orthogonal projection Q`−1 ontoM∩ Span(X∗1 , . . . ,X
∗`−1)⊥, cur-
rent matrix X` and tolerance restarttol > 0.Output: Eventually replaced X`.if ‖Q`−1(X`)‖F < restarttol (e.g. 0.01) then
Replace X` by a random element in Range(Q`−1). X` ← X`/‖X`‖F
• Linear dependence monitored by projected norm ‖Q`−1(X`)‖F
35 / 29
Linear algebra convergence proofTX∗Rr =
{[U∗ U
⊥∗ ]
[A BC 0
][V∗ V
⊥∗ ]
T}. (1)
X − X∗ = E + O(‖X − X∗‖2F )
with E ∈ TX∗B. Write E = [U∗ U⊥∗ ][ A BC D ][V∗ V
⊥∗ ]
T . By (1)‖D‖2F ≥ sin2 θ · ‖E‖2F . ∃F ,G orthogonal s.t.
X = FT[Σ∗ + A 0
0 D
]GT + O(‖E‖2F ).
‖Tr (X ) − X∗‖F =
[A + Σ∗ 0
0 0
]−
[Σ∗ −B−C CΣ−1∗ B
] F+ O(‖E‖2F )
=
[A BC 0
] F+ O(‖E‖2F ).
So‖Tr (X ) − X∗‖F‖X − X∗‖F
=
√‖E‖2F − ‖D‖
2F + O(‖X − X∗‖2F )
‖E‖F + O(‖X − X∗‖2F )
≤
√1 − sin2 θ + O(‖X − X∗‖2F ) = cos θ + O(‖X − X∗‖2F )
36 / 29
Partial result for Phase I
CorollaryIf r = 1, then already Phase I is locally linearly convergent.
Proof.In a neighborood of a rank-one matrix, shrinkage and truncation are thesame up to normalization. �
• In general, Phase I “sieves out” the non-dominant components toreveal the rank
37 / 29
String theory problemGiven A1, . . . ,Ak ∈ R
m×n, find ci ∈ R s.t.rank(c1A1 + c2A2 + · · · + ckAk ) < n
A1 A2 . . . Ak Q = 0, Q = [q1, . . . ,qnk−m]
• finding null vector that is rank-one when matricized• ⇒ Lowest-rank problem from mat(q1), . . . ,mat(qnk−m) ∈ Rn×k
when rank= 1• slow (or non-)convergence in practice• NP-hard? probably.. but unsure (since special case)
38 / 29