nicolas gillis joint work with fran˘cois glineur, …laurent.risser.free.fr › tmp_share ›...
TRANSCRIPT
Computing Nonnegative Matrix Factorizations
Nicolas Gillis
Joint work with Francois Glineur, Robert Luce, Stephen Vavasis,Arnaud Vandaele, Jeremy Cohen
Where is Mons?
2//
Where is Mons?
2//
Nonnegative Matrix Factorization (NMF)
Given a matrix M ∈ Rp×n+ and a factorization rank r min(p, n), find
U ∈ Rp×rand V ∈ Rr×n such that
minU≥0,V≥0
||M − UV ||2F =∑i ,j
(M − UV )2ij . (NMF)
NMF is a linear dimensionality reduction technique for nonnegative data :
M(:, i)︸ ︷︷ ︸≥0
≈r∑
k=1
U(:, k)︸ ︷︷ ︸≥0
V (k , i)︸ ︷︷ ︸≥0
for all i .
Why nonnegativity?
→ Interpretability: Nonnegativity constraints lead to easily interpretablefactors (and a sparse and part-based representation).→ Many applications. image processing, text mining, hyperspectralunmixing, community detection, clustering, etc.
3//
Nonnegative Matrix Factorization (NMF)
Given a matrix M ∈ Rp×n+ and a factorization rank r min(p, n), find
U ∈ Rp×rand V ∈ Rr×n such that
minU≥0,V≥0
||M − UV ||2F =∑i ,j
(M − UV )2ij . (NMF)
NMF is a linear dimensionality reduction technique for nonnegative data :
M(:, i)︸ ︷︷ ︸≥0
≈r∑
k=1
U(:, k)︸ ︷︷ ︸≥0
V (k , i)︸ ︷︷ ︸≥0
for all i .
Why nonnegativity?
→ Interpretability: Nonnegativity constraints lead to easily interpretablefactors (and a sparse and part-based representation).→ Many applications. image processing, text mining, hyperspectralunmixing, community detection, clustering, etc.
3//
Nonnegative Matrix Factorization (NMF)
Given a matrix M ∈ Rp×n+ and a factorization rank r min(p, n), find
U ∈ Rp×rand V ∈ Rr×n such that
minU≥0,V≥0
||M − UV ||2F =∑i ,j
(M − UV )2ij . (NMF)
NMF is a linear dimensionality reduction technique for nonnegative data :
M(:, i)︸ ︷︷ ︸≥0
≈r∑
k=1
U(:, k)︸ ︷︷ ︸≥0
V (k , i)︸ ︷︷ ︸≥0
for all i .
Why nonnegativity?
→ Interpretability: Nonnegativity constraints lead to easily interpretablefactors (and a sparse and part-based representation).→ Many applications. image processing, text mining, hyperspectralunmixing, community detection, clustering, etc.
3//
Example 1: Blind hyperspectral unmixing
Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.
Problem. Identify the materials and classify the pixels.
4//
Example 1: Blind hyperspectral unmixing
Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.
Problem. Identify the materials and classify the pixels.
4//
Linear mixing model
5//
Linear mixing model
5//
Example 1: Blind hyperspectral unmixing with NMF
Basis elements allow to recover the different endmembers: U ≥ 0;
Abundances of the endmembers in each pixel: V ≥ 0.
6//
Example 1: Blind hyperspectral unmixing with NMF
Basis elements allow to recover the different endmembers: U ≥ 0;
Abundances of the endmembers in each pixel: V ≥ 0.
6//
Example 1: Blind hyperspectral unmixing with NMF
Basis elements allow to recover the different endmembers: U ≥ 0;
Abundances of the endmembers in each pixel: V ≥ 0.
6//
Urban hyperspectral image
7//
Urban hyperspectral image
Figure: Decomposition of the Urban dataset.
8//
Urban hyperspectral image
Figure: Decomposition of the Urban dataset.
8//
Urban hyperspectral image
Figure: Decomposition of the Urban dataset.
8//
Example 2: topic recovery and document classification
Basis elements allow to recover the different topics;
Weights allow to assign each text to its corresponding topics.
9//
Example 2: topic recovery and document classification
Basis elements allow to recover the different topics;
Weights allow to assign each text to its corresponding topics.
9//
Example 2: topic recovery and document classification
Basis elements allow to recover the different topics;
Weights allow to assign each text to its corresponding topics.
9//
Exemple 3: feature extraction and classification
The basis elements extract facial features such as eyes, nose and lips.
10//
Exemple 3: feature extraction and classification
The basis elements extract facial features such as eyes, nose and lips.
10//
Exemple 3: feature extraction and classification
The basis elements extract facial features such as eyes, nose and lips.10
//
Outline
1 Computational complexity
2 Standard non-linear optimization schemes and acceleration
3 Exact NMF (M = UV ) and its geometric interpretation
4 NMF under the separability assumption
12//
Computational Complexity of NMF
13//
Complexity of NMF
minU∈Rp×r ,V∈Rr×n
||M − UV ||2F such that U ≥ 0,V ≥ 0.
For r = 1, Eckart-Young and Perron-Frobenius theorems.
Checking whether there exists an exact factorization M = UV :
NP-hard (Vavasis, 2009) where p, n and r are not fixed.
Using quantifier elimination (reformulation with fixed number ofvariables)
Cohen and Rothblum [1991]: (mn)O(mr+nr), non-polynomialArora et al. [2012]: (mn)O(2r ), polynomial
Moitra [2013] : (mn)O(r2), polynomial→ not really useful in practice . . .
Does not imply that rank+ (the minimum r such that M = UV ) canbe computed in polynomial time (because there are no upper boundon rank+).
14//
Complexity of NMF
minU∈Rp×r ,V∈Rr×n
||M − UV ||2F such that U ≥ 0,V ≥ 0.
For r = 1, Eckart-Young and Perron-Frobenius theorems.
Checking whether there exists an exact factorization M = UV :
NP-hard (Vavasis, 2009) where p, n and r are not fixed.
Using quantifier elimination (reformulation with fixed number ofvariables)
Cohen and Rothblum [1991]: (mn)O(mr+nr), non-polynomialArora et al. [2012]: (mn)O(2r ), polynomial
Moitra [2013] : (mn)O(r2), polynomial→ not really useful in practice . . .
Does not imply that rank+ (the minimum r such that M = UV ) canbe computed in polynomial time (because there are no upper boundon rank+).
14//
Complexity of NMF
minU∈Rp×r ,V∈Rr×n
||M − UV ||2F such that U ≥ 0,V ≥ 0.
For r = 1, Eckart-Young and Perron-Frobenius theorems.
Checking whether there exists an exact factorization M = UV :
NP-hard (Vavasis, 2009) where p, n and r are not fixed.
Using quantifier elimination (reformulation with fixed number ofvariables)
Cohen and Rothblum [1991]: (mn)O(mr+nr), non-polynomialArora et al. [2012]: (mn)O(2r ), polynomial
Moitra [2013] : (mn)O(r2), polynomial→ not really useful in practice . . .
Does not imply that rank+ (the minimum r such that M = UV ) canbe computed in polynomial time (because there are no upper boundon rank+).
14//
Complexity of NMF
minU∈Rp×r ,V∈Rr×n
||M − UV ||2F such that U ≥ 0,V ≥ 0.
For r = 1, Eckart-Young and Perron-Frobenius theorems.
Checking whether there exists an exact factorization M = UV :
NP-hard (Vavasis, 2009) where p, n and r are not fixed.
Using quantifier elimination (reformulation with fixed number ofvariables)
Cohen and Rothblum [1991]: (mn)O(mr+nr), non-polynomialArora et al. [2012]: (mn)O(2r ), polynomial
Moitra [2013] : (mn)O(r2), polynomial→ not really useful in practice . . .
Does not imply that rank+ (the minimum r such that M = UV ) canbe computed in polynomial time (because there are no upper boundon rank+).
14//
Complexity for other norms
minu∈Rp ,v∈Rn
||M − uvT ||1 =∑i ,j
|Mij − uivj | . (`1 norm)
If M is binary, M ∈ 0, 1m×n, any optimal solution (u∗, v∗) can beassumed to be binary, that is, (u∗, v∗) ∈ 0, 1p × 0, 1n.
minu∈Rp ,v∈Rn
||M − uvT ||2W =∑i ,j
Wij(M − uvT )2ij , (weighted `2 norm)
where W is a nonnegative weight matrix.This model can be used when
data is missing (Wij = 0 for missing entries),entries have different variances (Wij = 1/σ2
ij).
G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.
15//
Complexity for other norms
minu∈Rp ,v∈Rn
||M − uvT ||1 =∑i ,j
|Mij − uivj | . (`1 norm)
If M is binary, M ∈ 0, 1m×n, any optimal solution (u∗, v∗) can beassumed to be binary, that is, (u∗, v∗) ∈ 0, 1p × 0, 1n.
minu∈Rp ,v∈Rn
||M − uvT ||2W =∑i ,j
Wij(M − uvT )2ij , (weighted `2 norm)
where W is a nonnegative weight matrix.This model can be used when
data is missing (Wij = 0 for missing entries),entries have different variances (Wij = 1/σ2
ij).
G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.
G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.
15//
Complexity for other norms
minu∈Rp ,v∈Rn
||M − uvT ||1 =∑i ,j
|Mij − uivj | . (`1 norm)
If M is binary, M ∈ 0, 1m×n, any optimal solution (u∗, v∗) can beassumed to be binary, that is, (u∗, v∗) ∈ 0, 1p × 0, 1n.
minu∈Rp ,v∈Rn
||M − uvT ||2W =∑i ,j
Wij(M − uvT )2ij , (weighted `2 norm)
where W is a nonnegative weight matrix.This model can be used when
data is missing (Wij = 0 for missing entries),entries have different variances (Wij = 1/σ2
ij).
G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.
15//
NMF Algorithms and Acceleration
16//
NMF Algorithms
Given a matrix M ∈ Rm×n+ and a factorization rank r ∈ N:
minU∈Rm×r
+ ,V∈Rr×n+
||M − UV ||2F =∑i ,j
(M − UV )2ij . (NMF)
This is a difficult non-linear optimization problem with potentially manylocal minima.
Standard framework:
0. Initialize (U, V ). Then, alternatively update U and V :
1. Update V ≈ argminX≥0 ||M − UX ||2F . (NNLS)2. Update U ≈ argminY≥0 ||M − YV ||2F . (NNLS)
Most NMF algorithms come with no guarantees (except convergence tostationary points).
Solution is in general highly non-unique: indentifiability issues.
17//
NMF Algorithms
Given a matrix M ∈ Rm×n+ and a factorization rank r ∈ N:
minU∈Rm×r
+ ,V∈Rr×n+
||M − UV ||2F =∑i ,j
(M − UV )2ij . (NMF)
This is a difficult non-linear optimization problem with potentially manylocal minima.
Standard framework:
0. Initialize (U, V ). Then, alternatively update U and V :
1. Update V ≈ argminX≥0 ||M − UX ||2F . (NNLS)2. Update U ≈ argminY≥0 ||M − YV ||2F . (NNLS)
Most NMF algorithms come with no guarantees (except convergence tostationary points).
Solution is in general highly non-unique: indentifiability issues.
17//
NMF Algorithms
Given a matrix M ∈ Rm×n+ and a factorization rank r ∈ N:
minU∈Rm×r
+ ,V∈Rr×n+
||M − UV ||2F =∑i ,j
(M − UV )2ij . (NMF)
This is a difficult non-linear optimization problem with potentially manylocal minima.
Standard framework:
0. Initialize (U, V ). Then, alternatively update U and V :
1. Update V ≈ argminX≥0 ||M − UX ||2F . (NNLS)2. Update U ≈ argminY≥0 ||M − YV ||2F . (NNLS)
Most NMF algorithms come with no guarantees (except convergence tostationary points).
Solution is in general highly non-unique: indentifiability issues.
17//
NMF Algorithms
Given a matrix M ∈ Rm×n+ and a factorization rank r ∈ N:
minU∈Rm×r
+ ,V∈Rr×n+
||M − UV ||2F =∑i ,j
(M − UV )2ij . (NMF)
This is a difficult non-linear optimization problem with potentially manylocal minima.
Standard framework:
0. Initialize (U, V ). Then, alternatively update U and V :
1. Update V ≈ argminX≥0 ||M − UX ||2F . (NNLS)2. Update U ≈ argminY≥0 ||M − YV ||2F . (NNLS)
Most NMF algorithms come with no guarantees (except convergence tostationary points).
Solution is in general highly non-unique: indentifiability issues.
17//
Block coordinate descent method
Use block-coordinate descent on the NNLS subproblems−→ closed-form solutions for the columns of U and rows of V :
U∗:k = argminU:k≥0 ||Rk − U:kVk:||2F = max
(0,
RkVTk:
||Vk:||22
)∀k ,
where Rk.
= M −∑
j 6=k U:jVj :, and similarly for V .This is the so-called HALS algorithm.
It can be accelerated:
1 Gauss-Seidel Coordinate descent (Hsieh, Dhillon, 2011).
2 Loop several time over columns of U/rows of V to perform moreiterations at a lower computational cost (Glineur, G., 2012).
3 Randomized shuffling (Chow, Wu, Yin, 2017).
4 Use an extrapolation step: W (k+1) = W (k+1) + βk(W (k+1) −W (k))(Ang, G., 2018).
18//
Block coordinate descent method
Use block-coordinate descent on the NNLS subproblems−→ closed-form solutions for the columns of U and rows of V :
U∗:k = argminU:k≥0 ||Rk − U:kVk:||2F = max
(0,
RkVTk:
||Vk:||22
)∀k ,
where Rk.
= M −∑
j 6=k U:jVj :, and similarly for V .This is the so-called HALS algorithm.
It can be accelerated:
1 Gauss-Seidel Coordinate descent (Hsieh, Dhillon, 2011).
2 Loop several time over columns of U/rows of V to perform moreiterations at a lower computational cost (Glineur, G., 2012).
3 Randomized shuffling (Chow, Wu, Yin, 2017).
4 Use an extrapolation step: W (k+1) = W (k+1) + βk(W (k+1) −W (k))(Ang, G., 2018).
18//
Illustration on the CBCL face image data set
19//
Exact NMF: Geometry and ExtendedFormulations
20//
Geometric interpretation of exact NMF
Given M = UV , one can scale M and U such that they become columnstochastic implying that V is column stochastic:
M = UV ⇐⇒ M ′ = MDM = (UDU)(D−1U VDM) = U ′V ′.
The columns of M are convex combinations of the columns of U:
M:j =k∑
i=1
U:i Vij withk∑
i=1
Vij = 1∀j , Vij ≥ 0∀ij .
In other terms,
conv(M) ⊆ conv(U) ⊆ Sn,
where conv(X ) is the convex hull of the columns of X , andSn = x ∈ Rn |x ≥ 0,
∑ni=1 xi = 1 is the unit simplex.
Exact NMF ≡ Find r points whose convex hull is nested between twogiven polytopes.
21//
Geometric interpretation of exact NMF
Given M = UV , one can scale M and U such that they become columnstochastic implying that V is column stochastic:
M = UV ⇐⇒ M ′ = MDM = (UDU)(D−1U VDM) = U ′V ′.
The columns of M are convex combinations of the columns of U:
M:j =k∑
i=1
U:i Vij withk∑
i=1
Vij = 1∀j , Vij ≥ 0∀ij .
In other terms,
conv(M) ⊆ conv(U) ⊆ Sn,
where conv(X ) is the convex hull of the columns of X , andSn = x ∈ Rn |x ≥ 0,
∑ni=1 xi = 1 is the unit simplex.
Exact NMF ≡ Find r points whose convex hull is nested between twogiven polytopes.
21//
Geometric interpretation of exact NMF
Given M = UV , one can scale M and U such that they become columnstochastic implying that V is column stochastic:
M = UV ⇐⇒ M ′ = MDM = (UDU)(D−1U VDM) = U ′V ′.
The columns of M are convex combinations of the columns of U:
M:j =k∑
i=1
U:i Vij withk∑
i=1
Vij = 1∀j , Vij ≥ 0∀ij .
In other terms,
conv(M) ⊆ conv(U) ⊆ Sn,
where conv(X ) is the convex hull of the columns of X , andSn = x ∈ Rn |x ≥ 0,
∑ni=1 xi = 1 is the unit simplex.
Exact NMF ≡ Find r points whose convex hull is nested between twogiven polytopes.
21//
Geometric interpretation of exact NMF
Given M = UV , one can scale M and U such that they become columnstochastic implying that V is column stochastic:
M = UV ⇐⇒ M ′ = MDM = (UDU)(D−1U VDM) = U ′V ′.
The columns of M are convex combinations of the columns of U:
M:j =k∑
i=1
U:i Vij withk∑
i=1
Vij = 1∀j , Vij ≥ 0∀ij .
In other terms,
conv(M) ⊆ conv(U) ⊆ Sn,
where conv(X ) is the convex hull of the columns of X , andSn = x ∈ Rn |x ≥ 0,
∑ni=1 xi = 1 is the unit simplex.
Exact NMF ≡ Find r points whose convex hull is nested between twogiven polytopes.
21//
Geometric interpretation of NMF
Example: Two nested hexagons (rank(Ma) = 3)
22//
Geometric interpretation of NMF
Example: Two nested hexagons (rank(Ma) = 3)
Ma =1
a
1 a 2a− 1 2a− 1 a 11 1 a 2a− 1 2a− 1 aa 1 1 a 2a− 1 2a− 1
2a− 1 a 1 1 a 2a− 12a− 1 2a− 1 a 1 1 a
a 2a− 1 2a− 1 a 1 1
, a > 1.
22//
Geometric interpretation of NMF
Example: Two nested hexagons (rank(Ma) = 3)Case 1: a = 2, rank+(Ma) = 3, col(M) = col(U)
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
∆p ∩ col(M2)
conv(M2)
conv(U)
22//
Geometric interpretation of NMF
Example: Two nested hexagons (rank(Ma) = 3)Case 2: a = 3, rank+(Ma) = 4, col(M) = col(U)
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
∆p ∩ col(M3)
conv(M3)
conv(U)
22//
Geometric interpretation of NMF
Example: Two nested hexagons (rank(Ma) = 3)Case 3: a→ +∞, rank+(Ma) = 5, col(M) 6= col(U)
22//
An amazing result: NMF and extended formulations
Let P be a polytope
P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,
and let vj ’s (1 ≤ j ≤ n) be its vertices.
We define the m-by-n slack matrix SP of P as follows:
SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.
The hexagon:
SP =
0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0
23//
An amazing result: NMF and extended formulations
Let P be a polytope
P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,
and let vj ’s (1 ≤ j ≤ n) be its vertices.
We define the m-by-n slack matrix SP of P as follows:
SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.
The hexagon:
SP =
0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0
23//
An amazing result: NMF and extended formulations
Let P be a polytope
P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,
and let vj ’s (1 ≤ j ≤ n) be its vertices.
We define the m-by-n slack matrix SP of P as follows:
SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.
The hexagon:
SP =
0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0
23
//
An amazing result: NMF and extended formulations
An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p
that (linearly) projects onto P. The minimum number of facets of such apolytope is called the extension complexity xp(P) of P.
Theorem (Yannakakis, 1991).
rank+(SP) = xp(P).
Proof (one direction). Given P = x ∈ Rk | b−Ax ≥ 0, any exact NMFof SP = UV ,U ≥ 0,V ≥ 0 provides an explicit extended formulation(with some redundant equalities) of P:
P = x | b − Ax ≥ 0 = x | b − Ax = Uy and y ≥ 0.
Remark. The slack matrix SP of P satisfies
conv(SP) = Sm ∩ col(SP).
To get a small factorization, we need to go to a higher dimensional space:rank(U) > rank(M).
24//
An amazing result: NMF and extended formulations
An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p
that (linearly) projects onto P. The minimum number of facets of such apolytope is called the extension complexity xp(P) of P.
Theorem (Yannakakis, 1991).
rank+(SP) = xp(P).
Proof (one direction). Given P = x ∈ Rk | b−Ax ≥ 0, any exact NMFof SP = UV ,U ≥ 0,V ≥ 0 provides an explicit extended formulation(with some redundant equalities) of P:
P = x | b − Ax ≥ 0 = x | b − Ax = Uy and y ≥ 0.
Remark. The slack matrix SP of P satisfies
conv(SP) = Sm ∩ col(SP).
To get a small factorization, we need to go to a higher dimensional space:rank(U) > rank(M).
24//
An amazing result: NMF and extended formulations
An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p
that (linearly) projects onto P. The minimum number of facets of such apolytope is called the extension complexity xp(P) of P.
Theorem (Yannakakis, 1991).
rank+(SP) = xp(P).
Proof (one direction). Given P = x ∈ Rk | b−Ax ≥ 0, any exact NMFof SP = UV ,U ≥ 0,V ≥ 0 provides an explicit extended formulation(with some redundant equalities) of P:
P = x | b − Ax ≥ 0 = x | b − Ax = Uy and y ≥ 0.
Remark. The slack matrix SP of P satisfies
conv(SP) = Sm ∩ col(SP).
To get a small factorization, we need to go to a higher dimensional space:rank(U) > rank(M).
24//
An amazing result: NMF and extended formulations
An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p
that (linearly) projects onto P. The minimum number of facets of such apolytope is called the extension complexity xp(P) of P.
Theorem (Yannakakis, 1991).
rank+(SP) = xp(P).
Proof (one direction). Given P = x ∈ Rk | b−Ax ≥ 0, any exact NMFof SP = UV ,U ≥ 0,V ≥ 0 provides an explicit extended formulation(with some redundant equalities) of P:
P = x | b − Ax ≥ 0 = x | b − Ax = Uy and y ≥ 0.
Remark. The slack matrix SP of P satisfies
conv(SP) = Sm ∩ col(SP).
To get a small factorization, we need to go to a higher dimensional space:rank(U) > rank(M).
24//
The Hexagon
SP =
0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0
25//
The Hexagon
SP =
0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0
=
1 0 0 1/2 00 1 0 1 00 0 1 1/2 00 0 1 0 1/20 1 0 0 11 0 0 0 1/2
0 1 2 1 0 00 0 1 0 0 11 0 0 0 1 20 0 0 2 2 02 2 0 0 0 0
,
with
rank(SP) = 3 ≤ rank+(SP) = 5 ≤ min(m, n) = 6.
25//
Some implications
Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?
Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.
Ex. The matching problem cannot be solved via a polynomial-size LP.Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
This can be generalized to
approximations (no poly-size LP can approximate these problems upto some precision).Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs(beyond hierarchies), FOCS.
any convex cone, in particular PSD (so called PSD-rank).See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positivesemidefinite rank, Mathematical Programming, 2015.
26//
Some implications
Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.
Key tool: lower bound techniques for the nonnegative rank.
Ex. The matching problem cannot be solved via a polynomial-size LP.Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
This can be generalized to
approximations (no poly-size LP can approximate these problems upto some precision).Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs(beyond hierarchies), FOCS.
any convex cone, in particular PSD (so called PSD-rank).See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positivesemidefinite rank, Mathematical Programming, 2015.
26//
Some implications
Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.
Ex. The matching problem cannot be solved via a polynomial-size LP.Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
This can be generalized to
approximations (no poly-size LP can approximate these problems upto some precision).Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs(beyond hierarchies), FOCS.
any convex cone, in particular PSD (so called PSD-rank).See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positivesemidefinite rank, Mathematical Programming, 2015.
26//
Some implications
Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.
Ex. The matching problem cannot be solved via a polynomial-size LP.Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
This can be generalized to
approximations (no poly-size LP can approximate these problems upto some precision).Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs(beyond hierarchies), FOCS.
any convex cone, in particular PSD (so called PSD-rank).See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positivesemidefinite rank, Mathematical Programming, 2015.
26//
Some implications
Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.
Ex. The matching problem cannot be solved via a polynomial-size LP.Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
This can be generalized to
approximations (no poly-size LP can approximate these problems upto some precision).Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs(beyond hierarchies), FOCS.
any convex cone, in particular PSD (so called PSD-rank).See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positivesemidefinite rank, Mathematical Programming, 2015.
26//
Some implications
Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.
Ex. The matching problem cannot be solved via a polynomial-size LP.Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.
This can be generalized to
approximations (no poly-size LP can approximate these problems upto some precision).Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs(beyond hierarchies), FOCS.
any convex cone, in particular PSD (so called PSD-rank).See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positivesemidefinite rank, Mathematical Programming, 2015.
26//
Exact NMF computation and regular n-gons
Can we use numerical solvers to get insight into these problems?
Yes!
We have developed a library to compute exact NMF’s for small matricesusing meta-heuristics.[V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).
Extension complexity of the octagon?
27//
Exact NMF computation and regular n-gons
Can we use numerical solvers to get insight into these problems? Yes!
We have developed a library to compute exact NMF’s for small matricesusing meta-heuristics.[V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).
Extension complexity of the octagon?
27//
Exact NMF computation and regular n-gons
Can we use numerical solvers to get insight into these problems? Yes!
We have developed a library to compute exact NMF’s for small matricesusing meta-heuristics.[V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).
Extension complexity of the octagon?
27//
Exact NMF computation and regular n-gons
Can we use numerical solvers to get insight into these problems? Yes!
We have developed a library to compute exact NMF’s for small matricesusing meta-heuristics.[V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).
Extension complexity of the octagon?
rank(SP) = 3 ≤ rank+(SP) = 6 ≤ min(m, n) = 8.
27//
Exact NMF computation and regular n-gons
We observed a special structure on the solutions for regular n-gons,leading to the best known upper bound and closing the gap for somen-gons:
rank+(Sn) ≤
2dlog2(n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2,2dlog2(n)e for 2k−1 + 2k−2 < n ≤ 2k .
[V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regularn-gons (2015).
Implication: conic quadratic programming is ‘polynomially reducible’to linear programming.[BTN01] Ben-Tal and Nemirovski (2001). On polyhedral approximations of thesecond-order cone. Mathematics of Operations Research, 26(2), 193-205.
28//
Exact NMF computation and regular n-gons
We observed a special structure on the solutions for regular n-gons,leading to the best known upper bound and closing the gap for somen-gons:
rank+(Sn) ≤
2dlog2(n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2,2dlog2(n)e for 2k−1 + 2k−2 < n ≤ 2k .
[V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regularn-gons (2015).
Implication: conic quadratic programming is ‘polynomially reducible’to linear programming.[BTN01] Ben-Tal and Nemirovski (2001). On polyhedral approximations of thesecond-order cone. Mathematics of Operations Research, 26(2), 193-205.
28//
NMF under the separability assumption
29//
Separability Assumption
Separability of M: there exists an index set K and V ≥ 0 withM = M(:,K)︸ ︷︷ ︸
U
V , with |K| = r .
[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization –Provably, STOC 2012.
30//
Separability Assumption
Separability of M: there exists an index set K and V ≥ 0 withM = M(:,K)︸ ︷︷ ︸
U
V , with |K| = r .
[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization –Provably, STOC 2012.
30//
Separability Assumption
Separability of M: there exists an index set K and V ≥ 0 withM = M(:,K)︸ ︷︷ ︸
U
V , with |K| = r .
[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization –Provably, STOC 2012.
30//
Applications
In hyperspectral imaging, this is the pure-pixel assumption: for eachmaterial, there is a ‘pure’ pixel containing only that material.[M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insightsfrom Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.
In document classification: for each topic, there is a ‘pure’ word usedonly by that topic (an ‘anchor’ word).[A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees,ICML 2013.
Time-resolved Raman spectra analysis: each substance has a peak inits spectrum while the other spectra are (close) to zero.[L+16] Luce et al., Using Separable Nonnegative Matrix Factorization for the Analysis ofTime-Resolved Raman Spectra, Appl Spectrosc. 2016.
Others: video summarization, foreground-background separation.[ESV12] Elhamifar, Sapiro, Vidal, See all by looking at a few: Sparse modeling for findingrepresentative objects, CVPR 2012.[KSK13] Kumar, Sindhwani, Near-separable Non-negative Matrix Factorization with `1-and Bregman Loss Functions, SIAM data mining 2015.
31//
Applications
In hyperspectral imaging, this is the pure-pixel assumption: for eachmaterial, there is a ‘pure’ pixel containing only that material.[M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insightsfrom Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.
In document classification: for each topic, there is a ‘pure’ word usedonly by that topic (an ‘anchor’ word).[A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees,ICML 2013.
Time-resolved Raman spectra analysis: each substance has a peak inits spectrum while the other spectra are (close) to zero.[L+16] Luce et al., Using Separable Nonnegative Matrix Factorization for the Analysis ofTime-Resolved Raman Spectra, Appl Spectrosc. 2016.
Others: video summarization, foreground-background separation.[ESV12] Elhamifar, Sapiro, Vidal, See all by looking at a few: Sparse modeling for findingrepresentative objects, CVPR 2012.[KSK13] Kumar, Sindhwani, Near-separable Non-negative Matrix Factorization with `1-and Bregman Loss Functions, SIAM data mining 2015.
31//
Applications
In hyperspectral imaging, this is the pure-pixel assumption: for eachmaterial, there is a ‘pure’ pixel containing only that material.[M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insightsfrom Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.
In document classification: for each topic, there is a ‘pure’ word usedonly by that topic (an ‘anchor’ word).[A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees,ICML 2013.
Time-resolved Raman spectra analysis: each substance has a peak inits spectrum while the other spectra are (close) to zero.[L+16] Luce et al., Using Separable Nonnegative Matrix Factorization for the Analysis ofTime-Resolved Raman Spectra, Appl Spectrosc. 2016.
Others: video summarization, foreground-background separation.[ESV12] Elhamifar, Sapiro, Vidal, See all by looking at a few: Sparse modeling for findingrepresentative objects, CVPR 2012.[KSK13] Kumar, Sindhwani, Near-separable Non-negative Matrix Factorization with `1-and Bregman Loss Functions, SIAM data mining 2015.
31//
Geometric Interpretation
The columns of U are the vertices of the convex hull of the columns of M:
M(:, j) =r∑
k=1
U(:, k)V (k , j) ∀j , wherer∑
k=1
V (k , j) = 1,V ≥ 0.
32//
Geometric Interpretation with Noise
The columns of U are the vertices of the convex hull of the columns of M:
M(:, j) ≈r∑
k=1
U(:, k)V (k , j) ∀j , wherer∑
k=1
V (k , j) = 1,V ≥ 0.
Goal: theoretical analysis of the robustness to noise of separable NMFalgorithms
32//
Key Parameters: Noise and Conditioning
We assumeM = U[Ir , V
′]Π + N,
where V ′ ≥ 0, Π is a permutation and N is the noise.
We will assume that the noise is bounded (but otherwise arbitrary):
||N(:, j)||2 ≤ ε, for all j ,
and some dependence on the conditioning κ(U) = σmax(U)σmin(U) is unavoidable:
33//
Key Parameters: Noise and Conditioning
We assumeM = U[Ir , V
′]Π + N,
where V ′ ≥ 0, Π is a permutation and N is the noise.
We will assume that the noise is bounded (but otherwise arbitrary):
||N(:, j)||2 ≤ ε, for all j ,
and some dependence on the conditioning κ(U) = σmax(U)σmin(U) is unavoidable:
33//
Key Parameters: Noise and Conditioning
We assumeM = U[Ir , V
′]Π + N,
where V ′ ≥ 0, Π is a permutation and N is the noise.
We will assume that the noise is bounded (but otherwise arbitrary):
||N(:, j)||2 ≤ ε, for all j ,
and some dependence on the conditioning κ(U) = σmax(U)σmin(U) is unavoidable:
33//
Key Parameters: Noise and Conditioning
We assumeM = U[Ir , V
′]Π + N,
where V ′ ≥ 0, Π is a permutation and N is the noise.
We will assume that the noise is bounded (but otherwise arbitrary):
||N(:, j)||2 ≤ ε, for all j ,
and some dependence on the conditioning κ(U) = σmax(U)σmin(U) is unavoidable:
33//
Successive Projection Algorithm (SPA)
0: Initially K = ∅.For i = 1 : r1: Find j∗ = argmaxj ||M(:, j)||.2: K = K ∪ j∗.3: M ←
(I − uuT
)M where u = M(:,j∗)
||M(:,j∗)||2 .end∼modified Gram-Schmidt with column pivoting.
Theorem. If ε ≤ O(σmin(U)√rκ2(U)
), SPA satisfies
||U−M(:,K)|| = max1≤k≤r
||U(:, k)−M(:,K(k))|| ≤ O(εκ2(U)
).
Advantages. Extremely fast, no parameter.
Drawbacks. Requires U to be full rank; bound is weak.
[GV14] G., Vavasis, Fast and Robust Recursive Algorithms for Separable Nonnegative MatrixFactorization, IEEE Trans. Patt. Anal. Mach. Intell. 36 (4), pp. 698-714, 2014.
34//
Successive Projection Algorithm (SPA)
0: Initially K = ∅.For i = 1 : r1: Find j∗ = argmaxj ||M(:, j)||.2: K = K ∪ j∗.3: M ←
(I − uuT
)M where u = M(:,j∗)
||M(:,j∗)||2 .end∼modified Gram-Schmidt with column pivoting.
Theorem. If ε ≤ O(σmin(U)√rκ2(U)
), SPA satisfies
||U−M(:,K)|| = max1≤k≤r
||U(:, k)−M(:,K(k))|| ≤ O(εκ2(U)
).
Advantages. Extremely fast, no parameter.
Drawbacks. Requires U to be full rank; bound is weak.
[GV14] G., Vavasis, Fast and Robust Recursive Algorithms for Separable Nonnegative MatrixFactorization, IEEE Trans. Patt. Anal. Mach. Intell. 36 (4), pp. 698-714, 2014.
34//
Pre-conditioning for More Robust SPA
Observation. Pre-multiplying M preserves separability:
P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V
′]Π + PN.
Ideally, P = U−1 so that κ(PU) = 1 (assuming m = r).Solving the minimum volume ellipsoid centered at the origin andcontaining all the columns of M (which is SDP representable)
minA∈Sr+
log det(A)−1 s.t. mjTAmj ≤ 1 ∀ j ,
allows to approximate U−1: in fact, A∗ ≈ (UUT )−1.
Theorem. If ε ≤ O(σmin(U)r√r
), preconditioned SPA satisfies
||U −M(:,K)|| ≤ O (εκ(U)).
[GV15] G., Vavasis, SDP-based Preconditioning for More Robust Near-Separable NMF, SIAM J.on Optimization, 2015.
35//
Pre-conditioning for More Robust SPA
Observation. Pre-multiplying M preserves separability:
P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V
′]Π + PN.
Ideally, P = U−1 so that κ(PU) = 1 (assuming m = r).
Solving the minimum volume ellipsoid centered at the origin andcontaining all the columns of M (which is SDP representable)
minA∈Sr+
log det(A)−1 s.t. mjTAmj ≤ 1 ∀ j ,
allows to approximate U−1: in fact, A∗ ≈ (UUT )−1.
Theorem. If ε ≤ O(σmin(U)r√r
), preconditioned SPA satisfies
||U −M(:,K)|| ≤ O (εκ(U)).
[GV15] G., Vavasis, SDP-based Preconditioning for More Robust Near-Separable NMF, SIAM J.on Optimization, 2015.
35//
Pre-conditioning for More Robust SPA
Observation. Pre-multiplying M preserves separability:
P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V
′]Π + PN.
Ideally, P = U−1 so that κ(PU) = 1 (assuming m = r).Solving the minimum volume ellipsoid centered at the origin andcontaining all the columns of M (which is SDP representable)
minA∈Sr+
log det(A)−1 s.t. mjTAmj ≤ 1 ∀ j ,
allows to approximate U−1: in fact, A∗ ≈ (UUT )−1.
Theorem. If ε ≤ O(σmin(U)r√r
), preconditioned SPA satisfies
||U −M(:,K)|| ≤ O (εκ(U)).
[GV15] G., Vavasis, SDP-based Preconditioning for More Robust Near-Separable NMF, SIAM J.on Optimization, 2015.
35//
Pre-conditioning for More Robust SPA
Observation. Pre-multiplying M preserves separability:
P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V
′]Π + PN.
Ideally, P = U−1 so that κ(PU) = 1 (assuming m = r).Solving the minimum volume ellipsoid centered at the origin andcontaining all the columns of M (which is SDP representable)
minA∈Sr+
log det(A)−1 s.t. mjTAmj ≤ 1 ∀ j ,
allows to approximate U−1: in fact, A∗ ≈ (UUT )−1.
Theorem. If ε ≤ O(σmin(U)r√r
), preconditioned SPA satisfies
||U −M(:,K)|| ≤ O (εκ(U)).
[GV15] G., Vavasis, SDP-based Preconditioning for More Robust Near-Separable NMF, SIAM J.on Optimization, 2015.
35//
Geometric Interpretation
Figure: Geometric Interpretation of the SDP-based Preconditioning.
See also Mizutani, Ellipsoidal Rounding for Nonnegative MatrixFactorization Under Noisy Separability, JMLR, 2014.
36//
Synthetic data sets
Each entry of U ∈ R40×20+ uniform in [0, 1]; each column normalized.
The other columns of M are the middle points of the columns of U(hence there are
(202
)= 190).
The noise moves the middle points toward the outside of the convexhull of the column of U.
37//
Synthetic data sets
Each entry of U ∈ R40×20+ uniform in [0, 1]; each column normalized.
The other columns of M are the middle points of the columns of U(hence there are
(202
)= 190).
The noise moves the middle points toward the outside of the convexhull of the column of U.
37//
Results for the synthetic data sets
Figure: Average of the fraction of columns correctly extracted depending on thenoise level (for each noise level, 25 matrices are generated).
38//
Combinatorial formulation for separable NMF
We want to find the index set K with |K| = r such that
M = M(:,K)V .
This is equivalent to finding X ∈ Rn×n with r non-zero rows such that
M = M X .
A combinatorial formulation:
minX||X ||row,0 such that M = MX or ||M −MX || ≤ ε.
How to make X row sparse?
39//
Combinatorial formulation for separable NMF
We want to find the index set K with |K| = r such that
M = M(:,K)V .
This is equivalent to finding X ∈ Rn×n with r non-zero rows such that
M = M X .
A combinatorial formulation:
minX||X ||row,0 such that M = MX or ||M −MX || ≤ ε.
How to make X row sparse?
39//
Combinatorial formulation for separable NMF
We want to find the index set K with |K| = r such that
M = M(:,K)V .
This is equivalent to finding X ∈ Rn×n with r non-zero rows such that
M = M X .
A combinatorial formulation:
minX||X ||row,0 such that M = MX or ||M −MX || ≤ ε.
How to make X row sparse?
39//
Combinatorial formulation for separable NMF
We want to find the index set K with |K| = r such that
M = M(:,K)V .
This is equivalent to finding X ∈ Rn×n with r non-zero rows such that
M = M X .
A combinatorial formulation:
minX||X ||row,0 such that M = MX or ||M −MX || ≤ ε.
How to make X row sparse?
39//
A Linear Optimization Model
minX∈Rn×n
+
trace(X ) = || diag(X )||1
such that ||M −MX || ≤ ε,Xij ≤ Xii ≤ 1 for all i , j .
Robustness: noise ≤ O(κ−1
)⇒ error ≤ O
(rεκ
)[GL14].
This model is an improvement over [B+12]: more robust and detects thefactorization rank r automatically.It is equivalent [GL16] to using ||X ||1,∞ =
∑di=1 ||X (i , :)||∞ as a convex
surrogate for ||X ||row,0 [E+12].
[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014.
[B+12] Bittorf, Recht, Re, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.[E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space,IEEE Trans. Image Processing, 2012.[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.
40//
A Linear Optimization Model
minX∈Rn×n
+
trace(X ) = || diag(X )||1
such that ||M −MX || ≤ ε,Xij ≤ Xii ≤ 1 for all i , j .
Robustness: noise ≤ O(κ−1
)⇒ error ≤ O
(rεκ
)[GL14].
This model is an improvement over [B+12]: more robust and detects thefactorization rank r automatically.It is equivalent [GL16] to using ||X ||1,∞ =
∑di=1 ||X (i , :)||∞ as a convex
surrogate for ||X ||row,0 [E+12].
[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014.
[B+12] Bittorf, Recht, Re, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.[E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space,IEEE Trans. Image Processing, 2012.[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.
40//
A Linear Optimization Model
minX∈Rn×n
+
trace(X ) = || diag(X )||1
such that ||M −MX || ≤ ε,Xij ≤ Xii ≤ 1 for all i , j .
Robustness: noise ≤ O(κ−1
)⇒ error ≤ O
(rεκ
)[GL14].
This model is an improvement over [B+12]: more robust and detects thefactorization rank r automatically.
It is equivalent [GL16] to using ||X ||1,∞ =∑d
i=1 ||X (i , :)||∞ as a convexsurrogate for ||X ||row,0 [E+12].
[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014.
[B+12] Bittorf, Recht, Re, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.
[E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space,IEEE Trans. Image Processing, 2012.[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.
40//
A Linear Optimization Model
minX∈Rn×n
+
trace(X ) = || diag(X )||1
such that ||M −MX || ≤ ε,Xij ≤ Xii ≤ 1 for all i , j .
Robustness: noise ≤ O(κ−1
)⇒ error ≤ O
(rεκ
)[GL14].
This model is an improvement over [B+12]: more robust and detects thefactorization rank r automatically.It is equivalent [GL16] to using ||X ||1,∞ =
∑di=1 ||X (i , :)||∞ as a convex
surrogate for ||X ||row,0 [E+12].
[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014.
[B+12] Bittorf, Recht, Re, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.[E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space,IEEE Trans. Image Processing, 2012.[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.
40//
Practical Model and Algorithm
minX∈Ω
||M −MX ||2F + µ tr(X ),
Ω = X ∈ Rn,n | Xii ≤ 1,wiXij ≤ wjXii∀i , j.
We used a fast gradient method (optimal 1st order):
1 Choose an initial point X (0), Y = X (0), α1 ∈ (0, 1).
2 k = 1, 2, . . .
1 X (k) = PΩ
(Y − 1
L∇f (Y )).
2 Y = X (k)+βk(X (k) − X (k−1)
),
where βk = αk (1−αk )α2
k+αk+1with αk+1 ≥ 0 t.q. α2
k+1 = (1− αk+1)α2k .
Projection onto Ω can be done effectively in O(n2 log(n)) operations.
The total computational cost is O(pn2) operations.
[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.
41//
Practical Model and Algorithm
minX∈Ω
||M −MX ||2F + µ tr(X ),
Ω = X ∈ Rn,n | Xii ≤ 1,wiXij ≤ wjXii∀i , j.
We used a fast gradient method (optimal 1st order):
1 Choose an initial point X (0), Y = X (0), α1 ∈ (0, 1).
2 k = 1, 2, . . .
1 X (k) = PΩ
(Y − 1
L∇f (Y )).
2 Y = X (k)+βk(X (k) − X (k−1)
),
where βk = αk (1−αk )α2
k+αk+1with αk+1 ≥ 0 t.q. α2
k+1 = (1− αk+1)α2k .
Projection onto Ω can be done effectively in O(n2 log(n)) operations.
The total computational cost is O(pn2) operations.
[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.
41//
Practical Model and Algorithm
minX∈Ω
||M −MX ||2F + µ tr(X ),
Ω = X ∈ Rn,n | Xii ≤ 1,wiXij ≤ wjXii∀i , j.
We used a fast gradient method (optimal 1st order):
1 Choose an initial point X (0), Y = X (0), α1 ∈ (0, 1).
2 k = 1, 2, . . .
1 X (k) = PΩ
(Y − 1
L∇f (Y )).
2 Y = X (k)+βk(X (k) − X (k−1)
),
where βk = αk (1−αk )α2
k+αk+1with αk+1 ≥ 0 t.q. α2
k+1 = (1− αk+1)α2k .
Projection onto Ω can be done effectively in O(n2 log(n)) operations.
The total computational cost is O(pn2) operations.
[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.
41//
Practical Model and Algorithm
minX∈Ω
||M −MX ||2F + µ tr(X ),
Ω = X ∈ Rn,n | Xii ≤ 1,wiXij ≤ wjXii∀i , j.
We used a fast gradient method (optimal 1st order):
1 Choose an initial point X (0), Y = X (0), α1 ∈ (0, 1).
2 k = 1, 2, . . .
1 X (k) = PΩ
(Y − 1
L∇f (Y )).
2 Y = X (k)+βk(X (k) − X (k−1)
),
where βk = αk (1−αk )α2
k+αk+1with αk+1 ≥ 0 t.q. α2
k+1 = (1− αk+1)α2k .
Projection onto Ω can be done effectively in O(n2 log(n)) operations.
The total computational cost is O(pn2) operations.
[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.
41//
Hyperspectral unmixing
r = 6 r = 8
Time (s.) Rel. err. (%) Time (s.) Rel. err. (%)
VCA 1.02 18.05 1.05 22.68VCA-500 0.03 7.19 0.09 7.25
SPA 0.26 9.58 0.32 9.45SPA-500 <0.01 10.05 <0.01 8.86
SNPA 13.60 9.63 23.02 5.64SNPA-500 0.15 10.05 0.25 8.86
XRAY 28.17 7.50 95.34 6.82XRAY-500 0.15 8.07 0.28 7.36
H2NMF 12.20 5.81 14.92 5.47H2NMF-500 0.27 5.87 0.37 5.68
FGNSR-500 40.11 5.07 39.49 4.08
Table: Numerical results for the Urban HSI (the best result is highlighted in bold).
42//
Figure: Abundance maps extracted by FGNSR-500.
43//
Minimum-volume NMF: Relaxing separability
minK,V≥0
||M −M(:,K)V ||2F such that |K| = r .
test
Open problems: Efficient algorithm for min-vol NMF, robustness to noise.
Fu, Huang, Sidiropoulos, Ma, Nonnegative matrix factorization for signal and dataanalytics: Identifiability, algorithms, and applications, arXiv:1803.01257, 2018.
44//
Minimum-volume NMF: Relaxing separability
minU≥0,V≥0
vol(U) such that ||M − UV ||2F ≤ ε,
where vol(U) ∼ det(UTU), V (:, j) ∈ ∆r for all j .
Open problems: Efficient algorithm for min-vol NMF, robustness to noise.
Fu, Huang, Sidiropoulos, Ma, Nonnegative matrix factorization for signal and dataanalytics: Identifiability, algorithms, and applications, arXiv:1803.01257, 2018.
44//
Minimum-volume NMF: Relaxing separability
minU≥0,V≥0
vol(U) such that ||M − UV ||2F ≤ ε,
where vol(U) ∼ det(UTU), V (:, j) ∈ ∆r for all j .
Open problems: Efficient algorithm for min-vol NMF, robustness to noise.
Fu, Huang, Sidiropoulos, Ma, Nonnegative matrix factorization for signal and dataanalytics: Identifiability, algorithms, and applications, arXiv:1803.01257, 2018.
44//
Minimum-volume NMF: Relaxing separability
minU≥0,V≥0
vol(U) such that ||M − UV ||2F ≤ ε,
where vol(U) ∼ det(UTU), V (:, j) ∈ ∆r for all j .
Open problems: Efficient algorithm for min-vol NMF, robustness to noise.
Fu, Huang, Sidiropoulos, Ma, Nonnegative matrix factorization for signal and dataanalytics: Identifiability, algorithms, and applications, arXiv:1803.01257, 2018.
44//
Identifiability with sparsity
Decompose a low rank matrix with known coefficients sparsity.
M = UV ,rank(M) = rank(U) = r ,‖V (:, j)‖0 ≤ k = r − s < r ∀j .
Many existing theoretical results (see, e.g., [Gribonval 16]) and algorithms(Dictionary Learning). But:
% Not many results specific to the low-rank case
% Only two deterministic identifiability results [Elad 06, Georgiev 05]
% Not much in the NMF case except `1 regularization
45//
Identifiability with sparsity
Decompose a low rank matrix with known coefficients sparsity.
M = UV ,rank(M) = rank(U) = r ,‖V (:, j)‖0 ≤ k = r − s < r ∀j .
Many existing theoretical results (see, e.g., [Gribonval 16]) and algorithms(Dictionary Learning). But:
% Not many results specific to the low-rank case
% Only two deterministic identifiability results [Elad 06, Georgiev 05]
% Not much in the NMF case except `1 regularization
45//
Identifiability with sparsity: example
Example: p = 3, r = 3, s=sparsity=1, n = 9.
data pointsfirst decomposition
second decomposition
46//
Identifiability with sparsity: example
Example: p = 3, r = 3, s=sparsity=1, n = 9.
data pointsfirst decompositionsecond decomposition
46//
Identifiability results
Theorem
Let M = UV where rank(U) = rank(M) = r and each column of V has atleast s zeros. The factorization (U,V ) is essentially unique if on each
hyperplane spanned by all but one column of U, there are⌊r(r−2)
s
⌋+ 1
data points with spark r .
! For s = 1, this requires r3 − 2r2 + r data points and it is tight up tothe constant r (counter examples for any n = r3 − 2r2).
! For s = r − 1, this requires r data points and it is tight (one on eachintersection of r − 1 hyperplanes).
! It is tight up to constant factors for any s = βr for any fixed constantβ.
! Nonnegativity not taken into account in the analysis, it helps both intheory and in practice: further work.
[CG18] Cohen, G., Identifiability of Low-Rank Sparse Component Analysis,arXiv:1808.08765.
47//
Identifiability results
Theorem
Let M = UV where rank(U) = rank(M) = r and each column of V has atleast s zeros. The factorization (U,V ) is essentially unique if on each
hyperplane spanned by all but one column of U, there are⌊r(r−2)
s
⌋+ 1
data points with spark r .
! For s = 1, this requires r3 − 2r2 + r data points and it is tight up tothe constant r (counter examples for any n = r3 − 2r2).
! For s = r − 1, this requires r data points and it is tight (one on eachintersection of r − 1 hyperplanes).
! It is tight up to constant factors for any s = βr for any fixed constantβ.
! Nonnegativity not taken into account in the analysis, it helps both intheory and in practice: further work.
[CG18] Cohen, G., Identifiability of Low-Rank Sparse Component Analysis,arXiv:1808.08765.
47//
Identifiability results
Theorem
Let M = UV where rank(U) = rank(M) = r and each column of V has atleast s zeros. The factorization (U,V ) is essentially unique if on each
hyperplane spanned by all but one column of U, there are⌊r(r−2)
s
⌋+ 1
data points with spark r .
! For s = 1, this requires r3 − 2r2 + r data points and it is tight up tothe constant r (counter examples for any n = r3 − 2r2).
! For s = r − 1, this requires r data points and it is tight (one on eachintersection of r − 1 hyperplanes).
! It is tight up to constant factors for any s = βr for any fixed constantβ.
! Nonnegativity not taken into account in the analysis, it helps both intheory and in practice: further work.
[CG18] Cohen, G., Identifiability of Low-Rank Sparse Component Analysis,arXiv:1808.08765.
47//
Identifiability results
Theorem
Let M = UV where rank(U) = rank(M) = r and each column of V has atleast s zeros. The factorization (U,V ) is essentially unique if on each
hyperplane spanned by all but one column of U, there are⌊r(r−2)
s
⌋+ 1
data points with spark r .
! For s = 1, this requires r3 − 2r2 + r data points and it is tight up tothe constant r (counter examples for any n = r3 − 2r2).
! For s = r − 1, this requires r data points and it is tight (one on eachintersection of r − 1 hyperplanes).
! It is tight up to constant factors for any s = βr for any fixed constantβ.
! Nonnegativity not taken into account in the analysis, it helps both intheory and in practice: further work.
[CG18] Cohen, G., Identifiability of Low-Rank Sparse Component Analysis,arXiv:1808.08765.
47//
Identifiability results
Theorem
Let M = UV where rank(U) = rank(M) = r and each column of V has atleast s zeros. The factorization (U,V ) is essentially unique if on each
hyperplane spanned by all but one column of U, there are⌊r(r−2)
s
⌋+ 1
data points with spark r .
! For s = 1, this requires r3 − 2r2 + r data points and it is tight up tothe constant r (counter examples for any n = r3 − 2r2).
! For s = r − 1, this requires r data points and it is tight (one on eachintersection of r − 1 hyperplanes).
! It is tight up to constant factors for any s = βr for any fixed constantβ.
! Nonnegativity not taken into account in the analysis, it helps both intheory and in practice: further work.
[CG18] Cohen, G., Identifiability of Low-Rank Sparse Component Analysis,arXiv:1808.08765.
47//
Geometric intuition
Example: p = 3, r = 3, sparsity=1, n = 4 + 3 + 2 = 9.
data pointsunique decomposition
48//
Sparsity in action
Spectral unmixing, R = 6, s = 4
! Sparsity is another way to obtain identifiability for matrixdecompositions.
% Hard combinatorial problems to solve. . .
49//
Sparsity in action
Spectral unmixing, R = 6, s = 4
! Sparsity is another way to obtain identifiability for matrixdecompositions.
% Hard combinatorial problems to solve. . .
49//
Sparsity in action
Spectral unmixing, R = 6, s = 4
! Sparsity is another way to obtain identifiability for matrixdecompositions.
% Hard combinatorial problems to solve. . .
49//
Take-home messages
1 NMF is a useful and widely used linear model in data analysis andmachine learning.
2 NMF is difficult (NP-hard) and ill-posed (non-uniqueness).
3 NMF is closely related to the nested polytopes problem and extendedformulations.
4 NMF with (self-)dictionary is tractable and well-posed (separableNMF).
5 To obtain identifiable NMF models: minimum volume or sparsity canbe used but, as opposed to separability, this does not lead to tractablemodels. This is an important direction of reasearch (robustness tonoise, tractability).
50//
Take-home messages
1 NMF is a useful and widely used linear model in data analysis andmachine learning.
2 NMF is difficult (NP-hard) and ill-posed (non-uniqueness).
3 NMF is closely related to the nested polytopes problem and extendedformulations.
4 NMF with (self-)dictionary is tractable and well-posed (separableNMF).
5 To obtain identifiable NMF models: minimum volume or sparsity canbe used but, as opposed to separability, this does not lead to tractablemodels. This is an important direction of reasearch (robustness tonoise, tractability).
50//
Take-home messages
1 NMF is a useful and widely used linear model in data analysis andmachine learning.
2 NMF is difficult (NP-hard) and ill-posed (non-uniqueness).
3 NMF is closely related to the nested polytopes problem and extendedformulations.
4 NMF with (self-)dictionary is tractable and well-posed (separableNMF).
5 To obtain identifiable NMF models: minimum volume or sparsity canbe used but, as opposed to separability, this does not lead to tractablemodels. This is an important direction of reasearch (robustness tonoise, tractability).
50//
Take-home messages
1 NMF is a useful and widely used linear model in data analysis andmachine learning.
2 NMF is difficult (NP-hard) and ill-posed (non-uniqueness).
3 NMF is closely related to the nested polytopes problem and extendedformulations.
4 NMF with (self-)dictionary is tractable and well-posed (separableNMF).
5 To obtain identifiable NMF models: minimum volume or sparsity canbe used but, as opposed to separability, this does not lead to tractablemodels. This is an important direction of reasearch (robustness tonoise, tractability).
50//
Take-home messages
1 NMF is a useful and widely used linear model in data analysis andmachine learning.
2 NMF is difficult (NP-hard) and ill-posed (non-uniqueness).
3 NMF is closely related to the nested polytopes problem and extendedformulations.
4 NMF with (self-)dictionary is tractable and well-posed (separableNMF).
5 To obtain identifiable NMF models: minimum volume or sparsity canbe used but, as opposed to separability, this does not lead to tractablemodels. This is an important direction of reasearch (robustness tonoise, tractability).
50//
Thank you for your attention!
Code and papers available fromhttps://sites.google.com/site/nicolasgillis
51//