nicolas gillis joint work with fran˘cois glineur, …laurent.risser.free.fr › tmp_share ›...

Computing Nonnegative Matrix Factorizations

Nicolas Gillis

Joint work with Francois Glineur, Robert Luce, Stephen Vavasis,Arnaud Vandaele, Jeremy Cohen

Where is Mons?

Nonnegative Matrix Factorization (NMF)

Given a matrix M ∈ Rp×n+ and a factorization rank r min(p, n), find

U ∈ Rp×rand V ∈ Rr×n such that

minU≥0,V≥0

||M − UV ||2F =∑i ,j

(M − UV )2ij . (NMF)

NMF is a linear dimensionality reduction technique for nonnegative data :

M(:, i)︸︷︷︸≥0

≈r∑

U(:, k)︸︷︷︸≥0

V (k , i)︸︷︷︸≥0

for all i .

Why nonnegativity?

→ Interpretability: Nonnegativity constraints lead to easily interpretablefactors (and a sparse and part-based representation).→ Many applications. image processing, text mining, hyperspectralunmixing, community detection, clustering, etc.

minU≥0,V≥0

||M − UV ||2F =∑i ,j

M(:, i)︸︷︷︸≥0

≈r∑

U(:, k)︸︷︷︸≥0

V (k , i)︸︷︷︸≥0

for all i .

Why nonnegativity?

minU≥0,V≥0

||M − UV ||2F =∑i ,j

M(:, i)︸︷︷︸≥0

≈r∑

U(:, k)︸︷︷︸≥0

V (k , i)︸︷︷︸≥0

for all i .

Why nonnegativity?

Example 1: Blind hyperspectral unmixing

Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.

Problem. Identify the materials and classify the pixels.

Example 1: Blind hyperspectral unmixing

Figure: Urban hyperspectral image, 162 spectral bands and 307-by-307 pixels.

Problem. Identify the materials and classify the pixels.

Linear mixing model

Example 1: Blind hyperspectral unmixing with NMF

Basis elements allow to recover the different endmembers: U ≥ 0;

Abundances of the endmembers in each pixel: V ≥ 0.

Urban hyperspectral image

Figure: Decomposition of the Urban dataset.

Example 2: topic recovery and document classification

Basis elements allow to recover the different topics;

Weights allow to assign each text to its corresponding topics.

Exemple 3: feature extraction and classification

The basis elements extract facial features such as eyes, nose and lips.

The basis elements extract facial features such as eyes, nose and lips.10

Outline

1 Computational complexity

2 Standard non-linear optimization schemes and acceleration

3 Exact NMF (M = UV ) and its geometric interpretation

4 NMF under the separability assumption

Computational Complexity of NMF

Complexity of NMF

minU∈Rp×r ,V∈Rr×n

||M − UV ||2F such that U ≥ 0,V ≥ 0.

For r = 1, Eckart-Young and Perron-Frobenius theorems.

Checking whether there exists an exact factorization M = UV :

NP-hard (Vavasis, 2009) where p, n and r are not fixed.

Using quantifier elimination (reformulation with fixed number ofvariables)

Cohen and Rothblum [1991]: (mn)O(mr+nr), non-polynomialArora et al. [2012]: (mn)O(2r ), polynomial

Moitra [2013] : (mn)O(r2), polynomial→ not really useful in practice . . .

Does not imply that rank+ (the minimum r such that M = UV ) canbe computed in polynomial time (because there are no upper boundon rank+).

Complexity of NMF

||M − UV ||2F such that U ≥ 0,V ≥ 0.

Complexity of NMF

||M − UV ||2F such that U ≥ 0,V ≥ 0.

Complexity of NMF

||M − UV ||2F such that U ≥ 0,V ≥ 0.

Complexity for other norms

minu∈Rp ,v∈Rn

||M − uvT ||1 =∑i ,j

|Mij − uivj | . (`1 norm)

If M is binary, M ∈ 0, 1m×n, any optimal solution (u∗, v∗) can beassumed to be binary, that is, (u∗, v∗) ∈ 0, 1p × 0, 1n.

minu∈Rp ,v∈Rn

||M − uvT ||2W =∑i ,j

Wij(M − uvT )2ij , (weighted `2 norm)

where W is a nonnegative weight matrix.This model can be used when

data is missing (Wij = 0 for missing entries),entries have different variances (Wij = 1/σ2

G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.

minu∈Rp ,v∈Rn

||M − uvT ||1 =∑i ,j

minu∈Rp ,v∈Rn

||M − uvT ||2W =∑i ,j

G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.

G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.

minu∈Rp ,v∈Rn

||M − uvT ||1 =∑i ,j

minu∈Rp ,v∈Rn

||M − uvT ||2W =∑i ,j

G., Vavasis, On the Complexity of Robust PCA and `1-Norm Low-Rank MatrixApproximation, Mathematics of Operations Research, 2018.G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIAM J. Mat. Anal. Appl., 2011.

NMF Algorithms and Acceleration

NMF Algorithms

Given a matrix M ∈ Rm×n+ and a factorization rank r ∈ N:

minU∈Rm×r

+ ,V∈Rr×n+

||M − UV ||2F =∑i ,j

This is a difficult non-linear optimization problem with potentially manylocal minima.

Standard framework:

0. Initialize (U, V ). Then, alternatively update U and V :

1. Update V ≈ argminX≥0 ||M − UX ||2F . (NNLS)2. Update U ≈ argminY≥0 ||M − YV ||2F . (NNLS)

Most NMF algorithms come with no guarantees (except convergence tostationary points).

Solution is in general highly non-unique: indentifiability issues.

NMF Algorithms

minU∈Rm×r

+ ,V∈Rr×n+

||M − UV ||2F =∑i ,j

Standard framework:

NMF Algorithms

minU∈Rm×r

+ ,V∈Rr×n+

||M − UV ||2F =∑i ,j

Standard framework:

NMF Algorithms

minU∈Rm×r

+ ,V∈Rr×n+

||M − UV ||2F =∑i ,j

Standard framework:

Block coordinate descent method

Use block-coordinate descent on the NNLS subproblems−→ closed-form solutions for the columns of U and rows of V :

U∗:k = argminU:k≥0 ||Rk − U:kVk:||2F = max

RkVTk:

||Vk:||22

)∀k ,

where Rk.

= M −∑

j 6=k U:jVj :, and similarly for V .This is the so-called HALS algorithm.

It can be accelerated:

1 Gauss-Seidel Coordinate descent (Hsieh, Dhillon, 2011).

2 Loop several time over columns of U/rows of V to perform moreiterations at a lower computational cost (Glineur, G., 2012).

3 Randomized shuffling (Chow, Wu, Yin, 2017).

4 Use an extrapolation step: W (k+1) = W (k+1) + βk(W (k+1) −W (k))(Ang, G., 2018).

Block coordinate descent method

Use block-coordinate descent on the NNLS subproblems−→ closed-form solutions for the columns of U and rows of V :

U∗:k = argminU:k≥0 ||Rk − U:kVk:||2F = max

RkVTk:

||Vk:||22

)∀k ,

where Rk.

= M −∑

j 6=k U:jVj :, and similarly for V .This is the so-called HALS algorithm.

It can be accelerated:

1 Gauss-Seidel Coordinate descent (Hsieh, Dhillon, 2011).

2 Loop several time over columns of U/rows of V to perform moreiterations at a lower computational cost (Glineur, G., 2012).

3 Randomized shuffling (Chow, Wu, Yin, 2017).

4 Use an extrapolation step: W (k+1) = W (k+1) + βk(W (k+1) −W (k))(Ang, G., 2018).

Illustration on the CBCL face image data set

Exact NMF: Geometry and ExtendedFormulations

Geometric interpretation of exact NMF

Given M = UV , one can scale M and U such that they become columnstochastic implying that V is column stochastic:

M = UV ⇐⇒ M ′ = MDM = (UDU)(D−1U VDM) = U ′V ′.

The columns of M are convex combinations of the columns of U:

M:j =k∑

U:i Vij withk∑

Vij = 1∀j , Vij ≥ 0∀ij .

In other terms,

conv(M) ⊆ conv(U) ⊆ Sn,

where conv(X ) is the convex hull of the columns of X , andSn = x ∈ Rn |x ≥ 0,

∑ni=1 xi = 1 is the unit simplex.

Exact NMF ≡ Find r points whose convex hull is nested between twogiven polytopes.

M:j =k∑

U:i Vij withk∑

Vij = 1∀j , Vij ≥ 0∀ij .

In other terms,

M:j =k∑

U:i Vij withk∑

Vij = 1∀j , Vij ≥ 0∀ij .

In other terms,

M:j =k∑

U:i Vij withk∑

Vij = 1∀j , Vij ≥ 0∀ij .

In other terms,

Geometric interpretation of NMF

Example: Two nested hexagons (rank(Ma) = 3)

1 a 2a− 1 2a− 1 a 11 1 a 2a− 1 2a− 1 aa 1 1 a 2a− 1 2a− 1

2a− 1 a 1 1 a 2a− 12a− 1 2a− 1 a 1 1 a

a 2a− 1 2a− 1 a 1 1

, a > 1.

Example: Two nested hexagons (rank(Ma) = 3)Case 1: a = 2, rank+(Ma) = 3, col(M) = col(U)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−0.8

−0.6

−0.4

−0.2

∆p ∩ col(M2)

conv(M2)

conv(U)

Example: Two nested hexagons (rank(Ma) = 3)Case 2: a = 3, rank+(Ma) = 4, col(M) = col(U)

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−0.8

−0.6

−0.4

−0.2

∆p ∩ col(M3)

conv(M3)

conv(U)

Example: Two nested hexagons (rank(Ma) = 3)Case 3: a→ +∞, rank+(Ma) = 5, col(M) 6= col(U)

An amazing result: NMF and extended formulations

Let P be a polytope

P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,

and let vj ’s (1 ≤ j ≤ n) be its vertices.

We define the m-by-n slack matrix SP of P as follows:

SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

The hexagon:

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

Let P be a polytope

P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,

SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

The hexagon:

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

Let P be a polytope

P = x ∈ Rk | bi − A(i , :)x ≥ 0 for 1 ≤ i ≤ m,

SP(i , j) = bi − A(i , :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

The hexagon:

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p

that (linearly) projects onto P. The minimum number of facets of such apolytope is called the extension complexity xp(P) of P.

Theorem (Yannakakis, 1991).

rank+(SP) = xp(P).

Proof (one direction). Given P = x ∈ Rk | b−Ax ≥ 0, any exact NMFof SP = UV ,U ≥ 0,V ≥ 0 provides an explicit extended formulation(with some redundant equalities) of P:

P = x | b − Ax ≥ 0 = x | b − Ax = Uy and y ≥ 0.

Remark. The slack matrix SP of P satisfies

conv(SP) = Sm ∩ col(SP).

To get a small factorization, we need to go to a higher dimensional space:rank(U) > rank(M).

rank+(SP) = xp(P).

The Hexagon

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

The Hexagon

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

1 0 0 1/2 00 1 0 1 00 0 1 1/2 00 0 1 0 1/20 1 0 0 11 0 0 0 1/2

0 1 2 1 0 00 0 1 0 0 11 0 0 0 1 20 0 0 2 2 02 2 0 0 0 0

rank(SP) = 3 ≤ rank+(SP) = 5 ≤ min(m, n) = 6.

Some implications

Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?

Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.

Ex. The matching problem cannot be solved via a polynomial-size LP.Rothvoss (2014). The matching polytope has exponential extension complexity, STOC.

This can be generalized to

approximations (no poly-size LP can approximate these problems upto some precision).Braun, Fiorini, Pokutta & Steurer (2012). Approximation limits of linear programs(beyond hierarchies), FOCS.

any convex cone, in particular PSD (so called PSD-rank).See the survey: Fawzi, Gouveia, Parrilo, Robinson, & Thomas. Positivesemidefinite rank, Mathematical Programming, 2015.

Some implications

Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.

Key tool: lower bound techniques for the nonnegative rank.

Some implications

Problem: limits of LP for solving combinatorial problems: given apolytope, what is the most compact way to represent it?Its extension complexity = nonnegative rank of its slack matrix.Key tool: lower bound techniques for the nonnegative rank.

Some implications

Exact NMF computation and regular n-gons

Can we use numerical solvers to get insight into these problems?

We have developed a library to compute exact NMF’s for small matricesusing meta-heuristics.[V14] Vandaele, G., Glineur & D. Tuyttens, Heuristics for Exact NMF (2014).

Extension complexity of the octagon?

Can we use numerical solvers to get insight into these problems? Yes!

rank(SP) = 3 ≤ rank+(SP) = 6 ≤ min(m, n) = 8.

We observed a special structure on the solutions for regular n-gons,leading to the best known upper bound and closing the gap for somen-gons:

rank+(Sn) ≤

2dlog2(n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2,2dlog2(n)e for 2k−1 + 2k−2 < n ≤ 2k .

[V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regularn-gons (2015).

Implication: conic quadratic programming is ‘polynomially reducible’to linear programming.[BTN01] Ben-Tal and Nemirovski (2001). On polyhedral approximations of thesecond-order cone. Mathematics of Operations Research, 26(2), 193-205.

We observed a special structure on the solutions for regular n-gons,leading to the best known upper bound and closing the gap for somen-gons:

rank+(Sn) ≤

2dlog2(n)e − 1 for 2k−1 < n ≤ 2k−1 + 2k−2,2dlog2(n)e for 2k−1 + 2k−2 < n ≤ 2k .

[V15] Vandaele, G. & Glineur, On the Linear Extension Complexity of Regularn-gons (2015).

Implication: conic quadratic programming is ‘polynomially reducible’to linear programming.[BTN01] Ben-Tal and Nemirovski (2001). On polyhedral approximations of thesecond-order cone. Mathematics of Operations Research, 26(2), 193-205.

NMF under the separability assumption

Separability Assumption

Separability of M: there exists an index set K and V ≥ 0 withM = M(:,K)︸︷︷︸

V , with |K| = r .

[AGKM12] Arora, Ge, Kannan, Moitra, Computing a Nonnegative Matrix Factorization –Provably, STOC 2012.

V , with |K| = r .

Applications

In hyperspectral imaging, this is the pure-pixel assumption: for eachmaterial, there is a ‘pure’ pixel containing only that material.[M+14] Ma et al., A Signal Processing Perspective on Hyperspectral Unmixing: Insightsfrom Remote Sensing, IEEE Signal Processing Magazine 31(1):67-81, 2014.

In document classification: for each topic, there is a ‘pure’ word usedonly by that topic (an ‘anchor’ word).[A+13] Arora et al., A Practical Algorithm for Topic Modeling with Provable Guarantees,ICML 2013.

Time-resolved Raman spectra analysis: each substance has a peak inits spectrum while the other spectra are (close) to zero.[L+16] Luce et al., Using Separable Nonnegative Matrix Factorization for the Analysis ofTime-Resolved Raman Spectra, Appl Spectrosc. 2016.

Others: video summarization, foreground-background separation.[ESV12] Elhamifar, Sapiro, Vidal, See all by looking at a few: Sparse modeling for findingrepresentative objects, CVPR 2012.[KSK13] Kumar, Sindhwani, Near-separable Non-negative Matrix Factorization with `1-and Bregman Loss Functions, SIAM data mining 2015.

Applications

Geometric Interpretation

The columns of U are the vertices of the convex hull of the columns of M:

M(:, j) =r∑

U(:, k)V (k , j) ∀j , wherer∑

V (k , j) = 1,V ≥ 0.

Geometric Interpretation with Noise

The columns of U are the vertices of the convex hull of the columns of M:

M(:, j) ≈r∑

U(:, k)V (k , j) ∀j , wherer∑

V (k , j) = 1,V ≥ 0.

Goal: theoretical analysis of the robustness to noise of separable NMFalgorithms

Key Parameters: Noise and Conditioning

We assumeM = U[Ir , V

′]Π + N,

where V ′ ≥ 0, Π is a permutation and N is the noise.

We will assume that the noise is bounded (but otherwise arbitrary):

||N(:, j)||2 ≤ ε, for all j ,

and some dependence on the conditioning κ(U) = σmax(U)σmin(U) is unavoidable:

′]Π + N,

||N(:, j)||2 ≤ ε, for all j ,

′]Π + N,

||N(:, j)||2 ≤ ε, for all j ,

′]Π + N,

||N(:, j)||2 ≤ ε, for all j ,

Successive Projection Algorithm (SPA)

0: Initially K = ∅.For i = 1 : r1: Find j∗ = argmaxj ||M(:, j)||.2: K = K ∪ j∗.3: M ←

(I − uuT

)M where u = M(:,j∗)

||M(:,j∗)||2 .end∼modified Gram-Schmidt with column pivoting.

Theorem. If ε ≤ O(σmin(U)√rκ2(U)

), SPA satisfies

||U−M(:,K)|| = max1≤k≤r

||U(:, k)−M(:,K(k))|| ≤ O(εκ2(U)

Advantages. Extremely fast, no parameter.

Drawbacks. Requires U to be full rank; bound is weak.

[GV14] G., Vavasis, Fast and Robust Recursive Algorithms for Separable Nonnegative MatrixFactorization, IEEE Trans. Patt. Anal. Mach. Intell. 36 (4), pp. 698-714, 2014.

Successive Projection Algorithm (SPA)

0: Initially K = ∅.For i = 1 : r1: Find j∗ = argmaxj ||M(:, j)||.2: K = K ∪ j∗.3: M ←

(I − uuT

)M where u = M(:,j∗)

||M(:,j∗)||2 .end∼modified Gram-Schmidt with column pivoting.

Theorem. If ε ≤ O(σmin(U)√rκ2(U)

), SPA satisfies

||U−M(:,K)|| = max1≤k≤r

||U(:, k)−M(:,K(k))|| ≤ O(εκ2(U)

Advantages. Extremely fast, no parameter.

Drawbacks. Requires U to be full rank; bound is weak.

[GV14] G., Vavasis, Fast and Robust Recursive Algorithms for Separable Nonnegative MatrixFactorization, IEEE Trans. Patt. Anal. Mach. Intell. 36 (4), pp. 698-714, 2014.

Pre-conditioning for More Robust SPA

Observation. Pre-multiplying M preserves separability:

P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V

′]Π + PN.

Ideally, P = U−1 so that κ(PU) = 1 (assuming m = r).Solving the minimum volume ellipsoid centered at the origin andcontaining all the columns of M (which is SDP representable)

minA∈Sr+

log det(A)−1 s.t. mjTAmj ≤ 1 ∀ j ,

allows to approximate U−1: in fact, A∗ ≈ (UUT )−1.

Theorem. If ε ≤ O(σmin(U)r√r

), preconditioned SPA satisfies

||U −M(:,K)|| ≤ O (εκ(U)).

[GV15] G., Vavasis, SDP-based Preconditioning for More Robust Near-Separable NMF, SIAM J.on Optimization, 2015.

P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V

′]Π + PN.

Ideally, P = U−1 so that κ(PU) = 1 (assuming m = r).

Solving the minimum volume ellipsoid centered at the origin andcontaining all the columns of M (which is SDP representable)

minA∈Sr+

||U −M(:,K)|| ≤ O (εκ(U)).

P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V

′]Π + PN.

minA∈Sr+

||U −M(:,K)|| ≤ O (εκ(U)).

P M = P (U[Ir ,V′]Π + N) = (PU) [Ir ,V

′]Π + PN.

minA∈Sr+

||U −M(:,K)|| ≤ O (εκ(U)).

Geometric Interpretation

Figure: Geometric Interpretation of the SDP-based Preconditioning.

See also Mizutani, Ellipsoidal Rounding for Nonnegative MatrixFactorization Under Noisy Separability, JMLR, 2014.

Synthetic data sets

Each entry of U ∈ R40×20+ uniform in [0, 1]; each column normalized.

The other columns of M are the middle points of the columns of U(hence there are

)= 190).

The noise moves the middle points toward the outside of the convexhull of the column of U.

Synthetic data sets

Each entry of U ∈ R40×20+ uniform in [0, 1]; each column normalized.

The other columns of M are the middle points of the columns of U(hence there are

)= 190).

The noise moves the middle points toward the outside of the convexhull of the column of U.

Results for the synthetic data sets

Figure: Average of the fraction of columns correctly extracted depending on thenoise level (for each noise level, 25 matrices are generated).

Combinatorial formulation for separable NMF

We want to find the index set K with |K| = r such that

M = M(:,K)V .

This is equivalent to finding X ∈ Rn×n with r non-zero rows such that

M = M X .

A combinatorial formulation:

minX||X ||row,0 such that M = MX or ||M −MX || ≤ ε.

How to make X row sparse?

M = M(:,K)V .

M = M X .

M = M(:,K)V .

M = M X .

M = M(:,K)V .

M = M X .

A Linear Optimization Model

minX∈Rn×n

trace(X ) = || diag(X )||1

such that ||M −MX || ≤ ε,Xij ≤ Xii ≤ 1 for all i , j .

Robustness: noise ≤ O(κ−1

)⇒ error ≤ O

(rεκ

)[GL14].

This model is an improvement over [B+12]: more robust and detects thefactorization rank r automatically.It is equivalent [GL16] to using ||X ||1,∞ =

∑di=1 ||X (i , :)||∞ as a convex

surrogate for ||X ||row,0 [E+12].

[GL14] G., Luce, Robust Near-Separable NMF Using Linear Optimization, JMLR 2014.

[B+12] Bittorf, Recht, Re, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.[E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space,IEEE Trans. Image Processing, 2012.[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.

minX∈Rn×n

)⇒ error ≤ O

(rεκ

)[GL14].

minX∈Rn×n

)⇒ error ≤ O

(rεκ

)[GL14].

This model is an improvement over [B+12]: more robust and detects thefactorization rank r automatically.

It is equivalent [GL16] to using ||X ||1,∞ =∑d

i=1 ||X (i , :)||∞ as a convexsurrogate for ||X ||row,0 [E+12].

[B+12] Bittorf, Recht, Re, Tropp, Factoring nonnegative matrices with LPs, NIPS 2012.

[E+12] Esser et al., A convex model for NMF and dimensionality reduction on physical space,IEEE Trans. Image Processing, 2012.[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.

minX∈Rn×n

)⇒ error ≤ O

(rεκ

)[GL14].

Practical Model and Algorithm

minX∈Ω

||M −MX ||2F + µ tr(X ),

Ω = X ∈ Rn,n | Xii ≤ 1,wiXij ≤ wjXii∀i , j.

We used a fast gradient method (optimal 1st order):

1 Choose an initial point X (0), Y = X (0), α1 ∈ (0, 1).

2 k = 1, 2, . . .

1 X (k) = PΩ

(Y − 1

L∇f (Y )).

2 Y = X (k)+βk(X (k) − X (k−1)

where βk = αk (1−αk )α2

k+αk+1with αk+1 ≥ 0 t.q. α2

k+1 = (1− αk+1)α2k .

Projection onto Ω can be done effectively in O(n2 log(n)) operations.

The total computational cost is O(pn2) operations.

[GL16] G. and Luce, A Fast Gradient Method for Nonnegative Sparse Regression with SelfDictionary, IEEE Trans. Image Processing, 2018.

minX∈Ω

||M −MX ||2F + µ tr(X ),

2 k = 1, 2, . . .

1 X (k) = PΩ

(Y − 1

L∇f (Y )).

2 Y = X (k)+βk(X (k) − X (k−1)

k+1 = (1− αk+1)α2k .

minX∈Ω

||M −MX ||2F + µ tr(X ),

2 k = 1, 2, . . .

1 X (k) = PΩ

(Y − 1

L∇f (Y )).

2 Y = X (k)+βk(X (k) − X (k−1)

k+1 = (1− αk+1)α2k .

minX∈Ω

||M −MX ||2F + µ tr(X ),

2 k = 1, 2, . . .

1 X (k) = PΩ

(Y − 1

L∇f (Y )).

2 Y = X (k)+βk(X (k) − X (k−1)

k+1 = (1− αk+1)α2k .

Hyperspectral unmixing

r = 6 r = 8

Time (s.) Rel. err. (%) Time (s.) Rel. err. (%)

VCA 1.02 18.05 1.05 22.68VCA-500 0.03 7.19 0.09 7.25

SPA 0.26 9.58 0.32 9.45SPA-500 <0.01 10.05 <0.01 8.86

SNPA 13.60 9.63 23.02 5.64SNPA-500 0.15 10.05 0.25 8.86

XRAY 28.17 7.50 95.34 6.82XRAY-500 0.15 8.07 0.28 7.36

H2NMF 12.20 5.81 14.92 5.47H2NMF-500 0.27 5.87 0.37 5.68

FGNSR-500 40.11 5.07 39.49 4.08

Table: Numerical results for the Urban HSI (the best result is highlighted in bold).

Figure: Abundance maps extracted by FGNSR-500.

Minimum-volume NMF: Relaxing separability

minK,V≥0

||M −M(:,K)V ||2F such that |K| = r .

Open problems: Efficient algorithm for min-vol NMF, robustness to noise.

Fu, Huang, Sidiropoulos, Ma, Nonnegative matrix factorization for signal and dataanalytics: Identifiability, algorithms, and applications, arXiv:1803.01257, 2018.

minU≥0,V≥0

vol(U) such that ||M − UV ||2F ≤ ε,

where vol(U) ∼ det(UTU), V (:, j) ∈ ∆r for all j .

minU≥0,V≥0

Identifiability with sparsity

Decompose a low rank matrix with known coefficients sparsity.

M = UV ,rank(M) = rank(U) = r ,‖V (:, j)‖0 ≤ k = r − s < r ∀j .

Many existing theoretical results (see, e.g., [Gribonval 16]) and algorithms(Dictionary Learning). But:

% Not many results specific to the low-rank case

% Only two deterministic identifiability results [Elad 06, Georgiev 05]

% Not much in the NMF case except `1 regularization

Identifiability with sparsity

Decompose a low rank matrix with known coefficients sparsity.

M = UV ,rank(M) = rank(U) = r ,‖V (:, j)‖0 ≤ k = r − s < r ∀j .

Many existing theoretical results (see, e.g., [Gribonval 16]) and algorithms(Dictionary Learning). But:

% Not many results specific to the low-rank case

% Only two deterministic identifiability results [Elad 06, Georgiev 05]

% Not much in the NMF case except `1 regularization

Identifiability with sparsity: example

Example: p = 3, r = 3, s=sparsity=1, n = 9.

data pointsfirst decomposition

second decomposition

Identifiability with sparsity: example

Example: p = 3, r = 3, s=sparsity=1, n = 9.

data pointsfirst decompositionsecond decomposition

Identifiability results

Theorem

Let M = UV where rank(U) = rank(M) = r and each column of V has atleast s zeros. The factorization (U,V ) is essentially unique if on each

hyperplane spanned by all but one column of U, there are⌊r(r−2)

⌋+ 1

data points with spark r .

! For s = 1, this requires r3 − 2r2 + r data points and it is tight up tothe constant r (counter examples for any n = r3 − 2r2).

! For s = r − 1, this requires r data points and it is tight (one on eachintersection of r − 1 hyperplanes).

! It is tight up to constant factors for any s = βr for any fixed constantβ.

! Nonnegativity not taken into account in the analysis, it helps both intheory and in practice: further work.

[CG18] Cohen, G., Identifiability of Low-Rank Sparse Component Analysis,arXiv:1808.08765.

Theorem

⌋+ 1

Theorem

⌋+ 1

Theorem

⌋+ 1

Theorem

⌋+ 1

Geometric intuition

Example: p = 3, r = 3, sparsity=1, n = 4 + 3 + 2 = 9.

data pointsunique decomposition

Sparsity in action

Spectral unmixing, R = 6, s = 4

! Sparsity is another way to obtain identifiability for matrixdecompositions.

% Hard combinatorial problems to solve. . .

Sparsity in action

Take-home messages

1 NMF is a useful and widely used linear model in data analysis andmachine learning.

2 NMF is difficult (NP-hard) and ill-posed (non-uniqueness).

3 NMF is closely related to the nested polytopes problem and extendedformulations.

4 NMF with (self-)dictionary is tractable and well-posed (separableNMF).

5 To obtain identifiable NMF models: minimum volume or sparsity canbe used but, as opposed to separability, this does not lead to tractablemodels. This is an important direction of reasearch (robustness tonoise, tractability).

Take-home messages

Thank you for your attention!

Code and papers available fromhttps://sites.google.com/site/nicolasgillis

nicolas gillis joint work with fran˘cois glineur, …laurent.risser.free.fr › tmp_share ›...

Documents