nonnegative matrix factorization

Nonnegative Matrix Factorization-

Complexity, Algorithms and Applications

Nicolas Gillis†

Advisor: Francois GlineurUniversite catholique de Louvain

†Universite de MonsDepartment of Mathematics and Operational Research

Householder XIX Nonnegative Matrix Factorization: Complexity, Algorithms and Applications 1

Belgium

Nonnegative Matrix Factorization (NMF)Given a matrix M ∈ Rp×n

+ and a factorization rank r � min(p, n), find

U ∈ Rp×rand V ∈ Rr×n such that

minU≥0,V≥0

||M − UV ||2F =∑i,j

(M − UV )2ij . (NMF)

NMF is a linear dimensionality reduction technique for nonnegative data :

M(:, j)︸︷︷︸≥0

≈r∑

U(:, k)︸︷︷︸≥0

V (k, j)︸︷︷︸≥0

for all j.

Why nonnegativity?

→ Interpretability: Nonnegativity constraints lead to easily interpretablefactors (and a sparse and part-based representation).→ Many applications. image processing, text mining, recommendersystems, community detection, clustering, etc.

minU≥0,V≥0

||M − UV ||2F =∑i,j

M(:, j)︸︷︷︸≥0

≈r∑

U(:, k)︸︷︷︸≥0

V (k, j)︸︷︷︸≥0

for all j.

Why nonnegativity?

minU≥0,V≥0

||M − UV ||2F =∑i,j

M(:, j)︸︷︷︸≥0

≈r∑

U(:, k)︸︷︷︸≥0

V (k, j)︸︷︷︸≥0

for all j.

Why nonnegativity?

Example: blind hyperspectral unmixing

Figure : Urban hyperspectral image with 162 spectral bands and 307-by-307pixels.

� Basis elements allow to recover the different materials;

� Weights allow to know which pixel contains which material.

Figure : Decomposition of the Urban dataset.

Outline

1. Algorithms and Applications

I General framework for NMF algorithms

I Solving NMF sequentially with underapproximations

2. Complexity and Bounds

I Nonnegative rank

I rank-one subproblems, weights and missing data

How can we ‘solve’ NMF problems?Given a matrix M ∈ Rm×n

+ and a factorization rank r ∈ N:

minU∈Rm×r

+ ,V ∈Rr×n+

||M − UV ||2F =∑i,j

0. Initialize (U , V ). Then, alternatively update U and V :1. Update V ≈ argminX≥0 ||M − UX||2F . (NNLS)2. Update U ≈ argminY≥0 ||M − Y V ||2F . (NNLS)

HALS algorithm: Use block-coordinate descent on NNLS subproblems−→ closed-form solutions for the columns of U and rows of V :

U∗:k = argminU:k≥0 ||Rk − U:kVk:||2F = max

RkVTk:

||Vk:||22

)∀k,

where Rk.= M −

∑j 6=k U:jVj:, and similarly for V .

HALS can be accelerated ordering updates efficiently.

G., Glineur, Accelerated Multiplicative Updates and Hierarchical ALS Algorithms forNonnegative Matrix Factorization, Neural Computation 2012.

minU∈Rm×r

+ ,V ∈Rr×n+

||M − UV ||2F =∑i,j

RkVTk:

||Vk:||22

)∀k,

where Rk.= M −

minU∈Rm×r

+ ,V ∈Rr×n+

||M − UV ||2F =∑i,j

RkVTk:

||Vk:||22

)∀k,

where Rk.= M −

Drawbacks of standard NMF Algorithms

1. The optimal solution is, in most cases, non-unique and the problem isill-posed. Many variants of NMF impose additional constraints (e.g.,sparsity, smoothness, spatial information, etc.).

2. In practice, it is difficult to choose the factorization rank (in general,trial and error approach or estimation using the SVD).

A possible way to overcome these drawbacks is to use underapproximationconstraints to solve NMF sequentially.

Nonnegative Matrix Underapproximation (NMU)It is possible to solve NMF sequentially, solving at each step

minu≥0,v≥0

||M − uvT ||2F such that uvT ≤M ⇐⇒ M − uvT ≥ 0.

NMU is yet another linear dimensionality reduction technique.However,

� As PCA/SVD, it is sequential and is well-posed.

� As NMF, it leads to a separation by parts. Moreover theadditional underapproximation constraints enhance this property.

� In the presence of pure-pixels, the NMU recursion is able todetect materials individually.

G., Glineur, Using Underapproximations for Sparse Nonnegative Matrix Factorization, PatternRecognition, 2010.G., Plemmons, Dimensionality Reduction, Classification, and Spectral Mixture Analysis usingNonnegative Underapproximation, Optical Engineering, 2011.

minu≥0,v≥0

NMU on the Urban dataset

Identifying Lighting Orientations with NMUA static scene is illuminated from many directions.

NMU is able to detect the different lighting orientations.

NMU has also been successfully used in text mining for anomaly detection,and in image processing for segmentation of medical images.

Outline

1. Algorithms and Applications

I General framework for NMF algorithms

I Solving NMF sequentially with underapproximations

2. Complexity and Bounds

I Nonnegative rank

I Rank-one subproblems, weights and missing data

What is the nonnegative rank?

The nonnegative rank of a nonnegative matrix M ∈ Rm×n+ is the minimum

number r of nonnegative rank-one factors needed to reconstruct M :

r∑i=1

uivTi , ui ∈ Rm

+ , vi ∈ Rn+,

that is, the minimum r such that an exact NMF exists:

M = UV =

r∑i=1

U:iVi:, U ≥ 0, V ≥ 0.

The nonnegative rank of M is denoted rank+(M). Clearly,

rank(M) ≤ rank+(M) ≤ min(m,n).

What is the nonnegative rank?

The nonnegative rank of a nonnegative matrix M ∈ Rm×n+ is the minimum

number r of nonnegative rank-one factors needed to reconstruct M :

r∑i=1

uivTi , ui ∈ Rm

+ , vi ∈ Rn+,

that is, the minimum r such that an exact NMF exists:

M = UV =

r∑i=1

U:iVi:, U ≥ 0, V ≥ 0.

The nonnegative rank of M is denoted rank+(M). Clearly,

rank(M) ≤ rank+(M) ≤ min(m,n).

An amazing resultLet P be a polytope

P = {x ∈ Rk | bi −A(i, :)x ≥ 0 for 1 ≤ i ≤ m},

and let vj ’s (1 ≤ j ≤ n) be its vertices.

We define the m-by-n slack matrix SP of P as follows:

SP(i, j) = bi −A(i, :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

An extended formulation of P is higher dimensional polyhedron Q ⊆ Rk+p

that (linearly) projects onto P . The minimum number of facets of such apolytope is called the extension complexity xp(P) of P.

Theorem (Yannakakis, 1991).

rank+(SP) = xp(P).

Remark. Other closely related problems in communication complexity,probability, graph theory.

P = {x ∈ Rk | bi −A(i, :)x ≥ 0 for 1 ≤ i ≤ m},

SP(i, j) = bi −A(i, :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

rank+(SP) = xp(P).

P = {x ∈ Rk | bi −A(i, :)x ≥ 0 for 1 ≤ i ≤ m},

SP(i, j) = bi −A(i, :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

rank+(SP) = xp(P).

P = {x ∈ Rk | bi −A(i, :)x ≥ 0 for 1 ≤ i ≤ m},

SP(i, j) = bi −A(i, :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

rank+(SP) = xp(P).

P = {x ∈ Rk | bi −A(i, :)x ≥ 0 for 1 ≤ i ≤ m},

SP(i, j) = bi −A(i, :)vj≥ 0 1 ≤ i ≤ m, 1 ≤ j ≤ n.

rank+(SP) = xp(P).

The Hexagon

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

The Hexagon

0 1 2 2 1 00 0 1 2 2 11 0 0 1 2 22 1 0 0 1 22 2 1 0 0 11 2 2 1 0 0

1 0 0 1/2 00 1 0 1 00 0 1 1/2 00 0 1 0 1/20 1 0 0 11 0 0 0 1/2

0 1 2 1 0 00 0 1 0 0 11 0 0 0 1 20 0 0 2 2 02 2 0 0 0 0

rank(SP) = 3 ≤ rank+(SP) = 5 ≤ min(m,n) = 6.

Lower Bounds for the Nonnegative RankIt is difficult to compute the nonnegative rank (NP-hard). However, it ispossible to compute lower bounds efficiently. For example,

n = # vertices(P) ≤ 2r+ .

Goemans, Smallest compact formulation for the permutahedron, ISMP 2009.Beasley and Laffey, Real rank versus nonnegative rank, LAA 2009.G., Glineur, On the geometric interpretation of the nonnegative rank, LAA 2012.

What is the complexity of NMF?

Given a matrix M ∈ Rm×n+ and a factorization rank r ∈ N:

minU∈Rm×r

+ ,V ∈Rr×n+

||M − UV ||2F =∑i,j

� For r = 1: ‘easy’ [Perron-Frobenius and Eckart-Young theorems].

� rank(M) = 2: rank+(M) = 2, ‘easy’.

� r part of the input: NP-hard (Vavasis, 2009).

� rank+(M) = r not part of the input: polynomial –O((mn)r

(Arora et al. 2012, Moitra 2013).

� rank(M) = k not part of the input: open problem.

minU∈Rm×r

+ ,V ∈Rr×n+

||M − UV ||2F =∑i,j

� rank(M) = 2: rank+(M) = 2, ‘easy’.

minU∈Rm×r

+ ,V ∈Rr×n+

||M − UV ||2F =∑i,j

� rank(M) = 2: rank+(M) = 2, ‘easy’.

minU∈Rm×r

+ ,V ∈Rr×n+

||M − UV ||2F =∑i,j

� rank(M) = 2: rank+(M) = 2, ‘easy’.

What about rank-one problems in NMF?

Solving NMF requires to finding r nonnegative rank-one factors U:kVk:,each having to satisfy the following equality as well as possible

U:kVk: ≈M −∑j 6=k

U:jVj:.= Rk � 0 ∀k.

These subproblems have the following form

minu∈Rm,v∈Rn

||R− uvT ||2F such that u ≥ 0, v ≥ 0. (R1NF)

called rank-one nonnegative factorization.

Is this problem difficult?If R ≥ 0 : No.If R � 0 : Surprisingly, yes.

G., Glineur, A Continuous Characterization of the Maximum-Edge Biclique Problem, JOGO ’12.

U:jVj:.= Rk � 0 ∀k.

minu∈Rm,v∈Rn

U:jVj:.= Rk � 0 ∀k.

minu∈Rm,v∈Rn

U:jVj:.= Rk � 0 ∀k.

minu∈Rm,v∈Rn

What about weights and missing data?

In some cases, some entries are missing/unknown.

For example, we would like to predict how much someone is going to like amovie based on its movie preferences (e.g., 1 to 5 stars) :

Movies

2 3 2 ? ?? 1 ? 3 21 ? 4 1 ?5 4 ? 3 2? 1 2 ? 41 ? 3 4 3

Huge potential in electronic commerce sites (movies, books, music, . . . ).Good recommendations will increase the propensity of a purchase.

What about weights and missing data?

In some cases, some entries are missing/unknown.

For example, we would like to predict how much someone is going to like amovie based on its movie preferences (e.g., 1 to 5 stars) :

Movies

2 3 2 ? ?? 1 ? 3 21 ? 4 1 ?5 4 ? 3 2? 1 2 ? 41 ? 3 4 3

Huge potential in electronic commerce sites (movies, books, music, . . . ).Good recommendations will increase the propensity of a purchase.

Low-rank model for recommendation systems

The behavior of users is modeled using linear combination of ’feature’users (related to age, sex, culture, etc.)

M(:, j)︸︷︷︸user j

≈r∑

U(:, k)︸︷︷︸feature user k

V (k, j)︸︷︷︸weights

Or equivalently, movies ratings are modeled as linear combinations of’feature’ movies (related to the genres - child oriented, serious vs. escapist,thriller, romantic, actors, etc.).

M(i, :)︸︷︷︸movie i

≈r∑

U(i, k)︸︷︷︸weights

V (k, :)︸︷︷︸genre k

Low-rank model for recommendation systems

The behavior of users is modeled using linear combination of ’feature’users (related to age, sex, culture, etc.)

M(:, j)︸︷︷︸user j

≈r∑

U(:, k)︸︷︷︸feature user k

V (k, j)︸︷︷︸weights

Or equivalently, movies ratings are modeled as linear combinations of’feature’ movies (related to the genres - child oriented, serious vs. escapist,thriller, romantic, actors, etc.).

M(i, :)︸︷︷︸movie i

≈r∑

U(i, k)︸︷︷︸weights

V (k, :)︸︷︷︸genre k

Example

2 3 2 ? ?? 1 ? 3 21 ? 4 1 ?5 4 ? 3 2? 1 2 ? 41 ? 3 4 3

0.5 0.6 −0.10.8 −0.2 −0.30.8 −0.7 0.6−2 2.3 1.8−0.2 0.3 0.91 −0.2 −0.2

1.7 2.1 3.7 5 4.1

2.2 3.2 0.8 5 0.52 0.6 2.6 0.9 5

2 2.9 2.1 5.4 1.90.3 0.9 2 2.7 1.71 −0.2 4 1 5.95.3 4.2 −0.9 3.1 22.1 1.1 1.8 1.3 2.80.9 1.3 3 3.8 3

Example

2 3 2 ? ?? 1 ? 3 21 ? 4 1 ?5 4 ? 3 2? 1 2 ? 41 ? 3 4 3

0.5 0.6 −0.10.8 −0.2 −0.30.8 −0.7 0.6−2 2.3 1.8−0.2 0.3 0.91 −0.2 −0.2

1.7 2.1 3.7 5 4.1

2.2 3.2 0.8 5 0.52 0.6 2.6 0.9 5

2 2.9 2.1 5.4 1.90.3 0.9 2 2.7 1.71 −0.2 4 1 5.95.3 4.2 −0.9 3.1 22.1 1.1 1.8 1.3 2.80.9 1.3 3 3.8 3

Example

2 3 2 ? ?? 1 ? 3 21 ? 4 1 ?5 4 ? 3 2? 1 2 ? 41 ? 3 4 3

0.5 0.6 −0.10.8 −0.2 −0.30.8 −0.7 0.6−2 2.3 1.8−0.2 0.3 0.91 −0.2 −0.2

1.7 2.1 3.7 5 4.1

2.2 3.2 0.8 5 0.52 0.6 2.6 0.9 5

2 2.9 2.1 5.4 1.90.3 0.9 2 2.7 1.71 −0.2 4 1 5.95.3 4.2 −0.9 3.1 22.1 1.1 1.8 1.3 2.80.9 1.3 3 3.8 3

ComplexityWeighted low-rank approximation is the following problem

infU ∈ Rm×r

V ∈ Rr×n

||M − UV T ||2W =∑ij

Wij(M − UV T )2ij , (WLRA)

where W ≥ 0 is the weighting matrix. A zero in W represents amissing/unknown entry in M .For r = 1 and M ≥ 0, this is equivalent to Weighted NMF.

TheoremIt is NP-hard to find an approximate solution of WLRA, for any r ≥ 1.

Other applications: computer vision, microarray data analysis, 2-Ddigital filter, etc.

G., Glineur, Low-Rank Matrix Approximation with Weights or Missing Data is NP-hard,SIMAX ’11.

infU ∈ Rm×r

V ∈ Rr×n

||M − UV T ||2W =∑ij

infU ∈ Rm×r

V ∈ Rr×n

||M − UV T ||2W =∑ij

Maximum-edge biclique problemThe reductions for NMU, R1NF and WLRA are from themaximum-edge biclique problem: Given two sets of objects interactingtogether (a bipartite graph), find highly connected subgroups.

This is the maximum-edge complete bipartite subgraph.

Applications in clustering : text mining, web community discovery . . .

ExampleLet M be the biadjacency matrix of the bipartite graph representing theinteractions:

1 0 10 1 11 0 1

1 00 11 0

( 1 0 10 1 1

1 0 10 0 01 0 1

0 0 00 1 10 0 0

Each rank-one factor corresponds to a community.

[W11] Wang et al., Community discovery using nonnegative matrix factorization, Data Min.Knowl. Disc. 22:493-521, 2011.[PRE11] Psorakis, Roberts, and Ebden, Overlapping community detection using Bayesiannon-negative matrix factorization, Physical Review E 83, 066114 (2011).[YL13] Yang and Leskovec, Overlapping Community Detection at Scale: A Nonnegative MatrixFactorization Approach, Proc. of the sixth ACM int. conf. on web search and data mining, 2013.

Conclusion

Low-rank matrix approximation can be used for linear dimensionalityreduction. It is a powerful tool for data analysis.

Nonnegativity renders the problem difficult.

However, it enhances significantly its applicability in many areas (byimproving interpretability), e.g., image processing, text mining,hyperspectral unmixing, clustering, community detection, andrecommender systems.

Still many open questions for NMF and related problems, e.g., subclass ofmatrices for which NMF can be computed efficiently (separable NMF),bounding and computing nonnegative ranks, non-uniquenesscharacterization, complexity when using other norms, etc.

Conclusion

Thank you for your attention!

Code and papers available onhttps://sites.google.com/site/nicolasgillis/

Recent Survey: ‘The Why and How of Nonnegative Matrix Factorization’,arXiv:1401.5226.

nonnegative matrix factorization - complexity, algorithms ... · nonnegative matrix factorization,...

Documents

nonnegative matrix factorization with local similarity

advances in nonnegative matrix and tensor factorization

nonnegative matrix factorization for clustering

computing a nonnegative matrix factorization...

nonnegative matrix factorization for real...

symmetric nonnegative matrix factorization for graph...

nonnegative matrix factorization for real...

bayesian nonnegative matrix factorization with volume...

nonnegative matrix factorization for...

itakura-saito nonnegative matrix factorization and friends...

nonnegative matrix factorization with the itakura-saito

lecture 3: nonnegative matrix factorization: algorithms and...

nonnegative matrix factorization: algorithms and...

incorporating prior information in nonnegative matrix...

hybrid projective nonnegative matrix factorization with...

robust collaborative nonnegative matrix factorization for...

efficient initialization for nonnegative matrix...

new algorithms for nonnegative matrix factorization and...

exploring nonnegative matrix factorization - stanford...