· dictionary learning for massive matrix factorization author: arthur mensch, julien mairal,gaël...

25
Dictionary Learning for Massive Matrix Factorization Arthur Mensch, Julien Mairal, Ga¨ el Varoquaux, Bertrand Thirion Inria/CEA Parietal, Inria Thoth June 20, 2016

Upload: others

Post on 17-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Dictionary Learning forMassive Matrix Factorization

Arthur Mensch, Julien Mairal,Gael Varoquaux, Bertrand Thirion

Inria/CEA Parietal, Inria Thoth

June 20, 2016

Page 2:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Matrix factorization

X

p

n

D

p

k

=

A

n

k

1X ∈ Rp×n = DA ∈ Rp×k × Rk×n

Flexible tool for unsupervised data analysis

Dataset has lower underlying complexity than appearing size

How to scale it to very large datasets ? (Brain imaging, 2TB)

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 19

Page 3:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Matrix factorization

X

p

n

D

p

k

=

A

n

k

1Low rank factorization : k < p

X

p

n

D

p

k

=

A

n

k

1...with optional sparse factors

→ interpretable data (fMRI, genetics, topic modeling)

X

p

n

D

p

k

=

A

n

k

1Overcomplete dictionary learning k p - sparse A[Olshausen and Field, 1997]

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 2 / 19

Page 4:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Formalism and methods

Non-convex formulation

minD∈C,A∈Rk×n

‖X−DA‖22 + λΩ(A)

Constraints on D

Penalty on A (`1, `2)

Naive resolution

Alternated minimization: use full X at each iteration

Very slow : single iteration in O(p n)

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 3 / 19

Page 5:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Online matrix factorization

Stream (xt), update D at each t [Mairal et al., 2010]

Single iteration in O(p), a few epochs

xt

p

n

D

p

k

=

αt

n

k

streaming

1

Large n, regular p, eg image patches:

p = 256 n ≈ 106 1GB

Both (sparse) low-rank factorization / sparse coding

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 4 / 19

Page 6:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Scaling-up for massive matrices

Functional MRI (HCP dataset)

Brain “movies” : space × time

Extract k sparse networks

p = 2 · 105 n = 2 · 106 2 TB

Way larger than vision problems

Unusual setting: data is large inboth directions

Also useful in collaborative filtering

X

Vox

els

Time

=

D A

k spatial maps Time

x

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 5 / 19

Page 7:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Scaling-up for massive matrices

Out-of-the-box online algorithm ?

xt

p

n

D

p

k

=

αt

n

k

1Limited time budget ?Need to accomodate large p

235 h run time

1 full epoch

10 h run time

124

epoch

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 6 / 19

Page 8:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Scaling-up in both directions

X

p

n

Batch → onlinext

n

Steaming

Handle large n

1

xt

p

n

Streaming

Mtxt

n

Streaming

SubsamplingHandle large p

Online → double online

1Online learning + partial random access to samples

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 7 / 19

Page 9:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Scaling-up in both directions

xt

p

n

Streaming

Mtxt

n

Streaming

SubsamplingHandle large p

Online → double online

1Low-distorsion lemma [Johnson and Lindenstrauss, 1984]

Random linear alebra [Halko et al., 2009]

Sketching for data reduction [Pilanci and Wainwright, 2014]

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 8 / 19

Page 10:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Algorithm design

Online dictionary learning [Mairal et al., 2010]

1 Compute code – O(p)

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update surrogate – O(p)

gt =1

t

t∑i=1

‖xi −Dαi‖22

3 Minimize surrogate – O(p)

Dt = argminD∈C

gt(D) = argminD∈C

Tr (D>DAt −D>Bt)

xt access → O(p) algorithm (complexity dependency in p)

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 9 / 19

Page 11:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Introducing subsampling

Iteration cost in O(p): can we reduce it?

xt →Mtxt , p → rkMt = s

Use only Mtxt in algorithmcomputation: complexity in O(s)

Mtxt

p

n

Streaming

Subsampling

1Our contribution

Adapt the 3 parts of the algorith to obtain O(s) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

[Szabo et al., 2011]: dictionary learning with missing value – O(p)

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 10 / 19

Page 12:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

1. Code computation

Linear regression with random sampling

αt = argminα∈Rk

‖Mt(xt −Dt−1αt)‖22 + λ

rkMt

pΩ(α)

approximative solution of

αt = argminα∈Rk

‖xt −Dt−1αt‖22 + λΩ(α)

validity in high dimension, with incoherent features:

D>MtD ≈s

pD>D D>Mtxt ≈

s

pD>xt

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 11 / 19

Page 13:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

2. Surrogate update

Original algorithm: At and Bt used in dictionary update

At = 1t

∑ti=1 αiα

>i same as in online algorithm

Bt = (1− 1t )Bt−1 + 1

t xtα>t = 1

t

∑ti=1 xiα

>i Forbidden

Partial update of Bt at each iteration

Bt =1∑t

i=1 Mi

t∑i=1

Mixiα>i

Only MtB is updated

Behaves like Ex[xα] for large t

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 12 / 19

Page 14:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

3. Surrogate minimization

Original algorithm : block coordinate descent with projection on C

minD∈C

gt(D) Dj ← p⊥Cj (Dj −1

Aj ,j(D(At)j − (Bt)j))

Forbidden update of full D at iteration t

Cautious update

Leave dictionary unchanged for unseen features (I−Mt)

minD∈C

(I−Mt)D=(I−Mt)Dt−1

gt(D)

O(s) update in block coordinate descent

Dj ← p⊥Crj (Dj −1

(At)j ,j(Mt(D(At)j − (Bt)j)))

`1 ball C rj = D ∈ C, ‖MtD‖1 ≤ ‖MtDt−1‖1

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 13 / 19

Page 15:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Resting-state fMRI

HCP dataset

One brain image per second

200 sessions n = 2 · 106 p = 2 · 105

D AX

Vox

els

Time

=

k spatial maps Time

x

Sparse decomposition k = 70 C = Bk1 Ω = ‖ · ‖22

Validation

Increase reduction factor ps

Objective function on test set vs CPU time

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 14 / 19

Page 16:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Resting-state fMRI

Online dictionary learning

235 h run time

1 full epoch

10 h run time

124 epoch

Proposed method

10 h run time

12 epoch, reduction r=12

Qualitatively, usable maps are obtained 10× faster

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 15 / 19

Page 17:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Resting-state fMRI

.1 h 1 h 10 h 100 hCPUtime

2.20

2.25

2.30

2.35

2.40O

bje

ctiv

eva

lue

on

test

set ×108

Original online algorithmReduction factor r 4 8 12

No reduction

Speed-up close to reduction factor ps

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 16 / 19

Page 18:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Resting-state fMRI

100 1000 Epoch 4000Records2.162.172.182.192.202.212.222.232.24

Ob

ject

ive

valu

eo

nte

stse

t ×108

λ = 10−4

No reduction(original alg.)

r = 4r = 8r = 12

Convergence speed / number of seen records

Information is acquired faster

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 17 / 19

Page 19:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Collaborative filtering

Mtxt movie ratings from user t

vs. coordinate descent for MMMF loss (no hyperparameters)

100 s 1000 s0.930.940.950.960.970.980.99 Netflix (140M)

Coordinate descent

Proposed(full projection)

Proposed(partial projection)

Dataset Test RMSE Speed

CD MODL -up

ML 1M 0.872 0.866 ×0.75ML 10M 0.802 0.799 ×3.7NF (140M) 0.938 0.934 ×6.8

Outperform coordinate descent beyond 10M ratings

Same prediction performance

Speed-up 6.8× on Netflix

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 18 / 19

Page 20:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Conclusion

Take-home message

Loading stochastic subsets of samplestreams can drastically accelerates onlinematrix factorization

Mtxt

p

n

Streaming

Subsampling

1

Reduce CPU (+IO) load at each iteration

cf Gradient Descent vs SGD

An order of magnitude speed-up on two different problems

Python package http://github.com/arthurmensch/modl

Heuristic at contribution time

A follow-up algorithm has convergence guarantees

Questions ? (Poster # 41 this afternoon)

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 19 / 19

Page 21:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Bibliography I

[Halko et al., 2009] Halko, N., Martinsson, P.-G., and Tropp, J. A. (2009).

Finding structure with randomness: Probabilistic algorithms for constructingapproximate matrix decompositions.

arXiv:0909.4061 [math].

[Johnson and Lindenstrauss, 1984] Johnson, W. B. and Lindenstrauss, J.(1984).

Extensions of Lipschitz mappings into a Hilbert space.

Contemporary mathematics, 26(189-206):1.

[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).

Online learning for matrix factorization and sparse coding.

The Journal of Machine Learning Research, 11:19–60.

[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).

Sparse coding with an overcomplete basis set: A strategy employed by V1?

Vision Research, 37(23):3311–3325.

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 20 / 19

Page 22:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Bibliography II

[Pilanci and Wainwright, 2014] Pilanci, M. and Wainwright, M. J. (2014).

Iterative Hessian sketch: Fast and accurate solution approximation forconstrained least-squares.

arXiv:1411.0347 [cs, math, stat].

[Szabo et al., 2011] Szabo, Z., Poczos, B., and Lorincz, A. (2011).

Online group-structured dictionary learning.

In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2865–2872. IEEE.

[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).

Scalable coordinate descent approaches to parallel matrix factorization forrecommender systems.

In Proceedings of the International Conference on Data Mining, pages765–774. IEEE.

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 21 / 19

Page 23:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Appendix

Page 24:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Collaborative filtering

Streaming uncomplete data

Mt is imposed by user t

Data stream : Mtxt movies ranked by user t

Proposed by [Szabo et al., 2011]), with O(p) complexity

Validation: Test RMSE (rating prediction) vs CPU timeBaseline: Coordinate descent solver [Yu et al., 2012] solvingrelated loss

n∑i=1

(‖Mt(Xt −Dαt)‖22 + λ‖αt‖2

2) + λ‖D‖22

Fastest solver available apart from SGD – no hyperparameters

Our method is not sensitive to hyperparameters

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 22 / 19

Page 25:  · Dictionary Learning for Massive Matrix Factorization Author: Arthur Mensch, Julien Mairal,Gaël Varoquaux, Bertrand Thirion Created Date: 6/20/2016 2:33:40 PM

Algorithm

Our algorithm

1 Code computation

αt = argminα∈Rk

‖Mt(xt −Dt−1α)‖22

+ λrkMt

pΩ(αt)

2 Surrogate aggregation

At =1

t

t∑i=1

αiα>i

Bt = Bt−1 +1∑t

i=1 Mi(Mtxtα

>t −MtBt−1)

3 Surrogate minimization

MtDj ← p⊥Cj (MtDj−1

(At)j,jMt(D(At)j−(Bt)j ))

Original online MF

1 Code computation

αt = argminα∈Rk

‖xt −Dt−1α‖22

+ λΩ(αt)

2 Surrogate aggregation

At =1

t

t∑i=1

αiα>i

Bt = Bt−1 +1

t(xtα

>t − Bt−1)

3 Surrogate minimization

Dj ← p⊥Crj(Dj−

1

(At)j,j(D(At)j−(Bt)j ))

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 23 / 19