· dictionary learning for massive matrix factorization author: arthur mensch, julien mairal,gaël...

Dictionary Learning forMassive Matrix Factorization

Arthur Mensch, Julien Mairal,Gael Varoquaux, Bertrand Thirion

Inria/CEA Parietal, Inria Thoth

June 20, 2016

Matrix factorization

1X ∈ Rp×n = DA ∈ Rp×k × Rk×n

Flexible tool for unsupervised data analysis

Dataset has lower underlying complexity than appearing size

How to scale it to very large datasets ? (Brain imaging, 2TB)

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 19

Matrix factorization

1Low rank factorization : k < p

1...with optional sparse factors

→ interpretable data (fMRI, genetics, topic modeling)

1Overcomplete dictionary learning k p - sparse A[Olshausen and Field, 1997]

Formalism and methods

Non-convex formulation

minD∈C,A∈Rk×n

‖X−DA‖22 + λΩ(A)

Constraints on D

Penalty on A (`1, `2)

Naive resolution

Alternated minimization: use full X at each iteration

Very slow : single iteration in O(p n)

Online matrix factorization

Stream (xt), update D at each t [Mairal et al., 2010]

Single iteration in O(p), a few epochs

streaming

Large n, regular p, eg image patches:

p = 256 n ≈ 106 1GB

Both (sparse) low-rank factorization / sparse coding

Scaling-up for massive matrices

Functional MRI (HCP dataset)

Brain “movies” : space × time

Extract k sparse networks

p = 2 · 105 n = 2 · 106 2 TB

Way larger than vision problems

Unusual setting: data is large inboth directions

Also useful in collaborative filtering

k spatial maps Time

Scaling-up for massive matrices

Out-of-the-box online algorithm ?

1Limited time budget ?Need to accomodate large p

235 h run time

1 full epoch

10 h run time

Scaling-up in both directions

Batch → onlinext

Steaming

Handle large n

Streaming

SubsamplingHandle large p

Online → double online

1Online learning + partial random access to samples

Scaling-up in both directions

Streaming

SubsamplingHandle large p

Online → double online

1Low-distorsion lemma [Johnson and Lindenstrauss, 1984]

Random linear alebra [Halko et al., 2009]

Sketching for data reduction [Pilanci and Wainwright, 2014]

Algorithm design

Online dictionary learning [Mairal et al., 2010]

1 Compute code – O(p)

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update surrogate – O(p)

t∑i=1

‖xi −Dαi‖22

3 Minimize surrogate – O(p)

Dt = argminD∈C

gt(D) = argminD∈C

Tr (D>DAt −D>Bt)

xt access → O(p) algorithm (complexity dependency in p)

Introducing subsampling

Iteration cost in O(p): can we reduce it?

xt →Mtxt , p → rkMt = s

Use only Mtxt in algorithmcomputation: complexity in O(s)

Streaming

Subsampling

1Our contribution

Adapt the 3 parts of the algorith to obtain O(s) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

[Szabo et al., 2011]: dictionary learning with missing value – O(p)

1. Code computation

Linear regression with random sampling

αt = argminα∈Rk

‖Mt(xt −Dt−1αt)‖22 + λ

pΩ(α)

approximative solution of

αt = argminα∈Rk

‖xt −Dt−1αt‖22 + λΩ(α)

validity in high dimension, with incoherent features:

D>MtD ≈s

pD>D D>Mtxt ≈

2. Surrogate update

Original algorithm: At and Bt used in dictionary update

At = 1t

∑ti=1 αiα

>i same as in online algorithm

Bt = (1− 1t )Bt−1 + 1

t xtα>t = 1

∑ti=1 xiα

>i Forbidden

Partial update of Bt at each iteration

Bt =1∑t

i=1 Mi

t∑i=1

Mixiα>i

Only MtB is updated

Behaves like Ex[xα] for large t

3. Surrogate minimization

Original algorithm : block coordinate descent with projection on C

minD∈C

gt(D) Dj ← p⊥Cj (Dj −1

Aj ,j(D(At)j − (Bt)j))

Forbidden update of full D at iteration t

Cautious update

Leave dictionary unchanged for unseen features (I−Mt)

minD∈C

(I−Mt)D=(I−Mt)Dt−1

O(s) update in block coordinate descent

Dj ← p⊥Crj (Dj −1

(At)j ,j(Mt(D(At)j − (Bt)j)))

`1 ball C rj = D ∈ C, ‖MtD‖1 ≤ ‖MtDt−1‖1

Resting-state fMRI

HCP dataset

One brain image per second

200 sessions n = 2 · 106 p = 2 · 105

k spatial maps Time

Sparse decomposition k = 70 C = Bk1 Ω = ‖ · ‖22

Validation

Increase reduction factor ps

Objective function on test set vs CPU time

Resting-state fMRI

Online dictionary learning

235 h run time

1 full epoch

10 h run time

124 epoch

Proposed method

10 h run time

12 epoch, reduction r=12

Qualitatively, usable maps are obtained 10× faster

Resting-state fMRI

.1 h 1 h 10 h 100 hCPUtime

set ×108

Original online algorithmReduction factor r 4 8 12

No reduction

Speed-up close to reduction factor ps

Resting-state fMRI

100 1000 Epoch 4000Records2.162.172.182.192.202.212.222.232.24

t ×108

λ = 10−4

No reduction(original alg.)

r = 4r = 8r = 12

Convergence speed / number of seen records

Information is acquired faster

Collaborative filtering

Mtxt movie ratings from user t

vs. coordinate descent for MMMF loss (no hyperparameters)

100 s 1000 s0.930.940.950.960.970.980.99 Netflix (140M)

Coordinate descent

Proposed(full projection)

Proposed(partial projection)

Dataset Test RMSE Speed

CD MODL -up

ML 1M 0.872 0.866 ×0.75ML 10M 0.802 0.799 ×3.7NF (140M) 0.938 0.934 ×6.8

Outperform coordinate descent beyond 10M ratings

Same prediction performance

Speed-up 6.8× on Netflix

Conclusion

Take-home message

Loading stochastic subsets of samplestreams can drastically accelerates onlinematrix factorization

Streaming

Subsampling

Reduce CPU (+IO) load at each iteration

cf Gradient Descent vs SGD

An order of magnitude speed-up on two different problems

Python package http://github.com/arthurmensch/modl

Heuristic at contribution time

A follow-up algorithm has convergence guarantees

Questions ? (Poster # 41 this afternoon)

Bibliography I

[Halko et al., 2009] Halko, N., Martinsson, P.-G., and Tropp, J. A. (2009).

Finding structure with randomness: Probabilistic algorithms for constructingapproximate matrix decompositions.

arXiv:0909.4061 [math].

[Johnson and Lindenstrauss, 1984] Johnson, W. B. and Lindenstrauss, J.(1984).

Extensions of Lipschitz mappings into a Hilbert space.

Contemporary mathematics, 26(189-206):1.

[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).

Online learning for matrix factorization and sparse coding.

The Journal of Machine Learning Research, 11:19–60.

[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).

Sparse coding with an overcomplete basis set: A strategy employed by V1?

Vision Research, 37(23):3311–3325.

Bibliography II

[Pilanci and Wainwright, 2014] Pilanci, M. and Wainwright, M. J. (2014).

Iterative Hessian sketch: Fast and accurate solution approximation forconstrained least-squares.

arXiv:1411.0347 [cs, math, stat].

[Szabo et al., 2011] Szabo, Z., Poczos, B., and Lorincz, A. (2011).

Online group-structured dictionary learning.

In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2865–2872. IEEE.

[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).

Scalable coordinate descent approaches to parallel matrix factorization forrecommender systems.

In Proceedings of the International Conference on Data Mining, pages765–774. IEEE.

Appendix

Collaborative filtering

Streaming uncomplete data

Mt is imposed by user t

Data stream : Mtxt movies ranked by user t

Proposed by [Szabo et al., 2011]), with O(p) complexity

Validation: Test RMSE (rating prediction) vs CPU timeBaseline: Coordinate descent solver [Yu et al., 2012] solvingrelated loss

n∑i=1

(‖Mt(Xt −Dαt)‖22 + λ‖αt‖2

2) + λ‖D‖22

Fastest solver available apart from SGD – no hyperparameters

Our method is not sensitive to hyperparameters

Algorithm

Our algorithm

1 Code computation

αt = argminα∈Rk

‖Mt(xt −Dt−1α)‖22

+ λrkMt

pΩ(αt)

2 Surrogate aggregation

t∑i=1

αiα>i

Bt = Bt−1 +1∑t

i=1 Mi(Mtxtα

>t −MtBt−1)

3 Surrogate minimization

MtDj ← p⊥Cj (MtDj−1

(At)j,jMt(D(At)j−(Bt)j ))

Original online MF

1 Code computation

αt = argminα∈Rk

‖xt −Dt−1α‖22

+ λΩ(αt)

2 Surrogate aggregation

t∑i=1

αiα>i

Bt = Bt−1 +1

t(xtα

>t − Bt−1)

3 Surrogate minimization

Dj ← p⊥Crj(Dj−

(At)j,j(D(At)j−(Bt)j ))

· dictionary learning for massive matrix factorization author: arthur mensch, julien mairal,gaël...

Documents

3a index zu michelle thirion

la ideoloagia del servicio publico-mairal

jerome&thirion-euroback-serials review-1997

mairal and mali, ijpsr, 2020; vol. 11(6): 2592-2603. e

acciones de clase hectori mairal

rm mairal 2008 folia linguistica

pedro mairal la uruguaya - libros del asteroide

o. avenel, p. hakonen and e. varoquaux- superfluid...

mapping paradigm ontologies to and from the brain€¦ ·...

stochastic majorization-minimization algorithms for large...

classe de alice thirion - académie de grenoble · classe...

pedro mairal el año

pedro mairal

marina mairal buera.pdf

el quejigal - mario mairal

stéphane thirion illustrateur: dessin reversible

rapport d'audit technique m. daniel thirion (1)

scikit-learn · 3 scaling up / scaling out? g varoquaux 2....

amores argentinos - librosycasas.cultura.gob.ar€¦ ·...

pydata paris 2015 - opening keynote gael varoquaux