· dictionary learning for massive matrix factorization author: arthur mensch, julien mairal,gaël...

Dictionary Learning forMassive Matrix Factorization

Arthur Mensch, Julien Mairal,Gael Varoquaux, Bertrand Thirion

Inria/CEA Parietal, Inria Thoth

June 20, 2016

Matrix factorization

X

p

n

D

p

k

=

A

n

k

1X ∈ Rp×n = DA ∈ Rp×k × Rk×n

Flexible tool for unsupervised data analysis

Dataset has lower underlying complexity than appearing size

How to scale it to very large datasets ? (Brain imaging, 2TB)

Arthur Mensch Dictionary Learning for Massive Matrix Factorization 1 / 19

Matrix factorization

X

p

n

D

p

k

=

A

n

k

1Low rank factorization : k < p

X

p

n

D

p

k

=

A

n

k

1...with optional sparse factors

→ interpretable data (fMRI, genetics, topic modeling)

X

p

n

D

p

k

=

A

n

k

1Overcomplete dictionary learning k p - sparse A[Olshausen and Field, 1997]


Formalism and methods

Non-convex formulation

minD∈C,A∈Rk×n

‖X−DA‖22 + λΩ(A)

Constraints on D

Penalty on A (`1, `2)

Naive resolution

Alternated minimization: use full X at each iteration

Very slow : single iteration in O(p n)


Online matrix factorization

Stream (xt), update D at each t [Mairal et al., 2010]

Single iteration in O(p), a few epochs

xt

p

n

D

p

k

=

αt

n

k

streaming

1

Large n, regular p, eg image patches:

p = 256 n ≈ 106 1GB

Both (sparse) low-rank factorization / sparse coding


Scaling-up for massive matrices

Functional MRI (HCP dataset)

Brain “movies” : space × time

Extract k sparse networks

p = 2 · 105 n = 2 · 106 2 TB

Way larger than vision problems

Unusual setting: data is large inboth directions

Also useful in collaborative filtering

X

Vox

els

Time

=

D A

k spatial maps Time

x


Scaling-up for massive matrices

Out-of-the-box online algorithm ?

xt

p

n

D

p

k

=

αt

n

k

1Limited time budget ?Need to accomodate large p

235 h run time

1 full epoch

10 h run time

124

epoch


Scaling-up in both directions

X

p

n

Batch → onlinext

n

Steaming

Handle large n

1

xt

p

n

Streaming

Mtxt

n

Streaming

SubsamplingHandle large p

Online → double online

1Online learning + partial random access to samples


Scaling-up in both directions

xt

p

n

Streaming

Mtxt

n

Streaming

SubsamplingHandle large p

Online → double online

1Low-distorsion lemma [Johnson and Lindenstrauss, 1984]

Random linear alebra [Halko et al., 2009]

Sketching for data reduction [Pilanci and Wainwright, 2014]


Algorithm design

Online dictionary learning [Mairal et al., 2010]

1 Compute code – O(p)

αt = argminα∈Rk

‖xt −Dt−1α‖22 + λΩ(αt)

2 Update surrogate – O(p)

gt =1

t

t∑i=1

‖xi −Dαi‖22

3 Minimize surrogate – O(p)

Dt = argminD∈C

gt(D) = argminD∈C

Tr (D>DAt −D>Bt)

xt access → O(p) algorithm (complexity dependency in p)


Introducing subsampling

Iteration cost in O(p): can we reduce it?

xt →Mtxt , p → rkMt = s

Use only Mtxt in algorithmcomputation: complexity in O(s)

Mtxt

p

n

Streaming

Subsampling

1Our contribution

Adapt the 3 parts of the algorith to obtain O(s) complexity

1 Codecomputation

2 Surrogateupdate

3 Surrogateminimization

[Szabo et al., 2011]: dictionary learning with missing value – O(p)


1. Code computation

Linear regression with random sampling

αt = argminα∈Rk

‖Mt(xt −Dt−1αt)‖22 + λ

rkMt

pΩ(α)

approximative solution of

αt = argminα∈Rk

‖xt −Dt−1αt‖22 + λΩ(α)

validity in high dimension, with incoherent features:

D>MtD ≈s

pD>D D>Mtxt ≈

s

pD>xt


2. Surrogate update

Original algorithm: At and Bt used in dictionary update

At = 1t

∑ti=1 αiα

>i same as in online algorithm

Bt = (1− 1t )Bt−1 + 1

t xtα>t = 1

t

∑ti=1 xiα

>i Forbidden

Partial update of Bt at each iteration

Bt =1∑t

i=1 Mi

t∑i=1

Mixiα>i

Only MtB is updated

Behaves like Ex[xα] for large t


3. Surrogate minimization

Original algorithm : block coordinate descent with projection on C

minD∈C

gt(D) Dj ← p⊥Cj (Dj −1

Aj ,j(D(At)j − (Bt)j))

Forbidden update of full D at iteration t

Cautious update

Leave dictionary unchanged for unseen features (I−Mt)

minD∈C

(I−Mt)D=(I−Mt)Dt−1

gt(D)

O(s) update in block coordinate descent

Dj ← p⊥Crj (Dj −1

(At)j ,j(Mt(D(At)j − (Bt)j)))

`1 ball C rj = D ∈ C, ‖MtD‖1 ≤ ‖MtDt−1‖1


Resting-state fMRI

HCP dataset

One brain image per second

200 sessions n = 2 · 106 p = 2 · 105

D AX

Vox

els

Time

=

k spatial maps Time

x

Sparse decomposition k = 70 C = Bk1 Ω = ‖ · ‖22

Validation

Increase reduction factor ps

Objective function on test set vs CPU time


Resting-state fMRI

Online dictionary learning

235 h run time

1 full epoch

10 h run time

124 epoch

Proposed method

10 h run time

12 epoch, reduction r=12

Qualitatively, usable maps are obtained 10× faster


Resting-state fMRI

.1 h 1 h 10 h 100 hCPUtime

2.20

2.25

2.30

2.35

2.40O

bje

ctiv

eva

lue

on

test

set ×108

Original online algorithmReduction factor r 4 8 12

No reduction

Speed-up close to reduction factor ps


Resting-state fMRI

100 1000 Epoch 4000Records2.162.172.182.192.202.212.222.232.24

Ob

ject

ive

valu

eo

nte

stse

t ×108

λ = 10−4

No reduction(original alg.)

r = 4r = 8r = 12

Convergence speed / number of seen records

Information is acquired faster


Collaborative filtering

Mtxt movie ratings from user t

vs. coordinate descent for MMMF loss (no hyperparameters)

100 s 1000 s0.930.940.950.960.970.980.99 Netflix (140M)

Coordinate descent

Proposed(full projection)

Proposed(partial projection)

Dataset Test RMSE Speed

CD MODL -up

ML 1M 0.872 0.866 ×0.75ML 10M 0.802 0.799 ×3.7NF (140M) 0.938 0.934 ×6.8

Outperform coordinate descent beyond 10M ratings

Same prediction performance

Speed-up 6.8× on Netflix


Conclusion

Take-home message

Loading stochastic subsets of samplestreams can drastically accelerates onlinematrix factorization

Mtxt

p

n

Streaming

Subsampling

1

Reduce CPU (+IO) load at each iteration

cf Gradient Descent vs SGD

An order of magnitude speed-up on two different problems

Python package http://github.com/arthurmensch/modl

Heuristic at contribution time

A follow-up algorithm has convergence guarantees

Questions ? (Poster # 41 this afternoon)


http://github.com/arthurmensch/modl

Bibliography I

[Halko et al., 2009] Halko, N., Martinsson, P.-G., and Tropp, J. A. (2009).

Finding structure with randomness: Probabilistic algorithms for constructingapproximate matrix decompositions.

arXiv:0909.4061 [math].

[Johnson and Lindenstrauss, 1984] Johnson, W. B. and Lindenstrauss, J.(1984).

Extensions of Lipschitz mappings into a Hilbert space.

Contemporary mathematics, 26(189-206):1.

[Mairal et al., 2010] Mairal, J., Bach, F., Ponce, J., and Sapiro, G. (2010).

Online learning for matrix factorization and sparse coding.

The Journal of Machine Learning Research, 11:19–60.

[Olshausen and Field, 1997] Olshausen, B. A. and Field, D. J. (1997).

Sparse coding with an overcomplete basis set: A strategy employed by V1?

Vision Research, 37(23):3311–3325.


Bibliography II

[Pilanci and Wainwright, 2014] Pilanci, M. and Wainwright, M. J. (2014).

Iterative Hessian sketch: Fast and accurate solution approximation forconstrained least-squares.

arXiv:1411.0347 [cs, math, stat].

[Szabo et al., 2011] Szabo, Z., Poczos, B., and Lorincz, A. (2011).

Online group-structured dictionary learning.

In Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2865–2872. IEEE.

[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2012).

Scalable coordinate descent approaches to parallel matrix factorization forrecommender systems.

In Proceedings of the International Conference on Data Mining, pages765–774. IEEE.


Appendix

Collaborative filtering

Streaming uncomplete data

Mt is imposed by user t

Data stream : Mtxt movies ranked by user t

Proposed by [Szabo et al., 2011]), with O(p) complexity

Validation: Test RMSE (rating prediction) vs CPU timeBaseline: Coordinate descent solver [Yu et al., 2012] solvingrelated loss

n∑i=1

(‖Mt(Xt −Dαt)‖22 + λ‖αt‖2

2) + λ‖D‖22

Fastest solver available apart from SGD – no hyperparameters

Our method is not sensitive to hyperparameters


Algorithm

Our algorithm

1 Code computation

αt = argminα∈Rk

‖Mt(xt −Dt−1α)‖22

+ λrkMt

pΩ(αt)

2 Surrogate aggregation

At =1

t

t∑i=1

αiα>i

Bt = Bt−1 +1∑t

i=1 Mi(Mtxtα

>t −MtBt−1)

3 Surrogate minimization

MtDj ← p⊥Cj (MtDj−1

(At)j,jMt(D(At)j−(Bt)j ))

Original online MF

1 Code computation

αt = argminα∈Rk

‖xt −Dt−1α‖22

+ λΩ(αt)

2 Surrogate aggregation

At =1

t

t∑i=1

αiα>i

Bt = Bt−1 +1

t(xtα

>t − Bt−1)

3 Surrogate minimization

Dj ← p⊥Crj(Dj−

1

(At)j,j(D(At)j−(Bt)j ))


· dictionary learning for massive matrix factorization author: arthur mensch, julien mairal,gaël...

Documents