nomad: non-locking, stochastic multi-machine algorithm for ... › ~vishy › talks ›...

NOMAD: Non-locking, stOchastic Multi-machinealgorithm for Asynchronous and Decentralized

matrix factorizationScaling by Exploiting Structure

S.V.N. (vishy) Vishwanathan

Purdue University and [email protected],amazon.com

July 1st, 2013

S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 1 / 25

Regularized risk minimization

Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function

a sophisticated discrepancy score

Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)

More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w

minimizew

J(w) := λ

d∑j=1

φj(wj)︸︷︷︸Regularizer

+1m

m∑i=1

l(〈w , xi〉 , yi)︸︷︷︸Risk Remp


NOMAD for Matrix Completion

Outline

1 NOMAD for Matrix Completion

2 NOMAD for Regularized Risk Minimization



Collaborative filtering

M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

U1

U2

U3

U4

U5

U6

3 3 7 33 3 3

7 7 33 7 3

3 7 33 7



Matrix completion

A≈

W

H



Matrix completion

minW ∈ Rm×k

H ∈ Rn×k

f (W ,H),

f (W ,H) =12

∑(i,j)∈Ω

(

Aij −w>i hj

)2

︸︷︷︸loss

+λ(‖wi‖2 +

∥∥hj∥∥2)

︸︷︷︸Regularizer

.



Stochastic approximation

f (W ,H) ≈ fn(W ,H) =12

(Aij −w>i hj

)2+ λ

(‖wi‖2 +

∥∥hj∥∥2)

∇wi′ fn(W ,H) =

(Aij −w>i hj

)hj + λwi , for i = i ′

0 otherwise

∇hj′fn(W ,H) =

(Aij −w>i hj

)wi + λhj , for j = j ′

0 otherwise



Stochastic updates

wi ← wi − η((

Aij −w>i hj

)hj + λwi

)hj ← hj − η

((Aij −w>i hj

)wi + λhj

)



Decoupling the updates [Gemulla et al., KDD 2011]

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x




Synchronize and Communicate




x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x



Some Observations (also see Gemulla et al, ICDM 2012)

The goodUpdates are decoupled and easy to parallelizeEasy to implement using map-reduce

The badCommunication and computation are interleaved

When network is active then CPU is idleWhen CPU is active then network is active



Some Observations (also see Gemulla et al, ICDM 2012)

The goodUpdates are decoupled and easy to parallelizeEasy to implement using map-reduce

The badCommunication and computation are interleaved

When network is active then CPU is idleWhen CPU is active then network is active

Question: Can we keep CPU and network simultaneously busy?



Our answer

Non-locking, stOchastic Multi-machine algorithm for Asynchronousand Decentralized matrix factorization (NOMAD)



Illustration of NOMAD communication

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x



Eventually . . .

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x



Experiments: Netflix

Size: 2649429 × 17770, nnz=99 million

0 5 10 15 20 25 30

0

10

20

30

40

number of machines

spee

dup

netflix




1 Processor

500 1,000 1,500

1

1.05

1.1

elapsed secs × num processors

test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

1

1.05

1.1


test

RM

SE

2 Processors




1 Processor

500 1,000 1,500

1

1.05

1.1


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

1

1.05

1.1

1.15


test

RM

SE

4 Processors




1 Processor

500 1,000 1,500

1

1.05

1.1


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,5000.95

1

1.05

1.1

1.15


test

RM

SE

8 Processors




1 Processor

500 1,000 1,500

1

1.05

1.1


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

0.95

1

1.05

1.1

1.15


test

RM

SE

15 Processors




1 Processor

500 1,000 1,500

1

1.05

1.1


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

1

1.1

1.2

1.3


test

RM

SE

30 Processors



Experiments: Yahoo! music

Size: 1000990 × 624961, nnz=252 million

0 5 10 15 20 25 300

10

20

30

number of machines

spee

dup

Yahoo! music




1 Processor

500 1,000 1,500

23

24

25


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

23

24

25

26


test

RM

SE

2 Processors




1 Processor

500 1,000 1,500

23

24

25


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

23

24

25


test

RM

SE

4 Processors




1 Processor

500 1,000 1,500

23

24

25


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

23

24

25

26


test

RM

SE

8 Processors




1 Processor

500 1,000 1,500

23

24

25


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

23

24

25

26


test

RM

SE

15 Processors




1 Processor

500 1,000 1,500

23

24

25


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

25

30

35


test

RM

SE

30 Processors



Experiments: Synthetic Data

Size: 5 000 000 × 200 000, nnz=270 million

0 5 10 15 20 25 30

0

10

20

30

number of machines

spee

dup

Synthetic



Experiments: Synthetic data

1 Processor

500 1,000 1,50040

60

80

100

120

140


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

50

100

150

200


test

RM

SE

2 Processors




1 Processor

500 1,000 1,50040

60

80

100

120

140


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

40

60

80

100

120

140


test

RM

SE

4 Processors




1 Processor

500 1,000 1,50040

60

80

100

120

140


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

40

60

80

100

120

140


test

RM

SE

8 Processors




1 Processor

500 1,000 1,50040

60

80

100

120

140


test

RM

SE

1 Processor

Multiple Processors

500 1,000 1,500

40

60

80

100

120

140


test

RM

SE

15 Processors


NOMAD for Regularized Risk Minimization

Outline

1 NOMAD for Matrix Completion

2 NOMAD for Regularized Risk Minimization



Problem Formulation

minw

λ

d∑j=1

φj(wj) +1m

m∑i=1

l(〈w , xi〉 , yi)



Problem Formulation

minw ,u

λ

d∑j=1

φj(wj) +1m

m∑i=1

l(ui)

subject to ui = 〈w , xi〉 i = 1 . . .m



Problem Formulation

minw ,u

maxα

λ

d∑j=1

φj(wj) +1m

m∑i=1

l(ui) +1m

m∑i=1

αi(ui − 〈w , xi〉)



Problem Formulation

maxα

minw ,u

λ

d∑j=1

φj(wj) +1m

m∑i=1

l(ui) +1m

m∑i=1

αi(ui − 〈w , xi〉)



Problem Formulation

maxα

minw

λ

d∑j=1

φj(wj)−1m

m∑i=1

αi 〈w , xi〉+1m

minu

m∑i=1

(l(ui) + αiui)



Problem Formulation

maxα

minw

λ

d∑j=1

φj(wj)−

⟨w ,

1m

m∑i=1

αixi

⟩+

1m

m∑i=1

l∗(−αi)



Problem Formulation

maxα

minw

m∑i=1

∑j∈Ωi

(λ∣∣Ωj∣∣φj(wj)−

1mαiwjxij +

1m |Ωi |

l∗(−αi)

)



Stochastic Gradients

∂wj J(w , α) = |Ω|

(λ∣∣Ωj∣∣∂φj(wj)−

1mαixij

)

∂αi J(w , α) = |Ω|(− 1

mwjxij +

1m |Ωi |

∂αi l∗(−αi)

).



Decoupling the updates

x

x

x

x

xx

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x



It Converges!

0 20 40 60 80 10010−2

10−1

number of iterations

dual

gap

KDDA Test Dataset

10 percent coordinatesall coordinates



Joint work with

Reading Thread Training Thread

Update

RAM

Weight Vector

RAM

Cached Data (Working Set)

Disk

Dataset

Read

(Random Access)

Read

(Sequential Access)

Load

(Random Access)


nomad: non-locking, stochastic multi-machine algorithm for ... › ~vishy › talks ›...

Documents