nomad: non-locking, stochastic multi-machine algorithm for ... › ~vishy › talks ›...
TRANSCRIPT
NOMAD: Non-locking, stOchastic Multi-machinealgorithm for Asynchronous and Decentralized
matrix factorizationScaling by Exploiting Structure
S.V.N. (vishy) Vishwanathan
Purdue University and [email protected],amazon.com
July 1st, 2013
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 1 / 25
Regularized risk minimization
Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function
a sophisticated discrepancy score
Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)
More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w
minimizew
J(w) := λ
d∑j=1
φj(wj)︸ ︷︷ ︸Regularizer
+1m
m∑i=1
l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25
Regularized risk minimization
Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function
a sophisticated discrepancy score
Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)
More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w
minimizew
J(w) := λ
d∑j=1
φj(wj)︸ ︷︷ ︸Regularizer
+1m
m∑i=1
l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25
Regularized risk minimization
Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function
a sophisticated discrepancy score
Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)
More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w
minimizew
J(w) := λ
d∑j=1
φj(wj)︸ ︷︷ ︸Regularizer
+1m
m∑i=1
l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25
Regularized risk minimization
Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function
a sophisticated discrepancy score
Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)
More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w
minimizew
J(w) := λ
d∑j=1
φj(wj)︸ ︷︷ ︸Regularizer
+1m
m∑i=1
l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25
Regularized risk minimization
Machine LearningWe want to build a model which predicts well on dataA model’s performance is quantified by a loss function
a sophisticated discrepancy score
Our model must generalize to unseen dataAvoid over-fitting by penalizing complex models (Regularization)
More FormallyTraining data: x1, . . . , xmLabels: y1, . . . , ymLearn a vector: w
minimizew
J(w) := λ
d∑j=1
φj(wj)︸ ︷︷ ︸Regularizer
+1m
m∑i=1
l(〈w , xi〉 , yi)︸ ︷︷ ︸Risk Remp
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 2 / 25
NOMAD for Matrix Completion
Outline
1 NOMAD for Matrix Completion
2 NOMAD for Regularized Risk Minimization
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 3 / 25
NOMAD for Matrix Completion
Collaborative filtering
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
U1
U2
U3
U4
U5
U6
3 3 7 33 3 3
7 7 33 7 3
3 7 33 7
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 4 / 25
NOMAD for Matrix Completion
Matrix completion
A≈
W
H
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 5 / 25
NOMAD for Matrix Completion
Matrix completion
minW ∈ Rm×k
H ∈ Rn×k
f (W ,H),
f (W ,H) =12
∑(i,j)∈Ω
(
Aij −w>i hj
)2
︸ ︷︷ ︸loss
+λ(‖wi‖2 +
∥∥hj∥∥2)
︸ ︷︷ ︸Regularizer
.
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 6 / 25
NOMAD for Matrix Completion
Stochastic approximation
f (W ,H) ≈ fn(W ,H) =12
(Aij −w>i hj
)2+ λ
(‖wi‖2 +
∥∥hj∥∥2)
∇wi′ fn(W ,H) =
(Aij −w>i hj
)hj + λwi , for i = i ′
0 otherwise
∇hj′fn(W ,H) =
(Aij −w>i hj
)wi + λhj , for j = j ′
0 otherwise
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 7 / 25
NOMAD for Matrix Completion
Stochastic updates
wi ← wi − η((
Aij −w>i hj
)hj + λwi
)hj ← hj − η
((Aij −w>i hj
)wi + λhj
)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 8 / 25
NOMAD for Matrix Completion
Decoupling the updates [Gemulla et al., KDD 2011]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25
NOMAD for Matrix Completion
Decoupling the updates [Gemulla et al., KDD 2011]
Synchronize and Communicate
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25
NOMAD for Matrix Completion
Decoupling the updates [Gemulla et al., KDD 2011]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25
NOMAD for Matrix Completion
Decoupling the updates [Gemulla et al., KDD 2011]
Synchronize and Communicate
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25
NOMAD for Matrix Completion
Decoupling the updates [Gemulla et al., KDD 2011]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25
NOMAD for Matrix Completion
Decoupling the updates [Gemulla et al., KDD 2011]
Synchronize and Communicate
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25
NOMAD for Matrix Completion
Decoupling the updates [Gemulla et al., KDD 2011]
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 9 / 25
NOMAD for Matrix Completion
Some Observations (also see Gemulla et al, ICDM 2012)
The goodUpdates are decoupled and easy to parallelizeEasy to implement using map-reduce
The badCommunication and computation are interleaved
When network is active then CPU is idleWhen CPU is active then network is active
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 10 / 25
NOMAD for Matrix Completion
Some Observations (also see Gemulla et al, ICDM 2012)
The goodUpdates are decoupled and easy to parallelizeEasy to implement using map-reduce
The badCommunication and computation are interleaved
When network is active then CPU is idleWhen CPU is active then network is active
Question: Can we keep CPU and network simultaneously busy?
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 10 / 25
NOMAD for Matrix Completion
Our answer
Non-locking, stOchastic Multi-machine algorithm for Asynchronousand Decentralized matrix factorization (NOMAD)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 11 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Illustration of NOMAD communication
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 12 / 25
NOMAD for Matrix Completion
Eventually . . .
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 13 / 25
NOMAD for Matrix Completion
Experiments: Netflix
Size: 2649429 × 17770, nnz=99 million
0 5 10 15 20 25 30
0
10
20
30
40
number of machines
spee
dup
netflix
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 14 / 25
NOMAD for Matrix Completion
Experiments: Netflix
1 Processor
500 1,000 1,500
1
1.05
1.1
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
1
1.05
1.1
elapsed secs × num processors
test
RM
SE
2 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25
NOMAD for Matrix Completion
Experiments: Netflix
1 Processor
500 1,000 1,500
1
1.05
1.1
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
1
1.05
1.1
1.15
elapsed secs × num processors
test
RM
SE
4 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25
NOMAD for Matrix Completion
Experiments: Netflix
1 Processor
500 1,000 1,500
1
1.05
1.1
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,5000.95
1
1.05
1.1
1.15
elapsed secs × num processors
test
RM
SE
8 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25
NOMAD for Matrix Completion
Experiments: Netflix
1 Processor
500 1,000 1,500
1
1.05
1.1
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
0.95
1
1.05
1.1
1.15
elapsed secs × num processors
test
RM
SE
15 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25
NOMAD for Matrix Completion
Experiments: Netflix
1 Processor
500 1,000 1,500
1
1.05
1.1
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
1
1.1
1.2
1.3
elapsed secs × num processors
test
RM
SE
30 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 15 / 25
NOMAD for Matrix Completion
Experiments: Yahoo! music
Size: 1000990 × 624961, nnz=252 million
0 5 10 15 20 25 300
10
20
30
number of machines
spee
dup
Yahoo! music
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 16 / 25
NOMAD for Matrix Completion
Experiments: Yahoo! music
1 Processor
500 1,000 1,500
23
24
25
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
23
24
25
26
elapsed secs × num processors
test
RM
SE
2 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25
NOMAD for Matrix Completion
Experiments: Yahoo! music
1 Processor
500 1,000 1,500
23
24
25
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
23
24
25
elapsed secs × num processors
test
RM
SE
4 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25
NOMAD for Matrix Completion
Experiments: Yahoo! music
1 Processor
500 1,000 1,500
23
24
25
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
23
24
25
26
elapsed secs × num processors
test
RM
SE
8 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25
NOMAD for Matrix Completion
Experiments: Yahoo! music
1 Processor
500 1,000 1,500
23
24
25
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
23
24
25
26
elapsed secs × num processors
test
RM
SE
15 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25
NOMAD for Matrix Completion
Experiments: Yahoo! music
1 Processor
500 1,000 1,500
23
24
25
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
25
30
35
elapsed secs × num processors
test
RM
SE
30 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 17 / 25
NOMAD for Matrix Completion
Experiments: Synthetic Data
Size: 5 000 000 × 200 000, nnz=270 million
0 5 10 15 20 25 30
0
10
20
30
number of machines
spee
dup
Synthetic
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 18 / 25
NOMAD for Matrix Completion
Experiments: Synthetic data
1 Processor
500 1,000 1,50040
60
80
100
120
140
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
50
100
150
200
elapsed secs × num processors
test
RM
SE
2 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 19 / 25
NOMAD for Matrix Completion
Experiments: Synthetic data
1 Processor
500 1,000 1,50040
60
80
100
120
140
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
40
60
80
100
120
140
elapsed secs × num processors
test
RM
SE
4 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 19 / 25
NOMAD for Matrix Completion
Experiments: Synthetic data
1 Processor
500 1,000 1,50040
60
80
100
120
140
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
40
60
80
100
120
140
elapsed secs × num processors
test
RM
SE
8 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 19 / 25
NOMAD for Matrix Completion
Experiments: Synthetic data
1 Processor
500 1,000 1,50040
60
80
100
120
140
elapsed secs × num processors
test
RM
SE
1 Processor
Multiple Processors
500 1,000 1,500
40
60
80
100
120
140
elapsed secs × num processors
test
RM
SE
15 Processors
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 19 / 25
NOMAD for Regularized Risk Minimization
Outline
1 NOMAD for Matrix Completion
2 NOMAD for Regularized Risk Minimization
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 20 / 25
NOMAD for Regularized Risk Minimization
Problem Formulation
minw
λ
d∑j=1
φj(wj) +1m
m∑i=1
l(〈w , xi〉 , yi)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25
NOMAD for Regularized Risk Minimization
Problem Formulation
minw ,u
λ
d∑j=1
φj(wj) +1m
m∑i=1
l(ui)
subject to ui = 〈w , xi〉 i = 1 . . .m
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25
NOMAD for Regularized Risk Minimization
Problem Formulation
minw ,u
maxα
λ
d∑j=1
φj(wj) +1m
m∑i=1
l(ui) +1m
m∑i=1
αi(ui − 〈w , xi〉)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25
NOMAD for Regularized Risk Minimization
Problem Formulation
maxα
minw ,u
λ
d∑j=1
φj(wj) +1m
m∑i=1
l(ui) +1m
m∑i=1
αi(ui − 〈w , xi〉)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25
NOMAD for Regularized Risk Minimization
Problem Formulation
maxα
minw
λ
d∑j=1
φj(wj)−1m
m∑i=1
αi 〈w , xi〉+1m
minu
m∑i=1
(l(ui) + αiui)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25
NOMAD for Regularized Risk Minimization
Problem Formulation
maxα
minw
λ
d∑j=1
φj(wj)−
⟨w ,
1m
m∑i=1
αixi
⟩+
1m
m∑i=1
l∗(−αi)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25
NOMAD for Regularized Risk Minimization
Problem Formulation
maxα
minw
m∑i=1
∑j∈Ωi
(λ∣∣Ωj∣∣φj(wj)−
1mαiwjxij +
1m |Ωi |
l∗(−αi)
)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 21 / 25
NOMAD for Regularized Risk Minimization
Stochastic Gradients
∂wj J(w , α) = |Ω|
(λ∣∣Ωj∣∣∂φj(wj)−
1mαixij
)
∂αi J(w , α) = |Ω|(− 1
mwjxij +
1m |Ωi |
∂αi l∗(−αi)
).
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 22 / 25
NOMAD for Regularized Risk Minimization
Decoupling the updates
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 23 / 25
NOMAD for Regularized Risk Minimization
It Converges!
0 20 40 60 80 10010−2
10−1
number of iterations
dual
gap
KDDA Test Dataset
10 percent coordinatesall coordinates
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 24 / 25
NOMAD for Regularized Risk Minimization
Joint work with
Reading Thread Training Thread
Update
RAM
Weight Vector
RAM
Cached Data (Working Set)
Disk
Dataset
Read
(Random Access)
Read
(Sequential Access)
Load
(Random Access)
S.V. N. Vishwanathan (Purdue and Amazon) Decentralized Optimization for ML 25 / 25