scalable task parallel sgd on matrix factorization in multicore architectures
Post on 23-Feb-2017
84 Views
Preview:
TRANSCRIPT
15/05/29 ParLearning15 1
Scalable Task-Parallel SGDon Matrix Factorization
in Multicore ArchitecturesGraduate School of Information Science and Technology
The University of TokyoYusuke Nishioka, Kenjiro Taura
15/05/29 ParLearning15 2
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
15/05/29 ParLearning15 3
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
15/05/29 ParLearning15 4
Introduction Recommendation is an important technique especially in e-commerce services
e.g. Amazon, Netflix, …
Service providers use recommendation as a way to help users to find their preferable items
(https://amazon.com)
15/05/29 ParLearning15 5
Recommendation
There are two approaches for recommendation: content filtering and collaborative filtering
content filtering based on information of users and items
collaborative filtering● based on correlation between users and items
Collaborative filtering has been getting popular accurate prediction with less amount of data
15/05/29 ParLearning15 6
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Each entry in the table denotes the rating 5 means great, 1 means terrible
15/05/29 ParLearning15 7
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Both user A and C seem to like war movies
15/05/29 ParLearning15 8
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Both user A and C seem to like war movies Why not recommend other war movies ?
(4~5?)
(4~5?)
15/05/29 ParLearning15 9
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Both user B and D seem to like Disney movies
15/05/29 ParLearning15 10
Collaborative Filtering5 (Great)
1 (Bad)
user Auser Buser C
user D
SavingPrivateRyan
Piratesof
the CaribbeanBeauty
Andthe Beast
LettersFrom
Iwo Jima5
4
45
2
31
Both user A and C seem to like Disney movies Why not recommend other Disney movies ?
(4~5?)
(4~5?)
15/05/29 ParLearning15 11
Our contribution is to: propose an alternative approach for parallelmatrix factorization
analyze the scalability problem of the past work achieve better scalability than other methods
Contribution
15/05/29 ParLearning15 12
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
15/05/29 ParLearning15 13
Matrix factorization
One of collaborative filtering algorithms is matrix factorization (MF)
infers user-item relations from ratingswhich users have given to items
decomposes the rating matrix into two low-rank latent matrices
15/05/29 ParLearning15 14
What's matrix factorization ?
table
user A
itemA
itemB
itemC
user Buser Cuser D
R
n items
m users
matrix
15/05/29 ParLearning15 15
What's matrix factorization ?
The rating matrix R is modeled as matrix multiplication of PT and Q
R
Q
PT
musers
n items k
k
R≃PTQ
m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n
15/05/29 ParLearning15 16
What's matrix factorization ?
Each entry ru,v in the rating matrix R is modeled as inner products of corresponding row and column:
R
Q
PT
musers
n items
ru,v
qv
puT
k
k
m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n
^ru , v≃puT qv
u
v
15/05/29 ParLearning15 17
Optimization problem of MF
The objective is to reduce the error between predicted value and real value of existing ratings● The objective function is as follows:
minP ,Q ∑(u , v )∈R
( ^ru , v−puT qv)
2+λ P‖P‖F2 +λ Q‖Q‖F
2
ru , v :existingrating ,
puT qv : predicted rating ,
λ P ,λ Q :regularized coefficients‖‖F :Frobeniusnorm
15/05/29 ParLearning15 18
Gradient Descent Method An approach to this optimization problem is gradient descent (GD) method
Modify P and Q in the direction opposite togradients of the objective function
pu⇐ pu+ γ ( ∑(u , v)∈R
(ru , v− puT qv)qv−λ P∑
u=1
m
pu)qv⇐qv+γ ( ∑
(u , v )∈R(ru ,v−pu
T qv) pu−λ Q∑v=1
n
qv) GD needs to calculate gradients in a batch
slow convergence
15/05/29 ParLearning15 19
● Stochastic gradient descent (SGD) method updates corresponding pu and qv in an online manner
faster convergence with low memory consumption pu and qv are updated as follows:
SGD is inherently sequential Difficult to parallelize over multi-core CPU
pu⇐ pu+γ ((ru , v−puT qv)qv−λ P pu)
qv⇐qv+γ ((ru , v−puT qv) pu−λ Qqv)
Stochastic Gradient Descent Method
15/05/29 ParLearning15 20
SGD
GD vs SGD
GD
● Comparably easy to parallelize
pseudocode
strong point
weak point ● slow convergence
● fast convergence
● Difficult to parallelize
15/05/29 ParLearning15 21
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
15/05/29 ParLearning15 22
We review some parallel SGD algorithms for solving matrix factorization:
HogWild! DSGD FPSGD
Related work
15/05/29 ParLearning15 23
HogWild! is a lock-free approach to parallelize MF workers process ratings in R independently guarantee convergence in spite of the possibilitythat updates may be overwritten by other workers
HogWild! [Niu et al. 2011]
worker1worker2worker3
overwrite byother workers
R R
15/05/29 ParLearning15 24
Problems of HogWild!
Random rating selection in whole R Cannot take advantage of hardware prefetch
PT
Q
R1
21
1
32
3
4
3 4 2
4
15/05/29 ParLearning15 25
DSGD divides R into t * t blocks and assigns them to t workers
workers go on to the next block simultaneously synchronous parallel
DSGD [Gemulla et al. 2011]
15/05/29 ParLearning15 26
Random rating selection in a block Cannot take advantage of hardware prefetch
Problems of DSGD
worker1
synchronizebarrier
worker2
worker3
worker4
process block
Synchronous parallel performance depends on the slowest worker
15/05/29 ParLearning15 27
FPSGD is a state-of-the-art parallel algorithm for MF intends to overcome the bottleneck of HogWild! and DSGD by introducing following two techniques:
conflict-free scheduling partial random method
FPSGD [Zhuang et al. 2013]
15/05/29 ParLearning15 28
Take a look at a specific example: 6*6 blocks, 4 workers T0 ~ T3
First, the scheduler assigns a block which shares neither the same row nor the same column
Conflict-free schedulingR 0
0
1
T0
2
3
4
5
T0T1
T2T3
1 2 3 4 5
15/05/29 ParLearning15 29
The row and column processed by other workers are blocked not to update the same row and column When worker T0 finishes processing the block, the scheduler assigns blocks with a circle
Conflict-free schedulingR 0
0
1
T0 ○
○○
○○
○○
○
2
3
4
5
T0T1
T2T3
2 3 541
15/05/29 ParLearning15 30
Worker T0 moves to block from (0,0) to (4,5)
Conflict-free schedulingR 0
0
1
T0
2
3
4
5
T0
T1T2
T3
1 2 3 4 5
15/05/29 ParLearning15 31
In conflict-free scheduling, the rating matrix should be divided into at least (t+1)*(t+1) blocks ( t is the number of workers )
so that the scheduler can assign a “free” block Workers can process blocks asynchronously because of this scheduling
Conflict-free scheduling
15/05/29 ParLearning15 32
Select ratings in a block orderly Select blocks randomly
Partial random method
15/05/29 ParLearning15 33
Ordered rating selection means loading at least either row or column onto the cache orderly
workers can utilize hardware prefetch
Partial random method
14
1
1
2
3
2 3 4
2
4
3
ordered
15/05/29 ParLearning15 34
Summary of related work
parallel cache utilizationblock selection rating selection
HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordered
15/05/29 ParLearning15 35
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
15/05/29 ParLearning15 36
FPSGD has a scalability problem● it cannot scale with higher core counts
The main reasons of the limited scalability is: Locking problem Poor data localities across blocks
Scalability of FPSGD
15/05/29 ParLearning15 37
In get_job function, the scheduler gets a free block and mark the row and column “being processed” In put_job function, the worker returns processed block and unmark the row and column as “free” Both gets the single lock → poor scalability
Locking problem
T0
mark as “being processed”
15/05/29 ParLearning15 38
Workers randomly select blocks little opportunity to reuse P and Q across blocks
Poor data localities across blocks
15/05/29 ParLearning15 39
The traffic for processing all blocks once is:
Poor data localities across blocks
T fpsgd=(nrb2
+k nub
+k nib
)×b2×d
=d (nr+bk (nu+ni))
k
k
nu/b
ni/b
ratingsusersitemsblocks
word size
nu
b×b
nr
ni
d
15/05/29 ParLearning15 40
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
15/05/29 ParLearning15 41
We propose dcMF, a scalable divide-and-conquer method for MF using task parallel model
Proposal
15/05/29 ParLearning15 42
In task parallel model, parallelism can be expressed via two operations:
create_task create a task
sync_tasks wait for the calling task's child tasks to finish
Task parallel model
15/05/29 ParLearning15 43
Regard each function call as a task Tasks are automatically distributed among workers
Example using task parallel (fibonacci)
15/05/29 ParLearning15 44
In each recursion,● divide the blockinto 2-by-2 sub-blocks
processes two sub-blockson one diagonal line
processes two sub-blockson another diagonal line
Overview of dcMF
15/05/29 ParLearning15 45
Tasks created along recursion forms a tree structure Tasks which locates in the different column or row can be processed in parallel
Execution flow
15/05/29 ParLearning15 46
Overview of dcMF
The runtime system automatically distribute tasks to workers
R
Worker 1 Worker 2running tasks
waitingtasks
tasktask
task task
15/05/29 ParLearning15 47
divide the block into two-by-two sub-blocks
Overview of dcMF
15/05/29 ParLearning15 48
Tasks on one diagonal line are created
Overview of dcMF
divide the block into two-by-two sub-blocks
15/05/29 ParLearning15 49
Overview of dcMF
Tasks on one diagonal line are created
divide the block into two-by-two sub-blocks
Tasks on one another diagonal line are created
15/05/29 ParLearning15 50
When the block gets small enough, pu and qv will be updated
Overview of dcMF
Tasks on one diagonal line are created
divide the block into two-by-two sub-blocks
Tasks on one another diagonal line are created
15/05/29 ParLearning15 51
There are two advantages in dcMF: get rid of locking time
Task parallel systems can handle task migrationwithout a centralized data structure
reduce cache miss counts
Advantage of task parallel model
15/05/29 ParLearning15 52
Reducing cache miss counts
worker1
worker2
worker3 worker4
R
R
Task parallel systems split the task tree near its root, and assign them to workers
15/05/29 ParLearning15 53
R
If there is no load balancing, each worker process t blocks whose size is ( 1/t ) × ( 1/t )
Reducing cache miss counts
worker1
worker2
worker3 worker4
R
15/05/29 ParLearning15 54
R
Think of blocks around the leaf All of these blocks are processed by the same worker
load ontocache
workerloadonto
cache
Reducing cache miss counts
15/05/29 ParLearning15 55
R
After processing blocks on the same diagonal line, corresponding pu and qv will be on cache
worker
done
load ontocache
loadonto
cache
already oncache
alreadyon
cache
Reducing cache miss counts
15/05/29 ParLearning15 56
R
Processing remaining blocks don't require access to memory
worker
done
done
alreadyon
cachealready
oncache
already oncache
already oncache
Reducing cache miss counts
15/05/29 ParLearning15 57
Advantage of task parallel model
T dcmf=d (nr+tk (nu+ni))(T fpsgd=d (nr+bk (nu+ni)))
● The traffic for processing all blocks once is:
The total traffic of dcMF is less than that of FPSGD when b > t :
T fpsgd−T dcmf=d (nr+bk (nu+ni))−d (nr+tk (nu+ni))=d (b−t )(nu+ni)>0
ratingsusersitems
threadsblocks
word size
nu
b×b
nr
nit
d
15/05/29 ParLearning15 58
Comparison with related works
parallel cache utilizationblock selection rating selection
HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordereddcMF async partially ordered ordered
15/05/29 ParLearning15 59
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
15/05/29 ParLearning15 60
CPU: four AMD Opteron 8354 2.50 GHz sockets each socket has 2 NUMA nodeseach NUMA node has 4 moduleseach module has 2 cores
total core count is 64 L3 cache (6MB) is shared by 4 modulesL2 cache (2MB) is shared by 2 coresL1 cache (16KB) is shared by 1 core
Evaluation environment
15/05/29 ParLearning15 61
Implementation: we used the same code base as FPSGD Task parallel library: MassiveThreads
Evaluation environment
15/05/29 ParLearning15 62
We use three kinds of dataset: MovieLens10M, Netflix and Yahoo!Music
We use the same parameter value as FPSGDfor fair comparison
Evaluation dataset
15/05/29 ParLearning15 63
We conducted three comparative experiments: scalability convergence speed
we use RMSE ( root mean square error )for a convergence metric
L2 fill/writeback counts The papi native eventL2_CACHE_FILL_WRITEBACK is used
Experiments
√ 1nr ∑(u , v )∈R
(ru ,v− ^ru , v)2
15/05/29 ParLearning15 64
We implemented FPSGD++ to overcome the scalability problem of FPSGD
Setting multiple locks on both rows and columns instead of the single lock of FPSGD
FPSGD++
15/05/29 ParLearning15 65
Pseudo code of FPSGD++
Workers can enter the same scope of the procedures
Getting blocks canbe done withoutsingle giant lock
15/05/29 ParLearning15 66
dcMF scales with up to 64 cores FPSGD does not scale in high core counts especially with MovieLens10M and Netflix
FPSGD++ scales better than FPSGD
Scalability comparison
(a) MovieLens10M (b) Netflix (c) Yahoo!Music
Good
Bad
15/05/29 ParLearning15 67
Lock waiting time of FPSGD gets longer with high core counts
This long locking time is the main cause of poor scalability
Scalability comparisonBad
Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music
15/05/29 ParLearning15 68
Convergence speed comparison
dcMF converges to a certain RMSE faster than FPSGD
15/05/29 ParLearning15 69
Measure counts with MovieLens10M of dcMf is less than that of FPSGD
But none of the measured fits the estimated
L2 fill/writeback countsBad
Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music
15/05/29 ParLearning15 70
Agenda
Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work
15/05/29 ParLearning15 71
We proposed dcMF to parallelize matrix factorization using task parallel model We show that our dcMF surpasses FPSGD in terms of:
scalability convergence speed
We implemented dcMF at https://github.com/xxthermidorxx/cpmf
Conclusion
15/05/29 ParLearning15 72
Need to find out why there is no significant difference in L2_CACHE_FILL_WRITEBACK in dcMF with Netflix Investigate the effect of the number of blocks
Theoretically, smaller blocks may contributethe performance of dcMF
Find a way to apply dcMF in a distributed environment
Future Work
15/05/29 ParLearning15 73
Thank you for listening!
top related