scalable task parallel sgd on matrix factorization in multicore architectures

73
15/05/29 ParLearning15 1 Scalable Task-Parallel SGD on Matrix Factorization in Multicore Architectures Graduate School of Information Science and Technology The University of Tokyo Yusuke Nishioka , Kenjiro Taura

Upload: nishioka-yusuke

Post on 23-Feb-2017

84 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 1

Scalable Task-Parallel SGDon Matrix Factorization

in Multicore ArchitecturesGraduate School of Information Science and Technology

The University of TokyoYusuke Nishioka, Kenjiro Taura

Page 2: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 2

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Page 3: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 3

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Page 4: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 4

Introduction Recommendation is an important technique especially in e-commerce services

e.g. Amazon, Netflix, …

Service providers use recommendation as a way to help users to find their preferable items

(https://amazon.com)

Page 5: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 5

Recommendation

There are two approaches for recommendation: content filtering and collaborative filtering

content filtering based on information of users and items

collaborative filtering● based on correlation between users and items

Collaborative filtering has been getting popular accurate prediction with less amount of data

Page 6: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 6

Collaborative Filtering5 (Great)

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Each entry in the table denotes the rating 5 means great, 1 means terrible

Page 7: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 7

Collaborative Filtering5 (Great)

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Both user A and C seem to like war movies

Page 8: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 8

Collaborative Filtering5 (Great)

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Both user A and C seem to like war movies Why not recommend other war movies ?

(4~5?)

(4~5?)

Page 9: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 9

Collaborative Filtering5 (Great)

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Both user B and D seem to like Disney movies

Page 10: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 10

Collaborative Filtering5 (Great)

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Both user A and C seem to like Disney movies Why not recommend other Disney movies ?

(4~5?)

(4~5?)

Page 11: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 11

Our contribution is to: propose an alternative approach for parallelmatrix factorization

analyze the scalability problem of the past work achieve better scalability than other methods

Contribution

Page 12: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 12

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Page 13: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 13

Matrix factorization

One of collaborative filtering algorithms is matrix factorization (MF)

infers user-item relations from ratingswhich users have given to items

decomposes the rating matrix into two low-rank latent matrices

Page 14: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 14

What's matrix factorization ?

table

user A

itemA

itemB

itemC

user Buser Cuser D

R

n items

m users

matrix

Page 15: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 15

What's matrix factorization ?

The rating matrix R is modeled as matrix multiplication of PT and Q

R

Q

PT

musers

n items k

k

R≃PTQ

m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n

Page 16: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 16

What's matrix factorization ?

Each entry ru,v in the rating matrix R is modeled as inner products of corresponding row and column:

R

Q

PT

musers

n items

ru,v

qv

puT

k

k

m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n

^ru , v≃puT qv

u

v

Page 17: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 17

Optimization problem of MF

The objective is to reduce the error between predicted value and real value of existing ratings● The objective function is as follows:

minP ,Q ∑(u , v )∈R

( ^ru , v−puT qv)

2+λ P‖P‖F2 +λ Q‖Q‖F

2

ru , v :existingrating ,

puT qv : predicted rating ,

λ P ,λ Q :regularized coefficients‖‖F :Frobeniusnorm

Page 18: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 18

Gradient Descent Method An approach to this optimization problem is gradient descent (GD) method

Modify P and Q in the direction opposite togradients of the objective function

pu⇐ pu+ γ ( ∑(u , v)∈R

(ru , v− puT qv)qv−λ P∑

u=1

m

pu)qv⇐qv+γ ( ∑

(u , v )∈R(ru ,v−pu

T qv) pu−λ Q∑v=1

n

qv) GD needs to calculate gradients in a batch

slow convergence

Page 19: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 19

● Stochastic gradient descent (SGD) method updates corresponding pu and qv in an online manner

faster convergence with low memory consumption pu and qv are updated as follows:

SGD is inherently sequential Difficult to parallelize over multi-core CPU

pu⇐ pu+γ ((ru , v−puT qv)qv−λ P pu)

qv⇐qv+γ ((ru , v−puT qv) pu−λ Qqv)

Stochastic Gradient Descent Method

Page 20: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 20

SGD

GD vs SGD

GD

● Comparably easy to parallelize

pseudocode

strong point

weak point ● slow convergence

● fast convergence

● Difficult to parallelize

Page 21: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 21

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Page 22: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 22

We review some parallel SGD algorithms for solving matrix factorization:

HogWild! DSGD FPSGD

Related work

Page 23: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 23

HogWild! is a lock-free approach to parallelize MF workers process ratings in R independently guarantee convergence in spite of the possibilitythat updates may be overwritten by other workers

HogWild! [Niu et al. 2011]

worker1worker2worker3

overwrite byother workers

R R

Page 24: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 24

Problems of HogWild!

Random rating selection in whole R Cannot take advantage of hardware prefetch

PT

Q

R1

21

1

32

3

4

3 4 2

4

Page 25: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 25

DSGD divides R into t * t blocks and assigns them to t workers

workers go on to the next block simultaneously synchronous parallel

DSGD [Gemulla et al. 2011]

Page 26: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 26

Random rating selection in a block Cannot take advantage of hardware prefetch

Problems of DSGD

worker1

synchronizebarrier

worker2

worker3

worker4

process block

Synchronous parallel performance depends on the slowest worker

Page 27: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 27

FPSGD is a state-of-the-art parallel algorithm for MF intends to overcome the bottleneck of HogWild! and DSGD by introducing following two techniques:

conflict-free scheduling partial random method

FPSGD [Zhuang et al. 2013]

Page 28: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 28

Take a look at a specific example: 6*6 blocks, 4 workers T0 ~ T3

First, the scheduler assigns a block which shares neither the same row nor the same column

Conflict-free schedulingR 0

0

1

T0

2

3

4

5

T0T1

T2T3

1 2 3 4 5

Page 29: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 29

The row and column processed by other workers are blocked not to update the same row and column When worker T0 finishes processing the block, the scheduler assigns blocks with a circle

Conflict-free schedulingR 0

0

1

T0 ○

○○

○○

○○

2

3

4

5

T0T1

T2T3

2 3 541

Page 30: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 30

Worker T0 moves to block from (0,0) to (4,5)

Conflict-free schedulingR 0

0

1

T0

2

3

4

5

T0

T1T2

T3

1 2 3 4 5

Page 31: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 31

In conflict-free scheduling, the rating matrix should be divided into at least (t+1)*(t+1) blocks ( t is the number of workers )

so that the scheduler can assign a “free” block Workers can process blocks asynchronously because of this scheduling

Conflict-free scheduling

Page 32: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 32

Select ratings in a block orderly Select blocks randomly

Partial random method

Page 33: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 33

Ordered rating selection means loading at least either row or column onto the cache orderly

workers can utilize hardware prefetch

Partial random method

14

1

1

2

3

2 3 4

2

4

3

ordered

Page 34: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 34

Summary of related work

parallel cache utilizationblock selection rating selection

HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordered

Page 35: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 35

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Page 36: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 36

FPSGD has a scalability problem● it cannot scale with higher core counts

The main reasons of the limited scalability is: Locking problem Poor data localities across blocks

Scalability of FPSGD

Page 37: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 37

In get_job function, the scheduler gets a free block and mark the row and column “being processed” In put_job function, the worker returns processed block and unmark the row and column as “free” Both gets the single lock → poor scalability

Locking problem

T0

mark as “being processed”

Page 38: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 38

Workers randomly select blocks little opportunity to reuse P and Q across blocks

Poor data localities across blocks

Page 39: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 39

The traffic for processing all blocks once is:

Poor data localities across blocks

T fpsgd=(nrb2

+k nub

+k nib

)×b2×d

=d (nr+bk (nu+ni))

k

k

nu/b

ni/b

ratingsusersitemsblocks

word size

nu

b×b

nr

ni

d

Page 40: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 40

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Page 41: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 41

We propose dcMF, a scalable divide-and-conquer method for MF using task parallel model

Proposal

Page 42: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 42

In task parallel model, parallelism can be expressed via two operations:

create_task create a task

sync_tasks wait for the calling task's child tasks to finish

Task parallel model

Page 43: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 43

Regard each function call as a task Tasks are automatically distributed among workers

Example using task parallel (fibonacci)

Page 44: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 44

In each recursion,● divide the blockinto 2-by-2 sub-blocks

processes two sub-blockson one diagonal line

processes two sub-blockson another diagonal line

Overview of dcMF

Page 45: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 45

Tasks created along recursion forms a tree structure Tasks which locates in the different column or row can be processed in parallel

Execution flow

Page 46: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 46

Overview of dcMF

The runtime system automatically distribute tasks to workers

R

Worker 1 Worker 2running tasks

waitingtasks

tasktask

task task

Page 47: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 47

divide the block into two-by-two sub-blocks

Overview of dcMF

Page 48: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 48

Tasks on one diagonal line are created

Overview of dcMF

divide the block into two-by-two sub-blocks

Page 49: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 49

Overview of dcMF

Tasks on one diagonal line are created

divide the block into two-by-two sub-blocks

Tasks on one another diagonal line are created

Page 50: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 50

When the block gets small enough, pu and qv will be updated

Overview of dcMF

Tasks on one diagonal line are created

divide the block into two-by-two sub-blocks

Tasks on one another diagonal line are created

Page 51: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 51

There are two advantages in dcMF: get rid of locking time

Task parallel systems can handle task migrationwithout a centralized data structure

reduce cache miss counts

Advantage of task parallel model

Page 52: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 52

Reducing cache miss counts

worker1

worker2

worker3 worker4

R

R

Task parallel systems split the task tree near its root, and assign them to workers

Page 53: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 53

R

If there is no load balancing, each worker process t blocks whose size is ( 1/t ) × ( 1/t )

Reducing cache miss counts

worker1

worker2

worker3 worker4

R

Page 54: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 54

R

Think of blocks around the leaf All of these blocks are processed by the same worker

load ontocache

workerloadonto

cache

Reducing cache miss counts

Page 55: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 55

R

After processing blocks on the same diagonal line, corresponding pu and qv will be on cache

worker

done

load ontocache

loadonto

cache

already oncache

alreadyon

cache

Reducing cache miss counts

Page 56: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 56

R

Processing remaining blocks don't require access to memory

worker

done

done

alreadyon

cachealready

oncache

already oncache

already oncache

Reducing cache miss counts

Page 57: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 57

Advantage of task parallel model

T dcmf=d (nr+tk (nu+ni))(T fpsgd=d (nr+bk (nu+ni)))

● The traffic for processing all blocks once is:

The total traffic of dcMF is less than that of FPSGD when b > t :

T fpsgd−T dcmf=d (nr+bk (nu+ni))−d (nr+tk (nu+ni))=d (b−t )(nu+ni)>0

ratingsusersitems

threadsblocks

word size

nu

b×b

nr

nit

d

Page 58: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 58

Comparison with related works

parallel cache utilizationblock selection rating selection

HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordereddcMF async partially ordered ordered

Page 59: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 59

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Page 60: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 60

CPU: four AMD Opteron 8354 2.50 GHz sockets each socket has 2 NUMA nodeseach NUMA node has 4 moduleseach module has 2 cores

total core count is 64 L3 cache (6MB) is shared by 4 modulesL2 cache (2MB) is shared by 2 coresL1 cache (16KB) is shared by 1 core

Evaluation environment

Page 61: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 61

Implementation: we used the same code base as FPSGD Task parallel library: MassiveThreads

Evaluation environment

Page 62: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 62

We use three kinds of dataset: MovieLens10M, Netflix and Yahoo!Music

We use the same parameter value as FPSGDfor fair comparison

Evaluation dataset

Page 63: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 63

We conducted three comparative experiments: scalability convergence speed

we use RMSE ( root mean square error )for a convergence metric

L2 fill/writeback counts The papi native eventL2_CACHE_FILL_WRITEBACK is used

Experiments

√ 1nr ∑(u , v )∈R

(ru ,v− ^ru , v)2

Page 64: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 64

We implemented FPSGD++ to overcome the scalability problem of FPSGD

Setting multiple locks on both rows and columns instead of the single lock of FPSGD

FPSGD++

Page 65: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 65

Pseudo code of FPSGD++

Workers can enter the same scope of the procedures

Getting blocks canbe done withoutsingle giant lock

Page 66: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 66

dcMF scales with up to 64 cores FPSGD does not scale in high core counts especially with MovieLens10M and Netflix

FPSGD++ scales better than FPSGD

Scalability comparison

(a) MovieLens10M (b) Netflix (c) Yahoo!Music

Good

Bad

Page 67: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 67

Lock waiting time of FPSGD gets longer with high core counts

This long locking time is the main cause of poor scalability

Scalability comparisonBad

Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music

Page 68: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 68

Convergence speed comparison

dcMF converges to a certain RMSE faster than FPSGD

Page 69: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 69

Measure counts with MovieLens10M of dcMf is less than that of FPSGD

But none of the measured fits the estimated

L2 fill/writeback countsBad

Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music

Page 70: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 70

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Page 71: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 71

We proposed dcMF to parallelize matrix factorization using task parallel model We show that our dcMF surpasses FPSGD in terms of:

scalability convergence speed

We implemented dcMF at https://github.com/xxthermidorxx/cpmf

Conclusion

Page 72: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 72

Need to find out why there is no significant difference in L2_CACHE_FILL_WRITEBACK in dcMF with Netflix Investigate the effect of the number of blocks

Theoretically, smaller blocks may contributethe performance of dcMF

Find a way to apply dcMF in a distributed environment

Future Work

Page 73: Scalable Task Parallel SGD on Matrix Factorization in Multicore Architectures

15/05/29 ParLearning15 73

Thank you for listening!