scalable task parallel sgd on matrix factorization in multicore architectures

15/05/29 ParLearning15 1

Scalable Task-Parallel SGDon Matrix Factorization

in Multicore ArchitecturesGraduate School of Information Science and Technology

The University of TokyoYusuke Nishioka, Kenjiro Taura


Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work


Agenda



Introduction Recommendation is an important technique especially in e-commerce services

e.g. Amazon, Netflix, …

Service providers use recommendation as a way to help users to find their preferable items

(https://amazon.com)


Recommendation

There are two approaches for recommendation: content filtering and collaborative filtering

content filtering based on information of users and items

collaborative filtering● based on correlation between users and items

Collaborative filtering has been getting popular accurate prediction with less amount of data


Collaborative Filtering5 (Great)

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Each entry in the table denotes the rating 5 means great, 1 means terrible



1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Both user A and C seem to like war movies



1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Both user A and C seem to like war movies Why not recommend other war movies ?

(4~5?)

(4~5?)



1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Both user B and D seem to like Disney movies



1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

4

45

2

31

Both user A and C seem to like Disney movies Why not recommend other Disney movies ?

(4~5?)

(4~5?)


Our contribution is to: propose an alternative approach for parallelmatrix factorization

analyze the scalability problem of the past work achieve better scalability than other methods

Contribution


Agenda



Matrix factorization

One of collaborative filtering algorithms is matrix factorization (MF)

infers user-item relations from ratingswhich users have given to items

decomposes the rating matrix into two low-rank latent matrices


What's matrix factorization ?

table

user A

itemA

itemB

itemC

user Buser Cuser D

R

n items

m users

matrix



The rating matrix R is modeled as matrix multiplication of PT and Q

R

Q

PT

musers

n items k

k

R≃PTQ

m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n



Each entry ru,v in the rating matrix R is modeled as inner products of corresponding row and column:

R

Q

PT

musers

n items

ru,v

qv

puT

k

k

m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n

^ru , v≃puT qv

u

v


Optimization problem of MF

The objective is to reduce the error between predicted value and real value of existing ratings● The objective function is as follows:

minP ,Q ∑(u , v )∈R

( ^ru , v−puT qv)

2+λ P‖P‖F2 +λ Q‖Q‖F

2

ru , v :existingrating ,

puT qv : predicted rating ,

λ P ,λ Q :regularized coefficients‖‖F :Frobeniusnorm


Gradient Descent Method An approach to this optimization problem is gradient descent (GD) method

Modify P and Q in the direction opposite togradients of the objective function

pu⇐ pu+ γ ( ∑(u , v)∈R

(ru , v− puT qv)qv−λ P∑

u=1

m

pu)qv⇐qv+γ ( ∑

(u , v )∈R(ru ,v−pu

T qv) pu−λ Q∑v=1

n

qv) GD needs to calculate gradients in a batch

slow convergence


● Stochastic gradient descent (SGD) method updates corresponding pu and qv in an online manner

faster convergence with low memory consumption pu and qv are updated as follows:

SGD is inherently sequential Difficult to parallelize over multi-core CPU

pu⇐ pu+γ ((ru , v−puT qv)qv−λ P pu)

qv⇐qv+γ ((ru , v−puT qv) pu−λ Qqv)

Stochastic Gradient Descent Method


SGD

GD vs SGD

GD

● Comparably easy to parallelize

pseudocode

strong point

weak point ● slow convergence

● fast convergence

● Difficult to parallelize


Agenda



We review some parallel SGD algorithms for solving matrix factorization:

HogWild! DSGD FPSGD

Related work


HogWild! is a lock-free approach to parallelize MF workers process ratings in R independently guarantee convergence in spite of the possibilitythat updates may be overwritten by other workers

HogWild! [Niu et al. 2011]

worker1worker2worker3

overwrite byother workers

R R


Problems of HogWild!

Random rating selection in whole R Cannot take advantage of hardware prefetch

PT

Q

R1

21

1

32

3

4

3 4 2

4


DSGD divides R into t * t blocks and assigns them to t workers

workers go on to the next block simultaneously synchronous parallel

DSGD [Gemulla et al. 2011]


Random rating selection in a block Cannot take advantage of hardware prefetch

Problems of DSGD

worker1

synchronizebarrier

worker2

worker3

worker4

process block

Synchronous parallel performance depends on the slowest worker


FPSGD is a state-of-the-art parallel algorithm for MF intends to overcome the bottleneck of HogWild! and DSGD by introducing following two techniques:

conflict-free scheduling partial random method

FPSGD [Zhuang et al. 2013]


Take a look at a specific example: 6*6 blocks, 4 workers T0 ~ T3

First, the scheduler assigns a block which shares neither the same row nor the same column

Conflict-free schedulingR 0

0

1

T0

2

3

4

5

T0T1

T2T3

1 2 3 4 5


The row and column processed by other workers are blocked not to update the same row and column When worker T0 finishes processing the block, the scheduler assigns blocks with a circle


0

1

T0 ○

○○

○○

○○

○

2

3

4

5

T0T1

T2T3

2 3 541


Worker T0 moves to block from (0,0) to (4,5)


0

1

T0

2

3

4

5

T0

T1T2

T3

1 2 3 4 5


In conflict-free scheduling, the rating matrix should be divided into at least (t+1)*(t+1) blocks ( t is the number of workers )

so that the scheduler can assign a “free” block Workers can process blocks asynchronously because of this scheduling

Conflict-free scheduling


Select ratings in a block orderly Select blocks randomly

Partial random method


Ordered rating selection means loading at least either row or column onto the cache orderly

workers can utilize hardware prefetch

Partial random method

14

1

1

2

3

2 3 4

2

4

3

ordered


Summary of related work

parallel cache utilizationblock selection rating selection

HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordered


Agenda



FPSGD has a scalability problem● it cannot scale with higher core counts

The main reasons of the limited scalability is: Locking problem Poor data localities across blocks

Scalability of FPSGD


In get_job function, the scheduler gets a free block and mark the row and column “being processed” In put_job function, the worker returns processed block and unmark the row and column as “free” Both gets the single lock → poor scalability

Locking problem

T0

mark as “being processed”


Workers randomly select blocks little opportunity to reuse P and Q across blocks

Poor data localities across blocks


The traffic for processing all blocks once is:

Poor data localities across blocks

T fpsgd=(nrb2

+k nub

+k nib

)×b2×d

=d (nr+bk (nu+ni))

k

k

nu/b

ni/b

ratingsusersitemsblocks

word size

nu

b×b

nr

ni

d


Agenda



We propose dcMF, a scalable divide-and-conquer method for MF using task parallel model

Proposal


In task parallel model, parallelism can be expressed via two operations:

create_task create a task

sync_tasks wait for the calling task's child tasks to finish

Task parallel model


Regard each function call as a task Tasks are automatically distributed among workers

Example using task parallel (fibonacci)


In each recursion,● divide the blockinto 2-by-2 sub-blocks

processes two sub-blockson one diagonal line

processes two sub-blockson another diagonal line

Overview of dcMF


Tasks created along recursion forms a tree structure Tasks which locates in the different column or row can be processed in parallel

Execution flow


Overview of dcMF

The runtime system automatically distribute tasks to workers

R

Worker 1 Worker 2running tasks

waitingtasks

tasktask

task task


divide the block into two-by-two sub-blocks

Overview of dcMF


Tasks on one diagonal line are created

Overview of dcMF



Overview of dcMF



Tasks on one another diagonal line are created


When the block gets small enough, pu and qv will be updated

Overview of dcMF



Tasks on one another diagonal line are created


There are two advantages in dcMF: get rid of locking time

Task parallel systems can handle task migrationwithout a centralized data structure

reduce cache miss counts

Advantage of task parallel model


Reducing cache miss counts

worker1

worker2

worker3 worker4

R

R

Task parallel systems split the task tree near its root, and assign them to workers


R

If there is no load balancing, each worker process t blocks whose size is ( 1/t ) × ( 1/t )


worker1

worker2

worker3 worker4

R


R

Think of blocks around the leaf All of these blocks are processed by the same worker

load ontocache

workerloadonto

cache



R

After processing blocks on the same diagonal line, corresponding pu and qv will be on cache

worker

done

load ontocache

loadonto

cache

already oncache

alreadyon

cache



R

Processing remaining blocks don't require access to memory

worker

done

done

alreadyon

cachealready

oncache

already oncache

already oncache



Advantage of task parallel model

T dcmf=d (nr+tk (nu+ni))(T fpsgd=d (nr+bk (nu+ni)))

● The traffic for processing all blocks once is:

The total traffic of dcMF is less than that of FPSGD when b > t :

T fpsgd−T dcmf=d (nr+bk (nu+ni))−d (nr+tk (nu+ni))=d (b−t )(nu+ni)>0

ratingsusersitems

threadsblocks

word size

nu

b×b

nr

nit

d


Comparison with related works

parallel cache utilizationblock selection rating selection

HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordereddcMF async partially ordered ordered


Agenda



CPU: four AMD Opteron 8354 2.50 GHz sockets each socket has 2 NUMA nodeseach NUMA node has 4 moduleseach module has 2 cores

total core count is 64 L3 cache (6MB) is shared by 4 modulesL2 cache (2MB) is shared by 2 coresL1 cache (16KB) is shared by 1 core

Evaluation environment


Implementation: we used the same code base as FPSGD Task parallel library: MassiveThreads

Evaluation environment


We use three kinds of dataset: MovieLens10M, Netflix and Yahoo!Music

We use the same parameter value as FPSGDfor fair comparison

Evaluation dataset


We conducted three comparative experiments: scalability convergence speed

we use RMSE ( root mean square error )for a convergence metric

L2 fill/writeback counts The papi native eventL2_CACHE_FILL_WRITEBACK is used

Experiments

√ 1nr ∑(u , v )∈R

(ru ,v− ^ru , v)2


We implemented FPSGD++ to overcome the scalability problem of FPSGD

Setting multiple locks on both rows and columns instead of the single lock of FPSGD

FPSGD++


Pseudo code of FPSGD++

Workers can enter the same scope of the procedures

Getting blocks canbe done withoutsingle giant lock


dcMF scales with up to 64 cores FPSGD does not scale in high core counts especially with MovieLens10M and Netflix

FPSGD++ scales better than FPSGD

Scalability comparison

(a) MovieLens10M (b) Netflix (c) Yahoo!Music

Good

Bad


Lock waiting time of FPSGD gets longer with high core counts

This long locking time is the main cause of poor scalability

Scalability comparisonBad

Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music


Convergence speed comparison

dcMF converges to a certain RMSE faster than FPSGD


Measure counts with MovieLens10M of dcMf is less than that of FPSGD

But none of the measured fits the estimated

L2 fill/writeback countsBad

Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music


Agenda



We proposed dcMF to parallelize matrix factorization using task parallel model We show that our dcMF surpasses FPSGD in terms of:

scalability convergence speed

We implemented dcMF at https://github.com/xxthermidorxx/cpmf

Conclusion


Need to find out why there is no significant difference in L2_CACHE_FILL_WRITEBACK in dcMF with Netflix Investigate the effect of the number of blocks

Theoretically, smaller blocks may contributethe performance of dcMF

Find a way to apply dcMF in a distributed environment

Future Work


Thank you for listening!