scalable task parallel sgd on matrix factorization in multicore architectures

15/05/29 ParLearning15 1

Scalable Task-Parallel SGDon Matrix Factorization

in Multicore ArchitecturesGraduate School of Information Science and Technology

The University of TokyoYusuke Nishioka, Kenjiro Taura

Agenda

Introduction Matrix Factorization Related Work Scalability problem of FPSGD Proposal Evaluation Conclusion & Future Work

Agenda

Introduction Recommendation is an important technique especially in e-commerce services

e.g. Amazon, Netflix, …

Service providers use recommendation as a way to help users to find their preferable items

(https://amazon.com)

Recommendation

There are two approaches for recommendation: content filtering and collaborative filtering

content filtering based on information of users and items

collaborative filtering● based on correlation between users and items

Collaborative filtering has been getting popular accurate prediction with less amount of data

Collaborative Filtering5 (Great)

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

Each entry in the table denotes the rating 5 means great, 1 means terrible

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

Both user A and C seem to like war movies

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

Both user A and C seem to like war movies Why not recommend other war movies ?

(4~5?)

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

Both user B and D seem to like Disney movies

1 (Bad)

user Auser Buser C

user D

SavingPrivateRyan

Piratesof

the CaribbeanBeauty

Andthe Beast

LettersFrom

Iwo Jima5

Both user A and C seem to like Disney movies Why not recommend other Disney movies ?

(4~5?)

Our contribution is to: propose an alternative approach for parallelmatrix factorization

analyze the scalability problem of the past work achieve better scalability than other methods

Contribution

Agenda

Matrix factorization

One of collaborative filtering algorithms is matrix factorization (MF)

infers user-item relations from ratingswhich users have given to items

decomposes the rating matrix into two low-rank latent matrices

What's matrix factorization ?

user A

user Buser Cuser D

n items

m users

matrix

The rating matrix R is modeled as matrix multiplication of PT and Q

musers

n items k

R≃PTQ

m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n

Each entry ru,v in the rating matrix R is modeled as inner products of corresponding row and column:

musers

n items

m :number of usersn :number of itemsk :dimensionof latent matricesR∈ℝm×n , P∈ℝk×m ,Q∈ℝk×n

^ru , v≃puT qv

Optimization problem of MF

The objective is to reduce the error between predicted value and real value of existing ratings● The objective function is as follows:

minP ,Q ∑(u , v )∈R

( ^ru , v−puT qv)

2+λ P‖P‖F2 +λ Q‖Q‖F

ru , v :existingrating ,

puT qv : predicted rating ,

λ P ,λ Q :regularized coefficients‖‖F :Frobeniusnorm

Gradient Descent Method An approach to this optimization problem is gradient descent (GD) method

Modify P and Q in the direction opposite togradients of the objective function

pu⇐ pu+ γ ( ∑(u , v)∈R

(ru , v− puT qv)qv−λ P∑

pu)qv⇐qv+γ ( ∑

(u , v )∈R(ru ,v−pu

T qv) pu−λ Q∑v=1

qv) GD needs to calculate gradients in a batch

slow convergence

● Stochastic gradient descent (SGD) method updates corresponding pu and qv in an online manner

faster convergence with low memory consumption pu and qv are updated as follows:

SGD is inherently sequential Difficult to parallelize over multi-core CPU

pu⇐ pu+γ ((ru , v−puT qv)qv−λ P pu)

qv⇐qv+γ ((ru , v−puT qv) pu−λ Qqv)

Stochastic Gradient Descent Method

GD vs SGD

● Comparably easy to parallelize

pseudocode

strong point

weak point ● slow convergence

● fast convergence

● Difficult to parallelize

Agenda

We review some parallel SGD algorithms for solving matrix factorization:

HogWild! DSGD FPSGD

Related work

HogWild! is a lock-free approach to parallelize MF workers process ratings in R independently guarantee convergence in spite of the possibilitythat updates may be overwritten by other workers

HogWild! [Niu et al. 2011]

worker1worker2worker3

overwrite byother workers

Problems of HogWild!

Random rating selection in whole R Cannot take advantage of hardware prefetch

DSGD divides R into t * t blocks and assigns them to t workers

workers go on to the next block simultaneously synchronous parallel

DSGD [Gemulla et al. 2011]

Random rating selection in a block Cannot take advantage of hardware prefetch

Problems of DSGD

worker1

synchronizebarrier

worker2

worker3

worker4

process block

Synchronous parallel performance depends on the slowest worker

FPSGD is a state-of-the-art parallel algorithm for MF intends to overcome the bottleneck of HogWild! and DSGD by introducing following two techniques:

conflict-free scheduling partial random method

FPSGD [Zhuang et al. 2013]

Take a look at a specific example: 6*6 blocks, 4 workers T0 ~ T3

First, the scheduler assigns a block which shares neither the same row nor the same column

Conflict-free schedulingR 0

1 2 3 4 5

The row and column processed by other workers are blocked not to update the same row and column When worker T0 finishes processing the block, the scheduler assigns blocks with a circle

T0 ○

○○

2 3 541

Worker T0 moves to block from (0,0) to (4,5)

1 2 3 4 5

In conflict-free scheduling, the rating matrix should be divided into at least (t+1)*(t+1) blocks ( t is the number of workers )

so that the scheduler can assign a “free” block Workers can process blocks asynchronously because of this scheduling

Conflict-free scheduling

Select ratings in a block orderly Select blocks randomly

Partial random method

Ordered rating selection means loading at least either row or column onto the cache orderly

workers can utilize hardware prefetch

Partial random method

ordered

Summary of related work

parallel cache utilizationblock selection rating selection

HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordered

Agenda

FPSGD has a scalability problem● it cannot scale with higher core counts

The main reasons of the limited scalability is: Locking problem Poor data localities across blocks

Scalability of FPSGD

In get_job function, the scheduler gets a free block and mark the row and column “being processed” In put_job function, the worker returns processed block and unmark the row and column as “free” Both gets the single lock → poor scalability

Locking problem

mark as “being processed”

Workers randomly select blocks little opportunity to reuse P and Q across blocks

Poor data localities across blocks

The traffic for processing all blocks once is:

Poor data localities across blocks

T fpsgd=(nrb2

+k nub

+k nib

)×b2×d

=d (nr+bk (nu+ni))

ratingsusersitemsblocks

word size

Agenda

We propose dcMF, a scalable divide-and-conquer method for MF using task parallel model

Proposal

In task parallel model, parallelism can be expressed via two operations:

create_task create a task

sync_tasks wait for the calling task's child tasks to finish

Task parallel model

Regard each function call as a task Tasks are automatically distributed among workers

Example using task parallel (fibonacci)

In each recursion,● divide the blockinto 2-by-2 sub-blocks

processes two sub-blockson one diagonal line

processes two sub-blockson another diagonal line

Overview of dcMF

Tasks created along recursion forms a tree structure Tasks which locates in the different column or row can be processed in parallel

Execution flow

Overview of dcMF

The runtime system automatically distribute tasks to workers

Worker 1 Worker 2running tasks

waitingtasks

tasktask

task task

divide the block into two-by-two sub-blocks

Overview of dcMF

Tasks on one diagonal line are created

Overview of dcMF

Tasks on one another diagonal line are created

When the block gets small enough, pu and qv will be updated

Overview of dcMF

Tasks on one another diagonal line are created

There are two advantages in dcMF: get rid of locking time

Task parallel systems can handle task migrationwithout a centralized data structure

reduce cache miss counts

Advantage of task parallel model

Reducing cache miss counts

worker1

worker2

worker3 worker4

Task parallel systems split the task tree near its root, and assign them to workers

If there is no load balancing, each worker process t blocks whose size is ( 1/t ) × ( 1/t )

worker1

worker2

worker3 worker4

Think of blocks around the leaf All of these blocks are processed by the same worker

load ontocache

workerloadonto

After processing blocks on the same diagonal line, corresponding pu and qv will be on cache

worker

load ontocache

loadonto

already oncache

alreadyon

Processing remaining blocks don't require access to memory

worker

alreadyon

cachealready

oncache

already oncache

Advantage of task parallel model

T dcmf=d (nr+tk (nu+ni))(T fpsgd=d (nr+bk (nu+ni)))

● The traffic for processing all blocks once is:

The total traffic of dcMF is less than that of FPSGD when b > t :

T fpsgd−T dcmf=d (nr+bk (nu+ni))−d (nr+tk (nu+ni))=d (b−t )(nu+ni)>0

ratingsusersitems

threadsblocks

word size

Comparison with related works

parallel cache utilizationblock selection rating selection

HogWild! async - randomDSGD synchronous ordered randomFPSGD async random ordereddcMF async partially ordered ordered

Agenda

CPU: four AMD Opteron 8354 2.50 GHz sockets each socket has 2 NUMA nodeseach NUMA node has 4 moduleseach module has 2 cores

total core count is 64 L3 cache (6MB) is shared by 4 modulesL2 cache (2MB) is shared by 2 coresL1 cache (16KB) is shared by 1 core

Evaluation environment

Implementation: we used the same code base as FPSGD Task parallel library: MassiveThreads

Evaluation environment

We use three kinds of dataset: MovieLens10M, Netflix and Yahoo!Music

We use the same parameter value as FPSGDfor fair comparison

Evaluation dataset

We conducted three comparative experiments: scalability convergence speed

we use RMSE ( root mean square error )for a convergence metric

L2 fill/writeback counts The papi native eventL2_CACHE_FILL_WRITEBACK is used

Experiments

√ 1nr ∑(u , v )∈R

(ru ,v− ^ru , v)2

We implemented FPSGD++ to overcome the scalability problem of FPSGD

Setting multiple locks on both rows and columns instead of the single lock of FPSGD

FPSGD++

Pseudo code of FPSGD++

Workers can enter the same scope of the procedures

Getting blocks canbe done withoutsingle giant lock

dcMF scales with up to 64 cores FPSGD does not scale in high core counts especially with MovieLens10M and Netflix

FPSGD++ scales better than FPSGD

Scalability comparison

(a) MovieLens10M (b) Netflix (c) Yahoo!Music

Lock waiting time of FPSGD gets longer with high core counts

This long locking time is the main cause of poor scalability

Scalability comparisonBad

Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music

Convergence speed comparison

dcMF converges to a certain RMSE faster than FPSGD

Measure counts with MovieLens10M of dcMf is less than that of FPSGD

But none of the measured fits the estimated

L2 fill/writeback countsBad

Good(a) MovieLens10M (b) Netflix (c) Yahoo!Music

Agenda

We proposed dcMF to parallelize matrix factorization using task parallel model We show that our dcMF surpasses FPSGD in terms of:

scalability convergence speed

We implemented dcMF at https://github.com/xxthermidorxx/cpmf

Conclusion

Need to find out why there is no significant difference in L2_CACHE_FILL_WRITEBACK in dcMF with Netflix Investigate the effect of the number of blocks

Theoretically, smaller blocks may contributethe performance of dcMF

Find a way to apply dcMF in a distributed environment

Future Work

Thank you for listening!

scalable task parallel sgd on matrix factorization in multicore architectures

Technology

cumf sgd: fast and scalable matrix factorization › pdf ›...

laporan sgd

sponsorship & exhibitor prospectus€¦ · ib global...

city. metro manila, ... san juan, rizal makati, rizal...

pedia sgd

notes on factorization algebras, factorization homology...

matrix factorization and factorization machines for...

multicore processing, virtualization, and...

1. aze sgd 150 - kawkaw sg by makanstate – redefining...

prime factorization and the fundamental theorem of … ·...

international nuclear law essentials 2018 ·...

multicore 101: migrating embedded apps to multicore with...

new large-scale matrix factorization: dsgd in...

manual sgd

sgd promkes

composite number into factors using prime factorization to...

temporal factorization vs. spatial factorization

multicore and multicore programming with openmp (syst emes

composite number into factors using prime factorization...

multicore system design with xum: the extensible utah...