cumf: large-scale matrix factorization on just one machine...

22
Wei Tan, IBM T. J. Watson Research Center [email protected] CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Upload: others

Post on 03-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Wei Tan, IBM T. J. Watson Research Center

[email protected]

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Page 2: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Agenda

Why accelerate recommendation (matrix factorization) using GPUs?

What are the challenges?

How cuMF tackles the challenges?

So what is the result?

2

Page 3: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Why we need a fast and scalable recommender system?

3

Recommendation is pervasive

Drive 80% of Netflix’s watch hours1

Digital ads market in US: 37.3 billion2

Need: fast, scalable, economic

Page 4: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

ALS (alternating least square) for collaborative filtering

4

Input: users ratings on some items

Output: all missing ratings

How: factorize the rating matrix R into

and minimize the lost on observed ratings:

ALS: iteratively solve

R

X

xu

ΘT θv

Page 5: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Matrix factorization/ALS is versatile

5

item

X

xu

ΘT θv

Recommendation

user

X

ΘT

word

wordWord embedding

X

ΘT

word

documentTopic model

a very important algorithmto accelerate

Page 6: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Challenges of fast and scalable ALS

6

• ALS needs to solve:

spmm: cublas

• Challenge 1: access and

aggregate many θvs:

irregular (R is sparse) and

memory intensive

LU or Choleskydecomposition: cublas

• Challenge 2:

Single GPU can NOT handle big m, n and nnz

Page 7: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Challenge 1: memory access

7

Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec

Higher flops higher op intensity (more flops per byte) caching!

Operational intensity (Flops/Byte)

Gflo

ps/s

4.3T

1

288G ×

15

×

under-utilized

fully-utilized

Page 8: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Address challenge 1: memory-optimized ALS

8

• To obtain

1. Reuse θv s for many users

2. Load a portion into smem

3. Tile and aggregate in register

Page 9: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Address challenge 1: memory-optimized ALS

9

f

fT

T

T

T

Page 10: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Address challenge 2: scale-up ALS on multiple GPUs

Model parallel: solve a

portion of the model

10

model

parallel

Page 11: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Address challenge 2: scale-up ALS on multiple GPUs

Data parallel: solve

with a portion of the

training data

11

Data parallel

model

parallel

Page 12: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Address challenge 2: parallel reduction

Data parallel needs cross-GPU reduction

12

One-phase parallel reduction. Two-phase parallel reduction.

Inter-socketIntra-socket

Page 13: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Recap: cuMF tackled two challenges

13

• ALS needs to solve:

spmm: cublas

• Challenge 1: access and

aggregate many θvs:

irregular (R is sparse) and

memory intensive

LU or Choleskydecomposition: cublas

• Challenge 2:

Single GPU can NOT handle big m, n and nnz

Page 14: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Connect cuMF to Spark MLlib

14

Spark applications relying on mllib/ALS need no change

Modified mllib/ALS detects GPU and offload computation

Leverage the best of Spark (scale-out) and GPU (scale-up)

ALS apps

mllib/ALS

cuMFJNI

Page 15: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Connect cuMF to Spark MLlib

1 Power 8 node + 2 K40

CUDA kernel

GPU1

RDD

CUDA kernel

…RDD RDD RDD

CUDA kernel

GPU2

RDD

CUDA kernel

…RDD RDD RDD

shuffle

1 Power 8 node + 2 K40

CUDA kernel

GPU1

RDD

CUDA kernel

…RDD RDD RDD

shuffle

RDD on CPU: to distribute rating data and shuffle parameters

Solver on GPU: to form and solve

Able to run on multiple nodes, and multiple GPUs per node

15

Page 16: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Implementation

16

In C (circa. 10k LOC)

CUDA 7.0/7.5, GCC OpenMP v3.0

Baselines:– Libmf: SGD on 1 node [RecSys14]

– NOMAD: SGD on >1 nodes [VLDB 14]

– SparkALS: ALS on Spark

– FactorBird: SGD + parameter server for MF

– Facebook: enhanced Giraph

Page 17: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

CuMF performance

17

X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set

• 1 GPU vs. 30 cores, CuMF slightly

faster than libmf and NOMAD

• CuMF scales well on 1, 2, 4 GPUs

Page 18: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Effectiveness of memory optimization

18

X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set

• Aggressively using registers 2x

• Using texture 25%-35% faster

Page 19: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

CuMF performance and cost

19

• CuMF @4 GPUs ≈ NOMAD @64 HPC nodes

≈ 10x NOMAD @32 AWS nodes

• CuMF @4 GPUs ≈ 10x SparkALS @50 nodes

≈ 1% of its cost

Page 20: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

CuMF accelerated Spark on Power 8

20

• CuMF @2 K40s achieves 6+x speedup in training (193 sec 1.3k sec)

*GUI designed by Amir Sanjar

Page 21: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Conclusion

Why accelerate recommendation (matrix factorization) using GPU?

– Need to be fast, scalable and economic

What are the challenges?

– Memory access, scale to multiple GPUs

How cuMF tackles the challenges?

– Optimize memory access, parallelism and communication

So what is the result?

– Up to 10x as fast, 100x as cost-efficient

– Use cuMF standalone or with Spark

– GPU can tackle ML problems beyond deep learning!

21

Page 22: CuMF: Large-Scale Matrix Factorization on Just One Machine ...on-demand.gputechconf.com/.../s6211-wei-tan-cumf.pdf · Conclusion Why accelerate recommendation (matrix factorization)

Wei Tan, IBM T. J. Watson Research Center

[email protected]

http://ibm.biz/wei_tan

Thank you, questions?

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016, http://arxiv.org/abs/1603.03820

Source code available soon.