big data, complex algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... ·...

4/12/13

1

Distributed Machine Learning and Graph Processing with Sparse Matrices

Shivaram Venkataraman*, Erik Bodzsar#

Indrajit Roy+, Alvin AuYoung+, Rob Schreiber+ *UC Berkeley, #U Chicago, +HP Labs

Big Data, Complex Algorithms

PageRank (Dominant eigenvector)

Recommendations (Matrix factorization)

Anomaly detection (Top-K eigenvalues)

User Importance (Vertex Centrality)

Machine learning + Graph algorithms

Iterative Linear Algebra Operations

2

4/12/13

2

PageRank

3

P

Q

R S

Web Graph Adjacency Matrix

Page Rank

0.035 0.006 0.008 0.032

0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0

… ……

PageRank Using Matrices

Power Method Dominant eigenvector

4

M p p

M = web graph matrix p = PageRank vector

Iterate

4/12/13

3

Array-oriented programming environment Millions of users, thousands of free packages Popular among statisticians, bio-informatics communities

R 5



6

M p p


Simplified algorithm repeat { p = M*p }

4/12/13

4


7

Web graph matrix

Pagerank vector

P1

P2

PN/s



Pagerank vector

Large-Scale Processing Frameworks

Data-parallel frameworks – MapReduce/Dryad (2004)

–  Process each record in parallel –  Use case: Computing sufficient statistics

Graph-centric frameworks – Pregel/GraphLab (2010)

–  Process each vertex in parallel –  Use case: Graphical models

Array-based frameworks – MadLINQ (2012) –  Process blocks of array in parallel –  Challenges with sparse matrices

8

4/12/13

5

Challenge 1 – Communication

R - single-threaded Share data through pipes/network Time-inefficient (sending copies) Space-inefficient (extra copies)

R process copy

of data

local copy

R process

data

R process copy

of data

R process copy

of data

Server 1

network copy network copy

Server 2

9

Sparse matrices à Communication overhead

Challenge 2 – Sparse Matrices

10

4/12/13

6

Challenge 2 – Sparse Matrices

1

10

100

1000

10000

1 11 21 31 41 51 61 71 81 91 Blo

ck d

ensi

ty (n

orm

aliz

ed )

Block ID

LiveJournal Netflix ClueWeb-1B

11

Presto

Framework for large-scale iterative linear algebra Extend R for scalability

12

4/12/13

7

Outline •  Motivation •  Programming model •  Design •  Applications and Results

13

darray

4/12/13

8

foreach

f (x)

PageRank Using Presto

M ß darray(dim=c(N,N),blocks=(s,N)) P ß darray(dim=c(N,1),blocks=(s,1)) while(..){ foreach(i,1:len,

calculate(p=splits(P,i),m=splits(M,i), x=splits(P_old),z=splits(Z,i)) { p ß (m*x)+z

} ) P_old ß P }

Create Distributed Array

P2

P1

PN/s

M

P1

P2

PN/s

…

P_old Z

P1

P2

PN/s

…

P

s P1

P2

PN/s

…

N

s

N

16

4/12/13

9

PageRank Using Presto

M ß darray(dim=c(N,N),blocks=(s,N)) P ß darray(dim=c(N,1),blocks=(s,1)) while(..){ foreach(i,1:len,

calculate(p=splits(P,i), m=splits(M,i), x=splits(P_old), z=splits(Z,i)) { p ß (m*x)+z

} ) P_old ß P }

Execute function in a cluster

Pass array partitions

P2

P1

PN/s

M

P1

P2

PN/s

…

P_old Z

P1

P2

PN/s

…

P

s P1

P2

PN/s

…

N

s

N

17

Breadth-first Search Using Matrices

G = adjacency matrix X = BFS vector

A B C D E

A 1 1 1 0 0B 0 1 0 1 0C 0 1 1 0 0D 0 0 0 1 1E 0 0 0 0 1

1 0 0 0 0

A B

C

D

E

X

G

* * * 0 0

A B C D E* * * * 0

A B C D E* * * * *

A B C D E

A B

C

D

E A

B

C

D

E A

B

C

D

E

Simplified algorithm: repeat { X = G*X }

Matrix operations

Easy to express Efficient to implement

18

4/12/13

10

Outline •  Motivation •  Programming model •  Design •  Applications and Results

19

Presto Architecture

Worker

Executor pool

Worker

Executor pool

Master

R instance R instance

DRAM

R instance R instance R instance

DRAM

R instance

20

4/12/13

11

Dynamic Partitioning of Matrices

Profile execution

Partition

21

Size Invariants invariant(Mat,vec)

22

4/12/13

12

Outline

•  Motivation •  Programming model •  Design •  Applications and Results

23

demo

4/12/13

13

25

lj_matrix ß darray(dim=c(n,n),blocks=c(n,n)) in_vector ß darray(dim=c(n,1), blocks=(s,1),

data=1/n) out_vector ß darray(dim=c(n,1), blocks=(s,1)) foreach(i, 1:length(splits(lj_matrix)), function(g = splits(lj_matrix, i), i = splits(in_vector), o = splits(out_vector, i)) { n ß g %*% o update(n)

})

4/12/13

14

dotprod <-‐ function(a,b) { tmp <-‐ darray(dim=a@dim/a@blocks,c(1,1)) foreach(i, 1:length(splits(a)), mult <-‐ function(tmp = splits(tmp,i), a = splits(a,i), b = splits(b,i)) { tmp <-‐ sum(a * b) update(tmp) }, progress=FALSE) return(sum(getpartition(dotprod.tmp))) }

Examples

reduce <-‐ function(f,d) { i <-‐ 1 n <-‐ length(splits(d)) repeat { step <-‐ 2*i reducers <-‐ floor((n-‐i-‐1)/step)+1 foreach(j, 1:reducers, reduce.pair <-‐ function(s1=splits(d, (j-‐1)*step+1), s2 = splits(d,(j-‐1)*step+1+i), fun = f) { s1 <-‐ fun(s1,s2) update(s1) }, progress=FALSE) i <-‐ i*2 if (1+i > n) { break } } }

Examples

4/12/13

15

Applications Implemented in Presto

Application Algorithm Presto LOC

PageRank Eigenvector calculation 41

Triangle counting Top-K eigenvalues 121

Netflix recommendation Matrix factorization 130

Centrality measure Graph algorithm 132

k-path connectivity Graph algorithm 30

k-means Clustering 71

Sequence alignment Smith-Waterman 64

Fewer than 140 lines of code

29

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Split

siz

e (G

B)

Iteration count

Repartitioning Progress

30

4/12/13

16

0 20 40 60 80 100 120 140 160

Wor

kers

Transfer Compute

0 20 40 60 80 100 120 140 160

Wor

kers

Transfer Compute

Repartitioning benefits No Repartition

Repartition

31

Presto

Versioning Distributed Arrays

32

Co-partitioning matrices

Locality-based scheduling

Caching partitions

4/12/13

17

Conclusion

Linear Algebra is a powerful abstraction Easily express machine learning, graph algorithms Challenges: Sparse matrices, Data sharing Presto – prototype extends R

33

Blockus

•  Expressive distributed computing systems are in-memory

•  Being in-memory is problematic for (very) big data –  Expensive –  Fault tolerance problems

•  Scale Presto vertically •  Eliminate memory limitation

34

4/12/13

18

Vertical scaling

•  Use SSDs –  Low latency –  Fast small I/O –  Parallel I/O

•  SSDs still significantly slower than memory •  Need to do better than OS swap!

35

0

100

200

300

400

500

600

16k 64k 256k 1024k 4096k 16384k R

ead

spee

d (M

B/s

)

Block size

Read speed with 4 threads

SSD

HDD

Opportunities from Vertical Scaling

•  Enable big data analytics on small systems –  Laptop! –  Small cluster

•  Energy savings for extreme scale systems

•  Reduced cost, increased fault tolerance

36

4/12/13

19

Related work: OS paging/buffer cache

•  General purpose •  (almost) no application knowledge •  LRU caching, (conservative) read-ahead •  Reactive (do I/O on pagefault)

37

Related work: SSDAlloc

•  C library, replace malloc with malloc_object •  Objects are stored on SSD •  Memory is used as a cache

•  Advantage over OS paging: object-level caching •  Useful for web servers, etc.

38

4/12/13

20

Related work: GraphChi

•  Vertex-level programming •  Iterative, update vertex neighborhoods in each

iteration, for each vertex

•  GraphChi: I/O optimized execution engine •  Make sure I/O is sequential (essential for HDDs,

works for SSDs too)

39

Blockus Idea •  Use some form of application knowledge to optimize

I/O

•  Know future computation à prefetching –  From programmer hints, static analysis, history, etc.

•  Know about parallelism à reorder computation to decrease I/O

•  Block usage history: better caching (e.g. always keep popular parts of a graph in memory)

•  Deeper application knowledge à reorganize data, change computation,etc.

40

4/12/13

21

Worker

Executor pool

Worker

Executor pool

Blockus architecture

Master

R instance

R instance

DR

AM

R instance

SS

Ds

HD

Ds

I/O Engine

Scheduler

R instance

R instance

DR

AM

R instance

SS

Ds

HD

Ds

I/O Engine

•  Worker I/O engine: executes all I/O operations

•  Scheduler: performs I/O and task scheduling

Scheduler challenges •  Presto scheduler

•  Assumes everything fits in DRAM •  Schedules each task on worker which has most bytes of

its input arrays •  Transfers non-local input data greedily (no network

scheduling)

•  Blockus: better scheduling policies •  Load balancing (in memory) •  Intelligent Prefetching •  Intelligent computation prioritization •  Adaptive Caching

4/12/13

22

Blockus task scheduling policies

•  Reorder computation based on memory contents to minimize I/O

•  Different reordering policies

Project ideas •  Fault tolerance by keeping track of lineage •  Smart communication (e.g. pipelines, broadcasting) •  Static analysis of programs to better understand

dependencies –  Better garbage collection –  Better asynchronicity/reordering

•  Recursive parallelism •  Matrix reordering

–  To improve caching for out-of-core –  Load balancing for distributed system

•  Distributed out-of-core computation •  Heterogeneous storage (different SSDs, HDDs,

etc.)

big data, complex algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... ·...

Documents