big data, complex algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... ·...

22
4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram Venkataraman * , Erik Bodzsar # Indrajit Roy + , Alvin AuYoung + , Rob Schreiber + * UC Berkeley, # U Chicago, + HP Labs Big Data, Complex Algorithms PageRank (Dominant eigenvector) Anomaly detection (Top-K eigenvalues) User Importance (Vertex Centrality) Machine learning + Graph algorithms Iterative Linear Algebra Operations 2

Upload: others

Post on 04-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

1  

Distributed Machine Learning and Graph Processing with Sparse Matrices

Shivaram Venkataraman*, Erik Bodzsar#

Indrajit Roy+, Alvin AuYoung+, Rob Schreiber+ *UC Berkeley, #U Chicago, +HP Labs

Big Data, Complex Algorithms

PageRank (Dominant eigenvector)

Recommendations (Matrix factorization)

Anomaly detection (Top-K eigenvalues)

User Importance (Vertex Centrality)

Machine learning + Graph algorithms

Iterative Linear Algebra Operations

2

Page 2: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

2  

PageRank

3

P

Q

R S

Web Graph Adjacency Matrix

Page Rank

0.035 0.006 0.008 0.032

0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0

… ……

PageRank Using Matrices

Power Method Dominant eigenvector

4

M p   p  

M = web graph matrix p = PageRank vector

Iterate

Page 3: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

3  

Array-oriented programming environment Millions of users, thousands of free packages Popular among statisticians, bio-informatics communities

R  5

PageRank Using Matrices

Power Method Dominant eigenvector

6

M p   p  

M = web graph matrix p = PageRank vector

Simplified algorithm repeat { p = M*p }

Page 4: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

4  

PageRank Using Matrices

7

Web  graph  matrix  

Pagerank  vector  

P1  

P2  

PN/s  

Power Method Dominant eigenvector

M = web graph matrix p = PageRank vector

Pagerank  vector  

Large-Scale Processing Frameworks

Data-parallel frameworks – MapReduce/Dryad (2004)

–  Process each record in parallel –  Use case: Computing sufficient statistics

Graph-centric frameworks – Pregel/GraphLab (2010)

–  Process each vertex in parallel –  Use case: Graphical models

Array-based frameworks – MadLINQ (2012) –  Process blocks of array in parallel –  Challenges with sparse matrices

8

Page 5: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

5  

Challenge 1 – Communication

R - single-threaded Share data through pipes/network Time-inefficient (sending copies) Space-inefficient (extra copies)

R process copy

of data

local copy

R process

data

R process copy

of data

R process copy

of data

Server 1

network copy network copy

Server 2

9

Sparse matrices à Communication overhead

Challenge 2 – Sparse Matrices

10

Page 6: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

6  

Challenge 2 – Sparse Matrices

1

10

100

1000

10000

1 11 21 31 41 51 61 71 81 91 Blo

ck d

ensi

ty (n

orm

aliz

ed )

Block ID

LiveJournal Netflix ClueWeb-1B

11

Presto

Framework for large-scale iterative linear algebra Extend R for scalability

12

Page 7: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

7  

Outline •  Motivation •  Programming model •  Design •  Applications and Results

13

darray

Page 8: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

8  

foreach

f  (x)  

PageRank Using Presto

M  ß  darray(dim=c(N,N),blocks=(s,N))  P  ß  darray(dim=c(N,1),blocks=(s,1))  while(..){      foreach(i,1:len,    

 calculate(p=splits(P,i),m=splits(M,i),                          x=splits(P_old),z=splits(Z,i))  {            p  ß  (m*x)+z  

       }    )      P_old  ß  P  }  

Create Distributed Array

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

16

Page 9: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

9  

PageRank Using Presto

M  ß  darray(dim=c(N,N),blocks=(s,N))  P  ß  darray(dim=c(N,1),blocks=(s,1))  while(..){      foreach(i,1:len,    

 calculate(p=splits(P,i),  m=splits(M,i),                          x=splits(P_old),  z=splits(Z,i))  {            p  ß  (m*x)+z  

       }    )      P_old  ß  P  }  

Execute function in a cluster

Pass array partitions

P2

P1

PN/s

M

P1

P2

PN/s

P_old Z

P1

P2

PN/s

P

s P1

P2

PN/s

N

s

N

17

Breadth-first Search Using Matrices

G = adjacency matrix X = BFS vector

A B C D E

A 1 1 1 0 0B 0 1 0 1 0C 0 1 1 0 0D 0 0 0 1 1E 0 0 0 0 1

1 0 0 0 0

A  B  

C  

D  

E  

X  

G  

* * * 0 0

A B C D E* * * * 0

A B C D E* * * * *

A B C D E

A  B  

C  

D  

E  A  

B  

C  

D  

E  A  

B  

C  

D  

E  

Simplified algorithm: repeat { X = G*X }

Matrix operations

Easy to express Efficient to implement

18

Page 10: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

10  

Outline •  Motivation •  Programming model •  Design •  Applications and Results

19

Presto Architecture

Worker  

Executor  pool  

Worker  

Executor  pool  

Master  

R  instance  R  instance  

DRAM

 

R  instance   R  instance  R  instance  

DRAM

 

R  instance  

20

Page 11: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

11  

Dynamic Partitioning of Matrices

Profile execution

Partition

21

Size Invariants invariant(Mat,vec)  

22

Page 12: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

12  

Outline

•  Motivation •  Programming model •  Design •  Applications and Results

23

demo

Page 13: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

13  

25

 lj_matrix  ß  darray(dim=c(n,n),blocks=c(n,n))    in_vector  ß  darray(dim=c(n,1),  blocks=(s,1),  

             data=1/n)  out_vector  ß  darray(dim=c(n,1),  blocks=(s,1))      foreach(i,  1:length(splits(lj_matrix)),                        function(g  =  splits(lj_matrix,  i),                                          i  =  splits(in_vector),                                          o  =  splits(out_vector,  i))  {                          n  ß  g  %*%  o                          update(n)  

     })    

Page 14: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

14  

dotprod  <-­‐  function(a,b)  {      tmp  <-­‐  darray(dim=a@dim/a@blocks,c(1,1))      foreach(i,  1:length(splits(a)),                      mult  <-­‐  function(tmp  =  splits(tmp,i),                                                        a  =  splits(a,i),                                                        b  =  splits(b,i))  {                          tmp  <-­‐  sum(a  *  b)                          update(tmp)                      },  progress=FALSE)      return(sum(getpartition(dotprod.tmp)))  }    

Examples

reduce  <-­‐  function(f,d)  {      i  <-­‐  1      n  <-­‐  length(splits(d))      repeat  {          step  <-­‐  2*i          reducers  <-­‐  floor((n-­‐i-­‐1)/step)+1          foreach(j,  1:reducers,            reduce.pair  <-­‐  function(s1=splits(d,  (j-­‐1)*step+1),                                                            s2  =  splits(d,(j-­‐1)*step+1+i),                                                            fun  =  f)  {                              s1  <-­‐  fun(s1,s2)                              update(s1)                          },  progress=FALSE)            i  <-­‐  i*2          if  (1+i  >  n)  {              break          }      }  }    

Examples

Page 15: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

15  

Applications Implemented in Presto

Application Algorithm Presto LOC

PageRank Eigenvector calculation 41

Triangle counting Top-K eigenvalues 121

Netflix recommendation Matrix factorization 130

Centrality measure Graph algorithm 132

k-path connectivity Graph algorithm 30

k-means Clustering 71

Sequence alignment Smith-Waterman 64

Fewer than 140 lines of code

29

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Split

siz

e (G

B)

Iteration count

Repartitioning Progress

30

Page 16: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

16  

0 20 40 60 80 100 120 140 160

Wor

kers

Transfer Compute

0 20 40 60 80 100 120 140 160

Wor

kers

Transfer Compute

Repartitioning benefits No Repartition

Repartition

31

Presto

Versioning Distributed Arrays

32

Co-partitioning matrices

Locality-based scheduling

Caching partitions

Page 17: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

17  

Conclusion

Linear Algebra is a powerful abstraction Easily express machine learning, graph algorithms Challenges: Sparse matrices, Data sharing Presto – prototype extends R

33

Blockus

•  Expressive distributed computing systems are in-memory

•  Being in-memory is problematic for (very) big data –  Expensive –  Fault tolerance problems

•  Scale Presto vertically •  Eliminate memory limitation

34

Page 18: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

18  

Vertical scaling

•  Use SSDs –  Low latency –  Fast small I/O –  Parallel I/O

•  SSDs still significantly slower than memory •  Need to do better than OS swap!

35

0

100

200

300

400

500

600

16k 64k 256k 1024k 4096k 16384k R

ead

spee

d (M

B/s

)

Block size

Read speed with 4 threads

SSD

HDD

Opportunities from Vertical Scaling

•  Enable big data analytics on small systems –  Laptop! –  Small cluster

•  Energy savings for extreme scale systems

•  Reduced cost, increased fault tolerance

36

Page 19: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

19  

Related work: OS paging/buffer cache

•  General purpose •  (almost) no application knowledge •  LRU caching, (conservative) read-ahead •  Reactive (do I/O on pagefault)

37

Related work: SSDAlloc

•  C library, replace malloc with malloc_object •  Objects are stored on SSD •  Memory is used as a cache

•  Advantage over OS paging: object-level caching •  Useful for web servers, etc.

38

Page 20: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

20  

Related work: GraphChi

•  Vertex-level programming •  Iterative, update vertex neighborhoods in each

iteration, for each vertex

•  GraphChi: I/O optimized execution engine •  Make sure I/O is sequential (essential for HDDs,

works for SSDs too)

39

Blockus Idea •  Use some form of application knowledge to optimize

I/O

•  Know future computation à prefetching –  From programmer hints, static analysis, history, etc.

•  Know about parallelism à reorder computation to decrease I/O

•  Block usage history: better caching (e.g. always keep popular parts of a graph in memory)

•  Deeper application knowledge à reorganize data, change computation,etc.

40

Page 21: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

21  

Worker

Executor pool

Worker

Executor pool

Blockus architecture

Master

R instance

R instance

DR

AM

R instance

SS

Ds

HD

Ds

I/O Engine

Scheduler

R instance

R instance

DR

AM

R instance

SS

Ds

HD

Ds

I/O Engine

•  Worker I/O engine: executes all I/O operations

•  Scheduler: performs I/O and task scheduling

Scheduler challenges •  Presto scheduler

•  Assumes everything fits in DRAM •  Schedules each task on worker which has most bytes of

its input arrays •  Transfers non-local input data greedily (no network

scheduling)

•  Blockus: better scheduling policies •  Load balancing (in memory) •  Intelligent Prefetching •  Intelligent computation prioritization •  Adaptive Caching

Page 22: Big Data, Complex Algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... · 4/12/13 1 Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram

4/12/13  

22  

Blockus task scheduling policies

•  Reorder computation based on memory contents to minimize I/O

•  Different reordering policies

Project ideas •  Fault tolerance by keeping track of lineage •  Smart communication (e.g. pipelines, broadcasting) •  Static analysis of programs to better understand

dependencies –  Better garbage collection –  Better asynchronicity/reordering

•  Recursive parallelism •  Matrix reordering

–  To improve caching for out-of-core –  Load balancing for distributed system

•  Distributed out-of-core computation •  Heterogeneous storage (different SSDs, HDDs,

etc.)