big data, complex algorithmspeople.cs.uchicago.edu/~aachien/lssg/people/andrew-chien/chien... ·...
TRANSCRIPT
4/12/13
1
Distributed Machine Learning and Graph Processing with Sparse Matrices
Shivaram Venkataraman*, Erik Bodzsar#
Indrajit Roy+, Alvin AuYoung+, Rob Schreiber+ *UC Berkeley, #U Chicago, +HP Labs
Big Data, Complex Algorithms
PageRank (Dominant eigenvector)
Recommendations (Matrix factorization)
Anomaly detection (Top-K eigenvalues)
User Importance (Vertex Centrality)
Machine learning + Graph algorithms
Iterative Linear Algebra Operations
2
4/12/13
2
PageRank
3
P
Q
R S
Web Graph Adjacency Matrix
Page Rank
0.035 0.006 0.008 0.032
0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0
… ……
PageRank Using Matrices
Power Method Dominant eigenvector
4
M p p
M = web graph matrix p = PageRank vector
Iterate
4/12/13
3
Array-oriented programming environment Millions of users, thousands of free packages Popular among statisticians, bio-informatics communities
R 5
PageRank Using Matrices
Power Method Dominant eigenvector
6
M p p
M = web graph matrix p = PageRank vector
Simplified algorithm repeat { p = M*p }
4/12/13
4
PageRank Using Matrices
7
Web graph matrix
Pagerank vector
P1
P2
PN/s
Power Method Dominant eigenvector
M = web graph matrix p = PageRank vector
Pagerank vector
Large-Scale Processing Frameworks
Data-parallel frameworks – MapReduce/Dryad (2004)
– Process each record in parallel – Use case: Computing sufficient statistics
Graph-centric frameworks – Pregel/GraphLab (2010)
– Process each vertex in parallel – Use case: Graphical models
Array-based frameworks – MadLINQ (2012) – Process blocks of array in parallel – Challenges with sparse matrices
8
4/12/13
5
Challenge 1 – Communication
R - single-threaded Share data through pipes/network Time-inefficient (sending copies) Space-inefficient (extra copies)
R process copy
of data
local copy
R process
data
R process copy
of data
R process copy
of data
Server 1
network copy network copy
Server 2
9
Sparse matrices à Communication overhead
Challenge 2 – Sparse Matrices
10
4/12/13
6
Challenge 2 – Sparse Matrices
1
10
100
1000
10000
1 11 21 31 41 51 61 71 81 91 Blo
ck d
ensi
ty (n
orm
aliz
ed )
Block ID
LiveJournal Netflix ClueWeb-1B
11
Presto
Framework for large-scale iterative linear algebra Extend R for scalability
12
4/12/13
7
Outline • Motivation • Programming model • Design • Applications and Results
13
darray
4/12/13
8
foreach
f (x)
PageRank Using Presto
M ß darray(dim=c(N,N),blocks=(s,N)) P ß darray(dim=c(N,1),blocks=(s,1)) while(..){ foreach(i,1:len,
calculate(p=splits(P,i),m=splits(M,i), x=splits(P_old),z=splits(Z,i)) { p ß (m*x)+z
} ) P_old ß P }
Create Distributed Array
P2
P1
PN/s
M
P1
P2
PN/s
…
P_old Z
P1
P2
PN/s
…
P
s P1
P2
PN/s
…
N
s
N
16
4/12/13
9
PageRank Using Presto
M ß darray(dim=c(N,N),blocks=(s,N)) P ß darray(dim=c(N,1),blocks=(s,1)) while(..){ foreach(i,1:len,
calculate(p=splits(P,i), m=splits(M,i), x=splits(P_old), z=splits(Z,i)) { p ß (m*x)+z
} ) P_old ß P }
Execute function in a cluster
Pass array partitions
P2
P1
PN/s
M
P1
P2
PN/s
…
P_old Z
P1
P2
PN/s
…
P
s P1
P2
PN/s
…
N
s
N
17
Breadth-first Search Using Matrices
G = adjacency matrix X = BFS vector
A B C D E
A 1 1 1 0 0B 0 1 0 1 0C 0 1 1 0 0D 0 0 0 1 1E 0 0 0 0 1
1 0 0 0 0
A B
C
D
E
X
G
* * * 0 0
A B C D E* * * * 0
A B C D E* * * * *
A B C D E
A B
C
D
E A
B
C
D
E A
B
C
D
E
Simplified algorithm: repeat { X = G*X }
Matrix operations
Easy to express Efficient to implement
18
4/12/13
10
Outline • Motivation • Programming model • Design • Applications and Results
19
Presto Architecture
Worker
Executor pool
Worker
Executor pool
Master
R instance R instance
DRAM
R instance R instance R instance
DRAM
R instance
20
4/12/13
11
Dynamic Partitioning of Matrices
Profile execution
Partition
21
Size Invariants invariant(Mat,vec)
22
4/12/13
12
Outline
• Motivation • Programming model • Design • Applications and Results
23
demo
4/12/13
13
25
lj_matrix ß darray(dim=c(n,n),blocks=c(n,n)) in_vector ß darray(dim=c(n,1), blocks=(s,1),
data=1/n) out_vector ß darray(dim=c(n,1), blocks=(s,1)) foreach(i, 1:length(splits(lj_matrix)), function(g = splits(lj_matrix, i), i = splits(in_vector), o = splits(out_vector, i)) { n ß g %*% o update(n)
})
4/12/13
14
dotprod <-‐ function(a,b) { tmp <-‐ darray(dim=a@dim/a@blocks,c(1,1)) foreach(i, 1:length(splits(a)), mult <-‐ function(tmp = splits(tmp,i), a = splits(a,i), b = splits(b,i)) { tmp <-‐ sum(a * b) update(tmp) }, progress=FALSE) return(sum(getpartition(dotprod.tmp))) }
Examples
reduce <-‐ function(f,d) { i <-‐ 1 n <-‐ length(splits(d)) repeat { step <-‐ 2*i reducers <-‐ floor((n-‐i-‐1)/step)+1 foreach(j, 1:reducers, reduce.pair <-‐ function(s1=splits(d, (j-‐1)*step+1), s2 = splits(d,(j-‐1)*step+1+i), fun = f) { s1 <-‐ fun(s1,s2) update(s1) }, progress=FALSE) i <-‐ i*2 if (1+i > n) { break } } }
Examples
4/12/13
15
Applications Implemented in Presto
Application Algorithm Presto LOC
PageRank Eigenvector calculation 41
Triangle counting Top-K eigenvalues 121
Netflix recommendation Matrix factorization 130
Centrality measure Graph algorithm 132
k-path connectivity Graph algorithm 30
k-means Clustering 71
Sequence alignment Smith-Waterman 64
Fewer than 140 lines of code
29
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Split
siz
e (G
B)
Iteration count
Repartitioning Progress
30
4/12/13
16
0 20 40 60 80 100 120 140 160
Wor
kers
Transfer Compute
0 20 40 60 80 100 120 140 160
Wor
kers
Transfer Compute
Repartitioning benefits No Repartition
Repartition
31
Presto
Versioning Distributed Arrays
32
Co-partitioning matrices
Locality-based scheduling
Caching partitions
4/12/13
17
Conclusion
Linear Algebra is a powerful abstraction Easily express machine learning, graph algorithms Challenges: Sparse matrices, Data sharing Presto – prototype extends R
33
Blockus
• Expressive distributed computing systems are in-memory
• Being in-memory is problematic for (very) big data – Expensive – Fault tolerance problems
• Scale Presto vertically • Eliminate memory limitation
34
4/12/13
18
Vertical scaling
• Use SSDs – Low latency – Fast small I/O – Parallel I/O
• SSDs still significantly slower than memory • Need to do better than OS swap!
35
0
100
200
300
400
500
600
16k 64k 256k 1024k 4096k 16384k R
ead
spee
d (M
B/s
)
Block size
Read speed with 4 threads
SSD
HDD
Opportunities from Vertical Scaling
• Enable big data analytics on small systems – Laptop! – Small cluster
• Energy savings for extreme scale systems
• Reduced cost, increased fault tolerance
36
4/12/13
19
Related work: OS paging/buffer cache
• General purpose • (almost) no application knowledge • LRU caching, (conservative) read-ahead • Reactive (do I/O on pagefault)
37
Related work: SSDAlloc
• C library, replace malloc with malloc_object • Objects are stored on SSD • Memory is used as a cache
• Advantage over OS paging: object-level caching • Useful for web servers, etc.
38
4/12/13
20
Related work: GraphChi
• Vertex-level programming • Iterative, update vertex neighborhoods in each
iteration, for each vertex
• GraphChi: I/O optimized execution engine • Make sure I/O is sequential (essential for HDDs,
works for SSDs too)
39
Blockus Idea • Use some form of application knowledge to optimize
I/O
• Know future computation à prefetching – From programmer hints, static analysis, history, etc.
• Know about parallelism à reorder computation to decrease I/O
• Block usage history: better caching (e.g. always keep popular parts of a graph in memory)
• Deeper application knowledge à reorganize data, change computation,etc.
40
4/12/13
21
Worker
Executor pool
Worker
Executor pool
Blockus architecture
Master
R instance
R instance
DR
AM
R instance
SS
Ds
HD
Ds
I/O Engine
Scheduler
R instance
R instance
DR
AM
R instance
SS
Ds
HD
Ds
I/O Engine
• Worker I/O engine: executes all I/O operations
• Scheduler: performs I/O and task scheduling
Scheduler challenges • Presto scheduler
• Assumes everything fits in DRAM • Schedules each task on worker which has most bytes of
its input arrays • Transfers non-local input data greedily (no network
scheduling)
• Blockus: better scheduling policies • Load balancing (in memory) • Intelligent Prefetching • Intelligent computation prioritization • Adaptive Caching
4/12/13
22
Blockus task scheduling policies
• Reorder computation based on memory contents to minimize I/O
• Different reordering policies
Project ideas • Fault tolerance by keeping track of lineage • Smart communication (e.g. pipelines, broadcasting) • Static analysis of programs to better understand
dependencies – Better garbage collection – Better asynchronicity/reordering
• Recursive parallelism • Matrix reordering
– To improve caching for out-of-core – Load balancing for distributed system
• Distributed out-of-core computation • Heterogeneous storage (different SSDs, HDDs,
etc.)