large-scale graph-structured machine learning -

66
Joseph Gonzalez Postdoc, UC Berkeley AMPLab PhD @ CMU A System for Distributed Graph-Parallel Machine Learning Yucheng Low Aapo Kyrola Danny Bickson Alex Smola Haijie Gu The Team: Carlos Guestrin Guy Blelloch

Upload: others

Post on 11-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale Graph-Structured Machine Learning -

Joseph GonzalezPostdoc, UC Berkeley AMPLabPhD @ CMU

A System for Distributed Graph-Parallel Machine Learning

YuchengLow

AapoKyrola

DannyBickson

AlexSmola

HaijieGu

The Team:

CarlosGuestrin

GuyBlelloch

Page 2: Large-Scale Graph-Structured Machine Learning -

About me …

2

MachineLearning

GraphicalModels+

BigLearning

BigScalable

Page 3: Large-Scale Graph-Structured Machine Learning -

Graphs are Essential to Data-Mining and Machine Learning

• Identify influential people and information• Find communities• Target ads and products • Model complex data dependencies

3

Page 4: Large-Scale Graph-Structured Machine Learning -

Liberal Conservative

Post

Post

Post

Post

Post

Post

Post

Post

Example: Estimate Political Bias

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

Post

??

?

?

??

?

? ??

?

?

??

? ?

?

?

?

?

?

?

?

?

?

?

?

? ?

?

4

Loopy Belief PropagationConditional Random Field

Page 5: Large-Scale Graph-Structured Machine Learning -

Collaborative Filtering: Exploiting Dependencies

City of God

Wild Strawberries

The Celebration

La Dolce Vita

Women on the Verge of aNervous Breakdown

What do I recommend???

Page 6: Large-Scale Graph-Structured Machine Learning -

Matrix FactorizationAlternating Least Squares (ALS)

r13

r14

r24

r25

f(1)

f(2)

f(3)

f(4)

f(5)

Use

r Fac

tors

(U)

Movie Factors (M

)U

sers

MoviesNetflix

Use

rs

≈x

Movies

f(i)

f(j)

Iterate:

Factor forUser i

Factor forMovie j

Page 7: Large-Scale Graph-Structured Machine Learning -

PageRank

• Everyone starts with equal ranks• Update ranks in parallel • Iterate until convergence

Rank of user i Weighted sum of

neighbors’ ranks

7

Page 8: Large-Scale Graph-Structured Machine Learning -

How should we programgraph-parallel algorithms?

Low-level tools like MPI and Pthreads?

- Me, during my first years of grad school

8

Page 9: Large-Scale Graph-Structured Machine Learning -

Threads, Locks, and MPI• ML experts repeatedly solve the same

parallel design challenges:– Implement and debug complex parallel system– Tune for a single parallel platform– Six months later the conference paper contains:

“We implemented ______ in parallel.”• The resulting code:

– is difficult to maintain and extend– couples learning model and implementation

9

Page 10: Large-Scale Graph-Structured Machine Learning -

How should we programgraph-parallel algorithms?

10

High-level Abstractions!

- Me, now

Page 11: Large-Scale Graph-Structured Machine Learning -

The Graph-Parallel Abstraction• A user-defined Vertex-Program runs on each vertex• Graph constrains interaction along edges

– Using messages (e.g. Pregel [PODC’09, SIGMOD’10])

– Through shared state (e.g., GraphLab [UAI’10, VLDB’12])

• Parallelism: run multiple vertex programs simultaneously

11

“Think like a Vertex.”-Malewicz et al. [SIGMOD’10]

Page 12: Large-Scale Graph-Structured Machine Learning -

Better for Machine Learning

Graph-parallel Abstractions

12

Shared State

ii

Dynamic Asynchronous

Messaging

ii

Synchronous

Page 13: Large-Scale Graph-Structured Machine Learning -

The GraphLab Vertex ProgramVertex Programs directly access adjacent vertices and edges

GraphLab_PageRank(i) // Compute sum over neighborstotal = 0foreach( j in neighbors(i)):

total = total + R[j] * wji

// Update the PageRankR[i] = 0.15 + total

// Trigger neighbors to run againpriority = |R[i] – oldR[i]|if R[i] not converged thensignal neighborsOf(i) with priority

13

R[4] * w41

++

44 11

33 22

Page 14: Large-Scale Graph-Structured Machine Learning -

Benefit of Dynamic PageRank

1100

100001000000

100000000

0 10 20 30 40 50 60 70

Num

-Ver

tices

Number of Updates

51% updated only once!

Bett

er

14

Page 15: Large-Scale Graph-Structured Machine Learning -

GraphLab Asynchronous Execution

CPU 1

CPU 2

The scheduler determines the order that vertices are executed

ee ff gg

kkjjiihh

ddccbbaa bb

iihh

aa

ii

bb ee ff

jj

cc

Sche

dule

r

Scheduler can prioritize vertices.

Page 16: Large-Scale Graph-Structured Machine Learning -

Asynchronous Belief Propagation

Synthetic Noisy Image

Cumulative Vertex Updates

ManyUpdates

FewUpdates

Algorithm identifies and focuses on hidden sequential structure

Graphical Model

Challenge = Boundaries

Page 17: Large-Scale Graph-Structured Machine Learning -

GraphLab Ensures a Serializable Execution

• Enables: Gauss-Seidel iterations, Gibbs Sampling, Graph Coloring, …

Page 18: Large-Scale Graph-Structured Machine Learning -

Never Ending Learner Project (CoEM)

• Language modeling: named entity recognition

18

GraphLab 16 Cores 30 min

15x Faster!6x fewer CPUs!

Hadoop (BSP) 95 Cores 7.5 hrs

DistributedGraphLab

32 EC2 machines

80 secs

0.3% of Hadoop time

Page 19: Large-Scale Graph-Structured Machine Learning -

GraphLab provided apowerful new abstraction

But…

Thus far…

We couldn’t scale up to Altavista Webgraph from 2002

1.4B vertices, 6.6B edges

Page 20: Large-Scale Graph-Structured Machine Learning -

20

Natural GraphsGraphs derived from natural

phenomena

Page 21: Large-Scale Graph-Structured Machine Learning -

Properties of Natural Graphs

21

Power-Law Degree Distribution

Regular Mesh Natural Graph

Page 22: Large-Scale Graph-Structured Machine Learning -

Power-Law Degree Distribution

100 102 104 106 108100

102

104

106

108

1010

degree

coun

t

Top 1% of vertices are adjacent to

50% of the edges!

High-Degree Vertices

22

Num

ber o

f Ver

tices

AltaVista WebGraph1.4B Vertices, 6.6B Edges

Degree

More than 108 vertices have one neighbor.

Page 23: Large-Scale Graph-Structured Machine Learning -

Power-Law Degree Distribution

23

“Star Like” Motif

PresidentObama Followers

Page 24: Large-Scale Graph-Structured Machine Learning -

Asynchronous Executionrequires heavy locking (GraphLab)

Challenges of High-Degree Vertices

Touches a largefraction of graph

(GraphLab)

Sequentially processedges

Sends manymessages(Pregel)

Edge meta-datatoo large for single

machine

Synchronous Executionprone to stragglers (Pregel)

24

Page 25: Large-Scale Graph-Structured Machine Learning -

Graph Partitioning• Graph parallel abstractions rely on partitioning:

– Minimize communication– Balance computation and storage

25

Machine 1 Machine 2

Comm. CostO(# cut edges)

Page 26: Large-Scale Graph-Structured Machine Learning -

Power-Law Graphs are Difficult to Partition

• Power-Law graphs do not have low-cost balanced cuts [Leskovec et al. 08, Lang 04]

• Traditional graph-partitioning algorithms perform poorly on Power-Law Graphs.[Abou-Rjeili et al. 06]

26

CPU 1 CPU 2

Page 27: Large-Scale Graph-Structured Machine Learning -

Machine 1 Machine 2

Random Partitioning

• GraphLab resorts to random (hashed) partitioning on natural graphs

10 Machines 90% of edges cut100 Machines 99% of edges cut!

27

Page 28: Large-Scale Graph-Structured Machine Learning -

Machine 1 Machine 2

• Split High-Degree vertices• New Abstraction Equivalence on Split Vertices

28

ProgramFor This

Run on This

Page 29: Large-Scale Graph-Structured Machine Learning -

Gather InformationAbout Neighborhood

Update Vertex

Signal Neighbors &Modify Edge Data

A Common Pattern forVertex-Programs

GraphLab_PageRank(i) // Compute sum over neighborstotal = 0foreach( j in neighbors(i)):

total = total + R[j] * wji

// Update the PageRankR[i] = total

// Trigger neighbors to run againpriority = |R[i] – oldR[i]|if R[i] not converged then

signal neighbors(i) with priority

29

Page 30: Large-Scale Graph-Structured Machine Learning -

Formal GraphLab2 Semantics

• Gather(SrcV, Edge, DstV) A– Collect information from neighbors

• Sum(A, A) A– Commutative associative Sum

• Apply(V, A) V– Update the vertex

• Scatter(SrcV, Edge, DstV) (Edge, signal)– Update edges and signal neighbors

30

Page 31: Large-Scale Graph-Structured Machine Learning -

GraphLab2_PageRank(i)

Gather( j i ) : return wji * R[j]sum(a, b) : return a + b;

Apply(i, Σ) : R[i] = 0.15 + Σ

Scatter( i j ) :if R[i] changed then trigger j to be recomputed

PageRank in GraphLab2

31

Page 32: Large-Scale Graph-Structured Machine Learning -

Machine 2Machine 1

Machine 4Machine 3

GAS Decomposition

Σ1 Σ2

Σ3 Σ4

+ + +

YYYY

Y’

ΣY’Y’Y’Gather

Apply

Scatter

32

Master

Mirror

MirrorMirror

Page 33: Large-Scale Graph-Structured Machine Learning -

Minimizing Communication in PowerGraph

YYY

A vertex-cut minimizes machines each vertex spans

Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000]

Communication is linear in the number of machines

each vertex spans

33

New Theorem:For any edge-cut we can directly construct a vertex-cut which requires strictly less communication and storage.

Page 34: Large-Scale Graph-Structured Machine Learning -

Constructing Vertex-Cuts

• Evenly assign edges to machines– Minimize machines spanned by each vertex

• Assign each edge as it is loaded– Touch each edge only once

• Propose two distributed approaches:– Random Vertex Cut– Greedy Vertex Cut

34

Page 35: Large-Scale Graph-Structured Machine Learning -

Machine 2Machine 1 Machine 3

Random Vertex-Cut• Randomly assign edges to machines

YYYY ZYYYY ZY ZY Spans 3 Machines

Z Spans 2 Machines

Balanced Vertex-Cut

Not cut!

35

Page 36: Large-Scale Graph-Structured Machine Learning -

Random Vertex-Cuts vs. Edge-Cuts

• Expected improvement from vertex-cuts:

1

10

100

0 50 100 150

Redu

ctio

n in

Com

m. a

nd S

tora

ge

Number of Machines36

Order of MagnitudeImprovement

Page 37: Large-Scale Graph-Structured Machine Learning -

Streaming Greedy Vertex-Cuts

• Place edges on machines which already have the vertices in that edge.

Machine1 Machine 2

BA CB

DA EB37

Page 38: Large-Scale Graph-Structured Machine Learning -

Greedy Vertex-Cuts Improve Performance

00.10.20.30.40.50.60.70.80.9

1

PageRank CollaborativeFiltering

Shortest Path

Runt

ime

Rela

tive

to R

ando

m

Random

Greedy

Greedy partitioning improves computation performance. 38

Page 39: Large-Scale Graph-Structured Machine Learning -

System Design

• Implemented as C++ API• Uses HDFS for Graph Input and Output• Fault-tolerance is achieved by check-pointing

– Snapshot time < 5 seconds for twitter network39

EC2 HPC Nodes

MPI/TCP-IP PThreads HDFS

PowerGraph (GraphLab2) System

Page 40: Large-Scale Graph-Structured Machine Learning -

Implemented Many Algorithms

• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient

Descent– SVD– Non-negative MF

• Statistical Inference– Loopy Belief Propagation– Max-Product Linear

Programs– Gibbs Sampling

• Graph Analytics– PageRank– Triangle Counting– Shortest Path– Graph Coloring– K-core Decomposition

• Computer Vision– Image stitching

• Language Modeling– LDA

40

Page 41: Large-Scale Graph-Structured Machine Learning -

PageRank on the Twitter Follower Graph

0

10

20

30

40

50

60

70

GraphLab Pregel(Piccolo)

PowerGraph

41

05

10152025303540

GraphLab Pregel(Piccolo)

PowerGraph

Tota

l Net

wor

k (G

B)

Seco

nds

Communication RuntimeNatural Graph with 40M Users, 1.4 Billion Links

Reduces Communication Runs Faster32 Nodes x 8 Cores (EC2 HPC cc1.4x)

Page 42: Large-Scale Graph-Structured Machine Learning -

PageRank on Twitter Follower GraphNatural Graph with 40M Users, 1.4 Billion Links

Hadoop results from [Kang et al. '11]Twister (in-memory MapReduce) [Ekanayake et al. ‘10]

42

0 50 100 150 200

Hadoop

GraphLab

Twister

Piccolo

PowerGraph

Runtime Per Iteration

Order of magnitude by exploiting properties

of Natural Graphs

Page 43: Large-Scale Graph-Structured Machine Learning -

GraphLab2 is ScalableYahoo Altavista Web Graph (2002):

One of the largest publicly available web graphs1.4 Billion Webpages, 6.6 Billion Links

1024 Cores (2048 HT)64 HPC Nodes

7 Seconds per Iter.1B links processed per second

30 lines of user code

43

Page 44: Large-Scale Graph-Structured Machine Learning -

Topic Modeling• English language Wikipedia

– 2.6M Documents, 8.3M Words, 500M Tokens

– Computationally intensive algorithm

44

0 20 40 60 80 100 120 140 160

Smola et al.

PowerGraph

Million Tokens Per Second

100 Yahoo! MachinesSpecifically engineered for this task

64 cc2.8xlarge EC2 Nodes200 lines of code & 4 human hours

Page 45: Large-Scale Graph-Structured Machine Learning -

Triangle Counting

• For each vertex in graph, countnumber of triangles containing it

• Measures both “popularity” of the vertex and “cohesiveness” of the vertex’s community:

More TrianglesStronger Community

Fewer TrianglesWeaker Community

Page 46: Large-Scale Graph-Structured Machine Learning -

Counted: 34.8 Billion Triangles

46

Triangle Counting on The Twitter GraphIdentify individuals with strong communities.

64 Machines1.5 Minutes

1536 Machines423 Minutes

Hadoop[WWW’11]

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

282 x Faster

Why? Wrong Abstraction Broadcast O(degree2) messages per Vertex

Page 47: Large-Scale Graph-Structured Machine Learning -

EC2 HPC Nodes

MPI/TCP-IP PThreads HDFS

GraphLab2 System

Graph Analytics

GraphicalModels

ComputerVision Clustering Topic

ModelingCollaborative

Filtering

Machine Learning and Data-Mining Toolkits

Apache 2 License

http://graphlab.org

Page 48: Large-Scale Graph-Structured Machine Learning -

GraphChi: Going small with GraphLab

Solve huge problems on small or embedded

devices?

Key: Exploit non-volatile memory (starting with SSDs and HDs)

Page 49: Large-Scale Graph-Structured Machine Learning -

GraphChi – disk-based GraphLab

Novel Parallel Sliding Windows algorithm

• Single-Machine– Parallel, asynchronous execution

• Solves big problems– That are normally solved in cloud

• Efficiently exploits disks– Optimized for stream acces

– Efficient on both SSD and hard-drives

Page 50: Large-Scale Graph-Structured Machine Learning -

Triangle Counting in Twitter Graph

40M Users 1.2B Edges

Total: 34.8 Billion Triangles

Hadoop results from [Suri & Vassilvitskii '11]

64 Machines, 1024 Cores1.5 Minutes

PowerGraph

GraphChi

Hadoop

1536 Machines423 Minutes

59 Minutes, 1 Mac Mini!

Page 51: Large-Scale Graph-Structured Machine Learning -

Apache 2 License

http://graphlab.orgDocumentation… Code… Tutorials… (more on the way)

Page 52: Large-Scale Graph-Structured Machine Learning -

Active Work

• Cross language support (Python/Java)• Support for incremental graph computation• Integration with Graph Databases• Declarative representations of GAS

decomposition:– my.pr := nbrs.in.map(x => x.pr).reduce( (a,b) => a + b )

52

Joseph E. GonzalezPostdoc, UC [email protected]@cs.cmu.eduhttp://graphlab.org

Page 53: Large-Scale Graph-Structured Machine Learning -

Why not use Map-Reducefor

Graph Parallel algorithms?

Page 54: Large-Scale Graph-Structured Machine Learning -

Data Dependencies are Difficult• Difficult to express dependent data in Map

Reduce– Substantial data transformations – User managed graph structure– Costly data replication

Inde

pend

ent D

ata

Reco

rds

Page 55: Large-Scale Graph-Structured Machine Learning -

Iterative Computation is Difficult• System is not optimized for iteration:

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Data

Data

Data

Data

Data

Data

Data

CPU 1

CPU 2

CPU 3

Iterations

Disk Penalty

Disk Penalty

Disk Penalty

Startup Penalty

Startup Penalty

Startup Penalty

Page 56: Large-Scale Graph-Structured Machine Learning -

The Pregel AbstractionVertex-Programs interact by sending messages.

iiPregel_PageRank(i, messages) : // Receive all the messagestotal = 0foreach( msg in messages) :

total = total + msg

// Update the rank of this vertexR[i] = total

// Send new messages to neighborsforeach(j in out_neighbors[i]) :

Send msg(R[i]) to vertex j

56Malewicz et al. [PODC’09, SIGMOD’10]

Page 57: Large-Scale Graph-Structured Machine Learning -

BarrierPregel Synchronous Execution

Compute Communicate

Page 58: Large-Scale Graph-Structured Machine Learning -

Communication Overhead for High-Degree Vertices

Fan-In vs. Fan-Out

58

Page 59: Large-Scale Graph-Structured Machine Learning -

Pregel Message Combiners on Fan-In

Machine 1 Machine 2

++B

A

C

DSum

• User defined commutative associative (+) message operation:

59

Page 60: Large-Scale Graph-Structured Machine Learning -

Pregel Struggles with Fan-Out

Machine 1 Machine 2

B

A

C

D

• Broadcast sends many copies of the same message to the same machine!

60

Page 61: Large-Scale Graph-Structured Machine Learning -

Fan-In and Fan-Out Performance

• PageRank on synthetic Power-Law Graphs– Piccolo was used to simulate Pregel with combiners

02468

10

1.8 1.9 2 2.1 2.2

Tota

l Com

m. (

GB)

Power-Law Constant α

More high-degree vertices 61

Page 62: Large-Scale Graph-Structured Machine Learning -

GraphLab Ghosting

• Changes to master are synced to ghosts

Machine 1

A

B

C

Machine 2

DD

A

B

CGhost

62

Page 63: Large-Scale Graph-Structured Machine Learning -

GraphLab Ghosting

• Changes to neighbors of high degree vertices creates substantial network traffic

Machine 1

A

B

C

Machine 2

DD

A

B

C Ghost

63

Page 64: Large-Scale Graph-Structured Machine Learning -

Fan-In and Fan-Out Performance

• PageRank on synthetic Power-Law Graphs• GraphLab is undirected

02468

10

1.8 1.9 2 2.1 2.2

Tota

l Com

m. (

GB)

Power-Law Constant alphaMore high-degree vertices 64

Page 65: Large-Scale Graph-Structured Machine Learning -

Comparison with GraphLab & Pregel• PageRank on Synthetic Power-Law Graphs:

RuntimeCommunication

02468

10

1.8

Tota

l Net

wor

k (G

B)

Power-Law Constant α

05

1015202530

1.8Se

cond

s

Power-Law Constant α

Pregel (Piccolo)

GraphLab

Pregel (Piccolo)

GraphLab

65

High-degree vertices High-degree vertices

GraphLab2 is robust to high-degree vertices.

Page 66: Large-Scale Graph-Structured Machine Learning -

GraphLab on Spark

66

#include <graphlab.hpp>

struct vertex_data : public graphlab::IS_POD_TYPE { float rank;vertex_data() : rank(1) { }

};

typedef graphlab::empty edge_data;typedef graphlab::distributed_graph<vertex_data, edge_data> graph_type;class pagerank :public graphlab::ivertex_program<graph_type, float>,public graphlab::IS_POD_TYPE {float last_change;

public:float gather(icontext_type& context, const vertex_type& vertex,

edge_type& edge) const {return edge.source().data().rank / edge.source().num_out_edges();

}

void apply(icontext_type& context, vertex_type& vertex,const gather_type& total) {

const double newval = 0.15*total + 0.85;last_change = std::fabs(newval - vertex.data().rank);vertex.data().rank = newval;

}

void scatter(icontext_type& context, const vertex_type& vertex,edge_type& edge) const {

if (last_change > TOLERANCE) context.signal(edge.target());}

};

struct pagerank_writer {std::string save_vertex(graph_type::vertex_type v) {std::stringstream strm;strm << v.id() << "\t" << v.data() << "\n";return strm.str();

}std::string save_edge(graph_type::edge_type e) { return ""; }

};

int main(int argc, char** argv) {graphlab::mpi_tools::init(argc, argv);graphlab::distributed_control dc;

graphlab::command_line_options clopts("PageRank algorithm.");graph_type graph(dc, clopts);graph.load_format(“biggraph.tsv”, "tsv");

graphlab::omni_engine<pagerank> engine(dc, graph, clopts);engine.signal_all();engine.start();

graph.save(saveprefix, pagerank_writer(), false, true false);

graphlab::mpi_tools::finalize();return EXIT_SUCCESS;

}

import spark.graphlab._

val sc = spark.SparkContext(master, “pagerank”)

val graph = Graph.textFile(“bigGraph.tsv”)val vertices = graph.outDegree().mapValues((_, 1.0, 1.0))

val pr = Graph(vertices, graph.edges).iterate((meId, e) => e.source.data._2 / e.source.data._1, (a: Double, b: Double) => a + b, (v, accum) => (v.data._1, (0.15 + 0.85*a), v.data._2), (meId, e) => abs(e.source.data._2-e.source.data._1)>0.01)

pr.vertices.saveAsTextFile(“results”)

Interactive!