graph-based parallel computingwcohen/10-605/pregel-etc-2.pdfgraph-based parallel computing william...
TRANSCRIPT
Announcements • Next Tuesday 12/8: – Presentations for 10-805 projects. – 15 minutes per project. – Final written reports due Tues 12/15
• For exam: – Spectral clustering will not be covered – It’s ok to bring in two pages of notes – We’ll give a solution sheet for HW7 out on
Wednesday noon • but you get no credit on questions HW7 5-7 if you
turn in answers after that point
2
Outline • Motivation/where it Qits • Sample systems (c. 2010) – Pregel: and some sample programs • Bulk synchronous processing
– Signal/Collect and GraphLab • Asynchronous processing
• GraphLab descendants – PowerGraph: partitioning – GraphChi: graphs w/o parallelism – GraphX: graphs over Spark
3
Many Graph-Parallel Algorithms• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization
• Structured Prediction– Loopy Belief Propagation– Max-Product Linear Programs– Gibbs Sampling
• Semi-supervised ML– Graph SSL – CoEM
• Community Detection– Triangle-Counting– K-core Decomposition– K-Truss
• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring
• Classification– Neural Networks
4
Signal/collect model
signals are made available in a list and a map
next state for a vertex is output of the collect() operation
relax “num_iterations” soon
5
CoEM (Rosie Jones, 2005)
7 7
Optimal
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Spee
dup
Number of CPUs
Bett
er
Small
Large GraphLab 16 Cores 30 min
15x Faster! 6x fewer CPUs!
Hadoop 95 Cores 7.5 hrs
Outline • Motivation/where it Qits • Sample systems (c. 2010) – Pregel: and some sample programs • Bulk synchronous processing
– Signal/Collect and GraphLab • Asynchronous processing
• GraphLab descendants – PowerGraph: partitioning – GraphChi: graphs w/o parallelism – GraphX: graphs over Spark
9
GraphLab’s descendents
• PowerGraph • GraphChi • GraphX
On multicore architecture: shared memory for workers
On cluster architecture (like Pregel): different memory spaces
What are the challenges moving away from shared-memory?
10
Natural Graphs ! Power Law
100 102 104 106 108100
102
104
106
108
1010
degree
count
Top1%ofver+cesisadjacentto
53%oftheedges!
AltavistaWebGraph:1.4BVer+ces,6.7BEdgesGraphLab group/Aapo 11
Touchesalargefrac.onofgraph(GraphLab1)
Producesmanymessages
(Pregel,Signal/Collect)
Edgeinforma.ontoolargeforsingle
machine
Asynchronousconsistencyrequiresheavylocking(GraphLab1)
Synchronousconsistencyispronetostragglers(Pregel)
Problem:HighDegreeVer+cesLimitParallelism
GraphLab group/Aapo 12
PowerGraph • Problem: GraphLab’s localities can be large – “all neighbors of a node” can be large for hubs,
high indegree nodes • Approach: – new graph partitioning algorithm • can replicate data
– gather-apply-scatter API: Qiner-grained parallelism • gather ~ combiner • apply ~ vertex UDF (for all replicates) • scatter ~ messages from vertex to edges
13
PageRankinPowerGraph
PageRankProgram(i)Gather(j à i):returnwji * R[j]sum(a,b):returna+b;Apply(i, Σ) : R[i] = β + (1 – β) * Σ Scatter( i àj ) : �
if (R[i] changes) then activate(j)
R[i] = � + (1� �)X
(j,i)2E
wjiR[j]
17
GraphLabgroup/Aapo
jedgeivertex
gather/sumlikeacollectoragroupby…reduce(withcombiner)
scaVerislikeasignal
Machine2Machine1
Machine4Machine3
DistributedExecu.onofaPowerGraphVertex-Program
Σ1 Σ2
Σ3 Σ4
+++
YYYY
Y’
Σ
Y’Y’Y’Gather
Apply
ScaVer
18
GraphLabgroup/Aapo
MinimizingCommunica.oninPowerGraph
YYY
Avertex-cutminimizesmachineseachvertexspans
Percola4ontheorysuggeststhatpowerlawgraphshavegoodvertexcuts.[Albertetal.2000]
Communica.onislinearinthenumberofmachines
eachvertexspans
19
GraphLabgroup/Aapo
Par..oningPerformanceTwiUerGraph:41Mver.ces,1.4Bedges
Obliviousbalancespar..onqualityandpar..oning.me.
2468
1012141618
8 16 24 32 40 48 56 64
Avg#ofM
achine
sSpa
nned
NumberofMachines
0
200
400
600
800
1000
8 16 24 32 40 48 56 64Pa
r++o
nTime(Secon
ds)
NumberofMachines
Oblivious
GreedyGreedy
Random
20
Cost Construc.onTime
BeVer
GraphLabgroup/Aapo
Par..oningmaVers…
00.10.20.30.40.50.60.70.80.91
PageRankCollabora.ve
Filtering ShortestPath
Redu
c+on
inRun
+me
Random
Oblivious
Greedy
GraphLabgroup/Aapo
21
Outline
• Mo.va.on/whereitfits• Samplesystems(c.2010)– Pregel:andsomesampleprograms
• Bulksynchronousprocessing– Signal/CollectandGraphLab
• Asynchronousprocessing• GraphLabdescendants– PowerGraph:par..oning– GraphChi:graphsw/oparallelism– GraphX:graphsoverSpark
22
GraphLabcon’t
• PowerGraph• GraphChi– Goal:usegraphabstrac.onon-disk,notin-memory,onaconven.onalworksta.on
LinuxClusterServices(AmazonAWS)
MPI/TCP-IP PThreads Hadoop/HDFS
General-purposeAPI
GraphAnaly.cs
GraphicalModels
ComputerVision Clustering Topic
ModelingCollabora.ve
Filtering
24
GraphLabcon’t
• GraphChi– Keyinsight:
• somealgorithmsongrapharestreamable(i.e.,PageRank-Nibble)• ingeneralwecan’teasilystreamthegraphbecauseneighborswillbescaVered• butmaybewecanlimitthedegreetowhichthey’rescaVered…enoughtomakestreamingpossible?
– “almost-streaming”:keepPcursorsinafileinsteadofone
25
• Ver.cesarenumberedfrom1ton– Pintervals,eachassociatedwithashardondisk.– sub-graph=intervalofver.ces
PSW:ShardsandIntervals
shard(1)
interval(1) interval(2) interval(P)
shard(2) shard(P)
1 nv1 v2
26
1. Load
2. Compute
3. Write
PSW:Layout
Shard1
Shardssmallenoughtofitinmemory;balancesizeofshards
Shard:in-edgesforintervalofver.ces;sortedbysource-id
in-edgesfo
rver.ces1..100
sorted
bysource_id
Ver.ces1..100
Ver.ces101..700
Ver.ces701..1000
Ver.ces1001..10000
Shard2 Shard3 Shard4Shard1
1. Load
2. Compute
3. Write
27
Ver.ces1..100
Ver.ces101..700
Ver.ces701..1000
Ver.ces1001..10000
Loadallin-edgesinmemory
Loadsubgraphforver+ces1..100
Whataboutout-edges?Arrangedinsequenceinothershards
Shard2 Shard3 Shard4
PSW:LoadingSub-graph
Shard1
in-edgesfo
rver.ces1..100
sorted
bysource_id
1. Load
2. Compute
3. Write
28
Shard1
Load all in-edges in memory
Loadsubgraphforver+ces101..700
Shard2 Shard3 Shard4
PSW:LoadingSub-graphVer.ces1..100
Ver.ces101..700
Ver.ces701..1000
Ver.ces1001..10000
Out-edge blocks in memory
in-edgesfo
rver.ces1..100
sorted
bysource_id
1. Load
2. Compute
3. Write
29
PSWLoad-Phase
Only P large reads for each interval.
P2 reads on one full pass.
30
1. Load
2. Compute
3. Write
PSW:Executeupdates
• Update-func.onisexecutedoninterval’sver.ces• Edgeshavepointerstotheloadeddatablocks– Changestakeeffectimmediatelyàasynchronous.
&Data
&Data &Data
&Data
&Data&Data&Data
&Data&Data
&Data
BlockX
BlockY
31
1. Load
2. Compute
3. Write
PSW:CommittoDisk
• Inwritephase,theblocksarewriVenbacktodisk– Nextload-phaseseestheprecedingwritesàasynchronous.
32
1. Load
2. Compute
3. Write
&Data
&Data &Data
&Data
&Data&Data&Data
&Data&Data
&Data
BlockX
BlockY
Intotal:P2readsandwrites/fullpassonthegraph.àPerformswellonbothSSDandharddrive.
Tomakethiswork:thesizeofavertexstatecan’tchangewhenit’supdated(atlast,asstoredondisk).
ExperimentSerng• MacMini(AppleInc.)– 8GBRAM– 256GBSSD,1TBharddrive– IntelCorei5,2.5GHz
• Experimentgraphs:
Graph Ver+ces Edges P(shards) Preprocessing
live-journal 4.8M 69M 3 0.5min
neslix 0.5M 99M 20 1min
twiVer-2010 42M 1.5B 20 2min
uk-2007-05 106M 3.7B 40 31min
uk-union 133M 5.4B 50 33min
yahoo-web 1.4B 6.6B 50 37min33
ComparisontoExis.ngSystems
Notes:comparisonresultsdonotinclude4metotransferthedatatocluster,preprocessing,orthe4metoloadthegraphfromdisk.GraphChicomputesasynchronously,whileallbutGraphLabsynchronously.
PageRank
Seethepaperformorecomparisons.
WebGraphBeliefPropaga+on(UKangetal.)
MatrixFactoriza+on(Alt.LeastSqr.) TriangleCoun+ng
GraphLabv1(8cores)
GraphChi(MacMini)
0 2 4 6 8 10 12Minutes
NeHlix(99Bedges)
Spark(50machines)
GraphChi(MacMini)
0 2 4 6 8 10 12 14Minutes
TwiNer-2010(1.5Bedges)
Pegasus/Hadoop(100
machines)
GraphChi(MacMini)
0 5 10 15 20 25 30Minutes
Yahoo-web(6.7Bedges)
Hadoop(1636
machines)
GraphChi(MacMini)
0 50 100 150 200 250 300 350 400 450Minutes
twiNer-2010(1.5Bedges)
OnaMacMini:ü GraphChicansolveasbigproblemsasexis.nglarge-scalesystems.
ü Comparableperformance.
34
Outline
• Mo.va.on/whereitfits• Samplesystems(c.2010)– Pregel:andsomesampleprograms
• Bulksynchronousprocessing– Signal/CollectandGraphLab
• Asynchronousprocessing• GraphLab“descendants”– PowerGraph:par..oning– GraphChi:graphsw/oparallelism– GraphX:graphsoverSpark(Gonzalez)
35
GraphLab’sdescendents
• PowerGraph• GraphChi• GraphX– implementa.onofGraphLabsAPIontopofSpark– Mo.va.ons:
• avoidtransfersbetweensubsystems• leveragelargercommunityforcommoninfrastructure
– What’sdifferent:• Graphsarenowimmutableandopera.onstransformonegraphintoanother(RDDèRDG,resiliantdistributedgraph)
36
TheGraphXStack(LinesofCode)
GraphX (3575)
Spark
Pregel (28) + GraphLab (50)
PageRank (5)
Connected Comp. (10)
Shortest Path (10)
ALS(40) LDA
(120)
K-core(51) Triangle
Count(45)
SVD(40)
37
Idea:GraphasTables
Id
RxinJegonzalFranklinIstoica
SrcId DstId
rxin jegonzalfranklin rxinistoica franklinfranklin jegonzal
Property (E)
FriendAdvisor
CoworkerPI
Property (V)
(Stu., Berk.)(PstDoc, Berk.)
(Prof., Berk)(Prof., Berk)
R
J
F
I
Property GraphVertex Property Table
Edge Property Table
Underthehoodthingscanbesplitevenmorefinely:egavertexmaptable+vertexdatatable.Operatorsmaximizestructuresharingandminimizecommunica.on.(Notshown:par..onid’s,carefullyassigned….)
38
“VertexState”
“Signal”
39
Likesignal/collect:• Joinvertexandedgetables• DoesmapwithmapFuncontheedges• Reducesbydes.na.onvertexusingreduceFunc
Machine2Machine1
Machine4Machine3
DistributedExecu.onofaPowerGraphVertex-Program
Σ1 Σ2
Σ3 Σ4
+++
YYYY
Y’
Σ
Y’Y’Y’Gather
Apply
ScaVer
40
GraphLabgroup/Aapo
PerformanceComparisons
2268
207354
1340
0 200 400 600 800 1000 1200 1400 1600
GraphLabGraphXGiraph
Naïve SparkMahout/Hadoop
Runtime (in seconds, PageRank for 10 iterations)
GraphX is roughly 3x slower than GraphLab
Live-Journal: 69 Million Edges
43but: integrated with Spark, open-source, resiliant
Summary• Largeimmutabledatastructureson(distributed)disk,processingbysweepingthroughthenandcrea.ngnewdatastructures:– stream-and-sort,Hadoop,PIG,Hive,…
• Largeimmutabledatastructuresindistributedmemory:– Spark–distributedtables
• Largemutabledatastructuresindistributedmemory:– parameterserver:structureisahashtable– Pregel,GraphLab,GraphChi,GraphX:structureisagraph44
Summary• APIsforthevarioussystemsvaryindetailbuthaveasimilarflavor– Typicalalgorithmsitera.velyupdatevertexstate– Changesinstatearecommunicatedwithmessageswhichneedtobeaggregatedfromneighbors
• Biggestwinsare– onproblemswheregraphisfixedineachitera.on,butvertexdatachanges
– ongraphssmallenoughtofitin(distributed)memory
45
Somethingstotakeaway• Plasormsforitera.veopera.onsongraphs– GraphX:ifyouwanttointegratewithSpark– GraphChi:ifyoudon’thaveacluster– GraphLab/Dato:ifyoudon’tneedfreesowwareandperformanceiscrucial
– Pregel:ifyouworkatGoogle– Giraph,Signal/collect,…??
• Importantdifferences– Intendedarchitecture:shared-memoryandthreads,distributedclustermemory,graphondisk
– Howgraphsarepar..onedforclusters– Ifprocessingissynchronousorasynchronous
46