extreme scale breadth-first search on supercomputers
TRANSCRIPT
Extreme Scale Breadth-First Search on Supercomputers
Tokyo Institute of Technology / RIKENIBM T.J. Watson Research CenterRIKENKyushu UniversityTokyo Institute of Technology / AIST
Koji UenoToyotaro Suzumura
Naoya MaruyamaKatsuki Fujisawa
Satoshi Matsuoka
Large-Scale Graph Mining is Everywhere
Symbolic Networks: Human Brain: 100 billion neuron
Protein Interactions [genomebiology.com]Social Networks[Moody ’01]
(Facebook: 1 billion users) Cyber Security ( 15 billion log entries / day for large enterprise)
CybersecurityMedicalInformaticsDataEnrichmentSocialNetworksSymbolicNetworks
WWW[lumeta.com]
(1 trillion unique URL)
2
Breadth First Search on Large Distributed Memory Machines
} BreadthFirstSearch (BFS):} Themostfundamentalgraphalgorithm.} AkernelofGraph500benchmark.
} Largescalesupercomputers:} Consistsofthousandsofdistributedmemorynodes.
} Howtocomputegraphalgorithmsefficientlyonthosemachinesisanattackingchallenge.
KComputer:83,000nodesTSUBAME2.5:1400nodes
3
Graph500 Benchmark [http://www.graph500.org/]
} OneofourmajortargetsisGraph500benchmark.} BenchmarkforBigData(dataintensive)applications.} BFSisamainkernelforranking.} Kcomputeris#1usingourresult.
Graph500LatestRanking
4
Breadth First Search(BFS)
RootRoot
Level1
Level2
Level3
BFS
Input:GraphandRootVertex Output:BFS Tree
5
Direction Optimization [Beamer, ’11-12]
} DirectionoptimizationisanfastBFSalgorithmwhichswitchesdirection(Top-DownandBottom-Up)foreachsearchinglevel.
} Directionoptimizationiseffectiveforsmalldiametergraphs.} Scalefreenetworksandsmallworldnetworksaresmalldiameter
graphs.} ThetargetofGraph500benchmarkisalsosmalldiametergraphanddirectionoptimizationiseffective.
Frontier
Neighbors
Levelk
Levelk+1
FrontierLevelk
Levelk+1Neighbors
Top-Down Bottom-Up
6
2D Partitioning BFS
} Twodimensionallypartitiontheadjacencymatrixforgraph
} Eachpartitionedregionisassignedtoeachnode.} Nodesarevirtuallyspreadona2Dmesh.} Advantagesof2Dpartitioningover1Dpartitioning
} Partitionedmatrixregionisnearsquare.Rowsandcolumnsofthisregionisnottoolargetoholdtherelateddatalocally.Whereas,in1Dpartitioning,wecannotholdallthedatarelatedtorowsandcolumnsofthepartitionedmatrixregion.Thedataofrowsorcolumnsaredistributedamongnodes,whichrequiredadditionalcommunication.
7
Related Work} DistributedBFSwithTop-Downonly:
} 2DPartitioningBFSonBlueGene/L[Yoo ‘05]} ProposedDistributedMemoryBFSonLargeDistributedMemory
} Comparisonof1DPartitioningand2DPartitioning [Buluc ‘11]} DistributedMemoryBFSonCommodityMachine(IntelCPUand
Infiniband network)[Satish‘12]} DistributedBFSwithDirectionOptimization
} 2DPartitioningBFSwithDirectionOptimization[Beamer’13]} OurproposedBFSisbasedontheirBFS.
} 1DPartitioningBFSwithDirectionOptimizationandLoadBalancing.[Checconi ‘14]} Thisisveryscalableandtheyachieved23751GTEPSonBlueGene/Q98304Nodes.} TheyProposednovelsparsematrixrepresentation“Coarseindex+Skiplist”.
However,ourbitmapbasedsparsematrixrepresentationismoreefficient.
8
Problem of Graph Data Structure} Whenwepartitiongraphsforlargesupercomputers,apartitionedmatrixisaHyperSparseMatrix.
} HowdowerepresentthisHyperSparseMatrix?
・・・
・・・
256
256
HyperSparseMatrix
PartitionaGraphinto65,536Partitions
9
Existing Approaches for Sparse Matrix} Traditionalapproach:
} CompressedSparseRow(CSR)} CSRisNOTmemoryefficientforhypersparsematrix
Source(SRC) 0 0 6 7
Destination(DST) 4 5 3 1
Row Offset 0 2 2 2 2 2 2 3 4
DST 4 5 3 1
・EdgeList ・CSR
PartitionedGraphAdjacencyMatrix:8Vertexand4EdgeMemorywasted
} ForHyperSparseMatrix:} DCSR(DCSC)} CoarseIndex+SkipList} TheseapproachesareNOTcomputeefficient.
} Wedemonstrateattheperformanceevaluation.
Example
10
Bitmap based Sparse Matrix Representation
SRC 0 0 6 7
DST 4 5 3 1
・EdgeList
Offset 0 1 3
Bitmap 1 0 0 0 0 0 1 1
Row Offset 0 2 3 4
DST 4 5 3 1
・BitmapbaseSparseMatrixOnlyconsumes8bits
} Structure} RowOffset:Skipverticesthathasnoedges(sameasDCSC)} Bitmap:onebitforeachvertex:representsthevertexhasatleastoneedge(set
bit)ornot(notsetbit).} Offset: Supplementalarrayforfastercomputing:Representscumulative#ofset
bitsfromthebeginningofbitmaptothecorrespondingwordboarder.} Howtocomputetherowoffsetindexofagivenvertex?
} Rowoffsetindex=Offset[w]+popcount(Bitmap[w]&mask)} Wherew=v/64,mask=(1<<(v%64))– 1,visindexofagivenvertex.} Thisisnoloop.Therefore,thisisanO(1)operation,whichissameasCSR.
Inthisexample,1wordis4bit.
PartitionedGraphAdjacencyMatrix:8Vertexand4EdgeExample
11
Vertex Reordering} Problem
} BFSrequiresheavyrandommemoryaccesses,whichishighcost.
} InourBFS,avertexstate(visitedetc.)isrepresentedasabitmapwhoseindexisvertexID.
} Randommemoryaccessestothebitmapdataisoftenrequired.} RenumberingvertexIDinorderofvertexdegreeincreasethememoryaccesslocality.
12
VertexReordering
BitmapDataDataAccess
Memoryaccessislocalized
How to output in original ID?
ReorderedID IDtable
BFStree(output)
} Naïvemethod:} SearchwithreorderedIDandcreateBFStreeinreorderedID,then
convertittooriginalIDwithIDtableandall-to-allcommunication.} Sincethe#ofvertexistoolargetoholdon asinglenode,IDtableisdistributedamongallnodes.Weneedall-to-allcommunicationtoreferenceit.
} Problem:All-to-allcommunicationisaheavyoperation.
BFStreein
reorderedID
Withall-to-allcommunication
SearchinReorderedID
Search
13
Our proposal} WepreservebothreorderedvertexIDandoriginalvertexID.
ReorderedID OriginalID
Search Output(BFS tree)
SearchinReorderedIDOutputinoriginalID
ReorderedIDisNOTpresentonBFStree.
Offset 0 1 3
Bitmap 1 0 0 0 0 0 1 1
SRC(Orig) 2 0 1
RowOffset 0 2 3 4
DST 2 3 0 1
DST(Orig) 4 5 3 1
OriginalID
OriginalID
Almostnooverheadexceptforadditional
memorytoholdoriginalID.
14
Algorithm Detail1. Verticesofagraphispartitionedandassignedtoanode.
Eachvertexhasitsownernode.2. Eachnodesortstheassignedverticesbytheirdegree
andre-labeltheverticeswiththeIDnumber.} Therearenoexchangeormigrationofverticesamongnodes.Therefore,thereisnochangeinvertexnodeassignments.
3. WepreserveoriginalvertexIDtooutputBFStreeinoriginalID.
15
Top-Down Load Balancing} Loadimbalanceintop-downphase.
} Thelengthofedgelistvariesforeachvertex.Thisdifferencescauseloadimbalanceamongcomputingthreadsinanode.
} Ourproposal:Twophasehybridpartitioning} Phase-1:Verticalpartitioningbutskiplongedgelists.} Phase-2:Processlongedgelistswithhorizontalpartitioning.
T0
T1
T2
T3
T0
T1
T2
T3
T0
T0
T1
T1
T2
T2
T3
T3
NaïveVerticalPartitioning HybridPartitioning
16
Performance Evaluation} Evaluatedperformanceof3proposedmethods.
} BitmapbasedSparseMatrix} ReorderingVertexID} Top-DownLoadBalancing
} Usingupto61440nodesofKcomputer.} Weakscaling:#ofvertices:2^33per960nodes} #ofedges:16x“#ofvertices”
} GraphisgeneratedbyR-MATgenerator.} Parameter:A=0.57,B=0.19,C=0.19,D=0.05(SameasGraph500
benchmark)} Performanceisamedianof300BFS.(EachBFSstartsfromeachuniquerootvertex)} PerformanceunitisTEPS:TraversedEdgesPerSecond} GTEPS=Giga(1,000,000,000)TEPS
17
Bitmap based Sparse Matrix Representation} WecomparedBitmapbasedSparseMatrixRepresentationwithDCSC andCoarseindex+Skiplist.} SinceDCSCandCoarseindex+Skiplistarenotcomputeefficient,
ourproposalis1.6timesfasterthanthem.
1.6 timesfaster
02000400060008000
10000120001400016000
0 16000 32000 48000 64000
GTEP
S
#ofnodes
BitmapbasedRepresentationDCSCCoarseindex+Skiplist
18
Vertex Reordering} Ourproposal:searchwithreorderedIDandoutputwithoriginalID} Two-step:naïvemethodwithall-to-allcommunication} No-reordering:searchandoutputwithtotallyoriginalID} Vertex-reduction:renumberthevertexIDtoskipzerodegreevertices.
} Sincegeneratedgraphhasmanyisolatedvertices,verticesthathasnoedges.
1.5 timesspeedup
Naïvereorderingisslowerthanno-reorderingduetoall-to-allcommunication
02000400060008000
10000120001400016000
0 16000 32000 48000 64000
GTEP
S
#ofnodes
1.OurProposal2.Two-step3.No-reordering4.Vertex-reduction
19
Top-Down Load Balancing} Hybridpartitioningisthemostefficientway.
} Theperformanceofhorizontalpartitioningissameashybridoneinsomeresults.
02000400060008000
10000120001400016000
0 16000 32000 48000 64000
GTEP
S
#ofnodes
Hybrid(Ourproposal)PartitioningHorizontal(EdgeRange)PartitioningVertical(VertexRange)Partitioning
20
Overall Performance} Applyingall3optimizations,weachieved2.85timesspeedupon61440nodes.
} Weachieved38,621GTEPSon82944nodesofKcomputer.
2.85times
02000400060008000
10000120001400016000
0 16000 32000 48000 64000
GTEP
S
#ofnodes
NaïveBitmapbasedRepresentationVertexReorderingLoadBalancing
21
Conclusion} WeproposedefficientBreadthFirstSearchforlargedistributedmemorymachines.
} Wepresent3methodstospeedupdistributedBFS:} BitmapbasedSparseMatrixRepresentation} ReorderingvertexIDwithoutsearchingoverhead} Top-downloadbalancing
} Weachieved38,621GTEPSonKcomputer,whichrankedtop onGraph500nowfromJuly2015.
22