extreme scale breadth-first search on supercomputers

Extreme Scale Breadth-First Search on Supercomputers

Tokyo Institute of Technology / RIKENIBM T.J. Watson Research CenterRIKENKyushu UniversityTokyo Institute of Technology / AIST

Koji UenoToyotaro Suzumura

Naoya MaruyamaKatsuki Fujisawa

Satoshi Matsuoka

Large-Scale Graph Mining is Everywhere

Symbolic Networks: Human Brain: 100 billion neuron

Protein Interactions [genomebiology.com]Social Networks[Moody ’01]

（Facebook: 1 billion users) Cyber Security ( 15 billion log entries / day for large enterprise)

CybersecurityMedicalInformaticsDataEnrichmentSocialNetworksSymbolicNetworks

WWW[lumeta.com]

（1 trillion unique URL)

2

Breadth First Search on Large Distributed Memory Machines

} BreadthFirstSearch (BFS):} Themostfundamentalgraphalgorithm.} AkernelofGraph500benchmark.

} Largescalesupercomputers:} Consistsofthousandsofdistributedmemorynodes.

} Howtocomputegraphalgorithmsefficientlyonthosemachinesisanattackingchallenge.

KComputer:83,000nodesTSUBAME2.5:1400nodes

3

Graph500 Benchmark [http://www.graph500.org/]

} OneofourmajortargetsisGraph500benchmark.} BenchmarkforBigData(dataintensive)applications.} BFSisamainkernelforranking.} Kcomputeris#1usingourresult.

Graph500LatestRanking

4

Breadth First Search（BFS）

RootRoot

Level1

Level2

Level3

BFS

Input:GraphandRootVertex Output:BFS Tree

5

Direction Optimization [Beamer, ’11-12]

} DirectionoptimizationisanfastBFSalgorithmwhichswitchesdirection(Top-DownandBottom-Up)foreachsearchinglevel.

} Directionoptimizationiseffectiveforsmalldiametergraphs.} Scalefreenetworksandsmallworldnetworksaresmalldiameter

graphs.} ThetargetofGraph500benchmarkisalsosmalldiametergraphanddirectionoptimizationiseffective.

Frontier

Neighbors

Levelk

Levelk+1

FrontierLevelk

Levelk+1Neighbors

Top-Down Bottom-Up

6

2D Partitioning BFS

} Twodimensionallypartitiontheadjacencymatrixforgraph

} Eachpartitionedregionisassignedtoeachnode.} Nodesarevirtuallyspreadona2Dmesh.} Advantagesof2Dpartitioningover1Dpartitioning

} Partitionedmatrixregionisnearsquare.Rowsandcolumnsofthisregionisnottoolargetoholdtherelateddatalocally.Whereas,in1Dpartitioning,wecannotholdallthedatarelatedtorowsandcolumnsofthepartitionedmatrixregion.Thedataofrowsorcolumnsaredistributedamongnodes,whichrequiredadditionalcommunication.

7

Related Work} DistributedBFSwithTop-Downonly:

} 2DPartitioningBFSonBlueGene/L[Yoo ‘05]} ProposedDistributedMemoryBFSonLargeDistributedMemory

} Comparisonof1DPartitioningand2DPartitioning [Buluc ‘11]} DistributedMemoryBFSonCommodityMachine(IntelCPUand

Infiniband network)[Satish‘12]} DistributedBFSwithDirectionOptimization

} 2DPartitioningBFSwithDirectionOptimization[Beamer’13]} OurproposedBFSisbasedontheirBFS.

} 1DPartitioningBFSwithDirectionOptimizationandLoadBalancing.[Checconi ‘14]} Thisisveryscalableandtheyachieved23751GTEPSonBlueGene/Q98304Nodes.} TheyProposednovelsparsematrixrepresentation“Coarseindex+Skiplist”.

However,ourbitmapbasedsparsematrixrepresentationismoreefficient.

8

Problem of Graph Data Structure} Whenwepartitiongraphsforlargesupercomputers,apartitionedmatrixisaHyperSparseMatrix.

} HowdowerepresentthisHyperSparseMatrix?

・・・

・・・

256

256

HyperSparseMatrix

PartitionaGraphinto65,536Partitions

9

Existing Approaches for Sparse Matrix} Traditionalapproach:

} CompressedSparseRow(CSR)} CSRisNOTmemoryefficientforhypersparsematrix

Source(SRC) 0 0 6 7

Destination(DST) 4 5 3 1

Row Offset 0 2 2 2 2 2 2 3 4

DST 4 5 3 1

・EdgeList ・CSR

PartitionedGraphAdjacencyMatrix:8Vertexand4EdgeMemorywasted

} ForHyperSparseMatrix:} DCSR(DCSC)} CoarseIndex+SkipList} TheseapproachesareNOTcomputeefficient.

} Wedemonstrateattheperformanceevaluation.

Example

10

Bitmap based Sparse Matrix Representation

SRC 0 0 6 7

DST 4 5 3 1

・EdgeList

Offset 0 1 3

Bitmap 1 0 0 0 0 0 1 1

Row Offset 0 2 3 4

DST 4 5 3 1

・BitmapbaseSparseMatrixOnlyconsumes8bits

} Structure} RowOffset:Skipverticesthathasnoedges(sameasDCSC)} Bitmap:onebitforeachvertex:representsthevertexhasatleastoneedge(set

bit)ornot(notsetbit).} Offset: Supplementalarrayforfastercomputing:Representscumulative#ofset

bitsfromthebeginningofbitmaptothecorrespondingwordboarder.} Howtocomputetherowoffsetindexofagivenvertex?

} Rowoffsetindex=Offset[w]+popcount(Bitmap[w]&mask)} Wherew=v/64,mask=(1<<(v%64))– 1,visindexofagivenvertex.} Thisisnoloop.Therefore,thisisanO(1)operation,whichissameasCSR.

Inthisexample,1wordis4bit.

PartitionedGraphAdjacencyMatrix:8Vertexand4EdgeExample

11

Vertex Reordering} Problem

} BFSrequiresheavyrandommemoryaccesses,whichishighcost.

} InourBFS,avertexstate(visitedetc.)isrepresentedasabitmapwhoseindexisvertexID.

} Randommemoryaccessestothebitmapdataisoftenrequired.} RenumberingvertexIDinorderofvertexdegreeincreasethememoryaccesslocality.

12

VertexReordering

BitmapDataDataAccess

Memoryaccessislocalized

How to output in original ID?

ReorderedID IDtable

BFStree(output)

} Naïvemethod:} SearchwithreorderedIDandcreateBFStreeinreorderedID,then

convertittooriginalIDwithIDtableandall-to-allcommunication.} Sincethe#ofvertexistoolargetoholdon asinglenode,IDtableisdistributedamongallnodes.Weneedall-to-allcommunicationtoreferenceit.

} Problem:All-to-allcommunicationisaheavyoperation.

BFStreein

reorderedID

Withall-to-allcommunication

SearchinReorderedID

Search

13

Our proposal} WepreservebothreorderedvertexIDandoriginalvertexID.

ReorderedID OriginalID

Search Output(BFS tree)

SearchinReorderedIDOutputinoriginalID

ReorderedIDisNOTpresentonBFStree.

Offset 0 1 3

Bitmap 1 0 0 0 0 0 1 1

SRC(Orig) 2 0 1

RowOffset 0 2 3 4

DST 2 3 0 1

DST(Orig) 4 5 3 1

OriginalID

OriginalID

Almostnooverheadexceptforadditional

memorytoholdoriginalID.

14

Algorithm Detail1. Verticesofagraphispartitionedandassignedtoanode.

Eachvertexhasitsownernode.2. Eachnodesortstheassignedverticesbytheirdegree

andre-labeltheverticeswiththeIDnumber.} Therearenoexchangeormigrationofverticesamongnodes.Therefore,thereisnochangeinvertexnodeassignments.

3. WepreserveoriginalvertexIDtooutputBFStreeinoriginalID.

15

Top-Down Load Balancing} Loadimbalanceintop-downphase.

} Thelengthofedgelistvariesforeachvertex.Thisdifferencescauseloadimbalanceamongcomputingthreadsinanode.

} Ourproposal:Twophasehybridpartitioning} Phase-1:Verticalpartitioningbutskiplongedgelists.} Phase-2:Processlongedgelistswithhorizontalpartitioning.

T0

T1

T2

T3

T0

T1

T2

T3

T0

T0

T1

T1

T2

T2

T3

T3

NaïveVerticalPartitioning HybridPartitioning

16

Performance Evaluation} Evaluatedperformanceof3proposedmethods.

} BitmapbasedSparseMatrix} ReorderingVertexID} Top-DownLoadBalancing

} Usingupto61440nodesofKcomputer.} Weakscaling:#ofvertices:2^33per960nodes} #ofedges:16x“#ofvertices”

} GraphisgeneratedbyR-MATgenerator.} Parameter:A=0.57,B=0.19,C=0.19,D=0.05(SameasGraph500

benchmark)} Performanceisamedianof300BFS.(EachBFSstartsfromeachuniquerootvertex)} PerformanceunitisTEPS:TraversedEdgesPerSecond} GTEPS=Giga(1,000,000,000)TEPS

17

Bitmap based Sparse Matrix Representation} WecomparedBitmapbasedSparseMatrixRepresentationwithDCSC andCoarseindex+Skiplist.} SinceDCSCandCoarseindex+Skiplistarenotcomputeefficient,

ourproposalis1.6timesfasterthanthem.

1.6 timesfaster

02000400060008000

10000120001400016000

0 16000 32000 48000 64000

GTEP

S

#ofnodes

BitmapbasedRepresentationDCSCCoarseindex+Skiplist

18

Vertex Reordering} Ourproposal:searchwithreorderedIDandoutputwithoriginalID} Two-step:naïvemethodwithall-to-allcommunication} No-reordering:searchandoutputwithtotallyoriginalID} Vertex-reduction:renumberthevertexIDtoskipzerodegreevertices.

} Sincegeneratedgraphhasmanyisolatedvertices,verticesthathasnoedges.

1.5 timesspeedup

Naïvereorderingisslowerthanno-reorderingduetoall-to-allcommunication

02000400060008000

10000120001400016000

0 16000 32000 48000 64000

GTEP

S

#ofnodes

1.OurProposal2.Two-step3.No-reordering4.Vertex-reduction

19

Top-Down Load Balancing} Hybridpartitioningisthemostefficientway.

} Theperformanceofhorizontalpartitioningissameashybridoneinsomeresults.

02000400060008000

10000120001400016000

0 16000 32000 48000 64000

GTEP

S

#ofnodes

Hybrid(Ourproposal)PartitioningHorizontal(EdgeRange)PartitioningVertical(VertexRange)Partitioning

20

Overall Performance} Applyingall3optimizations,weachieved2.85timesspeedupon61440nodes.

} Weachieved38,621GTEPSon82944nodesofKcomputer.

2.85times

02000400060008000

10000120001400016000

0 16000 32000 48000 64000

GTEP

S

#ofnodes

NaïveBitmapbasedRepresentationVertexReorderingLoadBalancing

21

Conclusion} WeproposedefficientBreadthFirstSearchforlargedistributedmemorymachines.

} Wepresent3methodstospeedupdistributedBFS:} BitmapbasedSparseMatrixRepresentation} ReorderingvertexIDwithoutsearchingoverhead} Top-downloadbalancing

} Weachieved38,621GTEPSonKcomputer,whichrankedtop onGraph500nowfromJuly2015.

22

extreme scale breadth-first search on supercomputers

Data & Analytics