parallel bfs using 2 stacks

Parallel BFS on Distributed Memory SystemsAydin Buluc and Kamesh Madduri

Sapta

DC reading group

September 29, 2016

Outline

IntroductionShared Memory BFS

Model

Contributions

Serial BFS overview

Another paper: Parallel BFS using 2 queues

This paper: Hybrid Parallel BFS using 2 stacks

Experimental Results

Conclusion

Introduction

BFS is important.

I BFS usually forms a sub-part to more complex graphalgorithms.

I Now that we have BIG graphs, parallelizing it is veryimportant

I Shared Memory BFS involves: (1) communication betweenprocessors and (2) distribution of the graph(vertices) amongprocessors

Model

I Graph G (V ,E ), and |V | = n and |E | = m, also m is O(n);i.e. sparse graphs.

I Edge weights = 1.

Contributions

I Traditional representation: 1 dimensional BFS (1D adjacencyarrays).

I Sparse matrix representation: 2D partitioning of the graph(Not discussed).

Serial BFS overview

I Sequential BFS uses a queue data structureI BFS requirement :

I all vertices at a distance k from the source should be “visited”before vertices at distance k + 1.

I Explanation?

I Level Synchronous BFS is a key concept in correct sharedmemory BFS.

Modified BFS : Use 2 stacks

Can be parallelized as is: perform lines 6-7 in parallel,lines 8-10 are atomic

Related Work: Level Synchronous Parallel BFS using 2queues by Agarwal et al SC’10 [1]

Hybrid 1D Parallel BFS Algorithm

One of the main areas for optimization to this basic parallelalgorithm isload-balancing: ensuring that parallelization of the edge visitsteps is load-balanced

I 1D partitioning: If there are p processors in the system, giveownership of n/p vertices, to each processor.

I Random shuffling of the vertice identifiers prior topartitioning. So all processors ge roughly same number ofvertices(n/p) and edges(m/p)

I Use of local stacks NSi for pushes and then globalunion.(Overhead < 3% of execution time)

1D BFS

1D BFS contd..

1D BFS errors

I The value of level is not incremented

I The Next Stack NSi data structure should be emptied beforetraversing next level.

Experiments

I 1D Flat MPI: one process per core

I 1D Hybrid: one or more MPI processes within a node

I synthetic graphs based on the R-MAT random graphmodel(default m : n 16) , web crawl of the UK domain (133million vertices and 5.5 billion edges).

I Systems: Hopper (6392-node Cray XE6) and Franklin(9660-node Cray XT4)


I Strong scaling on FranklinI Higher is betterI GTEPS: Giga Traversed Edges per Second


I lower is betterI Strong scaling on Franklin


I Weak Scaling on Franklin

I Lower is better

Experiments

I Flat 1D algorithms are about 1.5− 1.8 times faster than the2D algorithms.

I The 1D hybrid algorithm, are slower than the flat 1Dalgorithm for smaller concurrencies, starts to performsignificantly faster for larger concurrencies.

Conclusion

I Conjecture: Level synchronous BFS can be implementedwithout any error with relaxed queues

I Question: Can the error be bounded if we don’t have a levelsynchronous algorithm?

V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalablegraph exploration on multicore processors. In Proc. ACM/IEEEConference on Supercomputing (SC10), November 2010.

A. Buluc K. Madduri. Parallel breadth-first search ondistributed memory systems. In Proceedings of 2011International Conference for High Performance Computing,Networking, Storage and Analysis, SC ’11, pages 65:1–65:12,New York, NY, USA, 2011. ACM.

C.E. Leiserson and T.B. Schardl. A work-efficient parallelbreadth-first search algorithm (or how to cope with thenondeterminism of reducers). In Proc. 22nd ACM Symp. onParallism in Algorithms and Architectures (SPAA ’10), pages303–314, June 2010.

Thank You :)

parallel bfs using 2 stacks

Science