parallel bfs using 2 stacks

Post on 19-Jan-2017

94 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Parallel BFS on Distributed Memory SystemsAydin Buluc and Kamesh Madduri

Sapta

DC reading group

September 29, 2016

Outline

IntroductionShared Memory BFS

Model

Contributions

Serial BFS overview

Another paper: Parallel BFS using 2 queues

This paper: Hybrid Parallel BFS using 2 stacks

Experimental Results

Conclusion

Introduction

BFS is important.

I BFS usually forms a sub-part to more complex graphalgorithms.

I Now that we have BIG graphs, parallelizing it is veryimportant

I Shared Memory BFS involves: (1) communication betweenprocessors and (2) distribution of the graph(vertices) amongprocessors

Model

I Graph G (V ,E ), and |V | = n and |E | = m, also m is O(n);i.e. sparse graphs.

I Edge weights = 1.

Contributions

I Traditional representation: 1 dimensional BFS (1D adjacencyarrays).

I Sparse matrix representation: 2D partitioning of the graph(Not discussed).

Serial BFS overview

I Sequential BFS uses a queue data structureI BFS requirement :

I all vertices at a distance k from the source should be “visited”before vertices at distance k + 1.

I Explanation?

I Level Synchronous BFS is a key concept in correct sharedmemory BFS.

Modified BFS : Use 2 stacks

Can be parallelized as is: perform lines 6-7 in parallel,lines 8-10 are atomic

Related Work: Level Synchronous Parallel BFS using 2queues by Agarwal et al SC’10 [1]

Hybrid 1D Parallel BFS Algorithm

One of the main areas for optimization to this basic parallelalgorithm isload-balancing: ensuring that parallelization of the edge visitsteps is load-balanced

I 1D partitioning: If there are p processors in the system, giveownership of n/p vertices, to each processor.

I Random shuffling of the vertice identifiers prior topartitioning. So all processors ge roughly same number ofvertices(n/p) and edges(m/p)

I Use of local stacks NSi for pushes and then globalunion.(Overhead < 3% of execution time)

1D BFS

1D BFS contd..

1D BFS errors

I The value of level is not incremented

I The Next Stack NSi data structure should be emptied beforetraversing next level.

Experiments

I 1D Flat MPI: one process per core

I 1D Hybrid: one or more MPI processes within a node

I synthetic graphs based on the R-MAT random graphmodel(default m : n 16) , web crawl of the UK domain (133million vertices and 5.5 billion edges).

I Systems: Hopper (6392-node Cray XE6) and Franklin(9660-node Cray XT4)

Experimental Results

I Strong scaling on FranklinI Higher is betterI GTEPS: Giga Traversed Edges per Second

Experimental Results

I lower is betterI Strong scaling on Franklin

Experimental Results

I Weak Scaling on Franklin

I Lower is better

Experiments

I Flat 1D algorithms are about 1.5− 1.8 times faster than the2D algorithms.

I The 1D hybrid algorithm, are slower than the flat 1Dalgorithm for smaller concurrencies, starts to performsignificantly faster for larger concurrencies.

Conclusion

I Conjecture: Level synchronous BFS can be implementedwithout any error with relaxed queues

I Question: Can the error be bounded if we don’t have a levelsynchronous algorithm?

V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalablegraph exploration on multicore processors. In Proc. ACM/IEEEConference on Supercomputing (SC10), November 2010.

A. Buluc K. Madduri. Parallel breadth-first search ondistributed memory systems. In Proceedings of 2011International Conference for High Performance Computing,Networking, Storage and Analysis, SC ’11, pages 65:1–65:12,New York, NY, USA, 2011. ACM.

C.E. Leiserson and T.B. Schardl. A work-efficient parallelbreadth-first search algorithm (or how to cope with thenondeterminism of reducers). In Proc. 22nd ACM Symp. onParallism in Algorithms and Architectures (SPAA ’10), pages303–314, June 2010.

Thank You :)

top related