parallel bfs using 2 stacks

20
Parallel BFS on Distributed Memory Systems Aydin Buluc and Kamesh Madduri Sapta DC reading group September 29, 2016

Upload: saptaparni-kumar

Post on 19-Jan-2017

94 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Parallel bfs using 2 stacks

Parallel BFS on Distributed Memory SystemsAydin Buluc and Kamesh Madduri

Sapta

DC reading group

September 29, 2016

Page 2: Parallel bfs using 2 stacks

Outline

IntroductionShared Memory BFS

Model

Contributions

Serial BFS overview

Another paper: Parallel BFS using 2 queues

This paper: Hybrid Parallel BFS using 2 stacks

Experimental Results

Conclusion

Page 3: Parallel bfs using 2 stacks

Introduction

BFS is important.

I BFS usually forms a sub-part to more complex graphalgorithms.

I Now that we have BIG graphs, parallelizing it is veryimportant

I Shared Memory BFS involves: (1) communication betweenprocessors and (2) distribution of the graph(vertices) amongprocessors

Page 4: Parallel bfs using 2 stacks

Model

I Graph G (V ,E ), and |V | = n and |E | = m, also m is O(n);i.e. sparse graphs.

I Edge weights = 1.

Page 5: Parallel bfs using 2 stacks

Contributions

I Traditional representation: 1 dimensional BFS (1D adjacencyarrays).

I Sparse matrix representation: 2D partitioning of the graph(Not discussed).

Page 6: Parallel bfs using 2 stacks

Serial BFS overview

I Sequential BFS uses a queue data structureI BFS requirement :

I all vertices at a distance k from the source should be “visited”before vertices at distance k + 1.

I Explanation?

I Level Synchronous BFS is a key concept in correct sharedmemory BFS.

Page 7: Parallel bfs using 2 stacks

Modified BFS : Use 2 stacks

Can be parallelized as is: perform lines 6-7 in parallel,lines 8-10 are atomic

Page 8: Parallel bfs using 2 stacks

Related Work: Level Synchronous Parallel BFS using 2queues by Agarwal et al SC’10 [1]

Page 9: Parallel bfs using 2 stacks

Hybrid 1D Parallel BFS Algorithm

One of the main areas for optimization to this basic parallelalgorithm isload-balancing: ensuring that parallelization of the edge visitsteps is load-balanced

I 1D partitioning: If there are p processors in the system, giveownership of n/p vertices, to each processor.

I Random shuffling of the vertice identifiers prior topartitioning. So all processors ge roughly same number ofvertices(n/p) and edges(m/p)

I Use of local stacks NSi for pushes and then globalunion.(Overhead < 3% of execution time)

Page 10: Parallel bfs using 2 stacks

1D BFS

Page 11: Parallel bfs using 2 stacks

1D BFS contd..

Page 12: Parallel bfs using 2 stacks

1D BFS errors

I The value of level is not incremented

I The Next Stack NSi data structure should be emptied beforetraversing next level.

Page 13: Parallel bfs using 2 stacks

Experiments

I 1D Flat MPI: one process per core

I 1D Hybrid: one or more MPI processes within a node

I synthetic graphs based on the R-MAT random graphmodel(default m : n 16) , web crawl of the UK domain (133million vertices and 5.5 billion edges).

I Systems: Hopper (6392-node Cray XE6) and Franklin(9660-node Cray XT4)

Page 14: Parallel bfs using 2 stacks

Experimental Results

I Strong scaling on FranklinI Higher is betterI GTEPS: Giga Traversed Edges per Second

Page 15: Parallel bfs using 2 stacks

Experimental Results

I lower is betterI Strong scaling on Franklin

Page 16: Parallel bfs using 2 stacks

Experimental Results

I Weak Scaling on Franklin

I Lower is better

Page 17: Parallel bfs using 2 stacks

Experiments

I Flat 1D algorithms are about 1.5− 1.8 times faster than the2D algorithms.

I The 1D hybrid algorithm, are slower than the flat 1Dalgorithm for smaller concurrencies, starts to performsignificantly faster for larger concurrencies.

Page 18: Parallel bfs using 2 stacks

Conclusion

I Conjecture: Level synchronous BFS can be implementedwithout any error with relaxed queues

I Question: Can the error be bounded if we don’t have a levelsynchronous algorithm?

Page 19: Parallel bfs using 2 stacks

V. Agarwal, F. Petrini, D. Pasetto, and D.A. Bader. Scalablegraph exploration on multicore processors. In Proc. ACM/IEEEConference on Supercomputing (SC10), November 2010.

A. Buluc K. Madduri. Parallel breadth-first search ondistributed memory systems. In Proceedings of 2011International Conference for High Performance Computing,Networking, Storage and Analysis, SC ’11, pages 65:1–65:12,New York, NY, USA, 2011. ACM.

C.E. Leiserson and T.B. Schardl. A work-efficient parallelbreadth-first search algorithm (or how to cope with thenondeterminism of reducers). In Proc. 22nd ACM Symp. onParallism in Algorithms and Architectures (SPAA ’10), pages303–314, June 2010.

Page 20: Parallel bfs using 2 stacks

Thank You :)