numa optimized parallel breadth first search on multicore single node system

37
NUMA optimized Parallel Breadth first Search on Multicore Single node System Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi Ruba Break Mariam Al-kassar Nagham Ballan

Upload: mohammad-tahsin-al-shalabi

Post on 06-Aug-2015

112 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: NUMA optimized Parallel Breadth first Search on Multicore Single node System

NUMA optimized Parallel Breadth first

Search on Multicore Single node System

Mohammad Opada Al-Bosh Mohammad Tahsin Al-Shalabi

Ruba Break Mariam Al-kassar Nagham Ballan

Page 2: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Outline

Background

Breadth-first Search (BFS)

NUMAI optimized parallel BFS

Numerical Results

Conclusion

Page 3: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Background

Large scale graph in various fields

US Road network:

58 million edges

Twitter follow-ship :

1.47 billion edges

Neuronal network :

100 trillion edges

large

Page 4: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Background

Fast and scalable graph processing by using HPC

Page 5: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Importance of graph processing

Application field

Transportation

Social network

Cyber-security

Bioinformatics

- Step 3:

• concurrent search (breadth first search)

• optimization (single source shortest path)

• edge-oriented (maximal independent set)

Page 6: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Breadth first search

BFS is important and fundamental graph processing

– Obtains relationship of distance (hops) as standIalone

– Many algorithm (BC, Max.flow, Max.independent set)

Problems of Fast and scalable computation BFS

- low arithmetic intensity

- irregular memory accesses

Page 7: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Graph500 Benchmark

Measures computer performance using TEPS ratio in graph

processing such as BFS (Breath-first search)

TEPS ratio = # of Traversed edges per second

Page 8: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Contribution

Efficient hybrid algorithm of BFS [Beamer2011,2012]

reduces unnecessary edge traversal

Our proposal

- NUMA-optimized hybrid algorithm

- Improves locality of memory access

. Library for considering NUMA carefully

. Column-wise graph partitioning

Page 9: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Example

4-way Intel Xeon E5 (64 CPU cores)

• Scalable: Scale well up to 64 threads.

• Fast: 11.15 GTEPS and 2.2x speedup compared with

original Hybrid algorithm

Page 10: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Outline

Background

Breadth-first Search (BFS)

NUMAI optimized parallel BFS

Numerical Results

Conclusion

Page 11: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Breadth first Search (BFS)

Obtain level of each vertices from source vertex

Level = certain # of hops away from the source

Input:Graph G and source

Output:Tree with root as source

Page 12: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Hybrid BFS for low diameter graph

Efficient for Low diameter graph

– scale free and/or small world property such as social

network.

At higher ranks in Graph500 benchmark

Hybrid algorithm

- combines top-down algorithm and bottom-up algorithm

– reduces unnecessary edge traversal

Page 13: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Hybrid algorithm

Top down algorithm Bottom up algorithm

Efficient for a small-frontier Efficient for a large-frontier

Page 14: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Top down algorithm

Explores outgoing edges of frontier queue QF

Appends unvisited vertices into neighbor queue QN

Efficient for a small frontier

• Has an unnecessary edge traversal for a large frontier

Page 15: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Bottom up algorithm

Explores frontier queue QF from unvisited vertices.

Appends adjacent vertices into neighbors QN

Efficient for a large frontier• Has unnecessary edge traversal for a small frontier

Page 16: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Outline

Background

Breadth-first Search (BFS)

NUMAI optimized parallel BFS

Numerical Results

Conclusion

Page 17: NUMA optimized Parallel Breadth first Search on Multicore Single node System

How to speedup the hybrid algorithm?

NUMA architecture

– Non uniform memory access

– Each CPU socket has a local RAM

– Fast local RAM and slow non-local RAM

4 socket Intel Xeon E5 system

Page 18: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Frequent non local memory accesses on NUMA

architecture

Page 19: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Difficulty of considering NUMA architecture

1. How does distribute graph and data to each local RAM?

2. How does bind partial graph and data to each NUMA unit?

Page 20: NUMA optimized Parallel Breadth first Search on Multicore Single node System

ULIBC: Ubiquity Library for Intelligently Binding

Cores

1. NUMACTL (command line tool, library for C/C++)2. Intel compiler Thread Affinity Interface (API)

3. ULIBC (Our library, library for C/C++)– Processor ID : index of logical processor core – Package ID : index of CPU socket – Core ID : index of physical core in each CPU socket

Page 21: NUMA optimized Parallel Breadth first Search on Multicore Single node System

NUMA-opt. Column wise Graph Partitioning

. Divides G=(V,A) into partial Gk=(Vk,.Ak) and binds local RAM k- Ak is a set of adjacency list that holds incoming edges to Vk.

Page 22: NUMA optimized Parallel Breadth first Search on Multicore Single node System

NUMA-optimized Top down

Explores outgoing edges of frontier queue QF.

Appends unvisited vertices into neighbor queue QN.

Efficient for a small frontier • Has unnecessary edge traversal for a large frontier

Page 23: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Details of NUMA-optimized Top-down

Page 24: NUMA optimized Parallel Breadth first Search on Multicore Single node System

NUMA-optimized Bottom-up Explores frontier queue QF from unvisited vertices.

Appends adjacent vertices into neighbors QN.

Efficient for a large frontier.• Has unnecessary edge traversal for a small frontier

Page 25: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Details of NUMA-optimized Bottom- up

Page 26: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Outline

Background

Breadth-first Search (BFS)

NUMAI optimized parallel BFS

Numerical Results

Conclusion

Page 27: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Machine specification

4 way Intel Xeon E5.

– CentOS 6.4 (Kernel 2.6.32)

– GCC 4.4.7

– 64 logical CPU cores

– 4 NUMA units x 16 logical cores

4 way AMD Opteron 6174.

– Fedora 19 (Kernel 3.11.2)

– GCC 4.8.1

– 48 CPU cores

– 8 NUMA units x 6-core

Page 28: NUMA optimized Parallel Breadth first Search on Multicore Single node System

TEPS ratio varied with problem size

NUMA 2.2x speedups compared with original hybrid algorithm

NUMA achieves11.15 GTEPS for Kronecker graph (SCALE26).

Page 29: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Strong scaling on Intel/AMD System

Scale well up to # of threads as # of cores

Page 30: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Twitter network

Page 31: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Graph500 benchmark

Fastest of single node on 4th list (Jue 2012)

Fastest of CPU-based single-node on 6th list (June 2013)

Page 32: NUMA optimized Parallel Breadth first Search on Multicore Single node System

1st Green Graph500 list on June 2013

Measures power-efficient using TEPS/W ratio

Results on various system such as Android ,Linux , and Mac.

NUMA

Small Data category

Page 33: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Outline Background

Breadth-first Search (BFS)

NUMAI optimized parallel BFS

Numerical Results

Conclusion

Page 34: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Conclusion

NUMA-optimized Hybrid BFS algorithm

– Reduces unnecessary edge traversals and remote RAM access

carefully considering NUMA

Numerical results on 4 way Intel Xeon

– scales well up to 64 threads (scalable)

– achieves 11.15 GTEPS (fast)

– 2.2x speedup compared original Hybrid

Graph500 and Green Graph500

– Fastest single-node in June 2012

– Most power-efficient in June 2013

Page 35: NUMA optimized Parallel Breadth first Search on Multicore Single node System

Future work Further optimizing NUMA-optimized BFS$

Distributed-memory parallel computation

Page 36: NUMA optimized Parallel Breadth first Search on Multicore Single node System

References Parallel Breadth-First Search on Distributed Memory Systems

[Aydın Buluç Kamesh Madduri Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA ABuluc, [email protected]]

A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L

[Andy Yooy Edmond Chowx Keith Hendersony William McLendonz Bruce Hendricksonz ÄUmit C» atalyÄurek]

Distributed Memory Breadth-First Search Revisited: Enabling Bottom-Up Search

[Scott Beamer Aydn Buluc Krste Asanovic David A. Patterson]

Distributed Breadth First Search

[CSE 6220, Spring 2013 Georgia Institute of Technology April 18 -- Guest lecture by Anita Zakrzewska]

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

[Zehan Cui1,2, Licheng Chen1,2, Mingyu Chen1, Yungang Bao1, Yongbing Huang1,2, Huiwei Lv1,2 1 State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences 2 Graduate School of Chinese Academy of Sciences cuizehan, chenlicheng, cmy, baoyg, huangyongbing, [email protected]]

Page 37: NUMA optimized Parallel Breadth first Search on Multicore Single node System

References Scaling Techniques for Massive Scale-Free Graphs in Distributed (External)

Memory

[Roger Pearcey, Maya Gokhaley, Nancy M. Amato Parasol Laboratory; Dept. of Computer Science and

Engineering Texas A&M University; College Station, TX yCenter for Applied Scientific Computing

Lawrence Livermore National Laboratory; Livermore, CA frpearce, [email protected] frpearce,

[email protected]]

Reducing Communication in Parallel Breadth-First Search on Distributed

Memory Systems

[Huiwei Luy, Guangming Tan, Mingyu Chen, Ninghui Sun State Key Laboratory of Computer

Architecture, Institute of Computing Technology, Chinese Academy of Sciences yArgonne National

Laboratory Email: [email protected], [email protected], [email protected], [email protected]]

Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore

and Multiprocessor Systems

[Rudolf Berrendorf and Matthias Makulla Computer Science Department Bonn-Rhein-Sieg University Sankt

Augustin, Germany e-mail: [email protected], [email protected]]