numa-aware thread-parallel breadth-first search for graph500 and green graph500 benchmarks on sgi uv...

NUMA-aware thread-parallel breadth-first search for Graph500 and Green

Graph500 Benchmarks on SGI UV 2000 Yuichiro Yasui & Katsuki Fujisawa

Kyushu University

ISM High Performance Computing Conference 11:00 − 11:50, Oct 9-10, 2015

Outline •  Introduction

•  NUMA-aware threading –  NUMA architecture and NUMA based system –  Our library “ULIBC” for NUMA aware threading

•  Efficient BFS algorithm for Graph500 benchmark –  NUMA-based Distributed Graph Representation … [BD13] –  Efficient algorithm considering the vertex degree [ISC14, HPCS15]

•  Conclusion

Graph processing for Large scale networks •  Large-scale graphs in various fields

–  US Road network : 58 million edges –  Twitter follow-ship: 1.47 billion edges –  Neuronal network : 100 trillion edges

89 billion vertices & 100 trillion edges Neuronal network @ Human Brain Project

Cyber-security Twitter

US road network 24 million vertices & 58 million edges 15 billion log entries / day

Social network

•  Fast and scalable graph processing by using HPC

large

61.6 million vertices & 1.47 billion edges

•  Transportation •  Social network •  Cyber-security •  Bioinformatics

Graph analysis and important kernel BFS •  Graph analysis to understanding for relationships on real-networks

graph processing

Understanding

Application fields

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

Relationships - SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

graph

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

results Step1

Step2

Step3

constructing

・Breadth-first search　・Single-source shortest path ・Maximum flow　・Maximal independent set ・Centrality metrics　・Clustering　・Graph Mining

•  One of most important and fundamental algorithm to traverse graph structures •  Many algorithms and applications based on BFS (Max. flow and Centrality) •  Linear time algorithm, required many widely memory accesses w/o reuse

Breadth-first search (BFS)

Source

BFS Lv. 3

Source Lv. 2 Lv. 1

Outputs •  Predecessor tree •  Distance

Inputs •  Graph •  Source vertex

•  Transportation •  Social network •  Cyber-security •  Bioinformatics

Graph analysis and important kernel BFS •  Graph analysis to understanding for relationships on real-networks

graph processing

Understanding

Application fields

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

Relationships - SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

graph

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

results Step1

Step2

Step3

constructing

・Breadth-first search　・Single-source shortest path ・Maximum flow　・Maximal independent set ・Centrality metrics　・Clustering　・Graph Mining

BFS Tree

Highway

Bridge

Betweenness centrality (BC)

CB(v) =!

s!v!t∈V

σst(v)σst

σst : number of shortest (s, t)-paths

σst(v) : number of shortest (s, t)-paths passing through vertex v

2 / 2

CB(v) =!

s!v!t∈V

σst(v)σst

σst : number of shortest (s, t)-paths

σst(v) : number of shortest (s, t)-paths passing through vertex v

2 / 2

: # of (s, t)-shortest paths : # of (s, t)-shortest paths passing throw v

Osaka road network 13,076 vertices and 40,528 edges

•  BC requires #vertices-times BFS, because BFS obtains one-to-all shortest paths

• Computes an importance for each vertices and edges utilizing all-to-all shortest-paths (breadth-first search) w/o vertex coordinates

Importance low high

Osaka station

Our software “NETAL” can solves BC for Osaka road network within one second Y. Yasui, K. Fujisawa, K. Goto, N. Kamiyama, and M. Takamatsu: NETAL: High-performance Implementation of Network Analysis Library Considering Computer Memory Hierarchy, JORSJ, Vol. 54-4, 2011.

Single-node NUMA system •  Single-node or Multi-node system?

•  Uniform memory access, or not?

Single-node (Single-OS) Multi-node

NUMA (Non-uniform memory access)

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

NUMA node

Fast local access Slow non-local access

CPU CPU

RAM RAM

UMA (Uniform memory access)

Same cost

•  UV2000 @ ISM (256 CPUs) •  Intel Xeon server (4 CPUs) •  Laptop PC (1 CPU) •  Smartphone (1 CPU)

Currently, major CPU arch.

Many configurations

Threading or Process-parallel •  Which we choice parallel programming model?

OpenMP (Pthreads)

Explicit memory access Using MPI_send() & MPI_recv()

MPI-OpenMP Hybrid

•  Distributed memory •  Explicit memory access

between processes for Good locality

•  Shared memory •  Implicit memory access for

reduce programming cost

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

Single-process Multi-process

Implicit memory access

Implicit

NUMA system •  NUMA = non-uniform memory access

RAM RAM

processor core & L2 cache

8-core Xeon E5 4640shared L3 cache

RAM RAM

Local memory access

Remote (non-local) Memory access

NUMA node

•  4-way Intel Xeon E5-4640 (Sandybridge-EP) –  4 (# of CPU sockets) –  8 (# of physical cores per socket) –  2 (# of threads per core)

4 x 8 x 2 = 64 threads NUMA node

Max.

Memory Bandwidth between NUMA nodes •  Local memory is 2.6x faster than remote memory

0

2

4

6

8

10

12

14

16

10 15 20 25 30

Bandw

idth

(G

B/s)

Number of elements log2 n

NUMA 0 ! NUMA 0NUMA 0 ! NUMA 1NUMA 0 ! NUMA 2NUMA 0 ! NUMA 3

Local access: 13 GB/s

Remote access: 5 GB/s

Approx. 2.6 faster

double a[N], b[N], c[N];

void STREAM_Triad(double scalar) { long j;#pragma omp parallel for for (j = 0; j < N; ++j) a[j] = b[j]+scalar*c[j];}

STREAMTRIAD

threads data

vector size a[n], b[n], c[n]

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

NUMA 0

NUMA 1 NUMA 2

NUMA 3

Problem and motivation •  To develop efficient graph algorithm on NUMA system

CPU CPU

RAM RAM

UMA (Uniform memory access)

Same cost

RAM RAM



RAM RAM

T T T

T

T

D

D

D

D

D

Accessing remote memory

T

Moving on another core

Default

RAM RAM



RAM RAM

T

NUMA-aware D

T D

T D

T

D

D

T

Accessing local memory Pinning threads and memory


Fast local access Slow non-local access

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

NUMA node

processor : 23 model name : Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz stepping : 7 cpu MHz : 1200.000 cache size : 20480 KB physical id : 2 siblings : 16 core id : 7 cpu cores : 8 apicid : 78 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual

Processor ID

NUMA node ID

Core ID in NUMA node

Programming cost for NUMA-aware •  Not easy to apply NUMA-aware threading?

#define _GNU_SOURCE�#include <sched.h> int bind_thread(int procid) {��  cpu_set_t set;��  CPU_ZERO(&set);   CPU_SET(procid, &set);��  return sched_setaffinity((pid_t)0, sizeof(cpu_set_t), &set) );�}

Processor ID

–  Linux provides processor topology as machine file only.

–  The pinning function “sched_setaffinity” use processor ID

File size of /proc/cpuinfo ⇒ 8.0 KB on DesktopPC ⇒ 2.4 MB on UV 2000

NUMA-aware computation with ULIBC •  ULIBC is callable library for CPU and memory affinity settings •  Detects processor topology on a system on run time

RAM RAM



RAM RAM

T

NUMA-aware D

T D

T D

T

D

D

T



CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

NUMA node

Local memory

Remote memory #include <ulibc.h>#include <omp.h>int main(void) { ULIBC_init();#pragma omp parallel { const struct numainfo_t ni = ULIBC_get_current_numainfo(); printf(“[%02d] Node: %d of %d, Core: %d of %d\n”, ni.id, ni.node, ULIBC_get_online_nodes(), ni.core, ULIBC_get_online_cores(ni.node)); } return 0;}

struct numainfo_t { int id; /* Thread ID */ int proc; /* Processor ID */ int node; /* NUMA node ID */ int core; /* core ID */};

[04] Node 0 of 4, Core: 1 of 16[55] Node 3 of 4, Core: 13 of 16[16] Node 0 of 4, Core: 4 of 16[37] Node 1 of 4, Core: 9 of 16[30] Node 2 of 4, Core: 7 of 16 . . .

Core ID NUMA node ID Thread ID

Init. Thread pinning

https://bitbucket.org/yuichiro_yasui/ulibc

1.  Detects entire topology Cores

CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14

2.  Detects online (available) topology

Threads NUMA 0 0(P1), 2(P5), 4(P9), 6(P13) NUMA 1 1(P2), 3(P6), 5(P10)

3.  Constructs ULIBC affinity ULIBC_set_affinity_policy(7, COMPACT_MAPPING, THREAD_TO_CORE)

CPU Affinity construction with ULIBC

Cores CPU 0 P0, P4, P8, P12 CPU 1 P1, P5, P9, P13 CPU 2 P2, P6, P10, P14 CPU 3 P3, P7, P11, P15

Use Other processes

NUMA 0

NUMA 1

core 0 core 1

core 2 core 3

RAM

RAM

Local RAM

Threads NUMA 0 0(P1), 1(P5), 2(P9), 3(P13) NUMA 1 4(P2), 5(P6), 6(P10)

NUMA 0

NUMA 1

core 0 core 1

core 2 core 3

RAM

RAM

Local RAM

ULIBC_set_affinity_policy(7, SCATTER_MAPPING, THREAD_TO_CORE)

Job manager (PBS) or numactl --cpunodebind=1,2

3.  Constructs ULIBC affinity (2 types)

Graph500 Benchmark •  Measures a performance of irregular memory accesses •  TEPS score (# of Traversed edges per second) in a BFS

SCALE & edgefactor (=16)

Median TEPS

1.  Generation

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3.  BFS x 64 2.  Construction

x 64

TEPS ratio

•  Generates synthetic scale-free network with 2SCALE vertices and 2SCALE×edgefactor edges by using SCALE-times the Rursive Kronecker products

www.graph500.org

G1 G2 G3 G4

Kronecker graph

Input parameters for problem size

Green Graph500 Benchmark •  Measures power-efficient using TEPS/W score •  Our results on various systems such as SGI UV series and

Xeon servers, Android devices

http://green.graph500.org

Median TEPS

1. Generation

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3.  BFS phase

2. Construction x 64

TEPS ratio

Watt TEPS/W

Power measurement Green Graph500

Graph500

Level-synchronized parallel BFS (Top-down) •  Started from source vertex and

executes following two phases for each level

3) BFS iterations (timed).: This step iterates the timedBFS-phase and the untimed verify-phase 64 times. The BFS-phase executes the BFS for each source, and the verify-phaseconfirms the output of the BFS.

This benchmark is based on the TEPS ratio, which iscomputed for a given graph and the BFS output. Submissionagainst this benchmark must report five TEPS ratios: theminimum, first quartile, median, third quartile, and maxi-mum.

III. PARALLEL BFS ALGORITHM

A. Level-synchronized Parallel BFSWe assume that the input of a BFS is a graph G = (V, E)

consisting of a set of vertices V and a set of edges E.The connections of G are contained as pairs (v, w), wherev, w ∈ V . The set of edges E corresponds to a set ofadjacency lists A, where an adjacency list A(v) containsthe outgoing edges (v, w) ∈ E for each vertex v ∈ V . ABFS explores the various edges spanning all other verticesv ∈ V \{s} from the source vertex s ∈ V in a given graphG and outputs the predecessor map π, which is a map fromeach vertex v to its parent. When the predecessor map π(v)points to only one parent for each vertex v, it represents atree with the root vertex s. However, some applications, suchas the betweenness centrality [1], require all of the parentsfor each vertex, which is equivalent to the number of hopsfrom the source. Therefore, the output predecessor map isrepresented as a directed adjacency graph (DAG). In thispaper, we focus on the Graph500 benchmark, and assumethat the BFS output is a predecessor map that is representedby a tree.

Algorithm 1 is a fundamental parallel algorithm for aBFS. This requires the synchronization of each level thatis a certain number of hops away from the source. We callthis the level-synchronized parallel BFS [6]. Each traversalexplores all outgoing edges of the current frontier, which isthe set of vertices discovered at this level, and finds theirneighbors, which is the set of unvisited vertices at the nextlevel. We can describe this algorithm using a frontier queueQF and a neighbor queue QN , because unvisited verticesw are appended to the neighbor queue QN for each frontierqueue vertex v ∈ QF in parallel with the exclusive controlat each level (Algorithm 1, lines 7–12), as follows:

QN ←!w ∈ A(v) | w ̸∈ visited, v ∈ QF

". (1)

B. Hybrid BFS Algorithm of Beamer et al.The main runtime bottleneck of the level-synchronized

parallel BFS (Algorithm 1) is the exploration of all outgoingedges of the current frontier (lines 7–12). Beamer et al. [8],[9] proposed a hybrid BFS algorithm (Algorithm 2) thatreduced the number of edges explored. This algorithmcombines two different traversal kernels: top-down and

Algorithm 1: Level-synchronized Parallel BFS.Input : G = (V, A) : unweighted directed graph.

s : source vertex.Variables: QF : frontier queue.

QN : neighbor queue.visited : vertices already visited.

Output : π(v) : predecessor map of BFS tree.1 π(v)← −1, ∀v ∈ V2 π(s)← s3 visited← {s}4 QF ← {s}5 QN ← ∅6 while QF ̸= ∅ do7 for v ∈ QF in parallel do8 for w ∈ A(v) do9 if w ̸∈ visited atomic then

10 π(w)← v11 visited← visited ∪ {w}12 QN ← QN ∪ {w}

13 QF ← QN

14 QN ← ∅

bottom-up. Like the level-synchronized parallel BFS, top-down kernels traverse neighbors of the frontier. Conversely,bottom-up kernels find the frontier from vertices in candidateneighbors. In other words, a top-down method finds thechildren from the parent, whereas a bottom-up method findsthe parent from the children. For a large frontier, bottom-upapproaches reduce the number of edges explored, becausethis traversal kernel terminates once a single parent is found(Algorithm 2, lines 16–21).

Table III lists the number of edges explored at each levelusing a top-down, bottom-up, and combined hybrid (oracle)approach. For the top-down kernel, the frontier size mF atlow and high levels is much less than that at mid-levels.In the case of the bottom-up method, the frontier size mBis equal to the number of edges m = |E| in a given graphG = (V, E), and decreases as the level increases. Bottom-upkernels estimate all unvisited vertices as candidate neighborsQN , because it is difficult to determine their exact numberprior to traversal, as shown in line 15 of Algorithm 2. Thislazy estimation of candidate neighbors increases the numberof edges traversed for a small frontier. Hence, consideringthe size of the frontier and the number of neighbors, wecombine top-down and bottom-up approaches in a hybrid(oracle). This loop generally only executes once, so thealgorithm is suitable for a small-world graph that has a largefrontier, such as a Kronecker graph or an R-MAT graph.Table III shows that the total number of edges traversed bythe hybrid (oracle) is only 3% of that in the case of thetop-down kernel.

We now explain how to determine a traversal policy(Table II(a)) for the top-down and bottom-up kernels. The

395

Traversal

Swap

Frontier Neighbor

Level k Level k+1 QF

QN

Swap exchanges the frontier QF and the neighbors QN for next level

Traversal finds neighbors QN from current frontier QF

visited

unvisited

Frontier Level k

Level k+1 Neighbors Frontier

Neighbors Level k Level k+1

Candidates of neighbors

前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み

v→ w

v

w

Input : Directed graph G = (V, AF ), Queue QF

Data : Queue QN , visited, Tree π(v)

QN ← ∅for v ∈ QF in parallel do

for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}

QF ← QN

• 後方探索でのデータの書込み

w→ v

v w

Input : Directed graph G = (V, AB), Queue QF


QN ← ∅for w ∈ V \ visited in parallel do

for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break

QF ← QN

• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)

6 / 12

Direction-optimizing BFS Top-down direction •  Efficient for small-frontier •  Uses out-going edges

Bottom-up direction •  Efficient for large-frontier •  Uses in-coming edges

前方探索と後方探索でのデータアクセスの観察• 前方探索でのデータの書込み

v→ w

v

w

Input : Directed graph G = (V, AF ), Queue QF


QN ← ∅for v ∈ QF in parallel do

for w ∈ AF (v) doif w ! visited atomic thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}

QF ← QN

• 後方探索でのデータの書込み

w→ v

v w

Input : Directed graph G = (V, AB), Queue QF


QN ← ∅for w ∈ V \ visited in parallel do

for v ∈ AB(w) doif v ∈ QF thenπ(w)← vvisited← visited ∪ {w}QN ← QN ∪ {w}break

QF ← QN

• どちらも wに関する変数 π(w)と visitedに書込みを行っている (vは点番号の参照)

6 / 12

Current frontier

Unvisited neighbors

Current frontier

Candidates of neighbors

Skips unnecessary edge traversal

Outgoing edges Incoming

edges

Chooses direction from Top-down or Bottom-up Beamer @ SC12

# of traversal edges of Kronecker graph with SCALE 26

Hybrid-BFS reduces unnecessary edge traversals

Direction-optimizing BFS Top-down 幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)

Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

6 / 14

Bottom-up

Top-down

Distance from source |V| = 226, |E| = 230

= |E|

Chooses direction from Top-down or Bottom-up Beamer @ SC12

for small frontier for large frontier

Large frontier

0

10

20

30

40

50

2011 SC10 SC12 BigData13ISC14

G500,ISC14

GT

EPS

Ref

eren

ce

NU

MA-a

war

e

Dir.

Opt

.

NU

MA-O

pt.

NU

MA-O

pt.

+D

eg.a

war

e

NU

MA-O

pt.

+D

eg.a

war

e+

Vtx

.Sor

t

87M 800M

5G

11G

29G

42G

⇥1 ⇥9

⇥58

⇥125

⇥334

⇥489

NUMA-Opt. + Dir. Opt. BFS [BD13] •  Manages memory accesses on a NUMA system carefully.

Top-down Top-down

Bottom-up

Top-down

CPU

RAM

NUMA-aware Bottom-up

Top-down

CPU

RAM

NUMA-aware

Our previous result (2013)

RAM RAM



RAM RAM

RAM RAM



RAM RAM

RAM RAM



RAM RAM0th 3th

1st 2nd

0 1 2 3

Adjacency matrix

binds a partial adjacency matrix into each NUMA nodes.

Binding using our library

NUMA nodes

0

10

20

30

40

50


G500,ISC14

GT

EPS

Ref

eren

ce

NU

MA-a

war

e

Dir.

Opt

.

NU

MA-O

pt.

NU

MA-O

pt.

+D

eg.a

war

e

NU

MA-O

pt.

+D

eg.a

war

e+

Vtx

.Sor

t

87M 800M

5G

11G

29G

42G

⇥1 ⇥9

⇥58

⇥125

⇥334

⇥489

Degree-aware + NUMA-opt. + Dir. Opt. BFS •  Manages memory accesses on NUMA system

–  Each NUMA node contains CPU socket and local memory

Top-down Top-down

Bottom-up

Top-down

CPU

RAM


Top-down

CPU

RAM

NUMA-aware

2. Sorting adjacency lists by degree to reduce bottom-up loop

1. Deleting isolated vertices Our previous result (2014)

Isolated

A(va)

… …

A(va)

… …

[ISC14]

0

10

20

30

40

50


G500,ISC14

GT

EPS

Ref

eren

ce

NU

MA-a

war

e

Dir.

Opt

.

NU

MA-O

pt.

NU

MA-O

pt.

+D

eg.a

war

e

NU

MA-O

pt.

+D

eg.a

war

e+

Vtx

.Sor

t

87M 800M

5G

11G

29G

42G

⇥1 ⇥9

⇥58

⇥125

⇥334

⇥489Improved locality

Top-down Top-down

Bottom-up

Top-down

CPU

RAM


Top-down

CPU

RAM

NUMA-aware

Our latest ver. Is 489X faster than Reference code

[HPCS15] Degree-aware + NUMA-opt. + Dir. Opt. BFS

•  Forward graph GF for Top-down

•  Backward graph GB for Bottom-up

Top-down

Bottom-up

A0

A1

A2

A3

Input: Frontier

V

Data: visited Vk Output: neighbors

Vk

Local RAM

A0

A1

A2

A3

Data: Frontier

V

Data: visited Vk Output: neighbors

Vk

Local RAM

Algorithm 2: Direction-optimizing BFSInput : Directed graph G = (V,AF , AB), source vertex s,

frontier queue CQ, neighbor queue NQ, and visitedvertices VS.

Output: Predecessor map π(v).1 π(v)←⊥, ∀v ∈ V \{s}2 π(s)← s3 VS← {s}4 CQ← {s}5 NQ← ∅6 while CQ ̸= ∅ do7 if is top down direction(CQ,NQ,VS) then8 NQ← Top-down(G,CQ,VS,π)9 else

10 NQ← Bottom-up(G,CQ,VS,π)11 swap(CQ,NQ)

12 Procedure Top-down(G,CQ,VS,π)13 NQ← ∅14 for v ∈ CQ in parallel do15 for w ∈ AF (v) do16 if w ̸∈ VS atomic then17 π(w)← v18 VS← VS ∪ {w}19 NQ← NQ ∪ {w}

20 return NQ

21 Procedure Bottom-up(G,CQ,VS,π)22 NQ← ∅23 for w ∈ V \ VS in parallel do24 for v ∈ AB(w) do25 if v ∈ CQ then26 π(w)← v27 VS← VS ∪ {w}28 NQ← NQ ∪ {w}29 break

30 return NQ

has local memory, and these connect to one another via aninterconnect such as the Intel QPI, AMD HyperTransport,or SGI NUMAlink 6. On such systems, processor cores canaccess their local memory faster than they can access remote(non-local) memory (i.e., memory local to another processoror memory shared between processors). To some degree, theperformance of BFS depends on the speed of memory access,because the complexity of memory accesses is greater than thatof computation. Therefore, in this paper, we propose a generalmanagement approach for processor and memory affinities ona NUMA system. However, we cannot find a library for obtain-ing the position of each running thread, such as the CPU socketindex, physical core index, or thread index in the physicalcore. Thus, we have developed a general management libraryfor processor and memory affinities, the ubiquity library forintelligently binding cores (ULIBC). The latest version of thislibrary is based on the HWLOC (portable hardware locality)library, and supports many operating systems (although we

have only confirmed Linux, Solaris, and AIX). ULIBC can beobtained from

https://bitbucket.org/ulibc.

The HWLOC library manages computing resources usingresource indices. In particular, each processor is managed bya processor index across the entire system, a socket indexfor each CPU socket (NUMA node), and a core index foreach logical core. However, the HWLOC library does notprovide a conversion table between the running thread andeach processor’s indices. ULIBC provides an “MPI rank”-likeindex, starting at zero, for each CPU socket, each physical corein each CPU socket, and each thread in each physical core,which are available for the corresponding process, respectively.We have already applied ULIBC to graph algorithms for theshortest path problem [34], BFS [18], [19], [35], [36], andmathematical optimization problems [15].

D. NUMA-optimized BFS

Our NUMA-optimized algorithm, which improves the ac-cess to local memory, is based on Beamer et al.’s direction-optimizing algorithm. Thus, it requires a given graph repre-sentation and working variables to allow a BFS to be dividedover the local memory before the traversal. In our algorithm,all accesses to remote memory are avoided in the traversalphase using the following column-wise partitioning:

V =!V0 | V1 | · · · | Vℓ−1

", (4)

A =!A0 | A1 | · · · | Aℓ−1

", (5)

and each set of partial vertices Vk on the k-th NUMA nodeis defined by

Vk =

#vj ∈ V | j ∈

$k

ℓ· n, (k + 1)

ℓ· n

%&, (6)

where n is the number of vertices and the divisor ℓ is set tothe number of NUMA nodes (CPU sockets). In addition, toavoid accessing remote memory, we define partial adjacencylists AF

k and ABk for the Top-down and Bottom-up policies as

follows:

AFk (v) = {w | w ∈ {Vk ∩A(v)}} , v ∈ V, (7)

ABk (w) = {v | v ∈ A(w)} , w ∈ Vk. (8)

Furthermore, the working spaces NQk, VSk, and πk forpartial vertices Vk are allocated to the local memory on the k-th NUMA node with the memory pinned. Note that the rangeof each current queue CQk is all vertices V in a given graph,and these are allocated to the local memory on the k-th NUMAnode. Algorithm 3 describes the NUMA-optimized Top-downand Bottom-up processes. In both traversal directions, eachlocal thread Tk binds to the processor cores of the k-th NUMAnode, and only traverses the local neighbors NQk from thelocal frontier CQk on the local memory. The computationalcomplexities are O(m) (Top-down) and O(m·diamG) (Bottom-up), where m is the number of edges and diamG is the






20 return NQ


30 return NQ







V =!V0 | V1 | · · · | Vℓ−1

", (4)

A =!A0 | A1 | · · · | Aℓ−1

", (5)


Vk =

#vj ∈ V | j ∈

$k

ℓ· n, (k + 1)

ℓ· n

%&, (6)



follows:

AFk (v) = {w | w ∈ {Vk ∩A(v)}} , v ∈ V, (7)

ABk (w) = {v | v ∈ A(w)} , w ∈ Vk. (8)







20 return NQ


30 return NQ







V =!V0 | V1 | · · · | Vℓ−1

", (4)

A =!A0 | A1 | · · · | Aℓ−1

", (5)


Vk =

#vj ∈ V | j ∈

$k

ℓ· n, (k + 1)

ℓ· n

%&, (6)



follows:

AFk (v) = {w | w ∈ {Vk ∩A(v)}} , v ∈ V, (7)

ABk (w) = {v | v ∈ A(w)} , w ∈ Vk. (8)


NUMA-based 1-D partitioned graph representation

These sub-graphs represent the same area, but not same data structures.

Partial vertex set Vk

Vertex sorting for BFS

Degree distribution

Access freq. w/ vertex sorting

•  # traversals is equal to out-degree for each vertex •  Locality of accessing vertices depends on vertex index

Improving cache hit ratios !!

Applying the vertex sorting, an access frequency is similar to the degree distribution.

Access frequency V.S. Degree distribution

0 2

1

3

4

4 0

2

3

1

Original indices

Sorted indices

High degree

Many accesses for small-index vertex

Two strategies for implementation 1.  Highest TEPS for Graph500

–  Graph500 list uses TEPS scores only

2.  Largest SCALE (problem size) for Green Graph500 –  Green Graph500 list is separated two categories by problem size –  The big data category collects over SCALE 30 entries

Median size of all entries

0

10

20

30

40

50

20 21 22 23 24 25 26 27 28 29 30

GTE

PS

SCALE

DG-VDG-SSG

Highest TEPS

Largest SCALE

#1 on 5th Green Graph500 RAM RAM



RAM RAM

On the 4-way NUMA system, • Highest TEPS model obtains 42

GTEPS for SCALE 27 •  Largest SCALE model can solve

up to SCALE 30

A0

A1

A2

A3

Comparison of two implementations •  Dual-dir. graphs or Single-graph

Highest-TEPS mode Largest-SCALE mode •  Forward graph GF for Top-down •  Transposed GB for Top-down

A0

A1

A2

A3

Input: Frontier

V

Data: visited Vk Output: neighbors Vk

Local RAM


A0

A1

A2

A3

Data: Frontier

V


Local RAM


A0

A1

A2

A3

Data: Frontier

V


Local RAM

Output:

Neighbors V

Data: Frontier

V

Input: visited V

Local and Remote RAM

Same

Diff.

Results on 4-way NUMA system (Xeon)

0

10

20

30

40

50

20 21 22 23 24 25 26 27 28 29 30

GTE

PS

SCALE

DG-VDG-SSG

1

2

4

8

16

32

64

1(1×1×1)

2(1×2×1)

4(1×4×1)

8(1×8×1)

16(2×8×1)

32(4×8×1)

64(4×8×2)

Spee

dup

ratio

Number of threads (#NUMA Nodes × #cores × #threads)

DG-VDG-SSG

•  TEPS-model handle up to SCALE29 •  Largest-SCALE model can solve up to SCALE30

Strong scaling for SCALE 27

•  With 64 threads, these models achieve over 20X faster than sequential.

•  Comparing 32 and 64 threads, HT was to produce a speedup of over 20%.

HT speedups are more than 20%.

Highest TEPS score

Largest SCALE

20x faster than seq.

#1 entry on 5th Green Graph500

HT

CPU: 4-way Intel Xeon E5-4640 (64 threads) Base-architecture: SandyBridge-EP RAM: 512 GB CC: GCC-4.4.7

SGI UV 2000 system •  SGI UV 2000

–  Shared-memory supercomputer based on cc-NUMA arch. –  Running with a single Linux OS –  User handles a large memory space by the thread parallelization

e.g. OpenMP or Pthreads (also can use MPI) –  The full-spec. UV 2000 (4 racks) has 2,560 cores and 64 TB memory

•  ISM, SGI, and us collaborate for the Graph500 benchmarks

The Institute of Statistical Mathematics •  Japan's national research institute for statistical science.

#2 system

ISM has two Full-spec. UV 2000 (totality 8 racks)

#1 full spec. UV 2K

System configuration of UV 2000 •  UV2000 has hierarchical hardware topologies

–  Sockets, Nodes, Cubes, Inner-racks, and Inter-racks Node = 2 sockets Cube = 8 nodes Rack = 32 nodes

•  We used NUMA-based flat parallelization –  Each NUMA node contains a “Xeon CPU E5-2470 v2” and a “256 GB RAM”

CPU

RAM

CPU

RAM

× 4 =

CPU

RAM

Node = 2 NUMA nodes Rack = 64 NUMA nodes

× 64 = CPU

RAM

Cube = 16 NUMA nodes

× 2 CPU

RAM × 16

NUMAlink 6.7GB/s

(20 cores, 512GB) (160 cores, 4TB) (640 cores, 16TB)

Cannot detect

0

50

100

150

200

26(1)

27(2)

28(4)

29(8)

30(16)

31(32)

32(64)

33(128)

SCALE 31 (64)

GTE

PS

SCALE (#sockets)

DG-V (SCALE 26 per NUMA node)DG-V (SCALE 25 per NUMA node)DG-S (SCALE 26 per NUMA node)SG (SCALE 26 per NUMA node)

Results on UV2000

DG-V is fastest 1 to 32 socket(s)

Highest TEPS model

0

50

100

150

200

26(1)

27(2)

28(4)

29(8)

30(16)

31(32)

32(64)

33(128)

SCALE 31 (64)

GTE

PS

SCALE (#sockets)


Results on UV2000

64 sockets DG-S is faster than DG-V

DG-V is fastest 1 to 32 socket(s)

Both implementations use all-to-all communication

Performance degradation

Highest TEPS model

0

50

100

150

200

26(1)

27(2)

28(4)

29(8)

30(16)

31(32)

32(64)

33(128)

SCALE 31 (64)

GTE

PS

SCALE (#sockets)


Results on UV2000

DG-V is fastest

SG is fastest and scalable

1 to 32 socket(s)

128 CPU sockets (1280 threads)

64 sockets Fastest of single node

174 GTEPS

9th on SC14, 10th on ISC15 Graph500

SCALE 33 •  8.59 B vertices • 137.44 B edges

Two UV2000 racks

DG-S is faster than DG-V

Highest TEPS model

Large-problem model

Breakdown with 2-racks UV2000 •  Comp. (57 %) > Comm. (43 %) ⇒ scalable

0

50

100

150

200

250

300

350

Init. 0 1 2 3 4 5 6 7

CPU

time

(ms)

Level

Breakdown of SCALE 33 on UV 2000 with 128 CPUs

Traversal (ms)

Comm. (ms)

Top-down

Top-down

Bottom-up

Bottom-up

Bottom-up

Bottom-up

Bottom-up

Bottom-up

Remotememorycommunications

Computation

Most of CPU time at middle-level

•  4-way Intel Xeon server –  DG-V (High-TEPS) model achieves

the fastest single-server entries –  SG (Large-SCALE) model won 3rd,

4th, 5th Green Graph500 lists

5th Green Graph500 list 62.93 MTEPS/W 31.33 GTEPS

Our achievements of Graph500 benchmarks •  UV 2000

–  DG-S (Middle) model achieves 131 GTEPS with 640 threads and the most power-efficient of commercial supercomputers

4th Green Graph500 list 61.48 MTEPS/W 28.61 GTEPS

3rd Green Graph500 list 59.12 MTEPS/W 28.48 GTEPS

#1

#1

#1

ISC15 in Last week

#7 on 3rd list #9 on 4rd list

–  SG (Largest-SCALE) model achieves 174 GTEPS for SCALE 33 with 1,280 threads and the fastest single-node entry

1

4

16

64

256

1024

4096

16384

26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

GT

EPS

SCALE

UV 2000 (CPU per SCLAE 26)BQ/Q (CPU per SCALE 24)DragonHawk (480 threads)

BFS performances

IBM BG/Q (8,192 cores) ⇒  SCALE 33: 172 GTEPS ⇒  21.0 MTEPS/core

SGI UV2000 (1,280 cores) ⇒ SCALE 33: 175 GTEPS ⇒ 136.5 MTEPS/core

HP Superdome X (240 cores) ⇒  SCALE 33: 127 GTEPS ⇒ 530.3 MTEPS/core

Our result (NUMA-aware)

Our result (NUMA-aware)

Conclusion 1. Efficient graph algorithm

considering the processor topology on a single-node NUMA system

2. NUMA-aware programming utilizing by our ULIBC

3. Our implementation works well on many computers; –  Scales up to 1,280 threads on UV2000 at ISM –  UV2000 achieves fast single node entry of 9th and 10th Graph500 –  Xeon server won most energy-efficient 3rd, 4th, and 5th Green

Graph500

RAM RAM



RAM RAM

T

NUMA-aware threading with ULBIC

D

TD

T

D

T

D

D

T


Our library “ULIBC“ is available at Bitbucket� https://bitbucket.org/yuichiro_yasui/ulibc �

Reference •  [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel

Breadth-first Search on Multicore Single-node System, IEEE BigData 2013

•  [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014

•  [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015.

•  [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui, K. Iwabuchi, and T. Endo: Advanced Computing & Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale upercomputers, Proceedings of the Optimization in the Real World -- Toward Solving Real-World Optimization Problems --, Springer, 2015.

This talk

Other results of Our Graph500 team

numa-aware thread-parallel breadth-first search for graph500 and green graph500 benchmarks on sgi uv...

Science