numa-aware scalable graph traversal on sgi uv systems

NUMA-aware scalable graph traversalon SGI UV systems

*Yuichiro Yasui, Katsuki FujisawaKyushu University

Eng Lim Goh, John Baron, Atsushi SugiuraSGI Corp.

Takashi UchiyamaSGI Japan, Ltd.

HPGP’16 @ Kyoto at May 31, 2016

Outline

• Introduction

– Graph analysis for large-scale networks

– Graph500 benchmark and Breadth-first search

– NUMA-aware computation

• Our proposal: Pruning of remote edge traversals

• Numerical results on SGI UV systems

Our motivations

• NUMA / cc-NUMA architecture • Graph algorithm; BFS

• Efficient NUMA-aware BFS algorithm

– improves locality of reference in memory accesses

– exploits multithreading on many-socket system (SGI UV2000, UV300)

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

Local access

Remote access

・・・

Many-socket systemRepresents many relationships by graph structure

CPU

RAM

CPU

RAM

CPU

RAM

…

Partial CSR graphassigned

Kronecker graph w/ SCALE 3417 billion nodes and 275 billion edges

SGI UV 30032-sockets 18-core Xeon and 16 TB RAM

Our contributions (Previous work and this paper)

• Efficient Graph data structure

4

3

2

0

1

Input graph

1

3

0

4

2

Vertex sorting (HPCS15) Adjacency list sorting (ISC14)

A0

A1

A2

A3

NUMA-aware graph (BD13)

• Efficient BFS based on Beamer’s Direction-optimizing (SC12)

CQ

Socket−queue

Remote

Remote

Local

VS NQ

Agarwal’s Top-down (SC10) Pruning edges

Top-down direction Bottom-up direction

NUMA-aware Bottom-up (BD13)

A0

A1

A2

A3

Input:

CQ

Data:

VSk

Output:

NQk

Local

Sorting by outdegree

CSR graph

This paper

Reduction ofremote edges

UV 300 w/ 32 sockets

219 GTEPS (NEW)

New result !!

Updated highest score

On single-node

Binding on NUMA node

Graph processing for Large scale networks

• Large-scale networks are generated in widely appli. areas

– US Road network: 58 million edges

– Twitter follow-ship: 1.47 billion edges

– Neuronal network: 100 trillion edges

89 billion vertices & 100 trillion edges

Neuronal network @ Human Brain Project

Cyber-security

Twitter

US road network

24 million vertices & 58 million edges15 billion log entries / day

Social network

• Fast and scalable graph processing with HPC

– categorized as data intensive application

large

61.6 million vertices

& 1.47 billion edges

• Transportation

• Social network

• Cyber-security

• Bioinformatics

Graph analysis and important kernel BFS

• Used to understand relationships in real-world networks

graph

processing

Understanding

Application fields

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

Relationships

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

graph

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

results

Step1

Step2

Step3

constructing

・Breadth-first search ・Single-source shortest path

・Maximum flow ・Maximal independent set

・Centrality metrics ・Clustering ・Graph Mining

• One of most important and fundamental algorithm to traverse graph structures

• Many algorithms and applications based on BFS (Max. flow and Centrality)

• Well-known algorithm takes O(n+m) for a digraph G with n vertices and m edges

Breadth-first search (BFS)

Source

BFSLv. 3

Source Lv. 2Lv. 1

Outputs• BFS tree

• Distance

Inputs• digraph G = (V, E)• Source vertex

• Transportation

• Social network

• Cyber-security

• Bioinformatics

Graph analysis and important kernel BFS

• Used to understand relationships in real-world networks

graph

processing

Understanding

Application fields

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

Relationships

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

graph

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

results

Step1

Step2

Step3

constructing

・Breadth-first search ・Single-source shortest path

・Maximum flow ・Maximal independent set

・Centrality metrics ・Clustering ・Graph Mining

BFS Tree

BFS on Twitter follow-ship network

• follow-ship network

– #Users (#vertices): 41,652,230

– Follow-ships (#edges): 2,405,026,092

Lv. #users ratio (%) percentile (%)0 1 0.00 0.00

1 7 0.00 0.00

2 6,188 0.01 0.01

3 510,515 1.23 1.24

4 29,526,508 70.89 72.13

5 11,314,238 27.16 99.29

6 282,456 0.68 99.97

7 11536 0.03 100.00

8 673 0.00 100.00

9 68 0.00 100.00

10 19 0.00 100.00

11 10 0.00 100.00

12 5 0.00 100.00

13 2 0.00 100.00

14 2 0.00 100.00

15 2 0.00 100.00

Total 41,652,230 100.00 -

BFS result from User 21,804,357

� excluding unconnected users

Six-degrees of separation

“everyone and everything is six or fewer steps away”

Ours: 60 milliseconds per BFS

Twitter2009

Graph500 and Green Graph500

• New benchmarks using graph processing (breadth-first search)

• measures a performance and energy efficiency of irregular memory access

TEPS score (# of Traversed edges per second) for

Measuring a performance of irregular memory accesses

TEPS per Watt score for measuring

power-efficient performamce

Graph500 benchmark Green Graph500 benchmark

1. Generation

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3. BFS x 642. Construction

x 64

Median TEPS

SCALE & edgefactor (=16)

Kronecker graph with 2SCALEvertices and

2SCALE×edgefactor edgesby using SCALE-times the recursive

Kronecker productsG1 G2 G3 G4

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

3. BFS x 64

x 64

Median of

64 TEPSs

Power

consumption

Power consumption

in watt

TEPS per Watt

NUMA (Non-uniform memory access) system

NUMA 0

NUMA 1 NUMA 2

NUMA 3

0

1

2

3

0 1 2 3

targ

et N

UM

A no

de

source NUMA node

24.2

3.4

3.0

3.4

3.3

23.9

3.5

3.0

3.0

3.4

24.3

3.4

3.5

3.0

3.4

24.2

Local access: 24 GB/s

Remote access: 3 GB/s

NUMA 0

NUMA 1 NUMA 2

NUMA 3

DataThreads

NUMA system w/ 4-sockets

Datathreads

Fast local access Slow non-local access

Differentdistances

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

CP

U

RA

M

Thread placement

Me

mo

ry p

lace

me

nt

(Example) 4-socket Xeon system

• 4 (# of CPU sockets)

• 8 (# of physical cores per socket)

• 2 (# of threads per core)

diagonal elements

= local access

SGI UV 2000• UV 2000

– Single OS: SUSE Linux 11 (x86_64)

– hypercube interconnection

– Up to 2,560 cores and 64 TB RAM

– (= 128 UV 2000 shassis x 2 sockets x 10 cores)

ISM has two full-spec. UV 2000

• Hierarchical network topologies

– Sockets, Chassis, Cubes, Inner-racks, and Outer-racks

UV2000 Chassis = 2 sockets Cube = 8 Chassis Rack = 32 nodes

CPU

RAM

CPU

RAM

× 4 = NUMAlink6

6.7GB/s

� Cannot detect

NUMA

node

ISM Kyushu U.

RAM CPU

to other chassis

to other chassis

SGI UV 300• UV 300

– Single OS: SUSE Linux 11 (x86_64)

– All-to-all interconnection

– Up to 1,152 cores and 16 TB RAM

– (= 8 UV 300 shassis x 4 sockets x 18 cores x 2 SMT)

• UV 300 chassis

– 4-socket 18-core Intel Xeon E7-8867 (Haswell)

– 2TB RAM (512 GB per NUMA node)

UV300 chassis

UV300 Rack

All-to-All • 18-core Xeon E7-8867

• HT enabled (2 SMT)

• 512GB RAM

NUMA node

UV 300 chassis

Kyushu U.

8 chassis

Memory Bandwidths on UV 2000 and UV 300

• Bandwidths in GB/s b/w NUMA nodes using STREAM TRIAD

• Local access is clearly faster than remote access

0

16

32

48

0 16 32 48

Mem

ory

Plac

emen

t

Thread Placement

0

5

10

15

20

25

30

35UV 2000 (64 sockets)

3-7 GB/s

3-7 GB/s

Each chassis has 2 sockets and connects to

each other by hypercube topology

Local33 GB/s

UV 2000chassis

0

4

8

12

16

20

24

28

0 4 8 12 16 20 24 28

Mem

ory

Plac

emen

t

Thread Placement

0

10

20

30

40

50

60UV 300 (32 sockets)

Each chassis has 4 sockets and connects to

each other by all-to-all topology

6 GB/s

6 GB/s

Local56 GB/s

12-14 GB/s

UV 300chassis

Programming cost for NUMA-aware

• Thread and Memory binding

– Reduce remote access

– Avoid thread migration

• Linux provides naïve interfaces

– sched_{set,get}affinity()• binds a thread on a processor set (specifying by processor id)

– mbind()• binds pages on a NUMA node set (specifying by NUMA node id)

– Linux provides processer id and NUMA node id as system files;

/proc/cpuinfo, /sys/devices/system/{node,cpu}/

• Reducing programming cost using ULIBC

– provides Some APIs for NUMA-aware programming

– available at

https://bitbucket.org/yuichiro_yasui/ulibc

#define _GNU_SOURCE#include <sched.h>int bind_thread(int procid) {

cpu_set_t set;CPU_ZERO(&set);CPU_SET(procid, &set);return sched_setaffinity( (pid_t)0,

sizeof(cpu_set_t), &set) );}

Specifying by Processor id

2. Detects online topology

CPU Affinity construction using ULIBC

RAM

1. Detects entire topology

RAM

RAM RAM

Socket 0

Socket 2

Socket 1

Socket 3

RAM 0 RAM 1

RAM 2 RAM 3

RAM

RAM

Socket 0

Socket 2

Socket 1

Socket 3

RAM 0 RAM 1

RAM 2 RAM 3

numactl --cpunodebind=1,2 ¥--membind=1,2

e.g.)

3. Constructs two-type affinities

NUMA node 0

NUMA node 1

thread 0

thread 1thread 2

thread 3

RAM

RAM

Local RAM

assigns threads in a position close to each other.

Compact-type affinity

export ULIBC_AFFINITY=compact:fineexport OMP_NUM_THREADS=7e.g.)

export ULIBC_AFFINITY=scatter:fineexport OMP_NUM_THREADS=7e.g.)

NUMA node 0

NUMA node 1

thread 0 thread 2

thread 1 thread 3

RAM

RAM

Local RAM

distributes the threads as evenly as possible

across online processors.

Scatter-type affinityRAM

RAM

NUMA-aware computation with ULIBC• ULIBC is a callable library for NUMA-aware computation

• Detects processor topology on run time

• Constructs thread and memory affinity setting

#include <stdio.h>#include <omp.h>#include <ulibc.h>

int main(void) {ULIBC_init();

_Pragma("omp parallel") {const int tid = ULIBC_get_thread_num();

ULIBC_bind_thread();

const struct numainfo_t loc = ULIBC_get_numainfo( tid );

printf(”Thread: %2d, NUMA-node: %d, NUMA-core: %d¥n",loc.id, loc.node, loc.core);

/* do something */}

}

initialize

get thread idbind current thread

get NUMA placement

https://bitbucket.org/yuichiro_yasui/ulibc

Thread: 4, NUMA-node: 0, NUMA-core: 1Thread: 55, NUMA-node: 3, NUMA-core: 13Thread: 16, NUMA-node: 0, NUMA-core: 4Thread: 37, NUMA-node: 1, NUMA-core: 9Thread: 30, NUMA-node: 2, NUMA-core: 7. . .

Core IDNUMA node IDThread ID

include header file

ULIBC is available at

Execution log on 4-socket

Level-synchronized parallel BFS (Top-down)

• Started from source vertex and

executes following two phases at

each level

Level kLevel k+1CQNQ

Swap exchanges CQ and NQ for next

level

Traversal phase finds unvisited verticesfrom CQ and appends into NQ

visited

unvisited

NQLevel 1

SourceLevel 0CQ

Level 2

Level 1

NQ

CQ

Level 3

Level 2

NQ

CQ

Level 0

Sync.

Sync.

Level 1

Level 2

NQCQ

NQCQ

FrontierLevel k

Level k+1NeighborsFrontier Neighbors

Level k

Level k+1

Candidates of

neighbors

Direction-optimizing BFS [Beamer, SC12]

• Top-down dir. using out-going edges • Bottom-up dir. using in-coming edges

Outgoingedges Incoming

edges

Two directions; Top-down or Bottom-up

幅優先探索に対する前方探索 (Top-down)と後方探索 (Bottom-up)

Level Top-down Bottom-up Hybrid0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

6 / 14

Distance from source

Large frontier

Top-down

Top-down

Bottom-up

Direction-opt. BFS

Outline

• Introduction






Our contributions (Previous work and this paper)

• Efficient Graph data structure

4

3

2

0

1

Input graph

1

3

0

4

2

Vertex sorting (HPCS15) Adjacency list sorting (ISC14)

A0

A1

A2

A3

NUMA-aware graph (BD13)

• Efficient BFS based on Beamer’s Direction-optimizing (SC12)

CQ

Socket−queue

Remote

Remote

Local

VS NQ

Agarwal’s Top-down (SC10) Pruning edges

Top-down direction Bottom-up direction

• 131 GTEPS

• 152 GTEPS (NEW)

NUMA-aware Bottom-up (BD13)

A0

A1

A2

A3

Input:

CQ

Data:

VSk

Output:

NQk

Local


CSR graph

This paper

Reduction ofremote edges



• 219 GTEPS (NEW)

• New results

��16 %

� faster than highest

single-node entry

Binding on NUMA node

Ours: NUMA-aware 1-D part. graph [BD13]

• Divides sub graphs and assigns on each NUMA node

A0

A1

A2

A3

Adjacency matrix 1-D part. Graph

CPU

RAM

assigndivide

CPU

RAM

CPU

RAM

CPU

RAM

NUMA node

A0

A1

A2

A3

Input:

FrontierCQ

Data:

visited VSkOutput:

neighborsNQ

Local RAM

Bottom-up direction

• At bottom-up direction (Bottleneck component), each NUMA node

computes partial NQ using local copied CQ and local assigned VS.

Each sub graph represents by CSR graph

Top-down direction uses inverse of G.(G is undirected)

A0

A1

A2

A3

Input:

FrontierCQk

Data:

visited VSkOutput:

neighborsNQ

LocalLocal

Remote Remote

Modified version of Agarwal’s NUMA-aware BFS

Ours: Adjacency list sorting [ISC14]

• Reduces unnecessary edge traversals at Bottom-up dir.

Loop count τ

A(va)A(vb)

finds frontier vertex and breaks this loop

……

Bottom-up

Skipped adjacency verticesTraversed adjacency vertices

• Sorting adjacency lists by the corresponding outdegreeVertex vi Vertex vi+1

Index

Value

High Low

Adjacency vertices of vi


Ours: Vertex sorting [HPCS15]

Degree distribution

Access freq. w/ vertex sorting

• # of vertex traversals equals the outdegree of the corresponding vertex

• Our vertex sorting reorders vertex indices by the outdegrees

Access freq. and OutDegree are correlated4

3

2

0

1

1

3

0

4

2

Original

indices

Sorted indices by outdegree

Highest

outdegree

Many accessesfor small-index vertex

NUMA-aware Top-down BFS

• Original version was proposed by Agarwal [Agarwal-SC10]

• Reducing random remote accesses using socket-queue

CQ

Local + Remote

NUMA 0

NUMA 1

NUMA 2

NUMA 3

�Local : Remote = 1 : ℓ�on ℓ-sockets

e.g.) focused on NUMA 2

synchronize

Socket−queue

Local

VS NQ

synchronize

Swap CQ and NQ

Append unvisited

vertices into NQ

Local

Phase1: CQ � NQ or Socket-queue Phase2: Socket-queue �NQ Next level

CQ

Socket−queue

Remote

Remote

Local

VS NQ

Append unvisited

vertices into NQ

NUMA-aware Top-down w/ Pruning remote edges

• pruning remote edges to reduce remote accesses

NUMA 0

NUMA 1

NUMA 2

NUMA 3

e.g.) focused remote edge traversal on NUMA 2

This paperproposed by Agarwal’s SC10 paper

with Pruningw/o Pruning (original)

Each NUMA node appends remote edges (v,w)

into the corresponding socket-queue, if the Fdoesn't contain w. (And then, F appends w)

Each NUMA node appends all remote

edges (v,w) into the corresponding

socket-queue

F (reuse CQ bitmap for Bottom-up)

CQ(vector queue)

Socket−queue

Remote

Local

Local

Remote

The F is not initialized,

while there is no change of

search direction.

CQ(vector queue)

Socket−queue

Remote

Remote

Each vertex is

searched once only.

Effects of pruning & Updated TEPS score

• Pruned many remote edges

0

16

32

48

0 16 32 48

Mem

ory

Pla

cem

ent

Thread Placement

0

5

10

15

20

25

30

35

(a) UV 2000 with 64 CPU sockets (One rack)

0

4

8

12

16

20

24

28

0 4 8 12 16 20 24 28

Mem

ory

Pla

cem

ent

Thread Placement

0

10

20

30

40

50

60

(b) UV 300 with 32 CPU sockets (One rack)

Figure 3: Memory bandwidth GB/s between arbitrary twoNUMA nodes.

16

32

64

128

256

512

1024

2048

3 1 2 4 8 16 32 64

Ban

dwid

th (G

B/s

)

Number of NUMA nodes (Number of CPU sockets)

UV300 (Haswell) (HT, THP, Local)UV300 (Haswell) (HT, THP, Remote)UV2000 (Ivy Bridge)SB4 (Sandy Bride-EP) (HT, THP)

Figure 4: Memory bandwidth (GB/s)

5. NUMERICAL RESULTS

5.1 SGI UV 2000Fig. 6 shows the weak scaling performances of our pre-

vious [22] and current implementation, which collect TEPS

0.0

0.2

0.4

0.6

0.8

1.0

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

(4) (9.04K) (221M) (15.3B) (1.50B) (4.55M) (11.1K) (29)

LocalPruned-remoteRemote

Level 7Level 6Level 5Level 4Level 3Level 2Level 1Level 0

Figure 5: Ratio of traversed edges on a NUMA node in thetop-down algorithm with remote edge traversal pruning fora Kronecker graph with SCALE29 on a four-socket server(SB4). Each number in a bracket represents the total num-ber of traversed edges at each level.

Algorithm 3: Top-down with pruning remote traversal

Procedure NUMA-aware-Top-down(G,CQ,VS,⇡)fork/* i-th thread runs on j-th core of k-th CPU */

1 (i, j, k) ULIBC get current numainfo()2 NQk ;3 for v 2 CQk in parallel do4 for w 2 AB

k (v) do5 if owner(w) = k then6 if w 62 VS atomic then7 ⇡(w) v8 VSk VSk [ {w}9 NQk NQk [ {w}

10 else11 if w 62 Fk atomic then12 Fk Fk [ {w}13 SQ

owner(w)

SQowner(w)

[ {(v, w)}

14 synchronize15 for (v, w) 2 SQk in parallel do16 if w 62 VS atomic then17 ⇡(w) v18 VSk VSk [ {w}19 NQk NQk [ {w}

20 join21 return NQk

scores with fixed problem size as SCALE 26 and SCALE27 per CPU socket. The previous implementation scaledup to 1,280 threads, and achieves 131 GTEPS for SCALE32 with 640 threads and 175 GTEPS for SCALE 33 with1280 threads, respectively. In contrast, the current imple-mentation achieves 152 GTEPS for SCALE 33 with 640threads – scalability is improved as a result of the prun-ing of edge traversal for remote memory in the top-downdirection. However, we only have results for a maximumof 640 threads. The performance gap between the previous(131 GTEPS for SCALE 33) and current (152 GTEPS forSCALE 34) implementations is 15.8% (= 152

131

) on one rackof UV 2000 with 640 threads.

Top-down Bottom-up Top-down

0

50

100

150

200

1 2 4 8 16 32 64 128

GTE

PS

Number of NUMA nodes (CPU sockets)

HPCS15-SG (SCALE 26 per NUMA node)This paper (SCALE 27 per NUMA node)

7.715.3

24.2

42.1

59.4

94.8

131.4

174.7

8.314.2

25.138.6

61.5

91.8

152.2

Figure 6: Weak scaling on UV 2000

5.2 SGI UV 300In this study, we obtained new results on SGI UV 300,

which has 32 CPU sockets and 16 TB memory. Fig. 7 de-picts TEPS versus number of CPU sockets (NUMA nodes).Table 8 shows the TEPS obtained for 32 CPU sockets. Wediscuss the results with the following parameters:

• Hyperthreading (HT): {enabled, disabled}• Transparent hugepage (THP): {enabled, disabled}• Priority mode for memory access: {local, remote}

For example, “(HT, THP, local)”means Hyperthreading andTransparent hugepage are enabled and priority mode is setfor local memory. In this table, a check mark (X) indicatesthat a parameter is enabled. First, UV 300 is faster than UV2000 for large problem sizes. Second, both Hyperthreading(HT) and Transparent hugepage (THP) together improvedthe performance by 16.49 % (= 219

188

) and 4.78 % (= 219

209

).Third, our implementation applied several techniques thatimproved the locality of memory access to make it suitablefor priority mode set as local memory. Ultimately, the bestperformance obtained was 219 GTEPS for SCALE 34 withthe configuration set as (HT, THP, Local), indicated in boldfont above.

0

50

100

150

200

250

1 2 4 8 16 32 64

GTE

PS


UV300 (HT, THP, Local): SCALE 29 per socketUV300 (HT, THP, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 27 per socketUV2000: SCALE 27 per socket

18.732.5

64.7

100.3

161.5

219.4

8.3 14.225.1

38.6

61.5

91.8

152.2


Table 8: GTEPS on UV 300 with 32 CPU sockets

System SCALE HT THP Mode GTEPSUV 2000 32 – – – 92UV 300 32 X – Remote 171UV 300 34 X – Remote 204UV 300 34 X X Remote 209UV 300 34 ⇤1 X Local 188UV 300 34 X X Local 219⇤1 use the number of threads same as physical cores.

5.3 STREAM and Graph500 benchmarksFinally, the correlativity between the memory bandwidth

(bytes per seconds) for the STREAM benchmark and thegraph traversal performance (TEPS) is depicted in Fig 8.Each line represents pairs of {Memory bandwidth (in GB/s)of the STREAM benchmark TRIAD operation with 107

elements per CPU socket, Graph500 score (GTEPS) withSCALE 27 per CPU socket} for each of 1, 2, 4, 8, 16, and32 CPU sockets. We obtained the memory bandwidth scorevia a modified implementation using ULIBC, in which eachthread computed the partial TRIAD operation for vectorson local memory only, shown in subsection 3.2. Figure showscorrelativity between the memory bandwidth and the graphtraversal performance. The optimized Graph500 implemen-tation and our previous implementation are scalable, likethe memory bandwidth. In contrast, the reference code ofGraph500 is not scalable and cannot exploit the NUMA sys-tem e�ciently.

2

4

8

16

32

64

128

256

16 32 64 128 256 512 1024 2048

GTE

PS

Memory Bandwidth (GB/s)

UV300 (Haswell) (HT, THP, Local): This paperUV2000 (Ivy Bridge): This paperUV2000 (Ivy Bridge): BD13SB4 (Sandy Bride-EP) (HT, THP): This paperSB4 (Sandy Bride-EP) (HT, THP): BD13

(a) GTEPS

32

64

128

16 32 64 128 256 512 1024 2048

MTE

PS


SB4 (Sandy Bride-EP) (HT, THP): Reference

(b) MTEPS

Figure 8: TEPS versus Memory bandwidth (GB/s)

6. CONCLUSIONSIn this paper, we presented a new and e�cient breadth-

first search algorithm for large-scale networks on a single

SCALE29, on 4-socket Xeon UV2000 (used only 64 sockets)

However, this method may not be effective on a few

sockets, because the algorithm switch a direction as

the bottom-up at middle levels.

Pruned

Previous: TD � BU

This paper: TD � BU � TD

On many sockets, Updated TEPS score

UV 2000 with 64 sockets;

• w/o pruning: 131 GTEPS

• w/ pruning: 152 GTEPS

Outline

• Introduction






Weak scaling performance

• UV 300 clearly outperforms UV 2000.

0

50

100

150

200

1 2 4 8 16 32 64

GTE

PS

Number of sockets

UV2000: Weak scaling with SCALE 27 per socket)UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)

8.314.2

25.138.6

61.5

91.8

152.2

16.329.2

53.5

83.9

129.4

171.0

15.928.0

57.9

93.5

151.4

209.3

91.4

147.6

203.7

18.7

32.5

64.7

100.3

161.5

219.4

UV 2000

UV 300

Compared on next slide.

Breakdown of system configuration on UV300

• UV 300 is 2x faster than UV 2000

– same sockets (32 sockets)

– #ThreadsPerSockets = #logical cores

• Best perf. of UV 300 obtained with

– Larger problem size

– THP (transparent huge page) enabled

– Set memory reference mode as local-mode

– HT (Hyperthreading) enabled

System #sockets SCALE HT THP Mem-Ref-mode GTEPS

UV2000 32 32 − 92

UV300 32

32 � Remote 171

34

� Remote 204

� � Remote 209

*1 � Local 188

� � Local 219 +16.5% by HT enabled

0

50

100

150

200

1 2 4 8 16 32 64

GTE

PS

Number of sockets

UV2000: Weak scaling with SCALE 27 per socket)UV300 (HT, Remote-mode): Weak scaling with SCALE 27 per socket)UV300 (HT, Remote-mode): Weak scaling with SCALE 29 per socket)UV300 (HT and THP, Remote-mode): Weak scaling with SCALE 29 per socket)UV300 (HT and THP, Local-mode): Weak scaling with SCALE 29 per socket)

8.314.2

25.138.6

61.5

91.8

152.2

16.329.2

53.5

83.9

129.4

171.0

15.928.0

57.9

93.5

151.4

209.3

91.4

147.6

203.7

18.7

32.5

64.7

100.3

161.5

219.4

+ 4.8% by Memory

Reference mode

+ 2.5% by THP enabled

+19.3% by using a

larger memory space

*1: uses # of threads same as physical cores. � emulated "Hyperthreading disabled”.

Perf.

gap

28.3 % perf. gap

New results and Nov. 2015 list

Updated fastest single-node

Ours

fastest of

single-node

Ours

SCALE34

219 GTEPS

SGI UV300(1 node / 576 cores)

− HT enabled

− THP enabled

− local-ref. mode

SGI UV2000(1280 cores)

SCALE 33

174.7 GTEPS

SGI UV2000(640 cores)

SCALE 33

149.8 GTEPS

Bandwidth and TEPS

• BW and TEPS of our implementations on 3 systems

– GB/s: STEAM TRIAD with 10 M elements per socket

– TEPS: SCALE27 (n=134M, m=2.15B) per socket

0

50

100

150

200

1 2 4 8 16 32 64 128

GTE

PS


HPCS15-SG (SCALE 26 per NUMA node)This paper (SCALE 27 per NUMA node)

7.715.3

24.2

42.1

59.4

94.8

131.4

174.7

8.314.2

25.138.6

61.5

91.8

152.2


5.2 SGI UV 300In this study, we obtained new results on SGI UV 300,

which has 32 CPU sockets and 16 TB memory. Fig. 7 de-picts TEPS versus number of CPU sockets (NUMA nodes).Table 8 shows the TEPS obtained for 32 CPU sockets. Wediscuss the results with the following parameters:

• Hyperthreading (HT): {enabled, disabled}• Transparent hugepage (THP): {enabled, disabled}• Priority mode for memory access: {local, remote}

For example, “(HT, THP, local)”means Hyperthreading andTransparent hugepage are enabled and priority mode is setfor local memory. In this table, a check mark (X) indicatesthat a parameter is enabled. First, UV 300 is faster than UV2000 for large problem sizes. Second, both Hyperthreading(HT) and Transparent hugepage (THP) together improvedthe performance by 16.49 % (= 219

188

) and 4.78 % (= 219

209

).Third, our implementation applied several techniques thatimproved the locality of memory access to make it suitablefor priority mode set as local memory. Ultimately, the bestperformance obtained was 219 GTEPS for SCALE 34 withthe configuration set as (HT, THP, Local), indicated in boldfont above.

0

50

100

150

200

250

1 2 4 8 16 32 64

GTE

PS


UV300 (HT, THP, Local): SCALE 29 per socketUV300 (HT, THP, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 29 per socketUV300 (HT, Remote): SCALE 27 per socketUV2000: SCALE 27 per socket

18.732.5

64.7

100.3

161.5

219.4

8.3 14.225.1

38.6

61.5

91.8

152.2


Table 8: GTEPS on UV 300 with 32 CPU sockets

System SCALE HT THP Mode GTEPSUV 2000 32 – – – 92UV 300 32 X – Remote 171UV 300 34 X – Remote 204UV 300 34 X X Remote 209UV 300 34 ⇤1 X Local 188UV 300 34 X X Local 219⇤1 use the number of threads same as physical cores.

5.3 STREAM and Graph500 benchmarksFinally, the correlativity between the memory bandwidth

(bytes per seconds) for the STREAM benchmark and thegraph traversal performance (TEPS) is depicted in Fig 8.Each line represents pairs of {Memory bandwidth (in GB/s)of the STREAM benchmark TRIAD operation with 107

elements per CPU socket, Graph500 score (GTEPS) withSCALE 27 per CPU socket} for each of 1, 2, 4, 8, 16, and32 CPU sockets. We obtained the memory bandwidth scorevia a modified implementation using ULIBC, in which eachthread computed the partial TRIAD operation for vectorson local memory only, shown in subsection 3.2. Figure showscorrelativity between the memory bandwidth and the graphtraversal performance. The optimized Graph500 implemen-tation and our previous implementation are scalable, likethe memory bandwidth. In contrast, the reference code ofGraph500 is not scalable and cannot exploit the NUMA sys-tem e�ciently.

2

4

8

16

32

64

128

256

16 32 64 128 256 512 1024 2048

GTE

PS


UV300 (Haswell) (HT, THP, Local): This paperUV2000 (Ivy Bridge): This paperUV2000 (Ivy Bridge): BD13SB4 (Sandy Bride-EP) (HT, THP): This paperSB4 (Sandy Bride-EP) (HT, THP): BD13

(a) GTEPS

32

64

128

16 32 64 128 256 512 1024 2048

MTE

PS


SB4 (Sandy Bride-EP) (HT, THP): Reference

(b) MTEPS

Figure 8: TEPS versus Memory bandwidth (GB/s)

6. CONCLUSIONSIn this paper, we presented a new and e�cient breadth-

first search algorithm for large-scale networks on a single

Our previous [BD13]

This paper

• Bandwidth and GTEPS are correlated on three Xeon processors

• UV300– 32-sockets Haswell

• UV2000– 64-sockets Ivy Bridge

• SB4– 4-sockets Sandy Bridge-EP

Systems

Conclusion

• NUMA / cc-NUMA architecture • Graph algorithm; BFS

• Efficient NUMA-aware BFS algorithm

– NUMA-aware to improved a locality of memory access

– Exploit multithreading on many-socket system (SGI UV2000, UV300)

Motivations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor



TEPSratio

ValidationBFS

64 Iterations

Local access

Remote access

・・・

Many-socket systemRepresents many relationships by graph structure

• NUMA-aware scalable BFS algorithm

– Scalable more than thousand threads on SGI UV 2000 and SGI UV 300

– Updated highest score single-node as 219 GTEPS on SGI UV300 with 32 sockets

• “ULIBC”: Callable library for NUMA-aware computation

– available at https://bitbucket.org/yuichiro_yasui/ulibc

Contributions Pruning edge traversalto reduce remote edges

References

• [BD13] Y. Yasui, K. Fujisawa, and K. Goto: NUMA-optimized Parallel

Breadth-first Search on Multicore Single-node System, IEEE BigData 2013

• [ISC14] Y. Yasui, K. Fujisawa, and Y. Sato: Fast and Energy-efficient

Breadth-first Search on a single NUMA system, IEEE ISC'14, 2014

• [HPCS15] Y. Yasui and K. Fujisawa: Fast and scalable NUMA-based thread

parallel breadth-first search, HPCS 2015, ACM, IEEE, IFIP, 2015.

• [GraphCREST2015] K. Fujisawa, T. Suzumura, H. Sato, K. Ueno, Y. Yasui,

K. Iwabuchi, and T. Endo: Advanced Computing & Optimization

Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale

upercomputers, Proceedings of the Optimization in the Real World --

Toward Solving Real-World Optimization Problems --, Springer, 2015.

NUMA-aware BFS algorithm

Other results of Our Graph500 team

numa-aware scalable graph traversal on sgi uv systems

Presentations & Public Speaking