which graph representation to select for static graph
TRANSCRIPT
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU
Thorsten Blaß and Michael Philippsen
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 2
Motivation (I)
Your task: write static graph algorithm on the GPU.
Preparation: check literature□ several data structures to represent graphs.□ all papers pick CRS data structure
and focus on optimizing algorithmic aspects.
You wonder: □ Is data structure irrelevant for performance? □ Is it right to follow the crowed?
This paper: Graph representation matters!
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 3
1.0
1.1
1.2
1.3
1.4
1.5
1.6
norm
alize
d fil
e-to
-file
runt
ime
w.r.
t. be
st se
lect
ion
benchmarks
With fastest data structure LonestarGPU codes with CRS data structure
BFS MST SSSP
Motivation (II)
§ Adequate data structure speeds up graph algorithms.
Our improvements: Fastest runs when choosing the adequate data structure.
Available Lonestar GPU benchmark implementations.
Higher = slower
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 4
Study Setup
§ What is the adequate data structure?
§ A total of 754,000 measurements:□ 10 state-of-the-art static graph algorithms (do not modify the graph).□ 19 input graphs.□ 4 architecturally different Nvidia GPUs.□ 10 graph data structures.□ 3 widely used graph exchange formats.
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 5
Study Setup – Benchmark Graph Algorithms
§ 10 state-of-the-art implementations from recent research and benchmark suites with different characteristics.
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 6
§ 19 input graphs with different characteristics.
Study Setup – Input Graphs
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 7
Study Setup – GPUs
§ 4 architecturally different Nvidia GPUs□ Titan XP, 12.2 GB, Pascal□ Geforce GTX 980, 4.0 GB, Maxwell□ Geforce GTX 680, 2.0 GB, Kepler□ Geforce GTX 580, 1.0 GB, Fermi
§ Only Titan XP can run all measurement configurations.
§ All measurements scale w.r.t. nominal GPU performance.à Suffices to discuss the Titan XP measurements.
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 8
Decision Tree (file-to-file)
Excess Rate < 20%
N
N
Y
Y
Y
N
ELL HYBavg
RB [symm]
Best data structure
Best exchg. format
ELL C?S
RB [symm]
C?S
RB [symm]
COO/EL
GR/MTX[symm]
Touch Count < 5
Edge-Centric
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 9
Decision Tree – First Layer (I)
Excess Rate < 20%
N
N
Y
Y
Y
N
ELL HYBavg
RB [symm]
Best data structure
Best exchg. format
ELL C?S
RB [symm]
C?S
RB [symm]
COO/EL
GR/MTX[symm]
Touch Count < 5
Edge-Centric
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 10
Decision Tree – First Layer (II)
§ First decision depends on whether the algorithm is edge-centric or node-centric.□ Splits data structures into two groups.
§ For edge-centric algorithms COO and EL perform best.□ The other formats are 25% slower in our measurements.
§ COOrdinate format□ Stores source, target, and weight of an
edge in different arrays.
§ Edge List format □ Stores an edge as a struct in an array.
0
15
3
32 26
5
2
64
4
1
𝑠𝑟𝑐 = 0 1 3 3 3 4 5𝑡𝑟𝑔𝑡 = 1 3 0 2 5 3 6
𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 5 3 2 6 2 1 4
𝑒𝑑𝑔𝑒𝑠 =015,133,302,⋯
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 11
Decision Tree – First Layer (III)
§ COO and EL show the same runtime behavior.□ Either COO or EL can be picked for edge-centric algorithms.□ Runtime can be minimized with MTX/GR exchange format.
0.95
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
aver
age
of th
e fa
stes
t com
bina
tion
of in
put g
raph
and
ex
chan
ge fo
rmat
per
dat
a st
ruct
and
ben
chm
ark
benchmarks
COO ELL
RBsymm
RB
MTX, MTXsymm, GR, GRsymm
APSP BFS MIS MST PR-n PR-e SpMV SSSP WCC-n WCC-e
Higher = slower
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 12
Decision Tree – First Layer (IV)
§ The MTX and GR input file formats are dumps of the COO and EL formats.
§ No costly conversion from file to representation in CPU memory.
§ The RB format is 20% slower in our measurements.
< 𝑀𝑇𝑋 ℎ𝑒𝑎𝑑𝑒𝑟 >7 7 70 1 51 3 33 0 23 2 63 5 24 3 15 6 4
< 𝐺𝑅 ℎ𝑒𝑎𝑑𝑒𝑟 >𝑝 𝑠𝑝 7 7 7𝑎 0 1 5𝑎 1 3 3𝑎 3 0 2𝑎 3 2 6𝑎 3 5 2𝑎 4 3 1𝑎 5 6 4
Encoded edges
Size of the array(s)
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 13
Decision Tree (file-to-file)
Excess Rate < 20%
N
N
Y
Y
Y
N
ELL HYBavg
RB [symm]
Best data structure
Best exchg. format
ELL C?S
RB [symm]
C?S
RB [symm]
COO/EL
GR/MTX[symm]
Touch Count < 5
Edge-Centric
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 14
Decision Tree – Second Layer (I)
§ Node-centric algorithms need a deeper analysis.
§ Compressed Row Storage format□ Most memory efficient format. □ Indirect addressing step necessary.
§ Compressed Column Storage format □ Same as CRS but stores the
predecessors of a node.
s𝑟𝑐@AB = 0 1 2 2 5 6 7 7𝒔𝒖𝒄𝒄 = 1 3 0 2 5 3 6𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 5 3 2 6 2 1 4
s𝑟𝑐@AB = 0 1 2 3 5 5 7 8𝒑𝒓𝒆𝒅 = 3 0 3 1 4 3 5𝑤𝑒𝑖𝑔ℎ𝑡 = 2 5 6 3 1 2 4
0
15
3
32 26
5
2
64
4
1
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 15
§ ELL format□ Fixed row length reserved for successors.□ Well suited for vector processors.□ Memory efficiency depends on degree
distribution of a graph.
§ HYBrid format□ Better memory efficiency than ELL.□ Heuristics (avg, dstr) find smaller the row length.□ Additional successors are stored in EL.
𝑒𝑙𝑙L53∗214∗
𝑒𝑙𝑙N13∗036∗
Decision Tree – Second Layer (II)
0
15
3
32 26
5
2
64
4
1
𝑠𝑢𝑐𝑐1 ∗ ∗3 ∗ ∗∗ ∗ ∗0 2 53 ∗ ∗6 ∗ ∗∗ ∗ ∗
𝑤𝑒𝑖𝑔ℎ𝑡𝑠5 ∗ ∗3 ∗ ∗∗ ∗ ∗2 6 21 ∗ ∗4 ∗ ∗∗ ∗ ∗
𝑒𝑙326,352
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 16
Decision Tree – Second Layer (III)
§ Data structures cause costs for construction, shipping, ...
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0 5 10 15 20 25 30
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
touch count
C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 17
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0 5 10 15 20 25 30
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
touch count
C?S (RB) ELL (RB) HYB_avg (RB)
Decision Tree – Second Layer (III)
§ Data structures cause costs for construction, shipping, ...
Touch count splits the measurements into two
groups.
Higher = slower
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 18
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0 5 10 15 20 25 30
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
touch count
C?S (RB) ELL (RB) HYB_avg (RB)
Decision Tree – Second Layer (III)
§ Data structures cause costs for construction, shipping, ...
We will discuss this later.
Higher = slower
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 19
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0 5 10 15 20 25 30
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
touch count
C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower
Decision Tree – Second Layer (III)
§ Data structures cause costs for construction, shipping, ...
CRS or CCS perform best, followed by ELL and HYBavg.
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 20
Decision Tree – Second Layer (IV)
§ The Rutherford Boeing (RB) format is a dump of the CRS format.
§ Other file formats are 40% slower in our measurements.
8 2 2 2𝑖𝑔𝑎 7 7 711𝐼4 11𝐼30 1 2 25 6 71 3 0 25 3 65 3 2 62 1 4
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑎𝑏𝑜𝑢𝑡 𝑡ℎ𝑒𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛𝑐𝑜𝑑𝑒𝑑 𝑔𝑟𝑎𝑝ℎ𝑎𝑛𝑑 𝑡ℎ𝑒 𝑓𝑖𝑙𝑒.
𝑠𝑟𝑐@AB
𝑠𝑢𝑐𝑐/𝑝𝑟𝑒𝑑
𝑤𝑒𝑖𝑔ℎ𝑡𝑠
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 21
Decision Tree (file-to-file)
Excess Rate < 20%
N
N
Y
Y
Y
N
ELL HYBavg
RB [symm]
Best data structure
Best exchg. format
ELL C?S
RB [symm]
C?S
RB [symm]
COO/EL
GR/MTX[symm]
Touch Count < 5
Edge-Centric
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 22
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0 5 10 15 20 25 30
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
touch count
C?S (RB) ELL (RB) HYB_avg (RB)
Decision Tree – Third Layer (I)
Higher = slower
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 23
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
0 5 10 15 20 25 30
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
touch count
C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower
No best single data structure.
Inconclusive! Performance also depends on input graphs.
Decision Tree – Third Layer (I)
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 24
0
10
20
30
40
50
60
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
perc
enta
ge o
f nod
es w
ith h
ighe
r deg
ree
than
the
aver
age
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
input graphs
ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess
road
_BAY
road
_CAL
road
_COL
road
_CTR
road
_FLA
road
_USA
msdoor
pwtk
rgg_n
_2_2
4
amaz
on-08
coAuth
ors
dblp-10
hollywood-09
indochina-0
4
it-04
kron_g
500
ljourn
al-08
twitt
er-ret
weed
soc-o
rkut
Decision Tree – Third Layer (II)
Graphs are sorted in ascending order by their excess rate.
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 25
0
10
20
30
40
50
60
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
perc
enta
ge o
f nod
es w
ith h
ighe
r deg
ree
than
the
aver
age
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
input graphs
ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess
road
_BAY
road
_CAL
road
_COL
road
_CTR
road
_FLA
road
_USA
msdoor
pwtk
rgg_n
_2_2
4
amaz
on-08
coAuth
ors
dblp-10
hollywood-09
indochina-0
4
it-04
kron_g
500
ljourn
al-08
twitt
er-ret
weed
soc-o
rkut
Decision Tree – Third Layer (II)
If ELL fits into memory it is always the fastest choice.
Shaded = ELL too large
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 26
0
10
20
30
40
50
60
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
perc
enta
ge o
f nod
es w
ith h
ighe
r deg
ree
than
the
aver
age
norm
alize
d fil
e-to
-file
exe
cutio
n tim
e w
.r.t.
best
dat
a st
ruct
ure
input graphs
ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess
road
_BAY
road
_CAL
road
_COL
road
_CTR
road
_FLA
road
_USA
msdoor
pwtk
rgg_n
_2_2
4
amaz
on-08
coAuth
ors
dblp-10
hollywood-09
indochina-0
4
it-04
kron_g
500
ljourn
al-08
twitt
er-ret
weed
soc-o
rkut
Decision Tree – Third Layer (II)
HYBavg faster
Excess Rate < 20%
C?S faster
Shaded = ELL too large
Y N
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 27
Decision Tree (file-to-file)
Excess Rate < 20%
N
N
Y
Y
Y
N
ELL HYBavg
RB [symm]
Best data structure
Best exchg. format
ELL C?S
RB [symm]
C?S
RB [symm]
COO/EL
GR/MTX[symm]
Touch Count < 5
Edge-Centric
How often touches the algorithm every node (on average).
Percentage of nodes that have an above-average degree.
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 28
Decision Tree – Kernel only
N
Y
ELL HYBdstr
Best data structure
COO/EL
Edge-Centric n.a.
n.a.
n.a.
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 29
0
10
20
30
40
50
60
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
input graphs
perc
enta
ge o
f no
des w
ith h
ighe
r deg
ree
than
the
aver
age
norm
alize
d ke
rnel
runt
ime
w.r.
t. be
st d
ata
stru
ctur
e
ELL avg. ELL
CRS avg. CRS
HYB_avg avg. HYB_avg
HYB_dstr avg. HYB_dstr
excess
road
_BAY
road
_CAL
road
_COL
road
_CTR
road
_FLA
road
_USA
msdoor
pwtk
rgg_n
_2_2
4
amaz
on-08
coAuth
ors
dblp-10
hollywood-09
indochina-0
4
it-04
kron_g
500
ljourn
al-08
twitt
er-ret
weed
soc-o
rkut
Kernel-only results
§ Without I/O, data structure construction, and CPU ßà GPU shipping.
ELL remains fastest graph structure if it fits into memory.
Shaded = ELL too large
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 30
0
10
20
30
40
50
60
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
input graphs
perc
enta
ge o
f no
des w
ith h
ighe
r deg
ree
than
the
aver
age
norm
alize
d ke
rnel
runt
ime
w.r.
t. be
st d
ata
stru
ctur
e
ELL avg. ELL
CRS avg. CRS
HYB_avg avg. HYB_avg
HYB_dstr avg. HYB_dstr
excess
road
_BAY
road
_CAL
road
_COL
road
_CTR
road
_FLA
road
_USA
msdoor
pwtk
rgg_n
_2_2
4
amaz
on-08
coAuth
ors
dblp-10
hollywood-09
indochina-0
4
it-04
kron_g
500
ljourn
al-08
twitt
er-ret
weed
soc-o
rkut
Kernel-only results
§ Without I/O, data structure construction, and CPU ßà GPU shipping.
Otherwise HYBdstr is the best choice.
Shaded = ELL too large
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 31
Conclusion
§ There is no single best graph data structure for GPUs.§ Three decision layers:
□ Edge-centric vs. node-centric□ Touch count threshold around 5□ Excess rate threshold around 20%
§ Our Decision Tree can help developers to pick an adequate data structure for their use case.
§ Choosing the adequate data structure canspeed up graph algorithms by up to 45%.
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 32
Conclusion
§ There is no single best graph data structure for GPUs.§ Three decision layers:
□ Edge-centric vs. node-centric□ Touch count threshold around 5□ Excess rate threshold around 20%
§ Our Decision Tree can help developers to pick an adequate data structure for their use case.
§ Choosing the adequate data structure canspeed up graph algorithms by up to 45%. Thank you for your attention!Any questions?