![Page 1: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/1.jpg)
Design Considerations for a GraphBLASCompliant Graph Library on Clusters of GPUs
July 11, 2016
Carl Yang, University of California, DavisAydın Buluç, Lawrence Berkeley National LaboratoryJohn D. Owens, University of California, Davis
![Page 2: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/2.jpg)
GraphBLASInstead of writing a graph algorithm from scratch, users can express algorithm using a small number of GraphBLAS kernels
Hardware-agnostic:Users do not have to relearn many different APIs in order to run graph algorithms on different hardware
Efficiency:Kernels are implemented with performance guarantees
![Page 3: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/3.jpg)
Why use GPUs for graphs?Cost-efficientK20X costs ~$2000
Improving faster that traditional CPUsP100 (2016): ~8 TFLOPS, ~1TB/s, ~3800 ALUs, 32GB mem.CPU-to-GPU bandwidth: ~80GB/s-200GB/s (NVLink)Xeon Phi 7250 (2016): ~6 TFLOPS, ~400GB/s, ~288 threads, 16GB mem.CPU-to-coprocessor bandwidth: ~100GB/s (Omni-Path)
![Page 4: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/4.jpg)
Challenges with using GPUs for graph analysis
Hard to programSmall memoryCPU RAM available ~100x larger
Solved by GraphBLASAPISolved by using multi-GPU
![Page 5: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/5.jpg)
GraphBLAS kernel (mXv)Goal: Implement mXv on multiple GPUs following GraphBLASstandardSame as SpMVSame as vXm, which computes yT = xTA
Input: Given adjacency matrix A, vector xWant: Compute y = ATxCurrent GraphBLAS API:
GrB_info GrB_mXv( GrB_Vector *y, const GrB_Semiring s, const GrB_Matrix A, const GrB_vector x )
Def: Think of a semiring as an object with a “x” operator, “+” operator, and identity for each operator.
![Page 6: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/6.jpg)
Applications of SpMVSSSPPageRankMISAnd many more!
![Page 7: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/7.jpg)
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x = ?
Each iteration, 1) compute y = ATx2) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 1)
10000
0-1-1-1-1
0
1
2
3
4
![Page 8: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/8.jpg)
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx2) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 1)
10000
01000
0-1-1-1-1
0
1
2
3
4
![Page 9: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/9.jpg)
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx new: -22) update BFS result3) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 1)
10000
01000
01-1-1-1
0
1
2
3
4
![Page 10: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/10.jpg)
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx2) update BFS result old: -23) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 2)
01000
00110
01-1-1-1
0
1
2
3
4
![Page 11: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/11.jpg)
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx new: 42) update BFS result old: -23) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 2)
01000
00110
0122-1
0
1
2
3
4
![Page 12: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/12.jpg)
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx2) update BFS result old: 43) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 3)
00110
00012
0122-1
0
1
2
3
4
![Page 13: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/13.jpg)
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx new: 82) update BFS result old: 43) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 3)
00110
00012
01223
0
1
2
3
4
![Page 14: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/14.jpg)
0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0
AT x y BFS
x =
Each iteration, 1) compute y = ATx new: 82) update BFS result old: 83) use BFS result to filter y4) stop if BFS result is unchanged
BFS Using SpMV (Iteration 4)
00001
00000
01223
0
1
2
3
4
![Page 15: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/15.jpg)
Design Considerations2.How to solve multiway-merge3.Partitioning scheme
![Page 16: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/16.jpg)
AT x y
x =
• 1) Elementwise multiply• 2) Reduce• Work: O(nnz)• Independent of vector sparsity
0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0
1. Row-based SpMV
00214
20864
![Page 17: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/17.jpg)
Why not Row-based?cuSPARSE, Modern GPU, CUB all provide row-based SpMV GPU implementationsWhy implement our own?
![Page 18: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/18.jpg)
x = A Motivating Example
0 0 1 0 0 1 0 0 2 24 0 0 0 0 1 3 0 0 20 0 1 2 1 0 0 0 7 91 4 3 0 0 2 4 3 3 90 5 2 0 0 2 0 0 0 01 2 0 0 0 0 2 2 0 03 3 2 0 0 2 0 0 0 00 0 2 0 0 0 0 2 3 00 0 0 0 0 0 3 0 0 03 0 2 2 9 0 0 0 0 0
0000002000
0608040060
• Take scalar from vector x, and multiply it by the corresponding column from A
AT x y
![Page 19: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/19.jpg)
AT x = y3 + y4 + y5 = y
x = + + =
• 1) Gather and elementwise multiply• 2) Multiway-merge (many elementwise adds)• Work: 𝑂 ∑ 𝑑)*[)]-. + kn• Suitable for sparse vectors, due to dependence on vector
sparsity
0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0
Column-based SpMV
00214
20864
20264
00200
00400
![Page 20: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/20.jpg)
For GraphBLAS folksboth a row-based SpMV as well as a column-based SpMV and automatically decide which one to use?This is the idea behind direction-optimizing BFS
![Page 21: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/21.jpg)
Design Considerations1.Column-based SpMV2.How to solve multiway-merge3.Partitioning scheme
![Page 22: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/22.jpg)
2. Intuition of multiway-mergeleading to duplicates
Input: sorted segments, with possible duplicatesE.g. [4 5 6 4 5 6 3 4 5]
Want: uniqueE.g. [4 5 6 3]
0 1 2
3 4 5 6
![Page 23: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/23.jpg)
Block Diagram (Single GPU)
(+): Addition operator of semiring(x): Multiplication operator of semiring
Gather columns from AT
(x) Multiway-merge
Input: Sparse Vector xSparse Matrix AT
(+) ewiseMult
Output: Sparse Vector y
![Page 24: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/24.jpg)
(+): Addition operator of semiring(x): Multiplication operator of semiring
(x) Multiway-merge
Multiway-merge
Gather columns from AT
(x) Multiway-merge
Input: Sparse Vector xSparse Matrix AT
(+) ewiseMult
Radix Sort
(x) Segmented ReduceOutput: Sparse Vector y
![Page 25: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/25.jpg)
Radix sort works best
![Page 26: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/26.jpg)
Importance of multiway-merge
Multiway-merge takes ~85.6% of application runtimeGather and ewiseMult take 11ms
![Page 27: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/27.jpg)
For GraphBLAS folksautomatically decide which one to use?This is the idea behind direction-optimizing BFS
Should sparse vector be allowed to contain duplicate entries?This is the idea behind Gunrock’s idempotent discovery that lets them get away without solving multiway-merge completelySome duplicates are allowed
![Page 28: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/28.jpg)
Design Considerations1.Column-based SpMV2.How to solve multiway-merge3.Partitioning scheme
![Page 29: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/29.jpg)
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1x2
x x3 = + + +...+ = ...xp
Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
3. Partitioning scheme
![Page 30: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/30.jpg)
3. Partitioning scheme
c
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 y1x2
x x3 = + + +...+ = ...xp
GPU1Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
![Page 31: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/31.jpg)
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 x2 y2
x x3 = + + +...+ = ...xp
GPU2Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
3. Partitioning scheme
c
![Page 32: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/32.jpg)
3. Partitioning scheme
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 y1x2
x x3 = + + +...+ = ...xp
Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
![Page 33: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/33.jpg)
3. Partitioning scheme
c
A1 A2 A3 … Ap y(1)+y(2)+y(3)+…+y(p) y x1 x2 y2
x x3 = + + +...+ = ...xp
GPU2Each iteration:1) GPU i computes y(i) = Aixi2) Computes histogram of send counts3) Send to other GPUs using MPI_Alltoall with send counts4) Send to other GPUs using MPI_Alltoallv5) Each GPU multiway-merges piece they are responsible for
![Page 34: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/34.jpg)
Block Matrix (Multi-GPU)On GPU 1, 2, …, p:
mXv (Single GPU)
Gather columns from Ai
Multiway-merge
Input: Sparse Vector xiSparse Matrix Ai
ewiseMult
Output: Sparse Vector yi
mXv (Single GPU)
MPI_Alltoallv
MPI_Alltoall
Multiway-merge
Histogram
c
c
Why 2 multiway-merges?
![Page 35: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/35.jpg)
Experimental SetuplCPUl64x 16-core AMD Opteron 6274 CPUl32 GB of main memory
lGPUl64x NVIDIA K20X GPUs l14 SMs, 192 ALUs per SM, 732 MHz, 6 GB on-board memory
lInterconnectlGemini interconnect between nodesl10.4 GB/s (HyperTransport 3.0)
![Page 36: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/36.jpg)
Strong Scaling
0.5
1
2
4
8
16
1 2 4 8 16 32 64
Billions of Edges Traversed (GTEPS)
Number of Processors
Ideal ScalingActual Scaling
![Page 37: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/37.jpg)
Weak Scaling (Vertex)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
1 2 4 8 16 32
Billions of Edges Traversed
per Processor (GTEPS)
Number of Processors
Ideal ScalingActual Scaling
![Page 38: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/38.jpg)
For 4 GPUs, everything looks fine
Majority of runtime is compute• Two biggest iterations brought down from 121ms to 40ms
![Page 39: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/39.jpg)
Compute time > Communication
MPI calls comprise a sliver of runtime
![Page 40: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/40.jpg)
At 16 GPUs, picture becomes bleak
MPI calls take up ~50% runtime now
![Page 41: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/41.jpg)
Not overlapping compute with communication
GPUs sit idle to wait for other GPUs to finish compute
![Page 42: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/42.jpg)
Communication vs. Compute
0
1
2
3
4
5
6
7
1 2 4 8 16
Relative Cost per Processor
Number of Processors
CommunicationCompute
How to bridge this gap?
![Page 43: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/43.jpg)
Work-in-progress1.Overlap compute with communication
Use multiple cudaStreams to make communication asynchronous
2.Try other partitioning schemesVariants of 1D column-based partitioning2D partitioning
3.Partition graph betterReverse Cuthill–McKee
![Page 44: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/44.jpg)
ConclusionRow-based vs. Column-based implementation for mXvBoth have their uses
Should sparse vector tolerate duplicate entries if user does not care about intermediate sparse vectors?
![Page 45: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/45.jpg)
Acknowledgements
supportThis work is funded by:DARPA XDATA program US Army award W911QX-12-C-0059
DARPA STTR award D15PC00010Department of Energy, Office of Science, ASCR Contract No. DE-AC02-05CH11231
![Page 46: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/46.jpg)
Questions?Aydın Buluç, Lawrence Berkeley National [email protected]
John D. Owens, University of Davis, [email protected]
![Page 47: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/47.jpg)
Backup Slides
![Page 48: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/48.jpg)
Applications of Interest
• Bioinformatics• Social network analysis• Computer vision• Fraud detection• Urban planning
![Page 49: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/49.jpg)
Challenges with Graph AnalysisToo big to visualize
Software complexityStructural properties varySmall-world: Difficult to partitionImpacts scalabilityMesh:Limits scalability of synchronous graph algorithms
![Page 50: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/50.jpg)
AT x y
x =
• GPU threads compute each row simultaneously• Work: O(nnz)• Suitable for multiplication by dense vector, because work is
independent of vector sparsity• Direction-optimizing or bottom-up BFS is a special case of this
0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0
1. Row-based mXv (SpMV)
00214
20864
![Page 51: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs](https://reader034.vdocuments.net/reader034/viewer/2022050323/5f7c48337c5cb3783b3c2f43/html5/thumbnails/51.jpg)
Dominance of Big Iterations in BFS of Scale-free Graphs
Two biggest iterations take 91% of runtime• Other six iterations take 6ms combined