Transcript
Page 1: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Design  Considerations  for  a  GraphBLASCompliant  Graph  Library  on  Clusters  of  GPUs

July 11,  2016

Carl  Yang,  University  of California,  DavisAydın  Buluç,  Lawrence  Berkeley  National  LaboratoryJohn  D.  Owens,  University  of California,  Davis

Page 2: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

GraphBLASInstead  of  writing  a  graph  algorithm  from  scratch,  users  can  express  algorithm  using  a  small  number  of  GraphBLAS kernels

Hardware-­agnostic:Users  do  not  have  to  relearn  many  different  APIs  in  order  to  run  graph  algorithms  on  different  hardware

Efficiency:Kernels  are  implemented  with  performance  guarantees

Page 3: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Why  use  GPUs  for  graphs?Cost-­efficientK20X  costs  ~$2000

Improving  faster  that  traditional  CPUsP100  (2016):    ~8  TFLOPS,  ~1TB/s,  ~3800  ALUs,  32GB  mem.CPU-­to-­GPU  bandwidth:  ~80GB/s-­200GB/s  (NVLink)Xeon  Phi  7250  (2016):  ~6  TFLOPS,  ~400GB/s,  ~288  threads,  16GB  mem.CPU-­to-­coprocessor  bandwidth:  ~100GB/s  (Omni-­Path)

Page 4: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Challenges  with  using  GPUs  for  graph  analysis

Hard  to  programSmall  memoryCPU  RAM  available  ~100x  larger

Solved  by  GraphBLASAPISolved  by  using  multi-­GPU

Page 5: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

GraphBLAS kernel  (mXv)Goal:  Implement  mXv on  multiple  GPUs  following  GraphBLASstandardSame  as  SpMVSame  as  vXm,  which  computes  yT =  xTA

Input:  Given  adjacency  matrix  A,  vector  xWant:  Compute  y  =  ATxCurrent  GraphBLAS API:

GrB_info GrB_mXv(  GrB_Vector *y,  const GrB_Semiring s,  const GrB_Matrix A,                    const GrB_vector x  )  

Def:  Think  of  a  semiring as  an  object  with  a  “x”  operator,  “+”  operator,  and  identity  for  each  operator.

Page 6: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Applications  of  SpMVSSSPPageRankMISAnd  many  more!

Page 7: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =      ?

Each iteration,  1)  compute  y  =  ATx2)  update BFS  result3)  use  BFS  result to  filter y4)  stop  if BFS  result is unchanged

BFS  Using SpMV (Iteration  1)

10000

0-­1-­1-­1-­1

0

1

2

3

4

Page 8: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute  y  =  ATx2)  update BFS  result3)  use  BFS  result to  filter y4)  stop  if BFS  result is unchanged

BFS  Using SpMV (Iteration  1)

10000

01000

0-­1-­1-­1-­1

0

1

2

3

4

Page 9: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx new:  -­22)  update  BFS  result3)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  1)

10000

01000

01-­1-­1-­1

0

1

2

3

4

Page 10: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx2)  update  BFS  result old:  -­23)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  2)

01000

00110

01-­1-­1-­1

0

1

2

3

4

Page 11: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx new:  42)  update  BFS  result old:  -­23)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  2)

01000

00110

0122-­1

0

1

2

3

4

Page 12: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx2)  update  BFS  result old:  43)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  3)

00110

00012

0122-­1

0

1

2

3

4

Page 13: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx new:  82)  update  BFS  result old:  43)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  3)

00110

00012

01223

0

1

2

3

4

Page 14: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

0 0 0 0 01 0 0 0 00 1 0 0 00 1 1 0 00 0 1 1 0

AT x                y              BFS

x                =

Each iteration,  1)  compute y =  ATx new:  82)  update  BFS  result old:  83)  use  BFS  result to  filter y4)  stop  if  BFS  result  is  unchanged

BFS  Using SpMV (Iteration  4)

00001

00000

01223

0

1

2

3

4

Page 15: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Design  Considerations2.How  to  solve  multiway-­merge3.Partitioning  scheme

Page 16: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

AT x                y

x                =

• 1)  Elementwise multiply• 2)  Reduce• Work:  O(nnz)• Independent of  vector  sparsity

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

1.  Row-­based SpMV

00214

20864

Page 17: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Why  not  Row-­based?cuSPARSE,  Modern  GPU,  CUB  all  provide  row-­based  SpMV GPU  implementationsWhy  implement  our  own?

Page 18: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

x              =  A  Motivating  Example

0 0 1 0 0 1 0 0 2 24 0 0 0 0 1 3 0 0 20 0 1 2 1 0 0 0 7 91 4 3 0 0 2 4 3 3 90 5 2 0 0 2 0 0 0 01 2 0 0 0 0 2 2 0 03 3 2 0 0 2 0 0 0 00 0 2 0 0 0 0 2 3 00 0 0 0 0 0 3 0 0 03 0 2 2 9 0 0 0 0 0

0000002000

0608040060

• Take  scalar  from  vector  x,  and  multiply  it  by  the  corresponding  column  from  A

AT x            y

Page 19: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

AT                                                  x      =    y3 +    y4 +    y5 =    y

x                =              +              +              =          

• 1)  Gather  and  elementwise multiply• 2)  Multiway-­merge  (many  elementwise adds)• Work:  𝑂 ∑ 𝑑)*[)]-. + kn• Suitable for sparse vectors,  due to  dependence on vector  

sparsity

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

Column-­based SpMV

00214

20864

20264

00200

00400

Page 20: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

For  GraphBLAS folksboth  a  row-­based  SpMV as  well  as  a  column-­based  SpMV and  automatically  decide  which  one  to  use?This  is  the  idea  behind  direction-­optimizing  BFS

Page 21: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Design  Considerations1.Column-­based  SpMV2.How  to  solve  multiway-­merge3.Partitioning  scheme

Page 22: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

2.  Intuition  of  multiway-­mergeleading  to  duplicates

Input:  sorted  segments,  with  possible  duplicatesE.g.  [4  5  6  4  5  6  3  4  5]

Want:  uniqueE.g.  [4  5  6  3]

0 1 2

3 4 5 6

Page 23: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Block  Diagram  (Single  GPU)

(+):  Addition  operator  of  semiring(x):  Multiplication  operator  of  semiring

Gather  columns  from  AT

(x)  Multiway-­merge

Input:  Sparse  Vector  xSparse  Matrix  AT

(+)  ewiseMult

Output:  Sparse  Vector  y

Page 24: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

(+):  Addition  operator  of  semiring(x):  Multiplication  operator  of  semiring

(x)  Multiway-­merge

Multiway-­merge

Gather  columns  from  AT

(x)  Multiway-­merge

Input:  Sparse  Vector  xSparse  Matrix  AT

(+)  ewiseMult

Radix  Sort

(x)  Segmented  ReduceOutput:  Sparse  Vector  y

Page 25: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Radix  sort  works  best

Page 26: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Importance  of  multiway-­merge

Multiway-­merge  takes  ~85.6%  of  application  runtimeGather  and  ewiseMult take  11ms

Page 27: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

For  GraphBLAS folksautomatically  decide  which  one  to  use?This  is  the  idea  behind  direction-­optimizing  BFS

Should  sparse  vector  be  allowed  to  contain  duplicate  entries?This  is  the  idea  behind  Gunrock’s idempotent  discovery  that  lets  them  get  away  without  solving  multiway-­merge  completelySome  duplicates  are  allowed

Page 28: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Design  Considerations1.Column-­based  SpMV2.How  to  solve  multiway-­merge3.Partitioning  scheme

Page 29: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1x2

x x3 =        +      +      +...+      =  ...xp

Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

3.  Partitioning scheme

Page 30: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

3.  Partitioning scheme

c

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1                                                                                                        y1x2

x x3 =        +      +      +...+      =  ...xp

GPU1Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

Page 31: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1                                                                                                        x2                                                                                                          y2

x x3 =        +      +      +...+      =  ...xp

GPU2Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

3.  Partitioning scheme

c

Page 32: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

3.  Partitioning scheme

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1 y1x2

x x3 =        +      +      +...+      =  ...xp

Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

Page 33: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

3.  Partitioning scheme

c

A1 A2 A3 …  Ap y(1)+y(2)+y(3)+…+y(p) y  x1                                                                                                        x2                                                                                                          y2

x x3 =        +      +      +...+      =  ...xp

GPU2Each iteration:1) GPU  i  computes y(i) =  Aixi2) Computes  histogram  of  send  counts3) Send  to  other  GPUs  using  MPI_Alltoall with  send  counts4) Send  to  other  GPUs  using  MPI_Alltoallv5) Each  GPU  multiway-­merges  piece  they  are  responsible   for

Page 34: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Block  Matrix  (Multi-­GPU)On  GPU  1,  2,  …,  p:

mXv (Single  GPU)

Gather  columns  from  Ai

Multiway-­merge

Input:  Sparse  Vector  xiSparse  Matrix  Ai

ewiseMult

Output:  Sparse  Vector  yi

mXv (Single  GPU)

MPI_Alltoallv

MPI_Alltoall

Multiway-­merge

Histogram

c

c

Why  2  multiway-­merges?

Page 35: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Experimental  SetuplCPUl64x  16-­core  AMD  Opteron 6274  CPUl32  GB  of main memory

lGPUl64x  NVIDIA  K20X  GPUs  l14  SMs,  192  ALUs  per  SM,  732  MHz,  6  GB  on-­board  memory

lInterconnectlGemini  interconnect between nodesl10.4  GB/s  (HyperTransport 3.0)

Page 36: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Strong  Scaling

0.5

1

2

4

8

16

1 2 4 8 16 32 64

Billions  of  Edges  Traversed  (GTEPS)

Number  of  Processors

Ideal  ScalingActual  Scaling

Page 37: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Weak  Scaling  (Vertex)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

1 2 4 8 16 32

Billions  of  Edges  Traversed  

per  Processor  (GTEPS)

Number  of  Processors

Ideal  ScalingActual  Scaling

Page 38: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

For  4  GPUs,  everything  looks  fine

Majority  of  runtime  is  compute• Two  biggest  iterations  brought  down  from  121ms  to  40ms

Page 39: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Compute  time  >  Communication

MPI  calls  comprise  a  sliver  of  runtime

Page 40: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

At  16  GPUs,  picture  becomes  bleak

MPI  calls  take  up  ~50%  runtime  now

Page 41: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Not  overlapping  compute  with  communication

GPUs  sit  idle  to  wait  for  other  GPUs  to  finish  compute

Page 42: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Communication  vs.  Compute

0

1

2

3

4

5

6

7

1 2 4 8 16

Relative  Cost  per  Processor

Number  of  Processors

CommunicationCompute

How  to  bridge  this  gap?

Page 43: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Work-­in-­progress1.Overlap  compute  with  communication

Use  multiple  cudaStreams to  make  communication  asynchronous

2.Try  other  partitioning  schemesVariants  of  1D  column-­based  partitioning2D  partitioning

3.Partition  graph  betterReverse  Cuthill–McKee

Page 44: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

ConclusionRow-­based  vs.  Column-­based  implementation  for  mXvBoth  have  their  uses

Should  sparse  vector  tolerate  duplicate  entries  if  user  does  not  care  about  intermediate  sparse  vectors?

Page 45: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Acknowledgements

supportThis  work  is  funded  by:DARPA  XDATA  program  US  Army  award  W911QX-­12-­C-­0059

DARPA  STTR  award  D15PC00010Department  of  Energy,  Office  of  Science,  ASCR  Contract  No.  DE-­AC02-­05CH11231

Page 46: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Questions?Aydın  Buluç,  Lawrence  Berkeley  National  [email protected]

John  D.  Owens,  University  of  Davis,  [email protected]

Page 47: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Backup  Slides

Page 48: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Applications of Interest  

• Bioinformatics• Social  network  analysis• Computer  vision• Fraud  detection• Urban  planning

Page 49: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Challenges  with  Graph  AnalysisToo  big  to  visualize

Software  complexityStructural  properties  varySmall-­world:  Difficult  to  partitionImpacts  scalabilityMesh:Limits  scalability  of  synchronous  graph  algorithms

Page 50: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

AT x                y

x                =

• GPU  threads compute  each row simultaneously• Work:  O(nnz)• Suitable for multiplication by dense  vector,  because work is

independent of  vector  sparsity• Direction-­optimizing or bottom-­up  BFS  is a  special case  of  this

0 0 1 0 04 0 0 0 00 0 1 2 11 4 3 0 00 5 2 0 0

1.  Row-­based mXv (SpMV)

00214

20864

Page 51: Design’Considerationsfor’a’ GraphBLAS …ctcyang/pub/siam-slides2016.pdf · 2016-10-04 · Design’Considerationsfor’a’ GraphBLAS Compliant’Graph’Libraryon’Clustersof’GPUs

Dominance  of  Big  Iterations  in  BFS  of  Scale-­free  Graphs

Two  biggest  iterations  take  91%  of  runtime• Other  six  iterations  take  6ms  combined


Top Related