faster’parallel’graphblas’kernels’and’new’ graph ...aydin/ucb_october2016.pdf · •...

49
Faster parallel GraphBLAS kernels and new graph algorithms in matrix algebra Aydın Buluç Computa1onal Research Division Berkeley Lab (LBNL) EECS, University of California, Berkeley October 14, 2016

Upload: others

Post on 22-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Faster  parallel  GraphBLAS  kernels  and  new  graph  algorithms  in  matrix  algebra  

Aydın  Buluç  Computa1onal  Research  Division  Berkeley  Lab  (LBNL)    EECS,  University  of  California,  Berkeley  October  14,  2016  

Page 2: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Large  Graphs  in  ScienHfic  Discoveries  

1 5 2 3 4 1

5

2 3 4

A

1

5

2

3

4

1

5

2

3

4

1 5 2 3 4 4

2

5 3 1

PAMatching  in  biparHte  graphs:  PermuHng  to  heavy  diagonal  or  block  triangular  form    

Graph  par11oning:  Dynamic  load  balancing  in  parallel  simulaHons    

Picture  (leP)  credit:  Sanders  and  Schulz  

Problem  size:  as  big  as  the  sparse  linear  system  to  be  solved  or  the  simulaHon  to  be  performed  

Page 3: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Large  Graphs  in  ScienHfic  Discoveries  

1 5 2 3 4 1

5

2 3 4

A

1

5

2

3

4

1

5

2

3

4

1 5 2 3 4 4

2

5 3 1

PAMatching  in  biparHte  graphs:  PermuHng  to  heavy  diagonal  or  block  triangular  form    

Graph  par11oning:  Dynamic  load  balancing  in  parallel  simulaHons    

Picture  (leP)  credit:  Sanders  and  Schulz  

Problem  size:  as  big  as  the  sparse  linear  system  to  be  solved  or  the  simulaHon  to  be  performed  

The  case  for  di

stributed  memory  

Page 4: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Large  Graphs  in  ScienHfic  Discoveries  

Schatz et al. (2010) Perspective: Assembly of Large Genomes w/2nd-Gen Seq. Genome Res. (figure reference)

Whole genome assembly Graph Theoretical analysis of Brain

Connectivity

PotenHally  millions  of  neurons  and  billions  of  edges  with  developing  technologies  

26  billion  (8B  of  which  are  non-­‐erroneous)  unique  k-­‐mers  (verHces)  in  the  hexaploit  wheat  genome  W7984  for  k=51  

VerHces:  k-­‐mers  

VerHces:  reads  

Page 5: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Large  Graphs  in  ScienHfic  Discoveries  

Schatz et al. (2010) Perspective: Assembly of Large Genomes w/2nd-Gen Seq. Genome Res. (figure reference)

Whole genome assembly Graph Theoretical analysis of Brain

Connectivity

PotenHally  millions  of  neurons  and  billions  of  edges  with  developing  technologies  

26  billion  (8B  of  which  are  non-­‐erroneous)  unique  k-­‐mers  (verHces)  in  the  hexaploit  wheat  genome  W7984  for  k=51  

VerHces:  k-­‐mers  

VerHces:  reads  

The  case  for  di

stributed  memory  

Page 6: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Outline

•  GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra

•  SpGEMM: Computing the sparse matrix-matrix multiplication in parallel

•  Triangle counting/enumeration in matrix algebra

•  Bipartite graph matching in parallel

•  Other contributions and future work

Page 7: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

The  case  for  sparse  matrices  

Many irregular applications contain coarse-grained parallelism that can be exploited by abstractions at the proper level.

Tradi1onal  graph  computa1ons

Graphs  in  the  language  of  linear  algebra

Data  driven,  unpredictable  communicaHon.

Fixed  communicaHon  pa^erns

Irregular  and  unstructured,    poor  locality  of  reference

OperaHons  on  matrix  blocks  exploit  memory  hierarchy

Fine  grained  data  accesses,  dominated  by  latency

Coarse  grained  parallelism,  bandwidth  limited

Page 8: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Sparse  matrix  X  sparse  matrix  

x  

Sparse  matrix  X  sparse  vector          

                 

x  

.*  

Linear-­‐algebraic  primiHves  for  graphs  

Element-­‐wise  operaHons   Sparse  matrix  indexing  

Is  think-­‐like-­‐a-­‐vertex  really  more  producHve?    “Our  mission  is  to  build  up  a  linear  algebra  sense  to  the  extent  that  vector-­‐level  thinking  becomes  as  natural  as  scalar-­‐level  thinking.”  -­‐  Charles  Van  Loan  

Page 9: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Examples  of  semirings  in  graph  algorithms  

Real  field:    (R,  +,  x)   Classical  numerical  linear  algebra  

Boolean  algebra:    ({0  1},  |,  &)   Graph  traversal  

Tropical  semiring:  (R  U  {∞}, min, +) Shortest  paths  

(S, select, select) Select  subgraph,  or  contract  nodes  to  form  quoHent  graph  

(edge/vertex  a^ributes,  vertex  data  aggregaHon,  edge  data  processing)  

Schema  for  user-­‐specified  computaHon  at  verHces  and  edges  

(R,  max,  +)   Graph  matching  &network  alignment  

(R,  min,  1mes)   Maximal  independent  set  

•  Shortened  semiring  nota1on:  (Set,  Add,  Mul1ply).  Both  idenHHes  omi^ed.    •  Add:  Traverses  edges,  Mul1ply:  Combines  edges/paths  at  a  vertex  •  Neither  add  nor  mulHply  needs  to  have  an  inverse.      •  Both  add  and  mul1ply  are  associa1ve,  mul1ply  distributes  over  add  

Page 10: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

1 2

3

4 7

6

5

AT

1

7

7 1 from

to

Breadth-­‐first  search  in    the  language  of  matrices  

Page 11: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

1 2

3

4 7

6

5

XAT

1

7

7 1 from

to

ATX

à

1  

1  

1  

1  

1  parents:  

ParHcular  semiring  operaHons:    Mul1ply:  select  Add:  minimum  

Page 12: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

1 2

3

4 7

6

5

X

4  

2  

2  

AT

1

7

7 1 from

to

ATX

à

2  

4  

4  

2  

2  4  

Select  vertex  with  minimum  label  as  parent  

1  

1  parents:  4  

2  

2  

Page 13: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

1 2

3

4 7

6

5

X

3  

AT

1

7

7 1 from

to

ATX

à 3  

5  

7  

3  

1  

1  parents:  4  

2  

2  

5  

3  

Page 14: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

XAT

1

7

7 1 from

to

ATX

à

6  

1 2

3

4 7

6

5

Page 15: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Graph  Algorithms  on  GraphBLAS  

Sparse  -­‐  Dense  Matrix  

Product  (SpDM3)  

Sparse  -­‐  Sparse  Matrix  

Product  (SpGEMM)  

Sparse  Matrix  Times  MulHple  Dense  Vectors  

(SpMM)  

Sparse  Matrix-­‐Dense  Vector  

(SpMV)  

Sparse  Matrix-­‐Sparse  Vector  (SpMSpV)  

GraphBLAS  primiHves  in  increasing  arithmeHc  intensity  

Shortest  paths  (all-­‐pairs,  

single-­‐source,  temporal)  

Graph  clustering  (Markov  cluster,  peer  pressure,  spectral,  local)  

Miscellaneous:  connecHvity,  traversal  (BFS),  independent  sets  (MIS),  graph  matching    

Centrality  (PageRank,  

betweenness,  closeness)  

Page 16: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Some  Graph  BLAS  basic  funcHons      

Func1on  (CombBLAS  equiv)  

Parameters   Returns   Matlab  nota1on  

MxM  (SpGEMM)  

-­‐  sparse  matrices  A  and  B  -­‐  opHonal  unary  functs  

sparse  matrix   C  =  A  *  B  

MxV  (SpM{Sp}V)  

-­‐  sparse  matrix  A    -­‐  sparse/dense  vector  x  

sparse/dense  vector   y  =  A  *  x  

ewisemult,  add,  …  (SpEWiseX)  

-­‐  sparse  matrices  or  vectors  -­‐  binary  funct,  opHonal  unarys  

in  place  or  sparse  matrix/vector  

C  =  A  .*  B  C  =  A  +  B  

reduce  (Reduce)  

-­‐  sparse  matrix  A  and  funct   dense  vector   y  =  sum(A,  op)  

extract  (SpRef)  

-­‐  sparse  matrix  A  -­‐  index  vectors  p  and  q  

sparse  matrix   B  =  A(p,  q)  

assign  (SpAsgn)  

-­‐  sparse  matrices  A  and  B  -­‐  index  vectors  p  and  q  

none   A(p,  q)  =  B    

buildMatrix  (Sparse)  

-­‐  list  of  edges/triples  (i,  j,  v)    

sparse  matrix   A  =  sparse(i,  j,  v,  m,  n)    

extractTuples  (Find)  

-­‐  sparse  matrix  A    

edge  list   [i,  j,  v]  =  find(A)  

Page 17: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Performance  of  Linear  Algebraic  Graph  Algorithms  

Combinatorial  BLAS  fastest  among  all  tested  graph  processing  frameworks  on  3  out  of  4  benchmarks  in  an  independent  study  by  Intel.      The  linear  algebra  abstracBon  enables  high  performance,  within  4X  of  naBve  performance  for  PageRank  and  CollaboraBve  filtering.    

SaHsh,  Nadathur,  et  al.  "NavigaHng  the  Maze  of  Graph  AnalyHcs  Frameworks  using  Massive  Graph  Datasets”,  in  SIGMOD’14  

Page 18: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

The  Graph  BLAS  effort  

•  The GraphBLAS Forum: http://graphblas.org •  IEEE Workshop on Graph Algorithms Building Blocks (at IPDPS):

http://www.graphanalysis.org/workshop2017.html

Abstract-- It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.

Page 19: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Outline

•  GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra

•  SpGEMM: Computing the sparse matrix-matrix multiplication in parallel

•  Triangle counting/enumeration in matrix algebra

•  Bipartite graph matching in parallel

•  Other contributions and future work

Page 20: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

MulHple-­‐source  breadth-­‐first  search  

•  Sparse array representation => space efficient •  Sparse matrix-matrix multiplication => work efficient •  Three possible levels of parallelism: searches, vertices, edges •  Highly-parallel implementation for Betweenness Centrality*

*:  A  measure  of  influence  in  graphs,  based  on  shortest  paths  

B

1 2

3

4 7

6

5

AT

Page 21: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

MulHple-­‐source  breadth-­‐first  search  

•  Sparse array representation => space efficient •  Sparse matrix-matrix multiplication => work efficient •  Three possible levels of parallelism: searches, vertices, edges •  Highly-parallel implementation for Betweenness Centrality*

*:  A  measure  of  influence  in  graphs,  based  on  shortest  paths  

BAT

à

AT � B6

1 2

3

4 7 5

Page 22: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Sparse  Matrix-­‐Matrix  MulHplicaHon  

Why sparse matrix-matrix multiplication? Used for algebraic multigrid, graph clustering, betweenness

centrality, graph contraction, subgraph extraction, cycle detection, quantum chemistry, high-dimensional similarity search, …

How do dense and sparse GEMM compare? Dense: Sparse: Lower bounds match algorithms. Significant gap Allows extensive data reuse Inherent poor reuse?

What do we obtain here? Improved (higher) lower bound New optimal algorithms Only for random matrices

Erdős-Rényi(n,d) graphs aka G(n, p=d/n)

Page 23: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

2D  Algorithm:  Sparse  SUMMA      

x    =  

Cij

100K 25K

5K

25K

100K

5K

A   B   C  

2D algorithm: Sparse SUMMA (based on dense SUMMA) General implementation that handles rectangular matrices

Cij += HyperSparseGEMM(Arecv, Brecv)

Page 24: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

0.25

0.5

1

2

4

8

16

32

4 9 16 36

64 121

256 529

1024

2025

4096

8100

Seconds

Number of Cores

Scale-21Scale-22Scale-23Scale-24

Almost  linear  scaling  unHl  bandwidth  costs  starts  to  dominate    

2D  Sparse  SUMMA  on  square  inputs  

Scaling  proporHonal  to  √p  aPerwards  

R-­‐MAT,  edgefactor:  8  a=0.6,  b=c=d=0.4/3  

NERSC/Franklin  Cray  XT4  

Aydin  Buluç  and  John  R.  Gilbert.  Parallel  sparse  matrix-­‐matrix  mulHplicaHon  and  indexing:  ImplementaHon  and  experiments.  SIAM  Journal  of  ScienBfic  CompuBng  (SISC),  2012.  

Page 25: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

The  computaHon  cube  and  sparsity-­‐independent  algorithms  

Matrix multiplication: ∀(i,j)  ∈  n  x  n,   C(i,j)  =  Σk  A(i,k)B(k,j),

A B C

The  computaBon  (discrete)  cube:  

•  A  face  for  each  (input/output)  matrix    

•  A  grid  point  for  each  mulHplicaHon  

1D  algorithms   2D  algorithms   3D  algorithms  

How  about  sparse  algorithms?  

Sparsity  independent  algorithms:    assigning  grid-­‐points  to  processors  is  independent  of  sparsity  structure.  -­‐  In  parHcular:  if  Cij  is  non-­‐zero,  who  holds  it?  -­‐  all  standard  algorithms          are  sparsity  independent  

AssumpHons:      -­‐  Sparsity  independent  algorithms    -­‐  input  (and  output)  are  sparse:      -­‐  The  algorithm  is  load  balanced  

Page 26: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Algorithms  a^aining  lower  bounds  

Previous Sparse Classical:

[Ballard,  et  al.  SIMAX’11]  

( ) ⎟⎟

⎜⎜

⎛⋅ΩPM

M

FLOPs3

#⎟⎟⎠

⎞⎜⎜⎝

⎛Ω=

MPnd 2

New Lower bound for Erdős-Rényi(n,d) :

[here]  Expected  (Under  some  technical  assumpHons)    No  previous  algorithm  a^ain  these.  

⎟⎟⎠

⎞⎜⎜⎝

⎭⎬⎫

⎩⎨⎧

ΩPnd

Pdn 2

,min

û  No  algorithm  a^ain  this  bound!  

ü   Two  new  algorithms  achieving  the  bounds  (Up  to  a  logarithmic  factor)  

i.  Recursive  3D,  based  on    [Ballard,  et  al.  SPAA’12]    

ii.  IteraHve  3D,  based  on    [Solomonik  &  Demmel  EuroPar’11]  

Ballard,  B.,  Demmel,  Grigori,  Lipshitz,  Schwartz,  and  Toledo.  CommunicaHon  opHmal  parallel  mulHplicaHon  of  sparse  random  matrices.  In  SPAA  2013.  

Page 27: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

3D  parallel  SpGEMM  in  a  nutshell  

Ariful  Azad,  Grey  Ballard,  B.,  James  Demmel,  Laura  Grigori,  Oded  Schwartz,  Sivan  Toledo,  Samuel  Williams.  ExploiHng  mulHple  levels  of  parallelism  in  sparse  matrix-­‐matrix  mulHplicaHon.  SIAM  Journal  on  ScienHfic  CompuHng  (SISC),  to  appear.    

A::1$

A::2$

A::3$

n pc

Allto

All%

Allto

All%

C intijk = Ailk

l=1

p/c

∑ Bljk

A$ B$ Cintermediate$ Cfinal%

x$

x$

x$

=$

=$

=$

!$

!$

!$

Page 28: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

3D  SpGEMM  performance  

64 256 1024 4096 163840.25

1

4

16

Number of Cores

Tim

e (s

ec)

nlpkkt160 x nlpkkt160 (on Edison)

2D (t=1)2D (t=3)2D (t=6)3D (c=4, t=1)3D (c=4, t=3)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6)

2D#threads

increasing

3D#layers &#threads

increasing

Strong  scaling  of  different  variants  of  2D  and  3D  algorithms  when  squaring  of  nlpkkt160  matrix  on  Edison.  

2D  (non-­‐threaded)  is  the  previous  state-­‐of-­‐the  art  

3D  (threaded)  –  first  presented  here  –  beats  it  by  8X  at  large  concurrencies  

Page 29: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

3D  SpGEMM  performance  

•  Matrix  squaring  (A*A),  proxy  for  Markov  clustering  (MCL)  

2004  web  crawl  of  .it  domain  41  million  ver1ces,  1.1  billion  edges  

SSCA  (R-­‐MAT)  matrices  on  Titan  

1024 2048 4096 8192 163842

4

8

16

32

Number of Cores

Tim

e (s

ec)

it−2004 (AA)

2D (t=1)2D (t=6)3D (c=8, t=1)3D (c=8, t=6)3D (c=16, t=6) 512 1024 2048 4096 8192 16384 32768 655360.125

0.25

0.5

1

2

4

8

16

32

Number of Cores

Tim

e (s

ec)

SSCA x SSCA (on Titan with c=16, t=8 )

Scale 24Scale 25Scale 26Scale 27

Page 30: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Outline

•  GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra

•  SpGEMM: Computing the sparse matrix-matrix multiplication in parallel

•  Triangle counting/enumeration in matrix algebra

•  Bipartite graph matching in parallel

•  Other contributions and future work

Page 31: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Counting triangles (clustering coefficient)

A

5

6

3

1 2

4

Clustering coefficient:

•  Pr (wedge i-j-k makes a triangle with edge i-k)

•  3 * # triangles / # wedges

•  3 * 4 / 19 = 0.63 in example

•  may want to compute for each vertex j

Cohen’s algorithm to count triangles:

- Count triangles by lowest-degree vertex.

- Enumerate “low-hinged” wedges.

- Keep wedges that close.

hi hilo

hi hilo

hihilo

Page 32: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Counting triangles (clustering coefficient)

A L U

1 2

1 1 1 2

C

A = L + U (hi->lo + lo->hi) L × U = B (wedge, low hinge)

A ∧ B = C (closed wedge)

sum(C)/2 = 4 triangles

A

5

6

3

1 2

4 5

6

3

1 2

4

1

1

2

B, C

Page 33: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Masked sparse matrix-matrix multiplication (SpGEMM)

Avoid communicating nonzeros that fall outside this structure via C = MaskedSpGEMM (L, U, A), which is mathematically equivalent to performing B = L U followed by C = A ∧ B.

Special SpGEMM: The nonzero structure of the product is contained in the original adjacency matrix A

U

x

A L

1

5

2 3 4

6 7

1

5

2 3 4

6 7

Ariful  Azad,  B.,  and  John  R  Gilbert.  “Parallel  triangle  counHng  and  enumeraHon  using  matrix  algebra”.  Graph  Algorithm  Building  Blocks  (GABB),  IPDPSW,  2015  

Page 34: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Distributed  memory  performance  of  triangle  counHng  

1"

2"

4"

8"

16"

32"

64"

128"

256"

1" 4" 16" 64" 256"

Speedu

p"

Number"of"Cores"

(a)$Triangle,basic$$

coPapersDBLP"mouse<gene"cage15"europe_osm"

1"

2"

4"

8"

16"

32"

64"

128"

1" 4" 16" 64" 256"Speedu

p"Number"of"Cores"

(b)$Triangle,masked,bloom$

coPapersDBLP"

mouse<gene"

cage15"

europe_osm"

•  Improved  1D  algorithm  (SPAA’13  by  Ballard,  Buluc,  et  al.),  which  had  not  been  implemented  and  evaluated  before.  

•  Scales  reasonably  well  unHl  512  cores  of  NERSC/Edison.  Further  scaling  requires  2D/3D  algorithms  (ongoing  work)  

•  Bloom  filters  are  used  in  (b)  to  signal  the  presence/absence  of  zeros  in  the  output  to  avoid  communicaHon.    

Page 35: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Outline

•  GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra

•  SpGEMM: Computing the sparse matrix-matrix multiplication in parallel

•  Triangle counting/enumeration in matrix algebra

•  Bipartite graph matching in parallel

•  Other contributions and future work

Page 36: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Graph  vs.  BiparHte  Graph  

q  A graph consists of vertices and edges: G(V,E) q  A bipartite graph: vertices are grouped into two sets.

No edge between vertices in a set. q  Notation: n = #vertices, m= #edges

x1

x2

x3

y1

y2

y3

x1

x2

y1

y2

A  graph  A  biparHte  graph  

Page 37: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

A  Matching  in  a  Graph  

Matching: A subset M of edges with no common end vertices.

|M| = Cardinality of the matching M

x1

x2

x3

y1

y2

y3

Matched  vertex  

Unmatched  vertex  

Matched  edge  

Unmatched  edge  

x1

x2

x3

y1

y2

y3

Maximum  Cardinality  Matching  A  Matching  (Maximal  cardinality)  

Page 38: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Maximum-­‐cardinality  matching  

q  Augmenting path: A path that alternates between matched and unmatched edges with unmatched end points.

q  Algorithm: Search for augmenting paths and flip edges across the paths to increase matching.

x1

x2

x3

y1

y2

y3

x1

x2

x3

y1

y2

y3

Augment  matching  

Page 39: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Single-­‐source  (SS)  search  for    augmenHng  paths  

x1

x2

x3

y1

y2

y3

1.  IniHal  matching  

x1

x2

x3

y1

y2

y3

3.  Increase  matching  by    flipping  edges  in  the    augmenHng  path  

x3

y3

x2

y1 y2

2.  Search  for  augmenHng  path  from  x3.  stop  when  an  unmatched  vertex  found.  Discard  the  whole  tree  when  no  augmen1ng  path  is  found    

AugmenHng    path  

Repeat  the  process  for  other  unmatched  verHces  

Page 40: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

MulH-­‐source  (MS)  search  for    augmenHng  paths  

x1

x2

x3

y1

y2

y3

1.  IniHal  matching  

x1

x2

x3

y1

y2

y3

3.  Increase  matching  by    flipping  edges  in  the    augmenHng  paths  

Repeat  the  process  unHl  no  augmenHng  path  is  found  

x3

y3

x2

y1 y2

2.  Search  for  vertex-­‐disjoint    augmenHng  paths  from  x3  &  x1.  Grow  a  tree  unHl  an  unmatched    vertex  is  found  in  it.  Can  not  discard  a  tree  if  no  augmen1ng  path  is  found.    

AugmenHng    paths   x1

y1

Search  Forest  

Page 41: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Tree  GraPing  

x1

x2

x3

x4

x5

y1

y2

y3

y4

y5

x6 y6

(a) A maximal matching in a Bipartite Graph

Active Tree Renewable Tree

x1 x2

x3 x4 x5

y1 y2 y3

y4 y5

(b) Alternating BFS Forest Augment in forest

x1

x3 x4

y1 y2

Active Tree

x2

y3

(c) Tree Grafting

x1

x3 x4

y1 y2

Active Tree

x2

y3 y4

y6

x6

(d) Continue BFS

Unvisited Vertices

Page 42: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Comparison  with  state-­‐of-­‐the-­‐art  

0"

4"

8"

12"

16"

kkt_po

wer"

hugetrace"

delaun

ay"

rgg_n2

4_s0"

coPaersDBL

P"

amazon

031

2"

citBPatents"

RMAT

"

road_u

sa"

wbBed

u"

web

BGoo

gle"

wikiped

ia"

RelaIve"pe

rformance" PushBRelabel"

PothenBFan"MSBBFSBGraN"

(a)$one$core$

0"

4"

8"

12"

16"

kkt_power

"huget

race" delaunay"

rgg_n24_s0

"

coPaersDBL

P"

amazon03

12"citBPa

tents"

RMAT"road_

usa" wbBedu"webBG

oogle"wikip

edia"

RelaIve"pe

rformance" 18$ 35$ 29$ 42$(b)$40$cores$

Intel  Westmere-­‐EX  

ScienHfic  compuHng     Scale-­‐free   Networks  

Page 43: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Sources  of  performance  

0"

2"

4"

6"

8"

Scien,fic" Scale0free" Networks"

Rela,ve""Perform

ance"" MS0BFS"

MS0BFS"(dir0opt)"

MS0BFS0GraE"

1.  DirecHon  opHmized  BFS  (1.6x)  2.  Tree  graPing  (3x)   On  40  cores  

Intel  Westmere-­‐EX  

Highest  performance    improvement  on  graphs  with  low  matching  number  

Ariful  Azad,  B.,  Alex  Pothen.  A  parallel  tree  graPing  algorithm  for  maximum  cardinality  matching  in  biparHte  graphs.  In  Proceedings  of  the  IPDPS,  2015.  Ariful  Azad,  B.,  Alex  Pothen.  CompuHng  maximum  cardinality  matchings  in  parallel  on  biparHte  graphs  via  tree-­‐graPing.  IEEE  TransacHons  on  Parallel  and  Distributed  Systems  (TPDS)),  2016.    

Page 44: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

16 32 64 128 256 512 1024 20482

4

8

16

32

64

128

256

Number of Cores

Tim

e (s

ec)

ljournalcage15road_usanlpkkt200hugetracedelaunay_n24HV15R 0

8

16

24

32

Dynamic Mindegree

Karp- Sipser

Greedy No Init

Tim

e (s

ec)

nlpkkt200 Maximum Matching Maximal Matching

0!

4!

8!

12!

16!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

road_usa!

0!

2!

4!

6!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

GL7d19!

0!

2!

4!

6!

8!

Dynamic Mindegree!

Karp- Sipser!

Greedy! No Init!

Tim

e (s

ec)!

wikipedia!

Ariful Azad and Aydin Buluç. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the IPDPS, 2016

Matching  in  distributed  memory  

Page 45: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Outline

•  GraphBLAS: standard building blocks for graph algorithms in the language of linear algebra

•  SpGEMM: Computing the sparse matrix-matrix multiplication in parallel

•  Triangle counting/enumeration in matrix algebra

•  Bipartite graph matching in parallel

•  Other contributions and future work

Page 46: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

HipMer:  An  Extreme-­‐Scale  De  Novo  Genome  Assembler    

k-mers

New  k-­‐mer  analysis  filters  errors  using  probabilisBc  “Bloom  Filter”    

Graph  algorithm  (connected  components)    scales  to  15K  cores  on  NERSC’s  Edison  

contigs

Scaffolding using Scalable Alignment

x  x  

New  fast  &  parallel  I/O  

reads

Meraculous Assembly Pipeline Meraculous  assembler  is  used  in  produc1on  at  the  Joint  Genome  Ins1tute  •  Wheat  assembly  is  a  “grand  challenge”    •  Hardest  part  is  conHg  generaHon    (large  in-­‐

memory  hash  table    that  represents  graph)  •  HipMer  is  an  efficient  parallelizaHon  of  

Meraculous  that  leverages  the  capabiliHes  of  Unified  Parallel  C  (UPC)  

Performance  improvement  from  days  to  minutes  

16

32

64

128

256

512

1024

2048

4096

8192

1920 3840 7680 15360 23040

Seco

nds

Number of Cores

ideal IO cached overall No IO cached

overall IO cachedkmer analysis

contig generationscaffolding

IO

Page 47: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Future  Work:  Machine  Learning  

Sparse  -­‐  Dense  Matrix  

Product  (SpDM3)  

Sparse  -­‐  Sparse  Matrix  

Product  (SpGEMM)  

Sparse  Matrix  Times  MulHple  Dense  Vectors  

(SpMM)  

Sparse  Matrix-­‐Dense  Vector  

(SpMV)  

Sparse  Matrix-­‐Sparse  Vector  (SpMSpV)  

GraphBLAS  funcHons  (in  increasing  arithmeHc  intensity)  

Graphical  Model  Structure  Learning  (e.g.,  CONCORD)  

Clustering  (e.g.,  MCL,  Spectral  Clustering)  

LogisHc  Regression,  

Support  Vector  Machines  

Dimensionality  ReducHon  (e.g.,  

NMF,  CX/CUR,  PCA)  

Higher-­‐level  machine  learning  tasks  

Page 48: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Future  Work:  Metagenomics  

•  Microbiomes:  dynamic  consorHa  of  hundreds  or  thousands  of  microbial  species  and  strains  of  varying  abundance  and  diversity.  

•  Metagenomics:    the  applicaHon  of  high-­‐throughput  genome  sequencing  technologies  to  DNA  extracted  from  microbiomes  

Challenges:  -­‐  Metagenome  assembly  is  

computaHonally  harder  than  single  genome  

-­‐  Protein  clustering  at  this  scale  has  never  been  done  before  

Figure  credit:  Natalia  Ivanova  (JGI)  

Page 49: Faster’parallel’GraphBLAS’kernels’and’new’ graph ...aydin/UCB_October2016.pdf · • GraphBLAS: standard building blocks for graph algorithms in the language of linear

Acknowledgments  

Ariful  Azad,  David  Bader,  Grey  Ballard,  Sco^  Beamer,  Jarrod  Chapman,  James  Demmel,  Rob  Egan,  Evangelos  Georganas,  John  Gilbert,  Laura  Grigori,  Steve  Hofmeyr,  CosHn  Iancu,  Jeremy  Kepner,  Penporn  Koanantakool,  Benjamin  Lipshitz,  Tim  Ma^son,  Sco^  McMillan,  Henning  Meyerhenke,  Jose  Moreira,  Sang  Oh,  Lenny  Oliker,  John  Owens,  Alex  Pothen,  Dan  Rokhsar,  Oded  Schwartz,  Sivan  Toledo,  Sam  Williams,  Carl  Yang,  Kathy  Yelick.  

Work  is  funded  by