pl-4089, accelerating and evaluating opencl graph applications, by shuai che, bradford bechmann,...

25
ACCELERATING AND EVALUATING OPENCL GRAPH APPLICATIONS SHUAI CHE , BRAD BECKMANN, STEVE REINHARDT AND KEVIN SKADRON

Upload: amd-developer-central

Post on 13-May-2015

913 views

Category:

Technology


2 download

DESCRIPTION

PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron at the AMD Developer Summit (APU13) November 11-13, 2014.

TRANSCRIPT

Page 1: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

ACCELERATING  AND  EVALUATING  OPENCL  GRAPH  APPLICATIONS  

                                                                                                                                           SHUAI  CHE  ,  BRAD  BECKMANN,  STEVE  REINHARDT  AND    KEVIN  SKADRON                                                                  

                                                                           

Page 2: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  2  

AGENDA  

Background  and  Graph  Applica8ons  

Panno8a  OpenCL™  Graph  Applica8ons    

Performance  Evalua8on  and  Discussion  

Page 3: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  3  

GRAPH  APPLICATIONS  

!  Intelligence  ‒ Business  analy8cs,  security  and  scien8fic  discovery    

! Social  networks  ‒ Facebook,  TwiVer,  LinkedIn,  Weibo,  etc.  

! Life  science  and  healthcare  ‒ Disease  and  drug  research,  life  system  research  

!  Infrastructure  ‒ Transporta8on,  power  grid,  energy  and  water  supply  

! Scien8fic  and  engineering  simula8ons  

   

Page 4: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  4  

GRAPH  APPLICATIONS  

! Low  arithme8c  intensity  and  data  reuse  ! Not  floa8ng-­‐point  intensive  ! Branch  divergence  

‒ Part  of  threads  in  a  wavefront  are  ac8ve    

! Memory  divergence  ‒ Data  distributed  in  different  regions  of  memory  ‒ A  challenge  to  op8mize  data  layouts  and  memory  accesses  

! Load  imbalance    ‒ Uneven  work  distribu8on  across  different  threads  ‒ Short-­‐running  threads  wait  for  long-­‐running  threads  

! Parallelism  ‒ Changing  degree  of  parallelism  across  itera8ons  ‒ Underu8liza8on  of  compute  units  for  certain  phases  

Page 5: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  5  

PANNOTIA  

! A  graph  applica8on  suite  for  GPGPU  ! Eight  diverse  graph  algorithms,  e.g.,  shortest  path,  graph  par88oning,  web  analysis  and  

social  network  !  Implemented  in  C  +  OpenCL™    ! Some  are  OpenCL  implementa8ons  based  on  algorithms  of  prior  work    !  Ini8al  implementa8on  is  for  a  single  GPU  node  ! Further  algorithmic  and  hardware-­‐specific  op8miza8ons  are  in  progress  ! Details  of  Panno8a  can  be  found  in  our  paper  published  in  2013  IEEE  Interna8onal  

Symposium  on  Workload  Characteriza8on  

Page 6: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  6  

PANNOTIA  

Applica7ons   Domains  Single-­‐Source  Shortest  Path   Shortest  Path  

Connected  Component  Labeling   Graph  Clustering  

Graph  Coloring   Graph  Par88oning  

Floyd-­‐Warshall   Shortest  Path  

Maximal  Independent  Set   Graph  Par88oning  

Betweeness  Centrality   Social  Network  

Friend  Recommenda8on   Social  Network  

Page  Rank   Web  Analysis  

Page 7: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  7  

GRAPH  INPUT  AND  DATA  STRUCTURE  

! Real-­‐world  graphs  ‒ The  University  of  Florida  Sparse  Matrix  Collec8on  ‒ The  9th    DIMACS  Implementa8on  Challenges  ‒ The10th  DIMACS  Implementa8on  Challenges  

!   Synthe8c  graphs  ‒   Random-­‐graph  generator  from  Georgia  Tech  

!   Graph  input  formats  ‒   Coordinate  Format  ‒   METIS  ‒   Matrix  Market  

!   Data  structure  representa8on  ‒   CSR,  COO,  ELL  …  ‒   2D  adjacency  matrix    

Page 8: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  8  

SINGLE  SOURCE  SHORTEST  PATH    

! Finds  the  path  with  the  shortest  path  between  the  source  node  and  all  the  other  nodes  in  the  graph  

0  

2  

1  

3  

4  

5  

6  23  

7   8  

1   15  

18  13  

2  

0   0  

1   3  

2   1  

3   8  

4   16  

5   19  

6   16  

Vid        Dist  

Page 9: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  9  

CONNECTED  COMPONENT  LABELING  

! Detect  connected  regions  in  graphs  and  images  ! Connected  components  are  the  nodes  in  a  graph  that  point  to  the  same  root  

q  

s  

p  

r  

Page 10: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  10  

GRAPH  COLORING  

! Assign  colors  (integers)  to  ver8ces  with  no  two  adjacent  ver8ces  with  the  same  color    

Page 11: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  11  

FLOYD-­‐WARSHALL  

! Solves  the  all-­‐pairs  shortest  path  (APSP)  problem  –  finding  the  shortest  path  from  every  possible  source  to  every  possible  des8na8on  

!   A  dynamic  programming  approach                      shortestPath(i,  j,  k)  returns  the  shortest  path  from  i  to  j  with  ver8ces  from  {1,2,...,k}  

Page 12: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  12  

MAXIMAL  INDEPENDENT  SET  

!  Independent  set:    no  two  ver8ces  are  neighbors  ! Maximal  Independent  set:  impossible  to  add  another  vertex  to  s8ll  keep  independent    

0   1  

4   2   3   7  

5   6  

S  =  {0,  4,  6}  is  an  Maximal  Independent  Set    

Page 13: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  13  

BETWEENNESS  CENTRALITY  

! Centrality  determines  the  rela8ve  importance  of  a  vertex  within  the  graph  (e.g.  degree,  betweenness,  closeness  …)  

! Betweenness  Centrality  quan8fies  the  number  of  8mes  a  node  acts  as  a  bridge  along  the  shortest  path  between  two  other  nodes.  

∑≠≠

=tvs st

st vvBCσσ )()(

no.  of  shortest  paths  between  nodes  s  and  t  )(vstσ

stσno.  of  shortest  paths  between  nodes  s  and  t  passing  through  v  

Page 14: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  14  

FRIEND  RECOMMENDATION  

!   Recommend  friend  connec8ons  –  a  common  feature  in  social  websites  !   A  simple  Map-­‐Reduce  like  algorithm  

 “Andy” =    [  “Brad”,  “Derek”,  “Shuai”,  …]      Andy  !      <“Brad”,  “Derek”,  “Andy”>  

                         <“Brad”,  “Shuai”,  “Andy”>                            <“Derek”,  “Brad”,  “Andy”>  

                                       <“Derek”,  “Shuai”,  “Andy”>                            <“Shuai”,  “Derek”,  “Andy”>  

                                       <“Shuai”,  “Brad”,  “Andy”>                                                                              Andy  recommends  Brad  to  Shuai  

Page 15: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  15  

PAGERANK  

! 

Page 16: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  16  

PERFORMANCE  BENEFITS  

! Speedups  are  up  to  11x  (an  AMD  “Tahi8”  discrete  GPU  v.s.  4  CPU  cores  on  A8)  ! PCI-­‐E  overhead  is  included  ! Performance  benefits  depend  on  graph  input  datasets    

0  

5  

10  

15  

Par

alle

l Spe

edup

Page 17: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  17  

EXECUTION  TIME  BREAKDOWN  (D-­‐GPU)  

! The  por8on  of  GPU  execu8on:  8%  -­‐  99%  ! Some  further  GPU  offload  can  be  done  (e.g.  FRD  and  MIS)    

Page 18: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  18  

PARALLELISM  (ACTIVE  VERTICES  OVER  TIME)  

Single-­‐Source  Shortest  Path  (Road  Network  -­‐  NY)  

                                 Graph  Coloring  (G3  Circuit)  

0  

120000  

Time  

0  

400000  

Time  

Page 19: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  19  

LOAD  IMBALANCE  (DEGREE  DISTRIBUTION)  Single-­‐Source  Shortest  Path  (Road  Network)  

                                 Graph  Coloring  (G3  Circuit)  

0%  

100%  

Time    

1   2   3   4   5   6   7   8  

0%  

100%  

Time  

1   2   3   4   5  

Page 20: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  20  

HIERARCHICAL  CLUSTERING  

!   Different  program-­‐input  pairs  may  have  vastly  different  characteris8cs!  

CLR-­‐G3-­‐circuit  CLR-­‐ecology  

DJK-­‐US-­‐NW  DJK-­‐US-­‐CA  

BC-­‐2k  BC-­‐1k  

CCL-­‐lena  CCL-­‐deposit  

FW-­‐512-­‐64k  FW-­‐256-­‐16k  

MIS-­‐US-­‐NW  

MIS-­‐shell  CLR-­‐shell  

MIS-­‐ecology  

PRK-­‐flicker  FRD-­‐coAuthor  

PRK-­‐2k  

0.0   4.6  

Page 21: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  21  

L2  HIT  RATE  OVER  TIME  (SSSP)  

! The  cache  hit  rate  first  improves,  then  degrades,  improves  again  and  finally  degrades  with  some  fluctua8ons  in  the  middle  

0  

10  

20  

30  

40  

50  

60  Hit  R

ate  

Time  

Page 22: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  22  

ARCHITECTURAL  IMPLICATIONS  (SCALAR  UNIT)  

1  

2  

Scalar   SIMD  

1   2   1  

2  

 A    B  

Time  

SIMD  

Graph  Traversal  

Scalar  

SIMD  

Page 23: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  23  

! Possibly  include  narrower  SIMD  units  or  heterogeneous  SIMD  units    

 ! Resource  management  and  scheduling  

‒ Switch  the  task  between  the  CPU  and  the  GPU  based  on  parallelism  ‒ Use  only  “enough”  SIMD  engines  and  save  power    

ARCHITECTURAL  IMPLICATIONS  

Scalar   Narrow  SIMD   Wide  SIMD  

CPU     GPU    

0  

120000  

Time  

GPU    

       A   B    

Page 24: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  24  

CONCLUSION  AND  FUTURE  WORK  

! Graph  applica8ons  are  an  emerging  workload  domain  ! Panno8a  is  a  first-­‐step  aVempt  to  evaluate  diverse  graph  building  blocks  on  GPUs    

Next-­‐Step  Goals:  ! Add  more  applica8ons  (e.g.  matching,  spanning  tree,  flow)    ! Op8mize  Panno8a  applica8ons  ! Extend  to  mul8ple  GPU  nodes  and  across  CPU  and  GPU  

Page 25: PL-4089, Accelerating and Evaluating OpenCL Graph Applications, by Shuai Che, Bradford Bechmann, Steve Reinhardt and Kevin Skadron

|      Accelera8ng  and  Evalua8ng  OpenCL  Graph  Applica8ons|      November  20,  2013      |      CONFIDENTIAL  25  

DISCLAIMER  &  ATTRIBUTION  

The  informa8on  presented  in  this  document  is  for  informa8onal  purposes  only  and  may  contain  technical  inaccuracies,  omissions  and  typographical  errors.    

The  informa8on  contained  herein  is  subject  to  change  and  may  be  rendered  inaccurate  for  many  reasons,  including  but  not  limited  to  product  and  roadmap  changes,  component  and  motherboard  version  changes,  new  model  and/or  product  releases,  product  differences  between  differing  manufacturers,  so{ware  changes,  BIOS  flashes,  firmware  upgrades,  or  the  like.  AMD  assumes  no  obliga8on  to  update  or  otherwise  correct  or  revise  this  informa8on.  However,  AMD  reserves  the  right  to  revise  this  informa8on  and  to  make  changes  from  8me  to  8me  to  the  content  hereof  without  obliga8on  of  AMD  to  no8fy  any  person  of  such  revisions  or  changes.    

AMD  MAKES  NO  REPRESENTATIONS  OR  WARRANTIES  WITH  RESPECT  TO  THE  CONTENTS  HEREOF  AND  ASSUMES  NO  RESPONSIBILITY  FOR  ANY  INACCURACIES,  ERRORS  OR  OMISSIONS  THAT  MAY  APPEAR  IN  THIS  INFORMATION.    

AMD  SPECIFICALLY  DISCLAIMS  ANY  IMPLIED  WARRANTIES  OF  MERCHANTABILITY  OR  FITNESS  FOR  ANY  PARTICULAR  PURPOSE.  IN  NO  EVENT  WILL  AMD  BE  LIABLE  TO  ANY  PERSON  FOR  ANY  DIRECT,  INDIRECT,  SPECIAL  OR  OTHER  CONSEQUENTIAL  DAMAGES  ARISING  FROM  THE  USE  OF  ANY  INFORMATION  CONTAINED  HEREIN,  EVEN  IF  AMD  IS  EXPRESSLY  ADVISED  OF  THE  POSSIBILITY  OF  SUCH  DAMAGES.  

 

ATTRIBUTION  

©  2013  Advanced  Micro  Devices,  Inc.  All  rights  reserved.  AMD,  the  AMD  Arrow  logo  and  combina8ons  thereof  are  trademarks  of  Advanced  Micro  Devices,  Inc.  in  the  United  States  and/or  other  jurisdic8ons.    OpenCL    is  a  registered  trademark  of  Apple  Inc.  Other  names  are  for  informa8onal  purposes  only  and  may  be  trademarks  of  their  respec8ve  owners.