kokkos implementation of albany: a performance-portable...

1
Sandia Na(onal Laboratories is a mul(program laboratory managed and operated by Sandia Corpora(on, a wholly owned subsidiary of Lockheed Mar(n Corpora(on, for the U.S. Department of Energy’s Na(onal Nuclear Security Administra(on under contract DEAC0494AL85000. Fully Implicit Solu(on Algorithm: Finite Element Assembly ~50% CPU (me Linear Solve ILU + CG ~50% CPU Time Problem Size, up to: 1.12B Unknowns 16384 Cores on Hopper Introduc)on Sandia Na)onal Laboratories Irina Demeshko, H. Carter Edwards, Michael A. Heroux, Eric T. Phipps, Andrew G. Salinger , Roger P. Pawowski Sandia Na(onal Laboratories, New Mexico, 87185 Kokkos implementation of Albany: a performance-portable finite element application SAND2014-18787C Kokkos programming model Implementa)on algorithm: Performance Results* Ice Sheet Simula)on Code (FELIX) Problem: Modern HPC applica(ons, Finite Element Assembly codes in par(cular, need to be run on many different pla^orms and performance portability has become a cri(cal issue: parallel code needs to be executed correctly and performant despite varia(on in the architecture, opera(ng system and so_ware libraries. Approach: Kokkos programming model from Trilinos: C++ library, which provide performance portability across diverse devices with different memory models. Funding: ASC Exascale Codesign Project Computed Surface Velocity [m/yr] for Greenland Ice Sheet We are developing a performance portable implementa)on of an Ice Sheet simula)on code*, build with the Albany [3] applica)on development environment Provides portability across manycore devices (Mul(core CPU, NVidia GPU, Intel Xeon Phi (poten(al: AMD Fusion) ) Abstract data layout for nontrivial data structures Uses library approach: Maximize amount of user (applica4on/library) code that can be compiled without modifica4on and run on these architectures Minimize amount of architecturespecific knowledge that a user is required to have Performant: Portable user code performs as well as architecturespecific code Main abstrac)ons: Kokkos executes computa(onal kernels in finegrain data parallel within an Execu4on space. Computa(onal kernels operate on mul(dimensional arrays (Kokkos::View) residing in Memory spaces. Kokkos provides these mul(dimensional arrays with polymorphic data layout, similar to the Boost.Mul(Array flexible storage ordering. Kokkos [1] - C++ library from Trilinos [2] to provide scientific and engineering codes with an intuitive manycore performance portable programming model. Project objec)ves: Develop a robust, scalable, unstructuredgrid finite element code for land Ice dynamics. Provide sea level rise predic(on Run on new architecture machines (hybrid systems). *Funding Source: PISCEES SciDAC Applica(on (BER & ASCR) Authors: A. Salinger, I. Kalashnikova, M. Perego, R. Tuminaro [SNL], S. Price[LANL] 1) Replace array alloca(ons with Kokkos::Views (in Host space) Kokkos::View<ScalarType****, Device> jacobian("Jacobian", worksetNumCells, numQPs, numDims, numDims); 2) Replace array access with Kokkos::Views B[k][j] = jacobian(i0, i1, k, j); Kokkos::deep_copy (jacobian, host_jacobian); 3) Replace func(ons with Functors, run in parallel on Host 4) Call Kokkos::parallel_for<…> (…) over the number of elements in workset: 5) Set Device to ‘Cuda’, ‘OpenMP’ or ‘Threads’ and run on specified Device typedef Kokkos::Cuda Device; or typedef Kokkos::OpenMP Device; or typedef Kokkos::Threads Device; for(int cell = 0; cell < worksetNumCells; cell++) { for(int qp = 0; qp < numQPs; qp++) { for(int row = 0; row < numDims; row++){ for(int col = 0; col < numDims; col++){ for(int node = 0; node < numNodes; node++){ jacobian(cell, qp, row, col) += coordVec(cell, node, row) *basisGrads(node, qp, col); } // node } // col } // row } // qp } // cell Kokkos::parallel_for ( worksetNumCells, compute_jacobian<ScalarT, Device, numQPs, numDims, numNodes> (basisGrads, jacobian, coordVec)); template < typename ScalarType, clas DeviceType, int numQPs_, int numDims_, int numNodes_ > class compute_jacobian { Array3 basisGrads_; Array4 jacobian_; Array3_const coordVec_; public: typedef DeviceType device_type; compute_jacobian(Array3 &basisGrads, Array4 &jacobian, Array3 &coordVec) : basisGrads_(basisGrads) , jacobian_(jacobian), coordVec_(coordVec){} KOKKOS_INLINE_FUNCTION void operator () (const std::size_t cell) const { for(int qp = 0; qp < numQPs_; qp++) { for(int row = 0; row < numDims_; row++){ for(int col = 0; col < numDims_; col++){ for(int node = 0; node < numNodes_; node++){ jacobian_(cell, qp, row, col) += coordVec_(cell, node, row)*basisGrads_(node, qp, col); } // node } // col } // row } // qp } }; Example: Compute Jacobian Initial code: Kokkos Implementation: “Fused” means that all the Kokkos kernels are fused togeather, when in “Separated” version we have several independent kernels ** “Nvidia K20 GPU” and “Initial Implementation” results were obtained on Shannon, “Intel MIC” were obtained on Compton References: 1) Edwards, H. Carter, Troy, Chris(an R., Sunderland, Daniel Kokkos: Enabling manycore performance portability through polymorphic memory access pa‘erns. Journal of Parallel and Distributed Compu(ng, 2014. 2) Salinger, Andrew G. Componentbased Scien4fic applica4on Development., December 01, 2012 3) Willenbring, James M.,and Michael Allen Heroux. Trilinos Users Guide., August 01, 2003. Graph of Finite Element Assembly Kernels Evalua7on Environment: Compton (Intel MIC cluster) : 42 nodes: Two 8core Sandy Bridge Xeon E52670 @ 2.6GHz (HT ac(vated) per node, 24GB (3*8Gb) memory per node, Two Preproduc(on KNC (Intel MIC) 2 per node (57 cores per each) Shannon (NVIDIA GPU cluster): 32 nodes: Two 8core Sandy Bridge Xeon E52670 @ 2.6GHz (HT deac(vated) per node, 128GB DDR3 memory per node, 2x NVIDIA K20x per node Device: Kokkos::parallel_for over the # of elements in workset Copy solu(on vector to the Device Copy residual vector to the Host Loop over the number of worksets 0.0000001 0.000001 0.00001 0.0001 0.001 10 100 1000 10000 )me/#of elements (sec) # of elements NVIDA K20 Intel Xeon Phi Intel Xeon Albany MiniDriver test code Albany FELIX code Performance evaluation results for the Kokkos implementation of Albany/FELIX code: weak scalability comparison for Serial, CUDA and Intel Xeon Phi implementations. a)-b) Performance results for Residual computations. c) Performance results for Jacobian computations. d) Total execution time for Fiinite Element Assembly in Felix code. 0 5 10 15 20 25 30 500 5500 10500 )me,sec # of elements per Workset Residual+Jacobian CUDA Intel Xeon Phi Serial 0 2 4 6 8 500 2500 4500 6500 8500 10500 )me, sec # of elements per Workset Residual CUDA Intel Xeon Phi Serial 0 5 10 15 20 500 2500 4500 6500 8500 10500 )me,sec # of elements per Workset Jacobian CUDA Intel Xeon Phi Serial 0 0.2 0.4 0.6 0.8 1 500 5500 10500 )me, sec # of elements per Workset Residual CUDA Intel Xeon Phi a) b) c) d)

Upload: others

Post on 26-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Kokkos implementation of Albany: a performance-portable ...sc14.supercomputing.org/.../post244s2-file2.pdfSandiaNaonal!Laboratories!is!amul(3program!laboratory!managed!and!operated!by!SandiaCorporaon,!a

 Sandia  Na(onal  Laboratories  is  a  mul(-­‐program  laboratory  managed  and  operated  by  Sandia  Corpora(on,  a  

wholly  owned  subsidiary  of  Lockheed  Mar(n  Corpora(on,  for  the  U.S.  Department  of  Energy’s  Na(onal  Nuclear  Security  Administra(on  under  contract  DE-­‐AC04-­‐94AL85000.  

 

Fully  Implicit  Solu(on  Algorithm:  •  Finite  Element  Assembly  

•  ~50%  CPU  (me  •  Linear  Solve  ILU  +  CG  

•  ~50%  CPU  Time  •  Problem  Size,  up  to:  

•  1.12B  Unknowns  •  16384  Cores  on  Hopper  

Introduc)on  

Sandia  Na)onal  Laboratories  Irina Demeshko, H. Carter Edwards, Michael A. Heroux, Eric T. Phipps, Andrew G. Salinger , Roger P. Pawowski Sandia  Na(onal  Laboratories,  New  Mexico,  87185  

Kokkos implementation of Albany: a performance-portable finite element application  

SAND2014-18787C  

Kokkos  programming  model  

Implementa)on  algorithm:  

Performance  Results*  

Ice  Sheet  Simula)on  Code  (FELIX)  

Problem:  Modern  HPC  applica(ons,  Finite  Element  Assembly  codes  in  par(cular,  need  to  be  run  on  many  different  pla^orms  and  performance  portability  has  become  a  cri(cal  issue:  parallel  code  needs  to  be  executed  correctly  and  performant  despite  varia(on  in  the  architecture,  opera(ng  system  and  so_ware  libraries.  Approach:  Kokkos  programming  model  from  Trilinos:  C++  library,  which  provide  performance  portability  across  diverse  devices  with  different  memory  models.  Funding:  ASC  Exascale  Codesign  Project    

Computed  Surface  Velocity  [m/yr]  for  Greenland  Ice  Sheet  

We  are  developing  a  performance  portable  implementa)on  of  an  Ice  Sheet  simula)on  code*,  build  with  the  Albany  [3]  applica)on  development  environment  

Ø  Provides  portability  across  manycore  devices  (Mul(core  CPU,  NVidia  GPU,  Intel  Xeon  Phi  (poten(al:  AMD  Fusion)  )  

Ø  Abstract  data  layout  for  non-­‐trivial  data  structures  Ø  Uses  library  approach:  

Ø  Maximize  amount  of  user  (applica4on/library)  code  that  can  be  compiled  without  modifica4on  and  run  on  these  architectures    

Ø  Minimize  amount  of  architecture-­‐specific  knowledge  that  a  user  is  required  to  have    Ø  Performant:  Portable  user  code  performs  as  well  as  architecture-­‐specific  code    

Main  abstrac)ons:  Ø  Kokkos  executes  computa(onal  kernels  in  fine-­‐grain  data  parallel  within  an  Execu4on  

space.  Ø  Computa(onal  kernels  operate  on  mul(dimensional  arrays  (Kokkos::View)  residing  in  

Memory  spaces.    Ø  Kokkos  provides  these  mul(dimensional  arrays  with  polymorphic  data  layout,  similar  to  

the  Boost.Mul(Array  flexible  storage  ordering.  

Kokkos  [1]  - C++ library from Trilinos [2] to provide scientific and engineering codes with an intuitive manycore performance portable programming model.

Project  objec)ves:    •  Develop  a  robust,  scalable,  unstructured-­‐grid  finite  

element  code  for  land  Ice  dynamics.  •  Provide  sea  level  rise  predic(on  •  Run  on  new  architecture  machines  (hybrid  systems).  

*Funding  Source:  PISCEES  SciDAC  Applica(on  (BER  &  ASCR)      

Authors:  A.  Salinger,  I.  Kalashnikova,  M.  Perego,  R.  Tuminaro  [SNL],  S.  Price[LANL]  

1)  Replace  array  alloca(ons  with  Kokkos::Views    (in  Host  space)    Kokkos::View<ScalarType****,  Device>      jacobian("Jacobian",  worksetNumCells,  numQPs,  numDims,  numDims);  

2)  Replace  array  access  with  Kokkos::Views                        B[k][j]  =  jacobian(i0,  i1,  k,  j);                              Kokkos::deep_copy  (jacobian,  host_jacobian);  3)  Replace  func(ons  with  Functors,  run  in  parallel  on  Host                                            4)  Call  Kokkos::parallel_for<…>  (…)    over  the  number  of  elements  in  workset:                            5)  Set  Device  to  ‘Cuda’,  ‘OpenMP’  or  ‘Threads’  and  run  on    specified  Device    typedef  Kokkos::Cuda  Device;  or      typedef  Kokkos::OpenMP  Device;  or    typedef  Kokkos::Threads    Device;    

1  

for(int cell = 0; cell < worksetNumCells; cell++) { for(int qp = 0; qp < numQPs; qp++) { for(int row = 0; row < numDims; row++){ for(int col = 0; col < numDims; col++){ for(int node = 0; node < numNodes; node++){ jacobian(cell, qp, row, col) += coordVec(cell, node, row) *basisGrads(node, qp, col); } // node } // col } // row } // qp } // cell

Kokkos::parallel_for ( worksetNumCells, compute_jacobian<ScalarT, Device, numQPs, numDims, numNodes> (basisGrads, jacobian, coordVec));

template  <  typename  ScalarType,  clas  DeviceType,  int  numQPs_,  int  numDims_,  int  numNodes_  >  class  compute_jacobian    {        Array3  basisGrads_;        Array4  jacobian_;        Array3_const  coordVec_;    public:        typedef  DeviceType  device_type;        compute_jacobian(Array3  &basisGrads,    Array4  &jacobian,  Array3  &coordVec)                                                                        :  basisGrads_(basisGrads)  ,  jacobian_(jacobian),  coordVec_(coordVec){}  KOKKOS_INLINE_FUNCTION    void  operator  ()  (const  std::size_t  cell)  const    {      for(int  qp  =  0;  qp  <  numQPs_;  qp++)  {                  for(int  row  =  0;  row  <  numDims_;  row++){                        for(int  col  =  0;  col  <  numDims_;  col++){                              for(int  node  =  0;  node  <  numNodes_;  node++){                                      jacobian_(cell,  qp,  row,  col)  +=  coordVec_(cell,  node,  row)*basisGrads_(node,  qp,  col);                                }  //  node                          }  //  col                  }  //  row            }  //  qp    }    };    

Example: Compute Jacobian

Initial code:

Kokkos Implementation:

•  “Fused” means that all the Kokkos kernels are fused togeather, when in “Separated” version we have several independent kernels •  ** “Nvidia K20 GPU” and “Initial Implementation” results were obtained on Shannon, “Intel MIC” were obtained on Compton

References:  1)  Edwards,  H.  Carter,  Troy,  Chris(an  R.,  Sunderland,  Daniel  Kokkos:  Enabling  manycore  performance  portability  

through  polymorphic  memory  access  pa`erns.  Journal  of  Parallel  and  Distributed  Compu(ng,  2014.  2)  Salinger,  Andrew  G.  Component-­‐based  Scien4fic  applica4on  Development.,  December  01,  2012  3)  Willenbring,  James  M.,and  Michael  Allen  Heroux.  Trilinos  Users  Guide.,  August  01,  2003.    

Graph of Finite Element Assembly Kernels

Evalua7on  Environment:  Compton  (Intel  MIC  cluster)  :    42  nodes:  •   Two  8-­‐core  Sandy  Bridge  Xeon  E5-­‐2670  @  

2.6GHz  (HT  ac(vated)  per  node,  •   24GB  (3*8Gb)  memory  per  node,    •  Two  Pre-­‐produc(on  KNC  (Intel  MIC)  2  per  

node  (57  cores  per  each)  Shannon  (NVIDIA  GPU  cluster):  32  nodes:  

•  Two  8-­‐core  Sandy  Bridge  Xeon  E5-­‐2670  @  2.6GHz  (HT  deac(vated)  per  node,  

•  128GB  DDR3  memory  per  node,  •  2x  NVIDIA  K20x  per  node    

 

Device:  Kokkos::parallel_for    over  the  #  of  elements  in  workset  

Copy  solu(on  vector  to  the  Device  

Copy  residual  vector  to  the  Host  

Loop  over  the  number  of  worksets  

0.0000001  

0.000001  

0.00001  

0.0001  

0.001  

10   100   1000   10000  

)me/#o

f  elemen

ts  

(sec)  

#  of  elements  

NVIDA  K20  

Intel  Xeon  Phi  Intel  Xeon  

Albany MiniDriver test code

Albany FELIX code

Performance evaluation results for the Kokkos implementation of Albany/FELIX code: weak scalability comparison for Serial, CUDA and Intel Xeon Phi implementations. a)-b) Performance results for Residual

computations. c) Performance results for Jacobian computations. d) Total execution time for Fiinite Element Assembly in Felix code.

0  5  

10  15  20  25  30  

500   5500   10500  

)me,sec  

#  of  elements  per  Workset  

Residual+Jacobian  

CUDA   Intel  Xeon  Phi    Serial  

0  

2  

4  

6  

8  

500   2500   4500   6500   8500   10500  

)me,  se

c  

#  of  elements  per  Workset  

Residual  

CUDA   Intel  Xeon  Phi    Serial  

0  

5  

10  

15  

20  

500   2500   4500   6500   8500   10500  

)me,sec  

#  of  elements  per  Workset  

Jacobian  

CUDA   Intel  Xeon  Phi    Serial  

0  

0.2  

0.4  

0.6  

0.8  

1  

500   5500   10500  

)me,  se

c  

#  of  elements  per  Workset  

Residual  

CUDA   Intel  Xeon  Phi  

a) b)

c) d)