performance*modeling*of*a*dependency3driven*mini3app*for...

1
Bryce Lelbach ([email protected]), Hartmut Kaiser (Louisiana State University) Hans Johansen (LBNL, CRD, [email protected]) Performance Modeling of a DependencyDriven MiniApp for Climate Complex science applicaGons oHen have very complex structures that weave together numerical and computaGonal soHware with mulGscale physics components. As a moGvaGng example, LBNL is developing an adapGve atmospheric dynamical core (“dycore”) in the Chombo framework (hPp://chombo.lbl.gov) for highaccuracy global climate simulaGons. Chombo is a highorder massively scalable finite volume framework for robust and accurate soluGon of parGal differenGal equaGons in arbitrary geometry. Chombo also provides a blockadapGve adapGve mesh refinement (AMR) capability to allow feature isolaGon and tracking. For Climate applicaGons, this could include weather predicGon Fig 2: Tropical cyclones Lee and KaGa, 6 September 2011, with tracking trajectories. Fig 3: Latlon plot (2D projecGon) of a shallow water equaGon soluGon with dynamic AMR. Grid lines show block structured boxes, 3 refinement levels. AdapGve mesh refinement has been successfully employed on a suite of 2D and 3D climaterelated test problems to verify accuracy and robustness of the algorithm. However, the complexity of the algorithm makes it difficult to create an idealized model for parallel scaling. Instead, we express the model as dependencies in a representaGve miniapp. Results for Intel Ivybridge Motivation HPX: Imperative vs. Functional Research Challenge For More Information With support from Intel Parallel CompuGng Center @ NERSC, a DOE Science Facility Climate Mini-App Description Future Research Directions Speedup on many-core Xeon Phi Fig 5: Difference stencils include “ghost cell” regions on neighboring boxes. HPX decomposes dependencies into swarms of tasks (dependencies, computaGon, and asynchronous communicaGon) HPX is a parallel programming framework designed to facilitate asynchronous (funcGonal) programming on manycore and tradiGonal architectures. HPX is an alternaGve to MPI+OpenMP (imperaGve), and provides an implementaGon of ISO C++ concurrency in the C++11 standard. HPX uses dependencies (“futures”) for expressing distributed communicaGon between and within nodes, using millions of lightweight tasks per hardware core. On most architectures, HPX can schedule tasks with submicrosecond overheads, enabling finegrained parallelism. HPX’s programming model: “Future”: proxy for a dependency that is needed by another task. Composable: futures can express control flows and work queues through their dependencies. Dynamic Load Balancing: Work is scheduled as dependencies complete, with a workstealing thread scheduler. Unified APIs: Remote communicaGon and local inmemory work use same objectoriented API for dependencies. The Climate miniapp is a parameterdriven applicaGon that serves as a proxy for computaGon and communicaGon paPerns in the AMR Climate code: Uses explicit and verGcal implicit blockstructured discreGzaGons Explicit operator is compute or memorybound depending on its ghost cells, stencil footprint VerGcal implicit operator is computebound by 1D nonlinear solves, no ghost cells ARK4 Gme integraGon expresses complex implicitexplicit coupling dependencies AMR Gme refinement creates another layer of dependencies Fig 1: Global grid with blockstructured AMR. 64 256 1024 4096 16384 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 # of Boxes Wall,me [s] Max Box Size Wall,me vs Max Box Size Intel Xeon Phi SE10P, 240 cores Wall.me # of Boxes Fig 6: This graph shows the tradeoff between more, smaller boxes (less work per box, more communicaGon) and efficiency (larger boxes, more work per communicaGon) 0% 5% 10% 15% 20% 25% 30% 35% 40% Ver,cal Implicit Solve Horizontal Explicit Operator Horizontal Refluxing ARK4 Overheads Ghost Zone Exchanges Array Copies Parallel Overheads % of Wall(me Breakdown of Wall(me Intel Ivybridge E5:2670 v2, 10 cores 57.6 Boxes/Core (Excessive Synchroniza,on) 7.2 Boxes/Core 0.9 Boxes/Core (Insufficient Parallelism) Fig 8: Even A single parameter, such as the # of boxes per core, can affect scaling dramaGcally. This shows the balance between parallel overheads (~5% indicates good parallelism for the middle case) and useful work (numerical calculaGons). Takeaway: Unlike tradiGonal processors, Xeon Phi is an inorder execuGon, manycore architecture. Each Xeon Phi core has 4 hw threads (240 total), and every cycle the processor context switches to a new thread (barrel threading). Each Phi core is slower but enables more hardware threadparallelism – if an applicaGon can take advantage of it. The ringbus topology and a distributed hash table manage global caches, so corecache locality is complex – creaGng programming complexity. But higher memory bandwidth and SIMD vector instrucGons improve performance (e.g. Intel MKL for LAPACK). Future research topics will focus on predicGve applicaGon performance on Xeon Phi: Include spaceGme AMR dependencies, and higherorder FLOPintensive stencils Model different domain decomposiGon strategies (verGcal, polyhedral, etc.) Evolve to representaGve kernels for the full Climate app performance behavior, including calls to complex “physics” rouGnes Compare flat MPI, MPI+OpenMP, and the HPX version on Xeon Phi Use HPX’s builtin performance monitoring to opGmize performance for Xeon Phi ScienGfic applicaGons include: Greater resoluGon of dynamic features (squall lines, atmospheric rivers) Regional weather phenomena (tropical cyclones, coastal climate) Grid refinement studies for new physics (cloud parameterizaGons) EvaluaGon of Gme discreGzaGon errors from 1D operator splirng (“column physics”) RK stage 1 = +me n stage 2 stage 3 stage 4 stage 5 stage 6 +me n+1 Explicit Implicit Fig 4: Space and Gme refinement with higherorder ARK4 ImEx integrator creates complex communicaGon and computaGon dependencies. exchange ghost cells calculate fluxes exchange fluxes calculate operator x j2 x j1 x j x j+2 x j+1 t i x j t i+1 t i+2 Box 1 Box 2 !me step t level Level 0 (domain) !me n+1 !me n (f!c) (c!f) Level 1 Level 2 0 60 120 180 240 0 60 120 180 240 Speedup HW Threads Speedup vs HW Threads Intel Xeon Phi SE10P, 240 max Measured Theore2cal Fig 7: The HPX task scheduler has very low, stable overheads. Using asynchronous data communicaGon and Intel’s MKL, we achieve 72% of theoreGcal speedup on Xeon Phi’s 240 hardware threads. Contact Bryce or Hans via email, or follow this QR code for more:

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance*Modeling*of*a*Dependency3Driven*Mini3App*for ...stellar.cct.lsu.edu/pubs/LelbachSummerPoster-7Aug2014.pdfBryce Lelbach&(blelbach@cct.lsu.edu),&Hartmut&Kaiser&(LouisianaState&University)&

Bryce  Lelbach  ([email protected]),  Hartmut  Kaiser  (Louisiana  State  University)  Hans  Johansen  (LBNL,  CRD,  [email protected])  

Performance  Modeling  of  a  Dependency-­‐Driven  Mini-­‐App  for  Climate  

Complex  science  applicaGons  oHen  have  very  complex  structures  that  weave  together  numerical  and  computaGonal  soHware  with  mulG-­‐scale  physics  components.  

As  a  moGvaGng  example,  LBNL  is  developing  an  adapGve  atmospheric  dynamical  core  (“dycore”)  in  the  Chombo  framework  (hPp://chombo.lbl.gov)  for  high-­‐accuracy  global  climate  simulaGons.    

Chombo  is  a  high-­‐order  massively  scalable  finite-­‐volume  framework  for  robust  and  accurate  soluGon  of  parGal  differenGal  equaGons  in  arbitrary  geometry.    Chombo  also  provides  a  block-­‐adapGve  adapGve  mesh  refinement  (AMR)  capability  to  allow  feature  isolaGon  and  tracking.  For  Climate  applicaGons,  this  could  include  weather  predicGon  

Fig  2:  Tropical  cyclones  Lee  and  KaGa,  6  September  2011,  with  tracking  trajectories.  

Fig  3:  Lat-­‐lon  plot  (2D  projecGon)  of  a  shallow  water  equaGon  soluGon  with  dynamic  AMR.  Grid  lines  show  block-­‐structured  boxes,  3  refinement  levels.  

AdapGve  mesh  refinement  has  been  successfully  employed  on  a  suite  of  2D  and  3D  climate-­‐related  test  problems  to  verify  accuracy  and  robustness  of  the  algorithm.  However,  the  complexity  of  the  algorithm  makes  it  difficult  to  create  an  idealized  model  for  parallel  scaling.  Instead,  we  express  the  model  as  dependencies  in  a  representaGve  mini-­‐app.  

Results for Intel Ivybridge Motivation HPX: Imperative vs. Functional

Research Challenge

For More Information

With  support  from    Intel  Parallel  CompuGng  Center  @  NERSC,    a    DOE  Science  Facility  

Climate Mini-App Description Future Research Directions

Speedup on many-core Xeon Phi

Fig  5:  Difference  stencils  include  “ghost  cell”  regions  on  neighboring  boxes.  HPX  decomposes  dependencies  into  swarms  of  tasks  (dependencies,  computaGon,  and  asynchronous  communicaGon)  

HPX  is  a  parallel  programming  framework  designed  to  facilitate  asynchronous  (funcGonal)  programming  on  many-­‐core  and  tradiGonal  architectures.  HPX  is  an  alternaGve  to  MPI+OpenMP  (imperaGve),  and  provides  an  implementaGon  of  ISO  C++  concurrency  in  the  C++11  standard.      HPX  uses  dependencies  (“futures”)  for  expressing  distributed  communicaGon  between  and  within  nodes,  using  millions  of  lightweight  tasks  per  hardware  core.  On  most  architectures,  HPX  can  schedule  tasks  with  sub-­‐microsecond  overheads,  enabling  fine-­‐grained  parallelism.  

HPX’s  programming  model:  •  “Future”:  proxy  for  a  dependency  that  

is  needed  by  another  task.  •  Composable:  futures  can  express  

control  flows  and  work  queues  through  their  dependencies.  

•  Dynamic  Load  Balancing:  Work  is  scheduled  as  dependencies  complete,  with  a  work-­‐stealing  thread  scheduler.  

•  Unified  APIs:  Remote  communicaGon  and  local  in-­‐memory  work  use  same  object-­‐oriented  API  for  dependencies.  

The  Climate  mini-­‐app  is  a  parameter-­‐driven  applicaGon  that  serves  as  a  proxy  for  computaGon  and  communicaGon  paPerns  in  the  AMR  Climate  code:  •  Uses  explicit  and  verGcal  implicit  block-­‐structured  discreGzaGons  •  Explicit  operator  is  compute  or  memory-­‐bound  depending  on  its  ghost  cells,  stencil  footprint  •  VerGcal  implicit  operator  is  compute-­‐bound  by  1D  non-­‐linear  solves,  no  ghost  cells  •  ARK4  Gme  integraGon  expresses  complex  implicit-­‐explicit  coupling  dependencies  •  AMR  Gme  refinement  creates  another  layer  of  dependencies  

Fig  1:  Global  grid  with  block-­‐structured  AMR.  

64#

256#

1024#

4096#

16384#

0#

10#

20#

30#

40#

50#

60#

0# 5# 10# 15# 20# 25# 30# 35#

#"of"Boxes"

Wall,me"[s]"

Max"Box"Size"

Wall,me"vs"Max"Box"Size"Intel"Xeon"Phi"SE10P,"240"cores"

Wall.me#

##of#Boxes#

Fig  6:  This  graph  shows  the  tradeoff  between  more,  smaller  boxes  (less  work  per  box,  more  communicaGon)  and  efficiency  (larger  boxes,  more  work  per  communicaGon)  

0%#

5%#

10%#

15%#

20%#

25%#

30%#

35%#

40%#

Ver,cal#Implicit#Solve#

Horizontal#Explicit#Operator#

Horizontal#Refluxing#

ARK4#Overheads#

Ghost#Zone#Exchanges#

Array#Copies# Parallel#Overheads#

%"of"W

all(me"

Breakdown"of"Wall(me"Intel"Ivybridge"E5:2670"v2,"10"cores"

57.6#Boxes/Core#(Excessive#Synchroniza,on)#

7.2#Boxes/Core#

0.9#Boxes/Core#(Insufficient#Parallelism)#

Fig  8:  Even  A  single  parameter,  such  as  the  #  of  boxes  per  core,  can  affect  scaling  dramaGcally.  This  shows  the  balance  between  parallel  overheads  (~5%  indicates  good  parallelism  for  the  middle  case)  and  useful  work  (numerical  calculaGons).  

Take-­‐away:  Unlike  tradiGonal  processors,  Xeon  Phi  is  an  in-­‐order  execuGon,  many-­‐core  architecture.  Each  Xeon  Phi  core  has  4  hw  threads  (240  total),  and  every  cycle  the  processor  context  switches  to  a  new  thread  (barrel  threading).  Each  Phi  core  is  slower  but  enables  more  hardware  thread-­‐parallelism  –  if  an  applicaGon  can  take  advantage  of  it.  The  ring-­‐bus  topology  and  a  distributed  hash  table  manage  global  caches,  so  core-­‐cache  locality  is  complex  –  creaGng  programming  complexity.  But  higher  memory  bandwidth  and  SIMD  vector  instrucGons  improve  performance  (e.g.  Intel  MKL  for  LAPACK).  

Future  research  topics  will  focus  on  predicGve  applicaGon  performance  on  Xeon  Phi:  •  Include  space-­‐Gme  AMR  dependencies,  and  higher-­‐order  FLOP-­‐intensive  stencils  •  Model  different  domain  decomposiGon  strategies  (verGcal,  polyhedral,  etc.)  •  Evolve  to  representaGve  kernels  for  the  full  Climate  app  performance  behavior,  

including  calls  to  complex  “physics”  rouGnes  •  Compare  flat  MPI,  MPI+OpenMP,  and  the  HPX  version  on  Xeon  Phi  •  Use  HPX’s  built-­‐in  performance  monitoring  to  opGmize  performance  for  Xeon  Phi  

ScienGfic  applicaGons  include:    

•  Greater  resoluGon  of  dynamic  features                      (squall  lines,  atmospheric  rivers)  •  Regional  weather  phenomena                        (tropical  cyclones,  coastal  climate)  •  Grid  refinement  studies  for  new  physics                      (cloud  parameterizaGons)  •  EvaluaGon  of  Gme  discreGzaGon  errors  from    1D  operator  splirng  (“column  physics”)  

RK#stage#1#=##

+me#n#

stage#2#

stage#3#

stage#4#

stage#5#

stage#6#

+me#n+1#

Explicit#

Implicit#

Fig  4:  Space  and  Gme  refinement  with  higher-­‐order  ARK4  ImEx  integrator  creates  complex  communicaGon  and  computaGon  dependencies.  

exchange(ghost(cells(

calculate(fluxes(

exchange(fluxes(

calculate(operator(

…  …   xj-­‐2   xj-­‐1   xj   xj+2  xj+1  ti  

…   …  xj  ti+1  

…  ti+2  

Box  1   Box  2  

!me$step$

t$

level$

Level$0$(domain)$

!me$$n+1$

!me$$n$

(f!c)$

(c!f)$

Level$1$

Level$2$

0"

60"

120"

180"

240"

0" 60" 120" 180" 240"

Speedu

p&

HW&Threads&

Speedup&vs&HW&Threads&Intel&Xeon&Phi&SE10P,&240&max&

Measured"

Theore2cal"

Fig  7:  The  HPX  task  scheduler  has  very  low,  stable  overheads.  Using  asynchronous  data  communicaGon  and  Intel’s  MKL,  we  achieve  72%  of  theoreGcal  speedup  on  Xeon  Phi’s  240  hardware  threads.    

Contact  Bryce  or  Hans  via  email,    or  follow  this  QR  code  for  more: