enzo-p / cello - xsedecello enzo-p / cello scalable adaptive mesh refinement for astrophysics and...
TRANSCRIPT
CELLO
Enzo-P / CelloScalable Adaptive Mesh Refinementfor Astrophysics and Cosmology
James Bordner1 Michael L. Norman1 Brian O’Shea2
1University of California, San Diego
San Diego Supercomputer Center
2Michigan State University
Department of Physics and Astronomy
Extreme Scaling Workshop 2012Blue Waters / XSEDE
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 1 / 21
CELLO
Enzo overview
Parallel astrophysics and cosmologyimplemented in C++ / Fortranapproximately 150K SLOCparallelized using MPI / OpenMP
Vast range of scalesastrophysical fluid dynamicshydrodynamic cosmology
Adaptive mesh refinement (AMR)
Growing development community[ Norman et al ]
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 2 / 21
CELLO
Enzo’s physics and algorithms
Eulerian hydrodynamicspiecewise-parabolic method (PPM)
Lagrangian dark matterparticle-mesh method (PM)
Self-gravityFFT’s on root-level gridmultigrid on refinement patches
Local physicsheating, cooling, chemistry, etc.
MHD, RHD (ray-tracing or implicit FLD) [ John Wise ]
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 3 / 21
CELLO
Enzo’s pursuit of scalability
Enzo born in early 1990’s
“Extreme” meant 100 processors
Continual scalability improvementsMPI/OpenMP parallelism“neighbor-finding” algorithmI/O optimizations
Further improvement getting harderincreasing scalability requirementseasy improvements already made
Motivates concurrent rewritingEnzo-P “petascale” Enzo forkCello AMR framework
[ Sam Skillman, Matt Turk ]
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 4 / 21
CELLO
Enzo’s AMR data structure
root patchrefinement
patchroot grid
Patch-based SAMR
Each patch is a C++ object
Patches assigned to processes
NP root patches, grid size ≈ 643,1283Refinement patches generally smaller
Refinement patches initially local to parent
Load balancing relocates refinement patches
Patch data (grids, particles) are distributed
AMR hierarchy structure is replicated[ Tom Abel, John Wise, Ralf Kaehler ]
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 5 / 21
CELLO
Enzo’s timestepping
Adaptive timestepping by level
parallel within level� less computation by O(NL)� reduced parallel efficiency
1 11
15
3
6
9
12
16
13
14
19
17
20
18
8
2
7
5
4
10
x
L=0 L=1 L=2 L=3
t
dt
dt0
1
2dt
EvolveLevel(L,dtL−1)1 refresh level L ghosts2 compute timestep dtL3 advance level L by dtL4 EvolveLevel(L+1,dtL)5 correct fluxes
x
t
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 6 / 21
CELLO
Enzo’s scaling issues
Memory usageAMR structure is non-scalableghost zone layer three zones deepmemory fragmentation
Data localitydisrupted by load balancing
Parallel task definitionwidely varying patch sizesgranularity determined by AMR
Parallel task schedulingparallel within a levelsynchronization between levels [ Elizabeth Tasker ]
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 7 / 21
CELLO
Talk outline
1 Enzo1 overview2 design3 scaling issues
2 Talk outline3 Enzo-P / Cello
1 overview2 design3 scaling solutions4 implementation using Charm++5 recursively generated parallel data structures
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 8 / 21
CELLO
Enzo-P / Cello overview
Enzo-P intended to be “petascale Enzo”
Cello is a scalable AMR framework
Parallelism using Charm++
MPI as a backup
25K SLOC
Work in progressPPM HD / PPML MHD on distributed Cartesian gridsprototyping Charm++ AMR implementations
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 9 / 21
CELLO
Cello’s AMR data structure
Patch BlockTreeForest
Tree-based SAMR
Patches define unit of refinement
cubical, varying size
Blocks define parallel tasks
flexible size and shape
One Block per Charm++ chare
Tree structure optionally distributed
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 10 / 21
CELLO
Cello’s timestepping
Optional adaptive timesteppingincludes by Patch or Blocklocal synchronizationparallelism between levels
1t
x
5
6
7
8
1
2
3
43
5
7
1
1
L=0 L=1 L=2 L=3
4
dt
dt
dt2
0
1
Quantized timesteps� avoids “sliver timesteps” dt ≈ 0� but dt’s smaller than optimal
reduced numerical roundoff
t
x
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 11 / 21
CELLO
Cello’s improvements to scalingReducing replicated AMR structure
Fewer replicated patches
“patch-merging” technique
truncates full subtrees
≈ 2 to 3× reduction
original patch-merged
Smaller replicated patches
Enzo: �grid� = 1544 bytes
Cello: �Node� = 16 bytes
≈ 100× reduction!
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 12 / 21
CELLO
Cello’s improvements to scalingReducing replicated AMR structure
Fewer replicated patches
“patch-merging” technique
truncates full subtrees
≈ 2 to 3× reduction
original patch-merged
Smaller replicated patches
Enzo: �grid� = 1544 bytes
Cello: �Node� = 16 bytes
≈ 100× reduction!
Enzo grid Cello Node
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 12 / 21
CELLO
Cello’s improvements to scalingDistributed AMR Structure
Forest of Trees
each assigned a process range
simple indexing
load balancing issues
Space-filling curve
improved load balancing
requires global scan
greater surface area
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 13 / 21
CELLO
Cello’s improvements to scalingDistributed AMR Structure
Forest of Trees
each assigned a process range
simple indexing
load balancing issues
Space-filling curve
improved load balancing
requires global scan
greater surface area
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 13 / 21
CELLO
Cello’s improvements to scalingDynamically load balancing AMR block data
Dynamic Load Balancing
Use Charm++ load balancing
Space-filling curvesequally distribute loadmaintain data localityno parent-child communication
Use measured performancecomputationmemory usage
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 14 / 21
CELLO
Cello’s improvements to scalingParallel tasks
Task definition
Flexible choice of size / shape
Reduced size variability� constant grid Block sizes� variable subcycling, # particles
Enzo Cello
Task scheduling
Charm++: asynchronous, data-driven
Blocks advance when ghosts refreshed
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 15 / 21
CELLO
Cello’s improvements to scalingParallel tasks
Task definition
Flexible choice of size / shape
Reduced size variability� constant grid Block sizes� variable subcycling, # particles
Enzo Cello
Task scheduling
Charm++: asynchronous, data-driven
Blocks advance when ghosts refreshed
t
x
L=0 L=1 L=2 L=3
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 15 / 21
CELLO
Cello’s implementation using Charm++Charm++ program structure
pX()
Main
ChareC
pW()
Main()
ChareA
pZ()
pY()
ChareB
pV()
A Charm++ Program
Charm++ programCharm++ objects are chares
invoke entry methods
communicate via messages
Charm++ runtime systemmaps chares to processorsschedules entry methodsmigrates chares to load balance
Additional scalability featuresfault tol.: checkpoint/restartdynamic load balancing
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 16 / 21
CELLO
Cello’s implementation using Charm++Charm++ collections of chares
Chare Arrays
distributed array of chares
migratable elements
flexible indexing
Chare Groups
one chare per processor (non-migratable)
Chare Nodegroups
one chare per node (non-migratable)
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 17 / 21
CELLO
Cello’s implementation using Charm++Charm++ collections of chares
Chare Arrays
distributed array of chares
migratable elements
flexible indexing
Chare Groups
one chare per processor (non-migratable)
Chare Nodegroups
one chare per node (non-migratable)
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 17 / 21
CELLO
Cello’s implementation using Charm++Charm++ collections of chares
Chare Arrays
distributed array of chares
migratable elements
flexible indexing
Chare Groups
one chare per processor (non-migratable)
Chare Nodegroups
one chare per node (non-migratable)
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 17 / 21
CELLO
Cello’s implementation using Charm++Three implementation strategies
1. Single chare array
efficient: single access
restricted Tree depth
PTB BBBTPBBTBBPB BPPBBP
Hierarchy
PT
H
2. Composite chare arrays
“tree” of chare arrays
less restricted Tree depth
3. Singleton chares
no depth restrictions
possible performance issues
how to generate...?
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 18 / 21
CELLO
Cello’s implementation using Charm++Three implementation strategies
1. Single chare array
efficient: single access
restricted Tree depth
PTB BBBTPBBTBBPB BPPBBP
Hierarchy
PT
H
2. Composite chare arrays
“tree” of chare arrays
less restricted Tree depth
T T P P B B
TreeForest PatchTF P
3. Singleton chares
no depth restrictions
possible performance issues
how to generate...?
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 18 / 21
CELLO
Cello’s implementation using Charm++Three implementation strategies
1. Single chare array
efficient: single access
restricted Tree depth
PTB BBBTPBBTBBPB BPPBBP
Hierarchy
PT
H
2. Composite chare arrays
“tree” of chare arrays
less restricted Tree depth
T T P P B B
TreeForest PatchTF P
3. Singleton chares
no depth restrictions
possible performance issues
how to generate...?
B BBBTPBBTBBPB BPPBBPPT PT
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 18 / 21
CELLO
Recursively generated parallel data structuresGenerating a “software network”
1 Start with single Seedconceptually complete p.d.s.with processor range
2 grow() spawns remote Seedsconceptually partitioned p.d.s.with processor subrangesSeeds interlinked
3 Recurse to individual elementsfinal Seeds complete p.d.s.scaffolding : previous Seeds data structure
topology
root seed
scaffolding
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 19 / 21
CELLO
Recursively generated parallel data structuresUsing the software network
Three types of Seed linksparent: reductionsneighbor : collaborationschild : distributions reduce link
broadcast links
neighbor links
Seed
parent Seed
neighbor Seeds
child Seeds
Scalablelinks per node: O(1)generation: ≈ O(logN)
Load balancingspace-filling curveshierarchical
Usable in Cellogrids / octrees “easy”SeedGrid, SeedTree, etc.
Usable with MPIsend encoded seed creationlink: process rank + pointer
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 20 / 21
CELLO
Summary
Enzo Enzo-P / CelloParallelization MPI Charm++
SAMR patch-based tree-basedAMR structure replicated optionally distributedRemote patches 1544 bytes 16 bytesTimestepping level-adaptive block-adaptiveBlock sizes ×1000 variation constantTask scheduling level-parallel dependency-drivenLoad balancing patch migration space-filling curvesData locality LB conflict no LB conflict
http://cello-project.org
NSF PHY-1104819, AST-0808184
Extreme Scaling Workshop Enzo-P / Cello 16 July 2012 21 / 21