dubey_siamcse15
TRANSCRIPT
![Page 1: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/1.jpg)
Hierarchical Resilience for Structured AMRAnshu Dubey, H. Fujita, Z. Rubenstein, B. Van Straalen and A. Chien
SIAM-CSE 2015
![Page 2: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/2.jpg)
1n+
2
t10 2
levellevellevel
t
t
sync sync
sync
n+1
n
t
refinementlevel
Block Structured Local Refinement Refined regions are organized into logically-rectangular patches Refined grids are dynamically created and destroyed Refinement is performed in time as well as in space The depth of hierarchy depends upon the range of scales Mostly solve hyperbolics, some elliptic and some PIC
The Basics of AMR
Resiliency Through Rollback
This talk is mostly from the perspective of hyperbolics
![Page 3: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/3.jpg)
Rollback and Recovery
The first step is to look at the whole application
Traditionally - checkpoint the entire state and reconstruct for restart Application determined state for transparent reconstruction Less data to save than system based checkpoint There have always been other reasons for checkpointing
Batch queue limits Steering the simulation Experimenting with parameter tweaking
Our first step was to utilize this capability to continue execution in face of resource degradation Restart on a fewer number of MPI ranks We partnered with GVR which saves and versions the state as
dictated by the application Incorporate restart without an abort
![Page 4: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/4.jpg)
GVR – Global View Resilience
Two key ideas Multi-version, multi-stream distributed arrays: preserves application critical data,
enables flexible recovery from complex errors with fine-grain, localized manner Open resilience: cross-layer partnership for error handling, with unified error
handling interface
Our next step was to see if we can exploit the natural hierarchy of granularity in AMR for differentiated localized recovery and quantify the costs
![Page 5: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/5.jpg)
Exploiting Granularities and Encapsulations
A Level provides coarsest granularity in the frameworkConvention – level below is coarser, level above is finer
A level knows All boxes at the level and their layout The mapping of boxes to processors Exchanges at the same level Fine coarse boundaries with the level below Local time step and the time step of the level below
Each level defines data on unions of rectangles Manages communications within itself During regridding carries out redistribution of boxes within
the level Also the reconstructs data into newly formed boxes from
the level below A level can do its own restart
![Page 6: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/6.jpg)
GDS_level_0 GDS_level_2GDS_level_1
Mapping a level to an Array
![Page 7: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/7.jpg)
Level n
Level n+1
Level n+2
P0
P1
P2
P3
P4
P5
P6
If we know that a level has no boxes on a processor, that level need not initiate a recovery
![Page 8: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/8.jpg)
Level Based Recovery
Once an array has been constructed for a level, version increment is sufficient to save the state the next time
One wrinkle – versioning applies only as long as there is no regridding
For regridding current array has to be destroyed and new one constructedThis is one of the todo’s for GVR – allow
expansion and contraction during versioning In general versioning is cheaper than constructing a
new arrayWe save the Box layout, the data for each box, and in
the metadata part a pointer to where the data of each box lies
![Page 9: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/9.jpg)
Exploiting Granularities and Encapsulations
The next layer of granularity is box within a level Box with its surrounding halo of ghost cells is a completely defined
computational region Iterators go through a list of local boxes and apply the same
operator to them All boxes of one level have same dx and same dt All boxes mapped to a processor may not be at the same level
Different boxes on the same processor have different dt and the operator maybe applied to them a different number of times.
The way we construct arrays gives us box level granularity for free
![Page 10: Dubey_SIAMCSE15](https://reader035.vdocuments.net/reader035/viewer/2022071815/55a84f641a28abbf318b4580/html5/thumbnails/10.jpg)
Failure Scenarios and Recovery Modes
We already discussed the node failure
Data corruption, recovery at one level
Read in the metadata to get the pointer to the data of the corrupted patch
Read in the data for corrupted patch Roll back to the beginning of the level’s time step and re-compute
Meta-data corruption: read in the meta-data
Roll-back may not be needed
Data corruption – recovery at multiple levels
Cascading effect- recovery may be needed in all the finer levels (may not always be needed though) All finer levels roll-back to the starting timestep of the affected level May get costlier as affected level gets coarser
Still cheaper than global recovery because of no regridding