dubey_siamcse15

Hierarchical Resilience for Structured AMRAnshu Dubey, H. Fujita, Z. Rubenstein, B. Van Straalen and A. Chien

SIAM-CSE 2015

1n+

2

t10 2

levellevellevel

t

t

sync sync

sync

n+1

n

t

refinementlevel

Block Structured Local Refinement Refined regions are organized into logically-rectangular patches Refined grids are dynamically created and destroyed Refinement is performed in time as well as in space The depth of hierarchy depends upon the range of scales Mostly solve hyperbolics, some elliptic and some PIC

The Basics of AMR

Resiliency Through Rollback

This talk is mostly from the perspective of hyperbolics

Rollback and Recovery

The first step is to look at the whole application

Traditionally - checkpoint the entire state and reconstruct for restart Application determined state for transparent reconstruction Less data to save than system based checkpoint There have always been other reasons for checkpointing

Batch queue limits Steering the simulation Experimenting with parameter tweaking

Our first step was to utilize this capability to continue execution in face of resource degradation Restart on a fewer number of MPI ranks We partnered with GVR which saves and versions the state as

dictated by the application Incorporate restart without an abort

GVR – Global View Resilience

Two key ideas Multi-version, multi-stream distributed arrays: preserves application critical data,

enables flexible recovery from complex errors with fine-grain, localized manner Open resilience: cross-layer partnership for error handling, with unified error

handling interface

Our next step was to see if we can exploit the natural hierarchy of granularity in AMR for differentiated localized recovery and quantify the costs

Exploiting Granularities and Encapsulations

A Level provides coarsest granularity in the frameworkConvention – level below is coarser, level above is finer

A level knows All boxes at the level and their layout The mapping of boxes to processors Exchanges at the same level Fine coarse boundaries with the level below Local time step and the time step of the level below

Each level defines data on unions of rectangles Manages communications within itself During regridding carries out redistribution of boxes within

the level Also the reconstructs data into newly formed boxes from

the level below A level can do its own restart

GDS_level_0 GDS_level_2GDS_level_1

Mapping a level to an Array

Level n

Level n+1

Level n+2

P0

P1

P2

P3

P4

P5

P6

If we know that a level has no boxes on a processor, that level need not initiate a recovery

Level Based Recovery

Once an array has been constructed for a level, version increment is sufficient to save the state the next time

One wrinkle – versioning applies only as long as there is no regridding

For regridding current array has to be destroyed and new one constructedThis is one of the todo’s for GVR – allow

expansion and contraction during versioning In general versioning is cheaper than constructing a

new arrayWe save the Box layout, the data for each box, and in

the metadata part a pointer to where the data of each box lies

Exploiting Granularities and Encapsulations

The next layer of granularity is box within a level Box with its surrounding halo of ghost cells is a completely defined

computational region Iterators go through a list of local boxes and apply the same

operator to them All boxes of one level have same dx and same dt All boxes mapped to a processor may not be at the same level

Different boxes on the same processor have different dt and the operator maybe applied to them a different number of times.

The way we construct arrays gives us box level granularity for free

Failure Scenarios and Recovery Modes

We already discussed the node failure

Data corruption, recovery at one level

Read in the metadata to get the pointer to the data of the corrupted patch

Read in the data for corrupted patch Roll back to the beginning of the level’s time step and re-compute

Meta-data corruption: read in the meta-data

Roll-back may not be needed

Data corruption – recovery at multiple levels

Cascading effect- recovery may be needed in all the finer levels (may not always be needed though) All finer levels roll-back to the starting timestep of the affected level May get costlier as affected level gets coarser

Still cheaper than global recovery because of no regridding

dubey_siamcse15

Science