fault tolerant extensions to charm++ and ampi presented by sayantan chakravorty chao huang, celso...

Fault Tolerant Extensions to Charm++ and AMPIpresented bySayantan ChakravortyChao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi

Parallel Programming LaboratoryUniversity of Illinois, U-C

OutlineMotivationBackgroundSolutionsCo-ordinated CheckpointingIn-memory double checkpointSender based Message LoggingProcessor Evacuation in response to fault prediction : New Work


MotivationAs machines grow in sizeMTBF decreasesApplications have to tolerate faultsApplications need fast, low cost and scalable fault tolerance supportModern Hardware is making fault prediction possibleTemperature sensors, PAPI-4, SMARTPaper on detection tomorrow


BackgroundCheckpoint based methodsCoordinated Blocking [Tamir84], Non-blocking [Chandy85]Co-check, Starfish, Clip fault tolerant MPIUncoordinated suffers from rollback propagationCommunication [Briatico84], doesnt scale wellLog-basedPessimistic MPICH-V1 and V2, SBML [Johnson87]Optimistic [Strom85] unbounded rollback, complicated recoveryCausal Logging [Elnozahy93] complicated causality tracking and recovery, Manetho, MPICH-V3


Multiple Solutions in Charm++Reactive : react to a faultDisk based Checkpoint/RestartIn Memory Double Checkpointing/RestartSender based Message LoggingProactive : react to a fault prediction Evacuate processors that are warned


Checkpoint/Restart MechanismBlocking Co-ordinated CheckpointState of chares are checkpointed to diskCollective call MPI_Checkpoint(DIRNAME)The entire job is restartedVirtualization allows restarting on different # of PesRuntime option> ./charmrun pgm +p4 +vp16 +restart DIRNAMESimple but effective for common cases


DrawbacksDisk based coordinated checkpointing is slowJob needs to be restartedRequires user interventionImpractical in the case of frequent faults


In-memory Double CheckpointIn-memory checkpointFaster than diskCo-ordinated checkpointSimple User can decide what makes up useful stateDouble checkpointingEach object maintains 2 checkpoints on:Local physical processorRemote buddy processor


RestartA Dummy process is created:Need not have application data or checkpointNecessary for runtimeStarts recovery on all other PEsOther processors:Remove all charesRestore checkpoints lost on the crashed PERestore chares from local checkpointsLoad balance after restart


Overhead EvaluationJacobi (200MB data size) on up to 128 processors8 checkpoints in 100 steps

Chart2

160.116529160.892814175.228279

87.6872590.54789596.139557

52.75797153.95661256.801147

35.23448336.02683837.40072

26.14929726.76084427.710856

22.09856722.07010322.975914

Normal Charm++/AMPI

FT-Charm++ w/o checkpointing

FT-Charm++ with checkpointing

Number of processors

Total execution time (s)

Myrinet

Sheet1

4160.116529160.892814175.228279

887.6872590.54789596.139557

1652.75797153.95661256.801147

3235.23448336.02683837.40072

6426.14929726.76084427.710856

12822.09856722.07010322.975914

Sheet1

Normal Charm++/AMPI

FT-Charm++ w/o checkpointing

FT-Charm++ with checkpointing

Number of processors

Total execution time (s)

Myrinet

Sheet2

Sheet3


Recovery PerformanceLeanMD application10 crashes128 processorsCheckpoint every 10 time steps


DrawbacksHigh Memory OverheadCheckpoint/Rollback doesnt scaleAll nodes are rolled back just because 1 crashedEven nodes independent of the crashed node are restartedRestart cost is similar to Checkpoint periodBlocking co-ordinated checkpoint requires user intervention


Sender based Message LoggingMessage LoggingStore message logs on the senderAsynchronous checkpointsEach processor has a buddy processorStores its checkpoint in the buddys memoryRestart: processor from an extra poolRecreate only objects on crashed processorPlayback logged messagesRestores state to that after the last processed messageProcessor virtualization can speed it up


Message LoggingState of an object is determined byMessages processedSequence of processed messagesProtocolSender logs message and requests receiver for TNReceiver sends back TNSender stores TN with log and sends messageReceiver processes messages in order of TNProcessor virtualization complicates message loggingMessages to object on the same processor needs to be logged remotely


Parallel RestartMessage Logging allows fault-free processors to continue with their executionHowever, sooner or later some processors start waiting for crashed processorVirtualization allows us to move work from the restarted processor to waiting processorsChares are restarted in parallelRestart cost can be reduced


Present StatusMost of Charm++ has been portedSupport for migration has not yet been implemented in the fault tolerant protocolAMPI portedParallel restart not yet implemented


Recovery Performance

Chart2

506

551

606

640

671

680

725

Number of faults

Execution Time(s)

Execution Time with Faults

Sheet1

0506

1551

2606

4640

5671

6680

7725

Sheet1

0

0

0

0

0

0

0

Number of faults

Execution Time(s)

Execution Time with Faults

Sheet2

Sheet3


Pros and ConsLow overhead for jobs with low communicationCurrently high overhead for jobs with high communicationShould be tested with high virtualization ratio to reduce the message logging overhead


Processor evacuationModern Hardware can be used to predict faultsRuntime system responseLow response timeNo new processors should be requiredEfficiency loss should be proportional to loss in computational power


Solution Migrate Charm++ objects off processorRequires remapping of home PEs of objectsPoint to Point message delivery continues to work efficientlyCollective operations cope with loss of processorsRewire reduction tree around a warned processorCan deal with multiple simultaneous warningsLoad balance after an evacuation


Rearrange the reduction treeDo not rewire tree while reduction is going onStop reductionsRewire treeContinue reductionsAffects only parent and children of a nodeUnbalances tree: Could be solved by recreating tree


Response timeEvacuation time for a Sweep3d execution on the 150^3 caseTotal ~500 MB of dataPessimistic estimate of evacuation time


Performance after evacuationIteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning


Processor Utilization after evacuationIteration time of Sweep3d on 32 processors for 150^3 problem with both processors on node 3( processors 4 and 5) being warned simultaneously


ConclusionsAvailable in Charm++ and AMPICheckpoint/RestartIn memory Checkpoint/Restart Proactive fault toleranceUnder developmentSender based message loggingDeal with migration, deletionParallel RestartAbstraction layers in Charm++/AMPI make it suitable for implementing fault tolerance protocols

fault tolerant extensions to charm++ and ampi presented by sayantan chakravorty chao huang, celso...

Documents