fault tolerant extensions to charm++ and ampi presented by sayantan chakravorty chao huang, celso...
TRANSCRIPT
-
Fault Tolerant Extensions to Charm++ and AMPIpresented bySayantan ChakravortyChao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi
Parallel Programming LaboratoryUniversity of Illinois, U-C
OutlineMotivationBackgroundSolutionsCo-ordinated CheckpointingIn-memory double checkpointSender based Message LoggingProcessor Evacuation in response to fault prediction : New Work
Parallel Programming LaboratoryUniversity of Illinois, U-C
MotivationAs machines grow in sizeMTBF decreasesApplications have to tolerate faultsApplications need fast, low cost and scalable fault tolerance supportModern Hardware is making fault prediction possibleTemperature sensors, PAPI-4, SMARTPaper on detection tomorrow
Parallel Programming LaboratoryUniversity of Illinois, U-C
BackgroundCheckpoint based methodsCoordinated Blocking [Tamir84], Non-blocking [Chandy85]Co-check, Starfish, Clip fault tolerant MPIUncoordinated suffers from rollback propagationCommunication [Briatico84], doesnt scale wellLog-basedPessimistic MPICH-V1 and V2, SBML [Johnson87]Optimistic [Strom85] unbounded rollback, complicated recoveryCausal Logging [Elnozahy93] complicated causality tracking and recovery, Manetho, MPICH-V3
Parallel Programming LaboratoryUniversity of Illinois, U-C
Multiple Solutions in Charm++Reactive : react to a faultDisk based Checkpoint/RestartIn Memory Double Checkpointing/RestartSender based Message LoggingProactive : react to a fault prediction Evacuate processors that are warned
Parallel Programming LaboratoryUniversity of Illinois, U-C
Checkpoint/Restart MechanismBlocking Co-ordinated CheckpointState of chares are checkpointed to diskCollective call MPI_Checkpoint(DIRNAME)The entire job is restartedVirtualization allows restarting on different # of PesRuntime option> ./charmrun pgm +p4 +vp16 +restart DIRNAMESimple but effective for common cases
Parallel Programming LaboratoryUniversity of Illinois, U-C
DrawbacksDisk based coordinated checkpointing is slowJob needs to be restartedRequires user interventionImpractical in the case of frequent faults
Parallel Programming LaboratoryUniversity of Illinois, U-C
In-memory Double CheckpointIn-memory checkpointFaster than diskCo-ordinated checkpointSimple User can decide what makes up useful stateDouble checkpointingEach object maintains 2 checkpoints on:Local physical processorRemote buddy processor
Parallel Programming LaboratoryUniversity of Illinois, U-C
RestartA Dummy process is created:Need not have application data or checkpointNecessary for runtimeStarts recovery on all other PEsOther processors:Remove all charesRestore checkpoints lost on the crashed PERestore chares from local checkpointsLoad balance after restart
Parallel Programming LaboratoryUniversity of Illinois, U-C
Overhead EvaluationJacobi (200MB data size) on up to 128 processors8 checkpoints in 100 steps
Chart2
160.116529160.892814175.228279
87.6872590.54789596.139557
52.75797153.95661256.801147
35.23448336.02683837.40072
26.14929726.76084427.710856
22.09856722.07010322.975914
Normal Charm++/AMPI
FT-Charm++ w/o checkpointing
FT-Charm++ with checkpointing
Number of processors
Total execution time (s)
Myrinet
Sheet1
4160.116529160.892814175.228279
887.6872590.54789596.139557
1652.75797153.95661256.801147
3235.23448336.02683837.40072
6426.14929726.76084427.710856
12822.09856722.07010322.975914
Sheet1
Normal Charm++/AMPI
FT-Charm++ w/o checkpointing
FT-Charm++ with checkpointing
Number of processors
Total execution time (s)
Myrinet
Sheet2
Sheet3
Parallel Programming LaboratoryUniversity of Illinois, U-C
Recovery PerformanceLeanMD application10 crashes128 processorsCheckpoint every 10 time steps
Parallel Programming LaboratoryUniversity of Illinois, U-C
DrawbacksHigh Memory OverheadCheckpoint/Rollback doesnt scaleAll nodes are rolled back just because 1 crashedEven nodes independent of the crashed node are restartedRestart cost is similar to Checkpoint periodBlocking co-ordinated checkpoint requires user intervention
Parallel Programming LaboratoryUniversity of Illinois, U-C
Sender based Message LoggingMessage LoggingStore message logs on the senderAsynchronous checkpointsEach processor has a buddy processorStores its checkpoint in the buddys memoryRestart: processor from an extra poolRecreate only objects on crashed processorPlayback logged messagesRestores state to that after the last processed messageProcessor virtualization can speed it up
Parallel Programming LaboratoryUniversity of Illinois, U-C
Message LoggingState of an object is determined byMessages processedSequence of processed messagesProtocolSender logs message and requests receiver for TNReceiver sends back TNSender stores TN with log and sends messageReceiver processes messages in order of TNProcessor virtualization complicates message loggingMessages to object on the same processor needs to be logged remotely
Parallel Programming LaboratoryUniversity of Illinois, U-C
Parallel RestartMessage Logging allows fault-free processors to continue with their executionHowever, sooner or later some processors start waiting for crashed processorVirtualization allows us to move work from the restarted processor to waiting processorsChares are restarted in parallelRestart cost can be reduced
Parallel Programming LaboratoryUniversity of Illinois, U-C
Present StatusMost of Charm++ has been portedSupport for migration has not yet been implemented in the fault tolerant protocolAMPI portedParallel restart not yet implemented
Parallel Programming LaboratoryUniversity of Illinois, U-C
Recovery Performance
Chart2
506
551
606
640
671
680
725
Number of faults
Execution Time(s)
Execution Time with Faults
Sheet1
0506
1551
2606
4640
5671
6680
7725
Sheet1
0
0
0
0
0
0
0
Number of faults
Execution Time(s)
Execution Time with Faults
Sheet2
Sheet3
Parallel Programming LaboratoryUniversity of Illinois, U-C
Pros and ConsLow overhead for jobs with low communicationCurrently high overhead for jobs with high communicationShould be tested with high virtualization ratio to reduce the message logging overhead
Parallel Programming LaboratoryUniversity of Illinois, U-C
Processor evacuationModern Hardware can be used to predict faultsRuntime system responseLow response timeNo new processors should be requiredEfficiency loss should be proportional to loss in computational power
Parallel Programming LaboratoryUniversity of Illinois, U-C
Solution Migrate Charm++ objects off processorRequires remapping of home PEs of objectsPoint to Point message delivery continues to work efficientlyCollective operations cope with loss of processorsRewire reduction tree around a warned processorCan deal with multiple simultaneous warningsLoad balance after an evacuation
Parallel Programming LaboratoryUniversity of Illinois, U-C
Rearrange the reduction treeDo not rewire tree while reduction is going onStop reductionsRewire treeContinue reductionsAffects only parent and children of a nodeUnbalances tree: Could be solved by recreating tree
Parallel Programming LaboratoryUniversity of Illinois, U-C
Response timeEvacuation time for a Sweep3d execution on the 150^3 caseTotal ~500 MB of dataPessimistic estimate of evacuation time
Parallel Programming LaboratoryUniversity of Illinois, U-C
Performance after evacuationIteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning
Parallel Programming LaboratoryUniversity of Illinois, U-C
Processor Utilization after evacuationIteration time of Sweep3d on 32 processors for 150^3 problem with both processors on node 3( processors 4 and 5) being warned simultaneously
Parallel Programming LaboratoryUniversity of Illinois, U-C
ConclusionsAvailable in Charm++ and AMPICheckpoint/RestartIn memory Checkpoint/Restart Proactive fault toleranceUnder developmentSender based message loggingDeal with migration, deletionParallel RestartAbstraction layers in Charm++/AMPI make it suitable for implementing fault tolerance protocols