fault tolerant extensions to charm++ and ampi presented by sayantan chakravorty chao huang, celso...

25
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi

Upload: florence-charles

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

  • Fault Tolerant Extensions to Charm++ and AMPIpresented bySayantan ChakravortyChao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    OutlineMotivationBackgroundSolutionsCo-ordinated CheckpointingIn-memory double checkpointSender based Message LoggingProcessor Evacuation in response to fault prediction : New Work

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    MotivationAs machines grow in sizeMTBF decreasesApplications have to tolerate faultsApplications need fast, low cost and scalable fault tolerance supportModern Hardware is making fault prediction possibleTemperature sensors, PAPI-4, SMARTPaper on detection tomorrow

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    BackgroundCheckpoint based methodsCoordinated Blocking [Tamir84], Non-blocking [Chandy85]Co-check, Starfish, Clip fault tolerant MPIUncoordinated suffers from rollback propagationCommunication [Briatico84], doesnt scale wellLog-basedPessimistic MPICH-V1 and V2, SBML [Johnson87]Optimistic [Strom85] unbounded rollback, complicated recoveryCausal Logging [Elnozahy93] complicated causality tracking and recovery, Manetho, MPICH-V3

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Multiple Solutions in Charm++Reactive : react to a faultDisk based Checkpoint/RestartIn Memory Double Checkpointing/RestartSender based Message LoggingProactive : react to a fault prediction Evacuate processors that are warned

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Checkpoint/Restart MechanismBlocking Co-ordinated CheckpointState of chares are checkpointed to diskCollective call MPI_Checkpoint(DIRNAME)The entire job is restartedVirtualization allows restarting on different # of PesRuntime option> ./charmrun pgm +p4 +vp16 +restart DIRNAMESimple but effective for common cases

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    DrawbacksDisk based coordinated checkpointing is slowJob needs to be restartedRequires user interventionImpractical in the case of frequent faults

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    In-memory Double CheckpointIn-memory checkpointFaster than diskCo-ordinated checkpointSimple User can decide what makes up useful stateDouble checkpointingEach object maintains 2 checkpoints on:Local physical processorRemote buddy processor

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    RestartA Dummy process is created:Need not have application data or checkpointNecessary for runtimeStarts recovery on all other PEsOther processors:Remove all charesRestore checkpoints lost on the crashed PERestore chares from local checkpointsLoad balance after restart

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Overhead EvaluationJacobi (200MB data size) on up to 128 processors8 checkpoints in 100 steps

    Chart2

    160.116529160.892814175.228279

    87.6872590.54789596.139557

    52.75797153.95661256.801147

    35.23448336.02683837.40072

    26.14929726.76084427.710856

    22.09856722.07010322.975914

    Normal Charm++/AMPI

    FT-Charm++ w/o checkpointing

    FT-Charm++ with checkpointing

    Number of processors

    Total execution time (s)

    Myrinet

    Sheet1

    4160.116529160.892814175.228279

    887.6872590.54789596.139557

    1652.75797153.95661256.801147

    3235.23448336.02683837.40072

    6426.14929726.76084427.710856

    12822.09856722.07010322.975914

    Sheet1

    Normal Charm++/AMPI

    FT-Charm++ w/o checkpointing

    FT-Charm++ with checkpointing

    Number of processors

    Total execution time (s)

    Myrinet

    Sheet2

    Sheet3

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Recovery PerformanceLeanMD application10 crashes128 processorsCheckpoint every 10 time steps

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    DrawbacksHigh Memory OverheadCheckpoint/Rollback doesnt scaleAll nodes are rolled back just because 1 crashedEven nodes independent of the crashed node are restartedRestart cost is similar to Checkpoint periodBlocking co-ordinated checkpoint requires user intervention

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Sender based Message LoggingMessage LoggingStore message logs on the senderAsynchronous checkpointsEach processor has a buddy processorStores its checkpoint in the buddys memoryRestart: processor from an extra poolRecreate only objects on crashed processorPlayback logged messagesRestores state to that after the last processed messageProcessor virtualization can speed it up

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Message LoggingState of an object is determined byMessages processedSequence of processed messagesProtocolSender logs message and requests receiver for TNReceiver sends back TNSender stores TN with log and sends messageReceiver processes messages in order of TNProcessor virtualization complicates message loggingMessages to object on the same processor needs to be logged remotely

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Parallel RestartMessage Logging allows fault-free processors to continue with their executionHowever, sooner or later some processors start waiting for crashed processorVirtualization allows us to move work from the restarted processor to waiting processorsChares are restarted in parallelRestart cost can be reduced

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Present StatusMost of Charm++ has been portedSupport for migration has not yet been implemented in the fault tolerant protocolAMPI portedParallel restart not yet implemented

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Recovery Performance

    Chart2

    506

    551

    606

    640

    671

    680

    725

    Number of faults

    Execution Time(s)

    Execution Time with Faults

    Sheet1

    0506

    1551

    2606

    4640

    5671

    6680

    7725

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    Number of faults

    Execution Time(s)

    Execution Time with Faults

    Sheet2

    Sheet3

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Pros and ConsLow overhead for jobs with low communicationCurrently high overhead for jobs with high communicationShould be tested with high virtualization ratio to reduce the message logging overhead

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Processor evacuationModern Hardware can be used to predict faultsRuntime system responseLow response timeNo new processors should be requiredEfficiency loss should be proportional to loss in computational power

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Solution Migrate Charm++ objects off processorRequires remapping of home PEs of objectsPoint to Point message delivery continues to work efficientlyCollective operations cope with loss of processorsRewire reduction tree around a warned processorCan deal with multiple simultaneous warningsLoad balance after an evacuation

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Rearrange the reduction treeDo not rewire tree while reduction is going onStop reductionsRewire treeContinue reductionsAffects only parent and children of a nodeUnbalances tree: Could be solved by recreating tree

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Response timeEvacuation time for a Sweep3d execution on the 150^3 caseTotal ~500 MB of dataPessimistic estimate of evacuation time

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Performance after evacuationIteration time of Sweep3d on 32 processors for 150^3 problem with 1 warning

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    Processor Utilization after evacuationIteration time of Sweep3d on 32 processors for 150^3 problem with both processors on node 3( processors 4 and 5) being warned simultaneously

    Parallel Programming LaboratoryUniversity of Illinois, U-C

    ConclusionsAvailable in Charm++ and AMPICheckpoint/RestartIn memory Checkpoint/Restart Proactive fault toleranceUnder developmentSender based message loggingDeal with migration, deletionParallel RestartAbstraction layers in Charm++/AMPI make it suitable for implementing fault tolerance protocols