university of westminster – checkpointing mechanism for the grid environment k sajadah, g...

25
University of Westminster – www.cpc.wmin.ac.uk Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University of Westminster

Upload: benedict-pope

Post on 17-Jan-2018

222 views

Category:

Documents


0 download

DESCRIPTION

Checkpointing of Parallel Applications in a Grid Environment Fault Tolerant Solutions  Retrying –When a job fails, it is re-executed a certain number of times. –The expected job’s completion time is very big.  Replication –Replicas of a job are executed on different Grid resources simultaneously. –It requires extra processing power.  Checkpointing –It stores a snapshot of an application state, and use it for restarting the execution in case of failure. –It is very efficient in environment where failure rate is high.

TRANSCRIPT

Page 1: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

University of Westminster – www.cpc.wmin.ac.uk

Checkpointing Mechanism

for the Grid EnvironmentK Sajadah, G Terstyanszky,

S Winter, P. KacsukUniversity of Westminster

Page 2: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

The Grid Environment Nature of Grid Environment:

– Generic, heterogeneous, and dynamic with lots of unreliable resources making it exposed to failures.

Solution:– Fault tolerant mechanisms should

ensure successful execution of applications.

Page 3: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Fault Tolerant Solutions Retrying

– When a job fails, it is re-executed a certain number of times.

– The expected job’s completion time is very big. Replication

– Replicas of a job are executed on different Grid resources simultaneously.

– It requires extra processing power. Checkpointing

– It stores a snapshot of an application state, and use it for restarting the execution in case of failure.

– It is very efficient in environment where failure rate is high.

Page 4: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Checkpointing Transparent Checkpointing

– Programmer orchestrates the checkpointing process

– Message synchronisation is performed.– Checkpointing & Recovery process is transparent

to the programmer. Non-Transparent Checkpointing

– Mechanism provides support for checkpointing through run-time libraries.

– Programmer can specify data that should be included in checkpoint file.

– Approach is not transparent to the programmer.

Page 5: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Challenges in Checkpointing When to take the checkpoint

How to synchronise (or how to minimise inter-process communication)

What kind of info to store at the checkpoint

Where to store the checkpoint’s info

How to restore the execution after a fault

Page 6: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Checkpointing (2) Performance constraints in

existing solutions:– Overheads due to synchronisation of messages.– Checkpoint intervals are either user-defined with

no regular pattern or are periodic. Proposed solution:

– Take checkpoint at the best possible pre-defined intervals.

– Mimimalise (or optimise) the inter-communication as much as possible.

Page 7: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Checkpointing (3) Inter-process communications can

cause inconsistent checkpoints due to lost messages or orphan messages.– To achieve a global consistent checkpoint

synchronization should be performed Synchronization introduces extra

communications among processes.

Page 8: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Approaches Used Combination of :

– First Order Approximation. – Natural Synchronisation Points.

First Order Approximation – Calculate the optimal checkpointing intervals.– Based on the Poisson process.

• Occurrence of failure is random with failure rate .

Page 9: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

The Optimal Checkpoint interval Tc is:– Tc = 2TsTf , where:

• Ts is the time required to save information at a checkpoint.

• Tf is the mean time between failures and Tf = Th/k

The following data are needed:– The number of hours the program will run on the

machines (Th).– The known failure rate during that time (k).– The time required to save information at a

checkpoint (Ts).

First Order Approximation

Page 10: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

First Order Approximation (2)

Tc

Tst = 0

Rerun Time tr

Restarting Point

Point ofFailure

Tc

Tc

Ts

Ts

Ts

…tTc

Tc = Checkpoint intervalTs = Time to save a checkpointtr = Rerun time of a failed application

Page 11: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

First Order Approximation(3)

Using the PROVE toolset, we can measure both the execution time and the checkpointing time of an application.

Nagios can be used to determine the failure rate of Grid resources.

Page 12: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Natural Synchronisation Points Examples of natural synchronization

points: – Barriers. – Top or bottom of a main loop.– Collective operations (broadcast, gather, scatter,

etc.) No interprocess communication at these

points.– Therefore, no need to be concerned with the state of

the communication channels or possible in-transit message.

– Eliminate the overhead incurred due to the synchronization process involved during checkpointing.

Page 13: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Natural Synchronisation Points (2)P1

P2

P3

Application Execution with Processes interactingP1

P2

P3

Coordinated checkpoint - waiting for in-transit messages

Page 14: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Natural Synchronisation Points (4)P1

P2

P3

Coordinated checkpoint - logging in-transit messages

Checkpointing at natural synchronisation points.

P1

P2

P3

N.S.P 1 N.S.P 2

Ckpt1 Ckpt2

Page 15: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

New Checkpointing Approa Using First Order Approximation only:

– Involves synchronisation of messages and capturing in-transit messages.

Checkpointing at natural synchronisation points only:– May not be very effective because there

are no patterns in their occurrences.

Page 16: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

New Checkpointing Approach(2) Use a combination of both the

Natural Synchronisation Points and the First Order Approximation.

Take checkpoints at natural synchronization points which are closest to the optimal checkpoint intervals.

Page 17: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Choosing Checkpoint Intervals

First Order approximation (Op)

Natural Synchronisation pts (Ns)

Critical Region { }

Choosing appropriate checkpointing intervals

Ns1

Ns2 Ns4

Ns3 Ns5

Ns6

Ns7

Ns 8

Ns9

Ns10

Op1 Op2 Op3 Op4 Op5 Op6

Page 18: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Choosing Checkpoint Intervals(2) Decision to select a checkpoint based

on:– Optimal checkpoint interval, – Natural synchronisation points and – Critical Region.

Checkpointing process is triggered by signals sent to the coordinated process whenever synchronization points are encountered.

Page 19: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

The Checkpointing Process When coordinated process receives a signal, it

checks to see if this signal is within the critical region. – If so, a checkpoint is taken and the clock is reset.– If not, no checkpointing is performed.

If no natural synchronization points are met within the critical region, we will have to force a checkpoint at the end of the critical region.– In such cases, the checkpointing mechanism will

perform synchronization to ensure there are no lost or orphan messages.

Page 20: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

The TestBed Madcity Traffic Simulation tool was

used.– Simulates traffic on a road network

and shows how individual vehicles behave on roads and at junctions.

MadCity traffic simulator can be parallelised using PGRADE.

Page 21: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

The Testbed(2)

Proposed checkpointing solution

First Order approximation (Op)Natural Synchronisation pts (Ns)Forced Synchronisation pts (Fs)Critical Region { }Saved Checkpoints

Op1 Op2 Op3 Op4 Op5 Op6

4 min

Ns1

Ns2

Ns3

Ns4

Ns5

Ns6

Ns7

Ns8

Ns9Fs1

Page 22: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

The Testbed(3) Through the First Order Approximation, the

calculated optimal checkpoint interval was 8 minutes.

A critical region of 2 minutes range from the optimal checkpoint interval was defined.

Checkpoint taken at: Ns1, Ns2, Ns5, Fs1, Ns6,Ns9.

Overall average time between checkpoints: 8.2 minutes

Page 23: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Conclusion Proposed checkpointing mechanism

provides a better and more efficient way to save checkpoint images.– Minimise the need of performing

synchronisation of messages.– Ensure that our average checkpointing

interval is close to the optimal checkpointing interval defined by the First Order Approximation.

Page 24: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Future Works Integrate the checkpointing

solution in PGRADE to provide an efficient fault tolerant solution to applications executed as Grid workflows.

Provide an efficient and reliable storage mechanism.

Page 25: University of Westminster –  Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University

Checkpointing of Parallel Applications in a Grid Environment

Questions