process introspection: a checkpoint mechanism for high performance heterogeneous distributed...

27
Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari.

Upload: laurel-wilson

Post on 12-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed

Systems.

University of Virginia.

Author: Adam J. Ferrari.

Page 2: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Some Basic Terminology.

What is a Process?

A process is an entity that is actually running in an operating system.

What does Introspection mean?

Introspection means understanding one’s inner self. ( Merriam-Webster Online)

Page 3: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Goals of the Process Introspection Project.

To construct a checkpoint/restart mechanism for a heterogeneous environment.

This mechanism should be:

1. Efficient,

2. Flexible,

3. Most importantly platform independent.

Page 4: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Heterogeneous Environment.

Became famous mainly due to their better price/performance ratio.

Some characteristics : 1. A conglomeration of workstations running on different

operating systems and varied architecture bound together using a network line.

2. Generally used for computing intensive applications where many workstations that are idle/having less load participate in finishing of a task, providing efficient utilization of idle time.

3. User Dedicated machines.

Page 5: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Ex: Our Own Department.

Page 6: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Efficient Utilization.

To take the advantage of these heterogeneous workstations, the following schemes should be provided to the processes:

1. Process Migration.

2. Load Balancing.

3. Fault Tolerance.

Page 7: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Checkpoint/Restart Mechanism.

Mainly Two Phases:

1. To save the current running state of a process.

2. Reconstruct the original running process from the saved image and resume the execution from exactly the interrupted point.

Page 8: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Advantages of using the Checkpoint/Restart Mechanism.

Process Migration. 1. Distributed Load Balancing. 2. Efficient Resource Utilization.

Crash Recovery and Rollback Transaction.

Useful in System Administration.

Lowering the Programming Burden.

Running complex simulation or complex modeling.

Page 9: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Implementation Challenges/Complexity.

Due to the heterogeneous nature of the computing environment the checkpoint/restart mechanism should be platform independent.

1. Capture a state of a running process.

2. Reinstantiate it on a completely different architecture or OS platform which consist of a different instruction set, data format, address space layout.

Page 10: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Existing Implementations.

V migration mechanism. Compiler support is used to generate meta information about a process

describing the locations and types of data items to be modified at migration time to mask data representation differences.

Disadvantages:1. Requires Kernel Support. Some other examples: MOSIX, Sprite.2. Requires data to be stored at the same address in all migrated versions.

Theimer and Hayes. Construct an intermediate source code representation of a running process at

the point of migration, and to recompile this source at migration target.

Never been implemented.

Page 11: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Process Introspection Design.

Process + Introspection : The ability of a process to examine and describe its own internal state in a logical, and platform independent format.

Extends the technique of handcoding checkpoint/restart mechanism into an automated approach.

Page 12: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Components Involved.

The Process Introspection Design Pattern.

Process Introspection Library (PIL).

Automatic Process Introspection Compiler (APrIL).

Standard Checkpoint Interface.

Central Checkpoint Coordinator.

Page 13: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Process Introspection Design Pattern.

A design template for writing checkpointable codes.

Based on a Process Model.

Page 14: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Adding functionality to the modules.

Ability to save/restore threads of control. 1. Poll points (checkpoint requests) inserted to save call stacks.

- Poll point placement is a key performance trade-off issue.

2. Serving a Checkpoint Request.

save data and logical point of execution and return to its calling subroutine.

3. Restart a process from checkpoint. restore the variables from the checkpoint and use control flow to reach the

correct point of execution, as mentioned by the checkpoint from the initial subroutine that is active at the checkpoint.

Call the next subroutine from the checkpointed stack.

Page 15: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Adding functionality to the process contd ...

Ability to save/restore memory blocks.

-- Should take care of different data representation and address space layout on different platforms.

For pointers. 1. Can’t save a raw memory address.

2. Have to save a logical description. High level descriptors are needed.

Page 16: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

APrIL CompilerTransformed Code

Hand-codedCheckpointable

Modules

Process Introspection Library(PIL)

Checkpoint

Page 17: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Process Introspection Library (PIL).

A consistent API for manipulating the elements of a process.

Automates and integrates:

Thread management.

Logical Program Counter Stack.

Data format conversion.

Checkpoint/restart of statically allocated data.

Checkpoint/restart of dynamically allocated data.

Pointer analysis/description.

Page 18: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

APrIL: Automatic Process Introspection Compiler.

A source code translator. Inserting code to keep the PIL tables updated during run

time. Placement of Poll Points in the module code as the thread

executing code in the module periodically polls for checkpoint requests.

During restart, process must restore all threads of execution.

Page 19: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Example - Function Prologuesvoid example(double *A){

int i;double temp[100];PIL_RegisterStackPointer(temp,PIL_Double,100);if(PIL_CheckpointStatus&PIL_StatusRestoreNow) {

int PIL_restore_point = PIL_PopLPCValue();A = PIL_RestoreStackPointer();i = PIL_RestoreStackInt();PIL_RestoreStackDoubles(temp,100);switch(PIL_restore_point) {

case 1: PIL_DoneRestart(); goto _PIL_PollPt_1;case 2: goto _PIL_PollPt_2;case 3: PIL_DoneRestart(); goto _PIL_PollPt_3;

}}

}

Page 20: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Example - Poll Points_PIL_PollPt_2:

i = function(A,X,100);_PIL_PollPt_3:

if(PIL_CheckpointStatus&PIL_StatusCheckpointNow) {if(PIL_CheckpointStatus&PIL_StatusCheckpointInProgress) PIL_PushLPCValue(2);else { PIL_PushLPCValue(3); PIL_CheckpointStatus|=PIL_StatusCheckpointInProgress;}goto _PIL_save_frame_;

}.

_PIL_save_frame_:PIL_SaveStackPointer(A);PIL_SaveStackInt(i);PIL_SaveStackDoubles(X,100);return;

Page 21: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

APrIL: Automatic Process Inrospection Compiler.

High LevelLanguage

TransformedCode

Binary 1 Binary 2 Binary N

APrIL

Back End Compilers

PIL

Page 22: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Checkpoint Coordination and Module Interfaces.

Helps in achieving interoperation of modules to produce checkpoint or restart processes.

SCI events:

Process Startup. – registers any global or data type definitions

Checkpoint Start/End – information of the module

Restart. – restoring the state from checkpoint.

Page 23: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Judging an implementation.

Little or no Programmer effort. Convenient Programmer Interface. Low Checkpoint Request Service Latency. Low Runtime Overhead. Control over the number of checkpoints. Should mix with the environment.

Page 24: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Example Overhead Measurements

Run times in seconds, Latencies in milliseconds

NXN Matrix Multiply, RS/6000, xlcN 32 64 128 256 512Normal 0.03 0.26 2.66 21.16 288.96Trans 0.03 0.27 2.66 21.17 288.99Opt 0.01 0.06 1.16 9.14 198.01Trans Opt 0.01 0.07 1.16 9.18 199.95Latency 0.04 0.08 0.09 0.2 0.6

Quicksort, 2^N Keys, Ultrasparc, gccN 17 18 19 20 21Normal 1.95 3.28 6.86 13.91 28.25Trans 1.96 3.68 7.54 15.31 31.25Opt 0.83 1.2 2.48 4.92 9.85Trans Opt 1 1.46 2.99 5.94 12.22Latency 0.02 0.022 0.025 0.026 0.03

Page 25: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Project Status.

Prototype PIL implemented.

Tested across multiple platforms :

Solaris, IRIX, AIX, OSF1, Linux, Win95/NT.

Example applications demonstrated

E.g. matrix multiply, SOR, sort.

Hand coded to use PIL.

Checkpointed /restarted across above platforms.

APrIL under design and construction

Page 26: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

References.

The Process Introspection Project.

http://www.cs.virginia.edu/~ajf2j/introspect/

Transparent Checkpointing under Unix. J.S. Plank, M.Beck, G. Kingsley, and K. Li.

CRAK: Linux Checkpoint/Restart As a Kernel Module.

Hua Zhong and Jason Nieh (Linux taken as example to explain the design concepts).

Page 27: Process Introspection: A Checkpoint Mechanism for High Performance Heterogeneous Distributed Systems. University of Virginia. Author: Adam J. Ferrari

Thank you. Questions ??