computer and automation research institute hungarian academy of sciences automatic checkpoint of...

23
Computer and Automation Research Computer and Automation Research Institute Institute Hungarian Academy of Sciences Hungarian Academy of Sciences Automatic checkpoint of CONDOR-PVM applications by P-GRADE Jozsef Kovacs, Peter Kacsuk Laboratory of Parallel and Distributed Systems MTA SZTAKI, Budapest, Hungary {smith, kacsuk}@sztaki.hu http://www.lpds.sztaki.hu

Upload: irene-waters

Post on 31-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Computer and Automation Research InstituteComputer and Automation Research InstituteHungarian Academy of SciencesHungarian Academy of Sciences

Automatic checkpoint of CONDOR-PVM applications by P-GRADE

Jozsef Kovacs, Peter Kacsuk

Laboratory of Parallel and Distributed Systems

MTA SZTAKI, Budapest, Hungary

{smith, kacsuk}@sztaki.hu

http://www.lpds.sztaki.hu

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 2

• 2002 Hungarian Ministry of Education, NIIF – procurement project to equip universities, high schools, public libraries with PC labs.

• More than 2000 PCs, which were considered to be enormous, computational resources had been spread over the country.

• Grid Technical Board – the goal was to build up a minimal, but functional grid system.

• Dual-boot PC labs are connected throughout the country. Day-time operation – Windows desktop use, night-time operation – grid mode use. 24 hours operational “grid backbone” infrastructure.

• Around 800 PCs are interconnected at 400 Gflops performance via private networking solution (MPLS VPN) over the academic network.

• 1st generation ClusterGrid – a single large Condor pool• 2nd generation ClusterGrid – a Condor based grid connected by web

service and transaction based.

Background: The Hungarian ClusterGridBackground: The Hungarian ClusterGrid

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 3

• Condor pools are connected by a global Grid Resource Broker which uses dynamic UID/GID mapping for user jobs, and “one job – one directory structure” job format.

• Scalable, easy to manage system.• In production since July 2003 with more than 30000 real user

jobs executed.• Applications range from fundamental research (mathematics,

physics) to applied research (biology, chemistry).– investigation of C60 molecule in electromagnetic fields

– simulation of protein molecules

– fractal calculation

– investigation of imbalanced phase transitions

– etc.

• Two classes of applications are currently supported: parameter scanning, and master-worker jobs parallelized by PVM.

• For more info, http://www.clustergrid.iif.hu.

Hungarian ClusterGrid InfrastructureHungarian ClusterGrid Infrastructure

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 4

Hungarian ClusterGrid InfrastructureHungarian ClusterGrid Infrastructure

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 5

MotivationMotivation

Checkpointing and migration support is necessary

• To enable load balancing and

• To support fault-tolerance

• To support day-night working mode of Hungarian ClusterGrid

etc.

Automatic checkpointing for sequential jobs in standard universe is provided by Condor

Fault-tolerant execution of Master-Worker style parallel jobs are supported without automatic checkpointing

With the P-GRADE environment Condor is able to make automatic checkpointing for PVM jobs to enable load-balancing and to make long running worker processes fault-tolerant

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 6

P-GRADE environment

Parallel Grid Run-time and Application

Development Environment

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 7

Using P-GRADE job mode for the whole range Using P-GRADE job mode for the whole range of parallel/distributed systemsof parallel/distributed systems

P-GRADE

PVM MPI Workflow

Super-computers

Clusters Grids

CondorGrid GT2 Grid OGSA

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 8

P-GRADE and Condor

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 9

Current prototype for migration framework

• First prototype is currently based on– P-GRADE– Condor– PVM

• Requirements– No manual code preparation is required– No user interaction during execution– No PVM modification– No extra requirements from schedulers– Just build your application using P-GRADE

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 10

Structure of Structure of P-GRADE P-GRADE applicationapplication

Built-inserver

clientA

clientB

clientD

clientC

Server

•process spawn/terminate

•identification/topology

•access to terminal/filesClients

•identification of neighbors by the server

•access to files/terminal through the server

•primitives for communication

messagepassing

messagepassing

Terminal Files

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 11

Checkpointing a single processCheckpointing a single process

1. Initiate a checkpoint

2. Synchronize transit messages and disconnect MP

3. Collect address-space information

4. Send checkpoint

5. Store checkpoint onto server

6. Reconnect to MP

User processCheckpoint

Server

Storage

handle MP

1

2

3

4

56

ckpt lib

handle MP

Vic Zandy’s single process checkpointer:

www.cs.wisc.edu/~zandy/ckpt

© University of Wisconsin, Madison

(former member of the Paradyn group)

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 12

Modified structure to checkpoint processesModified structure to checkpoint processes

Server/coordination

module

Client A

Client D

Client B

message passing library

message passing library

Files

CheckpointServer

Storage

ckpt lib

ckpt lib

ckpt lib

ckpt lib

Terminal

Client C

ckptlib

user code

comm lib

mp lib

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 13

Migration among friendly condor poolsMigration among friendly condor pools

Step 1: Starting the application

S: ServerCS: Checkpoint ServerP: PVM daemonA,B,C: User processes

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 14

Step 2: Condor is vacating a node

S: ServerCS: Checkpoint ServerP: PVM daemonA,B,C: User processes

Migration among friendly condor poolsMigration among friendly condor pools

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 15

Step 3: Checkpointing processes

S: ServerCS: Checkpoint ServerP: PVM daemonA,B,C: User processes

Migration among friendly condor poolsMigration among friendly condor pools

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 16

S: ServerCS: Checkpoint ServerP: PVM daemonA,B,C: User processes

Step 4: Process resumed on friendly Condor pool

Migration among friendly condor poolsMigration among friendly condor pools

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 17

Live demonstrations

• The prototype has been demonstrated in various conferences/ workshops

• EuroPar’03,Klagenfurt, Austria

• Hungarian Grid Day, Budapest, Hungary

• SuperComputing 2003,Phoenix, USA

• Cluster 2003, Hong-kong, China

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 18

P-GRADEGUI

London - UoW Budapest - SZTAKI

1

P-GRADE program submitted to Budapest as a Condor job

2

P-GRADE program runs at SZTAKI cluster

3

P-GRADE programmigrates to Londonas a Condor job

4P-GRADE program runs at UoW cluster

Budapest - BUTE

SZTAKI & BUTE clusters overloaded checkpointing

Possible scenario on checkpointing and migrationPossible scenario on checkpointing and migration of PGRADE of PGRADE programs between clustersprograms between clusters

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 19

Integrated checkpoint and monitorIntegrated checkpoint and monitor

• The checkpoint system is cooperating with theGRM-Mercury-PROVE monitoring and visualisation system

– logs out the user process from the monitoring layer before termination

– logs in the user process into the monitoring layer after resumption

– user can trace the machines where process migrated

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 20

Migration among non-friendly Condor poolsMigration among non-friendly Condor pools(under development)(under development)

5. Auto self-recoveryof PGRADE application

4. Submit application to the queue

3. Transfer binaries, checkpoint files, work files

1. Detection of lowresources on cluster

2. Removal of application from the queue

P-GRADEenvironment

GRIDApplication Manager

CONDOR pool BCONDOR pool AIt requires consultation with

CONDOR developers…

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 21

Summary of advantages/disadvantagesSummary of advantages/disadvantages

• Advantages

– no modification of the grid execution environment is required, since all

checkpointing/migration capability is built inside the application

– supports the day-night working mode in the Hungarian ClusterGrid

environment

– adaptivity and automation comes from Condor

– Condor-PVM applications, with topology of any kind, can now be dynamically

migrated like sequential jobs

(Note: Condor does not checkpoint PVM applications,

only fault-tolerant execution is supported for Master-Worker type

applications)

– migrating jobs can be monitored online and visualised

• Limitations

– currently PGRADE generated PVM jobs are supported

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 22

Conclusion

• A parallel program checkpointing mechanism that can be applied to generic PVM programs.

• A checkpointing mechanism that can be connected to Condor in order to realize migration of PVM jobs among Condor pools.

• By integrating P-GRADE migration framework and the Mercury Grid monitor, PVM applications can be performance monitored and visualized even during their migration.

• Condor-PVM, through our checkpointing algorithm, is enhanced to checkpoint PVM applications like it is done for sequential jobs.

14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 23

Thank you for your attention!Thank you for your attention!

Jozsef Kovacs <[email protected]>

Information about P-GRADE:

[email protected]

http://www.lpds.sztaki.hu/pgrade

Next release is coming at the end of April…

Information about Hungarian ClusterGrid:

http://www.clustergrid.iif.hu