computer and automation research institute hungarian academy of sciences automatic checkpoint of...
TRANSCRIPT
Computer and Automation Research InstituteComputer and Automation Research InstituteHungarian Academy of SciencesHungarian Academy of Sciences
Automatic checkpoint of CONDOR-PVM applications by P-GRADE
Jozsef Kovacs, Peter Kacsuk
Laboratory of Parallel and Distributed Systems
MTA SZTAKI, Budapest, Hungary
{smith, kacsuk}@sztaki.hu
http://www.lpds.sztaki.hu
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 2
• 2002 Hungarian Ministry of Education, NIIF – procurement project to equip universities, high schools, public libraries with PC labs.
• More than 2000 PCs, which were considered to be enormous, computational resources had been spread over the country.
• Grid Technical Board – the goal was to build up a minimal, but functional grid system.
• Dual-boot PC labs are connected throughout the country. Day-time operation – Windows desktop use, night-time operation – grid mode use. 24 hours operational “grid backbone” infrastructure.
• Around 800 PCs are interconnected at 400 Gflops performance via private networking solution (MPLS VPN) over the academic network.
• 1st generation ClusterGrid – a single large Condor pool• 2nd generation ClusterGrid – a Condor based grid connected by web
service and transaction based.
Background: The Hungarian ClusterGridBackground: The Hungarian ClusterGrid
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 3
• Condor pools are connected by a global Grid Resource Broker which uses dynamic UID/GID mapping for user jobs, and “one job – one directory structure” job format.
• Scalable, easy to manage system.• In production since July 2003 with more than 30000 real user
jobs executed.• Applications range from fundamental research (mathematics,
physics) to applied research (biology, chemistry).– investigation of C60 molecule in electromagnetic fields
– simulation of protein molecules
– fractal calculation
– investigation of imbalanced phase transitions
– etc.
• Two classes of applications are currently supported: parameter scanning, and master-worker jobs parallelized by PVM.
• For more info, http://www.clustergrid.iif.hu.
Hungarian ClusterGrid InfrastructureHungarian ClusterGrid Infrastructure
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 4
Hungarian ClusterGrid InfrastructureHungarian ClusterGrid Infrastructure
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 5
MotivationMotivation
Checkpointing and migration support is necessary
• To enable load balancing and
• To support fault-tolerance
• To support day-night working mode of Hungarian ClusterGrid
etc.
Automatic checkpointing for sequential jobs in standard universe is provided by Condor
Fault-tolerant execution of Master-Worker style parallel jobs are supported without automatic checkpointing
With the P-GRADE environment Condor is able to make automatic checkpointing for PVM jobs to enable load-balancing and to make long running worker processes fault-tolerant
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 6
P-GRADE environment
Parallel Grid Run-time and Application
Development Environment
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 7
Using P-GRADE job mode for the whole range Using P-GRADE job mode for the whole range of parallel/distributed systemsof parallel/distributed systems
P-GRADE
PVM MPI Workflow
Super-computers
Clusters Grids
CondorGrid GT2 Grid OGSA
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 8
P-GRADE and Condor
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 9
Current prototype for migration framework
• First prototype is currently based on– P-GRADE– Condor– PVM
• Requirements– No manual code preparation is required– No user interaction during execution– No PVM modification– No extra requirements from schedulers– Just build your application using P-GRADE
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 10
Structure of Structure of P-GRADE P-GRADE applicationapplication
Built-inserver
clientA
clientB
clientD
clientC
Server
•process spawn/terminate
•identification/topology
•access to terminal/filesClients
•identification of neighbors by the server
•access to files/terminal through the server
•primitives for communication
messagepassing
messagepassing
Terminal Files
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 11
Checkpointing a single processCheckpointing a single process
1. Initiate a checkpoint
2. Synchronize transit messages and disconnect MP
3. Collect address-space information
4. Send checkpoint
5. Store checkpoint onto server
6. Reconnect to MP
User processCheckpoint
Server
Storage
handle MP
1
2
3
4
56
ckpt lib
handle MP
Vic Zandy’s single process checkpointer:
www.cs.wisc.edu/~zandy/ckpt
© University of Wisconsin, Madison
(former member of the Paradyn group)
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 12
Modified structure to checkpoint processesModified structure to checkpoint processes
Server/coordination
module
Client A
Client D
Client B
message passing library
message passing library
Files
CheckpointServer
Storage
ckpt lib
ckpt lib
ckpt lib
ckpt lib
Terminal
Client C
ckptlib
user code
comm lib
mp lib
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 13
Migration among friendly condor poolsMigration among friendly condor pools
Step 1: Starting the application
S: ServerCS: Checkpoint ServerP: PVM daemonA,B,C: User processes
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 14
Step 2: Condor is vacating a node
S: ServerCS: Checkpoint ServerP: PVM daemonA,B,C: User processes
Migration among friendly condor poolsMigration among friendly condor pools
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 15
Step 3: Checkpointing processes
S: ServerCS: Checkpoint ServerP: PVM daemonA,B,C: User processes
Migration among friendly condor poolsMigration among friendly condor pools
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 16
S: ServerCS: Checkpoint ServerP: PVM daemonA,B,C: User processes
Step 4: Process resumed on friendly Condor pool
Migration among friendly condor poolsMigration among friendly condor pools
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 17
Live demonstrations
• The prototype has been demonstrated in various conferences/ workshops
• EuroPar’03,Klagenfurt, Austria
• Hungarian Grid Day, Budapest, Hungary
• SuperComputing 2003,Phoenix, USA
• Cluster 2003, Hong-kong, China
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 18
P-GRADEGUI
London - UoW Budapest - SZTAKI
1
P-GRADE program submitted to Budapest as a Condor job
2
P-GRADE program runs at SZTAKI cluster
3
P-GRADE programmigrates to Londonas a Condor job
4P-GRADE program runs at UoW cluster
Budapest - BUTE
SZTAKI & BUTE clusters overloaded checkpointing
Possible scenario on checkpointing and migrationPossible scenario on checkpointing and migration of PGRADE of PGRADE programs between clustersprograms between clusters
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 19
Integrated checkpoint and monitorIntegrated checkpoint and monitor
• The checkpoint system is cooperating with theGRM-Mercury-PROVE monitoring and visualisation system
– logs out the user process from the monitoring layer before termination
– logs in the user process into the monitoring layer after resumption
– user can trace the machines where process migrated
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 20
Migration among non-friendly Condor poolsMigration among non-friendly Condor pools(under development)(under development)
5. Auto self-recoveryof PGRADE application
4. Submit application to the queue
3. Transfer binaries, checkpoint files, work files
1. Detection of lowresources on cluster
2. Removal of application from the queue
P-GRADEenvironment
GRIDApplication Manager
CONDOR pool BCONDOR pool AIt requires consultation with
CONDOR developers…
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 21
Summary of advantages/disadvantagesSummary of advantages/disadvantages
• Advantages
– no modification of the grid execution environment is required, since all
checkpointing/migration capability is built inside the application
– supports the day-night working mode in the Hungarian ClusterGrid
environment
– adaptivity and automation comes from Condor
– Condor-PVM applications, with topology of any kind, can now be dynamically
migrated like sequential jobs
(Note: Condor does not checkpoint PVM applications,
only fault-tolerant execution is supported for Master-Worker type
applications)
– migrating jobs can be monitored online and visualised
• Limitations
– currently PGRADE generated PVM jobs are supported
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 22
Conclusion
• A parallel program checkpointing mechanism that can be applied to generic PVM programs.
• A checkpointing mechanism that can be connected to Condor in order to realize migration of PVM jobs among Condor pools.
• By integrating P-GRADE migration framework and the Mercury Grid monitor, PVM applications can be performance monitored and visualized even during their migration.
• Condor-PVM, through our checkpointing algorithm, is enhanced to checkpoint PVM applications like it is done for sequential jobs.
14-16th April 2004 Paradyn/Condor week, Madison, USA Automatic checkpoint of Condor-PVM applications by P-GRADE 23
Thank you for your attention!Thank you for your attention!
Jozsef Kovacs <[email protected]>
Information about P-GRADE:
http://www.lpds.sztaki.hu/pgrade
Next release is coming at the end of April…
Information about Hungarian ClusterGrid:
http://www.clustergrid.iif.hu