![Page 1: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/1.jpg)
Managed by
Grid Checkpoining Architecture
Radosław Januszewski
CoreGrid Summer School 2007
![Page 2: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/2.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 2
motivation
- The Grids are complex and therefore prone to errors.
- The distributed nature of the Grid makes scheduling of system maintenance hard.
- Each uncoordinated power-down or failure effects in loss of currently running applications.
- Loss of computation time means additional cost!
![Page 3: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/3.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 3
goal
To enhance the reliability, fault-tolerance and robustness of the Grid computing environment.
![Page 4: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/4.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 4
the solution
Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment
![Page 5: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/5.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 5
Grid Broker
User Interface
Operating System Operating System Operating System
Globally Accessible Storage (Data Management)
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Local Resource Manager
grid - model
![Page 6: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/6.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 6
GCA in the Grid
Grid Broker
User Interface
Core Setvice
Operating System Core Service
Operating System Core Service
Operating System
Globally Accessible Storage (Data Management)
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Local Resource Manager
Checkpoint Translation service (CTS)
Grid Checkpoint Service
![Page 7: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/7.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 7
Proof of concept – the goals
• check whether the GCA survives contact with the reality
• prepare PoC on the basis of real-life installation• the Grid with the GCA should provide additional
value comparing with the „traditional” approach
![Page 8: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/8.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 8
GCA proof of concept installation
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
![Page 9: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/9.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 9
involved elements
• GUI: command line, Grid Sphere, Migrating Desktop
• Broker: GRMS• Local Resource Manager: Globus + TORQUE• Core service: SGIckpt
![Page 10: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/10.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 10
Bottom-up approach
How to make the checkpointer work with the local resource manager?
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
![Page 11: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/11.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 11
pbs/torque special features
action checkpoint
action restart
action checkpoint_abort
![Page 12: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/12.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 12
config
$action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta
skid %path
$action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid
$restart_transmogrify true
$action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid
%jobid %sid %taskid %path
Detailed description accessible on the http://checkpointing.psnc.pl
![Page 13: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/13.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 13
Broker – local RM connectivity
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
![Page 14: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/14.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 14
problem
The checkpointer: a service or resource?
![Page 15: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/15.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 15
<grmsJob appid="matrix_demo_submit"> <task taskid="matrix" persistent="true" crucial="true"> <resource> <localrmname>pbs</localrmname> </resource> <executable type="multiple" count="1"> <execfile name="matrixi"> <url>gsiftp://xxx.xxx.xxx.xxxl//home/user/povray</url> </execfile> </executable> <other> <grms_id>${JOB_ID}</grms_id> <checkpointable>true</checkpointable> <period>1</period> </other> </task></grmsJob>
job description with checkpointing
![Page 16: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/16.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 16
the end-user point of view
![Page 17: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/17.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 17
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
manual scenario
![Page 18: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/18.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 18
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
manual scenario - restart
Application
Failure!
![Page 19: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/19.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 19
<grmsJob appid="matrix_demo_resume"> <task taskid="matrix" persistent="true" crucial="true"> <resource> <hostname>node-03.checkpointing.psnc.pl</hostname> <localrmname>pbs</localrmname> </resource> <executable type="multiple" count="1"> <execfile name="matrix_long"> <url>gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long</url> </execfile> </executable> <other> <grms_id>${JOB_ID}</grms_id> <recovery>true</recovery> <ckpt_id>1179315947518_matrix_demo_submit_0459</ckpt_id> <checkpointable>true</checkpointable> <period>1</period> </other> </task></grmsJob>
![Page 20: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/20.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 20
failure – end-user view
![Page 21: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/21.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 21
problem
This semi-automatic solution is not optimal.
How to introduce automatic job failure handling without introducing new functionality in the Broker?
Use the workflows!
![Page 22: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/22.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 22
the workflow
submit job description
send results to useryes
submit „restart scenario” job
job finished successfullty?
send results to useryes
no
no
return error description
job finished successfullty?
Problem: using this broker we are not able to model loops
![Page 23: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/23.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 23
Torque/PBS Pro
WS GRAM
GRMS
Command Line Client GridSphere interface Migrating Desktop Client
SGIckpt
Linux SGIckpt
Linux SGIckpt
Linux
NFS shared space
PBS JobManager
User Tier
GRID Tier
Cluster Tier
Computing Nodes
Checkpoint script
automatic scenario
Application
Failure!
![Page 24: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/24.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 24
end-user point of view
![Page 25: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/25.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 25
the benefits
user: more robust and fault-tolerant Grid environment
sysadmin: much easier system management due to automatic checkpoint and recovery mechanism
![Page 26: Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007](https://reader036.vdocuments.net/reader036/viewer/2022070305/5515228c550346a87d8b5263/html5/thumbnails/26.jpg)
European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies 26
Thank you!