pbspro advanced information systems & technology advanced campus services prepared by chao...

13
PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Upload: marian-crawford

Post on 18-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

PBSpro Advanced

Information Systems & TechnologyAdvanced Campus Services

Prepared by Chao “Bill” Xie, PhD student Computer ScienceFall 2005

Page 2: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

2

Syllabus Environment Variables Checkpointing

Page 3: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

3

Environment variables Environment Variables

Taken from the user’s environment Created by PBS Created by users

All names start with “PBS_” Some names start with “PBS_O_”

Indicating the variable is from the job’s originating environment

Page 4: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

4

Important variables PBS_O_HOME

Value of HOME from submission environment PBS_O_HOST

Host name on which the qsub command was executed PBS_O_PATH

Value of path from submission environment PBS_O_QUEUE

original queue name to which the job was submitted PBS_O_SHELL

Value of shell from submission environment PBS_O_SYSTEM

Operation system name where qsub was executed PBS_O_WORKDIR

Absolute path of directory where qsub was executed

Page 5: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

5

Important variables (cont1)

PBS_DEFAULT Name of the default PBS server

PBS_EVIRONMENT Indicate job types: PBS_BATCH or

PBS_INTERACTIVE PBS_JOBID

Job identify assigned to the job or job array PBS_JOBNAME

Job name supplied by the user PBS_MOMPORT

Port number on which this job’s MOMs will communicate

Page 6: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

6

Important variables (cont2)

PBS_NODEFILE Filename containing a list of nodes assigned to the

job PBS_NODENUM

Logical node number of this node allocated to the job

PBS_QUEQUE Name of the queue from which the job is executed

PBS_TASKNUM Tasks (process) number for the job on this node

TMPDIR Job-specific temporary directory for this job

Page 7: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

7

Checkpointing Two methods of checkpoint / restart:

OS-specific method SGI IRIX and Cray UNICOS

Generic site-specific method Specify the checkpointing directory

“-C path” command line option to pbs_mom PBS_CHECKPOINT_PATH environment variable “$checkpoint_path path” option in MOM’s

config file default value

Page 8: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

8

Checkpointing (cont) Manually checkpointing a job

Use the qhold command Checkpointing jobs during PBS shutdown

Append the -t immediate option to the qterm statement in the PBS start/stop script

Suspending/checkpointing multi-node jobs Save the complete session state in a file A open socket will cause the operation to fail

Page 9: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

9

Site-specific method Modify file mom_priv/config “periodic” job checkpoint action (during job

execution) $action checkpoint TIME_OUT SCRIPT_PATH

ARGS [...] Checkpoint just before the job is to be

terminated $action checkpoint_abort TIME_OUT

SCRIPT_PATH ARGS [...] Job restart action

$action restart TIME_OUT SCRIPT_PATH ARGS [...]

Page 10: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

10

Site-specific method (cont) $restart_background (true|false)

A boolean flag that modifies how MOM performs a restart “false” (the default), MOM runs the restart operation and

waits for the result “true”, restart operations are done by a child of MOM

which only returns when all the restarts for all the local tasks of a job are done, while the parent (main) MOM continue processing without being blocked

$restart_transmogrify (true|false) A boolean flag that controls how MOM launches the

restart script/program “false” (the default), MOM will run the restart script and

block until the restart operation is complete “true”, MOM will run the restart script/program in such a

way that the script will “become” the task it is restarting.

Page 11: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

11

Specify checkpoint in job “-c interval” option defines the checkpoint interval

(in minutes) The interval argument is specified as:

n No checkpointing is to be performed. s Checkpointing is to be performed only when the server

executing the job is shutdown. c Checkpointing is to be performed at the default

minimum time for the Server executing the job. c=minutes Checkpointing is to be performed at an

interval of minutes

u Checkpointing is unspecified, thus resulting in the same behavior as “s”.

If “-c” is not specified, the checkpoint attribute is set to the value “u”.

qsub –c c=10 myjob

Page 12: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Fall 2005 Using PBSpro Advanced, IS&T Advanced Campus Services

12

References PBS Professional 7 Quick Start PBS Professional 7 User Guide PBS Professional 7 Administration

Guide www.pbspro.com

Page 13: PBSpro Advanced Information Systems & Technology Advanced Campus Services Prepared by Chao “Bill” Xie, PhD student Computer Science Fall 2005

Thank you!

Contacts: Bill Xie

[email protected] Victor Bolet [email protected]