lsf user guide

88
Platform Computing Platform LSF ® User’s Guide Version 4.2 April 2002 Order Number: AA-RSAPB-TE

Upload: nilson-santamaria-g

Post on 24-Mar-2015

140 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: LSF User Guide

Platform LSF® User’s Guide

Version 4.2April 2002

Order Number: AA-RSAPB-TE

Platform Computing

Page 2: LSF User Guide

Copyright © 1994-2002 Platform Computing Corporation

All rights reserved.

We’d Like to Hear fromYou

You can help us make this manual better by telling us what you think of the content, organization, and usefulness of the information. If you find an error, or just want to make a suggestion for improving this manual, please address your comments to [email protected].

Your comments should pertain only to the LSF documentation. For product support, contact [email protected].

Although the information in this document has been carefully reviewed, Platform Computing Corporation (“Platform”) does not warrant it to be free of errors or omissions. Platform reserves the right to make corrections, updates, revisions or changes to the information in this document.

UNLESS OTHERWISE EXPRESSLY STATED BY PLATFORM, THE PROGRAM DESCRIBED IN THIS DOCUMENT IS PROVIDED “AS IS” AND WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL PLATFORM COMPUTING BE LIABLE TO ANYONE FOR SPECIAL, COLLATERAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING WITHOUT LIMITATION ANY LOST PROFITS, DATA, OR SAVINGS, ARISING OUT OF THE USE OF OR INABILITY TO USE THIS PROGRAM.

Trademarks ® LSF is a registered trademark of Platform Computing Corporation in the United States and in other jurisdictions.

™ THE BOTTOM LINE IN DISTRIBUTED COMPUTING, PLATFORM COMPUTING, and the PLATFORM and LSF logos are trademarks of Platform Computing Corporation in the United States and in other jurisdictions.

UNIX is a registered trademark of The Open Group in the United States and in other jurisdictions.

Other products or services mentioned in this document are identified by the trademarks or service marks of their respective owners.

Page 3: LSF User Guide

ContentsWelcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

About This Guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Learning About Platform LSF . . . . . . . . . . . . . . . . . . . . . . . 8

Technical Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1 LSF Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Basic LSF Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Resource Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 How LSF Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Job Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Job Dispatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Host Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Job Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . 42

Fault Tolerance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Job States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3 Working with Jobs . . . . . . . . . . . . . . . . . . . . . . . . . 49

Submitting Jobs (bsub) . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Modifying a Submitted Job (bmod) . . . . . . . . . . . . . . . . . . . . 54

Modifying Jobs in PEND State . . . . . . . . . . . . . . . . . . . . . . . 55

Modifying Running Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Controlling Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Killing Jobs (bkill) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Suspending and Resuming Jobs (bstop and bresume) . . . . . . . . . . 60

Changing Job Order Within Queues (bbot and btop) . . . . . . . . . . . 62

Using LSF with Non-Shared File Space . . . . . . . . . . . . . . . . . . 64

Platform LSF User’s Guide 3

Page 4: LSF User Guide

Contents

4

Reserving Resources for Jobs . . . . . . . . . . . . . . . . . . . . . . . 66

Submitting a Job to Specific Hosts . . . . . . . . . . . . . . . . . . . . 68

Submitting a Job and Indicating Host Preference . . . . . . . . . . . . . 69

Submitting a Job with Start or Termination Times . . . . . . . . . . . . 71

4 Viewing Information About Jobs . . . . . . . . . . . . . . . . . . 73

Viewing Job Status (bjobs) . . . . . . . . . . . . . . . . . . . . . . . . 74

Viewing Job Pend and Suspend Reasons (bjobs -p) . . . . . . . . . . . 75

Viewing Job Parameters (bjobs -l) . . . . . . . . . . . . . . . . . . . . 77

Viewing Job Resource Usage (bjobs -l) . . . . . . . . . . . . . . . . . . 78

Viewing Job History (bhist) . . . . . . . . . . . . . . . . . . . . . . . . 80

Viewing Job Output (bpeek) . . . . . . . . . . . . . . . . . . . . . . . . 83

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

Platform LSF User’s Guide

Page 5: LSF User Guide

WelcomeContents ◆ “About This Guide” on page 6

◆ “Learning About Platform LSF” on page 8◆ “Technical Support” on page 9

Platform LSF User’s Guide 5

Page 6: LSF User Guide

Welcome

6

About This Guide

Purpose of this guideThis guide introduces the basic concepts of the Platform LSF® software (“LSF”) and describes how to use your LSF installation to run and monitor jobs.

Who should use this guideThis guide is intended for LSF users and administrators who want to understand the fundamentals of Platform LSF operation and use. This guide assumes that you have access to one or more products from the Platform LSF Suite at your site.

How this guide is organized

Chapter 1 “LSF Concepts” introduces the basic terms and concepts of LSF.

Chapter 2 “How LSF Works” discusses the job life cycle, including job submission and dispatch, host selection, and job states.

Chapter 3 “Working with Jobs” discusses the commands for performing operations on jobs such as submitting jobs and specifying queues, hosts, job names, projects and user groups.

It also provides instructions for modifying jobs that have already been submitted, suspending and resuming jobs, moving jobs from one queue to another, sending signals to jobs, and forcing job execution.

You must have a working LSF cluster installed to use the commands described in this chapter.

Chapter 4 “Viewing Information About Jobs” discusses the commands for displaying information about clusters, resources, users, queues, jobs, and hosts.

You must have a working LSF cluster installed to use the commands described in this chapter.

Platform LSF User’s Guide

Page 7: LSF User Guide

Welcome

nd

d by

] ...

Typographical conventions

Command notation

Typeface Meaning Example

Courier The names of on-screen computer output, commands, files, and directories

The lsid comma

Bold Courier What you type, exactly as shown Type cd /bin

Italics ◆ Book titles, new words or terms, or words to be emphasized◆ Command-line place holders—replace with a real name or value

The queue specifiequeue_name

Bold Sans Serif ◆ Names of GUI elements that you manipulate Click OK

Notation Meaning Example

Quotes " or ' Must be entered exactly as shown "job_ID[index_list]"

Commas , Must be entered exactly as shown -C time0,time1

Ellipsis … The argument before the ellipsis can be repeated. Do not enter the ellipsis.

job_ID ...

lower case italics The argument must be replaced with a real value you provide.

job_ID

OR bar | You must enter one of the items separated by the bar. You cannot enter more than one item, Do not enter the bar.

[-h | -V]

Parenthesis ( ) Must be entered exactly as shown -X "exception_cond([params])::action

Option or variable in square brackets [ ]

The argument within the brackets is optional. Do not enter the brackets.

lsid [-h]

Shell prompts ◆ C shell: %◆ Bourne shell and Korn shell: $◆ root account: #Unless otherwise noted, the C shell prompt is used in all command examples

% cd /bin

Platform LSF User’s Guide 7

Page 8: LSF User Guide

Welcome

8

Learning About Platform LSF

Finding LSF informationInformation about LSF is available from the following sources:

◆ World Wide Web and FTP◆ README files and Release Notes◆ LSF manuals◆ Online documentation◆ LSF Technical support

World Wide Web and FTPThe latest information about all supported releases of LSF is available on the Platform Web site at http://www.platform.com. Look in the Online Support area for current README files, Release Notes, Upgrade Notices, Frequently Asked Questions (FAQs), Troubleshooting, and other helpful information.

The Platform FTP site (ftp.platform.com) also provides current README files and Release Notes for all supported releases of LSF.

If you have problems accessing the Platform Web site or the Platform FTP site, send email to [email protected].

README files and release notesThe LSF distribution files contain the current README files and Release Notes. Before installing LSF, be sure to read the files named readme.html and release_notes.html. They contain important installation and configuration information that is not in the printed documentation. You can also view these files from the Download area of the Platform Online Support Web page.

Online documentationThe following information is available online:

◆ LSF manuals in HTML and PDF format, available on the LSF product CD, and the Platform Web site at www.platform.com/lsf_docs

◆ Man pages (accessed with the man command) for all LSF commands and configuration files

Platform LSF User’s Guide

Page 9: LSF User Guide

Welcome

Technical SupportContact Platform Computing or your LSF vendor for technical support.

Email [email protected]

World Wide Web http://www.platform.com

Phone ◆ North America: +1 905 948 4297◆ Europe: +44 1256 370 530◆ Asia: +86 10 6238 1125

Toll-free phone 1-877-444-4LSF (+1 877 444 4573)

Mail Platform SupportPlatform Computing Corporation3760 14th AvenueMarkham, OntarioCanada L3R 3T7

When contacting Platform Computing, please include the full name of your company.

We’d like to hear from youIf you find an error in any Platform documentation, or you have a suggestion for improving it, please let us know:

Email [email protected]

Mail Platform Information DevelopmentPlatform Computing Corporation3760 14th AvenueMarkham, OntarioCanada L3R 3T7

Be sure to tell us:

◆ The title of the manual you are commenting on◆ The version of the product you are using◆ The format of the manual (HTML or PDF)

Platform LSF User’s Guide 9

Page 10: LSF User Guide

Welcome

10

Where to go nextLearn about basic LSF concepts, described in Chapter 1, “LSF Concepts”.

Platform LSF User’s Guide

Page 11: LSF User Guide

C H A P T E R

1LSF Concepts

Contents ◆ “Basic LSF Concepts” on page 12

◆ “Resource Concepts” on page 23

Platform LSF User’s Guide 11

Page 12: LSF User Guide

Basic LSF Concepts

12

Basic LSF Concepts

General

LSF When documenting the default behavior of LSF, we assume that LSF Standard Edition is installed. If information applies to LSF Base only, or to LSF Batch in combination with another LSF Suite product, this will be explicitly mentioned.

Users

LSF user A user account that has permission to submit jobs to the LSF cluster.

LSF administrator In general, you must be an LSF administrator to perform operations that will affect other LSF users. Each cluster has one primary LSF administrator, specified during LSF installation. You can also configure additional administrators at the cluster level and at the queue level.

primary LSFadministrator

An LSF administrator user account that must be specified during LSF installation. The first cluster administrator listed in lsf.cluster.cluster_name.

clusteradministrator

An LSF administrator user account that has permission to change the LSF configuration and perform other administrative and maintenance functions throughout the cluster.

For example, a cluster administrator can create an LSF host group, submit a job to any queue, or terminate another user’s job.

queueadministrator

An LSF administrator user account that has administrative permissions limited to a specified queue. For example, an LSF queue administrator can perform administrative operations on the specified queue, or on jobs running in the specified queue, but cannot change LSF configuration or operate on LSF daemons.

Platform LSF User’s Guide

Page 13: LSF User Guide

Chapter 1LSF Concepts

Jobs

job A job is a command submitted to LSF for execution. This allows LSF to schedule, control, and track the job according to configured policies.

In LSF, a job is a command submitted to LSF Batch. Each command can be a single process, or it can be a group of cooperating processes. LSF creates a new process group for each command it runs, and the job control mechanisms act on all processes in the process group.

A job is a collection of processes run on a machine to perform an action such as solve a system of equations or perform a database transaction. A typical job is run in the background on a machine running LSF.

job ID A unique number that identifies each job, assigned to the job by LSF at submission time. When you use bsub to submit the job, LSF returns the job ID.

job name An alphanumeric string assigned to the job by the user at submission time. Unlike the job ID, the job name is not necessarily unique.

task or interactivetask

A task is an interactive command that not submitted to a queue, and is run without using the batch processing features of LSF. Each command can be a single process, or it can be a group of cooperating processes. LSF creates a new process group for each command it runs, and the job control mechanisms act on all processes in the process group.

An interactive task is a command submitted to LSF using lslogin, lsrun, or lsgrun instead of bsub. LSF chooses the best available host to run the application on. The application does not necessarily require any user interaction, but the task is described as interactive because it is dispatched immediately, instead of being submitted to a queue and scheduled by LSF.

interactive batchjob

An interactive batch job is a command that is submitted to LSF using bsub, and using an option that allows you to interact with the application and enter required input from own terminal.

Platform LSF User’s Guide 13

Page 14: LSF User Guide

Basic LSF Concepts

14

LSF supports interactive batch jobs which differ from traditional batch jobs in that stdin, stdout and stderr are sent to the command shell that submitted the batch job. Interactive batch jobs allow users to run applications that require terminal input and output through LSF Batch, taking advantage of LSF scheduling policies and fault tolerance.

To reduce load on the network, it is normally most efficient to run intensively interactive applications, such as word-processing programs, on your own machine, outside of LSF. However, using LSF allows you to use the best available host, to obey all policies set by the LSF administrator, and to track resource usage and job history.

job report LSF prepares a job report when a job is done. This contains job information such as:

◆ CPU use

◆ Memory use

◆ Name of the account that submitted the job

This information is normally emailed to the user along with the job output and any errors, but it can also be directed to a file.

job output A job is a program that can produce output. This is normally emailed to the user along with the job report and any errors, but it can also be directed to a file.

job errors A job is a program that can produce errors. These are normally emailed to the user along with the job report and job output, but they can also be directed to a file.

consider fordispatch or

attempt to place ajob

A job is considered when LSF evaluates its requirements and determines if a suitable host is available. A job in a queue is considered once per dispatch turn. Jobs are considered in order, from the first in the list to the last.

place a job When LSF successfully matches up a job with a suitable host, this is called placing the job. If a job cannot be placed, it remains in the queue. If a job can be placed, it is dispatched immediately.

Platform LSF User’s Guide

Page 15: LSF User Guide

Chapter 1LSF Concepts

dispatch a job When LSF sends the job to the execution host, this is called dispatching the job. As soon as LSF places a job, LSF dispatches the job. Placing the job determines which host to dispatch the job to. Jobs are only placed on available hosts, so the host is expected to start the job as soon as it is dispatched.

job states LSF jobs have the following states:

◆ PEND—Waiting in a queue for scheduling and dispatch

◆ RUN—Dispatched to a host and running

◆ DONE—Finished normally with zero exit value

◆ EXITED—Finished with non-zero exit value

◆ PSUSP—Suspended while pending

◆ SSUSP—Suspended by LSF

◆ USUSP—Suspended by user

◆ POST_DONE—Post-processing completed without errors

◆ POST_ERR—Post-processing completed with errors

Hosts

hosts, machines,and computers

This document uses the terms host, machine, and computer to refer to a single computer, which may have more than one processor. An informal definition is as follows: if it runs a single copy of the operating system and has a unique Internet (IP) address, it is one computer. More formally, LSF treats each process queue as a separate machine. A multiprocessor computer with a single process queue is considered a single machine, while a box full of processors that each have their own process queue is treated as a group of separate machines.

host types Each host is assigned a host type in lsf.cluster.cluster_name. Host types are defined in lsf.shared. All hosts that are binary-compatible should be the same host type, even if they have different models of processor.

Platform LSF User’s Guide 15

Page 16: LSF User Guide

Basic LSF Concepts

16

local and remotehosts

When a remote command runs, two hosts are involved. The host where the remote execution is initiated is the local host. The host where the command is executed is the remote host. For example, in this sequence:

hostA% lsrun -v uname<<Execute uname on remote host hostD>>HP-UX

Here, the local host is hostA, and the remote host is hostD. It is possible for the local and remote hosts to be the same, if the best host available to run the job is the local host.

master host or LSFmaster

When LSF runs a job, even if the submission host and execution host are the same, the job information is sent to the master host, which is the host where the master LIM and mbatchd are running. The master host does all the job scheduling.

The master host is displayed by the lsid command:

% lsidLSF 4.2, Oct 2001Copyright 1992-2001 Platform Computing CorporationMy cluster name is test_clusterMy master name is hostA

submission andexecution hosts

When LSF runs a job, the host from which the job is submitted is the submission host. The job is run on the execution host. It is possible for more than one of these to be the same host, if the best host available to run the job is the local host.

The following example shows the submission and execution hosts for a job:

hostD% bsub myjobJob <1502> is submitted to default queue <normal>.hostD% bjobs 1502JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME1502 user2 RUN normal hostD hostB myjob Nov 22 14:03

The job is submitted on hostD, and executed on hostB.

server host or LSFserver

A machine running sbatchd (Slave Batch Daemon) that executes batch requests and applies local policies.

Platform LSF User’s Guide

Page 17: LSF User Guide

Chapter 1LSF Concepts

client host or LSFclient

A machine used to submit jobs to a specified LSF server. Client hosts are not licensed to run jobs themselves. Usually, the less powerful machines in a cluster are set up as LSF clients.

eligible hosts andusers

Each queue can have a list of users and user groups who are allowed to submit jobs to the queue, and a list of hosts and host groups that restricts where jobs from the queue can be dispatched.

Queues

queue Queues represent a set of pending jobs, lined up in a defined order and waiting for their opportunity to use resources. Queues implement different job scheduling and control policies. All jobs submitted to the same queue share the same scheduling and control policy. Queues do not correspond to individual hosts; each queue can use all server hosts in the cluster, or a configured subset of the server hosts.

A queue is a network-wide holding place for jobs. Jobs enter the queue via the bsub command. LSF can be configured to have one or more default queues. Jobs that are not submitted to a specific queue will be assigned to the first default queue that accepts them. Queues have the following attributes associated with them:

◆ Priority, where a larger integer is a higher priority

◆ Name, which uniquely identifies the queue

◆ UNIX nice(1) value, which sets the UNIX scheduler priority

◆ Queue limits, that restrict the following: hosts, number of jobs, users, groups, processors, etc.

◆ Standard UNIX limits: memory, swap, process, CPU, etc.

◆ Scheduling policies: fairshare, preemptive, FCFS, exclusive

◆ Administrators

◆ Run conditions

◆ Load-sharing threshold conditions, which apply load sharing to the queue.

Platform LSF User’s Guide 17

Page 18: LSF User Guide

Basic LSF Concepts

18

Example queue:

Begin QueueQUEUE_NAME = normalPRIORITY = 30NICE = 20STACKLIMIT= 2048DESCRIPTION = For normal low priority jobs, running only if hosts are lightly loaded.QJOB_LIMIT = 60 # job limit of the queuePJOB_LIMIT = 2 # job limit per processorut = 0.2io = 50/240#CPULIMIT = 180/hostA # 3 hours of hostAUSERS = allHOSTS = all End Queue

queue priority Defines the order in which queues are searched to determine which job will be processed. Queues are assigned a priority by the LSF administrator, where a higher number has a higher priority. Queues are serviced by LSF in order of priority from the highest to the lowest.

Clusters

cluster Load sharing in LSF is based on clusters. A cluster is a collection of hosts running LSF. Hosts are configured centrally and managed from any machine in the LSF cluster.

A cluster can contain a mixture of host types. By putting all hosts types into a single cluster, you can have easy access to the resources available on all host types.

Clusters are normally set up based on administrative boundaries. LSF clusters work best when each user has an account on all hosts in the cluster, and user files are shared among the hosts so that they can be accessed from any host. This way LSF can send a job to any host. You need not worry about whether the job will be able to access the correct files.

LSF can also run batch jobs when files are not shared among the hosts. LSF includes facilities to copy files to and from the host where the batch job is run, so your data will always be in the right place.

Platform LSF User’s Guide

Page 19: LSF User Guide

Chapter 1LSF Concepts

LSF can also run batch jobs when user accounts are not shared by all hosts in a cluster. Accounts can be mapped across machines.

A cluster is a group of hosts that provide shared computing resources. Hosts can be grouped into clusters in a number of ways. A cluster could contain:

◆ All the hosts in a single administrative group

◆ All the hosts on one file server or sub-network

◆ Hosts that perform similar functions

If you have hosts of more than one type, it is often convenient to group them together in the same cluster. LSF allows you to use these hosts transparently, so applications that run on only one host type are available to the entire cluster.

first-come, first-served (FCFS)

The default type of scheduling in LSF. Jobs are considered for dispatch based on their order in the queue.

job slot A job slot is a bucket into which a single unit of work is assigned in the LSF system. Hosts are configured to have a number of job slots available and queues dispatch jobs to fill job slots.

projects Projects are groups of jobs with names assigned. You define projects with the bsub -P option when you submit a job.

LSF will accept any arbitrary string for a project name. If you want to ensure that only valid projects are entered and the user is eligible to charge to that project, you can write an esub to do so.

Projects are used for chargeback accounting.

If a job is submitted without a project, LSF automatically assigns the job to the default project. The default project is the project you define by setting the environment variable LSB_DEFAULTPROJECT for a user. If you do not set LSB_DEFAULTPROJECT, the default project is the project specified by the LSF administrator in lsb.params with the DEFAULT_PROJECT parameter. If DEFAULT_PROJECT is not defined, then LSF uses default as the default project name.

Platform LSF User’s Guide 19

Page 20: LSF User Guide

Basic LSF Concepts

20

You can use the bsub -P project_name option with the following commands for projects:

◆ bsub◆ bhist◆ bjobs◆ bmod

user and hostgroups

The LSF administrator can configure user and host groups. The group names act as convenient aliases wherever lists of user or host names can be specified on the command line or in configuration files. The administrator can also limit the total number of running jobs belonging to a user or a group of users. User groups can also be defined to reflect the hierarchical share distribution, for hierarchical fairshare.

LSF daemons

LIM LIM (Load Information Manager) on each LSF server, monitors its host’s load, and forwards load information to the master LIMs. LIM collects 11 built-in load indices.

LIM provides simple placement advice for interactive tasks. This information is used by some of the lstools(1) applications to determine which host to run on.

Master LIM The master LIM is elected to store load data collected by LIMs running on hosts in the LSF cluster.

On one host in the cluster, the LIM acts as the master. The master LIM runs on the master host and forwards load information to mbatchd. The master LIM collects information for all hosts and provides that information to the applications. The master LIM is chosen among all the LIMs running in the cluster. If the master LIM becomes unavailable, a LIM on another host will automatically take over the role of master.

ELIM ELIMs (External LIMs) are site-definable executables that collect up to 256 different resources. They are typically named elim

Platform LSF User’s Guide

Page 21: LSF User Guide

Chapter 1LSF Concepts

RES RES (Remote Execution Server) runs on each LSF server and accepts remote execution requests and provides fast, transparent and secure remote execution of interactive jobs.

RES executes jobs and tasks in the background as the job owner. RES is similar to rshd (Remote Shell Daemon).

sbatchd sbatchd (Slave Batch Daemon) runs on each LSF server, receives job requests from mbatchd, and starts the jobs using RES. sbatchd is responsible for enforcing local LSF policies and maintaining the state of jobs on the machine.

mbatchd mbatchd is the Master Batch Daemon that receives job requests from LSF clients and servers and applies scheduling policies to dispatch the jobs to LSF servers in the cluster. The Master Batch Daemon is responsible for the overall state of all jobs in the batch system. It keeps a file of all transactions performed on jobs throughout their lifecycle. mbatchd manages queues and schedules jobs on all hosts in the LSF cluster.

Each cluster has one Master Batch Daemon, which runs on the master host.

PIM PIM (Process Information Manager) runs on each LSF server, and is responsible for monitoring all jobs and monitoring every process created for all jobs running on the server. PIM periodically walks the process tree, and accumulates memory and CPU use data which is reported to sbatchd. PIM provides run time resource use for all LSF jobs.

Platform LSF User’s Guide 21

Page 22: LSF User Guide

Basic LSF Concepts

22

Miscellaneous

lsrcp lsrcp is an LSF-enabled rcp (remote copy program) that transfers files between hosts in an LSF cluster. lsrcp uses RES on an LSF host to transfer files. If LSF is not installed on a host or if RES is not running then lsrcp uses rcp to copy the file.

external groupmembership

definition

User group or host group definitions can be maintained outside of LSF and imported into the LSF configuration at initialization time. An executable egroup in the LSF_SERVERDIR directory is invoked to obtain the list of members for a given group. The group members, separated by spaces, should be written to the standard output stream of egroup. In the LSF configuration file, the special character ‘!’ should be specified for the group member to indicate that egroup should be invoked.

Platform LSF User’s Guide

Page 23: LSF User Guide

Chapter 1LSF Concepts

Resource Concepts

About LSF resources

LSF provides a powerful means for you to describe your heterogeneous cluster in terms of resources. One of the most important decisions LSF makes when scheduling a job is to map a job’s resource requirements to resources available on individual hosts.

A computer may be thought of as a collection of resources used to execute programs. Different applications often require different resources. For example, a number-crunching program may take a lot of CPU power, but a large spreadsheet may need a lot of memory to run well. Some applications may run only on machines of a specific type, and not on others. To run applications as efficiently as possible, LSF needs to take these factors into account.

In LSF, resources are handled by naming them and tracking information relevant to them. LSF does its scheduling according to application’s resource requirements and resources available on individual hosts. LSF classifies resources in different ways.

There are several types of resources. Load indices measure dynamic resource availability, such as a host’s CPU load or available swap space. Static resources represent unchanging information, such as the number of CPUs a host has, the host type, and the maximum available swap space.

Resources can also be described in terms of where they are located. For example, a shared resource is a resource that is associated with the entire cluster or a subset of hosts within the cluster.

Viewing available resources

Use lsinfo to list the resources available in your cluster. The lsinfo command lists all the resource names and their descriptions.

Platform LSF User’s Guide 23

Page 24: LSF User Guide

Resource Concepts

24

How resources are classified

By values

By the way valueschange

By definitions

By availability

By scope

Numerical resources

Resources that take numerical values, such as all the load indices, number of processors on a host, or host CPU factor

String resources Resources that take string values, such as host type, host model, host status

Boolean resources Resources that denote the availability of specific features

Dynamic Resources Resources that change their values dynamically: host status and all the load indices.

Static Resources Resources that do not change their values: all resources except for load indices or host status.

Custom Resources Resources defined by user sites: external load indices and resources defined in the lsf.shared file (shared resources).

Built-In Resources Resources that are always defined in LSF, such as load indices, number of CPUs, or total swap space.

General Resources Resources that are available on all hosts, such as all the load indices, number of processors on a host, total swap space, or host status.

Special Resources Resources that are only associated with some hosts, such as some shared resources or most Boolean resources.

Host-Based Resources

Resources that are not shared among hosts, but are tied to individual hosts, such as swap space, CPU, or memory. An application must run on a particular host to access the resources. Using up memory on one host does not affect the available memory on another host.

Shared Resources Resources that are not associated with individual hosts in the same way, but are owned by the entire cluster, or a subset of hosts within the cluster, such as floating licenses or shared file systems. An application can access such a resource from any host which is configured to share it, but doing so affects its value as seen by other hosts.

Platform LSF User’s Guide

Page 25: LSF User Guide

Chapter 1LSF Concepts

Load indices

Load indices measure the availability of dynamic, non-shared resources on hosts in the cluster. Load indices are numeric in value.

Load indices built into the LIM are updated at fixed time intervals.

If you have configured external load indices, they are updated when new values are received from the external load collection program, ELIM. External load indices and ELIM are defined and configured by the LSF administrator.

Load indices collected by LIM.

Index Measures Units Direction Averaged over Update Interval

status host status string 15 seconds

r15s run queue length processes increasing 15 seconds 15 seconds

r1m run queue length processes increasing 1 minute 15 seconds

r15m run queue length processes increasing 15 minutes 15 seconds

ut CPU utilization (per cent) increasing 1 minute 15 seconds

pg paging activity pages in + pages out per second

increasing 1 minute 15 seconds

ls logins users increasing N/A 30 seconds

it idle time minutes decreasing N/A 30 seconds

swp available swap space megabytes decreasing N/A 15 seconds

mem available memory megabytes decreasing N/A 15 seconds

tmp available space in temporary file system(/tmp on UNIX, C:\temp on Windows NT for example)

megabytes decreasing N/A 120 seconds

io disk I/O (shown by lsload -l)

KB per second increasing 1 minute 15 seconds

name external load index configured by LSF administrator site-defined

Platform LSF User’s Guide 25

Page 26: LSF User Guide

Resource Concepts

26

Viewinginformation about

load indices

The lsinfo command lists the external load indices and thelsload -l command displays the values of all load indices. External load indices are configured by your LSF administrator.

Here is an example of the output from lsload:

% lsloadHOST_NAME status r15s r1m r15m ut pg ls it tmp swp memhostN ok 0.0 0.0 0.1 1% 0.0 1 224 43M 67M 3MhostK -ok 0.0 0.0 0.0 3% 0.0 3 0 38M 40M 7MhostG busy *6.2 6.9 9.5 85% 1.1 30 0 5M 400M 385MhostF busy 0.1 0.1 0.3 7% *17 6 0 9M 23M 28MhostV unavail

Static resources

Static resources represent host information that does not change over time, such as the maximum RAM available to user processes or the number of processors in a machine. Most static resources are determined by the LIM at start-up time.

Static resources can be used to select appropriate hosts for particular jobs based on binary architecture, relative CPU speed, and system configuration.

The resources ncpus, maxmem, maxswp, and maxtmp are not static on UNIX hosts that support dynamic hardware reconfiguation.

Static resources reported by LIM

Index Measures Units Determined by

type host type string configuration

model host model string configuration

hname host name string configuration

cpuf CPU factor relative configuration

server host can run remote jobs Boolean configuration

rexpri execution priority (UNIX only) nice(2) argument configuration

ncpus number of processors processors LIM

ndisks number of local disks disks LIM

maxmem maximum RAM memory available to users megabytes LIM

maxswp maximum available swap space megabytes LIM

maxtmp maximum available space in temporary file system megabytes LIM

Platform LSF User’s Guide

Page 27: LSF User Guide

Chapter 1LSF Concepts

Load thresholds

Load thresholds can be configured by your LSF administrator to schedule jobs in queues. There are two types of load thresholds: loadSched and loadStop. Each load threshold specifies a load index value. A loadSched threshold is the scheduling threshold which determines the load condition for dispatching pending jobs. If a host’s load is beyond any defined loadSched, a job will not be started on the host. This threshold is also used as the condition for resuming suspended jobs. A loadStop threshold is the suspending condition that determines when running jobs should be suspended.

Thresholds can be configured for each queue, for each host, or a combination of both. To schedule a job on a host, the load levels on that host must satisfy both the thresholds configured for that host and the thresholds for the queue from which the job is being dispatched.

The value of a load index may either increase or decrease with load, depending on the meaning of the specific load index. Therefore, when comparing the host load conditions with the threshold values, you need to use either greater than (>) or less than (<), depending on the load index.

Viewing host-level and queue-level suspending conditions

The suspending conditions are displayed by the bhosts -l and bqueues -l commands.

Viewing job-level suspending conditions

The thresholds that apply to a particular job are the more restrictive of the host and queue thresholds, and are displayed by the bjobs -l command.

Viewing resume thresholds

The bjobs -l command displays the scheduling thresholds that control when a job is resumed.

Platform LSF User’s Guide 27

Page 28: LSF User Guide

Resource Concepts

28

Resource usage

Jobs submitted through the LSF system will have the resources they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling.

LSF collects information such as:

◆ Total CPU time consumed by all processes in the job

◆ Total resident memory usage in KB of all currently running processes in a job

◆ Total virtual memory usage in KB of all currently running processes in a job

◆ Currently active process group ID in a job

◆ Currently active processes in a job

On UNIX, job-level resource usage is collected through a special process called PIM (Process Information Manager). PIM is managed internally by LSF.

Viewing jobresource usage

The -l option of the bjobs command displays the current resource usage of the job. The usage information is sampled by PIM every 30 seconds and collected by SBD at a maximum frequency of every SBD_SLEEP_TIME (configured in the lsb.params file) and sent to the MBD. The update is done only if the value for the CPU time, resident memory usage, or virtual memory usage has changed by more than 10 percent from the previous update, or if a new process or process group has been created.

View load on ahost

Use bhosts -l to check the load levels on the host, and adjust the suspending conditions of the host or queue if necessary. The bhosts -l command gives the most recent load values used for the scheduling of jobs. A dash (-) in the output indicates that the particular threshold is not defined.

Example % bhosts -l hostBHOST: hostBSTATUS CPUF JL/U MAX NJOBS RUN SSUSP USUSP RSV DISPATCH_WINDOWSok 20.00 2 2 0 0 0 0 0 -

Platform LSF User’s Guide

Page 29: LSF User Guide

Chapter 1LSF Concepts

CURRENT LOAD USED FOR SCHEDULING:r15s r1m r15m ut pg io ls t tmp swp mem

Total 0.3 0.8 0.9 61% 3.8 72 26 0 6M 253M 297MReserved 0.0 0.0 0.0 0% 0.0 0 0 0 0M 0M 0M

LOAD THRESHOLD USED FOR SCHEDULING:r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - - - - - - - - - - -loadStop - - - - - - - - - - -

Resource consumption limits

Resource consumption limits are constraints you or your LSF administrator can specify to limit the use of resources while a job is running. Jobs that consume more than the specified amount of a resource are signalled or have their priority lowered.

Resource limits can be specified either at the queue level by your LSF administrator (lsb.queues) or at the job level when you submit a job.

Hard and softlimits

Resource limits specified at the queue level are hard limits while those specified with job submission are soft limits. See setrlimit(2) man page for concepts of hard and soft limits.

CPU time limit -c cpu_limit[/host_name | /host_model]

Sets the soft CPU time limit to cpu_limit for this batch job. The default is no limit. This option is useful for preventing erroneous jobs from running away, or to avoid using up too many resources. A SIGXCPU signal is sent to all processes belonging to the job when it has accumulated the specified amount of CPU time. If the job has no signal handler for SIGXCPU, this causes it to be killed. LSF keeps track of the CPU time used by all processes of the job.

You can define whether the CPU limit is a per-process limit enforced by the OS or a per-job limit enforced by LSF with LSB_JOB_CPULIMIT in lsf.conf.

cpu_limit is in the form [hour:]minute, where minute can be greater than 59. So, 3.5 hours can either be specified as 3:30 or 210. The CPU limit is scaled by the host CPU factors of the submitting and execution

Platform LSF User’s Guide 29

Page 30: LSF User Guide

Resource Concepts

30

hosts. This is done so that the job does approximately the same amount of processing for a given CPU limit, even if it is sent to a host with a faster or slower CPU.

For example, if a job is submitted from a host with a CPU factor of 2 and executed on a host with a CPU factor of 3, the CPU time limit is multiplied by 2/3 because the execution host can do the same amount of work as the submission host in 2/3 of the time.

If the optional host name or host model is not given, the CPU limit is scaled based on the DEFAULT_HOST_SPEC shown by the bparams -l command. (If DEFAULT_HOST_SPEC is not defined, the fastest batch host in the cluster is used as the default.) If host or host model is given, its CPU scaling factor is used to adjust the actual CPU time limit at the execution host.

The following example specifies that myjob can run for 10 minutes on a DEC3000 host, or the corresponding time on any other host:

% bsub -c 10/DEC3000 myjob

Run time limit -W [hours:]minutes[/host_name | /host_model]

Sets the wall-clock run time limit of this batch job. The default is no limit. If the accumulated time the job has spent in the RUN state exceeds this limit, the job is sent a USR2 signal. If the job does not terminate within 10 minutes after being sent this signal, it is killed.

File size limit -F file_limit

Sets a per-process (soft) file size limit for each process that belongs to this batch job. If a process of this job attempts to write to a file such that the file size would increase beyond file_limit KB, the kernel sends that process a SIGXFSZ signal. This condition normally terminates the process, but may be caught. The default is no soft limit.

Data segment sizelimit

-D data_limit

Sets a per-process (soft) data segment size limit for each process that belongs to this batch job. An sbrk() or malloc() call to extend the data segment beyond data_limit KB returns an error. The default is no soft limit.

Platform LSF User’s Guide

Page 31: LSF User Guide

Chapter 1LSF Concepts

Stack segmentsize limit

-S stack_limit

Sets a per-process (soft) stack segment size limit for each process that belongs to this batch job. An sbrk() call to extend the stack segment beyond stack_limit KB causes the process to be terminated. The default is no soft limit.

Core file size limit -C core_limit

Sets a per-process (soft) core file size limit for each process that belongs to this batch job. On some systems, no core file is produced if the image for the process is larger than core_limit KB. On other systems only the first core_limit KB of the image are dumped. The default is no soft limit.

Memory limit -M mem_limit

Specify the memory limit, in KB.

If LSB_MEMLIMIT_ENFORCE or LSB_JOB_MEMLIMIT in lsf.conf are set to y, LSF kills the job when it exceeds the memory limit. Otherwise, LSF passes the memory limit to the operating system. Some operating systems apply the memory limit to each process, and some do not enforce the memory limit at all

The following command submits myjob with a memory limit of 5000 KB:

% bsub -M 5000 myjob

Resource requirements (bsub -R)

A resource requirement is an expression that contains resource names and operators.

Most LSF commands accept a -R res_req argument to specify resource requirements. The exact behavior depends on the command. For example, specifying a resource requirement for the lsload command displays the load levels for all hosts that have the requested resources. Specifying resource requirements for the lsrun command causes LSF to select the best host out of the set of hosts that have the requested resources.

Platform LSF User’s Guide 31

Page 32: LSF User Guide

Resource Concepts

32

Resource requirements restrict which hosts the job can run on. Each job has its resource requirements. Hosts that match the resource requirements are the candidate hosts. When LSF schedules a job, it uses the load index values of all the candidate hosts. The load values for each host are compared to the scheduling conditions. Jobs are only dispatched to a host if all load values are within the scheduling thresholds.

By default, if a job has no resource requirements, LSF places it on a host of the same type as the submission host. However, if a job has resource requirements but the host type has not been specified, LSF places the job on any host that meets the specified resource requirements.

To override the LSF defaults, specify resource requirements explicitly. Resource requirements can be set for queues, for individual applications, or for individual jobs.

To best place a job with optimized performance, resource requirements can be specified for each application. This way, you do not have to specify resource requirements every time you submit a job. The LSF administrator may have already configured the resource requirements for your jobs, or you can put your executable name together with its resource requirements into your personal remote task list.

The bsub command automatically uses the resource requirements of the job from the remote task lists.

Each job can specify resource requirements. Job-level resource requirements override any resource requirements specified in the remote task list.

In some cases, the queue specification sets an upper or lower bound on a resource. If you attempt to exceed that bound, your job will be rejected.

Platform LSF User’s Guide

Page 33: LSF User Guide

Chapter 1LSF Concepts

Specifyingresource

requirements

To specify resource requirements for your job, use bsub -R and specify the resource requirement string.

A resource requirement string describes the resources a job needs. LSF uses resource requirements to select hosts for remote execution and job execution.

A resource requirement string is divided into four sections:

◆ A selection section. The selection section specifies the criteria for selecting hosts from the system.

◆ An ordering section. The ordering section indicates how the hosts that meet the selection criteria should be sorted.

◆ A resource usage section. The resource usage section specifies the expected resource consumption of the task.

◆ A job spanning section. The job spanning section indicates if a parallel batch job should span across multiple hosts.

Depending on the command, one or more of these sections may apply. For example, the lshosts command only selects hosts, but does not order them. The lsload command selects and orders hosts, while lsplace uses the information in select, order, and rusage sections to select an appropriate host for a task. The lsloadadj command uses the rusage section to determine how the load information should be adjusted on a host, while bsub uses all four sections.

Syntax select[selection_string] order[order_string] rusage[usage_string] span[span_string]

The square brackets must be typed as shown.

The section names are select, order, rusage, and span. Sections that do not apply for a command are ignored.

If no section name is given, then the entire string is treated as a selection string. The select keyword may be omitted if the selection string is the first string in the resource requirement.

Each section has a different syntax.

Platform LSF User’s Guide 33

Page 34: LSF User Guide

Resource Concepts

34

Example % bsub -R "swp > 15 && hpux order[cpu]" myjob

This runs the job called myjob on an HP-UX host that is lightly loaded (CPU utilization) and has at least 15 megabytes of swap memory available.

Platform LSF User’s Guide

Page 35: LSF User Guide

C H A P T E R

2How LSF Works

LSF can be configured in many different ways that affect the scheduling of jobs. By default, this is how LSF handles a new job:

1 Receive the job. Create a job file. Return the job ID to the user.

2 During the next dispatch turn, consider the job for dispatch.

3 Place the job on the best available host.

4 Set the environment on the host.

5 Start the job.

Contents ◆ “Job Submission” on page 36

◆ “Job Dispatch” on page 38

◆ “Host Selection” on page 40

◆ “Job Execution Environment” on page 42

◆ “Fault Tolerance” on page 44

◆ “Job States” on page 46

Platform LSF User’s Guide 35

Page 36: LSF User Guide

Job Submission

36

Job SubmissionThe life cycle of a job starts when you submit the job to LSF. On the command line, bsub is used to submit jobs, and you can specify many options to bsub to modify the default behavior.

Queues

The job must be submitted to a queue.

Typically, a cluster has multiple queues. When you submit a job to LSF you might define which queue the job will enter. If you submit a job without specifying a queue name, LSF considers the requirements of the job and automatically chooses a suitable queue from a list of candidate default queues. If you did not define any candidate default queues, LSF will create a new queue using all the default settings, and submit the job to that queue.

Viewing defaultqueues

Use bparams to display default queues:

% bparamsDefault Queues: normal...

The user can override this list by defining the environment variable LSB_DEFAULTQUEUE.

How automatic queue selection works

The criteria LSF uses for selecting a suitable queue are as follows:

◆ User access restriction. Queues that do not allow this user to submit jobs are not considered.

◆ Host restriction. If the job explicitly specifies a list of hosts on which the job can be run, then the selected queue must be configured to send jobs to all hosts in the list.

◆ Queue status. Closed queues are not considered.

◆ Exclusive execution restriction. If the job requires exclusive execution, then queues that are not configured to accept exclusive jobs are not considered.

◆ Job’s requested resources. These must be within the resource limits of the selected queue.

Platform LSF User’s Guide

Page 37: LSF User Guide

Chapter 2How LSF Works

If multiple queues satisfy the above requirements, then the first queue listed in the candidate queues (as defined by the DEFAULT_QUEUE parameter in or the LSB_DEFAULTQUEUE environment variable) that satisfies the requirements is selected.

Job files

When a batch job is submitted to a queue, LSF Batch holds it in a job file until conditions are right for it to be executed. Then the job file is used to execute the job.

UNIX The job file is a Bourne shell script run by the batch daemon at execution time.

Windows The job file is a batch file processed by the batch daemon at execution time.

Platform LSF User’s Guide 37

Page 38: LSF User Guide

Job Dispatch

38

Job DispatchSubmitted jobs sit in queues until they are dispatched. Many factors control when and where a job runs:

◆ Active time window of the queue or hosts

◆ Resource requirements of the job

◆ Availability of eligible hosts

◆ Various job slot limits

◆ Job dependency conditions

◆ Fairshare constraints

◆ Load conditions

Scheduling policies

First-Come, First-Served (FCFS)

scheduling

By default, jobs in a queue are dispatched in first-come, first-served (FCFS) order. This means that jobs are dispatched according to their order in the queue. Since jobs are ordered according to job priority, this does not necessarily mean that jobs will be dispatched in the order of submission. The order of jobs in the queue can also be modified by the user or administrator.

Fairsharescheduling and

other policies

If a fairshare scheduling policy has been specified for the queue or if host partitions have been configured, jobs are dispatched in accordance with these policies instead. To solve diverse problems, LSF allows multiple scheduling policies in the same cluster. LSF has several queue scheduling policies such as exclusive, preemptive, fairshare, and hierarchical fairshare.

Viewing job orderin queue (bjobs)

Use bjobs to see the order in which jobs in a queue will actually be dispatched for the FCFS policy.

Platform LSF User’s Guide

Page 39: LSF User Guide

Chapter 2How LSF Works

Dispatch turn

Jobs are dispatched at regular intervals (60 seconds by default, configured by MBD_SLEEP_TIME in lsb.params). In each dispatch turn, LSF tries to start as many jobs as possible.

To prevent overloading any host, LSF waits for a configured number of dispatching intervals before sending another job to the same host. The waiting time is configured by the JOB_ACCEPT_INTERVAL parameter in lsb.params or lsb.queues; the default is one dispatch interval. If JOB_ACCEPT_INTERVAL is set to zero, more than one job can be started on a host in the same dispatch turn.

Dispatch order

Jobs are not necessarily dispatched in order of submission.

Each queue has a priority number. LSF Batch tries to start jobs from the highest priority queue first. The queue priority is set by an LSF Administrator when the queue is defined.

By default, LSF considers jobs for dispatch in the following order:

◆ For each queue, from highest to lowest priority

◆ For each job in the queue, according to FCFS order

◆ If any host is eligible to run this job, start the job on the best eligible host, and mark that host ineligible to run any other job until JOB_ACCEPT_INTERVAL dispatch turns have passed

Jobs can be dispatched out of turn if pre-execution conditions are not met, specific hosts or resources are busy or unavailable, or a user has reached the user job slot limit.

Changing job order in queue (btop and bbot)

Use the btop and bbot commands to change the job order in the queue.

See “Changing Job Order Within Queues (bbot and btop)” on page 62 for more information.

Platform LSF User’s Guide 39

Page 40: LSF User Guide

Host Selection

40

Host SelectionEach time LSF attempts to dispatch a job, it checks to see which hosts are eligible to run the job. A number of conditions determine whether a host is eligible:

◆ Host dispatch windows

◆ Resource requirements of the job

◆ Resource requirements of the queue

◆ Host list of the queue

◆ Host load levels

◆ Job slot limits of the host.

A host is only eligible to run a job if all the conditions are met. If a job is queued and there is an eligible host for that job, the job is placed on that host. If more than one host is eligible, the job is started on the best host based on both the job and the queue resource requirements.

Host load levels

A host is available if the values of the load indices (such as r1m, pg, mem) of the host are within the configured scheduling thresholds. There are two sets of scheduling thresholds: host and queue. If any load index on the host exceeds the corresponding host threshold or queue threshold, the host is not eligible to run any job.

Viewing host loadlevels

The bhosts -l command displays the host thresholds. The bqueues -l command displays the queue thresholds.

Eligible hosts

When LSF is trying to place a job, it obtains current load information for all hosts.

The load levels on each host are compared to the scheduling thresholds configured for that host in the Host section of lsb.hosts, as well as the per-queue scheduling thresholds configured in lsb.queues.

If any load index exceeds either its per-queue or its per-host scheduling threshold, no new job is started on that host.

Platform LSF User’s Guide

Page 41: LSF User Guide

Chapter 2How LSF Works

Viewing eligiblehosts

The bjobs -lp command displays the names of hosts that cannot accept a job at the moment together with the reasons the job cannot be accepted.

Resource requirements

Resource requirements at the queue level can also be used to specify scheduling conditions (for example, r1m<0.4 && pg<3).

A higher priority or earlier batch job is only bypassed if no hosts are available that meet the requirements of that job.

If a host is available but is not eligible to run a particular job, LSF Batch looks for a later job to start on that host. LSF Batch starts the first job found for which that host is eligible.

Platform LSF User’s Guide 41

Page 42: LSF User Guide

Job Execution Environment

42

Job Execution Environment

Understanding job execution environment

When LSF runs your jobs, it tries to make it as transparent to the user as possible. By default, the execution environment is maintained to be as close to the submission environment as possible. LSF will copy the environment from the submission host to the execution host. The execution environment includes the following:

◆ Environment variables needed by the job

◆ Working directory where the job begins running

◆ Other system-dependent environment settings; for example:

❖ On UNIX, resource limits and umask

❖ On Windows, desktop and Windows root directory

Since a network can be heterogeneous, it is often impossible or undesirable to reproduce the submission host’s execution environment on the execution host. For example, if home directory is not shared between submission and execution host, LSF runs the job in the /tmp on the execution host. If the DISPLAY environment variable is something like Unix:0.0, or :0.0, then it must be processed before using on the execution host. These are automatically handled by LSF.

Users can change the default behavior by using a job starter, or by using bsub -L to change the default execution environment.

For resource control, LSF also changes some of the execution environment of jobs. These include nice values, resource limits, or any other environment by configuring a job starter.

How LSF sets the job execution environment

By default, LSF transfers environment variables from the submission to the execution host. However, some environment variables do not make sense when transferred. When submitting a job from a Windows to a UNIX machine, the -L option of bsub can be used to reinitialize the environment variables. If submitting a job from a UNIX machine to a Windows machine, you can set the environment variables explicitly in your job script.

Platform LSF User’s Guide

Page 43: LSF User Guide

Chapter 2How LSF Works

PATH environmentvariable on UNIX

and Windows

LSF automatically resets the PATH on the execution host if the submission host is of a different type. If the submission host is Windows and the execution host is UNIX, the PATH variable is set to /bin:/usr/bin:/sbin:/usr/sbin and LSF_BINDIR (if defined in lsf.conf) is appended to it. If the submission host is UNIX and the execution host is Windows, the PATH variable is set to the system PATH variable with LSF_BINDIR appended to it. LSF looks for the presence of the WINDIR variable in the job’s environment to determine whether the job was submitted from a Windows or UNIX host. If WINDIR is present, it is assumed that the submission host was Windows; otherwise, the submission host is assumed to be a UNIX machine.

Platform LSF User’s Guide 43

Page 44: LSF User Guide

Fault Tolerance

44

Fault ToleranceLSF is designed to continue operating even if some of the hosts in the cluster are unavailable. One host in the cluster acts as the master, but if the master host becomes unavailable another host takes over. LSF services are available as long as there is one available host in the cluster.

LSF can tolerate the failure of any host or group of hosts in the cluster. When a host crashes, all jobs running on that host are lost. No other pending or running jobs are affected. Important jobs can be submitted to LSF with an option to automatically restart if the job is lost because of a host failure.

Dynamic master host

The LSF master host is chosen dynamically. If the current master host becomes unavailable, another host takes over automatically. The master host selection is based on the order in which hosts are listed in the lsf.cluster.cluster_name file. If the first host in the file is available, that host acts as the master. If the first host is unavailable, the second host takes over, and so on. LSF might be unavailable for a few minutes while hosts are waiting to be contacted by the new master.

Running jobs are managed by SBD on each batch server host. When the new MBD starts up it polls the SBDs on each host and finds the current status of its jobs. If SBD fails but the host is still running, jobs running on the host are not lost. When SBD is restarted it regains control of all jobs running on the host.

Network failure

If the cluster is partitioned by a network failure, a master LIM takes over on each side of the partition. Interactive load-sharing remains available, as long as each host still has access to the LSF executables.

Platform LSF User’s Guide

Page 45: LSF User Guide

Chapter 2How LSF Works

Event log file (lsb.events)

Fault tolerance in LSF depends on the event log file, lsb.events, which is kept on the primary file server. Every event in the system is logged in this file, including all job submissions and job and host state changes. If the master host becomes unavailable, a new master is chosen by the LIMs. SBD on the new master starts a new MBD. The new MBD reads the lsb.events file to recover the state of the system.

For sites not wanting to rely solely on a central file server for recovery information, LSF can be configured to maintain a duplicate event log by keeping a replica of lsb.events. The replica is stored on the file server, and used if the primary copy is unavailable. When using LSF’s duplicate event log function, the primary event log is stored on the first master host, and re-synchronized with the replicated copy when the host recovers.

Partitioned network

If the network is partitioned, only one of the partitions can access lsb.events, so batch services are only available on one side of the partition. A lock file is used to guarantee that only one MBD is running in the cluster.

Host failure

If an LSF server host fails, jobs running on that host are lost. No other jobs are affected. Jobs can be submitted so that they are automatically rerun from the beginning or restarted from a checkpoint on another host if they are lost because of a host failure.

If all of the hosts in a cluster go down, all running jobs are lost. When a host comes back up and takes over as master, it reads the lsb.events file to get the state of all batch jobs. Jobs that were running when the systems went down are assumed to have exited, and email is sent to the submitting user. Pending jobs remain in their queues, and are scheduled as hosts become available.

Platform LSF User’s Guide 45

Page 46: LSF User Guide

Job States

46

Job StatesA job goes through a series of state transitions until it eventually completes its task, fails, or is terminated. The possible states of a job during its life cycle are shown in the diagram.

Most jobs enter only three states:

PEND Waiting in a queue for scheduling and dispatch

RUN Dispatched to a host and running

DONE Finished normally with a zero exit value

Pending jobs

A job remains pending until all conditions for its execution are met. Some of the conditions are:

◆ Start time specified by the user when the job is submitted

◆ Load conditions on qualified hosts

◆ Dispatch windows during which the queue can dispatch and qualified hosts can accept jobs

◆ Run windows during which jobs from the queue can run

SSUSP

RUN

USUSP

EXIT

PSUSP

PENDbsub

bstop

bresume

bkillor abnormalexit

DONE

suitable host found

migration

normalcompletion

host OK host overloaded

bkill

bstop bresume

bkill

Platform LSF User’s Guide

Page 47: LSF User Guide

Chapter 2How LSF Works

◆ Limits on the number of job slots configured for a queue, a host, or a user

◆ Relative priority to other users and jobs

◆ Availability of the specified resources

◆ Job dependency and pre-execution conditions

Suspended jobs

A job can be suspended at any time. A job can be suspended by its owner, by the LSF administrator, by the root user (superuser), or by LSF. There are three different states for suspended jobs:

PSUSP Suspended by its owner or the LSF administrator while in PEND state

USUSP Suspended by its owner or the LSF administrator after being dispatched

SSUSP Suspended by LSF after being dispatched

After a job has been dispatched and started on a host, it can be suspended by LSF. When a job is running, LSF periodically checks the load level on the execution host. If any load index is beyond either its per-host or its per-queue suspending conditions, the lowest priority batch job on that host is suspended.

If the load on the execution host or hosts becomes too high, batch jobs could be interfering among themselves or could be interfering with interactive jobs. In either case, some jobs should be suspended to maximize host performance or to guarantee interactive response time.

LSF suspends jobs according to the priority of the job’s queue. When a host is busy, LSF suspends lower priority jobs first unless the scheduling policy associated with the job dictates otherwise. Jobs are also suspended by the system if the job queue has a run window and the current time goes outside the run window.

A system-suspended job can later be resumed by LSF Batch if the load condition on the execution hosts falls low enough or when the closed run window of the queue opens again.

Platform LSF User’s Guide 47

Page 48: LSF User Guide

Job States

48

Viewingsuspension

reasons

The bjobs -s command displays the reason why a job was suspended.

WAIT state (chunk jobs)

If you have configured chunk job queues, members of a chunk job that are waiting to run are displayed as WAIT by bjobs. Any jobs in WAIT status are included in the count of pending jobs by bqueues and busers, even though the entire chunk job has been dispatched and occupies a job slot. The bhosts command shows the single job slot occupied by the entire chunk job in the number of jobs shown in the NJOBS column.

You can switch (bswitch) or migrate (bmig) a chunk job member in WAIT state to another queue.

Viewing waitstatus and wait

reason

The bhist -l command shows jobs in WAIT status as Waiting ...

The bjobs -l command does not display a WAIT reason in the list of pending jobs.

See the Platform LSF Administrator’s Guide for more information about chunk jobs.

Exited jobs

A job might terminate abnormally for various reasons. Job termination can happen from any state. An abnormally terminated job goes into EXIT state. The situations where a job terminates abnormally include:

◆ The job is cancelled by its owner or the LSF administrator while pending, or after being dispatched to a host.

◆ The job is not able to be dispatched before it reaches its termination deadline, and thus is aborted by LSF.

◆ The job fails to start successfully. For example, the wrong executable is specified by the user when the job is submitted.

◆ The job exits with a non-zero exit status.

Platform LSF User’s Guide

Page 49: LSF User Guide

C H A P T E R

3Working with Jobs

Contents ◆ “Submitting Jobs (bsub)” on page 50

◆ “Modifying Jobs in PEND State” on page 55

◆ “Killing Jobs (bkill)” on page 59

❖ “Modifying Jobs in PEND State” on page 55

❖ “Modifying Running Jobs” on page 56

◆ “Controlling Jobs” on page 58

❖ “Killing Jobs (bkill)” on page 59

❖ “Suspending and Resuming Jobs (bstop and bresume)” on page 60

❖ “Changing Job Order Within Queues (bbot and btop)” on page 62

◆ “Using LSF with Non-Shared File Space” on page 64

◆ “Reserving Resources for Jobs” on page 66

◆ “Submitting a Job to Specific Hosts” on page 68

◆ “Submitting a Job and Indicating Host Preference” on page 69

◆ “Submitting a Job with Start or Termination Times” on page 71

Platform LSF User’s Guide 49

Page 50: LSF User Guide

Submitting Jobs (bsub)

50

Submitting Jobs (bsub)

In this section ◆ “Submitting a job to a queue (bsub -q)” on page 50

◆ “Submitting a job associated to a project (bsub -P)” on page 52

◆ “Submitting a job associated to a user group” on page 52

◆ “Submitting a job with a job name” on page 53

bsub command

You submit a job with the bsub command. If you do not specify any options, the job is submitted to the default queue configured by the LSF administrator (usually queue normal).

For example, if you submit the job my_job without specifying a queue, the job goes to the default queue.

% bsub my_jobJob <1234> is submitted to default queue <normal>

In the above example, 1234 is the job ID assigned to this job, and normal is the name of the default job queue.

Your job remains pending until all conditions for its execution are met. Each queue has execution conditions that apply to all jobs in the queue, and you can specify additional conditions when you submit the job.

You can also specify an execution host or a range of hosts, a queue, and start and termination times, as well as a wide range of other job options. See the bsub command in the Platform LSF Reference Guide for more details on bsub options.

Submitting a job to a queue (bsub -q)

The default queue is normally suitable to run most jobs, but the default queue may assign your jobs a very low priority, or restrict execution conditions to minimize interference with other jobs. If automatic queue selection is not satisfactory, choose the most suitable queue for each job.

Platform LSF User’s Guide

Page 51: LSF User Guide

Chapter 3Working with Jobs

The factors affecting which queue to choose are user access restrictions, size of job, resource limits of the queue, scheduling priority of the queue, active time windows of the queue, hosts used by the queue, scheduling load conditions, and the queue description displayed by the bqueues -l command.

Submitting a job to a specific queue

Job queues represent different job scheduling and control policies. All jobs submitted to the same queue share the same scheduling and control policy. Each job queue can use a configured subset of server hosts in the cluster; the default is to use all server hosts.

System administrators can configure job queues to control resource access by different users and types of application. Users select the job queue that best fits each job.

Viewing availablequeues

To see available queues, use the bqueues command.

Use bqueues -u user_name to specify a user or user group so that bqueues displays only the queues that accept jobs from these users.

The bqueues -m host_name option allows users to specify a host name or host group name so that bqueues displays only the queues that use these hosts to run jobs.

You can submit jobs to a queue as long as its STATUS is Open. However, jobs are not dispatched unless the queue is Active.

Submitting a job The following examples are based on the queues defined in the default configuration. Your LSF administrator may have configured different queues.

To run a job during off hours because the job generates very high load to both the file server and the network, you can submit it to the night queue:

% bsub -q night

If you have an urgent job to run, you may want to submit it to the priority queue:

% bsub -q priority

Platform LSF User’s Guide 51

Page 52: LSF User Guide

Submitting Jobs (bsub)

52

If you want to use hosts owned by others and you do not want to bother the owners, you may want to run your low priority jobs on the idle queue so that as soon as the owner comes back, your jobs get suspended.

If you are running small jobs and do not want to wait too long to get the results, you can submit jobs to the short queue to be dispatched with higher priority. Make sure your jobs are short enough that they are not killed for exceeding the CPU time limit of the queue (check the resource limits of the queue, if any).

If your job requires a specific execution environment, you may need to submit it to a queue that has a particular job starter defined. Because only your system administrator is able to specify a queue-level job starter as part of the queue definition, you should check with him for more information.

See the Platform LSF Administrator’s Guide for information on queue-level job starters.

Submitting a job associated to a project (bsub -P)

Use the bsub -P project_name option to associate a project name with a job.

On systems running IRIX 6, before the submitted job begins execution, a new array session is created and the project ID corresponding to the project name is assigned to the session.

Submitting a job associated to a user group

You can use the bsub -G user_group option to submit a job and associate it with a specified user group. This option is only useful with fairshare scheduling.

For more details on fairshare scheduling, see the Platform LSF Administrator’s Guide.

You can specify any user group to which you belong as long as it does not contain any subgroups. You must be a direct member of the specified user group.

User groups in non-leaf nodes cannot be specified because it will cause ambiguity in determining the correct shares given to a user.

Platform LSF User’s Guide

Page 53: LSF User Guide

Chapter 3Working with Jobs

For example, to submit the job myjob associated to user group special:

% bsub -G special myjob

Submitting a job with a job name

Use bsub -J job_name to submit a job and assign a job name to it.

You can later use the job name to identify the job. The job name need not be unique.

For example, to submit a job and assign the name my_job:

% bsub -J my_job

You can also assign a job name to a job array. See the Platform LSF Administrator’s Guide for more information about job arrays.

Platform LSF User’s Guide 53

Page 54: LSF User Guide

Modifying a Submitted Job (bmod)

54

Modifying a Submitted Job (bmod)

In this section ◆ “Modifying Jobs in PEND State” on page 55

◆ “Modifying Running Jobs” on page 56

◆ “Controlling Jobs” on page 58

Platform LSF User’s Guide

Page 55: LSF User Guide

Chapter 3Working with Jobs

Modifying Jobs in PEND StateFor submitted jobs in PEND state, the bmod command is used by the job owner and LSF administrator to modify command-line parameters. You can also modify entire job arrays or individual elements of a job array.

See the bmod command in the Platform LSF Reference Guide for more details.

Replacing the job command-line

To replace the job command line the bmod -Z "new_command" command option is used. The following example replaces the command line option for job 101 with "myjob file":

% bmod -Z "myjob file" 101

Changing a job parameter

To change a specific job parameter, use bmod with the bsub option used to specify the parameter. The specified options replace the submitted options. The following example changes the start time of job 101 to 2:00 a.m.:

% bmod -b 2:00 101

Resetting to default submitted value

To reset an option to its default submitted value (undo a bmod), append the n character to the option name, and do not include an option value. The following example resets the start time for job 101 back to its original value:

% bmod -bn 101

Resource reservation can be modified after a job has been started to ensure proper reservation and optimal resource utilization.

Platform LSF User’s Guide 55

Page 56: LSF User Guide

Modifying Running Jobs

56

Modifying Running Jobs

Modifying resource reservation

A job is usually submitted with a resource reservation for the maximum amount required. Use bmod -R to modify the resource reservation for a running job. This command is usually used to decrease the reservation, allowing other jobs access to the resource.

The following example sets the resource reservation for job 101 to 25MB of memory and 50MB of swap space:

% bmod -R "rusage[mem=25:swp=50]" 101

By default, you can modify resource reservation for running jobs. Set LSB_MOD_ALL_JOBS in lsf.conf to modify additional job options.

See “Reserving Resources for Jobs” on page 66 for more details.

Modifying other job options

If LSB_MOD_ALL_JOBS is specified in lsf.conf, the job owner or the LSF administrator can use the bmod command to modify the following job options for running jobs:

◆ CPU limit (-c [hour:]minute[/host_name | /host_model] | -cn)

◆ Memory limit (-M mem_limit | -Mn)

◆ Run limit (-W run_limit[/host_name | /host_model] | -Wn)

◆ Standard output file name (-o output_file | -on)

◆ Standard error file name (-e error_file | -en)

◆ Rerunnable jobs (-r | -rn)

In addition to resource reservation, these are the only bmod options that are valid for running jobs. You cannot make any other modifications after a job has been dispatched.

An error message is issued and the modification fails if these options are used on running jobs in combination with other bmod options.

Platform LSF User’s Guide

Page 57: LSF User Guide

Chapter 3Working with Jobs

Modifying resource limits for running jobs

The new resource limits cannot exceed the resource limits defined in the queue.

To modify the CPU limit of running jobs, LSB_JOB_CPULIMIT=Y must be defined in lsf.conf.

To modify the memory limit of running jobs, LSB_JOB_MEMLIMIT=Y must be defined in lsf.conf.

Limitations

Modifying remote running jobs in a MultiCluster environment is not supported.

To modify the name of job error file for a running job, you must use bsub -e or bmod -e to specify an error file before the job starts running.

For more information

See the Platform LSF Administrator’s Guide for more information about job output files, using job-level resource limits, and submitting rerunnable jobs

Platform LSF User’s Guide 57

Page 58: LSF User Guide

Controlling Jobs

58

Controlling JobsLSF needs to control jobs dispatched to a host to enforce scheduling policies, or in response to user requests. The principal actions the system performs on a job include suspend, resume, and terminate.

The actions are carried out by sending the signal SIGSTOP for suspending a job, SIGCONT for resuming a job, and SIGKILL for terminating a job.

On Windows, equivalent functions have been implemented to perform the same tasks.

In this section ◆ “Killing Jobs (bkill)” on page 59

◆ “Suspending and Resuming Jobs (bstop and bresume)” on page 60

◆ “Changing Job Order Within Queues (bbot and btop)” on page 62

Platform LSF User’s Guide

Page 59: LSF User Guide

Chapter 3Working with Jobs

Killing Jobs (bkill)The bkill command cancels pending batch jobs and sends signals to running jobs. By default, on UNIX, bkill sends the SIGKILL signal to running jobs.

Before SIGKILL is sent, SIGINT and SIGTERM are sent to give the job a chance to catch the signals and clean up. The signals are forwarded from mbatchd to sbatchd. sbatchd waits for the job to exit before reporting the status. Because of these delays, for a short period of time after the bkill command has been issued, bjobs may still report that the job is running.

On Windows, job control messages replace the SIGINT and SIGTERM signals, and termination is implemented by the TerminateProcess() system call.

Killing a job

For example, to kill job 3421:

% bkill 3421Job <3421> is being terminated

Forcing removal of a job from LSF

If a job cannot be killed in the operating system, use bkill -r to force the removal of the job from LSF.

The bkill -r command removes a job from the system without waiting for the job to terminate in the operating system. This sends the same series of signals as bkill without -r, except that the job is removed from the system immediately, the job is marked as EXIT, and job resources that LSF monitors are released as soon as LSF receives the first signal.

Platform LSF User’s Guide 59

Page 60: LSF User Guide

Suspending and Resuming Jobs (bstop and bresume)

60

Suspending and Resuming Jobs (bstop and bresume)

The bstop and bresume commands allow you to suspend or resume a job.

A job can also be suspended by its owner or the LSF administrator with the bstop command. These jobs are considered user-suspended and are displayed by bjobs as USUSP.

When the user restarts the job with the bresume command, the job is not started immediately to prevent overloading. Instead, the job is changed from USUSP to SSUSP (suspended by the system). The SSUSP job is resumed when host load levels are within the scheduling thresholds for that job, similarly to jobs suspended due to high load.

If a user suspends a high priority job from a non-preemptive queue, the load may become low enough for LSF to start a lower priority job in its place. The load created by the low priority job can prevent the high priority job from resuming.

This can be avoided by configuring preemptive queues. See the Platform LSF Administrator’s Guide for information about configuring queues.

Suspending a job

bstop command To suspend a job, use the bstop command. Suspending a job causes your job to go into USUSP state if the job is already started, or to go into PSUSP state if your job is pending.

UNIX bstop sends the following signals to the job:

◆ SIGTSTP for parallel or interactive jobs

SIGTSTP is caught by the master process and passed to all the slave processes running on other hosts.

◆ SIGSTOP for sequential jobs

SIGSTOP cannot be caught by user programs. The SIGSTOP signal can be configured with the LSB_SIGSTOP parameter in lsf.conf.

Platform LSF User’s Guide

Page 61: LSF User Guide

Chapter 3Working with Jobs

Windows bstop causes the job to be suspended.

Example To suspend job 3421, enter:

% bstop 3421Job <3421> is being stopped

Resuming a job

bresumecommand

To resume a job, use the bresume command.

Resuming a user-suspended job does not put your job into RUN state immediately. If your job was running before the suspension, bresume first puts your job into SSUSP state and then waits for sbatchd to schedule it according to the load conditions.

For example, to resume job 3421, enter:

% bresume 3421Job <3421> is being resumed

Platform LSF User’s Guide 61

Page 62: LSF User Guide

Changing Job Order Within Queues (bbot and btop)

62

Changing Job Order Within Queues (bbot and btop)

By default, LSF dispatches jobs in a queue in the order of arrival (that is, first-come-first-served), subject to availability of suitable server hosts.

The btop and bbot commands change the position of pending jobs, or of pending job array elements, to affect the order in which jobs are considered for dispatch. Users can only change the relative position of their own jobs, and LSF administrators can change the position of any users’ jobs.

bbot

Moves jobs relative to your last job in the queue.

If invoked by a regular user, bbot moves the selected job after the last job with the same priority submitted by the user to the queue.

If invoked by the LSF administrator, bbot moves the selected job after the last job with the same priority submitted to the queue.

Pending jobs are displayed by bjobs in the order in which they will be considered for dispatch.

btop

Moves jobs relative to your first job in the queue.

If invoked by a regular user, btop moves the selected job before the first job with the same priority submitted by the user to the queue.

If invoked by the LSF administrator, btop moves the selected job before the first job with the same priority submitted to the queue.

Pending jobs are displayed by bjobs in the order in which they will be considered for dispatch.

Platform LSF User’s Guide

Page 63: LSF User Guide

Chapter 3Working with Jobs

Change job order within queues

In the following example, job 5311 is moved to the top of the queue. Since job 5308 is already running, job 5311 is placed in the queue after job 5308.

Note that user1’s job is still in the same position on the queue. user2 cannot use btop to get extra jobs at the top of the queue; when one of his jobs moves up the queue, the rest of his jobs move down.

% bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME5308 user2 RUN normal hostA hostD /s500 Oct 23 10:165309 user2 PEND night hostA /s200 Oct 23 11:045310 user1 PEND night hostB /myjob Oct 23 13:455311 user2 PEND night hostA /s700 Oct 23 18:17

% btop 5311Job <5311> has been moved to position 1 from top.

% bjobs -u allJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME5308 user2 RUN normal hostA hostD /s500 Oct 23 10:165311 user2 PEND night hostA /s200 Oct 23 18:175310 user1 PEND night hostB /myjob Oct 23 13:455309 user2 PEND night hostA /s700 Oct 23 11:04

Platform LSF User’s Guide 63

Page 64: LSF User Guide

Using LSF with Non-Shared File Space

64

Using LSF with Non-Shared File SpaceLSF is usually used in networks with shared file space. When shared file space is not available, use the bsub -f command to have LSF copy needed files to the execution host before running the job, and copy result files back to the submission host after the job completes.

LSF attempts to run the job in the directory where the bsub command was invoked. If the execution directory is under the user’s home directory, sbatchd looks for the path relative to the user’s home directory. This handles some common configurations, such as cross-mounting user home directories with the /net automount option.

If the directory is not available on the execution host, the job is run in /tmp. Any files created by the batch job, including the standard output and error files created by the -o and -e options to bsub, are left on the execution host.

LSF provides support for moving user data from the submission host to the execution host before executing a batch job, and from the execution host back to the submitting host after the job completes. The file operations are specified with the -f option to bsub.

LSF uses the lsrcp command to transfer files. lsrcp contacts RES on the remote host to perform file transfer. If RES is not available, the UNIX rcp command is used.

See the Platform LSF Administrator’s Guide for more information about file transfer in LSF.

bsub -f

The -f "[local_file operator [remote_file]]" option to the bsub command copies a file between the submission host and the execution host. To specify multiple files, repeat the -f option.

local_file File name on the submission host

remote_file File name on the execution host

Platform LSF User’s Guide

Page 65: LSF User Guide

Chapter 3Working with Jobs

The files local_file and remote_file can be absolute or relative file path names. You must specify at least one file name. When the file remote_file is not specified, it is assumed to be the same as local_file. Including local_file without the operator results in a syntax error.

operator Operation to perform on the file. The operator must be surrounded by white space.

Valid values for operator are:

> local_file on the submission host is copied to remote_file on the execution host before job execution. remote_file is overwritten if it exists.

< remote_file on the execution host is copied to local_file on the submission host after the job completes. local_file is overwritten if it exists.

<< remote_file is appended to local_file after the job completes. local_file is created if it does not exist.

><, <> Equivalent to performing the > and then the < operation. The file local_file is copied to remote_file before the job executes, and remote_file is copied back, overwriting local_file, after the job completes. <> is the same as ><

If the submission and execution hosts have different directory structures, you must ensure that the directory where remote_file and local_file will be placed exists. LSF tries to change the directory to the same path name as the directory where the bsub command was run. If this directory does not exist, the job is run in your home directory on the execution host.

You should specify remote_file as a file name with no path when running in non-shared file systems; this places the file in the job’s current working directory on the execution host. This way the job will work correctly even if the directory where the bsub command is run does not exist on the execution host. Be careful not to overwrite an existing file in your home directory.

Platform LSF User’s Guide 65

Page 66: LSF User Guide

Reserving Resources for Jobs

66

Reserving Resources for Jobs

About resource reservation

When a job is dispatched, the system assumes that the resources that the job consumes will be reflected in the load information. However, many jobs do not consume the resources they require when they first start. Instead, they will typically use the resources over a period of time.

For example, a job requiring 100 megabytes of swap is dispatched to a host having 150 megabytes of available swap. The job starts off initially allocating 5 megabytes and gradually increases the amount consumed to 100 megabytes over a period of 30 minutes. During this period, another job requiring more than 50 megabytes of swap should not be started on the same host to avoid over-committing the resource.

You can reserve resources to prevent overcommitment by LSF. Resource reservation requirements can be specified as part of the resource requirements when submitting a job, or can be configured into the queue level resource requirements.

Using the rusage string

To specify resource reservation at the job level, use bsub -R and include the resource usage section in the resource requirement (rusage) string.

For example:

% bsub -R "rusage[tmp=30:duration=30:decay=1]" myjob

will reserve 30 MB of temp space for the job. As the job runs, the amount reserved will decrease at approximately 1 MB/minute such that the reserved amount is 0 after 30 minutes.

How resource reservation works

When deciding whether to schedule a job on a host, LSF considers the reserved resources of jobs that have previously started on that host. For each load index, the amount reserved by all jobs on that host is

Platform LSF User’s Guide

Page 67: LSF User Guide

Chapter 3Working with Jobs

summed up and subtracted (or added if the index is increasing) from the current value of the resources as reported by the LIM to get amount available for scheduling new jobs:

available amount = current value - reserved amount for all jobs

Viewing host-level resource information

The amount of resources reserved on each host can be viewed through the -l option of the bhosts command. Use bhosts -s to view information about shared resources.

Viewing queue-level resource information

To see the resource usage configured at the queue level, use bqueues -l.

Platform LSF User’s Guide 67

Page 68: LSF User Guide

Submitting a Job to Specific Hosts

68

Submitting a Job to Specific Hosts To indicate that a job must run on one of the specified hosts, use the bsub -m "hostA hostB ..." option.

By specifying a single host, you can force your job to wait until that host is available and then run on that host.

For example:

% bsub -q idle -m "hostA hostD hostB" myjob

This command submits myjob to the idle queue and tells LSF to choose one host from hostA, hostD and hostB to run the job. All other batch scheduling conditions still apply, so the selected host must be eligible to run the job.

Resources and bsub -m

If you have applications that need specific resources, it is more flexible to create a new Boolean resource and configure that resource for the appropriate hosts in the cluster.

This must be done by the LSF administrator. If you specify a host list using the -m option of bsub, you must change the host list every time you add a new host that supports the desired resources. By using a Boolean resource, the LSF administrator can add, move or remove resources without forcing users to learn about changes to resource configuration.

Platform LSF User’s Guide

Page 69: LSF User Guide

Chapter 3Working with Jobs

Submitting a Job and Indicating Host Preference When several hosts can satisfy the resource requirements of a job, the hosts are ordered by load. However, in certain situations it may be desirable to override this behavior to give preference to specific hosts, even if they are more heavily loaded.

For example, you may have licensed software which runs on different groups of hosts, but you prefer it to run on a particular host group because the jobs will finish faster, thereby freeing the software license to be used by other jobs.

Another situation arises in clusters consisting of dedicated batch servers and desktop machines which can also run jobs when no user is logged in. You may prefer to run on the batch servers and only use the desktop machines if no server is available.

To see a list of available hosts, use the bhosts command.

In this section ◆ “Submitting a job with host preference” on page 69

◆ “Submitting a job with different levels of host preference” on page 70

◆ “Submitting a job with resource requirements” on page 70

Submitting a job with host preference

bsub -m The bsub -m option allows you to indicate preference by using + with an optional preference level after the host name. The keyword others can be used to refer to all the hosts that are not explicitly listed. You must specify others with at least one host name or host group name.

For example:

% bsub -R "solaris && mem> 10" -m "hostD+ others" myjob

In this example, LSF selects all solaris hosts that have more than 10 megabytes of memory available. If hostD meets this criteria, it will be picked over any other host which otherwise meets the same criteria. If hostD does not meet the criteria, the least loaded host among the others will be selected. All the other hosts are considered as a group and are ordered by load.

Platform LSF User’s Guide 69

Page 70: LSF User Guide

Submitting a Job and Indicating Host Preference

70

Queues and hostpreference

A queue can also define host preferences for jobs. Host preferences specified by bsub -m override the queue specification.

In the queue definition in lsb.queues, use the HOSTS parameter to list the hosts or host groups to which the queue can dispatch jobs.

Use the not operator (~) to exclude hosts or host groups from the list of hosts to which the queue can dispatch jobs. This is useful if you have a large cluster, but only want to exclude a few hosts from the queue definition.

Refer to the Platform LSF Reference Guide for information about the lsb.queues file.

Submitting a job with different levels of host preference

You can indicate different levels of preference by specifying a number after the plus sign (+). The larger the number, the higher the preference for that host or host group. You can also specify the + with the keyword others.

For example:

% bsub -m "groupA+2 groupB+1 groupC" myjob

In this example, LSF gives first preference to hosts in groupA, second preference to hosts in groupB and last preference to those in groupC. Ordering within a group is still determined by load.

You can use the bmgroup command to display configured host groups.

Submitting a job with resource requirements

When you submit a job, you can also exclude a host by specifying a resource requirement using hname resource:

% bsub -R "hname!=hostb && type==sgi6" myjob

Platform LSF User’s Guide

Page 71: LSF User Guide

Chapter 3Working with Jobs

Submitting a Job with Start or Termination TimesBy default, LSF dispatches jobs as soon as possible, and then allows them to finish, although resource limits might terminate the job before it finishes.

You can specify a time of day at which to start or terminate a job.

Submitting a job with a start time

If you do not want to start your job immediately when you submit it, use bsub -b to specify a start time. LSF will not dispatch the job before this time. For example:

% bsub -b 5:00 myjob

This example submits a job that remains pending until after the local time on the master host reaches 5 a.m.

Submitting a job with a termination time

Use bsub -t to submit a job and specify a time after which the job should be terminated. For example:

% bsub -b 11:12:5:40 -t 11:12:20:30 myjob

The job called myjob is submitted to the default queue and will start after November 12 at 05:40 a.m. If the job is still running on November 12 at 8:30 p.m., it will be killed.

Platform LSF User’s Guide 71

Page 72: LSF User Guide

Submitting a Job with Start or Termination Times

72

Platform LSF User’s Guide
Page 73: LSF User Guide

C H A P T E R

4Viewing Information About

Jobs

Use the bjobs and bhist commands to view information about jobs:

◆ bjobs reports the status of jobs and the various options allow you to display specific information.

◆ bhist reports the history of one or more jobs in the system.

You can also find jobs on specific queues or hosts, find jobs submitted by specific projects, and check the status of specific jobs using their job IDs or names.

Contents ◆ “Viewing Job Status (bjobs)” on page 74

◆ “Viewing Job Pend and Suspend Reasons (bjobs -p)” on page 75

◆ “Viewing Job Parameters (bjobs -l)” on page 77

◆ “Viewing Job Resource Usage (bjobs -l)” on page 78

◆ “Viewing Job History (bhist)” on page 80

◆ “Viewing Job Output (bpeek)” on page 83

Platform LSF User’s Guide 73

Page 74: LSF User Guide

Viewing Job Status (bjobs)

74

Viewing Job Status (bjobs)The bjobs command has options to display the status of jobs in the LSF system. For more details on these or other bjobs options, see the bjobs command in the Platform LSF Reference Guide.

Unfinished current jobs

The bjobs command reports the status of LSF jobs. When no options are specified, bjobs displays information about jobs in the PEND, RUN, USUSP, PSUSP, and SSUSP states for the current user.

For example:

% bjobsJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME3926 user1 RUN priority hostf hostc verilog Oct 22 13:51605 user1 SSUSP idle hostq hostc Test4 Oct 17 18:071480 user1 PEND priority hostd generator Oct 19 18:137678 user1 PEND priority hostd verilog Oct 28 13:087679 user1 PEND priority hosta coreHunter Oct 28 13:127680 user1 PEND priority hostb myjob Oct 28 13:17

All jobs

bjobs -a displays the same information as bjobs and in addition displays information about recently finished jobs (PEND, RUN, USUSP, PSUSP, SSUSP, DONE and EXIT statuses).

All your jobs that are still in the system and jobs that have recently finished are displayed.

Running jobs

bjobs -r displays information only for running jobs (RUN state).

Platform LSF User’s Guide

Page 75: LSF User Guide

Chapter 4Viewing Information About Jobs

Viewing Job Pend and Suspend Reasons (bjobs -p)When you submit a job, it may be held in the queue before it starts running and it may be suspended while running. You can find out why jobs are pending or in suspension with the bjobs -p option.

You can combine bjob options to tailor the output. For more details on these or other bjobs options, see the bjobs command in the Platform LSF Reference Guide.

In this section ◆ “bjobs -p command” on page 75

◆ “Viewing pending and suspend reasons with host names” on page 76

◆ “Viewing suspend reasons only” on page 76

bjobs -p command

The -p option of bjobs displays pending jobs and the reasons a job is pending. Because there can be more than one reason why the job is pending or suspended, all reasons why a job is pending or suspended are displayed.

For example:

% bjobs -pJOBID USER STAT QUEUE FROM_HOST JOB_NAME SUBMIT_TIME7678 user1 PEND priority hostD verilog Oct 28 13:08Queue’s resource requirements not satisfied:3 hosts;Unable to reach slave lsbatch server: 1 host;Not enough job slots: 1 host;

The pending reasons also mention the number of hosts for each condition.

You can view reasons why a job is pending or in suspension for all users by combining the -p and -u all options.

Platform LSF User’s Guide 75

Page 76: LSF User Guide

Viewing Job Pend and Suspend Reasons (bjobs -p)

76

Viewing pending and suspend reasons with host names

To get specific host names along with pending reasons, use the -p and -l options with the bjobs command.

For example:

% bjobs -lpJob Id <7678>, User <user1>, Project <default>, Status <PEND>, Queue <priority>,Command <verilog>Mon Oct 28 13:08:11: Submitted from host <hostD>,CWD <$HOME>, Requested Resources <type==any && swp>35>;

PENDING REASONS:Queue’s resource requirements not satisfied: hostb, hostk, hostv;Unable to reach slave lsbatch server: hostH;Not enough job slots: hostF;

SCHEDULING PARAMETERS:r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - 0.7 1.0 - 4.0 - - - - - -loadStop - 1.5 2.5 - 8.0 - - - - - -

Viewing suspend reasons only

The -s option of bjobs displays reasons for suspended jobs only. For example:

% bjobs -sJOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME605 user1 SSUSP idle hosta hostc Test4 Oct 17 18:07The host load exceeded the following threshold(s):Paging rate: pg;Idle time: it;

Platform LSF User’s Guide

Page 77: LSF User Guide

Chapter 4Viewing Information About Jobs

Viewing Job Parameters (bjobs -l) The -l option of bjobs displays detailed information about job status and parameters, such as the job’s current working directory, parameters specified when the job was submitted, and the time when the job started running. For more details on bjobs options, see the bjobs command in the Platform LSF Reference Guide.

bjobs -l with a job ID displays all the information about a job, including:

◆ Submission parameters

◆ Execution environment

◆ Resource usage

For example:

% bjobs -l 7678Job Id <7678>, User <user1>, Project <default>, Status <PEND>, Queue <priority>,Command <verilog>Mon Oct 28 13:08:11: Submitted from host <hostD>,CWD <$HOME>, Requested Resources <type==any && swp>35>;PENDING REASONS:Queue’s resource requirements not satisfied:3 hosts;Unable to reach slave lsbatch server: 1 host;Not enough job slots: 1 host;

SCHEDULING PARAMETERS:r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - 0.7 1.0 - 4.0 - - - - - -loadStop - 1.5 2.5 - 8.0 - - - - - -

Platform LSF User’s Guide 77

Page 78: LSF User Guide

Viewing Job Resource Usage (bjobs -l)

78

Viewing Job Resource Usage (bjobs -l)LSF monitors the resources jobs consume while they are running. The -l option of the bjobs command displays the current resource usage of the job.

For more details on bjobs options, see the bjobs command in the Platform LSF Reference Guide.

Job-level information

Job-level information includes:

◆ Total CPU time consumed by all processes of a job

◆ Total resident memory usage in KB of all currently running processes of a job

◆ Total virtual memory usage in KB of all currently running processes of a job

◆ Currently active process group ID of a job

◆ Currently active processes of a job

Update interval

The job-level resource usage information is updated at a maximum frequency of every SBD_SLEEP_TIME seconds. See the Platform LSF Reference Guide for the value of SBD_SLEEP_TIME.

The update is done only if the value for the CPU time, resident memory usage, or virtual memory usage has changed by more than 10 percent from the previous update or if a new process or process group has been created.

Platform LSF User’s Guide

Page 79: LSF User Guide

Chapter 4Viewing Information About Jobs

View job resource usage

To view resource usage for a specific job, specify bjobs -l with the job ID:

% bjobs -l 1531Job Id <1531>, User <user1>, Project <default>, Status <RUN>, Queue <priority> Command <example 200>Fri Dec 27 13:04:14 Submitted from host <hostA>, CWD <$HOME>, SpecifiedHosts <hostD>;Fri Dec 27 13:04:19: Started on <hostD>, Execution Home </home/user1>, ExecutionCWD </home/user1>;Fri Dec 27 13:05:00: Resource usage collected.The CPU time used is 2 seconds.MEM: 147 Kbytes; SWAP: 201 Kbytes PGID: 8920; PIDs: 8920 8921 8922

SCHEDULING PARAMETERS:r15s r1m r15m ut pg io ls it tmp swp mem

loadSched - - - - - - - - - - -loadStop - - - - - - - - - - -

Platform LSF User’s Guide 79

Page 80: LSF User Guide

Viewing Job History (bhist)

80

Viewing Job History (bhist)Sometimes you want to know what has happened to your job since it was submitted. The bhist command displays a summary of the pending, suspended and running time of jobs for the user who invoked the command. Use bhist -u all to display a summary for all users in the cluster.

For more details on bhist options, see the bhist command in the Platform LSF Reference Guide.

In this section ◆ “Viewing detailed job history” on page 80

◆ “Viewing history of jobs not listed in active event log” on page 81

◆ “Viewing chronological history of jobs” on page 82

Viewing detailed job history

The -l option of bhist displays the time information and a complete history of scheduling events for each job.

% bhist -l 1531JobId <1531>, User <user1>, Project <default>, Command< example200>Fri Dec 27 13:04:14: Submitted from host <hostA> to Queue <priority>, CWD <$HOME>, Specified Hosts <hostD>;Fri Dec 27 13:04:19: Dispatched to <hostD>;Fri Dec 27 13:04:19: Starting (Pid 8920);Fri Dec 27 13:04:20: Running with execution home </home/user1>, Execution CWD </home/user1>, Execution Pid <8920>;Fri Dec 27 13:05:49: Suspended by the user or administrator;Fri Dec 27 13:05:56: Suspended: Waiting for re-scheduling after being resumed by user;Fri Dec 27 13:05:57: Running;Fri Dec 27 13:07:52: Done successfully. The CPU time used is 28.3 seconds.

Summary of time in seconds spent in various states by Sat Dec 27 13:07:52 1997PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL5 0 205 7 1 0 218

Platform LSF User’s Guide

Page 81: LSF User Guide

Chapter 4Viewing Information About Jobs

Viewing history of jobs not listed in active event log

LSF periodically backs up and prunes the job history log. By default, bhist only displays job history from the current event log file. You can use bhist -n num_logfiles to display the history for jobs that completed some time ago and are no longer listed in the active event log.

bhist -n num_logfiles

The -n num_logfiles option tells the bhist command to search through the specified number of log files instead of only searching the current log file.

Log files are searched in reverse time order. For example, the command bhist -n 3 searches the current event log file and then the two most recent backup files.

For example:

bhist -n 1 searches the current event log file lsb.eventsbhist -n 2 searches lsb.events and lsb.events.1bhist -n 3 searches lsb.events, lsb.events.1,

lsb.events.2

bhist -n 0 searches all event log files in LSB_SHAREDIR

Platform LSF User’s Guide 81

Page 82: LSF User Guide

Viewing Job History (bhist)

82

Viewing chronological history of jobs

By default, the bhist command displays information from the job event history file, lsb.events, on a per job basis.

bhist -t The -t option of bhist can be used to display the events chronologically instead of grouping all events for each job.

bhist -T The -T option allows to select only those events within a given time range.

For example, the following displays all events which occurred between 14:00 and 15:00 on a given day:

% bhist -t -T 14:00,14:30Wed Oct 22 14:01:25: Job <1574> done successfully;Wed Oct 22 14:03:09: Job <1575> submitted from host to Queue , CWD , User , Project , Command , Requested Resources ;Wed Oct 22 14:03:18: Job <1575> dispatched to ;Wed Oct 22 14:03:18: Job <1575> starting (Pid 210);Wed Oct 22 14:03:18: Job <1575> running with execution home , Execution CWD , Execution Pid <210>;Wed Oct 22 14:05:06: Job <1577> submitted from host to Queue, CWD , User , Project , Command , Requested Resources ;Wed Oct 22 14:05:11: Job <1577> dispatched to ;Wed Oct 22 14:05:11: Job <1577> starting (Pid 429);Wed Oct 22 14:05:12: Job <1577> running with execution home, Execution CWD , Execution Pid <429>;Wed Oct 22 14:08:26: Job <1578> submitted from host to Queue, CWD , User , Project , Command;Wed Oct 22 14:10:55: Job <1577> done successfully;Wed Oct 22 14:16:55: Job <1578> exited;Wed Oct 22 14:17:04: Job <1575> done successfully;

Platform LSF User’s Guide

Page 83: LSF User Guide

Chapter 4Viewing Information About Jobs

Viewing Job Output (bpeek)The output from a job is normally not available until the job is finished. However, LSF provides the bpeek command for you to look at the output the job has produced so far.

By default, bpeek shows the output from the most recently submitted job. You can also select the job by queue or execution host, or specify the job ID or job name on the command line.

For more details on bpeek options, see the bpeek command in the Platform LSF Reference Guide.

Viewing output of a running job

Only the job owner can use bpeek to see job output. The bpeek command will not work on a job running under a different user account.

To save time, you can use this command to check if your job is behaving as you expected and kill the job if it is running away or producing unusable results.

For example:

% bpeek 1234<< output from stdout >>Starting phase 1Phase 1 doneCalculating new parameters...

Platform LSF User’s Guide 83

Page 84: LSF User Guide

Viewing Job Output (bpeek)

84

Platform LSF User’s Guide
Page 85: LSF User Guide

IndexAabnormal job termination 48automatic, queue selection 36automount option, /net 64

Bbatch jobs

See also jobsaccessing files 64file access 64killing 59scheduling 71signalling 59

batch queues. See queuesbbot, changing job order within queues 62bhist

viewing chronological history of jobs 82viewing job history 80viewing jobs not listed in active event log 81

bhosts -ldescription 40viewing host-level resource information 67viewing load on a host 28viewing suspending conditions 27

bjobsviewing job resource usage 79viewing status of jobs 74

bjobs -lviewing job resource usage 28viewing job-level suspending conditions 27viewing resume thresholds 27

bkillforce job removal 59killing a job 59

bmodmodifying jobs 54modifying resource reservation for jobs 56modifying running jobs 56

Boolean resources 24bpeek, viewing job output 83

bqueues -lviewing queue thresholds 40viewing queue-level resource information 67viewing suspending conditions 27

bresume, resuming jobs 60bstop

SIGSTOP signal 60SIGTSTP signal 60suspending jobs 60

bsubremote file access 64submitting a job

assigning a job name 53associated to a project 52associated to a user group 52description 50to a specific queue 50

btop, changing job order within queues 62built-in resources 24

Cchunk jobs, WAIT status 48client host, description 17clusters 18core file size limit 31CPU time limit

job-level resource limit 29small jobs 52

custom resources, resource types 24

Ddaemons

lim. See LIMmbatchd. See MBDpim. See PIMres. See RESrshd 21sbatchd. See SBD

data segment size limit 30default queues 36directories, remote access 64

Platform LSF User’s Guide 85

Page 86: LSF User Guide

Index

86

dispatch turn 39DONE job state, description 46dynamic

master host 44resources 24

Eegroup 22eligible hosts

description 17viewing 40

eligible users 17ELIM (external LIM) 20environment of a job 42esub script, validating project names 19event log 45execution

environment 42host 16priority 26

EXIT job state 48

FFCFS scheduling 38file size limit 30first-come, first-served (FCFS) scheduling 38

Ggeneral resources 24

Hhard resource limits 29history, viewing 80, 82hname static resource 26home directories, remote file access 65host failure 45host groups 20host load levels 40host selection 33host thresholds 40host types 15host-based resources 24hosts

execution host 16local 16LSF server 16master 16remote 16specifying on job submission 68, 69

specifying preference at job submission 70submission host 16viewing pending and suspend reasons 76

Iinteractive tasks 13

Jjob errors 14job execution environment 42job files 37job ID 13job limits, modifying for running jobs 56job name 13job output 14job output options

modifying for rerunnable jobs 56modifying for running jobs 56

job reports 14job rerun, modifying running jobs 56job slots 19job spanning 33job states 15, 46–48job submission 36job-level resource reservation 66jobs

See also batch jobsassigning job names at job submission 53changing execution order 62checking output 83killing 59modifying after submission 54modifying resource reservation 56overview 13resuming 60specifying resource requirements 70submitting

description 50for a user group 52resources 68specifying host preference 68, 69to a project 52to specific queues 50with start/termination time 71

submitting with start/end time 71suspending 60viewing

chronological history 82history 80pending and suspend reasons 75

Platform LSF User’s Guide

Page 87: LSF User Guide

Index

resource usage 79status of 74

LLIM (Load Information Manager) 20lim daemon. See LIMlimits, modifying for running jobs 56load indices, built-in 25load information manager. See LIMload thresholds 27local host 16logs, viewing jobs not listed in active event log 81LSF client host 17LSF master host. See master hostlsinfo

viewing available resources 23viewing load indices 26

lsload, viewing load indices 26lsrcp command for remote file access 64

Mmaster batch daemon. See MBDmaster host 16master LIM 20maxmem static resource 26maxswp static resource 26maxtmp static resource 26mbatchd daemon. See MBDMBD (master batch daemon) 21multiprocessor hosts 15

Nnames assigning to jobs 53ncpus static resource 26ndisks static resource 26network failure 44non-shared file space 64numerical resources 24

Oorder of job execution 62

Ppartitioned networks 45PEND job state 46pending jobs 46pending reasons 75PIM (Process Information Manager)

description 21

resource use 28pim daemon. See PIMprocess information manager. See PIMprojects

associating jobs with 52description 19

PSUSP job state 47, 60

Qqueue priority 18queue thresholds 40queues

and host preference 70automatic selection 36changing job order within 62overview 17specifying at job submission 50

Rrcp command for remote file access 64remote execution 16remote execution server. See RESremote host 16remote jobs, execution priority 26rerunnable jobs, modifying running jobs 56RES (remote execution server)

daemon 21description 21

resource consumption limitsCPU limit for small jobs 52modifying for running jobs 56overview 29

resource requirements 31, 33specifying at job submission 70

resource reservationdescription 66modifying for jobs 56

resource usageresource requirements 33viewing for jobs 79

resourcesand job submission 68built-in load indices 25viewing 23

rexpri static resource 26rshd daemon 21RUN job state 46, 60

Ssbatchd daemon. See SBD

Platform LSF User’s Guide 87

Page 88: LSF User Guide

Index

88

SBD (slave batch daemon)description 21remote file access 64

select keyword in resource requirements 31server hosts 16shared resources 24signals

SIGCONT in job control actions 58SIGKILL in job control actions 58SIGSTOP

bstop command 60job control actions 58

SIGTSTP, bstop command 60slave batch daemon. See SBDslots (job slots) 19soft resource limit 29spanning section in resource requirements 33special resources 24SSUSP job state 47, 60stack segment size limit 31standard error output file, modifying for running

jobs 56

standard output file, modifying for running jobs 56

start time, specifying at job submission 71

static resources 24–26

string resources 24

submission and execution hosts 16

suspended job states 47

Ttermination time, specifying at job submission 71

thresholds 40

Uuser groups

associating with jobs at job submission 52description 20

USUSP job state 47, 60

WWAIT status of chunk jobs 48

Platform LSF User’s Guide