cluster resources training

181
1 © Cluster Resources, Inc. 1 Cluster Resources Training

Upload: oksana

Post on 30-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Cluster Resources Training. Outline. 1. Moab Overview 2. Deployment 3. Diagnostics and Troubleshooting 4. Integration 5. Scheduling Behaviour 6. Resource Access 7. Grid Computing 8. Accounting 9. Transitioning from LCRM 10. End Users. 1. Moab Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Cluster Resources Training

1

© Cluster Resources, Inc. 1

Cluster Resources Training

Page 2: Cluster Resources Training

1

© Cluster Resources, Inc. 2

Outline1. Moab Overview2. Deployment3. Diagnostics and Troubleshooting4. Integration5. Scheduling Behaviour6. Resource Access7. Grid Computing8. Accounting9. Transitioning from LCRM10. End Users

Page 3: Cluster Resources Training

1

© Cluster Resources, Inc. 3

1. Moab Introduction

• Overview of the Modern Cluster• Cluster Evolution• Cluster Productivity Losses• Moab Workload Manager Architecture• What Moab Does• What Moab Does Not Do

Page 4: Cluster Resources Training

1

© Cluster Resources, Inc. 4

Cluster Stack / Framework:

Cluster Workload Manager: Scheduler, Policy Manager, Integration PlatformCluster Workload Manager: Scheduler, Policy Manager, Integration Platform

Message PassingMessage Passing

SerialSerialParallelParallel ApplicationApplication

Resource ManagerResource Manager

Grid Workload Manager: Scheduler, Policy Manager, Integration PlatformGrid Workload Manager: Scheduler, Policy Manager, Integration Platform

Operating SystemOperating System

Hardware (Cluster or SMP)Hardware (Cluster or SMP)

PortalPortal

CLICLI

GUIGUI

ApplicationApplication

AdminAdmin UsersUsers

Se

cu

rityS

ec

urity

Page 5: Cluster Resources Training

1

© Cluster Resources, Inc. 5

Resource Manager (RM)• While other systems may have more strict interpretations

of a resource manager and its responsibilities, Moab's multi-resource manager support allows a much more liberal interpretation. – In essence, any object which can provide environmental information

and environmental control can be utilized as a resource manager.

• Moab is able to aggregate information from multiple unrelated sources into a larger more complete world view of the cluster which includes all the information and control found within a standard resource manager such as TORQUE including:– Node– Job– Queue management services.

Page 6: Cluster Resources Training

1

© Cluster Resources, Inc. 6

The Evolved Cluster

ResourceManager

MOAB

Compute Nodes

Admin

User

LicenseManager

JobQueue

Myrinet

IdentityManager

AllocationManager

ResourceManager

MOAB

Remote Site

Page 7: Cluster Resources Training

1

© Cluster Resources, Inc. 7

Moab Architecture

Page 8: Cluster Resources Training

1

© Cluster Resources, Inc. 8

What Moab Does

• Optimizes Resource Utilization with Intelligent Scheduling and Advanced Reservations

• Unifies Cluster Management across Varied Resources and Services

• Dynamically Adjusts Workload to Enforce Policies and Service Level Agreements

• Automates Diagnosis and Failure Response

Page 9: Cluster Resources Training

1

© Cluster Resources, Inc. 9

What Moab Does Not Do

• Does not does do resource management (usually)

• Does not install the system (usually)

• Not a storage manager

• Not a license manager

• Does not do message passing

Page 10: Cluster Resources Training

1

© Cluster Resources, Inc. 10

2. Deployment

• Installation

• Configuration

• Testing

Page 11: Cluster Resources Training

1

© Cluster Resources, Inc. 11

Moab Workload Manager Installation

> tar -xzvf moab-4.5.0p0.linux.tar.gz

> cd moab-4.5.4

> ./configure

> make

• When you are ready to use Moab in production, you may install it into the install directory you have configured using make install.

• Workload Manager must be running before Cluster Manager and Access Portal will work.

• You only install Moab Workload Manager on the head node.

• You can choose to install client commands on a remote system as well.

Page 12: Cluster Resources Training

1

© Cluster Resources, Inc. 12

File Locations• $(MOABHOMEDIR)

– moab.cfg (general config file containing information required by both the Moab server and user interface clients)

– moab-private.cfg (config file containing private information required by the Moab server only)

– .moab.ck  (Moab checkpoint file) – .moab.pid (Moab 'lock' file to prevent multiple instances) – log(directory for Moab log files - REQUIRED BY DEFAULT)

• moab.log  (Moab log file) • moab.log.1 (previous 'rolled' Moab log file)

– stats(directory for Moab statistics files - REQUIRED BY DEFAULT) • Moab stats files (in format 'stats.<YYYY>_<MM>_<DD>') • Moab fairshare data files (in format 'FS.<EPOCHTIME>')

– tools (directory for local tools called by Moab - OPTIONAL BY DEFAULT) – traces (directory for Moab simulation trace files - REQUIRED FOR SIMULATIONS)

• resource.trace1 (sample resource trace file) • workload.trace1 (sample workload trace file)

Page 13: Cluster Resources Training

1

© Cluster Resources, Inc. 13

– spool (directory for temporary Moab files - REQUIRED FOR ADVANCED FEATURES)

– contrib (directory containing contributed code in the areas of GUI's, algorithms, policies, etc)

• $(MOABINSTDIR)

– bin (directory for installed Moab executables) • moab (Moab scheduler executable) • mclient (Moab user interface client executable)

• /etc/moab.cfg (optional file.  This file is used to override default '$(MOABHOMEDIR)' settings.  It should contain the string 'MOABHOMEDIR $(DIRECTORY)' to override the 'built-in' $(MOABHOMEDIR)' setting.

Page 14: Cluster Resources Training

1

© Cluster Resources, Inc. 14

Initial Configuration – moab.cfg• moab.cfg contains the parameters and settings for Moab

Workload Manager. This is where you will set most of the policy settings.

Example of what moab.cfg will look like after installation:

##moab.cfg

SCHEDCFG[Moab] SERVER=test.icluster.org:4255

ADMINCFG[1] USERS=root

RMCFG[base] TYPE=PBS

Page 15: Cluster Resources Training

1

© Cluster Resources, Inc. 15

Supported Platforms/Environments• Resource Managers

– TORQUE, OpenPBS, PBSPro, LSF, Loadleveler, SLURM, BProc, clubMASK, S3, WIKI

• Operating Systems– RedHat, SUSE, Fedora, Debian, FreeBSD, (+ all known

variants of Linux), AIX, IRIX, HP-UX, OS/X, OSF/Tru-64, SunOS, Solaris, (+ all known variants of UNIX)

• Hardware– Intel x86, Intel IA-32, Intel IA-64, AMD x86, AMD Opteron,

SGI Altix, HP, IBM SP, IBM x-Series, IBM p-Series, IBM i-Series, Mac G4 and G5

Page 16: Cluster Resources Training

1

© Cluster Resources, Inc. 16

Basic Parameters• SCHEDCFG

– Specifies how the Moab server will execute and communicate with client requests. 

• Example: SCHEDCFG[orion] SERVER=cw.psu.edu

• ADMINCFG – Moab provides role-based security enabled by way of multiple

levels of admin access.  • Example: The following may be used to enable users greg amd thomas

as level 1 admins: – ADMINCFG[1] USERS=greg,thomas NOTE: Moab may only be

launched by the primary admin user id.

• RMCFG – In order for Moab to properly interact with a resource manager, the

interface to this resource manager must be defined.• For example: To interface to a TORQUE resource manager, the

following may be used: – RMCFG[torque1] TYPE=pbs

Page 17: Cluster Resources Training

1

© Cluster Resources, Inc. 17

Scheduling Modes

• Simulation Mode– Allows a test drive of the scheduler. You can evaluate how various policies can

improve the current performance on a stable production system.

• Test Mode– Test mode allows evaluation of new Moab releases, configurations, and policies in a

risk-free manner. the test-mode Moab behaves identical to a live or normal mode except the ability to start, cancel, or modify jobs.

• Normal Mode– Live (after installation, automatically set this way)

• Interactive Mode– Like test mode but instead of disabling all resource and job control functions, Moab

sends the desired change request to the screen and asks for permission to complete it.

- Configure modes in moab.cfg

Page 18: Cluster Resources Training

1

© Cluster Resources, Inc. 18

Testing New Policies

• Verifying Correct Specification of New Policies– If manually editing the moab.cfg file, use the mdiag –C

command– Moab Cluster Manager automatically verifies proper policy

specification

• Verifying Correct Behavior of New Policies– Put in INTERACTIVE Mode to ensure you want to make

each change

• Determining Long Term Impact of New Policies– Put in SIMULATION Mode

Page 19: Cluster Resources Training

1

© Cluster Resources, Inc. 19

• Moab 'Side-by-Side‘– Allows a production cluster or other resource to be logically

partitioned along resource and workload boundaries and allows different instances of Moab to schedule different partitions.

• Use parameters: IGNORENODES, IGNORECLASSES, IGNOREUSERS

##moab.cfg for production partition

SCHEDCFG[prod] MODE=NORMAL SERVER=orion.cxz.com:42020 RMCFG[TORQUE] TYPE=PBS

IGNORENODES node61,node62,node63,node64 IGNOREUSERS gridtest1,gridtest2

##moab.cfg for test partition

SCHEDCFG[prod] MODE=NORMAL SERVER=orion.cxz.com:42020 RMCFG[TORQUE] TYPE=PBS

IGNORENODES !node61,node62,node63,node64 IGNOREUSERS !gridtest1,gridtest2

Page 20: Cluster Resources Training

1

© Cluster Resources, Inc. 20

Simulation• What is the impact of additional hardware on cluster utilization? • What delays to key projects can be expected with the addition of

new users? • How will new prioritization weights alter cycle distribution among

existing workload? • What total loss of compute resources will result from introducing

a maintenance downtime? • Are the benefits of cycle stealing from non-dedicated desktop

systems worth the effort? • How much will anticipated grid workload delay the average wait

time of local jobs?

Page 21: Cluster Resources Training

1

© Cluster Resources, Inc. 21

Scheduling Iterations

• Update State Information • Refresh Reservations • Schedule Reserved Jobs • Schedule Priority Jobs • Backfill Jobs • Update Statistics • Handle User Requests• Perform Next Scheduling Cycle

Page 22: Cluster Resources Training

1

© Cluster Resources, Inc. 22

Job Flow

• Determine Basic Job Feasibility

• Prioritize Jobs

• Enforce Configured Throttling Policies

• Determine Resource Availability

• Allocate Resources to Job

• Launch Job

Page 23: Cluster Resources Training

1

© Cluster Resources, Inc. 23

Command Description

checkjob provide detailed status report for specified job

checknode provide detailed status report for specified node

mcredctl controls various aspects about the credential objects within Moab

mdiag provide diagnostic reports for resources, workload, and scheduling

mjobctl control and modify job

mnodectl control and modify nodes

mrmctl query and control resource managers

mrsvctl control and modify reservations

mschedctl modify scheduler state and behavior

mshow displays various diagnostic messages about the system and job queues

msub scheduler job submission

resetstats reset scheduler statistics

showbf show current resource availability

showq show queued jobs

showres show existing reservations

showstart show estimates of when job can/will start

showstate show current state of resources

showstats show usage statistics

Commands Overview

Page 24: Cluster Resources Training

1

© Cluster Resources, Inc. 24

End User CommandsCommand Flags Description

canceljob cancel existing job

checkjob display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization

showbf show resource availability for jobs with specific resource requirements

showq display detailed prioritized list of active and idle jobs

showstart show estimated start time of idle jobs

showstats show detailed usage statistics for users, groups, and accounts which the end user has access to

Page 25: Cluster Resources Training

1

© Cluster Resources, Inc. 25

Scheduling Objects

• Moab functions by manipulating five primary, elementary objects:  – Jobs – Nodes – Reservations – Policies

Page 26: Cluster Resources Training

1

© Cluster Resources, Inc. 26

Jobs• Job information is provided to the Moab scheduler

from a resource manager– (Such as Loadleveler, PBS, Wiki, or LSF)

• Job attributes include ownership of the:– Job– Job state– Amount– Type of resources required by the job– Wallclock limit

• A job consists of one or more requirements each of which requests a number of resources of a given type. 

Page 27: Cluster Resources Training

1

© Cluster Resources, Inc. 27

Nodes• Within Moab, a node is a collection of

resources with a particular set of associated attributes.

• A node is defined as one or more CPU's, together with associated memory, and possibly other compute resources such as local disk, swap, network adapters, software licenses, etc. 

Page 28: Cluster Resources Training

1

© Cluster Resources, Inc. 28

Advance Reservations• An object which dedicates a block of specific

resources for a particular use. • Each reservation consists of a list of

resources, an access control list, and a time range for which this access control list will be enforced.

• The reservation prevents the listed resources from being used in a way not described by the access control list during the time range specified.

Page 29: Cluster Resources Training

1

© Cluster Resources, Inc. 29

 Resource Managers

• Moab can be configured to manage more than one resource manager simultaneously, even resource managers of different types.

• Moab aggregates information from the RMs to fully manage workload, resources, and cluster policies

Page 30: Cluster Resources Training

1

© Cluster Resources, Inc. 30

3 Troubleshooting and Diagnostics

• Object Messages• Diagnostic Commands• Admin Notification• Logging• Tracking System Failures• Checkpointing• Debuggers

http://www.clusterresources.com/products/mwm/moabdocs/14.0troubleshootingandsysmaintenance.shtml

Page 31: Cluster Resources Training

1

© Cluster Resources, Inc. 31

Object Messages• Messages can hold information regarding failures and key events

• Messages possess event time, owner, expiration time, and event count information

• Resource managers and peer services can attach messages to objects

• Admins can attach messages

• Multiple messages per object are supported

• Messages are persistent

http://www.clusterresources.com/products/mwm/moabdocs/commands/mschedctl.shtml

http://www.clusterresources.com/products/mwm/moabdocs/14.3messagebuffer.shtml

Page 32: Cluster Resources Training

1

© Cluster Resources, Inc. 32

Diagnostics

• Moab’s diagnostic commands present detailed state information– Scheduling problems

– Summarize performance

– Evaluate current operation reporting on any unexpected or potentially erroneous conditions

– Where possible correct detected problems if desired

Page 33: Cluster Resources Training

1

© Cluster Resources, Inc. 33

mdiag

• Displays object state/health

• Displays object configuration– Attributes, resources, policies

• Displays object history and performance

• Displays object failures and messages

http://www.clusterresources.com/products/mwm/moabdocs/commands/mdiag.shtml

Page 34: Cluster Resources Training

1

© Cluster Resources, Inc. 34

mdiag usage• Most common diagnostics

– Scheduler (mdiag –S)

– Jobs (mdiag –j)

– Nodes (mdiag –n)

– Resource manager (mdiag –R)

– Blocked jobs (mdiag –b)

– Configuration (mdiag –C)

• Other diagnostics– Fairshare, Priority

– Users, Accounts, Classes

– Reservations, QoS, etc

http://www.clusterresources.com/products/mwm/moabdocs/commands/mdiag.shtml

Page 35: Cluster Resources Training

1

© Cluster Resources, Inc. 35

mdiag details• Performs numerous internal health and

consistency checks– Race conditions, object configuration inconsistencies,

possible external failures

• Not just for failures

• Provides status, config, and current performance

• Enables moab as an information service--flags=xml

Page 36: Cluster Resources Training

1

© Cluster Resources, Inc. 36

Job TroubleshootingTo determine why a particular job will not start, there are several commands which

can be helpful:• checkjob -v

– Checkjob will  evaluate the ability of a job to start immediately.  Tests include resource access, node state, job constraints (ie, startdate, taskspernode, QOS, etc).  Additionally, command line flags may be specified to provide further information.

• -l <POLICYLEVEL>    // evaluate impact of throttling policies on job feasibility   -n <NODENAME>       // evaluate resource access on specific node   -r <RESERVATION_LIST>  // evaluate access to specified reservations

• checknode– Display detailed status of node

• mdiag -b – Display various reasons job is considered 'blocked' or 'non-queued'.

• mdiag -j – Display high level summary of job attributes and perform sanity check on job

attributes/state.

• showbf -v – Determine general resource availability subject to specified constraints.

Page 37: Cluster Resources Training

1

© Cluster Resources, Inc. 37

Other Diagnostics

• checkjob and checknode commands – Why a job cannot start– Which nodes can be available

information regarding the recent events impacting current job

– Nodes state

Page 38: Cluster Resources Training

1

© Cluster Resources, Inc. 38

Issues with Client Commands

• Utilize built in moab logging – showq --loglevel=9

Or

 

• Check the moab log files

Page 39: Cluster Resources Training

1

© Cluster Resources, Inc. 39

Logging Facilities• Moab Log

– Report detailed scheduler actions, configuration, events, failures, etc

• Event Log– Report scheduler, job, node, and reservation events and failures

• Syslog– USESYSLOG

• http://www.clusterresources.com/products/mwm/moabdocs/a.fparameters.shtml#eventrecordlist

• http://www.clusterresources.com/products/mwm/moabdocs/14.2logging.shtml

• http://www.clusterresources.com/products/mwm/moabdocs/a.fparameters.shtml#usesyslog

# stats/events.Wed_Aug_24_20051124979598 rm base RMUP initialized1124979598 sched Moab SCHEDSTART -1124982013 node node017 GEVENT CPU2 Down1124989457 node node135 GEVENT /var/tmp Full1124996230 node node139 GEVENT /home Full1125013524 node node407 GEVENT Transient Power Supply Failure

Page 40: Cluster Resources Training

1

© Cluster Resources, Inc. 40

Logging Basics

• LOGDIR - Indicates directory for log files

• LOGFILE - Indicates path name of log file

• LOGFILEMAXSIZE - Indicates maximum size of log file before rolling

• LOGFILEROLLDEPTH - Indicates maximum number of log files to maintain \

• LOGLEVEL - Indicates verbosity of logging

Page 41: Cluster Resources Training

1

© Cluster Resources, Inc. 41

• In source and debug releases, each subroutine is logged, along with all printable parameters.

Function Level Information

##moab.log

MPolicyCheck(orion.322,2,Reason)

Page 42: Cluster Resources Training

1

© Cluster Resources, Inc. 42

Status Information• Information about internal status is logged at all

LOGLEVELs.  Critical internal status is indicated at low LOGLEVELs while less critical and more vebose status information is logged at higher LOGLEVELs.

##moab.log

INFO: job orion.4228 rejected (max user jobs) INFO: job fr4n01.923.0 rejected (maxjobperuser policy failure)

Page 43: Cluster Resources Training

1

© Cluster Resources, Inc. 43

Scheduler Warnings• Warnings are logged when the scheduler detects an

unexpected value or receives an unexpected result from a system call or subroutine.

##moab.log

WARNING: cannot open fairshare data file '/opt/moab/stats/FS.87000'

Page 44: Cluster Resources Training

1

© Cluster Resources, Inc. 44

Scheduler Alerts• Alerts are logged when the scheduler detects events

of an unexpected nature which may indicate problems in other systems or in objects.

##moab.log

ALERT: job orion.72 cannot run. deferring job for 360 Seconds

Page 45: Cluster Resources Training

1

© Cluster Resources, Inc. 45

Scheduler Errors• Errors are logged when the scheduler detects

problems of a nature of which impact the scheduler's ability to properly schedule the cluster.

##moab.log

ERROR: cannot connect to Loadleveler API

Page 46: Cluster Resources Training

1

© Cluster Resources, Inc. 46

Searching Moab Logs• While major failures will be reported via the mdiag -S

command, these failures can also be uncovered by searching the logs using the grep command as in the following:

> grep -E "WARNING|ALERT|ERROR" moab.log

Page 47: Cluster Resources Training

1

© Cluster Resources, Inc. 47

Event Logs• Major events are reported to both the Moab log file as

well as the Moab event log.  By default, the event log is maintained in the statistics directory and rolls on a daily basis, using the naming convention:– events.WWW_MMM_DD_YYYY (e.g. events.Fri_Aug_19_2005)

##event log format<EPOCHTIME> <OBJECT> <OBJECTID> <EVENT> <DETAILS>

Page 48: Cluster Resources Training

1

© Cluster Resources, Inc. 48

Enabling Syslog

• In addition to the log file, the Moab Scheduler can report events it determines to be critical to the UNIX syslog facility via the daemon facility using priorities ranging from INFO to ERROR. 

• The verbosity of this logging is not affected by the LOGLEVEL parameter.  In addition to errors and critical events, user commands that affect the state of the jobs, nodes, or the scheduler may also be logged to syslog.  

• Moab syslog messages are reported using the INFO, NOTICE, and ERR syslog priorities.

Page 49: Cluster Resources Training

1

© Cluster Resources, Inc. 49

Tracking System Failures• The scheduler has a number of dependencies which may cause

failures if not satisfied.

• Disk Space• The scheduler utilizes a number of files. If the file system is full or

otherwise inaccessible, the following behaviors might be noted:

File Failure

moab.pid scheduler cannot perform single instance check

moab.ck* scheduler cannot store persistent record of reservations, jobs, policies, summary statistics, etc.

moab.cfg/moab.dat scheduler cannot load local configuration

log/* scheduler cannot log activities

stats/* scheduler cannot write job records

Page 50: Cluster Resources Training

1

© Cluster Resources, Inc. 50

Checkpointing

• Moab checkpoints its internal state.  The checkpoint file records statistics and attributes for jobs, nodes, reservations, users, groups, classes, and almost every other scheduling object.

• CHECKPOINTEXPIRATIONTIME - Indicates how long unmodified data should be kept after the associated object has disappeared.  ie, job priority for a job no longer detected.

– FORMAT - [[[DD:]HH:]MM:]SS

– EXAMPLE - CHECKPOINTEXPIRATIONTIME  1:00:00:00

• CHECKPOINTFILE - Indicates path name of checkpoint file– FORMAT - <STRING>

– EXAMPLE - CHECKPOINTFILE  /var/adm/moab/moab.ck

• CHECKPOINTINTERVAL - Indicates interval between subsequent checkpoints.– FORMAT - [[[DD:]HH:]MM:]SS

– EXAMPLE - CHECKPOINTINTERVAL  00:15:00

moab.cfg:

Page 51: Cluster Resources Training

1

© Cluster Resources, Inc. 51

4 Integration

• High Availability

• License Managers

• Identity Managers

• Allocation Managers

• Site Specific Integration (Native RM)

Page 52: Cluster Resources Training

1

© Cluster Resources, Inc. 52

High Availability• High Availability allows Moab to run on two different machines,

a primary and secondary server. • While both are running, the secondary server, or fallback server,

will continually update its internal statistics, reservations, and other information to stay synchronized with the primary server.

• Should the primary server stop running, the secondary will pick up all responsibilities of the primary server and begin to schedule jobs and track internal data.

• When the primary server comes back online, the secondary server will hand over its data and resume functionality as the secondary server.

http://clusterresources.com/moabdocs/22.2ha.shtml

Page 53: Cluster Resources Training

1

© Cluster Resources, Inc. 53

High Availability Example

# moab.cfg on master server# (duplicate moab.cfg of the master or the same file using a shared file system)SCHEDCFG[colony] SERVER=head1 FBSERVER=head2

#moab-private.cfg on head1 serverCLIENTCFG[colony] KEY=1dfv-fewv443v HOST=head2 AUTH=admin1

#moab-private.cfg on head2 serverCLIENTCFG[colony] KEY=1dfv-fewv443v HOST=head1 AUTH=admin1

http://clusterresources.com/moabdocs/22.2ha.shtml

Page 54: Cluster Resources Training

1

© Cluster Resources, Inc. 54

Enabling High Availability Features

• Moab runs on two machines, primary and secondary server– The secondary server, or fallback server,

will continually update its internal statistics, reservations, and other information to stay synchronized with the primary server and take over scheduling should the primary server fail

Page 55: Cluster Resources Training

1

© Cluster Resources, Inc. 55

Configuring High Availability

moab.cfg

SCHEDCFG[mycluster] SERVER=primaryhostname:3000

SCHEDCFG[mycluster] FBSERVER=secondaryhostname

• Both the SERVER and FBSERVER are of the format: <HOST>[:<PORT>].  It is also necessary to ensure a few configuration settings for correct operation:

• each server must specify a shared key using the clientcfg parameter in the moab-private.cfg file.

• each server must be properly configured as an administrator inside of the resource manager using the clientcfg AUTH parameter.

• each server can properly communicate with the resource manager.

(See the torque/pbs integration guide for a specific example.) 

Page 56: Cluster Resources Training

1

© Cluster Resources, Inc. 56

Confirming Configuration• Run mdiag –R to confirm fallback Moab is able to communicate with the

primary Moab

node40:~/# mdiag -RRM[rmnode30] Type: PBS State: Active ResourceType: COMPUTE Version: '1.2.0p6-snap.1122589577' Nodes Reported: 4 Flags: executionServer,noTaskOrdering,typeIsExplicit Partition: rmnode30 Event Management: EPORT=15004 NOTE: SSS protocol enabled Submit Command: /usr/local/bin/qsub DefaultClass: batch RM Performance: AvgTime=0.01s MaxTime=1.03s (218 samples)

RM[internal] Type: SSS State: Active Version: 'SSS2.0' Flags: executionServer,localQueue,typeIsExplicit RM Performance: AvgTime=0.00s MaxTime=0.00s (125 samples)

NOTE: use 'mrmctl -f -r ' to clear stats/failures

Page 57: Cluster Resources Training

1

© Cluster Resources, Inc. 57

Confirmation cont.• Run mdiag –n to confirm fallback Moab is able to communicate with the

primary resource manager.

compute node summaryName State Procs Memory Opsys

node31 Idle 1:1 27:27 Linux-2.6node32 Idle 1:1 27:27 Linux-2.6node33 Idle 1:1 27:27 Linux-2.6node34 Idle 1:1 27:27 Linux-2.6----- --- 4:4 108:108 -----

Total Nodes: 4 (Active: 0 Idle: 4 Down: 0)

Page 58: Cluster Resources Training

1

© Cluster Resources, Inc. 58

License Management• Moab supports both node-locked and floating license

models and even allows mixing the two models simultaneously

• Methods for determining license availability– Local Consumable Resources– Resource Manager Based Consumable Resources– Interfacing to an External License Manager

• Requesting Licenses within Jobs

#qsub> qsub -l nodes=2,software=blast cmdscript.txt

Page 59: Cluster Resources Training

1

© Cluster Resources, Inc. 59

Identity ManagersAn identity manager is configured with the IDCFG parameter and allows

Moab to exchange information with an external identity management service.  As with Moab's resource manager interfaces, this service can be a full commercial package designed for this purpose, or something far simpler by which Moab obtains the needed information for a web service, text file, or database.

# moab.cfgIDCFG[alloc] SERVER=exec://$TOOLSDIR/idquery.pl

# idquery.pl outputgroup:financial fstarget=16.3 alist=project2 group:marketing fstarget=2.5 group:engineering fstarget=36.7 group:dm fstarget=42.5

Page 60: Cluster Resources Training

1

© Cluster Resources, Inc. 60

Allocation Management

Allocation Management Overview

Gold Capabilities/Features

Allocation Manager Example

Jan-99 Feb-99 Mar-99 Apr-99 May-99 Jun-99 Jul-99

Alloc6

Alloc5

Alloc4

Alloc3

Alloc2

Alloc1

.

10,000 NH

10,000 NH

10,000 NH

10,000 NH

10,000 NH

10,000 NH

Page 61: Cluster Resources Training

1

© Cluster Resources, Inc. 61

Allocation ManagementGold is an open source allocation system that controls

project usage on High Performance Computers.

• What does it do?– Pre-allocates resources to projects and users

– Controls project usage by billing for resource utilization

– Fine-grained control over who uses what, where and when

– Tracks resource utilization

– Allows for insightful capacity planning

– Facilitates resource sharing between organizations (Grids)

• Who needs it?– Sites with many projects

– Multi-cluster organizations

– Multi-organization grids

– QoS/SLA-based credit management

– Any credit/economic style environment

http://clusterresources.com/moabdocs/6.4allocationmanagement.shtml

Page 62: Cluster Resources Training

1

© Cluster Resources, Inc. 62

Gold Features• Enforces long-term usage limits

• Uses Reservations to Enforce Allocations

• Online Bank (Dynamic Charging)

• Journaled Account History

• Promotes Resource Sharing between Organizations (Grids)

• Facilitates capacity planningNow- 1 Qtr + 3 Qtr+ 2 Qtr+ 1 Qtr- 3 Qtr - 2 Qtr + 4 Qtr

100 % Capacity

Now- 1 Qtr + 3 Qtr+ 2 Qtr+ 1 Qtr- 3 Qtr - 2 Qtr + 4 Qtr

100 % Capacity

http://www.emsl.pnl.gov/docs/mscf/gold

Page 63: Cluster Resources Training

1

© Cluster Resources, Inc. 63

Allocation Manager Example

• Tightly integrated with Moab

• Can run in monitor mode only or can enforce allocations

# moab.cfg

AMCFG[bank] SERVER=gold://gold-server.some.org JOBFAILUREACTION=HOLD AMCFG[bank] TIMEOUT=15

# moab-private.cfg

CLIENTCFG[AM:bank] KEY=mysecr3t AUTHTYPE=HMAC

http://clusterresources.com/moabdocs/6.4allocationmanagement.shtml

Page 64: Cluster Resources Training

1

© Cluster Resources, Inc. 64

Other Allocation Management Options

• Qbank

• Moab’s native interface

Page 65: Cluster Resources Training

1

© Cluster Resources, Inc. 65

Site Specific Integration with theNative Resource Manager InterfaceEverything you’ve ever wanted to do with Moab -- An interface that allows sites to replace or augment their already existing resource managers with information from the following:

Example Usage– Arbitrary Scripts– Ganglia– FlexLM– MySQL

http://clusterresources.com/moabdocs/13.5nativerm.shtml

http://clusterresources.com/moabdocs/13.7licensemanagement.shtml

PBSPBS

Native Resource Native Resource ManagerManager

Node/JobNode/JobModifyModify

Node/ Job Node/ Job InfoInfo

MoabMoab

NodeNodeAvailabilityAvailability

JobJobExecutionExecution

PBSPBS

Native Resource Native Resource ManagerManager

Node/JobNode/JobModifyModify

Node/ Job Node/ Job InfoInfo

MoabMoab

NodeNodeAvailabilityAvailability

JobJobExecutionExecution

Page 66: Cluster Resources Training

1

© Cluster Resources, Inc. 66

Native Resource Manager Example

# moab.cfg

# interface w/TORQUERMCFG[torque] TYPE=PBS

# interface w/flexLMRMCFG[flexLM] TYPE=NATIVE RTYPE=license RMCFG[flexLM] CLUSTERQUERYURL=exec:///$HOME/tools/license.mon.flexlm.pl

# integrate local node health check script dataRMCFG[local] TYPE=NATIVERMCFG[local] CLUSTERQUERYURL=file:///opt/moab/localtools/healthcheck.dat

Page 67: Cluster Resources Training

1

© Cluster Resources, Inc. 67

Utilizing Multiple Resource Managers• Migrate jobs between resource managers

• Aggregate Information into a cohesive node view#moab.cfg RESOURCELIST node01,node02...RMCFG[base] TYPE=PBSRMCFG[network] TYPE=NATIVE:AGFULLRMCFG[network] CLUSTERQUERYURL=/tmp/network.shRMCFG[fs] TYPE=NATIVE:AGFULLRMCFG[fs] CLUSTERQUERYURL=/tmp/fs.sh

#sample network script

_RX=`/sbin/ifconfig eth0 | grep "RX by" | cut -d: -f2 | cut -d' ' -f1`; \_TX=`/sbin/ifconfig eth0 | grep "TX by" | cut -d: -f3 | cut -d' ' -f1`; \echo `hostname` NETUSAGE=`echo "$_RX + $_TX" | bc`;

Page 68: Cluster Resources Training

1

© Cluster Resources, Inc. 68

5. Scheduling Behaviour

• Job Priority

• Fairshare

• Usage Limits

• Optimizing the Scheduler

Page 69: Cluster Resources Training

1

© Cluster Resources, Inc. 69

Credentials• Certain job attributes (such as user, group, account, class and qos)

describe entities the job belongs to and can be used to associate policies with jobs.

• Every Job has credentials– Users (The only mandatory credential)

– Groups (Standard Unix group or arbitrary collection of users)

– Accounts (Associated with projects and billing)

– Class (Associated with RM queues)

– Quality Of Service (QoS) (Policy overrides, resource access, service targets, charge rates)

• All Credentials can have Usage Limits, Fairshare Targets, Priorities, Usage History, Credential Access Lists / Defaults

http://clusterresources.com/moabdocs/3.5credoverview.shtml

Page 70: Cluster Resources Training

1

© Cluster Resources, Inc. 70

Credential Membership

• Membership Examples# moab.cfg

# user steve can access accounts a14, a7, a2, a6, and a1. If no account is explicitly # requested, his job will be assigned to account a2USERCFG[steve] ADEF=a2 ALIST=a14,a7,a2,a6,a1

# moab.cfg

# account omega3 can only be accessed by users johnh, stevek, jenpACCOUNTCFG[omega3] MEMBERULIST=johnh,stevek,jenp

# moab.cfg

# Controlling QoS Access on a Per Group BasisGROUPCFG[staff] QLIST=standard,special QDEF=standard

Page 71: Cluster Resources Training

1

© Cluster Resources, Inc. 71

Fairness

• Definition:– giving all users equal access to

compute resources– incorporating historical resource

usage, political issues, and job value

• Moab provides a comprehensive and flexible set of tools allowing the ability to address the many and varied fairness management needs.

http://clusterresources.com/moabdocs/6.0managingfairness.shtml

Page 72: Cluster Resources Training

1

© Cluster Resources, Inc. 72

Performance Metrics• Metrics of Responsiveness

– Queue Time• How long a job’s been waiting

– X Factor• Duration-weighted time responsiveness factor

• Strongest single factor of perceived fairness

• Metrics of Utilization– Throughput

• Jobs per unit time

– Utilization• Percentage of cluster in use

XFactor = Y / X

Submit Time

X

Y

StartTime

XFactor = Y / X

Submit Time

X

Y

StartTime

http://clusterresources.com/moabdocs/5.1.2priorityfactors.shtml

Page 73: Cluster Resources Training

1

© Cluster Resources, Inc. 73

General Fairness Strategies

• Maximize Scheduler Options -- Do Not Overspecify• Keep It Simple – Do Not Address Hypothetical Issues• Seek To Adjust User Behaviour, Not Limit User Options• Allow Users to Specify Required Service Level• Monitor Cluster Performance Regularly• Tune Policies As Needed

Page 74: Cluster Resources Training

1

© Cluster Resources, Inc. 74

Priority• 2-tier prioritization structure• Independent component and subcomponent

weights/caps• Components include service, target, fairshare,

resource, usage, job attribute, and credential• Negative priority jobs may be blocked• Tuning facility available with mdiag -p

http://clusterresources.com/moabdocs/5.1jobprioritization.shtml

Page 75: Cluster Resources Training

1

© Cluster Resources, Inc. 75

Job Prioritization – Component Overview• Service

• Level of service delivered or anticipated • Includes queue time, xfactor, bypass, policy violation

• Target– Desired service level– Provides exponential factor growth– Includes target queue time, target xfactor

• Credential• Based on credential priorities• Includes user, group, account, QoS, and class

http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#cred

http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#service

http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#target

Page 76: Cluster Resources Training

1

© Cluster Resources, Inc. 76

Job Prioritization – Component Overview• Fairshare

– Includes user, group, account, QoS, and class fairshare– Includes current Based on historical resource consumption– usage metric of jobs per user, procs per user, and ps per

user– May allow prioritization with ‘cap’ fairshare target

• Resource– Based on requested resources– Includes nodes, processors, memory, swap, disk, and proc-

equivalents– Includes duration based metrics of walltime and proc-

secondshttp://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#fairshare

http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#resource

Page 77: Cluster Resources Training

1

© Cluster Resources, Inc. 77

Job Prioritization – Component Overview

• Job Attribute• Allows prioritization based on current job state• Allows prioritization based on job attributes (ie, preemptible)• Useful in preemption based scheduling

• Usage• Based on utilized resources• Includes resources utilized, resources remaining, percent

remaining• Useful in preemption based scheduling

http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#usage

http://www.clusterresources.com/moabdocs/5.1.2priorityfactors.shtml#attr

Page 78: Cluster Resources Training

1

© Cluster Resources, Inc. 78

mdiag -p

Page 79: Cluster Resources Training

1

© Cluster Resources, Inc. 79

Sample Priority Usage

• A site wants to do the following: – Favor jobs in the low, medium, and high QOS's so they will run in QOS

order – balance job expansion factor – use job queue time to prevent jobs from starving

•  The sample moab.cfg is listed below:

# moab.cfgQOSWEIGHT 1 QUEUETIMEWEIGHT 10

QOSCFG[low] PRIORITY=1000 QOSCFG[medium] PRIORITY=10000QOSCFG[high] PRIORITY=100000

Page 80: Cluster Resources Training

1

© Cluster Resources, Inc. 80

Credential Priority Example

# moab.cfg

# Service Priority Factors

SERVWEIGHT 1

XFACTORWEIGHT 10

QUEUETIMEWEIGHT 1000

# Credential Priority Factors

CREDWEIGHT 1

USERWEIGHT 1

CLASSWEIGHT 2

USERCFG[john] PRIORITY=200

CLASSCFG[batch] PRIORITY=15

CLASSCFG[debug] PRIORITY=100 XFWEIGHT=100

ACCOUNTCFG[bottomfeeder] PRIORITY=-5000 QTWEIGHT=1 XFWEIGHT=0

Page 81: Cluster Resources Training

1

© Cluster Resources, Inc. 81

Priority Caps

#moab.cfg

XFACTORCAP 1000

QUEUETIMECAP 1000

QOSCAP 10000

• It is also possible to limit the priority contribution due to a particular priority factor

Page 82: Cluster Resources Training

1

© Cluster Resources, Inc. 82

Manual Job Priority Adjustment

Sometimes you need to….• Run an admin test job as soon as possible

• Pacify a disserviced user

Use the Setspri command:

• setspri [-r] priority jobid

Example: setspri 1 PBS.1234.0

Page 83: Cluster Resources Training

1

© Cluster Resources, Inc. 83

Usage Limits/Throttling

• Usage Limits

• Override Limits

• Idle Job Limits

• System Job Limits

• Hard and Soft Limits

http://clusterresources.com/moabdocs/6.2throttlingpolicies.shtml

Page 84: Cluster Resources Training

1

© Cluster Resources, Inc. 84

Usage Limit Example

# moab.cfg

USERCFG[steve] MAXJOB=2 MAXNODE=30 GROUPCFG[staff] MAXJOB=5CLASSCFG[DEFAULT] MAXNODE=16CLASSCFG[batch] MAXNODE=32

# moab.cfg

# allow class batch to run up the 3 simultaneous jobs and# allow any user to use up to 8 total nodes within class batchCLASSCFG[batch] MAXJOB=3 MAXNODE[USER]=8

# allow users steve and bob to use up to 3 and 4 total processors respectively within class CLASSCFG[fast] MAXPROC[USER:steve]=3 MAXPROC[USER:bob]=4

Page 85: Cluster Resources Training

1

© Cluster Resources, Inc. 85

• Supersedes the limits of other credentials, effectively causing all other limits of the same type (ie, MAXJOB) to be ignored.

• Precede the limit specification with the capital letter 'O'. 

Override Limits

# moab.cfg

USERCFG[steve] MAXJOB=2 MAXNODE=30 GROUPCFG[staff] MAXJOB=5 CLASSCFG[DEFAULT] MAXNODE=16 CLASSCFG[batch] MAXNODE=32 QOSCFG[hiprio] OMAXJOB=3 OMAXNODE=64

Page 86: Cluster Resources Training

1

© Cluster Resources, Inc. 86

Idle Job Limits• Limits the jobs that are currently eligible for

scheduling.• Jobs that do not qualify as eligible do not accumulate

priority in the queue.• Often used to prevent queue stuffing

# moab.cfg

USERCFG[steve] MAXIJOB=2 MAXINODE=30 GROUPCFG[staff] MAXIJOB=5 CLASSCFG[DEFAULT] MAXINODE=16 CLASSCFG[batch] MAXINODE=32 QOSCFG[hiprio] MAXIJOB=3 MAXINODE=64

Page 87: Cluster Resources Training

1

© Cluster Resources, Inc. 87

System Job LimitsLimit Parameter Description

duration SYSTEMMAXJOBWALLTIME limits the maximum requested wallclock time per job

processors SYSTEMMAXPROCPERJOB limits the maximum requested processors per job

processor-seconds SYSTEMMAXPROCSECONDPERJOB limits the maximum requested processor-seconds per job

Page 88: Cluster Resources Training

1

© Cluster Resources, Inc. 88

Hard and Soft Limits

• Balance both fairness and utilization

#moab.cfg

USERCFG[steve] MAXJOB=2,4 MAXNODE=15,30 GROUPCFG[staff] MAXJOB=2,5 CLASSCFG[DEFAULT] MAXNODE=16,32 CLASSCFG[batch] MAXNODE=12,32 QOSCFG[hiprio] MAXJOB=3,5 MAXNODE=32,64

Page 89: Cluster Resources Training

1

© Cluster Resources, Inc. 89

Fairshare

• Fairshare scheduling helps steer a system toward usage targets by adjusting job priorities based on short term historical usage. 

• Moab's fairshare can target usage percentages, ceilings, floors or caps for users, groups, accounts, classes, and QOS levels.

http://clusterresources.com/moabdocs/6.3fairshare.shtml

Page 90: Cluster Resources Training

1

© Cluster Resources, Inc. 90

Fairshare Example# moab.cfgFSINTERVAL        12:00:00 FSDEPTH                  4 FSDECAY                0.5FSPOLICY DEDICATEDPS

# all users should have a fs target of 10%USERCFG[DEFAULT] FSTARGET=10.0

# user john gets extra cycles USERCFG[john] FSTARGET=20.0

# reduce staff priority if group usage exceed 15% GROUPCFG[staff] FSTARGET=15.0-

# give group orion additional priority if usage drops below 25.7%GROUPCFG[orion] FSTARGET=25.7+

FSUSERWEIGHT 10FSGROUPWEIGHT 100

http://clusterresources.com/moabdocs/6.3fairshare.shtml

Decay Factor

Interval

NowPast

Depth = Number of Intervals

Dec

ay

Fa

cto

r W

eig

hti

ng

s

EffectiveFairshare

Usage

=

Co

nsu

mp

tio

n M

etri

c

Decay Factor

Interval

NowPast

Depth = Number of Intervals

Dec

ay

Fa

cto

r W

eig

hti

ng

s

EffectiveFairshare

Usage

=

Co

nsu

mp

tio

n M

etri

c

Page 91: Cluster Resources Training

1

© Cluster Resources, Inc. 91

Fairshare stats

• Provide credential-based

usage distributions over time

• mdiag –f

• Maintained for all credentials

• Stored in stats/FS.${epochtime}

• Shows detailed time-distribution usage by fairshare metric

Page 92: Cluster Resources Training

1

© Cluster Resources, Inc. 92

Optimization Optimization is maximizing performance while fully

addressing all mission objectives. True optimization includes aspects of policy selection, increased availability, user training, and other factors.

• Identifying Policy Bottlenecks• Identifying Resource Fragmentation• Preemption• Malleable/Dynamic Jobs• Backfill

Page 93: Cluster Resources Training

1

© Cluster Resources, Inc. 93

Productivity Losses

Remaining Productivity

Environmental Losses

•File System Failures•Network Failures•Hardware Failures

Political Losses

•Underutilization of Resources Due to Overly Strict Political Access Constraints

Intra-job inefficiencies

•Poorly Designed Jobs•Poorly Functioning Jobs•Heterogeneous Resources Allocated

Partitioning Losses

•Underutilization of Resources Due to Physical Access Limits

Hardware Failures

•Job Loss and Delay due to Node, Network, and other Infrastructure Failures

Scheduling Inefficiencies

•Managing complex site policies•“Keeping Everybody Happy”•Scheduling jobs where they will finish faster rather than where they will start sooner

Middleware Failures/ Overhead

•Licensing•Network Applications•Resource Managers

Moab based systemsconsistently achieve

90-99% utilizationand objective-based

resource delivery guarantees

Page 94: Cluster Resources Training

1

© Cluster Resources, Inc. 94

Identifying Policy Bottlenecks

• Most Optimization is Enable by Default

• Sources of Bottlenecks– Usage Limits, Fairshare Caps

– Eval Steps• Verify priority (are most important jobs getting access to

resources first?)

http://clusterresources.com/moabdocs/commands/mdiag-priority.shtml

Page 95: Cluster Resources Training

1

© Cluster Resources, Inc. 95

Identifying Policy Bottlenecks (contd)• Sources of Bottlenecks (contd)

– Eval Steps (contd)• Check job blockage• Adjust Limits, Caps, Priority as needed• If needed, use simulation to determine performance impact of changes

http://clusterresources.com/moabdocs/commands/mdiag-queues.shtml

Page 96: Cluster Resources Training

1

© Cluster Resources, Inc. 96

Identifying Resource Fragmentation

• Fragmentation based on queues, reservations, partitions, os's, architectures, etc.

• Recommend changes, use node sets, soften reservations, time-based reservations, etc.

• User training to eliminate user specified fragmentation

Page 97: Cluster Resources Training

1

© Cluster Resources, Inc. 97

Preemption• Conflict between high utilization for cluster and

guarantees for important jobs

• Preemption allows scheduler to 'retract' some scheduling decisions to address newly submitted workload

• QoS-based preemption allows scheduler to enable preemption only if targets cannot be satisfied in other ways

http://www.clusterresources.com/products/mwm/docs/8.4preemption.shtml

Page 98: Cluster Resources Training

1

© Cluster Resources, Inc. 98

Malleable/Dynamic Jobs

• Moab adjusts jobs to utilize available resource and fill holes

• Moab adjusts both job size and job duration

• Only supported with resource managers which support dynamic job modification (i.e. TORQUE) or with msub

http://www.clusterresources.com/products/mwm/docs/22.4dynamicjobs.shtml

Page 99: Cluster Resources Training

1

© Cluster Resources, Inc. 99

Backfill

• Allows a scheduler to make better use of available resources by running jobs out of order

• Prioritizes the jobs in the queue according to a number of factors and then orders the jobs into a

highest priority first (or priority FIFO) sorted list

Page 100: Cluster Resources Training

1

© Cluster Resources, Inc. 100

6. Resource Access

• Admin Reservations

• Standing Reservations

• Nodesets

• Node Access Policies

• Partitions

Page 101: Cluster Resources Training

1

© Cluster Resources, Inc. 101

Advance Reservations

All reservations require three items:• Resources

– Under Moab, the resources specified for a reservation are specified by way of a task description. 

– Task- an atomic, or indivisible, collection of resources (processors, memory, swap,

local disk, etc.) • Timeframe

– start time and an end time.

• Access control list– which jobs can use a reservation (users, groups, accounts, classes, QOS, and job

duration)

An advance reservation is the mechanism by which Moab guarantees the availability of a set of resources at a particular

time. 

Page 102: Cluster Resources Training

1

© Cluster Resources, Inc. 102

Reservation Management Commands

Commands Flags Description

mdiag -r display summarized reservation information and any unexpected state 

mrsvctl reservation control

mrsvctl -r remove reservations

mrsvctl -c create an administrative reservation

showres display information regarding location and state of reservations

Page 103: Cluster Resources Training

1

© Cluster Resources, Inc. 103

Reservation MappingJob X, which meets access criteria for both reservation A and B, allocates a portion of its resources from each reservation and the remainder from resources outside of both reservations.

Page 104: Cluster Resources Training

1

© Cluster Resources, Inc. 104

Advance Reservations• Job Reservatons• Admin Reservations

– generally created to address non-periodic, 'one time' issues.

• Standing Reservations– provide a mechanism by which

a site can dedicate a particular block of resources for a special use on a regular daily or weekly basis.

• Personal User Reservations– created by End User

http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml

http://clusterresources.com/moabdocs/7.1.6userreservations.shtml

Page 105: Cluster Resources Training

1

© Cluster Resources, Inc. 105

Job Reservation Policies • Reasons to Increase RESERVATIONDEPTH

– the estimated job starttime information provided by the showstart command is heavily used and the accuracy needs to be increased

– priority dilution is preventing certain key mission objectives from being fulfilled

– users are more interested in knowing when their job will run than in having it run sooner

• Reasons to Decrease RESERVATIONDEPTH – scheduling efficiency and job throughput need to be increased

##moab.cfg

RESERVATIONDEPTH[bigmem] 4 RESERVATIONQOSLIST[bigmem] special,fast,joshua

Page 106: Cluster Resources Training

1

© Cluster Resources, Inc. 106

Administrative Reservations

• Created using: – mrsvctl -c (or setres) command

• Persistent until they expire or are removed using:– mrsvctl -r (or releaseres) command.

Page 107: Cluster Resources Training

1

© Cluster Resources, Inc. 107

Administrative Reservations

• Examples– Created using ‘mrsvctl’

#reserve nodes node01 and node02 for administrative updates> mrsvctl –c –h node01,node03, -d 1:00:00 –s +1:00:00

#reserve 6 tasks for project acme (only jobs in account acme can run in reservation)> mrsvctl –c –t 6 –a account==acme

Page 108: Cluster Resources Training

1

© Cluster Resources, Inc. 108

Annotating Admin Reservations

• Label and annotate reservations using comments allowing other admins, local users, and portals and other services to obtain more detailed information on the reservations.

– Use the '-n' and '-D' options of the mrsvctl command

> mrsvctl -c -D 'testing infiniband performance' -n nettest -h 'r:agt[15-245]'

mrsvctl -c example

Page 109: Cluster Resources Training

1

© Cluster Resources, Inc. 109

Using Reservation Profiles

• Reservation profiles can be set up and utilized to prevent repetition of standard reservation attributes.

– Specify reservation names, descriptions, ACL's, durations, hostlists, triggers, flags, and other aspects which are commonly used.

RSVPROFILE[mtn1] TRIGGER=AType=exec,Action="/tmp/trigger1.sh",EType=start RSVPROFILE[mtn1] USERLIST=steve,marym RSVPROFILE[mtn1] HOSTEXP="r:50-250"

> mrsvctl -c -P mtn1 -s 12:00:00_10/03 -d 2:00:00

moab.cfg

mrsvctl -c

Page 110: Cluster Resources Training

1

© Cluster Resources, Inc. 110

System Reservations• Easy to reserve entire cluster, or only

sections of it:

Can easily be scripted to roll out updates acrossthe entire cluster at specific times, ensuring that no workload will be interrupted

> mrsvctl –c –t ALL> mrsvctl –c –t ALL –s +1:00:00 –g staff> mrsvctl –c –h node0[0-9][0-9] –d 24:00:00

> mrsvctl –c –h node[0-9][0-9][0-9] –T Action=“/tmp/update.pl \ $HOSTLIST”,atype=exec,etype=start –s 23:50:00_6/15 –d 15:00

Page 111: Cluster Resources Training

1

© Cluster Resources, Inc. 111

Optimizing Maintenance Reservations• Configured to reduce its effective reservation shadow by allowing overlap

with checkpointable/preemptible jobs up until the time the reservation becomes active.

– Modify the reservation to disable preemption access

– Preempt jobs which may overlap the reservation

– Cancel any jobs which failed to properly checkpoint and exit

##moab.cfgRSVPROFILE[adm1] JOBATTRLIST=PREEMPTIBLE RSVPROFILE[adm1] DESCRIPTION="regular system maintenance" RSVPROFILE[adm1] TRIGGER=EType=start,Offset=-300,AType=modify,Action="acl-=jattr=PREEMPTIBLE" RSVPROFILE[adm1] TRIGGER=EType=start,Offset=-240,AType=jobpreempt,Action="checkpoint" RSVPROFILE[adm1] TRIGGER=EType=start,Offset=-60,AType=jobpreempt,Action="cancel"

> mrsvctl -c -P adm1 -s 12:00:00_10/03 -d 8:00:00 -h ALL

Page 112: Cluster Resources Training

1

© Cluster Resources, Inc. 112

Standing Reservations

http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml

• Standing Reservations provide a mechanism by which a site can dedicate a particular block of resources for a special use on a regular daily or weekly basis. For example, nodes 1-4 could be dedicated to running jobs only from users in the accounting group every Friday from 4 to 10 PM.

# moab.cfgSRCFG[fast] PERIOD=DAY STARTTIME=16:00:00 ENDTIME=22:00:00SRCFG[fast] HOSTLIST=node0[1-4]$SRCFG[fast] GROUPLIST=accounting

Page 113: Cluster Resources Training

1

© Cluster Resources, Inc. 113

Standing Reservation Example

# moab.cfg

SRCFG[shortpool] OWNER=ACCOUNT:jupiterSRCFG[shortpool] FLAGS=SPACEFLEXSRCFG[shortpool] MAXTIME=1:00:00SRCFG[shortpool] TASKCOUNT=16SRCFG[shortpool] STARTTIME=9:00:00SRCFG[shortpool] ENDTIME=17:00:00SRCFG[shortpool] DAYS=Mon,Tue,Wed,Thu,Fri

http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml

This following reservation (known as a shortpool) only allows jobs to enter it that are guaranteed to complete within an hour. It floats around incorporating nodes with jobs which will free up within the MAXTIME timeframe. This ensures there will be resources available for quick turnaround work.

Page 114: Cluster Resources Training

1

© Cluster Resources, Inc. 114

Rollback Reservations• Specifies the minimum time in the future at which the

reservation may start. This offset is rolling meaning the start time of the reservation will continuously rollback into the future so as to maintain this offset.

• Rollback offsets are a good way of providing guaranteed resource access to users under the conditions that they must commit their resources in the future or lose dedicated access.

http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml

http://clusterresources.com/moabdocs/7.1.5managingreservations.shtml#ROLLBACKOFFSET

Rolling ReservationTime Offset

No

des

JobsTime

Rolling ReservationTime Offset

No

des

JobsTime

Page 115: Cluster Resources Training

1

© Cluster Resources, Inc. 115

Rollback Reservations Example

• The standing reservation will guarantee access to up to 32 processors within 24 hours to jobs from the ajax account

# moab.cfg

SRCFG[ajax_rsv] ROLLBACKOFFSET=24:00:00 TASKCOUNT=32SRCFG[ajax_rsv] PERIOD=INFINITY ACCOUNTLIST=ajax

Page 116: Cluster Resources Training

1

© Cluster Resources, Inc. 116

End User Reservations• Enabling Personal Reservation Management

– enabled on a per QOS basis by setting the ENABLEUSERRSV flag as in the example below:

– A non-admin user wishes to create a reservation he must ALWAYS specify an accountable QOS with the mrsvctl -S flag.

##moab.cfgQOSCFG[titan] QFLAGS=ENABLEUSERRSV # allow 'titan' QOS jobs to create user reservations USERCFG[DEFAULT] QDEF=titan # allow all users to access 'titan' QOS

> mrsvctl -c -S AQOS=titan -h node01 -d 1:00:00 -s 1:30:00 NOTE: reservation test.126 created

Page 117: Cluster Resources Training

1

© Cluster Resources, Inc. 117

#moab.cfgQOSCFG[rsv] QFLAGS=ENABLEUSERRSV # allow 'rsv' QOS jobs to create user reservations GROUPCFG[eng] QDEF=rsv # allow all users in group eng to access 'rsv' QOS

Example: Allow all Users in Engineering Group to Create Personal Reservations

#moab.cfg# special qos has higher job priority and ability to create user reservations QOSCFG[special] QFLAGS=ENABLEUSERRSVQOSCFG[special] PRIORITY=1000

# allow betty and steve to user special qosUSERCFG[betty] QDEF=specialUSERCFG[steve] QLIST=fast,special,basic QDEF=special

Example: Allow Specific Users to Create Personal Reservations

Page 118: Cluster Resources Training

1

© Cluster Resources, Inc. 118

Reservation LimitsLimit Description

RMAXDURATION limits the duration (in seconds) of any single personal reservation

RMAXPROC limits the size (in processors) of any single personal reservation

RMAXPS limits the size (in processor-seconds) of any single personal reservation

RMAXCOUNT limits the total number of personal reservations a credential may have active at any given moment

RMAXTOTALDURATION limits the total duration of personal reservations a credential may have active at any given moment

RMAXTOTALPROC limits the total number of processors a credential may reserve active at any given moment

RMAXTOTALPS limits the total number of processor-seconds a credential may reserve active at any given moment

Page 119: Cluster Resources Training

1

© Cluster Resources, Inc. 119

Node Sets• Allow a job to request a set of common resources without specifying

exactly what resources are required. • Node set policy can be specified globally or on a per-job basis and can

be based on node processor speed, memory, network interfaces, or locally defined node attributes.

• These policies may also be used to guide jobs to one or more types of nodes on which a particular job performs best

# moab.cfg

NODESETATTRIBUTE FEATURENODESETLIST switchA,switchB,switchC,switchDNODESETPOLICY ONEOFNODESETISOPTIONAL FALSE

CLASSCFG[amd] DEFAULT.NODESET=ONEOF:FEATURE:ATHLON,OPTERON

http://clusterresources.com/moabdocs/8.3nodesetoverview.shtml

Page 120: Cluster Resources Training

1

© Cluster Resources, Inc. 120

Node Access Policy (SMP Issue)

• Shared vs Dedicated Node Access– SHARED - Moab will allow tasks of other

jobs to use the resources– SINGLEJOB – One job only – multiple

tasks possible.– SINGLETASK – One task only– SINGLEUSER – allows multiple jobs from

same user

Page 121: Cluster Resources Training

1

© Cluster Resources, Inc. 121

Node Allocation• Allow a site to specify how available resources should

be allocated to each job– NODEALLOCATIONPOLICY

• Heterogeneous resources (resources which vary from node to node in terms of quantity or quality)

• Shared nodes (nodes may be utilized by more than one job)

• Reservations or service guarantees • Non-flat network (a network in which a perceptible

performance degradation may potentially exist depending on workload placement)

Page 122: Cluster Resources Training

1

© Cluster Resources, Inc. 122

Resource Based Algorithms

• CPULOAD• FIRSTAVAILABLE• LASTAVAILABLE• PRIORITY• MINRESOURCE• CONTIGUOUS• MAXBALANCE• FASTEST• LOCAL

Page 123: Cluster Resources Training

1

© Cluster Resources, Inc. 123

Other Algorithms

• Time Based Algorithms– Large backlog – Large number of system or standing

reservations – Heavy use of backfill

• Locally Defined Algorithms• Specifying Per Job Resource

Preferences

Page 124: Cluster Resources Training

1

© Cluster Resources, Inc. 124

Resource Provisioning

• Selects a resource to modify if resources are not available to meet the needs of the current requests

• Configure an interface to a provisioning manager– SystemImager– Xen

Page 125: Cluster Resources Training

1

© Cluster Resources, Inc. 125

Partitions• Divide Resources along Resource and

Political Boundaries• Avoid Partitions when possible (Use

NodeSets, or Reservations)• Cannot Span Resource Managers• Jobs can Span with COALLOC Flag• Partition Access can be Managed on a

Credential Basis

http://clusterresources.com/moabdocs/7.2partitions.shtml

Page 126: Cluster Resources Training

1

© Cluster Resources, Inc. 126

Defining Partitions

• NODECFG

# moab.cfg NODECFG[node001]   PARTITION=astronomyNODECFG[node002]   PARTITION=astronomy...NODECFG[node049]   PARTITION=mathRMCFG[base] TYPE=PBS

Page 127: Cluster Resources Training

1

© Cluster Resources, Inc. 127

Managing Partition Access

• Use the *CFG parameter with PLIST and PDEF keywords– Ex. USERCFG, ACCOUNTCFG

# moab.cfg

SYSCFG[base]     PLIST= USERCFG[DEFAULT] PLIST=generalUSERCFG[steve]   PLIST=general:test PDEF=testGROUPCFG[staff]  PLIST=general:test PDEF=generalGROUPCFG[mgmt]   PLIST=general:test PDEF=general

Page 128: Cluster Resources Training

1

© Cluster Resources, Inc. 128

Partitions cont.

• Requesting Partitions– Add -l partition=test to qsub command line (For

Torque)– Select partition in Moab Cluster Manager or Moab

Access Portal

• Special jobs may be allowed to span the resources of multiple partitions if desired by associating the job with a QOS which has the flag 'COALLOC' set.

Page 129: Cluster Resources Training

1

© Cluster Resources, Inc. 129

7. Peer to Peer (Grids)

• Cluster Stack / Framework

• Moab P2P Grid

• Peer Configuration

• Resource Control Overview

• Data Management

• Security

http://clusterresources.com/moabdocs/17.0peertopeer.shtml

Page 130: Cluster Resources Training

1

© Cluster Resources, Inc. 130

Cluster Stack / Framework:

Cluster Workload Manager: Scheduler, Policy Manager, Integration PlatformCluster Workload Manager: Scheduler, Policy Manager, Integration Platform

Message PassingMessage Passing

SerialSerialParallelParallel ApplicationApplication

Resource ManagerResource Manager

Grid Workload Manager: Scheduler, Policy Manager, Integration PlatformGrid Workload Manager: Scheduler, Policy Manager, Integration Platform

Operating SystemOperating System

Hardware (Cluster or SMP)Hardware (Cluster or SMP)

PortalPortal

CLICLI

GUIGUI

ApplicationApplication

AdminAdmin UsersUsers

Se

cu

rityS

ec

urity

Page 131: Cluster Resources Training

1

© Cluster Resources, Inc. 131

Grid TypesA “Local Area Grid” uses one instance of Moab within an environment that shares a

user and data space across multiple clusters, that may or may not have multiple hardware types, operating systems and compute resource managers (e.g.

LoadLeveler, TORQUE, LSF, PBS Pro, etc.)

ClusterA

Local Area Grid (LAG)

ClusterB

ClusterC

Shared User SpaceShared Data Space

Moab

ClusterA

Wide Area Grid (WAG)

ClusterB

ClusterC

Multiple User SpacesMultiple Data Spaces

Moab (Master)

A “Wide Area Grid” uses multiple Moab instances working together within an environment that can have multiple user and data spaces across multiple clusters, that

may or may not have multiple hardware types, operating systems and compute resource managers (e.g. LoadLeveler, TORQUE, LSF, PBS Pro, etc.). Wide Area Grid

management rules can be centralized, locally controlled or mixed.

Moab Moab Moab

ClusterA

ClusterB

ClusterC

Moab (Grid Head Node)

Moab Moab Moab

Centralized Management

ClusterA

ClusterB

ClusterC

Moab

Centralized & Local Management

All Grid Rules

Moab (Grid Head Node)Shared Grid Rules

Local Grid Rules

Moab

Local Grid Rules

Moab

Local Grid Rules

ClusterA

ClusterB

ClusterC

Moab

Local Management“Peer to Peer”

Local Grid Rules

Moab

Local Grid Rules

Moab

Local Grid Rules

Grid Management Scenarios

Page 132: Cluster Resources Training

1

© Cluster Resources, Inc. 132

Grid Benefits

• Scalability

• Resource Access

• Load-Balancing

• Single System Image (SSI)

• High Availability

Page 133: Cluster Resources Training

1

© Cluster Resources, Inc. 133

Drawbacks of Layered Approach• Stability

– Additional failure layer– Centralized grid management

(single point of failure)

• Optimization– Limited local information and control

• Admin Experience– Additional tool to learn/configure– Policy Duplication and Conflicts– Additional tool to manage/troubleshoot

• User Experience– Additional submission language/environment– Additional tool to track, manage workload

http://clusterresources.com/moabdocs/17.12p2pgrid.shtml

Page 134: Cluster Resources Training

1

© Cluster Resources, Inc. 134

Moab P2P Approach• Little to no user training

• Little to no admin training

• Single Policy set

• Transparent Grid

http://clusterresources.com/moabdocs/17.0peertopeer.shtml

Page 135: Cluster Resources Training

1

© Cluster Resources, Inc. 135

Integrated Moab P2P/Grid Capabilities

• Distributed Resource Management

• Distributed Job Management

• Grid Information Management– Resource and Job Views

• Credential Management and Mapping

• Distributed Accounting

• Data Management

Page 136: Cluster Resources Training

1

© Cluster Resources, Inc. 136

Grid Relationship Combinations

ClusterA

ClusterB

ClusterC

Shared User SpaceShared Data Space

Moab

Multiple User SpacesMultiple Data Spaces

ClusterD

ClusterE

Moab (Grid Head Node)Shared Grid Rules

Local Area Grid Rules

Moab

Local Grid Rules

Moab

Local Grid Rules

ClusterF

ClusterG

ClusterH

Moab

Local Grid Rules

Moab

Local Grid Rules

Moab

Local Grid Rules

Hosting Site

Moab

Local Grid Rules

Moab is able to facilitate virtually any grid relationship:1. Join local area grids into wide are grids

2. Join wide area grids to other wide area grids (whether they be managed centrally, locally - “peer to peer” or mixed)

3. Resource sharing can be in one direction for use with hosting centers, or to bill out resources to other sites

4. Have multiple levels of grid relationships (e.g. conglomerates within conglomerates within conglomerates)

12

3

4

Page 137: Cluster Resources Training

1

© Cluster Resources, Inc. 137

Basic P2P Example# moab.cfg for Cluster ASCHEDCFG[ClusterA]RMCFG[ClusterB] TYPE=MOAB SERVER=node03:41000RMCFG[ClusterB.INBOUND] FLAGS=CLIENT CLIENT=ClusterB

# moab.cfg for Cluster BSCHEDCFG[ClusterB]RMCFG[ClusterA] TYPE=MOAB SERVER=node01:41000RMCFG[ClusterA.INBOUND] FLAGS=CLIENT CLIENT=ClusterA

# moab-private.cfg for Cluster ACLIENTCFG[RM:ClusterB] KEY=fet$wl02 AUTH=admin1

# moab-private.cfg for Cluster BCLIENTCFG[RM:ClusterA] KEY=fet$wl02 AUTH=admin1

Page 138: Cluster Resources Training

1

© Cluster Resources, Inc. 138

Peer Configuration• Resource Reporting

• Credential Config

• Data Config

• Usage Limits

• Bi-Directional Job Flow

#moab.cfg (server 1)SCHEDCFG[server1] SERVER=server1.omc.com:42005 MODE=NORMAL RMCFG[server2-out] TYPE=MOAB SERVER=server2.omc.com:42005 CLIENT=server2 RMCFG[server2-in] FLAGS=client CLIENT=server2

##moab-private.cfg (server 1)CLIENTCFG[server2] KEY=443db-writ4

Page 139: Cluster Resources Training

1

© Cluster Resources, Inc. 139

Jobs• Submitting Jobs to the Grid

– msub – Uses Resource Manager’s submission language

and translates to msub• Viewing Node and Job Information

– Each destination Moab server will report all compute nodes it finds back to the source Moab server

– Show as local nodes each within a partition associated with the resource manager reporting them.

Page 140: Cluster Resources Training

1

© Cluster Resources, Inc. 140

Resource Control Overview

• Full resource information– nodes appear with complete remote hostnames

and full attribute information • Remapped resource information

– nodes appear with remapped local hostnames and full attribute information

• Grid mode– information regarding nodes reported from a

remote peer is aggregated and transformed into one or more SMP-like large pseudo nodes

Page 141: Cluster Resources Training

1

© Cluster Resources, Inc. 141

Controlling Resource Information

• Direct– nodes are reported to remote clusters exactly as they

appear in the local cluster

• Mapped– nodes are reported as individual nodes, but node names are

mapped to a unique name when imported into the remote cluster

• Grid– node information is aggregated into a single large SMP-like

pseudo-node before it is reported to the remote cluster

Page 142: Cluster Resources Training

1

© Cluster Resources, Inc. 142

Grid Sandbox

• Constrains external resource access and limits which resources are reported to other peers

##moab.cfg SRCFG[sandbox1] PERIOD=INFINITY HOSTLIST=node01,node02,node03 SRCFG[sandbox1] CLUSTERLIST=ALL FLAGS=ALLOWGRID

Page 143: Cluster Resources Training

1

© Cluster Resources, Inc. 143

Access Controls

• Granting Access to Local Jobs

• Peer Access Control

##moab.cfg SRCFG[sandbox2] PERIOD=INFINITY HOSTLIST=node04,node05,node06 SRCFG[sandbox2] FLAGS=ALLOWGRID QOSLIST=high GROUPLIST=engineer

##moab.cfg (Cluster 1)SRCFG[sandbox1] PERIOD=INFINITY HOSTLIST=node01,node02,node03,node04,node05 SRCFG[sandbox1] FLAGS=ALLOWGRID CLUSTERLIST=ClusterB SRCFG[sandbox2] PERIOD=INFINITY HOSTLIST=node6 FLAGS=ALLOWGRID SRCFG[sandbox2] CLUSTERLIST=ClusterB,ClusterC,ClusterD USERLIST=ALL

Page 144: Cluster Resources Training

1

© Cluster Resources, Inc. 144

Controlling Peer Workload Information

• Local workload exporting– Help simplify administration of different clusters by

centralizing monitoring and management of jobs at one peer and avoids forcing each peer to the type SLAVE

##moab.cfg (ClusterB - Destination Peer) RMCFG[ClusterA] FLAGS=CLIENT,LOCALWORKLOADEXPORT # source peer

Page 145: Cluster Resources Training

1

© Cluster Resources, Inc. 145

Data Management Configuration

• Global file systems• Replicated data servers• Need based direct input• Output data migration ##moab.cfg (NFS data server) RMCFG[storage] TYPE=native SERVER=omc.omc13.com:42004 RTYPE=STORAGE RMCFG[storage] SYSTEMMODIFYURL=exec://$HOME/tools/storage.ctl.nfs.pl RMCFG[storage] SYSTEMQUERYURL=exec://$HOME/tools/storage.query.nfs.pl

##moab.cfg (SCP data server) RMCFG[storage] TYPE=native SERVER=omc.omc13.com:42004 RTYPE=STORAGE RMCFG[storage] SYSTEMMODIFYURL=exec://$HOME/tools/storage.ctl.scp.pl RMCFG[storage] SYSTEMQUERYURL=exec://$HOME/tools/storage.query.scp.pl

Page 146: Cluster Resources Training

1

© Cluster Resources, Inc. 146

Security

• Secret key based security is enabled via the moab-private.cfg file

• Globus Credential Based Server Authentication (4.2.4)

Page 147: Cluster Resources Training

1

© Cluster Resources, Inc. 147

Credential Management

• Peer Credential Mapping

• Source and Destination Side Credential Mapping

##moab.cfg SCHEDCFG[master1] MODE=normal RMCFG[slave1] OMAP=file:///opt/moab/omap.dat

##/opt/moab/omap.dat (source object map file) user:joe,jsmith user:steve,sjohnson group:test,staff class:batch,serial user:*,grid

Page 148: Cluster Resources Training

1

© Cluster Resources, Inc. 148

• Preventing User Space Collisions##moab.cfgSCHEDCFG[master1] MODE=normal RMCFG[slave1] OMAP=file:///opt/moab/omap.dat FLAGS=client

##/opt/moab/omap.dat (source object map file) user:*,c1_*group:*,*_grid account:*,temp_*

• Interfacing with Globus GRAM

##moab.cfg SCHEDCFG[c1] SERVER=head.c1.hpc.org RMCFG[c2] SERVER=head.c2.hpc.org TYPE=moab JOBSTAGEMETHOD=globus

Page 149: Cluster Resources Training

1

© Cluster Resources, Inc. 149

• Limiting Access To Peers

• Limiting Access From Peers

##moab.cfg SCHEDCFG SERVER=c1.hpc.org # only allow staff or members of the research and demo account to use # remote resources on c2 RMCFG[c2] SERVER=head.c2.hpc.org TYPE=moab RMCFG[c2] AUTHGLIST=staff AUTHALIST=research,demo

##moab.cfg SCHEDCFG SERVER=c1.hpc.org FLAGS=client # only allow jobs from remote cluster c1 with group credentials staff or # account research or demo to use local resources RMCFG[c2] SERVER=head.c2.hpc.org TYPE=moab RMCFG[c2] AUTHGLIST=staff AUTHALIST=research,demo

Page 150: Cluster Resources Training

1

© Cluster Resources, Inc. 150

P2P Resource Affinity• Certain compute architectures are able to execute

certain compute jobs more effectively than others

• From a given location, staging jobs to various clusters may require more expensive allocations, more data and network resources, and more use of system services

• Certain compute resources are owned by external organizations and should be utilized sparingly

• Moab allow the use of peer resource affinity to guide jobs to the clusters which make the best fit according to a number of criteria

Page 151: Cluster Resources Training

1

© Cluster Resources, Inc. 151

Management and Troubleshooting• Peer Management Overview

– Use 'mdiag -R' to view interface health and performance/usage statistics

– Use 'mrmctl' to enable/disable peer interfaces – Use 'mrmctl -m' to dynamically modify/configure

peer interfaces • Peer Management Overview

– Use 'mdiag -R' to diagnose general RM interfaces – Use 'mdiag -S' to diagnose general scheduler

health – Use 'mdiag -R <RMID> --flags=submit-check' to

diagnose peer-to-peer job migration

Page 152: Cluster Resources Training

1

© Cluster Resources, Inc. 152

Sovereignty: Local vs. Centralized Management Policies

Local Admin

Each Admin can manage their own cluster

Submit to either:• Local cluster• Specified cluster(s) in the grid • Generically to the grid

Local Cluster AResources

Grid AllocatedResources

Portion Allocated

to Grid

Local Admin can apply policies to manage:

1. Local user access to local cluster resources

2. Local user access to grid resources

3. Outside grid user access to local cluster resources (general or specific policies)

Local Users

Outside Grid Users

1

2

3

Grid Administration Body

Grid Administration Body can apply policies to manage:

1. General grid policies (Sharing, Priority, Limits, etc.)

1

Page 153: Cluster Resources Training

1

© Cluster Resources, Inc. 153

Data Staging

• Data Staging

• Data Staging Models

• Interface Scripts for a Storage Resource Manager

Page 154: Cluster Resources Training

1

© Cluster Resources, Inc. 154

Data Staging

• Manages intra-cluster and inter-cluster job data staging requirements so as to minimize resource inefficiencies and maximize system utilization

• Prevent the loss of compute resources due data blocking and can significantly improve cluster performance.

Page 155: Cluster Resources Training

1

© Cluster Resources, Inc. 155

Data Management: Increasing Efficiency

4

Data Staging Levels of Efficiency and Control:

0. No data staging. 1. Non-Verified Data Staging is the traditional use of data staging where CPU requests and data staging requests are not

coordinated, leaving the CPU request to cause blocking on the compute node when the data is not available to process. 2. Verified Data Staging is the added intelligence to have the workload manager verify that the data has arrived at the

needed location prior to launching the job, in order to avoid workload blocking. 3. Prioritized Data Staging uses the capabilities of Verified Data Staging, but adds the ability to intercept the data staging

requests and to submit them in an order of priority that matches that of the corresponding jobs.4. Fully Scheduled Data Staging uses all of the capabilities of Prioritized Data Staging, but adds the ability to estimate

staging periods, thus allowing workload to be scheduled more intelligently around data staging conditions. This capability, unlike the others can be applied to both external and internal storage scenarios, while others simply apply to external storage.

1

3

Fully Scheduled Data Staging

Prioritized Data Staging

Verified Data Staging

Non-Verified Data Staging

No Data Staging

2

0

Traditional Data Staging

Optimized Data Staging

Page 156: Cluster Resources Training

1

© Cluster Resources, Inc. 156

Optimized Data Staging

• Automatically pre-stages input data and stages back output data with event policies

• Coordinate data stage time with compute resource allocation

• Use GASS, gridftp, and scp for data management

• Reserve network resources to guarantee data staging and inter-process communication

Prestage Stage BackProcessing

Prestage Stage Back

Processing

CPU Reservation

CPU ReservationReservation Reservation

Traditional Inefficient

Method

Optimized Data Staging

Compute resources are wasted/ “Blocked” during data staging

Compute resources are available to other workload during data staging

Page 157: Cluster Resources Training

1

© Cluster Resources, Inc. 157

Efficiencies from Optimized Data Staging

Traditional Inefficient

Method

Intelligent Event-basedData Staging

Prestage Stage BackProcessingReservation

Prestage Stage BackProcessingReservationEvent Event

Prestage Stage BackProcessingReservation

Prestage Stage BackProcessingReservation

Prestage Stage BackProcessingReservation

Prestage Stage BackProcessingReservationEvent Event

Prestage Stage BackProcessingReservationEvent Event

Prestage Stage BackProcessingReservationEvent Event

Prestage Stage BackProcessingReservationEvent Event

Prestage Stage BackProcessingReservationEvent Event

Prestage Stage BackProcessingReservationEvent Event

Prestage ProcessingReservationEvent

•4 Jobs Completed

•7.5 Jobs Completed•Efficient use of CPU•Efficient use of Network

Processor Start Time

Page 158: Cluster Resources Training

1

© Cluster Resources, Inc. 158

Data Staging Models

Attribute Description

TYPE must be NATIVE in all cases

RESOURCETYPE must be set to STORAGE in all cases

SYSTEMQUERYURL specifies method of determining file attributes such as size, ownership, etc.

CLUSTERQUERYURL specifies method of determining current and configured storage manager resources such as available disk space, etc.

SYSTEMMODIFYURL specifies method of initiating file creation, file deletion, and data migration

• Verified Data Staging • Prioritized Data Staging • Fully-Scheduled Data Staging • Data Staging to Allocated Nodes

Page 159: Cluster Resources Training

1

© Cluster Resources, Inc. 159

Verified Data Staging

ClusterA

Moab

Local Grid Rules

4

Verified Data Staging (Start job after the file is verified to be in the right location):

To prevent job blocking caused by jobs who’s data has not finished data staging when all data staging is controlled via external data managers and no methods exist to control what is staged or in what order:

1. User submits jobs via portal, or job script like mechanism. Data staging needs are communicated to a data manager mechanism (HSM manager, staging tool, script, command, etc.). Job consideration requests are sent to Moab in order to decide how and when to run.

2. Moab periodically queries storage system (SAN, NAS, Storage Nodes) to see if the file is “there yet”.

3. The data manager moves the data to the desired location when it is able.

4. Moab verifies that the file is “there”, then releases the job for submission as long as it satisfies established policies.

1

StorageSystem

Data Manager

Job Submission

3Benefits:Prevents non-staged jobs from blocking usage of nodes

Drawbacks:No job-centric prioritization takes place in the order of which data gets staged first

2

Page 160: Cluster Resources Training

1

© Cluster Resources, Inc. 160

Prioritized Data Staging

ClusterA

Moab

Local Grid Rules

5

Prioritized Data Staging (priority order of data staging):

When Moab intercepts data staging requests & submits them through a data manager according to priority order:

1. User submits jobs via portal, or job script like mechanism. Data staging needs and Job consideration requests are sent to Moab in order to decide how and when to run and to decide priority order of submitting data staging requests.

2. Moab evaluates priority, reservations and other factors, and then submits data staging requests to a data manager mechanism (HSM manager, staging tool, script, command, etc.) in the best order to match established policies.

3. Moab periodically queries storage system (SAN, NAS, Storage Nodes) to see if the file is “there yet”.

4. The data manager moves the data to the desired location when it is able.

5. Moab verifies that the file is “there”, then releases the job for submission as long as it satisfies established policies.

1

StorageSystem

Data Manager

Job Submission

4Benefits:Prevents non-staged jobs from blocking usage of nodes Provides soft prioritization of data staging requests

Drawbacks:Prioritization is only softly provided Insufficient information for informed CPU reservations to take place

3

Priority Jobs First

2

Page 161: Cluster Resources Training

1

© Cluster Resources, Inc. 161

Fully Scheduled Data Staging: External StorageFully Scheduled Data Staging (priority order of data staging and data-staging centric scheduling):

When Moab intercepts data staging requests to manage data staging order & reserves CPU and other resources based on estimates of data staging periods:

1. User submits jobs via portal, or job script like mechanism. Data staging needs and Job consideration requests are sent to Moab in order to decide how and when to run and to decide priority order of submitting data staging requests.

2. Moab evaluates data size and network speeds to estimate data staging duration, then uses this estimate to reserve manage submission of data staging requests and reservations of CPUs and other resources.

3. Moab evaluates priority, reservations and other factors, and then submits data staging requests to a data manager mechanism (HSM manager, staging tool, script, command, etc.) in the best order to match established policies.

4. Moab periodically queries storage system (SAN, NAS, Storage Nodes) to see if the file is “there yet”. 5. The data manager moves the data to the desired location when it is able. 6. Moab verifies that the file is “there”, then releases the job for submission as long as it satisfies established policies.

Benefits:Prevents non-staged jobs from blocking usage of nodes Provides soft prioritization of data staging requestsIntelligently schedule resources based on data staging information

Drawbacks:Prioritization is only softly provided

ClusterA

Moab

Local Grid Rules

6

StorageSystem

Data Manager

Job Submission

5Priority

Jobs First

32

4

1

Page 162: Cluster Resources Training

1

© Cluster Resources, Inc. 162

Fully Scheduled Data Staging (priority order of data staging and data-staging centric scheduling):

When Moab intercepts data staging requests to manage data staging order & reserves CPU and other resources based on estimates of data staging periods:

1. User submits jobs via portal, or job script like mechanism. Data staging needs and Job consideration requests are sent to Moab in order to decide how and when to run and to decide priority order of submitting data staging requests.

2. Moab evaluates data size and network speeds to estimate data staging duration, then uses this estimate to reserve manage submission of data staging requests and reservations of CPUs and other resources.

3. Moab evaluates priority, reservations and other factors, and then submits data staging requests to a data manager mechanism (HSM manager, staging tool, script, command, etc.) in the best order to match established policies.

4. Moab periodically queries storage system (SAN, NAS, Storage Nodes) to see if the file is “there yet”. 5. The data manager moves the data to the desired location when it is able. 6. Moab verifies that the file is “there”, then releases the job for submission as long as it satisfies established policies.

Benefits:Prevents non-staged jobs from blocking usage of nodes Provides soft prioritization of data staging requestsIntelligently reserves resources based on data staging information

Drawbacks:Prioritization is only softly provided

ClusterA

Moab

Local Grid Rules

Data Manager

Job Submission

52

4S

S S

Storage is on Local Compute Nodes

1

3Priority

Jobs First

6

Fully Scheduled Data Staging: Local Storage

Page 163: Cluster Resources Training

1

© Cluster Resources, Inc. 163

Data Staging Diagnostics• Checkjob

– Stage type - input or output – File name - reports destination file only – Status - pending, active, or complete – File size - size of file to transfer – Data transferred - for active transfers, reports number of bytes

already transferred

• Checknode– Active and max storage manager data staging operations – Dedicated and max storage manager disk usage – File name - reports destination file only – Status - pending, active, or complete – File size - size of file to transfer – Data transferred - for active transfers, reports number of bytes

already transferred

Page 164: Cluster Resources Training

1

© Cluster Resources, Inc. 164

Interface Scripts for a Storage Resource Manager

• Moab's data staging capabilities can utilize up to 3 different native resource manager interfaces– Cluster Query Interface– System Query Interface– System Modify Interface

Page 165: Cluster Resources Training

1

© Cluster Resources, Inc. 165

Prioritized Data Staging Example

#moab.cfgRMCFG[data] TYPE=NATIVE RESOURCETYPE=STORAGE RMCFG[data] SYSTEMQUERYURL=exec:///opt/moab/tools/dstage.systemquery.pl RMCFG[data] CLUSTERQUERYURL=exec:///opt/moab/tools/dstage.clusterquery.pl RMCFG[data] SYSTEMMODIFYURL=exec:///opt/moab/tools/dstage.systemmodify.pl

Page 166: Cluster Resources Training

1

© Cluster Resources, Inc. 166

Information Services

• Monitoring performance statistics of multiple independent clusters

• Detecting and diagnosing failures from geographically distributed clusters

• Tracking cluster, storage, network, service, and application resources

• Generating load-balancing and resource state information for users and middleware services

Page 167: Cluster Resources Training

1

© Cluster Resources, Inc. 167

8. Accounting and Statistics

• Job and System Statistics

• Event Log

• Fairshare Stats

• Client Statistic Reports

• Realtime and Historical Charts with Moab Cluster Manager

• Native Resource Manager– GMetrics

– GEvents

Page 168: Cluster Resources Training

1

© Cluster Resources, Inc. 168

Accounting Overview

• Job and Reservation Accounting

• Resource Accounting

• Credential Accounting

#moab.cfg

USERCFG[DEFAULT] ENABLEPROFILING=TRUE

Page 169: Cluster Resources Training

1

© Cluster Resources, Inc. 169

Job and System Statistics

• Determining cumulative cluster performance over a fixed timeframe

• Graphing changes in cluster utilization and responsiveness over time

• Identifying which compute resources are most heavily used • Charting resource usage distribution among users, groups,

projects, and classes • Determining allocated resources, responsiveness, and/or failure

conditions for jobs completed in the past • Providing real-time statistics updates to external accounting

systems

Page 170: Cluster Resources Training

1

© Cluster Resources, Inc. 170

Event Log

• Report trace state and utilization records at events– Scheduler start, stop and failure

– Job create, start, end, cancel, migrate, failure

– Reservation create, start, stop, failure

– Configurable with RECORDEVENTLIST

– Can be exported to external systems

http://clusterresources.com/moabdocs/a.fparameters.shtml#recordeventlist

http://clusterresources.com/moabdocs/14.2logging.shtml#logevent

Page 171: Cluster Resources Training

1

© Cluster Resources, Inc. 171

Fairshare stats

• Provide credential-based

usage distributions over time

• mdiag –f

• Maintained for all credentials

• Stored in stats/FS.${epochtime}

• Shows detailed time-distribution usage by fairshare metric

Page 172: Cluster Resources Training

1

© Cluster Resources, Inc. 172

Client Statistic Reports

• In-Memory reports available for nodes and credentials

• Node categorization allows fine-grained localized usage tracking

Page 173: Cluster Resources Training

1

© Cluster Resources, Inc. 173

Realtime and Historical Charts with Moab Cluster Manager

• Reports nodes and all creds

• Allows arbitrary querying of historical timeframes with arbitrary correlations

Page 174: Cluster Resources Training

1

© Cluster Resources, Inc. 174

Service Monitoring and Management

Page 175: Cluster Resources Training

1

© Cluster Resources, Inc. 175

Real-Time Performance & Accounting Analysis

Page 176: Cluster Resources Training

1

© Cluster Resources, Inc. 176

10. End Users

• Moab Access Portal

• Moab Cluster Manager

• End User Commands

• End User Empowerment

Page 177: Cluster Resources Training

1

© Cluster Resources, Inc. 177

Moab Access Portal TM

• Submit Jobs from a web browser

• View and Modify only your own Workload

• Assist end-users to self-manage behaviours

http://clusterresources.com/map

Page 178: Cluster Resources Training

1

© Cluster Resources, Inc. 178

Moab Cluster Manager TM

• Administer Resources and Workload Policies Through an Easy-to-Use Graphical User Interface

• Monitor, Diagnose and Report Resource Allocation and Usage

http://clusterresources.com/mcm

Page 179: Cluster Resources Training

1

© Cluster Resources, Inc. 179

End User CommandsCommand Flags Description

canceljob cancel existing job

checkjob display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization

releaseres release a user reservation

setres create a user reservation

showbf show resource availability for jobs with specific resource requirements

showq display detailed prioritized list of active and idle jobs

showstart show estimated start time of idle jobs

showstats show detailed usage statistics for users, groups, and accounts which the end user has access to

Page 180: Cluster Resources Training

1

© Cluster Resources, Inc. 180

Assist Users in Better Utilizing Resources

• General info

• Job eval

• Completed job failure post-mortem

• Job start time estimates

• Job control

• Reservation control

Page 181: Cluster Resources Training

1

© Cluster Resources, Inc. 181

Assist Users in Better Utilizing Resources (contd)

• How do You Evaluate a Request– showstart (Earliest start, completion time, etc.)

– showstats –f (General service level statistics)

– showstats –u (User Statistics)

– showbf (Immediately available resources)