confidential – internal use only 1 scyld clusterware system administration
TRANSCRIPT
Confidential – Internal Use Only
1
Scyld ClusterWare System Administration
Confidential – Internal Use Only2
Orientation Agenda – Part 1
Scyld ClusterWare foundations
» Booting process
• Startup scripts
• File systems
• Name services
» Cluster Configuration
Cluster Components
» Networking infrastructure
» NFS File servers
» IPMI Configuration
Break
Confidential – Internal Use Only3
Orientation Agenda – Part 2
Parallel jobs
» MPI configuration
» Infiniband interconnect
Queuing
» Initial setup
» Tuning
» Policy case studies
Other software and tools
Troubleshooting
Questions and Answers
Confidential – Internal Use Only4
Orientation Agenda – Part 1
Scyld ClusterWare foundations
» Booting process
• Startup scripts
• File systems
• Name services
» Cluster Configuration
Cluster Components
» Networking infrastructure
» NFS File servers
Break
Confidential – Internal Use Only5
Cluster Virtualization Architecture Realized Minimal in-memory OS with single
daemon rapidly deployed in seconds - no disk required
» Less than 20 seconds
Virtual, unified process space enables intuitive single sign-on, job submission
» Effortless job migration to nodes
Monitor & manage efficiently from the Master
» Single System Install
» Single Process Space
» Shared cache of the cluster state
» Single point of provisioning
» Better performance due to lightweight nodes
» No version skew is inherently more reliable
Master Node
Interconnection Network
Internet or Internal Network
Optional Disks
Manage & use a cluster like a single SMP machineManage & use a cluster like a single SMP machine
Confidential – Internal Use Only6
Elements of Cluster Systems
Some important elements of a cluster system
» Booting and Provisioning
» Process creation, monitoring and control
» Update and consistency model
» Name services
» File Systems
» Physical management
» Workload virtualization
Master Node
Interconnection Network
Internet or Internal Network
Optional Disks
Confidential – Internal Use Only7
Booting and Provisioning
Integrated, automatic network boot
Basic hardware reporting and diagnostics in the Pre-OS stage
Only CPU, memory and NIC needed
Kernel and minimal environment from master
Just enough to say “what do I do now?”
Remaining configuration driven by master
Logs are stored in:
» /var/log/messages
» /var/log/beowulf/node.*
Confidential – Internal Use Only8
DHCP and TFTP services
Started from /etc/rc.d/init.d/beowulf
» Locate vmlinuz in /boot
» Configure syslog and other parameters on the head node
» Loads kernel modules
» Setup libraries
» Creates ramdisk image for compute nodes
» Starts DHCP/TFTP server (beoserv)
» Configures NAT for ipforwarding if needed
» Starts kickback name service daemon (4.2.0+)
» Tune network stack
Confidential – Internal Use Only9
Compute Node Boot Process
Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:– System date
– Basic network configuration
– Kernel modules (device drivers)
– Network routing
– setup_fs
– Name services
– chroot
– Prestages files (4.2.0+)
– Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only10
Compute Node Boot Process
Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:– System date
– Basic network configuration
– Kernel modules (device drivers)
– Network routing
– setup_fs
– Name services
– chroot
– Prestages files (4.2.0+)
– Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only11
Subnet configuration
Default used to be class C Network
» netmask 255.255.255.0
» Limited to 155 compute nodes ( 100 + $NODE < 255 )
» Last octect denotes special devices
• x.x.x.10 switches
• x.x.x.30 storage
» Infiniband is a separate network
• x.x.1.$(( 100 + $NODE ))
» Needed eth0:1 to reach IPMI network
• x.x.2.$(( 100 + $NODE ))
• /etc/sysconfig/network-scripts/ifcfg-eth0:1
• ifconfig eth0:1 10.54.2.1 netmask 255.255.255.0
Confidential – Internal Use Only12
Subnet configuration
New standard is class B Network
» netmask 255.255.0.0
» Limited to 100 * 256 compute nodes
• 10.54.50.x – 10.54.149.x
» Third octect denotes special devices
• x.x.10.x switches
• x.x.30.x storage
» Infiniband is a separate network
• x.$(( x+1)).x.x
» IPMI is on the same network (eth0:1 not needed)
• x.x.150.$NODE
Confidential – Internal Use Only13
Compute Node Boot Process
Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:– System date
– Basic network configuration
– Kernel modules (device drivers)
– Network routing
– setup_fs
– Name services
– chroot
– Prestages files (4.2.0+)
– Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only14
Setup_fs
Script is in /usr/lib/beoboot/bin/setup_fs
Configuration file: /etc/beowulf/fstab» # Select which FSTAB to use.if [ -r /etc/beowulf/fstab.$NODE ] ; then FSTAB=/etc/beowulf/fstab.$NODEelse FSTAB=/etc/beowulf/fstabfiecho "setup_fs: Configuring node filesystems using $FSTAB...“
$MASTER is determined and populated
“nonfatal” option allows compute nodes to finish boot process and log errors in /var/log/beowulf/node.*
NFS mounts of external servers needs to be done via IP address because name services have not been configured yet
Confidential – Internal Use Only15
beofdisk
Beofdisk configures partition tables on compute nodes
» To configure first drive: •bpsh 0 fdisk /dev/sda
– Typical interactive usage
» Query partition table:•beofdisk -q --node 0
» Write partition tables to other nodes:•for i in $(seq 1 10); do beofdisk -w --node $i ; done
» Create devices initially
• Use head nodes /dev/sd* as reference:– [root@scyld beowulf]# ls -l /dev/sda*brw-rw---- 1 root disk 8, 0 May 20 08:18 /dev/sdabrw-rw---- 1 root disk 8, 1 May 20 08:18 /dev/sda1brw-rw---- 1 root disk 8, 2 May 20 08:18 /dev/sda2brw-rw---- 1 root disk 8, 3 May 20 08:18 /dev/sda3[root@scyld beowulf]# bpsh 0 mknod /dev/sda1 b 8 1
Confidential – Internal Use Only16
Create local filesystems
After partitions have been created, mkfs
» bpsh –an mkswap /dev/sda1
» bpsh –an mkfs.ext2 /dev/sda2
• ext2 is a non-journaled filesystem, faster than ext3 for scratch file system
• If corruption occurs, simply mkfs again
Copy int18 bootblock if needed:» bpcp /usr/lib/beoboot/bin/int18_bootblock $NODE:/dev/sda
/etc/beowulf/config options for file system creation» # The compute node file system creation and consistency checking policies.
fsck fullmkfs never
Confidential – Internal Use Only17
Compute Node Boot Process
Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:– System date
– Basic network configuration
– Kernel modules (device drivers)
– Network routing
– setup_fs
– Name services
– chroot
– Prestages files (4.2.0+)
– Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only18
Name services
/usr/lib/beoboot/bin/node_up populates /etc/hosts and /etc/nsswitch.conf on compute nodes
beo name service determines values from /etc/beowulf/config file
bproc name service determines values from current environment
‘getent’ can be used to query entries» getent netgroup cluster
» getent hosts 10.54.0.1
» getent hosts n3
If system-config-authentication is run, ensure that proper entries still exist in /etc/nsswitch.conf (head node)
Confidential – Internal Use Only19
BeoNSS Hostnames Opportunity: We control IP address
assignment
» Assign node IP addresses in node order
» Changes name lookup to addition
» Master: 10.54.0.1GigE Switch: 10.54.10.0IB Switch: 10.54.11.0NFS/Storage: 10.54.30.0Nodes: 10.54.50.$node
Name format
» Cluster hostnames have the base form n<N>
» Options for admin-defined names and networks
Special names for "self" and "master"
» Current machine is ".-2" or "self".
» Master is known as ".-1", “master”, “master0”
Master Node
Interconnection Network
Internet or Internal Network
Optional Disks
.-1
master
n0 n1 n2 n3 n4 n5
Confidential – Internal Use Only20
Changes
Prior to 4.2.0
» Hostnames default to .<NODE> form
» /etc/hosts had to be populated with alternative names and IP addresses
» May break @cluster netgroup and hence NFS exports
» /etc/passwd and /etc/group needed on compute nodes for Torque
4.2.0+
» Hostnames default to n<NODE> form
» Configuration is driven by /etc/beowulf/config and beoNSS
» Username and groups can be provided by kickback daemon for Torque
Confidential – Internal Use Only21
Compute Node Boot Process
Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:– System date
– Basic network configuration
– Kernel modules (device drivers)
– Network routing
– setup_fs
– Name services
– chroot
– Prestages files (4.2.0+)
– Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only22
ClusterWare Filecache functionality
Provided by filecache kernel module
Configured by /etc/beowulf/config libraries directives
Dynamically controlled by ‘bplib’
Capabilities exist in all ClusterWare 4 versions
» 4.2.0 add prestage keyword in /etc/beowulf/config
» Prior versions needed additional scripts in /etc/beowulf/init.d
For libraries listed in /etc/beowulf/config, files can be prestaged by ‘md5sum’ the file
» # Prestage selected libraries. The keyword is generic, but the current# implementation only knows how to "prestage" a file that is open'able on# the compute node: through the libcache, across NFS, or already exists# locally (which isn't really a "prestaging", since it's already there).prestage_libs=`beoconfig prestage`for libname in $prestage_libs ; do # failure isn't always fatal, so don't use run_cmd echo "node_up: Prestage file:" $libname bpsh $NODE md5sum $libname > /dev/nulldone
Confidential – Internal Use Only23
Compute Node Boot Process
Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:– System date
– Basic network configuration
– Kernel modules (device drivers)
– Network routing
– setup_fs
– Name services
– chroot
– Prestages files (4.2.0+)
– Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only24
Compute nodes init.d scripts
Located in /etc/beowulf/init.d
Scripts start on the head node and need explicit bpsh and beomodprobe to operate on compute nodes
$NODE has been prepopulated by /usr/lib/beoboot/bin/node_up
Order is based on file name
» Numbered files can be used to control order
beochkconfig is used to set +x bit on files
Confidential – Internal Use Only25
Cluster Configuration
/etc/beowulf/config is the central location for cluster configuration
Features are documented in ‘man beowulf-config’
Compute node order is determined by ‘node’ parameters
Changes can be activated by doing a ‘service beowulf reload’
Confidential – Internal Use Only26
Orientation Agenda – Part 1
Scyld ClusterWare foundations
» Booting process
• Startup scripts
• File systems
• Name services
» Cluster Configuration
Cluster Components
» Networking infrastructure
» NFS File servers
» IPMI configuration
Break
Confidential – Internal Use Only27
Elements of Cluster Systems
Some important elements of a cluster system
» Booting and Provisioning
» Process creation, monitoring and control
» Update and consistency model
» Name services
» File Systems
» Physical management
» Workload virtualization
Master Node
Interconnection Network
Internet or Internal Network
Optional Disks
Confidential – Internal Use Only28
Compute Node Boot Process
Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:– System date
– Basic network configuration
– Kernel modules (device drivers)
– Network routing
– setup_fs
– Name services
– chroot
– Prestages files (4.2.0+)
– Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only29
Remote Filesystems
Remote - Share a single disk among all nodes
» Every node sees same filesystem
» Synchronization mechanisms manage changes
» Locking has either high overhead or causes serial blocking
» "Traditional" UNIX approach
» Relatively low performance
» Doesn't scale well; server becomes bottleneck in large systems
» Simplest solution for small clusters, reading/writing small files
Master Node
Interconnection Network
Internet or Internal Network
Optional Disks
Confidential – Internal Use Only30
NFS Server Configuration
Head node NFS services
» Configuration in /etc/exports
» Provides system files (/bin, /usr/bin)
» Increase number of NFS daemons
• echo “RPCNFSDCOUNT=64” > /etc/sysconfig/nfs ; service nfs restart
Dedicated NFS server
» SLES10 was recommended; RHEL5 now includes some xfs support
• xfs has better performance
• OS has better IO performance than RHEL4
» Network trunking can be used to increase bandwidth (with caveats)
» Hardware RAID
• Adaptec RAID card– CTRL-A at boot – arcconf utility from http://www.adaptec.com/en-US/support/raid/
» External storage (Xyratex or nStor)
• SAS-attached
• Fibre channel attached
Confidential – Internal Use Only31
Network trunking
Use multiple physical links as a single pipe for data
» Configuration must be done on host and switch
SLES 10 configuration
» Create a configuration file /etc/sysconfig/network/ifcfg-bond0 for the bond0 interface
» BOOTPROTO=staticDEVICE=bond0IPADDR=10.54.30.0NETMASK=255.255.0.0STARTMODE=onbootMTU='‘
BONDING_MASTER=yesBONDING_SLAVE_0=eth0BONDING_SLAVE_1=eth1BONDING_MODULE_OPTS='mode=0 miimon=500'
Confidential – Internal Use Only32
Network trunking
HP switch configuration
» Create trunk group via serial or telnet interface
Netgear (admin:password)
» Create trunk group via http interface
Cisco
» Create etherchannel configuration
Confidential – Internal Use Only33
External Storage
Xyratex arrays have a configuration interface
» Text based via serial port
» Newer devices (nStor 5210, Xyratex F/E 5402/5412/5404) have embedded StorView
• http://storage0:9292– admin:password
» RAID arrays, logical drives are configured and monitored
• LUNs are numbered and presented on each port. Highest LUN is the controller itself
• Multipath or failover needs to be configured
Confidential – Internal Use Only34
Need for QLogic Failover
Collapse LUN presentation in OS to a single instance per LUN
Minimize potential for user error which maintaining failover and static loadbalancing
Confidential – Internal Use Only35
Physical Management
ipmitool
» Intelligent Platform Management Interface (IPMI) is integrated into the base management console (BMC)
» Serial-over-LAN (SOL) can be implemented
» Allows access to hardware such as sensor data or power states
» E.g. ipmitool –H n$NODE-ipmi –U admin –P admin power {status,on,off}
bpctl
» Controls the operational state and ownership of compute nodes
» Examples might be to reboot or power off a node
• Reboot: bpctl –S all -R
• Power off: bpctl –S all –P
» Limit user and group access to run on a particular node or set of nodes
Confidential – Internal Use Only36
IPMI Configuration
Full spec is available here:
» http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0_E3_markup.pdf
Penguin Specific configuration
» Recent products all have IPMI implementations. Some are in-band (share physical media with eth0), some are out-of-band (separate port and cable from eth0)• Altus 1300, 600, 650 – In-band, lan channel 6
• Altus 1600, 2600, 1650, 2650; Relion 1600, 2600, 1650, 2650, 2612 – Out-of-band, lan channel 2
• Relion 1670 – In-band, lan channel 1
• Altus x700/x800, Relion x700 – Out-of-band OR in-band, lan channel 1
Some ipmitool versions have a bug and need to following command to commit a write
» bpsh $NODE ipmitool raw 12 1 $CHANNEL 0 0
Confidential – Internal Use Only37
Orientation Agenda – Part 2
Parallel jobs
» MPI configuration
» Infiniband interconnect
Queueing
» Initial setup
» Tuning
» Policy case studies
Other software and tools
Questions and Answers
Confidential – Internal Use Only38
Explicitly Parallel Programs
Different paradigms exist for parallelizing programs» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
Most distributed parallel programs are now written using MPI» Different options for MPI stacks: MPICH, OpenMPI, HP, and
Intel
» ClusterWare comes integrated with customized versions of MPICH and OpenMPI
Confidential – Internal Use Only39
Compiling MPICH programs
mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH
» GNU, PGI, and Intel compilers are supported
Effectively set libraries and includes for compile and linking
» prefix="/usr“part1="-I${prefix}/include“part2="“part3="-lmpi -lbproc“…part1="-L${prefix}/${lib}/MPICH/p4/gnu $part1“…$cc $part1 $part2 $part3
Confidential – Internal Use Only40
Running MPICH programs
mpirun is used to launch MPICH programs
If Infiniband is installed, the interconnect fabric can be chosen using the machine flag:
» -machine p4
» -machine vapi
» Done by changing LD_LIBRARY_PATH at runtime• export LD_LIBRARY_PATH=${libdir}/MPICH/${MACHINE}/${compiler}:${LD_LIBRARY_PATH}
» Hooks for using mpiexec for Queue system• elif [ -n "${PBS_JOBID}" ]; then for var in NP NO_LOCAL ALL_LOCAL BEOWULF_JOB_MAP do unset $var done for hostname in `cat $PBS_NODEFILE` do NODENUMBER=`getent hosts ${hostname} | awk '{print $3}' | tr -d '.'` BEOWULF_JOB_MAP="${BEOWULF_JOB_MAP}:${NODENUMBER}“ done # Clean a leading : from the map export BEOWULF_JOB_MAP=`echo ${BEOWULF_JOB_MAP} | sed 's/^://g'` # The -n 1 argument is important here exec mpiexec -n 1 ${progname} "$@"
Confidential – Internal Use Only41
Environment Variable Options
Additional environment variable control:
» NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes.
» ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs.
» ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU.
» ALL_LOCAL — Run every process on the master node; used for debugging purposes.
» NO_LOCAL — Don’t run any processes on the master node.
» EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment.
» BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on.
Confidential – Internal Use Only42
Running MPICH programs
Prior to ClusterWare 4.1.4, mpich jobs were spawned outside of the queue system
» BEOWULF_JOB_MAP had to be set based on machines listed in $PBS_NODEFILE
• number_of_nodes=`cat $PBS_NODEFILE | wc -l`hostlist=`cat $PBS_NODEFILE | head -n 1 `for i in $(seq 2 $number_of_nodes ) ; do hostlist=${hostlist}:`cat $PBS_NODEFILE | head -n $i | tail -n 1`done BEOWULF_JOB_MAP=`echo $hostlist | sed 's/\.//g' | sed 's/n//g'`export BEOWULF_JOB
Starting with ClusterWare 4.1.4, mpiexec was included with the distribution. mpiexec is an alternative spawning mechanism that starts processes as part of the queue system
Other MPI implementations have alternatives. HP-MPI and Intel MPI use rsh and run outside of the queue system. OpenMPI uses libtm to properly start processes
Confidential – Internal Use Only43
MPI Primer Only a brief introduction is provided here for MPI. Many other in-depth tutorials are
available on the web and in published sources.
» http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html
» http://www.llnl.gov/computing/tutorials/mpi/
Paradigms for writing parallel programs depend upon the application
» SIMD (single-instruction multiple-data)
» MIMD (multiple-instruction multiple-data)
» MISD (multiple-instruction single-data)
SIMD will be presented here as it is a commonly used template
» A single application source is compiled to perform operations on different sets of data
» The data is read by the different threads or passed between threads via messages (hence MPI = message passing interface)
• Contrast this with shared memory or OpenMP where data is locally via memory
• Optimizations in the MPI implementation can perform localhost optimization; however, the program is still written using a message passing construct
MPI specification has many functions; however most MPI programs can be written with only a small subset
Confidential – Internal Use Only44
Infiniband Primer
Infiniband provides a low-latency, high-bandwidth interconnect for message to minimize IO for tightly coupled parallel applications
Infiniband requires hardware, kernel drivers, O/S support, user land drivers, and application support
Prior to 4.2.0, software stack was provided by SilverStorm
Starting with 4.2.0, ClusterWare migrated to using the OpenFabrics (ofed, openIB) stack
Confidential – Internal Use Only45
Infiniband Subnet Manager
Every Infiniband network requires a Subnet Manager to discover and manage the topology
» Our clusters typically ship with a Managed QLogic Infiniband switch with an embedded subnet manager (10.54.0.20; admin:adminpass)
» Subnet Manager is configured to start at switch boot
» Alternatively, a software Subnet Manager (e.g. openSM) can be run on a host connected to the Infiniband fabric.
» Typically the embedded subnet manager is more robust and provides a better experience
Confidential – Internal Use Only46
Communication Layers
Verbs API (VAPI) provides a hardware specific interface to the transport media» Any program compiled with VAPI can only run on the same
hardware profile and drivers
» Makes portability difficult
Direct Access Programming Language (DAPL) provides a more consistent interface» DAPL layers can communicate with IB, Myrinet, and 10GigE
hardware
» Better portability for MPI libraries
TCP/IP interface» Another upper layer protocol provides IP-over-IB (IPoIB)
where the IB interface is assigned an IP address and most standard TCP/IP applications work
Confidential – Internal Use Only47
MPI Implementation Comparison
MPICH is provided by Argonne National Labs
» Runs only over Ethernet
Ohio State University has ported MPICH to use the Verbs API => MVAPICH
» Similar to MPICH but uses Infiniband
LAM-MPI was another implementation which provided a more modular format
OpenMPI is the successor to LAM-MPI and has many options
» Can use different physical interfaces and spawning mechanisms
» http://www.openmpi.org
HP-MPI, Intel-MPI
» Licensed MPICH2 code and added functionality
» Can use a variety of physical interconnects
Confidential – Internal Use Only48
OpenMPI Configuration
./configure --prefix=/opt/openmpi --with-udapl=/usr --with-tm=/usr --with-openib=/usr --without-bproc --without-lsf_bproc --without-grid --without-slurm --without-gridengine --without-portals --without-gm --without-loadleveler --without-xgrid --without-mx --enable-mpirun-prefix-by-default --enable-static
make all
make install
Create scripts in /etc/profile.d to set default environment variables for all users
mpirun -v -mca pls_rsh_agent rsh -mca btl openib,sm,self -machinefile machinefile ./IMB-MPI1
Confidential – Internal Use Only49
Queuing
How are resources allocated among multiple users and/or groups?
» Statically by using bpctl user and group permissions
» ClusterWare supports a variety of queuing packages
• TaskMaster (advanced MOAB policy based scheduler integrated ClusterWare)
• Torque
• SGE
Confidential – Internal Use Only50
Interacting with TaskMaster
Because TaskMaster uses the MOAB scheduler with Torque pbs_server and pbs_mom components, all of the Torque commands are still valid
» qsub will submit a job to Torque, MOAB then polls pbs_server to detect new jobs
» msub will submit a job to Moab which then pushes the job to pbs_server
Other TaskMaster commands
» qstat -> showq
» qdel, qhold, qrls -> mjobctl
» pbsnodes -> showstate
» qmgr -> mschedctl, mdiag
» Configuration in /opt/moab/moab.cfg
Confidential – Internal Use Only51
Torque Initial Setup
‘/usr/bin/torque.setup root’ can be used to start with a clean slate
» This will delete any current configuration that you have
» qmgr –c ‘set server keep_completed=300’qmgr –c ‘set server query_other_jobs=true’qmgr –c ‘set server operators += [email protected]’qmgr –c ‘set server managers += [email protected]’
/var/spool/torque/server_priv/nodes stores node information
» n0 np=8 prop1 prop2
» qterm –t quickedit /var/spool/torque/server_priv/nodesservice pbs_server start
/var/spool/torque/sched_priv/sched_config configures default FIFO scheduler
/var/spool/torque/mom_priv/config configure pbs_mom’s
» Copied out during /etc/beowulf/init.d/torque
Confidential – Internal Use Only52
TaskMaster Initial Setup
Edit configuration in /opt/moab/moab.cfg» SCHEDCFG[Scyld] MODE=NORMAL SERVER=scyld.localdomain:42559
• Ensure hostname is consistent with ‘hostname’» ADMINCFG[1] USERS=root
• Add additional users who can be queue managers» RMCFG[base] TYPE=PBS
• TYPE=PBS integrates with a traditional Torque configuration
Confidential – Internal Use Only53
Tuning
Default walltime can be set in Torque using:» qmgr -c ‘set queue batch resources_default.walltime=16:00:00’
If many small jobs need to be submitted, uncomment the following in /opt/moab/moab.cfg
» JOBAGGREGATIONTIME 10
To exactly match node and processor requests, add the following to /opt/moab/moab.cfg
» JOBNODEMATCHPOLICY EXACTNODE
Changes in /opt/moab/moab.cfg can be activated by doing a ‘service moab restart’
Confidential – Internal Use Only54
Case Studies
Case Study #1
» Multiple queues for interactive, high priority, and standard jobs
Case Study #2
» Different types of hardware configuration
» Setup with FairShare
• http://www.clusterresources.com/products/mwm/moabdocs/5.1.1priorityoverview.shtml
• http://www.clusterresources.com/products/mwm/moabdocs/5.1.2priorityfactors.shtml
Confidential – Internal Use Only55
Troubleshooting
Log files
» /var/log/messages
» /var/log/beowulf/node*
» /var/spool/torque/server_logs
» /var/spool/torque/mom_logs
» qstat –f
» tracejob
» /opt/moab/log
» mdiag
» strace –p
» gdb
Confidential – Internal Use Only56
Hardware Maintenance
pbsnodes –o n0: mark node offline and allow jobs to drain
bpctl –S 0 –s unavailable: prevent user interactive commands from running on node
Wait until node is idle
bpctl –S 0 –P: power off node
Perform maintenance
Power on node
pbsnodes –c n0
Confidential – Internal Use Only
57
Questions??