sun grid engine
DESCRIPTION
Sun Grid Engine. Grids. Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access point; kind of like plugging into an electrical grid • Cluster grids: resources in one room • Campus grids: multiple clusters on one campus - PowerPoint PPT PresentationTRANSCRIPT
Sun Grid Engine
Grids
Grids are collections of resources made available to customers.
Compute grids make cycles available to customers from an access point; kind of like plugging into an electrical grid
• Cluster grids: resources in one room• Campus grids: multiple clusters on one campus
• Global grids: Cross administrative domains
Grids
Potentially (ideally?) you could completely outsource your HPC needs by buying time on a commercial grid. Running a big data center is tricky and takes expensive people. If you are, say, a small computer animation group working on an animated short it might not make sense to set up a data center for six months of work
OTOH, if you’re Pixar or Lucas this is a core competency
Sun Grid Engine
SGE is a piece of software that matches jobs to compute resources
BTW, SGE runs on OS X. This would be another fine project for someone to investigate
SGE
As we’ve seen, Sun Grid Engine can accept a batch job and give it to a compute node.
SGE (base level) is open source; see http://gridengine.sunsource.net/
There are some other issues:• Multiple queues• Giving jobs only to nodes with the necessary resources
• Queue manipulation
SGE
Users submit jobs; they’re kept by SGE in a holding area until resources become available, then sent to an execution device. The results are reported back.
Types of hosts: master, execution, administration, and submit
Master runs the master daemon and scheduling daemon
Execution hosts are where jobs are run, admin hosts can manipulate the queues
There are a lot of knobs to twiddle on SGE
SGE
Imagine a bank that has five customers walk in. Four just want to deposit a check, and the fifth wants to set up a home loan.
If the home loan guy happens to be first, and there is only one queue, the four with short transactions wait for a long time.
What’s more, the home loan guy must have manager approval at some point in the process
So: set up two queues, one for long transactions, one an express lane. The home loan queue specifies that the manager must be available.
This reduces the median time spent in queue for the short transaction customers, and reduces the variance of the waiting time
SGE Queues
There may be more than one queue; jobs are associated with queues
qconf -sql Shows the list of defined queuesWhy multiple queues? Some types of jobs may be very
long or require specific resources, so users may submit jobs to queues optimized for those types of jobs
SGE Master
Q1
Q2
SGE Scheduler
ExecutionHost
ExecutionHost
ExecutionHost
Scheduler
The scheduler (which assigns jobs to execute hosts) looks at several factors:
• Load parameters, how busy the execute hosts are by some measure
• Consumable resources, memory, disk space, licenses, etc. SGE keeps track of these and dispatches a job only if resources are available
• Attributes, such as 64-bit, G5, etc. These aren’t necessarily consumed, but may simply be a state
The scheduler may look at all these factors before assigning a job from the holding pool to an execution host
Consumable Resources
There are some finite resources in the cluster: CPU time, disk space, licenses, bandwidth
Available capacity for these is defined by the administrator; the scheduler examines available consumables when deciding what to run
Requestable Attributes
On job submission you can request attributes or characteristics: at least X amount of memory, a license for software package Y, a 64 bit host, etc.
In a production environment licenses can be a big deal. Circuit design software may cost thousands per node, so not every node on the cluster may have a license.
The attributes can be related to the hosts or the queues
Attributes that are “requestable” can be mentioned in the qsub command, so jobs may require that attribute to run
SGE
You don’t need to submit a job to a specific queue; instead you can simply ask for certain resources, and SGE will pick a queue based on the requirement profile
Environment Variables
When a job runs on a host some environment variables are set:
ARCSGE_ROOTSGE_STDOUT_PATHHOME
Dependencies
Suppose you divide up a task into several subtasks. This can require sequencing--some subtasks may need to be finished before other subtasks can run. You can specify a list of jobs that must finish before this job runs
Listing Attributes
qconf -scl lists “complexes” of attributes. Typically this includes a complex for the queues, and one for the hosts
qconf -sc host|queue Lists attributes for a complex
#name shortcut type value relop requestable consumable default#--------------------------------------------------------------------------------------arch a STRING none == YES NO none num_proc p INT 1 == YES NO 0 load_avg la DOUBLE 99.99 >= NO NO 0
Modifying Attributes
Qconf -mc [complex name] opens up an editor that allows you to modify the complex settings
Attributes
Note that some attributes are “requestable”. This means that you can specify that your job requires that attribute from the qsub command line.
Qsub -l arch=“glinux” says the job requires a “glinux” host to run
Qconf -se compute-0-0 shows resources for a host
Priorities
By default jobs are handled in a FIFO manner. As they come in they are assigned to a compatible queue for processing by the scheduler.
Qsub -p can provide a priority to the job that can override FIFO behavior.
Qdel and qstat to find and delete jobs from the holding area
Checkpointing
Sometimes on very long jobs it is worthwhile to be able to stop the job and restart it later.
What are the issues involved here?Why use it?Starter, suspend, resume, terminate methods
Hard & Soft Requirements
A hard requirement must be present before the job is scheduled