cluster computing, i batch-queueing systems

45
GC3: Grid Computing Competence Center Cluster computing, I Batch-queueing systems Riccardo Murri, Sergio Maffioletti Grid Computing Competence Center, Organisch-Chemisches Institut, University of Zurich Oct. 23, 2012

Upload: others

Post on 04-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cluster computing, I Batch-queueing systems

GC3: Grid Computing Competence Center

Cluster computing, IBatch-queueing systemsRiccardo Murri, Sergio MaffiolettiGrid Computing Competence Center,Organisch-Chemisches Institut,University of Zurich

Oct. 23, 2012

Page 2: Cluster computing, I Batch-queueing systems

Today’s topic

purpose︷ ︸︸ ︷Batch job processing clusters︸ ︷︷ ︸

HW architecture

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 3: Cluster computing, I Batch-queueing systems

What is a cluster? I

compute−0−0.local compute−0−1.local compute−0−27.local

internet

local network fabric

ssh [email protected]

frontend.node.uzh.ch����������������

A cluster is a group of computerswith a direct network interconnect,

centralized management,and distributed execution facilites.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 4: Cluster computing, I Batch-queueing systems

What is a cluster? II

Centralized: – Authorization and Authentication– Shared filesystem– Application execution and

management

Distributed: – Execution of jobs– Multiple units of the same parallel

job may reside on separateresources

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 5: Cluster computing, I Batch-queueing systems

What is an HPC cluster?

A cluster is a group of computerswith a direct network interconnect,

centralized managementand distributed execution facilites.

An HPC cluster is a clusterwith a fast local network interconnect,

specialized for execution of paralleldistributed-memory programs.

A supercomputer is (currently)a very large HPC cluster

with a very fast local network interconnect.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 6: Cluster computing, I Batch-queueing systems

What’s batch job processing?

Asynchronous execution of shell commands.

Wikipedia: Asynchronous actions areactions executed in a non-blocking scheme,

allowing the main program flow to continue processing.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 7: Cluster computing, I Batch-queueing systems

Lifecycle of a batch job

1. A command to run is submitted to the batchprocessing system

2. The batch job scheduler selects appropriateresources to run the job

3. The resource manager executes the job

4. Users monitor the job execution state

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 8: Cluster computing, I Batch-queueing systems

Functional components of a batch job system

Resource ManagerMonitors compute infrastructure, launches andsupervise jobs, cleans up after termination.

Job manager / schedulerAllocates resources and time slots (scheduling)

Workload ManagerPolicy and orchestration at “job collection” level: fairshare, workflow orchestration, QoS, SLA, etc.

Reference: O. Richard, Batch Scheduler and File Management, The third

workshop of the INRIA-Illinois joint-laboratory on Petascale Computing,

June 21-24, 2010, Bordeaux, FranceLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 9: Cluster computing, I Batch-queueing systems

Architecture of a batch job system

compute−0−0.local compute−0−1.local compute−0−27.local

scheduler

resourcemanager

server

client

job launch

& execution

frontend

master

2. allocate resources

4. monitor execution

3. start job

4. monitor execution

1. submit job

machine status monitoring

monitor monitormonitor

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 10: Cluster computing, I Batch-queueing systems

Grid Engine

Sun Grid Engine (GE) is a batch-queuing systemproduced by Sun Microcomputers; made open-sourcein 2001.

After acquisition by Oracle, the product forked:

– Open Grid Scheduler (OGS) and Son of Grid Engine(SGE), independent open-source versions.

– Oracle Grid Engine, commercial and focused onenterprise technical computing.

– Univa Grid Engine is a commercial-only version,developed by the core SGE engineer team fromSun.

Used on UZH main HPC cluster “Schroedinger”.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 11: Cluster computing, I Batch-queueing systems

GE architecture, I

sge qmaster

– Runs on master node

– Accepts client requests (job submission, job/hoststate inspection)

– Schedules jobs on compute nodes (formerlyseparate sge schedd process)

Client programs qhost, qsub, qstat

– Run by user on submit node

– Clients for sge qmaster

– Master daemon has a list of authorized submitnodes

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 12: Cluster computing, I Batch-queueing systems

GE architecture, II

sge execd

– Runs on every compute node

– Accepts job start requests from sge qmaster

– Monitors node status (load average, free memory,etc.) and reports back to sge qmaster

sge shepherd

– Spawned by sge execd when starting a job

– Monitors the execution of a single job

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 13: Cluster computing, I Batch-queueing systems

GE architecture, III

compute−0−0.local compute−0−1.local compute−0−27.local

scheduler

resourcemanager

ge_master

frontend

master

2. allocate resources

4. monitor execution

3. start job

4. monitor execution

1. submit job

machine status monitoring

ge_execd

qstat

qsub

ge_execdge_execd

ge_shepherd

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 14: Cluster computing, I Batch-queueing systems

Lifecycle of a Job: user perspective

1. Prepare job script (normally shell script)

2. Define resource requirements

3. Submit job and record jobID

4. Monitor status of job (using JobID)

5. When done, inspect results

6. Otherwise check logs

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 15: Cluster computing, I Batch-queueing systems

Prepare job script

#!/bin/bash

MZXMLSEARCH="./MzXML2Search"

${MZXMLSEARCH} -dta ${MZXML_NAME}.mzXMLif [ ! $? -eq 0 ]; then

echo "[FATAL]"exit $1

fi

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 16: Cluster computing, I Batch-queueing systems

Submit job and monitor using jobID

# qsub test.sh534.localhost

# qstat 534Job id Name S Queue--------------- -------- - -------534.localhost test.sh R default

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 17: Cluster computing, I Batch-queueing systems

Lifecycle of a Job: system perspective

1. Job submission form DRM client2. Resource Manager stores job in a queue

– Queue selected inspecting DRM policies and job’sdescription

3. Scheduler starts scheduling cycle– Collects resource information from exec hosts– Inspects jobs in queues– Applies scheduling policies to sort jobs in queues– Sends run request to Resource Manager

4. Resource Manager sends job to exec host to run5. Exec host receives payload and runs it

– Job executed using user credentials– Periodically report to Resource Manager resource

utilization– When job finished, reports to Resource Manager

6. Resource Manager updates job’s stateLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 18: Cluster computing, I Batch-queueing systems

Job lifecycle

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 19: Cluster computing, I Batch-queueing systems

Implementation issues

I/Ohow to provide input data to the job and collect outputdata from it

SchedulingWhen should the job start?

Resource allocationOn what computer(s) should it run? How to cope withheterogeneous resource pools?

Job monitoring and accountingWhat usage records should be collected and stored?

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 20: Cluster computing, I Batch-queueing systems

I/O management in HPC clusters

Two main ways:

1. Shared file system

2. Data staging

Reference: O. Richard, Batch Scheduler and File Management, The third

workshop of the INRIA-Illinois joint-laboratory on Petascale Computing,

June 21-24, 2010, Bordeaux, France

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 21: Cluster computing, I Batch-queueing systems

Shared file systems

Used on most cluster systems

Parallel filesystem (e.g., Lustre, GPFS, PVFS, NFSv4.1,. . . ) for performance and scalability

Often separate filesystems based on features:

– a filesystem for persistent / longer-term data (e.g.,/home)

– another one for ephemeral I/O (deleted after thejob has finished running)

– responsibility is on the user to move data into theappropriate filesystem

Easy to use: no difference with local I/O model.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 22: Cluster computing, I Batch-queueing systems

Data staging

Job data requirements are identified and provided byuser in submitted script.

Stage-inInput Files are transfered to local disk of computenodes before job start.

Stage-outOutput Files are transfered from nodes to massstorage after execution.

Nowadays, rarely used on cluster, mainly used in Gridcontext

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 23: Cluster computing, I Batch-queueing systems

Scheduling

Long-term scheduler.– Jobs may last hours, days, even months!

HPC job scheduling is usually non-preemptive.– Compute resources are fully utilized, there’s little

room for sharing.

Common scheduling algorithms are usually variationsof FCFS or priority-based scheduling.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 24: Cluster computing, I Batch-queueing systems

Scheduling: terminology

Turnaround timeThe total time elapsed from the moment a job is submittedto the moment it terminates running.

Wait timeThe time elapsed from submission until a job actuallystarts running.

Wall-timeThe time elapsed from job start to end. (Abbreviation ofwall-clock time.)

CPU timeThe total time spent by CPU(s) executing a job program’scode.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 25: Cluster computing, I Batch-queueing systems

Scheduling: FCFS, I

First come, first served

Job requests are kept in a queue.

New job requests (submissions) append to the back ofthe queue.

Each time a suitable execution slot is freed, the job atthe front of the queue is run.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 26: Cluster computing, I Batch-queueing systems

Scheduling: FCFS, II

Issues with bare FCFS:1. Average waiting time might be long:

– e.g., a user submits a large number of very longjobs; other users have to wait a lot in order to haveshorter jobs running.

– Solutions: separate queues, backfill,priority-based scheduling

2. When there are parallel jobs spanning multipleexecution units, the scheduler has to keep somenodes idle to allocate enough resources.

– Solutions: backfill

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 27: Cluster computing, I Batch-queueing systems

Scheduling: separate queues

Create separate job queues.– Submission queue may be explicitly chosen by

user, or selected by scheduler based on jobcharacteristics.

Each queue is associated with a different set ofexecution nodes.

Each queue has different run features– e.g., different maximum run time

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 28: Cluster computing, I Batch-queueing systems

Scheduling: backfill

Jobs jump ahead in the queue and are executed on“reserved” nodes if they will be finished by the time thejob holding the reservation is scheduled to start.

Requires job duration to be known in advance!

Image source: http://people.ee.ethz.ch/∼ballisti/computer topics/lsf/admin/04-tunin.htmLSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 29: Cluster computing, I Batch-queueing systems

Scheduling: SFJ, I

Shortest job first

Job queue is sorted according to duration: shortestjobs are moved to the front.

Requires job duration to be known in advance!

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 30: Cluster computing, I Batch-queueing systems

Scheduling: SFJ, II

If all jobs are known in advance, it can be proved todeliver the optimal average wait time.

Otherwise, may delay long jobs indefinitely:

– At 10 am, Job X with expected runtime 4 hours issubmitted; it has to wait 2 hours in the queue.

– At 11 am, 10 jobs of 2 hours runtime aresubmitted; they jump ahead in the queue anddelay job X by 20 hours.

– At 12 am, 5 more jobs of 1 hour runtime aresubmitted; they delay job X by another 5 hours.

Solution: add “deadline” factor: take into account thetime a job has already spent waiting in the queue.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 31: Cluster computing, I Batch-queueing systems

Priority-based scheduling

Sort job queues according to some priorityfunction

The “priority” function is usually a weighted sum ofvarious contributions, e.g.:

– Requested run time

– Number of processors

– Wait time in queue

– Recent usage by same user/group/department(fair share)

– Administrator-set QoS

Reference: http://www.adaptivecomputing.com/resources/docs/maui/5.

1jobprioritization.php

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 32: Cluster computing, I Batch-queueing systems

Fair-share scheduling

Fair-share prioritization assigns higher priorities tousers/groups/etc. that have not used all of theirresource quota (usually expressed in CPU time).

Important parameters in defining a fair-share policy:

– window length: how much historical informationis kept and used for calculating resource usage

– interval: how often is resource utilizationcomputed

– decay: weights applied to resource usage in thepast (e.g., 2 hours of CPU time one week agomight weigh less than 2 hours of CPU time today)

Reference: http:

//www.adaptivecomputing.com/resources/docs/maui/6.3fairshare.php

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 33: Cluster computing, I Batch-queueing systems

Resource allocation, I

Resource allocation is the act of selecting executionunits out of the available pool for running a job.

Over time, clusters tend to grow inhomogeneously:new nodes are added, that are different from the olderones.

Jobs are different in computational and hardwarerequirements, e.g.:

– short jobs vs long-running jobs

– large memory hence less jobs fit in a singlemulti-core node

– I/O bound hence fast filesystem needed

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 34: Cluster computing, I Batch-queueing systems

Resource allocation, II

General resource allocation algorithm (match-making):

1. user specifies resource requirements during jobsubmission

2. filtering: scheduler filters resources based onevaluation of a boolean formula

– usually, logical AND of resource requirements

3. ranking: matching resources are sorted and thefirst-ranking one gets the job

Normally the filtering and ranking functions are fixedor can only be modified by the cluster admin.

A notable exception is the Condor batch system,which allows users to specify arbitrary filtering andranking functions.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 35: Cluster computing, I Batch-queueing systems

Example: resource requirements in SGE

Grid Engine allows specifying resource requirementswithin a job script.

#$/bin/bash

#$ -q all.q # queue name#$ -l s_vmem=300M # memory#$ -l s_rt=60 # walltime#$ -l gpu=1 # require 1 GPGPU#$ -pe mpich 32 # CPU cores

MZXMLSEARCH="./MzXML2Search"...

(Note that you write s rt=60 but the systemunderstands s rt >= 60 for the purpose of filtering.)

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 36: Cluster computing, I Batch-queueing systems

Condor

compute node N

local 1Gb/s ethernet network

batch system server

compute node 2compute node 1 compute node N

local 1Gb/s ethernet network

batch system server

compute node 2compute node 1 compute node N

local 1Gb/s ethernet network

batch system server

compute node 2compute node 1

condor_master

condor_submit

condor_resourcecondor_resource condor_resource

condor_agent

�����������������������������������

�����������������������������������

����������������

����������������

���������������

���������������

�����������������

�����������������

�����������������

�����������������

��������

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 37: Cluster computing, I Batch-queueing systems

Condor overview

Agents (client-side software) and Resources(cluster-side software) advertise their requests andcapabilities to the Condor Master.

The Master performs match-making betweenAgents’ requests and Resources’ offerings.

An Agent sends its computational job directly to thematching Resource.

Reference: Thain, D., Tannenbaum, T. and Livny, M. (2005):

“Distributed computing in practice: the Condor experience.”

Concurrency and Computation: Practice and Experience,

17:323–356.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 38: Cluster computing, I Batch-queueing systems

What is matchmaking?

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 39: Cluster computing, I Batch-queueing systems

Matchmaking, I

Same idea in Condor, except the schema is not fixed.

Agents and Resources report their requests and offersusing the “ClassAd” format (an enriched key=valueformat).

No prescribed schema, hence a Resource is free toadvertise any “interesting feature” it has, and torepresent it in any way that fits the key=value model.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 40: Cluster computing, I Batch-queueing systems

Matchmaking, II

1. Agents specify a Requirements constraint: aboolean expression that can use any value from theAgents’ own (self) ClassAd or the Resource’s (other).

2a. Resources whose offered ClassAd does not satisfythe Requirements constraint are discarded.

2b. Conversely, if the Agents’ ClassAd does not satisfythe Resource Requirements, the Resource isdiscarded.

3. Surviving Resources are sorted according to thevalue of the Rank expression in the Agent’s ClassAd,and their list is returned to the Agent.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 41: Cluster computing, I Batch-queueing systems

Example: Job ClassAd

Select 64-bit Linux hosts, and sort them preferringhosts with larger memory and CPU speed.

Requirements = Arch=="x86_64" && OpSys == "LINUX"Rank = TARGET.Memory + TARGET.Mips

Reference: http:

//research.cs.wisc.edu/condor/manual/v6.4/4 1Condor s ClassAd.html

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 42: Cluster computing, I Batch-queueing systems

Example: Resource ClassAd

A complex access policy, giving priority to users fromthe owner research group, then other “friend” users,and then the rest. . .

Friend = Owner == "tannenba"ResearchGroup = (Owner == "jbasney"

|| Owner == "raman")Trusted = Owner != "rival"Requirements = Trusted && ( ResearchGroup

|| LoadAvg < 0.3 &&KeyboardIdle > 15*60 )

Rank = Friend + ResearchGroup*10

Resource ClassAds specify an access/usage policy forthe resource.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 43: Cluster computing, I Batch-queueing systems

Resource allocation, III

Problem: How do you submit a job that requires200GB of local scratch space? Or 16 cores in a singlenode?

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 44: Cluster computing, I Batch-queueing systems

Resource allocation, IV

The names and types of resource requirementsvary from cluster to cluster

– Defaults change with batch system softwarerelease

– Custom requirements depend on local systemadministrator

Job management software must adapted to thelocal cluster

– When you get access to a new cluster, you mustrewrite a large portion of your submission scripts.

– Applies to Condor as well: since ClassAds arefree-form, defining what attributes can be usedand relied upon is an organizational problem.

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012

Page 45: Cluster computing, I Batch-queueing systems

All these job management systems are based on apush model (you send the job to an execution cluster).

Is there conversely a pull model?

LSCI2012 Cluster computing, I Batch-queueing systems Oct. 23, 2012