resource management system : condor · computer science, university of warwick 1 resource...

43
1 Computer Science, University of Warwick Computer Science, University of Warwick Resource Management System : Condor Resource Management System : Condor Condor was designed to make use of the spare capabilities in a network of pc/workstations. Dual modes Non-dedicated mode (cycle stealing). dedicated mode that can manage clusters Condor uses resource ClassAd to advertise resources Jobs’ requirements are specified in job submission script Condor matchmaker matches the job requirements with resource requirements Condor Software is available for free on the web. Hundreds of Condor installations used worldwide in both academia and industry.

Upload: others

Post on 20-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

1Computer Science, University of WarwickComputer Science, University of Warwick

Resource Management System : CondorResource Management System : Condor

Condor was designed to make use of the spare capabilities in a network of pc/workstations.

Dual modes • Non-dedicated mode (cycle stealing).

• dedicated mode that can manage clusters

Condor uses resource ClassAd to advertise resources

Jobs’ requirements are specified in job submission script

Condor matchmaker matches the job requirements with resource requirements

Condor Software is available for free on the web.

Hundreds of Condor installations used worldwide in both academia and industry.

Page 2: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

2Computer Science, University of WarwickComputer Science, University of Warwick

Resource Management System : CondorResource Management System : Condor

Universes:

Condor have the notion of an “execution universe”.

Each universe is an environment that supports a particular set of functionalities.

Condor has a number of these:• Standard - more advanced environment that supports checkpointing

and process migration; use condor_compile to relink the executable

• Vanilla - very basic execution functionality designed to support plain serial applications.

• parallel - designed to run parallel jobs (e.g. MPI) on dedicated clusters

Page 3: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

3Computer Science, University of WarwickComputer Science, University of Warwick

Resource Management System : CondorResource Management System : Condor

ClassAds:ClassAds allow Condor to match job resource requests with the most appropriate resource for running the job.

Users (jobs) have constraints - Job ClassAds:

• “must run on Linux… have 1 GB of memory available… 50 GB of disk space”

Owners (machines) wish to advertise their resources but also have constraints - Machine ClassAds:

• “this cluster consists of 32 Nodes… 2GHz Opterons… reject jobs from Ligang”

Condor matchmaker reviews the Job ClassAds and the Machine ClassAds, then identifies a match that satisfies the requirements of both.

Page 4: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

4Computer Science, University of WarwickComputer Science, University of Warwick

Resource Management System : CondorResource Management System : Condor

Submitting (from Condor manual):

Write a condor job submission script to inform the system about the job:

• includes the executable, universe, input, output and error files to use, command-line arguments, requirements.

Can describe many jobs at once, each with different parameters.

# Simple condor_submit input fileUniverse = vanillaExecutable = my_jobQueue

Page 5: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

5Computer Science, University of WarwickComputer Science, University of Warwick

Resource Management System : CondorResource Management System : Condor

Condor Submission Process:

After you have submitted the job submission script to Condor, itgenerates a ClassAd that describes your job(s).

It is then up to Condor to identify the more appropriate match between the generated job ClassAd and the Machine ClassAds.

Page 6: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

6Computer Science, University of WarwickComputer Science, University of Warwick

Resource Management System : CondorResource Management System : Condor

# Example condor_submit input file

Universe = vanillaExecutable = /home/dps/condor/dan-job.condorInput = myjob.stdinOutput = myjob.stdoutError = myjob.stderrArguments = -X 10 -Y 20 -Z 30InitialDir = /home/condor/exp-1.0Requirements = Memory >= 1024 && Disk > 50000Rank = (KFLOPS*10000) + MemoryQueue

Page 7: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

7Computer Science, University of WarwickComputer Science, University of Warwick

Condor daemonsCondor daemons

condor_masterIt is responsible for keeping the rest of the Condor daemons running on each machine in a pool

The master spawns the other daemon

runs on every machine in your Condor pool

condor_startdAny machine that wants to execute jobs needs to have this daemon running

It advertises a machine ClassAd

responsible for enforcing the policy under which the jobs will be started, suspended, resumed, vacated, or killed

condor_starterSpawned by condor_startd

sets up the execution environment , create a process to run the user job, and monitors the job running

Upon job completion, sends back status information to the submitting machine, and exits

Page 8: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

8Computer Science, University of WarwickComputer Science, University of Warwick

condor_scheddAny machine that allows users to submit jobs needs to have the daemon running

Users submit jobs to the condor_schedd, where they are stored in the job queue.

condor_submit, condor_q, or condor_rm connect to the condor_schedd to view and manipulate the job queue

condor_shadowCreated by condor_schedd

runs on the machine where a job was submitted

Any system call performed on the remote execute machine is sent over the network to this daemon, and the shadow performs the system call (such as file I/O) on the submit machine and the result is sent back over the network to the remote job

Page 9: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

9Computer Science, University of WarwickComputer Science, University of Warwick

condor_collector

collecting all the information about the status of a Condor pool

All other daemons periodically updates information to the collector

These information contain all the information about the state of the daemons, the resources they represent, or resource requests of the submitted jobs

condor_status connects to this daemon for information

condor_negotiator

responsible for all the matchmaking within the Condor system

responsible for enforcing user priorities in the system

Page 10: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

10Computer Science, University of WarwickComputer Science, University of Warwick

Page 11: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

11Computer Science, University of WarwickComputer Science, University of Warwick

Page 12: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

12Computer Science, University of WarwickComputer Science, University of Warwick

Resource Management System : CondorResource Management System : Condor

Most jobs are not independent:Dependencies exists between jobs.

Second stage cannot start until first stage has completed.

Condor uses DAGMan - Directed Acyclic Graph Manager

DAGMan allows you to specify dependencies between your Condor jobs, so it can manage them automatically for you.

DAGs are the data structures used by DAGMan to represent these dependencies.

Each job is a “node” in the DAG.

Each node can have any number of “parent” or “children” nodes – as long as there are no loops.

(example from Condor tutorial).

Page 13: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

13Computer Science, University of WarwickComputer Science, University of Warwick

Resource Management System : CondorResource Management System : Condor

A DAG is defined by another text file, listing each of nodes and their dependencies:# diamond.dag

Job A a.sub

Job B b.sub

Job C c.sub

Job D d.sub

Parent A Child B C

Parent B C Child D

Each node will run the Condor job specified by an accompanying Condor submit file.

Job A

Job B Job C

Job D

Page 14: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

High Performance ComputingHigh Performance Computing Course Notes 2007Course Notes 2007--20082008

Cluster TechnologiesCluster Technologies

Page 15: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

15Computer Science, University of WarwickComputer Science, University of Warwick

21 00

2 1 00 2 1 00 2 1 002 1 00

2 1 00 2 1 00 2 1 002 1 00

Desktop(Single Processor?)

SMPs orSuperCom

puters

LocalCluster

GlobalCluster/Grid

PERFORMANCE

Computing Platforms EvolutionComputing Platforms EvolutionBreaking Administrative BarriersBreaking Administrative Barriers

Inter PlanetCluster/Grid ??

IndividualGroupDepartmentCampusSta teNationalGlobeInte r Plane tUniverse

Administrative Barriers

EnterpriseCluster/Grid

?

Page 16: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

16Computer Science, University of WarwickComputer Science, University of Warwick

Design Space of Competing Computer Design Space of Competing Computer ArchitectureArchitecture

Page 17: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

Original Food Chain Picture

Page 18: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

18Computer Science, University of WarwickComputer Science, University of Warwick

1984 Computer Food Chain

Mainframe

Vector Supercomputer

Mini ComputerWorkstation

PC

Page 19: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

19Computer Science, University of WarwickComputer Science, University of Warwick

Mainframe

Vector Supercomputer MPP

WorkstationPC

1994 Computer Food Chain

Mini Computer(hitting wall soon)

(future is bleak)

Obsolete

Page 20: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

20Computer Science, University of WarwickComputer Science, University of Warwick

Computer Food Chain (Now and Future)

Page 21: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

21Computer Science, University of WarwickComputer Science, University of Warwick

No. of Processors

C.P

.I.

1 2 . . . .

Computational Power ImprovementComputational Power Improvement

Multiprocessor

Uniprocessor

Page 22: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

22Computer Science, University of WarwickComputer Science, University of Warwick

Age

Gro

wth

5 10 15 20 25 30 35 40 45 . . . .

Human Physical Growth AnalogyHuman Physical Growth Analogy:: Computational Power ImprovementComputational Power Improvement

Vertical Horizontal

Page 23: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

23Computer Science, University of WarwickComputer Science, University of Warwick

Why Clusters now?Why Clusters now?

Clustering gained new wave of interests when 3 technologies converged:

1. Very high performance Microprocessors• workstation performance = yesterday supercomputers

2. High speed communication

3. Standard tools for parallel/ distributed programming

Page 24: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

24Computer Science, University of WarwickComputer Science, University of Warwick

Cluster Components...1Cluster Components...1 NodesNodes

Multiple High Performance Components:PCsWorkstationsSMPs (CLUMPS)Distributed HPC Systems

They can be based on different architectures and running difference OS

But usually, the nodes in a cluster are homogenous - they have same architecture and performance and are installed with the same os

Page 25: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

25Computer Science, University of WarwickComputer Science, University of Warwick

Cluster ComponentsCluster Components……22 OSOS

State of the art OS:Linux (Beowulf)

Microsoft NT (Illinois HPVM)

SUN Solaris (Berkeley NOW)

IBM AIX (IBM SP2)

HP UX (Illinois - PANDA)

Mach (Microkernel based OS) (CMU)

Page 26: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

26Computer Science, University of WarwickComputer Science, University of Warwick

Cluster ComponentsCluster Components……33 High Performance NetworksHigh Performance Networks

Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps)SCIMyrinetInfinibandFDDI

Page 27: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

27Computer Science, University of WarwickComputer Science, University of Warwick

Cluster ComponentsCluster Components……44 Network InterfacesNetwork Interfaces

Network Interface CardMyrinet NIC

Ethernet NIC

Page 28: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

28Computer Science, University of WarwickComputer Science, University of Warwick

Cluster ComponentsCluster Components……5 Communication 5 Communication SoftwareSoftware

Traditional OS supported facilities (heavy weight due to protocol processing)..

Sockets (TCP/IP), Pipes, etc.

Light weight protocols (User Level)Active Messages (Berkeley)

Fast Messages (Illinois)

U-net (Cornell)

XTP (Virginia)

System systems can be built on top of the above protocols

Page 29: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

29Computer Science, University of WarwickComputer Science, University of Warwick

Cluster ComponentsCluster Components……66 Cluster MiddlewareCluster Middleware

Resides Between OS and Applications and support:

Single System Image (SSI)

System Availability (SA)

SSI makes collection appear as single machine (globalised view of system resources). ssh cluster.myinstitute.edu

SA - Check pointing and process migration..

Page 30: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

30Computer Science, University of WarwickComputer Science, University of Warwick

Cluster ComponentsCluster Components……77 Programming environmentsProgramming environments

Threads (PCs, SMPs, NOW..) POSIX Threads

Java Threads

MPILinux, NT, on many Supercomputers

PVM

DSMs

Page 31: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

31Computer Science, University of WarwickComputer Science, University of Warwick

Cluster ComponentsCluster Components……88 Development Tools Development Tools

CompilersC/C++/Java/ ;

Parallel programming with C++ (MIT Press book)

Debuggers

Performance Analysis Tools

Visualization Tools

Page 32: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

32Computer Science, University of WarwickComputer Science, University of Warwick

Cluster ComponentsCluster Components……99 ApplicationsApplications

Sequential

Parallel / Distributed

Grand Challenging applications• Weather Forecasting

• Computational Fluid Dynamics

• Molecular Biology Modeling

• Engineering Analysis (CAD/CAM)

• ……………….

Web servers,data-mining

Page 33: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

33Computer Science, University of WarwickComputer Science, University of Warwick

Cluster Architectures Cluster Architectures

Cluster system architecture

Page 34: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

34Computer Science, University of WarwickComputer Science, University of Warwick

Clusters Classification..1Clusters Classification..1

Based on Focus (in Market)High Performance (HP) Clusters

• Grand Challenging Applications

High Availability (HA) Clusters• Mission Critical applications

Page 35: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

35Computer Science, University of WarwickComputer Science, University of Warwick

Clusters Classification..2Clusters Classification..2

Based on Workstation/PC OwnershipDedicated Clusters

Non-dedicated clusters

Page 36: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

36Computer Science, University of WarwickComputer Science, University of Warwick

Clusters Classification..3Clusters Classification..3

Based on Node Architecture..Clusters of PCs (CoPs)

Clusters of Workstations (COWs)

Clusters of SMPs (CLUMPs)

Page 37: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

37Computer Science, University of WarwickComputer Science, University of Warwick

Clusters Classification..4Clusters Classification..4

Based on Node OS Type..Linux Clusters (Beowulf)

Solaris Clusters (Berkeley NOW)

NT Clusters (HPVM)

AIX Clusters (IBM SP2)

SCO/Compaq Clusters (Unixware)

HP clusters, ………………..

Page 38: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

38Computer Science, University of WarwickComputer Science, University of Warwick

Clusters Classification..5Clusters Classification..5

Based on node components architecture & configuration (Processor Arch, Node Type: PC/Workstation.. & OS: Linux/NT..):

Homogeneous Clusters• All nodes will have similar configuration

Heterogeneous Clusters• Nodes based on different processors and running

different OSes.

Page 39: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

39Computer Science, University of WarwickComputer Science, University of Warwick

Clusters Classification..6Clusters Classification..6 Levels of ClusteringLevels of Clustering

Group Clusters (#nodes: 2-99) (a set of dedicated/non-dedicated computers - mainly connected by SAN like Myrinet)

Departmental Clusters (#nodes: 99-999)Organizational Clusters (#nodes: many 100s)Internet-wide Clusters=Global Clusters:(#nodes: 1000s to many millions)

Page 40: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

40Computer Science, University of WarwickComputer Science, University of Warwick

What is Single System Image (SSI) ?What is Single System Image (SSI) ?

A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource.

SSI makes the cluster appear like a single machine to the user, to applications, and to the network.

A cluster without a SSI is not a cluster

Page 41: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

41Computer Science, University of WarwickComputer Science, University of Warwick

A typical Cluster Computing EnvironmentA typical Cluster Computing Environment

PVM / MPI/ SSH

Application

Hardware/OS

???

Page 42: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

42Computer Science, University of WarwickComputer Science, University of Warwick

The missing link is provide by cluster The missing link is provide by cluster middleware/middleware/underwareunderware

PVM / MPI/ SSH

Application

Hardware/OS

Middleware

Single Entry Pointssh cluster.my_institute.edussh node1.cluster. institute.edu

Page 43: Resource Management System : Condor · Computer Science, University of Warwick 1 Resource Management System : Condor Condor was designed to make use of the spare capabilities in a

43Computer Science, University of WarwickComputer Science, University of Warwick

Benefits of Single System ImageBenefits of Single System Image

Usage of system resources transparentlyTransparent process migration and load balancing across nodes.Improved reliability and higher availabilityImproved system response time and performanceSimplified system managementReduction in the risk of operator errorsUser need not be aware of the underlying system architecture to use these machines effectively