first steps implementing a high throughput workload management system massimo sgaravatto infn padova...

33
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova [email protected]

Post on 22-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

First steps implementing a High Throughput workload

management system

Massimo SgaravattoINFN Padova

[email protected]

Page 2: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

High throughput workload management system architecture

(simplified design)

GRAM

CONDOR

GRAM

LSF

GRAM

PBS

globusrun

Site1Site2 Site3

condor_submit(Globus Universe)

Condor-G

Master GISSubmit jobs(using Class-Ads)

ResourceDiscovery

Information on characteristics andstatus of local resources

Page 3: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Overview PC farms in different sites managed by possible

different local resource management systems GRAM as uniform interface to these different

local resource management systems Condor-G able to provide robustness and

reliability Master smart enough to decide in which Globus

resources the jobs must be submitted The Master uses the information on characteristics

and status of resources published in the GIS

Page 4: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

First step: evaluation of GRAM

GRAM

CONDOR

GRAM

LSF

GRAM

PBS

Site1Site2 Site3

Submit jobs (using Globus tools)

GIS

Information on characteristics andstatus of local resources

Page 5: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Evaluation of GRAM Service Job submission tests using Globus tools

(globusrun, globus-job-run, globus-job-submit)

GRAM as uniform interface to different underlying resource management systems

“Cooperation” between GRAM and GIS Evaluation of RSL as uniform language to

specify resources Tests performed with Globus 1.1.2 and

1.1.3 and Linux machines

Page 6: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & fork system call  

Client Server (fork)

Globus

Globus

Page 7: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & CondorClient Server

(Condor front-end machine)

Globus Globus

Condor

Condor pool

Page 8: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & Condor Tests considering:

Standard Condor jobs (relinked with Condor library)

INFN WAN Condor pool configured as Globus resource

~ 200 machines spread across different sites Heterogeneous environment No single file system and UID domain

Vanilla jobs (“normal” jobs) PC farm configured as Globus resource

Single file system and UID domain

Page 9: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & LSF

Server (LSF front-end machine)

Client

Globus

Globus LSF

LSF Cluster

Page 10: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & PBS (by F. Giacomini-INFN Cnaf)

  

Client Server (PBS)

Globus

GlobusPBS

Linux Server (4 processors)

Page 11: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Results Some bugs found and fixed

Standard output and error for vanilla Condor jobs globus-job-status …

Some bugs seem solvable without major re-design and/or re-implementation:

For LSF the RSL parameter (count=x) is translated into: bsub –n x …

Just allocates x processors, and dispatches the job to the first one Used for parallel applications

Should be: bsub … x times …

Two major problems: Scalability Fault tolerance

Page 12: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Globus GRAM Architecture

Client

LSF/ Condor/ PBS/ …

Globus front-end machine

Jobmanager

Job

pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl

file.rsl:&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename(stdout=/diskCms/Cmsim/filename)(count=1)

pc1 pc2

Page 13: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ???

1000 globusrun 1000 jobmanagers running in the front-end machine !!!

%globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rslfile.rsl:

&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename)(stdout=/diskCms/CmsimOut/filename)(count=1000)

It is not possible to specify in the RSL file 1000 different input files and 1000 different output files …

$(Process) in Condor Problems with job monitoring (globus-job-status)

Page 14: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted,

Globus assumes that the job(s) has been completed

Example of problem Submission of n jobs on a cluster managed

by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart

Orphan jobs Globus assumes that the jobs have been successfully completed

Page 15: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & GIS How the local GRAMs provide the

GIS with characteristics and status of local resources ?

The Master will need this (and other) information

Tests performed considering: Condor pool LSF cluster

Page 16: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & Condor & GIS

Page 17: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & LSF & GIS

Page 18: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Jobs & GIS Info on Globus jobs published in the GIS:

User Subject of certificate Local user name

RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…

Page 19: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GRAM & GIS The information on characteristics and status of

local resources and on jobs is not enough As local resources we must consider Farms and not the

single workstations Other information (i.e. total and available CPU power)

needed Fortunately the default schema can be integrated

with other info provided by specific agents WE need to identify which other info are necessary

Much more clear during Master design

Page 20: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

RSL We need a uniform language to specify

resources, between different resource management systems

The RSL syntax model seems suitable to define even complicated resource specification expressions

The common set of RSL attributes is often not sufficient The attributes not belonging to the common

set are ignored

Page 21: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

RSL More flexibility is required

Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model)

Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach

Page 22: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Second step: Condor-G

GRAM

CONDOR

GRAM

LSF

GRAM

PBS

globusrun

Site1Site2 Site3

Using condor_submit and Globus Universe

Condor-G

Submit jobscondor_qcondor_rm…

Page 23: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Condor-G ? Condor Schedd + Grid manager Why Condor-G ?

Usage of Condor architecture and mechanisms able to provide robustness and reliability

The user can submit his 10,000 jobs and he will be sure that they will be completed (even if there are problems in the submitting machine, in the executing machines, in the network, …) without human intervention

Usage of Condor interface and tools to “manage” the jobs

“Robust” tools with all the required capabilities (monitor, logging, …)

Page 24: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Condor-G (Globus Universe) Condor-G tested considering as

Globus resource: Workstation using the fork system call LSF Cluster Condor pool

Submission (condor_submit), monitoring (condor_q), removing (condor_rm) seem working fine, but…

Page 25: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Condor-G The Globus Universe architecture is only a

prototype Documentation not available Scalability in the submitting side

One shadow per each globusrun Very difficult to understand about errors Some errors in the log files Some improvements foreseen in the next future

Scalability, … The problems of scalability and fault

tolerance in the Globus resources are not solved

Fault tolerance only in the submitting side

Page 26: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Condor-G Architecture

Condor G(Globus Client)

LSF/ Condor/ PBS/ …

Globus front-end machine

Jobmanager

Job

condor_submit globusrun

polling (globus_job_status)

Jobs

pc1% condor_submit file.cnd

File.cnd:Universe = globusExecutable = /diskCms/startcmsim.shGlobusRSL = (stdin=/diskCms/PythiaOut/filename)stdout=/diskCms/CmsimOut/filename)log = /diskCms/log.$(Process)GlobusScheduler = pc2.pd.infn.it/jobmanager-xyzqueue 1

pc1pc2

Page 27: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Condor GlideIn Submission of Condor jobs on Globus

resources Condor daemons (master, startd) run on

Globus resources These resources temporarily become part of

the Condor pool Usage of Condor-G to run Condor daemons

Local resource management systems (LSF, PBS, …) of Globus resources used only to run Condor daemons

For a cluster it is necessary to install Globus only on one front-end machine, while the Condor daemons will run on each workstation of the cluster

Page 28: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

GlideIn

pc3 Cluster managed

by LSF/Condor/…

Globus Personal Condor

Globus

Globus

pc1

pc2

pc1% condor_glidein pc2.pd.infn.it …

pc1% condor_glidein pc3.pd.infn.it …

Page 29: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Condor GlideIn Usage of all Condor mechanisms and

capabilities Robustness and fault tolerance

Only “ready-to-use” solution if we want to use Globus tools

Also Master functionalities (Condor matchmaking system)

Viable solution if the goal is just to find idle CPUs The architecture must be integrated/modified if

we have to take into account other parameters (i.e. location of input files)

Page 30: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Condor GlideIn GlideIn tested (considering standard and vanilla

jobs) with: Workstation using the fork system call as job manager

Seems working Condor pool

Seems working Condor flocking better solution if authentication is not

required LSF cluster

Problems with glidein of multiple nodes with a single condor_glidein command (because of the problem related with the LSF parameter (count=x))

Multiple condor_glidein commands Problems of scalability (number of jobmanagers)

Modification of Globus scripts for LSF

Page 31: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Conclusions (problems) Major problems related with scalability and fault

tolerance with Globus GRAM The Globus team is going to re-implement the Globus GRAM

service When ??? How ???

The local GRAMs provide the GIS with not enough information

The default schema must be integrated WE must identify which information are necessary

RSL not enough flexible Condor Class-Ads seems a much better solution to specify

resources Condor-G can provide robustness only in the

submitting side

Page 32: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Future activities Complete on-going GRAM evaluations (i.e.

PBS) Bug fixes

Modification of Globus LSF scripts …

Tests with real applications Solve the scalability and robustness

problems Not so simple and straightforward !!! Possible collaboration between WP1, Globus

team and Condor team

Page 33: First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova massimo.sgaravatto@pd.infn.it

Other info http://www.pd.infn.it/~sgaravat/INFN-GRID

http://www.pd.infn.it/~sgaravat/ INFN-GRID/Globus