evaluation of the globus gram service massimo sgaravatto infn padova

19
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Post on 22-Dec-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service

Massimo SgaravattoINFN Padova

Page 2: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Evaluation of GRAM Service

GRAM

CONDOR

GRAM

LSF

GRAM

PBS

Site1Site2 Site3

Submit jobs (using Globus tools)

GIS

Information on characteristics andstatus of local resources

Page 3: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Evaluation of GRAM Service Job submission tests using Globus tools

(globusrun, globus-job-run, globus-job-submit)

GRAM as uniform interface to different underlying resource management systems

“Cooperation” between GRAM and GIS Evaluation of RSL as uniform language to

specify resources Tests performed with Globus 1.1.2 and

1.1.3 and Linux machines

Page 4: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

GRAM & fork system call  

Client Server (fork)

Globus

Globus

Page 5: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

GRAM & CondorClient Server

(Condor front-end machine)

Globus Globus

Condor

Condor pool

Page 6: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

GRAM & Condor Tests considering:

Standard Condor jobs (relinked with Condor library)

INFN WAN Condor pool configured as Globus resource

~ 200 machines spread across different sites Heterogeneous environment No single file system and UID domain

Vanilla jobs (“normal” jobs) PC farm configured as Globus resource

Single file system and UID domain

Page 7: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

GRAM & LSF

Server (LSF front-end machine)

Client

Globus

Globus LSF

LSF Cluster

Page 8: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Results Some bugs found and fixed (fixes included in INFNGRID 1.1

distribution) Standard output and error for vanilla Condor jobs globus-job-status …

Some bugs can be solved without major re-design and/or re-implementation:

For LSF the RSL parameter (count=x) is translated into: bsub –n x … Just allocates x processors, and dispatches the job to the first one

Used for parallel applications Should be: bsub … x times Maybe we don’t need to solve this problem (see later…)

… Two major problems:

Scalability Fault tolerance

Page 9: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Globus GRAM Architecture

Client

LSF/ Condor/ PBS/ …

Globus front-end machine

Jobmanager

Job

pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl

file.rsl:&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename(stdout=/diskCms/Cmsim/filename)(count=1)

pc1 pc2

Page 10: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ???

1000 globusrun 1000 jobmanagers running in the front-end machine !!!

%globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rslfile.rsl:

&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename)(stdout=/diskCms/CmsimOut/filename)(count=1000)

It is not possible to specify in the RSL file 1000 different input files and 1000 different output files …

$(Process) in Condor Problems with job monitoring (globus-job-status) Therefore (count=x) with x>1 not very useful !

Page 11: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted,

Globus assumes that the job(s) has been completed

Example of problem Submission of n jobs on a cluster managed

by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart

Orphan jobs Globus assumes that the jobs have been successfully completed

Page 12: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

GRAM & GIS How the local GRAMs provide the

GIS with characteristics and status of local resources ?

Tests performed considering: Condor pool LSF cluster

Page 13: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

GRAM & Condor & GIS

Page 14: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

GRAM & LSF & GIS

Must be fixed

Page 15: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Jobs & GIS Info on Globus jobs published in the GIS:

User Subject of certificate Local user name

RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…

Page 16: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

GRAM & GIS The information on characteristics and status

of local resources and on jobs is not enough As local resources we must consider Farms and not

the single workstations Other information (i.e. total and available CPU

power) needed Fortunately the default schema can be

integrated with other info provided by specific agents

The needed information must be identified first

Page 17: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

RSL We need a uniform language to specify

resources, between different resource management systems

The RSL syntax model seems suitable to define even complicated resource specification expressions

The common set of RSL attributes is often not sufficient The attributes not belonging to the common

set are ignored

Page 18: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

RSL More flexibility is required

Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model)

Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach

Page 19: Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova

Next steps Bug fixes

Modification of Globus LSF scripts for GIS Problem (count=x) with LSF ???

Tests with real applications and real environments (CMS fall production)

Define a small set of attributes of a Condor pool, LSF cluster, PBS cluster that should be reported to the GIS, and try to implement it

Let’s start with information provided by the underlying resource management system

Tests with GRAM API Not necessary tests with other resource management systems Scalability and robustness problems

Not so simple and straightforward !!! Up to Workload management WP, possible collaboration with Globus

team and Condor team