first steps implementing a high throughput workload management system massimo sgaravatto infn padova...
Post on 22-Dec-2015
216 views
TRANSCRIPT
First steps implementing a High Throughput workload
management system
Massimo SgaravattoINFN Padova
High throughput workload management system architecture
(simplified design)
GRAM
CONDOR
GRAM
LSF
GRAM
PBS
globusrun
Site1Site2 Site3
condor_submit(Globus Universe)
Condor-G
Master GISSubmit jobs(using Class-Ads)
ResourceDiscovery
Information on characteristics andstatus of local resources
Overview PC farms in different sites managed by possible
different local resource management systems GRAM as uniform interface to these different
local resource management systems Condor-G able to provide robustness and
reliability Master smart enough to decide in which Globus
resources the jobs must be submitted The Master uses the information on characteristics
and status of resources published in the GIS
First step: evaluation of GRAM
GRAM
CONDOR
GRAM
LSF
GRAM
PBS
Site1Site2 Site3
Submit jobs (using Globus tools)
GIS
Information on characteristics andstatus of local resources
Evaluation of GRAM Service Job submission tests using Globus tools
(globusrun, globus-job-run, globus-job-submit)
GRAM as uniform interface to different underlying resource management systems
“Cooperation” between GRAM and GIS Evaluation of RSL as uniform language to
specify resources Tests performed with Globus 1.1.2 and
1.1.3 and Linux machines
GRAM & fork system call
Client Server (fork)
Globus
Globus
GRAM & CondorClient Server
(Condor front-end machine)
Globus Globus
Condor
Condor pool
GRAM & Condor Tests considering:
Standard Condor jobs (relinked with Condor library)
INFN WAN Condor pool configured as Globus resource
~ 200 machines spread across different sites Heterogeneous environment No single file system and UID domain
Vanilla jobs (“normal” jobs) PC farm configured as Globus resource
Single file system and UID domain
GRAM & LSF
Server (LSF front-end machine)
Client
Globus
Globus LSF
LSF Cluster
GRAM & PBS (by F. Giacomini-INFN Cnaf)
Client Server (PBS)
Globus
GlobusPBS
Linux Server (4 processors)
Results Some bugs found and fixed
Standard output and error for vanilla Condor jobs globus-job-status …
Some bugs seem solvable without major re-design and/or re-implementation:
For LSF the RSL parameter (count=x) is translated into: bsub –n x …
Just allocates x processors, and dispatches the job to the first one Used for parallel applications
Should be: bsub … x times …
Two major problems: Scalability Fault tolerance
Globus GRAM Architecture
Client
LSF/ Condor/ PBS/ …
Globus front-end machine
Jobmanager
Job
pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl
file.rsl:&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename(stdout=/diskCms/Cmsim/filename)(count=1)
pc1 pc2
Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ???
1000 globusrun 1000 jobmanagers running in the front-end machine !!!
%globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rslfile.rsl:
&(executable=/diskCms/startcmsim.sh)(stdin=/diskCms/PythiaOut/filename)(stdout=/diskCms/CmsimOut/filename)(count=1000)
It is not possible to specify in the RSL file 1000 different input files and 1000 different output files …
$(Process) in Condor Problems with job monitoring (globus-job-status)
Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted,
Globus assumes that the job(s) has been completed
Example of problem Submission of n jobs on a cluster managed
by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart
Orphan jobs Globus assumes that the jobs have been successfully completed
GRAM & GIS How the local GRAMs provide the
GIS with characteristics and status of local resources ?
The Master will need this (and other) information
Tests performed considering: Condor pool LSF cluster
GRAM & Condor & GIS
GRAM & LSF & GIS
Jobs & GIS Info on Globus jobs published in the GIS:
User Subject of certificate Local user name
RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…
GRAM & GIS The information on characteristics and status of
local resources and on jobs is not enough As local resources we must consider Farms and not the
single workstations Other information (i.e. total and available CPU power)
needed Fortunately the default schema can be integrated
with other info provided by specific agents WE need to identify which other info are necessary
Much more clear during Master design
RSL We need a uniform language to specify
resources, between different resource management systems
The RSL syntax model seems suitable to define even complicated resource specification expressions
The common set of RSL attributes is often not sufficient The attributes not belonging to the common
set are ignored
RSL More flexibility is required
Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model)
Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach
Second step: Condor-G
GRAM
CONDOR
GRAM
LSF
GRAM
PBS
globusrun
Site1Site2 Site3
Using condor_submit and Globus Universe
Condor-G
Submit jobscondor_qcondor_rm…
Condor-G ? Condor Schedd + Grid manager Why Condor-G ?
Usage of Condor architecture and mechanisms able to provide robustness and reliability
The user can submit his 10,000 jobs and he will be sure that they will be completed (even if there are problems in the submitting machine, in the executing machines, in the network, …) without human intervention
Usage of Condor interface and tools to “manage” the jobs
“Robust” tools with all the required capabilities (monitor, logging, …)
Condor-G (Globus Universe) Condor-G tested considering as
Globus resource: Workstation using the fork system call LSF Cluster Condor pool
Submission (condor_submit), monitoring (condor_q), removing (condor_rm) seem working fine, but…
Condor-G The Globus Universe architecture is only a
prototype Documentation not available Scalability in the submitting side
One shadow per each globusrun Very difficult to understand about errors Some errors in the log files Some improvements foreseen in the next future
Scalability, … The problems of scalability and fault
tolerance in the Globus resources are not solved
Fault tolerance only in the submitting side
Condor-G Architecture
Condor G(Globus Client)
LSF/ Condor/ PBS/ …
Globus front-end machine
Jobmanager
Job
condor_submit globusrun
polling (globus_job_status)
Jobs
pc1% condor_submit file.cnd
File.cnd:Universe = globusExecutable = /diskCms/startcmsim.shGlobusRSL = (stdin=/diskCms/PythiaOut/filename)stdout=/diskCms/CmsimOut/filename)log = /diskCms/log.$(Process)GlobusScheduler = pc2.pd.infn.it/jobmanager-xyzqueue 1
pc1pc2
Condor GlideIn Submission of Condor jobs on Globus
resources Condor daemons (master, startd) run on
Globus resources These resources temporarily become part of
the Condor pool Usage of Condor-G to run Condor daemons
Local resource management systems (LSF, PBS, …) of Globus resources used only to run Condor daemons
For a cluster it is necessary to install Globus only on one front-end machine, while the Condor daemons will run on each workstation of the cluster
GlideIn
pc3 Cluster managed
by LSF/Condor/…
Globus Personal Condor
Globus
Globus
pc1
pc2
pc1% condor_glidein pc2.pd.infn.it …
pc1% condor_glidein pc3.pd.infn.it …
Condor GlideIn Usage of all Condor mechanisms and
capabilities Robustness and fault tolerance
Only “ready-to-use” solution if we want to use Globus tools
Also Master functionalities (Condor matchmaking system)
Viable solution if the goal is just to find idle CPUs The architecture must be integrated/modified if
we have to take into account other parameters (i.e. location of input files)
Condor GlideIn GlideIn tested (considering standard and vanilla
jobs) with: Workstation using the fork system call as job manager
Seems working Condor pool
Seems working Condor flocking better solution if authentication is not
required LSF cluster
Problems with glidein of multiple nodes with a single condor_glidein command (because of the problem related with the LSF parameter (count=x))
Multiple condor_glidein commands Problems of scalability (number of jobmanagers)
Modification of Globus scripts for LSF
Conclusions (problems) Major problems related with scalability and fault
tolerance with Globus GRAM The Globus team is going to re-implement the Globus GRAM
service When ??? How ???
The local GRAMs provide the GIS with not enough information
The default schema must be integrated WE must identify which information are necessary
RSL not enough flexible Condor Class-Ads seems a much better solution to specify
resources Condor-G can provide robustness only in the
submitting side
Future activities Complete on-going GRAM evaluations (i.e.
PBS) Bug fixes
Modification of Globus LSF scripts …
Tests with real applications Solve the scalability and robustness
problems Not so simple and straightforward !!! Possible collaboration between WP1, Globus
team and Condor team
Other info http://www.pd.infn.it/~sgaravat/INFN-GRID
http://www.pd.infn.it/~sgaravat/ INFN-GRID/Globus