the sam-grid fabric services gabriele garzoglio (for the sam-grid team) computing division fermilab

27
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM- Grid team) Computing Division Fermilab

Upload: bryan-gilmore

Post on 25-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

The SAM-Grid Fabric Services

Gabriele Garzoglio (for the SAM-Grid team)Computing DivisionFermilab

Page 2: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Overview

IntroductionThe grid-level services: an overview

Job Management

The fabric-level services Local batch system adaptationDynamic product retrievalLocal sandbox managementJob complex-status logging

Page 3: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Introduction

SAM is a Data Handling System for HEP: the project was started in 1997 by DZeroSAM-Grid project started in 2001-2002 to handle DZero’s expanded needs for globally distributed computingCDF joined SAM-Grid at the end of 2002

JIM complements the data handling system (SAM) with Job and Info Management:SAM-Grid = JIM + SAMJIM is funded by PPDG and GridPPParticipated at SC02 and SC03

Page 4: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Overview

IntroductionThe grid-level services: an overview

Job Management

The fabric-level services Local batch system adaptationDynamic product retrievalLocal sandbox managementJob complex-status logging

Page 5: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003JO

B

Computing Element

Submission Client

User Interface

QueuingSystem

Job ManagementUser

Interface

User Interface

BrokerMatch

Making Service

Information Collector

Execution Site #1

Submission Client

Submission Client

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Queuing System

Queuing System

Grid Sensors

Storage Element

Storage Element

Computing Element

Storage Element

Data Handling System

Data Handling System

Storage Element

Storage Element

Storage Element

Storage Element

Information Collector

Information Collector

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Computing Element

Computing Element

Data Handling System

Data Handling System

Data Handling System

Data Handling System

Page 6: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Overview

IntroductionThe grid-level services: an overview

Job ManagementThe fabric-level services

Local batch system adaptationDynamic product retrievalLocal sandbox managementJob complex-status logging

Page 7: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Running jobs on Grid resources: the trend

Grid resources are not dedicated to a single experimentTranslation:

no daemons running on the worker nodes of a Batch Systemno experiment specific software installed

Page 8: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Running jobs on Grid resources: today

The situation is transitioning:Generally, experiments can install specific services on a node close to the cluster.Worker nodes typically access the software via shared FS: not scalable!Local resource configuration still too diverse to easily plug into the Grid

Today, most of our efforts are directed to coping with (the lack of) standard local fabric services

Page 9: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Overview

IntroductionThe grid-level services: an overview

Job ManagementThe fabric-level services

Local batch system adaptationDynamic product retrievalLocal sandbox managementJob complex-status logging

Page 10: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Motivation

Problem: “standard” grid batch system adapters (globus job-managers) are too restrictive to fit all the local configurationsExamples:

the terms of the agreement for using the batch system can be expressed with special directives to the batch systemsystem administrators end up writing wrappers around the standard batch system commands

Page 11: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

SAM Batch System Adapter

We factor out the local batch system configuration using an intermediate layer that abstracts the basic interactions with the batch system

submit commandlookup commandremove command

For each of the commands above, the administrator can specify how to parse the output to fish out the relevant information e.g. local job id when submittingWe have written JIM globus job managers that use this layer

Page 12: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Overview

IntroductionThe grid-level services: an overview

Job ManagementThe fabric-level services

Local batch system adaptation Dynamic product retrieval

Local sandbox managementJob complex-status logging

Page 13: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Motivation

Portability of the software for DZero and CDF is still a problem not completely solved.Most of the CDF and DZero applications still rely on the offline software to be preinstalled at the site.Administrators need to install and maintain the software at each siteA job submitted to the grid must be able to execute at a site where its dependencies are installed

Page 14: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Old solution: software advertisement

Administrators install the software at each siteThe JIM advertisement framework senses the new product and advertises it to the broker as one of the characteristics of the siteDrawbacks:

the administrators still need to install the softwareincreased complexity of the advertisement framework: it needs to know how to detect the list of installed productsincreased complexity of the broker: it needs to enforce the matching to the eligible sitesjobs running on old software versions may not find an eligible site

Page 15: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

New solution: dynamic software retrieval

Product developers store the software into SAM with appropriate metadataBefore running a job at a site, the infrastructure asks SAM for the delivery of the dependent productsThe products live in the SAM cache and are automatically managedDrawbacks:

increased complexity of local job submission

Page 16: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Overview

IntroductionThe grid-level services: an overview

Job ManagementThe fabric-level services

Local batch system adaptation Dynamic product retrieval Local sandbox management

Job complex-status logging

Page 17: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Nomenclature

Input sandbox:from the client (user sandbox):

• the executable• configuration files• special dependencies (libraries, products,…)

from the local site• the product dependencies

Output sandbox:stdout, stderrlog filessmall custom output (e.g. histograms)

Page 18: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Requirements

We want an infrastructure that:Locally stores the user sandbox (from the Grid) at the site transports and installs the input sandbox to the worker nodepackages the output and hands it over to the Grid

Page 19: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Limitations to overcome

the file transport mechanism of a batch system is site specific and needs to be factored outshared file systems have scalability limits: we want to rely on them as little as possiblethe worker nodes may have connectivity restrictions (firewalls)

Page 20: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

The sandbox management 1

It creates a sandbox area (reorganizing the native globus gass cache)It starts up a gridftp server for the communications between worker nodes and head node (no shared FS)It requests the delivery of the product dependenciesIt creates a self extracting archive that contains the gridftp client and a bootstrapping script; when running, this transfers and installs the product dependencies, then passes control to the application

Page 21: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

The sandbox management 2

It submits to the batch system parallel instances of the self extracting archiveThe job relies on SAM for large input/output files transfersWhen the job finishes, stdout/stderr + custom output is packaged at the head node to be transported back to the submission site via grid mechanisms

Page 22: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Open problems

Not all the batch system allow the selection of a node with sufficient scratch space to install the needed softwareWe would greatly simplify this infrastructure if there were a “standard” local storage service at all the sites (e.g. DiskFarm)

Page 23: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Overview

IntroductionThe grid-level services: an overview

Job ManagementThe fabric-level services

Local batch system adaptation Dynamic product retrieval Local sandbox management Job complex-status logging

Page 24: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Motivation

Distributed logging of job status/historyWeb monitoringStatistics on historical dataGrid scheduling based upon job status/history at a certain site

Page 25: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

The XML DB Status Logger

The status of the job is reported to an XML database deployed at each execution siteThe information comes from the local batch system (simple job status e.g. “idle”, “running”, …) AND from the application (complex status e.g. “Processing executable X in the chain”)The XML database gives flexible remote access via standard mechanisms, such as XPath

Page 26: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

Conclusions

The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info ManagementThe SAM-Grid adopts Fabric-level configurable solutions for batch system adaptation, product delivery, sandboxing and job complex-status loggingThe community needs to come up with standard fabric-level services to make any Grid usable

Page 27: The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab

Gabriele Garzoglio, ACAT 2003

More info at…

http://www-d0.fnal.gov/computing/grid/

http://samgrid.fnal.gov:8080/

Morag Burgon-Lyon’s Talk on SAM-Grid for CDF!