what is sam-grid? job handling data handling monitoring and information

15
What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Upload: phyllis-bryan

Post on 05-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

What is SAM-Grid?

Job HandlingData HandlingMonitoring and Information

Page 2: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Problems To Solve

How can a large, geographically distributed, dynamic, physics collaboration work together?

How can this collaboration make use of available distributed computing resources?

How can it handle the huge amount of data (PBs) generated by the experiment?

Page 3: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Answers – The GRID & SAM-Grid GRID

A network of middleware services that tie together distributed resources (Fabric – processors, storage).

SAM-Grid Integrate the standard middleware to

achieve a complete Job, Data, and Information management infrastructure thereby enabling fully distributed computing.

Page 4: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

SAM-Grid Architecture

Page 5: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Job Management

Grid-level (global) job scheduling (selection of a cluster to run) distinguished from local scheduling (distribution of the job within the cluster)

We distinguish structured jobs from unstructured. Structured jobs have their details known to Grid

middleware Unstructured jobs are mapped as a whole onto a cluster

Scheduler is interfaced with the data handling system. For data-intensive jobs, sites are ranked by the amount

of data cached at the site

Page 6: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Job Handling

JOB

Computing Element

User Interface

SubmissionService

User Interface

User Interface

ResourceSelection

Match Making Service

Information Collector

ExecSite #1

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Submission Service

Submission Service

Grid Sensors

Computing Element

GenericService Generic

Service

Information Collector

Information Collector

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Grid Sensor

s

Computing Element

Computing Element

GenericService

GenericService

externalalgorithm

externalalgorithm

Grid/Fabric

Interface

Grid/Fabric

Interface

Grid/Fabric

Interface

Grid/Fabric

Interface

Page 7: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Data Handling - SAM

MSS1

LocalStation 1Cache1

LocalStation 1Cache2

LocalStation 2Cache1

RemoteStationCache1

SAM is a distributed data movement and management service

SAM stations are resources pooled together to enable data management

Data replication is achieved by the use of disk caches during file routing.

SAM is a fully functional meta-data catalog.

A station can access a remote resource via the services offered by other connected stations

MSS2

RemoteStationCache2MSS – Mass Storage System

Control FlowData Flow

Page 8: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Data HandlingDatabaseServer(s)

(Central Database)

Station 1Servers

Station 2Servers

Station 3 Servers

Station nServers

Mass Storage System(s)

SharedGlobally

LocalTo Site

SharedLocally

NameServer

Global Resource

Manager(s)Log server

services

Arrows indicateControl and data flow

Page 9: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Monitoring and Information

This includes: configuration framework resource description for job brokering infrastructure for monitoring

Main features Sites (resources), services and jobs monitoring Distributed knowledge about jobs etc. Incremental knowledge building Grid Monitoring Architecture for current state

inquiries, Logging for recent history studies All Web based

Page 10: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Monitoring and Information

Web Browser

Web Server

Site 1 Information System

IPIPIP

Web Browser

Web Server 1

Site 2 Information System

IPIP

IPIP

Web Server N

Site N Information System

Page 11: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Challenges with Grid/Fabric Interface The Globus toolkit Grid/Fabric interfaces are

not sufficiently… …flexible: they expect a “standard” batch system

configuration. …scalable: a process per grid job is started up at

the gateway machine. We want/need aggregation. …comprehensive: they interface to the batch

system only. How about data handling, local monitoring, databases, etc.

…robust: if the batch system forgets about the jobs, they cannot react.

Page 12: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Flexibility

Addressing the peculiarity of the configuration of each batch system requires modification to the Globus toolkit job-manager

We address the problem by writing job-managers that use a level of abstraction on top of the batch systems.

Each batch system adapter can be locally configured to conform to the local batch system interface

Page 13: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Scalability

The Globus gatekeeper starts up a process at the gateway node for every job entering the site

This limits the number of grid jobs at a site to around 300, for the typical commodity computer

We split single grid jobs into multiple batch processes in the SAM-Grid job-managers. Not only does this increase scalability, but it also increases the manageability of the job

Page 14: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Comprehensiveness

The standard job-managers interface only to the local batch system

We notify other fabric services when a job enters a site Data handling: for data pre-staging Monitoring: to monitor a non-running job Database: to aggregate queries

Page 15: What is SAM-Grid? Job Handling Data Handling Monitoring and Information

Robustness

The standard job-managers cannot react to temporary failures of the local batch systems

In our experience, PBS, Condor and BQS have failed to report the status of a job

We write wrappers around the batch systems. These wrappers implement extra robustness. We call them “idealizers”