what is sam-grid? job handling data handling monitoring and information
TRANSCRIPT
What is SAM-Grid?
Job HandlingData HandlingMonitoring and Information
Problems To Solve
How can a large, geographically distributed, dynamic, physics collaboration work together?
How can this collaboration make use of available distributed computing resources?
How can it handle the huge amount of data (PBs) generated by the experiment?
Answers – The GRID & SAM-Grid GRID
A network of middleware services that tie together distributed resources (Fabric – processors, storage).
SAM-Grid Integrate the standard middleware to
achieve a complete Job, Data, and Information management infrastructure thereby enabling fully distributed computing.
SAM-Grid Architecture
Job Management
Grid-level (global) job scheduling (selection of a cluster to run) distinguished from local scheduling (distribution of the job within the cluster)
We distinguish structured jobs from unstructured. Structured jobs have their details known to Grid
middleware Unstructured jobs are mapped as a whole onto a cluster
Scheduler is interfaced with the data handling system. For data-intensive jobs, sites are ranked by the amount
of data cached at the site
Job Handling
JOB
Computing Element
User Interface
SubmissionService
User Interface
User Interface
ResourceSelection
Match Making Service
Information Collector
ExecSite #1
Match Making Service
Match Making Service
Computing Element
Grid Sensors
Execution Site #n
Submission Service
Submission Service
Grid Sensors
Computing Element
GenericService Generic
Service
Information Collector
Information Collector
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Computing Element
Computing Element
GenericService
GenericService
externalalgorithm
externalalgorithm
Grid/Fabric
Interface
Grid/Fabric
Interface
Grid/Fabric
Interface
Grid/Fabric
Interface
Data Handling - SAM
MSS1
LocalStation 1Cache1
LocalStation 1Cache2
LocalStation 2Cache1
RemoteStationCache1
SAM is a distributed data movement and management service
SAM stations are resources pooled together to enable data management
Data replication is achieved by the use of disk caches during file routing.
SAM is a fully functional meta-data catalog.
A station can access a remote resource via the services offered by other connected stations
MSS2
RemoteStationCache2MSS – Mass Storage System
Control FlowData Flow
Data HandlingDatabaseServer(s)
(Central Database)
Station 1Servers
Station 2Servers
Station 3 Servers
Station nServers
Mass Storage System(s)
SharedGlobally
LocalTo Site
SharedLocally
NameServer
Global Resource
Manager(s)Log server
services
Arrows indicateControl and data flow
Monitoring and Information
This includes: configuration framework resource description for job brokering infrastructure for monitoring
Main features Sites (resources), services and jobs monitoring Distributed knowledge about jobs etc. Incremental knowledge building Grid Monitoring Architecture for current state
inquiries, Logging for recent history studies All Web based
Monitoring and Information
Web Browser
Web Server
Site 1 Information System
IPIPIP
Web Browser
Web Server 1
Site 2 Information System
IPIP
IPIP
Web Server N
Site N Information System
Challenges with Grid/Fabric Interface The Globus toolkit Grid/Fabric interfaces are
not sufficiently… …flexible: they expect a “standard” batch system
configuration. …scalable: a process per grid job is started up at
the gateway machine. We want/need aggregation. …comprehensive: they interface to the batch
system only. How about data handling, local monitoring, databases, etc.
…robust: if the batch system forgets about the jobs, they cannot react.
Flexibility
Addressing the peculiarity of the configuration of each batch system requires modification to the Globus toolkit job-manager
We address the problem by writing job-managers that use a level of abstraction on top of the batch systems.
Each batch system adapter can be locally configured to conform to the local batch system interface
Scalability
The Globus gatekeeper starts up a process at the gateway node for every job entering the site
This limits the number of grid jobs at a site to around 300, for the typical commodity computer
We split single grid jobs into multiple batch processes in the SAM-Grid job-managers. Not only does this increase scalability, but it also increases the manageability of the job
Comprehensiveness
The standard job-managers interface only to the local batch system
We notify other fabric services when a job enters a site Data handling: for data pre-staging Monitoring: to monitor a non-running job Database: to aggregate queries
Robustness
The standard job-managers cannot react to temporary failures of the local batch systems
In our experience, PBS, Condor and BQS have failed to report the status of a job
We write wrappers around the batch systems. These wrappers implement extra robustness. We call them “idealizers”