the infn grid project zscope: study and develop a general infn computing infrastructure, based on...
TRANSCRIPT
THE INFN GRID PROJECT
Scope: Study and develop a general INFN computing infrastructure, based on GRID technologies, to be validated (as first use case) implementing distributed Regional Center prototypes for LHC expts: ATLAS, CMS, ALICE and, later on, also for other INFN expts (Virgo, Gran Sasso ….)
Project Status: Outline of proposal submitted to INFN management 13-1-2000 3 Year duration Next meeting with INFN management 18th of February Feedback documents from LHC expts by end of February
(sites, FTEs..) Final proposal to INFN by end of March
INFN & “Grid Related Projects”
Globus tests“Condor on WAN” as general purpose
computing resource“GRID” working group to analyze viable
and useful solutions (LHC computing, Virgo…)
Global architecture that allows strategies for the discovery, allocation, reservation and management of resource collection
MONARC project related activities
Evaluation of the Globus ToolKit
5 sites Testbed (Bologna, CNAF, LNL, Padova, Roma1)
Use case: HTL CMS studies MC Prod. Complete HLT chain
Services to test/implement Resource management
fork() Interface to different local resource managers (Condor, LSF)
Resources chosen by hand Smart Broker to implement a Global resource manager
Data Mover (Gass, Gsiftp…) to stage executable and input files to retrieve output files
Bookkeeping (Is this a worth a general tool ?)
Use Case: CMS HLT studies
Status
Globus installed in 5 Linux PCs in 3 sitesGlobus Security Infrastructure
works !! MDS
Initial problems accessing data (long response time and time out)
GRAM, GASS, Gloperf Work in progress
Condor on WAN Objectives
Large INFN project of the Computing Commission involving ~20 sites
INFN collaboration with Condor Team UWISC
I goal: Condor “tuning” on WAN verify Condor reliability and robustness in Wide
Area Network environment
Verify suitability to INFN computing needs
Network I/O impact and measures
II goal: Network as a Condor Resource Dynamic checkpointing and Checkpoint domain
configuration
Pool partitioned in checkpoint domains (a dedicated ckpt server for each domain)
Definition of a checkpoint domain according:Presence of a sufficiently large CPU capacityPresence of a set of machines with an efficient
network connectivitySub-pools
Checkpointing: next step
Distributed dynamic checkpointing Pool machines select the “best”
checkpoint server (from a network view)
Association between execution machine and checkpoint server dynamically decided
Implementation
Characteristics of the INFN Condor pool:
Single pool To optimize CPU usage of all INFN hosts
Sub-pools To define policies/priorities on resource usage
Checkpoint domains To guarantee the performance and the
efficiency of the system To reduce network traffic for checkpointing
activity
GARR-B Topology
155 Mbps ATM based Network
access points (PoP)
main transport nodes
TORINO PADOVA
BARI
PALERMO
FIRENZE
PAVIA
MILANO
GENOVA
NAPOLI
CAGLIARI
TRIESTE
ROMA
PISA
L’AQUILA
CATANIA
BOLOGNA
UDINE
TRENTO
PERUGIA
LNF
LNGS
SASSARI
LECCE
LNS
LNL
USA
155Mbps
T3
SALERNO
COSENZA
S.Piero
FERRARA
PARMA
CNAF Central Manager
INFN Condor Pool on WAN: checkpoint domains
ROMA2
10
10
40
15
415
65
5
Default CKPTdomain @ Cnaf
CKPT domain # hosts
10
2
3
6
3
2
USA
3
5
1
15
EsNet
machines 500-1000 machines
6 ckpt servers 25 ckpt servers
Management
Central management ([email protected])
Local management ([email protected])Steering committeesoftware maintenance contract with
Condor_support team of University of Madison
INFN-GRID project requirements
Networked Workload Management:- Optimal co-allocation of data and CPU and network
for a specific “grid/network-aware” job- distributed scheduling (data and/or code migration)- unscheduled/ scheduled job submission- Management of heterogeneous computing systems- Uniform interface to various local resource
managers and schedulers- Priorities, policies on resource (CPU, Data, Network)
usage- bookkeeping and ‘web’ user interface
Networked Data Management:- Universal name space: transparent, location
independent - Data replication and caching- Data mover (scheduled/interactive at OBJ/file/DB
granularity)- Loose synchronization between replicas- Application Metadata, interfaced with DBMS, i.e.
Objectivity, …- Network services definition for a given application- End systems network protocol tuning
Project req. (cont.)
Application Monitoring/Management:- Performance, “instrumented systems” with
timing information and analysis tools - Run-time analysis of collected application events- Bottleneck analysis- Dynamic monitoring of GRID resources to
optimize resource allocation- Failure management
Project req. (cont.)
Computing Fabric and general utilities for a global managed Grid:
- Configuration management of computing facilities - Automatic software installation and maintenance- System, service, network monitoring and global
alarm notification, automatic recovery from failures- resource use accounting- Security of GRID resources and infrastructure usage- Information service
Project req. (cont.)
:
D a t a S e r v e rT i e r 2 / 3
D a t a S e r v e r T i e r 2 / 3
D a t a S e r v e r T i e r 1
C l i e n t
C l i e n t
C l i e n t
C l i e n t
C l i e n t
d e s k o p
d e s k t o p
C E R N –T i e r 0
Logical layout of the multi-tier client-server model
Grid Tools