the infn grid project zscope: study and develop a general infn computing infrastructure, based on...

THE INFN GRID PROJECT

Scope: Study and develop a general INFN computing infrastructure, based on GRID technologies, to be validated (as first use case) implementing distributed Regional Center prototypes for LHC expts: ATLAS, CMS, ALICE and, later on, also for other INFN expts (Virgo, Gran Sasso ….)

Project Status: Outline of proposal submitted to INFN management 13-1-2000 3 Year duration Next meeting with INFN management 18th of February Feedback documents from LHC expts by end of February

(sites, FTEs..) Final proposal to INFN by end of March

INFN & “Grid Related Projects”

Globus tests“Condor on WAN” as general purpose

computing resource“GRID” working group to analyze viable

and useful solutions (LHC computing, Virgo…)

Global architecture that allows strategies for the discovery, allocation, reservation and management of resource collection

MONARC project related activities

Evaluation of the Globus ToolKit

5 sites Testbed (Bologna, CNAF, LNL, Padova, Roma1)

Use case: HTL CMS studies MC Prod. Complete HLT chain

Services to test/implement Resource management

fork() Interface to different local resource managers (Condor, LSF)

Resources chosen by hand Smart Broker to implement a Global resource manager

Data Mover (Gass, Gsiftp…) to stage executable and input files to retrieve output files

Bookkeeping (Is this a worth a general tool ?)

Use Case: CMS HLT studies

Status

Globus installed in 5 Linux PCs in 3 sitesGlobus Security Infrastructure

works !! MDS

Initial problems accessing data (long response time and time out)

GRAM, GASS, Gloperf Work in progress

Condor on WAN Objectives

Large INFN project of the Computing Commission involving ~20 sites

INFN collaboration with Condor Team UWISC

I goal: Condor “tuning” on WAN verify Condor reliability and robustness in Wide

Area Network environment

Verify suitability to INFN computing needs

Network I/O impact and measures

II goal: Network as a Condor Resource Dynamic checkpointing and Checkpoint domain

configuration

Pool partitioned in checkpoint domains (a dedicated ckpt server for each domain)

Definition of a checkpoint domain according:Presence of a sufficiently large CPU capacityPresence of a set of machines with an efficient

network connectivitySub-pools

Checkpointing: next step

Distributed dynamic checkpointing Pool machines select the “best”

checkpoint server (from a network view)

Association between execution machine and checkpoint server dynamically decided

Implementation

Characteristics of the INFN Condor pool:

Single pool To optimize CPU usage of all INFN hosts

Sub-pools To define policies/priorities on resource usage

Checkpoint domains To guarantee the performance and the

efficiency of the system To reduce network traffic for checkpointing

activity

GARR-B Topology

155 Mbps ATM based Network

access points (PoP)

main transport nodes

TORINO PADOVA

BARI

PALERMO

FIRENZE

PAVIA

MILANO

GENOVA

NAPOLI

CAGLIARI

TRIESTE

ROMA

PISA

L’AQUILA

CATANIA

BOLOGNA

UDINE

TRENTO

PERUGIA

LNF

LNGS

SASSARI

LECCE

LNS

LNL

USA

155Mbps

T3

SALERNO

COSENZA

S.Piero

FERRARA

PARMA

CNAF Central Manager

INFN Condor Pool on WAN: checkpoint domains

ROMA2

10

10

40

15

415

65

5

Default CKPTdomain @ Cnaf

CKPT domain # hosts

10

2

3

6

3

2

USA

3

5

1

15

EsNet

machines 500-1000 machines

6 ckpt servers 25 ckpt servers

Management

Central management ([email protected])

Local management ([email protected])Steering committeesoftware maintenance contract with

Condor_support team of University of Madison

INFN-GRID project requirements

Networked Workload Management:- Optimal co-allocation of data and CPU and network

for a specific “grid/network-aware” job- distributed scheduling (data and/or code migration)- unscheduled/ scheduled job submission- Management of heterogeneous computing systems- Uniform interface to various local resource

managers and schedulers- Priorities, policies on resource (CPU, Data, Network)

usage- bookkeeping and ‘web’ user interface

Networked Data Management:- Universal name space: transparent, location

independent - Data replication and caching- Data mover (scheduled/interactive at OBJ/file/DB

granularity)- Loose synchronization between replicas- Application Metadata, interfaced with DBMS, i.e.

Objectivity, …- Network services definition for a given application- End systems network protocol tuning

Project req. (cont.)

Application Monitoring/Management:- Performance, “instrumented systems” with

timing information and analysis tools - Run-time analysis of collected application events- Bottleneck analysis- Dynamic monitoring of GRID resources to

optimize resource allocation- Failure management


Computing Fabric and general utilities for a global managed Grid:

- Configuration management of computing facilities - Automatic software installation and maintenance- System, service, network monitoring and global

alarm notification, automatic recovery from failures- resource use accounting- Security of GRID resources and infrastructure usage- Information service


:

D a t a S e r v e rT i e r 2 / 3

D a t a S e r v e r T i e r 2 / 3

D a t a S e r v e r T i e r 1

C l i e n t

C l i e n t

C l i e n t

C l i e n t

C l i e n t

d e s k o p

d e s k t o p

C E R N –T i e r 0

Logical layout of the multi-tier client-server model

Grid Tools

the infn grid project zscope: study and develop a general infn computing infrastructure, based on...

Documents

infn management

infn expts virgo

decided slide

activity slide

progress slide

condor resource ydynamic

infn grid project zscope

condor tuning