the lcg project

LCG-Asia Workshop – 28 July 2004 - 1

The LCG ProjectThe LCG Project

Ian BirdIT Department, CERN

ISGC 2004, Taipei27-28th July 2004


The Large Hadron Collider Project

4 detectors CMSATLAS

LHCb

Requirements for world-wide data analysis

Storage – Raw recording rate 0.1 – 1 GBytes/sec

Accumulating at 15 PetaBytes/year

40 PetaBytes of disk

Processing – 100,000 of today’s fastest PCs

Requirements for world-wide data analysis

Storage – Raw recording rate 0.1 – 1 GBytes/sec

Accumulating at 15 PetaBytes/year

40 PetaBytes of disk

Processing – 100,000 of today’s fastest PCs


CMSATLAS

LHCb~ 5000 Physicistsaround the world

“Offline” software effort: 1000 person-years

per experiment

Software life span: 20 years

Large distributed community Large distributed community


LCG – Goals LCG – Goals

The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments

Two phases:

Phase 1: 2002 – 2005 Build a service prototype, based on existing grid middleware Gain experience in running a production grid service Produce the TDR for the final system

Phase 2: 2006 – 2008 Build and commission the initial LHC computing environment

LCG is not a development project – it relies on other grid projects for grid middleware development and support


Introduction – the LCG ProjectIntroduction – the LCG Project

LHC Computing Grid (LCG) is a grid deployment project Prototype computing environment for LHC Focus on building a production-quality service Learn how to maintain and operate a global scale production grid Gain experience in close collaboration between regional (resource)

centres Understand how to integrate fully with existing computing services

Building on the results of earlier research projects; Learn how to move from test-beds to production services

Address policy-like issues needing agreement between collaborating sites


LHC Computing Grid Project - a CollaborationLHC Computing Grid Project - a Collaboration

Building and operating the LHC Grid involves a collaboration of

The physicists and computing specialists from the LHC experiment

The projects in the US and Europe that have been developing Grid middleware

The regional and national computing centres that provide resources for LHC and usually also for other physics experiments

and other sciences

Researchers

Software Engineers

Service Providers


Applications Area

Development environmentJoint projects

Data managementDistributed analysis

Middleware Area

Provision of a base set of gridmiddleware – acquisition,development, integration,

testing, supportCERN Fabric Area

Large cluster managementData recording

Cluster technologyNetworking

Computing service at CERN

Grid Deployment Area

Establishing and managing theGrid Service - Middleware

certification, security, operations,registration, authorisation,

accounting

The Project OrganisationThe Project Organisation

ARDA

Prototyping and testing grid middleware for experiment

analysis


Applications Area ProjectsApplications Area Projects

Software Process and Infrastructure (SPI) (A.Aimar) Librarian, QA, testing, developer tools, documentation, training, …

Persistency Framework & Database Applications (POOL) (D.Duellmann) Relational persistent data store, conditions database, collections

Core Tools and Services (SEAL) (P.Mato) Foundation and utility libraries, basic framework services, object dictionary and

whiteboard, maths libraries

Physicist Interface (PI) (V.Innocente) Interfaces and tools by which physicists directly use the software. Interactive

analysis, visualization

Simulation (T.Wenaus) Generic framework, Geant4, FLUKA integration, physics validation, generator

services

ROOT (R.Brun) ROOT I/O event store; analysis package


POOL – Object PersistencyPOOL – Object Persistency

Bulk event data storage – an object store based on ROOT I/O Full support for persistent references automatically resolved to objects anywhere

on the grid Recently extended to support updateable metadata as well (with some limitations)

File cataloging – Three implementations using – Grid middleware (EDG version of RLS) Relational DB (MySQL) Local Files (XML)

Event metadata – Event collections with query-able metadata (physics tags etc.)

Transient data cache – Optional component by which POOL can manage transient instances of persistent

objects POOL project scope now extended to include the Conditions Database


POOL Component BreakdownPOOL Component Breakdown

POOL API

Storage Service FileCatalog Collections

ROOT I/OStorage Svc

XMLCatalog

RDBMSCatalog

EDG Replica Location Service

ExplicitCollection

ImplicitCollection

RDBMSStorage Svc


Simulation Project OrganisationSimulation Project Organisation

Simulation Project Leader

Framework Geant4 FLUKAintegration

PhysicsValidation

GeneratorServices

Subprojects

Generic interface to

multiple simulation

engines (G4, FLUKA),

building on existing

ALICE work (VMC)

Part of the GEANT4

collaborationDevelopments

aligned with and

responding to needs from

LHC experiments

FLUKA team participating in

framework integration,

physics validation

Assess adequacy of

simulation and physics

environment for LHC,

provide the focus for the

LHC requirements

Generator librarian, common

event files, validation/test

suite, development when needed (HEPMC, etc.)


FabricsFabrics

Getting the data from the detector to the grid requires sustained data collection and distribution -- keeping up with the accelerator

To achieve the required levels of performance, reliability, resilience -- at minimal cost (people, equipment) -- we also have to work on scalability and performance of some of the basic computing technologies – cluster management mass storage management high performance networking


Tens of thousands of disks

Thousands of processors

Hundreds of tape drives

Continuous evolution

Sustainedthroughput

Resilient toproblems


SW Rep

Fabric Automation at CERNFabric Automation at CERN

Node

CfgCache

SWCache

SPMA

SWRepCDB

CDBOraMon

NCMMSA

SMS

HMS

LEMON

ConfigurationInstallation

Fault & hardware Management

Monitoring Includes technology developed by DataGrid


WAN WAN connectivityconnectivity

5.44 Gbps1.1 TB in 30

mins

We now have to get from an R&D project (DATATAG) to a sustained, reliable service – Asia, Europe, US

6.63 Gbps

25 June 2004


ALICE DC – MSS BandwidthALICE DC – MSS Bandwidth

0

200

400

600

800

1000

1200

1400

1998 1999 2000 2001 2002 2003 2004 2005 2006

MSS bw initial goals

MSS bw achieved

Tape Bw LCG


Sites in LCG-2/EGEE-0 : July 14 2004Sites in LCG-2/EGEE-0 : July 14 2004


Sites in LCG-2/EGEE-0 : July 14 2004Sites in LCG-2/EGEE-0 : July 14 2004

Austria U-Innsbruck

Canada Triumf

Alberta

Carleton

Montreal

Toronto

Czech Republic

Prague-FZU

Prague-CESNET

France CC-IN2P3

Clermont-Ferrand

Germany FZK

Aachen

DESY

GSI

Karlsruhe-U

Wuppertal

Greece HellasGrid

Hungary Budapest

India TIFR

Israel Tel-Aviv

Weizmann

Italy CNAF

Frascati

Legnaro

Milano

Napoli

Roma

Torino

Japan Tokyo

Netherlands NIKHEF

SARA

Pakistan NCP

Poland Krakow

Portugal LIP

Russia SINP-Moscow

ITEP

JINR-Dubna

Spain PIC

UAM

USC

UB-Barcelona

IFCA

CIEMAT

IFIC

Switzerland CERN

CSCS

Taiwan ASCC

IPAS

NCU

UK RAL

Birmingham

Cavendish

Glasgow

Imperial

Lancaster

Manchester

QMUL

RAL-PP

Sheffield

UCL

UCL-CCC

US BNL

FNAL

HP Puerto-Rico

• 22 Countries• 64 Sites

•50 Europe, •2 US,• 5 Canada,• 6 Asia, •1 HP

• Coming: •New Zealand,•China, •other HP (Brazil, Singapore)

• > 6000 cpu

LCG Tier 0/1 site


RAL

IN2P3

BNL

FZK

CNAF

PIC ICEPP

FNAL

Tier-2 – Well-managed disk storage –

grid-enabled Simulation End-user analysis – batch

and interactive High performance parallel

analysis (PROOF)

Tier-1small

centres

Tier-2

desktopsportables

USCNIKHEFKrakow

CIEMATRome

Taipei

TRIUMF

CSCS

Legnaro

UB

IFCA

IC

MSU

Prague

Budapest

Cambridge

IFIC

LHC Computing Model (simplified!!) Tier-0 – the accelerator centre

Filter raw data Reconstruction summary data (ESD) Record raw data and ESD Distribute raw and ESD to Tier-1

Tier-1 – Managed Mass Storage –

permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data service

Data-heavy analysis Re-processing raw ESD National, regional support

“online” to the data acquisition processhigh availability, long-term commitment


The LCG ServiceThe LCG Service

LCG-1 service started on September 15 2003 With 12 sites

LCG-2 with upgraded middleware began to be deployed beginning of 2004

Currently 64 sites with > 6000 cpu

During 2003 significant effort was expended to: Integrate VDT, EDG, and other tools Debug, patch, test, and certify the middleware

LCG-2 currently in use for the LCG experiments data challenges

LCG-2 forms the basis for the EGEE production service


LCG CertificationLCG Certification

Significant investment in certification and testing process and team Skilled people capable of system-level debugging, tightly coupled to

VDT, Globus, and EDG teams Needs significant hardware resources This was essential in achieving a robust service

Making production quality software is Expensive Time consuming Not glamorous! … and takes very skilled people with a lot of experience


Experiences in deploymentExperiences in deployment

LCG covers many sites (>60) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box

installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging,

installation, and configuration tools and procedures• A lot of effort had to be invested in this area

There are many problems – but in the end we are quite successful System is stable and reliable System is used in production System is reasonably easy to install now – 60 sites Now have a basis on which to incrementally build essential functionality

This infrastructure forms the basis of the initial EGEE production service


The LCG Deployment BoardThe LCG Deployment Board

Grid Deployment Board (GDB) set up to address policy issues requiring agreement and negotiation between resource centres

Members: country representatives, applications, and project managers Sets up working groups

Short term or ongoing Bring in technical experts to focus on specific issues

GDB approves recommendations from working groups Groups:

Several that outlined initial project directions (operations, security, resources, support)

Security – standing group – covers many policy issues Grid Operations Centre task force User Support group Storage management and other focused issues Service challenges


Operations services for LCGOperations services for LCG

Operational support Hierarchical model

• CERN acts as 1st level support for the Tier 1 centres• Tier 1 centres provide 1st level support for associated Tier 2s• “Tier 1 sites” “Primary sites”

Grid Operations Centres (GOC)• Provide operational monitoring, troubleshooting, coordination of incident

response, etc.• RAL (UK) led sub-project to prototype a GOC• 2nd GOC in Taipei now in operation

– Together providing 16hr coverage– Expect 3rd centre in Canada/US to help achieve 24hr coverage

User support Central model

• FZK provides user support portal– Problem tracking system web-based and available to all LCG participants

• Experiments provide triage of problems CERN team provide in-depth support and support for integration of

experiment sw with grid middleware


GGUS - ConceptGGUS - Concept

Target: 24×7 support via time difference and 3 support teams

Currently: GGUS FZK GGUS ASCC

Desired: GGUS USA


Support Teams within LCGSupport Teams within LCG

CERN DeploymentSupport (CDS)

Middleware Problems

4 LHCexperiments

(Alice Atlas CMS LHCb)

OtherCommunities

(VOs)

4 non-LHCexperiments

(BaBar CDF Compass D0)

Grid OperationsCenter (GOC)

Operations Problems

ResourceCenters (RC)

Hardware Problems

Experiment Specific User Support (ESUS)

Software Problems

Global Grid User Support (GGUS)Single Point of Contact

Coordination of User Support


SecuritySecurity

LCG Security Group LCG usage rules – proposed as general Grid usage guidelines Registration procedures and VO management

• Agreement to collect only minimal amount of personal data

• Registration has limited validity Initial audit requirements are defined Initial incident response procedures

• Site security contacts etc. are defined Set of trusted CAs (including Fermilab online KCA) Security policy

This group is now a Joint Security group covering several grid projects/infrastructure


LCG Security environmentLCG Security environment

The players

Users VOs

Sites

Personal dataRoles

Usage patterns…

Experiment dataAccess patternsMembership …

ResourcesAvailability

Accountability…

GridGrid


The RisksThe Risks

Top risks from Security Risk Analysis http://proj-lcg-security.web.cern.ch/proj-lcg-security/RiskAnalysis/risk.html

Launch attacks on other sites• Large distributed farms of machines

Illegal or inappropriate distribution or sharing of data• Massive distributed storage capacity

Disruption by exploit of security holes• Complex, heterogeneous and dynamic environment

Damage caused by viruses, worms etc.• Highly connected and novel infrastructure


Policy – the LCG Security GroupPolicy – the LCG Security Group

Security & Availability Policy

UsageRules

Certification Authorities

AuditRequirements

GOCGuides

Incident Response

User RegistrationApplication Development& Network Admin Guide

http://cern.ch/proj-lcg-security/documents.html

joint


Authentication InfrastructureAuthentication Infrastructure

Users and Services own long-lived (1yr) credentials Digital certificates (X.509 PKI) European Grid Policy Management Authority

• “… is a body to establish requirements and best practices for grid identity providers to enable a common trust domain applicable to authentication of end-entities in inter-organisational access to distributed resources. …”

• www.eugridpma.org covers EU (+ USA + Asia)

Jobs submitted with Grid Proxy Certificates Short-lived (<24hr) credential which “travels” with job Delegation allows service to act on behalf of user Proxy renewal service for long-running & queued jobs

Some Issues… Do trust mechanisms scale up ? “On-line” certification authorities & Certificate Stores

• Kerberized CA• Virtual SmartCard

Limited delegation


User Registration (2003-4)User Registration (2003-4)

lcg-registrar.cern.ch

VOs

1. “I agree to the Usage Rules please register me, my VO is XYZ”

2. Confirm email

3. User Details

User

XYZ VOManager

4. Register

5. Notify 6. User Details

Site

Authz

ResourceAuthz

Certificate GRIDGRID

Usage Rules

Submit job

?CA Certificates


User Registration (? 2004 - )User Registration (? 2004 - )

Some Issues Static user mappings will not scale up Multiple VO membership Complex authorization & policy handling VO manager needs to validate user data

• How ?

Solutions VO Management Service - Attribute proxy certificates

• Groups and Roles - not just static user mapping• Attributes bound to proxy cert., signed by VO Service

Credential mapping and authorization• Flexible policy intersection and mapping tools

Integrate with Organizational databases, but …• What about exceptions ? (the 2-week summer student)• What about other VO models: lighweight, deployment, testing

XYZ VOManager

?

Certificate

Roles


Audit & Incident ResponseAudit & Incident Response

Audit Requirements Mandates retention of logs by sites

Incident Response Security contact data gathered when site registers Establish communication channels

• maillists maintained by Deployment Team

• List of CSIRT lists– Channel for reporting

• Security contacts at site– Channel for discussion & resolution

Escalation path

2004 Security Service Challenges Check the data is there, complete and communications are open


Security CollaborationSecurity Collaboration

Projects sharing resources & have close links Need for inter-grid global security collaboration Common accepted Usage Rules Common authentication and authorization requirements Common incident response channels

LCG – EGEE – OSG LCG Security Group is now Joint Security Group

• JSG for LCG & EGEE & OSG

• Provide requirements for middleware development

• Some members from OSG already in JSG


What is EGEE ? (I)What is EGEE ? (I)

EGEE (Enabling Grids for e-Science in Europe) is a seamless Grid infrastructure for the support of scientific research, which: Integrates current national, regional

and thematic Grid efforts Provides researchers in academia

and industry with round-the-clock access to major computing resources, independent of geographic location

Applications

Geant network

Grid infrastructure


What is EGEE ? (II)What is EGEE ? (II)

70 institutions in 28 countries, federated in regional Grids

32 M Euros EU funding (2004-5), O(100 M) total budget

Aiming for a combined capacity of over 8000 CPUs (the largest international Grid infrastructure ever assembled)

~ 300 persons


EGEE ActivitiesEGEE Activities

Emphasis on operating a production grid and supporting the end-users

48 % service activities Grid Operations, Support and Management,

Network Resource Provision

24 % middleware re-engineering Quality Assurance, Security, Network

Services Development

28 % networking Management, Dissemination and Outreach, User

Training and Education, Application Identification and Support, Policy and International Cooperation


EGEE infrastructureEGEE infrastructure

Access to networking services provided by GEANT and the NRENs

Production Service: in place (based on LCG-2) for production applications runs only proven stable, debugged middleware

and services Will continue adding new sites in EGEE

federations

Pre-production Service: For middleware re-engineering

Certification and Training/Demo testbeds


EGEE Middleware ActivityEGEE Middleware Activity

Activity concentrated in few major centers

Middleware selection based on requirements of Applications and Operations

Harden and re-engineer existing middleware functionality, leveraging the experience of partners

Provide robust, supportable components

Track standards evolution (WS-RF)


Middleware ImplementationMiddleware Implementation

From day 1 (1st April 2004)Production grid service based on the LCG infrastructure running LCG-2 grid mware

In parallel develop a “next generation” grid facilityProduce a new set of grid services according to evolving standards (web services)

Run a pre-production service providing early access for evaluation purposes

Will replace LCG-2 on production facility in 2005

EDGVDT . . .

LCG

GLite

. . .AliEn

Globus 2 based Web services based

EGEE-2EGEE-1LCG-2LCG-1


Starts with components from AliEn, EDG, VDT etc.

Aim at addressing advanced requirements from applications

Prototyping short development cycles for fast user feedback

Initial web-services based prototype being tested internally with representatives from the application groups

EGEE Middleware: gLiteEGEE Middleware: gLite


LCG-2 Current base for production services Evolved with certified new or improved

services from the preproduction Pre-production Service

Early application access for new developments

Certification of selected components from gLite

Starts with LCG-2 Migrate new mware in 2005

Organising smooth/gradual transition from LCG-2 to GLite for production operations

EGEE Middleware ImplementationEGEE Middleware Implementation

LCG-2 (=EGEE-0)

prototyping

prototyping

product

20042004

20052005

LCG-3 (=EGEE-x?)

product


ARDA ProjectCollaborationCoordinationIntegration

SpecificationsPrioritiesPlanning

EGEE Middleware Activity

ALICEDistr.

analysis

ATLASDistr.

analysis

CMSDistr.

analysis

LHCbDistr.

analysis

Distributed Physics AnalysisDistributed Physics AnalysisThe The ARDAARDA Project Project

ARDA – distributed physics analysis batch to interactive end-user emphasis

4 pilots by the LHC experiments (core of the HEP activity in EGEE NA4)

Rapid prototyping pilot service

Providing focus for the first products of the EGEE middleware

Kept realistic by what the EGEE middleware can deliver


User support: Becomes hierarchical Through the Regional Operations

Centres (ROC)• Act as front-line support for user and

operations issues

• Provide local knowledge and adaptations

Coordination: At CERN (Operations Management

Centre) and CIC for HEP

Operational support: The LCG GOC is the model for the

EGEE CICs• CIC’s replace the European GOC at

RAL

• Also run essential infrastructure services

• Provide support for other (non-LHC) applications

• Provide 2nd level support to ROCs

LCG LCG EGEE in Europe EGEE in Europe


Interoperability – convergence?Interoperability – convergence?

Information systems All MDS-based, bring schema

together

Storage management Common ideas - SRM

File catalogs Not yet clear

Security Joint security group

Policy VO management, …

Can we converge and agree on common: Interfaces? Protocols? Implementations? Middleware?


M.C. Vetterli; SFU/TRIUMF

Linking HEPGrid to LCGLinking HEPGrid to LCG

.....GCRes.1

GCRes.n

Grid-Cannegotiator/scheduler

WGUBC/TRIUMF

TRIUMF

TRIUMFcpu &storage

negotiator/scheduler

RB/scheduler

LCG BDII/RB/ scheduler

Class ad

1) Each GC resource publishes a class ad to the GC collector

2) The GC CE aggregates this info and publishes it to TRIUMF as a single resource

Class ad

3) The same is done for WG

3) The CondorG job manager at TRIUMF builds a submission script for the TRIUMF Grid4) The TRIUMF negotiator matches the job to GC or WG

1) The LCG RB decides where to send the job (GC/WG or the TRIUMF farm)Job class ad2) Job goes to the TRIUMF farm or

TRIUMF decides to sendthe job to WG

5) The job is submitted to the proper resource

TRIUMF decides to sendthe job to GC

6) The process is repeated on GC if necessary

MDS

4) TRIUMF aggregates GC & WG and publishes this to LCG as one resource5) TRIUMF also publishes its own resources separately


What next? What next? Service challenges Service challenges

Proposed to be used in addition to ongoing data challenges and production use: Goal is to ensure baseline services can be demonstrated Demonstrate the resolution of problems mentioned above Demonstrate that operational and emergency procedures are in place

4 areas proposed: Reliable data transfer

• Demonstrate fundamental service for Tier 0 Tier 1 by end 2004 Job flooding/exerciser

• Understand the limitations and baseline performances of the system Incident response

• Ensure the procedures are in place and work – before real life tests them Interoperability

• How can we bring together the different grid infrastructures?


ConclusionsConclusions

LCG has successfully deployed a service to 60 sites Significant effort to get some reasonable stability and reliability

Still a long way to go to ensure adequate functionality in place for LHC startup Focus on building up base level services such as reliable data

movement• Continuing data and service challenges, analysis

LCG service will evolve rapidly to respond to problems found in the current system To integrate/migrate to new middleware

Effort also focussed on infrastructure and support services Operations support, user support, security etc.

Bring together the various grid infrastructures for LCG users

the lcg project

Documents

lhc grid

grid projects

testing grid middleware

grid middleware development

production grid serviceproduce

grid middlewarethe regional

computing specialists

development project