the lcg project
DESCRIPTION
The LCG Project. Ian Bird IT Department, CERN ISGC 2004, Taipei 27-28 th July 2004. The Large Hadron Collider Project 4 detectors. CMS. ATLAS. Requirements for world-wide data analysis Storage – Raw recording rate 0.1 – 1 GBytes/sec Accumulating at 15 PetaBytes/year - PowerPoint PPT PresentationTRANSCRIPT
LCG-Asia Workshop – 28 July 2004 - 1
The LCG ProjectThe LCG Project
Ian BirdIT Department, CERN
ISGC 2004, Taipei27-28th July 2004
LCG-Asia Workshop – 28 July 2004 - 2
The Large Hadron Collider Project
4 detectors CMSATLAS
LHCb
Requirements for world-wide data analysis
Storage – Raw recording rate 0.1 – 1 GBytes/sec
Accumulating at 15 PetaBytes/year
40 PetaBytes of disk
Processing – 100,000 of today’s fastest PCs
Requirements for world-wide data analysis
Storage – Raw recording rate 0.1 – 1 GBytes/sec
Accumulating at 15 PetaBytes/year
40 PetaBytes of disk
Processing – 100,000 of today’s fastest PCs
LCG-Asia Workshop – 28 July 2004 - 3
CMSATLAS
LHCb~ 5000 Physicistsaround the world
“Offline” software effort: 1000 person-years
per experiment
Software life span: 20 years
Large distributed community Large distributed community
LCG-Asia Workshop – 28 July 2004 - 4
LCG – Goals LCG – Goals
The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments
Two phases:
Phase 1: 2002 – 2005 Build a service prototype, based on existing grid middleware Gain experience in running a production grid service Produce the TDR for the final system
Phase 2: 2006 – 2008 Build and commission the initial LHC computing environment
LCG is not a development project – it relies on other grid projects for grid middleware development and support
LCG-Asia Workshop – 28 July 2004 - 5
Introduction – the LCG ProjectIntroduction – the LCG Project
LHC Computing Grid (LCG) is a grid deployment project Prototype computing environment for LHC Focus on building a production-quality service Learn how to maintain and operate a global scale production grid Gain experience in close collaboration between regional (resource)
centres Understand how to integrate fully with existing computing services
Building on the results of earlier research projects; Learn how to move from test-beds to production services
Address policy-like issues needing agreement between collaborating sites
LCG-Asia Workshop – 28 July 2004 - 6
LHC Computing Grid Project - a CollaborationLHC Computing Grid Project - a Collaboration
Building and operating the LHC Grid involves a collaboration of
The physicists and computing specialists from the LHC experiment
The projects in the US and Europe that have been developing Grid middleware
The regional and national computing centres that provide resources for LHC and usually also for other physics experiments
and other sciences
Researchers
Software Engineers
Service Providers
LCG-Asia Workshop – 28 July 2004 - 7
Applications Area
Development environmentJoint projects
Data managementDistributed analysis
Middleware Area
Provision of a base set of gridmiddleware – acquisition,development, integration,
testing, supportCERN Fabric Area
Large cluster managementData recording
Cluster technologyNetworking
Computing service at CERN
Grid Deployment Area
Establishing and managing theGrid Service - Middleware
certification, security, operations,registration, authorisation,
accounting
The Project OrganisationThe Project Organisation
ARDA
Prototyping and testing grid middleware for experiment
analysis
LCG-Asia Workshop – 28 July 2004 - 8
Applications Area ProjectsApplications Area Projects
Software Process and Infrastructure (SPI) (A.Aimar) Librarian, QA, testing, developer tools, documentation, training, …
Persistency Framework & Database Applications (POOL) (D.Duellmann) Relational persistent data store, conditions database, collections
Core Tools and Services (SEAL) (P.Mato) Foundation and utility libraries, basic framework services, object dictionary and
whiteboard, maths libraries
Physicist Interface (PI) (V.Innocente) Interfaces and tools by which physicists directly use the software. Interactive
analysis, visualization
Simulation (T.Wenaus) Generic framework, Geant4, FLUKA integration, physics validation, generator
services
ROOT (R.Brun) ROOT I/O event store; analysis package
LCG-Asia Workshop – 28 July 2004 - 9
POOL – Object PersistencyPOOL – Object Persistency
Bulk event data storage – an object store based on ROOT I/O Full support for persistent references automatically resolved to objects anywhere
on the grid Recently extended to support updateable metadata as well (with some limitations)
File cataloging – Three implementations using – Grid middleware (EDG version of RLS) Relational DB (MySQL) Local Files (XML)
Event metadata – Event collections with query-able metadata (physics tags etc.)
Transient data cache – Optional component by which POOL can manage transient instances of persistent
objects POOL project scope now extended to include the Conditions Database
LCG-Asia Workshop – 28 July 2004 - 10
POOL Component BreakdownPOOL Component Breakdown
POOL API
Storage Service FileCatalog Collections
ROOT I/OStorage Svc
XMLCatalog
RDBMSCatalog
EDG Replica Location Service
ExplicitCollection
ImplicitCollection
RDBMSStorage Svc
LCG-Asia Workshop – 28 July 2004 - 11
Simulation Project OrganisationSimulation Project Organisation
Simulation Project Leader
Framework Geant4 FLUKAintegration
PhysicsValidation
GeneratorServices
Subprojects
Generic interface to
multiple simulation
engines (G4, FLUKA),
building on existing
ALICE work (VMC)
Part of the GEANT4
collaborationDevelopments
aligned with and
responding to needs from
LHC experiments
FLUKA team participating in
framework integration,
physics validation
Assess adequacy of
simulation and physics
environment for LHC,
provide the focus for the
LHC requirements
Generator librarian, common
event files, validation/test
suite, development when needed (HEPMC, etc.)
LCG-Asia Workshop – 28 July 2004 - 12
FabricsFabrics
Getting the data from the detector to the grid requires sustained data collection and distribution -- keeping up with the accelerator
To achieve the required levels of performance, reliability, resilience -- at minimal cost (people, equipment) -- we also have to work on scalability and performance of some of the basic computing technologies – cluster management mass storage management high performance networking
LCG-Asia Workshop – 28 July 2004 - 13
Tens of thousands of disks
Thousands of processors
Hundreds of tape drives
Continuous evolution
Sustainedthroughput
Resilient toproblems
LCG-Asia Workshop – 28 July 2004 - 14
SW Rep
Fabric Automation at CERNFabric Automation at CERN
Node
CfgCache
SWCache
SPMA
SWRepCDB
CDBOraMon
NCMMSA
SMS
HMS
LEMON
ConfigurationInstallation
Fault & hardware Management
Monitoring Includes technology developed by DataGrid
LCG-Asia Workshop – 28 July 2004 - 15
WAN WAN connectivityconnectivity
5.44 Gbps1.1 TB in 30
mins
We now have to get from an R&D project (DATATAG) to a sustained, reliable service – Asia, Europe, US
6.63 Gbps
25 June 2004
LCG-Asia Workshop – 28 July 2004 - 16
ALICE DC – MSS BandwidthALICE DC – MSS Bandwidth
0
200
400
600
800
1000
1200
1400
1998 1999 2000 2001 2002 2003 2004 2005 2006
MSS bw initial goals
MSS bw achieved
Tape Bw LCG
LCG-Asia Workshop – 28 July 2004 - 17
Sites in LCG-2/EGEE-0 : July 14 2004Sites in LCG-2/EGEE-0 : July 14 2004
LCG-Asia Workshop – 28 July 2004 - 18
Sites in LCG-2/EGEE-0 : July 14 2004Sites in LCG-2/EGEE-0 : July 14 2004
Austria U-Innsbruck
Canada Triumf
Alberta
Carleton
Montreal
Toronto
Czech Republic
Prague-FZU
Prague-CESNET
France CC-IN2P3
Clermont-Ferrand
Germany FZK
Aachen
DESY
GSI
Karlsruhe-U
Wuppertal
Greece HellasGrid
Hungary Budapest
India TIFR
Israel Tel-Aviv
Weizmann
Italy CNAF
Frascati
Legnaro
Milano
Napoli
Roma
Torino
Japan Tokyo
Netherlands NIKHEF
SARA
Pakistan NCP
Poland Krakow
Portugal LIP
Russia SINP-Moscow
ITEP
JINR-Dubna
Spain PIC
UAM
USC
UB-Barcelona
IFCA
CIEMAT
IFIC
Switzerland CERN
CSCS
Taiwan ASCC
IPAS
NCU
UK RAL
Birmingham
Cavendish
Glasgow
Imperial
Lancaster
Manchester
QMUL
RAL-PP
Sheffield
UCL
UCL-CCC
US BNL
FNAL
HP Puerto-Rico
• 22 Countries• 64 Sites
•50 Europe, •2 US,• 5 Canada,• 6 Asia, •1 HP
• Coming: •New Zealand,•China, •other HP (Brazil, Singapore)
• > 6000 cpu
LCG Tier 0/1 site
LCG-Asia Workshop – 28 July 2004 - 19
RAL
IN2P3
BNL
FZK
CNAF
PIC ICEPP
FNAL
Tier-2 – Well-managed disk storage –
grid-enabled Simulation End-user analysis – batch
and interactive High performance parallel
analysis (PROOF)
Tier-1small
centres
Tier-2
desktopsportables
USCNIKHEFKrakow
CIEMATRome
Taipei
TRIUMF
CSCS
Legnaro
UB
IFCA
IC
MSU
Prague
Budapest
Cambridge
IFIC
LHC Computing Model (simplified!!) Tier-0 – the accelerator centre
Filter raw data Reconstruction summary data (ESD) Record raw data and ESD Distribute raw and ESD to Tier-1
Tier-1 – Managed Mass Storage –
permanent storage raw, ESD, calibration data, meta-data, analysis data and databases grid-enabled data service
Data-heavy analysis Re-processing raw ESD National, regional support
“online” to the data acquisition processhigh availability, long-term commitment
LCG-Asia Workshop – 28 July 2004 - 20
The LCG ServiceThe LCG Service
LCG-1 service started on September 15 2003 With 12 sites
LCG-2 with upgraded middleware began to be deployed beginning of 2004
Currently 64 sites with > 6000 cpu
During 2003 significant effort was expended to: Integrate VDT, EDG, and other tools Debug, patch, test, and certify the middleware
LCG-2 currently in use for the LCG experiments data challenges
LCG-2 forms the basis for the EGEE production service
LCG-Asia Workshop – 28 July 2004 - 21
LCG CertificationLCG Certification
Significant investment in certification and testing process and team Skilled people capable of system-level debugging, tightly coupled to
VDT, Globus, and EDG teams Needs significant hardware resources This was essential in achieving a robust service
Making production quality software is Expensive Time consuming Not glamorous! … and takes very skilled people with a lot of experience
LCG-Asia Workshop – 28 July 2004 - 22
Experiences in deploymentExperiences in deployment
LCG covers many sites (>60) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box
installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging,
installation, and configuration tools and procedures• A lot of effort had to be invested in this area
There are many problems – but in the end we are quite successful System is stable and reliable System is used in production System is reasonably easy to install now – 60 sites Now have a basis on which to incrementally build essential functionality
This infrastructure forms the basis of the initial EGEE production service
LCG-Asia Workshop – 28 July 2004 - 23
The LCG Deployment BoardThe LCG Deployment Board
Grid Deployment Board (GDB) set up to address policy issues requiring agreement and negotiation between resource centres
Members: country representatives, applications, and project managers Sets up working groups
Short term or ongoing Bring in technical experts to focus on specific issues
GDB approves recommendations from working groups Groups:
Several that outlined initial project directions (operations, security, resources, support)
Security – standing group – covers many policy issues Grid Operations Centre task force User Support group Storage management and other focused issues Service challenges
LCG-Asia Workshop – 28 July 2004 - 24
Operations services for LCGOperations services for LCG
Operational support Hierarchical model
• CERN acts as 1st level support for the Tier 1 centres• Tier 1 centres provide 1st level support for associated Tier 2s• “Tier 1 sites” “Primary sites”
Grid Operations Centres (GOC)• Provide operational monitoring, troubleshooting, coordination of incident
response, etc.• RAL (UK) led sub-project to prototype a GOC• 2nd GOC in Taipei now in operation
– Together providing 16hr coverage– Expect 3rd centre in Canada/US to help achieve 24hr coverage
User support Central model
• FZK provides user support portal– Problem tracking system web-based and available to all LCG participants
• Experiments provide triage of problems CERN team provide in-depth support and support for integration of
experiment sw with grid middleware
LCG-Asia Workshop – 28 July 2004 - 25
GGUS - ConceptGGUS - Concept
Target: 24×7 support via time difference and 3 support teams
Currently: GGUS FZK GGUS ASCC
Desired: GGUS USA
LCG-Asia Workshop – 28 July 2004 - 26
Support Teams within LCGSupport Teams within LCG
CERN DeploymentSupport (CDS)
Middleware Problems
4 LHCexperiments
(Alice Atlas CMS LHCb)
OtherCommunities
(VOs)
4 non-LHCexperiments
(BaBar CDF Compass D0)
Grid OperationsCenter (GOC)
Operations Problems
ResourceCenters (RC)
Hardware Problems
Experiment Specific User Support (ESUS)
Software Problems
Global Grid User Support (GGUS)Single Point of Contact
Coordination of User Support
LCG-Asia Workshop – 28 July 2004 - 27
SecuritySecurity
LCG Security Group LCG usage rules – proposed as general Grid usage guidelines Registration procedures and VO management
• Agreement to collect only minimal amount of personal data
• Registration has limited validity Initial audit requirements are defined Initial incident response procedures
• Site security contacts etc. are defined Set of trusted CAs (including Fermilab online KCA) Security policy
This group is now a Joint Security group covering several grid projects/infrastructure
LCG-Asia Workshop – 28 July 2004 - 28
LCG Security environmentLCG Security environment
The players
Users VOs
Sites
Personal dataRoles
Usage patterns…
Experiment dataAccess patternsMembership …
ResourcesAvailability
Accountability…
GridGrid
LCG-Asia Workshop – 28 July 2004 - 29
The RisksThe Risks
Top risks from Security Risk Analysis http://proj-lcg-security.web.cern.ch/proj-lcg-security/RiskAnalysis/risk.html
Launch attacks on other sites• Large distributed farms of machines
Illegal or inappropriate distribution or sharing of data• Massive distributed storage capacity
Disruption by exploit of security holes• Complex, heterogeneous and dynamic environment
Damage caused by viruses, worms etc.• Highly connected and novel infrastructure
LCG-Asia Workshop – 28 July 2004 - 30
Policy – the LCG Security GroupPolicy – the LCG Security Group
Security & Availability Policy
UsageRules
Certification Authorities
AuditRequirements
GOCGuides
Incident Response
User RegistrationApplication Development& Network Admin Guide
http://cern.ch/proj-lcg-security/documents.html
joint
LCG-Asia Workshop – 28 July 2004 - 31
Authentication InfrastructureAuthentication Infrastructure
Users and Services own long-lived (1yr) credentials Digital certificates (X.509 PKI) European Grid Policy Management Authority
• “… is a body to establish requirements and best practices for grid identity providers to enable a common trust domain applicable to authentication of end-entities in inter-organisational access to distributed resources. …”
• www.eugridpma.org covers EU (+ USA + Asia)
Jobs submitted with Grid Proxy Certificates Short-lived (<24hr) credential which “travels” with job Delegation allows service to act on behalf of user Proxy renewal service for long-running & queued jobs
Some Issues… Do trust mechanisms scale up ? “On-line” certification authorities & Certificate Stores
• Kerberized CA• Virtual SmartCard
Limited delegation
LCG-Asia Workshop – 28 July 2004 - 32
User Registration (2003-4)User Registration (2003-4)
lcg-registrar.cern.ch
VOs
1. “I agree to the Usage Rules please register me, my VO is XYZ”
2. Confirm email
3. User Details
User
XYZ VOManager
4. Register
5. Notify 6. User Details
Site
Authz
ResourceAuthz
Certificate GRIDGRID
Usage Rules
Submit job
?CA Certificates
LCG-Asia Workshop – 28 July 2004 - 33
User Registration (? 2004 - )User Registration (? 2004 - )
Some Issues Static user mappings will not scale up Multiple VO membership Complex authorization & policy handling VO manager needs to validate user data
• How ?
Solutions VO Management Service - Attribute proxy certificates
• Groups and Roles - not just static user mapping• Attributes bound to proxy cert., signed by VO Service
Credential mapping and authorization• Flexible policy intersection and mapping tools
Integrate with Organizational databases, but …• What about exceptions ? (the 2-week summer student)• What about other VO models: lighweight, deployment, testing
XYZ VOManager
?
Certificate
Roles
LCG-Asia Workshop – 28 July 2004 - 34
Audit & Incident ResponseAudit & Incident Response
Audit Requirements Mandates retention of logs by sites
Incident Response Security contact data gathered when site registers Establish communication channels
• maillists maintained by Deployment Team
• List of CSIRT lists– Channel for reporting
• Security contacts at site– Channel for discussion & resolution
Escalation path
2004 Security Service Challenges Check the data is there, complete and communications are open
LCG-Asia Workshop – 28 July 2004 - 35
Security CollaborationSecurity Collaboration
Projects sharing resources & have close links Need for inter-grid global security collaboration Common accepted Usage Rules Common authentication and authorization requirements Common incident response channels
LCG – EGEE – OSG LCG Security Group is now Joint Security Group
• JSG for LCG & EGEE & OSG
• Provide requirements for middleware development
• Some members from OSG already in JSG
LCG-Asia Workshop – 28 July 2004 - 36
What is EGEE ? (I)What is EGEE ? (I)
EGEE (Enabling Grids for e-Science in Europe) is a seamless Grid infrastructure for the support of scientific research, which: Integrates current national, regional
and thematic Grid efforts Provides researchers in academia
and industry with round-the-clock access to major computing resources, independent of geographic location
Applications
Geant network
Grid infrastructure
LCG-Asia Workshop – 28 July 2004 - 37
What is EGEE ? (II)What is EGEE ? (II)
70 institutions in 28 countries, federated in regional Grids
32 M Euros EU funding (2004-5), O(100 M) total budget
Aiming for a combined capacity of over 8000 CPUs (the largest international Grid infrastructure ever assembled)
~ 300 persons
LCG-Asia Workshop – 28 July 2004 - 38
EGEE ActivitiesEGEE Activities
Emphasis on operating a production grid and supporting the end-users
48 % service activities Grid Operations, Support and Management,
Network Resource Provision
24 % middleware re-engineering Quality Assurance, Security, Network
Services Development
28 % networking Management, Dissemination and Outreach, User
Training and Education, Application Identification and Support, Policy and International Cooperation
LCG-Asia Workshop – 28 July 2004 - 39
EGEE infrastructureEGEE infrastructure
Access to networking services provided by GEANT and the NRENs
Production Service: in place (based on LCG-2) for production applications runs only proven stable, debugged middleware
and services Will continue adding new sites in EGEE
federations
Pre-production Service: For middleware re-engineering
Certification and Training/Demo testbeds
LCG-Asia Workshop – 28 July 2004 - 40
EGEE Middleware ActivityEGEE Middleware Activity
Activity concentrated in few major centers
Middleware selection based on requirements of Applications and Operations
Harden and re-engineer existing middleware functionality, leveraging the experience of partners
Provide robust, supportable components
Track standards evolution (WS-RF)
LCG-Asia Workshop – 28 July 2004 - 41
Middleware ImplementationMiddleware Implementation
From day 1 (1st April 2004)Production grid service based on the LCG infrastructure running LCG-2 grid mware
In parallel develop a “next generation” grid facilityProduce a new set of grid services according to evolving standards (web services)
Run a pre-production service providing early access for evaluation purposes
Will replace LCG-2 on production facility in 2005
EDGVDT . . .
LCG
GLite
. . .AliEn
Globus 2 based Web services based
EGEE-2EGEE-1LCG-2LCG-1
LCG-Asia Workshop – 28 July 2004 - 42
Starts with components from AliEn, EDG, VDT etc.
Aim at addressing advanced requirements from applications
Prototyping short development cycles for fast user feedback
Initial web-services based prototype being tested internally with representatives from the application groups
EGEE Middleware: gLiteEGEE Middleware: gLite
LCG-Asia Workshop – 28 July 2004 - 43
LCG-2 Current base for production services Evolved with certified new or improved
services from the preproduction Pre-production Service
Early application access for new developments
Certification of selected components from gLite
Starts with LCG-2 Migrate new mware in 2005
Organising smooth/gradual transition from LCG-2 to GLite for production operations
EGEE Middleware ImplementationEGEE Middleware Implementation
LCG-2 (=EGEE-0)
prototyping
prototyping
product
20042004
20052005
LCG-3 (=EGEE-x?)
product
LCG-Asia Workshop – 28 July 2004 - 44
ARDA ProjectCollaborationCoordinationIntegration
SpecificationsPrioritiesPlanning
EGEE Middleware Activity
ALICEDistr.
analysis
ATLASDistr.
analysis
CMSDistr.
analysis
LHCbDistr.
analysis
Distributed Physics AnalysisDistributed Physics AnalysisThe The ARDAARDA Project Project
ARDA – distributed physics analysis batch to interactive end-user emphasis
4 pilots by the LHC experiments (core of the HEP activity in EGEE NA4)
Rapid prototyping pilot service
Providing focus for the first products of the EGEE middleware
Kept realistic by what the EGEE middleware can deliver
LCG-Asia Workshop – 28 July 2004 - 45
User support: Becomes hierarchical Through the Regional Operations
Centres (ROC)• Act as front-line support for user and
operations issues
• Provide local knowledge and adaptations
Coordination: At CERN (Operations Management
Centre) and CIC for HEP
Operational support: The LCG GOC is the model for the
EGEE CICs• CIC’s replace the European GOC at
RAL
• Also run essential infrastructure services
• Provide support for other (non-LHC) applications
• Provide 2nd level support to ROCs
LCG LCG EGEE in Europe EGEE in Europe
LCG-Asia Workshop – 26 July 2004 - 46
Interoperability – convergence?Interoperability – convergence?
Information systems All MDS-based, bring schema
together
Storage management Common ideas - SRM
File catalogs Not yet clear
Security Joint security group
Policy VO management, …
Can we converge and agree on common: Interfaces? Protocols? Implementations? Middleware?
LCG-Asia Workshop – 26 July 2004 - 47
M.C. Vetterli; SFU/TRIUMF
Linking HEPGrid to LCGLinking HEPGrid to LCG
.....GCRes.1
GCRes.n
Grid-Cannegotiator/scheduler
WGUBC/TRIUMF
TRIUMF
TRIUMFcpu &storage
negotiator/scheduler
RB/scheduler
LCG BDII/RB/ scheduler
Class ad
1) Each GC resource publishes a class ad to the GC collector
2) The GC CE aggregates this info and publishes it to TRIUMF as a single resource
Class ad
3) The same is done for WG
3) The CondorG job manager at TRIUMF builds a submission script for the TRIUMF Grid4) The TRIUMF negotiator matches the job to GC or WG
1) The LCG RB decides where to send the job (GC/WG or the TRIUMF farm)Job class ad2) Job goes to the TRIUMF farm or
TRIUMF decides to sendthe job to WG
5) The job is submitted to the proper resource
TRIUMF decides to sendthe job to GC
6) The process is repeated on GC if necessary
MDS
4) TRIUMF aggregates GC & WG and publishes this to LCG as one resource5) TRIUMF also publishes its own resources separately
LCG-Asia Workshop – 28 July 2004 - 48
What next? What next? Service challenges Service challenges
Proposed to be used in addition to ongoing data challenges and production use: Goal is to ensure baseline services can be demonstrated Demonstrate the resolution of problems mentioned above Demonstrate that operational and emergency procedures are in place
4 areas proposed: Reliable data transfer
• Demonstrate fundamental service for Tier 0 Tier 1 by end 2004 Job flooding/exerciser
• Understand the limitations and baseline performances of the system Incident response
• Ensure the procedures are in place and work – before real life tests them Interoperability
• How can we bring together the different grid infrastructures?
LCG-Asia Workshop – 26 July 2004 - 49
ConclusionsConclusions
LCG has successfully deployed a service to 60 sites Significant effort to get some reasonable stability and reliability
Still a long way to go to ensure adequate functionality in place for LHC startup Focus on building up base level services such as reliable data
movement• Continuing data and service challenges, analysis
LCG service will evolve rapidly to respond to problems found in the current system To integrate/migrate to new middleware
Effort also focussed on infrastructure and support services Operations support, user support, security etc.
Bring together the various grid infrastructures for LCG users