operating the lcg and egee production grid for hep
DESCRIPTION
Operating the LCG and EGEE Production Grid for HEP. Ian Bird IT Department, CERN LCG Deployment Area Manager & EGEE Operations Manager CHEP‘ 04 28 th September 2004. EGEE is a project funded by the European Union under contract IST-2003-508833. LCG Operations in 2004. - PowerPoint PPT PresentationTRANSCRIPT
[email protected] CHEP’04 – 28 September 2004 - 1
Operating the LCG and EGEE Operating the LCG and EGEE Production Grid for HEPProduction Grid for HEP
Ian BirdIT Department, CERN
LCG Deployment Area Manager & EGEE Operations Manager
CHEP‘0428th September 2004
EGEE is a project funded by the European Union under contract IST-2003-508833
[email protected] CHEP’04 – 28 September 2004 - 2
LCG Operations in 2004LCG Operations in 2004
Goal: - deploy & operate a prototype LHC computing environment Scope:
Integrate a set of middleware and coordinate and support its deployment to the regional centres
Provide operational services to enable running as a production-quality service
Provide assistance to the experiments in integrating their software and deploying in LCG; Provide direct user support
Deployment Goals for LCG-2 Production service for Data Challenges in 2004 Experience in close collaboration between the Regional Centres Learn how to maintain and operate a global grid Focus on building a production-quality service Understand how LCG can be integrated into the sites’ physics computing
services Set up the EGEE project and migrate the existing structure towards
EGEE structure By design LCG and EGEE services and operations teams are the same
[email protected] CHEP’04 – 28 September 2004 - 3
LCG – from certification to productionLCG – from certification to production
Some history:
March 2003 LCG-0 existing middleware, waiting for EDG-2 release
September 2003 LCG-1 3 month late -> reduced functionality extensive certification process -> improved stability
(RB, Information system) integrated 32 sites ~300 CPUs first use for
production December 2003 LCG-2
Full set of functionality for DCs, first MSS integration Deployed in January to 8 core sites DCs started in February -> testing in production Large sites integrate resources into LCG (MSS and
farms) Introduced a pre-production service for the
experiments Alternative packaging (tool based and generic
installation guides) Mai 2004 -> now monthly incremental releases
Not all releases are distributed to external sites Improved services, functionality, stability and
packing step by step Timely response to experiences from the data
challenges
The formal certification process has been invaluable
The process to stabilise existing middleware and put in production is expensive
Testbeds, people, time Now have monthly incremental
middleware releases Not all are deployed
Expand now with a pre-production service
[email protected] CHEP’04 – 28 September 2004 - 4
LCG-2 SoftwareLCG-2 Software
LCG-2 core packages: VDT (Globus2, condor) EDG WP1 (Resource Broker, job submission tools) EDG WP2 (Replica Management tools) + lcg tools
• One central RMC and LRC for each VO, located at CERN, ORACLE backend Several bits from other WPs (Config objects, InfoProviders, Packaging…) GLUE 1.1 (Information schema) + few essential LCG extensions MDS based Information System with significant LCG enhancements
(replacements, simplified (see poster)) Mechanism for application (experiment) software distribution
Almost all components have gone through some reengineering robustness scalability efficiency adaptation to local fabrics
The services are now quite stable and the performance and scalability has been significantly improved (within the limits of the current architecture) Still far from perfect data management
[email protected] CHEP’04 – 28 September 2004 - 5
LCG-2/EGEE-0 Status LCG-2/EGEE-0 Status 24-09-2004 24-09-2004
Total:78 Sites~9000 CPUs6.5 PByte
Total:78 Sites~9000 CPUs6.5 PByte
Cyprus
[email protected] CHEP’04 – 28 September 2004 - 6
Experiences in deploymentExperiences in deployment
LCG covers many sites (>70) now – both large and small Large sites – existing infrastructures – need to add-on grid interfaces etc. Small sites want a completely packaged, push-button, out-of-the-box
installation (including batch system, etc) Satisfying both simultaneously is hard – requires very flexible packaging,
installation, and configuration tools and procedures• A lot of effort had to be invested in this area
There are many problems – but in the end we are quite successful System is reasonably stable System is used in production System is reasonably easy to install now ~80 sites Now have a basis on which to incrementally build essential functionality,
and from which to measure improvements
This infrastructure now also forms the EGEE production service
[email protected] CHEP’04 – 28 September 2004 - 7
Operations services for LCG – 2004 Operations services for LCG – 2004
Deployment and Operational support Hierarchical model
• CERN acts as 1st level support for the Tier 1 centres• Tier 1 centres provide 1st level support for associated Tier 2s• “Tier 1 sites” “Primary sites”
Grid Operations Centres (GOC)• Provide operational monitoring, troubleshooting, coordination of incident
response, etc.• RAL (UK) led sub-project to prototype a GOC• Operations support from CERN team, GOC, and Taipei, with many individual
contributions on the mailing list
User support Central model
• FZK provides user support portal– Problem tracking system web-based and available to all LCG participants
• Experiments provide triage of problems CERN team provide in-depth support and support for integration of
experiment sw with grid middleware
[email protected] CHEP’04 – 28 September 2004 - 8
Experiences during the data challengesExperiences during the data challenges
[email protected] CHEP’04 – 28 September 2004 - 9
Data ChallengesData Challenges
Large scale production effort of the LHC experiments test and validate the computing models produce needed simulated data test experiments production frame works and software test the provided grid middleware test the services provided by LCG-2
All experiments used LCG-2 for all or part of their productions
[email protected] CHEP’04 – 28 September 2004 - 10
Data Challenges – ALICE Data Challenges – ALICE
• Phase I120k Pb+Pb events produced in 56k jobs1.3 million files (26TByte) in Castor@CERNTotal CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)~25% produced on LCG-2
Phase II (underway)1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU~15% on LCG-2
[email protected] CHEP’04 – 28 September 2004 - 11
Data Challenges – ATLAS Data Challenges – ATLAS
ATLAS DC2 - CPU usage
41%
30%
29%
LCG
NorduGrid
Grid3
• Phase I7.7 Million events fully simulated (Geant 4) in 95.000 jobs22 TByteTotal CPU: 972 MSI-2k hours >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid)
ATLAS DC2 - LCG - September 71%
2%
0%
1%
2%
14%
3%
1%
3%
9%
8%
3%2%5%1%4%
1%
1%
3%
0%
1%
1%
4%1%
0%
12%
0%
1%
1%
2%
10%
1% 4%
at.uibk
ca.triumf
ca.ualberta
ca.umontreal
ca.utoronto
ch.cern
cz.golias
cz.skurut
de.fzk
es.ifae
es.ific
es.uam
fr.in2p3
it.infn.cnaf
it.infn.lnl
it.infn.mi
it.infn.na
it.infn.na
it.infn.roma
it.infn.to
it.infn.lnf
jp.icepp
nl.nikhef
pl.zeus
ru.msu
tw.sinica
uk.bham
uk.ic
uk.lancs
uk.man
uk.rl
[email protected] CHEP’04 – 28 September 2004 - 12
Data Challenges – CMS Data Challenges – CMS
• ~30 M events produced• 25Hz reached
•(only once for a full day)• RLS, Castor, control systems, T1 storage, …
•Not a CPU challenge, but a full chain demonstration•Pre-challenge production in 2003/04
•70 M Monte Carlo events (30M with Geant-4) produced•Classic and grid (CMS/LCG-0, LCG-1, Grid3) productions
[email protected] CHEP’04 – 28 September 2004 - 13
DIRAC alone
LCG inaction
1.8 106/day
LCG paused
3-5 106/day
LCG restarted
Data Challenges – LHCb Data Challenges – LHCb
• Phase I186 M events 61 TByteTotal CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites)Up to 5600 concurrent running jobs in LCG-2
This is 5-6 times what was possible at CERN alone
[email protected] CHEP’04 – 28 September 2004 - 14
Data challenges – summary Data challenges – summary
Probably the first time such a set of large scale grid productions has been done
Significant efforts invested on all sides – very fruitful collaborations Unfortunately, DCs were first time the LCG-2 system had been used Adaptations were essential – adapting experiment software to middleware
and vice-versa – as limitations/capabilities were exposed Many problems were recognised and addressed during the challenges
Systematic confrontation of the functional problems with experiment requirements has recently been made (GAG)
Middleware is actually quite stable now But – job efficiency is not high – for many reasons (see below) Started to see some basic underlying issues:
Of implementation (lack of error handling, scalability, etc) Of underlying models (workload management) Perhaps also of fabric services – batch systems ?
But – single largest issue is lack of stable operations
[email protected] CHEP’04 – 28 September 2004 - 15
Problems during the data challenges Problems during the data challenges
Common functional issues seen by all experiments:
Sites suffering from configuration and operational problems inadequate resources on some sites (hardware, human..) this is now the main source of failures
Load balancing between different sites is problematic jobs can be “attracted” to sites that do not have adequate resources modern batch systems are too complex and dynamic to summarize their
behaviour in a few values in the IS Identification of problems in LCG-2 is difficult
distributed environment, access to many log files needed….. status of monitoring tools
Handling thousands of jobs is time consuming and tedious Support for bulk operation is not adequate
Performance and scalability of services storage (access and number of files) job submission information system file catalogues
Services suffered from hardware problems
DC summary
[email protected] CHEP’04 – 28 September 2004 - 16
Configuration and stability problemsConfiguration and stability problems This is the largest source of problems Many are “well-known” fabric problems
Batch systems that cause “black holes” NFS problems Clock skew at a site Software not installed or configured correctly Lack of configuration management – fixed problems reappear
Firewall issues – often less than optimal coordination between grid admins and firewall
maintainers Others are due to lack of experience
Many grid sites have not run services before, do not have procedures, tools, diagnostics
Not limited to small sites Lack of support
Maintaining stable operation is labour intensive still – requires adequate operations staff trained in grid management
Slow response – problems reported daily – but may last for weeks No vacations … Experiments expect 24x365 stable operation
Grid successfully integrates these problems from 80 sites
Building a stable operation is the highest priority
This is what EGEE is funded to do
[email protected] CHEP’04 – 28 September 2004 - 17
EGEE and Evolving the Operations modelEGEE and Evolving the Operations model
[email protected] CHEP’04 – 28 September 2004 - 18
Applications
Geant network
Grid infrastructure
EGEEEGEE
Goal Create a Europe-wide production quality grid infrastructure
on top of present regional grid programs• despite it’s name the project has a worldwide scope • multi science project
Scale 70 leading institutes in 27 countries ~300 FTEs Aim: 20’000 CPUs Initially: 2 year project
Activities 48% service activities (operation, support) 24% middleware re-engineering 28% management, training, dissemination, international
cooperation
Builds on: LCG to establish a grid operations service
• single team for deployment and operations Experience gained from running services for the LHC experiments
• HEP experiments are the pilot application for EGEE, together with bio-medical
[email protected] CHEP’04 – 28 September 2004 - 19
LCG and EGEE OperationsLCG and EGEE Operations
EGEE is funded to operate and support a research grid infrastructure in Europe
The core infrastructure of the LCG and EGEE grids is now operated as a single service, growing out of LCG service LCG includes US and Asia-Pacific, EGEE includes other sciences Substantial part of infrastructure common to both
LCG Deployment Manager is the EGEE Operations Manager CERN team (Operations Management Centre) provides coordination,
management, and 2nd level support Support activities are expanded with the provision of
Core Infrastructure Centres (CIC) (4) Regional Operations Centres (ROC) (9) ROCs are coordinated by Italy, outside of CERN (which has no ROC)
[email protected] CHEP’04 – 28 September 2004 - 20
User support: Becomes hierarchical Through the Regional Operations
Centres (ROC)• Act as front-line support for user and
operations issues
• Provide local knowledge and adaptations
Coordination: At CERN (Operations Management
Centre) and CIC for HEP-LHC
Operational support: The LCG GOC is the model for the
EGEE CICs• CIC’s replace the European GOC at
RAL
• Also run essential infrastructure services
• Provide support for other (non-LHC) applications
• Provide 2nd level support to ROCs
Operations: LCG Operations: LCG EGEE in Europe EGEE in Europe
[email protected] CHEP’04 – 28 September 2004 - 21
The Regional Operations CentresThe Regional Operations Centres
The ROC organisation is the focus of EGEE operations activities: Coordinate and support deployment Coordinate and support operations Coordinate Resource Centre management
• Negotiate application access to resources within region
• Coordinate planning and reporting within region
• Negotiate and monitor SLA’s within the region Teams:
• Deployment team
• 24 hour support team (answers user and rc problems)
• Operations training at RC’s
• Organise tutorials for users
The ROC is the first point of contact for all: New sites joining the grid and support for them New users and user support
[email protected] CHEP’04 – 28 September 2004 - 22
Core Infrastructure CentresCore Infrastructure Centres
“Grid Operations Centres” – behaving as a single organisation Operate infrastructure services, e.g.:
VO services:• VO servers, VO registration service
RBs, UIs, Information services RLS and other database services Ensure recovery procedures and fail-over (between CICs)
Act as Grid Operations Centres Monitoring, proactive troubleshooting Performance monitoring Control sites’ participation in production service Use work done at RAL for LCG GOC as starting point
Support to ROCs for operational problems Operational configuration management and change control Accounting and resource usage/availability monitoring
Take responsibility for operational “control” (tbd) – rotates through 4 CICs
[email protected] CHEP’04 – 28 September 2004 - 23
Future activitiesFuture activities
[email protected] CHEP’04 – 28 September 2004 - 24
Future activitiesFuture activities
All experiments expect to have significant ongoing productions for the foreseeable future Some will also have next data challenges 1 year from now
LCG will run a series of “service challenges” Complementary to data challenges/ongoing productions Demonstrate essential service-level issues (e.g. Tier0-1 reliable data
transfer) Essential that we are able to build a manageable production service
Based on existing infrastructure Reasonable improvements
In parallel build a “pre-production” service where: New middleware (gLite, …) can be demonstrated and validated before
being deployed in production Understand the migration strategy to 2nd generation middleware Use the existing production service as the baseline comparison
It takes a long time to make new software production quality Must be careful not to step backwards – even though what we have is far
from perfect
[email protected] CHEP’04 – 28 September 2004 - 25
What next? What next? Service challenges Service challenges
Proposed to be used in addition to ongoing data challenges and production use: Goal is to ensure baseline services can be demonstrated Demonstrate the resolution of problems mentioned earlier Demonstrate that operational and emergency procedures are in place
4 areas proposed: Reliable data transfer
• Demonstrate fundamental service for Tier 0 Tier 1 by end 2004 Job flooding/exerciser
• Understand the limitations and baseline performances of the system– May be achieved by the ongoing real productions
Incident response• Ensure the procedures are in place and work – before real life tests them
Interoperability• How can we bring together the different grid infrastructures?
[email protected] CHEP’04 – 28 September 2004 - 26
IssuesIssues
Operational management How much control can/should be assumed by an operations centre? Small sites with little support – can GOCs restart services?
• More intelligence in the services to recognise problems
Strong organisation to take operational responsibility Ensure that problems are addressed, traced, reported
Need site management to take responsibility Ensure that Operational security group is in place with good
communications Simplify service configurations – to avoid mistakes Weight of VOs
EGEE has many VOs (most still national in scope) Deploying a VO is very heavyweight – must become much simpler
[email protected] CHEP’04 – 28 September 2004 - 27
SummarySummary
Data challenges have been running for 8 months Major effort and collaboration between all involved
Distributed operations in such a large system are hard Requires significant effort – EGEE will help here
Many lessons have been learned Essential that 2nd generation middleware takes account of all these
issues Not just functionality, but manageability, scalability, accountability,
robustness operational requirements are important requirements for users too …
Now moving to a phase of sustained, continuous operation While building a parallel service to validate next generation
middleware We have come a long way in the last few months There is still much to be done
[email protected] CHEP’04 – 28 September 2004 - 28
Related papersRelated papers
Distributed Computing Services track Evolution of Data Management in LCG-2 (278); Jean-Philippe Baud
Distributed Computing Systems and Experiences track Deploying and Operating LCG-2 (389); Markus Schulz Many other papers on experience in using LCG-2
Poster Session 2 Several papers on LCG-2 – all aspects: certification, usage,
information systems, integration/certification