egee is a project funded by the european union under contract ist-2003-508833
DESCRIPTION
“ LCG Operation During the Data Challenges ” Markus Schulz, IT-GD, CERN [email protected] “Discussion on Operation Models”. EGEE is a project funded by the European Union under contract IST-2003-508833. Outline. Building LCG-2 Data Challenges (very brief) Problems (not so brief) - PowerPoint PPT PresentationTRANSCRIPT
EGEE is a project funded by the European Union under contract IST-2003-508833
“LCG Operation During the Data Challenges”
Markus Schulz, IT-GD, CERN
“Discussion on Operation Models”
HEPIX 2004 BNL CERN IT-GD 22 October 2004 2
Outline
• Building LCG-2• Data Challenges (very brief)• Problems (not so brief) • Operating LCG
how it was planned how it happened to be done how it felt
• What’s next?
• I will skip many slides to leave room for discussions
Comm
ent /Shout in REALTIME!!!!!
HEPIX 2004 BNL CERN IT-GD 22 October 2004 3
History
• December 2003 LCG-2 Full set of functionality for DCs, first MSS integration Deployed in January to 8 core sites (less sites less trouble) DCs started in February -> testing in production Large sites integrate resources into LCG (MSS and farms) Introduced a pre-production service for the experiments Alternative packaging (tool based and generic installation guides)
• Mai 2004 -> now monthly incremental releases Not all releases are distributed to external sites Improved services, functionality, stability and packing step by step Timely response to experiences from the data challenges
HEPIX 2004 BNL CERN IT-GD 22 October 2004 4
LCG-2 Status 22 10 2004
Total:82 Sites~9400 CPUs~6.5 PByte
Cyprus
new interested sites should look here: release
HEPIX 2004 BNL CERN IT-GD 22 October 2004 5
Integrating Sites
• Sites contact GD Group or Regional Center• Sites go to the release page• Sites decide on manual or tool based installation (LCFGng)
documentation for both available WN and UI from next release on tar-ball based release
• almost trivial install of WNs and UIs
• Sites provide security and contact information • Sites install and use provided tests for debugging
support from regional centers or CERN• CERN GD certifies site and adds it to the monitoring and
information system sites are daily re-certified and problems traced in SAVANNAH
• Large sites have integrated their local batch systems in LCG-2 • Adding new sites is now quite smooth
problem is keeping large number of sites correctly configured
worked 80+ times
failed 3-5 times
HEPIX 2004 BNL CERN IT-GD 22 October 2004 6
Data Challenges
• Large scale production effort of the LHC experiments test and validate the computing models produce needed simulated data test experiments production frame works and software
test the provided grid middleware test the services provided by LCG-2
• All experiments used LCG-2 for part of their production
HEPIX 2004 BNL CERN IT-GD 22 October 2004 7
Data Challenges
• Phase I120k Pb+Pb events produced in 56k jobs1.3 million files (26TByte) in Castor@CERNTotal CPU: 285 MSI-2k hours (2.8 GHz PC working 35 years)~25% produced on LCG-2
Phase II (underway)1 million jobs, 10 TB produced, 200TB transferred ,500 MSI2k hours CPU~15% on LCG-2
• Phase I7.7 Million events fully simulated (Geant 4) in 95.000 jobs22 TByteTotal CPU: 972 MSI-2k hours >40% produced on LCG-2 (used LCG-2, GRID3, NorduGrid)
HEPIX 2004 BNL CERN IT-GD 22 October 2004 8
Data Challenges
• ~30 M events produced• 25Hz reached
•(only once for a full day)
HEPIX 2004 BNL CERN IT-GD 22 October 2004 9
DIRAC alone
LCG inaction
1.8 106/day
LCG paused
3-5 106/day
LCG restarted
Data Challenges
• Phase I186 M events 61 TByteTotal CPU: 424 CPU years (43 LCG-2 and 20 DIRAC sites)Up to 5600 concurrent running jobs in LCG-2
HEPIX 2004 BNL CERN IT-GD 22 October 2004 10
Problems during the data challenges
• All experiments encountered on LCG-2 similar problems• LCG sites suffering from configuration and operational problems
not adequate resources on some sites (hardware, human..) this is now the main source of failures
• Load balancing between different sites is problematic jobs can be “attracted” to sites that have no adequate resources modern batch systems are too complex and dynamic to summarize their
behavior in a few values in the IS • Identification and location of problems in LCG-2 is difficult
distributed environment, access to many logfiles needed….. status of monitoring tools
• Handling thousands of jobs is time consuming and tedious Support for bulk operation is not adequate
• Performance and scalability of services storage (access and number of files) job submission information system file catalogues
• Services suffered from hardware problems (no fail over services)
DC summary
HEPIX 2004 BNL CERN IT-GD 22 October 2004 11
Outstanding Middleware Issues
• Collection: Outstanding Middleware Issues Important: 1st systematic confrontation of required functionalities
with capabilities of the existing middleware• Some can be patched, worked around, • Those related to fundamental problems with underlying models and
architectures have to be input as essential requirements to future developments (EGEE)
• Middleware is now not perfect but quite stable Much has been improved during DC’s
• A lot of effort still going into improvements and fixes• Big hole is missing space management on SE’s
– especially for Tier 2 sites
HEPIX 2004 BNL CERN IT-GD 22 October 2004 12
Operational issues (selection)
• Slow response from sites Upgrades, response to problems, etc. Problems reported daily – some problems last for weeks
• Lack of staff available to fix problems Vacation period, other high priority tasks
• Various mis-configurations (see next slide)• Lack of configuration management – problems that are fixed reappear• Lack of fabric management (mostly smaller sites)
scratch space, single nodes drain queues, incomplete upgrades, ….• Lack of understanding
Admins reformat disks of SE …• Provided documentation often not read (carefully)
new activity started to develop “hierarchical” adaptive documentation simpler way to install middleware on farm nodes (even remotely in user space)
• Firewall issues – often less than optimal coordination between grid admins and firewall maintainers
• PBS problems Scalability, robustness (switching to torque helps)
HEPIX 2004 BNL CERN IT-GD 22 October 2004 13
Site (mis) - configurations
• Site mis-configuration was responsible for most of the problems that occurred during the experiments Data Challenges. Here is a non-complete list of problems:
– The variable VO <VO> SW DIR points to a non existent area on WNs. – The ESM is not allowed to write in the area dedicated to the software installation – Only one certificate allowed to be mapped to the ESM local account – Wrong information published in the information system (Glue Object Classes not linked) – Queue time limits published in minutes instead of seconds and not normalized – /etc/ld.so.conf not properly configured. Shared libraries not found. – Machines not synchronized in time – Grid-mapfiles not properly built – Pool accounts not created but the rest of the tools configured with pool accounts – Firewall issues – CA files not properly installed – NFS problems for home directories or ESM areas – Services configured to use the wrong/no Information Index (BDII) – Wrong user profiles – Default user shell environment too big
• Only partly related to middleware complexity
integrated all common small problems into
1BIG PROBLEM
HEPIX 2004 BNL CERN IT-GD 22 October 2004 14
Running Services
• Multiple instances of core services for each of the experiments separates problems, avoids interference between experiments improves availability allows experiments to maintain individual configuration (information system) addresses scalability to some degree
• Monitoring tools for services currently not adequate tools under development to implement control system
• Access to storage via load balanced interfaces CASTOR dCache
• Services that carry “state” are problematic to restart on new nodes needed after hardware problems, or security problems
• “State Transition” between partial usage and full usage of resources required change in queue configuration (faire share, individual queues/VO) next release will come with description for fair share configuration (smaller sites)
DC summary
HEPIX 2004 BNL CERN IT-GD 22 October 2004 15
Support during the DCs
• User (Experiment) Support: GD at CERN worked very close with the experiments production managers Informal exchange (e-mail, meetings, phone)
• “No Secrets” approach, GD people on experiments mail lists and vice versa – ensured fast response
• tracking of problems tedious, but both sites have been patient• clear learning curve on BOTH sites • LCG GGUS (grid user support) at FZK became operational after start of the DCs
– due to the importance of the DCs the experiments switch slowly to the new service
• Very good end user documentation by GD-EIS • Dedicated testbed for experiments with next LCG-2 release
– rapid feedback, influenced what made it into the next release
• Installation (Site) Support: GD prepared releases and supported sites (certification, re-certification) Regional centres supported their local sites (some more, some less) Community style help via mailing list (high traffic!!) FAQ lists for trouble shooting and configuration issues: Taipei RAL
HEPIX 2004 BNL CERN IT-GD 22 October 2004 16
Support during the DCs
• Operations Service: RAL (UK) is leading sub-project on developing operations services Initial prototype http://www.grid-support.ac.uk/GOC/
• Basic monitoring tools• Mail lists for problem resolution• Working on defining policies for operation, responsibilities (draft document)• Working on grid wide accounting
Monitoring:• GridICE (development of DataTag Nagios-based tools) • GridPP job submission monitoring• Information system monitoring and consitency check http://goc.grid.sinica.edu.tw/gstat/
CERN GD daily re-certification of sites (including history) • escalation procedure under development• tracing of site specific problems via problem tracking tool• tests core services and configuration
HEPIX 2004 BNL CERN IT-GD 22 October 2004 17
Screen Shots
HEPIX 2004 BNL CERN IT-GD 22 October 2004 18
Screen Shots
HEPIX 2004 BNL CERN IT-GD 22 October 2004 19
Problem HandlingPLAN
VO A VO B VO C
GD CERN
GGUS (Remedy)
GOC
P-Site-1
S-Site-1 S-Site-2
P-Site-2
S-Site-2S-Site-1
Triage: VO / GRID
Monitoring/Followup
Escalation
HEPIX 2004 BNL CERN IT-GD 22 October 2004 20
Community
Problem HandlingOperation (most cases)
VO A
VO B
VO C
GD CERN
GGUS
GOC
P-Site-1
S-Site-1 S-Site-2
S-Site-2
S-Site-1
Triage
MonitoringFAQs
Rollout Mailing List
MonitoringCertificationFollow-Up
FAQs S-Site-3
HEPIX 2004 BNL CERN IT-GD 22 October 2004 21
Problem Tracking
• GGUS: REMEDY
• Middleware problems: SAVANNAH LCG-OPERATION
• Re-certification: SAVANNAH LCG-SITES
• Many (MOST) problems only tracked by e-mail • Much confusion on where to put problems• Training needed to get reasonable 1st level user support
canned answers experts need to focus on more complex tasks
• Unification of FAQs (RAL, Taipei, Italy, …)
HEPIX 2004 BNL CERN IT-GD 22 October 2004 22
EGEE Impact on Operations
• The available effort for operations from EGEE is now ramping up: LCG GOC (RAL) EGEE CICs and ROCs, + Taipei
• Hierarchical support structure Regional Operations Centres (ROC)
• One per region (9)• Front-line support for deployment, installation, users
Core Infrastructure Centres (CIC)• Four (+ Russia next year)• Evolve from GOC – monitoring, troubleshooting, operational “control”
– “24x7” in a 8x5 world ????
• Also providing VO-specific and general services EGEE NA3 organizes training for users and site admins
• “NOW” at HEPiX Address common issues, experiences
• “Operations and Fabric Workshop” CERN 1-3 Nov
HEPIX 2004 BNL CERN IT-GD 22 October 2004 23
PART II
• Operation models How much can be delegated to whom?
• autonomy/ availability
What are the consequences?• cost for 24/7 with 8x5 staff
One/multiple models for all sites/regions?
One model for site integration, update, user support, security, operation?
• latency, efficiency, distribution of workload ….. One size fits all?
Next slides are meant to stimulate discussions not give answers
HEPIX 2004 BNL CERN IT-GD 22 October 2004 24
CICs and ROCs and Operations
• Core Infrastructure Centers (CICs) run services like RBs, Information Indices, VO/VOMS, Catalogues are the distributed Grid Operation Center (GOC) and more….
• Regional Operation Centers (ROCs) coordinate activities in their region give support to regional RCs coordinate setup/upgrades and more..
• Resource Centers (RC) computing and storage
• Operation Management Center (OMC) coordination
HEPIX 2004 BNL CERN IT-GD 22 October 2004 25
Model I Strict Hierarchy
• CICs locates a problem with a RC or CIC in a region triggered by monitoring/ user alert
• CIC enters the problem into the problem tracking tool and assigns it to a ROC
• ROC receives a notification and works on solving the problem region decides locally what the ROC can to do on the RCs.
• This can include restarting services etc.• The main emphasis is that the region decides on the depth of the
interaction. • ===> different regions, different procedures
CICs NEVER contact a site• .====> ROCs need to be staffed all the time
ROC does it is fully responsible for ALL the sites in the region
HEPIX 2004 BNL CERN IT-GD 22 October 2004 26
Model I Strict Hierarchy
• Pro: Best model to transfer knowledge to the ROCs
• all information flows through them
Different regions can have their own policies • this can reflect different administrative relation of sites in a region.
Clear responsibility• until it is discovered it is the CICs fault then it is always the ROCs fault
• Cons: High latency
• even for trivial operations we have to pass through the ROCs
ROCs have to be staffed (reachable) all the time. $$$$ Regions will develop their own tools
• parallel strands, less quality
Excluded for handling security
HEPIX 2004 BNL CERN IT-GD 22 October 2004 27
Model II Direct Com. Local Contr.
• ROCs are active in: the follow-up of problems that take longer to handle setup of sites
• CICs are active in: handling problems that can be solved by simple interactions
• communicated directly between CICs and RCs– ROCs are informed on all interactions between CICs and RCs– all problems are entered into the problem tracking tool.
• restarting of services, etc. are handled by the RCs
HEPIX 2004 BNL CERN IT-GD 22 October 2004 28
Model II Direct Com. Local Contr.
• Pros: Resources are not lost for trivial reasons Principe of local control is maintained ROCs are in the loop,
• but weak ROCs can't create too severe delays
No complex tools for communication management needed• mail + IRC sufficient
• Cons: RCs need to be reachable at all times
• not realistic, and very expensive €€€€€€€€€€
CICs have to be aware of the level of maturity of O(100) RCs ROCs have to monitor what is going on to learn the trade Language problems between the CICs and sysadmins Unclear responsibility
• "This was reported" / "Why didn't the CICs fix it them self"
HEPIX 2004 BNL CERN IT-GD 22 October 2004 29
Model III Direct Com. Direct Contr.
• Like Model II with some modifications CICs have access to the services on the RCs
• can, if the RC is not staffed, manage some of the services• site publishes at any time
– whether the local support is reachable or not– what actions are permitted by the CICs.
• all interactions are logged and reported to RC and ROC– Some tools that allow very controlled (limited) access like this
are under development (GSI enabled remote SUDO)
• Variation with ROCs only interaction (IIIa)
HEPIX 2004 BNL CERN IT-GD 22 October 2004 30
Model III Direct Com. Direct Contr.
• Pros: Resources are not lost for trivial reasons ROCs are in the loop,
• but weak ROCs can't create too severe delays
One set of tools for remote operation• some uniformity ---> chance for better quality
Site decides at any time on balance between local/remote operation RCs can be run for (short) time unattended
• Cons: Set of tools for secure limited remote operation respecting the sites
policies has to be put in place ROCs have to monitor what is going to learn the trade Unclear responsibility
• "This was reported" / "Why didn't the CICs fix it them self"
HEPIX 2004 BNL CERN IT-GD 22 October 2004 31
Sample UseCases
• User reports jobs failing on one site
• User reports jobs failing on some/all sites
• Monitoring shows site dropping in and out of the IS
• An acute security incident
• Upgrading to a new version
• Post mortem after the security incidents
• …….
• Good preparation for the Operations Workshop
HEPIX 2004 BNL CERN IT-GD 22 October 2004 32
Summary
• LCG-2 services have been supporting the data challenges Many middleware problems have been found – many addressed Middleware itself is reasonably stable
• Biggest outstanding issues are related to providing and maintaining stable operations
• Future middleware has to take this into account: Must be more manageable, trivial to configure and install Management and monitoring must be built into services from the
start on
• Outcome of the workshop in November is crucial for EGEE operation