Download - Update on SAM monitoring
Grid Technology
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
DBCFCFGT
Update on SAM monitoring
Wojciech Lapka, David Collados
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Outlook
• Current Situation• Short overview of the migration of the LHC
VOs from SAM to Nagios – as understood by GT with some input from CMS
and ATLAS– detailed update can be given at the next MB– not enough time to work on a common report
• January availability report only arrived during first week of February ( delay due to manual data quality assurance)
• Status of ACE computation• Replacement of FCR for CMS blacklisting• Issues 2
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Current Situation
• We still have to operate two monitoring infrastructures in parallel – SAM legacy services
• Original end of life was autumn 2010 (EGEE-III)• Old SAM Portal, DBs, FCR (operated by CERN-IT) • SAM-DPM machines (operated by CERN-IT) • SAM-BDII (run by CERN-IT)
– NAGIOS based system• CERN-ROC Nagios Instance (CERN-IT)• Asia-Pacific Nagios Instance (CERN-IT)
– Last ROC Nagios not run by an NGI– Planned: ALL ROC-Nagios move by 10.2010 !!!
• Experiment specific Nagios Prod and PreProd Services
– 8 instances, fully Quattorized, ready to move 3
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Current Situation
• New visualization and front-ends – MyEGI running on central Nagios DBs
• Still needs work (features, bug fixing)
– Team lost main developer delay
• GridView (service run by GT)– Development by BARC collaboration– Service will be integrated into MyEGI
• 2nd Level support for SAM Nagios (GT)– Planned: Move to EGI autumn 2010
• Ops Nagios probes maintenance still with GT– Agreed to be moved to EMI Product Teams
• Many services and tasks still with the team– +reduced manpower (went from 7 4)
4
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Experiments moving to Nagios
• Probes and debugging by IT-ES and experiments• Services and support by IT-GT
– Follow-up on failures with experiment contact– New production and pre-prod setup from scratch
• Validation of Nagios monitoring– Dec/Jan availability reports for SAM/Nagios
• To compare the results • We expect Nagios and SAM availability figures to be within 5%
– Nagios should be a bit higher due to re-tries
– Equivalent metrics for CE/SRM at T0/T1s • Standard GridView algorithm was used • This allows direct comparison
5
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Experiments moving to Nagios
• GridView reports for LHC VOs:– Official (SAM based):
http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012/wlcg/
http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101/wlcg/
– New (Nagios based):http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012-nagios/
http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101-nagios/
6
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Status: ALICE on Nagios
• Followed up by Maria Dolores Saiz & Maarten Litmaath
• Random failures during job submission– Likely reason: 5h30 timeout in Nagios
» SAM used 12h– December
• RAL (Availability: 61% in Nagios, 93% in SAM)
– January• RAL (Availability:70% in Nagios, 90% in SAM)
• Suggested next step: increase timeout & re-evaluate in March
7
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Status: ATLAS on Nagios
• Followed up by Alessandro di Girolamo• December 2010
– Very similar results for Nagios and SAM
• January 2011– BNL (43% in Nagios, 86% in SAM)
• Problem understood by ATLAS and fixed by Site• Nagios uses new DN and CRL was not sufficiently
recent
• Nagios based availabilities have been implemented in ATLAS Dashboard– Data stored in legacy SAM DB– http://tinyurl.com/dashb-sam-nagios-48h
6
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Status: CMS on Nagios
• Followed up by Andrea Sciaba• ‘org.cms.SRM‐VOGet‘ fails randomly
– December• RAL-LCG2 (81% in Nagios, 94% in SAM)
– January• Taiwan-LCG2 (94% in Nagios, 100% in SAM)
– Problem understood by CMS• Probe, Space Token and site config. related issue
• Next steps (February):– Modification of CMS Nagios probe– Calculate and compare Dashboard availabilities– Run test ‘org.cms.WN-mc’ with production role
9
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Status: LHCb on Nagios
• Followed up by Roberto Santinelli • Random failures during job submission
– Most likely due to 5h30 timeout in Nagios– December
• PIC (84% in Nagios, 99% in SAM)• INFN-CNAF (80% in Nagios, 99% in SAM)• RAL (79% in Nagios, 94% in SAM)
– January 2011• RAL (88% in Nagios, 97% in SAM)
• Suggested next step: Increase timeout & re-evaluate in March
10
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT HEP VOs – next steps
• Validate dashboard applications with Nagios tests (IT/ES)– Still requires legacy SAM DB and portal
• The portal provides a programmatic interface (PI)• New interface by myEGI is available since mid January
– Still pre-prod service but can be used for migration
• GT will stop old SAM system as soon as we get green light from the experiments– June/July 2011 last security patches for SLC4 – service can’t be migrated to SL5
• Cannot afford running 2 services in parallel
11
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT ACE Schedule
• December 2010– Validated the standard availability for OPS √
• January – Computation of standard availabilities for LHC
experiments (one profile per VO) √
• February– Multiple availabilities (different profiles, same
algorithm) per VO √
• March– Multiple availabilities (different profiles and
algorithms: CREAM CE use case) per VO
12
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT ACE – next steps
• March: validate ACE reports with GridView – OPS & LHC VOs.
• March: Generate two ACE reports for OPS– CREAM & LCG-CE and compare results
• April: ACE validation– Production readiness
• May: ACE in production mode– Given that no major issues are found
13
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT CMS blacklisting
• In cooperation with Andrea Sciaba • CREAM & LCG-CE status computed based on
Nagios results• Generic Programmatic Interface for data export
available in JSON/XML• Ongoing work on a solution
– Test BDII with blacklisting by end of this week– Move to prod after CMS green light
• Would like CMS to consider the use of the generic PI– Increased flexibility– More uniform approach, no extra service
14
Grid Technology Issues
• Manpower went down by 60% (4 FTEs remain)• Team is in a Catch22 situation
– All resources are absorbed by operations and support– Decommissioning of legacy services would free resources
• But requires effort that is not available
• We need to stop/move away services or development will freeze until more resources arrive
• Risk that use of Nagios data via old SAM-DB continues for too long move to new PI
15
Grid Technology
Questions?
16