update on sam monitoring
DESCRIPTION
Update on SAM monitoring. Wojciech Lapka , David Collados. Outlook. Current Situation Short overview of the migration of the LHC VOs from SAM to Nagios as understood by GT with some input from CMS and ATLAS detailed update can be given at the next MB - PowerPoint PPT PresentationTRANSCRIPT
Grid Technology
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
DBCFCFGT
Update on SAM monitoring
Wojciech Lapka, David Collados
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Outlook
• Current Situation• Short overview of the migration of the LHC
VOs from SAM to Nagios – as understood by GT with some input from CMS
and ATLAS– detailed update can be given at the next MB– not enough time to work on a common report
• January availability report only arrived during first week of February ( delay due to manual data quality assurance)
• Status of ACE computation• Replacement of FCR for CMS blacklisting• Issues 2
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Current Situation
• We still have to operate two monitoring infrastructures in parallel – SAM legacy services
• Original end of life was autumn 2010 (EGEE-III)• Old SAM Portal, DBs, FCR (operated by CERN-IT) • SAM-DPM machines (operated by CERN-IT) • SAM-BDII (run by CERN-IT)
– NAGIOS based system• CERN-ROC Nagios Instance (CERN-IT)• Asia-Pacific Nagios Instance (CERN-IT)
– Last ROC Nagios not run by an NGI– Planned: ALL ROC-Nagios move by 10.2010 !!!
• Experiment specific Nagios Prod and PreProd Services
– 8 instances, fully Quattorized, ready to move 3
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Current Situation
• New visualization and front-ends – MyEGI running on central Nagios DBs
• Still needs work (features, bug fixing)
– Team lost main developer delay
• GridView (service run by GT)– Development by BARC collaboration– Service will be integrated into MyEGI
• 2nd Level support for SAM Nagios (GT)– Planned: Move to EGI autumn 2010
• Ops Nagios probes maintenance still with GT– Agreed to be moved to EMI Product Teams
• Many services and tasks still with the team– +reduced manpower (went from 7 4)
4
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Experiments moving to Nagios
• Probes and debugging by IT-ES and experiments• Services and support by IT-GT
– Follow-up on failures with experiment contact– New production and pre-prod setup from scratch
• Validation of Nagios monitoring– Dec/Jan availability reports for SAM/Nagios
• To compare the results • We expect Nagios and SAM availability figures to be within 5%
– Nagios should be a bit higher due to re-tries
– Equivalent metrics for CE/SRM at T0/T1s • Standard GridView algorithm was used • This allows direct comparison
5
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Experiments moving to Nagios
• GridView reports for LHC VOs:– Official (SAM based):
http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012/wlcg/
http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101/wlcg/
– New (Nagios based):http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012-nagios/
http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101-nagios/
6
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Status: ALICE on Nagios
• Followed up by Maria Dolores Saiz & Maarten Litmaath
• Random failures during job submission– Likely reason: 5h30 timeout in Nagios
» SAM used 12h– December
• RAL (Availability: 61% in Nagios, 93% in SAM)
– January• RAL (Availability:70% in Nagios, 90% in SAM)
• Suggested next step: increase timeout & re-evaluate in March
7
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Status: ATLAS on Nagios
• Followed up by Alessandro di Girolamo• December 2010
– Very similar results for Nagios and SAM
• January 2011– BNL (43% in Nagios, 86% in SAM)
• Problem understood by ATLAS and fixed by Site• Nagios uses new DN and CRL was not sufficiently
recent
• Nagios based availabilities have been implemented in ATLAS Dashboard– Data stored in legacy SAM DB– http://tinyurl.com/dashb-sam-nagios-48h
6
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Status: CMS on Nagios
• Followed up by Andrea Sciaba• ‘org.cms.SRM‐VOGet‘ fails randomly
– December• RAL-LCG2 (81% in Nagios, 94% in SAM)
– January• Taiwan-LCG2 (94% in Nagios, 100% in SAM)
– Problem understood by CMS• Probe, Space Token and site config. related issue
• Next steps (February):– Modification of CMS Nagios probe– Calculate and compare Dashboard availabilities– Run test ‘org.cms.WN-mc’ with production role
9
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT Status: LHCb on Nagios
• Followed up by Roberto Santinelli • Random failures during job submission
– Most likely due to 5h30 timeout in Nagios– December
• PIC (84% in Nagios, 99% in SAM)• INFN-CNAF (80% in Nagios, 99% in SAM)• RAL (79% in Nagios, 94% in SAM)
– January 2011• RAL (88% in Nagios, 97% in SAM)
• Suggested next step: Increase timeout & re-evaluate in March
10
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT HEP VOs – next steps
• Validate dashboard applications with Nagios tests (IT/ES)– Still requires legacy SAM DB and portal
• The portal provides a programmatic interface (PI)• New interface by myEGI is available since mid January
– Still pre-prod service but can be used for migration
• GT will stop old SAM system as soon as we get green light from the experiments– June/July 2011 last security patches for SLC4 – service can’t be migrated to SL5
• Cannot afford running 2 services in parallel
11
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT ACE Schedule
• December 2010– Validated the standard availability for OPS √
• January – Computation of standard availabilities for LHC
experiments (one profile per VO) √
• February– Multiple availabilities (different profiles, same
algorithm) per VO √
• March– Multiple availabilities (different profiles and
algorithms: CREAM CE use case) per VO
12
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT ACE – next steps
• March: validate ACE reports with GridView – OPS & LHC VOs.
• March: Generate two ACE reports for OPS– CREAM & LCG-CE and compare results
• April: ACE validation– Production readiness
• May: ACE in production mode– Given that no major issues are found
13
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
GT CMS blacklisting
• In cooperation with Andrea Sciaba • CREAM & LCG-CE status computed based on
Nagios results• Generic Programmatic Interface for data export
available in JSON/XML• Ongoing work on a solution
– Test BDII with blacklisting by end of this week– Move to prod after CMS green light
• Would like CMS to consider the use of the generic PI– Increased flexibility– More uniform approach, no extra service
14
Grid Technology Issues
• Manpower went down by 60% (4 FTEs remain)• Team is in a Catch22 situation
– All resources are absorbed by operations and support– Decommissioning of legacy services would free resources
• But requires effort that is not available
• We need to stop/move away services or development will freeze until more resources arrive
• Risk that use of Nagios data via old SAM-DB continues for too long move to new PI
15
Grid Technology
Questions?
16