Transcript
Page 1: Update on  SAM monitoring

Grid Technology

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

DBCFCFGT

Update on SAM monitoring

Wojciech Lapka, David Collados

Page 2: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Outlook

• Current Situation• Short overview of the migration of the LHC

VOs from SAM to Nagios – as understood by GT with some input from CMS

and ATLAS– detailed update can be given at the next MB– not enough time to work on a common report

• January availability report only arrived during first week of February ( delay due to manual data quality assurance)

• Status of ACE computation• Replacement of FCR for CMS blacklisting• Issues 2

Page 3: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Current Situation

• We still have to operate two monitoring infrastructures in parallel – SAM legacy services

• Original end of life was autumn 2010 (EGEE-III)• Old SAM Portal, DBs, FCR (operated by CERN-IT) • SAM-DPM machines (operated by CERN-IT) • SAM-BDII (run by CERN-IT)

– NAGIOS based system• CERN-ROC Nagios Instance (CERN-IT)• Asia-Pacific Nagios Instance (CERN-IT)

– Last ROC Nagios not run by an NGI– Planned: ALL ROC-Nagios move by 10.2010 !!!

• Experiment specific Nagios Prod and PreProd Services

– 8 instances, fully Quattorized, ready to move 3

Page 4: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Current Situation

• New visualization and front-ends – MyEGI running on central Nagios DBs

• Still needs work (features, bug fixing)

– Team lost main developer delay

• GridView (service run by GT)– Development by BARC collaboration– Service will be integrated into MyEGI

• 2nd Level support for SAM Nagios (GT)– Planned: Move to EGI autumn 2010

• Ops Nagios probes maintenance still with GT– Agreed to be moved to EMI Product Teams

• Many services and tasks still with the team– +reduced manpower (went from 7 4)

4

Page 5: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Experiments moving to Nagios

• Probes and debugging by IT-ES and experiments• Services and support by IT-GT

– Follow-up on failures with experiment contact– New production and pre-prod setup from scratch

• Validation of Nagios monitoring– Dec/Jan availability reports for SAM/Nagios

• To compare the results • We expect Nagios and SAM availability figures to be within 5%

– Nagios should be a bit higher due to re-tries

– Equivalent metrics for CE/SRM at T0/T1s • Standard GridView algorithm was used • This allows direct comparison

5

Page 6: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Experiments moving to Nagios

• GridView reports for LHC VOs:– Official (SAM based):

http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012/wlcg/

http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101/wlcg/

– New (Nagios based):http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/2010/201012-nagios/

http://gvdev.cern.ch/GRIDVIEW/downloads/Reports/201101-nagios/

6

Page 7: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Status: ALICE on Nagios

• Followed up by Maria Dolores Saiz & Maarten Litmaath

• Random failures during job submission– Likely reason: 5h30 timeout in Nagios

» SAM used 12h– December

• RAL (Availability: 61% in Nagios, 93% in SAM)

– January• RAL (Availability:70% in Nagios, 90% in SAM)

• Suggested next step: increase timeout & re-evaluate in March

7

Page 8: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Status: ATLAS on Nagios

• Followed up by Alessandro di Girolamo• December 2010

– Very similar results for Nagios and SAM

• January 2011– BNL (43% in Nagios, 86% in SAM)

• Problem understood by ATLAS and fixed by Site• Nagios uses new DN and CRL was not sufficiently

recent

• Nagios based availabilities have been implemented in ATLAS Dashboard– Data stored in legacy SAM DB– http://tinyurl.com/dashb-sam-nagios-48h

6

Page 9: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Status: CMS on Nagios

• Followed up by Andrea Sciaba• ‘org.cms.SRM‐VOGet‘ fails randomly

– December• RAL-LCG2 (81% in Nagios, 94% in SAM)

– January• Taiwan-LCG2 (94% in Nagios, 100% in SAM)

– Problem understood by CMS• Probe, Space Token and site config. related issue

• Next steps (February):– Modification of CMS Nagios probe– Calculate and compare Dashboard availabilities– Run test ‘org.cms.WN-mc’ with production role

9

Page 10: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT Status: LHCb on Nagios

• Followed up by Roberto Santinelli • Random failures during job submission

– Most likely due to 5h30 timeout in Nagios– December

• PIC (84% in Nagios, 99% in SAM)• INFN-CNAF (80% in Nagios, 99% in SAM)• RAL (79% in Nagios, 94% in SAM)

– January 2011• RAL (88% in Nagios, 97% in SAM)

• Suggested next step: Increase timeout & re-evaluate in March

10

Page 11: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT HEP VOs – next steps

• Validate dashboard applications with Nagios tests (IT/ES)– Still requires legacy SAM DB and portal

• The portal provides a programmatic interface (PI)• New interface by myEGI is available since mid January

– Still pre-prod service but can be used for migration

• GT will stop old SAM system as soon as we get green light from the experiments– June/July 2011 last security patches for SLC4 – service can’t be migrated to SL5

• Cannot afford running 2 services in parallel

11

Page 12: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT ACE Schedule

• December 2010– Validated the standard availability for OPS √

• January – Computation of standard availabilities for LHC

experiments (one profile per VO) √

• February– Multiple availabilities (different profiles, same

algorithm) per VO √

• March– Multiple availabilities (different profiles and

algorithms: CREAM CE use case) per VO

12

Page 13: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT ACE – next steps

• March: validate ACE reports with GridView – OPS & LHC VOs.

• March: Generate two ACE reports for OPS– CREAM & LCG-CE and compare results

• April: ACE validation– Production readiness

• May: ACE in production mode– Given that no major issues are found

13

Page 14: Update on  SAM monitoring

CERN IT Department

CH-1211 Geneva 23

Switzerlandwww.cern.ch/

it

GT CMS blacklisting

• In cooperation with Andrea Sciaba • CREAM & LCG-CE status computed based on

Nagios results• Generic Programmatic Interface for data export

available in JSON/XML• Ongoing work on a solution

– Test BDII with blacklisting by end of this week– Move to prod after CMS green light

• Would like CMS to consider the use of the generic PI– Increased flexibility– More uniform approach, no extra service

14

Page 15: Update on  SAM monitoring

Grid Technology Issues

• Manpower went down by 60% (4 FTEs remain)• Team is in a Catch22 situation

– All resources are absorbed by operations and support– Decommissioning of legacy services would free resources

• But requires effort that is not available

• We need to stop/move away services or development will freeze until more resources arrive

• Risk that use of Nagios data via old SAM-DB continues for too long move to new PI

15

Page 16: Update on  SAM monitoring

Grid Technology

Questions?

16


Top Related