best practices and use cases david bouvet, ioannis liabotis cod – 18, abingdon, 02/12/2008

9
EGEE-II INFSO-RI- 031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Upload: media

Post on 07-Jan-2016

19 views

Category:

Documents


1 download

DESCRIPTION

Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008. Best Practices – General COD activity. Follow up tickets assigned to developers Use the CIC mailing lists Report problems with tests in CIC mailing list so that other COD are aware of them - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

EGEE-II INFSO-RI-031688

Enabling Grids for E-sciencE

www.eu-egee.org

EGEE and gLite are registered trademarks

Best Practices and Use casesDavid Bouvet, Ioannis Liabotis

COD – 18, Abingdon, 02/12/2008

Page 2: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Best Practices – General COD activity

• Follow up tickets assigned to developers• Use the CIC mailing lists• Report problems with tests in CIC mailing list so that

other COD are aware of them• Read BROADCAST messages such as downtime

announcements for Core OPS tools.• Escalate properly the tickets• Minimize number of tickets per site• Use alarm masking• Answer comments of site administrators and try to

change the template escalation mails.• Do not leave overdue tickets or alarms open at the end

of the shift• Report inactivity

Page 3: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688

Best Practices – Hand Over Logs

• For OPS Meeting– List of Sites for escalation (ROC, Site, GGUS #, reason)– Operational tools problems– Issues with COD procedures and OPS manual– Problems in Grid Core services– Tickets that need attention

• For COD use– Identification of complex tickets – new use cases identification– Strange alarms – not easily transformed into tickets– Open ticket not related to alarms– New issues arising in the OPS meeting

• Lead team update the log after the weekly OPS meeting with all this info

Page 4: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 4

Best practises – Details (1)• Operational use cases

– If COD shifter detects an operational use case, it is recommended to create an entry for it in the tWiki use cases page: https://twiki.cern.ch/twiki/bin/view/EGEE/OperationalUseCasesAndStatus

– COD lead team is supposed to verify and update the tWiki page during its shift and to raise it at WLCG Ops meeting.

• Handover– For tickets in last escalation step, put in handover the following:

ROC Site name GGUS ticket number short summary of the reason why site is asked for suspension

– Need to follow up the tickets which are in last escalation step “Case transferred to political instances”

when the ROC said it has to discuss with its site• ROC should give an answer during the week following the WLCG Ops meeting

verify that site is really suspended by ROC after the WLCG Ops meeting if the decision is suspension.

– If still no action (no answer and/or no suspension), put again in next handover

Page 5: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 5

Point to discuss at COD meeting and to raise at ROC managers meeting

• Last escalation step/Site suspension follow-up (use case #9 on twiki page)– Context: Follow-up of last escalation step by OCC and ROC not

correctly done. When last step is reached, as stated in Operational Manual, ROC should normally discuss in private with its site, and then tell at next Weekly Operation meeting if the site should be suspend or not. Most of the time, at Weekly Operation meeting, ROC says that it has too discuss, and then no more news. The site stay in last escalation step during several weeks.

➔ COD work seems not to be considered. At WLCG Ops meeting on Nov. 17th, regarding suspension of IPTA-LCG2 site, ROC North said it has not seen CODs' mails nor CODs' tickets because it has too much mails... This is not acceptable!

Page 6: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 6

Point to discuss at COD meeting and to raise at ROC managers meeting

• Some example of "long" last step:– GGUS #40521: RU-Phys-SPbSU (1 month and a half)

• 25/09/2008: last escalation step• 06/10/2008: raised at WLCG Ops meeting• 06/11/2008: still in last step and not suspended• 06/11/2008: Cyril L'Orphelin (COD-FR) send mail to Maite, Steve and

Nick• 06/11/2008: Maite sent mail to Russian ROC• 06/11/2008: site suspended by Russian ROC

– GGUS #42015: ITPA-LCG2 (4 weeks)• 24/10/2008: last escalation step• 27/10/2008: raised at WLCG Ops meeting• 03/11/2008: raised again at WLCG Ops meeting• 07/11/2008: still in last step and not suspended• 10/11/2008: raised again at WLCG Ops meeting• 17/11/2008: still in last step and not suspended. ROC suspended. ROC

North is present at WLCG Ops meeting and will check with site.• 18/11/2008: finally fixed by site

Page 7: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 7

Point to discuss at COD meeting and to raise at ROC managers meeting

• In Operational Manual: "If no progress is made, COD make sure that OMC is informed of the situation, and the site status is set to “suspended” in GOCDB by COD unless OMC say differently."

• Proposed solution: As COD has rights to suspend a site– if ROC is not present at Weekly Operation meeting or has not

send a mail about that problem, COD suspends the site– if ROC is present and asks for discussion with its site, OCC

should put an action on ROC in the list of actions of the Weekly Operation meeting so it will be followed at next meeting. Answer or suspension by ROC should be done within the next 3 days: as acknowledgement, a mail should be sent to both OCC and COD mailing lists. In case not, the site is suspended by COD after these 3 days.

➔ we should agree on a solution to propose to ROC managers (the one above or another).

Then it should be discuss at ROC manager meeting here in Abingdon to be later integrated in the Ops manual.

Page 8: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 8

Best practices

• Alarm handling and masking– Shifter duty should not be a competition of who will process the

higher number of alarms Before assigning a ticket to an alarm, check if it is related to another

alarm with “Related alarms” table Mask alarm instead of creating 2 tickets for the same problem

2 alarms which can be masked by the current one

Page 9: Best Practices and Use cases David Bouvet, Ioannis Liabotis COD – 18, Abingdon, 02/12/2008

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 9

Point to discuss at COD meeting

• Training for new COD members– Every new COD member should be trained on the COD tasks by

his COD team or by another team

– We need to define how to do it. Probably on demand training Training materials available from the COD dashboard?