planning for lcg emergencies hepix, fall 2005 slac, 13 october 2005 david kelsey cclrc/ral, uk

12
Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK [email protected]

Upload: cory-houston

Post on 19-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

13-Oct-05David Kelsey, LCG Emergencies3 Background Computing and Networking is essential –Tier 0 (CERN) and 12 Tier 1 critical for data taking 10 Gbps Optical Private link to each T1 –The T1’s collectively keep a second copy of the raw data –The T1’s play vital role in (re)processing and providing access to derived data –During data taking, can cope with Tier 0 - Tier 1 link down for 12 hours to < few days. All T1’s down – very bad! –LCG MoU requires avg T1 uptime during data taking: 99% LCG TDR says –“Special attention needs to be paid to the security aspects of the Tier-0, the Tier-1s and their network connections to maintain these essential services during or after an incident so as to reduce the effect on LHC data taking.” LCG also essential for analysis Need to keep the Grid running at all times –Therefore must deal quickly with incidents

TRANSCRIPT

Page 1: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

Planning for LCG EmergenciesHEPiX, Fall 2005

SLAC, 13 October 2005

David KelseyCCLRC/RAL, UK

[email protected]

Page 2: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

13-Oct-05 David Kelsey, LCG Emergencies 2

LHC Tier 0/1/2

T0

IN2P3

GridKaTRIUMF

ASCC

Fermilab

Brookhaven

Nordic

CNAF

SARA

PIC

RAL

T2

T2

T2

T2

T2 T2

T2

T2

T2

T2T2

T2T2

General Purpose IP ResearchNetworks:

NREN’s, GEANT2, LHCNet, EsnetAbilene, Dedicated Links …. Etc.

Special PurposeOptical Private Network:

GEANT2+NREN 10Gbit circuits andLHCNet Dedicated 10Gbit Links to US

CERN

CERNCERN

Network Architecture

Page 3: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

13-Oct-05 David Kelsey, LCG Emergencies 3

Background• Computing and Networking is essential

– Tier 0 (CERN) and 12 Tier 1 critical for data taking• 10 Gbps Optical Private link to each T1

– The T1’s collectively keep a second copy of the raw data– The T1’s play vital role in (re)processing and providing

access to derived data – During data taking, can cope with Tier 0 - Tier 1 link down

for 12 hours to < few days. All T1’s down – very bad!– LCG MoU requires avg T1 uptime during data taking: 99%

• LCG TDR says– “Special attention needs to be paid to the security aspects

of the Tier-0, the Tier-1s and their network connections to maintain these essential services during or after an incident so as to reduce the effect on LHC data taking.”

• LCG also essential for analysis• Need to keep the Grid running at all times

– Therefore must deal quickly with incidents

Page 4: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

13-Oct-05 David Kelsey, LCG Emergencies 4

Security Incident Response• Joint (LCG/EGEE) Security Policy Group & EGEE

Operational Security Coordination Team– Based Security Incident Response Policy and

procedures on work of Open Science Grid• Agreement on Incident Response

See https://edms.cern.ch/document/428035/• Sites must

– Take local action to prevent disruption– Report to local security officers– Report to others via Grid Incident Response

mail list• “Volunteer” incident response team created

when needed

Page 5: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

13-Oct-05 David Kelsey, LCG Emergencies 5

Incident classification• High: (team leader required)

– The incident could lead to exploitation of the trust fabric, i.e user and host identities, or the incident could lead to instability of the overall Grid, or a denial-of-service is in progress against all replicas of a given Grid service.

• Medium: (team leader required if widespread)– The incident affects an instance of a Grid service, but

Grid stability is not at risk, or a denial-of-service affects one replica of a given Grid service, or a local attack compromised a privileged user account.

• Low: (team leader probably not required)– A local attack comprised individual user, non-

privileged credentials, or a denial-of-service attack or compromise affects only local grid resources.

Page 6: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

13-Oct-05 David Kelsey, LCG Emergencies 6

Emergency procedures• JSPG discussed this at last meeting (Sep 2005)• Started from point of view of Security incidents

– But quickly realised that other disasters are also likely, so should deal with these too

• Very early overview of the issues at this point– Certainly no plan yet– Invite feedback from HEPiX

• There must be lots of site-based plans• JSPG will produce a draft emergency plan (and

address policy issues)– Grid Operations and OSCT will need to

define the details

Page 7: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

13-Oct-05 David Kelsey, LCG Emergencies 7

JSPG discussion topics• What is the scope?

– LCG vs EGEE?– Critical: Tier 0/1, data taking, data integrity

• Inter-site information flow– This is the critical point to be tackled– Users, Sys Admins and Managers

• External information– including interface(s) to the Press

• How do we keep the infrastructure operational?– Is this the aim?

• What do we take down?– And who decides?

• Can optical private networks remain up?– And are they sufficient for LCG data taking?

• How do we deal with Tier 2 problems?

Page 8: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

LCG/EGEE Emergency Procedures

Denise HeagertyCERN

Page 9: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

David Kelsey, LCG Emergencies 9

When are emergency procedures required?

Emergency procedures are required to cover the following cases:

Incident response plans cannot be followed: critical parts of the infrastructure are unavailable (e.g. mailing lists)

Incident response plans are inappropriate: E.g. need to rapidly inform large parts of the community beyond the security contacts or incident communication channels are compromised

Examples Major power cut at Site A lasted several days Cable cut network access to Site B Major worm disrupted network access at Site C Security incident blocks user access to accounts at Site D Wide area exploit of the (homogeneous) security fabric

Page 10: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

David Kelsey, LCG Emergencies 10

What is needed in an emergency? Out of band communication channels

Alternative service providers (Internet, telephony) Alternative contact details (e-mail, chat, …) Alternative technology

Clear decision-making roles There is no time for consensus during a crisis Usual decision making process needs to be bypassed

Clear information flow and roles For at least management, users, the press Reduce the risk of mis-communication

Disaster Recovery Plan Definition of critical infrastructure to kept running or repaired

quickly Dependencies and sequence must be clear for restoring services Mailing lists (at CERN) are key to restoring communication

Page 11: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

David Kelsey, LCG Emergencies 11

Some ideas to stimulate discussion

Define an emergency advisory committee? Members, mandate Goal is to ensure rapid and appropriate decisions

Assure information flow E.g. update DNS servers to point to temporary (web) servers Pre-record messages on telephone help services

Prepare alternative communication channels E.g. commercial conference call facilities Alternative Internet providers (e-mail addresses, chat, phone,…)

When/do we return to normal Incident Response?

Page 12: Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005 David Kelsey CCLRC/RAL, UK

13-Oct-05 David Kelsey, LCG Emergencies 12

Final words• LCG needs a written plan• Clear definition of roles• Operations staff need to know what to do

– Training• The sites need to agree to policy and

procedures– Recognise the powers of operations staff

• Sites already have their own internal plans– Now trying to extend to the Grid

• Feedback and advice is welcome!