atlas central services and computing security flavia donno cern/it-es-vos 9/23/20151 flavia donno,...

26
ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 06/20/22 1 Flavia Donno, Tier1 Service Coordination

Upload: brandon-vernon-johns

Post on 29-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

ATLAS Central Services and

Computing Security

Flavia DonnoCERN/IT-ES-VOS

04/19/23 1Flavia Donno, Tier1 Service Coordination

Page 2: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Outline

04/19/23 2Flavia Donno, Tier1 Service Coordination

• Security in ATLAS: can this model be exported to other LHC experiments?– Why we do it– How we do it: policies and plan– Handling a security incident in ATLAS

• The ATLAS Central Services Operations Team. Can this evolve into a general services for LHC experiments?– Who we are and what we do– Some details: service inventory and the web redirector– Better interface and documentation to CERN/IT available

tools.

Page 3: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

ATLAS Security: The Goal• Minimize ATLAS central services unavailability during

data taking, mitigating vulnerability and improving operation and management of ATLAS services.

Preserve ATLAS reputation worldwide.– The focus is on security– Security cannot be enforced without good service

management practices Policies– Strategy: increase robustness and availability of critical

services while containing vulnerabilities of less critical ones Plan

• Other sites supporting ATLAS can follow the same policies where applicable

04/19/23 3Flavia Donno, Tier1 Service Coordination

Page 4: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

ATLAS Security: references and documentation

• Various CERN IT and wLCG/EGEE documents define policies and good practices:– VOBox Service Level Agreement (CERN-IT/FIO-FS)

• https://twiki.cern.ch/twiki/pub/FIOgroup/FsSLA/sla-v1.2.1.pdf– VOBox Security Recommendations and Questionnaire (wLCG Joint

Security Policy Group)• https://edms.cern.ch/file/639856/0.6/VO-Box-security-policy-0-6.pdf

– GRID system administrators best practice and guidelines (EGEE security group)

• http://rss-grid-security.cern.ch/glite.php?display=1

04/19/23 4Flavia Donno, Tier1 Service Coordination

SECURITY POLICIES AND PLAN FOR CERN ATLAS CENTRAL SERVICES

https://twiki.cern.ch/twiki/pub/Atlas/ATLASInternalSecurityPlan/ATLAS_Security_PlanV20.pdf

–Special thanks to Sebastian Lopienski, Romain Wartel and Stefan Lueders

Page 5: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Key figures and roles

04/19/23 5Flavia Donno, Tier1 Service Coordination

Page 6: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The policies

04/19/23 6Flavia Donno, Tier1 Service Coordination

P1 AM and SM Requests for new hardware must be done through the ATLAS VOC (VO Contact) : [email protected] or https://savannah.cern.ch/support/?group=atlascsops Requests must be justified and details about the services the machine will run must be provided.

P2 AM and SM AM and SM are responsible for requesting new hardware to replace the machines nearing warranty expiration. They are also responsible to request hardware in order to guarantee high availability of their service if needed (DNS load balancing, hot spares, etc.).

Sensors are provided to warn SM of warranty expirations.Changes cannot be made to existing hardware configurations.

Page 7: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The policies

04/19/23 7Flavia Donno, Tier1 Service Coordination

P3 AM and SM

They must respond timely (within 1 working day) to inquiries coming from the VOC. AM and SM are also responsible for security issues with their service. For security inquiries, the response must come within 1 hour, in case of severe incidents, within a few hours otherwise. The subject of the e-mail is of the form “[Important|Severe]: SI# ” where # is an internal sequence number. If unresponsive the call will be escalated to ADC coordinator and ATLAS computing coordinator.

P4 VOC The VOC must keep an updated list of all ATLAS SM and AM contacts.

P5 AM The AM must provide the VOC with installation and configuration details about the services, the need for data backup, connectivity details, procedures for draining a service etc. Details must be given according to the ATLAS Service Documentation Template (*).

Page 8: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The Service Documentation Card

04/19/23 8Flavia Donno, Tier1 Service Coordination

If you edit your serviceTwiki page, you are presentedWith the edit window.At the bottom you findThe button “Add Form”

(*) https://twiki.cern.ch/twiki/bin/view/Atlas/ATLASServiceDocumentationTemplateNo need to copy and past. Just follow the instructions to add a form in your twiki.

Page 9: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The policies

04/19/23 10Flavia Donno, Tier1 Service Coordination

P6 VOC and AM

All ATLAS central services at CERN must be “quattorized” (**)

P7 SM and AM

SM and AM should provide lemon sensors and procedures that allow for monitoring, raising alarms and directing the operators for their services. (***)

(**) https://twiki.cern.ch/twiki/bin/view/Atlas/CentralServicesManagementPoliciesAndProceduresProcedures and good practices are listed in the twiki. It is ongoing work.

• Distribution through rpms (ATLAS software repository)• SLC5 migration• ATLAS secure external configuration repository (based on SINDES)• Protecting DB passwords• Good practices (aliases, http vs https and CERN SSO, svn checkout, …)

(***) The VOBox service is provided as a business hours service on CERN working days. For 24/7 coverage procedures can be provided for the operators (OP)

–The VOC or AM must react to monitoring alarms escalated to them.–The VOC or AM should be careful not to take actions that could raise alarms

Page 10: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The policies

04/19/23 11Flavia Donno, Tier1 Service Coordination

P8 VOC , SM, AM

They must attend security courses and get informed and updated about security threats for the services they provide and manage. They are responsible for ensuring that their software does not pose security threats, that access to databases is secure and is sufficiently monitored, that stored data are compliant with legal requirements, and that VO services, including pilot job frameworks, are operated according to the applicable policy documents. It is the responsibility of the ATLAS VOC to publish available security courses within the ATLAS collaboration. Operators/shifters must be trained to detect security anomalies.

System level security compliance is monitored by the CERN IT security team

SM/AM must proactive check for security related updatesHousekeeping of the machines is the responsibility of the VOC and AM (logs rotation, tmp space management, etc.)

Page 11: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The policies

04/19/23 12Flavia Donno, Tier1 Service Coordination

P9 ATLAS, AM, SM

The management agrees that people working on computing will spend 5-10% of their time on security (1-2 days per month).

P10 All Security incidents must be reported to the ATLAS VOC and to CS: [email protected] or phone 70500 accordantly with [a] and [b]. Depending on the severity of the incident, services can be taken out either in agreement with AM/SM or, in case of unresponsiveness, independently. Services can be restored on hot spares if available.

[a] Report a Computer Security Incidenthttp://lcg.web.cern.ch/LCG/incident.htm [b] LCG Agreement on Incident Responsehttps://edms.cern.ch/document/428035

Page 12: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

More on Security in ATLAS– The AM must make available to the VOC and VO operators the list of possible

threats/risks and actions taken to mitigate them (software updates and security patches, etc.). In particular services should depend as little as possible on specific package versions, allowing for automatic updates.

– The AM is responsible to train ATLAS operators to identify anomalies with the services.

– The SM is responsible for monitoring the service for the correct behavior (load, performance, available free space, etc. – it might reveal an intrusion)

– The VOC will regularly review machine configurations to identify possible security threats (limited interactive or root access, interactive access [or sudo privileges!!!] to generic accounts, open doors, access control lists, un-needed services, world-writable files/directories [used in init.d], etc.) - we work with FIO to make this process semi-automatic.

– The VOC might run security challenges vs. specific services if needed.– VOC, AM and SM are responsible to follow recommended procedures:

• http://rss-grid-security.cern.ch/glite.php

04/19/23 13Flavia Donno, Tier1 Service Coordination

Page 13: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The plan

04/19/23 14Flavia Donno, Tier1 Service Coordination

Page 14: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The plan

04/19/23 15Flavia Donno, Tier1 Service Coordination

…such tasks will be repeated every three months.

Page 15: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Handling a security incident in ATLAS• Report all suspicious security incidents to the ATLAS VOC: atlas-adc-central-

[email protected] and to [email protected] or phone 70500. Please use caution and discretion.

• The ATLAS VOC will coordinate the over incident response process in Atlas: in particular, the VOC will contact the AM/SM for the specific service. The AM/SM will be contacted both via phone (whenever provided) and e-mail. The subject of the e-mail is of the form “[Important|Severe]: SI#” where # is an internal sequence number. In case of a severe incident we expect AM/SM to respond within 1 hour.

• The service will be removed from the public network (restricting to local access if possible) weighting this with the impact to the experiment and the organization. Depending on the severity of the incident, services can be shut down either in agreement with AM/SM or, in case of unresponsiveness, independently (contacting ATLAS management and informing user communities).

04/19/23 16Flavia Donno, Tier1 Service Coordination

Page 16: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Handling a security incident in ATLAS• As in case of normal maintenance interventions, if specific procedures are

provided by the AM/SM concerning the unavailability of the service, the ATLAS VOC will follow them (i.e. informing user communities, advertising downtime, redirecting dependent services, operation procedures, etc.)

• In agreement with the AM/SM and if the nature of the incident will allow it, the service could be partially restored on hot spares, if previously arranged.

• The original machines where the service was running and intruded will be made available to the CERN Security Team for analysis. However, it is the responsibility of the AM and SM to investigate the incident with the advice and guidance of the CERN Security Team.

04/19/23 17Flavia Donno, Tier1 Service Coordination

Page 17: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Handling a security incident in ATLAS• Once the service vulnerability is mitigated by the FM/AM/SM, the ATLAS VOC

will coordinate with the CS for a service examination/scan to make sure the mitigation is effective.

• Once the vulnerability problem is solved the service is restored.

04/19/23 18Flavia Donno, Tier1 Service Coordination

Page 18: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

• Started with B. Koblitz who left in July 2009

• The team:Flavia Donno (CERN-IT/GS) – coordinationSerguei Baranov (JINR - Russia)Sergey Makarychev (ITEP - Russia) - here till end of December 2009Alexey Buzykaev (BINP - Russia) - left on November 18th, 2009

04/19/23 19Flavia Donno, Tier1 Service Coordination

Page 19: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The ATLAS Central Services Team• The mandate:

– ATLAS contact for CERN/IT (Alarm Tickets, security, etc.)– Management of ATLAS Central Services according to the SLA between CERN-IT and

ATLAS (hardware requests, quattor, best practices, etc.)– Providing assistance with machine, software and service management to ATLAS

service developers and providers (distribution practices, software versions, rebooting, etc.)

– Provision of general frameworks, tools and practices to guarantee availability and reliability of services (squid/frontier distribution, web redirector, hot spares, sensors, better documentation, etc.)

– Collection of information and operational statistics (service inventory, machine usage, etc.)

– Enforcing the application of security policies– Training of newcomers in the team– Spreading knowledge among service developers and providers about the tools

available within CERN/IT (Shibboleth2, SLS, etc.)– Participating in the work of the ADC Operations Team– And more …

04/19/23 20Flavia Donno, Tier1 Service Coordination

Page 20: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The ATLAS Central Services Team• Communication channels:

– Support requests are submitted here:https://savannah.cern.ch/support/?group=atlascsops

– You can contact us via e-mail:[email protected]

• Documentation:– Useful twiki pages:

https://twiki.cern.ch/twiki/bin/view/Atlas/CentralServicesManagementPoliciesAndProcedureshttps://twiki.cern.ch/twiki/bin/view/Atlas/ADCMachinesHowto

– Public Machine and Service Inventory:https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedComputingMachines

– Security Document:https://twiki.cern.ch/twiki/pub/Atlas/ATLASInternalSecurityPlan/ATLAS_Security_Planv20.pdf

04/19/23 21Flavia Donno, Tier1 Service Coordination

Page 21: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Service Inventory

04/19/23 22Flavia Donno, Tier1 Service Coordination

https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedComputingMachines

Page 22: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Service Inventory

04/19/23 24Flavia Donno, Tier1 Service Coordination

• We manage a total of 184 machines [virtual/development/production]

• Production• 13 machines for Run Time Tester Service• 64 machine for the ATLAS Build services• 53 machines used for production services (DQM, Site Services, Central Catalogues, T0, etc.)

• Development

• 23 Development Machines• 20 Virtual Machines (on 3 physical ones)

• Spares • 5 spares• 6 unknown

Page 23: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Availability in ATLAS– Normal operations are complemented by the following actions

to ensure availability of services:

– Every ~6 months the ATLAS VOC will exercise rebooting and moving of some critical, load-balanced services on hot spares in order to be prepared for emergencies.

– In coordination with the Security Team the ATLAS VOC performs regular checks on the existing services for signaled vulnerabilities.

– The ATLAS VOC makes available web framework (such as the ATLAS web redirector) services to AM/SM. Such frameworks ensure a secure and CERN supported environment where to run web services (Shibboleth, CERN approved software packages, safe services etc.). In case the web redirector cannot be used, recommendations will be given to avoid common problems.

– The ATLAS VOC advices and provides support on specific software packages and their versions to be used on specific platforms (Python 2.6, emacs, mod_python, mod_wsgi, Django, etc.).

– The ATLAS VOC provides sensors for most common alarm needs.

04/19/23 25Flavia Donno, Tier1 Service Coordination

Page 24: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

The web redirector

04/19/23 26Flavia Donno, Tier1 Service Coordination

atmbadm

attrcadm

atrqadm

https://atlas-minimumbias.cern.ch

https://atlas-trigconf.cern.ch

https://atlas-runquery.cern.ch

atmbsrv

attrcsrv

atrqsrv

Administrationaccounts

Serveraccounts

Page 25: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

Conclusions

04/19/23 27Flavia Donno, Tier1 Service Coordination

• Security is important both for obtaining what we are working for (physics results) and for the good reputation of ATLAS and all LHC experiments– Please, help us enforcing it!– The ATLAS security model is general and can be applied to other LHC

experiments

• Central Services Operations Teams have an important role for the smooth running of experiment services– The goal is to improve service availability, reliability and management– We depend on CERN/IT, but a lot to be done– Can the ATLAS model be expanded in order to offer a common services to

all LHC experiments?

Page 26: ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS 9/23/20151 Flavia Donno, Tier1 Service Coordination

04/19/23 28Flavia Donno, Tier1 Service Coordination

… comments/suggestions/corrections ?

Please, send e-mail to [email protected]