ssc2 and update on multi-user pilot jobs framework mingchao ma, stfc – ral hepsysman meeting...
TRANSCRIPT
SSC2 and Update on Multi-user Pilot Jobs Framework
Mingchao Ma, STFC – RALHEPSysMan Meeting
20/06/2008
Slide 3
SSC - What is it?
“The goal of the LCG/EGEE Security Service Challenge, is to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate
communications channels are available”Like a fire drill!
Slide 4
SSC – Why and How?
• To check if communication channel among involved parties (Sites, VOs and Security contacts etc) is functioning;
• Exercises for system admins to trace users’ activities and to know various logfiles;
• Not intrusive – only ‘legal’ operations;• No penetration and no execution of exploits;• Conduct and monitor by OSCT and ROC security
Officers;– CERN challenges ALL Tier1 sites;– ROC security officer challenges Tier2 sites within that
ROC
Slide 5
Security Service Challenge
• SSC 1: challenges the Workload Management System (WMS) on the Grid: Resource Broker (RB) and Compute Element (CE) (2005)
• SSC 2: challenges the Storage Elements on the Grid (2007/2008)
• SSC 3: challenges the Operational Diligence of the LCG/EGEE Grid Sites (ongoing)
https://twiki.cern.ch/twiki/bin/view/LCG/LCGSecurityChallenge
Slide 6
SSCs - UKI ROC
• Security Service Challenge 2– 22 Tier2 sites (SEs) UKI ROC were challenged by ROC
security officer
• Security Service Challenge 3– RAL Tier1 was challenged by CERN on 06 March 2008
http://www.gridpp.ac.uk/security/ssc/https://www.gridpp.ac.uk/security/ssc/ssc2/index.htmlhttp://grid-deployment.web.cern.ch/grid-deployment/ssc/SSC_2/
SSC_2_google.html
Slide 7
Security Service Challenge 2
• Timeline– From 21 January 2008 to 10 March 2008– In total 22 sites (SEs) challenged– Job submission: from 21 Jan. to 28 Jan– 4 weeks (Feb. 2008) cool down period– GGUS ticket opened: 03 March 2008– Challenge completed: 5pm 10 March 2008
Slide 8
Security Service Challenge 2
• Basic Statistic– 22 SEs/Sites challenged, of which:
• One site failed to run challenge job;• One site is opt out of the challenge due to site re-
built;• One site is no longer part of EGEE Grid;• Initial response received from the 21 sites; • 18 sites acknowledged the initial alert ticket within
24 hours;• 2 site acknowledged ticket within 48 hours;• 1 site acknowledge ticket within 72 hours;
Slide 10
Security Service Challenge 2
• Preliminary Analysis– All responsed sites (18) found some traces of
the job activities and at least identified one SE operation
– Communication channel seems to work well;• Most sites acknowledged ticket within 24 hours• 1 sites was within 72 hours, where a new staff has
no support role in GGUS, therefore unable to answer the ticket
Slide 11
Security Service Challenge 2
• Issues observed– None of 19 sites were able to identity the Lookup
operation– Some sites only provided RAW logs (though correct part
of log) information with little or no analysis– A few sites experienced log missing (accidentally deleted
log file due to mis-configuration; log retention is only a month, again due to mis-configuration or lost log files due to system-rebuilt etc.)
– SE’s logs (syntax and format) are still too complex; it seems that it is very difficult to fully rebuild some operations (site configuration? Or Insufficient log information?); Too many logfiles!
Slide 13
What is multi-user pilot Job?
• A multi-user pilot job, hereafter referred to simply as a pilot job, is a Grid job for which the following holds*:– a Grid job is submitted with a set of credentials
belonging to either a member of the VO or to a service owned and operated by the VO
– when this Grid job begins to execute at a Site, it pulls down and executes workload, hereafter called a user job, owned and submitted by a different member of the VO or multiple user jobs owned and submitted by multiple different members of the VO
*Policy on Grid Multi-User Pilot Jobs https://edms.cern.ch/cedar/plsql/doc.info?cookie=7587020&document_id=855383&version=1
Slide 14
Pilot Jobs Framework
• A VO/Experiment-specific Workload Management System (WMS):– CMS glideinWMShttp://indico.cern.ch/materialDisplay.py?sessionId=4&materialId=0&confId=20230
– LHCb DIRAC WMShttp://indico.cern.ch/materialDisplay.py?sessionId=4&materialId=0&confId=20230
– ATLAS PanDAhttps://twiki.cern.ch/twiki/bin/view/Atlas/PanDA
– ALICE ???
Slide 15
A Simplified Diagram
End User
Central Job Repository/VO-Specific WMS
VOMS Server My Proxy Server Others
Worker Node(s)
Pilot Job
Glexec
User Job
Site 1
Jobs + Proxy
Submit Pilot Job + Pilot Proxy
Get User Jobs & User Proxy
Worker Node(s)
Pilot Job
Glexec
User Job
Site 2
Slide 16
Pilot Job Frameworks Review Workgroup
• GDB working group mandated by WLCG MB on Jan. 22, 2008 • Mission
– Review security issues in the pilot job framework of each experiment
• Pilot jobs are taken as multi-user in this context– Define a minimum set of security requirements– Advise on improvements
• Per framework or common to all– Report to GDB and MB
• Time frame is a few months• Members
– ALICE: Predrag Buncic– ATLAS: Torre Wenaus– CMS: Igor Sfiligoi– LHCb: Andrei Tsaregorodtsev– WLCG: Maarten Litmaath (chair)– EGEE: David Groep– FNAL: Eileen Berman– GridPP: Mingchao Ma– OSG: Mine Altunay
* Content from Maarten Litmaath, GDB, 2008/06/11
Slide 17
Questionnaire
• Describe in a schematic way all components of the system.– If a component needs to use IPC to talk to another component for any reason, describe what kind of
authentication, authorization, integrity and/or privacy mechanisms are in place. If configurable, specify the typical, minimum and maximum protection you can get.
• Describe how user proxies are handled from the moment a user submits a task to the central task queue to the moment that the user task runs on a WN, through any intermediate storage.
• What happens around the identity change on the WN, e.g. how is each task sandboxed and to what extent?
• How can running processes be accounted to the correct user?
• How is a task spawned on the WN and how is it destroyed?
• How can a site be blocked?
Slide 18
Questionnaire (cont.)
• What site security processes are applied to the machine(s) running the WMS? – Who is allowed access to the machine(s) on which the service(s) run, and how
do they obtain access?– How are authorized individuals authenticated on the machine(s)?– What is the process for keeping the service(s) and OS patched and up-to-date,
especially with respect to security patches?– Do you have an identified security contact?– Describe the incident response plan to deal with security incidents and reports
of unauthorized use?– What services (in general) run on the machine(s) that offer the WMS service?– What processes exist to maintain audit logs (e.g. for use during an incident)?– What monitoring exists on the machine(s) to aid detection of security incidents
or unauthorized use?
• Can you limit the users that can submit jobs to the VO WMS? How?