ssc2 and update on multi-user pilot jobs framework mingchao ma, stfc – ral hepsysman meeting...

SSC2 and Update on Multi-user Pilot Jobs Framework

Mingchao Ma, STFC – RALHEPSysMan Meeting

20/06/2008

Security Service Challenge

• What is it?• How does it work?• SSC 2 - UKI ROC experience

SSC - What is it?

“The goal of the LCG/EGEE Security Service Challenge, is to investigate whether sufficient information is available to be able conduct an audit trace as part of an incident response, and to ensure that appropriate

communications channels are available”Like a fire drill!

SSC – Why and How?

• To check if communication channel among involved parties (Sites, VOs and Security contacts etc) is functioning;

• Exercises for system admins to trace users’ activities and to know various logfiles;

• Not intrusive – only ‘legal’ operations;• No penetration and no execution of exploits;• Conduct and monitor by OSCT and ROC security

Officers;– CERN challenges ALL Tier1 sites;– ROC security officer challenges Tier2 sites within that

ROC

Security Service Challenge

• SSC 1: challenges the Workload Management System (WMS) on the Grid: Resource Broker (RB) and Compute Element (CE) (2005)

• SSC 2: challenges the Storage Elements on the Grid (2007/2008)

• SSC 3: challenges the Operational Diligence of the LCG/EGEE Grid Sites (ongoing)

https://twiki.cern.ch/twiki/bin/view/LCG/LCGSecurityChallenge

SSCs - UKI ROC

• Security Service Challenge 2– 22 Tier2 sites (SEs) UKI ROC were challenged by ROC

security officer

• Security Service Challenge 3– RAL Tier1 was challenged by CERN on 06 March 2008

http://www.gridpp.ac.uk/security/ssc/https://www.gridpp.ac.uk/security/ssc/ssc2/index.htmlhttp://grid-deployment.web.cern.ch/grid-deployment/ssc/SSC_2/

SSC_2_google.html

Security Service Challenge 2

• Timeline– From 21 January 2008 to 10 March 2008– In total 22 sites (SEs) challenged– Job submission: from 21 Jan. to 28 Jan– 4 weeks (Feb. 2008) cool down period– GGUS ticket opened: 03 March 2008– Challenge completed: 5pm 10 March 2008


• Basic Statistic– 22 SEs/Sites challenged, of which:

• One site failed to run challenge job;• One site is opt out of the challenge due to site re-

built;• One site is no longer part of EGEE Grid;• Initial response received from the 21 sites; • 18 sites acknowledged the initial alert ticket within

24 hours;• 2 site acknowledged ticket within 48 hours;• 1 site acknowledge ticket within 72 hours;

Security Service Challenge 2 - Result


• Preliminary Analysis– All responsed sites (18) found some traces of

the job activities and at least identified one SE operation

– Communication channel seems to work well;• Most sites acknowledged ticket within 24 hours• 1 sites was within 72 hours, where a new staff has

no support role in GGUS, therefore unable to answer the ticket


• Issues observed– None of 19 sites were able to identity the Lookup

operation– Some sites only provided RAW logs (though correct part

of log) information with little or no analysis– A few sites experienced log missing (accidentally deleted

log file due to mis-configuration; log retention is only a month, again due to mis-configuration or lost log files due to system-rebuilt etc.)

– SE’s logs (syntax and format) are still too complex; it seems that it is very difficult to fully rebuild some operations (site configuration? Or Insufficient log information?); Too many logfiles!

Multi-user Pilot Jobs Framework

What is multi-user pilot Job?

• A multi-user pilot job, hereafter referred to simply as a pilot job, is a Grid job for which the following holds*:– a Grid job is submitted with a set of credentials

belonging to either a member of the VO or to a service owned and operated by the VO

– when this Grid job begins to execute at a Site, it pulls down and executes workload, hereafter called a user job, owned and submitted by a different member of the VO or multiple user jobs owned and submitted by multiple different members of the VO

*Policy on Grid Multi-User Pilot Jobs https://edms.cern.ch/cedar/plsql/doc.info?cookie=7587020&document_id=855383&version=1

Pilot Jobs Framework

• A VO/Experiment-specific Workload Management System (WMS):– CMS glideinWMShttp://indico.cern.ch/materialDisplay.py?sessionId=4&materialId=0&confId=20230

– LHCb DIRAC WMShttp://indico.cern.ch/materialDisplay.py?sessionId=4&materialId=0&confId=20230

– ATLAS PanDAhttps://twiki.cern.ch/twiki/bin/view/Atlas/PanDA

– ALICE ???

A Simplified Diagram

End User

Central Job Repository/VO-Specific WMS

VOMS Server My Proxy Server Others

Worker Node(s)

Pilot Job

Glexec

User Job

Site 1

Jobs + Proxy

Submit Pilot Job + Pilot Proxy

Get User Jobs & User Proxy

Worker Node(s)

Pilot Job

Glexec

User Job

Site 2

Pilot Job Frameworks Review Workgroup

• GDB working group mandated by WLCG MB on Jan. 22, 2008 • Mission

– Review security issues in the pilot job framework of each experiment

• Pilot jobs are taken as multi-user in this context– Define a minimum set of security requirements– Advise on improvements

• Per framework or common to all– Report to GDB and MB

• Time frame is a few months• Members

– ALICE: Predrag Buncic– ATLAS: Torre Wenaus– CMS: Igor Sfiligoi– LHCb: Andrei Tsaregorodtsev– WLCG: Maarten Litmaath (chair)– EGEE: David Groep– FNAL: Eileen Berman– GridPP: Mingchao Ma– OSG: Mine Altunay

* Content from Maarten Litmaath, GDB, 2008/06/11

Questionnaire

• Describe in a schematic way all components of the system.– If a component needs to use IPC to talk to another component for any reason, describe what kind of

authentication, authorization, integrity and/or privacy mechanisms are in place. If configurable, specify the typical, minimum and maximum protection you can get.

• Describe how user proxies are handled from the moment a user submits a task to the central task queue to the moment that the user task runs on a WN, through any intermediate storage.

• What happens around the identity change on the WN, e.g. how is each task sandboxed and to what extent?

• How can running processes be accounted to the correct user?

• How is a task spawned on the WN and how is it destroyed?

• How can a site be blocked?

Questionnaire (cont.)

• What site security processes are applied to the machine(s) running the WMS? – Who is allowed access to the machine(s) on which the service(s) run, and how

do they obtain access?– How are authorized individuals authenticated on the machine(s)?– What is the process for keeping the service(s) and OS patched and up-to-date,

especially with respect to security patches?– Do you have an identified security contact?– Describe the incident response plan to deal with security incidents and reports

of unauthorized use?– What services (in general) run on the machine(s) that offer the WMS service?– What processes exist to maintain audit logs (e.g. for use during an incident)?– What monitoring exists on the machine(s) to aid detection of security incidents

or unauthorized use?

• Can you limit the users that can submit jobs to the VO WMS? How?

ssc2 and update on multi-user pilot jobs framework mingchao ma, stfc – ral hepsysman meeting...

Documents

roc slide

ticket slide

result slide

security service challenge

uki roc experience slide

challenge job

security contacts

responsed sites