28 april 2003lee lueking, ppdg review1 babar and dØ experiment reports doe review of ppdg january...
TRANSCRIPT
28 April 2003 Lee Lueking, PPDG Review 1
BaBar and DØ Experiment Reports
DOE Review of PPDG
January 28-29, 2003
Lee Lueking
Fermilab Computing Division
D0 liaison to PPDG
28 April 2003 Lee Lueking, PPDG Review 2
BaBar Introduction DØ BaBar's PPDG effort concentrating on:
– Data Distribution on the Grid (SRB, Bdbserver++).
– Job submission on the Grid (EDG,LCG).
People involved:– Tim Adye (RAL)– Andy Hanushevsky (SLAC)– Adil Hasan (SLAC)– Wilko Kroeger (SLAC).
Interactions with other Grid efforts that are part of BaBar:
– GridPP (UK), EDG (Europe through Dominique Boutigny), GridKA, Italian Grid groups etc.
BaBar Grid applications are being designed to be data-format neutral
– BaBar's new computing model should have little impact on the apps.
DØ’s PPDG effort concentrating on:– Data Distribution on the Grid
(SAM).– Job submission on the Grid
(JIM w/Condor-G and Globus). People involved:
– Igor Terekhov (FNAL; JIM Team Lead)
– Gabriele Garzoglio (FNAL)
– Andrew Baranovski (FNAL)
– Parag Mhashilkar & Vijay Murthi (via Contr. w/ UTA CSE)
– Lee Lueking (FNAL; D0 Liaison to PPDG)
Interactions with other Grid efforts that are part of D0:
– GridPP (UK), GridKA (DE), NIKHEF (NL), CCIN2P3 (FR)
Very closely working with the Condor team to achieve
– Grid Job & Resource Matchmaking service
– Other robustness and usability features
28 April 2003 Lee Lueking, PPDG Review 3
Overview of BaBar and DØ Data Handling
Regional CenterAnalysis site
DØ Integrated Files ConsumedMar’02 to Mar‘03
DØ Integrated Data Consumed Mar’02 to Mar‘03
4.0 M Files
1.2 PB
Mar2002 Mar2003
Both experiments have extensive distributed computing and data handling systems Significant amounts of data are processed at remote sites in the US and Europe
Apr May J un J ul Aug Sep Oct Nov Dec J an Feb Mar0
20000
40000
60000
80000
100000
120000
140000
160000
J an Feb Mar Apr May J un J ul Aug Sep Oct Nov Dec0
100
200
300
400
500
600
700
800
BaBar Database Growth (TB) Jan'02 to Dec'02
BaBar Analysis Jobs (SLAC) Apr'02 to Mar'03
730 TB
140k Jobs
DØ SAM Deployment
BaBar Deployment
Tier A CentersMonte Carlo
28 April 2003 Lee Lueking, PPDG Review 4
BaBar Bulk Data Distribution – SRB
Storage Resource Broker (SRB) from SDSC being used to test out data distribution from Tier A to Tier A with view to production this summer.
So far have had 2 successful demos at Super Computing 2001 (SLAC->SLAC), 2002 (SLAC->ccin2p3).
Have been testing SRB V2 (released Feb 2003), new features Bulk registering in RDBMS, parallel stream file replication.
Busy incorperating newly designed BaBar metadata tables to SRB's RDBMS tables. Looking to improve file replication performance (playing with streams, etc).
28 April 2003 Lee Lueking, PPDG Review 5
BaBar User-driven data distribution: BdbServer++
Attempts to address use-case: user wants to copy a collection of sparse events with little space overhead (mainly Tier A to Tier C).
BdbServer++ essentially a set of scripts that:
– Submit a job to the Grid to make a deep-copy of the sparse collection (ie copy objects for events of interest only).
– Then copy the files back to user's institution through Grid (can use globus-url-copy).
– Poster at CHEP2003 Currently have tested Deep-copy through the grid using EDG and pure
Globus. Just completed test of extracting data using globus-url-copy (pure Globus request).
To do: incorperate with BaBar bookeeping. Robustness, reliability tests, production-level scripts for submission, copying.
28 April 2003 Lee Lueking, PPDG Review 6
BaBar Job Submission on the Grid Many production-like activities could take advantage of using compute resources at more
than one site.– Analysis Production: ccin2p3 (France), UK, SLAC – using EDG installations.– Simulation Production: Ferrara (Italy) Grid Group, Ohio – using EDG and VDT
installations.– Also very useful for data distribution (BdbServer++), ccin2p3 (France), SLAC.
Proposed BaBar Grid Architecture
28 April 2003 Lee Lueking, PPDG Review 7
BaBar Job Submission on the Grid There was a CHEP 2003 talk and Poster, a grid demo set up in UK
(run BaBar jobs on UK grid) and have managed to run Simulation Production and data distribution tests on Grid.
Plan: test new EDG2/LCG installations, increase users as releases stabilize.
BbgUtils.pl – perl script to allow easier client-side installation of Globus + CA's (currently works for Sun, Linux).
– Script copies all tar files and signing-policies etc necessary for client installation for that expt.
– Can be readily extended to include SRB client-side installation, EDG/LCG client side installation, etc.
28 April 2003 Lee Lueking, PPDG Review 8
DØ Objectives of SAMGrid
Bring standard grid technologies (including Globus and Condor) to the Run II experiments.
Enable globally distributed computing for DØ and CDF. JIM (Job and Information Management) complements SAM by adding job
management and monitoring to data handling. Together, JIM + SAM = SAMGrid
28 April 2003 Lee Lueking, PPDG Review 9JO
B
Computing Element
Submission Client
User Interface
QueuingSystem
JIM Job ManagementUser
Interface
User Interface
BrokerMatch
Making Service
Information Collector
Execution Site #1
Submission Client
Submission Client
Match Making Service
Match Making Service
Computing Element
Grid Sensors
Execution Site #n
Queuing System
Queuing System
Grid Sensors
Storage Element
Storage Element
Computing Element
Storage Element
Data Handling System
Data Handling System
Storage Element
Storage Element
Storage Element
Storage Element
Information Collector
Information Collector
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Computing Element
Computing Element
Data Handling System
Data Handling System
Data Handling System
Data Handling System
28 April 2003 Lee Lueking, PPDG Review 10
DesktopAnalysis Stations
InstitutionalAnalysis Centers
Normal InteractionCommunication PathOccasional Interaction Communication Path
RegionalAnalysis Centers
Central Analysis Center (CAC)
DAS DAS…. DAS DAS….
IAC...
IAC IAC…
IAC
RAC….
RAC
A site can join SAM-Grid with combos of services:– Monitoring, and/or– Execution, and/or– Submission
May 2003: Expect 5 initial execution sites for SAMGrid deployment, and 20 submission sites.
– GrkdKa (Karlsruhe) – Analysis site– Imperial College and Lancaster – MC sites– U. Michigan (NPACI) – Reconstruction center.– FNAL - CLueD0 as a submission site.
Summer 2003: Continue to add execution and submission sites. Second round of execution site deployments include Lyon (ccin2p3), Manchester, MSU, Princeton, UTA, FNAL – CAB system.
Hope to grow to dozens execution and hundreds of submission sites over next year(s).
Use grid middleware for job submission within a site too!– Administrators will have general ways of managing resources. – Users will use common tools for submitting and monitoring jobs
everywhere.
DØ JIM Deployment
28 April 2003 Lee Lueking, PPDG Review 11
What’s Next for SAMGrid?After JIM version 1
Improve scheduling jobs and decision making. Improved monitoring, more comprehensive, easier to navigate. Execution of structured jobs Simplifying packaging and deployment. Extend the configuration and
advertising features of the uniform framework built for JIM that employs XML.
CDF is adopting SAM and SAMGrid for their Data Handling and Job Submission. CDF also has asked to join PPDG.
Interoperability, interoperability, interoperability
– Working with EDG and LCG to move in common directions
– Moving to Web services, Globus V3, and all the good things OGSA will provide. In particular, interoperability by expressing SAM and JIM as a collection of services, and mixing and matching with other Grids
28 April 2003 Lee Lueking, PPDG Review 12
Challenges
Meeting the challenges of real data handling and job submission BaBar and DØ have confronted real-life issues, including…
Troubleshooting is an important and time consuming activity in distributed computing environments, and many tools are needed to do this effectively.
Operating these distributed systems on a 24/7 basis involves coordination, training, and worldwide effort.
Standard middleware is still hard to use, and requires significant integration, testing, and debugging.
– File replication integrity
– Preemptive distributed caching
– Private networks
– Routing data in a worldwide system.
– Reliable network file transfers, timeouts, and retries
– Simplifying complex installation procedures
– Username clashing issues, moving to GSI and Grid Certificates
– Interoperability with many MSS.
– Security issues, firewalls, site policies
– Robust job submission on the grid
28 April 2003 Lee Lueking, PPDG Review 14
PPDG Benefits to BaBar and DØ
PPDG has provided very useful collaboration with, and feedback to, other Grid and Computer Science Groups.
Development of tools and middleware that should be of general interest to the Grid community, e.g.
– BbgUtils.pl
– Condor-G enhancements Deploying and testing grid middleware under battlefield conditions of
operational experiments hardens the software and helps CS learn what is needed.
The CS groups enable the experiments to examine problems in new, innovative ways, and provide important new technologies for solving them.