nuclear physics data management needs bruce g. gibbard
DESCRIPTION
Nuclear Physics Data Management Needs Bruce G. Gibbard. SLAC DMW2004 Workshop 16-18 March 2004. Overview. Addressing a class of Nuclear Physics (NP) experiments utilizing large particle detector systems to study accelerator produced reactions Examples at: BNL (RHIC), JLab, CERN (LHC) - PowerPoint PPT PresentationTRANSCRIPT
Nuclear Physics Data Management Needs
Bruce G. Gibbard
SLAC DMW2004 Workshop
16-18 March 2004
17 March 2004 B. Gibbard
2
Overview Addressing a class of Nuclear Physics (NP) experiments
utilizing large particle detector systems to study accelerator produced reactions
o Examples at: BNL (RHIC), JLab, CERN (LHC) Technologies & data management needs of this branch of
NP are quite similar to HEP Integrating across its four experiments, the Relativistic
Heavy Ion Collider (RHIC) at BNL is currently the most prolific producer of data
o Study of very high energy collisions of heavy ions (up to Au on Au)o High nucleon count, high energy => high multiplicityo High multiplicity, high luminosity and fine detector granularity =>
very high data rateso Raw data recording at up to ~250 MBytes/sec
17 March 2004 B. Gibbard
3
Digitized Event In STAR at RHIC
17 March 2004 B. Gibbard
4
Support the basic computing infrastructure for experimental collaboration
o Typically large, 100’s of physicist, and internationally distributedo Manage & distribute code, design, cost, & schedule databaseso Facilitate communication, documentation and decision making
Store, process, support analysis of, and serve datao Online recording of Raw datao Generation and recording of Simulated datao Construction of Summary data from Raw and Simulated datao Iterative generation of Distilled Data Subsets from Summary datao Serve Distilled Data Subsets and analysis capability to widely
distributed individual physicists
IT Activities of Such NP Experiments
Data Intensive Activities
17 March 2004 B. Gibbard
5
Raw Data
Reconstruction
Summary Data
Data Mining
Skimmed Streamed Distilled
Data
Individual Analysis
Derived Physics
DataDisplay
Calibrations & Conditions
DB
Bookkeeping DB
Physics TagDB
Final Results
Detector
System
Generic Computing
Model
Meta-DataProvenance
Physics Based IndicesData HandlingLimited
17 March 2004 B. Gibbard
6
Data Volumes in Current RHIC Run
Raw Data (PHENIX)o Peak rates to 120 MBytes/seco First 2 months of ’04, Jan & Feb
• 109 Events• 160 TBytes
o Project ~ 225 TBytes of Raw data for Current Run Derived Data (PHENIX)
o Construction of Summary Data from Raw Data then production of distilled subsets from that Summary Data
o Project ~270 TBytes of Derived data Total (all of RHIC) = 1.2 PBytes for Current Run
o STAR = PHENIXo BRAHMS + PHOBOS = ~ 40% of PHENIX
17 March 2004 B. Gibbard
7
RHIC Raw Data Recording Rate
STAR
PHENIX
120MBytes/sec
120MBytes/sec
17 March 2004 B. Gibbard
8Current RHIC TechnologyTertiary Storage
o StorageTek / HPSSo 4 Silos – 4.5 PBytes (1.5 PBytes currently filled)o 1000 MB/sec theoretical native I/O bandwidth
Online Storageo Central NFS served disk
• ~170 TBytes of FibreChannel Connected RAID 5• ~1200 MBytes/sec served by 32 SUN SMP’s
o Distributed disk• ~300 TBytes of SCSI/IDE• Locally mounted on Intel/Linux farm nodes
Computeo ~1300 Dual Processor Red Hat Linux / Intel Nodeso ~2600 CPU’s => ~1,400 kSPECint2K (3-4 TFLOPS)
17 March 2004 B. Gibbard
9
Projected Growth in Capacity Scale Moore’s Law effect of component replacement in experiment
DAQ’s & in computing facilities => ~X6 increase in 5 years
Not yet fully specified requirements of RHIC II and eRHIC upgrades are likely to accelerate growth
0
500
1000
1500
2000
2500
3000
2001 2002 2003 2004 2005 2006 2007 2008
Year
Dis
k (
TB
yte
s)
Disk Volume at RHIC
17 March 2004 B. Gibbard
10NP Analysis Limitations (1) Underlying the Data Management issue
o Events (interactions) of interest are rare relative to minimum bias events
• Threshold / phase space effect for each new energy domaino Combinatorics of large multiplicity events of all kinds
confound selection of interesting eventso Combinatorics also create backgrounds to signals of interest
Two analysis approacheso Topological: typically with
• Many qualitative &/or quantitative constraints on data sample• Relatively low background to signal• Modest number of events in final analysis data sample
o Statistical: frequently with• More poorly constrained sample• Large background (signal is small difference between large numbers)• Large number of events in final analysis data sample
17 March 2004 B. Gibbard
11
NP Analysis Limitations (2) It seems that it is less frequently possible to do
Topological Analyses in NP than in HEP so Statistical Analyses are more often required
o Evidence for this is rather anecdotal – not all would agreeo To the extent that it is true, final analysis data sets tend to
be largeo These are the data sets accessed very frequently by large
numbers of users … thus exacerbating the data management problem
In any case the extraction and the delivery of distilled data subsets to physicists for analysis currently most limits NP analyses
17 March 2004 B. Gibbard
12
Grid / Data Management Issues
Major RHIC experiments are moving (have moved) complete copies of Summary Date to regional analysis centers
o STAR: to LBNL via Grid Toolso PHENIX: to Riken via Tape/Airfreighto Evolution toward more sites and full dependence on Grid
RHIC, JLab, and NP at the LHC are all very interested and active in Grid development
o Including high performance reliable Wide Area data movement / replication / access services
17 March 2004 B. Gibbard
13
Conclusions NP and HEP accelerator/detector experiments have
very similar Data Management requirements NP analyses of this type currently tend to be more
Data than CPU limited “Mining” of Summary Data and affording end users
adequate access (both Local and Wide Area) to the resulting distillate currently most limits NP analysis
It is expected that this will remain the case for the next 4-6 years through
o Upgrades of RHIC and Jlabo Start-up of LHC
with Wide Area access growing in importance relative to Local access