fermilab site report mark o. kaletka head, core support services department computing division
TRANSCRIPT
Fermilab Site Report
Mark O. KaletkaHead, Core Support Services Department
Computing Division
CD mission statement
• The Computing Division’s mission is to play a full part in the mission of the laboratory and in particular:
• To proudly develop, innovate, and support excellent and forefront computing solutions and services, recognizing the essential role of cooperation and respect in all interactions between ourselves and with the people and organizations that we work with and serve.
How we are organized
We participate in all areas
Accelerator Tev BPM project and many projects to help Run II luminosity goals, Accelerator Simulation
High Energy Physics experiments Older Fixed Target analysis, CDF, DZero, MiniBoone, MINOS, MIPP, Testbeam, CMS, BTeV and future proposals such as Minerva and Nova and future Kaon expts
Astrophysics Experiments Pierre Auger, CDMS, SDSS and future proposals such as SDSS extension, Joint Dark Energy Mission (SNAP), and Dark Energy Survey
Theory Lattice QCD facility
Production system capacities
Growth in farms usage
Growth in farms density
Projected growth of computers
4178
3589
2089
1500
430
365
215
150
0
500
1000
1500
2000
2500
3000
3500
4000
4500
FY04 FY05 FY06 FY07 FY08
No
de
s
Computing Nodes Sever Nodes
CD Computer Power Growth
0
500
1000
1500
2000
2500
19
95
19
96
19
97
19
98
19
99
20
00
20
01
20
02
20
03
20
04
20
05
20
06
20
07
20
08
20
09
KV
A
Projected KVA Actual KVA
FCC Max 750
Projected power growth
Computer rooms
• Provide space, power & cooling for central computers• Problem: increasing luminosity
– ~ 2600 computers in FCC– Expect to add ~1,000 systems/year– FCC has run out of power & cooling, cannot add utility
capacity• New Muon Lab
– 256 systems for Lattice Gauge theory– CDF early buys of 160 systems + 160 CDF existing systems
from FCC– Developing plan for another room
• Wide Band– Long term phased plan FY04 – 08– FY04/05 build: 2,880 computers (~$3M)– Tape robot room in FY05– FY06/07: ~3,000 computers
Computer rooms
Computer rooms
Storage and data movement
• 1.72 PB of data in ATL– Ingest of ~100 TB/mo
• Many 10’s of TB fed to analysis programs each day
• Recent work:– Parameterizing storage
systems for SRM
• Apply to SAM
• Apply more generally
– VO notions in storage systems
FNAL Starlight dark fiber project
• FNAL dark fiber to Starlight– Completion: Mid-June,
2004
– Initial DWDM configuration:
• One 10 Gb/s (LAN_PHY) channel
• Two 1 Gb/s (OC48) channels
• Intended uses of link– WAN network R&D projects
– Overflow for production traffic:
• ESnet link to remain production network link
– Redundant offsite path
ESNET
CERN
STARLIGHT
ESnetChicago
PoP
generalinternet
UKLightFormerMREN sites
I-Wire
622Mb/s
CAnet
ResearchNetwork B
ResearchNetwork A
BorderRouter
Fermilab
Abilene(Internet 2)
1 G
b/s
{10
Gb
/s s
oo
n}
Key:Production Traffic:R & D Traffic:
General network improvements
• Core network upgrades– Switch/router (Catalyst 6500s) supervisors
upgraded:• 720 Gb/s switching fabric (Sup720s); provides
40Gb/s per slot– Initial deployment of 10 Gb/s backbone links
• 1000B-T support expanded– Ubiquitous on computer room floors:
• New farms acquisitions supported on gigabit ethernet ports
– Initial deployment in a few office areas
Network security improvements
• Mandatory node registration for network access– “Hotel-like” temporary registration utility for visitors– System vulnerability scan is part of the process
• Automated network scan blocker deployed– Based on quasi-real time network flow data
analysis– Blocks outbound & inbound scans
• VPN service deployed
Central services
• Email– Spam tagging in place
• X-Spam-Flag: YES– Capacity upgrades for
gateways, imapservers, virus scanning
– Redundant load sharing• AFS
– Completely on OpenAFS– SAN for backend storage– TiBS Backup system– DOE-funded SBIR for
performance investigations• Windows
– Two-tier patching system for Windows
• 1st tier under control of OU (patchlink)
• 2nd tier domain-wide (SUS)• 0 Sasser infections post-
implementation
Central services -- backups
• Site-wide backup plan is moving forward– SpectraLogic T950-5– 8 SAIT-1 drives– Initial 450 tape capacity for 7TB pilot project
• Plan for modular expansion to over 200 TB
Computer security
• Missed by Linux rootkit epidemic– but no theoretical reason for immunity
• Experimenting w/ AFS cross-cell authentication– w/ Kerberos 5 authentication– subtle ramifications
• DHCP registration process– includes security scan, does not (yet) deny access– a few VIP’s have been tapped during meetings
• Vigorous self-scanning program– based on nessus– maintain database of results– look especially for “critical vulnerabilities” (& deny access)
Run II – D0
• D0 reprocessed 600M events in fall 2003– using grid style tools, 100M of those event processed offsite at 5 other
facilities– Farm production capacity is roughly 25M events per week– MC production capacity is 1 M events per week– about 1B events/week on the analysis systems.
• Linux SAM station on a 2 TB fileserver to serve the new analysis nodes– next step in the plan to reduce D0min– station has been extremely performant, expanding the Linux SAM cache– station typically delivers about 15 TB of data and 550M events per week.
• Rolled out a MC production system that has grid-style job submission– JIM component of SAM-Grid
• Torque (sPBS) is in use on the most recent analysis nodes– has been much more robust than PBS.
• Linux fileservers are being used as "project" space– physics group managed storage with high access patterns– good results.
MINOS & BTeV status
• MINOS– data taking in early 2005– using “standard” tools
• Fermi Linux• General-purpose farms• AFS• Oracle• enstore & dcache• ROOT
• BTeV– preparations for CD-1 review by DOE
• included review of online (but not offline) computing• novel feature is that much of the Level2/3 trigger
software will be part of the offline reconstruction software
US-CMS computing
• DC04 Data Challenge and the preparation for the computing TDR– preparation for the Physics TDR (P-TDR)– roll out of the LCG Grid service and federating it with the
U.S. facilities
• Develop the required Grid and Facilities infrastructure– increase the facility capacity through equipment upgrades– commission Grid capabilities through Grid2003 and LCG-1
efforts– develop and integrate required functionalities and services
• Increase the capability of User Analysis Facility– improve how a physicists would use facilities and software– facilities and environment improvements– software releases, documentation, web presence etc
US-CMS computing – Tier 1
• 136 Worker Nodes (Dual 1 U Xeon Servers and Dual 1U Athlon)– 240 CPUs for Production (174 kSI2000)– 32 CPUs for Analysis (26 kSI2000)
• All systems purchased in 2003 are connected over gigabit
• 37 TB of Disk Storage– 24TB in Production for Mass Storage Disk Cache
• In 2003 we switched to SATA disks in external enclosures connected over fiber channel
• Only marginally more expensive than 3ware based systems, and much easier to administrate.
– 5TB of User Analysis Space• Highly available, high performance, backed-up space
– 8TB Production Space
• 70TB of Mass Storage Space– Limited by tape purchases and not silo space
US-CMS computing
US-CMS computing – DC03 & GRID 2003
Over 72K CPU-hours used in a week100 TB of data transferred across Grid3 sitesPeak numbers of jobs approaching 900Average numbers during the daytime over 500
US-CMS computing – DC04
0
5000
10000
15000
20000
1-M
ar-2
004
8-M
ar-2
004
15-M
ar-2
004
22-M
ar-2
004
29-M
ar-2
004
5-A
pr-2
004
12-A
pr-2
004
19-A
pr-2
004
26-A
pr-2
004
Nu
mb
er
of
Tra
ns
ferr
ed
file
s
1st LHC magnet leaving FNAL for CERN
And our science has shown up in some unusual journals…
“Her sneakers squeaked as she walked down the halls where Lederman had walked. The 7th floor of the high-rise was where she did her work, and she found her way to the small, functional desk in the back of the pen.”