a distributed tier-1 an example based on the nordic scientific computing infrastructure gdb meeting...
TRANSCRIPT
A Distributed Tier-1
An example based on the
Nordic Scientific Computing Infrastructure
GDB meeting – NIKHEF/SARA 13th October 2004John Renner Hansen – Niels Bohr Institute
With contributions from Oxana Smirnova, Peter Villemoes and Brian Vinter
Basis for a distributed Tier-1 structure
• External connectivity
• Internal connectivity
• Computer and Storage capacity
• Maintenance and operation
• Long term stability
NORDUnet network in 2003
155MRUNNet
622M
NASK12M
GÉANT10G
(Oct ’03)
NETNOD3.5G
GeneralInternet
5G
GeneralInternet
2.5G
NorthernLight
HelsinkiOslo Stockholm
Copenhagen
NetherLightAmsterdam
2.5G links connected to “ONS boxes” giving 2 GE channels between endpoints
Dec 2003
Aug 2003
NORDUNet was represented at the
NREN-TIER1 MeetingParis, Roissy Hilton, 12:00-17:00, 22 July 2004
by Peter Villemoes
Denmark / Forskningsnet • Upgrading from 622Mbit/s to a 2.5 Gbit/s ring
structure finished:– Copenhagen-Odense-Århus-Aalborg-
and back via Göteborg to Copenhagen
• Network research– setting up a Danish national IPv6 activity– dark fibre through the country
• experimental equipment for 10GE channels
Finland / FUNET
• Upgraded to 2.5G already in 2002
• Upgrading backbone routers to 10G capability
Norway / UNINETT
• Network upgraded to 2.5G between major universities,
• UNINETT is expanding to more services and organisations:
Sweden / SUNET
• 10G resilient nationwide network since Nov 2002
–all 32 universities have 2.5G access
• Active participation in SweGrid, Swedish Grid Initiative
Nordic Tier1 2004 2005 2006 2007 2008 2009 2010Split 2008 ALICE
ATLAS CMS LHCb SUM 2008
CPU (kSI2K) 300 300 700 1400 Offered 1400 1400
% of Total 8% 8%
Disk (Tbytes) 60 70 200 700 Offered 700 700
% of Total 8% 8%
Tape (Pbytes) 0.06 0.07 0.2 0.7 Offered 0.7 0.7
% of Total 12% 12%
Tape (Mbytes/sec)
Offered
Required 580
Balance
WAN (Mbits/sec) 1000 2000 5000 10000
Computer and Storage capacity at a Nordic Tier-1
Who
• Denmark
Danish Center for Grid Computing - DCGC
Danish Center for Scientific Computing -DCSC
Who
• Denmark
• Finland - CSC
Who
• Denmark
• Finland
• Norway - NorGrid
Who
• Denmark
• Finland
• Norway
• Sweden - SweGrid
Denmark
• Two collaborating Grid projects• Danish Centre for Scientific Computing
Grid– DCSC-Grid spans the four DCSC sites and
thus unify the resources, PC-clusters, IBM-Regatta, SGI-Enterprise, … within DCSC
• Danish Centre for Grid Computing– Is the the national Grid project– DCSC Grid is a partner in DCGC
Finland
• Remains centred about CSC– CSC participates in NDGF and NGC– A Finnish Grid will probably be created– This Grid will focus more on accessing CSC
resources with local machines
Norway
• NOTUR Emerging Technologies on Grid Computing is the main mover– Oslo-Bergen “mini-Grid” in place– Trondheim and Tromsø should be joining
Curently under reorganization
Sweden
• SweGrid is a very ambitious project– 6 clusters have been created for the purpose
of SweGrid each equipped with a 100 PCs and a large disk system
– Large support and education organization is integrated in the plans
Research CouncilsDK SF S NNOS-N
Nordic Data Grid Facility
Nordic Data Grid Facility Nordic Project
1. Create the basis for a common Nordic Data Grid Facility
2. Coordinate Nordic Grid
Activities
Core Group
Project Director
4 Post Doc.s
Steering Group 3 members per country
1 R.C. Civil Servant 2 Scientists
Services provided by the Tier-1 Regional Centres
• acceptance of raw and processed data from the Tier-0 centre, keeping up with data acquisition;
• recording and maintenance of raw and processed data on permanent mass storage;
• provision of managed disk storage providing permanent and temporary data storage for files and databases;
• operation of a data-intensive analysis facility;
• provision of other services according to agreed experiment requirements
• provision of high capacity network services for data exchange with the Tier-0 centre, as part of an overall plan agreed between the experiments, Tier-1 and Tier-0 centres;
• provision of network services for data exchange with Tier-1 and selected Tier-2 centres, as part of an overall plan agreed between the experiments, Tier-1 and Tier-2 centres;
• administration of databases required by experiments at Tier-1 centres;
SiteCountr
y
~ # CPUs
~ % Dedicated
1 atlas.hpc.unimelb.edu.au 28 30%
2genghis.hpc.unimelb.edu.
au90 20%
3charm.hpc.unimelb.edu.a
u20 100%
4 lheppc10.unibe.ch 12 100%
5 lxsrv9.lrz-muenchen.de 234 5%
6 atlas.fzk.de 884 5%
7 morpheus.dcgc.dk 18 100%
8 lscf.nbi.dk 32 50%
9 benedict.aau.dk 46 90%
10 fe10.dcsc.sdu.dk 644 1%
11 grid.uio.no 40 100%
12 fire.ii.uib.no 58 50%
13 grid.fi.uib.no 4 100%
14 hypatia.uio.no 100 60%
15 sigrid.lunarc.lu.se 100 30%
16 sg-access.pdc.kth.se 100 30%
17 hagrid.it.uu.se 100 30%
18 bluesmoke.nsc.liu.se 100 30%
19 ingrid.hpc2n.umu.se 100 30%
20 farm.hep.lu.se 60 60%
21 hive.unicc.chalmers.se 100 30%
22 brenta.ijs.si 50 100%
Totals at peak:• 7 countries• 22 sites• ~3000 CPUs
– dedicated ~700
• 7 Storage Services (in RLS)– few more storage
facilities– ~12TB
• ~1FTE (1-3 persons) in charge of production– At most 2 executor
instances simultaneously
ARC-connected resources for DC2
0
1000
2000
3000
4000
5000
6000
blue
smok
e.ns
c.liu
.se
grid
.uio
.no
hypa
tia.u
io.n
o
atla
s.hp
c.un
imel
b.ed
u.au
sg-a
cces
s.pd
c.kt
h.se
bene
dict
.aau
.dk
lxsr
v9.lr
z-m
uenc
hen.
de
bren
ta.ij
s.si
farm
.hep
.lu.s
e
lhep
pc10
.uni
be.c
h
sigr
id.lu
narc
.lu.s
e
hagr
id.it
.uu.
se
fire.
ii.ui
b.no
fe10
.dcs
c.sd
u.dk
ingr
id.h
pc2n
.um
u.se
atla
s.fz
k.de
mor
pheu
s.dc
gc.d
k
geng
his.
hpc.
unim
elb.
edu.
au
char
m.h
pc.u
nim
elb.
edu.
au
hive
.uni
cc.c
halm
ers.
se
lscf
.nbi
.dk
grid
.fi.u
ib.n
o
Good jobs
Failed jobs
Total # of successful jobs: 42202 (as of September 25, 2004) Failure rate before ATLAS ProdSys manipulations: 20%
• ~1/3 of failed jobs did not waste resources
Failure rate after: 35% Possible reasons:
• Dulcinea failing to add DQ attributes in RLS• DQ renaming• Windmill re-submitting good jobs
ARC performance in ATLAS DC2
Failure analysis
• Dominant problem: hardware accidents