a distributed tier-1 an example based on the nordic scientific computing infrastructure gdb meeting...

24
A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen – Niels Bohr Institute With contributions from Oxana Smirnova, Peter Villemoes and Brian Vinter

Upload: sharon-chandler

Post on 04-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

A Distributed Tier-1

An example based on the

Nordic Scientific Computing Infrastructure

GDB meeting – NIKHEF/SARA 13th October 2004John Renner Hansen – Niels Bohr Institute

With contributions from Oxana Smirnova, Peter Villemoes and Brian Vinter

Page 2: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Basis for a distributed Tier-1 structure

• External connectivity

• Internal connectivity

• Computer and Storage capacity

• Maintenance and operation

• Long term stability

Page 3: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

NORDUnet network in 2003

155MRUNNet

622M

NASK12M

GÉANT10G

(Oct ’03)

NETNOD3.5G

GeneralInternet

5G

GeneralInternet

2.5G

Page 4: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

NorthernLight

HelsinkiOslo Stockholm

Copenhagen

NetherLightAmsterdam

2.5G links connected to “ONS boxes” giving 2 GE channels between endpoints

Dec 2003

Aug 2003

Page 5: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

NORDUNet was represented at the

NREN-TIER1 MeetingParis, Roissy Hilton, 12:00-17:00, 22 July 2004

by Peter Villemoes

Page 6: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Denmark / Forskningsnet • Upgrading from 622Mbit/s to a 2.5 Gbit/s ring

structure finished:– Copenhagen-Odense-Århus-Aalborg-

and back via Göteborg to Copenhagen

• Network research– setting up a Danish national IPv6 activity– dark fibre through the country

• experimental equipment for 10GE channels

Page 7: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Finland / FUNET

• Upgraded to 2.5G already in 2002

• Upgrading backbone routers to 10G capability

Page 8: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Norway / UNINETT

• Network upgraded to 2.5G between major universities,

• UNINETT is expanding to more services and organisations:

Page 9: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Sweden / SUNET

• 10G resilient nationwide network since Nov 2002

–all 32 universities have 2.5G access

• Active participation in SweGrid, Swedish Grid Initiative

Page 10: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Nordic Tier1 2004 2005 2006 2007 2008 2009 2010Split 2008 ALICE

ATLAS CMS LHCb SUM 2008

CPU (kSI2K)   300 300 700 1400    Offered   1400     1400

% of Total   8%     8%

Disk (Tbytes)   60 70 200 700    Offered   700     700

% of Total   8%     8%

Tape (Pbytes)   0.06 0.07 0.2 0.7    Offered   0.7     0.7

% of Total   12%     12%

Tape (Mbytes/sec)

 

           

Offered          

Required         580

Balance          

WAN (Mbits/sec)   1000 2000 5000 10000                

Computer and Storage capacity at a Nordic Tier-1

Page 11: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Who

• Denmark

Danish Center for Grid Computing - DCGC

Danish Center for Scientific Computing -DCSC

Page 12: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Who

• Denmark

• Finland - CSC

Page 13: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Who

• Denmark

• Finland

• Norway - NorGrid

Page 14: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Who

• Denmark

• Finland

• Norway

• Sweden - SweGrid

Page 15: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Denmark

• Two collaborating Grid projects• Danish Centre for Scientific Computing

Grid– DCSC-Grid spans the four DCSC sites and

thus unify the resources, PC-clusters, IBM-Regatta, SGI-Enterprise, … within DCSC

• Danish Centre for Grid Computing– Is the the national Grid project– DCSC Grid is a partner in DCGC

Page 16: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Finland

• Remains centred about CSC– CSC participates in NDGF and NGC– A Finnish Grid will probably be created– This Grid will focus more on accessing CSC

resources with local machines

Page 17: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Norway

• NOTUR Emerging Technologies on Grid Computing is the main mover– Oslo-Bergen “mini-Grid” in place– Trondheim and Tromsø should be joining

Curently under reorganization

Page 18: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Sweden

• SweGrid is a very ambitious project– 6 clusters have been created for the purpose

of SweGrid each equipped with a 100 PCs and a large disk system

– Large support and education organization is integrated in the plans

Page 19: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Research CouncilsDK SF S NNOS-N

Nordic Data Grid Facility

Nordic Data Grid Facility Nordic Project

1. Create the basis for a common Nordic Data Grid Facility

2. Coordinate Nordic Grid

Activities

Core Group

Project Director

4 Post Doc.s

Steering Group 3 members per country

1 R.C. Civil Servant 2 Scientists

Page 20: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Services provided by the Tier-1 Regional Centres

• acceptance of raw and processed data from the Tier-0 centre, keeping up with data acquisition;

• recording and maintenance of raw and processed data on permanent mass storage;

• provision of managed disk storage providing permanent and temporary data storage for files and databases;

• operation of a data-intensive analysis facility;

• provision of other services according to agreed experiment requirements

• provision of high capacity network services for data exchange with the Tier-0 centre, as part of an overall plan agreed between the experiments, Tier-1 and Tier-0 centres;

• provision of network services for data exchange with Tier-1 and selected Tier-2 centres, as part of an overall plan agreed between the experiments, Tier-1 and Tier-2 centres;

• administration of databases required by experiments at Tier-1 centres;

Page 21: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen
Page 22: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

SiteCountr

y

~ # CPUs

~ % Dedicated

1 atlas.hpc.unimelb.edu.au 28 30%

2genghis.hpc.unimelb.edu.

au90 20%

3charm.hpc.unimelb.edu.a

u20 100%

4 lheppc10.unibe.ch 12 100%

5 lxsrv9.lrz-muenchen.de 234 5%

6 atlas.fzk.de 884 5%

7 morpheus.dcgc.dk 18 100%

8 lscf.nbi.dk 32 50%

9 benedict.aau.dk 46 90%

10 fe10.dcsc.sdu.dk 644 1%

11 grid.uio.no 40 100%

12 fire.ii.uib.no 58 50%

13 grid.fi.uib.no 4 100%

14 hypatia.uio.no 100 60%

15 sigrid.lunarc.lu.se 100 30%

16 sg-access.pdc.kth.se 100 30%

17 hagrid.it.uu.se 100 30%

18 bluesmoke.nsc.liu.se 100 30%

19 ingrid.hpc2n.umu.se 100 30%

20 farm.hep.lu.se 60 60%

21 hive.unicc.chalmers.se 100 30%

22 brenta.ijs.si 50 100%

Totals at peak:• 7 countries• 22 sites• ~3000 CPUs

– dedicated ~700

• 7 Storage Services (in RLS)– few more storage

facilities– ~12TB

• ~1FTE (1-3 persons) in charge of production– At most 2 executor

instances simultaneously

ARC-connected resources for DC2

Page 23: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

0

1000

2000

3000

4000

5000

6000

blue

smok

e.ns

c.liu

.se

grid

.uio

.no

hypa

tia.u

io.n

o

atla

s.hp

c.un

imel

b.ed

u.au

sg-a

cces

s.pd

c.kt

h.se

bene

dict

.aau

.dk

lxsr

v9.lr

z-m

uenc

hen.

de

bren

ta.ij

s.si

farm

.hep

.lu.s

e

lhep

pc10

.uni

be.c

h

sigr

id.lu

narc

.lu.s

e

hagr

id.it

.uu.

se

fire.

ii.ui

b.no

fe10

.dcs

c.sd

u.dk

ingr

id.h

pc2n

.um

u.se

atla

s.fz

k.de

mor

pheu

s.dc

gc.d

k

geng

his.

hpc.

unim

elb.

edu.

au

char

m.h

pc.u

nim

elb.

edu.

au

hive

.uni

cc.c

halm

ers.

se

lscf

.nbi

.dk

grid

.fi.u

ib.n

o

Good jobs

Failed jobs

Total # of successful jobs: 42202 (as of September 25, 2004) Failure rate before ATLAS ProdSys manipulations: 20%

• ~1/3 of failed jobs did not waste resources

Failure rate after: 35% Possible reasons:

• Dulcinea failing to add DQ attributes in RLS• DQ renaming• Windmill re-submitting good jobs

ARC performance in ATLAS DC2

Page 24: A Distributed Tier-1 An example based on the Nordic Scientific Computing Infrastructure GDB meeting – NIKHEF/SARA 13th October 2004 John Renner Hansen

Failure analysis

• Dominant problem: hardware accidents