fabric management for cern experiments past, present, and future

22
Fabric Management for CERN Experiments Past, Present, and Future Tim Smith CERN/IT

Upload: lilian

Post on 13-Jan-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Fabric Management for CERN Experiments Past, Present, and Future. Tim Smith CERN/IT. Contents. The Fabric of CERN today The new challenges of LHC computing What has this got to do with the GRID Fabric Management solutions of tomorrow? The DataGRID Project. Functionalities - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fabric Management for CERN Experiments Past, Present, and Future

Fabric Managementfor CERN Experiments

Past, Present, and Future

Tim Smith CERN/IT

Page 2: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 2

Contents

The Fabric of CERN today

The new challenges of LHC computing What has this got to do with the GRID

Fabric Management solutions of tomorrow? The DataGRID Project

Page 3: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 3

Fabric Elements

Functionalities Batch and Interactive Disk servers Tape Servers + devices Stage servers Home directory servers Application servers Backup service

Infrastructure Job Scheduler Authentication Authorisation Monitoring Alarms Console managers Networks

Page 4: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 4

Fabric Technology at CERN

89 90 91 92 93 94 95 96 97 98 99 00 01 0302 04 05

1

100

10

1000

10000

MainframesIBM Cray

RISC Workstations

Scalable SystemsSP2 CS2

RISC Workstations

PC Farms

PC Farms

Mu

ltip

licit

y S

cale

Year

SMPsSGI,DEC,HP,SUN

Page 5: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 5

Architecture Considerations

Physics applications have ideal data parallelism mass of independent problems

No message passing throughput rather than performance resilience rather than ultimate reliability

Can build hierarchies of mass market components

High Throughput Computing

Page 6: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 6

Component Architecture

100/1000baseT switch

CPU CPU CPU CPU CPU

High capacitybackboneswitch

1000baseT switch

Tape Server

Tape Server

Tape Server

Tape Server

Disk Server

Application Server

Page 7: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 7

Analysis Chain: Farms

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreconstruction

eventreconstruction

eventsimulation

eventsimulation

interactivephysicsanalysis

analysis objects(extracted by physics topic)

event filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

Page 8: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 8

Multiplication !

0

200

400

600

800

1000

1200

Jul-97 Jan-98 Jul-98 Jan-99 Jul-99 Jan-00

#CP

Us

tomog

tapes

pcsf

nomad

na49

na48

na45

mta

lxbatch

lxplus

lhcb

l3c

ion

eff

cms

ccf

atlas

alice

Page 9: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 9

PC Farms

Page 10: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 10

Shared FacilitiesEFF Scheduling 2000

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52

Week Number

Nu

mb

er o

f P

Cs

DELPHI

CMS

ALEPH

ATLAS

NA45

COMPASS

ALICE

Available

Page 11: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 11

LHC Computing Challenge

The scale will be different CPU 10k SI95 1M SI95 Disk 30TB 3PB Tape 600TB 9PB

The model will be different There are compelling reasons why some of the

farms and some of the capacity will not be located at CERN

Page 12: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 12

Estimated DISK Capacity ay CERN

0

200

400

600

800

1000

1200

1400

1600

1800

1998 1999 2000 2001 2002 2003 2004 2005 2006

year

Tera

Byt

es Non-LHC

Moore’s Law

LHC

Estimated disk storage capacity at CERN

Estimated CPU Capacity at CERN

0

500

1,000

1,500

2,000

2,500

1998 1999 2000 2001 2002 2003 2004 2005 2006

year

K S

I95

~10K SI951200 processors

Non-LHC

LHC

Estimated CPU capacity at CERNBad News: IO

1996: 4G @10MB/s1TB – 2500MB/s

2000: 50G @ 20 MB/s1TB – 400 MB/s

Bad News: Tapes< factor 2 reduction in 8 yearsSignificant fraction of cost

Page 13: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 13

Regional Centres:a Multi-Tier Model

Department

Desktop

CERN – Tier 0

MONARC http://cern.ch/MONARC

Tier 1 FNALRAL

IN2P3622 M

bps2.5 Gbps

622 M

bp

s

155

mbp

s

155 mbps

Tier2 Lab a

Uni b Lab c

Uni n

Page 14: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 14

More realistically:a Grid Topology

CERN – Tier 0

Tier 1 FNALRAL

IN2P3622 M

bps2.5 Gbps

622 M

bp

s

155

mbp

s 155 mbps

Tier2 Lab a

Uni b Lab c

Uni n

Department

Desktop DataGRID http://cern.ch/grid

Page 15: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 15

Can we build LHC farms?

Positive predictions CPU and disk price/performance trends suggest that the raw

processing and disk storage capacities will be affordable, and raw data rates and volumes look manageable

perhaps not today for ALICE

Space, power and cooling issues?

So probably yes… but can we manage them? Understand costs - 1 PC is cheap, but managing 10000 is not! Building and managing coherent systems from such large

numbers of boxes will be a challenge.

1999:

CDR @

45MB/s for

NA48!

2000:

CDR @

90MB/s for

Alice!

Page 16: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 16

Management Tasks I

Supporting adaptability Configuration Management

Machine / Service hierarchy Automated registration / insertion / removal Dynamic reassignment

Automatic Software Installation and Management (OS and applications) Version management Application dependencies Controlled (re)deployment

Page 17: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 17

Management Tasks II

Controlling Quality of Service System Monitoring

Orientation to the service NOT the machine Uniform access to diverse fabric elements Integrated with configuration (change) management

Problem Management Identification of root causes (faults + performance) Correlate network / system / application data Highly automated Adaptive - Integrated with configuration

management

Page 18: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 18

Relevance to the GRID ?

Scalable solutions needed in absence of GRID !

For the GRID to work it must be presented with information and opportunities Coordinated and efficiently run centres Presentable as a guaranteed quality resource

‘GRID’ification : the interfaces

Page 19: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 19

Mgmt Tasks: A GRID centre

GRID enable Support external requests: services

Publication Coordinated + ‘map’able

Security: Authentication / Authorisation Policies: Allocation / Priorities / Estimation / Cost

Scheduling Reservation Change Management

Guarantees Resource availability / QoS

Page 20: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 20

Existing Solutions ?

The world outside is moving fast !! Dissimilar problems

Virtual super computers (~200 nodes) MPI, latency, interconnect topology and bandwith Roadrunner, LosLobos, Cplant, Beowulf

Similar problems ISPs / ASPs (~200 nodes) Clustering: high availability / mission critical

The DataGRID : Fabric Management WP4

Page 21: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 21

WP4 Partners

CERN (CH) Tim Smith ZIB (D) Alexander Reinefeld KIP (D) Volker Lindenstruth NIKHEF (NL) Kors Bos INFN (I) Michele Michelotto RAL (UK) Andrew Sansum IN2P3 (Fr) Denis Linglin

Page 22: Fabric Management for CERN Experiments Past, Present, and Future

2000/11/03 Tim Smith: HEPiX @ JLab 22

Concluding Remarks

Years of experience in exploiting inexpensive mass market components

But we need to marry these with inexpensive highly scalable management tools

Build components back together as a resource for the GRID