cern user story

43
Towards An Agile Infrastructure at CERN Tim Bell [email protected] OpenStack Conference 6 th October 2011 1

Upload: tim-bell

Post on 12-May-2015

9.809 views

Category:

Technology


0 download

DESCRIPTION

CERN, the European Organization for Nuclear Research, is one of the world’s largest centres for scientific research. Its business is fundamental physics, finding out what the universe is made of and how it works. At CERN, accelerators such as the 27km Large Hadron Collider, are used to study the basic constituents of matter. This talk reviews the challenges to record and analyse the 25 Petabytes/year produced by the experiments and the investigations into how OpenStack could help to deliver a more agile computing infrastructure.

TRANSCRIPT

Page 1: CERN User Story

Towards An Agile Infrastructure at CERN

Tim [email protected]

OpenStack Conference6th October 2011

1

Page 2: CERN User Story

What is CERN ?

OpenStack Conference, Boston 2011 Tim Bell, CERN 2

• Conseil Européen pour la Recherche Nucléaire – aka European Laboratory for Particle Physics

• Between Geneva and the Jura mountains, straddling the Swiss-French border

• Founded in 1954 with an international treaty

• Our business is fundamental physics and how our universe works

Page 3: CERN User Story

OpenStack Conference, Boston 2011 Tim Bell, CERN 3

Answering fundamental questions…• How to explain particles have mass?

We have theories but need experimental evidence

• What is 96% of the universe made of ?We can only see 4% of its estimated mass!

• Why isn’t there anti-matterin the universe?

Nature should be symmetric…

• What was the state of matter justafter the « Big Bang » ?

Travelling back to the earliest instants ofthe universe would help…

Page 4: CERN User Story

Community collaboration on an international scale

Tim Bell, CERN 4OpenStack Conference, Boston 2011

Page 5: CERN User Story

Tim Bell, CERN 5

The Large Hadron Collider

OpenStack Conference, Boston 2011

Page 6: CERN User Story

OpenStack Conference, Boston 2011 Tim Bell, CERN 6

Page 7: CERN User Story

LHC construction

OpenStack Conference, Boston 2011 Tim Bell, CERN 7

Page 8: CERN User Story

8

The Large Hadron Collider (LHC) tunnel

OpenStack Conference, Boston 2011 Tim Bell, CERN

Page 9: CERN User Story

OpenStack Conference, Boston 2011 Tim Bell, CERN 9

Page 10: CERN User Story

Accumulating events in 2009-2011

OpenStack Conference, Boston 2011 Tim Bell, CERN 10

Page 11: CERN User Story

OpenStack Conference, Boston 2011 Tim Bell, CERN 11

Page 12: CERN User Story

Heavy Ion Collisions

OpenStack Conference, Boston 2011 Tim Bell, CERN 12

Page 13: CERN User Story

OpenStack Conference, Boston 2011 Tim Bell, CERN 13

Page 14: CERN User Story

OpenStack Conference, Boston 2011 Tim Bell, CERN 14

Tier-1 (11 centres):•Permanent storage•Re-processing•Analysis

Tier-0 (CERN):•Data recording•Initial data reconstruction•Data distribution

Tier-2 (~200 centres):• Simulation• End-user analysis

• Data is recorded at CERN and Tier-1s and analysed in the Worldwide LHC Computing Grid• In a normal day, the grid provides 100,000 CPU days executing 1 million jobs

Page 15: CERN User Story

OpenStack Conference, Boston 2011 Tim Bell, CERN 15

• Data Centre by Numbers– Hardware installation & retirement

• ~7,000 hardware movements/year; ~1,800 disk failures/year

Xeon 51502%

Xeon 516010%

Xeon E5335

7%Xeon

E534514%

Xeon E5405

6%

Xeon E541016%

Xeon L5420

8%

Xeon L552033%

Xeon 3GHz4%

Fujitsu3%

Hitachi23% HP

0% Maxtor

0% Seagate15%

Western Digital

59%

Other0%

High Speed Routers(640 Mbps → 2.4 Tbps) 24

Ethernet Switches 350

10 Gbps ports 2000

Switching Capacity 4.8 Tbps

1 Gbps ports 16,939

10 Gbps ports 558

Racks 828

Servers 11,728

Processors 15,694

Cores 64,238

HEPSpec06 482,507

Disks 64,109

Raw disk capacity (TiB) 63,289

Memory modules 56,014

Memory capacity (TiB) 158

RAID controllers 3,749

Tape Drives 160

Tape Cartridges 45000

Tape slots 56000

Tape Capacity (TiB) 34000

IT Power Consumption 2456 KW

Total Power Consumption 3890 KW

Page 16: CERN User Story

Our Environment

• Our users– Experiments build on top of our infrastructure and services

to deliver application frameworks for the 10,000 physicists

• Our custom user applications split into– Raw data processing from the accelerator and export to

the world wide LHC computing grid– Analysis of physics data– Simulation

• We also have standard large organisation applications– Payroll, Web, Mail, HR, …

OpenStack Conference, Boston 2011 Tim Bell, CERN 16

Page 17: CERN User Story

Our Infrastructure

• Hardware is generally based on commodity, white-box servers– Open tendering process based on SpecInt/CHF, CHF/Watt and GB/CHF– Compute nodes typically dual processor, 2GB per core– Bulk storage on 24x2TB disk storage-in-a-box with a RAID card

• Vast majority of servers run Scientific Linux, developed by Fermilab and CERN, based on Redhat Enterprise– Focus is on stability in view of the number of centres on the WLCG

OpenStack Conference, Boston 2011 Tim Bell, CERN 17

Page 18: CERN User Story

Our Challenges – Compute

• Optimise CPU resources– Maximise production lifetime of servers– Schedule interventions such as hardware repairs and OS patching– Match memory and core requirements per job– Reduce CPUs waiting idle for I/O

• Conflicting software requirements– Different experiments want different libraries– Maintenance of old programs needs old OSes

OpenStack Conference, Boston 2011 Tim Bell, CERN 18

Page 19: CERN User Story

Our Challenges – variable demand

OpenStack Conference, Boston 2011 Tim Bell, CERN 19

Page 20: CERN User Story

Our Challenges - Data storage

OpenStack Conference, Boston 2011 Tim Bell, CERN 20

• 25PB/year to record• >20 years retention• 6GB/s average• 25GB/s peaks

Page 21: CERN User Story

OpenStack Conference, Boston 2011 Tim Bell, CERN 21

Page 22: CERN User Story

Our Challenges – ‘minor’ other issues

• Power– Living within a fixed envelope of 2.9MW available for computer centre

• Cooling– Only 6kW/m2 without using water cooled racks (and no spare power)

• Space– New capacity replaces old servers in same racks (as density is low)

• Staff– CERN staff headcount is fixed

• Budget– CERN IT budget reflects member states contributions

OpenStack Conference, Boston 2011 Tim Bell, CERN 22

Page 23: CERN User Story

Server Consolidation

OpenStack Conference, Boston 2011 Tim Bell, CERN 23

4/1/2010 10/1/2010 4/1/20110

200

400

600

800

1000

1200

1400

1600

1800

WindowsOther LinuxScientific Linux

Num

ber o

f Virt

ual M

achi

nes

Page 24: CERN User Story

Batch Virtualisation

OpenStack Conference, Boston 2011 Tim Bell, CERN 24

Page 25: CERN User Story

Infrastructure as a Service Studies

• CERN has been using virtualisation on a small scale since 2007– Server Consolidation with Microsoft System Centre VM manager and Hyper-V– Virtual batch compute farm using OpenNebula and Platform ISF on KVM

• We are investigating moving to a cloud service provider model for infrastructure at CERN– Virtualisation consolidation across multiple sites– Bulk storage / Dropbox / …– Self-Service

• Aims– Improve efficiency– Reduce operations effort– Ease remote data centre support– Enable cloud APIs

OpenStack Conference, Boston 2011 Tim Bell, CERN 25

Page 26: CERN User Story

OpenStack Infrastructure as a Service Studies

• Current Focus– Converge the current virtualisation services into a single IaaS– Test Swift for bulk storage, compatibility with S3 tools and resilience

on commodity hardware– Integrate OpenStack with CERN’s infrastructure such as LDAP and

network databases

• Status– Swift testbed (480TB) is being migrated to Diablo and expanded to 1PB

with 10Ge networking– 48 Hypervisors running RHEL/KVM/Nova under test

OpenStack Conference, Boston 2011 Tim Bell, CERN 26

Page 27: CERN User Story

Areas where we struggled

• Networking configuration with Cactus– Trying out new Network-as-a-Service Quantum functions in Diablo

• Redhat distribution base– RPMs not yet in EPEL but Grid Dynamics RPMs helped– Puppet manifests needed adapting and multiple sources from

OpenStack and Puppetlabs

• Currently only testing with KVM– We’ll try Hyper-V once Diablo/Hyper-V support is fully in place

OpenStack Conference, Boston 2011 Tim Bell, CERN 27

Page 28: CERN User Story

OpenStack investigations : next steps

• Homogeneous servers for both storage and batch ?

OpenStack Conference, Boston 2011 Tim Bell, CERN 28

Other18%

Databases4%

VO Services5%

Mass Storage

25%

Batch40%

Grid Services

2%

WinServices6%

Page 29: CERN User Story

OpenStack investigations : next steps

• Scale testing with CERN’s toolchains to install and schedule 16,000 VMs

OpenStack Conference, Boston 2011 Tim Bell, CERN 29

Previous test results performed with OpenNebula

Page 30: CERN User Story

OpenStack investigations : next steps

• Investigate the commodity solutions for external volume storage– Ceph– Sheepdog– Gluster– ...

• Focus is on– Reducing performance impact of I/O with virtualisation– Enabling widespread use of live migration– Understanding the future storage classes and service definitions– Supporting remote data centre use cases

OpenStack Conference, Boston 2011 Tim Bell, CERN 30

Page 31: CERN User Story

Areas of interest looking forward

• Nova and Glance– Scheduling VMs near to the data they need– Managing the queue of requests when “no credit card” and no

resources– Orchestration of bare metal servers within OpenStack

• Swift– High performance transfers through the proxies without encryption– Long term archiving for low power disks or tape

• General– Filling in the missing functions such as billing, availability and

performance monitoring

OpenStack Conference, Boston 2011 Tim Bell, CERN 31

Page 32: CERN User Story

Final Thoughts

OpenStack Conference, Boston 2011 Tim Bell, CERN 32

• A small project to share documents at CERN in the ‘90s created the massive phenomenon that is today’s world wide web• Open Source• Transparent governance• Basis for innovation and competition• Standard APIs where consensus• Stable production ready solutions• Vibrant eco-system

• There is a strong need for a similar solution in the Infrastructure-as-a-Service space

• The next year is going to be exciting for OpenStack as the project matures and faces the challenges of production deployments

Page 33: CERN User Story

References

OpenStack Conference, Boston 2011 Tim Bell, CERN 33

CERN http://public.web.cern.ch/public/Scientific Linux http://www.scientificlinux.org/Silent data corruption study http://cern.ch/go/G7vLHEPiX Working Group on virtualization http://w3.hepix.org/virtualization/Worldwide LHC Computing Grid http://lcg.web.cern.ch/lcg/

http://rtm.hep.ph.ic.ac.uk/Jobs http://cern.ch/jobs

Page 34: CERN User Story

Backup Slides

OpenStack Conference, Boston 2011 Tim Bell, CERN 34

Page 35: CERN User Story

CERN’s tools

• The world’s most powerful accelerator: LHC– A 27 km long tunnel filled with high-tech instruments– Equipped with thousands of superconducting magnets– Accelerates particles to energies never before obtained– Produces particle collisions creating microscopic “big bangs”

• Very large sophisticated detectors– Four experiments each the size of a cathedral– Hundred million measurement channels each– Data acquisition systems treating Petabytes per second

• Top level computing to distribute and analyse the data– A Computing Grid linking ~200 computer centres around the globe– Sufficient computing power and storage to handle 25 Petabytes per

year, making them available to thousands of physicists for analysisOpenStack Conference, Boston 2011 Tim Bell, CERN 35

Page 36: CERN User Story

Other non-LHC experiments at CERN

OpenStack Conference, Boston 2011 Tim Bell, CERN 36

Page 37: CERN User Story

Superconducting magnets – October 2008

OpenStack Conference, Boston 2011 Tim Bell, CERN 37

A faulty connection between two superconducting magnets led to the release of a large amount of helium into the LHC tunnel and forced the machine to shut down for repairs

Page 38: CERN User Story

CERN Computer Centre

Tim Bell, CERN 38OpenStack Conference, Boston 2011

Page 39: CERN User Story

Our Challenges – keeping up to date

OpenStack Conference, Boston 2011 Tim Bell, CERN 39

Page 40: CERN User Story

CPU capacity at CERN during ‘80s and ‘90s

OpenStack Conference, Boston 2011 Tim Bell, CERN 40

198702

198718

198734

198750

198814

198830

198846

198910

198926

198942

199006

199022

199038

199102

199118

199134

199150

199214

199230

199246

199310

199326

199342

199407

199423

199439

199503

199519

199535

199551

199615

199631

199647

199712

199728

199744

199808

199824

199840

199903

199919

199935

199951

200017

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

Week - yyyyww

CP

U C

apac

ity

LEP Starts

CapacityUsed

Page 41: CERN User Story

Testbed Configuration for Nova / Swift

• 24 servers• Single server configuration for both compute and storage

• Supermicro based systems• Intel Xeon CPU L5520 @ 2.27GHz• 12GB memory• 10Ge connectivity• IPMI

OpenStack Conference, Boston 2011 Tim Bell, CERN 41

Page 42: CERN User Story

Data Rates at Tier-0

OpenStack Conference, Boston 2011 Tim Bell, CERN 42

Typical tier-0 bandwidthAverage in: 2 GB/s with peaks at 11.5 GB/sAverage out: 6 GB/s with peaks at 25 GB/s

Page 43: CERN User Story

Web Site Activity

OpenStack Conference, Boston 2011 Tim Bell, CERN 43

11/1

4/20

07

6/1/

2008

12/1

8/20

08

7/6/

2009

1/22

/201

0

8/10

/201

0

2/26

/201

1

0

500000000

1000000000

1500000000

2000000000

2500000000

3000000000

Num

ber o

f Hits

LHC first beam day:9. September 2008100 million hits to main CERN Websites300 million hits in total

LHC first collisions:25. March 201050 million hits to main CERN Websites

CERN websites access statistics