the cms online cluster: setup, operation and maintenance of an evolving cluster

34
The CMS Online Cluster: Setup, Operation and Maintenance of an Evolving Cluster J.A. Coarasa CERN, Geneva, Switzerland for the CMS TriDAS group. ISGC 2012, 26 February - 2 March 2012, Academia Sinica, Taipei, Taiwan

Upload: jose-antonio-coarasa-perez

Post on 21-May-2015

189 views

Category:

Technology


1 download

DESCRIPTION

The CMS online cluster consists of more than 2700 computers, mostly running under Scientific Linux CERN. They run the 15000 application instances responsible for the data acquisition and experiment control in a private network. The high availability of the network and services and the independence from external networks allows their operation around the clock. After testing virtualization, it is being deployed to further enhance high availability while allowing even easier servicing. Due to the ever increasing luminosity provided to CMS by the LHC, the cluster size and software running in it has been evolving to meet the increased demand of performance. Only in the last year, the processing power of the High Level Trigger farm was increased by 50% without disruption to ongoing operations and it is foreseen to continue growing. At the same time, large updates of the running software happen once every two weeks with smaller updates occurring all the time due to the many developers of the different subsystems. The configuration management infrastructure based on quattor has been instrumented accordingly to be flexible and easy to use by the software librarians while still performant and robust. Big parts of the cluster can be reconfigured and failing computers reinstalled in only a few minutes. The monitoring infrastructure is being revamped to increase performance and allow a fine grained and user configurable notification that will allow the final experts to receive the notifications of the problems directly and on demand. Details will be given on the adopted solutions which include the following topics: implementation of the redundant and load balanced network and core IT services; deployment and configuration management infrastructure and its customization; the new monitoring infrastructure; virtualization techniques for redundant services… Special emphasis will be put on the scalable approach allowing to increase the size of the cluster with no administration overhead. Finally, the lessons learnt from the two years of running will be presented together with the prospects for the short and long term upgrades and the new technologies now in the pipeline.

TRANSCRIPT

Page 1: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

The CMS Online Cluster: Setup, Operation and Maintenance

of an Evolving Cluster"J.A. Coarasa "

CERN, Geneva, Switzerland"

for the CMS TriDAS group.""

ISGC 2012, 26 February - 2 March 2012, Academia Sinica, Taipei, Taiwan

Page 2: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Outline"•  Introduction"

•  On/Off-line Computing Model"•  The Compact Muon Solenoid (CMS) Data Acquisition

system (DAQ)"

•  The CMS Online Cluster"–  IT Infrastructure"

•  Computing"•  Networking"•  Services"

– Operation"

J.A. Coarasa 2

Page 3: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Introduction"– The CMS Data Acquisition system (DAQ)"

J.A. Coarasa 3

Page 4: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Rate on tape 100 Hz!

On/Off-line computing"

~3µs"

~ s "

Level-1. Massive parallel processing !Particle identification (High pT, e, µ, jets, missing ET)"Hardwired custom systems (ASIC, FPGA). Synchronous clock driven!!

High Level Triggers !Physics process identification !Clusters of PCs. Asynchronous event driven!

!

First Level 100 kHz!

Readout: Data to Surface and Event Builder!2 Tb/s optical data links and 2 Tb/s switch networks!!

Distributed computing GRID (Tiers 0-4)!Analysis, production and archive!

!

Collisions rate ~1 GHz!

High Level Triggers!

Page 5: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

The CMS Data Acquisition system (DAQ). The challenge"

Large Data Volumes (~100 Gbytes/s data flow, 20TB/day)!–  After Level 1 Trigger ~100 Gbytes/s (rate ~O(100) kHz) reach the

event building (2 stages, ~1600 computers).!–  HLT filter cluster select 1 out 1000. Max. rate to tape: ~O(100) Hz "

⇒  The storage manager (stores and forwards) can sustain a 2GB/s traffic.!

⇒  Up to ~300 Mbytes/s sustained forwarded to the CERN T0. (>20TB/day).!

Detector Front-end

Computing Services

Readout Systems

Builder and Filter Systems

Event Manager Builder Networks

Level 1 Trigger

Run Control

40 MHz

100 kHz

100 Hz

J.A. Coarasa 5

Page 6: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

The CMS Online Cluster"–  IT Infrastructure"

J.A. Coarasa 6

Page 7: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

IT Infrastructure. The purpose and qualities"

"The IT infrastructure (computing, networking and services) of the CMS Online Cluster takes care of the CMS data acquisition and experiment control. "

•  Autonomous (i.e. independent from all other networks) and provides uninterrupted operation 24/7 on two far apart (~200 m) physical locations, with two control rooms;"

•  Redundant services design!–  Losing 1 min of data is wasting accelerator time (worth ~O(1000)CHF/min)."–  Our system serves online data taking! If not taken, data is gone!"

•  Scalable services design to accommodate expansions;"•  Fast configuration turnaround copes with the evolving nature

of DAQ applications and large scale of cluster;"•  Serving the needs of a community of more than 900 Users."

J.A. Coarasa 7

Page 8: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

More than 2700 computers mostly under Scientific Linux CERN 5:"–  640 (2-core) as a 1st stage building, equipped with 2 Myrinet and 3

independent 1 Gbit Ethernet lines for data networking. (1280 cores);"–  1008 (720 (8-core) + 288 (12-core allowing HT)) as high level trigger

computers with 2 Gbit Ethernet lines for data networking. (9216 cores);"–  16 (2-core) with access to 300 TBytes of FC storage, 4 Gbit Ethernet

lines for data networking and 2 additional ones for networking to Tier 0;"

Computing. Variety of roles"

J.A. Coarasa 8

Hig

h ba

ndw

idth

netw

orki

ng

Page 9: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

More than 2700 computers mostly under Scientific Linux CERN 5:"–  640 (2-core) as a 1st stage building, equipped with 2 Myrinet and 3

independent 1 Gbit Ethernet lines for data networking. (1280 cores);"–  1008 (720 (8-core) + 288 (12-core allowing HT)) as high level trigger

computers with 2 Gbit Ethernet lines for data networking. (9216 cores);"–  16 (2-core) with access to 300 TBytes of FC storage, 4 Gbit Ethernet

lines for data networking and 2 additional ones for networking to Tier 0;"–  More than 400 used by the subdetectors;"–  90 running Windows for Detector Control Systems;"–  12 computers as an ORACLE RAC;"–  12 computers as CMS control computers;"–  50 computers as desktop computers in the control rooms;"–  200 computers for commissioning, integration and testing;"–  15 computers as infrastructure and access servers;"–  250 active spare computers;"

!⇒ Many different Roles!

Computing. Variety of roles"

J.A. Coarasa 9

Hig

h ba

ndw

idth

netw

orki

ng

Page 10: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Networking.Overall Picture"

CMS Networks:"–  Private Networks:"

•  Service Network ""(~3000 1 Gbit ports);"

•  Data Network ""(~4000 1Gbit ports)"

–  Source routing on computers"–  VLANs on switches "

•  Central Data Recording ""(CDR). Network to Tier 0."

•  Private networks for Oracle RAC"

•  Private networks for subdetectors"

–  Public CERN Network"

"J.A. Coarasa 10

CMS Networks

CM

S Si

tes

Computer gateways

Readout, HLT Control…

Firewall

Internet

Service Network"

Data Networks

CDR Network

CERN Network

Storage Manager

Page 11: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

The Network Attached Storage"

•  Our important data is hosted in a Clustered Network Attached Storage:"–  User home directories;"–  CMS data: calibration, data quality

monitoring…;"–  Admin data."

•  2 NetApp filer heads in failover configuration (in two racks);"

•  With 6 storage drawers (2 of them in mirror) with internal Dual Parity RAID;"

•  Snapshot feature active (saves us from going to Backup)."

•  Deduplication active (saves ~25-55%)"•  Tested throughput > 380 MBytes/s."

J.A. Coarasa 11

           

4Gbit   2x10Gbit  redundant  

routers

shelves

heads

NAS

Page 12: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

IT Structural Services. Redundancy and Load balancing. The Concept"

•  Pattern followed:"–  1 master + N slave/replicas ""(now N=3 for most of the services) ""hosted in different racks;""⇒Easy scalability.""⇒Needs replication for all services."

–  Services working under DNS alias ""where possible.""⇒Allows to move the service.""⇒No service outage."

–  Load balancing of primary server ""for client:"

•  DNS Round Robin;"•  explicit client configuration ""segregating in groups of computers."

J.A. Coarasa 12

1 master

N slave/replicas

Explicit client configuration segregating in groups of computers

Page 13: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

IT Structural Services. Replication and Load balancing. Basic Services. "

•  Critical data local to the servers (hosted in RAID 1 protected servers);"

•  Manual load balancing through “in groups” segregation."

J.A. Coarasa 13

Service! replication through! Load Balancing through!DNS! named" Round Robin"

DHCP! in-house scripts" No. First who answers"Kerberos! in-house scripts" explicit segregation"

LDAP! slurpd" explicit segregation"NTP! -" explicit segregation"

syslog! -" No. Single server, used for stress test purposes"

Nagios monitoring! -" explicit segregation"

Page 14: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

CMS Management and Configuration Infrastructure: The tools"

•  IPMI (Intelligent Platform Management Interface) is used to manage the computers remotely:"–  reboot, console access,…;"

•  PXE and anaconda kickstart through http are used as bootstrap installation method;"

•  Quattor (QUattor is an Administration ToolkiT for Optimizing Resources) is used as the configuration management system;""⇒All Linux computers configured through it or rpms distributed with it (even the Quattor servers themselves): BIOS, all Networking parameters…"

J.A. Coarasa 14

Page 15: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

CMS Management and Configuration Infrastructure: Quattor implementation"

•  Based on Quattor 1.3 (cdb 2.0.4, swrep 2.1.38, PANC 6.0.8)"

•  Manages ~2500 installed computers in ~90 types!⇒One fully reinstalled computer in 9-30 min"⇒Big (~Gbyte) change cluster-wise in less than 25 min;""Small change only few minutes."

•  Uses in-house:!–  restricted format in templates: “hierarchical”+other conventions;"–  areas to define subdetector software and versioning in them;"⇒Allowed in-house easy developments:"•  Template summarizer/“inventory maker”"•  Dropbox for rpms"•  Template updater"

""

J.A. Coarasa 15

Page 16: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Monitoring infrastructure"•  We monitor 2500 hosts, 25000 services"

–  The usual things (ping, ssh, disk usage…)"–  In house:"

•  Electronics modules working?"•  Myrinet links working?"•  Quattor spma finished properly?"•  …"

•  Based on Nagios. 1+3 servers with manually split conf."•  Performance:"

–  Every 1 minute latency for the ping test;"–  Every more than 5 minutes for the rest of the services."⇒Needed improving. About to move to production the new version

of monitoring."

J.A. Coarasa 16

Page 17: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

New Monitoring infrastructure"•  Addresses issues:"

–  Easy configuration"–  Scalability"

•  Based on icinga (1 server + N workers)"–  mod_gearman: allows to split functionality (Master/Worker)."–  check_multi: groups checks, can cascade."–  PNP4nagios: provides performance data."–  rrdcached: caches files and improves performance."

•  We monitor more services grouped in less checks and get historical performance data"

•  Performance:"–  Every 1 minute latency including performance data."

•  Adding system to notify on a fine grain self-enroll basis"

J.A. Coarasa 17

Page 18: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Databases"•  Save all bookkeeping data:"

–  configuration of the detectors;"–  conditions for the experimental data

taken…"•  Oracle 11.2.0.3 (just migrated, still on compatibility

mode to 10.2.0.5) "–  hosting two databases @ CMS

totalling ~30 Tbytes (more, including replicated/synced ones @ CERN);"

–  hardware:"•  6 blades + standby nodes;"•  netApp filer as a backend for storage;"•  Completely redundant at hardware level."

J.A. Coarasa 18

Page 19: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Hardware Tracking system"Our Hardware Tracking system (HTS) allows us to know everything about the systems (soft/ hardware)"It is based on:"•  OCS (www.ocsinventory-ng.org)"

–  Inventory system"–  Counts everything related to a machine"

•  GLPI (www.glpi-project.org)"–  Tracking system"–  Works in relation with OCS"–  Set of tools to maintain machines history"

J.A. Coarasa 19

Page 20: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

The CMS Online Cluster"– Operation""

J.A. Coarasa 20

Page 21: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Constraints and implications"

•  Man power (~6 FTE):""⇒Needs effort on implementing scripts and automation:"

•  Computers will monitor the temperature and shutdown when too high;"•  Scripts to Check and fix serviceʼs problems."⇒Allows faster recovery after unexpected power cut"

"⇒Act only as a second level support in 24/7 operation.""⇒Try to interact with users through a ticketing system (savannah)"

•  Cluster 5 years old and still growing, adding resources as needed"–  Some hardware out of warranty"–  Some hardware with extended warranty"

•  Computers connected to the electronics need swift replacement (1h) if broken. ""⇒Need for standby spares in location.""⇒Fast turnaround in reinstallation or reconfiguration. "

•  Other computers, dealt with fault tolerant software and reconfiguration. Enough to change it the next day."""J.A. Coarasa 21

Page 22: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Operation Emphasis on continue running"

•  Learning from previous failures"•  Preventing service failures before they happen"

–  Redundancy where possible"•  Added full redundancy to the new database"•  Added redundancy to the control system (DCS)"•  Virtualization explored to add redundancy to other services"

–  Preventive campaigns carried out"•  Battery replacement on old computers"•  aging SAS controller (still under warranty) replacement (THS priceless)"•  Clean up/reinstallation of the whole cluster beginning of 2011"•  Clean up/selective beginning of 2012"

•  Identify the failures before they affect the running:"–  Carried out tests on the redundancy of the services"–  Improved the monitoring system"–  Scripts reporting problems before they affect (SAS controller)"

•  Testing before deployment"–  2 Testing and integration areas and 2 more being built"

"J.A. Coarasa 22

Page 23: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Operation Large Scale Effect on Failures"

Failure Happens! For the last year of operation (2011):"•  Unscheduled power cuts: ~1 every 2 months"

–  By far the worst to recover from. Affects also the detectors electronics."

•  Unscheduled cooling cuts: ~1 every few months"•  Network switch failures: ~ 1 every 3 months"

–  Database Failure"–  Failure of redundancy of kerberos and ldap (in the way of

being solved)"–  Affected the storage manager (still running under reduced

bandwidth"•  Computer failures: ~3 every week"

–  SAS aging problem responsible for most of them till exchange"

"J.A. Coarasa 23

Page 24: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Summary"•  A scalable cluster, 2700 computers

growing to 3000 in the next months"•  autonomous,"•  with service redundancy, "•  capable of more than 100GB/s data

throughputs, "•  has been working successfully since

2007.!

J.A. Coarasa 24

Page 25: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

""""""""

Thank you. Questions?"J.A. Coarasa 25

Page 26: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Backup slides"

J.A. Coarasa 26

Page 27: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

CMS Networks Configuration"•  Service Network (.cms)"

–  Service IPs "•  DHCP (fixed for central servers)"

–  IPMI IPs"•  Fixed at boot time from the DNS name"(in house development that fixes IPMI and BIOS

parameters)"

•  Data and Central Data Recording Networks"–  IPs and routing rules"

•  Fixed at boot time from DNS (A and TXT records)"(in house development: auto_ifconfig)"

J.A. Coarasa 27

Page 28: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

x

CMS networks"

J.A. Coarasa 28

x 80 x 8

Readout Unit

Myrinet Switches

Frontend Detector Readout

x 8

Force 10

2x1G

bit

3x1Gbit

x (90+36) x 8

Builder Filter Unit

x 16

Storage Manager

4x1Gbit 4x3Gbit

HP switch

2x10Gbit

Tier 0 Force 10

redundant HP switch

Page 29: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Installation/Configuration Services. Replication and Load balancing"

•  Data being read from the NAS once and kept in memory cache from computers.""⇒The NAS is not loaded even at peak installation time when the installation servers’ network is saturated."

J.A. Coarasa 29

Service! replication through! Load Balancing through!PXE boot/TFTP !

No replication. NFS mount from the NAS."

Through DHCP"Kickstart Installation

server!DNS Round Robin"

yum repository Installation server!

DNS Round Robin"

Quattor repository Installation server!

DNS Round Robin"

Quattor cdb configuration server!

No. "

Page 30: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

CMS IT Security"•  We are behind the CERN firewall, and services are not

reachable from the internet, unless explicitly allowed."•  The CMS Networks are private networks: not even

reachable from the CERN network."•  There is a Reverse Proxy, that requires authentication,

to access some internal information (selected http traffic)."

•  All users have to log in into gateway computers submitted to tighter security (ssh traffic to some computers). "

•  Traffic from internals computers to the outside world goes through proxying servers."

J.A. Coarasa 30

Page 31: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

The Compact Muon Solenoid experiment (CMS). Design Parameters."

J.A. Coarasa 31

Detector Channels Ev. Data Pixel 60000000 50 (kB) Tracker 10000000 650 Preshower 145000 50 ECAL 85000 100 HCAL 14000 50 Muon DT 200000 10 Muon RPC 200000 5 Muon CSC 400000 90 ⇒  >50 000 000 Channels,

~630 data origins ⇒  Average of 1 Mbyte/event

(bigger for HI ~20 Mbytes/event)

Detectors

Page 32: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

pp collisions in CMS"

FM – TIPP2011 32

Page 33: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Pb-Pb collisions in CMS"

FM – TIPP2011 33

Page 34: The CMS Online Cluster:  Setup, Operation and Maintenance  of an Evolving Cluster

ISGC 2012, 26/2-2/3 2012, Academia Sinica, Taipei, Taiwan The CMS Online Cluster

Networking. Service Network Redundancy"

•  Consists of networking in two physical locations:"–  Each switch in each rack is

connected to 2 routers;"–  The 2 routers are connected in

a redundant failover configuration with 10 Gbit lines to the routers in the other location."

J.A. Coarasa 34

2x10Gbit redundant

1Gbit 1Gbit

1Gbit 1Gbit

SCX5

USC

NAS