managing mature white box clusters at cern lcw: practical experience tim smith cern/it

14
Managing Mature White Box Clusters at CERN LCW: Practical Experience Tim Smith CERN/IT

Upload: cecil-rudolph-phillips

Post on 28-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Managing Mature White Box Clusters

at CERN

LCW: Practical Experience

Tim Smith CERN/IT

2002/10/21 White Box Farms: [email protected] 2

Contents

Scale Behind the

Scenes Hardware

Complexity Dynamics Practical Steps

Software Legacy Projects

2002/10/21 White Box Farms: [email protected] 3

Scale ~1000 boxes 140k Jobs/wk 2400 int user 50 parallel

reinstalls Parallel cmd

engines

350kSi2000 ~7/38 in top

500 clusters

2002/10/21 White Box Farms: [email protected] 4

Complexity

Hardware 12 hardware acquisitions 38 combinations of CPU/Mem/Disk

Software 4 versions of RedHat OS 37 clusters (indep. configurations)

User Communities 30 expts/user communities + Public 12,000 users

2002/10/21 White Box Farms: [email protected] 5

Dynamics

Hardware Drift e.g. missing after reboot:

CPUs, Memory, Disks Ethernet speed wrong

Volatile configurations e.g. passwd file every couple of hours

Hardware Failures Up to 4% of farm on holiday

Replacements generate new configurations

Monitoring

InventoryTracking

2002/10/21 White Box Farms: [email protected] 6

Vendor Call Analysis

0

5

10

15

20

25

30

35

40

45

disks dead motherb. memory video processor floppy power/fan tot. calls

reasons

Nu

mb

er o

f ca

lls

SIEMENS

ELONEX

TECH AS

SEIL

1 every2 days!

2002/10/21 White Box Farms: [email protected] 7

Acquisition Cycles

0

200

400

600

800

1000

1200

Jan-97 Jan-98 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05

Nu

mb

er o

f M

ach

ines

SEIL - 1000

ELONEX - 800

TECH - 600

ELONEX - 600

SIEMENS - 550

ELONEX - 500

HP - 450

ELONEX - 450

ELONEX - 450

ELONEX - 300

COGESTRA - 266

COGESTRA - 200

Out of Warantee

2002/10/21 White Box Farms: [email protected] 8

Addressing the Challenge Interactive: Refresh from uniform batch

machines Batch: One large production facility

Shares (and priorities) Selectable resources Flexibility Redundancy to reduced sensitivity to

failures Remedy Hardware workflows But intractable

Scatter in job return times Assumed but undeclared job requirements

2002/10/21 White Box Farms: [email protected] 9

SW: Legacy from Maturity

OS

Applications

Mgmt Tools

KickStart

SUE

ASIS

BIS

/home/usr/cute/usr/local/var/opt

2002/10/21 White Box Farms: [email protected] 10

BIS DB

SW: Legacy from Maturity

OS

Applications

Mgmt Tools

KickStart

SUE

ASIS

BIS

Oracle

AFSAFSAFSAFS

Local

acrontabs

/home/usr/cute/usr/local/var/opt

crontabs

Multiple owners,methods, formats

Multiplelocations

2002/10/21 White Box Farms: [email protected] 11

A Clean Restart

NodeConfiguration

SystemMonitoring

System

InstallationSystem

Fault MgmtSystem

2002/10/21 White Box Farms: [email protected] 12

A Clean Restart: SnapShot

NodeConfiguration

SystemMonitoring

System

InstallationSystem

Fault MgmtSystem

HWSW

FunctionState

Software UpdateBase Installation

RPM

AP

I

PXEKickstart

2002/10/21 White Box Farms: [email protected] 13

State and Configuration Mgt

Clean Initial State Linux Standards Base, RPM

Externally Specified Configuration System, local cache

Versioned + Repository CVS

No inherent drift No external crontabs No unregistered application provider triggered

updates Update verification nodes + release cycle Procedures and Workflows Transactions

Notifications

2002/10/21 White Box Farms: [email protected] 14

Conclusions Maturity brings…

Degradation of initial state definition HW + SW

Accumulation of innocuous temporary procedures

Scale brings… Marginal activities become full time

Many hands on the systems

Combat with strong management automation