installing, running, and maintaining large linux clusters at cern thorsten kleinwort cern-it/fio...

15
Installing, running, and Installing, running, and maintaining large Linux maintaining large Linux Clusters at CERN Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

Upload: tiffany-patrick

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

Installing, running, and Installing, running, and maintaining large Linux maintaining large Linux

Clusters at CERNClusters at CERN

Thorsten KleinwortCERN-IT/FIOCHEP 200324.03.2003

Page 2: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 2

OverviewOverview

• The Linux Clusters at the CERN CC• Recent achievements to improve manageability

• Installation• Configuration• Monitoring• Collaboration with EDG (WP4)

• Maintenance of the clusters• The batch system LSF• Steps towards LHC Computing• References

Page 3: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 3

IntroductionIntroduction

• The computing facilities in the CERN Computer Center:• Decommissioned non Linux platforms, apart

from some Suns• Merged private clusters into two big, shared

clusters:• LXPLUS for interactive use (~80 nodes)• LXBATCH as batch farm (~700 nodes)

• All commodity hardware (towers, Dual CPU), but divers (CPU speed, disk sizes and #, memory,…)

• The current OS is RedHat Linux, we are in the transition from 6.1 to 7.3, around 70% is done

Page 4: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 4

The CERN Computer The CERN Computer Center:Center:

Page 5: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 5

Recent achievements…Recent achievements…

• Moving from RedHat 6.1 to 7.3• Revised and rewrote existing

installation and maintenance tools, because the requirements have changed:• Focusing on Linux • Using well established tools/protocols/languages

(RPM, HTTP, XML,…)• Standard adherence (LSB, init scripts,…)

• Separated installation & configuration:• Identified all parts of installation• Identified all sources of configuration information

Page 6: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 6

InstallationInstallation

• The system is installed with kickstart• The installation is completely automatic

• Software installation: RPM• RPM is the tool of choice:

• Allows easy install/update/uninstall• Version control

• Additional software in RPMs, ours as well as software from others (e.g. CASTOR, EDG, LCG…)

• (Post-) Installation split up in components:• One RPM per component for installation• Configuration is done per component as well

Page 7: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 7

ConfigurationConfiguration

• Configuration of the system:• We enhanced SUE with a configuration interface

• Identified all sources of configuration information:• First step: Make this information available

through one interface (CCConfig)• Next step: Work on the unification and merging

of the different data sources behind it (ongoing)

Page 8: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 8

Configuration IIConfiguration II

• Using the EDG WP4 configuration tool:• Pan & CDB (Configuration Data Base) for

describing hosts:• Pan is a very flexible language for describing

host configuration information:• Expressed in templates (ASCII)• Allows includes (inheritance)

• Pan is compiled into XML, inside CDB• XML is downloaded and the information

provided by CCConfig, which is the high level API

Page 9: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 9

MonitoringMonitoring

• Adoption of EDG WP4 monitoring:• Has replaced old self made & grown alarm

scripts

• Still relying on old Alarm system (SURE):• Will be replaced, either by the WP4 tool or by a

commercial tool (PVSS)

• The monitoring information is stored in database:• With a user-API for queries• Eliminate the need for client access

Page 10: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 10

MaintenanceMaintenance

• Machines must be ‘updatable’:• Updating a machine must lead to the same result

as a new install

• Rpmupdate:• Based on RPMT, a transactional RPM which

allows updates, installs, and uninstalls at the same time

• Will be superseded by the EDG WP4 tool: SPMA

• Notification mechanism:• No automatic/periodic upgrade• Change mechanism triggered to run on the nodes

Page 11: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 11

The batch system (LSF)The batch system (LSF)

• Current version: LSF 4.2• LSF 5.1 is evaluated at the moment

• No multi-clusters any more• We introduced Fairshare, for a better utilization of the

unused capacities:• Experiments have guaranteed shares of the batch capacity• If unused, they can be used by others• No more available, but unusable resources

• We oversubscribe our hosts (3 jobs per dual CPU)

• Close collaboration with the provider, Platform, they benefit from our big farm, we benefit from their help and will to implement our requirements

Page 12: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 12

Other improvementsOther improvements

• Secure installations:• Each node has its own GPG key pair to exchange

secure information:• eg. for SSH keys, (encrypted) root password

• Intervention rundown:• Allow a scheduled reboot on batch nodes, when

they have finished batch jobs, e.g. for new kernel or other software installs

• Server Cluster:• Serves the RPMs, the configuration information,

etc.• Several machines, selected by ‘dynamic DNS

aliases’

Page 13: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 13

Going to Grid ComputingGoing to Grid Computing

• Merging EDG/VDT middleware into a large scale production farm• Enlarging our batch capacity by 400 nodes in

April

• Early contribution to LCG 1 by this summer• LXBATCH fully integrated by Q4/2003

• Close collaboration with EDG (WP4) and LCG will continue

Page 14: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 14

ConclusionsConclusions

• Redone the Linux installation for RH 7.3

• Clearer concepts, new tools• Streamlined it with EDG WP4 tools• Continuous collaboration with EDG

WP4 and LCG• Facing and implementing the needs

for the Grid Computing

Page 15: Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP 2003 24.03.2003

3/24/2003 Thorsten Kleinwort CERN-IT 15

ReferencesReferences

• CERN-IT/FIO: http://it-div-fio.web.cern.ch/it-div-fio/

• EDG: http://eu-datagrid.web.cern.ch/eu-datagrid/WP4: http://hep-proj-grid-fabric.web.cern.ch/hep-proj-grid-fabric/

• LCG: http://lcg.web.cern.ch/LCG/• SUE: http://proj-sue.web.cern.ch/proj-sue/• LCFG: http://www.lcfg.org/• LSF (Platform Computing):

http://www.platform.com/• PVSS: