platform ocs 5 - queen's...

45
Platform OCS 5 Technical Training

Upload: others

Post on 23-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Platform OCS 5Technical Training

Page 2: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

What we’ll cover

Day 1: Module 1: Concepts & Terminology Module 2: Node Provisioning Module 3: Basic Administration Module 4: HPC and Workload Management

Page 3: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

What we’ll cover

Day 2: Module 1: Installer Node Setup & Configuration Module 2: Compute Node Installation & Customization Module 3: The OCS 5 “Survival Kit” Module 4: Case Studies / Lab Activities

Page 4: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Day1, Module 1

Platform OCS 5Concepts & Terminology

Page 5: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Module Objectives

Upon completion of this module, you will be able to:

‘Understand the term High Performance Computing Describe a “beowulf cluster” Understand Platform OCS 5 concepts and terminology Know what Kusu is and its relation to Platform OCS 5 Introduce key commands that will be explored in future

modules

Page 6: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC: Who is doing HPC?

.com .gov .edu

Page 7: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC for .gov : Energy Research

Premier applied science laboratory that is part of the National Nuclear Security Administration (NNSA) within the Department of Energy (DOE)

#1,#11 on top500.org IBM BlueGene/L eServer 212,992 processors 73,728 GB memory 478,200 GFlops (Rmax) 596,378 GFlops (Rpeak)

http://top500.org/site/systems/2556 - as of November 2007Photo courtesy of Lawrence Livermore National Laboratory

Page 8: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC for .gov : merging black holes

Largest astrophysical calculation ever performed on a NASA supercomputer

SGI Altix system running Linux : 20 nodes 512 cpu per node

Total : 10,240 processors

http://www.nasa.gov/centers/goddard/universe/gwave.html

Page 9: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC for .edu : the bluebrain project

Understand brain function and dysfunction through detailed simulations

Objective : replicate neocortical column of a rat (10,000 neurons)

IBM Blugene : 8000 cpus MPI

http://bluebrain.epfl.ch/page18699.html

First comprehensive attempt to reverse-engineer the mammalian brain

Page 10: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC for .com : GOOGLE

150M queries/day (2000/second) 8.0B documents in the index 100,000 Linux systems in data

centers around the world 15 TFlops 1000 TB total

Eigenvalue problem, transition probability matrix, markov chain

http://en.wikipedia.org/wiki/Markov_chain

Page 11: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC for .edu (2) : Seti@home

Largest distributed computation project in existence

Running on 500,000 PCs, ~1000 CPU Years per day

Distributes Datasets from Arecibo Radio Telescope

Results sent back and combined

http://setiathome.berkeley.edu/

Page 12: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Why is HPC so important ?

Because money matters! Airlines: System-wide logistics optimization Savings: approx. $100 million per airline per year*.

Automotive design: CAD-CAM, crash testing, structural integrity and aerodynamics.

Savings: approx. $1 billion per company per year*.

Semiconductor industry: device electronics simulation and logic validation

Savings: approx. $1 billion per company per year*.

Securities industry: mortgage risk simulation Savings: approx. $15 billion per year for U.S. home mortgages*.

Solve new classes of problems More refined models Larger models

* source: http://www.cs.utk.edu/~dongarra/WEB-PAGES/SPRING-2005/Lect01-overview.pdf

Page 13: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Different systems for solving different problems

Different architectures have evolved to solve different specialized problems in HPC

“SMP like “Clusters with High Speed interconnects (parallel)Clusters with standard interconnects (Monte Carlo)Hybrid systems (clearspeed, Cell BE, FPGA, GPUs)…

Page 14: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC: what is OpenMP?

OpenMP (Open Multi-Processing) is an Application Programming Interface (API) that supports multi-platform shared memory multiprocessing programming in C/C++ and Fortran.

It consists of a set of compiler directives, library routines, and environment variables (ex: OMP_NUM_THREADS) that influence run-time behavior.

GCC 4.2 supports OpenMP Keywords: SMP, fat nodes

http://www.openmp.org/

Page 15: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC: what is MPI ?

MPI stands for Message Passing Interface Library specification designed to support parallel

computing in a distributed environment 2 standards: MPI-1 and MPI-2 Several implementations (Open Source, ISV) Keywords: distributed memory, beowulf cluster

http://www.mpi-forum.org

Page 16: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

HPC: what is PVM?

PVM stands for Parallel Virtual Machine PVM is a software package that permits a

heterogeneous collection of Unix and/or Windows computers hooked together by a network to be used as a single large parallel computer

Not widely used but still actively developed Keyword: virtual machine

http://www.csm.ornl.gov/pvm/

Page 17: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Cluster: What is a cluster?

Cluster: independent computers combined into a unified system through software and networking.

Typically used for High Availability (HA) for greater reliability or High Performance Computing (HPC) to provide greater computational power than a single computer can provide.

Page 18: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Cluster: What is a Beowulf cluster?

Beowulf Clusters: scalable performance commodity hardware Open Source software infrastructure.

Class I clusters are built entirely using commodity hardware and software

Class II clusters may use specialized hardware to achieve higher performance.

http://www.beowulf.org/

Page 19: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Cluster: What is a Beowulf cluster?

NASA: Project ColumbiaBeowulf cluster

42.7 teraflops – built in 120 days

Tunghai University, TaiwanBeowulf parallel testbed

17 compute nodes

Class I Class II

Page 20: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Cluster: stateful vs. stateless nodes?

Stateful: each system can be modified locally Stateless: no state exists on single computers; all

state is centralized Manageable complexity (no or limited growing entropy) Scalability Needs 100% automatic configuration

As we’ll see shortly, OCS 5 has rich capabilities in this area – nodes can be stateful, but administrators can realize the

administrative benefits of managing nodes as if they were stateless

Page 21: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Cluster: clustering suites

OCS like: OSCAR xCAT warewulf

Single System Image (cluster seen as a unique machine): Scyld (bproc) Clustermatic Mosix / OpenMosix Kerrighed

Page 22: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Cluster: kind of clusters OCS 5 can handle?

Beowulf clusters type I/II

Only x86, x86_64 based

Red Hat / CentOS or Fedora Nodes*

Extensible to other OS environments

Interoperable with other Architectures

Page 23: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

What is Project Kusu?

‘Kusu’ is an Open Source provisioning, cluster file management and repository management toolkit.

‘Kusu’ is the first completely Open Source project created by Platform – http://www.osgdc.org

‘Kusu’ provides a technology foundation for Platform OCS 5

Page 24: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

What is Kusu?

Kusu Island is located to the south of the main island of Singapore, off the Straits of Singapore. The name means "Tortoise Island" or "Turtle Island" in Chinese; the island is also known as Peak Island or Pulau Tembakul in Malay. From 2 tiny outcrops on a reef, the island was enlarged and transformed into an island holiday resort of 85,000 square metres. The island is 5.6 km south of the main island of Singapore.

Page 25: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

What is Kusu?

Legend has it that a magical tortoise turned itself into an island to save two shipwrecked sailors - a Malay and a Chinese. The two men gave thanks according to his belief system, the former by building a Muslim kramat (keramat) (shrine), and the latter by establishing a Taoist shrine. Each year during the ninth lunar month (which falls between Sep and Nov according to the Lunar Calendar), thousands of devotees flock to the island for their annual Kusu Pilgrimage to pay homage for good health, peace, happiness, good luck and prosperity.

Page 26: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

What is Platform OCS 5?

Project Kusu forms the cluster toolkit foundation for Platform OCS 5.

Kusu is a key part of Platform OCS 5. OCS 5 is the complete software stack for HPC clusters The software stack is all of the individual software components

that must be installed and configured in a cluster so that the end user or customer can run their applications.

Platform OCS 5 is a hybrid software stack.* Many components are Open Source (Platform Lava, MPI, Linux) Some components are freeware Some components are Commercial (Platform LSF HPC)

Red Hat HPC contains only the Open Source components of Platform OCS.

Page 27: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

What is Red Hat HPC?

Red Hat® HPC is a solution based on Platform OCS 5 Red Hat HPC contains all of the Open Source

components of OCS 5 Red Hat HPC is integrated with RHN A Red Hat HPC channel exists on RHN Customers must subscribe to the Red Hat HPC

channel. Red Hat HPC is installed using yum

Before an install proceeds the software checks the network and disk configuration to ensure the machine meets minimum requirements.

Page 28: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

What is Red Hat HPC?

Support for Red Hat HPC Red Hat is the first line support – customers call Red Hat Red Hat will escalate to Platform when needed There is no ‘hand off’ of customers Red Hat and Platform

jointly support customers until problem resolution

Patches and Security Updates for Red Hat HPC Red Hat builds the software from source at Red Hat Platform releases patches to Red Hat. Platform or Red Hat may identify security problems – patches

are either pushed upstream to Platform or downstream to Red Hat.

Red Hat will use the source to identify issues and resolve them.

Page 29: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Overview of the OCS 5 framework

Network Hardware – Infiniband, Myrinet, Ethernet, Gig-E, BMC

BasicClusterServices

DHCP

NFS

NTP

DNS

HTTP

IPMI

LDAP

NIS

MySQL

PFS

KusuCluster cfm

addhost

genconfig

driverpatch repoman

repopatch

nghosts

ngedit

buildimage

builtinitrd kitops

buildkit

ClusterMiddleware SOAM OFED Lava/LSF Portals MPI Compilers

AdminApplications

ClusterMonitoring

ClusterReporting

ClusterManagement

WorkloadManagement

UserApplications

Reservoir

Tools

Seismic

CFD

BI

FEA

Bio

Other

Compute,Network &Storage Resources

Linux® Operating System

Page 30: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Challenges Uniquely Addressed by OCS 5

Key Platform OCS 5 Advantages

Deploy & manage site specific node types Easily maintain systems at current patch levels Perform low-risk “trial installs” of packages & OS environments Support diskless clusters using images Install patches or rpms without reprovisioning Dynamically change node configurations Synchronize key files across all nodes Self monitoring / notification of problems Scale to hundreds of hosts / multiple clusters

Page 31: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Platform OCS 5: Key Concepts & Definitions

Key Platform OCS 5/Kusu concepts: Installer Node/Primary Installer Node Node Group Kit Component Snapshot Repository

We’ll now explain these essential/fundamental concepts and will cover them in detail in future modules

Page 32: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Platform OCS 5 Installer Nodes

Installer Nodes are: Nodes that install the Cluster

DHCP – for nodes in the cluster, TFTP – to initiate remote install, HTTP – send full OS install to nodes

Provide Packages for Installing nodes: OS Repositories (Package collections) Kits – Package collections of applications and operating systems

Provide the Configuration Management for the Nodes Kit Installation Cluster File Management, file replication and update

Maintain the SQL based cluster database (kusudb)

Primary Installer Nodes (note not implemented in OCS 5) Installer nodes can be arranged in a hierarchical manner One Installer node (usually the first one) is the Primary Installer Node All other Installer Nodes synchronize their configuration and database

with the primary installer node. (not in OCS 5.0-.1)

Page 33: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

OCS 5 Node Groups

A Node group is a template that defines the properties of cluster nodes: Repository to Use (Defines OS) Install to perform:

Package Based / Image Based / Diskless

Components to install (components are packaged in kits) Platform and 3rd party applications

Disk Partitioning Scheme (including LVM) Network Configurations

Multiple network types allowed including Infiniband interfaces.

OS packages Custom Scripts Cluster Node naming scheme Custom driver modules to load Kernel parameters Kernel and Initrd to boot

Xen, or Regular

Page 34: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Kit A

Platform OCS 5 Kits

Kits are pre-packaged applications or services that once added to a Platform OCS cluster can be automatically installed and configured onto cluster nodes.

Component A-1

Component A-k

RPM 1

RPM 2

RPM 3

RPM4

RPM 5

RPM N

Component A-2

Kit B

Page 35: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Platform OCS 5 Repositories

Platform OCS 5 can use Red Hat®, Fedora or CentOS based repositories

A single Installer node can manage multiple repositories – this means that many different OS versions can be installed in a single cluster if desired.

Multiple repositories with the same OS and version are not supported

OCS 5 can take standard OS media and create a repository from the media.

A snapshot can be made from any repository allowing the Administrator to modify the repository without messing up the original.

Page 36: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

The Platform OCS 5 Database

Platform OCS 5 adheres to the following design principals All OCS Cluster Configuration is in the database. All tools and GUIs modify the database tables Any configuration files required to run the cluster are

generated from the database such as: hosts, DNS, dhcpd, pdsh files etc.

All Platform OCS cluster tools retrieve configuration from the database.

So obviously the database is a key component of Platform OCS!

Page 37: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

The Platform OCS 5 Database

Page 38: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Platform OCS 5 Administration tools

These are key administration tools we’ll examine in future modulesaddhost – add or remove hosts to/from clusterdriverpatch – installs new drivers into the initrdkitops – used to create and manage kitsbuildkit – used to build kitsbuildimage – used to create disk imagesbuildinitrd – used to create the initial ram disk for diskless & imagedboothost – create PXE config files for bootinggenconfig – generate config files from databaserepoman – repository management toolrepopatch – patch packages into the repositorynghosts – assigns nodes to node groups

Page 39: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Platform OCS 5 Administration tools

ngedit – change the properties of node groups netedit – used to manage the networks in a cluster cfmsync – signal nodes to update files/components

Page 40: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Platform OCS 5 Kits

OCS 5 comes with the following kits which we’ll also examine in future modules: Base kit HPC kit Platform Lava kit Platform LSF HPC kit Nagios kit OFED kit Intel® Software Tools Kit Cacti Kit Ntop kit

Page 41: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Thank You!

Page 42: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Creating and using an OCS 5 Repository

OCS 5RHEL 5

repo

OCS 5Fedora

corerepo

Administrator inserts Fedora Core

CDs/DVD into the Installer Node

Kitops is used to add Fedora Core to the

Installer Node

/depot/repos

Repoman is used to create a Fedora Core Repo from the Fedora

Core Kit.

Ngedit is used to create a node group and associate

the new repo.

Addhost is then used to add new nodes using the

repository

Node is PXE

booted

Kit Fedora

1

2

3

4 5 6

Page 43: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

OCS 5 Repository Snapshots

Updating a cluster can be risky. Kernel drivers make it impossible to update the kernel. Some update packages have dependencies on kernel versions Sometimes updates ‘regress’ functionality –

Platform OCS 5 provides repository snapshots A snapshot is a full copy of an existing repository. A new repository directory is created and symbolic links are

created from the snapshot directory to the original kits. Adminstrators then update the snapshot using repopatch Admins assign the repository to a ‘test’ node group and

provision some machines to test the patches. If all goes well the new patches can be merged into the

original “production” repository by adding the update kit to a repository

Page 44: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

The Platform OCS 5 Database

The SQL database tables that hold state are accessible to Kusu database administrators via the command line or via popular web-based MySQL administration tools such as PHPAdmin

Page 45: Platform OCS 5 - Queen's Universitywiki.phy.queensu.ca/hughes/images/5/56/OCS5-Day1-module1.pdf · Why is HPC so important ? Because money matters! Airlines: System-wide logistics

Kits, Repositories and Node Groups

Kits are added to OCS 5 using ‘kitops’. After a Kit is added it must be assigned to a repository and then a node group.

OCS 5Repository

/depot/repo/…

OCS 5Kit depot/depot/kits

Kit A

# kitops –add A

repoman adds kit to repository

repository refreshed to add new Kit

Kit components automatically assigned

to node group defined in the kit.

OR

Use ‘ngedit’ to assign kit components to node

groups

Nodes in node group are re-provisioned or

updated with the new Kit

1

2 3

4a

4b

5