managing scale and complexity of next generation hpc systems and clouds

Managing Scale and Complexity of Next Generation HPC Systems and Clouds

Peter ffoulkesVice President of MarketingApril 2011

The World’s Most Capable Computing Systems Are Powered by Moab

2

The world’s largest HPC system, No. 2-ranked Jaguar, with over 18,500 nodes, 224,000 cores and a speed of 1.75 petaflop/s

Half of the top 10 systems Over one third of the top 50

systems (17 systems) 38% of the compute cores in the

top 100 systemsSource: Nov 2010 rankings from www.Top500.org

The Top 500 List (Nov 2010)

2 Oak Ridge National Laboratory

5 Lawrence Berkeley National Lab

7 Los Alamos National Laboratory

8 University of Tennessee

10 Los Alamos National Laboratory

12 Lawrence Livermore National Lab

14 Sandia National Laboratories

16 Lawrence Livermore National Lab

23 Forschungszentrum Jülich

26 Lawrence Berkeley National Lab

30 Oak Ridge National Laboratory

31 Sandia National Laboratories

32 NOAA / Oak Ridge National Lab

39 SciNet, University of Toronto

Oak Ridge National Laboratory Jaguar , the second most

capable HPC system in the world, running at 1.759 petaflop/s

18,686 nodes, 224,256 processing cores, 300TB of memory

Diversity of users was severely limiting system workload-management capability

3

Moab resolved Jaguar’s workload management problems and increased system utilization, decreased downtime, and allowed more control over resources

What’s Next…

4

Tsubami 2.02,816 (6 core) CPUs (16,896 cores) combined with 4,224 of NVIDIA’s Tesla M2050 (448 core) general-purpose GPUS, dual-rail, non-blocking fabric employing two Voltaire 40 Gb/s InfiniBand connections on each node

Tianhe 1A3,600 nodes, 14,336 (6 core) CPUs, 7,168 (448 core) GPUs - 86,000 general purpose cores and 7,168 GPUs, 160 Gbit/second Galaxy interconnect developed in China

Managing Scale and Complexity Moab 6.0

• A new command communication architecture that delivers a 100-fold increase in internal communications throughput

• Support for the most commonly used Moab commands and grids deploying multiple Moab instances, dramatically increasing the manageability of complex supercomputing environments

• Support for hybrid installations deploying GPGPU technologies in conjunction with TORQUE 2.5.4

5

Managing GPGPUs Moab 6.0 and TORQUE 2.5.4

• Specify GPGPUs in the same manner as CPUs• GPGPUs are requested as a defined resource• Applications receive indexed GPGPU information

about which GPGPU(s) to access• Moab’s intelligent scheduling ensures GPGPUs never

get oversubscribed• GPGPU usage is recorded in utilization reports

Managing Scale and Complexity Moab 6.0

• New on-demand dynamic provisioning and management capabilities that support both virtual and physical resources, including VM migration for load balancing, workload packing and consolidation

• Idle-resource management to deliver increased utilization, efficiency and energy conservation for HPC and enterprise cloud deployments

• Improved administration and reporting, including new parameterized administration functions; enhanced limits for event, group and account management; and new formats for job and reservation event reporting

Managing Scale and Complexity Moab Viewpoint 2.0

HPC as a service and HPC cloud capability:• Creation, management and status reporting of reservations and

job queues for HPC and batch workloads and system maintenance• On-demand dynamic management of VMs and physical nodes• Increased scalability to support management of tens of thousands

of nodes and hundreds of thousands of VMs• Flexible security management for flexible security options at

installation, including built-in security, single Sign On (SSO), or Lightweight Directory Access Protocol (LDAP) models

• Service-based administration and reporting for easy access and management of HPC and cloud resources

University of Cambridge: Cosmos

9

Overview

COSMOS has expanded:• New SGI Altix UV1000, 6-core NehalemEX

chips,768 cores , 2TGB of global shared memory

• Existing SGI Altix 4700, 920 cores and 2.5TB RAM

• Both compute systems are supported by 64TB of high performance storage.

Challenge

• Managing both cluster-based workloads and SMP shared memory workloads in the same environment

University of Birmingham

10

Overview

The University of Birmingham’s 1500 core cluster runs a mixed workload, from many - often hundreds - of short single-core parameter-sweep jobs to massively parallel multi-core computations, some running for over a week.

Challenges

The workload is variable, especially at different times of the year, and keeping the whole cluster powered up during less busy periods is wasteful of power. A sophisticated system for managing the power requirements is required to be aware of the scheduled as well as the active workload to ensure that resources are always available when required without power being wasted.

Solution

Moab Adaptive HPC Suite™

Results

An annual saving of about 10% of the current power costs, amounting to £50,000 from powering off nodes that are not in use. Further savings from ancillary supplies, especially the air conditioning in the datacentre are expected.

http://www.birmingham.ac.uk/

What’s in a cloud: Vapor-ware or silver lining?

Is the Future of Computing Clear or is it Obscured by Clouds?

National Institute of Standards and Technology:“Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.”

Essential Characteristics:On-demand self-service, Broad network access, Resource pooling, Rapid elasticity, Measured Service.

Service Models:Cloud Software as a Service (SaaS), Cloud Platform as a Service (PaaS), Cloud Infrastructure as a Service (IaaS)

Deployment Models:Private cloud, Community cloud, Public cloud, Hybrid cloud.

Note: Cloud software takes full advantage of the cloud paradigm by being service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability.The NIST Definition of Cloud Computing.

http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc

http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc

Agile

Automated

Adaptive

Delivers business services rapidly, efficiently and successfully

Eliminates human error, enables scaling and capacity, reduces management complexity and cost

Anticipates and adapts intelligently to dynamic business service needs and conditions

Three Essential Cloud Characteristics

13

SciNet—University of TorontoSolution• Energy-aware, stateless, on-demand multi-

OS provisioning• Moab Adaptive HPC Suite™ and xCAT

provisioning software• 4,000 server supercomputer system• 30,000 Intel Xeon 5500 cores, – a

theoretical peak of 306 TFlops ResultsA state-of-the-art data center that saves enough energy to power more than 700 homes yearly. On-demand provisioning allows users to make their OS choice part of their automated job template. SciNet always has several different flavors of Linux running simultaneously.

“Why should we pay for cooling when it’s so cold outside? Toronto is pretty cold for at least half of the year. We could have bought a humongous pile of cheap x86 boxes but couldn’t power, maintain or operate them in any logical way.”

Dr. Daniel Gruner, PhD, chief technology officer of software for SciNet.

Who: Top 3 financial services company

What: Moab - Automation Intelligence Manager will manage 80-90% of workloads (Up to 10,000+ applications of more than 100,000 servers across more than 10 datacenters)

Use Case: Iaas, PaaS, AaaS, using Workload-Driven Cloud 2.0

Objective: Increase agility, reduce risk and save over $1 billion dollars in 3 years.

A Global Bank based in the USA

managing scale and complexity of next generation hpc systems and clouds

Documents

managing scale

capable hpc system

account management

management capabilities

core gpus

generation hpc systems

increased system utilization

hpc cloud capability