real-world cloud hpc at scale, for production workloads (bdt212) | aws re:invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Real-world Cloud HPC at Scale, for

Production Workloads

Jason A Stowe, Cycle Computing

November 15, 2013

We believe that utility access

to HPC accelerates invention

Goals for today

• See real world use cases from 3 leading engineering and scientific computing users – Steve Philpott, CIO, HGST, A Western Digital Company

– Bill E. Williams, Director, The Aerospace Corporation

– Michael Steeves, Sr. Systems Engineer, Novartis

• Understand the motivations, strategies, lessons learned in running HPC / Big Data workloads in the cloud

• See the varying scales and application types that run well, including a 1.21 PetaFLOPS environment

Agenda

• Introduction

• Steve Philpott – Journey into Cloud

• Bill Williams – Cloud Computing @ Aerospace

• Michael Steeves – Accelerating Science

• Spot, On-demand, & Other Production uses

• Questions and answers


Journey to the Cloud

Steve Phillpott

CIO

HGST, a Western Digital Company

Founded in 2003 through the combination of the hard drive

businesses of IBM, the inventor of the hard drive, and

HGST, Ltd

Acquired by Western Digital in 2012

More than 4,200 active worldwide patents

Headquartered in San Jose, California

Approximately 41,000 employees worldwide

Develops innovative, advanced hard disk drives, enterprise-class

solid state drives, external storage solutions and services

Delivers intelligent storage devices that tightly integrate hardware

and software to maximize solution performance

6

Capacity Enterprise

Performance Enterprise

Cloud & Datacenter

Enterprise SSD (+3 acquisitions)

7200 RPM &

CoolSpin

HDDs

Ultrastar®

Ultrastar® &

MegaScale DC™

10K & 15K

HDDs

PCIe

SAS

7

April 2013

Zero to Cloud in 6+ Month

By 31 Oct 2013:

Cloud eMail – Microsoft Office365

Cloud eMail archiving/eDiscovery

External SingleSignOn (off VPN)

Cloud File/Collaboration – BOX

Cloud CRM – Salesforce.com

Integrated to save files in BOX

Cloud–High Performance Computing

(HPC) on Amazon AWS

Cloud – Big Data Platform on Amazon AWS

Responding to the Changing Business Model

Where is our business model headed?

“New Age of Innovation” as a guide

N=1 Focus on Individual Customer Experience

R=G Resources are Global

Implications

–Increase in strategic partnering

–Need for high level of flexibility

–Leveraging external expertise

Use of the Cloud/SaaS aligns with

Virtual Business Model:

Variable cost model critically important

Lightweight, scalable services

Reduced up-front capital spend

Accelerated provisioning

Pay as you go

8

Paradigm Shift: Consumerization of IT “I have better technology at home”

A new paradigm in ease of use and reduced cost.

Consumer web has been driven by a series of

platforms – and these platforms are household brand

names today

When we use these platforms, it continually amazes

us – how easy, how consistent these platforms work

A new set of services: DRM to iTunes

Yet, our workplace applications are cumbersome, costly,

difficult to navigate and require extensive support Workday, 2009

Consumer Web

9

The Big Switch – The Box has Disappeared The Transformation of Computing as we Know it.

Physical to Virtual/Digital move

– Do you really care which computer processed your last

Google search?

Efficiency

– Do not waste a CPU cycle or a byte of memory.

Building a 4-story building and only using the 1st floor

Utility: IT as a Service - Plug it in and get it

– Where the electricity industry has gone, Computing is following

– Computing shift is almost invisible to the end-user

DATA is the value to the Organization, not the “where”

1

0

Enterprise Data Management

End-to-End Business Processes

Business Intelligence and Analytics

Enabling the Virtual Organization Reframing IT Away From Thinking of “The App”

1

1

New IT Organizational Structures:

Support and Align to “New Business Model”

Software as a Service

(SaaS) Strategic

Outsourcing

1

1

New Computing Platforms

Creating an Innovation Playground:

Where to Start and How to Evolve

12

Educate

Experiment

Migrate

Outcome Defined

Build Expertise

Play

Learn

IT Supports Business Strategy

Executive Buy-In – CEO, CIO, InfoSec, etc

Reduce Cap-ex, Optimize DC usage

Know

ledge

• Team Involvement

• Conferences

• Vendor Briefing

• Expert Services

• Best Practices

• Collaborate with

other companies

• Team Approach

• Hands-on approach

• Understand the value proposition

• Understand constraints

• Migrate dev/test environments

• Migrate or launch new apps on the cloud

Embrace success

Showcase cost savings

Build an enterprise cloud strategy

Learn from each experience

Expand accordingly

Awareness Understanding Transition Commitment

• Indentify app fit for cloud computing

• Define new processes

Implement

12

AWS: “ >5x the compute capacity than next

14 providers combined” – Gartner, Aug 2013

Access to massive compute and storage

Billed by the hour - only pay for what is used

HGST Japan Research Lab: Using AWS for higher

performance, lower cost, faster deployed solution vs. buying

huge on-site cluster

Multiple Opportunities to Leverage Amazon Web Services (AWS)

Develop AWS Competency

Many Opportunities: In-house and commercial HPCs are “cloud ready”

Provide Computing When Needed: Reduce capital investment & risk and increase flexibility

Faster Response to Business Needs: Rapid prototyping to pilot new IT capabilities with “PO

Process” ; setup users, allocate compute and storage in minutes, load apps and go

AWS provide a great option for disaster recovery for our “on-premise” clusters and storage

13

HGST’s Amazon HPC Platform

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

0 100 200 300 400 500 600

Number of Core

Ato

ms

De

alin

g w

ith

Basic Molecular Simulation

Large Scale Molecular Simulation for HDI

Case 1

Case 2

Case 3

Relaxation time: < 1 ns

Relaxation time: 5 ns

5 ns

5 ns1 ns

Heat spot in TAR

Top view(Lube molecules spreading onto COC)

36 nm

(300,000 atoms)

Case 3: Lube depletion in TAR (2D heat profile)

Pre- and Post-Processing

Server Farms

New G2 Instances Add

Visualization Capabilities

Molecular

Dynamics

Simulation

Electo –

Magnetic Fields

Ansys

HFSS

Base HPC Platform

Scalable to thousands of

instances to support numerous

simultaneous simulations

Read / Write

Magnetics MAGLAND

Electo –

Magnetic Fields CST

14

Mechanical

Simulation

Application

Ansys

Read / Write

Magnetics

Commercial

LLG

Three “V’s” of Big Data

Big Data’s “3 V’s”

• Hardware and software optimization

• Architectural shifts: Scale-out systems, Distributed filesystems, Tiered storage, Hadoop…

Implications &

Opportunities

Tre

nd

s

Variety •Data sources

•Data types

•Applications

Structured

Unstructured, Semi-Structured & Structured

Terabytes

Petabytes & Exabytes

Volume •Data collected

•Analysis & metadata creation

Batch

Real-Time & Streaming

Velocity •Data acquisition

•Analysis & action

Best pragmatic

definition from

Snijders et al.

“Data sets so large

and complex that

they become

awkward to work

with using standard

tools and

techniques”

Key difference: data structure does not need

to be defined before loading

15

HDD

HGA

Slider

Wafer

Media

Substrate

Field Data

Supplier

All raw parametric,

logistic, vintage, data

SAP/DW’s

.

.

.

Data Sources

End-to-End

Integrated

Data

Big Data Platform Consumers

Ad hoc Analysis

Tableau, and

other tools

Optimize/Reduce

Testing

New Unified

EDW

Batch Analytics

Parallelized

batch analytics

App-Specific

Views

New High-Value

Parameters

raw

extracts Enriched

data

Proactive Drift

Identification

Failure Screen

Tests

Customer FA via

Field Data

SAS, Compellon or

other Predictive

Analytic Tools

16

Characteristics of a “Typical” Hadoop / Big Data Cluster

Big Data Solutions Must Support a Large Variety of

Compute and I/O Operations and Storage Needs …enter “the Cloud”

Hadoop MapReduce

I/O Bound Operations

and Workloads

• Indexing

• Grouping

• Data importing and exporting

• Data movement and

transform

Hadoop MapReduce

Compute Bound Operations

and Workloads

• Clustering/Classification

• Complex text mining

• Natural-language processing

• Feature extraction

Hadoop handles large data volumes and reliability in the software tier

− Hadoop distributes data across cluster; uses replication to ensure data reliability

and fault tolerance.

Each machine in Hadoop cluster stores AND processes data; machines must do both well.

Processing sent directly to the machines storing the data.

17

AWS Big Data Platform Storage Services

Amazon

Glacier

Amazon

S3

Amazon

EBS

Block Storage for Elastic Computing

Optimized for Performance

SSD / 15K / 10K

Highly Virtualized / SAN-Based

“Generic” Object Storage

Bulk of AWS Storage Today

Virtualized or Reserved Use

Server/Network-Based

Cold/Cool Storage

Lowest Cost Model for “least” used data

3-5 hour Latency / Sequentialized

18

HGST’s Other Amazon Use Cases/Capabilities

Petabyte-Scale Data Warehousing

“Between Glacier & S3”

Run Data Visualization tools in AWS

Resource Tracking Tool

Includes Tableau instance for reporting and visualization

19

More and more users coming to IT

asking for how to leverage

this new compute capability

We Are Just Starting with the Cloud

• Current Results From 6 month Effort

• Re-aligning Business Group Leadership

• Demands and Use To Grow And Accelerate

20

Cloud + HGST IT =

Strong Innovation and Business Partner


Cloud Computing @ Aerospace

Bill Williams, The Aerospace Corporation

Introduction and Background

• IT Executive for the The Aerospace Corporation (Aerospace)

• Manage HPC compute and cloud resources for the Aerospace corporate

• Career path has taken me through end user support, system administration, and enterprise architecture

Agenda

• Who is Aerospace?

• High Performance Computing @ Aerospace

• Services Provided

• Cloud Motivation

• Where are we today?

• What makes this work?

• Challenges

• Lessons Learned

Who is Aerospace?

Video

High Performance Computing @ Aerospace

• Allow engineers and scientists to focus on their

discipline and research

• Reduce and eliminate complexity in using High

Performance Computing (HPC) resources

• Supply and support centralized and networked

HPC resources

Services Provided

• Cluster Computing

• "Big Iron Linux" Dense Core Computing

• High Performance Cloud Computing

• High Performance Storage Systems

• Software Development Revision Control Repository

Cloud Motivation

• Respond to an increasing and variable demand

• Improve resource deployments and use

• Enhance provisioning

• Improve security posture

• Improve disaster recovery posture

• Greener

Where are we today?

• Successfully established elastic clusters in AWS GovCloud – Workload runs include Monte Carlo and Array Simulations

• Key features of the GovCloud clusters are auto-scaling and on-demand computing

• Compute instances are created as needed to meet job computational requirements

• Making strides towards mimicking internal clusters in GovCloud

What makes this work?

• AWS GovCloud – GovCloud is FedRAMP compliant

• Secure transport to and from Aerospace – VPC provides an additional layer of security while data is in transit

• Cyclecomputing – Cycle provides cluster auto-scaling

Lessons Learned

• Enhanced analytics and business intelligence

• Customer success stories

• Standard images

• Demonstrated operational “agility”

Lessons Learned

• Domain space is dynamic

• Expertise required

• Layers of complexity

• Ensuring data security (in hybrid deployment model)

Challenges

• Establishing a cloud storage infrastructure

• Determining appropriate bandwidth between Aerospace and GovCloud

• Library replication of internal systems

• System integration with internal authentication services

• Insuring a seamless transition to hybrid services

What’s Next?

• Expand offerings

• Explore charge back

• Explore “cloudifying” other HPC platforms

• Track technology

• Provide workload specific ad-hoc offerings

• Provide surge capability for HPC resources

Accelerating Science

Michael Steeves, Novartis Institutes for Biomedical Research

Novartis Institutes for BioMedical Research (NIBR)

Unique research strategy driven by patient needs

World-class research organization with about

6000 scientists globally

Intensifying focus on molecular pathways shared by

various diseases

Integration of clinical insights with mechanistic

understanding of disease

Research-to-Development transition redefined

through fast and rigorous “proof-of-concept” trials

Strategic alliances with academia and biotech

strengthen preclinical pipeline

Requirements

Large Scale Computational Chemistry Simulation

Results in under a week

Ability to run multiple experiments “on-demand”

Challenges

Sustained access to 50000+ compute cores

Ability to monitor and re-launch jobs

No additional Capital Expenditure

Internal HPCC already running at capacity

Job Profile

Embarrassingly Parallel

CPU Bound

Low I/O, Memory and Network requirements

Accelerating the Science

Virtual Screening

Target

Molecule Compound

Molecule

binding

site

"Lock" "Keys"

The Cloud: Flexible Science on Flexible Infrastructure

Engineering the right infrastructure for a workload:

Software runs the same job many times across instance types

Measures the throughput and determines the $ per job

Use the instances that provide the best scientific ROI

CC2 instance (Intel Xeon® ‘Sandy Bridge’) ran best for this

Metric Count

Compute Hours of Science 341,700 hours

Compute Days of Science 14,238 days

Compute Years of Science 39 years

AWS Instance Count-CC2 10,600 instances

Super Computing in the Cloud

$44 Million infrastructure

10 million compounds screened

39 Drug Design years in 11 hours for a cost of …$4,232

3 compounds identified and synthesized for screening

Key Learnings/What’s Next?

Diversity of Life Sciences brings unique challenges

Spend the time analyzing and tuning

Flexibility, Scalability and Performance

Time to rethink and retool

Challenge the Science and the Scientist

Collaboration

Future plans

Chemical Universe : 166 Billion cpds (Extreme scale CPU)

Next Generation Sequencing in the Cloud (Extreme CPU, Mem, I/O)

“Disruptive” Technologies-Imaging (10x that of NGS!)

Using On-Demand and

Spot Instances together

When task durations are > than 1

hour or require multiple machines

(MPI) for long periods, then use on-

demand

Shorter workloads work great for

Spot Instances

If you want a guaranteed end time,

use on-demand as well, so the

architecture looks like…

Shared FS

Scale from 150 - 150,000+ cores

CycleCloud Deploys Secured, Auto-scaled HPC Clusters

User

Check job load

Calculate ideal HPC cluster

Legacy

Internal

HPC

Load-based Spot bidding

Properly price the bids

Manage Spot Instance loss

Spot Instance Execute Nodes

(auto-started & auto-stopped

calculation is faster/cheaper)

FS /

S3

HPC Cluster

On-Demand Execute Nodes

(Guaranteed finish)

HPC Orchestration to

Handle Spot Instance Bid & Loss

Other Production use cases

• Sequencing, Genomics, Life Sciences

• MPI workloads for FEA, CFD, energy, utilities

• MATLAB and R applications for stats/modeling

• Win HPC Server cluster for finance

• Heat transfer and other FEA

• Insurance risk management

• Rendering/VFX

Designing Solar Materials

The Challenge is efficiency

Need to efficiently turn photons from the sun to Electricity

The number of possible materials is limitless:

• Need to separate the right compounds from the useless ones

• If the 20th century was the century of silicon, the 21st will be all organic

How do we find the right material out of 205,000

without spending the entire 21st century looking for it?

EMBARGOED until Nov. 12, 2013 8 a.m. EST

Challenge:

205,000 compounds

totaling 2,312,959 core-hours,

or 264 core-years


16,788 Spot Instances,

156,314 cores!

205,000 molecules

264 years of computing


156,314 cores =

1.21 PetaFLOPS (Rpeak) Equivalent to Top500 Jun2013 #29

205,000 molecules



Done in 18 hours

Access to $68M system

for $33k

205,000 molecules



1.21 PetaFLOPS, 156,000 core cluster

Solution:

205,000 compounds, 264 core years,

156k core Utility HPC cluster

in 18 hours

for $0.16/molecule using

Schrödinger Materials Science tools,

CycleCloud & AWS Spot Instances


Thanks to our speakers!

Question and Answer

How does utility HPC apply to your organization?

Follow us: @cyclecomputing, @jasonastowe

Come to Cycle’s booth: #1112

We’re hiring [email protected]

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

BDT212

real-world cloud hpc at scale, for production workloads (bdt212) | aws re:invent 2013

Technology