real-world cloud hpc at scale, for production workloads (bdt212) | aws re:invent 2013
DESCRIPTION
"Running high-performance scientific and engineering applications is challenging no matter where you do it. Join IT executives from Hitachi Global Storage Technology, The Aerospace Corporation, Novartis, and Cycle Computing and learn how they have used the AWS cloud to deploy mission-critical HPC workloads. Cycle Computing leads the session on how organizations of any scale can run HPC workloads on AWS. Hitachi Global Storage Technology discusses experiences using the cloud to create next-generation hard drives. The Aerospace Corporation provides perspectives on running MPI and other simulations, and offer insights into considerations like security while running rocket science on the cloud. Novartis Institutes for Biomedical Research talks about a scientific computing environment to do performance benchmark workloads and large HPC clusters, including a 30,000-core environment for research in the fight against cancer, using the Cancer Genome Atlas (TCGA)."TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Real-world Cloud HPC at Scale, for
Production Workloads
Jason A Stowe, Cycle Computing
November 15, 2013
Goals for today
• See real world use cases from 3 leading engineering and scientific computing users – Steve Philpott, CIO, HGST, A Western Digital Company
– Bill E. Williams, Director, The Aerospace Corporation
– Michael Steeves, Sr. Systems Engineer, Novartis
• Understand the motivations, strategies, lessons learned in running HPC / Big Data workloads in the cloud
• See the varying scales and application types that run well, including a 1.21 PetaFLOPS environment
Agenda
• Introduction
• Steve Philpott – Journey into Cloud
• Bill Williams – Cloud Computing @ Aerospace
• Michael Steeves – Accelerating Science
• Spot, On-demand, & Other Production uses
• Questions and answers
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Journey to the Cloud
Steve Phillpott
CIO
HGST, a Western Digital Company
Founded in 2003 through the combination of the hard drive
businesses of IBM, the inventor of the hard drive, and
HGST, Ltd
Acquired by Western Digital in 2012
More than 4,200 active worldwide patents
Headquartered in San Jose, California
Approximately 41,000 employees worldwide
Develops innovative, advanced hard disk drives, enterprise-class
solid state drives, external storage solutions and services
Delivers intelligent storage devices that tightly integrate hardware
and software to maximize solution performance
6
Capacity Enterprise
Performance Enterprise
Cloud & Datacenter
Enterprise SSD (+3 acquisitions)
7200 RPM &
CoolSpin
HDDs
Ultrastar®
Ultrastar® &
MegaScale DC™
10K & 15K
HDDs
PCIe
SAS
7
April 2013
Zero to Cloud in 6+ Month
By 31 Oct 2013:
Cloud eMail – Microsoft Office365
Cloud eMail archiving/eDiscovery
External SingleSignOn (off VPN)
Cloud File/Collaboration – BOX
Cloud CRM – Salesforce.com
Integrated to save files in BOX
Cloud–High Performance Computing
(HPC) on Amazon AWS
Cloud – Big Data Platform on Amazon AWS
Responding to the Changing Business Model
Where is our business model headed?
“New Age of Innovation” as a guide
N=1 Focus on Individual Customer Experience
R=G Resources are Global
Implications
–Increase in strategic partnering
–Need for high level of flexibility
–Leveraging external expertise
Use of the Cloud/SaaS aligns with
Virtual Business Model:
Variable cost model critically important
Lightweight, scalable services
Reduced up-front capital spend
Accelerated provisioning
Pay as you go
8
Paradigm Shift: Consumerization of IT “I have better technology at home”
A new paradigm in ease of use and reduced cost.
Consumer web has been driven by a series of
platforms – and these platforms are household brand
names today
When we use these platforms, it continually amazes
us – how easy, how consistent these platforms work
A new set of services: DRM to iTunes
Yet, our workplace applications are cumbersome, costly,
difficult to navigate and require extensive support Workday, 2009
Consumer Web
9
The Big Switch – The Box has Disappeared The Transformation of Computing as we Know it.
Physical to Virtual/Digital move
– Do you really care which computer processed your last
Google search?
Efficiency
– Do not waste a CPU cycle or a byte of memory.
Building a 4-story building and only using the 1st floor
Utility: IT as a Service - Plug it in and get it
– Where the electricity industry has gone, Computing is following
– Computing shift is almost invisible to the end-user
DATA is the value to the Organization, not the “where”
1
0
Enterprise Data Management
End-to-End Business Processes
Business Intelligence and Analytics
Enabling the Virtual Organization Reframing IT Away From Thinking of “The App”
1
1
New IT Organizational Structures:
Support and Align to “New Business Model”
Software as a Service
(SaaS) Strategic
Outsourcing
1
1
New Computing Platforms
Creating an Innovation Playground:
Where to Start and How to Evolve
12
Educate
Experiment
Migrate
Outcome Defined
Build Expertise
Play
Learn
IT Supports Business Strategy
Executive Buy-In – CEO, CIO, InfoSec, etc
Reduce Cap-ex, Optimize DC usage
Know
ledge
• Team Involvement
• Conferences
• Vendor Briefing
• Expert Services
• Best Practices
• Collaborate with
other companies
• Team Approach
• Hands-on approach
• Understand the value proposition
• Understand constraints
• Migrate dev/test environments
• Migrate or launch new apps on the cloud
Embrace success
Showcase cost savings
Build an enterprise cloud strategy
Learn from each experience
Expand accordingly
Awareness Understanding Transition Commitment
• Indentify app fit for cloud computing
• Define new processes
Implement
12
AWS: “ >5x the compute capacity than next
14 providers combined” – Gartner, Aug 2013
Access to massive compute and storage
Billed by the hour - only pay for what is used
HGST Japan Research Lab: Using AWS for higher
performance, lower cost, faster deployed solution vs. buying
huge on-site cluster
Multiple Opportunities to Leverage Amazon Web Services (AWS)
Develop AWS Competency
Many Opportunities: In-house and commercial HPCs are “cloud ready”
Provide Computing When Needed: Reduce capital investment & risk and increase flexibility
Faster Response to Business Needs: Rapid prototyping to pilot new IT capabilities with “PO
Process” ; setup users, allocate compute and storage in minutes, load apps and go
AWS provide a great option for disaster recovery for our “on-premise” clusters and storage
13
HGST’s Amazon HPC Platform
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
0 100 200 300 400 500 600
Number of Core
Ato
ms
De
alin
g w
ith
Basic Molecular Simulation
Large Scale Molecular Simulation for HDI
Case 1
Case 2
Case 3
Relaxation time: < 1 ns
Relaxation time: 5 ns
5 ns
5 ns1 ns
Heat spot in TAR
Top view(Lube molecules spreading onto COC)
36 nm
(300,000 atoms)
Case 3: Lube depletion in TAR (2D heat profile)
Pre- and Post-Processing
Server Farms
New G2 Instances Add
Visualization Capabilities
Molecular
Dynamics
Simulation
Electo –
Magnetic Fields
Ansys
HFSS
Base HPC Platform
Scalable to thousands of
instances to support numerous
simultaneous simulations
Read / Write
Magnetics MAGLAND
Electo –
Magnetic Fields CST
14
Mechanical
Simulation
Application
Ansys
Read / Write
Magnetics
Commercial
LLG
Three “V’s” of Big Data
Big Data’s “3 V’s”
• Hardware and software optimization
• Architectural shifts: Scale-out systems, Distributed filesystems, Tiered storage, Hadoop…
Implications &
Opportunities
Tre
nd
s
Variety •Data sources
•Data types
•Applications
Structured
Unstructured, Semi-Structured & Structured
Terabytes
Petabytes & Exabytes
Volume •Data collected
•Analysis & metadata creation
Batch
Real-Time & Streaming
Velocity •Data acquisition
•Analysis & action
Best pragmatic
definition from
Snijders et al.
“Data sets so large
and complex that
they become
awkward to work
with using standard
tools and
techniques”
Key difference: data structure does not need
to be defined before loading
15
HDD
HGA
Slider
Wafer
Media
Substrate
Field Data
Supplier
All raw parametric,
logistic, vintage, data
SAP/DW’s
.
.
.
Data Sources
End-to-End
Integrated
Data
Big Data Platform Consumers
Ad hoc Analysis
Tableau, and
other tools
Optimize/Reduce
Testing
New Unified
EDW
Batch Analytics
Parallelized
batch analytics
App-Specific
Views
New High-Value
Parameters
raw
extracts Enriched
data
Proactive Drift
Identification
Failure Screen
Tests
Customer FA via
Field Data
SAS, Compellon or
other Predictive
Analytic Tools
16
Characteristics of a “Typical” Hadoop / Big Data Cluster
Big Data Solutions Must Support a Large Variety of
Compute and I/O Operations and Storage Needs …enter “the Cloud”
Hadoop MapReduce
I/O Bound Operations
and Workloads
• Indexing
• Grouping
• Data importing and exporting
• Data movement and
transform
Hadoop MapReduce
Compute Bound Operations
and Workloads
• Clustering/Classification
• Complex text mining
• Natural-language processing
• Feature extraction
Hadoop handles large data volumes and reliability in the software tier
− Hadoop distributes data across cluster; uses replication to ensure data reliability
and fault tolerance.
Each machine in Hadoop cluster stores AND processes data; machines must do both well.
Processing sent directly to the machines storing the data.
17
AWS Big Data Platform Storage Services
Amazon
Glacier
Amazon
S3
Amazon
EBS
Block Storage for Elastic Computing
Optimized for Performance
SSD / 15K / 10K
Highly Virtualized / SAN-Based
“Generic” Object Storage
Bulk of AWS Storage Today
Virtualized or Reserved Use
Server/Network-Based
Cold/Cool Storage
Lowest Cost Model for “least” used data
3-5 hour Latency / Sequentialized
18
HGST’s Other Amazon Use Cases/Capabilities
Petabyte-Scale Data Warehousing
“Between Glacier & S3”
Run Data Visualization tools in AWS
Resource Tracking Tool
Includes Tableau instance for reporting and visualization
19
More and more users coming to IT
asking for how to leverage
this new compute capability
We Are Just Starting with the Cloud
• Current Results From 6 month Effort
• Re-aligning Business Group Leadership
• Demands and Use To Grow And Accelerate
20
Cloud + HGST IT =
Strong Innovation and Business Partner
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Cloud Computing @ Aerospace
Bill Williams, The Aerospace Corporation
Introduction and Background
• IT Executive for the The Aerospace Corporation (Aerospace)
• Manage HPC compute and cloud resources for the Aerospace corporate
• Career path has taken me through end user support, system administration, and enterprise architecture
Agenda
• Who is Aerospace?
• High Performance Computing @ Aerospace
• Services Provided
• Cloud Motivation
• Where are we today?
• What makes this work?
• Challenges
• Lessons Learned
High Performance Computing @ Aerospace
• Allow engineers and scientists to focus on their
discipline and research
• Reduce and eliminate complexity in using High
Performance Computing (HPC) resources
• Supply and support centralized and networked
HPC resources
Services Provided
• Cluster Computing
• "Big Iron Linux" Dense Core Computing
• High Performance Cloud Computing
• High Performance Storage Systems
• Software Development Revision Control Repository
Cloud Motivation
• Respond to an increasing and variable demand
• Improve resource deployments and use
• Enhance provisioning
• Improve security posture
• Improve disaster recovery posture
• Greener
Where are we today?
• Successfully established elastic clusters in AWS GovCloud – Workload runs include Monte Carlo and Array Simulations
• Key features of the GovCloud clusters are auto-scaling and on-demand computing
• Compute instances are created as needed to meet job computational requirements
• Making strides towards mimicking internal clusters in GovCloud
What makes this work?
• AWS GovCloud – GovCloud is FedRAMP compliant
• Secure transport to and from Aerospace – VPC provides an additional layer of security while data is in transit
• Cyclecomputing – Cycle provides cluster auto-scaling
Lessons Learned
• Enhanced analytics and business intelligence
• Customer success stories
• Standard images
• Demonstrated operational “agility”
Lessons Learned
• Domain space is dynamic
• Expertise required
• Layers of complexity
• Ensuring data security (in hybrid deployment model)
Challenges
• Establishing a cloud storage infrastructure
• Determining appropriate bandwidth between Aerospace and GovCloud
• Library replication of internal systems
• System integration with internal authentication services
• Insuring a seamless transition to hybrid services
What’s Next?
• Expand offerings
• Explore charge back
• Explore “cloudifying” other HPC platforms
• Track technology
• Provide workload specific ad-hoc offerings
• Provide surge capability for HPC resources
Novartis Institutes for BioMedical Research (NIBR)
Unique research strategy driven by patient needs
World-class research organization with about
6000 scientists globally
Intensifying focus on molecular pathways shared by
various diseases
Integration of clinical insights with mechanistic
understanding of disease
Research-to-Development transition redefined
through fast and rigorous “proof-of-concept” trials
Strategic alliances with academia and biotech
strengthen preclinical pipeline
Requirements
Large Scale Computational Chemistry Simulation
Results in under a week
Ability to run multiple experiments “on-demand”
Challenges
Sustained access to 50000+ compute cores
Ability to monitor and re-launch jobs
No additional Capital Expenditure
Internal HPCC already running at capacity
Job Profile
Embarrassingly Parallel
CPU Bound
Low I/O, Memory and Network requirements
Accelerating the Science
Virtual Screening
Target
Molecule Compound
Molecule
binding
site
"Lock" "Keys"
The Cloud: Flexible Science on Flexible Infrastructure
Engineering the right infrastructure for a workload:
Software runs the same job many times across instance types
Measures the throughput and determines the $ per job
Use the instances that provide the best scientific ROI
CC2 instance (Intel Xeon® ‘Sandy Bridge’) ran best for this
Metric Count
Compute Hours of Science 341,700 hours
Compute Days of Science 14,238 days
Compute Years of Science 39 years
AWS Instance Count-CC2 10,600 instances
Super Computing in the Cloud
$44 Million infrastructure
10 million compounds screened
39 Drug Design years in 11 hours for a cost of …$4,232
3 compounds identified and synthesized for screening
Key Learnings/What’s Next?
Diversity of Life Sciences brings unique challenges
Spend the time analyzing and tuning
Flexibility, Scalability and Performance
Time to rethink and retool
Challenge the Science and the Scientist
Collaboration
Future plans
Chemical Universe : 166 Billion cpds (Extreme scale CPU)
Next Generation Sequencing in the Cloud (Extreme CPU, Mem, I/O)
“Disruptive” Technologies-Imaging (10x that of NGS!)
Using On-Demand and
Spot Instances together
When task durations are > than 1
hour or require multiple machines
(MPI) for long periods, then use on-
demand
Shorter workloads work great for
Spot Instances
If you want a guaranteed end time,
use on-demand as well, so the
architecture looks like…
Shared FS
Scale from 150 - 150,000+ cores
CycleCloud Deploys Secured, Auto-scaled HPC Clusters
User
Check job load
Calculate ideal HPC cluster
Legacy
Internal
HPC
Load-based Spot bidding
Properly price the bids
Manage Spot Instance loss
Spot Instance Execute Nodes
(auto-started & auto-stopped
calculation is faster/cheaper)
FS /
S3
HPC Cluster
On-Demand Execute Nodes
(Guaranteed finish)
HPC Orchestration to
Handle Spot Instance Bid & Loss
Other Production use cases
• Sequencing, Genomics, Life Sciences
• MPI workloads for FEA, CFD, energy, utilities
• MATLAB and R applications for stats/modeling
• Win HPC Server cluster for finance
• Heat transfer and other FEA
• Insurance risk management
• Rendering/VFX
Designing Solar Materials
The Challenge is efficiency
Need to efficiently turn photons from the sun to Electricity
The number of possible materials is limitless:
• Need to separate the right compounds from the useless ones
• If the 20th century was the century of silicon, the 21st will be all organic
How do we find the right material out of 205,000
without spending the entire 21st century looking for it?
EMBARGOED until Nov. 12, 2013 8 a.m. EST
Challenge:
205,000 compounds
totaling 2,312,959 core-hours,
or 264 core-years
EMBARGOED until Nov. 12, 2013 8 a.m. EST
16,788 Spot Instances,
156,314 cores!
205,000 molecules
264 years of computing
EMBARGOED until Nov. 12, 2013 8 a.m. EST
156,314 cores =
1.21 PetaFLOPS (Rpeak) Equivalent to Top500 Jun2013 #29
205,000 molecules
264 years of computing
EMBARGOED until Nov. 12, 2013 8 a.m. EST
Done in 18 hours
Access to $68M system
for $33k
205,000 molecules
264 years of computing
EMBARGOED until Nov. 12, 2013 8 a.m. EST
Solution:
205,000 compounds, 264 core years,
156k core Utility HPC cluster
in 18 hours
for $0.16/molecule using
Schrödinger Materials Science tools,
CycleCloud & AWS Spot Instances
EMBARGOED until Nov. 12, 2013 8 a.m. EST
Question and Answer
How does utility HPC apply to your organization?
Follow us: @cyclecomputing, @jasonastowe
Come to Cycle’s booth: #1112
We’re hiring [email protected]