gpu based cloud computing

54
© NVIDIA Corporation 2010 GPU based cloud computing Dairsie Latimer, Petapath, UK Petapath

Upload: jack

Post on 24-Feb-2016

79 views

Category:

Documents


17 download

DESCRIPTION

GPU based cloud computing. Dairsie Latimer, Petapath, UK. Petapath. About Petapath. Founded in 2008 to focus on delivering innovative hardware and software solutions into the high performance computing (HPC) markets - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GPU based cloud computing

© NVIDIA Corporation 2010

GPU based cloud computing

Dairsie Latimer, Petapath, UK Petapath

Page 2: GPU based cloud computing

© NVIDIA Corporation 2010

About Petapath

Founded in 2008 to focus on delivering innovative hardware andsoftware solutions into the high performance computing (HPC) markets

Partnered with HP and SGI to deliverer two Petascale prototypesystems as part of the PRACE WP8 programme

The system is a testbed for new ideas in usability, scalability andefficiency of large computer installations

Active in exploiting emerging standards for acceleration technologies andare members of Khronos group and sit on the OpenCL working committee

We also provide consulting expertise for companies wishing to explore the advantages offered by heterogeneous systems

Petapath

Page 3: GPU based cloud computing

© NVIDIA Corporation 2010

What is Heterogeneous or GPU Computing?

Computing with CPU + GPUHeterogeneous Computing

x86 GPUPCIe bus

Page 4: GPU based cloud computing

© NVIDIA Corporation 2010

Low Latency or High Throughput?

CPUOptimised for low-latency access to cached data setsControl logic for out-of-order and speculative execution

GPUOptimised for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation

Page 5: GPU based cloud computing

© NVIDIA Corporation 2010

NVIDIA GPU Computing Ecosystem

NVIDIA Hardware Solutions

CUDA SDK & Tools

GPU Architecture

Deployment

Hardware

Architecture

Customer

Requirements

Customer

Application

TPP / OEMCUDA Development

Specialist

ISV

CUDA Training

CompanyHardware Architect

VAR

Page 6: GPU based cloud computing

© NVIDIA Corporation 2010

Science is Desperate for Throughput

1982 1997 2003 2006 2010 2012

1,000,000,000

1,000,000

1,000

1

Gigaflops

Estrogen Receptor36K atoms

F1-ATPase327K atoms

Ribosome2.7M atoms

Chromatophore50M atoms

BPTI3K atoms

Bacteria100s of

Chromatophores

1 Exaflop

1 Petaflop

Ran for 8 months to simulate 2 nanoseconds

Page 7: GPU based cloud computing

© NVIDIA Corporation 2010

Power Crisis in Supercomputing

1982 1996 2008 2020

Exaflop

Petaflop

Teraflop

Gigaflop

Household Power Equivalent

City

Town

Neighborhood

Block

7,000,000 Watts

25,000,000 Watts

850,000 Watts

60,000 Watts

JaguarLos Alamos

Page 8: GPU based cloud computing

© NVIDIA Corporation 2010

TeslaTM

High-Performance ComputingQuadro®

Design & CreationGeForce®

Entertainment

Enter the GPU

NVIDIA GPU Product Families

Page 9: GPU based cloud computing

© NVIDIA Corporation 2010

NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’

Page 10: GPU based cloud computing

© NVIDIA Corporation 2010

3 billion transistors

Up to 2× the cores (C2050 has 448)

Up to 8× the peak DP performance

ECC on all memories

L1 and L2 caches

Improved memory bandwidth (GDDR5)

Up to 1 Terabyte of GPU memory

Concurrent kernels

Hardware support for C++

DR

AM

I/F

HO

ST I/

FG

iga

Thre

adD

RA

M I/

F DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Introducing the ‘Fermi’ Tesla ArchitectureThe Soul of a Supercomputer in the body of a GPU

Page 11: GPU based cloud computing

© NVIDIA Corporation 2010

DataParallel

InstructionParallel

Many Decisions Large Data Sets

CPU

GPU

Design Goal of Fermi

Expand performance sweet spot of the GPU

Bring more users, more applications to the GPU

Page 12: GPU based cloud computing

© NVIDIA Corporation 2010

Streaming Multiprocessor ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units × 16Special Func Units × 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

32 CUDA cores per SM (512 total)

8× peak double precision floating point performance

50% of peak single precision

Dual Thread Scheduler

64 KB of RAM for shared memory and L1 cache (configurable)

Page 13: GPU based cloud computing

© NVIDIA Corporation 2010

CUDA Core ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs

Fused multiply-add (FMA) instruction for both single and double precision

New integer ALU optimized for64-bit and extended precisionoperations

Page 14: GPU based cloud computing

© NVIDIA Corporation 2010

Cached Memory Hierarchy

First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory

L1 Cache per SM (32 cores)Improves bandwidth and reduces latency

Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU D

RA

M I/

FG

iga

Thre

adH

OST

I/F

DR

AM

I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Parallel DataCache™

Memory Hierarchy

Page 15: GPU based cloud computing

© NVIDIA Corporation 2010

Larger, Faster, Resilient Memory Interface

GDDR5 memory interface2× signaling speed of GDDR3

Up to 1 Terabyte of memory attached to GPUOperate on larger data sets (3 and 6 GB Cards)

ECC protection for GDDR5 DRAM

All major internal memories are ECC protectedRegister file, L1 cache, L2 cache

DR

AM

I/F

Gig

a Th

read

HO

ST I/

FD

RA

M I/

F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Page 16: GPU based cloud computing

© NVIDIA Corporation 2010

GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution Parallel Kernel Execution

Tim

e

Kernel 1 Kernel 1 Kernel 2

Kernel 2 Kernel 3

Kernel 3

Ker4

nel Kernel 5

Kernel 5

Kernel 4

Kernel 2

Kernel 2

Page 17: GPU based cloud computing

© NVIDIA Corporation 2010

GigaThread Streaming Data Transfer Engine

Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time

Activity Snapshot:

SDT

Kernel 0Kernel 1

Kernel 2Kernel 3

CPU

CPU

CPU

CPU

SDT0

SDT0

SDT0

SDT0

GPU

GPU

GPU

GPU

SDT1

SDT1

SDT1

SDT1

Page 18: GPU based cloud computing

© NVIDIA Corporation 2010

Enhanced Software Support

Many new features in CUDA Toolkit 3.0To be released on Friday

Including early support for the Fermi architecture:Native 64-bit GPU supportMultiple Copy Engine supportECC reportingConcurrent Kernel ExecutionFermi HW debugging support in cuda-gdb

Page 19: GPU based cloud computing

© NVIDIA Corporation 2010

Enhanced Software Support

OpenCL 1.0 SupportFirst class language citizen in CUDA ArchitectureSupports ICD (so interoperability between vendors is a possibility)Profiling support availableDebug support coming to Parallel Nsight (NEXUS) soon

gDebugger CL from graphicREMEDYThird party OpenCL profiler/debugger/memory checker

Software Tools Ecosystem is starting to growGiven boost by existence of OpenCL

Page 20: GPU based cloud computing

© NVIDIA Corporation 2010

“Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is "expected to be 10-times more powerful than today's fastest supercomputer."

Since ORNL's Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 PFlops….

…we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.”

September 30 2009

Page 21: GPU based cloud computing

© NVIDIA Corporation 2010

NVIDIA TESLA PRODUCTS

Page 22: GPU based cloud computing

© NVIDIA Corporation 2010

Tesla S1070 1U System

Tesla C1060 Computing Board

Tesla Personal Supercomputer

GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision

Performance1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double PrecisionPerformance

156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

SuperMicro 1UGPU SuperServer

Tesla GPU Computing Products: 10 Series

Page 23: GPU based cloud computing

© NVIDIA Corporation 2010

Tesla S2070 1U System

Tesla C2050 Computing Board

Tesla C2070 Computing Board

GPUs 4 Tesla GPUs 1 Tesla GPUDouble Precision

Performance2.1 – 2.5 Teraflops 500+ Gigaflops

Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB

Tesla S20501U System

Tesla GPU Computing Products: 20 Series

Page 24: GPU based cloud computing

© NVIDIA Corporation 2010

HETEROGENEOUS CLUSTERS

Page 25: GPU based cloud computing

© NVIDIA Corporation 2010

Data Centers: Space and Energy Limited

Traditional Data Center Cluster

2x Performance requires 2x Number of Servers8 cores per server

1000’s of cores1000’s of servers

Quad-coreCPU

Heterogeneous Data Center Cluster

10,000’s of cores100’s of servers

Augment/replacehost servers

Page 26: GPU based cloud computing

© NVIDIA Corporation 2010

Cluster Deployment

Now a number of GPU aware Cluster Management SystemsActiveEon ProActive Parallel Suite® Version 4.2Platform Cluster Manager and HPC WorkgroupStreamline Computing GPU Environment (SCGE)

• Not just installation aidsi.e. putting the driver and toolkits in the right placenow starting to provide GPU node discovery and job steering

NVIDIA and MellanoxBetter interop. between Mellanox IF adapters and NVIDIA Tesla GPUsCan provide as much as a 30% performance improvement by eliminating unnecessary data movement in a multi node heterogeneous application

Page 27: GPU based cloud computing

© NVIDIA Corporation 2010

Cluster Deployment

A number of cluster and distributed debug tools now support CUDA and NVIDIA Tesla

Allinea® DDT for NVIDIA CUDAExtends well known Distributed Debugging Tool (DDT) with CUDA support

TotalView® debugger (part of an Early Experience Program)Extends with CUDA support, have also announced intentions to support OpenCL

Both based on the Parallel Nsight (NEXUS) Debugging API

Page 28: GPU based cloud computing

© NVIDIA Corporation 2010

NVIDIA Reality Server 3.0

Cloud computing platform for running 3D web applications

Consists of an Tesla RS GPU-based server cluster running RealityServer software from mental images

Deployed in a number of different sizesFrom 2 – 100’s of 1U Servers

iray® - Interactive Photorealistic Rendering TechnologyStreams interactive 3D applications to any web connected device Designers and architects can now share and visualize complex 3D models under different lighting and environmental conditions

Page 29: GPU based cloud computing

© NVIDIA Corporation 2010

DISTRIBUTED COMPUTING PROJECTS

Page 30: GPU based cloud computing

© NVIDIA Corporation 2010

Distributed Computing Projects

Traditional distributed computing projects have beenmaking use of GPUs for some time (non-commercial)

Typically have 000’s to 10,000’s of contributorsFolding@Home has access to 6.5 PFLOPS of compute Of which ~95% comes from GPUs or PS3s

Many are bio-informatics, molecular dynamicsand quantum chemistry codes

Represent the current sweet spot applications

Ubiquity of GPUs in home systems helps

Page 31: GPU based cloud computing

© NVIDIA Corporation 2010

Distributed Computing Projects

Folding@HomeDirected by Prof. Vijay Pande at Stanford University (http://folding.stanford.edu/)

Most recent GPU3 Core based on OpenMM 1.0 (https://simtk.org/home/openmm)OpenMM library provides tools for molecular modeling simulationCan be hooked into any MM application, allowing that code to domolecular modeling with minimal extra effortOpenMM has a strong emphasis on hardware acceleration providingnot just a consistent API, but much greater performance

Current NVIDIA target is via CUDA Toolkit 2.3

OpenMM 1.0 also provides Beta support for OpenCL

OpenCL is long term convergence software platform

Page 32: GPU based cloud computing

© NVIDIA Corporation 2010

Distributed Computing Projects

Berkeley Open Infrastructure for Network ComputingBOINC project (http://boinc.berkeley.edu/)

Platform infrastructure originally evolved from SETI@home

Many projects use BOINC and several of these have heterogeneous compute implementations (http://boinc.berkeley.edu/wiki/GPU_computing)

Examples include:GPUGRID.netSETI@homeMilkyway@home (IEEE 754 Double precision capable GPU required)AQUA@homeLatticeCollatz Conjecture

Page 33: GPU based cloud computing

© NVIDIA Corporation 2010

Distributed Computing Projects

GPUGRID.netDr. Gianni De Fabritiis,Research Group of Biomedical InformaticsUniversity Pompeu Fabra-IMIM, Barcelona

Uses GPUs to deliver high-performance all-atom biomolecular simulation of proteins using ACEMD (http://multiscalelab.org/acemd)

ACEMD is a production bio-molecular dynamics code specially optimized to run on graphics processing units (GPUs) from NVIDIAIt reads CHARMM/NAMD and AMBER input files with a simple and powerful configuration interface

A commercial implementation of ACEMD is available from Acellera Ltd (http://www.acellera.com/acemd/)

What makes this particularly interesting is that it is implemented using OpenCL

Page 34: GPU based cloud computing

© NVIDIA Corporation 2010

Distributed Computing Projects

Have had to use brute force methods to deal with robustnessRun the same WU with multiple users and compare results

Running on purpose designed heterogeneous grids with ECCMeans that some of the paranoia can be relaxed(can at least detect there have been soft errors or WU corruption)

Results in better throughput on these systems

But does result in divergence between Consumer and HPC devicesShould be compensated for by HPC class devices being about 4x faster

Page 35: GPU based cloud computing

© NVIDIA Corporation 2010

Tesla Bio WorkbenchAccelerating New Science

January, 2010

http://www.nvidia.com/bio_workbench

Page 36: GPU based cloud computing

© NVIDIA Corporation 2010

Introducing Tesla Bio WorkBench

Applications

Community

Platforms

Technicalpapers

DiscussionForums

Benchmarks& Configurations

Tesla Personal Supercomputer Tesla GPU Clusters

Download,Documentation

MUMmerGPU

LAMMPS GPU-AutoDock

TeraChem

Page 37: GPU based cloud computing

© NVIDIA Corporation 2010

Tesla Bio Workbench Applications

AMBER (MD)ACEMD (MD)GROMACS (MD)GROMOS (MD)LAMMPS (MD)NAMD (MD)TeraChem (QC)VMD (Visualization MD & QC)

DockingGPU AutoDock

Sequence analysisCUDASW++ (SmithWaterman)MUMmerGPUGPU-HMMERCUDA-MEME Motif Discovery

Page 38: GPU based cloud computing

© NVIDIA Corporation 2010

Recommended Hardware Configurations

Up to 4 Tesla C1060s per workstation4GB main memory / GPU

Tesla S1070 1U 4 GPUs per 1U

Integrated CPU-GPU Server2 GPUs per 1U + 2 CPUs

Tesla Personal Supercomputer Tesla GPU Clusters

Specifics at http://www.nvidia.com/bio_workbench

Page 39: GPU based cloud computing

© NVIDIA Corporation 2010

Molecular Dynamics andQuantum Chemistry Applications

Page 40: GPU based cloud computing

© NVIDIA Corporation 2010

Molecular Dynamics andQuantum Chemistry Applications

AMBER (MD)ACEMD (MD)HOOMD (MD)GROMACS (MD)

LAMMPS (MD)NAMD (MD)TeraChem (QC)VMD (Viz. MD & QC)

Typical speed ups of 3-8x on a single Tesla C1060 vs Modern 1USome applications (compute bound) show 20-100x speed ups

Page 41: GPU based cloud computing

© NVIDIA Corporation 2010

Usage of TeraGrid National Supercomputing Grid

Half of the cycles

Page 42: GPU based cloud computing

© NVIDIA Corporation 2010

Summary

Page 43: GPU based cloud computing

© NVIDIA Corporation 2010

Summary

‘Fermi’ debuts HPC/Enterprise featuresParticularly ECC and high performance double precision

Software development environments are now more matureSignificant software ecosystem is starting to emergeBroadening availability of development tools, libraries and applicationsHeterogeneous (GPU) aware cluster management systems

Economics, open standards and improving programming methodologies

Heterogeneous computing is gradually changing long held perception that it is just an ‘exotic’ niche technology

Page 44: GPU based cloud computing

© NVIDIA Corporation 2010

Questions?

Page 45: GPU based cloud computing

© NVIDIA Corporation 2010

Supporting Slides

Page 46: GPU based cloud computing

© NVIDIA Corporation 2010

AMBER Molecular Dynamics

Implicit solvent GB results1 Tesla GPU 8x faster than 2 quad-core CPUs

Data courtesy of San Diego Supercomputing Center

7x 8.6x

Generalized Born SimulationsAlphanow

Q1 2010 • PME: Particle Mesh Ewald• Beta release

• Generalized Born

Q2 2010 • Multi-GPU + MPI support• Beta 2 release

More Infohttp://www.nvidia.com/object/amber_on_tesla.html

Page 47: GPU based cloud computing

© NVIDIA Corporation 2010

GROMACS Molecular Dynamics

PME results1 Tesla GPU 3.5x-4.7x faster than CPU

Data courtesy of Stockholm Center for Biomembrane Research

3.5x22x

GROMACS on Tesla GPU Vs CPU

Particle-Mesh-Ewald (PME)

Reaction-FieldCutoffs

5.2x

Betanow

• Particle Mesh Ewald (PME)• Implicit solvent GB• Arbitrary forms of non-bonded interactions

Q2 2010 • Multi-GPU + MPI support• Beta 2 release

More Infohttp://www.nvidia.com/object/gromacs_on_tesla.html

Page 48: GPU based cloud computing

© NVIDIA Corporation 2010

HOOMD Blue Molecular Dynamics

Written bottom-up for CUDA GPUs

Modeled after LAMMPSSupports multiple GPUs

1 Tesla GPU outperforms 32 CPUs running LAMMPS

Data courtesy of University of Michigan

More Infohttp://www.nvidia.com/object/hoomd_on_tesla.html

Page 49: GPU based cloud computing

© NVIDIA Corporation 2010

LAMMPS: Molecular Dynamics on a GPU Cluster

2 GPUs = 24 CPUs

Available as beta on CUDACut-off based non-bonded terms

2 GPUs outperforms 24 CPUs

PME based electrostaticPreliminary results: 5X speed-up

Multiple GPU + MPI support enabled

Data courtesy of Scott Hampton & Pratul K. AgarwalOak Ridge National Laboratory

More Infohttp://www.nvidia.com/object/lammps_on_tesla.html

Page 50: GPU based cloud computing

© NVIDIA Corporation 2010

NAMD: Scaling Molecular Dynamics on a GPU Cluster

4 GPUs = 16 CPUs

Feature complete on CUDA : available in NAMD 2.7 Beta 2

Full electrostatics with PMEMultiple time-stepping1-4 Exclusions

4 GPU Tesla PSC outperforms 8 CPU servers

Scales to a GPU cluster

Data courtesy of Theoretical and Computational Bio-physics Group, UIUC

More Infohttp://www.nvidia.com/object/namd_on_tesla.html

Page 51: GPU based cloud computing

© NVIDIA Corporation 2010

TeraChem: Quantum Chemistry Package for GPUs

First QC SW written ground-up for GPUs

4 Tesla GPUs outperform 256 quad-core CPUs

Betanow

Q1 2010 • Full release• MPI support

• HF, Kohn-Sham, DFT• Multiple GPUs supported

More Infohttp://www.nvidia.com/object/terachem_on_tesla.html

Page 52: GPU based cloud computing

© NVIDIA Corporation 2010

VMD: Acceleration using CUDA GPUs

Several CUDA applications in VMD 1.8.7

Molecular Orbital DisplayCoulomb-based Ion PlacementImplicit Ligand Sampling

Speedups : 20x - 100x

Multiple GPU support enabled

Images and data courtesy of Beckman Institute for Advanced Science and Technology, UIUC More Info

http://www.nvidia.com/object/vmd_on_tesla.html

Page 53: GPU based cloud computing

© NVIDIA Corporation 2010

GPU-HMMER: Protein Sequence Alignment

Protein sequence alignment using profile HMMs

Available nowSupports multiple GPUs

Speedups range from 60-100x faster than CPU

Download http://www.mpihmmer.org/releases.htm

GPUsCPU

Page 54: GPU based cloud computing

© NVIDIA Corporation 2010

MUMmerGPU: Genome Sequence Alignment

High-throughput pair-wise local sequence alignment

Designed for large sequences

Drop-in replacement for “mummer” component in MUMmer software

Speedups 3.5x to 3.75xDownload

http://mummergpu.sourceforge.net