gpu based cloud computing
Post on 24-Feb-2016
79 Views
Preview:
DESCRIPTION
TRANSCRIPT
© NVIDIA Corporation 2010
GPU based cloud computing
Dairsie Latimer, Petapath, UK Petapath
© NVIDIA Corporation 2010
About Petapath
Founded in 2008 to focus on delivering innovative hardware andsoftware solutions into the high performance computing (HPC) markets
Partnered with HP and SGI to deliverer two Petascale prototypesystems as part of the PRACE WP8 programme
The system is a testbed for new ideas in usability, scalability andefficiency of large computer installations
Active in exploiting emerging standards for acceleration technologies andare members of Khronos group and sit on the OpenCL working committee
We also provide consulting expertise for companies wishing to explore the advantages offered by heterogeneous systems
Petapath
© NVIDIA Corporation 2010
What is Heterogeneous or GPU Computing?
Computing with CPU + GPUHeterogeneous Computing
x86 GPUPCIe bus
© NVIDIA Corporation 2010
Low Latency or High Throughput?
CPUOptimised for low-latency access to cached data setsControl logic for out-of-order and speculative execution
GPUOptimised for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation
© NVIDIA Corporation 2010
NVIDIA GPU Computing Ecosystem
NVIDIA Hardware Solutions
CUDA SDK & Tools
GPU Architecture
Deployment
Hardware
Architecture
Customer
Requirements
Customer
Application
TPP / OEMCUDA Development
Specialist
ISV
CUDA Training
CompanyHardware Architect
VAR
© NVIDIA Corporation 2010
Science is Desperate for Throughput
1982 1997 2003 2006 2010 2012
1,000,000,000
1,000,000
1,000
1
Gigaflops
Estrogen Receptor36K atoms
F1-ATPase327K atoms
Ribosome2.7M atoms
Chromatophore50M atoms
BPTI3K atoms
Bacteria100s of
Chromatophores
1 Exaflop
1 Petaflop
Ran for 8 months to simulate 2 nanoseconds
© NVIDIA Corporation 2010
Power Crisis in Supercomputing
1982 1996 2008 2020
Exaflop
Petaflop
Teraflop
Gigaflop
Household Power Equivalent
City
Town
Neighborhood
Block
7,000,000 Watts
25,000,000 Watts
850,000 Watts
60,000 Watts
JaguarLos Alamos
© NVIDIA Corporation 2010
TeslaTM
High-Performance ComputingQuadro®
Design & CreationGeForce®
Entertainment
Enter the GPU
NVIDIA GPU Product Families
© NVIDIA Corporation 2010
NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’
© NVIDIA Corporation 2010
3 billion transistors
Up to 2× the cores (C2050 has 448)
Up to 8× the peak DP performance
ECC on all memories
L1 and L2 caches
Improved memory bandwidth (GDDR5)
Up to 1 Terabyte of GPU memory
Concurrent kernels
Hardware support for C++
DR
AM
I/F
HO
ST I/
FG
iga
Thre
adD
RA
M I/
F DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
Introducing the ‘Fermi’ Tesla ArchitectureThe Soul of a Supercomputer in the body of a GPU
© NVIDIA Corporation 2010
DataParallel
InstructionParallel
Many Decisions Large Data Sets
CPU
GPU
Design Goal of Fermi
Expand performance sweet spot of the GPU
Bring more users, more applications to the GPU
© NVIDIA Corporation 2010
Streaming Multiprocessor ArchitectureRegister File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units × 16Special Func Units × 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
32 CUDA cores per SM (512 total)
8× peak double precision floating point performance
50% of peak single precision
Dual Thread Scheduler
64 KB of RAM for shared memory and L1 cache (configurable)
© NVIDIA Corporation 2010
CUDA Core ArchitectureRegister File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs
Fused multiply-add (FMA) instruction for both single and double precision
New integer ALU optimized for64-bit and extended precisionoperations
© NVIDIA Corporation 2010
Cached Memory Hierarchy
First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory
L1 Cache per SM (32 cores)Improves bandwidth and reduces latency
Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU D
RA
M I/
FG
iga
Thre
adH
OST
I/F
DR
AM
I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
Parallel DataCache™
Memory Hierarchy
© NVIDIA Corporation 2010
Larger, Faster, Resilient Memory Interface
GDDR5 memory interface2× signaling speed of GDDR3
Up to 1 Terabyte of memory attached to GPUOperate on larger data sets (3 and 6 GB Cards)
ECC protection for GDDR5 DRAM
All major internal memories are ECC protectedRegister file, L1 cache, L2 cache
DR
AM
I/F
Gig
a Th
read
HO
ST I/
FD
RA
M I/
F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
© NVIDIA Corporation 2010
GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch
Serial Kernel Execution Parallel Kernel Execution
Tim
e
Kernel 1 Kernel 1 Kernel 2
Kernel 2 Kernel 3
Kernel 3
Ker4
nel Kernel 5
Kernel 5
Kernel 4
Kernel 2
Kernel 2
© NVIDIA Corporation 2010
GigaThread Streaming Data Transfer Engine
Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time
Activity Snapshot:
SDT
Kernel 0Kernel 1
Kernel 2Kernel 3
CPU
CPU
CPU
CPU
SDT0
SDT0
SDT0
SDT0
GPU
GPU
GPU
GPU
SDT1
SDT1
SDT1
SDT1
© NVIDIA Corporation 2010
Enhanced Software Support
Many new features in CUDA Toolkit 3.0To be released on Friday
Including early support for the Fermi architecture:Native 64-bit GPU supportMultiple Copy Engine supportECC reportingConcurrent Kernel ExecutionFermi HW debugging support in cuda-gdb
© NVIDIA Corporation 2010
Enhanced Software Support
OpenCL 1.0 SupportFirst class language citizen in CUDA ArchitectureSupports ICD (so interoperability between vendors is a possibility)Profiling support availableDebug support coming to Parallel Nsight (NEXUS) soon
gDebugger CL from graphicREMEDYThird party OpenCL profiler/debugger/memory checker
Software Tools Ecosystem is starting to growGiven boost by existence of OpenCL
© NVIDIA Corporation 2010
“Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is "expected to be 10-times more powerful than today's fastest supercomputer."
Since ORNL's Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 PFlops….
…we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.”
September 30 2009
© NVIDIA Corporation 2010
NVIDIA TESLA PRODUCTS
© NVIDIA Corporation 2010
Tesla S1070 1U System
Tesla C1060 Computing Board
Tesla Personal Supercomputer
GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision
Performance1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops
Double PrecisionPerformance
156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops
Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)
SuperMicro 1UGPU SuperServer
Tesla GPU Computing Products: 10 Series
© NVIDIA Corporation 2010
Tesla S2070 1U System
Tesla C2050 Computing Board
Tesla C2070 Computing Board
GPUs 4 Tesla GPUs 1 Tesla GPUDouble Precision
Performance2.1 – 2.5 Teraflops 500+ Gigaflops
Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB
Tesla S20501U System
Tesla GPU Computing Products: 20 Series
© NVIDIA Corporation 2010
HETEROGENEOUS CLUSTERS
© NVIDIA Corporation 2010
Data Centers: Space and Energy Limited
Traditional Data Center Cluster
2x Performance requires 2x Number of Servers8 cores per server
1000’s of cores1000’s of servers
Quad-coreCPU
Heterogeneous Data Center Cluster
10,000’s of cores100’s of servers
Augment/replacehost servers
© NVIDIA Corporation 2010
Cluster Deployment
Now a number of GPU aware Cluster Management SystemsActiveEon ProActive Parallel Suite® Version 4.2Platform Cluster Manager and HPC WorkgroupStreamline Computing GPU Environment (SCGE)
• Not just installation aidsi.e. putting the driver and toolkits in the right placenow starting to provide GPU node discovery and job steering
NVIDIA and MellanoxBetter interop. between Mellanox IF adapters and NVIDIA Tesla GPUsCan provide as much as a 30% performance improvement by eliminating unnecessary data movement in a multi node heterogeneous application
© NVIDIA Corporation 2010
Cluster Deployment
A number of cluster and distributed debug tools now support CUDA and NVIDIA Tesla
Allinea® DDT for NVIDIA CUDAExtends well known Distributed Debugging Tool (DDT) with CUDA support
TotalView® debugger (part of an Early Experience Program)Extends with CUDA support, have also announced intentions to support OpenCL
Both based on the Parallel Nsight (NEXUS) Debugging API
© NVIDIA Corporation 2010
NVIDIA Reality Server 3.0
Cloud computing platform for running 3D web applications
Consists of an Tesla RS GPU-based server cluster running RealityServer software from mental images
Deployed in a number of different sizesFrom 2 – 100’s of 1U Servers
iray® - Interactive Photorealistic Rendering TechnologyStreams interactive 3D applications to any web connected device Designers and architects can now share and visualize complex 3D models under different lighting and environmental conditions
© NVIDIA Corporation 2010
DISTRIBUTED COMPUTING PROJECTS
© NVIDIA Corporation 2010
Distributed Computing Projects
Traditional distributed computing projects have beenmaking use of GPUs for some time (non-commercial)
Typically have 000’s to 10,000’s of contributorsFolding@Home has access to 6.5 PFLOPS of compute Of which ~95% comes from GPUs or PS3s
Many are bio-informatics, molecular dynamicsand quantum chemistry codes
Represent the current sweet spot applications
Ubiquity of GPUs in home systems helps
© NVIDIA Corporation 2010
Distributed Computing Projects
Folding@HomeDirected by Prof. Vijay Pande at Stanford University (http://folding.stanford.edu/)
Most recent GPU3 Core based on OpenMM 1.0 (https://simtk.org/home/openmm)OpenMM library provides tools for molecular modeling simulationCan be hooked into any MM application, allowing that code to domolecular modeling with minimal extra effortOpenMM has a strong emphasis on hardware acceleration providingnot just a consistent API, but much greater performance
Current NVIDIA target is via CUDA Toolkit 2.3
OpenMM 1.0 also provides Beta support for OpenCL
OpenCL is long term convergence software platform
© NVIDIA Corporation 2010
Distributed Computing Projects
Berkeley Open Infrastructure for Network ComputingBOINC project (http://boinc.berkeley.edu/)
Platform infrastructure originally evolved from SETI@home
Many projects use BOINC and several of these have heterogeneous compute implementations (http://boinc.berkeley.edu/wiki/GPU_computing)
Examples include:GPUGRID.netSETI@homeMilkyway@home (IEEE 754 Double precision capable GPU required)AQUA@homeLatticeCollatz Conjecture
© NVIDIA Corporation 2010
Distributed Computing Projects
GPUGRID.netDr. Gianni De Fabritiis,Research Group of Biomedical InformaticsUniversity Pompeu Fabra-IMIM, Barcelona
Uses GPUs to deliver high-performance all-atom biomolecular simulation of proteins using ACEMD (http://multiscalelab.org/acemd)
ACEMD is a production bio-molecular dynamics code specially optimized to run on graphics processing units (GPUs) from NVIDIAIt reads CHARMM/NAMD and AMBER input files with a simple and powerful configuration interface
A commercial implementation of ACEMD is available from Acellera Ltd (http://www.acellera.com/acemd/)
What makes this particularly interesting is that it is implemented using OpenCL
© NVIDIA Corporation 2010
Distributed Computing Projects
Have had to use brute force methods to deal with robustnessRun the same WU with multiple users and compare results
Running on purpose designed heterogeneous grids with ECCMeans that some of the paranoia can be relaxed(can at least detect there have been soft errors or WU corruption)
Results in better throughput on these systems
But does result in divergence between Consumer and HPC devicesShould be compensated for by HPC class devices being about 4x faster
© NVIDIA Corporation 2010
Tesla Bio WorkbenchAccelerating New Science
January, 2010
http://www.nvidia.com/bio_workbench
© NVIDIA Corporation 2010
Introducing Tesla Bio WorkBench
Applications
Community
Platforms
Technicalpapers
DiscussionForums
Benchmarks& Configurations
Tesla Personal Supercomputer Tesla GPU Clusters
Download,Documentation
MUMmerGPU
LAMMPS GPU-AutoDock
TeraChem
© NVIDIA Corporation 2010
Tesla Bio Workbench Applications
AMBER (MD)ACEMD (MD)GROMACS (MD)GROMOS (MD)LAMMPS (MD)NAMD (MD)TeraChem (QC)VMD (Visualization MD & QC)
DockingGPU AutoDock
Sequence analysisCUDASW++ (SmithWaterman)MUMmerGPUGPU-HMMERCUDA-MEME Motif Discovery
© NVIDIA Corporation 2010
Recommended Hardware Configurations
Up to 4 Tesla C1060s per workstation4GB main memory / GPU
Tesla S1070 1U 4 GPUs per 1U
Integrated CPU-GPU Server2 GPUs per 1U + 2 CPUs
Tesla Personal Supercomputer Tesla GPU Clusters
Specifics at http://www.nvidia.com/bio_workbench
© NVIDIA Corporation 2010
Molecular Dynamics andQuantum Chemistry Applications
© NVIDIA Corporation 2010
Molecular Dynamics andQuantum Chemistry Applications
AMBER (MD)ACEMD (MD)HOOMD (MD)GROMACS (MD)
LAMMPS (MD)NAMD (MD)TeraChem (QC)VMD (Viz. MD & QC)
Typical speed ups of 3-8x on a single Tesla C1060 vs Modern 1USome applications (compute bound) show 20-100x speed ups
© NVIDIA Corporation 2010
Usage of TeraGrid National Supercomputing Grid
Half of the cycles
© NVIDIA Corporation 2010
Summary
© NVIDIA Corporation 2010
Summary
‘Fermi’ debuts HPC/Enterprise featuresParticularly ECC and high performance double precision
Software development environments are now more matureSignificant software ecosystem is starting to emergeBroadening availability of development tools, libraries and applicationsHeterogeneous (GPU) aware cluster management systems
Economics, open standards and improving programming methodologies
Heterogeneous computing is gradually changing long held perception that it is just an ‘exotic’ niche technology
© NVIDIA Corporation 2010
Questions?
© NVIDIA Corporation 2010
Supporting Slides
© NVIDIA Corporation 2010
AMBER Molecular Dynamics
Implicit solvent GB results1 Tesla GPU 8x faster than 2 quad-core CPUs
Data courtesy of San Diego Supercomputing Center
7x 8.6x
Generalized Born SimulationsAlphanow
Q1 2010 • PME: Particle Mesh Ewald• Beta release
• Generalized Born
Q2 2010 • Multi-GPU + MPI support• Beta 2 release
More Infohttp://www.nvidia.com/object/amber_on_tesla.html
© NVIDIA Corporation 2010
GROMACS Molecular Dynamics
PME results1 Tesla GPU 3.5x-4.7x faster than CPU
Data courtesy of Stockholm Center for Biomembrane Research
3.5x22x
GROMACS on Tesla GPU Vs CPU
Particle-Mesh-Ewald (PME)
Reaction-FieldCutoffs
5.2x
Betanow
• Particle Mesh Ewald (PME)• Implicit solvent GB• Arbitrary forms of non-bonded interactions
Q2 2010 • Multi-GPU + MPI support• Beta 2 release
More Infohttp://www.nvidia.com/object/gromacs_on_tesla.html
© NVIDIA Corporation 2010
HOOMD Blue Molecular Dynamics
Written bottom-up for CUDA GPUs
Modeled after LAMMPSSupports multiple GPUs
1 Tesla GPU outperforms 32 CPUs running LAMMPS
Data courtesy of University of Michigan
More Infohttp://www.nvidia.com/object/hoomd_on_tesla.html
© NVIDIA Corporation 2010
LAMMPS: Molecular Dynamics on a GPU Cluster
2 GPUs = 24 CPUs
Available as beta on CUDACut-off based non-bonded terms
2 GPUs outperforms 24 CPUs
PME based electrostaticPreliminary results: 5X speed-up
Multiple GPU + MPI support enabled
Data courtesy of Scott Hampton & Pratul K. AgarwalOak Ridge National Laboratory
More Infohttp://www.nvidia.com/object/lammps_on_tesla.html
© NVIDIA Corporation 2010
NAMD: Scaling Molecular Dynamics on a GPU Cluster
4 GPUs = 16 CPUs
Feature complete on CUDA : available in NAMD 2.7 Beta 2
Full electrostatics with PMEMultiple time-stepping1-4 Exclusions
4 GPU Tesla PSC outperforms 8 CPU servers
Scales to a GPU cluster
Data courtesy of Theoretical and Computational Bio-physics Group, UIUC
More Infohttp://www.nvidia.com/object/namd_on_tesla.html
© NVIDIA Corporation 2010
TeraChem: Quantum Chemistry Package for GPUs
First QC SW written ground-up for GPUs
4 Tesla GPUs outperform 256 quad-core CPUs
Betanow
Q1 2010 • Full release• MPI support
• HF, Kohn-Sham, DFT• Multiple GPUs supported
More Infohttp://www.nvidia.com/object/terachem_on_tesla.html
© NVIDIA Corporation 2010
VMD: Acceleration using CUDA GPUs
Several CUDA applications in VMD 1.8.7
Molecular Orbital DisplayCoulomb-based Ion PlacementImplicit Ligand Sampling
Speedups : 20x - 100x
Multiple GPU support enabled
Images and data courtesy of Beckman Institute for Advanced Science and Technology, UIUC More Info
http://www.nvidia.com/object/vmd_on_tesla.html
© NVIDIA Corporation 2010
GPU-HMMER: Protein Sequence Alignment
Protein sequence alignment using profile HMMs
Available nowSupports multiple GPUs
Speedups range from 60-100x faster than CPU
Download http://www.mpihmmer.org/releases.htm
GPUsCPU
© NVIDIA Corporation 2010
MUMmerGPU: Genome Sequence Alignment
High-throughput pair-wise local sequence alignment
Designed for large sequences
Drop-in replacement for “mummer” component in MUMmer software
Speedups 3.5x to 3.75xDownload
http://mummergpu.sourceforge.net
top related