high-performance computing with nvidia tesla gpus

43
HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Timothy Lanfear, NVIDIA

Upload: tirzah

Post on 24-Feb-2016

97 views

Category:

Documents


1 download

DESCRIPTION

High-Performance Computing with NVIDIA Tesla GPUs. Timothy Lanfear, NVIDIA. Why GPU Computing?. Science is Desperate for Throughput. Gigaflops. 1 Exaflop. 1,000,000,000. 1 Petaflop. Bacteria 100s of Chromatophores. 1,000,000. Chromatophore 50M atoms. 1,000. Ribosome 2.7M atoms. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High-Performance Computing with NVIDIA Tesla GPUs

HIGH-PERFORMANCE COMPUTING WITH NVIDIA

TESLA GPUSTimothy Lanfear, NVIDIA

Page 2: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

WHY GPU COMPUTING?

Page 3: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Science is Desperate for Throughput

1982 1997 2003 2006 2010 2012

1,000,000,000

1,000,000

1,000

1

Gigaflops

Estrogen Receptor36K atoms

F1-ATPase327K atoms

Ribosome2.7M atoms

Chromatophore50M atoms

BPTI3K atoms

Bacteria100s of

Chromatophores

1 Exaflop

1 Petaflop

Ran for 8 months to simulate 2 nanoseconds

Page 4: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Power Crisis in Supercomputing

1982 1996 2008 2020

Exaflop

Petaflop

Teraflop

Gigaflop

Household Power Equivalent

City

Town

Neighborhood

Block

7,000,000 Watts

25,000,000 Watts

850,000 Watts

60,000 Watts

JaguarLos Alamos

Page 5: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

“Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is ‘expected to be 10-times more powerful than today’s fastest supercomputer.’

Since ORNL’s Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 Petaflops …

… we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.”

September 30 2009

Page 6: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

What is GPU Computing?

Computing with CPU + GPUHeterogeneous Computing

x86 GPUPCIe bus

Page 7: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Low Latency or High Throughput?

CPUOptimised for low-latency access to cached data setsControl logic for out-of-order and speculative execution

GPUOptimised for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation

Cache

ALUControl

ALU

ALU

ALU

DRAM

DRAM

Page 8: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Why Didn’t GPU Computing Take Off Sooner?

GPU ArchitectureGaming oriented, process pixel for displaySingle threaded operationsNo shared memory

Development ToolsGraphics oriented (OpenGL, GLSL)University research (Brook)Assembly language

DeploymentGaming solutions with limited lifetimeExpensive OpenGL professional graphics boardsNo HPC compatible products

Page 9: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

NVIDIA Invested in GPU Computing in 2004

Strategic move for the companyExpand GPU architecture beyond pixel processingFuture platforms will be hybrid, multi/many cores based

Hired key industry expertsx86 architecturex86 compilerHPC hardware specialist

Create a GPU based Compute Ecosystem by 2008

Page 10: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

NVIDIA GPU Computing Ecosystem

NVIDIA Hardware Solutions

CUDA SDK & Tools

GPU Architecture

Deployment

Hardware

Architecture

Customer

Requirements

Customer

Application

TPP / OEMCUDA Development

Specialist

ISV

CUDA Training

CompanyHardware Architect

VAR

Page 11: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

TeslaTM

High-Performance ComputingQuadro®

Design & CreationGeForce®

Entertainment

NVIDIA GPU Product Families

Page 12: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Many-Core High Performance Computing

NVIDIA’s 10-series GPU has 240 cores

Each core has aFloating point / integer unitLogic unit Move, compare unitBranch unit

Cores managed by thread managerThread manager can spawn and manage 30,000+ threads Zero overhead thread switching

1.4 billion transistors

1 Teraflop of processing power

240 processing cores

NVIDIA 10-Series GPU

NVIDIA’s 2nd Generation CUDA Processor

Page 13: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Tesla S1070 1U System

Tesla C1060 Computing Board

Tesla Personal Supercomputer

GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision

Performance1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double PrecisionPerformance

156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

SuperMicro 1UGPU SuperServer

Tesla GPU Computing Products

Page 14: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Tesla C1060 Computing Processor

Processor 1 × Tesla T10Number of cores 240

Core Clock 1.296 GHz

Floating Point Performance

933 Gflops Single Precision

78 Gflops Double Precision

On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak

Memory I/O 512-bit, 800MHz GDDR3

Form factor Full ATX: 4.736″ x 10.5″Dual slot wide

System I/O PCIe ×16 Gen2

Typical power 160 W

Page 15: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Processor 1 × Tesla T10Number of cores 240

Core Clock 1.296 GHz

Floating Point Performance

933 Gflops Single Precision

78 Gflops Double Precision

On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak

Memory I/O 512-bit, 800MHz GDDR3

Form factor Full ATX: 4.736″ x 10.5″Dual slot wide

System I/O PCIe ×16 Gen2

Typical power 160 W

Tesla M1060 Embedded Module

OEM-only productAvailable as integrated product in OEM systems

Page 16: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Tesla Personal Supercomputer

Supercomputing PerformanceMassively parallel CUDA Architecture960 cores. 4 Teraflops250× the performance of a desktop

Personal One researcher, one supercomputerPlugs into standard power strip

AccessibleProgram in C for Windows, LinuxAvailable now worldwide under $10,000

Page 17: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Tesla S1070 1U SystemProcessors 4 × Tesla T10

Number of cores 960

Core Clock 1.44 GHz

Performance 4 Teraflops

Total system memory 16.0 GB (4.0 GB per T10)

Memory bandwidth408 GB/sec peak

(102 GB/sec per T10)

Memory I/O2048-bit, 800MHz GDDR3

(512-bit per T10)

Form factor 1U (EIA 19″ rack)

System I/O 2 PCIe ×16 Gen2

Typical power 700 W

Page 18: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

SuperMicro GPU 1U SuperServer

Two M1060 GPUs in a 1U

Dual Nehalem-EP Xeon CPUs

Up to 96 GB DDR3 ECC

Onboard Infiniband (QDR)

3× hot-swap 3.5″ SATA HDD

1200 W power supply

M1060 GPUs

Page 19: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Tesla Cluster Configurations

Modular 2U compute nodeTesla S1070 + Host Server

4 Teraflops

Integrated 1U compute nodeGPU SuperServer

2 Teraflops

Page 20: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

CUDA FortranCUDA C Java and PythonOpenCL™ DirectCompute

GPU Computing Applications

CUDA Parallel Computing Architecture

NVIDIA GPU with the CUDA Parallel Computing Architecture

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

Page 21: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

NVIDIA CUDA C and OpenCL

Shared back-end compiler and optimization technology

OpenCL

CUDA C

PTX

GPU

Entry point for developers who prefer high-level C

Entry point for developers who want low-level API

Page 22: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

CUDA LibrariescuFFT cuBLAS cuDPP

CUDA Compiler

C Fortran

CUDA Tools

Debugger Profiler

CPU Hardware

PCI-E Switch1U

Application Software(written in C)

4 cores 240 cores

Page 23: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

NVIDIA Nexus

Register for the Beta here at GTC!

The first development environment for massively parallel applications.

Beta available October 2009Releasing in Q1 2010

Hardware GPU Source Debugging

Platform-wide Analysis

Complete Visual Studio integration

http://developer.nvidia.com/object/nexus.html

Graphics Inspector

Platform Trace

Parallel Source Debugging

Page 24: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

CUDA Zone: www.nvidia.com/CUDA CUDA Toolkit

CompilerLibraries

CUDA SDKCode samples

CUDA Profiler

Forums

Resources for CUDA developers

Page 25: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Wide Developer Acceptance and Success

146X

Interactive visualization of

volumetric white matter connectivity

36X

Ion placement for molecular dynamics simulation

19X

Transcoding HD video stream to

H.264

17X

Simulation in Matlab using .mex file CUDA function

100X

Astrophysics N-body simulation

149X

Financial simulation of

LIBOR model with swaptions

47X

GLAME@lab: An M-script API for linear Algebra

operations on GPU

20X

Ultrasound medical imaging

for cancer diagnostics

24X

Highly optimized object oriented

molecular dynamics

30X

Cmatch exact string matching to

find similar proteins and gene

sequences

Page 26: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

CUDA Co-Processing Ecosystem

Applications LibrariesFFT

BLASLAPACK

Image processingVideo processingSignal processing

Vision

Consultants OEMs

LanguagesC, C++DirectXFortranJava

OpenCLPython

CompilersPGI FortranCAPs HMPP

MCUDAMPI

NOAA Fortran2COpenMP

UIUCMIT

HarvardBerkeley

CambridgeOxford

IIT DelhiTsinghua

DortmundtETH Zurich

MoscowNTU…

Over 200 Universities Teaching CUDA

ANEO

GPU Tech

Oil & Gas Finance

Medical Biophysics

Numerics

Imaging

CFD

DSP EDA

Page 27: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

What We Did in the Past Three Years

2006G80, first GPU with built-in compute features, 128 core multi-threaded, scalable architectureCUDA SDK Beta

2007Tesla HPC product lineCUDA SDK 1.0, 1.1

2008GT200, second GPU generation, 240 core, 64-bitTesla HPC second generationCUDA SDK 2.0

2009 …

Page 28: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’

Page 29: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

3 billion transistors

Over 2× the cores (512 total)

8× the peak DP performance

ECC

L1 and L2 caches

~2× memory bandwidth (GDDR5)

Up to 1 Terabyte of GPU memory

Concurrent kernels

Hardware support for C++

DR

AM

I/F

HO

ST I/

FG

iga

Thre

adD

RA

M I/

F DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Introducing the ‘Fermi’ ArchitectureThe Soul of a Supercomputer in the body of a GPU

Page 30: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

DataParallel

InstructionParallel

Many Decisions Large Data Sets

CPU

GPU

Design Goal of Fermi

Expand performance sweet spot of the GPU

Bring more users, more applications to the GPU

Page 31: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Streaming Multiprocessor ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

32 CUDA cores per SM (512 total)

8× peak double precision floating point performance

50% of peak single precision

Dual Thread Scheduler

64 KB of RAM for shared memory and L1 cache (configurable)

Page 32: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

CUDA Core ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs

Fused multiply-add (FMA) instruction for both single and double precision

Newly designed integer ALU optimized for 64-bit and extended precision operations

Page 33: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Cached Memory Hierarchy

First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory

L1 Cache per SM (32 cores)Improves bandwidth and reduces latency

Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU

DR

AM

I/F

Gig

a Th

read

HO

ST I/

FD

RA

M I/

F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Parallel DataCache™

Memory Hierarchy

Page 34: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Larger, Faster Memory Interface

GDDR5 memory interface2× speed of GDDR3

Up to 1 Terabyte of memory attached to GPU

Operate on large data sets

DR

AM

I/F

Gig

a Th

read

HO

ST I/

FD

RA

M I/

F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Page 35: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Error Correcting Code

ECC protection forDRAM

ECC supported for GDDR5 memory

All major internal memories are ECC protectedRegister file, L1 cache, L2 cache

Page 36: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

GigaThreadTM Hardware Thread Scheduler

Hierarchically manages thousands of simultaneously active threads

10× faster application context switching

Concurrent kernel execution

HTS

Page 37: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution Parallel Kernel Execution

Tim

e

Kernel 1 Kernel 1 Kernel 2

Kernel 2 Kernel 3

Kernel 3

Ker4

nel Kernel 5

Kernel 5

Kernel 4

Kernel 2

Kernel 2

Page 38: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

GigaThread Streaming Data Transfer Engine

Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time

Activity Snapshot:

SDT

Kernel 0Kernel 1

Kernel 2Kernel 3

CPU

CPU

CPU

CPU

SDT0

SDT0

SDT0

SDT0

GPU

GPU

GPU

GPU

SDT1

SDT1

SDT1

SDT1

Page 39: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Enhanced Software Support

Full C++ SupportVirtual functionsTry/Catch hardware support

System call supportSupport for pipes, semaphores, printf, etc

Unified 64-bit memory addressing

Page 40: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

I believe history will record Fermi as a significant milestone.“ ”

Dave PattersonDirector Parallel Computing Research Laboratory, U.C. BerkeleyCo-Author of Computer Architecture: A Quantitative Approach

Fermi surpasses anything announced by NVIDIA's leading GPU competitor (AMD).“ ”

Tom HalfhillSenior Editor

Microprocessor Report

Page 41: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

Fermi is the world’s first complete GPU computing architecture.“ ”

Peter GlaskowskyTechnology AnalystThe Envisioneering Group

The convergence of new, fast GPUs optimized for computation as well as 3-D graphics acceleration and industry-standard software

development tools marks the real beginning of the GPU computing era. Gentlemen, start your GPU computing engines.

Nathan BrookwoodPrinciple Analyst & Founder

Insight 64

”“

Page 42: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009

2006 2010 2012 20152008

GPU

T8128 core

T10240 core

A 2015 GPU *~20× the performance of today’s GPU~5,000 cores at ~3 GHz (50 mW each)~20 TFLOPS~1.2 TB/s of memory bandwidth

* This is a sketch of a what a GPU in 2015 might look like; it does not reflect any actual product plans.

GPU Revolutionizing ComputingGFlops

Fermi512 core

Page 43: High-Performance Computing with NVIDIA Tesla GPUs

© NVIDIA Corporation 2009