high-performance computing with nvidia tesla gpus

HIGH-PERFORMANCE COMPUTING WITH NVIDIA

TESLA GPUSTimothy Lanfear, NVIDIA

© NVIDIA Corporation 2009

WHY GPU COMPUTING?


Science is Desperate for Throughput

1982 1997 2003 2006 2010 2012

1,000,000,000

1,000,000

1,000

1

Gigaflops

Estrogen Receptor36K atoms

F1-ATPase327K atoms

Ribosome2.7M atoms

Chromatophore50M atoms

BPTI3K atoms

Bacteria100s of

Chromatophores

1 Exaflop

1 Petaflop

Ran for 8 months to simulate 2 nanoseconds


Power Crisis in Supercomputing

1982 1996 2008 2020

Exaflop

Petaflop

Teraflop

Gigaflop

Household Power Equivalent

City

Town

Neighborhood

Block

7,000,000 Watts

25,000,000 Watts

850,000 Watts

60,000 Watts

JaguarLos Alamos


“Oak Ridge National Lab (ORNL) has already announced it will be using Fermi technology in an upcoming super that is ‘expected to be 10-times more powerful than today’s fastest supercomputer.’

Since ORNL’s Jaguar supercomputer, for all intents and purposes, holds that title, and is in the process of being upgraded to 2.3 Petaflops …

… we can surmise that the upcoming Fermi-equipped super is going to be in the 20 Petaflops range.”

September 30 2009


What is GPU Computing?

Computing with CPU + GPUHeterogeneous Computing

x86 GPUPCIe bus


Low Latency or High Throughput?

CPUOptimised for low-latency access to cached data setsControl logic for out-of-order and speculative execution

GPUOptimised for data-parallel, throughput computationArchitecture tolerant of memory latencyMore transistors dedicated to computation

Cache

ALUControl

ALU

ALU

ALU

DRAM

DRAM


Why Didn’t GPU Computing Take Off Sooner?

GPU ArchitectureGaming oriented, process pixel for displaySingle threaded operationsNo shared memory

Development ToolsGraphics oriented (OpenGL, GLSL)University research (Brook)Assembly language

DeploymentGaming solutions with limited lifetimeExpensive OpenGL professional graphics boardsNo HPC compatible products


NVIDIA Invested in GPU Computing in 2004

Strategic move for the companyExpand GPU architecture beyond pixel processingFuture platforms will be hybrid, multi/many cores based

Hired key industry expertsx86 architecturex86 compilerHPC hardware specialist

Create a GPU based Compute Ecosystem by 2008


NVIDIA GPU Computing Ecosystem

NVIDIA Hardware Solutions

CUDA SDK & Tools

GPU Architecture

Deployment

Hardware

Architecture

Customer

Requirements

Customer

Application

TPP / OEMCUDA Development

Specialist

ISV

CUDA Training

CompanyHardware Architect

VAR


TeslaTM

High-Performance ComputingQuadro®

Design & CreationGeForce®

Entertainment

NVIDIA GPU Product Families


Many-Core High Performance Computing

NVIDIA’s 10-series GPU has 240 cores

Each core has aFloating point / integer unitLogic unit Move, compare unitBranch unit

Cores managed by thread managerThread manager can spawn and manage 30,000+ threads Zero overhead thread switching

1.4 billion transistors

1 Teraflop of processing power

240 processing cores

NVIDIA 10-Series GPU

NVIDIA’s 2nd Generation CUDA Processor


Tesla S1070 1U System

Tesla C1060 Computing Board

Tesla Personal Supercomputer

GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision

Performance1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double PrecisionPerformance

156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

SuperMicro 1UGPU SuperServer

Tesla GPU Computing Products


Tesla C1060 Computing Processor

Processor 1 × Tesla T10Number of cores 240

Core Clock 1.296 GHz

Floating Point Performance

933 Gflops Single Precision

78 Gflops Double Precision

On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak

Memory I/O 512-bit, 800MHz GDDR3

Form factor Full ATX: 4.736″ x 10.5″Dual slot wide

System I/O PCIe ×16 Gen2

Typical power 160 W


Processor 1 × Tesla T10Number of cores 240

Core Clock 1.296 GHz

Floating Point Performance

933 Gflops Single Precision

78 Gflops Double Precision

On-board memory 4.0 GB Memory bandwidth 102 GB/sec peak

Memory I/O 512-bit, 800MHz GDDR3

Form factor Full ATX: 4.736″ x 10.5″Dual slot wide

System I/O PCIe ×16 Gen2

Typical power 160 W

Tesla M1060 Embedded Module

OEM-only productAvailable as integrated product in OEM systems


Tesla Personal Supercomputer

Supercomputing PerformanceMassively parallel CUDA Architecture960 cores. 4 Teraflops250× the performance of a desktop

Personal One researcher, one supercomputerPlugs into standard power strip

AccessibleProgram in C for Windows, LinuxAvailable now worldwide under $10,000


Tesla S1070 1U SystemProcessors 4 × Tesla T10

Number of cores 960

Core Clock 1.44 GHz

Performance 4 Teraflops

Total system memory 16.0 GB (4.0 GB per T10)

Memory bandwidth408 GB/sec peak

(102 GB/sec per T10)

Memory I/O2048-bit, 800MHz GDDR3

(512-bit per T10)

Form factor 1U (EIA 19″ rack)

System I/O 2 PCIe ×16 Gen2

Typical power 700 W


SuperMicro GPU 1U SuperServer

Two M1060 GPUs in a 1U

Dual Nehalem-EP Xeon CPUs

Up to 96 GB DDR3 ECC

Onboard Infiniband (QDR)

3× hot-swap 3.5″ SATA HDD

1200 W power supply

M1060 GPUs


Tesla Cluster Configurations

Modular 2U compute nodeTesla S1070 + Host Server

4 Teraflops

Integrated 1U compute nodeGPU SuperServer

2 Teraflops


CUDA FortranCUDA C Java and PythonOpenCL™ DirectCompute

GPU Computing Applications

CUDA Parallel Computing Architecture

NVIDIA GPU with the CUDA Parallel Computing Architecture

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.


NVIDIA CUDA C and OpenCL

Shared back-end compiler and optimization technology

OpenCL

CUDA C

PTX

GPU

Entry point for developers who prefer high-level C

Entry point for developers who want low-level API


CUDA LibrariescuFFT cuBLAS cuDPP

CUDA Compiler

C Fortran

CUDA Tools

Debugger Profiler

CPU Hardware

PCI-E Switch1U

Application Software(written in C)

4 cores 240 cores


NVIDIA Nexus

Register for the Beta here at GTC!

The first development environment for massively parallel applications.

Beta available October 2009Releasing in Q1 2010

Hardware GPU Source Debugging

Platform-wide Analysis

Complete Visual Studio integration

http://developer.nvidia.com/object/nexus.html

Graphics Inspector

Platform Trace

Parallel Source Debugging

http://developer.nvidia.com/object/nexus.html


CUDA Zone: www.nvidia.com/CUDA CUDA Toolkit

CompilerLibraries

CUDA SDKCode samples

CUDA Profiler

Forums

Resources for CUDA developers


Wide Developer Acceptance and Success

146X

Interactive visualization of

volumetric white matter connectivity

36X

Ion placement for molecular dynamics simulation

19X

Transcoding HD video stream to

H.264

17X

Simulation in Matlab using .mex file CUDA function

100X

Astrophysics N-body simulation

149X

Financial simulation of

LIBOR model with swaptions

47X

GLAME@lab: An M-script API for linear Algebra

operations on GPU

20X

Ultrasound medical imaging

for cancer diagnostics

24X

Highly optimized object oriented

molecular dynamics

30X

Cmatch exact string matching to

find similar proteins and gene

sequences


CUDA Co-Processing Ecosystem

Applications LibrariesFFT

BLASLAPACK

Image processingVideo processingSignal processing

Vision

Consultants OEMs

LanguagesC, C++DirectXFortranJava

OpenCLPython

CompilersPGI FortranCAPs HMPP

MCUDAMPI

NOAA Fortran2COpenMP

UIUCMIT

HarvardBerkeley

CambridgeOxford

…

IIT DelhiTsinghua

DortmundtETH Zurich

MoscowNTU…

Over 200 Universities Teaching CUDA

ANEO

GPU Tech

Oil & Gas Finance

Medical Biophysics

Numerics

Imaging

CFD

DSP EDA

http://www.supermicro.com/

http://en.wikipedia.org/wiki/File:Logo_groupe_bull.jpg

http://images.google.com/imgres?imgurl=http://fishtrain.com/wp-content/uploads/2007/09/cray_logo.gif&imgrefurl=http://fishtrain.com/2007/09/03/nvidias-playbook/&usg=__mBEPjqB6tUo0mps50ld866NdmmI=&h=70&w=160&sz=3&hl=en&start=8&sig2=erIWlru80_C67bxBapde6g&tbnid=ooG9_suq3ywK-M:&tbnh=43&tbnw=98&prev=/images?q=cray+logo&gbv=2&hl=en&ei=aHYpSvyWEo-ctgPd-dXxCg

http://www.google.com/imgres?imgurl=http://blog.taragana.com/wp-content/uploads/2009/05/nec-logo.jpg&imgrefurl=http://blog.taragana.com/index.php/t/east-asia/&h=354&w=354&sz=8&tbnid=YJa5kHMJJ5aMmM:&tbnh=121&tbnw=121&prev=/images?q=NEC+logo&hl=en&usg=__vqs8CIGTn2HFsKXlXcsnKjhGaww=&ei=Q98zSsTUG4vWsgPysrDODg&sa=X&oi=image_result&resnum=2&ct=image







What We Did in the Past Three Years

2006G80, first GPU with built-in compute features, 128 core multi-threaded, scalable architectureCUDA SDK Beta

2007Tesla HPC product lineCUDA SDK 1.0, 1.1

2008GT200, second GPU generation, 240 core, 64-bitTesla HPC second generationCUDA SDK 2.0

2009 …


NEXT-GENERATION GPU ARCHITECTURE — ‘FERMI’


3 billion transistors

Over 2× the cores (512 total)

8× the peak DP performance

ECC

L1 and L2 caches

~2× memory bandwidth (GDDR5)

Up to 1 Terabyte of GPU memory

Concurrent kernels

Hardware support for C++

DR

AM

I/F

HO

ST I/

FG

iga

Thre

adD

RA

M I/

F DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Introducing the ‘Fermi’ ArchitectureThe Soul of a Supercomputer in the body of a GPU


DataParallel

InstructionParallel

Many Decisions Large Data Sets

CPU

GPU

Design Goal of Fermi

Expand performance sweet spot of the GPU

Bring more users, more applications to the GPU


Streaming Multiprocessor ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

32 CUDA cores per SM (512 total)

8× peak double precision floating point performance

50% of peak single precision

Dual Thread Scheduler

64 KB of RAM for shared memory and L1 cache (configurable)


CUDA Core ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs

Fused multiply-add (FMA) instruction for both single and double precision

Newly designed integer ALU optimized for 64-bit and extended precision operations


Cached Memory Hierarchy

First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory

L1 Cache per SM (32 cores)Improves bandwidth and reduces latency

Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU

DR

AM

I/F

Gig

a Th

read

HO

ST I/

FD

RA

M I/

F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

Parallel DataCache™

Memory Hierarchy


Larger, Faster Memory Interface

GDDR5 memory interface2× speed of GDDR3

Up to 1 Terabyte of memory attached to GPU

Operate on large data sets

DR

AM

I/F

Gig

a Th

read

HO

ST I/

FD

RA

M I/

F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2


Error Correcting Code

ECC protection forDRAM

ECC supported for GDDR5 memory

All major internal memories are ECC protectedRegister file, L1 cache, L2 cache


GigaThreadTM Hardware Thread Scheduler

Hierarchically manages thousands of simultaneously active threads

10× faster application context switching

Concurrent kernel execution

HTS


GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution Parallel Kernel Execution

Tim

e

Kernel 1 Kernel 1 Kernel 2

Kernel 2 Kernel 3

Kernel 3

Ker4

nel Kernel 5

Kernel 5

Kernel 4

Kernel 2

Kernel 2


GigaThread Streaming Data Transfer Engine

Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time

Activity Snapshot:

SDT

Kernel 0Kernel 1

Kernel 2Kernel 3

CPU

CPU

CPU

CPU

SDT0

SDT0

SDT0

SDT0

GPU

GPU

GPU

GPU

SDT1

SDT1

SDT1

SDT1


Enhanced Software Support

Full C++ SupportVirtual functionsTry/Catch hardware support

System call supportSupport for pipes, semaphores, printf, etc

Unified 64-bit memory addressing


I believe history will record Fermi as a significant milestone.“ ”

Dave PattersonDirector Parallel Computing Research Laboratory, U.C. BerkeleyCo-Author of Computer Architecture: A Quantitative Approach

Fermi surpasses anything announced by NVIDIA's leading GPU competitor (AMD).“ ”

Tom HalfhillSenior Editor

Microprocessor Report


Fermi is the world’s first complete GPU computing architecture.“ ”

Peter GlaskowskyTechnology AnalystThe Envisioneering Group

The convergence of new, fast GPUs optimized for computation as well as 3-D graphics acceleration and industry-standard software

development tools marks the real beginning of the GPU computing era. Gentlemen, start your GPU computing engines.

Nathan BrookwoodPrinciple Analyst & Founder

Insight 64

”“


2006 2010 2012 20152008

GPU

T8128 core

T10240 core

A 2015 GPU *~20× the performance of today’s GPU~5,000 cores at ~3 GHz (50 mW each)~20 TFLOPS~1.2 TB/s of memory bandwidth

* This is a sketch of a what a GPU in 2015 might look like; it does not reflect any actual product plans.

GPU Revolutionizing ComputingGFlops

Fermi512 core

high-performance computing with nvidia tesla gpus

Documents

nvidia nvidia corporation

nanoseconds nvidia corporation

didnt gpu computing

companyexpand gpu architecture

mega flops

milliontera flops

billionpeta flops

low latency