product availability update

1

Product Availability UpdateProduct Inventory Leadtime

for big orders Notes

C1060 200 units 8 weeks Build to orderM1060 500 units 8 weeks Build to order

S1070-400 50 units 10 weeks Build to orderS1070-500 25 units+ 75 being built 10 weeks Build to order

M2050 Shipping nowBuilding 20K for Q2

8 weeks Sold out through mid-July

S2050 Shipping nowBuilding 200 for Q2

8 weeks Sold out through mid-July

C2050 2000 units 8 weeks Will maintain inventoryM2070 Sept 2010 - Get PO in now to get priority

C2070 Sept-Oct 2010 - Get PO in now to get priorityM2070-Q Oct 2010 -

Processamento Paralelo em GPU’s na Arquitetura FermiArnaldo TavaresTesla Sales Manager for Latin America

2

Quadro or Tesla?QUADROTM TESLATM

Computer Aided Design• e.g. CATIA, SolidWorks, Siemens NX

3D Modeling / Animation• e.g. 3ds, Maya, Softimage

Video Editing / FX• e.g. Adobe CS5, Avid

Numerical Analytics• e.g. MATLAB, Mathematica

Computational Biology• e.g. AMBER, NAMD, VMD

Computer Aided Engineering• e.g. ANSYS, SIMULIA/ABAQUS

3

GPU Computing

CPU + GPU Co-Processing

4 cores448 cores

CPU48 GigaFlops (DP)

GPU515 GigaFlops (DP)

(Average efficiency in Linpack: 50%)

4

146X

Medical Imaging U of Utah

36X

Molecular DynamicsU of Illinois, Urbana

18X

Video TranscodingElemental Tech

50X

Matlab ComputingAccelerEyes

100X

AstrophysicsRIKEN

149X

Financial simulationOxford

47X

Linear AlgebraUniversidad Jaime

20X

3D UltrasoundTechniscan

130X

Quantum ChemistryU of Illinois, Urbana

30X

Gene SequencingU of Maryland

50x – 150x

5

Tools

Oil & Gas

Bio-Chemistry

Bio-Informatics

TotalViewDebugger

NVIDIAVideo Libraries

AccelerEyesJacket MATLAB

EMPhotonicsCULAPACK

Bright ClusterManagerCAPS HMPP

MATLAB

Thrust C++Template Lib

CUDA C/C++

PGI CUDA Fortran

Parallel NsightVis Studio IDE

Allinea DDTDebugger

OpenEye ROCS

Available Announced

TauCUDAPerf Tools

NVIDIA NPPPerf Primitives

ParaToolsVampirTrace

VSGOpen Inventor

StoneRidgeRTM

Headwave Suite

AccelewareRTM Solver

GeoStar Seismic Suite

ffA SVI Pro

OpenGeoSolutions OpenSEIS

Paradigm RTM

Seismic CityRTM

TsunamiRTM

CAE ACUSIMAcuSolve 1.8

AutodeskMoldflow

PrometchParticleworks

RemcomXFdtd 7.0

MSC.SoftwareMarc 2010.2

PGIAccelerators

Platform LSFCluster Mgr

MAGMA (LAPACK)

FluiDynaOpenFOAM

MetacompCFD++

Available Now Future

Libraries

Wolfram Mathematica

CUDA FFTCUDA BLAS

TeraChem BigDFTABINT

VMD

AcelleraACEMDAMBER DL-POLY

GROMACS

HOOMD

LAMMPS

NAMD

GAMESS CP2K

CUDA-BLASTP

CUDA-EC

CUDA-MEME

CUDA SW++SmithWaterm GPU-HMMR

HEX ProteinDocking

MUMmerGPU PIPERDocking

LSTCLS-DYNA 971

RNG & SPARSE CUDA Libraries

Paradigm SKUA

Panorama Tech

PGI CUDA x86

Increasing Number of Professional CUDA Apps

ANSYSMechanical

http://www.fluidyna.de/en/lbultra

http://www.metacomptech.com/

6

Increasing Number of Professional CUDA Apps

Siemens 4D Ultrasound

Rendering

Finance

EDA

Digisens Medical

SchrodingerCore Hopping

MotionDSPIkena Video

ManifoldGIS

Dalsa Machine Vision

SynopsysTCAD

SPEAGSEMCAD X

AgilentEMPro 2010

CST Microwave

Agilent ADSSPICE

AccelewareFDTD Solver

AccelewareEM Solution

Aquimin AlphaVision

Other

NAGRNG

SciCompSciFinance

HanweckOptions Analy

Available Now

Gauda OPC

Useful Progress Med

LightworksArtisan

Autodesk3ds Max

NVIDIA OptiX (SDK)

mental imagesiray (OEM)

BunkspeedShot (iray)

Refractive SWOctane

Works ZebraZeany

Chaos GroupV-Ray GPU

CebasfinalRender

Random Control Arion

Caustic Graphics

Weta DigitalPantaRay

ILMPlume

Future

Available Announced

Digital Anarchy Photo

Video

ElementalVideo

FraunhoferJPEG2000

CinnafilmPixel Strings

AssimilateSCRATCH

The FoundryKronos

TDVisionTDVCodec

ARRIVarious Apps

Black MagicDa Vinci

MainConceptCUDA Encoder

GenArtsSapphire

Adobe Premier Pro CS5

MurexMACS

Numerix Risk RMS RiskMgt Solutions

RocketickVeritlog Sim

MVTec Machine Vis

7

3 of Top5 Supercomputers

Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 1000

500

1000

1500

2000

2500

3000

0

1

2

3

4

5

6

7

8G

igaf

lops

Meg

awat

ts

8

3 of Top5 Supercomputers

Tianhe-1A Jaguar Nebulae Tsubame Hopper II Tera 1000

500

1000

1500

2000

2500

3000

0

1

2

3

4

5

6

7

8G

igaf

lops

Meg

awat

ts

9

What if Every Supercomputer Had Fermi?

Oak Ridge National Laboratory Lawrence Livermore National Laboratory IDRIS Network Company IT Service Provider (D) Semiconductor Company (P) Semiconductor Company (O) Merlion Trade GmbH Geoscience (P) Semiconductor Company (O) Hosting Services IT Service Provider (D)0

200

400

600

800

1000

LinpackTeraflops

Top 500 Supercomputers (Nov 2009)

150 GPUs37 TeraFlops

$740KTop 150


$1.1 MTop 100


$2.2 MTop 50

10

Hybrid ExaScale Trajectory

20081 TFLOP7.5 KWatts

20101.27 PFLOPS2.55 MWatts

2017 *2 EFLOPS10 MWatts

* This is a projection based on Moore’s law and does not represent a committed roadmap

11

Tesla Roadmap

12

The March of the GPUs

2007 2008 2009 2010 2011 20120

50

100

150

200

250 Peak Memory Bandwidth GBytes/sec

T10

Nehalem 3 GHz

Westmere3 GHz

8-core Sandy Bridge3 GHz

T20

T20A

2007 2008 2009 2010 2011 20120

200

400

600

800

1000

1200 Peak Double Precision FP GFlops/sec

Nehalem3 GHz

Westmere3 GHz

T20

T20A

T10

8-coreSandy Bridge

3 GHz

NVIDIA GPU (ECC off) x86 CPUDouble Precision: NVIDIA GPU Double Precision: x86 CPU

13

Project Denver

14

Expected Tesla Roadmap with Project Denver

15

WorkstationsUp to 4x

Tesla C2050/70 GPUs

Integrated CPU-GPU Server

2x Tesla M2050/70 GPUs in 1U

OEM CPU Server +Tesla S2050/70

4 Tesla GPUs in 2U

Workstation / Data Center Solutions2 Tesla

M2050/70 GPUs

16

Tesla C2050 Tesla C2070Processors Tesla 20-series GPU

Number of Cores 448

Caches64 KB L1 cache + Shared Memory / 32 cores

768 KB L2 cache

Floating Point Peak Performance

1030 Gigaflops (single)515 Gigaflops (double)

GPU Memory3 GB

2.625 GB with ECC on6 GB

5.25 GB with ECC on

Memory Bandwith 144 GB/s (GDDR5)

System I/O PCIe x16 Gen2

Power 238 W (max) 238 W (max)

Available Shipping Now Shipping Now

Tesla C-Series Workstation GPUs

17

How is the GPU Used?

Basic Component: “Stream Multiprocessor” (SM)

SIMD: “Single Instruction Multiple Data”

Same Instruction for all cores, but can operate over different data

“SIMD at SM, MIMD at GPU chip”

Source: Presentation from Felipe A. Cruz, Nagasaki University

18

The Use of GPU’s and Bottleneck Analysis

Source: Presentation from Takayuki Aoki, Tokyo Institute of Technology

19

The Fermi Architecture3 billion transistors

16 x Streaming Multiprocessors (SM’s)

6 x 64-bit Memory Partitions = 384-bit Memory Interface

Host Interface: connects the GPU to the CPU via PCI-Express

GigaThread global scheduler: distribute thread blocks to SM thread schedulers

20

SM ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

32 CUDA cores per SM (512 total)

16 x Load/Store Units = source and destin. address calculated for 16 threads per clock

4 x Special Function Units (sin, cosine, sq. root, etc.)

64 KB of RAM for shared memory and L1 cache (configurable)

Dual Warp Scheduler

21

Dual Warp Scheduler

1 Warp = 32 parallel threads

2 Warps issued and executed concurrently

Each Warp goes to 16 CUDA Cores

Most instructions can be dual issued (exception: Double Precision instructions)

Dual-Issue Model allows near peak hardware performance

22

CUDA Core ArchitectureRegister File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs

Newly designed integer ALU optimized for 64-bit and extended precision operations

Fused multiply-add (FMA) instruction for both 32-bit single and 64-bit double precision

23

Fused Multiply-Add Instruction (FMA)

24

GigaThreadTM Hardware Thread Scheduler (HTS)

Hierarchically manages thousands of simultaneously active threads

10x faster application context switching (each program receives a time slice of processing resources)

Concurrent kernel execution

HTS

25

GigaThread Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution Parallel Kernel Execution

Tim

e

Kernel 1 Kernel 1 Kernel 2

Kernel 2 Kernel 3

Kernel 3

Ker4

nel Kernel 5

Kernel 5

Kernel 4

Kernel 2

Kernel 2

26

GigaThread Streaming Data Transfer Engine

Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time

Activity Snapshot:

SDT

Kernel 0Kernel 1

Kernel 2Kernel 3

CPU

CPU

CPU

CPU

SDT0

SDT0

SDT0

SDT0

GPU

GPU

GPU

GPU

SDT1

SDT1

SDT1

SDT1

27

Cached Memory Hierarchy

First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory

Shared/L1 Cache per SM (64KB)Improves bandwidth and reduces latency

Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU

Global Memory (up to 6GB)

28

CUDA: Compute Unified Device Architecture

NVIDIA’s Parallel Computing Architecture

Software Development Platform aimed to the GPU Architecture

CUDA Driver

CUDA Parallel Compute Engines inside GPU

CUDA Support in Kernel Level Driver

OpenCLDriver

ApplicationsUsing OpenCL

OpenCL C

ApplicationsUsing the

CUDA Driver API

C for CUDA

C Runtimefor CUDA

ApplicationsUsing C, C++, Fortran,

Java, Python, ...

C for CUDA

PTX (ISA)

DirectX 11Compute

ApplicationsUsing DirectX

HLSL

Device-level APIs Language Integration

1

2

34

5

29

Thread Hierarchy

Kernels (simple C program) are executed by thread

Threads are grouped into Blocks

Threads in a Block can synchronize execution

Blocks are grouped in a Grid

Blocks are independent (must be able to be executed at any order


30

Memory and Hardware Hierarchy

Threads access RegistersCUDA Cores execute Threads

Threads within a Block can share data/results via Shared MemoryStreaming Multiprocessors (SM’s) execute Blocks

Grids use Global Memory for result sharing (after kernel-wide global synchronization)GPU executes Grids


31

Full View of the Hierarchy Model

CUDA Hardware Level Memory Access

Thread CUDA Core Registers

Block SM Shared Memory

Grid GPU Global Memory

Device Node Host Memory

32

IDs and Dimensions

DeviceGrid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Threads3D IDs, unique within a block

Blocks2D IDs, unique within a grid

Dimensions set at launch timeCan be unique for each grid

Built-in variablesthreadIdx, blockIdxblockDim, gridDim

33

Compiling C for CUDA Applications

void serial_function(… ) { ...}void other_function(int ... ) { ...}

void saxpy_serial(float ... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}

void main( ) { float x; saxpy_serial(..); ...}

NVCC(Open64) CPU Compiler

C CUDAKey Kernels

CUDA objectfiles

Rest of CApplication

CPU objectfilesLinker

CPU-GPUExecutable

Modify into Parallel

CUDA code

34

C for CUDA : C with a few keywords

void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);

__global__ void saxpy_parallel(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);

Standard C Code

Parallel C Code

35

Software Programming

Source: Presentation from Andreas Klöckner, NYU

36



37



38



39



40



41



42



43

CUDA C/C++ Leadership

2007 2008 2009 2010

July 07 Nov 07 April 08 Aug 08 July 09 Nov 09 Mar 10CUDA Toolkit 1.1

• Win XP 64

• Atomics support

• Multi-GPU support

CUDA Toolkit 2.0

• Double Precision

• Compiler Optimizations

• Vista 32/64

• Mac OSX

• 3D Textures

• HW Interpolation

CUDA Toolkit 2.3

• DP FFT

• 16-32 Conversion intrinsics

• Performance enhancements

CUDA Toolkit 1.0

• C Compiler• C Extensions

• Single Precision• BLAS• FFT• SDK 40 examples

CUDAVisual Profiler 2.2

cuda-gdbHW Debugger

Parallel NsightBeta CUDA Toolkit 3.0

• C++ inheritance

• Fermi arch support

• Tools updates

• Driver / RT interop

44

Why should I choose Tesla over consumer cards?Feature Benefits

Features

4x Higher double precision (on 20-series) Higher Performance for scientific CUDA applications

ECC only on Tesla & Quadro (on 20-series) Data reliability inside the GPU and on DRAM memories

Bi-directional PCI-E communication (Tesla has Dual DMA Engines, GeForce has only 1 DMA Engine)

Higher Performance for CUDA applications (by overlapping communication & computation)

Larger memory for larger data sets – 3GB and 6GB Products Higher performance on wide range of applications (medical, oil & gas, manufacturing, FEA, CAE)

Cluster management software tools available on Tesla only Needed for GPU monitoring and job scheduling in data center deployments

TCC (Tesla Compute Cluster) driver supported for Windows OS only on Tesla.

Higher performance for CUDA applications due to lower kernel launch overhead. TCC adds support for RDP and Services

Integrated OEM workstations and servers Trusted, reliable systems built for Tesla products.

Professional ISVs will certify CUDA applications only on Tesla Bug reproduction, support, feature requests for Tesla only.

Quality & Warranty

2 to 4 day Stress testing & memory burn-in for reliability. Added margin in memory and core clocks for added reliability. Built for 24/7 computing in data center and workstation environments.

Manufactured & guaranteed by NVIDIA No changes in key components like GPU and memory without notice. Always the same clocks for known, reliable performance.

3-year warranty from HP Reliable, long life products

Support & Lifecycle

Enterprise support, higher priority for CUDA bugs and requests Ability to influence CUDA and GPU roadmap. Get early access to features requests.

18-24 months availability + 6-month EOL notice Reliable product supply

product availability update

Documents

advent of gpu computing

sequential computing

parallel computing

photo editing run

units10 weeksbuild

built10 weeksbuild

mathematical computations

q28 weekssold