the source for gpu programminghttp.download.nvidia.com/developer/presentations... · vertex texture...

The Source forGPU Programming

developer.nvidia.com Latest NewsDeveloper Events CalendarTechnical DocumentationConference PresentationsGPU Programming GuidePowerful Tools, SDKs, and more...

Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!

GeForce 6 Series PerformanceGeForce 6 Series PerformanceMatthias Wloka

Developer Technology

GeForce 6 Series Specific Performance

Instancing

Vertex- and Pixel-Shaders 3.0Branching and LoopingVertex Texture Fetch

Hardware Shadow Maps

Z- and Stencil-Cull

FP16 Filter and Blend, MRTs

Marketing Speak Translation

SM3, i.e., “Shader Model 3” hardware

Sometimes shorthand for“Every GeForce 6 feature not in GeForce FX”

Not just VS/PS 3.0See previous slide!

GeForce 6200 does not support fp16 filter/blendOkay, because: ‘value’ cards lack memory b/w to use fp16 render-targets

Simplified Graphics Pipeline

Frame Buffer

Fragment ProcessorRasterizerGeometry

ProcessorGeometry StorageCPU

Common bottlenecks: CPU Fragment processor

New features help address these bottlenecks

InstancingVertex Shader 3.0

Texture Storage + Filtering

Pixel Shader 3.0

Fp16 FilterShadow Maps

Fp16 BlendMRT

Z/Stencil Cull

CPU Bottleneck Getting Worse

Courtesy Ian Buck, Stanford University

Explicitly Address CPU Bottleneck

Reduce draw callsBudget/Design for your draw calls!Use instancing to reduce batchesUse über-shaders to eliminate batches/passesUse fp16 blending to eliminate passes

Move more computations to GPU:GPGPU: General-Purpose Computations Using GPUsSee http://gpgpu.org

Detail of a Single Vertex Shader Pipeline

FP32 VectorUnit

PrimitiveAssembly

Input Vertex Data

To Setup

Viewport Processing

BranchUnit

VertexTextureFetch

Texture Cache

FP32 ScalarUnit

Let’s GPU ‘loop’ over vertex buffers:

Instancing: What Is It?

Tree Model VB

Transform Matrices VB

Single draw call generates many instances of object

Instancing Demo

Complex lighting, post-processingSimple CPU collision

Instancing Advantages

Alternatives:One draw call / instance, change state in-betweenStatic batching (static pre-transformed VB)Dynamic batching (dynamic 2 stream instancing)Vertex constant instancingSee ‘Instancing’ code sample and whitepaper:

http://download.nvidia.com/developer/SDK/Individual_Samples/samples.html

Most flexible and has the least Draw callsMemory overheadCPU/Bus overhead

But…

Multiple vertex streams

GPU does extra work

Vertex sizes are largerTransform matrix is a per vertex attribute

Attribute Bound

Extra data fetched per instanceExplains slowdown

Vertex cache optimizeCache hit saves all vertex work:Including attribute access

Pack input attributes as tightly as possibleEven if vertex shader work required to unpackMove constants or derivables out of attributes

Instancing Performance

Instancing Method Comparison(Note: % is relative to HW instancing in each group)

[28 poly mesh]

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

120.00%

140.00%

2800 28000 140000 280000 560000

# Polys

FPS(

rela

tive

to H

W In

stan

cing

)

Single Draw CallsDynamic 2 Stream InstancingStatic 2 Stream InstancingVS Constant InstancingHardware InstancingStatic Pretransformed VB

Another ViewFPS per polys [28poly mesh]

1

10

100

1000

1000 10000 100000 1000000

# Polys

FPS

Single Draw Calls Dynamic 2 Stream InstancingStatic 2 Stream Instancing VS Constant InstancingHardware Instancing Static Pretransformed VB

Vertex Shader 3.0: Flow Control

Vertex flow control near optimal:Branch instructions have fixed ~1 cycle overheadDivergence is full speed (MIMD)

Vertex branching is a winExcept for short branchesCompiler/Driver decides

Use branches and loops to Consolidate batchesSkip over unnecessary work

Example:

Single unified v-shader for 1, 2, 3, and 4 bone skinning

Vertex Texture Fetch (VTF)

Mipmapped texture fetches from vertex:Only R32f and R32G32B32A32f formatsOnly point-samplingUp to 4 different texture stagesSample as often as you like

Large latencyEquivalent to 20-30 instructions

Cover the Latency

Latency means you can ‘hide’ other ops in it For freeCompiler/driver does this for you if possible

Branch over VTF if possible

Dependent VTFs are slowLess chance to hide latency

texldl r0, v0, sampler0mul r1, v1, c0 // stuff not depending on vtf result…

add r1, r1, r0 // use vtf result for the first time…

Vertex Texture Fetch Performance

GeForce 6800 capable of peak 600 MVerts / sMinimalist (err, read no) work per vertex

Max with a single VTF: 33 MVerts / sNot all vertices in frame need to be displaced1 Million displaced vertices @ 33 fps!

Do not use as general constant memory replacement

Early Z and Stencil Cull

Cull pixels that (will) fail depth/stencil tests before entering pixel-shader

For maximum z-cull:Render roughly front to backOr even better: render z-only pass before normal renderingDo stencil-only passes for other cull tricks

Changing depth-test directionFor example, less-equal to greater-equalOnly resets on clear

Things That Disable Z Culling

Z-Cull Uses Highly Compressed Z-Rep

‘Good’ ‘Bad’

Triangles with holes (alpha test/texkill/clip planes) are not occluding

Small triangles are bad occluders‘Small’ ~= less than 4x4 pixelsZ-cull may not recognize triangle as occluder

Things That Disable Stencil Culling

Changing stencil function, reference, or maskOnly resets on clear

Writing stencil while rejecting based on stencilWrite stencil in separate pass from rejecting color/z

Stencil Cull Example

1. Render light volume with color write disabledDepth func = LESS, Stencil func = ALWAYSStencil Z-FAIL = REPLACE (with value X)Rest of stencil ops set to KEEP

2. Render with lighting shaderDepth Func = ALWAYS, Stencil Func = EQUAL, all ops = KEEP, Stencil Ref = XUnlit pixels will be culled because stencil does not match reference value

Fast Z-Only Rendering

GeForce FX and 6 Series render z/stencil at double speed!

Important for dynamic shadow maps!Makes z-first/only pass (for z-cull benefits) attractive

Only enabled if:No color-writes

Disable pixel shaders (no depth replace, no texkill)Disable alpha test/color key

8-bit/component color buffer bound (not float)No user clip planes No AA

Pixel Shader 3.0 Performance

What is Pixel Shader 3.0?

3.0 shaders help both CPU and GPU bottlenecksConsolidate draw calls / passes (über-shaders)Early-outs with dynamic branching

Gory performance details of particularpixel shader 3.0 features

Detail of a Single Pixel Shader Pipeline

FP Texture Processor

Texture Cache

BranchProcessor

FP32 ShaderUnit 1

FP32 ShaderUnit 2

Input Fragment Data

Output Shaded Fragments

FogALU

TextureData

SIMD ArchitectureCo-IssueFP32 ComputationShader Model 3.0

SIMD ArchitectureCo-IssueFP32 ComputationShader Model 3.0

Shader Unit 14 FP Ops / pixelCo-IssueTexture Address CalcFree fp16 normalize+ mini ALU

Shader Unit 14 FP Ops / pixelCo-IssueTexture Address CalcFree fp16 normalize+ mini ALU

Texture FilterBi / Tri / Aniso1 texture @ full speed4 tap filter @ full speed16:1 Aniso w/ TrilinearFP16 Texture Filtering

Texture FilterBi / Tri / Aniso1 texture @ full speed4 tap filter @ full speed16:1 Aniso w/ TrilinearFP16 Texture Filtering

Shader Unit 24 FP Ops / pixelCo-Issue+ mini ALU

Shader Unit 24 FP Ops / pixelCo-Issue+ mini ALU

Half (fp16) Performance

Half (fp16) still matters!Critical for GeForce FX performance

Reduces register pressure

Better able to hide texture latency

Fast fp16 normalize

Compiler/driver can NOT help you with this

GeForce 6 Single Cycle Normalize()

Pixel shader unit has single-cycle normalize

Caveat: only for 3-component 16-bit float values

float3 f3;

half3 h3;

half4 h4;

f3 = normalize(f3); // slow: dp3/rsq/mul

h3 = normalize(f3); // fast: nrmh

h4 = normalize(h4); // slow: dp4/rsq/mul

h4.xyz = normalize(h4.xyz); // fast: nrmh

GeForce 6 Superscalar Execution

Executes multiple instructions simultaneously

For example, in a single cycle you can executeTwo 2-vector instructions, orOne 3-vector and one scalar instructionPlus, there are 2 math units per shader pipe

Use swizzle / write masks to help compilerhalf4 A, B;

A.w = sin(A.w);

// A = sin(A.w) not enough

A.xyz = A.xyz * B.xyz;

GeForce 6 Series Co-Issue

2 different instructions executing in the same cycle in same shader units

2 separate shader units

4 instructions/pixel/cycle RR GG BB AA

Operation 3 Operation 4

RR GG BB AA

Operation 1 Operation 2

ShaderUnit 1

ShaderUnit 2

Flow Control Performance Overview

Flow controlinstruction costs:

Not free, but useful

Additional costs when pixels diverge(more later)

4loop / endloop2ret2call6if / else / endif4if / endif

Cost (Cycles)Instruction

Looping Costs

DirectX ps.3.0 supports only static loopsUnrolling is fasterCompiler/driver can do that for you

Nonetheless useful becauseReduces high-level code-complexityReduces passes

Multiple lights in a single pass can be a big winNumber of lights unknown at compile time

Reduces proliferation of pre-compiled shadersThousands of shaders from just a few templates

Overcomes DirectX’s 512 static instruction limit

Branching Costs

Branching can provide substantial boostIf able to skip > 6 instruction cycles, and If the branch condition is coherent

Noisy branch conditions cause performance lossPotentially worse than taking both branches all the time

vs.

Coherent Incoherent

How Coherent Do I Have To Be?

GPU has hundreds of pixels in flight

Best if coherent over regions of > ~1000 pixelsThat’s only ~30x30!

You need to experiment in your own application

Soft shadow demo shows:Incoherent branches on small portion of screenis still a big win

Combine Branching With Others

Back face register (vFace)Shade front faces differently from back faces

Position register (vPos)Shade based on position For example, skip or simplify distant pixels

Early out:If in shadow, don’t do lighting computationsIf out of range (attenuation zero), don’t lightApplies to vs.3.0 as well

Soft Shadow Demo

How Soft Shadow Demo Works

Takes 8 test samples from shadow mapIf all 8 in shadow or all 8 in the light then doneIf on the edge (some in shadow/some in light)Do 56 more samples for additional quality

64 samples at much lower cost!Quick-and-dirty importance sampling

Dynamic sampling > 2x faster Vs. 64 samples everywhere

Hardware Shadow Maps

In DirectX, Render to a depth format texture (D3DFMT_D24X8, D3DFMT_D16)Use tex2Dproj to sampleShadow map comparison happens automatically

In OpenGL, Render to DEPTH_COMPONENT textureUse TEXTURE_COMPARE_MODE_ARB with COMPARE_R_TO_TEXTURE

Hardware Shadow Map Performance

Shadow map comparison is free (full speed)No need to compare and filter in the shaderIf bilinear state is on, Then percentage closer filtering of 4 nearest texels

Use single tap for performanceQuality roughly equivalent to 4-tap PCF R32F

Use multiple taps for higher quality4-tap HW shadow map roughly as fast as 4-tap manual-PCF R32F

Hardware Shadow Map Fallback

Possible to use R32F or R16F shadow mapsRender depth to single-channel float texture in shaderMultiple jittered samples for high quality / soft edges

Easy to maintain hardware shadow maps and R32F/R16F code paths:

Same setup and pipeline as any shadow map techniqueHW shadow map shader code simpler and fasterHW shadow maps buy speed or quality (or both)

Texture Instruction Performance

Texldb (scalar LOD bias): Full speed

Texldl (explicit scalar LOD selection): Full speedHardware need not calculate derivatives for LODPossible to dynamically branch over these instructions

Texldd (gradient-based LOD selection): Factor 10 slower!But when you need to use this, you need to use this

Floating Point Texture Performance

Prefer 64bpp float textures and render targetsHalf the bandwidth of 128bpp (fp32) textures

More importantly: double cache coherencePoor cache coherence destroys performanceFp16 textures 2x faster than fp32 if texture bound

Also important: efficient channel allocationUse R32F buffers for scalar data, and R16G16F for 2-vectorsDouble cache coherence again!

Common Sense Texture Performance

Use mipmapsGPU fetches local neighborhood for each texel

Sharper/Crisper texturesUse anisotropic filteringUse better mipmap generation (use texture tools)Do NOT use LOD biasLOD bias is slower and lower quality

Normal Maps

Use D3DFMT_V8U8 or DXT5To store x and yDerive z in shader

Simon Green’s normal map compression paperCompares quality of variety of formats

Multiple Render Targets

MRTs useful for reducing rendering passesWhen you need to output more than single 4-vector

Deferred shading, particle physics, GPGPU algorithmsReplaces up to four passes with one

But MRT is not freeHigh bandwidth cost, especially with float formatsSmall overhead per target renderedGeForce 6 has a sweet spot of 3 render targets (RTs)

Split 6 passes into 2 3-RT passesNot 1 4-RT pass and 1 2-RT pass

Other Render Target Advice

Do not render entire scene to a textureNot getting AAIf user turns on control panel AA, hard to detect

Instead, render to back buffer, then stretchrectDrivers give performance priority to back bufferAhead of texture surfacesAA works with back buffer

Full Screen Effects

Use scissor rects to restrict renderingLight bounds, etc.

Do not use full screen quadsUse full-screen triangles with scissor rect insteadCompletely avoids inefficient diagonals

Floating Point Blending

GeForce FX needs to emulate float blendingUsing “ping-pong buffer”Lots of context switches and additional passesBlending, e.g., lots of particles becomes infeasible

But fp16 is 2x bandwidth vs. A8R8G8B8

Increased Read Back Performance

Pre-GeForce 6Best case, < 200MB/s, all chipsetsOnly PCI cycles used to write back to host memory

GeForce 6800 (AGP)600 MB/s - 1.0 GB/s, depending on AGP chipset

PCI-E Workstation boards1.0 GB/s on Quadro FX 4400 Up to 2.4 GB/s on Quadro FX 1400

Read Back Still a BAD Idea

Read back still synchronizes CPU and GPU

CPU stalls until GPU finishes all renderingCan you afford wasting precious CPU cycles?

GPU pipeline drains completely and becomes idle

Memory Allocation

Order of resource allocation affects performance

Allocate render targets firstSort order by pitch (bpp * width)Sort pitch groups by frequency of use (most used first)

Then create vertex and pixel shaders

Load / create remaining textures

Conclusion

Lots of new/fast featuresInstancing, vs.3.0 flow control, vertex texture fetchZ-/Stencil-cull, fast z-onlyFast normalize, ps.3.0 flow controlHardware shadow maps, fp16 blending

With some sneaky gotchas

Use these features to attack bottlenecksCPU Pixel shaders...

Questions?

NVIDIA GPU Programming Guide:http://developer.nvidia.com/object/

gpu_programming_guide.html

Matthias Wloka ([email protected])

http://developer.nvidia.com

The Source forGPU Programming

developer.nvidia.com Latest NewsDeveloper Events CalendarTechnical DocumentationConference PresentationsGPU Programming GuidePowerful Tools, SDKs, and more...

Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!

the source for gpu programminghttp.download.nvidia.com/developer/presentations... · vertex texture...

Documents