the source for gpu programminghttp.download.nvidia.com/developer/presentations... · vertex texture...
TRANSCRIPT
The Source forGPU Programming
developer.nvidia.com Latest NewsDeveloper Events CalendarTechnical DocumentationConference PresentationsGPU Programming GuidePowerful Tools, SDKs, and more...
Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!
GeForce 6 Series PerformanceGeForce 6 Series PerformanceMatthias Wloka
Developer Technology
GeForce 6 Series Specific Performance
Instancing
Vertex- and Pixel-Shaders 3.0Branching and LoopingVertex Texture Fetch
Hardware Shadow Maps
Z- and Stencil-Cull
FP16 Filter and Blend, MRTs
Marketing Speak Translation
SM3, i.e., “Shader Model 3” hardware
Sometimes shorthand for“Every GeForce 6 feature not in GeForce FX”
Not just VS/PS 3.0See previous slide!
GeForce 6200 does not support fp16 filter/blendOkay, because: ‘value’ cards lack memory b/w to use fp16 render-targets
Simplified Graphics Pipeline
Frame Buffer
Fragment ProcessorRasterizerGeometry
ProcessorGeometry StorageCPU
Common bottlenecks: CPU Fragment processor
New features help address these bottlenecks
InstancingVertex Shader 3.0
Texture Storage + Filtering
Pixel Shader 3.0
Fp16 FilterShadow Maps
Fp16 BlendMRT
Z/Stencil Cull
CPU Bottleneck Getting Worse
Courtesy Ian Buck, Stanford University
Explicitly Address CPU Bottleneck
Reduce draw callsBudget/Design for your draw calls!Use instancing to reduce batchesUse über-shaders to eliminate batches/passesUse fp16 blending to eliminate passes
Move more computations to GPU:GPGPU: General-Purpose Computations Using GPUsSee http://gpgpu.org
Detail of a Single Vertex Shader Pipeline
FP32 VectorUnit
PrimitiveAssembly
Input Vertex Data
To Setup
Viewport Processing
BranchUnit
VertexTextureFetch
Texture Cache
FP32 ScalarUnit
Let’s GPU ‘loop’ over vertex buffers:
Instancing: What Is It?
Tree Model VB
Transform Matrices VB
Single draw call generates many instances of object
Instancing Demo
Complex lighting, post-processingSimple CPU collision
Instancing Advantages
Alternatives:One draw call / instance, change state in-betweenStatic batching (static pre-transformed VB)Dynamic batching (dynamic 2 stream instancing)Vertex constant instancingSee ‘Instancing’ code sample and whitepaper:
http://download.nvidia.com/developer/SDK/Individual_Samples/samples.html
Most flexible and has the least Draw callsMemory overheadCPU/Bus overhead
But…
Multiple vertex streams
GPU does extra work
Vertex sizes are largerTransform matrix is a per vertex attribute
Attribute Bound
Extra data fetched per instanceExplains slowdown
Vertex cache optimizeCache hit saves all vertex work:Including attribute access
Pack input attributes as tightly as possibleEven if vertex shader work required to unpackMove constants or derivables out of attributes
Instancing Performance
Instancing Method Comparison(Note: % is relative to HW instancing in each group)
[28 poly mesh]
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
140.00%
2800 28000 140000 280000 560000
# Polys
FPS(
rela
tive
to H
W In
stan
cing
)
Single Draw CallsDynamic 2 Stream InstancingStatic 2 Stream InstancingVS Constant InstancingHardware InstancingStatic Pretransformed VB
Another ViewFPS per polys [28poly mesh]
1
10
100
1000
1000 10000 100000 1000000
# Polys
FPS
Single Draw Calls Dynamic 2 Stream InstancingStatic 2 Stream Instancing VS Constant InstancingHardware Instancing Static Pretransformed VB
Vertex Shader 3.0: Flow Control
Vertex flow control near optimal:Branch instructions have fixed ~1 cycle overheadDivergence is full speed (MIMD)
Vertex branching is a winExcept for short branchesCompiler/Driver decides
Use branches and loops to Consolidate batchesSkip over unnecessary work
Example:
Single unified v-shader for 1, 2, 3, and 4 bone skinning
Vertex Texture Fetch (VTF)
Mipmapped texture fetches from vertex:Only R32f and R32G32B32A32f formatsOnly point-samplingUp to 4 different texture stagesSample as often as you like
Large latencyEquivalent to 20-30 instructions
Cover the Latency
Latency means you can ‘hide’ other ops in it For freeCompiler/driver does this for you if possible
Branch over VTF if possible
Dependent VTFs are slowLess chance to hide latency
texldl r0, v0, sampler0mul r1, v1, c0 // stuff not depending on vtf result…
add r1, r1, r0 // use vtf result for the first time…
Vertex Texture Fetch Performance
GeForce 6800 capable of peak 600 MVerts / sMinimalist (err, read no) work per vertex
Max with a single VTF: 33 MVerts / sNot all vertices in frame need to be displaced1 Million displaced vertices @ 33 fps!
Do not use as general constant memory replacement
Early Z and Stencil Cull
Cull pixels that (will) fail depth/stencil tests before entering pixel-shader
For maximum z-cull:Render roughly front to backOr even better: render z-only pass before normal renderingDo stencil-only passes for other cull tricks
Changing depth-test directionFor example, less-equal to greater-equalOnly resets on clear
Things That Disable Z Culling
Z-Cull Uses Highly Compressed Z-Rep
‘Good’ ‘Bad’
Triangles with holes (alpha test/texkill/clip planes) are not occluding
Small triangles are bad occluders‘Small’ ~= less than 4x4 pixelsZ-cull may not recognize triangle as occluder
Things That Disable Stencil Culling
Changing stencil function, reference, or maskOnly resets on clear
Writing stencil while rejecting based on stencilWrite stencil in separate pass from rejecting color/z
Stencil Cull Example
1. Render light volume with color write disabledDepth func = LESS, Stencil func = ALWAYSStencil Z-FAIL = REPLACE (with value X)Rest of stencil ops set to KEEP
2. Render with lighting shaderDepth Func = ALWAYS, Stencil Func = EQUAL, all ops = KEEP, Stencil Ref = XUnlit pixels will be culled because stencil does not match reference value
Fast Z-Only Rendering
GeForce FX and 6 Series render z/stencil at double speed!
Important for dynamic shadow maps!Makes z-first/only pass (for z-cull benefits) attractive
Only enabled if:No color-writes
Disable pixel shaders (no depth replace, no texkill)Disable alpha test/color key
8-bit/component color buffer bound (not float)No user clip planes No AA
Pixel Shader 3.0 Performance
What is Pixel Shader 3.0?
3.0 shaders help both CPU and GPU bottlenecksConsolidate draw calls / passes (über-shaders)Early-outs with dynamic branching
Gory performance details of particularpixel shader 3.0 features
Detail of a Single Pixel Shader Pipeline
FP Texture Processor
Texture Cache
BranchProcessor
FP32 ShaderUnit 1
FP32 ShaderUnit 2
Input Fragment Data
Output Shaded Fragments
FogALU
TextureData
SIMD ArchitectureCo-IssueFP32 ComputationShader Model 3.0
SIMD ArchitectureCo-IssueFP32 ComputationShader Model 3.0
Shader Unit 14 FP Ops / pixelCo-IssueTexture Address CalcFree fp16 normalize+ mini ALU
Shader Unit 14 FP Ops / pixelCo-IssueTexture Address CalcFree fp16 normalize+ mini ALU
Texture FilterBi / Tri / Aniso1 texture @ full speed4 tap filter @ full speed16:1 Aniso w/ TrilinearFP16 Texture Filtering
Texture FilterBi / Tri / Aniso1 texture @ full speed4 tap filter @ full speed16:1 Aniso w/ TrilinearFP16 Texture Filtering
Shader Unit 24 FP Ops / pixelCo-Issue+ mini ALU
Shader Unit 24 FP Ops / pixelCo-Issue+ mini ALU
Half (fp16) Performance
Half (fp16) still matters!Critical for GeForce FX performance
Reduces register pressure
Better able to hide texture latency
Fast fp16 normalize
Compiler/driver can NOT help you with this
GeForce 6 Single Cycle Normalize()
Pixel shader unit has single-cycle normalize
Caveat: only for 3-component 16-bit float values
float3 f3;
half3 h3;
half4 h4;
f3 = normalize(f3); // slow: dp3/rsq/mul
h3 = normalize(f3); // fast: nrmh
h4 = normalize(h4); // slow: dp4/rsq/mul
h4.xyz = normalize(h4.xyz); // fast: nrmh
GeForce 6 Superscalar Execution
Executes multiple instructions simultaneously
For example, in a single cycle you can executeTwo 2-vector instructions, orOne 3-vector and one scalar instructionPlus, there are 2 math units per shader pipe
Use swizzle / write masks to help compilerhalf4 A, B;
A.w = sin(A.w);
// A = sin(A.w) not enough
A.xyz = A.xyz * B.xyz;
GeForce 6 Series Co-Issue
2 different instructions executing in the same cycle in same shader units
2 separate shader units
4 instructions/pixel/cycle RR GG BB AA
Operation 3 Operation 4
RR GG BB AA
Operation 1 Operation 2
ShaderUnit 1
ShaderUnit 2
Flow Control Performance Overview
Flow controlinstruction costs:
Not free, but useful
Additional costs when pixels diverge(more later)
4loop / endloop2ret2call6if / else / endif4if / endif
Cost (Cycles)Instruction
Looping Costs
DirectX ps.3.0 supports only static loopsUnrolling is fasterCompiler/driver can do that for you
Nonetheless useful becauseReduces high-level code-complexityReduces passes
Multiple lights in a single pass can be a big winNumber of lights unknown at compile time
Reduces proliferation of pre-compiled shadersThousands of shaders from just a few templates
Overcomes DirectX’s 512 static instruction limit
Branching Costs
Branching can provide substantial boostIf able to skip > 6 instruction cycles, and If the branch condition is coherent
Noisy branch conditions cause performance lossPotentially worse than taking both branches all the time
vs.
Coherent Incoherent
How Coherent Do I Have To Be?
GPU has hundreds of pixels in flight
Best if coherent over regions of > ~1000 pixelsThat’s only ~30x30!
You need to experiment in your own application
Soft shadow demo shows:Incoherent branches on small portion of screenis still a big win
Combine Branching With Others
Back face register (vFace)Shade front faces differently from back faces
Position register (vPos)Shade based on position For example, skip or simplify distant pixels
Early out:If in shadow, don’t do lighting computationsIf out of range (attenuation zero), don’t lightApplies to vs.3.0 as well
Soft Shadow Demo
How Soft Shadow Demo Works
Takes 8 test samples from shadow mapIf all 8 in shadow or all 8 in the light then doneIf on the edge (some in shadow/some in light)Do 56 more samples for additional quality
64 samples at much lower cost!Quick-and-dirty importance sampling
Dynamic sampling > 2x faster Vs. 64 samples everywhere
Hardware Shadow Maps
In DirectX, Render to a depth format texture (D3DFMT_D24X8, D3DFMT_D16)Use tex2Dproj to sampleShadow map comparison happens automatically
In OpenGL, Render to DEPTH_COMPONENT textureUse TEXTURE_COMPARE_MODE_ARB with COMPARE_R_TO_TEXTURE
Hardware Shadow Map Performance
Shadow map comparison is free (full speed)No need to compare and filter in the shaderIf bilinear state is on, Then percentage closer filtering of 4 nearest texels
Use single tap for performanceQuality roughly equivalent to 4-tap PCF R32F
Use multiple taps for higher quality4-tap HW shadow map roughly as fast as 4-tap manual-PCF R32F
Hardware Shadow Map Fallback
Possible to use R32F or R16F shadow mapsRender depth to single-channel float texture in shaderMultiple jittered samples for high quality / soft edges
Easy to maintain hardware shadow maps and R32F/R16F code paths:
Same setup and pipeline as any shadow map techniqueHW shadow map shader code simpler and fasterHW shadow maps buy speed or quality (or both)
Texture Instruction Performance
Texldb (scalar LOD bias): Full speed
Texldl (explicit scalar LOD selection): Full speedHardware need not calculate derivatives for LODPossible to dynamically branch over these instructions
Texldd (gradient-based LOD selection): Factor 10 slower!But when you need to use this, you need to use this
Floating Point Texture Performance
Prefer 64bpp float textures and render targetsHalf the bandwidth of 128bpp (fp32) textures
More importantly: double cache coherencePoor cache coherence destroys performanceFp16 textures 2x faster than fp32 if texture bound
Also important: efficient channel allocationUse R32F buffers for scalar data, and R16G16F for 2-vectorsDouble cache coherence again!
Common Sense Texture Performance
Use mipmapsGPU fetches local neighborhood for each texel
Sharper/Crisper texturesUse anisotropic filteringUse better mipmap generation (use texture tools)Do NOT use LOD biasLOD bias is slower and lower quality
Normal Maps
Use D3DFMT_V8U8 or DXT5To store x and yDerive z in shader
Simon Green’s normal map compression paperCompares quality of variety of formats
Multiple Render Targets
MRTs useful for reducing rendering passesWhen you need to output more than single 4-vector
Deferred shading, particle physics, GPGPU algorithmsReplaces up to four passes with one
But MRT is not freeHigh bandwidth cost, especially with float formatsSmall overhead per target renderedGeForce 6 has a sweet spot of 3 render targets (RTs)
Split 6 passes into 2 3-RT passesNot 1 4-RT pass and 1 2-RT pass
Other Render Target Advice
Do not render entire scene to a textureNot getting AAIf user turns on control panel AA, hard to detect
Instead, render to back buffer, then stretchrectDrivers give performance priority to back bufferAhead of texture surfacesAA works with back buffer
Full Screen Effects
Use scissor rects to restrict renderingLight bounds, etc.
Do not use full screen quadsUse full-screen triangles with scissor rect insteadCompletely avoids inefficient diagonals
Floating Point Blending
GeForce FX needs to emulate float blendingUsing “ping-pong buffer”Lots of context switches and additional passesBlending, e.g., lots of particles becomes infeasible
But fp16 is 2x bandwidth vs. A8R8G8B8
Increased Read Back Performance
Pre-GeForce 6Best case, < 200MB/s, all chipsetsOnly PCI cycles used to write back to host memory
GeForce 6800 (AGP)600 MB/s - 1.0 GB/s, depending on AGP chipset
PCI-E Workstation boards1.0 GB/s on Quadro FX 4400 Up to 2.4 GB/s on Quadro FX 1400
Read Back Still a BAD Idea
Read back still synchronizes CPU and GPU
CPU stalls until GPU finishes all renderingCan you afford wasting precious CPU cycles?
GPU pipeline drains completely and becomes idle
Memory Allocation
Order of resource allocation affects performance
Allocate render targets firstSort order by pitch (bpp * width)Sort pitch groups by frequency of use (most used first)
Then create vertex and pixel shaders
Load / create remaining textures
Conclusion
Lots of new/fast featuresInstancing, vs.3.0 flow control, vertex texture fetchZ-/Stencil-cull, fast z-onlyFast normalize, ps.3.0 flow controlHardware shadow maps, fp16 blending
With some sneaky gotchas
Use these features to attack bottlenecksCPU Pixel shaders...
Questions?
NVIDIA GPU Programming Guide:http://developer.nvidia.com/object/
gpu_programming_guide.html
Matthias Wloka ([email protected])
http://developer.nvidia.com
The Source forGPU Programming
developer.nvidia.com Latest NewsDeveloper Events CalendarTechnical DocumentationConference PresentationsGPU Programming GuidePowerful Tools, SDKs, and more...
Join our FREE registered developer program for early access to NVIDIA drivers, cutting edge tools, online support forums, and more!