vertex shader tricks by bill bilodeau - amd at gdc14
TRANSCRIPT
Vertex Shader Tricks New Ways to Use the Vertex Shader to Improve Performance
Bill BilodeauDeveloper Technology Engineer, AMD
Topics Covered●Overview of the DX11 front-end pipeline●Common bottlenecks●Advanced Vertex Shader Features●Vertex Shader Techniques●Samples and Results
Graphics Hardware
DX11 Front-End Pipeline●VS –vertex data●HS – control points●Tessellator●DS – generated vertices●GS – primitives●Write to UAV at all stages
● Starting with DX11.1
Vector GPR’s(256 2048-bit registers)
Vector ALU(1 64-way single precision operation every 4 clocks)
Scalar ALU(1 operation every 4 clocks)
Scalar GPR’s(256 64-bit registers)
Vector/Scalar cross communication bus
Vector GPR’s(256 2048-bit registers)
Vector ALU(1 64-way single precision operation every 4 clocks)
Scalar ALU(1 operation every 4 clocks)
Scalar GPR’s(256 64-bit registers)
Vector/Scalar cross communication bus
Vector GPR’s(256 2048-bit registers)
Vector ALU(1 64-way single precision operation every 4 clocks)
Scalar ALU(1 operation every 4 clocks)
Scalar GPR’s(256 64-bit registers)
Vector/Scalar cross communication bus
.
.
.
Input Assembler
Hull Shader
Domain Shader
Tessellator
Geometry Shader
Stream Out
CB,SRV,or
UAV
Vertex Shader
Bottlenecks - VS●VS Attributes
● Limit outputs to 4 attributes (AMD)●This applies to all shader stages (except PS)
●VS Texture Fetches● Too many texture fetches can add latency
●Especially dependent texture fetches●Group fetches together for better performance●Hide latency with ALU instructions
Bottlenecks - VS●Use the caches wisely
● Avoid large vertex formats that waste pre-VS cache space
● DrawIndexed() allows for reuse of processed vertices saved in the post-VS cache
●Vertices with the same index only need to get processed once
Vertex Shader
Pre-VS Cache(Hides Latency)
Input Assembler
Post-VS Cache(Vertex Reuse)
Bottlenecks - GS●GS
● Can add or remove primitives● Adding new primitives requires storing new
vertices●Going off chip to store data can be a bandwidth issue
● Using the GS means another shader stage●This means more competition for shader resources●Better if you can do everything in the VS
Advanced Vertex Shader Features●SV_VertexID, SV_InstanceID●UAV output (DX11.1)●NULL vertex buffer
● VS can create its own vertex data
SV_VertexID●Can use the vertex id to decide what vertex data to fetch●Fetch from SRV, or procedurally create a vertex
VSOut VertexShader(SV_VertexID id){
float3 vertex = g_VertexBuffer[id];…
}
UAV buffers●Write to UAVs from a Vertex Shader
● New feature in DX11.1 (UAV at any stage)●Can be used instead of stream-out for writing vertex data
● Triangle output not limited to strips ●You can use whatever format you want
●Can output anything useful to a UAV
NULL Vertex Buffer●DX11/DX10 allows this
● Just set the number of vertices in Draw() ● VS will execute without a vertex buffer bound
●Can be used for instancing● Call Draw() with the total number of vertices● Bind mesh and instance data as SRVs
Vertex Shader Techniques●Full Screen Triangle●Vertex Shader Instancing
● Merged Instancing●Vertex Shader UAVs
Full Screen Triangle●For post-processing effects
● Triangle has better performance than quad
●Fast and easy with VS generated coordinates
● No IB or VB is necessary●Something you should be using for full screen effects
Clip Space Coordinates
(-1, -1, 0)
(-1, 3, 0)
(3, -1, 0)
Full Screen Triangle: C++ code// Null VB, IBpd3dImmediateContext->IASetVertexBuffers( 0, 0, NULL, NULL, NULL );pd3dImmediateContext->IASetIndexBuffer( NULL, (DXGI_FORMAT)0, 0 );pd3dImmediateContext->IASetInputLayout( NULL );
// Set Shaders pd3dImmediateContext->VSSetShader( g_pFullScreenVS, NULL, 0 );pd3dImmediateContext->PSSetShader( … );pd3dImmediateContext->PSSetShaderResources( … );
pd3dImmediateContext->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );
// Render 3 vertices for the trianglepd3dImmediateContext->Draw(3, 0);
Full Screen Triangle: HLSL CodeVSOutput VSFullScreenTest(uint id:SV_VERTEXID){
VSOutput output;
// generate clip space positionoutput.pos.x = (float)(id / 2) * 4.0 - 1.0;output.pos.y = (float)(id % 2) * 4.0 - 1.0;output.pos.z = 0.0;output.pos.w = 1.0;
// texture coordinatesoutput.tex.x = (float)(id / 2) * 2.0;output.tex.y = 1.0 - (float)(id % 2) * 2.0;
// coloroutput.color = float4(1, 1, 1, 1);
return output;}
Clip Space Coordinates
(-1, -1, 0)
(-1, 3, 0)
(3, -1, 0)
VS Instancing: Point Sprites●Often done on GS, but can be faster on VS
● Create an SRV point buffer and bind to VS● Call Draw or DrawIndexed to render the full
triangle list. ● Read the location from the point buffer and
expand to vertex location in quad● Can be used for particles or Bokeh DOF sprites● Don’t use DrawInstanced for a small mesh
Point Sprites: C++ Code
pd3d->IASetIndexBuffer( g_pParticleIndexBuffer, DXGI_FORMAT_R32_UINT, 0 );
pd3d->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );
pd3dImmediateContext->DrawIndexed( g_particleCount * 6, 0, 0);
Point Sprites: HLSL CodeVSInstancedParticleDrawOut VSIndexBuffer(uint id:SV_VERTEXID){ VSInstancedParticleDrawOut output;
uint particleIndex = id / 4;uint vertexInQuad = id % 4; // calculate the position of the vertexfloat3 position;position.x = (vertexInQuad % 2) ? 1.0 : -1.0;position.y = (vertexInQuad & 2) ? -1.0 : 1.0;position.z = 0.0;position.xy *= PARTICLE_RADIUS;
position = mul( position, (float3x3)g_mInvView ) + g_bufPosColor[particleIndex].pos.xyz; output.pos = mul( float4(position,1.0), g_mWorldViewProj ); output.color = g_bufPosColor[particleIndex].color;
// texture coordinateoutput.tex.x = (vertexInQuad % 2) ? 1.0 : 0.0;output.tex.y = (vertexInQuad & 2) ? 1.0 : 0.0;
return output;}
Point Sprite Performance
Indexed, 500K Sprites
Non-Indexed, 500K Sprites
GS, 500K Sprites
DrawInstanced, 500K Sprites
Indexed, 1M Sprites
Non-Indexed, 1M Sprites
GS, 1M Sprites DrawInstanced, 1M Sprites
0
2
4
6
8
10
12
AMD Radeon R9 290x
Nvidia Titan
Point Sprite Performance●DrawIndexed() is the fastest method●Draw() is slower but doesn’t need an IB●Don’t use DrawInstanced() for creating sprites on either AMD or NVidia hardware
● Not recommended for a small number of vertices
Merge Instancing●Combine multiple meshes that can be instanced many times
● Better than normal instancing which renders only one mesh
● Instance nearby meshes for smaller bounding box●Each mesh is a page in the vertex data
● Fixed vertex count for each mesh●Meshes smaller than page size use degenerate triangles
Merge Instancing
Mesh Vertex Data
Mesh Data 0
Mesh Data 1
Mesh Data 2...
Mesh Instance Data
Instance 0
Mesh Index 2
Instance 1
Mesh Index 0
.
.
.Degenerate
Triangle
Vertex 0Vertex 1Vertex 2Vertex 3
.
.
.000
Fixed Length Page
Merged Instancing using VS●Use the vertex ID to look up the mesh to instance
● All meshes are the same size, so (id / SIZE) can be used as an offset to the mesh
● Faster than using DrawInstanced()
Merge Instancing Performance
DrawInstanced Soft Instancing0
5
10
15
20
25
30
R9 290xGTX 780
●Instancing performance test by Cloud Imperium Games for Star Citizen●Renders 13.5M triangles (~40M verts)●DrawInstanced version calls DrawInstanced() and uses instance data in a vertex buffer●Soft Instancing version uses vertex instancing with Draw() calls and fetches instance data from SRV
AMD Radeon R9 290XNvidia GTX 780
ms
Vertex Shader UAVs●Random access Read/Write in a VS●Can be used to store transformed vertex data for use in multi-pass algorithms●Can be used for passing constant attributes between any shader stage (not just from VS)
Skinning to UAV●Skin vertex data then output to UAV
● Instance the skinned UAV data multiple times●Can also be used for non-instanced data
● Multiple passes can reuse the transformed vertex data – Shadow map rendering
●Performance is about the same as stream-out, but you can do more …
Bounding Box to UAV●Can calculate and store Bbox in the VS
● Use a UAV to store the min/max values (6)● InterlockedMin/InterlockedMax determine min
and max of the bbox●Need to use integer values with atomics
●Use the stored bbox in later passes● GPU physics (collision)● Tile based processing
Bounding Box: HLSL Codevoid UAVBBoxSkinVS(VSSkinnedIn input, uint id:SV_VERTEXID ){
// skin the vertex. . .// output the max and min for the bounding boxint x = (int) (vSkinned.Pos.x * FLOAT_SCALE); // convert to integerint y = (int) (vSkinned.Pos.y * FLOAT_SCALE);int z = (int) (vSkinned.Pos.z * FLOAT_SCALE);
InterlockedMin(g_BBoxUAV[0], x);InterlockedMin(g_BBoxUAV[1], y);InterlockedMin(g_BBoxUAV[2], z);InterlockedMax(g_BBoxUAV[3], x);InterlockedMax(g_BBoxUAV[4], y);InterlockedMax(g_BBoxUAV[5], z);. . .
Particle System UAV●Single pass GPU-only particle system●In the VS:
● Generate sprites for rendering● Do Euler integration and update the particle
system state to a UAV
Particle System: HLSL Codeuint particleIndex = id / 4;uint vertexInQuad = id % 4;
// calculate the new position of the vertexfloat3 oldPosition = g_bufPosColor[particleIndex].pos.xyz;float3 oldVelocity = g_bufPosColor[particleIndex].velocity.xyz;
// Euler integration to find new position and velocityfloat3 acceleration = normalize(oldVelocity) * ACCELLERATION;float3 newVelocity = acceleration * g_deltaT + oldVelocity;float3 newPosition = newVelocity * g_deltaT + oldPosition;g_particleUAV[particleIndex].pos = float4(newPosition, 1.0);g_particleUAV[particleIndex].velocity = float4(newVelocity, 0.0);
// Generate sprite vertices. . .
Conclusion●Vertex shader “tricks” can be more efficient than more commonly used methods
● Use SV_Vertex ID for smarter instancing●Sprites●Merge Instancing
● UAVs add lots of freedom to vertex shaders●Bounding box calculation●Single pass VS particle system
Demos●Particle System●UAV Skinning
● Bbox
Acknowledgements●Merge Instancing
● Emil Person, “Graphics Gems for Games” SIGGRAPH 2011
● Brendan Jackson, Cloud Imperium●Thanks to
● Nick Thibieroz, AMD● Raul Aguaviva (particle system UAV), AMD● Alex Kharlamov, AMD
Questions●[email protected]