vertex shader tricks by bill bilodeau - amd at gdc14

Vertex Shader Tricks New Ways to Use the Vertex Shader to Improve Performance

Bill BilodeauDeveloper Technology Engineer, AMD

Topics Covered●Overview of the DX11 front-end pipeline●Common bottlenecks●Advanced Vertex Shader Features●Vertex Shader Techniques●Samples and Results

Graphics Hardware

DX11 Front-End Pipeline●VS –vertex data●HS – control points●Tessellator●DS – generated vertices●GS – primitives●Write to UAV at all stages

● Starting with DX11.1

Vector GPR’s(256 2048-bit registers)

Vector ALU(1 64-way single precision operation every 4 clocks)

Scalar ALU(1 operation every 4 clocks)

Scalar GPR’s(256 64-bit registers)

Vector/Scalar cross communication bus











.

.

.

Input Assembler

Hull Shader

Domain Shader

Tessellator

Geometry Shader

Stream Out

CB,SRV,or

UAV

Vertex Shader

Bottlenecks - VS●VS Attributes

● Limit outputs to 4 attributes (AMD)●This applies to all shader stages (except PS)

●VS Texture Fetches● Too many texture fetches can add latency

●Especially dependent texture fetches●Group fetches together for better performance●Hide latency with ALU instructions

Bottlenecks - VS●Use the caches wisely

● Avoid large vertex formats that waste pre-VS cache space

● DrawIndexed() allows for reuse of processed vertices saved in the post-VS cache

●Vertices with the same index only need to get processed once

Vertex Shader

Pre-VS Cache(Hides Latency)

Input Assembler

Post-VS Cache(Vertex Reuse)

Bottlenecks - GS●GS

● Can add or remove primitives● Adding new primitives requires storing new

vertices●Going off chip to store data can be a bandwidth issue

● Using the GS means another shader stage●This means more competition for shader resources●Better if you can do everything in the VS

Advanced Vertex Shader Features●SV_VertexID, SV_InstanceID●UAV output (DX11.1)●NULL vertex buffer

● VS can create its own vertex data

SV_VertexID●Can use the vertex id to decide what vertex data to fetch●Fetch from SRV, or procedurally create a vertex

VSOut VertexShader(SV_VertexID id){

float3 vertex = g_VertexBuffer[id];…

}

UAV buffers●Write to UAVs from a Vertex Shader

● New feature in DX11.1 (UAV at any stage)●Can be used instead of stream-out for writing vertex data

● Triangle output not limited to strips ●You can use whatever format you want

●Can output anything useful to a UAV

NULL Vertex Buffer●DX11/DX10 allows this

● Just set the number of vertices in Draw() ● VS will execute without a vertex buffer bound

●Can be used for instancing● Call Draw() with the total number of vertices● Bind mesh and instance data as SRVs

Vertex Shader Techniques●Full Screen Triangle●Vertex Shader Instancing

● Merged Instancing●Vertex Shader UAVs

Full Screen Triangle●For post-processing effects

● Triangle has better performance than quad

●Fast and easy with VS generated coordinates

● No IB or VB is necessary●Something you should be using for full screen effects

Clip Space Coordinates

(-1, -1, 0)

(-1, 3, 0)

(3, -1, 0)

Full Screen Triangle: C++ code// Null VB, IBpd3dImmediateContext->IASetVertexBuffers( 0, 0, NULL, NULL, NULL );pd3dImmediateContext->IASetIndexBuffer( NULL, (DXGI_FORMAT)0, 0 );pd3dImmediateContext->IASetInputLayout( NULL );

// Set Shaders pd3dImmediateContext->VSSetShader( g_pFullScreenVS, NULL, 0 );pd3dImmediateContext->PSSetShader( … );pd3dImmediateContext->PSSetShaderResources( … );

pd3dImmediateContext->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );

// Render 3 vertices for the trianglepd3dImmediateContext->Draw(3, 0);

Full Screen Triangle: HLSL CodeVSOutput VSFullScreenTest(uint id:SV_VERTEXID){

VSOutput output;

// generate clip space positionoutput.pos.x = (float)(id / 2) * 4.0 - 1.0;output.pos.y = (float)(id % 2) * 4.0 - 1.0;output.pos.z = 0.0;output.pos.w = 1.0;

// texture coordinatesoutput.tex.x = (float)(id / 2) * 2.0;output.tex.y = 1.0 - (float)(id % 2) * 2.0;

// coloroutput.color = float4(1, 1, 1, 1);

return output;}

Clip Space Coordinates

(-1, -1, 0)

(-1, 3, 0)

(3, -1, 0)

VS Instancing: Point Sprites●Often done on GS, but can be faster on VS

● Create an SRV point buffer and bind to VS● Call Draw or DrawIndexed to render the full

triangle list. ● Read the location from the point buffer and

expand to vertex location in quad● Can be used for particles or Bokeh DOF sprites● Don’t use DrawInstanced for a small mesh

Point Sprites: C++ Code

pd3d->IASetIndexBuffer( g_pParticleIndexBuffer, DXGI_FORMAT_R32_UINT, 0 );

pd3d->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );

pd3dImmediateContext->DrawIndexed( g_particleCount * 6, 0, 0);

Point Sprites: HLSL CodeVSInstancedParticleDrawOut VSIndexBuffer(uint id:SV_VERTEXID){ VSInstancedParticleDrawOut output;

uint particleIndex = id / 4;uint vertexInQuad = id % 4; // calculate the position of the vertexfloat3 position;position.x = (vertexInQuad % 2) ? 1.0 : -1.0;position.y = (vertexInQuad & 2) ? -1.0 : 1.0;position.z = 0.0;position.xy *= PARTICLE_RADIUS;

position = mul( position, (float3x3)g_mInvView ) + g_bufPosColor[particleIndex].pos.xyz; output.pos = mul( float4(position,1.0), g_mWorldViewProj ); output.color = g_bufPosColor[particleIndex].color;

// texture coordinateoutput.tex.x = (vertexInQuad % 2) ? 1.0 : 0.0;output.tex.y = (vertexInQuad & 2) ? 1.0 : 0.0;

return output;}

Point Sprite Performance

Indexed, 500K Sprites

Non-Indexed, 500K Sprites

GS, 500K Sprites

DrawInstanced, 500K Sprites

Indexed, 1M Sprites

Non-Indexed, 1M Sprites

GS, 1M Sprites DrawInstanced, 1M Sprites

0

2

4

6

8

10

12

AMD Radeon R9 290x

Nvidia Titan

Point Sprite Performance●DrawIndexed() is the fastest method●Draw() is slower but doesn’t need an IB●Don’t use DrawInstanced() for creating sprites on either AMD or NVidia hardware

● Not recommended for a small number of vertices

Merge Instancing●Combine multiple meshes that can be instanced many times

● Better than normal instancing which renders only one mesh

● Instance nearby meshes for smaller bounding box●Each mesh is a page in the vertex data

● Fixed vertex count for each mesh●Meshes smaller than page size use degenerate triangles

Merge Instancing

Mesh Vertex Data

Mesh Data 0

Mesh Data 1

Mesh Data 2...

Mesh Instance Data

Instance 0

Mesh Index 2

Instance 1

Mesh Index 0

.

.

.Degenerate

Triangle

Vertex 0Vertex 1Vertex 2Vertex 3

.

.

.000

Fixed Length Page

Merged Instancing using VS●Use the vertex ID to look up the mesh to instance

● All meshes are the same size, so (id / SIZE) can be used as an offset to the mesh

● Faster than using DrawInstanced()

Merge Instancing Performance

DrawInstanced Soft Instancing0

5

10

15

20

25

30

R9 290xGTX 780

●Instancing performance test by Cloud Imperium Games for Star Citizen●Renders 13.5M triangles (~40M verts)●DrawInstanced version calls DrawInstanced() and uses instance data in a vertex buffer●Soft Instancing version uses vertex instancing with Draw() calls and fetches instance data from SRV

AMD Radeon R9 290XNvidia GTX 780

ms

Vertex Shader UAVs●Random access Read/Write in a VS●Can be used to store transformed vertex data for use in multi-pass algorithms●Can be used for passing constant attributes between any shader stage (not just from VS)

Skinning to UAV●Skin vertex data then output to UAV

● Instance the skinned UAV data multiple times●Can also be used for non-instanced data

● Multiple passes can reuse the transformed vertex data – Shadow map rendering

●Performance is about the same as stream-out, but you can do more …

Bounding Box to UAV●Can calculate and store Bbox in the VS

● Use a UAV to store the min/max values (6)● InterlockedMin/InterlockedMax determine min

and max of the bbox●Need to use integer values with atomics

●Use the stored bbox in later passes● GPU physics (collision)● Tile based processing

Bounding Box: HLSL Codevoid UAVBBoxSkinVS(VSSkinnedIn input, uint id:SV_VERTEXID ){

// skin the vertex. . .// output the max and min for the bounding boxint x = (int) (vSkinned.Pos.x * FLOAT_SCALE); // convert to integerint y = (int) (vSkinned.Pos.y * FLOAT_SCALE);int z = (int) (vSkinned.Pos.z * FLOAT_SCALE);

InterlockedMin(g_BBoxUAV[0], x);InterlockedMin(g_BBoxUAV[1], y);InterlockedMin(g_BBoxUAV[2], z);InterlockedMax(g_BBoxUAV[3], x);InterlockedMax(g_BBoxUAV[4], y);InterlockedMax(g_BBoxUAV[5], z);. . .

Particle System UAV●Single pass GPU-only particle system●In the VS:

● Generate sprites for rendering● Do Euler integration and update the particle

system state to a UAV

Particle System: HLSL Codeuint particleIndex = id / 4;uint vertexInQuad = id % 4;

// calculate the new position of the vertexfloat3 oldPosition = g_bufPosColor[particleIndex].pos.xyz;float3 oldVelocity = g_bufPosColor[particleIndex].velocity.xyz;

// Euler integration to find new position and velocityfloat3 acceleration = normalize(oldVelocity) * ACCELLERATION;float3 newVelocity = acceleration * g_deltaT + oldVelocity;float3 newPosition = newVelocity * g_deltaT + oldPosition;g_particleUAV[particleIndex].pos = float4(newPosition, 1.0);g_particleUAV[particleIndex].velocity = float4(newVelocity, 0.0);

// Generate sprite vertices. . .

Conclusion●Vertex shader “tricks” can be more efficient than more commonly used methods

● Use SV_Vertex ID for smarter instancing●Sprites●Merge Instancing

● UAVs add lots of freedom to vertex shaders●Bounding box calculation●Single pass VS particle system

Demos●Particle System●UAV Skinning

● Bbox

Acknowledgements●Merge Instancing

● Emil Person, “Graphics Gems for Games” SIGGRAPH 2011

● Brendan Jackson, Cloud Imperium●Thanks to

● Nick Thibieroz, AMD● Raul Aguaviva (particle system UAV), AMD● Alex Kharlamov, AMD

Questions●[email protected]