gs-4108, direct compute in gaming, by bill bilodeau

48
DirectCompute in Gaming BILL BILODEAU DEVELOPER TECHNOLOGY ENGINEER, AMD

Upload: amd-developer-central

Post on 05-Dec-2014

1.286 views

Category:

Technology


1 download

DESCRIPTION

Presentation GS-4108 by Bill Bilodeau at the AMD Developer Summit (APU13) November 11-13, 2013.

TRANSCRIPT

Page 1: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

DirectCompute in Gaming BILL BILODEAU

DEVELOPER TECHNOLOGY ENGINEER, AMD

Page 2: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

2 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

AGENDA

Introduction to DirectCompute

GCN Overview

DirectCompute Programming

Optimization Techniques

Examples of Compute in Games

‒ Separable Filter

‒ Tiled Lighting

‒ TressFX Physics

TOPICS COVERED IN THIS TALK

Page 3: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

INTRODUCTION TO DIRECT COMPUTE

Page 4: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

4 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

GPU COMPUTE

Games can be CPU limited on some configurations

‒ Many traditional CPU tasks can be offloaded to the GPU

Some rendering algorithms are faster when implemented using Compute

‒ Post-processing techniques

Some tasks are a natural fit for Compute

‒ Non-graphics Data Parallel programming

‒ Physics

WHY WE NEED COMPUTE IN GAMES

Page 5: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

5 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DX11 GRAPHICS PIPELINE WHY WE NEED DirectCompute

Too many unnecessary stages for compute

Old DX9 “GPGPU” programming

‒ Render full screen quad

‒ Use Pixel Shader for compute

Vertex Shader

Tessellator

Hull Shader

Domain Shader

Geometry Shader

Pixel Shader

Stream Out

Rasterizer

Depth Test

Output Merger

Input Assembly

Buffers

Textures

Constants

Render Targets

UAVs

Depthstencil

Page 6: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

6 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute PIPELINE

API designed for compute programming

‒ Interoperability with DirectX

Pipeline is much simpler

‒ No need to render triangles

‒ Bind input and output then call Dispatch()

ONLY WHAT’S NECESSARY

SRVs

UAVs

Compute Shader

Page 7: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

7 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute FEATURES

Structured Buffers

‒ Maps better to general purpose data structures

Append/Consume

‒ Fast for non order dependent i/o

Atomics, Barriers

‒ Synchronization for finer grain thread control

Thread Group Shared Memory

‒ Fast on-chip local memory shared between threads in a group

‒ Great for intermediate results

MORE REASONS TO USE DirectCompute

Page 8: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

8 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DISPATCH

Threads are organized into Thread Groups

Each thread is executing an instance of the compute shader code

ID3D11DeviceContext::Dispatch( Dx, Dy, Dz )

‒ Example: Dispatch(4,3,2) => 24 thread groups

THREAD GROUP ORGANIZATION

0,0,1 1,0,1 3,0,1 2,0,1

0,1,1 1,1,1 2,1,1 3,1,1

0,2,1 1,2,1 2,2,1 3,2,1

0,0,0 1,0,0 2,0,0 3,0,0

0,1,0 1,1,0 3,1,0

0,2,0 1,2,0 2,2,0 3,2,0

2,1,0

Page 9: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

9 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

THREAD GROUP

Threads in a group are organized in an array

‒ For full screen rendering you can think of a thread group as a tile on the screen, and the thread as a pixel

The compute shader HLSL code declares the thread layout

‒ [numthreads(Tx, Ty, Tz)]

‒ Example: [numthreads(8, 4, 2)] => 64 threads per thread group

SystemValues provide information about the current thread

‒ SV_GroupID = thread group address, SV_GroupThreadID = thread address relative to thread group

ORGANIZATION OF THREADS WITHIN A THREAD GROUP

0,0,1

0,1,1

0,2,1

0,3,1

1,0,1

1,1,1

1,2,1

1,3,1

2,0,1

2,1,1

2,2,1

2,3,1

3,0,1

3,1,1

3,2,1

4,0,1

3,3,1

4,1,1

4,2,1

4,3,1

5,0,1

5,1,1

5,2,1

5,3,1

6,0,1

6,1,1

6,2,1

6,3,1

7,0,1

7,1,1

7,2,1

7,3,1

0,0,0

0,1,0

0,2,0

0,3,0

1,0,0

1,1,0

1,2,0

1,3,0

2,0,0

2,1,0

2,2,0

2,3,0

3,0,0

3,1,0

3,2,0

3,3,0

4,0,0

4,1,0

4,2,0

4,3,0

5,0,0

5,1,0

5,2,0

5,3,0

6,0,0

6,1,0

6,2,0

6,3,0

7,0,0

7,1,0

7,2,0

7,3,0

Page 10: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

10 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute SETUP

Create the resources

C CODE TO SETUP COMPUTE SHADER EXECUTION

pd3dDevice->CreateBuffer(&Desc, NULL, &pCBCSPerFrame);

pd3dDevice->CreateBuffer(&bufferDesc, NULL, &pRWBuffer));

Page 11: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

11 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute SETUP

Create the resources

Create the views

C CODE TO SETUP COMPUTE SHADER EXECUTION

pd3dDevice->CreateUnorderedAccessView(pRWBuffer, &UAVDesc, &pUAV));

Page 12: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

12 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute SETUP

Create the resources

Create the views

Compile the compute shader

C CODE TO SETUP COMPUTE SHADER EXECUTION

CompileShaderFromFile( L“myComputeShader.hlsl", "CSMain", "cs_5_0", &pBlob )

pd3dDevice->CreateComputeShader( pBlob, . . . , pComputeShader );

Page 13: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

13 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute SETUP

Create the resources

Create the views

Compile the compute shader

Bind the resources

C CODE TO SETUP COMPUTE SHADER EXECUTION

pd3dContext->CSSetConstantBuffers(0, 1, &m_pCBCSPerFrame);

ID3D11ShaderResourceView* ppSRV[1] = {pUAV};

pd3dContext->CSSetShaderResources( 0, 1, ppSRV);

Page 14: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

14 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute SETUP

Create the resources

Create the views

Compile the compute shader

Bind the resources

Set the shader

C CODE TO SETUP COMPUTE SHADER EXECUTION

pd3dContext->CSSetShader(pComputeShader , NULL, 0 )

Page 15: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

15 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute SETUP

Create the resources

Create the views

Compile the compute shader

Bind the resources

Set the shader

Go!

C CODE TO SETUP COMPUTE SHADER EXECUTION

pd3dContext->Dispatch(numOfGroupsX, numOfGroupsY, 1);

Page 16: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

16 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

COMPUTE SHADER CODE

numthreads

‒ Defines the number of threads in a thread group

‒ Put this right before the definition of the compute shader

IMPORTANT ADDITIONS TO HLSL FOR COMPUTE

groupshared float4 sharedPos[THREAD_GROUP_SIZE];

[numthreads(THREAD_GROUP_SIZE, 1, 1)]

void MyComputeShader(uint GIndex : SV_GroupIndex)

{

sharedPos[GIndex] = gThreadData[myIndex];

GroupMemoryBarrierWithGroupSync();

InterlockedExchange(myUAV[loc], newVal, oldVal);

}

Page 17: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

17 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

COMPUTE SHADER CODE

Thread Group Shared Memory

‒ Variables that are stored in memory that is shared between threads in a group (up to 32K bytes per group)

‒ Use the groupshared modifier in front of the variable declaration

IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

groupshared float4 sharedPos[THREAD_GROUP_SIZE];

[numthreads(THREAD_GROUP_SIZE, 1, 1)]

void MyComputeShader(uint GIndex : SV_GroupIndex)

{

sharedPos[GIndex] = gThreadData[myIndex];

GroupMemoryBarrierWithGroupSync();

InterlockedExchange(myUAV[loc], newVal, oldVal);

}

Page 18: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

18 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

COMPUTE SHADER CODE

System Values that can be used in a Compute Shader

‒ SV_GroupID , SV_GroupThreadID , SV_DispatchThreadID , SV_GroupIndex

‒ The system values tell the compute shader what group and what thread it’s currently working on.

IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

groupshared float4 sharedPos[THREAD_GROUP_SIZE];

[numthreads(THREAD_GROUP_SIZE, 1, 1)]

void MyComputeShader(uint GIndex : SV_GroupIndex)

{

groupSharedMemVar[GIndex] = gThreadData[myIndex];

GroupMemoryBarrierWithGroupSync();

InterlockedExchange(myUAV[loc], newVal, oldVal);

}

Page 19: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

19 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

COMPUTE SHADER CODE

Synchronization

‒ Barriers

‒ GroupMemoryBarrier(), GroupMemoryBarrierWithGroupSync()

‒ Blocks execution until threads are synchronized

IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

groupshared float4 sharedPos[THREAD_GROUP_SIZE];

[numthreads(THREAD_GROUP_SIZE, 1, 1)]

void MyComputeShader(uint GIndex : SV_GroupIndex)

{

sharedPos[GIndex] = gThreadData[myIndex];

GroupMemoryBarrierWithGroupSync();

InterlockedExchange(myUAV[loc], newVal, oldVal);

}

Page 20: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

20 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

COMPUTE SHADER CODE

Synchronization

‒ Atomics

‒ “Interlock” versions of Add, Min, Max, Or, And, Xor, CompareStore, and Exchange

‒ Guarantee exclusive access to shared memory

IMPORTANT ADDITIONS TO HLSL FOR COMPUTE (CONTINUED)

groupshared float4 sharedPos[THREAD_GROUP_SIZE];

[numthreads(THREAD_GROUP_SIZE, 1, 1)]

void MyComputeShader(uint GIndex : SV_GroupIndex)

{

sharedPos[GIndex] = gThreadData[myIndex];

GroupMemoryBarrierWithGroupSync();

InterlockedExchange(myUAV[loc], newVal, oldVal);

}

Page 21: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

DirectCompute AND GCN

Page 22: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

22 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

GCN REFRESHER

Basic GPU building block

‒ R9 290x has 44 Compute Units

(4) 16-wide SIMDs that can execute instructions from multiple threads

‒ Each SIMD has 64KB of vector general purpose register (VGPR) memory

16 Texture Fetch (load/store) Units

‒ 16KB L1 cache

One Scalar Unit

‒ 8KB of Scalar GPR memory

Local Data Share

‒ 64K shared memory

COMPUTE UNIT

Branch & Message Unit

Scalar Unit Vector Units (4x SIMD-16)

Vector Registers (4x 64KB)

Texture Filter Units (4)

Local Data Share (64KB)

L1 Cache (16KB)

Texture Fetch Load / Store

Units (16)

Scalar Registers (8KB)

Scheduler

GCN Compute Unit

Page 23: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

23 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

WAVEFRONTS

A wavefront consists of 4 batches of threads

‒ Execution is time-sliced to hide latency

Since there are 16 ALUs per SIMD, the size of a wavefront is 4 x 16 = 64 threads

Each SIMD can keep track of 10 wavefronts

‒ 40 wavefronts per CU = 2560 threads

‒ For the R9 290x, that’s a total of 44 x 2560 = 112,640 threads in flight!

HOW THREADS EXECUTE ON COMPUTE UNITS

Time (clocks)

Batch 2 Batch 3 Batch 4

Batch 1

Stall

Runnable

Stall

Runnable

Stall

Runnable

Stall

Runnable

Done!

Done!

Done!

Done!

Page 24: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

24 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

GCN AND DirectCompute

DirectCompute thread groups correspond to groups of threads running on a Compute Unit

‒ Choose a thread group size that works well with the hardare

Thread Group Shared Memory is stored in the Compute Unit’s Local Data Share (LDS)

‒ 4 SIMDs (64 ALUs) are all sharing the same LDS

‒ Barriers and atomics synchronize access to this storage

HOW DirectCompute Maps to the GPU

Page 25: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

25 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

OPTIMIZE FOR THE HARDWARE

Thread Group Size

‒ Since there are 64 threads per wavefront, make the thread group size multiples of 64

‒ If [numthreads(x,y,z)] then (x * y * z) should be a multiple of 64

‒ If the thread group size is not a multiple of 64, the SIMDs will still run 64 threads per wavefront

‒ Whether they’re all used or not!

‒ Start with 64 threads per group and then see if higher multiples improve performance.

LDS memory access is faster than off-chip memory

‒ Use thread group shared memory for storing intermediate results instead of writing out to the UAV

‒ Thread group shared memory can also be used to eliminate unnecessary fetches

‒ For example, discrete convolutions sample many of the same locations when the kernel moves to an adjacent pixel

‒ TGSM is limited to 32K – pack the data to save space

IMPORTANT THINGS TO CONSIDER WHEN DESIGNING A COMPUTE SHADER

Page 26: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

26 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

OPTIMIZE FOR THE HARDWARE

Avoid Bank Conflicts

‒ Thread Group Shared Memory is stored in LDS which is organized in 32 banks

‒ If one thread accesses a location and another thread accesses location + (n x 32), a bank conflict will occur

‒ The hardware can resolve this, but at the cost of performance

THREAD GROUP SHARED MEMORY ACCESS

0 1 2 3 4 … 31 0 1 2 3

0 1 2 3 4 … 31 32 33 34 35

BANK

ADDRESS

Page 27: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

27 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

OPTIMIZE FOR THE HARDWARE

Local variables are stored in general purpose registers (GPRs)

‒ There’s a limit of 64K bytes of storage for vector registers per SIMD

‒ The register memory has to be shared between wavefronts, so too many GPRS can reduce the number of wavefronts

‒ Maximizing parallelism is critical to compute shader performance (and shader performance in general)

‒ The more wavefronts that can run, the more parallelism you can get

‒ Using fewer local variables can help, but look at the DirectX assembly code to get a better idea of how many registers are being used.

GPR USAGE

GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128

Max Waves/SIMD

10 9 8 7 6 5 4 3 2 1

Page 28: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

DirectCompute Techniques for Games

Page 30: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

30 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

SEPARABLE FILTER

Many useful filters are separable

‒ Typical example is Gaussian filter for a high quality image blur – used often in post-processing

Separable Algorithm

‒ Run the first pass in the horizontal direction and store the results in an intermediate buffer

‒ Run the second pass on the intermediate buffer in the vertical direction

TWO PASS DISCRETE CONVOLUTIONS

Source RT

Intermediate RT

Destination RT

Horizontal Pass Vertical Pass

Page 31: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

31 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

SEPARABLE FILTER

Use Thread Group Shared Memory to cache previously sampled values

Naïve approach is to load a row of texels into TGSM at once, then calculate each pixel value

COMPUTE SHADER OPTIMIZATION

...........

128 threads load 128 texels

128 – ( Kernel Radius * 2 ) threads compute results

Kernel Radius

Redundant compute threads

Page 32: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

32 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

SEPARABLE FILTER

Make some of the threads load more texels

‒ We only want as many threads as there are pixels

‒ However we need more texels than pixles since the kernel extends beyond the pixels

‒ Just have some of the threads load more pixels

Process multiple lines per thread group

‒ Keeps the thread group size a multiple of 64

BETTER OPTIMIZATION

...........

...........

64 threads load 256 texels

Kernel Radius * 4 threads load 1 extra texel each

64 threads compute 256 results

Kernel Radius

Page 34: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

34 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TILED LIGHTING COMPUTE SHADER LIGHT CULLING

1 2

3

[1] [1,2,3] [2,3]

Break up the screen into tiles

Create a list of lights for each tile

Only render with lights touching the tile

Page 35: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

35 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TILED LIGHTING

Divide screen into tiles

Fit asymmetric frustum around each tile

TILED FRUSTUMS

Tile0 Tile1 Tile2 Tile3

Page 36: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

36 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TILED LIGHTING

Use z buffer from depth pre-pass as input

Find min and max depth per tile

Use this frustum for intersection testing

CREATE FRONT AND BACK PLANES

Page 37: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

37 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TILED LIGHTING

Test each light against the frustum

CREATE PER TILE LIGHT LIST (PART 1)

•Position

•Radius Light0

•Position

•Radius Light1

•Position

•Radius Light2

•Position

•Radius Light3

•Position

•Radius Light4

•Position

•Radius Light10

Page 38: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

38 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TILED LIGHTING

For each light that intersects the frustum

‒ Write the index of the light in the per-tile list

CREATE PER TILE LIGHT LIST (PART 2)

1 4

•Position

•Radius Light0

•Position

•Radius Light1

•Position

•Radius Light2

•Position

•Radius Light3

•Position

•Radius Light4

•Position

•Radius Light10

•4

•1 Index1

Index2

Index0 •Count

Index3 •Empty

Index4 •Empty

Page 39: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

39 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TILED LIGHTING

Implemented using a single compute shader

A thread group is executed per tile ‒ e.g. [numthreads(16,16,1)] for 16x16 tile size

Build frustum

Calculate Z extent

‒ Each thread calculates Z extent in parallel

256 lights are culled in parallel (for 16x16 tile size)

Indices of intersecting lights are written to thread group shared memory (TGSM) ‒ groupshared uint ldsLightIdx[MAX_NUM_LIGHTS_PER_TILE];

Then export to global light index list ‒ RWBuffer<uint> g_PerTileLightIndexBufferOut : register( u0 );

COMPUTE SHADER

Page 40: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

40 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TressFX PHYSICS TOMB RAIDER

Page 41: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

41 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TressFX PHYSICS

Responds to natural gravity

Global constraints to maintain various hair styles

Collision detection with head and body

Supports wind and other forces

Artist-friendly

‒ Programmatic control over attributes

‒ Can tweak for different results

The entire simulation is done on Compute Shaders

PHYSICALLY BASED HAIR SIMULATION

Page 42: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

42 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TressFX PHYSICS TOMB RAIDER VIDEO

Page 43: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

43 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TressFX PHYSICS DATA FLOW

Hair Vertex Data UAV

Integration and Global Shapes

CS

Local Shapes Constraint

CS

Length Constraints and

Wind CS

Collision and Tangents

CS

Raw Vertex Data (CPU Memory)

Hair Render Vertex Shader

CPU GPU

Vertex data is only

loaded once during

life of the program.

Vertex data

stays on the GPU!

Page 44: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

44 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

TressFX PHYSICS

Interoperability with Direct3D

‒ UAVs can be shared with rendering so data can stay on the GPU

Massively Parallel

‒ Thousands of hair vertices that can be animated in parallel

‒ Each thread in the thread group represents a vertex (or strand)

Calculations done on data stored locally on chip

‒ Vertex data is loaded into Thread Group Shared Memory before calculations begin

BENEFITS OF USING DirectCompute TO SIMULATE HAIR

//------------------------------

// Copy data into shared memory

//------------------------------

if (localVertexIndex < numVerticesInTheStrand )

{

currentPos = sharedPos[indexForSharedMem] = g_HairVertexPositions[globalVertexIndex];

initialPos = g_InitialHairPositions[globalVertexIndex];

initialPos.xyz = mul(float4( initialPos.xyz, 1), g_ModelTransformForHead).xyz;

}

GroupMemoryBarrierWithGroupSync();

Page 45: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

45 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DirectCompute in Games

HDAO+

‒ New version of our High Definition Ambient Occlusion

‒ Uses TGSM for storing samples

Bokeh Depth of Field

‒ Used in the Frostbite 3 engine for rendering Bokeh shapes where “hotspots” are located.

Global Illumination

‒ Geomerics

Scene Management

Voxels ‒ Sparse Voxel Trees

‒ Octrees, Quadtrees

Particle Systems

OTHER USES

Page 46: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

46 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

Summary of DirectCompute in Games

Parallelism

‒ Some algorithms can be highly parallelized

Repetitive Fetches

‒ Caching values in TGSM can be a big win

Results used in the graphics pipeline

‒ DirectCompute can reduce transfers between GPU and CPU

Existing algorithms already optimized for DirectCompute

‒ Separable Filters

‒ Tiled Lighting

‒ Hair Physics

‒ HDAO+

‒ We have samples you can use!

http://developer.amd.com/tools-and-sdks/graphics-development/amd-radeon-sdk/

WHEN TO USE Direct Compute

Page 47: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

47 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

QUESTIONS?

Ask now, or send questions to:

[email protected]

Page 48: GS-4108, Direct Compute in Gaming, by Bill Bilodeau

48 | DirectCompute in Gaming| NOVEMBER 15 2013 | AMD DEVELOPER SUMMIT

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.