tiled shading: light culling – reaching the speed of light
TRANSCRIPT
![Page 1: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/1.jpg)
Tiled shading: light culling – reaching the speed of light Dmitry Zhdan Developer Technology Engineer, NVIDIA
![Page 2: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/2.jpg)
Agenda
● Über Goal
● Classic deferred vs tiled shading
● How to improve culling in tiled shading?
● New culling method overview
● Cool results!
2
![Page 3: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/3.jpg)
Über Goal
Improve overall lighting
performance in tiled shading
3
![Page 4: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/4.jpg)
Takeaway
You’ll know how to speed up light
culling in 10x times and more!
4
![Page 5: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/5.jpg)
Classic deferred: overview
● For each light:
● Render proxy geometry
to mark pixels inside the light volume
5
Pixels where light will be processed
![Page 6: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/6.jpg)
Classic deferred: overview
● For each light:
● Render proxy geometry
to mark pixels inside the light volume
● Shade only marked
pixels
● Blend to output
6
![Page 7: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/7.jpg)
Classic deferred: pros and cons
● Pros
● Precise per-pixel light culling
● A lot of work is done outside of the shader
● Cons
● Lighting is likely to become bandwidth limited
● Culling is ROP limited
7
![Page 8: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/8.jpg)
What we want to avoid?
● Blending
● G-buffer data reloading
● Per light state switching
8
![Page 9: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/9.jpg)
Tiled shading: overview
● Divide screen into tiles
● For each tile:
● Find min-max z
9
![Page 10: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/10.jpg)
Tiled shading: overview
● Divide screen into tiles
● For each tile:
● Find min-max z
● Cull light sources against tile frustum
10
Tiles where light will be processed
![Page 11: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/11.jpg)
Tiled shading: overview
● Divide screen into tiles
● For each tile:
● Find min-max z
● Cull light sources against tile frustum
● Shade tile using given light list
11
![Page 12: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/12.jpg)
Tiled shading: pros and cons
● Pros
● Lighting phase takes all visible lights in one go
● Cons
● Less accurate culling with tile granularity
● Frustum-primitive tests are either too coarse
or too slow
12
![Page 13: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/13.jpg)
Why care about culling?
● Culling itself can be a costly operation
● Accurate culling speeds up lighting
13
Adding “false positives” can dramatically
reduce lighting performance!
![Page 14: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/14.jpg)
Culling challenges
● Minimize the number of “false positive”
lights obtained in culling phase
● Improve light culling performance in tiled shading rendering
14
![Page 15: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/15.jpg)
Sphere vs frustum planes: never ever!
● Most commonly used
test
● In fact, it is a frustum-box test
● Extremely inaccurate
with large spheres
15
![Page 16: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/16.jpg)
Sphere vs frustum planes: never ever!
● Most commonly used
test
● In fact, it is a frustum-box test
● Extremely inaccurate
with large spheres
16 False positive
![Page 17: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/17.jpg)
Frustum planes
17
No
Reference
Does “is point inside volume” test
for each pixel in a tile
![Page 18: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/18.jpg)
Rounded AABB isn’t an option too…
● Doesn’t suit for spot
lights!
18
False positives?
Frustum
Light
AABB
Rounded AABB
![Page 19: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/19.jpg)
Rounded AABB isn’t an option too…
19
● Doesn’t suit for spot
lights!
● Works badly for very long frustums
False positives
![Page 20: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/20.jpg)
Rounded AABB isn’t an option too…
20
False positives!
● Doesn’t suit for spot
lights!
● Works badly for very long frustums
● Problematic for wide FOV
![Page 21: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/21.jpg)
Can we get away from frustums?
● Average tile frustum angle is small: FOV = 100⁰, Tile size = 16x16 pixels
Angle = FOV • (tile_size / screen_height) = 0.8⁰ (at 1080p)
21
This one is only 2.5⁰
![Page 22: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/22.jpg)
Can we get away from frustums?
● Frustum can be represented as a single
ray at tile center
● Or 4 rays at tile corners
22
![Page 23: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/23.jpg)
How to improve culling accuracy? ● Replace frustum test with ray intersection
test:
● Ray-sphere, ray-cone, …
23
![Page 24: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/24.jpg)
How to improve culling accuracy? ● Compare tile min-max z with min-max
among all intersections
24
![Page 25: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/25.jpg)
How to improve culling accuracy? ● Compare tile min-max z with min-max
among all intersections
● 4 rays work better
25
![Page 26: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/26.jpg)
Ray-primitive
26
Reference
Yes
![Page 27: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/27.jpg)
But culling on compute sucks ● It is a straightforward enumeration
total operations = X • Y • N
X – tile grid width
Y – tile grid height
N – number of lights
27
![Page 28: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/28.jpg)
How to improve culling performance?
● Reduce the order of enumeration
● Subdivide screen into 4-8 sub-screens
● Coarsely cull lights against sub-screen
frustums
● Select corresponding sub-screen during culling phase
● Up to 2x boost with small lights, but we
want more!
28
![Page 29: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/29.jpg)
How to improve culling performance?
● We are limited by the compute power
● Let’s try to offload some work from
shader to special HW units!
29
![Page 30: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/30.jpg)
How to improve culling performance?
● Let’s switch from compute to graphics pipeline! Like in the good old times!
30
![Page 31: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/31.jpg)
Take the best from classic and tiled!
● Migrate from compute idiom:
● “one tile - many lights”
31
![Page 32: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/32.jpg)
Take the best from classic and tiled!
● To classic deferred idiom:
● “one light - many pixels” (1 pixel = 1 tile)
32
![Page 33: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/33.jpg)
Light culling using graphics
● Use rasterizer to generate light fragments
● Empty tiles will be natively skipped
● Use depth test to account for occlusion
● Useless work for occluded tiles will be skipped
● Use primitive-ray intersection in PS for
fine culling and light list updating
33
![Page 34: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/34.jpg)
The Idea: overview
● Culling phase tile → 1 pixel
● Light volume → proxy geometry
● Coarse XY-culling → rasterization
● Coarse Z-culling → depth test
● Precise culling → pixel shader
34
![Page 35: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/35.jpg)
How to integrate?
35
● Don’t use über shaders
● Always break tiled shading into 3 phases:
● Reduction
● Culling
● Lighting
→ new method
![Page 36: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/36.jpg)
New Culling: Bird’s-eye view
● Camera frustum culling
● Depth buffers creation
● Rasterization & classification
36
![Page 37: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/37.jpg)
Step 1: Camera frustum culling
● Cull lights against
camera frustum
37
![Page 38: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/38.jpg)
Step 1: Camera frustum culling
● Cull lights against
camera frustum
38
![Page 39: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/39.jpg)
Step 1: Camera frustum culling
● Cull lights against
camera frustum
● Split visible lights into “outer” and
“inner”
39
![Page 40: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/40.jpg)
Step 2: Depth buffers creation
● For each tile:
● Find and copy max depth for “outer” lights
● Find and copy min depth for “inner” lights
● Depth test is a key to high performance!
● Use [earlydepthstencil] in shader
40
![Page 41: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/41.jpg)
Step 3: Rasterization & Classification
41
● Render light geometry with depth test
● “outer” – max depth buffer
● Front faces with direct depth test
● “inner” - min depth buffer
● Back faces with inverted depth test
● Use PS for precise culling and per-tile
light list creation
![Page 42: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/42.jpg)
Common light types
42
Point light (omni) Directional light (spot)
Light geometry can be replaced with proxy geometry
![Page 43: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/43.jpg)
Proxy geometry for point lights
43
● Geosphere (2 subdivisions,
octa-based)
● Close enough to sphere
● Low poly works well at low resolution
● Equilateral triangles can ease
rasterizer’s life
![Page 44: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/44.jpg)
Proxy geometry for spot lights
44
● Why so simple?
● Easy for parametrization
● From a searchlight
● To a hemisphere
● Plane part can be used to handle area lights
![Page 45: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/45.jpg)
Light culling via rasterization
● Advantages
● No work for tiles without lights and for
occluded lights
● Coarse culling is almost free!
● Incredible speed up with small lights
● Complex proxy models can be used!
● Mathematically it is a branch-and-bound procedure!
45
![Page 46: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/46.jpg)
?
Culling perf: long-ranged lights
46
GPU CS, ms Raster, ms Boost
GTX 970 - 19x12 0.55 0.15 x4
R9 390 - 19x12 0.60 0.25 x3
GTX 970 - 4K 2.00 0.35 x6
R9 390 - 4K 2.15 0.65 x3
400 lights (200 omnis, 200 spots)
20 lights per tile on average CS: ray-primitive based (same culling precision as using raster)
![Page 47: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/47.jpg)
?
Culling perf: medium-ranged lights
47
GPU CS, ms Raster, ms Boost
GTX 970 - 19x12 7.30 0.45 x17
R9 390 - 19x12 6.90 0.45 x15
GTX 970 - 4K 25.35 1.10 x23
R9 390 - 4K 23.75 1.30 x18
10000 lights (5000 omnis, 5000 spots)
70 lights per tile on average CS: ray-primitive based (same culling precision as using raster)
![Page 48: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/48.jpg)
?
Culling perf: fast CS vs Raster
48
GPU CS fast, ms Raster, ms Boost
GTX 970 - 19x12 1.60 0.45 x3.5
R9 390 - 19x12 1.30 0.45 x3.0
GTX 970 - 4K 5.45 1.10 x5.0
R9 390 - 4K 4.55 1.30 x3.5
10000 lights (5000 omnis, 5000 spots)
70 lights per tile on average CS fast: rounded AABB, sub-screens partitioning (less accurate
culling)
![Page 49: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/49.jpg)
?
Lighting perf: accurate vs fast culling
49
GPU Fast, ms Accurate, ms Boost
GTX 970 - 19x12 6.50 4.85 25%
R9 390 - 19x12 3.55 2.75 22%
GTX 970 - 4K 22.20 16.45 26%
R9 390 - 4K 12.00 9.25 23%
10000 lights (5000 omnis, 5000 spots)
70 lights per tile on average Fast: CS with rounded AABB, sub-screens partitioning
Accurate: fine CS or raster
![Page 50: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/50.jpg)
Culling perf: HD vs 4K
50
GPU HD (ms) 4K (ms) 4K / HD
GTX 970 – CSopt 1.45 5.45 3.8
GTX 970 - Raster 0.40 1.10 2.7
R9 390 - CSopt 1.15 4.55 4.0
R9 390 - Raster 0.40 1.30 3.2
Raster leads to less performance drop compared with optimized CS version at 4K
![Page 51: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/51.jpg)
Culling via rasterization: conclusion
3x-20x times faster than the same CS version
Produces less “false-positives” at a small cost
Has better resolution scaling
Raster allows us to use complex light volumes
51
![Page 52: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/52.jpg)
References
52
● “Advancements in Tiled-Based Compute Rendering” – GDC 2015, Gareth Thomas
● “Parallel Graphics in Frostbite –Current & Future” - SIGGRAPH 2009, Johan Andersson
● Jim Arvo, “A simple method for box-sphere intersection testing”, Graphics Gems 1990
![Page 54: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/54.jpg)
But the devil is in the details…
54
BONUS SLIDES!
![Page 55: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/55.jpg)
● Suits well for CPU
● It is always better to not only compute
index list of visible lights but tightly pack light data too!
● Better cache locality
● Boosts culling and lighting phases
Camera frustum culling
55
![Page 56: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/56.jpg)
● We can integrate clip planes into proxy
models to avoid light leaking
Proxy geometry ideas
56
Walls
![Page 57: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/57.jpg)
● We can use even coarse shadow volumes to avoid lighting in shadows!
Proxy geometry ideas
57
![Page 58: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/58.jpg)
● Conservative raster is not applicable here! ● Fragments on shared edges will be added twice, thus light will be added twice at some tiles
● Enlarge geometry in VS instead!
Rasterization tips
58
![Page 59: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/59.jpg)
● Reproject half tile size back to view space
● Use closest to the camera value for
reprojection:
● z = light_view.z – light_range
● Add it to light range
Omni rasterization tips
59
![Page 60: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/60.jpg)
● Reproject half tile size back to view space
● Use closest to the camera z value for
reprojection
● Enlarge geometry in all directions!
● This is why plane part in the spot proxy is important
Spot rasterization tips
60
![Page 61: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/61.jpg)
Explicit Multi GPU Programming with DirectX 12
Juha Sjöholm Developer Technology Engineer
NVIDIA
![Page 62: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/62.jpg)
●What is explicit Multi GPU
●API Introduction
●Engine Requirements
●Frame Pipelining – Case Study
Agenda
![Page 63: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/63.jpg)
Problem With Implicit Multi GPU
● Driver needs lots of hints ● Clears, discards
● Vendor specific APIs
● Developer needs to understand what driver is trying to do
● It still doesn’t always fly
● Driver does its magic
● Developer doesn‘t have to care
● It just works
Ideal situation Reality
![Page 64: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/64.jpg)
What is Explicit Multi-GPU?
● Control cross GPU transfers
● No unintended implicit transfers
● Control what work is done on each GPU
● Not just Alternate Frame Rendering (AFR)
![Page 65: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/65.jpg)
DX12 Explicit Multi GPU
●No more driver magic
●There is no driver level support for AFR
●Now you can do it better yourself, and much more!
●No vendor specific APIs needed
![Page 66: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/66.jpg)
Adapters – Linked Node Adapter
ID3D12Device*
Node 0
Node 1
Node 2
GPU 0
GPU 1
GPU 2
![Page 67: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/67.jpg)
Adapters – Multiple Adapters
ID3D12Device*
ID3D12Device*
ID3D12Device*
Cross Adapter
Resource Heap (ID3D12Heap*)
GPU 0
GPU 1
GPU 2
![Page 68: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/68.jpg)
Linked Node Adapter
●When user has enabled use of multiple GPUs in display driver, linked node mode is enabled
●IDXGIFactory::EnumAdapters1() sees one adapter
●ID3D12Device::GetNodeCount() tells node count
●Nodes (GPUs) are referenced with affinity masks
●Node 0 = 0x1
●Node 1 = 0x2
●Node 1 and 2 = 0x3
0000 0001
0000 0010
0000 0011
GPU 0
GPU 1
![Page 69: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/69.jpg)
Linked Node Features
●Resource copies directly from discrete GPU to discrete GPU – not through system memory
● Special support for AFR
IDXGISwapChain3::ResizeBuffers1() allows utilization
of other connections than PCIe when presenting frames
●Good for multiple discrete GPUs! GPU 0 GPU 1
PCI Express
Multi GPU link
![Page 70: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/70.jpg)
Linked Node Load Balancing
●It’s safe to assume that nodes are balanced for foreseeable future
●Life is easy
GPU 0 GPU 1
![Page 71: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/71.jpg)
Linked Node Load Balancing
●It’s safe to assume that nodes are balanced for foreseeable future
●Life is easy
●Heterogeneous nodes may be available some day
GPU 0 GPU 1 ?
![Page 72: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/72.jpg)
Infrastructure For Explicit M-GPU ●Renderer has to be aware of multiple GPUs
●Expose multiple GPUs at right level
●Wrap command queues, resources, descriptors, gpu virtual addresses etc. for multiple GPUs
●This can actually be the part that requires most effort
●Once infrastructure exists, it’s easier to experiment
![Page 73: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/73.jpg)
Multi Node APIs
●With linked nodes, some things are very easy
●Some interfaces are omni node (no node mask)
●Starting with ID3D12Device
●Some interfaces are multi node
●Affinity mask can have more than one bit set
●Root signatures, pipeline states and command signatures can be often just shared for all nodes
ID3D12PipelineState* NodeMask 0x3
ID3D12RootSignature* NodeMask 0x3
ID3D12CommandSignature* NodeMask 0x3
![Page 74: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/74.jpg)
Command Queues And Lists
●Each node has its own
ID3D12CommandQueue, i.e. “engine”
●ID3D12CommandLists are also exclusive to single node
●Command list pooling for each node is needed
ID3D12CommandQueue* NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
![Page 75: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/75.jpg)
Command List Pooling
ID3D12CommandQueue* NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandQueue* NodeMask 0x2 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList*
NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList*
NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList* NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList*
NodeMask 0x2 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList*
NodeMask 0x2 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList* NodeMask 0x2 D3D12_COMMAND_LIST_TYPE_DIRECT
![Page 76: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/76.jpg)
Command List Pooling
ID3D12CommandQueue* NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandQueue* NodeMask 0x2 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList*
NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList*
NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList* NodeMask 0x1 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList*
NodeMask 0x2 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList*
NodeMask 0x2 D3D12_COMMAND_LIST_TYPE_DIRECT
ID3D12CommandList* NodeMask 0x2 D3D12_COMMAND_LIST_TYPE_DIRECT
![Page 77: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/77.jpg)
Synchronization - Fences
●Different command queues need to be
synchronized when sharing resources
●ID3D12Fence is the synchronization tool
![Page 78: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/78.jpg)
Fences
●Application must avoid access conflicts
●Application must ensure that all engines see
shared resources in same state
ID3D12CommandQueue* Write Signal Do something
ID3D12CommandQueue* Wait Read
ID3D12Resource* ID3D12Fence*
![Page 79: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/79.jpg)
Copy Engine(s)
● ID3D12CommandQueue with D3D12_COMMAND_LIST_TYPE_COPY
● Cross GPU copies parallel to other processing
● Remember to double buffer the resources
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
GPU 0 Graphics (F-2) (F-1) F0 F1 F2 F3 F4
![Page 80: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/80.jpg)
Cross Node Sharing Tiers
● ID3D12Device has tiers for cross node sharing
● Tier 1 supports only cross node copy operations
● ID3D12GraphicsCommandList::CopyResource() etc
● Tier 2 supports cross node SRV/CBV/UAV access
● While SRV/CBV/UAV access may seem
convenient, try whether using parallel copy engines would be more efficient
![Page 81: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/81.jpg)
Resources
●Resources and descriptors need most
attention
●Resources/heaps have two separate node masks
●CreationNodeMask is single node mask
●VisibleNodeMask is multi node mask
●Descriptor heap is exclusive to single node
![Page 82: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/82.jpg)
Node 0x2 memory
Node 0x1 memory
Resources - Visibility
ID3D12Heap*
CreationNodeMask 0x2 VisibleNodeMask 0x2
ID3D12Heap*
CreationNodeMask 0x1 VisibleNodeMask 0x1
ID3D12DescriptorHeap*
NodeMask 0x1
ID3D12DescriptorHeap*
NodeMask 0x2
![Page 83: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/83.jpg)
Node 0x2 memory
Node 0x1 memory
Resources - Visibility
ID3D12Heap*
CreationNodeMask 0x2 VisibleNodeMask 0x2
ID3D12Heap*
CreationNodeMask 0x1 VisibleNodeMask 0x1
ID3D12DescriptorHeap*
NodeMask 0x1
ID3D12DescriptorHeap*
NodeMask 0x2
![Page 84: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/84.jpg)
Node 0x2 memory
Node 0x1 memory
Resources - Visibility
ID3D12Heap*
CreationNodeMask 0x2 VisibleNodeMask 0x2
ID3D12Heap*
CreationNodeMask 0x1 VisibleNodeMask 0x1
ID3D12DescriptorHeap*
NodeMask 0x1
ID3D12DescriptorHeap*
NodeMask 0x2
ID3D12Heap*
CreationNodeMask 0x1 VisibleNodeMask 0x3
![Page 85: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/85.jpg)
Resources - Assets
●Upload art assets (vertex data, textures
etc.) to nodes that need them
●It’s often convenient to upload your assets to all nodes for easy experimentation
●AFR needs assets on all nodes
●Create a unique resource for each node,
not just one that would be visible to others
(with proper VisibleNodeMask)
![Page 86: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/86.jpg)
Resources - AFR Targets
●AFR requires all render targets be
duplicated for each node
●Need robust cycling mechanism
●Again, a unique resource for each node,
not one resource visible to all nodes
![Page 87: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/87.jpg)
AFR Isn’t For Everyone…
●Temporal techniques make AFR difficult
●Too many inter-frame dependencies can kill the
performance
●Explicit or implicit
![Page 88: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/88.jpg)
AFR Workflow Problem Ideal
GPU 1 Frame 0 Frame 2 Frame 4 Frame 6 Frame 8
GPU 0 Frame 1 Frame 3 Frame 5 Frame 7 Frame 9
Screen (F-2) (F-1) F0 F1 F2 F3 F4 F5 F6 F7 F8
![Page 89: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/89.jpg)
AFR Workflow Problem Ideal
Dependencies between frames
GPU 1 Frame 0 Frame 2 Frame 4 Frame 6 Frame 8
GPU 0 Frame 1 Frame 3 Frame 5 Frame 7 Frame 9
Screen (F-2) (F-1) F0 F1 F2 F3 F4 F5 F6 F7 F8
GPU 1 Graphics Frame 0 Idle Frame 2 Idle Frame 4 Idle Frame 6
Copy F0->F1 Idle F2->F3 Idle F4->F5 Idle F6->F7
GPU 0 Graphics Idle Frame 1 Idle Frame 3 Idle Frame 5 Idle
Copy Idle F1->F2 Idle F3->F4 Idle F5->F6 Idle
Screen (F-1) F0 F1 F2 F3 F4 F5
![Page 90: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/90.jpg)
AFR Workflow Problem Ideal
Dependencies between frames
GPU 1 Frame 0 Frame 2 Frame 4 Frame 6 Frame 8
GPU 0 Frame 1 Frame 3 Frame 5 Frame 7 Frame 9
Screen (F-2) (F-1) F0 F1 F2 F3 F4 F5 F6 F7 F8
GPU 1 Graphics Frame 0 Idle Frame 2 Idle Frame 4 Idle Frame 6
Copy F0->F1 Idle F2->F3 Idle F4->F5 Idle F6->F7
GPU 0 Graphics Idle Frame 1 Idle Frame 3 Idle Frame 5 Idle
Copy Idle F1->F2 Idle F3->F4 Idle F5->F6 Idle
Screen (F-1) F0 F1 F2 F3 F4 F5
![Page 91: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/91.jpg)
New Possibility - Frame Pipelining
●Pipeline rendering of frames
●Begin frame on one GPU
●Transfer work to next GPU to finish rendering
and present
●The GPUs and copy engines form a pipeline
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
GPU 0 Graphics (F-2) (F-1) F0 F1 F2 F3 F4
Screen (F-2) (F-1) F0 F1 F2 F3
![Page 92: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/92.jpg)
New Possibility - Frame Pipelining
●Pipeline rendering of frames
●Begin frame on one GPU
●Transfer work to next GPU to finish rendering
and present
●The GPUs and copy engines form a pipeline
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
GPU 0 Graphics (F-2) (F-1) F0 F1 F2 F3 F4
Screen (F-2) (F-1) F0 F1 F2 F3
![Page 93: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/93.jpg)
Pipelining – Simple Dependencies
●No back and forth dependencies between
GPUs
●Helps to minimize waits
●Easier to do large cross GPU data transfers
without reducing frame rate
●Unless copying takes longer than actual work, it affects only latency, not frame rate
![Page 94: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/94.jpg)
Pipelining – Temporal techniques
●Temporal techniques allowed without penalties
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
GPU 0 Graphics (F-2) (F-1) F0 F1 F2 F3 F4
Screen (F-2) (F-1) F0 F1 F2 F3
![Page 95: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/95.jpg)
Pipelining – Temporal techniques
●Temporal techniques allowed without penalties
●Limitation: GPUs at beginning of pipeline cannot use resources produced further down the pipeline
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
GPU 0 Graphics (F-2) (F-1) F0 F1 F2 F3 F4
Screen (F-2) (F-1) F0 F1 F2 F3
![Page 96: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/96.jpg)
Pipelining – Something More
● Instead doing the same faster, do
something more
● GI
● Ray tracing
● Physics
● Etc.
![Page 97: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/97.jpg)
Pipelining – Workload Distribution
●Needs a good point to split the frame
●Cross GPU copies are slow regardless of
parallel copy engines
●<8 GB/s on 8xPCIe3, 64 MB consumes at least 8 ms
● Doing some passes on both GPUs instead
of transferring the results can be an option
![Page 98: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/98.jpg)
Frame Pipelining Workflow
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
GPU 0 Graphics (F-2) (F-1) F0 F1 F2 F3 F4
Screen (F-2) (F-1) F0 F1 F2 F3
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy (F-1) IdleF0 Idle F1 Idle F2 Idle F3 Idle F4 Idle
GPU 0 Graphics Idle (F-1) Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
Screen (F-2) (F-1) F0 F1 F2 F3
Ideal
Unbalanced work
![Page 99: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/99.jpg)
Frame Pipelining Workflow
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
GPU 0 Graphics (F-2) (F-1) F0 F1 F2 F3 F4
Screen (F-2) (F-1) F0 F1 F2 F3
GPU 1 Graphics Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Frame 5
Copy (F-1) IdleF0 Idle F1 Idle F2 Idle F3 Idle F4 Idle
GPU 0 Graphics Idle (F-1) Idle F0 Idle F1 Idle F2 Idle F3 Idle F4
Screen (F-2) (F-1) F0 F1 F2 F3
Ideal
Unbalanced work
![Page 100: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/100.jpg)
Pipelining – Possible Problems
●Workload balance between GPUs depends
also on scene content
●It’s never perfect, but can be reasonable
●Latency can be a problem like in AFR
●Scaling for 3 or 4 GPUs requires separate
solutions
![Page 101: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/101.jpg)
Frame Pipelining Case Study
●Microsoft DX12 miniengine ● Pre-depth
● SSAO
● Sun shadow map
● Primary pass
● Particles
● Motion blur
● Bloom
● FXAA
![Page 102: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/102.jpg)
Frame Pipelining Case Study ● As a stress test, 3840x2160 screen and 4k by 4k sun shadow map resolutions were used
● Generated on first GPU:
Predepth D32_FLOAT 31.6 MB 5.3 ms
Linear Depth R16_FLOAT 15.8 MB 2.6 ms
SSAO R8_UNORM 7.9 MB 1.3 ms
Sun Shadow Map D16_UNORM 32 MB 5.3 ms
Total 87.3 MB 14.6 ms
![Page 103: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/103.jpg)
Frame Pipelining Case Study - Performance
22
31
37
FPS
Single GPU
Two GPUs
Two GPUs using Copy Engine
![Page 104: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/104.jpg)
Pipelining Case Study - GPUView
Original single GPU workflow
Two GPUs pipelined without copy engine
Node 0x1
Node 0x2
Node 0x1
![Page 105: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/105.jpg)
Pipelining Case Study - GPUView
Two GPUs pipelined with copy engine
Node 0x1
Node 0x2
Node 0x2
![Page 106: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/106.jpg)
Frame Pipelining Case Study
●1.7x framerate from single to dual GPU
●Pretty even workload distribution, but it’s
content dependent
●Cost of copying step would limit frame
rate to about 60 fps on 8xPCIe 3.0 system
![Page 107: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/107.jpg)
Pipelining – Hiding Copy Latency
●Break up copy work into smaller chuncks
●Overlap with other work for the same frame
●More and smaller command lists
●Remember guidelines from the “Practical
DirectX 12”
● In the case study, the ~15 ms extra
latency from copies can be almost entirely
hidden
![Page 108: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/108.jpg)
Hiding Copy Latency - GPUView
Predeth Sun Shadow
SSAO
Primary pass Particles, Motion Blur etc.
Predeth Sun Shadow Linear Depth
SSAO
One frame
Node 0x1
Node 0x2
Node 0x2
![Page 109: Tiled shading: light culling – reaching the speed of light](https://reader036.vdocuments.net/reader036/viewer/2022082216/58a2bcc61a28abac368b499b/html5/thumbnails/109.jpg)
Summary
● No more driver magic
● You‘re in control of AFR
● Try pipelining with temporal techniques!
● Remember copy engines!
● You can do anything you want with that
extra GPU - Surprise us!