Download - Rendering Structures - ARM architecture
![Page 1: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/1.jpg)
© 2018 Arm Limited
• Hans-Kristian Arntzen• 2018-08-16 – SIGGRAPH 2018
Rendering Structures
Analyzing modern rendering on mobile
![Page 2: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/2.jpg)
2 © 2018 Arm Limited
Content
Motivation
1
Scene and lights
2
Rendering structures overview
3
Benchmark results
4
Post-AA benchmarking
5
![Page 3: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/3.jpg)
3 © 2018 Arm Limited
Motivation
• Performance characteristics for mobile architectures differ from desktop
• Very little comparative data on rendering many lights on mobile
• Explore the most promising rendering structures for mobile
• Focus on Mali
• Midgard GPU family is very different to Bifrost
![Page 4: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/4.jpg)
© 2018 Arm Limited
The sceneSponza, duh! [3]
![Page 5: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/5.jpg)
5 © 2018 Arm Limited
![Page 6: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/6.jpg)
6 © 2018 Arm Limited
![Page 7: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/7.jpg)
7 © 2018 Arm Limited
![Page 8: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/8.jpg)
© 2018 Arm Limited
Classic deferred shading
![Page 9: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/9.jpg)
9 © 2018 Arm Limited
![Page 10: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/10.jpg)
10 © 2018 Arm Limited
![Page 11: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/11.jpg)
11 © 2018 Arm Limited
Pros and Cons
• Pros• Arbitrary number of lights• Arbitrary different light types• Separate shadow maps per light• Small, decoupled shaders• Robust against geometry overdraw
• Cons• No (easy) MSAA• No (easy) transparency• False positives in shading (single-sided depth test)
![Page 12: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/12.jpg)
© 2018 Arm Limited
Clustered shading
![Page 13: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/13.jpg)
13 © 2018 Arm Limited
![Page 14: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/14.jpg)
14 © 2018 Arm Limited
![Page 15: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/15.jpg)
15 © 2018 Arm Limited
![Page 16: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/16.jpg)
16 © 2018 Arm Limited
![Page 17: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/17.jpg)
17 © 2018 Arm Limited
![Page 18: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/18.jpg)
18 © 2018 Arm Limited
Forward and deferred clustered shading
• Forward• Look up cluster and shade all lights when rendering a mesh directly
• Deferred• Render a full-screen quad in lighting pass and shade all positional lights
in one go based on reconstructed position
![Page 19: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/19.jpg)
19 © 2018 Arm Limited
Pros and Cons
• Pros• Very flexible
–Deferred, Forward, Transparency, MSAA, Volumetrics• Can be computed before knowing depth buffer• Can be expanded to support more than just lights
• Cons• All resources (e.g. shadow maps) need to be bound• Some fixed overhead to sample cluster• Very heavy shader
![Page 20: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/20.jpg)
© 2018 Arm Limited
Forward Z pre-pass
![Page 21: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/21.jpg)
21 © 2018 Arm Limited
The forward depth prepass
• Forward clustering shaded pixels are heavy
• Want to avoid over-shading
• Perfect front-to-back sorting is impractical
• Can potentially be done on-chip
• Sometimes unavoidable for certain effects• Screen-space AO techniques
![Page 22: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/22.jpg)
22 © 2018 Arm Limited
The forward depth prepass
• Pros• No over-shading• More flexible drawing order
• Cons• Double the geometry load• More bandwidth required• More CPU overhead
![Page 23: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/23.jpg)
© 2018 Arm Limited
ComparisonHow many pixels are touched?
![Page 24: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/24.jpg)
24 © 2018 Arm Limited
![Page 25: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/25.jpg)
25 © 2018 Arm Limited
![Page 26: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/26.jpg)
© 2018 Arm Limited
ResultsUsing the Granite renderer [2]
![Page 27: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/27.jpg)
27 © 2018 Arm Limited
The grand benchmark
• Create a massive benchmark sweep
• Compare the rendering structures head-to-head
• Turn knobs on and off, and see how it affects performance
![Page 28: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/28.jpg)
28 © 2018 Arm Limited
Test hardware and renderer
• Mobile
• Desktop reference points
• Vulkan
• PBR
• Multipass techniques used for deferred
![Page 29: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/29.jpg)
29 © 2018 Arm Limited
Midgard (T-880) greatly prefers deferred techniques
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
# spot lights
No
rmal
ized
tim
e
Classic deferred
Clustered deferred
Forward clustered
Forward clustered prepass
Clustered stencil culling
![Page 30: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/30.jpg)
30 © 2018 Arm Limited
Clustered shading scales very well on Bifrost (G71 & G72)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
# spot lights
No
rmal
ized
tim
e
Classic deferred
Clustered deferred
Forward clustered
Forward clustered prepass
Clustered stencil culling
![Page 31: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/31.jpg)
31 © 2018 Arm Limited
Gap between deferred and forward is greater on desktop
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
# spot lights
No
rmal
ized
tim
e
RX470 & GTX1060 average
Classic deferred
Clustered deferred
Forward clustered
Forward clustered prepass
Clustered stencil culling
![Page 32: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/32.jpg)
32 © 2018 Arm Limited
Forward depth prepass might be a good idea
Desktop Midgard Bifrost
0
0.2
0.4
0.6
0.8
1
1.2
No
rmal
ized
tim
e
Off
On
![Page 33: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/33.jpg)
33 © 2018 Arm Limited
MSAA is super expensive with microgeometry
Forward clustered Forward clusteredprepass
Forward clustered(4x MSAA)
Forward clusteredprepass (4x MSAA)
Classic deferred Stencil culling Clustered deferred
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
No
rmal
ized
tim
e
Desktop
Midgard
Bifrost
![Page 34: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/34.jpg)
34 © 2018 Arm Limited
Clustered shading actually improves bandwidth
Read Write Total
No
rmal
ized
DD
R t
ran
sact
ion
s
Classic deferred
Stencil culling
Clustered deferred
Forward Clustered
Forward Clustered Prepass
Forward Clustered (4x MSAA)
Forward Clustered Prepass (4x MSAA)
![Page 35: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/35.jpg)
35 © 2018 Arm Limited
One tenth performance compared to mid-range desktop
Desktop Midgard Bifrost
0
2
4
6
8
10
12
14
16
18
No
rmal
ized
tim
e ag
ain
st d
eskt
op
(p
er t
ech
niq
ue)
Forward clustered
Forward clustered prepass
Forward clustered (4x MSAA)
Forward clustered prepass (4x MSAA)
Classic deferred
Stencil culling
Clustered deferred
![Page 36: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/36.jpg)
© 2018 Arm Limited
Post-AA rundown
![Page 37: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/37.jpg)
37 © 2018 Arm Limited
The usual suspects
• FXAA• 9-tap
• SMAA• Low, Medium, High, Ultra
• TAA
![Page 38: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/38.jpg)
38 © 2018 Arm Limited
Bilinear5-tap cross
Variance Clipping
None
MAX3
Bicubic
YCgCo
RGB
AABBclip
Clamp5-tap cross
3x3
RoundedCorner
Low
Medium
High
Ultra
Extreme
Nightmare
![Page 39: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/39.jpg)
39 © 2018 Arm Limited
Results
0
10
20
30
40
50
60
70
80
90
Cyc
les
per
pix
el (
sin
gle
core
GP
U)
Post-AA rundown (normalized MP1)
Midgard
Bifrost
![Page 40: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/40.jpg)
40 © 2018 Arm Limited
Links
[1] - http://efficientshading.com/wp-content/uploads/s2015_mobile.pptx
[2] - https://github.com/Themaister/Granite
[3] - https://github.com/KhronosGroup/glTF-Sample-Models/tree/master/2.0/Sponza
[4] - http://advances.realtimerendering.com/s2016/Siggraph2016_idTech6.pdf
[5] - https://www.khronos.org/assets/uploads/developers/library/2017-gdc/GDC_Vulkan-on-Mobile_Vulkan-Multipass-ARM_Mar17.pdf
![Page 41: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/41.jpg)
4141
Thank YouDankeMerci谢谢ありがとうGraciasKiitos감사합니다धन्यवादתודה
© 2018 Arm Limited
![Page 42: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/42.jpg)
© 2018 Arm Limited
Bonus slides
![Page 43: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/43.jpg)
© 2018 Arm Limited
Clustered stencil culling
![Page 44: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/44.jpg)
44 © 2018 Arm Limited
Clustered stencil culling
• Basically, bucket the positional lights into N buckets
• Each bucket gets their own stencil bit• I used 7, 1 for masking background, YMMV
• Render backfaces with greater test at end of G-buffer pass• Each bucket sets their own stencil bit if depth passes• Instance all lights in a bucket
• In lighting pass• Less-than test, also test stencil read-only if the bucket bit is set.• Effectively a conservative double-sided test is achieved.• Cuts out a lot of overdraw against background.• Lights which clip near plane do not participate in cluster, just use back-face test as-is.
![Page 45: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/45.jpg)
45 © 2018 Arm Limited
![Page 46: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/46.jpg)
46 © 2018 Arm Limited
![Page 47: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/47.jpg)
47 © 2018 Arm Limited
Compared to classic deferred
• Pros• Can reduce false positives in shading
• Cons• Need to free up some bits in the stencil buffer• Some extra early-ZS fill-rate required• Some work required to bin lights to stencil bits
![Page 48: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/48.jpg)
48 © 2018 Arm Limited
Clustered shading
• Has been presented at SIGGRAPH 2015 in the past [1]• Unfortunately, no real performance data
• Extremely flexible• Does not need to know depth buffer up-front• Supports marching-like techniques for volumetrics• Can be computed on CPU or GPU (I use GPU)
• Fully supports• Forward• Deferred• MSAA• Transparency
• To shade• Look up (offset, count) or equivalent from a 3D texture based on rasterized position.• Iterate and shade lights, lights can be stored in a large buffer.• Needs atlasing techniques (or bindless) for shadowmaps.
![Page 49: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/49.jpg)
49 © 2018 Arm Limited
Implementation
• Async compute shader computes per cell in cluster• Low-resolution pre-pass in compute• Prunes lights in 4x4x4 blocks before testing at full res.
• Accurate intersection tests are easy• Because we have perfect small cubes, we can approximate well treating cube as a sphere• No need for conservative raster which is popular for the common «froxel» layout
• Bitmasks limit number of maximum lights• Can also use classic «list» of lights for arbitrary amounts, but more expensive to compute list on GPU• Shading performance seems the same (within margin of error), so going to leave it at that
![Page 50: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/50.jpg)
50 © 2018 Arm Limited
Light sweep test
Test how the different rendering algorithms react to increasing number of spot lights:• Classic deferred (blend light volumes over frame buffer)• Stencil culling (same as classic deferred, but tries to reduce false positives)• Forward clustered• Deferred clustered (all positional lights rendered as a single full-screen quad)• Forward clustered prepass (On-tile, run prepass depth in same render pass)
Barebones rendering outside positional lights:• LDR is used to avoid constant overhead of HDR bloom.• Shadows for directional light is turned off to avoid overhead of rendering shadows.• No MSAA.
Numbers on mobile are presented with normalized runtime based on:• GPU cycle counts• Peak clock speed
![Page 51: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/51.jpg)
51 © 2018 Arm Limited
Method sweep test
Tries to look at aggregate results to answer questions like:• How expensive are shadows?• How expensive is VSM vs PCF 1x1?• Prepass vs no prepass?• Does anything stick out compared to desktop?
Also have bandwidth numbers captured on the S9+.
![Page 52: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/52.jpg)
52 © 2018 Arm Limited
Read Write Total
DD
R t
ran
sact
ion
sDirectional light shadows bandwidth
Off
On
Read Write Total
DD
R t
ran
sact
ion
s
Positional light shadows bandwidth
Off
On
Read Write Total
DD
R t
ran
sact
ion
s
Forward prepass bandwidth
Off
On
![Page 53: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/53.jpg)
53 © 2018 Arm Limited
Read Write Total
DD
R t
ran
sact
ion
sShadow mapping filter bandwidth
PCF 1x1
VSM
Read Write TotalD
DR
tra
nsa
ctio
ns
Stencil culling bandwidth
Off
On
Read Write Total
DD
R t
ran
sact
ion
s
HDR vs LDR bandwidth
LDR
HDR + bloom
Read Write Total
DD
R t
ran
sact
ion
s
Deferred methods bandwidth
Classic deferred
Clustered deferred
![Page 54: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/54.jpg)
54 © 2018 Arm Limited
Observations
Bifrost deals way better with forward shading than Midgard• Far more registers available, and scalar instead of vector
Clustered shading is great on Bifrost• Deferred with multipass or forward, either is good
Forward prepass is surprisingly good• Gets better with more complex content• Bandwidth hit isn’t as extreme as I expected• Expect there to be a cutoff point
MSAA is expensive with denser geometry• We avoid all the bandwidth hit, but still need to shade a lot more partial quads
~10x gap to mid-range desktop• At least with this renderer ☺
![Page 55: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/55.jpg)
55 © 2018 Arm Limited
The usual suspects
• FXAA• The 9-tap version• It’s light on arithmetic, so it’s basically a 9-tap filtering benchmark ☺
• SMAA• Low• Medium• High• Ultra
• TAA• There is no de-facto reference TAA implementation• I implemented various well-known refinements as building blocks• Made some «presets»
![Page 56: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/56.jpg)
56 © 2018 Arm Limited
TAA variants
• Low• RGB input used for clamping• No HDR luminance adjustment• Neighbor rejection method based on clamping to 5-tap cross• Nearest depth / max velocity found from 5-tap cross
• Medium• Turns on HDR luminance adjustment so we blend in tonemapped space (reduces flicker)
• High• Uses AABB clipping for neighbors (less color squashing in the AABB corner)• Rounded corner method to find neighbor color AABB
• Ultra• Converts RGB to YCgCo for neighbor clipping purposes (retains hue better)
![Page 57: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/57.jpg)
57 © 2018 Arm Limited
More intense variants
• Extreme• Neighbor clamping method changed to rounded corner with variance clipping• Nearest depth / max velocity method bumped to a 3x3 grid
• Nightmare• When sampling the history buffer, use a 9-tap bicubic filter
– Trades a lot of arithmetic to avoid full 16-tap bicubic– Massive blur reduction in motion– This final shader is about 27 texel fetches per pixel
![Page 58: Rendering Structures - ARM architecture](https://reader030.vdocuments.net/reader030/viewer/2022012613/6195294ba2cb3b687f37c5a3/html5/thumbnails/58.jpg)
58 © 2018 Arm Limited
Observations
• Mobile is 10-20x slower than mid-range desktop• The gap is larger than for regular rendering
• Post-AA is still hard to fit into a budget• At 720p, FXAA seems reasonable (~1 ms)
• Mali-G71 and G72 are neck and neck on some post-AA• Texture pipe throughput is theoretically the same per clock• µ-arch improvements show up nicely on SMAA• Otherwise, uplift seems to be eaten by texture throughput/bandwidth.
• T-880 (and other Midgard GPUs) don’t scale with large, complex shaders