fast parallel construction of high-quality bounding volume hierarchies tero karras timo aila
TRANSCRIPT
Fast Parallel Construction of High-Quality Bounding Volume Hierarchies
Tero KarrasTimo Aila
2
Ray tracing comes in many flavors
Interactive apps1M–100Mrays/frame
Architecture & design100M–10Grays/frame
Movie production10G–1T
rays/frame
© Activision 2009, Game trailer by Blur Studio
Courtesy of Delta Tracing Lucasfilm Ltd.™, Digital work by ILM
Courtesy of Columbia Pictures
NVIDIA
Courtesy of Dassault Systemes
3
Effective performance
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑟𝑎𝑦 𝑡𝑟𝑎𝑐𝑖𝑛𝑔𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑟𝑒𝑛𝑑𝑒𝑟𝑖𝑛𝑔𝑡𝑖𝑚𝑒
4
Effective performance
𝑟𝑒𝑛𝑑𝑒𝑟𝑖𝑛𝑔𝑡𝑖𝑚𝑒=𝑡𝑖𝑚𝑒𝑡𝑜𝑏𝑢𝑖𝑙𝑑𝐵𝑉𝐻+𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑟𝑎𝑦 h h𝑡 𝑟𝑜𝑢𝑔 𝑝𝑢𝑡
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑟𝑎𝑦 𝑡𝑟𝑎𝑐𝑖𝑛𝑔𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑟𝑒𝑛𝑑𝑒𝑟𝑖𝑛𝑔𝑡𝑖𝑚𝑒
“speed” “quality”
5
Effective performance
1M 10M 100M 1G 10G 100G 1T0
50
100
150
200
250
300
350
400
450
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛
𝑐𝑒
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
𝑟𝑒𝑛𝑑𝑒𝑟𝑖𝑛𝑔𝑡𝑖𝑚𝑒=𝑡𝑖𝑚𝑒𝑡𝑜𝑏𝑢𝑖𝑙𝑑𝐵𝑉𝐻+𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑟𝑎𝑦 h h𝑡 𝑟𝑜𝑢𝑔 𝑝𝑢𝑡
Interactiveapps
Architecture& design
Movieproduction
Speed
matters
Qua
lity
matt
ersBoth matter
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑟𝑎𝑦 𝑡𝑟𝑎𝑐𝑖𝑛𝑔𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑟𝑒𝑛𝑑𝑒𝑟𝑖𝑛𝑔𝑡𝑖𝑚𝑒
Mrays/s
6
Effective performance
1M 10M 100M 1G 10G 100G 1T0
50
100
150
200
250
300
350
400
450
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛
𝑐𝑒
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
Mrays/s
SODA (2.2M tris)NVIDIA GTX TitanDiffuse rays
7
Effective performance
1M 10M 100M 1G 10G 100G 1T0
50
100
150
200
250
300
350
400
450
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛
𝑐𝑒
SBVH[Stich et al. 2009]
(CPU, 4 cores)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
Build timedominates
Mrays/s
8
Effective performance
1M 10M 100M 1G 10G 100G 1T0
50
100
150
200
250
300
350
400
450
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛
𝑐𝑒
HLBVH[Garanzha et al. 2011]
(GPU)
???
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
Mrays/s
SBVH[Stich et al. 2009]
(CPU, 4 cores)
9
Effective performance
1M 10M 100M 1G 10G 100G 1T0
50
100
150
200
250
300
350
400
450
𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛
𝑐𝑒
Our method(GPU)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
Mrays/s
HLBVH[Garanzha et al. 2011]
(GPU)
SBVH[Stich et al. 2009]
(CPU, 4 cores)
10
1M 10M 100M 1G 10G 100G 1T0
50
100
150
200
250
300
350
400
450
Effective performance𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒
𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛
𝑐𝑒
30M–500Grays/frame
97% ofSBVH
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
Mrays/s
Best quality–speed tradeoff for wide range of applications
11
Treelet restructuring
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
12
Treelet restructuring
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendants
13
R
Treelet restructuringTreelet root
Treelet leaf
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendants
14
R
Treelet restructuringTreelet root
Treelet leaf
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodes
Grow
15
R
Treelet restructuringTreeletinternal
node
Grow
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodes
16
R
Treelet restructuring
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodes
17
R
Treelet restructuring
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodes
18
R
Treelet restructuring
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodes
19
Treelet restructuring
R
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodesLargest leaves → best results
20
Treelet restructuring
R
C
A B
F
D
G
E
treelet leaves
treelet internalnodesIdea
Build a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodesLargest leaves → best results
Valid binary tree in itself
21
Treelet restructuring
R
C
A B
F
D
G
E
ActualBVH leaf
Arbitrarysubtree
IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once
TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodesLargest leaves → best results
Valid binary tree in itselfLeaves can represent arbitrary subtrees of the BVH
22
Treelet restructuring
R
C
A B
F
D
G
E
RestructuringConstruct optimal binary tree for the same set of leavesReplace old treelet
23
Treelet restructuring
C
A B
F
D
G
E
R
RestructuringConstruct optimal binary tree for the same set of leavesReplace old treelet
Reuse the same nodesUpdate connectivity and AABBsNew AABBs should be smaller
24
D
E
G
Treelet restructuring
A
R
C
F
B
RestructuringConstruct optimal binary tree for the same set of leavesReplace old treelet
Reuse the same nodesUpdate connectivity and AABBsNew AABBs should be smaller
25
Treelet restructuring
CA BF
D
G E
R
RestructuringConstruct optimal binary tree for the same set of leavesReplace old treelet
Reuse the same nodesUpdate connectivity and AABBsNew AABBs should be smaller
26
Treelet restructuring
R
CA BF
D
G E
RestructuringConstruct optimal binary tree for the same set of leavesReplace old treelet
Reuse the same nodesUpdate connectivity and AABBsNew AABBs should be smaller
Perfectly localized operationLeaves and their subtrees are kept intactNo need to look at subtree contents
27
Processing stages
Initial BVH construction
Post-processing
Optimization
Input triangles
One triangleper leaf
28
Processing stages
Initial BVH construction
Post-processing
Optimization
Parallel LBVH[Karras 2012]
60-bit Morton codesfor accurate spatial
partitioning
29
Processing stages
Initial BVH construction
Post-processing
Optimization
Parallel bottom-up traversal[Karras 2012]
Restructure multipletreelets in parallel
30
Processing stages
Initial BVH construction
Post-processing
Optimization
Parallel bottom-up traversal[Karras 2012]
31
Processing stages
Initial BVH construction
Post-processing
Optimization
Parallel bottom-up traversal[Karras 2012]
32
Processing stages
Initial BVH construction
Post-processing
Optimization
Parallel bottom-up traversal[Karras 2012]
33
Processing stages
Initial BVH construction
Post-processing
Optimization
Parallel bottom-up traversal[Karras 2012]
34
Processing stages
Initial BVH construction
Post-processing
Optimization
Parallel bottom-up traversal[Karras 2012]
35
Processing stages
Initial BVH construction
Post-processing
Optimization
Parallel bottom-up traversal[Karras 2012]
Strict bottom-up order→ no overlap between treelets
36
Processing stages
Initial BVH construction
Post-processing
Rinse and repeat(3 times is plenty)
Optimization
37
Processing stages
Initial BVH construction
Post-processing
Optimization
Collapse subtreesinto leaf nodes
38
Processing stages
Initial BVH construction
Post-processing
Optimization
Collect trianglesinto linear lists Prepare them for Woop’s
intersection test[Woop 2004]
39
Processing stages
Initial BVH construction
Post-processing
Optimization
Fast GPU ray traversal[Aila et al. 2012]
40
Processing stages
Initial BVH construction
Post-processing
Optimization
Triangle splitting
Fast GPU ray traversal[Aila et al. 2012]
41
Processing stages
Initial BVH construction
Post-processing
Optimization
Triangle splitting
Fast GPU ray traversal[Aila et al. 2012]
0.4 ms
5.4 ms6.6 ms
17.0 ms21.4 ms
1.2 ms1.6 ms
DRAGON (870K tris)NVIDIA GTX Titan23.6 ms / 30.0 ms
No splits
30% splits
42
Cost model
Surface area cost model[Goldsmith and Salmon 1987], [MacDonald and Booth 1990]
Track cost and triangle count of each subtree
Minimize SAH cost of the final BVHMake collapsing decisions already during optimization
→ Unified processing of leaves and internal nodes
43
Optimal restructuring
Finding the optimal node topology is NP-hardNaive algorithm → Our approach →
But it becomes very powerful as grows treelet leaves is enough for high-quality results
Use fixed-size treeletsConstant cost per treelet
→ Linear with respect to scene size
44
Optimal restructuring
Treelet size Layouts Quality vs. SBVH *4 15 78%5 105 85%6 945 88%7 10,395 97%8 135,135 98%
* SODA (2.2M tris)
Number of unique ways forrestructuring a given treelet
Ray tracing performanceafter 3 rounds of optimization
45
Optimal restructuring
Treelet size Layouts Quality vs. SBVH *4 15 78%5 105 85%6 945 88%7 10,395 97%8 135,135 98%
* SODA (2.2M tris)
Almost thesame thing astree rotations
[Kensler 2008]
Limited options during optimization→ easy to get stuck in a local optimum
Varies a lotbetweenscenes
46
Optimal restructuring
Treelet size Layouts Quality vs. SBVH *4 15 78%5 105 85%6 945 88%7 10,395 97%8 135,135 98%
* SODA (2.2M tris)
Can still beimplemented
efficiently
Surely one of thesewill take us forward
Consistentacross scenes
Further improvementis marginal
47
Algorithm
Dynamic programmingSolve small subproblems firstTabulate their solutionsBuild on them to solve larger subproblems
Subproblem:What’s the best node topology for a subset of the leaves?
48
Algorithm
input: set of treelet leavesfor to do
for each subset of size dofor each way of partitioning the leaves do
look up subtree costscalculate SAH cost
end forrecord the best solution
end forend forreconstruct optimal topology
Process subsets fromsmallest to largest
Record the optimalSAH cost for each
49
Algorithm
input: set of treelet leavesfor to do
for each subset of size dofor each way of partitioning the leaves do
look up subtree costscalculate SAH cost
end forrecord the best solution
end forend forreconstruct optimal topology
Exhaustive search:assign each leaf toleft/right subtree
We already knowhow much thesubtrees will cost
Backtrack thepartitioning choices
50
Scalar vs. SIMD
Scalar processingEach thread processes one treeletNeed many treelets in flight
SIMD processing32 threads collaborate on the same treeletNeed few treelets in flight
✗ Spills to off-chip memory✗ Doesn’t scale to small scenes✓ Trivial to implement
✓ Data fits in on-chip memory✓ Easy to fill the entire GPU✗ Need to keep all threads busy
Parallelize over subproblems usinga pre-optimized processing schedule
(details in the paper)
51
Scalar vs. SIMD
Scalar processingEach thread processes one treeletNeed many treelets in flight
SIMD processing32 threads collaborate on the same treeletNeed few treelets in flight
✓ Data fits in on-chip memory✓ Easy to fill the entire GPU✓ Possible to keep threads busy
✗ Spills to off-chip memory✗ Doesn’t scale to small scenes✓ Trivial to implement
52
Quality vs. speed
Spend less effort on bottom-most nodesLow contribution to SAH costQuick convergence
Additional parameter Only process subtrees that are large enoughTrade quality for speed
Double after each roundSignificant speedupNegligible effect on quality
53
Triangle splitting
Early Split Clipping [Ernst and Greiner 2007]Split triangle bounding boxes as a pre-process
Bounding box is nota good approximation
Split it!
Large triangle
54
Triangle splitting
Resulting boxesprovide atighter bound
Keep going until theyare small enough
Early Split Clipping [Ernst and Greiner 2007]Split triangle bounding boxes as a pre-process
55
Triangle splitting
Early Split Clipping [Ernst and Greiner 2007]Split triangle bounding boxes as a pre-process
Keep going until theyare small enough
56
Triangle splitting
Early Split Clipping [Ernst and Greiner 2007]Split triangle bounding boxes as a pre-process
Treat each box as aseparate primitive
Triangle itselfremains the same
57
Triangle splitting
Shortcomings of pre-process splittingCan hurt ray tracing performanceUnpredictable memory usageRequires manual tuning
Improve with better heuristicsSelect good split planesConcentrate splits where they matterUse a fixed split budget
58
Split plane selection
Root node partitions thescene at its spatial median
Reduce node overlap in the initial BVH
59
Split plane selection
Left child
Right child
Reduce node overlap in the initial BVH
60
If a trianglecrosses the plane...
Split plane selection
...the bounding boxes will overlap
Use the samespatial medianas a split plane
Reduce node overlap in the initial BVH
61
Split plane selection
No overlap
Use the samespatial medianas a split plane
Reduce node overlap in the initial BVH
62
Split plane selection
Splitting one triangledoes not help much
Need to split them allto get the benefits
Reduce node overlap in the initial BVH
63
Split plane selectionReduce node overlap in the initial BVH
64
Split plane selection
Same reasoning holds on multiple levels
Reduce node overlap in the initial BVH
Level 0
Level 1
Level 2 Level 2
Level 3
Level 3
65
Split plane selectionLook at all spatial
median planes thatintersect a triangle
Split it with thedominant one
Reduce node overlap in the initial BVH
Level 1
Level 2 Level 2Level 0
Level 3
Level 3
66
Algorithm
1. Allocate memory for a fixed split budget
67
Algorithm
1. Allocate memory for a fixed split budget
2. Calculate a priority value for each triangle
68
Algorithm
1. Allocate memory for a fixed split budget
2. Calculate a priority value for each triangle
3. Distribute the split budget among trianglesProportional to their priority values
69
Algorithm
1. Allocate memory for a fixed split budget
2. Calculate a priority value for each triangle
3. Distribute the split budget among trianglesProportional to their priority values
4. Split each triangle recursivelyDistribute remaining splits according to the size of the resulting AABBs
70
Split priority
𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦=(2(−𝑙𝑒𝑣𝑒𝑙 ) ∙ ( 𝐴𝑎𝑎𝑏𝑏−𝐴𝑖𝑑𝑒𝑎𝑙 ))1/3
Crosses an importantspatial median plane?
Has large potential forreducing surface area?
Concentrate on triangleswhere both apply
…but leavesomething forthe rest, too
71
ResultsCompare against 4 CPU and 3 GPU builders
4-core i7 930, NVIDIA GTX TitanAverage of 20 test scenes, multiple viewpoints
Ray tracing performance
72
SweepSAH[MacDonald]
SBVH[Stich]
Treerotations[Kensler]
Iterativereinsertion
[Bittner]
0%
20%
40%
60%
80%
100%
120%
140%
SweepSAH = 100%
High-quality CPU builders
SBVH = 131%
0%
20%
40%
60%
80%
100%
120%
140%
Ray tracing performance
73
SweepSAH[MacDonald]
SBVH[Stich]
Treerotations[Kensler]
Iterativereinsertion
[Bittner]
LBVH[Karras]
HLBVH[Garanzha]
GridSAH[Garanzha]
Fast GPU builders
67% – 69%
SBVH = 131%
Almost 2×
Ray tracing performance
74
SweepSAH[MacDonald]
SBVH[Stich]
Treerotations[Kensler]
Iterativereinsertion
[Bittner]
TRBVH TRBVH+30% split
LBVH[Karras]
HLBVH[Garanzha]
GridSAH[Garanzha]
0%
20%
40%
60%
80%
100%
120%
140%
No splits30% splits
Our method
Ray tracing performance
75
SweepSAH[MacDonald]
SBVH[Stich]
Treerotations[Kensler]
Iterativereinsertion
[Bittner]
TRBVH TRBVH+30% split
LBVH[Karras]
HLBVH[Garanzha]
GridSAH[Garanzha]
0%
20%
40%
60%
80%
100%
120%
140%
96% of SweepSAH
91% of SBVH
76
Effective performance
1M 10M 100M 1G 10G 100G 1T0%
20%
40%
60%
80%
100%
120%
140%
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
Tree rotations[Kensler]
Iterative reinsertion[Bittner]
Not Pareto-optimal
SBVH[Stich]
SweepSAH[MacDonald]
77
Effective performance
1M 10M 100M 1G 10G 100G 1T0%
20%
40%
60%
80%
100%
120%
140%
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
SBVH[Stich]
SweepSAH[MacDonald]
78
Effective performance
1M 10M 100M 1G 10G 100G 1T0%
20%
40%
60%
80%
100%
120%
140%
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
HLBVH[Garanzha] GridSAH
[Garanzha]
SBVH[Stich]
SweepSAH[MacDonald]
LBVH[Karras]
Not Pareto-optimal
79
Effective performance
1M 10M 100M 1G 10G 100G 1T0%
20%
40%
60%
80%
100%
120%
140%
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
SBVH[Stich]
SweepSAH[MacDonald]
LBVH[Karras]
80
Effective performance
1M 10M 100M 1G 10G 100G 1T0%
20%
40%
60%
80%
100%
120%
140%
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
SBVH[Stich]
Our method(no splits)
SweepSAH[MacDonald]
LBVH[Karras]
Our method(30% splits)
81
Effective performance
1M 10M 100M 1G 10G 100G 1T0%
20%
40%
60%
80%
100%
120%
140%
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒
7M–60G rays/frame→ our method is the best choice
Below 7M→ LBVH
Above 60G→ SBVH
82
Conclusion
General framework for optimizing treesInherently parallelApproximate restructuring → larger treelets?
Practical GPU-based BVH builderBest choice in a large class of applicationsAdjustable quality–speed tradeoff
Will be integrated into NVIDIA OptiX
83
Thank you
AcknowledgementsSamuli LaineJaakko LehtinenSami LiedesDavid McAllisterAnonymous reviewersAnat Grynberg and Greg Ward for CONFERENCE
University of Utah for FAIRY
Marko Dabrovic for SIBENIK
Ryan Vance for BUBS
Samuli Laine for HAIRBALL and VEGETATION
Guillermo Leal Laguno for SANMIGUEL
Jonathan Good for ARABIC, BABYLONIAN and ITALIAN
Stanford Computer Graphics Laboratory for ARMADILLO, BUDDHA and DRAGON
Cornell University for BAR
Georgia Institute of Technology for BLADE