fast parallel construction of high-quality bounding volume hierarchies tero karras timo aila

Fast Parallel Construction of High-Quality Bounding Volume Hierarchies

Tero KarrasTimo Aila

2

Ray tracing comes in many flavors

Interactive apps1M–100Mrays/frame

Architecture & design100M–10Grays/frame

Movie production10G–1T

rays/frame

© Activision 2009, Game trailer by Blur Studio

Courtesy of Delta Tracing Lucasfilm Ltd.™, Digital work by ILM

Courtesy of Columbia Pictures

NVIDIA

Courtesy of Dassault Systemes

3

Effective performance

𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑟𝑎𝑦 𝑡𝑟𝑎𝑐𝑖𝑛𝑔𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒=𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑟𝑒𝑛𝑑𝑒𝑟𝑖𝑛𝑔𝑡𝑖𝑚𝑒

4


𝑟𝑒𝑛𝑑𝑒𝑟𝑖𝑛𝑔𝑡𝑖𝑚𝑒=𝑡𝑖𝑚𝑒𝑡𝑜𝑏𝑢𝑖𝑙𝑑𝐵𝑉𝐻+𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑟𝑎𝑦 h h𝑡 𝑟𝑜𝑢𝑔 𝑝𝑢𝑡


“speed” “quality”

5


1M 10M 100M 1G 10G 100G 1T0

50

100

150

200

250

300

350

400

450

𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒

𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛

𝑐𝑒

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑝𝑒𝑟 𝑓𝑟𝑎𝑚𝑒

𝑟𝑒𝑛𝑑𝑒𝑟𝑖𝑛𝑔𝑡𝑖𝑚𝑒=𝑡𝑖𝑚𝑒𝑡𝑜𝑏𝑢𝑖𝑙𝑑𝐵𝑉𝐻+𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑎𝑦𝑠𝑟𝑎𝑦 h h𝑡 𝑟𝑜𝑢𝑔 𝑝𝑢𝑡

Interactiveapps

Architecture& design

Movieproduction

Speed

matters

Qua

lity

matt

ersBoth matter


Mrays/s

6


1M 10M 100M 1G 10G 100G 1T0

50

100

150

200

250

300

350

400

450



𝑐𝑒


Mrays/s

SODA (2.2M tris)NVIDIA GTX TitanDiffuse rays

7


1M 10M 100M 1G 10G 100G 1T0

50

100

150

200

250

300

350

400

450



𝑐𝑒

SBVH[Stich et al. 2009]

(CPU, 4 cores)


Build timedominates

Mrays/s

8


1M 10M 100M 1G 10G 100G 1T0

50

100

150

200

250

300

350

400

450



𝑐𝑒

HLBVH[Garanzha et al. 2011]

(GPU)

???


Mrays/s


(CPU, 4 cores)

9


1M 10M 100M 1G 10G 100G 1T0

50

100

150

200

250

300

350

400

450



𝑐𝑒

Our method(GPU)


Mrays/s

HLBVH[Garanzha et al. 2011]

(GPU)


(CPU, 4 cores)

10

1M 10M 100M 1G 10G 100G 1T0

50

100

150

200

250

300

350

400

450

Effective performance𝑒𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒


𝑐𝑒

30M–500Grays/frame

97% ofSBVH


Mrays/s

Best quality–speed tradeoff for wide range of applications

11

Treelet restructuring

IdeaBuild a low-quality BVHOptimize its node topologyLook at multiple nodes at once

12



TreeletSubset of a node’s descendants

13

R

Treelet restructuringTreelet root

Treelet leaf


TreeletSubset of a node’s descendants

14

R

Treelet restructuringTreelet root

Treelet leaf


TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodes

Grow

15

R

Treelet restructuringTreeletinternal

node

Grow



16

R




17

R




18

R




19


R


TreeletSubset of a node’s descendantsGrow by turning leaves into internal nodesLargest leaves → best results

20


R

C

A B

F

D

G

E

treelet leaves

treelet internalnodesIdea

Build a low-quality BVHOptimize its node topologyLook at multiple nodes at once


Valid binary tree in itself

21


R

C

A B

F

D

G

E

ActualBVH leaf

Arbitrarysubtree



Valid binary tree in itselfLeaves can represent arbitrary subtrees of the BVH

22


R

C

A B

F

D

G

E

RestructuringConstruct optimal binary tree for the same set of leavesReplace old treelet

23


C

A B

F

D

G

E

R


Reuse the same nodesUpdate connectivity and AABBsNew AABBs should be smaller

24

D

E

G


A

R

C

F

B



25


CA BF

D

G E

R



26


R

CA BF

D

G E



Perfectly localized operationLeaves and their subtrees are kept intactNo need to look at subtree contents

27

Processing stages

Initial BVH construction

Post-processing

Optimization

Input triangles

One triangleper leaf

28

Processing stages


Post-processing

Optimization

Parallel LBVH[Karras 2012]

60-bit Morton codesfor accurate spatial

partitioning

29

Processing stages


Post-processing

Optimization

Parallel bottom-up traversal[Karras 2012]

Restructure multipletreelets in parallel

30

Processing stages


Post-processing

Optimization


31

Processing stages


Post-processing

Optimization


32

Processing stages


Post-processing

Optimization


33

Processing stages


Post-processing

Optimization


34

Processing stages


Post-processing

Optimization


35

Processing stages


Post-processing

Optimization


Strict bottom-up order→ no overlap between treelets

36

Processing stages


Post-processing

Rinse and repeat(3 times is plenty)

Optimization

37

Processing stages


Post-processing

Optimization

Collapse subtreesinto leaf nodes

38

Processing stages


Post-processing

Optimization

Collect trianglesinto linear lists Prepare them for Woop’s

intersection test[Woop 2004]

39

Processing stages


Post-processing

Optimization

Fast GPU ray traversal[Aila et al. 2012]

40

Processing stages


Post-processing

Optimization

Triangle splitting


41

Processing stages


Post-processing

Optimization

Triangle splitting


0.4 ms

5.4 ms6.6 ms

17.0 ms21.4 ms

1.2 ms1.6 ms

DRAGON (870K tris)NVIDIA GTX Titan23.6 ms / 30.0 ms

No splits

30% splits

42

Cost model

Surface area cost model[Goldsmith and Salmon 1987], [MacDonald and Booth 1990]

Track cost and triangle count of each subtree

Minimize SAH cost of the final BVHMake collapsing decisions already during optimization

→ Unified processing of leaves and internal nodes

43

Optimal restructuring

Finding the optimal node topology is NP-hardNaive algorithm → Our approach →

But it becomes very powerful as grows treelet leaves is enough for high-quality results

Use fixed-size treeletsConstant cost per treelet

→ Linear with respect to scene size

44


Treelet size Layouts Quality vs. SBVH *4 15 78%5 105 85%6 945 88%7 10,395 97%8 135,135 98%

* SODA (2.2M tris)

Number of unique ways forrestructuring a given treelet

Ray tracing performanceafter 3 rounds of optimization

45



* SODA (2.2M tris)

Almost thesame thing astree rotations

[Kensler 2008]

Limited options during optimization→ easy to get stuck in a local optimum

Varies a lotbetweenscenes

46



* SODA (2.2M tris)

Can still beimplemented

efficiently

Surely one of thesewill take us forward

Consistentacross scenes

Further improvementis marginal

47

Algorithm

Dynamic programmingSolve small subproblems firstTabulate their solutionsBuild on them to solve larger subproblems

Subproblem:What’s the best node topology for a subset of the leaves?

48

Algorithm

input: set of treelet leavesfor to do

for each subset of size dofor each way of partitioning the leaves do

look up subtree costscalculate SAH cost

end forrecord the best solution

end forend forreconstruct optimal topology

Process subsets fromsmallest to largest

Record the optimalSAH cost for each

49

Algorithm

input: set of treelet leavesfor to do

for each subset of size dofor each way of partitioning the leaves do

look up subtree costscalculate SAH cost

end forrecord the best solution

end forend forreconstruct optimal topology

Exhaustive search:assign each leaf toleft/right subtree

We already knowhow much thesubtrees will cost

Backtrack thepartitioning choices

50

Scalar vs. SIMD

Scalar processingEach thread processes one treeletNeed many treelets in flight

SIMD processing32 threads collaborate on the same treeletNeed few treelets in flight

✗ Spills to off-chip memory✗ Doesn’t scale to small scenes✓ Trivial to implement

✓ Data fits in on-chip memory✓ Easy to fill the entire GPU✗ Need to keep all threads busy

Parallelize over subproblems usinga pre-optimized processing schedule

(details in the paper)

51

Scalar vs. SIMD

Scalar processingEach thread processes one treeletNeed many treelets in flight

SIMD processing32 threads collaborate on the same treeletNeed few treelets in flight

✓ Data fits in on-chip memory✓ Easy to fill the entire GPU✓ Possible to keep threads busy

✗ Spills to off-chip memory✗ Doesn’t scale to small scenes✓ Trivial to implement

52

Quality vs. speed

Spend less effort on bottom-most nodesLow contribution to SAH costQuick convergence

Additional parameter Only process subtrees that are large enoughTrade quality for speed

Double after each roundSignificant speedupNegligible effect on quality

53

Triangle splitting

Early Split Clipping [Ernst and Greiner 2007]Split triangle bounding boxes as a pre-process

Bounding box is nota good approximation

Split it!

Large triangle

54

Triangle splitting

Resulting boxesprovide atighter bound

Keep going until theyare small enough


55

Triangle splitting


Keep going until theyare small enough

56

Triangle splitting


Treat each box as aseparate primitive

Triangle itselfremains the same

57

Triangle splitting

Shortcomings of pre-process splittingCan hurt ray tracing performanceUnpredictable memory usageRequires manual tuning

Improve with better heuristicsSelect good split planesConcentrate splits where they matterUse a fixed split budget

58

Split plane selection

Root node partitions thescene at its spatial median

Reduce node overlap in the initial BVH

59


Left child

Right child


60

If a trianglecrosses the plane...


...the bounding boxes will overlap

Use the samespatial medianas a split plane


61


No overlap

Use the samespatial medianas a split plane


62


Splitting one triangledoes not help much

Need to split them allto get the benefits


63

Split plane selectionReduce node overlap in the initial BVH

64


Same reasoning holds on multiple levels


Level 0

Level 1

Level 2 Level 2

Level 3

Level 3

65

Split plane selectionLook at all spatial

median planes thatintersect a triangle

Split it with thedominant one


Level 1

Level 2 Level 2Level 0

Level 3

Level 3

66

Algorithm

1. Allocate memory for a fixed split budget

67

Algorithm


2. Calculate a priority value for each triangle

68

Algorithm



3. Distribute the split budget among trianglesProportional to their priority values

69

Algorithm



3. Distribute the split budget among trianglesProportional to their priority values

4. Split each triangle recursivelyDistribute remaining splits according to the size of the resulting AABBs

70

Split priority

𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦=(2(−𝑙𝑒𝑣𝑒𝑙 ) ∙ ( 𝐴𝑎𝑎𝑏𝑏−𝐴𝑖𝑑𝑒𝑎𝑙 ))1/3

Crosses an importantspatial median plane?

Has large potential forreducing surface area?

Concentrate on triangleswhere both apply

…but leavesomething forthe rest, too

71

ResultsCompare against 4 CPU and 3 GPU builders

4-core i7 930, NVIDIA GTX TitanAverage of 20 test scenes, multiple viewpoints

Ray tracing performance

72

SweepSAH[MacDonald]

SBVH[Stich]

Treerotations[Kensler]

Iterativereinsertion

[Bittner]

0%

20%

40%

60%

80%

100%

120%

140%

SweepSAH = 100%

High-quality CPU builders

SBVH = 131%

0%

20%

40%

60%

80%

100%

120%

140%


73

SweepSAH[MacDonald]

SBVH[Stich]



[Bittner]

LBVH[Karras]

HLBVH[Garanzha]

GridSAH[Garanzha]

Fast GPU builders

67% – 69%

SBVH = 131%

Almost 2×


74

SweepSAH[MacDonald]

SBVH[Stich]



[Bittner]

TRBVH TRBVH+30% split

LBVH[Karras]

HLBVH[Garanzha]

GridSAH[Garanzha]

0%

20%

40%

60%

80%

100%

120%

140%

No splits30% splits

Our method


75

SweepSAH[MacDonald]

SBVH[Stich]



[Bittner]

TRBVH TRBVH+30% split

LBVH[Karras]

HLBVH[Garanzha]

GridSAH[Garanzha]

0%

20%

40%

60%

80%

100%

120%

140%

96% of SweepSAH

91% of SBVH

76


1M 10M 100M 1G 10G 100G 1T0%

20%

40%

60%

80%

100%

120%

140%


Tree rotations[Kensler]

Iterative reinsertion[Bittner]

Not Pareto-optimal

SBVH[Stich]

SweepSAH[MacDonald]

77


1M 10M 100M 1G 10G 100G 1T0%

20%

40%

60%

80%

100%

120%

140%


SBVH[Stich]

SweepSAH[MacDonald]

78


1M 10M 100M 1G 10G 100G 1T0%

20%

40%

60%

80%

100%

120%

140%


HLBVH[Garanzha] GridSAH

[Garanzha]

SBVH[Stich]

SweepSAH[MacDonald]

LBVH[Karras]

Not Pareto-optimal

79


1M 10M 100M 1G 10G 100G 1T0%

20%

40%

60%

80%

100%

120%

140%


SBVH[Stich]

SweepSAH[MacDonald]

LBVH[Karras]

80


1M 10M 100M 1G 10G 100G 1T0%

20%

40%

60%

80%

100%

120%

140%


SBVH[Stich]

Our method(no splits)

SweepSAH[MacDonald]

LBVH[Karras]

Our method(30% splits)

81


1M 10M 100M 1G 10G 100G 1T0%

20%

40%

60%

80%

100%

120%

140%


7M–60G rays/frame→ our method is the best choice

Below 7M→ LBVH

Above 60G→ SBVH

82

Conclusion

General framework for optimizing treesInherently parallelApproximate restructuring → larger treelets?

Practical GPU-based BVH builderBest choice in a large class of applicationsAdjustable quality–speed tradeoff

Will be integrated into NVIDIA OptiX

83

Thank you

AcknowledgementsSamuli LaineJaakko LehtinenSami LiedesDavid McAllisterAnonymous reviewersAnat Grynberg and Greg Ward for CONFERENCE

University of Utah for FAIRY

Marko Dabrovic for SIBENIK

Ryan Vance for BUBS

Samuli Laine for HAIRBALL and VEGETATION

Guillermo Leal Laguno for SANMIGUEL

Jonathan Good for ARABIC, BABYLONIAN and ITALIAN

Stanford Computer Graphics Laboratory for ARMADILLO, BUDDHA and DRAGON

Cornell University for BAR

Georgia Institute of Technology for BLADE

fast parallel construction of high-quality bounding volume hierarchies tero karras timo aila

Documents

multiple nodes

nodes descendants

treelet subset

treelet internal node

rr treelet restructuring

treelet root treelet

node topology

speedquality slide