![Page 1: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/1.jpg)
![Page 2: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/2.jpg)
Ray Tracing PerformanceRay Tracing Performance
Zero to Millions in 45 Minutes
Gordon Stoll, Intel
![Page 3: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/3.jpg)
Ray Tracing PerformanceRay Tracing Performance
Zero to Millions in 45 Minutes?!
Gordon Stoll, Intel
![Page 4: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/4.jpg)
Goals for this talkGoals for this talk
• Goals
– point you toward the current state-of-the-art (“BKM”)
• for non-researchers: off-the-shelf performance
• for researchers: baseline for comparison
– get you interested in poking at the problem
• Non-Goals
– present lowest-level details of kernels
– present “the one true way”
![Page 5: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/5.jpg)
Acceleration StructuresAcceleration Structures
• Previous (90s) BKM was to use a uniform grid
– very high raw traversal speed
– performance is not robust
• Current BKM is the kD-tree
– comparable performance even on regular scenes
– very good build technique handles irregular scenes
– other grids, octrees, etc, might as well just use kD-tree
– packet tracing was the icing on the cake
![Page 6: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/6.jpg)
Acceleration StructuresAcceleration Structures
• Couldn’t leave well enough alone
– several new schemes to be described later
– lots of stuff up in the air right now
– But…
• The good tree building technique applies across the board.
• General optimization principles are similar.
![Page 7: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/7.jpg)
kD-TreeskD-Trees
![Page 8: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/8.jpg)
kD-TreeskD-Trees
![Page 9: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/9.jpg)
kD-TreeskD-Trees
![Page 10: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/10.jpg)
kD-TreeskD-Trees
![Page 11: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/11.jpg)
Advantages of kD-TreesAdvantages of kD-Trees
• Adaptive
– Can handle the “Teapot in a Stadium”
• Compact
– Relatively little memory overhead
• Cheap Traversal
– One FP subtract, one FP multiply
![Page 12: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/12.jpg)
Take advantage of advantagesTake advantage of advantages
• Adaptive
– You have to build a good tree
• Compact
– At least use the compact node representation (8-byte)
– You can’t be fetching whole cache lines every time
• Cheap traversal
– No sloppy inner loops! (one subtract, one multiply!)
![Page 13: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/13.jpg)
“Bang for the Buck” ( !/$ )“Bang for the Buck” ( !/$ )
A basic kD-tree implementation will go pretty fast…
…but extra effort will pay off big.
![Page 14: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/14.jpg)
Fast Ray Tracing w/ kD-TreesFast Ray Tracing w/ kD-Trees
• Adaptive
• Compact
• Cheap traversal
![Page 15: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/15.jpg)
Building kD-treesBuilding kD-trees
• Given:
– axis-aligned bounding box (“cell”)
– list of geometric primitives (triangles?) touching cell
• Core operation:
– pick an axis-aligned plane to split the cell into two parts
– sift geometry into two batches (some redundancy)
– recurse
![Page 16: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/16.jpg)
Building kD-treesBuilding kD-trees
• Given:
– axis-aligned bounding box (“cell”)
– list of geometric primitives (triangles?) touching cell
• Core operation:
– pick an axis-aligned plane to split the cell into two parts
– sift geometry into two batches (some redundancy)
– recurse
– termination criteria!
![Page 17: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/17.jpg)
“Intuitive” kD-Tree Building“Intuitive” kD-Tree Building
• Split Axis
– Round-robin; largest extent
• Split Location
– Middle of extent; median of geometry (balanced tree)
• Termination
– Target # of primitives, limited tree depth
![Page 18: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/18.jpg)
“Hack” kD-Tree Building“Hack” kD-Tree Building
• Split Axis
– Round-robin; largest extent
• Split Location
– Middle of extent; median of geometry (balanced tree)
• Termination
– Target # of primitives, limited tree depth
• All of these techniques stink.
![Page 19: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/19.jpg)
“Hack” kD-Tree Building“Hack” kD-Tree Building
• Split Axis
– Round-robin; largest extent
• Split Location
– Middle of extent; median of geometry (balanced tree)
• Termination
– Target # of primitives, limited tree depth
• All of these techniques stink. Don’t use them.
![Page 20: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/20.jpg)
“Hack” kD-Tree Building“Hack” kD-Tree Building
• Split Axis
– Round-robin; largest extent
• Split Location
– Middle of extent; median of geometry (balanced tree)
• Termination
– Target # of primitives, limited tree depth
• All of these techniques stink. Don’t use them.
– I mean it.
![Page 21: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/21.jpg)
Building good kD-treesBuilding good kD-trees
• What split do we really want?
– Clever Idea: The one that makes ray tracing cheap
– Write down an expression of cost and minimize it
– Greedy Cost Optimization
• What is the cost of tracing a ray through a cell?
Cost(cell) = C_trav + Prob(hit L) * Cost(L) + Prob(hit R) * Cost(R)
![Page 22: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/22.jpg)
Splitting with Cost in MindSplitting with Cost in Mind
![Page 23: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/23.jpg)
Split in the middleSplit in the middle
• Makes the L & R probabilities equal
• Pays no attention to the L & R costs
![Page 24: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/24.jpg)
Split at the MedianSplit at the Median
• Makes the L & R costs equal
• Pays no attention to the L & R probabilities
![Page 25: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/25.jpg)
Cost-Optimized SplitCost-Optimized Split
• Automatically and rapidly isolates complexity
• Produces large chunks of empty space
![Page 26: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/26.jpg)
Building good kD-treesBuilding good kD-trees
• Need the probabilities
– Turns out to be proportional to surface area
• Need the child cell costs
– Simple triangle count works great (very rough approx.)
– Empty cell “boost”
Cost(cell) = C_trav + Prob(hit L) * Cost(L) + Prob(hit R) * Cost(R)
= C_trav + SA(L) * TriCount(L) + SA(R) * TriCount(R)
![Page 27: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/27.jpg)
Termination CriteriaTermination Criteria
• When should we stop splitting?
– Clever idea: When splitting isn’t helping any more.
– Use the cost estimates in your termination criteria
• Threshold of cost improvement
– Stretch over multiple levels (?)
• Threshold of cell size
– Absolute probability so small there’s no point
![Page 28: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/28.jpg)
Building good kD-treesBuilding good kD-trees
• Basic build algorithm
– Pick an axis, or optimize across all three
– Build a set of “candidates” (split locations)
• BBox edges or exact triangle intersections
– Sort them or bin them
– Walk through candidates to find minimum cost split
• Characteristics of highly optimized tree
– “stringy”, very deep, very small leaves, big empty cells
![Page 29: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/29.jpg)
BKM for BVH building as wellBKM for BVH building as well
![Page 30: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/30.jpg)
BVH Version of BuildBVH Version of Build
![Page 31: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/31.jpg)
BVH Version of BuildBVH Version of Build
![Page 32: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/32.jpg)
Just Do ItJust Do It
• Benefits of a good tree are not small
– not 10%, 20%, 30%...
– several times faster than a mediocre tree
• Warning #1
– get a good set of scenes for testing
– scanned models aren’t representative, just easy to find
• Warning #2
– beware the NOP
![Page 33: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/33.jpg)
Building Trees QuicklyBuilding Trees Quickly
• Very important to build good trees first
– otherwise you have no foundation for comparison (!!!)
• Don’t give up SAH cost optimization!
– approximate, sure, but don’t just drop it
• Luckily, lots of flexibility…
– axis picking (“hack” pick vs. full optimization)
– candidate picking (bboxes, exact; binning, sorting)
– termination criteria (“knob” controlling tradeoff)
![Page 34: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/34.jpg)
Building Trees QuicklyBuilding Trees Quickly
• Huge amount of work happening right now
• I’m going to defer to later speakers
![Page 35: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/35.jpg)
Fast Ray Tracing w/ kD-TreesFast Ray Tracing w/ kD-Trees
• adaptive
– build a cost-optimized tree w/ the surface area heuristic
• compact
• cheap traversal
![Page 36: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/36.jpg)
What’s in a node?What’s in a node?
• A kD-tree internal node needs:
– Am I a leaf?
– Split axis
– Split location
– Pointers to children
![Page 37: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/37.jpg)
Compact (8-byte) nodesCompact (8-byte) nodes
• kD-Tree node can be packed into 8 bytes
– Leaf flag + Split axis
• 2 bits
– Split location
• 32 bit float
– Always two children, put them side-by-side
• One 32-bit pointer
![Page 38: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/38.jpg)
Compact (8-byte) nodesCompact (8-byte) nodes
• kD-Tree node can be packed into 8 bytes
– Leaf flag + Split axis
• 2 bits
– Split location
• 32 bit float
– Always two children, put them side-by-side
• One 32-bit pointer
• So close! Sweep those 2 bits under the rug…
![Page 39: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/39.jpg)
No Bounding Box!No Bounding Box!
• kD-Tree node corresponds to an AABB
• Doesn’t mean it has to *contain* one
– 24 bytes
– 4X explosion (!)
![Page 40: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/40.jpg)
Memory LayoutMemory Layout
• Cache lines are much bigger than 8 bytes!
– advantage of compactness lost with poor layout
• Pretty easy to do something reasonable
– Building depth first, watching memory allocator
![Page 41: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/41.jpg)
Other DataOther Data
• Memory should be separated by rate of access
– Frames
– << Pixels
– << Samples [ Ray Trees ]
– << Rays [ Shading (not quite) ]
– << Triangle intersections
– << Tree traversal steps
• Example: pre-processed triangle, shading info…
![Page 42: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/42.jpg)
Fast Ray Tracing w/ kD-TreesFast Ray Tracing w/ kD-Trees
• adaptive
– build a cost-optimized tree w/ the surface area heuristic
• compact
– use an 8-byte node
– lay out your memory in a cache-friendly way
• cheap traversal
![Page 43: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/43.jpg)
kD-Tree Traversal StepkD-Tree Traversal Step
split
t_split
t_min
t_max
![Page 44: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/44.jpg)
kD-Tree Traversal StepkD-Tree Traversal Step
split
t_split t_min
t_max
![Page 45: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/45.jpg)
kD-Tree Traversal StepkD-Tree Traversal Step
split
t_split
t_min
t_max
![Page 46: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/46.jpg)
kD-Tree Traversal StepkD-Tree Traversal Step
Given: ray P & iV (1/V), t_min, t_max, split_location, split_axis
t_at_split = ( split_location - ray->P[split_axis] ) * ray_iV[split_axis]
if t_at_split > t_min
need to test against near child
If t_at_split < t_max
need to test against far child
![Page 47: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/47.jpg)
Optimize Your Inner LoopOptimize Your Inner Loop
• kD-Tree traversal is the most critical kernel
– It happens about a zillion times
– It’s tiny
– Sloppy coding will show up
• Optimize, Optimize, Optimize
– Remove recursion and minimize stack operations
– Other standard tuning & tweaking
![Page 48: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/48.jpg)
kD-Tree TraversalkD-Tree Traversal
while ( not a leaf )
t_at_split = (split_location - ray->P[split_axis]) * ray_iV[split_axis]
if t_split <= t_min
continue with far child // hit either far child or none
if t_split >= t_max
continue with near child // hit near child only
// hit both children
push (far child, t_split, t_max) onto stack
continue with (near child, t_min, t_split)
![Page 49: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/49.jpg)
Can it go faster?Can it go faster?
• How do you make fast code go faster?
• Parallelize it!
![Page 50: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/50.jpg)
Ray Tracing and ParallelismRay Tracing and Parallelism
• Classic Answer: Ray-Tree parallelism
– independent tasks
– # of tasks = millions (at least)
– size of tasks = thousands of instructions (at least)
• So this is wonderful, right?
![Page 51: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/51.jpg)
Parallelism in CPUsParallelism in CPUs
• Instruction-Level Parallelism (ILP)
– pipelining, superscalar, OOO, SIMD
– fine granularity (~10s of instructions “window” tops)
– easily confounded by unpredictable control
– easily confounded by unpredictable latencies
• So…what does ray tracing look like to a CPU?
![Page 52: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/52.jpg)
No joy in ILP-villeNo joy in ILP-ville
• At <1000 instruction granularity, ray tracing is anything but “embarrassingly parallel”
• kD-Tree traversal (CPU view):
1) fetch a tiny fraction of a cache line from who knows where
2) do two piddling floating-point operations
3) do a completely unpredictable branch, or two, or three
4) repeat until frustrated
PS: Each operation is dependent on the one before it.
PPS: No SIMD for you! Ha!
![Page 53: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/53.jpg)
Split PersonalitySplit Personality
• Coarse-Grained parallelism (TLP) is perfect
– millions of independent tasks
– thousands of instructions per task
• Fine-Grained parallelism (ILP) is awful
– look at a scale <1000 of instructions
• sequential dependencies
• unpredictable control paths
• unpredictable latencies
• no SIMD
![Page 54: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/54.jpg)
OptionsOptions
• Option #1: Forget about ILP, go with TLP
– improve low-ILP efficiency and use multiple CPU cores
• Option #2: Let TLP stand in for ILP
– run multiple independent threads (ray trees) on one core
• Option #3: Improve the ILP situation directly
– how?
• Option #4: …
![Page 55: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/55.jpg)
…All of the above!…All of the above!
• multi-core CPUs are here in spades
– better performance, better low-ILP performance
– on the right performance curve
• multi-threaded CPUs are already here
– improve well-written ray tracer by ~20-30%
• packet tracing
– trace multiple rays together in a packet
– bulk up the inner loop with ILP-friendly operations
![Page 56: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/56.jpg)
Packet TracingPacket Tracing
• Very, very old idea from vector/SIMD machines
– Vector masks
• Old way
– if the ray wants to go left, go left
– if the ray wants to go right, go right
• New way
– if any ray wants to go left, go left with mask
– if any ray wants to go right, go right with mask
![Page 57: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/57.jpg)
Key ObservationsKey Observations
• Doesn’t add “bad” stuff
– Traverses the same nodes
– Adds no global fetches
– Adds no unpredictable branches
• What it does add
– SIMD-friendly floating-point operations
– Some messing around with masks
• Result: Very robust in relation to single rays
![Page 58: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/58.jpg)
How many rays in a packet?How many rays in a packet?
• Packet tracing gives us a “knob” with which to adjust computational intensity.
• Do natural SIMD width first
• Real answer is potentially much more complex
– diminishing returns due to per-ray costs
– lack of coherence to support bigger packets
– register pressure, L1 pressure
• Makes hardware much more likely/possible
![Page 59: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/59.jpg)
Beyond PacketsBeyond Packets
• Diminishing returns limit packet size
– cost of any operation is still linear in packet size
– run out of overhead to amortize away
• Frustums and/or Interval Representations
– constant time operations on a whole packet
– MLRT, Alex Reshetov, SIGGRAPH ’05
• “inverse frustum culling”, interval traversal
– Carsten Benthin’s thesis
![Page 60: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/60.jpg)
Fast Ray Tracing w/ kD-TreesFast Ray Tracing w/ kD-Trees
• Adaptive
– build a cost-optimized tree (w/ surface area heuristic)
• Compact
– use an 8-byte node
– lay out your memory in a cache-friendly way
• Cheap traversal
– optimize your inner loop
– trace packets
![Page 61: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/61.jpg)
Getting started…Getting started…
• Read PBRT (yeah, I know, it’s 1300 pages)
– great book, pretty decent kD-tree builder
• Read Wald’s thesis and Benthin’s thesis
– lots of coding details for this stuff
• Track down the interesting references
• Learn SIMD programming (e.g. SSE intrinsics)
• Use a profiler.
![Page 62: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/62.jpg)
Getting started…Getting started…
• Read PBRT (yeah, I know, it’s 1300 pages)
– great book, pretty decent kD-tree builder
• Read Wald’s thesis and Benthin’s thesis
– lots of coding details for this stuff
• Track down the interesting references
• Learn SIMD programming (e.g. SSE intrinsics)
• Use a profiler. I mean it.
![Page 63: Ray Tracing Performance Zero to Millions in 45 Minutes Gordon Stoll, Intel](https://reader030.vdocuments.net/reader030/viewer/2022032521/56649d565503460f94a34ad3/html5/thumbnails/63.jpg)
If you remember nothing elseIf you remember nothing else
• “Rays per Second” is measured in millions.