introduction to monte carlo ray tracing, opencl implementation (cedec 2014)
DESCRIPTION
Introduction to Monte Carlo Ray Tracing, OpenCL Implementation (CEDEC 2014)TRANSCRIPT
INTRODUCTION TO MONTE CARLO RAY TRACING
OPENCL IMPLEMENTATION
TAKAHIRO HARADA 9/2014
2 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
RECAP OF LAST SESSION
y Talked about theory y BRDFs
‒ Reflec8on, Refrac8on, Diffuse, Microfacet
y Fresnel is everywhere y Monte Carlo Ray Tracing
‒ Intui8ve understanding of Monte Carlo Integra8on ‒ Simple sampling (Random sampling) ‒ BeXer sampling (Importance sampling) ‒ Layered material
hXp://www.slideshare.net/takahiroharada/introduc8on-‐to-‐monte-‐carlo-‐ray-‐tracing-‐cedec2013
3 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
REVIEW
for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } }
SIMPLE CPU MC RAY TRACER
Direct illumina<on
4 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
REVIEW
for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
SIMPLE CPU MC RAY TRACER
Indirect illumina<on
5 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
REVIEW
for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } }
SIMPLE CPU MC RAY TRACER
for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
Direct illumina<on Indirect Illumina<on
6 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
COMPARISON
Direct illumina<on Indirect illumina<on
7 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
WHY OPENCL?
y Speed! ‒ GPU can accelerate it ‒ Why? Faster is the beXer
y OpenCL is an API for GPU compute
y OpenCL is not only for graphics programmers
y OpenCL does not always require a GPU ‒ Runs on CPU too ‒ Runs if there is a CPU (everywhere)
y If renderer is wriXen in OpenCL, runs on Windows, Linux, MacOSX J
PORTING TO OPENCL (FIRST ATTEMPT)
9 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
THINGS TO BE DONE
y No pointer in OpenCL* y Change pointer to index y Stored in a flat memory
y Not suited for par8al update
*Shared Virtual Memory (OpenCL 2.0)
DATA STRUCTURE
10 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
THINGS TO BE DONE
y Node data for a binary tree ‒ Spa8al accelera8on structure (BVH) ‒ Shading network
y Buffer<NodeData> nodeData;
DATA STRUCTURE
Node Data
m_m
ax.x
m_m
ax.y
m_m
ax.z
m_m
in.x
m_m
in.y
m_m
in.z
m_child0
m_child1
Node Data
m_m
ax.x
m_m
ax.y
m_m
ax.z
m_m
in.x
m_m
in.y
m_m
in.z
m_child0
m_child1
Node Data
m_m
ax.x
m_m
ax.y
m_m
ax.z
m_m
in.x
m_m
in.y
m_m
in.z
m_child0
m_child1
11 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
THINGS TO BE DONE
y Material ‒ Texture entry
y Buffer<Material> material;
y Buffer<char> texData; y Buffer<uint> texTable;
DATA STRUCTURE
Material0
m_kd
m_ior
m_...
m_kdT
ex
m_iorTex
m_b
umpT
ex
TextureTable
tex0
tex1
tex2
tex3
tex4
. . .
Texture0
m_h
eade
r
m_d
ata
Texture1
m_h
eade
r
m_d
ata
Texture2
m_h
eade
r
m_d
ata
Material1
m_kd
m_ior
m_...
m_kdT
ex
m_iorTex
m_b
umpT
ex
. . .
Texture3
m_h
eade
r
m_d
ata
12 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
THINGS TO BE DONE
for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
WRITING OPENCL KERNEL
__kernel void PtKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) return; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
CPU code OpenCL kernel
13 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
IT WORKS BUT…
y This approach is simple
y But a lot of issues
14 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DRAWBACKS
y Likely not u8lize hardware efficiently ‒ SIMD divergence ‒ GPU occupancy (latency)
y Maintainability
y Extendibility, Portability
PERFORMANCE
GPU ARCHITECTURE
16 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
OPENCL ON CPU
y Processing element executes Work item (thread) ‒ A SIMD lane (4*)
y Compute unit executes Work group (thread group) ‒ A core (8*) ‒ # of processing elements != # of work items
y Compute device executes Kernel (shader) ‒ A CPU ‒ # of compute units != # of work groups
* On AMD FX-‐8350
Work item
Work group Compute Unit
Processing element
. . .
Kernel
17 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
GPU VS CPU
y Processing element executes Work item ‒ A SIMD lane (64*)
y Compute unit executes Work group ‒ A SIMD engine (44x4*) ‒ # of processing elements != # of work items
y Compute device executes Kernel ‒ A GPU ‒ # of compute units != # of work groups
* On AMD Radeon R9 290X
Work item
Work group Compute Unit (4)
Processing element
. . .
Kernel
Compute Unit (64)
Processing element
...
CPU GPU
18 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
HIGH LEVEL DESCRIPTION
y Today’s GPU is similar to a CPU (if you look at very high level) ‒ GPU is an extremely wide CPU ‒ Many cores ‒ Wide SIMD
y AMD Radeon R9 290X GPU ‒ 176 = 44x4 SIMD engines (cores) ‒ 64 wide SIMD
y But different in ‒ SIMD width (very wide) ‒ Limited local resources ‒ Strategy to hide latency
y Knowing those are the key to exploit the performance
19 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
20 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
21 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
22 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
23 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
24 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
25 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
26 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
WIDE SIMD EXECUTION
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
J J
L
J
L
L
L
27 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SIMD DIVERGENCE
y SIMD execu8on = Program counter is shared among SIMD lanes
y If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)
int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }
Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7
J J
J
J
J
J
J
28 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
LATENCY
y Highest latency is from memory access
y CPU prevent it by having larger cache ‒ Latency of cache access is small (fast)
y Most of the memory access do not go to memory
y CPU can run at full speed un8l a cache miss
y # of concurrent execu8on on the GPU is far much larger than CPU ‒ More than 11k (= 44x4x64) work items
y GPU cache is not large enough to absorb memory requests from those if they all requests different part of memory
y Strategy ‒ Keep memory access as local as possible (not realis8c for prac8cal apps) ‒ Uses GPU mechanism for latency hiding
29 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
GPU LATENCY HIDING
y GPU can execute at full speed if there are only ALU instruc8ons (Inst. 0 -‐ 2) *
y Stalls on memory access instruc8on (Inst. 3)
* Can hide latency using logical vector
Inst. 0
Inst. 1
Inst. 2
Inst. 3 (MemAccess)
Lane0 Lane1 Lane2 Lane3 Lane1
Inst. 4
30 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
GPU LATENCY HIDING
y When stalled, switch to another work group
y Could fill the stall with instruc8ons from WG1
y A SIMD of GPU needs to process mul8ple WGs at the same 8me to hide latency (or maximize its throughput)
WG0: Inst. 0
WG0: Inst. 1
WG0: Inst. 2
WG0: Inst. 3
Lane0 Lane1 Lane2 Lane3 Lane1
WG0: Inst. 4
WG1: Inst. 0
WG1: Inst. 1
WG1: Inst. 2
WG1: Inst. 3
Lane0 Lane1 Lane2 Lane3 Lane1
31 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
HOW MANY WGS CAN WE EXECUTE PER SIMD
y 10 wavefronts (64WIs) per SIMD is the max
y It depends on local resource usage of the kernel y VGPR usage is ozen the problem
y Share 256 VGPRs among n work groups ‒ 1 wavefront, 256VGPRs LL ‒ 2 wavefronts, 128VGPRs ‒ 4 wavefronts, 64VGPRs J ‒ 10 wavefronts, 25VGPRs
y Share 16KB LDS among n work groups ‒ 1 work group, 16KB LL ‒ 2 work group, 8KB ‒ 4 work group, 4KB J
y VGPRs ‒ Registers used by vector ALUs ‒ 64KB/SIMD ‒ 256 VGPRs/SIMD lane (= 64KB/64/4)
y LDS (Local data share) ‒ 64KB/CU (CU == 4SIMD) ‒ 32KB/SIMD
32 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
ADVICE TO REDUCE VGPR PRESSURE
y Don’t write a large kernel y If the program can be split into several pieces, split
them into several kernels ‒ Single kernel approach
‒ VGPR usage of the kernel is 200 = max(60, 200, 10) ‒ 1 wavefront per SIMD ‒ Bad for latency hiding
‒ Mul8ple kernel approach ‒ FuncA: 4 wavefronts per SIMD ‒ FuncB: 1 wavefronts per SIMD ‒ FuncC: 10 wavefronts per SIMD ‒ FuncB is bad, but FuncA, FuncC runs fast
y Helps compiler too
GET MORE PERFORMANCE FROM GPU
Single Kernel
FuncA (60VGPRS)
FuncB (200VGPRS)
FuncC(10VGPRS)
Mul8ple Kernels
FuncA (60VGPRS)
FuncB (200VGPRS)
FuncC(10VGPRS)
L
L
L
L
J
J
33 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
W
WHAT IS THE SCALAR ARCHITECTURE?
y Best for vector computa8on
y Low efficiency on scalar computa8on
y Physical concurrent execu8on ‒ 16 work items ‒ 4 ALU opera8ons each ‒ Total: 16x4 ALU opera8ons
y 1 SIMD is running in a CU
y Difficult to fill xyzw
y If not filled, we waste HW cycle
y Good for both vector and scalar computa8on
y Need more work groups to fill GPU
y Physical concurrent execu8on ‒ 16x4 work items ‒ 1 ALU opera8on each ‒ Total: 16x4 ALU opera8ons
y 4 SIMDs are running in a CU
y 4x more work groups are necessary to fill HW
VLIW4 (NI), VLIW5 (EG) Scalar Architecture (SI, CI)
. . . X Y Z W
Lane 0
X Y Z W
Lane 1
X Y Z W
Lane 2
X Y Z W
Lane 3
X Y Z W
Lane 4
X Y Z W
Lane 15
. . .
SIMD0 SIMD0 SIMD1 SIMD3
34 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
ANOTHER SOLUTION
y Spli{ng computa8on into mul8ple kernels ‒ Primary ray gen kernel ‒ Trace kernel ‒ Evaluate DI kernel ‒ Etc
y BeXer HW u8liza8on ‒ Less divergence ‒ Higher HW occupancy
35 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
OTHER BENEFITS?
Ray Genera8on
First Hit
First Hit (Normal)
Direct Illumina8on
Indirect Illumina8on
y Maintainability ‒ Debug is not as easy as we do on C, C++ ‒ Not all compilers are mature
‒ Can hit to a compiler bug, which is hard to debug ‒ Helps compiler
‒ By spli{ng kernels, we can isolate the issue ‒ If the code is developed by many people, this is important
y Extendibility, Portability ‒ Easy to extend features ‒ Primary Ray Gen Kernel
‒ Add another camera projec8on ‒ Ray Cas8ng Kernel
‒ Easy to add another primi8ves (e.g., vector displacement) ‒ Take it out for physics ray cas8ng queries
PORTING TO OPENCL (SECOND ATTEMPT)
37 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPLITTING KERNELS
forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
TRANSFORMING CPU CODE
{ forAll() PrimaryRayGenKernel(); while(1) { forAll() TraceKernel(); if( !any( hits ) ) break; forAll() SampleLightKernel(); forAll() TraceKernel(); forAll() AccumulateDIKernel(); forAll() SampleNextRayKernel(); } }
Naïve CPU implementa<on Preparing for OpenCL implementa<on
Each for loop => A kernel execu8on
38 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPLITTING KERNELS
forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
CPU implementa<on Host code
39 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPLITTING KERNELS
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
__kernel void PrimaryRayGenKernel(); y Generate rays for all pixels in parallel __kernel void TraceKernel(); y Compute intersec8on for all rays in parallel __kernel void SampleLightKernel(); y Sample light for all hit points in parallel __kernel void AccumulateDItKernel(); y Accumulate DI for all hit points in parallel __kernel void SampleNextRayKernel(); y Generate bounced rays for all hit points in parallel
Host Code Device Code
40 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DESIGN
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
LOCALIZE BRANCH
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
Camera Type Brdf Type
41 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DESIGN
y Cannot keep state between kernels y Ray state needs to be saved to/restored from
global memory
RAY STATE
y Example ‒ Ray genera8on + ray direc8on visualiza8on
__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); }
struct RayState { float4 m_throughput; int2 m_randomNumber; int m_pixelIdx; };
42 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DESIGN
y Cannot keep state between kernels y Ray state needs to be saved to/restored from
global memory
RAY STATE
y Example ‒ Ray genera8on + ray direc8on visualiza8on
__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); }
Out: Ray
In: pixelIdx
Out: Ray
In: pixelIdx
Out: Ray
In: pixelIdx
Out: Ray
In: pixelIdx
Prim
aryRayGe
nKerne
l
In: Ray
Out: Pixel color
In: Ray
Out: Pixel color
In: Ray
Out: Pixel color
In: Ray
Out: Pixel color
Visualize
Kernel
Global memory (state)
43 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
RAY COMPACTION
y Sparse data lowers SIMD u8liza8on
y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8) Primary
Secondary
44 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
RAY COMPACTION
y Sparse data lowers SIMD u8liza8on
y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8)
y With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8)
*When rays are created for all pixels, this is not necessary
Primary
Secondary
Primary
Secondary
45 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
RAY COMPACTION
y Sparse data lowers SIMD u8liza8on
y Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8)
y With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8)
*When rays are created for all pixels, this is not necessary
y No need to write a compac8on kernel y Can compact using global atomics
‒ Prepare a counter (gRayCount) ‒ Perform atomic increment to reserve memory ‒ BeXer to do atomics in WG first, then do an atomic add per WG
__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; }
Primary
Secondary
Primary
Secondary
46 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DIRECT ILLUMINATION COMPUTATION
y SampleLightKernel ‒ Want to keep the work uniform
‒ Different # of light sample per ray isn’t good ‒ Compute contribu8on from one point on a light
y Simple approach ‒ Select a light ‒ Select a point on a light ‒ Compute DI without occlusion term
y More sophis8cated light sampling ‒ Using poten8al contribu8on for PDF ‒ Forward+ style light culling
__kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; }
47 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DIRECT ILLUMINATION COMPUTATION
y SampleLightKernel
y TraceRayKernel ‒ Check if the point on the light is visible or not ‒ Reuse code
y AccumulateDIKernel ‒ If the ray is not blocked, accumulate the result
__kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; } __kernel void AccumulateDIKernel(__global ...) { Hit shadowHit = gShadowHit[GIDX]; float4 di = gDi[GIDX]; if( !shadowHit ) gFb[GIDX] += di; }
48 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SAMPLE NEXT RAY
y Compute next ray by sampling BRDF
y Store ray and ray state __kernel void SampleNextRayKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; Hit hit = gHit[GIDX]; if( !hit ) return; nextRay, s = Brdf_Sample( ray, s ); int dst = atom_inc( &gRayCount ); gRayNext[dst] = nextRay; gRayStateNext[dst] = s; }
49 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
TRACE KERNEL
y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer)
0 1 2 3 4 5 6 7 8 9
1 2
0
4 3 5 6
Mesh0 xform0
Mesh1 xform1
Mesh2 xform2
Mesh3 xform3
50 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
TRACE KERNEL
y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer)
y 2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf
0 1 2 3 4 5 6 7 8 9
1 2
0
4 3 5 6
Top BVH
Bohom BVH
Mesh0 xform0
Mesh1 xform1
Mesh2 xform2
Mesh3 xform3
51 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
TRACE KERNEL
y BVH is used for accelera8on structure ‒ Index is used to describe hierarchy structure (no pointer)
y 2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf
y Store those BVHs in a single memory ‒ Traverse top tree ‒ Hit a leaf, transform the ray into object space ‒ Traverse boXom tree ‒ On exit, transform the ray back to world space
Bohom A Bohom B Bohom C Bohom D
0 1 2 3 4 5 6 7 8 9
1 2
0
root idx
Top
4 3 5 6
Top BVH
Bohom BVH
Mesh0 xform0
Mesh1 xform1
Mesh2 xform2
Mesh3 xform3
52 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SO FAR
y Explained an OpenCL implementa8on of a simple path tracer
y Easy to extend from here
y Extension can be done by swapping one or two kernels ‒ Material system, Shader ‒ Light sampling ‒ Support for different type of primi8ves ‒ Ray caster + spa8al accelera8on structure
ADVANCED TOPICS
54 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
INSTANCING
y Powerful technique to increase geometric complexity
y Small memory overhead ‒ Shares geometric informa8on (vertex, normal etc) ‒ Shares BVH ‒ Stores object transform
1 2
0
4 3 5 6
Top BVH
Mesh0 xform0
Mesh0 xform1
Mesh0 xform2
Mesh1 xform3
55 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
INSTANCING
y Powerful technique to increase geometric complexity
y Small memory overhead ‒ Shares geometric informa8on (vertex, normal etc) ‒ Shares BVH ‒ Stores object transform
Bohom A Top Bohom B
1 2
0
4 3 5 6
Top BVH
Mesh0 xform0
Mesh0 xform1
Mesh0 xform2
Mesh1 xform3
Bohom BVH
56 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
LAYERED MATERIAL
y Binary tree of BRDFs y Leaf node
‒ BRDF y Internal node
‒ Blend func8on ‒ Fresnel blend, Linear blend
y Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node
Reflect
Microfacet Diffuse
pdf=0.25
pdf=0.5
0.5 0.5
0.5 0.5
pdf=0.25
57 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
LAYERED MATERIAL
y Binary tree of BRDFs y Leaf node
‒ BRDF y Internal node
‒ Blend func8on ‒ Fresnel blend, Linear blend
y Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node
{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SelectBRDFKernel ); launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }
MORE ADVANCED TOPICS
59 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
VR
y Latency is super important
y To improve a frame rendering 8me, ‒ Used mul8ple GPUs ‒ Foveated rendering
y More than 60fps on 4 Hawaii GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample
60 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
VR
y Latency is important
y To improve a frame rendering 8me, ‒ Used mul8ple GPUs ‒ Foveated rendering
y More than 60fps on 4 Hawaii GPUs ‒ 6M triangles ‒ 32 shadow rays/sample ‒ 2 AA rays/sample
{ launch( VRPrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } launch( FillPixelKernel ); }
61 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DISPLACEMENT MAPPING
y Powerful technique to increase geometric complexity
y Pre tessella8on ‒ Required memory is too large ‒ GPU memory is too small
y Direct ray tracing ‒ When hit a patch, tessellate and displace
Base mesh Vector displacement map With vector displacement
Fig. from hXp://support.nextlimit.com/display/mxdocsv3/Displacement+component
62 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DISPLACEMENT MAPPING
y Powerful technique to increase geometric complexity
y Pre tessella8on ‒ Required memory is too large ‒ GPU memory is too small
y Direct ray tracing ‒ When hit a patch, tessellate and displace
y To amor8ze tessella8on, displacement cost, batch ray intersec8on
y Need to change TraceKernel
63 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
DISPLACEMENT MAPPING
y TraceKernel ‒ If a ray hit a quad with displacement map, save (ray, primi8ve) to a buffer
‒ Sort (ray, primi8ve) pairs by primi8ve index ‒ Process primi8ves in the list in parallel
y For each patch ‒ Build quad BVH in parallel ‒ Cast rays in parallel
y Key is work buffer memory alloca8on Level 0 (1 node)
BVH Comt
Ray Cast Ray Cast Ray Cast Ray Cast
Level 2 (16 nodes)
BVH Comt
BVH Comt
BVH Comt
BVH Comt
BVH Comt
BVH Comt
Ray Cast Ray Cast Ray Cast Ray Cast Ray Cast Ray Cast
Level 1 (4 nodes)
BVH Comt
BVH Comt
BVH Comt
BVH Comt
64 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
VECTOR DISPLACEMENT IN ACTION
Base mesh Vector displacement
52GB memory if pre tessella8on is used
65 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
OPEN SHADING LANGUAGE
y OSL itself has nothing to do with OpenCL y Many use cases
y Using OSL in OCL renderer ‒ Translate OSL to
‒ OCL kernel ‒ SPIR
‒ Feed those to OCL run8me ‒ clBuildProgram ‒ clCreateKernel
y OSL example surface
maXe [[ string descrip8on = "Lamber8an diffuse material" ]]
(float Kd = 1 [[float UImin = 0, float UIsozmax = 1 ]],
color Cs = 1 [[float UImin = 0, float UImax = 1 ]],
string texname = “diffuse.tex” [[int texture_slot = 1]] )
{
Ci = Kd * Cs * noise(5.0 * P) * diffuse (N);
}
66 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPIR
y Standard Portable Intermediate Representa8on
y Based on LLVM IR (32, 64)
y Useful to ship OpenCL Apps y Device independent y OpenCL did not have usable binary code representa8on
‒ Binary for each device x driver ‒ Combina8on explode ‒ Embed kernel as string
‒ Load source, clCreateProgramWithSource ‒ Dump binary, clGetProgramInfo + CL_PROGRAM_BINARIES ‒ Load binary, clCreateProgramWithBinary
y OpenCL implementa8on has to support cl_khr_spir extension ‒ Works on AMD, Intel (OpenCL 1.2) ‒ SPIR 2.0 is coming with OpenCL 2.0
67 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014
SPIR
y Offline compiler ‒ clang-‐spir* -‐cc1 -‐emit-‐llvm-‐bc -‐triple spir-‐unknown-‐unknown -‐cl-‐spir-‐compile-‐op8ons ”-‐x spir" -‐include <opencl_spir.h> -‐o <output> <input>
‒ clBuildProgram with “-‐x spir -‐spir-‐std=CL1.2”
y Use host OpenCL API ‒ clCompileProgram + Op8on ‒ clGetProgramInfo + CL_PROGRAM_BINARIES
*hXps://github.com/KhronosGroup/SPIR
CREATE SPIR BINARY
68 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014