introduction to monte carlo ray tracing, opencl implementation (cedec 2014)

INTRODUCTION TO MONTE CARLO RAY TRACING

OPENCL IMPLEMENTATION

TAKAHIRO HARADA 9/2014

2 | Introduc8on to Monte Carlo Ray Tracing OpenCL implementa8on | SEPT 3, 2014

RECAP OF LAST SESSION

y  Talked about theory y  BRDFs

‒ Reflec8on, Refrac8on, Diffuse, Microfacet

y  Fresnel is everywhere y  Monte Carlo Ray Tracing

‒  Intui8ve understanding of Monte Carlo Integra8on ‒  Simple sampling (Random sampling) ‒ BeXer sampling (Importance sampling) ‒  Layered material

hXp://www.slideshare.net/takahiroharada/introduc8on-‐to-‐monte-‐carlo-‐ray-‐tracing-‐cedec2013


REVIEW

for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } }

SIMPLE CPU MC RAY TRACER

Direct illumina<on


REVIEW

for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }


Indirect illumina<on


REVIEW

for( i, j ) { ray = PrimaryRayGen( camera, pixelLoc ); { hit = Trace( ray ); if( hit ) d( pixelLoc ) += EvaluateDI( ray, hit ); } }


for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }

Direct illumina<on Indirect Illumina<on


COMPARISON

Direct illumina<on Indirect illumina<on


WHY OPENCL?

y  Speed! ‒ GPU can accelerate it ‒ Why? Faster is the beXer

y  OpenCL is an API for GPU compute

y  OpenCL is not only for graphics programmers

y  OpenCL does not always require a GPU ‒ Runs on CPU too ‒ Runs if there is a CPU (everywhere)

y  If renderer is wriXen in OpenCL, runs on Windows, Linux, MacOSX J

PORTING TO OPENCL (FIRST ATTEMPT)


THINGS TO BE DONE

y  No pointer in OpenCL* y  Change pointer to index y  Stored in a flat memory

y  Not suited for par8al update

*Shared Virtual Memory (OpenCL 2.0)

DATA STRUCTURE


THINGS TO BE DONE

y  Node data for a binary tree ‒  Spa8al accelera8on structure (BVH) ‒  Shading network

y  Buffer<NodeData> nodeData;

DATA STRUCTURE

Node Data

m_m

ax.x

m_m

ax.y

m_m

ax.z

m_m

in.x

m_m

in.y

m_m

in.z

m_child0

m_child1

Node Data

m_m

ax.x

m_m

ax.y

m_m

ax.z

m_m

in.x

m_m

in.y

m_m

in.z

m_child0

m_child1

Node Data

m_m

ax.x

m_m

ax.y

m_m

ax.z

m_m

in.x

m_m

in.y

m_m

in.z

m_child0

m_child1


THINGS TO BE DONE

y  Material ‒ Texture entry

y  Buffer<Material> material;

y  Buffer<char> texData; y  Buffer<uint> texTable;

DATA STRUCTURE

Material0

m_kd

m_ior

m_...

m_kdT

ex

m_iorTex

m_b

umpT

ex

TextureTable

tex0

tex1

tex2

tex3

tex4

. . .

Texture0

m_h

eade

r

m_d

ata

Texture1

m_h

eade

r

m_d

ata

Texture2

m_h

eade

r

m_d

ata

Material1

m_kd

m_ior

m_...

m_kdT

ex

m_iorTex

m_b

umpT

ex

. . .

Texture3

m_h

eade

r

m_d

ata


THINGS TO BE DONE

for( i, j ) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }

WRITING OPENCL KERNEL

__kernel void PtKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if(!hit ) return; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }

CPU code OpenCL kernel


IT WORKS BUT…

y  This approach is simple

y  But a lot of issues


DRAWBACKS

y  Likely not u8lize hardware efficiently ‒  SIMD divergence ‒ GPU occupancy (latency)

y  Maintainability

y  Extendibility, Portability

PERFORMANCE

GPU ARCHITECTURE


OPENCL ON CPU

y  Processing element executes Work item (thread) ‒ A SIMD lane (4*)

y  Compute unit executes Work group (thread group) ‒ A core (8*) ‒ # of processing elements != # of work items

y  Compute device executes Kernel (shader) ‒ A CPU ‒ # of compute units != # of work groups

* On AMD FX-‐8350

Work item

Work group Compute Unit

Processing element

. . .

Kernel


GPU VS CPU

y  Processing element executes Work item ‒ A SIMD lane (64*)

y  Compute unit executes Work group ‒ A SIMD engine (44x4*) ‒ # of processing elements != # of work items

y  Compute device executes Kernel ‒ A GPU ‒ # of compute units != # of work groups

* On AMD Radeon R9 290X

Work item

Work group Compute Unit (4)

Processing element

. . .

Kernel

Compute Unit (64)

Processing element

...

CPU GPU


HIGH LEVEL DESCRIPTION

y  Today’s GPU is similar to a CPU (if you look at very high level) ‒ GPU is an extremely wide CPU ‒ Many cores ‒ Wide SIMD

y  AMD Radeon R9 290X GPU ‒ 176 = 44x4 SIMD engines (cores) ‒ 64 wide SIMD

y  But different in ‒  SIMD width (very wide) ‒  Limited local resources ‒  Strategy to hide latency

y  Knowing those are the key to exploit the performance


SIMD DIVERGENCE

y  SIMD execu8on = Program counter is shared among SIMD lanes

y  If it diverges in branches, HW u8liza8on decreases a lot (Gets easier to diverge on wide SIMD)

int funcA() { int value = 0; int a = computeA(); if( a == 0 ) value = compute0(); else if( a == 1 ) value = compute1(); else if( a == 2 ) value = compute2(); else if( a == 3 ) value = compute3(); return value; }

Lane0 Lane1 Lane2 Lane3 Lane4 Lane5 Lane6 Lane7


SIMD DIVERGENCE






WIDE SIMD EXECUTION





J J

L

J

L

L

L


SIMD DIVERGENCE





J J

J

J

J

J

J


LATENCY

y  Highest latency is from memory access

y  CPU prevent it by having larger cache ‒  Latency of cache access is small (fast)

y  Most of the memory access do not go to memory

y  CPU can run at full speed un8l a cache miss

y  # of concurrent execu8on on the GPU is far much larger than CPU ‒ More than 11k (= 44x4x64) work items

y  GPU cache is not large enough to absorb memory requests from those if they all requests different part of memory

y  Strategy ‒ Keep memory access as local as possible (not realis8c for prac8cal apps) ‒ Uses GPU mechanism for latency hiding


GPU LATENCY HIDING

y  GPU can execute at full speed if there are only ALU instruc8ons (Inst. 0 -‐ 2) *

y  Stalls on memory access instruc8on (Inst. 3)

* Can hide latency using logical vector

Inst. 0

Inst. 1

Inst. 2

Inst. 3 (MemAccess)

Lane0 Lane1 Lane2 Lane3 Lane1

Inst. 4


GPU LATENCY HIDING

y  When stalled, switch to another work group

y  Could fill the stall with instruc8ons from WG1

y  A SIMD of GPU needs to process mul8ple WGs at the same 8me to hide latency (or maximize its throughput)

WG0: Inst. 0

WG0: Inst. 1

WG0: Inst. 2

WG0: Inst. 3


WG0: Inst. 4

WG1: Inst. 0

WG1: Inst. 1

WG1: Inst. 2

WG1: Inst. 3



HOW MANY WGS CAN WE EXECUTE PER SIMD

y  10 wavefronts (64WIs) per SIMD is the max

y  It depends on local resource usage of the kernel y  VGPR usage is ozen the problem

y  Share 256 VGPRs among n work groups ‒ 1 wavefront, 256VGPRs LL ‒ 2 wavefronts, 128VGPRs ‒ 4 wavefronts, 64VGPRs J ‒ 10 wavefronts, 25VGPRs

y  Share 16KB LDS among n work groups ‒ 1 work group, 16KB LL ‒ 2 work group, 8KB ‒ 4 work group, 4KB J

y  VGPRs ‒ Registers used by vector ALUs ‒ 64KB/SIMD ‒ 256 VGPRs/SIMD lane (= 64KB/64/4)

y  LDS (Local data share) ‒ 64KB/CU (CU == 4SIMD) ‒ 32KB/SIMD


ADVICE TO REDUCE VGPR PRESSURE

y  Don’t write a large kernel y  If the program can be split into several pieces, split

them into several kernels ‒  Single kernel approach

‒  VGPR usage of the kernel is 200 = max(60, 200, 10) ‒  1 wavefront per SIMD ‒  Bad for latency hiding

‒ Mul8ple kernel approach ‒  FuncA: 4 wavefronts per SIMD ‒  FuncB: 1 wavefronts per SIMD ‒  FuncC: 10 wavefronts per SIMD ‒  FuncB is bad, but FuncA, FuncC runs fast

y  Helps compiler too

GET MORE PERFORMANCE FROM GPU

Single Kernel

FuncA (60VGPRS)

FuncB (200VGPRS)

FuncC(10VGPRS)

Mul8ple Kernels

FuncA (60VGPRS)

FuncB (200VGPRS)

FuncC(10VGPRS)

L

L

L

L

J

J


W

WHAT IS THE SCALAR ARCHITECTURE?

y  Best for vector computa8on

y  Low efficiency on scalar computa8on

y  Physical concurrent execu8on ‒ 16 work items ‒ 4 ALU opera8ons each ‒ Total: 16x4 ALU opera8ons

y  1 SIMD is running in a CU

y  Difficult to fill xyzw

y  If not filled, we waste HW cycle

y  Good for both vector and scalar computa8on

y  Need more work groups to fill GPU

y  Physical concurrent execu8on ‒ 16x4 work items ‒ 1 ALU opera8on each ‒ Total: 16x4 ALU opera8ons

y  4 SIMDs are running in a CU

y  4x more work groups are necessary to fill HW

VLIW4 (NI), VLIW5 (EG) Scalar Architecture (SI, CI)

. . . X Y Z W

Lane 0

X Y Z W

Lane 1

X Y Z W

Lane 2

X Y Z W

Lane 3

X Y Z W

Lane 4

X Y Z W

Lane 15

. . .

SIMD0 SIMD0 SIMD1 SIMD3


ANOTHER SOLUTION

y  Spli{ng computa8on into mul8ple kernels ‒ Primary ray gen kernel ‒ Trace kernel ‒ Evaluate DI kernel ‒ Etc

y  BeXer HW u8liza8on ‒  Less divergence ‒ Higher HW occupancy


OTHER BENEFITS?

Ray Genera8on

First Hit

First Hit (Normal)

Direct Illumina8on

Indirect Illumina8on

y  Maintainability ‒ Debug is not as easy as we do on C, C++ ‒ Not all compilers are mature

‒  Can hit to a compiler bug, which is hard to debug ‒ Helps compiler

‒ By spli{ng kernels, we can isolate the issue ‒  If the code is developed by many people, this is important

y  Extendibility, Portability ‒ Easy to extend features ‒ Primary Ray Gen Kernel

‒  Add another camera projec8on ‒ Ray Cas8ng Kernel

‒  Easy to add another primi8ves (e.g., vector displacement) ‒  Take it out for physics ray cas8ng queries

PORTING TO OPENCL (SECOND ATTEMPT)


SPLITTING KERNELS

forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }

TRANSFORMING CPU CODE

{ forAll() PrimaryRayGenKernel(); while(1) { forAll() TraceKernel(); if( !any( hits ) ) break; forAll() SampleLightKernel(); forAll() TraceKernel(); forAll() AccumulateDIKernel(); forAll() SampleNextRayKernel(); } }

Naïve CPU implementa<on Preparing for OpenCL implementa<on

Each for loop => A kernel execu8on


SPLITTING KERNELS

forAll() { ray, rayState = PrimaryRayGen( camera, pixelLoc ); while(1) { hit = Trace( ray ); if( !hit ) break; d( pixelLoc ) += EvaluateDI( ray, hit, rayState ); ray, rayState = sampleNextRay( ray, hit ); } }

{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }

CPU implementa<on Host code


SPLITTING KERNELS


__kernel void PrimaryRayGenKernel(); y  Generate rays for all pixels in parallel __kernel void TraceKernel(); y  Compute intersec8on for all rays in parallel __kernel void SampleLightKernel(); y  Sample light for all hit points in parallel __kernel void AccumulateDItKernel(); y  Accumulate DI for all hit points in parallel __kernel void SampleNextRayKernel(); y  Generate bounced rays for all hit points in parallel

Host Code Device Code


DESIGN


LOCALIZE BRANCH


Camera Type Brdf Type


DESIGN

y  Cannot keep state between kernels y  Ray state needs to be saved to/restored from

global memory

RAY STATE

y  Example ‒ Ray genera8on + ray direc8on visualiza8on

__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); }

struct RayState { float4 m_throughput; int2 m_randomNumber; int m_pixelIdx; };


DESIGN

y  Cannot keep state between kernels y  Ray state needs to be saved to/restored from

global memory

RAY STATE

y  Example ‒ Ray genera8on + ray direc8on visualiza8on

__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; // save } __kernel void VisualizeRayKernel(__global ...) { RayState s = gRayState[get_global_id(0)]; // restore Ray ray = gRay[get_global_id(0)]; gFb[s.m_pixelIdx] = Ray_getDir( ray ); }

Out: Ray

In: pixelIdx

Out: Ray

In: pixelIdx

Out: Ray

In: pixelIdx

Out: Ray

In: pixelIdx

Prim

aryRayGe

nKerne

l

In: Ray

Out: Pixel color

In: Ray

Out: Pixel color

In: Ray

Out: Pixel color

In: Ray

Out: Pixel color

Visualize

Kernel

Global memory (state)


RAY COMPACTION

y  Sparse data lowers SIMD u8liza8on

y  Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8) Primary

Secondary


RAY COMPACTION


y  Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8)

y  With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8)

*When rays are created for all pixels, this is not necessary

Primary

Secondary

Primary

Secondary


RAY COMPACTION


y  Without compac8on ‒ 3 SIMD execu8ons ‒ Occupancy (3/8, 1/8, 4/8)

y  With compac8on ‒ 1 SIMD execu8on ‒ Occupancy (7/8)

*When rays are created for all pixels, this is not necessary

y  No need to write a compac8on kernel y  Can compact using global atomics

‒ Prepare a counter (gRayCount) ‒ Perform atomic increment to reserve memory ‒ BeXer to do atomics in WG first, then do an atomic add per WG

__kernel void PrimaryRayGenKernel(__global ...) { ray, rayState = PrimaryRayGen( camera, pixelLoc ); int dst = atom_inc( &gRayCount ); gRay[dst] = ray; gRayState[dst] = rayState; }

Primary

Secondary

Primary

Secondary


DIRECT ILLUMINATION COMPUTATION

y  SampleLightKernel ‒ Want to keep the work uniform

‒ Different # of light sample per ray isn’t good ‒ Compute contribu8on from one point on a light

y  Simple approach ‒  Select a light ‒  Select a point on a light ‒ Compute DI without occlusion term

y  More sophis8cated light sampling ‒ Using poten8al contribu8on for PDF ‒  Forward+ style light culling

__kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; }


DIRECT ILLUMINATION COMPUTATION

y  SampleLightKernel

y  TraceRayKernel ‒ Check if the point on the light is visible or not ‒ Reuse code

y  AccumulateDIKernel ‒  If the ray is not blocked, accumulate the result

__kernel void SampleLightKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; shadowRay, lfnDotV = Light_Sample( ray, s ); gShadowRay[GIDX] = shadowRay; gDi[GIDX] = lfnDotV; gRayState[GIDX] = s; } __kernel void AccumulateDIKernel(__global ...) { Hit shadowHit = gShadowHit[GIDX]; float4 di = gDi[GIDX]; if( !shadowHit ) gFb[GIDX] += di; }


SAMPLE NEXT RAY

y  Compute next ray by sampling BRDF

y  Store ray and ray state __kernel void SampleNextRayKernel(__global ...) { RayState s = gRayState[GIDX]; Ray ray = gRay[GIDX]; Hit hit = gHit[GIDX]; if( !hit ) return; nextRay, s = Brdf_Sample( ray, s ); int dst = atom_inc( &gRayCount ); gRayNext[dst] = nextRay; gRayStateNext[dst] = s; }


TRACE KERNEL

y  BVH is used for accelera8on structure ‒  Index is used to describe hierarchy structure (no pointer)

0 1 2 3 4 5 6 7 8 9

1 2

0

4 3 5 6

Mesh0 xform0

Mesh1 xform1

Mesh2 xform2

Mesh3 xform3


TRACE KERNEL


y  2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf

0 1 2 3 4 5 6 7 8 9

1 2

0

4 3 5 6

Top BVH

Bohom BVH

Mesh0 xform0

Mesh1 xform1

Mesh2 xform2

Mesh3 xform3


TRACE KERNEL


y  2 level BVH ‒ Top: stores an object in a leaf (object index, transform) ‒ BoXom: stores a primi8ve (triangle, quad) in a leaf

y  Store those BVHs in a single memory ‒ Traverse top tree ‒ Hit a leaf, transform the ray into object space ‒ Traverse boXom tree ‒ On exit, transform the ray back to world space

Bohom A Bohom B Bohom C Bohom D

0 1 2 3 4 5 6 7 8 9

1 2

0

root idx

Top

4 3 5 6

Top BVH

Bohom BVH

Mesh0 xform0

Mesh1 xform1

Mesh2 xform2

Mesh3 xform3


SO FAR

y  Explained an OpenCL implementa8on of a simple path tracer

y  Easy to extend from here

y  Extension can be done by swapping one or two kernels ‒ Material system, Shader ‒  Light sampling ‒  Support for different type of primi8ves ‒ Ray caster + spa8al accelera8on structure

ADVANCED TOPICS


INSTANCING

y  Powerful technique to increase geometric complexity

y  Small memory overhead ‒  Shares geometric informa8on (vertex, normal etc) ‒  Shares BVH ‒  Stores object transform

1 2

0

4 3 5 6

Top BVH

Mesh0 xform0

Mesh0 xform1

Mesh0 xform2

Mesh1 xform3


INSTANCING


y  Small memory overhead ‒  Shares geometric informa8on (vertex, normal etc) ‒  Shares BVH ‒  Stores object transform

Bohom A Top Bohom B

1 2

0

4 3 5 6

Top BVH

Mesh0 xform0

Mesh0 xform1

Mesh0 xform2

Mesh1 xform3

Bohom BVH


LAYERED MATERIAL

y  Binary tree of BRDFs y  Leaf node

‒ BRDF y  Internal node

‒ Blend func8on ‒  Fresnel blend, Linear blend

y  Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node

Reflect

Microfacet Diffuse

pdf=0.25

pdf=0.5

0.5 0.5

0.5 0.5

pdf=0.25


LAYERED MATERIAL

y  Binary tree of BRDFs y  Leaf node

‒ BRDF y  Internal node

‒ Blend func8on ‒  Fresnel blend, Linear blend

y  Evaluate one BRDF at a 8me ‒ Traverse binary tree ‒ Random sampling at internal node

{ launch( PrimaryRayGenKernel ); while(1) { launch( TraceKernel ); if( !any( hits ) ) break; launch( SelectBRDFKernel ); launch( SampleLightKernel ); launch( TraceKernel ); launch( AccumulateDIKernel ); launch( SampleNextRayKernel ); } }

introduction to monte carlo ray tracing, opencl implementation (cedec 2014)

Technology

simddivergencey simdexecu8on