cpu is in focus again! implementing dof on cpu

Advanced Visual Computing 3D Graphics Team

Presenter:

Evgeny Gorodetsky Graphics Software Engineer

[email protected], twitter: egorodet

CPU is in Focus Again! Implementing DOF on CPU.

mailto:[email protected]

Agenda

Introduction to depth of field effect & techniques

DOF Explorer and post-processing pipeline

DOF Techniques on GPU & with CPU Onloading:

– Traditional: Poisson Disk & Gaussian Blur

– Advanced: Summed Area Tables Gather & Scatter

Performance results on Sandy Bridge processors

page 2

DEPTH OF FIELD EXPLAINED Introduction to DOF

page 3

Depth of Field Explained

page 4

Common effect in:

– Photography

– Cinematography

– Modern 3D games

Used to bring attention of the viewer

Optical nature of DoF:

– Lens settings: Aperture (f-stop), Focal distance

– Circle of Confusion (CoC)

– Bokeh effect (not adresed)

Max Blur

Radius

CoC (Blur Radius)

0

Distance from Camera (Depth) Near Focal Far

Linear approximation

Real dependency

There’s no right DoF technique!

page 5

Physically correct reference techniques: – Ray Tracing

– Accumulation Buffer

Real-time post-processing: – Gathering techniques:

– Poisson Disk

– Gaussian Blur

– Summed area table Gather

– Scattering techniques:

– Summed area table Scatter

– Heat diffusion simulation

Common Challenges: – Color bleeding:

– From sharp objects in front to blurred objects behind

– From blurred objects behind to sharp objects in front

– Blurriness discontinuities

– Performance depending on resolution!

input

output

Gathering vs. Scattering

Depth of Field Explorer

Post-processing on GPU and with CPU Onloading

Compare DoF techniques:

– On one of three scenes

– Performance & quality

– Runtime settings

Deferred rendering with async. CPU-GPU execution

Performance analysis

page 6

Depth of Field technique GPU CPU

Poisson Disk

Gaussian Blur

Gaussian Blur mixed with Poisson Disk

Summed Area Table (SAT) Gather

Summed Area Table (SAT) Scatter

Simple MipMap

Advanced MipMap

Post-Processing Pipeline Infrastructure simplifies CPU Onloading

Automatic resources management on GPU and CPU

Deferred execution mode in CPU Onloading:

– Performs computing on CPU while doing work on GPU

– Hides data transfer latency

Preview of intermediate resources

Integrated performance analysis tools

page 7

Render Scene

Color [size, format]

Depth [size, format]

Poisson Disk DoF

Color [size, format]

Stage 1 render

Stage 1 output pins

Stage 2 input pins

Stage 2 render

Stage 2 output pin

Stage 1-2 Intermediate

Resources

Stage 1 Render Target Views

Stage 2 Shader Resource Views

Stage 2 Screen Render Target

Pipeline Diagram:

Defined by developer:

Created by Pipeline infrastructure:

Depth of Field Explorer

page 8

Pipeline Oscilloscopes (F6)

for CPU & GPU

Pipeline Preview (F5)

DX and UI Controls

Common explorer controls

Technique-specific controls

TRADITIONAL DOF TECHNIQUES Poisson Disk & Gaussian Blur on GPU & CPU

page 9

Poisson Disk DOF Technique

Averages color by random Poisson disk samples around each pixel

Easy to implement on GPU

Not good for CPU, because of random memory access

Used for Bokeh simulation in some games

Variable number of Poisson taps can be generated in DOF Explorer

page 10

Gaussian Blur DOF Technique

Convolution of NxN neighbor pixels with pre-computed weights:

𝐺 𝑥, 𝑦 = 12𝜋𝜎2𝑒

−𝑥2+𝑦2

2𝜎2 ; 𝐶 𝑥, 𝑦 = 𝐺 𝑥𝑖 , 𝑦𝑗 ∙ 𝑓(𝑥𝑖 , 𝑦𝑗)

𝑁

𝑗=1

𝑁

𝑖=1

Decomposed into 2 passes:

– Vertical pass

– Horizontal pass

𝐺 𝑥 = 1

2𝜋𝜎𝑒−

𝑥2

2𝜎2; 𝐶 𝑥, 𝑦 = 𝐺 𝑥 𝑖 ∙ 𝐺 𝑦 𝑗 ∙ 𝑓(𝑥𝑖 , 𝑦𝑗)

𝑁

𝑗=1

𝑁

𝑖=1

Implementation:

– Traditional for GPU in pixel shader

– Novell for CPU, accelerated with TBB & SSE

page 11

Gaussian Blur Pipeline

page 12

Render Scene

Color 1280 x 800

Depth 1280 x 800

Resize X 0.5

Blurred Color

640 x 400

Gaussian Horiz. Blur

Blurred Color

640 x 400

Gaussian Vert. Blur

Blurred Color

640 x 400

DoF Simple

Combine

Color 1280 x 800

GPU CPU / GPU GPU

GPU CPU GPU

Gaussian Blur on CPU: Multi-threading with TBB

page 13

F0 F1 F2 F3 F4

F0

F1

F2

F3

F4

x

x

tbb::parallel_for

tbb::

para

llel_

for

1. Vertical Pass: 2. Horizontal Pass:

Gaussia

n w

eig

hts

:

Gaussian weights

Gaussian Blur on CPU: Vectorization with SSE 4

page 14

R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 …

R0’ G0’ B0’ A0’ R1’ G1’ B1’ A1’ R2’ G2’ B2’ A2’ R3’ …

R0 G0 B0 A0

x x x

=

=

=

=

R0 G0 B0 A0

R1 G1 B1 A1

R2 G2 B2 A2

… … … …

x = R0’ G0’ B0’ A0’

F0 F0 F0 F0 F1 F1 F1 F1 F2 F2 F2 F2 F3 …

F0 F0 F0 F0

F1 F1 F1 F1

F2 F2 F2 F2

… … … …

Vertical Pass:

Horizontal Pass: (cache friendly)

SSE SSE

SSE SSE SSE

Gaussian Blur: Performance results

0

2

4

6

8

10

12

14

16

18

1 Thread 8 Threads

13,7

4,4

3,2

5,6

Tim

e in m

illiseconds

Gaussian Blur speedup with TBB parallel_for

GPU Rendering

CPU Kernel Time

page 15

ADVANCED DOF TECHNIQUES Summed Area Tables Gather & Scatter

page 16

Summed Area Tables

page 17

1 2 3 4

1 0 7 2 4

2 1 4 1 2

3 6 1 2 0

4 0 3 5 2

1 2 3 4

1 0 7 9 13

2 1 12 15 21

3 7 19 24 30

4 7 22 32 40

𝑷 = 𝒑𝒊𝒋

- LL

- UR

+ UL

width

he

igh

t

𝑷𝒂𝒓𝒆𝒂 =𝑳𝑹 − 𝑼𝑹 − 𝑳𝑳 + 𝑼𝑳

𝒘𝒊𝒅𝒕𝒉 × 𝒉𝒆𝒊𝒈𝒉𝒕

LR +

Source Table: Summed Area Table (SAT): Averaging values in the area of source table by SAT:

𝑺𝒎𝒏 = 𝒑𝒊𝒋

𝒏

𝒋=𝟏

𝒎

𝒊=𝟏

Enables averaging values in variable rectangle areas in constant time: just with 4 SAT-texture reads!

Gathering vs. Scattering

page 18

Input:

Output:

Gathering: Scattering:

SAT Gather DoF pipeline

page 19

GPU CPU / GPU GPU

Render Scene

Color 8 bit/ch.

Depth

SAT Gather DoF

Build SAT

Color 32 bit/ch.

Color 8 bit/ch. Color

Temp

GPU CPU GPU

Building SAT on GPU in Pixel Shader

page 20

1 1..2 1..3 1..4 2..5 3..6 4..7 5..8

1 2 3 4 5 6 7 8

1 1..2 2..3 3..4 4..5 5..6 6..7 7..8

1 1..2 1..3 1..4 1..5 1..6 1..7 1..8

Pass 1:

Pass 2:

Pass 3:

Source:

Building SAT on CPU with SSE & TBB

page 21

Si-1,j-1 Si,j-1

Si-1,j Pi,j

T1,1 T2,1 T3,1

T1,2 T2,2 T3,2

T1,3 T2,3 T3,3

𝑻𝟎 = 𝑷𝟎,𝒋 𝑻𝒋 = 𝑻𝒋−𝟏 + 𝑷𝒊,𝒋

𝑻 += 𝑷𝒊,𝒋 𝑺𝒊,𝒋 = 𝑺𝒊,𝒋−𝟏 + 𝑻

𝑺𝒊,𝒋 = 𝑷𝒊,𝒋 + 𝑺𝒊,𝒋−𝟏 + 𝑺𝒊−𝟏,𝒋 − 𝑺𝒊−𝟏,𝒋−𝟏 𝑻𝒋=𝑺𝒊−𝟏,𝒋−𝑺𝒊−𝟏,𝒋−𝟏+𝑷𝒊,𝒋

𝑺𝒊,𝒋−𝟏 + 𝑻𝒋

Build SAT for each row j=1..n:

Single pass on CPU

Simultaneously process RGBA channels as 4 floats with SSE 4 (128-bit width vector instructions):

– Can be easily extended to 256-bit width AVX on Sandy Bridge

Split texture in tiles and process them in parallel threads:

– Implemented in TBB Tasks

– Run tile-processing tasks with respect of dependencies

SAT Scatter DoF pipeline

page 22

GPU CPU / GPU GPU

GPU CPU GPU

Render Scene

Color 8 bit/ch.

1280 x 800

Depth

SAT Scatter

DoF (add 100px

margins)

Color 32 bit/ch.

1480 x 1000

Build SAT

Color 32 bit/ch.

1480 x 1000

Resize with Crop

(remove margins)

Color 8 bit/ch.

1280 x 800

Color Temp

Compute Blur

Radius

Blur Params.

SAT Scatter: rectangle spreading

Spread pixels (derive), then build SAT (integrate).

page 23

x x x x x x x x x x x

x x S x x x x x x x x

x x x + x x ‒ x x x x

x x x x x

‒ +

0

Input colors:

Input blur radius:

Output colors:

SAT Computed

Ongoing Clearing

Ongoing SAT building

Ongoing rectangle spreading

Padding

SAT Scatter: Optimization Notes

Rectangle spreading on GPU:

– Implemented in Geometry Shader

– Requires huge number of Draw Calls = width x height

– Works slow even on high-end GPUs

– Compute Shaders could help, but not available on Sandy Bridge

Rectangle spreading on CPU:

– Takes advantage of SSE 4 instructions for RGBA float channels

– Multi-threaded with TBB Tasks (like SAT, but with different dependencies)

– Much faster than on GPU: 8.3x on SNB GT2, 2.7x on NHM GTX 280

Rectangle spreading CPU-stage can be fused with zeroing and SAT building to minimize memory footprint

Quality can be improved with repeated SAT integration (next slides)

page 24

SAT Scatter : CPU Optimization Results

page 25

Sequential Rendering:

Deferred Rendering:

Higher Order SAT Scatter (1/4)

page 26

Original Image

No filter


page 27

1-st order filter

box filter


page 28

2-nd order filter

triangle filter


page 29

3-rd order filter

parabolic filter

PERFORMANCE RESULTS ON 2-ND GENERATION CORE PROCESSORS

page 30

Depth of Field Performance on Sandy Bridge: GPU mode vs. CPU Onloading

page 31

262

161

58

135 137

60

19 8

124

40

60 67

0

50

100

150

200

250

300

FP

S

DoF Techniques

Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800

SNB Huron River 2720QM + HDG 3000: GPU only

SNB Huron River 2720QM + HDG 3000: CPU Onloading

3x

8x

Significant speedup with CPU Onloading for advanced compute-intensive DoF techniques!

Depth of Field Performance on Sandy Bridge in GPU mode on HDG 3000 & HDG 2000

page 32

262

161

58

135 137

60

19 8

125

91

35

70 64

31

17

3

0

50

100

150

200

250

300

FP

S

DoF Techniques


SNB Huron River 2720QM + HDG 3000: GPU only

SNB Sugar Bay 2600 + HDG 2000: GPU only

~2x High dependency

from GPUs, having twice difference in compute power (12

vs 6 EUs)

Depth of Field Performance on Sandy Bridge in CPU Onloading mode on HDG 3000 & HDG 2000

page 33

124

40

60

67

90

34

50 53

0

20

40

60

80

100

120

140

FP

S

DoF Techniques


SNB Huron River 2720QM + HDG 3000: CPU Onloading

SNB Sugar Bay 2600 + HDG 2000: CPU Onloading

~1.2-1.4x

Less dependent from GPU with extensive

CPU Onloading!

DoF Techniques Overhead (1/2)

page 36

Conclusion & Follow ups

Accelerate traditional & advanced post-processing techniques with CPU Onloading on modern processors with integrated processor graphics

Optimize compute kernels code with Intel Parallel Studio, TBB, SSE/AVX, MKL, OpenCL and ICC:

– http://software.intel.com/en-us/articles/intel-parallel-studio-home/

– http://software.intel.com/en-us/articles/opencl-sdk/

– http://software.intel.com/en-us/avx/

DOF Source code & article (will be published later):

– http://software.intel.com/en-us/articles/dofexplorer

See other graphics samples:

– http://software.intel.com/en-us/articles/code/

page 38

http://software.intel.com/en-us/articles/intel-parallel-studio-home/









http://software.intel.com/en-us/articles/opencl-sdk/





http://software.intel.com/en-us/avx/



http://software.intel.com/en-us/articles/dofexplorer




http://software.intel.com/en-us/articles/code/




cpu is in focus again! implementing dof on cpu

Technology