cpu is in focus again! implementing dof on cpu

36
Advanced Visual Computing 3D Graphics Team Presenter: Evgeny Gorodetsky Graphics Software Engineer [email protected], twitter: egorodet CPU is in Focus Again! Implementing DOF on CPU.

Upload: evgeny-gorodetsky

Post on 29-Nov-2014

1.728 views

Category:

Technology


9 download

DESCRIPTION

Presented at Russian Game Developers Conference 2011.Depth of Field (DoF) is an optical focus effect widely used in photography, movies, 3D graphics and games for bringing the attention of the viewer to some part of the scene. Until recently, this effect has been too computationally expensive to do in realtime, but with the growing power of graphic processors, DoF is becoming widely used in modern computer games, raising the level of visual experience.Physically correct DoF effect could be achieved with ray-tracing or accumulation buffer still being too compute intensive to be done in real-time. Like many effects in computer graphics, there is no “right” way to do Depth of Field in a real time application. Depth of Field Explorer offers developers a way to compare and contrast many different methods of calculating DoF and make an informed decision on the right balance between quality and performance on Sandy Bridge processors.We present multiple DoF techniques along with a set of adjustable parameters which allow the user to explore their performance and quality characteristics. All DoF techniques have traditional implementations for GPU and some of them additionally have novel “CPU Onloaded” implementations, demonstrating advantages of integrated processor graphics on Sandy Bridge. The techniques presented are Poisson disk filter, separable Gaussian filter, Gaussian filter combined with Poisson disk, simple and advanced mipmap interpolation, and summed area tables (SAT) gather and scatter.DoF Explorer demonstrates innovative “CPU Onloading” approaches to the Gaussian blur and summed area tables based DoF techniques. CPU Onloading moves compute intensive work from the GPU to the CPU, allowing faster DoF post-processing with better load balancing between graphics and central processor cores. CPU kernels demonstrate optimizations with SSE vector instructions and multi-threading on TBB along with asynchronous execution of tasks on GPU and CPU. Using run-time controls, Depth of Field Explorer enables developers to compare the performance of traditional GPU-based implementations with the CPU versions.Depth of Field Explorer is implemented as a DirectX application based on the DXUT framework and custom post-processing pipeline infrastructure to facilitate running many different Depth of Field techniques. The pipeline infrastructure enables running the sequence of multiple stages either on GPU or on CPU with support of asynchronous execution, which enables hiding data-transfer latency between CPU and GPU. It was made easy to analyze DoF techniques performance with help of integrated Oscilloscope performance monitor, displaying charts of CPU and GPU execution times with breakdown by stages.CPU Onloaded implementations of summed area tables gather and scatter techniques have been significantly accelerated in comparison with their traditional GPU implementations, showing 3x and 8x speedup appropriately on mobile system with Core i7 2720QM.

TRANSCRIPT

Page 1: CPU is in Focus Again! Implementing DOF on CPU

Advanced Visual Computing 3D Graphics Team

Presenter:

Evgeny Gorodetsky Graphics Software Engineer

[email protected], twitter: egorodet

CPU is in Focus Again! Implementing DOF on CPU.

Page 2: CPU is in Focus Again! Implementing DOF on CPU

Agenda

Introduction to depth of field effect & techniques

DOF Explorer and post-processing pipeline

DOF Techniques on GPU & with CPU Onloading:

– Traditional: Poisson Disk & Gaussian Blur

– Advanced: Summed Area Tables Gather & Scatter

Performance results on Sandy Bridge processors

page 2

Page 3: CPU is in Focus Again! Implementing DOF on CPU

DEPTH OF FIELD EXPLAINED Introduction to DOF

page 3

Page 4: CPU is in Focus Again! Implementing DOF on CPU

Depth of Field Explained

page 4

Common effect in:

– Photography

– Cinematography

– Modern 3D games

Used to bring attention of the viewer

Optical nature of DoF:

– Lens settings: Aperture (f-stop), Focal distance

– Circle of Confusion (CoC)

– Bokeh effect (not adresed)

Max Blur

Radius

CoC (Blur Radius)

0

Distance from Camera (Depth) Near Focal Far

Linear approximation

Real dependency

Page 5: CPU is in Focus Again! Implementing DOF on CPU

There’s no right DoF technique!

page 5

Physically correct reference techniques: – Ray Tracing

– Accumulation Buffer

Real-time post-processing: – Gathering techniques:

– Poisson Disk

– Gaussian Blur

– Summed area table Gather

– Scattering techniques:

– Summed area table Scatter

– Heat diffusion simulation

Common Challenges: – Color bleeding:

– From sharp objects in front to blurred objects behind

– From blurred objects behind to sharp objects in front

– Blurriness discontinuities

– Performance depending on resolution!

input

output

Gathering vs. Scattering

Page 6: CPU is in Focus Again! Implementing DOF on CPU

Depth of Field Explorer

Post-processing on GPU and with CPU Onloading

Compare DoF techniques:

– On one of three scenes

– Performance & quality

– Runtime settings

Deferred rendering with async. CPU-GPU execution

Performance analysis

page 6

Depth of Field technique GPU CPU

Poisson Disk

Gaussian Blur

Gaussian Blur mixed with Poisson Disk

Summed Area Table (SAT) Gather

Summed Area Table (SAT) Scatter

Simple MipMap

Advanced MipMap

Page 7: CPU is in Focus Again! Implementing DOF on CPU

Post-Processing Pipeline Infrastructure simplifies CPU Onloading

Automatic resources management on GPU and CPU

Deferred execution mode in CPU Onloading:

– Performs computing on CPU while doing work on GPU

– Hides data transfer latency

Preview of intermediate resources

Integrated performance analysis tools

page 7

Render Scene

Color [size, format]

Depth [size, format]

Poisson Disk DoF

Color [size, format]

Stage 1 render

Stage 1 output pins

Stage 2 input pins

Stage 2 render

Stage 2 output pin

Stage 1-2 Intermediate

Resources

Stage 1 Render Target Views

Stage 2 Shader Resource Views

Stage 2 Screen Render Target

Pipeline Diagram:

Defined by developer:

Created by Pipeline infrastructure:

Page 8: CPU is in Focus Again! Implementing DOF on CPU

Depth of Field Explorer

page 8

Pipeline Oscilloscopes (F6)

for CPU & GPU

Pipeline Preview (F5)

DX and UI Controls

Common explorer controls

Technique-specific controls

Page 9: CPU is in Focus Again! Implementing DOF on CPU

TRADITIONAL DOF TECHNIQUES Poisson Disk & Gaussian Blur on GPU & CPU

page 9

Page 10: CPU is in Focus Again! Implementing DOF on CPU

Poisson Disk DOF Technique

Averages color by random Poisson disk samples around each pixel

Easy to implement on GPU

Not good for CPU, because of random memory access

Used for Bokeh simulation in some games

Variable number of Poisson taps can be generated in DOF Explorer

page 10

Page 11: CPU is in Focus Again! Implementing DOF on CPU

Gaussian Blur DOF Technique

Convolution of NxN neighbor pixels with pre-computed weights:

𝐺 𝑥, 𝑦 = 12𝜋𝜎2𝑒

−𝑥2+𝑦2

2𝜎2 ; 𝐶 𝑥, 𝑦 = 𝐺 𝑥𝑖 , 𝑦𝑗 ∙ 𝑓(𝑥𝑖 , 𝑦𝑗)

𝑁

𝑗=1

𝑁

𝑖=1

Decomposed into 2 passes:

– Vertical pass

– Horizontal pass

𝐺 𝑥 = 1

2𝜋𝜎𝑒−

𝑥2

2𝜎2; 𝐶 𝑥, 𝑦 = 𝐺 𝑥 𝑖 ∙ 𝐺 𝑦 𝑗 ∙ 𝑓(𝑥𝑖 , 𝑦𝑗)

𝑁

𝑗=1

𝑁

𝑖=1

Implementation:

– Traditional for GPU in pixel shader

– Novell for CPU, accelerated with TBB & SSE

page 11

Page 12: CPU is in Focus Again! Implementing DOF on CPU

Gaussian Blur Pipeline

page 12

Render Scene

Color 1280 x 800

Depth 1280 x 800

Resize X 0.5

Blurred Color

640 x 400

Gaussian Horiz. Blur

Blurred Color

640 x 400

Gaussian Vert. Blur

Blurred Color

640 x 400

DoF Simple

Combine

Color 1280 x 800

GPU CPU / GPU GPU

GPU CPU GPU

Page 13: CPU is in Focus Again! Implementing DOF on CPU

Gaussian Blur on CPU: Multi-threading with TBB

page 13

F0 F1 F2 F3 F4

F0

F1

F2

F3

F4

x

x

tbb::parallel_for

tbb::

para

llel_

for

1. Vertical Pass: 2. Horizontal Pass:

Gaussia

n w

eig

hts

:

Gaussian weights

Page 14: CPU is in Focus Again! Implementing DOF on CPU

Gaussian Blur on CPU: Vectorization with SSE 4

page 14

R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 …

R0’ G0’ B0’ A0’ R1’ G1’ B1’ A1’ R2’ G2’ B2’ A2’ R3’ …

R0 G0 B0 A0

x x x

=

=

=

=

R0 G0 B0 A0

R1 G1 B1 A1

R2 G2 B2 A2

… … … …

x = R0’ G0’ B0’ A0’

F0 F0 F0 F0 F1 F1 F1 F1 F2 F2 F2 F2 F3 …

F0 F0 F0 F0

F1 F1 F1 F1

F2 F2 F2 F2

… … … …

Vertical Pass:

Horizontal Pass: (cache friendly)

SSE SSE

SSE SSE SSE

Page 15: CPU is in Focus Again! Implementing DOF on CPU

Gaussian Blur: Performance results

0

2

4

6

8

10

12

14

16

18

1 Thread 8 Threads

13,7

4,4

3,2

5,6

Tim

e in m

illiseconds

Gaussian Blur speedup with TBB parallel_for

GPU Rendering

CPU Kernel Time

page 15

Page 16: CPU is in Focus Again! Implementing DOF on CPU

ADVANCED DOF TECHNIQUES Summed Area Tables Gather & Scatter

page 16

Page 17: CPU is in Focus Again! Implementing DOF on CPU

Summed Area Tables

page 17

1 2 3 4

1 0 7 2 4

2 1 4 1 2

3 6 1 2 0

4 0 3 5 2

1 2 3 4

1 0 7 9 13

2 1 12 15 21

3 7 19 24 30

4 7 22 32 40

𝑷 = 𝒑𝒊𝒋

- LL

- UR

+ UL

width

he

igh

t

𝑷𝒂𝒓𝒆𝒂 =𝑳𝑹 − 𝑼𝑹 − 𝑳𝑳 + 𝑼𝑳

𝒘𝒊𝒅𝒕𝒉 × 𝒉𝒆𝒊𝒈𝒉𝒕

LR +

Source Table: Summed Area Table (SAT): Averaging values in the area of source table by SAT:

𝑺𝒎𝒏 = 𝒑𝒊𝒋

𝒏

𝒋=𝟏

𝒎

𝒊=𝟏

Enables averaging values in variable rectangle areas in constant time: just with 4 SAT-texture reads!

Page 18: CPU is in Focus Again! Implementing DOF on CPU

Gathering vs. Scattering

page 18

Input:

Output:

Gathering: Scattering:

Page 19: CPU is in Focus Again! Implementing DOF on CPU

SAT Gather DoF pipeline

page 19

GPU CPU / GPU GPU

Render Scene

Color 8 bit/ch.

Depth

SAT Gather DoF

Build SAT

Color 32 bit/ch.

Color 8 bit/ch. Color

Temp

GPU CPU GPU

Page 20: CPU is in Focus Again! Implementing DOF on CPU

Building SAT on GPU in Pixel Shader

page 20

1 1..2 1..3 1..4 2..5 3..6 4..7 5..8

1 2 3 4 5 6 7 8

1 1..2 2..3 3..4 4..5 5..6 6..7 7..8

1 1..2 1..3 1..4 1..5 1..6 1..7 1..8

Pass 1:

Pass 2:

Pass 3:

Source:

Page 21: CPU is in Focus Again! Implementing DOF on CPU

Building SAT on CPU with SSE & TBB

page 21

Si-1,j-1 Si,j-1

Si-1,j Pi,j

T1,1 T2,1 T3,1

T1,2 T2,2 T3,2

T1,3 T2,3 T3,3

𝑻𝟎 = 𝑷𝟎,𝒋 𝑻𝒋 = 𝑻𝒋−𝟏 + 𝑷𝒊,𝒋

𝑻 += 𝑷𝒊,𝒋 𝑺𝒊,𝒋 = 𝑺𝒊,𝒋−𝟏 + 𝑻

𝑺𝒊,𝒋 = 𝑷𝒊,𝒋 + 𝑺𝒊,𝒋−𝟏 + 𝑺𝒊−𝟏,𝒋 − 𝑺𝒊−𝟏,𝒋−𝟏 𝑻𝒋=𝑺𝒊−𝟏,𝒋−𝑺𝒊−𝟏,𝒋−𝟏+𝑷𝒊,𝒋

𝑺𝒊,𝒋−𝟏 + 𝑻𝒋

Build SAT for each row j=1..n:

Single pass on CPU

Simultaneously process RGBA channels as 4 floats with SSE 4 (128-bit width vector instructions):

– Can be easily extended to 256-bit width AVX on Sandy Bridge

Split texture in tiles and process them in parallel threads:

– Implemented in TBB Tasks

– Run tile-processing tasks with respect of dependencies

Page 22: CPU is in Focus Again! Implementing DOF on CPU

SAT Scatter DoF pipeline

page 22

GPU CPU / GPU GPU

GPU CPU GPU

Render Scene

Color 8 bit/ch.

1280 x 800

Depth

SAT Scatter

DoF (add 100px

margins)

Color 32 bit/ch.

1480 x 1000

Build SAT

Color 32 bit/ch.

1480 x 1000

Resize with Crop

(remove margins)

Color 8 bit/ch.

1280 x 800

Color Temp

Compute Blur

Radius

Blur Params.

Page 23: CPU is in Focus Again! Implementing DOF on CPU

SAT Scatter: rectangle spreading

Spread pixels (derive), then build SAT (integrate).

page 23

x x x x x x x x x x x

x x S x x x x x x x x

x x x + x x ‒ x x x x

x x x x x

‒ +

0

Input colors:

Input blur radius:

Output colors:

SAT Computed

Ongoing Clearing

Ongoing SAT building

Ongoing rectangle spreading

Padding

Page 24: CPU is in Focus Again! Implementing DOF on CPU

SAT Scatter: Optimization Notes

Rectangle spreading on GPU:

– Implemented in Geometry Shader

– Requires huge number of Draw Calls = width x height

– Works slow even on high-end GPUs

– Compute Shaders could help, but not available on Sandy Bridge

Rectangle spreading on CPU:

– Takes advantage of SSE 4 instructions for RGBA float channels

– Multi-threaded with TBB Tasks (like SAT, but with different dependencies)

– Much faster than on GPU: 8.3x on SNB GT2, 2.7x on NHM GTX 280

Rectangle spreading CPU-stage can be fused with zeroing and SAT building to minimize memory footprint

Quality can be improved with repeated SAT integration (next slides)

page 24

Page 25: CPU is in Focus Again! Implementing DOF on CPU

SAT Scatter : CPU Optimization Results

page 25

Sequential Rendering:

Deferred Rendering:

Page 26: CPU is in Focus Again! Implementing DOF on CPU

Higher Order SAT Scatter (1/4)

page 26

Original Image

No filter

Page 27: CPU is in Focus Again! Implementing DOF on CPU

Higher Order SAT Scatter (2/4)

page 27

1-st order filter

box filter

Page 28: CPU is in Focus Again! Implementing DOF on CPU

Higher Order SAT Scatter (3/4)

page 28

2-nd order filter

triangle filter

Page 29: CPU is in Focus Again! Implementing DOF on CPU

Higher Order SAT Scatter (4/4)

page 29

3-rd order filter

parabolic filter

Page 30: CPU is in Focus Again! Implementing DOF on CPU

PERFORMANCE RESULTS ON 2-ND GENERATION CORE PROCESSORS

page 30

Page 31: CPU is in Focus Again! Implementing DOF on CPU

Depth of Field Performance on Sandy Bridge: GPU mode vs. CPU Onloading

page 31

262

161

58

135 137

60

19 8

124

40

60 67

0

50

100

150

200

250

300

FP

S

DoF Techniques

Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800

SNB Huron River 2720QM + HDG 3000: GPU only

SNB Huron River 2720QM + HDG 3000: CPU Onloading

3x

8x

Significant speedup with CPU Onloading for advanced compute-intensive DoF techniques!

Page 32: CPU is in Focus Again! Implementing DOF on CPU

Depth of Field Performance on Sandy Bridge in GPU mode on HDG 3000 & HDG 2000

page 32

262

161

58

135 137

60

19 8

125

91

35

70 64

31

17

3

0

50

100

150

200

250

300

FP

S

DoF Techniques

Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800

SNB Huron River 2720QM + HDG 3000: GPU only

SNB Sugar Bay 2600 + HDG 2000: GPU only

~2x High dependency

from GPUs, having twice difference in compute power (12

vs 6 EUs)

Page 33: CPU is in Focus Again! Implementing DOF on CPU

Depth of Field Performance on Sandy Bridge in CPU Onloading mode on HDG 3000 & HDG 2000

page 33

124

40

60

67

90

34

50 53

0

20

40

60

80

100

120

140

FP

S

DoF Techniques

Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800

SNB Huron River 2720QM + HDG 3000: CPU Onloading

SNB Sugar Bay 2600 + HDG 2000: CPU Onloading

~1.2-1.4x

Less dependent from GPU with extensive

CPU Onloading!

Page 34: CPU is in Focus Again! Implementing DOF on CPU

DoF Techniques Overhead (1/2)

page 36

Page 35: CPU is in Focus Again! Implementing DOF on CPU

Conclusion & Follow ups

Accelerate traditional & advanced post-processing techniques with CPU Onloading on modern processors with integrated processor graphics

Optimize compute kernels code with Intel Parallel Studio, TBB, SSE/AVX, MKL, OpenCL and ICC:

– http://software.intel.com/en-us/articles/intel-parallel-studio-home/

– http://software.intel.com/en-us/articles/opencl-sdk/

– http://software.intel.com/en-us/avx/

DOF Source code & article (will be published later):

– http://software.intel.com/en-us/articles/dofexplorer

See other graphics samples:

– http://software.intel.com/en-us/articles/code/

page 38

Page 36: CPU is in Focus Again! Implementing DOF on CPU

page 39