cpu is in focus again! implementing dof on cpu
DESCRIPTION
Presented at Russian Game Developers Conference 2011.Depth of Field (DoF) is an optical focus effect widely used in photography, movies, 3D graphics and games for bringing the attention of the viewer to some part of the scene. Until recently, this effect has been too computationally expensive to do in realtime, but with the growing power of graphic processors, DoF is becoming widely used in modern computer games, raising the level of visual experience.Physically correct DoF effect could be achieved with ray-tracing or accumulation buffer still being too compute intensive to be done in real-time. Like many effects in computer graphics, there is no “right” way to do Depth of Field in a real time application. Depth of Field Explorer offers developers a way to compare and contrast many different methods of calculating DoF and make an informed decision on the right balance between quality and performance on Sandy Bridge processors.We present multiple DoF techniques along with a set of adjustable parameters which allow the user to explore their performance and quality characteristics. All DoF techniques have traditional implementations for GPU and some of them additionally have novel “CPU Onloaded” implementations, demonstrating advantages of integrated processor graphics on Sandy Bridge. The techniques presented are Poisson disk filter, separable Gaussian filter, Gaussian filter combined with Poisson disk, simple and advanced mipmap interpolation, and summed area tables (SAT) gather and scatter.DoF Explorer demonstrates innovative “CPU Onloading” approaches to the Gaussian blur and summed area tables based DoF techniques. CPU Onloading moves compute intensive work from the GPU to the CPU, allowing faster DoF post-processing with better load balancing between graphics and central processor cores. CPU kernels demonstrate optimizations with SSE vector instructions and multi-threading on TBB along with asynchronous execution of tasks on GPU and CPU. Using run-time controls, Depth of Field Explorer enables developers to compare the performance of traditional GPU-based implementations with the CPU versions.Depth of Field Explorer is implemented as a DirectX application based on the DXUT framework and custom post-processing pipeline infrastructure to facilitate running many different Depth of Field techniques. The pipeline infrastructure enables running the sequence of multiple stages either on GPU or on CPU with support of asynchronous execution, which enables hiding data-transfer latency between CPU and GPU. It was made easy to analyze DoF techniques performance with help of integrated Oscilloscope performance monitor, displaying charts of CPU and GPU execution times with breakdown by stages.CPU Onloaded implementations of summed area tables gather and scatter techniques have been significantly accelerated in comparison with their traditional GPU implementations, showing 3x and 8x speedup appropriately on mobile system with Core i7 2720QM.TRANSCRIPT
Advanced Visual Computing 3D Graphics Team
Presenter:
Evgeny Gorodetsky Graphics Software Engineer
[email protected], twitter: egorodet
CPU is in Focus Again! Implementing DOF on CPU.
Agenda
Introduction to depth of field effect & techniques
DOF Explorer and post-processing pipeline
DOF Techniques on GPU & with CPU Onloading:
– Traditional: Poisson Disk & Gaussian Blur
– Advanced: Summed Area Tables Gather & Scatter
Performance results on Sandy Bridge processors
page 2
DEPTH OF FIELD EXPLAINED Introduction to DOF
page 3
Depth of Field Explained
page 4
Common effect in:
– Photography
– Cinematography
– Modern 3D games
Used to bring attention of the viewer
Optical nature of DoF:
– Lens settings: Aperture (f-stop), Focal distance
– Circle of Confusion (CoC)
– Bokeh effect (not adresed)
Max Blur
Radius
CoC (Blur Radius)
0
Distance from Camera (Depth) Near Focal Far
Linear approximation
Real dependency
There’s no right DoF technique!
page 5
Physically correct reference techniques: – Ray Tracing
– Accumulation Buffer
Real-time post-processing: – Gathering techniques:
– Poisson Disk
– Gaussian Blur
– Summed area table Gather
– Scattering techniques:
– Summed area table Scatter
– Heat diffusion simulation
Common Challenges: – Color bleeding:
– From sharp objects in front to blurred objects behind
– From blurred objects behind to sharp objects in front
– Blurriness discontinuities
– Performance depending on resolution!
input
output
Gathering vs. Scattering
Depth of Field Explorer
Post-processing on GPU and with CPU Onloading
Compare DoF techniques:
– On one of three scenes
– Performance & quality
– Runtime settings
Deferred rendering with async. CPU-GPU execution
Performance analysis
page 6
Depth of Field technique GPU CPU
Poisson Disk
Gaussian Blur
Gaussian Blur mixed with Poisson Disk
Summed Area Table (SAT) Gather
Summed Area Table (SAT) Scatter
Simple MipMap
Advanced MipMap
Post-Processing Pipeline Infrastructure simplifies CPU Onloading
Automatic resources management on GPU and CPU
Deferred execution mode in CPU Onloading:
– Performs computing on CPU while doing work on GPU
– Hides data transfer latency
Preview of intermediate resources
Integrated performance analysis tools
page 7
Render Scene
Color [size, format]
Depth [size, format]
Poisson Disk DoF
Color [size, format]
Stage 1 render
Stage 1 output pins
Stage 2 input pins
Stage 2 render
Stage 2 output pin
Stage 1-2 Intermediate
Resources
Stage 1 Render Target Views
Stage 2 Shader Resource Views
Stage 2 Screen Render Target
Pipeline Diagram:
Defined by developer:
Created by Pipeline infrastructure:
Depth of Field Explorer
page 8
Pipeline Oscilloscopes (F6)
for CPU & GPU
Pipeline Preview (F5)
DX and UI Controls
Common explorer controls
Technique-specific controls
TRADITIONAL DOF TECHNIQUES Poisson Disk & Gaussian Blur on GPU & CPU
page 9
Poisson Disk DOF Technique
Averages color by random Poisson disk samples around each pixel
Easy to implement on GPU
Not good for CPU, because of random memory access
Used for Bokeh simulation in some games
Variable number of Poisson taps can be generated in DOF Explorer
page 10
Gaussian Blur DOF Technique
Convolution of NxN neighbor pixels with pre-computed weights:
𝐺 𝑥, 𝑦 = 12𝜋𝜎2𝑒
−𝑥2+𝑦2
2𝜎2 ; 𝐶 𝑥, 𝑦 = 𝐺 𝑥𝑖 , 𝑦𝑗 ∙ 𝑓(𝑥𝑖 , 𝑦𝑗)
𝑁
𝑗=1
𝑁
𝑖=1
Decomposed into 2 passes:
– Vertical pass
– Horizontal pass
𝐺 𝑥 = 1
2𝜋𝜎𝑒−
𝑥2
2𝜎2; 𝐶 𝑥, 𝑦 = 𝐺 𝑥 𝑖 ∙ 𝐺 𝑦 𝑗 ∙ 𝑓(𝑥𝑖 , 𝑦𝑗)
𝑁
𝑗=1
𝑁
𝑖=1
Implementation:
– Traditional for GPU in pixel shader
– Novell for CPU, accelerated with TBB & SSE
page 11
Gaussian Blur Pipeline
page 12
Render Scene
Color 1280 x 800
Depth 1280 x 800
Resize X 0.5
Blurred Color
640 x 400
Gaussian Horiz. Blur
Blurred Color
640 x 400
Gaussian Vert. Blur
Blurred Color
640 x 400
DoF Simple
Combine
Color 1280 x 800
GPU CPU / GPU GPU
GPU CPU GPU
Gaussian Blur on CPU: Multi-threading with TBB
page 13
F0 F1 F2 F3 F4
F0
F1
F2
F3
F4
x
x
tbb::parallel_for
tbb::
para
llel_
for
1. Vertical Pass: 2. Horizontal Pass:
Gaussia
n w
eig
hts
:
Gaussian weights
Gaussian Blur on CPU: Vectorization with SSE 4
page 14
R0 G0 B0 A0 R1 G1 B1 A1 R2 G2 B2 A2 R3 …
R0’ G0’ B0’ A0’ R1’ G1’ B1’ A1’ R2’ G2’ B2’ A2’ R3’ …
R0 G0 B0 A0
x x x
=
=
=
=
R0 G0 B0 A0
R1 G1 B1 A1
R2 G2 B2 A2
… … … …
x = R0’ G0’ B0’ A0’
F0 F0 F0 F0 F1 F1 F1 F1 F2 F2 F2 F2 F3 …
F0 F0 F0 F0
F1 F1 F1 F1
F2 F2 F2 F2
… … … …
Vertical Pass:
Horizontal Pass: (cache friendly)
SSE SSE
SSE SSE SSE
Gaussian Blur: Performance results
0
2
4
6
8
10
12
14
16
18
1 Thread 8 Threads
13,7
4,4
3,2
5,6
Tim
e in m
illiseconds
Gaussian Blur speedup with TBB parallel_for
GPU Rendering
CPU Kernel Time
page 15
ADVANCED DOF TECHNIQUES Summed Area Tables Gather & Scatter
page 16
Summed Area Tables
page 17
1 2 3 4
1 0 7 2 4
2 1 4 1 2
3 6 1 2 0
4 0 3 5 2
1 2 3 4
1 0 7 9 13
2 1 12 15 21
3 7 19 24 30
4 7 22 32 40
𝑷 = 𝒑𝒊𝒋
- LL
- UR
+ UL
width
he
igh
t
𝑷𝒂𝒓𝒆𝒂 =𝑳𝑹 − 𝑼𝑹 − 𝑳𝑳 + 𝑼𝑳
𝒘𝒊𝒅𝒕𝒉 × 𝒉𝒆𝒊𝒈𝒉𝒕
LR +
Source Table: Summed Area Table (SAT): Averaging values in the area of source table by SAT:
𝑺𝒎𝒏 = 𝒑𝒊𝒋
𝒏
𝒋=𝟏
𝒎
𝒊=𝟏
Enables averaging values in variable rectangle areas in constant time: just with 4 SAT-texture reads!
Gathering vs. Scattering
page 18
Input:
Output:
Gathering: Scattering:
SAT Gather DoF pipeline
page 19
GPU CPU / GPU GPU
Render Scene
Color 8 bit/ch.
Depth
SAT Gather DoF
Build SAT
Color 32 bit/ch.
Color 8 bit/ch. Color
Temp
GPU CPU GPU
Building SAT on GPU in Pixel Shader
page 20
1 1..2 1..3 1..4 2..5 3..6 4..7 5..8
1 2 3 4 5 6 7 8
1 1..2 2..3 3..4 4..5 5..6 6..7 7..8
1 1..2 1..3 1..4 1..5 1..6 1..7 1..8
Pass 1:
Pass 2:
Pass 3:
Source:
Building SAT on CPU with SSE & TBB
page 21
Si-1,j-1 Si,j-1
Si-1,j Pi,j
T1,1 T2,1 T3,1
T1,2 T2,2 T3,2
T1,3 T2,3 T3,3
𝑻𝟎 = 𝑷𝟎,𝒋 𝑻𝒋 = 𝑻𝒋−𝟏 + 𝑷𝒊,𝒋
𝑻 += 𝑷𝒊,𝒋 𝑺𝒊,𝒋 = 𝑺𝒊,𝒋−𝟏 + 𝑻
𝑺𝒊,𝒋 = 𝑷𝒊,𝒋 + 𝑺𝒊,𝒋−𝟏 + 𝑺𝒊−𝟏,𝒋 − 𝑺𝒊−𝟏,𝒋−𝟏 𝑻𝒋=𝑺𝒊−𝟏,𝒋−𝑺𝒊−𝟏,𝒋−𝟏+𝑷𝒊,𝒋
𝑺𝒊,𝒋−𝟏 + 𝑻𝒋
Build SAT for each row j=1..n:
Single pass on CPU
Simultaneously process RGBA channels as 4 floats with SSE 4 (128-bit width vector instructions):
– Can be easily extended to 256-bit width AVX on Sandy Bridge
Split texture in tiles and process them in parallel threads:
– Implemented in TBB Tasks
– Run tile-processing tasks with respect of dependencies
SAT Scatter DoF pipeline
page 22
GPU CPU / GPU GPU
GPU CPU GPU
Render Scene
Color 8 bit/ch.
1280 x 800
Depth
SAT Scatter
DoF (add 100px
margins)
Color 32 bit/ch.
1480 x 1000
Build SAT
Color 32 bit/ch.
1480 x 1000
Resize with Crop
(remove margins)
Color 8 bit/ch.
1280 x 800
Color Temp
Compute Blur
Radius
Blur Params.
SAT Scatter: rectangle spreading
Spread pixels (derive), then build SAT (integrate).
page 23
x x x x x x x x x x x
x x S x x x x x x x x
x x x + x x ‒ x x x x
x x x x x
‒ +
0
Input colors:
Input blur radius:
Output colors:
SAT Computed
Ongoing Clearing
Ongoing SAT building
Ongoing rectangle spreading
Padding
SAT Scatter: Optimization Notes
Rectangle spreading on GPU:
– Implemented in Geometry Shader
– Requires huge number of Draw Calls = width x height
– Works slow even on high-end GPUs
– Compute Shaders could help, but not available on Sandy Bridge
Rectangle spreading on CPU:
– Takes advantage of SSE 4 instructions for RGBA float channels
– Multi-threaded with TBB Tasks (like SAT, but with different dependencies)
– Much faster than on GPU: 8.3x on SNB GT2, 2.7x on NHM GTX 280
Rectangle spreading CPU-stage can be fused with zeroing and SAT building to minimize memory footprint
Quality can be improved with repeated SAT integration (next slides)
page 24
SAT Scatter : CPU Optimization Results
page 25
Sequential Rendering:
Deferred Rendering:
Higher Order SAT Scatter (1/4)
page 26
Original Image
No filter
Higher Order SAT Scatter (2/4)
page 27
1-st order filter
box filter
Higher Order SAT Scatter (3/4)
page 28
2-nd order filter
triangle filter
Higher Order SAT Scatter (4/4)
page 29
3-rd order filter
parabolic filter
PERFORMANCE RESULTS ON 2-ND GENERATION CORE PROCESSORS
page 30
Depth of Field Performance on Sandy Bridge: GPU mode vs. CPU Onloading
page 31
262
161
58
135 137
60
19 8
124
40
60 67
0
50
100
150
200
250
300
FP
S
DoF Techniques
Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800
SNB Huron River 2720QM + HDG 3000: GPU only
SNB Huron River 2720QM + HDG 3000: CPU Onloading
3x
8x
Significant speedup with CPU Onloading for advanced compute-intensive DoF techniques!
Depth of Field Performance on Sandy Bridge in GPU mode on HDG 3000 & HDG 2000
page 32
262
161
58
135 137
60
19 8
125
91
35
70 64
31
17
3
0
50
100
150
200
250
300
FP
S
DoF Techniques
Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800
SNB Huron River 2720QM + HDG 3000: GPU only
SNB Sugar Bay 2600 + HDG 2000: GPU only
~2x High dependency
from GPUs, having twice difference in compute power (12
vs 6 EUs)
Depth of Field Performance on Sandy Bridge in CPU Onloading mode on HDG 3000 & HDG 2000
page 33
124
40
60
67
90
34
50 53
0
20
40
60
80
100
120
140
FP
S
DoF Techniques
Frames per Second on Sandy Bridge, driver 15.21.2287, Gothic Temple Scene, 1280x800
SNB Huron River 2720QM + HDG 3000: CPU Onloading
SNB Sugar Bay 2600 + HDG 2000: CPU Onloading
~1.2-1.4x
Less dependent from GPU with extensive
CPU Onloading!
DoF Techniques Overhead (1/2)
page 36
Conclusion & Follow ups
Accelerate traditional & advanced post-processing techniques with CPU Onloading on modern processors with integrated processor graphics
Optimize compute kernels code with Intel Parallel Studio, TBB, SSE/AVX, MKL, OpenCL and ICC:
– http://software.intel.com/en-us/articles/intel-parallel-studio-home/
– http://software.intel.com/en-us/articles/opencl-sdk/
– http://software.intel.com/en-us/avx/
DOF Source code & article (will be published later):
– http://software.intel.com/en-us/articles/dofexplorer
See other graphics samples:
– http://software.intel.com/en-us/articles/code/
page 38
page 39