real-time 3d tracking with gpus | gtc 2013on-demand.gputechconf.com/gtc/2013/presentations/s... ·...

Technology for a better society 1

Session S3227

Where's Waldo?

Real-time 3D Tracking Using GPUs

Dr. André R. Brodtkorb, Research Scientist

SINTEF ICT, Department of Applied Mathematics


• Established 1950 by the Norwegian Institute of Technology.

• The largest independent research organisation in Scandinavia.

• A non-profit organisation.

• Motto: “Technology for a better society”.

• Key Figures*

• 2123 Employees from 67 different countries.

• 2755 million NOK in turnover

(about 340 million EUR / 440 million USD).

• 7216 projects for 2200 customers.

• Offices in Norway, USA, Brazil, Macedonia,

United Arab Emirates, and Denmark.

About SINTEF

* Data from SINTEF’s 2009 annual report [Map CC-BY-SA 3.0 based on work by Hayden120 and NuclearVacuum, Wikipedia]


• Motivation & Introduction

• Our Work

• Efficient Video Decoding for CUDA

• Single-camera image processing

• Multi-camera image processing

• Utilizing multiple GPUs

• Summary

Outline


• ADABTS (project number 218 197)

• 7th framework programme, Security

• Coordinated by FOI (Sweden)

• Total project cost 4.5 million EURO / 5.8 million USD

• 7 european project partners

• This work is part of

Work package: 6 Real Time Platform and System Integration

Motivation: The ADABTS Project


ADABTS Work Package 6 in a nutshell

"To develop a new hardware and software platform for advanced

real-time video analysis and detection using heterogeneous computing.

Exploit the possibilities that commercially available low cost heterogeneous

hardware architectures (multi-core CPUs in combination with GPUs) represent."


• GPUs have a 7-10x performance advantage

for floating point and bandwidth

• GPUs are naturally suited for image

processing

• NVIDIA GPUs support hardware-accelerated

video decoding for CUDA

Motivation for using GPUs


Results

(HTTP) H.264

(HTTP)

• Hardware platform: A SuperMicro server with four NVIDIA GTX 580 GPUs

• Software Platform: Linux / Windows, CUDA, NVCUVID, C++, a lot of threading and other

snacks.

Simplified Real Time Platform Sketch

IP cameras "Desktop Supercomputer"

Operator and further

processing


Results

(HTTP) H.264

(HTTP)

Simplified Real Time Platform Sketch

IP cameras "Desktop Supercomputer"

Operator and further

processing

Decode Single camera image

processing

Multi camera image

processing HTTP read HTTP send


Reading, decoding, and sending data


processing

Multi camera image



• Decode burden grows as the IP camera

resolution grows…

• Current industry standard codecs are

JPEG, H.264, and MPEG4 Part 2

• NVIDIA GPUs support H.264 and MPEG

4 Part 2 decoding in GPU hardware! [2]

IP Cameras

"The majority of IP cameras offered are now

megapixel (54.1%). This is somewhat amazing as

megapixel was a distinct minority just two years

ago." [1]

[1] IP Camera Statistics 2011, John Honovich, 2010, http://ipvm.com/report/ip_camera_statistics

[2] NVIDIA Purevideo, feature set D

"Over 60% of megapixel cameras support H.264

while only about 20% support MPEG-4." [1]

http://ipvm.com/report/ip_camera_statistics

http://ipvm.com/report/ip_camera_statistics


• NVCUVID - CUDA Video Decoder

• Released publicly in 2008

• Linux support after two years

• Enables GPU-accelerated decode

• Virtually zero CPU use

• H264, MPEG-2, (MPEG4-part2?)

• Decodes a frame into CUDA memory

• Transfers compressed video data directly to the GPU

• Far less PCI-e bandwidth used

GPU Decoding of Video

CPU GPU

CUDA

memory

System

memory H.264

H.264

Frame H.264


• Video processor operates

independently of other GPU

engines [1]

• Decode comes for free!

• Presumably uses the

same hardware as PureVideo

• When something looks too good to be true, it usually is:

• Writing a decoder is a lot of work!

GPU Concurrency Revisited

VideoProcessor (VP)

CUDA

cores DMA Engine 2

DMA Engine 1

[1] E. Young and F. Jargstorff, Image processing & video algorithms with CUDA, Nvision 2008


• The SDK example decodes a single movie from file

• Our needs are a bit more complex:

• Decode multiple movies simultaneously

• Read from network

• Use multiple GPUs

• …

Our Starting Point: The SDK Example


• NVCUVID is sparsely documented

• Find random forum posts online

• Speak with the NVIDIA engineers

• Read the header files carefully

• SDK example is "pedagogically suboptimal"

• Major challenge to decipher

• Difficult to grasp data flow

• A lot of hidden threading

• Ended up creating a UML diagram

of what was going on

Deciphering the SDK Example


• Four main threads:

• ByteStream reads data

• Decoder pushes data to NVCUVID

• CUVideoparser performs black magic

• CUVideodecoder decodes video bitstream

• Extremely easy to create a decoder:

1. Create a ByteStream

2. Give the ByteStream to a Decoder

3. Call getNextFrameAsync()

or getNextFrame()

• All threading is now hidden!

Our Decoder Structure

ByteStream • Reads data over HTTP

• Splits data into NALUs

Decoder • Reads data from bytestream

• Writes data to NVCUVID

• Keeps track of frames buffered

by NVCUVID

CUVideoparser • Part of NVCUVID

• Black magic

CUVideodecoder • Part of NVCUVID

• Decodes video bitstream


• If you feed the decoder with H.264 data, it crashes

• Must be fed one NALU at a time or it will get angry!

• NVCUVID-memory can be "special"

• cudamemcpy used to crash when trying to copy from a decoded frame (works today)

• Access to the CUDA context is not thread safe!

• You must use cuvidCtxLock / cuvidCtxUnlock for *each and every* cuda call (extremely easy to

forget, and hard to get new developers not to do this mistake)

• We created a CudaContext class that pushes / pops and locks / unlocks the cuda context

• JPEG works well!

• But uses the CPU only apparently (no surprize there, since PureVideo does not support JPEG)

• Other formats also exist in the header file…

Lessons Learned When Working With NVCUVID


• Our decoder performance matches that of

the SDK example

• Decoder speed varies with GPU and encoder options

• We get roughly 200 FPS for one camera

• When we use multiple decoders, the performance

scales linearly!

• Two cameras give ~100 FPS per camera

• This means that one GPU should handle roughly

10 cameras in 20 FPS decode!

Performance

0

0,2

0,4

0,6

0,8

1

1,2

1 2 3 4 5 6 7 8

Pe

rfo

rma

nc

e

Performance per camera


• Decoder parameters tweaked for

throughput, at the expense of latency

• Varies with GPU, but on the order of

one second for powerful gamer cards

• Not an issue with our usage scenario

Latency

0,001

0,01

0,1

1

1 251

Se

co

nd

s (

log

ari

thm

ic)

Frame number

Frame number versus frame latency


Image processing for one camera


processing

Multi camera image



• Low level algorithms are at the heart of high-level logic

• Break a complex task up into less complex tasks

• For a single camera, we have a set of low-level tasks

• Image segmentation (foreground / background)

• Optical flow

• Face/pedestrian detection (boosting)

• Modular system

• We can exchange algorithms, and add or remove them

• Most algorithms can run simultaneously

Single camera processing

Foreground

Segmentation

Optical flow

Face detection

Input

Frame

Detection

results


• Foreground segmentation relies on a good

description of background

• Background is essentially an empty scene

• Foreground is everything that deviates from

background

Segmentation


• There are many ways of describing the background

• We use intensity and edges (HOG)

• Anything that deviates from the background model is

considered foreground

• Computing and updating the background model is

embarrassingly parallel!

Describing the background


• There is noise in the input video (especially at

H.264 I-frames)

• We need to allow some deviations from

background

• It is notoriously difficult to handle

• Shadows

• Rapidly changing lighting conditions

(clouds / headlights / …)

• Reflections

• Our implementation addresses video noise,

varying lighting conditions and shadows.

Detections


• Foreground segmentation is difficult!

• Foreground objects that look like background

• Light changes (shadows, reflections, etc.) are difficult to get right

Example Detections

Good Bad Ugly


• Optical flow is the calculation of movement in a video stream

• Where did this pixel come from in the last frame?

• Computationally demanding algorithm

• Highly susceptible to image / compression noise

• Multiple ways of finding

• Brute force search

• Polynomial expansion

• …

Optical flow


• Based on a local search for each pixel

• Take a neighborhood in the previous frame

• Compute sum of absolute differences for a

variety of locations in the new frame

• Choose the minimum

• Embarrassingly parallel algorithm

• Quite expensive to compute per pixel in terms of

memory bandwidth!

• Only computed for segmented foreground,

which makes it high performant

Brute Force Search


• Results are quite reasonable

• Algorithm is sensitive to choice of search

directions and size of patches to compare

Optical Flow Results


• Approximates the local neighborhood with a second

order polynomial.

• Based on implementation in OpenCV and computed

for all pixels

• Gives high quality results, but is expensive

Färneback Optical Flow


• Face detection by combining many weak classifiers

into one strong classifier (variation of AdaBoost)

• A few "easy" steps to perform

1. Generate a mipmap pyramid for resolution-

independent face detection

2. Classify every window location for multiple weak

classifiers, exit early on non-faces

3. (Summarize hits (positive face classifications)

over scale and space)

WaldBoost Face detection


• Based on the implementation by Michael Hruby [1]

• Ported from OpenCL to CUDA

• A lot of the porting work was done by a set of

#defines and constants.

• In-kernel syntax mostly identical!

• Worst part of the porting work was figuring out the

data flow.

• Very easy to integrate into our framework

Porting from OpenCL to CUDA

[1] Michael Hruby, WaldBoost on OpenCL


• We have benchmarked over multiple

GPU generations

• Growing preformance with each new

gen

• Non-optimal performance for

GTX 680 for unknown reasons

Single Camera Performance

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

9800GX2 GTX 540M GTX 285 GTX 480 GTX 580 GTX 680N

orm

ali

ze

d p

erf

orm

an

ce

Performance versus GPU generation


Multi-camera image processing


processing

Multi camera image


Reception

1280x960 H.264 @ 20FPS

1280x960 H.264 @ 20FPS

1280x960 H.264 @ 20FPS

1280x960 H.264 @ 20FPS

720x576 JPEG @ 20FPS


• Multi-camera image processing has high

requirements to camera synchronization

• If a camera is off by a few frames, results

deteriorate rapidly

• IP cameras are naturally out of sync

• Their internal clock is unreliable

• A 25 FPS camera means something like 25

FPS on a sunny day

• Mixing camera makes and frame-rates gives

extra challenges

Camera Synchronization Torkel A. Haufmann

Poster P0168 / CO 09 in Computer Vision


• GPU-implementation of automatic

synchronization [1]

• Given two cameras generate a set of planes

• The plane must run through both

cameras, and a third point in 3D space

• This plane looks like a line in both

cameras

• Record changes along these lines, and

synchronize based on that

Synchronizing Two Cameras

[1] Pundik and Moses, Video synchronization using temporal signals from epipolar lines, 2010.


• Changes along the epipolar lines are recorded in a

2D array for each camera

• Try matching these two arrays to find the camera

drift

• For more cameras, split up into pairs of cameras

and synchronize each pair

Synchronizing Two Cameras

Cam 0

Fra

me

nu

mb

er

Plane number

Cam 1

Fra

me

nu

mb

er

Plane number


• Synchronized cameras can be used for voxel carving

• Voxel carving gives us the volumetric extent of an object in

3D by combining a view of the object from several angles

• Algorithm is computationally demanding

• Example grid can be 16 million or more voxels

• The CPU is way too slow to handle this

• The basic idea is to create a voxel grid in 3D space, and project

the foreground segmentation into this grid

• This projective texturing is a well known technique from

computer graphics, used e.g. in shadow mapping.

• Perfectly suited for the GPU!

Voxel Carving

Convex hull

Carved voxel volume


• We compute each output element independently

• This makes our algorithm output sensitive:

easy to adjust performance by varying voxel

grid size

• We use texture lookups for simplicity

• Caveat: Textures, constant memory etc. might

be dangerous to use in a multi-threaded

setting…

Voxel Carving

Cam 0 Cam 1

Pseudocode

parallel for each voxel {

float avg = 0.0f;

for each camera {

avg += getForeground(camera);

}

avg /= num_cameras;

}


• By thresholding our voxel carving, we

get an iso-surface for foreground

objects

• Results dependent on segmentation

quality

• Results dependent on occlusions

Voxel Carving Results

200 400 600 800 1000 1200 1400 1600

200

400

600

800

1000

1200


• Voxel carving is a no-op for small voxel

grid sizes

• Camera decode is the bottleneck

• For larger domain sizes, performance

drops slower than expected

• Indication that we are not fully

saturating the GPU, even for

1024x1024x6

Voxel Carving Results

0

10

20

30

40

50

60

32 64 128 256 384 512 768 1024

Fra

me

s p

er

se

co

nd

Voxel grid size: n x n x 64

Performance versus voxel grid size


• If we sum our voxel grid along the height dimension, we get

the density of foreground for each ground plane location.

• We can easily download this 2D map to the CPU and track

blobs

• Naïve and simple algorithm:

1. Find blobs with high density

2. Try matching with blobs from previous frame

• Easily extendible create better tracks:

• Face detection results, optical flow, image features, etc.

• Discrete optimization (max flow / k shortest paths)

Probability Map Tracker


• The voxel carver produces too much data. • Output is e.g., 512x512x64 (16 M) voxels

• We want to send this over HTTP for further processing

• Most of the data is not important • Areas with a low average detection (background)

• Areas with uniform high average detection (e.g., the inside of

the carved hull)

• We can compress the data by using standard stream

compaction • Pick out the 1-voxel wide shell of each object

• Reduces data dramatically!

Compression


Multi-GPU processing


• Different ways of utilizing multiple GPUs

• Task-parallel pipelining

• Data-parallel between GPUs

• Task-parallel pipelining is terrible

• Ruins all bandwidth savings

• Data-parallel is perfect!

• Our aim is many cameras

• Create a CPU thread for each GPU,

and we're in business

Multi-GPU strategies

Decode

Segmentation Optical flow

Multi Camera

Face detect


• Perfect weak scaling

• One GPU supports four-five cameras

with all processing

• Four cameras supports 16-20 cameras

with all processing

Multi-GPU results

0

0,5

1

1,5

2

2,5

3

3,5

4

1 2 3 4

No

rma

lize

d p

erf

orm

an

ce

Number of GPUs

Performance versus number of GPUs

Demo video:

https://www.youtube.com/watch?v=JvQJHA0EI2E

https://www.youtube.com/watch?v=JvQJHA0EI2E


• We have presented

• Efficient Video Decoding for CUDA

• Single and multi-camera image processing

• Utilizing multiple GPUs

• GPUs are superbly suited for these tasks

• Papers to be published

Summary


Contact:

André R. Brodtkorb, SINTEF ICT

Email: [email protected]

Webpage: http://babrodtk.at.ifi.uio.no/

Youtube: http://youtube.com/babrodtk/

Thank you for your attention

SINTEF ICT

Department of Applied Mathematics

http://www.sintef.no/math

SINTEF ICT

Department of Optical Measurement

Systems and Data Analysis

http://www.sintef.no/math

Project participants:

Asbjørn Berge*, André Brodtkorb,

Torkel A. Haufmann, Jens Olav Nygaard,

Anna Kim, Kristin Kaspersen,

Jon Hjelmervik

* Project leader

mailto:[email protected]

http://babrodtk.at.ifi.uio.no/

http://youtube.com/babrodtk/

real-time 3d tracking with gpus | gtc 2013on-demand.gputechconf.com/gtc/2013/presentations/s... ·...

Documents