real-time 3d tracking with gpus | gtc 2013on-demand.gputechconf.com/gtc/2013/presentations/s... ·...
TRANSCRIPT
Technology for a better society 1
Session S3227
Where's Waldo?
Real-time 3D Tracking Using GPUs
Dr. André R. Brodtkorb, Research Scientist
SINTEF ICT, Department of Applied Mathematics
Technology for a better society 2
• Established 1950 by the Norwegian Institute of Technology.
• The largest independent research organisation in Scandinavia.
• A non-profit organisation.
• Motto: “Technology for a better society”.
• Key Figures*
• 2123 Employees from 67 different countries.
• 2755 million NOK in turnover
(about 340 million EUR / 440 million USD).
• 7216 projects for 2200 customers.
• Offices in Norway, USA, Brazil, Macedonia,
United Arab Emirates, and Denmark.
About SINTEF
* Data from SINTEF’s 2009 annual report [Map CC-BY-SA 3.0 based on work by Hayden120 and NuclearVacuum, Wikipedia]
Technology for a better society 3
• Motivation & Introduction
• Our Work
• Efficient Video Decoding for CUDA
• Single-camera image processing
• Multi-camera image processing
• Utilizing multiple GPUs
• Summary
Outline
Technology for a better society 4
• ADABTS (project number 218 197)
• 7th framework programme, Security
• Coordinated by FOI (Sweden)
• Total project cost 4.5 million EURO / 5.8 million USD
• 7 european project partners
• This work is part of
Work package: 6 Real Time Platform and System Integration
Motivation: The ADABTS Project
Technology for a better society 5
ADABTS Work Package 6 in a nutshell
"To develop a new hardware and software platform for advanced
real-time video analysis and detection using heterogeneous computing.
Exploit the possibilities that commercially available low cost heterogeneous
hardware architectures (multi-core CPUs in combination with GPUs) represent."
Technology for a better society 6
• GPUs have a 7-10x performance advantage
for floating point and bandwidth
• GPUs are naturally suited for image
processing
• NVIDIA GPUs support hardware-accelerated
video decoding for CUDA
Motivation for using GPUs
Technology for a better society 7
Results
(HTTP) H.264
(HTTP)
• Hardware platform: A SuperMicro server with four NVIDIA GTX 580 GPUs
• Software Platform: Linux / Windows, CUDA, NVCUVID, C++, a lot of threading and other
snacks.
Simplified Real Time Platform Sketch
IP cameras "Desktop Supercomputer"
Operator and further
processing
Technology for a better society 8
Results
(HTTP) H.264
(HTTP)
Simplified Real Time Platform Sketch
IP cameras "Desktop Supercomputer"
Operator and further
processing
Decode Single camera image
processing
Multi camera image
processing HTTP read HTTP send
Technology for a better society 9
Reading, decoding, and sending data
Decode Single camera image
processing
Multi camera image
processing HTTP read HTTP send
Technology for a better society 10
• Decode burden grows as the IP camera
resolution grows…
• Current industry standard codecs are
JPEG, H.264, and MPEG4 Part 2
• NVIDIA GPUs support H.264 and MPEG
4 Part 2 decoding in GPU hardware! [2]
IP Cameras
"The majority of IP cameras offered are now
megapixel (54.1%). This is somewhat amazing as
megapixel was a distinct minority just two years
ago." [1]
[1] IP Camera Statistics 2011, John Honovich, 2010, http://ipvm.com/report/ip_camera_statistics
[2] NVIDIA Purevideo, feature set D
"Over 60% of megapixel cameras support H.264
while only about 20% support MPEG-4." [1]
Technology for a better society 11
• NVCUVID - CUDA Video Decoder
• Released publicly in 2008
• Linux support after two years
• Enables GPU-accelerated decode
• Virtually zero CPU use
• H264, MPEG-2, (MPEG4-part2?)
• Decodes a frame into CUDA memory
• Transfers compressed video data directly to the GPU
• Far less PCI-e bandwidth used
GPU Decoding of Video
CPU GPU
CUDA
memory
System
memory H.264
H.264
Frame H.264
Technology for a better society 12
• Video processor operates
independently of other GPU
engines [1]
• Decode comes for free!
• Presumably uses the
same hardware as PureVideo
• When something looks too good to be true, it usually is:
• Writing a decoder is a lot of work!
GPU Concurrency Revisited
VideoProcessor (VP)
CUDA
cores DMA Engine 2
DMA Engine 1
[1] E. Young and F. Jargstorff, Image processing & video algorithms with CUDA, Nvision 2008
Technology for a better society 13
• The SDK example decodes a single movie from file
• Our needs are a bit more complex:
• Decode multiple movies simultaneously
• Read from network
• Use multiple GPUs
• …
Our Starting Point: The SDK Example
Technology for a better society 14
• NVCUVID is sparsely documented
• Find random forum posts online
• Speak with the NVIDIA engineers
• Read the header files carefully
• SDK example is "pedagogically suboptimal"
• Major challenge to decipher
• Difficult to grasp data flow
• A lot of hidden threading
• Ended up creating a UML diagram
of what was going on
Deciphering the SDK Example
Technology for a better society 15
• Four main threads:
• ByteStream reads data
• Decoder pushes data to NVCUVID
• CUVideoparser performs black magic
• CUVideodecoder decodes video bitstream
• Extremely easy to create a decoder:
1. Create a ByteStream
2. Give the ByteStream to a Decoder
3. Call getNextFrameAsync()
or getNextFrame()
• All threading is now hidden!
Our Decoder Structure
ByteStream • Reads data over HTTP
• Splits data into NALUs
Decoder • Reads data from bytestream
• Writes data to NVCUVID
• Keeps track of frames buffered
by NVCUVID
CUVideoparser • Part of NVCUVID
• Black magic
CUVideodecoder • Part of NVCUVID
• Decodes video bitstream
Technology for a better society 16
• If you feed the decoder with H.264 data, it crashes
• Must be fed one NALU at a time or it will get angry!
• NVCUVID-memory can be "special"
• cudamemcpy used to crash when trying to copy from a decoded frame (works today)
• Access to the CUDA context is not thread safe!
• You must use cuvidCtxLock / cuvidCtxUnlock for *each and every* cuda call (extremely easy to
forget, and hard to get new developers not to do this mistake)
• We created a CudaContext class that pushes / pops and locks / unlocks the cuda context
• JPEG works well!
• But uses the CPU only apparently (no surprize there, since PureVideo does not support JPEG)
• Other formats also exist in the header file…
Lessons Learned When Working With NVCUVID
Technology for a better society 17
• Our decoder performance matches that of
the SDK example
• Decoder speed varies with GPU and encoder options
• We get roughly 200 FPS for one camera
• When we use multiple decoders, the performance
scales linearly!
• Two cameras give ~100 FPS per camera
• This means that one GPU should handle roughly
10 cameras in 20 FPS decode!
Performance
0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5 6 7 8
Pe
rfo
rma
nc
e
Performance per camera
Technology for a better society 18
• Decoder parameters tweaked for
throughput, at the expense of latency
• Varies with GPU, but on the order of
one second for powerful gamer cards
• Not an issue with our usage scenario
Latency
0,001
0,01
0,1
1
1 251
Se
co
nd
s (
log
ari
thm
ic)
Frame number
Frame number versus frame latency
Technology for a better society 19
Image processing for one camera
Decode Single camera image
processing
Multi camera image
processing HTTP read HTTP send
Technology for a better society 20
• Low level algorithms are at the heart of high-level logic
• Break a complex task up into less complex tasks
• For a single camera, we have a set of low-level tasks
• Image segmentation (foreground / background)
• Optical flow
• Face/pedestrian detection (boosting)
• Modular system
• We can exchange algorithms, and add or remove them
• Most algorithms can run simultaneously
Single camera processing
Foreground
Segmentation
Optical flow
Face detection
Input
Frame
Detection
results
Technology for a better society 21
• Foreground segmentation relies on a good
description of background
• Background is essentially an empty scene
• Foreground is everything that deviates from
background
Segmentation
Technology for a better society 22
• There are many ways of describing the background
• We use intensity and edges (HOG)
• Anything that deviates from the background model is
considered foreground
• Computing and updating the background model is
embarrassingly parallel!
Describing the background
Technology for a better society 23
• There is noise in the input video (especially at
H.264 I-frames)
• We need to allow some deviations from
background
• It is notoriously difficult to handle
• Shadows
• Rapidly changing lighting conditions
(clouds / headlights / …)
• Reflections
• Our implementation addresses video noise,
varying lighting conditions and shadows.
Detections
Technology for a better society 24
• Foreground segmentation is difficult!
• Foreground objects that look like background
• Light changes (shadows, reflections, etc.) are difficult to get right
Example Detections
Good Bad Ugly
Technology for a better society 25
• Optical flow is the calculation of movement in a video stream
• Where did this pixel come from in the last frame?
• Computationally demanding algorithm
• Highly susceptible to image / compression noise
• Multiple ways of finding
• Brute force search
• Polynomial expansion
• …
Optical flow
Technology for a better society 26
• Based on a local search for each pixel
• Take a neighborhood in the previous frame
• Compute sum of absolute differences for a
variety of locations in the new frame
• Choose the minimum
• Embarrassingly parallel algorithm
• Quite expensive to compute per pixel in terms of
memory bandwidth!
• Only computed for segmented foreground,
which makes it high performant
Brute Force Search
Technology for a better society 27
• Results are quite reasonable
• Algorithm is sensitive to choice of search
directions and size of patches to compare
Optical Flow Results
Technology for a better society 28
• Approximates the local neighborhood with a second
order polynomial.
• Based on implementation in OpenCV and computed
for all pixels
• Gives high quality results, but is expensive
Färneback Optical Flow
Technology for a better society 29
• Face detection by combining many weak classifiers
into one strong classifier (variation of AdaBoost)
• A few "easy" steps to perform
1. Generate a mipmap pyramid for resolution-
independent face detection
2. Classify every window location for multiple weak
classifiers, exit early on non-faces
3. (Summarize hits (positive face classifications)
over scale and space)
WaldBoost Face detection
Technology for a better society 30
• Based on the implementation by Michael Hruby [1]
• Ported from OpenCL to CUDA
• A lot of the porting work was done by a set of
#defines and constants.
• In-kernel syntax mostly identical!
• Worst part of the porting work was figuring out the
data flow.
• Very easy to integrate into our framework
Porting from OpenCL to CUDA
[1] Michael Hruby, WaldBoost on OpenCL
Technology for a better society 31
• We have benchmarked over multiple
GPU generations
• Growing preformance with each new
gen
• Non-optimal performance for
GTX 680 for unknown reasons
Single Camera Performance
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
9800GX2 GTX 540M GTX 285 GTX 480 GTX 580 GTX 680N
orm
ali
ze
d p
erf
orm
an
ce
Performance versus GPU generation
Technology for a better society 32
Multi-camera image processing
Decode Single camera image
processing
Multi camera image
processing HTTP read HTTP send
Reception
1280x960 H.264 @ 20FPS
1280x960 H.264 @ 20FPS
1280x960 H.264 @ 20FPS
1280x960 H.264 @ 20FPS
720x576 JPEG @ 20FPS
Technology for a better society 34
• Multi-camera image processing has high
requirements to camera synchronization
• If a camera is off by a few frames, results
deteriorate rapidly
• IP cameras are naturally out of sync
• Their internal clock is unreliable
• A 25 FPS camera means something like 25
FPS on a sunny day
• Mixing camera makes and frame-rates gives
extra challenges
Camera Synchronization Torkel A. Haufmann
Poster P0168 / CO 09 in Computer Vision
Technology for a better society 35
• GPU-implementation of automatic
synchronization [1]
• Given two cameras generate a set of planes
• The plane must run through both
cameras, and a third point in 3D space
• This plane looks like a line in both
cameras
• Record changes along these lines, and
synchronize based on that
Synchronizing Two Cameras
[1] Pundik and Moses, Video synchronization using temporal signals from epipolar lines, 2010.
Technology for a better society 36
• Changes along the epipolar lines are recorded in a
2D array for each camera
• Try matching these two arrays to find the camera
drift
• For more cameras, split up into pairs of cameras
and synchronize each pair
Synchronizing Two Cameras
Cam 0
Fra
me
nu
mb
er
Plane number
Cam 1
Fra
me
nu
mb
er
Plane number
Technology for a better society 37
• Synchronized cameras can be used for voxel carving
• Voxel carving gives us the volumetric extent of an object in
3D by combining a view of the object from several angles
• Algorithm is computationally demanding
• Example grid can be 16 million or more voxels
• The CPU is way too slow to handle this
• The basic idea is to create a voxel grid in 3D space, and project
the foreground segmentation into this grid
• This projective texturing is a well known technique from
computer graphics, used e.g. in shadow mapping.
• Perfectly suited for the GPU!
Voxel Carving
Convex hull
Carved voxel volume
Technology for a better society 38
• We compute each output element independently
• This makes our algorithm output sensitive:
easy to adjust performance by varying voxel
grid size
• We use texture lookups for simplicity
• Caveat: Textures, constant memory etc. might
be dangerous to use in a multi-threaded
setting…
Voxel Carving
Cam 0 Cam 1
Pseudocode
parallel for each voxel {
float avg = 0.0f;
for each camera {
avg += getForeground(camera);
}
avg /= num_cameras;
}
Technology for a better society 39
• By thresholding our voxel carving, we
get an iso-surface for foreground
objects
• Results dependent on segmentation
quality
• Results dependent on occlusions
Voxel Carving Results
200 400 600 800 1000 1200 1400 1600
200
400
600
800
1000
1200
Technology for a better society 40
• Voxel carving is a no-op for small voxel
grid sizes
• Camera decode is the bottleneck
• For larger domain sizes, performance
drops slower than expected
• Indication that we are not fully
saturating the GPU, even for
1024x1024x6
Voxel Carving Results
0
10
20
30
40
50
60
32 64 128 256 384 512 768 1024
Fra
me
s p
er
se
co
nd
Voxel grid size: n x n x 64
Performance versus voxel grid size
Technology for a better society 41
• If we sum our voxel grid along the height dimension, we get
the density of foreground for each ground plane location.
• We can easily download this 2D map to the CPU and track
blobs
• Naïve and simple algorithm:
1. Find blobs with high density
2. Try matching with blobs from previous frame
• Easily extendible create better tracks:
• Face detection results, optical flow, image features, etc.
• Discrete optimization (max flow / k shortest paths)
Probability Map Tracker
Technology for a better society 42
• The voxel carver produces too much data. • Output is e.g., 512x512x64 (16 M) voxels
• We want to send this over HTTP for further processing
• Most of the data is not important • Areas with a low average detection (background)
• Areas with uniform high average detection (e.g., the inside of
the carved hull)
• We can compress the data by using standard stream
compaction • Pick out the 1-voxel wide shell of each object
• Reduces data dramatically!
Compression
Technology for a better society 43
Multi-GPU processing
Technology for a better society 44
• Different ways of utilizing multiple GPUs
• Task-parallel pipelining
• Data-parallel between GPUs
• Task-parallel pipelining is terrible
• Ruins all bandwidth savings
• Data-parallel is perfect!
• Our aim is many cameras
• Create a CPU thread for each GPU,
and we're in business
Multi-GPU strategies
Decode
Segmentation Optical flow
Multi Camera
Face detect
Technology for a better society 45
• Perfect weak scaling
• One GPU supports four-five cameras
with all processing
• Four cameras supports 16-20 cameras
with all processing
Multi-GPU results
0
0,5
1
1,5
2
2,5
3
3,5
4
1 2 3 4
No
rma
lize
d p
erf
orm
an
ce
Number of GPUs
Performance versus number of GPUs
Technology for a better society 47
• We have presented
• Efficient Video Decoding for CUDA
• Single and multi-camera image processing
• Utilizing multiple GPUs
• GPUs are superbly suited for these tasks
• Papers to be published
Summary
Technology for a better society 48
Contact:
André R. Brodtkorb, SINTEF ICT
Email: [email protected]
Webpage: http://babrodtk.at.ifi.uio.no/
Youtube: http://youtube.com/babrodtk/
Thank you for your attention
SINTEF ICT
Department of Applied Mathematics
http://www.sintef.no/math
SINTEF ICT
Department of Optical Measurement
Systems and Data Analysis
http://www.sintef.no/math
Project participants:
Asbjørn Berge*, André Brodtkorb,
Torkel A. Haufmann, Jens Olav Nygaard,
Anna Kim, Kristin Kaspersen,
Jon Hjelmervik
* Project leader