Accelerated Connected Component Labeling Using CUDA Framework Fanny Nina-Paravecino, David Kaeli
ICCVG 2014
Outline
• Introduction• Connected Component Labeling• NVIDIA’s Compute Unified Device Architecture• Accelerated Connected Component Labeling• Performance Results• Conclusions
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
2
Introduction• Image analysis plays an important role in many applications • In the field of physical security, there are challenging tasks
such as luggage scanning at airports that require:• Near real-time response• Very high rate accuracy
• Connected component algorithm identifies neighboring segments possessing similar intensities• Potential for efficient segmentation• Provides high quality results
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
3
Introduction Matrix of Image512 x512
~700 images…
~700 matrices
…
One Frame
Multiple Frames
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
4
Introduction• Flow chart of Object Detection
DICOM Image
DICOM Image
Input Object Detection
Preprocessing Preprocessing
Image Segmentation
Image Segmentation
Features ExtractionFeatures
Extraction
Object DetectionObject Detection
Our current focus
Our current focus
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
5
Connected Component Labeling (CCL)• There have been a number of attempts to improve
performance of CCL:• Bailey and Johnston, “Single Pass Connected Components
Analysis. Image and Vision Computing” (2007)• Zhao et al., “Stripe-based Connected Components
Labeling” (2010)• Klaiber et al., “A memory-efficient parallel single pass
architecture for connected component labeling of streamed images” (2012)
• GPU implementations• Stava and Benes, “Connected component labeling in CUDA”,
GPU Computing Gems, (2010)
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
6
NVIDIA’s Compute Unified Device Architecture (CUDA)• Compute capability architecture:• Tesla: Compute capability 1.0, 1.1, 1.2, 1.3.• Fermi: Compute capability 2.0, 2.1.• Kepler: Compute capability 3.0, 3.5.
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
7
NVIDIA’s Compute Unified Device Architecture (CUDA)• Dynamic Parallelism
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
8
NVIDIA’s Compute Unified Device Architecture (CUDA)• Concurrent Kernel Execution: Hyper-Q
Issue Order
Stream 0
Stream 1
Fermi
Stream 0 Stream 1 Stream 0 Stream 1
Kepler GK110Kernel Execution
Time
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
9
Accelerated Connected Component Labeling• Two phases:• Phase 0: Find Spans• Phase 1: Merge Spans
Phase 0 Phase 1
1 1
1 1
1
0 0 2 2
1 2 - -
0 0 - -
Spans matrixN x K
Image matrixN x M
Each pair = span
1 2
3 -
5 -
Label Index MatrixN x K/2
Input
Binary imageN x M
threads
0 0 2 2
1 2 - -
0 0 - -
Spans matrix
1 2
2 -
5 -
Label Index
UpdateLabel
Kernel
UpdateLabel
Kernel
Child
threads
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
10
Accelerated Connected Component Labeling• Phase 0: Find Spans• Each span has two elements: (ystart, yend)
• A unique label is assigned immediately1 1
1 1
1
0 0 2 2
1 2 - -
0 0 - -
Spans matrix
1 2
3 -
5 -
Label Matrix
Binary imageN x M
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
11
spanx {(ystart,yend) | I (x,ystart ) I (x,ystart1 ) ... I (x,yend )}
Accelerated Connected Component Labeling• Phase 1: Merge Spans
Merge Span parent kernel
0 0 2 2
1 2 - -
0 0 - -
Spans matrix
1 2
3 -
5 -
Label Matrix
Merge Span parent kernel
0 0 2 2
0 0 2 2
0 1 - -
Spans matrix
1 2
1 2
1 -
Label Matrix
Concurrent Kernels
Multiples images at a time
Update LabelChild KernelUpdate LabelChild Kernel
Merge?
Merge?
Yes
NoNext span
One single update
Multiples updates at the same time
Update LabelChild KernelUpdate LabelChild Kernel
Merge?
Merge?
Yes
NoNext span
1 1
1 1
1 -
1 2
2 -
5 -
Label Matrix
Label Matrix
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
12
Performance Results• Input Image:• DICOM format• Integer values [0 – 255]• More than 700 images (512 x 512 pixels)
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
13
Performance Results• Pre-processing steps:• Background noise removal• Binary Conversion
Original Image Binary Image
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
14
Performance Results• Experimental Environment:• CPU• Intel Core i7-3779K processor• RAM: 8GB
• GPU• GK 110 (NVIDIA GTX Titan)• Compute Capability 3.5• CUDA 5.5
• gcc compiler 3.7• OpenMP 3.0
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
15
Performance Results• One Image
Method Running Time (s) Speedup
CCL Serial 0.25 1.00x
CCL OpenMP 0.18 1.39x
ACCL 0.05 5.00x
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
16
Performance Results• Multiple Images: Hyper-Q
# Streams CCL Serial (s) ACCL (s) Speedup
1 0.25 0.05 5.00x
2 1.08 0.10 10.80x
3 2.16 0.14 15.36x
4 4.18 0.19 21.44x
5 6.09 0.23 25.91x
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
17
Performance Results• Stava, O., Benes, B., CCL in CUDA comparison analysis
Mpixels/s Speedup
O. Stava, B. BenesCCL in CUDA
1542 1.0x
ACCL 5242 3.3x
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
18
Conclusions• Described Accelerated Connected Component Labeling
(ACCL) using the CUDA framework• Presented evaluation of new features of the NVIDIA
Kepler GPU such as: dynamic parallelism and Hyper-Q• Compared serial CCL, OpenMP CCL with ACCL• Our algorithm scales well as long as we increase the number
of streams
• Dynamic parallelism turns out to be a disadvantage when trying to use a larger number of child thread kernels
Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland
19
ThanksQuestions?