Download - Accelerated Connected Component Labeling Using CUDA … · Connected Component Labeling (CCL) • There have been a number of attempts to improve performance of CCL: • Bailey and

Accelerated Connected Component Labeling Using CUDA Framework Fanny Nina-Paravecino, David Kaeli

ICCVG 2014

Outline

• Introduction• Connected Component Labeling• NVIDIA’s Compute Unified Device Architecture• Accelerated Connected Component Labeling• Performance Results• Conclusions

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

2

Introduction• Image analysis plays an important role in many applications • In the field of physical security, there are challenging tasks

such as luggage scanning at airports that require:• Near real-time response• Very high rate accuracy

• Connected component algorithm identifies neighboring segments possessing similar intensities• Potential for efficient segmentation• Provides high quality results


3

Introduction Matrix of Image512 x512

~700 images…

~700 matrices

…

One Frame

Multiple Frames


4

Introduction• Flow chart of Object Detection

DICOM Image

DICOM Image

Input Object Detection

Preprocessing Preprocessing

Image Segmentation

Image Segmentation

Features ExtractionFeatures

Extraction

Object DetectionObject Detection

Our current focus

Our current focus


5

Connected Component Labeling (CCL)• There have been a number of attempts to improve

performance of CCL:• Bailey and Johnston, “Single Pass Connected Components

Analysis. Image and Vision Computing” (2007)• Zhao et al., “Stripe-based Connected Components

Labeling” (2010)• Klaiber et al., “A memory-efficient parallel single pass

architecture for connected component labeling of streamed images” (2012)

• GPU implementations• Stava and Benes, “Connected component labeling in CUDA”,

GPU Computing Gems, (2010)


6

NVIDIA’s Compute Unified Device Architecture (CUDA)• Compute capability architecture:• Tesla: Compute capability 1.0, 1.1, 1.2, 1.3.• Fermi: Compute capability 2.0, 2.1.• Kepler: Compute capability 3.0, 3.5.


7

NVIDIA’s Compute Unified Device Architecture (CUDA)• Dynamic Parallelism


8

NVIDIA’s Compute Unified Device Architecture (CUDA)• Concurrent Kernel Execution: Hyper-Q

Issue Order

Stream 0

Stream 1

Fermi

Stream 0 Stream 1 Stream 0 Stream 1

Kepler GK110Kernel Execution

Time


9

Accelerated Connected Component Labeling• Two phases:• Phase 0: Find Spans• Phase 1: Merge Spans

Phase 0 Phase 1

1 1

1 1

1

0 0 2 2

1 2 - -

0 0 - -

Spans matrixN x K

Image matrixN x M

Each pair = span

1 2

3 -

5 -

Label Index MatrixN x K/2

Input

Binary imageN x M

threads

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

1 2

2 -

5 -

Label Index

UpdateLabel

Kernel

UpdateLabel

Kernel

Child

threads


10

Accelerated Connected Component Labeling• Phase 0: Find Spans• Each span has two elements: (ystart, yend)

• A unique label is assigned immediately1 1

1 1

1

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

1 2

3 -

5 -

Label Matrix

Binary imageN x M


11

spanx {(ystart,yend) | I (x,ystart ) I (x,ystart1 ) ... I (x,yend )}

Accelerated Connected Component Labeling• Phase 1: Merge Spans

Merge Span parent kernel

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

1 2

3 -

5 -

Label Matrix

Merge Span parent kernel

0 0 2 2

0 0 2 2

0 1 - -

Spans matrix

1 2

1 2

1 -

Label Matrix

Concurrent Kernels

Multiples images at a time

Update LabelChild KernelUpdate LabelChild Kernel

Merge?

Merge?

Yes

NoNext span

One single update

Multiples updates at the same time

Update LabelChild KernelUpdate LabelChild Kernel

Merge?

Merge?

Yes

NoNext span

1 1

1 1

1 -

1 2

2 -

5 -

Label Matrix

Label Matrix


12

Performance Results• Input Image:• DICOM format• Integer values [0 – 255]• More than 700 images (512 x 512 pixels)


13

Performance Results• Pre-processing steps:• Background noise removal• Binary Conversion

Original Image Binary Image


14

Performance Results• Experimental Environment:• CPU• Intel Core i7-3779K processor• RAM: 8GB

• GPU• GK 110 (NVIDIA GTX Titan)• Compute Capability 3.5• CUDA 5.5

• gcc compiler 3.7• OpenMP 3.0


15

Performance Results• One Image

Method Running Time (s) Speedup

CCL Serial 0.25 1.00x

CCL OpenMP 0.18 1.39x

ACCL 0.05 5.00x


16

Performance Results• Multiple Images: Hyper-Q

# Streams CCL Serial (s) ACCL (s) Speedup

1 0.25 0.05 5.00x

2 1.08 0.10 10.80x

3 2.16 0.14 15.36x

4 4.18 0.19 21.44x

5 6.09 0.23 25.91x


17

Performance Results• Stava, O., Benes, B., CCL in CUDA comparison analysis

Mpixels/s Speedup

O. Stava, B. BenesCCL in CUDA

1542 1.0x

ACCL 5242 3.3x


18

Conclusions• Described Accelerated Connected Component Labeling

(ACCL) using the CUDA framework• Presented evaluation of new features of the NVIDIA

Kepler GPU such as: dynamic parallelism and Hyper-Q• Compared serial CCL, OpenMP CCL with ACCL• Our algorithm scales well as long as we increase the number

of streams

• Dynamic parallelism turns out to be a disadvantage when trying to use a larger number of child thread kernels


19

ThanksQuestions?

Download - Accelerated Connected Component Labeling Using CUDA … · Connected Component Labeling (CCL) • There have been a number of attempts to improve performance of CCL: • Bailey and

Top Related