accelerated connected component labeling using cuda … · connected component labeling (ccl) •...

Accelerated Connected Component Labeling Using CUDA Framework Fanny Nina-Paravecino, David Kaeli

ICCVG 2014

Outline

• Introduction• Connected Component Labeling• NVIDIA’s Compute Unified Device Architecture• Accelerated Connected Component Labeling• Performance Results• Conclusions

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

Introduction• Image analysis plays an important role in many applications • In the field of physical security, there are challenging tasks

such as luggage scanning at airports that require:• Near real-time response• Very high rate accuracy

• Connected component algorithm identifies neighboring segments possessing similar intensities• Potential for efficient segmentation• Provides high quality results

Introduction Matrix of Image512 x512

~700 images…

~700 matrices

One Frame

Multiple Frames

Introduction• Flow chart of Object Detection

DICOM Image

Input Object Detection

Preprocessing Preprocessing

Image Segmentation

Features ExtractionFeatures

Extraction

Object DetectionObject Detection

Our current focus

Connected Component Labeling (CCL)• There have been a number of attempts to improve

performance of CCL:• Bailey and Johnston, “Single Pass Connected Components

Analysis. Image and Vision Computing” (2007)• Zhao et al., “Stripe-based Connected Components

Labeling” (2010)• Klaiber et al., “A memory-efficient parallel single pass

architecture for connected component labeling of streamed images” (2012)

• GPU implementations• Stava and Benes, “Connected component labeling in CUDA”,

GPU Computing Gems, (2010)

NVIDIA’s Compute Unified Device Architecture (CUDA)• Compute capability architecture:• Tesla: Compute capability 1.0, 1.1, 1.2, 1.3.• Fermi: Compute capability 2.0, 2.1.• Kepler: Compute capability 3.0, 3.5.

NVIDIA’s Compute Unified Device Architecture (CUDA)• Dynamic Parallelism

NVIDIA’s Compute Unified Device Architecture (CUDA)• Concurrent Kernel Execution: Hyper-Q

Issue Order

Stream 0

Stream 1

Stream 0 Stream 1 Stream 0 Stream 1

Kepler GK110Kernel Execution

Accelerated Connected Component Labeling• Two phases:• Phase 0: Find Spans• Phase 1: Merge Spans

Phase 0 Phase 1

0 0 2 2

1 2 - -

0 0 - -

Spans matrixN x K

Image matrixN x M

Each pair = span

Label Index MatrixN x K/2

Binary imageN x M

threads

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

Label Index

UpdateLabel

Kernel

UpdateLabel

Kernel

threads

Accelerated Connected Component Labeling• Phase 0: Find Spans• Each span has two elements: (ystart, yend)

• A unique label is assigned immediately1 1

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

Label Matrix

Binary imageN x M

spanx {(ystart,yend) | I (x,ystart ) I (x,ystart1 ) ... I (x,yend )}

Accelerated Connected Component Labeling• Phase 1: Merge Spans

Merge Span parent kernel

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

Label Matrix

Merge Span parent kernel

0 0 2 2

0 1 - -

Spans matrix

Label Matrix

Concurrent Kernels

Multiples images at a time

Update LabelChild KernelUpdate LabelChild Kernel

Merge?

NoNext span

One single update

Multiples updates at the same time

Update LabelChild KernelUpdate LabelChild Kernel

Merge?

NoNext span

Label Matrix

Performance Results• Input Image:• DICOM format• Integer values [0 – 255]• More than 700 images (512 x 512 pixels)

Performance Results• Pre-processing steps:• Background noise removal• Binary Conversion

Original Image Binary Image

Performance Results• Experimental Environment:• CPU• Intel Core i7-3779K processor• RAM: 8GB

• GPU• GK 110 (NVIDIA GTX Titan)• Compute Capability 3.5• CUDA 5.5

• gcc compiler 3.7• OpenMP 3.0

Performance Results• One Image

Method Running Time (s) Speedup

CCL Serial 0.25 1.00x

CCL OpenMP 0.18 1.39x

ACCL 0.05 5.00x

Performance Results• Multiple Images: Hyper-Q

# Streams CCL Serial (s) ACCL (s) Speedup

1 0.25 0.05 5.00x

2 1.08 0.10 10.80x

3 2.16 0.14 15.36x

4 4.18 0.19 21.44x

5 6.09 0.23 25.91x

Performance Results• Stava, O., Benes, B., CCL in CUDA comparison analysis

Mpixels/s Speedup

O. Stava, B. BenesCCL in CUDA

1542 1.0x

ACCL 5242 3.3x

Conclusions• Described Accelerated Connected Component Labeling

(ACCL) using the CUDA framework• Presented evaluation of new features of the NVIDIA

Kepler GPU such as: dynamic parallelism and Hyper-Q• Compared serial CCL, OpenMP CCL with ACCL• Our algorithm scales well as long as we increase the number

of streams

• Dynamic parallelism turns out to be a disadvantage when trying to use a larger number of child thread kernels

ThanksQuestions?

accelerated connected component labeling using cuda … · connected component labeling (ccl) •...

Documents

indoor assistance for visually impaired people using a rgb...

a hybrid fuzzy morphology and connected components labeling...

michael j. hiscox workplaces with fair labor standards.but...

ccl project

a hybrid approach to parallel connected component labeling...

connected component labeling using...

a new direct connected component labeling and analysis...

a new direct connected component labeling and analysis...

connected component labeling using modified linear … ·...

moving object detection by connected component labeling...

designing efficient simd algorithms for direct connected...

spaghetti labeling: directed acyclic graphs for block...

revista ccl

ccl fosdeh

ccl reports

ccl challengesleadersface

connected component labeling. ©ukdw

a general approach to connected-component labeling for...

ccl 2020 · 2020. 10. 2. · ccl 2020 ... ccl

connected component labeling - hiroshima university ·...