accelerated connected component labeling using cuda … · connected component labeling (ccl) •...

Post on 04-Aug-2020

16 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Accelerated Connected Component Labeling Using CUDA Framework Fanny Nina-Paravecino, David Kaeli

ICCVG 2014

Outline

• Introduction• Connected Component Labeling• NVIDIA’s Compute Unified Device Architecture• Accelerated Connected Component Labeling• Performance Results• Conclusions

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

2

Introduction• Image analysis plays an important role in many applications • In the field of physical security, there are challenging tasks

such as luggage scanning at airports that require:• Near real-time response• Very high rate accuracy

• Connected component algorithm identifies neighboring segments possessing similar intensities• Potential for efficient segmentation• Provides high quality results

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

3

Introduction Matrix of Image512 x512

~700 images…

~700 matrices

One Frame

Multiple Frames

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

4

Introduction• Flow chart of Object Detection

DICOM Image

DICOM Image

Input Object Detection

Preprocessing Preprocessing

Image Segmentation

Image Segmentation

Features ExtractionFeatures

Extraction

Object DetectionObject Detection

Our current focus

Our current focus

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

5

Connected Component Labeling (CCL)• There have been a number of attempts to improve

performance of CCL:• Bailey and Johnston, “Single Pass Connected Components

Analysis. Image and Vision Computing” (2007)• Zhao et al., “Stripe-based Connected Components

Labeling” (2010)• Klaiber et al., “A memory-efficient parallel single pass

architecture for connected component labeling of streamed images” (2012)

• GPU implementations• Stava and Benes, “Connected component labeling in CUDA”,

GPU Computing Gems, (2010)

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

6

NVIDIA’s Compute Unified Device Architecture (CUDA)• Compute capability architecture:• Tesla: Compute capability 1.0, 1.1, 1.2, 1.3.• Fermi: Compute capability 2.0, 2.1.• Kepler: Compute capability 3.0, 3.5.

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

7

NVIDIA’s Compute Unified Device Architecture (CUDA)• Dynamic Parallelism

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

8

NVIDIA’s Compute Unified Device Architecture (CUDA)• Concurrent Kernel Execution: Hyper-Q

Issue Order

Stream 0

Stream 1

Fermi

Stream 0 Stream 1 Stream 0 Stream 1

Kepler GK110Kernel Execution

Time

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

9

Accelerated Connected Component Labeling• Two phases:• Phase 0: Find Spans• Phase 1: Merge Spans

Phase 0 Phase 1

1 1

1 1

1

0 0 2 2

1 2 - -

0 0 - -

Spans matrixN x K

Image matrixN x M

Each pair = span

1 2

3 -

5 -

Label Index MatrixN x K/2

Input

Binary imageN x M

threads

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

1 2

2 -

5 -

Label Index

UpdateLabel

Kernel

UpdateLabel

Kernel

Child

threads

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

10

Accelerated Connected Component Labeling• Phase 0: Find Spans• Each span has two elements: (ystart, yend)

• A unique label is assigned immediately1 1

1 1

1

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

1 2

3 -

5 -

Label Matrix

Binary imageN x M

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

11

spanx {(ystart,yend) | I (x,ystart ) I (x,ystart1 ) ... I (x,yend )}

Accelerated Connected Component Labeling• Phase 1: Merge Spans

Merge Span parent kernel

0 0 2 2

1 2 - -

0 0 - -

Spans matrix

1 2

3 -

5 -

Label Matrix

Merge Span parent kernel

0 0 2 2

0 0 2 2

0 1 - -

Spans matrix

1 2

1 2

1 -

Label Matrix

Concurrent Kernels

Multiples images at a time

Update LabelChild KernelUpdate LabelChild Kernel

Merge?

Merge?

Yes

NoNext span

One single update

Multiples updates at the same time

Update LabelChild KernelUpdate LabelChild Kernel

Merge?

Merge?

Yes

NoNext span

1 1

1 1

1 -

1 2

2 -

5 -

Label Matrix

Label Matrix

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

12

Performance Results• Input Image:• DICOM format• Integer values [0 – 255]• More than 700 images (512 x 512 pixels)

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

13

Performance Results• Pre-processing steps:• Background noise removal• Binary Conversion

Original Image Binary Image

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

14

Performance Results• Experimental Environment:• CPU• Intel Core i7-3779K processor• RAM: 8GB

• GPU• GK 110 (NVIDIA GTX Titan)• Compute Capability 3.5• CUDA 5.5

• gcc compiler 3.7• OpenMP 3.0

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

15

Performance Results• One Image

Method Running Time (s) Speedup

CCL Serial 0.25 1.00x

CCL OpenMP 0.18 1.39x

ACCL 0.05 5.00x

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

16

Performance Results• Multiple Images: Hyper-Q

# Streams CCL Serial (s) ACCL (s) Speedup

1 0.25 0.05 5.00x

2 1.08 0.10 10.80x

3 2.16 0.14 15.36x

4 4.18 0.19 21.44x

5 6.09 0.23 25.91x

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

17

Performance Results• Stava, O., Benes, B., CCL in CUDA comparison analysis

Mpixels/s Speedup

O. Stava, B. BenesCCL in CUDA

1542 1.0x

ACCL 5242 3.3x

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

18

Conclusions• Described Accelerated Connected Component Labeling

(ACCL) using the CUDA framework• Presented evaluation of new features of the NVIDIA

Kepler GPU such as: dynamic parallelism and Hyper-Q• Compared serial CCL, OpenMP CCL with ACCL• Our algorithm scales well as long as we increase the number

of streams

• Dynamic parallelism turns out to be a disadvantage when trying to use a larger number of child thread kernels

Fanny Nina-Paravecino and David Kaeli, ICCVG, 15-17 Sep. 2014, Warsaw, Poland

19

ThanksQuestions?

top related