hardware acceleration of feature detection and description ... · hardware acceleration of feature...

24
Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, School of Engineering, Brown University

Upload: others

Post on 04-Jul-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Hardware Acceleration of Feature Detection and Description Algorithms on 

Low‐Power Embedded Platforms

Onur Ulusel, Christopher Picardo, Christopher Harris, Sherief Reda, R. Iris Bahar, 

School of Engineering, Brown University

Page 2: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Image Processing in Mobile Systems 

2

• Image processing is everywhere!– Input data has changed from words/numbers to images

– Sensors have improved dramatically

• Image processing is a major driving factor in technological advancement– Autonomization relies on image processing

• Mobile/Embedded platforms?? – Real‐time computing + limited data bandwidth           prefer local computing to offloading to cloud

– BUT image processing can be very computationally intensive and power hungry

www.google.com

www.guardiantv.com

Page 3: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Accelerating Image Processing on Low Power Embedded Platforms 

• Meeting real time image processing requirements for many of these applications requires HW assisted acceleration

• Which algorithms do we accelerate?– Feature detection and feature description are key building blocks for image retrieval, biometric identification, visual odometry, etc.

– Computational efficient detection and analysis of image features is critical for performance and energy‐efficiency

3http://www.sybernautix.com/ https://blog.pivotal.iohttps://vision.in.tum.de

Page 4: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Hardware Acceleration for Energy Constrained Image Processing

• Low power embedded platforms– Field Programmable Gate Arrays (FPGAs)– Graphical Processing Units (GPUs)– Low power general processors (CPUs)

4

FPGAs GPUs

XilinxVirtex 6 Xilinx Zynq 7020 1532 core NVIDIA

GeForce GTX 680192 core NVIDIA

Jetson TK1

Power 15 W <5W 195 W <12W

Page 5: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Our Contributions

• Comparative study of feature detection and description algorithms–What are their computation kernel characteristics?

• Comparative study of platforms for embedded applications– Advantages/disadvantages of each platform?

• Accelerating algorithms on different platforms– How can algorithms be modified to better exploit available hardware of each platform?

– How does performance compare in terms of run time and energy consumption?

5

Page 6: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Feature Detection

• What is a ‘feature’?– An “interesting” part of an image that can be used to identify objects

• Examples:  Edges, corners, ridges, blobs

6Slide adapted from Darya Frolova, Denis Simakov, Weizmann Institude

flat region:no change within block

edge:no change 

along the edge

corner:change in 

all directions

Page 7: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Feature Description• Given the features, uniquely describe them so they can be matched in other images

• Descriptors summarize characteristics of the features – E.g., intensity, orientation

• Descriptors should be distinctive and insensitive to local image deformations.

7Images from: R. Szeliski, Computer Vision: Algorithms and Applications

Page 8: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Accuracy and Run‐time Comparisons

8

• HoG (Histogram of Gradient) based Descriptors– SIFT: Scale‐Invariant Feature Transform– SURF: Speeded Up Robust Features

• Binary Feature Descriptors– BRIEF: Binary Robust Independent Elementary Features– BRISK: Binary Robust Invariant Scalable Keypoints

1

10

100

1000

0

20

40

60

80

SIFT SURF BRIEF BRISK

CPU ru

n‐tim

e (m

s) ‐log scale

Accuracy Percetage Precision

Recall

Run‐time

(%  of detected features that are correct)

(% of features detected)

Page 9: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

FAST: Features from Accelerated Segment Test

9

BresenhamCirclesliding

window

12‐pixel continuity?  If so then feature

Pre‐compare pixels 1, 5, 9, and 13 to determine possibility for continuity

On average 98.5% of the comparisons fail the continuity test at the pre‐compare stage

Rosten and  Drummond, ECCV’06

Page 10: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

BRIEF: Binary Robust Independent Elementary Features

• Compare intensities of pairs of points using Hamming distance

• BRIEF Sampling pattern–512 sampling pairs– For each pair, Xi is at (0,0) and Yitakes all possible values from coarse polar grid

– Sampling pairs are generated from a 31×31 region around center pixel

10

Chosen sampling pattern results in a 512‐bit characterization array

Page 11: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

BRISK: Binary Robust Invariant Scalable Keypoints

11

• BRISK uses custom sampling pattern• 512 sampling pairs generated from a 31×31 region (like BRIEF)

• Distinguishes between short/long pairs– Short pairs used similar to BRIEF to generate descriptor vectors based on intensity comparisons

– Long pairs used for orientation computation by rotating sampling pattern

BRISK sampling patternRed circles represent standard deviation of Gaussian smoothing

Page 12: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Start

R d IFrame

Read Input Frame

h i l l

Bresenham circle

For each pixel , p, apply the 7x7 filter of Bresenham circle

is a corner

p is a corner

Generate N  sampling pairs , Xi and Yi, around p

Xi > Yi

Di = 1

Di = 0

FeatureDetection

Brief FeatureDescription

Stop

False

False

Algorithm Flowchart

• FAST feature detection + BRIEF feature description

• Obtaining sampling window for feature description requires irregular access pattern 

FAST

BRIEF

Page 13: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Start

R d IFrame

Read Input Frame

F h i l l

Bresenham circle

For each pixel , p, apply the 7x7 filter of Bresenham circle

is a corner

p is a corner

Generate N  sampling pairs , Xi and Yi, around p

Xi > Yi

Di = 1

Di = 0

FeatureDetection

False

False

Perform Orientation Compensation for BRISK

Stop

FeatureDescription

Algorithm Flowchart

• FAST feature detection + BRISK feature description

• BRISK requires an extra step for orientation compensation– A significant amount of extra hardware resources for this step

FAST

BRISK

Page 14: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Experimental Embedded Platforms

• FPGA: MicroZED development board:– 28nm Zynq 7020 SoC– Artix‐7 FPGA + 1GB DDR3– dual‐core Arm Cortex A9 CPU (for debug and init. only)

• GPU & CPU: Jetson TK1 development kit– 28nm Tegra K1 SoC– Kepler GPU with 192 CUDA cores @ 950MHz– Quadcore ARM Cortex A15 CPU @ 2.5GHz (single core activated)– 2GB Memory– Running OpenCV versions of FAST, BRIEF, BRISK

14

Page 15: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Feature Detection & Description:Block Diagram

Image data

Zig‐Zag Traversing Line Buffers

Mask Size Register Array

Buffer Address Generator andRegister Array 

+‐

+‐

+‐

+‐

Pre‐compute units

Equa

lity X 12

+‐

Circle Comparator

XX

xx

xx

xx

x&ycoordina

tesEnable 

& Sync signals

ARM Cortex A9

 CPU

Central Interconn

ect

32b GP AX

I Master P

ort

10‐Line word 

Buffe

r

Processing System (PS)

Control

Programmable Logic (PL)

is_corner

Smoothing & Region Generation

N‐wide comparator

Orientation Compensation

descrip

tor

AXI Interconn

ect

+‐

Memory InterfaceDDR3

15

orientation

Page 16: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Feature Detection & Description:Block Diagram

Image data

Zig‐Zag Traversing Line Buffers

Mask Size Register Array

Buffer Address Generator andRegister Array 

+‐

+‐

+‐

+‐

Pre‐compute units

Equa

lity X 12

+‐

Circle Comparator

XX

xx

xx

xx

x&ycoordina

tesEnable 

& Sync signals

ARM Cortex A9

 CPU

Central Interconn

ect

32b GP AX

I Master P

ort

10‐Line word 

Buffe

r

Processing System (PS)

Control

Programmable Logic (PL)

is_corner

Smoothing & Region Generation

N‐wide comparator

Orientation Compensation

descrip

tor

AXI Interconn

ect

+‐

Memory InterfaceDDR3

16

orientation

FAST Feature Detection

Page 17: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Feature Detection & Description:Block Diagram

Image data

Zig‐Zag Traversing Line Buffers

Mask Size Register Array

Buffer Address Generator andRegister Array 

+‐

+‐

+‐

+‐

Pre‐compute units

Equa

lity X 12

+‐

Circle Comparator

XX

xx

xx

xx

x&ycoordina

tesEnable 

& Sync signals

ARM Cortex A9

 CPU

Central Interconn

ect

32b GP AX

I Master P

ort

10‐Line word 

Buffe

r

Processing System (PS)

Control

Programmable Logic (PL)

is_corner

Smoothing & Region Generation

N‐wide comparator

Orientation Compensation

descrip

tor

AXI Interconn

ect

+‐

Memory InterfaceDDR3

17

orientation

BRIEF Descriptor

Page 18: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Feature Detection & Description:Block Diagram

Image data

Zig‐Zag Traversing Line Buffers

Mask Size Register Array

Buffer Address Generator andRegister Array 

+‐

+‐

+‐

+‐

Pre‐compute units

Equa

lity X 12

+‐

Circle Comparator

XX

xx

xx

xx

x&ycoordina

tesEnable 

& Sync signals

ARM Cortex A9

 CPU

Central Interconn

ect

32b GP AX

I Master P

ort

10‐Line word 

Buffe

r

Processing System (PS)

Control

Programmable Logic (PL)

is_corner

Smoothing & Region Generation

N‐wide comparator

Orientation Compensation

descrip

tor

AXI Interconn

ect

+‐

Memory InterfaceDDR3

18

orientation

BRISK Descriptor

Page 19: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Feature Detection & Description:Block Diagram

19

Feature detection logic

Image data

Zig‐Zag Traversing Line Buffers

Mask Size Register Array

Buffer Address Generator andRegister Array 

+‐

+‐

+‐

+‐

Pre‐compute units

Equa

lity X 12

+‐

Circle Comparator

XX

xx

xx

xx

x&ycoordina

tesEnable & 

Sync signals

ARM Cortex A9

 CPU

Central Interconn

ect

32b GP AX

I Master P

ort

10‐Line word 

Buffe

r

Processing System (PS)

Control

Programmable Logic (PL)

is_corner

Smoothing & Region Generation

N‐wide comparator

Orientation Compensation

descrip

tor

AXI Interconn

ect

+‐

Memory InterfaceDDR3

orientation

Descriptor logicData control logic for detection and description

Page 20: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Results: Run‐time4.4

25.8

13.8

13.7

36.4

103.7

18.2

13.7

50.8

111.7

27.6

13.7

0

20

40

60

80

100

120

Intel i7 CPU ARM on Jetson Tegra GPU onJetson

Zynq FPGA

Run‐tim

e (m

s)

FAST FAST+BRIEF FAST+BRISK

20

Page 21: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Results: Power & Energy

21

(detection)     (detection + description)

21.5

6.5

4.3

2.20

22.2

6.2

6.3

2.27

22.4

6.3

6.3

2.31

0

5

10

15

20

25

Intel i7 CPU EmbeddedCPU

Tegra GPU Zynq FPGAPo

wer (W

)FAST FAST+BRIEF FAST+BRISK

94.0

166.8

59.3

30.00

809.4

696.4

114.1

30.96

1137

.7

705.1

174.3

31.58

0

200

400

600

800

1000

1200

Intel i7 CPU EmbeddedCPU

Tegra GPU Zynq FPGA

Energy (m

J)

FAST FAST+BRIEF FAST+BRISK

Page 22: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Results: Profiling

• Feature description stalled due to memory throttle– Needs better data management

22

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0%

ExecutionDependency

Pipe Busy

MemoryThrottle

Not Selected

Stall Reasons during GPU Computation

Description Detection(BRIEF)              (FAST)

0%

20%

40%

60%

80%

100%

FPGA: FAST FPGA: FAST + BRIEF GPU: FAST GPU: FAST + BRIEF

Instruction Distribution

Load/Store FP/Integer

• Feature description for GPU implementation has bump in load/store ops– Almost 10X more than just FAST

Page 23: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Results: FPGA Resource Utilization

23

10% 14%27%

32%19%

37%

57%66%

37%

LookUp Tables Flip Flops Block Rams

FAST FAST+BRIEF FAST+BRISK

Lookup Tables

Flip Flops

BlockRAMs

FAST 4564 1551 8

FAST + BRIEF 14398 2093 11

FAST + BRISK 25575 7115 11

• BRISK requires significant amount of extra resources for smoothing and orienting

• Extra resources do not translate to much extra power

Distribution of resourcesResource utilization

Page 24: Hardware Acceleration of Feature Detection and Description ... · Hardware Acceleration of Feature Detection and Description Algorithms on Low‐Power Embedded Platforms Onur Ulusel,

Conclusions

• FPGA outperforms CPUs and GPUs in terms of power & performance– FAST + BRISK:  36 fps vs. 147 fps– FPGA amenable to various HW optimizations:  

• deep pipelining, optimized memory access, pre‐computation

• FGPA implementations better for handling multiple kernels– For GPUs, multiple kernels highly bounded by kernel scheduler and memory bottlenecks

– FPGA customization on layers better for tackling operations on multiple kernels. • Use profiling on GPU implementation as first step to FPGA optimization

– identify nature of bottlenecks– Customized FPGA HW can often better manage certain types of bottlenecks

24