Copyright © 2014 Synopsys Inc. 1
Pierre Paulin, Director R&D
Santa Clara, 29 May 2014
Combining Flexibility and Low-power in
Embedded Vision Subsystems:
An Application to Pedestrian Detection
Bruno Lavigueur, Senior R&D Engineer
Copyright © 2014 Synopsys Inc. 2
• Pedestrian Detection algorithm overview
• Computation and bandwidth requirements
• Embedded Vision Reference Platform
• Programming Tools and Architecture
• Application Mapping to a Heterogeneous Multi-Core
Platform
• From Functional implementation in OpenCV
to a fully optimized mapping to GPP and ASIP cores
• Final optimized mapping
• Power — Performance — Area analysis
• FPGA-based prototype
• Lessons learned, outlook
Outline
Copyright © 2014 Synopsys Inc. 3
• EDA tool and IP provider
• $1.96B in revenue (FY 2013)
• ~8700 employees ( > 5600 R&D engineers)
• ~81 offices worldwide
• Products for Designing Embedded Vision Systems
• Embedded Cores (ARC HS, EM, 600, 700)
• Application Specific Processor (ASIP) design tools
• Semiconductor IP (DDR, DMA, AXI, HDMI, USB, A/D, …)
• Synthesis and verification for SoCs and FPGAs
• FPGA-based rapid prototyping system
Synopsys — EDA Industry Leadership
Copyright © 2014 Synopsys Inc. 4
• Pedestrian detection
• One of the most popular EV applications
• Standard feature in luxury vehicles
• Moving to mid-size and compact vehicles
in the next 5-10 years, also due to
legislation efforts
• Implementation requirements
• Low cost
• Low power (small form factor, and/or battery powered)
• Programmable (to allow for in-field SW upgrades)
• Most popular algorithm for pedestrian detection is
Histogram of Oriented Gradients (HOG)
Pedestrian Detection and HOG
Copyright © 2014 Synopsys Inc. 5
Histogram Of Oriented Gradients
Gradient Computation
Apply Sobel operators: +1 +2 +10 0 0−1 −2 −1
and +1 0 −1+2 0 −2+1 0 −1
Grey scale conversion
Scale to multiple resolutions
Gradient computation
Histogram computation per block
Normalization of the histograms
SVM per window position
Non-max suppression
Scale to Multiple Resolutions
Use a fixed 64x128-pixel detection window.
Apply this detection window to scaled frames.
Copyright © 2014 Synopsys Inc. 6
Histogram Of Oriented Gradients
The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply
Gaussian weights and compute 4 histograms of orientation of gradients.
Histogram Computation
Normalization of the Histograms
(1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization
Support Vector Machine
Linear classification of histograms
for every 64x128 windows position.
Non-Max Suppression
Cluster multi-scale dense scan of
detection windows and select unique
Grey scale conversion
Scale to multiple resolutions
Gradient computation
Histogram computation per block
Normalization of the histograms
SVM per window position
Non-max suppression
Copyright © 2014 Synopsys Inc. 8
Embedded Vision Reference Platform
Embedded Vision
Reference Platform
Ported OpenCV library Pedestrian Detection, etc.
C API to ASIP-based vision accelerators
Configurable ARC HS RISC processor
ASIP-based accelerators
HAPS® FPGA-based prototyping system
Pre
-veri
fied
flo
w
an
d e
xam
ple
s
Copyright © 2014 Synopsys Inc. 9
Time-to-market and Flexibility vs.
Power-Performance-Area Trade-offs
Subsystem
Controller
HS
Emb. Vision
Accelerators
ASIP
ASIP
ASIP
1X 100X
P
A
R
A
L
L
E
L
I
S
M
Pre-processing:
- Filtering
- Color conversion
- Image scaling
- Feature extraction
and matching
- Segmentation
Power-Performance-Area Efficiency
Time-to-market, Flexibility
10X
MQX Lightweight O/S
High-level processing:
- Control
- Multi-object tracking
- Post-processing
- High-level command
interpretation
Data
Level
Parallelism
Task
Level
Parallelism
Sequential
Tasks
Copyright © 2014 Synopsys Inc. 10
• ARC HS family of high-performance cores
• ARC HS 36 Performance, power at 28 nm HPM process (worst case):
• Scalable to 1.6 GHz
• 1.9 DMIPS/MHz
• 37 uW/MHZ
• Application-Specific Instruction-set Processors (ASIP)
• User-driven design of processors tailored to a specific application
• Ability to guide performance-power-area and flexibility trade-offs
• Automatic generation of implementation, C compiler and
programming tools from instruction-set specification
• Connectivity components
• DMA, AXI, DDR, etc.
Main architectural components
53 Dhrystone GIPS/W
Copyright © 2014 Synopsys Inc. 11
Embedded Vision Flow and Architecture
HOG Embedded App.
Base drivers MQX runtime
AXI-4 local interconnect DMA, Sync & I/O HS DCCM
Dedicated Streaming Interconnect (FIFOs)
D D D ASIP1 ASIPn
C/C++ C API to Accelerators
HAPS-70 S12
12M ASIC
Gate equiv.
L2 SRAM
ASIP2
Copyright © 2014 Synopsys Inc. 14
• Refinement from an OpenCV high-level functional
description, to a fully optimized multi-processor SoC
combining a GP RISC with multiple ASIPs
• Main steps
• OpenCV functional reference
• Optimization and Porting onto MQX RTOS
• Profiling of all major functions
• Identification of high compute kernels
• Development of ASIPs using Synopsys ASIP design and
exploration tools
• Stepwise refinement
• From GPP only to GPP + multiple ASIPs
HOG Mapping and Refinement Flow
Copyright © 2014 Synopsys Inc. 15
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
Rescale Grad Hist Norm SVM Other
% of processing
% of processing
ARC and ASIP Exploration Tool Flow
Optimizing
Compiler
Assembler,
Linker
Instrn.-Set
Simulator
Debugger,
Profiler
C code
ARC HS S/W
Optimization
Processor
Description
Language
Optimizing
Compiler
Assembler,
Linker
Instrn.-Set
Simulator
Debugger,
Profiler
RTL
Gen.
Sim, FPGA,
RTL Synthesis
C code
ASIP HW/SW
Optimization
ARC-ASIP Trade-off Exploration
MQX RTOS
Copyright © 2014 Synopsys Inc. 16
Grey scale
conversion
HOG Functional Validation on ARC HS
(640 × 480 pixels)
AXI local interconnect DMA, Sync & I/O
Dedicated Streaming Interconnect (FIFOs)
D D D ASIP1 ASIP2
Rescaling Gradient Histogram SVM Normali-
zation
Non-max
suppression
ASIP4
L3 Ext. DRAM
DCCM HS
Subs. ctrl
1
• C fixed point profiling results: 2.25 G cycles per frame
Copyright © 2014 Synopsys Inc. 18
ARC HS
G cycles
0.1
1.4
17.3
31.9
1.2
15.7
0.004
Histogram Of Oriented Gradients Profiling
(640 × 480 pixels, at 25 FPS)
Grey scale conversion
Scale to multiple resolutions
Gradient computation
Histogram computation per block
Normalization of the histograms
SVM per window position
Non-max suppression
Copyright © 2014 Synopsys Inc. 19
Histogram Of Oriented Gradients Profiling
(640 × 480 pixels, at 25 FPS)
Grey scale conversion
Scale to multiple resolutions
Gradient computation
Histogram computation per block
Normalization of the histograms
SVM per window position
Non-max suppression
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
Rescale Grad Hist Norm SVM Other
% of processing
% of processing
Copyright © 2014 Synopsys Inc. 20
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
Rescale Grad Hist Norm SVM Other
# ARC HS
# ARC HS
Histogram Of Oriented Gradients Profiling
(640 × 480 pixels, at 25 FPS)
Grey scale conversion
Scale to multiple resolutions
Gradient computation
Histogram computation per block
Normalization of the histograms
SVM per window position
Non-max suppression
Single Core
Multicore?
Accelerate!
Copyright © 2014 Synopsys Inc. 21
Grey scale
conversion
Task Assignment #2
AXI local interconnect DMA, Sync & I/O
Dedicated Streaming Interconnect (FIFOs)
D D D ASIP1 ASIP2
Rescaling Gradient Histogram SVM Normali-
zation
Non-max
suppression
ASIP4
2
L3 Ext. DRAM
DCCM HS
Subs. ctrl 1.6 GHz 1.6 GHz
400 MHz
Copyright © 2014 Synopsys Inc. 22
Task Assignment #3
AXI local interconnect DMA, Sync & I/O HS DCCM
Dedicated Streaming Interconnect (FIFOs)
Subs. ctrl
D D D D ASIP1 ASIP2 ASIP3 ASIP4
3
L3 Ext. DRAM
Grey scale
conversion Rescaling Gradient Histogram SVM
Normali-
zation
Non-max
suppression
1.6 GHz
400 MHz
Copyright © 2014 Synopsys Inc. 23
Task Assignment #4
AXI local interconnect DMA, Sync & I/O HS DCCM
Dedicated Streaming Interconnect (FIFOs)
Subs. ctrl
D D D D ASIP1’ ASIP2 ASIP3 ASIP4
4
L3 Ext. DRAM
Grey scale
conversion Rescaling Gradient Histogram SVM
Normali-
zation
Non-max
suppression
400 MHz
400 MHz
Copyright © 2014 Synopsys Inc. 24
Task Assignment #4 With On-Chip L2
AXI local interconnect DMA, Sync & I/O
Dedicated Streaming Interconnect (FIFOs)
D D D D ASIP1’ ASIP2 ASIP3 ASIP4
4
HS DCCM L2
SRAM
L3 Ext. DRAM
Grey scale
conversion Rescaling Gradient Histogram SVM
Normali-
zation
Non-max
suppression
Storage of
scaled images
200 MB/s 80 MB/s
400 MHz
400 MHz
Copyright © 2014 Synopsys Inc. 25
Power, Gate Count Comparisons (28 nm)
640 × 480 pixels, at 25 FPS
0
200
400
600
800
1000
1200
1400
Config #2 Config #3 Config #4
ASIP gates (K)
ARC gates (K)
Gates (K)
2 3 4
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
Config #2 Config #3 Config #4
ASIP power (mW)
ARC power (mW)
Power (mW)
2 3 4
0
1
2
3
4
5
6
Config #2 Config #3 Config #4
ASIP design and S/W
ARC S/W
2 3 4
Effort (person-months)
HAPS FPGA-based
demo platform
Note: Gates and power for processors
and local memory
Copyright © 2014 Synopsys Inc. 26
• 1 ARC HS, 4 ASIPs, AXI interconnect, private SRAM, L2 SRAM
• Fixed point version of HOG derived from OpenCV
• 25 frames/second at 400 MHz (ARC and ASIPs)
• TSMC HPM process, 28nm
• Gate count (at 400 MHz): 471K gates
• 303K gates for ASIPs, 168K gates for ARC HS 36
• Power consumption: 60 mW
• Prototype running on HAPS board (ASIPs)
• 4 frames/second at 70 MHz
26
Final Results for Demonstrator Platform
Demonstration available at our booth
4
Copyright © 2014 Synopsys Inc. 27
Lessons Learned
Subsystem
Controller
HS
Emb. Vision
Accelerators
ASIP2
ASIP1’
ASIP3
1X 100X
P
A
R
A
L
L
E
L
I
S
M
1’) Rescaling + Gradient
2) Histogram
3) Normalization
4) SVM
Power-Performance-Area Efficiency
Time-to-market, Flexibility
10X
Data
Level
Parallelism
Talk
Level
Parallelism
1) Greyscale
2) Non-max suppr.
3) Display
4) Control, O/S ASIP4
Sequential
Tasks
4
Copyright © 2014 Synopsys Inc. 28
Lessons Learned
Subsystem
Controller
HS
Emb. Vision
Accelerators
ASIP2
ASIP1’
ASIP3
1X 60X~80X
P
A
R
A
L
L
E
L
I
S
M
Area Efficiency
Time-to-market, Flexibility
Data
Level
Parallelism
Talk
Level
Parallelism
Combined:
471K gates,
60 mW @ 28nm
ASIP4
Sequential
Tasks
400 MHz
303K gates, 58 mW (Logic = 12 mW
SRAM = 46 mW)
4
20% utilization
1.6 GHz: 473K gates, 37 uW/MHZ
400 MHz: 168K gates, 18 uW/MHz
Copyright © 2014 Synopsys Inc. 29
Accelerator
C API
Data
Level
Parallelism
Talk
Level
Parallelism
Sequential
Tasks
Embedded Vision Platform Directions
& Wish List
Subsystem
Controller
HS
1X 100X
Pre-processing:
- Filtering
- Color conversion
- Image scaling
- Feature extraction
and matching
- Segmentation
High-level processing:
- Body part detection
- Multi-object tracking
- Post-processing
- Command
interpretation
Power-Performance-Area Efficiency
Time-to-market, Flexibility
Close
coupling
Vision
Extn.
SIMD
(64 bit)
OpenCV
MQX O/S
10X
Emb. Vision
Accelerators
ASIP
ASIP
ASIP
P
A
R
A
L
L
E
L
I
S
M
Copyright © 2014 Synopsys Inc. 30
• Embedded vision applications combine complex algorithms
and high data rates with a need for low power
• Need to trade-off Flexibility vs. Power-Perf-Area
• Flexibility via High-performance ARC HS core
• Ability to trade-off power vs. performance
• Scaling to multi-core, specialization and SIMD usage
• Highest PPA via ASIPs
• Performance gains and power efficiency due to tailored
instruction sets and dedicated memory architecture
• While fully programmable, gains are application specific
Conclusions
Copyright © 2014 Synopsys Inc. 32
Design flow for the Vision Sub System
ARC
HS
DW
AXI interco
DesignWare
DMA
DesignWare
DDR
ARChitect
ASIP Processor
Designer
Core
Assembler
ASIP
description ASIP ISA
description
Ref Sub
System
ASIP
Synthesis +
P&R tools
Core
Consultant
SubSys
settings
ARC
settings
coreKit Tool
Core
Builder
Core
Builder User
config
VCS
DVE MDB PDBG
Legend :
HAPS
Copyright © 2014 Synopsys Inc. 33
Synopsys’ ASIP Design Tool Flow
Processor
Description
Language
Optimizing
Compiler
Assembler Linker
Instruction-Set
Simulator
Debugger Profiler
RTL Generator
RTL Sim &
FPGA
RTL
Synthesis
Full-featured SDK with graphical debugger
Compiler supports processor specific data-types and operators
Advanced optimizations allow C programmers to easily tap into architectural efficiencies
Fast retargeting to evaluate incremental processor architecture changes quickly.
High level language to quickly capture ISA Tight control of architecture (RTL-level)
Fast simulation technology
Easy integration into System C virtual platforms
Multicore and on-chip debugging
Smooth integration with RTL implementation and verification flows
Copyright © 2014 Synopsys Inc. 34
Architectural Optimization Space
ASIP architectural optimization space
Parallelism Specialization
Instruction- level
parallelism
Data- level
parallelism
Task- level
parallelism
Orthogonal instruction set (VLIW)
Encoded instruction
set
Vector processing
(SIMD)
Multi-core
App.-specific data types
App.-specific instructions
Connectivity & storage matching application’s
data-flow
App.-spec. data
processing
App.-spec. memory
addressing
App.-spec. control
processing
Distributed regs, sub-ranges
Multiple mem’s, sub-ranges
Jumps, subroutines, interrupts, HW do-loops,
residual control, predication…
Direct, indirect, post-modification, indexed,
stack indirect…
Any exotic operator
Integer, fractional, floating-point, bits, complex, vector…
Single or multi-cycle
Relative or absolute, address range, delay slots…
Pipeline
Synopsys ASIP tools …
• Support a wide range of ASIP architectures
• Support RTL accelerator tricks for highest PPA efficiency
• Enable ASIP optimization through architectural exploration
Multi-threading