soc subsystem a cceleration using application-specific processors (asips)

32
SoC Subsystem Acceleration using Application-Specific Processors (ASIPs) Markus Willems Product Manager Synopsys

Upload: arich

Post on 24-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

SoC Subsystem A cceleration using Application-Specific Processors (ASIPs). Markus Willems Product Manager Synopsys. SoC Design. What to do when the performance of your main processor is insufficient? Go multicore? Application mapping difficult, resource utilisation unbalanced - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

SoC Subsystem Acceleration using Application-Specific

Processors (ASIPs)Markus Willems

Product ManagerSynopsys

Page 2: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

• What to do when the performance of your main processor is insufficient?

– Go multicore?• Application mapping difficult,

resource utilisation unbalanced– Add hardwired accelerators?

• Balanced but inflexible

SoC Design

Page 3: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

• What to do when the performance of your main processor is insufficient?

SoC Design

ASIPs: application-specific processors• Anything between general-purpose P and hardwired data-path• Deploys classic hardware tricks (parallelism and customized datapaths) while

retaining programmability – Hardware efficiency with software programmability

Page 4: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions

Page 5: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Architectural Optimization Space

ASIP architectural optimization space

Parallelism Speciali-zation

Page 6: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Architectural Optimization Space

Parallelism

Instruction-level

parallelism (ILP)

Data-level

parallelism

Task-level

parallelism

Orthogonalinstructionset (VLIW)

Encoded instruction

set

Vector processing

(SIMD)Multicore Multi-

threading

Page 7: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Architectural Optimization Space

Specialization

App.-specificdata types

App.-specificinstructions

Connectivity & storage matching application’s data-

flow

App.-spec. data

processing

App.-spec. memory

addressing

App.-spec. control

processing

Distributed regs, sub-ranges

Multiple mem’s,sub-ranges

Jumps, subroutines,interrupts, HW do-loops, residual

control, predication…

Direct, indirect, post-modification, indexed,

stack indirect…Any exoticoperator

Integer, fractional, floating-point, bits, complex, vector…

Single or multi-cycle

Relative or absolute, address range, delay slots…

Pipeline

Page 8: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

IP Designer: ASIP Design and Programming

Page 9: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions

Page 10: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Synopsys - Full Spectrum Processor Technology Provider

Page 11: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

32-bit ARC HS ProcessorsHigh-Performance for Embedded Applications

10-stage pipeline

Instruction CCM

Instruction Cache

DataCache

DataCCM

ARCv2 ISA / DSP

User Defined Extensions

ARC Floating Point Unit

MAC & SIMD

Multi-plier ALU Divider Late

ALUReal-TimeTrace

Memory Protection Unit

JTAG

Optional

• Over 3100 DMIPS @ 1.6 GHz* • 53 mW* of power; 0.12mm2 area in

28-nm process*• HS Family products

– HS34 CCM, HS36 CCM plus I&D cache– HS234, HS236 dual-core– HS434, HS436 quad-core

• Configurable so each instance can be optimized for performance and power

• Custom instructions enable integration of proprietary hardware

*Worst case 28-nm silicon and conditions

Page 12: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

• Pedestrian detection• Standard feature in luxury vehicles• Moving to mid-size and compact vehicles

in the next 5-10 years, also due to legislation efforts

• Implementation requirements• Low cost • Low power (small form factor, and/or battery powered) • Programmable (to allow for in-field SW upgrades)

• Most popular algorithm for pedestrian detection is Histogram of Oriented Gradients (HOG)

Pedestrian Detection and HOG

Page 13: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Histogram Of Oriented Gradients

Gradient ComputationApply Sobel operators: and

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per

block

Normalization of the histograms

SVM per window position

Non-max suppression

Scale to Multiple Resolutions

Use a fixed 64x128-pixel detection window. Apply this detection window to scaled frames.

Page 14: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Histogram Of Oriented Gradients

The image is divided in 8x8-pixel cells. For very block of 2x2 cells, apply Gaussian weights and compute 4 histograms of orientation of gradients.

Histogram Computation

Normalization of the Histograms(1) L2 Normalization (2) clipping (saturation) (3) L2 Normalization

Support Vector MachineLinear classification of histogramsfor every 64x128 windows position.

Non-Max SuppressionCluster multi-scale dense scan of detection windows and select unique

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per

block

Normalization of the histograms

SVM per window position

Non-max suppression

Page 15: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Grey scaleconversion

HOG Functional Validation on ARC HS

(640 x 480 pixels)

AXI local interconnectDMA,Sync& I/ODCCM

Dedicated Streaming Interconnect (FIFOs)

D D

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

HSSubs. ctrl

ASIP1 ASIP2 ASIPn…

• OpenCV float profiling results: 2.6 G cycles per frame Fixed point profiling results: 2.4 G cycles per frame

1

Page 16: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

ARC HSG cycles

% # ARC HSequivalent

0.1 0.2% 0.07

1.6 2.3% 1.0

17.3 26% 10.8

31.9 47% 20.0

1.2 1.8% 0.8

15.7 23% 9.8

0.004 0.01% 0.002

Profiling (640 x 480 pixels, at 30 FPS)

Grey scale conversion

Scale to multiple resolutions

Gradient computation

Histogram computation per

block

Normalization of the histograms

SVM per window position

Non-max suppression

Page 17: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Grey scaleconversion

Task Assignment #2

AXI local interconnectDMA,Sync& I/OHS DCCM

Dedicated Streaming Interconnect (FIFOs)

Subs. ctrl

D D DASIP1 ASIP2

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

ASIP4

2

L3 Ext. DRAM

Page 18: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

ASIP Example: HISTOGRAM

• Vector-slot next to existing scalar instructions (VLIW)• 16x(8/16)-bit vector register files• 16x8-bit SRAM interface• 16x8-bit FIFO interfaces• Vector arithmetic instructions• Special registers and instructions to compute histograms

4x size increase & 200x speedup (relative to RISC template)

Implemented in less than 1 week

Page 19: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Grey scaleconversion

Task Assignment #3

AXI local interconnectDMA,Sync& I/OHS DCCM

Dedicated Streaming Interconnect (FIFOs)

Subs. ctrl

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

D D DDASIP1 ASIP2 ASIP3 ASIP4

3

L3 Ext. DRAM

Page 20: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Grey scaleconversion

Task Assignment #4

AXI local interconnectDMA,Sync& I/OHS DCCM

Dedicated Streaming Interconnect (FIFOs)

Subs. ctrl

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

D D DDASIP1’ ASIP2 ASIP3 ASIP4

4

L3 Ext. DRAM

Page 21: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Grey scaleconversion

Task Assignment #4

AXI local interconnectDMA,Sync& I/O

Dedicated Streaming Interconnect (FIFOs)

Rescaling Gradient Histogram SVMNormali-zation

Non-maxsuppression

D D DDASIP1’ ASIP2 ASIP3 ASIP4

4’

HS DCCM L2 SRAM

L3 Ext. DRAM

Page 22: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

ComparisonPlatformconfiguration

#HS(MHz)

#ASIP(MHz)

ARCFunctions

ASIPFunctions

HS ~40 0 All None

HS + ASIPs 2(1600)

2.5(500)

GreyscaleRescalingNormalizationNon-max suppr.Display

GradientHistogramSVM

HS + ASIPs 1(1600)

3.5(500)

GreyscaleRescalingNon-max suppr.Display

GradientHistogramNormalizationSVM

HS + ASIPs 1(500)

4(500)

GreyscaleNon-max suppr.Display

RescalingGradientHistogramNormalizationSVM

12

3

4

Page 23: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

• 1 ARC HS, 4ASIPs, AXI interconnect, private SRAM, L2 SRAM

• 30 frames/second at 500 MHz • Functionally identical to OpenCV reference• TSMC 28nm• ASIP gate count: 330k gates• ASIP power consumption: ~130mW • Scaling due to multi-core, specialization and SIMD

usage• Power/performance/area via ASIPs

• Scaling due to multi-core, specialization and SIMD usage

• Performance gains and power efficiency due to tailored instruction sets and dedicated memory architecture

23

Final Results

Page 24: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Scenario: Need for Flexible FEC Core

• Existing and emerging standards use advanced FEC schemes like turbo coding, LDPC and Viterbi

• Instead of duplication of FEC cores, need for re-configurable architecture at minimum power and area

DVB-X?LDPC-A

UMTSTurbo-B

.11nLDPC-C

.16eLDPC-D

3GPP-LTEturbo-A

FlexFEC(turbo/LDPC/Vit)

.11nVit

Page 25: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Architecture Refinement to Increase Throughput: Increased ILP from 2 to 6

ILP: 2 FU (scalar+vector unit)

ILP: 6 FU (1 scalar+5 vector units)No duplication for arithmetic functionalityFor exploiting ILP to increase throughput

2 FUs for local memory access

Page 26: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Fast Area/Performance Trade-off(40nm logical synthesis Processor only)

2 3 4 5 60

10

20

30

40

50

60

70

80

90

100

ldpc - layer 6ldpc - layer 8turbo - betaturbo - output

Total number of processor functional units

cycl

e co

unt

0.177 sqmm 0.189 sqmm

Page 27: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Architectural ExplorationFU Utilization: 2 5

layer6 layer7 layer8 alpha beta output0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

scalarvector

layer6 layer7 layer8 alpha beta output0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

scalarvector aluvector specvector vmemvector bg vmem

Vector slot separated in different FUs without overlapping functionality

Local memory access congestion

Page 28: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Architectural ExplorationMore Balanced FU Utilization: 5 6

ldpc - layer6 ldpc - layer7 ldpc - layer8 turbo - alpha turbo - beta turbo - output0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

scalarvector aluvector specvector vmemvector vmem2vector bg vmem

Page 30: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Latest IP Available from IMEC

Blox-LDPC ASIP

adInstances available

Page 31: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Agenda•ASIPs as accelerators in SoCs•How to design ASIPs•Examples•Conclusions

Page 32: SoC  Subsystem  A cceleration using Application-Specific Processors (ASIPs)

Conclusion• ASIPs enable programmable accelerators

• IP Designer enables efficient design and programming of ASIPs

• “Programmable datapath” ASIPs offer performance, area and power comparable to hardwired accelerators

• ASIPs enable balanced multicore SoC architectures