computing architecture for signal processorlacas/teaching/archi/archi_m2r_orsay_part1.pdf ·...

72
Embedded Computing Systems for Signal Processing Applications Part 1: Introduction November 7 th 2014 Eric Debes

Upload: phamthuan

Post on 26-Apr-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

Embedded Computing Systems

for Signal Processing Applications

Part 1: Introduction

November 7th 2014

Eric Debes

Page 2: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

2

What is this about?

Introduction to power/performance tradeoffs and system architecture

Overview of existing processor and system architectures

Consumer vs. Industrial/Embedded

Why do we care?

Engineering added value is in complex and critical system architecture

Need to know different components available

Software/Hardware System Architecture and Modelling

Power/Performance/Price Tradeoffs

What’s the plan?

Introduction

Page 3: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

3

1. Introduction

2. General-Purpose Processors and Parallelism

3. Application Specific Processors: DSPs, FPGAs, accelerators, SoCs

4. PC Architecture vs. Embedded System Architecture

5. Hard Real-time Systems and RTOS

6. Power Constraints

7. Critical and Complex Systems, MDE, MDA

Outline

Page 4: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

4

Embedded

Size and thermal constraints

Sometime battery life (energy) constraints

Real-time

Time constraints

Can be hard real-time

Or soft-real time

Systems

Typically includes multiple components

Requires different expertises:

Signal Processing, computer vision, machine learning/Cognition and other algorithmic expertise

Software Architecture

Hardware/Computing Architecture

Thermal and mechanical engineering

Embedded Real-time Systems

Page 5: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

5

Consumer : DVD/video players, Set-top-box, Playstation, printers, disk drives, GPS, cameras, mp3 players

Communications: Cellphone, Mobile Internet Devices, Netbooks, PDAs with WiFi, GSM/3G, WiMax, GPS, cameras, music/video

Automotive: Driving innovation for many embedded applications, e.g. Sensors, buses, info-tainment

Industrial Applications: Process control, Instrumentation

Other niche markets: video surveillance, satellites, airplanes, sonars, radars, military applications

Application Examples

Page 6: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

6

Texec = NI * CPI * Tc

NI = Number of Instructions

CPI = Clock per Instruction

Tc = Cycle Time

Texec = NI / (IPC * F)

IPC = Instructions Per Cycle

F = Frequency

Performance improves with

Silicon manufacturing technology

Moore’s law contributing to higher frequency and parallelism

Microarchitecture improvements

Higher frequencies with deeper pipelines

Higher IPC through parallelism

Performance

Page 7: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

7

Performance

PentiumII(R)

Pentium Pro

Pentium(R)

486386

601, 603

604 604+MPC75021066

21064A21164

21164A21264

21264S

10

100

1,000

10,000

1987 1989 1991 1993 1995 1997 1999 2001 2003 2005

Mhz

1

10

100

ClockPeriod/Avggate

delay

Processor freq

scales by 2X per

generation

Page 8: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

8

Dynamic Power = αCV²

α = activity

C = capacitance

V = voltage

= frequency

Power = Pdyn + Pstatic

Power is limited by

maximum current (Voltage regulator limitation)

Thermal constraints

Power ≠ Energy

Power

Page 9: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

9

Power

100

1 386

486

Pentium Pentium MMX

PentiumPro

Pentium II

10

1.5 1.0 0.8 0.6 0.35 0.25 0.18

Process (microns)

Maxim

um

Pow

er (

W)

1

10

100

1000

Watt

s2/c

m

i386 i486

Pentium processor

Pentium Pro processor

Pentium II processor

Pentium III processor

Hot plate

Nuclear Reactor Rocket

Nozzle

Sun’s

Surface

Power density

Page 10: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

10

ASIC

High-performance

Dedicated to one specific application

Not programmable

Processor

Programmable

General-purpose

Reconfigurable Architecture

Good compromise between programmability and performance

Processor Architecture Spectrum

Microprocessor Reconfigurable ASIC

Page 11: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

11

Processor Architecture Spectrum

Page 12: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

12

Processor Architecture Spectrum

Page 13: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

13

What are the key components in a Computing System?

Processor with

Arithmetic and Logic Units

Register File

Caches or local memory

Memory

Buses/Interconnect

I/O Devices

Key Components of a Computing System

Page 14: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

Part 2: General-purpose Processors and Parallelism

November 7th 2014

Eric Debes

Embedded Computing Systems

for Signal Processing Applications

Page 15: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

15

Laundry Example

Ann, Brian, Cathy, Dave

each have one load of clothes

to wash, dry, and fold

Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes

Pipelining: Its Natural!

A B C D

Page 16: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

16

Sequential laundry takes 6 hours for 4 loads

If they learned pipelining, how long would laundry take?

Sequential Laundry

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

T

a

s

k

O

r

d

e

r

Time

Page 17: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

17

• Pipelining doesn’t help

latency of single task, it

helps throughput of

entire workload

• Pipeline rate limited by

slowest pipeline stage

• Multiple tasks operating

simultaneously

• Potential speedup =

Number pipe stages

• Unbalanced lengths of

pipe stages reduces

speedup

• Time to “fill” pipeline and

time to “drain” it reduces

speedup

Pipelining Lessons

A

B

C

D

6 PM 7 8 9

T

a

s

k

O

r

d

e

r

Time

30 40 40 40 40 20

Page 18: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

18

Moore’s Law more transistors

for advanced architectures

Delivers higher peak perf

But lower power efficiency

Performance = Frequency x

Instruction per Clock Cycle

Power = Switching Activity x

Dynamic Capacitance x

Voltage x Voltage x

Frequency

History: How did we increase Perf in the Past?

0

1

2

3

4

5

Pipelined S-Scalar OOO-

Spec

Deep Pipe

Incre

ase (

X)

Area X

Perf X

-1

0

1

2

3

Pipelined S-Scalar OOO-

Spec

Deep

Pipe

Incre

ase (

X)

Power X

Mips/W (%)

Page 19: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

19

In many systems today power is the limiting factor and will

drive most of the architecture decisions

New Goal: optimize performance in a given power envelope

Why Multi-Cores?

Page 20: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

20

Dual Core

Voltage Frequency Power Performance

1% 1% 3% 0.66%

Rule of thumb (in the same process technology)

Core

Cache

Core

Cache

Core

Voltage = 1

Freq = 1

Area = 1

Power = 1

Perf = 1

Voltage = -15%

Freq = -15%

Area = 2

Power = 1

Perf = ~1.8

How to maximize performance in the same power envelope?

Power = Dynamic Capacitance x Voltage x Voltage x Frequency

Page 21: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

21

Multicore

Small

Core 1 1

Large Core

Cache

1

2

3

4

1

2

Power

Performance

Power = 1/4

Performance = 1/2

C1 C2

C3 C4

Cache

1

2

3

4

1

2

3

4 Multi-Core:

Power efficient

Better power and

thermal management

Page 22: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

22

Era of Parallelism

Page 23: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

23

Thermal is the main limitation factor in future design (not size)

Move away from Frequency alone to deliver performance

Challenges in scaling need to exploit thread level

parallelism to efficiently use the transistors available thanks

to Moore’s law.

Power/performance tradeoffs dictate architectural choices

Multi-everywhere

Multi-threading

Chip level multi-processing

Throughput oriented designs

Summary: Why Multi-Cores?

Page 24: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

24

Processors are designed to address the need of the mass market.

• Mobile applications low power and good power management

are top priorities to enable thinner systems and longer battery life

• Office, image, video single threaded perf matters, some level

of multithreaded perf Multi-core

• RMS (Recognition, Mining, Synthesis) Applications and

Model based Computing massively parallel apps, good scaling

on a large number of cores Many-core

Because of the large markets in each of the classes above, they

are the focus of silicon manufacturers and are driving innovation in

the semiconductor market

Application-driven Architecture Design

Page 25: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

25

RMS Scaling on a Many-Core Simulator

0

16

32

48

64

0 16 32 48 64

# of cores

Sp

eed

-up

Gauss-Seidel

Sparse Matrix (Avg.)

Dense Matrix (Avg.)

Kmeans

SVD

SVM_classification

Cloth-AVDF

Cloth-AVIF

Cloth-US

Face-AFD

SepiaTone (Core Image)

0

16

32

48

64

0 16 32 48 64

# of cores

Sp

eed

-up

Text indexing

CFD

Ray Tracing

FB_Estimation

Body Tracker

Portifolio management

Play physics

Data from Intel Application Research Lab

Page 26: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

26

• Low-power

architecture and SoCs

• ARM based

• LPIA/Atom based

• Multi-core

• Core microarchitecture

• PowerPC

• Many-core

• GP GPU

• Larrabee

3 Classes of Applications 3 Types of Processors

Page 27: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

27

Examples of ARM-based low power architectures and SoCs:

TI OMAP, Nvidia Tegra, Apple A4/A5, Samsung Exynos

Low-power Architecture and SoCs

Page 28: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

28

Towards PC on a chip

Same Intel Core (e.g. Bay Trail) for Tablets/Smartphones,

Consumer Electronic Devices and Embedded Market

Page 29: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

29

• Multi-core

• IBM Power4

• IBM Cell

• Intel Ivy Bridge

Multicore

Page 30: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

30

• Tick-Tock model

• Modular design to

decrease cost

(design, test,

validation)

• Integrate graphics

on chip

Intel Roadmap for Intel Core Microarchitecture

Page 31: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

31

• Binning for leakage distribution and performance

P = α.C.v2. + leakage

• Turbo mode to optimize performance under a given

power envelope

• Policy to balance thermal budget between general

purpose cores, and between GPP cores and graphics

• Next: Maximize performance under a given thermal

envelope at the platform level

Power/Performance Tradeoffs

Page 32: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

32

GP GPU: NVidia GeForce more than 2000 PEs

Page 33: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

33

• No need to put a lot of cache for GPUs because the

number of threads are hiding the latency. The chip is

designed for DRAM latency through a huge number of

threads. Local memory are still present to limit bandwidth

to GDDR

• CPU need multi-level large caches because the data need

to be close to the execution units

• Fast growing video game industry exerts strong

economic pressure that forces constant innovation

CPUs vs. GPUs

Page 34: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

34

For a given application, processor architectures should be

chosen depending on the performance/power efficiency

• MIPS/Watt or Gflops/Watt

• Energy efficiency (Energy Delay Product)

This is highly dependent on the application and targeted

power envelope. Examples:

• ARM and Atom are optimized for mainstream office and media apps for

a power envelope between 1W and <10W

• Core microarchitecture is optimized for high-end office and media apps

for a power envelope between 15W and ~75W

• GPUs are optimized for graphics applications and some selected

scientific applications between 10W and more than 400W

Performance/Power for different architectures

Page 35: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

35

Processor will integrate

- Big core for single thread perf

- Small core for multithreaded perf

- some dedicated hardware units for

- graphics

- media

- encryption

- networking function

- other function specific logic

Systems will be heterogeneous

Processor core will be connected to

- one or multiple many-core cards

- and dedicated function hw in the chipset

+ reconfigurable logic in the system or on chip?

Future: PC on a Chip

IA IA IA IA

IA IA IA IA

IA IA IA IA

IA IA IA IA

PCI-Ex PCI-Ex

Gfx/Media

Memory Ch

High-End Add-in

IA IA IA IA

IA IA IA IA

IA IA IA IA

IA IA IA IA

PCI-Ex PCI-Ex

Gfx/Media

Memory Ch

IA

(Big core)

IA

(Big core)

GCH

Page 36: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

Part 3: App Specific Proc: DSPs, FPGAs, Accelerators, SoCs

November 7th 2014

Eric Debes

Embedded Computing Systems

for Signal Processing Applications

Page 37: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

37

What are application specific processors?

Processors or System-on-chip targeting a specific (class of) application(s)

Very common for

Audio: MP3, AAC coding and decoding in audio players

Image: JPEG or JPEG2000 coding and decoding, e.g. Digital cameras

Video: MPEG, H264 coding and decoding, e.g. DVD players or set-top-boxes

Encryption: RSA, AES

Communication: GSM, 3G in cellphones

Why?

Large markets can justify the development of application specific processors

Dedicated circuits provide higher performance with lower power dissipation, better battery life and very often lower cost.

Application Specific Processors

Page 38: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

38

Application Specific Signal Processor Spectrum

Page 39: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

39

DSPs

Dedicated ASICs

FPGAs

Accelerators as coprocessors

ISA extensions

SoCs

Different Types of ASPs

Page 40: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

40

Summary of Architectural Features of DSPs

Data path configured for DSP

Fixed-point arithmetic

MAC- Multiply-accumulate

Multiple memory banks and buses -

Harvard Architecture: separate data and instruction memory

Multiple data memories

Specialized addressing modes

Bit-reversed addressing

Circular buffers

Specialized instruction set and execution control

Zero-overhead loops

Support for MAC

Specialized peripherals for DSP

Page 41: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

41

DSP Example: 320C62x/67x DSP

Page 42: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

42

Many dedicated ASICs exist on the market, especially for media and communication applications. Example:

MP3 player

DVD player

Video processing engines, e.g. De-interlacing, super-resolution

Video Encoder/Decoder

GSM/3G

TCP/IP Offload engine

Advantages:

Low power, high perf/power efficiency

Small area compared to same functionality in DSP or GPP

Drawbacks

Cost of designing ASICs requires large volume

Not flexible: cannot handle different applications, cannot evolve to follow standard evolution

Dedicated ASICs

Page 43: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

43

Reconfigurable architectures FPGAs contain gates that can be programmed for a specific application

• Each logic element outputs one data bit

• Interconnect programmable between elements

FPGAs can be reconfigured to target a different function by loading another configuration

Page 44: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

44

Spécifications

Input: RTL coding structural or behavioral description

RTL Simulation

Functional simulation check logic and data flow (no temporal

analysis)

Synthesis

Translate into specific hardware primitives

Optimisation to meet area and performance constraints

Place and Route

Map hw primitives to specific places on the chip based on area

and performance for the given technology

Specify routing

Temporal Analysis

Verification that temporal specification are met

Test and Verification of the component on the FPGA board

FPGAs Design Flow

Page 45: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

45

Vivado HLS Design flow: from C to VHDL

Page 46: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

46

Vivado HLS Development Flow

C/C++ programming

C Simulation

Algorithm validation

Optimization directives

insertion

Synthesis

Cosimulation

RTL design validation

IP generation

No

Yes

Pipeline

Unroll

Merge

Loops

- Array Partitionning

- Interfaces

Dataflow

Results OK ?

(Perf / Resources)

Page 47: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

47

Vivado HLS User Interface

Project explorer Directives insertion Code editor

Synthesis log

Page 48: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

48

Synthesis report

Vivado HLS Tooling

Instructions analysis view

Clock cycle accurate representation

Verification of actual parallelisation of

instructions (e.g. pipelining)

Localization of data dependencies

Latencies

Loops pipelining

(latencies/

Throughput)

Resources

Page 49: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

49

Xilinx Design methodology

• Design methodology options

A combination is possible!

Page 50: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

50

Current generations of FPGAs

add a GPP on the chip

Hardwired PowerPC (Xilinx)

NIOS Softcore (Altera)

MicroBlaze Softcore (Xilinx)

SoC with ARM on Xilinx Zynq

FPGAs with On-chip GPP

Page 51: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

51

DSP blocks in reconfigurable architectures

Stratix DSP blocks consist of hardware

multipliers, adders, subtractors,

accumulators, and pipeline registers

Some FPGAs add DSP blocks to increase performance of DSP algorithms

Example: Stratix DSP blocks

Page 52: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

52

Reconf matrix of DSP blocks as media coproc.

Execution

Unit

Data Cache

Instruction

Unit

Memory

Instruction

Cache

General purpose processor

Control (PLA)

Memory group #1

Memory group #2

Co processor

Matrix of

Processing

Elements

32b mult

32b add/sub

Shift reg

Row of

Processing

Elements

mem

op1

Reconfigurable MatriX (8x3 PEs)

mem

op2

Embedded memories

read write address

read data

write data

Control (ROM) chipselect

32b mult

32b add/sub

Shift reg

mem

op4

32b mult

32b add/sub

Shift reg

mem res

mem

op6

It is possible to build complex system based on recent FPGA architectures

Taking advantage of the regular structure of the DSP blocks in the FPGA matrix

Page 53: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

53

Dedicated circuits to accelerate a specific part of the processor

Typically will be connected to a general-purpose processor or a DSP

Granularity can vary

accelerator for a DCT function

Accelerator for a whole JPEG encoder

Accelerators are very common in system on chip

Are typically called through an API function call from the main CPU

Accelerators as Coprocessors

Page 54: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

54

Extending the ISA of a general purpose processor with SIMD instructions and specific instructions targeting media and communication applications is very common

It adds application specific features to a processor and turns a general purpose processor into a signal/image/video processor.

Example:

Intel MMX, SSE

PowerPC AltiVec

SUN VIS

Xscale WMMX

ARM Neon, Thumb-2, Trustzone, Jazelle, etc.

ISA extension in General-Purpose Processors

Page 55: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

55

Conflicting requirements

ASICs Media Proc/DSPs GPPs

Better Power efficiency, runs at lower frequency

Flexibility, re-programmability (vs. redesign cost)

Better programming tools, shorter TTM for new app

Smaller chip size, lower leakage

Page 56: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

56

The Energy-Flexibility Gap

Embedded Processors

Media Processors

DSPs

Dedicated

HW

Flexibility (Coverage)

En

ergy

Eff

icie

ncy

MO

PS

/mW

(or

MIP

S/m

W)

0.1

1

10

100

1000

Reconfigurable

Processor/Logic

Page 57: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

57

SoCs integrate the optimal mix of processors and dedicated hardware units for the different applications targeted by the system.

Typically integrate a general purpose processor, e.g. ARM

Can integrate a DSP

Accelerators for specific functions

Dedicated memories

Integration boosts performance, cuts cost, reduces power consumption compared to a similar mix of processors on a card

System-on-Chip

Page 58: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

58

Digital Camera hardware diagram

Mechanical Shutter

A/DCMOS Imager

Image

Processing

ASIC

256Kx16

DRAM

256Kx16

DRAM

MCU Memory

Card I/F

LCD

Control

ASICLCD

32 Kx8

SRAM

68

-pin

co

nn

.

ASIC

PCMCIA

Serial

EEPROM

Power

Control

3.3V CR-123

Lithium Cell

Expose

User Interface Keys

Activity LED

Door

Interlock

Memory Card

ASIC Integration Opportunity

Page 59: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

59

MPSoC: A Platform Story

What’s a platform?

“A coordinated family of architectures that satisfy a set of

architectural constraints imposed to support reuse of

hardware and software components”

Best of all worlds:

Provides some level of flexibility

While being power efficient

And enabling some level of reusability

Can last multiple product generations

Requires forward-looking platform based design to integrate potential

future application requirements in today’s platform

Programming model and design efficiency are key!

Page 60: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

60

Nvidia Tegra

Page 61: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

61

TI OMAP

Page 62: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

62

Freescale iMX6

Page 63: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

63

Intel Silvermont/Bay Trail

Page 64: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

64

Tegra K1

Page 65: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

65

Tegra K1

Page 66: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

66

Tegra K1

Page 67: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

67

Embedded Processor Architecture Trends

• Where do we come from?

DSP, FPGA, CPU, GPUs

Page 68: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

68

Embedded Processor Architecture Trends

• Where do we come from?

DSP, FPGA, CPU,

GPUs

• What is available today?

Mix of

multicore/manycore

and hardware

accelerator on the

same chip (e.g. Tegra

K1)

Or mix of multicore and

FPGA on the same

chip (Xynq)

SoC –i.MX6

I / O

Multicore CPU

GPU/Manycore

Multicore+Manycore

Image / V

ideo

A

ccelerator

SoC –i.MX6

I / O

Multicore CPU

FPGA

Multicore + FPGA

Image / V

ideo

A

ccelerator

Page 69: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

69

Embedded Processor Architecture Trends

• Where do we come from?

DSP, FPGA, CPU, GPUs

• What is available today?

Mix of multicore/manycore and

hardware accelerator on the

same chip (e.g. Tegra K1)

Or mix of multicore and FPGA

on the same chip (Xynq)

• Where are we going?

Mix of multicore,

manycore, FPGA and

Hardware accelerators

on the same chip

Designed for real-time

sensor processing I / O

Multicore CPU

GPU/Manycore

Image / Video Accelerator

FPGA

Higher Integration/Lower Power

Page 70: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

70

Very Fast Moving Industry

• Performance and power evolve at a very fast pace!

CPUs and GPUs driven by the PC market

System-on-Chip driven by the cellphone/tablet market

~50x perf/Watt improvement over the last 5 years

GTX 285 (2009) TegraK1 (2014)

Type PCIe Card System-on-chip

CPU Intel (PC board) ARM (integrated)

Interconnect PCIe On-Chip

# Cores 240 192

Power 200W 2W

Total Power 600W 5W

Performance 1000 Gflops 365 Gflops

Perf/Watt 5 Gflops/W 180 Gflops/W

50x perf/Watt improvement in 5 years fanless design possible today!

Page 71: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

71

Efficiently Use Heterogeneous SoCs

• Use the right core for the

right task:

GPU massively parallel

CPU control oriented tasks

FPGA compute intensive tasks

with hard-real time constraints

• Competitive programming

approaches:

Parallel programming

languages and tools for CPU

and GPU

High Level Synthesis (from C

to VHDL) for FPGAs

Development, profiling,

debugging tools are evolving

as fast as the hardware

2 6

3,0 4

15

3,8 2

20 10

Power(Watt)

Performance(Fps)

Perf/Watt

Single CPU

4 x CPU

4 x CPU + GPU

Video pipeline exampe

Improved power efficiency, time to market and portability within product line

Page 72: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

72

• New architecture is driven by power and thermal

• Transistor count continues to increase thanks to Moore’s law

• Most systems are limited by thermals

• Parallelism is needed for perf and power efficiency

• Instruction level parallelism: Pipeline, OOO, VLIW

• Data-level parallelism: SIMD, Vector, 2D SIMD Matrices

• Thread level parallelism: SMP, CMP, SMT/HT

• System level parallelism: I/Os, Memory Hierarchy

• Key Issues with Parallelism

• Amdahl’s law

• Extracting parallelism from applications

• Systems Issues the rest of the system needs to be well balanced

• Programming models need to be portable, easy to learn and efficient

• Application Specific Signal Processors and SoCs

• Spectrum: ASICs, FPGA, Media Proc, DSP, GPP + ISA extensions

• Depending on power/performance constraints, often a mix (SoC)

Summary