Download - Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

Embedded Computing Systems

for Signal Processing Applications

Part 1: Introduction

November 7th 2014

Eric Debes

2

What is this about?

Introduction to power/performance tradeoffs and system architecture

Overview of existing processor and system architectures

Consumer vs. Industrial/Embedded

Why do we care?

Engineering added value is in complex and critical system architecture

Need to know different components available

Software/Hardware System Architecture and Modelling

Power/Performance/Price Tradeoffs

What’s the plan?

Introduction

3

1. Introduction

2. General-Purpose Processors and Parallelism

3. Application Specific Processors: DSPs, FPGAs, accelerators, SoCs

4. PC Architecture vs. Embedded System Architecture

5. Hard Real-time Systems and RTOS

6. Power Constraints

7. Critical and Complex Systems, MDE, MDA

Outline

4

Embedded

Size and thermal constraints

Sometime battery life (energy) constraints

Real-time

Time constraints

Can be hard real-time

Or soft-real time

Systems

Typically includes multiple components

Requires different expertises:

Signal Processing, computer vision, machine learning/Cognition and other algorithmic expertise

Software Architecture

Hardware/Computing Architecture

Thermal and mechanical engineering

Embedded Real-time Systems

5

Consumer : DVD/video players, Set-top-box, Playstation, printers, disk drives, GPS, cameras, mp3 players

Communications: Cellphone, Mobile Internet Devices, Netbooks, PDAs with WiFi, GSM/3G, WiMax, GPS, cameras, music/video

Automotive: Driving innovation for many embedded applications, e.g. Sensors, buses, info-tainment

Industrial Applications: Process control, Instrumentation

Other niche markets: video surveillance, satellites, airplanes, sonars, radars, military applications

Application Examples

6

Texec = NI * CPI * Tc

NI = Number of Instructions

CPI = Clock per Instruction

Tc = Cycle Time

Texec = NI / (IPC * F)

IPC = Instructions Per Cycle

F = Frequency

Performance improves with

Silicon manufacturing technology

Moore’s law contributing to higher frequency and parallelism

Microarchitecture improvements

Higher frequencies with deeper pipelines

Higher IPC through parallelism

Performance

7

Performance

PentiumII(R)

Pentium Pro

Pentium(R)

486386

601, 603

604 604+MPC75021066

21064A21164

21164A21264

21264S

10

100

1,000

10,000

1987 1989 1991 1993 1995 1997 1999 2001 2003 2005

Mhz

1

10

100

ClockPeriod/Avggate

delay

Processor freq

scales by 2X per

generation

8

Dynamic Power = αCV²

α = activity

C = capacitance

V = voltage

= frequency

Power = Pdyn + Pstatic

Power is limited by

maximum current (Voltage regulator limitation)

Thermal constraints

Power ≠ Energy

Power

9

Power

100

1 386

486

Pentium Pentium MMX

PentiumPro

Pentium II

10

1.5 1.0 0.8 0.6 0.35 0.25 0.18

Process (microns)

Maxim

um

Pow

er (

W)

1

10

100

1000

Watt

s2/c

m

i386 i486

Pentium processor

Pentium Pro processor

Pentium II processor

Pentium III processor

Hot plate

Nuclear Reactor Rocket

Nozzle

Sun’s

Surface

Power density

10

ASIC

High-performance

Dedicated to one specific application

Not programmable

Processor

Programmable

General-purpose

Reconfigurable Architecture

Good compromise between programmability and performance

Processor Architecture Spectrum

Microprocessor Reconfigurable ASIC

11


12


13

What are the key components in a Computing System?

Processor with

Arithmetic and Logic Units

Register File

Caches or local memory

Memory

Buses/Interconnect

I/O Devices

Key Components of a Computing System

Part 2: General-purpose Processors and Parallelism

November 7th 2014

Eric Debes



15

Laundry Example

Ann, Brian, Cathy, Dave

each have one load of clothes

to wash, dry, and fold

Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes

Pipelining: Its Natural!

A B C D

16

Sequential laundry takes 6 hours for 4 loads

If they learned pipelining, how long would laundry take?

Sequential Laundry

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

T

a

s

k

O

r

d

e

r

Time

17

• Pipelining doesn’t help

latency of single task, it

helps throughput of

entire workload

• Pipeline rate limited by

slowest pipeline stage

• Multiple tasks operating

simultaneously

• Potential speedup =

Number pipe stages

• Unbalanced lengths of

pipe stages reduces

speedup

• Time to “fill” pipeline and

time to “drain” it reduces

speedup

Pipelining Lessons

A

B

C

D

6 PM 7 8 9

T

a

s

k

O

r

d

e

r

Time

30 40 40 40 40 20

18

Moore’s Law more transistors

for advanced architectures

Delivers higher peak perf

But lower power efficiency

Performance = Frequency x

Instruction per Clock Cycle

Power = Switching Activity x

Dynamic Capacitance x

Voltage x Voltage x

Frequency

History: How did we increase Perf in the Past?

0

1

2

3

4

5

Pipelined S-Scalar OOO-

Spec

Deep Pipe

Incre

ase (

X)

Area X

Perf X

-1

0

1

2

3

Pipelined S-Scalar OOO-

Spec

Deep

Pipe

Incre

ase (

X)

Power X

Mips/W (%)

19

In many systems today power is the limiting factor and will

drive most of the architecture decisions

New Goal: optimize performance in a given power envelope

Why Multi-Cores?

20

Dual Core

Voltage Frequency Power Performance

1% 1% 3% 0.66%

Rule of thumb (in the same process technology)

Core

Cache

Core

Cache

Core

Voltage = 1

Freq = 1

Area = 1

Power = 1

Perf = 1

Voltage = -15%

Freq = -15%

Area = 2

Power = 1

Perf = ~1.8

How to maximize performance in the same power envelope?

Power = Dynamic Capacitance x Voltage x Voltage x Frequency

21

Multicore

Small

Core 1 1

Large Core

Cache

1

2

3

4

1

2

Power

Performance

Power = 1/4

Performance = 1/2

C1 C2

C3 C4

Cache

1

2

3

4

1

2

3

4 Multi-Core:

Power efficient

Better power and

thermal management

22

Era of Parallelism

23

Thermal is the main limitation factor in future design (not size)

Move away from Frequency alone to deliver performance

Challenges in scaling need to exploit thread level

parallelism to efficiently use the transistors available thanks

to Moore’s law.

Power/performance tradeoffs dictate architectural choices

Multi-everywhere

Multi-threading

Chip level multi-processing

Throughput oriented designs

Summary: Why Multi-Cores?

24

Processors are designed to address the need of the mass market.

• Mobile applications low power and good power management

are top priorities to enable thinner systems and longer battery life

• Office, image, video single threaded perf matters, some level

of multithreaded perf Multi-core

• RMS (Recognition, Mining, Synthesis) Applications and

Model based Computing massively parallel apps, good scaling

on a large number of cores Many-core

Because of the large markets in each of the classes above, they

are the focus of silicon manufacturers and are driving innovation in

the semiconductor market

Application-driven Architecture Design

25

RMS Scaling on a Many-Core Simulator

0

16

32

48

64

0 16 32 48 64

# of cores

Sp

eed

-up

Gauss-Seidel

Sparse Matrix (Avg.)

Dense Matrix (Avg.)

Kmeans

SVD

SVM_classification

Cloth-AVDF

Cloth-AVIF

Cloth-US

Face-AFD

SepiaTone (Core Image)

0

16

32

48

64

0 16 32 48 64

# of cores

Sp

eed

-up

Text indexing

CFD

Ray Tracing

FB_Estimation

Body Tracker

Portifolio management

Play physics

Data from Intel Application Research Lab

26

• Low-power

architecture and SoCs

• ARM based

• LPIA/Atom based

• Multi-core

• Core microarchitecture

• PowerPC

• Many-core

• GP GPU

• Larrabee

3 Classes of Applications 3 Types of Processors

27

Examples of ARM-based low power architectures and SoCs:

TI OMAP, Nvidia Tegra, Apple A4/A5, Samsung Exynos

Low-power Architecture and SoCs

28

Towards PC on a chip

Same Intel Core (e.g. Bay Trail) for Tablets/Smartphones,

Consumer Electronic Devices and Embedded Market

29

• Multi-core

• IBM Power4

• IBM Cell

• Intel Ivy Bridge

Multicore

30

• Tick-Tock model

• Modular design to

decrease cost

(design, test,

validation)

• Integrate graphics

on chip

Intel Roadmap for Intel Core Microarchitecture

31

• Binning for leakage distribution and performance

P = α.C.v2. + leakage

• Turbo mode to optimize performance under a given

power envelope

• Policy to balance thermal budget between general

purpose cores, and between GPP cores and graphics

• Next: Maximize performance under a given thermal

envelope at the platform level

Power/Performance Tradeoffs

32

GP GPU: NVidia GeForce more than 2000 PEs

33

• No need to put a lot of cache for GPUs because the

number of threads are hiding the latency. The chip is

designed for DRAM latency through a huge number of

threads. Local memory are still present to limit bandwidth

to GDDR

• CPU need multi-level large caches because the data need

to be close to the execution units

• Fast growing video game industry exerts strong

economic pressure that forces constant innovation

CPUs vs. GPUs

34

For a given application, processor architectures should be

chosen depending on the performance/power efficiency

• MIPS/Watt or Gflops/Watt

• Energy efficiency (Energy Delay Product)

This is highly dependent on the application and targeted

power envelope. Examples:

• ARM and Atom are optimized for mainstream office and media apps for

a power envelope between 1W and <10W

• Core microarchitecture is optimized for high-end office and media apps

for a power envelope between 15W and ~75W

• GPUs are optimized for graphics applications and some selected

scientific applications between 10W and more than 400W

Performance/Power for different architectures

35

Processor will integrate

- Big core for single thread perf

- Small core for multithreaded perf

- some dedicated hardware units for

- graphics

- media

- encryption

- networking function

- other function specific logic

Systems will be heterogeneous

Processor core will be connected to

- one or multiple many-core cards

- and dedicated function hw in the chipset

+ reconfigurable logic in the system or on chip?

Future: PC on a Chip

IA IA IA IA

IA IA IA IA

IA IA IA IA

IA IA IA IA

PCI-Ex PCI-Ex

Gfx/Media

Memory Ch

High-End Add-in

IA IA IA IA

IA IA IA IA

IA IA IA IA

IA IA IA IA

PCI-Ex PCI-Ex

Gfx/Media

Memory Ch

IA

(Big core)

IA

(Big core)

GCH

Part 3: App Specific Proc: DSPs, FPGAs, Accelerators, SoCs

November 7th 2014

Eric Debes



37

What are application specific processors?

Processors or System-on-chip targeting a specific (class of) application(s)

Very common for

Audio: MP3, AAC coding and decoding in audio players

Image: JPEG or JPEG2000 coding and decoding, e.g. Digital cameras

Video: MPEG, H264 coding and decoding, e.g. DVD players or set-top-boxes

Encryption: RSA, AES

Communication: GSM, 3G in cellphones

Why?

Large markets can justify the development of application specific processors

Dedicated circuits provide higher performance with lower power dissipation, better battery life and very often lower cost.

Application Specific Processors

38

Application Specific Signal Processor Spectrum

39

DSPs

Dedicated ASICs

FPGAs

Accelerators as coprocessors

ISA extensions

SoCs

Different Types of ASPs

40

Summary of Architectural Features of DSPs

Data path configured for DSP

Fixed-point arithmetic

MAC- Multiply-accumulate

Multiple memory banks and buses -

Harvard Architecture: separate data and instruction memory

Multiple data memories

Specialized addressing modes

Bit-reversed addressing

Circular buffers

Specialized instruction set and execution control

Zero-overhead loops

Support for MAC

Specialized peripherals for DSP

41

DSP Example: 320C62x/67x DSP

42

Many dedicated ASICs exist on the market, especially for media and communication applications. Example:

MP3 player

DVD player

Video processing engines, e.g. De-interlacing, super-resolution

Video Encoder/Decoder

GSM/3G

TCP/IP Offload engine

Advantages:

Low power, high perf/power efficiency

Small area compared to same functionality in DSP or GPP

Drawbacks

Cost of designing ASICs requires large volume

Not flexible: cannot handle different applications, cannot evolve to follow standard evolution

Dedicated ASICs

43

Reconfigurable architectures FPGAs contain gates that can be programmed for a specific application

• Each logic element outputs one data bit

• Interconnect programmable between elements

FPGAs can be reconfigured to target a different function by loading another configuration

44

Spécifications

Input: RTL coding structural or behavioral description

RTL Simulation

Functional simulation check logic and data flow (no temporal

analysis)

Synthesis

Translate into specific hardware primitives

Optimisation to meet area and performance constraints

Place and Route

Map hw primitives to specific places on the chip based on area

and performance for the given technology

Specify routing

Temporal Analysis

Verification that temporal specification are met

Test and Verification of the component on the FPGA board

FPGAs Design Flow

45

Vivado HLS Design flow: from C to VHDL

46

Vivado HLS Development Flow

C/C++ programming

C Simulation

Algorithm validation

Optimization directives

insertion

Synthesis

Cosimulation

RTL design validation

IP generation

No

Yes

Pipeline

Unroll

Merge

Loops

- Array Partitionning

- Interfaces

Dataflow

Results OK ?

(Perf / Resources)

47

Vivado HLS User Interface

Project explorer Directives insertion Code editor

Synthesis log

48

Synthesis report

Vivado HLS Tooling

Instructions analysis view

Clock cycle accurate representation

Verification of actual parallelisation of

instructions (e.g. pipelining)

Localization of data dependencies

Latencies

Loops pipelining

(latencies/

Throughput)

Resources

49

Xilinx Design methodology

• Design methodology options

A combination is possible!

50

Current generations of FPGAs

add a GPP on the chip

Hardwired PowerPC (Xilinx)

NIOS Softcore (Altera)

MicroBlaze Softcore (Xilinx)

SoC with ARM on Xilinx Zynq

FPGAs with On-chip GPP

51

DSP blocks in reconfigurable architectures

Stratix DSP blocks consist of hardware

multipliers, adders, subtractors,

accumulators, and pipeline registers

Some FPGAs add DSP blocks to increase performance of DSP algorithms

Example: Stratix DSP blocks

52

Reconf matrix of DSP blocks as media coproc.

Execution

Unit

Data Cache

Instruction

Unit

Memory

Instruction

Cache

General purpose processor

Control (PLA)

Memory group #1

Memory group #2

Co processor

Matrix of

Processing

Elements

32b mult

32b add/sub

Shift reg

Row of

Processing

Elements

mem

op1

Reconfigurable MatriX (8x3 PEs)

mem

op2

Embedded memories

read write address

read data

write data

Control (ROM) chipselect

32b mult

32b add/sub

Shift reg

mem

op4

32b mult

32b add/sub

Shift reg

mem res

mem

op6

It is possible to build complex system based on recent FPGA architectures

Taking advantage of the regular structure of the DSP blocks in the FPGA matrix

53

Dedicated circuits to accelerate a specific part of the processor

Typically will be connected to a general-purpose processor or a DSP

Granularity can vary

accelerator for a DCT function

Accelerator for a whole JPEG encoder

Accelerators are very common in system on chip

Are typically called through an API function call from the main CPU

Accelerators as Coprocessors

54

Extending the ISA of a general purpose processor with SIMD instructions and specific instructions targeting media and communication applications is very common

It adds application specific features to a processor and turns a general purpose processor into a signal/image/video processor.

Example:

Intel MMX, SSE

PowerPC AltiVec

SUN VIS

Xscale WMMX

ARM Neon, Thumb-2, Trustzone, Jazelle, etc.

ISA extension in General-Purpose Processors

55

Conflicting requirements

ASICs Media Proc/DSPs GPPs

Better Power efficiency, runs at lower frequency

Flexibility, re-programmability (vs. redesign cost)

Better programming tools, shorter TTM for new app

Smaller chip size, lower leakage

56

The Energy-Flexibility Gap

Embedded Processors

Media Processors

DSPs

Dedicated

HW

Flexibility (Coverage)

En

ergy

Eff

icie

ncy

MO

PS

/mW

(or

MIP

S/m

W)

0.1

1

10

100

1000

Reconfigurable

Processor/Logic

57

SoCs integrate the optimal mix of processors and dedicated hardware units for the different applications targeted by the system.

Typically integrate a general purpose processor, e.g. ARM

Can integrate a DSP

Accelerators for specific functions

Dedicated memories

Integration boosts performance, cuts cost, reduces power consumption compared to a similar mix of processors on a card

System-on-Chip

58

Digital Camera hardware diagram

Mechanical Shutter

A/DCMOS Imager

Image

Processing

ASIC

256Kx16

DRAM

256Kx16

DRAM

MCU Memory

Card I/F

LCD

Control

ASICLCD

32 Kx8

SRAM

68

-pin

co

nn

.

ASIC

PCMCIA

Serial

EEPROM

Power

Control

3.3V CR-123

Lithium Cell

Expose

User Interface Keys

Activity LED

Door

Interlock

Memory Card

ASIC Integration Opportunity

59

MPSoC: A Platform Story

What’s a platform?

“A coordinated family of architectures that satisfy a set of

architectural constraints imposed to support reuse of

hardware and software components”

Best of all worlds:

Provides some level of flexibility

While being power efficient

And enabling some level of reusability

Can last multiple product generations

Requires forward-looking platform based design to integrate potential

future application requirements in today’s platform

Programming model and design efficiency are key!

60

Nvidia Tegra

61

TI OMAP

62

Freescale iMX6

63

Intel Silvermont/Bay Trail

64

Tegra K1

65

Tegra K1

66

Tegra K1

67

Embedded Processor Architecture Trends

• Where do we come from?

DSP, FPGA, CPU, GPUs

68



DSP, FPGA, CPU,

GPUs

• What is available today?

Mix of

multicore/manycore

and hardware

accelerator on the

same chip (e.g. Tegra

K1)

Or mix of multicore and

FPGA on the same

chip (Xynq)

SoC –i.MX6

I / O

Multicore CPU

GPU/Manycore

Multicore+Manycore

Image / V

ideo

A

ccelerator

SoC –i.MX6

I / O

Multicore CPU

FPGA

Multicore + FPGA

Image / V

ideo

A

ccelerator

69



DSP, FPGA, CPU, GPUs

• What is available today?

Mix of multicore/manycore and

hardware accelerator on the

same chip (e.g. Tegra K1)

Or mix of multicore and FPGA

on the same chip (Xynq)

• Where are we going?

Mix of multicore,

manycore, FPGA and

Hardware accelerators

on the same chip

Designed for real-time

sensor processing I / O

Multicore CPU

GPU/Manycore

Image / Video Accelerator

FPGA

Higher Integration/Lower Power

70

Very Fast Moving Industry

• Performance and power evolve at a very fast pace!

CPUs and GPUs driven by the PC market

System-on-Chip driven by the cellphone/tablet market

~50x perf/Watt improvement over the last 5 years

GTX 285 (2009) TegraK1 (2014)

Type PCIe Card System-on-chip

CPU Intel (PC board) ARM (integrated)

Interconnect PCIe On-Chip

# Cores 240 192

Power 200W 2W

Total Power 600W 5W

Performance 1000 Gflops 365 Gflops

Perf/Watt 5 Gflops/W 180 Gflops/W

50x perf/Watt improvement in 5 years fanless design possible today!

71

Efficiently Use Heterogeneous SoCs

• Use the right core for the

right task:

GPU massively parallel

CPU control oriented tasks

FPGA compute intensive tasks

with hard-real time constraints

• Competitive programming

approaches:

Parallel programming

languages and tools for CPU

and GPU

High Level Synthesis (from C

to VHDL) for FPGAs

Development, profiling,

debugging tools are evolving

as fast as the hardware

2 6

3,0 4

15

3,8 2

20 10

Power(Watt)

Performance(Fps)

Perf/Watt

Single CPU

4 x CPU

4 x CPU + GPU

Video pipeline exampe

Improved power efficiency, time to market and portability within product line

72

• New architecture is driven by power and thermal

• Transistor count continues to increase thanks to Moore’s law

• Most systems are limited by thermals

• Parallelism is needed for perf and power efficiency

• Instruction level parallelism: Pipeline, OOO, VLIW

• Data-level parallelism: SIMD, Vector, 2D SIMD Matrices

• Thread level parallelism: SMP, CMP, SMT/HT

• System level parallelism: I/Os, Memory Hierarchy

• Key Issues with Parallelism

• Amdahl’s law

• Extracting parallelism from applications

• Systems Issues the rest of the system needs to be well balanced

• Programming models need to be portable, easy to learn and efficient

• Application Specific Signal Processors and SoCs

• Spectrum: ASICs, FPGA, Media Proc, DSP, GPP + ISA extensions

• Depending on power/performance constraints, often a mix (SoC)

Summary

Download - Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems

Top Related