computing architecture for signal processorlacas/teaching/archi/archi_m2r_orsay_part1.pdf ·...
TRANSCRIPT
Embedded Computing Systems
for Signal Processing Applications
Part 1: Introduction
November 7th 2014
Eric Debes
2
What is this about?
Introduction to power/performance tradeoffs and system architecture
Overview of existing processor and system architectures
Consumer vs. Industrial/Embedded
Why do we care?
Engineering added value is in complex and critical system architecture
Need to know different components available
Software/Hardware System Architecture and Modelling
Power/Performance/Price Tradeoffs
What’s the plan?
Introduction
3
1. Introduction
2. General-Purpose Processors and Parallelism
3. Application Specific Processors: DSPs, FPGAs, accelerators, SoCs
4. PC Architecture vs. Embedded System Architecture
5. Hard Real-time Systems and RTOS
6. Power Constraints
7. Critical and Complex Systems, MDE, MDA
Outline
4
Embedded
Size and thermal constraints
Sometime battery life (energy) constraints
Real-time
Time constraints
Can be hard real-time
Or soft-real time
Systems
Typically includes multiple components
Requires different expertises:
Signal Processing, computer vision, machine learning/Cognition and other algorithmic expertise
Software Architecture
Hardware/Computing Architecture
Thermal and mechanical engineering
Embedded Real-time Systems
5
Consumer : DVD/video players, Set-top-box, Playstation, printers, disk drives, GPS, cameras, mp3 players
Communications: Cellphone, Mobile Internet Devices, Netbooks, PDAs with WiFi, GSM/3G, WiMax, GPS, cameras, music/video
Automotive: Driving innovation for many embedded applications, e.g. Sensors, buses, info-tainment
Industrial Applications: Process control, Instrumentation
Other niche markets: video surveillance, satellites, airplanes, sonars, radars, military applications
Application Examples
6
Texec = NI * CPI * Tc
NI = Number of Instructions
CPI = Clock per Instruction
Tc = Cycle Time
Texec = NI / (IPC * F)
IPC = Instructions Per Cycle
F = Frequency
Performance improves with
Silicon manufacturing technology
Moore’s law contributing to higher frequency and parallelism
Microarchitecture improvements
Higher frequencies with deeper pipelines
Higher IPC through parallelism
Performance
7
Performance
PentiumII(R)
Pentium Pro
Pentium(R)
486386
601, 603
604 604+MPC75021066
21064A21164
21164A21264
21264S
10
100
1,000
10,000
1987 1989 1991 1993 1995 1997 1999 2001 2003 2005
Mhz
1
10
100
ClockPeriod/Avggate
delay
Processor freq
scales by 2X per
generation
8
Dynamic Power = αCV²
α = activity
C = capacitance
V = voltage
= frequency
Power = Pdyn + Pstatic
Power is limited by
maximum current (Voltage regulator limitation)
Thermal constraints
Power ≠ Energy
Power
9
Power
100
1 386
486
Pentium Pentium MMX
PentiumPro
Pentium II
10
1.5 1.0 0.8 0.6 0.35 0.25 0.18
Process (microns)
Maxim
um
Pow
er (
W)
1
10
100
1000
Watt
s2/c
m
i386 i486
Pentium processor
Pentium Pro processor
Pentium II processor
Pentium III processor
Hot plate
Nuclear Reactor Rocket
Nozzle
Sun’s
Surface
Power density
10
ASIC
High-performance
Dedicated to one specific application
Not programmable
Processor
Programmable
General-purpose
Reconfigurable Architecture
Good compromise between programmability and performance
Processor Architecture Spectrum
Microprocessor Reconfigurable ASIC
11
Processor Architecture Spectrum
12
Processor Architecture Spectrum
13
What are the key components in a Computing System?
Processor with
Arithmetic and Logic Units
Register File
Caches or local memory
Memory
Buses/Interconnect
I/O Devices
Key Components of a Computing System
Part 2: General-purpose Processors and Parallelism
November 7th 2014
Eric Debes
Embedded Computing Systems
for Signal Processing Applications
15
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
Pipelining: Its Natural!
A B C D
16
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Sequential Laundry
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
17
• Pipelining doesn’t help
latency of single task, it
helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously
• Potential speedup =
Number pipe stages
• Unbalanced lengths of
pipe stages reduces
speedup
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
Pipelining Lessons
A
B
C
D
6 PM 7 8 9
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
18
Moore’s Law more transistors
for advanced architectures
Delivers higher peak perf
But lower power efficiency
Performance = Frequency x
Instruction per Clock Cycle
Power = Switching Activity x
Dynamic Capacitance x
Voltage x Voltage x
Frequency
History: How did we increase Perf in the Past?
0
1
2
3
4
5
Pipelined S-Scalar OOO-
Spec
Deep Pipe
Incre
ase (
X)
Area X
Perf X
-1
0
1
2
3
Pipelined S-Scalar OOO-
Spec
Deep
Pipe
Incre
ase (
X)
Power X
Mips/W (%)
19
In many systems today power is the limiting factor and will
drive most of the architecture decisions
New Goal: optimize performance in a given power envelope
Why Multi-Cores?
20
Dual Core
Voltage Frequency Power Performance
1% 1% 3% 0.66%
Rule of thumb (in the same process technology)
Core
Cache
Core
Cache
Core
Voltage = 1
Freq = 1
Area = 1
Power = 1
Perf = 1
Voltage = -15%
Freq = -15%
Area = 2
Power = 1
Perf = ~1.8
How to maximize performance in the same power envelope?
Power = Dynamic Capacitance x Voltage x Voltage x Frequency
21
Multicore
Small
Core 1 1
Large Core
Cache
1
2
3
4
1
2
Power
Performance
Power = 1/4
Performance = 1/2
C1 C2
C3 C4
Cache
1
2
3
4
1
2
3
4 Multi-Core:
Power efficient
Better power and
thermal management
22
Era of Parallelism
23
Thermal is the main limitation factor in future design (not size)
Move away from Frequency alone to deliver performance
Challenges in scaling need to exploit thread level
parallelism to efficiently use the transistors available thanks
to Moore’s law.
Power/performance tradeoffs dictate architectural choices
Multi-everywhere
Multi-threading
Chip level multi-processing
Throughput oriented designs
Summary: Why Multi-Cores?
24
Processors are designed to address the need of the mass market.
• Mobile applications low power and good power management
are top priorities to enable thinner systems and longer battery life
• Office, image, video single threaded perf matters, some level
of multithreaded perf Multi-core
• RMS (Recognition, Mining, Synthesis) Applications and
Model based Computing massively parallel apps, good scaling
on a large number of cores Many-core
Because of the large markets in each of the classes above, they
are the focus of silicon manufacturers and are driving innovation in
the semiconductor market
Application-driven Architecture Design
25
RMS Scaling on a Many-Core Simulator
0
16
32
48
64
0 16 32 48 64
# of cores
Sp
eed
-up
Gauss-Seidel
Sparse Matrix (Avg.)
Dense Matrix (Avg.)
Kmeans
SVD
SVM_classification
Cloth-AVDF
Cloth-AVIF
Cloth-US
Face-AFD
SepiaTone (Core Image)
0
16
32
48
64
0 16 32 48 64
# of cores
Sp
eed
-up
Text indexing
CFD
Ray Tracing
FB_Estimation
Body Tracker
Portifolio management
Play physics
Data from Intel Application Research Lab
26
• Low-power
architecture and SoCs
• ARM based
• LPIA/Atom based
• Multi-core
• Core microarchitecture
• PowerPC
• Many-core
• GP GPU
• Larrabee
3 Classes of Applications 3 Types of Processors
27
Examples of ARM-based low power architectures and SoCs:
TI OMAP, Nvidia Tegra, Apple A4/A5, Samsung Exynos
Low-power Architecture and SoCs
28
Towards PC on a chip
Same Intel Core (e.g. Bay Trail) for Tablets/Smartphones,
Consumer Electronic Devices and Embedded Market
29
• Multi-core
• IBM Power4
• IBM Cell
• Intel Ivy Bridge
Multicore
30
• Tick-Tock model
• Modular design to
decrease cost
(design, test,
validation)
• Integrate graphics
on chip
Intel Roadmap for Intel Core Microarchitecture
31
• Binning for leakage distribution and performance
P = α.C.v2. + leakage
• Turbo mode to optimize performance under a given
power envelope
• Policy to balance thermal budget between general
purpose cores, and between GPP cores and graphics
• Next: Maximize performance under a given thermal
envelope at the platform level
Power/Performance Tradeoffs
32
GP GPU: NVidia GeForce more than 2000 PEs
33
• No need to put a lot of cache for GPUs because the
number of threads are hiding the latency. The chip is
designed for DRAM latency through a huge number of
threads. Local memory are still present to limit bandwidth
to GDDR
• CPU need multi-level large caches because the data need
to be close to the execution units
• Fast growing video game industry exerts strong
economic pressure that forces constant innovation
CPUs vs. GPUs
34
For a given application, processor architectures should be
chosen depending on the performance/power efficiency
• MIPS/Watt or Gflops/Watt
• Energy efficiency (Energy Delay Product)
This is highly dependent on the application and targeted
power envelope. Examples:
• ARM and Atom are optimized for mainstream office and media apps for
a power envelope between 1W and <10W
• Core microarchitecture is optimized for high-end office and media apps
for a power envelope between 15W and ~75W
• GPUs are optimized for graphics applications and some selected
scientific applications between 10W and more than 400W
Performance/Power for different architectures
35
Processor will integrate
- Big core for single thread perf
- Small core for multithreaded perf
- some dedicated hardware units for
- graphics
- media
- encryption
- networking function
- other function specific logic
Systems will be heterogeneous
Processor core will be connected to
- one or multiple many-core cards
- and dedicated function hw in the chipset
+ reconfigurable logic in the system or on chip?
Future: PC on a Chip
IA IA IA IA
IA IA IA IA
IA IA IA IA
IA IA IA IA
PCI-Ex PCI-Ex
Gfx/Media
Memory Ch
High-End Add-in
IA IA IA IA
IA IA IA IA
IA IA IA IA
IA IA IA IA
PCI-Ex PCI-Ex
Gfx/Media
Memory Ch
IA
(Big core)
IA
(Big core)
GCH
Part 3: App Specific Proc: DSPs, FPGAs, Accelerators, SoCs
November 7th 2014
Eric Debes
Embedded Computing Systems
for Signal Processing Applications
37
What are application specific processors?
Processors or System-on-chip targeting a specific (class of) application(s)
Very common for
Audio: MP3, AAC coding and decoding in audio players
Image: JPEG or JPEG2000 coding and decoding, e.g. Digital cameras
Video: MPEG, H264 coding and decoding, e.g. DVD players or set-top-boxes
Encryption: RSA, AES
Communication: GSM, 3G in cellphones
Why?
Large markets can justify the development of application specific processors
Dedicated circuits provide higher performance with lower power dissipation, better battery life and very often lower cost.
Application Specific Processors
38
Application Specific Signal Processor Spectrum
39
DSPs
Dedicated ASICs
FPGAs
Accelerators as coprocessors
ISA extensions
SoCs
Different Types of ASPs
40
Summary of Architectural Features of DSPs
Data path configured for DSP
Fixed-point arithmetic
MAC- Multiply-accumulate
Multiple memory banks and buses -
Harvard Architecture: separate data and instruction memory
Multiple data memories
Specialized addressing modes
Bit-reversed addressing
Circular buffers
Specialized instruction set and execution control
Zero-overhead loops
Support for MAC
Specialized peripherals for DSP
41
DSP Example: 320C62x/67x DSP
42
Many dedicated ASICs exist on the market, especially for media and communication applications. Example:
MP3 player
DVD player
Video processing engines, e.g. De-interlacing, super-resolution
Video Encoder/Decoder
GSM/3G
TCP/IP Offload engine
Advantages:
Low power, high perf/power efficiency
Small area compared to same functionality in DSP or GPP
Drawbacks
Cost of designing ASICs requires large volume
Not flexible: cannot handle different applications, cannot evolve to follow standard evolution
Dedicated ASICs
43
Reconfigurable architectures FPGAs contain gates that can be programmed for a specific application
• Each logic element outputs one data bit
• Interconnect programmable between elements
FPGAs can be reconfigured to target a different function by loading another configuration
44
Spécifications
Input: RTL coding structural or behavioral description
RTL Simulation
Functional simulation check logic and data flow (no temporal
analysis)
Synthesis
Translate into specific hardware primitives
Optimisation to meet area and performance constraints
Place and Route
Map hw primitives to specific places on the chip based on area
and performance for the given technology
Specify routing
Temporal Analysis
Verification that temporal specification are met
Test and Verification of the component on the FPGA board
FPGAs Design Flow
45
Vivado HLS Design flow: from C to VHDL
46
Vivado HLS Development Flow
C/C++ programming
C Simulation
Algorithm validation
Optimization directives
insertion
Synthesis
Cosimulation
RTL design validation
IP generation
No
Yes
Pipeline
Unroll
Merge
Loops
- Array Partitionning
- Interfaces
Dataflow
Results OK ?
(Perf / Resources)
47
Vivado HLS User Interface
Project explorer Directives insertion Code editor
Synthesis log
48
Synthesis report
Vivado HLS Tooling
Instructions analysis view
Clock cycle accurate representation
Verification of actual parallelisation of
instructions (e.g. pipelining)
Localization of data dependencies
Latencies
Loops pipelining
(latencies/
Throughput)
Resources
49
Xilinx Design methodology
• Design methodology options
A combination is possible!
50
Current generations of FPGAs
add a GPP on the chip
Hardwired PowerPC (Xilinx)
NIOS Softcore (Altera)
MicroBlaze Softcore (Xilinx)
SoC with ARM on Xilinx Zynq
FPGAs with On-chip GPP
51
DSP blocks in reconfigurable architectures
Stratix DSP blocks consist of hardware
multipliers, adders, subtractors,
accumulators, and pipeline registers
Some FPGAs add DSP blocks to increase performance of DSP algorithms
Example: Stratix DSP blocks
52
Reconf matrix of DSP blocks as media coproc.
Execution
Unit
Data Cache
Instruction
Unit
Memory
Instruction
Cache
General purpose processor
Control (PLA)
Memory group #1
Memory group #2
Co processor
Matrix of
Processing
Elements
32b mult
32b add/sub
Shift reg
Row of
Processing
Elements
mem
op1
Reconfigurable MatriX (8x3 PEs)
mem
op2
Embedded memories
read write address
read data
write data
Control (ROM) chipselect
32b mult
32b add/sub
Shift reg
mem
op4
32b mult
32b add/sub
Shift reg
mem res
mem
op6
It is possible to build complex system based on recent FPGA architectures
Taking advantage of the regular structure of the DSP blocks in the FPGA matrix
53
Dedicated circuits to accelerate a specific part of the processor
Typically will be connected to a general-purpose processor or a DSP
Granularity can vary
accelerator for a DCT function
Accelerator for a whole JPEG encoder
Accelerators are very common in system on chip
Are typically called through an API function call from the main CPU
Accelerators as Coprocessors
54
Extending the ISA of a general purpose processor with SIMD instructions and specific instructions targeting media and communication applications is very common
It adds application specific features to a processor and turns a general purpose processor into a signal/image/video processor.
Example:
Intel MMX, SSE
PowerPC AltiVec
SUN VIS
Xscale WMMX
ARM Neon, Thumb-2, Trustzone, Jazelle, etc.
ISA extension in General-Purpose Processors
55
Conflicting requirements
ASICs Media Proc/DSPs GPPs
Better Power efficiency, runs at lower frequency
Flexibility, re-programmability (vs. redesign cost)
Better programming tools, shorter TTM for new app
Smaller chip size, lower leakage
56
The Energy-Flexibility Gap
Embedded Processors
Media Processors
DSPs
Dedicated
HW
Flexibility (Coverage)
En
ergy
Eff
icie
ncy
MO
PS
/mW
(or
MIP
S/m
W)
0.1
1
10
100
1000
Reconfigurable
Processor/Logic
57
SoCs integrate the optimal mix of processors and dedicated hardware units for the different applications targeted by the system.
Typically integrate a general purpose processor, e.g. ARM
Can integrate a DSP
Accelerators for specific functions
Dedicated memories
Integration boosts performance, cuts cost, reduces power consumption compared to a similar mix of processors on a card
System-on-Chip
58
Digital Camera hardware diagram
Mechanical Shutter
A/DCMOS Imager
Image
Processing
ASIC
256Kx16
DRAM
256Kx16
DRAM
MCU Memory
Card I/F
LCD
Control
ASICLCD
32 Kx8
SRAM
68
-pin
co
nn
.
ASIC
PCMCIA
Serial
EEPROM
Power
Control
3.3V CR-123
Lithium Cell
Expose
User Interface Keys
Activity LED
Door
Interlock
Memory Card
ASIC Integration Opportunity
59
MPSoC: A Platform Story
What’s a platform?
“A coordinated family of architectures that satisfy a set of
architectural constraints imposed to support reuse of
hardware and software components”
Best of all worlds:
Provides some level of flexibility
While being power efficient
And enabling some level of reusability
Can last multiple product generations
Requires forward-looking platform based design to integrate potential
future application requirements in today’s platform
Programming model and design efficiency are key!
60
Nvidia Tegra
61
TI OMAP
62
Freescale iMX6
63
Intel Silvermont/Bay Trail
64
Tegra K1
65
Tegra K1
66
Tegra K1
67
Embedded Processor Architecture Trends
• Where do we come from?
DSP, FPGA, CPU, GPUs
68
Embedded Processor Architecture Trends
• Where do we come from?
DSP, FPGA, CPU,
GPUs
• What is available today?
Mix of
multicore/manycore
and hardware
accelerator on the
same chip (e.g. Tegra
K1)
Or mix of multicore and
FPGA on the same
chip (Xynq)
SoC –i.MX6
I / O
Multicore CPU
GPU/Manycore
Multicore+Manycore
Image / V
ideo
A
ccelerator
SoC –i.MX6
I / O
Multicore CPU
FPGA
Multicore + FPGA
Image / V
ideo
A
ccelerator
69
Embedded Processor Architecture Trends
• Where do we come from?
DSP, FPGA, CPU, GPUs
• What is available today?
Mix of multicore/manycore and
hardware accelerator on the
same chip (e.g. Tegra K1)
Or mix of multicore and FPGA
on the same chip (Xynq)
• Where are we going?
Mix of multicore,
manycore, FPGA and
Hardware accelerators
on the same chip
Designed for real-time
sensor processing I / O
Multicore CPU
GPU/Manycore
Image / Video Accelerator
FPGA
Higher Integration/Lower Power
70
Very Fast Moving Industry
• Performance and power evolve at a very fast pace!
CPUs and GPUs driven by the PC market
System-on-Chip driven by the cellphone/tablet market
~50x perf/Watt improvement over the last 5 years
GTX 285 (2009) TegraK1 (2014)
Type PCIe Card System-on-chip
CPU Intel (PC board) ARM (integrated)
Interconnect PCIe On-Chip
# Cores 240 192
Power 200W 2W
Total Power 600W 5W
Performance 1000 Gflops 365 Gflops
Perf/Watt 5 Gflops/W 180 Gflops/W
50x perf/Watt improvement in 5 years fanless design possible today!
71
Efficiently Use Heterogeneous SoCs
• Use the right core for the
right task:
GPU massively parallel
CPU control oriented tasks
FPGA compute intensive tasks
with hard-real time constraints
• Competitive programming
approaches:
Parallel programming
languages and tools for CPU
and GPU
High Level Synthesis (from C
to VHDL) for FPGAs
Development, profiling,
debugging tools are evolving
as fast as the hardware
2 6
3,0 4
15
3,8 2
20 10
Power(Watt)
Performance(Fps)
Perf/Watt
Single CPU
4 x CPU
4 x CPU + GPU
Video pipeline exampe
Improved power efficiency, time to market and portability within product line
72
• New architecture is driven by power and thermal
• Transistor count continues to increase thanks to Moore’s law
• Most systems are limited by thermals
• Parallelism is needed for perf and power efficiency
• Instruction level parallelism: Pipeline, OOO, VLIW
• Data-level parallelism: SIMD, Vector, 2D SIMD Matrices
• Thread level parallelism: SMP, CMP, SMT/HT
• System level parallelism: I/Os, Memory Hierarchy
• Key Issues with Parallelism
• Amdahl’s law
• Extracting parallelism from applications
• Systems Issues the rest of the system needs to be well balanced
• Programming models need to be portable, easy to learn and efficient
• Application Specific Signal Processors and SoCs
• Spectrum: ASICs, FPGA, Media Proc, DSP, GPP + ISA extensions
• Depending on power/performance constraints, often a mix (SoC)
Summary