![Page 1: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/1.jpg)
Embedded Computing Systems
for Signal Processing Applications
Part 1: Introduction
November 7th 2014
Eric Debes
![Page 2: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/2.jpg)
2
What is this about?
Introduction to power/performance tradeoffs and system architecture
Overview of existing processor and system architectures
Consumer vs. Industrial/Embedded
Why do we care?
Engineering added value is in complex and critical system architecture
Need to know different components available
Software/Hardware System Architecture and Modelling
Power/Performance/Price Tradeoffs
What’s the plan?
Introduction
![Page 3: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/3.jpg)
3
1. Introduction
2. General-Purpose Processors and Parallelism
3. Application Specific Processors: DSPs, FPGAs, accelerators, SoCs
4. PC Architecture vs. Embedded System Architecture
5. Hard Real-time Systems and RTOS
6. Power Constraints
7. Critical and Complex Systems, MDE, MDA
Outline
![Page 4: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/4.jpg)
4
Embedded
Size and thermal constraints
Sometime battery life (energy) constraints
Real-time
Time constraints
Can be hard real-time
Or soft-real time
Systems
Typically includes multiple components
Requires different expertises:
Signal Processing, computer vision, machine learning/Cognition and other algorithmic expertise
Software Architecture
Hardware/Computing Architecture
Thermal and mechanical engineering
Embedded Real-time Systems
![Page 5: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/5.jpg)
5
Consumer : DVD/video players, Set-top-box, Playstation, printers, disk drives, GPS, cameras, mp3 players
Communications: Cellphone, Mobile Internet Devices, Netbooks, PDAs with WiFi, GSM/3G, WiMax, GPS, cameras, music/video
Automotive: Driving innovation for many embedded applications, e.g. Sensors, buses, info-tainment
Industrial Applications: Process control, Instrumentation
Other niche markets: video surveillance, satellites, airplanes, sonars, radars, military applications
Application Examples
![Page 6: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/6.jpg)
6
Texec = NI * CPI * Tc
NI = Number of Instructions
CPI = Clock per Instruction
Tc = Cycle Time
Texec = NI / (IPC * F)
IPC = Instructions Per Cycle
F = Frequency
Performance improves with
Silicon manufacturing technology
Moore’s law contributing to higher frequency and parallelism
Microarchitecture improvements
Higher frequencies with deeper pipelines
Higher IPC through parallelism
Performance
![Page 7: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/7.jpg)
7
Performance
PentiumII(R)
Pentium Pro
Pentium(R)
486386
601, 603
604 604+MPC75021066
21064A21164
21164A21264
21264S
10
100
1,000
10,000
1987 1989 1991 1993 1995 1997 1999 2001 2003 2005
Mhz
1
10
100
ClockPeriod/Avggate
delay
Processor freq
scales by 2X per
generation
![Page 8: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/8.jpg)
8
Dynamic Power = αCV²
α = activity
C = capacitance
V = voltage
= frequency
Power = Pdyn + Pstatic
Power is limited by
maximum current (Voltage regulator limitation)
Thermal constraints
Power ≠ Energy
Power
![Page 9: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/9.jpg)
9
Power
100
1 386
486
Pentium Pentium MMX
PentiumPro
Pentium II
10
1.5 1.0 0.8 0.6 0.35 0.25 0.18
Process (microns)
Maxim
um
Pow
er (
W)
1
10
100
1000
Watt
s2/c
m
i386 i486
Pentium processor
Pentium Pro processor
Pentium II processor
Pentium III processor
Hot plate
Nuclear Reactor Rocket
Nozzle
Sun’s
Surface
Power density
![Page 10: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/10.jpg)
10
ASIC
High-performance
Dedicated to one specific application
Not programmable
Processor
Programmable
General-purpose
Reconfigurable Architecture
Good compromise between programmability and performance
Processor Architecture Spectrum
Microprocessor Reconfigurable ASIC
![Page 11: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/11.jpg)
11
Processor Architecture Spectrum
![Page 12: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/12.jpg)
12
Processor Architecture Spectrum
![Page 13: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/13.jpg)
13
What are the key components in a Computing System?
Processor with
Arithmetic and Logic Units
Register File
Caches or local memory
Memory
Buses/Interconnect
I/O Devices
Key Components of a Computing System
![Page 14: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/14.jpg)
Part 2: General-purpose Processors and Parallelism
November 7th 2014
Eric Debes
Embedded Computing Systems
for Signal Processing Applications
![Page 15: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/15.jpg)
15
Laundry Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
Pipelining: Its Natural!
A B C D
![Page 16: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/16.jpg)
16
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Sequential Laundry
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6 PM 7 8 9 10 11 Midnight
T
a
s
k
O
r
d
e
r
Time
![Page 17: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/17.jpg)
17
• Pipelining doesn’t help
latency of single task, it
helps throughput of
entire workload
• Pipeline rate limited by
slowest pipeline stage
• Multiple tasks operating
simultaneously
• Potential speedup =
Number pipe stages
• Unbalanced lengths of
pipe stages reduces
speedup
• Time to “fill” pipeline and
time to “drain” it reduces
speedup
Pipelining Lessons
A
B
C
D
6 PM 7 8 9
T
a
s
k
O
r
d
e
r
Time
30 40 40 40 40 20
![Page 18: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/18.jpg)
18
Moore’s Law more transistors
for advanced architectures
Delivers higher peak perf
But lower power efficiency
Performance = Frequency x
Instruction per Clock Cycle
Power = Switching Activity x
Dynamic Capacitance x
Voltage x Voltage x
Frequency
History: How did we increase Perf in the Past?
0
1
2
3
4
5
Pipelined S-Scalar OOO-
Spec
Deep Pipe
Incre
ase (
X)
Area X
Perf X
-1
0
1
2
3
Pipelined S-Scalar OOO-
Spec
Deep
Pipe
Incre
ase (
X)
Power X
Mips/W (%)
![Page 19: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/19.jpg)
19
In many systems today power is the limiting factor and will
drive most of the architecture decisions
New Goal: optimize performance in a given power envelope
Why Multi-Cores?
![Page 20: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/20.jpg)
20
Dual Core
Voltage Frequency Power Performance
1% 1% 3% 0.66%
Rule of thumb (in the same process technology)
Core
Cache
Core
Cache
Core
Voltage = 1
Freq = 1
Area = 1
Power = 1
Perf = 1
Voltage = -15%
Freq = -15%
Area = 2
Power = 1
Perf = ~1.8
How to maximize performance in the same power envelope?
Power = Dynamic Capacitance x Voltage x Voltage x Frequency
![Page 21: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/21.jpg)
21
Multicore
Small
Core 1 1
Large Core
Cache
1
2
3
4
1
2
Power
Performance
Power = 1/4
Performance = 1/2
C1 C2
C3 C4
Cache
1
2
3
4
1
2
3
4 Multi-Core:
Power efficient
Better power and
thermal management
![Page 22: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/22.jpg)
22
Era of Parallelism
![Page 23: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/23.jpg)
23
Thermal is the main limitation factor in future design (not size)
Move away from Frequency alone to deliver performance
Challenges in scaling need to exploit thread level
parallelism to efficiently use the transistors available thanks
to Moore’s law.
Power/performance tradeoffs dictate architectural choices
Multi-everywhere
Multi-threading
Chip level multi-processing
Throughput oriented designs
Summary: Why Multi-Cores?
![Page 24: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/24.jpg)
24
Processors are designed to address the need of the mass market.
• Mobile applications low power and good power management
are top priorities to enable thinner systems and longer battery life
• Office, image, video single threaded perf matters, some level
of multithreaded perf Multi-core
• RMS (Recognition, Mining, Synthesis) Applications and
Model based Computing massively parallel apps, good scaling
on a large number of cores Many-core
Because of the large markets in each of the classes above, they
are the focus of silicon manufacturers and are driving innovation in
the semiconductor market
Application-driven Architecture Design
![Page 25: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/25.jpg)
25
RMS Scaling on a Many-Core Simulator
0
16
32
48
64
0 16 32 48 64
# of cores
Sp
eed
-up
Gauss-Seidel
Sparse Matrix (Avg.)
Dense Matrix (Avg.)
Kmeans
SVD
SVM_classification
Cloth-AVDF
Cloth-AVIF
Cloth-US
Face-AFD
SepiaTone (Core Image)
0
16
32
48
64
0 16 32 48 64
# of cores
Sp
eed
-up
Text indexing
CFD
Ray Tracing
FB_Estimation
Body Tracker
Portifolio management
Play physics
Data from Intel Application Research Lab
![Page 26: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/26.jpg)
26
• Low-power
architecture and SoCs
• ARM based
• LPIA/Atom based
• Multi-core
• Core microarchitecture
• PowerPC
• Many-core
• GP GPU
• Larrabee
3 Classes of Applications 3 Types of Processors
![Page 27: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/27.jpg)
27
Examples of ARM-based low power architectures and SoCs:
TI OMAP, Nvidia Tegra, Apple A4/A5, Samsung Exynos
Low-power Architecture and SoCs
![Page 28: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/28.jpg)
28
Towards PC on a chip
Same Intel Core (e.g. Bay Trail) for Tablets/Smartphones,
Consumer Electronic Devices and Embedded Market
![Page 29: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/29.jpg)
29
• Multi-core
• IBM Power4
• IBM Cell
• Intel Ivy Bridge
Multicore
![Page 30: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/30.jpg)
30
• Tick-Tock model
• Modular design to
decrease cost
(design, test,
validation)
• Integrate graphics
on chip
Intel Roadmap for Intel Core Microarchitecture
![Page 31: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/31.jpg)
31
• Binning for leakage distribution and performance
P = α.C.v2. + leakage
• Turbo mode to optimize performance under a given
power envelope
• Policy to balance thermal budget between general
purpose cores, and between GPP cores and graphics
• Next: Maximize performance under a given thermal
envelope at the platform level
Power/Performance Tradeoffs
![Page 32: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/32.jpg)
32
GP GPU: NVidia GeForce more than 2000 PEs
![Page 33: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/33.jpg)
33
• No need to put a lot of cache for GPUs because the
number of threads are hiding the latency. The chip is
designed for DRAM latency through a huge number of
threads. Local memory are still present to limit bandwidth
to GDDR
• CPU need multi-level large caches because the data need
to be close to the execution units
• Fast growing video game industry exerts strong
economic pressure that forces constant innovation
CPUs vs. GPUs
![Page 34: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/34.jpg)
34
For a given application, processor architectures should be
chosen depending on the performance/power efficiency
• MIPS/Watt or Gflops/Watt
• Energy efficiency (Energy Delay Product)
This is highly dependent on the application and targeted
power envelope. Examples:
• ARM and Atom are optimized for mainstream office and media apps for
a power envelope between 1W and <10W
• Core microarchitecture is optimized for high-end office and media apps
for a power envelope between 15W and ~75W
• GPUs are optimized for graphics applications and some selected
scientific applications between 10W and more than 400W
Performance/Power for different architectures
![Page 35: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/35.jpg)
35
Processor will integrate
- Big core for single thread perf
- Small core for multithreaded perf
- some dedicated hardware units for
- graphics
- media
- encryption
- networking function
- other function specific logic
Systems will be heterogeneous
Processor core will be connected to
- one or multiple many-core cards
- and dedicated function hw in the chipset
+ reconfigurable logic in the system or on chip?
Future: PC on a Chip
IA IA IA IA
IA IA IA IA
IA IA IA IA
IA IA IA IA
PCI-Ex PCI-Ex
Gfx/Media
Memory Ch
High-End Add-in
IA IA IA IA
IA IA IA IA
IA IA IA IA
IA IA IA IA
PCI-Ex PCI-Ex
Gfx/Media
Memory Ch
IA
(Big core)
IA
(Big core)
GCH
![Page 36: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/36.jpg)
Part 3: App Specific Proc: DSPs, FPGAs, Accelerators, SoCs
November 7th 2014
Eric Debes
Embedded Computing Systems
for Signal Processing Applications
![Page 37: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/37.jpg)
37
What are application specific processors?
Processors or System-on-chip targeting a specific (class of) application(s)
Very common for
Audio: MP3, AAC coding and decoding in audio players
Image: JPEG or JPEG2000 coding and decoding, e.g. Digital cameras
Video: MPEG, H264 coding and decoding, e.g. DVD players or set-top-boxes
Encryption: RSA, AES
Communication: GSM, 3G in cellphones
Why?
Large markets can justify the development of application specific processors
Dedicated circuits provide higher performance with lower power dissipation, better battery life and very often lower cost.
Application Specific Processors
![Page 38: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/38.jpg)
38
Application Specific Signal Processor Spectrum
![Page 39: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/39.jpg)
39
DSPs
Dedicated ASICs
FPGAs
Accelerators as coprocessors
ISA extensions
SoCs
Different Types of ASPs
![Page 40: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/40.jpg)
40
Summary of Architectural Features of DSPs
Data path configured for DSP
Fixed-point arithmetic
MAC- Multiply-accumulate
Multiple memory banks and buses -
Harvard Architecture: separate data and instruction memory
Multiple data memories
Specialized addressing modes
Bit-reversed addressing
Circular buffers
Specialized instruction set and execution control
Zero-overhead loops
Support for MAC
Specialized peripherals for DSP
![Page 41: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/41.jpg)
41
DSP Example: 320C62x/67x DSP
![Page 42: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/42.jpg)
42
Many dedicated ASICs exist on the market, especially for media and communication applications. Example:
MP3 player
DVD player
Video processing engines, e.g. De-interlacing, super-resolution
Video Encoder/Decoder
GSM/3G
TCP/IP Offload engine
Advantages:
Low power, high perf/power efficiency
Small area compared to same functionality in DSP or GPP
Drawbacks
Cost of designing ASICs requires large volume
Not flexible: cannot handle different applications, cannot evolve to follow standard evolution
Dedicated ASICs
![Page 43: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/43.jpg)
43
Reconfigurable architectures FPGAs contain gates that can be programmed for a specific application
• Each logic element outputs one data bit
• Interconnect programmable between elements
FPGAs can be reconfigured to target a different function by loading another configuration
![Page 44: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/44.jpg)
44
Spécifications
Input: RTL coding structural or behavioral description
RTL Simulation
Functional simulation check logic and data flow (no temporal
analysis)
Synthesis
Translate into specific hardware primitives
Optimisation to meet area and performance constraints
Place and Route
Map hw primitives to specific places on the chip based on area
and performance for the given technology
Specify routing
Temporal Analysis
Verification that temporal specification are met
Test and Verification of the component on the FPGA board
FPGAs Design Flow
![Page 45: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/45.jpg)
45
Vivado HLS Design flow: from C to VHDL
![Page 46: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/46.jpg)
46
Vivado HLS Development Flow
C/C++ programming
C Simulation
Algorithm validation
Optimization directives
insertion
Synthesis
Cosimulation
RTL design validation
IP generation
No
Yes
Pipeline
Unroll
Merge
Loops
- Array Partitionning
- Interfaces
Dataflow
Results OK ?
(Perf / Resources)
![Page 47: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/47.jpg)
47
Vivado HLS User Interface
Project explorer Directives insertion Code editor
Synthesis log
![Page 48: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/48.jpg)
48
Synthesis report
Vivado HLS Tooling
Instructions analysis view
Clock cycle accurate representation
Verification of actual parallelisation of
instructions (e.g. pipelining)
Localization of data dependencies
Latencies
Loops pipelining
(latencies/
Throughput)
Resources
![Page 49: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/49.jpg)
49
Xilinx Design methodology
• Design methodology options
A combination is possible!
![Page 50: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/50.jpg)
50
Current generations of FPGAs
add a GPP on the chip
Hardwired PowerPC (Xilinx)
NIOS Softcore (Altera)
MicroBlaze Softcore (Xilinx)
SoC with ARM on Xilinx Zynq
FPGAs with On-chip GPP
![Page 51: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/51.jpg)
51
DSP blocks in reconfigurable architectures
Stratix DSP blocks consist of hardware
multipliers, adders, subtractors,
accumulators, and pipeline registers
Some FPGAs add DSP blocks to increase performance of DSP algorithms
Example: Stratix DSP blocks
![Page 52: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/52.jpg)
52
Reconf matrix of DSP blocks as media coproc.
Execution
Unit
Data Cache
Instruction
Unit
Memory
Instruction
Cache
General purpose processor
Control (PLA)
Memory group #1
Memory group #2
Co processor
Matrix of
Processing
Elements
32b mult
32b add/sub
Shift reg
Row of
Processing
Elements
mem
op1
Reconfigurable MatriX (8x3 PEs)
mem
op2
Embedded memories
read write address
read data
write data
Control (ROM) chipselect
32b mult
32b add/sub
Shift reg
mem
op4
32b mult
32b add/sub
Shift reg
mem res
mem
op6
It is possible to build complex system based on recent FPGA architectures
Taking advantage of the regular structure of the DSP blocks in the FPGA matrix
![Page 53: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/53.jpg)
53
Dedicated circuits to accelerate a specific part of the processor
Typically will be connected to a general-purpose processor or a DSP
Granularity can vary
accelerator for a DCT function
Accelerator for a whole JPEG encoder
Accelerators are very common in system on chip
Are typically called through an API function call from the main CPU
Accelerators as Coprocessors
![Page 54: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/54.jpg)
54
Extending the ISA of a general purpose processor with SIMD instructions and specific instructions targeting media and communication applications is very common
It adds application specific features to a processor and turns a general purpose processor into a signal/image/video processor.
Example:
Intel MMX, SSE
PowerPC AltiVec
SUN VIS
Xscale WMMX
ARM Neon, Thumb-2, Trustzone, Jazelle, etc.
ISA extension in General-Purpose Processors
![Page 55: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/55.jpg)
55
Conflicting requirements
ASICs Media Proc/DSPs GPPs
Better Power efficiency, runs at lower frequency
Flexibility, re-programmability (vs. redesign cost)
Better programming tools, shorter TTM for new app
Smaller chip size, lower leakage
![Page 56: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/56.jpg)
56
The Energy-Flexibility Gap
Embedded Processors
Media Processors
DSPs
Dedicated
HW
Flexibility (Coverage)
En
ergy
Eff
icie
ncy
MO
PS
/mW
(or
MIP
S/m
W)
0.1
1
10
100
1000
Reconfigurable
Processor/Logic
![Page 57: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/57.jpg)
57
SoCs integrate the optimal mix of processors and dedicated hardware units for the different applications targeted by the system.
Typically integrate a general purpose processor, e.g. ARM
Can integrate a DSP
Accelerators for specific functions
Dedicated memories
Integration boosts performance, cuts cost, reduces power consumption compared to a similar mix of processors on a card
System-on-Chip
![Page 58: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/58.jpg)
58
Digital Camera hardware diagram
Mechanical Shutter
A/DCMOS Imager
Image
Processing
ASIC
256Kx16
DRAM
256Kx16
DRAM
MCU Memory
Card I/F
LCD
Control
ASICLCD
32 Kx8
SRAM
68
-pin
co
nn
.
ASIC
PCMCIA
Serial
EEPROM
Power
Control
3.3V CR-123
Lithium Cell
Expose
User Interface Keys
Activity LED
Door
Interlock
Memory Card
ASIC Integration Opportunity
![Page 59: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/59.jpg)
59
MPSoC: A Platform Story
What’s a platform?
“A coordinated family of architectures that satisfy a set of
architectural constraints imposed to support reuse of
hardware and software components”
Best of all worlds:
Provides some level of flexibility
While being power efficient
And enabling some level of reusability
Can last multiple product generations
Requires forward-looking platform based design to integrate potential
future application requirements in today’s platform
Programming model and design efficiency are key!
![Page 60: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/60.jpg)
60
Nvidia Tegra
![Page 61: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/61.jpg)
61
TI OMAP
![Page 62: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/62.jpg)
62
Freescale iMX6
![Page 63: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/63.jpg)
63
Intel Silvermont/Bay Trail
![Page 64: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/64.jpg)
64
Tegra K1
![Page 65: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/65.jpg)
65
Tegra K1
![Page 66: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/66.jpg)
66
Tegra K1
![Page 67: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/67.jpg)
67
Embedded Processor Architecture Trends
• Where do we come from?
DSP, FPGA, CPU, GPUs
![Page 68: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/68.jpg)
68
Embedded Processor Architecture Trends
• Where do we come from?
DSP, FPGA, CPU,
GPUs
• What is available today?
Mix of
multicore/manycore
and hardware
accelerator on the
same chip (e.g. Tegra
K1)
Or mix of multicore and
FPGA on the same
chip (Xynq)
SoC –i.MX6
I / O
Multicore CPU
GPU/Manycore
Multicore+Manycore
Image / V
ideo
A
ccelerator
SoC –i.MX6
I / O
Multicore CPU
FPGA
Multicore + FPGA
Image / V
ideo
A
ccelerator
![Page 69: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/69.jpg)
69
Embedded Processor Architecture Trends
• Where do we come from?
DSP, FPGA, CPU, GPUs
• What is available today?
Mix of multicore/manycore and
hardware accelerator on the
same chip (e.g. Tegra K1)
Or mix of multicore and FPGA
on the same chip (Xynq)
• Where are we going?
Mix of multicore,
manycore, FPGA and
Hardware accelerators
on the same chip
Designed for real-time
sensor processing I / O
Multicore CPU
GPU/Manycore
Image / Video Accelerator
FPGA
Higher Integration/Lower Power
![Page 70: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/70.jpg)
70
Very Fast Moving Industry
• Performance and power evolve at a very fast pace!
CPUs and GPUs driven by the PC market
System-on-Chip driven by the cellphone/tablet market
~50x perf/Watt improvement over the last 5 years
GTX 285 (2009) TegraK1 (2014)
Type PCIe Card System-on-chip
CPU Intel (PC board) ARM (integrated)
Interconnect PCIe On-Chip
# Cores 240 192
Power 200W 2W
Total Power 600W 5W
Performance 1000 Gflops 365 Gflops
Perf/Watt 5 Gflops/W 180 Gflops/W
50x perf/Watt improvement in 5 years fanless design possible today!
![Page 71: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/71.jpg)
71
Efficiently Use Heterogeneous SoCs
• Use the right core for the
right task:
GPU massively parallel
CPU control oriented tasks
FPGA compute intensive tasks
with hard-real time constraints
• Competitive programming
approaches:
Parallel programming
languages and tools for CPU
and GPU
High Level Synthesis (from C
to VHDL) for FPGAs
Development, profiling,
debugging tools are evolving
as fast as the hardware
2 6
3,0 4
15
3,8 2
20 10
Power(Watt)
Performance(Fps)
Perf/Watt
Single CPU
4 x CPU
4 x CPU + GPU
Video pipeline exampe
Improved power efficiency, time to market and portability within product line
![Page 72: Computing Architecture for Signal Processorlacas/Teaching/archi/Archi_M2R_Orsay_part1.pdf · latency of single task, it ... Instruction per Clock Cycle ... Mips/W (%) 19 In many systems](https://reader033.vdocuments.net/reader033/viewer/2022051801/5ae16a417f8b9ac0428eb611/html5/thumbnails/72.jpg)
72
• New architecture is driven by power and thermal
• Transistor count continues to increase thanks to Moore’s law
• Most systems are limited by thermals
• Parallelism is needed for perf and power efficiency
• Instruction level parallelism: Pipeline, OOO, VLIW
• Data-level parallelism: SIMD, Vector, 2D SIMD Matrices
• Thread level parallelism: SMP, CMP, SMT/HT
• System level parallelism: I/Os, Memory Hierarchy
• Key Issues with Parallelism
• Amdahl’s law
• Extracting parallelism from applications
• Systems Issues the rest of the system needs to be well balanced
• Programming models need to be portable, easy to learn and efficient
• Application Specific Signal Processors and SoCs
• Spectrum: ASICs, FPGA, Media Proc, DSP, GPP + ISA extensions
• Depending on power/performance constraints, often a mix (SoC)
Summary