hardware platforms for embedded computing
DESCRIPTION
Hardware platforms for Embedded computing. poor design techniques. The energy/flexibility conflict - Intrinsic Power Efficiency -. Operations/Watt [MOPS/mW]. Ambient Intelligence. 10. DSP-ASIPs. hardwired muxed ASIC. 1. Processors. Reconfigurable Computing. µPs. 0.1. 0.01. Technology. - PowerPoint PPT PresentationTRANSCRIPT
Hardware platforms for Embedded computing
The energy/flexibility conflict- Intrinsic Power Efficiency -
Technology
[H. de Man, Keynote, DATE‘02;T. Claasen, ISSCC99]
Operations/Watt[MOPS/mW]
ProcessorsReconfigurable Computing
hardwired muxed ASIC1
0.1
0.01
0.13µ
Necessary to optimize HW/SW; otherwise the prize for software flexibility cannot be paid!
Ambient Intelligence
0.07µ
DSP-ASIPsµPs
10
0.25µ0.5µ1.0µ
poor design techniques
Architectural Choices
P
Prog Mem
MACUnit
AddrGenP
Prog Mem
P
Prog Mem
SatelliteProcessorDedicated
Logic
Satellite
Processor
SatelliteProcessor
GeneralPurpose
P
Software
DirectMapped
Hardware
HardwareReconfigurable
Processor
ProgrammableDSP
Flex
ibili
tyFl
exib
ility
1/Efficiency (power, speed)1/Efficiency (power, speed)
The Processor Design Space
Cost
Perf
orm
ance
Microprocessors
Performance iseverything& Software rules
Embeddedprocessors
Microcontrollers
Cost is everything
Application specific architecturesfor performance
Area of processor cores = Cost
Nintendo processor Cellular phones
Another figure of meritComputation per unit area
Nintendo processor Cellular phones???
Embedded vs. general-purpose processors
Embedded processors may be optimized for a category of applications. Customization may be narrow or broad.
We may judge embedded processors using different metrics: Code size. Memory system performance. Preditability.
Microcontrollers
CPU ROM RAM
I/O
A single chip
Subsystems:Timers, Counters, AnalogInterfaces, I/O interfaces
Memory
Microcontroller Architectures
CPUProgram + Data
Address Bus
Data Bus
Memory
Von NeumannArchitecture
CPUProgram
Address Bus
Data Bus
HarvardArchitecture
Memory
Data
Address Bus
Fetch Bus
0
0
0
2n
MCS-51 “Family” of Microcontollers
8051 introduced by Intel in late 1970s Now produced by many companies in
many variations The most pupular microcontroller – about
40% of market share 8-bit microcontroller
“Original” 8051 Microcontroller
Oscillator and timing
4096 Bytes Program Memory
128 Bytes Data
Memory
Two 16 Bit Timer/Event
Counters
8051 CPU
64 K Byte Bus Expansion
Control
Programmable I/O
Programmable Serial Port Full Duplex UART
Synchronous Shifter
Internal data bus
External interrupts
subsystem interrupts
Control Parallel portsAddress Data BusI/O pins
Serial InputSerial Output
Microcontrollers- MHS 80C51 as an example -• 8-bit CPU optimised for control applications• Extensive Boolean processing capabilities• 64 k Program Memory address space• 64 k Data Memory address space• 4 k bytes of on chip Program Memory• 128 bytes of on chip data RAM• 32 bi-directional and individually addressable I/O lines• Two 16-bit timers/counters• Full duplex UART• 6 sources/5-vector interrupt structure with 2 priority levels• On chip clock oscillators• Very popular CPU with many different variations
Features for Embedded System
s
RISC processors RISC generally means
highly-pipelinable, one instruction per cycle.
Pipelines of embedded RISC processors have grown over time: ARM7 has 3-stage
pipeline. ARM9 has 5-stage
pipeline. ARM11 has eight-stage
pipeline.
ARM11 pipeline [ARM05].
RISC processor families ARM: ARM7 is relatively simple, no memory
management; ARM11 has memory management, other features.
MIPS: MIPS32 4K has 5-stage pipeline; 4KE family has DSP extension; 4KS is designed for security.
PowerPC: 400 series includes several embedded processors; MPD7410 is two-issue machine; 970FX has 16-stage pipeline.
DSP Applications
Audio applications MPEG Audio Portable audio Digital cameras Wireless Cellular
telephones Base station
Networking Cable modems ADSL VDSL
Another Look at DSP Applications
High-end Wireless Base Station - TMS320C6000 Cable modem gateways
Mid-end Cellular phone - TMS320C540 Fax/ voice server
Low end Storage products - TMS320C27 Digital camera - TMS320C5000 Portable phones Wireless headsets Consumer audio Automobiles, toasters, thermostats, ...
Incr
easi
ngC
ost
Increasingvolum
e
DSP vs. General Purpose MPU
The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC).
DSP are judged by whether they can keep the multipliers busy 100% of the time.
The "SPEC" of DSPs is 4 algorithms: Inifinite Impule Response (IIR) filters Finite Impule Response (FIR) filters FFT, and convolvers
In DSPs, algorithms are king! Binary compatability not an issue
Software is not (yet) king in DSPs. People still write in assembly language for a product to
minimize the die area for ROM in the DSP chip.
Architectural Features of DSPs
Data path configured for DSP Fixed-point arithmetic MAC- Multiply-accumulate
Multiple memory banks and buses - Harvard Architecture Multiple data memories
Specialized addressing modes Bit-reversed addressing Circular buffers
Specialized instruction set and execution control Zero-overhead loops Support for MAC
Specialized peripherals for DSP THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
DESIGN!!!
Application: y[j] = i=0
x[j-i]*a[i]
i: 0i n-1: yi[j] = yi-1[j] + x[j-i]*a[i]
Domain-oriented architectures
Architecture: Example: Data path ADSP210x
n-1
- Parallelism - Dedicated registers
MR
MFMX MY
*+,-
AR
AFAX AY
+,-,..
DP
yi-1[j]
x[j-i]
x[j-i]*a[i]
a[i]
Address generation unit (AGU)
Address- registersA0, A1, A2 ..i+1, j-i+1
ax
MR:=0; A1:=1; A2:=n-2; MX:=x[n-1]; MY:=a[0];for ( j:=1 to n) {MR:=MR+MX*MY; MY:=a[A1]; MX:=x[A2]; A1++; A2--}
DSP - Features (1) • Multiply/accumulate (MAC) and zero-overhead loop
(ZOL) instructions (as shown)• Heterogeneous registers (as shown)• Separate address generation units (AGUs)
(as in ADSP 210x)
DSP - Features (2) • Modulo
addressing: Am++ Am:=(Am+1) mod n(implements ring or circular buffer in memory)
..x[n-2]x[n-1]x[0]x[1]..
Memory, t=t1
..x[n-3]x[n-2]x[n-1]x[n]x[1]
Memory, t2=t1+1
sliding windowt2x
t1
t
Multiple memory banks or memories
MR
MFMX MY
*+,-
AR
AFAX AY
+,-,..
DP
Address generation unit (AGU)
Address- registersA0, A1, A2 ..
Simplifies parallel fetches
Very long instruction word (VLIW) processorsKey idea: detection of possible parallelism to be done by compiler, not by hardware at run-time (inefficient).
VLIW: parallel operations (instructions) encoded in one long word (instruction packet), each instruction controlling one functional unit. E.g.:
The Texas InstrumentsTMS 320C6xx as an example
31 0
0Instr. A
31 0
0Instr. D
31 0
1Instr. F
31 0
0Instr. G
31 0
1Instr. E
31 0
1Instr. C
31 0
1Instr. B
Cycle Instruction
1 A2 B C D3 E F G
Instructions B, C and D use disjoint functional units, cross paths and other data path resources. The same is also true for E, F and G.
Bit in each instruction encodes end of parallel execution
Parallel execution cannot span several packets.
Partitioned register files
register file A register file B
L1 S1 M1 D1 D2 M2 S2 L2
Data bus
Address bus
Data path A Data path B
• Many memory ports are required to supply enough operands per cycle.
• Memories with many ports are expensive. Registers are partitioned into (typically 2) sets, e.g. for TI
C60x:
Instruction types are mapped tofunctional unit types
There are 4 functional unit (FU) types: M: Memory Unit I: Integer Unit F: Floating-Point Unit B: Branch Unit
Instruction types corresponding FU type,except type A (mapping to either I or M-functional units).
Large # of delay slots,a problem of VLIW processors
The execution of many instructions has been started before it is realized that a branch was required.Nullifying those instructions would waste compute power Executing those instructions is declared a feature, not a bug. How to fill all „delay slots“ with useful instructions? Avoid branches wherever possible.
add sub and or
sub mult xor div
ld st mv beq
Predicated execution:Implementing IF-statements „branch-free“
Conditional Instruction „[c] I“ consists of:• condition c• instruction I
c = true => I executedc = false => NOP
Predicated execution:Implementing IF-statements „branch-free“: TI C6x
if (c){ a = x + y; b = x + z;}else{ a = x - y; b = x - z;}
Conditional branch
[c] B L1 NOP 5 B L2 NOP 4 SUB x,y,a || SUB x,z,bL1: ADD x,y,a || ADD x,z,bL2:
Predicated execution
[c] ADD x,y,a|| [c] ADD x,z,b|| [!c] SUB x,y,a|| [!c] SUB x,z,b
max. 12 cycles 1 cycle
Roadmap continues: 906545 nm “Traditional” Bus-based SoCs fit in one tile !!
Architecture Evolution
Communication demand is staggering, but unevenly distributed, because of architectural heterogeneity
I/0
I/0
PE
PE PE PE
SRAM SRAM
DRAM
I/O
I/OPERIPHERALS
3D stacked m
ain mem
ory
PE
LocalMemory
hierarchy
CPU
i/o
Multicores Are Here!
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128
256
512
Athlon
Raw
Power4 Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Boardcom 1480 Opteron 4PXeon MP
AmbricAM2045
[Amarasinghe06]
MPSoC – 2005 ITRS roadmap
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 20200
200
400
600
800
1000
60
50
40
30
20
10
0
1200
Num
ber
of P
roce
ssin
g En
gine
s
Logi
c, M
emor
y Si
ze (N
orm
aliz
ed to
200
5)
Number of Processing Engines(Right Axis)
Total Logic Size(Normalized to 2005, Left Axis)
Total Memory Size(Normalized to 2005, Left Axis)
16 23 32 46 63 79101
133 161212
268
348
424
526
669
878
[Martin06]
Power is the Challenge!
0
200
400
600
800
1000
1200
1400
90nm 65nm 45nm 32nm 22nm 16nm
Pow
er (W
), Po
wer
Den
sity
(W/c
m2 )SiO2 LkgSD LkgActive
10 mm Die
Technology, Circuits, and Architecture to constrain the power
Near Term Solutions Move away from Frequency alone to
deliver performance More on-die memory Multi-everywhere
Multi-threading Chip level multi-processing
Throughput oriented designs Performance by higher level of
integration
Architecture Techniques
0%
25%
50%
75%
100%
1u 0.5u 0.25u 0.13u 65nm
Cach
e %
of T
otal
Are
a
486 Pentium®
Pentium® III
Pentium® 4
Pentium® M
Increase on-die Memory
ST Wait for Mem
MT1 Wait for MemMT2 Wait
MT3
Single ThreadSingle Thread
Multi-ThreadingMulti-Threading
Full HW Utilization
Multi-threading
Improved performance, no impact on thermals & power delivery
C1 C2
C3 C4
Cache
Chip Multi-processing
LargeCore 1
1.5
2
2.5
3
3.5
1 2 3 4Die Area, Power
Rel
ativ
e Pe
rfor
man
ce
Multi Core
Single Core
Multi-Core
C1 C2
C3 C4
Cache
Large Core
Cache
1
2
3
4
1
2 SmallCore 1 1
1
2
3
4
1
2
3
4
Power
PerformancePower = 1/4
Performance = 1/2
Multi-Core:Power efficient
Better power and thermal management
Embedded vs. General Purpose
Embedded Applications Asymmetric Multi-Processing
Differentiated Processors Specific tasks known early
Mapped to dedicated processors Configurable and extensible
processors: performance, power efficiency
Communication Coherent memory Shared local memories HW FIFOS, other direct connections
Dataflow programming models Classical example – Smart mobile –
RISC + DSP + Media processors
Server Applications Symmetric Multi-Processing
Homogeneous cores General tasks known late
Tasks run on any core High-performance, high-speed
microprocessors Communication
large coherent memory space on multi-core die or bus
SMT programming models (Simultaneous Multi-Threading)
Examples: large server chips (eg Sun Niagara 8x4 threads), scientific multi-processors
MPSoC architectures
Example system platforms
Generic Automotive Wireless Multimedia
PC-based platform
Basic hardware components: CPU; memory; timers; DMA; minimal I/O devices.
Basic software: BIOS.
PC-style hardware architecture
CPU
system bus
memory
DMAcontroller
timers
businterface
brid
ge
high-speed bus
low-speed bus
I/O
I/O
Strong ARM StrongARM system includes:
CPU chip (3.686 MHz clock) system control module (32.768 kHz
clock). Real-time clock; operating system timer general-purpose I/O; interrupt controller; power manager controller; reset controller.
Pros and cons
Plentiful hardware options. Simple programming semantics. Good software development
environments. Performance-limited.
TI Open Wireless Multimedia Applications Platform Dual-processor shared memory system:
GPPOS
DSPmanager
General-purposeprocessor
DSP
DSPOS
DSPtask
& I/Octrl
bridge
Memctrl
external memory
http://www.ti.com/sc/docs/apps/wireless/omap/overview.htm
TI OMAP™ Hardware platform
I-MMU D-MMU
I-Cache
RISC Core
MMU
I-Cache Internal RAM/ROM
DSP Core+
Appl Coprocessors
DMA
Memory & Traffic Controller
ProgramMemory SDRAM
PeripheralsLCD Controller, Interrupt Handlers, Timers, GPIO, UARTs, ...
ARM9 core 16KB I-
cache 8KB D-cache 2-way set
associative 150 MHz
C55x DSP core
16KB I-cache 8KB RAM set 2-way set
associative 200 MHz
D-Cache
OMAPI Standard (ST/TI)
Goal: standardize the interfaces between application processor and peripheral devices in a mobile product
Provide standard services (APIs) in the OS that can be used by application developers
STMicro Nomadik platformMain Core
Memory System HW Accelerators I/Os
Nomadik SW platform
Compliant with OMAPI standard
Scalable VLIW Media Processor:• 100 to 300+ MHz• 32-bit or 64-bit
Nexperia™
System Buses• 32-128 bit
General-purpose Scalable RISC Processor• 50 to 300+ MHz• 32-bit or 64-bit
Library of DeviceIP Blocks• Image coprocessors• DSPs• UART• 1394• USB…and more
TM-xxxxTM-xxxxD$D$
I$I$
TriMedia CPUTriMedia CPU
DEVICE IP BLOCKDEVICE IP BLOCK
DEVICE IP BLOCKDEVICE IP BLOCK
DEVICE IP BLOCKDEVICE IP BLOCK
.. .. ..
DVP SYSTEM SILICON
PI B
US
SDRAM
MMI
DVP
MEM
ORY
BU
S
DEVICE IP BLOCK
PRxxxxD$
I$
MIPS CPU
DEVICE IP BLOCK. . .
DEVICE IP BLOCK
PI B
US
TriMedia™MIPS™
Philips Digital Video Nexperia Platform
MiddlewareJavaTV, TVPAK, OpenTV, MHP/Java, proprietary ...
Applications
Nexperia HardwareNexperia Hardware
Streaming andStreaming and Platform SoftwarePlatform Software K
erne
l: pS
OS
, Win
-CE
, Ja
vaO
S
Nexperia-DVP SoftwareNexperia™ -DVP Software Architecture
Supports multiple OSs and middleware software
Abstracts platform functionality via consistent APIs
Nexperia™-DVP Streaming Software
Encapsulates implementation of streaming media components (hardware and software)
Nexperia™ Platform Software OS independent device
drivers for on-chip and off-chip devices
Infineon Automotive Platform
TC1166
Applications High Performance drives / servo drives, Industrial control RoboticsFeatures 32-bit super-scalar TriCoreTM V1.3
CPU, 4 stage pipelineFully integrated DSP capabilitiesSingle precision floating point unit (FPU)80 MHz at full industrial temperature range
32-bit peripheral control processor with single cycle instruction (PCP2)
Memories1.5 MByte embedded progr. flash with ECC32 KByte data flash - EEPROM emulation56 KBSRAM, 8 KB I$, 16 KB Imem
8-channel DMA controller Interrupt system with 2 x 255 hardware
priority arbitration levels serviced by CPU and PCP2 Coprocessor
Triple bus structure: 64-bit local memory buses to internal flash and data memory, 32-bit system peripheral bus, 32-bit remote peripheral bus
HW layer
SW Platform layer(> 60% of total SW)
Application Platform layer(10% of total SW)
Controllers Library
OSEKRTOS
OSEKCOM
I/O drivers & handlers(> 20 configurable modules)
Application Programming Interface
Boot Loader
Sys. Config.
Transport
KWP 2000
CCP
ApplicationSpecificSoftware
Speedom
eterTachom
eterW
ater temp.
Speedom
eterTachom
eterO
dometer
---------------
ApplicationLibraries
Nec78k HC12HC08 H8S26 MB90
SW Platform Reuse> 70%
of total SW
CustomerLibraries
MOSAIC SW Architecture & Components for Automotive Dashboard and Body Control
Special Purpose processor
Stream processorGraphic processorNetwork processor
Dynamically Reconfigurable Processors
FPGA 、 Reconfigurable systems
Dedicated hardware
ProgrammableHardware
DSP
General purposeCPU
ConfigurableProcessor
Tile Processor
HomogeneousChip-multiprocessor
Specialinstructions
MultipleCores
HeterogeneousMultiprocessor
Multiple Cores
High performance forwide application field
High performance for narrow application fieldArchitecture trends
Task Specific (configurable) Processors
HDL GENERATOR
Silicon
RTL synthesis
Silicon
µcode
Processor modelD
D
Applications
SysC specs
ISADP
Courtesy:Target
Compilers T
RWTH AACHEN Lisatek(CoWare);IMEC Target Compiler T, ARM OptimoDEPHILIPS Siliconhive; TENSILICA, PicoChip…
INSTRUCTION SET SIMULATOR
HDLModel Break
Step
RETARGETABLE
COMPILER
Machinecode
MACDAPACSACH Y,1NEGLAR AR3,#X…
Multi-issue instruction
L operations packed in one long instruction
M copies of storage and function
SIMD operation
Parallelism at Three Levels in Extensible Instructions
Parallelism: L x M x NExample: 3 x 4 x 3 = 36 ops/cycle
op
op
N dependent operations
implemented as single
fused operation
const
register and constant inputs
reg
Fused operation
reg reg reg
op
Three forms of instruction-set parallelism:• Very Long Instruction Word (VLIW)• Single Instruction Multiple Data (SIMD) aka “vectors”• Fused operations aka “complex operations”
addi addi
l 8 u i
sub
abs
add
l 8 u i
Example:SAD (sum of absolute differences)
short total = 0;char *p1, *p2;for i =1,m for j =1,n total + = abs(*p1++ - *p2++)
Original C Code
SLOT 2
SLOT 1
SLOT 0
Sample Software Pipelined ScheduleVector + Fusion + FL I X Configuration
loop j =1, n / 8 by 2: liu9x8[j]; liu9x8[j]; fusion[j-2] liu9x8[j+1]; liu9x8[j+1]; fusion[j-1]
N O YES
Vectorize?2
abs9 x 8
cvt9_16
add16 x 8
sub9 x 8
l iu 9 x 8l iu 9 x 8
48
fusion
fusion
Dynamically Reconfigurable Processors
Reconfigurable systems → Previous lesson Flexible but It takes 10’s milliseconds for dynamic
reconfiguration. Dynamically Reconfigurable Processors
Improves area efficiency by changing hardware structure. IPs used in various SoCs. History
Reconfigurable Co-processor Garp(1997), CHIMAERA(2000) Multicontext reconfigurable devices WASMII(1992),Time-multiplexing
FPGA(1997), PipeRench(1998), DRL(1998) Functional-level synthesis
Various commercial products are available since 2000 IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp, Elixent DFabrix
SONY’s VME(Virtual Mobile Engine) is embedded in Network Workman and PSP
Recently, many Japanese vendors start to develop commercial products
Fujitsu Hitachi Lucent Sanyo Toshiba ( Mep+D-Fabrix)
What is Configurable Computing?Spatially-programmed connection of Spatially-programmed connection of processing elementsprocessing elements
“Hardware” customized to specifics of problem.
Direct map of problem specific dataflow, control.
Circuits “adapted” as problem requirements change.
Spatial vs. Temporal Computing
Spatial Temporal
Processor vs. FPGA Area
Processing Element Specialized for media/stream processing Coarse grain ⇔ Fine grain: LUT of FPGAs Components
ALU Shifter + Mask unit Multiplexers Registers
Operations and interconnection between components are changeable
No instruction fetch mechanism : A part of large datapath
Reconfigurable HW (DSP fabric)
Target signal processing and arithmetic intensive applications
Reconfigurable array of simple DSP core (CNode)
Low power architecture Hierarchical clock gating Distributed leakage control (fine grain power gating)
Programmable DMA engine
Reconfigurable at run time, multi task
Mapping Flow
• Alus execute a cyclic micro-sequence
• Data exchanges through hierarchical clustered interconnect
• Configuration step is sequence loading and interconnect programming
Data in Data out
ILP + software pipelining
Procedure(In,Out,inout)
Constant A,b,c,…;
Begin
X=a-in[0];
……..
End;
Behavioral code
Data in Data out
Data in Data out
Data in
Data out
Partitioning/static scheduling
DFG
Coarse grained configuration
MUX
Clusters Level0
Mux level 2
N0_i
N0_o
N2_o N2_i
N1_i N1_o
Level 1
Mapping Flow 3D optimization problem
(place/route/schedule)
Traditional scheduling techniques for VLIW or clustered VLIW don’t apply The solution don’t take into account the spatial
dimension of the problem
Traditional P&R used in FPGA don't apply neither because they don't consider the time dimension
Putting it all together
Constant SoC Die Size Slow evolution of peripherals (area decrease) GP CPU sub-system complexity 2x each node (constant
area), Embedded Memory capacity 2x at each node (constant area) Loosely coupled DSP sub-system complexity increase by
30% at each node (30% area decrease)
2004 2006 2008 2010 2012 Technology Node (nm) 90 65 45 32 22 Loosely coupled Sub-Systems 2 4 6 8 12 General Purpose CPU Single Multiple Hardware Accelerator Hardwired Reconfigurable
Interconnect
4MB Multi-port Embedded
Memory HostCore 2
L1L2
Peripherals& analog
What can fit in 45mm² in 45nmL1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
L1
DSP
HW
DMA
Programmable Multimedia Accelerator
ImagingH/W192 CNode
(40 GOPS)
HostCore 1
L1
VideoH/W