embedded processor architecture
DESCRIPTION
Embedded Processor Architecture. 5kk73. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP). Application specific processor. x4. x3. x2. x1. x0. Z -1. Z -1. Z -1. Z -1. c 4. c 3. c 2. - PowerPoint PPT PresentationTRANSCRIPT
Embedded Processor Architecture
5kk73
Embedded Processor Architecture Henk Corporaal / Bart Mesman 2
flexibilityefficiency
DSP
Programmable CPU
Programmable DSP
Application specific instruction set
processor (ASIP)
Applicationspecific processor
Embedded Processor Architecture Henk Corporaal / Bart Mesman 3
#define NTAPS 4
int fir(int in)int i;static int state[NTAPS];static int coeff[NTAPS];int out[NTAPS];
state[NTAPS] = in;out[0] = state[0] * coeff[0];for ( i = 1; i < NTAPS+1; i++)
out[i] = out[i-1] + state[i] * coeff[i];state[i-1] = state[i];
return(out[NTAPS]);
*
Z-1
*
Z-1
*
Z-1
*
+
c3c4 c2 c1
x4 x3 x2 x1
y
Z-1
c0
x0
*
Application examples (1)
Embedded Processor Architecture Henk Corporaal / Bart Mesman 4
.L1000006sll $3, $2, 2 R3=R2>>2 R3=i-1addu $14, $15, $3 R14=R15+R3lw $24, 0($14) R24=load(*R14) R24=coeff[i-1]addiu $12, $6, -4 R12=R6-4addu $11, $12, $3 R11=R12+R3lw $13, 0($11) R13=load(*R11) R13=state[i-1]nopmult $24, $13 R24=R24*R13addu $25, $sp, $3 R25=sp+R3lw $9, -4($25) R9=load(R25-4) R9=out[i-1]addiu $2, $2, 1 R2=R2+1 i=i+1mflo $13 R13=move from low mpy regaddu $10, $9, $13 R10=R9+R13 R10=out[i]sw $10, 0($25) mem(*R25)=R10addu $25, $7, $3 R25=R7+R3sw $24, 0($25) mem(*R25)=R24slti $24, $2, 10bne $24, $0, .L100006addiu $15, $7, -4
Application examples (1)
19 instructions per tap!!
Embedded Processor Architecture Henk Corporaal / Bart Mesman 5
temp1 = input << 1temp2 = if (bit(input,7) == 1
then 29 else 0
out = temp1 exor temp2
Bit level operations:finite field arithmetic
r1 = LB input Load byter2 = SLL r1 Shift left logicalr3 = ANDI r1, mask AND immediater4 = ADDI r3, -1 ADD immediateBNE ( r4 != r0) Branch on != to nonzeronopR5 = XORI(r1, 29) Exclusive or immediateJ common Jumpnop
nonzero r5 = XOR(r1,r0) Exclusive ORcommon …
in[0] in[1] in[2] in[3] in[4] in[5] in[6] in[7]
out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7]
exor exor exor
Application examples (2)
10 instructions!!Very simple in hardware
Embedded Processor Architecture Henk Corporaal / Bart Mesman 6
srl $13, $2, 20andi $25, $13, 1srl $14, $2, 21andi $24, $14, 6or $15, $25, $24srl $13, $2, 22andi $14, $13, 56or $25, $15, $14sll $24, $25, 2
202223252627source register ($2)
destination register ($24)
2 3 4 5 6 7
Bit level operations : DES exampleApplication examples (2)
Embedded Processor Architecture Henk Corporaal / Bart Mesman 7
srl $24, $5, 18srl $25, $5, 17xor $8, $24, $25srl $9, $5, 16xor $10, $8, $9srl $11, $5, 13xor $12, $10, $11andi $13, $12, 1
181716 13
xor
$5
1$13 … 0 ...
Bit level operations : A5 example (GSM encryption)
Application examples (2)
Embedded Processor Architecture Henk Corporaal / Bart Mesman 8
Application examples: conclusions
• CPUs offer flexibility, but…• not efficient in performance• not efficient in code size• not efficient in power consumption
Embedded Processor Architecture Henk Corporaal / Bart Mesman 9
Power Consumption in microprocessorsPower consumption is (becoming) the limiting factor in
processor design
Solution in direction of• Hardware acceleration• Instruction Level Parallelism instead of clock speed• Code size efficiency
source: ISSCC2001, Patrick Gelsinger, Intel
Embedded Processor Architecture Henk Corporaal / Bart Mesman 10
Amdahl’s law
• Impact of an improvement on the execution time of a program depends on 2 parameters:– f = fraction of the original computation time that is
affected by the improvement– s = speedup factor (local)
• exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s
• speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s)
• if s >> 1 then speedup_overall = 1 / ( 1 – f )• Example: 40 % of program can be executed 10 x faster
speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56
Embedded Processor Architecture Henk Corporaal / Bart Mesman 11
• Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw)• Keep it Simple heuristic (RISC vs. CISC)
• Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set• No special features that match a high level language construct.• At least 16 registers to ease register allocation.
• Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance)
Conclusions
Embedded Processor Architecture Henk Corporaal / Bart Mesman 12
Programmable Digital Signal Processors• real-time worst-case processing = need for more compute power
sec instr cycles secprog prog instr cycle
CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines
(vs. scalars for CPUs)• difficult to compare architectures
• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten
• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)
Embedded Processor Architecture Henk Corporaal / Bart Mesman 13
• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures
• examples: TI, Motorola, Philips • code generation• developments: VLIW (Very Long Instruction Word)
examples: C6 and TM
Outline
Embedded Processor Architecture Henk Corporaal / Bart Mesman 14
• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP
(conversion to int is time consuming since the behavior may change)
• disadvantage FP: cost (area, speed, power)• integer multiplication doubles the number of bits: n * n => 2n
DSP data types
Embedded Processor Architecture Henk Corporaal / Bart Mesman 15
PR
ADDER
ACR
MPY(Booth,
Wallace..)
c(i) x(i)
SHIFTROUND
TRUNCATE
clockP_reg
clockP_reg
control
Embedded Processor Architecture Henk Corporaal / Bart Mesman 16
Prog/datamemory
EXU
Von Neumann(sequential)
progmem.
EXU
Harvard
datamem.
progmem.
EXU
datamem. 1
datamem. 2
Modified Harvard
c(i) * x(i)Goal = 1 cycle per iteration
Embedded Processor Architecture Henk Corporaal / Bart Mesman 17
RAM_A RAM_B
ACU_A
AR_A
ACU_B
AR_B
MAC
DR_A DR_B
+1 PC
Interrupt address
Stack
Reset
ProgramMemory
IR
Control Bus
Rfile
Embedded Processor Architecture Henk Corporaal / Bart Mesman 18
*
Z-1
*
Z-1
*
Z-1
*
+
c4c5 c3 c2
x5 x4 x3 x2
y
Z-1
c1
x1
*
ci * xi
time loop
filter loop i
How updating the delayline ?
1 cycle/tap ?
Embedded Processor Architecture Henk Corporaal / Bart Mesman 19
Memorylocation
outputsample 1
outputsample 2
outputsample 3
outputsample 4
Outputsample 5
1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8
Solution 2: indirect adressing
• use of a pointer to mark the begin of the delay line• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer
Embedded Processor Architecture Henk Corporaal / Bart Mesman 20
A S
Modulo
outputto RAM
Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1
Modulo can beimplemented as a mask operation if the size is 2k
16 10 00023 10 111mask=hold
ACU architecture andInstruction set
Embedded Processor Architecture Henk Corporaal / Bart Mesman 21
Addressing modes
• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]
• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1
• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]
Remarks• direct = for static data• indirect = for arrays
• inc/dec = for stepping through arrays e.g. xn
• index = for stepping through arrays e.g. x2n
Embedded Processor Architecture Henk Corporaal / Bart Mesman 22
• 8 ARs (address or auxiliary register) available• extra indirect modes
•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular
• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.
Addressing modes: extra for DSP
Embedded Processor Architecture Henk Corporaal / Bart Mesman 23
+1 PC
Interrupt address
Stack
Reset
ProgramMemory
IR
ACU_A
AR_A
RAM_A
DR_A
ACU_B
AR_B
RAM_B
DR_B
MAC ALUControl Bus
Rfile
Embedded Processor Architecture Henk Corporaal / Bart Mesman 24
LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)
init counterloop incr (=i+1)
read x(i)acc(i)=acc(i-1)+x(i)*c(i)
dec counter branch to loop if counter > 0
nop
c(i) * x(i)
6 clockcycles/samplelimit pipelines in the controller
first solution
resources
time (cc)
Not showncoefficient RAM+ACU
Embedded Processor Architecture Henk Corporaal / Bart Mesman 25
f
g
h
ai
bi
ci
di
f
g
h
a0
b0
c0
d0
f
g
h
a1
b1
c1
d1
f
g
h
a2
b2
c2
d2
h g f
ai
bi
bi-1ci-2
ci-1di-2
for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)
for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)
Loopfolding (software pipelining)
Embedded Processor Architecture Henk Corporaal / Bart Mesman 26
c(i) * x(i)
Pre- and postamble4 clockcycles /sample
LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)
init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)
dec counterbranch to loop if counter > 0nop
acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)
Loopfolding (software pipelining)
Embedded Processor Architecture Henk Corporaal / Bart Mesman 27
Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)
init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)
acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)
c(i) * x(i)
hardware support for loop control
1 clockcycles/samplerepeat instruction and repeat block
Embedded Processor Architecture Henk Corporaal / Bart Mesman 28
T register
Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr
T
Multiplier (17*17)
A(40) B(40)
MUXA
0
A
A B
B A
fractional MUX
Adder (40)
ZERO SAT ROUND
MALU (40)
U B
MUX
TAB CD
C D
Barrer shifter
MSW/LSWselect
E
COMP
TRN
TC
B
A
P C DD
TMS320C5000
Embedded Processor Architecture Henk Corporaal / Bart Mesman 29
Address bus
16 bits
EXTERNALADRESS SWITCH
Y Address
Y memory256-by-24-bit
RAM256-by-24-bit
ROM
AddressALU
X memory256-by-24-bit
RAM256-by-24-bit
ROM
2,048-by-24-bitPROGRAMMEMORY
ROM
X AddressP Address
EXTERNALDATA-BUS
SWITCH
INTERNAL DATA-BUS
SWITCH
24 BITS DATA
BUS
X-DATAY DATAP DATAGLOBAL DATA
DATA ALU
24-by-24 bitMULTIPLIER-
ACCUMULATORPRODUCING
56 BIT RESULT
PROGRAM CONTROLLER
ON CHIPPERIPHERALS,
HOST,SYNCHRONOUS
SERIAL INTERFACESERIAL COMMU-
NICATIONSINTERFACE,
PROGRAMMED I/O,BUS CONTROL
2 BITS
CLOCK
3 BITS
INTERRUPT
24 BITS
I/OPORTS
7 BITS
Motorola 56K family
Embedded Processor Architecture Henk Corporaal / Bart Mesman 30
X data
Y data
Z data
Buses for
X
X datamemory
16 bitbus
Y datamemory
16 bit bus
Two address Compution
units
Y
Inst
ruc t
ion
d eco
der
96-b
it in
stru
ctio
ns
Program control
unit
Programmemory (Z data)
16-bit bus
Two 16-by-16 bitmultipliers
Y0
Y1
X
Y0
Y1
X
PO P1
scale scale
Two 40 bit arithmic-logic units
SaturationSaturation
Four 40 bitaccumulators
Saturation/scale
shift
R.E.A.L.
Embedded Processor Architecture Henk Corporaal / Bart Mesman 31
lexical analysis
syntax analysis
semantic analysis
Code selection
Register allocation
scheduling
Front end
Code generation
code
source
Intermediate machine independent
representation
1 instr = // opsorder of instr
Embedded Processor Architecture Henk Corporaal / Bart Mesman 32
a b
*
c d
+
+
*
c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3
t1 t2
t3
BBi
BBj BBk
Intermediate machine independent
representation
Embedded Processor Architecture Henk Corporaal / Bart Mesman 33
ax ay
ar
af mx my
mr
mf
+ -
x y x y
+ - *ALU MAC
d memory p memory ADSP[Analog Devices]
Code selection example
Embedded Processor Architecture Henk Corporaal / Bart Mesman 34
a b
*
c d
+
+
*
c
t1 t2
t3
mx := dmem my := pmem ax := dmem ay := pmem
mr := dmem
2:
1:
3: ar := ax + ay
my := ar
mr = mr * my
Mr := mr + (mx * my)
Example of code selection = covering of intermediate representation with RTPs
Embedded Processor Architecture Henk Corporaal / Bart Mesman 35
Problems• local decisions which have a global impact• phase coupling: example
• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions
Embedded Processor Architecture Henk Corporaal / Bart Mesman 36
Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture
develop an architecture which is still efficient but alsoa good model for building a compiler
Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word
It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler
phase coupling: discussion