embedded processor architecture

Embedded Processor Architecture

5kk73

Embedded Processor Architecture Henk Corporaal / Bart Mesman 2

flexibilityefficiency

DSP

Programmable CPU

Programmable DSP

Application specific instruction set

processor (ASIP)

Applicationspecific processor


#define NTAPS 4

int fir(int in)int i;static int state[NTAPS];static int coeff[NTAPS];int out[NTAPS];

state[NTAPS] = in;out[0] = state[0] * coeff[0];for ( i = 1; i < NTAPS+1; i++)

out[i] = out[i-1] + state[i] * coeff[i];state[i-1] = state[i];

return(out[NTAPS]);

*

Z-1

*

Z-1

*

Z-1

*

+

c3c4 c2 c1

x4 x3 x2 x1

y

Z-1

c0

x0

*

Application examples (1)


.L1000006sll $3, $2, 2 R3=R2>>2 R3=i-1addu $14, $15, $3 R14=R15+R3lw $24, 0($14) R24=load(*R14) R24=coeff[i-1]addiu $12, $6, -4 R12=R6-4addu $11, $12, $3 R11=R12+R3lw $13, 0($11) R13=load(*R11) R13=state[i-1]nopmult $24, $13 R24=R24*R13addu $25, $sp, $3 R25=sp+R3lw $9, -4($25) R9=load(R25-4) R9=out[i-1]addiu $2, $2, 1 R2=R2+1 i=i+1mflo $13 R13=move from low mpy regaddu $10, $9, $13 R10=R9+R13 R10=out[i]sw $10, 0($25) mem(*R25)=R10addu $25, $7, $3 R25=R7+R3sw $24, 0($25) mem(*R25)=R24slti $24, $2, 10bne $24, $0, .L100006addiu $15, $7, -4


19 instructions per tap!!


temp1 = input << 1temp2 = if (bit(input,7) == 1

then 29 else 0

out = temp1 exor temp2

Bit level operations:finite field arithmetic

r1 = LB input Load byter2 = SLL r1 Shift left logicalr3 = ANDI r1, mask AND immediater4 = ADDI r3, -1 ADD immediateBNE ( r4 != r0) Branch on != to nonzeronopR5 = XORI(r1, 29) Exclusive or immediateJ common Jumpnop

nonzero r5 = XOR(r1,r0) Exclusive ORcommon …

in[0] in[1] in[2] in[3] in[4] in[5] in[6] in[7]

out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7]

exor exor exor


10 instructions!!Very simple in hardware


srl $13, $2, 20andi $25, $13, 1srl $14, $2, 21andi $24, $14, 6or $15, $25, $24srl $13, $2, 22andi $14, $13, 56or $25, $15, $14sll $24, $25, 2

202223252627source register ($2)

destination register ($24)

2 3 4 5 6 7

Bit level operations : DES exampleApplication examples (2)


srl $24, $5, 18srl $25, $5, 17xor $8, $24, $25srl $9, $5, 16xor $10, $8, $9srl $11, $5, 13xor $12, $10, $11andi $13, $12, 1

181716 13

xor

$5

1$13 … 0 ...

Bit level operations : A5 example (GSM encryption)



Application examples: conclusions

• CPUs offer flexibility, but…• not efficient in performance• not efficient in code size• not efficient in power consumption


Power Consumption in microprocessorsPower consumption is (becoming) the limiting factor in

processor design

Solution in direction of• Hardware acceleration• Instruction Level Parallelism instead of clock speed• Code size efficiency

source: ISSCC2001, Patrick Gelsinger, Intel


Amdahl’s law

• Impact of an improvement on the execution time of a program depends on 2 parameters:– f = fraction of the original computation time that is

affected by the improvement– s = speedup factor (local)

• exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s

• speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s)

• if s >> 1 then speedup_overall = 1 / ( 1 – f )• Example: 40 % of program can be executed 10 x faster

speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56


• Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw)• Keep it Simple heuristic (RISC vs. CISC)

• Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set• No special features that match a high level language construct.• At least 16 registers to ease register allocation.

• Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance)

Conclusions


Programmable Digital Signal Processors• real-time worst-case processing = need for more compute power

sec instr cycles secprog prog instr cycle

CPI = 1• instruction level parallelism (ILP)• hardware support for loop control• attention for high level data types e.g. arrays, delaylines

(vs. scalars for CPUs)• difficult to compare architectures

• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten

• benchmarking (Berkeley Design Technology Inc (BDTi))(compare to SpecInt benchmarks for CPs)


• architectures for programmable DSPs• multiplier-accumulator• modified Harvard architecture• extension with an ALU (decision making)• controller architectures

• examples: TI, Motorola, Philips • code generation• developments: VLIW (Very Long Instruction Word)

examples: C6 and TM

Outline


• not every signal requires 32 bits• 2 types of DSP: floating point and integer• advantages FP: most specs are in FP

(conversion to int is time consuming since the behavior may change)

• disadvantage FP: cost (area, speed, power)• integer multiplication doubles the number of bits: n * n => 2n

DSP data types


PR

ADDER

ACR

MPY(Booth,

Wallace..)

c(i) x(i)

SHIFTROUND

TRUNCATE

clockP_reg

clockP_reg

control


Prog/datamemory

EXU

Von Neumann(sequential)

progmem.

EXU

Harvard

datamem.

progmem.

EXU

datamem. 1

datamem. 2

Modified Harvard

c(i) * x(i)Goal = 1 cycle per iteration


RAM_A RAM_B

ACU_A

AR_A

ACU_B

AR_B

MAC

DR_A DR_B

+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

Control Bus

Rfile


*

Z-1

*

Z-1

*

Z-1

*

+

c4c5 c3 c2

x5 x4 x3 x2

y

Z-1

c1

x1

*

ci * xi

time loop

filter loop i

How updating the delayline ?

1 cycle/tap ?


Memorylocation

outputsample 1

outputsample 2

outputsample 3

outputsample 4

Outputsample 5

1 x1 x92 x2 x23 x3 x3 x34 x4 x4 x4 x45 x5 x5 x5 x5 x56 x6 x6 x6 x67 x7 x7 x78 x8 x8

Solution 2: indirect adressing

• use of a pointer to mark the begin of the delay line• problem: trashing of the whole memory• solution: modulo addressing• need for a register to store the pointer


A S

Modulo

outputto RAM

Output reg A reg SRead_A A A SRead_S S A SincA A+1 A+1 SdecA A-1 A-1 SStep A+S A+S SInc_step S+1 A S+1

Modulo can beimplemented as a mask operation if the size is 2k

16 10 00023 10 111mask=hold

ACU architecture andInstruction set


Addressing modes

• register ADD R4, R3 R[R4] = R[R4] + R[R3]• immediate ADD R4, #3 R[R4] = R[R4] + #3• direct ADD R4, (100) R[R4] = R[R4] + Mem[100]• indirect ADD R4, (R3) R[R4] = R[R4] + Mem[R[R3]]

• w. inc/dec ADD R4, (R3)± R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1

• indexed ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2]

Remarks• direct = for static data• indirect = for arrays

• inc/dec = for stepping through arrays e.g. xn

• index = for stepping through arrays e.g. x2n


• 8 ARs (address or auxiliary register) available• extra indirect modes

•circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular

• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.

Addressing modes: extra for DSP


+1 PC

Interrupt address

Stack

Reset

ProgramMemory

IR

ACU_A

AR_A

RAM_A

DR_A

ACU_B

AR_B

RAM_B

DR_B

MAC ALUControl Bus

Rfile


LABEL ALU MPY-ACC RAM ACUAcc = 0 init (i=0)

init counterloop incr (=i+1)

read x(i)acc(i)=acc(i-1)+x(i)*c(i)

dec counter branch to loop if counter > 0

nop

c(i) * x(i)

6 clockcycles/samplelimit pipelines in the controller

first solution

resources

time (cc)

Not showncoefficient RAM+ACU


f

g

h

ai

bi

ci

di

f

g

h

a0

b0

c0

d0

f

g

h

a1

b1

c1

d1

f

g

h

a2

b2

c2

d2

h g f

ai

bi

bi-1ci-2

ci-1di-2

for i = 0 to n bi = f(ai) ci = g(bi) di = h(ci)

for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2)

Loopfolding (software pipelining)


c(i) * x(i)

Pre- and postamble4 clockcycles /sample

LABEL ALU MPY-ACC RAM ACUacc(i-1)=0 init (i=1)

init counter read x(i) inc(=i+1)loop acc(i) = acc(i-1)+x(i)*c(i) read x(i+1) incr (=i+2)

dec counterbranch to loop if counter > 0nop

acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1)+x(n)*c(n)

Loopfolding (software pipelining)


Label ALU MPY-ACC RAM ACUacc(i-1=0 init (i=1)

init counter read x(i) inc(=i+1)repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) read x(i+1) incr(=i+2)

acc(n-1) = acc(n-2) + x(n-1)*c(n-1) read x(n)acc(n) = acc(n-1) + x(n)*c(n)

c(i) * x(i)

hardware support for loop control

1 clockcycles/samplerepeat instruction and repeat block


T register

Sign ctr Sign ctr Sign ctr Sign ctr Sign ctr

T

Multiplier (17*17)

A(40) B(40)

MUXA

0

A

A B

B A

fractional MUX

Adder (40)

ZERO SAT ROUND

MALU (40)

U B

MUX

TAB CD

C D

Barrer shifter

MSW/LSWselect

E

COMP

TRN

TC

B

A

P C DD

TMS320C5000


Address bus

16 bits

EXTERNALADRESS SWITCH

Y Address

Y memory256-by-24-bit

RAM256-by-24-bit

ROM

AddressALU

X memory256-by-24-bit

RAM256-by-24-bit

ROM

2,048-by-24-bitPROGRAMMEMORY

ROM

X AddressP Address

EXTERNALDATA-BUS

SWITCH

INTERNAL DATA-BUS

SWITCH

24 BITS DATA

BUS

X-DATAY DATAP DATAGLOBAL DATA

DATA ALU

24-by-24 bitMULTIPLIER-

ACCUMULATORPRODUCING

56 BIT RESULT

PROGRAM CONTROLLER

ON CHIPPERIPHERALS,

HOST,SYNCHRONOUS

SERIAL INTERFACESERIAL COMMU-

NICATIONSINTERFACE,

PROGRAMMED I/O,BUS CONTROL

2 BITS

CLOCK

3 BITS

INTERRUPT

24 BITS

I/OPORTS

7 BITS

Motorola 56K family


X data

Y data

Z data

Buses for

X

X datamemory

16 bitbus

Y datamemory

16 bit bus

Two address Compution

units

Y

Inst

ruc t

ion

d eco

der

96-b

it in

stru

ctio

ns

Program control

unit

Programmemory (Z data)

16-bit bus

Two 16-by-16 bitmultipliers

Y0

Y1

X

Y0

Y1

X

PO P1

scale scale

Two 40 bit arithmic-logic units

SaturationSaturation

Four 40 bitaccumulators

Saturation/scale

shift

R.E.A.L.


lexical analysis

syntax analysis

semantic analysis

Code selection

Register allocation

scheduling

Front end

Code generation

code

source

Intermediate machine independent

representation

1 instr = // opsorder of instr


a b

*

c d

+

+

*

c t1 := a * b t2 := c + d t3 := t1 + cout := t2 * t3

t1 t2

t3

BBi

BBj BBk

Intermediate machine independent

representation


ax ay

ar

af mx my

mr

mf

+ -

x y x y

+ - *ALU MAC

d memory p memory ADSP[Analog Devices]

Code selection example


a b

*

c d

+

+

*

c

t1 t2

t3

mx := dmem my := pmem ax := dmem ay := pmem

mr := dmem

2:

1:

3: ar := ax + ay

my := ar

mr = mr * my

Mr := mr + (mx * my)

Example of code selection = covering of intermediate representation with RTPs


Problems• local decisions which have a global impact• phase coupling: example

• asap schedule• maximal freedom for scheduling• code selection during scheduling• register allocation comes afterwards• can lead to infeasible solutions


Solution: 1. Solve code generation for DSPs2. Step back and rethink the architecture

develop an architecture which is still efficient but alsoa good model for building a compiler

Efficiency = exploit instruction level parallelism (ILP)compilation = systematic positioning of registers and regular interconnect= VLIW = Very Long Instruction Word

It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler

phase coupling: discussion

embedded processor architecture

Documents