ernest jamro kat. elektroniki agh, kraków dep. of electronics, agh

Post on 05-Jan-2016

46 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

Hardware Implementation of Algorithms Sprzętowa Implementacja Algorytmów Układy mnożące, konwolwery Multipliers, convolvers. Ernest Jamro Kat. Elektroniki AGH, Kraków Dep. Of Electronics, AGH. 1. 0. 0. 1. X. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. +. 1. 0. 0. 1. - PowerPoint PPT Presentation

TRANSCRIPT

Hardware Implementation of AlgorithmsSprzętowa Implementacja Algorytmów

Układy mnożące, konwolweryMultipliers, convolvers

Ernest Jamro

Kat. Elektroniki AGH, Kraków

Dep. Of Electronics, AGH

2

Mnożenie / Multiplication

        1 0 0 1

X       1 0 1 1

        1 0 0 1

      1 0 0 1  

    0 0 0 0    

+ 1 0 0 1      

  1 1 0 0 0 1 1

9 x 11= 99

3

Parallel Array MultipliersMnożenie równoległe

&

&

&

&

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

a0 a1 a2 a3

& +

ai

bj

ck-1 ck

sl-1

sl

b0

b1

b2

b3

2ck+sl= =sl-1+(ai

bj)+ck-1

p0 p1 p2 p3 p4 p5 p6 p7

4

FPGA, Built-in multiplier DSP48

5

Sequential Multiplier /Mnożenie sekwencyjne

A3 A2 A1 A0 B3

B2

B1

B0

FA FA HA

FF FF FF FF FF

PISO

Register / Rejestr

Sumator Adder

Rejestr (Przesuwny) (Shift) Register

Pn,0 Pn,1 Pn,2 Pn,3

FA

FF FF FF

6

Wallace Tree Multiplier(with Carry Save Adders)

W układach FPGA nie zaleca się stosowania CSA

In FPGA the CSA are not recommended

7

Mnożenie ze znakiem / Multiplication of Sign numbers

Znak, Moduł / Sign-Module

Standardowe mnożenie liczb dodatnich / Standard unsigned multiplication

Znak= Znak1 XOR Znak2 Sign= Sign1 xor Sign2

W kodzie uzupełnień do dwóch Two’s Complement

2

0

1 22N

ii

iN

N aaa

C. R. Baugh and B. A.Wooley, “A two’s complement parallel array multiplication algorithm,” IEEE Trans. Comput., vol. C-22, pp. 1045–1047, Dec. 1973.

2

0

2

0

2

01

12

01

111

22 )2()2(22222N

ii

iN

i

N

ii

iNi

iNN

iNi

iNNN

N baabbababa

2

0

2

0

122N

iNi

iN

iNi

i baba

2

0

2

0

2

0

2

01

11

122 )2()2(222222N

ii

iN

i

N

ii

iN

iiN

iNNi

iNNNN

N baababbaba

(a1+a2)*(b1+b2)= a1b1+ a1b2+a2b1+a2b2

8

Mnożenie w kodzie uzupełnień do 2 / Two’s complement multiplication

&

&

&

!&

& +

& +

& +

!& +

& +

& +

& +

!& +

!& +

!& +

!& +

& +

a0 a1 a2 a3

!& +

ai

bj

ck-1 ck

sl-1

sl

b0

b1

b2

b3

sl-1+(aibj)+ck-1=

=2ck+sl

p0 p1 p2 p3 p4 p5 p6 p7

1

9

Układ mnożący o zredukowanej szerokości / Reduced-width multiplier

&

&

&

&

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

& +

a0 a1 a2 a3

& +

ai

bj

ck-1 ck

sl-1

sl

b0

b1

b2

b3

sl-1+(aibj)+ck-1=

=2ck+sl

p0 p1 p2 p3 p4 p5 p6

truncation line

p7

10

Kompensacja błędu redukcji / Truncation error compensation

&

&

&

&

& +

& +

& +

& +

& +

& +

& +

& +

& +

a0

a1

a2 a3

& +

ai

bj

ck-1 ck

sl-1

sl

b0

b1

b2

sl-1+(aibj)+ck-1=

=2ck+sl

p3 p4 p5 p6 p7

11

Mnożenie przez stały współczynnik / Constant Coefficient Multiplier

Look Up Table (LUT)

LUT Address Data

Example: Y= 5*X

Address Data

0 0

1 5

2 10

3 15 ...

12

LUT-based Multiplier Constant Coefficient: C

Y = CA = CA(0:3) + 24 CA(4:7) Input

LUT B

LUT A

4 4

8

12 12

Adder

8 4

12

16

output

13

Different ROM sizesInput data width = 6 bits

Mem161

Adder

Mem161

in

out

6

24

a)

Mem161

Adder

in

out

6

4

b)

Mem321

Adder

in

out

6

5

c)

14

Heteregenous memory usage Virtex: 161, 321, 4k1, 2k2, 1k4, 5128, 25616

Input data and coefficient width= 14

25616 321 3161

147

7 5 4 1

3116

21

25616 321 3161

7 5 4

3116

21

Adder

28

14

7

21

7

21

1

21

11

15

Exchange distributed RAM to BRAM

CLBBRAM

25616 321 3161

147

7 5 4 1

3116

21

7 5 4

3116

21

+

28

14

7

21

7

21

1

11

25616 321 3161 21LUT161

+

we

wy

14

4

LUT21

LUT161

4

LUT21

LUT161

4

LUT21

LUT161

2

LUT21

16

Area [CLB] for different input and coeffitinent width K

0

2

4

6

8

10

12

4 6 8 10 12 14 16 18 20 22 24

Only CLB, scale 1:10

# of BRAM

Equvalent cost of 1 BRAM

17

MM (Multiplierless Multiplication)Mnożenie bezmnożne

• Binary Representation, example B= 14= 11102

M= AB= (A<<1)+(A<<2)+(A<<3)

• Sub-structure Sharing (SS) example B= 27= 110112

tmp= A + (A<<1)

M= AB= tmp + (tmp<<3)

• Canonic Sign Digit (CSD)

set {0, 1, -1} (0 – no operation, 1 – addition, -1 (1) – subtraction)

example: B= 7 = 1112 B= 1001CSD

M=B·A= (A<<2) + (A<<1) + A M= (A<<3)-A

18

BINARNIE CSD

insert symbol ‘1’ only if the total number of operation is reducedCoefficientBinary (TC) CSDMCSD3 11 101 117 111 1001 100111 1011 10101 101123 10111 101001 11001

Start

i=0, c0=0bn=bn-1

ci+1= bi+1bi bici bi+1ci

di= bi+ci-2ci+1

i= i+1

YNi<n

Stop

Start

i=0carry= false

(bi=1 and carry)or

(bi=0 and not carry)

di=0

Y

iwN Y

N

j=i+1

jwNY

0Q(i,j)<2Y N

Q(i,j)<2and not

(Y<0 and j=w)(sign bit)

di= 1carry= false

di= -1carry= true

i= i+1

carry and B>0Y

di= 1

Stop

N

Y N

Standard Modified

19

Applience of different techniques of MM

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

3 4 5 6 7 8 9 10 11 12K

CSD-SS

SS

CSD

BR

20

The MM cost for different coefficients

0

2

4

6

8

10

12

14

16

18

0 50 100 150 200 250

coeff

CLBs

21

Filters FIR

1

0

)()()(N

k

kixkhiy

Układ opóźniający / Delay Module

Układ arytmetyczny / Arithmetic Module

x(i)

x(i) x(i-1) x(i-N+1)

y(i)

z -1 z -1

w 2 w 1 w 0 Input a y+2,x+2 a y+2,x+1 a y+2,x

+

Output

22

Filter FIR (sposób pośredni/ transposed)

1

0

)()()(N

k

kixkhiy

Układ opóźniający Delay

Układy mnożące / Arithmetic module x(i)

x(i) h(0)

x(i-1) h(1)

x(i-N+1) h(N-1)

y(i) z-1 +

Input

Output z-1

+

z-1 +

h(0) h(1) h(2)

23

FIR 2D

z-1 z-1

w2,2 w2,1 w2,0

Line Buf. z-1 z-1

w1,2 w1,1 w1,0

Line Buf. z-1 z-1

w0,2 w0,1 w0,0

Input ay+2,x+2 ay+2,x+1 ay+2,x

ay+1,x+2 ay+1,x+1 ay+1,x

ay,x+2 ay,x+1 ay,x

+

Output

by+1,x+1

24

Examples of 2D FIR Filters

1 2 1

2 4 2

1 2 1

-1 -2 -1

0 0 0

1 2 1

1 1 1

1 -8 1

1 1 1

Low-Pass Sobel Laplace

25

FIR Filter N=2LUT-based multipliers

z-1

LUTM0

LUTL0

LUTM1

LUTL1

In 8

4 4 4 4

Adder1 Adder0

Adder2

12 12 1212

13 134

18

4

Multiplier 1 Multiplier 2

Adder1 Adder0

Adder2

12 12 1212

13

9

414

18

Adders Block

FIR, Arytmetyka w innej kolejności(Parallel) Distributed Arithmetic

1

0

1

0

1

0,2

N

i

N

i

L

jji

jiii ahah

1

0

1

0,2

L

j

N

ijii

j ahcoefficient

inputdifferent bits of

the input

27

Arytmetyka Rozproszona (Distributed Arithmetic)

a0,0 a1,0 ... aN-1,0

S0

a0,1 a1,1 ... aN-1,1

S1<<1

a0,L-1 a1,L-1 ... aN-1,L-1

SL-1<<(L-1)WDAC

. . .

1

00,

N

iii ah

1

01,

N

iii ah

1

01,

N

iLii ah

1

0

2L

jj

j S

1

0

1

0,2

L

j

N

ijii

j ah

WDAC=K+ log2(N+1)

WLC= K+WIN

The same input bit weight

(smaller LUT widths)

28

Filtry FIR z liniową fazą / Linear Phase Filters(symetryczne/ symmetric: h(0)=h(N-1), h(1)=h(N-2), ...)

29

FPGA, Built-in multiplier DSP48

30

Example of sub-structure sharing for FIR filters

H(z)= 5 + 13z-1 + 5z-2 = 1012 + 11012z-1 + 1012z-2

Example 1:

A= 5 = 1012- temporary expression

H(z)= A + (1000 + A)z-1 + Az-2

Example 2:

A= 1 + z-1

H(z)= 5A + 8z-1 + 5z-2

31

Materiały dodatkowe

The END

32

Szybkie mnożenie w układach FPGA

AND

+

AND

+

+

+

+

27a7

b

26a6

b

AND

+

25a5

b

24a4

b

AND

+

23a3

b

22a2

b

AND

+

21a1

b

20a0

b Ewentualne rejestry potokowe

26·(2·a7 ·b + a6 ·b)

33

Układy mnożące w FPGA

Fragment of Virtex Configurable Logic Block (CLB)

Przykład:

G4 - a7

G3 - bi

G2 - a6

G1 - bi+1

F4 – a7

F3 – bi-1

F2 – a6

F1 – bi

(a7 and bi) xor (a6 and bi+1)

top related