ernest jamro kat. elektroniki agh, kraków dep. of electronics, agh
Post on 05-Jan-2016
46 Views
Preview:
DESCRIPTION
TRANSCRIPT
Hardware Implementation of AlgorithmsSprzętowa Implementacja Algorytmów
Układy mnożące, konwolweryMultipliers, convolvers
Ernest Jamro
Kat. Elektroniki AGH, Kraków
Dep. Of Electronics, AGH
2
Mnożenie / Multiplication
1 0 0 1
X 1 0 1 1
1 0 0 1
1 0 0 1
0 0 0 0
+ 1 0 0 1
1 1 0 0 0 1 1
9 x 11= 99
3
Parallel Array MultipliersMnożenie równoległe
&
&
&
&
& +
& +
& +
& +
& +
& +
& +
& +
& +
& +
& +
& +
a0 a1 a2 a3
& +
ai
bj
ck-1 ck
sl-1
sl
b0
b1
b2
b3
2ck+sl= =sl-1+(ai
bj)+ck-1
p0 p1 p2 p3 p4 p5 p6 p7
4
FPGA, Built-in multiplier DSP48
5
Sequential Multiplier /Mnożenie sekwencyjne
A3 A2 A1 A0 B3
B2
B1
B0
FA FA HA
FF FF FF FF FF
PISO
Register / Rejestr
Sumator Adder
Rejestr (Przesuwny) (Shift) Register
Pn,0 Pn,1 Pn,2 Pn,3
FA
FF FF FF
6
Wallace Tree Multiplier(with Carry Save Adders)
W układach FPGA nie zaleca się stosowania CSA
In FPGA the CSA are not recommended
7
Mnożenie ze znakiem / Multiplication of Sign numbers
Znak, Moduł / Sign-Module
Standardowe mnożenie liczb dodatnich / Standard unsigned multiplication
Znak= Znak1 XOR Znak2 Sign= Sign1 xor Sign2
W kodzie uzupełnień do dwóch Two’s Complement
2
0
1 22N
ii
iN
N aaa
C. R. Baugh and B. A.Wooley, “A two’s complement parallel array multiplication algorithm,” IEEE Trans. Comput., vol. C-22, pp. 1045–1047, Dec. 1973.
2
0
2
0
2
01
12
01
111
22 )2()2(22222N
ii
iN
i
N
ii
iNi
iNN
iNi
iNNN
N baabbababa
2
0
2
0
122N
iNi
iN
iNi
i baba
2
0
2
0
2
0
2
01
11
122 )2()2(222222N
ii
iN
i
N
ii
iN
iiN
iNNi
iNNNN
N baababbaba
(a1+a2)*(b1+b2)= a1b1+ a1b2+a2b1+a2b2
8
Mnożenie w kodzie uzupełnień do 2 / Two’s complement multiplication
&
&
&
!&
& +
& +
& +
!& +
& +
& +
& +
!& +
!& +
!& +
!& +
& +
a0 a1 a2 a3
!& +
ai
bj
ck-1 ck
sl-1
sl
b0
b1
b2
b3
sl-1+(aibj)+ck-1=
=2ck+sl
p0 p1 p2 p3 p4 p5 p6 p7
1
9
Układ mnożący o zredukowanej szerokości / Reduced-width multiplier
&
&
&
&
& +
& +
& +
& +
& +
& +
& +
& +
& +
& +
& +
& +
a0 a1 a2 a3
& +
ai
bj
ck-1 ck
sl-1
sl
b0
b1
b2
b3
sl-1+(aibj)+ck-1=
=2ck+sl
p0 p1 p2 p3 p4 p5 p6
truncation line
p7
10
Kompensacja błędu redukcji / Truncation error compensation
&
&
&
&
& +
& +
& +
& +
& +
& +
& +
& +
& +
a0
a1
a2 a3
& +
ai
bj
ck-1 ck
sl-1
sl
b0
b1
b2
sl-1+(aibj)+ck-1=
=2ck+sl
p3 p4 p5 p6 p7
11
Mnożenie przez stały współczynnik / Constant Coefficient Multiplier
Look Up Table (LUT)
LUT Address Data
Example: Y= 5*X
Address Data
0 0
1 5
2 10
3 15 ...
12
LUT-based Multiplier Constant Coefficient: C
Y = CA = CA(0:3) + 24 CA(4:7) Input
LUT B
LUT A
4 4
8
12 12
Adder
8 4
12
16
output
13
Different ROM sizesInput data width = 6 bits
Mem161
Adder
Mem161
in
out
6
24
a)
Mem161
Adder
in
out
6
4
b)
Mem321
Adder
in
out
6
5
c)
14
Heteregenous memory usage Virtex: 161, 321, 4k1, 2k2, 1k4, 5128, 25616
Input data and coefficient width= 14
25616 321 3161
147
7 5 4 1
3116
21
25616 321 3161
7 5 4
3116
21
Adder
28
14
7
21
7
21
1
21
11
15
Exchange distributed RAM to BRAM
CLBBRAM
25616 321 3161
147
7 5 4 1
3116
21
7 5 4
3116
21
+
28
14
7
21
7
21
1
11
25616 321 3161 21LUT161
+
we
wy
14
4
LUT21
LUT161
4
LUT21
LUT161
4
LUT21
LUT161
2
LUT21
16
Area [CLB] for different input and coeffitinent width K
0
2
4
6
8
10
12
4 6 8 10 12 14 16 18 20 22 24
Only CLB, scale 1:10
# of BRAM
Equvalent cost of 1 BRAM
17
MM (Multiplierless Multiplication)Mnożenie bezmnożne
• Binary Representation, example B= 14= 11102
M= AB= (A<<1)+(A<<2)+(A<<3)
• Sub-structure Sharing (SS) example B= 27= 110112
tmp= A + (A<<1)
M= AB= tmp + (tmp<<3)
• Canonic Sign Digit (CSD)
set {0, 1, -1} (0 – no operation, 1 – addition, -1 (1) – subtraction)
example: B= 7 = 1112 B= 1001CSD
M=B·A= (A<<2) + (A<<1) + A M= (A<<3)-A
18
BINARNIE CSD
insert symbol ‘1’ only if the total number of operation is reducedCoefficientBinary (TC) CSDMCSD3 11 101 117 111 1001 100111 1011 10101 101123 10111 101001 11001
Start
i=0, c0=0bn=bn-1
ci+1= bi+1bi bici bi+1ci
di= bi+ci-2ci+1
i= i+1
YNi<n
Stop
Start
i=0carry= false
(bi=1 and carry)or
(bi=0 and not carry)
di=0
Y
iwN Y
N
j=i+1
jwNY
0Q(i,j)<2Y N
Q(i,j)<2and not
(Y<0 and j=w)(sign bit)
di= 1carry= false
di= -1carry= true
i= i+1
carry and B>0Y
di= 1
Stop
N
Y N
Standard Modified
19
Applience of different techniques of MM
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
3 4 5 6 7 8 9 10 11 12K
CSD-SS
SS
CSD
BR
20
The MM cost for different coefficients
0
2
4
6
8
10
12
14
16
18
0 50 100 150 200 250
coeff
CLBs
21
Filters FIR
1
0
)()()(N
k
kixkhiy
Układ opóźniający / Delay Module
Układ arytmetyczny / Arithmetic Module
x(i)
x(i) x(i-1) x(i-N+1)
y(i)
z -1 z -1
w 2 w 1 w 0 Input a y+2,x+2 a y+2,x+1 a y+2,x
+
Output
22
Filter FIR (sposób pośredni/ transposed)
1
0
)()()(N
k
kixkhiy
Układ opóźniający Delay
Układy mnożące / Arithmetic module x(i)
x(i) h(0)
x(i-1) h(1)
x(i-N+1) h(N-1)
y(i) z-1 +
Input
Output z-1
+
z-1 +
h(0) h(1) h(2)
23
FIR 2D
z-1 z-1
w2,2 w2,1 w2,0
Line Buf. z-1 z-1
w1,2 w1,1 w1,0
Line Buf. z-1 z-1
w0,2 w0,1 w0,0
Input ay+2,x+2 ay+2,x+1 ay+2,x
ay+1,x+2 ay+1,x+1 ay+1,x
ay,x+2 ay,x+1 ay,x
+
Output
by+1,x+1
24
Examples of 2D FIR Filters
1 2 1
2 4 2
1 2 1
-1 -2 -1
0 0 0
1 2 1
1 1 1
1 -8 1
1 1 1
Low-Pass Sobel Laplace
25
FIR Filter N=2LUT-based multipliers
z-1
LUTM0
LUTL0
LUTM1
LUTL1
In 8
4 4 4 4
Adder1 Adder0
Adder2
12 12 1212
13 134
18
4
Multiplier 1 Multiplier 2
Adder1 Adder0
Adder2
12 12 1212
13
9
414
18
Adders Block
FIR, Arytmetyka w innej kolejności(Parallel) Distributed Arithmetic
1
0
1
0
1
0,2
N
i
N
i
L
jji
jiii ahah
1
0
1
0,2
L
j
N
ijii
j ahcoefficient
inputdifferent bits of
the input
27
Arytmetyka Rozproszona (Distributed Arithmetic)
a0,0 a1,0 ... aN-1,0
S0
a0,1 a1,1 ... aN-1,1
S1<<1
a0,L-1 a1,L-1 ... aN-1,L-1
SL-1<<(L-1)WDAC
. . .
1
00,
N
iii ah
1
01,
N
iii ah
1
01,
N
iLii ah
1
0
2L
jj
j S
1
0
1
0,2
L
j
N
ijii
j ah
WDAC=K+ log2(N+1)
WLC= K+WIN
The same input bit weight
(smaller LUT widths)
28
Filtry FIR z liniową fazą / Linear Phase Filters(symetryczne/ symmetric: h(0)=h(N-1), h(1)=h(N-2), ...)
29
FPGA, Built-in multiplier DSP48
30
Example of sub-structure sharing for FIR filters
H(z)= 5 + 13z-1 + 5z-2 = 1012 + 11012z-1 + 1012z-2
Example 1:
A= 5 = 1012- temporary expression
H(z)= A + (1000 + A)z-1 + Az-2
Example 2:
A= 1 + z-1
H(z)= 5A + 8z-1 + 5z-2
31
Materiały dodatkowe
The END
32
Szybkie mnożenie w układach FPGA
AND
+
AND
+
+
+
+
27a7
b
26a6
b
AND
+
25a5
b
24a4
b
AND
+
23a3
b
22a2
b
AND
+
21a1
b
20a0
b Ewentualne rejestry potokowe
26·(2·a7 ·b + a6 ·b)
33
Układy mnożące w FPGA
Fragment of Virtex Configurable Logic Block (CLB)
Przykład:
G4 - a7
G3 - bi
G2 - a6
G1 - bi+1
F4 – a7
F3 – bi-1
F2 – a6
F1 – bi
(a7 and bi) xor (a6 and bi+1)
top related