pipelined compressor tree optimization using integer ...martin-kumm.de/slides/2014_09_03_fpl.pdf ·...
TRANSCRIPT
Pipelined Compressor Tree Optimization using Integer Linear
Programming
International Conference onField Programmable Logic
03.09.2014
Martin Kumm, Peter Zipf
University of Kassel, Germany
2
CONTENTS
1. Introduction to Compressor Trees
2. Compressor Trees on FPGAs
3. Optimal Compressor Tree Synthesis
A compressor tree realizes the addition of many (>2) bit-shifted numbers
The applications are versatile:
Multiplier (real, complex, squarer)
Evaluation of polynomials (e.g., for function approximation)
Linear transforms (e.g., FFT, DCT)
Digital filters
…3
COMPRESSOR TREES
EXAMPLE 1: MULTI-INPUT ADDITION
4
Dot representation 5 bit, 5-input addition:
S =X
i
Xi
Formula:
2
42
32
22
12
0
9>>>>>>>>>>>>>>>>>=
>>>>>>>>>>>>>>>>>;
input
vectors
5
Dot representation 5 bit, 5-input addition:
S =X
i
Xi
Formula:
01101
11100
10110
11011
10101
3·24+2·23+4·22+3·21+4·20 = 90
+22
+7
+13
+27
21
= 90
9>>>>>>>>>>>>>>>>>=
>>>>>>>>>>>>>>>>>;
input
vectors
EXAMPLE 1: MULTI-INPUT ADDITION
EXAMPLE 2:MULTIPLIER
6
Dot Representation 5x5 Multiplication:
Formula:
EXAMPLE 3:ADVANCED ARITHMETIC
7
sine/cosine computation: Dot representation for Z-Z3/6:
[Dinechin HEART’13]
BASIC COMPRESSION
Full adder/ (3;2) counter:
8
Ripple carry adder:
FAFAFAFAFAFA FAFAFAFAFAFAFAFAFAFAFAFAFAFAFA
9
FLOW OF COMPRESSION
+
10
TABULAR REPRESENTATION
5 5 5 5 5 bits in stage 0
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
= 1 4 4 4 4 3 bits in stage 1
11
1 4 4 4 4 3 bits in stage 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
= 1 3 3 3 3 1 bits in stage 2
TABULAR REPRESENTATION
12
1 3 3 3 3 1 bits in stage 2
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
� 3
o
(3;2) counter
+ 1 1
= 2 2 2 2 1 1 bits in stage 3
TABULAR REPRESENTATION
13
TABULAR REPRESENTATION
2 2 2 2 1 1 bits in stage 3� 2 2 2 2 o
ripple carry adder+ 1 1 1 1 1
= 1 1 1 1 1 1 1 bits in final stage
APPLICATION TO FPGAS
The compression using full adders is unsuitable for FPGAs:
Mapping of a full adder on FPGA LUTs is inefficient and slow (➯ large routing delays)
Fast carry chain is not exploited
Conventional Solution: Ripple-carry adder tree
Delay reduction possible by using Generalized Parallel Counters (GPCs) [Parandeh–Afshar TRETS’11]
14
(1,5;3) GPC ON FPGA
15
FA
FAFA+
Dot transform: Realization:
16
(1,5;3) GPC Mapping [Parandeh-Afshar TRETS’11]:
Efficiency = bits reduced/#LUTs = (1+5-3)/3 = 1.0 [Dinechin FPL’13]
01
01
01
CarryLogic
01
SliceLUT
FAFA
(1,5;3) GPC ON FPGA
17
(1,4,1,5;5) GPC [Kumm MBMV’14]:
Efficiency = 1.5
01
01
01
CarryLogic
01
FA
SliceLUT
FAFAFA
EFFICIENT GPCS ON FPGAS
18
01
01
01
CarryLogic
01
FAFA
SliceLUT
HAHA
FA FA
(1,4,0,6;5) GPC [Kumm MBMV’14]:
Efficiency = 1.5
EFFICIENT GPCS ON FPGAS
19
(1,3,2,5;5) GPC (proposed):
Efficiency = 1.5
01
01
01
CarryLogic
01
SliceLUT
FAFAFAFAFAFA
FAFAHAFAFAHAFAFAFAFAFAFA
EFFICIENT GPCS ON FPGAS
20
(6,0,6;5) GPC (proposed):
Efficiency = 1.75
01
01
01
CarryLogic
01
SliceLUT
FAFAFAFAFAFA
FAFAFAFAFAFAFAFAFA
FAFAFAFAFAFA
FAFAFA
EFFICIENT GPCS ON FPGAS
Problem 1:
The presented GPCs have irregular input pattern
How to select them to get the least LUT resources?
Problem 2:
Pipelining is important on FPGAs to obtain a high throughput.
How to select them to get the least LUT/FF resources?(least pipeline balancing FFs)
21
COMPRESSOR TREEOPTIMIZATION
22
EXAMPLE FOR PROBLEM 1
5 5 5 5 5 bits in stage 0� 1 4 1 5 o
(1,4,1,5;5) GPC+ 1 1 1 1 1� 1 4 1 4 o
(1,4,1,5;5) GPC+ 1 1 1 1 1
= 1 6 2 2 2 1 bits in stage 1
1 6 2 2 2 1 bits in stage 1� 6 o
(6;3) GPC+ 1 1 1
= 1 2 1 2 2 2 1 bits in stage 2
23
EXAMPLE FOR PROBLEM 25 5 5 5 5 bits in stage 0
� 2 0 4 5
o
(2,0,4,5;5) GPC
+ 1 1 1 1 1
� 5 0 5
o
(6,0,6;5) GPC
+ 1 1 1 1 1
� 3 1
o
4 FF for pipeline balancing
+ 3 1
= 1 1 2 5 2 2 1 bits in stage 1
1 1 2 5 2 2 1 bits in stage 1
� 1 1 2 5
o
(1,3,2,5;5) GPC
+ 1 1 1 1 1
� 2 2 1
o
5 FF for pipeline balancing
+ 2 2 1
= 1 1 1 1 1 2 2 1 bits in stage 2
24
A generic ILP optimizer was used
Main idea of the ILP formulation is to count GPCs for each column [Matsunaga’13] and to `cover´ all bits in each stage by GPCs
For that, a `pseudo compressor´ with one input and one output is introduced (no compression)
To optimize a combinatorial compressor tree (problem 1) the cost are set to zero (a wire)
To optimize a pipelined compressor tree (problem 2) the cost are set to the flip flop cost
PROPOSED OPTIMIZATION
25
ILP FORMULATIONILP variables:
No. of bits in stage s and column c:
No. of GPCs in stage s, of type e and column c:
No. of inputs and outputs of GPC (Typ e) in column c: and , respectively
LUT cost of GPC e:
Binary variable to select the active stage:
ks,e,c
Ns,c
Me,c Ke,c
Ds =
(1 Wenn s Stufen verwendet werden
0 ansonsten
ce
if stage s is usedotherwise
26
minimize
S�1X
s=0
C�1X
c=0
E�1X
e=0
ceks,e,c
subject to
C1: Ns�1,c E�1X
e=0
Ce�1X
c0=0
Me,c+c0 ks�1,e,c+c0
) s = 1 . . . S � 1,c = 0 . . . C � 1,if Ds = 0
C2: Ns,c =
E�1X
e=0
Ce�1X
c0=0
Ke,c+c0 ks�1,e,c+c0
)s = 1 . . . S � 1,c = 0 . . . C � 1
C3: Ns,c ⇢
2 for two-input VMA
3 for ternary VMA
if Ds = 1
C4:
S�1X
s=1
Ds = 1
ILP FORMULATION
27
C1’: Ns�1,c E�1X
e=0
Ce�1X
c0=0
Me,c+c0 ks�1,e,c+c0 + IDs
C3’: Ns,c ⇢
2 + (1�Ds)I for two-input VMA
3 + (1�Ds)I for ternary VMA
C1 and C3 have to be linearized: I must be a sufficiently large integer.
ILP FORMULATION
28
RESULTSthe full-adder of the carry chain. The shown XOR gates arenecessary to complete the carry logic to a ripple carry adder(RCA). A similar structure is used in the second LUT, butnow the two carry bits are computed and fed to the RCA.This structure is repeated, leading to the (6, 0, 6; 5) GPC. Itsefficiency is E = 1.75 which is the highest efficiency reportedso far. Even the ternary adder or the 4 : 2 compressor havea lower efficiency of E = 1.5 for the same size (k = 4). Itscritical path only consists of a single LUT delay plus fourstages of fast carry propagation. A GPC with different inputconfiguration is shown in Fig. 5(b), namely the (1, 3, 2, 5; 5)GPC. Although it has a lower efficiency of E = 1.5, itmay be favorable in cases where not all of the inputs ofthe (6, 0, 6; 5) GPC are utilized. The delay is identical tothat of the (6, 0, 6; 5) GPC. Note that the carry-in of the(6, 0, 6; 5) GPC can not be used as additional input due torouting constraints within the slice (when the 0-input of thecarry-chain MUX is fed from a slice input).
VI. RESULTS
The proposed ILP formulation was integrated within theopen-source arithmetic core generator FloPoCo [15], which isa nice framework that supports the handling of compressortrees as a bit heap (including signed number support) [10]as well as the support for VHDL code generation and au-tomated tests. It also includes the recently proposed heuristiccompression method [10] which makes it perfectly suitable forcomparisons as both methods work on identical data structuresand use the same VHDL generation. To be able to provideour method as open-source tool it was decided to use theopen-source ILP solver SCIP [14], although it is well knownthat the commercial CPLEX ILP optimizer is much faster(we observed speedups about 10⇥ which is confirmed by abenchmark provided at [14]).
A. Evaluation of the Optimization Quality
To evaluate the performance of the compression we im-plemented a multiple-input adder with a variable number ofinputs as well as a variable word size. We chose this typeof circuit because it uses only the compressor tree plus anadditional VMA at the output. As VMA we chose a commontwo-input adder for performance reasons. In the experimentswe target Virtex 4 and Virtex 6 FPGAs from Xilinx ascandidates with different LUT input sizes. For Virtex 4 (4-input LUTs), we used the same LUT-based GPCs that areused in the FloPoCo framework, namely the (3; 2), (4; 3) and(1, 3; 3) GPCs with LUT cost 2, 3 and 3, respectively. FloPoCoallows the specification of a target frequency to decide howmany pipeline stages are used. This frequency was set to600 MHz for the heuristic to yield similar timing results andthus comparable resource consumptions. The input word sizeas well as the word length were varied from 4 to 16 leadingto rectangular bit heaps of size 16 to 256 bit. As the ILPoptimization may be very time-consuming, SCIP was set to atime limit of 1 hour, which we thought is reasonable. In mostcases, a valid solution was found within seconds or minutes
0 50 100 150 200 250 3000
100
200
300
400
500
600
700
Compressed bits
#LU
T
Heuristic [8]
prop. ILP
(a)
0 50 100 150 200 250 3000
50
100
150
200
250
Compressed bits
#LU
T
Heuristic [8]
prop. ILP
(b)
Fig. 6: Resulting number of LUTs over the number of inputbits using the heuristic [10] and the proposed ILP model for(a) Virtex 4 and (b) Virtex 6 FPGAs
which already outperformed the heuristic results. The currentimplementation allows to interrupt the optimization at anypoint and VHDL code is generated for the best solution foundso far (if any).
The resulting LUT cost from the optimization using theheuristic [10] and the proposed ILP method (with respectingflip-flops cost for pipelining) are shown in Fig. 6(a). It canbe observed that the LUT costs follow a fairly linear trendrelated to the number of bits, independent of the method used.However, the proposed method has a much lower gradientof 1.9 LUT/bit compared to 2.5 LUT/bit. The average LUTreduction is 22.8%. Up to a complexity of 100 bits, an optimalsolution was always found within the given time limit, oftenwithin a few seconds. As the trend in Fig. 6 continues forhigher complexities, it can be assumed that the non-optimalsolutions are not too far from being optimal.
The same procedure was applied to Virtex 6 FPGAs whichallow the use of 6-inputs LUTs. Here, much more LUT-basedGPCs are possible and some of them use the fact that 6-inputLUTs can be configured to two 5-input LUTs with sharedinputs (which is the case for GPCs with five or less inputs).The LUT-based GPCs used from the FloPoCo frameworkhave the configurations (6; 3), (1, 5; 3), (5; 3), (1, 4; 3), (4; 3),(2, 3; 3), (1, 3; 3). In addition to that, we used the fastest of theVirtex 6 optimized GPCs from [12] (1, 4, 1, 5; 5), (1, 4, 0, 6; 5)and (2, 0, 4, 5; 5) as well as the (1, 3, 2, 5; 5) and (6, 0, 6; 5)GPCs proposed above for the ILP optimization. The results
the full-adder of the carry chain. The shown XOR gates arenecessary to complete the carry logic to a ripple carry adder(RCA). A similar structure is used in the second LUT, butnow the two carry bits are computed and fed to the RCA.This structure is repeated, leading to the (6, 0, 6; 5) GPC. Itsefficiency is E = 1.75 which is the highest efficiency reportedso far. Even the ternary adder or the 4 : 2 compressor havea lower efficiency of E = 1.5 for the same size (k = 4). Itscritical path only consists of a single LUT delay plus fourstages of fast carry propagation. A GPC with different inputconfiguration is shown in Fig. 5(b), namely the (1, 3, 2, 5; 5)GPC. Although it has a lower efficiency of E = 1.5, itmay be favorable in cases where not all of the inputs ofthe (6, 0, 6; 5) GPC are utilized. The delay is identical tothat of the (6, 0, 6; 5) GPC. Note that the carry-in of the(6, 0, 6; 5) GPC can not be used as additional input due torouting constraints within the slice (when the 0-input of thecarry-chain MUX is fed from a slice input).
VI. RESULTS
The proposed ILP formulation was integrated within theopen-source arithmetic core generator FloPoCo [15], which isa nice framework that supports the handling of compressortrees as a bit heap (including signed number support) [10]as well as the support for VHDL code generation and au-tomated tests. It also includes the recently proposed heuristiccompression method [10] which makes it perfectly suitable forcomparisons as both methods work on identical data structuresand use the same VHDL generation. To be able to provideour method as open-source tool it was decided to use theopen-source ILP solver SCIP [14], although it is well knownthat the commercial CPLEX ILP optimizer is much faster(we observed speedups about 10⇥ which is confirmed by abenchmark provided at [14]).
A. Evaluation of the Optimization Quality
To evaluate the performance of the compression we im-plemented a multiple-input adder with a variable number ofinputs as well as a variable word size. We chose this typeof circuit because it uses only the compressor tree plus anadditional VMA at the output. As VMA we chose a commontwo-input adder for performance reasons. In the experimentswe target Virtex 4 and Virtex 6 FPGAs from Xilinx ascandidates with different LUT input sizes. For Virtex 4 (4-input LUTs), we used the same LUT-based GPCs that areused in the FloPoCo framework, namely the (3; 2), (4; 3) and(1, 3; 3) GPCs with LUT cost 2, 3 and 3, respectively. FloPoCoallows the specification of a target frequency to decide howmany pipeline stages are used. This frequency was set to600 MHz for the heuristic to yield similar timing results andthus comparable resource consumptions. The input word sizeas well as the word length were varied from 4 to 16 leadingto rectangular bit heaps of size 16 to 256 bit. As the ILPoptimization may be very time-consuming, SCIP was set to atime limit of 1 hour, which we thought is reasonable. In mostcases, a valid solution was found within seconds or minutes
0 50 100 150 200 250 3000
100
200
300
400
500
600
700
Compressed bits
#LU
T
Heuristic [8]
prop. ILP
(a)
0 50 100 150 200 250 3000
50
100
150
200
250
Compressed bits
#LU
T
Heuristic [8]
prop. ILP
(b)
Fig. 6: Resulting number of LUTs over the number of inputbits using the heuristic [10] and the proposed ILP model for(a) Virtex 4 and (b) Virtex 6 FPGAs
which already outperformed the heuristic results. The currentimplementation allows to interrupt the optimization at anypoint and VHDL code is generated for the best solution foundso far (if any).
The resulting LUT cost from the optimization using theheuristic [10] and the proposed ILP method (with respectingflip-flops cost for pipelining) are shown in Fig. 6(a). It canbe observed that the LUT costs follow a fairly linear trendrelated to the number of bits, independent of the method used.However, the proposed method has a much lower gradientof 1.9 LUT/bit compared to 2.5 LUT/bit. The average LUTreduction is 22.8%. Up to a complexity of 100 bits, an optimalsolution was always found within the given time limit, oftenwithin a few seconds. As the trend in Fig. 6 continues forhigher complexities, it can be assumed that the non-optimalsolutions are not too far from being optimal.
The same procedure was applied to Virtex 6 FPGAs whichallow the use of 6-inputs LUTs. Here, much more LUT-basedGPCs are possible and some of them use the fact that 6-inputLUTs can be configured to two 5-input LUTs with sharedinputs (which is the case for GPCs with five or less inputs).The LUT-based GPCs used from the FloPoCo frameworkhave the configurations (6; 3), (1, 5; 3), (5; 3), (1, 4; 3), (4; 3),(2, 3; 3), (1, 3; 3). In addition to that, we used the fastest of theVirtex 6 optimized GPCs from [12] (1, 4, 1, 5; 5), (1, 4, 0, 6; 5)and (2, 0, 4, 5; 5) as well as the (1, 3, 2, 5; 5) and (6, 0, 6; 5)GPCs proposed above for the ILP optimization. The results
Virtex 4 FPGA Virtex 6 FPGA
The required LUTs could be reduced by 23% (Virtex 4) and 30% (Virtex 6) compared to Dinechin (FPL’13) [8]
The slice reduction was 12.5% (Virtex 4) and 19.5% (Virtex 6) after synthesis.
29
EXAMPLE COMPRESSION TREE WITH 16 INPUTS, 16 BIT EACH
FloPoCo[Dinechin FPL’13] Proposed ILP
30
CONCLUSION & OUTLOOK
A novel ILP formulation for the optimization of pipelined compressor trees was presented
There is a notable gap between the former state-of-the-art heuristic and our optimal solution
Extensions are proposed for minimal stage count or variable column counters like 4:2 compressors
Good heuristics are still required for problem sizes >100 bit due to the runtime of the ILP solver
So far there is no heuristic considering pipelining
THANK YOU!
LITERATURE[Parandeh-Afshar TRETS’11]: H. Parandeh-Afshar, A. Neogy, P. Brisk, and P. Inne, “Compressor Tree Synthesis on Commercial High-Performance FPGAs,” ACM TRETS, 2011
[Dinechin HEART’13]: F. de Dinechin, M. Istoan, and G. Sergent, “Fixed-Point Trigonometric Functions on FPGAs,” HEART 2013, Jun. 2013.
[Dinechin FPL’13]: N. Brunie, F. de Dinechin, M. Istoan, G. Sergent, K. Illyes, and B. Popa, “Arithmetic Core Generation Using Bit Heaps,” FPL 2013
[Matsunaga’13]: T. Matsunaga, S. Kimura, and Y. Matsunaga, “An Exact Approach for GPC-Based Compressor Tree Synthesis,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Dec. 2013.
ATTACHMENTS
34
35
DETAILED RESULTS VIRTEX 4Heuristic [Dinechin FPL’13] proposed ILP
Size [bits] LUT4 FF Slices fmax
[MHz] LUT4 FF Slices fmax
[MHz]
16 34 20 25 501.5 28 21 25 562.425 45 39 29 455.2 46 45 39 562.136 78 63 59 489.5 54 56 35 491.449 123 86 73 444.8 79 78 46 481.964 181 108 109 412.9 123 120 100 471.581 209 132 117 420.7 141 135 106 477.8100 267 173 174 414.8 181 178 109 454.6121 332 182 181 332.6 242 247 211 435.4144 395 243 255 376.2 272 273 223 441.1169 492 283 277 344.8 309 317 197 428.3196 582 328 368 355.0 407 416 340 423.2225 622 345 410 333.9 444 451 349 424.3256 706 386 459 343.3 506 518 438 410.3
Avg.: 312.8 183.7 195.1 401.9 217.8 219.6 170.6 466.5Imp.: – – – – 30.3% -19.6% 12.5% 16.1%
36
DETAILED RESULTS VIRTEX 6Heuristic [Dinechin FPL’13] proposed ILP
Size [bits] LUT6 FF Slices fmax
[MHz] LUT6 FF Slices fmax
[MHz]
16 12 7 3 478.0 10 9 3 639.425 24 11 6 636.5 26 25 7 452.936 32 13 9 595.6 27 36 7 603.149 44 15 12 492.4 35 40 10 407.764 59 19 16 407.7 47 48 13 506.881 76 21 20 442.9 56 59 15 480.1100 96 47 26 435.9 77 98 20 437.5121 116 26 32 401.6 89 112 25 438.6144 134 28 35 383.9 94 121 24 469.0169 161 60 43 396.8 119 155 30 470.6196 189 76 50 358.0 131 160 35 408.0225 216 81 56 327.2 192 236 57 364.0256 251 74 66 338.3 204 251 55 372.3
Avg.: 108.5 36.8 28.8 438.1 85.2 103.8 23.2 465.4Imp.: – – – – 21.5% -182.4% 19.5% 6.2%
37
EFFICIENT GPCS ON FPGAS
GPC /
Compressor
#LUT6
(k)E�ciency
(E = �/k)delay
LUT based GPCs from [Dinechin FPL’13]
(3;2) GPC 1 1 ⌧L ⇡ ⌧(6;3) GPC 3 1 ⌧L ⇡ ⌧(1,5;3) GPC 3 1 ⌧L ⇡ ⌧Improved GPC mappings from [Parandeh-Afshar TRETS’11]:
(6;3) GPC 3 1 2⌧L + ⌧R + 3⌧CC ⇡3⌧
(1,5;3) GPC 2 1.5 ⌧L + 2⌧CC ⇡ ⌧(2,3;3) GPC 2 1 ⌧L + 2⌧CC ⇡ ⌧(7;3) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡
3⌧(5,3;4) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡
3⌧(6,2;4) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡
3⌧
38
EFFICIENT GPCS ON FPGASGPC /
Compressor
#LUT6
(k)E�ciency
(E = �/k)delay
GPCs and 4:2 compressor from [Kumm MBMV’13]:
(5,0,6;5) GPC 4 1.5 ⌧L + 4⌧CC ⇡ ⌧(1,4,1,5;5)
GPC
4 1.5 ⌧L + 4⌧CC ⇡ ⌧
(1,4,0,6;5)
GPC
4 1.5 ⌧L + 4⌧CC ⇡ ⌧
(2,0,4,5;5)
GPC
4 1.5 2⌧L + ⌧R + 4⌧CC ⇡3⌧
4:2
compressor
k 2� 2k ⌧L + k⌧CC
Adder with k BLE:
2-input adder k 1 ⌧L + k⌧CC
3-input adder k 2� 2k 2⌧L + ⌧R + k⌧CC ⇡
3⌧ + k⌧CC
Proposed GPCs:
(6,0,6;5) GPC 4 1.75 ⌧L + 4⌧CC ⇡ ⌧(1,3,2,5;5)
GPC
4 1.5 ⌧L ⇡ ⌧
39
01
01
01
CarryLogic
01
SliceLUT
HA
FAFAFA
(2,0,4,5;5) GPC [Kumm MBMV’14]:
Efficiency = 1.5
EFFICIENT GPCS ON FPGAS
40
4:2 COMPRESSOR
01
SliceLUT
FA
01
FA
01
CarryLogic
01
FA . . .
+. . .. . .
[Kumm MBMV’14]
41
We developed an ILP optimizer
The main idea of the ILP formulation is to `cover´ all bits in each stage by GPCs.
For that, a `pseudo element´ is introduced for which and (no compression)
In case of a combinatorial compressor tree (problem 1) we set its cost to (wire)
In case of a pipelined compressor tree (problem 2) corresponds to the flip flop cost.
PROPOSED OPTIMIZATION
e0
Me0,c = 1 Ke0,c = 1
ce0 = 0
ce0
42
FAFA
FAFA01
01
01
CarryLogic
01
SliceLUT
FAFA FA
(7;3) COMPRESSOR
TERNARY ADDERSA ternary adder realizes the operation
It can be realized as cascade of two ripple carry adders:
FA
FA
FA
FA
FA
FA
FA
FA
s = x+ y + z
43
TERNARY ADDERS
Using the 1st full adder stage as 3:2 compressor removes the carry chain:
FA
FAFAFA
FAFA
FA
FA
44