pipelined compressor tree optimization using integer ...martin-kumm.de/slides/2014_09_03_fpl.pdf ·...

Pipelined Compressor Tree Optimization using Integer Linear

Programming

International Conference onField Programmable Logic

03.09.2014

Martin Kumm, Peter Zipf

University of Kassel, Germany

2

CONTENTS

1. Introduction to Compressor Trees

2. Compressor Trees on FPGAs

3. Optimal Compressor Tree Synthesis

A compressor tree realizes the addition of many (>2) bit-shifted numbers

The applications are versatile:

Multiplier (real, complex, squarer)

Evaluation of polynomials (e.g., for function approximation)

Linear transforms (e.g., FFT, DCT)

Digital filters

…3

COMPRESSOR TREES

EXAMPLE 1: MULTI-INPUT ADDITION

4

Dot representation 5 bit, 5-input addition:

S =X

i

Xi

Formula:

2

42

32

22

12

0

9>>>>>>>>>>>>>>>>>=

>>>>>>>>>>>>>>>>>;

input

vectors

5

Dot representation 5 bit, 5-input addition:

S =X

i

Xi

Formula:

01101

11100

10110

11011

10101

3·24+2·23+4·22+3·21+4·20 = 90

+22

+7

+13

+27

21

= 90

9>>>>>>>>>>>>>>>>>=

>>>>>>>>>>>>>>>>>;

input

vectors

EXAMPLE 1: MULTI-INPUT ADDITION

EXAMPLE 2:MULTIPLIER

6

Dot Representation 5x5 Multiplication:

Formula:

EXAMPLE 3:ADVANCED ARITHMETIC

7

sine/cosine computation: Dot representation for Z-Z3/6:

[Dinechin HEART’13]

BASIC COMPRESSION

Full adder/ (3;2) counter:

8

Ripple carry adder:

FAFAFAFAFAFA FAFAFAFAFAFAFAFAFAFAFAFAFAFAFA

9

FLOW OF COMPRESSION

+

10

TABULAR REPRESENTATION

5 5 5 5 5 bits in stage 0

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

= 1 4 4 4 4 3 bits in stage 1

11

1 4 4 4 4 3 bits in stage 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

= 1 3 3 3 3 1 bits in stage 2


12

1 3 3 3 3 1 bits in stage 2

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

= 2 2 2 2 1 1 bits in stage 3


13


2 2 2 2 1 1 bits in stage 3� 2 2 2 2 o

ripple carry adder+ 1 1 1 1 1

= 1 1 1 1 1 1 1 bits in final stage

APPLICATION TO FPGAS

The compression using full adders is unsuitable for FPGAs:

Mapping of a full adder on FPGA LUTs is inefficient and slow (➯ large routing delays)

Fast carry chain is not exploited

Conventional Solution: Ripple-carry adder tree

Delay reduction possible by using Generalized Parallel Counters (GPCs) [Parandeh–Afshar TRETS’11]

14

(1,5;3) GPC ON FPGA

15

FA

FAFA+

Dot transform: Realization:

16

(1,5;3) GPC Mapping [Parandeh-Afshar TRETS’11]:

Efficiency = bits reduced/#LUTs = (1+5-3)/3 = 1.0 [Dinechin FPL’13]

01

01

01

CarryLogic

01

SliceLUT

FAFA

(1,5;3) GPC ON FPGA

17

(1,4,1,5;5) GPC [Kumm MBMV’14]:

Efficiency = 1.5

01

01

01

CarryLogic

01

FA

SliceLUT

FAFAFA

EFFICIENT GPCS ON FPGAS

18

01

01

01

CarryLogic

01

FAFA

SliceLUT

HAHA

FA FA

(1,4,0,6;5) GPC [Kumm MBMV’14]:

Efficiency = 1.5


19

(1,3,2,5;5) GPC (proposed):

Efficiency = 1.5

01

01

01

CarryLogic

01

SliceLUT

FAFAFAFAFAFA

FAFAHAFAFAHAFAFAFAFAFAFA


20

(6,0,6;5) GPC (proposed):

Efficiency = 1.75

01

01

01

CarryLogic

01

SliceLUT

FAFAFAFAFAFA

FAFAFAFAFAFAFAFAFA

FAFAFAFAFAFA

FAFAFA


Problem 1:

The presented GPCs have irregular input pattern

How to select them to get the least LUT resources?

Problem 2:

Pipelining is important on FPGAs to obtain a high throughput.

How to select them to get the least LUT/FF resources?(least pipeline balancing FFs)

21

COMPRESSOR TREEOPTIMIZATION

22

EXAMPLE FOR PROBLEM 1

5 5 5 5 5 bits in stage 0� 1 4 1 5 o

(1,4,1,5;5) GPC+ 1 1 1 1 1� 1 4 1 4 o

(1,4,1,5;5) GPC+ 1 1 1 1 1

= 1 6 2 2 2 1 bits in stage 1

1 6 2 2 2 1 bits in stage 1� 6 o

(6;3) GPC+ 1 1 1

= 1 2 1 2 2 2 1 bits in stage 2

23

EXAMPLE FOR PROBLEM 25 5 5 5 5 bits in stage 0

� 2 0 4 5

o

(2,0,4,5;5) GPC

+ 1 1 1 1 1

� 5 0 5

o

(6,0,6;5) GPC

+ 1 1 1 1 1

� 3 1

o

4 FF for pipeline balancing

+ 3 1

= 1 1 2 5 2 2 1 bits in stage 1

1 1 2 5 2 2 1 bits in stage 1

� 1 1 2 5

o

(1,3,2,5;5) GPC

+ 1 1 1 1 1

� 2 2 1

o

5 FF for pipeline balancing

+ 2 2 1

= 1 1 1 1 1 2 2 1 bits in stage 2

24

A generic ILP optimizer was used

Main idea of the ILP formulation is to count GPCs for each column [Matsunaga’13] and to `cover´ all bits in each stage by GPCs

For that, a `pseudo compressor´ with one input and one output is introduced (no compression)

To optimize a combinatorial compressor tree (problem 1) the cost are set to zero (a wire)

To optimize a pipelined compressor tree (problem 2) the cost are set to the flip flop cost

PROPOSED OPTIMIZATION

25

ILP FORMULATIONILP variables:

No. of bits in stage s and column c:

No. of GPCs in stage s, of type e and column c:

No. of inputs and outputs of GPC (Typ e) in column c: and , respectively

LUT cost of GPC e:

Binary variable to select the active stage:

ks,e,c

Ns,c

Me,c Ke,c

Ds =

(1 Wenn s Stufen verwendet werden

0 ansonsten

ce

if stage s is usedotherwise

26

minimize

S�1X

s=0

C�1X

c=0

E�1X

e=0

ceks,e,c

subject to

C1: Ns�1,c E�1X

e=0

Ce�1X

c0=0

Me,c+c0 ks�1,e,c+c0

) s = 1 . . . S � 1,c = 0 . . . C � 1,if Ds = 0

C2: Ns,c =

E�1X

e=0

Ce�1X

c0=0

Ke,c+c0 ks�1,e,c+c0

)s = 1 . . . S � 1,c = 0 . . . C � 1

C3: Ns,c ⇢

2 for two-input VMA

3 for ternary VMA

if Ds = 1

C4:

S�1X

s=1

Ds = 1

ILP FORMULATION

27

C1’: Ns�1,c E�1X

e=0

Ce�1X

c0=0

Me,c+c0 ks�1,e,c+c0 + IDs

C3’: Ns,c ⇢

2 + (1�Ds)I for two-input VMA

3 + (1�Ds)I for ternary VMA

C1 and C3 have to be linearized: I must be a sufficiently large integer.

ILP FORMULATION

28

RESULTSthe full-adder of the carry chain. The shown XOR gates arenecessary to complete the carry logic to a ripple carry adder(RCA). A similar structure is used in the second LUT, butnow the two carry bits are computed and fed to the RCA.This structure is repeated, leading to the (6, 0, 6; 5) GPC. Itsefficiency is E = 1.75 which is the highest efficiency reportedso far. Even the ternary adder or the 4 : 2 compressor havea lower efficiency of E = 1.5 for the same size (k = 4). Itscritical path only consists of a single LUT delay plus fourstages of fast carry propagation. A GPC with different inputconfiguration is shown in Fig. 5(b), namely the (1, 3, 2, 5; 5)GPC. Although it has a lower efficiency of E = 1.5, itmay be favorable in cases where not all of the inputs ofthe (6, 0, 6; 5) GPC are utilized. The delay is identical tothat of the (6, 0, 6; 5) GPC. Note that the carry-in of the(6, 0, 6; 5) GPC can not be used as additional input due torouting constraints within the slice (when the 0-input of thecarry-chain MUX is fed from a slice input).

VI. RESULTS

The proposed ILP formulation was integrated within theopen-source arithmetic core generator FloPoCo [15], which isa nice framework that supports the handling of compressortrees as a bit heap (including signed number support) [10]as well as the support for VHDL code generation and au-tomated tests. It also includes the recently proposed heuristiccompression method [10] which makes it perfectly suitable forcomparisons as both methods work on identical data structuresand use the same VHDL generation. To be able to provideour method as open-source tool it was decided to use theopen-source ILP solver SCIP [14], although it is well knownthat the commercial CPLEX ILP optimizer is much faster(we observed speedups about 10⇥ which is confirmed by abenchmark provided at [14]).

A. Evaluation of the Optimization Quality

To evaluate the performance of the compression we im-plemented a multiple-input adder with a variable number ofinputs as well as a variable word size. We chose this typeof circuit because it uses only the compressor tree plus anadditional VMA at the output. As VMA we chose a commontwo-input adder for performance reasons. In the experimentswe target Virtex 4 and Virtex 6 FPGAs from Xilinx ascandidates with different LUT input sizes. For Virtex 4 (4-input LUTs), we used the same LUT-based GPCs that areused in the FloPoCo framework, namely the (3; 2), (4; 3) and(1, 3; 3) GPCs with LUT cost 2, 3 and 3, respectively. FloPoCoallows the specification of a target frequency to decide howmany pipeline stages are used. This frequency was set to600 MHz for the heuristic to yield similar timing results andthus comparable resource consumptions. The input word sizeas well as the word length were varied from 4 to 16 leadingto rectangular bit heaps of size 16 to 256 bit. As the ILPoptimization may be very time-consuming, SCIP was set to atime limit of 1 hour, which we thought is reasonable. In mostcases, a valid solution was found within seconds or minutes

0 50 100 150 200 250 3000

100

200

300

400

500

600

700

Compressed bits

#LU

T

Heuristic [8]

prop. ILP

(a)

0 50 100 150 200 250 3000

50

100

150

200

250

Compressed bits

#LU

T

Heuristic [8]

prop. ILP

(b)

Fig. 6: Resulting number of LUTs over the number of inputbits using the heuristic [10] and the proposed ILP model for(a) Virtex 4 and (b) Virtex 6 FPGAs

which already outperformed the heuristic results. The currentimplementation allows to interrupt the optimization at anypoint and VHDL code is generated for the best solution foundso far (if any).

The resulting LUT cost from the optimization using theheuristic [10] and the proposed ILP method (with respectingflip-flops cost for pipelining) are shown in Fig. 6(a). It canbe observed that the LUT costs follow a fairly linear trendrelated to the number of bits, independent of the method used.However, the proposed method has a much lower gradientof 1.9 LUT/bit compared to 2.5 LUT/bit. The average LUTreduction is 22.8%. Up to a complexity of 100 bits, an optimalsolution was always found within the given time limit, oftenwithin a few seconds. As the trend in Fig. 6 continues forhigher complexities, it can be assumed that the non-optimalsolutions are not too far from being optimal.

The same procedure was applied to Virtex 6 FPGAs whichallow the use of 6-inputs LUTs. Here, much more LUT-basedGPCs are possible and some of them use the fact that 6-inputLUTs can be configured to two 5-input LUTs with sharedinputs (which is the case for GPCs with five or less inputs).The LUT-based GPCs used from the FloPoCo frameworkhave the configurations (6; 3), (1, 5; 3), (5; 3), (1, 4; 3), (4; 3),(2, 3; 3), (1, 3; 3). In addition to that, we used the fastest of theVirtex 6 optimized GPCs from [12] (1, 4, 1, 5; 5), (1, 4, 0, 6; 5)and (2, 0, 4, 5; 5) as well as the (1, 3, 2, 5; 5) and (6, 0, 6; 5)GPCs proposed above for the ILP optimization. The results

the full-adder of the carry chain. The shown XOR gates arenecessary to complete the carry logic to a ripple carry adder(RCA). A similar structure is used in the second LUT, butnow the two carry bits are computed and fed to the RCA.This structure is repeated, leading to the (6, 0, 6; 5) GPC. Itsefficiency is E = 1.75 which is the highest efficiency reportedso far. Even the ternary adder or the 4 : 2 compressor havea lower efficiency of E = 1.5 for the same size (k = 4). Itscritical path only consists of a single LUT delay plus fourstages of fast carry propagation. A GPC with different inputconfiguration is shown in Fig. 5(b), namely the (1, 3, 2, 5; 5)GPC. Although it has a lower efficiency of E = 1.5, itmay be favorable in cases where not all of the inputs ofthe (6, 0, 6; 5) GPC are utilized. The delay is identical tothat of the (6, 0, 6; 5) GPC. Note that the carry-in of the(6, 0, 6; 5) GPC can not be used as additional input due torouting constraints within the slice (when the 0-input of thecarry-chain MUX is fed from a slice input).

VI. RESULTS

The proposed ILP formulation was integrated within theopen-source arithmetic core generator FloPoCo [15], which isa nice framework that supports the handling of compressortrees as a bit heap (including signed number support) [10]as well as the support for VHDL code generation and au-tomated tests. It also includes the recently proposed heuristiccompression method [10] which makes it perfectly suitable forcomparisons as both methods work on identical data structuresand use the same VHDL generation. To be able to provideour method as open-source tool it was decided to use theopen-source ILP solver SCIP [14], although it is well knownthat the commercial CPLEX ILP optimizer is much faster(we observed speedups about 10⇥ which is confirmed by abenchmark provided at [14]).

A. Evaluation of the Optimization Quality

To evaluate the performance of the compression we im-plemented a multiple-input adder with a variable number ofinputs as well as a variable word size. We chose this typeof circuit because it uses only the compressor tree plus anadditional VMA at the output. As VMA we chose a commontwo-input adder for performance reasons. In the experimentswe target Virtex 4 and Virtex 6 FPGAs from Xilinx ascandidates with different LUT input sizes. For Virtex 4 (4-input LUTs), we used the same LUT-based GPCs that areused in the FloPoCo framework, namely the (3; 2), (4; 3) and(1, 3; 3) GPCs with LUT cost 2, 3 and 3, respectively. FloPoCoallows the specification of a target frequency to decide howmany pipeline stages are used. This frequency was set to600 MHz for the heuristic to yield similar timing results andthus comparable resource consumptions. The input word sizeas well as the word length were varied from 4 to 16 leadingto rectangular bit heaps of size 16 to 256 bit. As the ILPoptimization may be very time-consuming, SCIP was set to atime limit of 1 hour, which we thought is reasonable. In mostcases, a valid solution was found within seconds or minutes

0 50 100 150 200 250 3000

100

200

300

400

500

600

700

Compressed bits

#LU

T

Heuristic [8]

prop. ILP

(a)

0 50 100 150 200 250 3000

50

100

150

200

250

Compressed bits

#LU

T

Heuristic [8]

prop. ILP

(b)

Fig. 6: Resulting number of LUTs over the number of inputbits using the heuristic [10] and the proposed ILP model for(a) Virtex 4 and (b) Virtex 6 FPGAs

which already outperformed the heuristic results. The currentimplementation allows to interrupt the optimization at anypoint and VHDL code is generated for the best solution foundso far (if any).

The resulting LUT cost from the optimization using theheuristic [10] and the proposed ILP method (with respectingflip-flops cost for pipelining) are shown in Fig. 6(a). It canbe observed that the LUT costs follow a fairly linear trendrelated to the number of bits, independent of the method used.However, the proposed method has a much lower gradientof 1.9 LUT/bit compared to 2.5 LUT/bit. The average LUTreduction is 22.8%. Up to a complexity of 100 bits, an optimalsolution was always found within the given time limit, oftenwithin a few seconds. As the trend in Fig. 6 continues forhigher complexities, it can be assumed that the non-optimalsolutions are not too far from being optimal.

The same procedure was applied to Virtex 6 FPGAs whichallow the use of 6-inputs LUTs. Here, much more LUT-basedGPCs are possible and some of them use the fact that 6-inputLUTs can be configured to two 5-input LUTs with sharedinputs (which is the case for GPCs with five or less inputs).The LUT-based GPCs used from the FloPoCo frameworkhave the configurations (6; 3), (1, 5; 3), (5; 3), (1, 4; 3), (4; 3),(2, 3; 3), (1, 3; 3). In addition to that, we used the fastest of theVirtex 6 optimized GPCs from [12] (1, 4, 1, 5; 5), (1, 4, 0, 6; 5)and (2, 0, 4, 5; 5) as well as the (1, 3, 2, 5; 5) and (6, 0, 6; 5)GPCs proposed above for the ILP optimization. The results

Virtex 4 FPGA Virtex 6 FPGA

The required LUTs could be reduced by 23% (Virtex 4) and 30% (Virtex 6) compared to Dinechin (FPL’13) [8]

The slice reduction was 12.5% (Virtex 4) and 19.5% (Virtex 6) after synthesis.

29

EXAMPLE COMPRESSION TREE WITH 16 INPUTS, 16 BIT EACH

FloPoCo[Dinechin FPL’13] Proposed ILP

30

CONCLUSION & OUTLOOK

A novel ILP formulation for the optimization of pipelined compressor trees was presented

There is a notable gap between the former state-of-the-art heuristic and our optimal solution

Extensions are proposed for minimal stage count or variable column counters like 4:2 compressors

Good heuristics are still required for problem sizes >100 bit due to the runtime of the ILP solver

So far there is no heuristic considering pipelining

THANK YOU!

LITERATURE[Parandeh-Afshar TRETS’11]: H. Parandeh-Afshar, A. Neogy, P. Brisk, and P. Inne, “Compressor Tree Synthesis on Commercial High-Performance FPGAs,” ACM TRETS, 2011

[Dinechin HEART’13]: F. de Dinechin, M. Istoan, and G. Sergent, “Fixed-Point Trigonometric Functions on FPGAs,” HEART 2013, Jun. 2013.

[Dinechin FPL’13]: N. Brunie, F. de Dinechin, M. Istoan, G. Sergent, K. Illyes, and B. Popa, “Arithmetic Core Generation Using Bit Heaps,” FPL 2013

[Matsunaga’13]: T. Matsunaga, S. Kimura, and Y. Matsunaga, “An Exact Approach for GPC-Based Compressor Tree Synthesis,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Dec. 2013.

ATTACHMENTS

34

35

DETAILED RESULTS VIRTEX 4Heuristic [Dinechin FPL’13] proposed ILP

Size [bits] LUT4 FF Slices fmax

[MHz] LUT4 FF Slices fmax

[MHz]

16 34 20 25 501.5 28 21 25 562.425 45 39 29 455.2 46 45 39 562.136 78 63 59 489.5 54 56 35 491.449 123 86 73 444.8 79 78 46 481.964 181 108 109 412.9 123 120 100 471.581 209 132 117 420.7 141 135 106 477.8100 267 173 174 414.8 181 178 109 454.6121 332 182 181 332.6 242 247 211 435.4144 395 243 255 376.2 272 273 223 441.1169 492 283 277 344.8 309 317 197 428.3196 582 328 368 355.0 407 416 340 423.2225 622 345 410 333.9 444 451 349 424.3256 706 386 459 343.3 506 518 438 410.3

Avg.: 312.8 183.7 195.1 401.9 217.8 219.6 170.6 466.5Imp.: – – – – 30.3% -19.6% 12.5% 16.1%

36

DETAILED RESULTS VIRTEX 6Heuristic [Dinechin FPL’13] proposed ILP

Size [bits] LUT6 FF Slices fmax

[MHz] LUT6 FF Slices fmax

[MHz]

16 12 7 3 478.0 10 9 3 639.425 24 11 6 636.5 26 25 7 452.936 32 13 9 595.6 27 36 7 603.149 44 15 12 492.4 35 40 10 407.764 59 19 16 407.7 47 48 13 506.881 76 21 20 442.9 56 59 15 480.1100 96 47 26 435.9 77 98 20 437.5121 116 26 32 401.6 89 112 25 438.6144 134 28 35 383.9 94 121 24 469.0169 161 60 43 396.8 119 155 30 470.6196 189 76 50 358.0 131 160 35 408.0225 216 81 56 327.2 192 236 57 364.0256 251 74 66 338.3 204 251 55 372.3

Avg.: 108.5 36.8 28.8 438.1 85.2 103.8 23.2 465.4Imp.: – – – – 21.5% -182.4% 19.5% 6.2%

37


GPC /

Compressor

#LUT6

(k)E�ciency

(E = �/k)delay

LUT based GPCs from [Dinechin FPL’13]

(3;2) GPC 1 1 ⌧L ⇡ ⌧(6;3) GPC 3 1 ⌧L ⇡ ⌧(1,5;3) GPC 3 1 ⌧L ⇡ ⌧Improved GPC mappings from [Parandeh-Afshar TRETS’11]:

(6;3) GPC 3 1 2⌧L + ⌧R + 3⌧CC ⇡3⌧

(1,5;3) GPC 2 1.5 ⌧L + 2⌧CC ⇡ ⌧(2,3;3) GPC 2 1 ⌧L + 2⌧CC ⇡ ⌧(7;3) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡

3⌧(5,3;4) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡

3⌧(6,2;4) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡

3⌧

38

EFFICIENT GPCS ON FPGASGPC /

Compressor

#LUT6

(k)E�ciency

(E = �/k)delay

GPCs and 4:2 compressor from [Kumm MBMV’13]:

(5,0,6;5) GPC 4 1.5 ⌧L + 4⌧CC ⇡ ⌧(1,4,1,5;5)

GPC

4 1.5 ⌧L + 4⌧CC ⇡ ⌧

(1,4,0,6;5)

GPC

4 1.5 ⌧L + 4⌧CC ⇡ ⌧

(2,0,4,5;5)

GPC

4 1.5 2⌧L + ⌧R + 4⌧CC ⇡3⌧

4:2

compressor

k 2� 2k ⌧L + k⌧CC

Adder with k BLE:

2-input adder k 1 ⌧L + k⌧CC

3-input adder k 2� 2k 2⌧L + ⌧R + k⌧CC ⇡

3⌧ + k⌧CC

Proposed GPCs:

(6,0,6;5) GPC 4 1.75 ⌧L + 4⌧CC ⇡ ⌧(1,3,2,5;5)

GPC

4 1.5 ⌧L ⇡ ⌧

39

01

01

01

CarryLogic

01

SliceLUT

HA

FAFAFA

(2,0,4,5;5) GPC [Kumm MBMV’14]:

Efficiency = 1.5


40

4:2 COMPRESSOR

01

SliceLUT

FA

01

FA

01

CarryLogic

01

FA . . .

+. . .. . .

[Kumm MBMV’14]

41

We developed an ILP optimizer

The main idea of the ILP formulation is to `cover´ all bits in each stage by GPCs.

For that, a `pseudo element´ is introduced for which and (no compression)

In case of a combinatorial compressor tree (problem 1) we set its cost to (wire)

In case of a pipelined compressor tree (problem 2) corresponds to the flip flop cost.

PROPOSED OPTIMIZATION

e0

Me0,c = 1 Ke0,c = 1

ce0 = 0

ce0

42

FAFA

FAFA01

01

01

CarryLogic

01

SliceLUT

FAFA FA

(7;3) COMPRESSOR

TERNARY ADDERSA ternary adder realizes the operation

It can be realized as cascade of two ripple carry adders:

FA

FA

FA

FA

FA

FA

FA

FA

s = x+ y + z

43

TERNARY ADDERS

Using the 1st full adder stage as 3:2 compressor removes the carry chain:

FA

FAFAFA

FAFA

FA

FA

44

pipelined compressor tree optimization using integer ...martin-kumm.de/slides/2014_09_03_fpl.pdf ·...

Documents