pipelined compressor tree optimization using integer ...martin-kumm.de/slides/2014_09_03_fpl.pdf ·...

44
Pipelined Compressor Tree Optimization using Integer Linear Programming International Conference on Field Programmable Logic 03.09.2014 Martin Kumm , Peter Zipf University of Kassel, Germany

Upload: doankhanh

Post on 19-Mar-2018

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

Pipelined Compressor Tree Optimization using Integer Linear

Programming

International Conference onField Programmable Logic

03.09.2014

Martin Kumm, Peter Zipf

University of Kassel, Germany

Page 2: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

2

CONTENTS

1. Introduction to Compressor Trees

2. Compressor Trees on FPGAs

3. Optimal Compressor Tree Synthesis

Page 3: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

A compressor tree realizes the addition of many (>2) bit-shifted numbers

The applications are versatile:

Multiplier (real, complex, squarer)

Evaluation of polynomials (e.g., for function approximation)

Linear transforms (e.g., FFT, DCT)

Digital filters

…3

COMPRESSOR TREES

Page 4: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

EXAMPLE 1: MULTI-INPUT ADDITION

4

Dot representation 5 bit, 5-input addition:

S =X

i

Xi

Formula:

2

42

32

22

12

0

9>>>>>>>>>>>>>>>>>=

>>>>>>>>>>>>>>>>>;

input

vectors

Page 5: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

5

Dot representation 5 bit, 5-input addition:

S =X

i

Xi

Formula:

01101

11100

10110

11011

10101

3·24+2·23+4·22+3·21+4·20 = 90

+22

+7

+13

+27

21

= 90

9>>>>>>>>>>>>>>>>>=

>>>>>>>>>>>>>>>>>;

input

vectors

EXAMPLE 1: MULTI-INPUT ADDITION

Page 6: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

EXAMPLE 2:MULTIPLIER

6

Dot Representation 5x5 Multiplication:

Formula:

Page 7: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

EXAMPLE 3:ADVANCED ARITHMETIC

7

sine/cosine computation: Dot representation for Z-Z3/6:

[Dinechin HEART’13]

Page 8: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

BASIC COMPRESSION

Full adder/ (3;2) counter:

8

Ripple carry adder:

FAFAFAFAFAFA FAFAFAFAFAFAFAFAFAFAFAFAFAFAFA

Page 9: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

9

FLOW OF COMPRESSION

+

Page 10: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

10

TABULAR REPRESENTATION

5 5 5 5 5 bits in stage 0

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

= 1 4 4 4 4 3 bits in stage 1

Page 11: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

11

1 4 4 4 4 3 bits in stage 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

= 1 3 3 3 3 1 bits in stage 2

TABULAR REPRESENTATION

Page 12: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

12

1 3 3 3 3 1 bits in stage 2

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

� 3

o

(3;2) counter

+ 1 1

= 2 2 2 2 1 1 bits in stage 3

TABULAR REPRESENTATION

Page 13: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

13

TABULAR REPRESENTATION

2 2 2 2 1 1 bits in stage 3� 2 2 2 2 o

ripple carry adder+ 1 1 1 1 1

= 1 1 1 1 1 1 1 bits in final stage

Page 14: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

APPLICATION TO FPGAS

The compression using full adders is unsuitable for FPGAs:

Mapping of a full adder on FPGA LUTs is inefficient and slow (➯ large routing delays)

Fast carry chain is not exploited

Conventional Solution: Ripple-carry adder tree

Delay reduction possible by using Generalized Parallel Counters (GPCs) [Parandeh–Afshar TRETS’11]

14

Page 15: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

(1,5;3) GPC ON FPGA

15

FA

FAFA+

Dot transform: Realization:

Page 16: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

16

(1,5;3) GPC Mapping [Parandeh-Afshar TRETS’11]:

Efficiency = bits reduced/#LUTs = (1+5-3)/3 = 1.0 [Dinechin FPL’13]

01

01

01

CarryLogic

01

SliceLUT

FAFA

(1,5;3) GPC ON FPGA

Page 17: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

17

(1,4,1,5;5) GPC [Kumm MBMV’14]:

Efficiency = 1.5

01

01

01

CarryLogic

01

FA

SliceLUT

FAFAFA

EFFICIENT GPCS ON FPGAS

Page 18: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

18

01

01

01

CarryLogic

01

FAFA

SliceLUT

HAHA

FA FA

(1,4,0,6;5) GPC [Kumm MBMV’14]:

Efficiency = 1.5

EFFICIENT GPCS ON FPGAS

Page 19: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

19

(1,3,2,5;5) GPC (proposed):

Efficiency = 1.5

01

01

01

CarryLogic

01

SliceLUT

FAFAFAFAFAFA

FAFAHAFAFAHAFAFAFAFAFAFA

EFFICIENT GPCS ON FPGAS

Page 20: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

20

(6,0,6;5) GPC (proposed):

Efficiency = 1.75

01

01

01

CarryLogic

01

SliceLUT

FAFAFAFAFAFA

FAFAFAFAFAFAFAFAFA

FAFAFAFAFAFA

FAFAFA

EFFICIENT GPCS ON FPGAS

Page 21: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

Problem 1:

The presented GPCs have irregular input pattern

How to select them to get the least LUT resources?

Problem 2:

Pipelining is important on FPGAs to obtain a high throughput.

How to select them to get the least LUT/FF resources?(least pipeline balancing FFs)

21

COMPRESSOR TREEOPTIMIZATION

Page 22: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

22

EXAMPLE FOR PROBLEM 1

5 5 5 5 5 bits in stage 0� 1 4 1 5 o

(1,4,1,5;5) GPC+ 1 1 1 1 1� 1 4 1 4 o

(1,4,1,5;5) GPC+ 1 1 1 1 1

= 1 6 2 2 2 1 bits in stage 1

1 6 2 2 2 1 bits in stage 1� 6 o

(6;3) GPC+ 1 1 1

= 1 2 1 2 2 2 1 bits in stage 2

Page 23: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

23

EXAMPLE FOR PROBLEM 25 5 5 5 5 bits in stage 0

� 2 0 4 5

o

(2,0,4,5;5) GPC

+ 1 1 1 1 1

� 5 0 5

o

(6,0,6;5) GPC

+ 1 1 1 1 1

� 3 1

o

4 FF for pipeline balancing

+ 3 1

= 1 1 2 5 2 2 1 bits in stage 1

1 1 2 5 2 2 1 bits in stage 1

� 1 1 2 5

o

(1,3,2,5;5) GPC

+ 1 1 1 1 1

� 2 2 1

o

5 FF for pipeline balancing

+ 2 2 1

= 1 1 1 1 1 2 2 1 bits in stage 2

Page 24: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

24

A generic ILP optimizer was used

Main idea of the ILP formulation is to count GPCs for each column [Matsunaga’13] and to `cover´ all bits in each stage by GPCs

For that, a `pseudo compressor´ with one input and one output is introduced (no compression)

To optimize a combinatorial compressor tree (problem 1) the cost are set to zero (a wire)

To optimize a pipelined compressor tree (problem 2) the cost are set to the flip flop cost

PROPOSED OPTIMIZATION

Page 25: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

25

ILP FORMULATIONILP variables:

No. of bits in stage s and column c:

No. of GPCs in stage s, of type e and column c:

No. of inputs and outputs of GPC (Typ e) in column c: and , respectively

LUT cost of GPC e:

Binary variable to select the active stage:

ks,e,c

Ns,c

Me,c Ke,c

Ds =

(1 Wenn s Stufen verwendet werden

0 ansonsten

ce

if stage s is usedotherwise

Page 26: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

26

minimize

S�1X

s=0

C�1X

c=0

E�1X

e=0

ceks,e,c

subject to

C1: Ns�1,c E�1X

e=0

Ce�1X

c0=0

Me,c+c0 ks�1,e,c+c0

) s = 1 . . . S � 1,c = 0 . . . C � 1,if Ds = 0

C2: Ns,c =

E�1X

e=0

Ce�1X

c0=0

Ke,c+c0 ks�1,e,c+c0

)s = 1 . . . S � 1,c = 0 . . . C � 1

C3: Ns,c ⇢

2 for two-input VMA

3 for ternary VMA

if Ds = 1

C4:

S�1X

s=1

Ds = 1

ILP FORMULATION

Page 27: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

27

C1’: Ns�1,c E�1X

e=0

Ce�1X

c0=0

Me,c+c0 ks�1,e,c+c0 + IDs

C3’: Ns,c ⇢

2 + (1�Ds)I for two-input VMA

3 + (1�Ds)I for ternary VMA

C1 and C3 have to be linearized: I must be a sufficiently large integer.

ILP FORMULATION

Page 28: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

28

RESULTSthe full-adder of the carry chain. The shown XOR gates arenecessary to complete the carry logic to a ripple carry adder(RCA). A similar structure is used in the second LUT, butnow the two carry bits are computed and fed to the RCA.This structure is repeated, leading to the (6, 0, 6; 5) GPC. Itsefficiency is E = 1.75 which is the highest efficiency reportedso far. Even the ternary adder or the 4 : 2 compressor havea lower efficiency of E = 1.5 for the same size (k = 4). Itscritical path only consists of a single LUT delay plus fourstages of fast carry propagation. A GPC with different inputconfiguration is shown in Fig. 5(b), namely the (1, 3, 2, 5; 5)GPC. Although it has a lower efficiency of E = 1.5, itmay be favorable in cases where not all of the inputs ofthe (6, 0, 6; 5) GPC are utilized. The delay is identical tothat of the (6, 0, 6; 5) GPC. Note that the carry-in of the(6, 0, 6; 5) GPC can not be used as additional input due torouting constraints within the slice (when the 0-input of thecarry-chain MUX is fed from a slice input).

VI. RESULTS

The proposed ILP formulation was integrated within theopen-source arithmetic core generator FloPoCo [15], which isa nice framework that supports the handling of compressortrees as a bit heap (including signed number support) [10]as well as the support for VHDL code generation and au-tomated tests. It also includes the recently proposed heuristiccompression method [10] which makes it perfectly suitable forcomparisons as both methods work on identical data structuresand use the same VHDL generation. To be able to provideour method as open-source tool it was decided to use theopen-source ILP solver SCIP [14], although it is well knownthat the commercial CPLEX ILP optimizer is much faster(we observed speedups about 10⇥ which is confirmed by abenchmark provided at [14]).

A. Evaluation of the Optimization Quality

To evaluate the performance of the compression we im-plemented a multiple-input adder with a variable number ofinputs as well as a variable word size. We chose this typeof circuit because it uses only the compressor tree plus anadditional VMA at the output. As VMA we chose a commontwo-input adder for performance reasons. In the experimentswe target Virtex 4 and Virtex 6 FPGAs from Xilinx ascandidates with different LUT input sizes. For Virtex 4 (4-input LUTs), we used the same LUT-based GPCs that areused in the FloPoCo framework, namely the (3; 2), (4; 3) and(1, 3; 3) GPCs with LUT cost 2, 3 and 3, respectively. FloPoCoallows the specification of a target frequency to decide howmany pipeline stages are used. This frequency was set to600 MHz for the heuristic to yield similar timing results andthus comparable resource consumptions. The input word sizeas well as the word length were varied from 4 to 16 leadingto rectangular bit heaps of size 16 to 256 bit. As the ILPoptimization may be very time-consuming, SCIP was set to atime limit of 1 hour, which we thought is reasonable. In mostcases, a valid solution was found within seconds or minutes

0 50 100 150 200 250 3000

100

200

300

400

500

600

700

Compressed bits

#LU

T

Heuristic [8]

prop. ILP

(a)

0 50 100 150 200 250 3000

50

100

150

200

250

Compressed bits

#LU

T

Heuristic [8]

prop. ILP

(b)

Fig. 6: Resulting number of LUTs over the number of inputbits using the heuristic [10] and the proposed ILP model for(a) Virtex 4 and (b) Virtex 6 FPGAs

which already outperformed the heuristic results. The currentimplementation allows to interrupt the optimization at anypoint and VHDL code is generated for the best solution foundso far (if any).

The resulting LUT cost from the optimization using theheuristic [10] and the proposed ILP method (with respectingflip-flops cost for pipelining) are shown in Fig. 6(a). It canbe observed that the LUT costs follow a fairly linear trendrelated to the number of bits, independent of the method used.However, the proposed method has a much lower gradientof 1.9 LUT/bit compared to 2.5 LUT/bit. The average LUTreduction is 22.8%. Up to a complexity of 100 bits, an optimalsolution was always found within the given time limit, oftenwithin a few seconds. As the trend in Fig. 6 continues forhigher complexities, it can be assumed that the non-optimalsolutions are not too far from being optimal.

The same procedure was applied to Virtex 6 FPGAs whichallow the use of 6-inputs LUTs. Here, much more LUT-basedGPCs are possible and some of them use the fact that 6-inputLUTs can be configured to two 5-input LUTs with sharedinputs (which is the case for GPCs with five or less inputs).The LUT-based GPCs used from the FloPoCo frameworkhave the configurations (6; 3), (1, 5; 3), (5; 3), (1, 4; 3), (4; 3),(2, 3; 3), (1, 3; 3). In addition to that, we used the fastest of theVirtex 6 optimized GPCs from [12] (1, 4, 1, 5; 5), (1, 4, 0, 6; 5)and (2, 0, 4, 5; 5) as well as the (1, 3, 2, 5; 5) and (6, 0, 6; 5)GPCs proposed above for the ILP optimization. The results

the full-adder of the carry chain. The shown XOR gates arenecessary to complete the carry logic to a ripple carry adder(RCA). A similar structure is used in the second LUT, butnow the two carry bits are computed and fed to the RCA.This structure is repeated, leading to the (6, 0, 6; 5) GPC. Itsefficiency is E = 1.75 which is the highest efficiency reportedso far. Even the ternary adder or the 4 : 2 compressor havea lower efficiency of E = 1.5 for the same size (k = 4). Itscritical path only consists of a single LUT delay plus fourstages of fast carry propagation. A GPC with different inputconfiguration is shown in Fig. 5(b), namely the (1, 3, 2, 5; 5)GPC. Although it has a lower efficiency of E = 1.5, itmay be favorable in cases where not all of the inputs ofthe (6, 0, 6; 5) GPC are utilized. The delay is identical tothat of the (6, 0, 6; 5) GPC. Note that the carry-in of the(6, 0, 6; 5) GPC can not be used as additional input due torouting constraints within the slice (when the 0-input of thecarry-chain MUX is fed from a slice input).

VI. RESULTS

The proposed ILP formulation was integrated within theopen-source arithmetic core generator FloPoCo [15], which isa nice framework that supports the handling of compressortrees as a bit heap (including signed number support) [10]as well as the support for VHDL code generation and au-tomated tests. It also includes the recently proposed heuristiccompression method [10] which makes it perfectly suitable forcomparisons as both methods work on identical data structuresand use the same VHDL generation. To be able to provideour method as open-source tool it was decided to use theopen-source ILP solver SCIP [14], although it is well knownthat the commercial CPLEX ILP optimizer is much faster(we observed speedups about 10⇥ which is confirmed by abenchmark provided at [14]).

A. Evaluation of the Optimization Quality

To evaluate the performance of the compression we im-plemented a multiple-input adder with a variable number ofinputs as well as a variable word size. We chose this typeof circuit because it uses only the compressor tree plus anadditional VMA at the output. As VMA we chose a commontwo-input adder for performance reasons. In the experimentswe target Virtex 4 and Virtex 6 FPGAs from Xilinx ascandidates with different LUT input sizes. For Virtex 4 (4-input LUTs), we used the same LUT-based GPCs that areused in the FloPoCo framework, namely the (3; 2), (4; 3) and(1, 3; 3) GPCs with LUT cost 2, 3 and 3, respectively. FloPoCoallows the specification of a target frequency to decide howmany pipeline stages are used. This frequency was set to600 MHz for the heuristic to yield similar timing results andthus comparable resource consumptions. The input word sizeas well as the word length were varied from 4 to 16 leadingto rectangular bit heaps of size 16 to 256 bit. As the ILPoptimization may be very time-consuming, SCIP was set to atime limit of 1 hour, which we thought is reasonable. In mostcases, a valid solution was found within seconds or minutes

0 50 100 150 200 250 3000

100

200

300

400

500

600

700

Compressed bits

#LU

T

Heuristic [8]

prop. ILP

(a)

0 50 100 150 200 250 3000

50

100

150

200

250

Compressed bits

#LU

T

Heuristic [8]

prop. ILP

(b)

Fig. 6: Resulting number of LUTs over the number of inputbits using the heuristic [10] and the proposed ILP model for(a) Virtex 4 and (b) Virtex 6 FPGAs

which already outperformed the heuristic results. The currentimplementation allows to interrupt the optimization at anypoint and VHDL code is generated for the best solution foundso far (if any).

The resulting LUT cost from the optimization using theheuristic [10] and the proposed ILP method (with respectingflip-flops cost for pipelining) are shown in Fig. 6(a). It canbe observed that the LUT costs follow a fairly linear trendrelated to the number of bits, independent of the method used.However, the proposed method has a much lower gradientof 1.9 LUT/bit compared to 2.5 LUT/bit. The average LUTreduction is 22.8%. Up to a complexity of 100 bits, an optimalsolution was always found within the given time limit, oftenwithin a few seconds. As the trend in Fig. 6 continues forhigher complexities, it can be assumed that the non-optimalsolutions are not too far from being optimal.

The same procedure was applied to Virtex 6 FPGAs whichallow the use of 6-inputs LUTs. Here, much more LUT-basedGPCs are possible and some of them use the fact that 6-inputLUTs can be configured to two 5-input LUTs with sharedinputs (which is the case for GPCs with five or less inputs).The LUT-based GPCs used from the FloPoCo frameworkhave the configurations (6; 3), (1, 5; 3), (5; 3), (1, 4; 3), (4; 3),(2, 3; 3), (1, 3; 3). In addition to that, we used the fastest of theVirtex 6 optimized GPCs from [12] (1, 4, 1, 5; 5), (1, 4, 0, 6; 5)and (2, 0, 4, 5; 5) as well as the (1, 3, 2, 5; 5) and (6, 0, 6; 5)GPCs proposed above for the ILP optimization. The results

Virtex 4 FPGA Virtex 6 FPGA

The required LUTs could be reduced by 23% (Virtex 4) and 30% (Virtex 6) compared to Dinechin (FPL’13) [8]

The slice reduction was 12.5% (Virtex 4) and 19.5% (Virtex 6) after synthesis.

Page 29: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

29

EXAMPLE COMPRESSION TREE WITH 16 INPUTS, 16 BIT EACH

FloPoCo[Dinechin FPL’13] Proposed ILP

Page 30: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

30

CONCLUSION & OUTLOOK

A novel ILP formulation for the optimization of pipelined compressor trees was presented

There is a notable gap between the former state-of-the-art heuristic and our optimal solution

Extensions are proposed for minimal stage count or variable column counters like 4:2 compressors

Good heuristics are still required for problem sizes >100 bit due to the runtime of the ILP solver

So far there is no heuristic considering pipelining

Page 31: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

THANK YOU!

Page 32: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

LITERATURE[Parandeh-Afshar TRETS’11]: H. Parandeh-Afshar, A. Neogy, P. Brisk, and P. Inne, “Compressor Tree Synthesis on Commercial High-Performance FPGAs,” ACM TRETS, 2011

[Dinechin HEART’13]: F. de Dinechin, M. Istoan, and G. Sergent, “Fixed-Point Trigonometric Functions on FPGAs,” HEART 2013, Jun. 2013.

[Dinechin FPL’13]: N. Brunie, F. de Dinechin, M. Istoan, G. Sergent, K. Illyes, and B. Popa, “Arithmetic Core Generation Using Bit Heaps,” FPL 2013

[Matsunaga’13]: T. Matsunaga, S. Kimura, and Y. Matsunaga, “An Exact Approach for GPC-Based Compressor Tree Synthesis,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Dec. 2013.

Page 33: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications
Page 34: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

ATTACHMENTS

34

Page 35: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

35

DETAILED RESULTS VIRTEX 4Heuristic [Dinechin FPL’13] proposed ILP

Size [bits] LUT4 FF Slices fmax

[MHz] LUT4 FF Slices fmax

[MHz]

16 34 20 25 501.5 28 21 25 562.425 45 39 29 455.2 46 45 39 562.136 78 63 59 489.5 54 56 35 491.449 123 86 73 444.8 79 78 46 481.964 181 108 109 412.9 123 120 100 471.581 209 132 117 420.7 141 135 106 477.8100 267 173 174 414.8 181 178 109 454.6121 332 182 181 332.6 242 247 211 435.4144 395 243 255 376.2 272 273 223 441.1169 492 283 277 344.8 309 317 197 428.3196 582 328 368 355.0 407 416 340 423.2225 622 345 410 333.9 444 451 349 424.3256 706 386 459 343.3 506 518 438 410.3

Avg.: 312.8 183.7 195.1 401.9 217.8 219.6 170.6 466.5Imp.: – – – – 30.3% -19.6% 12.5% 16.1%

Page 36: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

36

DETAILED RESULTS VIRTEX 6Heuristic [Dinechin FPL’13] proposed ILP

Size [bits] LUT6 FF Slices fmax

[MHz] LUT6 FF Slices fmax

[MHz]

16 12 7 3 478.0 10 9 3 639.425 24 11 6 636.5 26 25 7 452.936 32 13 9 595.6 27 36 7 603.149 44 15 12 492.4 35 40 10 407.764 59 19 16 407.7 47 48 13 506.881 76 21 20 442.9 56 59 15 480.1100 96 47 26 435.9 77 98 20 437.5121 116 26 32 401.6 89 112 25 438.6144 134 28 35 383.9 94 121 24 469.0169 161 60 43 396.8 119 155 30 470.6196 189 76 50 358.0 131 160 35 408.0225 216 81 56 327.2 192 236 57 364.0256 251 74 66 338.3 204 251 55 372.3

Avg.: 108.5 36.8 28.8 438.1 85.2 103.8 23.2 465.4Imp.: – – – – 21.5% -182.4% 19.5% 6.2%

Page 37: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

37

EFFICIENT GPCS ON FPGAS

GPC /

Compressor

#LUT6

(k)E�ciency

(E = �/k)delay

LUT based GPCs from [Dinechin FPL’13]

(3;2) GPC 1 1 ⌧L ⇡ ⌧(6;3) GPC 3 1 ⌧L ⇡ ⌧(1,5;3) GPC 3 1 ⌧L ⇡ ⌧Improved GPC mappings from [Parandeh-Afshar TRETS’11]:

(6;3) GPC 3 1 2⌧L + ⌧R + 3⌧CC ⇡3⌧

(1,5;3) GPC 2 1.5 ⌧L + 2⌧CC ⇡ ⌧(2,3;3) GPC 2 1 ⌧L + 2⌧CC ⇡ ⌧(7;3) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡

3⌧(5,3;4) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡

3⌧(6,2;4) GPC 3 1.33 2⌧L + ⌧R + 3⌧CC ⇡

3⌧

Page 38: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

38

EFFICIENT GPCS ON FPGASGPC /

Compressor

#LUT6

(k)E�ciency

(E = �/k)delay

GPCs and 4:2 compressor from [Kumm MBMV’13]:

(5,0,6;5) GPC 4 1.5 ⌧L + 4⌧CC ⇡ ⌧(1,4,1,5;5)

GPC

4 1.5 ⌧L + 4⌧CC ⇡ ⌧

(1,4,0,6;5)

GPC

4 1.5 ⌧L + 4⌧CC ⇡ ⌧

(2,0,4,5;5)

GPC

4 1.5 2⌧L + ⌧R + 4⌧CC ⇡3⌧

4:2

compressor

k 2� 2k ⌧L + k⌧CC

Adder with k BLE:

2-input adder k 1 ⌧L + k⌧CC

3-input adder k 2� 2k 2⌧L + ⌧R + k⌧CC ⇡

3⌧ + k⌧CC

Proposed GPCs:

(6,0,6;5) GPC 4 1.75 ⌧L + 4⌧CC ⇡ ⌧(1,3,2,5;5)

GPC

4 1.5 ⌧L ⇡ ⌧

Page 39: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

39

01

01

01

CarryLogic

01

SliceLUT

HA

FAFAFA

(2,0,4,5;5) GPC [Kumm MBMV’14]:

Efficiency = 1.5

EFFICIENT GPCS ON FPGAS

Page 40: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

40

4:2 COMPRESSOR

01

SliceLUT

FA

01

FA

01

CarryLogic

01

FA . . .

+. . .. . .

[Kumm MBMV’14]

Page 41: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

41

We developed an ILP optimizer

The main idea of the ILP formulation is to `cover´ all bits in each stage by GPCs.

For that, a `pseudo element´ is introduced for which and (no compression)

In case of a combinatorial compressor tree (problem 1) we set its cost to (wire)

In case of a pipelined compressor tree (problem 2) corresponds to the flip flop cost.

PROPOSED OPTIMIZATION

e0

Me0,c = 1 Ke0,c = 1

ce0 = 0

ce0

Page 42: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

42

FAFA

FAFA01

01

01

CarryLogic

01

SliceLUT

FAFA FA

(7;3) COMPRESSOR

Page 43: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

TERNARY ADDERSA ternary adder realizes the operation

It can be realized as cascade of two ripple carry adders:

FA

FA

FA

FA

FA

FA

FA

FA

s = x+ y + z

43

Page 44: Pipelined Compressor Tree Optimization using Integer ...martin-kumm.de/slides/2014_09_03_FPL.pdf · A compressor tree realizes the addition of many (>2) bit-shifted numbers The applications

TERNARY ADDERS

Using the 1st full adder stage as 3:2 compressor removes the carry chain:

FA

FAFAFA

FAFA

FA

FA

44