new lut

8/7/2019 new LUT

1/4

New Look-up-Table Optimizations for

Memory-Based Multiplication

Pramod Kumar MeherCommunication Systems Department, Institute for Infocomm Research

1 Fusionopolis Way, Singapore, email: [email protected].

AbstractTwo new techniques are presented in this paper,for the reduction of look-up-table (LUT) size of memory-basedmultipliers to be used in digital signal processing applications. It isshown that by simple sign-bit exclusion, the LUT size is reducedby half at the cost of a marginal area overhead. Moreover, anovel anti-symmetric product coding (APC) scheme is proposedto reduce the LUT size by further half, where the LUT outputis added with or subtracted from a fixed value. It is shown thatthe optimized LUTs for small input width could be used forefficient implementation of high-precision LUT-multipliers, wherethe total contribution of all such fixed offsets could be added to thefinal result or could be initialized for successive accumulations.

The proposed LUT-multiplier and the existing ones are codedin VHDL and synthesized by Synopsys Design Compiler usingTSMC 90 nanometer library. The proposed optimized LUT-multiplier is found to involve less area and less multiplicationtime than the existing LUT-multipliers. The conventional LUT-multiplier and the odd-multiple storage LUT of [1] involve nearly56.8% and 26.2% more area-delay product, in average, for 16-bitinput than the proposed LUT design.

I. INTRODUCTION

Along with the device scaling over the years, semiconductor

memory has become cheaper, faster and more power-efficient.

Moreover, according to the projections of the international

technology roadmap for semiconductors (ITRS) [2], embedded

memories will have dominating presence in the system-on-

chips (SoC), which may exceed 90% of total SoC content.

It has also been found that the transistor packing density

of memory components is not only high but also increasing

much faster than the transistor density of logic components.

Apart from that, the memory-based computing structures are

more regular than the multiply-accumulate structures; and

offer many other advantages, e.g., greater potential for high-

throughput and low-latency implementation; and less dynamic

power consumption. Memory-based computing is well-suited

for many digital signal processing (DSP) algorithms, which

involve multiplication with fixed set of coefficients.

The basic functional model of memory-based multiplieris shown in Fig.1. Let A be a fixed coefficient and X bean input word to be multiplied with A. Assuming X to bea positive binary number with word-length L, there can be2L possible values of X, and accordingly, there can be 2L

possible values of product C= A X. Therefore, for memory-based multiplication, an LUT of 2L words consisting of pre-computed product values corresponding to all possible values

ofX, is conventionally used. The product-word AXi is storedat the location whose address is the same as Xi for 0 2L1,

such that ifL-bit binary value ofXi is used as address for theLUT, then the corresponding product value A Xi is availableas its output.

Several architectures have been reported in the literature for

memory-based implementation of DSP algorithms involving

orthogonal transforms and digital filters [3][6]. But, we do not

find any significant work on LUT optimization for memory-

based multiplication. In a recent paper [1], we have presented a

new approach to LUT design for memory-based multiplication,

which could be used to reduce the memory-size by half for

small input-widths. It is shown in [1] that although 2L

possiblevalues ofX correspond to 2L possible values ofC= A X,only (2L/2) words corresponding to the odd multiples of Amay only be stored in the LUT while the even multiples of Acould be derived by left-shift operations of one of those odd

multiples. We have referred to this scheme as odd-multiple-

storage LUT in the rest of this paper. Using this approach,

one can reduce the LUT size to half, but it has significant

combinational overhead since it requires a barrel-shifter along

with a control-circuit to generate the control-bits for producing

a maximum of (L 1) left-shifts, and an encoder to map theL-bit input word to (L 1)-bit LUT address.

In this paper, we propose two new schemes for optimization

of LUT with lower area- and time-overhead. One of theproposed optimization is based on exclusion of sign-bit from

the LUT address, and the other optimization is based on a

recoding of stored product word.

I I . PROPOSED OPTIMIZATIONS OF LOO K-UP-TABLE FOR

MEMORY-BASED MULTIPLICATION

The proposed LUT optimizations for multiplication of a

number X with a given constant A is possible for both sign-magnitude and 2s complement representations of X and A.Besides, both X and A could be fractions or integers in fixed-

point format. But, for simplicity of presentation, the multipli-cand X is assumed here to be an integer in sign-magnituderepresentation, while the constant A is assumed to be either insign-magnitude or in 2s complement representation.

MEMORYCORE

(2LWORDS)

L (W+L)X AX

ADDRESS

PORT

OUTPUT

PORT

Fig. 1. Basic Functional Model LUT-Based Multiplier.

ISIC 2009663

8/7/2019 new LUT

2/4

A. Sign-Bit Exclusion for LUT Optimization

Let us write the input multiplicand, X = sX |X|, wherethe operation is a bit-concatenation at the MSB of |X|,sX is the sign-bit (MSB) ofX, and the magnitude-part |X| isgiven by

|X| =L2i=0

xi 2i (1)

The product C= A X, similarly, can be written as

C= s |A| |X|, and (2a)

s = sA XOR sX (2b)

where sA and |A|, respectively, are the sign-bit and magnitude-part ofA.

Since |X| is an (L 1)-bit binary number, all possibleproduct values of|A| |X| can be stored as 2(L1) LUT words,while the sign-bit can be derived by an XOR operation accord-

ing to (2b). The product words and words required to be stored

for different values of X for LUT-based multiplication (forL = 5) is shown in Table I. The product word corresponding to

X= (1 x3x2x1x0) is negative of that for X= (0 x3x2x1x0)for any given value of |X| = (x3x2x1x0). The product wordson the fourth column of Table I can, therefore, be derived

by negating the product word stored at the second column on

the same row. Therefore, instead of 32 product words only 16

values of |A| |X| for all possible values of |X| = (x3x2x1x0)are required to be stored, as shown in the sixth column

of Table I. The corresponding LUT-multiplier is shown in

Fig.2(a). It requires only one additional XOR gate to determine

the sign of product word C. The sign-exclusion technique canalso be applied for 2s complement representation of coefficient

(W+4)-bit product, AXin sign-magnitude representation

x0

x1

x2

15|A|14|A|13|A|12|A|11|A|10|A|9|A|8|A|7|A|6|A|5|A|4|A|3|A|2|A||A|0

x3

x4= sX

sA

(W+3)

1

16-WORDLUTPRODUCTSINSIGN-

MAGNITUDESREPRESENTATION

4TO16LINEDECODER

AXin 2s complement representation

x0

x1

x2

16-WORDLUTOFPRODUCTWORDSIN2S

COMPLEMENTREPRESENTATION

15A14A13A12A11A10A9A8A7A6A5A4A3A2AA0

x3

MUX

x4=sX

(W+4)

(W+4)

1 0

TWOS COMPLEMENT

4TO16LINEDECODER

(a) (b)

Fig. 2. Proposed LUTs for multiplication ofW-bit fixed coefficient A with 5-bit operand X = x4 x3 x2 x1 x0, using sign-bit exclusion. (a) LUT-multiplierwith coefficient A in sign-magnitude representation. (b) LUT-multiplier withcoefficient A in 2s complement representation.

TABLE IINPUT WORDS, P RODUCTS WORDS AND STORED WORDS FOR L = 5

address, Xproduct

address, Xproduct stored words

values values 2s comp sign-mag

0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 1 A 1 0 0 0 1 A A |A|

0 0 0 1 0 2A 1 0 0 1 0 2A 2A 2|A|

0 0 0 1 1 3A 1 0 0 1 1 3A 3A 3|A|

0 0 1 0 0 4A 1 0 1 0 0 4A 4A 4|A|

0 0 1 0 1 5A 1 0 1 0 1 5A 5A 5|A|0 0 1 1 0 6A 1 0 1 1 0 6A 6A 6|A|

0 0 1 1 1 7A 1 0 1 1 1 7A 7A 7|A|

0 1 0 0 0 8A 1 1 0 0 0 8A 8A 8|A|

0 1 0 0 1 9A 1 1 0 0 1 9A 9A 9|A|

0 1 0 1 0 10A 1 1 0 1 0 10A 10A 10|A|

0 1 0 1 1 11A 1 1 0 1 1 11A 11A 11|A|

0 1 1 0 0 12A 1 1 1 0 0 12A 12A 12|A|

0 1 1 0 1 13A 1 1 1 0 1 13A 13A 13|A|

0 1 1 1 0 14A 1 1 1 1 0 14A 14A 14|A|

0 1 1 1 1 15A 1 1 1 1 1 15A 15A 15|A|

A. The multiplier for 2s complement representation of A

is shown in Fig.2(b), which stores the product words in 2scomplement representation, and requires a 2s complement unit

along with a 2:1 MUX to change the sign of LUT output for

negative values ofX.

B. Anti-Symmetric Product Coding for LUT Optimization

The product words for different values of modulus |X| forL = 5 are shown in the second column of Table II. It may beobserved in Table II that, the address word |X| on the i-th rowis 1s complement of that on (16+1i)-th row for 1 i 8.Besides, the sum of product values on these two rows is 15A.Let the product values on the i-th and (16+1 i)-th rows beu and v, respectively. Since one can write u =

u+v2

vu2

and v =u+v2 + v

u2

, for (u + v) = 15A, we can have

u =15A

2v u

2

and v =

15A

2+v u

2

(3)

The product values on the second column of Table II,

therefore, could be re-written (in a form as given in the third

column of the Table) according to (3). Since the product words

at i-th and (16 + 1 i)-th rows, for 1 i 8, on the thirdcolumn of this Table have a negative mirror symmetry, we

can name it as anti-symmetric product coding (APC). This

behavior APC can be used to reduce the LUT size as shown

in the fourth and fifth columns of the Table, where instead of

storing u and v at i-th and (16 + 1 i)-th rows, respectively,(v u) is stored at i-th row, for 1 i 8. The desiredproduct could be obtained by adding or subtracting the stored

value (v u) to or from the fixed value 15A when x3 is 1 or0, respectively, followed by a right-shift operation (for scaling

down by a factor 2).

APC can also be used when both the multiplicands are in

sign-magnitude representation. As shown in Table I, all the

stored values in this case are positive, since (v u) is alwayspositive. Besides, the result of addition/subtraction with/from

664

8/7/2019 new LUT

3/4

TABLE IITHE APC-BASED LUT ALLOCATION F OR DIFFERENT VALUES OF |X|

modulus, |X| product APC stored wordsx3x2x1x0 values representation 2s comp sign-mag

0 0 0 0 0 (15A 15A)/2 15A 15|A|

0 0 0 1 A (15A 13A)/2 13A 13|A|

0 0 1 0 2A (15A 11A)/2 11A 11|A|

0 0 1 1 3A (15A 9A)/2 9A 9|A|

0 1 0 0 4A (15A 7A)/2 7A 7|A|

0 1 0 1 5A (15A 5A)/2 5A 5|A|

0 1 1 0 6A (15A 3A)/2 3A 3|A|

0 1 1 1 7A (15A A)/2 A |A|

1 0 0 0 8A (15A + A)/2

1 0 0 1 9A (15A + 3A)/2

1 0 1 0 10 (15A + 5A)/2

1 0 1 1 11A (15A + 7A)/2

1 1 0 0 12A (15A + 9A)/2

1 1 0 1 13A (15A + 11A)/2

1 1 1 0 14A (15A + 13A)/2

1 1 1 1 15A (15A + 15A)/2

the fixed value 15|A| is also positive. The sign of product word

s, therefore, can finally be determined by an XOR operationaccording to (2b).

The structure and function of LUT-based multiplier for L =5 using sign-bit exclusion and APC techniques (assuming A aswell as the product values C in 2s complement representation)is shown in Fig.3(a). It consists of an LUT of 8 words to store

the values of product words in 2s complement representation

as given on the first eight rows of the fourth column of Table II.

A 3-to-8 line decoder is fed with the address bits through

an address mapping circuit, which generates the word-select

signals for the LUT. When x3 = 0, the first three bits ofX, i.e., (x2 x1 x0) are used as the LUT address while forx3 = 1, ones complement of (x2 x1 x0) is used as address.The address-mapping circuit, therefore, can be designed by

three XOR gates, where x3 is fed to all the XOR gates while(x2 x1 x0) are fed to three XOR gates as shown in the figure.The output of LUT is added with or subtracted from 15A, forx3 = 1 or 0, respectively, followed by a scaling down by afactor of two, using an ADD/SUB-SCALE cell according to

(3). The output of ADD/SUB-SCALE cell is negated by a 2s

complement unit and a 2:1 MUX when x4 = 1 .

The structure and function of LUT-based multiplier for

L = 5 using sign-bit elimination and APC techniques assum-ing both the multiplicands in sign-magnitude representation is

shown in Fig.3(b). It stores the magnitudes of product values

on the first eight rows of the fifth column of Table II. Theaddress mapping circuit in this case is the same as that of

Fig.3(a). The output of LUT is added with or subtracted from

15|A|, for x3 = 1 or 0, respectively, by an ADD/SUB-SCALEcell according to (3). The sign-bit s of the product word isgenerated by a 2-input XOR gate as in case of Fig. 2(a).

In the next section, we have shown that APC technique

could be used for efficient implementation of high-precision

LUT-based multiplications by suitable input operand decom-

position and summation of resulting partial sums.

PRODUCTWORDSIN

2SCOMPLEMENT

REPRESENTATION

A

3A

5A

7A

9A

11A

13A

15A

(W+4)

3TO8LINEDECODERx0

15A

product value AXin 2s complement representation

x4 = sX

(W+4)

(W+4)

1 0

2S COMPLEMENT

MUX

ADD/ SUB-SCALE

x1

x2

x3

address mapping 8-word LUT

sign-determination

(a)

|A|

3|A|

5|A|

7|A|

9|A|

11|A|

13|A|

15|A|

(W+3)

3TO8LINEDECODERx0

15A

ADD/ SUB-SCALE

x1

x2

x3

address mapping

sign-determination

(W+3)

(W+4)-bit product, AXin sign-magnitude representation

x4= sX

sA1

MAGNITUDEOFTHE

PRODUCTWORDS

8-word LUT

(b)

Fig. 3. LUT design for multiplication of W-bit fixed coefficient A with5-bit input, X = x4 x3 x2 x1 x0 using APC. (a) LUT-based multiplierwith constant multiplicand in 2s complement representation. (b) LUT-basedmultiplier with constant multiplicand in sign-magnitude representation.

III. HIG H-P RECISION MEMORY-BASED MULTIPLICATION

BY INPUT-OPERAND DECOMPOSITION

Although the area of memory-core of the LUT-basedmultiplier proposed in Section II is nearly half that of the

conventional LUT-multiplier, still the memory-size increases

exponentially with the input word-length. It is, therefore, not

an area-efficient choice to extend the proposed LUT-multiplier

of Section-II for large input words. When the width of input

multiplicand X is large, it could however be decomposedinto certain number of segments or sub-words, and the partial

products pertaining to different sub-words be shift-added to

obtain the desired product word.

665

8/7/2019 new LUT

4/4

TABLE IIIAREA AND TIM E COMPLEXITIES OF LUT-MULTIPLIERS F OR L = 16 .

multiplier designsW = 8 W = 16

area delay ADP excess ADP area delay ADP excess ADP

conventional LUT multiplier 3688.9 3.24 11952.0 61.02 % 5866.4 3.99 23406.9 52.58 %odd-multiple LUT-multiplier [1] 2870.4 3.32 9529.7 28.39 % 4608.3 4.13 19032.3 24.07 %proposed LUT-multiplier 2348.9 3.16 7422.5 3893.5 3.94 15340.4

Area is in m2 . Delay refers to the average multiplication-time in ns. ADP stands for area-delay product. ADP saving corresponds to the saving over conventional multiplier.

Let the magnitude part of input multiplicand X be de-composed into P number of S-bit segments (or sub-words),{X1 X2 XP}, for Xi = { x(i1)S x(i1)S+1 xiS1},is an S-bit sub-word for 1 i P 1, and XP ={x(P1)S x(P1)S+1 . . . x(P1)S+S1} is the last sub-wordof S-bit, where S S, and xi is the (i + 1)-th bit of X,and P is an integer for which the word-length of |X| is notmore than PS, (i.e., L 1 PS). Without loss of generality,we can however assume the width of input magnitude to be

L 1 = PS. The product word C= A.X thus can be writtenas sum of partial products generated by multiplying A withthe sub-words ofX as

C= s Pi=1

2S(i1) Ci (4)

The partial product Ci = A Xi for 1 i P, canhave maximum width of (W+ S 1) bits or (W + S) bitswhen A is in sign-magnitude or 2s complement representation,respectively. denotes 2s complement operation in caseof 2s complement representation of A if s = sX = 1, orconcatenation of s = sA sX to the MSB of the estimatedsum in case of sign-magnitude representation ofA.

For L = 16, let us take S = 4 and P = 4. The inputmultiplicand X, therefore, can be decomposed into 4 sub-words X1, X2, X3 and X4, where X2, X3 and X4 containthe 12 more significant bits ofX, and each of them is a 4-bitword. Let X1 contain the less significant 3-bits of |X|. Threepartial products C2 = A.X2, C3 = A.X3 and C4 = A.X4, arecomputed by three LUT-based multipliers using APC. Since

X1 has three bits only, C1 = A.X1 can be computed bya conventional 8-word LUT, moreover the fixed offset value

V = [120A ( 1 + 24 + 28)] due to other three LUTs can alsobe added with C1 to be stored in this LUT, such that LUT-1stores the values (A X1 + V). The proposed multiplier forL = 16 for any given value of coefficient size W is shownin Fig. 4. The least significant 3-bit sub-word X1 ofX is fedas input to LUT-1, and the next more significant sub-words

X2, X3 and X4 are fed as input to LUT-2 and LUT-3 andLUT-4 respectively. The structure of each of LUT-2, LUT-3and LUT-4 is similar to the one shown in Fig. 3 [except that

they do not have the ADD/SUB-SCALE cell]. The sign of the

LUT outputs are modified according to the value of the most-

significant bit of corresponding sub-word Xi for i = 2, 3 and4; and the LUT outputs are appropriately shifted and added

together as shown in Fig. 4. The sign of product is modified

finally depending on the sign of X by a sign-determinationcircuit (as shown in Fig. 3).

X44

AX4

X34

W+4

LUT4

AX3W+4

LUT3

X24

AX2W+4

LUT2

W+15

A|X|

ADDER

sX=x15

AX

ADDER

ADDERW+8 W+15

W+15

1

X13

LUT1

V+AX1

new lut

Documents