new lut
TRANSCRIPT
-
8/7/2019 new LUT
1/4
New Look-up-Table Optimizations for
Memory-Based Multiplication
Pramod Kumar MeherCommunication Systems Department, Institute for Infocomm Research
1 Fusionopolis Way, Singapore, email: [email protected].
AbstractTwo new techniques are presented in this paper,for the reduction of look-up-table (LUT) size of memory-basedmultipliers to be used in digital signal processing applications. It isshown that by simple sign-bit exclusion, the LUT size is reducedby half at the cost of a marginal area overhead. Moreover, anovel anti-symmetric product coding (APC) scheme is proposedto reduce the LUT size by further half, where the LUT outputis added with or subtracted from a fixed value. It is shown thatthe optimized LUTs for small input width could be used forefficient implementation of high-precision LUT-multipliers, wherethe total contribution of all such fixed offsets could be added to thefinal result or could be initialized for successive accumulations.
The proposed LUT-multiplier and the existing ones are codedin VHDL and synthesized by Synopsys Design Compiler usingTSMC 90 nanometer library. The proposed optimized LUT-multiplier is found to involve less area and less multiplicationtime than the existing LUT-multipliers. The conventional LUT-multiplier and the odd-multiple storage LUT of [1] involve nearly56.8% and 26.2% more area-delay product, in average, for 16-bitinput than the proposed LUT design.
I. INTRODUCTION
Along with the device scaling over the years, semiconductor
memory has become cheaper, faster and more power-efficient.
Moreover, according to the projections of the international
technology roadmap for semiconductors (ITRS) [2], embedded
memories will have dominating presence in the system-on-
chips (SoC), which may exceed 90% of total SoC content.
It has also been found that the transistor packing density
of memory components is not only high but also increasing
much faster than the transistor density of logic components.
Apart from that, the memory-based computing structures are
more regular than the multiply-accumulate structures; and
offer many other advantages, e.g., greater potential for high-
throughput and low-latency implementation; and less dynamic
power consumption. Memory-based computing is well-suited
for many digital signal processing (DSP) algorithms, which
involve multiplication with fixed set of coefficients.
The basic functional model of memory-based multiplieris shown in Fig.1. Let A be a fixed coefficient and X bean input word to be multiplied with A. Assuming X to bea positive binary number with word-length L, there can be2L possible values of X, and accordingly, there can be 2L
possible values of product C= A X. Therefore, for memory-based multiplication, an LUT of 2L words consisting of pre-computed product values corresponding to all possible values
ofX, is conventionally used. The product-word AXi is storedat the location whose address is the same as Xi for 0 2L1,
such that ifL-bit binary value ofXi is used as address for theLUT, then the corresponding product value A Xi is availableas its output.
Several architectures have been reported in the literature for
memory-based implementation of DSP algorithms involving
orthogonal transforms and digital filters [3][6]. But, we do not
find any significant work on LUT optimization for memory-
based multiplication. In a recent paper [1], we have presented a
new approach to LUT design for memory-based multiplication,
which could be used to reduce the memory-size by half for
small input-widths. It is shown in [1] that although 2L
possiblevalues ofX correspond to 2L possible values ofC= A X,only (2L/2) words corresponding to the odd multiples of Amay only be stored in the LUT while the even multiples of Acould be derived by left-shift operations of one of those odd
multiples. We have referred to this scheme as odd-multiple-
storage LUT in the rest of this paper. Using this approach,
one can reduce the LUT size to half, but it has significant
combinational overhead since it requires a barrel-shifter along
with a control-circuit to generate the control-bits for producing
a maximum of (L 1) left-shifts, and an encoder to map theL-bit input word to (L 1)-bit LUT address.
In this paper, we propose two new schemes for optimization
of LUT with lower area- and time-overhead. One of theproposed optimization is based on exclusion of sign-bit from
the LUT address, and the other optimization is based on a
recoding of stored product word.
I I . PROPOSED OPTIMIZATIONS OF LOO K-UP-TABLE FOR
MEMORY-BASED MULTIPLICATION
The proposed LUT optimizations for multiplication of a
number X with a given constant A is possible for both sign-magnitude and 2s complement representations of X and A.Besides, both X and A could be fractions or integers in fixed-
point format. But, for simplicity of presentation, the multipli-cand X is assumed here to be an integer in sign-magnituderepresentation, while the constant A is assumed to be either insign-magnitude or in 2s complement representation.
MEMORYCORE
(2LWORDS)
L (W+L)X AX
ADDRESS
PORT
OUTPUT
PORT
Fig. 1. Basic Functional Model LUT-Based Multiplier.
ISIC 2009663
-
8/7/2019 new LUT
2/4
A. Sign-Bit Exclusion for LUT Optimization
Let us write the input multiplicand, X = sX |X|, wherethe operation is a bit-concatenation at the MSB of |X|,sX is the sign-bit (MSB) ofX, and the magnitude-part |X| isgiven by
|X| =L2i=0
xi 2i (1)
The product C= A X, similarly, can be written as
C= s |A| |X|, and (2a)
s = sA XOR sX (2b)
where sA and |A|, respectively, are the sign-bit and magnitude-part ofA.
Since |X| is an (L 1)-bit binary number, all possibleproduct values of|A| |X| can be stored as 2(L1) LUT words,while the sign-bit can be derived by an XOR operation accord-
ing to (2b). The product words and words required to be stored
for different values of X for LUT-based multiplication (forL = 5) is shown in Table I. The product word corresponding to
X= (1 x3x2x1x0) is negative of that for X= (0 x3x2x1x0)for any given value of |X| = (x3x2x1x0). The product wordson the fourth column of Table I can, therefore, be derived
by negating the product word stored at the second column on
the same row. Therefore, instead of 32 product words only 16
values of |A| |X| for all possible values of |X| = (x3x2x1x0)are required to be stored, as shown in the sixth column
of Table I. The corresponding LUT-multiplier is shown in
Fig.2(a). It requires only one additional XOR gate to determine
the sign of product word C. The sign-exclusion technique canalso be applied for 2s complement representation of coefficient
(W+4)-bit product, AXin sign-magnitude representation
x0
x1
x2
15|A|14|A|13|A|12|A|11|A|10|A|9|A|8|A|7|A|6|A|5|A|4|A|3|A|2|A||A|0
x3
x4= sX
sA
(W+3)
1
16-WORDLUTPRODUCTSINSIGN-
MAGNITUDESREPRESENTATION
4TO16LINEDECODER
AXin 2s complement representation
x0
x1
x2
16-WORDLUTOFPRODUCTWORDSIN2S
COMPLEMENTREPRESENTATION
15A14A13A12A11A10A9A8A7A6A5A4A3A2AA0
x3
MUX
x4=sX
(W+4)
(W+4)
1 0
TWOS COMPLEMENT
4TO16LINEDECODER
(a) (b)
Fig. 2. Proposed LUTs for multiplication ofW-bit fixed coefficient A with 5-bit operand X = x4 x3 x2 x1 x0, using sign-bit exclusion. (a) LUT-multiplierwith coefficient A in sign-magnitude representation. (b) LUT-multiplier withcoefficient A in 2s complement representation.
TABLE IINPUT WORDS, P RODUCTS WORDS AND STORED WORDS FOR L = 5
address, Xproduct
address, Xproduct stored words
values values 2s comp sign-mag
0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 A 1 0 0 0 1 A A |A|
0 0 0 1 0 2A 1 0 0 1 0 2A 2A 2|A|
0 0 0 1 1 3A 1 0 0 1 1 3A 3A 3|A|
0 0 1 0 0 4A 1 0 1 0 0 4A 4A 4|A|
0 0 1 0 1 5A 1 0 1 0 1 5A 5A 5|A|0 0 1 1 0 6A 1 0 1 1 0 6A 6A 6|A|
0 0 1 1 1 7A 1 0 1 1 1 7A 7A 7|A|
0 1 0 0 0 8A 1 1 0 0 0 8A 8A 8|A|
0 1 0 0 1 9A 1 1 0 0 1 9A 9A 9|A|
0 1 0 1 0 10A 1 1 0 1 0 10A 10A 10|A|
0 1 0 1 1 11A 1 1 0 1 1 11A 11A 11|A|
0 1 1 0 0 12A 1 1 1 0 0 12A 12A 12|A|
0 1 1 0 1 13A 1 1 1 0 1 13A 13A 13|A|
0 1 1 1 0 14A 1 1 1 1 0 14A 14A 14|A|
0 1 1 1 1 15A 1 1 1 1 1 15A 15A 15|A|
A. The multiplier for 2s complement representation of A
is shown in Fig.2(b), which stores the product words in 2scomplement representation, and requires a 2s complement unit
along with a 2:1 MUX to change the sign of LUT output for
negative values ofX.
B. Anti-Symmetric Product Coding for LUT Optimization
The product words for different values of modulus |X| forL = 5 are shown in the second column of Table II. It may beobserved in Table II that, the address word |X| on the i-th rowis 1s complement of that on (16+1i)-th row for 1 i 8.Besides, the sum of product values on these two rows is 15A.Let the product values on the i-th and (16+1 i)-th rows beu and v, respectively. Since one can write u =
u+v2
vu2
and v =u+v2 + v
u2
, for (u + v) = 15A, we can have
u =15A
2v u
2
and v =
15A
2+v u
2
(3)
The product values on the second column of Table II,
therefore, could be re-written (in a form as given in the third
column of the Table) according to (3). Since the product words
at i-th and (16 + 1 i)-th rows, for 1 i 8, on the thirdcolumn of this Table have a negative mirror symmetry, we
can name it as anti-symmetric product coding (APC). This
behavior APC can be used to reduce the LUT size as shown
in the fourth and fifth columns of the Table, where instead of
storing u and v at i-th and (16 + 1 i)-th rows, respectively,(v u) is stored at i-th row, for 1 i 8. The desiredproduct could be obtained by adding or subtracting the stored
value (v u) to or from the fixed value 15A when x3 is 1 or0, respectively, followed by a right-shift operation (for scaling
down by a factor 2).
APC can also be used when both the multiplicands are in
sign-magnitude representation. As shown in Table I, all the
stored values in this case are positive, since (v u) is alwayspositive. Besides, the result of addition/subtraction with/from
664
-
8/7/2019 new LUT
3/4
TABLE IITHE APC-BASED LUT ALLOCATION F OR DIFFERENT VALUES OF |X|
modulus, |X| product APC stored wordsx3x2x1x0 values representation 2s comp sign-mag
0 0 0 0 0 (15A 15A)/2 15A 15|A|
0 0 0 1 A (15A 13A)/2 13A 13|A|
0 0 1 0 2A (15A 11A)/2 11A 11|A|
0 0 1 1 3A (15A 9A)/2 9A 9|A|
0 1 0 0 4A (15A 7A)/2 7A 7|A|
0 1 0 1 5A (15A 5A)/2 5A 5|A|
0 1 1 0 6A (15A 3A)/2 3A 3|A|
0 1 1 1 7A (15A A)/2 A |A|
1 0 0 0 8A (15A + A)/2
1 0 0 1 9A (15A + 3A)/2
1 0 1 0 10 (15A + 5A)/2
1 0 1 1 11A (15A + 7A)/2
1 1 0 0 12A (15A + 9A)/2
1 1 0 1 13A (15A + 11A)/2
1 1 1 0 14A (15A + 13A)/2
1 1 1 1 15A (15A + 15A)/2
the fixed value 15|A| is also positive. The sign of product word
s, therefore, can finally be determined by an XOR operationaccording to (2b).
The structure and function of LUT-based multiplier for L =5 using sign-bit exclusion and APC techniques (assuming A aswell as the product values C in 2s complement representation)is shown in Fig.3(a). It consists of an LUT of 8 words to store
the values of product words in 2s complement representation
as given on the first eight rows of the fourth column of Table II.
A 3-to-8 line decoder is fed with the address bits through
an address mapping circuit, which generates the word-select
signals for the LUT. When x3 = 0, the first three bits ofX, i.e., (x2 x1 x0) are used as the LUT address while forx3 = 1, ones complement of (x2 x1 x0) is used as address.The address-mapping circuit, therefore, can be designed by
three XOR gates, where x3 is fed to all the XOR gates while(x2 x1 x0) are fed to three XOR gates as shown in the figure.The output of LUT is added with or subtracted from 15A, forx3 = 1 or 0, respectively, followed by a scaling down by afactor of two, using an ADD/SUB-SCALE cell according to
(3). The output of ADD/SUB-SCALE cell is negated by a 2s
complement unit and a 2:1 MUX when x4 = 1 .
The structure and function of LUT-based multiplier for
L = 5 using sign-bit elimination and APC techniques assum-ing both the multiplicands in sign-magnitude representation is
shown in Fig.3(b). It stores the magnitudes of product values
on the first eight rows of the fifth column of Table II. Theaddress mapping circuit in this case is the same as that of
Fig.3(a). The output of LUT is added with or subtracted from
15|A|, for x3 = 1 or 0, respectively, by an ADD/SUB-SCALEcell according to (3). The sign-bit s of the product word isgenerated by a 2-input XOR gate as in case of Fig. 2(a).
In the next section, we have shown that APC technique
could be used for efficient implementation of high-precision
LUT-based multiplications by suitable input operand decom-
position and summation of resulting partial sums.
PRODUCTWORDSIN
2SCOMPLEMENT
REPRESENTATION
A
3A
5A
7A
9A
11A
13A
15A
(W+4)
3TO8LINEDECODERx0
15A
product value AXin 2s complement representation
x4 = sX
(W+4)
(W+4)
1 0
2S COMPLEMENT
MUX
ADD/ SUB-SCALE
x1
x2
x3
address mapping 8-word LUT
sign-determination
(a)
|A|
3|A|
5|A|
7|A|
9|A|
11|A|
13|A|
15|A|
(W+3)
3TO8LINEDECODERx0
15A
ADD/ SUB-SCALE
x1
x2
x3
address mapping
sign-determination
(W+3)
(W+4)-bit product, AXin sign-magnitude representation
x4= sX
sA1
MAGNITUDEOFTHE
PRODUCTWORDS
8-word LUT
(b)
Fig. 3. LUT design for multiplication of W-bit fixed coefficient A with5-bit input, X = x4 x3 x2 x1 x0 using APC. (a) LUT-based multiplierwith constant multiplicand in 2s complement representation. (b) LUT-basedmultiplier with constant multiplicand in sign-magnitude representation.
III. HIG H-P RECISION MEMORY-BASED MULTIPLICATION
BY INPUT-OPERAND DECOMPOSITION
Although the area of memory-core of the LUT-basedmultiplier proposed in Section II is nearly half that of the
conventional LUT-multiplier, still the memory-size increases
exponentially with the input word-length. It is, therefore, not
an area-efficient choice to extend the proposed LUT-multiplier
of Section-II for large input words. When the width of input
multiplicand X is large, it could however be decomposedinto certain number of segments or sub-words, and the partial
products pertaining to different sub-words be shift-added to
obtain the desired product word.
665
-
8/7/2019 new LUT
4/4
TABLE IIIAREA AND TIM E COMPLEXITIES OF LUT-MULTIPLIERS F OR L = 16 .
multiplier designsW = 8 W = 16
area delay ADP excess ADP area delay ADP excess ADP
conventional LUT multiplier 3688.9 3.24 11952.0 61.02 % 5866.4 3.99 23406.9 52.58 %odd-multiple LUT-multiplier [1] 2870.4 3.32 9529.7 28.39 % 4608.3 4.13 19032.3 24.07 %proposed LUT-multiplier 2348.9 3.16 7422.5 3893.5 3.94 15340.4
Area is in m2 . Delay refers to the average multiplication-time in ns. ADP stands for area-delay product. ADP saving corresponds to the saving over conventional multiplier.
Let the magnitude part of input multiplicand X be de-composed into P number of S-bit segments (or sub-words),{X1 X2 XP}, for Xi = { x(i1)S x(i1)S+1 xiS1},is an S-bit sub-word for 1 i P 1, and XP ={x(P1)S x(P1)S+1 . . . x(P1)S+S1} is the last sub-wordof S-bit, where S S, and xi is the (i + 1)-th bit of X,and P is an integer for which the word-length of |X| is notmore than PS, (i.e., L 1 PS). Without loss of generality,we can however assume the width of input magnitude to be
L 1 = PS. The product word C= A.X thus can be writtenas sum of partial products generated by multiplying A withthe sub-words ofX as
C= s Pi=1
2S(i1) Ci (4)
The partial product Ci = A Xi for 1 i P, canhave maximum width of (W+ S 1) bits or (W + S) bitswhen A is in sign-magnitude or 2s complement representation,respectively. denotes 2s complement operation in caseof 2s complement representation of A if s = sX = 1, orconcatenation of s = sA sX to the MSB of the estimatedsum in case of sign-magnitude representation ofA.
For L = 16, let us take S = 4 and P = 4. The inputmultiplicand X, therefore, can be decomposed into 4 sub-words X1, X2, X3 and X4, where X2, X3 and X4 containthe 12 more significant bits ofX, and each of them is a 4-bitword. Let X1 contain the less significant 3-bits of |X|. Threepartial products C2 = A.X2, C3 = A.X3 and C4 = A.X4, arecomputed by three LUT-based multipliers using APC. Since
X1 has three bits only, C1 = A.X1 can be computed bya conventional 8-word LUT, moreover the fixed offset value
V = [120A ( 1 + 24 + 28)] due to other three LUTs can alsobe added with C1 to be stored in this LUT, such that LUT-1stores the values (A X1 + V). The proposed multiplier forL = 16 for any given value of coefficient size W is shownin Fig. 4. The least significant 3-bit sub-word X1 ofX is fedas input to LUT-1, and the next more significant sub-words
X2, X3 and X4 are fed as input to LUT-2 and LUT-3 andLUT-4 respectively. The structure of each of LUT-2, LUT-3and LUT-4 is similar to the one shown in Fig. 3 [except that
they do not have the ADD/SUB-SCALE cell]. The sign of the
LUT outputs are modified according to the value of the most-
significant bit of corresponding sub-word Xi for i = 2, 3 and4; and the LUT outputs are appropriately shifted and added
together as shown in Fig. 4. The sign of product is modified
finally depending on the sign of X by a sign-determinationcircuit (as shown in Fig. 3).
X44
AX4
X34
W+4
LUT4
AX3W+4
LUT3
X24
AX2W+4
LUT2
W+15
A|X|
ADDER
sX=x15
AX
ADDER
ADDERW+8 W+15
W+15
1
X13
LUT1
V+AX1