an algorithm for multiple-precision floating-point multiplication
TRANSCRIPT
Applied Mathematics and Computation 166 (2005) 291–298
www.elsevier.com/locate/amc
An algorithm for multiple-precisionfloating-point multiplication
Daisuke Takahashi
Institute of Information Sciences and Electronics, University of Tsukuba,
1-1-1 Tennodai, Tsukuba, Ibaraki 305-8573, Japan
Abstract
We present an algorithm for multiple-precision floating-point multiplication. The
conventional algorithms based on the fast Fourier transform (FFT) multiply two n-
bit numbers to obtain a 2n-bit result. In multiple-precision floating-point multiplication,
we need only the returned result whose precision is equal to the multiple-precision float-
ing-point number. We show that the overall arithmetic operations for FFT-based
multiple-precision floating-point multiplication are reduced by decomposition of the
full-length multiplication into shorter-length multiplication.
� 2004 Elsevier Inc. All rights reserved.
Keywords: Program derivation; Multiplication; Multiple-precision arithmetic; Fast Fourier trans-
form
1. Introduction
Many multiple-precision multiplication algorithms have been well studied[1–6]. Multiple-precision multiplication of n-bit numbers requires O(n2) bit
0096-3003/$ - see front matter � 2004 Elsevier Inc. All rights reserved.
doi:10.1016/j.amc.2004.04.034
E-mail address: [email protected]
292 D. Takahashi / Appl. Math. Comput. 166 (2005) 291–298
operations using ordinary multiplication algorithm [1]. Karatsuba�s algorithm[2] is known to reduce the number of operations to O(nlog2 3).
Multiple-precision multiplication of n-bit numbers can be performed in
O(n logn log logn) bit operations by using the Schonhage–Strassen algorithm
[3], which is based on the fast Fourier transform (FFT) [7]. However, the
Schonhage–Strassen algorithm may not be able to take advantage of comput-ers with fast floating-point hardware, and it needs binary-to-decimal radix con-
version for the final result. Bailey used the discrete Fourier transform (DFT)
with three prime modulo computations followed by reconstruction through
the Chinese Remainder Theorem for his p calculation to 29 million decimal
digits [8].
In addition, a multiple-precision multiplication algorithm using floating-
point real FFT is known as another fast multiplication algorithm [9,10].
These conventional FFT-based multiplication algorithms multiply two n-bitnumbers to obtain a 2n-bit result. In multiple-precision floating-point multipli-
cation, we need only the returned result whose precision is equal to the multiple-
precision floating-point number. We will call the returned result the ‘‘short
product’’ here. In [4,6], algorithms for multiple-precision floating-point multi-
plication are shown. They used the ordinary O(n2) multiplication algorithm
or Karatsuba�s O(nlog2 3) algorithm. However, in multiple-precision multiplica-
tion of several thousand decimal bits or more, FFT-based multiplication is the
fastest. We show that the overall arithmetic operations for FFT-based multiple-precision floating-point multiplication are reduced by decomposition of the full-
length multiplication into shorter-length multiplication.
For simplicity, in this paper, we use the short product which does not pro-
vide exact rounding. However, it is not hard to extend multiple-precision float-
ing-point multiplication with exact rounding [4,6]. We also restrict ourselves to
computing a mantissa of floating-point numbers in this paper.
2. Multiple-precision multiplication based on FFT
We deal here with real FFT-based multiple-precision multiplication.
Let us consider the product Z of two integers X and Y with length n and
base B.
X ¼X2n�1
i¼0
xiBi; Y ¼X2n�1
i¼0
yiBi;
where 0 6 xi < B, 0 6 yi < B, for 0 6 i 6 n � 1, and xi = yi = 0 for i P n.
Then,
Z ¼ X � Y ¼X2n�1
i¼0
xiBi
!�X2n�1
j¼0
yjBj
!¼X2n�1
k¼0
X2n�1
j¼0
xjyk�j
!Bk ¼
X2n�1
k¼0
zkBk:
D. Takahashi / Appl. Math. Comput. 166 (2005) 291–298 293
Thus, zk can be written as follows:
zk ¼ Ckðx; yÞ ¼X2n�1
j¼0
xjyk�j; for 06 k6 2n� 2; and z2n�1 ¼ 0;
where the subscript k � j is interpreted as k � j + 2n if k � j is negative.
We note that the maximum value of zk is nB2 at most. Each zk should be
normalized to 0 6 zk < B.
The DFT and the inverse DFT are given by
F kðxÞ ¼XN�1
j¼0
xj e�2pijk=N ; ð1Þ
F �1k ðxÞ ¼ 1
N
XN�1
j¼0
xj e2pijk=N ; ð2Þ
where N = 2n, and i ¼ffiffiffiffiffiffiffi�1
p.
Then the convolution theorem for discrete sequences states that
F ½Cðx; yÞ� ¼ F ðxÞF ðyÞ:Let C(x,y) denote the convolution of sequences x and y:
Cðx; yÞ ¼ F �1½F ðxÞF ðyÞ�:We can use the FFT to compute the DFT in (1) and (2). The arithmetic
operation count of the N-point FFT is O(N logN) [7].The data length N is assumed to be a power-of-two. Let the arithmetic oper-
ation count of N(= 2n)-point FFT be cfft1N log2N + cfft2N and let the arithmetic
operation count for the point-by-point product of the N-point Fourier coeffi-
cients be cprodN.
In the conventional FFT-based multiplication algorithm, two N-point
forward FFTs and an N-point inverse FFT are needed to multiply two n-bit
numbers. The overall arithmetic operation count of multiple-precision float-
ing-point multiplication Tconv is as follows:
T conv ¼ 3ðcfft1N log2N þ cfft2NÞ þ cprodN : ð3Þ
3. Multiple-precision floating-point multiplication using the splitting algorithm
Let us consider multiple-precision floating-point multiplication of n-bit
numbers based on FFT-based multiplication.First of all, we show FFT-based multiple-precision floating-point multipli-
cation with 2-way splitting between X and Y, both of which are n-bit numbers
294 D. Takahashi / Appl. Math. Comput. 166 (2005) 291–298
with base B. We deal here with a multiple-precision floating-point multiplica-
tion, such as multiplying two n-bit numbers to obtain an n-bit result.
X ¼ x1Bn=2 þ x0 ð06 xi < Bn=2; 06 i6 1Þ;
Y ¼ y1Bn=2 þ y0 ð06 yj < Bn=2; 06 j6 1Þ:
The short product of upper half Z � X � Y is as follows:
X � Y � Z ¼ F �1½F ðx1ÞF ðy1Þ�Bn þ F �1½F ðx1ÞF ðy0Þ þ F ðx0ÞF ðy1Þ�Bn=2
¼ ðz1Bn=2 þ z0ÞBn=2:
The illustration of the short product with the 2-way splitting algorithm is
shown in Fig. 1.
When each partial product xiyj (0 6 i 6 1, 0 6 j 6 1, 0 6 i + j 6 1) iscomputed, it is necessary to compute the forward FFTs of xi (0 6 i 6 1)
and yj (0 6 j 6 1).
If we preserve the Fourier coefficients F(xi) (0 6 i 6 1) and F(yj)
(0 6 j 6 1), the total number of the forward FFTs is 4.
Since the inverse Fourier transformed results summed up at the same place
after the inverse FFTs are computed, the total number of the inverse FFTs is 3
(¼P2
i¼1i). However, we can utilize the linearity of the DFT. First of all, we
sum up the Fourier coefficients at the same place. Since we compute the inverseFFTs of these results, we can reduce the number of the inverse FFTs to 2.
We consider the arithmetic operation count of multiple-precision multiplica-
tion with 2-way splitting. The total arithmetic operation count Tfft for (N/2)-
point FFTs, is as follows:
T fft ¼ 6 cfft1N2log2
N2þ cfft2
N2
� �¼ 3 cfft1Nðlog2N � 1Þ þ cfft2Nf g: ð4Þ
Fig. 1. Illustration of the short product with the 2-way splitting algorithm.
D. Takahashi / Appl. Math. Comput. 166 (2005) 291–298 295
The total arithmetic operation count Tprod for the point-by-point product of
the Fourier coefficients, is as follows:
T prod ¼X2i¼1
i
!cprod
N2¼ 3
2cprodN : ð5Þ
Let the arithmetic operation count for the partial sum of the N-point coef-
ficients be csumN. Then, the total arithmetic operation count Tsum for the par-
tial sum of the Fourier coefficients at the same place is as follows:
T sum ¼X1i¼1
i
!csum
N2¼ 1
2csumN : ð6Þ
From (3)–(6) the overall arithmetic operation count of multiple-precision
floating-point multiplication with 2-way splitting, Tsplit, is as follows:
T split ¼ T fft þ T prod þ T sum þ T norm
¼ 3 cfft1Nðlog2N � 1Þ þ cfft2Nf g þ 3
2cprodN þ 1
2csumN þ cnormN
¼ T conv þ �3cfft1 þ1
2cprod þ
1
2csum
� �N : ð7Þ
The overall arithmetic operation count of multiple-precision floating-point
multiplication with 2-way splitting is less than that of the conventional algo-
rithm when
�3cfft1 þ1
2cprod þ
1
2csum < 0 ð8Þ
follows from (7).
From (8), we conclude that the 2-way splitting algorithm is better than the
conventional algorithm if (cprod + csum)/(6cfft1) < 1.
In the same way, we can show FFT-based multiple-precision multiplicationwith d-way splitting. The d-way splitting algorithm is shown in Fig. 2.
We consider the arithmetic operation count of multiple-precision floating-
point multiplication with d-way splitting. The total arithmetic operation count
Tfft for (N/d)-point FFTs, is as follows:
T fft1 ¼ 3d cfft1Ndlog2
Ndþ cfft2
Nd
� �¼ 3 cfft1Nðlog2N � log2dÞ þ cfft2Nf g: ð9Þ
The total arithmetic operation count Tprod for the point-by-point product ofthe Fourier coefficients, is as follows:
Fig. 2. The d-way splitting algorithm.
296 D. Takahashi / Appl. Math. Comput. 166 (2005) 291–298
T prod ¼Xdi¼1
i
!cprod
Nd¼ d þ 1
2cprodN : ð10Þ
Then, the total arithmetic operation count Tsum for the partial sum of theFourier coefficients at the same place is as follows:
T sum ¼Xd�1
i¼1
i
!csum
Nd¼ d � 1
2csumN : ð11Þ
From (9)–(11) the overall arithmetic operation count of multiple-precision
floating-point multiplication with d-way splitting, Tsplit, is as follows:
T split ¼ T fft þ T prod þ T sum þ T norm
¼ 3 cfft1Nðlog2N � log2dÞ þ cfft2Nf g þ d þ 1
2cprodN þ d � 1
2csumN þ cnormN
¼ T conv þ �3cfft1log2d þ d � 1
2cprod þ
d � 1
2csum
� �N : ð12Þ
The overall arithmetic operation count of the splitting algorithm is less than
that of the conventional algorithm when
�3cfft1log2d þ d � 1
2cprod þ
d � 1
2csum < 0 ð13Þ
follows from (12).From (13), we conclude that the d-way splitting algorithm is better than the
conventional algorithm if
cprod þ csum6cfft1
<log2dd � 1
;
where d is independent of N.
In order to evaluate the effectiveness of the d-way splitting algorithm, we
compare the arithmetic operation counts between the conventional algorithm
and the splitting algorithm.
0
0.2
0.4
0.6
0.8
1
1.2
1 2 4 8 16 32 64
Tco
nv /
Tsp
lit
d
N=2^8N=2^12N=2^16N=2^20
Fig. 3. Ratio of arithmetic operation count of splitting algorithm and conventional algorithm.
D. Takahashi / Appl. Math. Comput. 166 (2005) 291–298 297
Prior to our comparison, we determine the values of cfft1, cfft2, cprod and csum.
Since the arithmetic operation count of the N-point real split-radix FFT is2N log2N � 4N [11], we assume that cfft1 = 2 and cfft2 = �4. In general, a com-
plex multiplication is done with four real multiplications and two real addi-
tions. Since we can utilize the symmetric property of the real FFT, we
assume that cprod is 3. Then, we assume that csum is 1.
We show the ratio of Tconv/Tsplit for the above assumption when d and
N(= 2n) are varied in Fig. 3. The ratio Tconv/Tsplit shows the improvement that
the splitting algorithm provides over the conventional FFT-based multiple-
precision multiplication. We can see that the arithmetic operation count ofthe splitting algorithm is less than that of the conventional algorithm for
2 6 d 6 8. In particular, in comparison with the conventional algorithm,
the 4-way splitting algorithm saves about 18% of the arithmetic operation
counts for N = 28.
4. Conclusion
In this paper, we proposed an algorithm for multiple-precision floating-
point multiplication.
We showed that the overall arithmetic operations for FFT-based multiple-
precision floating-point multiplication are reduced by decomposition of the
full-length multiplication into shorter-length multiplication.
In comparison with the conventional algorithm, the 4-way splitting algo-
rithm saves about 18% of the arithmetic operation counts for N = 28.
298 D. Takahashi / Appl. Math. Comput. 166 (2005) 291–298
We conclude that 4-way splitting is optimal in terms of the total arithmetic
operation count of multiple-precision multiplication.
References
[1] D.E. Knuth, in: The Art of Computer Programming, vol. 2: Seminumerical Algorithms, third
ed., Addison-Wesley, Reading, MA, 1997.
[2] A. Karatsuba, Y. Ofman, Multiplication of multidigit numbers on automata, Doklady Akad.
Nauk SSSR 145 (1962) 293–294.
[3] A. Schonhage, V. Strassen, Schnelle Multiplikation grosser Zahlen, Computing (Arch.
Elektron. Rechnen) 7 (1971) 281–292.
[4] W. Krandick, J.R. Johnson, Efficient multiprecision floating point multiplication with optimal
directional rounding, in: Proc. 11th IEEE Symposium on Computer Arithmetic, 1993, pp. 228–
233.
[5] D. Zuras, More on squaring and multiplying large integers, IEEE Trans. Comput. 43 (1994)
899–908.
[6] T. Mulders, On short multiplications and divisions, Appl. Algebra Eng. Commun. Comput. 11
(2000) 69–88.
[7] J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series,
Math. Comput. 19 (1965) 297–301.
[8] D.H. Bailey, The computation of p to 29,360,000 decimal digits using Borweins� quarticallyconvergent algorithm, Math. Comput. 50 (1988) 283–296.
[9] J.M. Borwein, P.B. Borwein, D.H. Bailey, Ramanujan, modular equations, and approxima-
tions to pi or how to compute one billion digits of pi, Am. Math. Mon. 96 (1989) 201–219.
[10] D.H. Bailey, Algorithm 719: Multiprecision translation and execution of FORTRAN
programs, ACM Trans. Math. Softw. 19 (1993) 288–319.
[11] H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform
algorithms, IEEE Trans. Acoust. Speech Signal Processing ASSP-35 (1987) 849–863.