two floating point lll reduction algorithms - thesis
TRANSCRIPT
-
Two Floating Point Block LLLReduction Algorithms
Yancheng Xiao
Master of Science
School of Computer Science
McGill University
Montreal,Quebec
September 2012
A thesis submitted to McGill University in partial fulfillment of the requirements ofthe degree of Master of Science in Computer Science
cYancheng Xiao 2012
-
DEDICATION
This document is dedicated to my beloved parents.
ii
-
ACKNOWLEDGEMENTS
I have been indebted in my postgraduate study and research, especially in the
preparation of this thesis, to my supervisor Prof. Xiao-Wen Chang of School of Com-
puter Science at McGill University, whose academic guidance and financial support
with patience and kindness have been invaluable to me. We are grateful to Prof.
Clark Verbrugge for his kindly lending of their lovely AMD high concurrency ma-
chine, which has been useful in testing the performance of our block LLL reduction
algorithms. I would like thank all my lab mates of Scientific Computing Lab in School
of Computer Science, Mazen Al Borno, Stephen Breen, Xi Chen, Sevan Hanssian,
Wen-Yang Ku, Wanru Lin, Milena Scaccia, David Titley-Peloquin, Jinming Wen and
Xiaohu Xie, for the pleasant collaboration during my study and research. Thanks
also to all my friends and my boyfriend Bin Zhu for their various help on my study
and living in Montreal.
iii
-
ABSTRACT
The Lenstra, Lenstra and Lovasz (LLL) reduction is the most popular lattice
reduction and is a powerful tool for solving many complex problems in mathematics
and computer science. The blocking technique casts matrix algorithms in terms
of matrix-matrix operations to permit efficient reuse of data in the algorithms. In
this thesis, we use the blocking technique to develop two floating point block LLL
reduction algorithms, the left-to-right block LLL (LRBLLL) reduction algorithm
and the alternating partition block LLL (APBLLL) reduction algorithm, and give
the complexity analysis of these two algorithms. We compare these two block LLL
reduction algorithms with the original LLL reduction algorithm (in floating point
arithmetic) and the partial LLL (PLLL) reduction algorithm in the literature in
terms of CPU run time, flops and relative backward errors. The simulation results
show that the overall CPU run time of the two block LLL reduction algorithms are
faster than the partial LLL reduction algorithm and much faster than the original
LLL, even though the two block algorithms cost more flops than the partial LLL
reduction algorithm in some cases. The shortcoming of the two block algorithms is
that sometimes they may not be as numerically stable as the original and partial
LLL reduction algorithms. The parallelization of APBLLL is discussed.
iv
-
ABREGE
Le Lenstra, Lenstra et reduction Lovasz (LLL) est la reduction de reseaux plus
populaire et il est un outil puissant pour resoudre de nombreux problemes complexes
en mathematiques et en informatique. La technique bloc LLL bloquante reformule
les algorithmes en termes de matrice-matrice operations de permettre la reutilisation
efficace des donnees dans les algorithmes bloc LLL. Dans cette these, nous utilisons
la technique de blocage de developper les deux algorithmes de reduction bloc LLL en
points flottants, lalgorithme de reducton bloc LLL de la gauche vers la droite (LR-
BLLL) et lalgorithme de reduction bloc LLL en partirion alternative (APBLLL), et
donner a lanalyse de la complexite des ces deux algorithmes. Nous comparons ces
deux algorithmes de reduction bloc LLL avec lalgorithme de reduction LLL orig-
inal (en arithmetique au point flottant) et lalgorithme de reduction LLL partielle
(PLLL) dans la litterature en termes de temps dexecution CPU, flops et les er-
reurs de larriere par rapport. Les resultats des simulations montrent que les temps
dexecution CPU pour les deux algorithmes de reduction blocs LLL sont plus rapides
que lalgorithme de reduction LLL partielle et beaucoup plus rapide que la reduction
LLL originale, meme si les deux algorithmes par bloc coutent plus de flops que
lalgorithme de reduction LLL partielle dans certains cas. Linconvenient de ces
deux algorithmes par blocs, cest que parfois, ils peuvent netre pas aussi stable
numeriquement que les algorithmes originaux et les algorithmes de reduction LLL
partille. Le parallelisation de APBLLL est discutee.
v
-
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ABREGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Lattice Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions and Organization of the Thesis . . . . . . . . . . . 4
2 Introduction to LLL Reduction Algorithms . . . . . . . . . . . . . . . . . 7
2.1 LLL Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Original LLL Reduction Algorithm . . . . . . . . . . . . . . . . . 8
2.2.1 Size-Reductions . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Partial LLL Reduction Algorithm . . . . . . . . . . . . . . . . . . 162.3.1 Householder QR Factorization with Minimum Column
Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Partial Size-Reduction and Givens Rotation . . . . . . . . . 19
3 Block LLL Reduction Algorithms . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Subroutines of Block LLL Reduction Algorithms . . . . . . . . . . 243.1.1 Block Householder QR Factorization with Minimum Col-
umn Pivoting . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Block Size-Reduction . . . . . . . . . . . . . . . . . . . . . 32
vi
-
3.1.3 Local Partial LLL Reduction . . . . . . . . . . . . . . . . . 353.1.4 Block Partial Size-Reduction . . . . . . . . . . . . . . . . . 39
3.2 Left-to-Right Block LLL Reduction Algorithm . . . . . . . . . . . 413.2.1 Partition and Block Operation . . . . . . . . . . . . . . . . 413.2.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Alternating Partition Block LLL Reduction Algorithm . . . . . . 483.3.1 Partition and Block Operation . . . . . . . . . . . . . . . . 483.3.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 53
3.4 Simulation Results and Comparison of Algorithms . . . . . . . . . 55
4 Parallelization of Block LLL Reduction . . . . . . . . . . . . . . . . . . . 71
4.1 Parallel Methods for LLL Reduction . . . . . . . . . . . . . . . . . 714.2 A Parallel Block LLL Reduction Algorithm . . . . . . . . . . . . . 72
4.2.1 Parallel Diagonal Block Reduction and Block Updating . . 734.2.2 Parallel Block Size-Reduction . . . . . . . . . . . . . . . . . 73
4.3 Performance Evaluation of Parallel Algorithm . . . . . . . . . . . 76
5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 80
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vii
-
LIST OF TABLESTable page
31 Complexity analysis of LRBLLL reduction algorithm . . . . . . . . . . 48
32 Complexity analysis of APBLLL reduction algorithm . . . . . . . . . . 55
viii
-
LIST OF FIGURESFigure page
11 A lattice in 2-dimension . . . . . . . . . . . . . . . . . . . . . . . . . 2
31 Partition 1 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . . 49
32 Partition 2 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . . 49
33 Performance comparison for Case 1, Intel . . . . . . . . . . . . . . . . 61
34 Performance comparison for Case 2, Intel . . . . . . . . . . . . . . . . 62
35 Performance comparison for Case 3, Intel . . . . . . . . . . . . . . . . 63
36 Performance comparison for Case 2 with dimension 200, Intel . . . . . 64
37 Box plots of run time (left) and relative backward error (right) forCase 1 (top), Case 2 (middle), Case 3 (bottom) with dimension200, Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
38 Performance comparison for Case 1, AMD . . . . . . . . . . . . . . . 66
39 Performance comparison for Case 2, AMD . . . . . . . . . . . . . . . 67
310 Performance comparison for Case 3, AMD . . . . . . . . . . . . . . . 68
311 Performance comparison for Case 2 with dimension 200, AMD . . . . 69
312 Box plots of run time (left) and relative backward error (right) forCase 1 (top), Case 2 (middle), Case 3 (bottom) with dimension200, AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
41 Task allocation for three processors (P1, P2, P3) . . . . . . . . . . . . . 74
42 Approximating Parallel Simulation . . . . . . . . . . . . . . . . . . . 79
ix
-
CHAPTER 1Introduction
1.1 Lattice Reduction
A set L in the real vector space Rm is referred to as a lattice if there exists a set
of linear independent vectors b1, b2, . . . bn Rm such that
L =nj=1
Zbj =
{nj=1
zjbj | zj Z, 1 j n
}.
The set {b1, b2, . . . bn} is a basis of lattice L. The dimension of the lattice is defined
to be n. The matrix B = [b1, b2, . . . bn] is referred to as the lattice basis matrix
which generates L, also written as L(B).
Geometrically, a lattice can be viewed as a set of intersection points in an infinite
grid, as shown in Figure 11. The lines of the grid do not need to be orthogonal to
each other. The same lattice may have different bases. For example in Figure 11,
{b1, b2} is a basis of the lattice, and {c1, c2} is also a basis.
Suppose that we have two basis matrices B and C. If they generate a same
lattice L(B) = L(C), we say that B and C are equivalent. Two basis matrices
B,C Rmn are equivalent if and only if there exists a unimodular matrix Z Znn
(i.e., an integer matrix with determinant det(Z) = 1) such that C = BZ, see [25,
p4].
The lattice basis reduction is to transform a given lattice basis into a basis
with short and nearly orthogonal basis vectors. There are several kinds of lattice
1
-
Figure 11: A lattice in 2-dimension
reductions based on the different criteria on the resulted basis, such as the Gaus-
sian reduction [12, Chapter 6.1], the Minkowski reduction [26, 27], the Korkine and
Zolotarev (KZ) reduction [21] and the Lenstra, Lenstra and Lovasz (LLL) reduction
[22].
The lattice reduction is a powerful tool for solving many complex problems in
mathematics and computer science, especially the problems dealing with integers,
such as integer programming [1, 20], factoring polynomials with rational coefficients
[22], integer factoring [34] and cryptography [15].
The LLL reduction is the most popular lattice reduction. The LLL reduction
algorithm given in [22] and its variants have polynomial time complexity. It is wildly
used for applications such as factoring polynomials [22], subset sum problems [37],
digital communications [23, 24, 28, 29, 39], shortest vector problems (SVP) [25] and
2
-
closest vector problems (CVP), which are also referred to as the integer least-square
(ILS) problems [2, 4, 9, 10, 17].
Generally, we can classify the LLL reduction algorithms into three categories.
The first category includes exact integer arithmetic LLL reduction algorithms with
both input and output bases being integral. For example, the original LLL algorithm
given in [22] is in this category.
The second category includes the algorithms such as those in [30, 35, 36], which
use not only integer arithmetic, but also floating point arithmetic. The input and
output bases in this category are also integral. The reason to use floating point
arithmetic is that the integer arithmetic is expensive. The algorithms use long enough
floating numbers to approximate the intermediate results, so that the rounding errors
do not lead to an output basis which is not exactly LLL reduced.
The applications of the first and second categories include factoring polynomials
[22], subset sum problem [37] and public-key cryptanalysis [15].
The third category includes floating point algorithms with both input and output
bases being real. This category applies to cases where exact integer arithmetic is not
required and where a nearly LLL reduced basis is acceptable, such as ILS problems
which arise in GPS, e.g., [9, 10, 11, 17, 43], and in multi-input multi-output (MIMO)
communications, e.g., [24, 42]. So an algorithm in this category does not require strict
floating point error control like algorithms in the second category. An algorithm in
category three is much more efficient than those in categories one and two.
3
-
1.2 Contributions and Organization of the Thesis
The goal of this thesis is to propose efficient and reliable floating point algorithms
for the LLL reduction with real basis matrices by using blocking technique [14,
Chapter 5]. The algorithms are based on the original LLL reduction algorithm [22]
and the partial LLL (PLLL) reduction algorithm [13].
The computation speed of a matrix algorithm is determined not only by the
number of floating point operations involved, but also by the amount of memory
traffic which is the movements of data between memory and registers. The level
3 basic liner algebra subprograms (BLAS) are designed to reduce these movements
of data. The matrix-matrix operations implemented in level 3 BLAS make effi-
cient reuse of data that resided in cache or local memory to avoid excessive data
movements. The blocking technique casts the algorithms in terms of matrix-matrix
operations to permit efficient reuse of data.
Two block LLL reduction algorithms utilizing this blocking technique are pro-
posed in this thesis with their complexity analysis. Numerical simulations compare
the performance of our block algorithms on the CPU time, flops and numerical sta-
bility with the original LLL reduction algorithm and the PLLL reduction algorithm.
On average the computational speeds of the block algorithms are faster than PLLL
and LLL although their numerical stability in some cases may need improvement.
The parallelization of one of the two block LLL reduction algorithms is discussed
in two parts, the parallelization of the block size-reduction and the parallelization
of the diagonal block reduction. Complexity analysis shows that the parallelized
size-reduction part can obtain a speedup of np in ideal cases, if np processors are
4
-
used. The improvement of the parallelized diagonal block reduction part is hard to
be observed from the complexity analysis, since the complexity is too pessimistic. A
simple test is designed to examine the performance of the parallelized diagonal block
reduction part. The test result shows that the parallelized diagonal block reduction
part can obtain a speedup of 4.8 with 5 processors in best situations.
The rest of the thesis is organized as follows. In Chapter 2, we first give the
definition of the LLL reduction. Then a description of the original LLL reduction
algorithm in the matrix language is given, followed by its complexity analysis. In
the last section of this chapter, we introduce the partial LLL (PLLL) reduction
algorithm.
In Chapter 3, we first apply the blocking technique to the components of the
PLLL algorithm, leading to block subroutines. Then two block LLL algorithms are
proposed based on these block subroutines. We give the complexity analysis for the
block algorithms under the assumption of using exact arithmetic. Finally, simulation
results are presented, compared and discussed.
In Chapter 4, we first review the literature of parallel LLL algorithms. Then we
discuss the parallelization of one of our two block algorithms.
Chapter 5 gives conclusions and future work.
We now describe the notation to be used in the thesis. The sets of all real and
integer m n matrices are denoted by Rmn and Zmn, respectively, and the set of
real and integer n-vectors are denoted by Rn and Zn, respectively. Upper case letters
are used to denote matrices and bold lower case letters are used to denote vectors.
The identity matrix is denoted by I and its i-th column is denoted by ei. MATLAB
5
-
notation is used to denote a sub-matrix. Specifically, if A = (aij) Rmn, then A(i, :)
denotes the i-th row, A(:, j) denotes the j-th column, and A(i1 : i2, j1 : j2) denotes
the sub-matrix formed by rows i1 to i2 and columns j1 to j2. For the (i, j) element
of A, sometimes we use aij and sometimes we use A(i, j). For block matrix A, Aij
denotes the (i, j) block. For a scalar z R, we use bze to denote its nearest integer. If
there is a tie, bze denotes the one with smaller magnitude. det(A) is the determinant
of A. Without saying specifically, stands for the 2-norm, i.e., a =aTa, and
F stands for the Frobenious matrix norm, i.e., AF =
i,j a2ij.
6
-
CHAPTER 2Introduction to LLL Reduction Algorithms
In this chapter first we give the definition of the Lenstra-Lenstra-Lovasz (LLL)
reduction. Then we introduce the original LLL reduction algorithm [22] and the
partial LLL (PLLL) reduction algorithm [43], which will be the bases of our new
LLL reduction algorithms to be presented in later chapters.
2.1 LLL Reduction
The LLL reduction introduced in [22] can be described as a QRZ matrix fac-
torization:
B = Q
R0
Z1 = Q1RZ1,where B Rmn is a given matrix with full column rank, Q = [Q1, Q2]
n mn Rmm is
orthogonal, Z Znn is unimodular, and R Rnn is upper triangular and satisfies
two conditions: rijrii 12 , 1 i < j n, (2.1)
r2i1,i1 r2ii + r2i1,i, 1 < i n, (2.2)
with the parameter (1/4, 1). The conditions Eq.(2.1) and Eq.(2.2) are named as
the size-reduction condition and the Lovasz condition, respectively. The matrix BZ
or the matrix R is said to be LLL reduced.
7
-
The LLL reduction algorithm in [22] is the most well known lattice basis reduc-
tion algorithm with polynomial time complexity, which was originally designed for
factoring polynomials with rational coefficients using integer arithmetic operations.
Later, the LLL reduction has widely extended its applications to number theory
(see, e.g., [34, 37]), cryptography (see, e.g., [15, 25]), integer programming (see, e.g.,
[1, 20]), digital communications (see, e.g., [24]), and GPS (see, e.g., [11, 17]). Some
of these extended applications do not require exact integer LLL reduced basis, thus
floating point arithmetic is used to achieve better computational performance in such
application areas. One example of the floating point LLL application is to compute
a suboptimal solution (e.g., the Babai point [4]) or the optimal solution of an integer
least squares (ILS) problem.
In the following part of this chapter, the original LLL reduction algorithm and
the PLLL reduction algorithm are introduced and we assume they use floating point
arithmetic.
2.2 Original LLL Reduction Algorithm
We will describe the original LLL reduction algorithm in the matrix language
(see [44, Algorithm 3.3.1] and [13, Algorithm 2.6.3]). The algorithm involves the
Gram-Schmidt orthogonalization (GSO), integer Gauss transformations (IGT), col-
umn permutations and orthogonal transformations. GSO is applied to find the QR
factors Q and R of the given matrix B. The column permutations and IGTs produce
the unimodular matrix Z.
In the original exact integer LLL reduction algorithm, a column scaled Q and
a row scaled R which has unit diagonal entries are computed by a variation of GSO
8
-
to avoid square root computations. In the floating point LLL reduction algorithm in
this thesis, the regular GSO is adopted to B and gives the compact form of the QR
factorization:
B = Q1R,
where Q1 Rmn has orthonormal columns, and R Rnn is upper triangular.
After the GSO of B, integer Gauss transformations, column permutations and
GSO are used to transform R to a LLL reduced basis. IGTs are used to perform size-
reduction to the off diagonal entries to achieve Eq.(2.1). The column permutations
are used to order the columns to achieve Eq.(2.2). Since a column permutation
destroys the upper triangular structure, GSO is used to recover the upper triangular
structure.
2.2.1 Size-Reductions
An integer matrix is called an IGT or an integer Gauss matrix if it has the
following form
Zij = In eieTj , i 6= j, is an integer.
Applying Zij to R from the right gives
R = RZij = R ReieTj .
Thus R is the same as R, except that rkj = rkj rki, k = 1, , i. By setting
= brij/riie, the nearest integer to rij/rii, we ensure |rij| |rii|/2.
2.2.2 Permutations
The column permutations are applied to achieve Eq.(2.2). Suppose that the
Lavosz condition is not satisfied for i = k, then a permutation matrix Pk1,k is
9
-
performed to interchange columns k 1 and k of R. After the permutation, the
upper triangular structure of R is destroyed. An orthogonal transformation Gk1,k
using the GSO technique (see [22]) is performed to re-construct the upper triangular
structure of R:
R = Gk1,kRPk1,k,
where
Gk1,k =
Ik2
G
Ink
, G =c ss c
,c =
rk1,kr2k1,k + r
2kk
, s =rkk
r2k1,k + r2kk
.
The columns k 1, k and the rows k 1, k of R are changed by this permutation
and orthogonalization process. The diagonal and super-diagonal entries of R which
are changed after the permutation and orthogonalization process become
rk1,k1 =r2k1,k + r
2kk, rk1,k =
rk1,k1rk1,kr2k1,k + r
2kk
, rk,k = rk1,k1rkkr2k1,k + r
2kk
.
Thus, if r2k1,k1 > r2kk + r
2k1,k with (1/4, 1), then the above operations guar-
antee r2k1,k1 > r2kk + r
2k1,k.
Based on the above description of size-reductions and permutations, we will
describe the procedure of the LLL reduction algorithm as follows. The algorithm
shall iterate a sequence of stages to satisfy the LLL reduced conditions. And it
works on the columns of R from left to right. Define a column stage variable k which
10
-
indicates that the first k 1 columns of R are LLL reduced at the current stage, i.e.,rijrii 12 , 1 i < j k 1, (2.3)
r2i1,i1 r2ii + r2i1,i, 1 < i k 1. (2.4)
At the beginning, set k to 2. Then during the reduction procedure, the value of k
shifts between 2 and n+ 1 and changes by 1 in each step. At stage k, the algorithm
first uses the integer Gauss transformation to reduce rk1,k. Then it checks if it
needs to permute the columns k 1 and k according to the Lovasz condition. If
r2k1,k1 > r2kk + r
2k1,k, it performs the permutation and applies the corresponding
orthogonal transformation, and moves back to stage k 1. Otherwise it reduces
ri,k (i = k 2, k 2, , 1) by IGTs and moves to the next stage k + 1. When
k reaches to n + 1, the conditions Eq.(2.1) and Eq.(2.2) are satisfied, the upper
triangular matrix R is LLL reduced and the algorithm stops. The algorithm is given
as follows.
Algorithm 2.1. (LLL Reduction) Suppose B Rmn has full column rank. This
algorithm computes the LLL reduction: B = Q1RZ1, where Q1 has orthonormal
columns, R is upper triangular and satisfies LLL reduced criteria and Z is unimod-
ular.
function: [R,Z] = LLL(B)
1: Apply GSO to obtain B = Q1R
2: k := 2, Z := In
3: while k n do
4: if rk1,krk1,k1 > 12 then
11
-
// Reduce rk1,k
5: :=
rk1,krk1,k1
6: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, k 1)
7: R(1 : k 1, k) := R(1 : k 1, k) R(1 : k 1, k 1)
8: end if
// is parameter chosen in (14, 1)
9: if r2k1,k1 > r2kk + r
2k1,k then
10: Interchange columns Z(1 : n, k) and Z(1 : n, k 1)
11: Interchange columns R(1 : k, k) and R(1 : k, k 1)
12: Triangularize R: R := Gk1,kR
13: if k > 2 then
14: k := k 1
15: end if
16: else
// Size-reduction
17: for i = k 2 : 1 do
18: :=ri,krii
19: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, i)
20: R(1 : i, k) := R(1 : i, k) R(1 : i, i)
21: end for
22: k := k + 1
23: end if
24: end while
12
-
2.2.3 Complexity Analysis
Assume that the operations used in the algorithm are performed in exact arith-
metic. The complexity of Algorithm 2.1 is measured by the number of arithmetic
operations. Part of the results of the complexity analysis will be used in Chapter 3
and Chapter 4. The QR factorization by GSO takes O(mn2) arithmetic operations
[16, Section 5.2]. Next, we give the analysis of the complexity of the while loop in
the LLL reduction algorithm. By adding the complexity of QR factorization and the
while loop together, we get the complexity of the LLL reduction algorithm.
For the complexity of the while loop, we would like to first determine the number
of loops and then count the number of arithmetic operations in each loop.
Lemma 2.1 ([22]): Let = maxj bj, and let = minxZn/{0} Bx be the
length of the shortest vector of lattice L(B). The number of permutations involved
in Algorithm 2.1 is bounded by O(n3 + n2 log1/) and the algorithm converges.
Proof. We use the proof from [22] and [44, Chapter 3].
After the Gram-Schmidt QR factorization, we obtain QR factors Q1 and R in
the QR factorization B = Q1R. Let R(p) denote the upper triangular matrix R after
the p-th permutation (R(0) = R). Define the quantities wi and after the p-th
permutation as
w(p)i =
ij=1
(r(p)jj )
2, i = 1, 2, , n (2.5)
and
(p) =ni=1
w(p)i . (2.6)
13
-
Suppose the p-th permutation is applied to columns (q1) and q of matrix R(p1)
and the orthogonal transformation by GSO is applied to keep the upper triangular
structure as described in the algorithm, we obtain matrix R(p) with following feature:
r(p)jj = r
(p1)jj , j 6= q 1, q, |r
(p)p1,p1r
(p)pp | = |r
(p1)p1,p1r
(p1)pp |.
And by the permutation criterion (see line 9 of Algorithm 2.1) obtained from Eq.(2.2),
we have r(p)q1,q1 < r(p1)q1,q1 .Then from Eq.(2.5) we obtain
w(p)i = w
(p1)i , i 6= q 1, w
(p)q1/w
(p1)q1 < .
Substituting them into Eq.(2.6) gives
(p) < (p1), (2.7)
which means that one permutation operation decreases at least by a multiply of
. Assume that the algorithm involves a total of p permutations before convergence.
From Eq.(2.7) it follows that
(p) < p(0),
or equivalently
p < log1/(0)
(p)= log1/
(0) log1/ (p) = log1/ni=1
w(0)i log1/
ni=1
w(p)i . (2.8)
14
-
Since = maxj bj and bj2 (r(0)jj )2, then (r(0)jj )
2 2 (j = 1, 2, , n). Thus
from Eq.(2.5)
w(0)i 2i. (2.9)
By Theorem I of [7, Chapter II],
2 minxZn/{0}
Bx2 (
4
3
)(n1)/2(det (BTB))1/n. (2.10)
For any x Zn, we can define x = (Z(p))1x, where Z(p) denotes the unimodular
matrix Z after the p-th permutation (Z(0) = In). Define B(p) = B(p)Z(p) = Q
(p)1 R
(p).
From Eq.(2.10) we have
2 = minxZn/{0}
Bx2 = minxZn/{0}
B(p)x2
minx(1:i)Zi/{0}
B(p)(:, 1 : i)x(1 : i)2
(
4
3
)(i1)/2| det (B(p)(:, 1 : i)T B(p)(:, 1 : i))|1/i
=
(4
3
)(i1)/2| det (R(p)(:, 1 : i)TR(p)(:, 1 : i))|1/i
(
4
3
)(i1)/2(w
(p)i )
1/i (see Eq.(2.5)).
Then it follows that
w(p)i (3/4)i(i1)/22i. (2.11)
15
-
Substituting Eq.(2.9) and Eq.(2.11) into Eq.(2.8) gives
p < log1/
ni=1
2i log1/ni=1
(3/4)i(i1)/22i
= (n+ 1)n log1/
+ log1/
ni=1
(4/3)i(i1)/2
= (n+ 1)n log1/
+
1
6(n3 n) log1/(4/3).
So Algorithm 2.1 involves at most O(n3+n2 log1/) permutations and the algorithm
converges.
We should note that the bound on the number permutation from the lemma
suits for all kinds of LLL reduction algorithms, if they share the same permutation
criterion with Algorithm 2.1.
In Algorithm 2.1, k is either increased or decreased by 1 in the while loop. Since
the loops in which k is decreased must have a column permutation in it, we have
p loops in which k is decreased. The algorithm starts from k = 2 and ends when
k = n+ 1, so the number of loops in which k is increased should equals to p+ n 1.
Thus there are 2p+ n 1 loops in total, which is bounded by O(n3 + n2 log1/ ).
Each loop costs O(n2) arithmetic operations in the worst situation. So the whole
algorithm takes at most O(mn2 + n5 + n4 log1/) arithmetic operations.
2.3 Partial LLL Reduction Algorithm
Recently the so-called effective LLL (ELLL) reduction was proposed by Ling
and Howgrave [23], and later the so-called partial LLL (PLLL) reduction algorithm
was developed by Xie, Chang and Borno [43]. Both algorithms are more efficient
16
-
than Algorithm 2.1. The ELLL reduction algorithm is essentially identical to Al-
gorithm 2.1 after lines 17-21, which reduce the off-diagonal entries of R except the
super-diagonal ones, are removed. It has less computational complexity than LLL,
while it has the same effect on the performance of the Babai integer points as LLL.
[43] shows algebraically that the size-reduction condition of the LLL reduction has
no effect on a typical sphere decoding (SD) search process for solving an integer least
squares (ILS) problem. Thus it has no effect on the performance of the Babai inte-
ger point, the first integer point found in the search process. The PLLL is proposed
to avoid the numerical stability problem with ELLL, and to avoid some unneces-
sary size-reductions involved in LLL and ELLL. Both PLLL and LLL can compute
LLL reduced bases by adding an extra size-reduction procedure at the end of the
algorithms. The following part gives a description of the PLLL reduction.
2.3.1 Householder QR Factorization with Minimum Column Pivoting
The typical LLL algorithm first finds the QR factorization of the given matrix
B. In the original LLL algorithm, the Gram-Schmidt method is adopted for com-
puting the QR factorization. However the Householder method without forming the
orthogonal factor Q which costs 43mn2 flops, is more efficient than the Gram-Schmidt
method which costs 2mn2 flops [16]. The Householder method requires square root
operations, so it is not suitable for the exact integer LLL reduction. While the float-
ing point LLL reduction has no problem with computing a square root, so it can use
the Householder transformation to computer the QR factorization.
The PLLL reduction uses the Householder QR factorization with minimum col-
umn pivoting (QRMCP) instead of the classic Householder QR factorization. In
17
-
general, the number of permutations is a crucial factor of the cost of the whole LLL
reduction process. If one can make the upper triangular factor close to an LLL re-
duced one in the QR factorization stage, the number of the permutations in the later
stage is likely to decrease. The minimum column pivoting strategy is used to help
to achieve the Lovasz condition, see [44, Section 4.1].
From Eq.(2.1) and Eq.(2.2), we can easily obtain
( 14
)r2i1,i1 r2ii, 1 < i n, (1/4, 1). (2.12)
The Householder QR factorization upper-triangularize the matrix B columns by
columns, while the column index i is increasing from 1 to n. In order to make the
matrix R more likely to satisfy Eq.(2.12), the minimum column pivoting strategy
chooses a column permutation such that |rii| is the smallest in the i-th step. In the
i-th step of the QR factorization, the QRMCP finds the column in B(i :m, i :n) with
the minimum 2-norm, and interchanges the whole column with the i-th column of
B. After this the QRMCP eliminates the off-diagonal entries B(i + 1 : m, i) by a
Householder transformation Hi. By using the minimum column pivoting strategy,
the Householder QR becomes
BP = Q
R0
= [Q1 Q2]R
0
= Q1R, (2.13)where P Rnn is a permutation matrix, R Rnn is upper triangular, [Q1, Q2]
n mn
Rmm is orthogonal , Q consists of Q1 and Q2. QT = HnHn1 H1 is the product
of n Householder transformations.
The algorithm is given as follows.
18
-
Algorithm 2.2. (Householder QR Factorization with Minimum Column Pivoting)
Suppose B Rmn has full column rank. This algorithm computes the QRMCP
factorization: B = Q1RPT , and Q has orthonormal columns, R is upper triangular
and P is a permutation matrix.
function: [R,P ] = QRMCP (B)
1: P := In
2: lj := B(1 : m, j)2, j = 1 : n
3: for i = 1 : n do
4: q := arg minijn lj
5: if q > i then
6: Interchange columns B(1 : m, i) and B(1 : m, q)
7: Interchange columns P (1 : n, i) and P (1 : n, q)
8: end if
9: Compute the Householder transformation Hi which zeros B(i+ 1 : m, i)
10: B := HiB
11: lj := lj B(i, j)2, j = i+ 1, i+ 2, , n
12: end for
13: R := B(1 : n, 1 : n)
2.3.2 Partial Size-Reduction and Givens Rotation
After the QRMCP, the PLLL reduction performs permutations, IGTs and Givens
rotations on R in an efficient and numerical stable way. In the k-th column of R,
PLLL checks if it needs to permute the columns k and k 1 according to the Lo-
vasz condition Eq.(2.2). If the Lovasz condition hold, then the permutation will not
19
-
occur, no IGT will be applied, and the algorithm moves to column k + 1. If the
Lovasz condition does not hold, rk1,k is reduced by IGT, IGTs are also applied to
rk2,k, , r1,k for stability consideration. Then PLLL performs the permutation and
the Givens rotation, and moves back to the previous column.
Givens rotations are used to do triangularization after permutations in PLLL
instead of GSO, in line 12 of Algorithm 2.1. Define the Givens rotation matrix as
G =
c ss c
,where
c =rk1,k
r2k1,k + r2kk
, s =rkk
r2k1,k + r2kk
.
which are used in the following transformation: c ss c
rk1,k rk1,k1rk,k 0
=rk1,k1 rk1,k
0 rk,k
.The PLLL algorithms is given as follows.
Algorithm 2.3. ( PLLL Reduction) Suppose B Rmn has full column rank. This
algorithm computes the PLLL reduction of B: B = Q1RZ1, and Q1 has orthonor-
mal columns, R is upper triangular and Z is a unimodular. It computes IGTs only
when column permutation occurs.
function: [R,Z] = PLLL(B)
1: Compute [R,P ] = QRMCP (B)
2: Set Z := P , k := 2
20
-
3: while k n do
4: :=
rk1,krk1,k1
5: := rk1,k rk1,k1
// is parameter chosen in (14, 1)
6: if r2k1,k1 > 2 + r2kk then
// Size-reduce R(1 : k 1, k)
7: for l = k 1 : 1 do
8: :=rl,krll
9: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, l)
10: R(1 : l, k) := R(1 : l, k) R(1 : l, l)
11: end for
// Column permutation and updating
12: c :=rk1,kr2k1,k+r
2kk
13: s := rkkr2k1,k+r
2kk
14: G :=
c ss c
15: Interchange columns Z(1 : n, k) and Z(1 : n, k 1)
16: Interchange columns R(1 : n, k) and R(1 : n, k 1)
17: R(k 1 : k, k 1 : n) := GR(k 1 : k, k 1 : n)
18: if k > 2 then
19: k := k 1
20: end if
21: else
21
-
22: k := k + 1
23: end if
24: end while
Notice that the final matrix R obtained by the PLLL reduction algorithm are
not fully size-reduced, since the algorithm only performs size-reduction when a per-
mutation is followed immediately. However we can easily add an extra size-reduction
procedure at the end of the PLLL reduction algorithm, and transform R to a LLL re-
duced matrix. We name the PLLL algorithm with an extra size-reduction procedure
as PLLL+.
The PLLL reduction algorithm uses the same permutation criterion as the LLL
reduction algorithm, so it has the same upper bound of permutations/loops as the
upper bound for the LLL reduction algorithm, which is O(n3 + n2 log1/).
For each loop, the PLLL reduction algorithm has O(n2) arithmetic operations
in worst case situations. The Household QR costs O(mn2) flops [16, Section 5.2]. So
the PLLL algorithm takes at most O(mn2 + n5 + n4 log1/) arithmetic operations,
which is the same as the complexity bound of the LLL reduction algorithm. The
simulation results of PLLL in [43] show that it is faster and more stable than the
LLL reduction.
22
-
CHAPTER 3Block LLL Reduction Algorithms
The blocking technique has been wildly used to speed up conventional matrix
algorithms on todays high performance computers. The key to achieve high per-
formance on computers with a memory hierarchy has been to recast the algorithms
in terms of matrix-vector and matrix-matrix operations to permit efficient reuse of
data that resided in cache or local memory. The blocking technique partitions a
big matrix into small blocks, and performs matrix-matrix operations implemented
in level 3 basic linear algebra subprograms (BLAS) as much as possible [14]. The
matrix-matrix operations implemented in level 3 BLAS is more efficient than the
matrix-vector operation implemented in level 2 BLAS or the vector-vector operation
implemented in level 1 BLAS. The level 3 BLAS can maximumly reduce the move-
ments of data between memories and registers, which can be as costly as arithmetic
operations on the data in matrix algorithms.
In this chapter, we first explain how to apply the blocking technique to the com-
ponents of the partial LLL (PLLL) reduction algorithm. Then we propose two block
LLL reduction algorithms with different matrix partition strategies, and compare
their speed and stability with the original LLL reduction algorithm and the PLLL
reduction algorithm introduced in Chapter 2.
23
-
3.1 Subroutines of Block LLL Reduction Algorithms
In this section we describe a block QR factorization algorithm, a block size-
reduction algorithm named BSR, a variant of the PLLL reduction algorithm named
Local-PLLL and a block partial size-reduction algorithm named BPSR. They will
be used as subroutines of the block LLL reduction algorithms. Local-PLLL suits for
computing the PLLL reduction of blocks of the basis matrix. The block partial size-
reduction algorithm uses an efficient size-reduction strategy proposed in the PLLL
reduction algorithm.
3.1.1 Block Householder QR Factorization with Minimum Column Piv-oting
In order to design a block Householder QR factorization by means of level 3
BLAS, Schreiber and Van Loan [38] proposed a storage-efficient WY representa-
tion for the product of Householder transformations. Later Quintana-Orti, Sun and
Bischof [32] proposed a level 3 BLAS version of the QR factorization with maximum
column pivoting in order to get a rank-revealing factorization. Based on their work,
we give the block QR factorization algorithm with minimum column pivoting in this
section.
Given a real full column rank matrix B Rmn, the Householder QR factoriza-
tion with minimum column pivoting gives
BP = Q
R0
= [Q1 Q2]R
0
= Q1R, (3.1)where Q = [Q1, Q2]
n mn Rmm is orthogonal, R Rnn is upper triangular, and
P Znn is a permutation matrix. The orthogonal matrix Q is the product of n
24
-
Householder transformations:
QT = Hn H2H1, (3.2)
Hi = In iuiuTi , i = 1, 2, , n, (3.3)
where i = 2/(uTi ui), ui =
0ui
Rm, ui Rmi+1 is a Householder vector,Hi Rmm is the Householder transformation matrix which zeros B(i+ 1 : m, i).
The permutation matrix P is the product of n permutations:
P = P1P2 Pn,
where Pi (i = 1, 2 , n) is the permutation matrix which interchanges the i-th
column and another column in B(1 : m, i : n) such that the 2-norm of B(i : m, i) is
minimum.
In order to explain the block QR implementation, we define B(i) as the value of
B after i Householder transformations and i permutations, i.e.,
B(i) = Hi H2H1BP1P2 Pi, (3.4)
with B(0) = B. And we define B(i) as B with only i permutations applied, i.e.,
B(i) = B(P1P2 Pi). (3.5)
Here we want to point out that B(i) will not be formed in the i-th step of the block
algorithm, and it is used only for explanations of the algorithm.
25
-
The storage efficient WY representation [38] for the product of i Householder
transformations has the following format:
it=1
Ht =it=1
(Im tutuTt ) = Im YiTiY Ti , (3.6)
where
Yi = [u1,u2, ,ui] Rmi (3.7)
is lower trapezoidal, Ti Rii is lower triangular given by the following recursion
formula:
Ti =
Ti1 0hTi i
, hTi = uTi Yi1Ti1 R1(i1),with the base case T1 = 1.
Substituting Eq.(3.5) and Eq.(3.6) into Eq.(3.4), B(i) can be expressed as
B(i) = (In YiTiY Ti )B(i) = B(i) YiF Ti , (3.8)
where
F Ti = TiYTi B
(i) Rin. (3.9)
It is easy to show that F Ti can be computed by recursion:
F T1 = 1uT1 B
(1), F Ti =
F Ti1Piiu
Ti B
(i) iuTi Yi1F Ti1Pi
. (3.10)
26
-
The block Householder QR factorization algorithm partitions the matrix B
Rmn into d blocks with size m k (for simplification we assume n = dk). The algo-
rithm deals with the blocks sequentially from left to right. Inside a block, k House-
holder transformations are performed for upper-triangularization, and are accumu-
lated into a single block transformation using the WY representation in Eq.(3.6).
Then the block transformation is applied to other blocks of B by matrix-matrix
multiplication. Next we show how the block algorithm works.
In the first step, we first compute the squared column norms of B denoted by l:
lj := B(1 :m, j)2, j = 1, 2, , n.
Utilizing l, a column in B with minimum 2-norm is permuted with the first column
by the permutation matrix P1 (actually P1 is not formed explicitly). Then we use
the Householder transformation H1 to zero B(2 : m, 1). At this moment, unlike
Algorithm 2.2 we do not apply H1 to other columns of B. However, the first row of
B must be updated in order to downdate the squared column norms:
lj := lj B(1, j)2, j = 2, , n, (3.11)
which will be used in the next step for minimum column pivoting. In order to
update the first row, we form the following matrices (actually there are vectors)
using Eq.(3.6) and Eq.(3.10):
Y1 := u1, FT1 (1, 2:n) := 1u
T1B(1 :m, 2:n).
27
-
Notice that B(1 :m, 2 : n) stores in memory is equivalent to B(1)(1 :m, 2 : n) given
in Eq.(3.10). From Eq.(3.8) the first row of B except the first entry is updated as
follows:
B(1, 2:n) := B(1, 2:n) Y1(1, 1)F T1 (1, 2:n).
Then the squared column norms are downdated using Eq.(3.11). Thus at the end
of the first step, the first row and the first column have been updated, and the rest
part of B will be updated later.
In the second step, utilizing the vector l of the squared column norms, we apply
P2 to permute the second column of B with a column, say column p, 2 p n, such
that the 2-norm of B(2 :m, 2) is minimum, and we permute the second column of
F T1 with its p-th column (i.e., FT1 := F
T1 P2). Then from Eq.(3.8) the second column
B(2 :m, 2) is updated by the first Householder transformation H1:
B(2 :m, 2) := B(2 :m, 2) Y1(2 :m, 1)F T1 (1, 2).
After this update, we apply the Household transformation H2 to zero B(3 : m, 2).
Same as step 1, we do not use H2 to update the rest columns of B at this moment.
But we need update the second row of B, because it will be used to compute the
2-norms of each column of B(3 :m, 3:n). In order to perform the update, Y2 and F2
are formed by accumulating H2 into Y1 and F1 using Eq.(3.6) and Eq.(3.10):
Y2 :=
[Y1 u2
], F T2 (1 :2, 3:n) :=
F T1 (1, 3:n)2u
T2B(1 :m, 3:n) 2uT2 Y1F T1 (1, 3:n)
.
28
-
Note that here F T1 has been permuted by P2. Then we update the second row of B
except the first two entries:
B(2, 3:n) := B(2, 3:n) Y2(2, 1 : 2)F T2 (1 : 2, 3:n),
and compute the squared column norms of B(3 :m, 3:n):
lj := lj B(2, j)2, j = 3, , n.
At the end of the second step, the first two rows and the first two columns have been
updated.
Now we assume we are in the i-th step of transforming the first block of B to an
upper triangular matrix. The first (i 1) columns of B have been triangulized and
the first (i1) rows have been updated, while the rest part of the matrix B is waiting
to be updated. We first permute the i-th column with a column in B(1 : m, i : n)
such that the 2-norm of B(i :m, i) is minimum, and we permute the corresponding
columns of F Ti1 (i.e., FTi1 := F
Ti1Pi). Then we update the i-th column of B(i :m, i)
by using the Householder transformations H1, H2, , Hi as follows (see (3.8)):
B(i :m, i) := B(i :m, i) Yi1(i :m, 1: i 1)F Ti1(1 : i 1, i).
29
-
Then the Householder transformation Hi is used to zero B(i + 1 : m, i), and is
accumulated into Yi and Fi:
Yi :=
[Yi1 ui
],
F Ti (1 : i, i+ 1:n) :=
F Ti1(1 : i 1, i+ 1:n)iu
Ti B(1 :m, i+ 1:n) iuTi Yi1F Ti1(1 : i 1, i+ 1:n)
.Then we update the i-th row B(i, i+ 1:n) and downdate the squared column norms:
B(i, i+ 1:n) := B(i, i+ 1:n) Yi(i, 1: i)F Ti (1 : i, i+ 1:n),
lj := lj B(i, j)2, j = i+ 1, , n.
The first i columns and rows of B have been updated.
Like shown in above, the block algorithm updates one row and one column in
each step. At the end of the k-th step, we update the rest part of B by using the
accumulated first k Householder transformations as follows:
B(k + 1:m, k + 1:n) := B(k + 1:m, k + 1:n) Yk(k + 1:m, 1:k)F Tk (1 :k, k + 1:n).
At this point, the first k columns of B (i.e., the first block of B) have been upper-
triangulized, and the other columns of B have been updated. Then we can apply the
same procedure to triangulize the second block of B and so on until the final upper
triangular matrix is obtained.
The algorithm of block QR factorization with minimum column pivoting is given
as follows.
30
-
Algorithm 3.1. (Block Householder QR Factorization with Minimum Column Piv-
oting) Suppose B Rmn has full column rank, k is the chosen block size which
is a factor of n for simplification. This algorithm computes the QR Factorization:
Q1R = BP , where Q1 has orthonormal columns, P is a permutation matrix. Note
the matrix B is overwritten by R in computation.
function: [R,P ] = BQRMCP (B, k)
1: P := In, m := m, n := n
2: lj := B(1 : m, j)2, j = 1 : n
3: for j = 1 : k : n do
4: Y (1 : m, 1 : k) := 0, F (1 : n, 1 : k) := 0
5: for j = 1 : k do
// Permutation
6: i := j + j 1, q := arg minipn lp
7: Interchange columns B(1 : m, i) and B(1 : m, q)
8: Interchange columns P (1 : n, i) and P (1 : n, q)
9: Interchange rows F (j, 1 : k) and F (q j + 1, 1 : k)
// Update the i-th column
10: B(i : m, i) := B(i : m, i) Y (j : m, 1 : j 1)F (j, 1 : j 1)T
11: Zero B(i+ 1 : m, i) by the Householder transformation Hi = In iuiuTi
// Accumulation of the block transformation
12: Y (j : m, j) := ui(i : m)
13: F (j + 1 : n, j) := iB(1 : m, i+ 1 : n)Tui
14: F (1 : n, j) := F (1 : n, j) iF (1 : n, 1 : j 1)Y (j : m, 1 : j 1)Tui(i : m)
31
-
// Update the i-th row and downdate the norm
15: B(i, i+ 1 : n) := B(i, i+ 1 : n) Y (j, 1 : j)F (j + 1 : n, 1 : j)T
16: l(i+ 1 : n) := l(i+ 1 : n)B(i, i+ 1 : n). B(i, i+ 1 : n)
17: end for
// Block transformation to unprocessed parts of the matrix
18: B(i+ 1 : m, i+ 1 : n) := B(i+ 1 : m, i+ 1 : n)
Y (j + 1 : m, 1 : k)F (j + 1 : n, 1 : k)T
19: m := m k, n := n k
20: end for
21: R := B(1 : n, 1 : n)
Here we make a remark. In our implementation, we actually use an n-dimension
vector to store the permutation matrix P , and we do not form P explicitly for
efficiency.
3.1.2 Block Size-Reduction
The idea of block size-reduction is to accumulate several IGTs into a block
updating, so the algorithm is rich in matrix-matrix operations. The size-reduction
of an upper triangular matrix R Rnn can be described as
U = RZ,
where Z Znn is an unimodular matrix, U Rnn is size-reduced, i.e., |uij| 12|uii| (1 i < j n). The size-reduction algorithm repeatedly applies IGTs to R.
Z is the product of a sequence of IGTs which have the form of In eieTj (1 i 2 then
26: i := i 1
27: end if
28: else
38
-
29: i := i+ 1
30: end if
31: end while
32: Rl = Rlocal
3.1.4 Block Partial Size-Reduction
A block partial size-reduction algorithm is designed to coordinate with the Local-
PLLL reduction algorithm. In BSR (Algorithm 3.2), all off-diagonal entries of the
upper triangular matrix are checked for IGTs. However, it is not the case for the
PLLL reduction, where the off-diagonal entries are reduced only when they are nec-
essary. More specifically, if an IGT is applied to a super-diagonal entry of R, other
IGTs are applied to the off-diagonal entries in the same column in order to prevent
producing large numbers which may cause numerical stability problem. Thus, only
the entries in the columns which are affected by IGTs in Local-PLLL need to be
reduced. Local-PLLL stored the information of those columns affected by IGTs in
c, so block partial size-reduction (BPSR) algorithm can reduce only those marked
columns by IGTs.
Give an upper triangular matrix R Rnn which consists of d d blocks (here
we do not assume that each block has the same size):
R =
R11 R1d
. . ....
Rdd
, 1 i j d.
39
-
It has sub-matrices:
R =
R11 R1,i1
. . ....
Ri1,i1
, R =
R1i
R2i...
Ri1,i
, 1 < i d, (3.16)
where R has k columns, Ri,i with 1 < i i has k rows, and R1,i may have either k
or k/2 rows.
Given a vector c Zk, whose entries are either one or zero. For j = 1 : k, if
cj = 1 we performs size-reductions on column j of R by applying IGTs to it, which
involve R; if cj = 0 we do nothing. After this, part of the entries of R are size-reduced
according to c:
R := R + RZ,
where Z which is formed by those IGTs has the same dimension and block partition
as R:
Z =
Z1i
Z2i...
Zi1,i
, (3.17)
and
I Z0 I
is unimodular.The BPSR algorithm is given as follows.
Algorithm 3.4. (Block Partial Size-reduction) Given two sub-matrices R, R in
Eq.(3.16) and a vector c, this algorithm size-reduces the columns of R: R := R+RZ,
40
-
where Z has block partition as in Eq.(3.17). We use Ai1:i2,j to denote the sub-matrix
formed by block rows i1 to i2 in the j-th block column of A.
function: [R, Z] = BPSR(R, R, c)
1: for i = i 1 : 1 : 1 do
// Partial size-reduction of Ri,i by Zi,i involving Ri,i
2: for j = 1 : k do
3: if cj = 1 then
4: Size-reduce Ri,i(:, j): Ri,i(:, j) := Ri,i(:, j) +Ri,iZi,i(:, j)
5: end if
6: end for
7: Update R1:i1,i: R1:i1,i := R1:i1,i R1:i1,iZi,i
8: end for
3.2 Left-to-Right Block LLL Reduction Algorithm
In this section, we present a left-to-right block LLL (LRBLLL) reduction algo-
rithm utilizing the subroutines introduced in the previous section, i.e., the block QR
factorization (Algorithm 3.1), the block size-reduction (Algorithm 3.2), the Local-
PLLL reduction algorithm (Algorithm 3.3) and the block partial size-reduction (Al-
gorithm 3.4). The complexity analysis of LRBLLL is presented in the second part
of this section.
3.2.1 Partition and Block Operation
The left-to-right block LLL reduction algorithm combines the blocking technique
with the PLLL algorithm. It includes 7 steps as follows.
41
-
Step 1. Compute the block QR factorization (Algorithm 3.1) of the full column
rank matrix B Rmn with minimum column pivoting: BP = Q1R.
Step 2. Partition matrix R to dd blocks with block size k (here for simplicity,
we assume that n is multiple of k, i.e., n = dk, and d is even, and define k = 2k,
d = d/2):
R =
R11 R1d
. . ....
Rdd
Rnn, Rij Rkk , 1 i j d.Initialize a block index i = 1.
Step 3. Compute the Local-PLLL reduction (Algorithm 3.3) ofRlocal =
Rii Ri,i+1Ri+1,i+1
,Rlocal := Q
TlocalRlocalZlocal.
Step 4. Update the relevant blocks of Rlocal using block transformations:
Rright := QTlocalRright, Rup := RupZlocal,
where
Rright =
Ri,i+2 Ri,i+3 Ri,dRi+1,i+2 Ri+1,i+3 Ri+1,d
, Rup =
R1,i R1,i+1
R2,i R2,i+1...
...
Ri1,i Ri1,i+1
.
42
-
Step 5. Size-reduce Rup using the block partial size-reduction algorithm (Al-
gorithm 3.4) :
Rup := Rup +
R11 R1,i1
. . ....
Ri1,i1
Zupdate.Step 6. Set :=
(r(i1)k,(i1)k+1 br(i1)k,(i1)k+1/r(i1)k,(i1)ke r(i1)k,(i1)k
).
Check if the Lovasz condition r2(i1)k,(i1)k (2 + r2(i1)k+1,(i1)k+1) holds for the
first column of Rlocal and the column before it in R.
If i = 1 or the Lovasz condition holds, set i := i+ 1.
Else if i 6= 1 and the Lovasz condition does not hold, set i := i 1.
If i < d, go to step 3; else, go to step 7.
Step 7. Apply block size-reduction (Algorithm 3.2) to the whole matrix R, stop
the algorithm.
In Section 3.1.3, we stated that the first k columns of Rlocal may be PLLL
reduced before applying Local-PLLL in step 3. It is easy to check from the algorithm
that the first k columns of Rlocal are PLLL reduced except the first call of Local-
PLLL in step 3.
The left-to-right block LLL reduction algorithm is given as follows.
Algorithm 3.5. (Left-to-Right Block LLL Reduction) Given a full column rank ma-
trix B Rmn and a block size k which is even. This algorithm computes the LLL
factorization: B = Q1RZ1, where Q1 has orthonormal columns, R is upper tri-
angular and LLL reduced, and Z is unimodular. In the algorithm, we assume Z
43
-
is partitioned into blocks in the same way as R. We use Ai1:i2,j1:j2 to denote the
sub-matrix formed by block rows i1 to i2 and block columns j1 to i2 of A.
function: [R,Z] = LRBLLL(B, k)
// Compute the block QR factorization using Algorithm 3.1
1: [R,Z] = BQRMCP (B, k)
2: i := 1, k := k/2, d := 2n/k, f := 0
3: while i < d do
// PLLL reduction of Rii using Algorithm 3.3
4: [Q, Ri:i+1,i:i+1, Z, r] = Local-PLLL(Ri:i+1,i:i+1, f)
5: f := 1
6: if Z = I then
// The diagonal block is unchanged. The algorithm moves ahead.
7: i := i+ 1
8: Continue
9: end if
// Block updating
10: Z1:d,i:i+1 := Z1:d,i:i+1Z
11: R1:i1,i:i+1 := R1:i1,i:i+1Z
12: Ri:i+1,i+2:d := QTRi:i+1,i+2:d
// Size-reduce the corresponding columns of R1:i1,i:i+1 using Algorithm 3.4
13: [R1:i1,i:i+1, Z] = BPSR(R1:i1,i:i+1, R1:i1,1:i1, r)
14: Z1:d,i:i+1 = Z1:d,i:i+1 + Z1:d,1:i1Z
// Check the Lovasz condition, then move forward or backward
44
-
15: := bR((i1)k, (i1)k+1)/R((i1)k, (i1)k)e
16: := R((i1)k, (i1)k+1) R((i1)k, (i1)k)
// is parameter chosen in (14, 1)
17: if R((i1)k, (i1)k)2 2 +R((i1)k+1, (i1)k+1)2 or i = 1 then
18: i := i+ 1
19: else
20: i := i 1
21: end if
22: end while
// Size-reduce R using Algorithm 3.2
23: [R, Z] = BSR(R)
24: Z := ZZ
Notice that if the Local-PLLL output Z is an identity matrix, we do not apply block
updating and BPSR to relevant blocks for efficiency. Also notice that, if the matrix
dimension n is not a multiple of the block size k, the algorithm still works by simply
changing the block size of last column blocks to fit the matrix dimension. At the
end of each while loop the first ik columns of R are PLLL reduced. The while loop
breaks when i = d. Then the n = dk columns of R are PLLL reduced. And the
matrix R is size-reduced after the final size-reduction. Thus the LRBLLL algorithm
outputs a basis matrix which is LLL reduced.
3.2.2 Complexity Analysis
In the LRBLLL algorithm, the column permutation operations are executed in
the Local-PLLL subroutine. Since LRBLLL uses the same permutation criterion as
45
-
LLL (Algorithm 2.1), Lemma 2.1 can be also applied to LRBLLL. As in Section 2.2.3,
we define = maxj bj, and = minxZn/{0} Bx. Thus the LRBLLL algorithm
has at most O(n3 + n2 log1/) permutations, and the algorithm converges. During
the procedure of LRBLLL, the permutation operations are performed inside the
Local-PLLL subroutine. In the following part, we would like to obtain an upper
bound of the number of calls of Local-PLLL.
In the while loop of LRBLLL, it calls Local-PLLL reductions of diagonal sub-
matrices of R. At each loop, the PLLL reduction of a diagonal sub-matrix is per-
formed, and a diagonal sub-matrix which will be performed by the PLLL reduction
in the next loop, is selected in the current loop. From step 3 of LRBLLL, the diag-
onal sub-matrix Rlocal contains 2 diagonal blocks Ri,i and Ri+1,i+1. And Rlocal may
move one diagonal block forward or backward at the end of each loop, according to
whether the Lovasz condition holds for columns (i 1)k and (i 1)k + 1, see step
6 of LRBLLL described at Section 3.2. The matrix R which is divided into d d
blocks has d diagonal blocks. In the first call of Local-PLLL, Rlocal contains the first
two diagonal blocks R1,1 and R2,2, and the block index i equals to 1; while in the
last call of Local-PLLL, Rlocal contains the last two diagonal blocks Rd1,d1 and
Rd,d , and the block index i equals to d 1. It needs only d 1 loops for i to move
forward to i = d1 from i = 1, if there are no backward moves. Actually there may
be some backward moves, say s times, and the times of moving forward should be
added by an extra s. Thus the total number of moves of Rlocal is 2s + d 1 which
equals to 2s+ 2d 1.
46
-
The rest problem is to determine an upper bound of s which is the times of
the block index i moving backward during the excution of LRBLLL. Assume in a
loop except the first one, the Lovasz condition does not hold for columns (i 1)k
and (i 1)k + 1, so the algorithm moves one block back and the block index i is
decreased by one. However at the beginning of this loop the Lovasz condition holds
for columns (i1)k and (i1)k+ 1. Then the Local-PLLL subroutine of LRBLLL
must have modified column (i 1)k + 1 of R. To modify column (i 1)k + 1,
which is the first column of the current Rlocal, Local-PLLL must perform at least
k permutations. Since subroutine Local-PLLL starts with column (k + 1) of Rlocal
(see Section 3.1.3), it takes at least k permutations to get back to the first column
from column k + 1. Thus if the block index i is decreased in a loop, there is at
least k permutations taking place in Local-PLLL in this loop. Assume there are
p permutations involved in LRBLLL before convergence. So s, i.e., the number of
loops in which i is decreased, is bounded above by p/k which equals to (2n/d)p.
Then, the cost of LRBLLL is given as follows. The QR factorization with
minimum column pivoting takesO(mn2) arithmetics [16, Section 5.2]. In Local-PLLL
a permutation causes at most O(k2) arithmetic operations for subsequent updating
and size-reduction. In each loop after Local-PLLL is called, the block updating of
R takes O(nk2) operations. The subroutine BPSR takes O(n2k) operations in worst
case in each loop. And the block size-reduction subroutine at the end of the algorithm
takes O(n3) operations. From above, there are p permutations and 2s+2d1 loops.
The cost of LRBLLL is
CLRBLLL = O(mn2) + p O(k2) + (2s+ 2d 1) O(n2k + nk2) +O(n3).
47
-
Notice that p is bounded above by O(n3 + n2 log1/), so s is bounded above by
O(dn2 + dn log1/). The total cost of LRBLLL is bounded above by O(mn2 +n5 +
n4 log1/). This bound is the same as the bounds of LLL and PLLL.
Table 31 lists the costs of the important processes and the total cost of LR-
BLLL.
Table 31: Complexity analysis of LRBLLL reduction algorithm
Processes BoundCost of QR factorization O(mn2)
Cost of one permutation in Local-PLLL O(k2)Cost of block updating in one loop O(nk2)Cost of size-reduction in one loop O(n2k)Cost of final block size-reduction O(n3)
Number of permutations: p O(n3 + n2 log1/)
Number of loops: 2s+ 2d 1 O(dn2 + dn log1/ )Total cost of the algorithm O(mn2 + n5 + n4 log1/
)
3.3 Alternating Partition Block LLL Reduction Algorithm
In this section we propose a alternating partition block LLL (APBLLL) reduc-
tion algorithm which is easier to be parallelized. The complexity analysis of APBLLL
is also given.
3.3.1 Partition and Block Operation
The LRBLLL algorithm is actually a mimic of PLLL. LRBLLL works on the
matrix from left to right, and may moves forward or backward during the procedure.
In this new alternating partition block LLL reduction algorithmic, we do not move
the algorithm forward and backward, we do it in another way.
48
-
1k 1k 1k 1k
R11 R12 R13 R14 1k
R22 R23 R24 1k
R33 R34 1k
R44 1k
1.5k 1k 1.5k
R11 R12 R13 1.5k
R22 R23 1k
R33 1.5k
Figure 31: Partition 1 of matrix R
1k 1k 1k 1k
R11 R12 R13 R14 1k
R22 R23 R24 1k
R33 R34 1k
R44 1k
1.5k 1k 1.5k
R11 R12 R13 1.5k
R22 R23 1k
R33 1.5k
Figure 32: Partition 2 of matrix R
We first perform BQRMCP on B Rmn (see Algorithm 3.1):
BP = Q1R,
where Q1 Rmn has orthonormal columns, R Rnn is upper triangular and
P Znn is a permutation matrix.
Next we use an example to show how APBLLL works iteratively with two al-
ternating partitions as shown in Figure 31 and Figure 32.
In the first iteration, R is partitioned into 44 blocks, each block has size kk
(see Figure 31). This partition is refereed to as partition 1 for convenience. Then
we work on the blocks of partition 1. First we perform Local-PLLL (Algorithm 3.3)
to R11, then we update R12, R13 and R14 by Q generated by Local-PLLL. Second,
we perform Local-PLLL to R22, then we update R23 and R24 by Q generated by this
Local-PLLL, and update R12 by Z also generated by this Local-PLLL, then BPSR
49
-
(Algorithm 3.4) is applied to R12 to do partial size-reduction. Third, we perform
Local-PLLL to R33, then we update R34 by Q generated by current Local-PLLL,
and update R13 and R23 by Z also generated by current Local-PLLL, then BPSR
is applied to the block column R13 and R23. Fourth, we perform Local-PLLL to
R44, then we update R14, R24 and R34 by Z generated by current Local-PLLL, then
BPSR is applied to the block column R14, R24 and R34. After this, the first iteration
has finished. After the first iteration, all the diagonal blocks R11, R22, R33 and R44
are PLLL reduced.
In the second iteration, we repartition R into 3 3 blocks (see Figure 32), the
block size is indicated in the figure. This repartition is referred to as partition 2. We
do exactly the same for the blocks of partition 2 as we do in the first iteration. After
the second iteration, diagonal blocks R11, R22 and R33 are PLLL reduced.
Then in the following iterations, the same process with either partition 1 or
partition 2 are performed iteratively (partition 1 and partition 2 are preformed al-
ternately), until no permutation takes place in a iteration. At this point, it is easy
to see that R is PLLL reduced. Then an extra block size-reduction (Algorithm 3.2)
is applied to R. After the final size-reduction, R is LLL reduced and the algorithm
ends.
The two alternating partitions of R for the general case are given as follows.
Assume the block size is k and n = dk. Partition 1 partitions R into d d blocks:
R =
R11 R1d
. . ....
Rdd
Rnn, Rij Rkk, 1 i j d.
50
-
And partition 2 partitions R into (d 1) (d 1) blocks:
R =
R11 R1,d1
. . ....
Rd1,d1
Rnn,R11 R1.5k1.5k, R1,d1 R1.5k1.5k, Rd1,d1 R1.5k1.5k,
R1v R1.5k1k, Ru,d1 R1.5k1k, Ruv R1k1k, 1 < u v < d 1.
The alternating partition block LLL reduction algorithm is given as follows.
Algorithm 3.6. (Alternating Partition Block LLL Reduction) Given a full column
rank matrix B Rmn and a block size k (assume n is multiple of k, i.e., n = dk).
This algorithm computes the LLL reduction: B = Q1RZ1, where Q1 has orthonor-
mal columns, R is upper triangular and is LLL reduced, and Z is unimodular. In
the algorithm, we assume Z is partitioned into blocks in the same way as R. We use
Ai1:i2,j1:j2 to denote the sub-matrix formed by block rows i1 to i2 and block columns
j1 to i2 of A.
function: [R,Z] = APBLLL(B, k)
// Compute the block QR factorization using Algorithm 3.1
1: [R,Z] = BQRMCP (B, k)
2: d := n/k, f := 0
3: for i = 1 : d do
4: changei := 1, nextChangei := 1
5: end for
6: while (1) do
51
-
7: Partition R into blocks using Partition 1 or 2 iteratively
8: for i = 1 : d (for Partition 2: i = 1 : d 1, we assume partition 1 is used in
the following description ) do
9: if changei 6= 1 then
10: continue
11: end if
// Apply Local-PLLL to all diagonal blocks using Algorithm 3.3
12: [Q, Rii, Z, r] = Local-PLLL(Rii, f)
13: if Z = I then
// The diagonal block is unchanged, and updates are not needed
14: continue
15: end if
// Perform the corresponding updates
16: nextChangemax(1,i1) := 1, nextChangei := 1
// Block updating
17: Z1:d,i := Z1:d,iZ
18: R1:i1,i := R1:i1,iZ
19: Ri,i+1:d := QTRi,i+1:d
// Size-reduce the corresponding columns of R1:i1,i using Algorithm 3.4
20: [R1:i1,i, Z] = BPSR(R1:i1,i, R1:i1,1:i1, r)
21: Z1:d,i = Z1:d,i + Z1:d,1:i1Z
22: end for
23: if nextChange = 0 then
52
-
// Break when no permutation applied
24: break
25: end if
26: f := 1
27: for i = 1 : d do
28: changei := nextChangei, nextChangei := 0
29: end for
30: end while
// Size-reduce R using Algorithm 3.2
31: [R, Z] = BSR(R)
32: Z := ZZ
Notice that the two vectors change and nextChange are used to tracing that if
the diagonal blocks are PLLL reduced in each iteration. If two diagonal blocks are
unchanged in a iteration, in the next iteration we do not apply Local-PLLL to the
diagonal block whose diagonal entries come from the two unchanged diagonal blocks,
since this diagonal block should also be PLLL reduced. Also notice that if the Local-
PLLL output matrix Z is an identity matrix, we do not apply block updating and
BPSR to relevant blocks for efficiency.
3.3.2 Complexity Analysis
The APBLLL algorithm shares the same QR and final size-reduction parts as
LRBLLL. Thus the costs of these two parts are the same as they are in LRBLLL,
which are O(mn2) arithmetic operations for the QR factorization and O(n3) arith-
metic operations for the final size-reduction. The cost of the rest parts of APBLLL
53
-
are divided into two parts: the cost of subroutine Local-PLLL and the cost outside
subroutine Local-PLLL, i.e., the block updating and the block partial size-reductions.
These two parts are calculated separately.
Since APBLLL uses the same permutation criterion as LLL (Algorithm 2.1),
Lemma 2.1 can be also applied to APBLLL. Thus the total number of permutations
p taking place in Local-PLLL reductions is bounded above by O(n3 + n2 log1/).
In Local-PLLL a permutation causes at most O(k2) arithmetic operations for sub-
sequent updating and size-reductions. Thus, all the call to subroutine Local-PLLL
cost O(n3k2 + n2k2 log1/) arithmetic operations.
In APBLLL, only if the output matrix Z of Local-PLLL is not identity, i.e.
there are some permutations taking place during the execution of Local-PLLL, the
block updating and BPSR line 17-21 are performed. Because the total number of
permutations is p, there are at most p calls to Local-PLLL and each one of which
does not produce identity Z. So the worst case is that the block updating and BPSR
are executed p times. For each execution, the block updating and BPSR cause at
most O(n2k) arithmetic operations. Thus the total cost of block updating and BPSR
is p O(n2k) in the worst case.
From above, the total cost of APBLLL is obtained by adding the cost of all the
parts together:
CAPBLLL = O(mn2) +pO(k2) +pO(n2k) +O(n3) = O(mn2 +n5k+n4k log1/
).
This bound is lager than the bounds of LRBLLL, PLLL and LLL. However its simu-
lation result shows that it performs better than LLL and PLLL and performs similar
54
-
as LRBLLL. The simulation results and analysis of the two block LLL reduction
algorithms will be given in the next section.
Table 32 lists the costs of the important processes and the total cost of AP-
BLLL.
Table 32: Complexity analysis of APBLLL reduction algorithm
Processes BoundCost of QR factorization O(mn2)
Cost of one permutation in Local-PLLL O(k2)Cost of block updating and
size-reduction for one diagonal blockO(n2k)
Cost of final block size-reduction O(n3)
Number of permutations: p O(n3 + n2 log1/)
Total cost of the algorithm O(mn2 + n5k + n4k log1/)
3.4 Simulation Results and Comparison of Algorithms
The simulations are performed on MATLAB on two types of machines. One
has MATLAB 7.12.0 on a 64-bit Ubuntu 11.10 system with 4 Intel Xeon(R) CPU
W3530 2.8GH processors and 5GB memory. The other has MATLAB 7.13.0 on
a 64-bit Red Hat 6.2 system with 64 AMD Opteron(TM) 2.2GH processors and
64G memory. Our simulations use conventional MATLAB not Parallel MATLAB.
MATLAB use IEEE double precision model for the floating point arithmetic by
default. The unit round-off for double precision is about 1016. We compare four
algorithms, i.e., the original LLL algorithm (Algorithm 2.1), the PLLL+ algorithm,
the LRBLLL algorithm (Algorithm 3.5), and the APBLLL algorithm (Algorithm
3.6). The PLLL+ algorithm is the PLLL algorithm (Algorithm 2.3) with an extra
size-reduction procedure to guarantee the resulted matrix is size-reduced. All these
55
-
four algorithms produce LLL reduced matrices. We compare the CPU run time, the
flops, and the relative backward errors BQcRcZ1c F
BFof the four algorithms, where
Qc is the computed orthogonal matrix, Rc is the computed LLL reduced matrix and
Z1c is the unimodular matrix formed by the inverses of the computed permutation
matrix and IGTs. And the run time is measured by two separate parts, the run time
for the QR factorization and the run time for the rest part of each algorithm (for
simply, we just call this part the reduction), in order to observe how the blocking
technique performances in each part.
In the simulation, we test three cases of matrix B Rnn with n = 100 : 50 :
1000. The square matrices Bs are generated as follows.
Case 1: B is generated by MATLAB function randn: B = randn(n, n), i.e.,
each element follows the normal distribution N (0, 1).
Case 2: B = USV T , U and V are randomly generated orthogonal matrices,
and S is a diagonal matrix as follows,
S(i, i) = 104(i1)/(n1), i = 1, , n.
Case 3: B = USV T , U and V are randomly generated orthogonal matrices,
and S is a diagonal matrix as follows,
S(i, i) = 1000, i = 1, , bn/2e
S(i, i) = 0.1, i = bn/2e+ 1, , n.
Case 1 are the most typical testing matrices for numerical solutions. Case 2 and 3
intends to show the reduction speed when the condition numbers are fixed at 104.
56
-
Case 3 also shows that the block algorithms gain more efficiency at the reduction
part, when it takes a long time to run.
For each dimension of all cases, we randomly generate 20 different matrices to
do the test. We only test 20 simulation runs, because LLL is too time consuming.
However we use box plots to show that the behaviors of the algorithms are stable,
thus 20 runs are enough for our simulation. For the block algorithms, the optimal
block size may vary according to the dimension of the matrix. In the simulation, a
fixed block size of 32 is adopted for matrices at all dimensions for simplicity. In the
average QR/reduction run time plots, the y-axis is the average run time (seconds)
for the 20 matrices, and the x-axis is the dimension. In the average flops plots, the
y-axis is the average flops, and the x-axis is the dimension. In the average relative
backward error plots, the y-axis is the relative backward error, and the x-axis is the
dimension.
In the simulation, we test matrices with various condition numbers, and give
the results in the various condition number plots. In these plots, the y-axis is the
average QR/reduction run time, the average flops or the average relative back ward
errors for 20 matrices with dimension 200 in case 2, and the x-axis is the matrix
condition number from 101 to 106. Box plots of run time and relative backward
errors of all three cases with dimension 200 are drawn. In the box plot, the y-axis is
either the algorithm run time or the relative backward errors, and the x-axis is the
four algorithms, i.e., LLL, PLLL+, LRBLLL and APBLLL.
The simulation results given by Intel processors are shown in Figure 33, Figure
34 and Figure 35 for the overall performance of three cases, in Figure 36 for case
57
-
2 with different condition numbers, and in Figure 37 for the box plot of all the
cases. And the results given by AMD processors are shown in Figure 38, Figure 3
9, Figure 310 Figure 36 and Figure 37, respectively. For the overall performance
of each case, we give six plots. The two plots in the first row are the average run time
of QR factorization and the average reduction run time of LLL respectively. LLL
runs much longer than the other three algorithms, so we put it in individual plots in
order to compare the other three algorithms easily. The two plots in the middle row
are the average QR/reduction run time for PLLL+, LRBLLL, and APBLLL. The
two plots in the bottom row are the average flops and the average relative backward
errors for LLL, PLLL+, LRBLLL, and APBLLL. For case 2 with different condition
numbers, we also give six plots which are ordered in the same way as the overall
performance plots. For the box plot figure, we give six plots. The three plots in the
left column are the average algorithm run time of three cases. The three plots in the
right column are the average relative backward error of three cases.
From the simulation results, we can draw following observations and conclusions.
1. Comparing the results between two machines with Intel or AMD, we can ob-
serve that the performance of the four algorithms is consistent between these
two machines.
2. By comparing the run time of different algorithms, we found that LLL is the
slowest among the four algorithms. LRBLLL is as fast as APBLLL, and both
are faster than PLLL+ in all three cases. So on average the computational CPU
times for the four algorithms have the following order LRBLLL APBLLL LLL >
LRBLLL > APBLLL.
7. In Figure 36 and Figure 311, the test of matrices with various condition
numbers shows that the QR time is not affected by the condition number of the
matrices, and the reduction time, flops and the relative backward error of the
four algorithms increases when the condition number of the matrix increases.
8. The box plot shows the behaviors of LLL, PLLL+, LRBLLL and APBLLL on
the tests are stable for different simulation runs.
60
-
0 200 400 600 800 10000
500
1000
1500
Dimension
Red
uctio
n R
un T
ime
0 200 400 600 800 10000
1
2
3
4
Dimension
QR
Run
Tim
e
0 200 400 600 800 10000
0.1
0.2
0.3
0.4
0.5
Dimension
Reu
ctio
n R
un T
ime
0 200 400 600 800 100010
15
1014
1013
Dimension
Rel
ativ
e B
ackw
ard
Err
or
0 200 400 600 800 100010
6
107
108
109
1010
Dimension
Flo
ps
PLLL+LRBLLLAPBLLL
PLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
0 200 400 600 800 10000
0.5
1
1.5
2
2.5
3
Dimension
QR
Run
Tim
e
LLL LLL
Figure 33: Performance comparison for Case 1, Intel
61
-
0 200 400 600 800 10000
100
200
300
400
500
600
Dimension
Red
uctio
n R
un T
ime
0 200 400 600 800 10000
1
2
3
4
5
Dimension
QR
Run
Tim
e
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
Dimension
Reu
ctio
n R
un T
ime
0 200 400 600 800 100010
15
1010
Dimension
Rel
ativ
e B
ackw
ard
Err
or
0 200 400 600 800 100010
7
108
109
1010
Dimension
Flo
ps
PLLL+LRBLLLAPBLLL
PLLL+LRBLLLAPBLLL
0 200 400 600 800 10000
0.5
1
1.5
2
2.5
3
Dimension
QR
Run
Tim
e
LLL LLL
LLLPLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
Figure 34: Performance comparison for Case 2, Intel
62
-
0 200 400 600 800 10000
200
400
600
800
1000
1200
Dimension
Red
uctio
n R
un T
ime
0 200 400 600 800 10000
1
2
3
4
Dimension
QR
Run
Tim
e
0 200 400 600 800 10000
10
20
30
40
50
Dimension
Reu
ctio
n R
un T
ime
0 200 400 600 800 100010
12
1010
108
106
104
Dimension
Rel
ativ
e B
ackw
ard
Err
or
0 200 400 600 800 100010
7
108
109
1010
1011
Dimension
Flo
ps
PLLL+LRBLLLAPBLLL
PLLL+LRBLLLAPBLLL
0 200 400 600 800 10000
0.5
1
1.5
2
2.5
3
Dimension
QR
Run
Tim
e
LLL LLL
LLLPLLL+LRBLLLAPBLLL
LLLPLLL+BLLLAPBLLL
Figure 35: Performance comparison for Case 3, Intel
63
-
101
102
103
104
105
106
0
0.02
0.04
0.06
0.08
Condition Number
QR
Run
Tim
e
LLL
101
102
103
104
105
106
0
10
20
30
40
Condition Number
Red
uctio
n R
un T
ime
101
102
103
104
105
106
0
0.01
0.02
0.03
0.04
Condition Number
QR
Run
Tim
e
PLLL+LRBLLLAPBLLL
101
102
103
104
105
106
0
2
4
6
8
Condition Number
Reu
ctio
n R
un T
ime
PLLL+LRBLLLAPBLLL
101
102
103
104
105
106
1015
1010
105
Condition Number
Flo
ps
101
102
103
104
105
106
1015
1010
105
Condition Number
Rel
ativ
e B
ackw
ard
Err
or
LLLPLLL+LRBLLLAPBLLL
LLLPLLLLRBLLLAPBLLL
LLL
Figure 36: Performance comparison for Case 2 with dimension 200, Intel
64
-
LLL PLLL LRBLLL APBLLL10
2
101
100
101
Tot
al R
un T
ime
LLL PLLL LRBLLL APBLLL10
15
1014
1013
Rel
ativ
e B
ackw
ard
Err
or
LLL PLLL LRBLLL APBLLL10
0
101
102
Tot
al R
un T
ime
LLL PLLL LRBLLL APBLLL10
9
108
107
106
105
Rel
ativ
e B
ackw
ard
Err
or
LLL PLLL LRBLLL APBLLL10
1
100
101
Tot
al R
un T
ime
LLL PLLL LRBLLL APBLLL10
13
1012
1011
Rel
ativ
e B
ackw
ard
Err
or
Figure 37: Box plots of run time (left) and relative backward error (right) for Case1 (top), Case 2 (middle), Case 3 (bottom) with dimension 200, Intel
65
-
0 200 400 600 800 10000
1000
2000
3000
4000
Dimension
Red
uctio
n T
ime
LLL
0 200 400 600 800 10000
2
4
6
8
10
Dimension
QR
Tim
e
0 200 400 600 800 10000
0.2
0.4
0.6
0.8
1
Dimension
Reu
ctio
n T
ime
0 200 400 600 800 100010
6
107
108
109
1010
Dimension
Flo
ps
0 200 400 600 800 100010
15
1014
1013
Dimension
Rel
ativ
e B
ackw
ard
Err
or
0 200 400 600 800 10000
2
4
6
8
Dimension
QR
Tim
e
LLL
PLLL+LRBLLLAPBLLL
PLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
Figure 38: Performance comparison for Case 1, AMD
66
-
0 200 400 600 800 10000
2
4
6
8
10
Dimension
QR
Tim
e
0 200 400 600 800 10000
0.5
1
1.5
2
Dimension
Reu
ctio
n T
ime
0 200 400 600 800 100010
7
108
109
1010
Dimension
Flo
ps
0 200 400 600 800 100010
15
1010
Dimension
Rel
ativ
e B
ackw
ard
Err
or
0 200 400 600 800 10000
2
4
6
8
Dimension
QR
Tim
e
0 200 400 600 800 10000
200
400
600
800
1000
1200
Dimension
Red
uctio
n T
ime
LLL LLL
PLLL+LRBLLLAPBLLL
PLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
Figure 39: Performance comparison for Case 2, AMD
67
-
0 200 400 600 800 10000
500
1000
1500
2000
2500
Dimension
Red
uctio
n T
ime
0 200 400 600 800 10000
2
4
6
8
10
Dimension
QR
Tim
e
0 200 400 600 800 10000
20
40
60
80
100
Dimension
Reu
ctio
n T
ime
0 200 400 600 800 100010
7
108
109
1010
1011
Dimension
Flo
ps
0 200 400 600 800 100010
12
1010
108
106
104
Dimension
Rel
ativ
e B
ackw
ard
Err
or
0 200 400 600 800 10000
2
4
6
8
Dimension
QR
Tim
e
LLLLLL
PLLL+LRBLLLAPBLLL
PLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
LLLPLLL+LRBLLLAPBLLL
Figure 310: Performance comparison for Case 3, AMD
68
-
101
102
103
104
105
106
0
0.05
0.1
0.15
0.2
Condition Number
QR
Tim
e
LLL
101
102
103
104
105
106
0
20
40
60
80
Condition Number
Red
uctio
n T
ime
LLL
101
102
103
104
105
106
0
0.02
0.04
0.06
0.08
Condition Number
QR
Tim
e
PLLL+LRBLLLAPBLLL
101
102
103
104
105
106
0
5
10
15
20
Condition Number
Reu
ctio
n T
ime
PLLL+LRBLLLAPBLLL
101
102
103
104
105
106
1015
1010
105
Condition Number
Flo
ps
LLLPLLL+LRBLLLAPBLLL
101
102
103
104
105
106