two floating point lll reduction algorithms - thesis

Two Floating Point Block LLLReduction Algorithms

Yancheng Xiao

Master of Science

School of Computer Science

McGill University

Montreal,Quebec

September 2012

A thesis submitted to McGill University in partial fulfillment of the requirements ofthe degree of Master of Science in Computer Science

cYancheng Xiao 2012

DEDICATION

This document is dedicated to my beloved parents.

ii

ACKNOWLEDGEMENTS

I have been indebted in my postgraduate study and research, especially in the

preparation of this thesis, to my supervisor Prof. Xiao-Wen Chang of School of Com-

puter Science at McGill University, whose academic guidance and financial support

with patience and kindness have been invaluable to me. We are grateful to Prof.

Clark Verbrugge for his kindly lending of their lovely AMD high concurrency ma-

chine, which has been useful in testing the performance of our block LLL reduction

algorithms. I would like thank all my lab mates of Scientific Computing Lab in School

of Computer Science, Mazen Al Borno, Stephen Breen, Xi Chen, Sevan Hanssian,

Wen-Yang Ku, Wanru Lin, Milena Scaccia, David Titley-Peloquin, Jinming Wen and

Xiaohu Xie, for the pleasant collaboration during my study and research. Thanks

also to all my friends and my boyfriend Bin Zhu for their various help on my study

and living in Montreal.

iii

ABSTRACT

The Lenstra, Lenstra and Lovasz (LLL) reduction is the most popular lattice

reduction and is a powerful tool for solving many complex problems in mathematics

and computer science. The blocking technique casts matrix algorithms in terms

of matrix-matrix operations to permit efficient reuse of data in the algorithms. In

this thesis, we use the blocking technique to develop two floating point block LLL

reduction algorithms, the left-to-right block LLL (LRBLLL) reduction algorithm

and the alternating partition block LLL (APBLLL) reduction algorithm, and give

the complexity analysis of these two algorithms. We compare these two block LLL

reduction algorithms with the original LLL reduction algorithm (in floating point

arithmetic) and the partial LLL (PLLL) reduction algorithm in the literature in

terms of CPU run time, flops and relative backward errors. The simulation results

show that the overall CPU run time of the two block LLL reduction algorithms are

faster than the partial LLL reduction algorithm and much faster than the original

LLL, even though the two block algorithms cost more flops than the partial LLL

reduction algorithm in some cases. The shortcoming of the two block algorithms is

that sometimes they may not be as numerically stable as the original and partial

LLL reduction algorithms. The parallelization of APBLLL is discussed.

iv

ABREGE

Le Lenstra, Lenstra et reduction Lovasz (LLL) est la reduction de reseaux plus

populaire et il est un outil puissant pour resoudre de nombreux problemes complexes

en mathematiques et en informatique. La technique bloc LLL bloquante reformule

les algorithmes en termes de matrice-matrice operations de permettre la reutilisation

efficace des donnees dans les algorithmes bloc LLL. Dans cette these, nous utilisons

la technique de blocage de developper les deux algorithmes de reduction bloc LLL en

points flottants, lalgorithme de reducton bloc LLL de la gauche vers la droite (LR-

BLLL) et lalgorithme de reduction bloc LLL en partirion alternative (APBLLL), et

donner a lanalyse de la complexite des ces deux algorithmes. Nous comparons ces

deux algorithmes de reduction bloc LLL avec lalgorithme de reduction LLL orig-

inal (en arithmetique au point flottant) et lalgorithme de reduction LLL partielle

(PLLL) dans la litterature en termes de temps dexecution CPU, flops et les er-

reurs de larriere par rapport. Les resultats des simulations montrent que les temps

dexecution CPU pour les deux algorithmes de reduction blocs LLL sont plus rapides

que lalgorithme de reduction LLL partielle et beaucoup plus rapide que la reduction

LLL originale, meme si les deux algorithmes par bloc coutent plus de flops que

lalgorithme de reduction LLL partielle dans certains cas. Linconvenient de ces

deux algorithmes par blocs, cest que parfois, ils peuvent netre pas aussi stable

numeriquement que les algorithmes originaux et les algorithmes de reduction LLL

partille. Le parallelisation de APBLLL est discutee.

v

TABLE OF CONTENTS

DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

ABREGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Lattice Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions and Organization of the Thesis . . . . . . . . . . . 4

2 Introduction to LLL Reduction Algorithms . . . . . . . . . . . . . . . . . 7

2.1 LLL Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Original LLL Reduction Algorithm . . . . . . . . . . . . . . . . . 8

2.2.1 Size-Reductions . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Partial LLL Reduction Algorithm . . . . . . . . . . . . . . . . . . 162.3.1 Householder QR Factorization with Minimum Column

Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Partial Size-Reduction and Givens Rotation . . . . . . . . . 19

3 Block LLL Reduction Algorithms . . . . . . . . . . . . . . . . . . . . . . 23

3.1 Subroutines of Block LLL Reduction Algorithms . . . . . . . . . . 243.1.1 Block Householder QR Factorization with Minimum Col-

umn Pivoting . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Block Size-Reduction . . . . . . . . . . . . . . . . . . . . . 32

vi

3.1.3 Local Partial LLL Reduction . . . . . . . . . . . . . . . . . 353.1.4 Block Partial Size-Reduction . . . . . . . . . . . . . . . . . 39

3.2 Left-to-Right Block LLL Reduction Algorithm . . . . . . . . . . . 413.2.1 Partition and Block Operation . . . . . . . . . . . . . . . . 413.2.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Alternating Partition Block LLL Reduction Algorithm . . . . . . 483.3.1 Partition and Block Operation . . . . . . . . . . . . . . . . 483.3.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 53

3.4 Simulation Results and Comparison of Algorithms . . . . . . . . . 55

4 Parallelization of Block LLL Reduction . . . . . . . . . . . . . . . . . . . 71

4.1 Parallel Methods for LLL Reduction . . . . . . . . . . . . . . . . . 714.2 A Parallel Block LLL Reduction Algorithm . . . . . . . . . . . . . 72

4.2.1 Parallel Diagonal Block Reduction and Block Updating . . 734.2.2 Parallel Block Size-Reduction . . . . . . . . . . . . . . . . . 73

4.3 Performance Evaluation of Parallel Algorithm . . . . . . . . . . . 76

5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 80

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

vii

LIST OF TABLESTable page

31 Complexity analysis of LRBLLL reduction algorithm . . . . . . . . . . 48

32 Complexity analysis of APBLLL reduction algorithm . . . . . . . . . . 55

viii

LIST OF FIGURESFigure page

11 A lattice in 2-dimension . . . . . . . . . . . . . . . . . . . . . . . . . 2

31 Partition 1 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . . 49

32 Partition 2 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . . 49

33 Performance comparison for Case 1, Intel . . . . . . . . . . . . . . . . 61



36 Performance comparison for Case 2 with dimension 200, Intel . . . . . 64

37 Box plots of run time (left) and relative backward error (right) forCase 1 (top), Case 2 (middle), Case 3 (bottom) with dimension200, Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

38 Performance comparison for Case 1, AMD . . . . . . . . . . . . . . . 66



311 Performance comparison for Case 2 with dimension 200, AMD . . . . 69

312 Box plots of run time (left) and relative backward error (right) forCase 1 (top), Case 2 (middle), Case 3 (bottom) with dimension200, AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

41 Task allocation for three processors (P1, P2, P3) . . . . . . . . . . . . . 74

42 Approximating Parallel Simulation . . . . . . . . . . . . . . . . . . . 79

ix

CHAPTER 1Introduction

1.1 Lattice Reduction

A set L in the real vector space Rm is referred to as a lattice if there exists a set

of linear independent vectors b1, b2, . . . bn Rm such that

L =nj=1

Zbj =

{nj=1

zjbj | zj Z, 1 j n

}.

The set {b1, b2, . . . bn} is a basis of lattice L. The dimension of the lattice is defined

to be n. The matrix B = [b1, b2, . . . bn] is referred to as the lattice basis matrix

which generates L, also written as L(B).

Geometrically, a lattice can be viewed as a set of intersection points in an infinite

grid, as shown in Figure 11. The lines of the grid do not need to be orthogonal to

each other. The same lattice may have different bases. For example in Figure 11,

{b1, b2} is a basis of the lattice, and {c1, c2} is also a basis.

Suppose that we have two basis matrices B and C. If they generate a same

lattice L(B) = L(C), we say that B and C are equivalent. Two basis matrices

B,C Rmn are equivalent if and only if there exists a unimodular matrix Z Znn

(i.e., an integer matrix with determinant det(Z) = 1) such that C = BZ, see [25,

p4].

The lattice basis reduction is to transform a given lattice basis into a basis

with short and nearly orthogonal basis vectors. There are several kinds of lattice

1

Figure 11: A lattice in 2-dimension

reductions based on the different criteria on the resulted basis, such as the Gaus-

sian reduction [12, Chapter 6.1], the Minkowski reduction [26, 27], the Korkine and

Zolotarev (KZ) reduction [21] and the Lenstra, Lenstra and Lovasz (LLL) reduction

[22].

The lattice reduction is a powerful tool for solving many complex problems in

mathematics and computer science, especially the problems dealing with integers,

such as integer programming [1, 20], factoring polynomials with rational coefficients

[22], integer factoring [34] and cryptography [15].

The LLL reduction is the most popular lattice reduction. The LLL reduction

algorithm given in [22] and its variants have polynomial time complexity. It is wildly

used for applications such as factoring polynomials [22], subset sum problems [37],

digital communications [23, 24, 28, 29, 39], shortest vector problems (SVP) [25] and

2

closest vector problems (CVP), which are also referred to as the integer least-square

(ILS) problems [2, 4, 9, 10, 17].

Generally, we can classify the LLL reduction algorithms into three categories.

The first category includes exact integer arithmetic LLL reduction algorithms with

both input and output bases being integral. For example, the original LLL algorithm

given in [22] is in this category.

The second category includes the algorithms such as those in [30, 35, 36], which

use not only integer arithmetic, but also floating point arithmetic. The input and

output bases in this category are also integral. The reason to use floating point

arithmetic is that the integer arithmetic is expensive. The algorithms use long enough

floating numbers to approximate the intermediate results, so that the rounding errors

do not lead to an output basis which is not exactly LLL reduced.

The applications of the first and second categories include factoring polynomials

[22], subset sum problem [37] and public-key cryptanalysis [15].

The third category includes floating point algorithms with both input and output

bases being real. This category applies to cases where exact integer arithmetic is not

required and where a nearly LLL reduced basis is acceptable, such as ILS problems

which arise in GPS, e.g., [9, 10, 11, 17, 43], and in multi-input multi-output (MIMO)

communications, e.g., [24, 42]. So an algorithm in this category does not require strict

floating point error control like algorithms in the second category. An algorithm in

category three is much more efficient than those in categories one and two.

3

1.2 Contributions and Organization of the Thesis

The goal of this thesis is to propose efficient and reliable floating point algorithms

for the LLL reduction with real basis matrices by using blocking technique [14,

Chapter 5]. The algorithms are based on the original LLL reduction algorithm [22]

and the partial LLL (PLLL) reduction algorithm [13].

The computation speed of a matrix algorithm is determined not only by the

number of floating point operations involved, but also by the amount of memory

traffic which is the movements of data between memory and registers. The level

3 basic liner algebra subprograms (BLAS) are designed to reduce these movements

of data. The matrix-matrix operations implemented in level 3 BLAS make effi-

cient reuse of data that resided in cache or local memory to avoid excessive data

movements. The blocking technique casts the algorithms in terms of matrix-matrix

operations to permit efficient reuse of data.

Two block LLL reduction algorithms utilizing this blocking technique are pro-

posed in this thesis with their complexity analysis. Numerical simulations compare

the performance of our block algorithms on the CPU time, flops and numerical sta-

bility with the original LLL reduction algorithm and the PLLL reduction algorithm.

On average the computational speeds of the block algorithms are faster than PLLL

and LLL although their numerical stability in some cases may need improvement.

The parallelization of one of the two block LLL reduction algorithms is discussed

in two parts, the parallelization of the block size-reduction and the parallelization

of the diagonal block reduction. Complexity analysis shows that the parallelized

size-reduction part can obtain a speedup of np in ideal cases, if np processors are

4

used. The improvement of the parallelized diagonal block reduction part is hard to

be observed from the complexity analysis, since the complexity is too pessimistic. A

simple test is designed to examine the performance of the parallelized diagonal block

reduction part. The test result shows that the parallelized diagonal block reduction

part can obtain a speedup of 4.8 with 5 processors in best situations.

The rest of the thesis is organized as follows. In Chapter 2, we first give the

definition of the LLL reduction. Then a description of the original LLL reduction

algorithm in the matrix language is given, followed by its complexity analysis. In

the last section of this chapter, we introduce the partial LLL (PLLL) reduction

algorithm.

In Chapter 3, we first apply the blocking technique to the components of the

PLLL algorithm, leading to block subroutines. Then two block LLL algorithms are

proposed based on these block subroutines. We give the complexity analysis for the

block algorithms under the assumption of using exact arithmetic. Finally, simulation

results are presented, compared and discussed.

In Chapter 4, we first review the literature of parallel LLL algorithms. Then we

discuss the parallelization of one of our two block algorithms.

Chapter 5 gives conclusions and future work.

We now describe the notation to be used in the thesis. The sets of all real and

integer m n matrices are denoted by Rmn and Zmn, respectively, and the set of

real and integer n-vectors are denoted by Rn and Zn, respectively. Upper case letters

are used to denote matrices and bold lower case letters are used to denote vectors.

The identity matrix is denoted by I and its i-th column is denoted by ei. MATLAB

5

notation is used to denote a sub-matrix. Specifically, if A = (aij) Rmn, then A(i, :)

denotes the i-th row, A(:, j) denotes the j-th column, and A(i1 : i2, j1 : j2) denotes

the sub-matrix formed by rows i1 to i2 and columns j1 to j2. For the (i, j) element

of A, sometimes we use aij and sometimes we use A(i, j). For block matrix A, Aij

denotes the (i, j) block. For a scalar z R, we use bze to denote its nearest integer. If

there is a tie, bze denotes the one with smaller magnitude. det(A) is the determinant

of A. Without saying specifically, stands for the 2-norm, i.e., a =aTa, and

F stands for the Frobenious matrix norm, i.e., AF =

i,j a2ij.

6

CHAPTER 2Introduction to LLL Reduction Algorithms

In this chapter first we give the definition of the Lenstra-Lenstra-Lovasz (LLL)

reduction. Then we introduce the original LLL reduction algorithm [22] and the

partial LLL (PLLL) reduction algorithm [43], which will be the bases of our new

LLL reduction algorithms to be presented in later chapters.

2.1 LLL Reduction

The LLL reduction introduced in [22] can be described as a QRZ matrix fac-

torization:

B = Q

R0

Z1 = Q1RZ1,where B Rmn is a given matrix with full column rank, Q = [Q1, Q2]

n mn Rmm is

orthogonal, Z Znn is unimodular, and R Rnn is upper triangular and satisfies

two conditions: rijrii 12 , 1 i < j n, (2.1)

r2i1,i1 r2ii + r2i1,i, 1 < i n, (2.2)

with the parameter (1/4, 1). The conditions Eq.(2.1) and Eq.(2.2) are named as

the size-reduction condition and the Lovasz condition, respectively. The matrix BZ

or the matrix R is said to be LLL reduced.

7

The LLL reduction algorithm in [22] is the most well known lattice basis reduc-

tion algorithm with polynomial time complexity, which was originally designed for

factoring polynomials with rational coefficients using integer arithmetic operations.

Later, the LLL reduction has widely extended its applications to number theory

(see, e.g., [34, 37]), cryptography (see, e.g., [15, 25]), integer programming (see, e.g.,

[1, 20]), digital communications (see, e.g., [24]), and GPS (see, e.g., [11, 17]). Some

of these extended applications do not require exact integer LLL reduced basis, thus

floating point arithmetic is used to achieve better computational performance in such

application areas. One example of the floating point LLL application is to compute

a suboptimal solution (e.g., the Babai point [4]) or the optimal solution of an integer

least squares (ILS) problem.

In the following part of this chapter, the original LLL reduction algorithm and

the PLLL reduction algorithm are introduced and we assume they use floating point

arithmetic.

2.2 Original LLL Reduction Algorithm

We will describe the original LLL reduction algorithm in the matrix language

(see [44, Algorithm 3.3.1] and [13, Algorithm 2.6.3]). The algorithm involves the

Gram-Schmidt orthogonalization (GSO), integer Gauss transformations (IGT), col-

umn permutations and orthogonal transformations. GSO is applied to find the QR

factors Q and R of the given matrix B. The column permutations and IGTs produce

the unimodular matrix Z.

In the original exact integer LLL reduction algorithm, a column scaled Q and

a row scaled R which has unit diagonal entries are computed by a variation of GSO

8

to avoid square root computations. In the floating point LLL reduction algorithm in

this thesis, the regular GSO is adopted to B and gives the compact form of the QR

factorization:

B = Q1R,

where Q1 Rmn has orthonormal columns, and R Rnn is upper triangular.

After the GSO of B, integer Gauss transformations, column permutations and

GSO are used to transform R to a LLL reduced basis. IGTs are used to perform size-

reduction to the off diagonal entries to achieve Eq.(2.1). The column permutations

are used to order the columns to achieve Eq.(2.2). Since a column permutation

destroys the upper triangular structure, GSO is used to recover the upper triangular

structure.

2.2.1 Size-Reductions

An integer matrix is called an IGT or an integer Gauss matrix if it has the

following form

Zij = In eieTj , i 6= j, is an integer.

Applying Zij to R from the right gives

R = RZij = R ReieTj .

Thus R is the same as R, except that rkj = rkj rki, k = 1, , i. By setting

= brij/riie, the nearest integer to rij/rii, we ensure |rij| |rii|/2.

2.2.2 Permutations

The column permutations are applied to achieve Eq.(2.2). Suppose that the

Lavosz condition is not satisfied for i = k, then a permutation matrix Pk1,k is

9

performed to interchange columns k 1 and k of R. After the permutation, the

upper triangular structure of R is destroyed. An orthogonal transformation Gk1,k

using the GSO technique (see [22]) is performed to re-construct the upper triangular

structure of R:

R = Gk1,kRPk1,k,

where

Gk1,k =

Ik2

G

Ink

, G =c ss c

,c =

rk1,kr2k1,k + r

2kk

, s =rkk

r2k1,k + r2kk

.

The columns k 1, k and the rows k 1, k of R are changed by this permutation

and orthogonalization process. The diagonal and super-diagonal entries of R which

are changed after the permutation and orthogonalization process become

rk1,k1 =r2k1,k + r

2kk, rk1,k =

rk1,k1rk1,kr2k1,k + r

2kk

, rk,k = rk1,k1rkkr2k1,k + r

2kk

.

Thus, if r2k1,k1 > r2kk + r

2k1,k with (1/4, 1), then the above operations guar-

antee r2k1,k1 > r2kk + r

2k1,k.

Based on the above description of size-reductions and permutations, we will

describe the procedure of the LLL reduction algorithm as follows. The algorithm

shall iterate a sequence of stages to satisfy the LLL reduced conditions. And it

works on the columns of R from left to right. Define a column stage variable k which

10

indicates that the first k 1 columns of R are LLL reduced at the current stage, i.e.,rijrii 12 , 1 i < j k 1, (2.3)

r2i1,i1 r2ii + r2i1,i, 1 < i k 1. (2.4)

At the beginning, set k to 2. Then during the reduction procedure, the value of k

shifts between 2 and n+ 1 and changes by 1 in each step. At stage k, the algorithm

first uses the integer Gauss transformation to reduce rk1,k. Then it checks if it

needs to permute the columns k 1 and k according to the Lovasz condition. If

r2k1,k1 > r2kk + r

2k1,k, it performs the permutation and applies the corresponding

orthogonal transformation, and moves back to stage k 1. Otherwise it reduces

ri,k (i = k 2, k 2, , 1) by IGTs and moves to the next stage k + 1. When

k reaches to n + 1, the conditions Eq.(2.1) and Eq.(2.2) are satisfied, the upper

triangular matrix R is LLL reduced and the algorithm stops. The algorithm is given

as follows.

Algorithm 2.1. (LLL Reduction) Suppose B Rmn has full column rank. This

algorithm computes the LLL reduction: B = Q1RZ1, where Q1 has orthonormal

columns, R is upper triangular and satisfies LLL reduced criteria and Z is unimod-

ular.

function: [R,Z] = LLL(B)

1: Apply GSO to obtain B = Q1R

2: k := 2, Z := In

3: while k n do

4: if rk1,krk1,k1 > 12 then

11

// Reduce rk1,k

5: :=

rk1,krk1,k1

6: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, k 1)

7: R(1 : k 1, k) := R(1 : k 1, k) R(1 : k 1, k 1)

8: end if

// is parameter chosen in (14, 1)

9: if r2k1,k1 > r2kk + r

2k1,k then

10: Interchange columns Z(1 : n, k) and Z(1 : n, k 1)

11: Interchange columns R(1 : k, k) and R(1 : k, k 1)

12: Triangularize R: R := Gk1,kR

13: if k > 2 then

14: k := k 1

15: end if

16: else

// Size-reduction

17: for i = k 2 : 1 do

18: :=ri,krii

19: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, i)

20: R(1 : i, k) := R(1 : i, k) R(1 : i, i)

21: end for

22: k := k + 1

23: end if

24: end while

12

2.2.3 Complexity Analysis

Assume that the operations used in the algorithm are performed in exact arith-

metic. The complexity of Algorithm 2.1 is measured by the number of arithmetic

operations. Part of the results of the complexity analysis will be used in Chapter 3

and Chapter 4. The QR factorization by GSO takes O(mn2) arithmetic operations

[16, Section 5.2]. Next, we give the analysis of the complexity of the while loop in

the LLL reduction algorithm. By adding the complexity of QR factorization and the

while loop together, we get the complexity of the LLL reduction algorithm.

For the complexity of the while loop, we would like to first determine the number

of loops and then count the number of arithmetic operations in each loop.

Lemma 2.1 ([22]): Let = maxj bj, and let = minxZn/{0} Bx be the

length of the shortest vector of lattice L(B). The number of permutations involved

in Algorithm 2.1 is bounded by O(n3 + n2 log1/) and the algorithm converges.

Proof. We use the proof from [22] and [44, Chapter 3].

After the Gram-Schmidt QR factorization, we obtain QR factors Q1 and R in

the QR factorization B = Q1R. Let R(p) denote the upper triangular matrix R after

the p-th permutation (R(0) = R). Define the quantities wi and after the p-th

permutation as

w(p)i =

ij=1

(r(p)jj )

2, i = 1, 2, , n (2.5)

and

(p) =ni=1

w(p)i . (2.6)

13

Suppose the p-th permutation is applied to columns (q1) and q of matrix R(p1)

and the orthogonal transformation by GSO is applied to keep the upper triangular

structure as described in the algorithm, we obtain matrix R(p) with following feature:

r(p)jj = r

(p1)jj , j 6= q 1, q, |r

(p)p1,p1r

(p)pp | = |r

(p1)p1,p1r

(p1)pp |.

And by the permutation criterion (see line 9 of Algorithm 2.1) obtained from Eq.(2.2),

we have r(p)q1,q1 < r(p1)q1,q1 .Then from Eq.(2.5) we obtain

w(p)i = w

(p1)i , i 6= q 1, w

(p)q1/w

(p1)q1 < .

Substituting them into Eq.(2.6) gives

(p) < (p1), (2.7)

which means that one permutation operation decreases at least by a multiply of

. Assume that the algorithm involves a total of p permutations before convergence.

From Eq.(2.7) it follows that

(p) < p(0),

or equivalently

p < log1/(0)

(p)= log1/

(0) log1/ (p) = log1/ni=1

w(0)i log1/

ni=1

w(p)i . (2.8)

14

Since = maxj bj and bj2 (r(0)jj )2, then (r(0)jj )

2 2 (j = 1, 2, , n). Thus

from Eq.(2.5)

w(0)i 2i. (2.9)

By Theorem I of [7, Chapter II],

2 minxZn/{0}

Bx2 (

4

3

)(n1)/2(det (BTB))1/n. (2.10)

For any x Zn, we can define x = (Z(p))1x, where Z(p) denotes the unimodular

matrix Z after the p-th permutation (Z(0) = In). Define B(p) = B(p)Z(p) = Q

(p)1 R

(p).

From Eq.(2.10) we have

2 = minxZn/{0}

Bx2 = minxZn/{0}

B(p)x2

minx(1:i)Zi/{0}

B(p)(:, 1 : i)x(1 : i)2

(

4

3

)(i1)/2| det (B(p)(:, 1 : i)T B(p)(:, 1 : i))|1/i

=

(4

3

)(i1)/2| det (R(p)(:, 1 : i)TR(p)(:, 1 : i))|1/i

(

4

3

)(i1)/2(w

(p)i )

1/i (see Eq.(2.5)).

Then it follows that

w(p)i (3/4)i(i1)/22i. (2.11)

15

Substituting Eq.(2.9) and Eq.(2.11) into Eq.(2.8) gives

p < log1/

ni=1

2i log1/ni=1

(3/4)i(i1)/22i

= (n+ 1)n log1/

+ log1/

ni=1

(4/3)i(i1)/2

= (n+ 1)n log1/

+

1

6(n3 n) log1/(4/3).

So Algorithm 2.1 involves at most O(n3+n2 log1/) permutations and the algorithm

converges.

We should note that the bound on the number permutation from the lemma

suits for all kinds of LLL reduction algorithms, if they share the same permutation

criterion with Algorithm 2.1.

In Algorithm 2.1, k is either increased or decreased by 1 in the while loop. Since

the loops in which k is decreased must have a column permutation in it, we have

p loops in which k is decreased. The algorithm starts from k = 2 and ends when

k = n+ 1, so the number of loops in which k is increased should equals to p+ n 1.

Thus there are 2p+ n 1 loops in total, which is bounded by O(n3 + n2 log1/ ).

Each loop costs O(n2) arithmetic operations in the worst situation. So the whole

algorithm takes at most O(mn2 + n5 + n4 log1/) arithmetic operations.

2.3 Partial LLL Reduction Algorithm

Recently the so-called effective LLL (ELLL) reduction was proposed by Ling

and Howgrave [23], and later the so-called partial LLL (PLLL) reduction algorithm

was developed by Xie, Chang and Borno [43]. Both algorithms are more efficient

16

than Algorithm 2.1. The ELLL reduction algorithm is essentially identical to Al-

gorithm 2.1 after lines 17-21, which reduce the off-diagonal entries of R except the

super-diagonal ones, are removed. It has less computational complexity than LLL,

while it has the same effect on the performance of the Babai integer points as LLL.

[43] shows algebraically that the size-reduction condition of the LLL reduction has

no effect on a typical sphere decoding (SD) search process for solving an integer least

squares (ILS) problem. Thus it has no effect on the performance of the Babai inte-

ger point, the first integer point found in the search process. The PLLL is proposed

to avoid the numerical stability problem with ELLL, and to avoid some unneces-

sary size-reductions involved in LLL and ELLL. Both PLLL and LLL can compute

LLL reduced bases by adding an extra size-reduction procedure at the end of the

algorithms. The following part gives a description of the PLLL reduction.

2.3.1 Householder QR Factorization with Minimum Column Pivoting

The typical LLL algorithm first finds the QR factorization of the given matrix

B. In the original LLL algorithm, the Gram-Schmidt method is adopted for com-

puting the QR factorization. However the Householder method without forming the

orthogonal factor Q which costs 43mn2 flops, is more efficient than the Gram-Schmidt

method which costs 2mn2 flops [16]. The Householder method requires square root

operations, so it is not suitable for the exact integer LLL reduction. While the float-

ing point LLL reduction has no problem with computing a square root, so it can use

the Householder transformation to computer the QR factorization.

The PLLL reduction uses the Householder QR factorization with minimum col-

umn pivoting (QRMCP) instead of the classic Householder QR factorization. In

17

general, the number of permutations is a crucial factor of the cost of the whole LLL

reduction process. If one can make the upper triangular factor close to an LLL re-

duced one in the QR factorization stage, the number of the permutations in the later

stage is likely to decrease. The minimum column pivoting strategy is used to help

to achieve the Lovasz condition, see [44, Section 4.1].

From Eq.(2.1) and Eq.(2.2), we can easily obtain

( 14

)r2i1,i1 r2ii, 1 < i n, (1/4, 1). (2.12)

The Householder QR factorization upper-triangularize the matrix B columns by

columns, while the column index i is increasing from 1 to n. In order to make the

matrix R more likely to satisfy Eq.(2.12), the minimum column pivoting strategy

chooses a column permutation such that |rii| is the smallest in the i-th step. In the

i-th step of the QR factorization, the QRMCP finds the column in B(i :m, i :n) with

the minimum 2-norm, and interchanges the whole column with the i-th column of

B. After this the QRMCP eliminates the off-diagonal entries B(i + 1 : m, i) by a

Householder transformation Hi. By using the minimum column pivoting strategy,

the Householder QR becomes

BP = Q

R0

= [Q1 Q2]R

0

= Q1R, (2.13)where P Rnn is a permutation matrix, R Rnn is upper triangular, [Q1, Q2]

n mn

Rmm is orthogonal , Q consists of Q1 and Q2. QT = HnHn1 H1 is the product

of n Householder transformations.

The algorithm is given as follows.

18

Algorithm 2.2. (Householder QR Factorization with Minimum Column Pivoting)

Suppose B Rmn has full column rank. This algorithm computes the QRMCP

factorization: B = Q1RPT , and Q has orthonormal columns, R is upper triangular

and P is a permutation matrix.

function: [R,P ] = QRMCP (B)

1: P := In

2: lj := B(1 : m, j)2, j = 1 : n

3: for i = 1 : n do

4: q := arg minijn lj

5: if q > i then

6: Interchange columns B(1 : m, i) and B(1 : m, q)

7: Interchange columns P (1 : n, i) and P (1 : n, q)

8: end if

9: Compute the Householder transformation Hi which zeros B(i+ 1 : m, i)

10: B := HiB

11: lj := lj B(i, j)2, j = i+ 1, i+ 2, , n

12: end for

13: R := B(1 : n, 1 : n)

2.3.2 Partial Size-Reduction and Givens Rotation

After the QRMCP, the PLLL reduction performs permutations, IGTs and Givens

rotations on R in an efficient and numerical stable way. In the k-th column of R,

PLLL checks if it needs to permute the columns k and k 1 according to the Lo-

vasz condition Eq.(2.2). If the Lovasz condition hold, then the permutation will not

19

occur, no IGT will be applied, and the algorithm moves to column k + 1. If the

Lovasz condition does not hold, rk1,k is reduced by IGT, IGTs are also applied to

rk2,k, , r1,k for stability consideration. Then PLLL performs the permutation and

the Givens rotation, and moves back to the previous column.

Givens rotations are used to do triangularization after permutations in PLLL

instead of GSO, in line 12 of Algorithm 2.1. Define the Givens rotation matrix as

G =

c ss c

,where

c =rk1,k

r2k1,k + r2kk

, s =rkk

r2k1,k + r2kk

.

which are used in the following transformation: c ss c

rk1,k rk1,k1rk,k 0

=rk1,k1 rk1,k

0 rk,k

.The PLLL algorithms is given as follows.

Algorithm 2.3. ( PLLL Reduction) Suppose B Rmn has full column rank. This

algorithm computes the PLLL reduction of B: B = Q1RZ1, and Q1 has orthonor-

mal columns, R is upper triangular and Z is a unimodular. It computes IGTs only

when column permutation occurs.

function: [R,Z] = PLLL(B)

1: Compute [R,P ] = QRMCP (B)

2: Set Z := P , k := 2

20

3: while k n do

4: :=

rk1,krk1,k1

5: := rk1,k rk1,k1


6: if r2k1,k1 > 2 + r2kk then

// Size-reduce R(1 : k 1, k)

7: for l = k 1 : 1 do

8: :=rl,krll

9: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, l)

10: R(1 : l, k) := R(1 : l, k) R(1 : l, l)

11: end for

// Column permutation and updating

12: c :=rk1,kr2k1,k+r

2kk

13: s := rkkr2k1,k+r

2kk

14: G :=

c ss c

15: Interchange columns Z(1 : n, k) and Z(1 : n, k 1)

16: Interchange columns R(1 : n, k) and R(1 : n, k 1)

17: R(k 1 : k, k 1 : n) := GR(k 1 : k, k 1 : n)

18: if k > 2 then

19: k := k 1

20: end if

21: else

21

22: k := k + 1

23: end if

24: end while

Notice that the final matrix R obtained by the PLLL reduction algorithm are

not fully size-reduced, since the algorithm only performs size-reduction when a per-

mutation is followed immediately. However we can easily add an extra size-reduction

procedure at the end of the PLLL reduction algorithm, and transform R to a LLL re-

duced matrix. We name the PLLL algorithm with an extra size-reduction procedure

as PLLL+.

The PLLL reduction algorithm uses the same permutation criterion as the LLL

reduction algorithm, so it has the same upper bound of permutations/loops as the

upper bound for the LLL reduction algorithm, which is O(n3 + n2 log1/).

For each loop, the PLLL reduction algorithm has O(n2) arithmetic operations

in worst case situations. The Household QR costs O(mn2) flops [16, Section 5.2]. So

the PLLL algorithm takes at most O(mn2 + n5 + n4 log1/) arithmetic operations,

which is the same as the complexity bound of the LLL reduction algorithm. The

simulation results of PLLL in [43] show that it is faster and more stable than the

LLL reduction.

22

CHAPTER 3Block LLL Reduction Algorithms

The blocking technique has been wildly used to speed up conventional matrix

algorithms on todays high performance computers. The key to achieve high per-

formance on computers with a memory hierarchy has been to recast the algorithms

in terms of matrix-vector and matrix-matrix operations to permit efficient reuse of

data that resided in cache or local memory. The blocking technique partitions a

big matrix into small blocks, and performs matrix-matrix operations implemented

in level 3 basic linear algebra subprograms (BLAS) as much as possible [14]. The

matrix-matrix operations implemented in level 3 BLAS is more efficient than the

matrix-vector operation implemented in level 2 BLAS or the vector-vector operation

implemented in level 1 BLAS. The level 3 BLAS can maximumly reduce the move-

ments of data between memories and registers, which can be as costly as arithmetic

operations on the data in matrix algorithms.

In this chapter, we first explain how to apply the blocking technique to the com-

ponents of the partial LLL (PLLL) reduction algorithm. Then we propose two block

LLL reduction algorithms with different matrix partition strategies, and compare

their speed and stability with the original LLL reduction algorithm and the PLLL

reduction algorithm introduced in Chapter 2.

23

3.1 Subroutines of Block LLL Reduction Algorithms

In this section we describe a block QR factorization algorithm, a block size-

reduction algorithm named BSR, a variant of the PLLL reduction algorithm named

Local-PLLL and a block partial size-reduction algorithm named BPSR. They will

be used as subroutines of the block LLL reduction algorithms. Local-PLLL suits for

computing the PLLL reduction of blocks of the basis matrix. The block partial size-

reduction algorithm uses an efficient size-reduction strategy proposed in the PLLL

reduction algorithm.

3.1.1 Block Householder QR Factorization with Minimum Column Piv-oting

In order to design a block Householder QR factorization by means of level 3

BLAS, Schreiber and Van Loan [38] proposed a storage-efficient WY representa-

tion for the product of Householder transformations. Later Quintana-Orti, Sun and

Bischof [32] proposed a level 3 BLAS version of the QR factorization with maximum

column pivoting in order to get a rank-revealing factorization. Based on their work,

we give the block QR factorization algorithm with minimum column pivoting in this

section.

Given a real full column rank matrix B Rmn, the Householder QR factoriza-

tion with minimum column pivoting gives

BP = Q

R0

= [Q1 Q2]R

0

= Q1R, (3.1)where Q = [Q1, Q2]

n mn Rmm is orthogonal, R Rnn is upper triangular, and

P Znn is a permutation matrix. The orthogonal matrix Q is the product of n

24

Householder transformations:

QT = Hn H2H1, (3.2)

Hi = In iuiuTi , i = 1, 2, , n, (3.3)

where i = 2/(uTi ui), ui =

0ui

Rm, ui Rmi+1 is a Householder vector,Hi Rmm is the Householder transformation matrix which zeros B(i+ 1 : m, i).

The permutation matrix P is the product of n permutations:

P = P1P2 Pn,

where Pi (i = 1, 2 , n) is the permutation matrix which interchanges the i-th

column and another column in B(1 : m, i : n) such that the 2-norm of B(i : m, i) is

minimum.

In order to explain the block QR implementation, we define B(i) as the value of

B after i Householder transformations and i permutations, i.e.,

B(i) = Hi H2H1BP1P2 Pi, (3.4)

with B(0) = B. And we define B(i) as B with only i permutations applied, i.e.,

B(i) = B(P1P2 Pi). (3.5)

Here we want to point out that B(i) will not be formed in the i-th step of the block

algorithm, and it is used only for explanations of the algorithm.

25

The storage efficient WY representation [38] for the product of i Householder

transformations has the following format:

it=1

Ht =it=1

(Im tutuTt ) = Im YiTiY Ti , (3.6)

where

Yi = [u1,u2, ,ui] Rmi (3.7)

is lower trapezoidal, Ti Rii is lower triangular given by the following recursion

formula:

Ti =

Ti1 0hTi i

, hTi = uTi Yi1Ti1 R1(i1),with the base case T1 = 1.

Substituting Eq.(3.5) and Eq.(3.6) into Eq.(3.4), B(i) can be expressed as

B(i) = (In YiTiY Ti )B(i) = B(i) YiF Ti , (3.8)

where

F Ti = TiYTi B

(i) Rin. (3.9)

It is easy to show that F Ti can be computed by recursion:

F T1 = 1uT1 B

(1), F Ti =

F Ti1Piiu

Ti B

(i) iuTi Yi1F Ti1Pi

. (3.10)

26

The block Householder QR factorization algorithm partitions the matrix B

Rmn into d blocks with size m k (for simplification we assume n = dk). The algo-

rithm deals with the blocks sequentially from left to right. Inside a block, k House-

holder transformations are performed for upper-triangularization, and are accumu-

lated into a single block transformation using the WY representation in Eq.(3.6).

Then the block transformation is applied to other blocks of B by matrix-matrix

multiplication. Next we show how the block algorithm works.

In the first step, we first compute the squared column norms of B denoted by l:

lj := B(1 :m, j)2, j = 1, 2, , n.

Utilizing l, a column in B with minimum 2-norm is permuted with the first column

by the permutation matrix P1 (actually P1 is not formed explicitly). Then we use

the Householder transformation H1 to zero B(2 : m, 1). At this moment, unlike

Algorithm 2.2 we do not apply H1 to other columns of B. However, the first row of

B must be updated in order to downdate the squared column norms:

lj := lj B(1, j)2, j = 2, , n, (3.11)

which will be used in the next step for minimum column pivoting. In order to

update the first row, we form the following matrices (actually there are vectors)

using Eq.(3.6) and Eq.(3.10):

Y1 := u1, FT1 (1, 2:n) := 1u

T1B(1 :m, 2:n).

27

Notice that B(1 :m, 2 : n) stores in memory is equivalent to B(1)(1 :m, 2 : n) given

in Eq.(3.10). From Eq.(3.8) the first row of B except the first entry is updated as

follows:

B(1, 2:n) := B(1, 2:n) Y1(1, 1)F T1 (1, 2:n).

Then the squared column norms are downdated using Eq.(3.11). Thus at the end

of the first step, the first row and the first column have been updated, and the rest

part of B will be updated later.

In the second step, utilizing the vector l of the squared column norms, we apply

P2 to permute the second column of B with a column, say column p, 2 p n, such

that the 2-norm of B(2 :m, 2) is minimum, and we permute the second column of

F T1 with its p-th column (i.e., FT1 := F

T1 P2). Then from Eq.(3.8) the second column

B(2 :m, 2) is updated by the first Householder transformation H1:

B(2 :m, 2) := B(2 :m, 2) Y1(2 :m, 1)F T1 (1, 2).

After this update, we apply the Household transformation H2 to zero B(3 : m, 2).

Same as step 1, we do not use H2 to update the rest columns of B at this moment.

But we need update the second row of B, because it will be used to compute the

2-norms of each column of B(3 :m, 3:n). In order to perform the update, Y2 and F2

are formed by accumulating H2 into Y1 and F1 using Eq.(3.6) and Eq.(3.10):

Y2 :=

[Y1 u2

], F T2 (1 :2, 3:n) :=

F T1 (1, 3:n)2u

T2B(1 :m, 3:n) 2uT2 Y1F T1 (1, 3:n)

.

28

Note that here F T1 has been permuted by P2. Then we update the second row of B

except the first two entries:

B(2, 3:n) := B(2, 3:n) Y2(2, 1 : 2)F T2 (1 : 2, 3:n),

and compute the squared column norms of B(3 :m, 3:n):

lj := lj B(2, j)2, j = 3, , n.

At the end of the second step, the first two rows and the first two columns have been

updated.

Now we assume we are in the i-th step of transforming the first block of B to an

upper triangular matrix. The first (i 1) columns of B have been triangulized and

the first (i1) rows have been updated, while the rest part of the matrix B is waiting

to be updated. We first permute the i-th column with a column in B(1 : m, i : n)

such that the 2-norm of B(i :m, i) is minimum, and we permute the corresponding

columns of F Ti1 (i.e., FTi1 := F

Ti1Pi). Then we update the i-th column of B(i :m, i)

by using the Householder transformations H1, H2, , Hi as follows (see (3.8)):

B(i :m, i) := B(i :m, i) Yi1(i :m, 1: i 1)F Ti1(1 : i 1, i).

29

Then the Householder transformation Hi is used to zero B(i + 1 : m, i), and is

accumulated into Yi and Fi:

Yi :=

[Yi1 ui

],

F Ti (1 : i, i+ 1:n) :=

F Ti1(1 : i 1, i+ 1:n)iu

Ti B(1 :m, i+ 1:n) iuTi Yi1F Ti1(1 : i 1, i+ 1:n)

.Then we update the i-th row B(i, i+ 1:n) and downdate the squared column norms:

B(i, i+ 1:n) := B(i, i+ 1:n) Yi(i, 1: i)F Ti (1 : i, i+ 1:n),

lj := lj B(i, j)2, j = i+ 1, , n.

The first i columns and rows of B have been updated.

Like shown in above, the block algorithm updates one row and one column in

each step. At the end of the k-th step, we update the rest part of B by using the

accumulated first k Householder transformations as follows:

B(k + 1:m, k + 1:n) := B(k + 1:m, k + 1:n) Yk(k + 1:m, 1:k)F Tk (1 :k, k + 1:n).

At this point, the first k columns of B (i.e., the first block of B) have been upper-

triangulized, and the other columns of B have been updated. Then we can apply the

same procedure to triangulize the second block of B and so on until the final upper

triangular matrix is obtained.

The algorithm of block QR factorization with minimum column pivoting is given

as follows.

30

Algorithm 3.1. (Block Householder QR Factorization with Minimum Column Piv-

oting) Suppose B Rmn has full column rank, k is the chosen block size which

is a factor of n for simplification. This algorithm computes the QR Factorization:

Q1R = BP , where Q1 has orthonormal columns, P is a permutation matrix. Note

the matrix B is overwritten by R in computation.

function: [R,P ] = BQRMCP (B, k)

1: P := In, m := m, n := n

2: lj := B(1 : m, j)2, j = 1 : n

3: for j = 1 : k : n do

4: Y (1 : m, 1 : k) := 0, F (1 : n, 1 : k) := 0

5: for j = 1 : k do

// Permutation

6: i := j + j 1, q := arg minipn lp

7: Interchange columns B(1 : m, i) and B(1 : m, q)

8: Interchange columns P (1 : n, i) and P (1 : n, q)

9: Interchange rows F (j, 1 : k) and F (q j + 1, 1 : k)

// Update the i-th column

10: B(i : m, i) := B(i : m, i) Y (j : m, 1 : j 1)F (j, 1 : j 1)T

11: Zero B(i+ 1 : m, i) by the Householder transformation Hi = In iuiuTi

// Accumulation of the block transformation

12: Y (j : m, j) := ui(i : m)

13: F (j + 1 : n, j) := iB(1 : m, i+ 1 : n)Tui

14: F (1 : n, j) := F (1 : n, j) iF (1 : n, 1 : j 1)Y (j : m, 1 : j 1)Tui(i : m)

31

// Update the i-th row and downdate the norm

15: B(i, i+ 1 : n) := B(i, i+ 1 : n) Y (j, 1 : j)F (j + 1 : n, 1 : j)T

16: l(i+ 1 : n) := l(i+ 1 : n)B(i, i+ 1 : n). B(i, i+ 1 : n)

17: end for

// Block transformation to unprocessed parts of the matrix

18: B(i+ 1 : m, i+ 1 : n) := B(i+ 1 : m, i+ 1 : n)

Y (j + 1 : m, 1 : k)F (j + 1 : n, 1 : k)T

19: m := m k, n := n k

20: end for

21: R := B(1 : n, 1 : n)

Here we make a remark. In our implementation, we actually use an n-dimension

vector to store the permutation matrix P , and we do not form P explicitly for

efficiency.

3.1.2 Block Size-Reduction

The idea of block size-reduction is to accumulate several IGTs into a block

updating, so the algorithm is rich in matrix-matrix operations. The size-reduction

of an upper triangular matrix R Rnn can be described as

U = RZ,

where Z Znn is an unimodular matrix, U Rnn is size-reduced, i.e., |uij| 12|uii| (1 i < j n). The size-reduction algorithm repeatedly applies IGTs to R.

Z is the product of a sequence of IGTs which have the form of In eieTj (1 i 2 then

26: i := i 1

27: end if

28: else

38

29: i := i+ 1

30: end if

31: end while

32: Rl = Rlocal

3.1.4 Block Partial Size-Reduction

A block partial size-reduction algorithm is designed to coordinate with the Local-

PLLL reduction algorithm. In BSR (Algorithm 3.2), all off-diagonal entries of the

upper triangular matrix are checked for IGTs. However, it is not the case for the

PLLL reduction, where the off-diagonal entries are reduced only when they are nec-

essary. More specifically, if an IGT is applied to a super-diagonal entry of R, other

IGTs are applied to the off-diagonal entries in the same column in order to prevent

producing large numbers which may cause numerical stability problem. Thus, only

the entries in the columns which are affected by IGTs in Local-PLLL need to be

reduced. Local-PLLL stored the information of those columns affected by IGTs in

c, so block partial size-reduction (BPSR) algorithm can reduce only those marked

columns by IGTs.

Give an upper triangular matrix R Rnn which consists of d d blocks (here

we do not assume that each block has the same size):

R =

R11 R1d

. . ....

Rdd

, 1 i j d.

39

It has sub-matrices:

R =

R11 R1,i1

. . ....

Ri1,i1

, R =

R1i

R2i...

Ri1,i

, 1 < i d, (3.16)

where R has k columns, Ri,i with 1 < i i has k rows, and R1,i may have either k

or k/2 rows.

Given a vector c Zk, whose entries are either one or zero. For j = 1 : k, if

cj = 1 we performs size-reductions on column j of R by applying IGTs to it, which

involve R; if cj = 0 we do nothing. After this, part of the entries of R are size-reduced

according to c:

R := R + RZ,

where Z which is formed by those IGTs has the same dimension and block partition

as R:

Z =

Z1i

Z2i...

Zi1,i

, (3.17)

and

I Z0 I

is unimodular.The BPSR algorithm is given as follows.

Algorithm 3.4. (Block Partial Size-reduction) Given two sub-matrices R, R in

Eq.(3.16) and a vector c, this algorithm size-reduces the columns of R: R := R+RZ,

40

where Z has block partition as in Eq.(3.17). We use Ai1:i2,j to denote the sub-matrix

formed by block rows i1 to i2 in the j-th block column of A.

function: [R, Z] = BPSR(R, R, c)

1: for i = i 1 : 1 : 1 do

// Partial size-reduction of Ri,i by Zi,i involving Ri,i

2: for j = 1 : k do

3: if cj = 1 then

4: Size-reduce Ri,i(:, j): Ri,i(:, j) := Ri,i(:, j) +Ri,iZi,i(:, j)

5: end if

6: end for

7: Update R1:i1,i: R1:i1,i := R1:i1,i R1:i1,iZi,i

8: end for

3.2 Left-to-Right Block LLL Reduction Algorithm

In this section, we present a left-to-right block LLL (LRBLLL) reduction algo-

rithm utilizing the subroutines introduced in the previous section, i.e., the block QR

factorization (Algorithm 3.1), the block size-reduction (Algorithm 3.2), the Local-

PLLL reduction algorithm (Algorithm 3.3) and the block partial size-reduction (Al-

gorithm 3.4). The complexity analysis of LRBLLL is presented in the second part

of this section.

3.2.1 Partition and Block Operation

The left-to-right block LLL reduction algorithm combines the blocking technique

with the PLLL algorithm. It includes 7 steps as follows.

41

Step 1. Compute the block QR factorization (Algorithm 3.1) of the full column

rank matrix B Rmn with minimum column pivoting: BP = Q1R.

Step 2. Partition matrix R to dd blocks with block size k (here for simplicity,

we assume that n is multiple of k, i.e., n = dk, and d is even, and define k = 2k,

d = d/2):

R =

R11 R1d

. . ....

Rdd

Rnn, Rij Rkk , 1 i j d.Initialize a block index i = 1.

Step 3. Compute the Local-PLLL reduction (Algorithm 3.3) ofRlocal =

Rii Ri,i+1Ri+1,i+1

,Rlocal := Q

TlocalRlocalZlocal.

Step 4. Update the relevant blocks of Rlocal using block transformations:

Rright := QTlocalRright, Rup := RupZlocal,

where

Rright =

Ri,i+2 Ri,i+3 Ri,dRi+1,i+2 Ri+1,i+3 Ri+1,d

, Rup =

R1,i R1,i+1

R2,i R2,i+1...

...

Ri1,i Ri1,i+1

.

42

Step 5. Size-reduce Rup using the block partial size-reduction algorithm (Al-

gorithm 3.4) :

Rup := Rup +

R11 R1,i1

. . ....

Ri1,i1

Zupdate.Step 6. Set :=

(r(i1)k,(i1)k+1 br(i1)k,(i1)k+1/r(i1)k,(i1)ke r(i1)k,(i1)k

).

Check if the Lovasz condition r2(i1)k,(i1)k (2 + r2(i1)k+1,(i1)k+1) holds for the

first column of Rlocal and the column before it in R.

If i = 1 or the Lovasz condition holds, set i := i+ 1.

Else if i 6= 1 and the Lovasz condition does not hold, set i := i 1.

If i < d, go to step 3; else, go to step 7.

Step 7. Apply block size-reduction (Algorithm 3.2) to the whole matrix R, stop

the algorithm.

In Section 3.1.3, we stated that the first k columns of Rlocal may be PLLL

reduced before applying Local-PLLL in step 3. It is easy to check from the algorithm

that the first k columns of Rlocal are PLLL reduced except the first call of Local-

PLLL in step 3.

The left-to-right block LLL reduction algorithm is given as follows.

Algorithm 3.5. (Left-to-Right Block LLL Reduction) Given a full column rank ma-

trix B Rmn and a block size k which is even. This algorithm computes the LLL

factorization: B = Q1RZ1, where Q1 has orthonormal columns, R is upper tri-

angular and LLL reduced, and Z is unimodular. In the algorithm, we assume Z

43

is partitioned into blocks in the same way as R. We use Ai1:i2,j1:j2 to denote the

sub-matrix formed by block rows i1 to i2 and block columns j1 to i2 of A.

function: [R,Z] = LRBLLL(B, k)

// Compute the block QR factorization using Algorithm 3.1

1: [R,Z] = BQRMCP (B, k)

2: i := 1, k := k/2, d := 2n/k, f := 0

3: while i < d do

// PLLL reduction of Rii using Algorithm 3.3

4: [Q, Ri:i+1,i:i+1, Z, r] = Local-PLLL(Ri:i+1,i:i+1, f)

5: f := 1

6: if Z = I then

// The diagonal block is unchanged. The algorithm moves ahead.

7: i := i+ 1

8: Continue

9: end if

// Block updating

10: Z1:d,i:i+1 := Z1:d,i:i+1Z

11: R1:i1,i:i+1 := R1:i1,i:i+1Z

12: Ri:i+1,i+2:d := QTRi:i+1,i+2:d

// Size-reduce the corresponding columns of R1:i1,i:i+1 using Algorithm 3.4

13: [R1:i1,i:i+1, Z] = BPSR(R1:i1,i:i+1, R1:i1,1:i1, r)

14: Z1:d,i:i+1 = Z1:d,i:i+1 + Z1:d,1:i1Z

// Check the Lovasz condition, then move forward or backward

44

15: := bR((i1)k, (i1)k+1)/R((i1)k, (i1)k)e

16: := R((i1)k, (i1)k+1) R((i1)k, (i1)k)


17: if R((i1)k, (i1)k)2 2 +R((i1)k+1, (i1)k+1)2 or i = 1 then

18: i := i+ 1

19: else

20: i := i 1

21: end if

22: end while

// Size-reduce R using Algorithm 3.2

23: [R, Z] = BSR(R)

24: Z := ZZ

Notice that if the Local-PLLL output Z is an identity matrix, we do not apply block

updating and BPSR to relevant blocks for efficiency. Also notice that, if the matrix

dimension n is not a multiple of the block size k, the algorithm still works by simply

changing the block size of last column blocks to fit the matrix dimension. At the

end of each while loop the first ik columns of R are PLLL reduced. The while loop

breaks when i = d. Then the n = dk columns of R are PLLL reduced. And the

matrix R is size-reduced after the final size-reduction. Thus the LRBLLL algorithm

outputs a basis matrix which is LLL reduced.


In the LRBLLL algorithm, the column permutation operations are executed in

the Local-PLLL subroutine. Since LRBLLL uses the same permutation criterion as

45

LLL (Algorithm 2.1), Lemma 2.1 can be also applied to LRBLLL. As in Section 2.2.3,

we define = maxj bj, and = minxZn/{0} Bx. Thus the LRBLLL algorithm

has at most O(n3 + n2 log1/) permutations, and the algorithm converges. During

the procedure of LRBLLL, the permutation operations are performed inside the

Local-PLLL subroutine. In the following part, we would like to obtain an upper

bound of the number of calls of Local-PLLL.

In the while loop of LRBLLL, it calls Local-PLLL reductions of diagonal sub-

matrices of R. At each loop, the PLLL reduction of a diagonal sub-matrix is per-

formed, and a diagonal sub-matrix which will be performed by the PLLL reduction

in the next loop, is selected in the current loop. From step 3 of LRBLLL, the diag-

onal sub-matrix Rlocal contains 2 diagonal blocks Ri,i and Ri+1,i+1. And Rlocal may

move one diagonal block forward or backward at the end of each loop, according to

whether the Lovasz condition holds for columns (i 1)k and (i 1)k + 1, see step

6 of LRBLLL described at Section 3.2. The matrix R which is divided into d d

blocks has d diagonal blocks. In the first call of Local-PLLL, Rlocal contains the first

two diagonal blocks R1,1 and R2,2, and the block index i equals to 1; while in the

last call of Local-PLLL, Rlocal contains the last two diagonal blocks Rd1,d1 and

Rd,d , and the block index i equals to d 1. It needs only d 1 loops for i to move

forward to i = d1 from i = 1, if there are no backward moves. Actually there may

be some backward moves, say s times, and the times of moving forward should be

added by an extra s. Thus the total number of moves of Rlocal is 2s + d 1 which

equals to 2s+ 2d 1.

46

The rest problem is to determine an upper bound of s which is the times of

the block index i moving backward during the excution of LRBLLL. Assume in a

loop except the first one, the Lovasz condition does not hold for columns (i 1)k

and (i 1)k + 1, so the algorithm moves one block back and the block index i is

decreased by one. However at the beginning of this loop the Lovasz condition holds

for columns (i1)k and (i1)k+ 1. Then the Local-PLLL subroutine of LRBLLL

must have modified column (i 1)k + 1 of R. To modify column (i 1)k + 1,

which is the first column of the current Rlocal, Local-PLLL must perform at least

k permutations. Since subroutine Local-PLLL starts with column (k + 1) of Rlocal

(see Section 3.1.3), it takes at least k permutations to get back to the first column

from column k + 1. Thus if the block index i is decreased in a loop, there is at

least k permutations taking place in Local-PLLL in this loop. Assume there are

p permutations involved in LRBLLL before convergence. So s, i.e., the number of

loops in which i is decreased, is bounded above by p/k which equals to (2n/d)p.

Then, the cost of LRBLLL is given as follows. The QR factorization with

minimum column pivoting takesO(mn2) arithmetics [16, Section 5.2]. In Local-PLLL

a permutation causes at most O(k2) arithmetic operations for subsequent updating

and size-reduction. In each loop after Local-PLLL is called, the block updating of

R takes O(nk2) operations. The subroutine BPSR takes O(n2k) operations in worst

case in each loop. And the block size-reduction subroutine at the end of the algorithm

takes O(n3) operations. From above, there are p permutations and 2s+2d1 loops.

The cost of LRBLLL is

CLRBLLL = O(mn2) + p O(k2) + (2s+ 2d 1) O(n2k + nk2) +O(n3).

47

Notice that p is bounded above by O(n3 + n2 log1/), so s is bounded above by

O(dn2 + dn log1/). The total cost of LRBLLL is bounded above by O(mn2 +n5 +

n4 log1/). This bound is the same as the bounds of LLL and PLLL.

Table 31 lists the costs of the important processes and the total cost of LR-

BLLL.

Table 31: Complexity analysis of LRBLLL reduction algorithm

Processes BoundCost of QR factorization O(mn2)

Cost of one permutation in Local-PLLL O(k2)Cost of block updating in one loop O(nk2)Cost of size-reduction in one loop O(n2k)Cost of final block size-reduction O(n3)

Number of permutations: p O(n3 + n2 log1/)

Number of loops: 2s+ 2d 1 O(dn2 + dn log1/ )Total cost of the algorithm O(mn2 + n5 + n4 log1/

)

3.3 Alternating Partition Block LLL Reduction Algorithm

In this section we propose a alternating partition block LLL (APBLLL) reduc-

tion algorithm which is easier to be parallelized. The complexity analysis of APBLLL

is also given.

3.3.1 Partition and Block Operation

The LRBLLL algorithm is actually a mimic of PLLL. LRBLLL works on the

matrix from left to right, and may moves forward or backward during the procedure.

In this new alternating partition block LLL reduction algorithmic, we do not move

the algorithm forward and backward, we do it in another way.

48

1k 1k 1k 1k

R11 R12 R13 R14 1k

R22 R23 R24 1k

R33 R34 1k

R44 1k

1.5k 1k 1.5k

R11 R12 R13 1.5k

R22 R23 1k

R33 1.5k

Figure 31: Partition 1 of matrix R

1k 1k 1k 1k

R11 R12 R13 R14 1k

R22 R23 R24 1k

R33 R34 1k

R44 1k

1.5k 1k 1.5k

R11 R12 R13 1.5k

R22 R23 1k

R33 1.5k

Figure 32: Partition 2 of matrix R

We first perform BQRMCP on B Rmn (see Algorithm 3.1):

BP = Q1R,

where Q1 Rmn has orthonormal columns, R Rnn is upper triangular and

P Znn is a permutation matrix.

Next we use an example to show how APBLLL works iteratively with two al-

ternating partitions as shown in Figure 31 and Figure 32.

In the first iteration, R is partitioned into 44 blocks, each block has size kk

(see Figure 31). This partition is refereed to as partition 1 for convenience. Then

we work on the blocks of partition 1. First we perform Local-PLLL (Algorithm 3.3)

to R11, then we update R12, R13 and R14 by Q generated by Local-PLLL. Second,

we perform Local-PLLL to R22, then we update R23 and R24 by Q generated by this

Local-PLLL, and update R12 by Z also generated by this Local-PLLL, then BPSR

49

(Algorithm 3.4) is applied to R12 to do partial size-reduction. Third, we perform

Local-PLLL to R33, then we update R34 by Q generated by current Local-PLLL,

and update R13 and R23 by Z also generated by current Local-PLLL, then BPSR

is applied to the block column R13 and R23. Fourth, we perform Local-PLLL to

R44, then we update R14, R24 and R34 by Z generated by current Local-PLLL, then

BPSR is applied to the block column R14, R24 and R34. After this, the first iteration

has finished. After the first iteration, all the diagonal blocks R11, R22, R33 and R44

are PLLL reduced.

In the second iteration, we repartition R into 3 3 blocks (see Figure 32), the

block size is indicated in the figure. This repartition is referred to as partition 2. We

do exactly the same for the blocks of partition 2 as we do in the first iteration. After

the second iteration, diagonal blocks R11, R22 and R33 are PLLL reduced.

Then in the following iterations, the same process with either partition 1 or

partition 2 are performed iteratively (partition 1 and partition 2 are preformed al-

ternately), until no permutation takes place in a iteration. At this point, it is easy

to see that R is PLLL reduced. Then an extra block size-reduction (Algorithm 3.2)

is applied to R. After the final size-reduction, R is LLL reduced and the algorithm

ends.

The two alternating partitions of R for the general case are given as follows.

Assume the block size is k and n = dk. Partition 1 partitions R into d d blocks:

R =

R11 R1d

. . ....

Rdd

Rnn, Rij Rkk, 1 i j d.

50

And partition 2 partitions R into (d 1) (d 1) blocks:

R =

R11 R1,d1

. . ....

Rd1,d1

Rnn,R11 R1.5k1.5k, R1,d1 R1.5k1.5k, Rd1,d1 R1.5k1.5k,

R1v R1.5k1k, Ru,d1 R1.5k1k, Ruv R1k1k, 1 < u v < d 1.

The alternating partition block LLL reduction algorithm is given as follows.

Algorithm 3.6. (Alternating Partition Block LLL Reduction) Given a full column

rank matrix B Rmn and a block size k (assume n is multiple of k, i.e., n = dk).

This algorithm computes the LLL reduction: B = Q1RZ1, where Q1 has orthonor-

mal columns, R is upper triangular and is LLL reduced, and Z is unimodular. In

the algorithm, we assume Z is partitioned into blocks in the same way as R. We use

Ai1:i2,j1:j2 to denote the sub-matrix formed by block rows i1 to i2 and block columns

j1 to i2 of A.

function: [R,Z] = APBLLL(B, k)

// Compute the block QR factorization using Algorithm 3.1

1: [R,Z] = BQRMCP (B, k)

2: d := n/k, f := 0

3: for i = 1 : d do

4: changei := 1, nextChangei := 1

5: end for

6: while (1) do

51

7: Partition R into blocks using Partition 1 or 2 iteratively

8: for i = 1 : d (for Partition 2: i = 1 : d 1, we assume partition 1 is used in

the following description ) do

9: if changei 6= 1 then

10: continue

11: end if

// Apply Local-PLLL to all diagonal blocks using Algorithm 3.3

12: [Q, Rii, Z, r] = Local-PLLL(Rii, f)

13: if Z = I then

// The diagonal block is unchanged, and updates are not needed

14: continue

15: end if

// Perform the corresponding updates

16: nextChangemax(1,i1) := 1, nextChangei := 1

// Block updating

17: Z1:d,i := Z1:d,iZ

18: R1:i1,i := R1:i1,iZ

19: Ri,i+1:d := QTRi,i+1:d

// Size-reduce the corresponding columns of R1:i1,i using Algorithm 3.4

20: [R1:i1,i, Z] = BPSR(R1:i1,i, R1:i1,1:i1, r)

21: Z1:d,i = Z1:d,i + Z1:d,1:i1Z

22: end for

23: if nextChange = 0 then

52

// Break when no permutation applied

24: break

25: end if

26: f := 1

27: for i = 1 : d do

28: changei := nextChangei, nextChangei := 0

29: end for

30: end while

// Size-reduce R using Algorithm 3.2

31: [R, Z] = BSR(R)

32: Z := ZZ

Notice that the two vectors change and nextChange are used to tracing that if

the diagonal blocks are PLLL reduced in each iteration. If two diagonal blocks are

unchanged in a iteration, in the next iteration we do not apply Local-PLLL to the

diagonal block whose diagonal entries come from the two unchanged diagonal blocks,

since this diagonal block should also be PLLL reduced. Also notice that if the Local-

PLLL output matrix Z is an identity matrix, we do not apply block updating and

BPSR to relevant blocks for efficiency.


The APBLLL algorithm shares the same QR and final size-reduction parts as

LRBLLL. Thus the costs of these two parts are the same as they are in LRBLLL,

which are O(mn2) arithmetic operations for the QR factorization and O(n3) arith-

metic operations for the final size-reduction. The cost of the rest parts of APBLLL

53

are divided into two parts: the cost of subroutine Local-PLLL and the cost outside

subroutine Local-PLLL, i.e., the block updating and the block partial size-reductions.

These two parts are calculated separately.

Since APBLLL uses the same permutation criterion as LLL (Algorithm 2.1),

Lemma 2.1 can be also applied to APBLLL. Thus the total number of permutations

p taking place in Local-PLLL reductions is bounded above by O(n3 + n2 log1/).

In Local-PLLL a permutation causes at most O(k2) arithmetic operations for sub-

sequent updating and size-reductions. Thus, all the call to subroutine Local-PLLL

cost O(n3k2 + n2k2 log1/) arithmetic operations.

In APBLLL, only if the output matrix Z of Local-PLLL is not identity, i.e.

there are some permutations taking place during the execution of Local-PLLL, the

block updating and BPSR line 17-21 are performed. Because the total number of

permutations is p, there are at most p calls to Local-PLLL and each one of which

does not produce identity Z. So the worst case is that the block updating and BPSR

are executed p times. For each execution, the block updating and BPSR cause at

most O(n2k) arithmetic operations. Thus the total cost of block updating and BPSR

is p O(n2k) in the worst case.

From above, the total cost of APBLLL is obtained by adding the cost of all the

parts together:

CAPBLLL = O(mn2) +pO(k2) +pO(n2k) +O(n3) = O(mn2 +n5k+n4k log1/

).

This bound is lager than the bounds of LRBLLL, PLLL and LLL. However its simu-

lation result shows that it performs better than LLL and PLLL and performs similar

54

as LRBLLL. The simulation results and analysis of the two block LLL reduction

algorithms will be given in the next section.

Table 32 lists the costs of the important processes and the total cost of AP-

BLLL.

Table 32: Complexity analysis of APBLLL reduction algorithm

Processes BoundCost of QR factorization O(mn2)

Cost of one permutation in Local-PLLL O(k2)Cost of block updating and

size-reduction for one diagonal blockO(n2k)

Cost of final block size-reduction O(n3)

Number of permutations: p O(n3 + n2 log1/)

Total cost of the algorithm O(mn2 + n5k + n4k log1/)

3.4 Simulation Results and Comparison of Algorithms

The simulations are performed on MATLAB on two types of machines. One

has MATLAB 7.12.0 on a 64-bit Ubuntu 11.10 system with 4 Intel Xeon(R) CPU

W3530 2.8GH processors and 5GB memory. The other has MATLAB 7.13.0 on

a 64-bit Red Hat 6.2 system with 64 AMD Opteron(TM) 2.2GH processors and

64G memory. Our simulations use conventional MATLAB not Parallel MATLAB.

MATLAB use IEEE double precision model for the floating point arithmetic by

default. The unit round-off for double precision is about 1016. We compare four

algorithms, i.e., the original LLL algorithm (Algorithm 2.1), the PLLL+ algorithm,

the LRBLLL algorithm (Algorithm 3.5), and the APBLLL algorithm (Algorithm

3.6). The PLLL+ algorithm is the PLLL algorithm (Algorithm 2.3) with an extra

size-reduction procedure to guarantee the resulted matrix is size-reduced. All these

55

four algorithms produce LLL reduced matrices. We compare the CPU run time, the

flops, and the relative backward errors BQcRcZ1c F

BFof the four algorithms, where

Qc is the computed orthogonal matrix, Rc is the computed LLL reduced matrix and

Z1c is the unimodular matrix formed by the inverses of the computed permutation

matrix and IGTs. And the run time is measured by two separate parts, the run time

for the QR factorization and the run time for the rest part of each algorithm (for

simply, we just call this part the reduction), in order to observe how the blocking

technique performances in each part.

In the simulation, we test three cases of matrix B Rnn with n = 100 : 50 :

1000. The square matrices Bs are generated as follows.

Case 1: B is generated by MATLAB function randn: B = randn(n, n), i.e.,

each element follows the normal distribution N (0, 1).

Case 2: B = USV T , U and V are randomly generated orthogonal matrices,

and S is a diagonal matrix as follows,

S(i, i) = 104(i1)/(n1), i = 1, , n.

Case 3: B = USV T , U and V are randomly generated orthogonal matrices,

and S is a diagonal matrix as follows,

S(i, i) = 1000, i = 1, , bn/2e

S(i, i) = 0.1, i = bn/2e+ 1, , n.

Case 1 are the most typical testing matrices for numerical solutions. Case 2 and 3

intends to show the reduction speed when the condition numbers are fixed at 104.

56

Case 3 also shows that the block algorithms gain more efficiency at the reduction

part, when it takes a long time to run.

For each dimension of all cases, we randomly generate 20 different matrices to

do the test. We only test 20 simulation runs, because LLL is too time consuming.

However we use box plots to show that the behaviors of the algorithms are stable,

thus 20 runs are enough for our simulation. For the block algorithms, the optimal

block size may vary according to the dimension of the matrix. In the simulation, a

fixed block size of 32 is adopted for matrices at all dimensions for simplicity. In the

average QR/reduction run time plots, the y-axis is the average run time (seconds)

for the 20 matrices, and the x-axis is the dimension. In the average flops plots, the

y-axis is the average flops, and the x-axis is the dimension. In the average relative

backward error plots, the y-axis is the relative backward error, and the x-axis is the

dimension.

In the simulation, we test matrices with various condition numbers, and give

the results in the various condition number plots. In these plots, the y-axis is the

average QR/reduction run time, the average flops or the average relative back ward

errors for 20 matrices with dimension 200 in case 2, and the x-axis is the matrix

condition number from 101 to 106. Box plots of run time and relative backward

errors of all three cases with dimension 200 are drawn. In the box plot, the y-axis is

either the algorithm run time or the relative backward errors, and the x-axis is the

four algorithms, i.e., LLL, PLLL+, LRBLLL and APBLLL.

The simulation results given by Intel processors are shown in Figure 33, Figure

34 and Figure 35 for the overall performance of three cases, in Figure 36 for case

57

2 with different condition numbers, and in Figure 37 for the box plot of all the

cases. And the results given by AMD processors are shown in Figure 38, Figure 3

9, Figure 310 Figure 36 and Figure 37, respectively. For the overall performance

of each case, we give six plots. The two plots in the first row are the average run time

of QR factorization and the average reduction run time of LLL respectively. LLL

runs much longer than the other three algorithms, so we put it in individual plots in

order to compare the other three algorithms easily. The two plots in the middle row

are the average QR/reduction run time for PLLL+, LRBLLL, and APBLLL. The

two plots in the bottom row are the average flops and the average relative backward

errors for LLL, PLLL+, LRBLLL, and APBLLL. For case 2 with different condition

numbers, we also give six plots which are ordered in the same way as the overall

performance plots. For the box plot figure, we give six plots. The three plots in the

left column are the average algorithm run time of three cases. The three plots in the

right column are the average relative backward error of three cases.

From the simulation results, we can draw following observations and conclusions.

1. Comparing the results between two machines with Intel or AMD, we can ob-

serve that the performance of the four algorithms is consistent between these

two machines.

2. By comparing the run time of different algorithms, we found that LLL is the

slowest among the four algorithms. LRBLLL is as fast as APBLLL, and both

are faster than PLLL+ in all three cases. So on average the computational CPU

times for the four algorithms have the following order LRBLLL APBLLL LLL >

LRBLLL > APBLLL.

7. In Figure 36 and Figure 311, the test of matrices with various condition

numbers shows that the QR time is not affected by the condition number of the

matrices, and the reduction time, flops and the relative backward error of the

four algorithms increases when the condition number of the matrix increases.

8. The box plot shows the behaviors of LLL, PLLL+, LRBLLL and APBLLL on

the tests are stable for different simulation runs.

60

0 200 400 600 800 10000

500

1000

1500

Dimension

Red

uctio

n R

un T

ime

0 200 400 600 800 10000

1

2

3

4

Dimension

QR

Run

Tim

e

0 200 400 600 800 10000

0.1

0.2

0.3

0.4

0.5

Dimension

Reu

ctio

n R

un T

ime

0 200 400 600 800 100010

15

1014

1013

Dimension

Rel

ativ

e B

ackw

ard

Err

or

0 200 400 600 800 100010

6

107

108

109

1010

Dimension

Flo

ps

PLLL+LRBLLLAPBLLL

PLLL+LRBLLLAPBLLL

LLLPLLL+LRBLLLAPBLLL


0 200 400 600 800 10000

0.5

1

1.5

2

2.5

3

Dimension

QR

Run

Tim

e

LLL LLL

Figure 33: Performance comparison for Case 1, Intel

61

0 200 400 600 800 10000

100

200

300

400

500

600

Dimension

Red

uctio

n R

un T

ime

0 200 400 600 800 10000

1

2

3

4

5

Dimension

QR

Run

Tim

e

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

Dimension

Reu

ctio

n R

un T

ime

0 200 400 600 800 100010

15

1010

Dimension

Rel

ativ

e B

ackw

ard

Err

or

0 200 400 600 800 100010

7

108

109

1010

Dimension

Flo

ps

PLLL+LRBLLLAPBLLL

PLLL+LRBLLLAPBLLL

0 200 400 600 800 10000

0.5

1

1.5

2

2.5

3

Dimension

QR

Run

Tim

e

LLL LLL




62

0 200 400 600 800 10000

200

400

600

800

1000

1200

Dimension

Red

uctio

n R

un T

ime

0 200 400 600 800 10000

1

2

3

4

Dimension

QR

Run

Tim

e

0 200 400 600 800 10000

10

20

30

40

50

Dimension

Reu

ctio

n R

un T

ime

0 200 400 600 800 100010

12

1010

108

106

104

Dimension

Rel

ativ

e B

ackw

ard

Err

or

0 200 400 600 800 100010

7

108

109

1010

1011

Dimension

Flo

ps

PLLL+LRBLLLAPBLLL

PLLL+LRBLLLAPBLLL

0 200 400 600 800 10000

0.5

1

1.5

2

2.5

3

Dimension

QR

Run

Tim

e

LLL LLL


LLLPLLL+BLLLAPBLLL


63

101

102

103

104

105

106

0

0.02

0.04

0.06

0.08

Condition Number

QR

Run

Tim

e

LLL

101

102

103

104

105

106

0

10

20

30

40

Condition Number

Red

uctio

n R

un T

ime

101

102

103

104

105

106

0

0.01

0.02

0.03

0.04

Condition Number

QR

Run

Tim

e

PLLL+LRBLLLAPBLLL

101

102

103

104

105

106

0

2

4

6

8

Condition Number

Reu

ctio

n R

un T

ime

PLLL+LRBLLLAPBLLL

101

102

103

104

105

106

1015

1010

105

Condition Number

Flo

ps

101

102

103

104

105

106

1015

1010

105

Condition Number

Rel

ativ

e B

ackw

ard

Err

or


LLLPLLLLRBLLLAPBLLL

LLL

Figure 36: Performance comparison for Case 2 with dimension 200, Intel

64

LLL PLLL LRBLLL APBLLL10

2

101

100

101

Tot

al R

un T

ime


15

1014

1013

Rel

ativ

e B

ackw

ard

Err

or


0

101

102

Tot

al R

un T

ime


9

108

107

106

105

Rel

ativ

e B

ackw

ard

Err

or


1

100

101

Tot

al R

un T

ime


13

1012

1011

Rel

ativ

e B

ackw

ard

Err

or

Figure 37: Box plots of run time (left) and relative backward error (right) for Case1 (top), Case 2 (middle), Case 3 (bottom) with dimension 200, Intel

65

0 200 400 600 800 10000

1000

2000

3000

4000

Dimension

Red

uctio

n T

ime

LLL

0 200 400 600 800 10000

2

4

6

8

10

Dimension

QR

Tim

e

0 200 400 600 800 10000

0.2

0.4

0.6

0.8

1

Dimension

Reu

ctio

n T

ime

0 200 400 600 800 100010

6

107

108

109

1010

Dimension

Flo

ps

0 200 400 600 800 100010

15

1014

1013

Dimension

Rel

ativ

e B

ackw

ard

Err

or

0 200 400 600 800 10000

2

4

6

8

Dimension

QR

Tim

e

LLL

PLLL+LRBLLLAPBLLL

PLLL+LRBLLLAPBLLL



Figure 38: Performance comparison for Case 1, AMD

66

0 200 400 600 800 10000

2

4

6

8

10

Dimension

QR

Tim

e

0 200 400 600 800 10000

0.5

1

1.5

2

Dimension

Reu

ctio

n T

ime

0 200 400 600 800 100010

7

108

109

1010

Dimension

Flo

ps

0 200 400 600 800 100010

15

1010

Dimension

Rel

ativ

e B

ackw

ard

Err

or

0 200 400 600 800 10000

2

4

6

8

Dimension

QR

Tim

e

0 200 400 600 800 10000

200

400

600

800

1000

1200

Dimension

Red

uctio

n T

ime

LLL LLL

PLLL+LRBLLLAPBLLL

PLLL+LRBLLLAPBLLL




67

0 200 400 600 800 10000

500

1000

1500

2000

2500

Dimension

Red

uctio

n T

ime

0 200 400 600 800 10000

2

4

6

8

10

Dimension

QR

Tim

e

0 200 400 600 800 10000

20

40

60

80

100

Dimension

Reu

ctio

n T

ime

0 200 400 600 800 100010

7

108

109

1010

1011

Dimension

Flo

ps

0 200 400 600 800 100010

12

1010

108

106

104

Dimension

Rel

ativ

e B

ackw

ard

Err

or

0 200 400 600 800 10000

2

4

6

8

Dimension

QR

Tim

e

LLLLLL

PLLL+LRBLLLAPBLLL

PLLL+LRBLLLAPBLLL




68

101

102

103

104

105

106

0

0.05

0.1

0.15

0.2

Condition Number

QR

Tim

e

LLL

101

102

103

104

105

106

0

20

40

60

80

Condition Number

Red

uctio

n T

ime

LLL

101

102

103

104

105

106

0

0.02

0.04

0.06

0.08

Condition Number

QR

Tim

e

PLLL+LRBLLLAPBLLL

101

102

103

104

105

106

0

5

10

15

20

Condition Number

Reu

ctio

n T

ime

PLLL+LRBLLLAPBLLL

101

102

103

104

105

106

1015

1010

105

Condition Number

Flo

ps


101

102

103

104

105

106

two floating point lll reduction algorithms - thesis

Documents

lovasz lll reduction

block lll reductionalgorithms

algorithmes bloc lll

block algorithms isthat

reducton bloc lll

partial lllreduction

schoolof computer science

partial fulfillment