two floating point lll reduction algorithms - thesis

95
Two Floating Point Block LLL Reduction Algorithms Yancheng Xiao Master of Science School of Computer Science McGill University Montreal,Quebec September 2012 A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Master of Science in Computer Science c Yancheng Xiao 2012

Upload: tomdarel

Post on 17-Nov-2015

11 views

Category:

Documents


1 download

TRANSCRIPT

  • Two Floating Point Block LLLReduction Algorithms

    Yancheng Xiao

    Master of Science

    School of Computer Science

    McGill University

    Montreal,Quebec

    September 2012

    A thesis submitted to McGill University in partial fulfillment of the requirements ofthe degree of Master of Science in Computer Science

    cYancheng Xiao 2012

  • DEDICATION

    This document is dedicated to my beloved parents.

    ii

  • ACKNOWLEDGEMENTS

    I have been indebted in my postgraduate study and research, especially in the

    preparation of this thesis, to my supervisor Prof. Xiao-Wen Chang of School of Com-

    puter Science at McGill University, whose academic guidance and financial support

    with patience and kindness have been invaluable to me. We are grateful to Prof.

    Clark Verbrugge for his kindly lending of their lovely AMD high concurrency ma-

    chine, which has been useful in testing the performance of our block LLL reduction

    algorithms. I would like thank all my lab mates of Scientific Computing Lab in School

    of Computer Science, Mazen Al Borno, Stephen Breen, Xi Chen, Sevan Hanssian,

    Wen-Yang Ku, Wanru Lin, Milena Scaccia, David Titley-Peloquin, Jinming Wen and

    Xiaohu Xie, for the pleasant collaboration during my study and research. Thanks

    also to all my friends and my boyfriend Bin Zhu for their various help on my study

    and living in Montreal.

    iii

  • ABSTRACT

    The Lenstra, Lenstra and Lovasz (LLL) reduction is the most popular lattice

    reduction and is a powerful tool for solving many complex problems in mathematics

    and computer science. The blocking technique casts matrix algorithms in terms

    of matrix-matrix operations to permit efficient reuse of data in the algorithms. In

    this thesis, we use the blocking technique to develop two floating point block LLL

    reduction algorithms, the left-to-right block LLL (LRBLLL) reduction algorithm

    and the alternating partition block LLL (APBLLL) reduction algorithm, and give

    the complexity analysis of these two algorithms. We compare these two block LLL

    reduction algorithms with the original LLL reduction algorithm (in floating point

    arithmetic) and the partial LLL (PLLL) reduction algorithm in the literature in

    terms of CPU run time, flops and relative backward errors. The simulation results

    show that the overall CPU run time of the two block LLL reduction algorithms are

    faster than the partial LLL reduction algorithm and much faster than the original

    LLL, even though the two block algorithms cost more flops than the partial LLL

    reduction algorithm in some cases. The shortcoming of the two block algorithms is

    that sometimes they may not be as numerically stable as the original and partial

    LLL reduction algorithms. The parallelization of APBLLL is discussed.

    iv

  • ABREGE

    Le Lenstra, Lenstra et reduction Lovasz (LLL) est la reduction de reseaux plus

    populaire et il est un outil puissant pour resoudre de nombreux problemes complexes

    en mathematiques et en informatique. La technique bloc LLL bloquante reformule

    les algorithmes en termes de matrice-matrice operations de permettre la reutilisation

    efficace des donnees dans les algorithmes bloc LLL. Dans cette these, nous utilisons

    la technique de blocage de developper les deux algorithmes de reduction bloc LLL en

    points flottants, lalgorithme de reducton bloc LLL de la gauche vers la droite (LR-

    BLLL) et lalgorithme de reduction bloc LLL en partirion alternative (APBLLL), et

    donner a lanalyse de la complexite des ces deux algorithmes. Nous comparons ces

    deux algorithmes de reduction bloc LLL avec lalgorithme de reduction LLL orig-

    inal (en arithmetique au point flottant) et lalgorithme de reduction LLL partielle

    (PLLL) dans la litterature en termes de temps dexecution CPU, flops et les er-

    reurs de larriere par rapport. Les resultats des simulations montrent que les temps

    dexecution CPU pour les deux algorithmes de reduction blocs LLL sont plus rapides

    que lalgorithme de reduction LLL partielle et beaucoup plus rapide que la reduction

    LLL originale, meme si les deux algorithmes par bloc coutent plus de flops que

    lalgorithme de reduction LLL partielle dans certains cas. Linconvenient de ces

    deux algorithmes par blocs, cest que parfois, ils peuvent netre pas aussi stable

    numeriquement que les algorithmes originaux et les algorithmes de reduction LLL

    partille. Le parallelisation de APBLLL est discutee.

    v

  • TABLE OF CONTENTS

    DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

    ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

    ABREGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

    LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

    LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

    1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.1 Lattice Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions and Organization of the Thesis . . . . . . . . . . . 4

    2 Introduction to LLL Reduction Algorithms . . . . . . . . . . . . . . . . . 7

    2.1 LLL Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Original LLL Reduction Algorithm . . . . . . . . . . . . . . . . . 8

    2.2.1 Size-Reductions . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 13

    2.3 Partial LLL Reduction Algorithm . . . . . . . . . . . . . . . . . . 162.3.1 Householder QR Factorization with Minimum Column

    Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.2 Partial Size-Reduction and Givens Rotation . . . . . . . . . 19

    3 Block LLL Reduction Algorithms . . . . . . . . . . . . . . . . . . . . . . 23

    3.1 Subroutines of Block LLL Reduction Algorithms . . . . . . . . . . 243.1.1 Block Householder QR Factorization with Minimum Col-

    umn Pivoting . . . . . . . . . . . . . . . . . . . . . . . . 243.1.2 Block Size-Reduction . . . . . . . . . . . . . . . . . . . . . 32

    vi

  • 3.1.3 Local Partial LLL Reduction . . . . . . . . . . . . . . . . . 353.1.4 Block Partial Size-Reduction . . . . . . . . . . . . . . . . . 39

    3.2 Left-to-Right Block LLL Reduction Algorithm . . . . . . . . . . . 413.2.1 Partition and Block Operation . . . . . . . . . . . . . . . . 413.2.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 45

    3.3 Alternating Partition Block LLL Reduction Algorithm . . . . . . 483.3.1 Partition and Block Operation . . . . . . . . . . . . . . . . 483.3.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 53

    3.4 Simulation Results and Comparison of Algorithms . . . . . . . . . 55

    4 Parallelization of Block LLL Reduction . . . . . . . . . . . . . . . . . . . 71

    4.1 Parallel Methods for LLL Reduction . . . . . . . . . . . . . . . . . 714.2 A Parallel Block LLL Reduction Algorithm . . . . . . . . . . . . . 72

    4.2.1 Parallel Diagonal Block Reduction and Block Updating . . 734.2.2 Parallel Block Size-Reduction . . . . . . . . . . . . . . . . . 73

    4.3 Performance Evaluation of Parallel Algorithm . . . . . . . . . . . 76

    5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 80

    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    vii

  • LIST OF TABLESTable page

    31 Complexity analysis of LRBLLL reduction algorithm . . . . . . . . . . 48

    32 Complexity analysis of APBLLL reduction algorithm . . . . . . . . . . 55

    viii

  • LIST OF FIGURESFigure page

    11 A lattice in 2-dimension . . . . . . . . . . . . . . . . . . . . . . . . . 2

    31 Partition 1 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    32 Partition 2 of matrix R . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    33 Performance comparison for Case 1, Intel . . . . . . . . . . . . . . . . 61

    34 Performance comparison for Case 2, Intel . . . . . . . . . . . . . . . . 62

    35 Performance comparison for Case 3, Intel . . . . . . . . . . . . . . . . 63

    36 Performance comparison for Case 2 with dimension 200, Intel . . . . . 64

    37 Box plots of run time (left) and relative backward error (right) forCase 1 (top), Case 2 (middle), Case 3 (bottom) with dimension200, Intel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    38 Performance comparison for Case 1, AMD . . . . . . . . . . . . . . . 66

    39 Performance comparison for Case 2, AMD . . . . . . . . . . . . . . . 67

    310 Performance comparison for Case 3, AMD . . . . . . . . . . . . . . . 68

    311 Performance comparison for Case 2 with dimension 200, AMD . . . . 69

    312 Box plots of run time (left) and relative backward error (right) forCase 1 (top), Case 2 (middle), Case 3 (bottom) with dimension200, AMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    41 Task allocation for three processors (P1, P2, P3) . . . . . . . . . . . . . 74

    42 Approximating Parallel Simulation . . . . . . . . . . . . . . . . . . . 79

    ix

  • CHAPTER 1Introduction

    1.1 Lattice Reduction

    A set L in the real vector space Rm is referred to as a lattice if there exists a set

    of linear independent vectors b1, b2, . . . bn Rm such that

    L =nj=1

    Zbj =

    {nj=1

    zjbj | zj Z, 1 j n

    }.

    The set {b1, b2, . . . bn} is a basis of lattice L. The dimension of the lattice is defined

    to be n. The matrix B = [b1, b2, . . . bn] is referred to as the lattice basis matrix

    which generates L, also written as L(B).

    Geometrically, a lattice can be viewed as a set of intersection points in an infinite

    grid, as shown in Figure 11. The lines of the grid do not need to be orthogonal to

    each other. The same lattice may have different bases. For example in Figure 11,

    {b1, b2} is a basis of the lattice, and {c1, c2} is also a basis.

    Suppose that we have two basis matrices B and C. If they generate a same

    lattice L(B) = L(C), we say that B and C are equivalent. Two basis matrices

    B,C Rmn are equivalent if and only if there exists a unimodular matrix Z Znn

    (i.e., an integer matrix with determinant det(Z) = 1) such that C = BZ, see [25,

    p4].

    The lattice basis reduction is to transform a given lattice basis into a basis

    with short and nearly orthogonal basis vectors. There are several kinds of lattice

    1

  • Figure 11: A lattice in 2-dimension

    reductions based on the different criteria on the resulted basis, such as the Gaus-

    sian reduction [12, Chapter 6.1], the Minkowski reduction [26, 27], the Korkine and

    Zolotarev (KZ) reduction [21] and the Lenstra, Lenstra and Lovasz (LLL) reduction

    [22].

    The lattice reduction is a powerful tool for solving many complex problems in

    mathematics and computer science, especially the problems dealing with integers,

    such as integer programming [1, 20], factoring polynomials with rational coefficients

    [22], integer factoring [34] and cryptography [15].

    The LLL reduction is the most popular lattice reduction. The LLL reduction

    algorithm given in [22] and its variants have polynomial time complexity. It is wildly

    used for applications such as factoring polynomials [22], subset sum problems [37],

    digital communications [23, 24, 28, 29, 39], shortest vector problems (SVP) [25] and

    2

  • closest vector problems (CVP), which are also referred to as the integer least-square

    (ILS) problems [2, 4, 9, 10, 17].

    Generally, we can classify the LLL reduction algorithms into three categories.

    The first category includes exact integer arithmetic LLL reduction algorithms with

    both input and output bases being integral. For example, the original LLL algorithm

    given in [22] is in this category.

    The second category includes the algorithms such as those in [30, 35, 36], which

    use not only integer arithmetic, but also floating point arithmetic. The input and

    output bases in this category are also integral. The reason to use floating point

    arithmetic is that the integer arithmetic is expensive. The algorithms use long enough

    floating numbers to approximate the intermediate results, so that the rounding errors

    do not lead to an output basis which is not exactly LLL reduced.

    The applications of the first and second categories include factoring polynomials

    [22], subset sum problem [37] and public-key cryptanalysis [15].

    The third category includes floating point algorithms with both input and output

    bases being real. This category applies to cases where exact integer arithmetic is not

    required and where a nearly LLL reduced basis is acceptable, such as ILS problems

    which arise in GPS, e.g., [9, 10, 11, 17, 43], and in multi-input multi-output (MIMO)

    communications, e.g., [24, 42]. So an algorithm in this category does not require strict

    floating point error control like algorithms in the second category. An algorithm in

    category three is much more efficient than those in categories one and two.

    3

  • 1.2 Contributions and Organization of the Thesis

    The goal of this thesis is to propose efficient and reliable floating point algorithms

    for the LLL reduction with real basis matrices by using blocking technique [14,

    Chapter 5]. The algorithms are based on the original LLL reduction algorithm [22]

    and the partial LLL (PLLL) reduction algorithm [13].

    The computation speed of a matrix algorithm is determined not only by the

    number of floating point operations involved, but also by the amount of memory

    traffic which is the movements of data between memory and registers. The level

    3 basic liner algebra subprograms (BLAS) are designed to reduce these movements

    of data. The matrix-matrix operations implemented in level 3 BLAS make effi-

    cient reuse of data that resided in cache or local memory to avoid excessive data

    movements. The blocking technique casts the algorithms in terms of matrix-matrix

    operations to permit efficient reuse of data.

    Two block LLL reduction algorithms utilizing this blocking technique are pro-

    posed in this thesis with their complexity analysis. Numerical simulations compare

    the performance of our block algorithms on the CPU time, flops and numerical sta-

    bility with the original LLL reduction algorithm and the PLLL reduction algorithm.

    On average the computational speeds of the block algorithms are faster than PLLL

    and LLL although their numerical stability in some cases may need improvement.

    The parallelization of one of the two block LLL reduction algorithms is discussed

    in two parts, the parallelization of the block size-reduction and the parallelization

    of the diagonal block reduction. Complexity analysis shows that the parallelized

    size-reduction part can obtain a speedup of np in ideal cases, if np processors are

    4

  • used. The improvement of the parallelized diagonal block reduction part is hard to

    be observed from the complexity analysis, since the complexity is too pessimistic. A

    simple test is designed to examine the performance of the parallelized diagonal block

    reduction part. The test result shows that the parallelized diagonal block reduction

    part can obtain a speedup of 4.8 with 5 processors in best situations.

    The rest of the thesis is organized as follows. In Chapter 2, we first give the

    definition of the LLL reduction. Then a description of the original LLL reduction

    algorithm in the matrix language is given, followed by its complexity analysis. In

    the last section of this chapter, we introduce the partial LLL (PLLL) reduction

    algorithm.

    In Chapter 3, we first apply the blocking technique to the components of the

    PLLL algorithm, leading to block subroutines. Then two block LLL algorithms are

    proposed based on these block subroutines. We give the complexity analysis for the

    block algorithms under the assumption of using exact arithmetic. Finally, simulation

    results are presented, compared and discussed.

    In Chapter 4, we first review the literature of parallel LLL algorithms. Then we

    discuss the parallelization of one of our two block algorithms.

    Chapter 5 gives conclusions and future work.

    We now describe the notation to be used in the thesis. The sets of all real and

    integer m n matrices are denoted by Rmn and Zmn, respectively, and the set of

    real and integer n-vectors are denoted by Rn and Zn, respectively. Upper case letters

    are used to denote matrices and bold lower case letters are used to denote vectors.

    The identity matrix is denoted by I and its i-th column is denoted by ei. MATLAB

    5

  • notation is used to denote a sub-matrix. Specifically, if A = (aij) Rmn, then A(i, :)

    denotes the i-th row, A(:, j) denotes the j-th column, and A(i1 : i2, j1 : j2) denotes

    the sub-matrix formed by rows i1 to i2 and columns j1 to j2. For the (i, j) element

    of A, sometimes we use aij and sometimes we use A(i, j). For block matrix A, Aij

    denotes the (i, j) block. For a scalar z R, we use bze to denote its nearest integer. If

    there is a tie, bze denotes the one with smaller magnitude. det(A) is the determinant

    of A. Without saying specifically, stands for the 2-norm, i.e., a =aTa, and

    F stands for the Frobenious matrix norm, i.e., AF =

    i,j a2ij.

    6

  • CHAPTER 2Introduction to LLL Reduction Algorithms

    In this chapter first we give the definition of the Lenstra-Lenstra-Lovasz (LLL)

    reduction. Then we introduce the original LLL reduction algorithm [22] and the

    partial LLL (PLLL) reduction algorithm [43], which will be the bases of our new

    LLL reduction algorithms to be presented in later chapters.

    2.1 LLL Reduction

    The LLL reduction introduced in [22] can be described as a QRZ matrix fac-

    torization:

    B = Q

    R0

    Z1 = Q1RZ1,where B Rmn is a given matrix with full column rank, Q = [Q1, Q2]

    n mn Rmm is

    orthogonal, Z Znn is unimodular, and R Rnn is upper triangular and satisfies

    two conditions: rijrii 12 , 1 i < j n, (2.1)

    r2i1,i1 r2ii + r2i1,i, 1 < i n, (2.2)

    with the parameter (1/4, 1). The conditions Eq.(2.1) and Eq.(2.2) are named as

    the size-reduction condition and the Lovasz condition, respectively. The matrix BZ

    or the matrix R is said to be LLL reduced.

    7

  • The LLL reduction algorithm in [22] is the most well known lattice basis reduc-

    tion algorithm with polynomial time complexity, which was originally designed for

    factoring polynomials with rational coefficients using integer arithmetic operations.

    Later, the LLL reduction has widely extended its applications to number theory

    (see, e.g., [34, 37]), cryptography (see, e.g., [15, 25]), integer programming (see, e.g.,

    [1, 20]), digital communications (see, e.g., [24]), and GPS (see, e.g., [11, 17]). Some

    of these extended applications do not require exact integer LLL reduced basis, thus

    floating point arithmetic is used to achieve better computational performance in such

    application areas. One example of the floating point LLL application is to compute

    a suboptimal solution (e.g., the Babai point [4]) or the optimal solution of an integer

    least squares (ILS) problem.

    In the following part of this chapter, the original LLL reduction algorithm and

    the PLLL reduction algorithm are introduced and we assume they use floating point

    arithmetic.

    2.2 Original LLL Reduction Algorithm

    We will describe the original LLL reduction algorithm in the matrix language

    (see [44, Algorithm 3.3.1] and [13, Algorithm 2.6.3]). The algorithm involves the

    Gram-Schmidt orthogonalization (GSO), integer Gauss transformations (IGT), col-

    umn permutations and orthogonal transformations. GSO is applied to find the QR

    factors Q and R of the given matrix B. The column permutations and IGTs produce

    the unimodular matrix Z.

    In the original exact integer LLL reduction algorithm, a column scaled Q and

    a row scaled R which has unit diagonal entries are computed by a variation of GSO

    8

  • to avoid square root computations. In the floating point LLL reduction algorithm in

    this thesis, the regular GSO is adopted to B and gives the compact form of the QR

    factorization:

    B = Q1R,

    where Q1 Rmn has orthonormal columns, and R Rnn is upper triangular.

    After the GSO of B, integer Gauss transformations, column permutations and

    GSO are used to transform R to a LLL reduced basis. IGTs are used to perform size-

    reduction to the off diagonal entries to achieve Eq.(2.1). The column permutations

    are used to order the columns to achieve Eq.(2.2). Since a column permutation

    destroys the upper triangular structure, GSO is used to recover the upper triangular

    structure.

    2.2.1 Size-Reductions

    An integer matrix is called an IGT or an integer Gauss matrix if it has the

    following form

    Zij = In eieTj , i 6= j, is an integer.

    Applying Zij to R from the right gives

    R = RZij = R ReieTj .

    Thus R is the same as R, except that rkj = rkj rki, k = 1, , i. By setting

    = brij/riie, the nearest integer to rij/rii, we ensure |rij| |rii|/2.

    2.2.2 Permutations

    The column permutations are applied to achieve Eq.(2.2). Suppose that the

    Lavosz condition is not satisfied for i = k, then a permutation matrix Pk1,k is

    9

  • performed to interchange columns k 1 and k of R. After the permutation, the

    upper triangular structure of R is destroyed. An orthogonal transformation Gk1,k

    using the GSO technique (see [22]) is performed to re-construct the upper triangular

    structure of R:

    R = Gk1,kRPk1,k,

    where

    Gk1,k =

    Ik2

    G

    Ink

    , G =c ss c

    ,c =

    rk1,kr2k1,k + r

    2kk

    , s =rkk

    r2k1,k + r2kk

    .

    The columns k 1, k and the rows k 1, k of R are changed by this permutation

    and orthogonalization process. The diagonal and super-diagonal entries of R which

    are changed after the permutation and orthogonalization process become

    rk1,k1 =r2k1,k + r

    2kk, rk1,k =

    rk1,k1rk1,kr2k1,k + r

    2kk

    , rk,k = rk1,k1rkkr2k1,k + r

    2kk

    .

    Thus, if r2k1,k1 > r2kk + r

    2k1,k with (1/4, 1), then the above operations guar-

    antee r2k1,k1 > r2kk + r

    2k1,k.

    Based on the above description of size-reductions and permutations, we will

    describe the procedure of the LLL reduction algorithm as follows. The algorithm

    shall iterate a sequence of stages to satisfy the LLL reduced conditions. And it

    works on the columns of R from left to right. Define a column stage variable k which

    10

  • indicates that the first k 1 columns of R are LLL reduced at the current stage, i.e.,rijrii 12 , 1 i < j k 1, (2.3)

    r2i1,i1 r2ii + r2i1,i, 1 < i k 1. (2.4)

    At the beginning, set k to 2. Then during the reduction procedure, the value of k

    shifts between 2 and n+ 1 and changes by 1 in each step. At stage k, the algorithm

    first uses the integer Gauss transformation to reduce rk1,k. Then it checks if it

    needs to permute the columns k 1 and k according to the Lovasz condition. If

    r2k1,k1 > r2kk + r

    2k1,k, it performs the permutation and applies the corresponding

    orthogonal transformation, and moves back to stage k 1. Otherwise it reduces

    ri,k (i = k 2, k 2, , 1) by IGTs and moves to the next stage k + 1. When

    k reaches to n + 1, the conditions Eq.(2.1) and Eq.(2.2) are satisfied, the upper

    triangular matrix R is LLL reduced and the algorithm stops. The algorithm is given

    as follows.

    Algorithm 2.1. (LLL Reduction) Suppose B Rmn has full column rank. This

    algorithm computes the LLL reduction: B = Q1RZ1, where Q1 has orthonormal

    columns, R is upper triangular and satisfies LLL reduced criteria and Z is unimod-

    ular.

    function: [R,Z] = LLL(B)

    1: Apply GSO to obtain B = Q1R

    2: k := 2, Z := In

    3: while k n do

    4: if rk1,krk1,k1 > 12 then

    11

  • // Reduce rk1,k

    5: :=

    rk1,krk1,k1

    6: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, k 1)

    7: R(1 : k 1, k) := R(1 : k 1, k) R(1 : k 1, k 1)

    8: end if

    // is parameter chosen in (14, 1)

    9: if r2k1,k1 > r2kk + r

    2k1,k then

    10: Interchange columns Z(1 : n, k) and Z(1 : n, k 1)

    11: Interchange columns R(1 : k, k) and R(1 : k, k 1)

    12: Triangularize R: R := Gk1,kR

    13: if k > 2 then

    14: k := k 1

    15: end if

    16: else

    // Size-reduction

    17: for i = k 2 : 1 do

    18: :=ri,krii

    19: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, i)

    20: R(1 : i, k) := R(1 : i, k) R(1 : i, i)

    21: end for

    22: k := k + 1

    23: end if

    24: end while

    12

  • 2.2.3 Complexity Analysis

    Assume that the operations used in the algorithm are performed in exact arith-

    metic. The complexity of Algorithm 2.1 is measured by the number of arithmetic

    operations. Part of the results of the complexity analysis will be used in Chapter 3

    and Chapter 4. The QR factorization by GSO takes O(mn2) arithmetic operations

    [16, Section 5.2]. Next, we give the analysis of the complexity of the while loop in

    the LLL reduction algorithm. By adding the complexity of QR factorization and the

    while loop together, we get the complexity of the LLL reduction algorithm.

    For the complexity of the while loop, we would like to first determine the number

    of loops and then count the number of arithmetic operations in each loop.

    Lemma 2.1 ([22]): Let = maxj bj, and let = minxZn/{0} Bx be the

    length of the shortest vector of lattice L(B). The number of permutations involved

    in Algorithm 2.1 is bounded by O(n3 + n2 log1/) and the algorithm converges.

    Proof. We use the proof from [22] and [44, Chapter 3].

    After the Gram-Schmidt QR factorization, we obtain QR factors Q1 and R in

    the QR factorization B = Q1R. Let R(p) denote the upper triangular matrix R after

    the p-th permutation (R(0) = R). Define the quantities wi and after the p-th

    permutation as

    w(p)i =

    ij=1

    (r(p)jj )

    2, i = 1, 2, , n (2.5)

    and

    (p) =ni=1

    w(p)i . (2.6)

    13

  • Suppose the p-th permutation is applied to columns (q1) and q of matrix R(p1)

    and the orthogonal transformation by GSO is applied to keep the upper triangular

    structure as described in the algorithm, we obtain matrix R(p) with following feature:

    r(p)jj = r

    (p1)jj , j 6= q 1, q, |r

    (p)p1,p1r

    (p)pp | = |r

    (p1)p1,p1r

    (p1)pp |.

    And by the permutation criterion (see line 9 of Algorithm 2.1) obtained from Eq.(2.2),

    we have r(p)q1,q1 < r(p1)q1,q1 .Then from Eq.(2.5) we obtain

    w(p)i = w

    (p1)i , i 6= q 1, w

    (p)q1/w

    (p1)q1 < .

    Substituting them into Eq.(2.6) gives

    (p) < (p1), (2.7)

    which means that one permutation operation decreases at least by a multiply of

    . Assume that the algorithm involves a total of p permutations before convergence.

    From Eq.(2.7) it follows that

    (p) < p(0),

    or equivalently

    p < log1/(0)

    (p)= log1/

    (0) log1/ (p) = log1/ni=1

    w(0)i log1/

    ni=1

    w(p)i . (2.8)

    14

  • Since = maxj bj and bj2 (r(0)jj )2, then (r(0)jj )

    2 2 (j = 1, 2, , n). Thus

    from Eq.(2.5)

    w(0)i 2i. (2.9)

    By Theorem I of [7, Chapter II],

    2 minxZn/{0}

    Bx2 (

    4

    3

    )(n1)/2(det (BTB))1/n. (2.10)

    For any x Zn, we can define x = (Z(p))1x, where Z(p) denotes the unimodular

    matrix Z after the p-th permutation (Z(0) = In). Define B(p) = B(p)Z(p) = Q

    (p)1 R

    (p).

    From Eq.(2.10) we have

    2 = minxZn/{0}

    Bx2 = minxZn/{0}

    B(p)x2

    minx(1:i)Zi/{0}

    B(p)(:, 1 : i)x(1 : i)2

    (

    4

    3

    )(i1)/2| det (B(p)(:, 1 : i)T B(p)(:, 1 : i))|1/i

    =

    (4

    3

    )(i1)/2| det (R(p)(:, 1 : i)TR(p)(:, 1 : i))|1/i

    (

    4

    3

    )(i1)/2(w

    (p)i )

    1/i (see Eq.(2.5)).

    Then it follows that

    w(p)i (3/4)i(i1)/22i. (2.11)

    15

  • Substituting Eq.(2.9) and Eq.(2.11) into Eq.(2.8) gives

    p < log1/

    ni=1

    2i log1/ni=1

    (3/4)i(i1)/22i

    = (n+ 1)n log1/

    + log1/

    ni=1

    (4/3)i(i1)/2

    = (n+ 1)n log1/

    +

    1

    6(n3 n) log1/(4/3).

    So Algorithm 2.1 involves at most O(n3+n2 log1/) permutations and the algorithm

    converges.

    We should note that the bound on the number permutation from the lemma

    suits for all kinds of LLL reduction algorithms, if they share the same permutation

    criterion with Algorithm 2.1.

    In Algorithm 2.1, k is either increased or decreased by 1 in the while loop. Since

    the loops in which k is decreased must have a column permutation in it, we have

    p loops in which k is decreased. The algorithm starts from k = 2 and ends when

    k = n+ 1, so the number of loops in which k is increased should equals to p+ n 1.

    Thus there are 2p+ n 1 loops in total, which is bounded by O(n3 + n2 log1/ ).

    Each loop costs O(n2) arithmetic operations in the worst situation. So the whole

    algorithm takes at most O(mn2 + n5 + n4 log1/) arithmetic operations.

    2.3 Partial LLL Reduction Algorithm

    Recently the so-called effective LLL (ELLL) reduction was proposed by Ling

    and Howgrave [23], and later the so-called partial LLL (PLLL) reduction algorithm

    was developed by Xie, Chang and Borno [43]. Both algorithms are more efficient

    16

  • than Algorithm 2.1. The ELLL reduction algorithm is essentially identical to Al-

    gorithm 2.1 after lines 17-21, which reduce the off-diagonal entries of R except the

    super-diagonal ones, are removed. It has less computational complexity than LLL,

    while it has the same effect on the performance of the Babai integer points as LLL.

    [43] shows algebraically that the size-reduction condition of the LLL reduction has

    no effect on a typical sphere decoding (SD) search process for solving an integer least

    squares (ILS) problem. Thus it has no effect on the performance of the Babai inte-

    ger point, the first integer point found in the search process. The PLLL is proposed

    to avoid the numerical stability problem with ELLL, and to avoid some unneces-

    sary size-reductions involved in LLL and ELLL. Both PLLL and LLL can compute

    LLL reduced bases by adding an extra size-reduction procedure at the end of the

    algorithms. The following part gives a description of the PLLL reduction.

    2.3.1 Householder QR Factorization with Minimum Column Pivoting

    The typical LLL algorithm first finds the QR factorization of the given matrix

    B. In the original LLL algorithm, the Gram-Schmidt method is adopted for com-

    puting the QR factorization. However the Householder method without forming the

    orthogonal factor Q which costs 43mn2 flops, is more efficient than the Gram-Schmidt

    method which costs 2mn2 flops [16]. The Householder method requires square root

    operations, so it is not suitable for the exact integer LLL reduction. While the float-

    ing point LLL reduction has no problem with computing a square root, so it can use

    the Householder transformation to computer the QR factorization.

    The PLLL reduction uses the Householder QR factorization with minimum col-

    umn pivoting (QRMCP) instead of the classic Householder QR factorization. In

    17

  • general, the number of permutations is a crucial factor of the cost of the whole LLL

    reduction process. If one can make the upper triangular factor close to an LLL re-

    duced one in the QR factorization stage, the number of the permutations in the later

    stage is likely to decrease. The minimum column pivoting strategy is used to help

    to achieve the Lovasz condition, see [44, Section 4.1].

    From Eq.(2.1) and Eq.(2.2), we can easily obtain

    ( 14

    )r2i1,i1 r2ii, 1 < i n, (1/4, 1). (2.12)

    The Householder QR factorization upper-triangularize the matrix B columns by

    columns, while the column index i is increasing from 1 to n. In order to make the

    matrix R more likely to satisfy Eq.(2.12), the minimum column pivoting strategy

    chooses a column permutation such that |rii| is the smallest in the i-th step. In the

    i-th step of the QR factorization, the QRMCP finds the column in B(i :m, i :n) with

    the minimum 2-norm, and interchanges the whole column with the i-th column of

    B. After this the QRMCP eliminates the off-diagonal entries B(i + 1 : m, i) by a

    Householder transformation Hi. By using the minimum column pivoting strategy,

    the Householder QR becomes

    BP = Q

    R0

    = [Q1 Q2]R

    0

    = Q1R, (2.13)where P Rnn is a permutation matrix, R Rnn is upper triangular, [Q1, Q2]

    n mn

    Rmm is orthogonal , Q consists of Q1 and Q2. QT = HnHn1 H1 is the product

    of n Householder transformations.

    The algorithm is given as follows.

    18

  • Algorithm 2.2. (Householder QR Factorization with Minimum Column Pivoting)

    Suppose B Rmn has full column rank. This algorithm computes the QRMCP

    factorization: B = Q1RPT , and Q has orthonormal columns, R is upper triangular

    and P is a permutation matrix.

    function: [R,P ] = QRMCP (B)

    1: P := In

    2: lj := B(1 : m, j)2, j = 1 : n

    3: for i = 1 : n do

    4: q := arg minijn lj

    5: if q > i then

    6: Interchange columns B(1 : m, i) and B(1 : m, q)

    7: Interchange columns P (1 : n, i) and P (1 : n, q)

    8: end if

    9: Compute the Householder transformation Hi which zeros B(i+ 1 : m, i)

    10: B := HiB

    11: lj := lj B(i, j)2, j = i+ 1, i+ 2, , n

    12: end for

    13: R := B(1 : n, 1 : n)

    2.3.2 Partial Size-Reduction and Givens Rotation

    After the QRMCP, the PLLL reduction performs permutations, IGTs and Givens

    rotations on R in an efficient and numerical stable way. In the k-th column of R,

    PLLL checks if it needs to permute the columns k and k 1 according to the Lo-

    vasz condition Eq.(2.2). If the Lovasz condition hold, then the permutation will not

    19

  • occur, no IGT will be applied, and the algorithm moves to column k + 1. If the

    Lovasz condition does not hold, rk1,k is reduced by IGT, IGTs are also applied to

    rk2,k, , r1,k for stability consideration. Then PLLL performs the permutation and

    the Givens rotation, and moves back to the previous column.

    Givens rotations are used to do triangularization after permutations in PLLL

    instead of GSO, in line 12 of Algorithm 2.1. Define the Givens rotation matrix as

    G =

    c ss c

    ,where

    c =rk1,k

    r2k1,k + r2kk

    , s =rkk

    r2k1,k + r2kk

    .

    which are used in the following transformation: c ss c

    rk1,k rk1,k1rk,k 0

    =rk1,k1 rk1,k

    0 rk,k

    .The PLLL algorithms is given as follows.

    Algorithm 2.3. ( PLLL Reduction) Suppose B Rmn has full column rank. This

    algorithm computes the PLLL reduction of B: B = Q1RZ1, and Q1 has orthonor-

    mal columns, R is upper triangular and Z is a unimodular. It computes IGTs only

    when column permutation occurs.

    function: [R,Z] = PLLL(B)

    1: Compute [R,P ] = QRMCP (B)

    2: Set Z := P , k := 2

    20

  • 3: while k n do

    4: :=

    rk1,krk1,k1

    5: := rk1,k rk1,k1

    // is parameter chosen in (14, 1)

    6: if r2k1,k1 > 2 + r2kk then

    // Size-reduce R(1 : k 1, k)

    7: for l = k 1 : 1 do

    8: :=rl,krll

    9: Z(1 : n, k) := Z(1 : n, k) Z(1 : n, l)

    10: R(1 : l, k) := R(1 : l, k) R(1 : l, l)

    11: end for

    // Column permutation and updating

    12: c :=rk1,kr2k1,k+r

    2kk

    13: s := rkkr2k1,k+r

    2kk

    14: G :=

    c ss c

    15: Interchange columns Z(1 : n, k) and Z(1 : n, k 1)

    16: Interchange columns R(1 : n, k) and R(1 : n, k 1)

    17: R(k 1 : k, k 1 : n) := GR(k 1 : k, k 1 : n)

    18: if k > 2 then

    19: k := k 1

    20: end if

    21: else

    21

  • 22: k := k + 1

    23: end if

    24: end while

    Notice that the final matrix R obtained by the PLLL reduction algorithm are

    not fully size-reduced, since the algorithm only performs size-reduction when a per-

    mutation is followed immediately. However we can easily add an extra size-reduction

    procedure at the end of the PLLL reduction algorithm, and transform R to a LLL re-

    duced matrix. We name the PLLL algorithm with an extra size-reduction procedure

    as PLLL+.

    The PLLL reduction algorithm uses the same permutation criterion as the LLL

    reduction algorithm, so it has the same upper bound of permutations/loops as the

    upper bound for the LLL reduction algorithm, which is O(n3 + n2 log1/).

    For each loop, the PLLL reduction algorithm has O(n2) arithmetic operations

    in worst case situations. The Household QR costs O(mn2) flops [16, Section 5.2]. So

    the PLLL algorithm takes at most O(mn2 + n5 + n4 log1/) arithmetic operations,

    which is the same as the complexity bound of the LLL reduction algorithm. The

    simulation results of PLLL in [43] show that it is faster and more stable than the

    LLL reduction.

    22

  • CHAPTER 3Block LLL Reduction Algorithms

    The blocking technique has been wildly used to speed up conventional matrix

    algorithms on todays high performance computers. The key to achieve high per-

    formance on computers with a memory hierarchy has been to recast the algorithms

    in terms of matrix-vector and matrix-matrix operations to permit efficient reuse of

    data that resided in cache or local memory. The blocking technique partitions a

    big matrix into small blocks, and performs matrix-matrix operations implemented

    in level 3 basic linear algebra subprograms (BLAS) as much as possible [14]. The

    matrix-matrix operations implemented in level 3 BLAS is more efficient than the

    matrix-vector operation implemented in level 2 BLAS or the vector-vector operation

    implemented in level 1 BLAS. The level 3 BLAS can maximumly reduce the move-

    ments of data between memories and registers, which can be as costly as arithmetic

    operations on the data in matrix algorithms.

    In this chapter, we first explain how to apply the blocking technique to the com-

    ponents of the partial LLL (PLLL) reduction algorithm. Then we propose two block

    LLL reduction algorithms with different matrix partition strategies, and compare

    their speed and stability with the original LLL reduction algorithm and the PLLL

    reduction algorithm introduced in Chapter 2.

    23

  • 3.1 Subroutines of Block LLL Reduction Algorithms

    In this section we describe a block QR factorization algorithm, a block size-

    reduction algorithm named BSR, a variant of the PLLL reduction algorithm named

    Local-PLLL and a block partial size-reduction algorithm named BPSR. They will

    be used as subroutines of the block LLL reduction algorithms. Local-PLLL suits for

    computing the PLLL reduction of blocks of the basis matrix. The block partial size-

    reduction algorithm uses an efficient size-reduction strategy proposed in the PLLL

    reduction algorithm.

    3.1.1 Block Householder QR Factorization with Minimum Column Piv-oting

    In order to design a block Householder QR factorization by means of level 3

    BLAS, Schreiber and Van Loan [38] proposed a storage-efficient WY representa-

    tion for the product of Householder transformations. Later Quintana-Orti, Sun and

    Bischof [32] proposed a level 3 BLAS version of the QR factorization with maximum

    column pivoting in order to get a rank-revealing factorization. Based on their work,

    we give the block QR factorization algorithm with minimum column pivoting in this

    section.

    Given a real full column rank matrix B Rmn, the Householder QR factoriza-

    tion with minimum column pivoting gives

    BP = Q

    R0

    = [Q1 Q2]R

    0

    = Q1R, (3.1)where Q = [Q1, Q2]

    n mn Rmm is orthogonal, R Rnn is upper triangular, and

    P Znn is a permutation matrix. The orthogonal matrix Q is the product of n

    24

  • Householder transformations:

    QT = Hn H2H1, (3.2)

    Hi = In iuiuTi , i = 1, 2, , n, (3.3)

    where i = 2/(uTi ui), ui =

    0ui

    Rm, ui Rmi+1 is a Householder vector,Hi Rmm is the Householder transformation matrix which zeros B(i+ 1 : m, i).

    The permutation matrix P is the product of n permutations:

    P = P1P2 Pn,

    where Pi (i = 1, 2 , n) is the permutation matrix which interchanges the i-th

    column and another column in B(1 : m, i : n) such that the 2-norm of B(i : m, i) is

    minimum.

    In order to explain the block QR implementation, we define B(i) as the value of

    B after i Householder transformations and i permutations, i.e.,

    B(i) = Hi H2H1BP1P2 Pi, (3.4)

    with B(0) = B. And we define B(i) as B with only i permutations applied, i.e.,

    B(i) = B(P1P2 Pi). (3.5)

    Here we want to point out that B(i) will not be formed in the i-th step of the block

    algorithm, and it is used only for explanations of the algorithm.

    25

  • The storage efficient WY representation [38] for the product of i Householder

    transformations has the following format:

    it=1

    Ht =it=1

    (Im tutuTt ) = Im YiTiY Ti , (3.6)

    where

    Yi = [u1,u2, ,ui] Rmi (3.7)

    is lower trapezoidal, Ti Rii is lower triangular given by the following recursion

    formula:

    Ti =

    Ti1 0hTi i

    , hTi = uTi Yi1Ti1 R1(i1),with the base case T1 = 1.

    Substituting Eq.(3.5) and Eq.(3.6) into Eq.(3.4), B(i) can be expressed as

    B(i) = (In YiTiY Ti )B(i) = B(i) YiF Ti , (3.8)

    where

    F Ti = TiYTi B

    (i) Rin. (3.9)

    It is easy to show that F Ti can be computed by recursion:

    F T1 = 1uT1 B

    (1), F Ti =

    F Ti1Piiu

    Ti B

    (i) iuTi Yi1F Ti1Pi

    . (3.10)

    26

  • The block Householder QR factorization algorithm partitions the matrix B

    Rmn into d blocks with size m k (for simplification we assume n = dk). The algo-

    rithm deals with the blocks sequentially from left to right. Inside a block, k House-

    holder transformations are performed for upper-triangularization, and are accumu-

    lated into a single block transformation using the WY representation in Eq.(3.6).

    Then the block transformation is applied to other blocks of B by matrix-matrix

    multiplication. Next we show how the block algorithm works.

    In the first step, we first compute the squared column norms of B denoted by l:

    lj := B(1 :m, j)2, j = 1, 2, , n.

    Utilizing l, a column in B with minimum 2-norm is permuted with the first column

    by the permutation matrix P1 (actually P1 is not formed explicitly). Then we use

    the Householder transformation H1 to zero B(2 : m, 1). At this moment, unlike

    Algorithm 2.2 we do not apply H1 to other columns of B. However, the first row of

    B must be updated in order to downdate the squared column norms:

    lj := lj B(1, j)2, j = 2, , n, (3.11)

    which will be used in the next step for minimum column pivoting. In order to

    update the first row, we form the following matrices (actually there are vectors)

    using Eq.(3.6) and Eq.(3.10):

    Y1 := u1, FT1 (1, 2:n) := 1u

    T1B(1 :m, 2:n).

    27

  • Notice that B(1 :m, 2 : n) stores in memory is equivalent to B(1)(1 :m, 2 : n) given

    in Eq.(3.10). From Eq.(3.8) the first row of B except the first entry is updated as

    follows:

    B(1, 2:n) := B(1, 2:n) Y1(1, 1)F T1 (1, 2:n).

    Then the squared column norms are downdated using Eq.(3.11). Thus at the end

    of the first step, the first row and the first column have been updated, and the rest

    part of B will be updated later.

    In the second step, utilizing the vector l of the squared column norms, we apply

    P2 to permute the second column of B with a column, say column p, 2 p n, such

    that the 2-norm of B(2 :m, 2) is minimum, and we permute the second column of

    F T1 with its p-th column (i.e., FT1 := F

    T1 P2). Then from Eq.(3.8) the second column

    B(2 :m, 2) is updated by the first Householder transformation H1:

    B(2 :m, 2) := B(2 :m, 2) Y1(2 :m, 1)F T1 (1, 2).

    After this update, we apply the Household transformation H2 to zero B(3 : m, 2).

    Same as step 1, we do not use H2 to update the rest columns of B at this moment.

    But we need update the second row of B, because it will be used to compute the

    2-norms of each column of B(3 :m, 3:n). In order to perform the update, Y2 and F2

    are formed by accumulating H2 into Y1 and F1 using Eq.(3.6) and Eq.(3.10):

    Y2 :=

    [Y1 u2

    ], F T2 (1 :2, 3:n) :=

    F T1 (1, 3:n)2u

    T2B(1 :m, 3:n) 2uT2 Y1F T1 (1, 3:n)

    .

    28

  • Note that here F T1 has been permuted by P2. Then we update the second row of B

    except the first two entries:

    B(2, 3:n) := B(2, 3:n) Y2(2, 1 : 2)F T2 (1 : 2, 3:n),

    and compute the squared column norms of B(3 :m, 3:n):

    lj := lj B(2, j)2, j = 3, , n.

    At the end of the second step, the first two rows and the first two columns have been

    updated.

    Now we assume we are in the i-th step of transforming the first block of B to an

    upper triangular matrix. The first (i 1) columns of B have been triangulized and

    the first (i1) rows have been updated, while the rest part of the matrix B is waiting

    to be updated. We first permute the i-th column with a column in B(1 : m, i : n)

    such that the 2-norm of B(i :m, i) is minimum, and we permute the corresponding

    columns of F Ti1 (i.e., FTi1 := F

    Ti1Pi). Then we update the i-th column of B(i :m, i)

    by using the Householder transformations H1, H2, , Hi as follows (see (3.8)):

    B(i :m, i) := B(i :m, i) Yi1(i :m, 1: i 1)F Ti1(1 : i 1, i).

    29

  • Then the Householder transformation Hi is used to zero B(i + 1 : m, i), and is

    accumulated into Yi and Fi:

    Yi :=

    [Yi1 ui

    ],

    F Ti (1 : i, i+ 1:n) :=

    F Ti1(1 : i 1, i+ 1:n)iu

    Ti B(1 :m, i+ 1:n) iuTi Yi1F Ti1(1 : i 1, i+ 1:n)

    .Then we update the i-th row B(i, i+ 1:n) and downdate the squared column norms:

    B(i, i+ 1:n) := B(i, i+ 1:n) Yi(i, 1: i)F Ti (1 : i, i+ 1:n),

    lj := lj B(i, j)2, j = i+ 1, , n.

    The first i columns and rows of B have been updated.

    Like shown in above, the block algorithm updates one row and one column in

    each step. At the end of the k-th step, we update the rest part of B by using the

    accumulated first k Householder transformations as follows:

    B(k + 1:m, k + 1:n) := B(k + 1:m, k + 1:n) Yk(k + 1:m, 1:k)F Tk (1 :k, k + 1:n).

    At this point, the first k columns of B (i.e., the first block of B) have been upper-

    triangulized, and the other columns of B have been updated. Then we can apply the

    same procedure to triangulize the second block of B and so on until the final upper

    triangular matrix is obtained.

    The algorithm of block QR factorization with minimum column pivoting is given

    as follows.

    30

  • Algorithm 3.1. (Block Householder QR Factorization with Minimum Column Piv-

    oting) Suppose B Rmn has full column rank, k is the chosen block size which

    is a factor of n for simplification. This algorithm computes the QR Factorization:

    Q1R = BP , where Q1 has orthonormal columns, P is a permutation matrix. Note

    the matrix B is overwritten by R in computation.

    function: [R,P ] = BQRMCP (B, k)

    1: P := In, m := m, n := n

    2: lj := B(1 : m, j)2, j = 1 : n

    3: for j = 1 : k : n do

    4: Y (1 : m, 1 : k) := 0, F (1 : n, 1 : k) := 0

    5: for j = 1 : k do

    // Permutation

    6: i := j + j 1, q := arg minipn lp

    7: Interchange columns B(1 : m, i) and B(1 : m, q)

    8: Interchange columns P (1 : n, i) and P (1 : n, q)

    9: Interchange rows F (j, 1 : k) and F (q j + 1, 1 : k)

    // Update the i-th column

    10: B(i : m, i) := B(i : m, i) Y (j : m, 1 : j 1)F (j, 1 : j 1)T

    11: Zero B(i+ 1 : m, i) by the Householder transformation Hi = In iuiuTi

    // Accumulation of the block transformation

    12: Y (j : m, j) := ui(i : m)

    13: F (j + 1 : n, j) := iB(1 : m, i+ 1 : n)Tui

    14: F (1 : n, j) := F (1 : n, j) iF (1 : n, 1 : j 1)Y (j : m, 1 : j 1)Tui(i : m)

    31

  • // Update the i-th row and downdate the norm

    15: B(i, i+ 1 : n) := B(i, i+ 1 : n) Y (j, 1 : j)F (j + 1 : n, 1 : j)T

    16: l(i+ 1 : n) := l(i+ 1 : n)B(i, i+ 1 : n). B(i, i+ 1 : n)

    17: end for

    // Block transformation to unprocessed parts of the matrix

    18: B(i+ 1 : m, i+ 1 : n) := B(i+ 1 : m, i+ 1 : n)

    Y (j + 1 : m, 1 : k)F (j + 1 : n, 1 : k)T

    19: m := m k, n := n k

    20: end for

    21: R := B(1 : n, 1 : n)

    Here we make a remark. In our implementation, we actually use an n-dimension

    vector to store the permutation matrix P , and we do not form P explicitly for

    efficiency.

    3.1.2 Block Size-Reduction

    The idea of block size-reduction is to accumulate several IGTs into a block

    updating, so the algorithm is rich in matrix-matrix operations. The size-reduction

    of an upper triangular matrix R Rnn can be described as

    U = RZ,

    where Z Znn is an unimodular matrix, U Rnn is size-reduced, i.e., |uij| 12|uii| (1 i < j n). The size-reduction algorithm repeatedly applies IGTs to R.

    Z is the product of a sequence of IGTs which have the form of In eieTj (1 i 2 then

    26: i := i 1

    27: end if

    28: else

    38

  • 29: i := i+ 1

    30: end if

    31: end while

    32: Rl = Rlocal

    3.1.4 Block Partial Size-Reduction

    A block partial size-reduction algorithm is designed to coordinate with the Local-

    PLLL reduction algorithm. In BSR (Algorithm 3.2), all off-diagonal entries of the

    upper triangular matrix are checked for IGTs. However, it is not the case for the

    PLLL reduction, where the off-diagonal entries are reduced only when they are nec-

    essary. More specifically, if an IGT is applied to a super-diagonal entry of R, other

    IGTs are applied to the off-diagonal entries in the same column in order to prevent

    producing large numbers which may cause numerical stability problem. Thus, only

    the entries in the columns which are affected by IGTs in Local-PLLL need to be

    reduced. Local-PLLL stored the information of those columns affected by IGTs in

    c, so block partial size-reduction (BPSR) algorithm can reduce only those marked

    columns by IGTs.

    Give an upper triangular matrix R Rnn which consists of d d blocks (here

    we do not assume that each block has the same size):

    R =

    R11 R1d

    . . ....

    Rdd

    , 1 i j d.

    39

  • It has sub-matrices:

    R =

    R11 R1,i1

    . . ....

    Ri1,i1

    , R =

    R1i

    R2i...

    Ri1,i

    , 1 < i d, (3.16)

    where R has k columns, Ri,i with 1 < i i has k rows, and R1,i may have either k

    or k/2 rows.

    Given a vector c Zk, whose entries are either one or zero. For j = 1 : k, if

    cj = 1 we performs size-reductions on column j of R by applying IGTs to it, which

    involve R; if cj = 0 we do nothing. After this, part of the entries of R are size-reduced

    according to c:

    R := R + RZ,

    where Z which is formed by those IGTs has the same dimension and block partition

    as R:

    Z =

    Z1i

    Z2i...

    Zi1,i

    , (3.17)

    and

    I Z0 I

    is unimodular.The BPSR algorithm is given as follows.

    Algorithm 3.4. (Block Partial Size-reduction) Given two sub-matrices R, R in

    Eq.(3.16) and a vector c, this algorithm size-reduces the columns of R: R := R+RZ,

    40

  • where Z has block partition as in Eq.(3.17). We use Ai1:i2,j to denote the sub-matrix

    formed by block rows i1 to i2 in the j-th block column of A.

    function: [R, Z] = BPSR(R, R, c)

    1: for i = i 1 : 1 : 1 do

    // Partial size-reduction of Ri,i by Zi,i involving Ri,i

    2: for j = 1 : k do

    3: if cj = 1 then

    4: Size-reduce Ri,i(:, j): Ri,i(:, j) := Ri,i(:, j) +Ri,iZi,i(:, j)

    5: end if

    6: end for

    7: Update R1:i1,i: R1:i1,i := R1:i1,i R1:i1,iZi,i

    8: end for

    3.2 Left-to-Right Block LLL Reduction Algorithm

    In this section, we present a left-to-right block LLL (LRBLLL) reduction algo-

    rithm utilizing the subroutines introduced in the previous section, i.e., the block QR

    factorization (Algorithm 3.1), the block size-reduction (Algorithm 3.2), the Local-

    PLLL reduction algorithm (Algorithm 3.3) and the block partial size-reduction (Al-

    gorithm 3.4). The complexity analysis of LRBLLL is presented in the second part

    of this section.

    3.2.1 Partition and Block Operation

    The left-to-right block LLL reduction algorithm combines the blocking technique

    with the PLLL algorithm. It includes 7 steps as follows.

    41

  • Step 1. Compute the block QR factorization (Algorithm 3.1) of the full column

    rank matrix B Rmn with minimum column pivoting: BP = Q1R.

    Step 2. Partition matrix R to dd blocks with block size k (here for simplicity,

    we assume that n is multiple of k, i.e., n = dk, and d is even, and define k = 2k,

    d = d/2):

    R =

    R11 R1d

    . . ....

    Rdd

    Rnn, Rij Rkk , 1 i j d.Initialize a block index i = 1.

    Step 3. Compute the Local-PLLL reduction (Algorithm 3.3) ofRlocal =

    Rii Ri,i+1Ri+1,i+1

    ,Rlocal := Q

    TlocalRlocalZlocal.

    Step 4. Update the relevant blocks of Rlocal using block transformations:

    Rright := QTlocalRright, Rup := RupZlocal,

    where

    Rright =

    Ri,i+2 Ri,i+3 Ri,dRi+1,i+2 Ri+1,i+3 Ri+1,d

    , Rup =

    R1,i R1,i+1

    R2,i R2,i+1...

    ...

    Ri1,i Ri1,i+1

    .

    42

  • Step 5. Size-reduce Rup using the block partial size-reduction algorithm (Al-

    gorithm 3.4) :

    Rup := Rup +

    R11 R1,i1

    . . ....

    Ri1,i1

    Zupdate.Step 6. Set :=

    (r(i1)k,(i1)k+1 br(i1)k,(i1)k+1/r(i1)k,(i1)ke r(i1)k,(i1)k

    ).

    Check if the Lovasz condition r2(i1)k,(i1)k (2 + r2(i1)k+1,(i1)k+1) holds for the

    first column of Rlocal and the column before it in R.

    If i = 1 or the Lovasz condition holds, set i := i+ 1.

    Else if i 6= 1 and the Lovasz condition does not hold, set i := i 1.

    If i < d, go to step 3; else, go to step 7.

    Step 7. Apply block size-reduction (Algorithm 3.2) to the whole matrix R, stop

    the algorithm.

    In Section 3.1.3, we stated that the first k columns of Rlocal may be PLLL

    reduced before applying Local-PLLL in step 3. It is easy to check from the algorithm

    that the first k columns of Rlocal are PLLL reduced except the first call of Local-

    PLLL in step 3.

    The left-to-right block LLL reduction algorithm is given as follows.

    Algorithm 3.5. (Left-to-Right Block LLL Reduction) Given a full column rank ma-

    trix B Rmn and a block size k which is even. This algorithm computes the LLL

    factorization: B = Q1RZ1, where Q1 has orthonormal columns, R is upper tri-

    angular and LLL reduced, and Z is unimodular. In the algorithm, we assume Z

    43

  • is partitioned into blocks in the same way as R. We use Ai1:i2,j1:j2 to denote the

    sub-matrix formed by block rows i1 to i2 and block columns j1 to i2 of A.

    function: [R,Z] = LRBLLL(B, k)

    // Compute the block QR factorization using Algorithm 3.1

    1: [R,Z] = BQRMCP (B, k)

    2: i := 1, k := k/2, d := 2n/k, f := 0

    3: while i < d do

    // PLLL reduction of Rii using Algorithm 3.3

    4: [Q, Ri:i+1,i:i+1, Z, r] = Local-PLLL(Ri:i+1,i:i+1, f)

    5: f := 1

    6: if Z = I then

    // The diagonal block is unchanged. The algorithm moves ahead.

    7: i := i+ 1

    8: Continue

    9: end if

    // Block updating

    10: Z1:d,i:i+1 := Z1:d,i:i+1Z

    11: R1:i1,i:i+1 := R1:i1,i:i+1Z

    12: Ri:i+1,i+2:d := QTRi:i+1,i+2:d

    // Size-reduce the corresponding columns of R1:i1,i:i+1 using Algorithm 3.4

    13: [R1:i1,i:i+1, Z] = BPSR(R1:i1,i:i+1, R1:i1,1:i1, r)

    14: Z1:d,i:i+1 = Z1:d,i:i+1 + Z1:d,1:i1Z

    // Check the Lovasz condition, then move forward or backward

    44

  • 15: := bR((i1)k, (i1)k+1)/R((i1)k, (i1)k)e

    16: := R((i1)k, (i1)k+1) R((i1)k, (i1)k)

    // is parameter chosen in (14, 1)

    17: if R((i1)k, (i1)k)2 2 +R((i1)k+1, (i1)k+1)2 or i = 1 then

    18: i := i+ 1

    19: else

    20: i := i 1

    21: end if

    22: end while

    // Size-reduce R using Algorithm 3.2

    23: [R, Z] = BSR(R)

    24: Z := ZZ

    Notice that if the Local-PLLL output Z is an identity matrix, we do not apply block

    updating and BPSR to relevant blocks for efficiency. Also notice that, if the matrix

    dimension n is not a multiple of the block size k, the algorithm still works by simply

    changing the block size of last column blocks to fit the matrix dimension. At the

    end of each while loop the first ik columns of R are PLLL reduced. The while loop

    breaks when i = d. Then the n = dk columns of R are PLLL reduced. And the

    matrix R is size-reduced after the final size-reduction. Thus the LRBLLL algorithm

    outputs a basis matrix which is LLL reduced.

    3.2.2 Complexity Analysis

    In the LRBLLL algorithm, the column permutation operations are executed in

    the Local-PLLL subroutine. Since LRBLLL uses the same permutation criterion as

    45

  • LLL (Algorithm 2.1), Lemma 2.1 can be also applied to LRBLLL. As in Section 2.2.3,

    we define = maxj bj, and = minxZn/{0} Bx. Thus the LRBLLL algorithm

    has at most O(n3 + n2 log1/) permutations, and the algorithm converges. During

    the procedure of LRBLLL, the permutation operations are performed inside the

    Local-PLLL subroutine. In the following part, we would like to obtain an upper

    bound of the number of calls of Local-PLLL.

    In the while loop of LRBLLL, it calls Local-PLLL reductions of diagonal sub-

    matrices of R. At each loop, the PLLL reduction of a diagonal sub-matrix is per-

    formed, and a diagonal sub-matrix which will be performed by the PLLL reduction

    in the next loop, is selected in the current loop. From step 3 of LRBLLL, the diag-

    onal sub-matrix Rlocal contains 2 diagonal blocks Ri,i and Ri+1,i+1. And Rlocal may

    move one diagonal block forward or backward at the end of each loop, according to

    whether the Lovasz condition holds for columns (i 1)k and (i 1)k + 1, see step

    6 of LRBLLL described at Section 3.2. The matrix R which is divided into d d

    blocks has d diagonal blocks. In the first call of Local-PLLL, Rlocal contains the first

    two diagonal blocks R1,1 and R2,2, and the block index i equals to 1; while in the

    last call of Local-PLLL, Rlocal contains the last two diagonal blocks Rd1,d1 and

    Rd,d , and the block index i equals to d 1. It needs only d 1 loops for i to move

    forward to i = d1 from i = 1, if there are no backward moves. Actually there may

    be some backward moves, say s times, and the times of moving forward should be

    added by an extra s. Thus the total number of moves of Rlocal is 2s + d 1 which

    equals to 2s+ 2d 1.

    46

  • The rest problem is to determine an upper bound of s which is the times of

    the block index i moving backward during the excution of LRBLLL. Assume in a

    loop except the first one, the Lovasz condition does not hold for columns (i 1)k

    and (i 1)k + 1, so the algorithm moves one block back and the block index i is

    decreased by one. However at the beginning of this loop the Lovasz condition holds

    for columns (i1)k and (i1)k+ 1. Then the Local-PLLL subroutine of LRBLLL

    must have modified column (i 1)k + 1 of R. To modify column (i 1)k + 1,

    which is the first column of the current Rlocal, Local-PLLL must perform at least

    k permutations. Since subroutine Local-PLLL starts with column (k + 1) of Rlocal

    (see Section 3.1.3), it takes at least k permutations to get back to the first column

    from column k + 1. Thus if the block index i is decreased in a loop, there is at

    least k permutations taking place in Local-PLLL in this loop. Assume there are

    p permutations involved in LRBLLL before convergence. So s, i.e., the number of

    loops in which i is decreased, is bounded above by p/k which equals to (2n/d)p.

    Then, the cost of LRBLLL is given as follows. The QR factorization with

    minimum column pivoting takesO(mn2) arithmetics [16, Section 5.2]. In Local-PLLL

    a permutation causes at most O(k2) arithmetic operations for subsequent updating

    and size-reduction. In each loop after Local-PLLL is called, the block updating of

    R takes O(nk2) operations. The subroutine BPSR takes O(n2k) operations in worst

    case in each loop. And the block size-reduction subroutine at the end of the algorithm

    takes O(n3) operations. From above, there are p permutations and 2s+2d1 loops.

    The cost of LRBLLL is

    CLRBLLL = O(mn2) + p O(k2) + (2s+ 2d 1) O(n2k + nk2) +O(n3).

    47

  • Notice that p is bounded above by O(n3 + n2 log1/), so s is bounded above by

    O(dn2 + dn log1/). The total cost of LRBLLL is bounded above by O(mn2 +n5 +

    n4 log1/). This bound is the same as the bounds of LLL and PLLL.

    Table 31 lists the costs of the important processes and the total cost of LR-

    BLLL.

    Table 31: Complexity analysis of LRBLLL reduction algorithm

    Processes BoundCost of QR factorization O(mn2)

    Cost of one permutation in Local-PLLL O(k2)Cost of block updating in one loop O(nk2)Cost of size-reduction in one loop O(n2k)Cost of final block size-reduction O(n3)

    Number of permutations: p O(n3 + n2 log1/)

    Number of loops: 2s+ 2d 1 O(dn2 + dn log1/ )Total cost of the algorithm O(mn2 + n5 + n4 log1/

    )

    3.3 Alternating Partition Block LLL Reduction Algorithm

    In this section we propose a alternating partition block LLL (APBLLL) reduc-

    tion algorithm which is easier to be parallelized. The complexity analysis of APBLLL

    is also given.

    3.3.1 Partition and Block Operation

    The LRBLLL algorithm is actually a mimic of PLLL. LRBLLL works on the

    matrix from left to right, and may moves forward or backward during the procedure.

    In this new alternating partition block LLL reduction algorithmic, we do not move

    the algorithm forward and backward, we do it in another way.

    48

  • 1k 1k 1k 1k

    R11 R12 R13 R14 1k

    R22 R23 R24 1k

    R33 R34 1k

    R44 1k

    1.5k 1k 1.5k

    R11 R12 R13 1.5k

    R22 R23 1k

    R33 1.5k

    Figure 31: Partition 1 of matrix R

    1k 1k 1k 1k

    R11 R12 R13 R14 1k

    R22 R23 R24 1k

    R33 R34 1k

    R44 1k

    1.5k 1k 1.5k

    R11 R12 R13 1.5k

    R22 R23 1k

    R33 1.5k

    Figure 32: Partition 2 of matrix R

    We first perform BQRMCP on B Rmn (see Algorithm 3.1):

    BP = Q1R,

    where Q1 Rmn has orthonormal columns, R Rnn is upper triangular and

    P Znn is a permutation matrix.

    Next we use an example to show how APBLLL works iteratively with two al-

    ternating partitions as shown in Figure 31 and Figure 32.

    In the first iteration, R is partitioned into 44 blocks, each block has size kk

    (see Figure 31). This partition is refereed to as partition 1 for convenience. Then

    we work on the blocks of partition 1. First we perform Local-PLLL (Algorithm 3.3)

    to R11, then we update R12, R13 and R14 by Q generated by Local-PLLL. Second,

    we perform Local-PLLL to R22, then we update R23 and R24 by Q generated by this

    Local-PLLL, and update R12 by Z also generated by this Local-PLLL, then BPSR

    49

  • (Algorithm 3.4) is applied to R12 to do partial size-reduction. Third, we perform

    Local-PLLL to R33, then we update R34 by Q generated by current Local-PLLL,

    and update R13 and R23 by Z also generated by current Local-PLLL, then BPSR

    is applied to the block column R13 and R23. Fourth, we perform Local-PLLL to

    R44, then we update R14, R24 and R34 by Z generated by current Local-PLLL, then

    BPSR is applied to the block column R14, R24 and R34. After this, the first iteration

    has finished. After the first iteration, all the diagonal blocks R11, R22, R33 and R44

    are PLLL reduced.

    In the second iteration, we repartition R into 3 3 blocks (see Figure 32), the

    block size is indicated in the figure. This repartition is referred to as partition 2. We

    do exactly the same for the blocks of partition 2 as we do in the first iteration. After

    the second iteration, diagonal blocks R11, R22 and R33 are PLLL reduced.

    Then in the following iterations, the same process with either partition 1 or

    partition 2 are performed iteratively (partition 1 and partition 2 are preformed al-

    ternately), until no permutation takes place in a iteration. At this point, it is easy

    to see that R is PLLL reduced. Then an extra block size-reduction (Algorithm 3.2)

    is applied to R. After the final size-reduction, R is LLL reduced and the algorithm

    ends.

    The two alternating partitions of R for the general case are given as follows.

    Assume the block size is k and n = dk. Partition 1 partitions R into d d blocks:

    R =

    R11 R1d

    . . ....

    Rdd

    Rnn, Rij Rkk, 1 i j d.

    50

  • And partition 2 partitions R into (d 1) (d 1) blocks:

    R =

    R11 R1,d1

    . . ....

    Rd1,d1

    Rnn,R11 R1.5k1.5k, R1,d1 R1.5k1.5k, Rd1,d1 R1.5k1.5k,

    R1v R1.5k1k, Ru,d1 R1.5k1k, Ruv R1k1k, 1 < u v < d 1.

    The alternating partition block LLL reduction algorithm is given as follows.

    Algorithm 3.6. (Alternating Partition Block LLL Reduction) Given a full column

    rank matrix B Rmn and a block size k (assume n is multiple of k, i.e., n = dk).

    This algorithm computes the LLL reduction: B = Q1RZ1, where Q1 has orthonor-

    mal columns, R is upper triangular and is LLL reduced, and Z is unimodular. In

    the algorithm, we assume Z is partitioned into blocks in the same way as R. We use

    Ai1:i2,j1:j2 to denote the sub-matrix formed by block rows i1 to i2 and block columns

    j1 to i2 of A.

    function: [R,Z] = APBLLL(B, k)

    // Compute the block QR factorization using Algorithm 3.1

    1: [R,Z] = BQRMCP (B, k)

    2: d := n/k, f := 0

    3: for i = 1 : d do

    4: changei := 1, nextChangei := 1

    5: end for

    6: while (1) do

    51

  • 7: Partition R into blocks using Partition 1 or 2 iteratively

    8: for i = 1 : d (for Partition 2: i = 1 : d 1, we assume partition 1 is used in

    the following description ) do

    9: if changei 6= 1 then

    10: continue

    11: end if

    // Apply Local-PLLL to all diagonal blocks using Algorithm 3.3

    12: [Q, Rii, Z, r] = Local-PLLL(Rii, f)

    13: if Z = I then

    // The diagonal block is unchanged, and updates are not needed

    14: continue

    15: end if

    // Perform the corresponding updates

    16: nextChangemax(1,i1) := 1, nextChangei := 1

    // Block updating

    17: Z1:d,i := Z1:d,iZ

    18: R1:i1,i := R1:i1,iZ

    19: Ri,i+1:d := QTRi,i+1:d

    // Size-reduce the corresponding columns of R1:i1,i using Algorithm 3.4

    20: [R1:i1,i, Z] = BPSR(R1:i1,i, R1:i1,1:i1, r)

    21: Z1:d,i = Z1:d,i + Z1:d,1:i1Z

    22: end for

    23: if nextChange = 0 then

    52

  • // Break when no permutation applied

    24: break

    25: end if

    26: f := 1

    27: for i = 1 : d do

    28: changei := nextChangei, nextChangei := 0

    29: end for

    30: end while

    // Size-reduce R using Algorithm 3.2

    31: [R, Z] = BSR(R)

    32: Z := ZZ

    Notice that the two vectors change and nextChange are used to tracing that if

    the diagonal blocks are PLLL reduced in each iteration. If two diagonal blocks are

    unchanged in a iteration, in the next iteration we do not apply Local-PLLL to the

    diagonal block whose diagonal entries come from the two unchanged diagonal blocks,

    since this diagonal block should also be PLLL reduced. Also notice that if the Local-

    PLLL output matrix Z is an identity matrix, we do not apply block updating and

    BPSR to relevant blocks for efficiency.

    3.3.2 Complexity Analysis

    The APBLLL algorithm shares the same QR and final size-reduction parts as

    LRBLLL. Thus the costs of these two parts are the same as they are in LRBLLL,

    which are O(mn2) arithmetic operations for the QR factorization and O(n3) arith-

    metic operations for the final size-reduction. The cost of the rest parts of APBLLL

    53

  • are divided into two parts: the cost of subroutine Local-PLLL and the cost outside

    subroutine Local-PLLL, i.e., the block updating and the block partial size-reductions.

    These two parts are calculated separately.

    Since APBLLL uses the same permutation criterion as LLL (Algorithm 2.1),

    Lemma 2.1 can be also applied to APBLLL. Thus the total number of permutations

    p taking place in Local-PLLL reductions is bounded above by O(n3 + n2 log1/).

    In Local-PLLL a permutation causes at most O(k2) arithmetic operations for sub-

    sequent updating and size-reductions. Thus, all the call to subroutine Local-PLLL

    cost O(n3k2 + n2k2 log1/) arithmetic operations.

    In APBLLL, only if the output matrix Z of Local-PLLL is not identity, i.e.

    there are some permutations taking place during the execution of Local-PLLL, the

    block updating and BPSR line 17-21 are performed. Because the total number of

    permutations is p, there are at most p calls to Local-PLLL and each one of which

    does not produce identity Z. So the worst case is that the block updating and BPSR

    are executed p times. For each execution, the block updating and BPSR cause at

    most O(n2k) arithmetic operations. Thus the total cost of block updating and BPSR

    is p O(n2k) in the worst case.

    From above, the total cost of APBLLL is obtained by adding the cost of all the

    parts together:

    CAPBLLL = O(mn2) +pO(k2) +pO(n2k) +O(n3) = O(mn2 +n5k+n4k log1/

    ).

    This bound is lager than the bounds of LRBLLL, PLLL and LLL. However its simu-

    lation result shows that it performs better than LLL and PLLL and performs similar

    54

  • as LRBLLL. The simulation results and analysis of the two block LLL reduction

    algorithms will be given in the next section.

    Table 32 lists the costs of the important processes and the total cost of AP-

    BLLL.

    Table 32: Complexity analysis of APBLLL reduction algorithm

    Processes BoundCost of QR factorization O(mn2)

    Cost of one permutation in Local-PLLL O(k2)Cost of block updating and

    size-reduction for one diagonal blockO(n2k)

    Cost of final block size-reduction O(n3)

    Number of permutations: p O(n3 + n2 log1/)

    Total cost of the algorithm O(mn2 + n5k + n4k log1/)

    3.4 Simulation Results and Comparison of Algorithms

    The simulations are performed on MATLAB on two types of machines. One

    has MATLAB 7.12.0 on a 64-bit Ubuntu 11.10 system with 4 Intel Xeon(R) CPU

    W3530 2.8GH processors and 5GB memory. The other has MATLAB 7.13.0 on

    a 64-bit Red Hat 6.2 system with 64 AMD Opteron(TM) 2.2GH processors and

    64G memory. Our simulations use conventional MATLAB not Parallel MATLAB.

    MATLAB use IEEE double precision model for the floating point arithmetic by

    default. The unit round-off for double precision is about 1016. We compare four

    algorithms, i.e., the original LLL algorithm (Algorithm 2.1), the PLLL+ algorithm,

    the LRBLLL algorithm (Algorithm 3.5), and the APBLLL algorithm (Algorithm

    3.6). The PLLL+ algorithm is the PLLL algorithm (Algorithm 2.3) with an extra

    size-reduction procedure to guarantee the resulted matrix is size-reduced. All these

    55

  • four algorithms produce LLL reduced matrices. We compare the CPU run time, the

    flops, and the relative backward errors BQcRcZ1c F

    BFof the four algorithms, where

    Qc is the computed orthogonal matrix, Rc is the computed LLL reduced matrix and

    Z1c is the unimodular matrix formed by the inverses of the computed permutation

    matrix and IGTs. And the run time is measured by two separate parts, the run time

    for the QR factorization and the run time for the rest part of each algorithm (for

    simply, we just call this part the reduction), in order to observe how the blocking

    technique performances in each part.

    In the simulation, we test three cases of matrix B Rnn with n = 100 : 50 :

    1000. The square matrices Bs are generated as follows.

    Case 1: B is generated by MATLAB function randn: B = randn(n, n), i.e.,

    each element follows the normal distribution N (0, 1).

    Case 2: B = USV T , U and V are randomly generated orthogonal matrices,

    and S is a diagonal matrix as follows,

    S(i, i) = 104(i1)/(n1), i = 1, , n.

    Case 3: B = USV T , U and V are randomly generated orthogonal matrices,

    and S is a diagonal matrix as follows,

    S(i, i) = 1000, i = 1, , bn/2e

    S(i, i) = 0.1, i = bn/2e+ 1, , n.

    Case 1 are the most typical testing matrices for numerical solutions. Case 2 and 3

    intends to show the reduction speed when the condition numbers are fixed at 104.

    56

  • Case 3 also shows that the block algorithms gain more efficiency at the reduction

    part, when it takes a long time to run.

    For each dimension of all cases, we randomly generate 20 different matrices to

    do the test. We only test 20 simulation runs, because LLL is too time consuming.

    However we use box plots to show that the behaviors of the algorithms are stable,

    thus 20 runs are enough for our simulation. For the block algorithms, the optimal

    block size may vary according to the dimension of the matrix. In the simulation, a

    fixed block size of 32 is adopted for matrices at all dimensions for simplicity. In the

    average QR/reduction run time plots, the y-axis is the average run time (seconds)

    for the 20 matrices, and the x-axis is the dimension. In the average flops plots, the

    y-axis is the average flops, and the x-axis is the dimension. In the average relative

    backward error plots, the y-axis is the relative backward error, and the x-axis is the

    dimension.

    In the simulation, we test matrices with various condition numbers, and give

    the results in the various condition number plots. In these plots, the y-axis is the

    average QR/reduction run time, the average flops or the average relative back ward

    errors for 20 matrices with dimension 200 in case 2, and the x-axis is the matrix

    condition number from 101 to 106. Box plots of run time and relative backward

    errors of all three cases with dimension 200 are drawn. In the box plot, the y-axis is

    either the algorithm run time or the relative backward errors, and the x-axis is the

    four algorithms, i.e., LLL, PLLL+, LRBLLL and APBLLL.

    The simulation results given by Intel processors are shown in Figure 33, Figure

    34 and Figure 35 for the overall performance of three cases, in Figure 36 for case

    57

  • 2 with different condition numbers, and in Figure 37 for the box plot of all the

    cases. And the results given by AMD processors are shown in Figure 38, Figure 3

    9, Figure 310 Figure 36 and Figure 37, respectively. For the overall performance

    of each case, we give six plots. The two plots in the first row are the average run time

    of QR factorization and the average reduction run time of LLL respectively. LLL

    runs much longer than the other three algorithms, so we put it in individual plots in

    order to compare the other three algorithms easily. The two plots in the middle row

    are the average QR/reduction run time for PLLL+, LRBLLL, and APBLLL. The

    two plots in the bottom row are the average flops and the average relative backward

    errors for LLL, PLLL+, LRBLLL, and APBLLL. For case 2 with different condition

    numbers, we also give six plots which are ordered in the same way as the overall

    performance plots. For the box plot figure, we give six plots. The three plots in the

    left column are the average algorithm run time of three cases. The three plots in the

    right column are the average relative backward error of three cases.

    From the simulation results, we can draw following observations and conclusions.

    1. Comparing the results between two machines with Intel or AMD, we can ob-

    serve that the performance of the four algorithms is consistent between these

    two machines.

    2. By comparing the run time of different algorithms, we found that LLL is the

    slowest among the four algorithms. LRBLLL is as fast as APBLLL, and both

    are faster than PLLL+ in all three cases. So on average the computational CPU

    times for the four algorithms have the following order LRBLLL APBLLL LLL >

    LRBLLL > APBLLL.

    7. In Figure 36 and Figure 311, the test of matrices with various condition

    numbers shows that the QR time is not affected by the condition number of the

    matrices, and the reduction time, flops and the relative backward error of the

    four algorithms increases when the condition number of the matrix increases.

    8. The box plot shows the behaviors of LLL, PLLL+, LRBLLL and APBLLL on

    the tests are stable for different simulation runs.

    60

  • 0 200 400 600 800 10000

    500

    1000

    1500

    Dimension

    Red

    uctio

    n R

    un T

    ime

    0 200 400 600 800 10000

    1

    2

    3

    4

    Dimension

    QR

    Run

    Tim

    e

    0 200 400 600 800 10000

    0.1

    0.2

    0.3

    0.4

    0.5

    Dimension

    Reu

    ctio

    n R

    un T

    ime

    0 200 400 600 800 100010

    15

    1014

    1013

    Dimension

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    0 200 400 600 800 100010

    6

    107

    108

    109

    1010

    Dimension

    Flo

    ps

    PLLL+LRBLLLAPBLLL

    PLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    0 200 400 600 800 10000

    0.5

    1

    1.5

    2

    2.5

    3

    Dimension

    QR

    Run

    Tim

    e

    LLL LLL

    Figure 33: Performance comparison for Case 1, Intel

    61

  • 0 200 400 600 800 10000

    100

    200

    300

    400

    500

    600

    Dimension

    Red

    uctio

    n R

    un T

    ime

    0 200 400 600 800 10000

    1

    2

    3

    4

    5

    Dimension

    QR

    Run

    Tim

    e

    0 200 400 600 800 10000

    0.2

    0.4

    0.6

    0.8

    Dimension

    Reu

    ctio

    n R

    un T

    ime

    0 200 400 600 800 100010

    15

    1010

    Dimension

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    0 200 400 600 800 100010

    7

    108

    109

    1010

    Dimension

    Flo

    ps

    PLLL+LRBLLLAPBLLL

    PLLL+LRBLLLAPBLLL

    0 200 400 600 800 10000

    0.5

    1

    1.5

    2

    2.5

    3

    Dimension

    QR

    Run

    Tim

    e

    LLL LLL

    LLLPLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    Figure 34: Performance comparison for Case 2, Intel

    62

  • 0 200 400 600 800 10000

    200

    400

    600

    800

    1000

    1200

    Dimension

    Red

    uctio

    n R

    un T

    ime

    0 200 400 600 800 10000

    1

    2

    3

    4

    Dimension

    QR

    Run

    Tim

    e

    0 200 400 600 800 10000

    10

    20

    30

    40

    50

    Dimension

    Reu

    ctio

    n R

    un T

    ime

    0 200 400 600 800 100010

    12

    1010

    108

    106

    104

    Dimension

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    0 200 400 600 800 100010

    7

    108

    109

    1010

    1011

    Dimension

    Flo

    ps

    PLLL+LRBLLLAPBLLL

    PLLL+LRBLLLAPBLLL

    0 200 400 600 800 10000

    0.5

    1

    1.5

    2

    2.5

    3

    Dimension

    QR

    Run

    Tim

    e

    LLL LLL

    LLLPLLL+LRBLLLAPBLLL

    LLLPLLL+BLLLAPBLLL

    Figure 35: Performance comparison for Case 3, Intel

    63

  • 101

    102

    103

    104

    105

    106

    0

    0.02

    0.04

    0.06

    0.08

    Condition Number

    QR

    Run

    Tim

    e

    LLL

    101

    102

    103

    104

    105

    106

    0

    10

    20

    30

    40

    Condition Number

    Red

    uctio

    n R

    un T

    ime

    101

    102

    103

    104

    105

    106

    0

    0.01

    0.02

    0.03

    0.04

    Condition Number

    QR

    Run

    Tim

    e

    PLLL+LRBLLLAPBLLL

    101

    102

    103

    104

    105

    106

    0

    2

    4

    6

    8

    Condition Number

    Reu

    ctio

    n R

    un T

    ime

    PLLL+LRBLLLAPBLLL

    101

    102

    103

    104

    105

    106

    1015

    1010

    105

    Condition Number

    Flo

    ps

    101

    102

    103

    104

    105

    106

    1015

    1010

    105

    Condition Number

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    LLLPLLL+LRBLLLAPBLLL

    LLLPLLLLRBLLLAPBLLL

    LLL

    Figure 36: Performance comparison for Case 2 with dimension 200, Intel

    64

  • LLL PLLL LRBLLL APBLLL10

    2

    101

    100

    101

    Tot

    al R

    un T

    ime

    LLL PLLL LRBLLL APBLLL10

    15

    1014

    1013

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    LLL PLLL LRBLLL APBLLL10

    0

    101

    102

    Tot

    al R

    un T

    ime

    LLL PLLL LRBLLL APBLLL10

    9

    108

    107

    106

    105

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    LLL PLLL LRBLLL APBLLL10

    1

    100

    101

    Tot

    al R

    un T

    ime

    LLL PLLL LRBLLL APBLLL10

    13

    1012

    1011

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    Figure 37: Box plots of run time (left) and relative backward error (right) for Case1 (top), Case 2 (middle), Case 3 (bottom) with dimension 200, Intel

    65

  • 0 200 400 600 800 10000

    1000

    2000

    3000

    4000

    Dimension

    Red

    uctio

    n T

    ime

    LLL

    0 200 400 600 800 10000

    2

    4

    6

    8

    10

    Dimension

    QR

    Tim

    e

    0 200 400 600 800 10000

    0.2

    0.4

    0.6

    0.8

    1

    Dimension

    Reu

    ctio

    n T

    ime

    0 200 400 600 800 100010

    6

    107

    108

    109

    1010

    Dimension

    Flo

    ps

    0 200 400 600 800 100010

    15

    1014

    1013

    Dimension

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    0 200 400 600 800 10000

    2

    4

    6

    8

    Dimension

    QR

    Tim

    e

    LLL

    PLLL+LRBLLLAPBLLL

    PLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    Figure 38: Performance comparison for Case 1, AMD

    66

  • 0 200 400 600 800 10000

    2

    4

    6

    8

    10

    Dimension

    QR

    Tim

    e

    0 200 400 600 800 10000

    0.5

    1

    1.5

    2

    Dimension

    Reu

    ctio

    n T

    ime

    0 200 400 600 800 100010

    7

    108

    109

    1010

    Dimension

    Flo

    ps

    0 200 400 600 800 100010

    15

    1010

    Dimension

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    0 200 400 600 800 10000

    2

    4

    6

    8

    Dimension

    QR

    Tim

    e

    0 200 400 600 800 10000

    200

    400

    600

    800

    1000

    1200

    Dimension

    Red

    uctio

    n T

    ime

    LLL LLL

    PLLL+LRBLLLAPBLLL

    PLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    Figure 39: Performance comparison for Case 2, AMD

    67

  • 0 200 400 600 800 10000

    500

    1000

    1500

    2000

    2500

    Dimension

    Red

    uctio

    n T

    ime

    0 200 400 600 800 10000

    2

    4

    6

    8

    10

    Dimension

    QR

    Tim

    e

    0 200 400 600 800 10000

    20

    40

    60

    80

    100

    Dimension

    Reu

    ctio

    n T

    ime

    0 200 400 600 800 100010

    7

    108

    109

    1010

    1011

    Dimension

    Flo

    ps

    0 200 400 600 800 100010

    12

    1010

    108

    106

    104

    Dimension

    Rel

    ativ

    e B

    ackw

    ard

    Err

    or

    0 200 400 600 800 10000

    2

    4

    6

    8

    Dimension

    QR

    Tim

    e

    LLLLLL

    PLLL+LRBLLLAPBLLL

    PLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    LLLPLLL+LRBLLLAPBLLL

    Figure 310: Performance comparison for Case 3, AMD

    68

  • 101

    102

    103

    104

    105

    106

    0

    0.05

    0.1

    0.15

    0.2

    Condition Number

    QR

    Tim

    e

    LLL

    101

    102

    103

    104

    105

    106

    0

    20

    40

    60

    80

    Condition Number

    Red

    uctio

    n T

    ime

    LLL

    101

    102

    103

    104

    105

    106

    0

    0.02

    0.04

    0.06

    0.08

    Condition Number

    QR

    Tim

    e

    PLLL+LRBLLLAPBLLL

    101

    102

    103

    104

    105

    106

    0

    5

    10

    15

    20

    Condition Number

    Reu

    ctio

    n T

    ime

    PLLL+LRBLLLAPBLLL

    101

    102

    103

    104

    105

    106

    1015

    1010

    105

    Condition Number

    Flo

    ps

    LLLPLLL+LRBLLLAPBLLL

    101

    102

    103

    104

    105

    106