blocked and recursive algorithms for triangular

15
BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR TRIDIAGONALIZATION GIL SHKLARSKI AND SIVAN TOLEDO Abstract. We present partitioned (blocked) algorithms for reducing a sym- metric matrix to a tridiagonal form, with partial pivoting. That is, the al- gorithms compute a factorization P AP T = LT L T where P is a permutation matrix, L is lower triangular with a unit diagonal, and T is symmetric and tridiagonal. The algorithms are based on the column-by-column methods of Parlett and Reid and of Aasen. Our implementations also compute the QR factorization of T and solve linear systems of equations using the computed factorization. The partitioning allow our algorithms to exploit modern com- puter architectures (in particular, cache memories and high-performance blas libraries). Experimental results demonstrate that our algorithms achieve ap- proximately the same level of performance as the partitioned Bunch-Kaufman factorization and solve in lapack. Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical Lin- ear Algebra— linear systems; G.4 [Mathematics of Computing]: Mathematical Software—algorithm analysis; efficiency General Terms: Algorithms, Performance Additional Key Words and Phrases: symmetric indefinite matrices, tridiago- nalization, Aasen’s tridagonalization, Parlett-Reid tridagonalization, partitioned factorizations, recursive factorizations 1. Introduction This paper presents blocked and recursive algorithms for reducing a square symmetric matrix to tridiagonal form. These algorithms are essentially par- titioned (blocked) and recursive formulations of the algorithms of Parlett and Reid [12] and of Aasen [1]. These partitioned formulations effectively exploit modern computer architecture through the use of the level-3 blas [7]. The Parlett-Reid and Aasen algorithms factor a symmetric matrix A into a product of a unit lower triangular matrix L, a symmetric tridiagonal matrix T , and the transpose of L: A = LT L T . When partial pivoting is employed, these algorithms compute a factorization P AP T = LT L T where P is a permutation matrix. Aasen’s algorithm is backward stable [11]. Asymptotically, the Parlett-Reid algorithm performs 2n 3 /3+ O(n 2 ) arithmetic operations. Aasen’s method performs only n 3 /3+ O(n 2 ) arithmetic operations, so it is much more efficient. This factorization is one of the two main methods for factoring dense sym- metric matrices. The other method, due to Bunch and Kaufman, factors the matrix into a product A = LBL T where L is unit lower triangular and B is block diagonal with 1-by-1 and 2-by-2 blocks. The method of Bunch and Kauf- man also performs n 3 /3+ O(n 2 ) operations. The performance of straightforward Date : May 2007. 1

Upload: others

Post on 23-Nov-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

BLOCKED AND RECURSIVE ALGORITHMS FORTRIANGULAR TRIDIAGONALIZATION

GIL SHKLARSKI AND SIVAN TOLEDO

Abstract. We present partitioned (blocked) algorithms for reducing a sym-metric matrix to a tridiagonal form, with partial pivoting. That is, the al-gorithms compute a factorization PAPT = LTLT where P is a permutationmatrix, L is lower triangular with a unit diagonal, and T is symmetric andtridiagonal. The algorithms are based on the column-by-column methods ofParlett and Reid and of Aasen. Our implementations also compute the QRfactorization of T and solve linear systems of equations using the computedfactorization. The partitioning allow our algorithms to exploit modern com-puter architectures (in particular, cache memories and high-performance blas

libraries). Experimental results demonstrate that our algorithms achieve ap-proximately the same level of performance as the partitioned Bunch-Kaufmanfactorization and solve in lapack.

Categories and Subject Descriptors: G.1.3 [Numerical Analysis]: Numerical Lin-ear Algebra—linear systems; G.4 [Mathematics of Computing]: Mathematical Software—algorithmanalysis; efficiencyGeneral Terms: Algorithms, PerformanceAdditional Key Words and Phrases: symmetric indefinite matrices, tridiago-nalization, Aasen’s tridagonalization, Parlett-Reid tridagonalization, partitionedfactorizations, recursive factorizations

1. Introduction

This paper presents blocked and recursive algorithms for reducing a squaresymmetric matrix to tridiagonal form. These algorithms are essentially par-titioned (blocked) and recursive formulations of the algorithms of Parlett andReid [12] and of Aasen [1]. These partitioned formulations effectively exploitmodern computer architecture through the use of the level-3 blas [7].

The Parlett-Reid and Aasen algorithms factor a symmetric matrix A into aproduct of a unit lower triangular matrix L, a symmetric tridiagonal matrix T ,and the transpose of L:

A = LTLT .

When partial pivoting is employed, these algorithms compute a factorizationPAP T = LTLT where P is a permutation matrix. Aasen’s algorithm is backwardstable [11]. Asymptotically, the Parlett-Reid algorithm performs 2n3/3 + O(n2)arithmetic operations. Aasen’s method performs only n3/3 + O(n2) arithmeticoperations, so it is much more efficient.

This factorization is one of the two main methods for factoring dense sym-metric matrices. The other method, due to Bunch and Kaufman, factors thematrix into a product A = LBLT where L is unit lower triangular and B isblock diagonal with 1-by-1 and 2-by-2 blocks. The method of Bunch and Kauf-man also performs n3/3+O(n2) operations. The performance of straightforward

Date: May 2007.

1

Page 2: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

implementations of Bunch and Kaufman’s method is similar to that of straight-forward implementations of Aasen (see [5] for old experimental evidence; thereis no reason that the results should be different on a more modern machine).

The performance of dense matrix algorithms improves dramatically when theordering of operations is altered so as to cluster together operations on smallblocks of the matrix. This is usually called partitioning or blocking (e.g., see [6]and [9]). The blocked Bunch-Kaufman algorithm is described in [8]. A studyby Anderson and Dongarra compared the performance of blocked versions ofthe Bunch-Kaufman and Aasen methods on the Cray 2 and found that Aasen’smethod did not benefit from blocking as much as the Bunch-Kaufman method [3].This finding appears to be the main reason that the Bunch-Kaufman method isused in the symmetric linear solver in lapack [2] rather than Aasen’s method.The details of the blocked algorithms are not given in the report by Andersonand Dongarra; the code appears to have been lost [4].

Our paper presents blocked versions of both Aasen’s method and that of Par-lett and Reid. Both versions achieve similar performance to that of the blockedBunch-Kaufman code in lapack. Our algorithms perform partial pivoting andtheir storage requirements are similar to those of lapack’s implementation ofBunch-Kaufman. Our findings show that there is no performance-related reasonto prefer the Bunch-Kaufman method over that of Aasen. We cannot determinewhether the opposite conclusion of Anderson and Dongarra resulted from theparticular way they blocked the algorithm or whether it is specific to the Cray 2architecture.

Another innovation of our paper consists of recursive formulations of the al-gorithms of Parlett and Reid and Aasen. We have not been able to incorporateefficient partial pivoting into the recursive formulations, so it is not of muchpractical interest. However, it does clarify some aspects of the algorithm andelucidates our partitioned algorithms (which do pivot).

Our partitioned Parlett-Reid algorithm appears to be as stable as the straight-forward column-by-column original algorithm. Our partitioned Aasen algorithmsometimes downdates elements of the reduced matrix; we cannot rule out thepossibility that this may harm the stability of the algorithm.

The recursive and partitioned algorithms for the LTLT factorization that wepresent here are considerably more complicated to derive than recursive andpartitioned algorithms for other factorizations, such as LU with partial pivotingor Cholesky.

This paper is organized as follows. Section 2 presents recursive formulationsof the Parlett-Reid and Aasen tridiagonalization algorithms. Section 3 showshow to efficiently incorporate partial pivoting into a partitioned formulation ofthese algorithms. Section 4 presents our implementations of the partitionedalgorithms (with pivoting). Section 5 presents experimental results. We presentour conclusions in Section 6.

2. Recursive Tridiagonalization

This section presents a recursive tridiagonalization method. We begin with adescription of a relatively simple formulation that is in some sense equivalent tothe algorithm of Parlett and Reid. When this recursive formulation is appliedto single columns, we get the Parlett-Reid algorithm, which is inefficient; itperforms 2n3/3+Θ(n2) arithmetic operations. But when the recursion partitionsthe matrix into almost equal parts, the algorithm performs only n3/3 + Θ(n2)

2

Page 3: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

Figure 1. An illustration of the factorization. Filled gray circlesrepresent known values and empty circles represent values that arecomputed during the factorization. The lines show the partitioningof the matrices into blocks.

operations, like Aasen’s algorithm. We later use this algorithm as a basis for analgorithm with pivoting.

We also show a more efficient algorithm that uses Aasen’s method of delayingrank-1 updates in order to combine pairs of related rank-1 updates into a singlerank-1 update. This variant performs n3/3 + Θ(n2) operations no matter howthe recursion partitions the algorithm. However, the trailing submatrix in thisalgorithm is unsymmetric, which requires downdating elements of the trailingsubmatrix if only half of it is stored. The downdating may cause a numericalinstability.

2.1. A Recursive Parlett-Reid Method. As in Aasen’s original presentation,the key to the development of efficient tridiagonalization is the introduction ofan lower Hessenberg matrix H = LT . Figure 1 shows the factorization A =LTLT = HLT . In the figure, as well as in subsequent figures, we use differentgraphic representation for known matrix elements and for matrix elements thatare to be computed. In the overall factorization, the elements of A are known aswell as the diagonal elements of L, which are all 1. If A has an LTLT factorization(without pivoting), then it has such a pivoting for any assignment of values tothe subdiagonal elements of the first column of L. Therefore, we treat the valuesin the first column of L as if they are known; when the algorithm begins, itcan set L2:n,1column arbitrarily. This is a key observation in the development ofrecursive and blocked variants: we assume that L2:n,1 is given to the algorithm asinput ; when we call the top level of the recursion, we set this input to all zeros.

We derive the algorithm by portioning all the matrices into 2-by-2 block ma-trices. The partitioning is the same for all the matrices. The diagonal blocksmust be square, but the off-diagonal blocks can be rectangular, as shown in theillustrations. Assuming that the first block contains k rows and columns, we use

3

Page 4: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

Figure 2. The first step in the factorization.

Figure 3. The second step of the algorithm, part one.

the following notation to denote the blocks:

A[11] = A1:k,1:k ,

A[12] = A1:k,k+1:n ,

A[21] = Ak+1:n,1:k , and

A[22] = Ak+1:n,k+1:n .

(and similarly for L, T , and H).The partitioning into blocks allows us to write block-matrix equations that

define the blocks of the factors. In the first step of the algorithm, we use theequation

A[11] = H [11] L[11] T

= L[11]T [11] L[11] T.

This equation implies that we can compute L[11] and T [11] recursively. Since weassume that L2:n,1 is given to the algorithm as input, the recursive call receives

the values of L[11]2:k,1 as input. This is illustrated in Figure 2, with the known

elements of L[11]2:k,1 shown in gray.

Once the recursive call returns, the second step of the algorithm computesL[21], the first column of L[22], and Tk+1,k. We start the derivation of this stepwith the equation

A[21] = H [21] L[11] T.

4

Page 5: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

Figure 4. The second step of the algorithm, part two.

We already know L[11], so we can compute H [21] using multiple triangular solves,as illustrated in Figure 3. In the illustration, the elements of L[11] are representedby filled gray circles, because they are already known.

To obtain the elements of L and T that we need, we expand H [21] into

H [21] = L[21]T [11] + L[22]T [21]

= L[21]T [11] +

n−k∑i=1

L[22]:,i T

[21]i,:

= L[21]T [11] + L[22]:,1 T

[21]1,: ,

as shown in Figure 4. The transition from the first to the second line relies on theexpression of L[22]T [21] as a sum of rank-1 outer products. Since rows 2, . . . , n−kof T [21] are zero, the sum collapses to its first term, on the third line. The firstcolumn of L[21] is known (it was given to the algorithm as input). This is crucial;it allows us to compute the second column of L[21] by substitution, using theequation

H[21]:,1 = L

[21]:,1 T

[11]1,1 + L

[21]:,2 T

[11]2,1 .

In this equation everything is known except for L[21]:,2 , which we can therefore

compute. From now on we have similar equations but with three terms on theright hand side; the first such equation is

H[21]:,2 = L

[21]:,1 T

[11]1,2 + L

[21]:,2 T

[11]2,2 + L

[21]:,3 T

[11]3,2 .

From this equation we compute L[21]:,2 , the only unknown in the equation. The

general case, for i = 2, . . . , k − 1, is

H[21]:,i = L

[21]:,i−1T

[11]i−1,i + L

[21]:,i T

[11]i,i + L

[21]:,i+1T

[11]i+1,i .

5

Page 6: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

Figure 5. The third step of the algorithm, part one.

These equations yield columns 2 to k of L[21]. Column k of H [21] yields one moreequation,

H[21]:,k = L

[21]:,k−1T

[11]k−1,k + L

[21]:,k T

[11]k,k + L

[22]:,1 T

[21]1,k .

Here we know everything except for the last term, L[22]:,1 T

[21]1,k . We use the equation

to compute it. We now use the fact that L[22]1,1 = 1 to compute T

[21]1,k . Now we can

compute the rest of L[22]:,1 by scaling. Thus, the second step of the algorithm gave

us L[21], T [21], and the first column of L[22]. We will use the first column of L[22]

in the next recursive call.In the third step of the algorithm, we use the equation

A[22] = H [21] L[21] T+ H [22] L[22] T

,

illustrated in Figure 5. We know the first term on the right-hand side, so wesubtract it and continue with an expansion of the second term,

A[22] − H [21] L[21] T= H [22] L[22] T

=(L[21]T [12] + L[22]T [22]

)L[22] T

=(L

[21]:,k T

[12]k,: + L[22]T [22]

)L[22] T

= L[21]:,k T

[12]k,: L[22] T

+ L[22]T [22] L[22] T

= L[21]:,k T

[12]k,1 L

[22]:,1

T+ L[22]T [22] L[22] T

.

This is illustrated in Figure 6. The transition from the second to the thirdline is based on an expansion of L[21]T [12] into a sum of column-times-row outerproducts, all of which except the first are zero. The transition from the fourth to

6

Page 7: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

Figure 6. The third step of the algorithm, part two.

the fifth line is based on a similar trick: L[21]:,k T

[12]k,: is a square matrix in which only

the first column, L[21]:,k T

[12]k,1 , is not zero. Therefore, we can compute L

[21]:,k T

[12]k,: L[22] T

simply by multiplying L[21]:,k T

[12]k,1 by the first row of L[22] T

. We have already

7

Page 8: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

computed L[22]:,1 in the second step of the algorithm, so all the factors in the

product

L[21]:,k T

[12]k,1 L

[22]:,1

T

are known. We can, therefore, subtract it from both sides to obtain the input tothe second recursive call in the algorithm,

A[22] − H [21] L[21] T − L[21]:,k T

[12]k,1 L

[22]:,1

T= L[22]T [22] L[22] T

.

The recursive call computes T [22] and columns 2 to n − k of L[22]; column 1has already been determined. Note that since the trailing submatrix has anLTLT factorization with a symmetric T , it is symmetric itself, so we only needto represent its lower triangle.

The term L[21]:,k T

[12]k,1 L

[22]:,1

Tis what causes the inefficiency when the recursion is

applied column by column (k = 1). This term represents a second rank-1 update

of the trailing submatrix, in addition to H [21] L[21] T. Performing two rank-1

updates on the trailing submatrix for every column that is eliminated leads toan overall operation count of 2n3/3 + Θ(n2).

On the other hand, if we always choose k = �n/2�, the extra cost associatedwith this rank-1 update becomes insignificant. To see this, we denote the numberof operations of this variant by TPR(n/2)(n) and write down the recursion

TPR(n/2)(n) = 2TPR(n/2)

(n

2

)+

n3

4+ O(n2) .

It is not hard to show that TPR(n/2)(n) = n3/3+O(n2). Our experimental resultsare consistent with this analysis.

2.2. A Recursive Aasen Method. To obtain a more efficient algorithm foran arbitrary k, we need to delay the last rank-1 update in the third step of thealgorithm. Since we delay this rank-1 update, the input to the recursive call isan unsymmetric matrix minus a rank-1 update; the subtraction is symmetric,but the terms are not. To turn this into a complete algorithm, we formulate itas follows:

Compute the LTLT factorization of a symmetric matrix A−hLT:,1

given a column vector h and the first column L:,1 of L. The term

A is, in general, unsymmetric.

Figure 7 shows the graphical representation of this formulation. To initializethe recursion, we simply take A = A and h = 0.

The first step in this recursive formulation is trivial; we compute the (1, 1)blocks of L and T given A[11], h1:k, and L1:k,1.

In the second step, we first solve

A[21] = H [21] L[11] T

8

Page 9: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

LTA~ h

=

T LTL

Figure 7. An illustration of the Aasen-like factorization.

for the matrix H [21]. We then subtract hk+1:n from the first column of H [21] toobtain H [21]. This works because

A[21] = A[21] − hk+1:nLT1:k,1

= H [21] L[11] T − hk+1:nLT1:k,1

=

k∑j=1

Hk+1:n,jLT1:k,j − hk+1:nL

T1:k,1

=(Hk+1:n,1 − hk+1:n

)L1:k,1 +

k∑j=2

Hk+1:n,jLT1:k,j

= H [21] L[11] T.

Note that the rank-1 update hLT:,1 is applied here implicitly, which is the essential

idea in Aasen’s method. The second part of step two is carried out exactly as inSection 2.1, since it computes Lk+1:n,2:k+1 and Tk+1,k from H [21] and T [11], whichwe already have.

In the third step of the algorithm, we compute an (n − k)-vector g and thematrix

B = L[22]T [22] L[22] T+ g L

[22]:,1

T.

Given B and g, we invoke the algorithm recursively to compute the LTLT fac-

torization of B − g L[22]:,1

T. To determine B and g, we start with the identity

H [22] = L[21]T [12] + L[22]T [22]

=[Lk+1:n,kTk,k+1 0 · · · 0

]+ L[22]T [22] .

The first term in the second line is a square matrix in which the first columnis Lk+1:n,kTk,k+1 and the rest are zeros. This follows from the fact that T [12]

contains only one nonzero. Therefore,

H [22] L[22] T=

[Lk+1:n,kTk,k+1 0 · · · 0

]L[22] T

+ L[22]T [22] L[22] T

= L[22]T [22] L[22] T+ Lk+1:n,kTk,k+1 L

[22]:,1

T.

9

Page 10: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

We can now define

B = H [22] L[22] T,

g = Lk+1:n,kTk,k+1 .

Computing g is trivial, since we already have Lk+1:n,k and Tk,k+1. Computing Bis slightly more complex. We have

A[22] = A[22] − hk+1:nLTk+1:n,1

= H [21] L[21] T+ H [22] L[22] T

,

so

H [22] L[22] T= A[22] − H [21] L[21] T − hk+1:nL

Tk+1:n,1

= A[22] − H [21] L[21] T − hk+1:n L[21]:,1

T

= A[22] − (H [21] +

[hk+1:n 0 · · · 0

])L[21] T

.

But we already have H [21] +[hk+1:n 0 · · · 0

]; it is H [21] that we used in step

two. Therefore, to compute B we subtract the matrix-matrix product from A[22].Here again, the rank-1 update was performed implicitly as in Aasen’s method.

A close inspection of this algorithm shows that it only uses elements from thelower triangle A. This allows us to only compute the lower triangle of the trailingsubmatrix B, leading to n3/3 + Θ(n2) operations for any choice of k.

3. Pivoting in a Partitioned Algorithm

We now show how to incorporate pivoting into these recursive formulations.The algorithms that we describe are essentially hybrids of the methods of Sec-tion 2 with Aasen’s method. To keep the presentation concise, we do not presentdetailed equations or pseudo-code. We do provide compact Matlab code of thealgorithms, as well as efficient C code1.

To incorporate partial pivoting into LTLT factorizations, we perform sym-metric row and column exchanges to ensure that element j + 1, j of the reducedmatrix is largest in magnitude among elements j + 1 : n, j before column j iseliminated. For example, in the first step of the algorithm we may need to ex-change row/column i with row/column 2 to ensure that

(PAP T

)2,1

is at least as

large in magnitude as(PAP T

)3:n,1

. Since both row 2 and column 2 may change,

we cannot partition the matrix a priori into blocks, since essentially any elementof A2:n,2:n may be permuted into row or column 2.

In this form of pivoting, we cannot partition the rows or the columns intopanels that are eliminated contiguously. Any column may be the second to beeliminated. Therefore, we partition the algorithm into batches of k consecutiveelimination steps, which are not associated with specific columns. Each batchgenerates k columns of L and T .

To allow this kind of pivoting, we must ensure that all uneliminated rows andcolumns of the matrix are in the same state. We do not want to have somecolumns that have undergone one set of reductions and other columns that haveundergone another set.

To keep all the uneliminated columns in the same state, we adopt a two-levelapproach. We eliminate a batch of k rows and columns, and once they are

1http://www.tau.ac.il/~stoledo/research.html

10

Page 11: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

factored, we reduce the remaining columns and move on to the next k. At thebeginning of each batch, all the columns in the trailing submatrix have beenreduced. During the processing of a batch of columns, we apply all the updatesfrom this batch to the column to be eliminated just before its elimination. This isa so-called left-looking approach. Aasen’s algorithm is precisely this method withthe batch size k set to 1; by setting it to a larger value we obtain a partitionedalgorithm that performs better.

We note that in LU with partial pivoting, we can preselect the columns foreach batch. This permits both left-looking and right-looking approaches withinbatches (traditionally called panels). Here we cannot use a right-looking ap-proach within batches since we do not know which columns will eventually belongto the batch.

The elimination of k columns is exactly equivalent to a combination of steps1 and 2 of the recursive algorithms. Once we eliminate a batch of k columns,we reduce the trailing submatrix. We can perform the batch updates and theupdates within batches using either the algorithm of Section 2.1 or the algorithmof Section 2.2. In the method of Section 2.1, the trailing submatrix is symmetric.Therefore, we can deduce its upper part even if we only compute and store thelower part. This allows us to easily implement the row/column exchanges.

If we use the algorithm of Section 2.2, however, the trailing submatrix is nolonger symmetric. The trailing submatrix minus a rank-1 update is symmetric.We can therefore perform the row/column exchange as follows. We first subtractthe rank-1 update from the lower-triangular part of the two rows/columns to beexchanged. We then exchange them, and add back the rank-1 update to obtainthe unsymmetric-plus-rank-1 representation. The extra cost of these operationsis Θ(n) operations per column or Θ(n2) for the entire factorization; a low-orderterm. We cannot rule out the possibility that this update/downdate operationmay cause numerical instabilities.

Due to the potential instability of the method of Section 2.2, we prefer themethod of Section 2.1. The latter is a little less efficient, due to the extra rank-1updates. More precisely, the method of Section 2.1 performs one extra rank-1update per batch of k columns. Therefore, it performs Θ(n

kn2) = Θ(n3/k) extra

operations relative to Aasen’s method. A careful examination of the constantsshows that the number of operations in this algorithm is

1

3

(1 +

1

k

)n3 + O(n2k) .

This is essentially a factor of 1 + 1/k more arithmetic operations than Aasen’soriginal algorithm. For moderately-large k, the parallelism and cache benefits ofthe partitioned method should outweigh the extra operations.

We believe that this algorithm does not suffer from numerical instabilities,although we do not have a formal proof. It essentially combines one Parlett-Reidstep for every k − 1 Aasen steps.

4. Implementation

We have implemented the partitioned algorithm in C. The implementation useslapack-style data structures for the matrices and has a lapack-style interface.We assume that the lower triangular part of A is given in a square or rectangulartwo-dimensional array, column major.

11

Page 12: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

The factorization is computed in place, with columns of L overwriting columnsof A. The diagonal of L, which is all ones, is not stored.

We use a separate n-by-k array to store the block of H corresponding to thecurrent batch of columns. It is used during the processing of the batch to reducecolumns that are about to be eliminated, and it is used at the end of the batchto reduce all the remaining columns in A. Lapack uses a buffer of the samedimensions in its partitioned Bunch-Kaufman code. The implementation alsouses three n-vectors as scratch areas.

The code performs most of its work in calls to the level-3 blas [7], whichare highly-efficient implementations of matrix-matrix operations. The smallestdimension of any matrix involved in a level-3 blas call in the algorithm is atleast k (except perhaps in the updating of the last batch, which may consist offewer than k columns). This allows the algorithm to effectively use cache andmultiple cores. Only O(n2k) operations are performed outside level-3 blas.

In particular, we never perform a rank-1 update on the trailing submatrix; wealways fuse the rank-1 update with a larger matrix-matrix multiply.

The algorithm performs back pivoting only within the current batch. Columnsof L factored in previous batches are not permuted, and the solve code takes thisinto account. We perform forward pivoting completely, but only within the lowertriangular part of the trailing submatrix.

We factor T into its QR factors using Givens rotations. This is done duringthe LTLT factorization. The factors are stored in unused diagonals in the upperpart of A. We could have also used LU with partial pivoting; both preservethe sparsity of a tridiagonal matrix almost completely (in both cases the uppertriangular matrix has 3 nonzero diagonals; in LU the unit lower triangular matrixhas only two nonzero diagonals, and in QR we need to store n − 1 rotations).

5. Experimental Results

We have compared the performance of the two partitioned tridiagonalizationmethods to the performance of the partitioned Bunch-Kaufman algorithm inlapack version 3.1.1 [2]. The results that we report below show that the per-formance of the partitioned tridiagonalization methods is similar to that of theBunch-Kaufman algorithm. The performance of the solver phases is also similar.

We linked all the codes (ours and lapack) to the Goto blas version 1.12 [10].The codes were compiled using gcc version 4.1.2 on an Intel x86 64 machinerunning Linux. The computer on which we ran the experiments had an IntelCore 2 model 6400 running at 2.13 GHz. We restricted the blas to use onlyone core of the CPU using an environment variable. The computer had 4 GBof main memory; we did not detect any significant paging activity during theexperiments.

We used a block size k = 64 both in our methods and in lapack. This valueis the default of lapack 3.1.1. In experiments not reported here, we verifiedthat this value is indeed the optimal block size for this machine (and for all themethods).

The experiments were conducted using random symmetric matrices whose el-ements are uniformly distributed in (−1, 1). The relative residuals were alwaysbelow 10−12 and. The residuals produced by the three methods were of similarmagnitude in all the matrices.

Figure 8 compares the performance of the three methods. The vertical axisshows effective computational rates in millions of floating-point operations per

12

Page 13: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

0 200 400 600 800 10000

500

1000

1500

2000

2500

3000

3500

4000

4500Factorization Performance for Small Matrix Sizes

MF

LOP

/s

n

LAPACK−3.1.1Partitioned−Parlett−ReidPartitioned−Aasen

0 2500 5000 7500 10000 12500 150004000

4500

5000

5500

6000Factorization Performance for Large Matrix Sizes

MF

LOP

/s

n

LAPACK−3.1.1Partitioned−Parlett−ReidPartitioned−Aasen

Figure 8. The performance of the three methods. The verticalaxis shows the effective computational rate.

0 1000 2000 3000 40000

50

100

150

200

Number of Right−Hand−Sides

Tim

e (S

econ

ds)

Solve Time, Fixed Matrix Orders

LAPACK−3.1.1 n=1000LAPACK−3.1.1 n=2000LAPACK−3.1.1 n=4000Partitioned−Parlett−Reid n=1000Partitioned−Parlett−Reid n=2000Partitioned−Parlett−Reid n=4000

0 2000 4000 6000 8000 100000

100

200

300

400

500

600

n

Tim

e (S

econ

ds)

Solve Time, fixed Numbers of Right−Hand−Sides (NRHS)

LAPACK−3.1.1 NRHS=100LAPACK−3.1.1 NRHS=500LAPACK−3.1.1 NRHS=2000Partitioned−Parlett−Reid NRHS=100Partitioned−Parlett−Reid NRHS=500Partitioned−Parlett−Reid NRHS=2000

Figure 9. The performance of the solve phase. The graphs of thepartitioned Parlett-Reid method coincide with the graphs of theBunch-Kaufman method in lapack. The solve phase in the parti-tioned Aasen and the partitioned Parlett-Reid methods is exactlythe same.

second,

1

3

n3

T× 10−6 ,

where T is the running time in seconds. Note that this metric compares actualrunning times and ignores differences in operation counts. Therefore, a highervalue always indicates a faster factorization.

On matrices of order up to about 5000, the tridiagonalization methods areslightly faster or run at the same speed as the Bunch-Kaufman algorithm. Onlarger matrices, the Bunch-Kaufman algorithm is sometimes faster. In eithercase, the relative performance differences are not large. We do not know whythe performance of the tridiagonalization algorithms drops at n = 8000. Theperformance of the two tridiagonalization algorithms is very similar.

The graphs in Figure 9 show that the performance of the solve phase is es-sentially the same for our new methods and for lapack’s implementation of theBunch-Kaufman algorithm.

13

Page 14: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

6. Concluding Remarks

We have presented two variants of symmetric tridiagonalization algorithms.Both methods are partitioned, so they effectively exploit high-performance level-3 blas routines, and both incorporate partial pivoting. One method is a parti-tioned variant of the Parlett-Reid algorithm [12] and the other is a partitionedvariant of Aasen’s algorithm [1]. Although Aasen’s method performs half thenumber of operations of Parlett-Reid when both are applied on a column-by-column basis, they perform similarly in the partitioned case. We have analyzedthis issue theoretically and verified the results experimentally.

It is more difficult to incorporate pivoting into the partitioned Aasen’s methodthan into the partitioned Parlett-Reid method. In the former, we had to down-date certain columns in order to keep only the lower triangle of the reducedmatrix. This may cause numerical instability, although we have not observedany such instability. The partitioned Parlett-Reid method is likely to be asstable as the original column-by-column algorithm.

Our implementation of the two methods performs similarly to lapack’s parti-tioned implementation of the Bunch-Kaufman algorithm, in both the factor andthe solve phases. This shows that the findings of Anderson and Dongarra, whofound that a partitioned variant of Aasen’s algorithm performed worse than apartitioned Bunch-Kaufman [3], are not valid for our new variants. Our experi-ments were conducted on a very different machine than the one used by Andersonand Dongarra, but we do not see why our variants would perform poorer thanlapack even on a machine like the Cray 2; our methods and lapack’s imple-mentation of Bunch-Kaufman perform essentially the same sequence of level-3blas calls.

More generally, our results show that symmetric tridiagonalization methodscan be as efficient as symmetric block-diagonalization (when both utilize non-unitary transformations). The extra cost of solving tridiagonal matrices is neg-ligible. The triangular factors that our methods produce always have elementsbounded in magnitude by 1, whereas Bunch-Kaufman methods sometimes pro-duce triangular factors with large elements [4].

Acknowledgement 6.1. This research was supported by an IBM Faculty Partner-ship Award, by grant 848/04 from the Israel Science Foundation (founded bythe Israel Academy of Sciences and Humanities), and by grant 2002261 from theUnited-States-Israel Binational Science Foundation.

References

[1] J. O. Aasen. On the reduction of a symmetric matrix to tridiagonal form. BIT, 11:233–242,1971.

[2] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’ Guide.Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.

[3] Edward Anderson and Jack Dongarra. Evaluating block algorithm variants in LAPACK.In Jack Dongarra, Paul Messina, Danny C. Sorensen, and Robert G. Voigt, editors, Pro-ceedings of the 4th Conference on Parallel Processing for Scientific Computing, pages 3–8.SIAM, 1989.

[4] Cleve Ashcraft, Roger G. Grimes, and John G. Lewis. Accurate symmetric indefinite linearequation solvers. SIAM J. Matrix Anal. Appl., 20(2):513–561, 1998.

[5] Barwell, V. and George, A. A comparison of algorithms for solving symmetric indefinitesystems of linear equations. ACM Trans. Math. Software, 2:242–251, 1976.

14

Page 15: BLOCKED AND RECURSIVE ALGORITHMS FOR TRIANGULAR

[6] Paolo Bientinesi, John A. Gunnels, Margaret E. Myers, Enrique Quintana-Orti, andRobert van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans-actions on Mathematical Software, 31:1–26, 2005.

[7] Jack J. Dongarra, Jeremy Du Cruz, Sven Hammarling, and Ian Duff. A set of level 3 basiclinear algebra subprograms. ACM Transactions on Mathematical Software, 16(1):1–17,1990.

[8] Jack J. Dongarra, Iain S. Duff, Danny C. Sorensen, and Henk A. van der Vorst. Numer-ical Linear Algebra for High-Performance Computers. Society for Industrial and AppliedMathematics, Philadelphia, PA, USA, 1998.

[9] Erik Elmroth, Fred Gustavson, Isak Jonsson, and Bo Kagstrom. Recursive blocked al-gorithms and hybrid data structures for dense matrix library software. SIAM Review,46(1):3–45, March 2004.

[10] Kazushige Goto and Robert van de Geijn. High-performance implementation of the level-3BLAS. Technical Report CS-TR-06-23, The University of Texas at Austin, Departmentof Computer Sciences, May 5 2006.

[11] Nicholas J. Higham. Notes on accuracy and stability of algorithms in numerical linearalgebra. In Mark Ainsworth, Jeremy Levesley, and Marco Marletta, editors, The GraduateStudent’s Guide to Numerical Analysis ’98, pages 48–82. Springer-Verlag , Berlin, 1999.

[12] B. N. Parlett and J. K. Reid. On the solution of a system of linear equations whose matrixis symmetric but not definite. BIT, 10:386–397, 1970.

15