implementation and tuning of a parallel symmetric toeplitz eigensolver

10
J. Parallel Distrib. Comput. 71 (2011) 485–494 Contents lists available at ScienceDirect J. Parallel Distrib. Comput. journal homepage: www.elsevier.com/locate/jpdc Implementation and tuning of a parallel symmetric Toeplitz eigensolver Pedro Alonso a , Miguel O. Bernabéu b , Victor M. García a,, Antonio M. Vidal a a Department of Information Systems and Computation, Universidad Politécnica de Valencia, Cno. Vera s/n, 46022 Valencia, Spain b Oxford University Computing Laboratory, Oxford, UK article info Article history: Received 25 March 2010 Received in revised form 27 July 2010 Accepted 7 October 2010 Available online 29 October 2010 Keywords: Matrix eigenvalues Multicore platforms Toeplitz matrices abstract In a previous paper (Vidal et al., 2008, [21]), we presented a parallel solver for the symmetric Toeplitz eigenvalue problem, which is based on a modified version of the Lanczos iteration. However, its efficient implementation on modern parallel architectures is not trivial. In this paper, we present an efficient implementation on multicore processors which takes advantage of the features of this architecture. Several optimization techniques have been incorporated to the algorithm: improvement of Discrete Sine Transform routines, utilization of the Gohberg–Semencul formulas to solve the Toeplitz linear systems, optimization of the workload distribution among processors, and others. Although the algorithm follows a distributed memory parallel programming paradigm that is led by the nature of the mathematical derivation, special attention has been paid to obtaining the best performance in multicore environments. Hybrid techniques, which merge OpenMP and MPI, have been used to increase the performance in these environments. Experimental results show that our implementation takes advantage of multicore architectures and clearly outperforms the results obtained with LAPACK or ScaLAPACK. © 2010 Elsevier Inc. All rights reserved. 1. Introduction In this paper, we study the numerical solution of the symmetric eigenvalue problem Tx = λx, (1) where T R n×n is a symmetric Toeplitz matrix, x = 0 R n , and λ R. This problem arises in many applications including control and digital signal processing [4,2,19]. In the case of non- structured matrices [24,11], this problem is commonly solved by reducing the matrix to a tridiagonal form (Householder reflections, Givens rotations,. . . ) and applying algorithms that compute the eigensystem of the tridiagonal form (iterative QR, bisection, divide- and-conquer, or MRRR [22]). In the case of structured matrices, these algorithms have a clear drawback: the tridiagonalization destroys the structure of the matrix. Another approach for tackling this problem consists of using ‘‘iterative’’ algorithms. Although the essential nature of all eigen- value algorithms is iterative, here we denote as ‘‘iterative’’ those Supported by Spanish Government (Projects TIN2008-06570-C04 and TEC2009-13741), Universidad Politécnica de Valencia (Project 20080009) and Generalitat Valenciana (Project PROMETEO/2009/013). Corresponding author. E-mail addresses: [email protected] (P. Alonso), [email protected] (M.O. Bernabéu), [email protected] (V.M. García), [email protected] (A.M. Vidal). algorithms based on the Lanczos, Arnoldi, or Jacobi–Davidson methods [7]. In the general case (non-structured matrices), these algorithms are used when memory space limits are reached. They are also used if only a few eigenvalues/eigenvectors are required, maybe the largest ones or the smallest ones. This approach is also preferred in the structured matrix field since the structure of the matrix is not destroyed in the process. One of these iterative meth- ods proposed to solve problem (1) uses a bisection-like procedure to find intervals that contain a single eigenvalue by evaluating the characteristic polynomial with efficient recurrences and then ex- tracting it with some suitable root-finding procedure [2,19]. A new approach has recently been proposed to solve (1) using the Lanczos method [21]. The algorithm partitions the interval that contains all the eigenvalues in smaller subintervals. The eigenval- ues contained in these subintervals are extracted independently by using a ‘‘Shift-and-Invert’’ technique. The Shift-and-Invert tech- nique allows the eigenvalues that are close to a given real num- ber σ called shift to be computed. The efficiency of the method is based on how fast linear systems of type (T σ I )x = b can be solved. Since different intervals can be processed in parallel, it becomes possible to greatly reduce the execution time by using parallel computational resources like clusters of computers or mul- ticore computers. However, in order to obtain efficient versions of the algorithm, it must be carefully adapted to the underlying com- bination of hardware/software. In this paper, we have refined the method proposed in [21], optimizing several aspects of that algorithm for a hardware 0743-7315/$ – see front matter © 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2010.10.010

Upload: pedro-alonso

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

J. Parallel Distrib. Comput. 71 (2011) 485–494

Contents lists available at ScienceDirect

J. Parallel Distrib. Comput.

journal homepage: www.elsevier.com/locate/jpdc

Implementation and tuning of a parallel symmetric Toeplitz eigensolver✩

Pedro Alonso a, Miguel O. Bernabéu b, Victor M. García a,∗, Antonio M. Vidal aa Department of Information Systems and Computation, Universidad Politécnica de Valencia, Cno. Vera s/n, 46022 Valencia, Spainb Oxford University Computing Laboratory, Oxford, UK

a r t i c l e i n f o

Article history:Received 25 March 2010Received in revised form27 July 2010Accepted 7 October 2010Available online 29 October 2010

Keywords:Matrix eigenvaluesMulticore platformsToeplitz matrices

a b s t r a c t

In a previous paper (Vidal et al., 2008, [21]), we presented a parallel solver for the symmetric Toeplitzeigenvalue problem, which is based on a modified version of the Lanczos iteration. However, its efficientimplementation on modern parallel architectures is not trivial.

In this paper, we present an efficient implementation on multicore processors which takes advantageof the features of this architecture. Several optimization techniques have been incorporated to thealgorithm: improvement of Discrete Sine Transform routines, utilization of the Gohberg–Semenculformulas to solve the Toeplitz linear systems, optimization of theworkloaddistribution amongprocessors,and others. Although the algorithm follows a distributed memory parallel programming paradigm thatis led by the nature of the mathematical derivation, special attention has been paid to obtaining thebest performance in multicore environments. Hybrid techniques, which merge OpenMP and MPI, havebeen used to increase the performance in these environments. Experimental results show that ourimplementation takes advantage of multicore architectures and clearly outperforms the results obtainedwith LAPACK or ScaLAPACK.

© 2010 Elsevier Inc. All rights reserved.

1. Introduction

In this paper, we study the numerical solution of the symmetriceigenvalue problem

Tx = λx, (1)

where T ∈ Rn×n is a symmetric Toeplitz matrix, x = 0 ∈ Rn,and λ ∈ R. This problem arises in many applications includingcontrol and digital signal processing [4,2,19]. In the case of non-structured matrices [24,11], this problem is commonly solved byreducing thematrix to a tridiagonal form (Householder reflections,Givens rotations,. . . ) and applying algorithms that compute theeigensystemof the tridiagonal form (iterativeQR, bisection, divide-and-conquer, or MRRR [22]). In the case of structured matrices,these algorithms have a clear drawback: the tridiagonalizationdestroys the structure of the matrix.

Another approach for tackling this problem consists of using‘‘iterative’’ algorithms. Although the essential nature of all eigen-value algorithms is iterative, here we denote as ‘‘iterative’’ those

✩ Supported by Spanish Government (Projects TIN2008-06570-C04 andTEC2009-13741), Universidad Politécnica de Valencia (Project 20080009) andGeneralitat Valenciana (Project PROMETEO/2009/013).∗ Corresponding author.

E-mail addresses: [email protected] (P. Alonso),[email protected] (M.O. Bernabéu), [email protected](V.M. García), [email protected] (A.M. Vidal).

0743-7315/$ – see front matter© 2010 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2010.10.010

algorithms based on the Lanczos, Arnoldi, or Jacobi–Davidsonmethods [7]. In the general case (non-structured matrices), thesealgorithms are used when memory space limits are reached. Theyare also used if only a few eigenvalues/eigenvectors are required,maybe the largest ones or the smallest ones. This approach is alsopreferred in the structured matrix field since the structure of thematrix is not destroyed in the process. One of these iterativemeth-ods proposed to solve problem (1) uses a bisection-like procedureto find intervals that contain a single eigenvalue by evaluating thecharacteristic polynomial with efficient recurrences and then ex-tracting it with some suitable root-finding procedure [2,19].

A new approach has recently been proposed to solve (1) usingthe Lanczosmethod [21]. The algorithm partitions the interval thatcontains all the eigenvalues in smaller subintervals. The eigenval-ues contained in these subintervals are extracted independentlyby using a ‘‘Shift-and-Invert’’ technique. The Shift-and-Invert tech-nique allows the eigenvalues that are close to a given real num-ber σ called shift to be computed. The efficiency of the methodis based on how fast linear systems of type (T − σ I)x = b canbe solved. Since different intervals can be processed in parallel, itbecomes possible to greatly reduce the execution time by usingparallel computational resources like clusters of computers ormul-ticore computers. However, in order to obtain efficient versions ofthe algorithm, it must be carefully adapted to the underlying com-bination of hardware/software.

In this paper, we have refined the method proposed in [21],optimizing several aspects of that algorithm for a hardware

Page 2: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

486 P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494

architecture based on modern multicore boards, which are or-ganized either in single computers or in clusters. The softwarealready available for these computers must be exploited to accom-plish this. The result is a very efficient algorithm that outperformsthose subroutines contained in sequential (LAPACK) and parallel(ScaLAPACK) libraries.

We describe the general idea of the method in Section 2. Thecomputationalmodel addressed is discussed in Section 3. Section 4is devoted to the description of the sequential algorithm, includingthe improvements incorporated in this paper. Section 5 describesall the improvements carried out in the parallel version. Theenvironment parameters, link options, and other details needed toobtain high performance in the parallel hardware are examined inSection 6. The experimental results are presented throughout thepaper where each modification is presented.

2. Method description

The basic algorithm is fully described in [21]. However, for thesake of completeness, here we give a reduced description.

The Lanczos method is used to compute eigenvalues of a sym-metric matrix A. Given an initial vector r , it builds an orthonormalbasis v1, v2, . . . , vm of the Krylov subspace Km (A, r). In the neworthonormal basis, thematrix A is represented as a tridiagonal ma-trix:

Aj =

α1 β1

β1 α2. . .

. . .. . . βj−1βj−1 αj

. (2)

Some of the eigenvalues of theAmatrix (called Ritz values) aregood approximations to the eigenvalues of A, and the eigenvectorscan also be easily obtained from the eigenvectors of theA matrixand the vi vectors.

This method, as such, is not suitable for computing manyeigenvalues since, among other problems, the convergence speedwould be very poor and the matrix V = (vi) might become huge.Therefore, the method must be adapted.

The Lanczos method converges first to the eigenvalues thatare largest in magnitude. Given an eigenvalue λσ of W = (A −

σ I)−1, then λ =1

λσ− σ is an eigenvalue of A. Thus, the Lanczos

method applied to W first gives the set of eigenvalues of A thatare closest to σ . This method, which is known as ‘‘Shift-and-Invert’’ [7], was proposed in [12]. A parallel version can be foundin [25]. Algorithm 1 represents the ‘‘Shift-and-Invert’’ version ofthe Lanczos algorithm.

Algorithm 1. ‘‘Shift-and-Invert’’ Lanczos algorithm for computinga few eigenvalues of A, the closest to σ .

1. β0 = ‖r‖2;2. for j = 1, 2, ... until convergence:3. vj = r/βj−1;4. r = (A − σ I)−1vj;5. r = r − vj−1 · βj−1;6. αj = vT

j r;7. r = r − vj · αj;8. re-orthogonalize if necessary;9. βj = ‖r‖2;10. compute approximate eigenvalues ofAj;11. test bounds for convergence;12. end for13. compute approximate eigenvectors.

The performance of Algorithm 1 depends upon how quicklythe linear systems can be solved. There exist solvers for Toeplitzmatrices that can be used to compute a linear system veryquickly [15]. Since matrices of the form T − σ I are still Toeplitz,the Lanczos method can be successfully applied.

When the Lanczos method is applied to symmetric Toeplitzmatrices, it has an interesting ‘‘two-way symmetry’’ which wasfirst proposed and exploited by Voss [23]. He used this property tocompute the smallest eigenvalue of a symmetric Toeplitz matrix.Let us see how this property was used in [21] to compute theeigenvalues close to a given ‘‘shift’’.

Let Jn =δi,n+1−i

i,j=1,...,n be the (n, n) matrix that reverses

a vector when it is applied. A vector v is symmetric if x = Jnxand it is skew-symmetric if x = −Jnx. It is well known that theeigenvectors of a Toeplitz matrix are either symmetric or skew-symmetric. By analogy, we denote the corresponding eigenvaluesas symmetric and skew-symmetric. If the initial vector for theLanczos recurrence is symmetric, then thewhole Krylov Space is inthe same symmetry class, and the eigenvectors generated will besymmetric. If the initial vector for the Lanczos recurrence is skew-symmetric, then the whole Krylov Space is in the same symmetryclass, and the eigenvectors generated will be skew-symmetric.Very often, the symmetric and skew-symmetric eigenvalues ofa symmetric Toeplitz matrix are interlaced [6]; therefore, if wecan restrict the Lanczos method to only one of these classes, therelative separation between eigenvalues may increase, and thespeed of convergence could improve.

Another useful property of the symmetry of Toeplitz matricesallows us to extract symmetric and skew-symmetric eigenvaluesat the same time. All of these give what Voss called a ‘‘two-way’’Lanczos process in which the two ‘‘ways’’ work in parallel until alinear system needs to be solved. Once the linear system is solvedthe process goes on in parallel. Algorithm 2 shows two columnsdenoting the two parallel ways of computation.

Algorithm 2. ‘‘Shift-and-Invert’’ Two-way Lanczos method.Given T ∈ Rn×n a symmetric Toeplitz matrix, this algorithm

returns the eigenvalues closest to the shift σ and the associatedeigenvectors.

1. Let p1 = Jnp1 = 0 and q1 = −Jnq1 = 0 initial vectors2. Let p0 = q0 = 0; β0 = δ0 = 0;3. p1 = p1/ ‖p1‖; q1 = q1/ ‖q1‖;4. for j = 1, 2, ... until convergence:5. w = pk + qk;6. solve (T − σ I) v = w;7. vs = 0.5 · (v + Jnv); va = 0.5 · (v − Jnv);8. αk = vT

s · pk; γk = vTa · qk;

9. vs = vs − αk · pk − βk−1 · pk−1; va = va − γk · qk − δk−1 · qk−1;10. full re-orthogonalization;11. βk = ‖vs‖2; δk = ‖va‖2;12. pk+1 = vs/βk; qk+1 = va/δk;13. obtain eigenvalues of SYMk; obtain eigenvalues of SKSk;14. test bounds for convergence;15. end for16. compute associated eigenvectors.

To compute all the eigenvalues of the matrix (or all the eigen-values in a, possibly large, interval) the overall process starts firstby finding a large interval that contains all the desired eigenvaluesand then slicing this large interval into small subintervals. Then, ashift is selected in themiddle of each subinterval so that the ‘‘Shift-and-Invert’’ iterative method is applied to extract all the eigen-values of the subinterval. In [21], this algorithm is called the FullSpectrum Two-Way Lanczos-based algorithm (FSTWL). Here, it issummarized inAlgorithm3. The algorithmcan also beused to com-pute the spectrum contained in any given interval.

Page 3: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494 487

Fig. 1. Hardware computational model.

Algorithm 3. FSTWL

1. Choose the interval [a, b] containing the desired eigenvalues;2. Divide the interval [a, b] in small subintervals;3. for each subinterval: (* in parallel *)4. Compute a ‘‘shift’’ σ , possibly σ = (a + b)/2;5. Compute a function/decomposition of T − σ I , to enable fast

solution of linear systems6. Apply ‘‘Shift-and-Invert’’ two-way Lanczos method

(Algorithm 2) to extract all the eigenvalues in subintervaland the associated eigenvectors;

7. end for8. end algorithm.

An important ingredient of the algorithm is the solution oflinear systems of the type (T − σ I) v = w that appear in step6 of Algorithm 2. The efficient algorithm used in [21] consistsof transforming matrix T − σ I into a Cauchy-like matrix C byusing the Discrete Sine Transformation (DST) [20]; therefore C =

S(T − αI)S, where S represents the DST. It is well known that halfthe entries of the Cauchy-like matrix C are zero [13]. This can beexploited to turn the problem of solving a linear system with C asthe system matrix into two independent problems of solving thelinear systems where the system matrix is C0 ∈ R⌊n/2×n/2⌋ andC1 ∈ R⌈n/2×n/2⌉, by using an odd–even permutation matrix P sothat C = PCPT

= C0 ⊕ C1. Matrices C0 and C1 are Cauchy-like aswell.

The solution of the linear systems is carried out by computingtheir LDLT factorization. The triangular factorization of each oneof these submatrices can be carried out rapidly since there existsan algorithm that obtains the triangular factorization of symmetricCauchy-like matrices in O(n2) flops [3,21].

These factorizations are computed only once (in step 5 ofAlgorithm 3), since the system matrix never changes for a givenshift. Thus, the solution of the linear systems at each loop iterationin Algorithm 2 consists only of the solution of two triangular linearsystems.

An efficient subinterval selection is based on the Inertia Theo-rem [7]. Given an interval [α, β], this theorem can be used to findout how many eigenvalues are in the interval. This could be doneby computing the LDLT decompositions of T −αI (equal to LαDαLTα)and T − βI (equal to LβDβLTβ ). Then, the number of eigenvalues inthe interval [α, β] is simply the number ν(Dβ)−ν(Dα), where ν(D)denotes the number of negative elements in the diagonal D.

Once an appropriate number of eigenvalues has been decided,the main interval containing all the desired eigenvalues is dividedinto a reasonable number of subintervals, choosing division pointsσi where the Inertia Theorem is used to determine the number of

eigenvalues to the left and to the right of the chosen point. Then,applying the Inertia Theorem to each new point, a bisection-likesearch is performed, until a division of themain interval is obtainedwhere none of the subintervals has more eigenvalues than thechosen number, and none is wider than a pre-set tolerance. Emptysubintervals (without eigenvalues) are automatically discarded.We refer to the process of dividing the main interval (steps 1 and2) as Isolation and the process performed by the Two-way LanczosAlgorithm of extracting all the eigenvalues of all the intervals asExtraction.

3. Computational environment

In this section, we describe both the hardware and softwarecomputational models that we used to design our algorithm. Thehardware computational model is composed of basic computingunits that we call cores. The cores are clustered in groups. We de-note each group as a computer or node (Fig. 1). The number of coresper computer can be different, although we have used computerswith an equal number of cores. The cores inside a computer all haveaccess to a global shared memory so that they can communicatedata through this memory. Computers communicate data throughan Interconnection Network. Thus, there are two hierarchical levelsof data communication: intra-node and extra-node.

The software computational model is made up of ‘‘heavy’’ pro-cesses. These processes interchange datawith each other by ames-sage passing interface. Each process is mapped into a computer,and it can use all the cores of the computer where it has beenmapped. A process uses cores inside the node by means of ‘‘light’’processes, that is, threads. The model does not make restrictionsabout the number of processes that can bemapped in a single com-puter. Also, there are no restrictions regarding the number of coresthat a process can use (the number of threads that a process canspawn), making it possible for several processes to compete forcores.

In our experiments, the hardware used was a SGI Altix XE270cluster composed of four computing nodes and one front end. Eachcomputing node was a two-processor board made up of two IntelQuad Core processors of type Xeon 5365 at 3.0 GHz and 16 GB ofmemory. Thus, each node was composed of two cpus and 4 coreseach for a total of 8 cores. The front endwas a two-processor boardcomposed of two Intel Dual Core processors of type Xeon 5140 at2.33 GHz and 8 GB of memory. The interconnection network was a1 Gb Ethernet.

The processes were implemented as MPI (Message PassingInterface) processes [16]. Specifically, we used the MPI implemen-tation OpenMPI-1.3 [9], which is a project that combines tech-nologies and resources from several other projects and is MPI-2

Page 4: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

488 P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494

specification compliant. From the programmer’s point of view,MPI processes interchange data by using a message passingparadigm, but the communication intra-node (i.e., between pro-cesses mapped onto the same computer) is more efficient sinceOpenMPI uses shared memory to implement the correspondingmessage passing routine. An MPI process can use more than onecore inside the computer where it has been mapped. These re-sources are used through threads, which in our case are managedby using OpenMP [5] pragmas. The codes were implemented inFortran 95 using Intel Fortran Version 11.0.

The programs use BLAS and LAPACK kernels provided by theIntel Math Kernel Library Version 10.1 (MKL). The BLAS andLAPACK packages are usually thought of as ‘‘sequential’’. However,level 3 routines of BLAS and some level 3 routines of LAPACKprovided by the Intel MKL are ‘‘threaded’’, which means that theycanmake use of several cores inside a computer at the same time. Atruly sequential version or a parallel one can be obtained by linkingthe application with the appropriate MKL.1

The numerical tests shown in this paper were carried out usingfull symmetric Toeplitz matrices generated randomly. Of course,our numerical results may vary depending on the spectrum of thematrices considered; however, some extra experiments reportedin [21] show that the basic algorithm is not too sensitive to theconditioning of the problem.

4. Sequential implementation of the algorithm

In this section, we propose some improvements in the FSTWLAlgorithm (Algorithm 3).

4.1. The computation of the DST

The first improvement carried out in the algorithm dealswith the computation of the DST, which is used to translate aToeplitz linear system to the Cauchy-like domain. The FSTWLAlgorithm uses the fftpack [17] libraries to compute the DST,specifically, routine sinti (which is called once to initialize thetransformation) and sint (to apply the transformation).

As is well known, the DST is a transformation related to the FastFourier Transformation (FFT) that can be computed in O(n log n)flops in the best case. However, the actual cost may be far from thebest one if the prime numbers in which n + 1 is decomposed arelarge.

For the case of the FSTWL Algorithm, Table 1 shows the totaltime as the sum of the isolation and the extraction time fordifferent problem sizes. The table also shows the total time usedfor the computation of the DST. The cost of the computation of theDST clearly grows with the size of the maximum prime numberas can be seen for problem sizes n = 2002 and n = 2010since 2003 and 2011 are prime numbers. Because the isolation andthe extraction parts of the FSTWL Algorithm use the DST (bothprocesses perform the LDLT factorization of a Cauchy-like matrix),the impact is noticeable in both parts of the algorithm.

In some cases, other packages that provide more efficient rou-tines can be used. In our case, we used Intel MKL, which containsa very efficient routine for the computation of the FFT. In addition,this routine computes the FFT quite rapidly for the cases where thedecomposition in prime numbers is not favorable. Since there isno routine for the computation of the DST in the MKL, we usedthe MKL–FFT routine to compute the DST as follows [20]. Let van n-array to be transformed by the DST. First, array w is built as

1 A linkage with the parallel MKL is also commonly used and then thenumber of threads that will operate is manually selected through an environmentvariable [14].

Table 1Time for each part of the FSTWL Algorithm and time for the DST computation (s).

n Desc. prime (n + 1) DST Isolation Extraction Total

2000 2001 = 1×3×23×29 0.38 0.64 10.9 11.52002 2003 = 1 × 2003 30.4 1.29 40.4 41.72004 2005 = 1 × 5 × 401 2.77 0.69 13.2 13.92006 2007 = 1 × 32

× 223 1.38 0.66 11.8 12.42008 2009 = 1 × 72

× 41 0.40 0.64 11.0 11.72010 2011 = 1 × 2011 30.2 1.28 40.2 41.5

Table 2Time for each part of the FSTWL Algorithm with the new DST module and time forthe DST computation (s).

n Type of DST DST Isolation Extraction Total

2000 fftpack 0.39 0.64 10.9 11.52002 mkl 0.90 0.66 11.8 12.52004 mkl 0.88 0.65 11.5 12.12006 mkl 0.87 0.65 11.5 12.12008 fftpack 0.40 0.64 11.0 11.62010 mkl 0.88 0.66 11.6 12.3

w =0 vT 0 −vT T , where v is the array with the entries of

v in the reverse order. Then, the following is computed

y =i2F2(n+1)w,

where F2(n+1) is the matrix representing the FFT of order 2(n + 1)and i is

√−1. Thus, y2:n+1 is the DST of array v.

The fastest way to compute the DST is not always the MKL,mainly because it uses an augmented array of size 2(n+1) insteadof n, which is the size used by the routines of fftpack. Therefore,we have developed a Fortran module that automatically selectsthe best routine to apply in each case. Our code provides a routinecalled dst, which only receives array v as an argument. The firsttime that our routine is called, it computes the DST using both thefftpack and the MKL and then measures the time needed foreach one of the two options, selecting the fastest one for the nextcalls. The initialization routines are also executed in this first call.The time spent checking both methods is worthwhile given thatthe computation of the DST will be intensively used throughoutthe rest of the process. Table 2 shows the same information asTable 1, where there is an evident reduction of time for theseproblematic sizes. In the new version, those cases in which thefftpackproduced small times (2000 and2008) are still computedwith the fftpack since time using MKL was 0.57 s and 0.50 s,respectively.

This modification may well be extended to any other packagecontaining an efficient routine to compute the DST (like fftw [8])or to a package containing only an efficient routine for the FFT.

4.2. The solution of Toeplitz linear systems

The extraction process (steps 3–7 of Algorithm 3) requiresthe solution of a very large number of symmetric Toeplitzlinear systems, each of which has the same system matrix butwith different right-hand side vector. The version of the FSTWLAlgorithm implemented in [21] translates (T − σ I) to a Cauchy-like form and performs the triangular factorization [21]; then the‘‘Shift-and-Invert’’ two-way Lanczos process uses the triangularfactors to solve the triangular linear systems at each iteration(Algorithm 2).

A detailed analysis of the weight that each step uses, showsthat themost expensive step is the solution of the triangular linearsystems. Column 3 in Table 3 shows the time spent to perform thetriangular decomposition, and column 4 shows the time used inthe solution of all the triangular systems that follow leading to the

Page 5: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494 489

Table 3Cauchy-like factorization and triangular system solution times for the FSTWL Algorithm (s).

n Isolation Desc. Cauchy Triang. syst. sol. Extraction Total

2 000 0.64 0.58 8.95 10.92 11.554000 3.11 4.40 76.29 86.40 89.516000 7.31 10.1 248.8 274.7 282.18000 13.8 19.0 561.2 610.6 624.4

10000 22.3 30.3 1067 1150 1172

Table 4Computation of the first column of T−1 and the linear systems solution by means of Gohberg–Semencul formulas (s).

n Isolation First column T−1 Triang. syst. sol. Extraction Total

2 000 0.64 1.14 0.54 2.81 3.454000 3.11 5.52 2.15 12.5 15.66 000 7.31 12.8 6.94 33.0 40.38 000 13.8 24.0 9.31 61.1 74.8

10000 22.3 38.4 24.7 115 138

solution of the linear systems. The extraction time includes bothtimes in columns 3 and 4 plus other computations. We use theDST computation described in Section 4.1 to obtain the results inTable 3 for a fair comparison with further improvements.

In this work, we propose changing the method by which thelinear systems are solved. The new method is based on the factthat the inverse of a Toeplitz matrix is also a structured matrix,which is determined by the first column of T−1. Let x be the firstcolumn of T−1, then an explicit form of T−1 can be built from thefirst Gohberg–Semencul formula [10] if x0 = 0:

T−1=

1x0

x0

x1. . .

.... . .

. . .

xn−1 . . . x1 x0

x0 x1 . . . xn−1

. . .. . .

...

. . . x1x0

−1x0

0

xn−1. . .

.... . .

. . .

x1 . . . xn−1 0

0 xn−1 . . . x1

. . .. . .

...

. . . xn−10

.

In our new version of the FSTWL Algorithm, we use thetranslation of the Toeplitz matrix to a Cauchy-like matrix to solvethe linear system that allows the first columnof T−1 to be obtained.The computation of this column is done in step 5 of Algorithm 3.Column 3 in Table 4 shows the time required to obtain this column.It is larger than the corresponding one in Table 3 because itincludes the triangular decomposition and the solution of the twoassociated triangular linear systems.

For the solution of the following linear systems, we use the FFT(for the convolution) to perform themultiplication of T−1 with theindependent vector to solve the linear system. We implemented aFortran 90module that, given the first column of a lower triangularToeplitzmatrix and a vector to bemultiplied, automatically selectsthe smallest power of two that is greater than n in order to buildthe FFT that will be used to carry out this convolution. As for thecomputation of the DST, the first time the module is called, all theinitialization operations are done only once and stored for the nextcalls.

When the triangular system solution times shown in column4 of Tables 4 and 3 are compared, it can be observed that agreat reduction in time is achieved when the Gohberg–Semenculformula method is used to perform the extraction step.

4.3. Selection of the subintervals

The efficiency of the method proposed in [21], applied in bothparallel and sequential, will depend on a good choice of the

Fig. 2. Execution time of the two main parts of the FSTWL Algorithm consideringdifferent values for the maximum number of eigenvalues per subinterval.

subintervals. If someof the subintervals have toomany eigenvaluesor are too wide, then many Lanczos iterations (and, possibly manyrestarts)will be needed. Therefore, the selection of the subintervalsis another important part of the algorithm.

There are several factors that must be taken into account whendetermining the subintervals. Experimentally, it can be observedthat it takes a few Lanczos iterations until the eigenvalues startto be extracted (usually a minimum of 5–7 iterations until thefirst eigenvalue converges). After that, new eigenvalues convergequite rapidly, especially due to the use of the symmetry-exploitingLanczos technique. This would demand subintervals with manyeigenvalues.

On the other hand, Algorithm 2 has been implemented withfull re-orthogonalization; this means that the computational costgrows after each iteration. Furthermore, to deal with difficult casesand with multiple eigenvalues, a maximum number of Lanczos it-erations (maximum dimension of the Krylov space) is set. Whenthis number of iterations is reached, the algorithm performs an ex-plicit restart (the method starts again by re-orthogonalizing thestarting vector with respect to the already converged eigenvec-tors). This mechanism gives robustness to the algorithm [7] butyields an important loss of efficiency.

These two factors indicate that the number of eigenvaluesin each subinterval must not be too large, and that the widthof the subinterval must be controlled as well, since very largesubintervals with eigenvalues in the extremes would slow downthe convergence. The suitable maximum number of eigenvaluesper subinterval depends on these two factors, but it also dependson thematrix spectrum, the solvers and thehardware architecture.

We have studied the selection of the best value for the maxi-mum number of eigenvalues per subinterval (MNEPS) in this envi-ronment. Fig. 2 shows the behaviour of each component of the total

Page 6: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

490 P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494

time considering that theMNEPS = η. The isolation timedecreaseswith η. As long as η grows, the number of subintervals decreasesso the time used to process these subintervals is small. The extrac-tion time is themain contributor to the total time.With low valuesof η, the extraction process is not efficient. With larger values ofη, the time to extract eigenvalues of all the subintervals decreases.The Lanczos iteration becomes inefficient with a large number ofeigenvalues to be extracted from the given subinterval.

Thus, there exists a range of values for which the extractionprocess minimizes the execution time. The shape of the graphicis algorithm-dependent. It does not depend on the hardware/software environment. It only depends slightly on the spectrumof the Toeplitz matrix. Nevertheless, the optimal value of η is diffi-cult to tune since it depends on the target hardware, solver, matrixsize and its spectrum. A fixed value can be chosenwithin this rangeof values (as was done in [21]) since similar behaviour can be ex-pected among potential values that fall into this range. Otherwise,a more detailed tuning can be performed in order to obtain a moreaccurate value.

We have experimentally verified that the following approxima-tion is quite close to the optimal η for a large range of Toeplitz ma-trices generated randomly:

η(n) = 0.02n + 18, min(20, n) ≤ η(n) ≤ 218.

Value η is a linear function of n until 218 which has been verifiedas the maximum number of eigenvalues that can be efficientlycomputed. All the constants are machine-dependent and must betuned experimentally.

In the next section, we will discuss the parallelization of thealgorithm that results from applying the changes, just described,to the algorithm FSTWL. In order to avoid confusion between theversions with and without the new changes, we will call from nowon Modified Full Spectrum Two-Way Lanczos (MFSTWL) to thealgorithm FSTWL including the improvements described above.

5. Parallelization of the algorithm

It is clear that the procedure in each subinterval is independentof the other subintervals, so that this algorithmparallelizes triviallyby just assigning different subintervals to different processes. Theparallel algorithm proposed in [21] is based on this idea and imple-ments a master–slave scheme. The master process sends intervalsto the slaves so that they can extract the eigenvalues contained inthem. However, there are more opportunities available for paral-lelization in the decomposition of Cauchy-like matrices and in theisolation step that we propose in the following subsections.

5.1. The parallelization method

The FSTWL Algorithm was parallelized in [21] following adistributed memory scheme using MPI. Since our aim is to fullyexploit the availability of many cores in multicore boards, we firststudied the use of a shared memory scheme through OpenMP sothat each subinterval is processed in parallel by a single thread.

However, we verified that this approach is not suitable. The rea-son for this is because our parallelization is based on heavy pro-cesses using distributed memory. These processes are completelyindependent and perform a considerable amount of work resultingin an algorithm that is completely partitioned in full tasks. A de-sign that is oriented to threads should be based on a large numberof light tasks so that the scheduler can map them in runtime ontothreads, keeping cores as busy as possible. Moreover, the processesthat compute the subintervals use dynamic allocated memory andas yet there are no standard tools that enable a thread to allocatememory at runtime. Only proprietary routines exist, but they arenon-portable and difficult to use.

Table 5Comparison between the decomposition of Cauchy-like matrices with and withoutOpenMP (OMP) in the MFSTWL Algorithm (s).

n Isolation Comput. first column T−1

Without OMP With OMP Without OMP With OMP

2000 0.64 0.39 1.14 0.664000 3.11 1.83 5.52 3.216000 7.31 4.17 12.8 7.458000 13.8 7.91 24.0 13.9

10000 22.3 13.3 38.4 22.2

Table 6Execution times (s) and percentages for the isolation and extraction phases.

n Isolation % of total Extraction % of total Total

2 000 0.39 14.5 2.29 85.5 2.685000 2.67 12.2 19.2 87.8 21.9

10000 12.7 11.4 99.1 88.6 111.815000 40.5 15.2 226.7 84.8 267.120000 88.6 15.5 481.5 84.5 570.1

The best solution we have found consists of the use of aMPI paradigm. OpenMPI provides some useful features orientedto tuning the application into a multicore board. Among others,OpenMPI defines the processor and memory ‘‘affinity’’. Processoraffinity allows aMPI process to bind to a specific processor or core.The goal is that by fixing a particular affinity, the operating systemwill only allow that process to run on that processor or core. Onmulti-processor machines, this can help improve performance bynot letting the operating system move processes between proces-sors. This is accomplished by using a rankfile to explicitly specifyprocess–processor (or process–core) binding as we show below. Inaddition, by using the combination of MPI processes and a suitablemanual mapping, we get a parallel algorithm that is able to exploitall the existing cores wherever they are located (inside or outsidethe node).

5.2. Parallel decomposition of symmetric Cauchy-like matrices

Parallel algorithms in shared [18], distributed [1], and hybridarchitectures [3] have been widely explored to perform the trian-gular decomposition of a symmetric Cauchy-likematrix. In this pa-per, we use OpenMP to compute the factorization of submatricesC0 and C1 concurrently. Cauchy-like matrices arise in the solutionof the linear system (T − σ I)x = e1 to obtain the first columnof matrix (T − σ I)−1 and in the isolation process (steps 1 and 2).In both cases, the MPI process can optionally spawn two OpenMPthreads to perform this computation in parallel.

Table 5 shows the time of the MFSTWL Algorithm for differentproblem sizes analyzing the use of OpenMP in the decompositionof the Cauchy-like matrices. This decomposition arises in theisolation process and in the solution of the linear system leadingto the computation of the first column of the inverse of a Toeplitzmatrix.

5.3. The parallelization of the isolation step

As shown in Table 6, the isolation process is an important partof the total, so, if it is carried out sequentially, the scalabilitywill clearly be limited. There are three options to implement theparallel algorithm that differ in the way that the isolation processis carried out.

Sequential isolation: The first option is a straightforward imple-mentation in which the isolation process is carried out by themaster process. This option just allows us to reduce time in theextraction procedure, which represents 80% of the total time.However, as the number of cores increases, the weight of the

Page 7: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494 491

first stage becomes dominant, penalizing the performance ofthe algorithm.Parallel isolation by interval: In the second option, the algorithmdivides the whole interval into a number of subintervals thatis equal to the number of processes. Then, each process splitsthe subinterval intomore subintervals containing nomore thanthe maximum eigenvalues defined at the start of the algorithm(MNEPS = η). The master gathers all the computed subinter-vals generated by the slaves and starts the extraction step. Thiswas the option used in [21].Parallel isolation by eigenvalues: Here we propose a third op-tion which also parallelizes the isolation process as in the for-mer case. The Parallel isolation by interval technique divides theinterval containing all the eigenvalues in sp equal segments,with sp being the number of slave processes. Since the spec-trum can be very unevenly distributed, this technique producesintervals with very different numbers of eigenvalues, leading toa very unbalanced workload in the subsequent isolation stepperformed by each process. Another reason deals with the totalnumber of subintervals generated in the isolation process, thatis usually larger than the number of slave processes.The technique proposed is based on a master–slave synchro-nization of processes where the master distributes the tasks ofisolation among processes by sending one subinterval to eachprocess in an attempt to make that all of the subintervals haveapproximately the same number of eigenvalues. The isolationalgorithm is thendivided into twomain steps (Algorithm4). Thefirst is carried out only by the master and consists of a sequen-tial isolation process to obtain subintervals with ≈ n/sp eigen-values, that is, using MNEPS = n/sp in the isolation algorithm.In the second step, each slave process isolates the subintervalsusing MNEPS = η.

Algorithm 4. Parallel Isolation Algorithm by eigenvalues.

Master:Performs a sequential isolation with MNEPS= n/sp

obtaining a set ofm subintervals.i := 0 (Subinterval index)alive_slaves := spdo while alive_slaves > 0

Waits for any Slave message.Receives a message from Slaves.if ( not empty message ) then

Store the subintervals of the message.end ifsubinterval_sent := falsedo while ( subinterval_sent = false and i < m )

if ( eigs(subintervali)< η ) thenStore subintervali.i := i + 1

elseSends subintervali to Slaves.subinterval_sent := true

end ifend doif ( i = m ) then

Send a stopping message to Slaves.alive_slaves := alive_slaves − 1

end ifend do

Slave:Sends an empty message to the Master.continue := truedo while ( continue = true )

Waits for a Master message.Receives an empty message or a subinterval sub.if ( empty message ) then

continue := falseelse

Performs a sequential isolation of subwith MNEPS= η.Sends the set of subintervals obtained to the Master.

Table 7Comparison of different strategies of isolation in a multicore board (8 cores) forn = 20 000 (s). 9 MPI processes are used and no OpenMP.

Routine Isolation Extraction Total

Sequential 88.5 200.6 289.1

Parallel By interval 85.8 196.3 282.1By eigenvalues 42.5 200.9 243.3

end ifend do

In Algorithm4, themaster is driven by the slaves’messages. Theprocess is started by the slaves sending an empty message to themaster. We use empty messages as control messages for the sakeof simplicity in the algorithm. The master works on each messagereceived in a different way. Once a message is received from aslave, the master stores the subintervals received in the messageand sends a new subinterval to the slave if the current interval(subintervali) contains more than η eigenvalues. If the currentsubinterval (subintervali) that the master is trying to send to theslave contains few eigenvalues (<η), then it stores the subintervaland operates with the following one (subintervali+1).

When all them subintervals isolated in the first step have beensent, the master sends empty messages to stop the slaves. All theintervals after the isolation process contain ≤η eigenvalues andare kept by the master, ready to be processed by the followingextraction stage.

We tested the three options for a large problem size (n =

20 000) to observe the impact of each solution (Table 7). Theexperiments used 9 MPI processes (no OpenMP), one master,and 8 slaves. The worst performing one is obviously sequentialisolation. The parallel isolation based on the distribution of themain interval runs slightly better than the sequential isolationbut is still inefficient due to the unbalanced distribution of theeigenvalues among the subintervals. The weight of the isolationprocess in both cases is ≈30% when the number of slaves is 8.The isolation time of 42.5 s for the third isolation option is thecomposition of a master isolation process that spends 13.2 s andthe distributed isolation of process that represents the rest of thetime. It can be observed that it is half the time of the other twoisolation options.

By dividing the interval into larger subintervals, we have re-duced the sequential isolation time from88.5 s to 13.2 s since thesetwo different times are the ones required to isolate the main inter-val in subintervals of size n/sp and η, respectively. The difference29.3 = 42.5 − 13.2 is the time used by the slaves to isolate theother subintervals. This time is probably larger than the expectedone since it is a parallel process. However, the method still lacksworkload balance since the number of eigenvalues per interval andthe number of intervals is problem-dependent and difficult to pre-dict. Nevertheless, the most important advance is the reduction oftime achieved in the isolation process which improves the scala-bility of the algorithm in comparison with the other methods.

6. Execution of the algorithm

In this section we describe the execution of the parallelMFSTWL Algorithm proposed. The algorithm can spawn twoOpenMP threads for decomposing Cauchy-like matrices as neededin the new version of Algorithm 3. It also includes the parallelisolation algorithm by eigenvalues, carried out by MPI processesin a master–slave scheme. Here, we use the former master–slavescheme to perform the extraction process. The parallel algorithmused in this section uses all the improvements proposed inSection 4.

Current hardware architectures and software allow the appli-cation to run in different ways, for example, by combining the

Page 8: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

492 P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494

Table 8Different combinations to exploit the 8 cores in a board (n = 20 000).

MPI processes OpenMP MKL Mapping Time

9 No Sequential Autom. 264.29 No Parallel Manual 263.19 No Sequential Manual 263.09 No Parallel Autom. 261.95 Yes Sequential Autom. 260.25 Yes Parallel Autom. 259.35 Yes Parallel Manual 245.75 Yes Sequential Manual 243.4

number of MPI processes and the number of threads to exploit thecores in a computer. In the following, we explore these differentpossibilities in order to select the best combination.

6.1. Tuning the parallel algorithm

In order to obtain the best performance on a multicore board,we need to compare the different existing options when theparallel algorithm is run. Table 8 shows the total execution timeof the parallel algorithm with different combinations. The rows ofthe table are ordered by time in decreasing order. The issue is howto exploit the 8 cores contained in a board.

The first study relies on the correlation between the numberof MPI processes and the use of OpenMP in the decompositionof Cauchy-like matrices (denoted as OpenMP in the tables). Thealgorithm now intensively uses OpenMP in the isolation phaseto decompose the Cauchy matrices. There is a great differencebetween using OpenMP and not using it. For example, the isolationtime forn = 5000withoutOpenMP is 4.63 and2.67 swithOpenMP(the times in Table 6 were obtained by using OpenMP).

Although they are not shown in Table 8, our tests indicate thatit is pointless to use more threads than cores, that is, the numberof slave MPI processes multiplied by two should not be larger thanthe number of cores. The parallel algorithm uses one master and 8slaves (9MPI processes) if OpenMP is not used, and onemaster and4 slaves (5 MPI processes) if OpenMP is used. The results clearlyshow that the hybrid combination MPI+OpenMP better exploitsthe multicore environment in all the cases.

Another feature that we focused on is related to the use of theparallel library of Intel. The IntelMKL provides a subset of BLAS andLAPACK routines that are parallelized (threaded) using OpenMP.We assess the impact of using the non-threaded routines (MKLsequential in the table) and the threaded ones (parallel) by linkingour application to the suitable libraries.

As expected, there is no great impact on the use of threadedroutines. The threaded routines of LAPACK called by the application(i.e., the ones to solve a symmetric eigenproblem) work onmatrices that are too small to achieve a significant speed up. Incaseswhere OpenMP is not used, it is better to use the parallel MKLeven though the improvement is small. In the case of the use ofOpenMP, it is not so clear that the use of the parallel MKL improvesthe performance.

To complete the explanation of Table 8, we need to include aconsideration about the MPI process mapping. Current versionsof OpenMPI allow us to explicitly specify a process–core bindingon multicore architectures. Together with the traditional machinefile that allows processes to be mapped onto nodes in a cluster,it is possible to specify combination slots/cores where the MPIprocesses will be mapped inside a multicore node. This is donethrough a rankfile. In order to obtain the minimum possiblerunning time, we tested different manual mapping options bychanging the values in the rankfile.

The smaller times for 5 MPI processes were obtained bymapping oneMPI process on each core and leaving free themasterMPI process. Processes ranging (rank in the rankfile) from1 to 4 arethe slaves. All of them are mapped on the same node. Processes

Table 9Time for dsyevd and the Parallel MFSTWL Algorithm with different number ofthreads (s).

n 1 core 2 cores 4 cores 8 cores

dsyevd

2000 6.70 3.47 1.86 1.545000 98.26 57.8 43.3 39.3

10000 757.2 449.2 347.2 321.815000 2502 1813 1151 107620000 5917 4248 2704 2515

Parallel MFSTWL

2000 3.47 2.52 2.26 1.585000 27.18 19.51 10.3 8.43

10000 138.1 98.76 51.8 44.915000 339.1 228.8 120 109.320000 752.0 505.8 314 243.4

with ranks 1 and 2 are mapped onto slot (physical processor)0 and processes with rank 3 and with rank 4 are mapped ontoslot (physical processor) 1. Each slot has a total of 4 cores, eachMPI slave process will use a couple of cores of the given physicalprocessor. Thus, each one of the two threads that are generatedfor the decomposition of the Cauchy-like matrices when usingOpenMP will use one core exclusively.

The master (MPI process with rank 0) is not mapped manually.Our experiments showed that it is better to let the OpenMPIruntime scheduler do the mapping of the master dynamicallyinstead of a manual static mapping.

The manual mapping for the case with 9 processes is straight-forward by making a process–core mapping.

Table 8 shows times with manual and automatic (no rankfiledefined) mappings. There is no clear difference between manualand automatic mapping regarding the use of MKL parallel orsequential libraries in the 9-process case. Really, it is pointless touse 9 MPI processes and parallel MKL if there are not enough coresfor each thread. In such cases, the automatic mapping seems toobtain a slightly better performance.

If the user does not care about a precise mapping, the bestoption is an automatic scheduling. The program creates a set ofthreads and the scheduler tries to find free resources for each one.The drawback of this approach is that the processes can frequentlychange core penalizing the performance.

However, things improve if the implementation takes intoaccount the correct use of the resources inside a multicore board.A smaller time is achieved when a resource is exclusively kept forthe thread or process that is going to use it (except for the masterprocess). This is why, in the best option, two cores per processare used; the process use the two cores for the decompositionof Cauchy-like matrices by using OpenMP and the processes aremanually mapped to provide locality. In this case, the use of theparallel MKL is even counterproductive. Surely, this conclusion canbe extended to the case of more than 8 cores per board.

6.2. Performance of the parallel algorithm

For the sake of comparison, we tested our results against theroutine dsyevd of LAPACK. This routine computes the eigenvaluesof non-structured dense symmetric matrices. The MKL version ofthe routine is threaded, that is, the MKL provides a parallel versionof dsyevd that can be run with a different number of threads asshown in Table 9. Currently, it is the best routine that we havewithin reach to compute the eigenvalues of a symmetric Toeplitzmatrix with a different number of cores.

The results in Table 9 of the routine dsyevd and the ParallelMFSTWL Algorithm can be compared elementwise. We use theconcept of core instead of thread as defined in Section 3 heresince the goal in each column is just to use the number of coresdenoted in the heading; how this is achieved depends on whichmethod the application uses to achieve parallelism. For routine

Page 9: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494 493

Table 10Time for pdsyevd and the Parallel MFSTWL Algorithm (s).

n pdsyevd Parallel MFSTWL

10000 222.6 17.715000 650.3 41.220000 1436 82.1

dsyevd, the cores used are the same as the number of threadschosen. For the algorithm proposed in this paper, we use thesequential version of the algorithm without OpenMP in the ‘‘1core’’ column. For the 2-cores case, the smaller time is obtainedwith the sequential version with OpenMP. To exploit a 4-coresconfiguration, the parallel algorithm with 2 MPI slave processesand OpenMP is the best option. Four MPI processes using OpenMPare used to exploit all the 8 cores in a computer.

All times with the algorithm are smaller than the ones obtainedwith routinedsyevd, except forn = 2000usingmore than2 cores.The overhead induced by message passing makes our algorithmunsuitable for small problem sizes and many cores.

Table 10 shows the time spent to compute the eigenvalues fordifferent problem sizes using the 32 cores of the cluster (Fig. 1).The parallel algorithm used in this case is launched with 16 slaveMPI processes, each of which spawns 2 OpenMP threads, and eachcomputer holds 4 slaves. The master is mapped onto the first nodesince the front end is slower than the nodes. The isolation phase forthese cases ranges between 38% and 44% of the total time, reducingthe scalability of the algorithm. Even so, the total time ismore thanone order of magnitude smaller than the time obtained with theScaLAPACK routinepdsyevd (the parallel counterpart ofdsyevd).the ScaLAPACK routine uses four MPI processes arranged in a 2×2grid; one process is mapped per computer and uses the 8 cores bycalling the threaded routine dsyevd of the MKL.

7. Conclusions

Current architectures increase the performance by addingmoreand more computational resources. The more cores a computerhas, the more powerful it will be. However, algorithms that aimto obtain the maximum benefits of a computer have to exploitall these resources that increase in number. Definitely these al-gorithms should essentially be parallel. Furthermore, the hierar-chical nature of computational resources and memories requiresthe use of hybrid schemes that combine processes that commu-nicate data by message passing and concurrent execution threadssuch as the ones that we use in this work. Due to the complexity ofthemodern computational environments based onmulticore com-puters, a good implementation is not enough. The way in whicha MPI–OpenMP application uses the available computational re-sources (i.e., the binding of the processes/threads to cores) is alsoimportant in order to fully exploit this type of hardware.

Themathematical problems that can be formulated by Toeplitz-type matrices have been extensively studied in the past becauseof their applicability to engineering and because of the opportuni-ties offered by its special structure. This structure allows fast al-gorithms to be obtained to solve these mathematical problems.Nevertheless, less attention has been paid to the parallelization ofToeplitz methods. Traditional partitioning techniques used to par-allelize algorithms working on non-structured matrices cannot beapplied without losing the Toeplitz structure and, consequently,losing the ‘‘fast’’ condition of the algorithms. The approximation toa parallel solver has more similarity with solvers involving sparsematrices because these solvers access the matrix through a ma-trix–vector product and they are efficient as long as this operationis efficient.

Wehave selected different ideas from the broad experience thathas been gained in the fast solution of Toeplitz problems in order

to build an algorithm that fits the requirements of modern archi-tectures. This algorithm aims to exploit all the existing computa-tional resources, those contained in a single computer and thoseexisting outside in other computers within a cluster. We have alsocompared the algorithmwith others that also fit the same require-ments, specifically, the routines of LAPACK and ScaLAPACK that areimplemented in the Intel MKL, one of the most current and com-petitive implementations of these linear algebra libraries. The par-allel algorithm presented in [21] to solve the symmetric Toeplitzeigenvalue problem in a distributed memory architecture has nowbeen improved in this work. The algorithmpresented here exploitsthe concurrent computational resources and runs faster than thestandard threaded LAPACK routine for large problem sizes.

References

[1] Pedro Alonso, José M. Badía, Antonio M. Vidal, Parallel algorithms for thesolution of Toeplitz systems of linear equations, Lect. Notes Comput. Sci. 3019(2004) 969–976.

[2] JoséM. Badía, AntonioM.Vidal, Parallel algorithms to compute the eigenvaluesand eigenvectors of symmetric Toeplitz matrices, Parallel Algorithms Appl. 1(13) (1998) 75–93.

[3] Miguel O. Bernabeu, Pedro Alonso, Antonio M. Vidal, A multilevel parallelalgorithm to solve symmetric Toeplitz linear systems, J. Supercomput. 44(April) (2008) 237–256.

[4] James R. Bunch, Stability of methods for solving Toeplitz systems of equations,SIAM J. Sci. Stat. Comput. 6 (2) (1985) 349–364.

[5] L. Dagum, R. Menon, Openmp: an industry-standard API for shared-memoryprogramming, IEEE Comput. Sci. Eng. 5 (1) (1998) 46–55.

[6] P. Delsarte, Y. Genin, Spectral properties of finite Toeplitz matrices, in:Proceedings of the MTNS-83 International Symposium, Lecture Notes inControl and Information Sciences, vol. 58, Springer, Berlin, Heidelberg, 1984,pp. 194–213.

[7] James Demmel, Jack Dongarra, Axel Ruhe, Henk van der Vorst, Templates forthe Solution of Algebraic Eigenvalue Problems: A Practical Guide, Society forIndustrial and Applied Mathematics, Philadelphia, PA, USA, 2000.

[8] Matteo Frigo, Steven G. Johnson, The design and implementation of FFTW3,in: Program Generation, Optimization, and Platform Adaptation, Proc. IEEE 93(2) (2005) 216–231 (Special issue).

[9] EdgarGabriel, GrahamE. Fagg, George Bosilca, Thara Angskun, Jack J. Dongarra,JeffreyM. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett, AndrewLumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham, TimothyS. Woodall, Open MPI: goals, concept, and design of a next generationMPI implementation, in: Proceedings, 11th European PVM/MPI Users’ GroupMeeting, Budapest, Hungary, September 2004, pp. 97–104.

[10] I. Gohberg, A. Semencul, On the inversion of finite Toeplitz matrices and theircontinuous analogs, Mat. Issled. 2 (1972) 201–233.

[11] Gene H. Golub, Charles F. Van Loan, Matrix Computations, third ed., in: JohnsHopkins Studies in the Mathematical Sciences, The Johns Hopkins UniversityPress, Baltimore, MD, USA, 1996.

[12] Roger G. Grimes, John G. Lewis, Horst D. Simon, A shifted block Lanczosalgorithm for solving sparse symmetric generalized eigenproblems, SIAM J.Matrix Anal. Appl. 15 (1) (1994) 228–272.

[13] Georg Heinig, Inversion of generalized Cauchy matrices and other classes ofstructuredmatrices, Linear Algebra and Signal Proc. IMA,Math. Appl. 69 (1994)95–114.

[14] Intel Co. Intel Math Kernel Library: Manual Reference, August 2008.[15] T. Kailath, A.H. Sayed (Eds.), Fast Reliable Algorithms for Matrices with

Structure, SIAM, Philadelphia, PA, 1999.[16] Message P Forum. MPI: a message-passing interface standard, Technical

Report, Knoxville, TN, USA, 1994.[17] P.N. Swarztrauber, Vectorizing the FFT’s, Academic Press, New York, 1982.[18] S. Thirumalai, High performance algorithms to solve Toeplitz and block

Toeplitz systems, Ph.D. Thesis, Graduate College of the University of Illinoisat Urbana-Champaign, 1996.

[19] WilliamF. Trench, Numerical solution of the eigenvalue problem forHermitianToeplitz matrices, SIAM J. Matrix Anal. Appl. 10 (2) (1989) 135–146.

[20] C. Van Loan, Computational Frameworks for the Fast Fourier Transform, SIAMPress, Philadelphia, 1992.

[21] Antonio M. Vidal, Víctor M. García, Pedro Alonso, Miguel Oscar Bernabeu,Parallel computation of the eigenvalues of symmetric Toeplitz matricesthrough iterative methods, J. Parallel Distrib. Comput. 68 (8) (2008)1113–1121.

[22] Christof Vömel, Scalapack’s MRRRR algorithm, ACM Trans. Math. Softw. 37 (1)(2010) 1–35.

[23] H. Voss, A symmetry exploiting Lanczos method for symmetric Toeplitzmatrices, Numer. Algorithms 25 (2000) 377–385.

[24] J.H. Wilkinson, The Algebraic Eigenvalue Problem, Oxford University Press,1965.

[25] Hong Zhang, Barry Smith, Michael Sternberg, Peter Zapol, Sips: shift-and-invert parallel spectral transformations, ACMTrans.Math. Softw. 33 (2) (2007)9.

Page 10: Implementation and tuning of a parallel symmetric Toeplitz eigensolver

494 P. Alonso et al. / J. Parallel Distrib. Comput. 71 (2011) 485–494

Pedro Alonsowas born in Valencia, Spain, in 1968. He re-ceived the Engineer degree in Computer Science from theUniversidad Politecnica de Valencia, Spain, in 1994 and thePh.D. degree from the same University in 2003. His disser-tation was on the design of parallel algorithms for struc-tured matrices with application in several fields of digitalsignal analysis.

Since 1996 he is a senior lecturer in the Department ofComputer Science of the Universidad Politecnica de Valen-cia and he is a member of the High Performance network-ing and Computing Research Group of the Universidad

Politecnica de Valencia. His main areas of interest include parallel computing forthe solution of structured matrices with applications in digital signal processing.

Miguel O. Bernabéu received his Engineer degree in com-puter science from the Universidad Politécnica de Valen-cia, Spain, in 2005.

He was a Research Fellow with the UniversidadPolitécnica de Valencia from2004 through 2007. He is cur-rently a ResearchAssistantwith theComputing Laboratoryof the University of Oxford, UK. His research interests in-clude parallel computing and numerical linear algebra andits applications to signal processing and cardiac modelingand simulation.

Víctor M. García obtained a degree in Mathematics andComputer Science (Universidad Complutense, Madrid) in1991, later an M.Sc. degree in Industrial Mathematics(University of Strathclyde, Glasgow) in 1992 and aPh.D. degree in Mathematics (Universidad Politécnica deValencia) in 1998. He is a senior lecturer in theUniversidadPolitécnica de Valencia, and his areas of interest areNumerical Computing, parallel numerical methods andapplications.

Antonio M. Vidal receives his M.S. degree in Physics fromthe ‘‘Universidad de Valencia’’, Spain, in 1972, and hisPh.D. degree in Computer Science from the ‘‘UniversidadPolitécnica de Valencia’’, Spain, in 1990. Since 1992 hehas been in the Universidad Politécnica de Valencia, Spain,where he is currently a full professor in the Departmentof Computer Science. He is the coordinator of the project‘‘High Performance Computing on Current Architecturesfor Problems of Multiple Signal Processing’’, developed byINCO2 Group and financed by the Generalitat Valenciana,in the frame of PROMETEO Program for research groups of

excellence. His main areas of interest include parallel computing with applicationsin numerical linear algebra and signal processing.