impact of mapping and sparsity on parallelized finite element method modules

11
Computing Systems in Engineering Vol. 2, No. 2/3, pp. 277-287, 1991 0956-0521/91 $3.00+ 0.00 Printed in Great Britain. © 1991 Pergamon Pressplc IMPACT OF MAPPING AND SPARSITY ON PARALLELIZED FINITE ELEMENT METHOD MODULES J. RODRIGUEZand M. O'SULLIVAN Department of Mechanical and Aerospace Engineeringand CAIP Center, Rutgers University, Piscataway, NJ 08855-0909, U.S.A. (Received 14 December 1990) Abstract--The efficient parallelization of a flexible h-version finite element method (FEM) algorithm is investigated. The actual implementation and benchmarking are performed on a distributed memory- multiple instruction multiple data (MIMD) machine (NCUBE). After an initial modularization and benchmarking, three specific parameters that could influence the efficiencyof a concurrent finite element analysis (FEA) solver are identified: (i) sparsity of matrices, (ii) storage format for stiffness matrix, and (iii) mapping of domain onto processors. These factors are interdependent and they heavily influence the overall efficiencyof the implemented FE code. This study includes a performance evaluation of the FE modules using the techniques outlined below. The results demonstrate the importance of these parameters, and the advantages of exploiting different levels of concurrency in a parallel FEA code. Three direct methods of solving systems of linear equations are analyzed, a Block Cholesky method exhibiting a coarse grain concurrency, and Gauss Elimination and Cholesky Decomposition methods with inherit medium grain parallelism. Additionally, work on two Assembly/Storage Modules (One-Way Dissection and Shifted Banded), and their corresponding simple Domain Decomposers (Vertical Strip and Wrap Mapping) is performed. INTRODUCTION Due to the inexpensive amount of computational power offered, parallel computers have been applied in numerous engineering and business applications since their introduction to the scientific market. Many practical applications of parallel processing tech- niques have been demonstrated as well. 1'2 Here at the Center for Computer Aids for Industrial Productivity (CAIP Center) at Rutgers University, an indus- try-university consortium, we are interested in apply- ing state-of-art technology to the design/production process. One of the ongoing projects is the develop- ment and implementation of FEA codes on parallel processing machines. This project has a two-fold objective: (a) to provide a flexible analysis tool for structural analysis, and (b) to develop a library of general purpose subroutines. This paper will investi- gate various aspects of the application of parallel processing techniques to solid mechanics problems to establish the guidelines to attain a highly efficient implementation. The target multiprocessor for this investigation is the NCUBE 3200 hypercube machine. This machine may contain up to 1024 processors, each with a local memory of 512Kbytes. The NCUBE multiprocessor belongs to the MIMD, multiple instruction multiple data, classification of computer architecture. The autonomous nature of the processors allows efficient exploitation of both coarse and medium grain paral- lelism. When processors operate on independent tasks at the subroutine level it is called coarse grain concurrency. Medium grain parallelism is exploited when independent operations within loops are assigned to different processors. The processors com- municate by passing messages through a hypercube interconnection scheme. For a specific hypercube size, n, there are 2n processors each connected to n neighboring nodes whose binary identification numbers vary by only one bit. This processor configuration allows flexibility in node allocation and in emulation of other topologies such as rings, meshes, and trees. A complete FEA consists of three stages: (a) Pre-processing, (b) Analysis, and (c) Post-processing. This paper focuses on the Analysis phase which consists of the application of global equilibrium conditions (Assembler Module) to the discretized solution domain (Decomposer Module) of the given problem, and the solution of the resulting set of equations (Solver Module) where [K] is the stiffness matrix, {d} the system displacements solution vector, and {F} the vector/ matrix of external forces. Problems with a large number of degrees of freedem (dofs) result in large matrices. Solving these problems requires both a great deal of storage as well as CPU time. Such problems are potential candidates for parallel processing. Using two different par- ameters of the stiffness matrix [K] to represent the 277

Upload: j-rodriguez

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Impact of mapping and sparsity on parallelized finite element method modules

Computing Systems in Engineering Vol. 2, No. 2/3, pp. 277-287, 1991 0956-0521/91 $3.00 + 0.00 Printed in Great Britain. © 1991 Pergamon Press plc

IMPACT OF MAPPING AND SPARSITY ON PARALLELIZED FINITE ELEMENT METHOD MODULES

J. RODRIGUEZ and M. O'SULLIVAN Department of Mechanical and Aerospace Engineering and CAIP Center, Rutgers University, Piscataway,

NJ 08855-0909, U.S.A.

(Received 14 December 1990)

Abstract--The efficient parallelization of a flexible h-version finite element method (FEM) algorithm is investigated. The actual implementation and benchmarking are performed on a distributed memory- multiple instruction multiple data (MIMD) machine (NCUBE). After an initial modularization and benchmarking, three specific parameters that could influence the efficiency of a concurrent finite element analysis (FEA) solver are identified: (i) sparsity of matrices, (ii) storage format for stiffness matrix, and (iii) mapping of domain onto processors. These factors are interdependent and they heavily influence the overall efficiency of the implemented FE code. This study includes a performance evaluation of the FE modules using the techniques outlined below. The results demonstrate the importance of these parameters, and the advantages of exploiting different levels of concurrency in a parallel FEA code. Three direct methods of solving systems of linear equations are analyzed, a Block Cholesky method exhibiting a coarse grain concurrency, and Gauss Elimination and Cholesky Decomposition methods with inherit medium grain parallelism. Additionally, work on two Assembly/Storage Modules (One-Way Dissection and Shifted Banded), and their corresponding simple Domain Decomposers (Vertical Strip and Wrap Mapping) is performed.

INTRODUCTION

Due to the inexpensive amount of computational power offered, parallel computers have been applied in numerous engineering and business applications since their introduction to the scientific market. Many practical applications of parallel processing tech- niques have been demonstrated as well. 1'2 Here at the Center for Computer Aids for Industrial Productivity (CAIP Center) at Rutgers University, an indus- try-university consortium, we are interested in apply- ing state-of-art technology to the design/production process. One of the ongoing projects is the develop- ment and implementation of FEA codes on parallel processing machines. This project has a two-fold objective: (a) to provide a flexible analysis tool for structural analysis, and (b) to develop a library of general purpose subroutines. This paper will investi- gate various aspects of the application of parallel processing techniques to solid mechanics problems to establish the guidelines to attain a highly efficient implementation.

The target multiprocessor for this investigation is the NCUBE 3200 hypercube machine. This machine may contain up to 1024 processors, each with a local memory of 512Kbytes. The NCUBE multiprocessor belongs to the MIMD, multiple instruction multiple data, classification of computer architecture. The autonomous nature of the processors allows efficient exploitation of both coarse and medium grain paral- lelism. When processors operate on independent tasks at the subroutine level it is called coarse grain

concurrency. Medium grain parallelism is exploited when independent operations within loops are assigned to different processors. The processors com- municate by passing messages through a hypercube interconnection scheme. For a specific hypercube size, n, there are 2 n processors each connected to n neighboring nodes whose binary identification numbers vary by only one bit. This processor configuration allows flexibility in node allocation and in emulation of other topologies such as rings, meshes, and trees.

A complete FEA consists of three stages: (a) Pre-processing, (b) Analysis, and (c) Post-processing. This paper focuses on the Analysis phase which consists of the application of global equilibrium conditions (Assembler Module) to the discretized solution domain (Decomposer Module) of the given problem, and the solution of the resulting set of equations (Solver Module)

where [K] is the stiffness matrix, {d} the system displacements solution vector, and {F} the vector/ matrix of external forces.

Problems with a large number of degrees of freedem (dofs) result in large matrices. Solving these problems requires both a great deal of storage as well as CPU time. Such problems are potential candidates for parallel processing. Using two different par- ameters of the stiffness matrix [K] to represent the

277

Page 2: Impact of mapping and sparsity on parallelized finite element method modules

278 J. RODRIGUEZ and M. O'SULLIVAN

o

MOOULF" solver

q "- ~'t-?g~- "~&t S'u-COi.o~- - - .........................

.. / "

~ ......

1oo.o 40'o.0 ~o;.o 1o6o.o i~6o.o 16oo.o NUMBER OF ELEMENTS

Fig. 1. Normalized CPU times for different problem size in a sequential FEA code.

problem size: (a) NSIZE, total number of dofs, and (b) MBAND, semibandwidth, a serial code was benchmarked. The results clearly show the import- ance of studying the Solver Module in order to have an efficient parallelization in terms of computational requirements (i.e. CPU time and storage). The Solver Module comprises up to 85% of the total CPU time (Fig. 1), the Assembler Module takes generally only one-eighth of the Solver CPU time. Thus, the paral- lelization of the Solver is the most critical aspect in our development.

M A P P I N G A N D S P A R S I T Y

Even though the Solver Module is the most critical one, the data dependencies of the Solver and the Assembler must be considered to develop guidelines based on all the modules linked together. After the initial analysis of those data dependencies, the follow- ing parameters were found to have an effect on the modules' performance: matrix sparsity, storage allo- cation, and mapping onto the processors. Their effect on the inherit concurrency of the Analysis phase, as well as the inter-dependencies between the phases are addressed in this investigation.

An FEA code assembles a banded, symmetric, sparse, positive definite stiffness matrix [K], with average density (number of non-zero/total number of elements) varying from 3 to 10%. 3 A matrix is considered sparse if there are considerably more zero terms than non-zero terms in the matr ix) The spar- sity is exploited by avoiding unnecessary operations on those null terms, as well as by avoiding storage of extraneous data. For banded symmetric systems, the sparsity is gauged by the semibandwidth parameter MBAND, whose value is equal to half the bandwidth plus one. In a typical FE problem, this parameter is usually less than 10% of NSIZE. Thus, tremendous savings in CPU time and memory may be realized by adapting FE codes for sparsity.

The memory resources are taxed to their fullest extent in the Assembler Module. In con- current computing, the storage dilemma may be significant, especially for local memory architectures. Each processor on the NCUBE3200 machine contains only 512K of local memory, thus forcing the programmer to consider innovative storage techniques in order to analyze large-scale prob- lems. Many storage techniques have been developed to exploit global sparsity. Some of them were specifi- cally created for the matrices assembled during F E A ) For this study, two different storage formats have been implemented in the Assembler, a Banded Shifted storage format (Fig. 2a), exhibiting medium grain parallelism, and a One-Way Dissection format (Fig. 2b), for the coarse grain code. These storage options are established during the Domain Decomposition and Assembly phases of the analysis.

The mapping of the physical problem onto the processors has a strong influence on resulting speedups and efficiencies. The mapping dictates the required inter-processor communication and the work load allocation. An optimal mapping will provide a minimum amount of inter-processor com- munication and a balanced distribution of calcu- lations between processors. Given our architecture, a column-oriented Gray Code Wrap Mapping (Fig. 3a), and a specialized Block Mapping scheme (Fig. 3b) are utilized in this study. The Gray Code Wrap Mapping assigns columns of the assembled stiffness matrix [K] to the processors according to the gray code sequence. The Block Mapping scheme results from a One-Way Dissection Storage im- plemented during Assembly. This Block Mapping allocates each block of internal degrees of freedom (idofs) to a processor along with their contribution to the shared degrees of freedom (sdofs), while the last processor(s) stores information pertaining only to the sdofs.

Page 3: Impact of mapping and sparsity on parallelized finite element method modules

Impact of mapping and sparsity on parallelized finite element method modules 279

m m

a)

m + + + + + + + + + + + + + + + + + !

b)

Fig. 2. Storage schemes: (a) One-Way Dissection; (b) Banded Shifted.

METHODOLOGY--SOLVER

Several methods exist to solve the system of linear equations assembled during an FEM analysis. 6'7 Three direct methods: (a) Gauss Elimination, (b) Cholesky Decomposition, and (c) Block Cholesky Decomposition are considered here. Each of these methods is composed of two general phases, factoriz- ation and back substitution. During factorization, the matrix is reduced into a triangular system, which is solved by back substitution for the given fight-hand side (RHS) vector/matrix. These techniques are valid for positive definite systems. The Cholesky methods are specifically coded for symmetric matrices.

Factorization

The factorization of the stiffness matrix [K] into an equivalent triangular system is the most compu- tationally intensive part in the Solver Module. The decomposition of full systems requires O(nsize) 3 operations. Both Cholesky and Gauss algorithms consist of three nested loops (L J, K). Although interchanges 8 of the IJK loops require the same

number of operations, their inherit concurrency varies considerably, and this concurrency was stud- ied. The Block Cholesky algorithm operates on parti- tioned blocks of a matrix to triangularize it. The method used here is specifically tailored to the One- Way Dissection partitioning of the stiffness matrix mentioned previously. This algorithm requires the same number of operations as the standard Cholesky for full matrices, but for sparse, banded systems on parallel architectures the number of operations and the innate concurrency differ.

Two levels of concurrency are exploited in these direct methods. One level is a medium grain paral- lelism,9 12 which occurs in the elimination of a single variable, and is exhibited in the Cholesky Decompo- sition and Gauss Elimination codes. In our im- plementation, the operations in the middle loop of the algorithm are distributed across the hypercube. A coarse grain parallelism is exploited in the Block Cholesky solving method since concurrency is at- tained in the elimination of partitions of the global matrix. Since guidelines are pursued for choosing the most appropriate method for a given problem, a reported relationship ~3 is of interest: it is expected that when communication costs are negligible, that medium grain dominates if MBAND > ~/(NSIZE). Validity of this indicator for NCUBE machines, where communication may be significant, is studied.

Gauss and Cholesky. Feasibility studies on paral- lelism for IJK variants of the Cholesky and Gauss factorization algorithms have been previously re- portedJ 4,~5 These studies were performed for fully populated matrices in a shared memory architecture. The impact of a message passing environment on the data dependencies of the IJK variants is studied. The message passing architecture of the NCUBE is such that the communication of columns is most efficient. A column, or a group of them in a packet of information, of a matrix may be sent without any overhead, whereas rows of a matrix must be rearranged in a vector buffer before broadcasting. During factorization, NSIZE - 1 broadcasts of data are required. This results in a significant overhead if row communication is used. This overhead is verified by benchmarking of column- and row-oriented IJK variants. For a banded matrix, elimination only needs to be performed on MBAND rows below the pivot row. Thus, our sparsity parameter MBAND dictates the degree of parallelism exhibited, and subsequently the efficiency of the factorization. A saving in storage in excess of 75% may be gained by exploiting spar- sity. This is based on an average MBAND of 5% of NSIZE, typical sparsity in structural FEA.

Not all variants need to be benchmarked, in some cases only the modifications to the data inter- dependencies are fully investigated in order to attain optimal factorization. For instance, the data depen- dency diagram of the column-oriented Gauss KJI factorization, when adapted to capitalize on the symmetry of the matrix, clearly shows that there is a

Page 4: Impact of mapping and sparsity on parallelized finite element method modules

280 J. RODRIOUEZ and M. O'SULLIVAN

prohibitive amount of communication between each task Tkj assigned to columns for the Banded system.

Block Cholesky. A coarser concurrency is attained in the Block Cholesky by the simultaneous factoriz- ation of partitioned blocks. These operations are independent because of the One-Way Dissection Storage produced by the Assembler. This special ordering scheme takes advantage, in terms of CPU times, of the sparsity of the system by avoiding unnecessary operations outside the local semiband- width and by working on more densely populated local matrices. Storage is conserved by exploiting the global sparsity (i.e. between blocks) and by minimiz- ing local bandwidth. Issues related to the asyn- chronous procedures, that is processors performing different calculations on dofs at different times, are investigated in this study. Load balancing, communi- cation cost, idle time, and suboptimal allocation of processors to sdofs are some of those issues. The exact effect of each of these items is problem

4 - * * * * * *

*4- * * * * *

* 4 - * * * * *

* * 4 - * * * *

* * * * 4 - * * * * * * *

* * * * * 4- * * * *

* * * * * * 4 - * * * *

* * * * 4 - * * * * * *

* * * * * * 4 - * * * * *

proe0

* + *

* ÷ *

* * 4 -

procl

4 - * * * *

*4- * *

* 4 - *

* * 4-*

* * * * 4 -

m

dependent and requires further investigation to reduce idle time.

Back substitution

After factorization, the reduced triangular matrix needs to be solved. Gauss Elimination requires back- solving whereas a Cholesky Decomposition, demands forward- and back-solving. For a single processor, the cost of back substitution is of O(nsize). 1 In a concurrent message passing environment, the back- solver takes on greater significance since the calcu- lation/communication ratio is much less than that for factorization, and data packets are smaller. Conse- quently, communication times tend to dominate con- current back-solving and may have quite an impact on the efficiency of the Solver Module.

Fan-out and Fan-in algorithms. Current and pre- vious work on full triangular solvers, and their viability in message passing environments has been reviewed. 16 The Fan-in and Fan-out algorithms are

* * * * 4 - * * * * * * *

* * * * * 4- * * * *

* * * * * * 4 - * * * *

* * * * * 4 - * *

* * * * * * 4 - *

* * * * * * * 4 -

proc2

4 - * * * *

* 4 - * * *

* * 4 - * *

* * * 4 - *

* * * * 4 -

n m

Fig. 3(a).

proc 3 m m

m

Page 5: Impact of mapping and sparsity on parallelized finite element method modules

Impact of mapping and sparsity on parallelized finite element method modules 281

÷ * * * * * *

* + * * * * *

* ÷ * * * * *

* * @ * * * *

* * * * + * * * * * * *

* * * * * @ * * * *

* * * * * * + * * * *

* * * * + * * * * *

* * * * + * * * * * *

* * * * * * + * * * * *

* * * * * * + * * * *

pmcO ~ 1

* I t

* * II

* * q

* + * *11

* * * *'4

* + * II

* * I t , II :

* + * : *

* * * :It,

* + : *

* * : +

* * * * + * * * * * * *

* * * * * + * * * * *

* * * * * + * * * *

* * * * * * ÷ * * * *

* * * * + * * *

* * * . 0 + * *

* * * * * * + *

+ ,

. I

t

i

R .

* +

Fig. 3(b). Fig. 3. Mapping schemes: (a) Block Mapping; (b) Wrap Mapping.

proc 3

i !- * + *

It *

* + *

* +

investigated here. The concurrency in these methods lies in the inner loop where operations, specifically the scalar-product for the Fan-in and the vector-sum for the Fan-out, are performed simultaneously. Adaptation of both codes to banded storage resulted in no major changes in their data dependencies.

Block Algorithm. This algorithm requires a Block forward-solver and a Block back-solver. Coarser concurrency is exploited here. The asynchronous nature of the Block back-solver is similar to the Block Elimination. Processors that are allocated to sdofs operate with different algorithms than the ones allocated to idofs. Synchronization of operations between processors is a significant factor and merits further investigation. Loading is balanced in our test

cases by assigning equal idofs per processor and only one processor is allocated to the sdofs.

METHODOLOGY--ASSEMBLER

In the Assembler Module, a stiffness matrix [k] and a force vector {f} are generated for each finite element in the mesh, and they are subsequently assembled into global matrices [K] and {F}. Four- node isoparametric elements are considered in this study. When studying this procedure, and assigning concurrent tasks, special consideration needs to be given to the Domain Decomposition algorithm. 17 Since decomposition is a sequential scheme in our implementation, our goal in this study is to define

Page 6: Impact of mapping and sparsity on parallelized finite element method modules

282 J. RODRIGUEZ and M. O'SULLIVAN

criteria for its future concurrent implementation. Options for this task are: (a) Substructuring, and (b) Domain Decomposition. Substructuring TM is more a geometric and/or functional-based procedure and does not account for balance of assigned tasks. Domain Decomposition is based on the character- istics of a given FE mesh, which could be one for a substructure. In both cases the Assembler Module must be provided with index tables (relate local to global dofs) and connectivity tables.

Two simple Decomposers were developed: (a) Vertical Strip Gray Code, a coarse grain scheme which partitions the FE mesh in vertical subdomains and assigns them to processors according to the gray code, and (b) Wrap Decomposer, a medium grain scheme which allocates global dofs in a wrap map- ping sequence. The first scheme results in low com- munication costs and low overhead for processor load balancing when uniform strips (i.e. equal num- ber of dofs) are defined. Storage efficiency is achieved by performing local renumbering. At the local level the sparsity is maximized by choosing minimal band- width paths in each processor. The second scheme provides each processor with only the necessary information to assemble the local stiffness matrices corresponding to the assigned subdomain.

Storage format and mapping are specified by the Domain Decomposer but formalized by the Assembler. After some analysis, two Assemblers that minimize storage requirements and overhead, as well as preserve the parallelism were implemented. The One-Way Dissection storage format, linked to the Vertical Strip Decomposer, is generated by a special ordering of the dofs during assembly. The indexing performed enumerates sdofs after all idofs have been assigned a number. The number of blocks depends upon the optimal number of processors needed by the Solver for the given NSIZE and MBAND. The Shifted Banded format, linked to the

Wrap Decomposer, where the matrix elements are shifted k - 1 places (k = row/column number) for storage. This represents savings in memory since semibandwidths are usually below 10%0 of the matrix order. Also the added overhead is negligible.

RESULTS

The primary objective was to implement and benchmark the different modules for two FEA codes, one exploiting medium grain parallelism and one with coarse grain parallelism. For each code, benchmark- ing was performed on each subroutine independently, and linking them together.

In the first set of tests, Gauss and Cholesky factor- ization speedups were compared for different levels of sparsity on different hybercube sizes. The Cholesky algorithm most suitable for parallel computing is the KJI column-oriented variant, with shifted storage. For Gauss Elimination] 9'2° the data dependencies and tests pointed to the KJI algorithm as the one of choice. The matrices were allocated by Wrap Map- ping. MBAND was varied from 5 to 100% of NSIZE. Both, Gauss and Cholesky, exhibit similar behavior of increased speedup for increasing MBAND. This is due to the implemented concurrency across the semibandwidth. Cholesky Decomposition (Fig. 4) requires less CPU than Gauss Factorization, 47% for fully populated matrix but only 25% for sparser systems since fewer operations are performed along the bandwidth. These results are reflected in the speedup of the implementation as well, in Fig. 5 we observe the advantage of having denser matrices, better speedups are obtained because we have more concurrent operations.

The second set of factorization tests was performed on the Block Cholesky with NSIZE and sdofs as the test parameters. It was found that there is a strong correlation between sdofs and CPU times, similar to

"o ~o.

Processors

N" r 0 c

~:~..

~,0 40.0 60,0 80.0 I00.0 120.0 II0.0 160.0 180,0 200.0 220.0 210.0 ~o0.0 280.0 300.0

SEMIBflNOMIOTH

Fig. 4. CPU times for Cholesky factorization.

Page 7: Impact of mapping and sparsity on parallelized finite element method modules

Impact of mapping and sparsity on parallelized finite element method modules 283

%

~ Z

U

-O_

MBANO 100Z EffLcLencq SZ NSIZE

~ i O Z N ~ t ~ t . . . . . . . . . . . . . . . - - I - - - - t - ~£ - Ns f~P . . . . . - -

. . . . . . . . i,O~ . . . . . . . . ifO ~ LOG CUBE SIZE

Fig. 5. Speedups for Cholesky factorization.

10 ~

the relationship between MBAND and CPU for the previous cases. The trend of reduction in CPU time with more processors is valid from hypercube order 2, minimum required for this Solver to be considered concurrent, to hypercube order 5, after which com- munication cost becomes dominant (Fig. 6).

Fan-in and Fan-out Triangular Solvers were benchmarked at various degrees of sparsity (MBAND) and problem size (NSIZE). The perform- ance of the Fan-out algorithm (Fig. 7) was superior to the Fan-in algorithm (Fig. 8) for all tested cases. An MBAND equal to 10% of NSIZE was specified for both algorithms, which restricts the parallelism that can be exploited. In the Fan-in algorithm the high communication cost experienced further reduces efficiency.

A Block back-solver was tested. The implemented scheme was found to exhibit sufficient parallelism to improve efficiency with more processors, up to a

hypercube order 4 (Fig. 9). The dependence of the efficiency upon the number of sdofs is not as strong as for the factorization. The dependency has an asymptotic relationship for small hypercubes and appears linear for larger hypercube dimensions.

When testing the linked Factorization and Back- solver subroutines, the following results were obtained. The Cholesky K J I decomposition linked with the two back-solvers was tested for a large, physically-valid, problem (NSIZE = 1000 and MBAND = 5-10% NSIZE). Results show the ex- pected benefit of reduced CPU time with an increas- ing number of processors, especially for larger problems. However, there is an exponential behavior on the number of processors that will produce better speedups (Fig. 10). As expected, when NSIZE and MBAND are increased, the efficiency of the algor- ithm increases. For the same NSIZE, a decrease in MBAND results in a significant decrease in efficiency.

+ 4 proc /

~n~-i°'~~=~;_~o ~- J v 1o proc / /

cJ

10.0 ~ . 0 30.0 4o.o so.o 60.0 70,0 80.0 90.0 100.0 SHRREO OOF

Fig. 6. CPU times for Block Cholesky factorization.

Page 8: Impact of mapping and sparsity on parallelized finite element method modules

284 J. RODRIOUEZ and M. O'SULLIVAN

q

iI ......... o

~oo. z

d

ca

q 8"

NSIZ [

Fig. 7. CPU times for Fan-out (vector sum) back-solver.

o

P r o c e s s o r s 0 I ~roc

~. x --4--~o~

Y 8- / / ><~ -

%.o &o &.o ~.o ,06.0 ,26.o do.o ~&.o ~,'o.o ~ . o 226.0 2,~.o 2&.o do.o 3oo.o NSIZE

Fig. 8. CPU times for Fan-in (scalar product) back-solver.

q

z o o

d :

Processo rs

c

Z

,o.o ~.o ~.o ~.o sLo ~.o ,~.o ~6.o ~.o SHARED DOF

Fig. 9. CPU times for Block Cholesky back-solver.

Page 9: Impact of mapping and sparsity on parallelized finite element method modules

Impact of mapping and sparsity on parallelized finite element method modules

L3

03

1

NSIZE IOOZ effLcLencq

................

i u ' ' ' ' ' ' ' 1 1 0 1 ' ' ' ' ' ' '

LOG CUBE SIZe

Fig. 10. Speedups for Cholesky Solver Module. MBAND = 10% NSIZE.

285

This is expected since the parallelism in this algorithm is gained across the MBAND. Additionally, analysis of this situation indicated that communication cost in the algorithm overrides the inherit parallelism in the calculations.

Results for linked Block Solver resemble the fac- torization plots. This is explained by the fact that the back-solver takes only 1-9% of the CPU time that the factorization phase takes. Thus, the efficiency of the complete parallel Block Solver is mainly depen- dent upon the implementation of the factorization phase. An optimum number of processors, for a specific NSIZE and sdofs, is found for the Block Solver as well. But in this case, as opposed to the normal Cholesky, there is an increase in the CPU time when we go from 32 to 64 processors (Fig. 11), thus indicating hardware under-utilization. Syn- chronization issues are probably causing the cross- over points, especially for 32 and 64 processors.

The final test concerned the comparison of the two linked algorithms, that is the coarse and the medium grain codes. Possible advantages of the implemented Cholesky, the most efficient of the two medium grain Solvers, over our Block Cholesky were investigated. In order to compare the efficiency of the two methods, the sdofs parameter in the Block-based method must be translated into an equivalent MBAND for the Wrap-based algorithm. The corre- lation between the two parameters is problem depen- dent so a universal relationship is not possible. A rectangular cantilever test problem was chosen for the correlation. A uniform mesh, N x elements in the horizontal direction by Ny elements in the vertical direction, was utilized. By changing the aspect ratio (Nx/Ny), the semibandwidth of the assembled system of equations is changed. The results showed that for very small MBAND, less than 1% of NSIZE, the Cholesky with medium grain parallelism (Fig. 12) is

Processors

r o c

g

~:=._

c J

I0.0 20.0 3o.o 40,0 50.0 6o.o 70.0 6o.0 90.0 |00.0

SHARED OOf

Fig. 11. Total CPU time for Solver Module using Block Cholesky scheme.

Page 10: Impact of mapping and sparsity on parallelized finite element method modules

286 J. RODRIGUEZ and M. O'SULLIVAN

"o

/ Rspect Rotto [Nx/N~) /

E

, , , , , , , , ,

LOG CUBE SIZE

Fig. 12. Speedups for linked FEA modules with medium grain concurrency.

the most efficient, up to 60% reduction. It is interest- ing to note that for the problem size we tested, the effect of MBAND (i.e. sparsity) is basically null once all three modules are linked together. The asyn- chronous nature of the Cholesky algorithms is cur- rently under investigation to reduce idle processor time. Tests at the One-Way Dissection Assembler indicate that increasing speedups will occur with increased allocation of processor as long as the chosen subdomains are sufficiently large so that the time saved by parallel calculations on the idofs remains greater than the time spent in communi- cation of the sdofs.

It is important to mention that some tests were run to demonstrate the impact of various parameters on the efficiency of the general algorithms, even though they do not relate to practical situations (e.g. MBAND = NSIZE) in FEA.

SUMMARY AND CONCLUSIONS

While developing and implementing modules for an FEA code on a distributed memory MIMD machine, several issues have been addressed in order to define a set of guidelines that can help us to produce a highly efficient and flexible software pack- age. Our approach of starting with the Solver Module, regressing backwards to the Domain Decomposer, led us to investigate three parameters: sparsity of the problem, storage organization, and mapping scheme.

Important results were obtained after extensive benchmarking of the two specialized algorithms. Among the most interesting conclusions is that, from the considered Solvers, the normal Cholesky algor- ithm proved to be the mose efficient for an FEA code. Although the Block method is superior when dealing with very small MBANDs, the exponential increase in CPU for MBANDs greater than 1% of NSIZE makes this algorithm too limited to be an effective

method for a general structural FEA. The reduction of processor idle time through the adjustment of load imbalances is currently under investigation. The objective is to increase the efficiency of the Block Cholesky factorization of sdofs at larger semiband- widths.

One conclusion that will play an important role in our future work is the observed limit on the speedups, with linked modules, as the number of processors is increased. This conclusion implies that, for a given problem size, there is an optimum hypercube dimen- sion that should be allocated for its efficient solution. Above this limit, the total CPU time required is reduced, but the speedups are lower.

Similarly, we can conclude that even though the Solver is the most CPU-intensive module, in the end the Domain Decomposer will have a significant impact on the final efficiency of the FEA. Research is currently being done in this area.

Acknowledgements--The authors would like to thank the CAIP Parallel Processing Laboratory and the Department of Mechanical and Aerospace Engineering, both at Rutgers University. The CAIP (Computer Aids for Industrial Pro- ductivity) Center is an Advanced Technology Center sup- ported by the New Jersey Commission of Science and Technology and industrial members.

REFERENCES

1. J. L. Gustafson, G. R. Montry and R. E. Benner, "Development of parallel methods for a 1024-processor hypercube," SIAM Journal on Scientific and Statistical Computing 9(4), 609~538 (1988).

2. D. W. White and J. F. Abel, "Bibliography on finite elements and supercomputing," Communications in Applied Numerical Methods 4, 279-294 (1988).

3. R. D. Cook, Concepts and Applications of Finite Element Analysis, Wiley, New York, 1981.

4. I. S. Duff, A. M. Erisman and J. K. Reid, Direct Methods for Sparse Matrices, Clarendon Press, Oxford, 1986.

Page 11: Impact of mapping and sparsity on parallelized finite element method modules

Impact of mapping and sparsity on parallelized finite element method modules 287

5. A. George and J. W. H. Liu, Computer Solution of Large Positive Definite Systems, Prentice-Hall, Englewood Cliffs, New Jersey, 1981.

6. D. Goehlich, L. Komzsik and R. E. Fulton, "Appli- cation of a parallel equation solver to static FEM problems," Computers & Structures 31(2), 121-129 (1989).

7. K. H. Law, "A parallel finite element solution method," Computers & Structures 23(6), 849-858, (1986).

8. J. J. Dongarra, F. G. Gustafon and A. Karp, "Implementing linear algebra algorithms for dense matrices on a vector pipeline machine," SIAM Review 26, 91-112 (1984).

9. C. Farhat and I. Wilson, "A parallel active column equation solver," Computers & Structures 28(2), 281-304 (1988).

10. G. A. Geist and M. T. Heath, "Matrix factorization on a hypercube multiprocessor," SIAM Reviews, 161 180 (11986).

I 1. A. Gerasoulis and I. Nelken, "Gaussian elimination and Gauss-Jordan on MIMD architectures," Department of Computer Science, Rutgers University, New Jersey, 1988.

12. M. T. Heath and C. H. Romine, "Parallel solution of triangular systems on distributed memory multi- processors," SIAM Journal on Scientific and Statistical Computing 9(3), 558-587 (1988).

13. J. J. Dongarra and L. Johnsson, "Solving banded systems on a parallel processor," Parallel Computing 5, 219-246 (1987).

14. M. Cosnard, M. Marrakchi, Y. Robert and D. Trystram, "Parallel Gaussian elimination on an MIMD computer," Parallel Computing 6, 275-296 (1988).

15. J. M. Ortega and C. H. Romine, "The IJK forms of factorization methods--II Parallel systems," Parallel Computing 7, 149-162 (1988).

16. G. A. Geist, "Efficient parallel LU factorization with pivoting on a hypercube multiprocessor," ORNL/TM- 6211, Mathematical Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 1985.

17. D. Zois, "Parallel processing techniques for FE analy- sis: stiffness, loads, and stress evaluation," Computers & Structures 28(2), 247-260 (1988).

18. O. O. Storaasli and P. Bergan, "Nonlinear substructur- ing method for concurrent processing computers," AAIA Journal 25(6), 871-876 (1987).

19. Y. Saad, "Gaussian elimination on hypercubes", in Parallel Algorithms and Architectures (edited by M. Cosnard, Y. Robert, P. Quinton and M. Tchuente), pp. 5 17, Elsevier Science Publishers Co., North-Holland, New York, 1986.

20. Y. Saad and M. Schultz, "Data Communication in hypercubes," Report YALEU/DCS/RR-428, Yale University, 1985.