a highly scalable matrix-free multigrid solver for fe …people.inf.ethz.ch/arbenz/lssc11.pdfa...

9
A highly scalable matrix-free multigrid solver for μFE analysis based on a pointer-less octree Cyril Flaig and Peter Arbenz ETH Z¨ urich, Chair of Computational Science, 8092 Z¨ urich, Switzerland Abstract. The state of the art method to predict bone stiffness is micro finite element (μFE) analysis based on high-resolution computed tomog- raphy (CT). Modern parallel solvers enable simulations with billions of degrees of freedom. In this paper we present a conjugate gradient solver that works directly on the CT image and exploits the geometric prop- erties of the regular grid and the basic element shapes given by the 3D pixel. The data is stored in a pointer-less octree. The tree data struc- ture provides different resolutions of the image that are used to construct a geometric multigrid preconditioner. It enables the use of matrix-free representation of all matrices on all levels. The new solver reduces the memory footprint by more than a factor of 10 compared to our previous solver ParFE. It allows to solve much bigger problems than before and scales excellently on a Cray XT-5 supercomputer. Keywords: micro-finite element analysis, voxel based computing, matrix- free, geometric multigrid preconditioning, pointer-less octree 1 Introduction Osteoporosis is a bone disease affecting millions of people around the world. The disease entails low bone quality and increases the risk of bone fracture. For a better understanding of bone structures and to improve the prediction of bone fractures, a precise estimation of its stiffness and strength is required. Micro finite element analysis (μFE) is a tool to this end [12, 17]. It is based on high-resolution 3D images that are obtained by computed tomography (CT). The high resolution scans produce computation domains of complicated shape composed of a huge number of voxels (3D pixels), cf. Fig 1. Since voxels directly translate into finite elements the resulting linear systems can have enormous numbers of degrees of freedom (dofs). Some years ago, we have developed a fully parallel state-of-the-art solver called ParFE [2, 11] based on the conjugate gra- dient algorithm preconditioned by smoothed aggregation-based algebraic multi- grid. This code exploits the geometric properties of the underlaying rectangular grid by avoiding the assembly of the system matrix. The largest realistic bone model solved with ParFE so far had a size of about 1.5 billion dofs [3]. It is natural to represent the voxel-based domains by octrees [4, 6, 15]. Sam- peth et al. [15] used the different tree levels to construct a geometric multigrid preconditioner.

Upload: others

Post on 04-Jul-2020

19 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

A highly scalable matrix-free multigrid solver forµFE analysis based on a pointer-less octree

Cyril Flaig and Peter Arbenz

ETH Zurich, Chair of Computational Science, 8092 Zurich, Switzerland

Abstract. The state of the art method to predict bone stiffness is microfinite element (µFE) analysis based on high-resolution computed tomog-raphy (CT). Modern parallel solvers enable simulations with billions ofdegrees of freedom. In this paper we present a conjugate gradient solverthat works directly on the CT image and exploits the geometric prop-erties of the regular grid and the basic element shapes given by the3D pixel. The data is stored in a pointer-less octree. The tree data struc-ture provides different resolutions of the image that are used to constructa geometric multigrid preconditioner. It enables the use of matrix-freerepresentation of all matrices on all levels. The new solver reduces thememory footprint by more than a factor of 10 compared to our previoussolver ParFE. It allows to solve much bigger problems than before andscales excellently on a Cray XT-5 supercomputer.

Keywords: micro-finite element analysis, voxel based computing, matrix-free, geometric multigrid preconditioning, pointer-less octree

1 Introduction

Osteoporosis is a bone disease affecting millions of people around the world.The disease entails low bone quality and increases the risk of bone fracture.For a better understanding of bone structures and to improve the predictionof bone fractures, a precise estimation of its stiffness and strength is required.Micro finite element analysis (µFE) is a tool to this end [12, 17]. It is based onhigh-resolution 3D images that are obtained by computed tomography (CT).

The high resolution scans produce computation domains of complicated shapecomposed of a huge number of voxels (3D pixels), cf. Fig 1. Since voxels directlytranslate into finite elements the resulting linear systems can have enormousnumbers of degrees of freedom (dofs). Some years ago, we have developed a fullyparallel state-of-the-art solver called ParFE [2, 11] based on the conjugate gra-dient algorithm preconditioned by smoothed aggregation-based algebraic multi-grid. This code exploits the geometric properties of the underlaying rectangulargrid by avoiding the assembly of the system matrix. The largest realistic bonemodel solved with ParFE so far had a size of about 1.5 billion dofs [3].

It is natural to represent the voxel-based domains by octrees [4, 6, 15]. Sam-peth et al. [15] used the different tree levels to construct a geometric multigridpreconditioner.

Page 2: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

In this paper, we present a solver based on a pointer-less octree-like datastructure. Both finite elements and nodes are identified by a key correspondingto a space filling curve. This curve is equivalent to an octree. In contrast to [4,6, 15] we deal with incomplete octrees due to the bone free space. In full spaceapproaches [9, 10] the bone free space is modeled by very soft material andits unknowns are included in the computations. With the help of the new datastructure the algorithm can exploit the sparse structure of the bone. This enablesus to run the simulation with up to 6 times smaller memory footprint comparedto the geometric multigrid that also stores the empty bone region [9]. Comparedto matrix-free ParFE, the memory savings is more than a factor of 10.

2 The mathematical modeling of the problem

The linear elasticity theory is used to analyse the bone strength. The weak for-mulation in 3D reads as follows [5]: Find the displacement field u ∈ [H1

E(Ω)]3 =v ∈ [H1(Ω)]3 : v|ΓD

= uS such that∫Ω

[2µε(u) : ε(v) + λ divu divv] dΩ =

∫Ω

fTvdΩ +

∫ΓN

gTSvdΓ (1)

for all v ∈ [H10 (Ω)]3 with the volume forces f , the boundaries traction g on the

Neuman boundary, the linearized symmetric strain tensor

ε(u) :=1

2(∇u + (∇u)T ),

and the Lame constants

λ =Eν

(1 + ν)(1− 2ν), µ =

E

2(1 + ν).

Here, E is the Young’s modulus and ν the Poisson’s ratio.We use two different boundary conditions. The Neuman boundaries are trac-

tion free, gS = 0. On the top and bottom of the domain we have Dirichletboundary condition with a fixed displacement. The engineers look for regionswith high stresses and strains to determine the quality of the bone [17].

The displacements are discretized by trilinear hexahedral elements. Theseare converted one-to-one from the voxels of the CT image. Thus, all elementsare cubes of the same size. In contrast to ParFE only the Young’s modulus canvary in the domain. The Poisson’s ratio ν must be constant. Bone mass has atypical Poisson’s ratio ν = 0.3. Applying this finite element discretization to (1)results in a symmetric positive definite linear system

Au = f .

The number of degrees of freedom can exceed 109. For symmetric positive definitelinear systems of this size the preconditioned conjugate gradient algorithm is thesolver of choice [13]. We use a geometric multigrid preconditioner.

Page 3: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

Algorithm 1 Optimized Search

Require: int SearchIndex(int start, t octree key key, t tree tree)int count = 1;while key > tree[start+ count].key docount = count · 8;

end whilereturn binarySearch(start+ count/8, start+ count, key, tree);

We coarsen by aggregating 2 × 2 × 2 voxels. A voxel of the coarser level` + 1 gets its Young’s modulus by averaging the Young’s moduli of the eightaggregated smaller voxels of level `,

E`+1x,y,z =

1

8

1∑i,j,k=0

E`2x+i,2y+j,2z+k, (2)

where the Young’s modulus of a non-existing child element is zero. If this proce-dure is applied to a homogeneous grid with the standard prolongation (interpola-tion) and restriction it corresponds to the Galerkin product [16]. For smoothingwe use a Chebyshev polynomial [1]. This type of smoother was successfully usedin ParFE [2] in the context of a smoothed aggregation-based algebraic multigridpreconditioner.

3 Implementation details

The mesh, which is constructed from a 3D image, is stored in an octree. Anoctree divides each spatial dimension in two parts. This means that each treenode has eight children. Finite elements and nodes of the grid that lie in bone freespace are not stored. In our application we iterate over all elements of a multigrid(or octree) level. These elements have the same size. Both, the nodes and theelements of each level are stored in one array. Each element is identified by thecoordinate of its node with local number 0. If the data item has a weight w ≥ 0then it represents both an element with a Young’s modulus of Eelem = w · 1GPaand the node of the element with local number 0. Plain nodes are characterizedby a negative weight. The nodes and elements are sorted against their position inthe depth-first traversal of the tree. This so-called Morton ordering correspondsto a space filling curve called Z-curve [14]. The Morton key can be computedeasily from the three coordinates (short int) by interleaving their bits key =z15y15x15 · · · z1y1x1z0y0x0. This pointer-less storing scheme reduces the neededmemory to hold the octree by 24 Byte (on 64-bit by 56 Byte) per node. Thewhole application needs only about 100 Bytes per degree of freedom. That isabout 16 times less compared to the matrix-free ParFE code.

3.1 Accessing nodes of an element

In matrix-free finite element applications the nodes of the corresponding ele-ment must be accessed. Usually an element-to-node table is queried to get the

Page 4: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

Algorithm 2 Prolongation

Require: void Prolongate(Vector c, Vector f)ImportGhostNodes(c);cindex tmp = 0;for each i in TreeF ineLevel docoarsekey = i.key/8; bits = i.key mod 8; factor = FactorOfElem(bits);cindex tmp = SearchIndex(cindex tmp, coarsekey, coarsetree);f [IndexOf(i)] += factor · c[cindex tmp];coarsekeylist = AddCoarseKeysIfBitInDimensionIsSet(coarsekey, bits);for each cnode in coarsekeylist docindex = SearchIndex(cindex tmp, cnode, coarsetree);f [IndexOf(i)] += factor · c[cindex];

end forend forZeroBoundaryNodes(f);

indices of the corresponding nodes. With the octree data structure this corre-sponds to the search of the eight neighbours in positive x, y, z direction. Thebinary search corresponds to the travel of the root down to the leaves. Nodeswith bigger coordinates have always a bigger Morton key. The search has to bedone from the index of the actual element to the end of the array.

A faster way to access the neighbouring nodes is to ascend in the tree anddescend to the wanted node [7]. Ascending in the full octree is an exponentialinterval search by a factor of eight (see Algorithm 1). The binary search combinedwith an exponential interval search speeds up the application.

3.2 Matrix-vector multiplication

The first step is to store the prescribed values at the Dirichlet boundary pointsand zero the corresponding components of the source vector. This is done becausethe boundary conditions are not taken into account in the matrix. Afterwards weimport the ghost nodes. Then all elements must be traversed in order to computethe matrix-free matrix-vector product. All corresponding displacements of thenodes are loaded. This involves the neighbour search described in Section 3.1.Then the local stiffness matrix is applied with a scaling parameter that corre-sponds to the Young’s modulus of the element. The results of the local elementare added into the appropriate places in the destination vector and the ghostnodes are exported. Finally the displacements at the Dirichlet boundary pointsare restored.

3.3 Prolongation and restriction

Compared to the matrix-vector multiplication prolongation and restrictionare procedures that involve two tree levels. Instead of traveling between thelevels, the two different resolutions are traversed concurrently. The keys on thecoarser level are computed from those of the finer level by a division by eight.

Page 5: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

Fig. 1. Load balancing with a space filling curve in a cubical bone sample. 16 partitionsare used. On the left side all partitions are shown. On the right side the partitionsnumbered three to nine are displayed. Note that partitions need not be connected.

Because we traverse the mesh in one direction, we can use the fast search de-scribed in Section 3.1. Algorithm 2 describes the prolongation. The restrictionis implemented in a similar way.

3.4 Load balancing

The domain partitioning is obtained by splitting the space filling curve in equalsized sets of contiguous elements. This avoids the use of a data structure to storethe mapping from the nodes to the processes.

After reading the image data each process sorts its nodes and elements ac-cording the space filling curve. Afterwards, the key space is subdivided binaryinto buckets until each holds less data items than a defined upper limit. Eachprocess gets a number of consecutive buckets until the average size of elementsis reached. This results in a nearly balanced distribution, cf. Fig. 1.

4 Numerical results

We performed a strong and a weak scalability test. We used the boundary condi-tions described in Section 2. In each test we used the following stopping criterion:||rk||M−1 ≤ 10−6||r0||M−1 . We used a W-cycle in the multigrid preconditionerM . On the finest level we used a Chebyshev smoother of degree 6. On eachcoarser level the degree was increased by one. On the coarsest level we solvedthe problem by a Jacobi preconditioned CG algorithm. We stopped CG after 20iterations or if the residual norm was decreased by a factor of 107. Usually thefirst criterion was met. The timings were made on the Cray XT5 of the Swiss Na-tional Supercomputing Center [8]. The Cray XT5 is based on Opteron processorswith six cores running at 2.4 GHz. Each core has 1.33 GiB main memory.

Page 6: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

64 512 1728 5832 8000

c240

dofs 445 · 106 3.6 · 109 12.0 · 109 40.5 · 109 55.5 · 109

meshing time [s] 8.5 20.9 52.7 154 204setup time [s] 20.4 21.4 23.1 28.9 33.6

GFlops 32.3 253 854 2888 3947

c320

dofs 758 · 106 6.1 · 109 20.4 · 109 69.1 · 109 94.7 · 109

meshing time [s] 16.6 37.1 92.3 273 804setup time [s] 34.6 36.0 37.7 44.8 51.0

GFlops 31.8 252 856 2865 3921Table 1. Weak scalability timings. The meshing time includes also the time to readthe image data.

800

1000

1200

1400

1600

1800

8 1000 2000 3000 4000 5000 5832 8000

131517

Tim

e[s]

Iterations

Number of Cores

c320 iterationsc240 iterations

c320 solving timec240 solving time

Fig. 2. Weak scaling with two different trabecular bone samples embedded in a 3203

and a 2403 regular grid. 3D mirroring is applied to generate the bigger meshes.

4.1 Weak scalability

The solver for the bone analysis is designed such that it scales well on MPI-basedsupercomputers with big-sized meshes. We have tested the weak scalability withup to 8000 cores with two different meshes, cf. Table 1. The larger grids aregenerated by 3D mirroring [2] from a bone sample encased in a cube, cf. Fig. 1.We have used two base meshes:

– c240 is encased in a 2403 cube with 6.9 ·106 degrees of freedom and 1.46 ·106

elements (porosity 10.6%).– c320 is encased in a 3203 cube with 11.8·106 degrees of freedom and 2.23·106

elements (porosity 6.83%).

The biggest mesh on 8000 cores has 94.7 · 109 dofs and is 62 times bigger thanthe largest problem solved with ParFE [3]. In these tests we always used 7 levelsin the multigrid preconditioner.

In Fig. 2 we see that the solver scales nearly perfectly up to 8000 cores. Withboth meshes, above 125 cores the solving time increases only little. Also the setuptime and the flop rate of the matrix vector product scale very well, cf. Table 1.However, the meshing time doesn’t scale. This time includes the construction

Page 7: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

100

1000

27 36 72 144 288 576

Solu

tion

Tim

e[s

]

degree 6 level 7degree 6 level 6degree 6 level 5

degree 10 level 6degree 10 level 5

linear speedup

0

0.2

0.4

0.6

0.8

1

27 36 72 144 288 576

Par

alle

lE

ffici

ency

degree 6 level 7degree 6 level 6degree 6 level 5

degree 10 level 6degree 10 level 5

Fig. 3. Strong scaling with different smoother degrees and number of levels in themultigrid algorithm. c320 mesh three timed 3D mirrored used on initial 27 cores. Onthe left site the parallel efficiency. On the right side the solution time. The yellowdashed line in the bottom denotes linear speed up.

of the octree (meshing) and, most of all, the time to distribute the voxel dataamong the cores. The latter means the broadcast of about 250 MiB = 3203 · 8 Bof image data from the root core to all others cores, which is a costly procedure.

4.2 Strong scalability

For the strong scalability test a mesh based on c320 was used with 320 ·106 dofs.This moderately sized problem could be solved on a machine that is affordablefor a clinical institute. We have tested the scalability with different parametersto identify the limiting factors. The memory that is needed for solving this meshforced us to use at least 27 cores.

Figure 3 shows that the application scales very well up to 576 cores. If thenumber of levels is chosen too big (red line) the parallel efficiency decreases and aconfiguration with a smoother of higher degree needs less time to solve with 144cores. The reason is that the problem size on the coarser mesh gets very smalland the communication dominates. With redistribution and using a smaller setof cores on coarser meshes the efficiency would be higher especially for largenumbers of levels.

The higher smoother degree results in higher efficiency because on the finemeshes the matrix-vector product scales perfectly with the number of processors.However, on this mesh the smoother of degree ten needed more time to solve theproblem than the smoother of degree six if the same number of levels is used.

5 Conclusions and future work

We have presented a highly parallel solver for voxel-based µFE bone analysis.The solver is based on the PCG method and uses a geometric multigrid pre-conditioner. Because the mesh is stored in a octree-like data structure all levelsare implemented with matrix-free techniques. The minimal memory footprintenabled us to solve huge problems with more than 94 · 109 degrees of freedom.

Page 8: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

Solving these problems with the old solver ParFE would require 16 times asmany processors! The solver also shows nearly perfect weak scalability up to8000 of processors.

We plan to further improve the accessing of the element nodes by a lowcollision rate hashing. Further enhancements could be done with enabling repar-titioning of the coarser level using a subset of processors. This would lowercommunication complexity and increase further the parallel efficiency.

Acknowledgments

The work of the first author has been funded in parts by the Swiss NationalScience Foundation project 205320 125114. The computations on the Cray XT5have been performed in the framework of a Large User Project grant of the SwissNational Supercomputing Centre (CSCS).

References

1. Adams, M., Brezina, M., Hu, J., Tuminaro, R.: Parallel multigrid smoothing: poly-nomial versus Gauss–Seidel. J. Comput. Phys. 188(2), 593–610 (2003)

2. Arbenz, P., van Lenthe, G.H., Mennel, U., Muller, R., Sala, M.: A scalable multi-level preconditioner for matrix-free µ-finite element analysis of human bone struc-tures. Internat. J. Numer. Methods Engrg. 73(7), 927–947 (2008)

3. Bekas, C., Curioni, A., Arbenz, P., Flaig, C., van Lenthe, G., Muller, R., Wirth, A.:Extreme scalability challenges in micro-finite element simulations of human bone.Concurrency Computat.: Pract. Exper. 22(16), 2282–2296 (2010)

4. Bielak, J., Ghattas, O., Kim, E.J.: Parallel octree-based finite element methodfor large-scale earthquake ground simulation. Comp. Model. in Eng. & Sci. 10(2),99–112 (2005)

5. Braess, D.: Finite Elements: Theory, fast solvers and applications in solid mechan-ics. Cambridge University Press, Cambridge, 2nd edn. (2001)

6. Burstedde, C., Wilcox, L.C., Ghattas, O.: p4est: Scalable algorithms for paralleladaptive mesh refinement on forests of octrees, accepted for publication in SIAMJ. Sci. Comput.

7. Castro, R., Lewiner, T., Lopes, H., Tavares, G., Bordignon, A.: Statistical opti-mization of octree searches. Computer Graphics Forum 27(6), 1557–1566 (2008)

8. Swiss National Supercomputing Centre (CSCS), http://www.cscs.ch/9. Flaig, C., Arbenz, P.: A Scalable Memory Efficient Multigrid Solver for Micro-

Finite Element Analyses Based on CT Images. Parallel Computing (2011), ac-cepted for publication

10. Margenov, S., Vutov, Y.: Comparative analysis of PCG solvers for voxel FEMsystems. In: Proceedings of the International Multiconference on Computer Scienceand Information Technology. pp. 591–598 (2006)

11. The ParFE Project Home Page (2010), http://parfe.sourceforge.net/12. van Rietbergen, B., Weinans, H., Huiskes, R., Polman, B.J.W.: Computational

strategies for iterative solutions of large FEM applications employing voxel data.Internat. J. Numer. Methods Engrg. 39(16), 2743–2767 (1996)

13. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia, PA,2nd edn. (2003)

Page 9: A highly scalable matrix-free multigrid solver for FE …people.inf.ethz.ch/arbenz/lssc11.pdfA highly scalable matrix-free multigrid solver for FE analysis based on a pointer-less

14. Samet, H.: The quadtree and related hierarchical data structures. ACM Comput.Surv. 16, 187–260 (1984)

15. Sampath, R.S., Biros, G.: A parallel geometric multigrid method for finite elementson octree meshes. SIAM J. Sci. Comput. 32(3), 1361–1392 (2010)

16. Trottenberg, U., Oosterlee, C.W., Schuller, A.: Multigrid. Academic Press, London(2000)

17. Wirth, A., Mueller, T., Vereecken, W., Flaig, C., Arbenz, P., Muller, R., vanLenthe, G.H.: Mechanical competence of bone-implant systems can accurately bedetermined by image-based micro-finite element analyses. Arch. Appl. Mech. 80(5),513–525 (2010)