a cache-oblivious self-adaptive full multigrid method

17
NUMERICAL LINEAR ALGEBRA WITH APPLICATIONS Numer. Linear Algebra Appl. 2006; 13:275–291 Published online 15 February 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/nla.481 A cache-oblivious self-adaptive full multigrid method M. Mehl ; , T. Weinzierl and Chr. Zenger Institut f ur Informatik; TU M unchen; Boltzmannstrae 3; 85748 Garching; Germany SUMMARY This paper presents a new ecient way to implement multigrid algorithms on adaptively rened grids. To cope with todays demands in high-performance computing, we cannot do without such highly sophisticated numerical methods. But if we do not implement them very carefully, we lose a lot of eciency in terms of memory usage: using trees for the storage of hierarchical multilevel data causes a large amount of non-local (in terms of the physical memory space) data accesses, and often requires the storage of pointers to neighbours to allow the evaluation of discrete operators (dierence stencils, restrictions, interpolations, etc.). The importance of this problem becomes clear if we remember that storage and not the CPUs is the bottleneck on modern computers. We established a cache-oblivious and storage-minimizing algorithm based on the concept of space- tree grids combined with a cell-oriented operator evaluation, a linear ordering of grid cells along a space-lling curve, and a sophisticated construction of linearly processed data structures for vertex data. In this context, we could show that the implementation of a dynamically adaptive F -cycle is, rst, very natural and, second, does not cause any overhead in terms of storage usage and access as adaptivity and multilevel data do not disturb the linear processing order of our data structures. Copyright ? 2006 John Wiley & Sons, Ltd. KEY WORDS: cache-eciency; multigrid; dynamical adaptivity; space-tree; space-lling curve 1. INTRODUCTION In general, the problem of nding ecient implementations for numerically ecient methods is well known and there are numerous solution strategies known in literature [1]. In our work, the particular task is to eciently implement a dynamically adaptive multigrid solver for partial dierential equations. Hereby, the diculty is to prevent an inecient usage of storage resources [2] which are, in fact, the bottleneck on modern computers. Figure 1 visualizes the stumbling blocks which arise in the context of multigrid algorithms and adaptively rened grids. If we store the multilevel representation of our unknown functions in a tree structure, multilevel algorithms cause a large amount of non-local memory accesses Correspondence to: M. Mehl, Institut f ur Informatik, TU M unchen, Boltzmannstrae 3, 85748 Garching, Germany. E-mail: [email protected] Received 13 May 2005 Revised 5 December 2005 Copyright ? 2006 John Wiley & Sons, Ltd. Accepted 6 December 2005

Upload: m-mehl

Post on 15-Jun-2016

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: A cache-oblivious self-adaptive full multigrid method

NUMERICAL LINEAR ALGEBRA WITH APPLICATIONSNumer. Linear Algebra Appl. 2006; 13:275–291Published online 15 February 2006 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/nla.481

A cache-oblivious self-adaptive full multigrid method

M. Mehl∗;†, T. Weinzierl and Chr. Zenger

Institut f�ur Informatik; TU M�unchen; Boltzmannstra�e 3; 85748 Garching; Germany

SUMMARY

This paper presents a new e�cient way to implement multigrid algorithms on adaptively re�ned grids.To cope with todays demands in high-performance computing, we cannot do without such highlysophisticated numerical methods. But if we do not implement them very carefully, we lose a lot ofe�ciency in terms of memory usage: using trees for the storage of hierarchical multilevel data causesa large amount of non-local (in terms of the physical memory space) data accesses, and often requiresthe storage of pointers to neighbours to allow the evaluation of discrete operators (di�erence stencils,restrictions, interpolations, etc.). The importance of this problem becomes clear if we remember thatstorage and not the CPUs is the bottleneck on modern computers.We established a cache-oblivious and storage-minimizing algorithm based on the concept of space-

tree grids combined with a cell-oriented operator evaluation, a linear ordering of grid cells along aspace-�lling curve, and a sophisticated construction of linearly processed data structures for vertex data.In this context, we could show that the implementation of a dynamically adaptive F-cycle is, �rst, verynatural and, second, does not cause any overhead in terms of storage usage and access as adaptivityand multilevel data do not disturb the linear processing order of our data structures. Copyright ? 2006John Wiley & Sons, Ltd.

KEY WORDS: cache-e�ciency; multigrid; dynamical adaptivity; space-tree; space-�lling curve

1. INTRODUCTION

In general, the problem of �nding e�cient implementations for numerically e�cient methodsis well known and there are numerous solution strategies known in literature [1]. In ourwork, the particular task is to e�ciently implement a dynamically adaptive multigrid solverfor partial di�erential equations. Hereby, the di�culty is to prevent an ine�cient usage ofstorage resources [2] which are, in fact, the bottleneck on modern computers.Figure 1 visualizes the stumbling blocks which arise in the context of multigrid algorithms

and adaptively re�ned grids. If we store the multilevel representation of our unknown functionsin a tree structure, multilevel algorithms cause a large amount of non-local memory accesses

∗Correspondence to: M. Mehl, Institut f�ur Informatik, TU M�unchen, Boltzmannstra�e 3, 85748 Garching, Germany.†E-mail: [email protected]

Received 13 May 2005Revised 5 December 2005

Copyright ? 2006 John Wiley & Sons, Ltd. Accepted 6 December 2005

Page 2: A cache-oblivious self-adaptive full multigrid method

276 M. MEHL, T. WEINZIERL AND CHR. ZENGER

a b cd e fg h i

4 x

Figure 1. Non-local interaction between data points in a multigrid PDE-solver (left); storage overheadcaused by pointers to neighbours and=or specialized di�erence stencils in the adaptive case (right).

in terms of the physical memory space. The reasons for this non-locality are the usage ofspatially neighboured data points for the evaluation of di�erence operators and the usage ofdata of di�erent levels for the interpolation and restriction operators. Such ‘jumps’ in thephysical memory space increase the probability of not �nding needed data in the cache and,thus, cause unnecessarily long data access times. In addition, in the case of adaptive grids,often pointers to neighbours and=or specialized di�erence operators are stored for each degreeof freedom. Thus, we end up with a poor performance both in terms of memory access andin terms of memory requirements.In literature, we �nd numerous approaches to optimize the storage usage of numerical

solvers for partial di�erential equations ranging from strategies enhancing the instruction-level-parallelism (like, for example, loop unrolling) over data layout optimizations (array padding,array merging) and data access optimizations (loop fusion and=or loop blocking) to enhancedmethods implementing cache-aware versions of highly e�cient new numerical approaches(for example, patch-adaptive relaxation) [3–6]. These strategies achieve a tremendous gain ine�ciency also for multigrid methods. Even for multigrid methods working on unstructuredgrids, runtimes can be reduced substantially with the help of special patch-based methods[3, 4]. An important feature for state-of-the-art numerical PDE solvers which all codes devel-oped on the basis of the methods mentioned above cannot handle is the dynamical adaptivityof the underlying computational grid. The methods applied to unstructured grids require aquite costly set-up phase decomposing the grid into patches and reorganizing data withinthese patches.At this point, our approach comes into play. We established an algorithm based on space-

partitioning grids and a data optimizing strategy using space-�lling curves [7]. Due to the highstructuredness of space-partitioning grids, the storage-requirements are very low comparedto unstructured grids, but, at the same time, a �exible and dynamical adaptivity is easyto implement.‡ The special construction of data structures with the help of a space-�llingcurve automatically gives us a very high locality of data both in terms of spatial locality(the algorithms performs only small steps within the physical memory space) and in terms of

‡The only exeption up to now is an unisotropic re�nement of the grid which is subject to current research.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 3: A cache-oblivious self-adaptive full multigrid method

A CACHE-OBLIVIOUS SELF-ADAPTIVE FULL MULTIGRID METHOD 277

time locality (all actions using one certain datum are performed in a small period of time).This gives us an inherently good cache-performance for any type of cache-hierarchy.§

The idea to use space-�lling curves in the context of algorithms working on space-partition-ing grids is quite established. In particular, the curves are widely used as a tool to determinea balanced partitioning of an adaptively re�ned space-partitioning grid [11–16], whereas thecommunication cost for the resulting partitioning can be shown to be quasi-minimal [16]due to the good locality properties of self-similar recursively de�ned space-�lling curves. Atthe same time, this locality property of the curves results in an substantial improvement ofcache-e�ciency as soon as we reorder data according to the ordering induced by the space-�lling curve [17].

2. BASIC ALGORITHM

For our algorithm mimizing storage requirements as well as jumps within the memory spacewe concentrate on partial di�erential equations discretized on the so-called space-tree grids. Aspace-tree grid consists of rectangular (in 2D) or cuboidal (in 3D) cells generated by a locallyrecursive re�nement of cells into a prescribed number of equal subcells. Figure 2 shows anexample for a two-dimensional space-tree grid with a local re�nement of each cell of the gridinto nine subcells. Space-tree grids are highly �exible with respect to adaptivity.¶

Figure 2. Two-dimensional adaptively re�ned space-tree grid with the associateddiscrete iterate of the Peano-curve.

§Algorithms which are cache-aware by concept without detailed knowledge of the cache parameters are also calledcache-oblivious [8–10].

¶The only exception are unisotropic re�nements. The inclusion of such cases into our method is a current subjectof research.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 4: A cache-oblivious self-adaptive full multigrid method

278 M. MEHL, T. WEINZIERL AND CHR. ZENGER

Furtheron, we assume that all data of the unknown function(s) are located at the gridsvertices. The �rst step to prevent jumps in the memory space and, at the same time, to makethe storage of pointers to neighbouring cells=data obsolete is to switch from aso-called vertex-oriented operator evaluation to a strictly cell-oriented view.‖ This inducestwo conceptual changes:

(1) We process the grid cell-by-cell instead of vertex-by-vertex.(2) In each cell we compute only cell-parts of the operator values for all four (in 2D) or

eight (in 3D) vertices. All data we may use for this are the four or eight vertex dataof the current cell. The complete operator values are computed by an accumulation ofthe cell-parts of all associated cells.

This leads to completely local operations on cells which can be performed without anyinformation on neighbouring cells, even without any information on the re�nement depth ofneighbouring regions. Thus, our algorithm gets by with only two bits of administrationalinformation per cell: One for the geometry (inside or outside the computational domain?) andone for the re�nement (current cell re�ned or not?).The second step now is to de�ne suitable data structures together with data access rules

for the vertex data. We further decompose this quite complicated task into three substeps:

(1) De�ne a processing order of grid cells.(2) De�ne data structures to store vertex data.(3) De�ne data access rules.

For the �rst substep, we use approximating polygonals of a particular self-similar, recur-sively de�ned space-�lling curve, the Peano-curve [7]. Due to the locally recursive construc-tion of such curves, this results in an inherently high locality of data usage in terms of time:Whenever data located at a vertex in the grid are used for the �rst time during an iteration(of the solver, e.g.), we can expect that it will be used for the last time in the near futuresince the Peano-curve always �nishes all work inside a subcube of the domain before it entersthe next subcube.To construct our particular data structures, we use a second substantial property of the

Peano-curve: If we look at a face between two subcubes of our domain, we can show thatthe processing orders of cells at the two sides of this face are both approximating polygonalsof a lower dimensional Peano-curve and are inverses for each other, that is, in the �rstsubcube, the face-cells are processed in the opposite direction as in the second subcube. Thisholds analogously for edges between subcubes. The resulting to-and-fro processing correspondsnaturally to one of the simplest data structures known: stacks are data structures which allowonly two operation: push a datum on top of the stack (corresponds to the to-direction) andpop a datum from the top of the stack (corresponds to the from-direction). In References[19–22], we could show that it is really possible to get along with a small and constantnumber of stacks as the only data structures for vertex data in our grid. This gives us anoptimal spatial locality of data access since only one step forward or backward is possible.

‖For the case of �nite element discretizations, the cell-oriented operator-decomposition is well known and described,for example, in Reference [18].

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 5: A cache-oblivious self-adaptive full multigrid method

A CACHE-OBLIVIOUS SELF-ADAPTIVE FULL MULTIGRID METHOD 279

The rules for reading and writing data from the stacks are recursively de�ned—both interms of dimension and in terms of re�nement levels—and do not require the storage of anyadditional information [19–22].Based on this basic method—discretization on space-tree grids, cell-oriented operator evalu-

ation, processing of cells along the Peano-curve, stacks as data structures, locally deterministicdata access—we will describe a dynamically adaptive full multigrid solver in the following.

3. DYNAMICAL ADAPTIVITY

Whenever we try to numerically solve practically relevant partial di�erential equations, weare faced with the demand for a high accuracy which in many cases cannot be ful�lled dueto limited storage resources. This holds in particular for the three-dimensional case, whereeach bisection of the grid width increases the number of unknowns by a factor of eight in aregular grid. Thus, we are forced to use adaptive grids which stay quite coarse—and, thus,save memory—in parts of the domain where the claimed accuracy can be achieved alreadywith a small number of data points. As regions requiring a deep re�nement of the grid arein most cases not known in advance and, in addition, may be changing over time for time-dependent problems, we have to implement this grid adaptivity in a dynamic way, that is,local re�nements and=or coarsenings of the grid have to be determined during the simulationin dependence on the local properties of the current solution.

3.1. Algorithmic details

From the algorithmic view, this results in the need for inserting new data points into our datastructures and removing obsolete data. In general, this is not a trivial problem in particular forsophisticated cache-optimized data structures like ours as the new data points might deterioratetheir good properties. In our case, data are read from an input stack containing all data inthe order of �rst usage at the beginning of each solver iteration. Analogously, results arewritten to an output stack at the end of each iteration which serves as an input stack forthe next iteration at the same time [19, 20, 22]. This allows a very natural implementation ofre�nements and coarsenings without any loss of cache-optimality of existing data structures:For local re�nements, the solver iteration creates new data when it reaches a newly re�nedarea instead of reading data from the input stack. Analogously, the writing of data to the outputstack at the end of an iteration is omitted for local coarsening [23]. Thus, data structures andtheir performance are not a�ected at all by dynamical adaptivity. The only thing that changesis the way of ‘reading’ or ‘writing’ data between iterations (Figure 3).

3.2. Adaptivity criteria

To maximize the relation between accuracy and number of degrees of freedom, we have to�nd guesses for the in�uence of local re�nement or coarsening on the chosen error function.For this purpose, we have to manage with the already computed approximate solutions on thecurrent grid. The optimal choice of error guesses strongly depends on the objective function.We implemented two criteria evenly minimizing the error in the whole domain (linear surplusand �-criterion) and one criterion minimizing the error at a certain position in the domain(dually weighted linear surplus).

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 6: A cache-oblivious self-adaptive full multigrid method

280 M. MEHL, T. WEINZIERL AND CHR. ZENGER

iteration,iteration stacks

inputstack

outputstack

new data(refinement)

obsolete data(coarsening)

Figure 3. Schematic view of the algorithmic realization of local grid re�nement and coarsening.

For all three possibilities, we need only two bits per data point to store the result of theevaluation of the criterion (need for re�nement or coarsening) [23].

3.2.1. Linear surplus. The so-called linear surplus

�uk := uk − 18

8∑i=1uk;i (1)

describing the di�erence between the linear interpolant of neighbouring points and the com-puted approximative value of the unknown function u at the current grid point is a commonlyused adaptivity criterion (in particular for �nite di�erence methods). Hereby, uk denotes theapproximative value of the unknown function u at grid point k, uk;i are the values of u at theeight diagonally neighboured grid points.Formula (1) can be interpreted as a second-order approximation of the Laplacian [23] and,

therefore, is a guess for the local curvature of u. As the curvature is a good indicator for thegain of replacing the old linear interpoland by a more accurate version with additional basepoints, we locally re�ne our grid wherever the surplus exceeds a certain limit and coarsenwherever it falls below a second limit.The implementation of the linear surplus is straightforward in our concept: The surplus

is accumulated in a cell-wise evaluation process analogue to the evaluation of di�erenceoperators. As the decision for re�nements or coarsenings can be taken directly after �nishingthe computation of the linear surplus, its value does not have to be stored on the input=outputstack of an iteration.

3.2.2. �-Criterion. In contrast to the linear surplus, the �-criterion does not directly measurethe error in the solution u but uses the local truncation error

�h :=Ahu− fh (2)

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 7: A cache-oblivious self-adaptive full multigrid method

A CACHE-OBLIVIOUS SELF-ADAPTIVE FULL MULTIGRID METHOD 281

which can be approximated with the help of a current approximation uh by the relative localtruncation error [24]

�Hh := (AHRuh − fH ) + R(fh − Ahuh) (3)

where H denotes operators and function values on the next coarser grid, R is the restrictionoperator from grid width h to grid width H =3h.�Hh has to be accumulated cell-wise again, but is located at the second �nest grid level

only. Since it has to be propagated to the �nest grid to give a basis for re�nement=coarseningdecisions in a next step, it has to be stored on the input=output stacks, which causes onlya small amount of extra storage requirements as the number of coarse grid points is onlyabout 4% of the number of �ne grids points. Due to the fact that the criterion is locatedat coarse grid points, it results in a little less local grid re�nement than the linear surplus(see Section 5).

3.2.3. Dually weighted linear surplus. With the help of the previously described criteria weadapted the grid based on evenly weighted local error guesses at all existing grid points. Insome cases, however, we are not interested in a solution with a good overall accuracy butonly in a high accuracy of a value at a certain point or in a local region of our computationaldomain. In this case, we need a di�erent kind of criterion if we want to minimize the numberof required grid points under this precondition.In the following, we will concentrate on the task to minimize the error |u(y)− uh(y)| at a

certain point y in the domain. It is obvious that this will lead to a strong local re�nement inparticular around y but possibly also in further away regions as minimizing the error in y alsomeans that we have to keep errors at other positions which strongly in�uence the solution aty small.As described in detail, for example, in References [23, 25], the function describing the

in�uence of changes at a point x on the solution at a point y is the Greens function G(: ; y)given as a solution of the dual problem, which can be given in the weak formulation as∫

�∇z(x):∇v(x) dx= v(y) for all v (4)

Only the right-hand side of the dual problem is di�erent from the primary problem. Thus,it can easily be solved at the same time with relatively low computational extra costs. Butwe have to store the solution of the dual problem on our stacks, which doubles the memoryrequirements.To approximate the contribution of a grid point k to the error at point y, we have to weight

the local error at grid point k with a measure for the in�uence of grid point k on point y.As shown in References [23, 25], this results in the criterion

�k = h · �zk︸ ︷︷ ︸dual weighting

· �uk︸︷︷︸linear surplus

(5)

where the linear surpluses �zk and �uk can be computed as described in 3.2.1. In spite ofthe storage overhead, the dual weighted linear surplus is an attractive criterion as it can beanalogously used for arbitrary error functionals.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 8: A cache-oblivious self-adaptive full multigrid method

282 M. MEHL, T. WEINZIERL AND CHR. ZENGER

4. MULTIGRID

To achieve an optimal runtime of our program, we have to combine our adaptive discretizationwith an e�cient multigrid method. As our grid is dynamically adaptive, we end up with afull-multigrid cycle starting from a very simple and coarse grid: The equations are solved onthe current grid up to a certain accuracy, then, the grid is re�ned according to the selectedcriterion and selected tolerances for re�nement and coarsening, and, �nally, the current solutionis interpolated=restricted to the new grid (see Figure 4).For the solution of the equations on each level of the full-multigrid algorithm, we use

an additive multigrid with Jacobi smoother due to the following algorithmical reasons: First,our cell-wise and, thus, partitioned evaluation of all operators makes the Jacobi smoother atrivial choice whereas the implementation of a Gauss–Seidel smoother—at least for vertex-based data—is non-trivial as using ‘new’ values for the computation of the cell-parts of theoperators would lead to a ‘mixed’ computation of operators from ‘new’ and ‘old’ valuesof a variable at a �xed position. Second, we process the computational grid according tothe Peano-curve in a top-down-depth-�rst manner. In connection with the cell-wise operatorevaluation at the cells vertices, this means that we do not �nish all work for the evaluation ofthe residual on the �ne levels before we return to the coarser levels. Thus, the natural choicefor a solver on a �xed adaptive grid is an additive multigrid method. Additive multigridmethods compute the residual on all grids simultanuously, that is based on the same currentsolution. In our algorithm, we realize the computation of the residuals on all levels in thefollowing steps:

(1) Accumulation of the nodal values of the unknown variables on the �nest grid (usingbilinear interpolation) during the top-down traversion of the space-tree,

(2) cell-oriented evaluation of the residual on the �nest grid,(3) restriction (full-weightening) of the residual to coarser grids during the bottom-up

traversion of the space-tree.

As we store data distributed over all cell levels following the concept of generating systems,we can simultaneously apply our smoother at all grid points for which the accumulation ofthe residual (either by operator evaluation or by restriction) is �nished. Figure 5 shows aschematic view of the computation within a coarse grid cell containing nine �ne grid cells inthe two-dimensional case.To conclude, our additive multigrid algorithm with Jacobi smoother on each level can also

be interpreted as a Jacobi iteration on the generating system. Of course, the performance ofan additive multigrid algorithm is not optimal and, in particular, the �tting of the relaxationparameter is not trivial at all. Due to these reasons, we are on the way to implement a

dynamically adaptivegrid improvement + interpolation/restriction

Figure 4. Schematic view of the dynamically adaptive full multigrid cycle.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 9: A cache-oblivious self-adaptive full multigrid method

A CACHE-OBLIVIOUS SELF-ADAPTIVE FULL MULTIGRID METHOD 283

Figure 5. Schematic visualization of the computational steps of the additive multigrid methodon a coarse grid cell consisting of nine �ne grid cell in the two-dimensional case (shaded box:computation of the cell-part of the residual in the current �ne grid cell; dark arrows: interpolation;

light arrows: restriction; light circles: smoothing).

conjugate gradient method with the additive multigrid method as a preconditioner on theone hand, and, on the other hand, a multiplicative multigrid method. The implementation ofmultiplicative multigrid methods demands similar mechanisms for adding data points to theinput stream of iterations and removing data points from the output stream as already usedfor the algorithmic realization of the dynamical adaptivity.

5. NUMERICAL RESULTS

To show the potential of our algorithm we will present several results in this section: First, weexamine the performance of the pure additive multigrid solver (leaving dynamical adaptivityand full-multigrid aside) with respect to the criteria multigrid performance (5.1.1), cache-e�ciency (5.1.2), and storage requirements (5.1.3) and compare the runtimes per iterationand degree of freedom with DiME [26], a highly cache- and runtime-optimized multigridsolver for partial di�erential equations on regular grids (5.1.4). In the next steps, we applythe complete full-multigrid algorithm and present results on the multigrid performance of thefull-multigrid method (5.2), compare the three adaptivity criteria implemented in our code(5.3) and, as a last and important point for our work, show that our adaptivity is �exibleenough to even cope with singularities (5.4).

5.1. Performance of the basic algorithm

In this section, we will present results of the basic additive multigrid algorithm on adaptivelyre�ned space-partitioning grids without dynamical adaptivity and full multigrid to show thegeneral potential of the underlying algorithmic concept.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 10: A cache-oblivious self-adaptive full multigrid method

284 M. MEHL, T. WEINZIERL AND CHR. ZENGER

Table I. Solution of the three-dimensional Poisson equationon a cubic domain with regular grids: number of additivemultigrid iterations needed for the reduction of the residual

by a factor of 10−5 (see Reference [20]).

Grid resolution 3 9 27 81 243# iterations 20 21 22 22 22

Table II. Solution of the three-dimensional Poisson equation on a cubic domain with acubic cut-o� edge (singularity) with regular and adaptive (last two lines) grids: number ofadditive multigrid iterations needed for the reduction of the residual by a factor of 10−5

(see Reference [20]).

Grid resolution 27 81 243 2187 59 049# degrees of freedom (% of regular grid) 100 100 100 1 1# iterations 35 37 38 38 38

5.1.1. Multigrid performance. Before testing the full-multigrid method, we examine the per-formance of the additive multigrid method on �xed grids, the three-dimensional Poissonequation

�u(x)=−3�2 ·2∏i=0sin(�xi)

with homogeneous Dirichlet boundary conditions on the unit cube and on the unit cube witha cubic cut-o� at one edge (see Figure 9). Tables I and II show the resulting resolution-independent number of iterations and, thus, the multigrid performance of our method.

5.1.2. Cache-e�ciency. All measurements∗∗ we performed for our code resulted in a cachehit-rate for the level 2 cache above 99.8% and, much more signi�cant, the cache miss-ratein the level 2 cache was only about 10% higher than the theoretical mimimum given by theneed of loading data to the cache at least once per iteration [19, 20, 28].

5.1.3. Storage requirements. As described in Section 2, the storage requirements per degreeof freedom are very low in our code due to the strict structuredness of our grids and thecell-oriented processing. See Table III for the results achieved for the solution of the Poissonequation on a cubic and a spherical domain and on regular and adaptive grids. The slightlyhigher storage costs for the adaptive grids on the spherical domain are not caused by someintrinsic overhead of the adaptivity but by the need of storing some obstacle cells outside thecomputational domain which do of course not carry any degrees of freedom but, nevertheless,geometrical and re�nement information. As we use space-partitioning grids with a minimalresolution outside the computational domain itself, the fraction of those obstacle cells willdramatically reduce with an increasing grid resolution inside the computational domain.

∗∗With the cache-simulator cachegrind [27], for example.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 11: A cache-oblivious self-adaptive full multigrid method

A CACHE-OBLIVIOUS SELF-ADAPTIVE FULL MULTIGRID METHOD 285

Table III. Storage requirements for the solution of thethree-dimensional Poisson equation on a cubic and a spherical

domain (taken from Reference [20]).

Domain Resolution Deg. of freedom St. req. per dof (Byte)

Cube 243 14 702 584 5.2729 400 530 936 5.0

Sphere 243(a) 855 816 7.0729(a) 23 118 848 5.5

Test cases with adaptive grids are marked with ‘(a)’.

Table IV. Runtimes per degree of freedom and per solveriteration for the Poisson equation solved on an Intel DualXeon with 2.4GHz, 4 GByte RAM, using the Intel Compiler

8 with options −03 − xW (see Reference [20]).

Domain Resolution Runtime per dof and it (s)

Cube 243 5:77× 10−6

729 5:66× 10−6

Sphere 243(a) 6:96× 10−6

729(a) 6:05× 10−6

Test examples which used adaptive grids are marked by ‘(a)’.

5.1.4. Runtime. In terms of runtime, our program is not optimized yet. In particular, a moree�cient implementation of the stack administration leading to a more e�cient pipelining anda better vectorization is assumed to give some substantial gains. In spite of this, there is stillone main remarkable result: The computational time per degree of freedom is independentfrom the degree of adaptivity=irregularity and the size of the grid (see Table IV).To compete with others, we compared the runtime per iteration and degree of freedom

with DiME [26], a highly cache- and runtime-optimized multigrid solver for partial di�erentialequations on regular grids. In the case of DiME, we applied muligrid V -cycles with one pre-and one post-smoothing step on each level.For the judgement of the results, we would like to point out several aspects: First, as

mentioned above, our program is by far not optimized yet. Second, we can handle fullyadaptive grids which by far more complicated and more di�cult to implement in an e�cientway than regular grids. We could show (see References [19, 20] and Table IV) that theruntime per iteration and degree of freedom of our code does not depend on the degree ofirregularity of the underlying grid. Third, the storage requirements for our code are very low(less than 7 Bytes per degree of freedom even in the three-dimensional case, see Table III)whereas DiME need more than 27 Bytes per degree of freedom in the two-dimensional case.Table V shows the runtimes of both programs for the solution of the two-dimensional Poissonequation

−�u(x; y)=−2�2 sin(�x) sin(�y)

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 12: A cache-oblivious self-adaptive full multigrid method

286 M. MEHL, T. WEINZIERL AND CHR. ZENGER

Table V. Comparison of the runtimes per degree of freedom and multigriditeration for our code (left) and DiME [26] (right) on an AMD AthlonXP 2400+ (1.9GHz) processor with 256KB cache and 1GB RAM using

the gcc3.4 compiler with options −03 − Xw.Grid # deg. of Runtime per Grid # deg. of Runtime perres. freedom it and dof (s) res. freedom it and dof (s)

243 5:95× 104 1:70× 10−6 257 6:60× 104 5:78× 10−7

729 5:33× 105 1:71× 10−6 513 2:63× 105 4:23× 10−7

2187 4:79× 106 1:72× 10−6 1025 1:05× 106 3:85× 10−7

2049 4:20× 106 3:74× 10−7

Table VI. Number of additive multigrid iterations needed on each re�ne-ment level for the solution of a Poisson equation on the unit cube (taken

from Reference [23]).

Re�nement step 0 1 2 3# iterations 9 10 9 9L2 error 5:972× 10−2 4:613× 10−3 4:521× 10−4 6:771× 10−4

on the unit square ]0; 1[2 with homogeneous Dirichlet boundary conditions computed on reg-ular grids with di�erent grid resolutions. All computations were peformed on a AMD AthlonXP 2400+ processor (1.9GHz) with 256KB cache and 1GB RAM. We used the gcc3:4compiler with the options −03 − Xw.To conclude, our program is substantially slower (by a factor of about �ve) than DiME,

but o�ers the potential for further runtime improvements, has lower storage requirements, andcan be applied without loosing e�ciency to adaptively re�ned grids. Thus, we consider it areasonable and e�cient alternative for problems demanding for highly adaptive grids like, forexample, problems with singularities or the laminar edge layer in turbulent �ows.

5.2. Multigrid performance of the F-cycle

Considering as an example the Poisson equation

−�u=3�2 ·3∏j=1sin(�xj) in ]0; 1[3

u=0 on @]0; 1[3

with the known analytical solution u(x)=∏3j=1 sin(�xj), we tested the multigrid performance

of our full multigrid cycle. Table VI shows that the resulting number of additive multigriditerations needed after each re�nement step is constant as we would have expected. In thecomputed example, we used the linear surplus as re�nement criterion.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 13: A cache-oblivious self-adaptive full multigrid method

A CACHE-OBLIVIOUS SELF-ADAPTIVE FULL MULTIGRID METHOD 287

5.3. Comparison of adaptivity criteria

To evaluate the power of the three di�erent adaptivity criteria, we examined another examplewith a less smooth solution:

�u= c · (129�r2 sinh(64�(2− r2))− 3 cosh(64�(2− r2))) in ]0; 1[3

with the constant

c :=128�

sinh(128�)

and the auxiliary variable

r2 :=(x1 − 1

3

)2+

(x2 − 1

3

)2+

(x3 − 1

3

)2

which has the analytical solution (see also Figure 6 for the two-dimensional analogon)

u=1

sinh(128�)sinh(64�(2− r2)) on @]0; [13

for an appropriate de�nition of boundary values.

00.2

0.40.6

0.8 10

0.5

1

0

0.2

0.4

0.6

0.8

1

Figure 6. Graph of the two-dimensional function (1= sinh(128�)) sinh(64�(2 − (x1 − 13 )2 − (x2 − 1

3 )2))

on the unit square [0; 1]2 (taken from Reference [23]).

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 14: A cache-oblivious self-adaptive full multigrid method

288 M. MEHL, T. WEINZIERL AND CHR. ZENGER

5.3.1. Cache-performance. For all examples computed and for all three adaptivity criteria,the level 2 cache hit rates, that is the relation between the number of successful data accesseson the level two cache and the total number of all data accesses was above 99.8%.

5.3.2. Relation accuracy–degrees of freedom. The numerical e�ciency of adaptivity criteriacan be measured by the dependence between the number of degrees of freedom and theachieved accuracy. The generated grids strongly depend on the used criterion (see Figure 7).Figure 8 shows the development of errors in dependence on the adaptivity criterion.

Although the linear surplus generates locally deeper re�ned grids (Figure 7), even the lo-cal accuracy at the tip of the peak of the analytical solution of the example equation is betterfor the �-criterion. This shows the in�uence of the accuracy in the further surrounding onthe accuracy at a certain point. Therefore, also the application of the dually weighted linearsurplus brings a gain in terms of the relation accuracy versus needed number of degrees offreedom.

5.3.3. Runtime. Considering the runtime per iteration and degree of freedom, we observedalmost no di�erences between the linear surplus and the �-criterion above a certain number ofdegrees of freedom. Only the dually weighted linear surplus caused an about 10–15% longerruntime due to the extra-computation of the dual solution and its linear surplus.A summary of the results of the comparison of our adaptivity criteria can be found

in Table VII. The cache-e�ciency is equally high for all of them, the judgement of theaccuracy depends on the given example and on the objective function in terms of the error.The superiority of the linear surplus and the �-criterion with respect to runtime and storagerequirements is naturally given by the computational requirements of the criterion. Anyhow,the dual weighting is a very useful tool to minimize general error functionals.

Figure 7. Two-dimensional projection of the grids resulting from the application of the linear surplus(left) and the �-criterion (right) (taken from Reference [23]).

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 15: A cache-oblivious self-adaptive full multigrid method

A CACHE-OBLIVIOUS SELF-ADAPTIVE FULL MULTIGRID METHOD 289

104 105 106 107 108

regular grid

linear surplus

number of degrees of freedom104 105 106 107

number of degrees of freedom

erro

r at

poi

nt P

abso

lute

err

or a

t poi

nt Q

linear surplusdual weighting

Figure 8. Absolute error at the point P=(1027 ;1027 ;

1027 ) for regular grids and adaptive grids using the linear

surplus or the �-criterion (left); absolute error at the point Q=(23 ;23 ;23 ) for adaptive grids using the

linear surplus or the dually weighted linear surplus (right) (taken from Reference [23]).

Table VII. Summary of the comparison of the three implementedadaptivity criteria.

Cache-e�ciency (%) Accuracy Runtime Storage

Linear surplus ¿99:8 Depends + +�-criterion ¿99:8 Depends + +Dual weighting ¿99:8 Depends − −

5.4. Behaviour for singularities

To show the high �exibility of our approach, we solved a singular problem on the unit cubewith a cubic cut-o� (the three-dimensional analogon of an L-shape domain. Figure 9 showsthe computational domain.On this domain �, we solve the Poisson equation

−�u=3�2 ·3∏j=1sin(�xj) in �

u=0 on @�

Table VIII shows the tremendously lower number of degrees of freedom that is neededwith an adaptively re�ned grid in comparison to a regular grid to achieve the same accuracy(measured by the maximal discretization error �max).To summarize the numerical results, we can say that we solve systems with up to 1010

degrees of freedom [20] on highly and dynamically adaptive grids with a computing time ofless than 2×10−6 s per iteration and degree of freedom (all measured on an AMD Athlon XP

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 16: A cache-oblivious self-adaptive full multigrid method

290 M. MEHL, T. WEINZIERL AND CHR. ZENGER

x2

x1

x3

Figure 9. Unit cube with a cubic cut-o�.

Table VIII. Number of degrees of freedom needed for aregular grid and an (on the �nest level) according to the�-criterion adaptively re�ned grid to achieve a maximal

discretization error �max = 1:1734× 10−3.

Regular grid Adaptive grid

# degrees of freedom 509 656 61 267

2400+ (1.9GHz) processor with 256KB cache and 1GB RAM using the gcc3.4 compilerwith options −03 − Xw.

6. CONCLUSION

As we could show in the previous sections, we established an algorithmic concept for theimplementation of multigrid methods on dynamically adaptive grids, which ful�lled at thesame time the numerical demands on a modern simulation code (multigrid performance, �ex-ible adaptivity) and shows an e�cient usage of hardware resources, in particular memoryhierarchies. The latter is valid independently from the actual hardware parameters like forexample chache size, cache-line length, associativity of the cache, etc. [19].

REFERENCES

1. Goedecker S, Hoisie A. Performance Optimization of Numerically Intensive Codes. SIAM: Philadelphia, PA,2001.

2. Douglas CC. Caching in with multigrid algorithms: problems in two dimensions. Parallel Algorithms andApplications 1996; 9:195–204.

3. Douglas CC, Hu J, Kowarschik M, R�ude U, Wei� C. Cache optimization for structured and unstructured gridmultigrid. Electronic Transactions on Numerical Analysis 2000; 10:21–40.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291

Page 17: A cache-oblivious self-adaptive full multigrid method

A CACHE-OBLIVIOUS SELF-ADAPTIVE FULL MULTIGRID METHOD 291

4. Kowarschik M. Data locality optimizations for iterative numerical algorithms and cellular automata on hierachicalmemory architectures. Ph.D. Thesis, Institut f�ur Informatik, Universit�at Erlangen-N�urnberg, SCS PublishingHouse, 2004.

5. Kowarschik M, Wei� C. An overview of cache optimization techniques and cache-aware numerical algorithms.In Algorithms for Memory Hierarchies—Advanced Lectures, Meyer U, Sanders P, Sibeyen J (eds), LectureNotes in Computer Science, vol. 2625. Springer: Berlin, 2003; 213–232.

6. Wei� C. Data locality optimizations for multigrid methods on structured grids. Ph.D. Thesis, Institut f�urInformatik, TU M�unchen, 2001.

7. Sagan H. Space-Filling Curves. Springer: New York, 1994.8. Prokop H. Cache-oblivious algorithms. Master Thesis, Massachusetts Institute of Technology, 1999.9. Frigo M, Leierson CE, Prokop H, Ramchandran S. Cache-oblivious algorithms. Proceedings of the 40th AnnualSymposium on Foundations of Computer Science, New York, 1999; 285–297.

10. Demaine ED. Cache-oblivious algorithms and data structures. Lecture Notes from the EEF Summer Schoolon Massive Data Sets, University of Aarhus, Denmark, June 27–July 1, Lecture Notes in Computer Science.Springer: Berlin, 2002.

11. Griebel M, Zumbusch GW. Parallel multigrid in an adaptive PDE solver based on hashing and space-�llingcurves. Parallel Computing 1999; 25:827–843.

12. Griebel M, Zumbusch G. Hash based adaptive parallel multilevel methods with space-�lling curves. In NICSeries, Rollnik H, Wolf D (eds), vol. 9. 2002; 479–492.

13. Oden JT, Patra A, Feng Y. Domain decomposition for adaptive hp �nite element methods. In DomainDecomposition Methods in Scienti�c and Engineering Computing, Proceedings of the 7th InternationalConference on Domain Decomposition, Keyes D, Xu J (eds), Contemporary Mathematics, vol. 180. 1994;203–214.

14. Patra AK, Long J, Laszlo� A. E�cient parallel adaptive �nite element methods using self-scheduling dataand computations. In High Performance Computing—HiPC’99, 6th International Conference, Calcutta, India,December 17–20, 1999, Proceedings, Banerjee P, Prasanna VK, Sinha BP (eds), HiPC, Lecture Notes inComputer Science, vol. 1745. 1999; 359–363.

15. Roberts S, Klyanasundaram S, Cardew-Hall M, Clarke W. A key based parallel adaptive re�nement techniquefor �nite element methods. In Proceedings Computational Techniques and Applications: CTAC ’97, Noye J,Teubner M, Gill A (eds) World Scienti�c: Singapore, 1998; 577–584.

16. Zumbusch GW. Adaptive Parallel Multilevel Methods for Partial Di�erential Equations. Universit�at Bonn:Habilitationsschrift, 2001.

17. Aftosmis MJ, Berger MJ, Adomavivius G. A parallel multilevel method for adaptively re�ned cartesian gridswith embedded boundaries, American Institute of Aeronautics and Astronautics-2000-808, Aerospace ScienceMeeting and Exhibit, 38th, Reno, Nevada, January 10–13, 2000.

18. Braess D. Finite Elements. Theory, Fast Solvers and Applications in Solid Mechanics. Cambridge UniversityPress: Cambridge, MA, 2001.

19. G�unther F. Eine cache-optimale Implementierung der Finite-Elemente-Methode. Doctoral Thesis, Institut f�urInformatik, TU M�unchen, 2004.

20. P�ogl M. Entwicklung eines cache-optimalen 3D Finite-Element-Verfahrens f�ur gro�e Probleme. Doctoral Thesis,Institut f�ur Informatik, TU M�unchen, 2004.

21. Hartmann J. Entwicklung eines cache-optimalen Finite-Element-Verfahrens zur L�osung d-dimensionalerProbleme. Diploma Thesis, Institut f�ur Informatik, TU M�unchen, 2005.

22. G�unther F, Mehl M, P�ogl M, Zenger Ch. A cache-aware algorithm for PDEs on hierarchical data structuresbased on space-�lling curves. SIAM Journal on Scienti�c Computing, under review.

23. Dieminger N. Kriterien f�ur die Selbstadaption cache-e�zienter Mehrgitteralgorithmen. Diploma Thesis, Institutf�ur Informatik, TU M�unchen, 2005.

24. Fulton SR. On the accuracy of multigrid truncation error estimates. Electronic Transactions on NumericalAnalysis 2003; 15:29–37.

25. Schneider S. Adaptive solution of elliptic partial di�erential equations by hierarchical tensor product �niteelements. Doctoral Thesis, Institut f�ur Informatik, TU M�unchen, 2000.

26. Kowarschik M, Wei� C. DiMEPACK—A cache-optimized multigrid library. In Proceedings of InternationalConference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2001), Las Vegas,Nevada, U.S.A., Arabnia (ed.), vol. I, 2001.

27. Seward J, Nethercote N, Fitzhardinge J. Cachegrind: a cache-miss pro�ler. http:==valgrind.kde.org=docs.html28. Krahnke A. Adaptive Verfahren h�oherer Ordnung auf cache-optimalen Datenstrakturen f�ur dreidimensionale

Probleme. Doctoral Thesis, Institut f�ur Informatik, TU M�unchen, 2005.

Copyright ? 2006 John Wiley & Sons, Ltd. Numer. Linear Algebra Appl. 2006; 13:275–291