an accurate cost model for guiding data locality

An Accurate Cost Model for Guiding DataLocality Transformations

XAVIER VERAMalardalens HogskolaandJAUME ABELLA, JOSEP LLOSA, and ANTONIO GONZALEZUniversitat Politecnica de Catalunya-Barcelona

Caches have become increasingly important with the widening gap between main memory andprocessor speeds. Small and fast cache memories are designed to bridge this discrepancy. However,they are only effective when programs exhibit sufficient data locality.

The performance of the memory hierarchy can be improved by means of data and loop transfor-mations. Tiling is a loop transformation that aims at reducing capacity misses by shortening thereuse distance. Padding is a data layout transformation targeted to reduce conflict misses.

This article presents an accurate cost model that describes misses across different hierarchylevels and considers the effects of other hardware components such as branch predictors. The costmodel drives the application of tiling and padding transformations. We combine the cost model witha genetic algorithm to compute the tile and pad factors that enhance the program performance.

To validate our strategy, we ran experiments for a set of benchmarks on a large set of modernarchitectures. Our results show that this scheme is useful to optimize programs’ performance. Whencompared to previous approaches, we observe that with a reasonable compile-time overhead, ourapproach gives significant performance improvements for all studied kernels on all architectures.

Categories and Subject Descriptors: C.4 [Performance of Systems]: modeling techniques; D.3.4[Programming Languages]: Processors—Compilers; optimizations

General Terms: Languages, Performance

Additional Key Words and Phrases: Cache memories, tiling, padding, genetic algorithms

1. INTRODUCTION

With ever-increasing clock rates and the use of new architectural features, thespeed of processors increase dramatically every year. Unfortunately, memory

This work was supported by the ESPRIT project MHAOTEU (EP 24942).X. Vera was supported in part by VR grant no. 2001–2575 and CICYT project 511/98.J. Abella, J. Llosa, and A. Gonzalez were supported by CICYT project 995/2001.Authors’ addresses: X. Vera, Institutionen for Datateknik, Malardalens Hogskola, P.O. BOX 883,Vasteras SE-721 23, Sweden; email: [email protected]; J. Abella, J. Llosa, and A. Gonzalez,Computer Architecture Department, Universitat Politecnica de Catalunya-Barcelona, Jordi Girona1–3, Barcelona 08034, Spain; email: {jabella, josepll, antonio}@ac.upc.es.Permission to make digital or hard copies of part or all of this work for personal or classroom use isgranted without fee provided that copies are not made or distributed for profit or direct commercialadvantage and that copies show this notice on the first page or initial screen of a display alongwith the full citation. Copyrights for components of this work owned by others than ACM must behonored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,to redistribute to lists, or to use any component of this work in other works requires prior specificpermission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or [email protected]© 2005 ACM 0164-0925/05/0900-0946 $5.00

ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005, Pages 946–987.

An Accurate Cost Model for Guiding Data Locality Transformations • 947

latency does not decrease at the same pace, being a key obstacle to achieve highIPC. The basic solution that almost all systems rely on is the cache hierarchy.

While caches are useful, they are effective only when programs exhibitsufficient data locality in their memory accesses. Numerical applications tendto operate on large data sets, and usually present a large amount of reuse.However, this reuse may not translate to locality since caches can only hold asmall fraction of the data accessed.

1.1 Cache Compiler Optimizations

Memory is organized hierarchically in such a way that the lower levels aresmaller and faster. In order to fully exploit the memory hierarchy, one has toensure that most of the memory references are handled by lowest levels ofcache. Programmers spend an important amount of time improving locality,which is tedious and error prone. Various hardware and software approacheshave been proposed lately for increasing the effectiveness of memory hierarchy.Software-controlled prefetching [Mowry et al. 1992] hides the memory latencyby overlapping a memory access with computation and other accesses. Compil-ers apply useful loop transformations such as tiling [Carr and Kennedy 1992;Coleman and McKinley 1995; Lam et al. 1991; Wolf and Lam 1991] and datalayout transformations [Chatterjee et al. 1999; Kandemir et al. 1999; Riveraand Tseng 1998a, 1999a; Temam et al. 1993]. In all cases, a fast and accurateassessment of a program’s cache behavior at compile time is needed to makean appropriate choice of transformation parameters.

Unfortunately, cache memory behavior is very hard to predict. Simulatorsare used for describing it accurately. However, they are very slow and do notprovide too much insight into the causes of the misses. Thus, current approachesare based on simple models (heuristics) for estimating locality [Carr et al. 1994;Coleman and McKinley 1995; Lam et al. 1991; Rivera and Tseng 1998a, 1999a].However, modern architectures have very complex internal organization, withdifferent levels of cache, branch predictors, etc. Such models provide very roughperformance estimates, and in practice, are too simplistic to statically select thebest optimizations.

Tiling has been shown to be useful for many algorithms in linear algebra.By restructuring the loop and changing the order in which memory referencesare executed, it reuses data in the faster levels of the hierarchy; thus reduc-ing the average latency. Nevertheless, finding the optimal tile sizes is a verycomplex task. The solution space is huge, and exploring all possible solutionsis infeasible. Padding has a significant potential to remove conflict misses. Infact, it can remove most conflict misses by changing the addresses of conflictingdata, and some compulsory misses by aligning data with cache lines. However,choosing the optimal data layout is an NP problem [Petrank and Rawitz 2002].A number of algorithms have been proposed which are based on simple costmodels that only consider the first level cache [Rivera and Tseng 1998a].

We introduce an approach to drive program transformations oriented to en-hance data locality. The centerpiece of the proposed method is an accurate costmodel combined with a genetic algorithm. In particular, we improve the order

ACM Transactions on Programming Languages and Systems, Vol. 27, No. 5, September 2005.

948 • X. Vera et al.

Fig. 1. Optimizing framework.

of memory accesses via tiling, whereas conflict misses that tiling cannot elimi-nate are removed via padding. Moreover, it chooses the best tile and pad factorsat the same time. Our approach makes use of a very precise cost model thatallows us to consider all different levels of the memory hierarchy. Furthermore,we consider the performance cost of the miss-predicted branches. We presentresults for a collection of kernels that exhibit a large number of misses. For thesake of concreteness, we report results for a set of modern processors that rep-resent current architectural paradigms. We have chosen the Pentium-4 (CISC),Alpha-21264 and UltraSparc-III (RISC), and Itanium (EPIC).1

1.2 An Overview

This article proposes a new method to perform loop tiling combined withpadding for numeric codes. We use a static data cache analysis that considersdifferent levels of cache. Moreover, it considers the cost of the branch instruc-tions accordingly to the outcome from the branch predictor.

We first start describing data reuse using the well-known concept of reusevectors [Wolf and Lam 1991]. We implemented Ghosh et al’s [1999] Cache MissEquations (CMEs) to compute the locality of a program, extending its applica-bility to deal with multi-level caches. This allows us to have a precise modelthat describes cache memory behavior across different levels. Once informationabout read and write misses for all different levels is obtained, we set up thecost model function. By considering the relative costs for each memory level, aswell as the cost of miss-predicted branches, the cost function is tuned for im-proving execution time. Finally, we use a genetic algorithm (GA) that traversesthe solution space in order to determine all tile and pad factors at the sametime, thus giving equal importance to both transformations.

Figure 1 depicts an overview of the approach described in this article. Wehave implemented our system in the SUIF2 compiler. We use SUIF2 to identifyhigh-level information (such as array accesses and loop constructs), which isused to model the cache behavior. The GA generates different possible combi-nations of tile and pad factors, that are analyzed by the cost model function.Finally, the best parameters are fed back and used to generate the optimizedcode.

The rest of the article is organized as follows: Section 2 reviews ourmethod to describe data locality, and introduces tiling and padding techniques.Section 3 describes our cost model for estimating performance. Section 4

1We do not consider the 16KB L1 cache since it only holds integers.



explains in detail our decisions to implement the genetic algorithm. Section 5presents the experimental framework, and Section 6 compares our resultsagainst the state-of-the-art techniques. Section 7 contains some related workthat aims at optimizing the cache behavior. Finally, we conclude and outline aroad map to future extensions in Section 8.

2. IMPROVING LOCALITY

In this section, we review the concepts of data reuse and data locality. We firstdiscuss some important concepts related to data cache analysis and how wemodel different cache levels. Then, we introduce loop tiling and padding, andwe explain how they can be used to improve data locality.

Understanding data reuse is essential to predict cache behavior. Reuse hap-pens whenever the same data item is referenced multiple times. This reuseresults in locality if it is actually realized; reuse will result in a cache hit if nointervening reference flushes out the datum.

Given that, a static data cache analysis can be split up into the followingsteps:

(1) Reuse Analysis describes the intrinsic data reuse among all different mem-ory references.2

(2) Data Locality Analysis describes the subset of reuses that actually resultsin locality.

We describe each step in detail in the following sections.

2.1 Reuse Analysis

In order to describe data reuse, we use the well-known concept of reuse vec-tors [Wolf and Lam 1991]. They provide a mechanism for summarizing repeatedmemory accesses within a perfect loop nest.

Trying to determine all iterations that use the same data is extremely ex-pensive. Thus, we use a concrete mathematical representation that describesthe direction as well as the distance of the reuse in a methodical way. The shapeof the set of iterations that uses the same data is represented by a reuse vectorspace [Wolf and Lam 1991], using the reuse vectors as a basis that describesthe space. Whereas self reuse (both spatial and temporal) and group tempo-ral reuse is computed in an exact way, group spatial reuse is only consideredamong uniformly generated references (UGRs), this is, references whose arraysubscript expressions differ at most in the constant term [Gannon et al. 1988].

Figure 2(b) presents the reuse vectors for the references in our running ex-ample showed in Figure 2(a). The reference a(i, j ) (W ) may reuse from the samedatum (hence, temporal reuse) that a(i, j ) (R) (hence, group reuse) accessed atthe same iteration. Reference c(k, j ) is associated with the self-spatial reusevector (0,0,1), since it may reuse the same cache line (thus, spatial reuse) that itaccessed one iteration before of the innermost loop nest. The other reuse vectorscan be understood in a similar way.

2We use memory reference to note a static read or write in the program. A particular execution ofthat read or write at run time is a memory access.



Fig. 2. The IJK matrix multiplication and the reuse vectors that describe its reuse.

In the case that a reuse vector is not present, we assume there is not reuse,thus, there is not locality.

2.2 Data Locality Analysis

Data locality is the subset of reuse that is realized; that is, reuse where thesubsequent use of data results in a hit in the considered cache level. To discoverwhether a reuse translates to locality we need to know all data brought to thecache between the two accesses (this implies knowledge about loop boundsand memory access addresses) and the particular cache architecture we areanalyzing.

CMEs [Ghosh et al. 1999] are mathematical formulas that provide a pre-cise characterization of the cache behavior for set-associative caches with LRUreplacement.3 They consider perfect nested loops consisting of straight-line as-signments. Based on the description of reuse given by reuse vectors, some equa-tions that describe those iteration points where the reuse is not realized areset up. Solving them gives information about the number of misses and wherethey occur.

Even though generating the equations is linear in the number of references,solving them can be very time consuming. We use our previous work, whichpresents a probabilistic method based on sampling [Vera et al. 2000] to solvethe equations in a fast and accurate way. Moreover, we use our own polyhedrarepresentation [Bermudo et al. 2000] to further optimize the process of obtain-ing the number of cache misses from the equations. For some results showingthe accuracy of our method to predict caches misses for different cache archi-tectures and real processors, we refer the interested reader to our previouswork [Vera et al. 2004].

2.2.1 CMEs for Multi-level Caches. The increasing performance mismatchbetween memory and processor speeds has required an increased number ofcache levels. For instance, Itanium has three levels of cache. Thus, an accuratemodel for predicting cache behavior must give a quantitative measurement of

3We plan to incorporate other replacement policies into our model in the future.



Fig. 3. Our approach for analyzing multilevel caches.

cache misses for all levels of cache; a reduced number of misses on the first-levelcache may not translate to a reduced execution time due to the presence of othercache levels which could have high miss ratios. Therefore, we have extendedthe CMEs to describe cache behavior for modern architectures.

Given a memory reference, the equations are to investigate whether thereuse described by its reuse vectors is realized or not. We now discuss how weextend the analysis to memory hierarchies with more than one level.

For these architectures, we have to analyze differently memory referencesdepending on the cache level they are accessing. For that purpose, a set ofequations that describe precisely the relationship among the iteration space,array sizes and cache parameters is set up for each of the cache levels.

Figure 3 shows our approach for a 3-level cache memory hierarchy. Whenanalyzing potential cache set contentions, only memory accesses that miss inlower cache levels are considered. Thus, we can see the equations for each levelas filters, where only those memory accesses that miss are analyzed in furtherlevels.

2.3 Tiling and Padding Overview

In addition to the hardware organization, it is common knowledge that perfor-mance of memory hierarchy is very sensitive to the particular memory refer-ence patterns of each program. In order to enhance locality, transformationsthat do not alter the semantics of the program attempt to modify the orderin which computations are performed, or simply change the data layout. Inthis section, we review the transformations implemented in our experimentalcompiler.

2.3.1 Tiling. Loop tiling combines strip-mining with loop interchange forincreasing the effectiveness of memory hierarchy. Figure 2(a) shows the codefor the matrix multiplication of NxN arrays kernel, which we use as our runningexample. Loop tiling basically consists of two steps [Wolf and Lam 1991]. Thefirst one consists of restructuring the code to enable tiling those loops that carryreuse. The second one is to select the tile factors that maximize locality. It isthe latter step that is sensitive to the characteristics of the cache memory con-sidered. Due to hardware constraints, caches have limited associativity, whichmay cause cache lines to be flushed out of the cache before they are reuseddespite sufficient capacity in the overall cache.

We present the tiled version, with tile sizes T1 and T2, in Figure 4(a).



Fig. 4. Matrix multiply algorithm after applying tiling and padding.

Fig. 5. Example of tiled iteration space.

2.3.1.1 Implementing Tiling. CMEs are defined over convex iterationspaces [Ghosh et al. 1999]. However, the tiled iteration space is not convex.More formally stated, the iteration space obtained after tiling n dimensionscan be expressed as the union of 2n convex regions.

We illustrate this situation in Figure 5. Figure 5(a) shows how a1-dimensional iteration space becomes a two-convex region iteration space (see



Figure 5(b)) after tiling (T=3). The shaded regions correspond to the differentconvex regions before and after tiling.

A naive way to overcome this problem is to use only one convex region thatapproximates the actual nonconvex region. This convex region can be the small-est parallelepiped that includes all other convex regions (see Figure 5(c)) oralternatively, the region which does not include the last iteration of every tiledloop when the tile size does not divide the upper bound (see Figure 5(d)). Nev-ertheless, neither option is accurate. The first option includes points outsidethe iteration space, whereas the second option excludes points belonging to theiteration space.

In our aim of having an accurate model, we decided to implement the exactsolution. We have modified the CMEs to deal with multiple convex regions bydefining a set of equations for every convex region. When analyzing an iterationpoint, we use the equations corresponding to the convex region the iterationpoint is contained in.

Let m be the number of convex regions of a loop after tiling. Compulsoryequations are defined for each convex region, so the number of compulsoryequations is increased by m. For each reuse vector, we generate a set of re-placement equations for each convex region [Ghosh et al. 1999]. In addition,we generate a set of equations for every pair of convex regions that reflect thepotential reuse between different regions. Thus, the number of replacementequations is increased by a factor of m2.

2.3.2 Padding. Unlike loop tiling, padding modifies the data layout toeliminate conflict misses. It changes the data layout in two different ways.Interpadding modifies the base address of the arrays, whereas intrapaddingchanges the size of array dimensions.

We refer to the L1 (primary) cache size as Cs. memi is the original baseaddress of variable number i (Vari) and P Basei stands for the intervariablepadding between V ari and Vari−1. dimij stands for the size of the dimensionj of Vari (Di is the number of dimensions) and Si is its size. P Dimij is theintravariable padding applied to dimij, and P Si is the size of Vari after padding(see Figure 6). We define �i as P Si − Si.

2.3.2.1 Intervariable Padding.. When intervariable padding is appliedonly the base addresses of the variables are changed. Thus, padding is per-formed in a simple way. Memory variable base addresses are initially defined us-ing the values given by the compiler. Then, we define for each memory variableVari, a variable P Basei, i = 0 . . . k:

0 ≤ P Basei ≤ Cs − 1.

Note that padding a variable results in modifying the initial addresses of theother variables (see Figure 6). Thus, after padding, the memory variable baseaddresses are computed as follows:

BaseAddr(Vari) = memi +k≤i∑k=0

P Basek .



Fig. 6. Data layout: (a) before intervariable padding, (b) after intervariable padding (c) beforepadding, (d) after padding, (e) 2-D array, (f) 2-D array after intravariable padding.

2.3.2.2 Adding Intravariable Padding. The result of applying both inter-and intravariable padding is that all base addresses and sizes of every dimen-sion of each memory variable may change. They are initially set according tothe values given by the compiler. For each memory variable Vari, i = 0 . . . k wedefine a set of variables {P Basei, P Dimij}, j = 0 . . . Di

0 ≤ P Basei, P Dimij ≤ Cs − 1

After padding, memory variable base addresses are computed in the followingway (see Figure 6):

BaseAddr(Vari) = memi++ ∑k<i

k=0(P Basek + �k) + P Basei

and the size of the dimensions are:

Dim j (Vari) = dimij + P Dimij.

Figure 4(b) shows our running example after tiling and intrapadding all arraydimensions.

3. PERFORMANCE MODELING

In this section, we introduce our cost model. We first describe how we modelloop tiling, padding and branch predictor behavior. Then, we describe our costfunction to estimate performance. Finally, we reason the use of a GA to traversethe solution space.



3.1 Tiling and Padding Model

We want to improve data locality through loop tiling and padding. We focus onremoving capacity misses by means of loop tiling, whereas we use padding toeliminate those conflict misses that loop tiling cannot remove.

We present a compiler strategy that combines both optimizations at the sametime by implementing the CMEs in a parameterized way. We assume normal-ized loop nests. Thus, each tile factor will range between 1 and the upper boundof the corresponding loop. For the pad factors, there is no need to consider largedomains. Usually, if two references do not conflict on a cache of size S, they maynot conflict on a cache of size nS (larger by a factor of n). Therefore, we use thecache size of the smallest cache in the hierarchy (which in practice is L1) as thedomain. Results show than even smaller domains would be enough to achieveimportant speedups, since pad factors are generally very small.

Our measure of locality is the number of read and write misses for each cachelevel. More formally stated, given a loop nest L with n normalized enclosingloops L = {l1, . . . , ln}, a set of tile and pad factors F , and a memory hierarchywith u levels, we define a function MCost(L, F ):

MCost : Loops × Factors −→ (integer, integer, . . . , integer, integer)(L, F ) �−→ (rmML1, wmML1, . . . , rmMLu, wmMLu)

whereF = Tile Factors × Pad Factors

Tile Factors = {T1, . . . , Tk}, k ≤ nPad Factors = {P Basei, P Dimij}, 0 ≤ i < #vars, j ≤ Di,

where rmLk (wmLk) stands for the number of read (write) misses on the kthlevel cache.

3.1.1 Example. Let us recall the optimized version of our running exampleshown in Figure 4(b), where we tile two loops and only consider intrapadding.We call the following instance of MCost to describe its locality:

MCost(L, F ) = MCost({l1, l2, l3}, F )where

F = Tile Factors × Pad FactorsTile Factors = {T1, T2}, 1 ≤ T1 ≤ N , 1 ≤ T2 ≤ NPad Factors = {P Dimij}, 0 ≤ i < 3, 0 ≤ j < 2

0 ≤ P Dimij ≤ Cs − 1.

3.2 Branch Prediction

As the issue rate and depth of pipelining of high-performance superscalar pro-cessors increase, the importance of control-flow speculation becomes more vitalto achieve the potential performance of current processors. There is a seriousperformance degradation in deep-pipelined machines caused by prediction



misses due to the large amount of speculative work that has to be dis-carded [Butler et al. 1991]. Latest branch predictors work fairly well for loopclosing branches, since they usually miss only for the last iteration of the loop.However, current processors still miss-predict branches frequently in this situa-tion. Next, we review in detail the different strategies adopted by the processorsused in our study.

3.2.1 Pentium-4. It dynamically predicts the direction and target ofbranches using a 4K branch target buffer combined with a branch history table.If no dynamic prediction is available, it predicts statically (taken for backwardslooping branches, not taken for forward branches). Besides, it makes use of thetrace cache to alleviate the cost of a miss-predicted branch.

3.2.2 Alpha-21264. It uses a hybrid predictor that dynamically choosesbetween two types of branch predictors. Both predictors are two-level predic-tors. One of them has 1024 branch-history registers in the first-level and 1024pattern-history entries in the second level. The other predictor has a singlebranch-history register in the first-level and 4096 pattern-history entries inthe second level. The choice predictor is implemented by means of a 4096-entrytable that is indexed by the global branch-history register.

3.2.3 UltraSparc-III. It uses a modified gshare [McFarling 1993].

3.2.4 Itanium. It has the most sophisticated branch prediction mechanismamong the processors considered in our study. It consists of a hierarchy of threelevels of branch predictors, which need an important input from the compilerin order to achieve a high hit rate [Sharangpani 2000].

Level 1. It is a branch target which uses hints set by the compiler to decide thebranch direction taken.

Level 2. For scalar codes, the processor uses a 2-level prediction scheme. Areturn stack buffer provides predictions for return instructions.

Level 3. Defined as branch address calculation and correction level, it appliesa correction for the exit condition of modulo-scheduled loops. It usesa special structure that keeps track of the loop count obtained duringthe loop initialization. However, it does not work well with tiling dueto the min expressions in the loop bounds.

3.3 Branch Model

Tiling must be applied carefully because it may increase the overhead due tothe tiled code complexity. Besides, the extra levels of loops may lead to largernumber of miss-predicted branches. In order to avoid a large performance degra-dation due to prediction misses, we incorporate into our model the number ofpossible miss-predicted branches. Notice that the same scheme can be appliedto model iterations overhead.

Let L be a loop nest with n normalized enclosing loops L = {l1, . . . , ln}, withupper bounds U = {U1, . . . , Un}, respectively. Current branch predictors maymiss-speculate when loops finish their execution. Thus, the number of expected



Fig. 7. LoopCost algorithm.

miss-predicted branches is:

MissPred(L) =j≤n∑j=1

i< j∏i=1

Ui.

3.3.1 Example. Let us consider our running example when {N=100,T1=20, T2=20}. For this particular example, the values of the pad factors areirrelevant since they do not affect the number of miss-predicted branches. Thenumber of miss-speculated branches for the nontiled version (see Figure 2(a))is:

j≤3∑j=1

i< j∏i=1

100 = 1 + 100 + 104 = 10101.

The number of expected miss-predictions for the tiled version (see Figure 4(a))will be:

j≤5∑j=1

i< j∏i=1

Ui = 1 +⌈

10020

⌉+

⌈10020

⌉×

⌈10020

⌉+

⌈10020

⌉×

⌈10020

⌉× 100

+⌈

10020

⌉×

⌈10020

⌉× 100 × 20

= 1 + 5 + 25 + 2500 + 5 × 104 = 52531,

which is over five times as many as the original program. Thus, in order to havea tiled code that runs faster than the original one, we must have an importantreduction in number of misses to compensate this overhead.

3.4 Cost Model

Once we account for misses across all different levels and loop tiling overheaddue to miss-predicted branches, we can calculate the loop cost. In Figure 7, wegive a detailed description of our cost model function, LoopCost.



Table I. Weights to Calculate LoopCost for aParticular Architecture

µR1 µW1 µR2 µW2 µMPPentium-4 24 24 150 150 20Alpha-21264 6 6 84 84 7UltraSparcIII 10 10 100 100 8Itanium 21 21 117 117 15

MCost calculates the locality for the loop nest L given the set of tile andpad factors, that is, the number of read and write misses for each cache level.MissPred estimates the number of miss-predicted branches that the branchpredictor may incur in when executing loop nest L. Then, LoopCost calculatesthe total cost of executing L. It simply adds up all different misses and thenumber of miss-predictions, weighting each value by its relative cost.

Values of different µ weights depend on the considered architecture. Mostcurrent processors have out-of-order execution and nonblocking caches, whichmakes the different penalties vary at execution time. In Table I, we give the rel-ative costs used to model the architectures we used in our study for optimizingexecution time. Cache latencies and miss-prediction penalties4 have been ob-tained from vendors’ specifications. The memory latency is calculated assuminga SDRAM with a memory access of 160 ns. In order to have a more accuratemodel, the vendors may calculate the average penalties empirically, which maytranslate to better transformed codes.

3.5 Compiler Strategy

The main objective of a compiler strategy is to determine which transformationto apply. In our case, our main concern is to decide which tile and pad factorsyield the best results. In this section, we explain how we choose the parametervalues guided by our cost model.

We want to find a set of tile and pad factors that minimize the LoopCost of aloop nest L. More formally stated:

MIN LoopCost(L, ML, F )

1 ≤ Tk ≤ Uk

0 ≤ P Basei, P Dimij ≤ Cs − 1

where LoopCost is called the objective function.

3.6 Choosing Tile and Pad Factors

In this section, we use results from the CMEs and compiler theory to reasonabout the complexity of achieving optimal tile and pad factors.

3.6.1 Complexity: CMEs. CMEs [Ghosh et al. 1999] describe cache be-havior by means of Diophantine equations. Each set of equations defines a

4It is basically the length of the pipeline.



bounded convex polyhedra. Obtaining the number of misses is equivalent tocounting the number of integer points inside those polyhedra. However, count-ing integer points in a general polyhedron has the same complexity as decidingwhether a solution exists to a system of equalities and inequalities, which isNP-Complete [Banerjee 1988]. So is the problem of computing the volume (i.e.,number of integer points) of a polyhedron [Dyer and Frieze 1988].

The number of integer points inside a polyhedron can be enumeratedby means of Ehrhart Polynomials [Clauss 1996], which describe pseudo-polynomials whose coefficients in the general case are NP-hard to compute.Thus, the parameterized CMEs are a pseudopolynomial function. Summing up,LoopCost is a pseudopolynomial function, and hence the relationship betweentiling, padding and the number of misses is nonlinear.

Tile and pad factors can only take integer values, thus, the problem of opti-mizing LoopCost can be seen as a nonlinear integer optimization (NLP) one,which again, in the general case, is NP-hard [Horst et al. 1995; Vavasis 1991].

3.6.2 Complexity: Compiler Theory. It is common knowledge that choosingthe best tile sizes is very hard, and it is considered to be a very difficult problem.However, no proof has been published that is an NP-hard problem.

Many researchers have spent much effort looking for the best data layout.Recently it has been proved that “Unless P==NP there is no efficient optimalalgorithm for data placement that minimizes the number of misses” [Petrankand Rawitz 2002]. That is, choosing the best pad factors is an NP problem.Since we try to compute both tile and pad factors at the same time, minimizingLoopCost is an NP problem.

3.7 How to Solve Nonlinear Integer Problems

One of the challenges in NLP is that some problems exhibit local minima.Algorithms proposed to overcome this problem are named Global Optimization.Real functions have been studied deeply [Gill et al. 1981; Horst et al. 1995; Tornand Zilinskas 1989]. Unfortunately, integer functions are hard to optimize.

There are some studies based on {0,1} valued integer functions [Hansenet al. 1995], but in general, this is a hard and time-consuming problem.Hence, the use of heuristics to traverse the solution space is necessary. Tabusearch [Glover and Laguna 1997] obtains promising theoretical results, butonly partial implementations have been reported so far. On the other hand,simulated annealing [Kirkpatrick et al. 1983] and genetic algorithms [Goldberg1989; Holland 1975] have been used for years with very good results for manyproblems.

3.7.1 Why a Genetic Algorithm? The majority of research in optimiza-tion via high level restructuring has relied on smart heuristics and very simplemodels [Carr and Kennedy 1992; Coleman and McKinley 1995; Lam et al. 1991;Rivera and Tseng 1998a; Temam et al. 1993; Wolf and Lam 1991], managingto improve program performance significantly. Current results in compiler the-ory [Petrank and Rawitz 2002] point out two important practical issues: (i) the



Fig. 8. Different implementations of a Genetic Algorithm.

use of heuristics is a must, and (ii) the preservation of information is critical tofind a good solution.

Our proposal is based on the use of a very accurate cost model, thus reducingthe loss of information. Then, we use a heuristic, in this case a genetic algorithm,to optimize the LoopCost function. According to Petrank and Rawitz [2002], theonly efficient way to evaluate the potential of our method is comparing it withprevious ones. Our experimental results show that with a small and reasonablecompile-time overhead, our method outperforms all previous approaches, for allbenchmarks running on a variety of modern architectures.

4. IMPLEMENTING A GENETIC ALGORITHM

Algorithms for function optimization are generally limited to convex-regularfunctions. However, there are lots of functions that are not continuous, nondif-ferentiable or multimodal. It is common to solve these problems by means ofstochastic sampling. Whereas traditional search techniques use characteristicsof the problem to determine the next sampling point (e.g., Gradient), stochasticmethods use non-deterministic decision rules [Ermoliev and Wets 1988].

Genetic Algorithms (GAs) are a particular type of stochastic methods thathave been used to solve hard problems with objective functions that do not meetthe properties to use traditional methods [Goldberg 1989]. These algorithmssearch in the solution space of a function simulating the nature-based processof evolution, that is, the survival of the fittest. Usually, the fittest individualstend to reproduce more than the inferior individuals, and they survive to thenext generation propagating the best genes.

GAs simulate the evolution of a population. Figure 8(a) shows the sim-plest GA. It starts from a random generated population. Then, it evolves



the population by means of basic genetic operators (selection, mutation andcrossover) [Goldberg 1989] applied to individuals of the current population toproduce an improved next generation.

Next, we explain how we implemented the different genetic operators andour representation of the tile and pad factors.

4.1 Genetic Algorithm Parameters

The use of GAs requires the determination of the following issues: chromosomerepresentation, selection function, genetic operators, the creation of the initialpopulation and the termination criteria.

Each individual is made up of a set of chromosomes which represents thevariables. In our work, each individual is one configuration of tiling/padding(identified by all tile factors and the inter- and intravariable pad factors). Eachchromosome represents one single factor, either a tile or a pad factor. The fit-ness of those individuals is computed using the objective function (in our case,LoopCost in Figure 7). The fittest individual is the one that has a set of tileand pad factors that results in the smallest cost according to our cost functionLoopCost.

A chromosome representation is needed to represent each individual in thepopulation. Genetic algorithms require the natural parameter set of the opti-mization problem to be coded as a finite-length string over some finite alphabetsuch as alphabet {0,1}. Thus, each chromosome is made up of a sequence ofgenes from a certain alphabet.

4.1.1 Representing Tile Factors. It has been shown that using large alpha-bets gives better results [Michalewicz 1994]. For representing the tile factors,we have experimentally observed that using the alphabet {00, 01, 10, 11} pro-duces good results.

The function to transform the chromosome values into tile sizes is not theidentity function. Tile factor Ti can take any value in the range [1 · · ·Ui]. Onthe other hand a chromosome is represented by a sequence of genes encodedin a binary representation. Thus, each chromosome will be represented by avalue in the range [0. . . 2k − 1] where k is �log2Ui�. If k is an odd number, k isincreased by 1 due to the alphabet we have used to represent genes. Thus, thereare more values in the representation range for a chromosome than possibletile size values. Therefore, we need a function to map values from the domain[0 · · · 2k − 1] to the range [1 · · ·Ui].

Let g be the function that represents the tile factor for each possible valueof a chromosome. We define it as follows:

g : [0 · · · 2k − 1] −→ [1 · · ·Ui]

x �−→⌊

x ∗ (Ui − 1)2k − 1

⌋+ 1

wherek = �log2Ui� (+1 i f odd ) x ∈ [0 . . . 2k − 1]



Fig. 9. Example of mapping between representation values and tile factors.

Figure 9 illustrates an example of how this function works. It can be seenthat every possible tile factor has at least one representation.

Example. Let us codify two tile factors {T1, T2} for two nested loops with upperbounds {U1 = 40, U2 = 100}. Each tile factor is represented by one chromosome.Thus, the first chromosome is represented by 3 genes (and function g1), andthe second one by 4 genes (and the corresponding function g2). Thus, the value27 (011011) and 74 (01001010) correspond to the tile factors 17 (g1(27)=17)and 29 (g2(74)=29) respectively, and are represented by the followinggenes:

01︸︷︷︸gene10 =1

10︸︷︷︸gene11 =2

11︸︷︷︸gene12 =3︸︷︷︸

chromosome1

01︸︷︷︸gene20 =1

00︸︷︷︸gene21 =0

10︸︷︷︸gene22 =2

10︸︷︷︸gene23 =2︸︷︷︸

chromosome2

.

4.1.2 Representing Pad Factors. Representing pad factors is easier thanrepresenting tile factors due to the fact that each pad factor belongs to therange [0, . . . Cs − 1].5 Thus, we have used the alphabet {0, . . . , 2t − 1}, wheret is the greatest divisor of the log2 Cs that is lower than log2Cs. This is, thelargest value of t that guarantees that a single pad factor consists of at leasttwo genes for every cache size. This is not a restriction because the compilersknow the cache size. Thus, this computation can be done automatically.

Example. Let us assume a 32 KB cache. Thus, log2(32 × 210) = 15. The setof divisors is divisors = {1, 3, 5, 15}. Hence, the greatest divisor less than 15 is5, and we will use the alphabet {0, . . . , 31}, representing each single pad factorwith 3 genes. For instance, a pad factor of 10017 is represented by the followingthree genes:

01001︸︷︷︸gene0=9

11001︸︷︷︸gene1=25

00001︸︷︷︸gene2=1︸︷︷︸

chromosome

.

5If the cache size is not a power of 2, we consider the largest cache size which is power of 2 andsmaller than the considered cache.



Fig. 10. Schematic of simple crossover.

4.1.3 Genetic Operators. Genetic operators provide the basic search mech-anism of the GAs, creating new solutions based on the solutions that alreadyexist. The selection of individuals to produce successive generations plays anextremely important role. A common selection approach assigns probability ofselection to each individual depending on its fitness. Individuals with higherfitness have a higher probability of contributing one or more offsprings to thenext generation. Then, individuals are selected depending on this probability.Let us have a population of size N (i.e., a population with N individuals). Aselection scheme consists of choosing N individuals from the N individuals ofthe previous generation. We have adopted one of the selection schemes thatgives better results, which is known as remainder stochastic selection withoutreplacement [Goldberg 1989]. This selection scheme allows an individual to bechosen more than once, so that best individuals contribute with more offsprings.

The next step consists of pairing the chosen individuals and applyingcrossover. Crossover takes two individuals and produces two new individualswith a given probability, merging the genetic material in a random point (namedcross site). In the case they do not crossover, both individuals are added to thenew population (see Figure 10). Finally, mutation changes one individual toproduce a new one by flipping some of its genes. Both crossover probability andmutation probability have to be determined empirically, and are related to thesize of the population. For a complete example, we refer the interested readerto Appendix A.

4.1.4 Convergence Criterion. The GA must be provided with an initial pop-ulation (see Figure 8(a)), that is created randomly. GAs move from generationto generation, and even though other criteria can be used [Goldberg 1989], theusual termination criterion is the number of generations.

Our experiments have shown that an initial population of size equal to 30,with a crossover probability of 0.9 and a mutation probability of 0.005, givesnear-optimal results in most of the cases after 15 generations. However, in someother cases the near-optimal results are obtained after a number of generationsbetween 15 and 25. Figure 8(b) shows our particular GA, where converge() is afunction which decides when the population is homogeneous enough. We con-sider that a population converges when the best individual has a cost smallerthan 2% with respect to the average of its generation. We have observed, forthe evaluated loops, that this convergence criterion is only achieved if the pop-ulation is close to the optimal.



Table II. Processors Used for the Experimentation. Cs Stands for Cache Size inKB, Ls Stands for Cache Line Size in Bytes, and K Stands for the Degree of

Associativity

Processor Freq. L1 (Cs, Ls, K) L1 Replacement L2 (Cs, Ls, K)Pentium-4 1.6 GHz (8, 64, 4) LRU (512, 128, 8)Alpha-21264 525 MHz (64, 64, 2) FIFO (4096, 64, 4)UltraSparc-III 750 MHz (64, 32, 4) Random (8192, 512, 4)Itanium 800 MHz (96, 64, 6) LRU (2048, 64, 4)

5. EXPERIMENTAL FRAMEWORK

Let us recall our experimental framework in Figure 1. We implement the anal-ysis as general as possible, so the compiler is written using the SUIF2 inter-nal representation, which can be generated from different front-ends. We useSUIF2 to collect all information about memory accesses and control flow.

The key component is the one that computes the equations and solves them,which describes the cache behavior. We have implemented the CMEs followingthe techniques outlined in our previous work [Bermudo et al. 2000; Vera et al.2000], and have chosen a confidence interval width of 0.1 and a 90% confidencewhich proved to be enough to guide our optimizations. The genetic algorithmhas been implemented following techniques described in Section 4.

To evaluate our method, we have implemented our algorithms, executed theoriginal and transformed versions and collected number of misses for differentcache levels. For the sake of comparison, all kernels are compiled with “g77-O3”. All kernels can be found in Appendix 8. We measured execution timeson four modern architectures: Pentium-4, Alpha-21264, UltraSparc-III and Ita-nium. Table II shows their memory configurations. The actual number of misseswhen executing programs on the Pentium-4 platform are obtained by meansof the performance counters. We have measured the events L1 load misses re-tired and L2 load misses retired. Otherwise, they are obtained by means of theCMEs.

In order to evaluate our ability to improve data locality, we start studyingpadding and tiling separately. Then, we combine both of them and report resultsfor a set of kernels.

5.1 Padding

Since the objective of padding is removing conflict misses, we optimized thoseprograms from SPECfp95 that present high number of conflict misses, whichare TOMCATV and SWIM [Fernandez 1999]. In addition, their miss ratio is highlyaffected by cache size. We have chosen the most time-consuming loop nestsfrom each program that in total represent between the 90–100% of the wholeexecution time, using the reference input data.

A fully-associative cache has been evaluated as a reference point to estimatethe amount of conflict misses that are not removed by the padding technique.In order to measure our ability to improve locality, we compare our paddingalgorithm with Rivera and Tseng’s state-of-the-art technique [Rivera and Tseng1998a, 1998b].



Table III. Description of the KernelsUsed for Evaluating Tiling

Name DescriptionMATMUL Matrix multiplicationMATVEC Matrix vector multiplicationT2D 2D matrix transpositionADI 2D ADI integrationVPENTA Invert 3 pentadiagonals

Table IV. Average Miss Ratios for TOMCATV and SWIM for a Set ofDirect-Mapped Caches. Cache Line is 32 B

Program Cache Size NO Padding (%) Inter-Padding (%)32 KB 9.6 8.816 KB 14.8 11.8

TOMCATV 8 KB 46.0 21.64 KB 72.1 52.0

32 KB 8.1 7.116 KB 28.8 7.2

SWIM 8 KB 62.9 7.84 KB 77.9 8.2

5.2 Tiling

An overview of the five kernels that have been evaluated can be seen inTable III. For all of them, we have studied a set of different sizes that areexplained in the different experiments. All kernels are written in FORTRAN,drawn from different benchmarks (NAS,6 LIVERMORE.). We chose these ker-nels because they exhibit high number of capacity misses.

We also determined the effectiveness of our method comparing it with othermethods which represent the state-of-the-art:

—lrw: Lam et al. [1991] choose the largest nonconflicting square tile.—tss: Rivera and Tseng [1999a] extend Coleman and McKinley’s [1995] Eu-

clidean GCD algorithm.

6. EXPERIMENTAL RESULTS

In this section, we evaluate our approach. We first present results for a set ofexperiments where padding and tiling are applied in isolation. Then, we analyzethe efficiency of our approach, which consists on applying both of them. In orderto evaluate our ability to improve data locality, we compare to state-of-the-arttechniques.

6.1 Padding

Table IV shows, for the two programs analyzed, the miss ratio of a direct-mapped cache before and after applying intervariable padding. Since the objec-tive of padding is to eliminate conflict misses, intervariable padding providesa huge improvement in miss ratio for TOMCATV and SWIM. Note that for both

6Numerical Aerospace Simulation Facility (by NASA Ames Research Center).



programs, a small improvement is obtained for a 32 KB cache. This is causedby the fact that almost no conflicts arise for 32 KB caches or bigger for theseprograms due to the relatively small working set of the SPECfp95 applications.However, the smaller the cache the bigger the miss ratio and the bigger theimprovement that intervariable padding obtains.

For the SWIM program the miss ratio grows from 8.1% to 28.8%, 62.9%, and77.9% when the cache is reduced from 32 KB to 16 KB, 8 KB, and 4 KB,respectively. However, when we apply intervariable padding, the miss ratio iskept almost constant (7.1%, 7.2%, 7.8% and 8.2%, respectively). This is becausemost of the misses of this program are caused by conflicts between different datastructures (intervariable conflict misses) and the algorithm practically obtainsthe optimal padding among them.

For the TOMCATV program, the miss ratio also grows significantly when thecache size is reduced (9.6%, 14.8%, 46.0%, and 72.1%, respectively, for the dif-ferent cache sizes). In this program, we also obtain a considerable improvementwhen applying intervariable padding for caches smaller than 32 KB. However,the miss ratio after intervariable padding varies significantly with the cachesize (8.8%, 11.8%, 21.6%, and 52%). This variation is caused by intravariableconflict misses (e.g., conflicts among distinct rows and columns of the samearray) whose frequency also grows when the cache is reduced. Intervariablepadding does not remove the latter type of conflicts, which are the target ofintravariable padding.

Figure 11 details the miss ratio for the main loop nests of TOMCATV and SWIM

(note the different scales for the different cache sizes). The figure shows themiss ratio for each loop before and after applying intervariable padding. It alsoshows the miss ratio for a fully associative cache.

For the SWIM program, loop nest 1 has practically no improvement due tointervariable padding (excepting a slight improvement due to alignment) be-cause it has no conflict misses. Note also that this loop nest has almost thesame miss ratio regardless of the cache size. On the other hand, loop nests 2to 6 have an extremely large miss ratio. As an extreme case, loop nest 2 hasa miss ratio close to 100% for a 4 KB cache, which after intervariable paddingis reduced to 11.8%. Note that intervariable padding removes all the conflictmisses for all SWIM loops since the miss ratio after intervariable padding andthe fully associative miss ratio are practically identical.

The TOMCATV program has several loop nests that deserve special comments.For the 32 KB and 16 KB, the proposed intervariable padding technique prac-tically removes all conflict misses. The higher miss ratio shown for the fully-associative cache for some kernels is due to some pathological cases. We use thedifference between the miss ratio of a given cache and that of a fully associativecache as an estimation of the amount of conflict misses that are not removed bythe padding technique. We assume an LRU replacement policy, which is veryused in practice, unlike the Belady algorithm [Belady 1966], which is optimalbut only off-line implementations are known. In this case, one may find somesituations where the miss ratio of a fully-associative cache is higher than thatof a direct-mapped cache. For instance, assume a loop nest that traverses mul-tiple times an array whose size is equal to the cache size plus one line and



Fig. 11. Miss ratio before and after intervariable padding for a set of direct-mapped caches.

the line size is equal to one array element. The LRU fully associative cache willmiss for every access whereas a direct-mapped cache will hit for all the accessesexcepting those corresponding to the first and the last line.

For the 8 KB cache, intervariable padding removes all conflict misses fromall loop nests except for loop 1. In this case, intervariable padding reduces themiss ratio from 53.6% to 29.2% but not all conflict misses are removed since thefully associative miss ratio is 11.4%. An analysis of this loop shows that thereare also intraconflict misses.

In the case of a 4 KB cache, intervariable padding achieves about the samemiss ratio as a fully associative cache for loop nests 2 and 5. As a noticeablecase, the miss ratio of loop 5 has been reduced from 42.3% to 15.8%. For theother loop nests there is a significant improvement but the miss ratio is stillfar from that of the fully associative cache. An analysis of these three loopnests revealed that most of the remaining misses are intravariable conflictmisses.

6.1.1 Intravariable Padding. The objective of intravariable padding isto eliminate those intravariable conflict misses that interpadding cannotremove.

We have shown that TOMCATV is the only program that has a significantintravariable conflict miss ratio, in particular for caches of 4 KB and 8 KB.Figure 12 shows the miss ratio for the different loop nests of the TOMCATV pro-gram. The figure shows the miss ratio for each loop after applying inter- and



Fig. 12. Miss ratio for different TOMCATV loop nests before and after inter- and intravariable paddingfor two direct-mapped caches.

Fig. 13. L1 and L2 miss ratios before and after intra-padding for the Pentium-4.

intravariable padding. It also shows the miss ratio before padding and thatof a fully associative cache. As we observed before, intervariable padding doesnot remove all conflict misses because there are intraconflict misses. Intravari-able padding achieves about the same miss ratio as the fully associative cache,which means that the proposed padding algorithm removes practically all con-flict misses.

6.1.2 Miss Ratio Results. We have obtained the miss ratios for the most sig-nificant loop nests from TOMCATV and SWIM after running them on the Pentium-4platform (see Table II for the configuration of the machine used). Figure 13shows the miss ratios for both L1 and L27 caches before and after applyingintrapadding. We use the miss penalties shown in Table I to calculate the MCost(see Section 3.4) for each loop nest. Intravariable padding reduces the average

7L2 misses are calculated with respect to the total number of memory accesses.



Table V. Problem Sizes for Evaluating Tiling

Name Problem SizeMATMUL 100, 200, 500, 1000, 2000MATVEC 100, 200, 500, 1000, 2000T2D 100, 200, 500, 1000, 2000ADI 100, 200, 500, 1000, 2000VPENTA 128

(a) Problem sizes for evaluatingthe reduction in number of capacity misses.

Name Size 1 Size 2MATMUL 400 + 50i 1000 + 50iMATVEC 500 + 43i 1000 + 43iT2D 2000 + 53i 4000 + 53iADI 2000 + 53i 4000 + 53iVPENTA 1028 + 47i 2056 + 47i

(b) Problem sizes (i = 0 . . . 14)for evaluating execution time.

miss penalty for the TOMCATV program by 12.4%, whereas it reduces the averagemiss penalty by 140.8% for the SWIM program.

6.1.3 Performance Results. Figure 14 shows the run-time improvements.We have executed the original and padded version on the four considered plat-forms (see Table II). Notice the different scales for each chart. We have alsocompared our approach to select pad sizes with Rivera and Tseng’s [1998a]algorithm. First column presents the speedups achieved running Rivera andTseng’s method. We use the best result yielded by their two approaches PADand PADLITE. The second column shows the speedups obtained by our ap-proach. The overall speedup has been obtained adding the execution time ofall loops. We observe that in all cases our approach performs better, with rela-tive speedups with Rivera and Tseng’s ranging between 1% and 227% for theTOMCATV program and 4% and 66% for the SWIM program.

6.2 Tiling

We now present results for our evaluation of tiling.

6.2.1 Removing Capacity Misses. We first show the ability of tiling to re-duce the number of capacity misses.8 In order to investigate that, we haveevaluated the replacement miss ratio of the studied kernels for a set of prob-lem sizes shown in Table V(a). We do not include the compulsory misses sincetiling do not change them. Unlike the SPECfp95, we can change the size of theworking sets. Thus, we use bigger caches and bigger working sets according totoday’s workloads.

We show the results in Figure 15. We observe that tiling practically removesall replacement misses for almost all kernels. However, the replacement missratio obtained for VPENTA is still rather high for all cache sizes due to conflictmisses. To confirm this intuition, we have applied tiling and padding for this ker-nel, which removes all replacement misses (i.e., conflict and capacity misses).This case illustrates the need for applying tiling and padding in order to removeboth conflict and capacity misses.

6.2.2 Loop Tiling Overhead. We now present the importance of consider-ing the loop tiling overhead. We show that without an accurate estimate of

8CMEs replacement misses account for both capacity and conflict misses.



Fig. 14. Speedups of the padded versions compared to the original programs. � stands for therelative speedup of our method compared to Rivera and Tseng’s.



Fig. 15. Miss ratio before and after tiling for a set of direct-mapped caches. Cache line is 32 B.

Fig. 16. Impact of branch miss-prediction overhead for the Pentium-4 processor. Results are nor-malized to our estimated penalty, µMP = 20.

the penalty of miss-predicted branches, the degradation in performance can besevere.

The results of our set of experiments are shown in Figure 16. In order toprove the importance of considering branch predictor behavior, we have ana-lyzed different penalty values for the Pentium-4 processor for the problem sizesshown in the second column of Table V(b). For the sake of comparison, we onlyconsider the effects of tiling; we have run our approach for obtaining the best tilesizes considering different values of µMP (which is the estimated miss-predictedbranch cost, see Section 3.4), and compared the execution times to that of theselected penalty (µMP = 20). We present results in terms of slowdowns. Wecan see that in general, execution time converges smoothly to the estimated



Fig. 17. Run-time information of the three different tiling algorithms for the execution on thePentium-4 platform.

Fig. 18. Speedup obtained by our approach compared with lrw and tss algorithms.

penalty, which confirms our intuition. When the penalty is set to small values,the degradation in performance may be very important (up to 34% for MATMUL).This is because we generate tiles that are very small in order to minimize mem-ory penalty, though incurring in a high overhead due to the increased numberof miss-predicted branches. On the other hand, if we set large penalty values,we prioritize branches overhead, thus, tiles are bigger but we incur in moremisses that can degrade performance.

6.2.3 LoopCost Results. We have obtained the miss ratios for the differentproblem sizes from MATMUL as shown in the second column of Table V(b) afterrunning them on the Pentium-4 platform. Figure 17 shows the number of L1



Fig. 19. Speedups for 5 different loop orders of the MATMUL kernel.

and L2 misses for lrw and tss (see Section 5.2 for a description) normalizedto those of our approach. We also show the number of modeled miss-predictedbranches (MissPred in Section 3.4) based on the selected tile sizes. Note thatthe resulting tiled code of our approach is always the fastest (see Figure 18).

Since our approach tries to optimize overall performance, it does not focuson removing L1 cache misses. Instead, it takes into account all different factorsmentioned before, that is, L1 and L2 misses and number of expected miss-predicted branches. Thus, we can observe that the other approaches may yieldless misses on the L1 or L2 cache level, or have less miss-predicted branches.However, our approach considers all possible factors and the resulting programruns faster.

6.2.4 Performance Results. While lrw and tss can be applied to any loopnest, they were originally thought for programs involving matrix operations,and especially, for tiling matrix multiplication.

In this section, we compare our tile selection approach with them for thematrix multiplication kernel. We present results for five possible loop orders,IKJ, JIK, JKI, KIJ and KJI (the remaining IJK order is used in the next section).We use 15 different matrix sizes:

N = 1000 + 53i, 0 ≤ i ≤ 14.

Figure 19 shows the average speedups of our method (only tiling is applied)compared to lrw and tss. The average is computed adding the execution time ofall different loop orders. For obtaining these results, we have run all differentapproaches to select the best tile sizes for each platform. Then, we have executedthe tiled version measuring the actual execution time. The results show thatour approach outperforms these two techniques significantly, up to 310% fortss on the Alpha processor. We also show that our approach is better thanboth techniques for all platforms, with improvements ranging between 5% and310%.

6.3 Tiling and Padding

We now present results for a set of common kernels that may benefit from tilingand padding. We showed the problem sizes in Table V(b). The second columnshows the sizes considered for the Pentium-4, Alpha-21264 and UltraSparc-III, whereas the third column shows the sizes used for experimenting on the



Itanium machine. We chose different sizes for Itanium in such a way that block-ing could be useful to enhance performance.

In order to see the effectiveness of our method, we have compared our ap-proach to select tile and pad factors with lrw and tss. Overall results are ob-tained adding execution times for all programs and problem sizes. Figure 18shows, for each machine, to what extent our method is better in terms of exe-cution time of the optimized codes.

We first consider the results where only loop tiling is applied. For each pro-gram, the first two bars report the speedup compared to lrw and tss respec-tively. In all cases, our method yields better results than previous approaches.Our ability to select tile factors results in important run-time improvements;on average, our transformed code runs 8% and 49% faster on a Pentium-4 com-pared to lrw and tss. On the Alpha machine, results are even more impressive,with average speedups of 63% and 195%, respectively.

Now, we consider results where both tiling and padding techniques are ap-plied. The second set of bars reports the speedup compared to lrwPad and tss-Pad, enhanced versions of lrw and tss where padding is allowed [Rivera andTseng 1999a]. Note that the memory requirement for all methods was roughlythe same. We can see that the speedup is smaller on the Pentium-4, where ourtransformed code runs 7.7% and 26% faster than lrwPad and tssPad, respec-tively. However, the difference increases on the other three platforms, with themost significant results showing for the Alpha (260% and 271%).

Finally, in order to see to what extent tiling and padding help enhancingthe program, we show in Figure 20 the speedups that the different approaches(with and without padding) obtain with respect to the original kernel. Theapplication of padding on lrw and tss does not always translate to a betterperformance. Padding improves especially tss on the Pentium-4, but it yieldsworse results on the Alpha machine. On the other hand, our approach appliespadding selectively. Our accurate cost model guides the selection of tile andpad factors; if padding is not useful, our cost model will predict a performancedegradation, so pad factors will be set to 0. Overall, our approach obtains (98%,204%, 135%, 49%) average speedups on the Pentium-4, Alpha, UltraSparc-IIIand Itanium respectively. Combining the other methods with padding, lrwPadobtains (69%, 19%, 107%, 16%) and tssPad (20%, 80%, 109%, 11%). Otherwise,their speedups are (74%, 78%, 119%, 9%) and (30%, 20%, 105%, 3%) for lrwand tss, respectively.

Note that the use of an accurate model allows us to obtain always a version ofthe code that it is not worse than the original one. For instance, when optimizingMATVEC for the Itanium platform our cost model determines that tiling is notuseful, thus we do not apply it. However, the other approaches do not have anaccurate model that guides the transformations, which results, some times, inoptimized codes that run slower than the original version.

6.4 Compile-Time Overhead

Clearly, for our method to be considered a realistic optimization approach, itmust be shown that the compile time required is small enough to be practical.



Fig. 20. Speedup of all approaches with respect to the original program.

Table VI. Compile-Time Overhead (InSeconds) When Selecting Tile and Pad

Factors on a Pentium-4 Running at 1.6 GHz

Processor MIN MAX AVGPentium-4 1.8 14.5 4.6

Alpha-21264 0.1 11.9 3.6UltraSparc-III 0.87 16.5 5.8

Itanium 0.4 17.0 5.5

Although a precise cost model combined with a GA can find very good results,the compile time required for that may be infeasible. In order to investigate this,we have collected the execution time needed to obtain tile and pad factor for allour experiments. We account for 15 problem sizes for each of the 5 kernels.

Table VI shows the average times needed to generate the optimized version(including both tiling and padding) for each architecture. We see that in theworst case, it takes an average of 5.8 seconds to optimize a code. We believethat this amount of time is reasonable for a static compiler.

6.5 Summary

Overall, we have shown the effectiveness of our method to select tile and pad fac-tors. We first have presented results where only padding and tiling are applied.

Then, we have reported results that highlight the importance of modeling thebranch predictor behavior. Later, we have seen that our approach outperforms



state-of-the-art techniques to select tile and pad sizes for all analyzed kernels,for all platforms. We have shown how our cost model selects tile and pad factorsin concert, which translates to consistent speedups.

From these results, we conclude that accurate cost models that consider notonly cache behavior but other hardware components are necessary. A simplecost model may hinder compiler’s ability to generate good code that improvesoverall performance. For instance, it is not clear when padding should be com-bined with tiling for the lrw and tss algorithms.

7. RELATED WORK

Caches are an essential part of processors for reducing memory latency and in-crease memory bandwidth. By reducing the number of accesses to the slow up-per levels of the memory hierarchy, significant speedups can be achieved. Con-flict misses may represent the majority of intra-nest misses and about half of allcache misses for typical scientific programs and cache architectures [McKinleyand Temam 1996].

Researchers working on locality optimizations have considered re-orderingtechniques such as loop interchange [Gannon et al. 1988; McKinley et al. 1996;Wolf and Lam 1991; Wolfe 1996], loop fission/fusion [McKinley et al. 1996] andloop tiling [Carr and Kennedy 1992; Coleman and McKinley 1995; Lam et al.1991; Rivera and Tseng 1999a; Wolf and Lam 1991].

The success of loop tiling depends on the tile size and shape selection. Lamet al. [1991] present an algorithm that chooses the largest nonconflicting squaretile, considering caches with low associativity. Coleman and McKinley [1995] tryto maximize the tile size while minimizing the cross-interferences. Their costmodel is based on computing the footprints of the array references. Rivera andTseng [1999a] further extend the Euclidean algorithm [Coleman and McKinley1995] by computing tile widths using a recurrence. They realize that there maybe some pathological problem sizes where tile selection does not work very well.They propose padding the first dimension of all arrays with the same pad toeliminate such cases.

Array padding can help eliminate conflict misses. Rivera and Tseng [1998a,1998b] propose several simple heuristics that are addressed to eliminate con-flicts in some particular cases. They mainly focus on conflicts that occur onevery loop iteration, addressing only interpadding for uniformly generated ref-erences (so they cannot remove conflict misses for references such as B(i, j )and C(k, j )). On the other hand, they do not use intrapadding to removecross-interferences. In case they can not remove all the conflicts, no changesare done to the data layout. Besides, they use the padding algorithm de-vised to avoid conflict misses for direct-mapped caches to remove conflictmisses for set-associative caches, without taking into account that interfer-ences arise in different situations for different cache architectures. A set con-tention in a set-associative cache does not mean there is a cache miss. Theypresented an extension of this work targeting multilevel caches [Rivera andTseng 1999b], where they study the effects of optimizing L1 cache on L2 cachebehavior.



Ghosh et al. [1999] use the CMEs to propose a tiling and padding technique.Padding works on direct-mapped caches, optimizing conflicting arrays that havethe same column size. Their technique finds the optimal padding if there is apadding such that the total number of replacement misses after padding is zero.However, if such a padding does not exist, their technique does not provide anysolution. Note that replacement misses include both conflict and capacity missesand one may expect the case where replacement misses cannot be decreasedup to zero to be common. Tiling is based on maximizing the tile size for everyself-interference equation, obtaining a tile that has no conflicts for the givenequation. However, they do not give insights about how to combine the differenttile sizes obtained. Furthermore, tiling is not applied to cross-interferences.

Our approach has several advantages over previous research. First, our costmodel describes accurately the cache behavior of multi-level caches and con-siders all affine array accesses within a loop nest. Moreover, we model tilingoverhead due to miss-predicted branches. Second, our padding considers differ-ent pad factors for each array dimension, increasing the chances of finding abetter optimized code. Finally, we perform tiling and padding at the same time,hence considering a global solution.

8. CONCLUSIONS

This article presents a new approach to improve execution time of programs byenhancing data locality. It combines tiling and padding to remove both capacityand conflict misses. First, we present a very accurate model that describes cachelocality across different levels. Moreover, this cost model takes into account thepossible tiling overhead due to the added miss-predicted branches. We discusshow this model can be tuned to describe accurately the performance cost fordifferent modern architectures.

Second, we introduce the use of genetic algorithms to traverse the solutionspace. We show how our approach can guide compiler optimizations efficiently;with what we believe it is a small compile-time overhead (average of 4.6 secondsper kernel), we obtain significant run-time improvements. Our results showthat, compared to the best performing scheme among previous approaches foreach particular architecture (which happens not to be always the same scheme),we have 7.7%, 63.2%, 5.2% and 35.7% average speedup for Pentium-4, Alpha-21264, UltraSparc-III and Itanium, respectively.

Overall, this article contributes with a new technique that makes a casefor the use of accurate models to guide compilers in order to improve executiontime. Moreover, it does not only model cache behavior but hardware componentssuch as branch predictors, which shows the possibility of having complex andaccurate models for modern architectures.

Future work will both investigate the application of padding and tiling forwhole programs and the addition of other compiler techniques such as loop fu-sion, loop interchange and unrolling. Even though the results show that a costmodel that assumes an LRU replacement policy is good enough, we plan to incor-porate other replacement policies into our model. We also plan to consider mod-eling other hardware components such as register files and pipeline utilization.



APPENDIXES

A. EXAMPLE OF A GENETIC ALGORITHM

In this section, we show an example of an iteration of the genetic algorithm. Letus recall our running example in its tiled and padded version (see Figure 4(b)).We consider the problem of finding the best tile and pad sizes for a 4 KB cachewhen N=100.

The steps to set up a genetic algorithm are as follows:

(1) Identify variables.(2) Identify domains of those variables.(3) Decide representation.

We review this three steps explained in Section 4 by constructing a GA forthe problem stated above. Then, we will show how the GA works by illustratingsome iterations.

A.1 Setting Up the GA

According to the code shown in Figure 4(b), our problem depends on eight dif-ferent variables. There are two tile factors, T1 and T2, and six intra-pad factors,P Dim00, P Dim01, P Dim10, P Dim11, P Dim20 and P Dim21.

First, we determine the domains according to Section 3.1:

1 ≤ T1 ≤ 100 1 ≤ T2 ≤ 1000 ≤ P Dim00 < 4096 0 ≤ P Dim01 < 40960 ≤ P Dim10 < 4096 0 ≤ P Dim11 < 40960 ≤ P Dim20 < 4096 0 ≤ P Dim21 < 4096

Once the domains are determined, the next step consists on deciding therepresentation that is used for each parameter. In order to do that, we followthe steps explained in Section 4.1.

For representing the tile factors, we will need 8 bits. We obtain this value asfollows:

�log2100� = �6.64� = 7

since 7 is odd, we add 1 and choose 8. Thus, each tile factor will be representedby 4 genes (each gene ∈ {00, 01, 10, 11}).

Pad factors are somehow easier to represent. We start considering the cache,which is 4 KB. Thus, log2 410 = 12, which means that each chromosome willconsist of 12 bits. The set of divisors is {1, 2, 4, 6, 12}; therefore, we choose torepresent each gene with 6 bits.

Summing up, each individual (that represents a set of tile and pad factors),will be made up of 88 bits. This is, 20 genes or 8 chromosomes:

xx︸︷︷︸ xx︸︷︷︸ xx︸︷︷︸ xx︸︷︷︸︸︷︷︸T1

· · · xxxxxx︸︷︷︸ xxxxxx︸︷︷︸︸︷︷︸P Dim00


x ∈ {0, 1}.



A.2 Iterating the GA

Now, we show in detail how the genetic operators are applied in order to im-prove the population. Let us consider that the selection method has chosen twoindividuals, I1 and I2 such that:

I1 = xx︸︷︷︸ xx︸︷︷︸ xx︸︷︷︸ xx︸︷︷︸︸︷︷︸T1



x ∈ {0, 1}

I2 = y y︸︷︷︸ y y︸︷︷︸ y y︸︷︷︸ y y︸︷︷︸︸︷︷︸T1

· · · y y y y y y︸︷︷︸ y y y y y y︸︷︷︸︸︷︷︸P Dim00


y ∈ {0, 1}

According to Section 4.1.3, we first apply crossover, and then we mutate thenew individuals.

A.2.1 Crossover Is NOT Applied. Let us consider first that crossover is notapplied. Thus, we obtain two new individuals which are exactly the same asbefore, I ′

1 = I1, I ′2 = I2. The next step consists of mutating some bits of these

individuals. After mutating, we obtain these two new individuals which areadded to the new population:

I ′′1 = xx︸︷︷︸ xx︸︷︷︸ xx︸︷︷︸ xx︸︷︷︸︸︷︷︸

T1



x ∈ {0, 1}

I ′′2 = y y︸︷︷︸ y y︸︷︷︸ y y︸︷︷︸ y y︸︷︷︸︸︷︷︸

T1



y ∈ {0, 1}

A.2.2 Crossover IS Applied. When crossover is applied, we obtain two newindividual that are obtained merging the genes from both original individuals.In our case, the crossing site is at the 5th bit, which gives rise to these two newindividuals obtained:I ′

1 = xx︸︷︷︸ xx︸︷︷︸ x y︸︷︷︸ y y︸︷︷︸︸︷︷︸T1



x, y ∈ {0, 1}

I ′2 = y y︸︷︷︸ y y︸︷︷︸ yx︸︷︷︸ xx︸︷︷︸︸︷︷︸

T1



x, y ∈ {0, 1}

Finally, mutation is applied, which results in the two individuals that are even-tually added to the new population:I ′′

1 = xx︸︷︷︸ xx︸︷︷︸ x y︸︷︷︸ y y︸︷︷︸︸︷︷︸T1



x, y ∈ {0, 1}

I ′′2 = y y︸︷︷︸ y y︸︷︷︸ yx︸︷︷︸ xx︸︷︷︸︸︷︷︸

T1



x, y ∈ {0, 1}

B. CODES

For the sake of comparison, we include all the codes we used for running ourexperiments.



Fig. 21. TOMCATV: Loop number 1.




Fig. 26. SWIM: Loop number 1.





Fig. 32. MATMUL.

Fig. 33. T2D.

Fig. 34. ADI.

Fig. 35. MATVEC.



Fig. 36. VPENTA.

ACKNOWLEDGMENT

We wish to thank Erik Hagersten for helpful discussions on the importance ofmodeling “more than L1 caches” to improve performance on current machines.

REFERENCES

BANERJEE, U. 1988. Dependence Analysis for Supercomputing. Kluwer Academic Publishers.BELADY, L. A. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Syst. J.BERMUDO, N., VERA, X., GONZALEZ, A., AND LLOSA, J. 2000. An efficient solver for cache miss equa-

tions. In Proceedings of IEEE International Symposium on Performance Analysis of Systems andSoftware (ISPASS’00). IEEE Computer Society Press, Los Alamitos, CA.

BUTLER, M., YEH, T.-Y., PATT, Y., ALSUP, M., SALES, H., AND SHEBANOW, M. 1991. Instruction levelparallelism is greater than two. In Proceedings of the 18th International Symposium on ComputerArchitecture (ISCA’91). 276–286.

CARR, S. AND KENNEDY, K. 1992. Compiler blockability of numerical algorithms. In Proceedings ofSupercomputing (SC’92). 114–124.

CARR, S., MCKINLEY, K., AND TSENG, C.-W. 1994. Compiler optimizations for improving data lo-cality. In Proceedings of VI International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS’94). 252–262.

CHATTERJEE, S., JAIN, V. V., LEBECK, A. R., MUNDHRA, S., AND THOTTETHODI, M. 1999. Nonlineararray layout for hierarchical memory systems. In Proceedings of ACM International Conferenceon Supercomputing (ICS’99) (Rhodes, Greece). ACM, New York, 444–453.

CLAUSS, P. 1996. Counting solutions to linear and non-linear constraints through Ehrhartpolynomials. In Proceedings of ACM International Conference on Supercomputing (ICS’96)(Philadelphia, PA). ACM, New York, 278–285.

COLEMAN, S. AND MCKINLEY, K. S. 1995. Tile size selection using cache organization and datalayout. In Proceedings of ACM SIGPLAN Conference on Programming Language Design andImplementation (PLDI’95). ACM, New York, 279–290.

DYER, M. AND FRIEZE, A. M. 1988. On the complexity of computing the volumne of a polyhedron.SIAM J. Comput. 17, 5, 967–974.

ERMOLIEV, Y. AND WETS, R. J.-B. 1988. Numerical Techniques for Stochastic Optimization.Springer-Verlag, New York.

FERNANDEZ, A. 1999. A quantitative analysis of the SPECfp95. Tech. Rep. UPC-DAC-1999-12,Universitat Politecnica de Catalunya. March.

GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. Strategies for cache and local memory managementby global program transformations. J. Paral. Distrib. Comput. 5, 587–616.



GHOSH, S., MARTONOSI, M., AND MALIK, S. 1999. Cache miss equations: a compiler framework foranalyzing and tuning memory behavior. ACM Trans. Prog. Lang. Syst. (TOPLAS) 21, 4, 703–746.

GILL, P. E., MURRAY, W., AND WRIGHT, M. H. 1981. Practical optimization. Academic Press, Orlando,FL.

GLOVER, F. AND LAGUNA, M. 1997. Tabu search. Kluwer.GOLDBERG, D. E. 1989. Genetic algorithms in search, optimizations and machine learning.

Addison-Wesley, Reading, MA.HANSEN, P., JAUMARD, B., AND MATHON, V. 1995. Constrained nonlinear 0-1 programming. ORSA

J. Comput.HOLLAND, J. 1975. Adaptation in natural and artificial systems. The University of Michigan Press,

Ann Arbor, MI.HORST, R., PARDALOS, P. M., AND THOAI, N. V. 1995. Introduction to Global Optimization. Kluwer

Academic Publishers.KANDEMIR, M., CHOUDHARY, A., BANERJEE, P., AND RAMANUJAM, J. 1999. A linear algebra framework

for automatic determination of optimal data layouts. IEEE Trans. Paral. Distrib. Syst. 10, 2(Feb.), 115–135.

KIRKPATRICK, S., GELATT, C. D., AND VECCHI, M. P. 1983. Optimization by simulated annealing.Science 220.

LAM, M., ROTHBERG, E. E., AND WOLF, M. E. 1991. The cache performance of blocked algorithms. InProceedings of IV International Conference on Architectural Support for Programming Languagesand Operating Systems (ASPLOS’91).

MCFARLING, S. 1993. Combining branch predictors. Tech. Rep. TN-36, Digital Western ResearhLab.

MCKINLEY, K., CARR, S., AND TSENG, C.-W. 1996. Improving data locality with loop transformations.ACM Trans. Prog. Lang. Syst. (TOPLAS) 18, 4 (Jul.), 424–453.

MCKINLEY, K. S. AND TEMAM, O. 1996. A quantitative analysis of loop nest locality. In Proceed-ings of VII International Conference on Architectural Support for Programming Languages andOperating Systems (ASPLOS’96).

MICHALEWICZ, Z. 1994. Genetic algorithms+Data structures=Evolution Programs. Springer-Verlag, New York.

MOWRY, T., LAM, M., AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm forprefetching. In Proceedings of V International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS’92). 62–73.

PETRANK, E. AND RAWITZ, D. 2002. Hardness of cache conscious data placement. In Proceedings ofInternational Conference on Principles of Programming Languages (POPL’02).

RIVERA, G. AND TSENG, C.-W. 1998a. Data transformations for eliminating conflict misses. In Pro-ceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation(PLDI’98). ACM, New York, 38–49.

RIVERA, G. AND TSENG, C.-W. 1998b. Eliminating conflict misses for high-performance architec-tures. In Proceedings of ACM International Conference on Supercomputing (ICS’98). ACM, NewYork.

RIVERA, G. AND TSENG, C.-W. 1999a. A comparison of compiler tiling algorithms. In Proceedingsof the 8th International Conference on Compiler Construction (CC’99).

RIVERA, G. AND TSENG, C.-W. 1999b. Locality optimizations for multi-level caches. In Proceedingsof Supercomputing (SC’99).

SHARANGPANI, H. 2000. Itanium microprocessor architecture. IEEE Micro.TEMAM, O., GRANSTON, E., AND JALBY, W. 1993. To copy or not to copy: A compile-time technique

for accessing when data copying should be used to eliminate cache conflicts. In Proceedings ofSupercomputing (SC’93). 410–419.

TORN, A. AND ZILINSKAS, A. 1989. Global optimization. Springer-Verlag, New York.VAVASIS, S. A. 1991. Nonlinear Optimization, Complexity Issues. Oxford University

Press.VERA, X., BERMUDO, N., LLOSA, J., AND GONZALEZ, A. 2004. A fast and accurate framework to ana-

lyze and optimize cache memory behavior. ACM Trans. Prog. Lang. Syst. (TOPLAS) 26, 2, 263–300.



VERA, X., LLOSA, J., GONZALEZ, A., AND BERMUDO, N. 2000. A fast and accurate approach to an-alyze cache memory behavior. In Proceedings of European Conference on Parallel Computing(Europar’00).

WOLF, M. AND LAM, M. 1991. A data locality optimizing algorithm. In Proceedings of ACM SIG-PLAN Conference on Programming Language Design and Implementation (PLDI‘91). ACM, NewYork, 30–44.

WOLFE, M. 1996. Advanced loop interchanging. In Proceedings of International Conference onParallel Processing (ICPP’96).

Received April 2003; revised December 2003; accepted June 2004


an accurate cost model for guiding data locality

Documents