meta optimization: improving compiler heuristics with ... · meta optimization: improving compiler...

Meta Optimization:Improving Compiler Heuristics with Machine Learning

Mark Stephenson andSaman Amarasinghe

Massachusetts Institute of TechnologyLaboratory for Computer Science

Cambridge, MA 02139{mstephen, saman}@cag.lcs.mit.edu

Martin Martin and Una-May O’ReillyMassachusetts Institute of Technology

Artificial Intelligence LaboratoryCambridge, MA 02139

{mcm, unamay}@ai.mit.edu

ABSTRACTCompiler writers have crafted many heuristics over the yearsto approximately solve NP-hard problems efficiently. Find-ing a heuristic that performs well on a broad range of ap-plications is a tedious and difficult process. This paper in-troduces Meta Optimization, a methodology for automat-ically fine-tuning compiler heuristics. Meta Optimizationuses machine-learning techniques to automatically searchthe space of compiler heuristics. Our techniques reduce com-piler design complexity by relieving compiler writers of thetedium of heuristic tuning. Our machine-learning systemuses an evolutionary algorithm to automatically find effec-tive compiler heuristics. We present promising experimentalresults. In one mode of operation Meta Optimization createsapplication-specific heuristics which often result in impres-sive speedups. For hyperblock formation, one optimizationwe present in this paper, we obtain an average speedup of23% (up to 73%) for the applications in our suite. Further-more, by evolving a compiler’s heuristic over several bench-marks, we can create effective, general-purpose heuristics.The best general-purpose heuristic our system found for hy-perblock formation improved performance by an average of25% on our training set, and 9% on a completely unrelatedtest set. We demonstrate the efficacy of our techniques onthree different optimizations in this paper: hyperblock for-mation, register allocation, and data prefetching.

Categories and Subject DescriptorsD.1.2 [Programming Techniques]: Automatic Program-ming; D.2.2 [Software Engineering]: Design Tools andTechniques; I.2.6 [Artificial Intelligence]: Learning

General TermsDesign, Algorithms, Performance

KeywordsMachine Learning, Priority Functions, Genetic Program-ming, Compiler Heuristics

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.PLDI’03, June 9–11, 2003, San Diego, California, USA.Copyright 2003 ACM 1-58113-662-5/03/0006 ...$5.00.

1. INTRODUCTIONCompiler writers have a difficult task. They are expected

to create effective and inexpensive solutions to NP-hardproblems such as register allocation and instruction schedul-ing. Their solutions are expected to interact well with otheroptimizations that the compiler performs. Because someoptimizations have competing and conflicting goals, adverseinteractions are inevitable. Getting all of the compiler passesto mesh nicely is a daunting task.

The advent of intractably complex computer architecturesalso complicates the compiler writer’s task. Since it is im-possible to create a simple model that captures the intrica-cies of modern architectures and compilers, compiler writersrely on inaccurate abstractions. Such models are based uponmany assumptions, and thus may not even properly simulatefirst-order effects.

Because compilers cannot afford to optimally solve NP-hard problems, compiler writers devise clever heuristics thatquickly find good approximate solutions for a large class ofapplications. Unfortunately, heuristics rely on a fair amountof tweaking to achieve suitable performance. Trial-and-errorexperimentation can help an engineer optimize the heuris-tic for a given compiler and architecture. For instance, onemight be able to use iterative experimentation to figure outhow much to unroll loops for a given architecture (i.e., with-out thrashing the instruction cache or incurring too muchregister pressure).

After studying several compiler optimizations, we foundthat many heuristics have a focal point. A single priorityor cost function often dictates the efficacy of a heuristic.A priority function— a function of the factors that affecta given problem— measures the relative importance of thedifferent options available to a compiler algorithm.

Take register allocation for example. When a graph col-oring register allocator cannot successfully color an interfer-ence graph, it spills a variable to memory and removes itfrom the graph. The allocator then attempts to color thereduced graph. When a graph is not colorable, choosing anappropriate variable to spill is crucial. For many allocators,this decision is bestowed upon a single priority function.Based on relevant data (e.g., number of references, depth inloop nest, etc.), the function assigns weights to all uncoloredvariables and thereby determines which variable to spill.

Fine-tuning priority functions to achieve suitable perfor-mance is a tedious process. Currently, compiler writers man-ually experiment with different priority functions. For in-

77

stance, Bernstein et al. manually identified three priorityfunctions for choosing spill variables [3]. By applying thethree functions to a suite of benchmarks, they found that aregister allocator’s effectiveness is highly dependent on thepriority function the compiler uses.

The importance of priority functions is a key insight thatmotivates Meta Optimization, a method by which a machine-learning algorithm automatically searches the priority func-tion solution space. More specifically, we use a learning al-gorithm that iteratively searches for priority functions thatimprove the execution time of compiled applications.

Our system can be used to cater a priority function toa specific input program. This mode of operation is essen-tially an advanced form of feedback directed optimization.More importantly, it can be used to find a general-purposefunction that works well for a broad range of applications.In this mode of operation, Meta Optimization can performthe tedious work that is currently performed by engineers.For each of the three case studies we describe in this paper,we were able to at least match the performance of human-generated priority functions. In some cases we achieved con-siderable speedups.

While many researchers have used machine-learning tech-niques and exhaustive search algorithms to improve an ap-plication, none have used learning to search for priority func-tions. Because Meta Optimization improves the effective-ness of the compiler itself, in theory, we need only apply theprocess once (rather than on a per-application basis).

The remainder of this paper is organized as follows. Thenext section introduces priority functions. Section 3 de-scribes genetic programming, a machine-learning techniquethat is well suited to our problem. Section 4 discusses ourmethodology. We apply our technique to three separate casestudies in Section 5, Section 6, and Section 7. Results of ourexperiments are included in the case study sections. Sec-tion 8 discusses related work, and finally Section 9 concludes.

2. PRIORITY FUNCTIONSThis section is intended to give the reader a feel for the

utility and ubiquity of priority functions. Put simply, pri-ority functions prioritize the options available to a compileralgorithm.

For example, in list scheduling, a priority function assignsa weight to each instruction in the scheduler’s dependencegraph, dictating the order in which to schedule instructions.A common and effective heuristic assigns priorities usinglatency-weighted depths [10]. Essentially, this is the instruc-tion’s depth in the dependence graph, taking into accountthe latency of instructions on all paths to the root nodes:

P (i) =latency(i) : if i is independent.

maxi depends on j latency(i) + P (j) : otherwise.

The list scheduler proceeds by scheduling ready instructionsin priority order. In other words, if two instructions areready to be scheduled, the algorithm will favor the instruc-tion with the higher priority. The scheduling algorithmhinges upon the priority function. Apart from enforcing thelegality of the schedule, the scheduler entirely relies on thepriority function to make all of its decisions.

This description of list scheduling is a simplification. Pro-duction compilers use sophisticated priority functions that

account for many competing factors (e.g., how a given sched-ule may affect register allocation).

The remainder of the section lists a few other priorityfunctions that are amenable to the techniques we discuss inthis paper. We will explore three of the following priorityfunctions in detail later in the paper.

• Clustered scheduling: Ozer et al. describe an ap-proach to scheduling for architectures with clusteredregister files [20]. They note that the choice of priorityfunction has a “strong effect on the schedule.” Theyalso investigate five different priority functions [20].

• Hyperblock formation: Later in this paper we usethe formation of predicated hyperblocks as a case study.

• Meld scheduling: Abraham et al. rely on a priorityfunction to schedule across region boundaries [1]. Thepriority function is used to sort regions by the order inwhich they should be visited.

• Modulo scheduling: In [22], Rau states that “thereis a limitless number of priority functions” that canbe devised for modulo scheduling. Rau describes thetradeoffs involved when considering scheduling priori-ties.

• Data Prefetching: Later in this paper we investigatea priority function that determines whether or not toprefetch an address.

• Register allocation: Many register allocation algo-rithms use cost functions to determine which variablesto spill if spilling is required. We use register allocationas a case study later in the paper.

This is not an exhaustive list of applications. Many im-portant compiler optimizations employ cost functions of thesort mentioned above. The next section introduces geneticprogramming, which we use to automatically find effectivepriority functions.

3. GENETIC PROGRAMMINGOf the many available machine-learning techniques, we

chose to employ genetic programming (GP) because its at-tributes best fit the needs of our application. The followinglist highlights the suitability of GP to our problem:

• GP is especially appropriate when the relationshipsamong relevant variables are poorly understood [13].Such is the case with compiler heuristics, which oftenfeature uncertain tradeoffs. Today’s complex systemsalso introduce uncertainty.

• GP is capable of searching high-dimensional spaces.Many other learning algorithms are not as scalable.

• GP is a distributed algorithm. With the cost of com-puting power at an all-time low, it is now economicallyfeasible to dedicate a cluster of machines to searchinga solution space.

• GP solutions are human readable. The ‘genomes’ onwhich GP operates are parse trees which can easilybe converted to free-form arithmetic equations. Othermachine-learning representations, such as neural net-works, are not as comprehensible.

78

(a) (b) (c) (d)

exec_ratio

num_ops 4.0

+

*

total_ops 2.3 predictability 4.1

-

* /

-

* exec_ratio

total_ops 2.3

+

exec_ratio +

num_branches 1.2

Figure 1: GP Genomes. Part (a) and (b) show examples of GP genomes. Part (c) provides an example of a randomcrossover of the genomes in (a) and (b). Part (d) shows a mutation of the genome in part (a).

Compile and run each expression

gens < LIMIT?

Probabilistically select expressions

Crossover and mutationgens = gens + 1

No

End

Yes

Create initial populationgens = 0

Figure 2: Flow of genetic programming. Genetic pro-gramming (GP) initially creates a population of expres-sions. Each expression is then assigned a fitness, whichis a measure of how well it satisfies the end goal. Inour case, fitness is proportional to the execution time ofthe compiled application(s). Until some user-defined capon the number of generations is reached, the algorithmprobabilistically chooses the best expressions for mat-ing and continues. To guard against stagnation, someexpressions undergo mutation.

Like other evolutionary algorithms, GP is loosely pat-terned on Darwinian evolution. GP maintains a popula-tion of parse trees [13]. In our case, each parse tree is anexpression that represents a priority function. As with natu-ral selection, expressions are chosen for reproduction (calledcrossover) according to their level of fitness. Expressionsthat best solve the problem are most likely to have progeny.The algorithm also randomly mutates some expressions toinnovate a possibly stagnant population.

Figure 2 shows the general flow of genetic programming inthe context of our system. The algorithm begins by creat-ing a population of initial expressions. The baseline heuris-tic over which we try to improve is included in the initialpopulation; the remainder of the initial expressions are ran-domly generated. The algorithm then determines each ex-

pression’s level of fitness. In our case, compilers that pro-duce the fastest code are fittest. Once the algorithm reachesa user-defined limit on the number of generations, the pro-cess stops; otherwise, the algorithm proceeds by probabilis-tically choosing the best expressions for mating. Some of theoffspring undergo mutation, and the algorithm continues.

Unlike other evolutionary algorithms, which use fixed-length binary genomes, GP’s expressions are variable inlength and free-form. Figure 1 provides several examplesof genetic programming genomes (expressions). Variable-length genomes do not artificially constrain evolution bysetting a maximum genome size. However, without specialconsideration, genomes grow exponentially during crossoverand mutation.

Our system rewards parsimony by selecting the smallerof two otherwise equally fit expressions [13]. Parsimoniousexpressions are aligned with our philosophy of using GP asa tool for compiler writers and architects to identify impor-tant heuristic features and the relationships among them.Without enforcing parsimony, expressions quickly becomeunintelligible.

In Figure 1, part (c) provides an example of crossover,the method by which two expressions reproduce. Here thetwo expressions in (a) and (b) produce offspring. Crossoverworks by selecting a random node in each parent, and thenswapping the subtrees rooted at those nodes1. In theory,crossover works by propagating ‘good’ subexpressions. Goodsubexpressions increase an expression’s fitness.

Because GP favors fit expressions, expressions with favor-able building blocks are more likely selected for crossover,further disseminating the blocks. Our system uses tourna-ment selection to choose expressions for crossover. Tourna-ment selection chooses N expressions at random from thepopulation and selects the one with the highest fitness [13].N is referred to as the tournament size. Small values ofN reduce selection pressure; expressions are only comparedagainst the other N − 1 expressions in the tournament.

Finally, part (d) shows a mutated version of the expressionin (a). Here, a randomly generated expression supplants arandomly chosen node in the expression. For details on themutation operators we implemented, see [2].

1Selection algorithms must use caution when selecting random tree nodes. If

we consider a full binary tree, then leaf nodes comprise over 50% of the tree.Thus, a naive selection algorithm will choose leaf nodes over half of the time.We employ depth-fair crossover, which equally weighs each level of the tree [12].

79

Real-Valued Function Representation

Real1 + Real2 (add Real1 Real2)Real1 − Real2 (sub Real1 Real2)Real1 · Real2 (mul Real1 Real2)

Real1/Real2 : if Real2 = 00 : if Real2 = 0

(div Real1 Real2)√

Real1 (sqrt Real1)Real1 : ifBool1Real2 : if notBool1

(tern Bool1 Real1 Real2)

Real1 · Real2 : ifBool1Real2 : if notBool1

(cmul Bool1 Real1 Real2)

Returns real constant K (rconst K)Returns real value of arg from en-vironment

(rarg arg)

Boolean-Valued Function Representation

Bool1 and Bool2 (and Bool1 Bool2)Bool1 or Bool2 (or Bool1 Bool2)not Bool1 (not Bool1)Real1 < Real2 (lt Real1 Real2)Real1 > Real2 (gt Real1 Real2)Real1 = Real2 (eq Real1 Real2)Returns Boolean constant (bconst {true, false})Returns Boolean value of argfrom environment

(barg arg)

Table 1: GP primitives. Our GP system uses the prim-itives and syntax shown in this table. The top segmentrepresents the real-valued functions, which all return areal value. Likewise, the functions in the bottom seg-ment all return a Boolean value.

To find general-purpose expressions (i.e., expressions thatwork well for a broad range of input programs), the learningalgorithm learns from a set of ‘training’ programs. To trainon multiple input programs, we use the technique describedby Gathercole in [9]. The technique—called dynamic subsetselection (DSS)— trains on subsets of the training programs,concentrating more effort on programs that perform poorlycompared to the baseline heuristics. DSS reduces the num-ber of fitness evaluations that need to be performed in orderto achieve a suitable solution. Because our system mustcompile and run benchmarks to test an expression’s level offitness, fitness evaluations for our problem are costly.

The next section describes the methodology that we usethroughout the remainder of the paper.

4. METHODOLOGYCompiler priority functions are often based on assump-

tions that may not be valid across application and architec-tural variations. In other words, who knows on what set ofbenchmarks, and for what target architecture the priorityfunctions were designed? It could be the case that a prior-ity function was designed for completely orthogonal circum-stances than those under which you use your compiler.

Our system uses genetic programming to automaticallysearch for effective priority functions. Though it may bepossible to ‘evolve’ the underlying algorithm, we restrict our-selves to priority functions. This drastically reduces searchspace size, and the underlying algorithm ensures optimiza-tion legality. Furthermore, this technique is still very pow-erful; even small changes to the priority function can dras-tically improve (or diminish) performance.

We optimize a given priority function by wrapping theiterative framework of Figure 2 around the compiler and ar-chitecture. We replace the priority function that we wish

Parameter Setting

Population size 400 expressionsNumber of generations 50 generationsGenerational replacement 22 expressionsMutation rate 5%Tournament size 7Elitism Best expression is guaranteed sur-

vival.Fitness Average speedup over the baseline

on the suite of benchmarks.

Table 2: GP parameters. This table shows the GP pa-rameters we used to collect the results in this section.

to optimize with an expression parser and evaluator. Thisallows us to compile the benchmarks in our ‘training’ suiteusing the expressions— which are priority functions— in thepopulation. The expressions that create the fastest executa-bles for the applications in the training suite are favored forcrossover.

Our system uses total execution time to assign fitnesses.This approach focuses on frequently executed procedures,and therefore, may slowly converge upon general-purposesolutions. However, when one wants to specialize a compilerfor a given input program, this evaluation of fitness worksextremely well.

Table 1 shows the GP expression primitives that our sys-tem uses. Careful selection of GP primitives is essential. Wewant to give the system enough flexibility to potentially findunexpected results. However, the more leeway we give GP,the longer it will take to converge upon a general solution.

Our system creates an initial population that consists of399 randomly generated expressions; it randomly ‘grows’expressions of varying heights using the primitives in Table 1and features extracted by the compiler writer. Features aremeasurable program characteristics that the compiler writerthinks may be important for forming good priority functions(e.g., latency-weighted depth for list scheduling).

In addition to the randomly generated expressions, weseed the initial population with the compiler writer’s bestguess. In other words, we include the priority functiondistributed with the compiler. For two of the three opti-mizations presented in this paper, we found that the seed isquickly obscured and weeded out of the population as morefavorable expressions emerge. In fact, for hyperblock selec-tion and data prefetching, which we discuss later, the seedhad no impact on the final solution. These results suggestthat one could use Meta Optimization to construct priorityfunctions from scratch rather than trying to improve uponpreexisting functions. In this way, our tool can reduce thecomplexity of compiler design by sparing the engineer fromperfunctory algorithm tweaking.

Table 2 summarizes the parameters that we use to col-lect results. We chose the parameters in the table after amoderate amount of experimentation. We give our GP sys-tem 50 generations to find a solution. For the benchmarksthat we surveyed, the time required to run for 50 genera-tions is about one day per benchmark in the training set2.Our system memoizes benchmark fitnesses because fitnessevaluations are so costly.

After every generation the system randomly replaces 22%of the population with new expressions created via the crossover

2We ran on 15 to 20 machines in parallel for the experiments in Section 5 and

Section 6, and we used 5 machines for the experiments in Section 7.

80

(a)

branch ...

d - buf & 0x1

buf = *inp inp = inp + 1 t = buf >> 4 d = t & 0xf

(b)

cmp p2,p3 ... (p2) d = buf & 0x1 (p3) buf = *inp (p3) inp = inp + 1 (p3) t = buf >> 4 (p3) d = t & 0xf

Figure 3: Control flow v. predicated execution. Part (a)shows a segment of control-flow that demonstrates a sim-ple if-then-else statement. As is typical with multime-dia and integer applications, there are few instructionsper basic block in the example. Part (b) is the corre-sponding predicated hyperblock. If-conversion mergesdisjoint paths of control by creating predicated hyper-blocks. Choosing which paths to merge is a balancingact. In this example, branching may be more efficientthan predicating if p3 is rarely true.

operation presented in Section 3. Only the best expression isguaranteed survival. Typically, GP practitioners use muchhigher replacement rates. However, since we use dynamicsubset selection, only a subset of benchmarks is evaluatedin a generation. Thus, we need a lower replacement ratein order to increase the likelihood that a given expressionwill be tested on more than one subset of benchmarks. Themutation operator, which is discussed in the same section,mutates roughly 5% of the new expressions. Finally, we usea tournament size of 7 when selecting the fittest expressions.This setting causes moderate selection pressure.

The following three sections build upon the methodologydescribed in this section by presenting individual case stud-ies. Results for each of the case studies are included in theirrespective sections.

5. CASE STUDY I: HYPERBLOCKFORMATION

This section describes the operation of our system in thecontext of a specific compiler optimization: hyperblock for-mation. Here we introduce the optimization, and then wediscuss factors that might be important when creating a pri-ority function for it. We conclude the section by presentingexperimental results for hyperblock formation.

Architects have proposed two noteworthy methods for de-creasing the costs associated with control transfers3: im-proved branch prediction, and predication. Improved branchprediction algorithms would obviously increase processor uti-lization. Unfortunately, some branches are inherently unpre-dictable, and hence, even the most sophisticated algorithmwould fail. For such branches, predication may be a fruitfulalternative.

3The Pentium r⃝ 4 architecture features 20 pipeline stages. It squashes up to

126 in-flight instructions when it mispredicts.

Rather than relying on branch prediction, predication al-lows a multiple-issue processor to simultaneously execute thetaken and fall-through paths of control flow. The processornullifies all instructions in the incorrect path. In this model,a predicate operand guards the execution of every instruc-tion. If the value of the operand is true, then the instructionexecutes normally. If however, the operand is false, the pro-cessor nullifies the instruction, preventing it from modifyingprocessor state.

Figure 3 highlights the difference between control-flowand predicated execution. Part (a) shows a segment ofcontrol-flow. Using a process dubbed if-conversion, the IM-PACT predicating compiler merges disjoint paths of execu-tion into a predicated hyperblock. A hyperblock is a predi-cated single-entry, multiple-exit region. Part (b) shows thehyperblock corresponding to the control-flow in part (a).Here, p2 and p3 are mutually exclusive predicates that areset according to the branch condition in part (a).

Though predication effectively exposes ILP, simply pred-icating everything will diminish performance by saturatingmachine resources with useless instructions. However, anappropriate balance of predication and branching can dras-tically improve performance.

5.1 Feature ExtractionIn the following list we give a brief overview of several cri-

teria that are useful to consider when forming hyperblocks.Such criteria are often referred to as features. In the list,a path refers to a path of control flow (i.e., a sequence ofbasic blocks that are connected by edges in the control flowgraph):

• Path predictability: Predictable branches incur nomisprediction penalties, and thus, should probably re-main unpredicated. Combining multiple paths of ex-ecution into a single predicated region uses preciousmachine resources [15]. In this case, using machineresources to parallelize individual paths is typicallywiser.

• Path frequency: Infrequently executed paths areprobably not worth predicating. Including the pathin a hyperblock would consume resources, and couldnegatively affect performance.

• Path ILP: If a path’s level of parallelism is low, it maybe worthwhile to predicate the path. In other words, ifa path does not fully use machine resources, combiningit with another sequential path probably will not di-minish performance. Because predicated instructionsdo not need to know the value of their guarding pred-icate until late in the pipeline, a processor can sustainhigh levels of ILP.

• Number of instructions in path: Long paths useup machine resources, and if predicated, will likelyslow execution. This is especially true when long pathsare combined with short paths. Since every instruc-tion in a hyperblock executes, long paths effectivelydelay the time to completion of short paths. The costof misprediction is relatively high for short paths. Ifthe processor mispredicts on a short path, the pro-cessor has to nullify all the instructions in the path,and the subsequent control-independent instructionsfetched before the branch condition resolves.

81

• Number of branches in path: Paths of controlthrough several branches have a greater chance of mis-predicting. Therefore, it may be worthwhile to predi-cate such paths. On the other hand, including severalsuch paths may produce large hyperblocks that satu-rate resources.

• Compiler optimization considerations: Paths thatcontain hazard conditions (i.e., pointer dereferencesand procedure calls) limit the effectiveness of manycompiler optimizations. In the presence of hazards, acompiler must make conservative assumptions. Thecode in Figure 3(a) could benefit from predication.Without architectural support, the load from *inp can-not be hoisted above the branch. The program willbehave unexpectedly if the load is not supposed to ex-ecute and it accesses protected memory. By removingbranches from the instruction stream, predication af-fords the scheduler freer code motion opportunities.For instance, the predicated hyperblock in Figure 3(b)allows the scheduler to rearrange memory operationswithout control-flow concerns.

• Machine-specific considerations: A heuristic shouldaccount for machine characteristics. For instance, thebranch delay penalty is a decisive factor.

Clearly, there is much to consider when designing a heuris-tic for hyperblock selection. Many of the above considera-tions make sense on their own, but when they are put to-gether, contradictions arise. Finding the right mix of criteriato construct an effective priority function is nontrivial. Thatis why we believe automating the decision process is crucial.

5.2 Trimaran’s HeuristicWe now discuss the heuristic employed by Trimaran’s IM-

PACT compiler for creating predicated hyperblocks [15, 16].The IMPACT compiler begins by transforming the codeso that it is more amenable to hyperblock formation [15].IMPACT’s algorithm then identifies acyclic paths of con-trol that are suitable for hyperblock inclusion. Park andSchlansker detail this portion of the algorithm in [21]. A pri-ority function— which is the critical calculation in the predi-cation decision process— assigns a value to each of the pathsbased on characteristics such as the ones just described [15].Some of these characteristics come from runtime profiling.

IMPACT uses the priority function shown below:

hi =0.25 : if pathi contains a hazard.

1 : if pathi is hazard free.

d ratioi =dep heighti

maxj=1→N dep heightj

o ratioi =num opsi

maxj=1→N num opsj

priorityi = exec ratioi · hi · (2.1 − d ratioi − o ratioi) (1)

The heuristic applies the above equation to all paths in apredicatable region. Based on a runtime profile, exec ratiois the probability that the path is executed. The prior-ity function also penalizes paths that contain hazards (e.g.,

Feature Description

Registers 64 general-purpose registers, 64 floating-point registers, and 256 predicateregisters.

Integer units 4 fully-pipelined units with 1-cycle la-tencies, except for multiply instructions,which require 3 cycles, and divide instruc-tions, which require 8.

Floating-point units 2 fully-pipelined units with 3-cycle laten-cies, except for divide instructions, whichrequire 8 cycles.

Memory units 2 memory units. L1 cache accesses take2 cycles, L2 accesses take 7 cycles, andL3 accesses require 35 cycles. Stores arebuffered, and thus require 1 cycle.

Branch unit 1 branch unit.Branch prediction 2-bit branch predictor with a 5-cycle

branch misprediction penalty.

Table 3: Architectural characteristics. This table de-scribes the EPIC architecture over which we evolved.This model approximates the Intel Itanium architecture.

pointer dereferences and procedure calls). Such paths mayconstrain aggressive compiler optimizations. To avoid largehyperblocks, the heuristic is careful not to choose paths thathave a large dependence height (dep height) with respectto the maximum dependence height. Similarly it penalizespaths that contain too many instructions (num ops).

IMPACT’s algorithm then merges the paths with the high-est priorities into a predicated hyperblock. The algorithmstops merging paths when it has consumed the target archi-tecture’s estimated resources.

5.3 Experimental SetupThis section discusses the experimental results for opti-

mizing Trimaran’s hyperblock selection priority function.Trimaran is an integrated compiler and simulator for a pa-rameterized EPIC architecture. Table 3 details the specificarchitecture over which we evolved. This model resemblesIntel’s Itanium r⃝ architecture.

We modified Trimaran’s IMPACT compiler by replacingits hyperblock formation priority function (Equation 1) withour GP expression parser and evaluator. This allows IM-PACT to read an expression and evaluate it based on thevalues of human-selected features that might be importantfor creating effective priority functions. Table 4 describesthese features.

The hyperblock formation algorithm passes the featuresin the table as parameters to the expression evaluator. Forinstance, if an expression contains a reference to dep height,the path’s dependence height will be used when the expres-sion is evaluated. Most of the characteristics in Table 4were already available in IMPACT. Equation 1 has a localscope. To provide some global information, we also extractthe minimum, maximum, mean, and standard deviation ofall path-specific characteristics in the table.

We added a 2-bit dynamic branch predictor to the sim-ulator and we modified the compiler’s profiler to extractbranch predictability statistics. Lastly, we enabled the fol-lowing compiler optimizations: function inlining, loop un-rolling, backedge coalescing, acyclic global scheduling [6],modulo scheduling [25], hyperblock formation, register allo-cation, machine-specific peephole optimization, and severalclassic optimizations.

82

Feature Description

dep height The maximum instruction dependenceheight over all instructions in path.

num ops The total number of instructions in thepath.

exec ratio How frequently this path is executed com-pared to other paths considered (fromprofile).

num branches The total number of branches in the path.predictability Average path predictability obtained by

simulating a branch predictor (from pro-file).

predict product Product of branch predictabilities in thepath (from profile).

avg ops executed The average number of instructions exe-cuted in the path (from profile).

unsafe JSR If the path contains a subroutine call thatmay have side-effects, it returns true; oth-erwise it returns false.

safe JSR If the path contains a side-effect free sub-routine call, it returns true; otherwise itreturns false.

mem hazard If the path contains an unresolvable mem-ory access, it returns true; otherwise itreturns false.

max dep height The maximum dependence height over allpaths considered for hyperblock inclusion.

total ops The sum of all instructions in paths con-sidered for hyperblock inclusion.

num paths Number of paths considered for hyper-block inclusion.

Table 4: Hyperblock selection features. The compilerwriter chooses interesting attributes, and the systemevolves a priority function based on them. We rely onprofile information to extract some of these parameters.We also include the min, mean, max, and standard devi-ation of path characteristics. This provides some globalinformation to the greedy local heuristic.

5.4 Experimental ResultsWe use the familiar benchmarks in Table 5 to test our sys-

tem. All of the Trimaran certified benchmarks are includedin the table4 [24]. Our suite also includes many of the Media-bench benchmarks [14]. The build process for ghostscriptproved too difficult to compile. We also exclude the remain-der of the Mediabench applications because the Trimaransystem does not compile them correctly5.

We begin by presenting results for application-specializedheuristics. Following this, we show that it is possible to useMeta Optimization to create general-purpose heuristics.

5.4.1 Specialized Priority FunctionsSpecialized heuristics are created by optimizing a prior-

ity function for a given application. In other words, wetrain the priority function on a single benchmark. Figure 4shows that Meta Optimization is extremely effective on aper-benchmark basis. The dark bar shows the speedup (overTrimaran’s baseline heuristic) of each benchmark when runwith the same data on which it was trained. The light barshows the speedup attained when the benchmark processesa data set that was not used to train the priority function.We call this the novel data set.

4Due to preexisting bugs in Trimaran, we could not get 134.perl to execute

correctly, though [24] certified it.5We exclude cjpeg, the complement of djpeg, because it does not execute prop-

erly when compiled with some priority functions. Our system can also be usedto uncover bugs!

Benchmark Suite Description

codrle4 See [4] RLE type 4 encoder/decoder.decodrle4huff enc See [4] A Huffman encoder/decoder.huff decdjpeg Mediabench Lossy still image decompressor.g721encode Mediabench CCITT voiceg721decode compressor/decompressor.mpeg2dec Mediabench Lossy video decompressor.rasta Mediabench Speech recognition application.rawcaudio Mediabench Adaptive differential pulse coderawdaudio modulation audio encoder/decoder.toast Mediabench Speech transcoder.unepic Mediabench Experimental image decompressor.085.cc1 SPEC92 gcc C compiler.052.alvinn SPEC92 Single-precision neural network

training.179.art SPEC2000 A neural network-based image

recognition algorithm.osdemo Mediabench Part of a 3-D graphics librarymipmap Mediabench similar to OpenGL.129.compress SPEC95 In-memory file compressor and

decompressor.023.eqntott SPEC92 Creates a truth table from a logical

representation of a Boolean equa-tion.

132.ijpeg SPEC95 JPEG compressor anddecompressor.

130.li SPEC95 Lisp interpreter.124.m88ksim SPEC95 Processor simulator.147.vortex SPEC95 An object oriented database.

Table 5: Benchmarks used. The set includes applica-tions from the SpecInt, SpecFP, and Mediabench bench-mark suites, as well as a few miscellaneous programs.

Intuitively, in most cases the training input data achievesa better speedup. Because Meta Optimization is performance-driven, it selects priority functions that excel on the traininginput data. The alternate input data likely exercises dif-ferent paths of control flow—paths which may have beenunused during training. Nonetheless, in every case, theapplication-specific priority function outperforms the base-line.

Figure 5 shows fitness improvements over generations. Inmany cases, Meta Optimization finds a superior priorityfunction quickly, and finds only marginal improvements asthe evolution continues. In fact, the baseline priority func-tion is quickly obscured by GP-generated expressions. Of-ten, the initial population contains at least one expressionthat outperforms the baseline. This means that by simplycreating and testing 399 random expressions, we were ableto find a priority function that outperformed Trimaran’s forthe given benchmark.

Once GP has discovered a decent solution, the searchspace and operator dynamics are such that most offspringwill be worse, some will be equal and very few turn out to bebetter. This seems indicative of a steep hill in the solutionspace. In addition, multiple reruns using different initializa-tion seeds reveal minuscule differences in performance. Itmight be a space in which there are many possible solutionsassociated with a given fitness.

5.4.2 General-Purpose Priority FunctionsWe divided the benchmarks in Table 5 into two sets6: a

training set, and a test set. Instead of creating a priority

6We chose to train mostly on Mediabench applications because they compile and

run faster than the Spec benchmarks.

83

1.54

1.23

0

0.5

1

1.5

2

2.5

3

3.5

12

9.c

om

pre

ss

g7

21

en

co

de

g7

21

de

co

de

hu

ff_

de

c

hu

ff_

en

c

raw

ca

ud

io

raw

da

ud

io

toast

mp

eg

2d

ec

Ave

rag

e

Speedup

Train data set Novel data set

Figure 4: Hyperblock specialization. This graph showsspeedups obtained by training on a per-benchmarks ba-sis. The dark colored bars are executions using the samedata set on which the specialized priority function wastrained. The light colored bars are executions that usean alternate, or novel data set.

1

1.5

2

2.5

3

3.5

0 5 10 15 20 25 30 35 40 45 50

Generation

Speedup

129.compress

g721decode

mpeg2dec

rawcaudio

rawdaudio

toast

huff_enc

huff_dec

Figure 5: Hyperblock formation evolution. This figuregraphs the best fitness over generations. For this prob-lem, Meta Optimization quickly finds a priority functionthat outperforms Trimaran’s baseline heuristic.

function for each benchmark, in this section we aim to findone priority function that works well for all the benchmarksin the training set. To this end, we evolve over the trainingset using dynamic subset selection [9].

Figure 6 shows the results of applying the single best pri-ority function to the benchmarks in the training set. Thedark bar associated with each benchmark is the speedupover Trimaran’s base heuristic when the training input datais used. This data set yields a 44% improvement. The lightbar shows results when novel input data is used. The overallimprovement for this set is 25%.

It is interesting that, on average, the general-purpose pri-ority function outperforms the application-specific priorityfunction on the novel data set. The general-purpose solutionis less susceptible to variations in input data because it wastrained to be more general.

We then apply the resulting priority function to the bench-marks in the test set. The machine-learning community

1.44

1.25

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

de

co

drl

e4

co

drl

e4

g7

21

de

co

de

g7

21

en

co

de

raw

da

ud

io

raw

ca

ud

io

toast

mp

eg

2d

ec

12

4.m

88

ksim

12

9.c

om

pre

ss

hu

ff_

en

c

hu

ff_

de

c

Ave

rag

e

Speedup


Figure 6: Training on multiple benchmarks. A sin-gle priority function was obtained by training over allthe benchmarks in this graph. The dark bars representspeedups obtained by running the given benchmark onthe same data that was used to train the priority func-tion. The light bars correspond to a novel data set.

refers to this as cross validation. Since the benchmarks inthe test set are not related to the benchmarks in the trainingset, this is a measure of the priority function’s generality.

The results of the cross validation are shown in Figure 7.This experiment applies the best priority function on thetraining set to the benchmarks in the test set. The av-erage speedup on the test set is 9%. In three cases (un-epic, 023.eqntott, and 085.cc1) Trimaran’s baseline heuris-tic marginally outperforms the GP-generated priority func-tion. For the remaining benchmarks, the heuristic our sys-tem found is better.

5.4.3 The Best Priority FunctionFigure 8 shows the best general-purpose priority function

our system found for hyperblock selection. Because par-simony pressure favors small expressions, most of our sys-tem’s solutions are readable. Nevertheless, the expressionspresented in this paper have been hand simplified for easeof discussion.

Notice that some parts of the expression have no im-pact on the overall result. For instance, removing the sub-expression on line 2 will not affect the heuristic; the valueis invariant to a scheduling region since the mean executionratio is the same for all paths in the region. Such ‘use-less’ expressions are called introns. It turns out that intronsare actually quite useful for preserving good building blocksduring crossover and mutation [13].

The conditional multiply statement on line 4 does have adirect effect on the priority function: it favors paths that donot have pointer dereferences (because the sub-expressionin line 5 will always be greater than one). Pointers inhibitthe effectiveness of the scheduler and other compiler opti-mizations, and thus dereferences should be penalized. TheIMPACT group came to the exact same conclusion, thoughthe extent to which they penalize dereferences differs [15].

The sub-expression on line 8 favors ‘bushy’ parallel paths,where there are numerous independent operations. This re-sult is somewhat counterintuitive since highly parallel pathswill quickly saturate machine resources. In addition, paths

84

1.09

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

un

ep

ic

djp

eg

rasta

02

3.e

qn

tott

13

2.ijp

eg

05

2.a

lvin

n

14

7.v

ort

ex

08

5.c

c1

art

13

0.li

osd

em

o

mip

ma

p

Ave

rag

e

Sp

ee

du

p

Figure 7: Cross validation of the general-purpose prior-ity function. The best priority function found by trainingon the benchmarks in Figure 6 is applied to the bench-marks in this graph.

(1) (add(2) (sub (mul exec ratio mean 0.8720) 0.9400)(3) (mul 0.4762(4) (cmul (not mem hazard)(5) (mul 0.6727 num paths)(6) (mul 1.1609(7) (add(8) (sub(9) (mul(10) (div num ops dep height) 10.8240)(11) exec ratio)(12) (sub (mul (cmul has unsafe jsr(13) predict product mean(14) 0.9838)(15) (sub 1.1039 num ops max))(16) (sub (mul dep height mean(17) num branches max )(18) num paths)))))))

Figure 8: The best priority function our system foundfor hyperblock scheduling.

with higher exec ratio’s are slightly penalized, which alsodefies intuition.

The conditional multiply expression on line 12 penalizespaths with unsafe calls (i.e., calls to subroutines that mayhave side effects). Once again this agrees with the IMPACTgroup’s reasoning [15].

Because Trimaran is such a large and complicated sys-tem, it is difficult to know exactly why the priority functionin Figure 8 works well. This is exactly the point of us-ing a methodology like Meta Optimization. The bountifulcomplexities of compilers and systems are difficult to un-derstand. Also worthy of notice is the fact that we get suchgood speedups, particularly on the training set, by changingsuch a small portion of the compiler.

The next section presents another case study, which wealso test on Trimaran.

6. CASE STUDY II:REGISTER ALLOCATION

The importance of register allocation is well-known, so wewill not motivate the optimization here. Many register al-location algorithms use cost functions to determine which

variables to spill when spilling is required. For instance inpriority-based coloring register allocation, the priority func-tion is an estimate of the relative benefits of storing a givenvariable in a register [7].

Priority-based coloring first associates a live range withevery variable. A live range is the composition of codesegments (basic blocks), through which the associated vari-able’s value must be preserved. The algorithm then pri-oritizes each live range based on the estimated executionsavings of register allocating the associated variable:

savingsi = wi · (LDsave · usesi + STsave · defsi) (2)

priority(lr) = i∈lr savingsi

N(3)

Equation 2 is used to compute the savings of each codesegment. LDsave and STsave are estimates of the execu-tion time saved by keeping the associated variable in a reg-ister for references and definitions respectively. usesi anddefsi represent the number of uses and definitions of a vari-able in block i. wi is the estimated execution frequency forthe block.

Equation 3 sums the savings over the N blocks that com-pose the live range. Thus, this priority function representsthe savings incurred by accessing a register instead of re-sorting to main memory.

The algorithm then tries to assign registers to live rangesin priority order. Please see [7] for a complete descriptionof the algorithm. For our purposes, the important thing tonote is that the success of the algorithm depends on thepriority function.

The priority function described above is intuitive— it as-signs weights to live ranges based on the estimated executionsavings of register allocating them. Nevertheless, our systemfinds functions that improve the heuristic by up to 11%.

6.1 Experimental ResultsWe collected these results using the same experimental

setup that we used for hyperblock selection. We use Tri-maran and we target the architecture described in Table 3.However, to more effectively stress the register allocator, weonly use 32 general-purpose registers and 32 floating-pointregisters.

We modified Trimaran’s Elcor register allocator by re-placing its priority function (Equation 2) with an expres-sion parser and evaluator. The register allocation heuristicdescribed above essentially works at the basic block level.Equation 3 simply sums and normalizes the priorities of theindividual basic blocks. For this reason, we stay within thealgorithm’s framework and leave Equation 3 intact.

For a more detailed description of our experiments withregister allocation, including the features we extracted toperform them, please see [23].

6.1.1 Specialized Priority FunctionsThese results indicate that Meta Optimization works well,

even for well-studied heuristics. Figure 9 shows speedupsobtained by specializing Trimaran’s register allocator for agiven application. The dark bar associated with each appli-cation represents the speedup obtained by using the sameinput data that was used to specialize the heuristic. Thelight bar shows the speedup when the benchmark processesa novel data set.

85

1.08

1.06

0.9

0.95

1

1.05

1.1

1.15

1.2

mp

eg

2d

ec

raw

ca

ud

io

12

9.c

om

pre

ss

huff_enc

huff_dec

g7

21

de

co

de

Ave

rag

e

Speedup

Training data set Novel data set

Figure 9: Register allocation specialization. Thisgraph shows speedups obtained by training on a per-benchmarks basis. The dark colored bars are executionsusing the same data set on which the specialized pri-ority function was trained. The light colored bars areexecutions that use a novel data set.

1

1.025

1.05

1.075

1.1

1.125

1.15

0 5 10 15 20 25 30 35 40 45 50

Generation

Speedup

mpeg2dec

rawcaudio

g721decode

129.compress

huff_enc

huff_dec

Figure 10: Register allocation evolution. This figuregraphs fitness over generations. Unlike the hyperblockselection evolution, these fitnesses improve gradually.

Once again, it makes sense that the training input dataoutperforms the alternate input data. In the case of reg-ister allocation however, we see that the disparity betweenspeedups on training and novel data is less pronounced thanit is with hyperblock selection. This is likely because hyper-block selection is extremely data-driven. An examination ofthe general-purpose hyperblock formation heuristic revealstwo dynamic factors (exec ratio and predict product mean)that are critical components in the hyperblock decision pro-cess.

Figure 10 graphs fitness improvements over generations.It is interesting to contrast this graph with Figure 5. Thefairly constant improvement in fitness over several genera-tions seems to suggest that this problem is harder to op-timize than hyperblock selection. Additionally, unlike thehyperblock selection algorithm, the baseline heuristic typi-cally remained in the population for several generations.

1.03

1.03

0.85

0.9

0.95

1

1.05

1.1

1.15

129.c

om

pre

ss

g721decode

g721encode

huff_enc

huff_dec

raw

caudio

raw

daudio

mpeg2dec

Avera

ge

Speedup


Figure 11: Training a register allocation priority func-tion on multiple benchmarks. Our DSS evolution trainedon all the benchmarks in this figure. The single best pri-ority function was applied to all the benchmarks. Thedark bars represent speedups obtained by running thegiven benchmark on the same data that was used totrain the priority function. The light bars correspondto an alternate data set.

1.02

0.96

0.98

1

1.02

1.04

1.06

1.08

de

co

drl

e4

co

drl

e4

12

4.m

88

ksim

un

ep

ic

djp

eg

02

3.e

qn

tott

13

2.ijp

eg

14

7.v

ort

ex

08

5.c

c1

13

0.li

Ave

rag

e

Speedup

Figure 12: Cross validation of the general-purpose reg-ister allocation priority function. The best priority func-tion found by the DSS run is applied to the benchmarksin this graph. Results from two target architectures areshown.

6.1.2 General-Purpose Priority FunctionsJust as we did in Section 5.4.2, we divide our benchmarks

into a training set and a test set7. The benchmarks in Fig-ure 11 show the training set for this experiment. The figurealso shows the results of applying the best priority function(from our DSS run) to all the benchmarks in the set. Thedark bar associated with each benchmark is the speedupover Trimaran’s baseline heuristic when using the traininginput data. The average for this data set is 3%. On a noveldata set we attain an average speedup of 3%, which indicatesthat register allocation is not as susceptible to variations ininput data.

Figure 12 shows the cross validation results for this ex-

7This experiment uses smaller test and training sets due to preexisting bugs in

Trimaran. It does not correctly compile several of our benchmarks when target-ing a machine with 32 registers.

86

1.35

1.40

0.00

0.50

1.00

1.50

2.00

2.50

3.00

10

1.t

om

ca

tv

102.s

wim

103.s

u2co

r

125.turb

3d

146.w

ave

5

093.n

asa

7

015.d

oduc

034.m

dljd

p2

107.m

grid

141.a

psi

Ave

rage

Speedup

Training data Novel data

Figure 13: Prefetching specialization. This graph showsspeedups obtained by training on a per-benchmarks ba-sis. The dark colored bars are executions using the samedata set on which the specialized priority function wastrained. The light colored bars are executions that usea novel data set.

periment. The figure shows the speedups (over Trimaran’sbaseline) achieved by applying the single best priority func-tion to a set of benchmarks that were not in the training set.The learned priority function outperforms the baseline forall benchmarks except decodrle4 and 132.ijpeg. Althoughthe overall speedup on the cross validation set is only 3%,this is an exciting result. Register allocation is well-studiedoptimization which our technique is able to improve.

7. CASE STUDY III:DATA PREFETCHING

This section describes another memory hierarchy opti-mization. Data prefetching is an optimization aimed at re-ducing the costs of long-latency memory accesses. By mov-ing data from main memory into cache before it is accessed,prefetching can effectively reduce memory latencies.

However, prefetching can degrade performance in manycases. For instance, aggressive prefetching may evict usefuldata from the cache before it is needed. In addition, addingunnecessary prefetch instructions may hinder instruction ca-che performance and saturate memory queues.

The Open Research Compiler (ORC) [19] uses an exten-sion of Mowry’s algorithm [18] to insert prefetch instruc-tions. ORC uses a priority function that assigns a Booleanconfidence to prefetching a given address. Subsequent passesuse this value to determine whether or not to prefetch theaddress. Currently, the priority function is simply basedupon how well the compiler can estimate loop trip counts.

7.1 Experimental SetupThis case study is different from those already presented

in two important ways. First, we collected the results of thissection in the context of a real machine: we use the OpenResearch Compiler, and we target an Itanium I architecture.Just as with the previous two case studies, the fitness of anexpression is the speedup over the baseline priority function.However, unlike simulated execution which is perfectly re-producible, real environments are inherently noisy. Even onan unloaded system, back-to-back runs of a program mayvary.

1

1.2

1.4

1.6

1.8

2

2.2

2.4

0 5 10 15 20 25 30 35 40 45 50

Generation

Speedup

103.su2cor

102.swim

034.mdljdp2

093.nasa7

125.turb3d

146.wave5

015.doduc

101.tomcatv

107.mgrid

141.apsi

Figure 14: Prefetching evolution. This figure graphs fit-ness over generations. The baseline expression is quicklyweeded out of the population.

Fortunately, GP can handle noisy environments, as longas the level of noise is smaller than attainable speedups us-ing our technique. For the Itanium processor, this is in-deed the case. Since it is a single threaded, statically sched-uled processor, our measurements are fairly consistent; vari-ations due to noise are well within the range of our attainedspeedups.

Another major divergence from the methodology employedin the last two case studies is the format of the priority func-tion that we aim to optimize. Whereas the priority functionsfor register allocation and hyperblock formation are real-valued, the prefetching priority function is Boolean-valued.This case study emphasizes GP’s flexibility.

As the ORC website recommends, we compile all bench-marks with -O3 optimizations enabled, and we use profile-driven feedback. For additional details such as the featureswe extracted for this optimization, please see [23].

7.2 Experimental ResultsPrefetching is known to be an effective technique for float-

ing point benchmarks, and for this reason we train on variousSPECFP benchmarks in this case study.

7.2.1 Specialized Priority FunctionsFigure 13 shows the results of the ten different application-

specialized priority functions. Closer examination of the GPsolutions reveals that ORC overzealously prefetches and thatby simply taming the compiler’s aggressiveness, one can sub-stantially improve performance (on this set of benchmarks).The GP solutions rarely prefetched. In fact, shutting offprefetching altogether achieves gains within 7% of the spe-cialized priority functions.

Figure 14 graphs fitness over generation for the application-specific experiments. Just as with hyperblock selection,the baseline priority function has no impact on the finalsolutions— it is quickly obscured by superior expressions.As is the case with hyperblock selection, it appears that inmany cases, GP solutions get ‘stuck’ in a local minimum inthe solution space; the fitnesses stop improving early in theevolution. One plausible explanation for this is our use ofparsimony pressure in the selection process. For application-specific evolutions, it is often the case that very small expres-sions work well. While these small expressions are effective,

87

1.31

1.36

0.0

0.5

1.0

1.5

2.0

2.5

3.0

10

1.t

om

ca

tv

10

2.s

wim

10

3.s

u2

co

r

12

5.t

urb

3d

14

6.w

ave

5

09

3.n

asa

7

01

5.d

od

uc

03

4.m

dljd

p2

10

7.m

gri

d

14

1.a

psi

Ave

rag

e

Speedup


Figure 15: Training a prefetching priority function onmultiple benchmarks. Our DSS evolution trained on allthe benchmarks in this figure. The single best priorityfunction was applied to all the benchmarks. The darkbars represent speedups obtained by running the givenbenchmark on the same data that was used to train thepriority function. The light bars correspond to an alter-nate data set.

they limit the genetic material available to the crossoveroperator. Furthermore, since we always keep the best ex-pression, the population soon becomes inbred with copies ofthe top expression. Future work will explore the impact ofparsimony pressure and elitism.

7.2.2 General-Purpose Priority FunctionsThe performance of the best DSS-generated prefetching

priority function is shown in Figure 15. The priority functionwas trained on the same benchmarks in the figure, whichare a combination of SPEC92 and SPEC95 floating pointbenchmarks. Data prefetching, like hyperblock selection, isextremely data-driven. By applying the same input datathat we used to train the priority function we achieve a31% speedup. Somewhat surprisingly, the novel input dataset achieves a better speedup of 36%. Because the priorityfunction learned to prefetch infrequently, it is simply thecase that the novel data set is more sensitive to prefetchingthan the training data set is.

Figure 16 shows the cross validation results for this opti-mization, and prompts us to mention a caveat of our tech-nique. GP’s ability to identify good general-purpose so-lutions is based on the benchmarks over which they areevolved. For the SPEC92 and SPEC95 benchmarks thatwere used to train our general-purpose heuristic, aggressiveprefetching was debilitating. However, for a couple of bench-marks in the SPEC2000 floating point set, we see that ag-gressive prefetching is desirable. Thus, unless designers canassert that the training set provides adequate problem cov-erage, they cannot completely trust GP-generated solutions.

8. RELATED WORKMany researchers have applied machine-learning methods

to compilation, and therefore, only the most relevant worksare cited here.

Calder et al. used supervised learning techniques to fine-tune static branch prediction heuristics [5]. They employ

1.01

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

16

8.w

up

wis

e

17

1.s

wim

172.m

grid

173.a

pplu

178.g

alg

el

183.e

quake

187.face

rec

188.a

mm

p

189.lu

cas

200.s

ixtr

ack

301.a

psi

191.fm

a3d

Ave

rage

Speedup

Figure 16: Cross validation of the general-purposeprefetching priority function on SPEC2000. The bestpriority function found by the DSS run is applied to thebenchmarks in this graph. Results from two target ar-chitectures are shown.

neural networks and decision trees to search for effectivestatic branch prediction heuristics. While our methodologyis similar, our work differs in several important ways. Mostimportantly, we use unsupervised learning, while they usesupervised learning.

Unsupervised learning is used to capture inherent orga-nization in data, and thus, only input data is required fortraining. Supervised learning attempts to match traininginputs with known outcomes, called labels. This means thattheir learning techniques rely on knowing the optimal out-come, while ours does not8. In their case determining the op-timal outcome is trivial— they simply run the benchmarksin their training set and note the direction that each branchfavors. In this sense, their method is simply a classifier:classify the data into two groups, either taken or not-taken.Priority functions cannot be classified in this way, and thusthey demand an unsupervised method such as ours.

We also differ in the end goal of our learning techniques.They use misprediction rates to guide the learning process.While this is a perfectly valid choice, it does not necessarilyreflect the bottom line: execution time.

Monsifrot et al. use a classifier based on decision treelearning to determine which loops to unroll [17]. Like [5],this supervised methodology relies on extracting labels, whichis not only difficult, in many cases it is simply not feasible.

Cooper et al. use genetic algorithms to solve compilationphase ordering problems [8]. Their technique is quite ef-fective. However, like other related work, they evolve theapplication, not the compiler9. Thus, their compiler itera-tively evolves every program it compiles. By evolving com-piler heuristics, and not the applications themselves, we needonly apply our process once as shown in Section 5.4.2.

The COGEN(t) compiler creatively uses genetic algorithmsto map code to irregular DSPs [11]. This compiler, thoughinteresting, also evolves on a per-application basis. Nonethe-less, the compile-once nature of DSP applications may war-rant the long, iterative compilation process.

8This is a strict requirement both for decision trees and the gradient descent

method they use to train their neural network.9However, they were able to manually construct a general-purpose sequence us-

ing information gleaned from their application-specific evolutions.

88

9. CONCLUSIONCompiler developers have always had to contend with

complex phenomenon that are not easy modeled. For ex-ample, it has never been possible to create a useful modelfor all the input programs the compiler has to optimize.However until recently, most architectures— the target ofcompiler optimizations— were simple and analyzable. Thisis no longer the case. A complex compiler with multiple in-terdependent optimization passes exacerbates the problem.In many instances, end-to-end performance can only be eval-uated empirically.

Optimally solving NP-hard problems is not practical evenwhen simple analytical models exist. Thus, heuristics playa major role in modern compilers. Borrowing techniquesfrom the machine-learning community, we created a gen-eral framework for developing compiler heuristics. We ad-vocate a machine-learning based methodology for automat-ically learning effective compiler heuristics.

The techniques presented in this paper show promise, butthey are still in their infancy. For many applications ourtechniques found excellent application-specific priority func-tions. However, the disparity in some cases between theapplication-specific performance and the general-purpose per-formance tells us that our techniques can be improved.

We also note disparities between the performance of train-ing set applications and the cross validation performance. Insome cases our solutions overfit the training set. If compilerdevelopers use our technique but only train using bench-marks on which their compiler will be judged, the generalityof their compiler may actually be reduced.

Our fledgling research has a few shortcomings that futurework will address. For instance, the success of any learn-ing algorithm hinges on selecting the right features. We willexplore techniques that aid in extracting features that bestreflect program variability. While genetic programming iswell-suited to our application, it too has shortcomings. Theoverriding goal of our research is to free humans from te-dious parameter tweaking. Unfortunately, GP’s success isdependent on parameters such as population size and mu-tation rate, and finding an adequate solution relies on someexperimentation (which fortunately can be performed witha minimal amount of user interaction). Future work willexperiment with different learning techniques.

We believe the benefits of using a system like ours far out-weighs the drawbacks. While our techniques do not alwaysachieve large speedups, they do reduce design complexityconsiderably. Compiler writers are forced to spend a largeportion of their time designing heuristics. The results pre-sented in this paper lead us to believe that machine-learningtechniques can optimize heuristics at least as well human de-signers. We believe that automatic heuristic tuning based onempirical evaluation will become prevalent, and that design-ers will intentionally expose algorithm policies to facilitatemachine-learning optimization.

A toolset that can be used to evolve compiler heuristicswill be available at:

http://www.cag.lcs.mit.edu/metaopt

10. ACKNOWLEDGMENTSWe would like to thank the PLDI reviewers for their in-

sightful comments. In general we thank everyone who helpedus with this paper, especially Kristen Grauman, MichaelGordon, Sam Larsen, William Thies, Derek Bruening, andVladimir Kiriansky. Finally, thanks to Michael Smith andGlenn Holloway at Harvard University for lending us theirItanium machines in our hour of need. This work is fundedby DARPA, NSF, and the Oxygen Alliance.

11. REFERENCES[1] S. G. Abraham, V. Kathail, and B. L. Deitrich. Meld

Scheduling: Relaxing Scheduling Constaints AcrossRegion Boundaries. In Proceedings of the 29th AnnualInternational Symposium on Microarchitecture(MICRO-29), pages 308–321, 1996.

[2] W. Banzhaf, P. Nordin, R. Keller, and F. Francone.Genetic Programming : An Introduction : On theAutomatic Evolution of Computer Programs and ItsApplications. Morgan Kaufmann, 1998.

[3] D. Bernstein, D. Goldin, and M. G. et. al. Spill CodeMinimization Techniques for Optimizing Compilers. InProceedings of the SIGPLAN ’89 Conference onProgramming Language Design and Implementation,pages 258–263, 1989.

[4] D. Bourgin. Losslessy compression schemes.http://hpux.u-aizu.ac.jp/hppd/hpux-/Languages/codecs-1.0/.

[5] B. Calder, D. G. ad Michael Jones, D. Lindsay,J. Martin, M. Mozer, and B. Zorn. Evidence-BasedStatic Branch Prediction Using Machine Learning. InACM Transactions on Programming Languages andSystems (ToPLaS-19), volume 19, 1997.

[6] P. Chang, D. Lavery, S. Mahlke, W. Chen, andW. Hwu. The Importance of Prepass Code Schedulingfor Superscalar and Superpipelined processors. InIEEE Transactions on Computers, volume 44, pages353–370, March 1995.

[7] F. C. Chow and J. L. Hennessey. The Priority-BasedColoring Approch to Register Allocation. In ACMTransactions on Programming Languages and Systems(ToPLaS-12), pages 501–536, 1990.

[8] K. Cooper, P. Scheilke, and D. Subramanian.Optimizing for Reduced Code Space using GeneticAlgorithms. In Languages, Compilers, Tools forEmbedded Systems, pages 1–9, 1999.

[9] C. Gathercole. An Investigation of SupervisedLearning in Genetic Programming. PhD thesis,University of Edinburgh, 1998.

[10] P. B. Gibbons and S. S. Muchnick. EfficientInstruction Scheduling for a Pipelined Architecture. InProceedings of the ACM Symposium on CompilerConstruction, volume 21, pages 11–16, 1986.

[11] G. W. Grewal and C. T. Wilson. Mappping ReferenceCode to Irregular DSPs with the Retargetable,Optimizing Compiler COGEN(T). In InternationalSymposium on Microarchitecture, volume 34, pages192–202, 2001

89

.

[12] M. Kessler and T. Haynes. Depth-Fair Crossover inGenetic Programming. In Proceedings of the ACMSymposium on Applied Computing, pages 319–323,February 1999.

[13] J. Koza. Genetic Programming: On the Programmingof Computers by Means of Natural Selection. The MITPress, 1992.

[14] C. Lee, M. Potkonjak, and W. H. Mangione-Smith.MediaBench: a tool for evaluating and synthesizingmultimedia and communication systems. InInternational Symposium on Microarchitecture,volume 30, pages 330–335, 1997.

[15] S. A. Mahlke. Exploiting instruction level parallelismin the presence of branches. PhD thesis, University ofIllinois at Urbana-Champaign, Department ofElectrical and Computer Engineering, 1996.

[16] S. A. Mahlke, D. Lin, W. Chen, R. Hank, andR. Bringmann. Effective Compiler Support forPredicated Execution Using the Hyperblock. InInternational Symposium on Microarchitecture,volume 25, pages 45–54, 1992.

[17] A. Monsifrot, F. Bodin, and R. Quiniou. A MachineLearning Approach to Automatic Production ofCompiler Heuristics. In Artificial Intelligence:Methodology, Systems, Applications, pages 41–50,2002.

[18] T. C. Mowry. Tolerating Latency throughSoftware-Controlled Data Prefetching. PhD thesis,Stanford University, Department of ElectricalEngineering, 1994

.

[19] Open Research Compiler.http://ipf-orc.sourceforge.net.

[20] E. Ozer, S. Banerjia, and T. Conte. Unified Assignand Schedule: A New Approach to Scheduling forClustered Register Filee Microarchitectures. InProceedings of the 27th Annual InternationalSymposium on Microarchitecture (MICRO-24), pages308–315, 1998.

[21] J. C. H. Park and M. S. Schlansker. On PredicatedExecution. Technical Report HPL-91-58, HewlettPackard Laboratories, 1991.

[22] B. R. Rau. Iterative Modulo Scheduling: AnAlgorithm for Software Pipelining Loops. InProceedings of the 27th Annual InternationalSymposium on Microarchitecture (MICRO-24),November 1994.

[23] M. Stephenson, S. Amarasinghe, U.-M. O’Reilly, andM. Martin. Compiler Heuristic Design with MachineLearning. Technical Report TR-893, MassachusettsInstitute of Technology, 2003.

[24] Trimaran. http://www.trimaran.org.

[25] N. Warter. Modulo Scheduling with IsomorphicControl Transformations. PhD thesis, University ofIllinois at Urbana-Champaign, Department ofElectrical and Computer Engineering, 1993.

90

meta optimization: improving compiler heuristics with ... · meta optimization: improving compiler...

Documents